This is a live mirror of the Perl 5 development currently hosted at https://github.com/perl/perl5
Slight edits
[perl5.git] / pod / perlrecharclass.pod
CommitLineData
8a118206 1=head1 NAME
ea449505 2X<character class>
8a118206
RGS
3
4perlrecharclass - Perl Regular Expression Character Classes
5
6=head1 DESCRIPTION
7
8The top level documentation about Perl regular expressions
9is found in L<perlre>.
10
11This manual page discusses the syntax and use of character
12classes in Perl Regular Expressions.
13
14A character class is a way of denoting a set of characters,
15in such a way that one character of the set is matched.
16It's important to remember that matching a character class
17consumes exactly one character in the source string. (The source
18string is the string the regular expression is matched against.)
19
20There are three types of character classes in Perl regular
ea449505
KW
21expressions: the dot, backslashed sequences, and the form enclosed in square
22brackets. Keep in mind, though, that often the term "character class" is used
23to mean just the bracketed form. This is true in other Perl documentation.
8a118206
RGS
24
25=head2 The dot
26
27The dot (or period), C<.> is probably the most used, and certainly
28the most well-known character class. By default, a dot matches any
29character, except for the newline. The default can be changed to
30add matching the newline with the I<single line> modifier: either
31for the entire regular expression using the C</s> modifier, or
32locally using C<(?s)>.
33
34Here are some examples:
35
36 "a" =~ /./ # Match
37 "." =~ /./ # Match
38 "" =~ /./ # No match (dot has to match a character)
39 "\n" =~ /./ # No match (dot does not match a newline)
40 "\n" =~ /./s # Match (global 'single line' modifier)
41 "\n" =~ /(?s:.)/ # Match (local 'single line' modifier)
42 "ab" =~ /^.$/ # No match (dot matches one character)
43
8a118206 44=head2 Backslashed sequences
ea449505
KW
45X<\w> X<\W> X<\s> X<\S> X<\d> X<\D> X<\p> X<\P>
46X<\N> X<\v> X<\V> X<\h> X<\H>
47X<word> X<whitespace>
8a118206
RGS
48
49Perl regular expressions contain many backslashed sequences that
50constitute a character class. That is, they will match a single
51character, if that character belongs to a specific set of characters
52(defined by the sequence). A backslashed sequence is a sequence of
53characters starting with a backslash. Not all backslashed sequences
df225385 54are character classes; for a full list, see L<perlrebackslash>.
8a118206 55
ea449505
KW
56Here's a list of the backslashed sequences that are character classes. They
57are discussed in more detail below.
8a118206
RGS
58
59 \d Match a digit character.
60 \D Match a non-digit character.
61 \w Match a "word" character.
62 \W Match a non-"word" character.
ea449505
KW
63 \s Match a whitespace character.
64 \S Match a non-whitespace character.
65 \h Match a horizontal whitespace character.
66 \H Match a character that isn't horizontal whitespace.
b3b85878 67 \N Match a character that isn't newline. Experimental.
ea449505
KW
68 \v Match a vertical whitespace character.
69 \V Match a character that isn't vertical whitespace.
8a118206
RGS
70 \pP, \p{Prop} Match a character matching a Unicode property.
71 \PP, \P{Prop} Match a character that doesn't match a Unicode property.
72
73=head3 Digits
74
ea449505
KW
75C<\d> matches a single character that is considered to be a I<digit>. What is
76considered a digit depends on the internal encoding of the source string and
77the locale that is in effect. If the source string is in UTF-8 format, C<\d>
78not only matches the digits '0' - '9', but also Arabic, Devanagari and digits
79from other languages. Otherwise, if there is a locale in effect, it will match
80whatever characters the locale considers digits. Without a locale, C<\d>
81matches the digits '0' to '9'. See L</Locale, EBCDIC, Unicode and UTF-8>.
8a118206
RGS
82
83Any character that isn't matched by C<\d> will be matched by C<\D>.
84
85=head3 Word characters
86
ea449505
KW
87A C<\w> matches a single alphanumeric character (an alphabetic character, or a
88decimal digit) or an underscore (C<_>), not a whole word. Use C<\w+> to match
89a string of Perl-identifier characters (which isn't the same as matching an
90English word). What is considered a word character depends on the internal
91encoding of the string and the locale or EBCDIC code page that is in effect. If
92it's in UTF-8 format, C<\w> matches those characters that are considered word
93characters in the Unicode database. That is, it not only matches ASCII letters,
94but also Thai letters, Greek letters, etc. If the source string isn't in UTF-8
95format, C<\w> matches those characters that are considered word characters by
96the current locale or EBCDIC code page. Without a locale or EBCDIC code page,
97C<\w> matches the ASCII letters, digits and the underscore.
98See L</Locale, EBCDIC, Unicode and UTF-8>.
8a118206
RGS
99
100Any character that isn't matched by C<\w> will be matched by C<\W>.
101
ea449505
KW
102=head3 Whitespace
103
104C<\s> matches any single character that is considered whitespace. In the ASCII
105range, C<\s> matches the horizontal tab (C<\t>), the new line (C<\n>), the form
106feed (C<\f>), the carriage return (C<\r>), and the space. (The vertical tab,
107C<\cK> is not matched by C<\s>.) The exact set of characters matched by C<\s>
108depends on whether the source string is in UTF-8 format and the locale or
109EBCDIC code page that is in effect. If it's in UTF-8 format, C<\s> matches what
110is considered whitespace in the Unicode database; the complete list is in the
111table below. Otherwise, if there is a locale or EBCDIC code page in effect,
112C<\s> matches whatever is considered whitespace by the current locale or EBCDIC
113code page. Without a locale or EBCDIC code page, C<\s> matches the five
114characters mentioned in the beginning of this paragraph. Perhaps the most
115notable possible surprise is that C<\s> matches a non-breaking space only if
116the non-breaking space is in a UTF-8 encoded string or the locale or EBCDIC
117code page that is in effect has that character.
118See L</Locale, EBCDIC, Unicode and UTF-8>.
8a118206
RGS
119
120Any character that isn't matched by C<\s> will be matched by C<\S>.
121
ea449505
KW
122C<\h> will match any character that is considered horizontal whitespace;
123this includes the space and the tab characters and 17 other characters that are
124listed in the table below. C<\H> will match any character
125that is not considered horizontal whitespace.
126
127C<\N> is new in 5.12, and is experimental. It, like the dot, will match any
128character that is not a newline. The difference is that C<\N> will not be
129influenced by the single line C</s> regular expression modifier. (Note that,
130there is a second meaning of C<\N> when of the form C<\N{...}>. This form is
131for named characters. See L<charnames> for those. If C<\N> is followed by an
132opening brace and something that is not a quantifier, perl will assume that a
133character name is coming, and not this meaning of C<\N>. For example, C<\N{3}>
134means to match 3 non-newlines; C<\N{5,}> means to match 5 or more non-newlines,
135but C<\N{4F}> and C<\N{F4}> are not legal quantifiers, and will cause perl to
136look for characters named C<4F> or C<F4>, respectively (and won't find them,
137thus raising an error, unless they have been defined using custom names).
138
139C<\v> will match any character that is considered vertical whitespace;
140this includes the carriage return and line feed characters (newline) plus 5
141other characters listed in the table below.
142C<\V> will match any character that is not considered vertical whitespace.
8a118206
RGS
143
144C<\R> matches anything that can be considered a newline under Unicode
145rules. It's not a character class, as it can match a multi-character
146sequence. Therefore, it cannot be used inside a bracketed character
ea449505
KW
147class; use C<\v> instead (vertical whitespace).
148Details are discussed in L<perlrebackslash>.
8a118206
RGS
149
150Note that unlike C<\s>, C<\d> and C<\w>, C<\h> and C<\v> always match
151the same characters, regardless whether the source string is in UTF-8
152format or not. The set of characters they match is also not influenced
ea449505 153by locale or EBCDIC code page.
8a118206 154
ea449505
KW
155One might think that C<\s> is equivalent to C<[\h\v]>. This is not true. The
156vertical tab (C<"\x0b">) is not matched by C<\s>, it is however considered
157vertical whitespace. Furthermore, if the source string is not in UTF-8 format,
158and any locale or EBCDIC code page that is in effect doesn't include them, the
159next line (C<"\x85">) and the no-break space (C<"\xA0">) characters are not
160matched by C<\s>, but are by C<\v> and C<\h> respectively. If the source
161string is in UTF-8 format, both the next line and the no-break space are
162matched by C<\s>.
8a118206
RGS
163
164The following table is a complete listing of characters matched by
ea449505 165C<\s>, C<\h> and C<\v> as of Unicode 5.2.
8a118206
RGS
166
167The first column gives the code point of the character (in hex format),
168the second column gives the (Unicode) name. The third column indicates
ea449505
KW
169by which class(es) the character is matched (assuming no locale or EBCDIC code
170page is in effect that changes the C<\s> matching).
8a118206
RGS
171
172 0x00009 CHARACTER TABULATION h s
173 0x0000a LINE FEED (LF) vs
174 0x0000b LINE TABULATION v
175 0x0000c FORM FEED (FF) vs
176 0x0000d CARRIAGE RETURN (CR) vs
177 0x00020 SPACE h s
178 0x00085 NEXT LINE (NEL) vs [1]
179 0x000a0 NO-BREAK SPACE h s [1]
180 0x01680 OGHAM SPACE MARK h s
181 0x0180e MONGOLIAN VOWEL SEPARATOR h s
182 0x02000 EN QUAD h s
183 0x02001 EM QUAD h s
184 0x02002 EN SPACE h s
185 0x02003 EM SPACE h s
186 0x02004 THREE-PER-EM SPACE h s
187 0x02005 FOUR-PER-EM SPACE h s
188 0x02006 SIX-PER-EM SPACE h s
189 0x02007 FIGURE SPACE h s
190 0x02008 PUNCTUATION SPACE h s
191 0x02009 THIN SPACE h s
192 0x0200a HAIR SPACE h s
193 0x02028 LINE SEPARATOR vs
194 0x02029 PARAGRAPH SEPARATOR vs
195 0x0202f NARROW NO-BREAK SPACE h s
196 0x0205f MEDIUM MATHEMATICAL SPACE h s
197 0x03000 IDEOGRAPHIC SPACE h s
198
199=over 4
200
201=item [1]
202
203NEXT LINE and NO-BREAK SPACE only match C<\s> if the source string is in
ea449505 204UTF-8 format, or the locale or EBCDIC code page that is in effect includes them.
8a118206
RGS
205
206=back
207
208It is worth noting that C<\d>, C<\w>, etc, match single characters, not
209complete numbers or words. To match a number (that consists of integers),
210use C<\d+>; to match a word, use C<\w+>.
211
212
213=head3 Unicode Properties
214
215C<\pP> and C<\p{Prop}> are character classes to match characters that
216fit given Unicode classes. One letter classes can be used in the C<\pP>
e1b711da
KW
217form, with the class name following the C<\p>, otherwise, braces are required.
218There is a single form, which is just the property name enclosed in the braces,
219and a compound form which looks like C<\p{name=value}>, which means to match
ea449505 220if the property "name" for the character has the particular "value".
e1b711da
KW
221For instance, a match for a number can be written as C</\pN/> or as
222C</\p{Number}/>, or as C</\p{Number=True}/>.
223Lowercase letters are matched by the property I<Lowercase_Letter> which
224has as short form I<Ll>. They need the braces, so are written as C</\p{Ll}/> or
225C</\p{Lowercase_Letter}/>, or C</\p{General_Category=Lowercase_Letter}/>
226(the underscores are optional).
227C</\pLl/> is valid, but means something different.
8a118206
RGS
228It matches a two character string: a letter (Unicode property C<\pL>),
229followed by a lowercase C<l>.
230
e1b711da
KW
231For more details, see L<perlunicode/Unicode Character Properties>; for a
232complete list of possible properties, see
233L<perluniprops/Properties accessible through \p{} and \P{}>.
234It is also possible to define your own properties. This is discussed in
8a118206
RGS
235L<perlunicode/User-Defined Character Properties>.
236
237
238=head4 Examples
239
240 "a" =~ /\w/ # Match, "a" is a 'word' character.
241 "7" =~ /\w/ # Match, "7" is a 'word' character as well.
242 "a" =~ /\d/ # No match, "a" isn't a digit.
243 "7" =~ /\d/ # Match, "7" is a digit.
ea449505 244 " " =~ /\s/ # Match, a space is whitespace.
8a118206
RGS
245 "a" =~ /\D/ # Match, "a" is a non-digit.
246 "7" =~ /\D/ # No match, "7" is not a non-digit.
ea449505 247 " " =~ /\S/ # No match, a space is not non-whitespace.
8a118206 248
ea449505
KW
249 " " =~ /\h/ # Match, space is horizontal whitespace.
250 " " =~ /\v/ # No match, space is not vertical whitespace.
251 "\r" =~ /\v/ # Match, a return is vertical whitespace.
8a118206
RGS
252
253 "a" =~ /\pL/ # Match, "a" is a letter.
254 "a" =~ /\p{Lu}/ # No match, /\p{Lu}/ matches upper case letters.
255
256 "\x{0e0b}" =~ /\p{Thai}/ # Match, \x{0e0b} is the character
257 # 'THAI CHARACTER SO SO', and that's in
258 # Thai Unicode class.
ea449505 259 "a" =~ /\P{Lao}/ # Match, as "a" is not a Laotian character.
8a118206
RGS
260
261
262=head2 Bracketed Character Classes
263
264The third form of character class you can use in Perl regular expressions
265is the bracketed form. In its simplest form, it lists the characters
266that may be matched inside square brackets, like this: C<[aeiou]>.
ea449505 267This matches one of C<a>, C<e>, C<i>, C<o> or C<u>. Like the other
8a118206 268character classes, exactly one character will be matched. To match
ea449505 269a longer string consisting of characters mentioned in the character
8a118206
RGS
270class, follow the character class with a quantifier. For instance,
271C<[aeiou]+> matches a string of one or more lowercase ASCII vowels.
272
273Repeating a character in a character class has no
274effect; it's considered to be in the set only once.
275
276Examples:
277
278 "e" =~ /[aeiou]/ # Match, as "e" is listed in the class.
279 "p" =~ /[aeiou]/ # No match, "p" is not listed in the class.
280 "ae" =~ /^[aeiou]$/ # No match, a character class only matches
281 # a single character.
282 "ae" =~ /^[aeiou]+$/ # Match, due to the quantifier.
283
284=head3 Special Characters Inside a Bracketed Character Class
285
286Most characters that are meta characters in regular expressions (that
df225385 287is, characters that carry a special meaning like C<.>, C<*>, or C<(>) lose
8a118206
RGS
288their special meaning and can be used inside a character class without
289the need to escape them. For instance, C<[()]> matches either an opening
290parenthesis, or a closing parenthesis, and the parens inside the character
291class don't group or capture.
292
293Characters that may carry a special meaning inside a character class are:
294C<\>, C<^>, C<->, C<[> and C<]>, and are discussed below. They can be
295escaped with a backslash, although this is sometimes not needed, in which
296case the backslash may be omitted.
297
298The sequence C<\b> is special inside a bracketed character class. While
299outside the character class C<\b> is an assertion indicating a point
300that does not have either two word characters or two non-word characters
301on either side, inside a bracketed character class, C<\b> matches a
302backspace character.
303
df225385
KW
304The sequences
305C<\a>,
306C<\c>,
307C<\e>,
308C<\f>,
309C<\n>,
e526e8bb
KW
310C<\N{I<NAME>}>,
311C<\N{U+I<wide hex char>}>,
df225385
KW
312C<\r>,
313C<\t>,
314and
315C<\x>
316are also special and have the same meanings as they do outside a bracketed character
317class.
318
ea449505
KW
319Also, a backslash followed by two or three octal digits is considered an octal
320number.
df225385 321
8a118206
RGS
322A C<[> is not special inside a character class, unless it's the start
323of a POSIX character class (see below). It normally does not need escaping.
324
325A C<]> is either the end of a POSIX character class (see below), or it
326signals the end of the bracketed character class. Normally it needs
327escaping if you want to include a C<]> in the set of characters.
328However, if the C<]> is the I<first> (or the second if the first
329character is a caret) character of a bracketed character class, it
330does not denote the end of the class (as you cannot have an empty class)
331and is considered part of the set of characters that can be matched without
332escaping.
333
334Examples:
335
336 "+" =~ /[+?*]/ # Match, "+" in a character class is not special.
337 "\cH" =~ /[\b]/ # Match, \b inside in a character class
338 # is equivalent with a backspace.
339 "]" =~ /[][]/ # Match, as the character class contains.
340 # both [ and ].
341 "[]" =~ /[[]]/ # Match, the pattern contains a character class
342 # containing just ], and the character class is
343 # followed by a ].
344
345=head3 Character Ranges
346
347It is not uncommon to want to match a range of characters. Luckily, instead
348of listing all the characters in the range, one may use the hyphen (C<->).
349If inside a bracketed character class you have two characters separated
350by a hyphen, it's treated as if all the characters between the two are in
351the class. For instance, C<[0-9]> matches any ASCII digit, and C<[a-m]>
352matches any lowercase letter from the first half of the ASCII alphabet.
353
354Note that the two characters on either side of the hyphen are not
355necessary both letters or both digits. Any character is possible,
356although not advisable. C<['-?]> contains a range of characters, but
357most people will not know which characters that will be. Furthermore,
358such ranges may lead to portability problems if the code has to run on
359a platform that uses a different character set, such as EBCDIC.
360
ea449505
KW
361If a hyphen in a character class cannot syntactically be part of a range, for
362instance because it is the first or the last character of the character class,
8a118206 363or if it immediately follows a range, the hyphen isn't special, and will be
ea449505
KW
364considered a character that may be matched. You have to escape the hyphen with
365a backslash if you want to have a hyphen in your set of characters to be
366matched, and its position in the class is such that it could be considered part
367of a range.
8a118206
RGS
368
369Examples:
370
371 [a-z] # Matches a character that is a lower case ASCII letter.
372 [a-fz] # Matches any letter between 'a' and 'f' (inclusive) or the
373 # letter 'z'.
374 [-z] # Matches either a hyphen ('-') or the letter 'z'.
375 [a-f-m] # Matches any letter between 'a' and 'f' (inclusive), the
376 # hyphen ('-'), or the letter 'm'.
377 ['-?] # Matches any of the characters '()*+,-./0123456789:;<=>?
378 # (But not on an EBCDIC platform).
379
380
381=head3 Negation
382
383It is also possible to instead list the characters you do not want to
384match. You can do so by using a caret (C<^>) as the first character in the
385character class. For instance, C<[^a-z]> matches a character that is not a
386lowercase ASCII letter.
387
388This syntax make the caret a special character inside a bracketed character
389class, but only if it is the first character of the class. So if you want
390to have the caret as one of the characters you want to match, you either
391have to escape the caret, or not list it first.
392
393Examples:
394
395 "e" =~ /[^aeiou]/ # No match, the 'e' is listed.
396 "x" =~ /[^aeiou]/ # Match, as 'x' isn't a lowercase vowel.
397 "^" =~ /[^^]/ # No match, matches anything that isn't a caret.
398 "^" =~ /[x^]/ # Match, caret is not special here.
399
400=head3 Backslash Sequences
401
ea449505
KW
402You can put any backslash sequence character class (with the exception of
403C<\N>) inside a bracketed character class, and it will act just
df225385
KW
404as if you put all the characters matched by the backslash sequence inside the
405character class. For instance, C<[a-f\d]> will match any digit, or any of the
406lowercase letters between 'a' and 'f' inclusive.
407
e526e8bb
KW
408C<\N> within a bracketed character class must be of the forms C<\N{I<name>}> or
409C<\N{U+I<wide hex char>}> for the same reason that a dot C<.> inside a
410bracketed character class loses its special meaning: it matches nearly
411anything, which generally isn't what you want to happen.
8a118206
RGS
412
413Examples:
414
415 /[\p{Thai}\d]/ # Matches a character that is either a Thai
416 # character, or a digit.
417 /[^\p{Arabic}()]/ # Matches a character that is neither an Arabic
418 # character, nor a parenthesis.
419
420Backslash sequence character classes cannot form one of the endpoints
421of a range.
422
423=head3 Posix Character Classes
ea449505
KW
424X<character class> X<\p> X<\p{}>
425fix
426X<alpha> X<alnum> X<ascii> X<blank> X<cntrl> X<digit> X<graph>
427X<lower> X<print> X<punct> X<space> X<upper> X<word> X<xdigit>
8a118206
RGS
428
429Posix character classes have the form C<[:class:]>, where I<class> is
ea449505 430name, and the C<[:> and C<:]> delimiters. Posix character classes only appear
8a118206
RGS
431I<inside> bracketed character classes, and are a convenient and descriptive
432way of listing a group of characters. Be careful about the syntax,
433
434 # Correct:
435 $string =~ /[[:alpha:]]/
436
437 # Incorrect (will warn):
438 $string =~ /[:alpha:]/
439
440The latter pattern would be a character class consisting of a colon,
441and the letters C<a>, C<l>, C<p> and C<h>.
ea449505
KW
442These character classes can be part of a larger bracketed character class. For
443example,
444
445 [01[:alpha:]%]
446
447is valid and matches '0', '1', any alphabetic character, and the percent sign.
8a118206
RGS
448
449Perl recognizes the following POSIX character classes:
450
ea449505
KW
451 alpha Any alphabetical character ("[A-Za-z]").
452 alnum Any alphanumerical character. ("[A-Za-z0-9]")
453 ascii Any character in the ASCII character set.
ea8b8ad2 454 blank A GNU extension, equal to a space or a horizontal tab ("\t").
ea449505
KW
455 cntrl Any control character. See Note [2] below.
456 digit Any decimal digit ("[0-9]"), equivalent to "\d".
457 graph Any printable character, excluding a space. See Note [3] below.
458 lower Any lowercase character ("[a-z]").
459 print Any printable character, including a space. See Note [4] below.
460 punct Any graphical character excluding "word" characters. See Note [5]
461 space Any whitespace character. "\s" plus the vertical tab ("\cK").
462 upper Any uppercase character ("[A-Z]").
463 word A Perl extension ("[A-Za-z0-9_]"), equivalent to "\w".
464 xdigit Any hexadecimal digit ("[0-9a-fA-F]").
465
466Most POSIX character classes have two Unicode-style C<\p> property
467counterparts. (They are not official Unicode properties, but Perl extensions
468derived from official Unicode properties.) The table below shows the relation
469between POSIX character classes and these counterparts.
470
471One counterpart, in the column labelled "ASCII-range Unicode" in
472the table will only match characters in the ASCII range. (On EBCDIC platforms,
473they match those characters which have ASCII equivalents.)
474
475The other counterpart, in the column labelled "Full-range Unicode", matches any
476appropriate characters in the full Unicode character set. For example,
477C<\p{Alpha}> will match not just the ASCII alphabetic characters, but any
478character in the entire Unicode character set that is considered to be
479alphabetic.
480
481(Each of the counterparts has various synonyms as well.
482L<perluniprops/Properties accessible through \p{} and \P{}> lists all the
483synonyms, plus all the characters matched by each of the ASCII-range
484properties. For example C<\p{AHex}> is a synonym for C<\p{ASCII_Hex_Digit}>,
485and any C<\p> property name can be prefixed with "Is" such as C<\p{IsAlpha}>.)
486
487Both the C<\p> forms are unaffected by any locale that is in effect, or whether
488the string is in UTF-8 format or not, or whether the platform is EBCDIC or not.
489In contrast, the POSIX character classes are affected. If the source string is
490in UTF-8 format, the POSIX classes (with the exception of C<[[:punct:]]>, see
491Note [5]) behave like their "Full-range" Unicode counterparts. If the source
492string is not in UTF-8 format, and no locale is in effect, and the platform is
493not EBCDIC, all the POSIX classes behave like their ASCII-range counterparts.
494Otherwise, they behave based on the rules of the locale or EBCDIC code page.
495It is proposed to change this behavior in a future release of Perl so that the
496the UTF8ness of the source string will be irrelevant to the behavior of the
497POSIX character classes. This means they will always behave in strict
498accordance with the official POSIX standard. That is, if either locale or
499EBCDIC code page is present, they will behave in accordance with those; if
500absent, the classes will match only their ASCII-range counterparts. If you
501disagree with this proposal, send email to C<perl5-porters@perl.org>.
502
503 [[:...:]] ASCII-range Full-range backslash Note
504 Unicode Unicode sequence
505 -----------------------------------------------------
506 alpha \p{PosixAlpha} \p{Alpha}
507 alnum \p{PosixAlnum} \p{Alnum}
508 ascii \p{ASCII}
509 blank \p{PosixBlank} \p{Blank} =
510 \p{HorizSpace} \h [1]
511 cntrl \p{PosixCntrl} \p{Cntrl} [2]
512 digit \p{PosixDigit} \p{Digit} \d
513 graph \p{PosixGraph} \p{Graph} [3]
514 lower \p{PosixLower} \p{Lower}
515 print \p{PosixPrint} \p{Print} [4]
516 punct \p{PosixPunct} \p{Punct} [5]
517 \p{PerlSpace} \p{SpacePerl} \s [6]
518 space \p{PosixSpace} \p{Space} [6]
519 upper \p{PosixUpper} \p{Upper}
520 word \p{PerlWord} \p{Word} \w
521 xdigit \p{ASCII_Hex_Digit} \p{XDigit}
8a118206
RGS
522
523=over 4
524
ea449505
KW
525=item [1]
526
527C<\p{Blank}> and C<\p{HorizSpace}> are synonyms.
528
529=item [2]
8a118206 530
ea449505
KW
531Control characters don't produce output as such, but instead usually control
532the terminal somehow: for example newline and backspace are control characters.
533In the ASCII range, characters whose ordinals are between 0 and 31 inclusive,
534plus 127 (C<DEL>) are control characters.
8a118206 535
ea449505
KW
536On EBCDIC platforms, it is likely that the code page will define this character
537class to be the counterparts to the ASCII controls, plus the controls that in
538Unicode have ordinals from 128 through 139.
539
540=item [3]
8a118206
RGS
541
542Any character that is I<graphical>, that is, visible. This class consists
543of all the alphanumerical characters and all punctuation characters.
544
ea449505 545=item [4]
8a118206
RGS
546
547All printable characters, which is the set of all the graphical characters
ea449505
KW
548plus whitespace characters that are not also controls.
549
550=item [5]
551
552C<\p{PosixPunct}> and C<[[:punct:]]> in the ASCII range match all the
553non-controls, non-alphanumeric, non-space characters:
554C<[-!"#$%&'()*+,./:;<=E<gt>?@[\\\]^_`{|}~]> (although if a locale is in effect,
555it could alter the behavior of C<[[:punct:]]>).
556
557When the matching string is in UTF-8 format, C<[[:punct:]]> matches the above
558set, plus whatever C<\p{Punct}> matches beyond the ASCII range. It matches
559more than what C<\p{Punct}> matches in the ASCII range, because the POSIX
560definition of "Punct" includes more than what Unicode calls "Punct"; namely, it
561includes what Unicode calls "Symbol". In other words, the Posix C<[[:punct:]]>
562lumps the Unicode "Punct" and "Symbol" together.
563
564This character class does not match any characters of Unicode type "Symbol"
565outside the ASCII range when the matching string is in UTF-8 format.
8a118206 566
ea449505 567=item [6]
8a118206 568
ea449505
KW
569C<\p{SpacePerl}> and C<\p{Space}> differ only in that C<\p{Space}> additionally
570matches the vertical tab, C<\cK>. Same for the two ASCII-only range forms.
8a118206
RGS
571
572=back
573
574=head4 Negation
ea449505 575X<character class, negation>
8a118206
RGS
576
577A Perl extension to the POSIX character class is the ability to
578negate it. This is done by prefixing the class name with a caret (C<^>).
579Some examples:
580
ea449505
KW
581 POSIX ASCII-range Full-range backslash
582 Unicode Unicode sequence
583 -----------------------------------------------------
584 [[:^digit:]] \P{PosixDigit} \P{Digit} \D
585 [[:^space:]] \P{PosixSpace} \P{Space}
586 [[:^word:]] \P{PerlWord} \P{Word} \W
8a118206
RGS
587
588=head4 [= =] and [. .]
589
590Perl will recognize the POSIX character classes C<[=class=]>, and
ea449505 591C<[.class.]>, but does not (yet?) support them. Use of
740bae87 592such a construct will lead to an error.
8a118206
RGS
593
594
595=head4 Examples
596
597 /[[:digit:]]/ # Matches a character that is a digit.
598 /[01[:lower:]]/ # Matches a character that is either a
599 # lowercase letter, or '0' or '1'.
600 /[[:digit:][:^xdigit:]]/ # Matches a character that can be anything,
601 # but the letters 'a' to 'f' in either case.
602 # This is because the character class contains
603 # all digits, and anything that isn't a
604 # hex digit, resulting in a class containing
605 # all characters, but the letters 'a' to 'f'
606 # and 'A' to 'F'.
607
608
ea449505 609=head2 Locale, EBCDIC, Unicode and UTF-8
8a118206
RGS
610
611Some of the character classes have a somewhat different behaviour depending
612on the internal encoding of the source string, and the locale that is
ea449505 613in effect, and if the program is running on an EBCDIC platform.
8a118206
RGS
614
615C<\w>, C<\d>, C<\s> and the POSIX character classes (and their negations,
ea449505
KW
616including C<\W>, C<\D>, C<\S>) suffer from this behaviour. (This also affects
617the backslash sequences C<\b> and C<\B>.)
8a118206
RGS
618
619The rule is that if the source string is in UTF-8 format, the character
620classes match according to the Unicode properties. If the source string
ea449505
KW
621isn't, then the character classes match according to whatever locale or EBCDIC
622code page is in effect. If there is no locale nor EBCDIC, they match the ASCII
623defaults (52 letters, 10 digits and underscore for C<\w>; 0 to 9 for C<\d>;
624L</Whitespace> above gives the list for C<\s>).
8a118206
RGS
625
626This usually means that if you are matching against characters whose C<ord()>
627values are between 128 and 255 inclusive, your character class may match
ea449505
KW
628or not depending on the current locale or EBCDIC code page, and whether the
629source string is in UTF-8 format. The string will be in UTF-8 format if it
630contains characters whose C<ord()> value exceeds 255. But a string may be in
631UTF-8 format without it having such characters. See L<perluniprops/The
632"Unicode Bug">.
8a118206
RGS
633
634For portability reasons, it may be better to not use C<\w>, C<\d>, C<\s>
635or the POSIX character classes, and use the Unicode properties instead.
636
637=head4 Examples
638
639 $str = "\xDF"; # $str is not in UTF-8 format.
640 $str =~ /^\w/; # No match, as $str isn't in UTF-8 format.
641 $str .= "\x{0e0b}"; # Now $str is in UTF-8 format.
642 $str =~ /^\w/; # Match! $str is now in UTF-8 format.
643 chop $str;
644 $str =~ /^\w/; # Still a match! $str remains in UTF-8 format.
645
646=cut