This is a live mirror of the Perl 5 development currently hosted at https://github.com/perl/perl5
Nits in perlre.pod, x-referencing, broken links
[perl5.git] / pod / perlrecharclass.pod
CommitLineData
8a118206 1=head1 NAME
ea449505 2X<character class>
8a118206
RGS
3
4perlrecharclass - Perl Regular Expression Character Classes
5
6=head1 DESCRIPTION
7
8The top level documentation about Perl regular expressions
9is found in L<perlre>.
10
11This manual page discusses the syntax and use of character
12classes in Perl Regular Expressions.
13
14A character class is a way of denoting a set of characters,
15in such a way that one character of the set is matched.
16It's important to remember that matching a character class
17consumes exactly one character in the source string. (The source
18string is the string the regular expression is matched against.)
19
20There are three types of character classes in Perl regular
ea449505
KW
21expressions: the dot, backslashed sequences, and the form enclosed in square
22brackets. Keep in mind, though, that often the term "character class" is used
23to mean just the bracketed form. This is true in other Perl documentation.
8a118206
RGS
24
25=head2 The dot
26
27The dot (or period), C<.> is probably the most used, and certainly
28the most well-known character class. By default, a dot matches any
29character, except for the newline. The default can be changed to
30add matching the newline with the I<single line> modifier: either
31for the entire regular expression using the C</s> modifier, or
32locally using C<(?s)>.
33
34Here are some examples:
35
36 "a" =~ /./ # Match
37 "." =~ /./ # Match
38 "" =~ /./ # No match (dot has to match a character)
39 "\n" =~ /./ # No match (dot does not match a newline)
40 "\n" =~ /./s # Match (global 'single line' modifier)
41 "\n" =~ /(?s:.)/ # Match (local 'single line' modifier)
42 "ab" =~ /^.$/ # No match (dot matches one character)
43
8a118206 44=head2 Backslashed sequences
ea449505
KW
45X<\w> X<\W> X<\s> X<\S> X<\d> X<\D> X<\p> X<\P>
46X<\N> X<\v> X<\V> X<\h> X<\H>
47X<word> X<whitespace>
8a118206
RGS
48
49Perl regular expressions contain many backslashed sequences that
50constitute a character class. That is, they will match a single
51character, if that character belongs to a specific set of characters
52(defined by the sequence). A backslashed sequence is a sequence of
53characters starting with a backslash. Not all backslashed sequences
df225385 54are character classes; for a full list, see L<perlrebackslash>.
8a118206 55
ea449505
KW
56Here's a list of the backslashed sequences that are character classes. They
57are discussed in more detail below.
8a118206
RGS
58
59 \d Match a digit character.
60 \D Match a non-digit character.
61 \w Match a "word" character.
62 \W Match a non-"word" character.
ea449505
KW
63 \s Match a whitespace character.
64 \S Match a non-whitespace character.
65 \h Match a horizontal whitespace character.
66 \H Match a character that isn't horizontal whitespace.
b3b85878 67 \N Match a character that isn't newline. Experimental.
ea449505
KW
68 \v Match a vertical whitespace character.
69 \V Match a character that isn't vertical whitespace.
8a118206
RGS
70 \pP, \p{Prop} Match a character matching a Unicode property.
71 \PP, \P{Prop} Match a character that doesn't match a Unicode property.
72
73=head3 Digits
74
ea449505
KW
75C<\d> matches a single character that is considered to be a I<digit>. What is
76considered a digit depends on the internal encoding of the source string and
77the locale that is in effect. If the source string is in UTF-8 format, C<\d>
78not only matches the digits '0' - '9', but also Arabic, Devanagari and digits
79from other languages. Otherwise, if there is a locale in effect, it will match
80whatever characters the locale considers digits. Without a locale, C<\d>
81matches the digits '0' to '9'. See L</Locale, EBCDIC, Unicode and UTF-8>.
8a118206
RGS
82
83Any character that isn't matched by C<\d> will be matched by C<\D>.
84
85=head3 Word characters
86
ea449505
KW
87A C<\w> matches a single alphanumeric character (an alphabetic character, or a
88decimal digit) or an underscore (C<_>), not a whole word. Use C<\w+> to match
89a string of Perl-identifier characters (which isn't the same as matching an
90English word). What is considered a word character depends on the internal
91encoding of the string and the locale or EBCDIC code page that is in effect. If
92it's in UTF-8 format, C<\w> matches those characters that are considered word
93characters in the Unicode database. That is, it not only matches ASCII letters,
94but also Thai letters, Greek letters, etc. If the source string isn't in UTF-8
95format, C<\w> matches those characters that are considered word characters by
96the current locale or EBCDIC code page. Without a locale or EBCDIC code page,
97C<\w> matches the ASCII letters, digits and the underscore.
98See L</Locale, EBCDIC, Unicode and UTF-8>.
8a118206
RGS
99
100Any character that isn't matched by C<\w> will be matched by C<\W>.
101
ea449505
KW
102=head3 Whitespace
103
104C<\s> matches any single character that is considered whitespace. In the ASCII
105range, C<\s> matches the horizontal tab (C<\t>), the new line (C<\n>), the form
106feed (C<\f>), the carriage return (C<\r>), and the space. (The vertical tab,
107C<\cK> is not matched by C<\s>.) The exact set of characters matched by C<\s>
108depends on whether the source string is in UTF-8 format and the locale or
109EBCDIC code page that is in effect. If it's in UTF-8 format, C<\s> matches what
110is considered whitespace in the Unicode database; the complete list is in the
111table below. Otherwise, if there is a locale or EBCDIC code page in effect,
112C<\s> matches whatever is considered whitespace by the current locale or EBCDIC
113code page. Without a locale or EBCDIC code page, C<\s> matches the five
114characters mentioned in the beginning of this paragraph. Perhaps the most
115notable possible surprise is that C<\s> matches a non-breaking space only if
116the non-breaking space is in a UTF-8 encoded string or the locale or EBCDIC
117code page that is in effect has that character.
118See L</Locale, EBCDIC, Unicode and UTF-8>.
8a118206
RGS
119
120Any character that isn't matched by C<\s> will be matched by C<\S>.
121
ea449505
KW
122C<\h> will match any character that is considered horizontal whitespace;
123this includes the space and the tab characters and 17 other characters that are
124listed in the table below. C<\H> will match any character
125that is not considered horizontal whitespace.
126
127C<\N> is new in 5.12, and is experimental. It, like the dot, will match any
128character that is not a newline. The difference is that C<\N> will not be
c1c4ae3a 129influenced by the single line C</s> regular expression modifier. Note that
ea449505
KW
130there is a second meaning of C<\N> when of the form C<\N{...}>. This form is
131for named characters. See L<charnames> for those. If C<\N> is followed by an
132opening brace and something that is not a quantifier, perl will assume that a
133character name is coming, and not this meaning of C<\N>. For example, C<\N{3}>
134means to match 3 non-newlines; C<\N{5,}> means to match 5 or more non-newlines,
135but C<\N{4F}> and C<\N{F4}> are not legal quantifiers, and will cause perl to
136look for characters named C<4F> or C<F4>, respectively (and won't find them,
137thus raising an error, unless they have been defined using custom names).
138
139C<\v> will match any character that is considered vertical whitespace;
140this includes the carriage return and line feed characters (newline) plus 5
141other characters listed in the table below.
142C<\V> will match any character that is not considered vertical whitespace.
8a118206
RGS
143
144C<\R> matches anything that can be considered a newline under Unicode
145rules. It's not a character class, as it can match a multi-character
146sequence. Therefore, it cannot be used inside a bracketed character
ea449505
KW
147class; use C<\v> instead (vertical whitespace).
148Details are discussed in L<perlrebackslash>.
8a118206
RGS
149
150Note that unlike C<\s>, C<\d> and C<\w>, C<\h> and C<\v> always match
151the same characters, regardless whether the source string is in UTF-8
152format or not. The set of characters they match is also not influenced
c1c4ae3a 153by locale nor EBCDIC code page.
8a118206 154
ea449505
KW
155One might think that C<\s> is equivalent to C<[\h\v]>. This is not true. The
156vertical tab (C<"\x0b">) is not matched by C<\s>, it is however considered
157vertical whitespace. Furthermore, if the source string is not in UTF-8 format,
158and any locale or EBCDIC code page that is in effect doesn't include them, the
159next line (C<"\x85">) and the no-break space (C<"\xA0">) characters are not
160matched by C<\s>, but are by C<\v> and C<\h> respectively. If the source
161string is in UTF-8 format, both the next line and the no-break space are
162matched by C<\s>.
8a118206
RGS
163
164The following table is a complete listing of characters matched by
ea449505 165C<\s>, C<\h> and C<\v> as of Unicode 5.2.
8a118206
RGS
166
167The first column gives the code point of the character (in hex format),
168the second column gives the (Unicode) name. The third column indicates
ea449505
KW
169by which class(es) the character is matched (assuming no locale or EBCDIC code
170page is in effect that changes the C<\s> matching).
8a118206
RGS
171
172 0x00009 CHARACTER TABULATION h s
173 0x0000a LINE FEED (LF) vs
174 0x0000b LINE TABULATION v
175 0x0000c FORM FEED (FF) vs
176 0x0000d CARRIAGE RETURN (CR) vs
177 0x00020 SPACE h s
178 0x00085 NEXT LINE (NEL) vs [1]
179 0x000a0 NO-BREAK SPACE h s [1]
180 0x01680 OGHAM SPACE MARK h s
181 0x0180e MONGOLIAN VOWEL SEPARATOR h s
182 0x02000 EN QUAD h s
183 0x02001 EM QUAD h s
184 0x02002 EN SPACE h s
185 0x02003 EM SPACE h s
186 0x02004 THREE-PER-EM SPACE h s
187 0x02005 FOUR-PER-EM SPACE h s
188 0x02006 SIX-PER-EM SPACE h s
189 0x02007 FIGURE SPACE h s
190 0x02008 PUNCTUATION SPACE h s
191 0x02009 THIN SPACE h s
192 0x0200a HAIR SPACE h s
193 0x02028 LINE SEPARATOR vs
194 0x02029 PARAGRAPH SEPARATOR vs
195 0x0202f NARROW NO-BREAK SPACE h s
196 0x0205f MEDIUM MATHEMATICAL SPACE h s
197 0x03000 IDEOGRAPHIC SPACE h s
198
199=over 4
200
201=item [1]
202
203NEXT LINE and NO-BREAK SPACE only match C<\s> if the source string is in
ea449505 204UTF-8 format, or the locale or EBCDIC code page that is in effect includes them.
8a118206
RGS
205
206=back
207
208It is worth noting that C<\d>, C<\w>, etc, match single characters, not
209complete numbers or words. To match a number (that consists of integers),
210use C<\d+>; to match a word, use C<\w+>.
211
212
213=head3 Unicode Properties
214
c1c4ae3a
KW
215C<\pP> and C<\p{Prop}> are character classes to match characters that fit given
216Unicode properties. One letter property names can be used in the C<\pP> form,
217with the property name following the C<\p>, otherwise, braces are required.
218When using braces, there is a single form, which is just the property name
219enclosed in the braces, and a compound form which looks like C<\p{name=value}>,
220which means to match if the property "name" for the character has the particular
221"value".
e1b711da
KW
222For instance, a match for a number can be written as C</\pN/> or as
223C</\p{Number}/>, or as C</\p{Number=True}/>.
224Lowercase letters are matched by the property I<Lowercase_Letter> which
225has as short form I<Ll>. They need the braces, so are written as C</\p{Ll}/> or
226C</\p{Lowercase_Letter}/>, or C</\p{General_Category=Lowercase_Letter}/>
227(the underscores are optional).
228C</\pLl/> is valid, but means something different.
8a118206
RGS
229It matches a two character string: a letter (Unicode property C<\pL>),
230followed by a lowercase C<l>.
231
e1b711da
KW
232For more details, see L<perlunicode/Unicode Character Properties>; for a
233complete list of possible properties, see
234L<perluniprops/Properties accessible through \p{} and \P{}>.
235It is also possible to define your own properties. This is discussed in
8a118206
RGS
236L<perlunicode/User-Defined Character Properties>.
237
238
239=head4 Examples
240
241 "a" =~ /\w/ # Match, "a" is a 'word' character.
242 "7" =~ /\w/ # Match, "7" is a 'word' character as well.
243 "a" =~ /\d/ # No match, "a" isn't a digit.
244 "7" =~ /\d/ # Match, "7" is a digit.
ea449505 245 " " =~ /\s/ # Match, a space is whitespace.
8a118206
RGS
246 "a" =~ /\D/ # Match, "a" is a non-digit.
247 "7" =~ /\D/ # No match, "7" is not a non-digit.
ea449505 248 " " =~ /\S/ # No match, a space is not non-whitespace.
8a118206 249
ea449505
KW
250 " " =~ /\h/ # Match, space is horizontal whitespace.
251 " " =~ /\v/ # No match, space is not vertical whitespace.
252 "\r" =~ /\v/ # Match, a return is vertical whitespace.
8a118206
RGS
253
254 "a" =~ /\pL/ # Match, "a" is a letter.
255 "a" =~ /\p{Lu}/ # No match, /\p{Lu}/ matches upper case letters.
256
257 "\x{0e0b}" =~ /\p{Thai}/ # Match, \x{0e0b} is the character
258 # 'THAI CHARACTER SO SO', and that's in
259 # Thai Unicode class.
ea449505 260 "a" =~ /\P{Lao}/ # Match, as "a" is not a Laotian character.
8a118206
RGS
261
262
263=head2 Bracketed Character Classes
264
265The third form of character class you can use in Perl regular expressions
266is the bracketed form. In its simplest form, it lists the characters
c1c4ae3a 267that may be matched, surrounded by square brackets, like this: C<[aeiou]>.
ea449505 268This matches one of C<a>, C<e>, C<i>, C<o> or C<u>. Like the other
8a118206 269character classes, exactly one character will be matched. To match
ea449505 270a longer string consisting of characters mentioned in the character
8a118206
RGS
271class, follow the character class with a quantifier. For instance,
272C<[aeiou]+> matches a string of one or more lowercase ASCII vowels.
273
274Repeating a character in a character class has no
275effect; it's considered to be in the set only once.
276
277Examples:
278
279 "e" =~ /[aeiou]/ # Match, as "e" is listed in the class.
280 "p" =~ /[aeiou]/ # No match, "p" is not listed in the class.
281 "ae" =~ /^[aeiou]$/ # No match, a character class only matches
282 # a single character.
283 "ae" =~ /^[aeiou]+$/ # Match, due to the quantifier.
284
285=head3 Special Characters Inside a Bracketed Character Class
286
287Most characters that are meta characters in regular expressions (that
df225385 288is, characters that carry a special meaning like C<.>, C<*>, or C<(>) lose
8a118206
RGS
289their special meaning and can be used inside a character class without
290the need to escape them. For instance, C<[()]> matches either an opening
291parenthesis, or a closing parenthesis, and the parens inside the character
292class don't group or capture.
293
294Characters that may carry a special meaning inside a character class are:
295C<\>, C<^>, C<->, C<[> and C<]>, and are discussed below. They can be
296escaped with a backslash, although this is sometimes not needed, in which
297case the backslash may be omitted.
298
299The sequence C<\b> is special inside a bracketed character class. While
300outside the character class C<\b> is an assertion indicating a point
301that does not have either two word characters or two non-word characters
302on either side, inside a bracketed character class, C<\b> matches a
303backspace character.
304
df225385
KW
305The sequences
306C<\a>,
307C<\c>,
308C<\e>,
309C<\f>,
310C<\n>,
e526e8bb
KW
311C<\N{I<NAME>}>,
312C<\N{U+I<wide hex char>}>,
df225385
KW
313C<\r>,
314C<\t>,
315and
316C<\x>
317are also special and have the same meanings as they do outside a bracketed character
318class.
319
ea449505
KW
320Also, a backslash followed by two or three octal digits is considered an octal
321number.
df225385 322
8a118206
RGS
323A C<[> is not special inside a character class, unless it's the start
324of a POSIX character class (see below). It normally does not need escaping.
325
c1c4ae3a
KW
326A C<]> is normally either the end of a POSIX character class (see below), or it
327signals the end of the bracketed character class. If you want to include a
328C<]> in the set of characters, you must generally escape it.
8a118206
RGS
329However, if the C<]> is the I<first> (or the second if the first
330character is a caret) character of a bracketed character class, it
331does not denote the end of the class (as you cannot have an empty class)
332and is considered part of the set of characters that can be matched without
333escaping.
334
335Examples:
336
337 "+" =~ /[+?*]/ # Match, "+" in a character class is not special.
338 "\cH" =~ /[\b]/ # Match, \b inside in a character class
c1c4ae3a 339 # is equivalent to a backspace.
8a118206
RGS
340 "]" =~ /[][]/ # Match, as the character class contains.
341 # both [ and ].
342 "[]" =~ /[[]]/ # Match, the pattern contains a character class
343 # containing just ], and the character class is
344 # followed by a ].
345
346=head3 Character Ranges
347
348It is not uncommon to want to match a range of characters. Luckily, instead
349of listing all the characters in the range, one may use the hyphen (C<->).
350If inside a bracketed character class you have two characters separated
351by a hyphen, it's treated as if all the characters between the two are in
352the class. For instance, C<[0-9]> matches any ASCII digit, and C<[a-m]>
353matches any lowercase letter from the first half of the ASCII alphabet.
354
355Note that the two characters on either side of the hyphen are not
356necessary both letters or both digits. Any character is possible,
357although not advisable. C<['-?]> contains a range of characters, but
358most people will not know which characters that will be. Furthermore,
359such ranges may lead to portability problems if the code has to run on
360a platform that uses a different character set, such as EBCDIC.
361
ea449505
KW
362If a hyphen in a character class cannot syntactically be part of a range, for
363instance because it is the first or the last character of the character class,
8a118206 364or if it immediately follows a range, the hyphen isn't special, and will be
c1c4ae3a
KW
365considered a character that may be matched literally. You have to escape the
366hyphen with a backslash if you want to have a hyphen in your set of characters
367to be matched, and its position in the class is such that it could be
368considered part of a range.
8a118206
RGS
369
370Examples:
371
372 [a-z] # Matches a character that is a lower case ASCII letter.
c1c4ae3a
KW
373 [a-fz] # Matches any letter between 'a' and 'f' (inclusive) or
374 # the letter 'z'.
8a118206
RGS
375 [-z] # Matches either a hyphen ('-') or the letter 'z'.
376 [a-f-m] # Matches any letter between 'a' and 'f' (inclusive), the
377 # hyphen ('-'), or the letter 'm'.
378 ['-?] # Matches any of the characters '()*+,-./0123456789:;<=>?
379 # (But not on an EBCDIC platform).
380
381
382=head3 Negation
383
384It is also possible to instead list the characters you do not want to
385match. You can do so by using a caret (C<^>) as the first character in the
386character class. For instance, C<[^a-z]> matches a character that is not a
387lowercase ASCII letter.
388
389This syntax make the caret a special character inside a bracketed character
390class, but only if it is the first character of the class. So if you want
391to have the caret as one of the characters you want to match, you either
392have to escape the caret, or not list it first.
393
394Examples:
395
396 "e" =~ /[^aeiou]/ # No match, the 'e' is listed.
397 "x" =~ /[^aeiou]/ # Match, as 'x' isn't a lowercase vowel.
398 "^" =~ /[^^]/ # No match, matches anything that isn't a caret.
399 "^" =~ /[x^]/ # Match, caret is not special here.
400
401=head3 Backslash Sequences
402
ea449505
KW
403You can put any backslash sequence character class (with the exception of
404C<\N>) inside a bracketed character class, and it will act just
df225385
KW
405as if you put all the characters matched by the backslash sequence inside the
406character class. For instance, C<[a-f\d]> will match any digit, or any of the
407lowercase letters between 'a' and 'f' inclusive.
408
e526e8bb
KW
409C<\N> within a bracketed character class must be of the forms C<\N{I<name>}> or
410C<\N{U+I<wide hex char>}> for the same reason that a dot C<.> inside a
411bracketed character class loses its special meaning: it matches nearly
412anything, which generally isn't what you want to happen.
8a118206
RGS
413
414Examples:
415
416 /[\p{Thai}\d]/ # Matches a character that is either a Thai
417 # character, or a digit.
418 /[^\p{Arabic}()]/ # Matches a character that is neither an Arabic
419 # character, nor a parenthesis.
420
421Backslash sequence character classes cannot form one of the endpoints
422of a range.
423
424=head3 Posix Character Classes
ea449505 425X<character class> X<\p> X<\p{}>
ea449505
KW
426X<alpha> X<alnum> X<ascii> X<blank> X<cntrl> X<digit> X<graph>
427X<lower> X<print> X<punct> X<space> X<upper> X<word> X<xdigit>
8a118206
RGS
428
429Posix character classes have the form C<[:class:]>, where I<class> is
ea449505 430name, and the C<[:> and C<:]> delimiters. Posix character classes only appear
8a118206 431I<inside> bracketed character classes, and are a convenient and descriptive
c1c4ae3a
KW
432way of listing a group of characters, though they currently suffer from
433portability issues (see below and L<Locale, EBCDIC, Unicode and UTF-8>). Be
434careful about the syntax,
8a118206
RGS
435
436 # Correct:
437 $string =~ /[[:alpha:]]/
438
439 # Incorrect (will warn):
440 $string =~ /[:alpha:]/
441
442The latter pattern would be a character class consisting of a colon,
443and the letters C<a>, C<l>, C<p> and C<h>.
ea449505
KW
444These character classes can be part of a larger bracketed character class. For
445example,
446
447 [01[:alpha:]%]
448
449is valid and matches '0', '1', any alphabetic character, and the percent sign.
8a118206
RGS
450
451Perl recognizes the following POSIX character classes:
452
ea449505
KW
453 alpha Any alphabetical character ("[A-Za-z]").
454 alnum Any alphanumerical character. ("[A-Za-z0-9]")
455 ascii Any character in the ASCII character set.
ea8b8ad2 456 blank A GNU extension, equal to a space or a horizontal tab ("\t").
ea449505
KW
457 cntrl Any control character. See Note [2] below.
458 digit Any decimal digit ("[0-9]"), equivalent to "\d".
459 graph Any printable character, excluding a space. See Note [3] below.
460 lower Any lowercase character ("[a-z]").
461 print Any printable character, including a space. See Note [4] below.
c1c4ae3a 462 punct Any graphical character excluding "word" characters. Note [5].
ea449505
KW
463 space Any whitespace character. "\s" plus the vertical tab ("\cK").
464 upper Any uppercase character ("[A-Z]").
465 word A Perl extension ("[A-Za-z0-9_]"), equivalent to "\w".
466 xdigit Any hexadecimal digit ("[0-9a-fA-F]").
467
468Most POSIX character classes have two Unicode-style C<\p> property
469counterparts. (They are not official Unicode properties, but Perl extensions
470derived from official Unicode properties.) The table below shows the relation
471between POSIX character classes and these counterparts.
472
473One counterpart, in the column labelled "ASCII-range Unicode" in
474the table will only match characters in the ASCII range. (On EBCDIC platforms,
475they match those characters which have ASCII equivalents.)
476
477The other counterpart, in the column labelled "Full-range Unicode", matches any
478appropriate characters in the full Unicode character set. For example,
479C<\p{Alpha}> will match not just the ASCII alphabetic characters, but any
480character in the entire Unicode character set that is considered to be
481alphabetic.
482
483(Each of the counterparts has various synonyms as well.
484L<perluniprops/Properties accessible through \p{} and \P{}> lists all the
485synonyms, plus all the characters matched by each of the ASCII-range
486properties. For example C<\p{AHex}> is a synonym for C<\p{ASCII_Hex_Digit}>,
487and any C<\p> property name can be prefixed with "Is" such as C<\p{IsAlpha}>.)
488
489Both the C<\p> forms are unaffected by any locale that is in effect, or whether
490the string is in UTF-8 format or not, or whether the platform is EBCDIC or not.
491In contrast, the POSIX character classes are affected. If the source string is
492in UTF-8 format, the POSIX classes (with the exception of C<[[:punct:]]>, see
493Note [5]) behave like their "Full-range" Unicode counterparts. If the source
494string is not in UTF-8 format, and no locale is in effect, and the platform is
495not EBCDIC, all the POSIX classes behave like their ASCII-range counterparts.
496Otherwise, they behave based on the rules of the locale or EBCDIC code page.
497It is proposed to change this behavior in a future release of Perl so that the
498the UTF8ness of the source string will be irrelevant to the behavior of the
499POSIX character classes. This means they will always behave in strict
500accordance with the official POSIX standard. That is, if either locale or
501EBCDIC code page is present, they will behave in accordance with those; if
502absent, the classes will match only their ASCII-range counterparts. If you
503disagree with this proposal, send email to C<perl5-porters@perl.org>.
504
505 [[:...:]] ASCII-range Full-range backslash Note
506 Unicode Unicode sequence
507 -----------------------------------------------------
508 alpha \p{PosixAlpha} \p{Alpha}
509 alnum \p{PosixAlnum} \p{Alnum}
510 ascii \p{ASCII}
c1c4ae3a 511 blank \p{PosixBlank} \p{Blank} = [1]
ea449505
KW
512 \p{HorizSpace} \h [1]
513 cntrl \p{PosixCntrl} \p{Cntrl} [2]
514 digit \p{PosixDigit} \p{Digit} \d
515 graph \p{PosixGraph} \p{Graph} [3]
516 lower \p{PosixLower} \p{Lower}
517 print \p{PosixPrint} \p{Print} [4]
518 punct \p{PosixPunct} \p{Punct} [5]
519 \p{PerlSpace} \p{SpacePerl} \s [6]
520 space \p{PosixSpace} \p{Space} [6]
521 upper \p{PosixUpper} \p{Upper}
522 word \p{PerlWord} \p{Word} \w
523 xdigit \p{ASCII_Hex_Digit} \p{XDigit}
8a118206
RGS
524
525=over 4
526
ea449505
KW
527=item [1]
528
529C<\p{Blank}> and C<\p{HorizSpace}> are synonyms.
530
531=item [2]
8a118206 532
ea449505
KW
533Control characters don't produce output as such, but instead usually control
534the terminal somehow: for example newline and backspace are control characters.
535In the ASCII range, characters whose ordinals are between 0 and 31 inclusive,
536plus 127 (C<DEL>) are control characters.
8a118206 537
c1c4ae3a
KW
538On EBCDIC platforms, it is likely that the code page will define C<[[:cntrl:]]>
539to be the EBCDIC equivalents of the ASCII controls, plus the controls
540that in Unicode have ordinals from 128 through 139.
ea449505
KW
541
542=item [3]
8a118206
RGS
543
544Any character that is I<graphical>, that is, visible. This class consists
545of all the alphanumerical characters and all punctuation characters.
546
ea449505 547=item [4]
8a118206
RGS
548
549All printable characters, which is the set of all the graphical characters
ea449505
KW
550plus whitespace characters that are not also controls.
551
552=item [5]
553
554C<\p{PosixPunct}> and C<[[:punct:]]> in the ASCII range match all the
555non-controls, non-alphanumeric, non-space characters:
556C<[-!"#$%&'()*+,./:;<=E<gt>?@[\\\]^_`{|}~]> (although if a locale is in effect,
557it could alter the behavior of C<[[:punct:]]>).
558
559When the matching string is in UTF-8 format, C<[[:punct:]]> matches the above
c1c4ae3a
KW
560set, plus what C<\p{Punct}> matches. This is different than strictly matching
561according to C<\p{Punct}>, because the above set includes characters that aren't
562considered punctuation by Unicode, but rather "symbols". Another way to say it
563is that for a UTF-8 string, C<[[:punct:]]> matches all the characters that
564Unicode considers to be punctuation, plus all the ASCII-range characters that
565Unicode considers to be symbols.
8a118206 566
ea449505 567=item [6]
8a118206 568
ea449505
KW
569C<\p{SpacePerl}> and C<\p{Space}> differ only in that C<\p{Space}> additionally
570matches the vertical tab, C<\cK>. Same for the two ASCII-only range forms.
8a118206
RGS
571
572=back
573
574=head4 Negation
ea449505 575X<character class, negation>
8a118206
RGS
576
577A Perl extension to the POSIX character class is the ability to
578negate it. This is done by prefixing the class name with a caret (C<^>).
579Some examples:
580
ea449505
KW
581 POSIX ASCII-range Full-range backslash
582 Unicode Unicode sequence
583 -----------------------------------------------------
c1c4ae3a 584 [[:^digit:]] \P{PosixDigit} \P{Digit} \D
ea449505 585 [[:^space:]] \P{PosixSpace} \P{Space}
c1c4ae3a
KW
586 \P{PerlSpace} \P{SpacePerl} \S
587 [[:^word:]] \P{PerlWord} \P{Word} \W
8a118206
RGS
588
589=head4 [= =] and [. .]
590
591Perl will recognize the POSIX character classes C<[=class=]>, and
ea449505 592C<[.class.]>, but does not (yet?) support them. Use of
740bae87 593such a construct will lead to an error.
8a118206
RGS
594
595
596=head4 Examples
597
598 /[[:digit:]]/ # Matches a character that is a digit.
599 /[01[:lower:]]/ # Matches a character that is either a
600 # lowercase letter, or '0' or '1'.
c1c4ae3a
KW
601 /[[:digit:][:^xdigit:]]/ # Matches a character that can be anything
602 # except the letters 'a' to 'f'. This is
603 # because the main character class is composed
604 # of two POSIX character classes that are ORed
605 # together, one that matches any digit, and
606 # the other that matches anything that isn't a
607 # hex digit. The result matches all
608 # characters except the letters 'a' to 'f' and
609 # 'A' to 'F'.
8a118206
RGS
610
611
ea449505 612=head2 Locale, EBCDIC, Unicode and UTF-8
8a118206
RGS
613
614Some of the character classes have a somewhat different behaviour depending
615on the internal encoding of the source string, and the locale that is
ea449505 616in effect, and if the program is running on an EBCDIC platform.
8a118206
RGS
617
618C<\w>, C<\d>, C<\s> and the POSIX character classes (and their negations,
c1c4ae3a
KW
619including C<\W>, C<\D>, C<\S>) suffer from this behaviour. (Since the backslash
620sequences C<\b> and C<\B> are defined in terms of C<\w> and C<\W>, they also are
621affected.)
8a118206
RGS
622
623The rule is that if the source string is in UTF-8 format, the character
624classes match according to the Unicode properties. If the source string
ea449505
KW
625isn't, then the character classes match according to whatever locale or EBCDIC
626code page is in effect. If there is no locale nor EBCDIC, they match the ASCII
627defaults (52 letters, 10 digits and underscore for C<\w>; 0 to 9 for C<\d>;
c1c4ae3a 628etc.).
8a118206
RGS
629
630This usually means that if you are matching against characters whose C<ord()>
631values are between 128 and 255 inclusive, your character class may match
ea449505
KW
632or not depending on the current locale or EBCDIC code page, and whether the
633source string is in UTF-8 format. The string will be in UTF-8 format if it
634contains characters whose C<ord()> value exceeds 255. But a string may be in
635UTF-8 format without it having such characters. See L<perluniprops/The
636"Unicode Bug">.
8a118206
RGS
637
638For portability reasons, it may be better to not use C<\w>, C<\d>, C<\s>
639or the POSIX character classes, and use the Unicode properties instead.
640
641=head4 Examples
642
643 $str = "\xDF"; # $str is not in UTF-8 format.
644 $str =~ /^\w/; # No match, as $str isn't in UTF-8 format.
645 $str .= "\x{0e0b}"; # Now $str is in UTF-8 format.
646 $str =~ /^\w/; # Match! $str is now in UTF-8 format.
647 chop $str;
648 $str =~ /^\w/; # Still a match! $str remains in UTF-8 format.
649
650=cut