Commit | Line | Data |
---|---|---|
8a118206 | 1 | =head1 NAME |
ea449505 | 2 | X<character class> |
8a118206 RGS |
3 | |
4 | perlrecharclass - Perl Regular Expression Character Classes | |
5 | ||
6 | =head1 DESCRIPTION | |
7 | ||
8 | The top level documentation about Perl regular expressions | |
9 | is found in L<perlre>. | |
10 | ||
11 | This manual page discusses the syntax and use of character | |
6b83a163 | 12 | classes in Perl regular expressions. |
8a118206 | 13 | |
6b83a163 | 14 | A character class is a way of denoting a set of characters |
8a118206 | 15 | in such a way that one character of the set is matched. |
6b83a163 | 16 | It's important to remember that: matching a character class |
8a118206 RGS |
17 | consumes exactly one character in the source string. (The source |
18 | string is the string the regular expression is matched against.) | |
19 | ||
20 | There are three types of character classes in Perl regular | |
6b83a163 | 21 | expressions: the dot, backslash sequences, and the form enclosed in square |
ea449505 | 22 | brackets. Keep in mind, though, that often the term "character class" is used |
6b83a163 | 23 | to mean just the bracketed form. Certainly, most Perl documentation does that. |
8a118206 RGS |
24 | |
25 | =head2 The dot | |
26 | ||
27 | The dot (or period), C<.> is probably the most used, and certainly | |
28 | the most well-known character class. By default, a dot matches any | |
5db9882c | 29 | character, except for the newline. That default can be changed to |
6b83a163 KW |
30 | add matching the newline by using the I<single line> modifier: either |
31 | for the entire regular expression with the C</s> modifier, or | |
32 | locally with C<(?s)>. (The experimental C<\N> backslash sequence, described | |
33 | below, matches any character except newline without regard to the | |
34 | I<single line> modifier.) | |
8a118206 RGS |
35 | |
36 | Here are some examples: | |
37 | ||
38 | "a" =~ /./ # Match | |
39 | "." =~ /./ # Match | |
40 | "" =~ /./ # No match (dot has to match a character) | |
41 | "\n" =~ /./ # No match (dot does not match a newline) | |
42 | "\n" =~ /./s # Match (global 'single line' modifier) | |
43 | "\n" =~ /(?s:.)/ # Match (local 'single line' modifier) | |
44 | "ab" =~ /^.$/ # No match (dot matches one character) | |
45 | ||
6b83a163 | 46 | =head2 Backslash sequences |
82206b5e | 47 | X<\w> X<\W> X<\s> X<\S> X<\d> X<\D> X<\p> X<\P> |
ea449505 KW |
48 | X<\N> X<\v> X<\V> X<\h> X<\H> |
49 | X<word> X<whitespace> | |
8a118206 | 50 | |
6b83a163 KW |
51 | A backslash sequence is a sequence of characters, the first one of which is a |
52 | backslash. Perl ascribes special meaning to many such sequences, and some of | |
53 | these are character classes. That is, they match a single character each, | |
54 | provided that the character belongs to the specific set of characters defined | |
55 | by the sequence. | |
8a118206 | 56 | |
6b83a163 KW |
57 | Here's a list of the backslash sequences that are character classes. They |
58 | are discussed in more detail below. (For the backslash sequences that aren't | |
59 | character classes, see L<perlrebackslash>.) | |
8a118206 | 60 | |
6b83a163 KW |
61 | \d Match a decimal digit character. |
62 | \D Match a non-decimal-digit character. | |
8a118206 RGS |
63 | \w Match a "word" character. |
64 | \W Match a non-"word" character. | |
ea449505 KW |
65 | \s Match a whitespace character. |
66 | \S Match a non-whitespace character. | |
67 | \h Match a horizontal whitespace character. | |
68 | \H Match a character that isn't horizontal whitespace. | |
ea449505 KW |
69 | \v Match a vertical whitespace character. |
70 | \V Match a character that isn't vertical whitespace. | |
6b83a163 KW |
71 | \N Match a character that isn't a newline. Experimental. |
72 | \pP, \p{Prop} Match a character that has the given Unicode property. | |
6c5a041f | 73 | \PP, \P{Prop} Match a character that doesn't have the Unicode property |
8a118206 | 74 | |
1433f837 KW |
75 | =head3 \N |
76 | ||
77 | C<\N> is new in 5.12, and is experimental. It, like the dot, matches any | |
78 | character that is not a newline. The difference is that C<\N> is not influenced | |
79 | by the I<single line> regular expression modifier (see L</The dot> above). Note | |
80 | that the form C<\N{...}> may mean something completely different. When the | |
81 | C<{...}> is a L<quantifier|perlre/Quantifiers>, it means to match a non-newline | |
82 | character that many times. For example, C<\N{3}> means to match 3 | |
83 | non-newlines; C<\N{5,}> means to match 5 or more non-newlines. But if C<{...}> | |
84 | is not a legal quantifier, it is presumed to be a named character. See | |
85 | L<charnames> for those. For example, none of C<\N{COLON}>, C<\N{4F}>, and | |
86 | C<\N{F4}> contain legal quantifiers, so Perl will try to find characters whose | |
87 | names are respectively C<COLON>, C<4F>, and C<F4>. | |
88 | ||
8a118206 RGS |
89 | =head3 Digits |
90 | ||
b6538e4f | 91 | C<\d> matches a single character considered to be a decimal I<digit>. |
5db9882c | 92 | If the C</a> regular expression modifier is in effect, it matches [0-9]. |
582da942 | 93 | Otherwise, it |
82206b5e KW |
94 | matches anything that is matched by C<\p{Digit}>, which includes [0-9]. |
95 | (An unlikely possible exception is that under locale matching rules, the | |
96 | current locale might not have [0-9] matched by C<\d>, and/or might match | |
97 | other characters whose code point is less than 256. Such a locale | |
98 | definition would be in violation of the C language standard, but Perl | |
99 | doesn't currently assume anything in regard to this.) | |
100 | ||
101 | What this means is that unless the C</a> modifier is in effect C<\d> not | |
102 | only matches the digits '0' - '9', but also Arabic, Devanagari, and | |
103 | digits from other languages. This may cause some confusion, and some | |
104 | security issues. | |
105 | ||
106 | Some digits that C<\d> matches look like some of the [0-9] ones, but | |
107 | have different values. For example, BENGALI DIGIT FOUR (U+09EA) looks | |
108 | very much like an ASCII DIGIT EIGHT (U+0038). An application that | |
109 | is expecting only the ASCII digits might be misled, or if the match is | |
110 | C<\d+>, the matched string might contain a mixture of digits from | |
111 | different writing systems that look like they signify a number different | |
e397bccf KW |
112 | than they actually do. L<Unicode::UCDE<sol>num()|Unicode::UCD/num> can |
113 | be used to safely | |
82206b5e KW |
114 | calculate the value, returning C<undef> if the input string contains |
115 | such a mixture. | |
116 | ||
117 | What C<\p{Digit}> means (and hence C<\d> except under the C</a> | |
118 | modifier) is C<\p{General_Category=Decimal_Number}>, or synonymously, | |
119 | C<\p{General_Category=Digit}>. Starting with Unicode version 4.1, this | |
120 | is the same set of characters matched by C<\p{Numeric_Type=Decimal}>. | |
6b83a163 KW |
121 | But Unicode also has a different property with a similar name, |
122 | C<\p{Numeric_Type=Digit}>, which matches a completely different set of | |
82206b5e KW |
123 | characters. These characters are things such as C<CIRCLED DIGIT ONE> |
124 | or subscripts, or are from writing systems that lack all ten digits. | |
6b83a163 | 125 | |
82206b5e KW |
126 | The design intent is for C<\d> to exactly match the set of characters |
127 | that can safely be used with "normal" big-endian positional decimal | |
128 | syntax, where, for example 123 means one 'hundred', plus two 'tens', | |
129 | plus three 'ones'. This positional notation does not necessarily apply | |
130 | to characters that match the other type of "digit", | |
131 | C<\p{Numeric_Type=Digit}>, and so C<\d> doesn't match them. | |
6b83a163 | 132 | |
e2cfb18c | 133 | The Tamil digits (U+0BE6 - U+0BEF) can also legally be |
82206b5e KW |
134 | used in old-style Tamil numbers in which they would appear no more than |
135 | one in a row, separated by characters that mean "times 10", "times 100", | |
136 | etc. (See L<http://www.unicode.org/notes/tn21>.) | |
8a118206 | 137 | |
b6538e4f | 138 | Any character not matched by C<\d> is matched by C<\D>. |
8a118206 RGS |
139 | |
140 | =head3 Word characters | |
141 | ||
ea449505 | 142 | A C<\w> matches a single alphanumeric character (an alphabetic character, or a |
d35dd6c6 KW |
143 | decimal digit) or a connecting punctuation character, such as an |
144 | underscore ("_"). It does not match a whole word. To match a whole | |
82206b5e | 145 | word, use C<\w+>. This isn't the same thing as matching an English word, but |
765fa144 | 146 | in the ASCII range it is the same as a string of Perl-identifier |
82206b5e KW |
147 | characters. |
148 | ||
149 | =over | |
150 | ||
151 | =item If the C</a> modifier is in effect ... | |
152 | ||
153 | C<\w> matches the 63 characters [a-zA-Z0-9_]. | |
154 | ||
155 | =item otherwise ... | |
156 | ||
157 | =over | |
158 | ||
159 | =item For code points above 255 ... | |
160 | ||
161 | C<\w> matches the same as C<\p{Word}> matches in this range. That is, | |
162 | it matches Thai letters, Greek letters, etc. This includes connector | |
d35dd6c6 | 163 | punctuation (like the underscore) which connect two words together, or |
b6538e4f | 164 | diacritics, such as a C<COMBINING TILDE> and the modifier letters, which |
82206b5e KW |
165 | are generally used to add auxiliary markings to letters. |
166 | ||
167 | =item For code points below 256 ... | |
168 | ||
169 | =over | |
170 | ||
171 | =item if locale rules are in effect ... | |
172 | ||
173 | C<\w> matches the platform's native underscore character plus whatever | |
174 | the locale considers to be alphanumeric. | |
175 | ||
176 | =item if Unicode rules are in effect or if on an EBCDIC platform ... | |
177 | ||
178 | C<\w> matches exactly what C<\p{Word}> matches. | |
179 | ||
180 | =item otherwise ... | |
181 | ||
182 | C<\w> matches [a-zA-Z0-9_]. | |
183 | ||
184 | =back | |
185 | ||
186 | =back | |
187 | ||
188 | =back | |
189 | ||
190 | Which rules apply are determined as described in L<perlre/Which character set modifier is in effect?>. | |
8a118206 | 191 | |
6b83a163 KW |
192 | There are a number of security issues with the full Unicode list of word |
193 | characters. See L<http://unicode.org/reports/tr36>. | |
194 | ||
195 | Also, for a somewhat finer-grained set of characters that are in programming | |
196 | language identifiers beyond the ASCII range, you may wish to instead use the | |
e2cfb18c KW |
197 | more customized L</Unicode Properties>, C<\p{ID_Start}>, |
198 | C<\p{ID_Continue}>, C<\p{XID_Start}>, and C<\p{XID_Continue}>. See | |
199 | L<http://unicode.org/reports/tr31>. | |
6b83a163 | 200 | |
b6538e4f | 201 | Any character not matched by C<\w> is matched by C<\W>. |
8a118206 | 202 | |
ea449505 KW |
203 | =head3 Whitespace |
204 | ||
82206b5e KW |
205 | C<\s> matches any single character considered whitespace. |
206 | ||
207 | =over | |
208 | ||
209 | =item If the C</a> modifier is in effect ... | |
210 | ||
211 | C<\s> matches the 5 characters [\t\n\f\r ]; that is, the horizontal tab, | |
212 | the newline, the form feed, the carriage return, and the space. (Note | |
213 | that it doesn't match the vertical tab, C<\cK> on ASCII platforms.) | |
214 | ||
215 | =item otherwise ... | |
216 | ||
217 | =over | |
218 | ||
219 | =item For code points above 255 ... | |
220 | ||
221 | C<\s> matches exactly the code points above 255 shown with an "s" column | |
222 | in the table below. | |
223 | ||
224 | =item For code points below 256 ... | |
225 | ||
226 | =over | |
227 | ||
228 | =item if locale rules are in effect ... | |
229 | ||
230 | C<\s> matches whatever the locale considers to be whitespace. Note that | |
231 | this is likely to include the vertical space, unlike non-locale C<\s> | |
232 | matching. | |
233 | ||
234 | =item if Unicode rules are in effect or if on an EBCDIC platform ... | |
235 | ||
236 | C<\s> matches exactly the characters shown with an "s" column in the | |
237 | table below. | |
238 | ||
239 | =item otherwise ... | |
240 | ||
241 | C<\s> matches [\t\n\f\r ]. | |
242 | Note that this list doesn't include the non-breaking space. | |
243 | ||
244 | =back | |
245 | ||
246 | =back | |
247 | ||
248 | =back | |
249 | ||
250 | Which rules apply are determined as described in L<perlre/Which character set modifier is in effect?>. | |
8a118206 | 251 | |
b6538e4f | 252 | Any character not matched by C<\s> is matched by C<\S>. |
8a118206 | 253 | |
b6538e4f | 254 | C<\h> matches any character considered horizontal whitespace; |
82206b5e | 255 | this includes the space and tab characters and several others |
b6538e4f TC |
256 | listed in the table below. C<\H> matches any character |
257 | not considered horizontal whitespace. | |
ea449505 | 258 | |
b6538e4f | 259 | C<\v> matches any character considered vertical whitespace; |
82206b5e | 260 | this includes the carriage return and line feed characters (newline) |
b6538e4f TC |
261 | plus several other characters, all listed in the table below. |
262 | C<\V> matches any character not considered vertical whitespace. | |
8a118206 RGS |
263 | |
264 | C<\R> matches anything that can be considered a newline under Unicode | |
265 | rules. It's not a character class, as it can match a multi-character | |
266 | sequence. Therefore, it cannot be used inside a bracketed character | |
ea449505 KW |
267 | class; use C<\v> instead (vertical whitespace). |
268 | Details are discussed in L<perlrebackslash>. | |
8a118206 | 269 | |
82206b5e | 270 | Note that unlike C<\s> (and C<\d> and C<\w>), C<\h> and C<\v> always match |
b6538e4f TC |
271 | the same characters, without regard to other factors, such as whether the |
272 | source string is in UTF-8 format. | |
8a118206 | 273 | |
82206b5e | 274 | One might think that C<\s> is equivalent to C<[\h\v]>. This is not true. |
5db9882c KW |
275 | The difference is that the vertical tab (C<"\x0b">) is not matched by |
276 | C<\s>; it is however considered vertical whitespace. | |
8a118206 RGS |
277 | |
278 | The following table is a complete listing of characters matched by | |
82206b5e | 279 | C<\s>, C<\h> and C<\v> as of Unicode 6.0. |
8a118206 | 280 | |
582da942 | 281 | The first column gives the Unicode code point of the character (in hex format), |
8a118206 | 282 | the second column gives the (Unicode) name. The third column indicates |
ea449505 KW |
283 | by which class(es) the character is matched (assuming no locale or EBCDIC code |
284 | page is in effect that changes the C<\s> matching). | |
8a118206 | 285 | |
fc28d2a3 KW |
286 | 0x0009 CHARACTER TABULATION h s |
287 | 0x000a LINE FEED (LF) vs | |
288 | 0x000b LINE TABULATION v | |
289 | 0x000c FORM FEED (FF) vs | |
290 | 0x000d CARRIAGE RETURN (CR) vs | |
291 | 0x0020 SPACE h s | |
292 | 0x0085 NEXT LINE (NEL) vs [1] | |
293 | 0x00a0 NO-BREAK SPACE h s [1] | |
294 | 0x1680 OGHAM SPACE MARK h s | |
295 | 0x180e MONGOLIAN VOWEL SEPARATOR h s | |
296 | 0x2000 EN QUAD h s | |
297 | 0x2001 EM QUAD h s | |
298 | 0x2002 EN SPACE h s | |
299 | 0x2003 EM SPACE h s | |
300 | 0x2004 THREE-PER-EM SPACE h s | |
301 | 0x2005 FOUR-PER-EM SPACE h s | |
302 | 0x2006 SIX-PER-EM SPACE h s | |
303 | 0x2007 FIGURE SPACE h s | |
304 | 0x2008 PUNCTUATION SPACE h s | |
305 | 0x2009 THIN SPACE h s | |
306 | 0x200a HAIR SPACE h s | |
307 | 0x2028 LINE SEPARATOR vs | |
308 | 0x2029 PARAGRAPH SEPARATOR vs | |
309 | 0x202f NARROW NO-BREAK SPACE h s | |
310 | 0x205f MEDIUM MATHEMATICAL SPACE h s | |
311 | 0x3000 IDEOGRAPHIC SPACE h s | |
8a118206 RGS |
312 | |
313 | =over 4 | |
314 | ||
315 | =item [1] | |
316 | ||
82206b5e KW |
317 | NEXT LINE and NO-BREAK SPACE may or may not match C<\s> depending |
318 | on the rules in effect. See | |
319 | L<the beginning of this section|/Whitespace>. | |
8a118206 RGS |
320 | |
321 | =back | |
322 | ||
8a118206 RGS |
323 | =head3 Unicode Properties |
324 | ||
c1c4ae3a KW |
325 | C<\pP> and C<\p{Prop}> are character classes to match characters that fit given |
326 | Unicode properties. One letter property names can be used in the C<\pP> form, | |
327 | with the property name following the C<\p>, otherwise, braces are required. | |
328 | When using braces, there is a single form, which is just the property name | |
329 | enclosed in the braces, and a compound form which looks like C<\p{name=value}>, | |
b6538e4f | 330 | which means to match if the property "name" for the character has that particular |
c1c4ae3a | 331 | "value". |
e1b711da KW |
332 | For instance, a match for a number can be written as C</\pN/> or as |
333 | C</\p{Number}/>, or as C</\p{Number=True}/>. | |
334 | Lowercase letters are matched by the property I<Lowercase_Letter> which | |
e2cfb18c | 335 | has the short form I<Ll>. They need the braces, so are written as C</\p{Ll}/> or |
e1b711da KW |
336 | C</\p{Lowercase_Letter}/>, or C</\p{General_Category=Lowercase_Letter}/> |
337 | (the underscores are optional). | |
338 | C</\pLl/> is valid, but means something different. | |
8a118206 RGS |
339 | It matches a two character string: a letter (Unicode property C<\pL>), |
340 | followed by a lowercase C<l>. | |
341 | ||
82206b5e KW |
342 | If neither the C</a> modifier nor locale rules are in effect, the use of |
343 | a Unicode property will force the regular expression into using Unicode | |
344 | rules. | |
345 | ||
56ca34ca KW |
346 | Note that almost all properties are immune to case-insensitive matching. |
347 | That is, adding a C</i> regular expression modifier does not change what | |
82206b5e | 348 | they match. There are two sets that are affected. The first set is |
56ca34ca KW |
349 | C<Uppercase_Letter>, |
350 | C<Lowercase_Letter>, | |
351 | and C<Titlecase_Letter>, | |
352 | all of which match C<Cased_Letter> under C</i> matching. | |
b6538e4f | 353 | The second set is |
56ca34ca KW |
354 | C<Uppercase>, |
355 | C<Lowercase>, | |
356 | and C<Titlecase>, | |
357 | all of which match C<Cased> under C</i> matching. | |
358 | (The difference between these sets is that some things, such as Roman | |
e2cfb18c | 359 | numerals, come in both upper and lower case, so they are C<Cased>, but |
b6538e4f | 360 | aren't considered to be letters, so they aren't C<Cased_Letter>s. They're |
82206b5e KW |
361 | actually C<Letter_Number>s.) |
362 | This set also includes its subsets C<PosixUpper> and C<PosixLower>, both | |
e2cfb18c | 363 | of which under C</i> match C<PosixAlpha>. |
56ca34ca KW |
364 | |
365 | For more details on Unicode properties, see L<perlunicode/Unicode | |
366 | Character Properties>; for a | |
e1b711da | 367 | complete list of possible properties, see |
56ca34ca KW |
368 | L<perluniprops/Properties accessible through \p{} and \P{}>, |
369 | which notes all forms that have C</i> differences. | |
e1b711da | 370 | It is also possible to define your own properties. This is discussed in |
8a118206 RGS |
371 | L<perlunicode/User-Defined Character Properties>. |
372 | ||
8a118206 RGS |
373 | =head4 Examples |
374 | ||
375 | "a" =~ /\w/ # Match, "a" is a 'word' character. | |
376 | "7" =~ /\w/ # Match, "7" is a 'word' character as well. | |
377 | "a" =~ /\d/ # No match, "a" isn't a digit. | |
378 | "7" =~ /\d/ # Match, "7" is a digit. | |
ea449505 | 379 | " " =~ /\s/ # Match, a space is whitespace. |
8a118206 RGS |
380 | "a" =~ /\D/ # Match, "a" is a non-digit. |
381 | "7" =~ /\D/ # No match, "7" is not a non-digit. | |
ea449505 | 382 | " " =~ /\S/ # No match, a space is not non-whitespace. |
8a118206 | 383 | |
ea449505 KW |
384 | " " =~ /\h/ # Match, space is horizontal whitespace. |
385 | " " =~ /\v/ # No match, space is not vertical whitespace. | |
386 | "\r" =~ /\v/ # Match, a return is vertical whitespace. | |
8a118206 RGS |
387 | |
388 | "a" =~ /\pL/ # Match, "a" is a letter. | |
389 | "a" =~ /\p{Lu}/ # No match, /\p{Lu}/ matches upper case letters. | |
390 | ||
391 | "\x{0e0b}" =~ /\p{Thai}/ # Match, \x{0e0b} is the character | |
392 | # 'THAI CHARACTER SO SO', and that's in | |
393 | # Thai Unicode class. | |
ea449505 | 394 | "a" =~ /\P{Lao}/ # Match, as "a" is not a Laotian character. |
8a118206 | 395 | |
82206b5e KW |
396 | It is worth emphasizing that C<\d>, C<\w>, etc, match single characters, not |
397 | complete numbers or words. To match a number (that consists of digits), | |
398 | use C<\d+>; to match a word, use C<\w+>. But be aware of the security | |
399 | considerations in doing so, as mentioned above. | |
8a118206 RGS |
400 | |
401 | =head2 Bracketed Character Classes | |
402 | ||
403 | The third form of character class you can use in Perl regular expressions | |
6b83a163 | 404 | is the bracketed character class. In its simplest form, it lists the characters |
c1c4ae3a | 405 | that may be matched, surrounded by square brackets, like this: C<[aeiou]>. |
ea449505 | 406 | This matches one of C<a>, C<e>, C<i>, C<o> or C<u>. Like the other |
1f59b283 | 407 | character classes, exactly one character is matched.* To match |
ea449505 | 408 | a longer string consisting of characters mentioned in the character |
6b83a163 | 409 | class, follow the character class with a L<quantifier|perlre/Quantifiers>. For |
b6538e4f | 410 | instance, C<[aeiou]+> matches one or more lowercase English vowels. |
8a118206 RGS |
411 | |
412 | Repeating a character in a character class has no | |
413 | effect; it's considered to be in the set only once. | |
414 | ||
415 | Examples: | |
416 | ||
417 | "e" =~ /[aeiou]/ # Match, as "e" is listed in the class. | |
418 | "p" =~ /[aeiou]/ # No match, "p" is not listed in the class. | |
419 | "ae" =~ /^[aeiou]$/ # No match, a character class only matches | |
420 | # a single character. | |
421 | "ae" =~ /^[aeiou]+$/ # Match, due to the quantifier. | |
422 | ||
1f59b283 KW |
423 | ------- |
424 | ||
df0e3973 KW |
425 | * There is an exception to a bracketed character class matching a |
426 | single character only. When the class is to match caselessely under C</i> | |
1f59b283 KW |
427 | matching rules, and a character inside the class matches a |
428 | multiple-character sequence caselessly under Unicode rules, the class | |
429 | (when not L<inverted|/Negation>) will also match that sequence. For | |
430 | example, Unicode says that the letter C<LATIN SMALL LETTER SHARP S> | |
431 | should match the sequence C<ss> under C</i> rules. Thus, | |
432 | ||
433 | 'ss' =~ /\A\N{LATIN SMALL LETTER SHARP S}\z/i # Matches | |
434 | 'ss' =~ /\A[aeioust\N{LATIN SMALL LETTER SHARP S}]\z/i # Matches | |
435 | ||
8a118206 RGS |
436 | =head3 Special Characters Inside a Bracketed Character Class |
437 | ||
438 | Most characters that are meta characters in regular expressions (that | |
df225385 | 439 | is, characters that carry a special meaning like C<.>, C<*>, or C<(>) lose |
8a118206 RGS |
440 | their special meaning and can be used inside a character class without |
441 | the need to escape them. For instance, C<[()]> matches either an opening | |
442 | parenthesis, or a closing parenthesis, and the parens inside the character | |
443 | class don't group or capture. | |
444 | ||
445 | Characters that may carry a special meaning inside a character class are: | |
446 | C<\>, C<^>, C<->, C<[> and C<]>, and are discussed below. They can be | |
447 | escaped with a backslash, although this is sometimes not needed, in which | |
448 | case the backslash may be omitted. | |
449 | ||
450 | The sequence C<\b> is special inside a bracketed character class. While | |
6b83a163 | 451 | outside the character class, C<\b> is an assertion indicating a point |
8a118206 RGS |
452 | that does not have either two word characters or two non-word characters |
453 | on either side, inside a bracketed character class, C<\b> matches a | |
454 | backspace character. | |
455 | ||
df225385 KW |
456 | The sequences |
457 | C<\a>, | |
458 | C<\c>, | |
459 | C<\e>, | |
460 | C<\f>, | |
461 | C<\n>, | |
e526e8bb | 462 | C<\N{I<NAME>}>, |
765fa144 | 463 | C<\N{U+I<hex char>}>, |
df225385 KW |
464 | C<\r>, |
465 | C<\t>, | |
466 | and | |
467 | C<\x> | |
06ee63cd KW |
468 | are also special and have the same meanings as they do outside a |
469 | bracketed character class. (However, inside a bracketed character | |
470 | class, if C<\N{I<NAME>}> expands to a sequence of characters, only the first | |
471 | one in the sequence is used, with a warning.) | |
df225385 | 472 | |
ea449505 KW |
473 | Also, a backslash followed by two or three octal digits is considered an octal |
474 | number. | |
df225385 | 475 | |
6b83a163 KW |
476 | A C<[> is not special inside a character class, unless it's the start of a |
477 | POSIX character class (see L</POSIX Character Classes> below). It normally does | |
478 | not need escaping. | |
8a118206 | 479 | |
6b83a163 KW |
480 | A C<]> is normally either the end of a POSIX character class (see |
481 | L</POSIX Character Classes> below), or it signals the end of the bracketed | |
482 | character class. If you want to include a C<]> in the set of characters, you | |
483 | must generally escape it. | |
b6538e4f | 484 | |
8a118206 RGS |
485 | However, if the C<]> is the I<first> (or the second if the first |
486 | character is a caret) character of a bracketed character class, it | |
487 | does not denote the end of the class (as you cannot have an empty class) | |
488 | and is considered part of the set of characters that can be matched without | |
489 | escaping. | |
490 | ||
491 | Examples: | |
492 | ||
493 | "+" =~ /[+?*]/ # Match, "+" in a character class is not special. | |
494 | "\cH" =~ /[\b]/ # Match, \b inside in a character class | |
c1c4ae3a | 495 | # is equivalent to a backspace. |
8a118206 RGS |
496 | "]" =~ /[][]/ # Match, as the character class contains. |
497 | # both [ and ]. | |
498 | "[]" =~ /[[]]/ # Match, the pattern contains a character class | |
499 | # containing just ], and the character class is | |
500 | # followed by a ]. | |
501 | ||
502 | =head3 Character Ranges | |
503 | ||
504 | It is not uncommon to want to match a range of characters. Luckily, instead | |
b6538e4f | 505 | of listing all characters in the range, one may use the hyphen (C<->). |
8a118206 | 506 | If inside a bracketed character class you have two characters separated |
b6538e4f | 507 | by a hyphen, it's treated as if all characters between the two were in |
8a118206 | 508 | the class. For instance, C<[0-9]> matches any ASCII digit, and C<[a-m]> |
e2cfb18c | 509 | matches any lowercase letter from the first half of the ASCII alphabet. |
8a118206 RGS |
510 | |
511 | Note that the two characters on either side of the hyphen are not | |
765fa144 | 512 | necessarily both letters or both digits. Any character is possible, |
8a118206 | 513 | although not advisable. C<['-?]> contains a range of characters, but |
b6538e4f | 514 | most people will not know which characters that means. Furthermore, |
8a118206 RGS |
515 | such ranges may lead to portability problems if the code has to run on |
516 | a platform that uses a different character set, such as EBCDIC. | |
517 | ||
ea449505 KW |
518 | If a hyphen in a character class cannot syntactically be part of a range, for |
519 | instance because it is the first or the last character of the character class, | |
b6538e4f TC |
520 | or if it immediately follows a range, the hyphen isn't special, and so is |
521 | considered a character to be matched literally. If you want a hyphen in | |
522 | your set of characters to be matched and its position in the class is such | |
523 | that it could be considered part of a range, you must escape that hyphen | |
524 | with a backslash. | |
8a118206 RGS |
525 | |
526 | Examples: | |
527 | ||
528 | [a-z] # Matches a character that is a lower case ASCII letter. | |
c1c4ae3a KW |
529 | [a-fz] # Matches any letter between 'a' and 'f' (inclusive) or |
530 | # the letter 'z'. | |
8a118206 RGS |
531 | [-z] # Matches either a hyphen ('-') or the letter 'z'. |
532 | [a-f-m] # Matches any letter between 'a' and 'f' (inclusive), the | |
533 | # hyphen ('-'), or the letter 'm'. | |
534 | ['-?] # Matches any of the characters '()*+,-./0123456789:;<=>? | |
535 | # (But not on an EBCDIC platform). | |
536 | ||
537 | ||
538 | =head3 Negation | |
539 | ||
540 | It is also possible to instead list the characters you do not want to | |
541 | match. You can do so by using a caret (C<^>) as the first character in the | |
b6538e4f | 542 | character class. For instance, C<[^a-z]> matches any character that is not a |
e2cfb18c KW |
543 | lowercase ASCII letter, which therefore includes more than a million |
544 | Unicode code points. The class is said to be "negated" or "inverted". | |
8a118206 RGS |
545 | |
546 | This syntax make the caret a special character inside a bracketed character | |
547 | class, but only if it is the first character of the class. So if you want | |
82206b5e | 548 | the caret as one of the characters to match, either escape the caret or |
e2cfb18c | 549 | else don't list it first. |
8a118206 | 550 | |
1f59b283 | 551 | In inverted bracketed character classes, Perl ignores the Unicode rules |
56e1c5aa KW |
552 | that normally say that certain characters should match a sequence of |
553 | multiple characters under caseless C</i> matching. Following those | |
554 | rules could lead to highly confusing situations: | |
1f59b283 | 555 | |
582da942 | 556 | "ss" =~ /^[^\xDF]+$/ui; # Matches! |
1f59b283 KW |
557 | |
558 | This should match any sequences of characters that aren't C<\xDF> nor | |
559 | what C<\xDF> matches under C</i>. C<"s"> isn't C<\xDF>, but Unicode | |
560 | says that C<"ss"> is what C<\xDF> matches under C</i>. So which one | |
561 | "wins"? Do you fail the match because the string has C<ss> or accept it | |
582da942 KW |
562 | because it has an C<s> followed by another C<s>? Perl has chosen the |
563 | latter. | |
1f59b283 | 564 | |
8a118206 RGS |
565 | Examples: |
566 | ||
567 | "e" =~ /[^aeiou]/ # No match, the 'e' is listed. | |
568 | "x" =~ /[^aeiou]/ # Match, as 'x' isn't a lowercase vowel. | |
569 | "^" =~ /[^^]/ # No match, matches anything that isn't a caret. | |
570 | "^" =~ /[x^]/ # Match, caret is not special here. | |
571 | ||
572 | =head3 Backslash Sequences | |
573 | ||
ea449505 | 574 | You can put any backslash sequence character class (with the exception of |
765fa144 | 575 | C<\N> and C<\R>) inside a bracketed character class, and it will act just |
b6538e4f TC |
576 | as if you had put all characters matched by the backslash sequence inside the |
577 | character class. For instance, C<[a-f\d]> matches any decimal digit, or any | |
6b83a163 KW |
578 | of the lowercase letters between 'a' and 'f' inclusive. |
579 | ||
580 | C<\N> within a bracketed character class must be of the forms C<\N{I<name>}> | |
765fa144 | 581 | or C<\N{U+I<hex char>}>, and NOT be the form that matches non-newlines, |
6b83a163 KW |
582 | for the same reason that a dot C<.> inside a bracketed character class loses |
583 | its special meaning: it matches nearly anything, which generally isn't what you | |
584 | want to happen. | |
df225385 | 585 | |
8a118206 RGS |
586 | |
587 | Examples: | |
588 | ||
589 | /[\p{Thai}\d]/ # Matches a character that is either a Thai | |
590 | # character, or a digit. | |
591 | /[^\p{Arabic}()]/ # Matches a character that is neither an Arabic | |
592 | # character, nor a parenthesis. | |
593 | ||
594 | Backslash sequence character classes cannot form one of the endpoints | |
6b83a163 KW |
595 | of a range. Thus, you can't say: |
596 | ||
597 | /[\p{Thai}-\d]/ # Wrong! | |
8a118206 | 598 | |
6b83a163 | 599 | =head3 POSIX Character Classes |
ea449505 | 600 | X<character class> X<\p> X<\p{}> |
ea449505 KW |
601 | X<alpha> X<alnum> X<ascii> X<blank> X<cntrl> X<digit> X<graph> |
602 | X<lower> X<print> X<punct> X<space> X<upper> X<word> X<xdigit> | |
8a118206 | 603 | |
6b83a163 KW |
604 | POSIX character classes have the form C<[:class:]>, where I<class> is |
605 | name, and the C<[:> and C<:]> delimiters. POSIX character classes only appear | |
8a118206 | 606 | I<inside> bracketed character classes, and are a convenient and descriptive |
82206b5e | 607 | way of listing a group of characters. |
6b83a163 KW |
608 | |
609 | Be careful about the syntax, | |
8a118206 RGS |
610 | |
611 | # Correct: | |
612 | $string =~ /[[:alpha:]]/ | |
613 | ||
614 | # Incorrect (will warn): | |
615 | $string =~ /[:alpha:]/ | |
616 | ||
617 | The latter pattern would be a character class consisting of a colon, | |
618 | and the letters C<a>, C<l>, C<p> and C<h>. | |
82206b5e | 619 | POSIX character classes can be part of a larger bracketed character class. |
b6538e4f | 620 | For example, |
ea449505 KW |
621 | |
622 | [01[:alpha:]%] | |
623 | ||
624 | is valid and matches '0', '1', any alphabetic character, and the percent sign. | |
8a118206 RGS |
625 | |
626 | Perl recognizes the following POSIX character classes: | |
627 | ||
ea449505 | 628 | alpha Any alphabetical character ("[A-Za-z]"). |
b6538e4f | 629 | alnum Any alphanumeric character. ("[A-Za-z0-9]") |
ea449505 | 630 | ascii Any character in the ASCII character set. |
ea8b8ad2 | 631 | blank A GNU extension, equal to a space or a horizontal tab ("\t"). |
ea449505 KW |
632 | cntrl Any control character. See Note [2] below. |
633 | digit Any decimal digit ("[0-9]"), equivalent to "\d". | |
634 | graph Any printable character, excluding a space. See Note [3] below. | |
635 | lower Any lowercase character ("[a-z]"). | |
636 | print Any printable character, including a space. See Note [4] below. | |
c1c4ae3a | 637 | punct Any graphical character excluding "word" characters. Note [5]. |
ea449505 KW |
638 | space Any whitespace character. "\s" plus the vertical tab ("\cK"). |
639 | upper Any uppercase character ("[A-Z]"). | |
640 | word A Perl extension ("[A-Za-z0-9_]"), equivalent to "\w". | |
641 | xdigit Any hexadecimal digit ("[0-9a-fA-F]"). | |
642 | ||
643 | Most POSIX character classes have two Unicode-style C<\p> property | |
644 | counterparts. (They are not official Unicode properties, but Perl extensions | |
645 | derived from official Unicode properties.) The table below shows the relation | |
646 | between POSIX character classes and these counterparts. | |
647 | ||
648 | One counterpart, in the column labelled "ASCII-range Unicode" in | |
b6538e4f | 649 | the table, matches only characters in the ASCII character set. |
ea449505 KW |
650 | |
651 | The other counterpart, in the column labelled "Full-range Unicode", matches any | |
652 | appropriate characters in the full Unicode character set. For example, | |
b6538e4f | 653 | C<\p{Alpha}> matches not just the ASCII alphabetic characters, but any |
82206b5e | 654 | character in the entire Unicode character set considered alphabetic. |
582da942 | 655 | An entry in the column labelled "backslash sequence" is a (short) |
5db9882c | 656 | equivalent. |
ea449505 | 657 | |
cbc24f92 KW |
658 | [[:...:]] ASCII-range Full-range backslash Note |
659 | Unicode Unicode sequence | |
ea449505 | 660 | ----------------------------------------------------- |
cbc24f92 KW |
661 | alpha \p{PosixAlpha} \p{XPosixAlpha} |
662 | alnum \p{PosixAlnum} \p{XPosixAlnum} | |
82206b5e | 663 | ascii \p{ASCII} |
cbc24f92 KW |
664 | blank \p{PosixBlank} \p{XPosixBlank} \h [1] |
665 | or \p{HorizSpace} [1] | |
666 | cntrl \p{PosixCntrl} \p{XPosixCntrl} [2] | |
667 | digit \p{PosixDigit} \p{XPosixDigit} \d | |
668 | graph \p{PosixGraph} \p{XPosixGraph} [3] | |
669 | lower \p{PosixLower} \p{XPosixLower} | |
670 | print \p{PosixPrint} \p{XPosixPrint} [4] | |
671 | punct \p{PosixPunct} \p{XPosixPunct} [5] | |
672 | \p{PerlSpace} \p{XPerlSpace} \s [6] | |
673 | space \p{PosixSpace} \p{XPosixSpace} [6] | |
674 | upper \p{PosixUpper} \p{XPosixUpper} | |
675 | word \p{PosixWord} \p{XPosixWord} \w | |
82206b5e | 676 | xdigit \p{PosixXDigit} \p{XPosixXDigit} |
8a118206 RGS |
677 | |
678 | =over 4 | |
679 | ||
ea449505 KW |
680 | =item [1] |
681 | ||
682 | C<\p{Blank}> and C<\p{HorizSpace}> are synonyms. | |
683 | ||
684 | =item [2] | |
8a118206 | 685 | |
ea449505 | 686 | Control characters don't produce output as such, but instead usually control |
b6538e4f | 687 | the terminal somehow: for example, newline and backspace are control characters. |
82206b5e | 688 | In the ASCII range, characters whose code points are between 0 and 31 inclusive, |
ea449505 | 689 | plus 127 (C<DEL>) are control characters. |
8a118206 | 690 | |
c1c4ae3a KW |
691 | On EBCDIC platforms, it is likely that the code page will define C<[[:cntrl:]]> |
692 | to be the EBCDIC equivalents of the ASCII controls, plus the controls | |
82206b5e | 693 | that in Unicode have code pointss from 128 through 159. |
ea449505 KW |
694 | |
695 | =item [3] | |
8a118206 RGS |
696 | |
697 | Any character that is I<graphical>, that is, visible. This class consists | |
b6538e4f | 698 | of all alphanumeric characters and all punctuation characters. |
8a118206 | 699 | |
ea449505 | 700 | =item [4] |
8a118206 | 701 | |
b6538e4f TC |
702 | All printable characters, which is the set of all graphical characters |
703 | plus those whitespace characters which are not also controls. | |
ea449505 | 704 | |
b6dac59a | 705 | =item [5] |
ea449505 | 706 | |
b6538e4f | 707 | C<\p{PosixPunct}> and C<[[:punct:]]> in the ASCII range match all |
ea449505 KW |
708 | non-controls, non-alphanumeric, non-space characters: |
709 | C<[-!"#$%&'()*+,./:;<=E<gt>?@[\\\]^_`{|}~]> (although if a locale is in effect, | |
710 | it could alter the behavior of C<[[:punct:]]>). | |
711 | ||
cbc24f92 KW |
712 | The similarly named property, C<\p{Punct}>, matches a somewhat different |
713 | set in the ASCII range, namely | |
6c5a041f KW |
714 | C<[-!"#%&'()*,./:;?@[\\\]_{}]>. That is, it is missing C<[$+E<lt>=E<gt>^`|~]>. |
715 | This is because Unicode splits what POSIX considers to be punctuation into two | |
716 | categories, Punctuation and Symbols. | |
717 | ||
e2cfb18c | 718 | C<\p{XPosixPunct}> and (under Unicode rules) C<[[:punct:]]>, match what |
765fa144 KW |
719 | C<\p{PosixPunct}> matches in the ASCII range, plus what C<\p{Punct}> |
720 | matches. This is different than strictly matching according to | |
721 | C<\p{Punct}>. Another way to say it is that | |
82206b5e KW |
722 | if Unicode rules are in effect, C<[[:punct:]]> matches all characters |
723 | that Unicode considers punctuation, plus all ASCII-range characters that | |
724 | Unicode considers symbols. | |
8a118206 | 725 | |
ea449505 | 726 | =item [6] |
8a118206 | 727 | |
82206b5e KW |
728 | C<\p{SpacePerl}> and C<\p{Space}> differ only in that in non-locale |
729 | matching, C<\p{Space}> additionally | |
ea449505 | 730 | matches the vertical tab, C<\cK>. Same for the two ASCII-only range forms. |
8a118206 RGS |
731 | |
732 | =back | |
733 | ||
ab6199be KW |
734 | There are various other synonyms that can be used besides the names |
735 | listed in the table. For example, C<\p{PosixAlpha}> can be written as | |
736 | C<\p{Alpha}>. All are listed in | |
737 | L<perluniprops/Properties accessible through \p{} and \P{}>, | |
738 | plus all characters matched by each ASCII-range property. | |
739 | ||
740 | Both the C<\p> counterparts always assume Unicode rules are in effect. | |
741 | On ASCII platforms, this means they assume that the code points from 128 | |
742 | to 255 are Latin-1, and that means that using them under locale rules is | |
743 | unwise unless the locale is guaranteed to be Latin-1 or UTF-8. In contrast, the | |
744 | POSIX character classes are useful under locale rules. They are | |
745 | affected by the actual rules in effect, as follows: | |
746 | ||
747 | =over | |
748 | ||
749 | =item If the C</a> modifier, is in effect ... | |
750 | ||
751 | Each of the POSIX classes matches exactly the same as their ASCII-range | |
752 | counterparts. | |
753 | ||
754 | =item otherwise ... | |
755 | ||
756 | =over | |
757 | ||
758 | =item For code points above 255 ... | |
759 | ||
760 | The POSIX class matches the same as its Full-range counterpart. | |
761 | ||
762 | =item For code points below 256 ... | |
763 | ||
764 | =over | |
765 | ||
766 | =item if locale rules are in effect ... | |
767 | ||
768 | The POSIX class matches according to the locale. | |
769 | ||
770 | =item if Unicode rules are in effect or if on an EBCDIC platform ... | |
771 | ||
772 | The POSIX class matches the same as the Full-range counterpart. | |
773 | ||
774 | =item otherwise ... | |
775 | ||
776 | The POSIX class matches the same as the ASCII range counterpart. | |
777 | ||
778 | =back | |
779 | ||
780 | =back | |
781 | ||
782 | =back | |
783 | ||
784 | Which rules apply are determined as described in | |
785 | L<perlre/Which character set modifier is in effect?>. | |
786 | ||
787 | It is proposed to change this behavior in a future release of Perl so that | |
788 | whether or not Unicode rules are in effect would not change the | |
789 | behavior: Outside of locale or an EBCDIC code page, the POSIX classes | |
790 | would behave like their ASCII-range counterparts. If you wish to | |
791 | comment on this proposal, send email to C<perl5-porters@perl.org>. | |
cbc24f92 | 792 | |
1f59b283 | 793 | =head4 Negation of POSIX character classes |
ea449505 | 794 | X<character class, negation> |
8a118206 RGS |
795 | |
796 | A Perl extension to the POSIX character class is the ability to | |
797 | negate it. This is done by prefixing the class name with a caret (C<^>). | |
798 | Some examples: | |
799 | ||
ea449505 KW |
800 | POSIX ASCII-range Full-range backslash |
801 | Unicode Unicode sequence | |
802 | ----------------------------------------------------- | |
cbc24f92 KW |
803 | [[:^digit:]] \P{PosixDigit} \P{XPosixDigit} \D |
804 | [[:^space:]] \P{PosixSpace} \P{XPosixSpace} | |
805 | \P{PerlSpace} \P{XPerlSpace} \S | |
806 | [[:^word:]] \P{PerlWord} \P{XPosixWord} \W | |
807 | ||
765fa144 | 808 | The backslash sequence can mean either ASCII- or Full-range Unicode, |
82206b5e | 809 | depending on various factors as described in L<perlre/Which character set modifier is in effect?>. |
8a118206 RGS |
810 | |
811 | =head4 [= =] and [. .] | |
812 | ||
b6538e4f | 813 | Perl recognizes the POSIX character classes C<[=class=]> and |
82206b5e | 814 | C<[.class.]>, but does not (yet?) support them. Any attempt to use |
b6538e4f | 815 | either construct raises an exception. |
8a118206 RGS |
816 | |
817 | =head4 Examples | |
818 | ||
819 | /[[:digit:]]/ # Matches a character that is a digit. | |
820 | /[01[:lower:]]/ # Matches a character that is either a | |
821 | # lowercase letter, or '0' or '1'. | |
c1c4ae3a KW |
822 | /[[:digit:][:^xdigit:]]/ # Matches a character that can be anything |
823 | # except the letters 'a' to 'f'. This is | |
824 | # because the main character class is composed | |
825 | # of two POSIX character classes that are ORed | |
826 | # together, one that matches any digit, and | |
827 | # the other that matches anything that isn't a | |
828 | # hex digit. The result matches all | |
829 | # characters except the letters 'a' to 'f' and | |
830 | # 'A' to 'F'. |