Commit | Line | Data |
---|---|---|
8a118206 | 1 | =head1 NAME |
ea449505 | 2 | X<character class> |
8a118206 RGS |
3 | |
4 | perlrecharclass - Perl Regular Expression Character Classes | |
5 | ||
6 | =head1 DESCRIPTION | |
7 | ||
8 | The top level documentation about Perl regular expressions | |
9 | is found in L<perlre>. | |
10 | ||
11 | This manual page discusses the syntax and use of character | |
6b83a163 | 12 | classes in Perl regular expressions. |
8a118206 | 13 | |
6b83a163 | 14 | A character class is a way of denoting a set of characters |
8a118206 | 15 | in such a way that one character of the set is matched. |
6b83a163 | 16 | It's important to remember that: matching a character class |
8a118206 RGS |
17 | consumes exactly one character in the source string. (The source |
18 | string is the string the regular expression is matched against.) | |
19 | ||
20 | There are three types of character classes in Perl regular | |
6b83a163 | 21 | expressions: the dot, backslash sequences, and the form enclosed in square |
ea449505 | 22 | brackets. Keep in mind, though, that often the term "character class" is used |
6b83a163 | 23 | to mean just the bracketed form. Certainly, most Perl documentation does that. |
8a118206 RGS |
24 | |
25 | =head2 The dot | |
26 | ||
27 | The dot (or period), C<.> is probably the most used, and certainly | |
28 | the most well-known character class. By default, a dot matches any | |
5db9882c | 29 | character, except for the newline. That default can be changed to |
4a88d526 | 30 | add matching the newline by using the I<single line> modifier: |
6b83a163 | 31 | for the entire regular expression with the C</s> modifier, or |
4a88d526 KW |
32 | locally with C<(?s)> (and even globally within the scope of |
33 | L<C<use re '/s'>|re/'E<sol>flags' mode>). (The C<L</\N>> backslash | |
34 | sequence, described | |
6b83a163 KW |
35 | below, matches any character except newline without regard to the |
36 | I<single line> modifier.) | |
8a118206 RGS |
37 | |
38 | Here are some examples: | |
39 | ||
40 | "a" =~ /./ # Match | |
41 | "." =~ /./ # Match | |
42 | "" =~ /./ # No match (dot has to match a character) | |
43 | "\n" =~ /./ # No match (dot does not match a newline) | |
44 | "\n" =~ /./s # Match (global 'single line' modifier) | |
45 | "\n" =~ /(?s:.)/ # Match (local 'single line' modifier) | |
46 | "ab" =~ /^.$/ # No match (dot matches one character) | |
47 | ||
6b83a163 | 48 | =head2 Backslash sequences |
82206b5e | 49 | X<\w> X<\W> X<\s> X<\S> X<\d> X<\D> X<\p> X<\P> |
ea449505 KW |
50 | X<\N> X<\v> X<\V> X<\h> X<\H> |
51 | X<word> X<whitespace> | |
8a118206 | 52 | |
6b83a163 KW |
53 | A backslash sequence is a sequence of characters, the first one of which is a |
54 | backslash. Perl ascribes special meaning to many such sequences, and some of | |
55 | these are character classes. That is, they match a single character each, | |
56 | provided that the character belongs to the specific set of characters defined | |
57 | by the sequence. | |
8a118206 | 58 | |
6b83a163 KW |
59 | Here's a list of the backslash sequences that are character classes. They |
60 | are discussed in more detail below. (For the backslash sequences that aren't | |
61 | character classes, see L<perlrebackslash>.) | |
8a118206 | 62 | |
6b83a163 KW |
63 | \d Match a decimal digit character. |
64 | \D Match a non-decimal-digit character. | |
8a118206 RGS |
65 | \w Match a "word" character. |
66 | \W Match a non-"word" character. | |
ea449505 KW |
67 | \s Match a whitespace character. |
68 | \S Match a non-whitespace character. | |
69 | \h Match a horizontal whitespace character. | |
70 | \H Match a character that isn't horizontal whitespace. | |
ea449505 KW |
71 | \v Match a vertical whitespace character. |
72 | \V Match a character that isn't vertical whitespace. | |
4e5e0888 | 73 | \N Match a character that isn't a newline. |
6b83a163 | 74 | \pP, \p{Prop} Match a character that has the given Unicode property. |
6c5a041f | 75 | \PP, \P{Prop} Match a character that doesn't have the Unicode property |
8a118206 | 76 | |
1433f837 KW |
77 | =head3 \N |
78 | ||
2171640d | 79 | C<\N>, available starting in v5.12, like the dot, matches any |
1433f837 KW |
80 | character that is not a newline. The difference is that C<\N> is not influenced |
81 | by the I<single line> regular expression modifier (see L</The dot> above). Note | |
82 | that the form C<\N{...}> may mean something completely different. When the | |
83 | C<{...}> is a L<quantifier|perlre/Quantifiers>, it means to match a non-newline | |
84 | character that many times. For example, C<\N{3}> means to match 3 | |
85 | non-newlines; C<\N{5,}> means to match 5 or more non-newlines. But if C<{...}> | |
86 | is not a legal quantifier, it is presumed to be a named character. See | |
87 | L<charnames> for those. For example, none of C<\N{COLON}>, C<\N{4F}>, and | |
88 | C<\N{F4}> contain legal quantifiers, so Perl will try to find characters whose | |
89 | names are respectively C<COLON>, C<4F>, and C<F4>. | |
90 | ||
8a118206 RGS |
91 | =head3 Digits |
92 | ||
b6538e4f | 93 | C<\d> matches a single character considered to be a decimal I<digit>. |
5db9882c | 94 | If the C</a> regular expression modifier is in effect, it matches [0-9]. |
582da942 | 95 | Otherwise, it |
82206b5e KW |
96 | matches anything that is matched by C<\p{Digit}>, which includes [0-9]. |
97 | (An unlikely possible exception is that under locale matching rules, the | |
d66e1f56 KW |
98 | current locale might not have C<[0-9]> matched by C<\d>, and/or might match |
99 | other characters whose code point is less than 256. The only such locale | |
100 | definitions that are legal would be to match C<[0-9]> plus another set of | |
101 | 10 consecutive digit characters; anything else would be in violation of | |
102 | the C language standard, but Perl doesn't currently assume anything in | |
103 | regard to this.) | |
82206b5e KW |
104 | |
105 | What this means is that unless the C</a> modifier is in effect C<\d> not | |
106 | only matches the digits '0' - '9', but also Arabic, Devanagari, and | |
107 | digits from other languages. This may cause some confusion, and some | |
108 | security issues. | |
109 | ||
110 | Some digits that C<\d> matches look like some of the [0-9] ones, but | |
111 | have different values. For example, BENGALI DIGIT FOUR (U+09EA) looks | |
8350b274 KW |
112 | very much like an ASCII DIGIT EIGHT (U+0038), and LEPCHA DIGIT SIX |
113 | (U+1C46) looks very much like an ASCII DIGIT FIVE (U+0035). An | |
114 | application that | |
82206b5e KW |
115 | is expecting only the ASCII digits might be misled, or if the match is |
116 | C<\d+>, the matched string might contain a mixture of digits from | |
117 | different writing systems that look like they signify a number different | |
67592e11 | 118 | than they actually do. L<Unicode::UCD/num()> can |
e397bccf | 119 | be used to safely |
82206b5e | 120 | calculate the value, returning C<undef> if the input string contains |
8350b274 KW |
121 | such a mixture. Otherwise, for example, a displayed price might be |
122 | deliberately different than it appears. | |
82206b5e KW |
123 | |
124 | What C<\p{Digit}> means (and hence C<\d> except under the C</a> | |
125 | modifier) is C<\p{General_Category=Decimal_Number}>, or synonymously, | |
126 | C<\p{General_Category=Digit}>. Starting with Unicode version 4.1, this | |
127 | is the same set of characters matched by C<\p{Numeric_Type=Decimal}>. | |
6b83a163 KW |
128 | But Unicode also has a different property with a similar name, |
129 | C<\p{Numeric_Type=Digit}>, which matches a completely different set of | |
82206b5e KW |
130 | characters. These characters are things such as C<CIRCLED DIGIT ONE> |
131 | or subscripts, or are from writing systems that lack all ten digits. | |
6b83a163 | 132 | |
82206b5e KW |
133 | The design intent is for C<\d> to exactly match the set of characters |
134 | that can safely be used with "normal" big-endian positional decimal | |
135 | syntax, where, for example 123 means one 'hundred', plus two 'tens', | |
136 | plus three 'ones'. This positional notation does not necessarily apply | |
137 | to characters that match the other type of "digit", | |
138 | C<\p{Numeric_Type=Digit}>, and so C<\d> doesn't match them. | |
6b83a163 | 139 | |
e2cfb18c | 140 | The Tamil digits (U+0BE6 - U+0BEF) can also legally be |
82206b5e KW |
141 | used in old-style Tamil numbers in which they would appear no more than |
142 | one in a row, separated by characters that mean "times 10", "times 100", | |
143 | etc. (See L<http://www.unicode.org/notes/tn21>.) | |
8a118206 | 144 | |
b6538e4f | 145 | Any character not matched by C<\d> is matched by C<\D>. |
8a118206 RGS |
146 | |
147 | =head3 Word characters | |
148 | ||
ea449505 | 149 | A C<\w> matches a single alphanumeric character (an alphabetic character, or a |
41805eb9 KW |
150 | decimal digit); or a connecting punctuation character, such as an |
151 | underscore ("_"); or a "mark" character (like some sort of accent) that | |
152 | attaches to one of those. It does not match a whole word. To match a | |
153 | whole word, use C<\w+>. This isn't the same thing as matching an | |
154 | English word, but in the ASCII range it is the same as a string of | |
155 | Perl-identifier characters. | |
82206b5e KW |
156 | |
157 | =over | |
158 | ||
159 | =item If the C</a> modifier is in effect ... | |
160 | ||
161 | C<\w> matches the 63 characters [a-zA-Z0-9_]. | |
162 | ||
163 | =item otherwise ... | |
164 | ||
165 | =over | |
166 | ||
167 | =item For code points above 255 ... | |
168 | ||
169 | C<\w> matches the same as C<\p{Word}> matches in this range. That is, | |
170 | it matches Thai letters, Greek letters, etc. This includes connector | |
d35dd6c6 | 171 | punctuation (like the underscore) which connect two words together, or |
b6538e4f | 172 | diacritics, such as a C<COMBINING TILDE> and the modifier letters, which |
82206b5e KW |
173 | are generally used to add auxiliary markings to letters. |
174 | ||
175 | =item For code points below 256 ... | |
176 | ||
177 | =over | |
178 | ||
179 | =item if locale rules are in effect ... | |
180 | ||
181 | C<\w> matches the platform's native underscore character plus whatever | |
182 | the locale considers to be alphanumeric. | |
183 | ||
04c2d19c | 184 | =item if, instead, Unicode rules are in effect ... |
82206b5e KW |
185 | |
186 | C<\w> matches exactly what C<\p{Word}> matches. | |
187 | ||
188 | =item otherwise ... | |
189 | ||
190 | C<\w> matches [a-zA-Z0-9_]. | |
191 | ||
192 | =back | |
193 | ||
194 | =back | |
195 | ||
196 | =back | |
197 | ||
198 | Which rules apply are determined as described in L<perlre/Which character set modifier is in effect?>. | |
8a118206 | 199 | |
6b83a163 KW |
200 | There are a number of security issues with the full Unicode list of word |
201 | characters. See L<http://unicode.org/reports/tr36>. | |
202 | ||
203 | Also, for a somewhat finer-grained set of characters that are in programming | |
204 | language identifiers beyond the ASCII range, you may wish to instead use the | |
e2cfb18c KW |
205 | more customized L</Unicode Properties>, C<\p{ID_Start}>, |
206 | C<\p{ID_Continue}>, C<\p{XID_Start}>, and C<\p{XID_Continue}>. See | |
207 | L<http://unicode.org/reports/tr31>. | |
6b83a163 | 208 | |
b6538e4f | 209 | Any character not matched by C<\w> is matched by C<\W>. |
8a118206 | 210 | |
ea449505 KW |
211 | =head3 Whitespace |
212 | ||
82206b5e KW |
213 | C<\s> matches any single character considered whitespace. |
214 | ||
215 | =over | |
216 | ||
217 | =item If the C</a> modifier is in effect ... | |
218 | ||
d28d8023 KW |
219 | In all Perl versions, C<\s> matches the 5 characters [\t\n\f\r ]; that |
220 | is, the horizontal tab, | |
221 | the newline, the form feed, the carriage return, and the space. | |
779cf272 | 222 | Starting in Perl v5.18, it also matches the vertical tab, C<\cK>. |
d28d8023 | 223 | See note C<[1]> below for a discussion of this. |
82206b5e KW |
224 | |
225 | =item otherwise ... | |
226 | ||
227 | =over | |
228 | ||
229 | =item For code points above 255 ... | |
230 | ||
231 | C<\s> matches exactly the code points above 255 shown with an "s" column | |
232 | in the table below. | |
233 | ||
234 | =item For code points below 256 ... | |
235 | ||
236 | =over | |
237 | ||
238 | =item if locale rules are in effect ... | |
239 | ||
d28d8023 | 240 | C<\s> matches whatever the locale considers to be whitespace. |
82206b5e | 241 | |
04c2d19c | 242 | =item if, instead, Unicode rules are in effect ... |
82206b5e KW |
243 | |
244 | C<\s> matches exactly the characters shown with an "s" column in the | |
245 | table below. | |
246 | ||
247 | =item otherwise ... | |
248 | ||
779cf272 | 249 | C<\s> matches [\t\n\f\r ] and, starting in Perl |
d28d8023 KW |
250 | v5.18, the vertical tab, C<\cK>. |
251 | (See note C<[1]> below for a discussion of this.) | |
82206b5e KW |
252 | Note that this list doesn't include the non-breaking space. |
253 | ||
254 | =back | |
255 | ||
256 | =back | |
257 | ||
258 | =back | |
259 | ||
260 | Which rules apply are determined as described in L<perlre/Which character set modifier is in effect?>. | |
8a118206 | 261 | |
b6538e4f | 262 | Any character not matched by C<\s> is matched by C<\S>. |
8a118206 | 263 | |
b6538e4f | 264 | C<\h> matches any character considered horizontal whitespace; |
8129baca | 265 | this includes the platform's space and tab characters and several others |
b6538e4f | 266 | listed in the table below. C<\H> matches any character |
8129baca KW |
267 | not considered horizontal whitespace. They use the platform's native |
268 | character set, and do not consider any locale that may otherwise be in | |
269 | use. | |
ea449505 | 270 | |
b6538e4f | 271 | C<\v> matches any character considered vertical whitespace; |
8129baca | 272 | this includes the platform's carriage return and line feed characters (newline) |
b6538e4f TC |
273 | plus several other characters, all listed in the table below. |
274 | C<\V> matches any character not considered vertical whitespace. | |
8129baca KW |
275 | They use the platform's native character set, and do not consider any |
276 | locale that may otherwise be in use. | |
8a118206 RGS |
277 | |
278 | C<\R> matches anything that can be considered a newline under Unicode | |
412a49a2 KW |
279 | rules. It can match a multi-character sequence. It cannot be used inside |
280 | a bracketed character class; use C<\v> instead (vertical whitespace). | |
281 | It uses the platform's | |
8129baca KW |
282 | native character set, and does not consider any locale that may |
283 | otherwise be in use. | |
ea449505 | 284 | Details are discussed in L<perlrebackslash>. |
8a118206 | 285 | |
82206b5e | 286 | Note that unlike C<\s> (and C<\d> and C<\w>), C<\h> and C<\v> always match |
8129baca KW |
287 | the same characters, without regard to other factors, such as the active |
288 | locale or whether the source string is in UTF-8 format. | |
8a118206 | 289 | |
d28d8023 KW |
290 | One might think that C<\s> is equivalent to C<[\h\v]>. This is indeed true |
291 | starting in Perl v5.18, but prior to that, the sole difference was that the | |
292 | vertical tab (C<"\cK">) was not matched by C<\s>. | |
8a118206 RGS |
293 | |
294 | The following table is a complete listing of characters matched by | |
a9c9e371 | 295 | C<\s>, C<\h> and C<\v> as of Unicode 6.3. |
8a118206 | 296 | |
582da942 | 297 | The first column gives the Unicode code point of the character (in hex format), |
8a118206 | 298 | the second column gives the (Unicode) name. The third column indicates |
4b9734bf KW |
299 | by which class(es) the character is matched (assuming no locale is in |
300 | effect that changes the C<\s> matching). | |
8a118206 | 301 | |
fc28d2a3 KW |
302 | 0x0009 CHARACTER TABULATION h s |
303 | 0x000a LINE FEED (LF) vs | |
d28d8023 | 304 | 0x000b LINE TABULATION vs [1] |
fc28d2a3 KW |
305 | 0x000c FORM FEED (FF) vs |
306 | 0x000d CARRIAGE RETURN (CR) vs | |
307 | 0x0020 SPACE h s | |
d28d8023 KW |
308 | 0x0085 NEXT LINE (NEL) vs [2] |
309 | 0x00a0 NO-BREAK SPACE h s [2] | |
fc28d2a3 | 310 | 0x1680 OGHAM SPACE MARK h s |
fc28d2a3 KW |
311 | 0x2000 EN QUAD h s |
312 | 0x2001 EM QUAD h s | |
313 | 0x2002 EN SPACE h s | |
314 | 0x2003 EM SPACE h s | |
315 | 0x2004 THREE-PER-EM SPACE h s | |
316 | 0x2005 FOUR-PER-EM SPACE h s | |
317 | 0x2006 SIX-PER-EM SPACE h s | |
318 | 0x2007 FIGURE SPACE h s | |
319 | 0x2008 PUNCTUATION SPACE h s | |
320 | 0x2009 THIN SPACE h s | |
321 | 0x200a HAIR SPACE h s | |
322 | 0x2028 LINE SEPARATOR vs | |
323 | 0x2029 PARAGRAPH SEPARATOR vs | |
324 | 0x202f NARROW NO-BREAK SPACE h s | |
325 | 0x205f MEDIUM MATHEMATICAL SPACE h s | |
326 | 0x3000 IDEOGRAPHIC SPACE h s | |
8a118206 RGS |
327 | |
328 | =over 4 | |
329 | ||
330 | =item [1] | |
331 | ||
779cf272 KW |
332 | Prior to Perl v5.18, C<\s> did not match the vertical tab. |
333 | C<[^\S\cK]> (obscurely) matches what C<\s> traditionally did. | |
d28d8023 KW |
334 | |
335 | =item [2] | |
336 | ||
82206b5e KW |
337 | NEXT LINE and NO-BREAK SPACE may or may not match C<\s> depending |
338 | on the rules in effect. See | |
339 | L<the beginning of this section|/Whitespace>. | |
8a118206 RGS |
340 | |
341 | =back | |
342 | ||
8a118206 RGS |
343 | =head3 Unicode Properties |
344 | ||
c1c4ae3a KW |
345 | C<\pP> and C<\p{Prop}> are character classes to match characters that fit given |
346 | Unicode properties. One letter property names can be used in the C<\pP> form, | |
347 | with the property name following the C<\p>, otherwise, braces are required. | |
348 | When using braces, there is a single form, which is just the property name | |
349 | enclosed in the braces, and a compound form which looks like C<\p{name=value}>, | |
b6538e4f | 350 | which means to match if the property "name" for the character has that particular |
c1c4ae3a | 351 | "value". |
e1b711da KW |
352 | For instance, a match for a number can be written as C</\pN/> or as |
353 | C</\p{Number}/>, or as C</\p{Number=True}/>. | |
354 | Lowercase letters are matched by the property I<Lowercase_Letter> which | |
e2cfb18c | 355 | has the short form I<Ll>. They need the braces, so are written as C</\p{Ll}/> or |
e1b711da KW |
356 | C</\p{Lowercase_Letter}/>, or C</\p{General_Category=Lowercase_Letter}/> |
357 | (the underscores are optional). | |
358 | C</\pLl/> is valid, but means something different. | |
8a118206 RGS |
359 | It matches a two character string: a letter (Unicode property C<\pL>), |
360 | followed by a lowercase C<l>. | |
361 | ||
bc943be5 | 362 | If locale rules are not in effect, the use of |
82206b5e | 363 | a Unicode property will force the regular expression into using Unicode |
bc943be5 | 364 | rules, if it isn't already. |
82206b5e | 365 | |
56ca34ca KW |
366 | Note that almost all properties are immune to case-insensitive matching. |
367 | That is, adding a C</i> regular expression modifier does not change what | |
82206b5e | 368 | they match. There are two sets that are affected. The first set is |
56ca34ca KW |
369 | C<Uppercase_Letter>, |
370 | C<Lowercase_Letter>, | |
371 | and C<Titlecase_Letter>, | |
372 | all of which match C<Cased_Letter> under C</i> matching. | |
b6538e4f | 373 | The second set is |
56ca34ca KW |
374 | C<Uppercase>, |
375 | C<Lowercase>, | |
376 | and C<Titlecase>, | |
377 | all of which match C<Cased> under C</i> matching. | |
378 | (The difference between these sets is that some things, such as Roman | |
e2cfb18c | 379 | numerals, come in both upper and lower case, so they are C<Cased>, but |
b6538e4f | 380 | aren't considered to be letters, so they aren't C<Cased_Letter>s. They're |
82206b5e KW |
381 | actually C<Letter_Number>s.) |
382 | This set also includes its subsets C<PosixUpper> and C<PosixLower>, both | |
e2cfb18c | 383 | of which under C</i> match C<PosixAlpha>. |
56ca34ca KW |
384 | |
385 | For more details on Unicode properties, see L<perlunicode/Unicode | |
386 | Character Properties>; for a | |
e1b711da | 387 | complete list of possible properties, see |
56ca34ca KW |
388 | L<perluniprops/Properties accessible through \p{} and \P{}>, |
389 | which notes all forms that have C</i> differences. | |
e1b711da | 390 | It is also possible to define your own properties. This is discussed in |
8a118206 RGS |
391 | L<perlunicode/User-Defined Character Properties>. |
392 | ||
94b42e47 | 393 | Unicode properties are defined (surprise!) only on Unicode code points. |
2d88a86a KW |
394 | Starting in v5.20, when matching against C<\p> and C<\P>, Perl treats |
395 | non-Unicode code points (those above the legal Unicode maximum of | |
396 | 0x10FFFF) as if they were typical unassigned Unicode code points. | |
94b42e47 | 397 | |
2d88a86a KW |
398 | Prior to v5.20, Perl raised a warning and made all matches fail on |
399 | non-Unicode code points. This could be somewhat surprising: | |
94b42e47 | 400 | |
2d88a86a KW |
401 | chr(0x110000) =~ \p{ASCII_Hex_Digit=True} # Fails on Perls < v5.20. |
402 | chr(0x110000) =~ \p{ASCII_Hex_Digit=False} # Also fails on Perls | |
403 | # < v5.20 | |
404 | ||
405 | Even though these two matches might be thought of as complements, until | |
406 | v5.20 they were so only on Unicode code points. | |
94b42e47 | 407 | |
8a118206 RGS |
408 | =head4 Examples |
409 | ||
410 | "a" =~ /\w/ # Match, "a" is a 'word' character. | |
411 | "7" =~ /\w/ # Match, "7" is a 'word' character as well. | |
412 | "a" =~ /\d/ # No match, "a" isn't a digit. | |
413 | "7" =~ /\d/ # Match, "7" is a digit. | |
ea449505 | 414 | " " =~ /\s/ # Match, a space is whitespace. |
8a118206 RGS |
415 | "a" =~ /\D/ # Match, "a" is a non-digit. |
416 | "7" =~ /\D/ # No match, "7" is not a non-digit. | |
ea449505 | 417 | " " =~ /\S/ # No match, a space is not non-whitespace. |
8a118206 | 418 | |
ea449505 KW |
419 | " " =~ /\h/ # Match, space is horizontal whitespace. |
420 | " " =~ /\v/ # No match, space is not vertical whitespace. | |
421 | "\r" =~ /\v/ # Match, a return is vertical whitespace. | |
8a118206 RGS |
422 | |
423 | "a" =~ /\pL/ # Match, "a" is a letter. | |
424 | "a" =~ /\p{Lu}/ # No match, /\p{Lu}/ matches upper case letters. | |
425 | ||
426 | "\x{0e0b}" =~ /\p{Thai}/ # Match, \x{0e0b} is the character | |
427 | # 'THAI CHARACTER SO SO', and that's in | |
428 | # Thai Unicode class. | |
ea449505 | 429 | "a" =~ /\P{Lao}/ # Match, as "a" is not a Laotian character. |
8a118206 | 430 | |
82206b5e KW |
431 | It is worth emphasizing that C<\d>, C<\w>, etc, match single characters, not |
432 | complete numbers or words. To match a number (that consists of digits), | |
433 | use C<\d+>; to match a word, use C<\w+>. But be aware of the security | |
434 | considerations in doing so, as mentioned above. | |
8a118206 RGS |
435 | |
436 | =head2 Bracketed Character Classes | |
437 | ||
438 | The third form of character class you can use in Perl regular expressions | |
6b83a163 | 439 | is the bracketed character class. In its simplest form, it lists the characters |
c1c4ae3a | 440 | that may be matched, surrounded by square brackets, like this: C<[aeiou]>. |
ea449505 | 441 | This matches one of C<a>, C<e>, C<i>, C<o> or C<u>. Like the other |
1f59b283 | 442 | character classes, exactly one character is matched.* To match |
ea449505 | 443 | a longer string consisting of characters mentioned in the character |
6b83a163 | 444 | class, follow the character class with a L<quantifier|perlre/Quantifiers>. For |
b6538e4f | 445 | instance, C<[aeiou]+> matches one or more lowercase English vowels. |
8a118206 RGS |
446 | |
447 | Repeating a character in a character class has no | |
448 | effect; it's considered to be in the set only once. | |
449 | ||
450 | Examples: | |
451 | ||
452 | "e" =~ /[aeiou]/ # Match, as "e" is listed in the class. | |
453 | "p" =~ /[aeiou]/ # No match, "p" is not listed in the class. | |
454 | "ae" =~ /^[aeiou]$/ # No match, a character class only matches | |
455 | # a single character. | |
456 | "ae" =~ /^[aeiou]+$/ # Match, due to the quantifier. | |
457 | ||
1f59b283 KW |
458 | ------- |
459 | ||
8f0cd35a KW |
460 | * There are two exceptions to a bracketed character class matching a |
461 | single character only. Each requires special handling by Perl to make | |
462 | things work: | |
463 | ||
464 | =over | |
465 | ||
466 | =item * | |
467 | ||
468 | When the class is to match caselessly under C</i> matching rules, and a | |
469 | character that is explicitly mentioned inside the class matches a | |
1f59b283 | 470 | multiple-character sequence caselessly under Unicode rules, the class |
8f0cd35a KW |
471 | will also match that sequence. For example, Unicode says that the |
472 | letter C<LATIN SMALL LETTER SHARP S> should match the sequence C<ss> | |
473 | under C</i> rules. Thus, | |
1f59b283 KW |
474 | |
475 | 'ss' =~ /\A\N{LATIN SMALL LETTER SHARP S}\z/i # Matches | |
476 | 'ss' =~ /\A[aeioust\N{LATIN SMALL LETTER SHARP S}]\z/i # Matches | |
477 | ||
8f0cd35a KW |
478 | For this to happen, the class must not be inverted (see L</Negation>) |
479 | and the character must be explicitly specified, and not be part of a | |
480 | multi-character range (not even as one of its endpoints). (L</Character | |
481 | Ranges> will be explained shortly.) Therefore, | |
9d53c457 | 482 | |
eb9e3b14 KW |
483 | 'ss' =~ /\A[\0-\x{ff}]\z/ui # Doesn't match |
484 | 'ss' =~ /\A[\0-\N{LATIN SMALL LETTER SHARP S}]\z/ui # No match | |
485 | 'ss' =~ /\A[\xDF-\xDF]\z/ui # Matches on ASCII platforms, since | |
a845303d | 486 | # \xDF is LATIN SMALL LETTER SHARP S, |
8f0cd35a KW |
487 | # and the range is just a single |
488 | # element | |
9d53c457 KW |
489 | |
490 | Note that it isn't a good idea to specify these types of ranges anyway. | |
491 | ||
8f0cd35a KW |
492 | =item * |
493 | ||
494 | Some names known to C<\N{...}> refer to a sequence of multiple characters, | |
495 | instead of the usual single character. When one of these is included in | |
496 | the class, the entire sequence is matched. For example, | |
497 | ||
498 | "\N{TAMIL LETTER KA}\N{TAMIL VOWEL SIGN AU}" | |
499 | =~ / ^ [\N{TAMIL SYLLABLE KAU}] $ /x; | |
500 | ||
501 | matches, because C<\N{TAMIL SYLLABLE KAU}> is a named sequence | |
502 | consisting of the two characters matched against. Like the other | |
eb9e3b14 | 503 | instance where a bracketed class can match multiple characters, and for |
8f0cd35a KW |
504 | similar reasons, the class must not be inverted, and the named sequence |
505 | may not appear in a range, even one where it is both endpoints. If | |
4a88d526 KW |
506 | these happen, it is a fatal error if the character class is within the |
507 | scope of L<C<use re 'strict>|re/'strict' mode>, or within an extended | |
508 | L<C<(?[...])>|/Extended Bracketed Character Classes> class; otherwise | |
509 | only the first code point is used (with a C<regexp>-type warning | |
510 | raised). | |
8f0cd35a KW |
511 | |
512 | =back | |
513 | ||
8a118206 RGS |
514 | =head3 Special Characters Inside a Bracketed Character Class |
515 | ||
516 | Most characters that are meta characters in regular expressions (that | |
df225385 | 517 | is, characters that carry a special meaning like C<.>, C<*>, or C<(>) lose |
8a118206 RGS |
518 | their special meaning and can be used inside a character class without |
519 | the need to escape them. For instance, C<[()]> matches either an opening | |
520 | parenthesis, or a closing parenthesis, and the parens inside the character | |
6e16fd37 KW |
521 | class don't group or capture. Be aware that, unless the pattern is |
522 | evaluated in single-quotish context, variable interpolation will take | |
523 | place before the bracketed class is parsed: | |
524 | ||
525 | $, = "\t| "; | |
526 | $a =~ m'[$,]'; # single-quotish: matches '$' or ',' | |
527 | $a =~ q{[$,]}' # same | |
528 | $a =~ m/[$,]/; # double-quotish: matches "\t", "|", or " " | |
8a118206 RGS |
529 | |
530 | Characters that may carry a special meaning inside a character class are: | |
531 | C<\>, C<^>, C<->, C<[> and C<]>, and are discussed below. They can be | |
532 | escaped with a backslash, although this is sometimes not needed, in which | |
533 | case the backslash may be omitted. | |
534 | ||
535 | The sequence C<\b> is special inside a bracketed character class. While | |
6b83a163 | 536 | outside the character class, C<\b> is an assertion indicating a point |
8a118206 RGS |
537 | that does not have either two word characters or two non-word characters |
538 | on either side, inside a bracketed character class, C<\b> matches a | |
539 | backspace character. | |
540 | ||
df225385 KW |
541 | The sequences |
542 | C<\a>, | |
543 | C<\c>, | |
544 | C<\e>, | |
545 | C<\f>, | |
546 | C<\n>, | |
e526e8bb | 547 | C<\N{I<NAME>}>, |
765fa144 | 548 | C<\N{U+I<hex char>}>, |
df225385 KW |
549 | C<\r>, |
550 | C<\t>, | |
551 | and | |
552 | C<\x> | |
06ee63cd | 553 | are also special and have the same meanings as they do outside a |
eb9e3b14 | 554 | bracketed character class. |
df225385 | 555 | |
ea449505 KW |
556 | Also, a backslash followed by two or three octal digits is considered an octal |
557 | number. | |
df225385 | 558 | |
6b83a163 KW |
559 | A C<[> is not special inside a character class, unless it's the start of a |
560 | POSIX character class (see L</POSIX Character Classes> below). It normally does | |
561 | not need escaping. | |
8a118206 | 562 | |
6b83a163 KW |
563 | A C<]> is normally either the end of a POSIX character class (see |
564 | L</POSIX Character Classes> below), or it signals the end of the bracketed | |
565 | character class. If you want to include a C<]> in the set of characters, you | |
566 | must generally escape it. | |
b6538e4f | 567 | |
8a118206 RGS |
568 | However, if the C<]> is the I<first> (or the second if the first |
569 | character is a caret) character of a bracketed character class, it | |
570 | does not denote the end of the class (as you cannot have an empty class) | |
571 | and is considered part of the set of characters that can be matched without | |
572 | escaping. | |
573 | ||
574 | Examples: | |
575 | ||
576 | "+" =~ /[+?*]/ # Match, "+" in a character class is not special. | |
090752cc | 577 | "\cH" =~ /[\b]/ # Match, \b inside in a character class |
c1c4ae3a | 578 | # is equivalent to a backspace. |
090752cc | 579 | "]" =~ /[][]/ # Match, as the character class contains |
8a118206 RGS |
580 | # both [ and ]. |
581 | "[]" =~ /[[]]/ # Match, the pattern contains a character class | |
52f4d632 | 582 | # containing just [, and the character class is |
8a118206 RGS |
583 | # followed by a ]. |
584 | ||
77c8f263 KW |
585 | =head3 Bracketed Character Classes and the C</xx> pattern modifier |
586 | ||
587 | Normally SPACE and TAB characters have no special meaning inside a | |
588 | bracketed character class; they are just added to the list of characters | |
589 | matched by the class. But if the L<C</xx>|perlre/E<sol>x and E<sol>xx> | |
590 | pattern modifier is in effect, they are generally ignored and can be | |
591 | added to improve readability. They can't be added in the middle of a | |
592 | single construct: | |
593 | ||
594 | / [ \x{10 FFFF} ] /xx # WRONG! | |
595 | ||
596 | The SPACE in the middle of the hex constant is illegal. | |
597 | ||
598 | To specify a literal SPACE character, you can escape it with a | |
599 | backslash, like: | |
600 | ||
601 | /[ a e i o u \ ]/xx | |
602 | ||
603 | This matches the English vowels plus the SPACE character. | |
604 | ||
605 | For clarity, you should already have been using C<\t> to specify a | |
606 | literal tab, and C<\t> is unaffected by C</xx>. | |
607 | ||
8a118206 RGS |
608 | =head3 Character Ranges |
609 | ||
610 | It is not uncommon to want to match a range of characters. Luckily, instead | |
b6538e4f | 611 | of listing all characters in the range, one may use the hyphen (C<->). |
8a118206 | 612 | If inside a bracketed character class you have two characters separated |
b6538e4f | 613 | by a hyphen, it's treated as if all characters between the two were in |
8a118206 | 614 | the class. For instance, C<[0-9]> matches any ASCII digit, and C<[a-m]> |
e2cfb18c | 615 | matches any lowercase letter from the first half of the ASCII alphabet. |
8a118206 RGS |
616 | |
617 | Note that the two characters on either side of the hyphen are not | |
765fa144 | 618 | necessarily both letters or both digits. Any character is possible, |
8a118206 | 619 | although not advisable. C<['-?]> contains a range of characters, but |
b6538e4f | 620 | most people will not know which characters that means. Furthermore, |
8a118206 RGS |
621 | such ranges may lead to portability problems if the code has to run on |
622 | a platform that uses a different character set, such as EBCDIC. | |
623 | ||
ea449505 KW |
624 | If a hyphen in a character class cannot syntactically be part of a range, for |
625 | instance because it is the first or the last character of the character class, | |
b6538e4f TC |
626 | or if it immediately follows a range, the hyphen isn't special, and so is |
627 | considered a character to be matched literally. If you want a hyphen in | |
628 | your set of characters to be matched and its position in the class is such | |
629 | that it could be considered part of a range, you must escape that hyphen | |
630 | with a backslash. | |
8a118206 RGS |
631 | |
632 | Examples: | |
633 | ||
634 | [a-z] # Matches a character that is a lower case ASCII letter. | |
c1c4ae3a KW |
635 | [a-fz] # Matches any letter between 'a' and 'f' (inclusive) or |
636 | # the letter 'z'. | |
8a118206 RGS |
637 | [-z] # Matches either a hyphen ('-') or the letter 'z'. |
638 | [a-f-m] # Matches any letter between 'a' and 'f' (inclusive), the | |
639 | # hyphen ('-'), or the letter 'm'. | |
640 | ['-?] # Matches any of the characters '()*+,-./0123456789:;<=>? | |
641 | # (But not on an EBCDIC platform). | |
c7d25594 KW |
642 | [\N{APOSTROPHE}-\N{QUESTION MARK}] |
643 | # Matches any of the characters '()*+,-./0123456789:;<=>? | |
644 | # even on an EBCDIC platform. | |
ad63362f | 645 | [\N{U+27}-\N{U+3F}] # Same. (U+27 is "'", and U+3F is "?") |
c7d25594 | 646 | |
dabde021 | 647 | As the final two examples above show, you can achieve portability to |
c7d25594 KW |
648 | non-ASCII platforms by using the C<\N{...}> form for the range |
649 | endpoints. These indicate that the specified range is to be interpreted | |
650 | using Unicode values, so C<[\N{U+27}-\N{U+3F}]> means to match | |
651 | C<\N{U+27}>, C<\N{U+28}>, C<\N{U+29}>, ..., C<\N{U+3D}>, C<\N{U+3E}>, | |
652 | and C<\N{U+3F}>, whatever the native code point versions for those are. | |
b927b7e9 KW |
653 | These are called "Unicode" ranges. If either end is of the C<\N{...}> |
654 | form, the range is considered Unicode. A C<regexp> warning is raised | |
655 | under C<S<"use re 'strict'">> if the other endpoint is specified | |
656 | non-portably: | |
657 | ||
658 | [\N{U+00}-\x09] # Warning under re 'strict'; \x09 is non-portable | |
659 | [\N{U+00}-\t] # No warning; | |
660 | ||
661 | Both of the above match the characters C<\N{U+00}> C<\N{U+01}>, ... | |
662 | C<\N{U+08}>, C<\N{U+09}>, but the C<\x09> looks like it could be a | |
663 | mistake so the warning is raised (under C<re 'strict'>) for it. | |
c7d25594 KW |
664 | |
665 | Perl also guarantees that the ranges C<A-Z>, C<a-z>, C<0-9>, and any | |
09e43397 | 666 | subranges of these match what an English-only speaker would expect them |
c7d25594 KW |
667 | to match on any platform. That is, C<[A-Z]> matches the 26 ASCII |
668 | uppercase letters; | |
09e43397 KW |
669 | C<[a-z]> matches the 26 lowercase letters; and C<[0-9]> matches the 10 |
670 | digits. Subranges, like C<[h-k]>, match correspondingly, in this case | |
671 | just the four letters C<"h">, C<"i">, C<"j">, and C<"k">. This is the | |
672 | natural behavior on ASCII platforms where the code points (ordinal | |
673 | values) for C<"h"> through C<"k"> are consecutive integers (0x68 through | |
674 | 0x6B). But special handling to achieve this may be needed on platforms | |
675 | with a non-ASCII native character set. For example, on EBCDIC | |
676 | platforms, the code point for C<"h"> is 0x88, C<"i"> is 0x89, C<"j"> is | |
677 | 0x91, and C<"k"> is 0x92. Perl specially treats C<[h-k]> to exclude the | |
678 | seven code points in the gap: 0x8A through 0x90. This special handling is | |
679 | only invoked when the range is a subrange of one of the ASCII uppercase, | |
680 | lowercase, and digit ranges, AND each end of the range is expressed | |
681 | either as a literal, like C<"A">, or as a named character (C<\N{...}>, | |
682 | including the C<\N{U+...> form). | |
683 | ||
684 | EBCDIC Examples: | |
685 | ||
686 | [i-j] # Matches either "i" or "j" | |
687 | [i-\N{LATIN SMALL LETTER J}] # Same | |
688 | [i-\N{U+6A}] # Same | |
689 | [\N{U+69}-\N{U+6A}] # Same | |
690 | [\x{89}-\x{91}] # Matches 0x89 ("i"), 0x8A .. 0x90, 0x91 ("j") | |
691 | [i-\x{91}] # Same | |
692 | [\x{89}-j] # Same | |
693 | [i-J] # Matches, 0x89 ("i") .. 0xC1 ("J"); special | |
694 | # handling doesn't apply because range is mixed | |
695 | # case | |
8a118206 RGS |
696 | |
697 | =head3 Negation | |
698 | ||
699 | It is also possible to instead list the characters you do not want to | |
700 | match. You can do so by using a caret (C<^>) as the first character in the | |
b6538e4f | 701 | character class. For instance, C<[^a-z]> matches any character that is not a |
e2cfb18c KW |
702 | lowercase ASCII letter, which therefore includes more than a million |
703 | Unicode code points. The class is said to be "negated" or "inverted". | |
8a118206 RGS |
704 | |
705 | This syntax make the caret a special character inside a bracketed character | |
706 | class, but only if it is the first character of the class. So if you want | |
82206b5e | 707 | the caret as one of the characters to match, either escape the caret or |
e2cfb18c | 708 | else don't list it first. |
8a118206 | 709 | |
1f59b283 | 710 | In inverted bracketed character classes, Perl ignores the Unicode rules |
8f0cd35a KW |
711 | that normally say that named sequence, and certain characters should |
712 | match a sequence of multiple characters use under caseless C</i> | |
713 | matching. Following those rules could lead to highly confusing | |
714 | situations: | |
1f59b283 | 715 | |
582da942 | 716 | "ss" =~ /^[^\xDF]+$/ui; # Matches! |
1f59b283 KW |
717 | |
718 | This should match any sequences of characters that aren't C<\xDF> nor | |
719 | what C<\xDF> matches under C</i>. C<"s"> isn't C<\xDF>, but Unicode | |
720 | says that C<"ss"> is what C<\xDF> matches under C</i>. So which one | |
721 | "wins"? Do you fail the match because the string has C<ss> or accept it | |
582da942 | 722 | because it has an C<s> followed by another C<s>? Perl has chosen the |
8f0cd35a | 723 | latter. (See note in L</Bracketed Character Classes> above.) |
1f59b283 | 724 | |
8a118206 RGS |
725 | Examples: |
726 | ||
727 | "e" =~ /[^aeiou]/ # No match, the 'e' is listed. | |
728 | "x" =~ /[^aeiou]/ # Match, as 'x' isn't a lowercase vowel. | |
729 | "^" =~ /[^^]/ # No match, matches anything that isn't a caret. | |
730 | "^" =~ /[x^]/ # Match, caret is not special here. | |
731 | ||
732 | =head3 Backslash Sequences | |
733 | ||
ea449505 | 734 | You can put any backslash sequence character class (with the exception of |
765fa144 | 735 | C<\N> and C<\R>) inside a bracketed character class, and it will act just |
b6538e4f TC |
736 | as if you had put all characters matched by the backslash sequence inside the |
737 | character class. For instance, C<[a-f\d]> matches any decimal digit, or any | |
6b83a163 KW |
738 | of the lowercase letters between 'a' and 'f' inclusive. |
739 | ||
740 | C<\N> within a bracketed character class must be of the forms C<\N{I<name>}> | |
765fa144 | 741 | or C<\N{U+I<hex char>}>, and NOT be the form that matches non-newlines, |
6b83a163 KW |
742 | for the same reason that a dot C<.> inside a bracketed character class loses |
743 | its special meaning: it matches nearly anything, which generally isn't what you | |
744 | want to happen. | |
df225385 | 745 | |
8a118206 RGS |
746 | |
747 | Examples: | |
748 | ||
749 | /[\p{Thai}\d]/ # Matches a character that is either a Thai | |
750 | # character, or a digit. | |
751 | /[^\p{Arabic}()]/ # Matches a character that is neither an Arabic | |
752 | # character, nor a parenthesis. | |
753 | ||
754 | Backslash sequence character classes cannot form one of the endpoints | |
6b83a163 KW |
755 | of a range. Thus, you can't say: |
756 | ||
757 | /[\p{Thai}-\d]/ # Wrong! | |
8a118206 | 758 | |
6b83a163 | 759 | =head3 POSIX Character Classes |
ea449505 | 760 | X<character class> X<\p> X<\p{}> |
ea449505 KW |
761 | X<alpha> X<alnum> X<ascii> X<blank> X<cntrl> X<digit> X<graph> |
762 | X<lower> X<print> X<punct> X<space> X<upper> X<word> X<xdigit> | |
8a118206 | 763 | |
d66e1f56 | 764 | POSIX character classes have the form C<[:class:]>, where I<class> is the |
6b83a163 | 765 | name, and the C<[:> and C<:]> delimiters. POSIX character classes only appear |
8a118206 | 766 | I<inside> bracketed character classes, and are a convenient and descriptive |
82206b5e | 767 | way of listing a group of characters. |
6b83a163 KW |
768 | |
769 | Be careful about the syntax, | |
8a118206 RGS |
770 | |
771 | # Correct: | |
772 | $string =~ /[[:alpha:]]/ | |
773 | ||
774 | # Incorrect (will warn): | |
775 | $string =~ /[:alpha:]/ | |
776 | ||
777 | The latter pattern would be a character class consisting of a colon, | |
778 | and the letters C<a>, C<l>, C<p> and C<h>. | |
d66e1f56 | 779 | |
82206b5e | 780 | POSIX character classes can be part of a larger bracketed character class. |
b6538e4f | 781 | For example, |
ea449505 KW |
782 | |
783 | [01[:alpha:]%] | |
784 | ||
785 | is valid and matches '0', '1', any alphabetic character, and the percent sign. | |
8a118206 RGS |
786 | |
787 | Perl recognizes the following POSIX character classes: | |
788 | ||
ea449505 | 789 | alpha Any alphabetical character ("[A-Za-z]"). |
48cbae4f | 790 | alnum Any alphanumeric character ("[A-Za-z0-9]"). |
ea449505 | 791 | ascii Any character in the ASCII character set. |
ea8b8ad2 | 792 | blank A GNU extension, equal to a space or a horizontal tab ("\t"). |
ea449505 KW |
793 | cntrl Any control character. See Note [2] below. |
794 | digit Any decimal digit ("[0-9]"), equivalent to "\d". | |
795 | graph Any printable character, excluding a space. See Note [3] below. | |
796 | lower Any lowercase character ("[a-z]"). | |
797 | print Any printable character, including a space. See Note [4] below. | |
c1c4ae3a | 798 | punct Any graphical character excluding "word" characters. Note [5]. |
d28d8023 KW |
799 | space Any whitespace character. "\s" including the vertical tab |
800 | ("\cK"). | |
ea449505 KW |
801 | upper Any uppercase character ("[A-Z]"). |
802 | word A Perl extension ("[A-Za-z0-9_]"), equivalent to "\w". | |
803 | xdigit Any hexadecimal digit ("[0-9a-fA-F]"). | |
804 | ||
93106464 KW |
805 | Like the L<Unicode properties|/Unicode Properties>, most of the POSIX |
806 | properties match the same regardless of whether case-insensitive (C</i>) | |
807 | matching is in effect or not. The two exceptions are C<[:upper:]> and | |
808 | C<[:lower:]>. Under C</i>, they each match the union of C<[:upper:]> and | |
809 | C<[:lower:]>. | |
810 | ||
ea449505 KW |
811 | Most POSIX character classes have two Unicode-style C<\p> property |
812 | counterparts. (They are not official Unicode properties, but Perl extensions | |
813 | derived from official Unicode properties.) The table below shows the relation | |
814 | between POSIX character classes and these counterparts. | |
815 | ||
816 | One counterpart, in the column labelled "ASCII-range Unicode" in | |
b6538e4f | 817 | the table, matches only characters in the ASCII character set. |
ea449505 KW |
818 | |
819 | The other counterpart, in the column labelled "Full-range Unicode", matches any | |
820 | appropriate characters in the full Unicode character set. For example, | |
b6538e4f | 821 | C<\p{Alpha}> matches not just the ASCII alphabetic characters, but any |
82206b5e | 822 | character in the entire Unicode character set considered alphabetic. |
582da942 | 823 | An entry in the column labelled "backslash sequence" is a (short) |
5db9882c | 824 | equivalent. |
ea449505 | 825 | |
cbc24f92 KW |
826 | [[:...:]] ASCII-range Full-range backslash Note |
827 | Unicode Unicode sequence | |
ea449505 | 828 | ----------------------------------------------------- |
cbc24f92 KW |
829 | alpha \p{PosixAlpha} \p{XPosixAlpha} |
830 | alnum \p{PosixAlnum} \p{XPosixAlnum} | |
82206b5e | 831 | ascii \p{ASCII} |
cbc24f92 KW |
832 | blank \p{PosixBlank} \p{XPosixBlank} \h [1] |
833 | or \p{HorizSpace} [1] | |
834 | cntrl \p{PosixCntrl} \p{XPosixCntrl} [2] | |
835 | digit \p{PosixDigit} \p{XPosixDigit} \d | |
836 | graph \p{PosixGraph} \p{XPosixGraph} [3] | |
837 | lower \p{PosixLower} \p{XPosixLower} | |
838 | print \p{PosixPrint} \p{XPosixPrint} [4] | |
839 | punct \p{PosixPunct} \p{XPosixPunct} [5] | |
840 | \p{PerlSpace} \p{XPerlSpace} \s [6] | |
841 | space \p{PosixSpace} \p{XPosixSpace} [6] | |
842 | upper \p{PosixUpper} \p{XPosixUpper} | |
843 | word \p{PosixWord} \p{XPosixWord} \w | |
82206b5e | 844 | xdigit \p{PosixXDigit} \p{XPosixXDigit} |
8a118206 RGS |
845 | |
846 | =over 4 | |
847 | ||
ea449505 KW |
848 | =item [1] |
849 | ||
850 | C<\p{Blank}> and C<\p{HorizSpace}> are synonyms. | |
851 | ||
852 | =item [2] | |
8a118206 | 853 | |
ea449505 | 854 | Control characters don't produce output as such, but instead usually control |
b6538e4f | 855 | the terminal somehow: for example, newline and backspace are control characters. |
93106464 KW |
856 | On ASCII platforms, in the ASCII range, characters whose code points are |
857 | between 0 and 31 inclusive, plus 127 (C<DEL>) are control characters; on | |
858 | EBCDIC platforms, their counterparts are control characters. | |
8a118206 | 859 | |
ea449505 | 860 | =item [3] |
8a118206 RGS |
861 | |
862 | Any character that is I<graphical>, that is, visible. This class consists | |
b6538e4f | 863 | of all alphanumeric characters and all punctuation characters. |
8a118206 | 864 | |
ea449505 | 865 | =item [4] |
8a118206 | 866 | |
b6538e4f TC |
867 | All printable characters, which is the set of all graphical characters |
868 | plus those whitespace characters which are not also controls. | |
ea449505 | 869 | |
b6dac59a | 870 | =item [5] |
ea449505 | 871 | |
b6538e4f | 872 | C<\p{PosixPunct}> and C<[[:punct:]]> in the ASCII range match all |
ea449505 KW |
873 | non-controls, non-alphanumeric, non-space characters: |
874 | C<[-!"#$%&'()*+,./:;<=E<gt>?@[\\\]^_`{|}~]> (although if a locale is in effect, | |
875 | it could alter the behavior of C<[[:punct:]]>). | |
876 | ||
cbc24f92 KW |
877 | The similarly named property, C<\p{Punct}>, matches a somewhat different |
878 | set in the ASCII range, namely | |
0be9b861 KW |
879 | C<[-!"#%&'()*,./:;?@[\\\]_{}]>. That is, it is missing the nine |
880 | characters C<[$+E<lt>=E<gt>^`|~]>. | |
6c5a041f KW |
881 | This is because Unicode splits what POSIX considers to be punctuation into two |
882 | categories, Punctuation and Symbols. | |
883 | ||
e2cfb18c | 884 | C<\p{XPosixPunct}> and (under Unicode rules) C<[[:punct:]]>, match what |
765fa144 KW |
885 | C<\p{PosixPunct}> matches in the ASCII range, plus what C<\p{Punct}> |
886 | matches. This is different than strictly matching according to | |
887 | C<\p{Punct}>. Another way to say it is that | |
82206b5e KW |
888 | if Unicode rules are in effect, C<[[:punct:]]> matches all characters |
889 | that Unicode considers punctuation, plus all ASCII-range characters that | |
890 | Unicode considers symbols. | |
8a118206 | 891 | |
ea449505 | 892 | =item [6] |
8a118206 | 893 | |
7fa2fdc0 | 894 | C<\p{XPerlSpace}> and C<\p{Space}> match identically starting with Perl |
d28d8023 | 895 | v5.18. In earlier versions, these differ only in that in non-locale |
779cf272 | 896 | matching, C<\p{XPerlSpace}> did not match the vertical tab, C<\cK>. |
d28d8023 | 897 | Same for the two ASCII-only range forms. |
8a118206 RGS |
898 | |
899 | =back | |
900 | ||
ab6199be | 901 | There are various other synonyms that can be used besides the names |
4cb26c52 | 902 | listed in the table. For example, C<\p{XPosixAlpha}> can be written as |
ab6199be | 903 | C<\p{Alpha}>. All are listed in |
d66e1f56 | 904 | L<perluniprops/Properties accessible through \p{} and \P{}>. |
ab6199be KW |
905 | |
906 | Both the C<\p> counterparts always assume Unicode rules are in effect. | |
907 | On ASCII platforms, this means they assume that the code points from 128 | |
908 | to 255 are Latin-1, and that means that using them under locale rules is | |
909 | unwise unless the locale is guaranteed to be Latin-1 or UTF-8. In contrast, the | |
910 | POSIX character classes are useful under locale rules. They are | |
911 | affected by the actual rules in effect, as follows: | |
912 | ||
913 | =over | |
914 | ||
915 | =item If the C</a> modifier, is in effect ... | |
916 | ||
917 | Each of the POSIX classes matches exactly the same as their ASCII-range | |
918 | counterparts. | |
919 | ||
920 | =item otherwise ... | |
921 | ||
922 | =over | |
923 | ||
924 | =item For code points above 255 ... | |
925 | ||
926 | The POSIX class matches the same as its Full-range counterpart. | |
927 | ||
928 | =item For code points below 256 ... | |
929 | ||
930 | =over | |
931 | ||
932 | =item if locale rules are in effect ... | |
933 | ||
a145a423 KW |
934 | The POSIX class matches according to the locale, except: |
935 | ||
936 | =over | |
937 | ||
938 | =item C<word> | |
939 | ||
940 | also includes the platform's native underscore character, no matter what | |
8129baca | 941 | the locale is. |
ab6199be | 942 | |
a145a423 KW |
943 | =item C<ascii> |
944 | ||
945 | on platforms that don't have the POSIX C<ascii> extension, this matches | |
946 | just the platform's native ASCII-range characters. | |
947 | ||
948 | =item C<blank> | |
949 | ||
950 | on platforms that don't have the POSIX C<blank> extension, this matches | |
951 | just the platform's native tab and space characters. | |
952 | ||
953 | =back | |
954 | ||
04c2d19c | 955 | =item if, instead, Unicode rules are in effect ... |
ab6199be KW |
956 | |
957 | The POSIX class matches the same as the Full-range counterpart. | |
958 | ||
959 | =item otherwise ... | |
960 | ||
961 | The POSIX class matches the same as the ASCII range counterpart. | |
962 | ||
963 | =back | |
964 | ||
965 | =back | |
966 | ||
967 | =back | |
968 | ||
969 | Which rules apply are determined as described in | |
970 | L<perlre/Which character set modifier is in effect?>. | |
971 | ||
972 | It is proposed to change this behavior in a future release of Perl so that | |
973 | whether or not Unicode rules are in effect would not change the | |
4b9734bf | 974 | behavior: Outside of locale, the POSIX classes |
ab6199be KW |
975 | would behave like their ASCII-range counterparts. If you wish to |
976 | comment on this proposal, send email to C<perl5-porters@perl.org>. | |
cbc24f92 | 977 | |
1f59b283 | 978 | =head4 Negation of POSIX character classes |
ea449505 | 979 | X<character class, negation> |
8a118206 RGS |
980 | |
981 | A Perl extension to the POSIX character class is the ability to | |
982 | negate it. This is done by prefixing the class name with a caret (C<^>). | |
983 | Some examples: | |
984 | ||
ea449505 KW |
985 | POSIX ASCII-range Full-range backslash |
986 | Unicode Unicode sequence | |
987 | ----------------------------------------------------- | |
cbc24f92 KW |
988 | [[:^digit:]] \P{PosixDigit} \P{XPosixDigit} \D |
989 | [[:^space:]] \P{PosixSpace} \P{XPosixSpace} | |
990 | \P{PerlSpace} \P{XPerlSpace} \S | |
991 | [[:^word:]] \P{PerlWord} \P{XPosixWord} \W | |
992 | ||
765fa144 | 993 | The backslash sequence can mean either ASCII- or Full-range Unicode, |
82206b5e | 994 | depending on various factors as described in L<perlre/Which character set modifier is in effect?>. |
8a118206 RGS |
995 | |
996 | =head4 [= =] and [. .] | |
997 | ||
b6538e4f | 998 | Perl recognizes the POSIX character classes C<[=class=]> and |
82206b5e | 999 | C<[.class.]>, but does not (yet?) support them. Any attempt to use |
b6538e4f | 1000 | either construct raises an exception. |
8a118206 RGS |
1001 | |
1002 | =head4 Examples | |
1003 | ||
1004 | /[[:digit:]]/ # Matches a character that is a digit. | |
1005 | /[01[:lower:]]/ # Matches a character that is either a | |
1006 | # lowercase letter, or '0' or '1'. | |
c1c4ae3a | 1007 | /[[:digit:][:^xdigit:]]/ # Matches a character that can be anything |
bc943be5 KW |
1008 | # except the letters 'a' to 'f' and 'A' to |
1009 | # 'F'. This is because the main character | |
1010 | # class is composed of two POSIX character | |
1011 | # classes that are ORed together, one that | |
1012 | # matches any digit, and the other that | |
1013 | # matches anything that isn't a hex digit. | |
1014 | # The OR adds the digits, leaving only the | |
1015 | # letters 'a' to 'f' and 'A' to 'F' excluded. | |
572224ce KW |
1016 | |
1017 | =head3 Extended Bracketed Character Classes | |
1018 | X<character class> | |
1019 | X<set operations> | |
1020 | ||
1021 | This is a fancy bracketed character class that can be used for more | |
1022 | readable and less error-prone classes, and to perform set operations, | |
1023 | such as intersection. An example is | |
1024 | ||
1025 | /(?[ \p{Thai} & \p{Digit} ])/ | |
1026 | ||
1027 | This will match all the digit characters that are in the Thai script. | |
1028 | ||
1029 | This is an experimental feature available starting in 5.18, and is | |
1030 | subject to change as we gain field experience with it. Any attempt to | |
1031 | use it will raise a warning, unless disabled via | |
1032 | ||
1033 | no warnings "experimental::regex_sets"; | |
1034 | ||
1035 | Comments on this feature are welcome; send email to | |
1036 | C<perl5-porters@perl.org>. | |
1037 | ||
a60b7922 KW |
1038 | The rules used by L<C<use re 'strict>|re/'strict' mode> apply to this |
1039 | construct. | |
1040 | ||
572224ce KW |
1041 | We can extend the example above: |
1042 | ||
1043 | /(?[ ( \p{Thai} + \p{Lao} ) & \p{Digit} ])/ | |
1044 | ||
1045 | This matches digits that are in either the Thai or Laotian scripts. | |
1046 | ||
1047 | Notice the white space in these examples. This construct always has | |
77c8f263 | 1048 | the C<E<sol>xx> modifier turned on within it. |
572224ce KW |
1049 | |
1050 | The available binary operators are: | |
1051 | ||
1052 | & intersection | |
1053 | + union | |
1054 | | another name for '+', hence means union | |
1055 | - subtraction (the result matches the set consisting of those | |
1056 | code points matched by the first operand, excluding any that | |
1057 | are also matched by the second operand) | |
1058 | ^ symmetric difference (the union minus the intersection). This | |
1059 | is like an exclusive or, in that the result is the set of code | |
1060 | points that are matched by either, but not both, of the | |
1061 | operands. | |
1062 | ||
1063 | There is one unary operator: | |
1064 | ||
1065 | ! complement | |
1066 | ||
6798c95d KW |
1067 | All the binary operators left associate; C<"&"> is higher precedence |
1068 | than the others, which all have equal precedence. The unary operator | |
1069 | right associates, and has highest precedence. Thus this follows the | |
1070 | normal Perl precedence rules for logical operators. Use parentheses to | |
1071 | override the default precedence and associativity. | |
572224ce KW |
1072 | |
1073 | The main restriction is that everything is a metacharacter. Thus, | |
1074 | you cannot refer to single characters by doing something like this: | |
1075 | ||
1076 | /(?[ a + b ])/ # Syntax error! | |
1077 | ||
1078 | The easiest way to specify an individual typable character is to enclose | |
1079 | it in brackets: | |
1080 | ||
1081 | /(?[ [a] + [b] ])/ | |
1082 | ||
1083 | (This is the same thing as C<[ab]>.) You could also have said the | |
1084 | equivalent: | |
1085 | ||
1086 | /(?[[ a b ]])/ | |
1087 | ||
de36fb2e KW |
1088 | (You can, of course, specify single characters by using, C<\x{...}>, |
1089 | C<\N{...}>, etc.) | |
572224ce KW |
1090 | |
1091 | This last example shows the use of this construct to specify an ordinary | |
1092 | bracketed character class without additional set operations. Note the | |
77c8f263 KW |
1093 | white space within it. This is allowed because C<E<sol>xx> is |
1094 | automatically turned on within this construct. | |
572224ce | 1095 | |
572224ce | 1096 | All the other escapes accepted by normal bracketed character classes are |
7d4c055d KW |
1097 | accepted here as well. |
1098 | ||
1099 | Because this construct compiles under | |
1100 | L<C<use re 'strict>|re/'strict' mode>, unrecognized escapes that | |
1101 | generate warnings in normal classes are fatal errors here, as well as | |
1102 | all other warnings from these class elements, as well as some | |
1103 | practices that don't currently warn outside C<re 'strict'>. For example | |
1104 | you cannot say | |
572224ce KW |
1105 | |
1106 | /(?[ [ \xF ] ])/ # Syntax error! | |
1107 | ||
1108 | You have to have two hex digits after a braceless C<\x> (use a leading | |
1109 | zero to make two). These restrictions are to lower the incidence of | |
1110 | typos causing the class to not match what you thought it would. | |
1111 | ||
f194034a KW |
1112 | If a regular bracketed character class contains a C<\p{}> or C<\P{}> and |
1113 | is matched against a non-Unicode code point, a warning may be | |
1114 | raised, as the result is not Unicode-defined. No such warning will come | |
1115 | when using this extended form. | |
1116 | ||
572224ce KW |
1117 | The final difference between regular bracketed character classes and |
1118 | these, is that it is not possible to get these to match a | |
1119 | multi-character fold. Thus, | |
1120 | ||
1121 | /(?[ [\xDF] ])/iu | |
1122 | ||
1123 | does not match the string C<ss>. | |
1124 | ||
1125 | You don't have to enclose POSIX class names inside double brackets, | |
1126 | hence both of the following work: | |
1127 | ||
1128 | /(?[ [:word:] - [:lower:] ])/ | |
1129 | /(?[ [[:word:]] - [[:lower:]] ])/ | |
1130 | ||
1131 | Any contained POSIX character classes, including things like C<\w> and C<\D> | |
1132 | respect the C<E<sol>a> (and C<E<sol>aa>) modifiers. | |
1133 | ||
19a498a4 YO |
1134 | Note that C<< (?[ ]) >> is a regex-compile-time construct. Any attempt |
1135 | to use something which isn't knowable at the time the containing regular | |
572224ce | 1136 | expression is compiled is a fatal error. In practice, this means |
11a9b3e0 | 1137 | just three limitations: |
572224ce KW |
1138 | |
1139 | =over 4 | |
1140 | ||
1141 | =item 1 | |
1142 | ||
a0bd1a30 KW |
1143 | When compiled within the scope of C<use locale> (or the C<E<sol>l> regex |
1144 | modifier), this construct assumes that the execution-time locale will be | |
1145 | a UTF-8 one, and the generated pattern always uses Unicode rules. What | |
1146 | gets matched or not thus isn't dependent on the actual runtime locale, so | |
1147 | tainting is not enabled. But a C<locale> category warning is raised | |
1148 | if the runtime locale turns out to not be UTF-8. | |
572224ce KW |
1149 | |
1150 | =item 2 | |
1151 | ||
1152 | Any | |
1153 | L<user-defined property|perlunicode/"User-Defined Character Properties"> | |
1154 | used must be already defined by the time the regular expression is | |
1155 | compiled (but note that this construct can be used instead of such | |
1156 | properties). | |
1157 | ||
1158 | =item 3 | |
1159 | ||
1160 | A regular expression that otherwise would compile | |
1161 | using C<E<sol>d> rules, and which uses this construct will instead | |
1162 | use C<E<sol>u>. Thus this construct tells Perl that you don't want | |
1163 | C<E<sol>d> rules for the entire regular expression containing it. | |
1164 | ||
1165 | =back | |
1166 | ||
572224ce KW |
1167 | Note that skipping white space applies only to the interior of this |
1168 | construct. There must not be any space between any of the characters | |
1169 | that form the initial C<(?[>. Nor may there be space between the | |
1170 | closing C<])> characters. | |
1171 | ||
11a9b3e0 | 1172 | Just as in all regular expressions, the pattern can be built up by |
572224ce KW |
1173 | including variables that are interpolated at regex compilation time. |
1174 | Care must be taken to ensure that you are getting what you expect. For | |
1175 | example: | |
1176 | ||
1177 | my $thai_or_lao = '\p{Thai} + \p{Lao}'; | |
1178 | ... | |
1179 | qr/(?[ \p{Digit} & $thai_or_lao ])/; | |
1180 | ||
1181 | compiles to | |
1182 | ||
1183 | qr/(?[ \p{Digit} & \p{Thai} + \p{Lao} ])/; | |
1184 | ||
1185 | But this does not have the effect that someone reading the code would | |
1186 | likely expect, as the intersection applies just to C<\p{Thai}>, | |
1187 | excluding the Laotian. Pitfalls like this can be avoided by | |
1188 | parenthesizing the component pieces: | |
1189 | ||
1190 | my $thai_or_lao = '( \p{Thai} + \p{Lao} )'; | |
1191 | ||
1192 | But any modifiers will still apply to all the components: | |
1193 | ||
1194 | my $lower = '\p{Lower} + \p{Digit}'; | |
1195 | qr/(?[ \p{Greek} & $lower ])/i; | |
1196 | ||
1197 | matches upper case things. You can avoid surprises by making the | |
1198 | components into instances of this construct by compiling them: | |
1199 | ||
1200 | my $thai_or_lao = qr/(?[ \p{Thai} + \p{Lao} ])/; | |
1201 | my $lower = qr/(?[ \p{Lower} + \p{Digit} ])/; | |
1202 | ||
1203 | When these are embedded in another pattern, what they match does not | |
1204 | change, regardless of parenthesization or what modifiers are in effect | |
1205 | in that outer pattern. | |
1206 | ||
1207 | Due to the way that Perl parses things, your parentheses and brackets | |
1208 | may need to be balanced, even including comments. If you run into any | |
1209 | examples, please send them to C<perlbug@perl.org>, so that we can have a | |
1210 | concrete example for this man page. | |
1211 | ||
1212 | We may change it so that things that remain legal uses in normal bracketed | |
1213 | character classes might become illegal within this experimental | |
1214 | construct. One proposal, for example, is to forbid adjacent uses of the | |
1215 | same character, as in C<(?[ [aa] ])>. The motivation for such a change | |
1216 | is that this usage is likely a typo, as the second "a" adds nothing. |