| 1 | =head1 NAME |
| 2 | X<character class> |
| 3 | |
| 4 | perlrecharclass - Perl Regular Expression Character Classes |
| 5 | |
| 6 | =head1 DESCRIPTION |
| 7 | |
| 8 | The top level documentation about Perl regular expressions |
| 9 | is found in L<perlre>. |
| 10 | |
| 11 | This manual page discusses the syntax and use of character |
| 12 | classes in Perl regular expressions. |
| 13 | |
| 14 | A character class is a way of denoting a set of characters |
| 15 | in such a way that one character of the set is matched. |
| 16 | It's important to remember that: matching a character class |
| 17 | consumes exactly one character in the source string. (The source |
| 18 | string is the string the regular expression is matched against.) |
| 19 | |
| 20 | There are three types of character classes in Perl regular |
| 21 | expressions: the dot, backslash sequences, and the form enclosed in square |
| 22 | brackets. Keep in mind, though, that often the term "character class" is used |
| 23 | to mean just the bracketed form. Certainly, most Perl documentation does that. |
| 24 | |
| 25 | =head2 The dot |
| 26 | |
| 27 | The dot (or period), C<.> is probably the most used, and certainly |
| 28 | the most well-known character class. By default, a dot matches any |
| 29 | character, except for the newline. That default can be changed to |
| 30 | add matching the newline by using the I<single line> modifier: either |
| 31 | for the entire regular expression with the C</s> modifier, or |
| 32 | locally with C<(?s)>. (The C<L</\N>> backslash sequence, described |
| 33 | below, matches any character except newline without regard to the |
| 34 | I<single line> modifier.) |
| 35 | |
| 36 | Here are some examples: |
| 37 | |
| 38 | "a" =~ /./ # Match |
| 39 | "." =~ /./ # Match |
| 40 | "" =~ /./ # No match (dot has to match a character) |
| 41 | "\n" =~ /./ # No match (dot does not match a newline) |
| 42 | "\n" =~ /./s # Match (global 'single line' modifier) |
| 43 | "\n" =~ /(?s:.)/ # Match (local 'single line' modifier) |
| 44 | "ab" =~ /^.$/ # No match (dot matches one character) |
| 45 | |
| 46 | =head2 Backslash sequences |
| 47 | X<\w> X<\W> X<\s> X<\S> X<\d> X<\D> X<\p> X<\P> |
| 48 | X<\N> X<\v> X<\V> X<\h> X<\H> |
| 49 | X<word> X<whitespace> |
| 50 | |
| 51 | A backslash sequence is a sequence of characters, the first one of which is a |
| 52 | backslash. Perl ascribes special meaning to many such sequences, and some of |
| 53 | these are character classes. That is, they match a single character each, |
| 54 | provided that the character belongs to the specific set of characters defined |
| 55 | by the sequence. |
| 56 | |
| 57 | Here's a list of the backslash sequences that are character classes. They |
| 58 | are discussed in more detail below. (For the backslash sequences that aren't |
| 59 | character classes, see L<perlrebackslash>.) |
| 60 | |
| 61 | \d Match a decimal digit character. |
| 62 | \D Match a non-decimal-digit character. |
| 63 | \w Match a "word" character. |
| 64 | \W Match a non-"word" character. |
| 65 | \s Match a whitespace character. |
| 66 | \S Match a non-whitespace character. |
| 67 | \h Match a horizontal whitespace character. |
| 68 | \H Match a character that isn't horizontal whitespace. |
| 69 | \v Match a vertical whitespace character. |
| 70 | \V Match a character that isn't vertical whitespace. |
| 71 | \N Match a character that isn't a newline. |
| 72 | \pP, \p{Prop} Match a character that has the given Unicode property. |
| 73 | \PP, \P{Prop} Match a character that doesn't have the Unicode property |
| 74 | |
| 75 | =head3 \N |
| 76 | |
| 77 | C<\N>, available starting in v5.12, like the dot, matches any |
| 78 | character that is not a newline. The difference is that C<\N> is not influenced |
| 79 | by the I<single line> regular expression modifier (see L</The dot> above). Note |
| 80 | that the form C<\N{...}> may mean something completely different. When the |
| 81 | C<{...}> is a L<quantifier|perlre/Quantifiers>, it means to match a non-newline |
| 82 | character that many times. For example, C<\N{3}> means to match 3 |
| 83 | non-newlines; C<\N{5,}> means to match 5 or more non-newlines. But if C<{...}> |
| 84 | is not a legal quantifier, it is presumed to be a named character. See |
| 85 | L<charnames> for those. For example, none of C<\N{COLON}>, C<\N{4F}>, and |
| 86 | C<\N{F4}> contain legal quantifiers, so Perl will try to find characters whose |
| 87 | names are respectively C<COLON>, C<4F>, and C<F4>. |
| 88 | |
| 89 | =head3 Digits |
| 90 | |
| 91 | C<\d> matches a single character considered to be a decimal I<digit>. |
| 92 | If the C</a> regular expression modifier is in effect, it matches [0-9]. |
| 93 | Otherwise, it |
| 94 | matches anything that is matched by C<\p{Digit}>, which includes [0-9]. |
| 95 | (An unlikely possible exception is that under locale matching rules, the |
| 96 | current locale might not have C<[0-9]> matched by C<\d>, and/or might match |
| 97 | other characters whose code point is less than 256. The only such locale |
| 98 | definitions that are legal would be to match C<[0-9]> plus another set of |
| 99 | 10 consecutive digit characters; anything else would be in violation of |
| 100 | the C language standard, but Perl doesn't currently assume anything in |
| 101 | regard to this.) |
| 102 | |
| 103 | What this means is that unless the C</a> modifier is in effect C<\d> not |
| 104 | only matches the digits '0' - '9', but also Arabic, Devanagari, and |
| 105 | digits from other languages. This may cause some confusion, and some |
| 106 | security issues. |
| 107 | |
| 108 | Some digits that C<\d> matches look like some of the [0-9] ones, but |
| 109 | have different values. For example, BENGALI DIGIT FOUR (U+09EA) looks |
| 110 | very much like an ASCII DIGIT EIGHT (U+0038). An application that |
| 111 | is expecting only the ASCII digits might be misled, or if the match is |
| 112 | C<\d+>, the matched string might contain a mixture of digits from |
| 113 | different writing systems that look like they signify a number different |
| 114 | than they actually do. L<Unicode::UCD/num()> can |
| 115 | be used to safely |
| 116 | calculate the value, returning C<undef> if the input string contains |
| 117 | such a mixture. |
| 118 | |
| 119 | What C<\p{Digit}> means (and hence C<\d> except under the C</a> |
| 120 | modifier) is C<\p{General_Category=Decimal_Number}>, or synonymously, |
| 121 | C<\p{General_Category=Digit}>. Starting with Unicode version 4.1, this |
| 122 | is the same set of characters matched by C<\p{Numeric_Type=Decimal}>. |
| 123 | But Unicode also has a different property with a similar name, |
| 124 | C<\p{Numeric_Type=Digit}>, which matches a completely different set of |
| 125 | characters. These characters are things such as C<CIRCLED DIGIT ONE> |
| 126 | or subscripts, or are from writing systems that lack all ten digits. |
| 127 | |
| 128 | The design intent is for C<\d> to exactly match the set of characters |
| 129 | that can safely be used with "normal" big-endian positional decimal |
| 130 | syntax, where, for example 123 means one 'hundred', plus two 'tens', |
| 131 | plus three 'ones'. This positional notation does not necessarily apply |
| 132 | to characters that match the other type of "digit", |
| 133 | C<\p{Numeric_Type=Digit}>, and so C<\d> doesn't match them. |
| 134 | |
| 135 | The Tamil digits (U+0BE6 - U+0BEF) can also legally be |
| 136 | used in old-style Tamil numbers in which they would appear no more than |
| 137 | one in a row, separated by characters that mean "times 10", "times 100", |
| 138 | etc. (See L<http://www.unicode.org/notes/tn21>.) |
| 139 | |
| 140 | Any character not matched by C<\d> is matched by C<\D>. |
| 141 | |
| 142 | =head3 Word characters |
| 143 | |
| 144 | A C<\w> matches a single alphanumeric character (an alphabetic character, or a |
| 145 | decimal digit); or a connecting punctuation character, such as an |
| 146 | underscore ("_"); or a "mark" character (like some sort of accent) that |
| 147 | attaches to one of those. It does not match a whole word. To match a |
| 148 | whole word, use C<\w+>. This isn't the same thing as matching an |
| 149 | English word, but in the ASCII range it is the same as a string of |
| 150 | Perl-identifier characters. |
| 151 | |
| 152 | =over |
| 153 | |
| 154 | =item If the C</a> modifier is in effect ... |
| 155 | |
| 156 | C<\w> matches the 63 characters [a-zA-Z0-9_]. |
| 157 | |
| 158 | =item otherwise ... |
| 159 | |
| 160 | =over |
| 161 | |
| 162 | =item For code points above 255 ... |
| 163 | |
| 164 | C<\w> matches the same as C<\p{Word}> matches in this range. That is, |
| 165 | it matches Thai letters, Greek letters, etc. This includes connector |
| 166 | punctuation (like the underscore) which connect two words together, or |
| 167 | diacritics, such as a C<COMBINING TILDE> and the modifier letters, which |
| 168 | are generally used to add auxiliary markings to letters. |
| 169 | |
| 170 | =item For code points below 256 ... |
| 171 | |
| 172 | =over |
| 173 | |
| 174 | =item if locale rules are in effect ... |
| 175 | |
| 176 | C<\w> matches the platform's native underscore character plus whatever |
| 177 | the locale considers to be alphanumeric. |
| 178 | |
| 179 | =item if Unicode rules are in effect ... |
| 180 | |
| 181 | C<\w> matches exactly what C<\p{Word}> matches. |
| 182 | |
| 183 | =item otherwise ... |
| 184 | |
| 185 | C<\w> matches [a-zA-Z0-9_]. |
| 186 | |
| 187 | =back |
| 188 | |
| 189 | =back |
| 190 | |
| 191 | =back |
| 192 | |
| 193 | Which rules apply are determined as described in L<perlre/Which character set modifier is in effect?>. |
| 194 | |
| 195 | There are a number of security issues with the full Unicode list of word |
| 196 | characters. See L<http://unicode.org/reports/tr36>. |
| 197 | |
| 198 | Also, for a somewhat finer-grained set of characters that are in programming |
| 199 | language identifiers beyond the ASCII range, you may wish to instead use the |
| 200 | more customized L</Unicode Properties>, C<\p{ID_Start}>, |
| 201 | C<\p{ID_Continue}>, C<\p{XID_Start}>, and C<\p{XID_Continue}>. See |
| 202 | L<http://unicode.org/reports/tr31>. |
| 203 | |
| 204 | Any character not matched by C<\w> is matched by C<\W>. |
| 205 | |
| 206 | =head3 Whitespace |
| 207 | |
| 208 | C<\s> matches any single character considered whitespace. |
| 209 | |
| 210 | =over |
| 211 | |
| 212 | =item If the C</a> modifier is in effect ... |
| 213 | |
| 214 | In all Perl versions, C<\s> matches the 5 characters [\t\n\f\r ]; that |
| 215 | is, the horizontal tab, |
| 216 | the newline, the form feed, the carriage return, and the space. |
| 217 | Starting in Perl v5.18, experimentally, it also matches the vertical tab, C<\cK>. |
| 218 | See note C<[1]> below for a discussion of this. |
| 219 | |
| 220 | =item otherwise ... |
| 221 | |
| 222 | =over |
| 223 | |
| 224 | =item For code points above 255 ... |
| 225 | |
| 226 | C<\s> matches exactly the code points above 255 shown with an "s" column |
| 227 | in the table below. |
| 228 | |
| 229 | =item For code points below 256 ... |
| 230 | |
| 231 | =over |
| 232 | |
| 233 | =item if locale rules are in effect ... |
| 234 | |
| 235 | C<\s> matches whatever the locale considers to be whitespace. |
| 236 | |
| 237 | =item if Unicode rules are in effect ... |
| 238 | |
| 239 | C<\s> matches exactly the characters shown with an "s" column in the |
| 240 | table below. |
| 241 | |
| 242 | =item otherwise ... |
| 243 | |
| 244 | C<\s> matches [\t\n\f\r\cK ] and, starting, experimentally in Perl |
| 245 | v5.18, the vertical tab, C<\cK>. |
| 246 | (See note C<[1]> below for a discussion of this.) |
| 247 | Note that this list doesn't include the non-breaking space. |
| 248 | |
| 249 | =back |
| 250 | |
| 251 | =back |
| 252 | |
| 253 | =back |
| 254 | |
| 255 | Which rules apply are determined as described in L<perlre/Which character set modifier is in effect?>. |
| 256 | |
| 257 | Any character not matched by C<\s> is matched by C<\S>. |
| 258 | |
| 259 | C<\h> matches any character considered horizontal whitespace; |
| 260 | this includes the platform's space and tab characters and several others |
| 261 | listed in the table below. C<\H> matches any character |
| 262 | not considered horizontal whitespace. They use the platform's native |
| 263 | character set, and do not consider any locale that may otherwise be in |
| 264 | use. |
| 265 | |
| 266 | C<\v> matches any character considered vertical whitespace; |
| 267 | this includes the platform's carriage return and line feed characters (newline) |
| 268 | plus several other characters, all listed in the table below. |
| 269 | C<\V> matches any character not considered vertical whitespace. |
| 270 | They use the platform's native character set, and do not consider any |
| 271 | locale that may otherwise be in use. |
| 272 | |
| 273 | C<\R> matches anything that can be considered a newline under Unicode |
| 274 | rules. It's not a character class, as it can match a multi-character |
| 275 | sequence. Therefore, it cannot be used inside a bracketed character |
| 276 | class; use C<\v> instead (vertical whitespace). It uses the platform's |
| 277 | native character set, and does not consider any locale that may |
| 278 | otherwise be in use. |
| 279 | Details are discussed in L<perlrebackslash>. |
| 280 | |
| 281 | Note that unlike C<\s> (and C<\d> and C<\w>), C<\h> and C<\v> always match |
| 282 | the same characters, without regard to other factors, such as the active |
| 283 | locale or whether the source string is in UTF-8 format. |
| 284 | |
| 285 | One might think that C<\s> is equivalent to C<[\h\v]>. This is indeed true |
| 286 | starting in Perl v5.18, but prior to that, the sole difference was that the |
| 287 | vertical tab (C<"\cK">) was not matched by C<\s>. |
| 288 | |
| 289 | The following table is a complete listing of characters matched by |
| 290 | C<\s>, C<\h> and C<\v> as of Unicode 6.3. |
| 291 | |
| 292 | The first column gives the Unicode code point of the character (in hex format), |
| 293 | the second column gives the (Unicode) name. The third column indicates |
| 294 | by which class(es) the character is matched (assuming no locale is in |
| 295 | effect that changes the C<\s> matching). |
| 296 | |
| 297 | 0x0009 CHARACTER TABULATION h s |
| 298 | 0x000a LINE FEED (LF) vs |
| 299 | 0x000b LINE TABULATION vs [1] |
| 300 | 0x000c FORM FEED (FF) vs |
| 301 | 0x000d CARRIAGE RETURN (CR) vs |
| 302 | 0x0020 SPACE h s |
| 303 | 0x0085 NEXT LINE (NEL) vs [2] |
| 304 | 0x00a0 NO-BREAK SPACE h s [2] |
| 305 | 0x1680 OGHAM SPACE MARK h s |
| 306 | 0x2000 EN QUAD h s |
| 307 | 0x2001 EM QUAD h s |
| 308 | 0x2002 EN SPACE h s |
| 309 | 0x2003 EM SPACE h s |
| 310 | 0x2004 THREE-PER-EM SPACE h s |
| 311 | 0x2005 FOUR-PER-EM SPACE h s |
| 312 | 0x2006 SIX-PER-EM SPACE h s |
| 313 | 0x2007 FIGURE SPACE h s |
| 314 | 0x2008 PUNCTUATION SPACE h s |
| 315 | 0x2009 THIN SPACE h s |
| 316 | 0x200a HAIR SPACE h s |
| 317 | 0x2028 LINE SEPARATOR vs |
| 318 | 0x2029 PARAGRAPH SEPARATOR vs |
| 319 | 0x202f NARROW NO-BREAK SPACE h s |
| 320 | 0x205f MEDIUM MATHEMATICAL SPACE h s |
| 321 | 0x3000 IDEOGRAPHIC SPACE h s |
| 322 | |
| 323 | =over 4 |
| 324 | |
| 325 | =item [1] |
| 326 | |
| 327 | Prior to Perl v5.18, C<\s> did not match the vertical tab. The change |
| 328 | in v5.18 is considered an experiment, which means it could be backed out |
| 329 | in v5.20 or v5.22 if experience indicates that it breaks too much |
| 330 | existing code. If this change adversely affects you, send email to |
| 331 | C<perlbug@perl.org>; if it affects you positively, email |
| 332 | C<perlthanks@perl.org>. In the meantime, C<[^\S\cK]> (obscurely) |
| 333 | matches what C<\s> traditionally did. |
| 334 | |
| 335 | =item [2] |
| 336 | |
| 337 | NEXT LINE and NO-BREAK SPACE may or may not match C<\s> depending |
| 338 | on the rules in effect. See |
| 339 | L<the beginning of this section|/Whitespace>. |
| 340 | |
| 341 | =back |
| 342 | |
| 343 | =head3 Unicode Properties |
| 344 | |
| 345 | C<\pP> and C<\p{Prop}> are character classes to match characters that fit given |
| 346 | Unicode properties. One letter property names can be used in the C<\pP> form, |
| 347 | with the property name following the C<\p>, otherwise, braces are required. |
| 348 | When using braces, there is a single form, which is just the property name |
| 349 | enclosed in the braces, and a compound form which looks like C<\p{name=value}>, |
| 350 | which means to match if the property "name" for the character has that particular |
| 351 | "value". |
| 352 | For instance, a match for a number can be written as C</\pN/> or as |
| 353 | C</\p{Number}/>, or as C</\p{Number=True}/>. |
| 354 | Lowercase letters are matched by the property I<Lowercase_Letter> which |
| 355 | has the short form I<Ll>. They need the braces, so are written as C</\p{Ll}/> or |
| 356 | C</\p{Lowercase_Letter}/>, or C</\p{General_Category=Lowercase_Letter}/> |
| 357 | (the underscores are optional). |
| 358 | C</\pLl/> is valid, but means something different. |
| 359 | It matches a two character string: a letter (Unicode property C<\pL>), |
| 360 | followed by a lowercase C<l>. |
| 361 | |
| 362 | If locale rules are not in effect, the use of |
| 363 | a Unicode property will force the regular expression into using Unicode |
| 364 | rules, if it isn't already. |
| 365 | |
| 366 | Note that almost all properties are immune to case-insensitive matching. |
| 367 | That is, adding a C</i> regular expression modifier does not change what |
| 368 | they match. There are two sets that are affected. The first set is |
| 369 | C<Uppercase_Letter>, |
| 370 | C<Lowercase_Letter>, |
| 371 | and C<Titlecase_Letter>, |
| 372 | all of which match C<Cased_Letter> under C</i> matching. |
| 373 | The second set is |
| 374 | C<Uppercase>, |
| 375 | C<Lowercase>, |
| 376 | and C<Titlecase>, |
| 377 | all of which match C<Cased> under C</i> matching. |
| 378 | (The difference between these sets is that some things, such as Roman |
| 379 | numerals, come in both upper and lower case, so they are C<Cased>, but |
| 380 | aren't considered to be letters, so they aren't C<Cased_Letter>s. They're |
| 381 | actually C<Letter_Number>s.) |
| 382 | This set also includes its subsets C<PosixUpper> and C<PosixLower>, both |
| 383 | of which under C</i> match C<PosixAlpha>. |
| 384 | |
| 385 | For more details on Unicode properties, see L<perlunicode/Unicode |
| 386 | Character Properties>; for a |
| 387 | complete list of possible properties, see |
| 388 | L<perluniprops/Properties accessible through \p{} and \P{}>, |
| 389 | which notes all forms that have C</i> differences. |
| 390 | It is also possible to define your own properties. This is discussed in |
| 391 | L<perlunicode/User-Defined Character Properties>. |
| 392 | |
| 393 | Unicode properties are defined (surprise!) only on Unicode code points. |
| 394 | Starting in v5.20, when matching against C<\p> and C<\P>, Perl treats |
| 395 | non-Unicode code points (those above the legal Unicode maximum of |
| 396 | 0x10FFFF) as if they were typical unassigned Unicode code points. |
| 397 | |
| 398 | Prior to v5.20, Perl raised a warning and made all matches fail on |
| 399 | non-Unicode code points. This could be somewhat surprising: |
| 400 | |
| 401 | chr(0x110000) =~ \p{ASCII_Hex_Digit=True} # Fails on Perls < v5.20. |
| 402 | chr(0x110000) =~ \p{ASCII_Hex_Digit=False} # Also fails on Perls |
| 403 | # < v5.20 |
| 404 | |
| 405 | Even though these two matches might be thought of as complements, until |
| 406 | v5.20 they were so only on Unicode code points. |
| 407 | |
| 408 | =head4 Examples |
| 409 | |
| 410 | "a" =~ /\w/ # Match, "a" is a 'word' character. |
| 411 | "7" =~ /\w/ # Match, "7" is a 'word' character as well. |
| 412 | "a" =~ /\d/ # No match, "a" isn't a digit. |
| 413 | "7" =~ /\d/ # Match, "7" is a digit. |
| 414 | " " =~ /\s/ # Match, a space is whitespace. |
| 415 | "a" =~ /\D/ # Match, "a" is a non-digit. |
| 416 | "7" =~ /\D/ # No match, "7" is not a non-digit. |
| 417 | " " =~ /\S/ # No match, a space is not non-whitespace. |
| 418 | |
| 419 | " " =~ /\h/ # Match, space is horizontal whitespace. |
| 420 | " " =~ /\v/ # No match, space is not vertical whitespace. |
| 421 | "\r" =~ /\v/ # Match, a return is vertical whitespace. |
| 422 | |
| 423 | "a" =~ /\pL/ # Match, "a" is a letter. |
| 424 | "a" =~ /\p{Lu}/ # No match, /\p{Lu}/ matches upper case letters. |
| 425 | |
| 426 | "\x{0e0b}" =~ /\p{Thai}/ # Match, \x{0e0b} is the character |
| 427 | # 'THAI CHARACTER SO SO', and that's in |
| 428 | # Thai Unicode class. |
| 429 | "a" =~ /\P{Lao}/ # Match, as "a" is not a Laotian character. |
| 430 | |
| 431 | It is worth emphasizing that C<\d>, C<\w>, etc, match single characters, not |
| 432 | complete numbers or words. To match a number (that consists of digits), |
| 433 | use C<\d+>; to match a word, use C<\w+>. But be aware of the security |
| 434 | considerations in doing so, as mentioned above. |
| 435 | |
| 436 | =head2 Bracketed Character Classes |
| 437 | |
| 438 | The third form of character class you can use in Perl regular expressions |
| 439 | is the bracketed character class. In its simplest form, it lists the characters |
| 440 | that may be matched, surrounded by square brackets, like this: C<[aeiou]>. |
| 441 | This matches one of C<a>, C<e>, C<i>, C<o> or C<u>. Like the other |
| 442 | character classes, exactly one character is matched.* To match |
| 443 | a longer string consisting of characters mentioned in the character |
| 444 | class, follow the character class with a L<quantifier|perlre/Quantifiers>. For |
| 445 | instance, C<[aeiou]+> matches one or more lowercase English vowels. |
| 446 | |
| 447 | Repeating a character in a character class has no |
| 448 | effect; it's considered to be in the set only once. |
| 449 | |
| 450 | Examples: |
| 451 | |
| 452 | "e" =~ /[aeiou]/ # Match, as "e" is listed in the class. |
| 453 | "p" =~ /[aeiou]/ # No match, "p" is not listed in the class. |
| 454 | "ae" =~ /^[aeiou]$/ # No match, a character class only matches |
| 455 | # a single character. |
| 456 | "ae" =~ /^[aeiou]+$/ # Match, due to the quantifier. |
| 457 | |
| 458 | ------- |
| 459 | |
| 460 | * There is an exception to a bracketed character class matching a |
| 461 | single character only. When the class is to match caselessly under C</i> |
| 462 | matching rules, and a character that is explicitly mentioned inside the |
| 463 | class matches a |
| 464 | multiple-character sequence caselessly under Unicode rules, the class |
| 465 | (when not L<inverted|/Negation>) will also match that sequence. For |
| 466 | example, Unicode says that the letter C<LATIN SMALL LETTER SHARP S> |
| 467 | should match the sequence C<ss> under C</i> rules. Thus, |
| 468 | |
| 469 | 'ss' =~ /\A\N{LATIN SMALL LETTER SHARP S}\z/i # Matches |
| 470 | 'ss' =~ /\A[aeioust\N{LATIN SMALL LETTER SHARP S}]\z/i # Matches |
| 471 | |
| 472 | For this to happen, the character must be explicitly specified, and not |
| 473 | be part of a multi-character range (not even as one of its endpoints). |
| 474 | (L</Character Ranges> will be explained shortly.) Therefore, |
| 475 | |
| 476 | 'ss' =~ /\A[\0-\x{ff}]\z/i # Doesn't match |
| 477 | 'ss' =~ /\A[\0-\N{LATIN SMALL LETTER SHARP S}]\z/i # No match |
| 478 | 'ss' =~ /\A[\xDF-\xDF]\z/i # Matches on ASCII platforms, since \XDF |
| 479 | # is LATIN SMALL LETTER SHARP S, and the |
| 480 | # range is just a single element |
| 481 | |
| 482 | Note that it isn't a good idea to specify these types of ranges anyway. |
| 483 | |
| 484 | =head3 Special Characters Inside a Bracketed Character Class |
| 485 | |
| 486 | Most characters that are meta characters in regular expressions (that |
| 487 | is, characters that carry a special meaning like C<.>, C<*>, or C<(>) lose |
| 488 | their special meaning and can be used inside a character class without |
| 489 | the need to escape them. For instance, C<[()]> matches either an opening |
| 490 | parenthesis, or a closing parenthesis, and the parens inside the character |
| 491 | class don't group or capture. |
| 492 | |
| 493 | Characters that may carry a special meaning inside a character class are: |
| 494 | C<\>, C<^>, C<->, C<[> and C<]>, and are discussed below. They can be |
| 495 | escaped with a backslash, although this is sometimes not needed, in which |
| 496 | case the backslash may be omitted. |
| 497 | |
| 498 | The sequence C<\b> is special inside a bracketed character class. While |
| 499 | outside the character class, C<\b> is an assertion indicating a point |
| 500 | that does not have either two word characters or two non-word characters |
| 501 | on either side, inside a bracketed character class, C<\b> matches a |
| 502 | backspace character. |
| 503 | |
| 504 | The sequences |
| 505 | C<\a>, |
| 506 | C<\c>, |
| 507 | C<\e>, |
| 508 | C<\f>, |
| 509 | C<\n>, |
| 510 | C<\N{I<NAME>}>, |
| 511 | C<\N{U+I<hex char>}>, |
| 512 | C<\r>, |
| 513 | C<\t>, |
| 514 | and |
| 515 | C<\x> |
| 516 | are also special and have the same meanings as they do outside a |
| 517 | bracketed character class. (However, inside a bracketed character |
| 518 | class, if C<\N{I<NAME>}> expands to a sequence of characters, only the first |
| 519 | one in the sequence is used, with a warning.) |
| 520 | |
| 521 | Also, a backslash followed by two or three octal digits is considered an octal |
| 522 | number. |
| 523 | |
| 524 | A C<[> is not special inside a character class, unless it's the start of a |
| 525 | POSIX character class (see L</POSIX Character Classes> below). It normally does |
| 526 | not need escaping. |
| 527 | |
| 528 | A C<]> is normally either the end of a POSIX character class (see |
| 529 | L</POSIX Character Classes> below), or it signals the end of the bracketed |
| 530 | character class. If you want to include a C<]> in the set of characters, you |
| 531 | must generally escape it. |
| 532 | |
| 533 | However, if the C<]> is the I<first> (or the second if the first |
| 534 | character is a caret) character of a bracketed character class, it |
| 535 | does not denote the end of the class (as you cannot have an empty class) |
| 536 | and is considered part of the set of characters that can be matched without |
| 537 | escaping. |
| 538 | |
| 539 | Examples: |
| 540 | |
| 541 | "+" =~ /[+?*]/ # Match, "+" in a character class is not special. |
| 542 | "\cH" =~ /[\b]/ # Match, \b inside in a character class. |
| 543 | # is equivalent to a backspace. |
| 544 | "]" =~ /[][]/ # Match, as the character class contains. |
| 545 | # both [ and ]. |
| 546 | "[]" =~ /[[]]/ # Match, the pattern contains a character class |
| 547 | # containing just ], and the character class is |
| 548 | # followed by a ]. |
| 549 | |
| 550 | =head3 Character Ranges |
| 551 | |
| 552 | It is not uncommon to want to match a range of characters. Luckily, instead |
| 553 | of listing all characters in the range, one may use the hyphen (C<->). |
| 554 | If inside a bracketed character class you have two characters separated |
| 555 | by a hyphen, it's treated as if all characters between the two were in |
| 556 | the class. For instance, C<[0-9]> matches any ASCII digit, and C<[a-m]> |
| 557 | matches any lowercase letter from the first half of the ASCII alphabet. |
| 558 | |
| 559 | Note that the two characters on either side of the hyphen are not |
| 560 | necessarily both letters or both digits. Any character is possible, |
| 561 | although not advisable. C<['-?]> contains a range of characters, but |
| 562 | most people will not know which characters that means. Furthermore, |
| 563 | such ranges may lead to portability problems if the code has to run on |
| 564 | a platform that uses a different character set, such as EBCDIC. |
| 565 | |
| 566 | If a hyphen in a character class cannot syntactically be part of a range, for |
| 567 | instance because it is the first or the last character of the character class, |
| 568 | or if it immediately follows a range, the hyphen isn't special, and so is |
| 569 | considered a character to be matched literally. If you want a hyphen in |
| 570 | your set of characters to be matched and its position in the class is such |
| 571 | that it could be considered part of a range, you must escape that hyphen |
| 572 | with a backslash. |
| 573 | |
| 574 | Examples: |
| 575 | |
| 576 | [a-z] # Matches a character that is a lower case ASCII letter. |
| 577 | [a-fz] # Matches any letter between 'a' and 'f' (inclusive) or |
| 578 | # the letter 'z'. |
| 579 | [-z] # Matches either a hyphen ('-') or the letter 'z'. |
| 580 | [a-f-m] # Matches any letter between 'a' and 'f' (inclusive), the |
| 581 | # hyphen ('-'), or the letter 'm'. |
| 582 | ['-?] # Matches any of the characters '()*+,-./0123456789:;<=>? |
| 583 | # (But not on an EBCDIC platform). |
| 584 | |
| 585 | |
| 586 | =head3 Negation |
| 587 | |
| 588 | It is also possible to instead list the characters you do not want to |
| 589 | match. You can do so by using a caret (C<^>) as the first character in the |
| 590 | character class. For instance, C<[^a-z]> matches any character that is not a |
| 591 | lowercase ASCII letter, which therefore includes more than a million |
| 592 | Unicode code points. The class is said to be "negated" or "inverted". |
| 593 | |
| 594 | This syntax make the caret a special character inside a bracketed character |
| 595 | class, but only if it is the first character of the class. So if you want |
| 596 | the caret as one of the characters to match, either escape the caret or |
| 597 | else don't list it first. |
| 598 | |
| 599 | In inverted bracketed character classes, Perl ignores the Unicode rules |
| 600 | that normally say that certain characters should match a sequence of |
| 601 | multiple characters under caseless C</i> matching. Following those |
| 602 | rules could lead to highly confusing situations: |
| 603 | |
| 604 | "ss" =~ /^[^\xDF]+$/ui; # Matches! |
| 605 | |
| 606 | This should match any sequences of characters that aren't C<\xDF> nor |
| 607 | what C<\xDF> matches under C</i>. C<"s"> isn't C<\xDF>, but Unicode |
| 608 | says that C<"ss"> is what C<\xDF> matches under C</i>. So which one |
| 609 | "wins"? Do you fail the match because the string has C<ss> or accept it |
| 610 | because it has an C<s> followed by another C<s>? Perl has chosen the |
| 611 | latter. |
| 612 | |
| 613 | Examples: |
| 614 | |
| 615 | "e" =~ /[^aeiou]/ # No match, the 'e' is listed. |
| 616 | "x" =~ /[^aeiou]/ # Match, as 'x' isn't a lowercase vowel. |
| 617 | "^" =~ /[^^]/ # No match, matches anything that isn't a caret. |
| 618 | "^" =~ /[x^]/ # Match, caret is not special here. |
| 619 | |
| 620 | =head3 Backslash Sequences |
| 621 | |
| 622 | You can put any backslash sequence character class (with the exception of |
| 623 | C<\N> and C<\R>) inside a bracketed character class, and it will act just |
| 624 | as if you had put all characters matched by the backslash sequence inside the |
| 625 | character class. For instance, C<[a-f\d]> matches any decimal digit, or any |
| 626 | of the lowercase letters between 'a' and 'f' inclusive. |
| 627 | |
| 628 | C<\N> within a bracketed character class must be of the forms C<\N{I<name>}> |
| 629 | or C<\N{U+I<hex char>}>, and NOT be the form that matches non-newlines, |
| 630 | for the same reason that a dot C<.> inside a bracketed character class loses |
| 631 | its special meaning: it matches nearly anything, which generally isn't what you |
| 632 | want to happen. |
| 633 | |
| 634 | |
| 635 | Examples: |
| 636 | |
| 637 | /[\p{Thai}\d]/ # Matches a character that is either a Thai |
| 638 | # character, or a digit. |
| 639 | /[^\p{Arabic}()]/ # Matches a character that is neither an Arabic |
| 640 | # character, nor a parenthesis. |
| 641 | |
| 642 | Backslash sequence character classes cannot form one of the endpoints |
| 643 | of a range. Thus, you can't say: |
| 644 | |
| 645 | /[\p{Thai}-\d]/ # Wrong! |
| 646 | |
| 647 | =head3 POSIX Character Classes |
| 648 | X<character class> X<\p> X<\p{}> |
| 649 | X<alpha> X<alnum> X<ascii> X<blank> X<cntrl> X<digit> X<graph> |
| 650 | X<lower> X<print> X<punct> X<space> X<upper> X<word> X<xdigit> |
| 651 | |
| 652 | POSIX character classes have the form C<[:class:]>, where I<class> is the |
| 653 | name, and the C<[:> and C<:]> delimiters. POSIX character classes only appear |
| 654 | I<inside> bracketed character classes, and are a convenient and descriptive |
| 655 | way of listing a group of characters. |
| 656 | |
| 657 | Be careful about the syntax, |
| 658 | |
| 659 | # Correct: |
| 660 | $string =~ /[[:alpha:]]/ |
| 661 | |
| 662 | # Incorrect (will warn): |
| 663 | $string =~ /[:alpha:]/ |
| 664 | |
| 665 | The latter pattern would be a character class consisting of a colon, |
| 666 | and the letters C<a>, C<l>, C<p> and C<h>. |
| 667 | |
| 668 | POSIX character classes can be part of a larger bracketed character class. |
| 669 | For example, |
| 670 | |
| 671 | [01[:alpha:]%] |
| 672 | |
| 673 | is valid and matches '0', '1', any alphabetic character, and the percent sign. |
| 674 | |
| 675 | Perl recognizes the following POSIX character classes: |
| 676 | |
| 677 | alpha Any alphabetical character ("[A-Za-z]"). |
| 678 | alnum Any alphanumeric character ("[A-Za-z0-9]"). |
| 679 | ascii Any character in the ASCII character set. |
| 680 | blank A GNU extension, equal to a space or a horizontal tab ("\t"). |
| 681 | cntrl Any control character. See Note [2] below. |
| 682 | digit Any decimal digit ("[0-9]"), equivalent to "\d". |
| 683 | graph Any printable character, excluding a space. See Note [3] below. |
| 684 | lower Any lowercase character ("[a-z]"). |
| 685 | print Any printable character, including a space. See Note [4] below. |
| 686 | punct Any graphical character excluding "word" characters. Note [5]. |
| 687 | space Any whitespace character. "\s" including the vertical tab |
| 688 | ("\cK"). |
| 689 | upper Any uppercase character ("[A-Z]"). |
| 690 | word A Perl extension ("[A-Za-z0-9_]"), equivalent to "\w". |
| 691 | xdigit Any hexadecimal digit ("[0-9a-fA-F]"). |
| 692 | |
| 693 | Most POSIX character classes have two Unicode-style C<\p> property |
| 694 | counterparts. (They are not official Unicode properties, but Perl extensions |
| 695 | derived from official Unicode properties.) The table below shows the relation |
| 696 | between POSIX character classes and these counterparts. |
| 697 | |
| 698 | One counterpart, in the column labelled "ASCII-range Unicode" in |
| 699 | the table, matches only characters in the ASCII character set. |
| 700 | |
| 701 | The other counterpart, in the column labelled "Full-range Unicode", matches any |
| 702 | appropriate characters in the full Unicode character set. For example, |
| 703 | C<\p{Alpha}> matches not just the ASCII alphabetic characters, but any |
| 704 | character in the entire Unicode character set considered alphabetic. |
| 705 | An entry in the column labelled "backslash sequence" is a (short) |
| 706 | equivalent. |
| 707 | |
| 708 | [[:...:]] ASCII-range Full-range backslash Note |
| 709 | Unicode Unicode sequence |
| 710 | ----------------------------------------------------- |
| 711 | alpha \p{PosixAlpha} \p{XPosixAlpha} |
| 712 | alnum \p{PosixAlnum} \p{XPosixAlnum} |
| 713 | ascii \p{ASCII} |
| 714 | blank \p{PosixBlank} \p{XPosixBlank} \h [1] |
| 715 | or \p{HorizSpace} [1] |
| 716 | cntrl \p{PosixCntrl} \p{XPosixCntrl} [2] |
| 717 | digit \p{PosixDigit} \p{XPosixDigit} \d |
| 718 | graph \p{PosixGraph} \p{XPosixGraph} [3] |
| 719 | lower \p{PosixLower} \p{XPosixLower} |
| 720 | print \p{PosixPrint} \p{XPosixPrint} [4] |
| 721 | punct \p{PosixPunct} \p{XPosixPunct} [5] |
| 722 | \p{PerlSpace} \p{XPerlSpace} \s [6] |
| 723 | space \p{PosixSpace} \p{XPosixSpace} [6] |
| 724 | upper \p{PosixUpper} \p{XPosixUpper} |
| 725 | word \p{PosixWord} \p{XPosixWord} \w |
| 726 | xdigit \p{PosixXDigit} \p{XPosixXDigit} |
| 727 | |
| 728 | =over 4 |
| 729 | |
| 730 | =item [1] |
| 731 | |
| 732 | C<\p{Blank}> and C<\p{HorizSpace}> are synonyms. |
| 733 | |
| 734 | =item [2] |
| 735 | |
| 736 | Control characters don't produce output as such, but instead usually control |
| 737 | the terminal somehow: for example, newline and backspace are control characters. |
| 738 | In the ASCII range, characters whose code points are between 0 and 31 inclusive, |
| 739 | plus 127 (C<DEL>) are control characters. |
| 740 | |
| 741 | =item [3] |
| 742 | |
| 743 | Any character that is I<graphical>, that is, visible. This class consists |
| 744 | of all alphanumeric characters and all punctuation characters. |
| 745 | |
| 746 | =item [4] |
| 747 | |
| 748 | All printable characters, which is the set of all graphical characters |
| 749 | plus those whitespace characters which are not also controls. |
| 750 | |
| 751 | =item [5] |
| 752 | |
| 753 | C<\p{PosixPunct}> and C<[[:punct:]]> in the ASCII range match all |
| 754 | non-controls, non-alphanumeric, non-space characters: |
| 755 | C<[-!"#$%&'()*+,./:;<=E<gt>?@[\\\]^_`{|}~]> (although if a locale is in effect, |
| 756 | it could alter the behavior of C<[[:punct:]]>). |
| 757 | |
| 758 | The similarly named property, C<\p{Punct}>, matches a somewhat different |
| 759 | set in the ASCII range, namely |
| 760 | C<[-!"#%&'()*,./:;?@[\\\]_{}]>. That is, it is missing the nine |
| 761 | characters C<[$+E<lt>=E<gt>^`|~]>. |
| 762 | This is because Unicode splits what POSIX considers to be punctuation into two |
| 763 | categories, Punctuation and Symbols. |
| 764 | |
| 765 | C<\p{XPosixPunct}> and (under Unicode rules) C<[[:punct:]]>, match what |
| 766 | C<\p{PosixPunct}> matches in the ASCII range, plus what C<\p{Punct}> |
| 767 | matches. This is different than strictly matching according to |
| 768 | C<\p{Punct}>. Another way to say it is that |
| 769 | if Unicode rules are in effect, C<[[:punct:]]> matches all characters |
| 770 | that Unicode considers punctuation, plus all ASCII-range characters that |
| 771 | Unicode considers symbols. |
| 772 | |
| 773 | =item [6] |
| 774 | |
| 775 | C<\p{SpacePerl}> and C<\p{Space}> match identically starting with Perl |
| 776 | v5.18. In earlier versions, these differ only in that in non-locale |
| 777 | matching, C<\p{SpacePerl}> does not match the vertical tab, C<\cK>. |
| 778 | Same for the two ASCII-only range forms. |
| 779 | |
| 780 | =back |
| 781 | |
| 782 | There are various other synonyms that can be used besides the names |
| 783 | listed in the table. For example, C<\p{PosixAlpha}> can be written as |
| 784 | C<\p{Alpha}>. All are listed in |
| 785 | L<perluniprops/Properties accessible through \p{} and \P{}>. |
| 786 | |
| 787 | Both the C<\p> counterparts always assume Unicode rules are in effect. |
| 788 | On ASCII platforms, this means they assume that the code points from 128 |
| 789 | to 255 are Latin-1, and that means that using them under locale rules is |
| 790 | unwise unless the locale is guaranteed to be Latin-1 or UTF-8. In contrast, the |
| 791 | POSIX character classes are useful under locale rules. They are |
| 792 | affected by the actual rules in effect, as follows: |
| 793 | |
| 794 | =over |
| 795 | |
| 796 | =item If the C</a> modifier, is in effect ... |
| 797 | |
| 798 | Each of the POSIX classes matches exactly the same as their ASCII-range |
| 799 | counterparts. |
| 800 | |
| 801 | =item otherwise ... |
| 802 | |
| 803 | =over |
| 804 | |
| 805 | =item For code points above 255 ... |
| 806 | |
| 807 | The POSIX class matches the same as its Full-range counterpart. |
| 808 | |
| 809 | =item For code points below 256 ... |
| 810 | |
| 811 | =over |
| 812 | |
| 813 | =item if locale rules are in effect ... |
| 814 | |
| 815 | The POSIX class matches according to the locale, except that |
| 816 | C<word> uses the platform's native underscore character, no matter what |
| 817 | the locale is. |
| 818 | |
| 819 | =item if Unicode rules are in effect ... |
| 820 | |
| 821 | The POSIX class matches the same as the Full-range counterpart. |
| 822 | |
| 823 | =item otherwise ... |
| 824 | |
| 825 | The POSIX class matches the same as the ASCII range counterpart. |
| 826 | |
| 827 | =back |
| 828 | |
| 829 | =back |
| 830 | |
| 831 | =back |
| 832 | |
| 833 | Which rules apply are determined as described in |
| 834 | L<perlre/Which character set modifier is in effect?>. |
| 835 | |
| 836 | It is proposed to change this behavior in a future release of Perl so that |
| 837 | whether or not Unicode rules are in effect would not change the |
| 838 | behavior: Outside of locale, the POSIX classes |
| 839 | would behave like their ASCII-range counterparts. If you wish to |
| 840 | comment on this proposal, send email to C<perl5-porters@perl.org>. |
| 841 | |
| 842 | =head4 Negation of POSIX character classes |
| 843 | X<character class, negation> |
| 844 | |
| 845 | A Perl extension to the POSIX character class is the ability to |
| 846 | negate it. This is done by prefixing the class name with a caret (C<^>). |
| 847 | Some examples: |
| 848 | |
| 849 | POSIX ASCII-range Full-range backslash |
| 850 | Unicode Unicode sequence |
| 851 | ----------------------------------------------------- |
| 852 | [[:^digit:]] \P{PosixDigit} \P{XPosixDigit} \D |
| 853 | [[:^space:]] \P{PosixSpace} \P{XPosixSpace} |
| 854 | \P{PerlSpace} \P{XPerlSpace} \S |
| 855 | [[:^word:]] \P{PerlWord} \P{XPosixWord} \W |
| 856 | |
| 857 | The backslash sequence can mean either ASCII- or Full-range Unicode, |
| 858 | depending on various factors as described in L<perlre/Which character set modifier is in effect?>. |
| 859 | |
| 860 | =head4 [= =] and [. .] |
| 861 | |
| 862 | Perl recognizes the POSIX character classes C<[=class=]> and |
| 863 | C<[.class.]>, but does not (yet?) support them. Any attempt to use |
| 864 | either construct raises an exception. |
| 865 | |
| 866 | =head4 Examples |
| 867 | |
| 868 | /[[:digit:]]/ # Matches a character that is a digit. |
| 869 | /[01[:lower:]]/ # Matches a character that is either a |
| 870 | # lowercase letter, or '0' or '1'. |
| 871 | /[[:digit:][:^xdigit:]]/ # Matches a character that can be anything |
| 872 | # except the letters 'a' to 'f' and 'A' to |
| 873 | # 'F'. This is because the main character |
| 874 | # class is composed of two POSIX character |
| 875 | # classes that are ORed together, one that |
| 876 | # matches any digit, and the other that |
| 877 | # matches anything that isn't a hex digit. |
| 878 | # The OR adds the digits, leaving only the |
| 879 | # letters 'a' to 'f' and 'A' to 'F' excluded. |
| 880 | |
| 881 | =head3 Extended Bracketed Character Classes |
| 882 | X<character class> |
| 883 | X<set operations> |
| 884 | |
| 885 | This is a fancy bracketed character class that can be used for more |
| 886 | readable and less error-prone classes, and to perform set operations, |
| 887 | such as intersection. An example is |
| 888 | |
| 889 | /(?[ \p{Thai} & \p{Digit} ])/ |
| 890 | |
| 891 | This will match all the digit characters that are in the Thai script. |
| 892 | |
| 893 | This is an experimental feature available starting in 5.18, and is |
| 894 | subject to change as we gain field experience with it. Any attempt to |
| 895 | use it will raise a warning, unless disabled via |
| 896 | |
| 897 | no warnings "experimental::regex_sets"; |
| 898 | |
| 899 | Comments on this feature are welcome; send email to |
| 900 | C<perl5-porters@perl.org>. |
| 901 | |
| 902 | We can extend the example above: |
| 903 | |
| 904 | /(?[ ( \p{Thai} + \p{Lao} ) & \p{Digit} ])/ |
| 905 | |
| 906 | This matches digits that are in either the Thai or Laotian scripts. |
| 907 | |
| 908 | Notice the white space in these examples. This construct always has |
| 909 | the C<E<sol>x> modifier turned on within it. |
| 910 | |
| 911 | The available binary operators are: |
| 912 | |
| 913 | & intersection |
| 914 | + union |
| 915 | | another name for '+', hence means union |
| 916 | - subtraction (the result matches the set consisting of those |
| 917 | code points matched by the first operand, excluding any that |
| 918 | are also matched by the second operand) |
| 919 | ^ symmetric difference (the union minus the intersection). This |
| 920 | is like an exclusive or, in that the result is the set of code |
| 921 | points that are matched by either, but not both, of the |
| 922 | operands. |
| 923 | |
| 924 | There is one unary operator: |
| 925 | |
| 926 | ! complement |
| 927 | |
| 928 | All the binary operators left associate, and are of equal precedence. |
| 929 | The unary operator right associates, and has higher precedence. Use |
| 930 | parentheses to override the default associations. Some feedback we've |
| 931 | received indicates a desire for intersection to have higher precedence |
| 932 | than union. This is something that feedback from the field may cause us |
| 933 | to change in future releases; you may want to parenthesize copiously to |
| 934 | avoid such changes affecting your code, until this feature is no longer |
| 935 | considered experimental. |
| 936 | |
| 937 | The main restriction is that everything is a metacharacter. Thus, |
| 938 | you cannot refer to single characters by doing something like this: |
| 939 | |
| 940 | /(?[ a + b ])/ # Syntax error! |
| 941 | |
| 942 | The easiest way to specify an individual typable character is to enclose |
| 943 | it in brackets: |
| 944 | |
| 945 | /(?[ [a] + [b] ])/ |
| 946 | |
| 947 | (This is the same thing as C<[ab]>.) You could also have said the |
| 948 | equivalent: |
| 949 | |
| 950 | /(?[[ a b ]])/ |
| 951 | |
| 952 | (You can, of course, specify single characters by using, C<\x{...}>, |
| 953 | C<\N{...}>, etc.) |
| 954 | |
| 955 | This last example shows the use of this construct to specify an ordinary |
| 956 | bracketed character class without additional set operations. Note the |
| 957 | white space within it; C<E<sol>x> is turned on even within bracketed |
| 958 | character classes, except you can't have comments inside them. Hence, |
| 959 | |
| 960 | (?[ [#] ]) |
| 961 | |
| 962 | matches the literal character "#". To specify a literal white space character, |
| 963 | you can escape it with a backslash, like: |
| 964 | |
| 965 | /(?[ [ a e i o u \ ] ])/ |
| 966 | |
| 967 | This matches the English vowels plus the SPACE character. |
| 968 | All the other escapes accepted by normal bracketed character classes are |
| 969 | accepted here as well; but unrecognized escapes that generate warnings |
| 970 | in normal classes are fatal errors here. |
| 971 | |
| 972 | All warnings from these class elements are fatal, as well as some |
| 973 | practices that don't currently warn. For example you cannot say |
| 974 | |
| 975 | /(?[ [ \xF ] ])/ # Syntax error! |
| 976 | |
| 977 | You have to have two hex digits after a braceless C<\x> (use a leading |
| 978 | zero to make two). These restrictions are to lower the incidence of |
| 979 | typos causing the class to not match what you thought it would. |
| 980 | |
| 981 | If a regular bracketed character class contains a C<\p{}> or C<\P{}> and |
| 982 | is matched against a non-Unicode code point, a warning may be |
| 983 | raised, as the result is not Unicode-defined. No such warning will come |
| 984 | when using this extended form. |
| 985 | |
| 986 | The final difference between regular bracketed character classes and |
| 987 | these, is that it is not possible to get these to match a |
| 988 | multi-character fold. Thus, |
| 989 | |
| 990 | /(?[ [\xDF] ])/iu |
| 991 | |
| 992 | does not match the string C<ss>. |
| 993 | |
| 994 | You don't have to enclose POSIX class names inside double brackets, |
| 995 | hence both of the following work: |
| 996 | |
| 997 | /(?[ [:word:] - [:lower:] ])/ |
| 998 | /(?[ [[:word:]] - [[:lower:]] ])/ |
| 999 | |
| 1000 | Any contained POSIX character classes, including things like C<\w> and C<\D> |
| 1001 | respect the C<E<sol>a> (and C<E<sol>aa>) modifiers. |
| 1002 | |
| 1003 | C<< (?[ ]) >> is a regex-compile-time construct. Any attempt to use |
| 1004 | something which isn't knowable at the time the containing regular |
| 1005 | expression is compiled is a fatal error. In practice, this means |
| 1006 | just three limitations: |
| 1007 | |
| 1008 | =over 4 |
| 1009 | |
| 1010 | =item 1 |
| 1011 | |
| 1012 | This construct cannot be used within the scope of |
| 1013 | C<use locale> (or the C<E<sol>l> regex modifier). |
| 1014 | |
| 1015 | =item 2 |
| 1016 | |
| 1017 | Any |
| 1018 | L<user-defined property|perlunicode/"User-Defined Character Properties"> |
| 1019 | used must be already defined by the time the regular expression is |
| 1020 | compiled (but note that this construct can be used instead of such |
| 1021 | properties). |
| 1022 | |
| 1023 | =item 3 |
| 1024 | |
| 1025 | A regular expression that otherwise would compile |
| 1026 | using C<E<sol>d> rules, and which uses this construct will instead |
| 1027 | use C<E<sol>u>. Thus this construct tells Perl that you don't want |
| 1028 | C<E<sol>d> rules for the entire regular expression containing it. |
| 1029 | |
| 1030 | =back |
| 1031 | |
| 1032 | The C<E<sol>x> processing within this class is an extended form. |
| 1033 | Besides the characters that are considered white space in normal C</x> |
| 1034 | processing, there are 5 others, recommended by the Unicode standard: |
| 1035 | |
| 1036 | U+0085 NEXT LINE |
| 1037 | U+200E LEFT-TO-RIGHT MARK |
| 1038 | U+200F RIGHT-TO-LEFT MARK |
| 1039 | U+2028 LINE SEPARATOR |
| 1040 | U+2029 PARAGRAPH SEPARATOR |
| 1041 | |
| 1042 | Note that skipping white space applies only to the interior of this |
| 1043 | construct. There must not be any space between any of the characters |
| 1044 | that form the initial C<(?[>. Nor may there be space between the |
| 1045 | closing C<])> characters. |
| 1046 | |
| 1047 | Just as in all regular expressions, the pattern can be built up by |
| 1048 | including variables that are interpolated at regex compilation time. |
| 1049 | Care must be taken to ensure that you are getting what you expect. For |
| 1050 | example: |
| 1051 | |
| 1052 | my $thai_or_lao = '\p{Thai} + \p{Lao}'; |
| 1053 | ... |
| 1054 | qr/(?[ \p{Digit} & $thai_or_lao ])/; |
| 1055 | |
| 1056 | compiles to |
| 1057 | |
| 1058 | qr/(?[ \p{Digit} & \p{Thai} + \p{Lao} ])/; |
| 1059 | |
| 1060 | But this does not have the effect that someone reading the code would |
| 1061 | likely expect, as the intersection applies just to C<\p{Thai}>, |
| 1062 | excluding the Laotian. Pitfalls like this can be avoided by |
| 1063 | parenthesizing the component pieces: |
| 1064 | |
| 1065 | my $thai_or_lao = '( \p{Thai} + \p{Lao} )'; |
| 1066 | |
| 1067 | But any modifiers will still apply to all the components: |
| 1068 | |
| 1069 | my $lower = '\p{Lower} + \p{Digit}'; |
| 1070 | qr/(?[ \p{Greek} & $lower ])/i; |
| 1071 | |
| 1072 | matches upper case things. You can avoid surprises by making the |
| 1073 | components into instances of this construct by compiling them: |
| 1074 | |
| 1075 | my $thai_or_lao = qr/(?[ \p{Thai} + \p{Lao} ])/; |
| 1076 | my $lower = qr/(?[ \p{Lower} + \p{Digit} ])/; |
| 1077 | |
| 1078 | When these are embedded in another pattern, what they match does not |
| 1079 | change, regardless of parenthesization or what modifiers are in effect |
| 1080 | in that outer pattern. |
| 1081 | |
| 1082 | Due to the way that Perl parses things, your parentheses and brackets |
| 1083 | may need to be balanced, even including comments. If you run into any |
| 1084 | examples, please send them to C<perlbug@perl.org>, so that we can have a |
| 1085 | concrete example for this man page. |
| 1086 | |
| 1087 | We may change it so that things that remain legal uses in normal bracketed |
| 1088 | character classes might become illegal within this experimental |
| 1089 | construct. One proposal, for example, is to forbid adjacent uses of the |
| 1090 | same character, as in C<(?[ [aa] ])>. The motivation for such a change |
| 1091 | is that this usage is likely a typo, as the second "a" adds nothing. |