| 1 | =head1 NAME |
| 2 | |
| 3 | perlrecharclass - Perl Regular Expression Character Classes |
| 4 | |
| 5 | =head1 DESCRIPTION |
| 6 | |
| 7 | The top level documentation about Perl regular expressions |
| 8 | is found in L<perlre>. |
| 9 | |
| 10 | This manual page discusses the syntax and use of character |
| 11 | classes in Perl Regular Expressions. |
| 12 | |
| 13 | A character class is a way of denoting a set of characters, |
| 14 | in such a way that one character of the set is matched. |
| 15 | It's important to remember that matching a character class |
| 16 | consumes exactly one character in the source string. (The source |
| 17 | string is the string the regular expression is matched against.) |
| 18 | |
| 19 | There are three types of character classes in Perl regular |
| 20 | expressions: the dot, backslashed sequences, and the bracketed form. |
| 21 | |
| 22 | =head2 The dot |
| 23 | |
| 24 | The dot (or period), C<.> is probably the most used, and certainly |
| 25 | the most well-known character class. By default, a dot matches any |
| 26 | character, except for the newline. The default can be changed to |
| 27 | add matching the newline with the I<single line> modifier: either |
| 28 | for the entire regular expression using the C</s> modifier, or |
| 29 | locally using C<(?s)>. |
| 30 | |
| 31 | Here are some examples: |
| 32 | |
| 33 | "a" =~ /./ # Match |
| 34 | "." =~ /./ # Match |
| 35 | "" =~ /./ # No match (dot has to match a character) |
| 36 | "\n" =~ /./ # No match (dot does not match a newline) |
| 37 | "\n" =~ /./s # Match (global 'single line' modifier) |
| 38 | "\n" =~ /(?s:.)/ # Match (local 'single line' modifier) |
| 39 | "ab" =~ /^.$/ # No match (dot matches one character) |
| 40 | |
| 41 | |
| 42 | =head2 Backslashed sequences |
| 43 | |
| 44 | Perl regular expressions contain many backslashed sequences that |
| 45 | constitute a character class. That is, they will match a single |
| 46 | character, if that character belongs to a specific set of characters |
| 47 | (defined by the sequence). A backslashed sequence is a sequence of |
| 48 | characters starting with a backslash. Not all backslashed sequences |
| 49 | are character class; for a full list, see L<perlrebackslash>. |
| 50 | |
| 51 | Here's a list of the backslashed sequences, which are discussed in |
| 52 | more detail below. |
| 53 | |
| 54 | \d Match a digit character. |
| 55 | \D Match a non-digit character. |
| 56 | \w Match a "word" character. |
| 57 | \W Match a non-"word" character. |
| 58 | \s Match a white space character. |
| 59 | \S Match a non-white space character. |
| 60 | \h Match a horizontal white space character. |
| 61 | \H Match a character that isn't horizontal white space. |
| 62 | \v Match a vertical white space character. |
| 63 | \V Match a character that isn't vertical white space. |
| 64 | \pP, \p{Prop} Match a character matching a Unicode property. |
| 65 | \PP, \P{Prop} Match a character that doesn't match a Unicode property. |
| 66 | |
| 67 | =head3 Digits |
| 68 | |
| 69 | C<\d> matches a single character that is considered to be a I<digit>. |
| 70 | What is considered a digit depends on the internal encoding of |
| 71 | the source string. If the source string is in UTF-8 format, C<\d> |
| 72 | not only matches the digits '0' - '9', but also Arabic, Devanagari and |
| 73 | digits from other languages. Otherwise, if there is a locale in effect, |
| 74 | it will match whatever characters the locale considers digits. Without |
| 75 | a locale, C<\d> matches the digits '0' to '9'. |
| 76 | See L</Locale, Unicode and UTF-8>. |
| 77 | |
| 78 | Any character that isn't matched by C<\d> will be matched by C<\D>. |
| 79 | |
| 80 | =head3 Word characters |
| 81 | |
| 82 | C<\w> matches a single I<word> character: an alphanumeric character |
| 83 | (that is, an alphabetic character, or a digit), or the underscore (C<_>). |
| 84 | What is considered a word character depends on the internal encoding |
| 85 | of the string. If it's in UTF-8 format, C<\w> matches those characters |
| 86 | that are considered word characters in the Unicode database. That is, it |
| 87 | not only matches ASCII letters, but also Thai letters, Greek letters, etc. |
| 88 | If the source string isn't in UTF-8 format, C<\w> matches those characters |
| 89 | that are considered word characters by the current locale. Without |
| 90 | a locale in effect, C<\w> matches the ASCII letters, digits and the |
| 91 | underscore. |
| 92 | |
| 93 | Any character that isn't matched by C<\w> will be matched by C<\W>. |
| 94 | |
| 95 | =head3 White space |
| 96 | |
| 97 | C<\s> matches any single character that is consider white space. In the |
| 98 | ASCII range, C<\s> matches the horizontal tab (C<\t>), the new line |
| 99 | (C<\n>), the form feed (C<\f>), the carriage return (C<\r>), and the |
| 100 | space (the vertical tab, C<\cK> is not matched by C<\s>). The exact set |
| 101 | of characters matched by C<\s> depends on whether the source string is |
| 102 | in UTF-8 format. If it is, C<\s> matches what is considered white space |
| 103 | in the Unicode database. Otherwise, if there is a locale in effect, C<\s> |
| 104 | matches whatever is considered white space by the current locale. Without |
| 105 | a locale, C<\s> matches the five characters mentioned in the beginning |
| 106 | of this paragraph. Perhaps the most notable difference is that C<\s> |
| 107 | matches a non-breaking space only if the non-breaking space is in a |
| 108 | UTF-8 encoded string. |
| 109 | |
| 110 | Any character that isn't matched by C<\s> will be matched by C<\S>. |
| 111 | |
| 112 | C<\h> will match any character that is considered horizontal white space; |
| 113 | this includes the space and the tab characters. C<\H> will match any character |
| 114 | that is not considered horizontal white space. |
| 115 | |
| 116 | C<\v> will match any character that is considered vertical white space; |
| 117 | this includes the carriage return and line feed characters (newline). |
| 118 | C<\V> will match any character that is not considered vertical white space. |
| 119 | |
| 120 | C<\R> matches anything that can be considered a newline under Unicode |
| 121 | rules. It's not a character class, as it can match a multi-character |
| 122 | sequence. Therefore, it cannot be used inside a bracketed character |
| 123 | class. Details are discussed in L<perlrebackslash>. |
| 124 | |
| 125 | C<\h>, C<\H>, C<\v>, C<\V>, and C<\R> are new in perl 5.10. |
| 126 | |
| 127 | Note that unlike C<\s>, C<\d> and C<\w>, C<\h> and C<\v> always match |
| 128 | the same characters, regardless whether the source string is in UTF-8 |
| 129 | format or not. The set of characters they match is also not influenced |
| 130 | by locale. |
| 131 | |
| 132 | One might think that C<\s> is equivalent with C<[\h\v]>. This is not true. |
| 133 | The vertical tab (C<"\x0b">) is not matched by C<\s>, it is however |
| 134 | considered vertical white space. Furthermore, if the source string is |
| 135 | not in UTF-8 format, the next line (C<"\x85">) and the no-break space |
| 136 | (C<"\xA0">) are not matched by C<\s>, but are by C<\v> and C<\h> respectively. |
| 137 | If the source string is in UTF-8 format, both the next line and the |
| 138 | no-break space are matched by C<\s>. |
| 139 | |
| 140 | The following table is a complete listing of characters matched by |
| 141 | C<\s>, C<\h> and C<\v>. |
| 142 | |
| 143 | The first column gives the code point of the character (in hex format), |
| 144 | the second column gives the (Unicode) name. The third column indicates |
| 145 | by which class(es) the character is matched. |
| 146 | |
| 147 | 0x00009 CHARACTER TABULATION h s |
| 148 | 0x0000a LINE FEED (LF) vs |
| 149 | 0x0000b LINE TABULATION v |
| 150 | 0x0000c FORM FEED (FF) vs |
| 151 | 0x0000d CARRIAGE RETURN (CR) vs |
| 152 | 0x00020 SPACE h s |
| 153 | 0x00085 NEXT LINE (NEL) vs [1] |
| 154 | 0x000a0 NO-BREAK SPACE h s [1] |
| 155 | 0x01680 OGHAM SPACE MARK h s |
| 156 | 0x0180e MONGOLIAN VOWEL SEPARATOR h s |
| 157 | 0x02000 EN QUAD h s |
| 158 | 0x02001 EM QUAD h s |
| 159 | 0x02002 EN SPACE h s |
| 160 | 0x02003 EM SPACE h s |
| 161 | 0x02004 THREE-PER-EM SPACE h s |
| 162 | 0x02005 FOUR-PER-EM SPACE h s |
| 163 | 0x02006 SIX-PER-EM SPACE h s |
| 164 | 0x02007 FIGURE SPACE h s |
| 165 | 0x02008 PUNCTUATION SPACE h s |
| 166 | 0x02009 THIN SPACE h s |
| 167 | 0x0200a HAIR SPACE h s |
| 168 | 0x02028 LINE SEPARATOR vs |
| 169 | 0x02029 PARAGRAPH SEPARATOR vs |
| 170 | 0x0202f NARROW NO-BREAK SPACE h s |
| 171 | 0x0205f MEDIUM MATHEMATICAL SPACE h s |
| 172 | 0x03000 IDEOGRAPHIC SPACE h s |
| 173 | |
| 174 | =over 4 |
| 175 | |
| 176 | =item [1] |
| 177 | |
| 178 | NEXT LINE and NO-BREAK SPACE only match C<\s> if the source string is in |
| 179 | UTF-8 format. |
| 180 | |
| 181 | =back |
| 182 | |
| 183 | It is worth noting that C<\d>, C<\w>, etc, match single characters, not |
| 184 | complete numbers or words. To match a number (that consists of integers), |
| 185 | use C<\d+>; to match a word, use C<\w+>. |
| 186 | |
| 187 | |
| 188 | =head3 Unicode Properties |
| 189 | |
| 190 | C<\pP> and C<\p{Prop}> are character classes to match characters that |
| 191 | fit given Unicode classes. One letter classes can be used in the C<\pP> |
| 192 | form, with the class name following the C<\p>, otherwise, the property |
| 193 | name is enclosed in braces, and follows the C<\p>. For instance, a |
| 194 | match for a number can be written as C</\pN/> or as C</\p{Number}/>. |
| 195 | Lowercase letters are matched by the property I<LowercaseLetter> which |
| 196 | has as short form I<Ll>. They have to be written as C</\p{Ll}/> or |
| 197 | C</\p{LowercaseLetter}/>. C</\pLl/> is valid, but means something different. |
| 198 | It matches a two character string: a letter (Unicode property C<\pL>), |
| 199 | followed by a lowercase C<l>. |
| 200 | |
| 201 | For a list of possible properties, see |
| 202 | L<perlunicode/Unicode Character Properties>. It is also possible to |
| 203 | defined your own properties. This is discussed in |
| 204 | L<perlunicode/User-Defined Character Properties>. |
| 205 | |
| 206 | |
| 207 | =head4 Examples |
| 208 | |
| 209 | "a" =~ /\w/ # Match, "a" is a 'word' character. |
| 210 | "7" =~ /\w/ # Match, "7" is a 'word' character as well. |
| 211 | "a" =~ /\d/ # No match, "a" isn't a digit. |
| 212 | "7" =~ /\d/ # Match, "7" is a digit. |
| 213 | " " =~ /\s/ # Match, a space is white space. |
| 214 | "a" =~ /\D/ # Match, "a" is a non-digit. |
| 215 | "7" =~ /\D/ # No match, "7" is not a non-digit. |
| 216 | " " =~ /\S/ # No match, a space is not non-white space. |
| 217 | |
| 218 | " " =~ /\h/ # Match, space is horizontal white space. |
| 219 | " " =~ /\v/ # No match, space is not vertical white space. |
| 220 | "\r" =~ /\v/ # Match, a return is vertical white space. |
| 221 | |
| 222 | "a" =~ /\pL/ # Match, "a" is a letter. |
| 223 | "a" =~ /\p{Lu}/ # No match, /\p{Lu}/ matches upper case letters. |
| 224 | |
| 225 | "\x{0e0b}" =~ /\p{Thai}/ # Match, \x{0e0b} is the character |
| 226 | # 'THAI CHARACTER SO SO', and that's in |
| 227 | # Thai Unicode class. |
| 228 | "a" =~ /\P{Lao}/ # Match, as "a" is not a Laoian character. |
| 229 | |
| 230 | |
| 231 | =head2 Bracketed Character Classes |
| 232 | |
| 233 | The third form of character class you can use in Perl regular expressions |
| 234 | is the bracketed form. In its simplest form, it lists the characters |
| 235 | that may be matched inside square brackets, like this: C<[aeiou]>. |
| 236 | This matches one of C<a>, C<e>, C<i>, C<o> or C<u>. Just as the other |
| 237 | character classes, exactly one character will be matched. To match |
| 238 | a longer string consisting of characters mentioned in the characters |
| 239 | class, follow the character class with a quantifier. For instance, |
| 240 | C<[aeiou]+> matches a string of one or more lowercase ASCII vowels. |
| 241 | |
| 242 | Repeating a character in a character class has no |
| 243 | effect; it's considered to be in the set only once. |
| 244 | |
| 245 | Examples: |
| 246 | |
| 247 | "e" =~ /[aeiou]/ # Match, as "e" is listed in the class. |
| 248 | "p" =~ /[aeiou]/ # No match, "p" is not listed in the class. |
| 249 | "ae" =~ /^[aeiou]$/ # No match, a character class only matches |
| 250 | # a single character. |
| 251 | "ae" =~ /^[aeiou]+$/ # Match, due to the quantifier. |
| 252 | |
| 253 | =head3 Special Characters Inside a Bracketed Character Class |
| 254 | |
| 255 | Most characters that are meta characters in regular expressions (that |
| 256 | is, characters that carry a special meaning like C<*> or C<(>) lose |
| 257 | their special meaning and can be used inside a character class without |
| 258 | the need to escape them. For instance, C<[()]> matches either an opening |
| 259 | parenthesis, or a closing parenthesis, and the parens inside the character |
| 260 | class don't group or capture. |
| 261 | |
| 262 | Characters that may carry a special meaning inside a character class are: |
| 263 | C<\>, C<^>, C<->, C<[> and C<]>, and are discussed below. They can be |
| 264 | escaped with a backslash, although this is sometimes not needed, in which |
| 265 | case the backslash may be omitted. |
| 266 | |
| 267 | The sequence C<\b> is special inside a bracketed character class. While |
| 268 | outside the character class C<\b> is an assertion indicating a point |
| 269 | that does not have either two word characters or two non-word characters |
| 270 | on either side, inside a bracketed character class, C<\b> matches a |
| 271 | backspace character. |
| 272 | |
| 273 | A C<[> is not special inside a character class, unless it's the start |
| 274 | of a POSIX character class (see below). It normally does not need escaping. |
| 275 | |
| 276 | A C<]> is either the end of a POSIX character class (see below), or it |
| 277 | signals the end of the bracketed character class. Normally it needs |
| 278 | escaping if you want to include a C<]> in the set of characters. |
| 279 | However, if the C<]> is the I<first> (or the second if the first |
| 280 | character is a caret) character of a bracketed character class, it |
| 281 | does not denote the end of the class (as you cannot have an empty class) |
| 282 | and is considered part of the set of characters that can be matched without |
| 283 | escaping. |
| 284 | |
| 285 | Examples: |
| 286 | |
| 287 | "+" =~ /[+?*]/ # Match, "+" in a character class is not special. |
| 288 | "\cH" =~ /[\b]/ # Match, \b inside in a character class |
| 289 | # is equivalent with a backspace. |
| 290 | "]" =~ /[][]/ # Match, as the character class contains. |
| 291 | # both [ and ]. |
| 292 | "[]" =~ /[[]]/ # Match, the pattern contains a character class |
| 293 | # containing just ], and the character class is |
| 294 | # followed by a ]. |
| 295 | |
| 296 | =head3 Character Ranges |
| 297 | |
| 298 | It is not uncommon to want to match a range of characters. Luckily, instead |
| 299 | of listing all the characters in the range, one may use the hyphen (C<->). |
| 300 | If inside a bracketed character class you have two characters separated |
| 301 | by a hyphen, it's treated as if all the characters between the two are in |
| 302 | the class. For instance, C<[0-9]> matches any ASCII digit, and C<[a-m]> |
| 303 | matches any lowercase letter from the first half of the ASCII alphabet. |
| 304 | |
| 305 | Note that the two characters on either side of the hyphen are not |
| 306 | necessary both letters or both digits. Any character is possible, |
| 307 | although not advisable. C<['-?]> contains a range of characters, but |
| 308 | most people will not know which characters that will be. Furthermore, |
| 309 | such ranges may lead to portability problems if the code has to run on |
| 310 | a platform that uses a different character set, such as EBCDIC. |
| 311 | |
| 312 | If a hyphen in a character class cannot be part of a range, for instance |
| 313 | because it is the first or the last character of the character class, |
| 314 | or if it immediately follows a range, the hyphen isn't special, and will be |
| 315 | considered a character that may be matched. You have to escape the hyphen |
| 316 | with a backslash if you want to have a hyphen in your set of characters to |
| 317 | be matched, and its position in the class is such that it can be considered |
| 318 | part of a range. |
| 319 | |
| 320 | Examples: |
| 321 | |
| 322 | [a-z] # Matches a character that is a lower case ASCII letter. |
| 323 | [a-fz] # Matches any letter between 'a' and 'f' (inclusive) or the |
| 324 | # letter 'z'. |
| 325 | [-z] # Matches either a hyphen ('-') or the letter 'z'. |
| 326 | [a-f-m] # Matches any letter between 'a' and 'f' (inclusive), the |
| 327 | # hyphen ('-'), or the letter 'm'. |
| 328 | ['-?] # Matches any of the characters '()*+,-./0123456789:;<=>? |
| 329 | # (But not on an EBCDIC platform). |
| 330 | |
| 331 | |
| 332 | =head3 Negation |
| 333 | |
| 334 | It is also possible to instead list the characters you do not want to |
| 335 | match. You can do so by using a caret (C<^>) as the first character in the |
| 336 | character class. For instance, C<[^a-z]> matches a character that is not a |
| 337 | lowercase ASCII letter. |
| 338 | |
| 339 | This syntax make the caret a special character inside a bracketed character |
| 340 | class, but only if it is the first character of the class. So if you want |
| 341 | to have the caret as one of the characters you want to match, you either |
| 342 | have to escape the caret, or not list it first. |
| 343 | |
| 344 | Examples: |
| 345 | |
| 346 | "e" =~ /[^aeiou]/ # No match, the 'e' is listed. |
| 347 | "x" =~ /[^aeiou]/ # Match, as 'x' isn't a lowercase vowel. |
| 348 | "^" =~ /[^^]/ # No match, matches anything that isn't a caret. |
| 349 | "^" =~ /[x^]/ # Match, caret is not special here. |
| 350 | |
| 351 | =head3 Backslash Sequences |
| 352 | |
| 353 | You can put a backslash sequence character class inside a bracketed character |
| 354 | class, and it will act just as if you put all the characters matched by |
| 355 | the backslash sequence inside the character class. For instance, |
| 356 | C<[a-f\d]> will match any digit, or any of the lowercase letters between |
| 357 | 'a' and 'f' inclusive. |
| 358 | |
| 359 | Examples: |
| 360 | |
| 361 | /[\p{Thai}\d]/ # Matches a character that is either a Thai |
| 362 | # character, or a digit. |
| 363 | /[^\p{Arabic}()]/ # Matches a character that is neither an Arabic |
| 364 | # character, nor a parenthesis. |
| 365 | |
| 366 | Backslash sequence character classes cannot form one of the endpoints |
| 367 | of a range. |
| 368 | |
| 369 | =head3 Posix Character Classes |
| 370 | |
| 371 | Posix character classes have the form C<[:class:]>, where I<class> is |
| 372 | name, and the C<[:> and C<:]> delimiters. Posix character classes appear |
| 373 | I<inside> bracketed character classes, and are a convenient and descriptive |
| 374 | way of listing a group of characters. Be careful about the syntax, |
| 375 | |
| 376 | # Correct: |
| 377 | $string =~ /[[:alpha:]]/ |
| 378 | |
| 379 | # Incorrect (will warn): |
| 380 | $string =~ /[:alpha:]/ |
| 381 | |
| 382 | The latter pattern would be a character class consisting of a colon, |
| 383 | and the letters C<a>, C<l>, C<p> and C<h>. |
| 384 | |
| 385 | Perl recognizes the following POSIX character classes: |
| 386 | |
| 387 | alpha Any alphabetical character. |
| 388 | alnum Any alphanumerical character. |
| 389 | ascii Any ASCII character. |
| 390 | blank A GNU extension, equal to a space or a horizontal tab (C<\t>). |
| 391 | cntrl Any control character. |
| 392 | digit Any digit, equivalent to C<\d>. |
| 393 | graph Any printable character, excluding a space. |
| 394 | lower Any lowercase character. |
| 395 | print Any printable character, including a space. |
| 396 | punct Any punctuation character. |
| 397 | space Any white space character. C<\s> plus the vertical tab (C<\cK>). |
| 398 | upper Any uppercase character. |
| 399 | word Any "word" character, equivalent to C<\w>. |
| 400 | xdigit Any hexadecimal digit, '0' - '9', 'a' - 'f', 'A' - 'F'. |
| 401 | |
| 402 | The exact set of characters matched depends on whether the source string |
| 403 | is internally in UTF-8 format or not. See L</Locale, Unicode and UTF-8>. |
| 404 | |
| 405 | Most POSIX character classes have C<\p> counterparts. The difference |
| 406 | is that the C<\p> classes will always match according to the Unicode |
| 407 | properties, regardless whether the string is in UTF-8 format or not. |
| 408 | |
| 409 | The following table shows the relation between POSIX character classes |
| 410 | and the Unicode properties: |
| 411 | |
| 412 | [[:...:]] \p{...} backslash |
| 413 | |
| 414 | alpha IsAlpha |
| 415 | alnum IsAlnum |
| 416 | ascii IsASCII |
| 417 | blank |
| 418 | cntrl IsCntrl |
| 419 | digit IsDigit \d |
| 420 | graph IsGraph |
| 421 | lower IsLower |
| 422 | print IsPrint |
| 423 | punct IsPunct |
| 424 | space IsSpace |
| 425 | IsSpacePerl \s |
| 426 | upper IsUpper |
| 427 | word IsWord |
| 428 | xdigit IsXDigit |
| 429 | |
| 430 | Some character classes may have a non-obvious name: |
| 431 | |
| 432 | =over 4 |
| 433 | |
| 434 | =item cntrl |
| 435 | |
| 436 | Any control character. Usually, control characters don't produce output |
| 437 | as such, but instead control the terminal somehow: for example newline |
| 438 | and backspace are control characters. All characters with C<ord()> less |
| 439 | than 32 are usually classified as control characters (in ASCII, the ISO |
| 440 | Latin character sets, and Unicode), as is the character C<ord()> value |
| 441 | of 127 (C<DEL>). |
| 442 | |
| 443 | =item graph |
| 444 | |
| 445 | Any character that is I<graphical>, that is, visible. This class consists |
| 446 | of all the alphanumerical characters and all punctuation characters. |
| 447 | |
| 448 | =item print |
| 449 | |
| 450 | All printable characters, which is the set of all the graphical characters |
| 451 | plus the space. |
| 452 | |
| 453 | =item punct |
| 454 | |
| 455 | Any punctuation (special) character. |
| 456 | |
| 457 | =back |
| 458 | |
| 459 | =head4 Negation |
| 460 | |
| 461 | A Perl extension to the POSIX character class is the ability to |
| 462 | negate it. This is done by prefixing the class name with a caret (C<^>). |
| 463 | Some examples: |
| 464 | |
| 465 | POSIX Unicode Backslash |
| 466 | [[:^digit:]] \P{IsDigit} \D |
| 467 | [[:^space:]] \P{IsSpace} \S |
| 468 | [[:^word:]] \P{IsWord} \W |
| 469 | |
| 470 | =head4 [= =] and [. .] |
| 471 | |
| 472 | Perl will recognize the POSIX character classes C<[=class=]>, and |
| 473 | C<[.class.]>, but does not (yet?) support this construct. Use of |
| 474 | such a constructs will lead to an error. |
| 475 | |
| 476 | |
| 477 | =head4 Examples |
| 478 | |
| 479 | /[[:digit:]]/ # Matches a character that is a digit. |
| 480 | /[01[:lower:]]/ # Matches a character that is either a |
| 481 | # lowercase letter, or '0' or '1'. |
| 482 | /[[:digit:][:^xdigit:]]/ # Matches a character that can be anything, |
| 483 | # but the letters 'a' to 'f' in either case. |
| 484 | # This is because the character class contains |
| 485 | # all digits, and anything that isn't a |
| 486 | # hex digit, resulting in a class containing |
| 487 | # all characters, but the letters 'a' to 'f' |
| 488 | # and 'A' to 'F'. |
| 489 | |
| 490 | |
| 491 | =head2 Locale, Unicode and UTF-8 |
| 492 | |
| 493 | Some of the character classes have a somewhat different behaviour depending |
| 494 | on the internal encoding of the source string, and the locale that is |
| 495 | in effect. |
| 496 | |
| 497 | C<\w>, C<\d>, C<\s> and the POSIX character classes (and their negations, |
| 498 | including C<\W>, C<\D>, C<\S>) suffer from this behaviour. |
| 499 | |
| 500 | The rule is that if the source string is in UTF-8 format, the character |
| 501 | classes match according to the Unicode properties. If the source string |
| 502 | isn't, then the character classes match according to whatever locale is |
| 503 | in effect. If there is no locale, they match the ASCII defaults |
| 504 | (52 letters, 10 digits and underscore for C<\w>, 0 to 9 for C<\d>, etc). |
| 505 | |
| 506 | This usually means that if you are matching against characters whose C<ord()> |
| 507 | values are between 128 and 255 inclusive, your character class may match |
| 508 | or not depending on the current locale, and whether the source string is |
| 509 | in UTF-8 format. The string will be in UTF-8 format if it contains |
| 510 | characters whose C<ord()> value exceeds 255. But a string may be in UTF-8 |
| 511 | format without it having such characters. |
| 512 | |
| 513 | For portability reasons, it may be better to not use C<\w>, C<\d>, C<\s> |
| 514 | or the POSIX character classes, and use the Unicode properties instead. |
| 515 | |
| 516 | =head4 Examples |
| 517 | |
| 518 | $str = "\xDF"; # $str is not in UTF-8 format. |
| 519 | $str =~ /^\w/; # No match, as $str isn't in UTF-8 format. |
| 520 | $str .= "\x{0e0b}"; # Now $str is in UTF-8 format. |
| 521 | $str =~ /^\w/; # Match! $str is now in UTF-8 format. |
| 522 | chop $str; |
| 523 | $str =~ /^\w/; # Still a match! $str remains in UTF-8 format. |
| 524 | |
| 525 | =cut |