| 1 | =head1 NAME |
| 2 | X<character class> |
| 3 | |
| 4 | perlrecharclass - Perl Regular Expression Character Classes |
| 5 | |
| 6 | =head1 DESCRIPTION |
| 7 | |
| 8 | The top level documentation about Perl regular expressions |
| 9 | is found in L<perlre>. |
| 10 | |
| 11 | This manual page discusses the syntax and use of character |
| 12 | classes in Perl Regular Expressions. |
| 13 | |
| 14 | A character class is a way of denoting a set of characters, |
| 15 | in such a way that one character of the set is matched. |
| 16 | It's important to remember that matching a character class |
| 17 | consumes exactly one character in the source string. (The source |
| 18 | string is the string the regular expression is matched against.) |
| 19 | |
| 20 | There are three types of character classes in Perl regular |
| 21 | expressions: the dot, backslashed sequences, and the form enclosed in square |
| 22 | brackets. Keep in mind, though, that often the term "character class" is used |
| 23 | to mean just the bracketed form. This is true in other Perl documentation. |
| 24 | |
| 25 | =head2 The dot |
| 26 | |
| 27 | The dot (or period), C<.> is probably the most used, and certainly |
| 28 | the most well-known character class. By default, a dot matches any |
| 29 | character, except for the newline. The default can be changed to |
| 30 | add matching the newline with the I<single line> modifier: either |
| 31 | for the entire regular expression using the C</s> modifier, or |
| 32 | locally using C<(?s)>. |
| 33 | |
| 34 | Here are some examples: |
| 35 | |
| 36 | "a" =~ /./ # Match |
| 37 | "." =~ /./ # Match |
| 38 | "" =~ /./ # No match (dot has to match a character) |
| 39 | "\n" =~ /./ # No match (dot does not match a newline) |
| 40 | "\n" =~ /./s # Match (global 'single line' modifier) |
| 41 | "\n" =~ /(?s:.)/ # Match (local 'single line' modifier) |
| 42 | "ab" =~ /^.$/ # No match (dot matches one character) |
| 43 | |
| 44 | =head2 Backslashed sequences |
| 45 | X<\w> X<\W> X<\s> X<\S> X<\d> X<\D> X<\p> X<\P> |
| 46 | X<\N> X<\v> X<\V> X<\h> X<\H> |
| 47 | X<word> X<whitespace> |
| 48 | |
| 49 | Perl regular expressions contain many backslashed sequences that |
| 50 | constitute a character class. That is, they will match a single |
| 51 | character, if that character belongs to a specific set of characters |
| 52 | (defined by the sequence). A backslashed sequence is a sequence of |
| 53 | characters starting with a backslash. Not all backslashed sequences |
| 54 | are character classes; for a full list, see L<perlrebackslash>. |
| 55 | |
| 56 | Here's a list of the backslashed sequences that are character classes. They |
| 57 | are discussed in more detail below. |
| 58 | |
| 59 | \d Match a digit character. |
| 60 | \D Match a non-digit character. |
| 61 | \w Match a "word" character. |
| 62 | \W Match a non-"word" character. |
| 63 | \s Match a whitespace character. |
| 64 | \S Match a non-whitespace character. |
| 65 | \h Match a horizontal whitespace character. |
| 66 | \H Match a character that isn't horizontal whitespace. |
| 67 | \N Match a character that isn't newline. Experimental. |
| 68 | \v Match a vertical whitespace character. |
| 69 | \V Match a character that isn't vertical whitespace. |
| 70 | \pP, \p{Prop} Match a character matching a Unicode property. |
| 71 | \PP, \P{Prop} Match a character that doesn't match a Unicode property. |
| 72 | |
| 73 | =head3 Digits |
| 74 | |
| 75 | C<\d> matches a single character that is considered to be a I<digit>. What is |
| 76 | considered a digit depends on the internal encoding of the source string and |
| 77 | the locale that is in effect. If the source string is in UTF-8 format, C<\d> |
| 78 | not only matches the digits '0' - '9', but also Arabic, Devanagari and digits |
| 79 | from other languages. Otherwise, if there is a locale in effect, it will match |
| 80 | whatever characters the locale considers digits. Without a locale, C<\d> |
| 81 | matches the digits '0' to '9'. See L</Locale, EBCDIC, Unicode and UTF-8>. |
| 82 | |
| 83 | Any character that isn't matched by C<\d> will be matched by C<\D>. |
| 84 | |
| 85 | =head3 Word characters |
| 86 | |
| 87 | A C<\w> matches a single alphanumeric character (an alphabetic character, or a |
| 88 | decimal digit) or an underscore (C<_>), not a whole word. Use C<\w+> to match |
| 89 | a string of Perl-identifier characters (which isn't the same as matching an |
| 90 | English word). What is considered a word character depends on the internal |
| 91 | encoding of the string and the locale or EBCDIC code page that is in effect. If |
| 92 | it's in UTF-8 format, C<\w> matches those characters that are considered word |
| 93 | characters in the Unicode database. That is, it not only matches ASCII letters, |
| 94 | but also Thai letters, Greek letters, etc. If the source string isn't in UTF-8 |
| 95 | format, C<\w> matches those characters that are considered word characters by |
| 96 | the current locale or EBCDIC code page. Without a locale or EBCDIC code page, |
| 97 | C<\w> matches the ASCII letters, digits and the underscore. |
| 98 | See L</Locale, EBCDIC, Unicode and UTF-8>. |
| 99 | |
| 100 | Any character that isn't matched by C<\w> will be matched by C<\W>. |
| 101 | |
| 102 | =head3 Whitespace |
| 103 | |
| 104 | C<\s> matches any single character that is considered whitespace. In the ASCII |
| 105 | range, C<\s> matches the horizontal tab (C<\t>), the new line (C<\n>), the form |
| 106 | feed (C<\f>), the carriage return (C<\r>), and the space. (The vertical tab, |
| 107 | C<\cK> is not matched by C<\s>.) The exact set of characters matched by C<\s> |
| 108 | depends on whether the source string is in UTF-8 format and the locale or |
| 109 | EBCDIC code page that is in effect. If it's in UTF-8 format, C<\s> matches what |
| 110 | is considered whitespace in the Unicode database; the complete list is in the |
| 111 | table below. Otherwise, if there is a locale or EBCDIC code page in effect, |
| 112 | C<\s> matches whatever is considered whitespace by the current locale or EBCDIC |
| 113 | code page. Without a locale or EBCDIC code page, C<\s> matches the five |
| 114 | characters mentioned in the beginning of this paragraph. Perhaps the most |
| 115 | notable possible surprise is that C<\s> matches a non-breaking space only if |
| 116 | the non-breaking space is in a UTF-8 encoded string or the locale or EBCDIC |
| 117 | code page that is in effect has that character. |
| 118 | See L</Locale, EBCDIC, Unicode and UTF-8>. |
| 119 | |
| 120 | Any character that isn't matched by C<\s> will be matched by C<\S>. |
| 121 | |
| 122 | C<\h> will match any character that is considered horizontal whitespace; |
| 123 | this includes the space and the tab characters and 17 other characters that are |
| 124 | listed in the table below. C<\H> will match any character |
| 125 | that is not considered horizontal whitespace. |
| 126 | |
| 127 | C<\N> is new in 5.12, and is experimental. It, like the dot, will match any |
| 128 | character that is not a newline. The difference is that C<\N> will not be |
| 129 | influenced by the single line C</s> regular expression modifier. Note that |
| 130 | there is a second meaning of C<\N> when of the form C<\N{...}>. This form is |
| 131 | for named characters. See L<charnames> for those. If C<\N> is followed by an |
| 132 | opening brace and something that is not a quantifier, perl will assume that a |
| 133 | character name is coming, and not this meaning of C<\N>. For example, C<\N{3}> |
| 134 | means to match 3 non-newlines; C<\N{5,}> means to match 5 or more non-newlines, |
| 135 | but C<\N{4F}> and C<\N{F4}> are not legal quantifiers, and will cause perl to |
| 136 | look for characters named C<4F> or C<F4>, respectively (and won't find them, |
| 137 | thus raising an error, unless they have been defined using custom names). |
| 138 | |
| 139 | C<\v> will match any character that is considered vertical whitespace; |
| 140 | this includes the carriage return and line feed characters (newline) plus 5 |
| 141 | other characters listed in the table below. |
| 142 | C<\V> will match any character that is not considered vertical whitespace. |
| 143 | |
| 144 | C<\R> matches anything that can be considered a newline under Unicode |
| 145 | rules. It's not a character class, as it can match a multi-character |
| 146 | sequence. Therefore, it cannot be used inside a bracketed character |
| 147 | class; use C<\v> instead (vertical whitespace). |
| 148 | Details are discussed in L<perlrebackslash>. |
| 149 | |
| 150 | Note that unlike C<\s>, C<\d> and C<\w>, C<\h> and C<\v> always match |
| 151 | the same characters, regardless whether the source string is in UTF-8 |
| 152 | format or not. The set of characters they match is also not influenced |
| 153 | by locale nor EBCDIC code page. |
| 154 | |
| 155 | One might think that C<\s> is equivalent to C<[\h\v]>. This is not true. The |
| 156 | vertical tab (C<"\x0b">) is not matched by C<\s>, it is however considered |
| 157 | vertical whitespace. Furthermore, if the source string is not in UTF-8 format, |
| 158 | and any locale or EBCDIC code page that is in effect doesn't include them, the |
| 159 | next line (C<"\x85">) and the no-break space (C<"\xA0">) characters are not |
| 160 | matched by C<\s>, but are by C<\v> and C<\h> respectively. If the source |
| 161 | string is in UTF-8 format, both the next line and the no-break space are |
| 162 | matched by C<\s>. |
| 163 | |
| 164 | The following table is a complete listing of characters matched by |
| 165 | C<\s>, C<\h> and C<\v> as of Unicode 5.2. |
| 166 | |
| 167 | The first column gives the code point of the character (in hex format), |
| 168 | the second column gives the (Unicode) name. The third column indicates |
| 169 | by which class(es) the character is matched (assuming no locale or EBCDIC code |
| 170 | page is in effect that changes the C<\s> matching). |
| 171 | |
| 172 | 0x00009 CHARACTER TABULATION h s |
| 173 | 0x0000a LINE FEED (LF) vs |
| 174 | 0x0000b LINE TABULATION v |
| 175 | 0x0000c FORM FEED (FF) vs |
| 176 | 0x0000d CARRIAGE RETURN (CR) vs |
| 177 | 0x00020 SPACE h s |
| 178 | 0x00085 NEXT LINE (NEL) vs [1] |
| 179 | 0x000a0 NO-BREAK SPACE h s [1] |
| 180 | 0x01680 OGHAM SPACE MARK h s |
| 181 | 0x0180e MONGOLIAN VOWEL SEPARATOR h s |
| 182 | 0x02000 EN QUAD h s |
| 183 | 0x02001 EM QUAD h s |
| 184 | 0x02002 EN SPACE h s |
| 185 | 0x02003 EM SPACE h s |
| 186 | 0x02004 THREE-PER-EM SPACE h s |
| 187 | 0x02005 FOUR-PER-EM SPACE h s |
| 188 | 0x02006 SIX-PER-EM SPACE h s |
| 189 | 0x02007 FIGURE SPACE h s |
| 190 | 0x02008 PUNCTUATION SPACE h s |
| 191 | 0x02009 THIN SPACE h s |
| 192 | 0x0200a HAIR SPACE h s |
| 193 | 0x02028 LINE SEPARATOR vs |
| 194 | 0x02029 PARAGRAPH SEPARATOR vs |
| 195 | 0x0202f NARROW NO-BREAK SPACE h s |
| 196 | 0x0205f MEDIUM MATHEMATICAL SPACE h s |
| 197 | 0x03000 IDEOGRAPHIC SPACE h s |
| 198 | |
| 199 | =over 4 |
| 200 | |
| 201 | =item [1] |
| 202 | |
| 203 | NEXT LINE and NO-BREAK SPACE only match C<\s> if the source string is in |
| 204 | UTF-8 format, or the locale or EBCDIC code page that is in effect includes them. |
| 205 | |
| 206 | =back |
| 207 | |
| 208 | It is worth noting that C<\d>, C<\w>, etc, match single characters, not |
| 209 | complete numbers or words. To match a number (that consists of integers), |
| 210 | use C<\d+>; to match a word, use C<\w+>. |
| 211 | |
| 212 | |
| 213 | =head3 Unicode Properties |
| 214 | |
| 215 | C<\pP> and C<\p{Prop}> are character classes to match characters that fit given |
| 216 | Unicode properties. One letter property names can be used in the C<\pP> form, |
| 217 | with the property name following the C<\p>, otherwise, braces are required. |
| 218 | When using braces, there is a single form, which is just the property name |
| 219 | enclosed in the braces, and a compound form which looks like C<\p{name=value}>, |
| 220 | which means to match if the property "name" for the character has the particular |
| 221 | "value". |
| 222 | For instance, a match for a number can be written as C</\pN/> or as |
| 223 | C</\p{Number}/>, or as C</\p{Number=True}/>. |
| 224 | Lowercase letters are matched by the property I<Lowercase_Letter> which |
| 225 | has as short form I<Ll>. They need the braces, so are written as C</\p{Ll}/> or |
| 226 | C</\p{Lowercase_Letter}/>, or C</\p{General_Category=Lowercase_Letter}/> |
| 227 | (the underscores are optional). |
| 228 | C</\pLl/> is valid, but means something different. |
| 229 | It matches a two character string: a letter (Unicode property C<\pL>), |
| 230 | followed by a lowercase C<l>. |
| 231 | |
| 232 | For more details, see L<perlunicode/Unicode Character Properties>; for a |
| 233 | complete list of possible properties, see |
| 234 | L<perluniprops/Properties accessible through \p{} and \P{}>. |
| 235 | It is also possible to define your own properties. This is discussed in |
| 236 | L<perlunicode/User-Defined Character Properties>. |
| 237 | |
| 238 | |
| 239 | =head4 Examples |
| 240 | |
| 241 | "a" =~ /\w/ # Match, "a" is a 'word' character. |
| 242 | "7" =~ /\w/ # Match, "7" is a 'word' character as well. |
| 243 | "a" =~ /\d/ # No match, "a" isn't a digit. |
| 244 | "7" =~ /\d/ # Match, "7" is a digit. |
| 245 | " " =~ /\s/ # Match, a space is whitespace. |
| 246 | "a" =~ /\D/ # Match, "a" is a non-digit. |
| 247 | "7" =~ /\D/ # No match, "7" is not a non-digit. |
| 248 | " " =~ /\S/ # No match, a space is not non-whitespace. |
| 249 | |
| 250 | " " =~ /\h/ # Match, space is horizontal whitespace. |
| 251 | " " =~ /\v/ # No match, space is not vertical whitespace. |
| 252 | "\r" =~ /\v/ # Match, a return is vertical whitespace. |
| 253 | |
| 254 | "a" =~ /\pL/ # Match, "a" is a letter. |
| 255 | "a" =~ /\p{Lu}/ # No match, /\p{Lu}/ matches upper case letters. |
| 256 | |
| 257 | "\x{0e0b}" =~ /\p{Thai}/ # Match, \x{0e0b} is the character |
| 258 | # 'THAI CHARACTER SO SO', and that's in |
| 259 | # Thai Unicode class. |
| 260 | "a" =~ /\P{Lao}/ # Match, as "a" is not a Laotian character. |
| 261 | |
| 262 | |
| 263 | =head2 Bracketed Character Classes |
| 264 | |
| 265 | The third form of character class you can use in Perl regular expressions |
| 266 | is the bracketed form. In its simplest form, it lists the characters |
| 267 | that may be matched, surrounded by square brackets, like this: C<[aeiou]>. |
| 268 | This matches one of C<a>, C<e>, C<i>, C<o> or C<u>. Like the other |
| 269 | character classes, exactly one character will be matched. To match |
| 270 | a longer string consisting of characters mentioned in the character |
| 271 | class, follow the character class with a quantifier. For instance, |
| 272 | C<[aeiou]+> matches a string of one or more lowercase ASCII vowels. |
| 273 | |
| 274 | Repeating a character in a character class has no |
| 275 | effect; it's considered to be in the set only once. |
| 276 | |
| 277 | Examples: |
| 278 | |
| 279 | "e" =~ /[aeiou]/ # Match, as "e" is listed in the class. |
| 280 | "p" =~ /[aeiou]/ # No match, "p" is not listed in the class. |
| 281 | "ae" =~ /^[aeiou]$/ # No match, a character class only matches |
| 282 | # a single character. |
| 283 | "ae" =~ /^[aeiou]+$/ # Match, due to the quantifier. |
| 284 | |
| 285 | =head3 Special Characters Inside a Bracketed Character Class |
| 286 | |
| 287 | Most characters that are meta characters in regular expressions (that |
| 288 | is, characters that carry a special meaning like C<.>, C<*>, or C<(>) lose |
| 289 | their special meaning and can be used inside a character class without |
| 290 | the need to escape them. For instance, C<[()]> matches either an opening |
| 291 | parenthesis, or a closing parenthesis, and the parens inside the character |
| 292 | class don't group or capture. |
| 293 | |
| 294 | Characters that may carry a special meaning inside a character class are: |
| 295 | C<\>, C<^>, C<->, C<[> and C<]>, and are discussed below. They can be |
| 296 | escaped with a backslash, although this is sometimes not needed, in which |
| 297 | case the backslash may be omitted. |
| 298 | |
| 299 | The sequence C<\b> is special inside a bracketed character class. While |
| 300 | outside the character class C<\b> is an assertion indicating a point |
| 301 | that does not have either two word characters or two non-word characters |
| 302 | on either side, inside a bracketed character class, C<\b> matches a |
| 303 | backspace character. |
| 304 | |
| 305 | The sequences |
| 306 | C<\a>, |
| 307 | C<\c>, |
| 308 | C<\e>, |
| 309 | C<\f>, |
| 310 | C<\n>, |
| 311 | C<\N{I<NAME>}>, |
| 312 | C<\N{U+I<wide hex char>}>, |
| 313 | C<\r>, |
| 314 | C<\t>, |
| 315 | and |
| 316 | C<\x> |
| 317 | are also special and have the same meanings as they do outside a bracketed character |
| 318 | class. |
| 319 | |
| 320 | Also, a backslash followed by two or three octal digits is considered an octal |
| 321 | number. |
| 322 | |
| 323 | A C<[> is not special inside a character class, unless it's the start |
| 324 | of a POSIX character class (see below). It normally does not need escaping. |
| 325 | |
| 326 | A C<]> is normally either the end of a POSIX character class (see below), or it |
| 327 | signals the end of the bracketed character class. If you want to include a |
| 328 | C<]> in the set of characters, you must generally escape it. |
| 329 | However, if the C<]> is the I<first> (or the second if the first |
| 330 | character is a caret) character of a bracketed character class, it |
| 331 | does not denote the end of the class (as you cannot have an empty class) |
| 332 | and is considered part of the set of characters that can be matched without |
| 333 | escaping. |
| 334 | |
| 335 | Examples: |
| 336 | |
| 337 | "+" =~ /[+?*]/ # Match, "+" in a character class is not special. |
| 338 | "\cH" =~ /[\b]/ # Match, \b inside in a character class |
| 339 | # is equivalent to a backspace. |
| 340 | "]" =~ /[][]/ # Match, as the character class contains. |
| 341 | # both [ and ]. |
| 342 | "[]" =~ /[[]]/ # Match, the pattern contains a character class |
| 343 | # containing just ], and the character class is |
| 344 | # followed by a ]. |
| 345 | |
| 346 | =head3 Character Ranges |
| 347 | |
| 348 | It is not uncommon to want to match a range of characters. Luckily, instead |
| 349 | of listing all the characters in the range, one may use the hyphen (C<->). |
| 350 | If inside a bracketed character class you have two characters separated |
| 351 | by a hyphen, it's treated as if all the characters between the two are in |
| 352 | the class. For instance, C<[0-9]> matches any ASCII digit, and C<[a-m]> |
| 353 | matches any lowercase letter from the first half of the ASCII alphabet. |
| 354 | |
| 355 | Note that the two characters on either side of the hyphen are not |
| 356 | necessary both letters or both digits. Any character is possible, |
| 357 | although not advisable. C<['-?]> contains a range of characters, but |
| 358 | most people will not know which characters that will be. Furthermore, |
| 359 | such ranges may lead to portability problems if the code has to run on |
| 360 | a platform that uses a different character set, such as EBCDIC. |
| 361 | |
| 362 | If a hyphen in a character class cannot syntactically be part of a range, for |
| 363 | instance because it is the first or the last character of the character class, |
| 364 | or if it immediately follows a range, the hyphen isn't special, and will be |
| 365 | considered a character that may be matched literally. You have to escape the |
| 366 | hyphen with a backslash if you want to have a hyphen in your set of characters |
| 367 | to be matched, and its position in the class is such that it could be |
| 368 | considered part of a range. |
| 369 | |
| 370 | Examples: |
| 371 | |
| 372 | [a-z] # Matches a character that is a lower case ASCII letter. |
| 373 | [a-fz] # Matches any letter between 'a' and 'f' (inclusive) or |
| 374 | # the letter 'z'. |
| 375 | [-z] # Matches either a hyphen ('-') or the letter 'z'. |
| 376 | [a-f-m] # Matches any letter between 'a' and 'f' (inclusive), the |
| 377 | # hyphen ('-'), or the letter 'm'. |
| 378 | ['-?] # Matches any of the characters '()*+,-./0123456789:;<=>? |
| 379 | # (But not on an EBCDIC platform). |
| 380 | |
| 381 | |
| 382 | =head3 Negation |
| 383 | |
| 384 | It is also possible to instead list the characters you do not want to |
| 385 | match. You can do so by using a caret (C<^>) as the first character in the |
| 386 | character class. For instance, C<[^a-z]> matches a character that is not a |
| 387 | lowercase ASCII letter. |
| 388 | |
| 389 | This syntax make the caret a special character inside a bracketed character |
| 390 | class, but only if it is the first character of the class. So if you want |
| 391 | to have the caret as one of the characters you want to match, you either |
| 392 | have to escape the caret, or not list it first. |
| 393 | |
| 394 | Examples: |
| 395 | |
| 396 | "e" =~ /[^aeiou]/ # No match, the 'e' is listed. |
| 397 | "x" =~ /[^aeiou]/ # Match, as 'x' isn't a lowercase vowel. |
| 398 | "^" =~ /[^^]/ # No match, matches anything that isn't a caret. |
| 399 | "^" =~ /[x^]/ # Match, caret is not special here. |
| 400 | |
| 401 | =head3 Backslash Sequences |
| 402 | |
| 403 | You can put any backslash sequence character class (with the exception of |
| 404 | C<\N>) inside a bracketed character class, and it will act just |
| 405 | as if you put all the characters matched by the backslash sequence inside the |
| 406 | character class. For instance, C<[a-f\d]> will match any digit, or any of the |
| 407 | lowercase letters between 'a' and 'f' inclusive. |
| 408 | |
| 409 | C<\N> within a bracketed character class must be of the forms C<\N{I<name>}> or |
| 410 | C<\N{U+I<wide hex char>}> for the same reason that a dot C<.> inside a |
| 411 | bracketed character class loses its special meaning: it matches nearly |
| 412 | anything, which generally isn't what you want to happen. |
| 413 | |
| 414 | Examples: |
| 415 | |
| 416 | /[\p{Thai}\d]/ # Matches a character that is either a Thai |
| 417 | # character, or a digit. |
| 418 | /[^\p{Arabic}()]/ # Matches a character that is neither an Arabic |
| 419 | # character, nor a parenthesis. |
| 420 | |
| 421 | Backslash sequence character classes cannot form one of the endpoints |
| 422 | of a range. |
| 423 | |
| 424 | =head3 Posix Character Classes |
| 425 | X<character class> X<\p> X<\p{}> |
| 426 | X<alpha> X<alnum> X<ascii> X<blank> X<cntrl> X<digit> X<graph> |
| 427 | X<lower> X<print> X<punct> X<space> X<upper> X<word> X<xdigit> |
| 428 | |
| 429 | Posix character classes have the form C<[:class:]>, where I<class> is |
| 430 | name, and the C<[:> and C<:]> delimiters. Posix character classes only appear |
| 431 | I<inside> bracketed character classes, and are a convenient and descriptive |
| 432 | way of listing a group of characters, though they currently suffer from |
| 433 | portability issues (see below and L<Locale, EBCDIC, Unicode and UTF-8>). Be |
| 434 | careful about the syntax, |
| 435 | |
| 436 | # Correct: |
| 437 | $string =~ /[[:alpha:]]/ |
| 438 | |
| 439 | # Incorrect (will warn): |
| 440 | $string =~ /[:alpha:]/ |
| 441 | |
| 442 | The latter pattern would be a character class consisting of a colon, |
| 443 | and the letters C<a>, C<l>, C<p> and C<h>. |
| 444 | These character classes can be part of a larger bracketed character class. For |
| 445 | example, |
| 446 | |
| 447 | [01[:alpha:]%] |
| 448 | |
| 449 | is valid and matches '0', '1', any alphabetic character, and the percent sign. |
| 450 | |
| 451 | Perl recognizes the following POSIX character classes: |
| 452 | |
| 453 | alpha Any alphabetical character ("[A-Za-z]"). |
| 454 | alnum Any alphanumerical character. ("[A-Za-z0-9]") |
| 455 | ascii Any character in the ASCII character set. |
| 456 | blank A GNU extension, equal to a space or a horizontal tab ("\t"). |
| 457 | cntrl Any control character. See Note [2] below. |
| 458 | digit Any decimal digit ("[0-9]"), equivalent to "\d". |
| 459 | graph Any printable character, excluding a space. See Note [3] below. |
| 460 | lower Any lowercase character ("[a-z]"). |
| 461 | print Any printable character, including a space. See Note [4] below. |
| 462 | punct Any graphical character excluding "word" characters. Note [5]. |
| 463 | space Any whitespace character. "\s" plus the vertical tab ("\cK"). |
| 464 | upper Any uppercase character ("[A-Z]"). |
| 465 | word A Perl extension ("[A-Za-z0-9_]"), equivalent to "\w". |
| 466 | xdigit Any hexadecimal digit ("[0-9a-fA-F]"). |
| 467 | |
| 468 | Most POSIX character classes have two Unicode-style C<\p> property |
| 469 | counterparts. (They are not official Unicode properties, but Perl extensions |
| 470 | derived from official Unicode properties.) The table below shows the relation |
| 471 | between POSIX character classes and these counterparts. |
| 472 | |
| 473 | One counterpart, in the column labelled "ASCII-range Unicode" in |
| 474 | the table will only match characters in the ASCII range. (On EBCDIC platforms, |
| 475 | they match those characters which have ASCII equivalents.) |
| 476 | |
| 477 | The other counterpart, in the column labelled "Full-range Unicode", matches any |
| 478 | appropriate characters in the full Unicode character set. For example, |
| 479 | C<\p{Alpha}> will match not just the ASCII alphabetic characters, but any |
| 480 | character in the entire Unicode character set that is considered to be |
| 481 | alphabetic. |
| 482 | |
| 483 | (Each of the counterparts has various synonyms as well. |
| 484 | L<perluniprops/Properties accessible through \p{} and \P{}> lists all the |
| 485 | synonyms, plus all the characters matched by each of the ASCII-range |
| 486 | properties. For example C<\p{AHex}> is a synonym for C<\p{ASCII_Hex_Digit}>, |
| 487 | and any C<\p> property name can be prefixed with "Is" such as C<\p{IsAlpha}>.) |
| 488 | |
| 489 | Both the C<\p> forms are unaffected by any locale that is in effect, or whether |
| 490 | the string is in UTF-8 format or not, or whether the platform is EBCDIC or not. |
| 491 | In contrast, the POSIX character classes are affected. If the source string is |
| 492 | in UTF-8 format, the POSIX classes (with the exception of C<[[:punct:]]>, see |
| 493 | Note [5]) behave like their "Full-range" Unicode counterparts. If the source |
| 494 | string is not in UTF-8 format, and no locale is in effect, and the platform is |
| 495 | not EBCDIC, all the POSIX classes behave like their ASCII-range counterparts. |
| 496 | Otherwise, they behave based on the rules of the locale or EBCDIC code page. |
| 497 | It is proposed to change this behavior in a future release of Perl so that the |
| 498 | the UTF8ness of the source string will be irrelevant to the behavior of the |
| 499 | POSIX character classes. This means they will always behave in strict |
| 500 | accordance with the official POSIX standard. That is, if either locale or |
| 501 | EBCDIC code page is present, they will behave in accordance with those; if |
| 502 | absent, the classes will match only their ASCII-range counterparts. If you |
| 503 | disagree with this proposal, send email to C<perl5-porters@perl.org>. |
| 504 | |
| 505 | [[:...:]] ASCII-range Full-range backslash Note |
| 506 | Unicode Unicode sequence |
| 507 | ----------------------------------------------------- |
| 508 | alpha \p{PosixAlpha} \p{Alpha} |
| 509 | alnum \p{PosixAlnum} \p{Alnum} |
| 510 | ascii \p{ASCII} |
| 511 | blank \p{PosixBlank} \p{Blank} = [1] |
| 512 | \p{HorizSpace} \h [1] |
| 513 | cntrl \p{PosixCntrl} \p{Cntrl} [2] |
| 514 | digit \p{PosixDigit} \p{Digit} \d |
| 515 | graph \p{PosixGraph} \p{Graph} [3] |
| 516 | lower \p{PosixLower} \p{Lower} |
| 517 | print \p{PosixPrint} \p{Print} [4] |
| 518 | punct \p{PosixPunct} \p{Punct} [5] |
| 519 | \p{PerlSpace} \p{SpacePerl} \s [6] |
| 520 | space \p{PosixSpace} \p{Space} [6] |
| 521 | upper \p{PosixUpper} \p{Upper} |
| 522 | word \p{PerlWord} \p{Word} \w |
| 523 | xdigit \p{ASCII_Hex_Digit} \p{XDigit} |
| 524 | |
| 525 | =over 4 |
| 526 | |
| 527 | =item [1] |
| 528 | |
| 529 | C<\p{Blank}> and C<\p{HorizSpace}> are synonyms. |
| 530 | |
| 531 | =item [2] |
| 532 | |
| 533 | Control characters don't produce output as such, but instead usually control |
| 534 | the terminal somehow: for example newline and backspace are control characters. |
| 535 | In the ASCII range, characters whose ordinals are between 0 and 31 inclusive, |
| 536 | plus 127 (C<DEL>) are control characters. |
| 537 | |
| 538 | On EBCDIC platforms, it is likely that the code page will define C<[[:cntrl:]]> |
| 539 | to be the EBCDIC equivalents of the ASCII controls, plus the controls |
| 540 | that in Unicode have ordinals from 128 through 139. |
| 541 | |
| 542 | =item [3] |
| 543 | |
| 544 | Any character that is I<graphical>, that is, visible. This class consists |
| 545 | of all the alphanumerical characters and all punctuation characters. |
| 546 | |
| 547 | =item [4] |
| 548 | |
| 549 | All printable characters, which is the set of all the graphical characters |
| 550 | plus whitespace characters that are not also controls. |
| 551 | |
| 552 | =item [5] |
| 553 | |
| 554 | C<\p{PosixPunct}> and C<[[:punct:]]> in the ASCII range match all the |
| 555 | non-controls, non-alphanumeric, non-space characters: |
| 556 | C<[-!"#$%&'()*+,./:;<=E<gt>?@[\\\]^_`{|}~]> (although if a locale is in effect, |
| 557 | it could alter the behavior of C<[[:punct:]]>). |
| 558 | |
| 559 | When the matching string is in UTF-8 format, C<[[:punct:]]> matches the above |
| 560 | set, plus what C<\p{Punct}> matches. This is different than strictly matching |
| 561 | according to C<\p{Punct}>, because the above set includes characters that aren't |
| 562 | considered punctuation by Unicode, but rather "symbols". Another way to say it |
| 563 | is that for a UTF-8 string, C<[[:punct:]]> matches all the characters that |
| 564 | Unicode considers to be punctuation, plus all the ASCII-range characters that |
| 565 | Unicode considers to be symbols. |
| 566 | |
| 567 | =item [6] |
| 568 | |
| 569 | C<\p{SpacePerl}> and C<\p{Space}> differ only in that C<\p{Space}> additionally |
| 570 | matches the vertical tab, C<\cK>. Same for the two ASCII-only range forms. |
| 571 | |
| 572 | =back |
| 573 | |
| 574 | =head4 Negation |
| 575 | X<character class, negation> |
| 576 | |
| 577 | A Perl extension to the POSIX character class is the ability to |
| 578 | negate it. This is done by prefixing the class name with a caret (C<^>). |
| 579 | Some examples: |
| 580 | |
| 581 | POSIX ASCII-range Full-range backslash |
| 582 | Unicode Unicode sequence |
| 583 | ----------------------------------------------------- |
| 584 | [[:^digit:]] \P{PosixDigit} \P{Digit} \D |
| 585 | [[:^space:]] \P{PosixSpace} \P{Space} |
| 586 | \P{PerlSpace} \P{SpacePerl} \S |
| 587 | [[:^word:]] \P{PerlWord} \P{Word} \W |
| 588 | |
| 589 | =head4 [= =] and [. .] |
| 590 | |
| 591 | Perl will recognize the POSIX character classes C<[=class=]>, and |
| 592 | C<[.class.]>, but does not (yet?) support them. Use of |
| 593 | such a construct will lead to an error. |
| 594 | |
| 595 | |
| 596 | =head4 Examples |
| 597 | |
| 598 | /[[:digit:]]/ # Matches a character that is a digit. |
| 599 | /[01[:lower:]]/ # Matches a character that is either a |
| 600 | # lowercase letter, or '0' or '1'. |
| 601 | /[[:digit:][:^xdigit:]]/ # Matches a character that can be anything |
| 602 | # except the letters 'a' to 'f'. This is |
| 603 | # because the main character class is composed |
| 604 | # of two POSIX character classes that are ORed |
| 605 | # together, one that matches any digit, and |
| 606 | # the other that matches anything that isn't a |
| 607 | # hex digit. The result matches all |
| 608 | # characters except the letters 'a' to 'f' and |
| 609 | # 'A' to 'F'. |
| 610 | |
| 611 | |
| 612 | =head2 Locale, EBCDIC, Unicode and UTF-8 |
| 613 | |
| 614 | Some of the character classes have a somewhat different behaviour depending |
| 615 | on the internal encoding of the source string, and the locale that is |
| 616 | in effect, and if the program is running on an EBCDIC platform. |
| 617 | |
| 618 | C<\w>, C<\d>, C<\s> and the POSIX character classes (and their negations, |
| 619 | including C<\W>, C<\D>, C<\S>) suffer from this behaviour. (Since the backslash |
| 620 | sequences C<\b> and C<\B> are defined in terms of C<\w> and C<\W>, they also are |
| 621 | affected.) |
| 622 | |
| 623 | The rule is that if the source string is in UTF-8 format, the character |
| 624 | classes match according to the Unicode properties. If the source string |
| 625 | isn't, then the character classes match according to whatever locale or EBCDIC |
| 626 | code page is in effect. If there is no locale nor EBCDIC, they match the ASCII |
| 627 | defaults (52 letters, 10 digits and underscore for C<\w>; 0 to 9 for C<\d>; |
| 628 | etc.). |
| 629 | |
| 630 | This usually means that if you are matching against characters whose C<ord()> |
| 631 | values are between 128 and 255 inclusive, your character class may match |
| 632 | or not depending on the current locale or EBCDIC code page, and whether the |
| 633 | source string is in UTF-8 format. The string will be in UTF-8 format if it |
| 634 | contains characters whose C<ord()> value exceeds 255. But a string may be in |
| 635 | UTF-8 format without it having such characters. See L<perluniprops/The |
| 636 | "Unicode Bug">. |
| 637 | |
| 638 | For portability reasons, it may be better to not use C<\w>, C<\d>, C<\s> |
| 639 | or the POSIX character classes, and use the Unicode properties instead. |
| 640 | |
| 641 | =head4 Examples |
| 642 | |
| 643 | $str = "\xDF"; # $str is not in UTF-8 format. |
| 644 | $str =~ /^\w/; # No match, as $str isn't in UTF-8 format. |
| 645 | $str .= "\x{0e0b}"; # Now $str is in UTF-8 format. |
| 646 | $str =~ /^\w/; # Match! $str is now in UTF-8 format. |
| 647 | chop $str; |
| 648 | $str =~ /^\w/; # Still a match! $str remains in UTF-8 format. |
| 649 | |
| 650 | =cut |