Commit | Line | Data |
---|---|---|
8a118206 RGS |
1 | =head1 NAME |
2 | ||
3 | perlrecharclass - Perl Regular Expression Character Classes | |
4 | ||
5 | =head1 DESCRIPTION | |
6 | ||
7 | The top level documentation about Perl regular expressions | |
8 | is found in L<perlre>. | |
9 | ||
10 | This manual page discusses the syntax and use of character | |
11 | classes in Perl Regular Expressions. | |
12 | ||
13 | A character class is a way of denoting a set of characters, | |
14 | in such a way that one character of the set is matched. | |
15 | It's important to remember that matching a character class | |
16 | consumes exactly one character in the source string. (The source | |
17 | string is the string the regular expression is matched against.) | |
18 | ||
19 | There are three types of character classes in Perl regular | |
20 | expressions: the dot, backslashed sequences, and the bracketed form. | |
21 | ||
22 | =head2 The dot | |
23 | ||
24 | The dot (or period), C<.> is probably the most used, and certainly | |
25 | the most well-known character class. By default, a dot matches any | |
26 | character, except for the newline. The default can be changed to | |
27 | add matching the newline with the I<single line> modifier: either | |
28 | for the entire regular expression using the C</s> modifier, or | |
29 | locally using C<(?s)>. | |
30 | ||
31 | Here are some examples: | |
32 | ||
33 | "a" =~ /./ # Match | |
34 | "." =~ /./ # Match | |
35 | "" =~ /./ # No match (dot has to match a character) | |
36 | "\n" =~ /./ # No match (dot does not match a newline) | |
37 | "\n" =~ /./s # Match (global 'single line' modifier) | |
38 | "\n" =~ /(?s:.)/ # Match (local 'single line' modifier) | |
39 | "ab" =~ /^.$/ # No match (dot matches one character) | |
40 | ||
8a118206 RGS |
41 | =head2 Backslashed sequences |
42 | ||
43 | Perl regular expressions contain many backslashed sequences that | |
44 | constitute a character class. That is, they will match a single | |
45 | character, if that character belongs to a specific set of characters | |
46 | (defined by the sequence). A backslashed sequence is a sequence of | |
47 | characters starting with a backslash. Not all backslashed sequences | |
df225385 | 48 | are character classes; for a full list, see L<perlrebackslash>. |
8a118206 RGS |
49 | |
50 | Here's a list of the backslashed sequences, which are discussed in | |
51 | more detail below. | |
52 | ||
53 | \d Match a digit character. | |
54 | \D Match a non-digit character. | |
55 | \w Match a "word" character. | |
56 | \W Match a non-"word" character. | |
57 | \s Match a white space character. | |
58 | \S Match a non-white space character. | |
59 | \h Match a horizontal white space character. | |
60 | \H Match a character that isn't horizontal white space. | |
b3b85878 | 61 | \N Match a character that isn't newline. Experimental. |
8a118206 RGS |
62 | \v Match a vertical white space character. |
63 | \V Match a character that isn't vertical white space. | |
64 | \pP, \p{Prop} Match a character matching a Unicode property. | |
65 | \PP, \P{Prop} Match a character that doesn't match a Unicode property. | |
66 | ||
67 | =head3 Digits | |
68 | ||
69 | C<\d> matches a single character that is considered to be a I<digit>. | |
70 | What is considered a digit depends on the internal encoding of | |
71 | the source string. If the source string is in UTF-8 format, C<\d> | |
72 | not only matches the digits '0' - '9', but also Arabic, Devanagari and | |
73 | digits from other languages. Otherwise, if there is a locale in effect, | |
74 | it will match whatever characters the locale considers digits. Without | |
75 | a locale, C<\d> matches the digits '0' to '9'. | |
76 | See L</Locale, Unicode and UTF-8>. | |
77 | ||
78 | Any character that isn't matched by C<\d> will be matched by C<\D>. | |
79 | ||
80 | =head3 Word characters | |
81 | ||
82 | C<\w> matches a single I<word> character: an alphanumeric character | |
83 | (that is, an alphabetic character, or a digit), or the underscore (C<_>). | |
84 | What is considered a word character depends on the internal encoding | |
85 | of the string. If it's in UTF-8 format, C<\w> matches those characters | |
86 | that are considered word characters in the Unicode database. That is, it | |
87 | not only matches ASCII letters, but also Thai letters, Greek letters, etc. | |
88 | If the source string isn't in UTF-8 format, C<\w> matches those characters | |
89 | that are considered word characters by the current locale. Without | |
90 | a locale in effect, C<\w> matches the ASCII letters, digits and the | |
91 | underscore. | |
92 | ||
93 | Any character that isn't matched by C<\w> will be matched by C<\W>. | |
94 | ||
95 | =head3 White space | |
96 | ||
c741660a | 97 | C<\s> matches any single character that is considered white space. In the |
8a118206 RGS |
98 | ASCII range, C<\s> matches the horizontal tab (C<\t>), the new line |
99 | (C<\n>), the form feed (C<\f>), the carriage return (C<\r>), and the | |
100 | space (the vertical tab, C<\cK> is not matched by C<\s>). The exact set | |
101 | of characters matched by C<\s> depends on whether the source string is | |
102 | in UTF-8 format. If it is, C<\s> matches what is considered white space | |
103 | in the Unicode database. Otherwise, if there is a locale in effect, C<\s> | |
104 | matches whatever is considered white space by the current locale. Without | |
105 | a locale, C<\s> matches the five characters mentioned in the beginning | |
106 | of this paragraph. Perhaps the most notable difference is that C<\s> | |
107 | matches a non-breaking space only if the non-breaking space is in a | |
108 | UTF-8 encoded string. | |
109 | ||
110 | Any character that isn't matched by C<\s> will be matched by C<\S>. | |
111 | ||
112 | C<\h> will match any character that is considered horizontal white space; | |
113 | this includes the space and the tab characters. C<\H> will match any character | |
114 | that is not considered horizontal white space. | |
115 | ||
b3b85878 KW |
116 | C<\N> is an experimental feature. It, like the dot, will match any character |
117 | that is not a newline. The difference is that C<\N> will not be influenced by | |
118 | the single line C</s> regular expression modifier. (Note that, since C<\N{}> is | |
119 | also used for named characters, if C<\N> is followed by an opening brace and | |
120 | something that is not a quantifier, perl will assume that a character name is | |
121 | coming. For example, C<\N{3}> means to match 3 non-newlines; C<\N{5,}> means | |
122 | to match 5 or more non-newlines, but C<\N{4F}> is not a legal quantifier, and | |
123 | will cause perl to look for a character named C<4F> (and won't find one unless | |
124 | custom names have been defined that include it.) | |
c741660a | 125 | |
8a118206 RGS |
126 | C<\v> will match any character that is considered vertical white space; |
127 | this includes the carriage return and line feed characters (newline). | |
128 | C<\V> will match any character that is not considered vertical white space. | |
129 | ||
130 | C<\R> matches anything that can be considered a newline under Unicode | |
131 | rules. It's not a character class, as it can match a multi-character | |
132 | sequence. Therefore, it cannot be used inside a bracketed character | |
133 | class. Details are discussed in L<perlrebackslash>. | |
134 | ||
99d59c4d | 135 | C<\h>, C<\H>, C<\v>, C<\V>, and C<\R> are new in perl 5.10.0. |
8a118206 RGS |
136 | |
137 | Note that unlike C<\s>, C<\d> and C<\w>, C<\h> and C<\v> always match | |
138 | the same characters, regardless whether the source string is in UTF-8 | |
139 | format or not. The set of characters they match is also not influenced | |
140 | by locale. | |
141 | ||
142 | One might think that C<\s> is equivalent with C<[\h\v]>. This is not true. | |
143 | The vertical tab (C<"\x0b">) is not matched by C<\s>, it is however | |
144 | considered vertical white space. Furthermore, if the source string is | |
145 | not in UTF-8 format, the next line (C<"\x85">) and the no-break space | |
146 | (C<"\xA0">) are not matched by C<\s>, but are by C<\v> and C<\h> respectively. | |
147 | If the source string is in UTF-8 format, both the next line and the | |
148 | no-break space are matched by C<\s>. | |
149 | ||
150 | The following table is a complete listing of characters matched by | |
151 | C<\s>, C<\h> and C<\v>. | |
152 | ||
153 | The first column gives the code point of the character (in hex format), | |
154 | the second column gives the (Unicode) name. The third column indicates | |
155 | by which class(es) the character is matched. | |
156 | ||
157 | 0x00009 CHARACTER TABULATION h s | |
158 | 0x0000a LINE FEED (LF) vs | |
159 | 0x0000b LINE TABULATION v | |
160 | 0x0000c FORM FEED (FF) vs | |
161 | 0x0000d CARRIAGE RETURN (CR) vs | |
162 | 0x00020 SPACE h s | |
163 | 0x00085 NEXT LINE (NEL) vs [1] | |
164 | 0x000a0 NO-BREAK SPACE h s [1] | |
165 | 0x01680 OGHAM SPACE MARK h s | |
166 | 0x0180e MONGOLIAN VOWEL SEPARATOR h s | |
167 | 0x02000 EN QUAD h s | |
168 | 0x02001 EM QUAD h s | |
169 | 0x02002 EN SPACE h s | |
170 | 0x02003 EM SPACE h s | |
171 | 0x02004 THREE-PER-EM SPACE h s | |
172 | 0x02005 FOUR-PER-EM SPACE h s | |
173 | 0x02006 SIX-PER-EM SPACE h s | |
174 | 0x02007 FIGURE SPACE h s | |
175 | 0x02008 PUNCTUATION SPACE h s | |
176 | 0x02009 THIN SPACE h s | |
177 | 0x0200a HAIR SPACE h s | |
178 | 0x02028 LINE SEPARATOR vs | |
179 | 0x02029 PARAGRAPH SEPARATOR vs | |
180 | 0x0202f NARROW NO-BREAK SPACE h s | |
181 | 0x0205f MEDIUM MATHEMATICAL SPACE h s | |
182 | 0x03000 IDEOGRAPHIC SPACE h s | |
183 | ||
184 | =over 4 | |
185 | ||
186 | =item [1] | |
187 | ||
188 | NEXT LINE and NO-BREAK SPACE only match C<\s> if the source string is in | |
189 | UTF-8 format. | |
190 | ||
191 | =back | |
192 | ||
193 | It is worth noting that C<\d>, C<\w>, etc, match single characters, not | |
194 | complete numbers or words. To match a number (that consists of integers), | |
195 | use C<\d+>; to match a word, use C<\w+>. | |
196 | ||
197 | ||
198 | =head3 Unicode Properties | |
199 | ||
200 | C<\pP> and C<\p{Prop}> are character classes to match characters that | |
201 | fit given Unicode classes. One letter classes can be used in the C<\pP> | |
e1b711da KW |
202 | form, with the class name following the C<\p>, otherwise, braces are required. |
203 | There is a single form, which is just the property name enclosed in the braces, | |
204 | and a compound form which looks like C<\p{name=value}>, which means to match | |
205 | if the property C<name> for the character has the particular C<value>. | |
206 | For instance, a match for a number can be written as C</\pN/> or as | |
207 | C</\p{Number}/>, or as C</\p{Number=True}/>. | |
208 | Lowercase letters are matched by the property I<Lowercase_Letter> which | |
209 | has as short form I<Ll>. They need the braces, so are written as C</\p{Ll}/> or | |
210 | C</\p{Lowercase_Letter}/>, or C</\p{General_Category=Lowercase_Letter}/> | |
211 | (the underscores are optional). | |
212 | C</\pLl/> is valid, but means something different. | |
8a118206 RGS |
213 | It matches a two character string: a letter (Unicode property C<\pL>), |
214 | followed by a lowercase C<l>. | |
215 | ||
e1b711da KW |
216 | For more details, see L<perlunicode/Unicode Character Properties>; for a |
217 | complete list of possible properties, see | |
218 | L<perluniprops/Properties accessible through \p{} and \P{}>. | |
219 | It is also possible to define your own properties. This is discussed in | |
8a118206 RGS |
220 | L<perlunicode/User-Defined Character Properties>. |
221 | ||
222 | ||
223 | =head4 Examples | |
224 | ||
225 | "a" =~ /\w/ # Match, "a" is a 'word' character. | |
226 | "7" =~ /\w/ # Match, "7" is a 'word' character as well. | |
227 | "a" =~ /\d/ # No match, "a" isn't a digit. | |
228 | "7" =~ /\d/ # Match, "7" is a digit. | |
229 | " " =~ /\s/ # Match, a space is white space. | |
230 | "a" =~ /\D/ # Match, "a" is a non-digit. | |
231 | "7" =~ /\D/ # No match, "7" is not a non-digit. | |
232 | " " =~ /\S/ # No match, a space is not non-white space. | |
233 | ||
234 | " " =~ /\h/ # Match, space is horizontal white space. | |
235 | " " =~ /\v/ # No match, space is not vertical white space. | |
236 | "\r" =~ /\v/ # Match, a return is vertical white space. | |
237 | ||
238 | "a" =~ /\pL/ # Match, "a" is a letter. | |
239 | "a" =~ /\p{Lu}/ # No match, /\p{Lu}/ matches upper case letters. | |
240 | ||
241 | "\x{0e0b}" =~ /\p{Thai}/ # Match, \x{0e0b} is the character | |
242 | # 'THAI CHARACTER SO SO', and that's in | |
243 | # Thai Unicode class. | |
244 | "a" =~ /\P{Lao}/ # Match, as "a" is not a Laoian character. | |
245 | ||
246 | ||
247 | =head2 Bracketed Character Classes | |
248 | ||
249 | The third form of character class you can use in Perl regular expressions | |
250 | is the bracketed form. In its simplest form, it lists the characters | |
251 | that may be matched inside square brackets, like this: C<[aeiou]>. | |
252 | This matches one of C<a>, C<e>, C<i>, C<o> or C<u>. Just as the other | |
253 | character classes, exactly one character will be matched. To match | |
254 | a longer string consisting of characters mentioned in the characters | |
255 | class, follow the character class with a quantifier. For instance, | |
256 | C<[aeiou]+> matches a string of one or more lowercase ASCII vowels. | |
257 | ||
258 | Repeating a character in a character class has no | |
259 | effect; it's considered to be in the set only once. | |
260 | ||
261 | Examples: | |
262 | ||
263 | "e" =~ /[aeiou]/ # Match, as "e" is listed in the class. | |
264 | "p" =~ /[aeiou]/ # No match, "p" is not listed in the class. | |
265 | "ae" =~ /^[aeiou]$/ # No match, a character class only matches | |
266 | # a single character. | |
267 | "ae" =~ /^[aeiou]+$/ # Match, due to the quantifier. | |
268 | ||
269 | =head3 Special Characters Inside a Bracketed Character Class | |
270 | ||
271 | Most characters that are meta characters in regular expressions (that | |
df225385 | 272 | is, characters that carry a special meaning like C<.>, C<*>, or C<(>) lose |
8a118206 RGS |
273 | their special meaning and can be used inside a character class without |
274 | the need to escape them. For instance, C<[()]> matches either an opening | |
275 | parenthesis, or a closing parenthesis, and the parens inside the character | |
276 | class don't group or capture. | |
277 | ||
278 | Characters that may carry a special meaning inside a character class are: | |
279 | C<\>, C<^>, C<->, C<[> and C<]>, and are discussed below. They can be | |
280 | escaped with a backslash, although this is sometimes not needed, in which | |
281 | case the backslash may be omitted. | |
282 | ||
283 | The sequence C<\b> is special inside a bracketed character class. While | |
284 | outside the character class C<\b> is an assertion indicating a point | |
285 | that does not have either two word characters or two non-word characters | |
286 | on either side, inside a bracketed character class, C<\b> matches a | |
287 | backspace character. | |
288 | ||
df225385 KW |
289 | The sequences |
290 | C<\a>, | |
291 | C<\c>, | |
292 | C<\e>, | |
293 | C<\f>, | |
294 | C<\n>, | |
e526e8bb KW |
295 | C<\N{I<NAME>}>, |
296 | C<\N{U+I<wide hex char>}>, | |
df225385 KW |
297 | C<\r>, |
298 | C<\t>, | |
299 | and | |
300 | C<\x> | |
301 | are also special and have the same meanings as they do outside a bracketed character | |
302 | class. | |
303 | ||
304 | Also, a backslash followed by digits is considered an octal number. | |
305 | ||
8a118206 RGS |
306 | A C<[> is not special inside a character class, unless it's the start |
307 | of a POSIX character class (see below). It normally does not need escaping. | |
308 | ||
309 | A C<]> is either the end of a POSIX character class (see below), or it | |
310 | signals the end of the bracketed character class. Normally it needs | |
311 | escaping if you want to include a C<]> in the set of characters. | |
312 | However, if the C<]> is the I<first> (or the second if the first | |
313 | character is a caret) character of a bracketed character class, it | |
314 | does not denote the end of the class (as you cannot have an empty class) | |
315 | and is considered part of the set of characters that can be matched without | |
316 | escaping. | |
317 | ||
318 | Examples: | |
319 | ||
320 | "+" =~ /[+?*]/ # Match, "+" in a character class is not special. | |
321 | "\cH" =~ /[\b]/ # Match, \b inside in a character class | |
322 | # is equivalent with a backspace. | |
323 | "]" =~ /[][]/ # Match, as the character class contains. | |
324 | # both [ and ]. | |
325 | "[]" =~ /[[]]/ # Match, the pattern contains a character class | |
326 | # containing just ], and the character class is | |
327 | # followed by a ]. | |
328 | ||
329 | =head3 Character Ranges | |
330 | ||
331 | It is not uncommon to want to match a range of characters. Luckily, instead | |
332 | of listing all the characters in the range, one may use the hyphen (C<->). | |
333 | If inside a bracketed character class you have two characters separated | |
334 | by a hyphen, it's treated as if all the characters between the two are in | |
335 | the class. For instance, C<[0-9]> matches any ASCII digit, and C<[a-m]> | |
336 | matches any lowercase letter from the first half of the ASCII alphabet. | |
337 | ||
338 | Note that the two characters on either side of the hyphen are not | |
339 | necessary both letters or both digits. Any character is possible, | |
340 | although not advisable. C<['-?]> contains a range of characters, but | |
341 | most people will not know which characters that will be. Furthermore, | |
342 | such ranges may lead to portability problems if the code has to run on | |
343 | a platform that uses a different character set, such as EBCDIC. | |
344 | ||
345 | If a hyphen in a character class cannot be part of a range, for instance | |
346 | because it is the first or the last character of the character class, | |
347 | or if it immediately follows a range, the hyphen isn't special, and will be | |
348 | considered a character that may be matched. You have to escape the hyphen | |
349 | with a backslash if you want to have a hyphen in your set of characters to | |
350 | be matched, and its position in the class is such that it can be considered | |
351 | part of a range. | |
352 | ||
353 | Examples: | |
354 | ||
355 | [a-z] # Matches a character that is a lower case ASCII letter. | |
356 | [a-fz] # Matches any letter between 'a' and 'f' (inclusive) or the | |
357 | # letter 'z'. | |
358 | [-z] # Matches either a hyphen ('-') or the letter 'z'. | |
359 | [a-f-m] # Matches any letter between 'a' and 'f' (inclusive), the | |
360 | # hyphen ('-'), or the letter 'm'. | |
361 | ['-?] # Matches any of the characters '()*+,-./0123456789:;<=>? | |
362 | # (But not on an EBCDIC platform). | |
363 | ||
364 | ||
365 | =head3 Negation | |
366 | ||
367 | It is also possible to instead list the characters you do not want to | |
368 | match. You can do so by using a caret (C<^>) as the first character in the | |
369 | character class. For instance, C<[^a-z]> matches a character that is not a | |
370 | lowercase ASCII letter. | |
371 | ||
372 | This syntax make the caret a special character inside a bracketed character | |
373 | class, but only if it is the first character of the class. So if you want | |
374 | to have the caret as one of the characters you want to match, you either | |
375 | have to escape the caret, or not list it first. | |
376 | ||
377 | Examples: | |
378 | ||
379 | "e" =~ /[^aeiou]/ # No match, the 'e' is listed. | |
380 | "x" =~ /[^aeiou]/ # Match, as 'x' isn't a lowercase vowel. | |
381 | "^" =~ /[^^]/ # No match, matches anything that isn't a caret. | |
382 | "^" =~ /[x^]/ # Match, caret is not special here. | |
383 | ||
384 | =head3 Backslash Sequences | |
385 | ||
df225385 KW |
386 | You can put any backslash sequence character class (with one exception listed |
387 | in the next paragraph) inside a bracketed character class, and it will act just | |
388 | as if you put all the characters matched by the backslash sequence inside the | |
389 | character class. For instance, C<[a-f\d]> will match any digit, or any of the | |
390 | lowercase letters between 'a' and 'f' inclusive. | |
391 | ||
e526e8bb KW |
392 | C<\N> within a bracketed character class must be of the forms C<\N{I<name>}> or |
393 | C<\N{U+I<wide hex char>}> for the same reason that a dot C<.> inside a | |
394 | bracketed character class loses its special meaning: it matches nearly | |
395 | anything, which generally isn't what you want to happen. | |
8a118206 RGS |
396 | |
397 | Examples: | |
398 | ||
399 | /[\p{Thai}\d]/ # Matches a character that is either a Thai | |
400 | # character, or a digit. | |
401 | /[^\p{Arabic}()]/ # Matches a character that is neither an Arabic | |
402 | # character, nor a parenthesis. | |
403 | ||
404 | Backslash sequence character classes cannot form one of the endpoints | |
405 | of a range. | |
406 | ||
407 | =head3 Posix Character Classes | |
408 | ||
409 | Posix character classes have the form C<[:class:]>, where I<class> is | |
410 | name, and the C<[:> and C<:]> delimiters. Posix character classes appear | |
411 | I<inside> bracketed character classes, and are a convenient and descriptive | |
412 | way of listing a group of characters. Be careful about the syntax, | |
413 | ||
414 | # Correct: | |
415 | $string =~ /[[:alpha:]]/ | |
416 | ||
417 | # Incorrect (will warn): | |
418 | $string =~ /[:alpha:]/ | |
419 | ||
420 | The latter pattern would be a character class consisting of a colon, | |
421 | and the letters C<a>, C<l>, C<p> and C<h>. | |
422 | ||
423 | Perl recognizes the following POSIX character classes: | |
424 | ||
425 | alpha Any alphabetical character. | |
426 | alnum Any alphanumerical character. | |
427 | ascii Any ASCII character. | |
ea8b8ad2 | 428 | blank A GNU extension, equal to a space or a horizontal tab ("\t"). |
8a118206 | 429 | cntrl Any control character. |
ea8b8ad2 | 430 | digit Any digit, equivalent to "\d". |
8a118206 RGS |
431 | graph Any printable character, excluding a space. |
432 | lower Any lowercase character. | |
433 | print Any printable character, including a space. | |
434 | punct Any punctuation character. | |
ea8b8ad2 | 435 | space Any white space character. "\s" plus the vertical tab ("\cK"). |
8a118206 | 436 | upper Any uppercase character. |
ea8b8ad2 | 437 | word Any "word" character, equivalent to "\w". |
8a118206 RGS |
438 | xdigit Any hexadecimal digit, '0' - '9', 'a' - 'f', 'A' - 'F'. |
439 | ||
440 | The exact set of characters matched depends on whether the source string | |
441 | is internally in UTF-8 format or not. See L</Locale, Unicode and UTF-8>. | |
442 | ||
443 | Most POSIX character classes have C<\p> counterparts. The difference | |
444 | is that the C<\p> classes will always match according to the Unicode | |
445 | properties, regardless whether the string is in UTF-8 format or not. | |
446 | ||
447 | The following table shows the relation between POSIX character classes | |
448 | and the Unicode properties: | |
449 | ||
450 | [[:...:]] \p{...} backslash | |
451 | ||
452 | alpha IsAlpha | |
453 | alnum IsAlnum | |
454 | ascii IsASCII | |
455 | blank | |
456 | cntrl IsCntrl | |
457 | digit IsDigit \d | |
458 | graph IsGraph | |
459 | lower IsLower | |
460 | print IsPrint | |
461 | punct IsPunct | |
462 | space IsSpace | |
463 | IsSpacePerl \s | |
464 | upper IsUpper | |
465 | word IsWord | |
466 | xdigit IsXDigit | |
467 | ||
e1b711da | 468 | Some of these names may not be obvious: |
8a118206 RGS |
469 | |
470 | =over 4 | |
471 | ||
472 | =item cntrl | |
473 | ||
474 | Any control character. Usually, control characters don't produce output | |
475 | as such, but instead control the terminal somehow: for example newline | |
476 | and backspace are control characters. All characters with C<ord()> less | |
477 | than 32 are usually classified as control characters (in ASCII, the ISO | |
478 | Latin character sets, and Unicode), as is the character C<ord()> value | |
479 | of 127 (C<DEL>). | |
480 | ||
481 | =item graph | |
482 | ||
483 | Any character that is I<graphical>, that is, visible. This class consists | |
484 | of all the alphanumerical characters and all punctuation characters. | |
485 | ||
486 | =item print | |
487 | ||
488 | All printable characters, which is the set of all the graphical characters | |
489 | plus the space. | |
490 | ||
491 | =item punct | |
492 | ||
493 | Any punctuation (special) character. | |
494 | ||
495 | =back | |
496 | ||
497 | =head4 Negation | |
498 | ||
499 | A Perl extension to the POSIX character class is the ability to | |
500 | negate it. This is done by prefixing the class name with a caret (C<^>). | |
501 | Some examples: | |
502 | ||
503 | POSIX Unicode Backslash | |
504 | [[:^digit:]] \P{IsDigit} \D | |
505 | [[:^space:]] \P{IsSpace} \S | |
506 | [[:^word:]] \P{IsWord} \W | |
507 | ||
508 | =head4 [= =] and [. .] | |
509 | ||
510 | Perl will recognize the POSIX character classes C<[=class=]>, and | |
511 | C<[.class.]>, but does not (yet?) support this construct. Use of | |
740bae87 | 512 | such a construct will lead to an error. |
8a118206 RGS |
513 | |
514 | ||
515 | =head4 Examples | |
516 | ||
517 | /[[:digit:]]/ # Matches a character that is a digit. | |
518 | /[01[:lower:]]/ # Matches a character that is either a | |
519 | # lowercase letter, or '0' or '1'. | |
520 | /[[:digit:][:^xdigit:]]/ # Matches a character that can be anything, | |
521 | # but the letters 'a' to 'f' in either case. | |
522 | # This is because the character class contains | |
523 | # all digits, and anything that isn't a | |
524 | # hex digit, resulting in a class containing | |
525 | # all characters, but the letters 'a' to 'f' | |
526 | # and 'A' to 'F'. | |
527 | ||
528 | ||
529 | =head2 Locale, Unicode and UTF-8 | |
530 | ||
531 | Some of the character classes have a somewhat different behaviour depending | |
532 | on the internal encoding of the source string, and the locale that is | |
533 | in effect. | |
534 | ||
535 | C<\w>, C<\d>, C<\s> and the POSIX character classes (and their negations, | |
536 | including C<\W>, C<\D>, C<\S>) suffer from this behaviour. | |
537 | ||
538 | The rule is that if the source string is in UTF-8 format, the character | |
539 | classes match according to the Unicode properties. If the source string | |
540 | isn't, then the character classes match according to whatever locale is | |
541 | in effect. If there is no locale, they match the ASCII defaults | |
542 | (52 letters, 10 digits and underscore for C<\w>, 0 to 9 for C<\d>, etc). | |
543 | ||
544 | This usually means that if you are matching against characters whose C<ord()> | |
545 | values are between 128 and 255 inclusive, your character class may match | |
546 | or not depending on the current locale, and whether the source string is | |
547 | in UTF-8 format. The string will be in UTF-8 format if it contains | |
548 | characters whose C<ord()> value exceeds 255. But a string may be in UTF-8 | |
549 | format without it having such characters. | |
550 | ||
551 | For portability reasons, it may be better to not use C<\w>, C<\d>, C<\s> | |
552 | or the POSIX character classes, and use the Unicode properties instead. | |
553 | ||
554 | =head4 Examples | |
555 | ||
556 | $str = "\xDF"; # $str is not in UTF-8 format. | |
557 | $str =~ /^\w/; # No match, as $str isn't in UTF-8 format. | |
558 | $str .= "\x{0e0b}"; # Now $str is in UTF-8 format. | |
559 | $str =~ /^\w/; # Match! $str is now in UTF-8 format. | |
560 | chop $str; | |
561 | $str =~ /^\w/; # Still a match! $str remains in UTF-8 format. | |
562 | ||
563 | =cut |