Commit | Line | Data |
---|---|---|
8a118206 | 1 | =head1 NAME |
ea449505 | 2 | X<character class> |
8a118206 RGS |
3 | |
4 | perlrecharclass - Perl Regular Expression Character Classes | |
5 | ||
6 | =head1 DESCRIPTION | |
7 | ||
8 | The top level documentation about Perl regular expressions | |
9 | is found in L<perlre>. | |
10 | ||
11 | This manual page discusses the syntax and use of character | |
6b83a163 | 12 | classes in Perl regular expressions. |
8a118206 | 13 | |
6b83a163 | 14 | A character class is a way of denoting a set of characters |
8a118206 | 15 | in such a way that one character of the set is matched. |
6b83a163 | 16 | It's important to remember that: matching a character class |
8a118206 RGS |
17 | consumes exactly one character in the source string. (The source |
18 | string is the string the regular expression is matched against.) | |
19 | ||
20 | There are three types of character classes in Perl regular | |
6b83a163 | 21 | expressions: the dot, backslash sequences, and the form enclosed in square |
ea449505 | 22 | brackets. Keep in mind, though, that often the term "character class" is used |
6b83a163 | 23 | to mean just the bracketed form. Certainly, most Perl documentation does that. |
8a118206 RGS |
24 | |
25 | =head2 The dot | |
26 | ||
27 | The dot (or period), C<.> is probably the most used, and certainly | |
28 | the most well-known character class. By default, a dot matches any | |
29 | character, except for the newline. The default can be changed to | |
6b83a163 KW |
30 | add matching the newline by using the I<single line> modifier: either |
31 | for the entire regular expression with the C</s> modifier, or | |
32 | locally with C<(?s)>. (The experimental C<\N> backslash sequence, described | |
33 | below, matches any character except newline without regard to the | |
34 | I<single line> modifier.) | |
8a118206 RGS |
35 | |
36 | Here are some examples: | |
37 | ||
38 | "a" =~ /./ # Match | |
39 | "." =~ /./ # Match | |
40 | "" =~ /./ # No match (dot has to match a character) | |
41 | "\n" =~ /./ # No match (dot does not match a newline) | |
42 | "\n" =~ /./s # Match (global 'single line' modifier) | |
43 | "\n" =~ /(?s:.)/ # Match (local 'single line' modifier) | |
44 | "ab" =~ /^.$/ # No match (dot matches one character) | |
45 | ||
6b83a163 | 46 | =head2 Backslash sequences |
ea449505 KW |
47 | X<\w> X<\W> X<\s> X<\S> X<\d> X<\D> X<\p> X<\P> |
48 | X<\N> X<\v> X<\V> X<\h> X<\H> | |
49 | X<word> X<whitespace> | |
8a118206 | 50 | |
6b83a163 KW |
51 | A backslash sequence is a sequence of characters, the first one of which is a |
52 | backslash. Perl ascribes special meaning to many such sequences, and some of | |
53 | these are character classes. That is, they match a single character each, | |
54 | provided that the character belongs to the specific set of characters defined | |
55 | by the sequence. | |
8a118206 | 56 | |
6b83a163 KW |
57 | Here's a list of the backslash sequences that are character classes. They |
58 | are discussed in more detail below. (For the backslash sequences that aren't | |
59 | character classes, see L<perlrebackslash>.) | |
8a118206 | 60 | |
6b83a163 KW |
61 | \d Match a decimal digit character. |
62 | \D Match a non-decimal-digit character. | |
8a118206 RGS |
63 | \w Match a "word" character. |
64 | \W Match a non-"word" character. | |
ea449505 KW |
65 | \s Match a whitespace character. |
66 | \S Match a non-whitespace character. | |
67 | \h Match a horizontal whitespace character. | |
68 | \H Match a character that isn't horizontal whitespace. | |
ea449505 KW |
69 | \v Match a vertical whitespace character. |
70 | \V Match a character that isn't vertical whitespace. | |
6b83a163 KW |
71 | \N Match a character that isn't a newline. Experimental. |
72 | \pP, \p{Prop} Match a character that has the given Unicode property. | |
6c5a041f | 73 | \PP, \P{Prop} Match a character that doesn't have the Unicode property |
8a118206 RGS |
74 | |
75 | =head3 Digits | |
76 | ||
6b83a163 | 77 | C<\d> matches a single character that is considered to be a decimal I<digit>. |
17657a39 KW |
78 | What is considered a decimal digit depends on several factors, detailed |
79 | below in L</Locale, EBCDIC, Unicode and UTF-8>. If those factors | |
80 | indicate a Unicode interpretation, C<\d> not only matches the digits | |
81 | '0' - '9', but also Arabic, Devanagari and digits from other languages. | |
82 | Otherwise, if there is a locale in effect, it will match whatever | |
83 | characters the locale considers decimal digits. Without a locale, C<\d> | |
84 | matches just the digits '0' to '9'. | |
6b83a163 KW |
85 | |
86 | Unicode digits may cause some confusion, and some security issues. In UTF-8 | |
f7d1198f KW |
87 | strings, unless the C<"a"> regular expression modifier is specified, |
88 | C<\d> matches the same characters matched by | |
6b83a163 KW |
89 | C<\p{General_Category=Decimal_Number}>, or synonymously, |
90 | C<\p{General_Category=Digit}>. Starting with Unicode version 4.1, this is the | |
91 | same set of characters matched by C<\p{Numeric_Type=Decimal}>. | |
92 | ||
93 | But Unicode also has a different property with a similar name, | |
94 | C<\p{Numeric_Type=Digit}>, which matches a completely different set of | |
95 | characters. These characters are things such as subscripts. | |
96 | ||
97 | The design intent is for C<\d> to match all the digits (and no other characters) | |
98 | that can be used with "normal" big-endian positional decimal syntax, whereby a | |
99 | sequence of such digits {N0, N1, N2, ...Nn} has the numeric value (...(N0 * 10 | |
100 | + N1) * 10 + N2) * 10 ... + Nn). In Unicode 5.2, the Tamil digits (U+0BE6 - | |
101 | U+0BEF) can also legally be used in old-style Tamil numbers in which they would | |
102 | appear no more than one in a row, separated by characters that mean "times 10", | |
103 | "times 100", etc. (See L<http://www.unicode.org/notes/tn21>.) | |
104 | ||
105 | Some of the non-European digits that C<\d> matches look like European ones, but | |
6671dd37 KW |
106 | have different values. For example, BENGALI DIGIT FOUR (U+09EA) looks |
107 | very much like an ASCII DIGIT EIGHT (U+0038). | |
6b83a163 KW |
108 | |
109 | It may be useful for security purposes for an application to require that all | |
110 | digits in a row be from the same script. See L<Unicode::UCD/charscript()>. | |
8a118206 RGS |
111 | |
112 | Any character that isn't matched by C<\d> will be matched by C<\D>. | |
113 | ||
114 | =head3 Word characters | |
115 | ||
ea449505 | 116 | A C<\w> matches a single alphanumeric character (an alphabetic character, or a |
d35dd6c6 KW |
117 | decimal digit) or a connecting punctuation character, such as an |
118 | underscore ("_"). It does not match a whole word. To match a whole | |
6b83a163 | 119 | word, use C<\w+>. This isn't the same thing as matching an English word, but |
765fa144 | 120 | in the ASCII range it is the same as a string of Perl-identifier |
d35dd6c6 | 121 | characters. What is considered a |
17657a39 KW |
122 | word character depends on several factors, detailed below in L</Locale, |
123 | EBCDIC, Unicode and UTF-8>. If those factors indicate a Unicode | |
124 | interpretation, C<\w> matches the characters that are considered word | |
ea449505 | 125 | characters in the Unicode database. That is, it not only matches ASCII letters, |
d35dd6c6 KW |
126 | but also Thai letters, Greek letters, etc. This includes connector |
127 | punctuation (like the underscore) which connect two words together, or | |
128 | marks, such as a C<COMBINING TILDE>, which are generally used to add | |
129 | diacritical marks to letters. If a Unicode interpretation | |
17657a39 KW |
130 | is not indicated, C<\w> matches those characters that are considered |
131 | word characters by the current locale or EBCDIC code page. Without a | |
132 | locale or EBCDIC code page, C<\w> matches the ASCII letters, digits and | |
133 | the underscore. | |
8a118206 | 134 | |
6b83a163 KW |
135 | There are a number of security issues with the full Unicode list of word |
136 | characters. See L<http://unicode.org/reports/tr36>. | |
137 | ||
138 | Also, for a somewhat finer-grained set of characters that are in programming | |
139 | language identifiers beyond the ASCII range, you may wish to instead use the | |
140 | more customized Unicode properties, "ID_Start", ID_Continue", "XID_Start", and | |
141 | "XID_Continue". See L<http://unicode.org/reports/tr31>. | |
142 | ||
8a118206 RGS |
143 | Any character that isn't matched by C<\w> will be matched by C<\W>. |
144 | ||
ea449505 KW |
145 | =head3 Whitespace |
146 | ||
6b83a163 | 147 | C<\s> matches any single character that is considered whitespace. The exact |
17657a39 KW |
148 | set of characters matched by C<\s> depends on several factors, detailed |
149 | below in L</Locale, EBCDIC, Unicode and UTF-8>. If those factors | |
150 | indicate a Unicode interpretation, C<\s> matches what is considered | |
151 | whitespace in the Unicode database; the complete list is in the table | |
152 | below. Otherwise, if there is a locale or EBCDIC code page in effect, | |
153 | C<\s> matches whatever is considered whitespace by the current locale or | |
154 | EBCDIC code page. Without a locale or EBCDIC code page, C<\s> matches | |
155 | the horizontal tab (C<\t>), the newline (C<\n>), the form feed (C<\f>), | |
156 | the carriage return (C<\r>), and the space. (Note that it doesn't match | |
157 | the vertical tab, C<\cK>.) Perhaps the most notable possible surprise | |
158 | is that C<\s> matches a non-breaking space only if a Unicode | |
159 | interpretation is indicated, or the locale or EBCDIC code page that is | |
160 | in effect has that character. | |
8a118206 RGS |
161 | |
162 | Any character that isn't matched by C<\s> will be matched by C<\S>. | |
163 | ||
ea449505 | 164 | C<\h> will match any character that is considered horizontal whitespace; |
6b83a163 KW |
165 | this includes the space and the tab characters and a number other characters, |
166 | all of which are listed in the table below. C<\H> will match any character | |
ea449505 KW |
167 | that is not considered horizontal whitespace. |
168 | ||
ea449505 | 169 | C<\v> will match any character that is considered vertical whitespace; |
6b83a163 KW |
170 | this includes the carriage return and line feed characters (newline) plus several |
171 | other characters, all listed in the table below. | |
ea449505 | 172 | C<\V> will match any character that is not considered vertical whitespace. |
8a118206 RGS |
173 | |
174 | C<\R> matches anything that can be considered a newline under Unicode | |
175 | rules. It's not a character class, as it can match a multi-character | |
176 | sequence. Therefore, it cannot be used inside a bracketed character | |
ea449505 KW |
177 | class; use C<\v> instead (vertical whitespace). |
178 | Details are discussed in L<perlrebackslash>. | |
8a118206 RGS |
179 | |
180 | Note that unlike C<\s>, C<\d> and C<\w>, C<\h> and C<\v> always match | |
17657a39 KW |
181 | the same characters, without regard to other factors, such as if the |
182 | source string is in UTF-8 format or not. | |
8a118206 | 183 | |
ea449505 KW |
184 | One might think that C<\s> is equivalent to C<[\h\v]>. This is not true. The |
185 | vertical tab (C<"\x0b">) is not matched by C<\s>, it is however considered | |
186 | vertical whitespace. Furthermore, if the source string is not in UTF-8 format, | |
187 | and any locale or EBCDIC code page that is in effect doesn't include them, the | |
6b83a163 KW |
188 | next line (ASCII-platform C<"\x85">) and the no-break space (ASCII-platform |
189 | C<"\xA0">) characters are not matched by C<\s>, but are by C<\v> and C<\h> | |
f7d1198f KW |
190 | respectively. If the C<"a"> modifier is not in effect, and the source |
191 | string is in UTF-8 format, both the next line and | |
6b83a163 | 192 | the no-break space are matched by C<\s>. |
8a118206 RGS |
193 | |
194 | The following table is a complete listing of characters matched by | |
ea449505 | 195 | C<\s>, C<\h> and C<\v> as of Unicode 5.2. |
8a118206 RGS |
196 | |
197 | The first column gives the code point of the character (in hex format), | |
198 | the second column gives the (Unicode) name. The third column indicates | |
ea449505 KW |
199 | by which class(es) the character is matched (assuming no locale or EBCDIC code |
200 | page is in effect that changes the C<\s> matching). | |
8a118206 RGS |
201 | |
202 | 0x00009 CHARACTER TABULATION h s | |
203 | 0x0000a LINE FEED (LF) vs | |
204 | 0x0000b LINE TABULATION v | |
205 | 0x0000c FORM FEED (FF) vs | |
206 | 0x0000d CARRIAGE RETURN (CR) vs | |
207 | 0x00020 SPACE h s | |
208 | 0x00085 NEXT LINE (NEL) vs [1] | |
209 | 0x000a0 NO-BREAK SPACE h s [1] | |
210 | 0x01680 OGHAM SPACE MARK h s | |
211 | 0x0180e MONGOLIAN VOWEL SEPARATOR h s | |
212 | 0x02000 EN QUAD h s | |
213 | 0x02001 EM QUAD h s | |
214 | 0x02002 EN SPACE h s | |
215 | 0x02003 EM SPACE h s | |
216 | 0x02004 THREE-PER-EM SPACE h s | |
217 | 0x02005 FOUR-PER-EM SPACE h s | |
218 | 0x02006 SIX-PER-EM SPACE h s | |
219 | 0x02007 FIGURE SPACE h s | |
220 | 0x02008 PUNCTUATION SPACE h s | |
221 | 0x02009 THIN SPACE h s | |
222 | 0x0200a HAIR SPACE h s | |
223 | 0x02028 LINE SEPARATOR vs | |
224 | 0x02029 PARAGRAPH SEPARATOR vs | |
225 | 0x0202f NARROW NO-BREAK SPACE h s | |
226 | 0x0205f MEDIUM MATHEMATICAL SPACE h s | |
227 | 0x03000 IDEOGRAPHIC SPACE h s | |
228 | ||
229 | =over 4 | |
230 | ||
231 | =item [1] | |
232 | ||
233 | NEXT LINE and NO-BREAK SPACE only match C<\s> if the source string is in | |
f7d1198f KW |
234 | UTF-8 format and the C<"a"> modifier is not in effect; or the locale or |
235 | EBCDIC code page that is in effect includes them. | |
8a118206 RGS |
236 | |
237 | =back | |
238 | ||
239 | It is worth noting that C<\d>, C<\w>, etc, match single characters, not | |
e486b3cc | 240 | complete numbers or words. To match a number (that consists of digits), |
8a118206 RGS |
241 | use C<\d+>; to match a word, use C<\w+>. |
242 | ||
6b83a163 KW |
243 | =head3 \N |
244 | ||
245 | C<\N> is new in 5.12, and is experimental. It, like the dot, will match any | |
246 | character that is not a newline. The difference is that C<\N> is not influenced | |
247 | by the I<single line> regular expression modifier (see L</The dot> above). Note | |
248 | that the form C<\N{...}> may mean something completely different. When the | |
249 | C<{...}> is a L<quantifier|perlre/Quantifiers>, it means to match a non-newline | |
250 | character that many times. For example, C<\N{3}> means to match 3 | |
251 | non-newlines; C<\N{5,}> means to match 5 or more non-newlines. But if C<{...}> | |
252 | is not a legal quantifier, it is presumed to be a named character. See | |
253 | L<charnames> for those. For example, none of C<\N{COLON}>, C<\N{4F}>, and | |
254 | C<\N{F4}> contain legal quantifiers, so Perl will try to find characters whose | |
255 | names are, respectively, C<COLON>, C<4F>, and C<F4>. | |
8a118206 RGS |
256 | |
257 | =head3 Unicode Properties | |
258 | ||
c1c4ae3a KW |
259 | C<\pP> and C<\p{Prop}> are character classes to match characters that fit given |
260 | Unicode properties. One letter property names can be used in the C<\pP> form, | |
261 | with the property name following the C<\p>, otherwise, braces are required. | |
262 | When using braces, there is a single form, which is just the property name | |
263 | enclosed in the braces, and a compound form which looks like C<\p{name=value}>, | |
264 | which means to match if the property "name" for the character has the particular | |
265 | "value". | |
e1b711da KW |
266 | For instance, a match for a number can be written as C</\pN/> or as |
267 | C</\p{Number}/>, or as C</\p{Number=True}/>. | |
268 | Lowercase letters are matched by the property I<Lowercase_Letter> which | |
269 | has as short form I<Ll>. They need the braces, so are written as C</\p{Ll}/> or | |
270 | C</\p{Lowercase_Letter}/>, or C</\p{General_Category=Lowercase_Letter}/> | |
271 | (the underscores are optional). | |
272 | C</\pLl/> is valid, but means something different. | |
8a118206 RGS |
273 | It matches a two character string: a letter (Unicode property C<\pL>), |
274 | followed by a lowercase C<l>. | |
275 | ||
56ca34ca KW |
276 | Note that almost all properties are immune to case-insensitive matching. |
277 | That is, adding a C</i> regular expression modifier does not change what | |
278 | they match. There are two sets that are affected. The first set is | |
279 | C<Uppercase_Letter>, | |
280 | C<Lowercase_Letter>, | |
281 | and C<Titlecase_Letter>, | |
282 | all of which match C<Cased_Letter> under C</i> matching. | |
283 | And the second set is | |
284 | C<Uppercase>, | |
285 | C<Lowercase>, | |
286 | and C<Titlecase>, | |
287 | all of which match C<Cased> under C</i> matching. | |
288 | (The difference between these sets is that some things, such as Roman | |
289 | Numerals come in both upper and lower case so they are C<Cased>, but | |
290 | aren't considered to be letters, so they aren't C<Cased_Letter>s.) | |
291 | This set also includes its subsets C<PosixUpper> and C<PosixLower> both | |
292 | of which under C</i> matching match C<PosixAlpha>. | |
293 | ||
294 | For more details on Unicode properties, see L<perlunicode/Unicode | |
295 | Character Properties>; for a | |
e1b711da | 296 | complete list of possible properties, see |
56ca34ca KW |
297 | L<perluniprops/Properties accessible through \p{} and \P{}>, |
298 | which notes all forms that have C</i> differences. | |
e1b711da | 299 | It is also possible to define your own properties. This is discussed in |
8a118206 RGS |
300 | L<perlunicode/User-Defined Character Properties>. |
301 | ||
8a118206 RGS |
302 | =head4 Examples |
303 | ||
304 | "a" =~ /\w/ # Match, "a" is a 'word' character. | |
305 | "7" =~ /\w/ # Match, "7" is a 'word' character as well. | |
306 | "a" =~ /\d/ # No match, "a" isn't a digit. | |
307 | "7" =~ /\d/ # Match, "7" is a digit. | |
ea449505 | 308 | " " =~ /\s/ # Match, a space is whitespace. |
8a118206 RGS |
309 | "a" =~ /\D/ # Match, "a" is a non-digit. |
310 | "7" =~ /\D/ # No match, "7" is not a non-digit. | |
ea449505 | 311 | " " =~ /\S/ # No match, a space is not non-whitespace. |
8a118206 | 312 | |
ea449505 KW |
313 | " " =~ /\h/ # Match, space is horizontal whitespace. |
314 | " " =~ /\v/ # No match, space is not vertical whitespace. | |
315 | "\r" =~ /\v/ # Match, a return is vertical whitespace. | |
8a118206 RGS |
316 | |
317 | "a" =~ /\pL/ # Match, "a" is a letter. | |
318 | "a" =~ /\p{Lu}/ # No match, /\p{Lu}/ matches upper case letters. | |
319 | ||
320 | "\x{0e0b}" =~ /\p{Thai}/ # Match, \x{0e0b} is the character | |
321 | # 'THAI CHARACTER SO SO', and that's in | |
322 | # Thai Unicode class. | |
ea449505 | 323 | "a" =~ /\P{Lao}/ # Match, as "a" is not a Laotian character. |
8a118206 RGS |
324 | |
325 | ||
326 | =head2 Bracketed Character Classes | |
327 | ||
328 | The third form of character class you can use in Perl regular expressions | |
6b83a163 | 329 | is the bracketed character class. In its simplest form, it lists the characters |
c1c4ae3a | 330 | that may be matched, surrounded by square brackets, like this: C<[aeiou]>. |
ea449505 | 331 | This matches one of C<a>, C<e>, C<i>, C<o> or C<u>. Like the other |
8a118206 | 332 | character classes, exactly one character will be matched. To match |
ea449505 | 333 | a longer string consisting of characters mentioned in the character |
6b83a163 KW |
334 | class, follow the character class with a L<quantifier|perlre/Quantifiers>. For |
335 | instance, C<[aeiou]+> matches a string of one or more lowercase English vowels. | |
8a118206 RGS |
336 | |
337 | Repeating a character in a character class has no | |
338 | effect; it's considered to be in the set only once. | |
339 | ||
340 | Examples: | |
341 | ||
342 | "e" =~ /[aeiou]/ # Match, as "e" is listed in the class. | |
343 | "p" =~ /[aeiou]/ # No match, "p" is not listed in the class. | |
344 | "ae" =~ /^[aeiou]$/ # No match, a character class only matches | |
345 | # a single character. | |
346 | "ae" =~ /^[aeiou]+$/ # Match, due to the quantifier. | |
347 | ||
348 | =head3 Special Characters Inside a Bracketed Character Class | |
349 | ||
350 | Most characters that are meta characters in regular expressions (that | |
df225385 | 351 | is, characters that carry a special meaning like C<.>, C<*>, or C<(>) lose |
8a118206 RGS |
352 | their special meaning and can be used inside a character class without |
353 | the need to escape them. For instance, C<[()]> matches either an opening | |
354 | parenthesis, or a closing parenthesis, and the parens inside the character | |
355 | class don't group or capture. | |
356 | ||
357 | Characters that may carry a special meaning inside a character class are: | |
358 | C<\>, C<^>, C<->, C<[> and C<]>, and are discussed below. They can be | |
359 | escaped with a backslash, although this is sometimes not needed, in which | |
360 | case the backslash may be omitted. | |
361 | ||
362 | The sequence C<\b> is special inside a bracketed character class. While | |
6b83a163 | 363 | outside the character class, C<\b> is an assertion indicating a point |
8a118206 RGS |
364 | that does not have either two word characters or two non-word characters |
365 | on either side, inside a bracketed character class, C<\b> matches a | |
366 | backspace character. | |
367 | ||
df225385 KW |
368 | The sequences |
369 | C<\a>, | |
370 | C<\c>, | |
371 | C<\e>, | |
372 | C<\f>, | |
373 | C<\n>, | |
e526e8bb | 374 | C<\N{I<NAME>}>, |
765fa144 | 375 | C<\N{U+I<hex char>}>, |
df225385 KW |
376 | C<\r>, |
377 | C<\t>, | |
378 | and | |
379 | C<\x> | |
06ee63cd KW |
380 | are also special and have the same meanings as they do outside a |
381 | bracketed character class. (However, inside a bracketed character | |
382 | class, if C<\N{I<NAME>}> expands to a sequence of characters, only the first | |
383 | one in the sequence is used, with a warning.) | |
df225385 | 384 | |
ea449505 KW |
385 | Also, a backslash followed by two or three octal digits is considered an octal |
386 | number. | |
df225385 | 387 | |
6b83a163 KW |
388 | A C<[> is not special inside a character class, unless it's the start of a |
389 | POSIX character class (see L</POSIX Character Classes> below). It normally does | |
390 | not need escaping. | |
8a118206 | 391 | |
6b83a163 KW |
392 | A C<]> is normally either the end of a POSIX character class (see |
393 | L</POSIX Character Classes> below), or it signals the end of the bracketed | |
394 | character class. If you want to include a C<]> in the set of characters, you | |
395 | must generally escape it. | |
8a118206 RGS |
396 | However, if the C<]> is the I<first> (or the second if the first |
397 | character is a caret) character of a bracketed character class, it | |
398 | does not denote the end of the class (as you cannot have an empty class) | |
399 | and is considered part of the set of characters that can be matched without | |
400 | escaping. | |
401 | ||
402 | Examples: | |
403 | ||
404 | "+" =~ /[+?*]/ # Match, "+" in a character class is not special. | |
405 | "\cH" =~ /[\b]/ # Match, \b inside in a character class | |
c1c4ae3a | 406 | # is equivalent to a backspace. |
8a118206 RGS |
407 | "]" =~ /[][]/ # Match, as the character class contains. |
408 | # both [ and ]. | |
409 | "[]" =~ /[[]]/ # Match, the pattern contains a character class | |
410 | # containing just ], and the character class is | |
411 | # followed by a ]. | |
412 | ||
413 | =head3 Character Ranges | |
414 | ||
415 | It is not uncommon to want to match a range of characters. Luckily, instead | |
416 | of listing all the characters in the range, one may use the hyphen (C<->). | |
417 | If inside a bracketed character class you have two characters separated | |
418 | by a hyphen, it's treated as if all the characters between the two are in | |
419 | the class. For instance, C<[0-9]> matches any ASCII digit, and C<[a-m]> | |
420 | matches any lowercase letter from the first half of the ASCII alphabet. | |
421 | ||
422 | Note that the two characters on either side of the hyphen are not | |
765fa144 | 423 | necessarily both letters or both digits. Any character is possible, |
8a118206 RGS |
424 | although not advisable. C<['-?]> contains a range of characters, but |
425 | most people will not know which characters that will be. Furthermore, | |
426 | such ranges may lead to portability problems if the code has to run on | |
427 | a platform that uses a different character set, such as EBCDIC. | |
428 | ||
ea449505 KW |
429 | If a hyphen in a character class cannot syntactically be part of a range, for |
430 | instance because it is the first or the last character of the character class, | |
8a118206 | 431 | or if it immediately follows a range, the hyphen isn't special, and will be |
6b83a163 | 432 | considered a character that is to be matched literally. You have to escape the |
c1c4ae3a KW |
433 | hyphen with a backslash if you want to have a hyphen in your set of characters |
434 | to be matched, and its position in the class is such that it could be | |
435 | considered part of a range. | |
8a118206 RGS |
436 | |
437 | Examples: | |
438 | ||
439 | [a-z] # Matches a character that is a lower case ASCII letter. | |
c1c4ae3a KW |
440 | [a-fz] # Matches any letter between 'a' and 'f' (inclusive) or |
441 | # the letter 'z'. | |
8a118206 RGS |
442 | [-z] # Matches either a hyphen ('-') or the letter 'z'. |
443 | [a-f-m] # Matches any letter between 'a' and 'f' (inclusive), the | |
444 | # hyphen ('-'), or the letter 'm'. | |
445 | ['-?] # Matches any of the characters '()*+,-./0123456789:;<=>? | |
446 | # (But not on an EBCDIC platform). | |
447 | ||
448 | ||
449 | =head3 Negation | |
450 | ||
451 | It is also possible to instead list the characters you do not want to | |
452 | match. You can do so by using a caret (C<^>) as the first character in the | |
453 | character class. For instance, C<[^a-z]> matches a character that is not a | |
454 | lowercase ASCII letter. | |
455 | ||
456 | This syntax make the caret a special character inside a bracketed character | |
457 | class, but only if it is the first character of the class. So if you want | |
458 | to have the caret as one of the characters you want to match, you either | |
459 | have to escape the caret, or not list it first. | |
460 | ||
461 | Examples: | |
462 | ||
463 | "e" =~ /[^aeiou]/ # No match, the 'e' is listed. | |
464 | "x" =~ /[^aeiou]/ # Match, as 'x' isn't a lowercase vowel. | |
465 | "^" =~ /[^^]/ # No match, matches anything that isn't a caret. | |
466 | "^" =~ /[x^]/ # Match, caret is not special here. | |
467 | ||
468 | =head3 Backslash Sequences | |
469 | ||
ea449505 | 470 | You can put any backslash sequence character class (with the exception of |
765fa144 | 471 | C<\N> and C<\R>) inside a bracketed character class, and it will act just |
df225385 | 472 | as if you put all the characters matched by the backslash sequence inside the |
6b83a163 KW |
473 | character class. For instance, C<[a-f\d]> will match any decimal digit, or any |
474 | of the lowercase letters between 'a' and 'f' inclusive. | |
475 | ||
476 | C<\N> within a bracketed character class must be of the forms C<\N{I<name>}> | |
765fa144 | 477 | or C<\N{U+I<hex char>}>, and NOT be the form that matches non-newlines, |
6b83a163 KW |
478 | for the same reason that a dot C<.> inside a bracketed character class loses |
479 | its special meaning: it matches nearly anything, which generally isn't what you | |
480 | want to happen. | |
df225385 | 481 | |
8a118206 RGS |
482 | |
483 | Examples: | |
484 | ||
485 | /[\p{Thai}\d]/ # Matches a character that is either a Thai | |
486 | # character, or a digit. | |
487 | /[^\p{Arabic}()]/ # Matches a character that is neither an Arabic | |
488 | # character, nor a parenthesis. | |
489 | ||
490 | Backslash sequence character classes cannot form one of the endpoints | |
6b83a163 KW |
491 | of a range. Thus, you can't say: |
492 | ||
493 | /[\p{Thai}-\d]/ # Wrong! | |
8a118206 | 494 | |
6b83a163 | 495 | =head3 POSIX Character Classes |
ea449505 | 496 | X<character class> X<\p> X<\p{}> |
ea449505 KW |
497 | X<alpha> X<alnum> X<ascii> X<blank> X<cntrl> X<digit> X<graph> |
498 | X<lower> X<print> X<punct> X<space> X<upper> X<word> X<xdigit> | |
8a118206 | 499 | |
6b83a163 KW |
500 | POSIX character classes have the form C<[:class:]>, where I<class> is |
501 | name, and the C<[:> and C<:]> delimiters. POSIX character classes only appear | |
8a118206 | 502 | I<inside> bracketed character classes, and are a convenient and descriptive |
f7d1198f | 503 | way of listing a group of characters, though they can suffer from |
6b83a163 KW |
504 | portability issues (see below and L<Locale, EBCDIC, Unicode and UTF-8>). |
505 | ||
506 | Be careful about the syntax, | |
8a118206 RGS |
507 | |
508 | # Correct: | |
509 | $string =~ /[[:alpha:]]/ | |
510 | ||
511 | # Incorrect (will warn): | |
512 | $string =~ /[:alpha:]/ | |
513 | ||
514 | The latter pattern would be a character class consisting of a colon, | |
515 | and the letters C<a>, C<l>, C<p> and C<h>. | |
6b83a163 | 516 | POSIX character classes can be part of a larger bracketed character class. For |
ea449505 KW |
517 | example, |
518 | ||
519 | [01[:alpha:]%] | |
520 | ||
521 | is valid and matches '0', '1', any alphabetic character, and the percent sign. | |
8a118206 RGS |
522 | |
523 | Perl recognizes the following POSIX character classes: | |
524 | ||
ea449505 KW |
525 | alpha Any alphabetical character ("[A-Za-z]"). |
526 | alnum Any alphanumerical character. ("[A-Za-z0-9]") | |
527 | ascii Any character in the ASCII character set. | |
ea8b8ad2 | 528 | blank A GNU extension, equal to a space or a horizontal tab ("\t"). |
ea449505 KW |
529 | cntrl Any control character. See Note [2] below. |
530 | digit Any decimal digit ("[0-9]"), equivalent to "\d". | |
531 | graph Any printable character, excluding a space. See Note [3] below. | |
532 | lower Any lowercase character ("[a-z]"). | |
533 | print Any printable character, including a space. See Note [4] below. | |
c1c4ae3a | 534 | punct Any graphical character excluding "word" characters. Note [5]. |
ea449505 KW |
535 | space Any whitespace character. "\s" plus the vertical tab ("\cK"). |
536 | upper Any uppercase character ("[A-Z]"). | |
537 | word A Perl extension ("[A-Za-z0-9_]"), equivalent to "\w". | |
538 | xdigit Any hexadecimal digit ("[0-9a-fA-F]"). | |
539 | ||
540 | Most POSIX character classes have two Unicode-style C<\p> property | |
541 | counterparts. (They are not official Unicode properties, but Perl extensions | |
542 | derived from official Unicode properties.) The table below shows the relation | |
543 | between POSIX character classes and these counterparts. | |
544 | ||
545 | One counterpart, in the column labelled "ASCII-range Unicode" in | |
6b83a163 | 546 | the table, will only match characters in the ASCII character set. |
ea449505 KW |
547 | |
548 | The other counterpart, in the column labelled "Full-range Unicode", matches any | |
549 | appropriate characters in the full Unicode character set. For example, | |
550 | C<\p{Alpha}> will match not just the ASCII alphabetic characters, but any | |
551 | character in the entire Unicode character set that is considered to be | |
765fa144 | 552 | alphabetic. The column labelled "backslash sequence" is a (short) synonym for |
cbc24f92 | 553 | the Full-range Unicode form. |
ea449505 KW |
554 | |
555 | (Each of the counterparts has various synonyms as well. | |
556 | L<perluniprops/Properties accessible through \p{} and \P{}> lists all the | |
557 | synonyms, plus all the characters matched by each of the ASCII-range | |
558 | properties. For example C<\p{AHex}> is a synonym for C<\p{ASCII_Hex_Digit}>, | |
559 | and any C<\p> property name can be prefixed with "Is" such as C<\p{IsAlpha}>.) | |
560 | ||
561 | Both the C<\p> forms are unaffected by any locale that is in effect, or whether | |
562 | the string is in UTF-8 format or not, or whether the platform is EBCDIC or not. | |
f7d1198f KW |
563 | In contrast, the POSIX character classes are affected, unless the |
564 | regular expression is compiled with the C<"a"> modifier. If the C<"a"> | |
565 | modifier is not in effect, and the source string is in UTF-8 format, the | |
566 | POSIX classes behave like their "Full-range" Unicode counterparts. If | |
567 | C<"a"> modifier is in effect; or the source string is not in UTF-8 | |
568 | format, and no locale is in effect, and the platform is not EBCDIC, all | |
569 | the POSIX classes behave like their ASCII-range counterparts. | |
570 | Otherwise, they behave based on the rules of the locale or EBCDIC code | |
571 | page. | |
6b83a163 | 572 | |
ea449505 | 573 | It is proposed to change this behavior in a future release of Perl so that the |
765fa144 | 574 | the UTF-8-ness of the source string will be irrelevant to the behavior of the |
ea449505 KW |
575 | POSIX character classes. This means they will always behave in strict |
576 | accordance with the official POSIX standard. That is, if either locale or | |
577 | EBCDIC code page is present, they will behave in accordance with those; if | |
578 | absent, the classes will match only their ASCII-range counterparts. If you | |
765fa144 | 579 | wish to comment on this proposal, send email to C<perl5-porters@perl.org>. |
ea449505 | 580 | |
cbc24f92 KW |
581 | [[:...:]] ASCII-range Full-range backslash Note |
582 | Unicode Unicode sequence | |
ea449505 | 583 | ----------------------------------------------------- |
cbc24f92 KW |
584 | alpha \p{PosixAlpha} \p{XPosixAlpha} |
585 | alnum \p{PosixAlnum} \p{XPosixAlnum} | |
ea449505 | 586 | ascii \p{ASCII} |
cbc24f92 KW |
587 | blank \p{PosixBlank} \p{XPosixBlank} \h [1] |
588 | or \p{HorizSpace} [1] | |
589 | cntrl \p{PosixCntrl} \p{XPosixCntrl} [2] | |
590 | digit \p{PosixDigit} \p{XPosixDigit} \d | |
591 | graph \p{PosixGraph} \p{XPosixGraph} [3] | |
592 | lower \p{PosixLower} \p{XPosixLower} | |
593 | print \p{PosixPrint} \p{XPosixPrint} [4] | |
594 | punct \p{PosixPunct} \p{XPosixPunct} [5] | |
595 | \p{PerlSpace} \p{XPerlSpace} \s [6] | |
596 | space \p{PosixSpace} \p{XPosixSpace} [6] | |
597 | upper \p{PosixUpper} \p{XPosixUpper} | |
598 | word \p{PosixWord} \p{XPosixWord} \w | |
599 | xdigit \p{ASCII_Hex_Digit} \p{XPosixXDigit} | |
8a118206 RGS |
600 | |
601 | =over 4 | |
602 | ||
ea449505 KW |
603 | =item [1] |
604 | ||
605 | C<\p{Blank}> and C<\p{HorizSpace}> are synonyms. | |
606 | ||
607 | =item [2] | |
8a118206 | 608 | |
ea449505 KW |
609 | Control characters don't produce output as such, but instead usually control |
610 | the terminal somehow: for example newline and backspace are control characters. | |
611 | In the ASCII range, characters whose ordinals are between 0 and 31 inclusive, | |
612 | plus 127 (C<DEL>) are control characters. | |
8a118206 | 613 | |
c1c4ae3a KW |
614 | On EBCDIC platforms, it is likely that the code page will define C<[[:cntrl:]]> |
615 | to be the EBCDIC equivalents of the ASCII controls, plus the controls | |
6b83a163 | 616 | that in Unicode have ordinals from 128 through 159. |
ea449505 KW |
617 | |
618 | =item [3] | |
8a118206 RGS |
619 | |
620 | Any character that is I<graphical>, that is, visible. This class consists | |
621 | of all the alphanumerical characters and all punctuation characters. | |
622 | ||
ea449505 | 623 | =item [4] |
8a118206 RGS |
624 | |
625 | All printable characters, which is the set of all the graphical characters | |
ea449505 KW |
626 | plus whitespace characters that are not also controls. |
627 | ||
b6dac59a | 628 | =item [5] |
ea449505 KW |
629 | |
630 | C<\p{PosixPunct}> and C<[[:punct:]]> in the ASCII range match all the | |
631 | non-controls, non-alphanumeric, non-space characters: | |
632 | C<[-!"#$%&'()*+,./:;<=E<gt>?@[\\\]^_`{|}~]> (although if a locale is in effect, | |
633 | it could alter the behavior of C<[[:punct:]]>). | |
634 | ||
cbc24f92 KW |
635 | The similarly named property, C<\p{Punct}>, matches a somewhat different |
636 | set in the ASCII range, namely | |
6c5a041f KW |
637 | C<[-!"#%&'()*,./:;?@[\\\]_{}]>. That is, it is missing C<[$+E<lt>=E<gt>^`|~]>. |
638 | This is because Unicode splits what POSIX considers to be punctuation into two | |
639 | categories, Punctuation and Symbols. | |
640 | ||
765fa144 KW |
641 | C<\p{XPosixPunct}> and (in Unicode mode) C<[[:punct:]]>, match what |
642 | C<\p{PosixPunct}> matches in the ASCII range, plus what C<\p{Punct}> | |
643 | matches. This is different than strictly matching according to | |
644 | C<\p{Punct}>. Another way to say it is that | |
6c5a041f KW |
645 | for a UTF-8 string, C<[[:punct:]]> matches all the characters that Unicode |
646 | considers to be punctuation, plus all the ASCII-range characters that Unicode | |
647 | considers to be symbols. | |
8a118206 | 648 | |
ea449505 | 649 | =item [6] |
8a118206 | 650 | |
ea449505 KW |
651 | C<\p{SpacePerl}> and C<\p{Space}> differ only in that C<\p{Space}> additionally |
652 | matches the vertical tab, C<\cK>. Same for the two ASCII-only range forms. | |
8a118206 RGS |
653 | |
654 | =back | |
655 | ||
cbc24f92 KW |
656 | There are various other synonyms that can be used for these besides |
657 | C<\p{HorizSpace}> and \C<\p{XPosixBlank}>. For example | |
658 | C<\p{PosixAlpha}> can be written as C<\p{Alpha}>. All are listed | |
659 | in L<perluniprops/Properties accessible through \p{} and \P{}>. | |
660 | ||
8a118206 | 661 | =head4 Negation |
ea449505 | 662 | X<character class, negation> |
8a118206 RGS |
663 | |
664 | A Perl extension to the POSIX character class is the ability to | |
665 | negate it. This is done by prefixing the class name with a caret (C<^>). | |
666 | Some examples: | |
667 | ||
ea449505 KW |
668 | POSIX ASCII-range Full-range backslash |
669 | Unicode Unicode sequence | |
670 | ----------------------------------------------------- | |
cbc24f92 KW |
671 | [[:^digit:]] \P{PosixDigit} \P{XPosixDigit} \D |
672 | [[:^space:]] \P{PosixSpace} \P{XPosixSpace} | |
673 | \P{PerlSpace} \P{XPerlSpace} \S | |
674 | [[:^word:]] \P{PerlWord} \P{XPosixWord} \W | |
675 | ||
765fa144 KW |
676 | The backslash sequence can mean either ASCII- or Full-range Unicode, |
677 | depending on various factors. See L</Locale, EBCDIC, Unicode and UTF-8> | |
678 | below. | |
8a118206 RGS |
679 | |
680 | =head4 [= =] and [. .] | |
681 | ||
682 | Perl will recognize the POSIX character classes C<[=class=]>, and | |
ea449505 | 683 | C<[.class.]>, but does not (yet?) support them. Use of |
740bae87 | 684 | such a construct will lead to an error. |
8a118206 RGS |
685 | |
686 | ||
687 | =head4 Examples | |
688 | ||
689 | /[[:digit:]]/ # Matches a character that is a digit. | |
690 | /[01[:lower:]]/ # Matches a character that is either a | |
691 | # lowercase letter, or '0' or '1'. | |
c1c4ae3a KW |
692 | /[[:digit:][:^xdigit:]]/ # Matches a character that can be anything |
693 | # except the letters 'a' to 'f'. This is | |
694 | # because the main character class is composed | |
695 | # of two POSIX character classes that are ORed | |
696 | # together, one that matches any digit, and | |
697 | # the other that matches anything that isn't a | |
698 | # hex digit. The result matches all | |
699 | # characters except the letters 'a' to 'f' and | |
700 | # 'A' to 'F'. | |
8a118206 RGS |
701 | |
702 | ||
ea449505 | 703 | =head2 Locale, EBCDIC, Unicode and UTF-8 |
8a118206 | 704 | |
f7d1198f KW |
705 | Some of the character classes have a somewhat different behaviour |
706 | depending on the internal encoding of the source string, if the regular | |
707 | expression is marked as having Unicode semantics, the locale that is in | |
708 | effect, and if the program is running on an EBCDIC platform. | |
709 | ||
710 | C<\w>, C<\d>, C<\s> and the POSIX character classes (and their | |
711 | negations, including C<\W>, C<\D>, C<\S>) have this behaviour. (Since | |
712 | the backslash sequences C<\b> and C<\B> are defined in terms of C<\w> | |
713 | and C<\W>, they also are affected.) | |
714 | ||
715 | Starting in Perl 5.14, if the regular expression is compiled with the | |
716 | C<"a"> modifier, the behavior doesn't differ regardless of any other | |
717 | factors. C<\d> matches the 10 digits 0-9; C<\D> any character but those | |
718 | 10; C<\s>, exactly the five characters "[ \f\n\r\t]"; C<\w> only the 63 | |
719 | characters "[A-Za-z0-9_]"; and the C<"[[:posix:]]"> classes only the | |
720 | appropriate ASCII characters, the same characters as are matched by the | |
721 | corresponding C<\p{}> property given in the "ASCII-range Unicode" column | |
722 | in the table above. (The behavior of all of their complements follows | |
723 | the same paradigm.) | |
724 | ||
725 | Otherwise, a regular expression is marked for Unicode semantics if it is | |
726 | encoded in utf8 (usually as a result of including a literal character | |
727 | whose code point is above 255), or if it contains a C<\N{U+...}> or | |
728 | C<\N{I<name>}> construct, or (starting in Perl 5.14) if it was compiled | |
729 | in the scope of a C<S<use feature "unicode_strings">> pragma and not in | |
730 | the scope of a C<S<use locale>> pragma, or has the C<"u"> regular | |
b6dac59a | 731 | expression modifier. |
17657a39 | 732 | |
f7d1198f KW |
733 | Note that one can specify C<"use re '/l'"> for example, for any regular |
734 | expression modifier, and this has precedence over either of the | |
735 | C<S<use feature "unicode_strings">> or C<S<use locale>> pragmas. | |
736 | ||
17657a39 KW |
737 | The differences in behavior between locale and non-locale semantics |
738 | can affect any character whose code point is 255 or less. The | |
739 | differences in behavior between Unicode and non-Unicode semantics | |
740 | affects only ASCII platforms, and only when matching against characters | |
741 | whose code points are between 128 and 255 inclusive. See | |
742 | L<perlunicode/The "Unicode Bug">. | |
8a118206 | 743 | |
f7d1198f KW |
744 | For portability reasons, unless the C<"a"> modifier is specified, |
745 | it may be better to not use C<\w>, C<\d>, C<\s> or the POSIX character | |
746 | classes and use the Unicode properties instead. | |
a12cf05f KW |
747 | That way you can control whether you want matching of just characters in |
748 | the ASCII character set, or any Unicode characters. | |
749 | C<S<use feature "unicode_strings">> will allow seamless Unicode behavior | |
750 | no matter what the internal encodings are, but won't allow restricting | |
751 | to just the ASCII characters. | |
8a118206 RGS |
752 | |
753 | =head4 Examples | |
754 | ||
755 | $str = "\xDF"; # $str is not in UTF-8 format. | |
756 | $str =~ /^\w/; # No match, as $str isn't in UTF-8 format. | |
757 | $str .= "\x{0e0b}"; # Now $str is in UTF-8 format. | |
758 | $str =~ /^\w/; # Match! $str is now in UTF-8 format. | |
759 | chop $str; | |
760 | $str =~ /^\w/; # Still a match! $str remains in UTF-8 format. | |
761 | ||
762 | =cut |