This is a live mirror of the Perl 5 development currently hosted at https://github.com/perl/perl5
Unused 'cv'
[perl5.git] / pod / perlrecharclass.pod
CommitLineData
8a118206
RGS
1=head1 NAME
2
3perlrecharclass - Perl Regular Expression Character Classes
4
5=head1 DESCRIPTION
6
7The top level documentation about Perl regular expressions
8is found in L<perlre>.
9
10This manual page discusses the syntax and use of character
11classes in Perl Regular Expressions.
12
13A character class is a way of denoting a set of characters,
14in such a way that one character of the set is matched.
15It's important to remember that matching a character class
16consumes exactly one character in the source string. (The source
17string is the string the regular expression is matched against.)
18
19There are three types of character classes in Perl regular
20expressions: the dot, backslashed sequences, and the bracketed form.
21
22=head2 The dot
23
24The dot (or period), C<.> is probably the most used, and certainly
25the most well-known character class. By default, a dot matches any
26character, except for the newline. The default can be changed to
27add matching the newline with the I<single line> modifier: either
28for the entire regular expression using the C</s> modifier, or
29locally using C<(?s)>.
30
31Here are some examples:
32
33 "a" =~ /./ # Match
34 "." =~ /./ # Match
35 "" =~ /./ # No match (dot has to match a character)
36 "\n" =~ /./ # No match (dot does not match a newline)
37 "\n" =~ /./s # Match (global 'single line' modifier)
38 "\n" =~ /(?s:.)/ # Match (local 'single line' modifier)
39 "ab" =~ /^.$/ # No match (dot matches one character)
40
8a118206
RGS
41=head2 Backslashed sequences
42
43Perl regular expressions contain many backslashed sequences that
44constitute a character class. That is, they will match a single
45character, if that character belongs to a specific set of characters
46(defined by the sequence). A backslashed sequence is a sequence of
47characters starting with a backslash. Not all backslashed sequences
48are character class; for a full list, see L<perlrebackslash>.
49
50Here's a list of the backslashed sequences, which are discussed in
51more detail below.
52
53 \d Match a digit character.
54 \D Match a non-digit character.
55 \w Match a "word" character.
56 \W Match a non-"word" character.
57 \s Match a white space character.
58 \S Match a non-white space character.
59 \h Match a horizontal white space character.
60 \H Match a character that isn't horizontal white space.
c741660a 61 \N Match a character that isn't newline.
8a118206
RGS
62 \v Match a vertical white space character.
63 \V Match a character that isn't vertical white space.
64 \pP, \p{Prop} Match a character matching a Unicode property.
65 \PP, \P{Prop} Match a character that doesn't match a Unicode property.
66
67=head3 Digits
68
69C<\d> matches a single character that is considered to be a I<digit>.
70What is considered a digit depends on the internal encoding of
71the source string. If the source string is in UTF-8 format, C<\d>
72not only matches the digits '0' - '9', but also Arabic, Devanagari and
73digits from other languages. Otherwise, if there is a locale in effect,
74it will match whatever characters the locale considers digits. Without
75a locale, C<\d> matches the digits '0' to '9'.
76See L</Locale, Unicode and UTF-8>.
77
78Any character that isn't matched by C<\d> will be matched by C<\D>.
79
80=head3 Word characters
81
82C<\w> matches a single I<word> character: an alphanumeric character
83(that is, an alphabetic character, or a digit), or the underscore (C<_>).
84What is considered a word character depends on the internal encoding
85of the string. If it's in UTF-8 format, C<\w> matches those characters
86that are considered word characters in the Unicode database. That is, it
87not only matches ASCII letters, but also Thai letters, Greek letters, etc.
88If the source string isn't in UTF-8 format, C<\w> matches those characters
89that are considered word characters by the current locale. Without
90a locale in effect, C<\w> matches the ASCII letters, digits and the
91underscore.
92
93Any character that isn't matched by C<\w> will be matched by C<\W>.
94
95=head3 White space
96
c741660a 97C<\s> matches any single character that is considered white space. In the
8a118206
RGS
98ASCII range, C<\s> matches the horizontal tab (C<\t>), the new line
99(C<\n>), the form feed (C<\f>), the carriage return (C<\r>), and the
100space (the vertical tab, C<\cK> is not matched by C<\s>). The exact set
101of characters matched by C<\s> depends on whether the source string is
102in UTF-8 format. If it is, C<\s> matches what is considered white space
103in the Unicode database. Otherwise, if there is a locale in effect, C<\s>
104matches whatever is considered white space by the current locale. Without
105a locale, C<\s> matches the five characters mentioned in the beginning
106of this paragraph. Perhaps the most notable difference is that C<\s>
107matches a non-breaking space only if the non-breaking space is in a
108UTF-8 encoded string.
109
110Any character that isn't matched by C<\s> will be matched by C<\S>.
111
112C<\h> will match any character that is considered horizontal white space;
113this includes the space and the tab characters. C<\H> will match any character
114that is not considered horizontal white space.
115
c741660a
RGS
116C<\N>, like the dot, will match any character that is not a newline. The
117difference is that C<\N> will not be influenced by the single line C</s>
118regular expression modifier. (Note that, since C<\N{}> is also used for
119Unicode named characters, if C<\N> is followed by an opening brace and
120by a letter, perl will assume that a Unicode character name is coming.)
121
8a118206
RGS
122C<\v> will match any character that is considered vertical white space;
123this includes the carriage return and line feed characters (newline).
124C<\V> will match any character that is not considered vertical white space.
125
126C<\R> matches anything that can be considered a newline under Unicode
127rules. It's not a character class, as it can match a multi-character
128sequence. Therefore, it cannot be used inside a bracketed character
129class. Details are discussed in L<perlrebackslash>.
130
99d59c4d 131C<\h>, C<\H>, C<\v>, C<\V>, and C<\R> are new in perl 5.10.0.
8a118206
RGS
132
133Note that unlike C<\s>, C<\d> and C<\w>, C<\h> and C<\v> always match
134the same characters, regardless whether the source string is in UTF-8
135format or not. The set of characters they match is also not influenced
136by locale.
137
138One might think that C<\s> is equivalent with C<[\h\v]>. This is not true.
139The vertical tab (C<"\x0b">) is not matched by C<\s>, it is however
140considered vertical white space. Furthermore, if the source string is
141not in UTF-8 format, the next line (C<"\x85">) and the no-break space
142(C<"\xA0">) are not matched by C<\s>, but are by C<\v> and C<\h> respectively.
143If the source string is in UTF-8 format, both the next line and the
144no-break space are matched by C<\s>.
145
146The following table is a complete listing of characters matched by
147C<\s>, C<\h> and C<\v>.
148
149The first column gives the code point of the character (in hex format),
150the second column gives the (Unicode) name. The third column indicates
151by which class(es) the character is matched.
152
153 0x00009 CHARACTER TABULATION h s
154 0x0000a LINE FEED (LF) vs
155 0x0000b LINE TABULATION v
156 0x0000c FORM FEED (FF) vs
157 0x0000d CARRIAGE RETURN (CR) vs
158 0x00020 SPACE h s
159 0x00085 NEXT LINE (NEL) vs [1]
160 0x000a0 NO-BREAK SPACE h s [1]
161 0x01680 OGHAM SPACE MARK h s
162 0x0180e MONGOLIAN VOWEL SEPARATOR h s
163 0x02000 EN QUAD h s
164 0x02001 EM QUAD h s
165 0x02002 EN SPACE h s
166 0x02003 EM SPACE h s
167 0x02004 THREE-PER-EM SPACE h s
168 0x02005 FOUR-PER-EM SPACE h s
169 0x02006 SIX-PER-EM SPACE h s
170 0x02007 FIGURE SPACE h s
171 0x02008 PUNCTUATION SPACE h s
172 0x02009 THIN SPACE h s
173 0x0200a HAIR SPACE h s
174 0x02028 LINE SEPARATOR vs
175 0x02029 PARAGRAPH SEPARATOR vs
176 0x0202f NARROW NO-BREAK SPACE h s
177 0x0205f MEDIUM MATHEMATICAL SPACE h s
178 0x03000 IDEOGRAPHIC SPACE h s
179
180=over 4
181
182=item [1]
183
184NEXT LINE and NO-BREAK SPACE only match C<\s> if the source string is in
185UTF-8 format.
186
187=back
188
189It is worth noting that C<\d>, C<\w>, etc, match single characters, not
190complete numbers or words. To match a number (that consists of integers),
191use C<\d+>; to match a word, use C<\w+>.
192
193
194=head3 Unicode Properties
195
196C<\pP> and C<\p{Prop}> are character classes to match characters that
197fit given Unicode classes. One letter classes can be used in the C<\pP>
198form, with the class name following the C<\p>, otherwise, the property
199name is enclosed in braces, and follows the C<\p>. For instance, a
200match for a number can be written as C</\pN/> or as C</\p{Number}/>.
201Lowercase letters are matched by the property I<LowercaseLetter> which
202has as short form I<Ll>. They have to be written as C</\p{Ll}/> or
203C</\p{LowercaseLetter}/>. C</\pLl/> is valid, but means something different.
204It matches a two character string: a letter (Unicode property C<\pL>),
205followed by a lowercase C<l>.
206
207For a list of possible properties, see
208L<perlunicode/Unicode Character Properties>. It is also possible to
209defined your own properties. This is discussed in
210L<perlunicode/User-Defined Character Properties>.
211
212
213=head4 Examples
214
215 "a" =~ /\w/ # Match, "a" is a 'word' character.
216 "7" =~ /\w/ # Match, "7" is a 'word' character as well.
217 "a" =~ /\d/ # No match, "a" isn't a digit.
218 "7" =~ /\d/ # Match, "7" is a digit.
219 " " =~ /\s/ # Match, a space is white space.
220 "a" =~ /\D/ # Match, "a" is a non-digit.
221 "7" =~ /\D/ # No match, "7" is not a non-digit.
222 " " =~ /\S/ # No match, a space is not non-white space.
223
224 " " =~ /\h/ # Match, space is horizontal white space.
225 " " =~ /\v/ # No match, space is not vertical white space.
226 "\r" =~ /\v/ # Match, a return is vertical white space.
227
228 "a" =~ /\pL/ # Match, "a" is a letter.
229 "a" =~ /\p{Lu}/ # No match, /\p{Lu}/ matches upper case letters.
230
231 "\x{0e0b}" =~ /\p{Thai}/ # Match, \x{0e0b} is the character
232 # 'THAI CHARACTER SO SO', and that's in
233 # Thai Unicode class.
234 "a" =~ /\P{Lao}/ # Match, as "a" is not a Laoian character.
235
236
237=head2 Bracketed Character Classes
238
239The third form of character class you can use in Perl regular expressions
240is the bracketed form. In its simplest form, it lists the characters
241that may be matched inside square brackets, like this: C<[aeiou]>.
242This matches one of C<a>, C<e>, C<i>, C<o> or C<u>. Just as the other
243character classes, exactly one character will be matched. To match
244a longer string consisting of characters mentioned in the characters
245class, follow the character class with a quantifier. For instance,
246C<[aeiou]+> matches a string of one or more lowercase ASCII vowels.
247
248Repeating a character in a character class has no
249effect; it's considered to be in the set only once.
250
251Examples:
252
253 "e" =~ /[aeiou]/ # Match, as "e" is listed in the class.
254 "p" =~ /[aeiou]/ # No match, "p" is not listed in the class.
255 "ae" =~ /^[aeiou]$/ # No match, a character class only matches
256 # a single character.
257 "ae" =~ /^[aeiou]+$/ # Match, due to the quantifier.
258
259=head3 Special Characters Inside a Bracketed Character Class
260
261Most characters that are meta characters in regular expressions (that
262is, characters that carry a special meaning like C<*> or C<(>) lose
263their special meaning and can be used inside a character class without
264the need to escape them. For instance, C<[()]> matches either an opening
265parenthesis, or a closing parenthesis, and the parens inside the character
266class don't group or capture.
267
268Characters that may carry a special meaning inside a character class are:
269C<\>, C<^>, C<->, C<[> and C<]>, and are discussed below. They can be
270escaped with a backslash, although this is sometimes not needed, in which
271case the backslash may be omitted.
272
273The sequence C<\b> is special inside a bracketed character class. While
274outside the character class C<\b> is an assertion indicating a point
275that does not have either two word characters or two non-word characters
276on either side, inside a bracketed character class, C<\b> matches a
277backspace character.
278
279A C<[> is not special inside a character class, unless it's the start
280of a POSIX character class (see below). It normally does not need escaping.
281
282A C<]> is either the end of a POSIX character class (see below), or it
283signals the end of the bracketed character class. Normally it needs
284escaping if you want to include a C<]> in the set of characters.
285However, if the C<]> is the I<first> (or the second if the first
286character is a caret) character of a bracketed character class, it
287does not denote the end of the class (as you cannot have an empty class)
288and is considered part of the set of characters that can be matched without
289escaping.
290
291Examples:
292
293 "+" =~ /[+?*]/ # Match, "+" in a character class is not special.
294 "\cH" =~ /[\b]/ # Match, \b inside in a character class
295 # is equivalent with a backspace.
296 "]" =~ /[][]/ # Match, as the character class contains.
297 # both [ and ].
298 "[]" =~ /[[]]/ # Match, the pattern contains a character class
299 # containing just ], and the character class is
300 # followed by a ].
301
302=head3 Character Ranges
303
304It is not uncommon to want to match a range of characters. Luckily, instead
305of listing all the characters in the range, one may use the hyphen (C<->).
306If inside a bracketed character class you have two characters separated
307by a hyphen, it's treated as if all the characters between the two are in
308the class. For instance, C<[0-9]> matches any ASCII digit, and C<[a-m]>
309matches any lowercase letter from the first half of the ASCII alphabet.
310
311Note that the two characters on either side of the hyphen are not
312necessary both letters or both digits. Any character is possible,
313although not advisable. C<['-?]> contains a range of characters, but
314most people will not know which characters that will be. Furthermore,
315such ranges may lead to portability problems if the code has to run on
316a platform that uses a different character set, such as EBCDIC.
317
318If a hyphen in a character class cannot be part of a range, for instance
319because it is the first or the last character of the character class,
320or if it immediately follows a range, the hyphen isn't special, and will be
321considered a character that may be matched. You have to escape the hyphen
322with a backslash if you want to have a hyphen in your set of characters to
323be matched, and its position in the class is such that it can be considered
324part of a range.
325
326Examples:
327
328 [a-z] # Matches a character that is a lower case ASCII letter.
329 [a-fz] # Matches any letter between 'a' and 'f' (inclusive) or the
330 # letter 'z'.
331 [-z] # Matches either a hyphen ('-') or the letter 'z'.
332 [a-f-m] # Matches any letter between 'a' and 'f' (inclusive), the
333 # hyphen ('-'), or the letter 'm'.
334 ['-?] # Matches any of the characters '()*+,-./0123456789:;<=>?
335 # (But not on an EBCDIC platform).
336
337
338=head3 Negation
339
340It is also possible to instead list the characters you do not want to
341match. You can do so by using a caret (C<^>) as the first character in the
342character class. For instance, C<[^a-z]> matches a character that is not a
343lowercase ASCII letter.
344
345This syntax make the caret a special character inside a bracketed character
346class, but only if it is the first character of the class. So if you want
347to have the caret as one of the characters you want to match, you either
348have to escape the caret, or not list it first.
349
350Examples:
351
352 "e" =~ /[^aeiou]/ # No match, the 'e' is listed.
353 "x" =~ /[^aeiou]/ # Match, as 'x' isn't a lowercase vowel.
354 "^" =~ /[^^]/ # No match, matches anything that isn't a caret.
355 "^" =~ /[x^]/ # Match, caret is not special here.
356
357=head3 Backslash Sequences
358
359You can put a backslash sequence character class inside a bracketed character
360class, and it will act just as if you put all the characters matched by
361the backslash sequence inside the character class. For instance,
362C<[a-f\d]> will match any digit, or any of the lowercase letters between
363'a' and 'f' inclusive.
364
365Examples:
366
367 /[\p{Thai}\d]/ # Matches a character that is either a Thai
368 # character, or a digit.
369 /[^\p{Arabic}()]/ # Matches a character that is neither an Arabic
370 # character, nor a parenthesis.
371
372Backslash sequence character classes cannot form one of the endpoints
373of a range.
374
375=head3 Posix Character Classes
376
377Posix character classes have the form C<[:class:]>, where I<class> is
378name, and the C<[:> and C<:]> delimiters. Posix character classes appear
379I<inside> bracketed character classes, and are a convenient and descriptive
380way of listing a group of characters. Be careful about the syntax,
381
382 # Correct:
383 $string =~ /[[:alpha:]]/
384
385 # Incorrect (will warn):
386 $string =~ /[:alpha:]/
387
388The latter pattern would be a character class consisting of a colon,
389and the letters C<a>, C<l>, C<p> and C<h>.
390
391Perl recognizes the following POSIX character classes:
392
393 alpha Any alphabetical character.
394 alnum Any alphanumerical character.
395 ascii Any ASCII character.
ea8b8ad2 396 blank A GNU extension, equal to a space or a horizontal tab ("\t").
8a118206 397 cntrl Any control character.
ea8b8ad2 398 digit Any digit, equivalent to "\d".
8a118206
RGS
399 graph Any printable character, excluding a space.
400 lower Any lowercase character.
401 print Any printable character, including a space.
402 punct Any punctuation character.
ea8b8ad2 403 space Any white space character. "\s" plus the vertical tab ("\cK").
8a118206 404 upper Any uppercase character.
ea8b8ad2 405 word Any "word" character, equivalent to "\w".
8a118206
RGS
406 xdigit Any hexadecimal digit, '0' - '9', 'a' - 'f', 'A' - 'F'.
407
408The exact set of characters matched depends on whether the source string
409is internally in UTF-8 format or not. See L</Locale, Unicode and UTF-8>.
410
411Most POSIX character classes have C<\p> counterparts. The difference
412is that the C<\p> classes will always match according to the Unicode
413properties, regardless whether the string is in UTF-8 format or not.
414
415The following table shows the relation between POSIX character classes
416and the Unicode properties:
417
418 [[:...:]] \p{...} backslash
419
420 alpha IsAlpha
421 alnum IsAlnum
422 ascii IsASCII
423 blank
424 cntrl IsCntrl
425 digit IsDigit \d
426 graph IsGraph
427 lower IsLower
428 print IsPrint
429 punct IsPunct
430 space IsSpace
431 IsSpacePerl \s
432 upper IsUpper
433 word IsWord
434 xdigit IsXDigit
435
436Some character classes may have a non-obvious name:
437
438=over 4
439
440=item cntrl
441
442Any control character. Usually, control characters don't produce output
443as such, but instead control the terminal somehow: for example newline
444and backspace are control characters. All characters with C<ord()> less
445than 32 are usually classified as control characters (in ASCII, the ISO
446Latin character sets, and Unicode), as is the character C<ord()> value
447of 127 (C<DEL>).
448
449=item graph
450
451Any character that is I<graphical>, that is, visible. This class consists
452of all the alphanumerical characters and all punctuation characters.
453
454=item print
455
456All printable characters, which is the set of all the graphical characters
457plus the space.
458
459=item punct
460
461Any punctuation (special) character.
462
463=back
464
465=head4 Negation
466
467A Perl extension to the POSIX character class is the ability to
468negate it. This is done by prefixing the class name with a caret (C<^>).
469Some examples:
470
471 POSIX Unicode Backslash
472 [[:^digit:]] \P{IsDigit} \D
473 [[:^space:]] \P{IsSpace} \S
474 [[:^word:]] \P{IsWord} \W
475
476=head4 [= =] and [. .]
477
478Perl will recognize the POSIX character classes C<[=class=]>, and
479C<[.class.]>, but does not (yet?) support this construct. Use of
740bae87 480such a construct will lead to an error.
8a118206
RGS
481
482
483=head4 Examples
484
485 /[[:digit:]]/ # Matches a character that is a digit.
486 /[01[:lower:]]/ # Matches a character that is either a
487 # lowercase letter, or '0' or '1'.
488 /[[:digit:][:^xdigit:]]/ # Matches a character that can be anything,
489 # but the letters 'a' to 'f' in either case.
490 # This is because the character class contains
491 # all digits, and anything that isn't a
492 # hex digit, resulting in a class containing
493 # all characters, but the letters 'a' to 'f'
494 # and 'A' to 'F'.
495
496
497=head2 Locale, Unicode and UTF-8
498
499Some of the character classes have a somewhat different behaviour depending
500on the internal encoding of the source string, and the locale that is
501in effect.
502
503C<\w>, C<\d>, C<\s> and the POSIX character classes (and their negations,
504including C<\W>, C<\D>, C<\S>) suffer from this behaviour.
505
506The rule is that if the source string is in UTF-8 format, the character
507classes match according to the Unicode properties. If the source string
508isn't, then the character classes match according to whatever locale is
509in effect. If there is no locale, they match the ASCII defaults
510(52 letters, 10 digits and underscore for C<\w>, 0 to 9 for C<\d>, etc).
511
512This usually means that if you are matching against characters whose C<ord()>
513values are between 128 and 255 inclusive, your character class may match
514or not depending on the current locale, and whether the source string is
515in UTF-8 format. The string will be in UTF-8 format if it contains
516characters whose C<ord()> value exceeds 255. But a string may be in UTF-8
517format without it having such characters.
518
519For portability reasons, it may be better to not use C<\w>, C<\d>, C<\s>
520or the POSIX character classes, and use the Unicode properties instead.
521
522=head4 Examples
523
524 $str = "\xDF"; # $str is not in UTF-8 format.
525 $str =~ /^\w/; # No match, as $str isn't in UTF-8 format.
526 $str .= "\x{0e0b}"; # Now $str is in UTF-8 format.
527 $str =~ /^\w/; # Match! $str is now in UTF-8 format.
528 chop $str;
529 $str =~ /^\w/; # Still a match! $str remains in UTF-8 format.
530
531=cut