This is a live mirror of the Perl 5 development currently hosted at https://github.com/perl/perl5
Update charnames documentations for \N changes, bugs
[perl5.git] / pod / perlrebackslash.pod
... / ...
CommitLineData
1=head1 NAME
2
3perlrebackslash - Perl Regular Expression Backslash Sequences and Escapes
4
5=head1 DESCRIPTION
6
7The top level documentation about Perl regular expressions
8is found in L<perlre>.
9
10This document describes all backslash and escape sequences. After
11explaining the role of the backslash, it lists all the sequences that have
12a special meaning in Perl regular expressions (in alphabetical order),
13then describes each of them.
14
15Most sequences are described in detail in different documents; the primary
16purpose of this document is to have a quick reference guide describing all
17backslash and escape sequences.
18
19
20=head2 The backslash
21
22In a regular expression, the backslash can perform one of two tasks:
23it either takes away the special meaning of the character following it
24(for instance, C<\|> matches a vertical bar, it's not an alternation),
25or it is the start of a backslash or escape sequence.
26
27The rules determining what it is are quite simple: if the character
28following the backslash is a punctuation (non-word) character (that is,
29anything that is not a letter, digit or underscore), then the backslash
30just takes away the special meaning (if any) of the character following
31it.
32
33If the character following the backslash is a letter or a digit, then the
34sequence may be special; if so, it's listed below. A few letters have not
35been used yet, and escaping them with a backslash is safe for now, but a
36future version of Perl may assign a special meaning to it. However, if you
37have warnings turned on, Perl will issue a warning if you use such a sequence.
38[1].
39
40It is however guaranteed that backslash or escape sequences never have a
41punctuation character following the backslash, not now, and not in a future
42version of Perl 5. So it is safe to put a backslash in front of a non-word
43character.
44
45Note that the backslash itself is special; if you want to match a backslash,
46you have to escape the backslash with a backslash: C</\\/> matches a single
47backslash.
48
49=over 4
50
51=item [1]
52
53There is one exception. If you use an alphanumerical character as the
54delimiter of your pattern (which you probably shouldn't do for readability
55reasons), you will have to escape the delimiter if you want to match
56it. Perl won't warn then. See also L<perlop/Gory details of parsing
57quoted constructs>.
58
59=back
60
61
62=head2 All the sequences and escapes
63
64 \000 Octal escape sequence.
65 \1 Absolute backreference.
66 \a Alarm or bell.
67 \A Beginning of string.
68 \b Word/non-word boundary. (Backspace in a char class).
69 \B Not a word/non-word boundary.
70 \cX Control-X (X can be any ASCII character).
71 \C Single octet, even under UTF-8.
72 \d Character class for digits.
73 \D Character class for non-digits.
74 \e Escape character.
75 \E Turn off \Q, \L and \U processing.
76 \f Form feed.
77 \g{}, \g1 Named, absolute or relative backreference.
78 \G Pos assertion.
79 \h Character class for horizontal white space.
80 \H Character class for non horizontal white space.
81 \k{}, \k<>, \k'' Named backreference.
82 \K Keep the stuff left of \K.
83 \l Lowercase next character.
84 \L Lowercase till \E.
85 \n (Logical) newline character.
86 \N Any character but newline.
87 \N{} Named (Unicode) character.
88 \p{}, \pP Character with the given Unicode property.
89 \P{}, \PP Character without the given Unicode property.
90 \Q Quotemeta till \E.
91 \r Return character.
92 \R Generic new line.
93 \s Character class for white space.
94 \S Character class for non white space.
95 \t Tab character.
96 \u Titlecase next character.
97 \U Uppercase till \E.
98 \v Character class for vertical white space.
99 \V Character class for non vertical white space.
100 \w Character class for word characters.
101 \W Character class for non-word characters.
102 \x{}, \x00 Hexadecimal escape sequence.
103 \X Unicode "extended grapheme cluster".
104 \z End of string.
105 \Z End of string.
106
107=head2 Character Escapes
108
109=head3 Fixed characters
110
111A handful of characters have a dedicated I<character escape>. The following
112table shows them, along with their code points (in decimal and hex), their
113ASCII name, the control escape (see below) and a short description.
114
115 Seq. Code Point ASCII Cntr Description.
116 Dec Hex
117 \a 7 07 BEL \cG alarm or bell
118 \b 8 08 BS \cH backspace [1]
119 \e 27 1B ESC \c[ escape character
120 \f 12 0C FF \cL form feed
121 \n 10 0A LF \cJ line feed [2]
122 \r 13 0D CR \cM carriage return
123 \t 9 09 TAB \cI tab
124
125=over 4
126
127=item [1]
128
129C<\b> is only the backspace character inside a character class. Outside a
130character class, C<\b> is a word/non-word boundary.
131
132=item [2]
133
134C<\n> matches a logical newline. Perl will convert between C<\n> and your
135OSses native newline character when reading from or writing to text files.
136
137=back
138
139=head4 Example
140
141 $str =~ /\t/; # Matches if $str contains a (horizontal) tab.
142
143=head3 Control characters
144
145C<\c> is used to denote a control character; the character following C<\c>
146is the name of the control character. For instance, C</\cM/> matches the
147character I<control-M> (a carriage return, code point 13). The case of the
148character following C<\c> doesn't matter: C<\cM> and C<\cm> match the same
149character.
150
151Mnemonic: I<c>ontrol character.
152
153=head4 Example
154
155 $str =~ /\cK/; # Matches if $str contains a vertical tab (control-K).
156
157=head3 Named characters
158
159All Unicode characters have a Unicode name, and characters in various scripts
160have names as well. It is even possible to give your own names to characters.
161You can use a character by name by using the C<\N{}> construct; the name of
162the character goes between the curly braces. You do have to C<use charnames>
163to load the names of the characters, otherwise Perl will complain you use
164a name it doesn't know about. For more details, see L<charnames>.
165
166Mnemonic: I<N>amed character.
167
168=head4 Example
169
170 use charnames ':full'; # Loads the Unicode names.
171 $str =~ /\N{THAI CHARACTER SO SO}/; # Matches the Thai SO SO character
172
173 use charnames 'Cyrillic'; # Loads Cyrillic names.
174 $str =~ /\N{ZHE}\N{KA}/; # Match "ZHE" followed by "KA".
175
176=head3 Octal escapes
177
178Octal escapes consist of a backslash followed by two or three octal digits
179matching the code point of the character you want to use. This allows for
180512 characters (C<\00> up to C<\777>) that can be expressed this way.
181Enough in pre-Unicode days, but most Unicode characters cannot be escaped
182this way.
183
184Note that a character that is expressed as an octal escape is considered
185as a character without special meaning by the regex engine, and will match
186"as is".
187
188=head4 Examples
189
190 $str = "Perl";
191 $str =~ /\120/; # Match, "\120" is "P".
192 $str =~ /\120+/; # Match, "\120" is "P", it is repeated at least once.
193 $str =~ /P\053/; # No match, "\053" is "+" and taken literally.
194
195=head4 Caveat
196
197Octal escapes potentially clash with backreferences. They both consist
198of a backslash followed by numbers. So Perl has to use heuristics to
199determine whether it is a backreference or an octal escape. Perl uses
200the following rules:
201
202=over 4
203
204=item 1
205
206If the backslash is followed by a single digit, it's a backreference.
207
208=item 2
209
210If the first digit following the backslash is a 0, it's an octal escape.
211
212=item 3
213
214If the number following the backslash is N (decimal), and Perl already has
215seen N capture groups, Perl will consider this to be a backreference.
216Otherwise, it will consider it to be an octal escape. Note that if N > 999,
217Perl only takes the first three digits for the octal escape; the rest is
218matched as is.
219
220 my $pat = "(" x 999;
221 $pat .= "a";
222 $pat .= ")" x 999;
223 /^($pat)\1000$/; # Matches 'aa'; there are 1000 capture groups.
224 /^$pat\1000$/; # Matches 'a@0'; there are 999 capture groups
225 # and \1000 is seen as \100 (a '@') and a '0'.
226
227=back
228
229=head3 Hexadecimal escapes
230
231Hexadecimal escapes start with C<\x> and are then either followed by
232two digit hexadecimal number, or a hexadecimal number of arbitrary length
233surrounded by curly braces. The hexadecimal number is the code point of
234the character you want to express.
235
236Note that a character that is expressed as a hexadecimal escape is considered
237as a character without special meaning by the regex engine, and will match
238"as is".
239
240Mnemonic: heI<x>adecimal.
241
242=head4 Examples
243
244 $str = "Perl";
245 $str =~ /\x50/; # Match, "\x50" is "P".
246 $str =~ /\x50+/; # Match, "\x50" is "P", it is repeated at least once.
247 $str =~ /P\x2B/; # No match, "\x2B" is "+" and taken literally.
248
249 /\x{2603}\x{2602}/ # Snowman with an umbrella.
250 # The Unicode character 2603 is a snowman,
251 # the Unicode character 2602 is an umbrella.
252 /\x{263B}/ # Black smiling face.
253 /\x{263b}/ # Same, the hex digits A - F are case insensitive.
254
255=head2 Modifiers
256
257A number of backslash sequences have to do with changing the character,
258or characters following them. C<\l> will lowercase the character following
259it, while C<\u> will uppercase (or, more accurately, titlecase) the
260character following it. (They perform similar functionality as the
261functions C<lcfirst> and C<ucfirst>).
262
263To uppercase or lowercase several characters, one might want to use
264C<\L> or C<\U>, which will lowercase/uppercase all characters following
265them, until either the end of the pattern, or the next occurrence of
266C<\E>, whatever comes first. They perform similar functionality as the
267functions C<lc> and C<uc> do.
268
269C<\Q> is used to escape all characters following, up to the next C<\E>
270or the end of the pattern. C<\Q> adds a backslash to any character that
271isn't a letter, digit or underscore. This will ensure that any character
272between C<\Q> and C<\E> is matched literally, and will not be interpreted
273by the regexp engine.
274
275Mnemonic: I<L>owercase, I<U>ppercase, I<Q>uotemeta, I<E>nd.
276
277=head4 Examples
278
279 $sid = "sid";
280 $greg = "GrEg";
281 $miranda = "(Miranda)";
282 $str =~ /\u$sid/; # Matches 'Sid'
283 $str =~ /\L$greg/; # Matches 'greg'
284 $str =~ /\Q$miranda\E/; # Matches '(Miranda)', as if the pattern
285 # had been written as /\(Miranda\)/
286
287=head2 Character classes
288
289Perl regular expressions have a large range of character classes. Some of
290the character classes are written as a backslash sequence. We will briefly
291discuss those here; full details of character classes can be found in
292L<perlrecharclass>.
293
294C<\w> is a character class that matches any I<word> character (letters,
295digits, underscore). C<\d> is a character class that matches any digit,
296while the character class C<\s> matches any white space character.
297New in perl 5.10.0 are the classes C<\h> and C<\v> which match horizontal
298and vertical white space characters.
299
300The uppercase variants (C<\W>, C<\D>, C<\S>, C<\H>, and C<\V>) are
301character classes that match any character that isn't a word character,
302digit, white space, horizontal white space or vertical white space.
303
304Mnemonics: I<w>ord, I<d>igit, I<s>pace, I<h>orizontal, I<v>ertical.
305
306=head3 Unicode classes
307
308C<\pP> (where C<P> is a single letter) and C<\p{Property}> are used to
309match a character that matches the given Unicode property; properties
310include things like "letter", or "thai character". Capitalizing the
311sequence to C<\PP> and C<\P{Property}> make the sequence match a character
312that doesn't match the given Unicode property. For more details, see
313L<perlrecharclass/Backslashed sequences> and
314L<perlunicode/Unicode Character Properties>.
315
316Mnemonic: I<p>roperty.
317
318
319=head2 Referencing
320
321If capturing parenthesis are used in a regular expression, we can refer
322to the part of the source string that was matched, and match exactly the
323same thing. There are three ways of referring to such I<backreference>:
324absolutely, relatively, and by name.
325
326=for later add link to perlrecapture
327
328=head3 Absolute referencing
329
330A backslash sequence that starts with a backslash and is followed by a
331number is an absolute reference (but be aware of the caveat mentioned above).
332If the number is I<N>, it refers to the Nth set of parenthesis - whatever
333has been matched by that set of parenthesis has to be matched by the C<\N>
334as well.
335
336=head4 Examples
337
338 /(\w+) \1/; # Finds a duplicated word, (e.g. "cat cat").
339 /(.)(.)\2\1/; # Match a four letter palindrome (e.g. "ABBA").
340
341
342=head3 Relative referencing
343
344New in perl 5.10.0 is a different way of referring to capture buffers: C<\g>.
345C<\g> takes a number as argument, with the number in curly braces (the
346braces are optional). If the number (N) does not have a sign, it's a reference
347to the Nth capture group (so C<\g{2}> is equivalent to C<\2> - except that
348C<\g> always refers to a capture group and will never be seen as an octal
349escape). If the number is negative, the reference is relative, referring to
350the Nth group before the C<\g{-N}>.
351
352The big advantage of C<\g{-N}> is that it makes it much easier to write
353patterns with references that can be interpolated in larger patterns,
354even if the larger pattern also contains capture groups.
355
356Mnemonic: I<g>roup.
357
358=head4 Examples
359
360 /(A) # Buffer 1
361 ( # Buffer 2
362 (B) # Buffer 3
363 \g{-1} # Refers to buffer 3 (B)
364 \g{-3} # Refers to buffer 1 (A)
365 )
366 /x; # Matches "ABBA".
367
368 my $qr = qr /(.)(.)\g{-2}\g{-1}/; # Matches 'abab', 'cdcd', etc.
369 /$qr$qr/ # Matches 'ababcdcd'.
370
371=head3 Named referencing
372
373Also new in perl 5.10.0 is the use of named capture buffers, which can be
374referred to by name. This is done with C<\g{name}>, which is a
375backreference to the capture buffer with the name I<name>.
376
377To be compatible with .Net regular expressions, C<\g{name}> may also be
378written as C<\k{name}>, C<< \k<name> >> or C<\k'name'>.
379
380Note that C<\g{}> has the potential to be ambiguous, as it could be a named
381reference, or an absolute or relative reference (if its argument is numeric).
382However, names are not allowed to start with digits, nor are allowed to
383contain a hyphen, so there is no ambiguity.
384
385=head4 Examples
386
387 /(?<word>\w+) \g{word}/ # Finds duplicated word, (e.g. "cat cat")
388 /(?<word>\w+) \k{word}/ # Same.
389 /(?<word>\w+) \k<word>/ # Same.
390 /(?<letter1>.)(?<letter2>.)\g{letter2}\g{letter1}/
391 # Match a four letter palindrome (e.g. "ABBA")
392
393=head2 Assertions
394
395Assertions are conditions that have to be true; they don't actually
396match parts of the substring. There are six assertions that are written as
397backslash sequences.
398
399=over 4
400
401=item \A
402
403C<\A> only matches at the beginning of the string. If the C</m> modifier
404isn't used, then C</\A/> is equivalent with C</^/>. However, if the C</m>
405modifier is used, then C</^/> matches internal newlines, but the meaning
406of C</\A/> isn't changed by the C</m> modifier. C<\A> matches at the beginning
407of the string regardless whether the C</m> modifier is used.
408
409=item \z, \Z
410
411C<\z> and C<\Z> match at the end of the string. If the C</m> modifier isn't
412used, then C</\Z/> is equivalent with C</$/>, that is, it matches at the
413end of the string, or before the newline at the end of the string. If the
414C</m> modifier is used, then C</$/> matches at internal newlines, but the
415meaning of C</\Z/> isn't changed by the C</m> modifier. C<\Z> matches at
416the end of the string (or just before a trailing newline) regardless whether
417the C</m> modifier is used.
418
419C<\z> is just like C<\Z>, except that it will not match before a trailing
420newline. C<\z> will only match at the end of the string - regardless of the
421modifiers used, and not before a newline.
422
423=item \G
424
425C<\G> is usually only used in combination with the C</g> modifier. If the
426C</g> modifier is used (and the match is done in scalar context), Perl will
427remember where in the source string the last match ended, and the next time,
428it will start the match from where it ended the previous time.
429
430C<\G> matches the point where the previous match ended, or the beginning
431of the string if there was no previous match.
432
433=for later add link to perlremodifiers
434
435Mnemonic: I<G>lobal.
436
437=item \b, \B
438
439C<\b> matches at any place between a word and a non-word character; C<\B>
440matches at any place between characters where C<\b> doesn't match. C<\b>
441and C<\B> assume there's a non-word character before the beginning and after
442the end of the source string; so C<\b> will match at the beginning (or end)
443of the source string if the source string begins (or ends) with a word
444character. Otherwise, C<\B> will match.
445
446Mnemonic: I<b>oundary.
447
448=back
449
450=head4 Examples
451
452 "cat" =~ /\Acat/; # Match.
453 "cat" =~ /cat\Z/; # Match.
454 "cat\n" =~ /cat\Z/; # Match.
455 "cat\n" =~ /cat\z/; # No match.
456
457 "cat" =~ /\bcat\b/; # Matches.
458 "cats" =~ /\bcat\b/; # No match.
459 "cat" =~ /\bcat\B/; # No match.
460 "cats" =~ /\bcat\B/; # Match.
461
462 while ("cat dog" =~ /(\w+)/g) {
463 print $1; # Prints 'catdog'
464 }
465 while ("cat dog" =~ /\G(\w+)/g) {
466 print $1; # Prints 'cat'
467 }
468
469=head2 Misc
470
471Here we document the backslash sequences that don't fall in one of the
472categories above. They are:
473
474=over 4
475
476=item \C
477
478C<\C> always matches a single octet, even if the source string is encoded
479in UTF-8 format, and the character to be matched is a multi-octet character.
480C<\C> was introduced in perl 5.6.
481
482Mnemonic: oI<C>tet.
483
484=item \K
485
486This is new in perl 5.10.0. Anything that is matched left of C<\K> is
487not included in C<$&> - and will not be replaced if the pattern is
488used in a substitution. This will allow you to write C<s/PAT1 \K PAT2/REPL/x>
489instead of C<s/(PAT1) PAT2/${1}REPL/x> or C<s/(?<=PAT1) PAT2/REPL/x>.
490
491Mnemonic: I<K>eep.
492
493=item \R
494
495C<\R> matches a I<generic newline>, that is, anything that is considered
496a newline by Unicode. This includes all characters matched by C<\v>
497(vertical white space), and the multi character sequence C<"\x0D\x0A">
498(carriage return followed by a line feed, aka the network newline, or
499the newline used in Windows text files). C<\R> is equivalent with
500C<< (?>\x0D\x0A)|\v) >>. Since C<\R> can match a more than one character,
501it cannot be put inside a bracketed character class; C</[\R]/> is an error.
502C<\R> was introduced in perl 5.10.0.
503
504Mnemonic: none really. C<\R> was picked because PCRE already uses C<\R>,
505and more importantly because Unicode recommends such a regular expression
506metacharacter, and suggests C<\R> as the notation.
507
508=item \X
509
510This matches a Unicode I<extended grapheme cluster>.
511
512C<\X> matches quite well what normal (non-Unicode-programmer) usage
513would consider a single character. As an example, consider a G with some sort
514of diacritic mark, such as an arrow. There is no such single character in
515Unicode, but one can be composed using a G followed by a Unicode "COMBINING
516UPWARDS ARROW BELOW", and would be displayed by Unicode-aware software as if it
517were a single character.
518
519Mnemonic: eI<X>tended Unicode character.
520
521=back
522
523=head4 Examples
524
525 "\x{256}" =~ /^\C\C$/; # Match as chr (256) takes 2 octets in UTF-8.
526
527 $str =~ s/foo\Kbar/baz/g; # Change any 'bar' following a 'foo' to 'baz'.
528 $str =~ s/(.)\K\1//g; # Delete duplicated characters.
529
530 "\n" =~ /^\R$/; # Match, \n is a generic newline.
531 "\r" =~ /^\R$/; # Match, \r is a generic newline.
532 "\r\n" =~ /^\R$/; # Match, \r\n is a generic newline.
533
534 "P\x{0307}" =~ /^\X$/ # \X matches a P with a dot above.
535
536=cut