This is a live mirror of the Perl 5 development currently hosted at https://github.com/perl/perl5
perlrebackslash: Update for 5.14 changes
[perl5.git] / pod / perlrebackslash.pod
CommitLineData
8a118206
RGS
1=head1 NAME
2
3perlrebackslash - Perl Regular Expression Backslash Sequences and Escapes
4
5=head1 DESCRIPTION
6
7The top level documentation about Perl regular expressions
8is found in L<perlre>.
9
10This document describes all backslash and escape sequences. After
11explaining the role of the backslash, it lists all the sequences that have
12a special meaning in Perl regular expressions (in alphabetical order),
13then describes each of them.
14
15Most sequences are described in detail in different documents; the primary
16purpose of this document is to have a quick reference guide describing all
17backslash and escape sequences.
18
8a118206
RGS
19=head2 The backslash
20
21In a regular expression, the backslash can perform one of two tasks:
22it either takes away the special meaning of the character following it
23(for instance, C<\|> matches a vertical bar, it's not an alternation),
24or it is the start of a backslash or escape sequence.
25
26The rules determining what it is are quite simple: if the character
df225385 27following the backslash is an ASCII punctuation (non-word) character (that is,
b6538e4f
TC
28anything that is not a letter, digit, or underscore), then the backslash just
29takes away any special meaning of the character following it.
df225385
KW
30
31If the character following the backslash is an ASCII letter or an ASCII digit,
32then the sequence may be special; if so, it's listed below. A few letters have
6b46370c
KW
33not been used yet, so escaping them with a backslash doesn't change them to be
34special. A future version of Perl may assign a special meaning to them, so if
b6538e4f 35you have warnings turned on, Perl issues a warning if you use such a
6b46370c 36sequence. [1].
8a118206 37
e2cb52ee 38It is however guaranteed that backslash or escape sequences never have a
8a118206
RGS
39punctuation character following the backslash, not now, and not in a future
40version of Perl 5. So it is safe to put a backslash in front of a non-word
41character.
42
43Note that the backslash itself is special; if you want to match a backslash,
44you have to escape the backslash with a backslash: C</\\/> matches a single
45backslash.
46
47=over 4
48
49=item [1]
50
b6538e4f 51There is one exception. If you use an alphanumeric character as the
8a118206 52delimiter of your pattern (which you probably shouldn't do for readability
b6538e4f 53reasons), you have to escape the delimiter if you want to match
8a118206
RGS
54it. Perl won't warn then. See also L<perlop/Gory details of parsing
55quoted constructs>.
56
57=back
58
59
60=head2 All the sequences and escapes
61
df225385
KW
62Those not usable within a bracketed character class (like C<[\da-z]>) are marked
63as C<Not in [].>
64
f0a2b745 65 \000 Octal escape sequence. See also \o{}.
df225385 66 \1 Absolute backreference. Not in [].
8a118206 67 \a Alarm or bell.
df225385
KW
68 \A Beginning of string. Not in [].
69 \b Word/non-word boundary. (Backspace in []).
70 \B Not a word/non-word boundary. Not in [].
4948b50f 71 \cX Control-X
df225385 72 \C Single octet, even under UTF-8. Not in [].
8a118206
RGS
73 \d Character class for digits.
74 \D Character class for non-digits.
75 \e Escape character.
df225385 76 \E Turn off \Q, \L and \U processing. Not in [].
8a118206 77 \f Form feed.
f822d0dd 78 \g{}, \g1 Named, absolute or relative backreference. Not in []
df225385 79 \G Pos assertion. Not in [].
418e7b04
KW
80 \h Character class for horizontal whitespace.
81 \H Character class for non horizontal whitespace.
df225385
KW
82 \k{}, \k<>, \k'' Named backreference. Not in [].
83 \K Keep the stuff left of \K. Not in [].
84 \l Lowercase next character. Not in [].
85 \L Lowercase till \E. Not in [].
8a118206 86 \n (Logical) newline character.
b3b85878 87 \N Any character but newline. Experimental. Not in [].
fb121860 88 \N{} Named or numbered (Unicode) character or sequence.
f0a2b745 89 \o{} Octal escape sequence.
e1b711da
KW
90 \p{}, \pP Character with the given Unicode property.
91 \P{}, \PP Character without the given Unicode property.
df225385 92 \Q Quotemeta till \E. Not in [].
8a118206 93 \r Return character.
df225385 94 \R Generic new line. Not in [].
418e7b04
KW
95 \s Character class for whitespace.
96 \S Character class for non whitespace.
8a118206 97 \t Tab character.
df225385
KW
98 \u Titlecase next character. Not in [].
99 \U Uppercase till \E. Not in [].
418e7b04
KW
100 \v Character class for vertical whitespace.
101 \V Character class for non vertical whitespace.
8a118206
RGS
102 \w Character class for word characters.
103 \W Character class for non-word characters.
104 \x{}, \x00 Hexadecimal escape sequence.
df225385
KW
105 \X Unicode "extended grapheme cluster". Not in [].
106 \z End of string. Not in [].
107 \Z End of string. Not in [].
8a118206
RGS
108
109=head2 Character Escapes
110
111=head3 Fixed characters
112
e2cb52ee 113A handful of characters have a dedicated I<character escape>. The following
58151fe4 114table shows them, along with their ASCII code points (in decimal and hex),
4948b50f
KW
115their ASCII name, the control escape on ASCII platforms and a short
116description. (For EBCDIC platforms, see L<perlebcdic/OPERATOR DIFFERENCES>.)
8a118206 117
4948b50f 118 Seq. Code Point ASCII Cntrl Description.
8a118206
RGS
119 Dec Hex
120 \a 7 07 BEL \cG alarm or bell
121 \b 8 08 BS \cH backspace [1]
122 \e 27 1B ESC \c[ escape character
123 \f 12 0C FF \cL form feed
124 \n 10 0A LF \cJ line feed [2]
125 \r 13 0D CR \cM carriage return
126 \t 9 09 TAB \cI tab
127
128=over 4
129
130=item [1]
131
301ba1af 132C<\b> is the backspace character only inside a character class. Outside a
8a118206
RGS
133character class, C<\b> is a word/non-word boundary.
134
135=item [2]
136
b6538e4f 137C<\n> matches a logical newline. Perl converts between C<\n> and your
f6993e9e 138OS's native newline character when reading from or writing to text files.
8a118206
RGS
139
140=back
141
142=head4 Example
143
144 $str =~ /\t/; # Matches if $str contains a (horizontal) tab.
145
146=head3 Control characters
147
148C<\c> is used to denote a control character; the character following C<\c>
4948b50f
KW
149determines the value of the construct. For example the value of C<\cA> is
150C<chr(1)>, and the value of C<\cb> is C<chr(2)>, etc.
151The gory details are in L<perlop/"Regexp Quote-Like Operators">. A complete
152list of what C<chr(1)>, etc. means for ASCII and EBCDIC platforms is in
153L<perlebcdic/OPERATOR DIFFERENCES>.
154
155Note that C<\c\> alone at the end of a regular expression (or doubled-quoted
156string) is not valid. The backslash must be followed by another character.
157That is, C<\c\I<X>> means C<chr(28) . 'I<X>'> for all characters I<X>.
158
159To write platform-independent code, you must use C<\N{I<NAME>}> instead, like
160C<\N{ESCAPE}> or C<\N{U+001B}>, see L<charnames>.
8a118206
RGS
161
162Mnemonic: I<c>ontrol character.
163
164=head4 Example
165
166 $str =~ /\cK/; # Matches if $str contains a vertical tab (control-K).
167
fb121860 168=head3 Named or numbered characters and character sequences
8a118206 169
f6993e9e 170Unicode characters have a Unicode name and numeric ordinal value. Use the
e526e8bb 171C<\N{}> construct to specify a character by either of these values.
fb121860 172Certain sequences of characters also have names.
e526e8bb 173
fb121860
KW
174To specify by name, the name of the character or character sequence goes
175between the curly braces. In this case, you have to C<use charnames> to
b6538e4f 176load the Unicode names of the characters; otherwise Perl will complain.
e526e8bb 177
b6538e4f
TC
178To specify a character by Unicode code point, use the form C<\N{U+I<code
179point>}>, where I<code point> is a number in hexadecimal that gives the
180ordinal number that Unicode has assigned to the desired character. It is
181customary but not required to use leading zeros to pad the number to 4
182digits. Thus C<\N{U+0041}> means C<LATIN CAPITAL LETTER A>, and you will
183rarely see it written without the two leading zeros. C<\N{U+0041}> means
184"A" even on EBCDIC machines (where the ordinal value of "A" is not 0x41).
e526e8bb 185
fb121860
KW
186It is even possible to give your own names to characters and character
187sequences. For details, see L<charnames>.
8a118206 188
8c37f1d0 189(There is an expanded internal form that you may see in debug output:
b6538e4f
TC
190C<\N{U+I<code point>.I<code point>...}>.
191The C<...> means any number of these I<code point>s separated by dots.
8c37f1d0
KW
192This represents the sequence formed by the characters. This is an internal
193form only, subject to change, and you should not try to use it yourself.)
194
8a118206
RGS
195Mnemonic: I<N>amed character.
196
b6538e4f
TC
197Note that a character or character sequence expressed as a named
198or numbered character is considered a character without special
fb121860 199meaning by the regex engine, and will match "as is".
df225385 200
8a118206
RGS
201=head4 Example
202
203 use charnames ':full'; # Loads the Unicode names.
204 $str =~ /\N{THAI CHARACTER SO SO}/; # Matches the Thai SO SO character
205
206 use charnames 'Cyrillic'; # Loads Cyrillic names.
207 $str =~ /\N{ZHE}\N{KA}/; # Match "ZHE" followed by "KA".
208
209=head3 Octal escapes
210
f0a2b745
KW
211There are two forms of octal escapes. Each is used to specify a character by
212its ordinal, specified in octal notation.
213
214One form, available starting in Perl 5.14 looks like C<\o{...}>, where the dots
215represent one or more octal digits. It can be used for any Unicode character.
216
217It was introduced to avoid the potential problems with the other form,
218available in all Perls. That form consists of a backslash followed by three
219octal digits. One problem with this form is that it can look exactly like an
220old-style backreference (see
221L</Disambiguation rules between old-style octal escapes and backreferences>
222below.) You can avoid this by making the first of the three digits always a
9645299c 223zero, but that makes \077 the largest code point specifiable.
f0a2b745
KW
224
225In some contexts, a backslash followed by two or even one octal digits may be
226interpreted as an octal escape, sometimes with a warning, and because of some
227bugs, sometimes with surprising results. Also, if you are creating a regex
c69ca1d4 228out of smaller snippets concatenated together, and you use fewer than three
f0a2b745
KW
229digits, the beginning of one snippet may be interpreted as adding digits to the
230ending of the snippet before it. See L</Absolute referencing> for more
231discussion and examples of the snippet problem.
8a118206 232
b6538e4f
TC
233Note that a character expressed as an octal escape is considered
234a character without special meaning by the regex engine, and will match
8a118206
RGS
235"as is".
236
f0a2b745
KW
237To summarize, the C<\o{}> form is always safe to use, and the other form is
238safe to use for ordinals up through \077 when you use exactly three digits to
239specify them.
8a118206 240
f0a2b745 241Mnemonic: I<0>ctal or I<o>ctal.
8a118206 242
f0a2b745 243=head4 Examples (assuming an ASCII platform)
8a118206 244
f0a2b745
KW
245 $str = "Perl";
246 $str =~ /\o{120}/; # Match, "\120" is "P".
247 $str =~ /\120/; # Same.
248 $str =~ /\o{120}+/; # Match, "\120" is "P", it's repeated at least once
249 $str =~ /\120+/; # Same.
250 $str =~ /P\053/; # No match, "\053" is "+" and taken literally.
251 /\o{23073}/ # Black foreground, white background smiling face.
252 /\o{4801234567}/ # Raises a warning, and yields chr(4)
253
254=head4 Disambiguation rules between old-style octal escapes and backreferences
255
256Octal escapes of the C<\000> form outside of bracketed character classes
257potentially clash with old-style backreferences. (see L</Absolute referencing>
258below). They both consist of a backslash followed by numbers. So Perl has to
259use heuristics to determine whether it is a backreference or an octal escape.
260Perl uses the following rules to disambiguate:
8a118206
RGS
261
262=over 4
263
264=item 1
265
353c6505 266If the backslash is followed by a single digit, it's a backreference.
8a118206
RGS
267
268=item 2
269
270If the first digit following the backslash is a 0, it's an octal escape.
271
272=item 3
273
b6538e4f
TC
274If the number following the backslash is N (in decimal), and Perl already
275has seen N capture groups, Perl considers this a backreference. Otherwise,
276it considers it an octal escape. If N has more than three digits, Perl
277takes only the first three for the octal escape; the rest are matched as is.
8a118206
RGS
278
279 my $pat = "(" x 999;
280 $pat .= "a";
281 $pat .= ")" x 999;
282 /^($pat)\1000$/; # Matches 'aa'; there are 1000 capture groups.
283 /^$pat\1000$/; # Matches 'a@0'; there are 999 capture groups
f0a2b745 284 # and \1000 is seen as \100 (a '@') and a '0'
8a118206
RGS
285
286=back
287
f0a2b745
KW
288You can the force a backreference interpretation always by using the C<\g{...}>
289form. You can the force an octal interpretation always by using the C<\o{...}>
290form, or for numbers up through \077 (= 63 decimal), by using three digits,
291beginning with a "0".
292
8a118206
RGS
293=head3 Hexadecimal escapes
294
f0a2b745
KW
295Like octal escapes, there are two forms of hexadecimal escapes, but both start
296with the same thing, C<\x>. This is followed by either exactly two hexadecimal
297digits forming a number, or a hexadecimal number of arbitrary length surrounded
298by curly braces. The hexadecimal number is the code point of the character you
299want to express.
8a118206 300
b6538e4f
TC
301Note that a character expressed as one of these escapes is considered a
302character without special meaning by the regex engine, and will match
8a118206
RGS
303"as is".
304
305Mnemonic: heI<x>adecimal.
306
9f5650a8 307=head4 Examples (assuming an ASCII platform)
8a118206
RGS
308
309 $str = "Perl";
310 $str =~ /\x50/; # Match, "\x50" is "P".
f822d0dd 311 $str =~ /\x50+/; # Match, "\x50" is "P", it is repeated at least once
8a118206
RGS
312 $str =~ /P\x2B/; # No match, "\x2B" is "+" and taken literally.
313
314 /\x{2603}\x{2602}/ # Snowman with an umbrella.
315 # The Unicode character 2603 is a snowman,
316 # the Unicode character 2602 is an umbrella.
317 /\x{263B}/ # Black smiling face.
318 /\x{263b}/ # Same, the hex digits A - F are case insensitive.
319
320=head2 Modifiers
321
322A number of backslash sequences have to do with changing the character,
323or characters following them. C<\l> will lowercase the character following
5f2b17ca 324it, while C<\u> will uppercase (or, more accurately, titlecase) the
b6538e4f
TC
325character following it. They provide functionality similar to the
326functions C<lcfirst> and C<ucfirst>.
8a118206
RGS
327
328To uppercase or lowercase several characters, one might want to use
329C<\L> or C<\U>, which will lowercase/uppercase all characters following
b6538e4f
TC
330them, until either the end of the pattern or the next occurrence of
331C<\E>, whatever comes first. They provide functionality similar to what
332the functions C<lc> and C<uc> provide.
8a118206
RGS
333
334C<\Q> is used to escape all characters following, up to the next C<\E>
335or the end of the pattern. C<\Q> adds a backslash to any character that
b6538e4f
TC
336isn't a letter, digit, or underscore. This ensures that any character
337between C<\Q> and C<\E> shall be matched literally, not interpreted
338as a metacharacter by the regex engine.
8a118206
RGS
339
340Mnemonic: I<L>owercase, I<U>ppercase, I<Q>uotemeta, I<E>nd.
341
342=head4 Examples
343
344 $sid = "sid";
345 $greg = "GrEg";
346 $miranda = "(Miranda)";
347 $str =~ /\u$sid/; # Matches 'Sid'
348 $str =~ /\L$greg/; # Matches 'greg'
349 $str =~ /\Q$miranda\E/; # Matches '(Miranda)', as if the pattern
350 # had been written as /\(Miranda\)/
351
352=head2 Character classes
353
354Perl regular expressions have a large range of character classes. Some of
355the character classes are written as a backslash sequence. We will briefly
356discuss those here; full details of character classes can be found in
357L<perlrecharclass>.
358
d35dd6c6
KW
359C<\w> is a character class that matches any single I<word> character
360(letters, digits, Unicode marks, and connector punctuation (like the
361underscore)). C<\d> is a character class that matches any decimal
362digit, while the character class C<\s> matches any whitespace character.
99d59c4d 363New in perl 5.10.0 are the classes C<\h> and C<\v> which match horizontal
418e7b04 364and vertical whitespace characters.
cfaf538b
KW
365
366The exact set of characters matched by C<\d>, C<\s>, and C<\w> varies
9645299c
KW
367depending on various pragma and regular expression modifiers. It is
368possible to restrict the match to the ASCII range by using the C</a>
369regular expression modifier. See L<perlrecharclass>.
8a118206
RGS
370
371The uppercase variants (C<\W>, C<\D>, C<\S>, C<\H>, and C<\V>) are
e486b3cc
KW
372character classes that match, respectively, any character that isn't a
373word character, digit, whitespace, horizontal whitespace, or vertical
374whitespace.
8a118206
RGS
375
376Mnemonics: I<w>ord, I<d>igit, I<s>pace, I<h>orizontal, I<v>ertical.
377
378=head3 Unicode classes
379
380C<\pP> (where C<P> is a single letter) and C<\p{Property}> are used to
381match a character that matches the given Unicode property; properties
382include things like "letter", or "thai character". Capitalizing the
383sequence to C<\PP> and C<\P{Property}> make the sequence match a character
384that doesn't match the given Unicode property. For more details, see
4948b50f 385L<perlrecharclass/Backslash sequences> and
8a118206
RGS
386L<perlunicode/Unicode Character Properties>.
387
388Mnemonic: I<p>roperty.
389
8a118206
RGS
390=head2 Referencing
391
392If capturing parenthesis are used in a regular expression, we can refer
393to the part of the source string that was matched, and match exactly the
1843fd28
RGS
394same thing. There are three ways of referring to such I<backreference>:
395absolutely, relatively, and by name.
396
397=for later add link to perlrecapture
8a118206
RGS
398
399=head3 Absolute referencing
400
c27a5cfe 401Either C<\gI<N>> (starting in Perl 5.10.0), or C<\I<N>> (old-style) where I<N>
d8b950dc 402is a positive (unsigned) decimal number of any length is an absolute reference
c27a5cfe
KW
403to a capturing group.
404
8e4698ef
KW
405I<N> refers to the Nth set of parentheses, so C<\gI<N>> refers to whatever has
406been matched by that set of parentheses. Thus C<\g1> refers to the first
c27a5cfe
KW
407capture group in the regex.
408
409The C<\gI<N>> form can be equivalently written as C<\g{I<N>}>
410which avoids ambiguity when building a regex by concatenating shorter
d8b950dc
KW
411strings. Otherwise if you had a regex C<qr/$a$b/>, and C<$a> contained
412C<"\g1">, and C<$b> contained C<"37">, you would get C</\g137/> which is
413probably not what you intended.
c27a5cfe
KW
414
415In the C<\I<N>> form, I<N> must not begin with a "0", and there must be at
b6538e4f
TC
416least I<N> capturing groups, or else I<N> is considered an octal escape
417(but something like C<\18> is the same as C<\0018>; that is, the octal escape
c27a5cfe
KW
418C<"\001"> followed by a literal digit C<"8">).
419
420Mnemonic: I<g>roup.
8a118206
RGS
421
422=head4 Examples
423
c27a5cfe
KW
424 /(\w+) \g1/; # Finds a duplicated word, (e.g. "cat cat").
425 /(\w+) \1/; # Same thing; written old-style
426 /(.)(.)\g2\g1/; # Match a four letter palindrome (e.g. "ABBA").
8a118206
RGS
427
428
429=head3 Relative referencing
430
c27a5cfe
KW
431C<\g-I<N>> (starting in Perl 5.10.0) is used for relative addressing. (It can
432be written as C<\g{-I<N>>.) It refers to the I<N>th group before the
433C<\g{-I<N>}>.
8a118206 434
c27a5cfe 435The big advantage of this form is that it makes it much easier to write
8a118206
RGS
436patterns with references that can be interpolated in larger patterns,
437even if the larger pattern also contains capture groups.
438
8a118206
RGS
439=head4 Examples
440
c27a5cfe
KW
441 /(A) # Group 1
442 ( # Group 2
443 (B) # Group 3
444 \g{-1} # Refers to group 3 (B)
445 \g{-3} # Refers to group 1 (A)
8a118206
RGS
446 )
447 /x; # Matches "ABBA".
448
449 my $qr = qr /(.)(.)\g{-2}\g{-1}/; # Matches 'abab', 'cdcd', etc.
450 /$qr$qr/ # Matches 'ababcdcd'.
451
452=head3 Named referencing
453
d8b950dc
KW
454C<\g{I<name>}> (starting in Perl 5.10.0) can be used to back refer to a
455named capture group, dispensing completely with having to think about capture
456buffer positions.
8a118206
RGS
457
458To be compatible with .Net regular expressions, C<\g{name}> may also be
459written as C<\k{name}>, C<< \k<name> >> or C<\k'name'>.
460
d8b950dc
KW
461To prevent any ambiguity, I<name> must not start with a digit nor contain a
462hyphen.
8a118206
RGS
463
464=head4 Examples
465
466 /(?<word>\w+) \g{word}/ # Finds duplicated word, (e.g. "cat cat")
467 /(?<word>\w+) \k{word}/ # Same.
468 /(?<word>\w+) \k<word>/ # Same.
469 /(?<letter1>.)(?<letter2>.)\g{letter2}\g{letter1}/
470 # Match a four letter palindrome (e.g. "ABBA")
471
472=head2 Assertions
473
ac036724 474Assertions are conditions that have to be true; they don't actually
8a118206
RGS
475match parts of the substring. There are six assertions that are written as
476backslash sequences.
477
478=over 4
479
480=item \A
481
482C<\A> only matches at the beginning of the string. If the C</m> modifier
1726f7e8 483isn't used, then C</\A/> is equivalent to C</^/>. However, if the C</m>
8a118206
RGS
484modifier is used, then C</^/> matches internal newlines, but the meaning
485of C</\A/> isn't changed by the C</m> modifier. C<\A> matches at the beginning
486of the string regardless whether the C</m> modifier is used.
487
488=item \z, \Z
489
490C<\z> and C<\Z> match at the end of the string. If the C</m> modifier isn't
b6538e4f
TC
491used, then C</\Z/> is equivalent to C</$/>; that is, it matches at the
492end of the string, or one before the newline at the end of the string. If the
8a118206
RGS
493C</m> modifier is used, then C</$/> matches at internal newlines, but the
494meaning of C</\Z/> isn't changed by the C</m> modifier. C<\Z> matches at
495the end of the string (or just before a trailing newline) regardless whether
496the C</m> modifier is used.
497
b6538e4f
TC
498C<\z> is just like C<\Z>, except that it does not match before a trailing
499newline. C<\z> matches at the end of the string only, regardless of the
500modifiers used, and not just before a newline. It is how to anchor the
501match to the true end of the string under all conditions.
8a118206
RGS
502
503=item \G
504
b6538e4f
TC
505C<\G> is usually used only in combination with the C</g> modifier. If the
506C</g> modifier is used and the match is done in scalar context, Perl
507remembers where in the source string the last match ended, and the next time,
8a118206
RGS
508it will start the match from where it ended the previous time.
509
b6538e4f
TC
510C<\G> matches the point where the previous match on that string ended,
511or the beginning of that string if there was no previous match.
1843fd28
RGS
512
513=for later add link to perlremodifiers
8a118206
RGS
514
515Mnemonic: I<G>lobal.
516
517=item \b, \B
518
519C<\b> matches at any place between a word and a non-word character; C<\B>
520matches at any place between characters where C<\b> doesn't match. C<\b>
521and C<\B> assume there's a non-word character before the beginning and after
522the end of the source string; so C<\b> will match at the beginning (or end)
523of the source string if the source string begins (or ends) with a word
b6538e4f
TC
524character. Otherwise, C<\B> will match.
525
526Do not use something like C<\b=head\d\b> and expect it to match the
527beginning of a line. It can't, because for there to be a boundary before
528the non-word "=", there must be a word character immediately previous.
529All boundary determinations look for word characters alone, not for
530non-words characters nor for string ends. It may help to understand how
531<\b> and <\B> work by equating them as follows:
532
533 \b really means (?:(?<=\w)(?!\w)|(?<!\w)(?=\w))
534 \B really means (?:(?<=\w)(?=\w)|(?<!\w)(?!\w))
8a118206
RGS
535
536Mnemonic: I<b>oundary.
537
538=back
539
540=head4 Examples
541
542 "cat" =~ /\Acat/; # Match.
543 "cat" =~ /cat\Z/; # Match.
544 "cat\n" =~ /cat\Z/; # Match.
545 "cat\n" =~ /cat\z/; # No match.
546
547 "cat" =~ /\bcat\b/; # Matches.
548 "cats" =~ /\bcat\b/; # No match.
549 "cat" =~ /\bcat\B/; # No match.
550 "cats" =~ /\bcat\B/; # Match.
551
552 while ("cat dog" =~ /(\w+)/g) {
553 print $1; # Prints 'catdog'
554 }
555 while ("cat dog" =~ /\G(\w+)/g) {
556 print $1; # Prints 'cat'
557 }
558
559=head2 Misc
560
561Here we document the backslash sequences that don't fall in one of the
b6538e4f 562categories above. These are:
8a118206
RGS
563
564=over 4
565
566=item \C
567
568C<\C> always matches a single octet, even if the source string is encoded
569in UTF-8 format, and the character to be matched is a multi-octet character.
b6538e4f
TC
570C<\C> was introduced in perl 5.6. This is very dangerous, because it violates
571the logical character abstraction and can cause UTF-8 sequences to become malformed.
8a118206
RGS
572
573Mnemonic: oI<C>tet.
574
575=item \K
576
b6538e4f
TC
577This appeared in perl 5.10.0. Anything matched left of C<\K> is
578not included in C<$&>, and will not be replaced if the pattern is
579used in a substitution. This lets you write C<s/PAT1 \K PAT2/REPL/x>
8a118206
RGS
580instead of C<s/(PAT1) PAT2/${1}REPL/x> or C<s/(?<=PAT1) PAT2/REPL/x>.
581
582Mnemonic: I<K>eep.
583
df225385
KW
584=item \N
585
b6538e4f
TC
586This is an experimental feature new to perl 5.12.0. It matches any character
587that is B<not> a newline. It is a short-hand for writing C<[^\n]>, and is
b3b85878
KW
588identical to the C<.> metasymbol, except under the C</s> flag, which changes
589the meaning of C<.>, but not C<\N>.
df225385 590
e526e8bb 591Note that C<\N{...}> can mean a
fb121860
KW
592L<named or numbered character
593|/Named or numbered characters and character sequences>.
df225385
KW
594
595Mnemonic: Complement of I<\n>.
596
8a118206 597=item \R
6b46370c 598X<\R>
8a118206 599
b6538e4f
TC
600C<\R> matches a I<generic newline>; that is, anything considered a
601linebreak sequence by Unicode. This includes all characters matched by
602C<\v> (vertical whitespace), and the multi character sequence C<"\x0D\x0A">
603(carriage return followed by a line feed, sometimes called the network
604newline; it's the end of line sequence used in Microsoft text files opened
605in binary mode). C<\R> is equivalent to C<< (?>\x0D\x0A)|\v) >>. Since
606C<\R> can match a sequence of more than one character, it cannot be put
607inside a bracketed character class; C</[\R]/> is an error; use C<\v>
608instead. C<\R> was introduced in perl 5.10.0.
8a118206 609
10fdd326
JH
610Mnemonic: none really. C<\R> was picked because PCRE already uses C<\R>,
611and more importantly because Unicode recommends such a regular expression
b6538e4f 612metacharacter, and suggests C<\R> as its notation.
8a118206
RGS
613
614=item \X
6b46370c 615X<\X>
8a118206 616
0111a78f 617This matches a Unicode I<extended grapheme cluster>.
8a118206 618
10fdd326 619C<\X> matches quite well what normal (non-Unicode-programmer) usage
0111a78f 620would consider a single character. As an example, consider a G with some sort
c670e63a 621of diacritic mark, such as an arrow. There is no such single character in
df225385 622Unicode, but one can be composed by using a G followed by a Unicode "COMBINING
c670e63a
KW
623UPWARDS ARROW BELOW", and would be displayed by Unicode-aware software as if it
624were a single character.
10fdd326 625
8a118206
RGS
626Mnemonic: eI<X>tended Unicode character.
627
628=back
629
630=head4 Examples
631
b6538e4f 632 "\x{256}" =~ /^\C\C$/; # Match as chr (0x256) takes 2 octets in UTF-8.
8a118206 633
f822d0dd 634 $str =~ s/foo\Kbar/baz/g; # Change any 'bar' following a 'foo' to 'baz'
d8b950dc 635 $str =~ s/(.)\K\g1//g; # Delete duplicated characters.
8a118206
RGS
636
637 "\n" =~ /^\R$/; # Match, \n is a generic newline.
638 "\r" =~ /^\R$/; # Match, \r is a generic newline.
639 "\r\n" =~ /^\R$/; # Match, \r\n is a generic newline.
640
b6538e4f 641 "P\x{307}" =~ /^\X$/ # \X matches a P with a dot above.
8a118206
RGS
642
643=cut