This is a live mirror of the Perl 5 development currently hosted at https://github.com/perl/perl5
regen/mk_invlists.pl: Add tables for \b{wb}
[perl5.git] / pod / perlrebackslash.pod
CommitLineData
8a118206
RGS
1=head1 NAME
2
3perlrebackslash - Perl Regular Expression Backslash Sequences and Escapes
4
5=head1 DESCRIPTION
6
7The top level documentation about Perl regular expressions
8is found in L<perlre>.
9
10This document describes all backslash and escape sequences. After
11explaining the role of the backslash, it lists all the sequences that have
12a special meaning in Perl regular expressions (in alphabetical order),
13then describes each of them.
14
15Most sequences are described in detail in different documents; the primary
16purpose of this document is to have a quick reference guide describing all
17backslash and escape sequences.
18
8a118206
RGS
19=head2 The backslash
20
21In a regular expression, the backslash can perform one of two tasks:
22it either takes away the special meaning of the character following it
23(for instance, C<\|> matches a vertical bar, it's not an alternation),
24or it is the start of a backslash or escape sequence.
25
26The rules determining what it is are quite simple: if the character
df225385 27following the backslash is an ASCII punctuation (non-word) character (that is,
b6538e4f
TC
28anything that is not a letter, digit, or underscore), then the backslash just
29takes away any special meaning of the character following it.
df225385
KW
30
31If the character following the backslash is an ASCII letter or an ASCII digit,
32then the sequence may be special; if so, it's listed below. A few letters have
6b46370c
KW
33not been used yet, so escaping them with a backslash doesn't change them to be
34special. A future version of Perl may assign a special meaning to them, so if
b6538e4f 35you have warnings turned on, Perl issues a warning if you use such a
6b46370c 36sequence. [1].
8a118206 37
e2cb52ee 38It is however guaranteed that backslash or escape sequences never have a
8a118206
RGS
39punctuation character following the backslash, not now, and not in a future
40version of Perl 5. So it is safe to put a backslash in front of a non-word
41character.
42
43Note that the backslash itself is special; if you want to match a backslash,
44you have to escape the backslash with a backslash: C</\\/> matches a single
45backslash.
46
47=over 4
48
49=item [1]
50
b6538e4f 51There is one exception. If you use an alphanumeric character as the
8a118206 52delimiter of your pattern (which you probably shouldn't do for readability
b6538e4f 53reasons), you have to escape the delimiter if you want to match
8a118206
RGS
54it. Perl won't warn then. See also L<perlop/Gory details of parsing
55quoted constructs>.
56
57=back
58
59
60=head2 All the sequences and escapes
61
df225385
KW
62Those not usable within a bracketed character class (like C<[\da-z]>) are marked
63as C<Not in [].>
64
f0a2b745 65 \000 Octal escape sequence. See also \o{}.
df225385 66 \1 Absolute backreference. Not in [].
8a118206 67 \a Alarm or bell.
df225385 68 \A Beginning of string. Not in [].
64935bc6
KW
69 \b{}, \b Boundary. (\b is a backspace in []).
70 \B{}, \B Not a boundary.
f321be7e 71 \cX Control-X.
5ed061ff
KW
72 \C Single octet, even under UTF-8. Not in [].
73 (Deprecated)
8a118206
RGS
74 \d Character class for digits.
75 \D Character class for non-digits.
76 \e Escape character.
df225385 77 \E Turn off \Q, \L and \U processing. Not in [].
8a118206 78 \f Form feed.
628253b8 79 \F Foldcase till \E. Not in [].
f321be7e
SK
80 \g{}, \g1 Named, absolute or relative backreference.
81 Not in [].
df225385 82 \G Pos assertion. Not in [].
418e7b04
KW
83 \h Character class for horizontal whitespace.
84 \H Character class for non horizontal whitespace.
df225385
KW
85 \k{}, \k<>, \k'' Named backreference. Not in [].
86 \K Keep the stuff left of \K. Not in [].
87 \l Lowercase next character. Not in [].
88 \L Lowercase till \E. Not in [].
8a118206 89 \n (Logical) newline character.
4e5e0888 90 \N Any character but newline. Not in [].
fb121860 91 \N{} Named or numbered (Unicode) character or sequence.
f0a2b745 92 \o{} Octal escape sequence.
e1b711da
KW
93 \p{}, \pP Character with the given Unicode property.
94 \P{}, \PP Character without the given Unicode property.
736fe711
KW
95 \Q Quote (disable) pattern metacharacters till \E. Not
96 in [].
8a118206 97 \r Return character.
df225385 98 \R Generic new line. Not in [].
418e7b04
KW
99 \s Character class for whitespace.
100 \S Character class for non whitespace.
8a118206 101 \t Tab character.
df225385
KW
102 \u Titlecase next character. Not in [].
103 \U Uppercase till \E. Not in [].
418e7b04
KW
104 \v Character class for vertical whitespace.
105 \V Character class for non vertical whitespace.
8a118206
RGS
106 \w Character class for word characters.
107 \W Character class for non-word characters.
108 \x{}, \x00 Hexadecimal escape sequence.
df225385
KW
109 \X Unicode "extended grapheme cluster". Not in [].
110 \z End of string. Not in [].
111 \Z End of string. Not in [].
8a118206
RGS
112
113=head2 Character Escapes
114
115=head3 Fixed characters
116
e2cb52ee 117A handful of characters have a dedicated I<character escape>. The following
58151fe4 118table shows them, along with their ASCII code points (in decimal and hex),
4948b50f
KW
119their ASCII name, the control escape on ASCII platforms and a short
120description. (For EBCDIC platforms, see L<perlebcdic/OPERATOR DIFFERENCES>.)
8a118206 121
4948b50f 122 Seq. Code Point ASCII Cntrl Description.
8a118206
RGS
123 Dec Hex
124 \a 7 07 BEL \cG alarm or bell
125 \b 8 08 BS \cH backspace [1]
126 \e 27 1B ESC \c[ escape character
127 \f 12 0C FF \cL form feed
128 \n 10 0A LF \cJ line feed [2]
129 \r 13 0D CR \cM carriage return
130 \t 9 09 TAB \cI tab
131
132=over 4
133
134=item [1]
135
301ba1af 136C<\b> is the backspace character only inside a character class. Outside a
64935bc6
KW
137character class, C<\b> alone is a word-character/non-word-character
138boundary, and C<\b{}> is some other type of boundary.
8a118206
RGS
139
140=item [2]
141
b6538e4f 142C<\n> matches a logical newline. Perl converts between C<\n> and your
f6993e9e 143OS's native newline character when reading from or writing to text files.
8a118206
RGS
144
145=back
146
147=head4 Example
148
149 $str =~ /\t/; # Matches if $str contains a (horizontal) tab.
150
151=head3 Control characters
152
153C<\c> is used to denote a control character; the character following C<\c>
4948b50f
KW
154determines the value of the construct. For example the value of C<\cA> is
155C<chr(1)>, and the value of C<\cb> is C<chr(2)>, etc.
156The gory details are in L<perlop/"Regexp Quote-Like Operators">. A complete
157list of what C<chr(1)>, etc. means for ASCII and EBCDIC platforms is in
158L<perlebcdic/OPERATOR DIFFERENCES>.
159
160Note that C<\c\> alone at the end of a regular expression (or doubled-quoted
161string) is not valid. The backslash must be followed by another character.
162That is, C<\c\I<X>> means C<chr(28) . 'I<X>'> for all characters I<X>.
163
164To write platform-independent code, you must use C<\N{I<NAME>}> instead, like
165C<\N{ESCAPE}> or C<\N{U+001B}>, see L<charnames>.
8a118206
RGS
166
167Mnemonic: I<c>ontrol character.
168
169=head4 Example
170
171 $str =~ /\cK/; # Matches if $str contains a vertical tab (control-K).
172
fb121860 173=head3 Named or numbered characters and character sequences
8a118206 174
17148a1a
KW
175Unicode characters have a Unicode name and numeric code point (ordinal)
176value. Use the
e526e8bb 177C<\N{}> construct to specify a character by either of these values.
fb121860 178Certain sequences of characters also have names.
e526e8bb 179
fb121860 180To specify by name, the name of the character or character sequence goes
fbb93542 181between the curly braces.
e526e8bb 182
b6538e4f
TC
183To specify a character by Unicode code point, use the form C<\N{U+I<code
184point>}>, where I<code point> is a number in hexadecimal that gives the
17148a1a 185code point that Unicode has assigned to the desired character. It is
b6538e4f
TC
186customary but not required to use leading zeros to pad the number to 4
187digits. Thus C<\N{U+0041}> means C<LATIN CAPITAL LETTER A>, and you will
188rarely see it written without the two leading zeros. C<\N{U+0041}> means
189"A" even on EBCDIC machines (where the ordinal value of "A" is not 0x41).
e526e8bb 190
fb121860
KW
191It is even possible to give your own names to characters and character
192sequences. For details, see L<charnames>.
8a118206 193
8c37f1d0 194(There is an expanded internal form that you may see in debug output:
b6538e4f
TC
195C<\N{U+I<code point>.I<code point>...}>.
196The C<...> means any number of these I<code point>s separated by dots.
8c37f1d0
KW
197This represents the sequence formed by the characters. This is an internal
198form only, subject to change, and you should not try to use it yourself.)
199
8a118206
RGS
200Mnemonic: I<N>amed character.
201
b6538e4f
TC
202Note that a character or character sequence expressed as a named
203or numbered character is considered a character without special
fb121860 204meaning by the regex engine, and will match "as is".
df225385 205
8a118206
RGS
206=head4 Example
207
8a118206
RGS
208 $str =~ /\N{THAI CHARACTER SO SO}/; # Matches the Thai SO SO character
209
210 use charnames 'Cyrillic'; # Loads Cyrillic names.
211 $str =~ /\N{ZHE}\N{KA}/; # Match "ZHE" followed by "KA".
212
213=head3 Octal escapes
214
f0a2b745 215There are two forms of octal escapes. Each is used to specify a character by
17148a1a 216its code point specified in octal notation.
f0a2b745
KW
217
218One form, available starting in Perl 5.14 looks like C<\o{...}>, where the dots
219represent one or more octal digits. It can be used for any Unicode character.
220
221It was introduced to avoid the potential problems with the other form,
222available in all Perls. That form consists of a backslash followed by three
223octal digits. One problem with this form is that it can look exactly like an
224old-style backreference (see
225L</Disambiguation rules between old-style octal escapes and backreferences>
226below.) You can avoid this by making the first of the three digits always a
9645299c 227zero, but that makes \077 the largest code point specifiable.
f0a2b745
KW
228
229In some contexts, a backslash followed by two or even one octal digits may be
230interpreted as an octal escape, sometimes with a warning, and because of some
231bugs, sometimes with surprising results. Also, if you are creating a regex
c69ca1d4 232out of smaller snippets concatenated together, and you use fewer than three
f0a2b745
KW
233digits, the beginning of one snippet may be interpreted as adding digits to the
234ending of the snippet before it. See L</Absolute referencing> for more
235discussion and examples of the snippet problem.
8a118206 236
b6538e4f
TC
237Note that a character expressed as an octal escape is considered
238a character without special meaning by the regex engine, and will match
8a118206
RGS
239"as is".
240
f0a2b745 241To summarize, the C<\o{}> form is always safe to use, and the other form is
17148a1a 242safe to use for code points through \077 when you use exactly three digits to
f0a2b745 243specify them.
8a118206 244
f0a2b745 245Mnemonic: I<0>ctal or I<o>ctal.
8a118206 246
f0a2b745 247=head4 Examples (assuming an ASCII platform)
8a118206 248
f0a2b745
KW
249 $str = "Perl";
250 $str =~ /\o{120}/; # Match, "\120" is "P".
251 $str =~ /\120/; # Same.
f321be7e
SK
252 $str =~ /\o{120}+/; # Match, "\120" is "P",
253 # it's repeated at least once.
f0a2b745
KW
254 $str =~ /\120+/; # Same.
255 $str =~ /P\053/; # No match, "\053" is "+" and taken literally.
256 /\o{23073}/ # Black foreground, white background smiling face.
f321be7e 257 /\o{4801234567}/ # Raises a warning, and yields chr(4).
f0a2b745
KW
258
259=head4 Disambiguation rules between old-style octal escapes and backreferences
260
261Octal escapes of the C<\000> form outside of bracketed character classes
f321be7e 262potentially clash with old-style backreferences (see L</Absolute referencing>
f0a2b745
KW
263below). They both consist of a backslash followed by numbers. So Perl has to
264use heuristics to determine whether it is a backreference or an octal escape.
265Perl uses the following rules to disambiguate:
8a118206
RGS
266
267=over 4
268
269=item 1
270
353c6505 271If the backslash is followed by a single digit, it's a backreference.
8a118206
RGS
272
273=item 2
274
275If the first digit following the backslash is a 0, it's an octal escape.
276
277=item 3
278
b6538e4f
TC
279If the number following the backslash is N (in decimal), and Perl already
280has seen N capture groups, Perl considers this a backreference. Otherwise,
281it considers it an octal escape. If N has more than three digits, Perl
282takes only the first three for the octal escape; the rest are matched as is.
8a118206
RGS
283
284 my $pat = "(" x 999;
285 $pat .= "a";
286 $pat .= ")" x 999;
287 /^($pat)\1000$/; # Matches 'aa'; there are 1000 capture groups.
288 /^$pat\1000$/; # Matches 'a@0'; there are 999 capture groups
f321be7e 289 # and \1000 is seen as \100 (a '@') and a '0'.
8a118206
RGS
290
291=back
292
17148a1a 293You can force a backreference interpretation always by using the C<\g{...}>
f0a2b745
KW
294form. You can the force an octal interpretation always by using the C<\o{...}>
295form, or for numbers up through \077 (= 63 decimal), by using three digits,
296beginning with a "0".
297
8a118206
RGS
298=head3 Hexadecimal escapes
299
f0a2b745
KW
300Like octal escapes, there are two forms of hexadecimal escapes, but both start
301with the same thing, C<\x>. This is followed by either exactly two hexadecimal
302digits forming a number, or a hexadecimal number of arbitrary length surrounded
303by curly braces. The hexadecimal number is the code point of the character you
304want to express.
8a118206 305
b6538e4f
TC
306Note that a character expressed as one of these escapes is considered a
307character without special meaning by the regex engine, and will match
8a118206
RGS
308"as is".
309
310Mnemonic: heI<x>adecimal.
311
9f5650a8 312=head4 Examples (assuming an ASCII platform)
8a118206
RGS
313
314 $str = "Perl";
315 $str =~ /\x50/; # Match, "\x50" is "P".
f822d0dd 316 $str =~ /\x50+/; # Match, "\x50" is "P", it is repeated at least once
8a118206
RGS
317 $str =~ /P\x2B/; # No match, "\x2B" is "+" and taken literally.
318
319 /\x{2603}\x{2602}/ # Snowman with an umbrella.
320 # The Unicode character 2603 is a snowman,
321 # the Unicode character 2602 is an umbrella.
322 /\x{263B}/ # Black smiling face.
323 /\x{263b}/ # Same, the hex digits A - F are case insensitive.
324
325=head2 Modifiers
326
327A number of backslash sequences have to do with changing the character,
328or characters following them. C<\l> will lowercase the character following
5f2b17ca 329it, while C<\u> will uppercase (or, more accurately, titlecase) the
b6538e4f
TC
330character following it. They provide functionality similar to the
331functions C<lcfirst> and C<ucfirst>.
8a118206
RGS
332
333To uppercase or lowercase several characters, one might want to use
334C<\L> or C<\U>, which will lowercase/uppercase all characters following
b6538e4f 335them, until either the end of the pattern or the next occurrence of
17148a1a 336C<\E>, whichever comes first. They provide functionality similar to what
b6538e4f 337the functions C<lc> and C<uc> provide.
8a118206 338
736fe711
KW
339C<\Q> is used to quote (disable) pattern metacharacters, up to the next
340C<\E> or the end of the pattern. C<\Q> adds a backslash to any character
341that could have special meaning to Perl. In the ASCII range, it quotes
342every character that isn't a letter, digit, or underscore. See
343L<perlfunc/quotemeta> for details on what gets quoted for non-ASCII
344code points. Using this ensures that any character between C<\Q> and
345C<\E> will be matched literally, not interpreted as a metacharacter by
346the regex engine.
8a118206 347
628253b8
BF
348C<\F> can be used to casefold all characters following, up to the next C<\E>
349or the end of the pattern. It provides the functionality similar to
350the C<fc> function.
351
352Mnemonic: I<L>owercase, I<U>ppercase, I<F>old-case, I<Q>uotemeta, I<E>nd.
8a118206
RGS
353
354=head4 Examples
355
356 $sid = "sid";
357 $greg = "GrEg";
358 $miranda = "(Miranda)";
359 $str =~ /\u$sid/; # Matches 'Sid'
360 $str =~ /\L$greg/; # Matches 'greg'
361 $str =~ /\Q$miranda\E/; # Matches '(Miranda)', as if the pattern
362 # had been written as /\(Miranda\)/
363
364=head2 Character classes
365
366Perl regular expressions have a large range of character classes. Some of
367the character classes are written as a backslash sequence. We will briefly
368discuss those here; full details of character classes can be found in
369L<perlrecharclass>.
370
d35dd6c6
KW
371C<\w> is a character class that matches any single I<word> character
372(letters, digits, Unicode marks, and connector punctuation (like the
373underscore)). C<\d> is a character class that matches any decimal
374digit, while the character class C<\s> matches any whitespace character.
99d59c4d 375New in perl 5.10.0 are the classes C<\h> and C<\v> which match horizontal
418e7b04 376and vertical whitespace characters.
cfaf538b
KW
377
378The exact set of characters matched by C<\d>, C<\s>, and C<\w> varies
9645299c
KW
379depending on various pragma and regular expression modifiers. It is
380possible to restrict the match to the ASCII range by using the C</a>
381regular expression modifier. See L<perlrecharclass>.
8a118206
RGS
382
383The uppercase variants (C<\W>, C<\D>, C<\S>, C<\H>, and C<\V>) are
e486b3cc
KW
384character classes that match, respectively, any character that isn't a
385word character, digit, whitespace, horizontal whitespace, or vertical
386whitespace.
8a118206
RGS
387
388Mnemonics: I<w>ord, I<d>igit, I<s>pace, I<h>orizontal, I<v>ertical.
389
390=head3 Unicode classes
391
392C<\pP> (where C<P> is a single letter) and C<\p{Property}> are used to
393match a character that matches the given Unicode property; properties
394include things like "letter", or "thai character". Capitalizing the
395sequence to C<\PP> and C<\P{Property}> make the sequence match a character
396that doesn't match the given Unicode property. For more details, see
4948b50f 397L<perlrecharclass/Backslash sequences> and
8a118206
RGS
398L<perlunicode/Unicode Character Properties>.
399
400Mnemonic: I<p>roperty.
401
8a118206
RGS
402=head2 Referencing
403
404If capturing parenthesis are used in a regular expression, we can refer
405to the part of the source string that was matched, and match exactly the
1843fd28
RGS
406same thing. There are three ways of referring to such I<backreference>:
407absolutely, relatively, and by name.
408
409=for later add link to perlrecapture
8a118206
RGS
410
411=head3 Absolute referencing
412
c27a5cfe 413Either C<\gI<N>> (starting in Perl 5.10.0), or C<\I<N>> (old-style) where I<N>
d8b950dc 414is a positive (unsigned) decimal number of any length is an absolute reference
c27a5cfe
KW
415to a capturing group.
416
8e4698ef
KW
417I<N> refers to the Nth set of parentheses, so C<\gI<N>> refers to whatever has
418been matched by that set of parentheses. Thus C<\g1> refers to the first
c27a5cfe
KW
419capture group in the regex.
420
421The C<\gI<N>> form can be equivalently written as C<\g{I<N>}>
422which avoids ambiguity when building a regex by concatenating shorter
d8b950dc
KW
423strings. Otherwise if you had a regex C<qr/$a$b/>, and C<$a> contained
424C<"\g1">, and C<$b> contained C<"37">, you would get C</\g137/> which is
425probably not what you intended.
c27a5cfe
KW
426
427In the C<\I<N>> form, I<N> must not begin with a "0", and there must be at
b6538e4f
TC
428least I<N> capturing groups, or else I<N> is considered an octal escape
429(but something like C<\18> is the same as C<\0018>; that is, the octal escape
c27a5cfe
KW
430C<"\001"> followed by a literal digit C<"8">).
431
432Mnemonic: I<g>roup.
8a118206
RGS
433
434=head4 Examples
435
c27a5cfe 436 /(\w+) \g1/; # Finds a duplicated word, (e.g. "cat cat").
f321be7e 437 /(\w+) \1/; # Same thing; written old-style.
c27a5cfe 438 /(.)(.)\g2\g1/; # Match a four letter palindrome (e.g. "ABBA").
8a118206
RGS
439
440
441=head3 Relative referencing
442
c27a5cfe
KW
443C<\g-I<N>> (starting in Perl 5.10.0) is used for relative addressing. (It can
444be written as C<\g{-I<N>>.) It refers to the I<N>th group before the
445C<\g{-I<N>}>.
8a118206 446
c27a5cfe 447The big advantage of this form is that it makes it much easier to write
8a118206
RGS
448patterns with references that can be interpolated in larger patterns,
449even if the larger pattern also contains capture groups.
450
8a118206
RGS
451=head4 Examples
452
c27a5cfe
KW
453 /(A) # Group 1
454 ( # Group 2
455 (B) # Group 3
456 \g{-1} # Refers to group 3 (B)
457 \g{-3} # Refers to group 1 (A)
8a118206
RGS
458 )
459 /x; # Matches "ABBA".
460
461 my $qr = qr /(.)(.)\g{-2}\g{-1}/; # Matches 'abab', 'cdcd', etc.
462 /$qr$qr/ # Matches 'ababcdcd'.
463
464=head3 Named referencing
465
d8b950dc
KW
466C<\g{I<name>}> (starting in Perl 5.10.0) can be used to back refer to a
467named capture group, dispensing completely with having to think about capture
468buffer positions.
8a118206
RGS
469
470To be compatible with .Net regular expressions, C<\g{name}> may also be
471written as C<\k{name}>, C<< \k<name> >> or C<\k'name'>.
472
d8b950dc
KW
473To prevent any ambiguity, I<name> must not start with a digit nor contain a
474hyphen.
8a118206
RGS
475
476=head4 Examples
477
478 /(?<word>\w+) \g{word}/ # Finds duplicated word, (e.g. "cat cat")
479 /(?<word>\w+) \k{word}/ # Same.
480 /(?<word>\w+) \k<word>/ # Same.
481 /(?<letter1>.)(?<letter2>.)\g{letter2}\g{letter1}/
482 # Match a four letter palindrome (e.g. "ABBA")
483
484=head2 Assertions
485
ac036724 486Assertions are conditions that have to be true; they don't actually
8a118206
RGS
487match parts of the substring. There are six assertions that are written as
488backslash sequences.
489
490=over 4
491
492=item \A
493
494C<\A> only matches at the beginning of the string. If the C</m> modifier
1726f7e8 495isn't used, then C</\A/> is equivalent to C</^/>. However, if the C</m>
8a118206
RGS
496modifier is used, then C</^/> matches internal newlines, but the meaning
497of C</\A/> isn't changed by the C</m> modifier. C<\A> matches at the beginning
498of the string regardless whether the C</m> modifier is used.
499
500=item \z, \Z
501
502C<\z> and C<\Z> match at the end of the string. If the C</m> modifier isn't
b6538e4f
TC
503used, then C</\Z/> is equivalent to C</$/>; that is, it matches at the
504end of the string, or one before the newline at the end of the string. If the
8a118206
RGS
505C</m> modifier is used, then C</$/> matches at internal newlines, but the
506meaning of C</\Z/> isn't changed by the C</m> modifier. C<\Z> matches at
507the end of the string (or just before a trailing newline) regardless whether
508the C</m> modifier is used.
509
b6538e4f
TC
510C<\z> is just like C<\Z>, except that it does not match before a trailing
511newline. C<\z> matches at the end of the string only, regardless of the
512modifiers used, and not just before a newline. It is how to anchor the
513match to the true end of the string under all conditions.
8a118206
RGS
514
515=item \G
516
b6538e4f
TC
517C<\G> is usually used only in combination with the C</g> modifier. If the
518C</g> modifier is used and the match is done in scalar context, Perl
519remembers where in the source string the last match ended, and the next time,
8a118206
RGS
520it will start the match from where it ended the previous time.
521
b6538e4f
TC
522C<\G> matches the point where the previous match on that string ended,
523or the beginning of that string if there was no previous match.
1843fd28
RGS
524
525=for later add link to perlremodifiers
8a118206
RGS
526
527Mnemonic: I<G>lobal.
528
64935bc6 529=item \b{}, \b, \B{}, \B
8a118206 530
64935bc6
KW
531C<\b{...}>, available starting in v5.22, matches a boundary (between two
532characters, or before the first character of the string, or after the
533final character of the string) based on the Unicode rules for the
534boundary type specified inside the braces. The currently known boundary
535types are given a few paragraphs below. C<\B{...}> matches at any place
536between characters where C<\b{...}> of the same type doesn't match.
537
538C<\b> when not immediately followed by a C<"{"> matches at any place
539between a word (something matched by C<\w>) and a non-word character
540(C<\W>); C<\B> when not immediately followed by a C<"{"> matches at any
541place between characters where C<\b> doesn't match.
542
543C<\b>
8a118206
RGS
544and C<\B> assume there's a non-word character before the beginning and after
545the end of the source string; so C<\b> will match at the beginning (or end)
546of the source string if the source string begins (or ends) with a word
b6538e4f
TC
547character. Otherwise, C<\B> will match.
548
549Do not use something like C<\b=head\d\b> and expect it to match the
550beginning of a line. It can't, because for there to be a boundary before
551the non-word "=", there must be a word character immediately previous.
64935bc6
KW
552All plain C<\b> and C<\B> boundary determinations look for word
553characters alone, not for
554non-word characters nor for string ends. It may help to understand how
b6538e4f
TC
555<\b> and <\B> work by equating them as follows:
556
557 \b really means (?:(?<=\w)(?!\w)|(?<!\w)(?=\w))
558 \B really means (?:(?<=\w)(?=\w)|(?<!\w)(?!\w))
8a118206 559
64935bc6
KW
560In contrast, C<\b{...}> always matches at the beginning and end of the
561line (and C<\B{...}> never does). The only boundary type currently
562"Grapheme Cluster Boundary". (Actually Perl always uses the improved
563"extended" grapheme cluster"). These are explained below under C<\X>.
564In fact, C<\X> is another way to get the same functionality. It is
565equivalent to C</.+?\b{gcb}/>. Use whichever is most convenient for
566your situation.
567
8a118206
RGS
568Mnemonic: I<b>oundary.
569
570=back
571
572=head4 Examples
573
574 "cat" =~ /\Acat/; # Match.
575 "cat" =~ /cat\Z/; # Match.
576 "cat\n" =~ /cat\Z/; # Match.
577 "cat\n" =~ /cat\z/; # No match.
578
579 "cat" =~ /\bcat\b/; # Matches.
580 "cats" =~ /\bcat\b/; # No match.
581 "cat" =~ /\bcat\B/; # No match.
582 "cats" =~ /\bcat\B/; # Match.
583
584 while ("cat dog" =~ /(\w+)/g) {
585 print $1; # Prints 'catdog'
586 }
587 while ("cat dog" =~ /\G(\w+)/g) {
588 print $1; # Prints 'cat'
589 }
590
591=head2 Misc
592
593Here we document the backslash sequences that don't fall in one of the
b6538e4f 594categories above. These are:
8a118206
RGS
595
596=over 4
597
598=item \C
599
37ea023e
DM
600(Deprecated.) C<\C> always matches a single octet, even if the source
601string is encoded
8a118206 602in UTF-8 format, and the character to be matched is a multi-octet character.
69a6e56c 603This is very dangerous, because it violates
b6538e4f 604the logical character abstraction and can cause UTF-8 sequences to become malformed.
8a118206 605
37ea023e
DM
606Use C<utf8::encode()> instead.
607
8a118206
RGS
608Mnemonic: oI<C>tet.
609
610=item \K
611
b6538e4f
TC
612This appeared in perl 5.10.0. Anything matched left of C<\K> is
613not included in C<$&>, and will not be replaced if the pattern is
614used in a substitution. This lets you write C<s/PAT1 \K PAT2/REPL/x>
8a118206
RGS
615instead of C<s/(PAT1) PAT2/${1}REPL/x> or C<s/(?<=PAT1) PAT2/REPL/x>.
616
617Mnemonic: I<K>eep.
618
df225385
KW
619=item \N
620
2171640d 621This feature, available starting in v5.12, matches any character
b6538e4f 622that is B<not> a newline. It is a short-hand for writing C<[^\n]>, and is
b3b85878
KW
623identical to the C<.> metasymbol, except under the C</s> flag, which changes
624the meaning of C<.>, but not C<\N>.
df225385 625
e526e8bb 626Note that C<\N{...}> can mean a
fb121860
KW
627L<named or numbered character
628|/Named or numbered characters and character sequences>.
df225385
KW
629
630Mnemonic: Complement of I<\n>.
631
8a118206 632=item \R
6b46370c 633X<\R>
8a118206 634
b6538e4f
TC
635C<\R> matches a I<generic newline>; that is, anything considered a
636linebreak sequence by Unicode. This includes all characters matched by
637C<\v> (vertical whitespace), and the multi character sequence C<"\x0D\x0A">
638(carriage return followed by a line feed, sometimes called the network
639newline; it's the end of line sequence used in Microsoft text files opened
1978b668 640in binary mode). C<\R> is equivalent to C<< (?>\x0D\x0A|\v) >>. (The
040ac264
FC
641reason it doesn't backtrack is that the sequence is considered
642inseparable. That means that
1978b668
KW
643
644 "\x0D\x0A" =~ /^\R\x0A$/ # No match
645
646fails, because the C<\R> matches the entire string, and won't backtrack
647to match just the C<"\x0D">.) Since
b6538e4f
TC
648C<\R> can match a sequence of more than one character, it cannot be put
649inside a bracketed character class; C</[\R]/> is an error; use C<\v>
650instead. C<\R> was introduced in perl 5.10.0.
8a118206 651
8129baca
KW
652Note that this does not respect any locale that might be in effect; it
653matches according to the platform's native character set.
654
10fdd326
JH
655Mnemonic: none really. C<\R> was picked because PCRE already uses C<\R>,
656and more importantly because Unicode recommends such a regular expression
b6538e4f 657metacharacter, and suggests C<\R> as its notation.
8a118206
RGS
658
659=item \X
6b46370c 660X<\X>
8a118206 661
0111a78f 662This matches a Unicode I<extended grapheme cluster>.
8a118206 663
10fdd326 664C<\X> matches quite well what normal (non-Unicode-programmer) usage
0111a78f 665would consider a single character. As an example, consider a G with some sort
c670e63a 666of diacritic mark, such as an arrow. There is no such single character in
df225385 667Unicode, but one can be composed by using a G followed by a Unicode "COMBINING
c670e63a
KW
668UPWARDS ARROW BELOW", and would be displayed by Unicode-aware software as if it
669were a single character.
10fdd326 670
aa9e685b
KW
671The match is greedy and non-backtracking, so that the cluster is never
672broken up into smaller components.
673
64935bc6
KW
674See also L<C<\b{gcb}>|/\b{}, \b, \B{}, \B>.
675
8a118206
RGS
676Mnemonic: eI<X>tended Unicode character.
677
678=back
679
680=head4 Examples
681
f822d0dd 682 $str =~ s/foo\Kbar/baz/g; # Change any 'bar' following a 'foo' to 'baz'
d8b950dc 683 $str =~ s/(.)\K\g1//g; # Delete duplicated characters.
8a118206
RGS
684
685 "\n" =~ /^\R$/; # Match, \n is a generic newline.
686 "\r" =~ /^\R$/; # Match, \r is a generic newline.
687 "\r\n" =~ /^\R$/; # Match, \r\n is a generic newline.
688
b6538e4f 689 "P\x{307}" =~ /^\X$/ # \X matches a P with a dot above.
8a118206
RGS
690
691=cut