This is a live mirror of the Perl 5 development currently hosted at https://github.com/perl/perl5
perlunicode: Update for 5.14
[perl5.git] / pod / perlre.pod
CommitLineData
a0d0e21e 1=head1 NAME
d74e8afc 2X<regular expression> X<regex> X<regexp>
a0d0e21e
LW
3
4perlre - Perl regular expressions
5
6=head1 DESCRIPTION
7
5d458dd8 8This page describes the syntax of regular expressions in Perl.
91e0c79e 9
cc46d5f2 10If you haven't used regular expressions before, a quick-start
91e0c79e
MJD
11introduction is available in L<perlrequick>, and a longer tutorial
12introduction is available in L<perlretut>.
13
14For reference on how regular expressions are used in matching
15operations, plus various examples of the same, see discussions of
16C<m//>, C<s///>, C<qr//> and C<??> in L<perlop/"Regexp Quote-Like
17Operators">.
cb1a09d0 18
0d017f4d
WL
19
20=head2 Modifiers
21
19799a22 22Matching operations can have various modifiers. Modifiers
5a964f20 23that relate to the interpretation of the regular expression inside
19799a22 24are listed below. Modifiers that alter the way a regular expression
5d458dd8 25is used by Perl are detailed in L<perlop/"Regexp Quote-Like Operators"> and
1e66bd83 26L<perlop/"Gory details of parsing quoted constructs">.
a0d0e21e 27
55497cff 28=over 4
29
54310121 30=item m
d74e8afc 31X</m> X<regex, multiline> X<regexp, multiline> X<regular expression, multiline>
55497cff 32
33Treat string as multiple lines. That is, change "^" and "$" from matching
14218588 34the start or end of the string to matching the start or end of any
7f761169 35line anywhere within the string.
55497cff 36
54310121 37=item s
d74e8afc
ITB
38X</s> X<regex, single-line> X<regexp, single-line>
39X<regular expression, single-line>
55497cff 40
41Treat string as single line. That is, change "." to match any character
19799a22 42whatsoever, even a newline, which normally it would not match.
55497cff 43
34d67d80 44Used together, as C</ms>, they let the "." match any character whatsoever,
fb55449c 45while still allowing "^" and "$" to match, respectively, just after
19799a22 46and just before newlines within the string.
7b8d334a 47
87e95b7f
YO
48=item i
49X</i> X<regex, case-insensitive> X<regexp, case-insensitive>
50X<regular expression, case-insensitive>
51
52Do case-insensitive pattern matching.
53
54If C<use locale> is in effect, the case map is taken from the current
17580e7a 55locale for code points less than 255, and from Unicode rules for larger
ed7efc79
KW
56code points. However, matches that would cross the Unicode
57rules/non-Unicode rules boundary (ords 255/256) will not succeed. See
58L<perllocale>.
59
60There are a number of Unicode characters that match multiple characters
61under C</i>. For example, C<LATIN SMALL LIGATURE FI>
62should match the sequence C<fi>. Perl is not
63currently able to do this when the multiple characters are in the pattern and
64are split between groupings, or when one or more are quantified. Thus
65
66 "\N{LATIN SMALL LIGATURE FI}" =~ /fi/i; # Matches
67 "\N{LATIN SMALL LIGATURE FI}" =~ /[fi][fi]/i; # Doesn't match!
68 "\N{LATIN SMALL LIGATURE FI}" =~ /fi*/i; # Doesn't match!
69
70 # The below doesn't match, and it isn't clear what $1 and $2 would
71 # be even if it did!!
72 "\N{LATIN SMALL LIGATURE FI}" =~ /(f)(i)/i; # Doesn't match!
73
74Also, this matching doesn't fully conform to the current Unicode
75recommendations, which ask that the matching be made upon the NFD
76(Normalization Form Decomposed) of the text. However, Unicode is
77in the process of reconsidering and revising their recommendations.
87e95b7f 78
54310121 79=item x
d74e8afc 80X</x>
55497cff 81
82Extend your pattern's legibility by permitting whitespace and comments.
ed7efc79 83Details in L</"/x">
55497cff 84
87e95b7f
YO
85=item p
86X</p> X<regex, preserve> X<regexp, preserve>
87
632a1772 88Preserve the string matched such that ${^PREMATCH}, ${^MATCH}, and
87e95b7f
YO
89${^POSTMATCH} are available for use after matching.
90
e2e6bec7
DN
91=item g and c
92X</g> X</c>
93
94Global matching, and keep the Current position after failed matching.
95Unlike i, m, s and x, these two flags affect the way the regex is used
96rather than the regex itself. See
97L<perlretut/"Using regular expressions in Perl"> for further explanation
98of the g and c modifiers.
99
b6fa137b
FC
100=item a, d, l and u
101X</a> X</d> X</l> X</u>
102
103These modifiers, new in 5.14, affect which character-set semantics
ed7efc79
KW
104(Unicode, ASCII, etc.) are used, as described below in
105L</Character set modifiers>.
b6fa137b 106
55497cff 107=back
a0d0e21e
LW
108
109These are usually written as "the C</x> modifier", even though the delimiter
b6fa137b 110in question might not really be a slash. The modifiers C</imsxadlup>
ab7bb42d 111may also be embedded within the regular expression itself using
ed7efc79 112the C<(?...)> construct, see L</Extended Patterns> below.
b6fa137b
FC
113
114The C</x>, C</l>, C</u>, C</a> and C</d> modifiers need a little more
115explanation.
a0d0e21e 116
ed7efc79
KW
117=head3 /x
118
b6fa137b 119C</x> tells
7b059540 120the regular expression parser to ignore most whitespace that is neither
55497cff 121backslashed nor within a character class. You can use this to break up
4633a7c4 122your regular expression into (slightly) more readable parts. The C<#>
54310121 123character is also treated as a metacharacter introducing a comment,
55497cff 124just as in ordinary Perl code. This also means that if you want real
14218588 125whitespace or C<#> characters in the pattern (outside a character
f9a3ff1a 126class, where they are unaffected by C</x>), then you'll either have to
7b059540
KW
127escape them (using backslashes or C<\Q...\E>) or encode them using octal,
128hex, or C<\N{}> escapes. Taken together, these features go a long way towards
8933a740
RGS
129making Perl's regular expressions more readable. Note that you have to
130be careful not to include the pattern delimiter in the comment--perl has
131no way of knowing you did not intend to close the pattern early. See
132the C-comment deletion code in L<perlop>. Also note that anything inside
7651b971 133a C<\Q...\E> stays unaffected by C</x>. And note that C</x> doesn't affect
0b928c2f 134space interpretation within a single multi-character construct. For
7651b971 135example in C<\x{...}>, regardless of the C</x> modifier, there can be no
9bb1f947 136spaces. Same for a L<quantifier|/Quantifiers> such as C<{3}> or
f9e949fd
KW
137C<{5,}>. Similarly, C<(?:...)> can't have a space between the C<?> and C<:>,
138but can between the C<(> and C<?>. Within any delimiters for such a
139construct, allowed spaces are not affected by C</x>, and depend on the
140construct. For example, C<\x{...}> can't have spaces because hexadecimal
141numbers don't have spaces in them. But, Unicode properties can have spaces, so
0b928c2f 142in C<\p{...}> there can be spaces that follow the Unicode rules, for which see
9bb1f947 143L<perluniprops/Properties accessible through \p{} and \P{}>.
d74e8afc 144X</x>
a0d0e21e 145
ed7efc79
KW
146=head3 Character set modifiers
147
148C</d>, C</u>, C</a>, and C</l>, available starting in 5.14, are called
149the character set modifiers; they affect the character set semantics
150used for the regular expression.
151
152At any given time, exactly one of these modifiers is in effect. Once
153compiled, the behavior doesn't change regardless of what rules are in
154effect when the regular expression is executed. And if a regular
155expression is interpolated into a larger one, the original's rules
156continue to apply to it, and only it.
157
158=head4 /l
159
160means to use the current locale's rules (see L<perllocale>) when pattern
161matching. For example, C<\w> will match the "word" characters of that
162locale, and C<"/i"> case-insensitive matching will match according to
163the locale's case folding rules. The locale used will be the one in
164effect at the time of execution of the pattern match. This may not be
165the same as the compilation-time locale, and can differ from one match
166to another if there is an intervening call of the
b6fa137b 167L<setlocale() function|perllocale/The setlocale function>.
ed7efc79
KW
168
169Perl only supports single-byte locales. This means that code points
170above 255 are treated as Unicode no matter what locale is in effect.
171Under Unicode rules, there are a few case-insensitive matches that cross
172the 255/256 boundary. These are disallowed under C</l>. For example,
1730xFF does not caselessly match the character at 0x178, C<LATIN CAPITAL
174LETTER Y WITH DIAERESIS>, because 0xFF may not be C<LATIN SMALL LETTER Y
175WITH DIAERESIS> in the current locale, and Perl has no way of knowing if
176that character even exists in the locale, much less what code point it
177is.
178
179This modifier may be specified to be the default by C<use locale>, but
180see L</Which character set modifier is in effect?>.
b6fa137b
FC
181X</l>
182
ed7efc79
KW
183=head4 /u
184
185means to use Unicode rules when pattern matching. On ASCII platforms,
186this means that the code points between 128 and 255 take on their
b6fa137b
FC
187Latin-1 (ISO-8859-1) meanings (which are the same as Unicode's), whereas
188in strict ASCII their meanings are undefined. Thus the platform
ed7efc79
KW
189effectively becomes a Unicode platform, hence, for example, C<\w> will
190match any of the more than 100_000 word characters in Unicode.
191
192Unlike most locales, which are specific to a language and country pair,
193Unicode classifies all the characters that are letters I<somewhere> as
194C<\w>. For example, your locale might not think that C<LATIN SMALL
195LETTER ETH> is a letter (unless you happen to speak Icelandic), but
196Unicode does. Similarly, all the characters that are decimal digits
197somewhere in the world will match C<\d>; this is hundreds, not 10,
198possible matches. And some of those digits look like some of the 10
199ASCII digits, but mean a different number, so a human could easily think
200a number is a different quantity than it really is. For example,
201C<BENGALI DIGIT FOUR> (U+09EA) looks very much like an
202C<ASCII DIGIT EIGHT> (U+0038). And, C<\d+>, may match strings of digits
203that are a mixture from different writing systems, creating a security
204issue. L<Unicode::UCD/num()> can be used to sort this out.
205
206Also, case-insensitive matching works on the full set of Unicode
207characters. The C<KELVIN SIGN>, for example matches the letters "k" and
208"K"; and C<LATIN SMALL LIGATURE FF> matches the sequence "ff", which,
209if you're not prepared, might make it look like a hexadecimal constant,
210presenting another potential security issue. See
211L<http://unicode.org/reports/tr36> for a detailed discussion of Unicode
212security issues.
213
214On EBCDIC platforms, which already are equivalent to Latin-1 (at least
215the ones that Perl handles), this modifier changes behavior only when
216the C<"/i"> modifier is also specified, and it turns out it affects only
217two characters, giving them full Unicode semantics: the C<MICRO SIGN>
218will match the Greek capital and small letters C<MU>; otherwise not; and
219the C<LATIN CAPITAL LETTER SHARP S> will match any of C<SS>, C<Ss>,
220C<sS>, and C<ss>, otherwise not.
221
222This modifier may be specified to be the default by C<use feature
223'unicode_strings>, but see
224L</Which character set modifier is in effect?>.
b6fa137b
FC
225X</u>
226
ed7efc79
KW
227=head4 /a
228
229is the same as C</u>, except that C<\d>, C<\s>, C<\w>, and the
b6fa137b
FC
230Posix character classes are restricted to matching in the ASCII range
231only. That is, with this modifier, C<\d> always means precisely the
232digits C<"0"> to C<"9">; C<\s> means the five characters C<[ \f\n\r\t]>;
233C<\w> means the 63 characters C<[A-Za-z0-9_]>; and likewise, all the
234Posix classes such as C<[[:print:]]> match only the appropriate
ed7efc79
KW
235ASCII-range characters.
236
237This modifier is useful for people who only incidentally use Unicode.
238With it, one can write C<\d> with confidence that it will only match
239ASCII characters, and should the need arise to match beyond ASCII, you
240can use C<\p{Digit}>, or C<\p{Word}> for C<\w>. There are similar
241C<\p{...}> constructs that can match white space and Posix classes
242beyond ASCII. See L<perlrecharclass>.
243
244As you would expect, this modifier causes, for example, C<\D> to mean
245the same thing as C<[^0-9]>; in fact, all non-ASCII characters match
246C<\D>, C<\S>, and C<\W>. C<\b> still means to match at the boundary
247between C<\w> and C<\W>, using the C</a> definitions of them (similarly
248for C<\B>).
249
250Otherwise, C</a> behaves like the C</u> modifier, in that
251case-insensitive matching uses Unicode semantics; for example, "k" will
252match the Unicode C<\N{KELVIN SIGN}> under C</i> matching, and code
253points in the Latin1 range, above ASCII will have Unicode rules when it
254comes to case-insensitive matching.
255
256To forbid ASCII/non-ASCII matches (like "k" with C<\N{KELVIN SIGN}>),
257specify the "a" twice, for example C</aai> or C</aia>
258
259To reiterate, this modifier provides protection for applications that
260don't wish to be exposed to all of Unicode. Specifying it twice
261gives added protection.
262
263This modifier may be specified to be the default by C<use re '/a'>
264or C<use re '/aa'>, but see
265L</Which character set modifier is in effect?>.
b6fa137b 266X</a>
ed7efc79
KW
267X</aa>
268
269=head4 /d
270
271This modifier means to use the "Default" native rules of the platform
272except when there is cause to use Unicode rules instead, as follows:
273
274=over 4
275
276=item 1
277
278the target string is encoded in UTF-8; or
279
280=item 2
281
282the pattern is encoded in UTF-8; or
283
284=item 3
285
286the pattern explicitly mentions a code point that is above 255 (say by
287C<\x{100}>); or
288
289=item 4
b6fa137b 290
ed7efc79
KW
291the pattern uses a Unicode name (C<\N{...}>); or
292
293=item 5
294
295the pattern uses a Unicode property (C<\p{...}>)
296
297=back
298
299Another mnemonic for this modifier is "Depends", as the rules actually
300used depend on various things, and as a result you can get unexpected
301results. See L<perlunicode/The "Unicode Bug">.
302
303On ASCII platforms, the native rules are ASCII, and on EBCDIC platforms
304(at least the ones that Perl handles), they are Latin-1.
305
306Here are some examples of how that works on an ASCII platform:
307
308 $str = "\xDF"; # $str is not in UTF-8 format.
309 $str =~ /^\w/; # No match, as $str isn't in UTF-8 format.
310 $str .= "\x{0e0b}"; # Now $str is in UTF-8 format.
311 $str =~ /^\w/; # Match! $str is now in UTF-8 format.
312 chop $str;
313 $str =~ /^\w/; # Still a match! $str remains in UTF-8 format.
314
315=head4 Which character set modifier is in effect?
316
317Which of these modifiers is in effect at any given point in a regular
318expression depends on a fairly complex set of interactions. As
319explained below in L</Extended Patterns> it is possible to explicitly
320specify modifiers that apply only to portions of a regular expression.
321The innermost always has priority over any outer ones, and one applying
322to the whole expression has priority over any default settings that are
323described in the next few paragraphs.
324
325The C<L<use re 'E<sol>foo'|re/'E<sol>flags' mode">> pragma can be used to set
326default modifiers (including these) for regular expressions compiled
327within its scope. This pragma has precedence over the other pragmas
328that change the defaults, as listed below.
329
330Otherwise, C<L<use locale|perllocale>> sets the default modifier to C</l>;
331and C<L<use feature 'unicode_strings|feature>> or
332C<L<use 5.012|perlfunc/use VERSION>> (or higher) set the default to
333C</u> when not in the same scope as either C<L<use locale|perllocale>>
334or C<L<use bytes|bytes>> .
335
336If none of the above apply, for backwards compatibility reasons, the
337C</d> modifier is the one in effect by default. As this can lead to
338unexpected results, it is best to specify which other rule set should be
339used.
340
341=head4 Character set modifier behavior prior to Perl 5.14
342
343Prior to 5.14, there were no explicit modifiers, but C</l> was implied
344for regexes compiled within the scope of C<use locale>, and C</d> was
345implied otherwise. However, interpolating a regex into a larger regex
346would ignore the original compilation in favor of whatever was in effect
347at the time of the second compilation. There were a number of
348inconsistencies (bugs) with the C</d> modifier, where Unicode rules
349would be used when inappropriate, and vice versa. C<\p{}> did not imply
350Unicode rules, and neither did all occurrences of C<\N{}>, until 5.12.
b6fa137b 351
a0d0e21e
LW
352=head2 Regular Expressions
353
04838cea
RGS
354=head3 Metacharacters
355
384f06ae 356The patterns used in Perl pattern matching evolved from those supplied in
14218588 357the Version 8 regex routines. (The routines are derived
19799a22
GS
358(distantly) from Henry Spencer's freely redistributable reimplementation
359of the V8 routines.) See L<Version 8 Regular Expressions> for
360details.
a0d0e21e
LW
361
362In particular the following metacharacters have their standard I<egrep>-ish
363meanings:
d74e8afc
ITB
364X<metacharacter>
365X<\> X<^> X<.> X<$> X<|> X<(> X<()> X<[> X<[]>
366
a0d0e21e 367
f793d64a
KW
368 \ Quote the next metacharacter
369 ^ Match the beginning of the line
370 . Match any character (except newline)
371 $ Match the end of the line (or before newline at the end)
372 | Alternation
373 () Grouping
374 [] Bracketed Character class
a0d0e21e 375
14218588
GS
376By default, the "^" character is guaranteed to match only the
377beginning of the string, the "$" character only the end (or before the
378newline at the end), and Perl does certain optimizations with the
a0d0e21e
LW
379assumption that the string contains only one line. Embedded newlines
380will not be matched by "^" or "$". You may, however, wish to treat a
4a6725af 381string as a multi-line buffer, such that the "^" will match after any
0d520e8e
YO
382newline within the string (except if the newline is the last character in
383the string), and "$" will match before any newline. At the
a0d0e21e
LW
384cost of a little more overhead, you can do this by using the /m modifier
385on the pattern match operator. (Older programs did this by setting C<$*>,
0b928c2f 386but this option was removed in perl 5.9.)
d74e8afc 387X<^> X<$> X</m>
a0d0e21e 388
14218588 389To simplify multi-line substitutions, the "." character never matches a
55497cff 390newline unless you use the C</s> modifier, which in effect tells Perl to pretend
f02c194e 391the string is a single line--even if it isn't.
d74e8afc 392X<.> X</s>
a0d0e21e 393
04838cea
RGS
394=head3 Quantifiers
395
a0d0e21e 396The following standard quantifiers are recognized:
d74e8afc 397X<metacharacter> X<quantifier> X<*> X<+> X<?> X<{n}> X<{n,}> X<{n,m}>
a0d0e21e 398
f793d64a
KW
399 * Match 0 or more times
400 + Match 1 or more times
401 ? Match 1 or 0 times
402 {n} Match exactly n times
403 {n,} Match at least n times
404 {n,m} Match at least n but not more than m times
a0d0e21e 405
0b928c2f
FC
406(If a curly bracket occurs in any other context and does not form part of
407a backslashed sequence like C<\x{...}>, it is treated
b975c076 408as a regular character. In particular, the lower bound
527e91da
BB
409is not optional.) The "*" quantifier is equivalent to C<{0,}>, the "+"
410quantifier to C<{1,}>, and the "?" quantifier to C<{0,1}>. n and m are limited
d0b16107 411to non-negative integral values less than a preset limit defined when perl is built.
9c79236d
GS
412This is usually 32766 on the most common platforms. The actual limit can
413be seen in the error message generated by code such as this:
414
820475bd 415 $_ **= $_ , / {$_} / for 2 .. 42;
a0d0e21e 416
54310121 417By default, a quantified subpattern is "greedy", that is, it will match as
418many times as possible (given a particular starting location) while still
419allowing the rest of the pattern to match. If you want it to match the
420minimum number of times possible, follow the quantifier with a "?". Note
421that the meanings don't change, just the "greediness":
0d017f4d 422X<metacharacter> X<greedy> X<greediness>
d74e8afc 423X<?> X<*?> X<+?> X<??> X<{n}?> X<{n,}?> X<{n,m}?>
a0d0e21e 424
f793d64a
KW
425 *? Match 0 or more times, not greedily
426 +? Match 1 or more times, not greedily
427 ?? Match 0 or 1 time, not greedily
0b928c2f 428 {n}? Match exactly n times, not greedily (redundant)
f793d64a
KW
429 {n,}? Match at least n times, not greedily
430 {n,m}? Match at least n but not more than m times, not greedily
a0d0e21e 431
b9b4dddf
YO
432By default, when a quantified subpattern does not allow the rest of the
433overall pattern to match, Perl will backtrack. However, this behaviour is
0d017f4d 434sometimes undesirable. Thus Perl provides the "possessive" quantifier form
b9b4dddf
YO
435as well.
436
f793d64a
KW
437 *+ Match 0 or more times and give nothing back
438 ++ Match 1 or more times and give nothing back
439 ?+ Match 0 or 1 time and give nothing back
440 {n}+ Match exactly n times and give nothing back (redundant)
441 {n,}+ Match at least n times and give nothing back
442 {n,m}+ Match at least n but not more than m times and give nothing back
b9b4dddf
YO
443
444For instance,
445
446 'aaaa' =~ /a++a/
447
448will never match, as the C<a++> will gobble up all the C<a>'s in the
449string and won't leave any for the remaining part of the pattern. This
450feature can be extremely useful to give perl hints about where it
451shouldn't backtrack. For instance, the typical "match a double-quoted
452string" problem can be most efficiently performed when written as:
453
454 /"(?:[^"\\]++|\\.)*+"/
455
0d017f4d 456as we know that if the final quote does not match, backtracking will not
0b928c2f
FC
457help. See the independent subexpression
458L</C<< (?>pattern) >>> for more details;
b9b4dddf
YO
459possessive quantifiers are just syntactic sugar for that construct. For
460instance the above example could also be written as follows:
461
462 /"(?>(?:(?>[^"\\]+)|\\.)*)"/
463
04838cea
RGS
464=head3 Escape sequences
465
0b928c2f 466Because patterns are processed as double-quoted strings, the following
a0d0e21e
LW
467also work:
468
f793d64a
KW
469 \t tab (HT, TAB)
470 \n newline (LF, NL)
471 \r return (CR)
472 \f form feed (FF)
473 \a alarm (bell) (BEL)
474 \e escape (think troff) (ESC)
f793d64a 475 \cK control char (example: VT)
dc0d9c48 476 \x{}, \x00 character whose ordinal is the given hexadecimal number
fb121860 477 \N{name} named Unicode character or character sequence
f793d64a 478 \N{U+263D} Unicode character (example: FIRST QUARTER MOON)
f0a2b745 479 \o{}, \000 character whose ordinal is the given octal number
f793d64a
KW
480 \l lowercase next char (think vi)
481 \u uppercase next char (think vi)
482 \L lowercase till \E (think vi)
483 \U uppercase till \E (think vi)
484 \Q quote (disable) pattern metacharacters till \E
485 \E end either case modification or quoted section, think vi
a0d0e21e 486
9bb1f947 487Details are in L<perlop/Quote and Quote-like Operators>.
1d2dff63 488
e1d1eefb 489=head3 Character Classes and other Special Escapes
04838cea 490
a0d0e21e 491In addition, Perl defines the following:
d0b16107 492X<\g> X<\k> X<\K> X<backreference>
a0d0e21e 493
f793d64a
KW
494 Sequence Note Description
495 [...] [1] Match a character according to the rules of the
496 bracketed character class defined by the "...".
497 Example: [a-z] matches "a" or "b" or "c" ... or "z"
498 [[:...:]] [2] Match a character according to the rules of the POSIX
499 character class "..." within the outer bracketed
500 character class. Example: [[:upper:]] matches any
501 uppercase character.
d35dd6c6
KW
502 \w [3] Match a "word" character (alphanumeric plus "_", plus
503 other connector punctuation chars plus Unicode
0b928c2f 504 marks)
f793d64a
KW
505 \W [3] Match a non-"word" character
506 \s [3] Match a whitespace character
507 \S [3] Match a non-whitespace character
508 \d [3] Match a decimal digit character
509 \D [3] Match a non-digit character
510 \pP [3] Match P, named property. Use \p{Prop} for longer names
511 \PP [3] Match non-P
512 \X [4] Match Unicode "eXtended grapheme cluster"
513 \C Match a single C-language char (octet) even if that is
514 part of a larger UTF-8 character. Thus it breaks up
515 characters into their UTF-8 bytes, so you may end up
516 with malformed pieces of UTF-8. Unsupported in
517 lookbehind.
c27a5cfe 518 \1 [5] Backreference to a specific capture group or buffer.
f793d64a
KW
519 '1' may actually be any positive integer.
520 \g1 [5] Backreference to a specific or previous group,
521 \g{-1} [5] The number may be negative indicating a relative
c27a5cfe 522 previous group and may optionally be wrapped in
f793d64a
KW
523 curly brackets for safer parsing.
524 \g{name} [5] Named backreference
525 \k<name> [5] Named backreference
526 \K [6] Keep the stuff left of the \K, don't include it in $&
527 \N [7] Any character but \n (experimental). Not affected by
528 /s modifier
529 \v [3] Vertical whitespace
530 \V [3] Not vertical whitespace
531 \h [3] Horizontal whitespace
532 \H [3] Not horizontal whitespace
533 \R [4] Linebreak
e1d1eefb 534
9bb1f947
KW
535=over 4
536
537=item [1]
538
539See L<perlrecharclass/Bracketed Character Classes> for details.
df225385 540
9bb1f947 541=item [2]
b8c5462f 542
9bb1f947 543See L<perlrecharclass/POSIX Character Classes> for details.
b8c5462f 544
9bb1f947 545=item [3]
5496314a 546
9bb1f947 547See L<perlrecharclass/Backslash sequences> for details.
5496314a 548
9bb1f947 549=item [4]
5496314a 550
9bb1f947 551See L<perlrebackslash/Misc> for details.
d0b16107 552
9bb1f947 553=item [5]
b8c5462f 554
c27a5cfe 555See L</Capture groups> below for details.
93733859 556
9bb1f947 557=item [6]
b8c5462f 558
9bb1f947
KW
559See L</Extended Patterns> below for details.
560
561=item [7]
562
563Note that C<\N> has two meanings. When of the form C<\N{NAME}>, it matches the
fb121860
KW
564character or character sequence whose name is C<NAME>; and similarly
565when of the form C<\N{U+I<hex>}>, it matches the character whose Unicode
566code point is I<hex>. Otherwise it matches any character but C<\n>.
9bb1f947
KW
567
568=back
d0b16107 569
04838cea
RGS
570=head3 Assertions
571
a0d0e21e 572Perl defines the following zero-width assertions:
d74e8afc
ITB
573X<zero-width assertion> X<assertion> X<regex, zero-width assertion>
574X<regexp, zero-width assertion>
575X<regular expression, zero-width assertion>
576X<\b> X<\B> X<\A> X<\Z> X<\z> X<\G>
a0d0e21e 577
9bb1f947
KW
578 \b Match a word boundary
579 \B Match except at a word boundary
580 \A Match only at beginning of string
581 \Z Match only at end of string, or before newline at the end
582 \z Match only at end of string
583 \G Match only at pos() (e.g. at the end-of-match position
9da458fc 584 of prior m//g)
a0d0e21e 585
14218588 586A word boundary (C<\b>) is a spot between two characters
19799a22
GS
587that has a C<\w> on one side of it and a C<\W> on the other side
588of it (in either order), counting the imaginary characters off the
589beginning and end of the string as matching a C<\W>. (Within
590character classes C<\b> represents backspace rather than a word
591boundary, just as it normally does in any double-quoted string.)
592The C<\A> and C<\Z> are just like "^" and "$", except that they
593won't match multiple times when the C</m> modifier is used, while
594"^" and "$" will match at every internal line boundary. To match
595the actual end of the string and not ignore an optional trailing
596newline, use C<\z>.
d74e8afc 597X<\b> X<\A> X<\Z> X<\z> X</m>
19799a22
GS
598
599The C<\G> assertion can be used to chain global matches (using
600C<m//g>), as described in L<perlop/"Regexp Quote-Like Operators">.
601It is also useful when writing C<lex>-like scanners, when you have
602several patterns that you want to match against consequent substrings
0b928c2f 603of your string; see the previous reference. The actual location
19799a22 604where C<\G> will match can also be influenced by using C<pos()> as
58e23c8d 605an lvalue: see L<perlfunc/pos>. Note that the rule for zero-length
0b928c2f
FC
606matches (see L</"Repeated Patterns Matching a Zero-length Substring">)
607is modified somewhat, in that contents to the left of C<\G> are
58e23c8d
YO
608not counted when determining the length of the match. Thus the following
609will not match forever:
d74e8afc 610X<\G>
c47ff5f1 611
e761bb84
CO
612 my $string = 'ABC';
613 pos($string) = 1;
614 while ($string =~ /(.\G)/g) {
615 print $1;
616 }
58e23c8d
YO
617
618It will print 'A' and then terminate, as it considers the match to
619be zero-width, and thus will not match at the same position twice in a
620row.
621
622It is worth noting that C<\G> improperly used can result in an infinite
623loop. Take care when using patterns that include C<\G> in an alternation.
624
c27a5cfe 625=head3 Capture groups
04838cea 626
c27a5cfe
KW
627The bracketing construct C<( ... )> creates capture groups (also referred to as
628capture buffers). To refer to the current contents of a group later on, within
d8b950dc
KW
629the same pattern, use C<\g1> (or C<\g{1}>) for the first, C<\g2> (or C<\g{2}>)
630for the second, and so on.
631This is called a I<backreference>.
d74e8afc 632X<regex, capture buffer> X<regexp, capture buffer>
c27a5cfe 633X<regex, capture group> X<regexp, capture group>
d74e8afc 634X<regular expression, capture buffer> X<backreference>
c27a5cfe 635X<regular expression, capture group> X<backreference>
1f1031fe 636X<\g{1}> X<\g{-1}> X<\g{name}> X<relative backreference> X<named backreference>
d8b950dc
KW
637X<named capture buffer> X<regular expression, named capture buffer>
638X<named capture group> X<regular expression, named capture group>
639X<%+> X<$+{name}> X<< \k<name> >>
640There is no limit to the number of captured substrings that you may use.
641Groups are numbered with the leftmost open parenthesis being number 1, etc. If
642a group did not match, the associated backreference won't match either. (This
643can happen if the group is optional, or in a different branch of an
644alternation.)
645You can omit the C<"g">, and write C<"\1">, etc, but there are some issues with
646this form, described below.
647
648You can also refer to capture groups relatively, by using a negative number, so
649that C<\g-1> and C<\g{-1}> both refer to the immediately preceding capture
650group, and C<\g-2> and C<\g{-2}> both refer to the group before it. For
651example:
5624f11d
YO
652
653 /
c27a5cfe
KW
654 (Y) # group 1
655 ( # group 2
656 (X) # group 3
657 \g{-1} # backref to group 3
658 \g{-3} # backref to group 1
5624f11d
YO
659 )
660 /x
661
d8b950dc
KW
662would match the same as C</(Y) ( (X) \g3 \g1 )/x>. This allows you to
663interpolate regexes into larger regexes and not have to worry about the
664capture groups being renumbered.
665
666You can dispense with numbers altogether and create named capture groups.
667The notation is C<(?E<lt>I<name>E<gt>...)> to declare and C<\g{I<name>}> to
668reference. (To be compatible with .Net regular expressions, C<\g{I<name>}> may
669also be written as C<\k{I<name>}>, C<\kE<lt>I<name>E<gt>> or C<\k'I<name>'>.)
670I<name> must not begin with a number, nor contain hyphens.
671When different groups within the same pattern have the same name, any reference
672to that name assumes the leftmost defined group. Named groups count in
673absolute and relative numbering, and so can also be referred to by those
674numbers.
675(It's possible to do things with named capture groups that would otherwise
676require C<(??{})>.)
677
678Capture group contents are dynamically scoped and available to you outside the
679pattern until the end of the enclosing block or until the next successful
680match, whichever comes first. (See L<perlsyn/"Compound Statements">.)
681You can refer to them by absolute number (using C<"$1"> instead of C<"\g1">,
682etc); or by name via the C<%+> hash, using C<"$+{I<name>}">.
683
684Braces are required in referring to named capture groups, but are optional for
685absolute or relative numbered ones. Braces are safer when creating a regex by
686concatenating smaller strings. For example if you have C<qr/$a$b/>, and C<$a>
687contained C<"\g1">, and C<$b> contained C<"37">, you would get C</\g137/> which
688is probably not what you intended.
689
690The C<\g> and C<\k> notations were introduced in Perl 5.10.0. Prior to that
691there were no named nor relative numbered capture groups. Absolute numbered
0b928c2f
FC
692groups were referred to using C<\1>,
693C<\2>, etc., and this notation is still
d8b950dc
KW
694accepted (and likely always will be). But it leads to some ambiguities if
695there are more than 9 capture groups, as C<\10> could mean either the tenth
696capture group, or the character whose ordinal in octal is 010 (a backspace in
697ASCII). Perl resolves this ambiguity by interpreting C<\10> as a backreference
698only if at least 10 left parentheses have opened before it. Likewise C<\11> is
699a backreference only if at least 11 left parentheses have opened before it.
e1f120a9
KW
700And so on. C<\1> through C<\9> are always interpreted as backreferences.
701There are several examples below that illustrate these perils. You can avoid
702the ambiguity by always using C<\g{}> or C<\g> if you mean capturing groups;
703and for octal constants always using C<\o{}>, or for C<\077> and below, using 3
704digits padded with leading zeros, since a leading zero implies an octal
705constant.
d8b950dc
KW
706
707The C<\I<digit>> notation also works in certain circumstances outside
ed7efc79 708the pattern. See L</Warning on \1 Instead of $1> below for details.
81714fb9 709
14218588 710Examples:
a0d0e21e
LW
711
712 s/^([^ ]*) *([^ ]*)/$2 $1/; # swap first two words
713
d8b950dc 714 /(.)\g1/ # find first doubled char
81714fb9
YO
715 and print "'$1' is the first doubled character\n";
716
717 /(?<char>.)\k<char>/ # ... a different way
718 and print "'$+{char}' is the first doubled character\n";
719
d8b950dc 720 /(?'char'.)\g1/ # ... mix and match
81714fb9 721 and print "'$1' is the first doubled character\n";
c47ff5f1 722
14218588 723 if (/Time: (..):(..):(..)/) { # parse out values
f793d64a
KW
724 $hours = $1;
725 $minutes = $2;
726 $seconds = $3;
a0d0e21e 727 }
c47ff5f1 728
9d860678
KW
729 /(.)(.)(.)(.)(.)(.)(.)(.)(.)\g10/ # \g10 is a backreference
730 /(.)(.)(.)(.)(.)(.)(.)(.)(.)\10/ # \10 is octal
731 /((.)(.)(.)(.)(.)(.)(.)(.)(.))\10/ # \10 is a backreference
732 /((.)(.)(.)(.)(.)(.)(.)(.)(.))\010/ # \010 is octal
733
734 $a = '(.)\1'; # Creates problems when concatenated.
735 $b = '(.)\g{1}'; # Avoids the problems.
736 "aa" =~ /${a}/; # True
737 "aa" =~ /${b}/; # True
738 "aa0" =~ /${a}0/; # False!
739 "aa0" =~ /${b}0/; # True
dc0d9c48
KW
740 "aa\x08" =~ /${a}0/; # True!
741 "aa\x08" =~ /${b}0/; # False
9d860678 742
14218588
GS
743Several special variables also refer back to portions of the previous
744match. C<$+> returns whatever the last bracket match matched.
745C<$&> returns the entire matched string. (At one point C<$0> did
746also, but now it returns the name of the program.) C<$`> returns
77ea4f6d
JV
747everything before the matched string. C<$'> returns everything
748after the matched string. And C<$^N> contains whatever was matched by
749the most-recently closed group (submatch). C<$^N> can be used in
750extended patterns (see below), for example to assign a submatch to a
81714fb9 751variable.
d74e8afc 752X<$+> X<$^N> X<$&> X<$`> X<$'>
14218588 753
d8b950dc
KW
754These special variables, like the C<%+> hash and the numbered match variables
755(C<$1>, C<$2>, C<$3>, etc.) are dynamically scoped
14218588
GS
756until the end of the enclosing block or until the next successful
757match, whichever comes first. (See L<perlsyn/"Compound Statements">.)
d74e8afc
ITB
758X<$+> X<$^N> X<$&> X<$`> X<$'>
759X<$1> X<$2> X<$3> X<$4> X<$5> X<$6> X<$7> X<$8> X<$9>
760
0d017f4d 761B<NOTE>: Failed matches in Perl do not reset the match variables,
5146ce24 762which makes it easier to write code that tests for a series of more
665e98b9
JH
763specific cases and remembers the best match.
764
14218588
GS
765B<WARNING>: Once Perl sees that you need one of C<$&>, C<$`>, or
766C<$'> anywhere in the program, it has to provide them for every
767pattern match. This may substantially slow your program. Perl
d8b950dc 768uses the same mechanism to produce C<$1>, C<$2>, etc, so you also pay a
14218588
GS
769price for each pattern that contains capturing parentheses. (To
770avoid this cost while retaining the grouping behaviour, use the
771extended regular expression C<(?: ... )> instead.) But if you never
772use C<$&>, C<$`> or C<$'>, then patterns I<without> capturing
773parentheses will not be penalized. So avoid C<$&>, C<$'>, and C<$`>
774if you can, but if you can't (and some algorithms really appreciate
775them), once you've used them once, use them at will, because you've
776already paid the price. As of 5.005, C<$&> is not so costly as the
777other two.
d74e8afc 778X<$&> X<$`> X<$'>
68dc0745 779
99d59c4d 780As a workaround for this problem, Perl 5.10.0 introduces C<${^PREMATCH}>,
cde0cee5
YO
781C<${^MATCH}> and C<${^POSTMATCH}>, which are equivalent to C<$`>, C<$&>
782and C<$'>, B<except> that they are only guaranteed to be defined after a
87e95b7f 783successful match that was executed with the C</p> (preserve) modifier.
cde0cee5
YO
784The use of these variables incurs no global performance penalty, unlike
785their punctuation char equivalents, however at the trade-off that you
786have to tell perl when you want to use them.
87e95b7f 787X</p> X<p modifier>
cde0cee5 788
9d727203
KW
789=head2 Quoting metacharacters
790
19799a22
GS
791Backslashed metacharacters in Perl are alphanumeric, such as C<\b>,
792C<\w>, C<\n>. Unlike some other regular expression languages, there
793are no backslashed symbols that aren't alphanumeric. So anything
c47ff5f1 794that looks like \\, \(, \), \<, \>, \{, or \} is always
19799a22
GS
795interpreted as a literal character, not a metacharacter. This was
796once used in a common idiom to disable or quote the special meanings
797of regular expression metacharacters in a string that you want to
36bbe248 798use for a pattern. Simply quote all non-"word" characters:
a0d0e21e
LW
799
800 $pattern =~ s/(\W)/\\$1/g;
801
f1cbbd6e 802(If C<use locale> is set, then this depends on the current locale.)
14218588
GS
803Today it is more common to use the quotemeta() function or the C<\Q>
804metaquoting escape sequence to disable all metacharacters' special
805meanings like this:
a0d0e21e
LW
806
807 /$unquoted\Q$quoted\E$unquoted/
808
9da458fc
IZ
809Beware that if you put literal backslashes (those not inside
810interpolated variables) between C<\Q> and C<\E>, double-quotish
811backslash interpolation may lead to confusing results. If you
812I<need> to use literal backslashes within C<\Q...\E>,
813consult L<perlop/"Gory details of parsing quoted constructs">.
814
19799a22
GS
815=head2 Extended Patterns
816
14218588 817Perl also defines a consistent extension syntax for features not
0b928c2f
FC
818found in standard tools like B<awk> and
819B<lex>. The syntax for most of these is a
14218588
GS
820pair of parentheses with a question mark as the first thing within
821the parentheses. The character after the question mark indicates
822the extension.
19799a22 823
14218588
GS
824The stability of these extensions varies widely. Some have been
825part of the core language for many years. Others are experimental
826and may change without warning or be completely removed. Check
827the documentation on an individual feature to verify its current
828status.
19799a22 829
14218588
GS
830A question mark was chosen for this and for the minimal-matching
831construct because 1) question marks are rare in older regular
832expressions, and 2) whenever you see one, you should stop and
0b928c2f 833"question" exactly what is going on. That's psychology....
a0d0e21e
LW
834
835=over 10
836
cc6b7395 837=item C<(?#text)>
d74e8afc 838X<(?#)>
a0d0e21e 839
14218588 840A comment. The text is ignored. If the C</x> modifier enables
19799a22 841whitespace formatting, a simple C<#> will suffice. Note that Perl closes
259138e3
GS
842the comment as soon as it sees a C<)>, so there is no way to put a literal
843C<)> in the comment.
a0d0e21e 844
cfaf538b 845=item C<(?adlupimsx-imsx)>
fb85c044 846
cfaf538b 847=item C<(?^alupimsx)>
fb85c044 848X<(?)> X<(?^)>
19799a22 849
0b6d1084
JH
850One or more embedded pattern-match modifiers, to be turned on (or
851turned off, if preceded by C<->) for the remainder of the pattern or
fb85c044
KW
852the remainder of the enclosing pattern group (if any).
853
fb85c044 854This is particularly useful for dynamic patterns, such as those read in from a
0d017f4d 855configuration file, taken from an argument, or specified in a table
0b928c2f
FC
856somewhere. Consider the case where some patterns want to be
857case-sensitive and some do not: The case-insensitive ones merely need to
0d017f4d 858include C<(?i)> at the front of the pattern. For example:
19799a22
GS
859
860 $pattern = "foobar";
5d458dd8 861 if ( /$pattern/i ) { }
19799a22
GS
862
863 # more flexible:
864
865 $pattern = "(?i)foobar";
5d458dd8 866 if ( /$pattern/ ) { }
19799a22 867
0b6d1084 868These modifiers are restored at the end of the enclosing group. For example,
19799a22 869
d8b950dc 870 ( (?i) blah ) \s+ \g1
19799a22 871
0d017f4d
WL
872will match C<blah> in any case, some spaces, and an exact (I<including the case>!)
873repetition of the previous word, assuming the C</x> modifier, and no C</i>
874modifier outside this group.
19799a22 875
8eb5594e
DR
876These modifiers do not carry over into named subpatterns called in the
877enclosing group. In other words, a pattern such as C<((?i)(&NAME))> does not
878change the case-sensitivity of the "NAME" pattern.
879
dc925305
KW
880Any of these modifiers can be set to apply globally to all regular
881expressions compiled within the scope of a C<use re>. See
a0bbd6ff 882L<re/"'/flags' mode">.
dc925305 883
9de15fec
KW
884Starting in Perl 5.14, a C<"^"> (caret or circumflex accent) immediately
885after the C<"?"> is a shorthand equivalent to C<d-imsx>. Flags (except
886C<"d">) may follow the caret to override it.
887But a minus sign is not legal with it.
888
dc925305 889Note that the C<a>, C<d>, C<l>, C<p>, and C<u> modifiers are special in
e1d8d8ac 890that they can only be enabled, not disabled, and the C<a>, C<d>, C<l>, and
dc925305 891C<u> modifiers are mutually exclusive: specifying one de-specifies the
ed7efc79
KW
892others, and a maximum of one (or two C<a>'s) may appear in the
893construct. Thus, for
0b928c2f 894example, C<(?-p)> will warn when compiled under C<use warnings>;
b6fa137b 895C<(?-d:...)> and C<(?dl:...)> are fatal errors.
9de15fec
KW
896
897Note also that the C<p> modifier is special in that its presence
898anywhere in a pattern has a global effect.
cde0cee5 899
5a964f20 900=item C<(?:pattern)>
d74e8afc 901X<(?:)>
a0d0e21e 902
cfaf538b 903=item C<(?adluimsx-imsx:pattern)>
ca9dfc88 904
cfaf538b 905=item C<(?^aluimsx:pattern)>
fb85c044
KW
906X<(?^:)>
907
5a964f20
TC
908This is for clustering, not capturing; it groups subexpressions like
909"()", but doesn't make backreferences as "()" does. So
a0d0e21e 910
5a964f20 911 @fields = split(/\b(?:a|b|c)\b/)
a0d0e21e
LW
912
913is like
914
5a964f20 915 @fields = split(/\b(a|b|c)\b/)
a0d0e21e 916
19799a22
GS
917but doesn't spit out extra fields. It's also cheaper not to capture
918characters if you don't need to.
a0d0e21e 919
19799a22 920Any letters between C<?> and C<:> act as flags modifiers as with
cfaf538b 921C<(?adluimsx-imsx)>. For example,
ca9dfc88
IZ
922
923 /(?s-i:more.*than).*million/i
924
14218588 925is equivalent to the more verbose
ca9dfc88
IZ
926
927 /(?:(?s-i)more.*than).*million/i
928
fb85c044 929Starting in Perl 5.14, a C<"^"> (caret or circumflex accent) immediately
9de15fec
KW
930after the C<"?"> is a shorthand equivalent to C<d-imsx>. Any positive
931flags (except C<"d">) may follow the caret, so
fb85c044
KW
932
933 (?^x:foo)
934
935is equivalent to
936
937 (?x-ims:foo)
938
939The caret tells Perl that this cluster doesn't inherit the flags of any
0b928c2f 940surrounding pattern, but uses the system defaults (C<d-imsx>),
fb85c044
KW
941modified by any flags specified.
942
943The caret allows for simpler stringification of compiled regular
944expressions. These look like
945
946 (?^:pattern)
947
948with any non-default flags appearing between the caret and the colon.
949A test that looks at such stringification thus doesn't need to have the
950system default flags hard-coded in it, just the caret. If new flags are
951added to Perl, the meaning of the caret's expansion will change to include
952the default for those flags, so the test will still work, unchanged.
953
954Specifying a negative flag after the caret is an error, as the flag is
955redundant.
956
957Mnemonic for C<(?^...)>: A fresh beginning since the usual use of a caret is
958to match at the beginning.
959
594d7033
YO
960=item C<(?|pattern)>
961X<(?|)> X<Branch reset>
962
963This is the "branch reset" pattern, which has the special property
c27a5cfe 964that the capture groups are numbered from the same starting point
99d59c4d 965in each alternation branch. It is available starting from perl 5.10.0.
4deaaa80 966
c27a5cfe 967Capture groups are numbered from left to right, but inside this
693596a8 968construct the numbering is restarted for each branch.
4deaaa80 969
c27a5cfe 970The numbering within each branch will be as normal, and any groups
4deaaa80
PJ
971following this construct will be numbered as though the construct
972contained only one branch, that being the one with the most capture
c27a5cfe 973groups in it.
4deaaa80 974
0b928c2f 975This construct is useful when you want to capture one of a
4deaaa80
PJ
976number of alternative matches.
977
978Consider the following pattern. The numbers underneath show in
c27a5cfe 979which group the captured content will be stored.
594d7033
YO
980
981
982 # before ---------------branch-reset----------- after
983 / ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x
984 # 1 2 2 3 2 3 4
985
ab106183
A
986Be careful when using the branch reset pattern in combination with
987named captures. Named captures are implemented as being aliases to
c27a5cfe 988numbered groups holding the captures, and that interferes with the
ab106183
A
989implementation of the branch reset pattern. If you are using named
990captures in a branch reset pattern, it's best to use the same names,
991in the same order, in each of the alternations:
992
993 /(?| (?<a> x ) (?<b> y )
994 | (?<a> z ) (?<b> w )) /x
995
996Not doing so may lead to surprises:
997
998 "12" =~ /(?| (?<a> \d+ ) | (?<b> \D+))/x;
999 say $+ {a}; # Prints '12'
1000 say $+ {b}; # *Also* prints '12'.
1001
c27a5cfe
KW
1002The problem here is that both the group named C<< a >> and the group
1003named C<< b >> are aliases for the group belonging to C<< $1 >>.
90a18110 1004
ee9b8eae
YO
1005=item Look-Around Assertions
1006X<look-around assertion> X<lookaround assertion> X<look-around> X<lookaround>
1007
0b928c2f 1008Look-around assertions are zero-width patterns which match a specific
ee9b8eae
YO
1009pattern without including it in C<$&>. Positive assertions match when
1010their subpattern matches, negative assertions match when their subpattern
1011fails. Look-behind matches text up to the current match position,
1012look-ahead matches text following the current match position.
1013
1014=over 4
1015
5a964f20 1016=item C<(?=pattern)>
d74e8afc 1017X<(?=)> X<look-ahead, positive> X<lookahead, positive>
a0d0e21e 1018
19799a22 1019A zero-width positive look-ahead assertion. For example, C</\w+(?=\t)/>
a0d0e21e
LW
1020matches a word followed by a tab, without including the tab in C<$&>.
1021
5a964f20 1022=item C<(?!pattern)>
d74e8afc 1023X<(?!)> X<look-ahead, negative> X<lookahead, negative>
a0d0e21e 1024
19799a22 1025A zero-width negative look-ahead assertion. For example C</foo(?!bar)/>
a0d0e21e 1026matches any occurrence of "foo" that isn't followed by "bar". Note
19799a22
GS
1027however that look-ahead and look-behind are NOT the same thing. You cannot
1028use this for look-behind.
7b8d334a 1029
5a964f20 1030If you are looking for a "bar" that isn't preceded by a "foo", C</(?!foo)bar/>
7b8d334a
GS
1031will not do what you want. That's because the C<(?!foo)> is just saying that
1032the next thing cannot be "foo"--and it's not, it's a "bar", so "foobar" will
0b928c2f 1033match. Use look-behind instead (see below).
c277df42 1034
ee9b8eae
YO
1035=item C<(?<=pattern)> C<\K>
1036X<(?<=)> X<look-behind, positive> X<lookbehind, positive> X<\K>
c277df42 1037
c47ff5f1 1038A zero-width positive look-behind assertion. For example, C</(?<=\t)\w+/>
19799a22
GS
1039matches a word that follows a tab, without including the tab in C<$&>.
1040Works only for fixed-width look-behind.
c277df42 1041
ee9b8eae
YO
1042There is a special form of this construct, called C<\K>, which causes the
1043regex engine to "keep" everything it had matched prior to the C<\K> and
0b928c2f 1044not include it in C<$&>. This effectively provides variable-length
ee9b8eae
YO
1045look-behind. The use of C<\K> inside of another look-around assertion
1046is allowed, but the behaviour is currently not well defined.
1047
c62285ac 1048For various reasons C<\K> may be significantly more efficient than the
ee9b8eae
YO
1049equivalent C<< (?<=...) >> construct, and it is especially useful in
1050situations where you want to efficiently remove something following
1051something else in a string. For instance
1052
1053 s/(foo)bar/$1/g;
1054
1055can be rewritten as the much more efficient
1056
1057 s/foo\Kbar//g;
1058
5a964f20 1059=item C<(?<!pattern)>
d74e8afc 1060X<(?<!)> X<look-behind, negative> X<lookbehind, negative>
c277df42 1061
19799a22
GS
1062A zero-width negative look-behind assertion. For example C</(?<!bar)foo/>
1063matches any occurrence of "foo" that does not follow "bar". Works
1064only for fixed-width look-behind.
c277df42 1065
ee9b8eae
YO
1066=back
1067
81714fb9
YO
1068=item C<(?'NAME'pattern)>
1069
1070=item C<< (?<NAME>pattern) >>
1071X<< (?<NAME>) >> X<(?'NAME')> X<named capture> X<capture>
1072
c27a5cfe 1073A named capture group. Identical in every respect to normal capturing
0b928c2f
FC
1074parentheses C<()> but for the additional fact that the group
1075can be referred to by name in various regular expression
1076constructs (like C<\g{NAME}>) and can be accessed by name
1077after a successful match via C<%+> or C<%->. See L<perlvar>
90a18110 1078for more details on the C<%+> and C<%-> hashes.
81714fb9 1079
c27a5cfe
KW
1080If multiple distinct capture groups have the same name then the
1081$+{NAME} will refer to the leftmost defined group in the match.
81714fb9 1082
0d017f4d 1083The forms C<(?'NAME'pattern)> and C<< (?<NAME>pattern) >> are equivalent.
81714fb9
YO
1084
1085B<NOTE:> While the notation of this construct is the same as the similar
c27a5cfe 1086function in .NET regexes, the behavior is not. In Perl the groups are
81714fb9
YO
1087numbered sequentially regardless of being named or not. Thus in the
1088pattern
1089
1090 /(x)(?<foo>y)(z)/
1091
1092$+{foo} will be the same as $2, and $3 will contain 'z' instead of
1093the opposite which is what a .NET regex hacker might expect.
1094
1f1031fe
YO
1095Currently NAME is restricted to simple identifiers only.
1096In other words, it must match C</^[_A-Za-z][_A-Za-z0-9]*\z/> or
1097its Unicode extension (see L<utf8>),
1098though it isn't extended by the locale (see L<perllocale>).
81714fb9 1099
1f1031fe 1100B<NOTE:> In order to make things easier for programmers with experience
ae5648b3 1101with the Python or PCRE regex engines, the pattern C<< (?PE<lt>NAMEE<gt>pattern) >>
0d017f4d 1102may be used instead of C<< (?<NAME>pattern) >>; however this form does not
64c5a566 1103support the use of single quotes as a delimiter for the name.
81714fb9 1104
1f1031fe
YO
1105=item C<< \k<NAME> >>
1106
1107=item C<< \k'NAME' >>
81714fb9
YO
1108
1109Named backreference. Similar to numeric backreferences, except that
1110the group is designated by name and not number. If multiple groups
1111have the same name then it refers to the leftmost defined group in
1112the current match.
1113
0d017f4d 1114It is an error to refer to a name not defined by a C<< (?<NAME>) >>
81714fb9
YO
1115earlier in the pattern.
1116
1117Both forms are equivalent.
1118
1f1031fe 1119B<NOTE:> In order to make things easier for programmers with experience
0d017f4d 1120with the Python or PCRE regex engines, the pattern C<< (?P=NAME) >>
64c5a566 1121may be used instead of C<< \k<NAME> >>.
1f1031fe 1122
cc6b7395 1123=item C<(?{ code })>
d74e8afc 1124X<(?{})> X<regex, code in> X<regexp, code in> X<regular expression, code in>
c277df42 1125
19799a22 1126B<WARNING>: This extended regular expression feature is considered
b9b4dddf
YO
1127experimental, and may be changed without notice. Code executed that
1128has side effects may not perform identically from version to version
1129due to the effect of future optimisations in the regex engine.
c277df42 1130
cc46d5f2 1131This zero-width assertion evaluates any embedded Perl code. It
19799a22
GS
1132always succeeds, and its C<code> is not interpolated. Currently,
1133the rules to determine where the C<code> ends are somewhat convoluted.
1134
77ea4f6d
JV
1135This feature can be used together with the special variable C<$^N> to
1136capture the results of submatches in variables without having to keep
1137track of the number of nested parentheses. For example:
1138
1139 $_ = "The brown fox jumps over the lazy dog";
1140 /the (\S+)(?{ $color = $^N }) (\S+)(?{ $animal = $^N })/i;
1141 print "color = $color, animal = $animal\n";
1142
754091cb
RGS
1143Inside the C<(?{...})> block, C<$_> refers to the string the regular
1144expression is matching against. You can also use C<pos()> to know what is
fa11829f 1145the current position of matching within this string.
754091cb 1146
19799a22
GS
1147The C<code> is properly scoped in the following sense: If the assertion
1148is backtracked (compare L<"Backtracking">), all changes introduced after
1149C<local>ization are undone, so that
b9ac3b5b
GS
1150
1151 $_ = 'a' x 8;
5d458dd8 1152 m<
f793d64a 1153 (?{ $cnt = 0 }) # Initialize $cnt.
b9ac3b5b 1154 (
5d458dd8 1155 a
b9ac3b5b 1156 (?{
f793d64a 1157 local $cnt = $cnt + 1; # Update $cnt, backtracking-safe.
b9ac3b5b 1158 })
5d458dd8 1159 )*
b9ac3b5b 1160 aaaa
f793d64a
KW
1161 (?{ $res = $cnt }) # On success copy to
1162 # non-localized location.
b9ac3b5b
GS
1163 >x;
1164
0d017f4d 1165will set C<$res = 4>. Note that after the match, C<$cnt> returns to the globally
14218588 1166introduced value, because the scopes that restrict C<local> operators
b9ac3b5b
GS
1167are unwound.
1168
19799a22
GS
1169This assertion may be used as a C<(?(condition)yes-pattern|no-pattern)>
1170switch. If I<not> used in this way, the result of evaluation of
1171C<code> is put into the special variable C<$^R>. This happens
1172immediately, so C<$^R> can be used from other C<(?{ code })> assertions
1173inside the same regular expression.
b9ac3b5b 1174
19799a22
GS
1175The assignment to C<$^R> above is properly localized, so the old
1176value of C<$^R> is restored if the assertion is backtracked; compare
1177L<"Backtracking">.
b9ac3b5b 1178
19799a22
GS
1179For reasons of security, this construct is forbidden if the regular
1180expression involves run-time interpolation of variables, unless the
1181perilous C<use re 'eval'> pragma has been used (see L<re>), or the
0b928c2f 1182variables contain results of the C<qr//> operator (see
b6fa137b 1183L<perlop/"qr/STRINGE<sol>msixpodual">).
871b0233 1184
0d017f4d 1185This restriction is due to the wide-spread and remarkably convenient
19799a22 1186custom of using run-time determined strings as patterns. For example:
871b0233
IZ
1187
1188 $re = <>;
1189 chomp $re;
1190 $string =~ /$re/;
1191
14218588
GS
1192Before Perl knew how to execute interpolated code within a pattern,
1193this operation was completely safe from a security point of view,
1194although it could raise an exception from an illegal pattern. If
1195you turn on the C<use re 'eval'>, though, it is no longer secure,
1196so you should only do so if you are also using taint checking.
1197Better yet, use the carefully constrained evaluation within a Safe
cc46d5f2 1198compartment. See L<perlsec> for details about both these mechanisms.
871b0233 1199
e95d7314
GG
1200B<WARNING>: Use of lexical (C<my>) variables in these blocks is
1201broken. The result is unpredictable and will make perl unstable. The
1202workaround is to use global (C<our>) variables.
1203
8525cfae
FC
1204B<WARNING>: In perl 5.12.x and earlier, the regex engine
1205was not re-entrant, so interpolated code could not
1206safely invoke the regex engine either directly with
e95d7314 1207C<m//> or C<s///>), or indirectly with functions such as
8525cfae 1208C<split>. Invoking the regex engine in these blocks would make perl
e95d7314 1209unstable.
8988a1bb 1210
14455d6c 1211=item C<(??{ code })>
d74e8afc
ITB
1212X<(??{})>
1213X<regex, postponed> X<regexp, postponed> X<regular expression, postponed>
0f5d15d6 1214
19799a22 1215B<WARNING>: This extended regular expression feature is considered
b9b4dddf
YO
1216experimental, and may be changed without notice. Code executed that
1217has side effects may not perform identically from version to version
1218due to the effect of future optimisations in the regex engine.
0f5d15d6 1219
19799a22
GS
1220This is a "postponed" regular subexpression. The C<code> is evaluated
1221at run time, at the moment this subexpression may match. The result
0b928c2f 1222of evaluation is considered a regular expression and matched as
61528107 1223if it were inserted instead of this construct. Note that this means
c27a5cfe 1224that the contents of capture groups defined inside an eval'ed pattern
6bda09f9 1225are not available outside of the pattern, and vice versa, there is no
c27a5cfe 1226way for the inner pattern to refer to a capture group defined outside.
6bda09f9
YO
1227Thus,
1228
1229 ('a' x 100)=~/(??{'(.)' x 100})/
1230
81714fb9 1231B<will> match, it will B<not> set $1.
0f5d15d6 1232
428594d9 1233The C<code> is not interpolated. As before, the rules to determine
19799a22
GS
1234where the C<code> ends are currently somewhat convoluted.
1235
1236The following pattern matches a parenthesized group:
0f5d15d6
IZ
1237
1238 $re = qr{
f793d64a
KW
1239 \(
1240 (?:
1241 (?> [^()]+ ) # Non-parens without backtracking
1242 |
1243 (??{ $re }) # Group with matching parens
1244 )*
1245 \)
1246 }x;
0f5d15d6 1247
6bda09f9
YO
1248See also C<(?PARNO)> for a different, more efficient way to accomplish
1249the same task.
1250
0b370c0a
A
1251For reasons of security, this construct is forbidden if the regular
1252expression involves run-time interpolation of variables, unless the
1253perilous C<use re 'eval'> pragma has been used (see L<re>), or the
0b928c2f 1254variables contain results of the C<qr//> operator (see
b6fa137b 1255L<perlop/"qrE<sol>STRINGE<sol>msixpodual">).
0b370c0a 1256
8525cfae
FC
1257In perl 5.12.x and earlier, because the regex engine was not re-entrant,
1258delayed code could not safely invoke the regex engine either directly with
1259C<m//> or C<s///>), or indirectly with functions such as C<split>.
8988a1bb 1260
5d458dd8
YO
1261Recursing deeper than 50 times without consuming any input string will
1262result in a fatal error. The maximum depth is compiled into perl, so
6bda09f9
YO
1263changing it requires a custom build.
1264
542fa716
YO
1265=item C<(?PARNO)> C<(?-PARNO)> C<(?+PARNO)> C<(?R)> C<(?0)>
1266X<(?PARNO)> X<(?1)> X<(?R)> X<(?0)> X<(?-1)> X<(?+1)> X<(?-PARNO)> X<(?+PARNO)>
6bda09f9 1267X<regex, recursive> X<regexp, recursive> X<regular expression, recursive>
542fa716 1268X<regex, relative recursion>
6bda09f9 1269
81714fb9 1270Similar to C<(??{ code })> except it does not involve compiling any code,
c27a5cfe
KW
1271instead it treats the contents of a capture group as an independent
1272pattern that must match at the current position. Capture groups
81714fb9 1273contained by the pattern will have the value as determined by the
6bda09f9
YO
1274outermost recursion.
1275
894be9b7 1276PARNO is a sequence of digits (not starting with 0) whose value reflects
c27a5cfe 1277the paren-number of the capture group to recurse to. C<(?R)> recurses to
894be9b7 1278the beginning of the whole pattern. C<(?0)> is an alternate syntax for
542fa716 1279C<(?R)>. If PARNO is preceded by a plus or minus sign then it is assumed
c27a5cfe 1280to be relative, with negative numbers indicating preceding capture groups
542fa716 1281and positive ones following. Thus C<(?-1)> refers to the most recently
c27a5cfe 1282declared group, and C<(?+1)> indicates the next group to be declared.
c74340f9 1283Note that the counting for relative recursion differs from that of
c27a5cfe 1284relative backreferences, in that with recursion unclosed groups B<are>
c74340f9 1285included.
6bda09f9 1286
81714fb9 1287The following pattern matches a function foo() which may contain
f145b7e9 1288balanced parentheses as the argument.
6bda09f9
YO
1289
1290 $re = qr{ ( # paren group 1 (full function)
81714fb9 1291 foo
6bda09f9
YO
1292 ( # paren group 2 (parens)
1293 \(
1294 ( # paren group 3 (contents of parens)
1295 (?:
1296 (?> [^()]+ ) # Non-parens without backtracking
1297 |
1298 (?2) # Recurse to start of paren group 2
1299 )*
1300 )
1301 \)
1302 )
1303 )
1304 }x;
1305
1306If the pattern was used as follows
1307
1308 'foo(bar(baz)+baz(bop))'=~/$re/
1309 and print "\$1 = $1\n",
1310 "\$2 = $2\n",
1311 "\$3 = $3\n";
1312
1313the output produced should be the following:
1314
1315 $1 = foo(bar(baz)+baz(bop))
1316 $2 = (bar(baz)+baz(bop))
81714fb9 1317 $3 = bar(baz)+baz(bop)
6bda09f9 1318
c27a5cfe 1319If there is no corresponding capture group defined, then it is a
61528107 1320fatal error. Recursing deeper than 50 times without consuming any input
81714fb9 1321string will also result in a fatal error. The maximum depth is compiled
6bda09f9
YO
1322into perl, so changing it requires a custom build.
1323
542fa716
YO
1324The following shows how using negative indexing can make it
1325easier to embed recursive patterns inside of a C<qr//> construct
1326for later use:
1327
1328 my $parens = qr/(\((?:[^()]++|(?-1))*+\))/;
1329 if (/foo $parens \s+ + \s+ bar $parens/x) {
1330 # do something here...
1331 }
1332
81714fb9 1333B<Note> that this pattern does not behave the same way as the equivalent
0d017f4d 1334PCRE or Python construct of the same form. In Perl you can backtrack into
6bda09f9 1335a recursed group, in PCRE and Python the recursed into group is treated
542fa716
YO
1336as atomic. Also, modifiers are resolved at compile time, so constructs
1337like (?i:(?1)) or (?:(?i)(?1)) do not affect how the sub-pattern will
1338be processed.
6bda09f9 1339
894be9b7
YO
1340=item C<(?&NAME)>
1341X<(?&NAME)>
1342
0d017f4d
WL
1343Recurse to a named subpattern. Identical to C<(?PARNO)> except that the
1344parenthesis to recurse to is determined by name. If multiple parentheses have
894be9b7
YO
1345the same name, then it recurses to the leftmost.
1346
1347It is an error to refer to a name that is not declared somewhere in the
1348pattern.
1349
1f1031fe
YO
1350B<NOTE:> In order to make things easier for programmers with experience
1351with the Python or PCRE regex engines the pattern C<< (?P>NAME) >>
64c5a566 1352may be used instead of C<< (?&NAME) >>.
1f1031fe 1353
e2e6a0f1
YO
1354=item C<(?(condition)yes-pattern|no-pattern)>
1355X<(?()>
286f584a 1356
e2e6a0f1 1357=item C<(?(condition)yes-pattern)>
286f584a 1358
41ef34de
ML
1359Conditional expression. Matches C<yes-pattern> if C<condition> yields
1360a true value, matches C<no-pattern> otherwise. A missing pattern always
1361matches.
1362
1363C<(condition)> should be either an integer in
e2e6a0f1
YO
1364parentheses (which is valid if the corresponding pair of parentheses
1365matched), a look-ahead/look-behind/evaluate zero-width assertion, a
c27a5cfe 1366name in angle brackets or single quotes (which is valid if a group
e2e6a0f1
YO
1367with the given name matched), or the special symbol (R) (true when
1368evaluated inside of recursion or eval). Additionally the R may be
1369followed by a number, (which will be true when evaluated when recursing
1370inside of the appropriate group), or by C<&NAME>, in which case it will
1371be true only when evaluated during recursion in the named group.
1372
1373Here's a summary of the possible predicates:
1374
1375=over 4
1376
1377=item (1) (2) ...
1378
c27a5cfe 1379Checks if the numbered capturing group has matched something.
e2e6a0f1
YO
1380
1381=item (<NAME>) ('NAME')
1382
c27a5cfe 1383Checks if a group with the given name has matched something.
e2e6a0f1 1384
f01cd190
FC
1385=item (?=...) (?!...) (?<=...) (?<!...)
1386
1387Checks whether the pattern matches (or does not match, for the '!'
1388variants).
1389
e2e6a0f1
YO
1390=item (?{ CODE })
1391
f01cd190 1392Treats the return value of the code block as the condition.
e2e6a0f1
YO
1393
1394=item (R)
1395
1396Checks if the expression has been evaluated inside of recursion.
1397
1398=item (R1) (R2) ...
1399
1400Checks if the expression has been evaluated while executing directly
1401inside of the n-th capture group. This check is the regex equivalent of
1402
1403 if ((caller(0))[3] eq 'subname') { ... }
1404
1405In other words, it does not check the full recursion stack.
1406
1407=item (R&NAME)
1408
1409Similar to C<(R1)>, this predicate checks to see if we're executing
1410directly inside of the leftmost group with a given name (this is the same
1411logic used by C<(?&NAME)> to disambiguate). It does not check the full
1412stack, but only the name of the innermost active recursion.
1413
1414=item (DEFINE)
1415
1416In this case, the yes-pattern is never directly executed, and no
1417no-pattern is allowed. Similar in spirit to C<(?{0})> but more efficient.
1418See below for details.
1419
1420=back
1421
1422For example:
1423
1424 m{ ( \( )?
1425 [^()]+
1426 (?(1) \) )
1427 }x
1428
1429matches a chunk of non-parentheses, possibly included in parentheses
1430themselves.
1431
0b928c2f
FC
1432A special form is the C<(DEFINE)> predicate, which never executes its
1433yes-pattern directly, and does not allow a no-pattern. This allows one to
1434define subpatterns which will be executed only by the recursion mechanism.
e2e6a0f1
YO
1435This way, you can define a set of regular expression rules that can be
1436bundled into any pattern you choose.
1437
1438It is recommended that for this usage you put the DEFINE block at the
1439end of the pattern, and that you name any subpatterns defined within it.
1440
1441Also, it's worth noting that patterns defined this way probably will
1442not be as efficient, as the optimiser is not very clever about
1443handling them.
1444
1445An example of how this might be used is as follows:
1446
2bf803e2 1447 /(?<NAME>(?&NAME_PAT))(?<ADDR>(?&ADDRESS_PAT))
e2e6a0f1 1448 (?(DEFINE)
2bf803e2
YO
1449 (?<NAME_PAT>....)
1450 (?<ADRESS_PAT>....)
e2e6a0f1
YO
1451 )/x
1452
c27a5cfe
KW
1453Note that capture groups matched inside of recursion are not accessible
1454after the recursion returns, so the extra layer of capturing groups is
e2e6a0f1
YO
1455necessary. Thus C<$+{NAME_PAT}> would not be defined even though
1456C<$+{NAME}> would be.
286f584a 1457
c47ff5f1 1458=item C<< (?>pattern) >>
6bda09f9 1459X<backtrack> X<backtracking> X<atomic> X<possessive>
5a964f20 1460
19799a22
GS
1461An "independent" subexpression, one which matches the substring
1462that a I<standalone> C<pattern> would match if anchored at the given
9da458fc 1463position, and it matches I<nothing other than this substring>. This
19799a22
GS
1464construct is useful for optimizations of what would otherwise be
1465"eternal" matches, because it will not backtrack (see L<"Backtracking">).
9da458fc
IZ
1466It may also be useful in places where the "grab all you can, and do not
1467give anything back" semantic is desirable.
19799a22 1468
c47ff5f1 1469For example: C<< ^(?>a*)ab >> will never match, since C<< (?>a*) >>
19799a22
GS
1470(anchored at the beginning of string, as above) will match I<all>
1471characters C<a> at the beginning of string, leaving no C<a> for
1472C<ab> to match. In contrast, C<a*ab> will match the same as C<a+b>,
1473since the match of the subgroup C<a*> is influenced by the following
1474group C<ab> (see L<"Backtracking">). In particular, C<a*> inside
1475C<a*ab> will match fewer characters than a standalone C<a*>, since
1476this makes the tail match.
1477
0b928c2f
FC
1478C<< (?>pattern) >> does not disable backtracking altogether once it has
1479matched. It is still possible to backtrack past the construct, but not
1480into it. So C<< ((?>a*)|(?>b*))ar >> will still match "bar".
1481
c47ff5f1 1482An effect similar to C<< (?>pattern) >> may be achieved by writing
0b928c2f
FC
1483C<(?=(pattern))\g{-1}>. This matches the same substring as a standalone
1484C<a+>, and the following C<\g{-1}> eats the matched string; it therefore
c47ff5f1 1485makes a zero-length assertion into an analogue of C<< (?>...) >>.
19799a22
GS
1486(The difference between these two constructs is that the second one
1487uses a capturing group, thus shifting ordinals of backreferences
1488in the rest of a regular expression.)
1489
1490Consider this pattern:
c277df42 1491
871b0233 1492 m{ \(
e2e6a0f1 1493 (
f793d64a 1494 [^()]+ # x+
e2e6a0f1 1495 |
871b0233
IZ
1496 \( [^()]* \)
1497 )+
e2e6a0f1 1498 \)
871b0233 1499 }x
5a964f20 1500
19799a22
GS
1501That will efficiently match a nonempty group with matching parentheses
1502two levels deep or less. However, if there is no such group, it
1503will take virtually forever on a long string. That's because there
1504are so many different ways to split a long string into several
1505substrings. This is what C<(.+)+> is doing, and C<(.+)+> is similar
1506to a subpattern of the above pattern. Consider how the pattern
1507above detects no-match on C<((()aaaaaaaaaaaaaaaaaa> in several
1508seconds, but that each extra letter doubles this time. This
1509exponential performance will make it appear that your program has
14218588 1510hung. However, a tiny change to this pattern
5a964f20 1511
e2e6a0f1
YO
1512 m{ \(
1513 (
f793d64a 1514 (?> [^()]+ ) # change x+ above to (?> x+ )
e2e6a0f1 1515 |
871b0233
IZ
1516 \( [^()]* \)
1517 )+
e2e6a0f1 1518 \)
871b0233 1519 }x
c277df42 1520
c47ff5f1 1521which uses C<< (?>...) >> matches exactly when the one above does (verifying
5a964f20
TC
1522this yourself would be a productive exercise), but finishes in a fourth
1523the time when used on a similar string with 1000000 C<a>s. Be aware,
0b928c2f
FC
1524however, that, when this construct is followed by a
1525quantifier, it currently triggers a warning message under
9f1b1f2d 1526the C<use warnings> pragma or B<-w> switch saying it
6bab786b 1527C<"matches null string many times in regex">.
c277df42 1528
c47ff5f1 1529On simple groups, such as the pattern C<< (?> [^()]+ ) >>, a comparable
19799a22 1530effect may be achieved by negative look-ahead, as in C<[^()]+ (?! [^()] )>.
c277df42
IZ
1531This was only 4 times slower on a string with 1000000 C<a>s.
1532
9da458fc
IZ
1533The "grab all you can, and do not give anything back" semantic is desirable
1534in many situations where on the first sight a simple C<()*> looks like
1535the correct solution. Suppose we parse text with comments being delimited
1536by C<#> followed by some optional (horizontal) whitespace. Contrary to
4375e838 1537its appearance, C<#[ \t]*> I<is not> the correct subexpression to match
9da458fc
IZ
1538the comment delimiter, because it may "give up" some whitespace if
1539the remainder of the pattern can be made to match that way. The correct
1540answer is either one of these:
1541
1542 (?>#[ \t]*)
1543 #[ \t]*(?![ \t])
1544
1545For example, to grab non-empty comments into $1, one should use either
1546one of these:
1547
1548 / (?> \# [ \t]* ) ( .+ ) /x;
1549 / \# [ \t]* ( [^ \t] .* ) /x;
1550
1551Which one you pick depends on which of these expressions better reflects
1552the above specification of comments.
1553
6bda09f9
YO
1554In some literature this construct is called "atomic matching" or
1555"possessive matching".
1556
b9b4dddf
YO
1557Possessive quantifiers are equivalent to putting the item they are applied
1558to inside of one of these constructs. The following equivalences apply:
1559
1560 Quantifier Form Bracketing Form
1561 --------------- ---------------
1562 PAT*+ (?>PAT*)
1563 PAT++ (?>PAT+)
1564 PAT?+ (?>PAT?)
1565 PAT{min,max}+ (?>PAT{min,max})
1566
e2e6a0f1
YO
1567=back
1568
1569=head2 Special Backtracking Control Verbs
1570
1571B<WARNING:> These patterns are experimental and subject to change or
0d017f4d 1572removal in a future version of Perl. Their usage in production code should
e2e6a0f1
YO
1573be noted to avoid problems during upgrades.
1574
1575These special patterns are generally of the form C<(*VERB:ARG)>. Unless
1576otherwise stated the ARG argument is optional; in some cases, it is
1577forbidden.
1578
1579Any pattern containing a special backtracking verb that allows an argument
e1020413 1580has the special behaviour that when executed it sets the current package's
5d458dd8
YO
1581C<$REGERROR> and C<$REGMARK> variables. When doing so the following
1582rules apply:
e2e6a0f1 1583
5d458dd8
YO
1584On failure, the C<$REGERROR> variable will be set to the ARG value of the
1585verb pattern, if the verb was involved in the failure of the match. If the
1586ARG part of the pattern was omitted, then C<$REGERROR> will be set to the
1587name of the last C<(*MARK:NAME)> pattern executed, or to TRUE if there was
1588none. Also, the C<$REGMARK> variable will be set to FALSE.
e2e6a0f1 1589
5d458dd8
YO
1590On a successful match, the C<$REGERROR> variable will be set to FALSE, and
1591the C<$REGMARK> variable will be set to the name of the last
1592C<(*MARK:NAME)> pattern executed. See the explanation for the
1593C<(*MARK:NAME)> verb below for more details.
e2e6a0f1 1594
5d458dd8 1595B<NOTE:> C<$REGERROR> and C<$REGMARK> are not magic variables like C<$1>
0b928c2f 1596and most other regex-related variables. They are not local to a scope, nor
5d458dd8
YO
1597readonly, but instead are volatile package variables similar to C<$AUTOLOAD>.
1598Use C<local> to localize changes to them to a specific scope if necessary.
e2e6a0f1
YO
1599
1600If a pattern does not contain a special backtracking verb that allows an
5d458dd8 1601argument, then C<$REGERROR> and C<$REGMARK> are not touched at all.
e2e6a0f1
YO
1602
1603=over 4
1604
1605=item Verbs that take an argument
1606
1607=over 4
1608
5d458dd8 1609=item C<(*PRUNE)> C<(*PRUNE:NAME)>
f7819f85 1610X<(*PRUNE)> X<(*PRUNE:NAME)>
54612592 1611
5d458dd8
YO
1612This zero-width pattern prunes the backtracking tree at the current point
1613when backtracked into on failure. Consider the pattern C<A (*PRUNE) B>,
1614where A and B are complex patterns. Until the C<(*PRUNE)> verb is reached,
1615A may backtrack as necessary to match. Once it is reached, matching
1616continues in B, which may also backtrack as necessary; however, should B
1617not match, then no further backtracking will take place, and the pattern
1618will fail outright at the current starting position.
54612592
YO
1619
1620The following example counts all the possible matching strings in a
1621pattern (without actually matching any of them).
1622
e2e6a0f1 1623 'aaab' =~ /a+b?(?{print "$&\n"; $count++})(*FAIL)/;
54612592
YO
1624 print "Count=$count\n";
1625
1626which produces:
1627
1628 aaab
1629 aaa
1630 aa
1631 a
1632 aab
1633 aa
1634 a
1635 ab
1636 a
1637 Count=9
1638
5d458dd8 1639If we add a C<(*PRUNE)> before the count like the following
54612592 1640
5d458dd8 1641 'aaab' =~ /a+b?(*PRUNE)(?{print "$&\n"; $count++})(*FAIL)/;
54612592
YO
1642 print "Count=$count\n";
1643
0b928c2f 1644we prevent backtracking and find the count of the longest matching string
353c6505 1645at each matching starting point like so:
54612592
YO
1646
1647 aaab
1648 aab
1649 ab
1650 Count=3
1651
5d458dd8 1652Any number of C<(*PRUNE)> assertions may be used in a pattern.
54612592 1653
5d458dd8
YO
1654See also C<< (?>pattern) >> and possessive quantifiers for other ways to
1655control backtracking. In some cases, the use of C<(*PRUNE)> can be
1656replaced with a C<< (?>pattern) >> with no functional difference; however,
1657C<(*PRUNE)> can be used to handle cases that cannot be expressed using a
1658C<< (?>pattern) >> alone.
54612592 1659
e2e6a0f1 1660
5d458dd8
YO
1661=item C<(*SKIP)> C<(*SKIP:NAME)>
1662X<(*SKIP)>
e2e6a0f1 1663
5d458dd8 1664This zero-width pattern is similar to C<(*PRUNE)>, except that on
e2e6a0f1 1665failure it also signifies that whatever text that was matched leading up
5d458dd8
YO
1666to the C<(*SKIP)> pattern being executed cannot be part of I<any> match
1667of this pattern. This effectively means that the regex engine "skips" forward
1668to this position on failure and tries to match again, (assuming that
1669there is sufficient room to match).
1670
1671The name of the C<(*SKIP:NAME)> pattern has special significance. If a
1672C<(*MARK:NAME)> was encountered while matching, then it is that position
1673which is used as the "skip point". If no C<(*MARK)> of that name was
1674encountered, then the C<(*SKIP)> operator has no effect. When used
1675without a name the "skip point" is where the match point was when
1676executing the (*SKIP) pattern.
1677
0b928c2f 1678Compare the following to the examples in C<(*PRUNE)>; note the string
24b23f37
YO
1679is twice as long:
1680
5d458dd8 1681 'aaabaaab' =~ /a+b?(*SKIP)(?{print "$&\n"; $count++})(*FAIL)/;
24b23f37
YO
1682 print "Count=$count\n";
1683
1684outputs
1685
1686 aaab
1687 aaab
1688 Count=2
1689
5d458dd8 1690Once the 'aaab' at the start of the string has matched, and the C<(*SKIP)>
353c6505 1691executed, the next starting point will be where the cursor was when the
5d458dd8
YO
1692C<(*SKIP)> was executed.
1693
5d458dd8
YO
1694=item C<(*MARK:NAME)> C<(*:NAME)>
1695X<(*MARK)> C<(*MARK:NAME)> C<(*:NAME)>
1696
1697This zero-width pattern can be used to mark the point reached in a string
1698when a certain part of the pattern has been successfully matched. This
1699mark may be given a name. A later C<(*SKIP)> pattern will then skip
1700forward to that point if backtracked into on failure. Any number of
b4222fa9 1701C<(*MARK)> patterns are allowed, and the NAME portion may be duplicated.
5d458dd8
YO
1702
1703In addition to interacting with the C<(*SKIP)> pattern, C<(*MARK:NAME)>
1704can be used to "label" a pattern branch, so that after matching, the
1705program can determine which branches of the pattern were involved in the
1706match.
1707
1708When a match is successful, the C<$REGMARK> variable will be set to the
1709name of the most recently executed C<(*MARK:NAME)> that was involved
1710in the match.
1711
1712This can be used to determine which branch of a pattern was matched
c27a5cfe 1713without using a separate capture group for each branch, which in turn
5d458dd8
YO
1714can result in a performance improvement, as perl cannot optimize
1715C</(?:(x)|(y)|(z))/> as efficiently as something like
1716C</(?:x(*MARK:x)|y(*MARK:y)|z(*MARK:z))/>.
1717
1718When a match has failed, and unless another verb has been involved in
1719failing the match and has provided its own name to use, the C<$REGERROR>
1720variable will be set to the name of the most recently executed
1721C<(*MARK:NAME)>.
1722
1723See C<(*SKIP)> for more details.
1724
b62d2d15
YO
1725As a shortcut C<(*MARK:NAME)> can be written C<(*:NAME)>.
1726
5d458dd8
YO
1727=item C<(*THEN)> C<(*THEN:NAME)>
1728
241e7389 1729This is similar to the "cut group" operator C<::> from Perl 6. Like
5d458dd8
YO
1730C<(*PRUNE)>, this verb always matches, and when backtracked into on
1731failure, it causes the regex engine to try the next alternation in the
1732innermost enclosing group (capturing or otherwise).
1733
1734Its name comes from the observation that this operation combined with the
1735alternation operator (C<|>) can be used to create what is essentially a
1736pattern-based if/then/else block:
1737
1738 ( COND (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ )
1739
1740Note that if this operator is used and NOT inside of an alternation then
1741it acts exactly like the C<(*PRUNE)> operator.
1742
1743 / A (*PRUNE) B /
1744
1745is the same as
1746
1747 / A (*THEN) B /
1748
1749but
1750
1751 / ( A (*THEN) B | C (*THEN) D ) /
1752
1753is not the same as
1754
1755 / ( A (*PRUNE) B | C (*PRUNE) D ) /
1756
1757as after matching the A but failing on the B the C<(*THEN)> verb will
1758backtrack and try C; but the C<(*PRUNE)> verb will simply fail.
24b23f37 1759
e2e6a0f1
YO
1760=item C<(*COMMIT)>
1761X<(*COMMIT)>
24b23f37 1762
241e7389 1763This is the Perl 6 "commit pattern" C<< <commit> >> or C<:::>. It's a
5d458dd8
YO
1764zero-width pattern similar to C<(*SKIP)>, except that when backtracked
1765into on failure it causes the match to fail outright. No further attempts
1766to find a valid match by advancing the start pointer will occur again.
1767For example,
24b23f37 1768
e2e6a0f1 1769 'aaabaaab' =~ /a+b?(*COMMIT)(?{print "$&\n"; $count++})(*FAIL)/;
24b23f37
YO
1770 print "Count=$count\n";
1771
1772outputs
1773
1774 aaab
1775 Count=1
1776
e2e6a0f1
YO
1777In other words, once the C<(*COMMIT)> has been entered, and if the pattern
1778does not match, the regex engine will not try any further matching on the
1779rest of the string.
c277df42 1780
e2e6a0f1 1781=back
9af228c6 1782
e2e6a0f1 1783=item Verbs without an argument
9af228c6
YO
1784
1785=over 4
1786
e2e6a0f1
YO
1787=item C<(*FAIL)> C<(*F)>
1788X<(*FAIL)> X<(*F)>
9af228c6 1789
e2e6a0f1
YO
1790This pattern matches nothing and always fails. It can be used to force the
1791engine to backtrack. It is equivalent to C<(?!)>, but easier to read. In
1792fact, C<(?!)> gets optimised into C<(*FAIL)> internally.
9af228c6 1793
e2e6a0f1 1794It is probably useful only when combined with C<(?{})> or C<(??{})>.
9af228c6 1795
e2e6a0f1
YO
1796=item C<(*ACCEPT)>
1797X<(*ACCEPT)>
9af228c6 1798
e2e6a0f1
YO
1799B<WARNING:> This feature is highly experimental. It is not recommended
1800for production code.
9af228c6 1801
e2e6a0f1
YO
1802This pattern matches nothing and causes the end of successful matching at
1803the point at which the C<(*ACCEPT)> pattern was encountered, regardless of
1804whether there is actually more to match in the string. When inside of a
0d017f4d 1805nested pattern, such as recursion, or in a subpattern dynamically generated
e2e6a0f1 1806via C<(??{})>, only the innermost pattern is ended immediately.
9af228c6 1807
c27a5cfe 1808If the C<(*ACCEPT)> is inside of capturing groups then the groups are
e2e6a0f1
YO
1809marked as ended at the point at which the C<(*ACCEPT)> was encountered.
1810For instance:
9af228c6 1811
e2e6a0f1 1812 'AB' =~ /(A (A|B(*ACCEPT)|C) D)(E)/x;
9af228c6 1813
e2e6a0f1 1814will match, and C<$1> will be C<AB> and C<$2> will be C<B>, C<$3> will not
0b928c2f 1815be set. If another branch in the inner parentheses was matched, such as in the
e2e6a0f1 1816string 'ACDE', then the C<D> and C<E> would have to be matched as well.
9af228c6
YO
1817
1818=back
c277df42 1819
a0d0e21e
LW
1820=back
1821
c07a80fd 1822=head2 Backtracking
d74e8afc 1823X<backtrack> X<backtracking>
c07a80fd 1824
35a734be
IZ
1825NOTE: This section presents an abstract approximation of regular
1826expression behavior. For a more rigorous (and complicated) view of
1827the rules involved in selecting a match among possible alternatives,
0d017f4d 1828see L<Combining RE Pieces>.
35a734be 1829
c277df42 1830A fundamental feature of regular expression matching involves the
5a964f20 1831notion called I<backtracking>, which is currently used (when needed)
0d017f4d 1832by all regular non-possessive expression quantifiers, namely C<*>, C<*?>, C<+>,
9da458fc
IZ
1833C<+?>, C<{n,m}>, and C<{n,m}?>. Backtracking is often optimized
1834internally, but the general principle outlined here is valid.
c07a80fd 1835
1836For a regular expression to match, the I<entire> regular expression must
1837match, not just part of it. So if the beginning of a pattern containing a
1838quantifier succeeds in a way that causes later parts in the pattern to
1839fail, the matching engine backs up and recalculates the beginning
1840part--that's why it's called backtracking.
1841
1842Here is an example of backtracking: Let's say you want to find the
1843word following "foo" in the string "Food is on the foo table.":
1844
1845 $_ = "Food is on the foo table.";
1846 if ( /\b(foo)\s+(\w+)/i ) {
f793d64a 1847 print "$2 follows $1.\n";
c07a80fd 1848 }
1849
1850When the match runs, the first part of the regular expression (C<\b(foo)>)
1851finds a possible match right at the beginning of the string, and loads up
1852$1 with "Foo". However, as soon as the matching engine sees that there's
1853no whitespace following the "Foo" that it had saved in $1, it realizes its
68dc0745 1854mistake and starts over again one character after where it had the
c07a80fd 1855tentative match. This time it goes all the way until the next occurrence
1856of "foo". The complete regular expression matches this time, and you get
1857the expected output of "table follows foo."
1858
1859Sometimes minimal matching can help a lot. Imagine you'd like to match
1860everything between "foo" and "bar". Initially, you write something
1861like this:
1862
1863 $_ = "The food is under the bar in the barn.";
1864 if ( /foo(.*)bar/ ) {
f793d64a 1865 print "got <$1>\n";
c07a80fd 1866 }
1867
1868Which perhaps unexpectedly yields:
1869
1870 got <d is under the bar in the >
1871
1872That's because C<.*> was greedy, so you get everything between the
14218588 1873I<first> "foo" and the I<last> "bar". Here it's more effective
c07a80fd 1874to use minimal matching to make sure you get the text between a "foo"
1875and the first "bar" thereafter.
1876
1877 if ( /foo(.*?)bar/ ) { print "got <$1>\n" }
1878 got <d is under the >
1879
0d017f4d 1880Here's another example. Let's say you'd like to match a number at the end
b6e13d97 1881of a string, and you also want to keep the preceding part of the match.
c07a80fd 1882So you write this:
1883
1884 $_ = "I have 2 numbers: 53147";
f793d64a
KW
1885 if ( /(.*)(\d*)/ ) { # Wrong!
1886 print "Beginning is <$1>, number is <$2>.\n";
c07a80fd 1887 }
1888
1889That won't work at all, because C<.*> was greedy and gobbled up the
1890whole string. As C<\d*> can match on an empty string the complete
1891regular expression matched successfully.
1892
8e1088bc 1893 Beginning is <I have 2 numbers: 53147>, number is <>.
c07a80fd 1894
1895Here are some variants, most of which don't work:
1896
1897 $_ = "I have 2 numbers: 53147";
1898 @pats = qw{
f793d64a
KW
1899 (.*)(\d*)
1900 (.*)(\d+)
1901 (.*?)(\d*)
1902 (.*?)(\d+)
1903 (.*)(\d+)$
1904 (.*?)(\d+)$
1905 (.*)\b(\d+)$
1906 (.*\D)(\d+)$
c07a80fd 1907 };
1908
1909 for $pat (@pats) {
f793d64a
KW
1910 printf "%-12s ", $pat;
1911 if ( /$pat/ ) {
1912 print "<$1> <$2>\n";
1913 } else {
1914 print "FAIL\n";
1915 }
c07a80fd 1916 }
1917
1918That will print out:
1919
1920 (.*)(\d*) <I have 2 numbers: 53147> <>
1921 (.*)(\d+) <I have 2 numbers: 5314> <7>
1922 (.*?)(\d*) <> <>
1923 (.*?)(\d+) <I have > <2>
1924 (.*)(\d+)$ <I have 2 numbers: 5314> <7>
1925 (.*?)(\d+)$ <I have 2 numbers: > <53147>
1926 (.*)\b(\d+)$ <I have 2 numbers: > <53147>
1927 (.*\D)(\d+)$ <I have 2 numbers: > <53147>
1928
1929As you see, this can be a bit tricky. It's important to realize that a
1930regular expression is merely a set of assertions that gives a definition
1931of success. There may be 0, 1, or several different ways that the
1932definition might succeed against a particular string. And if there are
5a964f20
TC
1933multiple ways it might succeed, you need to understand backtracking to
1934know which variety of success you will achieve.
c07a80fd 1935
19799a22 1936When using look-ahead assertions and negations, this can all get even
8b19b778 1937trickier. Imagine you'd like to find a sequence of non-digits not
c07a80fd 1938followed by "123". You might try to write that as
1939
871b0233 1940 $_ = "ABC123";
f793d64a
KW
1941 if ( /^\D*(?!123)/ ) { # Wrong!
1942 print "Yup, no 123 in $_\n";
871b0233 1943 }
c07a80fd 1944
1945But that isn't going to match; at least, not the way you're hoping. It
1946claims that there is no 123 in the string. Here's a clearer picture of
9b9391b2 1947why that pattern matches, contrary to popular expectations:
c07a80fd 1948
4358a253
SS
1949 $x = 'ABC123';
1950 $y = 'ABC445';
c07a80fd 1951
4358a253
SS
1952 print "1: got $1\n" if $x =~ /^(ABC)(?!123)/;
1953 print "2: got $1\n" if $y =~ /^(ABC)(?!123)/;
c07a80fd 1954
4358a253
SS
1955 print "3: got $1\n" if $x =~ /^(\D*)(?!123)/;
1956 print "4: got $1\n" if $y =~ /^(\D*)(?!123)/;
c07a80fd 1957
1958This prints
1959
1960 2: got ABC
1961 3: got AB
1962 4: got ABC
1963
5f05dabc 1964You might have expected test 3 to fail because it seems to a more
c07a80fd 1965general purpose version of test 1. The important difference between
1966them is that test 3 contains a quantifier (C<\D*>) and so can use
1967backtracking, whereas test 1 will not. What's happening is
1968that you've asked "Is it true that at the start of $x, following 0 or more
5f05dabc 1969non-digits, you have something that's not 123?" If the pattern matcher had
c07a80fd 1970let C<\D*> expand to "ABC", this would have caused the whole pattern to
54310121 1971fail.
14218588 1972
c07a80fd 1973The search engine will initially match C<\D*> with "ABC". Then it will
0b928c2f 1974try to match C<(?!123)> with "123", which fails. But because
c07a80fd 1975a quantifier (C<\D*>) has been used in the regular expression, the
1976search engine can backtrack and retry the match differently
54310121 1977in the hope of matching the complete regular expression.
c07a80fd 1978
5a964f20
TC
1979The pattern really, I<really> wants to succeed, so it uses the
1980standard pattern back-off-and-retry and lets C<\D*> expand to just "AB" this
c07a80fd 1981time. Now there's indeed something following "AB" that is not
14218588 1982"123". It's "C123", which suffices.
c07a80fd 1983
14218588
GS
1984We can deal with this by using both an assertion and a negation.
1985We'll say that the first part in $1 must be followed both by a digit
1986and by something that's not "123". Remember that the look-aheads
1987are zero-width expressions--they only look, but don't consume any
1988of the string in their match. So rewriting this way produces what
c07a80fd 1989you'd expect; that is, case 5 will fail, but case 6 succeeds:
1990
4358a253
SS
1991 print "5: got $1\n" if $x =~ /^(\D*)(?=\d)(?!123)/;
1992 print "6: got $1\n" if $y =~ /^(\D*)(?=\d)(?!123)/;
c07a80fd 1993
1994 6: got ABC
1995
5a964f20 1996In other words, the two zero-width assertions next to each other work as though
19799a22 1997they're ANDed together, just as you'd use any built-in assertions: C</^$/>
c07a80fd 1998matches only if you're at the beginning of the line AND the end of the
1999line simultaneously. The deeper underlying truth is that juxtaposition in
2000regular expressions always means AND, except when you write an explicit OR
2001using the vertical bar. C</ab/> means match "a" AND (then) match "b",
2002although the attempted matches are made at different positions because "a"
2003is not a zero-width assertion, but a one-width assertion.
2004
0d017f4d 2005B<WARNING>: Particularly complicated regular expressions can take
14218588 2006exponential time to solve because of the immense number of possible
0d017f4d 2007ways they can use backtracking to try for a match. For example, without
9da458fc
IZ
2008internal optimizations done by the regular expression engine, this will
2009take a painfully long time to run:
c07a80fd 2010
e1901655
IZ
2011 'aaaaaaaaaaaa' =~ /((a{0,5}){0,5})*[c]/
2012
2013And if you used C<*>'s in the internal groups instead of limiting them
2014to 0 through 5 matches, then it would take forever--or until you ran
2015out of stack space. Moreover, these internal optimizations are not
2016always applicable. For example, if you put C<{0,5}> instead of C<*>
2017on the external group, no current optimization is applicable, and the
2018match takes a long time to finish.
c07a80fd 2019
9da458fc
IZ
2020A powerful tool for optimizing such beasts is what is known as an
2021"independent group",
96090e4f 2022which does not backtrack (see L</C<< (?>pattern) >>>). Note also that
9da458fc 2023zero-length look-ahead/look-behind assertions will not backtrack to make
5d458dd8 2024the tail match, since they are in "logical" context: only
14218588 2025whether they match is considered relevant. For an example
9da458fc 2026where side-effects of look-ahead I<might> have influenced the
96090e4f 2027following match, see L</C<< (?>pattern) >>>.
c277df42 2028
a0d0e21e 2029=head2 Version 8 Regular Expressions
d74e8afc 2030X<regular expression, version 8> X<regex, version 8> X<regexp, version 8>
a0d0e21e 2031
5a964f20 2032In case you're not familiar with the "regular" Version 8 regex
a0d0e21e
LW
2033routines, here are the pattern-matching rules not described above.
2034
54310121 2035Any single character matches itself, unless it is a I<metacharacter>
a0d0e21e 2036with a special meaning described here or above. You can cause
5a964f20 2037characters that normally function as metacharacters to be interpreted
5f05dabc 2038literally by prefixing them with a "\" (e.g., "\." matches a ".", not any
0d017f4d
WL
2039character; "\\" matches a "\"). This escape mechanism is also required
2040for the character used as the pattern delimiter.
2041
2042A series of characters matches that series of characters in the target
0b928c2f 2043string, so the pattern C<blurfl> would match "blurfl" in the target
0d017f4d 2044string.
a0d0e21e
LW
2045
2046You can specify a character class, by enclosing a list of characters
5d458dd8 2047in C<[]>, which will match any character from the list. If the
a0d0e21e 2048first character after the "[" is "^", the class matches any character not
14218588 2049in the list. Within a list, the "-" character specifies a
5a964f20 2050range, so that C<a-z> represents all characters between "a" and "z",
8a4f6ac2
GS
2051inclusive. If you want either "-" or "]" itself to be a member of a
2052class, put it at the start of the list (possibly after a "^"), or
2053escape it with a backslash. "-" is also taken literally when it is
2054at the end of the list, just before the closing "]". (The
84850974
DD
2055following all specify the same class of three characters: C<[-az]>,
2056C<[az-]>, and C<[a\-z]>. All are different from C<[a-z]>, which
5d458dd8
YO
2057specifies a class containing twenty-six characters, even on EBCDIC-based
2058character sets.) Also, if you try to use the character
2059classes C<\w>, C<\W>, C<\s>, C<\S>, C<\d>, or C<\D> as endpoints of
2060a range, the "-" is understood literally.
a0d0e21e 2061
8ada0baa
JH
2062Note also that the whole range idea is rather unportable between
2063character sets--and even within character sets they may cause results
2064you probably didn't expect. A sound principle is to use only ranges
0d017f4d 2065that begin from and end at either alphabetics of equal case ([a-e],
8ada0baa
JH
2066[A-E]), or digits ([0-9]). Anything else is unsafe. If in doubt,
2067spell out the character sets in full.
2068
54310121 2069Characters may be specified using a metacharacter syntax much like that
a0d0e21e
LW
2070used in C: "\n" matches a newline, "\t" a tab, "\r" a carriage return,
2071"\f" a form feed, etc. More generally, \I<nnn>, where I<nnn> is a string
dc0d9c48 2072of three octal digits, matches the character whose coded character set value
5d458dd8 2073is I<nnn>. Similarly, \xI<nn>, where I<nn> are hexadecimal digits,
dc0d9c48 2074matches the character whose ordinal is I<nn>. The expression \cI<x>
5d458dd8 2075matches the character control-I<x>. Finally, the "." metacharacter
fb55449c 2076matches any character except "\n" (unless you use C</s>).
a0d0e21e
LW
2077
2078You can specify a series of alternatives for a pattern using "|" to
2079separate them, so that C<fee|fie|foe> will match any of "fee", "fie",
5a964f20 2080or "foe" in the target string (as would C<f(e|i|o)e>). The
a0d0e21e 2081first alternative includes everything from the last pattern delimiter
0b928c2f 2082("(", "(?:", etc. or the beginning of the pattern) up to the first "|", and
a0d0e21e 2083the last alternative contains everything from the last "|" to the next
0b928c2f 2084closing pattern delimiter. That's why it's common practice to include
14218588 2085alternatives in parentheses: to minimize confusion about where they
a3cb178b
GS
2086start and end.
2087
5a964f20 2088Alternatives are tried from left to right, so the first
a3cb178b
GS
2089alternative found for which the entire expression matches, is the one that
2090is chosen. This means that alternatives are not necessarily greedy. For
628afcb5 2091example: when matching C<foo|foot> against "barefoot", only the "foo"
a3cb178b
GS
2092part will match, as that is the first alternative tried, and it successfully
2093matches the target string. (This might not seem important, but it is
2094important when you are capturing matched text using parentheses.)
2095
5a964f20 2096Also remember that "|" is interpreted as a literal within square brackets,
a3cb178b 2097so if you write C<[fee|fie|foe]> you're really only matching C<[feio|]>.
a0d0e21e 2098
14218588
GS
2099Within a pattern, you may designate subpatterns for later reference
2100by enclosing them in parentheses, and you may refer back to the
2101I<n>th subpattern later in the pattern using the metacharacter
0b928c2f 2102\I<n> or \gI<n>. Subpatterns are numbered based on the left to right order
14218588
GS
2103of their opening parenthesis. A backreference matches whatever
2104actually matched the subpattern in the string being examined, not
d8b950dc 2105the rules for that subpattern. Therefore, C<(0|0x)\d*\s\g1\d*> will
14218588
GS
2106match "0x1234 0x4321", but not "0x1234 01234", because subpattern
21071 matched "0x", even though the rule C<0|0x> could potentially match
2108the leading 0 in the second number.
cb1a09d0 2109
0d017f4d 2110=head2 Warning on \1 Instead of $1
cb1a09d0 2111
5a964f20 2112Some people get too used to writing things like:
cb1a09d0
AD
2113
2114 $pattern =~ s/(\W)/\\\1/g;
2115
3ff1c45a
KW
2116This is grandfathered (for \1 to \9) for the RHS of a substitute to avoid
2117shocking the
cb1a09d0 2118B<sed> addicts, but it's a dirty habit to get into. That's because in
d1be9408 2119PerlThink, the righthand side of an C<s///> is a double-quoted string. C<\1> in
cb1a09d0
AD
2120the usual double-quoted string means a control-A. The customary Unix
2121meaning of C<\1> is kludged in for C<s///>. However, if you get into the habit
2122of doing that, you get yourself into trouble if you then add an C</e>
2123modifier.
2124
f793d64a 2125 s/(\d+)/ \1 + 1 /eg; # causes warning under -w
cb1a09d0
AD
2126
2127Or if you try to do
2128
2129 s/(\d+)/\1000/;
2130
2131You can't disambiguate that by saying C<\{1}000>, whereas you can fix it with
14218588 2132C<${1}000>. The operation of interpolation should not be confused
cb1a09d0
AD
2133with the operation of matching a backreference. Certainly they mean two
2134different things on the I<left> side of the C<s///>.
9fa51da4 2135
0d017f4d 2136=head2 Repeated Patterns Matching a Zero-length Substring
c84d73f1 2137
19799a22 2138B<WARNING>: Difficult material (and prose) ahead. This section needs a rewrite.
c84d73f1
IZ
2139
2140Regular expressions provide a terse and powerful programming language. As
2141with most other power tools, power comes together with the ability
2142to wreak havoc.
2143
2144A common abuse of this power stems from the ability to make infinite
628afcb5 2145loops using regular expressions, with something as innocuous as:
c84d73f1
IZ
2146
2147 'foo' =~ m{ ( o? )* }x;
2148
0d017f4d 2149The C<o?> matches at the beginning of C<'foo'>, and since the position
c84d73f1 2150in the string is not moved by the match, C<o?> would match again and again
527e91da 2151because of the C<*> quantifier. Another common way to create a similar cycle
c84d73f1
IZ
2152is with the looping modifier C<//g>:
2153
2154 @matches = ( 'foo' =~ m{ o? }xg );
2155
2156or
2157
2158 print "match: <$&>\n" while 'foo' =~ m{ o? }xg;
2159
2160or the loop implied by split().
2161
2162However, long experience has shown that many programming tasks may
14218588
GS
2163be significantly simplified by using repeated subexpressions that
2164may match zero-length substrings. Here's a simple example being:
c84d73f1 2165
f793d64a 2166 @chars = split //, $string; # // is not magic in split
c84d73f1
IZ
2167 ($whitewashed = $string) =~ s/()/ /g; # parens avoid magic s// /
2168
9da458fc 2169Thus Perl allows such constructs, by I<forcefully breaking
c84d73f1 2170the infinite loop>. The rules for this are different for lower-level
527e91da 2171loops given by the greedy quantifiers C<*+{}>, and for higher-level
c84d73f1
IZ
2172ones like the C</g> modifier or split() operator.
2173
19799a22
GS
2174The lower-level loops are I<interrupted> (that is, the loop is
2175broken) when Perl detects that a repeated expression matched a
2176zero-length substring. Thus
c84d73f1
IZ
2177
2178 m{ (?: NON_ZERO_LENGTH | ZERO_LENGTH )* }x;
2179
5d458dd8 2180is made equivalent to
c84d73f1 2181
0b928c2f
FC
2182 m{ (?: NON_ZERO_LENGTH )* (?: ZERO_LENGTH )? }x;
2183
2184For example, this program
2185
2186 #!perl -l
2187 "aaaaab" =~ /
2188 (?:
2189 a # non-zero
2190 | # or
2191 (?{print "hello"}) # print hello whenever this
2192 # branch is tried
2193 (?=(b)) # zero-width assertion
2194 )* # any number of times
2195 /x;
2196 print $&;
2197 print $1;
c84d73f1 2198
0b928c2f
FC
2199prints
2200
2201 hello
2202 aaaaa
2203 b
2204
2205Notice that "hello" is only printed once, as when Perl sees that the sixth
2206iteration of the outermost C<(?:)*> matches a zero-length string, it stops
2207the C<*>.
2208
2209The higher-level loops preserve an additional state between iterations:
5d458dd8 2210whether the last match was zero-length. To break the loop, the following
c84d73f1 2211match after a zero-length match is prohibited to have a length of zero.
5d458dd8 2212This prohibition interacts with backtracking (see L<"Backtracking">),
c84d73f1
IZ
2213and so the I<second best> match is chosen if the I<best> match is of
2214zero length.
2215
19799a22 2216For example:
c84d73f1
IZ
2217
2218 $_ = 'bar';
2219 s/\w??/<$&>/g;
2220
20fb949f 2221results in C<< <><b><><a><><r><> >>. At each position of the string the best
5d458dd8 2222match given by non-greedy C<??> is the zero-length match, and the I<second
c84d73f1
IZ
2223best> match is what is matched by C<\w>. Thus zero-length matches
2224alternate with one-character-long matches.
2225
5d458dd8 2226Similarly, for repeated C<m/()/g> the second-best match is the match at the
c84d73f1
IZ
2227position one notch further in the string.
2228
19799a22 2229The additional state of being I<matched with zero-length> is associated with
c84d73f1 2230the matched string, and is reset by each assignment to pos().
9da458fc
IZ
2231Zero-length matches at the end of the previous match are ignored
2232during C<split>.
c84d73f1 2233
0d017f4d 2234=head2 Combining RE Pieces
35a734be
IZ
2235
2236Each of the elementary pieces of regular expressions which were described
2237before (such as C<ab> or C<\Z>) could match at most one substring
2238at the given position of the input string. However, in a typical regular
2239expression these elementary pieces are combined into more complicated
0b928c2f 2240patterns using combining operators C<ST>, C<S|T>, C<S*> etc.
35a734be
IZ
2241(in these examples C<S> and C<T> are regular subexpressions).
2242
2243Such combinations can include alternatives, leading to a problem of choice:
2244if we match a regular expression C<a|ab> against C<"abc">, will it match
2245substring C<"a"> or C<"ab">? One way to describe which substring is
2246actually matched is the concept of backtracking (see L<"Backtracking">).
2247However, this description is too low-level and makes you think
2248in terms of a particular implementation.
2249
2250Another description starts with notions of "better"/"worse". All the
2251substrings which may be matched by the given regular expression can be
2252sorted from the "best" match to the "worst" match, and it is the "best"
2253match which is chosen. This substitutes the question of "what is chosen?"
2254by the question of "which matches are better, and which are worse?".
2255
2256Again, for elementary pieces there is no such question, since at most
2257one match at a given position is possible. This section describes the
2258notion of better/worse for combining operators. In the description
2259below C<S> and C<T> are regular subexpressions.
2260
13a2d996 2261=over 4
35a734be
IZ
2262
2263=item C<ST>
2264
2265Consider two possible matches, C<AB> and C<A'B'>, C<A> and C<A'> are
2266substrings which can be matched by C<S>, C<B> and C<B'> are substrings
5d458dd8 2267which can be matched by C<T>.
35a734be 2268
0b928c2f 2269If C<A> is a better match for C<S> than C<A'>, C<AB> is a better
35a734be
IZ
2270match than C<A'B'>.
2271
2272If C<A> and C<A'> coincide: C<AB> is a better match than C<AB'> if
0b928c2f 2273C<B> is a better match for C<T> than C<B'>.
35a734be
IZ
2274
2275=item C<S|T>
2276
2277When C<S> can match, it is a better match than when only C<T> can match.
2278
2279Ordering of two matches for C<S> is the same as for C<S>. Similar for
2280two matches for C<T>.
2281
2282=item C<S{REPEAT_COUNT}>
2283
2284Matches as C<SSS...S> (repeated as many times as necessary).
2285
2286=item C<S{min,max}>
2287
2288Matches as C<S{max}|S{max-1}|...|S{min+1}|S{min}>.
2289
2290=item C<S{min,max}?>
2291
2292Matches as C<S{min}|S{min+1}|...|S{max-1}|S{max}>.
2293
2294=item C<S?>, C<S*>, C<S+>
2295
2296Same as C<S{0,1}>, C<S{0,BIG_NUMBER}>, C<S{1,BIG_NUMBER}> respectively.
2297
2298=item C<S??>, C<S*?>, C<S+?>
2299
2300Same as C<S{0,1}?>, C<S{0,BIG_NUMBER}?>, C<S{1,BIG_NUMBER}?> respectively.
2301
c47ff5f1 2302=item C<< (?>S) >>
35a734be
IZ
2303
2304Matches the best match for C<S> and only that.
2305
2306=item C<(?=S)>, C<(?<=S)>
2307
2308Only the best match for C<S> is considered. (This is important only if
2309C<S> has capturing parentheses, and backreferences are used somewhere
2310else in the whole regular expression.)
2311
2312=item C<(?!S)>, C<(?<!S)>
2313
2314For this grouping operator there is no need to describe the ordering, since
2315only whether or not C<S> can match is important.
2316
6bda09f9 2317=item C<(??{ EXPR })>, C<(?PARNO)>
35a734be
IZ
2318
2319The ordering is the same as for the regular expression which is
c27a5cfe 2320the result of EXPR, or the pattern contained by capture group PARNO.
35a734be
IZ
2321
2322=item C<(?(condition)yes-pattern|no-pattern)>
2323
2324Recall that which of C<yes-pattern> or C<no-pattern> actually matches is
2325already determined. The ordering of the matches is the same as for the
2326chosen subexpression.
2327
2328=back
2329
2330The above recipes describe the ordering of matches I<at a given position>.
2331One more rule is needed to understand how a match is determined for the
2332whole regular expression: a match at an earlier position is always better
2333than a match at a later position.
2334
0d017f4d 2335=head2 Creating Custom RE Engines
c84d73f1 2336
0b928c2f
FC
2337As of Perl 5.10.0, one can create custom regular expression engines. This
2338is not for the faint of heart, as they have to plug in at the C level. See
2339L<perlreapi> for more details.
2340
2341As an alternative, overloaded constants (see L<overload>) provide a simple
2342way to extend the functionality of the RE engine, by substituting one
2343pattern for another.
c84d73f1
IZ
2344
2345Suppose that we want to enable a new RE escape-sequence C<\Y|> which
0d017f4d 2346matches at a boundary between whitespace characters and non-whitespace
c84d73f1
IZ
2347characters. Note that C<(?=\S)(?<!\S)|(?!\S)(?<=\S)> matches exactly
2348at these positions, so we want to have each C<\Y|> in the place of the
2349more complicated version. We can create a module C<customre> to do
2350this:
2351
2352 package customre;
2353 use overload;
2354
2355 sub import {
2356 shift;
2357 die "No argument to customre::import allowed" if @_;
2358 overload::constant 'qr' => \&convert;
2359 }
2360
2361 sub invalid { die "/$_[0]/: invalid escape '\\$_[1]'"}
2362
580a9fe1
RGS
2363 # We must also take care of not escaping the legitimate \\Y|
2364 # sequence, hence the presence of '\\' in the conversion rules.
5d458dd8 2365 my %rules = ( '\\' => '\\\\',
f793d64a 2366 'Y|' => qr/(?=\S)(?<!\S)|(?!\S)(?<=\S)/ );
c84d73f1
IZ
2367 sub convert {
2368 my $re = shift;
5d458dd8 2369 $re =~ s{
c84d73f1
IZ
2370 \\ ( \\ | Y . )
2371 }
5d458dd8 2372 { $rules{$1} or invalid($re,$1) }sgex;
c84d73f1
IZ
2373 return $re;
2374 }
2375
2376Now C<use customre> enables the new escape in constant regular
2377expressions, i.e., those without any runtime variable interpolations.
2378As documented in L<overload>, this conversion will work only over
2379literal parts of regular expressions. For C<\Y|$re\Y|> the variable
2380part of this regular expression needs to be converted explicitly
2381(but only if the special meaning of C<\Y|> should be enabled inside $re):
2382
2383 use customre;
2384 $re = <>;
2385 chomp $re;
2386 $re = customre::convert $re;
2387 /\Y|$re\Y|/;
2388
0b928c2f 2389=head2 PCRE/Python Support
1f1031fe 2390
0b928c2f 2391As of Perl 5.10.0, Perl supports several Python/PCRE-specific extensions
1f1031fe 2392to the regex syntax. While Perl programmers are encouraged to use the
0b928c2f 2393Perl-specific syntax, the following are also accepted:
1f1031fe
YO
2394
2395=over 4
2396
ae5648b3 2397=item C<< (?PE<lt>NAMEE<gt>pattern) >>
1f1031fe 2398
c27a5cfe 2399Define a named capture group. Equivalent to C<< (?<NAME>pattern) >>.
1f1031fe
YO
2400
2401=item C<< (?P=NAME) >>
2402
c27a5cfe 2403Backreference to a named capture group. Equivalent to C<< \g{NAME} >>.
1f1031fe
YO
2404
2405=item C<< (?P>NAME) >>
2406
c27a5cfe 2407Subroutine call to a named capture group. Equivalent to C<< (?&NAME) >>.
1f1031fe 2408
ee9b8eae 2409=back
1f1031fe 2410
19799a22
GS
2411=head1 BUGS
2412
88c9975e
KW
2413Many regular expression constructs don't work on EBCDIC platforms.
2414
ed7efc79
KW
2415There are a number of issues with regard to case-insensitive matching
2416in Unicode rules. See C<i> under L</Modifiers> above.
2417
9da458fc
IZ
2418This document varies from difficult to understand to completely
2419and utterly opaque. The wandering prose riddled with jargon is
2420hard to fathom in several places.
2421
2422This document needs a rewrite that separates the tutorial content
2423from the reference content.
19799a22
GS
2424
2425=head1 SEE ALSO
9fa51da4 2426
91e0c79e
MJD
2427L<perlrequick>.
2428
2429L<perlretut>.
2430
9b599b2a
GS
2431L<perlop/"Regexp Quote-Like Operators">.
2432
1e66bd83
PP
2433L<perlop/"Gory details of parsing quoted constructs">.
2434
14218588
GS
2435L<perlfaq6>.
2436
9b599b2a
GS
2437L<perlfunc/pos>.
2438
2439L<perllocale>.
2440
fb55449c
JH
2441L<perlebcdic>.
2442
14218588
GS
2443I<Mastering Regular Expressions> by Jeffrey Friedl, published
2444by O'Reilly and Associates.