This is a live mirror of the Perl 5 development currently hosted at https://github.com/perl/perl5
Add finalize_optree function which can take over all the compile time checking/finali...
[perl5.git] / pod / perlre.pod
CommitLineData
a0d0e21e 1=head1 NAME
d74e8afc 2X<regular expression> X<regex> X<regexp>
a0d0e21e
LW
3
4perlre - Perl regular expressions
5
6=head1 DESCRIPTION
7
5d458dd8 8This page describes the syntax of regular expressions in Perl.
91e0c79e 9
cc46d5f2 10If you haven't used regular expressions before, a quick-start
91e0c79e
MJD
11introduction is available in L<perlrequick>, and a longer tutorial
12introduction is available in L<perlretut>.
13
14For reference on how regular expressions are used in matching
15operations, plus various examples of the same, see discussions of
16C<m//>, C<s///>, C<qr//> and C<??> in L<perlop/"Regexp Quote-Like
17Operators">.
cb1a09d0 18
0d017f4d
WL
19
20=head2 Modifiers
21
19799a22 22Matching operations can have various modifiers. Modifiers
5a964f20 23that relate to the interpretation of the regular expression inside
19799a22 24are listed below. Modifiers that alter the way a regular expression
5d458dd8 25is used by Perl are detailed in L<perlop/"Regexp Quote-Like Operators"> and
1e66bd83 26L<perlop/"Gory details of parsing quoted constructs">.
a0d0e21e 27
55497cff
PP
28=over 4
29
54310121 30=item m
d74e8afc 31X</m> X<regex, multiline> X<regexp, multiline> X<regular expression, multiline>
55497cff
PP
32
33Treat string as multiple lines. That is, change "^" and "$" from matching
14218588 34the start or end of the string to matching the start or end of any
7f761169 35line anywhere within the string.
55497cff 36
54310121 37=item s
d74e8afc
ITB
38X</s> X<regex, single-line> X<regexp, single-line>
39X<regular expression, single-line>
55497cff
PP
40
41Treat string as single line. That is, change "." to match any character
19799a22 42whatsoever, even a newline, which normally it would not match.
55497cff 43
34d67d80 44Used together, as C</ms>, they let the "." match any character whatsoever,
fb55449c 45while still allowing "^" and "$" to match, respectively, just after
19799a22 46and just before newlines within the string.
7b8d334a 47
87e95b7f
YO
48=item i
49X</i> X<regex, case-insensitive> X<regexp, case-insensitive>
50X<regular expression, case-insensitive>
51
52Do case-insensitive pattern matching.
53
5027a30b
KW
54If locale matching rules are in effect, the case map is taken from the
55current
17580e7a 56locale for code points less than 255, and from Unicode rules for larger
ed7efc79
KW
57code points. However, matches that would cross the Unicode
58rules/non-Unicode rules boundary (ords 255/256) will not succeed. See
59L<perllocale>.
60
61There are a number of Unicode characters that match multiple characters
62under C</i>. For example, C<LATIN SMALL LIGATURE FI>
63should match the sequence C<fi>. Perl is not
64currently able to do this when the multiple characters are in the pattern and
65are split between groupings, or when one or more are quantified. Thus
66
67 "\N{LATIN SMALL LIGATURE FI}" =~ /fi/i; # Matches
68 "\N{LATIN SMALL LIGATURE FI}" =~ /[fi][fi]/i; # Doesn't match!
69 "\N{LATIN SMALL LIGATURE FI}" =~ /fi*/i; # Doesn't match!
70
71 # The below doesn't match, and it isn't clear what $1 and $2 would
72 # be even if it did!!
73 "\N{LATIN SMALL LIGATURE FI}" =~ /(f)(i)/i; # Doesn't match!
74
1f59b283
KW
75Perl doesn't match multiple characters in an inverted bracketed
76character class, which otherwise could be highly confusing. See
77L<perlrecharclass/Negation>.
78
79Also, Perl matching doesn't fully conform to the current Unicode C</i>
ed7efc79
KW
80recommendations, which ask that the matching be made upon the NFD
81(Normalization Form Decomposed) of the text. However, Unicode is
82in the process of reconsidering and revising their recommendations.
87e95b7f 83
54310121 84=item x
d74e8afc 85X</x>
55497cff
PP
86
87Extend your pattern's legibility by permitting whitespace and comments.
ed7efc79 88Details in L</"/x">
55497cff 89
87e95b7f
YO
90=item p
91X</p> X<regex, preserve> X<regexp, preserve>
92
632a1772 93Preserve the string matched such that ${^PREMATCH}, ${^MATCH}, and
87e95b7f
YO
94${^POSTMATCH} are available for use after matching.
95
e2e6bec7
DN
96=item g and c
97X</g> X</c>
98
99Global matching, and keep the Current position after failed matching.
100Unlike i, m, s and x, these two flags affect the way the regex is used
101rather than the regex itself. See
102L<perlretut/"Using regular expressions in Perl"> for further explanation
103of the g and c modifiers.
104
b6fa137b
FC
105=item a, d, l and u
106X</a> X</d> X</l> X</u>
107
516074bb
KW
108These modifiers, all new in 5.14, affect which character-set semantics
109(Unicode, etc.) are used, as described below in
ed7efc79 110L</Character set modifiers>.
b6fa137b 111
55497cff 112=back
a0d0e21e 113
516074bb
KW
114Regular expression modifiers are usually written in documentation
115as e.g., "the C</x> modifier", even though the delimiter
b6fa137b 116in question might not really be a slash. The modifiers C</imsxadlup>
ab7bb42d 117may also be embedded within the regular expression itself using
ed7efc79 118the C<(?...)> construct, see L</Extended Patterns> below.
b6fa137b 119
ed7efc79
KW
120=head3 /x
121
b6fa137b 122C</x> tells
7b059540 123the regular expression parser to ignore most whitespace that is neither
55497cff 124backslashed nor within a character class. You can use this to break up
4633a7c4 125your regular expression into (slightly) more readable parts. The C<#>
54310121 126character is also treated as a metacharacter introducing a comment,
55497cff 127just as in ordinary Perl code. This also means that if you want real
14218588 128whitespace or C<#> characters in the pattern (outside a character
f9a3ff1a 129class, where they are unaffected by C</x>), then you'll either have to
7b059540
KW
130escape them (using backslashes or C<\Q...\E>) or encode them using octal,
131hex, or C<\N{}> escapes. Taken together, these features go a long way towards
8933a740
RGS
132making Perl's regular expressions more readable. Note that you have to
133be careful not to include the pattern delimiter in the comment--perl has
134no way of knowing you did not intend to close the pattern early. See
135the C-comment deletion code in L<perlop>. Also note that anything inside
7651b971 136a C<\Q...\E> stays unaffected by C</x>. And note that C</x> doesn't affect
0b928c2f 137space interpretation within a single multi-character construct. For
7651b971 138example in C<\x{...}>, regardless of the C</x> modifier, there can be no
9bb1f947 139spaces. Same for a L<quantifier|/Quantifiers> such as C<{3}> or
f9e949fd
KW
140C<{5,}>. Similarly, C<(?:...)> can't have a space between the C<?> and C<:>,
141but can between the C<(> and C<?>. Within any delimiters for such a
142construct, allowed spaces are not affected by C</x>, and depend on the
143construct. For example, C<\x{...}> can't have spaces because hexadecimal
144numbers don't have spaces in them. But, Unicode properties can have spaces, so
0b928c2f 145in C<\p{...}> there can be spaces that follow the Unicode rules, for which see
9bb1f947 146L<perluniprops/Properties accessible through \p{} and \P{}>.
d74e8afc 147X</x>
a0d0e21e 148
ed7efc79
KW
149=head3 Character set modifiers
150
151C</d>, C</u>, C</a>, and C</l>, available starting in 5.14, are called
152the character set modifiers; they affect the character set semantics
153used for the regular expression.
154
155At any given time, exactly one of these modifiers is in effect. Once
156compiled, the behavior doesn't change regardless of what rules are in
157effect when the regular expression is executed. And if a regular
158expression is interpolated into a larger one, the original's rules
159continue to apply to it, and only it.
160
ca9560b2
KW
161Note that the modifiers affect only pattern matching, and do not extend
162to any replacement done. For example,
163
164 s/foo/\Ubar/l
165
166will uppercase "bar", but the C</l> does not affect how the C<\U>
167operates. If C<use locale> is in effect, the C<\U> will use locale
168rules; if C<use feature 'unicode_strings'> is in effect, it will
169use Unicode rules, etc.
170
ed7efc79
KW
171=head4 /l
172
173means to use the current locale's rules (see L<perllocale>) when pattern
174matching. For example, C<\w> will match the "word" characters of that
175locale, and C<"/i"> case-insensitive matching will match according to
176the locale's case folding rules. The locale used will be the one in
177effect at the time of execution of the pattern match. This may not be
178the same as the compilation-time locale, and can differ from one match
179to another if there is an intervening call of the
b6fa137b 180L<setlocale() function|perllocale/The setlocale function>.
ed7efc79
KW
181
182Perl only supports single-byte locales. This means that code points
183above 255 are treated as Unicode no matter what locale is in effect.
184Under Unicode rules, there are a few case-insensitive matches that cross
185the 255/256 boundary. These are disallowed under C</l>. For example,
516074bb
KW
1860xFF (on ASCII platforms) does not caselessly match the character at
1870x178, C<LATIN CAPITAL LETTER Y WITH DIAERESIS>, because 0xFF may not be
188C<LATIN SMALL LETTER Y WITH DIAERESIS> in the current locale, and Perl
189has no way of knowing if that character even exists in the locale, much
190less what code point it is.
ed7efc79
KW
191
192This modifier may be specified to be the default by C<use locale>, but
193see L</Which character set modifier is in effect?>.
b6fa137b
FC
194X</l>
195
ed7efc79
KW
196=head4 /u
197
198means to use Unicode rules when pattern matching. On ASCII platforms,
199this means that the code points between 128 and 255 take on their
b6fa137b
FC
200Latin-1 (ISO-8859-1) meanings (which are the same as Unicode's), whereas
201in strict ASCII their meanings are undefined. Thus the platform
ed7efc79
KW
202effectively becomes a Unicode platform, hence, for example, C<\w> will
203match any of the more than 100_000 word characters in Unicode.
204
205Unlike most locales, which are specific to a language and country pair,
516074bb
KW
206Unicode classifies all the characters that are letters I<somewhere> in
207the world as
ed7efc79
KW
208C<\w>. For example, your locale might not think that C<LATIN SMALL
209LETTER ETH> is a letter (unless you happen to speak Icelandic), but
210Unicode does. Similarly, all the characters that are decimal digits
211somewhere in the world will match C<\d>; this is hundreds, not 10,
212possible matches. And some of those digits look like some of the 10
213ASCII digits, but mean a different number, so a human could easily think
214a number is a different quantity than it really is. For example,
215C<BENGALI DIGIT FOUR> (U+09EA) looks very much like an
216C<ASCII DIGIT EIGHT> (U+0038). And, C<\d+>, may match strings of digits
217that are a mixture from different writing systems, creating a security
516074bb
KW
218issue. L<Unicode::UCDE<sol>num()|Unicode::UCD/num> can be used to sort
219this out. Or the C</a> modifier can be used to force C<\d> to match
220just the ASCII 0 through 9.
ed7efc79 221
516074bb
KW
222Also, under this modifier, case-insensitive matching works on the full
223set of Unicode
ed7efc79
KW
224characters. The C<KELVIN SIGN>, for example matches the letters "k" and
225"K"; and C<LATIN SMALL LIGATURE FF> matches the sequence "ff", which,
226if you're not prepared, might make it look like a hexadecimal constant,
227presenting another potential security issue. See
228L<http://unicode.org/reports/tr36> for a detailed discussion of Unicode
229security issues.
230
6368643f
KW
231On the EBCDIC platforms that Perl handles, the native character set is
232equivalent to Latin-1. Thus this modifier changes behavior only when
ed7efc79
KW
233the C<"/i"> modifier is also specified, and it turns out it affects only
234two characters, giving them full Unicode semantics: the C<MICRO SIGN>
6368643f 235will match the Greek capital and small letters C<MU>, otherwise not; and
ed7efc79
KW
236the C<LATIN CAPITAL LETTER SHARP S> will match any of C<SS>, C<Ss>,
237C<sS>, and C<ss>, otherwise not.
238
239This modifier may be specified to be the default by C<use feature
240'unicode_strings>, but see
241L</Which character set modifier is in effect?>.
b6fa137b
FC
242X</u>
243
ed7efc79
KW
244=head4 /a
245
246is the same as C</u>, except that C<\d>, C<\s>, C<\w>, and the
b6fa137b
FC
247Posix character classes are restricted to matching in the ASCII range
248only. That is, with this modifier, C<\d> always means precisely the
249digits C<"0"> to C<"9">; C<\s> means the five characters C<[ \f\n\r\t]>;
250C<\w> means the 63 characters C<[A-Za-z0-9_]>; and likewise, all the
251Posix classes such as C<[[:print:]]> match only the appropriate
ed7efc79
KW
252ASCII-range characters.
253
254This modifier is useful for people who only incidentally use Unicode.
255With it, one can write C<\d> with confidence that it will only match
256ASCII characters, and should the need arise to match beyond ASCII, you
257can use C<\p{Digit}>, or C<\p{Word}> for C<\w>. There are similar
258C<\p{...}> constructs that can match white space and Posix classes
6368643f 259beyond ASCII. See L<perlrecharclass/POSIX Character Classes>.
ed7efc79
KW
260
261As you would expect, this modifier causes, for example, C<\D> to mean
262the same thing as C<[^0-9]>; in fact, all non-ASCII characters match
263C<\D>, C<\S>, and C<\W>. C<\b> still means to match at the boundary
264between C<\w> and C<\W>, using the C</a> definitions of them (similarly
265for C<\B>).
266
267Otherwise, C</a> behaves like the C</u> modifier, in that
268case-insensitive matching uses Unicode semantics; for example, "k" will
269match the Unicode C<\N{KELVIN SIGN}> under C</i> matching, and code
270points in the Latin1 range, above ASCII will have Unicode rules when it
271comes to case-insensitive matching.
272
273To forbid ASCII/non-ASCII matches (like "k" with C<\N{KELVIN SIGN}>),
274specify the "a" twice, for example C</aai> or C</aia>
275
276To reiterate, this modifier provides protection for applications that
277don't wish to be exposed to all of Unicode. Specifying it twice
278gives added protection.
279
280This modifier may be specified to be the default by C<use re '/a'>
281or C<use re '/aa'>, but see
282L</Which character set modifier is in effect?>.
b6fa137b 283X</a>
ed7efc79
KW
284X</aa>
285
286=head4 /d
287
288This modifier means to use the "Default" native rules of the platform
289except when there is cause to use Unicode rules instead, as follows:
290
291=over 4
292
293=item 1
294
295the target string is encoded in UTF-8; or
296
297=item 2
298
299the pattern is encoded in UTF-8; or
300
301=item 3
302
303the pattern explicitly mentions a code point that is above 255 (say by
304C<\x{100}>); or
305
306=item 4
b6fa137b 307
ed7efc79
KW
308the pattern uses a Unicode name (C<\N{...}>); or
309
310=item 5
311
312the pattern uses a Unicode property (C<\p{...}>)
313
314=back
315
316Another mnemonic for this modifier is "Depends", as the rules actually
317used depend on various things, and as a result you can get unexpected
318results. See L<perlunicode/The "Unicode Bug">.
319
320On ASCII platforms, the native rules are ASCII, and on EBCDIC platforms
321(at least the ones that Perl handles), they are Latin-1.
322
323Here are some examples of how that works on an ASCII platform:
324
325 $str = "\xDF"; # $str is not in UTF-8 format.
326 $str =~ /^\w/; # No match, as $str isn't in UTF-8 format.
327 $str .= "\x{0e0b}"; # Now $str is in UTF-8 format.
328 $str =~ /^\w/; # Match! $str is now in UTF-8 format.
329 chop $str;
330 $str =~ /^\w/; # Still a match! $str remains in UTF-8 format.
331
332=head4 Which character set modifier is in effect?
333
334Which of these modifiers is in effect at any given point in a regular
335expression depends on a fairly complex set of interactions. As
336explained below in L</Extended Patterns> it is possible to explicitly
337specify modifiers that apply only to portions of a regular expression.
338The innermost always has priority over any outer ones, and one applying
6368643f
KW
339to the whole expression has priority over any of the default settings that are
340described in the remainder of this section.
ed7efc79 341
916cec3f 342The C<L<use re 'E<sol>foo'|re/"'/flags' mode">> pragma can be used to set
ed7efc79
KW
343default modifiers (including these) for regular expressions compiled
344within its scope. This pragma has precedence over the other pragmas
516074bb 345listed below that also change the defaults.
ed7efc79
KW
346
347Otherwise, C<L<use locale|perllocale>> sets the default modifier to C</l>;
348and C<L<use feature 'unicode_strings|feature>> or
349C<L<use 5.012|perlfunc/use VERSION>> (or higher) set the default to
350C</u> when not in the same scope as either C<L<use locale|perllocale>>
6368643f
KW
351or C<L<use bytes|bytes>>. Unlike the mechanisms mentioned above, these
352affect operations besides regular expressions pattern matching, and so
353give more consistent results with other operators, including using
354C<\U>, C<\l>, etc. in substitution replacements.
ed7efc79
KW
355
356If none of the above apply, for backwards compatibility reasons, the
357C</d> modifier is the one in effect by default. As this can lead to
358unexpected results, it is best to specify which other rule set should be
359used.
360
361=head4 Character set modifier behavior prior to Perl 5.14
362
363Prior to 5.14, there were no explicit modifiers, but C</l> was implied
364for regexes compiled within the scope of C<use locale>, and C</d> was
365implied otherwise. However, interpolating a regex into a larger regex
366would ignore the original compilation in favor of whatever was in effect
367at the time of the second compilation. There were a number of
368inconsistencies (bugs) with the C</d> modifier, where Unicode rules
369would be used when inappropriate, and vice versa. C<\p{}> did not imply
370Unicode rules, and neither did all occurrences of C<\N{}>, until 5.12.
b6fa137b 371
a0d0e21e
LW
372=head2 Regular Expressions
373
04838cea
RGS
374=head3 Metacharacters
375
384f06ae 376The patterns used in Perl pattern matching evolved from those supplied in
14218588 377the Version 8 regex routines. (The routines are derived
19799a22
GS
378(distantly) from Henry Spencer's freely redistributable reimplementation
379of the V8 routines.) See L<Version 8 Regular Expressions> for
380details.
a0d0e21e
LW
381
382In particular the following metacharacters have their standard I<egrep>-ish
383meanings:
d74e8afc
ITB
384X<metacharacter>
385X<\> X<^> X<.> X<$> X<|> X<(> X<()> X<[> X<[]>
386
a0d0e21e 387
f793d64a
KW
388 \ Quote the next metacharacter
389 ^ Match the beginning of the line
390 . Match any character (except newline)
391 $ Match the end of the line (or before newline at the end)
392 | Alternation
393 () Grouping
394 [] Bracketed Character class
a0d0e21e 395
14218588
GS
396By default, the "^" character is guaranteed to match only the
397beginning of the string, the "$" character only the end (or before the
398newline at the end), and Perl does certain optimizations with the
a0d0e21e
LW
399assumption that the string contains only one line. Embedded newlines
400will not be matched by "^" or "$". You may, however, wish to treat a
4a6725af 401string as a multi-line buffer, such that the "^" will match after any
0d520e8e
YO
402newline within the string (except if the newline is the last character in
403the string), and "$" will match before any newline. At the
a0d0e21e
LW
404cost of a little more overhead, you can do this by using the /m modifier
405on the pattern match operator. (Older programs did this by setting C<$*>,
0b928c2f 406but this option was removed in perl 5.9.)
d74e8afc 407X<^> X<$> X</m>
a0d0e21e 408
14218588 409To simplify multi-line substitutions, the "." character never matches a
55497cff 410newline unless you use the C</s> modifier, which in effect tells Perl to pretend
f02c194e 411the string is a single line--even if it isn't.
d74e8afc 412X<.> X</s>
a0d0e21e 413
04838cea
RGS
414=head3 Quantifiers
415
a0d0e21e 416The following standard quantifiers are recognized:
d74e8afc 417X<metacharacter> X<quantifier> X<*> X<+> X<?> X<{n}> X<{n,}> X<{n,m}>
a0d0e21e 418
f793d64a
KW
419 * Match 0 or more times
420 + Match 1 or more times
421 ? Match 1 or 0 times
422 {n} Match exactly n times
423 {n,} Match at least n times
424 {n,m} Match at least n but not more than m times
a0d0e21e 425
0b928c2f
FC
426(If a curly bracket occurs in any other context and does not form part of
427a backslashed sequence like C<\x{...}>, it is treated
b975c076 428as a regular character. In particular, the lower bound
527e91da
BB
429is not optional.) The "*" quantifier is equivalent to C<{0,}>, the "+"
430quantifier to C<{1,}>, and the "?" quantifier to C<{0,1}>. n and m are limited
d0b16107 431to non-negative integral values less than a preset limit defined when perl is built.
9c79236d
GS
432This is usually 32766 on the most common platforms. The actual limit can
433be seen in the error message generated by code such as this:
434
820475bd 435 $_ **= $_ , / {$_} / for 2 .. 42;
a0d0e21e 436
54310121
PP
437By default, a quantified subpattern is "greedy", that is, it will match as
438many times as possible (given a particular starting location) while still
439allowing the rest of the pattern to match. If you want it to match the
440minimum number of times possible, follow the quantifier with a "?". Note
441that the meanings don't change, just the "greediness":
0d017f4d 442X<metacharacter> X<greedy> X<greediness>
d74e8afc 443X<?> X<*?> X<+?> X<??> X<{n}?> X<{n,}?> X<{n,m}?>
a0d0e21e 444
f793d64a
KW
445 *? Match 0 or more times, not greedily
446 +? Match 1 or more times, not greedily
447 ?? Match 0 or 1 time, not greedily
0b928c2f 448 {n}? Match exactly n times, not greedily (redundant)
f793d64a
KW
449 {n,}? Match at least n times, not greedily
450 {n,m}? Match at least n but not more than m times, not greedily
a0d0e21e 451
b9b4dddf
YO
452By default, when a quantified subpattern does not allow the rest of the
453overall pattern to match, Perl will backtrack. However, this behaviour is
0d017f4d 454sometimes undesirable. Thus Perl provides the "possessive" quantifier form
b9b4dddf
YO
455as well.
456
f793d64a
KW
457 *+ Match 0 or more times and give nothing back
458 ++ Match 1 or more times and give nothing back
459 ?+ Match 0 or 1 time and give nothing back
460 {n}+ Match exactly n times and give nothing back (redundant)
461 {n,}+ Match at least n times and give nothing back
462 {n,m}+ Match at least n but not more than m times and give nothing back
b9b4dddf
YO
463
464For instance,
465
466 'aaaa' =~ /a++a/
467
468will never match, as the C<a++> will gobble up all the C<a>'s in the
469string and won't leave any for the remaining part of the pattern. This
470feature can be extremely useful to give perl hints about where it
471shouldn't backtrack. For instance, the typical "match a double-quoted
472string" problem can be most efficiently performed when written as:
473
474 /"(?:[^"\\]++|\\.)*+"/
475
0d017f4d 476as we know that if the final quote does not match, backtracking will not
0b928c2f
FC
477help. See the independent subexpression
478L</C<< (?>pattern) >>> for more details;
b9b4dddf
YO
479possessive quantifiers are just syntactic sugar for that construct. For
480instance the above example could also be written as follows:
481
482 /"(?>(?:(?>[^"\\]+)|\\.)*)"/
483
04838cea
RGS
484=head3 Escape sequences
485
0b928c2f 486Because patterns are processed as double-quoted strings, the following
a0d0e21e
LW
487also work:
488
f793d64a
KW
489 \t tab (HT, TAB)
490 \n newline (LF, NL)
491 \r return (CR)
492 \f form feed (FF)
493 \a alarm (bell) (BEL)
494 \e escape (think troff) (ESC)
f793d64a 495 \cK control char (example: VT)
dc0d9c48 496 \x{}, \x00 character whose ordinal is the given hexadecimal number
fb121860 497 \N{name} named Unicode character or character sequence
f793d64a 498 \N{U+263D} Unicode character (example: FIRST QUARTER MOON)
f0a2b745 499 \o{}, \000 character whose ordinal is the given octal number
f793d64a
KW
500 \l lowercase next char (think vi)
501 \u uppercase next char (think vi)
502 \L lowercase till \E (think vi)
503 \U uppercase till \E (think vi)
504 \Q quote (disable) pattern metacharacters till \E
505 \E end either case modification or quoted section, think vi
a0d0e21e 506
9bb1f947 507Details are in L<perlop/Quote and Quote-like Operators>.
1d2dff63 508
e1d1eefb 509=head3 Character Classes and other Special Escapes
04838cea 510
a0d0e21e 511In addition, Perl defines the following:
d0b16107 512X<\g> X<\k> X<\K> X<backreference>
a0d0e21e 513
f793d64a
KW
514 Sequence Note Description
515 [...] [1] Match a character according to the rules of the
516 bracketed character class defined by the "...".
517 Example: [a-z] matches "a" or "b" or "c" ... or "z"
518 [[:...:]] [2] Match a character according to the rules of the POSIX
519 character class "..." within the outer bracketed
520 character class. Example: [[:upper:]] matches any
521 uppercase character.
d35dd6c6
KW
522 \w [3] Match a "word" character (alphanumeric plus "_", plus
523 other connector punctuation chars plus Unicode
0b928c2f 524 marks)
f793d64a
KW
525 \W [3] Match a non-"word" character
526 \s [3] Match a whitespace character
527 \S [3] Match a non-whitespace character
528 \d [3] Match a decimal digit character
529 \D [3] Match a non-digit character
530 \pP [3] Match P, named property. Use \p{Prop} for longer names
531 \PP [3] Match non-P
532 \X [4] Match Unicode "eXtended grapheme cluster"
533 \C Match a single C-language char (octet) even if that is
534 part of a larger UTF-8 character. Thus it breaks up
535 characters into their UTF-8 bytes, so you may end up
536 with malformed pieces of UTF-8. Unsupported in
537 lookbehind.
c27a5cfe 538 \1 [5] Backreference to a specific capture group or buffer.
f793d64a
KW
539 '1' may actually be any positive integer.
540 \g1 [5] Backreference to a specific or previous group,
541 \g{-1} [5] The number may be negative indicating a relative
c27a5cfe 542 previous group and may optionally be wrapped in
f793d64a
KW
543 curly brackets for safer parsing.
544 \g{name} [5] Named backreference
545 \k<name> [5] Named backreference
546 \K [6] Keep the stuff left of the \K, don't include it in $&
547 \N [7] Any character but \n (experimental). Not affected by
548 /s modifier
549 \v [3] Vertical whitespace
550 \V [3] Not vertical whitespace
551 \h [3] Horizontal whitespace
552 \H [3] Not horizontal whitespace
553 \R [4] Linebreak
e1d1eefb 554
9bb1f947
KW
555=over 4
556
557=item [1]
558
559See L<perlrecharclass/Bracketed Character Classes> for details.
df225385 560
9bb1f947 561=item [2]
b8c5462f 562
9bb1f947 563See L<perlrecharclass/POSIX Character Classes> for details.
b8c5462f 564
9bb1f947 565=item [3]
5496314a 566
9bb1f947 567See L<perlrecharclass/Backslash sequences> for details.
5496314a 568
9bb1f947 569=item [4]
5496314a 570
9bb1f947 571See L<perlrebackslash/Misc> for details.
d0b16107 572
9bb1f947 573=item [5]
b8c5462f 574
c27a5cfe 575See L</Capture groups> below for details.
93733859 576
9bb1f947 577=item [6]
b8c5462f 578
9bb1f947
KW
579See L</Extended Patterns> below for details.
580
581=item [7]
582
583Note that C<\N> has two meanings. When of the form C<\N{NAME}>, it matches the
fb121860
KW
584character or character sequence whose name is C<NAME>; and similarly
585when of the form C<\N{U+I<hex>}>, it matches the character whose Unicode
586code point is I<hex>. Otherwise it matches any character but C<\n>.
9bb1f947
KW
587
588=back
d0b16107 589
04838cea
RGS
590=head3 Assertions
591
a0d0e21e 592Perl defines the following zero-width assertions:
d74e8afc
ITB
593X<zero-width assertion> X<assertion> X<regex, zero-width assertion>
594X<regexp, zero-width assertion>
595X<regular expression, zero-width assertion>
596X<\b> X<\B> X<\A> X<\Z> X<\z> X<\G>
a0d0e21e 597
9bb1f947
KW
598 \b Match a word boundary
599 \B Match except at a word boundary
600 \A Match only at beginning of string
601 \Z Match only at end of string, or before newline at the end
602 \z Match only at end of string
603 \G Match only at pos() (e.g. at the end-of-match position
9da458fc 604 of prior m//g)
a0d0e21e 605
14218588 606A word boundary (C<\b>) is a spot between two characters
19799a22
GS
607that has a C<\w> on one side of it and a C<\W> on the other side
608of it (in either order), counting the imaginary characters off the
609beginning and end of the string as matching a C<\W>. (Within
610character classes C<\b> represents backspace rather than a word
611boundary, just as it normally does in any double-quoted string.)
612The C<\A> and C<\Z> are just like "^" and "$", except that they
613won't match multiple times when the C</m> modifier is used, while
614"^" and "$" will match at every internal line boundary. To match
615the actual end of the string and not ignore an optional trailing
616newline, use C<\z>.
d74e8afc 617X<\b> X<\A> X<\Z> X<\z> X</m>
19799a22
GS
618
619The C<\G> assertion can be used to chain global matches (using
620C<m//g>), as described in L<perlop/"Regexp Quote-Like Operators">.
621It is also useful when writing C<lex>-like scanners, when you have
622several patterns that you want to match against consequent substrings
0b928c2f 623of your string; see the previous reference. The actual location
19799a22 624where C<\G> will match can also be influenced by using C<pos()> as
58e23c8d 625an lvalue: see L<perlfunc/pos>. Note that the rule for zero-length
0b928c2f
FC
626matches (see L</"Repeated Patterns Matching a Zero-length Substring">)
627is modified somewhat, in that contents to the left of C<\G> are
58e23c8d
YO
628not counted when determining the length of the match. Thus the following
629will not match forever:
d74e8afc 630X<\G>
c47ff5f1 631
e761bb84
CO
632 my $string = 'ABC';
633 pos($string) = 1;
634 while ($string =~ /(.\G)/g) {
635 print $1;
636 }
58e23c8d
YO
637
638It will print 'A' and then terminate, as it considers the match to
639be zero-width, and thus will not match at the same position twice in a
640row.
641
642It is worth noting that C<\G> improperly used can result in an infinite
643loop. Take care when using patterns that include C<\G> in an alternation.
644
c27a5cfe 645=head3 Capture groups
04838cea 646
c27a5cfe
KW
647The bracketing construct C<( ... )> creates capture groups (also referred to as
648capture buffers). To refer to the current contents of a group later on, within
d8b950dc
KW
649the same pattern, use C<\g1> (or C<\g{1}>) for the first, C<\g2> (or C<\g{2}>)
650for the second, and so on.
651This is called a I<backreference>.
d74e8afc 652X<regex, capture buffer> X<regexp, capture buffer>
c27a5cfe 653X<regex, capture group> X<regexp, capture group>
d74e8afc 654X<regular expression, capture buffer> X<backreference>
c27a5cfe 655X<regular expression, capture group> X<backreference>
1f1031fe 656X<\g{1}> X<\g{-1}> X<\g{name}> X<relative backreference> X<named backreference>
d8b950dc
KW
657X<named capture buffer> X<regular expression, named capture buffer>
658X<named capture group> X<regular expression, named capture group>
659X<%+> X<$+{name}> X<< \k<name> >>
660There is no limit to the number of captured substrings that you may use.
661Groups are numbered with the leftmost open parenthesis being number 1, etc. If
662a group did not match, the associated backreference won't match either. (This
663can happen if the group is optional, or in a different branch of an
664alternation.)
665You can omit the C<"g">, and write C<"\1">, etc, but there are some issues with
666this form, described below.
667
668You can also refer to capture groups relatively, by using a negative number, so
669that C<\g-1> and C<\g{-1}> both refer to the immediately preceding capture
670group, and C<\g-2> and C<\g{-2}> both refer to the group before it. For
671example:
5624f11d
YO
672
673 /
c27a5cfe
KW
674 (Y) # group 1
675 ( # group 2
676 (X) # group 3
677 \g{-1} # backref to group 3
678 \g{-3} # backref to group 1
5624f11d
YO
679 )
680 /x
681
d8b950dc
KW
682would match the same as C</(Y) ( (X) \g3 \g1 )/x>. This allows you to
683interpolate regexes into larger regexes and not have to worry about the
684capture groups being renumbered.
685
686You can dispense with numbers altogether and create named capture groups.
687The notation is C<(?E<lt>I<name>E<gt>...)> to declare and C<\g{I<name>}> to
688reference. (To be compatible with .Net regular expressions, C<\g{I<name>}> may
689also be written as C<\k{I<name>}>, C<\kE<lt>I<name>E<gt>> or C<\k'I<name>'>.)
690I<name> must not begin with a number, nor contain hyphens.
691When different groups within the same pattern have the same name, any reference
692to that name assumes the leftmost defined group. Named groups count in
693absolute and relative numbering, and so can also be referred to by those
694numbers.
695(It's possible to do things with named capture groups that would otherwise
696require C<(??{})>.)
697
698Capture group contents are dynamically scoped and available to you outside the
699pattern until the end of the enclosing block or until the next successful
700match, whichever comes first. (See L<perlsyn/"Compound Statements">.)
701You can refer to them by absolute number (using C<"$1"> instead of C<"\g1">,
702etc); or by name via the C<%+> hash, using C<"$+{I<name>}">.
703
704Braces are required in referring to named capture groups, but are optional for
705absolute or relative numbered ones. Braces are safer when creating a regex by
706concatenating smaller strings. For example if you have C<qr/$a$b/>, and C<$a>
707contained C<"\g1">, and C<$b> contained C<"37">, you would get C</\g137/> which
708is probably not what you intended.
709
710The C<\g> and C<\k> notations were introduced in Perl 5.10.0. Prior to that
711there were no named nor relative numbered capture groups. Absolute numbered
0b928c2f
FC
712groups were referred to using C<\1>,
713C<\2>, etc., and this notation is still
d8b950dc
KW
714accepted (and likely always will be). But it leads to some ambiguities if
715there are more than 9 capture groups, as C<\10> could mean either the tenth
716capture group, or the character whose ordinal in octal is 010 (a backspace in
717ASCII). Perl resolves this ambiguity by interpreting C<\10> as a backreference
718only if at least 10 left parentheses have opened before it. Likewise C<\11> is
719a backreference only if at least 11 left parentheses have opened before it.
e1f120a9
KW
720And so on. C<\1> through C<\9> are always interpreted as backreferences.
721There are several examples below that illustrate these perils. You can avoid
722the ambiguity by always using C<\g{}> or C<\g> if you mean capturing groups;
723and for octal constants always using C<\o{}>, or for C<\077> and below, using 3
724digits padded with leading zeros, since a leading zero implies an octal
725constant.
d8b950dc
KW
726
727The C<\I<digit>> notation also works in certain circumstances outside
ed7efc79 728the pattern. See L</Warning on \1 Instead of $1> below for details.
81714fb9 729
14218588 730Examples:
a0d0e21e
LW
731
732 s/^([^ ]*) *([^ ]*)/$2 $1/; # swap first two words
733
d8b950dc 734 /(.)\g1/ # find first doubled char
81714fb9
YO
735 and print "'$1' is the first doubled character\n";
736
737 /(?<char>.)\k<char>/ # ... a different way
738 and print "'$+{char}' is the first doubled character\n";
739
d8b950dc 740 /(?'char'.)\g1/ # ... mix and match
81714fb9 741 and print "'$1' is the first doubled character\n";
c47ff5f1 742
14218588 743 if (/Time: (..):(..):(..)/) { # parse out values
f793d64a
KW
744 $hours = $1;
745 $minutes = $2;
746 $seconds = $3;
a0d0e21e 747 }
c47ff5f1 748
9d860678
KW
749 /(.)(.)(.)(.)(.)(.)(.)(.)(.)\g10/ # \g10 is a backreference
750 /(.)(.)(.)(.)(.)(.)(.)(.)(.)\10/ # \10 is octal
751 /((.)(.)(.)(.)(.)(.)(.)(.)(.))\10/ # \10 is a backreference
752 /((.)(.)(.)(.)(.)(.)(.)(.)(.))\010/ # \010 is octal
753
754 $a = '(.)\1'; # Creates problems when concatenated.
755 $b = '(.)\g{1}'; # Avoids the problems.
756 "aa" =~ /${a}/; # True
757 "aa" =~ /${b}/; # True
758 "aa0" =~ /${a}0/; # False!
759 "aa0" =~ /${b}0/; # True
dc0d9c48
KW
760 "aa\x08" =~ /${a}0/; # True!
761 "aa\x08" =~ /${b}0/; # False
9d860678 762
14218588
GS
763Several special variables also refer back to portions of the previous
764match. C<$+> returns whatever the last bracket match matched.
765C<$&> returns the entire matched string. (At one point C<$0> did
766also, but now it returns the name of the program.) C<$`> returns
77ea4f6d
JV
767everything before the matched string. C<$'> returns everything
768after the matched string. And C<$^N> contains whatever was matched by
769the most-recently closed group (submatch). C<$^N> can be used in
770extended patterns (see below), for example to assign a submatch to a
81714fb9 771variable.
d74e8afc 772X<$+> X<$^N> X<$&> X<$`> X<$'>
14218588 773
d8b950dc
KW
774These special variables, like the C<%+> hash and the numbered match variables
775(C<$1>, C<$2>, C<$3>, etc.) are dynamically scoped
14218588
GS
776until the end of the enclosing block or until the next successful
777match, whichever comes first. (See L<perlsyn/"Compound Statements">.)
d74e8afc
ITB
778X<$+> X<$^N> X<$&> X<$`> X<$'>
779X<$1> X<$2> X<$3> X<$4> X<$5> X<$6> X<$7> X<$8> X<$9>
780
0d017f4d 781B<NOTE>: Failed matches in Perl do not reset the match variables,
5146ce24 782which makes it easier to write code that tests for a series of more
665e98b9
JH
783specific cases and remembers the best match.
784
14218588
GS
785B<WARNING>: Once Perl sees that you need one of C<$&>, C<$`>, or
786C<$'> anywhere in the program, it has to provide them for every
787pattern match. This may substantially slow your program. Perl
d8b950dc 788uses the same mechanism to produce C<$1>, C<$2>, etc, so you also pay a
14218588
GS
789price for each pattern that contains capturing parentheses. (To
790avoid this cost while retaining the grouping behaviour, use the
791extended regular expression C<(?: ... )> instead.) But if you never
792use C<$&>, C<$`> or C<$'>, then patterns I<without> capturing
793parentheses will not be penalized. So avoid C<$&>, C<$'>, and C<$`>
794if you can, but if you can't (and some algorithms really appreciate
795them), once you've used them once, use them at will, because you've
796already paid the price. As of 5.005, C<$&> is not so costly as the
797other two.
d74e8afc 798X<$&> X<$`> X<$'>
68dc0745 799
99d59c4d 800As a workaround for this problem, Perl 5.10.0 introduces C<${^PREMATCH}>,
cde0cee5
YO
801C<${^MATCH}> and C<${^POSTMATCH}>, which are equivalent to C<$`>, C<$&>
802and C<$'>, B<except> that they are only guaranteed to be defined after a
87e95b7f 803successful match that was executed with the C</p> (preserve) modifier.
cde0cee5
YO
804The use of these variables incurs no global performance penalty, unlike
805their punctuation char equivalents, however at the trade-off that you
806have to tell perl when you want to use them.
87e95b7f 807X</p> X<p modifier>
cde0cee5 808
9d727203
KW
809=head2 Quoting metacharacters
810
19799a22
GS
811Backslashed metacharacters in Perl are alphanumeric, such as C<\b>,
812C<\w>, C<\n>. Unlike some other regular expression languages, there
813are no backslashed symbols that aren't alphanumeric. So anything
c47ff5f1 814that looks like \\, \(, \), \<, \>, \{, or \} is always
19799a22
GS
815interpreted as a literal character, not a metacharacter. This was
816once used in a common idiom to disable or quote the special meanings
817of regular expression metacharacters in a string that you want to
36bbe248 818use for a pattern. Simply quote all non-"word" characters:
a0d0e21e
LW
819
820 $pattern =~ s/(\W)/\\$1/g;
821
f1cbbd6e 822(If C<use locale> is set, then this depends on the current locale.)
14218588
GS
823Today it is more common to use the quotemeta() function or the C<\Q>
824metaquoting escape sequence to disable all metacharacters' special
825meanings like this:
a0d0e21e
LW
826
827 /$unquoted\Q$quoted\E$unquoted/
828
9da458fc
IZ
829Beware that if you put literal backslashes (those not inside
830interpolated variables) between C<\Q> and C<\E>, double-quotish
831backslash interpolation may lead to confusing results. If you
832I<need> to use literal backslashes within C<\Q...\E>,
833consult L<perlop/"Gory details of parsing quoted constructs">.
834
19799a22
GS
835=head2 Extended Patterns
836
14218588 837Perl also defines a consistent extension syntax for features not
0b928c2f
FC
838found in standard tools like B<awk> and
839B<lex>. The syntax for most of these is a
14218588
GS
840pair of parentheses with a question mark as the first thing within
841the parentheses. The character after the question mark indicates
842the extension.
19799a22 843
14218588
GS
844The stability of these extensions varies widely. Some have been
845part of the core language for many years. Others are experimental
846and may change without warning or be completely removed. Check
847the documentation on an individual feature to verify its current
848status.
19799a22 849
14218588
GS
850A question mark was chosen for this and for the minimal-matching
851construct because 1) question marks are rare in older regular
852expressions, and 2) whenever you see one, you should stop and
0b928c2f 853"question" exactly what is going on. That's psychology....
a0d0e21e 854
70ca8714 855=over 4
a0d0e21e 856
cc6b7395 857=item C<(?#text)>
d74e8afc 858X<(?#)>
a0d0e21e 859
14218588 860A comment. The text is ignored. If the C</x> modifier enables
19799a22 861whitespace formatting, a simple C<#> will suffice. Note that Perl closes
259138e3
GS
862the comment as soon as it sees a C<)>, so there is no way to put a literal
863C<)> in the comment.
a0d0e21e 864
cfaf538b 865=item C<(?adlupimsx-imsx)>
fb85c044 866
cfaf538b 867=item C<(?^alupimsx)>
fb85c044 868X<(?)> X<(?^)>
19799a22 869
0b6d1084
JH
870One or more embedded pattern-match modifiers, to be turned on (or
871turned off, if preceded by C<->) for the remainder of the pattern or
fb85c044
KW
872the remainder of the enclosing pattern group (if any).
873
fb85c044 874This is particularly useful for dynamic patterns, such as those read in from a
0d017f4d 875configuration file, taken from an argument, or specified in a table
0b928c2f
FC
876somewhere. Consider the case where some patterns want to be
877case-sensitive and some do not: The case-insensitive ones merely need to
0d017f4d 878include C<(?i)> at the front of the pattern. For example:
19799a22
GS
879
880 $pattern = "foobar";
5d458dd8 881 if ( /$pattern/i ) { }
19799a22
GS
882
883 # more flexible:
884
885 $pattern = "(?i)foobar";
5d458dd8 886 if ( /$pattern/ ) { }
19799a22 887
0b6d1084 888These modifiers are restored at the end of the enclosing group. For example,
19799a22 889
d8b950dc 890 ( (?i) blah ) \s+ \g1
19799a22 891
0d017f4d
WL
892will match C<blah> in any case, some spaces, and an exact (I<including the case>!)
893repetition of the previous word, assuming the C</x> modifier, and no C</i>
894modifier outside this group.
19799a22 895
8eb5594e
DR
896These modifiers do not carry over into named subpatterns called in the
897enclosing group. In other words, a pattern such as C<((?i)(&NAME))> does not
898change the case-sensitivity of the "NAME" pattern.
899
dc925305
KW
900Any of these modifiers can be set to apply globally to all regular
901expressions compiled within the scope of a C<use re>. See
a0bbd6ff 902L<re/"'/flags' mode">.
dc925305 903
9de15fec
KW
904Starting in Perl 5.14, a C<"^"> (caret or circumflex accent) immediately
905after the C<"?"> is a shorthand equivalent to C<d-imsx>. Flags (except
906C<"d">) may follow the caret to override it.
907But a minus sign is not legal with it.
908
dc925305 909Note that the C<a>, C<d>, C<l>, C<p>, and C<u> modifiers are special in
e1d8d8ac 910that they can only be enabled, not disabled, and the C<a>, C<d>, C<l>, and
dc925305 911C<u> modifiers are mutually exclusive: specifying one de-specifies the
ed7efc79
KW
912others, and a maximum of one (or two C<a>'s) may appear in the
913construct. Thus, for
0b928c2f 914example, C<(?-p)> will warn when compiled under C<use warnings>;
b6fa137b 915C<(?-d:...)> and C<(?dl:...)> are fatal errors.
9de15fec
KW
916
917Note also that the C<p> modifier is special in that its presence
918anywhere in a pattern has a global effect.
cde0cee5 919
5a964f20 920=item C<(?:pattern)>
d74e8afc 921X<(?:)>
a0d0e21e 922
cfaf538b 923=item C<(?adluimsx-imsx:pattern)>
ca9dfc88 924
cfaf538b 925=item C<(?^aluimsx:pattern)>
fb85c044
KW
926X<(?^:)>
927
5a964f20
TC
928This is for clustering, not capturing; it groups subexpressions like
929"()", but doesn't make backreferences as "()" does. So
a0d0e21e 930
5a964f20 931 @fields = split(/\b(?:a|b|c)\b/)
a0d0e21e
LW
932
933is like
934
5a964f20 935 @fields = split(/\b(a|b|c)\b/)
a0d0e21e 936
19799a22
GS
937but doesn't spit out extra fields. It's also cheaper not to capture
938characters if you don't need to.
a0d0e21e 939
19799a22 940Any letters between C<?> and C<:> act as flags modifiers as with
cfaf538b 941C<(?adluimsx-imsx)>. For example,
ca9dfc88
IZ
942
943 /(?s-i:more.*than).*million/i
944
14218588 945is equivalent to the more verbose
ca9dfc88
IZ
946
947 /(?:(?s-i)more.*than).*million/i
948
fb85c044 949Starting in Perl 5.14, a C<"^"> (caret or circumflex accent) immediately
9de15fec
KW
950after the C<"?"> is a shorthand equivalent to C<d-imsx>. Any positive
951flags (except C<"d">) may follow the caret, so
fb85c044
KW
952
953 (?^x:foo)
954
955is equivalent to
956
957 (?x-ims:foo)
958
959The caret tells Perl that this cluster doesn't inherit the flags of any
0b928c2f 960surrounding pattern, but uses the system defaults (C<d-imsx>),
fb85c044
KW
961modified by any flags specified.
962
963The caret allows for simpler stringification of compiled regular
964expressions. These look like
965
966 (?^:pattern)
967
968with any non-default flags appearing between the caret and the colon.
969A test that looks at such stringification thus doesn't need to have the
970system default flags hard-coded in it, just the caret. If new flags are
971added to Perl, the meaning of the caret's expansion will change to include
972the default for those flags, so the test will still work, unchanged.
973
974Specifying a negative flag after the caret is an error, as the flag is
975redundant.
976
977Mnemonic for C<(?^...)>: A fresh beginning since the usual use of a caret is
978to match at the beginning.
979
594d7033
YO
980=item C<(?|pattern)>
981X<(?|)> X<Branch reset>
982
983This is the "branch reset" pattern, which has the special property
c27a5cfe 984that the capture groups are numbered from the same starting point
99d59c4d 985in each alternation branch. It is available starting from perl 5.10.0.
4deaaa80 986
c27a5cfe 987Capture groups are numbered from left to right, but inside this
693596a8 988construct the numbering is restarted for each branch.
4deaaa80 989
c27a5cfe 990The numbering within each branch will be as normal, and any groups
4deaaa80
PJ
991following this construct will be numbered as though the construct
992contained only one branch, that being the one with the most capture
c27a5cfe 993groups in it.
4deaaa80 994
0b928c2f 995This construct is useful when you want to capture one of a
4deaaa80
PJ
996number of alternative matches.
997
998Consider the following pattern. The numbers underneath show in
c27a5cfe 999which group the captured content will be stored.
594d7033
YO
1000
1001
1002 # before ---------------branch-reset----------- after
1003 / ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x
1004 # 1 2 2 3 2 3 4
1005
ab106183
A
1006Be careful when using the branch reset pattern in combination with
1007named captures. Named captures are implemented as being aliases to
c27a5cfe 1008numbered groups holding the captures, and that interferes with the
ab106183
A
1009implementation of the branch reset pattern. If you are using named
1010captures in a branch reset pattern, it's best to use the same names,
1011in the same order, in each of the alternations:
1012
1013 /(?| (?<a> x ) (?<b> y )
1014 | (?<a> z ) (?<b> w )) /x
1015
1016Not doing so may lead to surprises:
1017
1018 "12" =~ /(?| (?<a> \d+ ) | (?<b> \D+))/x;
1019 say $+ {a}; # Prints '12'
1020 say $+ {b}; # *Also* prints '12'.
1021
c27a5cfe
KW
1022The problem here is that both the group named C<< a >> and the group
1023named C<< b >> are aliases for the group belonging to C<< $1 >>.
90a18110 1024
ee9b8eae
YO
1025=item Look-Around Assertions
1026X<look-around assertion> X<lookaround assertion> X<look-around> X<lookaround>
1027
0b928c2f 1028Look-around assertions are zero-width patterns which match a specific
ee9b8eae
YO
1029pattern without including it in C<$&>. Positive assertions match when
1030their subpattern matches, negative assertions match when their subpattern
1031fails. Look-behind matches text up to the current match position,
1032look-ahead matches text following the current match position.
1033
1034=over 4
1035
5a964f20 1036=item C<(?=pattern)>
d74e8afc 1037X<(?=)> X<look-ahead, positive> X<lookahead, positive>
a0d0e21e 1038
19799a22 1039A zero-width positive look-ahead assertion. For example, C</\w+(?=\t)/>
a0d0e21e
LW
1040matches a word followed by a tab, without including the tab in C<$&>.
1041
5a964f20 1042=item C<(?!pattern)>
d74e8afc 1043X<(?!)> X<look-ahead, negative> X<lookahead, negative>
a0d0e21e 1044
19799a22 1045A zero-width negative look-ahead assertion. For example C</foo(?!bar)/>
a0d0e21e 1046matches any occurrence of "foo" that isn't followed by "bar". Note
19799a22
GS
1047however that look-ahead and look-behind are NOT the same thing. You cannot
1048use this for look-behind.
7b8d334a 1049
5a964f20 1050If you are looking for a "bar" that isn't preceded by a "foo", C</(?!foo)bar/>
7b8d334a
GS
1051will not do what you want. That's because the C<(?!foo)> is just saying that
1052the next thing cannot be "foo"--and it's not, it's a "bar", so "foobar" will
0b928c2f 1053match. Use look-behind instead (see below).
c277df42 1054
ee9b8eae
YO
1055=item C<(?<=pattern)> C<\K>
1056X<(?<=)> X<look-behind, positive> X<lookbehind, positive> X<\K>
c277df42 1057
c47ff5f1 1058A zero-width positive look-behind assertion. For example, C</(?<=\t)\w+/>
19799a22
GS
1059matches a word that follows a tab, without including the tab in C<$&>.
1060Works only for fixed-width look-behind.
c277df42 1061
ee9b8eae
YO
1062There is a special form of this construct, called C<\K>, which causes the
1063regex engine to "keep" everything it had matched prior to the C<\K> and
0b928c2f 1064not include it in C<$&>. This effectively provides variable-length
ee9b8eae
YO
1065look-behind. The use of C<\K> inside of another look-around assertion
1066is allowed, but the behaviour is currently not well defined.
1067
c62285ac 1068For various reasons C<\K> may be significantly more efficient than the
ee9b8eae
YO
1069equivalent C<< (?<=...) >> construct, and it is especially useful in
1070situations where you want to efficiently remove something following
1071something else in a string. For instance
1072
1073 s/(foo)bar/$1/g;
1074
1075can be rewritten as the much more efficient
1076
1077 s/foo\Kbar//g;
1078
5a964f20 1079=item C<(?<!pattern)>
d74e8afc 1080X<(?<!)> X<look-behind, negative> X<lookbehind, negative>
c277df42 1081
19799a22
GS
1082A zero-width negative look-behind assertion. For example C</(?<!bar)foo/>
1083matches any occurrence of "foo" that does not follow "bar". Works
1084only for fixed-width look-behind.
c277df42 1085
ee9b8eae
YO
1086=back
1087
81714fb9
YO
1088=item C<(?'NAME'pattern)>
1089
1090=item C<< (?<NAME>pattern) >>
1091X<< (?<NAME>) >> X<(?'NAME')> X<named capture> X<capture>
1092
c27a5cfe 1093A named capture group. Identical in every respect to normal capturing
0b928c2f
FC
1094parentheses C<()> but for the additional fact that the group
1095can be referred to by name in various regular expression
1096constructs (like C<\g{NAME}>) and can be accessed by name
1097after a successful match via C<%+> or C<%->. See L<perlvar>
90a18110 1098for more details on the C<%+> and C<%-> hashes.
81714fb9 1099
c27a5cfe
KW
1100If multiple distinct capture groups have the same name then the
1101$+{NAME} will refer to the leftmost defined group in the match.
81714fb9 1102
0d017f4d 1103The forms C<(?'NAME'pattern)> and C<< (?<NAME>pattern) >> are equivalent.
81714fb9
YO
1104
1105B<NOTE:> While the notation of this construct is the same as the similar
c27a5cfe 1106function in .NET regexes, the behavior is not. In Perl the groups are
81714fb9
YO
1107numbered sequentially regardless of being named or not. Thus in the
1108pattern
1109
1110 /(x)(?<foo>y)(z)/
1111
1112$+{foo} will be the same as $2, and $3 will contain 'z' instead of
1113the opposite which is what a .NET regex hacker might expect.
1114
1f1031fe
YO
1115Currently NAME is restricted to simple identifiers only.
1116In other words, it must match C</^[_A-Za-z][_A-Za-z0-9]*\z/> or
1117its Unicode extension (see L<utf8>),
1118though it isn't extended by the locale (see L<perllocale>).
81714fb9 1119
1f1031fe 1120B<NOTE:> In order to make things easier for programmers with experience
ae5648b3 1121with the Python or PCRE regex engines, the pattern C<< (?PE<lt>NAMEE<gt>pattern) >>
0d017f4d 1122may be used instead of C<< (?<NAME>pattern) >>; however this form does not
64c5a566 1123support the use of single quotes as a delimiter for the name.
81714fb9 1124
1f1031fe
YO
1125=item C<< \k<NAME> >>
1126
1127=item C<< \k'NAME' >>
81714fb9
YO
1128
1129Named backreference. Similar to numeric backreferences, except that
1130the group is designated by name and not number. If multiple groups
1131have the same name then it refers to the leftmost defined group in
1132the current match.
1133
0d017f4d 1134It is an error to refer to a name not defined by a C<< (?<NAME>) >>
81714fb9
YO
1135earlier in the pattern.
1136
1137Both forms are equivalent.
1138
1f1031fe 1139B<NOTE:> In order to make things easier for programmers with experience
0d017f4d 1140with the Python or PCRE regex engines, the pattern C<< (?P=NAME) >>
64c5a566 1141may be used instead of C<< \k<NAME> >>.
1f1031fe 1142
cc6b7395 1143=item C<(?{ code })>
d74e8afc 1144X<(?{})> X<regex, code in> X<regexp, code in> X<regular expression, code in>
c277df42 1145
19799a22 1146B<WARNING>: This extended regular expression feature is considered
b9b4dddf
YO
1147experimental, and may be changed without notice. Code executed that
1148has side effects may not perform identically from version to version
1149due to the effect of future optimisations in the regex engine.
c277df42 1150
cc46d5f2 1151This zero-width assertion evaluates any embedded Perl code. It
19799a22
GS
1152always succeeds, and its C<code> is not interpolated. Currently,
1153the rules to determine where the C<code> ends are somewhat convoluted.
1154
77ea4f6d
JV
1155This feature can be used together with the special variable C<$^N> to
1156capture the results of submatches in variables without having to keep
1157track of the number of nested parentheses. For example:
1158
1159 $_ = "The brown fox jumps over the lazy dog";
1160 /the (\S+)(?{ $color = $^N }) (\S+)(?{ $animal = $^N })/i;
1161 print "color = $color, animal = $animal\n";
1162
754091cb
RGS
1163Inside the C<(?{...})> block, C<$_> refers to the string the regular
1164expression is matching against. You can also use C<pos()> to know what is
fa11829f 1165the current position of matching within this string.
754091cb 1166
19799a22
GS
1167The C<code> is properly scoped in the following sense: If the assertion
1168is backtracked (compare L<"Backtracking">), all changes introduced after
1169C<local>ization are undone, so that
b9ac3b5b
GS
1170
1171 $_ = 'a' x 8;
5d458dd8 1172 m<
d1fbf752 1173 (?{ $cnt = 0 }) # Initialize $cnt.
b9ac3b5b 1174 (
5d458dd8 1175 a
b9ac3b5b 1176 (?{
d1fbf752
KW
1177 local $cnt = $cnt + 1; # Update $cnt,
1178 # backtracking-safe.
b9ac3b5b 1179 })
5d458dd8 1180 )*
b9ac3b5b 1181 aaaa
d1fbf752
KW
1182 (?{ $res = $cnt }) # On success copy to
1183 # non-localized location.
b9ac3b5b
GS
1184 >x;
1185
0d017f4d 1186will set C<$res = 4>. Note that after the match, C<$cnt> returns to the globally
14218588 1187introduced value, because the scopes that restrict C<local> operators
b9ac3b5b
GS
1188are unwound.
1189
19799a22
GS
1190This assertion may be used as a C<(?(condition)yes-pattern|no-pattern)>
1191switch. If I<not> used in this way, the result of evaluation of
1192C<code> is put into the special variable C<$^R>. This happens
1193immediately, so C<$^R> can be used from other C<(?{ code })> assertions
1194inside the same regular expression.
b9ac3b5b 1195
19799a22
GS
1196The assignment to C<$^R> above is properly localized, so the old
1197value of C<$^R> is restored if the assertion is backtracked; compare
1198L<"Backtracking">.
b9ac3b5b 1199
19799a22
GS
1200For reasons of security, this construct is forbidden if the regular
1201expression involves run-time interpolation of variables, unless the
1202perilous C<use re 'eval'> pragma has been used (see L<re>), or the
0b928c2f 1203variables contain results of the C<qr//> operator (see
b6fa137b 1204L<perlop/"qr/STRINGE<sol>msixpodual">).
871b0233 1205
0d017f4d 1206This restriction is due to the wide-spread and remarkably convenient
19799a22 1207custom of using run-time determined strings as patterns. For example:
871b0233
IZ
1208
1209 $re = <>;
1210 chomp $re;
1211 $string =~ /$re/;
1212
14218588
GS
1213Before Perl knew how to execute interpolated code within a pattern,
1214this operation was completely safe from a security point of view,
1215although it could raise an exception from an illegal pattern. If
1216you turn on the C<use re 'eval'>, though, it is no longer secure,
1217so you should only do so if you are also using taint checking.
1218Better yet, use the carefully constrained evaluation within a Safe
cc46d5f2 1219compartment. See L<perlsec> for details about both these mechanisms.
871b0233 1220
e95d7314
GG
1221B<WARNING>: Use of lexical (C<my>) variables in these blocks is
1222broken. The result is unpredictable and will make perl unstable. The
1223workaround is to use global (C<our>) variables.
1224
8525cfae
FC
1225B<WARNING>: In perl 5.12.x and earlier, the regex engine
1226was not re-entrant, so interpolated code could not
1227safely invoke the regex engine either directly with
e95d7314 1228C<m//> or C<s///>), or indirectly with functions such as
8525cfae 1229C<split>. Invoking the regex engine in these blocks would make perl
e95d7314 1230unstable.
8988a1bb 1231
14455d6c 1232=item C<(??{ code })>
d74e8afc
ITB
1233X<(??{})>
1234X<regex, postponed> X<regexp, postponed> X<regular expression, postponed>
0f5d15d6 1235
19799a22 1236B<WARNING>: This extended regular expression feature is considered
b9b4dddf
YO
1237experimental, and may be changed without notice. Code executed that
1238has side effects may not perform identically from version to version
1239due to the effect of future optimisations in the regex engine.
0f5d15d6 1240
19799a22
GS
1241This is a "postponed" regular subexpression. The C<code> is evaluated
1242at run time, at the moment this subexpression may match. The result
0b928c2f 1243of evaluation is considered a regular expression and matched as
61528107 1244if it were inserted instead of this construct. Note that this means
c27a5cfe 1245that the contents of capture groups defined inside an eval'ed pattern
6bda09f9 1246are not available outside of the pattern, and vice versa, there is no
c27a5cfe 1247way for the inner pattern to refer to a capture group defined outside.
6bda09f9
YO
1248Thus,
1249
1250 ('a' x 100)=~/(??{'(.)' x 100})/
1251
81714fb9 1252B<will> match, it will B<not> set $1.
0f5d15d6 1253
428594d9 1254The C<code> is not interpolated. As before, the rules to determine
19799a22
GS
1255where the C<code> ends are currently somewhat convoluted.
1256
1257The following pattern matches a parenthesized group:
0f5d15d6 1258
d1fbf752
KW
1259 $re = qr{
1260 \(
1261 (?:
1262 (?> [^()]+ ) # Non-parens without backtracking
1263 |
1264 (??{ $re }) # Group with matching parens
1265 )*
1266 \)
1267 }x;
0f5d15d6 1268
6bda09f9
YO
1269See also C<(?PARNO)> for a different, more efficient way to accomplish
1270the same task.
1271
0b370c0a
A
1272For reasons of security, this construct is forbidden if the regular
1273expression involves run-time interpolation of variables, unless the
1274perilous C<use re 'eval'> pragma has been used (see L<re>), or the
0b928c2f 1275variables contain results of the C<qr//> operator (see
b6fa137b 1276L<perlop/"qrE<sol>STRINGE<sol>msixpodual">).
0b370c0a 1277
8525cfae
FC
1278In perl 5.12.x and earlier, because the regex engine was not re-entrant,
1279delayed code could not safely invoke the regex engine either directly with
1280C<m//> or C<s///>), or indirectly with functions such as C<split>.
8988a1bb 1281
5d458dd8
YO
1282Recursing deeper than 50 times without consuming any input string will
1283result in a fatal error. The maximum depth is compiled into perl, so
6bda09f9
YO
1284changing it requires a custom build.
1285
542fa716
YO
1286=item C<(?PARNO)> C<(?-PARNO)> C<(?+PARNO)> C<(?R)> C<(?0)>
1287X<(?PARNO)> X<(?1)> X<(?R)> X<(?0)> X<(?-1)> X<(?+1)> X<(?-PARNO)> X<(?+PARNO)>
6bda09f9 1288X<regex, recursive> X<regexp, recursive> X<regular expression, recursive>
542fa716 1289X<regex, relative recursion>
6bda09f9 1290
81714fb9 1291Similar to C<(??{ code })> except it does not involve compiling any code,
c27a5cfe
KW
1292instead it treats the contents of a capture group as an independent
1293pattern that must match at the current position. Capture groups
81714fb9 1294contained by the pattern will have the value as determined by the
6bda09f9
YO
1295outermost recursion.
1296
894be9b7 1297PARNO is a sequence of digits (not starting with 0) whose value reflects
c27a5cfe 1298the paren-number of the capture group to recurse to. C<(?R)> recurses to
894be9b7 1299the beginning of the whole pattern. C<(?0)> is an alternate syntax for
542fa716 1300C<(?R)>. If PARNO is preceded by a plus or minus sign then it is assumed
c27a5cfe 1301to be relative, with negative numbers indicating preceding capture groups
542fa716 1302and positive ones following. Thus C<(?-1)> refers to the most recently
c27a5cfe 1303declared group, and C<(?+1)> indicates the next group to be declared.
c74340f9 1304Note that the counting for relative recursion differs from that of
c27a5cfe 1305relative backreferences, in that with recursion unclosed groups B<are>
c74340f9 1306included.
6bda09f9 1307
81714fb9 1308The following pattern matches a function foo() which may contain
f145b7e9 1309balanced parentheses as the argument.
6bda09f9 1310
d1fbf752 1311 $re = qr{ ( # paren group 1 (full function)
81714fb9 1312 foo
d1fbf752 1313 ( # paren group 2 (parens)
6bda09f9 1314 \(
d1fbf752 1315 ( # paren group 3 (contents of parens)
6bda09f9 1316 (?:
d1fbf752 1317 (?> [^()]+ ) # Non-parens without backtracking
6bda09f9 1318 |
d1fbf752 1319 (?2) # Recurse to start of paren group 2
6bda09f9
YO
1320 )*
1321 )
1322 \)
1323 )
1324 )
1325 }x;
1326
1327If the pattern was used as follows
1328
1329 'foo(bar(baz)+baz(bop))'=~/$re/
1330 and print "\$1 = $1\n",
1331 "\$2 = $2\n",
1332 "\$3 = $3\n";
1333
1334the output produced should be the following:
1335
1336 $1 = foo(bar(baz)+baz(bop))
1337 $2 = (bar(baz)+baz(bop))
81714fb9 1338 $3 = bar(baz)+baz(bop)
6bda09f9 1339
c27a5cfe 1340If there is no corresponding capture group defined, then it is a
61528107 1341fatal error. Recursing deeper than 50 times without consuming any input
81714fb9 1342string will also result in a fatal error. The maximum depth is compiled
6bda09f9
YO
1343into perl, so changing it requires a custom build.
1344
542fa716
YO
1345The following shows how using negative indexing can make it
1346easier to embed recursive patterns inside of a C<qr//> construct
1347for later use:
1348
1349 my $parens = qr/(\((?:[^()]++|(?-1))*+\))/;
1350 if (/foo $parens \s+ + \s+ bar $parens/x) {
1351 # do something here...
1352 }
1353
81714fb9 1354B<Note> that this pattern does not behave the same way as the equivalent
0d017f4d 1355PCRE or Python construct of the same form. In Perl you can backtrack into
6bda09f9 1356a recursed group, in PCRE and Python the recursed into group is treated
542fa716
YO
1357as atomic. Also, modifiers are resolved at compile time, so constructs
1358like (?i:(?1)) or (?:(?i)(?1)) do not affect how the sub-pattern will
1359be processed.
6bda09f9 1360
894be9b7
YO
1361=item C<(?&NAME)>
1362X<(?&NAME)>
1363
0d017f4d
WL
1364Recurse to a named subpattern. Identical to C<(?PARNO)> except that the
1365parenthesis to recurse to is determined by name. If multiple parentheses have
894be9b7
YO
1366the same name, then it recurses to the leftmost.
1367
1368It is an error to refer to a name that is not declared somewhere in the
1369pattern.
1370
1f1031fe
YO
1371B<NOTE:> In order to make things easier for programmers with experience
1372with the Python or PCRE regex engines the pattern C<< (?P>NAME) >>
64c5a566 1373may be used instead of C<< (?&NAME) >>.
1f1031fe 1374
e2e6a0f1
YO
1375=item C<(?(condition)yes-pattern|no-pattern)>
1376X<(?()>
286f584a 1377
e2e6a0f1 1378=item C<(?(condition)yes-pattern)>
286f584a 1379
41ef34de
ML
1380Conditional expression. Matches C<yes-pattern> if C<condition> yields
1381a true value, matches C<no-pattern> otherwise. A missing pattern always
1382matches.
1383
1384C<(condition)> should be either an integer in
e2e6a0f1
YO
1385parentheses (which is valid if the corresponding pair of parentheses
1386matched), a look-ahead/look-behind/evaluate zero-width assertion, a
c27a5cfe 1387name in angle brackets or single quotes (which is valid if a group
e2e6a0f1
YO
1388with the given name matched), or the special symbol (R) (true when
1389evaluated inside of recursion or eval). Additionally the R may be
1390followed by a number, (which will be true when evaluated when recursing
1391inside of the appropriate group), or by C<&NAME>, in which case it will
1392be true only when evaluated during recursion in the named group.
1393
1394Here's a summary of the possible predicates:
1395
1396=over 4
1397
1398=item (1) (2) ...
1399
c27a5cfe 1400Checks if the numbered capturing group has matched something.
e2e6a0f1
YO
1401
1402=item (<NAME>) ('NAME')
1403
c27a5cfe 1404Checks if a group with the given name has matched something.
e2e6a0f1 1405
f01cd190
FC
1406=item (?=...) (?!...) (?<=...) (?<!...)
1407
1408Checks whether the pattern matches (or does not match, for the '!'
1409variants).
1410
e2e6a0f1
YO
1411=item (?{ CODE })
1412
f01cd190 1413Treats the return value of the code block as the condition.
e2e6a0f1
YO
1414
1415=item (R)
1416
1417Checks if the expression has been evaluated inside of recursion.
1418
1419=item (R1) (R2) ...
1420
1421Checks if the expression has been evaluated while executing directly
1422inside of the n-th capture group. This check is the regex equivalent of
1423
1424 if ((caller(0))[3] eq 'subname') { ... }
1425
1426In other words, it does not check the full recursion stack.
1427
1428=item (R&NAME)
1429
1430Similar to C<(R1)>, this predicate checks to see if we're executing
1431directly inside of the leftmost group with a given name (this is the same
1432logic used by C<(?&NAME)> to disambiguate). It does not check the full
1433stack, but only the name of the innermost active recursion.
1434
1435=item (DEFINE)
1436
1437In this case, the yes-pattern is never directly executed, and no
1438no-pattern is allowed. Similar in spirit to C<(?{0})> but more efficient.
1439See below for details.
1440
1441=back
1442
1443For example:
1444
1445 m{ ( \( )?
1446 [^()]+
1447 (?(1) \) )
1448 }x
1449
1450matches a chunk of non-parentheses, possibly included in parentheses
1451themselves.
1452
0b928c2f
FC
1453A special form is the C<(DEFINE)> predicate, which never executes its
1454yes-pattern directly, and does not allow a no-pattern. This allows one to
1455define subpatterns which will be executed only by the recursion mechanism.
e2e6a0f1
YO
1456This way, you can define a set of regular expression rules that can be
1457bundled into any pattern you choose.
1458
1459It is recommended that for this usage you put the DEFINE block at the
1460end of the pattern, and that you name any subpatterns defined within it.
1461
1462Also, it's worth noting that patterns defined this way probably will
1463not be as efficient, as the optimiser is not very clever about
1464handling them.
1465
1466An example of how this might be used is as follows:
1467
2bf803e2 1468 /(?<NAME>(?&NAME_PAT))(?<ADDR>(?&ADDRESS_PAT))
e2e6a0f1 1469 (?(DEFINE)
2bf803e2
YO
1470 (?<NAME_PAT>....)
1471 (?<ADRESS_PAT>....)
e2e6a0f1
YO
1472 )/x
1473
c27a5cfe
KW
1474Note that capture groups matched inside of recursion are not accessible
1475after the recursion returns, so the extra layer of capturing groups is
e2e6a0f1
YO
1476necessary. Thus C<$+{NAME_PAT}> would not be defined even though
1477C<$+{NAME}> would be.
286f584a 1478
c47ff5f1 1479=item C<< (?>pattern) >>
6bda09f9 1480X<backtrack> X<backtracking> X<atomic> X<possessive>
5a964f20 1481
19799a22
GS
1482An "independent" subexpression, one which matches the substring
1483that a I<standalone> C<pattern> would match if anchored at the given
9da458fc 1484position, and it matches I<nothing other than this substring>. This
19799a22
GS
1485construct is useful for optimizations of what would otherwise be
1486"eternal" matches, because it will not backtrack (see L<"Backtracking">).
9da458fc
IZ
1487It may also be useful in places where the "grab all you can, and do not
1488give anything back" semantic is desirable.
19799a22 1489
c47ff5f1 1490For example: C<< ^(?>a*)ab >> will never match, since C<< (?>a*) >>
19799a22
GS
1491(anchored at the beginning of string, as above) will match I<all>
1492characters C<a> at the beginning of string, leaving no C<a> for
1493C<ab> to match. In contrast, C<a*ab> will match the same as C<a+b>,
1494since the match of the subgroup C<a*> is influenced by the following
1495group C<ab> (see L<"Backtracking">). In particular, C<a*> inside
1496C<a*ab> will match fewer characters than a standalone C<a*>, since
1497this makes the tail match.
1498
0b928c2f
FC
1499C<< (?>pattern) >> does not disable backtracking altogether once it has
1500matched. It is still possible to backtrack past the construct, but not
1501into it. So C<< ((?>a*)|(?>b*))ar >> will still match "bar".
1502
c47ff5f1 1503An effect similar to C<< (?>pattern) >> may be achieved by writing
0b928c2f
FC
1504C<(?=(pattern))\g{-1}>. This matches the same substring as a standalone
1505C<a+>, and the following C<\g{-1}> eats the matched string; it therefore
c47ff5f1 1506makes a zero-length assertion into an analogue of C<< (?>...) >>.
19799a22
GS
1507(The difference between these two constructs is that the second one
1508uses a capturing group, thus shifting ordinals of backreferences
1509in the rest of a regular expression.)
1510
1511Consider this pattern:
c277df42 1512
871b0233 1513 m{ \(
e2e6a0f1 1514 (
f793d64a 1515 [^()]+ # x+
e2e6a0f1 1516 |
871b0233
IZ
1517 \( [^()]* \)
1518 )+
e2e6a0f1 1519 \)
871b0233 1520 }x
5a964f20 1521
19799a22
GS
1522That will efficiently match a nonempty group with matching parentheses
1523two levels deep or less. However, if there is no such group, it
1524will take virtually forever on a long string. That's because there
1525are so many different ways to split a long string into several
1526substrings. This is what C<(.+)+> is doing, and C<(.+)+> is similar
1527to a subpattern of the above pattern. Consider how the pattern
1528above detects no-match on C<((()aaaaaaaaaaaaaaaaaa> in several
1529seconds, but that each extra letter doubles this time. This
1530exponential performance will make it appear that your program has
14218588 1531hung. However, a tiny change to this pattern
5a964f20 1532
e2e6a0f1
YO
1533 m{ \(
1534 (
f793d64a 1535 (?> [^()]+ ) # change x+ above to (?> x+ )
e2e6a0f1 1536 |
871b0233
IZ
1537 \( [^()]* \)
1538 )+
e2e6a0f1 1539 \)
871b0233 1540 }x
c277df42 1541
c47ff5f1 1542which uses C<< (?>...) >> matches exactly when the one above does (verifying
5a964f20
TC
1543this yourself would be a productive exercise), but finishes in a fourth
1544the time when used on a similar string with 1000000 C<a>s. Be aware,
0b928c2f
FC
1545however, that, when this construct is followed by a
1546quantifier, it currently triggers a warning message under
9f1b1f2d 1547the C<use warnings> pragma or B<-w> switch saying it
6bab786b 1548C<"matches null string many times in regex">.
c277df42 1549
c47ff5f1 1550On simple groups, such as the pattern C<< (?> [^()]+ ) >>, a comparable
19799a22 1551effect may be achieved by negative look-ahead, as in C<[^()]+ (?! [^()] )>.
c277df42
IZ
1552This was only 4 times slower on a string with 1000000 C<a>s.
1553
9da458fc
IZ
1554The "grab all you can, and do not give anything back" semantic is desirable
1555in many situations where on the first sight a simple C<()*> looks like
1556the correct solution. Suppose we parse text with comments being delimited
1557by C<#> followed by some optional (horizontal) whitespace. Contrary to
4375e838 1558its appearance, C<#[ \t]*> I<is not> the correct subexpression to match
9da458fc
IZ
1559the comment delimiter, because it may "give up" some whitespace if
1560the remainder of the pattern can be made to match that way. The correct
1561answer is either one of these:
1562
1563 (?>#[ \t]*)
1564 #[ \t]*(?![ \t])
1565
1566For example, to grab non-empty comments into $1, one should use either
1567one of these:
1568
1569 / (?> \# [ \t]* ) ( .+ ) /x;
1570 / \# [ \t]* ( [^ \t] .* ) /x;
1571
1572Which one you pick depends on which of these expressions better reflects
1573the above specification of comments.
1574
6bda09f9
YO
1575In some literature this construct is called "atomic matching" or
1576"possessive matching".
1577
b9b4dddf
YO
1578Possessive quantifiers are equivalent to putting the item they are applied
1579to inside of one of these constructs. The following equivalences apply:
1580
1581 Quantifier Form Bracketing Form
1582 --------------- ---------------
1583 PAT*+ (?>PAT*)
1584 PAT++ (?>PAT+)
1585 PAT?+ (?>PAT?)
1586 PAT{min,max}+ (?>PAT{min,max})
1587
e2e6a0f1
YO
1588=back
1589
1590=head2 Special Backtracking Control Verbs
1591
1592B<WARNING:> These patterns are experimental and subject to change or
0d017f4d 1593removal in a future version of Perl. Their usage in production code should
e2e6a0f1
YO
1594be noted to avoid problems during upgrades.
1595
1596These special patterns are generally of the form C<(*VERB:ARG)>. Unless
1597otherwise stated the ARG argument is optional; in some cases, it is
1598forbidden.
1599
1600Any pattern containing a special backtracking verb that allows an argument
e1020413 1601has the special behaviour that when executed it sets the current package's
5d458dd8
YO
1602C<$REGERROR> and C<$REGMARK> variables. When doing so the following
1603rules apply:
e2e6a0f1 1604
5d458dd8
YO
1605On failure, the C<$REGERROR> variable will be set to the ARG value of the
1606verb pattern, if the verb was involved in the failure of the match. If the
1607ARG part of the pattern was omitted, then C<$REGERROR> will be set to the
1608name of the last C<(*MARK:NAME)> pattern executed, or to TRUE if there was
1609none. Also, the C<$REGMARK> variable will be set to FALSE.
e2e6a0f1 1610
5d458dd8
YO
1611On a successful match, the C<$REGERROR> variable will be set to FALSE, and
1612the C<$REGMARK> variable will be set to the name of the last
1613C<(*MARK:NAME)> pattern executed. See the explanation for the
1614C<(*MARK:NAME)> verb below for more details.
e2e6a0f1 1615
5d458dd8 1616B<NOTE:> C<$REGERROR> and C<$REGMARK> are not magic variables like C<$1>
0b928c2f 1617and most other regex-related variables. They are not local to a scope, nor
5d458dd8
YO
1618readonly, but instead are volatile package variables similar to C<$AUTOLOAD>.
1619Use C<local> to localize changes to them to a specific scope if necessary.
e2e6a0f1
YO
1620
1621If a pattern does not contain a special backtracking verb that allows an
5d458dd8 1622argument, then C<$REGERROR> and C<$REGMARK> are not touched at all.
e2e6a0f1 1623
70ca8714 1624=over 3
e2e6a0f1
YO
1625
1626=item Verbs that take an argument
1627
1628=over 4
1629
5d458dd8 1630=item C<(*PRUNE)> C<(*PRUNE:NAME)>
f7819f85 1631X<(*PRUNE)> X<(*PRUNE:NAME)>
54612592 1632
5d458dd8
YO
1633This zero-width pattern prunes the backtracking tree at the current point
1634when backtracked into on failure. Consider the pattern C<A (*PRUNE) B>,
1635where A and B are complex patterns. Until the C<(*PRUNE)> verb is reached,
1636A may backtrack as necessary to match. Once it is reached, matching
1637continues in B, which may also backtrack as necessary; however, should B
1638not match, then no further backtracking will take place, and the pattern
1639will fail outright at the current starting position.
54612592
YO
1640
1641The following example counts all the possible matching strings in a
1642pattern (without actually matching any of them).
1643
e2e6a0f1 1644 'aaab' =~ /a+b?(?{print "$&\n"; $count++})(*FAIL)/;
54612592
YO
1645 print "Count=$count\n";
1646
1647which produces:
1648
1649 aaab
1650 aaa
1651 aa
1652 a
1653 aab
1654 aa
1655 a
1656 ab
1657 a
1658 Count=9
1659
5d458dd8 1660If we add a C<(*PRUNE)> before the count like the following
54612592 1661
5d458dd8 1662 'aaab' =~ /a+b?(*PRUNE)(?{print "$&\n"; $count++})(*FAIL)/;
54612592
YO
1663 print "Count=$count\n";
1664
0b928c2f 1665we prevent backtracking and find the count of the longest matching string
353c6505 1666at each matching starting point like so:
54612592
YO
1667
1668 aaab
1669 aab
1670 ab
1671 Count=3
1672
5d458dd8 1673Any number of C<(*PRUNE)> assertions may be used in a pattern.
54612592 1674
5d458dd8
YO
1675See also C<< (?>pattern) >> and possessive quantifiers for other ways to
1676control backtracking. In some cases, the use of C<(*PRUNE)> can be
1677replaced with a C<< (?>pattern) >> with no functional difference; however,
1678C<(*PRUNE)> can be used to handle cases that cannot be expressed using a
1679C<< (?>pattern) >> alone.
54612592 1680
5d458dd8
YO
1681=item C<(*SKIP)> C<(*SKIP:NAME)>
1682X<(*SKIP)>
e2e6a0f1 1683
5d458dd8 1684This zero-width pattern is similar to C<(*PRUNE)>, except that on
e2e6a0f1 1685failure it also signifies that whatever text that was matched leading up
5d458dd8
YO
1686to the C<(*SKIP)> pattern being executed cannot be part of I<any> match
1687of this pattern. This effectively means that the regex engine "skips" forward
1688to this position on failure and tries to match again, (assuming that
1689there is sufficient room to match).
1690
1691The name of the C<(*SKIP:NAME)> pattern has special significance. If a
1692C<(*MARK:NAME)> was encountered while matching, then it is that position
1693which is used as the "skip point". If no C<(*MARK)> of that name was
1694encountered, then the C<(*SKIP)> operator has no effect. When used
1695without a name the "skip point" is where the match point was when
1696executing the (*SKIP) pattern.
1697
0b928c2f 1698Compare the following to the examples in C<(*PRUNE)>; note the string
24b23f37
YO
1699is twice as long:
1700
d1fbf752
KW
1701 'aaabaaab' =~ /a+b?(*SKIP)(?{print "$&\n"; $count++})(*FAIL)/;
1702 print "Count=$count\n";
24b23f37
YO
1703
1704outputs
1705
1706 aaab
1707 aaab
1708 Count=2
1709
5d458dd8 1710Once the 'aaab' at the start of the string has matched, and the C<(*SKIP)>
353c6505 1711executed, the next starting point will be where the cursor was when the
5d458dd8
YO
1712C<(*SKIP)> was executed.
1713
5d458dd8 1714=item C<(*MARK:NAME)> C<(*:NAME)>
b16db30f 1715X<(*MARK)> X<(*MARK:NAME)> X<(*:NAME)>
5d458dd8
YO
1716
1717This zero-width pattern can be used to mark the point reached in a string
1718when a certain part of the pattern has been successfully matched. This
1719mark may be given a name. A later C<(*SKIP)> pattern will then skip
1720forward to that point if backtracked into on failure. Any number of
b4222fa9 1721C<(*MARK)> patterns are allowed, and the NAME portion may be duplicated.
5d458dd8
YO
1722
1723In addition to interacting with the C<(*SKIP)> pattern, C<(*MARK:NAME)>
1724can be used to "label" a pattern branch, so that after matching, the
1725program can determine which branches of the pattern were involved in the
1726match.
1727
1728When a match is successful, the C<$REGMARK> variable will be set to the
1729name of the most recently executed C<(*MARK:NAME)> that was involved
1730in the match.
1731
1732This can be used to determine which branch of a pattern was matched
c27a5cfe 1733without using a separate capture group for each branch, which in turn
5d458dd8
YO
1734can result in a performance improvement, as perl cannot optimize
1735C</(?:(x)|(y)|(z))/> as efficiently as something like
1736C</(?:x(*MARK:x)|y(*MARK:y)|z(*MARK:z))/>.
1737
1738When a match has failed, and unless another verb has been involved in
1739failing the match and has provided its own name to use, the C<$REGERROR>
1740variable will be set to the name of the most recently executed
1741C<(*MARK:NAME)>.
1742
42ac7c82 1743See L</(*SKIP)> for more details.
5d458dd8 1744
b62d2d15
YO
1745As a shortcut C<(*MARK:NAME)> can be written C<(*:NAME)>.
1746
5d458dd8
YO
1747=item C<(*THEN)> C<(*THEN:NAME)>
1748
241e7389 1749This is similar to the "cut group" operator C<::> from Perl 6. Like
5d458dd8
YO
1750C<(*PRUNE)>, this verb always matches, and when backtracked into on
1751failure, it causes the regex engine to try the next alternation in the
1752innermost enclosing group (capturing or otherwise).
1753
1754Its name comes from the observation that this operation combined with the
1755alternation operator (C<|>) can be used to create what is essentially a
1756pattern-based if/then/else block:
1757
1758 ( COND (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ )
1759
1760Note that if this operator is used and NOT inside of an alternation then
1761it acts exactly like the C<(*PRUNE)> operator.
1762
1763 / A (*PRUNE) B /
1764
1765is the same as
1766
1767 / A (*THEN) B /
1768
1769but
1770
1771 / ( A (*THEN) B | C (*THEN) D ) /
1772
1773is not the same as
1774
1775 / ( A (*PRUNE) B | C (*PRUNE) D ) /
1776
1777as after matching the A but failing on the B the C<(*THEN)> verb will
1778backtrack and try C; but the C<(*PRUNE)> verb will simply fail.
24b23f37 1779
cbeadc21
JV
1780=back
1781
1782=item Verbs without an argument
1783
1784=over 4
1785
e2e6a0f1
YO
1786=item C<(*COMMIT)>
1787X<(*COMMIT)>
24b23f37 1788
241e7389 1789This is the Perl 6 "commit pattern" C<< <commit> >> or C<:::>. It's a
5d458dd8
YO
1790zero-width pattern similar to C<(*SKIP)>, except that when backtracked
1791into on failure it causes the match to fail outright. No further attempts
1792to find a valid match by advancing the start pointer will occur again.
1793For example,
24b23f37 1794
d1fbf752
KW
1795 'aaabaaab' =~ /a+b?(*COMMIT)(?{print "$&\n"; $count++})(*FAIL)/;
1796 print "Count=$count\n";
24b23f37
YO
1797
1798outputs
1799
1800 aaab
1801 Count=1
1802
e2e6a0f1
YO
1803In other words, once the C<(*COMMIT)> has been entered, and if the pattern
1804does not match, the regex engine will not try any further matching on the
1805rest of the string.
c277df42 1806
e2e6a0f1
YO
1807=item C<(*FAIL)> C<(*F)>
1808X<(*FAIL)> X<(*F)>
9af228c6 1809
e2e6a0f1
YO
1810This pattern matches nothing and always fails. It can be used to force the
1811engine to backtrack. It is equivalent to C<(?!)>, but easier to read. In
1812fact, C<(?!)> gets optimised into C<(*FAIL)> internally.
9af228c6 1813
e2e6a0f1 1814It is probably useful only when combined with C<(?{})> or C<(??{})>.
9af228c6 1815
e2e6a0f1
YO
1816=item C<(*ACCEPT)>
1817X<(*ACCEPT)>
9af228c6 1818
e2e6a0f1
YO
1819B<WARNING:> This feature is highly experimental. It is not recommended
1820for production code.
9af228c6 1821
e2e6a0f1
YO
1822This pattern matches nothing and causes the end of successful matching at
1823the point at which the C<(*ACCEPT)> pattern was encountered, regardless of
1824whether there is actually more to match in the string. When inside of a
0d017f4d 1825nested pattern, such as recursion, or in a subpattern dynamically generated
e2e6a0f1 1826via C<(??{})>, only the innermost pattern is ended immediately.
9af228c6 1827
c27a5cfe 1828If the C<(*ACCEPT)> is inside of capturing groups then the groups are
e2e6a0f1
YO
1829marked as ended at the point at which the C<(*ACCEPT)> was encountered.
1830For instance:
9af228c6 1831
e2e6a0f1 1832 'AB' =~ /(A (A|B(*ACCEPT)|C) D)(E)/x;
9af228c6 1833
e2e6a0f1 1834will match, and C<$1> will be C<AB> and C<$2> will be C<B>, C<$3> will not
0b928c2f 1835be set. If another branch in the inner parentheses was matched, such as in the
e2e6a0f1 1836string 'ACDE', then the C<D> and C<E> would have to be matched as well.
9af228c6
YO
1837
1838=back
c277df42 1839
a0d0e21e
LW
1840=back
1841
c07a80fd 1842=head2 Backtracking
d74e8afc 1843X<backtrack> X<backtracking>
c07a80fd 1844
35a734be
IZ
1845NOTE: This section presents an abstract approximation of regular
1846expression behavior. For a more rigorous (and complicated) view of
1847the rules involved in selecting a match among possible alternatives,
0d017f4d 1848see L<Combining RE Pieces>.
35a734be 1849
c277df42 1850A fundamental feature of regular expression matching involves the
5a964f20 1851notion called I<backtracking>, which is currently used (when needed)
0d017f4d 1852by all regular non-possessive expression quantifiers, namely C<*>, C<*?>, C<+>,
9da458fc
IZ
1853C<+?>, C<{n,m}>, and C<{n,m}?>. Backtracking is often optimized
1854internally, but the general principle outlined here is valid.
c07a80fd
PP
1855
1856For a regular expression to match, the I<entire> regular expression must
1857match, not just part of it. So if the beginning of a pattern containing a
1858quantifier succeeds in a way that causes later parts in the pattern to
1859fail, the matching engine backs up and recalculates the beginning
1860part--that's why it's called backtracking.
1861
1862Here is an example of backtracking: Let's say you want to find the
1863word following "foo" in the string "Food is on the foo table.":
1864
1865 $_ = "Food is on the foo table.";
1866 if ( /\b(foo)\s+(\w+)/i ) {
f793d64a 1867 print "$2 follows $1.\n";
c07a80fd
PP
1868 }
1869
1870When the match runs, the first part of the regular expression (C<\b(foo)>)
1871finds a possible match right at the beginning of the string, and loads up
1872$1 with "Foo". However, as soon as the matching engine sees that there's
1873no whitespace following the "Foo" that it had saved in $1, it realizes its
68dc0745 1874mistake and starts over again one character after where it had the
c07a80fd
PP
1875tentative match. This time it goes all the way until the next occurrence
1876of "foo". The complete regular expression matches this time, and you get
1877the expected output of "table follows foo."
1878
1879Sometimes minimal matching can help a lot. Imagine you'd like to match
1880everything between "foo" and "bar". Initially, you write something
1881like this:
1882
1883 $_ = "The food is under the bar in the barn.";
1884 if ( /foo(.*)bar/ ) {
f793d64a 1885 print "got <$1>\n";
c07a80fd
PP
1886 }
1887
1888Which perhaps unexpectedly yields:
1889
1890 got <d is under the bar in the >
1891
1892That's because C<.*> was greedy, so you get everything between the
14218588 1893I<first> "foo" and the I<last> "bar". Here it's more effective
c07a80fd
PP
1894to use minimal matching to make sure you get the text between a "foo"
1895and the first "bar" thereafter.
1896
1897 if ( /foo(.*?)bar/ ) { print "got <$1>\n" }
1898 got <d is under the >
1899
0d017f4d 1900Here's another example. Let's say you'd like to match a number at the end
b6e13d97 1901of a string, and you also want to keep the preceding part of the match.
c07a80fd
PP
1902So you write this:
1903
1904 $_ = "I have 2 numbers: 53147";
f793d64a
KW
1905 if ( /(.*)(\d*)/ ) { # Wrong!
1906 print "Beginning is <$1>, number is <$2>.\n";
c07a80fd
PP
1907 }
1908
1909That won't work at all, because C<.*> was greedy and gobbled up the
1910whole string. As C<\d*> can match on an empty string the complete
1911regular expression matched successfully.
1912
8e1088bc 1913 Beginning is <I have 2 numbers: 53147>, number is <>.
c07a80fd
PP
1914
1915Here are some variants, most of which don't work:
1916
1917 $_ = "I have 2 numbers: 53147";
1918 @pats = qw{
f793d64a
KW
1919 (.*)(\d*)
1920 (.*)(\d+)
1921 (.*?)(\d*)
1922 (.*?)(\d+)
1923 (.*)(\d+)$
1924 (.*?)(\d+)$
1925 (.*)\b(\d+)$
1926 (.*\D)(\d+)$
c07a80fd
PP
1927 };
1928
1929 for $pat (@pats) {
f793d64a
KW
1930 printf "%-12s ", $pat;
1931 if ( /$pat/ ) {
1932 print "<$1> <$2>\n";
1933 } else {
1934 print "FAIL\n";
1935 }
c07a80fd
PP
1936 }
1937
1938That will print out:
1939
1940 (.*)(\d*) <I have 2 numbers: 53147> <>
1941 (.*)(\d+) <I have 2 numbers: 5314> <7>
1942 (.*?)(\d*) <> <>
1943 (.*?)(\d+) <I have > <2>
1944 (.*)(\d+)$ <I have 2 numbers: 5314> <7>
1945 (.*?)(\d+)$ <I have 2 numbers: > <53147>
1946 (.*)\b(\d+)$ <I have 2 numbers: > <53147>
1947 (.*\D)(\d+)$ <I have 2 numbers: > <53147>
1948
1949As you see, this can be a bit tricky. It's important to realize that a
1950regular expression is merely a set of assertions that gives a definition
1951of success. There may be 0, 1, or several different ways that the
1952definition might succeed against a particular string. And if there are
5a964f20
TC
1953multiple ways it might succeed, you need to understand backtracking to
1954know which variety of success you will achieve.
c07a80fd 1955
19799a22 1956When using look-ahead assertions and negations, this can all get even
8b19b778 1957trickier. Imagine you'd like to find a sequence of non-digits not
c07a80fd
PP
1958followed by "123". You might try to write that as
1959
871b0233 1960 $_ = "ABC123";
f793d64a
KW
1961 if ( /^\D*(?!123)/ ) { # Wrong!
1962 print "Yup, no 123 in $_\n";
871b0233 1963 }
c07a80fd
PP
1964
1965But that isn't going to match; at least, not the way you're hoping. It
1966claims that there is no 123 in the string. Here's a clearer picture of
9b9391b2 1967why that pattern matches, contrary to popular expectations:
c07a80fd 1968
4358a253
SS
1969 $x = 'ABC123';
1970 $y = 'ABC445';
c07a80fd 1971
4358a253
SS
1972 print "1: got $1\n" if $x =~ /^(ABC)(?!123)/;
1973 print "2: got $1\n" if $y =~ /^(ABC)(?!123)/;
c07a80fd 1974
4358a253
SS
1975 print "3: got $1\n" if $x =~ /^(\D*)(?!123)/;
1976 print "4: got $1\n" if $y =~ /^(\D*)(?!123)/;
c07a80fd
PP
1977
1978This prints
1979
1980 2: got ABC
1981 3: got AB
1982 4: got ABC
1983
5f05dabc 1984You might have expected test 3 to fail because it seems to a more
c07a80fd
PP
1985general purpose version of test 1. The important difference between
1986them is that test 3 contains a quantifier (C<\D*>) and so can use
1987backtracking, whereas test 1 will not. What's happening is
1988that you've asked "Is it true that at the start of $x, following 0 or more
5f05dabc 1989non-digits, you have something that's not 123?" If the pattern matcher had
c07a80fd 1990let C<\D*> expand to "ABC", this would have caused the whole pattern to
54310121 1991fail.
14218588 1992
c07a80fd 1993The search engine will initially match C<\D*> with "ABC". Then it will
0b928c2f 1994try to match C<(?!123)> with "123", which fails. But because
c07a80fd
PP
1995a quantifier (C<\D*>) has been used in the regular expression, the
1996search engine can backtrack and retry the match differently
54310121 1997in the hope of matching the complete regular expression.
c07a80fd 1998
5a964f20
TC
1999The pattern really, I<really> wants to succeed, so it uses the
2000standard pattern back-off-and-retry and lets C<\D*> expand to just "AB" this
c07a80fd 2001time. Now there's indeed something following "AB" that is not
14218588 2002"123". It's "C123", which suffices.
c07a80fd 2003
14218588
GS
2004We can deal with this by using both an assertion and a negation.
2005We'll say that the first part in $1 must be followed both by a digit
2006and by something that's not "123". Remember that the look-aheads
2007are zero-width expressions--they only look, but don't consume any
2008of the string in their match. So rewriting this way produces what
c07a80fd
PP
2009you'd expect; that is, case 5 will fail, but case 6 succeeds:
2010
4358a253
SS
2011 print "5: got $1\n" if $x =~ /^(\D*)(?=\d)(?!123)/;
2012 print "6: got $1\n" if $y =~ /^(\D*)(?=\d)(?!123)/;
c07a80fd
PP
2013
2014 6: got ABC
2015
5a964f20 2016In other words, the two zero-width assertions next to each other work as though
19799a22 2017they're ANDed together, just as you'd use any built-in assertions: C</^$/>
c07a80fd
PP
2018matches only if you're at the beginning of the line AND the end of the
2019line simultaneously. The deeper underlying truth is that juxtaposition in
2020regular expressions always means AND, except when you write an explicit OR
2021using the vertical bar. C</ab/> means match "a" AND (then) match "b",
2022although the attempted matches are made at different positions because "a"
2023is not a zero-width assertion, but a one-width assertion.
2024
0d017f4d 2025B<WARNING>: Particularly complicated regular expressions can take
14218588 2026exponential time to solve because of the immense number of possible
0d017f4d 2027ways they can use backtracking to try for a match. For example, without
9da458fc
IZ
2028internal optimizations done by the regular expression engine, this will
2029take a painfully long time to run:
c07a80fd 2030
e1901655
IZ
2031 'aaaaaaaaaaaa' =~ /((a{0,5}){0,5})*[c]/
2032
2033And if you used C<*>'s in the internal groups instead of limiting them
2034to 0 through 5 matches, then it would take forever--or until you ran
2035out of stack space. Moreover, these internal optimizations are not
2036always applicable. For example, if you put C<{0,5}> instead of C<*>
2037on the external group, no current optimization is applicable, and the
2038match takes a long time to finish.
c07a80fd 2039
9da458fc
IZ
2040A powerful tool for optimizing such beasts is what is known as an
2041"independent group",
96090e4f 2042which does not backtrack (see L</C<< (?>pattern) >>>). Note also that
9da458fc 2043zero-length look-ahead/look-behind assertions will not backtrack to make
5d458dd8 2044the tail match, since they are in "logical" context: only
14218588 2045whether they match is considered relevant. For an example
9da458fc 2046where side-effects of look-ahead I<might> have influenced the
96090e4f 2047following match, see L</C<< (?>pattern) >>>.
c277df42 2048
a0d0e21e 2049=head2 Version 8 Regular Expressions
d74e8afc 2050X<regular expression, version 8> X<regex, version 8> X<regexp, version 8>
a0d0e21e 2051
5a964f20 2052In case you're not familiar with the "regular" Version 8 regex
a0d0e21e
LW
2053routines, here are the pattern-matching rules not described above.
2054
54310121 2055Any single character matches itself, unless it is a I<metacharacter>
a0d0e21e 2056with a special meaning described here or above. You can cause
5a964f20 2057characters that normally function as metacharacters to be interpreted
5f05dabc 2058literally by prefixing them with a "\" (e.g., "\." matches a ".", not any
0d017f4d
WL
2059character; "\\" matches a "\"). This escape mechanism is also required
2060for the character used as the pattern delimiter.
2061
2062A series of characters matches that series of characters in the target
0b928c2f 2063string, so the pattern C<blurfl> would match "blurfl" in the target
0d017f4d 2064string.
a0d0e21e
LW
2065
2066You can specify a character class, by enclosing a list of characters
5d458dd8 2067in C<[]>, which will match any character from the list. If the
a0d0e21e 2068first character after the "[" is "^", the class matches any character not
14218588 2069in the list. Within a list, the "-" character specifies a
5a964f20 2070range, so that C<a-z> represents all characters between "a" and "z",
8a4f6ac2
GS
2071inclusive. If you want either "-" or "]" itself to be a member of a
2072class, put it at the start of the list (possibly after a "^"), or
2073escape it with a backslash. "-" is also taken literally when it is
2074at the end of the list, just before the closing "]". (The
84850974
DD
2075following all specify the same class of three characters: C<[-az]>,
2076C<[az-]>, and C<[a\-z]>. All are different from C<[a-z]>, which
5d458dd8
YO
2077specifies a class containing twenty-six characters, even on EBCDIC-based
2078character sets.) Also, if you try to use the character
2079classes C<\w>, C<\W>, C<\s>, C<\S>, C<\d>, or C<\D> as endpoints of
2080a range, the "-" is understood literally.
a0d0e21e 2081
8ada0baa
JH
2082Note also that the whole range idea is rather unportable between
2083character sets--and even within character sets they may cause results
2084you probably didn't expect. A sound principle is to use only ranges
0d017f4d 2085that begin from and end at either alphabetics of equal case ([a-e],
8ada0baa
JH
2086[A-E]), or digits ([0-9]). Anything else is unsafe. If in doubt,
2087spell out the character sets in full.
2088
54310121 2089Characters may be specified using a metacharacter syntax much like that
a0d0e21e
LW
2090used in C: "\n" matches a newline, "\t" a tab, "\r" a carriage return,
2091"\f" a form feed, etc. More generally, \I<nnn>, where I<nnn> is a string
dc0d9c48 2092of three octal digits, matches the character whose coded character set value
5d458dd8 2093is I<nnn>. Similarly, \xI<nn>, where I<nn> are hexadecimal digits,
dc0d9c48 2094matches the character whose ordinal is I<nn>. The expression \cI<x>
5d458dd8 2095matches the character control-I<x>. Finally, the "." metacharacter
fb55449c 2096matches any character except "\n" (unless you use C</s>).
a0d0e21e
LW
2097
2098You can specify a series of alternatives for a pattern using "|" to
2099separate them, so that C<fee|fie|foe> will match any of "fee", "fie",
5a964f20 2100or "foe" in the target string (as would C<f(e|i|o)e>). The
a0d0e21e 2101first alternative includes everything from the last pattern delimiter
0b928c2f 2102("(", "(?:", etc. or the beginning of the pattern) up to the first "|", and
a0d0e21e 2103the last alternative contains everything from the last "|" to the next
0b928c2f 2104closing pattern delimiter. That's why it's common practice to include
14218588 2105alternatives in parentheses: to minimize confusion about where they
a3cb178b
GS
2106start and end.
2107
5a964f20 2108Alternatives are tried from left to right, so the first
a3cb178b
GS
2109alternative found for which the entire expression matches, is the one that
2110is chosen. This means that alternatives are not necessarily greedy. For
628afcb5 2111example: when matching C<foo|foot> against "barefoot", only the "foo"
a3cb178b
GS
2112part will match, as that is the first alternative tried, and it successfully
2113matches the target string. (This might not seem important, but it is
2114important when you are capturing matched text using parentheses.)
2115
5a964f20 2116Also remember that "|" is interpreted as a literal within square brackets,
a3cb178b 2117so if you write C<[fee|fie|foe]> you're really only matching C<[feio|]>.
a0d0e21e 2118
14218588
GS
2119Within a pattern, you may designate subpatterns for later reference
2120by enclosing them in parentheses, and you may refer back to the
2121I<n>th subpattern later in the pattern using the metacharacter
0b928c2f 2122\I<n> or \gI<n>. Subpatterns are numbered based on the left to right order
14218588
GS
2123of their opening parenthesis. A backreference matches whatever
2124actually matched the subpattern in the string being examined, not
d8b950dc 2125the rules for that subpattern. Therefore, C<(0|0x)\d*\s\g1\d*> will
14218588
GS
2126match "0x1234 0x4321", but not "0x1234 01234", because subpattern
21271 matched "0x", even though the rule C<0|0x> could potentially match
2128the leading 0 in the second number.
cb1a09d0 2129
0d017f4d 2130=head2 Warning on \1 Instead of $1
cb1a09d0 2131
5a964f20 2132Some people get too used to writing things like:
cb1a09d0
AD
2133
2134 $pattern =~ s/(\W)/\\\1/g;
2135
3ff1c45a
KW
2136This is grandfathered (for \1 to \9) for the RHS of a substitute to avoid
2137shocking the
cb1a09d0 2138B<sed> addicts, but it's a dirty habit to get into. That's because in
d1be9408 2139PerlThink, the righthand side of an C<s///> is a double-quoted string. C<\1> in
cb1a09d0
AD
2140the usual double-quoted string means a control-A. The customary Unix
2141meaning of C<\1> is kludged in for C<s///>. However, if you get into the habit
2142of doing that, you get yourself into trouble if you then add an C</e>
2143modifier.
2144
f793d64a 2145 s/(\d+)/ \1 + 1 /eg; # causes warning under -w
cb1a09d0
AD
2146
2147Or if you try to do
2148
2149 s/(\d+)/\1000/;
2150
2151You can't disambiguate that by saying C<\{1}000>, whereas you can fix it with
14218588 2152C<${1}000>. The operation of interpolation should not be confused
cb1a09d0
AD
2153with the operation of matching a backreference. Certainly they mean two
2154different things on the I<left> side of the C<s///>.
9fa51da4 2155
0d017f4d 2156=head2 Repeated Patterns Matching a Zero-length Substring
c84d73f1 2157
19799a22 2158B<WARNING>: Difficult material (and prose) ahead. This section needs a rewrite.
c84d73f1
IZ
2159
2160Regular expressions provide a terse and powerful programming language. As
2161with most other power tools, power comes together with the ability
2162to wreak havoc.
2163
2164A common abuse of this power stems from the ability to make infinite
628afcb5 2165loops using regular expressions, with something as innocuous as:
c84d73f1
IZ
2166
2167 'foo' =~ m{ ( o? )* }x;
2168
0d017f4d 2169The C<o?> matches at the beginning of C<'foo'>, and since the position
c84d73f1 2170in the string is not moved by the match, C<o?> would match again and again
527e91da 2171because of the C<*> quantifier. Another common way to create a similar cycle
c84d73f1
IZ
2172is with the looping modifier C<//g>:
2173
2174 @matches = ( 'foo' =~ m{ o? }xg );
2175
2176or
2177
2178 print "match: <$&>\n" while 'foo' =~ m{ o? }xg;
2179
2180or the loop implied by split().
2181
2182However, long experience has shown that many programming tasks may
14218588
GS
2183be significantly simplified by using repeated subexpressions that
2184may match zero-length substrings. Here's a simple example being:
c84d73f1 2185
d1fbf752 2186 @chars = split //, $string; # // is not magic in split
c84d73f1
IZ
2187 ($whitewashed = $string) =~ s/()/ /g; # parens avoid magic s// /
2188
9da458fc 2189Thus Perl allows such constructs, by I<forcefully breaking
c84d73f1 2190the infinite loop>. The rules for this are different for lower-level
527e91da 2191loops given by the greedy quantifiers C<*+{}>, and for higher-level
c84d73f1
IZ
2192ones like the C</g> modifier or split() operator.
2193
19799a22
GS
2194The lower-level loops are I<interrupted> (that is, the loop is
2195broken) when Perl detects that a repeated expression matched a
2196zero-length substring. Thus
c84d73f1
IZ
2197
2198 m{ (?: NON_ZERO_LENGTH | ZERO_LENGTH )* }x;
2199
5d458dd8 2200is made equivalent to
c84d73f1 2201
0b928c2f
FC
2202 m{ (?: NON_ZERO_LENGTH )* (?: ZERO_LENGTH )? }x;
2203
2204For example, this program
2205
2206 #!perl -l
2207 "aaaaab" =~ /
2208 (?:
2209 a # non-zero
2210 | # or
2211 (?{print "hello"}) # print hello whenever this
2212 # branch is tried
2213 (?=(b)) # zero-width assertion
2214 )* # any number of times
2215 /x;
2216 print $&;
2217 print $1;
c84d73f1 2218
0b928c2f
FC
2219prints
2220
2221 hello
2222 aaaaa
2223 b
2224
2225Notice that "hello" is only printed once, as when Perl sees that the sixth
2226iteration of the outermost C<(?:)*> matches a zero-length string, it stops
2227the C<*>.
2228
2229The higher-level loops preserve an additional state between iterations:
5d458dd8 2230whether the last match was zero-length. To break the loop, the following
c84d73f1 2231match after a zero-length match is prohibited to have a length of zero.
5d458dd8 2232This prohibition interacts with backtracking (see L<"Backtracking">),
c84d73f1
IZ
2233and so the I<second best> match is chosen if the I<best> match is of
2234zero length.
2235
19799a22 2236For example:
c84d73f1
IZ
2237
2238 $_ = 'bar';
2239 s/\w??/<$&>/g;
2240
20fb949f 2241results in C<< <><b><><a><><r><> >>. At each position of the string the best
5d458dd8 2242match given by non-greedy C<??> is the zero-length match, and the I<second
c84d73f1
IZ
2243best> match is what is matched by C<\w>. Thus zero-length matches
2244alternate with one-character-long matches.
2245
5d458dd8 2246Similarly, for repeated C<m/()/g> the second-best match is the match at the
c84d73f1
IZ
2247position one notch further in the string.
2248
19799a22 2249The additional state of being I<matched with zero-length> is associated with
c84d73f1 2250the matched string, and is reset by each assignment to pos().
9da458fc
IZ
2251Zero-length matches at the end of the previous match are ignored
2252during C<split>.
c84d73f1 2253
0d017f4d 2254=head2 Combining RE Pieces
35a734be
IZ
2255
2256Each of the elementary pieces of regular expressions which were described
2257before (such as C<ab> or C<\Z>) could match at most one substring
2258at the given position of the input string. However, in a typical regular
2259expression these elementary pieces are combined into more complicated
0b928c2f 2260patterns using combining operators C<ST>, C<S|T>, C<S*> etc.
35a734be
IZ
2261(in these examples C<S> and C<T> are regular subexpressions).
2262
2263Such combinations can include alternatives, leading to a problem of choice:
2264if we match a regular expression C<a|ab> against C<"abc">, will it match
2265substring C<"a"> or C<"ab">? One way to describe which substring is
2266actually matched is the concept of backtracking (see L<"Backtracking">).
2267However, this description is too low-level and makes you think
2268in terms of a particular implementation.
2269
2270Another description starts with notions of "better"/"worse". All the
2271substrings which may be matched by the given regular expression can be
2272sorted from the "best" match to the "worst" match, and it is the "best"
2273match which is chosen. This substitutes the question of "what is chosen?"
2274by the question of "which matches are better, and which are worse?".
2275
2276Again, for elementary pieces there is no such question, since at most
2277one match at a given position is possible. This section describes the
2278notion of better/worse for combining operators. In the description
2279below C<S> and C<T> are regular subexpressions.
2280
13a2d996 2281=over 4
35a734be
IZ
2282
2283=item C<ST>
2284
2285Consider two possible matches, C<AB> and C<A'B'>, C<A> and C<A'> are
2286substrings which can be matched by C<S>, C<B> and C<B'> are substrings
5d458dd8 2287which can be matched by C<T>.
35a734be 2288
0b928c2f 2289If C<A> is a better match for C<S> than C<A'>, C<AB> is a better
35a734be
IZ
2290match than C<A'B'>.
2291
2292If C<A> and C<A'> coincide: C<AB> is a better match than C<AB'> if
0b928c2f 2293C<B> is a better match for C<T> than C<B'>.
35a734be
IZ
2294
2295=item C<S|T>
2296
2297When C<S> can match, it is a better match than when only C<T> can match.
2298
2299Ordering of two matches for C<S> is the same as for C<S>. Similar for
2300two matches for C<T>.
2301
2302=item C<S{REPEAT_COUNT}>
2303
2304Matches as C<SSS...S> (repeated as many times as necessary).
2305
2306=item C<S{min,max}>
2307
2308Matches as C<S{max}|S{max-1}|...|S{min+1}|S{min}>.
2309
2310=item C<S{min,max}?>
2311
2312Matches as C<S{min}|S{min+1}|...|S{max-1}|S{max}>.
2313
2314=item C<S?>, C<S*>, C<S+>
2315
2316Same as C<S{0,1}>, C<S{0,BIG_NUMBER}>, C<S{1,BIG_NUMBER}> respectively.
2317
2318=item C<S??>, C<S*?>, C<S+?>
2319
2320Same as C<S{0,1}?>, C<S{0,BIG_NUMBER}?>, C<S{1,BIG_NUMBER}?> respectively.
2321
c47ff5f1 2322=item C<< (?>S) >>
35a734be
IZ
2323
2324Matches the best match for C<S> and only that.
2325
2326=item C<(?=S)>, C<(?<=S)>
2327
2328Only the best match for C<S> is considered. (This is important only if
2329C<S> has capturing parentheses, and backreferences are used somewhere
2330else in the whole regular expression.)
2331
2332=item C<(?!S)>, C<(?<!S)>
2333
2334For this grouping operator there is no need to describe the ordering, since
2335only whether or not C<S> can match is important.
2336
6bda09f9 2337=item C<(??{ EXPR })>, C<(?PARNO)>
35a734be
IZ
2338
2339The ordering is the same as for the regular expression which is
c27a5cfe 2340the result of EXPR, or the pattern contained by capture group PARNO.
35a734be
IZ
2341
2342=item C<(?(condition)yes-pattern|no-pattern)>
2343
2344Recall that which of C<yes-pattern> or C<no-pattern> actually matches is
2345already determined. The ordering of the matches is the same as for the
2346chosen subexpression.
2347
2348=back
2349
2350The above recipes describe the ordering of matches I<at a given position>.
2351One more rule is needed to understand how a match is determined for the
2352whole regular expression: a match at an earlier position is always better
2353than a match at a later position.
2354
0d017f4d 2355=head2 Creating Custom RE Engines
c84d73f1 2356
0b928c2f
FC
2357As of Perl 5.10.0, one can create custom regular expression engines. This
2358is not for the faint of heart, as they have to plug in at the C level. See
2359L<perlreapi> for more details.
2360
2361As an alternative, overloaded constants (see L<overload>) provide a simple
2362way to extend the functionality of the RE engine, by substituting one
2363pattern for another.
c84d73f1
IZ
2364
2365Suppose that we want to enable a new RE escape-sequence C<\Y|> which
0d017f4d 2366matches at a boundary between whitespace characters and non-whitespace
c84d73f1
IZ
2367characters. Note that C<(?=\S)(?<!\S)|(?!\S)(?<=\S)> matches exactly
2368at these positions, so we want to have each C<\Y|> in the place of the
2369more complicated version. We can create a module C<customre> to do
2370this:
2371
2372 package customre;
2373 use overload;
2374
2375 sub import {
2376 shift;
2377 die "No argument to customre::import allowed" if @_;
2378 overload::constant 'qr' => \&convert;
2379 }
2380
2381 sub invalid { die "/$_[0]/: invalid escape '\\$_[1]'"}
2382
580a9fe1
RGS
2383 # We must also take care of not escaping the legitimate \\Y|
2384 # sequence, hence the presence of '\\' in the conversion rules.
5d458dd8 2385 my %rules = ( '\\' => '\\\\',
f793d64a 2386 'Y|' => qr/(?=\S)(?<!\S)|(?!\S)(?<=\S)/ );
c84d73f1
IZ
2387 sub convert {
2388 my $re = shift;
5d458dd8 2389 $re =~ s{
c84d73f1
IZ
2390 \\ ( \\ | Y . )
2391 }
5d458dd8 2392 { $rules{$1} or invalid($re,$1) }sgex;
c84d73f1
IZ
2393 return $re;
2394 }
2395
2396Now C<use customre> enables the new escape in constant regular
2397expressions, i.e., those without any runtime variable interpolations.
2398As documented in L<overload>, this conversion will work only over
2399literal parts of regular expressions. For C<\Y|$re\Y|> the variable
2400part of this regular expression needs to be converted explicitly
2401(but only if the special meaning of C<\Y|> should be enabled inside $re):
2402
2403 use customre;
2404 $re = <>;
2405 chomp $re;
2406 $re = customre::convert $re;
2407 /\Y|$re\Y|/;
2408
0b928c2f 2409=head2 PCRE/Python Support
1f1031fe 2410
0b928c2f 2411As of Perl 5.10.0, Perl supports several Python/PCRE-specific extensions
1f1031fe 2412to the regex syntax. While Perl programmers are encouraged to use the
0b928c2f 2413Perl-specific syntax, the following are also accepted:
1f1031fe
YO
2414
2415=over 4
2416
ae5648b3 2417=item C<< (?PE<lt>NAMEE<gt>pattern) >>
1f1031fe 2418
c27a5cfe 2419Define a named capture group. Equivalent to C<< (?<NAME>pattern) >>.
1f1031fe
YO
2420
2421=item C<< (?P=NAME) >>
2422
c27a5cfe 2423Backreference to a named capture group. Equivalent to C<< \g{NAME} >>.
1f1031fe
YO
2424
2425=item C<< (?P>NAME) >>
2426
c27a5cfe 2427Subroutine call to a named capture group. Equivalent to C<< (?&NAME) >>.
1f1031fe 2428
ee9b8eae 2429=back
1f1031fe 2430
19799a22
GS
2431=head1 BUGS
2432
88c9975e
KW
2433Many regular expression constructs don't work on EBCDIC platforms.
2434
ed7efc79
KW
2435There are a number of issues with regard to case-insensitive matching
2436in Unicode rules. See C<i> under L</Modifiers> above.
2437
9da458fc
IZ
2438This document varies from difficult to understand to completely
2439and utterly opaque. The wandering prose riddled with jargon is
2440hard to fathom in several places.
2441
2442This document needs a rewrite that separates the tutorial content
2443from the reference content.
19799a22
GS
2444
2445=head1 SEE ALSO
9fa51da4 2446
91e0c79e
MJD
2447L<perlrequick>.
2448
2449L<perlretut>.
2450
9b599b2a
GS
2451L<perlop/"Regexp Quote-Like Operators">.
2452
1e66bd83
PP
2453L<perlop/"Gory details of parsing quoted constructs">.
2454
14218588
GS
2455L<perlfaq6>.
2456
9b599b2a
GS
2457L<perlfunc/pos>.
2458
2459L<perllocale>.
2460
fb55449c
JH
2461L<perlebcdic>.
2462
14218588
GS
2463I<Mastering Regular Expressions> by Jeffrey Friedl, published
2464by O'Reilly and Associates.