This is a live mirror of the Perl 5 development currently hosted at https://github.com/perl/perl5
Updates to perldelta for all the updated CPAN modules.
[perl5.git] / pod / perlre.pod
CommitLineData
a0d0e21e 1=head1 NAME
d74e8afc 2X<regular expression> X<regex> X<regexp>
a0d0e21e
LW
3
4perlre - Perl regular expressions
5
6=head1 DESCRIPTION
7
5d458dd8 8This page describes the syntax of regular expressions in Perl.
91e0c79e 9
cc46d5f2 10If you haven't used regular expressions before, a quick-start
91e0c79e
MJD
11introduction is available in L<perlrequick>, and a longer tutorial
12introduction is available in L<perlretut>.
13
14For reference on how regular expressions are used in matching
15operations, plus various examples of the same, see discussions of
16C<m//>, C<s///>, C<qr//> and C<??> in L<perlop/"Regexp Quote-Like
17Operators">.
cb1a09d0 18
0d017f4d
WL
19
20=head2 Modifiers
21
19799a22 22Matching operations can have various modifiers. Modifiers
5a964f20 23that relate to the interpretation of the regular expression inside
19799a22 24are listed below. Modifiers that alter the way a regular expression
5d458dd8 25is used by Perl are detailed in L<perlop/"Regexp Quote-Like Operators"> and
1e66bd83 26L<perlop/"Gory details of parsing quoted constructs">.
a0d0e21e 27
55497cff 28=over 4
29
54310121 30=item m
d74e8afc 31X</m> X<regex, multiline> X<regexp, multiline> X<regular expression, multiline>
55497cff 32
33Treat string as multiple lines. That is, change "^" and "$" from matching
14218588 34the start or end of the string to matching the start or end of any
7f761169 35line anywhere within the string.
55497cff 36
54310121 37=item s
d74e8afc
ITB
38X</s> X<regex, single-line> X<regexp, single-line>
39X<regular expression, single-line>
55497cff 40
41Treat string as single line. That is, change "." to match any character
19799a22 42whatsoever, even a newline, which normally it would not match.
55497cff 43
34d67d80 44Used together, as C</ms>, they let the "." match any character whatsoever,
fb55449c 45while still allowing "^" and "$" to match, respectively, just after
19799a22 46and just before newlines within the string.
7b8d334a 47
87e95b7f
YO
48=item i
49X</i> X<regex, case-insensitive> X<regexp, case-insensitive>
50X<regular expression, case-insensitive>
51
52Do case-insensitive pattern matching.
53
5027a30b
KW
54If locale matching rules are in effect, the case map is taken from the
55current
17580e7a 56locale for code points less than 255, and from Unicode rules for larger
ed7efc79
KW
57code points. However, matches that would cross the Unicode
58rules/non-Unicode rules boundary (ords 255/256) will not succeed. See
59L<perllocale>.
60
61There are a number of Unicode characters that match multiple characters
62under C</i>. For example, C<LATIN SMALL LIGATURE FI>
63should match the sequence C<fi>. Perl is not
64currently able to do this when the multiple characters are in the pattern and
65are split between groupings, or when one or more are quantified. Thus
66
67 "\N{LATIN SMALL LIGATURE FI}" =~ /fi/i; # Matches
68 "\N{LATIN SMALL LIGATURE FI}" =~ /[fi][fi]/i; # Doesn't match!
69 "\N{LATIN SMALL LIGATURE FI}" =~ /fi*/i; # Doesn't match!
70
71 # The below doesn't match, and it isn't clear what $1 and $2 would
72 # be even if it did!!
73 "\N{LATIN SMALL LIGATURE FI}" =~ /(f)(i)/i; # Doesn't match!
74
1f59b283
KW
75Perl doesn't match multiple characters in an inverted bracketed
76character class, which otherwise could be highly confusing. See
77L<perlrecharclass/Negation>.
78
79Also, Perl matching doesn't fully conform to the current Unicode C</i>
ed7efc79
KW
80recommendations, which ask that the matching be made upon the NFD
81(Normalization Form Decomposed) of the text. However, Unicode is
82in the process of reconsidering and revising their recommendations.
87e95b7f 83
54310121 84=item x
d74e8afc 85X</x>
55497cff 86
87Extend your pattern's legibility by permitting whitespace and comments.
ed7efc79 88Details in L</"/x">
55497cff 89
87e95b7f
YO
90=item p
91X</p> X<regex, preserve> X<regexp, preserve>
92
632a1772 93Preserve the string matched such that ${^PREMATCH}, ${^MATCH}, and
87e95b7f
YO
94${^POSTMATCH} are available for use after matching.
95
e2e6bec7
DN
96=item g and c
97X</g> X</c>
98
99Global matching, and keep the Current position after failed matching.
100Unlike i, m, s and x, these two flags affect the way the regex is used
101rather than the regex itself. See
102L<perlretut/"Using regular expressions in Perl"> for further explanation
103of the g and c modifiers.
104
b6fa137b
FC
105=item a, d, l and u
106X</a> X</d> X</l> X</u>
107
108These modifiers, new in 5.14, affect which character-set semantics
ed7efc79
KW
109(Unicode, ASCII, etc.) are used, as described below in
110L</Character set modifiers>.
b6fa137b 111
55497cff 112=back
a0d0e21e
LW
113
114These are usually written as "the C</x> modifier", even though the delimiter
b6fa137b 115in question might not really be a slash. The modifiers C</imsxadlup>
ab7bb42d 116may also be embedded within the regular expression itself using
ed7efc79 117the C<(?...)> construct, see L</Extended Patterns> below.
b6fa137b
FC
118
119The C</x>, C</l>, C</u>, C</a> and C</d> modifiers need a little more
120explanation.
a0d0e21e 121
ed7efc79
KW
122=head3 /x
123
b6fa137b 124C</x> tells
7b059540 125the regular expression parser to ignore most whitespace that is neither
55497cff 126backslashed nor within a character class. You can use this to break up
4633a7c4 127your regular expression into (slightly) more readable parts. The C<#>
54310121 128character is also treated as a metacharacter introducing a comment,
55497cff 129just as in ordinary Perl code. This also means that if you want real
14218588 130whitespace or C<#> characters in the pattern (outside a character
f9a3ff1a 131class, where they are unaffected by C</x>), then you'll either have to
7b059540
KW
132escape them (using backslashes or C<\Q...\E>) or encode them using octal,
133hex, or C<\N{}> escapes. Taken together, these features go a long way towards
8933a740
RGS
134making Perl's regular expressions more readable. Note that you have to
135be careful not to include the pattern delimiter in the comment--perl has
136no way of knowing you did not intend to close the pattern early. See
137the C-comment deletion code in L<perlop>. Also note that anything inside
7651b971 138a C<\Q...\E> stays unaffected by C</x>. And note that C</x> doesn't affect
0b928c2f 139space interpretation within a single multi-character construct. For
7651b971 140example in C<\x{...}>, regardless of the C</x> modifier, there can be no
9bb1f947 141spaces. Same for a L<quantifier|/Quantifiers> such as C<{3}> or
f9e949fd
KW
142C<{5,}>. Similarly, C<(?:...)> can't have a space between the C<?> and C<:>,
143but can between the C<(> and C<?>. Within any delimiters for such a
144construct, allowed spaces are not affected by C</x>, and depend on the
145construct. For example, C<\x{...}> can't have spaces because hexadecimal
146numbers don't have spaces in them. But, Unicode properties can have spaces, so
0b928c2f 147in C<\p{...}> there can be spaces that follow the Unicode rules, for which see
9bb1f947 148L<perluniprops/Properties accessible through \p{} and \P{}>.
d74e8afc 149X</x>
a0d0e21e 150
ed7efc79
KW
151=head3 Character set modifiers
152
153C</d>, C</u>, C</a>, and C</l>, available starting in 5.14, are called
154the character set modifiers; they affect the character set semantics
155used for the regular expression.
156
157At any given time, exactly one of these modifiers is in effect. Once
158compiled, the behavior doesn't change regardless of what rules are in
159effect when the regular expression is executed. And if a regular
160expression is interpolated into a larger one, the original's rules
161continue to apply to it, and only it.
162
ca9560b2
KW
163Note that the modifiers affect only pattern matching, and do not extend
164to any replacement done. For example,
165
166 s/foo/\Ubar/l
167
168will uppercase "bar", but the C</l> does not affect how the C<\U>
169operates. If C<use locale> is in effect, the C<\U> will use locale
170rules; if C<use feature 'unicode_strings'> is in effect, it will
171use Unicode rules, etc.
172
ed7efc79
KW
173=head4 /l
174
175means to use the current locale's rules (see L<perllocale>) when pattern
176matching. For example, C<\w> will match the "word" characters of that
177locale, and C<"/i"> case-insensitive matching will match according to
178the locale's case folding rules. The locale used will be the one in
179effect at the time of execution of the pattern match. This may not be
180the same as the compilation-time locale, and can differ from one match
181to another if there is an intervening call of the
b6fa137b 182L<setlocale() function|perllocale/The setlocale function>.
ed7efc79
KW
183
184Perl only supports single-byte locales. This means that code points
185above 255 are treated as Unicode no matter what locale is in effect.
186Under Unicode rules, there are a few case-insensitive matches that cross
187the 255/256 boundary. These are disallowed under C</l>. For example,
1880xFF does not caselessly match the character at 0x178, C<LATIN CAPITAL
189LETTER Y WITH DIAERESIS>, because 0xFF may not be C<LATIN SMALL LETTER Y
190WITH DIAERESIS> in the current locale, and Perl has no way of knowing if
191that character even exists in the locale, much less what code point it
192is.
193
194This modifier may be specified to be the default by C<use locale>, but
195see L</Which character set modifier is in effect?>.
b6fa137b
FC
196X</l>
197
ed7efc79
KW
198=head4 /u
199
200means to use Unicode rules when pattern matching. On ASCII platforms,
201this means that the code points between 128 and 255 take on their
b6fa137b
FC
202Latin-1 (ISO-8859-1) meanings (which are the same as Unicode's), whereas
203in strict ASCII their meanings are undefined. Thus the platform
ed7efc79
KW
204effectively becomes a Unicode platform, hence, for example, C<\w> will
205match any of the more than 100_000 word characters in Unicode.
206
207Unlike most locales, which are specific to a language and country pair,
208Unicode classifies all the characters that are letters I<somewhere> as
209C<\w>. For example, your locale might not think that C<LATIN SMALL
210LETTER ETH> is a letter (unless you happen to speak Icelandic), but
211Unicode does. Similarly, all the characters that are decimal digits
212somewhere in the world will match C<\d>; this is hundreds, not 10,
213possible matches. And some of those digits look like some of the 10
214ASCII digits, but mean a different number, so a human could easily think
215a number is a different quantity than it really is. For example,
216C<BENGALI DIGIT FOUR> (U+09EA) looks very much like an
217C<ASCII DIGIT EIGHT> (U+0038). And, C<\d+>, may match strings of digits
218that are a mixture from different writing systems, creating a security
916cec3f 219issue. L<Unicode::UCDE<sol>num()|Unicode::UCD/num> can be used to sort this out.
ed7efc79
KW
220
221Also, case-insensitive matching works on the full set of Unicode
222characters. The C<KELVIN SIGN>, for example matches the letters "k" and
223"K"; and C<LATIN SMALL LIGATURE FF> matches the sequence "ff", which,
224if you're not prepared, might make it look like a hexadecimal constant,
225presenting another potential security issue. See
226L<http://unicode.org/reports/tr36> for a detailed discussion of Unicode
227security issues.
228
6368643f
KW
229On the EBCDIC platforms that Perl handles, the native character set is
230equivalent to Latin-1. Thus this modifier changes behavior only when
ed7efc79
KW
231the C<"/i"> modifier is also specified, and it turns out it affects only
232two characters, giving them full Unicode semantics: the C<MICRO SIGN>
6368643f 233will match the Greek capital and small letters C<MU>, otherwise not; and
ed7efc79
KW
234the C<LATIN CAPITAL LETTER SHARP S> will match any of C<SS>, C<Ss>,
235C<sS>, and C<ss>, otherwise not.
236
237This modifier may be specified to be the default by C<use feature
238'unicode_strings>, but see
239L</Which character set modifier is in effect?>.
b6fa137b
FC
240X</u>
241
ed7efc79
KW
242=head4 /a
243
244is the same as C</u>, except that C<\d>, C<\s>, C<\w>, and the
b6fa137b
FC
245Posix character classes are restricted to matching in the ASCII range
246only. That is, with this modifier, C<\d> always means precisely the
247digits C<"0"> to C<"9">; C<\s> means the five characters C<[ \f\n\r\t]>;
248C<\w> means the 63 characters C<[A-Za-z0-9_]>; and likewise, all the
249Posix classes such as C<[[:print:]]> match only the appropriate
ed7efc79
KW
250ASCII-range characters.
251
252This modifier is useful for people who only incidentally use Unicode.
253With it, one can write C<\d> with confidence that it will only match
254ASCII characters, and should the need arise to match beyond ASCII, you
255can use C<\p{Digit}>, or C<\p{Word}> for C<\w>. There are similar
256C<\p{...}> constructs that can match white space and Posix classes
6368643f 257beyond ASCII. See L<perlrecharclass/POSIX Character Classes>.
ed7efc79
KW
258
259As you would expect, this modifier causes, for example, C<\D> to mean
260the same thing as C<[^0-9]>; in fact, all non-ASCII characters match
261C<\D>, C<\S>, and C<\W>. C<\b> still means to match at the boundary
262between C<\w> and C<\W>, using the C</a> definitions of them (similarly
263for C<\B>).
264
265Otherwise, C</a> behaves like the C</u> modifier, in that
266case-insensitive matching uses Unicode semantics; for example, "k" will
267match the Unicode C<\N{KELVIN SIGN}> under C</i> matching, and code
268points in the Latin1 range, above ASCII will have Unicode rules when it
269comes to case-insensitive matching.
270
271To forbid ASCII/non-ASCII matches (like "k" with C<\N{KELVIN SIGN}>),
272specify the "a" twice, for example C</aai> or C</aia>
273
274To reiterate, this modifier provides protection for applications that
275don't wish to be exposed to all of Unicode. Specifying it twice
276gives added protection.
277
278This modifier may be specified to be the default by C<use re '/a'>
279or C<use re '/aa'>, but see
280L</Which character set modifier is in effect?>.
b6fa137b 281X</a>
ed7efc79
KW
282X</aa>
283
284=head4 /d
285
286This modifier means to use the "Default" native rules of the platform
287except when there is cause to use Unicode rules instead, as follows:
288
289=over 4
290
291=item 1
292
293the target string is encoded in UTF-8; or
294
295=item 2
296
297the pattern is encoded in UTF-8; or
298
299=item 3
300
301the pattern explicitly mentions a code point that is above 255 (say by
302C<\x{100}>); or
303
304=item 4
b6fa137b 305
ed7efc79
KW
306the pattern uses a Unicode name (C<\N{...}>); or
307
308=item 5
309
310the pattern uses a Unicode property (C<\p{...}>)
311
312=back
313
314Another mnemonic for this modifier is "Depends", as the rules actually
315used depend on various things, and as a result you can get unexpected
316results. See L<perlunicode/The "Unicode Bug">.
317
318On ASCII platforms, the native rules are ASCII, and on EBCDIC platforms
319(at least the ones that Perl handles), they are Latin-1.
320
321Here are some examples of how that works on an ASCII platform:
322
323 $str = "\xDF"; # $str is not in UTF-8 format.
324 $str =~ /^\w/; # No match, as $str isn't in UTF-8 format.
325 $str .= "\x{0e0b}"; # Now $str is in UTF-8 format.
326 $str =~ /^\w/; # Match! $str is now in UTF-8 format.
327 chop $str;
328 $str =~ /^\w/; # Still a match! $str remains in UTF-8 format.
329
330=head4 Which character set modifier is in effect?
331
332Which of these modifiers is in effect at any given point in a regular
333expression depends on a fairly complex set of interactions. As
334explained below in L</Extended Patterns> it is possible to explicitly
335specify modifiers that apply only to portions of a regular expression.
336The innermost always has priority over any outer ones, and one applying
6368643f
KW
337to the whole expression has priority over any of the default settings that are
338described in the remainder of this section.
ed7efc79 339
916cec3f 340The C<L<use re 'E<sol>foo'|re/"'/flags' mode">> pragma can be used to set
ed7efc79
KW
341default modifiers (including these) for regular expressions compiled
342within its scope. This pragma has precedence over the other pragmas
6368643f 343listed below that change the defaults.
ed7efc79
KW
344
345Otherwise, C<L<use locale|perllocale>> sets the default modifier to C</l>;
346and C<L<use feature 'unicode_strings|feature>> or
347C<L<use 5.012|perlfunc/use VERSION>> (or higher) set the default to
348C</u> when not in the same scope as either C<L<use locale|perllocale>>
6368643f
KW
349or C<L<use bytes|bytes>>. Unlike the mechanisms mentioned above, these
350affect operations besides regular expressions pattern matching, and so
351give more consistent results with other operators, including using
352C<\U>, C<\l>, etc. in substitution replacements.
ed7efc79
KW
353
354If none of the above apply, for backwards compatibility reasons, the
355C</d> modifier is the one in effect by default. As this can lead to
356unexpected results, it is best to specify which other rule set should be
357used.
358
359=head4 Character set modifier behavior prior to Perl 5.14
360
361Prior to 5.14, there were no explicit modifiers, but C</l> was implied
362for regexes compiled within the scope of C<use locale>, and C</d> was
363implied otherwise. However, interpolating a regex into a larger regex
364would ignore the original compilation in favor of whatever was in effect
365at the time of the second compilation. There were a number of
366inconsistencies (bugs) with the C</d> modifier, where Unicode rules
367would be used when inappropriate, and vice versa. C<\p{}> did not imply
368Unicode rules, and neither did all occurrences of C<\N{}>, until 5.12.
b6fa137b 369
a0d0e21e
LW
370=head2 Regular Expressions
371
04838cea
RGS
372=head3 Metacharacters
373
384f06ae 374The patterns used in Perl pattern matching evolved from those supplied in
14218588 375the Version 8 regex routines. (The routines are derived
19799a22
GS
376(distantly) from Henry Spencer's freely redistributable reimplementation
377of the V8 routines.) See L<Version 8 Regular Expressions> for
378details.
a0d0e21e
LW
379
380In particular the following metacharacters have their standard I<egrep>-ish
381meanings:
d74e8afc
ITB
382X<metacharacter>
383X<\> X<^> X<.> X<$> X<|> X<(> X<()> X<[> X<[]>
384
a0d0e21e 385
f793d64a
KW
386 \ Quote the next metacharacter
387 ^ Match the beginning of the line
388 . Match any character (except newline)
389 $ Match the end of the line (or before newline at the end)
390 | Alternation
391 () Grouping
392 [] Bracketed Character class
a0d0e21e 393
14218588
GS
394By default, the "^" character is guaranteed to match only the
395beginning of the string, the "$" character only the end (or before the
396newline at the end), and Perl does certain optimizations with the
a0d0e21e
LW
397assumption that the string contains only one line. Embedded newlines
398will not be matched by "^" or "$". You may, however, wish to treat a
4a6725af 399string as a multi-line buffer, such that the "^" will match after any
0d520e8e
YO
400newline within the string (except if the newline is the last character in
401the string), and "$" will match before any newline. At the
a0d0e21e
LW
402cost of a little more overhead, you can do this by using the /m modifier
403on the pattern match operator. (Older programs did this by setting C<$*>,
0b928c2f 404but this option was removed in perl 5.9.)
d74e8afc 405X<^> X<$> X</m>
a0d0e21e 406
14218588 407To simplify multi-line substitutions, the "." character never matches a
55497cff 408newline unless you use the C</s> modifier, which in effect tells Perl to pretend
f02c194e 409the string is a single line--even if it isn't.
d74e8afc 410X<.> X</s>
a0d0e21e 411
04838cea
RGS
412=head3 Quantifiers
413
a0d0e21e 414The following standard quantifiers are recognized:
d74e8afc 415X<metacharacter> X<quantifier> X<*> X<+> X<?> X<{n}> X<{n,}> X<{n,m}>
a0d0e21e 416
f793d64a
KW
417 * Match 0 or more times
418 + Match 1 or more times
419 ? Match 1 or 0 times
420 {n} Match exactly n times
421 {n,} Match at least n times
422 {n,m} Match at least n but not more than m times
a0d0e21e 423
0b928c2f
FC
424(If a curly bracket occurs in any other context and does not form part of
425a backslashed sequence like C<\x{...}>, it is treated
b975c076 426as a regular character. In particular, the lower bound
527e91da
BB
427is not optional.) The "*" quantifier is equivalent to C<{0,}>, the "+"
428quantifier to C<{1,}>, and the "?" quantifier to C<{0,1}>. n and m are limited
d0b16107 429to non-negative integral values less than a preset limit defined when perl is built.
9c79236d
GS
430This is usually 32766 on the most common platforms. The actual limit can
431be seen in the error message generated by code such as this:
432
820475bd 433 $_ **= $_ , / {$_} / for 2 .. 42;
a0d0e21e 434
54310121 435By default, a quantified subpattern is "greedy", that is, it will match as
436many times as possible (given a particular starting location) while still
437allowing the rest of the pattern to match. If you want it to match the
438minimum number of times possible, follow the quantifier with a "?". Note
439that the meanings don't change, just the "greediness":
0d017f4d 440X<metacharacter> X<greedy> X<greediness>
d74e8afc 441X<?> X<*?> X<+?> X<??> X<{n}?> X<{n,}?> X<{n,m}?>
a0d0e21e 442
f793d64a
KW
443 *? Match 0 or more times, not greedily
444 +? Match 1 or more times, not greedily
445 ?? Match 0 or 1 time, not greedily
0b928c2f 446 {n}? Match exactly n times, not greedily (redundant)
f793d64a
KW
447 {n,}? Match at least n times, not greedily
448 {n,m}? Match at least n but not more than m times, not greedily
a0d0e21e 449
b9b4dddf
YO
450By default, when a quantified subpattern does not allow the rest of the
451overall pattern to match, Perl will backtrack. However, this behaviour is
0d017f4d 452sometimes undesirable. Thus Perl provides the "possessive" quantifier form
b9b4dddf
YO
453as well.
454
f793d64a
KW
455 *+ Match 0 or more times and give nothing back
456 ++ Match 1 or more times and give nothing back
457 ?+ Match 0 or 1 time and give nothing back
458 {n}+ Match exactly n times and give nothing back (redundant)
459 {n,}+ Match at least n times and give nothing back
460 {n,m}+ Match at least n but not more than m times and give nothing back
b9b4dddf
YO
461
462For instance,
463
464 'aaaa' =~ /a++a/
465
466will never match, as the C<a++> will gobble up all the C<a>'s in the
467string and won't leave any for the remaining part of the pattern. This
468feature can be extremely useful to give perl hints about where it
469shouldn't backtrack. For instance, the typical "match a double-quoted
470string" problem can be most efficiently performed when written as:
471
472 /"(?:[^"\\]++|\\.)*+"/
473
0d017f4d 474as we know that if the final quote does not match, backtracking will not
0b928c2f
FC
475help. See the independent subexpression
476L</C<< (?>pattern) >>> for more details;
b9b4dddf
YO
477possessive quantifiers are just syntactic sugar for that construct. For
478instance the above example could also be written as follows:
479
480 /"(?>(?:(?>[^"\\]+)|\\.)*)"/
481
04838cea
RGS
482=head3 Escape sequences
483
0b928c2f 484Because patterns are processed as double-quoted strings, the following
a0d0e21e
LW
485also work:
486
f793d64a
KW
487 \t tab (HT, TAB)
488 \n newline (LF, NL)
489 \r return (CR)
490 \f form feed (FF)
491 \a alarm (bell) (BEL)
492 \e escape (think troff) (ESC)
f793d64a 493 \cK control char (example: VT)
dc0d9c48 494 \x{}, \x00 character whose ordinal is the given hexadecimal number
fb121860 495 \N{name} named Unicode character or character sequence
f793d64a 496 \N{U+263D} Unicode character (example: FIRST QUARTER MOON)
f0a2b745 497 \o{}, \000 character whose ordinal is the given octal number
f793d64a
KW
498 \l lowercase next char (think vi)
499 \u uppercase next char (think vi)
500 \L lowercase till \E (think vi)
501 \U uppercase till \E (think vi)
502 \Q quote (disable) pattern metacharacters till \E
503 \E end either case modification or quoted section, think vi
a0d0e21e 504
9bb1f947 505Details are in L<perlop/Quote and Quote-like Operators>.
1d2dff63 506
e1d1eefb 507=head3 Character Classes and other Special Escapes
04838cea 508
a0d0e21e 509In addition, Perl defines the following:
d0b16107 510X<\g> X<\k> X<\K> X<backreference>
a0d0e21e 511
f793d64a
KW
512 Sequence Note Description
513 [...] [1] Match a character according to the rules of the
514 bracketed character class defined by the "...".
515 Example: [a-z] matches "a" or "b" or "c" ... or "z"
516 [[:...:]] [2] Match a character according to the rules of the POSIX
517 character class "..." within the outer bracketed
518 character class. Example: [[:upper:]] matches any
519 uppercase character.
d35dd6c6
KW
520 \w [3] Match a "word" character (alphanumeric plus "_", plus
521 other connector punctuation chars plus Unicode
0b928c2f 522 marks)
f793d64a
KW
523 \W [3] Match a non-"word" character
524 \s [3] Match a whitespace character
525 \S [3] Match a non-whitespace character
526 \d [3] Match a decimal digit character
527 \D [3] Match a non-digit character
528 \pP [3] Match P, named property. Use \p{Prop} for longer names
529 \PP [3] Match non-P
530 \X [4] Match Unicode "eXtended grapheme cluster"
531 \C Match a single C-language char (octet) even if that is
532 part of a larger UTF-8 character. Thus it breaks up
533 characters into their UTF-8 bytes, so you may end up
534 with malformed pieces of UTF-8. Unsupported in
535 lookbehind.
c27a5cfe 536 \1 [5] Backreference to a specific capture group or buffer.
f793d64a
KW
537 '1' may actually be any positive integer.
538 \g1 [5] Backreference to a specific or previous group,
539 \g{-1} [5] The number may be negative indicating a relative
c27a5cfe 540 previous group and may optionally be wrapped in
f793d64a
KW
541 curly brackets for safer parsing.
542 \g{name} [5] Named backreference
543 \k<name> [5] Named backreference
544 \K [6] Keep the stuff left of the \K, don't include it in $&
545 \N [7] Any character but \n (experimental). Not affected by
546 /s modifier
547 \v [3] Vertical whitespace
548 \V [3] Not vertical whitespace
549 \h [3] Horizontal whitespace
550 \H [3] Not horizontal whitespace
551 \R [4] Linebreak
e1d1eefb 552
9bb1f947
KW
553=over 4
554
555=item [1]
556
557See L<perlrecharclass/Bracketed Character Classes> for details.
df225385 558
9bb1f947 559=item [2]
b8c5462f 560
9bb1f947 561See L<perlrecharclass/POSIX Character Classes> for details.
b8c5462f 562
9bb1f947 563=item [3]
5496314a 564
9bb1f947 565See L<perlrecharclass/Backslash sequences> for details.
5496314a 566
9bb1f947 567=item [4]
5496314a 568
9bb1f947 569See L<perlrebackslash/Misc> for details.
d0b16107 570
9bb1f947 571=item [5]
b8c5462f 572
c27a5cfe 573See L</Capture groups> below for details.
93733859 574
9bb1f947 575=item [6]
b8c5462f 576
9bb1f947
KW
577See L</Extended Patterns> below for details.
578
579=item [7]
580
581Note that C<\N> has two meanings. When of the form C<\N{NAME}>, it matches the
fb121860
KW
582character or character sequence whose name is C<NAME>; and similarly
583when of the form C<\N{U+I<hex>}>, it matches the character whose Unicode
584code point is I<hex>. Otherwise it matches any character but C<\n>.
9bb1f947
KW
585
586=back
d0b16107 587
04838cea
RGS
588=head3 Assertions
589
a0d0e21e 590Perl defines the following zero-width assertions:
d74e8afc
ITB
591X<zero-width assertion> X<assertion> X<regex, zero-width assertion>
592X<regexp, zero-width assertion>
593X<regular expression, zero-width assertion>
594X<\b> X<\B> X<\A> X<\Z> X<\z> X<\G>
a0d0e21e 595
9bb1f947
KW
596 \b Match a word boundary
597 \B Match except at a word boundary
598 \A Match only at beginning of string
599 \Z Match only at end of string, or before newline at the end
600 \z Match only at end of string
601 \G Match only at pos() (e.g. at the end-of-match position
9da458fc 602 of prior m//g)
a0d0e21e 603
14218588 604A word boundary (C<\b>) is a spot between two characters
19799a22
GS
605that has a C<\w> on one side of it and a C<\W> on the other side
606of it (in either order), counting the imaginary characters off the
607beginning and end of the string as matching a C<\W>. (Within
608character classes C<\b> represents backspace rather than a word
609boundary, just as it normally does in any double-quoted string.)
610The C<\A> and C<\Z> are just like "^" and "$", except that they
611won't match multiple times when the C</m> modifier is used, while
612"^" and "$" will match at every internal line boundary. To match
613the actual end of the string and not ignore an optional trailing
614newline, use C<\z>.
d74e8afc 615X<\b> X<\A> X<\Z> X<\z> X</m>
19799a22
GS
616
617The C<\G> assertion can be used to chain global matches (using
618C<m//g>), as described in L<perlop/"Regexp Quote-Like Operators">.
619It is also useful when writing C<lex>-like scanners, when you have
620several patterns that you want to match against consequent substrings
0b928c2f 621of your string; see the previous reference. The actual location
19799a22 622where C<\G> will match can also be influenced by using C<pos()> as
58e23c8d 623an lvalue: see L<perlfunc/pos>. Note that the rule for zero-length
0b928c2f
FC
624matches (see L</"Repeated Patterns Matching a Zero-length Substring">)
625is modified somewhat, in that contents to the left of C<\G> are
58e23c8d
YO
626not counted when determining the length of the match. Thus the following
627will not match forever:
d74e8afc 628X<\G>
c47ff5f1 629
e761bb84
CO
630 my $string = 'ABC';
631 pos($string) = 1;
632 while ($string =~ /(.\G)/g) {
633 print $1;
634 }
58e23c8d
YO
635
636It will print 'A' and then terminate, as it considers the match to
637be zero-width, and thus will not match at the same position twice in a
638row.
639
640It is worth noting that C<\G> improperly used can result in an infinite
641loop. Take care when using patterns that include C<\G> in an alternation.
642
c27a5cfe 643=head3 Capture groups
04838cea 644
c27a5cfe
KW
645The bracketing construct C<( ... )> creates capture groups (also referred to as
646capture buffers). To refer to the current contents of a group later on, within
d8b950dc
KW
647the same pattern, use C<\g1> (or C<\g{1}>) for the first, C<\g2> (or C<\g{2}>)
648for the second, and so on.
649This is called a I<backreference>.
d74e8afc 650X<regex, capture buffer> X<regexp, capture buffer>
c27a5cfe 651X<regex, capture group> X<regexp, capture group>
d74e8afc 652X<regular expression, capture buffer> X<backreference>
c27a5cfe 653X<regular expression, capture group> X<backreference>
1f1031fe 654X<\g{1}> X<\g{-1}> X<\g{name}> X<relative backreference> X<named backreference>
d8b950dc
KW
655X<named capture buffer> X<regular expression, named capture buffer>
656X<named capture group> X<regular expression, named capture group>
657X<%+> X<$+{name}> X<< \k<name> >>
658There is no limit to the number of captured substrings that you may use.
659Groups are numbered with the leftmost open parenthesis being number 1, etc. If
660a group did not match, the associated backreference won't match either. (This
661can happen if the group is optional, or in a different branch of an
662alternation.)
663You can omit the C<"g">, and write C<"\1">, etc, but there are some issues with
664this form, described below.
665
666You can also refer to capture groups relatively, by using a negative number, so
667that C<\g-1> and C<\g{-1}> both refer to the immediately preceding capture
668group, and C<\g-2> and C<\g{-2}> both refer to the group before it. For
669example:
5624f11d
YO
670
671 /
c27a5cfe
KW
672 (Y) # group 1
673 ( # group 2
674 (X) # group 3
675 \g{-1} # backref to group 3
676 \g{-3} # backref to group 1
5624f11d
YO
677 )
678 /x
679
d8b950dc
KW
680would match the same as C</(Y) ( (X) \g3 \g1 )/x>. This allows you to
681interpolate regexes into larger regexes and not have to worry about the
682capture groups being renumbered.
683
684You can dispense with numbers altogether and create named capture groups.
685The notation is C<(?E<lt>I<name>E<gt>...)> to declare and C<\g{I<name>}> to
686reference. (To be compatible with .Net regular expressions, C<\g{I<name>}> may
687also be written as C<\k{I<name>}>, C<\kE<lt>I<name>E<gt>> or C<\k'I<name>'>.)
688I<name> must not begin with a number, nor contain hyphens.
689When different groups within the same pattern have the same name, any reference
690to that name assumes the leftmost defined group. Named groups count in
691absolute and relative numbering, and so can also be referred to by those
692numbers.
693(It's possible to do things with named capture groups that would otherwise
694require C<(??{})>.)
695
696Capture group contents are dynamically scoped and available to you outside the
697pattern until the end of the enclosing block or until the next successful
698match, whichever comes first. (See L<perlsyn/"Compound Statements">.)
699You can refer to them by absolute number (using C<"$1"> instead of C<"\g1">,
700etc); or by name via the C<%+> hash, using C<"$+{I<name>}">.
701
702Braces are required in referring to named capture groups, but are optional for
703absolute or relative numbered ones. Braces are safer when creating a regex by
704concatenating smaller strings. For example if you have C<qr/$a$b/>, and C<$a>
705contained C<"\g1">, and C<$b> contained C<"37">, you would get C</\g137/> which
706is probably not what you intended.
707
708The C<\g> and C<\k> notations were introduced in Perl 5.10.0. Prior to that
709there were no named nor relative numbered capture groups. Absolute numbered
0b928c2f
FC
710groups were referred to using C<\1>,
711C<\2>, etc., and this notation is still
d8b950dc
KW
712accepted (and likely always will be). But it leads to some ambiguities if
713there are more than 9 capture groups, as C<\10> could mean either the tenth
714capture group, or the character whose ordinal in octal is 010 (a backspace in
715ASCII). Perl resolves this ambiguity by interpreting C<\10> as a backreference
716only if at least 10 left parentheses have opened before it. Likewise C<\11> is
717a backreference only if at least 11 left parentheses have opened before it.
e1f120a9
KW
718And so on. C<\1> through C<\9> are always interpreted as backreferences.
719There are several examples below that illustrate these perils. You can avoid
720the ambiguity by always using C<\g{}> or C<\g> if you mean capturing groups;
721and for octal constants always using C<\o{}>, or for C<\077> and below, using 3
722digits padded with leading zeros, since a leading zero implies an octal
723constant.
d8b950dc
KW
724
725The C<\I<digit>> notation also works in certain circumstances outside
ed7efc79 726the pattern. See L</Warning on \1 Instead of $1> below for details.
81714fb9 727
14218588 728Examples:
a0d0e21e
LW
729
730 s/^([^ ]*) *([^ ]*)/$2 $1/; # swap first two words
731
d8b950dc 732 /(.)\g1/ # find first doubled char
81714fb9
YO
733 and print "'$1' is the first doubled character\n";
734
735 /(?<char>.)\k<char>/ # ... a different way
736 and print "'$+{char}' is the first doubled character\n";
737
d8b950dc 738 /(?'char'.)\g1/ # ... mix and match
81714fb9 739 and print "'$1' is the first doubled character\n";
c47ff5f1 740
14218588 741 if (/Time: (..):(..):(..)/) { # parse out values
f793d64a
KW
742 $hours = $1;
743 $minutes = $2;
744 $seconds = $3;
a0d0e21e 745 }
c47ff5f1 746
9d860678
KW
747 /(.)(.)(.)(.)(.)(.)(.)(.)(.)\g10/ # \g10 is a backreference
748 /(.)(.)(.)(.)(.)(.)(.)(.)(.)\10/ # \10 is octal
749 /((.)(.)(.)(.)(.)(.)(.)(.)(.))\10/ # \10 is a backreference
750 /((.)(.)(.)(.)(.)(.)(.)(.)(.))\010/ # \010 is octal
751
752 $a = '(.)\1'; # Creates problems when concatenated.
753 $b = '(.)\g{1}'; # Avoids the problems.
754 "aa" =~ /${a}/; # True
755 "aa" =~ /${b}/; # True
756 "aa0" =~ /${a}0/; # False!
757 "aa0" =~ /${b}0/; # True
dc0d9c48
KW
758 "aa\x08" =~ /${a}0/; # True!
759 "aa\x08" =~ /${b}0/; # False
9d860678 760
14218588
GS
761Several special variables also refer back to portions of the previous
762match. C<$+> returns whatever the last bracket match matched.
763C<$&> returns the entire matched string. (At one point C<$0> did
764also, but now it returns the name of the program.) C<$`> returns
77ea4f6d
JV
765everything before the matched string. C<$'> returns everything
766after the matched string. And C<$^N> contains whatever was matched by
767the most-recently closed group (submatch). C<$^N> can be used in
768extended patterns (see below), for example to assign a submatch to a
81714fb9 769variable.
d74e8afc 770X<$+> X<$^N> X<$&> X<$`> X<$'>
14218588 771
d8b950dc
KW
772These special variables, like the C<%+> hash and the numbered match variables
773(C<$1>, C<$2>, C<$3>, etc.) are dynamically scoped
14218588
GS
774until the end of the enclosing block or until the next successful
775match, whichever comes first. (See L<perlsyn/"Compound Statements">.)
d74e8afc
ITB
776X<$+> X<$^N> X<$&> X<$`> X<$'>
777X<$1> X<$2> X<$3> X<$4> X<$5> X<$6> X<$7> X<$8> X<$9>
778
0d017f4d 779B<NOTE>: Failed matches in Perl do not reset the match variables,
5146ce24 780which makes it easier to write code that tests for a series of more
665e98b9
JH
781specific cases and remembers the best match.
782
14218588
GS
783B<WARNING>: Once Perl sees that you need one of C<$&>, C<$`>, or
784C<$'> anywhere in the program, it has to provide them for every
785pattern match. This may substantially slow your program. Perl
d8b950dc 786uses the same mechanism to produce C<$1>, C<$2>, etc, so you also pay a
14218588
GS
787price for each pattern that contains capturing parentheses. (To
788avoid this cost while retaining the grouping behaviour, use the
789extended regular expression C<(?: ... )> instead.) But if you never
790use C<$&>, C<$`> or C<$'>, then patterns I<without> capturing
791parentheses will not be penalized. So avoid C<$&>, C<$'>, and C<$`>
792if you can, but if you can't (and some algorithms really appreciate
793them), once you've used them once, use them at will, because you've
794already paid the price. As of 5.005, C<$&> is not so costly as the
795other two.
d74e8afc 796X<$&> X<$`> X<$'>
68dc0745 797
99d59c4d 798As a workaround for this problem, Perl 5.10.0 introduces C<${^PREMATCH}>,
cde0cee5
YO
799C<${^MATCH}> and C<${^POSTMATCH}>, which are equivalent to C<$`>, C<$&>
800and C<$'>, B<except> that they are only guaranteed to be defined after a
87e95b7f 801successful match that was executed with the C</p> (preserve) modifier.
cde0cee5
YO
802The use of these variables incurs no global performance penalty, unlike
803their punctuation char equivalents, however at the trade-off that you
804have to tell perl when you want to use them.
87e95b7f 805X</p> X<p modifier>
cde0cee5 806
9d727203
KW
807=head2 Quoting metacharacters
808
19799a22
GS
809Backslashed metacharacters in Perl are alphanumeric, such as C<\b>,
810C<\w>, C<\n>. Unlike some other regular expression languages, there
811are no backslashed symbols that aren't alphanumeric. So anything
c47ff5f1 812that looks like \\, \(, \), \<, \>, \{, or \} is always
19799a22
GS
813interpreted as a literal character, not a metacharacter. This was
814once used in a common idiom to disable or quote the special meanings
815of regular expression metacharacters in a string that you want to
36bbe248 816use for a pattern. Simply quote all non-"word" characters:
a0d0e21e
LW
817
818 $pattern =~ s/(\W)/\\$1/g;
819
f1cbbd6e 820(If C<use locale> is set, then this depends on the current locale.)
14218588
GS
821Today it is more common to use the quotemeta() function or the C<\Q>
822metaquoting escape sequence to disable all metacharacters' special
823meanings like this:
a0d0e21e
LW
824
825 /$unquoted\Q$quoted\E$unquoted/
826
9da458fc
IZ
827Beware that if you put literal backslashes (those not inside
828interpolated variables) between C<\Q> and C<\E>, double-quotish
829backslash interpolation may lead to confusing results. If you
830I<need> to use literal backslashes within C<\Q...\E>,
831consult L<perlop/"Gory details of parsing quoted constructs">.
832
19799a22
GS
833=head2 Extended Patterns
834
14218588 835Perl also defines a consistent extension syntax for features not
0b928c2f
FC
836found in standard tools like B<awk> and
837B<lex>. The syntax for most of these is a
14218588
GS
838pair of parentheses with a question mark as the first thing within
839the parentheses. The character after the question mark indicates
840the extension.
19799a22 841
14218588
GS
842The stability of these extensions varies widely. Some have been
843part of the core language for many years. Others are experimental
844and may change without warning or be completely removed. Check
845the documentation on an individual feature to verify its current
846status.
19799a22 847
14218588
GS
848A question mark was chosen for this and for the minimal-matching
849construct because 1) question marks are rare in older regular
850expressions, and 2) whenever you see one, you should stop and
0b928c2f 851"question" exactly what is going on. That's psychology....
a0d0e21e 852
70ca8714 853=over 4
a0d0e21e 854
cc6b7395 855=item C<(?#text)>
d74e8afc 856X<(?#)>
a0d0e21e 857
14218588 858A comment. The text is ignored. If the C</x> modifier enables
19799a22 859whitespace formatting, a simple C<#> will suffice. Note that Perl closes
259138e3
GS
860the comment as soon as it sees a C<)>, so there is no way to put a literal
861C<)> in the comment.
a0d0e21e 862
cfaf538b 863=item C<(?adlupimsx-imsx)>
fb85c044 864
cfaf538b 865=item C<(?^alupimsx)>
fb85c044 866X<(?)> X<(?^)>
19799a22 867
0b6d1084
JH
868One or more embedded pattern-match modifiers, to be turned on (or
869turned off, if preceded by C<->) for the remainder of the pattern or
fb85c044
KW
870the remainder of the enclosing pattern group (if any).
871
fb85c044 872This is particularly useful for dynamic patterns, such as those read in from a
0d017f4d 873configuration file, taken from an argument, or specified in a table
0b928c2f
FC
874somewhere. Consider the case where some patterns want to be
875case-sensitive and some do not: The case-insensitive ones merely need to
0d017f4d 876include C<(?i)> at the front of the pattern. For example:
19799a22
GS
877
878 $pattern = "foobar";
5d458dd8 879 if ( /$pattern/i ) { }
19799a22
GS
880
881 # more flexible:
882
883 $pattern = "(?i)foobar";
5d458dd8 884 if ( /$pattern/ ) { }
19799a22 885
0b6d1084 886These modifiers are restored at the end of the enclosing group. For example,
19799a22 887
d8b950dc 888 ( (?i) blah ) \s+ \g1
19799a22 889
0d017f4d
WL
890will match C<blah> in any case, some spaces, and an exact (I<including the case>!)
891repetition of the previous word, assuming the C</x> modifier, and no C</i>
892modifier outside this group.
19799a22 893
8eb5594e
DR
894These modifiers do not carry over into named subpatterns called in the
895enclosing group. In other words, a pattern such as C<((?i)(&NAME))> does not
896change the case-sensitivity of the "NAME" pattern.
897
dc925305
KW
898Any of these modifiers can be set to apply globally to all regular
899expressions compiled within the scope of a C<use re>. See
a0bbd6ff 900L<re/"'/flags' mode">.
dc925305 901
9de15fec
KW
902Starting in Perl 5.14, a C<"^"> (caret or circumflex accent) immediately
903after the C<"?"> is a shorthand equivalent to C<d-imsx>. Flags (except
904C<"d">) may follow the caret to override it.
905But a minus sign is not legal with it.
906
dc925305 907Note that the C<a>, C<d>, C<l>, C<p>, and C<u> modifiers are special in
e1d8d8ac 908that they can only be enabled, not disabled, and the C<a>, C<d>, C<l>, and
dc925305 909C<u> modifiers are mutually exclusive: specifying one de-specifies the
ed7efc79
KW
910others, and a maximum of one (or two C<a>'s) may appear in the
911construct. Thus, for
0b928c2f 912example, C<(?-p)> will warn when compiled under C<use warnings>;
b6fa137b 913C<(?-d:...)> and C<(?dl:...)> are fatal errors.
9de15fec
KW
914
915Note also that the C<p> modifier is special in that its presence
916anywhere in a pattern has a global effect.
cde0cee5 917
5a964f20 918=item C<(?:pattern)>
d74e8afc 919X<(?:)>
a0d0e21e 920
cfaf538b 921=item C<(?adluimsx-imsx:pattern)>
ca9dfc88 922
cfaf538b 923=item C<(?^aluimsx:pattern)>
fb85c044
KW
924X<(?^:)>
925
5a964f20
TC
926This is for clustering, not capturing; it groups subexpressions like
927"()", but doesn't make backreferences as "()" does. So
a0d0e21e 928
5a964f20 929 @fields = split(/\b(?:a|b|c)\b/)
a0d0e21e
LW
930
931is like
932
5a964f20 933 @fields = split(/\b(a|b|c)\b/)
a0d0e21e 934
19799a22
GS
935but doesn't spit out extra fields. It's also cheaper not to capture
936characters if you don't need to.
a0d0e21e 937
19799a22 938Any letters between C<?> and C<:> act as flags modifiers as with
cfaf538b 939C<(?adluimsx-imsx)>. For example,
ca9dfc88
IZ
940
941 /(?s-i:more.*than).*million/i
942
14218588 943is equivalent to the more verbose
ca9dfc88
IZ
944
945 /(?:(?s-i)more.*than).*million/i
946
fb85c044 947Starting in Perl 5.14, a C<"^"> (caret or circumflex accent) immediately
9de15fec
KW
948after the C<"?"> is a shorthand equivalent to C<d-imsx>. Any positive
949flags (except C<"d">) may follow the caret, so
fb85c044
KW
950
951 (?^x:foo)
952
953is equivalent to
954
955 (?x-ims:foo)
956
957The caret tells Perl that this cluster doesn't inherit the flags of any
0b928c2f 958surrounding pattern, but uses the system defaults (C<d-imsx>),
fb85c044
KW
959modified by any flags specified.
960
961The caret allows for simpler stringification of compiled regular
962expressions. These look like
963
964 (?^:pattern)
965
966with any non-default flags appearing between the caret and the colon.
967A test that looks at such stringification thus doesn't need to have the
968system default flags hard-coded in it, just the caret. If new flags are
969added to Perl, the meaning of the caret's expansion will change to include
970the default for those flags, so the test will still work, unchanged.
971
972Specifying a negative flag after the caret is an error, as the flag is
973redundant.
974
975Mnemonic for C<(?^...)>: A fresh beginning since the usual use of a caret is
976to match at the beginning.
977
594d7033
YO
978=item C<(?|pattern)>
979X<(?|)> X<Branch reset>
980
981This is the "branch reset" pattern, which has the special property
c27a5cfe 982that the capture groups are numbered from the same starting point
99d59c4d 983in each alternation branch. It is available starting from perl 5.10.0.
4deaaa80 984
c27a5cfe 985Capture groups are numbered from left to right, but inside this
693596a8 986construct the numbering is restarted for each branch.
4deaaa80 987
c27a5cfe 988The numbering within each branch will be as normal, and any groups
4deaaa80
PJ
989following this construct will be numbered as though the construct
990contained only one branch, that being the one with the most capture
c27a5cfe 991groups in it.
4deaaa80 992
0b928c2f 993This construct is useful when you want to capture one of a
4deaaa80
PJ
994number of alternative matches.
995
996Consider the following pattern. The numbers underneath show in
c27a5cfe 997which group the captured content will be stored.
594d7033
YO
998
999
1000 # before ---------------branch-reset----------- after
1001 / ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x
1002 # 1 2 2 3 2 3 4
1003
ab106183
A
1004Be careful when using the branch reset pattern in combination with
1005named captures. Named captures are implemented as being aliases to
c27a5cfe 1006numbered groups holding the captures, and that interferes with the
ab106183
A
1007implementation of the branch reset pattern. If you are using named
1008captures in a branch reset pattern, it's best to use the same names,
1009in the same order, in each of the alternations:
1010
1011 /(?| (?<a> x ) (?<b> y )
1012 | (?<a> z ) (?<b> w )) /x
1013
1014Not doing so may lead to surprises:
1015
1016 "12" =~ /(?| (?<a> \d+ ) | (?<b> \D+))/x;
1017 say $+ {a}; # Prints '12'
1018 say $+ {b}; # *Also* prints '12'.
1019
c27a5cfe
KW
1020The problem here is that both the group named C<< a >> and the group
1021named C<< b >> are aliases for the group belonging to C<< $1 >>.
90a18110 1022
ee9b8eae
YO
1023=item Look-Around Assertions
1024X<look-around assertion> X<lookaround assertion> X<look-around> X<lookaround>
1025
0b928c2f 1026Look-around assertions are zero-width patterns which match a specific
ee9b8eae
YO
1027pattern without including it in C<$&>. Positive assertions match when
1028their subpattern matches, negative assertions match when their subpattern
1029fails. Look-behind matches text up to the current match position,
1030look-ahead matches text following the current match position.
1031
1032=over 4
1033
5a964f20 1034=item C<(?=pattern)>
d74e8afc 1035X<(?=)> X<look-ahead, positive> X<lookahead, positive>
a0d0e21e 1036
19799a22 1037A zero-width positive look-ahead assertion. For example, C</\w+(?=\t)/>
a0d0e21e
LW
1038matches a word followed by a tab, without including the tab in C<$&>.
1039
5a964f20 1040=item C<(?!pattern)>
d74e8afc 1041X<(?!)> X<look-ahead, negative> X<lookahead, negative>
a0d0e21e 1042
19799a22 1043A zero-width negative look-ahead assertion. For example C</foo(?!bar)/>
a0d0e21e 1044matches any occurrence of "foo" that isn't followed by "bar". Note
19799a22
GS
1045however that look-ahead and look-behind are NOT the same thing. You cannot
1046use this for look-behind.
7b8d334a 1047
5a964f20 1048If you are looking for a "bar" that isn't preceded by a "foo", C</(?!foo)bar/>
7b8d334a
GS
1049will not do what you want. That's because the C<(?!foo)> is just saying that
1050the next thing cannot be "foo"--and it's not, it's a "bar", so "foobar" will
0b928c2f 1051match. Use look-behind instead (see below).
c277df42 1052
ee9b8eae
YO
1053=item C<(?<=pattern)> C<\K>
1054X<(?<=)> X<look-behind, positive> X<lookbehind, positive> X<\K>
c277df42 1055
c47ff5f1 1056A zero-width positive look-behind assertion. For example, C</(?<=\t)\w+/>
19799a22
GS
1057matches a word that follows a tab, without including the tab in C<$&>.
1058Works only for fixed-width look-behind.
c277df42 1059
ee9b8eae
YO
1060There is a special form of this construct, called C<\K>, which causes the
1061regex engine to "keep" everything it had matched prior to the C<\K> and
0b928c2f 1062not include it in C<$&>. This effectively provides variable-length
ee9b8eae
YO
1063look-behind. The use of C<\K> inside of another look-around assertion
1064is allowed, but the behaviour is currently not well defined.
1065
c62285ac 1066For various reasons C<\K> may be significantly more efficient than the
ee9b8eae
YO
1067equivalent C<< (?<=...) >> construct, and it is especially useful in
1068situations where you want to efficiently remove something following
1069something else in a string. For instance
1070
1071 s/(foo)bar/$1/g;
1072
1073can be rewritten as the much more efficient
1074
1075 s/foo\Kbar//g;
1076
5a964f20 1077=item C<(?<!pattern)>
d74e8afc 1078X<(?<!)> X<look-behind, negative> X<lookbehind, negative>
c277df42 1079
19799a22
GS
1080A zero-width negative look-behind assertion. For example C</(?<!bar)foo/>
1081matches any occurrence of "foo" that does not follow "bar". Works
1082only for fixed-width look-behind.
c277df42 1083
ee9b8eae
YO
1084=back
1085
81714fb9
YO
1086=item C<(?'NAME'pattern)>
1087
1088=item C<< (?<NAME>pattern) >>
1089X<< (?<NAME>) >> X<(?'NAME')> X<named capture> X<capture>
1090
c27a5cfe 1091A named capture group. Identical in every respect to normal capturing
0b928c2f
FC
1092parentheses C<()> but for the additional fact that the group
1093can be referred to by name in various regular expression
1094constructs (like C<\g{NAME}>) and can be accessed by name
1095after a successful match via C<%+> or C<%->. See L<perlvar>
90a18110 1096for more details on the C<%+> and C<%-> hashes.
81714fb9 1097
c27a5cfe
KW
1098If multiple distinct capture groups have the same name then the
1099$+{NAME} will refer to the leftmost defined group in the match.
81714fb9 1100
0d017f4d 1101The forms C<(?'NAME'pattern)> and C<< (?<NAME>pattern) >> are equivalent.
81714fb9
YO
1102
1103B<NOTE:> While the notation of this construct is the same as the similar
c27a5cfe 1104function in .NET regexes, the behavior is not. In Perl the groups are
81714fb9
YO
1105numbered sequentially regardless of being named or not. Thus in the
1106pattern
1107
1108 /(x)(?<foo>y)(z)/
1109
1110$+{foo} will be the same as $2, and $3 will contain 'z' instead of
1111the opposite which is what a .NET regex hacker might expect.
1112
1f1031fe
YO
1113Currently NAME is restricted to simple identifiers only.
1114In other words, it must match C</^[_A-Za-z][_A-Za-z0-9]*\z/> or
1115its Unicode extension (see L<utf8>),
1116though it isn't extended by the locale (see L<perllocale>).
81714fb9 1117
1f1031fe 1118B<NOTE:> In order to make things easier for programmers with experience
ae5648b3 1119with the Python or PCRE regex engines, the pattern C<< (?PE<lt>NAMEE<gt>pattern) >>
0d017f4d 1120may be used instead of C<< (?<NAME>pattern) >>; however this form does not
64c5a566 1121support the use of single quotes as a delimiter for the name.
81714fb9 1122
1f1031fe
YO
1123=item C<< \k<NAME> >>
1124
1125=item C<< \k'NAME' >>
81714fb9
YO
1126
1127Named backreference. Similar to numeric backreferences, except that
1128the group is designated by name and not number. If multiple groups
1129have the same name then it refers to the leftmost defined group in
1130the current match.
1131
0d017f4d 1132It is an error to refer to a name not defined by a C<< (?<NAME>) >>
81714fb9
YO
1133earlier in the pattern.
1134
1135Both forms are equivalent.
1136
1f1031fe 1137B<NOTE:> In order to make things easier for programmers with experience
0d017f4d 1138with the Python or PCRE regex engines, the pattern C<< (?P=NAME) >>
64c5a566 1139may be used instead of C<< \k<NAME> >>.
1f1031fe 1140
cc6b7395 1141=item C<(?{ code })>
d74e8afc 1142X<(?{})> X<regex, code in> X<regexp, code in> X<regular expression, code in>
c277df42 1143
19799a22 1144B<WARNING>: This extended regular expression feature is considered
b9b4dddf
YO
1145experimental, and may be changed without notice. Code executed that
1146has side effects may not perform identically from version to version
1147due to the effect of future optimisations in the regex engine.
c277df42 1148
cc46d5f2 1149This zero-width assertion evaluates any embedded Perl code. It
19799a22
GS
1150always succeeds, and its C<code> is not interpolated. Currently,
1151the rules to determine where the C<code> ends are somewhat convoluted.
1152
77ea4f6d
JV
1153This feature can be used together with the special variable C<$^N> to
1154capture the results of submatches in variables without having to keep
1155track of the number of nested parentheses. For example:
1156
1157 $_ = "The brown fox jumps over the lazy dog";
1158 /the (\S+)(?{ $color = $^N }) (\S+)(?{ $animal = $^N })/i;
1159 print "color = $color, animal = $animal\n";
1160
754091cb
RGS
1161Inside the C<(?{...})> block, C<$_> refers to the string the regular
1162expression is matching against. You can also use C<pos()> to know what is
fa11829f 1163the current position of matching within this string.
754091cb 1164
19799a22
GS
1165The C<code> is properly scoped in the following sense: If the assertion
1166is backtracked (compare L<"Backtracking">), all changes introduced after
1167C<local>ization are undone, so that
b9ac3b5b
GS
1168
1169 $_ = 'a' x 8;
5d458dd8 1170 m<
d1fbf752 1171 (?{ $cnt = 0 }) # Initialize $cnt.
b9ac3b5b 1172 (
5d458dd8 1173 a
b9ac3b5b 1174 (?{
d1fbf752
KW
1175 local $cnt = $cnt + 1; # Update $cnt,
1176 # backtracking-safe.
b9ac3b5b 1177 })
5d458dd8 1178 )*
b9ac3b5b 1179 aaaa
d1fbf752
KW
1180 (?{ $res = $cnt }) # On success copy to
1181 # non-localized location.
b9ac3b5b
GS
1182 >x;
1183
0d017f4d 1184will set C<$res = 4>. Note that after the match, C<$cnt> returns to the globally
14218588 1185introduced value, because the scopes that restrict C<local> operators
b9ac3b5b
GS
1186are unwound.
1187
19799a22
GS
1188This assertion may be used as a C<(?(condition)yes-pattern|no-pattern)>
1189switch. If I<not> used in this way, the result of evaluation of
1190C<code> is put into the special variable C<$^R>. This happens
1191immediately, so C<$^R> can be used from other C<(?{ code })> assertions
1192inside the same regular expression.
b9ac3b5b 1193
19799a22
GS
1194The assignment to C<$^R> above is properly localized, so the old
1195value of C<$^R> is restored if the assertion is backtracked; compare
1196L<"Backtracking">.
b9ac3b5b 1197
19799a22
GS
1198For reasons of security, this construct is forbidden if the regular
1199expression involves run-time interpolation of variables, unless the
1200perilous C<use re 'eval'> pragma has been used (see L<re>), or the
0b928c2f 1201variables contain results of the C<qr//> operator (see
b6fa137b 1202L<perlop/"qr/STRINGE<sol>msixpodual">).
871b0233 1203
0d017f4d 1204This restriction is due to the wide-spread and remarkably convenient
19799a22 1205custom of using run-time determined strings as patterns. For example:
871b0233
IZ
1206
1207 $re = <>;
1208 chomp $re;
1209 $string =~ /$re/;
1210
14218588
GS
1211Before Perl knew how to execute interpolated code within a pattern,
1212this operation was completely safe from a security point of view,
1213although it could raise an exception from an illegal pattern. If
1214you turn on the C<use re 'eval'>, though, it is no longer secure,
1215so you should only do so if you are also using taint checking.
1216Better yet, use the carefully constrained evaluation within a Safe
cc46d5f2 1217compartment. See L<perlsec> for details about both these mechanisms.
871b0233 1218
e95d7314
GG
1219B<WARNING>: Use of lexical (C<my>) variables in these blocks is
1220broken. The result is unpredictable and will make perl unstable. The
1221workaround is to use global (C<our>) variables.
1222
8525cfae
FC
1223B<WARNING>: In perl 5.12.x and earlier, the regex engine
1224was not re-entrant, so interpolated code could not
1225safely invoke the regex engine either directly with
e95d7314 1226C<m//> or C<s///>), or indirectly with functions such as
8525cfae 1227C<split>. Invoking the regex engine in these blocks would make perl
e95d7314 1228unstable.
8988a1bb 1229
14455d6c 1230=item C<(??{ code })>
d74e8afc
ITB
1231X<(??{})>
1232X<regex, postponed> X<regexp, postponed> X<regular expression, postponed>
0f5d15d6 1233
19799a22 1234B<WARNING>: This extended regular expression feature is considered
b9b4dddf
YO
1235experimental, and may be changed without notice. Code executed that
1236has side effects may not perform identically from version to version
1237due to the effect of future optimisations in the regex engine.
0f5d15d6 1238
19799a22
GS
1239This is a "postponed" regular subexpression. The C<code> is evaluated
1240at run time, at the moment this subexpression may match. The result
0b928c2f 1241of evaluation is considered a regular expression and matched as
61528107 1242if it were inserted instead of this construct. Note that this means
c27a5cfe 1243that the contents of capture groups defined inside an eval'ed pattern
6bda09f9 1244are not available outside of the pattern, and vice versa, there is no
c27a5cfe 1245way for the inner pattern to refer to a capture group defined outside.
6bda09f9
YO
1246Thus,
1247
1248 ('a' x 100)=~/(??{'(.)' x 100})/
1249
81714fb9 1250B<will> match, it will B<not> set $1.
0f5d15d6 1251
428594d9 1252The C<code> is not interpolated. As before, the rules to determine
19799a22
GS
1253where the C<code> ends are currently somewhat convoluted.
1254
1255The following pattern matches a parenthesized group:
0f5d15d6 1256
d1fbf752
KW
1257 $re = qr{
1258 \(
1259 (?:
1260 (?> [^()]+ ) # Non-parens without backtracking
1261 |
1262 (??{ $re }) # Group with matching parens
1263 )*
1264 \)
1265 }x;
0f5d15d6 1266
6bda09f9
YO
1267See also C<(?PARNO)> for a different, more efficient way to accomplish
1268the same task.
1269
0b370c0a
A
1270For reasons of security, this construct is forbidden if the regular
1271expression involves run-time interpolation of variables, unless the
1272perilous C<use re 'eval'> pragma has been used (see L<re>), or the
0b928c2f 1273variables contain results of the C<qr//> operator (see
b6fa137b 1274L<perlop/"qrE<sol>STRINGE<sol>msixpodual">).
0b370c0a 1275
8525cfae
FC
1276In perl 5.12.x and earlier, because the regex engine was not re-entrant,
1277delayed code could not safely invoke the regex engine either directly with
1278C<m//> or C<s///>), or indirectly with functions such as C<split>.
8988a1bb 1279
5d458dd8
YO
1280Recursing deeper than 50 times without consuming any input string will
1281result in a fatal error. The maximum depth is compiled into perl, so
6bda09f9
YO
1282changing it requires a custom build.
1283
542fa716
YO
1284=item C<(?PARNO)> C<(?-PARNO)> C<(?+PARNO)> C<(?R)> C<(?0)>
1285X<(?PARNO)> X<(?1)> X<(?R)> X<(?0)> X<(?-1)> X<(?+1)> X<(?-PARNO)> X<(?+PARNO)>
6bda09f9 1286X<regex, recursive> X<regexp, recursive> X<regular expression, recursive>
542fa716 1287X<regex, relative recursion>
6bda09f9 1288
81714fb9 1289Similar to C<(??{ code })> except it does not involve compiling any code,
c27a5cfe
KW
1290instead it treats the contents of a capture group as an independent
1291pattern that must match at the current position. Capture groups
81714fb9 1292contained by the pattern will have the value as determined by the
6bda09f9
YO
1293outermost recursion.
1294
894be9b7 1295PARNO is a sequence of digits (not starting with 0) whose value reflects
c27a5cfe 1296the paren-number of the capture group to recurse to. C<(?R)> recurses to
894be9b7 1297the beginning of the whole pattern. C<(?0)> is an alternate syntax for
542fa716 1298C<(?R)>. If PARNO is preceded by a plus or minus sign then it is assumed
c27a5cfe 1299to be relative, with negative numbers indicating preceding capture groups
542fa716 1300and positive ones following. Thus C<(?-1)> refers to the most recently
c27a5cfe 1301declared group, and C<(?+1)> indicates the next group to be declared.
c74340f9 1302Note that the counting for relative recursion differs from that of
c27a5cfe 1303relative backreferences, in that with recursion unclosed groups B<are>
c74340f9 1304included.
6bda09f9 1305
81714fb9 1306The following pattern matches a function foo() which may contain
f145b7e9 1307balanced parentheses as the argument.
6bda09f9 1308
d1fbf752 1309 $re = qr{ ( # paren group 1 (full function)
81714fb9 1310 foo
d1fbf752 1311 ( # paren group 2 (parens)
6bda09f9 1312 \(
d1fbf752 1313 ( # paren group 3 (contents of parens)
6bda09f9 1314 (?:
d1fbf752 1315 (?> [^()]+ ) # Non-parens without backtracking
6bda09f9 1316 |
d1fbf752 1317 (?2) # Recurse to start of paren group 2
6bda09f9
YO
1318 )*
1319 )
1320 \)
1321 )
1322 )
1323 }x;
1324
1325If the pattern was used as follows
1326
1327 'foo(bar(baz)+baz(bop))'=~/$re/
1328 and print "\$1 = $1\n",
1329 "\$2 = $2\n",
1330 "\$3 = $3\n";
1331
1332the output produced should be the following:
1333
1334 $1 = foo(bar(baz)+baz(bop))
1335 $2 = (bar(baz)+baz(bop))
81714fb9 1336 $3 = bar(baz)+baz(bop)
6bda09f9 1337
c27a5cfe 1338If there is no corresponding capture group defined, then it is a
61528107 1339fatal error. Recursing deeper than 50 times without consuming any input
81714fb9 1340string will also result in a fatal error. The maximum depth is compiled
6bda09f9
YO
1341into perl, so changing it requires a custom build.
1342
542fa716
YO
1343The following shows how using negative indexing can make it
1344easier to embed recursive patterns inside of a C<qr//> construct
1345for later use:
1346
1347 my $parens = qr/(\((?:[^()]++|(?-1))*+\))/;
1348 if (/foo $parens \s+ + \s+ bar $parens/x) {
1349 # do something here...
1350 }
1351
81714fb9 1352B<Note> that this pattern does not behave the same way as the equivalent
0d017f4d 1353PCRE or Python construct of the same form. In Perl you can backtrack into
6bda09f9 1354a recursed group, in PCRE and Python the recursed into group is treated
542fa716
YO
1355as atomic. Also, modifiers are resolved at compile time, so constructs
1356like (?i:(?1)) or (?:(?i)(?1)) do not affect how the sub-pattern will
1357be processed.
6bda09f9 1358
894be9b7
YO
1359=item C<(?&NAME)>
1360X<(?&NAME)>
1361
0d017f4d
WL
1362Recurse to a named subpattern. Identical to C<(?PARNO)> except that the
1363parenthesis to recurse to is determined by name. If multiple parentheses have
894be9b7
YO
1364the same name, then it recurses to the leftmost.
1365
1366It is an error to refer to a name that is not declared somewhere in the
1367pattern.
1368
1f1031fe
YO
1369B<NOTE:> In order to make things easier for programmers with experience
1370with the Python or PCRE regex engines the pattern C<< (?P>NAME) >>
64c5a566 1371may be used instead of C<< (?&NAME) >>.
1f1031fe 1372
e2e6a0f1
YO
1373=item C<(?(condition)yes-pattern|no-pattern)>
1374X<(?()>
286f584a 1375
e2e6a0f1 1376=item C<(?(condition)yes-pattern)>
286f584a 1377
41ef34de
ML
1378Conditional expression. Matches C<yes-pattern> if C<condition> yields
1379a true value, matches C<no-pattern> otherwise. A missing pattern always
1380matches.
1381
1382C<(condition)> should be either an integer in
e2e6a0f1
YO
1383parentheses (which is valid if the corresponding pair of parentheses
1384matched), a look-ahead/look-behind/evaluate zero-width assertion, a
c27a5cfe 1385name in angle brackets or single quotes (which is valid if a group
e2e6a0f1
YO
1386with the given name matched), or the special symbol (R) (true when
1387evaluated inside of recursion or eval). Additionally the R may be
1388followed by a number, (which will be true when evaluated when recursing
1389inside of the appropriate group), or by C<&NAME>, in which case it will
1390be true only when evaluated during recursion in the named group.
1391
1392Here's a summary of the possible predicates:
1393
1394=over 4
1395
1396=item (1) (2) ...
1397
c27a5cfe 1398Checks if the numbered capturing group has matched something.
e2e6a0f1
YO
1399
1400=item (<NAME>) ('NAME')
1401
c27a5cfe 1402Checks if a group with the given name has matched something.
e2e6a0f1 1403
f01cd190
FC
1404=item (?=...) (?!...) (?<=...) (?<!...)
1405
1406Checks whether the pattern matches (or does not match, for the '!'
1407variants).
1408
e2e6a0f1
YO
1409=item (?{ CODE })
1410
f01cd190 1411Treats the return value of the code block as the condition.
e2e6a0f1
YO
1412
1413=item (R)
1414
1415Checks if the expression has been evaluated inside of recursion.
1416
1417=item (R1) (R2) ...
1418
1419Checks if the expression has been evaluated while executing directly
1420inside of the n-th capture group. This check is the regex equivalent of
1421
1422 if ((caller(0))[3] eq 'subname') { ... }
1423
1424In other words, it does not check the full recursion stack.
1425
1426=item (R&NAME)
1427
1428Similar to C<(R1)>, this predicate checks to see if we're executing
1429directly inside of the leftmost group with a given name (this is the same
1430logic used by C<(?&NAME)> to disambiguate). It does not check the full
1431stack, but only the name of the innermost active recursion.
1432
1433=item (DEFINE)
1434
1435In this case, the yes-pattern is never directly executed, and no
1436no-pattern is allowed. Similar in spirit to C<(?{0})> but more efficient.
1437See below for details.
1438
1439=back
1440
1441For example:
1442
1443 m{ ( \( )?
1444 [^()]+
1445 (?(1) \) )
1446 }x
1447
1448matches a chunk of non-parentheses, possibly included in parentheses
1449themselves.
1450
0b928c2f
FC
1451A special form is the C<(DEFINE)> predicate, which never executes its
1452yes-pattern directly, and does not allow a no-pattern. This allows one to
1453define subpatterns which will be executed only by the recursion mechanism.
e2e6a0f1
YO
1454This way, you can define a set of regular expression rules that can be
1455bundled into any pattern you choose.
1456
1457It is recommended that for this usage you put the DEFINE block at the
1458end of the pattern, and that you name any subpatterns defined within it.
1459
1460Also, it's worth noting that patterns defined this way probably will
1461not be as efficient, as the optimiser is not very clever about
1462handling them.
1463
1464An example of how this might be used is as follows:
1465
2bf803e2 1466 /(?<NAME>(?&NAME_PAT))(?<ADDR>(?&ADDRESS_PAT))
e2e6a0f1 1467 (?(DEFINE)
2bf803e2
YO
1468 (?<NAME_PAT>....)
1469 (?<ADRESS_PAT>....)
e2e6a0f1
YO
1470 )/x
1471
c27a5cfe
KW
1472Note that capture groups matched inside of recursion are not accessible
1473after the recursion returns, so the extra layer of capturing groups is
e2e6a0f1
YO
1474necessary. Thus C<$+{NAME_PAT}> would not be defined even though
1475C<$+{NAME}> would be.
286f584a 1476
c47ff5f1 1477=item C<< (?>pattern) >>
6bda09f9 1478X<backtrack> X<backtracking> X<atomic> X<possessive>
5a964f20 1479
19799a22
GS
1480An "independent" subexpression, one which matches the substring
1481that a I<standalone> C<pattern> would match if anchored at the given
9da458fc 1482position, and it matches I<nothing other than this substring>. This
19799a22
GS
1483construct is useful for optimizations of what would otherwise be
1484"eternal" matches, because it will not backtrack (see L<"Backtracking">).
9da458fc
IZ
1485It may also be useful in places where the "grab all you can, and do not
1486give anything back" semantic is desirable.
19799a22 1487
c47ff5f1 1488For example: C<< ^(?>a*)ab >> will never match, since C<< (?>a*) >>
19799a22
GS
1489(anchored at the beginning of string, as above) will match I<all>
1490characters C<a> at the beginning of string, leaving no C<a> for
1491C<ab> to match. In contrast, C<a*ab> will match the same as C<a+b>,
1492since the match of the subgroup C<a*> is influenced by the following
1493group C<ab> (see L<"Backtracking">). In particular, C<a*> inside
1494C<a*ab> will match fewer characters than a standalone C<a*>, since
1495this makes the tail match.
1496
0b928c2f
FC
1497C<< (?>pattern) >> does not disable backtracking altogether once it has
1498matched. It is still possible to backtrack past the construct, but not
1499into it. So C<< ((?>a*)|(?>b*))ar >> will still match "bar".
1500
c47ff5f1 1501An effect similar to C<< (?>pattern) >> may be achieved by writing
0b928c2f
FC
1502C<(?=(pattern))\g{-1}>. This matches the same substring as a standalone
1503C<a+>, and the following C<\g{-1}> eats the matched string; it therefore
c47ff5f1 1504makes a zero-length assertion into an analogue of C<< (?>...) >>.
19799a22
GS
1505(The difference between these two constructs is that the second one
1506uses a capturing group, thus shifting ordinals of backreferences
1507in the rest of a regular expression.)
1508
1509Consider this pattern:
c277df42 1510
871b0233 1511 m{ \(
e2e6a0f1 1512 (
f793d64a 1513 [^()]+ # x+
e2e6a0f1 1514 |
871b0233
IZ
1515 \( [^()]* \)
1516 )+
e2e6a0f1 1517 \)
871b0233 1518 }x
5a964f20 1519
19799a22
GS
1520That will efficiently match a nonempty group with matching parentheses
1521two levels deep or less. However, if there is no such group, it
1522will take virtually forever on a long string. That's because there
1523are so many different ways to split a long string into several
1524substrings. This is what C<(.+)+> is doing, and C<(.+)+> is similar
1525to a subpattern of the above pattern. Consider how the pattern
1526above detects no-match on C<((()aaaaaaaaaaaaaaaaaa> in several
1527seconds, but that each extra letter doubles this time. This
1528exponential performance will make it appear that your program has
14218588 1529hung. However, a tiny change to this pattern
5a964f20 1530
e2e6a0f1
YO
1531 m{ \(
1532 (
f793d64a 1533 (?> [^()]+ ) # change x+ above to (?> x+ )
e2e6a0f1 1534 |
871b0233
IZ
1535 \( [^()]* \)
1536 )+
e2e6a0f1 1537 \)
871b0233 1538 }x
c277df42 1539
c47ff5f1 1540which uses C<< (?>...) >> matches exactly when the one above does (verifying
5a964f20
TC
1541this yourself would be a productive exercise), but finishes in a fourth
1542the time when used on a similar string with 1000000 C<a>s. Be aware,
0b928c2f
FC
1543however, that, when this construct is followed by a
1544quantifier, it currently triggers a warning message under
9f1b1f2d 1545the C<use warnings> pragma or B<-w> switch saying it
6bab786b 1546C<"matches null string many times in regex">.
c277df42 1547
c47ff5f1 1548On simple groups, such as the pattern C<< (?> [^()]+ ) >>, a comparable
19799a22 1549effect may be achieved by negative look-ahead, as in C<[^()]+ (?! [^()] )>.
c277df42
IZ
1550This was only 4 times slower on a string with 1000000 C<a>s.
1551
9da458fc
IZ
1552The "grab all you can, and do not give anything back" semantic is desirable
1553in many situations where on the first sight a simple C<()*> looks like
1554the correct solution. Suppose we parse text with comments being delimited
1555by C<#> followed by some optional (horizontal) whitespace. Contrary to
4375e838 1556its appearance, C<#[ \t]*> I<is not> the correct subexpression to match
9da458fc
IZ
1557the comment delimiter, because it may "give up" some whitespace if
1558the remainder of the pattern can be made to match that way. The correct
1559answer is either one of these:
1560
1561 (?>#[ \t]*)
1562 #[ \t]*(?![ \t])
1563
1564For example, to grab non-empty comments into $1, one should use either
1565one of these:
1566
1567 / (?> \# [ \t]* ) ( .+ ) /x;
1568 / \# [ \t]* ( [^ \t] .* ) /x;
1569
1570Which one you pick depends on which of these expressions better reflects
1571the above specification of comments.
1572
6bda09f9
YO
1573In some literature this construct is called "atomic matching" or
1574"possessive matching".
1575
b9b4dddf
YO
1576Possessive quantifiers are equivalent to putting the item they are applied
1577to inside of one of these constructs. The following equivalences apply:
1578
1579 Quantifier Form Bracketing Form
1580 --------------- ---------------
1581 PAT*+ (?>PAT*)
1582 PAT++ (?>PAT+)
1583 PAT?+ (?>PAT?)
1584 PAT{min,max}+ (?>PAT{min,max})
1585
e2e6a0f1
YO
1586=back
1587
1588=head2 Special Backtracking Control Verbs
1589
1590B<WARNING:> These patterns are experimental and subject to change or
0d017f4d 1591removal in a future version of Perl. Their usage in production code should
e2e6a0f1
YO
1592be noted to avoid problems during upgrades.
1593
1594These special patterns are generally of the form C<(*VERB:ARG)>. Unless
1595otherwise stated the ARG argument is optional; in some cases, it is
1596forbidden.
1597
1598Any pattern containing a special backtracking verb that allows an argument
e1020413 1599has the special behaviour that when executed it sets the current package's
5d458dd8
YO
1600C<$REGERROR> and C<$REGMARK> variables. When doing so the following
1601rules apply:
e2e6a0f1 1602
5d458dd8
YO
1603On failure, the C<$REGERROR> variable will be set to the ARG value of the
1604verb pattern, if the verb was involved in the failure of the match. If the
1605ARG part of the pattern was omitted, then C<$REGERROR> will be set to the
1606name of the last C<(*MARK:NAME)> pattern executed, or to TRUE if there was
1607none. Also, the C<$REGMARK> variable will be set to FALSE.
e2e6a0f1 1608
5d458dd8
YO
1609On a successful match, the C<$REGERROR> variable will be set to FALSE, and
1610the C<$REGMARK> variable will be set to the name of the last
1611C<(*MARK:NAME)> pattern executed. See the explanation for the
1612C<(*MARK:NAME)> verb below for more details.
e2e6a0f1 1613
5d458dd8 1614B<NOTE:> C<$REGERROR> and C<$REGMARK> are not magic variables like C<$1>
0b928c2f 1615and most other regex-related variables. They are not local to a scope, nor
5d458dd8
YO
1616readonly, but instead are volatile package variables similar to C<$AUTOLOAD>.
1617Use C<local> to localize changes to them to a specific scope if necessary.
e2e6a0f1
YO
1618
1619If a pattern does not contain a special backtracking verb that allows an
5d458dd8 1620argument, then C<$REGERROR> and C<$REGMARK> are not touched at all.
e2e6a0f1 1621
70ca8714 1622=over 3
e2e6a0f1
YO
1623
1624=item Verbs that take an argument
1625
1626=over 4
1627
5d458dd8 1628=item C<(*PRUNE)> C<(*PRUNE:NAME)>
f7819f85 1629X<(*PRUNE)> X<(*PRUNE:NAME)>
54612592 1630
5d458dd8
YO
1631This zero-width pattern prunes the backtracking tree at the current point
1632when backtracked into on failure. Consider the pattern C<A (*PRUNE) B>,
1633where A and B are complex patterns. Until the C<(*PRUNE)> verb is reached,
1634A may backtrack as necessary to match. Once it is reached, matching
1635continues in B, which may also backtrack as necessary; however, should B
1636not match, then no further backtracking will take place, and the pattern
1637will fail outright at the current starting position.
54612592
YO
1638
1639The following example counts all the possible matching strings in a
1640pattern (without actually matching any of them).
1641
e2e6a0f1 1642 'aaab' =~ /a+b?(?{print "$&\n"; $count++})(*FAIL)/;
54612592
YO
1643 print "Count=$count\n";
1644
1645which produces:
1646
1647 aaab
1648 aaa
1649 aa
1650 a
1651 aab
1652 aa
1653 a
1654 ab
1655 a
1656 Count=9
1657
5d458dd8 1658If we add a C<(*PRUNE)> before the count like the following
54612592 1659
5d458dd8 1660 'aaab' =~ /a+b?(*PRUNE)(?{print "$&\n"; $count++})(*FAIL)/;
54612592
YO
1661 print "Count=$count\n";
1662
0b928c2f 1663we prevent backtracking and find the count of the longest matching string
353c6505 1664at each matching starting point like so:
54612592
YO
1665
1666 aaab
1667 aab
1668 ab
1669 Count=3
1670
5d458dd8 1671Any number of C<(*PRUNE)> assertions may be used in a pattern.
54612592 1672
5d458dd8
YO
1673See also C<< (?>pattern) >> and possessive quantifiers for other ways to
1674control backtracking. In some cases, the use of C<(*PRUNE)> can be
1675replaced with a C<< (?>pattern) >> with no functional difference; however,
1676C<(*PRUNE)> can be used to handle cases that cannot be expressed using a
1677C<< (?>pattern) >> alone.
54612592 1678
e2e6a0f1 1679
5d458dd8
YO
1680=item C<(*SKIP)> C<(*SKIP:NAME)>
1681X<(*SKIP)>
e2e6a0f1 1682
5d458dd8 1683This zero-width pattern is similar to C<(*PRUNE)>, except that on
e2e6a0f1 1684failure it also signifies that whatever text that was matched leading up
5d458dd8
YO
1685to the C<(*SKIP)> pattern being executed cannot be part of I<any> match
1686of this pattern. This effectively means that the regex engine "skips" forward
1687to this position on failure and tries to match again, (assuming that
1688there is sufficient room to match).
1689
1690The name of the C<(*SKIP:NAME)> pattern has special significance. If a
1691C<(*MARK:NAME)> was encountered while matching, then it is that position
1692which is used as the "skip point". If no C<(*MARK)> of that name was
1693encountered, then the C<(*SKIP)> operator has no effect. When used
1694without a name the "skip point" is where the match point was when
1695executing the (*SKIP) pattern.
1696
0b928c2f 1697Compare the following to the examples in C<(*PRUNE)>; note the string
24b23f37
YO
1698is twice as long:
1699
d1fbf752
KW
1700 'aaabaaab' =~ /a+b?(*SKIP)(?{print "$&\n"; $count++})(*FAIL)/;
1701 print "Count=$count\n";
24b23f37
YO
1702
1703outputs
1704
1705 aaab
1706 aaab
1707 Count=2
1708
5d458dd8 1709Once the 'aaab' at the start of the string has matched, and the C<(*SKIP)>
353c6505 1710executed, the next starting point will be where the cursor was when the
5d458dd8
YO
1711C<(*SKIP)> was executed.
1712
5d458dd8 1713=item C<(*MARK:NAME)> C<(*:NAME)>
b16db30f 1714X<(*MARK)> X<(*MARK:NAME)> X<(*:NAME)>
5d458dd8
YO
1715
1716This zero-width pattern can be used to mark the point reached in a string
1717when a certain part of the pattern has been successfully matched. This
1718mark may be given a name. A later C<(*SKIP)> pattern will then skip
1719forward to that point if backtracked into on failure. Any number of
b4222fa9 1720C<(*MARK)> patterns are allowed, and the NAME portion may be duplicated.
5d458dd8
YO
1721
1722In addition to interacting with the C<(*SKIP)> pattern, C<(*MARK:NAME)>
1723can be used to "label" a pattern branch, so that after matching, the
1724program can determine which branches of the pattern were involved in the
1725match.
1726
1727When a match is successful, the C<$REGMARK> variable will be set to the
1728name of the most recently executed C<(*MARK:NAME)> that was involved
1729in the match.
1730
1731This can be used to determine which branch of a pattern was matched
c27a5cfe 1732without using a separate capture group for each branch, which in turn
5d458dd8
YO
1733can result in a performance improvement, as perl cannot optimize
1734C</(?:(x)|(y)|(z))/> as efficiently as something like
1735C</(?:x(*MARK:x)|y(*MARK:y)|z(*MARK:z))/>.
1736
1737When a match has failed, and unless another verb has been involved in
1738failing the match and has provided its own name to use, the C<$REGERROR>
1739variable will be set to the name of the most recently executed
1740C<(*MARK:NAME)>.
1741
42ac7c82 1742See L</(*SKIP)> for more details.
5d458dd8 1743
b62d2d15
YO
1744As a shortcut C<(*MARK:NAME)> can be written C<(*:NAME)>.
1745
5d458dd8
YO
1746=item C<(*THEN)> C<(*THEN:NAME)>
1747
241e7389 1748This is similar to the "cut group" operator C<::> from Perl 6. Like
5d458dd8
YO
1749C<(*PRUNE)>, this verb always matches, and when backtracked into on
1750failure, it causes the regex engine to try the next alternation in the
1751innermost enclosing group (capturing or otherwise).
1752
1753Its name comes from the observation that this operation combined with the
1754alternation operator (C<|>) can be used to create what is essentially a
1755pattern-based if/then/else block:
1756
1757 ( COND (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ )
1758
1759Note that if this operator is used and NOT inside of an alternation then
1760it acts exactly like the C<(*PRUNE)> operator.
1761
1762 / A (*PRUNE) B /
1763
1764is the same as
1765
1766 / A (*THEN) B /
1767
1768but
1769
1770 / ( A (*THEN) B | C (*THEN) D ) /
1771
1772is not the same as
1773
1774 / ( A (*PRUNE) B | C (*PRUNE) D ) /
1775
1776as after matching the A but failing on the B the C<(*THEN)> verb will
1777backtrack and try C; but the C<(*PRUNE)> verb will simply fail.
24b23f37 1778
cbeadc21
JV
1779=back
1780
1781=item Verbs without an argument
1782
1783=over 4
1784
e2e6a0f1
YO
1785=item C<(*COMMIT)>
1786X<(*COMMIT)>
24b23f37 1787
241e7389 1788This is the Perl 6 "commit pattern" C<< <commit> >> or C<:::>. It's a
5d458dd8
YO
1789zero-width pattern similar to C<(*SKIP)>, except that when backtracked
1790into on failure it causes the match to fail outright. No further attempts
1791to find a valid match by advancing the start pointer will occur again.
1792For example,
24b23f37 1793
d1fbf752
KW
1794 'aaabaaab' =~ /a+b?(*COMMIT)(?{print "$&\n"; $count++})(*FAIL)/;
1795 print "Count=$count\n";
24b23f37
YO
1796
1797outputs
1798
1799 aaab
1800 Count=1
1801
e2e6a0f1
YO
1802In other words, once the C<(*COMMIT)> has been entered, and if the pattern
1803does not match, the regex engine will not try any further matching on the
1804rest of the string.
c277df42 1805
e2e6a0f1
YO
1806=item C<(*FAIL)> C<(*F)>
1807X<(*FAIL)> X<(*F)>
9af228c6 1808
e2e6a0f1
YO
1809This pattern matches nothing and always fails. It can be used to force the
1810engine to backtrack. It is equivalent to C<(?!)>, but easier to read. In
1811fact, C<(?!)> gets optimised into C<(*FAIL)> internally.
9af228c6 1812
e2e6a0f1 1813It is probably useful only when combined with C<(?{})> or C<(??{})>.
9af228c6 1814
e2e6a0f1
YO
1815=item C<(*ACCEPT)>
1816X<(*ACCEPT)>
9af228c6 1817
e2e6a0f1
YO
1818B<WARNING:> This feature is highly experimental. It is not recommended
1819for production code.
9af228c6 1820
e2e6a0f1
YO
1821This pattern matches nothing and causes the end of successful matching at
1822the point at which the C<(*ACCEPT)> pattern was encountered, regardless of
1823whether there is actually more to match in the string. When inside of a
0d017f4d 1824nested pattern, such as recursion, or in a subpattern dynamically generated
e2e6a0f1 1825via C<(??{})>, only the innermost pattern is ended immediately.
9af228c6 1826
c27a5cfe 1827If the C<(*ACCEPT)> is inside of capturing groups then the groups are
e2e6a0f1
YO
1828marked as ended at the point at which the C<(*ACCEPT)> was encountered.
1829For instance:
9af228c6 1830
e2e6a0f1 1831 'AB' =~ /(A (A|B(*ACCEPT)|C) D)(E)/x;
9af228c6 1832
e2e6a0f1 1833will match, and C<$1> will be C<AB> and C<$2> will be C<B>, C<$3> will not
0b928c2f 1834be set. If another branch in the inner parentheses was matched, such as in the
e2e6a0f1 1835string 'ACDE', then the C<D> and C<E> would have to be matched as well.
9af228c6
YO
1836
1837=back
c277df42 1838
a0d0e21e
LW
1839=back
1840
c07a80fd 1841=head2 Backtracking
d74e8afc 1842X<backtrack> X<backtracking>
c07a80fd 1843
35a734be
IZ
1844NOTE: This section presents an abstract approximation of regular
1845expression behavior. For a more rigorous (and complicated) view of
1846the rules involved in selecting a match among possible alternatives,
0d017f4d 1847see L<Combining RE Pieces>.
35a734be 1848
c277df42 1849A fundamental feature of regular expression matching involves the
5a964f20 1850notion called I<backtracking>, which is currently used (when needed)
0d017f4d 1851by all regular non-possessive expression quantifiers, namely C<*>, C<*?>, C<+>,
9da458fc
IZ
1852C<+?>, C<{n,m}>, and C<{n,m}?>. Backtracking is often optimized
1853internally, but the general principle outlined here is valid.
c07a80fd 1854
1855For a regular expression to match, the I<entire> regular expression must
1856match, not just part of it. So if the beginning of a pattern containing a
1857quantifier succeeds in a way that causes later parts in the pattern to
1858fail, the matching engine backs up and recalculates the beginning
1859part--that's why it's called backtracking.
1860
1861Here is an example of backtracking: Let's say you want to find the
1862word following "foo" in the string "Food is on the foo table.":
1863
1864 $_ = "Food is on the foo table.";
1865 if ( /\b(foo)\s+(\w+)/i ) {
f793d64a 1866 print "$2 follows $1.\n";
c07a80fd 1867 }
1868
1869When the match runs, the first part of the regular expression (C<\b(foo)>)
1870finds a possible match right at the beginning of the string, and loads up
1871$1 with "Foo". However, as soon as the matching engine sees that there's
1872no whitespace following the "Foo" that it had saved in $1, it realizes its
68dc0745 1873mistake and starts over again one character after where it had the
c07a80fd 1874tentative match. This time it goes all the way until the next occurrence
1875of "foo". The complete regular expression matches this time, and you get
1876the expected output of "table follows foo."
1877
1878Sometimes minimal matching can help a lot. Imagine you'd like to match
1879everything between "foo" and "bar". Initially, you write something
1880like this:
1881
1882 $_ = "The food is under the bar in the barn.";
1883 if ( /foo(.*)bar/ ) {
f793d64a 1884 print "got <$1>\n";
c07a80fd 1885 }
1886
1887Which perhaps unexpectedly yields:
1888
1889 got <d is under the bar in the >
1890
1891That's because C<.*> was greedy, so you get everything between the
14218588 1892I<first> "foo" and the I<last> "bar". Here it's more effective
c07a80fd 1893to use minimal matching to make sure you get the text between a "foo"
1894and the first "bar" thereafter.
1895
1896 if ( /foo(.*?)bar/ ) { print "got <$1>\n" }
1897 got <d is under the >
1898
0d017f4d 1899Here's another example. Let's say you'd like to match a number at the end
b6e13d97 1900of a string, and you also want to keep the preceding part of the match.
c07a80fd 1901So you write this:
1902
1903 $_ = "I have 2 numbers: 53147";
f793d64a
KW
1904 if ( /(.*)(\d*)/ ) { # Wrong!
1905 print "Beginning is <$1>, number is <$2>.\n";
c07a80fd 1906 }
1907
1908That won't work at all, because C<.*> was greedy and gobbled up the
1909whole string. As C<\d*> can match on an empty string the complete
1910regular expression matched successfully.
1911
8e1088bc 1912 Beginning is <I have 2 numbers: 53147>, number is <>.
c07a80fd 1913
1914Here are some variants, most of which don't work:
1915
1916 $_ = "I have 2 numbers: 53147";
1917 @pats = qw{
f793d64a
KW
1918 (.*)(\d*)
1919 (.*)(\d+)
1920 (.*?)(\d*)
1921 (.*?)(\d+)
1922 (.*)(\d+)$
1923 (.*?)(\d+)$
1924 (.*)\b(\d+)$
1925 (.*\D)(\d+)$
c07a80fd 1926 };
1927
1928 for $pat (@pats) {
f793d64a
KW
1929 printf "%-12s ", $pat;
1930 if ( /$pat/ ) {
1931 print "<$1> <$2>\n";
1932 } else {
1933 print "FAIL\n";
1934 }
c07a80fd 1935 }
1936
1937That will print out:
1938
1939 (.*)(\d*) <I have 2 numbers: 53147> <>
1940 (.*)(\d+) <I have 2 numbers: 5314> <7>
1941 (.*?)(\d*) <> <>
1942 (.*?)(\d+) <I have > <2>
1943 (.*)(\d+)$ <I have 2 numbers: 5314> <7>
1944 (.*?)(\d+)$ <I have 2 numbers: > <53147>
1945 (.*)\b(\d+)$ <I have 2 numbers: > <53147>
1946 (.*\D)(\d+)$ <I have 2 numbers: > <53147>
1947
1948As you see, this can be a bit tricky. It's important to realize that a
1949regular expression is merely a set of assertions that gives a definition
1950of success. There may be 0, 1, or several different ways that the
1951definition might succeed against a particular string. And if there are
5a964f20
TC
1952multiple ways it might succeed, you need to understand backtracking to
1953know which variety of success you will achieve.
c07a80fd 1954
19799a22 1955When using look-ahead assertions and negations, this can all get even
8b19b778 1956trickier. Imagine you'd like to find a sequence of non-digits not
c07a80fd 1957followed by "123". You might try to write that as
1958
871b0233 1959 $_ = "ABC123";
f793d64a
KW
1960 if ( /^\D*(?!123)/ ) { # Wrong!
1961 print "Yup, no 123 in $_\n";
871b0233 1962 }
c07a80fd 1963
1964But that isn't going to match; at least, not the way you're hoping. It
1965claims that there is no 123 in the string. Here's a clearer picture of
9b9391b2 1966why that pattern matches, contrary to popular expectations:
c07a80fd 1967
4358a253
SS
1968 $x = 'ABC123';
1969 $y = 'ABC445';
c07a80fd 1970
4358a253
SS
1971 print "1: got $1\n" if $x =~ /^(ABC)(?!123)/;
1972 print "2: got $1\n" if $y =~ /^(ABC)(?!123)/;
c07a80fd 1973
4358a253
SS
1974 print "3: got $1\n" if $x =~ /^(\D*)(?!123)/;
1975 print "4: got $1\n" if $y =~ /^(\D*)(?!123)/;
c07a80fd 1976
1977This prints
1978
1979 2: got ABC
1980 3: got AB
1981 4: got ABC
1982
5f05dabc 1983You might have expected test 3 to fail because it seems to a more
c07a80fd 1984general purpose version of test 1. The important difference between
1985them is that test 3 contains a quantifier (C<\D*>) and so can use
1986backtracking, whereas test 1 will not. What's happening is
1987that you've asked "Is it true that at the start of $x, following 0 or more
5f05dabc 1988non-digits, you have something that's not 123?" If the pattern matcher had
c07a80fd 1989let C<\D*> expand to "ABC", this would have caused the whole pattern to
54310121 1990fail.
14218588 1991
c07a80fd 1992The search engine will initially match C<\D*> with "ABC". Then it will
0b928c2f 1993try to match C<(?!123)> with "123", which fails. But because
c07a80fd 1994a quantifier (C<\D*>) has been used in the regular expression, the
1995search engine can backtrack and retry the match differently
54310121 1996in the hope of matching the complete regular expression.
c07a80fd 1997
5a964f20
TC
1998The pattern really, I<really> wants to succeed, so it uses the
1999standard pattern back-off-and-retry and lets C<\D*> expand to just "AB" this
c07a80fd 2000time. Now there's indeed something following "AB" that is not
14218588 2001"123". It's "C123", which suffices.
c07a80fd 2002
14218588
GS
2003We can deal with this by using both an assertion and a negation.
2004We'll say that the first part in $1 must be followed both by a digit
2005and by something that's not "123". Remember that the look-aheads
2006are zero-width expressions--they only look, but don't consume any
2007of the string in their match. So rewriting this way produces what
c07a80fd 2008you'd expect; that is, case 5 will fail, but case 6 succeeds:
2009
4358a253
SS
2010 print "5: got $1\n" if $x =~ /^(\D*)(?=\d)(?!123)/;
2011 print "6: got $1\n" if $y =~ /^(\D*)(?=\d)(?!123)/;
c07a80fd 2012
2013 6: got ABC
2014
5a964f20 2015In other words, the two zero-width assertions next to each other work as though
19799a22 2016they're ANDed together, just as you'd use any built-in assertions: C</^$/>
c07a80fd 2017matches only if you're at the beginning of the line AND the end of the
2018line simultaneously. The deeper underlying truth is that juxtaposition in
2019regular expressions always means AND, except when you write an explicit OR
2020using the vertical bar. C</ab/> means match "a" AND (then) match "b",
2021although the attempted matches are made at different positions because "a"
2022is not a zero-width assertion, but a one-width assertion.
2023
0d017f4d 2024B<WARNING>: Particularly complicated regular expressions can take
14218588 2025exponential time to solve because of the immense number of possible
0d017f4d 2026ways they can use backtracking to try for a match. For example, without
9da458fc
IZ
2027internal optimizations done by the regular expression engine, this will
2028take a painfully long time to run:
c07a80fd 2029
e1901655
IZ
2030 'aaaaaaaaaaaa' =~ /((a{0,5}){0,5})*[c]/
2031
2032And if you used C<*>'s in the internal groups instead of limiting them
2033to 0 through 5 matches, then it would take forever--or until you ran
2034out of stack space. Moreover, these internal optimizations are not
2035always applicable. For example, if you put C<{0,5}> instead of C<*>
2036on the external group, no current optimization is applicable, and the
2037match takes a long time to finish.
c07a80fd 2038
9da458fc
IZ
2039A powerful tool for optimizing such beasts is what is known as an
2040"independent group",
96090e4f 2041which does not backtrack (see L</C<< (?>pattern) >>>). Note also that
9da458fc 2042zero-length look-ahead/look-behind assertions will not backtrack to make
5d458dd8 2043the tail match, since they are in "logical" context: only
14218588 2044whether they match is considered relevant. For an example
9da458fc 2045where side-effects of look-ahead I<might> have influenced the
96090e4f 2046following match, see L</C<< (?>pattern) >>>.
c277df42 2047
a0d0e21e 2048=head2 Version 8 Regular Expressions
d74e8afc 2049X<regular expression, version 8> X<regex, version 8> X<regexp, version 8>
a0d0e21e 2050
5a964f20 2051In case you're not familiar with the "regular" Version 8 regex
a0d0e21e
LW
2052routines, here are the pattern-matching rules not described above.
2053
54310121 2054Any single character matches itself, unless it is a I<metacharacter>
a0d0e21e 2055with a special meaning described here or above. You can cause
5a964f20 2056characters that normally function as metacharacters to be interpreted
5f05dabc 2057literally by prefixing them with a "\" (e.g., "\." matches a ".", not any
0d017f4d
WL
2058character; "\\" matches a "\"). This escape mechanism is also required
2059for the character used as the pattern delimiter.
2060
2061A series of characters matches that series of characters in the target
0b928c2f 2062string, so the pattern C<blurfl> would match "blurfl" in the target
0d017f4d 2063string.
a0d0e21e
LW
2064
2065You can specify a character class, by enclosing a list of characters
5d458dd8 2066in C<[]>, which will match any character from the list. If the
a0d0e21e 2067first character after the "[" is "^", the class matches any character not
14218588 2068in the list. Within a list, the "-" character specifies a
5a964f20 2069range, so that C<a-z> represents all characters between "a" and "z",
8a4f6ac2
GS
2070inclusive. If you want either "-" or "]" itself to be a member of a
2071class, put it at the start of the list (possibly after a "^"), or
2072escape it with a backslash. "-" is also taken literally when it is
2073at the end of the list, just before the closing "]". (The
84850974
DD
2074following all specify the same class of three characters: C<[-az]>,
2075C<[az-]>, and C<[a\-z]>. All are different from C<[a-z]>, which
5d458dd8
YO
2076specifies a class containing twenty-six characters, even on EBCDIC-based
2077character sets.) Also, if you try to use the character
2078classes C<\w>, C<\W>, C<\s>, C<\S>, C<\d>, or C<\D> as endpoints of
2079a range, the "-" is understood literally.
a0d0e21e 2080
8ada0baa
JH
2081Note also that the whole range idea is rather unportable between
2082character sets--and even within character sets they may cause results
2083you probably didn't expect. A sound principle is to use only ranges
0d017f4d 2084that begin from and end at either alphabetics of equal case ([a-e],
8ada0baa
JH
2085[A-E]), or digits ([0-9]). Anything else is unsafe. If in doubt,
2086spell out the character sets in full.
2087
54310121 2088Characters may be specified using a metacharacter syntax much like that
a0d0e21e
LW
2089used in C: "\n" matches a newline, "\t" a tab, "\r" a carriage return,
2090"\f" a form feed, etc. More generally, \I<nnn>, where I<nnn> is a string
dc0d9c48 2091of three octal digits, matches the character whose coded character set value
5d458dd8 2092is I<nnn>. Similarly, \xI<nn>, where I<nn> are hexadecimal digits,
dc0d9c48 2093matches the character whose ordinal is I<nn>. The expression \cI<x>
5d458dd8 2094matches the character control-I<x>. Finally, the "." metacharacter
fb55449c 2095matches any character except "\n" (unless you use C</s>).
a0d0e21e
LW
2096
2097You can specify a series of alternatives for a pattern using "|" to
2098separate them, so that C<fee|fie|foe> will match any of "fee", "fie",
5a964f20 2099or "foe" in the target string (as would C<f(e|i|o)e>). The
a0d0e21e 2100first alternative includes everything from the last pattern delimiter
0b928c2f 2101("(", "(?:", etc. or the beginning of the pattern) up to the first "|", and
a0d0e21e 2102the last alternative contains everything from the last "|" to the next
0b928c2f 2103closing pattern delimiter. That's why it's common practice to include
14218588 2104alternatives in parentheses: to minimize confusion about where they
a3cb178b
GS
2105start and end.
2106
5a964f20 2107Alternatives are tried from left to right, so the first
a3cb178b
GS
2108alternative found for which the entire expression matches, is the one that
2109is chosen. This means that alternatives are not necessarily greedy. For
628afcb5 2110example: when matching C<foo|foot> against "barefoot", only the "foo"
a3cb178b
GS
2111part will match, as that is the first alternative tried, and it successfully
2112matches the target string. (This might not seem important, but it is
2113important when you are capturing matched text using parentheses.)
2114
5a964f20 2115Also remember that "|" is interpreted as a literal within square brackets,
a3cb178b 2116so if you write C<[fee|fie|foe]> you're really only matching C<[feio|]>.
a0d0e21e 2117
14218588
GS
2118Within a pattern, you may designate subpatterns for later reference
2119by enclosing them in parentheses, and you may refer back to the
2120I<n>th subpattern later in the pattern using the metacharacter
0b928c2f 2121\I<n> or \gI<n>. Subpatterns are numbered based on the left to right order
14218588
GS
2122of their opening parenthesis. A backreference matches whatever
2123actually matched the subpattern in the string being examined, not
d8b950dc 2124the rules for that subpattern. Therefore, C<(0|0x)\d*\s\g1\d*> will
14218588
GS
2125match "0x1234 0x4321", but not "0x1234 01234", because subpattern
21261 matched "0x", even though the rule C<0|0x> could potentially match
2127the leading 0 in the second number.
cb1a09d0 2128
0d017f4d 2129=head2 Warning on \1 Instead of $1
cb1a09d0 2130
5a964f20 2131Some people get too used to writing things like:
cb1a09d0
AD
2132
2133 $pattern =~ s/(\W)/\\\1/g;
2134
3ff1c45a
KW
2135This is grandfathered (for \1 to \9) for the RHS of a substitute to avoid
2136shocking the
cb1a09d0 2137B<sed> addicts, but it's a dirty habit to get into. That's because in
d1be9408 2138PerlThink, the righthand side of an C<s///> is a double-quoted string. C<\1> in
cb1a09d0
AD
2139the usual double-quoted string means a control-A. The customary Unix
2140meaning of C<\1> is kludged in for C<s///>. However, if you get into the habit
2141of doing that, you get yourself into trouble if you then add an C</e>
2142modifier.
2143
f793d64a 2144 s/(\d+)/ \1 + 1 /eg; # causes warning under -w
cb1a09d0
AD
2145
2146Or if you try to do
2147
2148 s/(\d+)/\1000/;
2149
2150You can't disambiguate that by saying C<\{1}000>, whereas you can fix it with
14218588 2151C<${1}000>. The operation of interpolation should not be confused
cb1a09d0
AD
2152with the operation of matching a backreference. Certainly they mean two
2153different things on the I<left> side of the C<s///>.
9fa51da4 2154
0d017f4d 2155=head2 Repeated Patterns Matching a Zero-length Substring
c84d73f1 2156
19799a22 2157B<WARNING>: Difficult material (and prose) ahead. This section needs a rewrite.
c84d73f1
IZ
2158
2159Regular expressions provide a terse and powerful programming language. As
2160with most other power tools, power comes together with the ability
2161to wreak havoc.
2162
2163A common abuse of this power stems from the ability to make infinite
628afcb5 2164loops using regular expressions, with something as innocuous as:
c84d73f1
IZ
2165
2166 'foo' =~ m{ ( o? )* }x;
2167
0d017f4d 2168The C<o?> matches at the beginning of C<'foo'>, and since the position
c84d73f1 2169in the string is not moved by the match, C<o?> would match again and again
527e91da 2170because of the C<*> quantifier. Another common way to create a similar cycle
c84d73f1
IZ
2171is with the looping modifier C<//g>:
2172
2173 @matches = ( 'foo' =~ m{ o? }xg );
2174
2175or
2176
2177 print "match: <$&>\n" while 'foo' =~ m{ o? }xg;
2178
2179or the loop implied by split().
2180
2181However, long experience has shown that many programming tasks may
14218588
GS
2182be significantly simplified by using repeated subexpressions that
2183may match zero-length substrings. Here's a simple example being:
c84d73f1 2184
d1fbf752 2185 @chars = split //, $string; # // is not magic in split
c84d73f1
IZ
2186 ($whitewashed = $string) =~ s/()/ /g; # parens avoid magic s// /
2187
9da458fc 2188Thus Perl allows such constructs, by I<forcefully breaking
c84d73f1 2189the infinite loop>. The rules for this are different for lower-level
527e91da 2190loops given by the greedy quantifiers C<*+{}>, and for higher-level
c84d73f1
IZ
2191ones like the C</g> modifier or split() operator.
2192
19799a22
GS
2193The lower-level loops are I<interrupted> (that is, the loop is
2194broken) when Perl detects that a repeated expression matched a
2195zero-length substring. Thus
c84d73f1
IZ
2196
2197 m{ (?: NON_ZERO_LENGTH | ZERO_LENGTH )* }x;
2198
5d458dd8 2199is made equivalent to
c84d73f1 2200
0b928c2f
FC
2201 m{ (?: NON_ZERO_LENGTH )* (?: ZERO_LENGTH )? }x;
2202
2203For example, this program
2204
2205 #!perl -l
2206 "aaaaab" =~ /
2207 (?:
2208 a # non-zero
2209 | # or
2210 (?{print "hello"}) # print hello whenever this
2211 # branch is tried
2212 (?=(b)) # zero-width assertion
2213 )* # any number of times
2214 /x;
2215 print $&;
2216 print $1;
c84d73f1 2217
0b928c2f
FC
2218prints
2219
2220 hello
2221 aaaaa
2222 b
2223
2224Notice that "hello" is only printed once, as when Perl sees that the sixth
2225iteration of the outermost C<(?:)*> matches a zero-length string, it stops
2226the C<*>.
2227
2228The higher-level loops preserve an additional state between iterations:
5d458dd8 2229whether the last match was zero-length. To break the loop, the following
c84d73f1 2230match after a zero-length match is prohibited to have a length of zero.
5d458dd8 2231This prohibition interacts with backtracking (see L<"Backtracking">),
c84d73f1
IZ
2232and so the I<second best> match is chosen if the I<best> match is of
2233zero length.
2234
19799a22 2235For example:
c84d73f1
IZ
2236
2237 $_ = 'bar';
2238 s/\w??/<$&>/g;
2239
20fb949f 2240results in C<< <><b><><a><><r><> >>. At each position of the string the best
5d458dd8 2241match given by non-greedy C<??> is the zero-length match, and the I<second
c84d73f1
IZ
2242best> match is what is matched by C<\w>. Thus zero-length matches
2243alternate with one-character-long matches.
2244
5d458dd8 2245Similarly, for repeated C<m/()/g> the second-best match is the match at the
c84d73f1
IZ
2246position one notch further in the string.
2247
19799a22 2248The additional state of being I<matched with zero-length> is associated with
c84d73f1 2249the matched string, and is reset by each assignment to pos().
9da458fc
IZ
2250Zero-length matches at the end of the previous match are ignored
2251during C<split>.
c84d73f1 2252
0d017f4d 2253=head2 Combining RE Pieces
35a734be
IZ
2254
2255Each of the elementary pieces of regular expressions which were described
2256before (such as C<ab> or C<\Z>) could match at most one substring
2257at the given position of the input string. However, in a typical regular
2258expression these elementary pieces are combined into more complicated
0b928c2f 2259patterns using combining operators C<ST>, C<S|T>, C<S*> etc.
35a734be
IZ
2260(in these examples C<S> and C<T> are regular subexpressions).
2261
2262Such combinations can include alternatives, leading to a problem of choice:
2263if we match a regular expression C<a|ab> against C<"abc">, will it match
2264substring C<"a"> or C<"ab">? One way to describe which substring is
2265actually matched is the concept of backtracking (see L<"Backtracking">).
2266However, this description is too low-level and makes you think
2267in terms of a particular implementation.
2268
2269Another description starts with notions of "better"/"worse". All the
2270substrings which may be matched by the given regular expression can be
2271sorted from the "best" match to the "worst" match, and it is the "best"
2272match which is chosen. This substitutes the question of "what is chosen?"
2273by the question of "which matches are better, and which are worse?".
2274
2275Again, for elementary pieces there is no such question, since at most
2276one match at a given position is possible. This section describes the
2277notion of better/worse for combining operators. In the description
2278below C<S> and C<T> are regular subexpressions.
2279
13a2d996 2280=over 4
35a734be
IZ
2281
2282=item C<ST>
2283
2284Consider two possible matches, C<AB> and C<A'B'>, C<A> and C<A'> are
2285substrings which can be matched by C<S>, C<B> and C<B'> are substrings
5d458dd8 2286which can be matched by C<T>.
35a734be 2287
0b928c2f 2288If C<A> is a better match for C<S> than C<A'>, C<AB> is a better
35a734be
IZ
2289match than C<A'B'>.
2290
2291If C<A> and C<A'> coincide: C<AB> is a better match than C<AB'> if
0b928c2f 2292C<B> is a better match for C<T> than C<B'>.
35a734be
IZ
2293
2294=item C<S|T>
2295
2296When C<S> can match, it is a better match than when only C<T> can match.
2297
2298Ordering of two matches for C<S> is the same as for C<S>. Similar for
2299two matches for C<T>.
2300
2301=item C<S{REPEAT_COUNT}>
2302
2303Matches as C<SSS...S> (repeated as many times as necessary).
2304
2305=item C<S{min,max}>
2306
2307Matches as C<S{max}|S{max-1}|...|S{min+1}|S{min}>.
2308
2309=item C<S{min,max}?>
2310
2311Matches as C<S{min}|S{min+1}|...|S{max-1}|S{max}>.
2312
2313=item C<S?>, C<S*>, C<S+>
2314
2315Same as C<S{0,1}>, C<S{0,BIG_NUMBER}>, C<S{1,BIG_NUMBER}> respectively.
2316
2317=item C<S??>, C<S*?>, C<S+?>
2318
2319Same as C<S{0,1}?>, C<S{0,BIG_NUMBER}?>, C<S{1,BIG_NUMBER}?> respectively.
2320
c47ff5f1 2321=item C<< (?>S) >>
35a734be
IZ
2322
2323Matches the best match for C<S> and only that.
2324
2325=item C<(?=S)>, C<(?<=S)>
2326
2327Only the best match for C<S> is considered. (This is important only if
2328C<S> has capturing parentheses, and backreferences are used somewhere
2329else in the whole regular expression.)
2330
2331=item C<(?!S)>, C<(?<!S)>
2332
2333For this grouping operator there is no need to describe the ordering, since
2334only whether or not C<S> can match is important.
2335
6bda09f9 2336=item C<(??{ EXPR })>, C<(?PARNO)>
35a734be
IZ
2337
2338The ordering is the same as for the regular expression which is
c27a5cfe 2339the result of EXPR, or the pattern contained by capture group PARNO.
35a734be
IZ
2340
2341=item C<(?(condition)yes-pattern|no-pattern)>
2342
2343Recall that which of C<yes-pattern> or C<no-pattern> actually matches is
2344already determined. The ordering of the matches is the same as for the
2345chosen subexpression.
2346
2347=back
2348
2349The above recipes describe the ordering of matches I<at a given position>.
2350One more rule is needed to understand how a match is determined for the
2351whole regular expression: a match at an earlier position is always better
2352than a match at a later position.
2353
0d017f4d 2354=head2 Creating Custom RE Engines
c84d73f1 2355
0b928c2f
FC
2356As of Perl 5.10.0, one can create custom regular expression engines. This
2357is not for the faint of heart, as they have to plug in at the C level. See
2358L<perlreapi> for more details.
2359
2360As an alternative, overloaded constants (see L<overload>) provide a simple
2361way to extend the functionality of the RE engine, by substituting one
2362pattern for another.
c84d73f1
IZ
2363
2364Suppose that we want to enable a new RE escape-sequence C<\Y|> which
0d017f4d 2365matches at a boundary between whitespace characters and non-whitespace
c84d73f1
IZ
2366characters. Note that C<(?=\S)(?<!\S)|(?!\S)(?<=\S)> matches exactly
2367at these positions, so we want to have each C<\Y|> in the place of the
2368more complicated version. We can create a module C<customre> to do
2369this:
2370
2371 package customre;
2372 use overload;
2373
2374 sub import {
2375 shift;
2376 die "No argument to customre::import allowed" if @_;
2377 overload::constant 'qr' => \&convert;
2378 }
2379
2380 sub invalid { die "/$_[0]/: invalid escape '\\$_[1]'"}
2381
580a9fe1
RGS
2382 # We must also take care of not escaping the legitimate \\Y|
2383 # sequence, hence the presence of '\\' in the conversion rules.
5d458dd8 2384 my %rules = ( '\\' => '\\\\',
f793d64a 2385 'Y|' => qr/(?=\S)(?<!\S)|(?!\S)(?<=\S)/ );
c84d73f1
IZ
2386 sub convert {
2387 my $re = shift;
5d458dd8 2388 $re =~ s{
c84d73f1
IZ
2389 \\ ( \\ | Y . )
2390 }
5d458dd8 2391 { $rules{$1} or invalid($re,$1) }sgex;
c84d73f1
IZ
2392 return $re;
2393 }
2394
2395Now C<use customre> enables the new escape in constant regular
2396expressions, i.e., those without any runtime variable interpolations.
2397As documented in L<overload>, this conversion will work only over
2398literal parts of regular expressions. For C<\Y|$re\Y|> the variable
2399part of this regular expression needs to be converted explicitly
2400(but only if the special meaning of C<\Y|> should be enabled inside $re):
2401
2402 use customre;
2403 $re = <>;
2404 chomp $re;
2405 $re = customre::convert $re;
2406 /\Y|$re\Y|/;
2407
0b928c2f 2408=head2 PCRE/Python Support
1f1031fe 2409
0b928c2f 2410As of Perl 5.10.0, Perl supports several Python/PCRE-specific extensions
1f1031fe 2411to the regex syntax. While Perl programmers are encouraged to use the
0b928c2f 2412Perl-specific syntax, the following are also accepted:
1f1031fe
YO
2413
2414=over 4
2415
ae5648b3 2416=item C<< (?PE<lt>NAMEE<gt>pattern) >>
1f1031fe 2417
c27a5cfe 2418Define a named capture group. Equivalent to C<< (?<NAME>pattern) >>.
1f1031fe
YO
2419
2420=item C<< (?P=NAME) >>
2421
c27a5cfe 2422Backreference to a named capture group. Equivalent to C<< \g{NAME} >>.
1f1031fe
YO
2423
2424=item C<< (?P>NAME) >>
2425
c27a5cfe 2426Subroutine call to a named capture group. Equivalent to C<< (?&NAME) >>.
1f1031fe 2427
ee9b8eae 2428=back
1f1031fe 2429
19799a22
GS
2430=head1 BUGS
2431
88c9975e
KW
2432Many regular expression constructs don't work on EBCDIC platforms.
2433
ed7efc79
KW
2434There are a number of issues with regard to case-insensitive matching
2435in Unicode rules. See C<i> under L</Modifiers> above.
2436
9da458fc
IZ
2437This document varies from difficult to understand to completely
2438and utterly opaque. The wandering prose riddled with jargon is
2439hard to fathom in several places.
2440
2441This document needs a rewrite that separates the tutorial content
2442from the reference content.
19799a22
GS
2443
2444=head1 SEE ALSO
9fa51da4 2445
91e0c79e
MJD
2446L<perlrequick>.
2447
2448L<perlretut>.
2449
9b599b2a
GS
2450L<perlop/"Regexp Quote-Like Operators">.
2451
1e66bd83
PP
2452L<perlop/"Gory details of parsing quoted constructs">.
2453
14218588
GS
2454L<perlfaq6>.
2455
9b599b2a
GS
2456L<perlfunc/pos>.
2457
2458L<perllocale>.
2459
fb55449c
JH
2460L<perlebcdic>.
2461
14218588
GS
2462I<Mastering Regular Expressions> by Jeffrey Friedl, published
2463by O'Reilly and Associates.