This is a live mirror of the Perl 5 development currently hosted at https://github.com/perl/perl5
fix perl #126186 make all verbs allow an optional arg
[perl5.git] / pod / perlre.pod
CommitLineData
a0d0e21e 1=head1 NAME
d74e8afc 2X<regular expression> X<regex> X<regexp>
a0d0e21e
LW
3
4perlre - Perl regular expressions
5
6=head1 DESCRIPTION
7
5d458dd8 8This page describes the syntax of regular expressions in Perl.
91e0c79e 9
cc46d5f2 10If you haven't used regular expressions before, a quick-start
91e0c79e
MJD
11introduction is available in L<perlrequick>, and a longer tutorial
12introduction is available in L<perlretut>.
13
14For reference on how regular expressions are used in matching
15operations, plus various examples of the same, see discussions of
16C<m//>, C<s///>, C<qr//> and C<??> in L<perlop/"Regexp Quote-Like
17Operators">.
cb1a09d0 18
67cdf558
KW
19New in v5.22, L<C<use re 'strict'>|re/'strict' mode> applies stricter
20rules than otherwise when compiling regular expression patterns. It can
21find things that, while legal, may not be what you intended.
0d017f4d
WL
22
23=head2 Modifiers
24
19799a22 25Matching operations can have various modifiers. Modifiers
5a964f20 26that relate to the interpretation of the regular expression inside
19799a22 27are listed below. Modifiers that alter the way a regular expression
5d458dd8 28is used by Perl are detailed in L<perlop/"Regexp Quote-Like Operators"> and
1e66bd83 29L<perlop/"Gory details of parsing quoted constructs">.
a0d0e21e 30
55497cff
PP
31=over 4
32
54310121 33=item m
d74e8afc 34X</m> X<regex, multiline> X<regexp, multiline> X<regular expression, multiline>
55497cff
PP
35
36Treat string as multiple lines. That is, change "^" and "$" from matching
1ca12bda
RS
37the start of the string's first line and the end of its last line to
38matching the start and end of each line within the string.
55497cff 39
54310121 40=item s
d74e8afc
ITB
41X</s> X<regex, single-line> X<regexp, single-line>
42X<regular expression, single-line>
55497cff
PP
43
44Treat string as single line. That is, change "." to match any character
19799a22 45whatsoever, even a newline, which normally it would not match.
55497cff 46
34d67d80 47Used together, as C</ms>, they let the "." match any character whatsoever,
fb55449c 48while still allowing "^" and "$" to match, respectively, just after
19799a22 49and just before newlines within the string.
7b8d334a 50
87e95b7f
YO
51=item i
52X</i> X<regex, case-insensitive> X<regexp, case-insensitive>
53X<regular expression, case-insensitive>
54
55Do case-insensitive pattern matching.
56
5027a30b
KW
57If locale matching rules are in effect, the case map is taken from the
58current
17580e7a 59locale for code points less than 255, and from Unicode rules for larger
ed7efc79
KW
60code points. However, matches that would cross the Unicode
61rules/non-Unicode rules boundary (ords 255/256) will not succeed. See
62L<perllocale>.
63
64There are a number of Unicode characters that match multiple characters
65under C</i>. For example, C<LATIN SMALL LIGATURE FI>
66should match the sequence C<fi>. Perl is not
67currently able to do this when the multiple characters are in the pattern and
68are split between groupings, or when one or more are quantified. Thus
69
70 "\N{LATIN SMALL LIGATURE FI}" =~ /fi/i; # Matches
71 "\N{LATIN SMALL LIGATURE FI}" =~ /[fi][fi]/i; # Doesn't match!
72 "\N{LATIN SMALL LIGATURE FI}" =~ /fi*/i; # Doesn't match!
73
74 # The below doesn't match, and it isn't clear what $1 and $2 would
75 # be even if it did!!
76 "\N{LATIN SMALL LIGATURE FI}" =~ /(f)(i)/i; # Doesn't match!
77
9d53c457
KW
78Perl doesn't match multiple characters in a bracketed
79character class unless the character that maps to them is explicitly
80mentioned, and it doesn't match them at all if the character class is
81inverted, which otherwise could be highly confusing. See
82L<perlrecharclass/Bracketed Character Classes>, and
1f59b283
KW
83L<perlrecharclass/Negation>.
84
54310121 85=item x
d74e8afc 86X</x>
55497cff
PP
87
88Extend your pattern's legibility by permitting whitespace and comments.
ed7efc79 89Details in L</"/x">
55497cff 90
87e95b7f
YO
91=item p
92X</p> X<regex, preserve> X<regexp, preserve>
93
632a1772 94Preserve the string matched such that ${^PREMATCH}, ${^MATCH}, and
87e95b7f
YO
95${^POSTMATCH} are available for use after matching.
96
13b0f67d
DM
97In Perl 5.20 and higher this is ignored. Due to a new copy-on-write
98mechanism, ${^PREMATCH}, ${^MATCH}, and ${^POSTMATCH} will be available
99after the match regardless of the modifier.
100
b6fa137b
FC
101=item a, d, l and u
102X</a> X</d> X</l> X</u>
103
850b7ec9 104These modifiers, all new in 5.14, affect which character-set rules
516074bb 105(Unicode, etc.) are used, as described below in
ed7efc79 106L</Character set modifiers>.
b6fa137b 107
33be4c61
MH
108=item n
109X</n> X<regex, non-capture> X<regexp, non-capture>
110X<regular expression, non-capture>
111
112Prevent the grouping metacharacters C<()> from capturing. This modifier,
113new in 5.22, will stop C<$1>, C<$2>, etc... from being filled in.
114
115 "hello" =~ /(hi|hello)/; # $1 is "hello"
116 "hello" =~ /(hi|hello)/n; # $1 is undef
117
25941dca 118This is equivalent to putting C<?:> at the beginning of every capturing group:
33be4c61
MH
119
120 "hello" =~ /(?:hi|hello)/; # $1 is undef
121
122C</n> can be negated on a per-group basis. Alternatively, named captures
123may still be used.
124
125 "hello" =~ /(?-n:(hi|hello))/n; # $1 is "hello"
bdafd784
KW
126 "hello" =~ /(?<greet>hi|hello)/n; # $1 is "hello", $+{greet} is
127 # "hello"
33be4c61 128
7cf040c1
RS
129=item Other Modifiers
130
131There are a number of flags that can be found at the end of regular
132expression constructs that are I<not> generic regular expression flags, but
133apply to the operation being performed, like matching or substitution (C<m//>
134or C<s///> respectively).
135
136Flags described further in
137L<perlretut/"Using regular expressions in Perl"> are:
138
139 c - keep the current position during repeated matching
140 g - globally match the pattern repeatedly in the string
141
142Substitution-specific modifiers described in
143
33be4c61 144L<perlop/"s/PATTERN/REPLACEMENT/msixpodualngcer"> are:
171e7319 145
7cf040c1
RS
146 e - evaluate the right-hand side as an expression
147 ee - evaluate the right side as a string then eval the result
148 o - pretend to optimize your code, but actually introduce bugs
149 r - perform non-destructive substitution and return the new value
171e7319 150
55497cff 151=back
a0d0e21e 152
516074bb
KW
153Regular expression modifiers are usually written in documentation
154as e.g., "the C</x> modifier", even though the delimiter
b6fa137b 155in question might not really be a slash. The modifiers C</imsxadlup>
ab7bb42d 156may also be embedded within the regular expression itself using
ed7efc79 157the C<(?...)> construct, see L</Extended Patterns> below.
b6fa137b 158
ed7efc79
KW
159=head3 /x
160
b6fa137b 161C</x> tells
7b059540 162the regular expression parser to ignore most whitespace that is neither
7c688e65
KW
163backslashed nor within a bracketed character class. You can use this to
164break up your regular expression into (slightly) more readable parts.
165Also, the C<#> character is treated as a metacharacter introducing a
166comment that runs up to the pattern's closing delimiter, or to the end
167of the current line if the pattern extends onto the next line. Hence,
168this is very much like an ordinary Perl code comment. (You can include
169the closing delimiter within the comment only if you precede it with a
170backslash, so be careful!)
171
172Use of C</x> means that if you want real
173whitespace or C<#> characters in the pattern (outside a bracketed character
174class, which is unaffected by C</x>), then you'll either have to
7b059540 175escape them (using backslashes or C<\Q...\E>) or encode them using octal,
7c688e65 176hex, or C<\N{}> escapes.
8be3c4ca
KW
177It is ineffective to try to continue a comment onto the next line by
178escaping the C<\n> with a backslash or C<\Q>.
7c688e65
KW
179
180You can use L</(?#text)> to create a comment that ends earlier than the
181end of the current line, but C<text> also can't contain the closing
182delimiter unless escaped with a backslash.
183
184Taken together, these features go a long way towards
185making Perl's regular expressions more readable. Here's an example:
186
187 # Delete (most) C comments.
188 $program =~ s {
189 /\* # Match the opening delimiter.
190 .*? # Match a minimal number of characters.
191 \*/ # Match the closing delimiter.
192 } []gsx;
193
194Note that anything inside
7651b971 195a C<\Q...\E> stays unaffected by C</x>. And note that C</x> doesn't affect
0b928c2f 196space interpretation within a single multi-character construct. For
7651b971 197example in C<\x{...}>, regardless of the C</x> modifier, there can be no
9bb1f947 198spaces. Same for a L<quantifier|/Quantifiers> such as C<{3}> or
000947ad
KW
199C<{5,}>. Similarly, C<(?:...)> can't have a space between the C<(>,
200C<?>, and C<:>. Within any delimiters for such a
f9e949fd
KW
201construct, allowed spaces are not affected by C</x>, and depend on the
202construct. For example, C<\x{...}> can't have spaces because hexadecimal
203numbers don't have spaces in them. But, Unicode properties can have spaces, so
0b928c2f 204in C<\p{...}> there can be spaces that follow the Unicode rules, for which see
9bb1f947 205L<perluniprops/Properties accessible through \p{} and \P{}>.
d74e8afc 206X</x>
a0d0e21e 207
8373491a
KW
208The set of characters that are deemed whitespace are those that Unicode
209calls "Pattern White Space", namely:
210
211 U+0009 CHARACTER TABULATION
212 U+000A LINE FEED
213 U+000B LINE TABULATION
214 U+000C FORM FEED
215 U+000D CARRIAGE RETURN
216 U+0020 SPACE
217 U+0085 NEXT LINE
218 U+200E LEFT-TO-RIGHT MARK
219 U+200F RIGHT-TO-LEFT MARK
220 U+2028 LINE SEPARATOR
221 U+2029 PARAGRAPH SEPARATOR
222
ed7efc79
KW
223=head3 Character set modifiers
224
225C</d>, C</u>, C</a>, and C</l>, available starting in 5.14, are called
850b7ec9 226the character set modifiers; they affect the character set rules
ed7efc79
KW
227used for the regular expression.
228
808432af
KW
229The C</d>, C</u>, and C</l> modifiers are not likely to be of much use
230to you, and so you need not worry about them very much. They exist for
231Perl's internal use, so that complex regular expression data structures
232can be automatically serialized and later exactly reconstituted,
233including all their nuances. But, since Perl can't keep a secret, and
234there may be rare instances where they are useful, they are documented
235here.
ed7efc79 236
808432af
KW
237The C</a> modifier, on the other hand, may be useful. Its purpose is to
238allow code that is to work mostly on ASCII data to not have to concern
239itself with Unicode.
ca9560b2 240
808432af
KW
241Briefly, C</l> sets the character set to that of whatever B<L>ocale is in
242effect at the time of the execution of the pattern match.
ca9560b2 243
808432af
KW
244C</u> sets the character set to B<U>nicode.
245
246C</a> also sets the character set to Unicode, BUT adds several
247restrictions for B<A>SCII-safe matching.
248
249C</d> is the old, problematic, pre-5.14 B<D>efault character set
250behavior. Its only use is to force that old behavior.
251
252At any given time, exactly one of these modifiers is in effect. Their
253existence allows Perl to keep the originally compiled behavior of a
254regular expression, regardless of what rules are in effect when it is
255actually executed. And if it is interpolated into a larger regex, the
256original's rules continue to apply to it, and only it.
257
258The C</l> and C</u> modifiers are automatically selected for
259regular expressions compiled within the scope of various pragmas,
260and we recommend that in general, you use those pragmas instead of
261specifying these modifiers explicitly. For one thing, the modifiers
262affect only pattern matching, and do not extend to even any replacement
263done, whereas using the pragmas give consistent results for all
264appropriate operations within their scopes. For example,
265
266 s/foo/\Ubar/il
267
268will match "foo" using the locale's rules for case-insensitive matching,
269but the C</l> does not affect how the C<\U> operates. Most likely you
270want both of them to use locale rules. To do this, instead compile the
271regular expression within the scope of C<use locale>. This both
ce4fe27b
KW
272implicitly adds the C</l>, and applies locale rules to the C<\U>. The
273lesson is to C<use locale>, and not C</l> explicitly.
808432af
KW
274
275Similarly, it would be better to use C<use feature 'unicode_strings'>
276instead of,
277
278 s/foo/\Lbar/iu
279
280to get Unicode rules, as the C<\L> in the former (but not necessarily
281the latter) would also use Unicode rules.
282
283More detail on each of the modifiers follows. Most likely you don't
284need to know this detail for C</l>, C</u>, and C</d>, and can skip ahead
285to L<E<sol>a|/E<sol>a (and E<sol>aa)>.
ca9560b2 286
ed7efc79
KW
287=head4 /l
288
289means to use the current locale's rules (see L<perllocale>) when pattern
290matching. For example, C<\w> will match the "word" characters of that
291locale, and C<"/i"> case-insensitive matching will match according to
292the locale's case folding rules. The locale used will be the one in
293effect at the time of execution of the pattern match. This may not be
294the same as the compilation-time locale, and can differ from one match
295to another if there is an intervening call of the
b6fa137b 296L<setlocale() function|perllocale/The setlocale function>.
ed7efc79 297
31f05a37
KW
298The only non-single-byte locale Perl supports is (starting in v5.20)
299UTF-8. This means that code points above 255 are treated as Unicode no
300matter what locale is in effect (since UTF-8 implies Unicode).
301
ed7efc79 302Under Unicode rules, there are a few case-insensitive matches that cross
31f05a37
KW
303the 255/256 boundary. Except for UTF-8 locales in Perls v5.20 and
304later, these are disallowed under C</l>. For example, 0xFF (on ASCII
305platforms) does not caselessly match the character at 0x178, C<LATIN
306CAPITAL LETTER Y WITH DIAERESIS>, because 0xFF may not be C<LATIN SMALL
307LETTER Y WITH DIAERESIS> in the current locale, and Perl has no way of
308knowing if that character even exists in the locale, much less what code
309point it is.
310
311In a UTF-8 locale in v5.20 and later, the only visible difference
312between locale and non-locale in regular expressions should be tainting
313(see L<perlsec>).
ed7efc79
KW
314
315This modifier may be specified to be the default by C<use locale>, but
316see L</Which character set modifier is in effect?>.
b6fa137b
FC
317X</l>
318
ed7efc79
KW
319=head4 /u
320
321means to use Unicode rules when pattern matching. On ASCII platforms,
322this means that the code points between 128 and 255 take on their
808432af
KW
323Latin-1 (ISO-8859-1) meanings (which are the same as Unicode's).
324(Otherwise Perl considers their meanings to be undefined.) Thus,
325under this modifier, the ASCII platform effectively becomes a Unicode
326platform; and hence, for example, C<\w> will match any of the more than
327100_000 word characters in Unicode.
ed7efc79
KW
328
329Unlike most locales, which are specific to a language and country pair,
516074bb
KW
330Unicode classifies all the characters that are letters I<somewhere> in
331the world as
ed7efc79
KW
332C<\w>. For example, your locale might not think that C<LATIN SMALL
333LETTER ETH> is a letter (unless you happen to speak Icelandic), but
334Unicode does. Similarly, all the characters that are decimal digits
335somewhere in the world will match C<\d>; this is hundreds, not 10,
336possible matches. And some of those digits look like some of the 10
337ASCII digits, but mean a different number, so a human could easily think
338a number is a different quantity than it really is. For example,
339C<BENGALI DIGIT FOUR> (U+09EA) looks very much like an
340C<ASCII DIGIT EIGHT> (U+0038). And, C<\d+>, may match strings of digits
341that are a mixture from different writing systems, creating a security
67592e11 342issue. L<Unicode::UCD/num()> can be used to sort
516074bb
KW
343this out. Or the C</a> modifier can be used to force C<\d> to match
344just the ASCII 0 through 9.
ed7efc79 345
516074bb
KW
346Also, under this modifier, case-insensitive matching works on the full
347set of Unicode
ed7efc79
KW
348characters. The C<KELVIN SIGN>, for example matches the letters "k" and
349"K"; and C<LATIN SMALL LIGATURE FF> matches the sequence "ff", which,
350if you're not prepared, might make it look like a hexadecimal constant,
351presenting another potential security issue. See
352L<http://unicode.org/reports/tr36> for a detailed discussion of Unicode
353security issues.
354
ed7efc79 355This modifier may be specified to be the default by C<use feature
66cbab2c
KW
356'unicode_strings>, C<use locale ':not_characters'>, or
357C<L<use 5.012|perlfunc/use VERSION>> (or higher),
808432af 358but see L</Which character set modifier is in effect?>.
b6fa137b
FC
359X</u>
360
ed7efc79
KW
361=head4 /d
362
363This modifier means to use the "Default" native rules of the platform
364except when there is cause to use Unicode rules instead, as follows:
365
366=over 4
367
368=item 1
369
370the target string is encoded in UTF-8; or
371
372=item 2
373
374the pattern is encoded in UTF-8; or
375
376=item 3
377
378the pattern explicitly mentions a code point that is above 255 (say by
379C<\x{100}>); or
380
381=item 4
b6fa137b 382
ed7efc79
KW
383the pattern uses a Unicode name (C<\N{...}>); or
384
385=item 5
386
ce4fe27b 387the pattern uses a Unicode property (C<\p{...}> or C<\P{...}>); or
9d1a5160
KW
388
389=item 6
390
64935bc6
KW
391the pattern uses a Unicode break (C<\b{...}> or C<\B{...}>); or
392
393=item 7
394
9d1a5160 395the pattern uses L</C<(?[ ])>>
ed7efc79
KW
396
397=back
398
399Another mnemonic for this modifier is "Depends", as the rules actually
400used depend on various things, and as a result you can get unexpected
808432af
KW
401results. See L<perlunicode/The "Unicode Bug">. The Unicode Bug has
402become rather infamous, leading to yet another (printable) name for this
403modifier, "Dodgy".
ed7efc79 404
4b9734bf
KW
405Unless the pattern or string are encoded in UTF-8, only ASCII characters
406can match positively.
ed7efc79
KW
407
408Here are some examples of how that works on an ASCII platform:
409
410 $str = "\xDF"; # $str is not in UTF-8 format.
411 $str =~ /^\w/; # No match, as $str isn't in UTF-8 format.
412 $str .= "\x{0e0b}"; # Now $str is in UTF-8 format.
413 $str =~ /^\w/; # Match! $str is now in UTF-8 format.
414 chop $str;
415 $str =~ /^\w/; # Still a match! $str remains in UTF-8 format.
416
808432af
KW
417This modifier is automatically selected by default when none of the
418others are, so yet another name for it is "Default".
419
420Because of the unexpected behaviors associated with this modifier, you
421probably should only use it to maintain weird backward compatibilities.
422
423=head4 /a (and /aa)
424
425This modifier stands for ASCII-restrict (or ASCII-safe). This modifier,
426unlike the others, may be doubled-up to increase its effect.
427
428When it appears singly, it causes the sequences C<\d>, C<\s>, C<\w>, and
429the Posix character classes to match only in the ASCII range. They thus
430revert to their pre-5.6, pre-Unicode meanings. Under C</a>, C<\d>
431always means precisely the digits C<"0"> to C<"9">; C<\s> means the five
779cf272
KW
432characters C<[ \f\n\r\t]>, and starting in Perl v5.18, the vertical tab;
433C<\w> means the 63 characters
808432af
KW
434C<[A-Za-z0-9_]>; and likewise, all the Posix classes such as
435C<[[:print:]]> match only the appropriate ASCII-range characters.
436
437This modifier is useful for people who only incidentally use Unicode,
438and who do not wish to be burdened with its complexities and security
439concerns.
440
441With C</a>, one can write C<\d> with confidence that it will only match
442ASCII characters, and should the need arise to match beyond ASCII, you
443can instead use C<\p{Digit}> (or C<\p{Word}> for C<\w>). There are
444similar C<\p{...}> constructs that can match beyond ASCII both white
445space (see L<perlrecharclass/Whitespace>), and Posix classes (see
446L<perlrecharclass/POSIX Character Classes>). Thus, this modifier
447doesn't mean you can't use Unicode, it means that to get Unicode
448matching you must explicitly use a construct (C<\p{}>, C<\P{}>) that
449signals Unicode.
450
451As you would expect, this modifier causes, for example, C<\D> to mean
452the same thing as C<[^0-9]>; in fact, all non-ASCII characters match
453C<\D>, C<\S>, and C<\W>. C<\b> still means to match at the boundary
454between C<\w> and C<\W>, using the C</a> definitions of them (similarly
455for C<\B>).
456
457Otherwise, C</a> behaves like the C</u> modifier, in that
850b7ec9 458case-insensitive matching uses Unicode rules; for example, "k" will
808432af
KW
459match the Unicode C<\N{KELVIN SIGN}> under C</i> matching, and code
460points in the Latin1 range, above ASCII will have Unicode rules when it
461comes to case-insensitive matching.
462
463To forbid ASCII/non-ASCII matches (like "k" with C<\N{KELVIN SIGN}>),
464specify the "a" twice, for example C</aai> or C</aia>. (The first
465occurrence of "a" restricts the C<\d>, etc., and the second occurrence
466adds the C</i> restrictions.) But, note that code points outside the
467ASCII range will use Unicode rules for C</i> matching, so the modifier
468doesn't really restrict things to just ASCII; it just forbids the
469intermixing of ASCII and non-ASCII.
470
471To summarize, this modifier provides protection for applications that
472don't wish to be exposed to all of Unicode. Specifying it twice
473gives added protection.
474
475This modifier may be specified to be the default by C<use re '/a'>
476or C<use re '/aa'>. If you do so, you may actually have occasion to use
31dc26d6 477the C</u> modifier explicitly if there are a few regular expressions
808432af
KW
478where you do want full Unicode rules (but even here, it's best if
479everything were under feature C<"unicode_strings">, along with the
480C<use re '/aa'>). Also see L</Which character set modifier is in
481effect?>.
482X</a>
483X</aa>
484
ed7efc79
KW
485=head4 Which character set modifier is in effect?
486
487Which of these modifiers is in effect at any given point in a regular
808432af
KW
488expression depends on a fairly complex set of interactions. These have
489been designed so that in general you don't have to worry about it, but
490this section gives the gory details. As
ed7efc79
KW
491explained below in L</Extended Patterns> it is possible to explicitly
492specify modifiers that apply only to portions of a regular expression.
493The innermost always has priority over any outer ones, and one applying
6368643f
KW
494to the whole expression has priority over any of the default settings that are
495described in the remainder of this section.
ed7efc79 496
916cec3f 497The C<L<use re 'E<sol>foo'|re/"'/flags' mode">> pragma can be used to set
ed7efc79
KW
498default modifiers (including these) for regular expressions compiled
499within its scope. This pragma has precedence over the other pragmas
516074bb 500listed below that also change the defaults.
ed7efc79
KW
501
502Otherwise, C<L<use locale|perllocale>> sets the default modifier to C</l>;
66cbab2c 503and C<L<use feature 'unicode_strings|feature>>, or
ed7efc79
KW
504C<L<use 5.012|perlfunc/use VERSION>> (or higher) set the default to
505C</u> when not in the same scope as either C<L<use locale|perllocale>>
66cbab2c
KW
506or C<L<use bytes|bytes>>.
507(C<L<use locale ':not_characters'|perllocale/Unicode and UTF-8>> also
508sets the default to C</u>, overriding any plain C<use locale>.)
509Unlike the mechanisms mentioned above, these
6368643f
KW
510affect operations besides regular expressions pattern matching, and so
511give more consistent results with other operators, including using
512C<\U>, C<\l>, etc. in substitution replacements.
ed7efc79
KW
513
514If none of the above apply, for backwards compatibility reasons, the
515C</d> modifier is the one in effect by default. As this can lead to
516unexpected results, it is best to specify which other rule set should be
517used.
518
519=head4 Character set modifier behavior prior to Perl 5.14
520
521Prior to 5.14, there were no explicit modifiers, but C</l> was implied
522for regexes compiled within the scope of C<use locale>, and C</d> was
523implied otherwise. However, interpolating a regex into a larger regex
524would ignore the original compilation in favor of whatever was in effect
525at the time of the second compilation. There were a number of
526inconsistencies (bugs) with the C</d> modifier, where Unicode rules
527would be used when inappropriate, and vice versa. C<\p{}> did not imply
528Unicode rules, and neither did all occurrences of C<\N{}>, until 5.12.
b6fa137b 529
a0d0e21e
LW
530=head2 Regular Expressions
531
04838cea
RGS
532=head3 Metacharacters
533
384f06ae 534The patterns used in Perl pattern matching evolved from those supplied in
14218588 535the Version 8 regex routines. (The routines are derived
19799a22
GS
536(distantly) from Henry Spencer's freely redistributable reimplementation
537of the V8 routines.) See L<Version 8 Regular Expressions> for
538details.
a0d0e21e
LW
539
540In particular the following metacharacters have their standard I<egrep>-ish
541meanings:
d74e8afc
ITB
542X<metacharacter>
543X<\> X<^> X<.> X<$> X<|> X<(> X<()> X<[> X<[]>
544
a0d0e21e 545
f793d64a
KW
546 \ Quote the next metacharacter
547 ^ Match the beginning of the line
548 . Match any character (except newline)
363e3e5a
RS
549 $ Match the end of the string (or before newline at the end
550 of the string)
f793d64a
KW
551 | Alternation
552 () Grouping
553 [] Bracketed Character class
a0d0e21e 554
14218588
GS
555By default, the "^" character is guaranteed to match only the
556beginning of the string, the "$" character only the end (or before the
557newline at the end), and Perl does certain optimizations with the
a0d0e21e
LW
558assumption that the string contains only one line. Embedded newlines
559will not be matched by "^" or "$". You may, however, wish to treat a
4a6725af 560string as a multi-line buffer, such that the "^" will match after any
0d520e8e
YO
561newline within the string (except if the newline is the last character in
562the string), and "$" will match before any newline. At the
a0d0e21e
LW
563cost of a little more overhead, you can do this by using the /m modifier
564on the pattern match operator. (Older programs did this by setting C<$*>,
db7cd43a 565but this option was removed in perl 5.10.)
d74e8afc 566X<^> X<$> X</m>
a0d0e21e 567
14218588 568To simplify multi-line substitutions, the "." character never matches a
55497cff 569newline unless you use the C</s> modifier, which in effect tells Perl to pretend
f02c194e 570the string is a single line--even if it isn't.
d74e8afc 571X<.> X</s>
a0d0e21e 572
04838cea
RGS
573=head3 Quantifiers
574
a0d0e21e 575The following standard quantifiers are recognized:
d74e8afc 576X<metacharacter> X<quantifier> X<*> X<+> X<?> X<{n}> X<{n,}> X<{n,m}>
a0d0e21e 577
f793d64a
KW
578 * Match 0 or more times
579 + Match 1 or more times
580 ? Match 1 or 0 times
581 {n} Match exactly n times
582 {n,} Match at least n times
583 {n,m} Match at least n but not more than m times
a0d0e21e 584
0b928c2f 585(If a curly bracket occurs in any other context and does not form part of
4d68ffa0 586a backslashed sequence like C<\x{...}>, it is treated as a regular
412f55bb
KW
587character. However, a deprecation warning is raised for all such
588occurrences, and in Perl v5.26, literal uses of a curly bracket will be
589required to be escaped, say by preceding them with a backslash (C<"\{">)
590or enclosing them within square brackets (C<"[{]">). This change will
591allow for future syntax extensions (like making the lower bound of a
592quantifier optional), and better error checking of quantifiers.)
9af81bfe
KW
593
594The "*" quantifier is equivalent to C<{0,}>, the "+"
527e91da 595quantifier to C<{1,}>, and the "?" quantifier to C<{0,1}>. n and m are limited
d0b16107 596to non-negative integral values less than a preset limit defined when perl is built.
9c79236d
GS
597This is usually 32766 on the most common platforms. The actual limit can
598be seen in the error message generated by code such as this:
599
820475bd 600 $_ **= $_ , / {$_} / for 2 .. 42;
a0d0e21e 601
54310121
PP
602By default, a quantified subpattern is "greedy", that is, it will match as
603many times as possible (given a particular starting location) while still
604allowing the rest of the pattern to match. If you want it to match the
605minimum number of times possible, follow the quantifier with a "?". Note
606that the meanings don't change, just the "greediness":
0d017f4d 607X<metacharacter> X<greedy> X<greediness>
d74e8afc 608X<?> X<*?> X<+?> X<??> X<{n}?> X<{n,}?> X<{n,m}?>
a0d0e21e 609
f793d64a
KW
610 *? Match 0 or more times, not greedily
611 +? Match 1 or more times, not greedily
612 ?? Match 0 or 1 time, not greedily
0b928c2f 613 {n}? Match exactly n times, not greedily (redundant)
f793d64a
KW
614 {n,}? Match at least n times, not greedily
615 {n,m}? Match at least n but not more than m times, not greedily
a0d0e21e 616
5f3789aa 617Normally when a quantified subpattern does not allow the rest of the
b9b4dddf 618overall pattern to match, Perl will backtrack. However, this behaviour is
0d017f4d 619sometimes undesirable. Thus Perl provides the "possessive" quantifier form
b9b4dddf
YO
620as well.
621
f793d64a
KW
622 *+ Match 0 or more times and give nothing back
623 ++ Match 1 or more times and give nothing back
624 ?+ Match 0 or 1 time and give nothing back
625 {n}+ Match exactly n times and give nothing back (redundant)
626 {n,}+ Match at least n times and give nothing back
627 {n,m}+ Match at least n but not more than m times and give nothing back
b9b4dddf
YO
628
629For instance,
630
631 'aaaa' =~ /a++a/
632
633will never match, as the C<a++> will gobble up all the C<a>'s in the
634string and won't leave any for the remaining part of the pattern. This
635feature can be extremely useful to give perl hints about where it
636shouldn't backtrack. For instance, the typical "match a double-quoted
637string" problem can be most efficiently performed when written as:
638
639 /"(?:[^"\\]++|\\.)*+"/
640
0d017f4d 641as we know that if the final quote does not match, backtracking will not
0b928c2f
FC
642help. See the independent subexpression
643L</C<< (?>pattern) >>> for more details;
b9b4dddf
YO
644possessive quantifiers are just syntactic sugar for that construct. For
645instance the above example could also be written as follows:
646
647 /"(?>(?:(?>[^"\\]+)|\\.)*)"/
648
5f3789aa
YO
649Note that the possessive quantifier modifier can not be be combined
650with the non-greedy modifier. This is because it would make no sense.
651Consider the follow equivalency table:
652
653 Illegal Legal
654 ------------ ------
655 X??+ X{0}
656 X+?+ X{1}
657 X{min,max}?+ X{min}
658
04838cea
RGS
659=head3 Escape sequences
660
0b928c2f 661Because patterns are processed as double-quoted strings, the following
a0d0e21e
LW
662also work:
663
f793d64a
KW
664 \t tab (HT, TAB)
665 \n newline (LF, NL)
666 \r return (CR)
667 \f form feed (FF)
668 \a alarm (bell) (BEL)
669 \e escape (think troff) (ESC)
f793d64a 670 \cK control char (example: VT)
dc0d9c48 671 \x{}, \x00 character whose ordinal is the given hexadecimal number
fb121860 672 \N{name} named Unicode character or character sequence
f793d64a 673 \N{U+263D} Unicode character (example: FIRST QUARTER MOON)
f0a2b745 674 \o{}, \000 character whose ordinal is the given octal number
f793d64a
KW
675 \l lowercase next char (think vi)
676 \u uppercase next char (think vi)
ad81d09f
KE
677 \L lowercase until \E (think vi)
678 \U uppercase until \E (think vi)
679 \Q quote (disable) pattern metacharacters until \E
f793d64a 680 \E end either case modification or quoted section, think vi
a0d0e21e 681
9bb1f947 682Details are in L<perlop/Quote and Quote-like Operators>.
1d2dff63 683
e1d1eefb 684=head3 Character Classes and other Special Escapes
04838cea 685
a0d0e21e 686In addition, Perl defines the following:
d0b16107 687X<\g> X<\k> X<\K> X<backreference>
a0d0e21e 688
f793d64a
KW
689 Sequence Note Description
690 [...] [1] Match a character according to the rules of the
691 bracketed character class defined by the "...".
692 Example: [a-z] matches "a" or "b" or "c" ... or "z"
693 [[:...:]] [2] Match a character according to the rules of the POSIX
694 character class "..." within the outer bracketed
695 character class. Example: [[:upper:]] matches any
696 uppercase character.
572224ce 697 (?[...]) [8] Extended bracketed character class
d35dd6c6
KW
698 \w [3] Match a "word" character (alphanumeric plus "_", plus
699 other connector punctuation chars plus Unicode
0b928c2f 700 marks)
f793d64a
KW
701 \W [3] Match a non-"word" character
702 \s [3] Match a whitespace character
703 \S [3] Match a non-whitespace character
704 \d [3] Match a decimal digit character
705 \D [3] Match a non-digit character
706 \pP [3] Match P, named property. Use \p{Prop} for longer names
707 \PP [3] Match non-P
708 \X [4] Match Unicode "eXtended grapheme cluster"
c27a5cfe 709 \1 [5] Backreference to a specific capture group or buffer.
f793d64a
KW
710 '1' may actually be any positive integer.
711 \g1 [5] Backreference to a specific or previous group,
712 \g{-1} [5] The number may be negative indicating a relative
c27a5cfe 713 previous group and may optionally be wrapped in
f793d64a
KW
714 curly brackets for safer parsing.
715 \g{name} [5] Named backreference
716 \k<name> [5] Named backreference
717 \K [6] Keep the stuff left of the \K, don't include it in $&
2171640d 718 \N [7] Any character but \n. Not affected by /s modifier
f793d64a
KW
719 \v [3] Vertical whitespace
720 \V [3] Not vertical whitespace
721 \h [3] Horizontal whitespace
722 \H [3] Not horizontal whitespace
723 \R [4] Linebreak
e1d1eefb 724
9bb1f947
KW
725=over 4
726
727=item [1]
728
729See L<perlrecharclass/Bracketed Character Classes> for details.
df225385 730
9bb1f947 731=item [2]
b8c5462f 732
9bb1f947 733See L<perlrecharclass/POSIX Character Classes> for details.
b8c5462f 734
9bb1f947 735=item [3]
5496314a 736
9bb1f947 737See L<perlrecharclass/Backslash sequences> for details.
5496314a 738
9bb1f947 739=item [4]
5496314a 740
9bb1f947 741See L<perlrebackslash/Misc> for details.
d0b16107 742
9bb1f947 743=item [5]
b8c5462f 744
c27a5cfe 745See L</Capture groups> below for details.
93733859 746
9bb1f947 747=item [6]
b8c5462f 748
9bb1f947
KW
749See L</Extended Patterns> below for details.
750
751=item [7]
752
753Note that C<\N> has two meanings. When of the form C<\N{NAME}>, it matches the
fb121860
KW
754character or character sequence whose name is C<NAME>; and similarly
755when of the form C<\N{U+I<hex>}>, it matches the character whose Unicode
756code point is I<hex>. Otherwise it matches any character but C<\n>.
9bb1f947 757
572224ce
KW
758=item [8]
759
760See L<perlrecharclass/Extended Bracketed Character Classes> for details.
761
9bb1f947 762=back
d0b16107 763
04838cea
RGS
764=head3 Assertions
765
a0d0e21e 766Perl defines the following zero-width assertions:
d74e8afc
ITB
767X<zero-width assertion> X<assertion> X<regex, zero-width assertion>
768X<regexp, zero-width assertion>
769X<regular expression, zero-width assertion>
770X<\b> X<\B> X<\A> X<\Z> X<\z> X<\G>
a0d0e21e 771
64935bc6
KW
772 \b{} Match at Unicode boundary of specified type
773 \B{} Match where corresponding \b{} doesn't match
9bb1f947
KW
774 \b Match a word boundary
775 \B Match except at a word boundary
776 \A Match only at beginning of string
777 \Z Match only at end of string, or before newline at the end
778 \z Match only at end of string
779 \G Match only at pos() (e.g. at the end-of-match position
9da458fc 780 of prior m//g)
a0d0e21e 781
64935bc6
KW
782A Unicode boundary (C<\b{}>), available starting in v5.22, is a spot
783between two characters, or before the first character in the string, or
784after the final character in the string where certain criteria defined
785by Unicode are met. See L<perlrebackslash/\b{}, \b, \B{}, \B> for
786details.
787
14218588 788A word boundary (C<\b>) is a spot between two characters
19799a22
GS
789that has a C<\w> on one side of it and a C<\W> on the other side
790of it (in either order), counting the imaginary characters off the
791beginning and end of the string as matching a C<\W>. (Within
792character classes C<\b> represents backspace rather than a word
793boundary, just as it normally does in any double-quoted string.)
794The C<\A> and C<\Z> are just like "^" and "$", except that they
795won't match multiple times when the C</m> modifier is used, while
796"^" and "$" will match at every internal line boundary. To match
797the actual end of the string and not ignore an optional trailing
798newline, use C<\z>.
d74e8afc 799X<\b> X<\A> X<\Z> X<\z> X</m>
19799a22
GS
800
801The C<\G> assertion can be used to chain global matches (using
802C<m//g>), as described in L<perlop/"Regexp Quote-Like Operators">.
803It is also useful when writing C<lex>-like scanners, when you have
804several patterns that you want to match against consequent substrings
0b928c2f 805of your string; see the previous reference. The actual location
19799a22 806where C<\G> will match can also be influenced by using C<pos()> as
58e23c8d 807an lvalue: see L<perlfunc/pos>. Note that the rule for zero-length
0b928c2f
FC
808matches (see L</"Repeated Patterns Matching a Zero-length Substring">)
809is modified somewhat, in that contents to the left of C<\G> are
58e23c8d
YO
810not counted when determining the length of the match. Thus the following
811will not match forever:
d74e8afc 812X<\G>
c47ff5f1 813
e761bb84
CO
814 my $string = 'ABC';
815 pos($string) = 1;
816 while ($string =~ /(.\G)/g) {
817 print $1;
818 }
58e23c8d
YO
819
820It will print 'A' and then terminate, as it considers the match to
821be zero-width, and thus will not match at the same position twice in a
822row.
823
824It is worth noting that C<\G> improperly used can result in an infinite
825loop. Take care when using patterns that include C<\G> in an alternation.
826
d5e7783a
DM
827Note also that C<s///> will refuse to overwrite part of a substitution
828that has already been replaced; so for example this will stop after the
829first iteration, rather than iterating its way backwards through the
830string:
831
832 $_ = "123456789";
833 pos = 6;
834 s/.(?=.\G)/X/g;
835 print; # prints 1234X6789, not XXXXX6789
836
837
c27a5cfe 838=head3 Capture groups
04838cea 839
c27a5cfe
KW
840The bracketing construct C<( ... )> creates capture groups (also referred to as
841capture buffers). To refer to the current contents of a group later on, within
d8b950dc
KW
842the same pattern, use C<\g1> (or C<\g{1}>) for the first, C<\g2> (or C<\g{2}>)
843for the second, and so on.
844This is called a I<backreference>.
d74e8afc 845X<regex, capture buffer> X<regexp, capture buffer>
c27a5cfe 846X<regex, capture group> X<regexp, capture group>
d74e8afc 847X<regular expression, capture buffer> X<backreference>
c27a5cfe 848X<regular expression, capture group> X<backreference>
1f1031fe 849X<\g{1}> X<\g{-1}> X<\g{name}> X<relative backreference> X<named backreference>
d8b950dc
KW
850X<named capture buffer> X<regular expression, named capture buffer>
851X<named capture group> X<regular expression, named capture group>
852X<%+> X<$+{name}> X<< \k<name> >>
853There is no limit to the number of captured substrings that you may use.
854Groups are numbered with the leftmost open parenthesis being number 1, etc. If
855a group did not match, the associated backreference won't match either. (This
856can happen if the group is optional, or in a different branch of an
857alternation.)
858You can omit the C<"g">, and write C<"\1">, etc, but there are some issues with
859this form, described below.
860
861You can also refer to capture groups relatively, by using a negative number, so
862that C<\g-1> and C<\g{-1}> both refer to the immediately preceding capture
863group, and C<\g-2> and C<\g{-2}> both refer to the group before it. For
864example:
5624f11d
YO
865
866 /
c27a5cfe
KW
867 (Y) # group 1
868 ( # group 2
869 (X) # group 3
870 \g{-1} # backref to group 3
871 \g{-3} # backref to group 1
5624f11d
YO
872 )
873 /x
874
d8b950dc
KW
875would match the same as C</(Y) ( (X) \g3 \g1 )/x>. This allows you to
876interpolate regexes into larger regexes and not have to worry about the
877capture groups being renumbered.
878
879You can dispense with numbers altogether and create named capture groups.
880The notation is C<(?E<lt>I<name>E<gt>...)> to declare and C<\g{I<name>}> to
881reference. (To be compatible with .Net regular expressions, C<\g{I<name>}> may
882also be written as C<\k{I<name>}>, C<\kE<lt>I<name>E<gt>> or C<\k'I<name>'>.)
883I<name> must not begin with a number, nor contain hyphens.
884When different groups within the same pattern have the same name, any reference
885to that name assumes the leftmost defined group. Named groups count in
886absolute and relative numbering, and so can also be referred to by those
887numbers.
888(It's possible to do things with named capture groups that would otherwise
889require C<(??{})>.)
890
891Capture group contents are dynamically scoped and available to you outside the
892pattern until the end of the enclosing block or until the next successful
893match, whichever comes first. (See L<perlsyn/"Compound Statements">.)
894You can refer to them by absolute number (using C<"$1"> instead of C<"\g1">,
895etc); or by name via the C<%+> hash, using C<"$+{I<name>}">.
896
897Braces are required in referring to named capture groups, but are optional for
898absolute or relative numbered ones. Braces are safer when creating a regex by
899concatenating smaller strings. For example if you have C<qr/$a$b/>, and C<$a>
900contained C<"\g1">, and C<$b> contained C<"37">, you would get C</\g137/> which
901is probably not what you intended.
902
903The C<\g> and C<\k> notations were introduced in Perl 5.10.0. Prior to that
904there were no named nor relative numbered capture groups. Absolute numbered
0b928c2f
FC
905groups were referred to using C<\1>,
906C<\2>, etc., and this notation is still
d8b950dc
KW
907accepted (and likely always will be). But it leads to some ambiguities if
908there are more than 9 capture groups, as C<\10> could mean either the tenth
909capture group, or the character whose ordinal in octal is 010 (a backspace in
910ASCII). Perl resolves this ambiguity by interpreting C<\10> as a backreference
911only if at least 10 left parentheses have opened before it. Likewise C<\11> is
912a backreference only if at least 11 left parentheses have opened before it.
e1f120a9
KW
913And so on. C<\1> through C<\9> are always interpreted as backreferences.
914There are several examples below that illustrate these perils. You can avoid
915the ambiguity by always using C<\g{}> or C<\g> if you mean capturing groups;
916and for octal constants always using C<\o{}>, or for C<\077> and below, using 3
917digits padded with leading zeros, since a leading zero implies an octal
918constant.
d8b950dc
KW
919
920The C<\I<digit>> notation also works in certain circumstances outside
ed7efc79 921the pattern. See L</Warning on \1 Instead of $1> below for details.
81714fb9 922
14218588 923Examples:
a0d0e21e
LW
924
925 s/^([^ ]*) *([^ ]*)/$2 $1/; # swap first two words
926
d8b950dc 927 /(.)\g1/ # find first doubled char
81714fb9
YO
928 and print "'$1' is the first doubled character\n";
929
930 /(?<char>.)\k<char>/ # ... a different way
931 and print "'$+{char}' is the first doubled character\n";
932
d8b950dc 933 /(?'char'.)\g1/ # ... mix and match
81714fb9 934 and print "'$1' is the first doubled character\n";
c47ff5f1 935
14218588 936 if (/Time: (..):(..):(..)/) { # parse out values
f793d64a
KW
937 $hours = $1;
938 $minutes = $2;
939 $seconds = $3;
a0d0e21e 940 }
c47ff5f1 941
9d860678
KW
942 /(.)(.)(.)(.)(.)(.)(.)(.)(.)\g10/ # \g10 is a backreference
943 /(.)(.)(.)(.)(.)(.)(.)(.)(.)\10/ # \10 is octal
944 /((.)(.)(.)(.)(.)(.)(.)(.)(.))\10/ # \10 is a backreference
945 /((.)(.)(.)(.)(.)(.)(.)(.)(.))\010/ # \010 is octal
946
947 $a = '(.)\1'; # Creates problems when concatenated.
948 $b = '(.)\g{1}'; # Avoids the problems.
949 "aa" =~ /${a}/; # True
950 "aa" =~ /${b}/; # True
951 "aa0" =~ /${a}0/; # False!
952 "aa0" =~ /${b}0/; # True
dc0d9c48
KW
953 "aa\x08" =~ /${a}0/; # True!
954 "aa\x08" =~ /${b}0/; # False
9d860678 955
14218588
GS
956Several special variables also refer back to portions of the previous
957match. C<$+> returns whatever the last bracket match matched.
958C<$&> returns the entire matched string. (At one point C<$0> did
959also, but now it returns the name of the program.) C<$`> returns
77ea4f6d
JV
960everything before the matched string. C<$'> returns everything
961after the matched string. And C<$^N> contains whatever was matched by
962the most-recently closed group (submatch). C<$^N> can be used in
963extended patterns (see below), for example to assign a submatch to a
81714fb9 964variable.
d74e8afc 965X<$+> X<$^N> X<$&> X<$`> X<$'>
14218588 966
d8b950dc
KW
967These special variables, like the C<%+> hash and the numbered match variables
968(C<$1>, C<$2>, C<$3>, etc.) are dynamically scoped
14218588
GS
969until the end of the enclosing block or until the next successful
970match, whichever comes first. (See L<perlsyn/"Compound Statements">.)
d74e8afc
ITB
971X<$+> X<$^N> X<$&> X<$`> X<$'>
972X<$1> X<$2> X<$3> X<$4> X<$5> X<$6> X<$7> X<$8> X<$9>
973
0d017f4d 974B<NOTE>: Failed matches in Perl do not reset the match variables,
5146ce24 975which makes it easier to write code that tests for a series of more
665e98b9
JH
976specific cases and remembers the best match.
977
13b0f67d
DM
978B<WARNING>: If your code is to run on Perl 5.16 or earlier,
979beware that once Perl sees that you need one of C<$&>, C<$`>, or
14218588 980C<$'> anywhere in the program, it has to provide them for every
13b0f67d
DM
981pattern match. This may substantially slow your program.
982
983Perl uses the same mechanism to produce C<$1>, C<$2>, etc, so you also
984pay a price for each pattern that contains capturing parentheses.
985(To avoid this cost while retaining the grouping behaviour, use the
14218588
GS
986extended regular expression C<(?: ... )> instead.) But if you never
987use C<$&>, C<$`> or C<$'>, then patterns I<without> capturing
988parentheses will not be penalized. So avoid C<$&>, C<$'>, and C<$`>
989if you can, but if you can't (and some algorithms really appreciate
990them), once you've used them once, use them at will, because you've
13b0f67d 991already paid the price.
d74e8afc 992X<$&> X<$`> X<$'>
68dc0745 993
13b0f67d
DM
994Perl 5.16 introduced a slightly more efficient mechanism that notes
995separately whether each of C<$`>, C<$&>, and C<$'> have been seen, and
996thus may only need to copy part of the string. Perl 5.20 introduced a
997much more efficient copy-on-write mechanism which eliminates any slowdown.
998
999As another workaround for this problem, Perl 5.10.0 introduced C<${^PREMATCH}>,
cde0cee5
YO
1000C<${^MATCH}> and C<${^POSTMATCH}>, which are equivalent to C<$`>, C<$&>
1001and C<$'>, B<except> that they are only guaranteed to be defined after a
87e95b7f 1002successful match that was executed with the C</p> (preserve) modifier.
cde0cee5
YO
1003The use of these variables incurs no global performance penalty, unlike
1004their punctuation char equivalents, however at the trade-off that you
13b0f67d
DM
1005have to tell perl when you want to use them. As of Perl 5.20, these three
1006variables are equivalent to C<$`>, C<$&> and C<$'>, and C</p> is ignored.
87e95b7f 1007X</p> X<p modifier>
cde0cee5 1008
9d727203
KW
1009=head2 Quoting metacharacters
1010
19799a22
GS
1011Backslashed metacharacters in Perl are alphanumeric, such as C<\b>,
1012C<\w>, C<\n>. Unlike some other regular expression languages, there
1013are no backslashed symbols that aren't alphanumeric. So anything
0f264506 1014that looks like \\, \(, \), \[, \], \{, or \} is always
19799a22
GS
1015interpreted as a literal character, not a metacharacter. This was
1016once used in a common idiom to disable or quote the special meanings
1017of regular expression metacharacters in a string that you want to
36bbe248 1018use for a pattern. Simply quote all non-"word" characters:
a0d0e21e
LW
1019
1020 $pattern =~ s/(\W)/\\$1/g;
1021
f1cbbd6e 1022(If C<use locale> is set, then this depends on the current locale.)
14218588
GS
1023Today it is more common to use the quotemeta() function or the C<\Q>
1024metaquoting escape sequence to disable all metacharacters' special
1025meanings like this:
a0d0e21e
LW
1026
1027 /$unquoted\Q$quoted\E$unquoted/
1028
9da458fc
IZ
1029Beware that if you put literal backslashes (those not inside
1030interpolated variables) between C<\Q> and C<\E>, double-quotish
1031backslash interpolation may lead to confusing results. If you
1032I<need> to use literal backslashes within C<\Q...\E>,
1033consult L<perlop/"Gory details of parsing quoted constructs">.
1034
736fe711
KW
1035C<quotemeta()> and C<\Q> are fully described in L<perlfunc/quotemeta>.
1036
19799a22
GS
1037=head2 Extended Patterns
1038
14218588 1039Perl also defines a consistent extension syntax for features not
0b928c2f
FC
1040found in standard tools like B<awk> and
1041B<lex>. The syntax for most of these is a
14218588
GS
1042pair of parentheses with a question mark as the first thing within
1043the parentheses. The character after the question mark indicates
1044the extension.
19799a22 1045
14218588
GS
1046The stability of these extensions varies widely. Some have been
1047part of the core language for many years. Others are experimental
1048and may change without warning or be completely removed. Check
1049the documentation on an individual feature to verify its current
1050status.
19799a22 1051
14218588
GS
1052A question mark was chosen for this and for the minimal-matching
1053construct because 1) question marks are rare in older regular
1054expressions, and 2) whenever you see one, you should stop and
0b928c2f 1055"question" exactly what is going on. That's psychology....
a0d0e21e 1056
70ca8714 1057=over 4
a0d0e21e 1058
cc6b7395 1059=item C<(?#text)>
d74e8afc 1060X<(?#)>
a0d0e21e 1061
7c688e65
KW
1062A comment. The text is ignored.
1063Note that Perl closes
259138e3 1064the comment as soon as it sees a C<)>, so there is no way to put a literal
7c688e65
KW
1065C<)> in the comment. The pattern's closing delimiter must be escaped by
1066a backslash if it appears in the comment.
1067
1068See L</E<sol>x> for another way to have comments in patterns.
a0d0e21e 1069
cfaf538b 1070=item C<(?adlupimsx-imsx)>
fb85c044 1071
cfaf538b 1072=item C<(?^alupimsx)>
fb85c044 1073X<(?)> X<(?^)>
19799a22 1074
0b6d1084
JH
1075One or more embedded pattern-match modifiers, to be turned on (or
1076turned off, if preceded by C<->) for the remainder of the pattern or
fb85c044
KW
1077the remainder of the enclosing pattern group (if any).
1078
fb85c044 1079This is particularly useful for dynamic patterns, such as those read in from a
0d017f4d 1080configuration file, taken from an argument, or specified in a table
0b928c2f
FC
1081somewhere. Consider the case where some patterns want to be
1082case-sensitive and some do not: The case-insensitive ones merely need to
0d017f4d 1083include C<(?i)> at the front of the pattern. For example:
19799a22
GS
1084
1085 $pattern = "foobar";
5d458dd8 1086 if ( /$pattern/i ) { }
19799a22
GS
1087
1088 # more flexible:
1089
1090 $pattern = "(?i)foobar";
5d458dd8 1091 if ( /$pattern/ ) { }
19799a22 1092
0b6d1084 1093These modifiers are restored at the end of the enclosing group. For example,
19799a22 1094
d8b950dc 1095 ( (?i) blah ) \s+ \g1
19799a22 1096
0d017f4d
WL
1097will match C<blah> in any case, some spaces, and an exact (I<including the case>!)
1098repetition of the previous word, assuming the C</x> modifier, and no C</i>
1099modifier outside this group.
19799a22 1100
8eb5594e 1101These modifiers do not carry over into named subpatterns called in the
dd72e27b 1102enclosing group. In other words, a pattern such as C<((?i)(?&NAME))> does not
8eb5594e
DR
1103change the case-sensitivity of the "NAME" pattern.
1104
dc925305
KW
1105Any of these modifiers can be set to apply globally to all regular
1106expressions compiled within the scope of a C<use re>. See
a0bbd6ff 1107L<re/"'/flags' mode">.
dc925305 1108
9de15fec
KW
1109Starting in Perl 5.14, a C<"^"> (caret or circumflex accent) immediately
1110after the C<"?"> is a shorthand equivalent to C<d-imsx>. Flags (except
1111C<"d">) may follow the caret to override it.
1112But a minus sign is not legal with it.
1113
dc925305 1114Note that the C<a>, C<d>, C<l>, C<p>, and C<u> modifiers are special in
e1d8d8ac 1115that they can only be enabled, not disabled, and the C<a>, C<d>, C<l>, and
dc925305 1116C<u> modifiers are mutually exclusive: specifying one de-specifies the
ed7efc79
KW
1117others, and a maximum of one (or two C<a>'s) may appear in the
1118construct. Thus, for
0b928c2f 1119example, C<(?-p)> will warn when compiled under C<use warnings>;
b6fa137b 1120C<(?-d:...)> and C<(?dl:...)> are fatal errors.
9de15fec
KW
1121
1122Note also that the C<p> modifier is special in that its presence
1123anywhere in a pattern has a global effect.
cde0cee5 1124
5a964f20 1125=item C<(?:pattern)>
d74e8afc 1126X<(?:)>
a0d0e21e 1127
cfaf538b 1128=item C<(?adluimsx-imsx:pattern)>
ca9dfc88 1129
cfaf538b 1130=item C<(?^aluimsx:pattern)>
fb85c044
KW
1131X<(?^:)>
1132
5a964f20
TC
1133This is for clustering, not capturing; it groups subexpressions like
1134"()", but doesn't make backreferences as "()" does. So
a0d0e21e 1135
5a964f20 1136 @fields = split(/\b(?:a|b|c)\b/)
a0d0e21e
LW
1137
1138is like
1139
5a964f20 1140 @fields = split(/\b(a|b|c)\b/)
a0d0e21e 1141
19799a22
GS
1142but doesn't spit out extra fields. It's also cheaper not to capture
1143characters if you don't need to.
a0d0e21e 1144
19799a22 1145Any letters between C<?> and C<:> act as flags modifiers as with
cfaf538b 1146C<(?adluimsx-imsx)>. For example,
ca9dfc88
IZ
1147
1148 /(?s-i:more.*than).*million/i
1149
14218588 1150is equivalent to the more verbose
ca9dfc88
IZ
1151
1152 /(?:(?s-i)more.*than).*million/i
1153
fb85c044 1154Starting in Perl 5.14, a C<"^"> (caret or circumflex accent) immediately
9de15fec
KW
1155after the C<"?"> is a shorthand equivalent to C<d-imsx>. Any positive
1156flags (except C<"d">) may follow the caret, so
fb85c044
KW
1157
1158 (?^x:foo)
1159
1160is equivalent to
1161
1162 (?x-ims:foo)
1163
1164The caret tells Perl that this cluster doesn't inherit the flags of any
0b928c2f 1165surrounding pattern, but uses the system defaults (C<d-imsx>),
fb85c044
KW
1166modified by any flags specified.
1167
1168The caret allows for simpler stringification of compiled regular
1169expressions. These look like
1170
1171 (?^:pattern)
1172
1173with any non-default flags appearing between the caret and the colon.
1174A test that looks at such stringification thus doesn't need to have the
1175system default flags hard-coded in it, just the caret. If new flags are
1176added to Perl, the meaning of the caret's expansion will change to include
1177the default for those flags, so the test will still work, unchanged.
1178
1179Specifying a negative flag after the caret is an error, as the flag is
1180redundant.
1181
1182Mnemonic for C<(?^...)>: A fresh beginning since the usual use of a caret is
1183to match at the beginning.
1184
594d7033
YO
1185=item C<(?|pattern)>
1186X<(?|)> X<Branch reset>
1187
1188This is the "branch reset" pattern, which has the special property
c27a5cfe 1189that the capture groups are numbered from the same starting point
99d59c4d 1190in each alternation branch. It is available starting from perl 5.10.0.
4deaaa80 1191
c27a5cfe 1192Capture groups are numbered from left to right, but inside this
693596a8 1193construct the numbering is restarted for each branch.
4deaaa80 1194
c27a5cfe 1195The numbering within each branch will be as normal, and any groups
4deaaa80
PJ
1196following this construct will be numbered as though the construct
1197contained only one branch, that being the one with the most capture
c27a5cfe 1198groups in it.
4deaaa80 1199
0b928c2f 1200This construct is useful when you want to capture one of a
4deaaa80
PJ
1201number of alternative matches.
1202
1203Consider the following pattern. The numbers underneath show in
c27a5cfe 1204which group the captured content will be stored.
594d7033
YO
1205
1206
1207 # before ---------------branch-reset----------- after
1208 / ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x
1209 # 1 2 2 3 2 3 4
1210
ab106183
A
1211Be careful when using the branch reset pattern in combination with
1212named captures. Named captures are implemented as being aliases to
c27a5cfe 1213numbered groups holding the captures, and that interferes with the
ab106183
A
1214implementation of the branch reset pattern. If you are using named
1215captures in a branch reset pattern, it's best to use the same names,
1216in the same order, in each of the alternations:
1217
1218 /(?| (?<a> x ) (?<b> y )
1219 | (?<a> z ) (?<b> w )) /x
1220
1221Not doing so may lead to surprises:
1222
1223 "12" =~ /(?| (?<a> \d+ ) | (?<b> \D+))/x;
1224 say $+ {a}; # Prints '12'
1225 say $+ {b}; # *Also* prints '12'.
1226
c27a5cfe
KW
1227The problem here is that both the group named C<< a >> and the group
1228named C<< b >> are aliases for the group belonging to C<< $1 >>.
90a18110 1229
ee9b8eae
YO
1230=item Look-Around Assertions
1231X<look-around assertion> X<lookaround assertion> X<look-around> X<lookaround>
1232
0b928c2f 1233Look-around assertions are zero-width patterns which match a specific
ee9b8eae
YO
1234pattern without including it in C<$&>. Positive assertions match when
1235their subpattern matches, negative assertions match when their subpattern
1236fails. Look-behind matches text up to the current match position,
1237look-ahead matches text following the current match position.
1238
1239=over 4
1240
5a964f20 1241=item C<(?=pattern)>
d74e8afc 1242X<(?=)> X<look-ahead, positive> X<lookahead, positive>
a0d0e21e 1243
19799a22 1244A zero-width positive look-ahead assertion. For example, C</\w+(?=\t)/>
a0d0e21e
LW
1245matches a word followed by a tab, without including the tab in C<$&>.
1246
5a964f20 1247=item C<(?!pattern)>
d74e8afc 1248X<(?!)> X<look-ahead, negative> X<lookahead, negative>
a0d0e21e 1249
19799a22 1250A zero-width negative look-ahead assertion. For example C</foo(?!bar)/>
a0d0e21e 1251matches any occurrence of "foo" that isn't followed by "bar". Note
19799a22
GS
1252however that look-ahead and look-behind are NOT the same thing. You cannot
1253use this for look-behind.
7b8d334a 1254
5a964f20 1255If you are looking for a "bar" that isn't preceded by a "foo", C</(?!foo)bar/>
7b8d334a
GS
1256will not do what you want. That's because the C<(?!foo)> is just saying that
1257the next thing cannot be "foo"--and it's not, it's a "bar", so "foobar" will
0b928c2f 1258match. Use look-behind instead (see below).
c277df42 1259
ee9b8eae
YO
1260=item C<(?<=pattern)> C<\K>
1261X<(?<=)> X<look-behind, positive> X<lookbehind, positive> X<\K>
c277df42 1262
c47ff5f1 1263A zero-width positive look-behind assertion. For example, C</(?<=\t)\w+/>
19799a22
GS
1264matches a word that follows a tab, without including the tab in C<$&>.
1265Works only for fixed-width look-behind.
c277df42 1266
3d9df1a7
KE
1267There is a special form of this construct, called C<\K> (available since
1268Perl 5.10.0), which causes the
ee9b8eae 1269regex engine to "keep" everything it had matched prior to the C<\K> and
0b928c2f 1270not include it in C<$&>. This effectively provides variable-length
ee9b8eae
YO
1271look-behind. The use of C<\K> inside of another look-around assertion
1272is allowed, but the behaviour is currently not well defined.
1273
c62285ac 1274For various reasons C<\K> may be significantly more efficient than the
ee9b8eae
YO
1275equivalent C<< (?<=...) >> construct, and it is especially useful in
1276situations where you want to efficiently remove something following
1277something else in a string. For instance
1278
1279 s/(foo)bar/$1/g;
1280
1281can be rewritten as the much more efficient
1282
1283 s/foo\Kbar//g;
1284
5a964f20 1285=item C<(?<!pattern)>
d74e8afc 1286X<(?<!)> X<look-behind, negative> X<lookbehind, negative>
c277df42 1287
19799a22
GS
1288A zero-width negative look-behind assertion. For example C</(?<!bar)foo/>
1289matches any occurrence of "foo" that does not follow "bar". Works
1290only for fixed-width look-behind.
c277df42 1291
ee9b8eae
YO
1292=back
1293
81714fb9
YO
1294=item C<(?'NAME'pattern)>
1295
1296=item C<< (?<NAME>pattern) >>
1297X<< (?<NAME>) >> X<(?'NAME')> X<named capture> X<capture>
1298
c27a5cfe 1299A named capture group. Identical in every respect to normal capturing
0b928c2f
FC
1300parentheses C<()> but for the additional fact that the group
1301can be referred to by name in various regular expression
1302constructs (like C<\g{NAME}>) and can be accessed by name
1303after a successful match via C<%+> or C<%->. See L<perlvar>
90a18110 1304for more details on the C<%+> and C<%-> hashes.
81714fb9 1305
c27a5cfe
KW
1306If multiple distinct capture groups have the same name then the
1307$+{NAME} will refer to the leftmost defined group in the match.
81714fb9 1308
0d017f4d 1309The forms C<(?'NAME'pattern)> and C<< (?<NAME>pattern) >> are equivalent.
81714fb9
YO
1310
1311B<NOTE:> While the notation of this construct is the same as the similar
c27a5cfe 1312function in .NET regexes, the behavior is not. In Perl the groups are
81714fb9
YO
1313numbered sequentially regardless of being named or not. Thus in the
1314pattern
1315
1316 /(x)(?<foo>y)(z)/
1317
1318$+{foo} will be the same as $2, and $3 will contain 'z' instead of
1319the opposite which is what a .NET regex hacker might expect.
1320
1f1031fe
YO
1321Currently NAME is restricted to simple identifiers only.
1322In other words, it must match C</^[_A-Za-z][_A-Za-z0-9]*\z/> or
1323its Unicode extension (see L<utf8>),
1324though it isn't extended by the locale (see L<perllocale>).
81714fb9 1325
1f1031fe 1326B<NOTE:> In order to make things easier for programmers with experience
ae5648b3 1327with the Python or PCRE regex engines, the pattern C<< (?PE<lt>NAMEE<gt>pattern) >>
0d017f4d 1328may be used instead of C<< (?<NAME>pattern) >>; however this form does not
64c5a566 1329support the use of single quotes as a delimiter for the name.
81714fb9 1330
1f1031fe
YO
1331=item C<< \k<NAME> >>
1332
1333=item C<< \k'NAME' >>
81714fb9
YO
1334
1335Named backreference. Similar to numeric backreferences, except that
1336the group is designated by name and not number. If multiple groups
1337have the same name then it refers to the leftmost defined group in
1338the current match.
1339
0d017f4d 1340It is an error to refer to a name not defined by a C<< (?<NAME>) >>
81714fb9
YO
1341earlier in the pattern.
1342
1343Both forms are equivalent.
1344
1f1031fe 1345B<NOTE:> In order to make things easier for programmers with experience
0d017f4d 1346with the Python or PCRE regex engines, the pattern C<< (?P=NAME) >>
64c5a566 1347may be used instead of C<< \k<NAME> >>.
1f1031fe 1348
cc6b7395 1349=item C<(?{ code })>
d74e8afc 1350X<(?{})> X<regex, code in> X<regexp, code in> X<regular expression, code in>
c277df42 1351
83f32aba
RS
1352B<WARNING>: Using this feature safely requires that you understand its
1353limitations. Code executed that has side effects may not perform identically
1354from version to version due to the effect of future optimisations in the regex
1355engine. For more information on this, see L</Embedded Code Execution
1356Frequency>.
c277df42 1357
e128ab2c
DM
1358This zero-width assertion executes any embedded Perl code. It always
1359succeeds, and its return value is set as C<$^R>.
19799a22 1360
e128ab2c
DM
1361In literal patterns, the code is parsed at the same time as the
1362surrounding code. While within the pattern, control is passed temporarily
1363back to the perl parser, until the logically-balancing closing brace is
1364encountered. This is similar to the way that an array index expression in
1365a literal string is handled, for example
77ea4f6d 1366
e128ab2c
DM
1367 "abc$array[ 1 + f('[') + g()]def"
1368
1369In particular, braces do not need to be balanced:
1370
576fa024 1371 s/abc(?{ f('{'); })/def/
e128ab2c
DM
1372
1373Even in a pattern that is interpolated and compiled at run-time, literal
1374code blocks will be compiled once, at perl compile time; the following
1375prints "ABCD":
1376
1377 print "D";
1378 my $qr = qr/(?{ BEGIN { print "A" } })/;
1379 my $foo = "foo";
1380 /$foo$qr(?{ BEGIN { print "B" } })/;
1381 BEGIN { print "C" }
1382
1383In patterns where the text of the code is derived from run-time
1384information rather than appearing literally in a source code /pattern/,
1385the code is compiled at the same time that the pattern is compiled, and
5771dda0 1386for reasons of security, C<use re 'eval'> must be in scope. This is to
e128ab2c
DM
1387stop user-supplied patterns containing code snippets from being
1388executable.
1389
5771dda0 1390In situations where you need to enable this with C<use re 'eval'>, you should
e128ab2c
DM
1391also have taint checking enabled. Better yet, use the carefully
1392constrained evaluation within a Safe compartment. See L<perlsec> for
1393details about both these mechanisms.
1394
1395From the viewpoint of parsing, lexical variable scope and closures,
1396
1397 /AAA(?{ BBB })CCC/
1398
1399behaves approximately like
1400
1401 /AAA/ && do { BBB } && /CCC/
1402
1403Similarly,
1404
1405 qr/AAA(?{ BBB })CCC/
1406
1407behaves approximately like
77ea4f6d 1408
e128ab2c
DM
1409 sub { /AAA/ && do { BBB } && /CCC/ }
1410
1411In particular:
1412
1413 { my $i = 1; $r = qr/(?{ print $i })/ }
1414 my $i = 2;
1415 /$r/; # prints "1"
1416
1417Inside a C<(?{...})> block, C<$_> refers to the string the regular
754091cb 1418expression is matching against. You can also use C<pos()> to know what is
fa11829f 1419the current position of matching within this string.
754091cb 1420
e128ab2c
DM
1421The code block introduces a new scope from the perspective of lexical
1422variable declarations, but B<not> from the perspective of C<local> and
1423similar localizing behaviours. So later code blocks within the same
1424pattern will still see the values which were localized in earlier blocks.
1425These accumulated localizations are undone either at the end of a
1426successful match, or if the assertion is backtracked (compare
1427L<"Backtracking">). For example,
b9ac3b5b
GS
1428
1429 $_ = 'a' x 8;
5d458dd8 1430 m<
d1fbf752 1431 (?{ $cnt = 0 }) # Initialize $cnt.
b9ac3b5b 1432 (
5d458dd8 1433 a
b9ac3b5b 1434 (?{
d1fbf752
KW
1435 local $cnt = $cnt + 1; # Update $cnt,
1436 # backtracking-safe.
b9ac3b5b 1437 })
5d458dd8 1438 )*
b9ac3b5b 1439 aaaa
d1fbf752
KW
1440 (?{ $res = $cnt }) # On success copy to
1441 # non-localized location.
b9ac3b5b
GS
1442 >x;
1443
e128ab2c
DM
1444will initially increment C<$cnt> up to 8; then during backtracking, its
1445value will be unwound back to 4, which is the value assigned to C<$res>.
1446At the end of the regex execution, $cnt will be wound back to its initial
1447value of 0.
1448
1449This assertion may be used as the condition in a
b9ac3b5b 1450
e128ab2c
DM
1451 (?(condition)yes-pattern|no-pattern)
1452
1453switch. If I<not> used in this way, the result of evaluation of C<code>
1454is put into the special variable C<$^R>. This happens immediately, so
1455C<$^R> can be used from other C<(?{ code })> assertions inside the same
1456regular expression.
b9ac3b5b 1457
19799a22
GS
1458The assignment to C<$^R> above is properly localized, so the old
1459value of C<$^R> is restored if the assertion is backtracked; compare
1460L<"Backtracking">.
b9ac3b5b 1461
e128ab2c
DM
1462Note that the special variable C<$^N> is particularly useful with code
1463blocks to capture the results of submatches in variables without having to
1464keep track of the number of nested parentheses. For example:
1465
1466 $_ = "The brown fox jumps over the lazy dog";
1467 /the (\S+)(?{ $color = $^N }) (\S+)(?{ $animal = $^N })/i;
1468 print "color = $color, animal = $animal\n";
1469
8988a1bb 1470
14455d6c 1471=item C<(??{ code })>
d74e8afc
ITB
1472X<(??{})>
1473X<regex, postponed> X<regexp, postponed> X<regular expression, postponed>
0f5d15d6 1474
83f32aba
RS
1475B<WARNING>: Using this feature safely requires that you understand its
1476limitations. Code executed that has side effects may not perform
1477identically from version to version due to the effect of future
1478optimisations in the regex engine. For more information on this, see
1479L</Embedded Code Execution Frequency>.
0f5d15d6 1480
e128ab2c
DM
1481This is a "postponed" regular subexpression. It behaves in I<exactly> the
1482same way as a C<(?{ code })> code block as described above, except that
1483its return value, rather than being assigned to C<$^R>, is treated as a
1484pattern, compiled if it's a string (or used as-is if its a qr// object),
1485then matched as if it were inserted instead of this construct.
6bda09f9 1486
e128ab2c
DM
1487During the matching of this sub-pattern, it has its own set of
1488captures which are valid during the sub-match, but are discarded once
1489control returns to the main pattern. For example, the following matches,
1490with the inner pattern capturing "B" and matching "BB", while the outer
1491pattern captures "A";
1492
1493 my $inner = '(.)\1';
1494 "ABBA" =~ /^(.)(??{ $inner })\1/;
1495 print $1; # prints "A";
6bda09f9 1496
e128ab2c
DM
1497Note that this means that there is no way for the inner pattern to refer
1498to a capture group defined outside. (The code block itself can use C<$1>,
1499etc., to refer to the enclosing pattern's capture groups.) Thus, although
0f5d15d6 1500
e128ab2c
DM
1501 ('a' x 100)=~/(??{'(.)' x 100})/
1502
1503I<will> match, it will I<not> set $1 on exit.
19799a22
GS
1504
1505The following pattern matches a parenthesized group:
0f5d15d6 1506
d1fbf752
KW
1507 $re = qr{
1508 \(
1509 (?:
1510 (?> [^()]+ ) # Non-parens without backtracking
1511 |
1512 (??{ $re }) # Group with matching parens
1513 )*
1514 \)
1515 }x;
0f5d15d6 1516
93f313ef
KW
1517See also
1518L<C<(?I<PARNO>)>|/(?PARNO) (?-PARNO) (?+PARNO) (?R) (?0)>
1519for a different, more efficient way to accomplish
6bda09f9
YO
1520the same task.
1521
e128ab2c
DM
1522Executing a postponed regular expression 50 times without consuming any
1523input string will result in a fatal error. The maximum depth is compiled
1524into perl, so changing it requires a custom build.
6bda09f9 1525
93f313ef 1526=item C<(?I<PARNO>)> C<(?-I<PARNO>)> C<(?+I<PARNO>)> C<(?R)> C<(?0)>
542fa716 1527X<(?PARNO)> X<(?1)> X<(?R)> X<(?0)> X<(?-1)> X<(?+1)> X<(?-PARNO)> X<(?+PARNO)>
6bda09f9 1528X<regex, recursive> X<regexp, recursive> X<regular expression, recursive>
d1b2014a
YO
1529X<regex, relative recursion> X<GOSUB> X<GOSTART>
1530
1531Recursive subpattern. Treat the contents of a given capture buffer in the
1532current pattern as an independent subpattern and attempt to match it at
1533the current position in the string. Information about capture state from
1534the caller for things like backreferences is available to the subpattern,
1535but capture buffers set by the subpattern are not visible to the caller.
6bda09f9 1536
e128ab2c
DM
1537Similar to C<(??{ code })> except that it does not involve executing any
1538code or potentially compiling a returned pattern string; instead it treats
1539the part of the current pattern contained within a specified capture group
d1b2014a
YO
1540as an independent pattern that must match at the current position. Also
1541different is the treatment of capture buffers, unlike C<(??{ code })>
1542recursive patterns have access to their callers match state, so one can
1543use backreferences safely.
6bda09f9 1544
93f313ef 1545I<PARNO> is a sequence of digits (not starting with 0) whose value reflects
c27a5cfe 1546the paren-number of the capture group to recurse to. C<(?R)> recurses to
894be9b7 1547the beginning of the whole pattern. C<(?0)> is an alternate syntax for
93f313ef 1548C<(?R)>. If I<PARNO> is preceded by a plus or minus sign then it is assumed
c27a5cfe 1549to be relative, with negative numbers indicating preceding capture groups
542fa716 1550and positive ones following. Thus C<(?-1)> refers to the most recently
c27a5cfe 1551declared group, and C<(?+1)> indicates the next group to be declared.
c74340f9 1552Note that the counting for relative recursion differs from that of
c27a5cfe 1553relative backreferences, in that with recursion unclosed groups B<are>
c74340f9 1554included.
6bda09f9 1555
81714fb9 1556The following pattern matches a function foo() which may contain
f145b7e9 1557balanced parentheses as the argument.
6bda09f9 1558
d1fbf752 1559 $re = qr{ ( # paren group 1 (full function)
81714fb9 1560 foo
d1fbf752 1561 ( # paren group 2 (parens)
6bda09f9 1562 \(
d1fbf752 1563 ( # paren group 3 (contents of parens)
6bda09f9 1564 (?:
d1fbf752 1565 (?> [^()]+ ) # Non-parens without backtracking
6bda09f9 1566 |
d1fbf752 1567 (?2) # Recurse to start of paren group 2
6bda09f9
YO
1568 )*
1569 )
1570 \)
1571 )
1572 )
1573 }x;
1574
1575If the pattern was used as follows
1576
1577 'foo(bar(baz)+baz(bop))'=~/$re/
1578 and print "\$1 = $1\n",
1579 "\$2 = $2\n",
1580 "\$3 = $3\n";
1581
1582the output produced should be the following:
1583
1584 $1 = foo(bar(baz)+baz(bop))
1585 $2 = (bar(baz)+baz(bop))
81714fb9 1586 $3 = bar(baz)+baz(bop)
6bda09f9 1587
c27a5cfe 1588If there is no corresponding capture group defined, then it is a
61528107 1589fatal error. Recursing deeper than 50 times without consuming any input
81714fb9 1590string will also result in a fatal error. The maximum depth is compiled
6bda09f9
YO
1591into perl, so changing it requires a custom build.
1592
542fa716
YO
1593The following shows how using negative indexing can make it
1594easier to embed recursive patterns inside of a C<qr//> construct
1595for later use:
1596
1597 my $parens = qr/(\((?:[^()]++|(?-1))*+\))/;
c77257ed 1598 if (/foo $parens \s+ \+ \s+ bar $parens/x) {
542fa716
YO
1599 # do something here...
1600 }
1601
81714fb9 1602B<Note> that this pattern does not behave the same way as the equivalent
0d017f4d 1603PCRE or Python construct of the same form. In Perl you can backtrack into
6bda09f9 1604a recursed group, in PCRE and Python the recursed into group is treated
542fa716
YO
1605as atomic. Also, modifiers are resolved at compile time, so constructs
1606like (?i:(?1)) or (?:(?i)(?1)) do not affect how the sub-pattern will
1607be processed.
6bda09f9 1608
894be9b7
YO
1609=item C<(?&NAME)>
1610X<(?&NAME)>
1611
93f313ef 1612Recurse to a named subpattern. Identical to C<(?I<PARNO>)> except that the
0d017f4d 1613parenthesis to recurse to is determined by name. If multiple parentheses have
894be9b7
YO
1614the same name, then it recurses to the leftmost.
1615
1616It is an error to refer to a name that is not declared somewhere in the
1617pattern.
1618
1f1031fe
YO
1619B<NOTE:> In order to make things easier for programmers with experience
1620with the Python or PCRE regex engines the pattern C<< (?P>NAME) >>
64c5a566 1621may be used instead of C<< (?&NAME) >>.
1f1031fe 1622
e2e6a0f1
YO
1623=item C<(?(condition)yes-pattern|no-pattern)>
1624X<(?()>
286f584a 1625
e2e6a0f1 1626=item C<(?(condition)yes-pattern)>
286f584a 1627
41ef34de
ML
1628Conditional expression. Matches C<yes-pattern> if C<condition> yields
1629a true value, matches C<no-pattern> otherwise. A missing pattern always
1630matches.
1631
25e26d77 1632C<(condition)> should be one of: 1) an integer in
e2e6a0f1 1633parentheses (which is valid if the corresponding pair of parentheses
25e26d77 1634matched); 2) a look-ahead/look-behind/evaluate zero-width assertion; 3) a
c27a5cfe 1635name in angle brackets or single quotes (which is valid if a group
25e26d77 1636with the given name matched); or 4) the special symbol (R) (true when
e2e6a0f1
YO
1637evaluated inside of recursion or eval). Additionally the R may be
1638followed by a number, (which will be true when evaluated when recursing
1639inside of the appropriate group), or by C<&NAME>, in which case it will
1640be true only when evaluated during recursion in the named group.
1641
1642Here's a summary of the possible predicates:
1643
1644=over 4
1645
1646=item (1) (2) ...
1647
c27a5cfe 1648Checks if the numbered capturing group has matched something.
e2e6a0f1
YO
1649
1650=item (<NAME>) ('NAME')
1651
c27a5cfe 1652Checks if a group with the given name has matched something.
e2e6a0f1 1653
f01cd190
FC
1654=item (?=...) (?!...) (?<=...) (?<!...)
1655
1656Checks whether the pattern matches (or does not match, for the '!'
1657variants).
1658
e2e6a0f1
YO
1659=item (?{ CODE })
1660
f01cd190 1661Treats the return value of the code block as the condition.
e2e6a0f1
YO
1662
1663=item (R)
1664
1665Checks if the expression has been evaluated inside of recursion.
1666
1667=item (R1) (R2) ...
1668
1669Checks if the expression has been evaluated while executing directly
1670inside of the n-th capture group. This check is the regex equivalent of
1671
1672 if ((caller(0))[3] eq 'subname') { ... }
1673
1674In other words, it does not check the full recursion stack.
1675
1676=item (R&NAME)
1677
1678Similar to C<(R1)>, this predicate checks to see if we're executing
1679directly inside of the leftmost group with a given name (this is the same
1680logic used by C<(?&NAME)> to disambiguate). It does not check the full
1681stack, but only the name of the innermost active recursion.
1682
1683=item (DEFINE)
1684
1685In this case, the yes-pattern is never directly executed, and no
1686no-pattern is allowed. Similar in spirit to C<(?{0})> but more efficient.
1687See below for details.
1688
1689=back
1690
1691For example:
1692
1693 m{ ( \( )?
1694 [^()]+
1695 (?(1) \) )
1696 }x
1697
1698matches a chunk of non-parentheses, possibly included in parentheses
1699themselves.
1700
0b928c2f
FC
1701A special form is the C<(DEFINE)> predicate, which never executes its
1702yes-pattern directly, and does not allow a no-pattern. This allows one to
1703define subpatterns which will be executed only by the recursion mechanism.
e2e6a0f1
YO
1704This way, you can define a set of regular expression rules that can be
1705bundled into any pattern you choose.
1706
1707It is recommended that for this usage you put the DEFINE block at the
1708end of the pattern, and that you name any subpatterns defined within it.
1709
1710Also, it's worth noting that patterns defined this way probably will
31dc26d6 1711not be as efficient, as the optimizer is not very clever about
e2e6a0f1
YO
1712handling them.
1713
1714An example of how this might be used is as follows:
1715
2bf803e2 1716 /(?<NAME>(?&NAME_PAT))(?<ADDR>(?&ADDRESS_PAT))
e2e6a0f1 1717 (?(DEFINE)
2bf803e2 1718 (?<NAME_PAT>....)
22dc6719 1719 (?<ADDRESS_PAT>....)
e2e6a0f1
YO
1720 )/x
1721
c27a5cfe
KW
1722Note that capture groups matched inside of recursion are not accessible
1723after the recursion returns, so the extra layer of capturing groups is
e2e6a0f1
YO
1724necessary. Thus C<$+{NAME_PAT}> would not be defined even though
1725C<$+{NAME}> would be.
286f584a 1726
51a1303c
BF
1727Finally, keep in mind that subpatterns created inside a DEFINE block
1728count towards the absolute and relative number of captures, so this:
1729
1730 my @captures = "a" =~ /(.) # First capture
1731 (?(DEFINE)
1732 (?<EXAMPLE> 1 ) # Second capture
1733 )/x;
1734 say scalar @captures;
1735
1736Will output 2, not 1. This is particularly important if you intend to
1737compile the definitions with the C<qr//> operator, and later
1738interpolate them in another pattern.
1739
c47ff5f1 1740=item C<< (?>pattern) >>
6bda09f9 1741X<backtrack> X<backtracking> X<atomic> X<possessive>
5a964f20 1742
19799a22
GS
1743An "independent" subexpression, one which matches the substring
1744that a I<standalone> C<pattern> would match if anchored at the given
9da458fc 1745position, and it matches I<nothing other than this substring>. This
19799a22
GS
1746construct is useful for optimizations of what would otherwise be
1747"eternal" matches, because it will not backtrack (see L<"Backtracking">).
9da458fc
IZ
1748It may also be useful in places where the "grab all you can, and do not
1749give anything back" semantic is desirable.
19799a22 1750
c47ff5f1 1751For example: C<< ^(?>a*)ab >> will never match, since C<< (?>a*) >>
19799a22
GS
1752(anchored at the beginning of string, as above) will match I<all>
1753characters C<a> at the beginning of string, leaving no C<a> for
1754C<ab> to match. In contrast, C<a*ab> will match the same as C<a+b>,
1755since the match of the subgroup C<a*> is influenced by the following
1756group C<ab> (see L<"Backtracking">). In particular, C<a*> inside
1757C<a*ab> will match fewer characters than a standalone C<a*>, since
1758this makes the tail match.
1759
0b928c2f
FC
1760C<< (?>pattern) >> does not disable backtracking altogether once it has
1761matched. It is still possible to backtrack past the construct, but not
1762into it. So C<< ((?>a*)|(?>b*))ar >> will still match "bar".
1763
c47ff5f1 1764An effect similar to C<< (?>pattern) >> may be achieved by writing
0b928c2f
FC
1765C<(?=(pattern))\g{-1}>. This matches the same substring as a standalone
1766C<a+>, and the following C<\g{-1}> eats the matched string; it therefore
c47ff5f1 1767makes a zero-length assertion into an analogue of C<< (?>...) >>.
19799a22
GS
1768(The difference between these two constructs is that the second one
1769uses a capturing group, thus shifting ordinals of backreferences
1770in the rest of a regular expression.)
1771
1772Consider this pattern:
c277df42 1773
871b0233 1774 m{ \(
e2e6a0f1 1775 (
f793d64a 1776 [^()]+ # x+
e2e6a0f1 1777 |
871b0233
IZ
1778 \( [^()]* \)
1779 )+
e2e6a0f1 1780 \)
871b0233 1781 }x
5a964f20 1782
19799a22
GS
1783That will efficiently match a nonempty group with matching parentheses
1784two levels deep or less. However, if there is no such group, it
1785will take virtually forever on a long string. That's because there
1786are so many different ways to split a long string into several
1787substrings. This is what C<(.+)+> is doing, and C<(.+)+> is similar
1788to a subpattern of the above pattern. Consider how the pattern
1789above detects no-match on C<((()aaaaaaaaaaaaaaaaaa> in several
1790seconds, but that each extra letter doubles this time. This
1791exponential performance will make it appear that your program has
14218588 1792hung. However, a tiny change to this pattern
5a964f20 1793
e2e6a0f1
YO
1794 m{ \(
1795 (
f793d64a 1796 (?> [^()]+ ) # change x+ above to (?> x+ )
e2e6a0f1 1797 |
871b0233
IZ
1798 \( [^()]* \)
1799 )+
e2e6a0f1 1800 \)
871b0233 1801 }x
c277df42 1802
c47ff5f1 1803which uses C<< (?>...) >> matches exactly when the one above does (verifying
5a964f20
TC
1804this yourself would be a productive exercise), but finishes in a fourth
1805the time when used on a similar string with 1000000 C<a>s. Be aware,
0b928c2f
FC
1806however, that, when this construct is followed by a
1807quantifier, it currently triggers a warning message under
9f1b1f2d 1808the C<use warnings> pragma or B<-w> switch saying it
6bab786b 1809C<"matches null string many times in regex">.
c277df42 1810
c47ff5f1 1811On simple groups, such as the pattern C<< (?> [^()]+ ) >>, a comparable
19799a22 1812effect may be achieved by negative look-ahead, as in C<[^()]+ (?! [^()] )>.
c277df42
IZ
1813This was only 4 times slower on a string with 1000000 C<a>s.
1814
9da458fc
IZ
1815The "grab all you can, and do not give anything back" semantic is desirable
1816in many situations where on the first sight a simple C<()*> looks like
1817the correct solution. Suppose we parse text with comments being delimited
1818by C<#> followed by some optional (horizontal) whitespace. Contrary to
4375e838 1819its appearance, C<#[ \t]*> I<is not> the correct subexpression to match
9da458fc
IZ
1820the comment delimiter, because it may "give up" some whitespace if
1821the remainder of the pattern can be made to match that way. The correct
1822answer is either one of these:
1823
1824 (?>#[ \t]*)
1825 #[ \t]*(?![ \t])
1826
1827For example, to grab non-empty comments into $1, one should use either
1828one of these:
1829
1830 / (?> \# [ \t]* ) ( .+ ) /x;
1831 / \# [ \t]* ( [^ \t] .* ) /x;
1832
1833Which one you pick depends on which of these expressions better reflects
1834the above specification of comments.
1835
6bda09f9
YO
1836In some literature this construct is called "atomic matching" or
1837"possessive matching".
1838
b9b4dddf
YO
1839Possessive quantifiers are equivalent to putting the item they are applied
1840to inside of one of these constructs. The following equivalences apply:
1841
1842 Quantifier Form Bracketing Form
1843 --------------- ---------------
1844 PAT*+ (?>PAT*)
1845 PAT++ (?>PAT+)
1846 PAT?+ (?>PAT?)
1847 PAT{min,max}+ (?>PAT{min,max})
1848
9d1a5160 1849=item C<(?[ ])>
f4f5fe57 1850
572224ce 1851See L<perlrecharclass/Extended Bracketed Character Classes>.
9d1a5160 1852
e2e6a0f1
YO
1853=back
1854
1855=head2 Special Backtracking Control Verbs
1856
e2e6a0f1
YO
1857These special patterns are generally of the form C<(*VERB:ARG)>. Unless
1858otherwise stated the ARG argument is optional; in some cases, it is
fee50582 1859mandatory.
e2e6a0f1
YO
1860
1861Any pattern containing a special backtracking verb that allows an argument
e1020413 1862has the special behaviour that when executed it sets the current package's
5d458dd8
YO
1863C<$REGERROR> and C<$REGMARK> variables. When doing so the following
1864rules apply:
e2e6a0f1 1865
5d458dd8
YO
1866On failure, the C<$REGERROR> variable will be set to the ARG value of the
1867verb pattern, if the verb was involved in the failure of the match. If the
1868ARG part of the pattern was omitted, then C<$REGERROR> will be set to the
1869name of the last C<(*MARK:NAME)> pattern executed, or to TRUE if there was
1870none. Also, the C<$REGMARK> variable will be set to FALSE.
e2e6a0f1 1871
5d458dd8
YO
1872On a successful match, the C<$REGERROR> variable will be set to FALSE, and
1873the C<$REGMARK> variable will be set to the name of the last
1874C<(*MARK:NAME)> pattern executed. See the explanation for the
1875C<(*MARK:NAME)> verb below for more details.
e2e6a0f1 1876
5d458dd8 1877B<NOTE:> C<$REGERROR> and C<$REGMARK> are not magic variables like C<$1>
0b928c2f 1878and most other regex-related variables. They are not local to a scope, nor
5d458dd8
YO
1879readonly, but instead are volatile package variables similar to C<$AUTOLOAD>.
1880Use C<local> to localize changes to them to a specific scope if necessary.
e2e6a0f1
YO
1881
1882If a pattern does not contain a special backtracking verb that allows an
5d458dd8 1883argument, then C<$REGERROR> and C<$REGMARK> are not touched at all.
e2e6a0f1 1884
70ca8714 1885=over 3
e2e6a0f1 1886
fee50582 1887=item Verbs
e2e6a0f1
YO
1888
1889=over 4
1890
5d458dd8 1891=item C<(*PRUNE)> C<(*PRUNE:NAME)>
f7819f85 1892X<(*PRUNE)> X<(*PRUNE:NAME)>
54612592 1893
5d458dd8
YO
1894This zero-width pattern prunes the backtracking tree at the current point
1895when backtracked into on failure. Consider the pattern C<A (*PRUNE) B>,
1896where A and B are complex patterns. Until the C<(*PRUNE)> verb is reached,
1897A may backtrack as necessary to match. Once it is reached, matching
1898continues in B, which may also backtrack as necessary; however, should B
1899not match, then no further backtracking will take place, and the pattern
1900will fail outright at the current starting position.
54612592
YO
1901
1902The following example counts all the possible matching strings in a
1903pattern (without actually matching any of them).
1904
e2e6a0f1 1905 'aaab' =~ /a+b?(?{print "$&\n"; $count++})(*FAIL)/;
54612592
YO
1906 print "Count=$count\n";
1907
1908which produces:
1909
1910 aaab
1911 aaa
1912 aa
1913 a
1914 aab
1915 aa
1916 a
1917 ab
1918 a
1919 Count=9
1920
5d458dd8 1921If we add a C<(*PRUNE)> before the count like the following
54612592 1922
5d458dd8 1923 'aaab' =~ /a+b?(*PRUNE)(?{print "$&\n"; $count++})(*FAIL)/;
54612592
YO
1924 print "Count=$count\n";
1925
0b928c2f 1926we prevent backtracking and find the count of the longest matching string
353c6505 1927at each matching starting point like so:
54612592
YO
1928
1929 aaab
1930 aab
1931 ab
1932 Count=3
1933
5d458dd8 1934Any number of C<(*PRUNE)> assertions may be used in a pattern.
54612592 1935
5d458dd8
YO
1936See also C<< (?>pattern) >> and possessive quantifiers for other ways to
1937control backtracking. In some cases, the use of C<(*PRUNE)> can be
1938replaced with a C<< (?>pattern) >> with no functional difference; however,
1939C<(*PRUNE)> can be used to handle cases that cannot be expressed using a
1940C<< (?>pattern) >> alone.
54612592 1941
5d458dd8
YO
1942=item C<(*SKIP)> C<(*SKIP:NAME)>
1943X<(*SKIP)>
e2e6a0f1 1944
5d458dd8 1945This zero-width pattern is similar to C<(*PRUNE)>, except that on
e2e6a0f1 1946failure it also signifies that whatever text that was matched leading up
5d458dd8
YO
1947to the C<(*SKIP)> pattern being executed cannot be part of I<any> match
1948of this pattern. This effectively means that the regex engine "skips" forward
1949to this position on failure and tries to match again, (assuming that
1950there is sufficient room to match).
1951
1952The name of the C<(*SKIP:NAME)> pattern has special significance. If a
1953C<(*MARK:NAME)> was encountered while matching, then it is that position
1954which is used as the "skip point". If no C<(*MARK)> of that name was
1955encountered, then the C<(*SKIP)> operator has no effect. When used
1956without a name the "skip point" is where the match point was when
1957executing the (*SKIP) pattern.
1958
0b928c2f 1959Compare the following to the examples in C<(*PRUNE)>; note the string
24b23f37
YO
1960is twice as long:
1961
d1fbf752
KW
1962 'aaabaaab' =~ /a+b?(*SKIP)(?{print "$&\n"; $count++})(*FAIL)/;
1963 print "Count=$count\n";
24b23f37
YO
1964
1965outputs
1966
1967 aaab
1968 aaab
1969 Count=2
1970
5d458dd8 1971Once the 'aaab' at the start of the string has matched, and the C<(*SKIP)>
353c6505 1972executed, the next starting point will be where the cursor was when the
5d458dd8
YO
1973C<(*SKIP)> was executed.
1974
5d458dd8 1975=item C<(*MARK:NAME)> C<(*:NAME)>
b16db30f 1976X<(*MARK)> X<(*MARK:NAME)> X<(*:NAME)>
5d458dd8
YO
1977
1978This zero-width pattern can be used to mark the point reached in a string
1979when a certain part of the pattern has been successfully matched. This
1980mark may be given a name. A later C<(*SKIP)> pattern will then skip
1981forward to that point if backtracked into on failure. Any number of
b4222fa9 1982C<(*MARK)> patterns are allowed, and the NAME portion may be duplicated.
5d458dd8
YO
1983
1984In addition to interacting with the C<(*SKIP)> pattern, C<(*MARK:NAME)>
1985can be used to "label" a pattern branch, so that after matching, the
1986program can determine which branches of the pattern were involved in the
1987match.
1988
1989When a match is successful, the C<$REGMARK> variable will be set to the
1990name of the most recently executed C<(*MARK:NAME)> that was involved
1991in the match.
1992
1993This can be used to determine which branch of a pattern was matched
c27a5cfe 1994without using a separate capture group for each branch, which in turn
5d458dd8
YO
1995can result in a performance improvement, as perl cannot optimize
1996C</(?:(x)|(y)|(z))/> as efficiently as something like
1997C</(?:x(*MARK:x)|y(*MARK:y)|z(*MARK:z))/>.
1998
1999When a match has failed, and unless another verb has been involved in
2000failing the match and has provided its own name to use, the C<$REGERROR>
2001variable will be set to the name of the most recently executed
2002C<(*MARK:NAME)>.
2003
42ac7c82 2004See L</(*SKIP)> for more details.
5d458dd8 2005
b62d2d15
YO
2006As a shortcut C<(*MARK:NAME)> can be written C<(*:NAME)>.
2007
5d458dd8
YO
2008=item C<(*THEN)> C<(*THEN:NAME)>
2009
ac9d8485 2010This is similar to the "cut group" operator C<::> from Perl 6. Like
5d458dd8
YO
2011C<(*PRUNE)>, this verb always matches, and when backtracked into on
2012failure, it causes the regex engine to try the next alternation in the
ac9d8485
FC
2013innermost enclosing group (capturing or otherwise) that has alternations.
2014The two branches of a C<(?(condition)yes-pattern|no-pattern)> do not
2015count as an alternation, as far as C<(*THEN)> is concerned.
5d458dd8
YO
2016
2017Its name comes from the observation that this operation combined with the
2018alternation operator (C<|>) can be used to create what is essentially a
2019pattern-based if/then/else block:
2020
2021 ( COND (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ )
2022
2023Note that if this operator is used and NOT inside of an alternation then
2024it acts exactly like the C<(*PRUNE)> operator.
2025
2026 / A (*PRUNE) B /
2027
2028is the same as
2029
2030 / A (*THEN) B /
2031
2032but
2033
25e26d77 2034 / ( A (*THEN) B | C ) /
5d458dd8
YO
2035
2036is not the same as
2037
25e26d77 2038 / ( A (*PRUNE) B | C ) /
5d458dd8
YO
2039
2040as after matching the A but failing on the B the C<(*THEN)> verb will
2041backtrack and try C; but the C<(*PRUNE)> verb will simply fail.
24b23f37 2042
fee50582 2043=item C<(*COMMIT)> C<(*COMMIT:args)>
e2e6a0f1 2044X<(*COMMIT)>
24b23f37 2045
241e7389 2046This is the Perl 6 "commit pattern" C<< <commit> >> or C<:::>. It's a
5d458dd8
YO
2047zero-width pattern similar to C<(*SKIP)>, except that when backtracked
2048into on failure it causes the match to fail outright. No further attempts
2049to find a valid match by advancing the start pointer will occur again.
2050For example,
24b23f37 2051
d1fbf752
KW
2052 'aaabaaab' =~ /a+b?(*COMMIT)(?{print "$&\n"; $count++})(*FAIL)/;
2053 print "Count=$count\n";
24b23f37
YO
2054
2055outputs
2056
2057 aaab
2058 Count=1
2059
e2e6a0f1
YO
2060In other words, once the C<(*COMMIT)> has been entered, and if the pattern
2061does not match, the regex engine will not try any further matching on the
2062rest of the string.
c277df42 2063
fee50582 2064=item C<(*FAIL)> C<(*F)> C<(*FAIL:arg)>
e2e6a0f1 2065X<(*FAIL)> X<(*F)>
9af228c6 2066
e2e6a0f1
YO
2067This pattern matches nothing and always fails. It can be used to force the
2068engine to backtrack. It is equivalent to C<(?!)>, but easier to read. In
fee50582
YO
2069fact, C<(?!)> gets optimised into C<(*FAIL)> internally. You can provide
2070an argument so that if the match fails because of this FAIL directive
2071the argument can be obtained from $REGERROR.
9af228c6 2072
e2e6a0f1 2073It is probably useful only when combined with C<(?{})> or C<(??{})>.
9af228c6 2074
fee50582 2075=item C<(*ACCEPT)> C<(*ACCEPT:arg)>
e2e6a0f1 2076X<(*ACCEPT)>
9af228c6 2077
e2e6a0f1
YO
2078This pattern matches nothing and causes the end of successful matching at
2079the point at which the C<(*ACCEPT)> pattern was encountered, regardless of
2080whether there is actually more to match in the string. When inside of a
0d017f4d 2081nested pattern, such as recursion, or in a subpattern dynamically generated
e2e6a0f1 2082via C<(??{})>, only the innermost pattern is ended immediately.
9af228c6 2083
c27a5cfe 2084If the C<(*ACCEPT)> is inside of capturing groups then the groups are
e2e6a0f1
YO
2085marked as ended at the point at which the C<(*ACCEPT)> was encountered.
2086For instance:
9af228c6 2087
e2e6a0f1 2088 'AB' =~ /(A (A|B(*ACCEPT)|C) D)(E)/x;
9af228c6 2089
e2e6a0f1 2090will match, and C<$1> will be C<AB> and C<$2> will be C<B>, C<$3> will not
0b928c2f 2091be set. If another branch in the inner parentheses was matched, such as in the
e2e6a0f1 2092string 'ACDE', then the C<D> and C<E> would have to be matched as well.
9af228c6 2093
fee50582
YO
2094You can provide an argument, which will be available in the var $REGMARK
2095after the match completes.
2096
9af228c6 2097=back
c277df42 2098
a0d0e21e
LW
2099=back
2100
c07a80fd 2101=head2 Backtracking
d74e8afc 2102X<backtrack> X<backtracking>
c07a80fd 2103
35a734be
IZ
2104NOTE: This section presents an abstract approximation of regular
2105expression behavior. For a more rigorous (and complicated) view of
2106the rules involved in selecting a match among possible alternatives,
0d017f4d 2107see L<Combining RE Pieces>.
35a734be 2108
c277df42 2109A fundamental feature of regular expression matching involves the
5a964f20 2110notion called I<backtracking>, which is currently used (when needed)
0d017f4d 2111by all regular non-possessive expression quantifiers, namely C<*>, C<*?>, C<+>,
9da458fc
IZ
2112C<+?>, C<{n,m}>, and C<{n,m}?>. Backtracking is often optimized
2113internally, but the general principle outlined here is valid.
c07a80fd
PP
2114
2115For a regular expression to match, the I<entire> regular expression must
2116match, not just part of it. So if the beginning of a pattern containing a
2117quantifier succeeds in a way that causes later parts in the pattern to
2118fail, the matching engine backs up and recalculates the beginning
2119part--that's why it's called backtracking.
2120
2121Here is an example of backtracking: Let's say you want to find the
2122word following "foo" in the string "Food is on the foo table.":
2123
2124 $_ = "Food is on the foo table.";
2125 if ( /\b(foo)\s+(\w+)/i ) {
f793d64a 2126 print "$2 follows $1.\n";
c07a80fd
PP
2127 }
2128
2129When the match runs, the first part of the regular expression (C<\b(foo)>)
2130finds a possible match right at the beginning of the string, and loads up
2131$1 with "Foo". However, as soon as the matching engine sees that there's
2132no whitespace following the "Foo" that it had saved in $1, it realizes its
68dc0745 2133mistake and starts over again one character after where it had the
c07a80fd
PP
2134tentative match. This time it goes all the way until the next occurrence
2135of "foo". The complete regular expression matches this time, and you get
2136the expected output of "table follows foo."
2137
2138Sometimes minimal matching can help a lot. Imagine you'd like to match
2139everything between "foo" and "bar". Initially, you write something
2140like this:
2141
2142 $_ = "The food is under the bar in the barn.";
2143 if ( /foo(.*)bar/ ) {
f793d64a 2144 print "got <$1>\n";
c07a80fd
PP
2145 }
2146
2147Which perhaps unexpectedly yields:
2148
2149 got <d is under the bar in the >
2150
2151That's because C<.*> was greedy, so you get everything between the
14218588 2152I<first> "foo" and the I<last> "bar". Here it's more effective
c07a80fd
PP
2153to use minimal matching to make sure you get the text between a "foo"
2154and the first "bar" thereafter.
2155
2156 if ( /foo(.*?)bar/ ) { print "got <$1>\n" }
2157 got <d is under the >
2158
0d017f4d 2159Here's another example. Let's say you'd like to match a number at the end
b6e13d97 2160of a string, and you also want to keep the preceding part of the match.
c07a80fd
PP
2161So you write this:
2162
2163 $_ = "I have 2 numbers: 53147";
f793d64a
KW
2164 if ( /(.*)(\d*)/ ) { # Wrong!
2165 print "Beginning is <$1>, number is <$2>.\n";
c07a80fd
PP
2166 }
2167
2168That won't work at all, because C<.*> was greedy and gobbled up the
2169whole string. As C<\d*> can match on an empty string the complete
2170regular expression matched successfully.
2171
8e1088bc 2172 Beginning is <I have 2 numbers: 53147>, number is <>.
c07a80fd
PP
2173
2174Here are some variants, most of which don't work:
2175
2176 $_ = "I have 2 numbers: 53147";
2177 @pats = qw{
f793d64a
KW
2178 (.*)(\d*)
2179 (.*)(\d+)
2180 (.*?)(\d*)
2181 (.*?)(\d+)
2182 (.*)(\d+)$
2183 (.*?)(\d+)$
2184 (.*)\b(\d+)$
2185 (.*\D)(\d+)$
c07a80fd
PP
2186 };
2187
2188 for $pat (@pats) {
f793d64a
KW
2189 printf "%-12s ", $pat;
2190 if ( /$pat/ ) {
2191 print "<$1> <$2>\n";
2192 } else {
2193 print "FAIL\n";
2194 }
c07a80fd
PP
2195 }
2196
2197That will print out:
2198
2199 (.*)(\d*) <I have 2 numbers: 53147> <>
2200 (.*)(\d+) <I have 2 numbers: 5314> <7>
2201 (.*?)(\d*) <> <>
2202 (.*?)(\d+) <I have > <2>
2203 (.*)(\d+)$ <I have 2 numbers: 5314> <7>
2204 (.*?)(\d+)$ <I have 2 numbers: > <53147>
2205 (.*)\b(\d+)$ <I have 2 numbers: > <53147>
2206 (.*\D)(\d+)$ <I have 2 numbers: > <53147>
2207
2208As you see, this can be a bit tricky. It's important to realize that a
2209regular expression is merely a set of assertions that gives a definition
2210of success. There may be 0, 1, or several different ways that the
2211definition might succeed against a particular string. And if there are
5a964f20
TC
2212multiple ways it might succeed, you need to understand backtracking to
2213know which variety of success you will achieve.
c07a80fd 2214
19799a22 2215When using look-ahead assertions and negations, this can all get even
8b19b778 2216trickier. Imagine you'd like to find a sequence of non-digits not
c07a80fd
PP
2217followed by "123". You might try to write that as
2218
871b0233 2219 $_ = "ABC123";
f793d64a
KW
2220 if ( /^\D*(?!123)/ ) { # Wrong!
2221 print "Yup, no 123 in $_\n";
871b0233 2222 }
c07a80fd
PP
2223
2224But that isn't going to match; at least, not the way you're hoping. It
2225claims that there is no 123 in the string. Here's a clearer picture of
9b9391b2 2226why that pattern matches, contrary to popular expectations:
c07a80fd 2227
4358a253
SS
2228 $x = 'ABC123';
2229 $y = 'ABC445';
c07a80fd 2230
4358a253
SS
2231 print "1: got $1\n" if $x =~ /^(ABC)(?!123)/;
2232 print "2: got $1\n" if $y =~ /^(ABC)(?!123)/;
c07a80fd 2233
4358a253
SS
2234 print "3: got $1\n" if $x =~ /^(\D*)(?!123)/;
2235 print "4: got $1\n" if $y =~ /^(\D*)(?!123)/;
c07a80fd
PP
2236
2237This prints
2238
2239 2: got ABC
2240 3: got AB
2241 4: got ABC
2242
5f05dabc 2243You might have expected test 3 to fail because it seems to a more
c07a80fd
PP
2244general purpose version of test 1. The important difference between
2245them is that test 3 contains a quantifier (C<\D*>) and so can use
2246backtracking, whereas test 1 will not. What's happening is
2247that you've asked "Is it true that at the start of $x, following 0 or more
5f05dabc 2248non-digits, you have something that's not 123?" If the pattern matcher had
c07a80fd 2249let C<\D*> expand to "ABC", this would have caused the whole pattern to
54310121 2250fail.
14218588 2251
c07a80fd 2252The search engine will initially match C<\D*> with "ABC". Then it will
0b928c2f 2253try to match C<(?!123)> with "123", which fails. But because
c07a80fd
PP
2254a quantifier (C<\D*>) has been used in the regular expression, the
2255search engine can backtrack and retry the match differently
54310121 2256in the hope of matching the complete regular expression.
c07a80fd 2257
5a964f20
TC
2258The pattern really, I<really> wants to succeed, so it uses the
2259standard pattern back-off-and-retry and lets C<\D*> expand to just "AB" this
c07a80fd 2260time. Now there's indeed something following "AB" that is not
14218588 2261"123". It's "C123", which suffices.
c07a80fd 2262
14218588
GS
2263We can deal with this by using both an assertion and a negation.
2264We'll say that the first part in $1 must be followed both by a digit
2265and by something that's not "123". Remember that the look-aheads
2266are zero-width expressions--they only look, but don't consume any
2267of the string in their match. So rewriting this way produces what
c07a80fd
PP
2268you'd expect; that is, case 5 will fail, but case 6 succeeds:
2269
4358a253
SS
2270 print "5: got $1\n" if $x =~ /^(\D*)(?=\d)(?!123)/;
2271 print "6: got $1\n" if $y =~ /^(\D*)(?=\d)(?!123)/;
c07a80fd
PP
2272
2273 6: got ABC
2274
5a964f20 2275In other words, the two zero-width assertions next to each other work as though
19799a22 2276they're ANDed together, just as you'd use any built-in assertions: C</^$/>
c07a80fd
PP
2277matches only if you're at the beginning of the line AND the end of the
2278line simultaneously. The deeper underlying truth is that juxtaposition in
2279regular expressions always means AND, except when you write an explicit OR
2280using the vertical bar. C</ab/> means match "a" AND (then) match "b",
2281although the attempted matches are made at different positions because "a"
2282is not a zero-width assertion, but a one-width assertion.
2283
0d017f4d 2284B<WARNING>: Particularly complicated regular expressions can take
14218588 2285exponential time to solve because of the immense number of possible
0d017f4d 2286ways they can use backtracking to try for a match. For example, without
9da458fc
IZ
2287internal optimizations done by the regular expression engine, this will
2288take a painfully long time to run:
c07a80fd 2289
e1901655
IZ
2290 'aaaaaaaaaaaa' =~ /((a{0,5}){0,5})*[c]/
2291
2292And if you used C<*>'s in the internal groups instead of limiting them
2293to 0 through 5 matches, then it would take forever--or until you ran
2294out of stack space. Moreover, these internal optimizations are not
2295always applicable. For example, if you put C<{0,5}> instead of C<*>
2296on the external group, no current optimization is applicable, and the
2297match takes a long time to finish.
c07a80fd 2298
9da458fc
IZ
2299A powerful tool for optimizing such beasts is what is known as an
2300"independent group",
96090e4f 2301which does not backtrack (see L</C<< (?>pattern) >>>). Note also that
9da458fc 2302zero-length look-ahead/look-behind assertions will not backtrack to make
5d458dd8 2303the tail match, since they are in "logical" context: only
14218588 2304whether they match is considered relevant. For an example
9da458fc 2305where side-effects of look-ahead I<might> have influenced the
96090e4f 2306following match, see L</C<< (?>pattern) >>>.
c277df42 2307
a0d0e21e 2308=head2 Version 8 Regular Expressions
d74e8afc 2309X<regular expression, version 8> X<regex, version 8> X<regexp, version 8>
a0d0e21e 2310
5a964f20 2311In case you're not familiar with the "regular" Version 8 regex
a0d0e21e
LW
2312routines, here are the pattern-matching rules not described above.
2313
54310121 2314Any single character matches itself, unless it is a I<metacharacter>
a0d0e21e 2315with a special meaning described here or above. You can cause
5a964f20 2316characters that normally function as metacharacters to be interpreted
5f05dabc 2317literally by prefixing them with a "\" (e.g., "\." matches a ".", not any
0d017f4d
WL
2318character; "\\" matches a "\"). This escape mechanism is also required
2319for the character used as the pattern delimiter.
2320
2321A series of characters matches that series of characters in the target
0b928c2f 2322string, so the pattern C<blurfl> would match "blurfl" in the target
0d017f4d 2323string.
a0d0e21e
LW
2324
2325You can specify a character class, by enclosing a list of characters
5d458dd8 2326in C<[]>, which will match any character from the list. If the
a0d0e21e 2327first character after the "[" is "^", the class matches any character not
14218588 2328in the list. Within a list, the "-" character specifies a
5a964f20 2329range, so that C<a-z> represents all characters between "a" and "z",
8a4f6ac2
GS
2330inclusive. If you want either "-" or "]" itself to be a member of a
2331class, put it at the start of the list (possibly after a "^"), or
2332escape it with a backslash. "-" is also taken literally when it is
2333at the end of the list, just before the closing "]". (The
84850974
DD
2334following all specify the same class of three characters: C<[-az]>,
2335C<[az-]>, and C<[a\-z]>. All are different from C<[a-z]>, which
5d458dd8
YO
2336specifies a class containing twenty-six characters, even on EBCDIC-based
2337character sets.) Also, if you try to use the character
2338classes C<\w>, C<\W>, C<\s>, C<\S>, C<\d>, or C<\D> as endpoints of
2339a range, the "-" is understood literally.
a0d0e21e 2340
8ada0baa 2341Note also that the whole range idea is rather unportable between
b927b7e9
KW
2342character sets, except for four situations that Perl handles specially.
2343Any subset of the ranges C<[A-Z]>, C<[a-z]>, and C<[0-9]> are guaranteed
2344to match the expected subset of ASCII characters, no matter what
2345character set the platform is running. The fourth portable way to
2346specify ranges is to use the C<\N{...}> syntax to specify either end
2347point of the range. For example, C<[\N{U+04}-\N{U+07}]> means to match
2348the Unicode code points C<\N{U+04}>, C<\N{U+05}>, C<\N{U+06}>, and
2349C<\N{U+07}>, whatever their native values may be on the platform. Under
2350L<use re 'strict'|re/'strict' mode> or within a L</C<(?[ ])>>, a warning
2351is raised, if enabled, and the other end point of a range which has a
2352C<\N{...}> endpoint is not portably specified. For example,
2353
2354 [\N{U+00}-\x06] # Warning under "use re 'strict'".
2355
2356It is hard to understand without digging what exactly matches ranges
2357other than subsets of C<[A-Z]>, C<[a-z]>, and C<[0-9]>. A sound
2358principle is to use only ranges that begin from and end at either
2359alphabetics of equal case ([a-e], [A-E]), or digits ([0-9]). Anything
2360else is unsafe or unclear. If in doubt, spell out the range in full.
8ada0baa 2361
54310121 2362Characters may be specified using a metacharacter syntax much like that
a0d0e21e
LW
2363used in C: "\n" matches a newline, "\t" a tab, "\r" a carriage return,
2364"\f" a form feed, etc. More generally, \I<nnn>, where I<nnn> is a string
dc0d9c48 2365of three octal digits, matches the character whose coded character set value
5d458dd8 2366is I<nnn>. Similarly, \xI<nn>, where I<nn> are hexadecimal digits,
dc0d9c48 2367matches the character whose ordinal is I<nn>. The expression \cI<x>
5d458dd8 2368matches the character control-I<x>. Finally, the "." metacharacter
fb55449c 2369matches any character except "\n" (unless you use C</s>).
a0d0e21e
LW
2370
2371You can specify a series of alternatives for a pattern using "|" to
2372separate them, so that C<fee|fie|foe> will match any of "fee", "fie",
5a964f20 2373or "foe" in the target string (as would C<f(e|i|o)e>). The
a0d0e21e 2374first alternative includes everything from the last pattern delimiter
0b928c2f 2375("(", "(?:", etc. or the beginning of the pattern) up to the first "|", and
a0d0e21e 2376the last alternative contains everything from the last "|" to the next
0b928c2f 2377closing pattern delimiter. That's why it's common practice to include
14218588 2378alternatives in parentheses: to minimize confusion about where they
a3cb178b
GS
2379start and end.
2380
5a964f20 2381Alternatives are tried from left to right, so the first
a3cb178b
GS
2382alternative found for which the entire expression matches, is the one that
2383is chosen. This means that alternatives are not necessarily greedy. For
628afcb5 2384example: when matching C<foo|foot> against "barefoot", only the "foo"
a3cb178b
GS
2385part will match, as that is the first alternative tried, and it successfully
2386matches the target string. (This might not seem important, but it is
2387important when you are capturing matched text using parentheses.)
2388
5a964f20 2389Also remember that "|" is interpreted as a literal within square brackets,
a3cb178b 2390so if you write C<[fee|fie|foe]> you're really only matching C<[feio|]>.
a0d0e21e 2391
14218588
GS
2392Within a pattern, you may designate subpatterns for later reference
2393by enclosing them in parentheses, and you may refer back to the
2394I<n>th subpattern later in the pattern using the metacharacter
0b928c2f 2395\I<n> or \gI<n>. Subpatterns are numbered based on the left to right order
14218588
GS
2396of their opening parenthesis. A backreference matches whatever
2397actually matched the subpattern in the string being examined, not
d8b950dc 2398the rules for that subpattern. Therefore, C<(0|0x)\d*\s\g1\d*> will
14218588
GS
2399match "0x1234 0x4321", but not "0x1234 01234", because subpattern
24001 matched "0x", even though the rule C<0|0x> could potentially match
2401the leading 0 in the second number.
cb1a09d0 2402
0d017f4d 2403=head2 Warning on \1 Instead of $1
cb1a09d0 2404
5a964f20 2405Some people get too used to writing things like:
cb1a09d0
AD
2406
2407 $pattern =~ s/(\W)/\\\1/g;
2408
3ff1c45a
KW
2409This is grandfathered (for \1 to \9) for the RHS of a substitute to avoid
2410shocking the
cb1a09d0 2411B<sed> addicts, but it's a dirty habit to get into. That's because in
d1be9408 2412PerlThink, the righthand side of an C<s///> is a double-quoted string. C<\1> in
cb1a09d0
AD
2413the usual double-quoted string means a control-A. The customary Unix
2414meaning of C<\1> is kludged in for C<s///>. However, if you get into the habit
2415of doing that, you get yourself into trouble if you then add an C</e>
2416modifier.
2417
f793d64a 2418 s/(\d+)/ \1 + 1 /eg; # causes warning under -w
cb1a09d0
AD
2419
2420Or if you try to do
2421
2422 s/(\d+)/\1000/;
2423
2424You can't disambiguate that by saying C<\{1}000>, whereas you can fix it with
14218588 2425C<${1}000>. The operation of interpolation should not be confused
cb1a09d0
AD
2426with the operation of matching a backreference. Certainly they mean two
2427different things on the I<left> side of the C<s///>.
9fa51da4 2428
0d017f4d 2429=head2 Repeated Patterns Matching a Zero-length Substring
c84d73f1 2430
19799a22 2431B<WARNING>: Difficult material (and prose) ahead. This section needs a rewrite.
c84d73f1
IZ
2432
2433Regular expressions provide a terse and powerful programming language. As
2434with most other power tools, power comes together with the ability
2435to wreak havoc.
2436
2437A common abuse of this power stems from the ability to make infinite
628afcb5 2438loops using regular expressions, with something as innocuous as:
c84d73f1
IZ
2439
2440 'foo' =~ m{ ( o? )* }x;
2441
0d017f4d 2442The C<o?> matches at the beginning of C<'foo'>, and since the position
c84d73f1 2443in the string is not moved by the match, C<o?> would match again and again
527e91da 2444because of the C<*> quantifier. Another common way to create a similar cycle
c84d73f1
IZ
2445is with the looping modifier C<//g>:
2446
2447 @matches = ( 'foo' =~ m{ o? }xg );
2448
2449or
2450
2451 print "match: <$&>\n" while 'foo' =~ m{ o? }xg;
2452
2453or the loop implied by split().
2454
2455However, long experience has shown that many programming tasks may
14218588
GS
2456be significantly simplified by using repeated subexpressions that
2457may match zero-length substrings. Here's a simple example being:
c84d73f1 2458
d1fbf752 2459 @chars = split //, $string; # // is not magic in split
c84d73f1
IZ
2460 ($whitewashed = $string) =~ s/()/ /g; # parens avoid magic s// /
2461
9da458fc 2462Thus Perl allows such constructs, by I<forcefully breaking
c84d73f1 2463the infinite loop>. The rules for this are different for lower-level
527e91da 2464loops given by the greedy quantifiers C<*+{}>, and for higher-level
c84d73f1
IZ
2465ones like the C</g> modifier or split() operator.
2466
19799a22
GS
2467The lower-level loops are I<interrupted> (that is, the loop is
2468broken) when Perl detects that a repeated expression matched a
2469zero-length substring. Thus
c84d73f1
IZ
2470
2471 m{ (?: NON_ZERO_LENGTH | ZERO_LENGTH )* }x;
2472
5d458dd8 2473is made equivalent to
c84d73f1 2474
0b928c2f
FC
2475 m{ (?: NON_ZERO_LENGTH )* (?: ZERO_LENGTH )? }x;
2476
2477For example, this program
2478
2479 #!perl -l
2480 "aaaaab" =~ /
2481 (?:
2482 a # non-zero
2483 | # or
2484 (?{print "hello"}) # print hello whenever this
2485 # branch is tried
2486 (?=(b)) # zero-width assertion
2487 )* # any number of times
2488 /x;
2489 print $&;
2490 print $1;
c84d73f1 2491
0b928c2f
FC
2492prints
2493
2494 hello
2495 aaaaa
2496 b
2497
2498Notice that "hello" is only printed once, as when Perl sees that the sixth
2499iteration of the outermost C<(?:)*> matches a zero-length string, it stops
2500the C<*>.
2501
2502The higher-level loops preserve an additional state between iterations:
5d458dd8 2503whether the last match was zero-length. To break the loop, the following
c84d73f1 2504match after a zero-length match is prohibited to have a length of zero.
5d458dd8 2505This prohibition interacts with backtracking (see L<"Backtracking">),
c84d73f1
IZ
2506and so the I<second best> match is chosen if the I<best> match is of
2507zero length.
2508
19799a22 2509For example:
c84d73f1
IZ
2510
2511 $_ = 'bar';
2512 s/\w??/<$&>/g;
2513
20fb949f 2514results in C<< <><b><><a><><r><> >>. At each position of the string the best
5d458dd8 2515match given by non-greedy C<??> is the zero-length match, and the I<second
c84d73f1
IZ
2516best> match is what is matched by C<\w>. Thus zero-length matches
2517alternate with one-character-long matches.
2518
5d458dd8 2519Similarly, for repeated C<m/()/g> the second-best match is the match at the
c84d73f1
IZ
2520position one notch further in the string.
2521
19799a22 2522The additional state of being I<matched with zero-length> is associated with
c84d73f1 2523the matched string, and is reset by each assignment to pos().
9da458fc
IZ
2524Zero-length matches at the end of the previous match are ignored
2525during C<split>.
c84d73f1 2526
0d017f4d 2527=head2 Combining RE Pieces
35a734be
IZ
2528
2529Each of the elementary pieces of regular expressions which were described
2530before (such as C<ab> or C<\Z>) could match at most one substring
2531at the given position of the input string. However, in a typical regular
2532expression these elementary pieces are combined into more complicated
0b928c2f 2533patterns using combining operators C<ST>, C<S|T>, C<S*> etc.
35a734be
IZ
2534(in these examples C<S> and C<T> are regular subexpressions).
2535
2536Such combinations can include alternatives, leading to a problem of choice:
2537if we match a regular expression C<a|ab> against C<"abc">, will it match
2538substring C<"a"> or C<"ab">? One way to describe which substring is
2539actually matched is the concept of backtracking (see L<"Backtracking">).
2540However, this description is too low-level and makes you think
2541in terms of a particular implementation.
2542
2543Another description starts with notions of "better"/"worse". All the
2544substrings which may be matched by the given regular expression can be
2545sorted from the "best" match to the "worst" match, and it is the "best"
2546match which is chosen. This substitutes the question of "what is chosen?"
2547by the question of "which matches are better, and which are worse?".
2548
2549Again, for elementary pieces there is no such question, since at most
2550one match at a given position is possible. This section describes the
2551notion of better/worse for combining operators. In the description
2552below C<S> and C<T> are regular subexpressions.
2553
13a2d996 2554=over 4
35a734be
IZ
2555
2556=item C<ST>
2557
2558Consider two possible matches, C<AB> and C<A'B'>, C<A> and C<A'> are
2559substrings which can be matched by C<S>, C<B> and C<B'> are substrings
5d458dd8 2560which can be matched by C<T>.
35a734be 2561
0b928c2f 2562If C<A> is a better match for C<S> than C<A'>, C<AB> is a better
35a734be
IZ
2563match than C<A'B'>.
2564
2565If C<A> and C<A'> coincide: C<AB> is a better match than C<AB'> if
0b928c2f 2566C<B> is a better match for C<T> than C<B'>.
35a734be
IZ
2567
2568=item C<S|T>
2569
2570When C<S> can match, it is a better match than when only C<T> can match.
2571
2572Ordering of two matches for C<S> is the same as for C<S>. Similar for
2573two matches for C<T>.
2574
2575=item C<S{REPEAT_COUNT}>
2576
2577Matches as C<SSS...S> (repeated as many times as necessary).
2578
2579=item C<S{min,max}>
2580
2581Matches as C<S{max}|S{max-1}|...|S{min+1}|S{min}>.
2582
2583=item C<S{min,max}?>
2584
2585Matches as C<S{min}|S{min+1}|...|S{max-1}|S{max}>.
2586
2587=item C<S?>, C<S*>, C<S+>
2588
2589Same as C<S{0,1}>, C<S{0,BIG_NUMBER}>, C<S{1,BIG_NUMBER}> respectively.
2590
2591=item C<S??>, C<S*?>, C<S+?>
2592
2593Same as C<S{0,1}?>, C<S{0,BIG_NUMBER}?>, C<S{1,BIG_NUMBER}?> respectively.
2594
c47ff5f1 2595=item C<< (?>S) >>
35a734be
IZ
2596
2597Matches the best match for C<S> and only that.
2598
2599=item C<(?=S)>, C<(?<=S)>
2600
2601Only the best match for C<S> is considered. (This is important only if
2602C<S> has capturing parentheses, and backreferences are used somewhere
2603else in the whole regular expression.)
2604
2605=item C<(?!S)>, C<(?<!S)>
2606
2607For this grouping operator there is no need to describe the ordering, since
2608only whether or not C<S> can match is important.
2609
93f313ef 2610=item C<(??{ EXPR })>, C<(?I<PARNO>)>
35a734be
IZ
2611
2612The ordering is the same as for the regular expression which is
93f313ef 2613the result of EXPR, or the pattern contained by capture group I<PARNO>.
35a734be
IZ
2614
2615=item C<(?(condition)yes-pattern|no-pattern)>
2616
2617Recall that which of C<yes-pattern> or C<no-pattern> actually matches is
2618already determined. The ordering of the matches is the same as for the
2619chosen subexpression.
2620
2621=back
2622
2623The above recipes describe the ordering of matches I<at a given position>.
2624One more rule is needed to understand how a match is determined for the
2625whole regular expression: a match at an earlier position is always better
2626than a match at a later position.
2627
0d017f4d 2628=head2 Creating Custom RE Engines
c84d73f1 2629
0b928c2f
FC
2630As of Perl 5.10.0, one can create custom regular expression engines. This
2631is not for the faint of heart, as they have to plug in at the C level. See
2632L<perlreapi> for more details.
2633
2634As an alternative, overloaded constants (see L<overload>) provide a simple
2635way to extend the functionality of the RE engine, by substituting one
2636pattern for another.
c84d73f1
IZ
2637
2638Suppose that we want to enable a new RE escape-sequence C<\Y|> which
0d017f4d 2639matches at a boundary between whitespace characters and non-whitespace
c84d73f1
IZ
2640characters. Note that C<(?=\S)(?<!\S)|(?!\S)(?<=\S)> matches exactly
2641at these positions, so we want to have each C<\Y|> in the place of the
2642more complicated version. We can create a module C<customre> to do
2643this:
2644
2645 package customre;
2646 use overload;
2647
2648 sub import {
2649 shift;
2650 die "No argument to customre::import allowed" if @_;
2651 overload::constant 'qr' => \&convert;
2652 }
2653
2654 sub invalid { die "/$_[0]/: invalid escape '\\$_[1]'"}
2655
580a9fe1
RGS
2656 # We must also take care of not escaping the legitimate \\Y|
2657 # sequence, hence the presence of '\\' in the conversion rules.
5d458dd8 2658 my %rules = ( '\\' => '\\\\',
f793d64a 2659 'Y|' => qr/(?=\S)(?<!\S)|(?!\S)(?<=\S)/ );
c84d73f1
IZ
2660 sub convert {
2661 my $re = shift;
5d458dd8 2662 $re =~ s{
c84d73f1
IZ
2663 \\ ( \\ | Y . )
2664 }
5d458dd8 2665 { $rules{$1} or invalid($re,$1) }sgex;
c84d73f1
IZ
2666 return $re;
2667 }
2668
2669Now C<use customre> enables the new escape in constant regular
2670expressions, i.e., those without any runtime variable interpolations.
2671As documented in L<overload>, this conversion will work only over
2672literal parts of regular expressions. For C<\Y|$re\Y|> the variable
2673part of this regular expression needs to be converted explicitly
2674(but only if the special meaning of C<\Y|> should be enabled inside $re):
2675
2676 use customre;
2677 $re = <>;
2678 chomp $re;
2679 $re = customre::convert $re;
2680 /\Y|$re\Y|/;
2681
83f32aba
RS
2682=head2 Embedded Code Execution Frequency
2683
2684The exact rules for how often (??{}) and (?{}) are executed in a pattern
2685are unspecified. In the case of a successful match you can assume that
2686they DWIM and will be executed in left to right order the appropriate
2687number of times in the accepting path of the pattern as would any other
2688meta-pattern. How non-accepting pathways and match failures affect the
2689number of times a pattern is executed is specifically unspecified and
2690may vary depending on what optimizations can be applied to the pattern
2691and is likely to change from version to version.
2692
2693For instance in
2694
2695 "aaabcdeeeee"=~/a(?{print "a"})b(?{print "b"})cde/;
2696
2697the exact number of times "a" or "b" are printed out is unspecified for
2698failure, but you may assume they will be printed at least once during
2699a successful match, additionally you may assume that if "b" is printed,
2700it will be preceded by at least one "a".
2701
2702In the case of branching constructs like the following:
2703
2704 /a(b|(?{ print "a" }))c(?{ print "c" })/;
2705
2706you can assume that the input "ac" will output "ac", and that "abc"
2707will output only "c".
2708
2709When embedded code is quantified, successful matches will call the
2710code once for each matched iteration of the quantifier. For
2711example:
2712
2713 "good" =~ /g(?:o(?{print "o"}))*d/;
2714
2715will output "o" twice.
2716
0b928c2f 2717=head2 PCRE/Python Support
1f1031fe 2718
0b928c2f 2719As of Perl 5.10.0, Perl supports several Python/PCRE-specific extensions
1f1031fe 2720to the regex syntax. While Perl programmers are encouraged to use the
0b928c2f 2721Perl-specific syntax, the following are also accepted:
1f1031fe
YO
2722
2723=over 4
2724
ae5648b3 2725=item C<< (?PE<lt>NAMEE<gt>pattern) >>
1f1031fe 2726
c27a5cfe 2727Define a named capture group. Equivalent to C<< (?<NAME>pattern) >>.
1f1031fe
YO
2728
2729=item C<< (?P=NAME) >>
2730
c27a5cfe 2731Backreference to a named capture group. Equivalent to C<< \g{NAME} >>.
1f1031fe
YO
2732
2733=item C<< (?P>NAME) >>
2734
c27a5cfe 2735Subroutine call to a named capture group. Equivalent to C<< (?&NAME) >>.
1f1031fe 2736
ee9b8eae 2737=back
1f1031fe 2738
19799a22
GS
2739=head1 BUGS
2740
88c9975e
KW
2741Many regular expression constructs don't work on EBCDIC platforms.
2742
ed7efc79
KW
2743There are a number of issues with regard to case-insensitive matching
2744in Unicode rules. See C<i> under L</Modifiers> above.
2745
9da458fc
IZ
2746This document varies from difficult to understand to completely
2747and utterly opaque. The wandering prose riddled with jargon is
2748hard to fathom in several places.
2749
2750This document needs a rewrite that separates the tutorial content
2751from the reference content.
19799a22
GS
2752
2753=head1 SEE ALSO
9fa51da4 2754
91e0c79e
MJD
2755L<perlrequick>.
2756
2757L<perlretut>.
2758
9b599b2a
GS
2759L<perlop/"Regexp Quote-Like Operators">.
2760
1e66bd83
PP
2761L<perlop/"Gory details of parsing quoted constructs">.
2762
14218588
GS
2763L<perlfaq6>.
2764
9b599b2a
GS
2765L<perlfunc/pos>.
2766
2767L<perllocale>.
2768
fb55449c
JH
2769L<perlebcdic>.
2770
14218588
GS
2771I<Mastering Regular Expressions> by Jeffrey Friedl, published
2772by O'Reilly and Associates.