This is a live mirror of the Perl 5 development currently hosted at https://github.com/perl/perl5
perlre: Clarify /x eol can't be escaped
[perl5.git] / pod / perlre.pod
CommitLineData
a0d0e21e 1=head1 NAME
d74e8afc 2X<regular expression> X<regex> X<regexp>
a0d0e21e
LW
3
4perlre - Perl regular expressions
5
6=head1 DESCRIPTION
7
5d458dd8 8This page describes the syntax of regular expressions in Perl.
91e0c79e 9
cc46d5f2 10If you haven't used regular expressions before, a quick-start
91e0c79e
MJD
11introduction is available in L<perlrequick>, and a longer tutorial
12introduction is available in L<perlretut>.
13
14For reference on how regular expressions are used in matching
15operations, plus various examples of the same, see discussions of
16C<m//>, C<s///>, C<qr//> and C<??> in L<perlop/"Regexp Quote-Like
17Operators">.
cb1a09d0 18
0d017f4d
WL
19
20=head2 Modifiers
21
19799a22 22Matching operations can have various modifiers. Modifiers
5a964f20 23that relate to the interpretation of the regular expression inside
19799a22 24are listed below. Modifiers that alter the way a regular expression
5d458dd8 25is used by Perl are detailed in L<perlop/"Regexp Quote-Like Operators"> and
1e66bd83 26L<perlop/"Gory details of parsing quoted constructs">.
a0d0e21e 27
55497cff
PP
28=over 4
29
54310121 30=item m
d74e8afc 31X</m> X<regex, multiline> X<regexp, multiline> X<regular expression, multiline>
55497cff
PP
32
33Treat string as multiple lines. That is, change "^" and "$" from matching
1ca12bda
RS
34the start of the string's first line and the end of its last line to
35matching the start and end of each line within the string.
55497cff 36
54310121 37=item s
d74e8afc
ITB
38X</s> X<regex, single-line> X<regexp, single-line>
39X<regular expression, single-line>
55497cff
PP
40
41Treat string as single line. That is, change "." to match any character
19799a22 42whatsoever, even a newline, which normally it would not match.
55497cff 43
34d67d80 44Used together, as C</ms>, they let the "." match any character whatsoever,
fb55449c 45while still allowing "^" and "$" to match, respectively, just after
19799a22 46and just before newlines within the string.
7b8d334a 47
87e95b7f
YO
48=item i
49X</i> X<regex, case-insensitive> X<regexp, case-insensitive>
50X<regular expression, case-insensitive>
51
52Do case-insensitive pattern matching.
53
5027a30b
KW
54If locale matching rules are in effect, the case map is taken from the
55current
17580e7a 56locale for code points less than 255, and from Unicode rules for larger
ed7efc79
KW
57code points. However, matches that would cross the Unicode
58rules/non-Unicode rules boundary (ords 255/256) will not succeed. See
59L<perllocale>.
60
61There are a number of Unicode characters that match multiple characters
62under C</i>. For example, C<LATIN SMALL LIGATURE FI>
63should match the sequence C<fi>. Perl is not
64currently able to do this when the multiple characters are in the pattern and
65are split between groupings, or when one or more are quantified. Thus
66
67 "\N{LATIN SMALL LIGATURE FI}" =~ /fi/i; # Matches
68 "\N{LATIN SMALL LIGATURE FI}" =~ /[fi][fi]/i; # Doesn't match!
69 "\N{LATIN SMALL LIGATURE FI}" =~ /fi*/i; # Doesn't match!
70
71 # The below doesn't match, and it isn't clear what $1 and $2 would
72 # be even if it did!!
73 "\N{LATIN SMALL LIGATURE FI}" =~ /(f)(i)/i; # Doesn't match!
74
9d53c457
KW
75Perl doesn't match multiple characters in a bracketed
76character class unless the character that maps to them is explicitly
77mentioned, and it doesn't match them at all if the character class is
78inverted, which otherwise could be highly confusing. See
79L<perlrecharclass/Bracketed Character Classes>, and
1f59b283
KW
80L<perlrecharclass/Negation>.
81
54310121 82=item x
d74e8afc 83X</x>
55497cff
PP
84
85Extend your pattern's legibility by permitting whitespace and comments.
ed7efc79 86Details in L</"/x">
55497cff 87
87e95b7f
YO
88=item p
89X</p> X<regex, preserve> X<regexp, preserve>
90
632a1772 91Preserve the string matched such that ${^PREMATCH}, ${^MATCH}, and
87e95b7f
YO
92${^POSTMATCH} are available for use after matching.
93
13b0f67d
DM
94In Perl 5.20 and higher this is ignored. Due to a new copy-on-write
95mechanism, ${^PREMATCH}, ${^MATCH}, and ${^POSTMATCH} will be available
96after the match regardless of the modifier.
97
b6fa137b
FC
98=item a, d, l and u
99X</a> X</d> X</l> X</u>
100
850b7ec9 101These modifiers, all new in 5.14, affect which character-set rules
516074bb 102(Unicode, etc.) are used, as described below in
ed7efc79 103L</Character set modifiers>.
b6fa137b 104
7cf040c1
RS
105=item Other Modifiers
106
107There are a number of flags that can be found at the end of regular
108expression constructs that are I<not> generic regular expression flags, but
109apply to the operation being performed, like matching or substitution (C<m//>
110or C<s///> respectively).
111
112Flags described further in
113L<perlretut/"Using regular expressions in Perl"> are:
114
115 c - keep the current position during repeated matching
116 g - globally match the pattern repeatedly in the string
117
118Substitution-specific modifiers described in
119
120L<perlop/"s/PATTERN/REPLACEMENT/msixpodualgcer"> are:
171e7319 121
7cf040c1
RS
122 e - evaluate the right-hand side as an expression
123 ee - evaluate the right side as a string then eval the result
124 o - pretend to optimize your code, but actually introduce bugs
125 r - perform non-destructive substitution and return the new value
171e7319 126
55497cff 127=back
a0d0e21e 128
516074bb
KW
129Regular expression modifiers are usually written in documentation
130as e.g., "the C</x> modifier", even though the delimiter
b6fa137b 131in question might not really be a slash. The modifiers C</imsxadlup>
ab7bb42d 132may also be embedded within the regular expression itself using
ed7efc79 133the C<(?...)> construct, see L</Extended Patterns> below.
b6fa137b 134
ed7efc79
KW
135=head3 /x
136
b6fa137b 137C</x> tells
7b059540 138the regular expression parser to ignore most whitespace that is neither
7c688e65
KW
139backslashed nor within a bracketed character class. You can use this to
140break up your regular expression into (slightly) more readable parts.
141Also, the C<#> character is treated as a metacharacter introducing a
142comment that runs up to the pattern's closing delimiter, or to the end
143of the current line if the pattern extends onto the next line. Hence,
144this is very much like an ordinary Perl code comment. (You can include
145the closing delimiter within the comment only if you precede it with a
146backslash, so be careful!)
147
148Use of C</x> means that if you want real
149whitespace or C<#> characters in the pattern (outside a bracketed character
150class, which is unaffected by C</x>), then you'll either have to
7b059540 151escape them (using backslashes or C<\Q...\E>) or encode them using octal,
7c688e65 152hex, or C<\N{}> escapes.
8be3c4ca
KW
153It is ineffective to try to continue a comment onto the next line by
154escaping the C<\n> with a backslash or C<\Q>.
7c688e65
KW
155
156You can use L</(?#text)> to create a comment that ends earlier than the
157end of the current line, but C<text> also can't contain the closing
158delimiter unless escaped with a backslash.
159
160Taken together, these features go a long way towards
161making Perl's regular expressions more readable. Here's an example:
162
163 # Delete (most) C comments.
164 $program =~ s {
165 /\* # Match the opening delimiter.
166 .*? # Match a minimal number of characters.
167 \*/ # Match the closing delimiter.
168 } []gsx;
169
170Note that anything inside
7651b971 171a C<\Q...\E> stays unaffected by C</x>. And note that C</x> doesn't affect
0b928c2f 172space interpretation within a single multi-character construct. For
7651b971 173example in C<\x{...}>, regardless of the C</x> modifier, there can be no
9bb1f947 174spaces. Same for a L<quantifier|/Quantifiers> such as C<{3}> or
000947ad
KW
175C<{5,}>. Similarly, C<(?:...)> can't have a space between the C<(>,
176C<?>, and C<:>. Within any delimiters for such a
f9e949fd
KW
177construct, allowed spaces are not affected by C</x>, and depend on the
178construct. For example, C<\x{...}> can't have spaces because hexadecimal
179numbers don't have spaces in them. But, Unicode properties can have spaces, so
0b928c2f 180in C<\p{...}> there can be spaces that follow the Unicode rules, for which see
9bb1f947 181L<perluniprops/Properties accessible through \p{} and \P{}>.
d74e8afc 182X</x>
a0d0e21e 183
ed7efc79
KW
184=head3 Character set modifiers
185
186C</d>, C</u>, C</a>, and C</l>, available starting in 5.14, are called
850b7ec9 187the character set modifiers; they affect the character set rules
ed7efc79
KW
188used for the regular expression.
189
808432af
KW
190The C</d>, C</u>, and C</l> modifiers are not likely to be of much use
191to you, and so you need not worry about them very much. They exist for
192Perl's internal use, so that complex regular expression data structures
193can be automatically serialized and later exactly reconstituted,
194including all their nuances. But, since Perl can't keep a secret, and
195there may be rare instances where they are useful, they are documented
196here.
ed7efc79 197
808432af
KW
198The C</a> modifier, on the other hand, may be useful. Its purpose is to
199allow code that is to work mostly on ASCII data to not have to concern
200itself with Unicode.
ca9560b2 201
808432af
KW
202Briefly, C</l> sets the character set to that of whatever B<L>ocale is in
203effect at the time of the execution of the pattern match.
ca9560b2 204
808432af
KW
205C</u> sets the character set to B<U>nicode.
206
207C</a> also sets the character set to Unicode, BUT adds several
208restrictions for B<A>SCII-safe matching.
209
210C</d> is the old, problematic, pre-5.14 B<D>efault character set
211behavior. Its only use is to force that old behavior.
212
213At any given time, exactly one of these modifiers is in effect. Their
214existence allows Perl to keep the originally compiled behavior of a
215regular expression, regardless of what rules are in effect when it is
216actually executed. And if it is interpolated into a larger regex, the
217original's rules continue to apply to it, and only it.
218
219The C</l> and C</u> modifiers are automatically selected for
220regular expressions compiled within the scope of various pragmas,
221and we recommend that in general, you use those pragmas instead of
222specifying these modifiers explicitly. For one thing, the modifiers
223affect only pattern matching, and do not extend to even any replacement
224done, whereas using the pragmas give consistent results for all
225appropriate operations within their scopes. For example,
226
227 s/foo/\Ubar/il
228
229will match "foo" using the locale's rules for case-insensitive matching,
230but the C</l> does not affect how the C<\U> operates. Most likely you
231want both of them to use locale rules. To do this, instead compile the
232regular expression within the scope of C<use locale>. This both
233implicitly adds the C</l> and applies locale rules to the C<\U>. The
234lesson is to C<use locale> and not C</l> explicitly.
235
236Similarly, it would be better to use C<use feature 'unicode_strings'>
237instead of,
238
239 s/foo/\Lbar/iu
240
241to get Unicode rules, as the C<\L> in the former (but not necessarily
242the latter) would also use Unicode rules.
243
244More detail on each of the modifiers follows. Most likely you don't
245need to know this detail for C</l>, C</u>, and C</d>, and can skip ahead
246to L<E<sol>a|/E<sol>a (and E<sol>aa)>.
ca9560b2 247
ed7efc79
KW
248=head4 /l
249
250means to use the current locale's rules (see L<perllocale>) when pattern
251matching. For example, C<\w> will match the "word" characters of that
252locale, and C<"/i"> case-insensitive matching will match according to
253the locale's case folding rules. The locale used will be the one in
254effect at the time of execution of the pattern match. This may not be
255the same as the compilation-time locale, and can differ from one match
256to another if there is an intervening call of the
b6fa137b 257L<setlocale() function|perllocale/The setlocale function>.
ed7efc79 258
31f05a37
KW
259The only non-single-byte locale Perl supports is (starting in v5.20)
260UTF-8. This means that code points above 255 are treated as Unicode no
261matter what locale is in effect (since UTF-8 implies Unicode).
262
ed7efc79 263Under Unicode rules, there are a few case-insensitive matches that cross
31f05a37
KW
264the 255/256 boundary. Except for UTF-8 locales in Perls v5.20 and
265later, these are disallowed under C</l>. For example, 0xFF (on ASCII
266platforms) does not caselessly match the character at 0x178, C<LATIN
267CAPITAL LETTER Y WITH DIAERESIS>, because 0xFF may not be C<LATIN SMALL
268LETTER Y WITH DIAERESIS> in the current locale, and Perl has no way of
269knowing if that character even exists in the locale, much less what code
270point it is.
271
272In a UTF-8 locale in v5.20 and later, the only visible difference
273between locale and non-locale in regular expressions should be tainting
274(see L<perlsec>).
ed7efc79
KW
275
276This modifier may be specified to be the default by C<use locale>, but
277see L</Which character set modifier is in effect?>.
b6fa137b
FC
278X</l>
279
ed7efc79
KW
280=head4 /u
281
282means to use Unicode rules when pattern matching. On ASCII platforms,
283this means that the code points between 128 and 255 take on their
808432af
KW
284Latin-1 (ISO-8859-1) meanings (which are the same as Unicode's).
285(Otherwise Perl considers their meanings to be undefined.) Thus,
286under this modifier, the ASCII platform effectively becomes a Unicode
287platform; and hence, for example, C<\w> will match any of the more than
288100_000 word characters in Unicode.
ed7efc79
KW
289
290Unlike most locales, which are specific to a language and country pair,
516074bb
KW
291Unicode classifies all the characters that are letters I<somewhere> in
292the world as
ed7efc79
KW
293C<\w>. For example, your locale might not think that C<LATIN SMALL
294LETTER ETH> is a letter (unless you happen to speak Icelandic), but
295Unicode does. Similarly, all the characters that are decimal digits
296somewhere in the world will match C<\d>; this is hundreds, not 10,
297possible matches. And some of those digits look like some of the 10
298ASCII digits, but mean a different number, so a human could easily think
299a number is a different quantity than it really is. For example,
300C<BENGALI DIGIT FOUR> (U+09EA) looks very much like an
301C<ASCII DIGIT EIGHT> (U+0038). And, C<\d+>, may match strings of digits
302that are a mixture from different writing systems, creating a security
67592e11 303issue. L<Unicode::UCD/num()> can be used to sort
516074bb
KW
304this out. Or the C</a> modifier can be used to force C<\d> to match
305just the ASCII 0 through 9.
ed7efc79 306
516074bb
KW
307Also, under this modifier, case-insensitive matching works on the full
308set of Unicode
ed7efc79
KW
309characters. The C<KELVIN SIGN>, for example matches the letters "k" and
310"K"; and C<LATIN SMALL LIGATURE FF> matches the sequence "ff", which,
311if you're not prepared, might make it look like a hexadecimal constant,
312presenting another potential security issue. See
313L<http://unicode.org/reports/tr36> for a detailed discussion of Unicode
314security issues.
315
ed7efc79 316This modifier may be specified to be the default by C<use feature
66cbab2c
KW
317'unicode_strings>, C<use locale ':not_characters'>, or
318C<L<use 5.012|perlfunc/use VERSION>> (or higher),
808432af 319but see L</Which character set modifier is in effect?>.
b6fa137b
FC
320X</u>
321
ed7efc79
KW
322=head4 /d
323
324This modifier means to use the "Default" native rules of the platform
325except when there is cause to use Unicode rules instead, as follows:
326
327=over 4
328
329=item 1
330
331the target string is encoded in UTF-8; or
332
333=item 2
334
335the pattern is encoded in UTF-8; or
336
337=item 3
338
339the pattern explicitly mentions a code point that is above 255 (say by
340C<\x{100}>); or
341
342=item 4
b6fa137b 343
ed7efc79
KW
344the pattern uses a Unicode name (C<\N{...}>); or
345
346=item 5
347
9d1a5160
KW
348the pattern uses a Unicode property (C<\p{...}>); or
349
350=item 6
351
352the pattern uses L</C<(?[ ])>>
ed7efc79
KW
353
354=back
355
356Another mnemonic for this modifier is "Depends", as the rules actually
357used depend on various things, and as a result you can get unexpected
808432af
KW
358results. See L<perlunicode/The "Unicode Bug">. The Unicode Bug has
359become rather infamous, leading to yet another (printable) name for this
360modifier, "Dodgy".
ed7efc79 361
4b9734bf
KW
362Unless the pattern or string are encoded in UTF-8, only ASCII characters
363can match positively.
ed7efc79
KW
364
365Here are some examples of how that works on an ASCII platform:
366
367 $str = "\xDF"; # $str is not in UTF-8 format.
368 $str =~ /^\w/; # No match, as $str isn't in UTF-8 format.
369 $str .= "\x{0e0b}"; # Now $str is in UTF-8 format.
370 $str =~ /^\w/; # Match! $str is now in UTF-8 format.
371 chop $str;
372 $str =~ /^\w/; # Still a match! $str remains in UTF-8 format.
373
808432af
KW
374This modifier is automatically selected by default when none of the
375others are, so yet another name for it is "Default".
376
377Because of the unexpected behaviors associated with this modifier, you
378probably should only use it to maintain weird backward compatibilities.
379
380=head4 /a (and /aa)
381
382This modifier stands for ASCII-restrict (or ASCII-safe). This modifier,
383unlike the others, may be doubled-up to increase its effect.
384
385When it appears singly, it causes the sequences C<\d>, C<\s>, C<\w>, and
386the Posix character classes to match only in the ASCII range. They thus
387revert to their pre-5.6, pre-Unicode meanings. Under C</a>, C<\d>
388always means precisely the digits C<"0"> to C<"9">; C<\s> means the five
d28d8023
KW
389characters C<[ \f\n\r\t]>, and starting in Perl v5.18, experimentally,
390the vertical tab; C<\w> means the 63 characters
808432af
KW
391C<[A-Za-z0-9_]>; and likewise, all the Posix classes such as
392C<[[:print:]]> match only the appropriate ASCII-range characters.
393
394This modifier is useful for people who only incidentally use Unicode,
395and who do not wish to be burdened with its complexities and security
396concerns.
397
398With C</a>, one can write C<\d> with confidence that it will only match
399ASCII characters, and should the need arise to match beyond ASCII, you
400can instead use C<\p{Digit}> (or C<\p{Word}> for C<\w>). There are
401similar C<\p{...}> constructs that can match beyond ASCII both white
402space (see L<perlrecharclass/Whitespace>), and Posix classes (see
403L<perlrecharclass/POSIX Character Classes>). Thus, this modifier
404doesn't mean you can't use Unicode, it means that to get Unicode
405matching you must explicitly use a construct (C<\p{}>, C<\P{}>) that
406signals Unicode.
407
408As you would expect, this modifier causes, for example, C<\D> to mean
409the same thing as C<[^0-9]>; in fact, all non-ASCII characters match
410C<\D>, C<\S>, and C<\W>. C<\b> still means to match at the boundary
411between C<\w> and C<\W>, using the C</a> definitions of them (similarly
412for C<\B>).
413
414Otherwise, C</a> behaves like the C</u> modifier, in that
850b7ec9 415case-insensitive matching uses Unicode rules; for example, "k" will
808432af
KW
416match the Unicode C<\N{KELVIN SIGN}> under C</i> matching, and code
417points in the Latin1 range, above ASCII will have Unicode rules when it
418comes to case-insensitive matching.
419
420To forbid ASCII/non-ASCII matches (like "k" with C<\N{KELVIN SIGN}>),
421specify the "a" twice, for example C</aai> or C</aia>. (The first
422occurrence of "a" restricts the C<\d>, etc., and the second occurrence
423adds the C</i> restrictions.) But, note that code points outside the
424ASCII range will use Unicode rules for C</i> matching, so the modifier
425doesn't really restrict things to just ASCII; it just forbids the
426intermixing of ASCII and non-ASCII.
427
428To summarize, this modifier provides protection for applications that
429don't wish to be exposed to all of Unicode. Specifying it twice
430gives added protection.
431
432This modifier may be specified to be the default by C<use re '/a'>
433or C<use re '/aa'>. If you do so, you may actually have occasion to use
31dc26d6 434the C</u> modifier explicitly if there are a few regular expressions
808432af
KW
435where you do want full Unicode rules (but even here, it's best if
436everything were under feature C<"unicode_strings">, along with the
437C<use re '/aa'>). Also see L</Which character set modifier is in
438effect?>.
439X</a>
440X</aa>
441
ed7efc79
KW
442=head4 Which character set modifier is in effect?
443
444Which of these modifiers is in effect at any given point in a regular
808432af
KW
445expression depends on a fairly complex set of interactions. These have
446been designed so that in general you don't have to worry about it, but
447this section gives the gory details. As
ed7efc79
KW
448explained below in L</Extended Patterns> it is possible to explicitly
449specify modifiers that apply only to portions of a regular expression.
450The innermost always has priority over any outer ones, and one applying
6368643f
KW
451to the whole expression has priority over any of the default settings that are
452described in the remainder of this section.
ed7efc79 453
916cec3f 454The C<L<use re 'E<sol>foo'|re/"'/flags' mode">> pragma can be used to set
ed7efc79
KW
455default modifiers (including these) for regular expressions compiled
456within its scope. This pragma has precedence over the other pragmas
516074bb 457listed below that also change the defaults.
ed7efc79
KW
458
459Otherwise, C<L<use locale|perllocale>> sets the default modifier to C</l>;
66cbab2c 460and C<L<use feature 'unicode_strings|feature>>, or
ed7efc79
KW
461C<L<use 5.012|perlfunc/use VERSION>> (or higher) set the default to
462C</u> when not in the same scope as either C<L<use locale|perllocale>>
66cbab2c
KW
463or C<L<use bytes|bytes>>.
464(C<L<use locale ':not_characters'|perllocale/Unicode and UTF-8>> also
465sets the default to C</u>, overriding any plain C<use locale>.)
466Unlike the mechanisms mentioned above, these
6368643f
KW
467affect operations besides regular expressions pattern matching, and so
468give more consistent results with other operators, including using
469C<\U>, C<\l>, etc. in substitution replacements.
ed7efc79
KW
470
471If none of the above apply, for backwards compatibility reasons, the
472C</d> modifier is the one in effect by default. As this can lead to
473unexpected results, it is best to specify which other rule set should be
474used.
475
476=head4 Character set modifier behavior prior to Perl 5.14
477
478Prior to 5.14, there were no explicit modifiers, but C</l> was implied
479for regexes compiled within the scope of C<use locale>, and C</d> was
480implied otherwise. However, interpolating a regex into a larger regex
481would ignore the original compilation in favor of whatever was in effect
482at the time of the second compilation. There were a number of
483inconsistencies (bugs) with the C</d> modifier, where Unicode rules
484would be used when inappropriate, and vice versa. C<\p{}> did not imply
485Unicode rules, and neither did all occurrences of C<\N{}>, until 5.12.
b6fa137b 486
a0d0e21e
LW
487=head2 Regular Expressions
488
04838cea
RGS
489=head3 Metacharacters
490
384f06ae 491The patterns used in Perl pattern matching evolved from those supplied in
14218588 492the Version 8 regex routines. (The routines are derived
19799a22
GS
493(distantly) from Henry Spencer's freely redistributable reimplementation
494of the V8 routines.) See L<Version 8 Regular Expressions> for
495details.
a0d0e21e
LW
496
497In particular the following metacharacters have their standard I<egrep>-ish
498meanings:
d74e8afc
ITB
499X<metacharacter>
500X<\> X<^> X<.> X<$> X<|> X<(> X<()> X<[> X<[]>
501
a0d0e21e 502
f793d64a
KW
503 \ Quote the next metacharacter
504 ^ Match the beginning of the line
505 . Match any character (except newline)
363e3e5a
RS
506 $ Match the end of the string (or before newline at the end
507 of the string)
f793d64a
KW
508 | Alternation
509 () Grouping
510 [] Bracketed Character class
a0d0e21e 511
14218588
GS
512By default, the "^" character is guaranteed to match only the
513beginning of the string, the "$" character only the end (or before the
514newline at the end), and Perl does certain optimizations with the
a0d0e21e
LW
515assumption that the string contains only one line. Embedded newlines
516will not be matched by "^" or "$". You may, however, wish to treat a
4a6725af 517string as a multi-line buffer, such that the "^" will match after any
0d520e8e
YO
518newline within the string (except if the newline is the last character in
519the string), and "$" will match before any newline. At the
a0d0e21e
LW
520cost of a little more overhead, you can do this by using the /m modifier
521on the pattern match operator. (Older programs did this by setting C<$*>,
db7cd43a 522but this option was removed in perl 5.10.)
d74e8afc 523X<^> X<$> X</m>
a0d0e21e 524
14218588 525To simplify multi-line substitutions, the "." character never matches a
55497cff 526newline unless you use the C</s> modifier, which in effect tells Perl to pretend
f02c194e 527the string is a single line--even if it isn't.
d74e8afc 528X<.> X</s>
a0d0e21e 529
04838cea
RGS
530=head3 Quantifiers
531
a0d0e21e 532The following standard quantifiers are recognized:
d74e8afc 533X<metacharacter> X<quantifier> X<*> X<+> X<?> X<{n}> X<{n,}> X<{n,m}>
a0d0e21e 534
f793d64a
KW
535 * Match 0 or more times
536 + Match 1 or more times
537 ? Match 1 or 0 times
538 {n} Match exactly n times
539 {n,} Match at least n times
540 {n,m} Match at least n but not more than m times
a0d0e21e 541
0b928c2f 542(If a curly bracket occurs in any other context and does not form part of
4d68ffa0
KW
543a backslashed sequence like C<\x{...}>, it is treated as a regular
544character. In particular, the lower quantifier bound is not optional,
545and a typo in a quantifier silently causes it to be treated as the
546literal characters. For example,
9af81bfe 547
f78f6d13 548 /o{4,a}/
9af81bfe 549
f78f6d13
KW
550compiles to match the sequence of six characters
551S<C<"o { 4 , a }">>. It is planned to eventually require literal uses
4d68ffa0
KW
552of curly brackets to be escaped, say by preceding them with a backslash
553or enclosing them within square brackets, (C<"\{"> or C<"[{]">). This
554change will allow for future syntax extensions (like making the lower
555bound of a quantifier optional), and better error checking. In the
556meantime, you should get in the habit of escaping all instances where
557you mean a literal "{".)
9af81bfe
KW
558
559The "*" quantifier is equivalent to C<{0,}>, the "+"
527e91da 560quantifier to C<{1,}>, and the "?" quantifier to C<{0,1}>. n and m are limited
d0b16107 561to non-negative integral values less than a preset limit defined when perl is built.
9c79236d
GS
562This is usually 32766 on the most common platforms. The actual limit can
563be seen in the error message generated by code such as this:
564
820475bd 565 $_ **= $_ , / {$_} / for 2 .. 42;
a0d0e21e 566
54310121
PP
567By default, a quantified subpattern is "greedy", that is, it will match as
568many times as possible (given a particular starting location) while still
569allowing the rest of the pattern to match. If you want it to match the
570minimum number of times possible, follow the quantifier with a "?". Note
571that the meanings don't change, just the "greediness":
0d017f4d 572X<metacharacter> X<greedy> X<greediness>
d74e8afc 573X<?> X<*?> X<+?> X<??> X<{n}?> X<{n,}?> X<{n,m}?>
a0d0e21e 574
f793d64a
KW
575 *? Match 0 or more times, not greedily
576 +? Match 1 or more times, not greedily
577 ?? Match 0 or 1 time, not greedily
0b928c2f 578 {n}? Match exactly n times, not greedily (redundant)
f793d64a
KW
579 {n,}? Match at least n times, not greedily
580 {n,m}? Match at least n but not more than m times, not greedily
a0d0e21e 581
5f3789aa 582Normally when a quantified subpattern does not allow the rest of the
b9b4dddf 583overall pattern to match, Perl will backtrack. However, this behaviour is
0d017f4d 584sometimes undesirable. Thus Perl provides the "possessive" quantifier form
b9b4dddf
YO
585as well.
586
f793d64a
KW
587 *+ Match 0 or more times and give nothing back
588 ++ Match 1 or more times and give nothing back
589 ?+ Match 0 or 1 time and give nothing back
590 {n}+ Match exactly n times and give nothing back (redundant)
591 {n,}+ Match at least n times and give nothing back
592 {n,m}+ Match at least n but not more than m times and give nothing back
b9b4dddf
YO
593
594For instance,
595
596 'aaaa' =~ /a++a/
597
598will never match, as the C<a++> will gobble up all the C<a>'s in the
599string and won't leave any for the remaining part of the pattern. This
600feature can be extremely useful to give perl hints about where it
601shouldn't backtrack. For instance, the typical "match a double-quoted
602string" problem can be most efficiently performed when written as:
603
604 /"(?:[^"\\]++|\\.)*+"/
605
0d017f4d 606as we know that if the final quote does not match, backtracking will not
0b928c2f
FC
607help. See the independent subexpression
608L</C<< (?>pattern) >>> for more details;
b9b4dddf
YO
609possessive quantifiers are just syntactic sugar for that construct. For
610instance the above example could also be written as follows:
611
612 /"(?>(?:(?>[^"\\]+)|\\.)*)"/
613
5f3789aa
YO
614Note that the possessive quantifier modifier can not be be combined
615with the non-greedy modifier. This is because it would make no sense.
616Consider the follow equivalency table:
617
618 Illegal Legal
619 ------------ ------
620 X??+ X{0}
621 X+?+ X{1}
622 X{min,max}?+ X{min}
623
04838cea
RGS
624=head3 Escape sequences
625
0b928c2f 626Because patterns are processed as double-quoted strings, the following
a0d0e21e
LW
627also work:
628
f793d64a
KW
629 \t tab (HT, TAB)
630 \n newline (LF, NL)
631 \r return (CR)
632 \f form feed (FF)
633 \a alarm (bell) (BEL)
634 \e escape (think troff) (ESC)
f793d64a 635 \cK control char (example: VT)
dc0d9c48 636 \x{}, \x00 character whose ordinal is the given hexadecimal number
fb121860 637 \N{name} named Unicode character or character sequence
f793d64a 638 \N{U+263D} Unicode character (example: FIRST QUARTER MOON)
f0a2b745 639 \o{}, \000 character whose ordinal is the given octal number
f793d64a
KW
640 \l lowercase next char (think vi)
641 \u uppercase next char (think vi)
642 \L lowercase till \E (think vi)
643 \U uppercase till \E (think vi)
644 \Q quote (disable) pattern metacharacters till \E
645 \E end either case modification or quoted section, think vi
a0d0e21e 646
9bb1f947 647Details are in L<perlop/Quote and Quote-like Operators>.
1d2dff63 648
e1d1eefb 649=head3 Character Classes and other Special Escapes
04838cea 650
a0d0e21e 651In addition, Perl defines the following:
d0b16107 652X<\g> X<\k> X<\K> X<backreference>
a0d0e21e 653
f793d64a
KW
654 Sequence Note Description
655 [...] [1] Match a character according to the rules of the
656 bracketed character class defined by the "...".
657 Example: [a-z] matches "a" or "b" or "c" ... or "z"
658 [[:...:]] [2] Match a character according to the rules of the POSIX
659 character class "..." within the outer bracketed
660 character class. Example: [[:upper:]] matches any
661 uppercase character.
572224ce 662 (?[...]) [8] Extended bracketed character class
d35dd6c6
KW
663 \w [3] Match a "word" character (alphanumeric plus "_", plus
664 other connector punctuation chars plus Unicode
0b928c2f 665 marks)
f793d64a
KW
666 \W [3] Match a non-"word" character
667 \s [3] Match a whitespace character
668 \S [3] Match a non-whitespace character
669 \d [3] Match a decimal digit character
670 \D [3] Match a non-digit character
671 \pP [3] Match P, named property. Use \p{Prop} for longer names
672 \PP [3] Match non-P
673 \X [4] Match Unicode "eXtended grapheme cluster"
674 \C Match a single C-language char (octet) even if that is
675 part of a larger UTF-8 character. Thus it breaks up
676 characters into their UTF-8 bytes, so you may end up
677 with malformed pieces of UTF-8. Unsupported in
37ea023e 678 lookbehind. (Deprecated.)
c27a5cfe 679 \1 [5] Backreference to a specific capture group or buffer.
f793d64a
KW
680 '1' may actually be any positive integer.
681 \g1 [5] Backreference to a specific or previous group,
682 \g{-1} [5] The number may be negative indicating a relative
c27a5cfe 683 previous group and may optionally be wrapped in
f793d64a
KW
684 curly brackets for safer parsing.
685 \g{name} [5] Named backreference
686 \k<name> [5] Named backreference
687 \K [6] Keep the stuff left of the \K, don't include it in $&
2171640d 688 \N [7] Any character but \n. Not affected by /s modifier
f793d64a
KW
689 \v [3] Vertical whitespace
690 \V [3] Not vertical whitespace
691 \h [3] Horizontal whitespace
692 \H [3] Not horizontal whitespace
693 \R [4] Linebreak
e1d1eefb 694
9bb1f947
KW
695=over 4
696
697=item [1]
698
699See L<perlrecharclass/Bracketed Character Classes> for details.
df225385 700
9bb1f947 701=item [2]
b8c5462f 702
9bb1f947 703See L<perlrecharclass/POSIX Character Classes> for details.
b8c5462f 704
9bb1f947 705=item [3]
5496314a 706
9bb1f947 707See L<perlrecharclass/Backslash sequences> for details.
5496314a 708
9bb1f947 709=item [4]
5496314a 710
9bb1f947 711See L<perlrebackslash/Misc> for details.
d0b16107 712
9bb1f947 713=item [5]
b8c5462f 714
c27a5cfe 715See L</Capture groups> below for details.
93733859 716
9bb1f947 717=item [6]
b8c5462f 718
9bb1f947
KW
719See L</Extended Patterns> below for details.
720
721=item [7]
722
723Note that C<\N> has two meanings. When of the form C<\N{NAME}>, it matches the
fb121860
KW
724character or character sequence whose name is C<NAME>; and similarly
725when of the form C<\N{U+I<hex>}>, it matches the character whose Unicode
726code point is I<hex>. Otherwise it matches any character but C<\n>.
9bb1f947 727
572224ce
KW
728=item [8]
729
730See L<perlrecharclass/Extended Bracketed Character Classes> for details.
731
9bb1f947 732=back
d0b16107 733
04838cea
RGS
734=head3 Assertions
735
a0d0e21e 736Perl defines the following zero-width assertions:
d74e8afc
ITB
737X<zero-width assertion> X<assertion> X<regex, zero-width assertion>
738X<regexp, zero-width assertion>
739X<regular expression, zero-width assertion>
740X<\b> X<\B> X<\A> X<\Z> X<\z> X<\G>
a0d0e21e 741
9bb1f947
KW
742 \b Match a word boundary
743 \B Match except at a word boundary
744 \A Match only at beginning of string
745 \Z Match only at end of string, or before newline at the end
746 \z Match only at end of string
747 \G Match only at pos() (e.g. at the end-of-match position
9da458fc 748 of prior m//g)
a0d0e21e 749
14218588 750A word boundary (C<\b>) is a spot between two characters
19799a22
GS
751that has a C<\w> on one side of it and a C<\W> on the other side
752of it (in either order), counting the imaginary characters off the
753beginning and end of the string as matching a C<\W>. (Within
754character classes C<\b> represents backspace rather than a word
755boundary, just as it normally does in any double-quoted string.)
756The C<\A> and C<\Z> are just like "^" and "$", except that they
757won't match multiple times when the C</m> modifier is used, while
758"^" and "$" will match at every internal line boundary. To match
759the actual end of the string and not ignore an optional trailing
760newline, use C<\z>.
d74e8afc 761X<\b> X<\A> X<\Z> X<\z> X</m>
19799a22
GS
762
763The C<\G> assertion can be used to chain global matches (using
764C<m//g>), as described in L<perlop/"Regexp Quote-Like Operators">.
765It is also useful when writing C<lex>-like scanners, when you have
766several patterns that you want to match against consequent substrings
0b928c2f 767of your string; see the previous reference. The actual location
19799a22 768where C<\G> will match can also be influenced by using C<pos()> as
58e23c8d 769an lvalue: see L<perlfunc/pos>. Note that the rule for zero-length
0b928c2f
FC
770matches (see L</"Repeated Patterns Matching a Zero-length Substring">)
771is modified somewhat, in that contents to the left of C<\G> are
58e23c8d
YO
772not counted when determining the length of the match. Thus the following
773will not match forever:
d74e8afc 774X<\G>
c47ff5f1 775
e761bb84
CO
776 my $string = 'ABC';
777 pos($string) = 1;
778 while ($string =~ /(.\G)/g) {
779 print $1;
780 }
58e23c8d
YO
781
782It will print 'A' and then terminate, as it considers the match to
783be zero-width, and thus will not match at the same position twice in a
784row.
785
786It is worth noting that C<\G> improperly used can result in an infinite
787loop. Take care when using patterns that include C<\G> in an alternation.
788
d5e7783a
DM
789Note also that C<s///> will refuse to overwrite part of a substitution
790that has already been replaced; so for example this will stop after the
791first iteration, rather than iterating its way backwards through the
792string:
793
794 $_ = "123456789";
795 pos = 6;
796 s/.(?=.\G)/X/g;
797 print; # prints 1234X6789, not XXXXX6789
798
799
c27a5cfe 800=head3 Capture groups
04838cea 801
c27a5cfe
KW
802The bracketing construct C<( ... )> creates capture groups (also referred to as
803capture buffers). To refer to the current contents of a group later on, within
d8b950dc
KW
804the same pattern, use C<\g1> (or C<\g{1}>) for the first, C<\g2> (or C<\g{2}>)
805for the second, and so on.
806This is called a I<backreference>.
d74e8afc 807X<regex, capture buffer> X<regexp, capture buffer>
c27a5cfe 808X<regex, capture group> X<regexp, capture group>
d74e8afc 809X<regular expression, capture buffer> X<backreference>
c27a5cfe 810X<regular expression, capture group> X<backreference>
1f1031fe 811X<\g{1}> X<\g{-1}> X<\g{name}> X<relative backreference> X<named backreference>
d8b950dc
KW
812X<named capture buffer> X<regular expression, named capture buffer>
813X<named capture group> X<regular expression, named capture group>
814X<%+> X<$+{name}> X<< \k<name> >>
815There is no limit to the number of captured substrings that you may use.
816Groups are numbered with the leftmost open parenthesis being number 1, etc. If
817a group did not match, the associated backreference won't match either. (This
818can happen if the group is optional, or in a different branch of an
819alternation.)
820You can omit the C<"g">, and write C<"\1">, etc, but there are some issues with
821this form, described below.
822
823You can also refer to capture groups relatively, by using a negative number, so
824that C<\g-1> and C<\g{-1}> both refer to the immediately preceding capture
825group, and C<\g-2> and C<\g{-2}> both refer to the group before it. For
826example:
5624f11d
YO
827
828 /
c27a5cfe
KW
829 (Y) # group 1
830 ( # group 2
831 (X) # group 3
832 \g{-1} # backref to group 3
833 \g{-3} # backref to group 1
5624f11d
YO
834 )
835 /x
836
d8b950dc
KW
837would match the same as C</(Y) ( (X) \g3 \g1 )/x>. This allows you to
838interpolate regexes into larger regexes and not have to worry about the
839capture groups being renumbered.
840
841You can dispense with numbers altogether and create named capture groups.
842The notation is C<(?E<lt>I<name>E<gt>...)> to declare and C<\g{I<name>}> to
843reference. (To be compatible with .Net regular expressions, C<\g{I<name>}> may
844also be written as C<\k{I<name>}>, C<\kE<lt>I<name>E<gt>> or C<\k'I<name>'>.)
845I<name> must not begin with a number, nor contain hyphens.
846When different groups within the same pattern have the same name, any reference
847to that name assumes the leftmost defined group. Named groups count in
848absolute and relative numbering, and so can also be referred to by those
849numbers.
850(It's possible to do things with named capture groups that would otherwise
851require C<(??{})>.)
852
853Capture group contents are dynamically scoped and available to you outside the
854pattern until the end of the enclosing block or until the next successful
855match, whichever comes first. (See L<perlsyn/"Compound Statements">.)
856You can refer to them by absolute number (using C<"$1"> instead of C<"\g1">,
857etc); or by name via the C<%+> hash, using C<"$+{I<name>}">.
858
859Braces are required in referring to named capture groups, but are optional for
860absolute or relative numbered ones. Braces are safer when creating a regex by
861concatenating smaller strings. For example if you have C<qr/$a$b/>, and C<$a>
862contained C<"\g1">, and C<$b> contained C<"37">, you would get C</\g137/> which
863is probably not what you intended.
864
865The C<\g> and C<\k> notations were introduced in Perl 5.10.0. Prior to that
866there were no named nor relative numbered capture groups. Absolute numbered
0b928c2f
FC
867groups were referred to using C<\1>,
868C<\2>, etc., and this notation is still
d8b950dc
KW
869accepted (and likely always will be). But it leads to some ambiguities if
870there are more than 9 capture groups, as C<\10> could mean either the tenth
871capture group, or the character whose ordinal in octal is 010 (a backspace in
872ASCII). Perl resolves this ambiguity by interpreting C<\10> as a backreference
873only if at least 10 left parentheses have opened before it. Likewise C<\11> is
874a backreference only if at least 11 left parentheses have opened before it.
e1f120a9
KW
875And so on. C<\1> through C<\9> are always interpreted as backreferences.
876There are several examples below that illustrate these perils. You can avoid
877the ambiguity by always using C<\g{}> or C<\g> if you mean capturing groups;
878and for octal constants always using C<\o{}>, or for C<\077> and below, using 3
879digits padded with leading zeros, since a leading zero implies an octal
880constant.
d8b950dc
KW
881
882The C<\I<digit>> notation also works in certain circumstances outside
ed7efc79 883the pattern. See L</Warning on \1 Instead of $1> below for details.
81714fb9 884
14218588 885Examples:
a0d0e21e
LW
886
887 s/^([^ ]*) *([^ ]*)/$2 $1/; # swap first two words
888
d8b950dc 889 /(.)\g1/ # find first doubled char
81714fb9
YO
890 and print "'$1' is the first doubled character\n";
891
892 /(?<char>.)\k<char>/ # ... a different way
893 and print "'$+{char}' is the first doubled character\n";
894
d8b950dc 895 /(?'char'.)\g1/ # ... mix and match
81714fb9 896 and print "'$1' is the first doubled character\n";
c47ff5f1 897
14218588 898 if (/Time: (..):(..):(..)/) { # parse out values
f793d64a
KW
899 $hours = $1;
900 $minutes = $2;
901 $seconds = $3;
a0d0e21e 902 }
c47ff5f1 903
9d860678
KW
904 /(.)(.)(.)(.)(.)(.)(.)(.)(.)\g10/ # \g10 is a backreference
905 /(.)(.)(.)(.)(.)(.)(.)(.)(.)\10/ # \10 is octal
906 /((.)(.)(.)(.)(.)(.)(.)(.)(.))\10/ # \10 is a backreference
907 /((.)(.)(.)(.)(.)(.)(.)(.)(.))\010/ # \010 is octal
908
909 $a = '(.)\1'; # Creates problems when concatenated.
910 $b = '(.)\g{1}'; # Avoids the problems.
911 "aa" =~ /${a}/; # True
912 "aa" =~ /${b}/; # True
913 "aa0" =~ /${a}0/; # False!
914 "aa0" =~ /${b}0/; # True
dc0d9c48
KW
915 "aa\x08" =~ /${a}0/; # True!
916 "aa\x08" =~ /${b}0/; # False
9d860678 917
14218588
GS
918Several special variables also refer back to portions of the previous
919match. C<$+> returns whatever the last bracket match matched.
920C<$&> returns the entire matched string. (At one point C<$0> did
921also, but now it returns the name of the program.) C<$`> returns
77ea4f6d
JV
922everything before the matched string. C<$'> returns everything
923after the matched string. And C<$^N> contains whatever was matched by
924the most-recently closed group (submatch). C<$^N> can be used in
925extended patterns (see below), for example to assign a submatch to a
81714fb9 926variable.
d74e8afc 927X<$+> X<$^N> X<$&> X<$`> X<$'>
14218588 928
d8b950dc
KW
929These special variables, like the C<%+> hash and the numbered match variables
930(C<$1>, C<$2>, C<$3>, etc.) are dynamically scoped
14218588
GS
931until the end of the enclosing block or until the next successful
932match, whichever comes first. (See L<perlsyn/"Compound Statements">.)
d74e8afc
ITB
933X<$+> X<$^N> X<$&> X<$`> X<$'>
934X<$1> X<$2> X<$3> X<$4> X<$5> X<$6> X<$7> X<$8> X<$9>
935
0d017f4d 936B<NOTE>: Failed matches in Perl do not reset the match variables,
5146ce24 937which makes it easier to write code that tests for a series of more
665e98b9
JH
938specific cases and remembers the best match.
939
13b0f67d
DM
940B<WARNING>: If your code is to run on Perl 5.16 or earlier,
941beware that once Perl sees that you need one of C<$&>, C<$`>, or
14218588 942C<$'> anywhere in the program, it has to provide them for every
13b0f67d
DM
943pattern match. This may substantially slow your program.
944
945Perl uses the same mechanism to produce C<$1>, C<$2>, etc, so you also
946pay a price for each pattern that contains capturing parentheses.
947(To avoid this cost while retaining the grouping behaviour, use the
14218588
GS
948extended regular expression C<(?: ... )> instead.) But if you never
949use C<$&>, C<$`> or C<$'>, then patterns I<without> capturing
950parentheses will not be penalized. So avoid C<$&>, C<$'>, and C<$`>
951if you can, but if you can't (and some algorithms really appreciate
952them), once you've used them once, use them at will, because you've
13b0f67d 953already paid the price.
d74e8afc 954X<$&> X<$`> X<$'>
68dc0745 955
13b0f67d
DM
956Perl 5.16 introduced a slightly more efficient mechanism that notes
957separately whether each of C<$`>, C<$&>, and C<$'> have been seen, and
958thus may only need to copy part of the string. Perl 5.20 introduced a
959much more efficient copy-on-write mechanism which eliminates any slowdown.
960
961As another workaround for this problem, Perl 5.10.0 introduced C<${^PREMATCH}>,
cde0cee5
YO
962C<${^MATCH}> and C<${^POSTMATCH}>, which are equivalent to C<$`>, C<$&>
963and C<$'>, B<except> that they are only guaranteed to be defined after a
87e95b7f 964successful match that was executed with the C</p> (preserve) modifier.
cde0cee5
YO
965The use of these variables incurs no global performance penalty, unlike
966their punctuation char equivalents, however at the trade-off that you
13b0f67d
DM
967have to tell perl when you want to use them. As of Perl 5.20, these three
968variables are equivalent to C<$`>, C<$&> and C<$'>, and C</p> is ignored.
87e95b7f 969X</p> X<p modifier>
cde0cee5 970
9d727203
KW
971=head2 Quoting metacharacters
972
19799a22
GS
973Backslashed metacharacters in Perl are alphanumeric, such as C<\b>,
974C<\w>, C<\n>. Unlike some other regular expression languages, there
975are no backslashed symbols that aren't alphanumeric. So anything
0f264506 976that looks like \\, \(, \), \[, \], \{, or \} is always
19799a22
GS
977interpreted as a literal character, not a metacharacter. This was
978once used in a common idiom to disable or quote the special meanings
979of regular expression metacharacters in a string that you want to
36bbe248 980use for a pattern. Simply quote all non-"word" characters:
a0d0e21e
LW
981
982 $pattern =~ s/(\W)/\\$1/g;
983
f1cbbd6e 984(If C<use locale> is set, then this depends on the current locale.)
14218588
GS
985Today it is more common to use the quotemeta() function or the C<\Q>
986metaquoting escape sequence to disable all metacharacters' special
987meanings like this:
a0d0e21e
LW
988
989 /$unquoted\Q$quoted\E$unquoted/
990
9da458fc
IZ
991Beware that if you put literal backslashes (those not inside
992interpolated variables) between C<\Q> and C<\E>, double-quotish
993backslash interpolation may lead to confusing results. If you
994I<need> to use literal backslashes within C<\Q...\E>,
995consult L<perlop/"Gory details of parsing quoted constructs">.
996
736fe711
KW
997C<quotemeta()> and C<\Q> are fully described in L<perlfunc/quotemeta>.
998
19799a22
GS
999=head2 Extended Patterns
1000
14218588 1001Perl also defines a consistent extension syntax for features not
0b928c2f
FC
1002found in standard tools like B<awk> and
1003B<lex>. The syntax for most of these is a
14218588
GS
1004pair of parentheses with a question mark as the first thing within
1005the parentheses. The character after the question mark indicates
1006the extension.
19799a22 1007
14218588
GS
1008The stability of these extensions varies widely. Some have been
1009part of the core language for many years. Others are experimental
1010and may change without warning or be completely removed. Check
1011the documentation on an individual feature to verify its current
1012status.
19799a22 1013
14218588
GS
1014A question mark was chosen for this and for the minimal-matching
1015construct because 1) question marks are rare in older regular
1016expressions, and 2) whenever you see one, you should stop and
0b928c2f 1017"question" exactly what is going on. That's psychology....
a0d0e21e 1018
70ca8714 1019=over 4
a0d0e21e 1020
cc6b7395 1021=item C<(?#text)>
d74e8afc 1022X<(?#)>
a0d0e21e 1023
7c688e65
KW
1024A comment. The text is ignored.
1025Note that Perl closes
259138e3 1026the comment as soon as it sees a C<)>, so there is no way to put a literal
7c688e65
KW
1027C<)> in the comment. The pattern's closing delimiter must be escaped by
1028a backslash if it appears in the comment.
1029
1030See L</E<sol>x> for another way to have comments in patterns.
a0d0e21e 1031
cfaf538b 1032=item C<(?adlupimsx-imsx)>
fb85c044 1033
cfaf538b 1034=item C<(?^alupimsx)>
fb85c044 1035X<(?)> X<(?^)>
19799a22 1036
0b6d1084
JH
1037One or more embedded pattern-match modifiers, to be turned on (or
1038turned off, if preceded by C<->) for the remainder of the pattern or
fb85c044
KW
1039the remainder of the enclosing pattern group (if any).
1040
fb85c044 1041This is particularly useful for dynamic patterns, such as those read in from a
0d017f4d 1042configuration file, taken from an argument, or specified in a table
0b928c2f
FC
1043somewhere. Consider the case where some patterns want to be
1044case-sensitive and some do not: The case-insensitive ones merely need to
0d017f4d 1045include C<(?i)> at the front of the pattern. For example:
19799a22
GS
1046
1047 $pattern = "foobar";
5d458dd8 1048 if ( /$pattern/i ) { }
19799a22
GS
1049
1050 # more flexible:
1051
1052 $pattern = "(?i)foobar";
5d458dd8 1053 if ( /$pattern/ ) { }
19799a22 1054
0b6d1084 1055These modifiers are restored at the end of the enclosing group. For example,
19799a22 1056
d8b950dc 1057 ( (?i) blah ) \s+ \g1
19799a22 1058
0d017f4d
WL
1059will match C<blah> in any case, some spaces, and an exact (I<including the case>!)
1060repetition of the previous word, assuming the C</x> modifier, and no C</i>
1061modifier outside this group.
19799a22 1062
8eb5594e 1063These modifiers do not carry over into named subpatterns called in the
dd72e27b 1064enclosing group. In other words, a pattern such as C<((?i)(?&NAME))> does not
8eb5594e
DR
1065change the case-sensitivity of the "NAME" pattern.
1066
dc925305
KW
1067Any of these modifiers can be set to apply globally to all regular
1068expressions compiled within the scope of a C<use re>. See
a0bbd6ff 1069L<re/"'/flags' mode">.
dc925305 1070
9de15fec
KW
1071Starting in Perl 5.14, a C<"^"> (caret or circumflex accent) immediately
1072after the C<"?"> is a shorthand equivalent to C<d-imsx>. Flags (except
1073C<"d">) may follow the caret to override it.
1074But a minus sign is not legal with it.
1075
dc925305 1076Note that the C<a>, C<d>, C<l>, C<p>, and C<u> modifiers are special in
e1d8d8ac 1077that they can only be enabled, not disabled, and the C<a>, C<d>, C<l>, and
dc925305 1078C<u> modifiers are mutually exclusive: specifying one de-specifies the
ed7efc79
KW
1079others, and a maximum of one (or two C<a>'s) may appear in the
1080construct. Thus, for
0b928c2f 1081example, C<(?-p)> will warn when compiled under C<use warnings>;
b6fa137b 1082C<(?-d:...)> and C<(?dl:...)> are fatal errors.
9de15fec
KW
1083
1084Note also that the C<p> modifier is special in that its presence
1085anywhere in a pattern has a global effect.
cde0cee5 1086
5a964f20 1087=item C<(?:pattern)>
d74e8afc 1088X<(?:)>
a0d0e21e 1089
cfaf538b 1090=item C<(?adluimsx-imsx:pattern)>
ca9dfc88 1091
cfaf538b 1092=item C<(?^aluimsx:pattern)>
fb85c044
KW
1093X<(?^:)>
1094
5a964f20
TC
1095This is for clustering, not capturing; it groups subexpressions like
1096"()", but doesn't make backreferences as "()" does. So
a0d0e21e 1097
5a964f20 1098 @fields = split(/\b(?:a|b|c)\b/)
a0d0e21e
LW
1099
1100is like
1101
5a964f20 1102 @fields = split(/\b(a|b|c)\b/)
a0d0e21e 1103
19799a22
GS
1104but doesn't spit out extra fields. It's also cheaper not to capture
1105characters if you don't need to.
a0d0e21e 1106
19799a22 1107Any letters between C<?> and C<:> act as flags modifiers as with
cfaf538b 1108C<(?adluimsx-imsx)>. For example,
ca9dfc88
IZ
1109
1110 /(?s-i:more.*than).*million/i
1111
14218588 1112is equivalent to the more verbose
ca9dfc88
IZ
1113
1114 /(?:(?s-i)more.*than).*million/i
1115
fb85c044 1116Starting in Perl 5.14, a C<"^"> (caret or circumflex accent) immediately
9de15fec
KW
1117after the C<"?"> is a shorthand equivalent to C<d-imsx>. Any positive
1118flags (except C<"d">) may follow the caret, so
fb85c044
KW
1119
1120 (?^x:foo)
1121
1122is equivalent to
1123
1124 (?x-ims:foo)
1125
1126The caret tells Perl that this cluster doesn't inherit the flags of any
0b928c2f 1127surrounding pattern, but uses the system defaults (C<d-imsx>),
fb85c044
KW
1128modified by any flags specified.
1129
1130The caret allows for simpler stringification of compiled regular
1131expressions. These look like
1132
1133 (?^:pattern)
1134
1135with any non-default flags appearing between the caret and the colon.
1136A test that looks at such stringification thus doesn't need to have the
1137system default flags hard-coded in it, just the caret. If new flags are
1138added to Perl, the meaning of the caret's expansion will change to include
1139the default for those flags, so the test will still work, unchanged.
1140
1141Specifying a negative flag after the caret is an error, as the flag is
1142redundant.
1143
1144Mnemonic for C<(?^...)>: A fresh beginning since the usual use of a caret is
1145to match at the beginning.
1146
594d7033
YO
1147=item C<(?|pattern)>
1148X<(?|)> X<Branch reset>
1149
1150This is the "branch reset" pattern, which has the special property
c27a5cfe 1151that the capture groups are numbered from the same starting point
99d59c4d 1152in each alternation branch. It is available starting from perl 5.10.0.
4deaaa80 1153
c27a5cfe 1154Capture groups are numbered from left to right, but inside this
693596a8 1155construct the numbering is restarted for each branch.
4deaaa80 1156
c27a5cfe 1157The numbering within each branch will be as normal, and any groups
4deaaa80
PJ
1158following this construct will be numbered as though the construct
1159contained only one branch, that being the one with the most capture
c27a5cfe 1160groups in it.
4deaaa80 1161
0b928c2f 1162This construct is useful when you want to capture one of a
4deaaa80
PJ
1163number of alternative matches.
1164
1165Consider the following pattern. The numbers underneath show in
c27a5cfe 1166which group the captured content will be stored.
594d7033
YO
1167
1168
1169 # before ---------------branch-reset----------- after
1170 / ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x
1171 # 1 2 2 3 2 3 4
1172
ab106183
A
1173Be careful when using the branch reset pattern in combination with
1174named captures. Named captures are implemented as being aliases to
c27a5cfe 1175numbered groups holding the captures, and that interferes with the
ab106183
A
1176implementation of the branch reset pattern. If you are using named
1177captures in a branch reset pattern, it's best to use the same names,
1178in the same order, in each of the alternations:
1179
1180 /(?| (?<a> x ) (?<b> y )
1181 | (?<a> z ) (?<b> w )) /x
1182
1183Not doing so may lead to surprises:
1184
1185 "12" =~ /(?| (?<a> \d+ ) | (?<b> \D+))/x;
1186 say $+ {a}; # Prints '12'
1187 say $+ {b}; # *Also* prints '12'.
1188
c27a5cfe
KW
1189The problem here is that both the group named C<< a >> and the group
1190named C<< b >> are aliases for the group belonging to C<< $1 >>.
90a18110 1191
ee9b8eae
YO
1192=item Look-Around Assertions
1193X<look-around assertion> X<lookaround assertion> X<look-around> X<lookaround>
1194
0b928c2f 1195Look-around assertions are zero-width patterns which match a specific
ee9b8eae
YO
1196pattern without including it in C<$&>. Positive assertions match when
1197their subpattern matches, negative assertions match when their subpattern
1198fails. Look-behind matches text up to the current match position,
1199look-ahead matches text following the current match position.
1200
1201=over 4
1202
5a964f20 1203=item C<(?=pattern)>
d74e8afc 1204X<(?=)> X<look-ahead, positive> X<lookahead, positive>
a0d0e21e 1205
19799a22 1206A zero-width positive look-ahead assertion. For example, C</\w+(?=\t)/>
a0d0e21e
LW
1207matches a word followed by a tab, without including the tab in C<$&>.
1208
5a964f20 1209=item C<(?!pattern)>
d74e8afc 1210X<(?!)> X<look-ahead, negative> X<lookahead, negative>
a0d0e21e 1211
19799a22 1212A zero-width negative look-ahead assertion. For example C</foo(?!bar)/>
a0d0e21e 1213matches any occurrence of "foo" that isn't followed by "bar". Note
19799a22
GS
1214however that look-ahead and look-behind are NOT the same thing. You cannot
1215use this for look-behind.
7b8d334a 1216
5a964f20 1217If you are looking for a "bar" that isn't preceded by a "foo", C</(?!foo)bar/>
7b8d334a
GS
1218will not do what you want. That's because the C<(?!foo)> is just saying that
1219the next thing cannot be "foo"--and it's not, it's a "bar", so "foobar" will
0b928c2f 1220match. Use look-behind instead (see below).
c277df42 1221
ee9b8eae
YO
1222=item C<(?<=pattern)> C<\K>
1223X<(?<=)> X<look-behind, positive> X<lookbehind, positive> X<\K>
c277df42 1224
c47ff5f1 1225A zero-width positive look-behind assertion. For example, C</(?<=\t)\w+/>
19799a22
GS
1226matches a word that follows a tab, without including the tab in C<$&>.
1227Works only for fixed-width look-behind.
c277df42 1228
ee9b8eae
YO
1229There is a special form of this construct, called C<\K>, which causes the
1230regex engine to "keep" everything it had matched prior to the C<\K> and
0b928c2f 1231not include it in C<$&>. This effectively provides variable-length
ee9b8eae
YO
1232look-behind. The use of C<\K> inside of another look-around assertion
1233is allowed, but the behaviour is currently not well defined.
1234
c62285ac 1235For various reasons C<\K> may be significantly more efficient than the
ee9b8eae
YO
1236equivalent C<< (?<=...) >> construct, and it is especially useful in
1237situations where you want to efficiently remove something following
1238something else in a string. For instance
1239
1240 s/(foo)bar/$1/g;
1241
1242can be rewritten as the much more efficient
1243
1244 s/foo\Kbar//g;
1245
5a964f20 1246=item C<(?<!pattern)>
d74e8afc 1247X<(?<!)> X<look-behind, negative> X<lookbehind, negative>
c277df42 1248
19799a22
GS
1249A zero-width negative look-behind assertion. For example C</(?<!bar)foo/>
1250matches any occurrence of "foo" that does not follow "bar". Works
1251only for fixed-width look-behind.
c277df42 1252
ee9b8eae
YO
1253=back
1254
81714fb9
YO
1255=item C<(?'NAME'pattern)>
1256
1257=item C<< (?<NAME>pattern) >>
1258X<< (?<NAME>) >> X<(?'NAME')> X<named capture> X<capture>
1259
c27a5cfe 1260A named capture group. Identical in every respect to normal capturing
0b928c2f
FC
1261parentheses C<()> but for the additional fact that the group
1262can be referred to by name in various regular expression
1263constructs (like C<\g{NAME}>) and can be accessed by name
1264after a successful match via C<%+> or C<%->. See L<perlvar>
90a18110 1265for more details on the C<%+> and C<%-> hashes.
81714fb9 1266
c27a5cfe
KW
1267If multiple distinct capture groups have the same name then the
1268$+{NAME} will refer to the leftmost defined group in the match.
81714fb9 1269
0d017f4d 1270The forms C<(?'NAME'pattern)> and C<< (?<NAME>pattern) >> are equivalent.
81714fb9
YO
1271
1272B<NOTE:> While the notation of this construct is the same as the similar
c27a5cfe 1273function in .NET regexes, the behavior is not. In Perl the groups are
81714fb9
YO
1274numbered sequentially regardless of being named or not. Thus in the
1275pattern
1276
1277 /(x)(?<foo>y)(z)/
1278
1279$+{foo} will be the same as $2, and $3 will contain 'z' instead of
1280the opposite which is what a .NET regex hacker might expect.
1281
1f1031fe
YO
1282Currently NAME is restricted to simple identifiers only.
1283In other words, it must match C</^[_A-Za-z][_A-Za-z0-9]*\z/> or
1284its Unicode extension (see L<utf8>),
1285though it isn't extended by the locale (see L<perllocale>).
81714fb9 1286
1f1031fe 1287B<NOTE:> In order to make things easier for programmers with experience
ae5648b3 1288with the Python or PCRE regex engines, the pattern C<< (?PE<lt>NAMEE<gt>pattern) >>
0d017f4d 1289may be used instead of C<< (?<NAME>pattern) >>; however this form does not
64c5a566 1290support the use of single quotes as a delimiter for the name.
81714fb9 1291
1f1031fe
YO
1292=item C<< \k<NAME> >>
1293
1294=item C<< \k'NAME' >>
81714fb9
YO
1295
1296Named backreference. Similar to numeric backreferences, except that
1297the group is designated by name and not number. If multiple groups
1298have the same name then it refers to the leftmost defined group in
1299the current match.
1300
0d017f4d 1301It is an error to refer to a name not defined by a C<< (?<NAME>) >>
81714fb9
YO
1302earlier in the pattern.
1303
1304Both forms are equivalent.
1305
1f1031fe 1306B<NOTE:> In order to make things easier for programmers with experience
0d017f4d 1307with the Python or PCRE regex engines, the pattern C<< (?P=NAME) >>
64c5a566 1308may be used instead of C<< \k<NAME> >>.
1f1031fe 1309
cc6b7395 1310=item C<(?{ code })>
d74e8afc 1311X<(?{})> X<regex, code in> X<regexp, code in> X<regular expression, code in>
c277df42 1312
83f32aba
RS
1313B<WARNING>: Using this feature safely requires that you understand its
1314limitations. Code executed that has side effects may not perform identically
1315from version to version due to the effect of future optimisations in the regex
1316engine. For more information on this, see L</Embedded Code Execution
1317Frequency>.
c277df42 1318
e128ab2c
DM
1319This zero-width assertion executes any embedded Perl code. It always
1320succeeds, and its return value is set as C<$^R>.
19799a22 1321
e128ab2c
DM
1322In literal patterns, the code is parsed at the same time as the
1323surrounding code. While within the pattern, control is passed temporarily
1324back to the perl parser, until the logically-balancing closing brace is
1325encountered. This is similar to the way that an array index expression in
1326a literal string is handled, for example
77ea4f6d 1327
e128ab2c
DM
1328 "abc$array[ 1 + f('[') + g()]def"
1329
1330In particular, braces do not need to be balanced:
1331
576fa024 1332 s/abc(?{ f('{'); })/def/
e128ab2c
DM
1333
1334Even in a pattern that is interpolated and compiled at run-time, literal
1335code blocks will be compiled once, at perl compile time; the following
1336prints "ABCD":
1337
1338 print "D";
1339 my $qr = qr/(?{ BEGIN { print "A" } })/;
1340 my $foo = "foo";
1341 /$foo$qr(?{ BEGIN { print "B" } })/;
1342 BEGIN { print "C" }
1343
1344In patterns where the text of the code is derived from run-time
1345information rather than appearing literally in a source code /pattern/,
1346the code is compiled at the same time that the pattern is compiled, and
5771dda0 1347for reasons of security, C<use re 'eval'> must be in scope. This is to
e128ab2c
DM
1348stop user-supplied patterns containing code snippets from being
1349executable.
1350
5771dda0 1351In situations where you need to enable this with C<use re 'eval'>, you should
e128ab2c
DM
1352also have taint checking enabled. Better yet, use the carefully
1353constrained evaluation within a Safe compartment. See L<perlsec> for
1354details about both these mechanisms.
1355
1356From the viewpoint of parsing, lexical variable scope and closures,
1357
1358 /AAA(?{ BBB })CCC/
1359
1360behaves approximately like
1361
1362 /AAA/ && do { BBB } && /CCC/
1363
1364Similarly,
1365
1366 qr/AAA(?{ BBB })CCC/
1367
1368behaves approximately like
77ea4f6d 1369
e128ab2c
DM
1370 sub { /AAA/ && do { BBB } && /CCC/ }
1371
1372In particular:
1373
1374 { my $i = 1; $r = qr/(?{ print $i })/ }
1375 my $i = 2;
1376 /$r/; # prints "1"
1377
1378Inside a C<(?{...})> block, C<$_> refers to the string the regular
754091cb 1379expression is matching against. You can also use C<pos()> to know what is
fa11829f 1380the current position of matching within this string.
754091cb 1381
e128ab2c
DM
1382The code block introduces a new scope from the perspective of lexical
1383variable declarations, but B<not> from the perspective of C<local> and
1384similar localizing behaviours. So later code blocks within the same
1385pattern will still see the values which were localized in earlier blocks.
1386These accumulated localizations are undone either at the end of a
1387successful match, or if the assertion is backtracked (compare
1388L<"Backtracking">). For example,
b9ac3b5b
GS
1389
1390 $_ = 'a' x 8;
5d458dd8 1391 m<
d1fbf752 1392 (?{ $cnt = 0 }) # Initialize $cnt.
b9ac3b5b 1393 (
5d458dd8 1394 a
b9ac3b5b 1395 (?{
d1fbf752
KW
1396 local $cnt = $cnt + 1; # Update $cnt,
1397 # backtracking-safe.
b9ac3b5b 1398 })
5d458dd8 1399 )*
b9ac3b5b 1400 aaaa
d1fbf752
KW
1401 (?{ $res = $cnt }) # On success copy to
1402 # non-localized location.
b9ac3b5b
GS
1403 >x;
1404
e128ab2c
DM
1405will initially increment C<$cnt> up to 8; then during backtracking, its
1406value will be unwound back to 4, which is the value assigned to C<$res>.
1407At the end of the regex execution, $cnt will be wound back to its initial
1408value of 0.
1409
1410This assertion may be used as the condition in a
b9ac3b5b 1411
e128ab2c
DM
1412 (?(condition)yes-pattern|no-pattern)
1413
1414switch. If I<not> used in this way, the result of evaluation of C<code>
1415is put into the special variable C<$^R>. This happens immediately, so
1416C<$^R> can be used from other C<(?{ code })> assertions inside the same
1417regular expression.
b9ac3b5b 1418
19799a22
GS
1419The assignment to C<$^R> above is properly localized, so the old
1420value of C<$^R> is restored if the assertion is backtracked; compare
1421L<"Backtracking">.
b9ac3b5b 1422
e128ab2c
DM
1423Note that the special variable C<$^N> is particularly useful with code
1424blocks to capture the results of submatches in variables without having to
1425keep track of the number of nested parentheses. For example:
1426
1427 $_ = "The brown fox jumps over the lazy dog";
1428 /the (\S+)(?{ $color = $^N }) (\S+)(?{ $animal = $^N })/i;
1429 print "color = $color, animal = $animal\n";
1430
8988a1bb 1431
14455d6c 1432=item C<(??{ code })>
d74e8afc
ITB
1433X<(??{})>
1434X<regex, postponed> X<regexp, postponed> X<regular expression, postponed>
0f5d15d6 1435
83f32aba
RS
1436B<WARNING>: Using this feature safely requires that you understand its
1437limitations. Code executed that has side effects may not perform
1438identically from version to version due to the effect of future
1439optimisations in the regex engine. For more information on this, see
1440L</Embedded Code Execution Frequency>.
0f5d15d6 1441
e128ab2c
DM
1442This is a "postponed" regular subexpression. It behaves in I<exactly> the
1443same way as a C<(?{ code })> code block as described above, except that
1444its return value, rather than being assigned to C<$^R>, is treated as a
1445pattern, compiled if it's a string (or used as-is if its a qr// object),
1446then matched as if it were inserted instead of this construct.
6bda09f9 1447
e128ab2c
DM
1448During the matching of this sub-pattern, it has its own set of
1449captures which are valid during the sub-match, but are discarded once
1450control returns to the main pattern. For example, the following matches,
1451with the inner pattern capturing "B" and matching "BB", while the outer
1452pattern captures "A";
1453
1454 my $inner = '(.)\1';
1455 "ABBA" =~ /^(.)(??{ $inner })\1/;
1456 print $1; # prints "A";
6bda09f9 1457
e128ab2c
DM
1458Note that this means that there is no way for the inner pattern to refer
1459to a capture group defined outside. (The code block itself can use C<$1>,
1460etc., to refer to the enclosing pattern's capture groups.) Thus, although
0f5d15d6 1461
e128ab2c
DM
1462 ('a' x 100)=~/(??{'(.)' x 100})/
1463
1464I<will> match, it will I<not> set $1 on exit.
19799a22
GS
1465
1466The following pattern matches a parenthesized group:
0f5d15d6 1467
d1fbf752
KW
1468 $re = qr{
1469 \(
1470 (?:
1471 (?> [^()]+ ) # Non-parens without backtracking
1472 |
1473 (??{ $re }) # Group with matching parens
1474 )*
1475 \)
1476 }x;
0f5d15d6 1477
93f313ef
KW
1478See also
1479L<C<(?I<PARNO>)>|/(?PARNO) (?-PARNO) (?+PARNO) (?R) (?0)>
1480for a different, more efficient way to accomplish
6bda09f9
YO
1481the same task.
1482
e128ab2c
DM
1483Executing a postponed regular expression 50 times without consuming any
1484input string will result in a fatal error. The maximum depth is compiled
1485into perl, so changing it requires a custom build.
6bda09f9 1486
93f313ef 1487=item C<(?I<PARNO>)> C<(?-I<PARNO>)> C<(?+I<PARNO>)> C<(?R)> C<(?0)>
542fa716 1488X<(?PARNO)> X<(?1)> X<(?R)> X<(?0)> X<(?-1)> X<(?+1)> X<(?-PARNO)> X<(?+PARNO)>
6bda09f9 1489X<regex, recursive> X<regexp, recursive> X<regular expression, recursive>
d1b2014a
YO
1490X<regex, relative recursion> X<GOSUB> X<GOSTART>
1491
1492Recursive subpattern. Treat the contents of a given capture buffer in the
1493current pattern as an independent subpattern and attempt to match it at
1494the current position in the string. Information about capture state from
1495the caller for things like backreferences is available to the subpattern,
1496but capture buffers set by the subpattern are not visible to the caller.
6bda09f9 1497
e128ab2c
DM
1498Similar to C<(??{ code })> except that it does not involve executing any
1499code or potentially compiling a returned pattern string; instead it treats
1500the part of the current pattern contained within a specified capture group
d1b2014a
YO
1501as an independent pattern that must match at the current position. Also
1502different is the treatment of capture buffers, unlike C<(??{ code })>
1503recursive patterns have access to their callers match state, so one can
1504use backreferences safely.
6bda09f9 1505
93f313ef 1506I<PARNO> is a sequence of digits (not starting with 0) whose value reflects
c27a5cfe 1507the paren-number of the capture group to recurse to. C<(?R)> recurses to
894be9b7 1508the beginning of the whole pattern. C<(?0)> is an alternate syntax for
93f313ef 1509C<(?R)>. If I<PARNO> is preceded by a plus or minus sign then it is assumed
c27a5cfe 1510to be relative, with negative numbers indicating preceding capture groups
542fa716 1511and positive ones following. Thus C<(?-1)> refers to the most recently
c27a5cfe 1512declared group, and C<(?+1)> indicates the next group to be declared.
c74340f9 1513Note that the counting for relative recursion differs from that of
c27a5cfe 1514relative backreferences, in that with recursion unclosed groups B<are>
c74340f9 1515included.
6bda09f9 1516
81714fb9 1517The following pattern matches a function foo() which may contain
f145b7e9 1518balanced parentheses as the argument.
6bda09f9 1519
d1fbf752 1520 $re = qr{ ( # paren group 1 (full function)
81714fb9 1521 foo
d1fbf752 1522 ( # paren group 2 (parens)
6bda09f9 1523 \(
d1fbf752 1524 ( # paren group 3 (contents of parens)
6bda09f9 1525 (?:
d1fbf752 1526 (?> [^()]+ ) # Non-parens without backtracking
6bda09f9 1527 |
d1fbf752 1528 (?2) # Recurse to start of paren group 2
6bda09f9
YO
1529 )*
1530 )
1531 \)
1532 )
1533 )
1534 }x;
1535
1536If the pattern was used as follows
1537
1538 'foo(bar(baz)+baz(bop))'=~/$re/
1539 and print "\$1 = $1\n",
1540 "\$2 = $2\n",
1541 "\$3 = $3\n";
1542
1543the output produced should be the following:
1544
1545 $1 = foo(bar(baz)+baz(bop))
1546 $2 = (bar(baz)+baz(bop))
81714fb9 1547 $3 = bar(baz)+baz(bop)
6bda09f9 1548
c27a5cfe 1549If there is no corresponding capture group defined, then it is a
61528107 1550fatal error. Recursing deeper than 50 times without consuming any input
81714fb9 1551string will also result in a fatal error. The maximum depth is compiled
6bda09f9
YO
1552into perl, so changing it requires a custom build.
1553
542fa716
YO
1554The following shows how using negative indexing can make it
1555easier to embed recursive patterns inside of a C<qr//> construct
1556for later use:
1557
1558 my $parens = qr/(\((?:[^()]++|(?-1))*+\))/;
c77257ed 1559 if (/foo $parens \s+ \+ \s+ bar $parens/x) {
542fa716
YO
1560 # do something here...
1561 }
1562
81714fb9 1563B<Note> that this pattern does not behave the same way as the equivalent
0d017f4d 1564PCRE or Python construct of the same form. In Perl you can backtrack into
6bda09f9 1565a recursed group, in PCRE and Python the recursed into group is treated
542fa716
YO
1566as atomic. Also, modifiers are resolved at compile time, so constructs
1567like (?i:(?1)) or (?:(?i)(?1)) do not affect how the sub-pattern will
1568be processed.
6bda09f9 1569
894be9b7
YO
1570=item C<(?&NAME)>
1571X<(?&NAME)>
1572
93f313ef 1573Recurse to a named subpattern. Identical to C<(?I<PARNO>)> except that the
0d017f4d 1574parenthesis to recurse to is determined by name. If multiple parentheses have
894be9b7
YO
1575the same name, then it recurses to the leftmost.
1576
1577It is an error to refer to a name that is not declared somewhere in the
1578pattern.
1579
1f1031fe
YO
1580B<NOTE:> In order to make things easier for programmers with experience
1581with the Python or PCRE regex engines the pattern C<< (?P>NAME) >>
64c5a566 1582may be used instead of C<< (?&NAME) >>.
1f1031fe 1583
e2e6a0f1
YO
1584=item C<(?(condition)yes-pattern|no-pattern)>
1585X<(?()>
286f584a 1586
e2e6a0f1 1587=item C<(?(condition)yes-pattern)>
286f584a 1588
41ef34de
ML
1589Conditional expression. Matches C<yes-pattern> if C<condition> yields
1590a true value, matches C<no-pattern> otherwise. A missing pattern always
1591matches.
1592
25e26d77 1593C<(condition)> should be one of: 1) an integer in
e2e6a0f1 1594parentheses (which is valid if the corresponding pair of parentheses
25e26d77 1595matched); 2) a look-ahead/look-behind/evaluate zero-width assertion; 3) a
c27a5cfe 1596name in angle brackets or single quotes (which is valid if a group
25e26d77 1597with the given name matched); or 4) the special symbol (R) (true when
e2e6a0f1
YO
1598evaluated inside of recursion or eval). Additionally the R may be
1599followed by a number, (which will be true when evaluated when recursing
1600inside of the appropriate group), or by C<&NAME>, in which case it will
1601be true only when evaluated during recursion in the named group.
1602
1603Here's a summary of the possible predicates:
1604
1605=over 4
1606
1607=item (1) (2) ...
1608
c27a5cfe 1609Checks if the numbered capturing group has matched something.
e2e6a0f1
YO
1610
1611=item (<NAME>) ('NAME')
1612
c27a5cfe 1613Checks if a group with the given name has matched something.
e2e6a0f1 1614
f01cd190
FC
1615=item (?=...) (?!...) (?<=...) (?<!...)
1616
1617Checks whether the pattern matches (or does not match, for the '!'
1618variants).
1619
e2e6a0f1
YO
1620=item (?{ CODE })
1621
f01cd190 1622Treats the return value of the code block as the condition.
e2e6a0f1
YO
1623
1624=item (R)
1625
1626Checks if the expression has been evaluated inside of recursion.
1627
1628=item (R1) (R2) ...
1629
1630Checks if the expression has been evaluated while executing directly
1631inside of the n-th capture group. This check is the regex equivalent of
1632
1633 if ((caller(0))[3] eq 'subname') { ... }
1634
1635In other words, it does not check the full recursion stack.
1636
1637=item (R&NAME)
1638
1639Similar to C<(R1)>, this predicate checks to see if we're executing
1640directly inside of the leftmost group with a given name (this is the same
1641logic used by C<(?&NAME)> to disambiguate). It does not check the full
1642stack, but only the name of the innermost active recursion.
1643
1644=item (DEFINE)
1645
1646In this case, the yes-pattern is never directly executed, and no
1647no-pattern is allowed. Similar in spirit to C<(?{0})> but more efficient.
1648See below for details.
1649
1650=back
1651
1652For example:
1653
1654 m{ ( \( )?
1655 [^()]+
1656 (?(1) \) )
1657 }x
1658
1659matches a chunk of non-parentheses, possibly included in parentheses
1660themselves.
1661
0b928c2f
FC
1662A special form is the C<(DEFINE)> predicate, which never executes its
1663yes-pattern directly, and does not allow a no-pattern. This allows one to
1664define subpatterns which will be executed only by the recursion mechanism.
e2e6a0f1
YO
1665This way, you can define a set of regular expression rules that can be
1666bundled into any pattern you choose.
1667
1668It is recommended that for this usage you put the DEFINE block at the
1669end of the pattern, and that you name any subpatterns defined within it.
1670
1671Also, it's worth noting that patterns defined this way probably will
31dc26d6 1672not be as efficient, as the optimizer is not very clever about
e2e6a0f1
YO
1673handling them.
1674
1675An example of how this might be used is as follows:
1676
2bf803e2 1677 /(?<NAME>(?&NAME_PAT))(?<ADDR>(?&ADDRESS_PAT))
e2e6a0f1 1678 (?(DEFINE)
2bf803e2
YO
1679 (?<NAME_PAT>....)
1680 (?<ADRESS_PAT>....)
e2e6a0f1
YO
1681 )/x
1682
c27a5cfe
KW
1683Note that capture groups matched inside of recursion are not accessible
1684after the recursion returns, so the extra layer of capturing groups is
e2e6a0f1
YO
1685necessary. Thus C<$+{NAME_PAT}> would not be defined even though
1686C<$+{NAME}> would be.
286f584a 1687
51a1303c
BF
1688Finally, keep in mind that subpatterns created inside a DEFINE block
1689count towards the absolute and relative number of captures, so this:
1690
1691 my @captures = "a" =~ /(.) # First capture
1692 (?(DEFINE)
1693 (?<EXAMPLE> 1 ) # Second capture
1694 )/x;
1695 say scalar @captures;
1696
1697Will output 2, not 1. This is particularly important if you intend to
1698compile the definitions with the C<qr//> operator, and later
1699interpolate them in another pattern.
1700
c47ff5f1 1701=item C<< (?>pattern) >>
6bda09f9 1702X<backtrack> X<backtracking> X<atomic> X<possessive>
5a964f20 1703
19799a22
GS
1704An "independent" subexpression, one which matches the substring
1705that a I<standalone> C<pattern> would match if anchored at the given
9da458fc 1706position, and it matches I<nothing other than this substring>. This
19799a22
GS
1707construct is useful for optimizations of what would otherwise be
1708"eternal" matches, because it will not backtrack (see L<"Backtracking">).
9da458fc
IZ
1709It may also be useful in places where the "grab all you can, and do not
1710give anything back" semantic is desirable.
19799a22 1711
c47ff5f1 1712For example: C<< ^(?>a*)ab >> will never match, since C<< (?>a*) >>
19799a22
GS
1713(anchored at the beginning of string, as above) will match I<all>
1714characters C<a> at the beginning of string, leaving no C<a> for
1715C<ab> to match. In contrast, C<a*ab> will match the same as C<a+b>,
1716since the match of the subgroup C<a*> is influenced by the following
1717group C<ab> (see L<"Backtracking">). In particular, C<a*> inside
1718C<a*ab> will match fewer characters than a standalone C<a*>, since
1719this makes the tail match.
1720
0b928c2f
FC
1721C<< (?>pattern) >> does not disable backtracking altogether once it has
1722matched. It is still possible to backtrack past the construct, but not
1723into it. So C<< ((?>a*)|(?>b*))ar >> will still match "bar".
1724
c47ff5f1 1725An effect similar to C<< (?>pattern) >> may be achieved by writing
0b928c2f
FC
1726C<(?=(pattern))\g{-1}>. This matches the same substring as a standalone
1727C<a+>, and the following C<\g{-1}> eats the matched string; it therefore
c47ff5f1 1728makes a zero-length assertion into an analogue of C<< (?>...) >>.
19799a22
GS
1729(The difference between these two constructs is that the second one
1730uses a capturing group, thus shifting ordinals of backreferences
1731in the rest of a regular expression.)
1732
1733Consider this pattern:
c277df42 1734
871b0233 1735 m{ \(
e2e6a0f1 1736 (
f793d64a 1737 [^()]+ # x+
e2e6a0f1 1738 |
871b0233
IZ
1739 \( [^()]* \)
1740 )+
e2e6a0f1 1741 \)
871b0233 1742 }x
5a964f20 1743
19799a22
GS
1744That will efficiently match a nonempty group with matching parentheses
1745two levels deep or less. However, if there is no such group, it
1746will take virtually forever on a long string. That's because there
1747are so many different ways to split a long string into several
1748substrings. This is what C<(.+)+> is doing, and C<(.+)+> is similar
1749to a subpattern of the above pattern. Consider how the pattern
1750above detects no-match on C<((()aaaaaaaaaaaaaaaaaa> in several
1751seconds, but that each extra letter doubles this time. This
1752exponential performance will make it appear that your program has
14218588 1753hung. However, a tiny change to this pattern
5a964f20 1754
e2e6a0f1
YO
1755 m{ \(
1756 (
f793d64a 1757 (?> [^()]+ ) # change x+ above to (?> x+ )
e2e6a0f1 1758 |
871b0233
IZ
1759 \( [^()]* \)
1760 )+
e2e6a0f1 1761 \)
871b0233 1762 }x
c277df42 1763
c47ff5f1 1764which uses C<< (?>...) >> matches exactly when the one above does (verifying
5a964f20
TC
1765this yourself would be a productive exercise), but finishes in a fourth
1766the time when used on a similar string with 1000000 C<a>s. Be aware,
0b928c2f
FC
1767however, that, when this construct is followed by a
1768quantifier, it currently triggers a warning message under
9f1b1f2d 1769the C<use warnings> pragma or B<-w> switch saying it
6bab786b 1770C<"matches null string many times in regex">.
c277df42 1771
c47ff5f1 1772On simple groups, such as the pattern C<< (?> [^()]+ ) >>, a comparable
19799a22 1773effect may be achieved by negative look-ahead, as in C<[^()]+ (?! [^()] )>.
c277df42
IZ
1774This was only 4 times slower on a string with 1000000 C<a>s.
1775
9da458fc
IZ
1776The "grab all you can, and do not give anything back" semantic is desirable
1777in many situations where on the first sight a simple C<()*> looks like
1778the correct solution. Suppose we parse text with comments being delimited
1779by C<#> followed by some optional (horizontal) whitespace. Contrary to
4375e838 1780its appearance, C<#[ \t]*> I<is not> the correct subexpression to match
9da458fc
IZ
1781the comment delimiter, because it may "give up" some whitespace if
1782the remainder of the pattern can be made to match that way. The correct
1783answer is either one of these:
1784
1785 (?>#[ \t]*)
1786 #[ \t]*(?![ \t])
1787
1788For example, to grab non-empty comments into $1, one should use either
1789one of these:
1790
1791 / (?> \# [ \t]* ) ( .+ ) /x;
1792 / \# [ \t]* ( [^ \t] .* ) /x;
1793
1794Which one you pick depends on which of these expressions better reflects
1795the above specification of comments.
1796
6bda09f9
YO
1797In some literature this construct is called "atomic matching" or
1798"possessive matching".
1799
b9b4dddf
YO
1800Possessive quantifiers are equivalent to putting the item they are applied
1801to inside of one of these constructs. The following equivalences apply:
1802
1803 Quantifier Form Bracketing Form
1804 --------------- ---------------
1805 PAT*+ (?>PAT*)
1806 PAT++ (?>PAT+)
1807 PAT?+ (?>PAT?)
1808 PAT{min,max}+ (?>PAT{min,max})
1809
9d1a5160 1810=item C<(?[ ])>
f4f5fe57 1811
572224ce 1812See L<perlrecharclass/Extended Bracketed Character Classes>.
9d1a5160 1813
e2e6a0f1
YO
1814=back
1815
1816=head2 Special Backtracking Control Verbs
1817
e2e6a0f1
YO
1818These special patterns are generally of the form C<(*VERB:ARG)>. Unless
1819otherwise stated the ARG argument is optional; in some cases, it is
1820forbidden.
1821
1822Any pattern containing a special backtracking verb that allows an argument
e1020413 1823has the special behaviour that when executed it sets the current package's
5d458dd8
YO
1824C<$REGERROR> and C<$REGMARK> variables. When doing so the following
1825rules apply:
e2e6a0f1 1826
5d458dd8
YO
1827On failure, the C<$REGERROR> variable will be set to the ARG value of the
1828verb pattern, if the verb was involved in the failure of the match. If the
1829ARG part of the pattern was omitted, then C<$REGERROR> will be set to the
1830name of the last C<(*MARK:NAME)> pattern executed, or to TRUE if there was
1831none. Also, the C<$REGMARK> variable will be set to FALSE.
e2e6a0f1 1832
5d458dd8
YO
1833On a successful match, the C<$REGERROR> variable will be set to FALSE, and
1834the C<$REGMARK> variable will be set to the name of the last
1835C<(*MARK:NAME)> pattern executed. See the explanation for the
1836C<(*MARK:NAME)> verb below for more details.
e2e6a0f1 1837
5d458dd8 1838B<NOTE:> C<$REGERROR> and C<$REGMARK> are not magic variables like C<$1>
0b928c2f 1839and most other regex-related variables. They are not local to a scope, nor
5d458dd8
YO
1840readonly, but instead are volatile package variables similar to C<$AUTOLOAD>.
1841Use C<local> to localize changes to them to a specific scope if necessary.
e2e6a0f1
YO
1842
1843If a pattern does not contain a special backtracking verb that allows an
5d458dd8 1844argument, then C<$REGERROR> and C<$REGMARK> are not touched at all.
e2e6a0f1 1845
70ca8714 1846=over 3
e2e6a0f1
YO
1847
1848=item Verbs that take an argument
1849
1850=over 4
1851
5d458dd8 1852=item C<(*PRUNE)> C<(*PRUNE:NAME)>
f7819f85 1853X<(*PRUNE)> X<(*PRUNE:NAME)>
54612592 1854
5d458dd8
YO
1855This zero-width pattern prunes the backtracking tree at the current point
1856when backtracked into on failure. Consider the pattern C<A (*PRUNE) B>,
1857where A and B are complex patterns. Until the C<(*PRUNE)> verb is reached,
1858A may backtrack as necessary to match. Once it is reached, matching
1859continues in B, which may also backtrack as necessary; however, should B
1860not match, then no further backtracking will take place, and the pattern
1861will fail outright at the current starting position.
54612592
YO
1862
1863The following example counts all the possible matching strings in a
1864pattern (without actually matching any of them).
1865
e2e6a0f1 1866 'aaab' =~ /a+b?(?{print "$&\n"; $count++})(*FAIL)/;
54612592
YO
1867 print "Count=$count\n";
1868
1869which produces:
1870
1871 aaab
1872 aaa
1873 aa
1874 a
1875 aab
1876 aa
1877 a
1878 ab
1879 a
1880 Count=9
1881
5d458dd8 1882If we add a C<(*PRUNE)> before the count like the following
54612592 1883
5d458dd8 1884 'aaab' =~ /a+b?(*PRUNE)(?{print "$&\n"; $count++})(*FAIL)/;
54612592
YO
1885 print "Count=$count\n";
1886
0b928c2f 1887we prevent backtracking and find the count of the longest matching string
353c6505 1888at each matching starting point like so:
54612592
YO
1889
1890 aaab
1891 aab
1892 ab
1893 Count=3
1894
5d458dd8 1895Any number of C<(*PRUNE)> assertions may be used in a pattern.
54612592 1896
5d458dd8
YO
1897See also C<< (?>pattern) >> and possessive quantifiers for other ways to
1898control backtracking. In some cases, the use of C<(*PRUNE)> can be
1899replaced with a C<< (?>pattern) >> with no functional difference; however,
1900C<(*PRUNE)> can be used to handle cases that cannot be expressed using a
1901C<< (?>pattern) >> alone.
54612592 1902
5d458dd8
YO
1903=item C<(*SKIP)> C<(*SKIP:NAME)>
1904X<(*SKIP)>
e2e6a0f1 1905
5d458dd8 1906This zero-width pattern is similar to C<(*PRUNE)>, except that on
e2e6a0f1 1907failure it also signifies that whatever text that was matched leading up
5d458dd8
YO
1908to the C<(*SKIP)> pattern being executed cannot be part of I<any> match
1909of this pattern. This effectively means that the regex engine "skips" forward
1910to this position on failure and tries to match again, (assuming that
1911there is sufficient room to match).
1912
1913The name of the C<(*SKIP:NAME)> pattern has special significance. If a
1914C<(*MARK:NAME)> was encountered while matching, then it is that position
1915which is used as the "skip point". If no C<(*MARK)> of that name was
1916encountered, then the C<(*SKIP)> operator has no effect. When used
1917without a name the "skip point" is where the match point was when
1918executing the (*SKIP) pattern.
1919
0b928c2f 1920Compare the following to the examples in C<(*PRUNE)>; note the string
24b23f37
YO
1921is twice as long:
1922
d1fbf752
KW
1923 'aaabaaab' =~ /a+b?(*SKIP)(?{print "$&\n"; $count++})(*FAIL)/;
1924 print "Count=$count\n";
24b23f37
YO
1925
1926outputs
1927
1928 aaab
1929 aaab
1930 Count=2
1931
5d458dd8 1932Once the 'aaab' at the start of the string has matched, and the C<(*SKIP)>
353c6505 1933executed, the next starting point will be where the cursor was when the
5d458dd8
YO
1934C<(*SKIP)> was executed.
1935
5d458dd8 1936=item C<(*MARK:NAME)> C<(*:NAME)>
b16db30f 1937X<(*MARK)> X<(*MARK:NAME)> X<(*:NAME)>
5d458dd8
YO
1938
1939This zero-width pattern can be used to mark the point reached in a string
1940when a certain part of the pattern has been successfully matched. This
1941mark may be given a name. A later C<(*SKIP)> pattern will then skip
1942forward to that point if backtracked into on failure. Any number of
b4222fa9 1943C<(*MARK)> patterns are allowed, and the NAME portion may be duplicated.
5d458dd8
YO
1944
1945In addition to interacting with the C<(*SKIP)> pattern, C<(*MARK:NAME)>
1946can be used to "label" a pattern branch, so that after matching, the
1947program can determine which branches of the pattern were involved in the
1948match.
1949
1950When a match is successful, the C<$REGMARK> variable will be set to the
1951name of the most recently executed C<(*MARK:NAME)> that was involved
1952in the match.
1953
1954This can be used to determine which branch of a pattern was matched
c27a5cfe 1955without using a separate capture group for each branch, which in turn
5d458dd8
YO
1956can result in a performance improvement, as perl cannot optimize
1957C</(?:(x)|(y)|(z))/> as efficiently as something like
1958C</(?:x(*MARK:x)|y(*MARK:y)|z(*MARK:z))/>.
1959
1960When a match has failed, and unless another verb has been involved in
1961failing the match and has provided its own name to use, the C<$REGERROR>
1962variable will be set to the name of the most recently executed
1963C<(*MARK:NAME)>.
1964
42ac7c82 1965See L</(*SKIP)> for more details.
5d458dd8 1966
b62d2d15
YO
1967As a shortcut C<(*MARK:NAME)> can be written C<(*:NAME)>.
1968
5d458dd8
YO
1969=item C<(*THEN)> C<(*THEN:NAME)>
1970
ac9d8485 1971This is similar to the "cut group" operator C<::> from Perl 6. Like
5d458dd8
YO
1972C<(*PRUNE)>, this verb always matches, and when backtracked into on
1973failure, it causes the regex engine to try the next alternation in the
ac9d8485
FC
1974innermost enclosing group (capturing or otherwise) that has alternations.
1975The two branches of a C<(?(condition)yes-pattern|no-pattern)> do not
1976count as an alternation, as far as C<(*THEN)> is concerned.
5d458dd8
YO
1977
1978Its name comes from the observation that this operation combined with the
1979alternation operator (C<|>) can be used to create what is essentially a
1980pattern-based if/then/else block:
1981
1982 ( COND (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ )
1983
1984Note that if this operator is used and NOT inside of an alternation then
1985it acts exactly like the C<(*PRUNE)> operator.
1986
1987 / A (*PRUNE) B /
1988
1989is the same as
1990
1991 / A (*THEN) B /
1992
1993but
1994
25e26d77 1995 / ( A (*THEN) B | C ) /
5d458dd8
YO
1996
1997is not the same as
1998
25e26d77 1999 / ( A (*PRUNE) B | C ) /
5d458dd8
YO
2000
2001as after matching the A but failing on the B the C<(*THEN)> verb will
2002backtrack and try C; but the C<(*PRUNE)> verb will simply fail.
24b23f37 2003
cbeadc21
JV
2004=back
2005
2006=item Verbs without an argument
2007
2008=over 4
2009
e2e6a0f1
YO
2010=item C<(*COMMIT)>
2011X<(*COMMIT)>
24b23f37 2012
241e7389 2013This is the Perl 6 "commit pattern" C<< <commit> >> or C<:::>. It's a
5d458dd8
YO
2014zero-width pattern similar to C<(*SKIP)>, except that when backtracked
2015into on failure it causes the match to fail outright. No further attempts
2016to find a valid match by advancing the start pointer will occur again.
2017For example,
24b23f37 2018
d1fbf752
KW
2019 'aaabaaab' =~ /a+b?(*COMMIT)(?{print "$&\n"; $count++})(*FAIL)/;
2020 print "Count=$count\n";
24b23f37
YO
2021
2022outputs
2023
2024 aaab
2025 Count=1
2026
e2e6a0f1
YO
2027In other words, once the C<(*COMMIT)> has been entered, and if the pattern
2028does not match, the regex engine will not try any further matching on the
2029rest of the string.
c277df42 2030
e2e6a0f1
YO
2031=item C<(*FAIL)> C<(*F)>
2032X<(*FAIL)> X<(*F)>
9af228c6 2033
e2e6a0f1
YO
2034This pattern matches nothing and always fails. It can be used to force the
2035engine to backtrack. It is equivalent to C<(?!)>, but easier to read. In
2036fact, C<(?!)> gets optimised into C<(*FAIL)> internally.
9af228c6 2037
e2e6a0f1 2038It is probably useful only when combined with C<(?{})> or C<(??{})>.
9af228c6 2039
e2e6a0f1
YO
2040=item C<(*ACCEPT)>
2041X<(*ACCEPT)>
9af228c6 2042
e2e6a0f1
YO
2043This pattern matches nothing and causes the end of successful matching at
2044the point at which the C<(*ACCEPT)> pattern was encountered, regardless of
2045whether there is actually more to match in the string. When inside of a
0d017f4d 2046nested pattern, such as recursion, or in a subpattern dynamically generated
e2e6a0f1 2047via C<(??{})>, only the innermost pattern is ended immediately.
9af228c6 2048
c27a5cfe 2049If the C<(*ACCEPT)> is inside of capturing groups then the groups are
e2e6a0f1
YO
2050marked as ended at the point at which the C<(*ACCEPT)> was encountered.
2051For instance:
9af228c6 2052
e2e6a0f1 2053 'AB' =~ /(A (A|B(*ACCEPT)|C) D)(E)/x;
9af228c6 2054
e2e6a0f1 2055will match, and C<$1> will be C<AB> and C<$2> will be C<B>, C<$3> will not
0b928c2f 2056be set. If another branch in the inner parentheses was matched, such as in the
e2e6a0f1 2057string 'ACDE', then the C<D> and C<E> would have to be matched as well.
9af228c6
YO
2058
2059=back
c277df42 2060
a0d0e21e
LW
2061=back
2062
c07a80fd 2063=head2 Backtracking
d74e8afc 2064X<backtrack> X<backtracking>
c07a80fd 2065
35a734be
IZ
2066NOTE: This section presents an abstract approximation of regular
2067expression behavior. For a more rigorous (and complicated) view of
2068the rules involved in selecting a match among possible alternatives,
0d017f4d 2069see L<Combining RE Pieces>.
35a734be 2070
c277df42 2071A fundamental feature of regular expression matching involves the
5a964f20 2072notion called I<backtracking>, which is currently used (when needed)
0d017f4d 2073by all regular non-possessive expression quantifiers, namely C<*>, C<*?>, C<+>,
9da458fc
IZ
2074C<+?>, C<{n,m}>, and C<{n,m}?>. Backtracking is often optimized
2075internally, but the general principle outlined here is valid.
c07a80fd
PP
2076
2077For a regular expression to match, the I<entire> regular expression must
2078match, not just part of it. So if the beginning of a pattern containing a
2079quantifier succeeds in a way that causes later parts in the pattern to
2080fail, the matching engine backs up and recalculates the beginning
2081part--that's why it's called backtracking.
2082
2083Here is an example of backtracking: Let's say you want to find the
2084word following "foo" in the string "Food is on the foo table.":
2085
2086 $_ = "Food is on the foo table.";
2087 if ( /\b(foo)\s+(\w+)/i ) {
f793d64a 2088 print "$2 follows $1.\n";
c07a80fd
PP
2089 }
2090
2091When the match runs, the first part of the regular expression (C<\b(foo)>)
2092finds a possible match right at the beginning of the string, and loads up
2093$1 with "Foo". However, as soon as the matching engine sees that there's
2094no whitespace following the "Foo" that it had saved in $1, it realizes its
68dc0745 2095mistake and starts over again one character after where it had the
c07a80fd
PP
2096tentative match. This time it goes all the way until the next occurrence
2097of "foo". The complete regular expression matches this time, and you get
2098the expected output of "table follows foo."
2099
2100Sometimes minimal matching can help a lot. Imagine you'd like to match
2101everything between "foo" and "bar". Initially, you write something
2102like this:
2103
2104 $_ = "The food is under the bar in the barn.";
2105 if ( /foo(.*)bar/ ) {
f793d64a 2106 print "got <$1>\n";
c07a80fd
PP
2107 }
2108
2109Which perhaps unexpectedly yields:
2110
2111 got <d is under the bar in the >
2112
2113That's because C<.*> was greedy, so you get everything between the
14218588 2114I<first> "foo" and the I<last> "bar". Here it's more effective
c07a80fd
PP
2115to use minimal matching to make sure you get the text between a "foo"
2116and the first "bar" thereafter.
2117
2118 if ( /foo(.*?)bar/ ) { print "got <$1>\n" }
2119 got <d is under the >
2120
0d017f4d 2121Here's another example. Let's say you'd like to match a number at the end
b6e13d97 2122of a string, and you also want to keep the preceding part of the match.
c07a80fd
PP
2123So you write this:
2124
2125 $_ = "I have 2 numbers: 53147";
f793d64a
KW
2126 if ( /(.*)(\d*)/ ) { # Wrong!
2127 print "Beginning is <$1>, number is <$2>.\n";
c07a80fd
PP
2128 }
2129
2130That won't work at all, because C<.*> was greedy and gobbled up the
2131whole string. As C<\d*> can match on an empty string the complete
2132regular expression matched successfully.
2133
8e1088bc 2134 Beginning is <I have 2 numbers: 53147>, number is <>.
c07a80fd
PP
2135
2136Here are some variants, most of which don't work:
2137
2138 $_ = "I have 2 numbers: 53147";
2139 @pats = qw{
f793d64a
KW
2140 (.*)(\d*)
2141 (.*)(\d+)
2142 (.*?)(\d*)
2143 (.*?)(\d+)
2144 (.*)(\d+)$
2145 (.*?)(\d+)$
2146 (.*)\b(\d+)$
2147 (.*\D)(\d+)$
c07a80fd
PP
2148 };
2149
2150 for $pat (@pats) {
f793d64a
KW
2151 printf "%-12s ", $pat;
2152 if ( /$pat/ ) {
2153 print "<$1> <$2>\n";
2154 } else {
2155 print "FAIL\n";
2156 }
c07a80fd
PP
2157 }
2158
2159That will print out:
2160
2161 (.*)(\d*) <I have 2 numbers: 53147> <>
2162 (.*)(\d+) <I have 2 numbers: 5314> <7>
2163 (.*?)(\d*) <> <>
2164 (.*?)(\d+) <I have > <2>
2165 (.*)(\d+)$ <I have 2 numbers: 5314> <7>
2166 (.*?)(\d+)$ <I have 2 numbers: > <53147>
2167 (.*)\b(\d+)$ <I have 2 numbers: > <53147>
2168 (.*\D)(\d+)$ <I have 2 numbers: > <53147>
2169
2170As you see, this can be a bit tricky. It's important to realize that a
2171regular expression is merely a set of assertions that gives a definition
2172of success. There may be 0, 1, or several different ways that the
2173definition might succeed against a particular string. And if there are
5a964f20
TC
2174multiple ways it might succeed, you need to understand backtracking to
2175know which variety of success you will achieve.
c07a80fd 2176
19799a22 2177When using look-ahead assertions and negations, this can all get even
8b19b778 2178trickier. Imagine you'd like to find a sequence of non-digits not
c07a80fd
PP
2179followed by "123". You might try to write that as
2180
871b0233 2181 $_ = "ABC123";
f793d64a
KW
2182 if ( /^\D*(?!123)/ ) { # Wrong!
2183 print "Yup, no 123 in $_\n";
871b0233 2184 }
c07a80fd
PP
2185
2186But that isn't going to match; at least, not the way you're hoping. It
2187claims that there is no 123 in the string. Here's a clearer picture of
9b9391b2 2188why that pattern matches, contrary to popular expectations:
c07a80fd 2189
4358a253
SS
2190 $x = 'ABC123';
2191 $y = 'ABC445';
c07a80fd 2192
4358a253
SS
2193 print "1: got $1\n" if $x =~ /^(ABC)(?!123)/;
2194 print "2: got $1\n" if $y =~ /^(ABC)(?!123)/;
c07a80fd 2195
4358a253
SS
2196 print "3: got $1\n" if $x =~ /^(\D*)(?!123)/;
2197 print "4: got $1\n" if $y =~ /^(\D*)(?!123)/;
c07a80fd
PP
2198
2199This prints
2200
2201 2: got ABC
2202 3: got AB
2203 4: got ABC
2204
5f05dabc 2205You might have expected test 3 to fail because it seems to a more
c07a80fd
PP
2206general purpose version of test 1. The important difference between
2207them is that test 3 contains a quantifier (C<\D*>) and so can use
2208backtracking, whereas test 1 will not. What's happening is
2209that you've asked "Is it true that at the start of $x, following 0 or more
5f05dabc 2210non-digits, you have something that's not 123?" If the pattern matcher had
c07a80fd 2211let C<\D*> expand to "ABC", this would have caused the whole pattern to
54310121 2212fail.
14218588 2213
c07a80fd 2214The search engine will initially match C<\D*> with "ABC". Then it will
0b928c2f 2215try to match C<(?!123)> with "123", which fails. But because
c07a80fd
PP
2216a quantifier (C<\D*>) has been used in the regular expression, the
2217search engine can backtrack and retry the match differently
54310121 2218in the hope of matching the complete regular expression.
c07a80fd 2219
5a964f20
TC
2220The pattern really, I<really> wants to succeed, so it uses the
2221standard pattern back-off-and-retry and lets C<\D*> expand to just "AB" this
c07a80fd 2222time. Now there's indeed something following "AB" that is not
14218588 2223"123". It's "C123", which suffices.
c07a80fd 2224
14218588
GS
2225We can deal with this by using both an assertion and a negation.
2226We'll say that the first part in $1 must be followed both by a digit
2227and by something that's not "123". Remember that the look-aheads
2228are zero-width expressions--they only look, but don't consume any
2229of the string in their match. So rewriting this way produces what
c07a80fd
PP
2230you'd expect; that is, case 5 will fail, but case 6 succeeds:
2231
4358a253
SS
2232 print "5: got $1\n" if $x =~ /^(\D*)(?=\d)(?!123)/;
2233 print "6: got $1\n" if $y =~ /^(\D*)(?=\d)(?!123)/;
c07a80fd
PP
2234
2235 6: got ABC
2236
5a964f20 2237In other words, the two zero-width assertions next to each other work as though
19799a22 2238they're ANDed together, just as you'd use any built-in assertions: C</^$/>
c07a80fd
PP
2239matches only if you're at the beginning of the line AND the end of the
2240line simultaneously. The deeper underlying truth is that juxtaposition in
2241regular expressions always means AND, except when you write an explicit OR
2242using the vertical bar. C</ab/> means match "a" AND (then) match "b",
2243although the attempted matches are made at different positions because "a"
2244is not a zero-width assertion, but a one-width assertion.
2245
0d017f4d 2246B<WARNING>: Particularly complicated regular expressions can take
14218588 2247exponential time to solve because of the immense number of possible
0d017f4d 2248ways they can use backtracking to try for a match. For example, without
9da458fc
IZ
2249internal optimizations done by the regular expression engine, this will
2250take a painfully long time to run:
c07a80fd 2251
e1901655
IZ
2252 'aaaaaaaaaaaa' =~ /((a{0,5}){0,5})*[c]/
2253
2254And if you used C<*>'s in the internal groups instead of limiting them
2255to 0 through 5 matches, then it would take forever--or until you ran
2256out of stack space. Moreover, these internal optimizations are not
2257always applicable. For example, if you put C<{0,5}> instead of C<*>
2258on the external group, no current optimization is applicable, and the
2259match takes a long time to finish.
c07a80fd 2260
9da458fc
IZ
2261A powerful tool for optimizing such beasts is what is known as an
2262"independent group",
96090e4f 2263which does not backtrack (see L</C<< (?>pattern) >>>). Note also that
9da458fc 2264zero-length look-ahead/look-behind assertions will not backtrack to make
5d458dd8 2265the tail match, since they are in "logical" context: only
14218588 2266whether they match is considered relevant. For an example
9da458fc 2267where side-effects of look-ahead I<might> have influenced the
96090e4f 2268following match, see L</C<< (?>pattern) >>>.
c277df42 2269
a0d0e21e 2270=head2 Version 8 Regular Expressions
d74e8afc 2271X<regular expression, version 8> X<regex, version 8> X<regexp, version 8>
a0d0e21e 2272
5a964f20 2273In case you're not familiar with the "regular" Version 8 regex
a0d0e21e
LW
2274routines, here are the pattern-matching rules not described above.
2275
54310121 2276Any single character matches itself, unless it is a I<metacharacter>
a0d0e21e 2277with a special meaning described here or above. You can cause
5a964f20 2278characters that normally function as metacharacters to be interpreted
5f05dabc 2279literally by prefixing them with a "\" (e.g., "\." matches a ".", not any
0d017f4d
WL
2280character; "\\" matches a "\"). This escape mechanism is also required
2281for the character used as the pattern delimiter.
2282
2283A series of characters matches that series of characters in the target
0b928c2f 2284string, so the pattern C<blurfl> would match "blurfl" in the target
0d017f4d 2285string.
a0d0e21e
LW
2286
2287You can specify a character class, by enclosing a list of characters
5d458dd8 2288in C<[]>, which will match any character from the list. If the
a0d0e21e 2289first character after the "[" is "^", the class matches any character not
14218588 2290in the list. Within a list, the "-" character specifies a
5a964f20 2291range, so that C<a-z> represents all characters between "a" and "z",
8a4f6ac2
GS
2292inclusive. If you want either "-" or "]" itself to be a member of a
2293class, put it at the start of the list (possibly after a "^"), or
2294escape it with a backslash. "-" is also taken literally when it is
2295at the end of the list, just before the closing "]". (The
84850974
DD
2296following all specify the same class of three characters: C<[-az]>,
2297C<[az-]>, and C<[a\-z]>. All are different from C<[a-z]>, which
5d458dd8
YO
2298specifies a class containing twenty-six characters, even on EBCDIC-based
2299character sets.) Also, if you try to use the character
2300classes C<\w>, C<\W>, C<\s>, C<\S>, C<\d>, or C<\D> as endpoints of
2301a range, the "-" is understood literally.
a0d0e21e 2302
8ada0baa
JH
2303Note also that the whole range idea is rather unportable between
2304character sets--and even within character sets they may cause results
2305you probably didn't expect. A sound principle is to use only ranges
0d017f4d 2306that begin from and end at either alphabetics of equal case ([a-e],
8ada0baa
JH
2307[A-E]), or digits ([0-9]). Anything else is unsafe. If in doubt,
2308spell out the character sets in full.
2309
54310121 2310Characters may be specified using a metacharacter syntax much like that
a0d0e21e
LW
2311used in C: "\n" matches a newline, "\t" a tab, "\r" a carriage return,
2312"\f" a form feed, etc. More generally, \I<nnn>, where I<nnn> is a string
dc0d9c48 2313of three octal digits, matches the character whose coded character set value
5d458dd8 2314is I<nnn>. Similarly, \xI<nn>, where I<nn> are hexadecimal digits,
dc0d9c48 2315matches the character whose ordinal is I<nn>. The expression \cI<x>
5d458dd8 2316matches the character control-I<x>. Finally, the "." metacharacter
fb55449c 2317matches any character except "\n" (unless you use C</s>).
a0d0e21e
LW
2318
2319You can specify a series of alternatives for a pattern using "|" to
2320separate them, so that C<fee|fie|foe> will match any of "fee", "fie",
5a964f20 2321or "foe" in the target string (as would C<f(e|i|o)e>). The
a0d0e21e 2322first alternative includes everything from the last pattern delimiter
0b928c2f 2323("(", "(?:", etc. or the beginning of the pattern) up to the first "|", and
a0d0e21e 2324the last alternative contains everything from the last "|" to the next
0b928c2f 2325closing pattern delimiter. That's why it's common practice to include
14218588 2326alternatives in parentheses: to minimize confusion about where they
a3cb178b
GS
2327start and end.
2328
5a964f20 2329Alternatives are tried from left to right, so the first
a3cb178b
GS
2330alternative found for which the entire expression matches, is the one that
2331is chosen. This means that alternatives are not necessarily greedy. For
628afcb5 2332example: when matching C<foo|foot> against "barefoot", only the "foo"
a3cb178b
GS
2333part will match, as that is the first alternative tried, and it successfully
2334matches the target string. (This might not seem important, but it is
2335important when you are capturing matched text using parentheses.)
2336
5a964f20 2337Also remember that "|" is interpreted as a literal within square brackets,
a3cb178b 2338so if you write C<[fee|fie|foe]> you're really only matching C<[feio|]>.
a0d0e21e 2339
14218588
GS
2340Within a pattern, you may designate subpatterns for later reference
2341by enclosing them in parentheses, and you may refer back to the
2342I<n>th subpattern later in the pattern using the metacharacter
0b928c2f 2343\I<n> or \gI<n>. Subpatterns are numbered based on the left to right order
14218588
GS
2344of their opening parenthesis. A backreference matches whatever
2345actually matched the subpattern in the string being examined, not
d8b950dc 2346the rules for that subpattern. Therefore, C<(0|0x)\d*\s\g1\d*> will
14218588
GS
2347match "0x1234 0x4321", but not "0x1234 01234", because subpattern
23481 matched "0x", even though the rule C<0|0x> could potentially match
2349the leading 0 in the second number.
cb1a09d0 2350
0d017f4d 2351=head2 Warning on \1 Instead of $1
cb1a09d0 2352
5a964f20 2353Some people get too used to writing things like:
cb1a09d0
AD
2354
2355 $pattern =~ s/(\W)/\\\1/g;
2356
3ff1c45a
KW
2357This is grandfathered (for \1 to \9) for the RHS of a substitute to avoid
2358shocking the
cb1a09d0 2359B<sed> addicts, but it's a dirty habit to get into. That's because in
d1be9408 2360PerlThink, the righthand side of an C<s///> is a double-quoted string. C<\1> in
cb1a09d0
AD
2361the usual double-quoted string means a control-A. The customary Unix
2362meaning of C<\1> is kludged in for C<s///>. However, if you get into the habit
2363of doing that, you get yourself into trouble if you then add an C</e>
2364modifier.
2365
f793d64a 2366 s/(\d+)/ \1 + 1 /eg; # causes warning under -w
cb1a09d0
AD
2367
2368Or if you try to do
2369
2370 s/(\d+)/\1000/;
2371
2372You can't disambiguate that by saying C<\{1}000>, whereas you can fix it with
14218588 2373C<${1}000>. The operation of interpolation should not be confused
cb1a09d0
AD
2374with the operation of matching a backreference. Certainly they mean two
2375different things on the I<left> side of the C<s///>.
9fa51da4 2376
0d017f4d 2377=head2 Repeated Patterns Matching a Zero-length Substring
c84d73f1 2378
19799a22 2379B<WARNING>: Difficult material (and prose) ahead. This section needs a rewrite.
c84d73f1
IZ
2380
2381Regular expressions provide a terse and powerful programming language. As
2382with most other power tools, power comes together with the ability
2383to wreak havoc.
2384
2385A common abuse of this power stems from the ability to make infinite
628afcb5 2386loops using regular expressions, with something as innocuous as:
c84d73f1
IZ
2387
2388 'foo' =~ m{ ( o? )* }x;
2389
0d017f4d 2390The C<o?> matches at the beginning of C<'foo'>, and since the position
c84d73f1 2391in the string is not moved by the match, C<o?> would match again and again
527e91da 2392because of the C<*> quantifier. Another common way to create a similar cycle
c84d73f1
IZ
2393is with the looping modifier C<//g>:
2394
2395 @matches = ( 'foo' =~ m{ o? }xg );
2396
2397or
2398
2399 print "match: <$&>\n" while 'foo' =~ m{ o? }xg;
2400
2401or the loop implied by split().
2402
2403However, long experience has shown that many programming tasks may
14218588
GS
2404be significantly simplified by using repeated subexpressions that
2405may match zero-length substrings. Here's a simple example being:
c84d73f1 2406
d1fbf752 2407 @chars = split //, $string; # // is not magic in split
c84d73f1
IZ
2408 ($whitewashed = $string) =~ s/()/ /g; # parens avoid magic s// /
2409
9da458fc 2410Thus Perl allows such constructs, by I<forcefully breaking
c84d73f1 2411the infinite loop>. The rules for this are different for lower-level
527e91da 2412loops given by the greedy quantifiers C<*+{}>, and for higher-level
c84d73f1
IZ
2413ones like the C</g> modifier or split() operator.
2414
19799a22
GS
2415The lower-level loops are I<interrupted> (that is, the loop is
2416broken) when Perl detects that a repeated expression matched a
2417zero-length substring. Thus
c84d73f1
IZ
2418
2419 m{ (?: NON_ZERO_LENGTH | ZERO_LENGTH )* }x;
2420
5d458dd8 2421is made equivalent to
c84d73f1 2422
0b928c2f
FC
2423 m{ (?: NON_ZERO_LENGTH )* (?: ZERO_LENGTH )? }x;
2424
2425For example, this program
2426
2427 #!perl -l
2428 "aaaaab" =~ /
2429 (?:
2430 a # non-zero
2431 | # or
2432 (?{print "hello"}) # print hello whenever this
2433 # branch is tried
2434 (?=(b)) # zero-width assertion
2435 )* # any number of times
2436 /x;
2437 print $&;
2438 print $1;
c84d73f1 2439
0b928c2f
FC
2440prints
2441
2442 hello
2443 aaaaa
2444 b
2445
2446Notice that "hello" is only printed once, as when Perl sees that the sixth
2447iteration of the outermost C<(?:)*> matches a zero-length string, it stops
2448the C<*>.
2449
2450The higher-level loops preserve an additional state between iterations:
5d458dd8 2451whether the last match was zero-length. To break the loop, the following
c84d73f1 2452match after a zero-length match is prohibited to have a length of zero.
5d458dd8 2453This prohibition interacts with backtracking (see L<"Backtracking">),
c84d73f1
IZ
2454and so the I<second best> match is chosen if the I<best> match is of
2455zero length.
2456
19799a22 2457For example:
c84d73f1
IZ
2458
2459 $_ = 'bar';
2460 s/\w??/<$&>/g;
2461
20fb949f 2462results in C<< <><b><><a><><r><> >>. At each position of the string the best
5d458dd8 2463match given by non-greedy C<??> is the zero-length match, and the I<second
c84d73f1
IZ
2464best> match is what is matched by C<\w>. Thus zero-length matches
2465alternate with one-character-long matches.
2466
5d458dd8 2467Similarly, for repeated C<m/()/g> the second-best match is the match at the
c84d73f1
IZ
2468position one notch further in the string.
2469
19799a22 2470The additional state of being I<matched with zero-length> is associated with
c84d73f1 2471the matched string, and is reset by each assignment to pos().
9da458fc
IZ
2472Zero-length matches at the end of the previous match are ignored
2473during C<split>.
c84d73f1 2474
0d017f4d 2475=head2 Combining RE Pieces
35a734be
IZ
2476
2477Each of the elementary pieces of regular expressions which were described
2478before (such as C<ab> or C<\Z>) could match at most one substring
2479at the given position of the input string. However, in a typical regular
2480expression these elementary pieces are combined into more complicated
0b928c2f 2481patterns using combining operators C<ST>, C<S|T>, C<S*> etc.
35a734be
IZ
2482(in these examples C<S> and C<T> are regular subexpressions).
2483
2484Such combinations can include alternatives, leading to a problem of choice:
2485if we match a regular expression C<a|ab> against C<"abc">, will it match
2486substring C<"a"> or C<"ab">? One way to describe which substring is
2487actually matched is the concept of backtracking (see L<"Backtracking">).
2488However, this description is too low-level and makes you think
2489in terms of a particular implementation.
2490
2491Another description starts with notions of "better"/"worse". All the
2492substrings which may be matched by the given regular expression can be
2493sorted from the "best" match to the "worst" match, and it is the "best"
2494match which is chosen. This substitutes the question of "what is chosen?"
2495by the question of "which matches are better, and which are worse?".
2496
2497Again, for elementary pieces there is no such question, since at most
2498one match at a given position is possible. This section describes the
2499notion of better/worse for combining operators. In the description
2500below C<S> and C<T> are regular subexpressions.
2501
13a2d996 2502=over 4
35a734be
IZ
2503
2504=item C<ST>
2505
2506Consider two possible matches, C<AB> and C<A'B'>, C<A> and C<A'> are
2507substrings which can be matched by C<S>, C<B> and C<B'> are substrings
5d458dd8 2508which can be matched by C<T>.
35a734be 2509
0b928c2f 2510If C<A> is a better match for C<S> than C<A'>, C<AB> is a better
35a734be
IZ
2511match than C<A'B'>.
2512
2513If C<A> and C<A'> coincide: C<AB> is a better match than C<AB'> if
0b928c2f 2514C<B> is a better match for C<T> than C<B'>.
35a734be
IZ
2515
2516=item C<S|T>
2517
2518When C<S> can match, it is a better match than when only C<T> can match.
2519
2520Ordering of two matches for C<S> is the same as for C<S>. Similar for
2521two matches for C<T>.
2522
2523=item C<S{REPEAT_COUNT}>
2524
2525Matches as C<SSS...S> (repeated as many times as necessary).
2526
2527=item C<S{min,max}>
2528
2529Matches as C<S{max}|S{max-1}|...|S{min+1}|S{min}>.
2530
2531=item C<S{min,max}?>
2532
2533Matches as C<S{min}|S{min+1}|...|S{max-1}|S{max}>.
2534
2535=item C<S?>, C<S*>, C<S+>
2536
2537Same as C<S{0,1}>, C<S{0,BIG_NUMBER}>, C<S{1,BIG_NUMBER}> respectively.
2538
2539=item C<S??>, C<S*?>, C<S+?>
2540
2541Same as C<S{0,1}?>, C<S{0,BIG_NUMBER}?>, C<S{1,BIG_NUMBER}?> respectively.
2542
c47ff5f1 2543=item C<< (?>S) >>
35a734be
IZ
2544
2545Matches the best match for C<S> and only that.
2546
2547=item C<(?=S)>, C<(?<=S)>
2548
2549Only the best match for C<S> is considered. (This is important only if
2550C<S> has capturing parentheses, and backreferences are used somewhere
2551else in the whole regular expression.)
2552
2553=item C<(?!S)>, C<(?<!S)>
2554
2555For this grouping operator there is no need to describe the ordering, since
2556only whether or not C<S> can match is important.
2557
93f313ef 2558=item C<(??{ EXPR })>, C<(?I<PARNO>)>
35a734be
IZ
2559
2560The ordering is the same as for the regular expression which is
93f313ef 2561the result of EXPR, or the pattern contained by capture group I<PARNO>.
35a734be
IZ
2562
2563=item C<(?(condition)yes-pattern|no-pattern)>
2564
2565Recall that which of C<yes-pattern> or C<no-pattern> actually matches is
2566already determined. The ordering of the matches is the same as for the
2567chosen subexpression.
2568
2569=back
2570
2571The above recipes describe the ordering of matches I<at a given position>.
2572One more rule is needed to understand how a match is determined for the
2573whole regular expression: a match at an earlier position is always better
2574than a match at a later position.
2575
0d017f4d 2576=head2 Creating Custom RE Engines
c84d73f1 2577
0b928c2f
FC
2578As of Perl 5.10.0, one can create custom regular expression engines. This
2579is not for the faint of heart, as they have to plug in at the C level. See
2580L<perlreapi> for more details.
2581
2582As an alternative, overloaded constants (see L<overload>) provide a simple
2583way to extend the functionality of the RE engine, by substituting one
2584pattern for another.
c84d73f1
IZ
2585
2586Suppose that we want to enable a new RE escape-sequence C<\Y|> which
0d017f4d 2587matches at a boundary between whitespace characters and non-whitespace
c84d73f1
IZ
2588characters. Note that C<(?=\S)(?<!\S)|(?!\S)(?<=\S)> matches exactly
2589at these positions, so we want to have each C<\Y|> in the place of the
2590more complicated version. We can create a module C<customre> to do
2591this:
2592
2593 package customre;
2594 use overload;
2595
2596 sub import {
2597 shift;
2598 die "No argument to customre::import allowed" if @_;
2599 overload::constant 'qr' => \&convert;
2600 }
2601
2602 sub invalid { die "/$_[0]/: invalid escape '\\$_[1]'"}
2603
580a9fe1
RGS
2604 # We must also take care of not escaping the legitimate \\Y|
2605 # sequence, hence the presence of '\\' in the conversion rules.
5d458dd8 2606 my %rules = ( '\\' => '\\\\',
f793d64a 2607 'Y|' => qr/(?=\S)(?<!\S)|(?!\S)(?<=\S)/ );
c84d73f1
IZ
2608 sub convert {
2609 my $re = shift;
5d458dd8 2610 $re =~ s{
c84d73f1
IZ
2611 \\ ( \\ | Y . )
2612 }
5d458dd8 2613 { $rules{$1} or invalid($re,$1) }sgex;
c84d73f1
IZ
2614 return $re;
2615 }
2616
2617Now C<use customre> enables the new escape in constant regular
2618expressions, i.e., those without any runtime variable interpolations.
2619As documented in L<overload>, this conversion will work only over
2620literal parts of regular expressions. For C<\Y|$re\Y|> the variable
2621part of this regular expression needs to be converted explicitly
2622(but only if the special meaning of C<\Y|> should be enabled inside $re):
2623
2624 use customre;
2625 $re = <>;
2626 chomp $re;
2627 $re = customre::convert $re;
2628 /\Y|$re\Y|/;
2629
83f32aba
RS
2630=head2 Embedded Code Execution Frequency
2631
2632The exact rules for how often (??{}) and (?{}) are executed in a pattern
2633are unspecified. In the case of a successful match you can assume that
2634they DWIM and will be executed in left to right order the appropriate
2635number of times in the accepting path of the pattern as would any other
2636meta-pattern. How non-accepting pathways and match failures affect the
2637number of times a pattern is executed is specifically unspecified and
2638may vary depending on what optimizations can be applied to the pattern
2639and is likely to change from version to version.
2640
2641For instance in
2642
2643 "aaabcdeeeee"=~/a(?{print "a"})b(?{print "b"})cde/;
2644
2645the exact number of times "a" or "b" are printed out is unspecified for
2646failure, but you may assume they will be printed at least once during
2647a successful match, additionally you may assume that if "b" is printed,
2648it will be preceded by at least one "a".
2649
2650In the case of branching constructs like the following:
2651
2652 /a(b|(?{ print "a" }))c(?{ print "c" })/;
2653
2654you can assume that the input "ac" will output "ac", and that "abc"
2655will output only "c".
2656
2657When embedded code is quantified, successful matches will call the
2658code once for each matched iteration of the quantifier. For
2659example:
2660
2661 "good" =~ /g(?:o(?{print "o"}))*d/;
2662
2663will output "o" twice.
2664
0b928c2f 2665=head2 PCRE/Python Support
1f1031fe 2666
0b928c2f 2667As of Perl 5.10.0, Perl supports several Python/PCRE-specific extensions
1f1031fe 2668to the regex syntax. While Perl programmers are encouraged to use the
0b928c2f 2669Perl-specific syntax, the following are also accepted:
1f1031fe
YO
2670
2671=over 4
2672
ae5648b3 2673=item C<< (?PE<lt>NAMEE<gt>pattern) >>
1f1031fe 2674
c27a5cfe 2675Define a named capture group. Equivalent to C<< (?<NAME>pattern) >>.
1f1031fe
YO
2676
2677=item C<< (?P=NAME) >>
2678
c27a5cfe 2679Backreference to a named capture group. Equivalent to C<< \g{NAME} >>.
1f1031fe
YO
2680
2681=item C<< (?P>NAME) >>
2682
c27a5cfe 2683Subroutine call to a named capture group. Equivalent to C<< (?&NAME) >>.
1f1031fe 2684
ee9b8eae 2685=back
1f1031fe 2686
19799a22
GS
2687=head1 BUGS
2688
88c9975e
KW
2689Many regular expression constructs don't work on EBCDIC platforms.
2690
ed7efc79
KW
2691There are a number of issues with regard to case-insensitive matching
2692in Unicode rules. See C<i> under L</Modifiers> above.
2693
9da458fc
IZ
2694This document varies from difficult to understand to completely
2695and utterly opaque. The wandering prose riddled with jargon is
2696hard to fathom in several places.
2697
2698This document needs a rewrite that separates the tutorial content
2699from the reference content.
19799a22
GS
2700
2701=head1 SEE ALSO
9fa51da4 2702
91e0c79e
MJD
2703L<perlrequick>.
2704
2705L<perlretut>.
2706
9b599b2a
GS
2707L<perlop/"Regexp Quote-Like Operators">.
2708
1e66bd83
PP
2709L<perlop/"Gory details of parsing quoted constructs">.
2710
14218588
GS
2711L<perlfaq6>.
2712
9b599b2a
GS
2713L<perlfunc/pos>.
2714
2715L<perllocale>.
2716
fb55449c
JH
2717L<perlebcdic>.
2718
14218588
GS
2719I<Mastering Regular Expressions> by Jeffrey Friedl, published
2720by O'Reilly and Associates.