This is a live mirror of the Perl 5 development currently hosted at https://github.com/perl/perl5
Document \N{U+...}
[perl5.git] / pod / perlre.pod
CommitLineData
a0d0e21e 1=head1 NAME
d74e8afc 2X<regular expression> X<regex> X<regexp>
a0d0e21e
LW
3
4perlre - Perl regular expressions
5
6=head1 DESCRIPTION
7
5d458dd8 8This page describes the syntax of regular expressions in Perl.
91e0c79e 9
cc46d5f2 10If you haven't used regular expressions before, a quick-start
91e0c79e
MJD
11introduction is available in L<perlrequick>, and a longer tutorial
12introduction is available in L<perlretut>.
13
14For reference on how regular expressions are used in matching
15operations, plus various examples of the same, see discussions of
16C<m//>, C<s///>, C<qr//> and C<??> in L<perlop/"Regexp Quote-Like
17Operators">.
cb1a09d0 18
0d017f4d
WL
19
20=head2 Modifiers
21
19799a22 22Matching operations can have various modifiers. Modifiers
5a964f20 23that relate to the interpretation of the regular expression inside
19799a22 24are listed below. Modifiers that alter the way a regular expression
5d458dd8 25is used by Perl are detailed in L<perlop/"Regexp Quote-Like Operators"> and
1e66bd83 26L<perlop/"Gory details of parsing quoted constructs">.
a0d0e21e 27
55497cff 28=over 4
29
54310121 30=item m
d74e8afc 31X</m> X<regex, multiline> X<regexp, multiline> X<regular expression, multiline>
55497cff 32
33Treat string as multiple lines. That is, change "^" and "$" from matching
14218588 34the start or end of the string to matching the start or end of any
7f761169 35line anywhere within the string.
55497cff 36
54310121 37=item s
d74e8afc
ITB
38X</s> X<regex, single-line> X<regexp, single-line>
39X<regular expression, single-line>
55497cff 40
41Treat string as single line. That is, change "." to match any character
19799a22 42whatsoever, even a newline, which normally it would not match.
55497cff 43
f02c194e 44Used together, as /ms, they let the "." match any character whatsoever,
fb55449c 45while still allowing "^" and "$" to match, respectively, just after
19799a22 46and just before newlines within the string.
7b8d334a 47
87e95b7f
YO
48=item i
49X</i> X<regex, case-insensitive> X<regexp, case-insensitive>
50X<regular expression, case-insensitive>
51
52Do case-insensitive pattern matching.
53
54If C<use locale> is in effect, the case map is taken from the current
55locale. See L<perllocale>.
56
54310121 57=item x
d74e8afc 58X</x>
55497cff 59
60Extend your pattern's legibility by permitting whitespace and comments.
61
87e95b7f
YO
62=item p
63X</p> X<regex, preserve> X<regexp, preserve>
64
632a1772 65Preserve the string matched such that ${^PREMATCH}, ${^MATCH}, and
87e95b7f
YO
66${^POSTMATCH} are available for use after matching.
67
e2e6bec7
DN
68=item g and c
69X</g> X</c>
70
71Global matching, and keep the Current position after failed matching.
72Unlike i, m, s and x, these two flags affect the way the regex is used
73rather than the regex itself. See
74L<perlretut/"Using regular expressions in Perl"> for further explanation
75of the g and c modifiers.
76
55497cff 77=back
a0d0e21e
LW
78
79These are usually written as "the C</x> modifier", even though the delimiter
14218588 80in question might not really be a slash. Any of these
a0d0e21e 81modifiers may also be embedded within the regular expression itself using
14218588 82the C<(?...)> construct. See below.
a0d0e21e 83
4633a7c4 84The C</x> modifier itself needs a little more explanation. It tells
55497cff 85the regular expression parser to ignore whitespace that is neither
86backslashed nor within a character class. You can use this to break up
4633a7c4 87your regular expression into (slightly) more readable parts. The C<#>
54310121 88character is also treated as a metacharacter introducing a comment,
55497cff 89just as in ordinary Perl code. This also means that if you want real
14218588 90whitespace or C<#> characters in the pattern (outside a character
f9a3ff1a
RGS
91class, where they are unaffected by C</x>), then you'll either have to
92escape them (using backslashes or C<\Q...\E>) or encode them using octal
8933a740
RGS
93or hex escapes. Taken together, these features go a long way towards
94making Perl's regular expressions more readable. Note that you have to
95be careful not to include the pattern delimiter in the comment--perl has
96no way of knowing you did not intend to close the pattern early. See
97the C-comment deletion code in L<perlop>. Also note that anything inside
1031e5db 98a C<\Q...\E> stays unaffected by C</x>.
d74e8afc 99X</x>
a0d0e21e
LW
100
101=head2 Regular Expressions
102
04838cea
RGS
103=head3 Metacharacters
104
384f06ae 105The patterns used in Perl pattern matching evolved from those supplied in
14218588 106the Version 8 regex routines. (The routines are derived
19799a22
GS
107(distantly) from Henry Spencer's freely redistributable reimplementation
108of the V8 routines.) See L<Version 8 Regular Expressions> for
109details.
a0d0e21e
LW
110
111In particular the following metacharacters have their standard I<egrep>-ish
112meanings:
d74e8afc
ITB
113X<metacharacter>
114X<\> X<^> X<.> X<$> X<|> X<(> X<()> X<[> X<[]>
115
a0d0e21e 116
54310121 117 \ Quote the next metacharacter
a0d0e21e
LW
118 ^ Match the beginning of the line
119 . Match any character (except newline)
c07a80fd 120 $ Match the end of the line (or before newline at the end)
a0d0e21e
LW
121 | Alternation
122 () Grouping
123 [] Character class
124
14218588
GS
125By default, the "^" character is guaranteed to match only the
126beginning of the string, the "$" character only the end (or before the
127newline at the end), and Perl does certain optimizations with the
a0d0e21e
LW
128assumption that the string contains only one line. Embedded newlines
129will not be matched by "^" or "$". You may, however, wish to treat a
4a6725af 130string as a multi-line buffer, such that the "^" will match after any
0d520e8e
YO
131newline within the string (except if the newline is the last character in
132the string), and "$" will match before any newline. At the
a0d0e21e
LW
133cost of a little more overhead, you can do this by using the /m modifier
134on the pattern match operator. (Older programs did this by setting C<$*>,
f02c194e 135but this practice has been removed in perl 5.9.)
d74e8afc 136X<^> X<$> X</m>
a0d0e21e 137
14218588 138To simplify multi-line substitutions, the "." character never matches a
55497cff 139newline unless you use the C</s> modifier, which in effect tells Perl to pretend
f02c194e 140the string is a single line--even if it isn't.
d74e8afc 141X<.> X</s>
a0d0e21e 142
04838cea
RGS
143=head3 Quantifiers
144
a0d0e21e 145The following standard quantifiers are recognized:
d74e8afc 146X<metacharacter> X<quantifier> X<*> X<+> X<?> X<{n}> X<{n,}> X<{n,m}>
a0d0e21e
LW
147
148 * Match 0 or more times
149 + Match 1 or more times
150 ? Match 1 or 0 times
151 {n} Match exactly n times
152 {n,} Match at least n times
153 {n,m} Match at least n but not more than m times
154
155(If a curly bracket occurs in any other context, it is treated
b975c076 156as a regular character. In particular, the lower bound
527e91da
BB
157is not optional.) The "*" quantifier is equivalent to C<{0,}>, the "+"
158quantifier to C<{1,}>, and the "?" quantifier to C<{0,1}>. n and m are limited
9c79236d
GS
159to integral values less than a preset limit defined when perl is built.
160This is usually 32766 on the most common platforms. The actual limit can
161be seen in the error message generated by code such as this:
162
820475bd 163 $_ **= $_ , / {$_} / for 2 .. 42;
a0d0e21e 164
54310121 165By default, a quantified subpattern is "greedy", that is, it will match as
166many times as possible (given a particular starting location) while still
167allowing the rest of the pattern to match. If you want it to match the
168minimum number of times possible, follow the quantifier with a "?". Note
169that the meanings don't change, just the "greediness":
0d017f4d 170X<metacharacter> X<greedy> X<greediness>
d74e8afc 171X<?> X<*?> X<+?> X<??> X<{n}?> X<{n,}?> X<{n,m}?>
a0d0e21e 172
0d017f4d
WL
173 *? Match 0 or more times, not greedily
174 +? Match 1 or more times, not greedily
175 ?? Match 0 or 1 time, not greedily
176 {n}? Match exactly n times, not greedily
177 {n,}? Match at least n times, not greedily
178 {n,m}? Match at least n but not more than m times, not greedily
a0d0e21e 179
b9b4dddf
YO
180By default, when a quantified subpattern does not allow the rest of the
181overall pattern to match, Perl will backtrack. However, this behaviour is
0d017f4d 182sometimes undesirable. Thus Perl provides the "possessive" quantifier form
b9b4dddf
YO
183as well.
184
0d017f4d
WL
185 *+ Match 0 or more times and give nothing back
186 ++ Match 1 or more times and give nothing back
187 ?+ Match 0 or 1 time and give nothing back
b9b4dddf 188 {n}+ Match exactly n times and give nothing back (redundant)
04838cea
RGS
189 {n,}+ Match at least n times and give nothing back
190 {n,m}+ Match at least n but not more than m times and give nothing back
b9b4dddf
YO
191
192For instance,
193
194 'aaaa' =~ /a++a/
195
196will never match, as the C<a++> will gobble up all the C<a>'s in the
197string and won't leave any for the remaining part of the pattern. This
198feature can be extremely useful to give perl hints about where it
199shouldn't backtrack. For instance, the typical "match a double-quoted
200string" problem can be most efficiently performed when written as:
201
202 /"(?:[^"\\]++|\\.)*+"/
203
0d017f4d 204as we know that if the final quote does not match, backtracking will not
b9b4dddf
YO
205help. See the independent subexpression C<< (?>...) >> for more details;
206possessive quantifiers are just syntactic sugar for that construct. For
207instance the above example could also be written as follows:
208
209 /"(?>(?:(?>[^"\\]+)|\\.)*)"/
210
04838cea
RGS
211=head3 Escape sequences
212
5f05dabc 213Because patterns are processed as double quoted strings, the following
a0d0e21e 214also work:
0d017f4d 215X<\t> X<\n> X<\r> X<\f> X<\e> X<\a> X<\l> X<\u> X<\L> X<\U> X<\E> X<\Q>
d74e8afc 216X<\0> X<\c> X<\N> X<\x>
a0d0e21e 217
0f36ee90 218 \t tab (HT, TAB)
219 \n newline (LF, NL)
220 \r return (CR)
221 \f form feed (FF)
222 \a alarm (bell) (BEL)
223 \e escape (think troff) (ESC)
0d017f4d
WL
224 \033 octal char (example: ESC)
225 \x1B hex char (example: ESC)
196ac2fc 226 \x{263a} long hex char (example: Unicode SMILEY)
0d017f4d 227 \cK control char (example: VT)
196ac2fc 228 \N{name} named Unicode character
e526e8bb 229 \N{U+263D} Unicode character (example: FIRST QUARTER MOON)
cb1a09d0
AD
230 \l lowercase next char (think vi)
231 \u uppercase next char (think vi)
232 \L lowercase till \E (think vi)
233 \U uppercase till \E (think vi)
234 \E end case modification (think vi)
5a964f20 235 \Q quote (disable) pattern metacharacters till \E
a0d0e21e 236
a034a98d 237If C<use locale> is in effect, the case map used by C<\l>, C<\L>, C<\u>
423cee85 238and C<\U> is taken from the current locale. See L<perllocale>. For
4a2d328f 239documentation of C<\N{name}>, see L<charnames>.
a034a98d 240
1d2dff63
GS
241You cannot include a literal C<$> or C<@> within a C<\Q> sequence.
242An unescaped C<$> or C<@> interpolates the corresponding variable,
243while escaping will cause the literal string C<\$> to be matched.
244You'll need to write something like C<m/\Quser\E\@\Qhost/>.
245
e1d1eefb 246=head3 Character Classes and other Special Escapes
04838cea 247
a0d0e21e 248In addition, Perl defines the following:
d74e8afc 249X<\w> X<\W> X<\s> X<\S> X<\d> X<\D> X<\X> X<\p> X<\P> X<\C>
f7819f85 250X<\g> X<\k> X<\N> X<\K> X<\v> X<\V> X<\h> X<\H>
0d017f4d 251X<word> X<whitespace> X<character class> X<backreference>
a0d0e21e 252
81714fb9
YO
253 \w Match a "word" character (alphanumeric plus "_")
254 \W Match a non-"word" character
255 \s Match a whitespace character
256 \S Match a non-whitespace character
257 \d Match a digit character
258 \D Match a non-digit character
259 \pP Match P, named property. Use \p{Prop} for longer names.
260 \PP Match non-P
0111a78f 261 \X Match Unicode "eXtended grapheme cluster"
81714fb9
YO
262 \C Match a single C char (octet) even under Unicode.
263 NOTE: breaks up characters into their UTF-8 bytes,
264 so you may end up with malformed pieces of UTF-8.
265 Unsupported in lookbehind.
5d458dd8 266 \1 Backreference to a specific group.
c74340f9 267 '1' may actually be any positive integer.
2bf803e2
YO
268 \g1 Backreference to a specific or previous group,
269 \g{-1} number may be negative indicating a previous buffer and may
270 optionally be wrapped in curly brackets for safer parsing.
1f1031fe 271 \g{name} Named backreference
81714fb9 272 \k<name> Named backreference
ee9b8eae 273 \K Keep the stuff left of the \K, don't include it in $&
c741660a 274 \N Any character but \n
e1d1eefb
YO
275 \v Vertical whitespace
276 \V Not vertical whitespace
277 \h Horizontal whitespace
278 \H Not horizontal whitespace
2ddf2931 279 \R Linebreak
a0d0e21e 280
08ce8fc6
JH
281A C<\w> matches a single alphanumeric character (an alphabetic
282character, or a decimal digit) or C<_>, not a whole word. Use C<\w+>
283to match a string of Perl-identifier characters (which isn't the same
284as matching an English word). If C<use locale> is in effect, the list
285of alphabetic characters generated by C<\w> is taken from the current
286locale. See L<perllocale>. You may use C<\w>, C<\W>, C<\s>, C<\S>,
0d017f4d
WL
287C<\d>, and C<\D> within character classes, but they aren't usable
288as either end of a range. If any of them precedes or follows a "-",
289the "-" is understood literally. If Unicode is in effect, C<\s> matches
c62285ac 290also "\x{85}", "\x{2028}", and "\x{2029}". See L<perlunicode> for more
0d017f4d
WL
291details about C<\pP>, C<\PP>, C<\X> and the possibility of defining
292your own C<\p> and C<\P> properties, and L<perluniintro> about Unicode
293in general.
d74e8afc 294X<\w> X<\W> X<word>
a0d0e21e 295
e1d1eefb 296C<\R> will atomically match a linebreak, including the network line-ending
e2cb52ee 297"\x0D\x0A". Specifically, X<\R> is exactly equivalent to
e1d1eefb
YO
298
299 (?>\x0D\x0A?|[\x0A-\x0C\x85\x{2028}\x{2029}])
300
301B<Note:> C<\R> has no special meaning inside of a character class;
302use C<\v> instead (vertical whitespace).
303X<\R>
304
df225385 305Note that C<\N> has two meanings. When of the form C<\N{NAME}>, it matches the
e526e8bb
KW
306character whose name is C<NAME>; and similarly when of the form
307C<\N{U+I<wide hex char>}>, it matches the character whose Unicode ordinal is
308I<wide hex char>. Otherwise it matches any character but C<\n>.
df225385 309
b8c5462f 310The POSIX character class syntax
d74e8afc 311X<character class>
b8c5462f 312
820475bd 313 [:class:]
b8c5462f 314
0d017f4d 315is also available. Note that the C<[> and C<]> brackets are I<literal>;
5496314a
SP
316they must always be used within a character class expression.
317
318 # this is correct:
319 $string =~ /[[:alpha:]]/;
320
321 # this is not, and will generate a warning:
322 $string =~ /[:alpha:]/;
323
6fa80ea2
YO
324The following table shows the mapping of POSIX character class
325names, common escapes, literal escape sequences and their equivalent
326Unicode style property names.
327X<character class> X<\p> X<\p{}>
d74e8afc
ITB
328X<alpha> X<alnum> X<ascii> X<blank> X<cntrl> X<digit> X<graph>
329X<lower> X<print> X<punct> X<space> X<upper> X<word> X<xdigit>
b8c5462f 330
6fa80ea2
YO
331B<Note:> up to Perl 5.10 the property names used were shared with
332standard Unicode properties, this was changed in Perl 5.11, see
333L<perl5110delta> for details.
334
335 POSIX Esc Class Property Note
336 --------------------------------------------------------
337 alnum [0-9A-Za-z] IsPosixAlnum
338 alpha [A-Za-z] IsPosixAlpha
339 ascii [\000-\177] IsASCII
340 blank [\011 ] IsPosixBlank [1]
341 cntrl [\0-\37\177] IsPosixCntrl
342 digit \d [0-9] IsPosixDigit
343 graph [!-~] IsPosixGraph
344 lower [a-z] IsPosixLower
345 print [ -~] IsPosixPrint
346 punct [!-/:-@[-`{-~] IsPosixPunct
347 space [\11-\15 ] IsPosixSpace [2]
348 \s [\11\12\14\15 ] IsPerlSpace [2]
349 upper [A-Z] IsPosixUpper
350 word \w [0-9A-Z_a-z] IsPerlWord [3]
351 xdigit [0-9A-Fa-f] IsXDigit
b8c5462f 352
07698885
RGS
353=over
354
355=item [1]
356
b432a672 357A GNU extension equivalent to C<[ \t]>, "all horizontal whitespace".
07698885
RGS
358
359=item [2]
360
6fa80ea2
YO
361Note that C<\s> and C<[[:space:]]> are B<not> equivalent as C<[[:space:]]>
362includes also the (very rare) "vertical tabulator", "\cK" or chr(11) in
363ASCII.
07698885
RGS
364
365=item [3]
366
08ce8fc6 367A Perl extension, see above.
07698885
RGS
368
369=back
aaa51d5e 370
26b44a0a 371For example use C<[:upper:]> to match all the uppercase characters.
aaa51d5e
JF
372Note that the C<[]> are part of the C<[::]> construct, not part of the
373whole character class. For example:
b8c5462f 374
820475bd 375 [01[:alpha:]%]
b8c5462f 376
0d017f4d 377matches zero, one, any alphabetic character, and the percent sign.
b8c5462f 378
353c6505 379The other named classes are:
b8c5462f
JH
380
381=over 4
382
383=item cntrl
d74e8afc 384X<cntrl>
b8c5462f 385
820475bd
GS
386Any control character. Usually characters that don't produce output as
387such but instead control the terminal somehow: for example newline and
388backspace are control characters. All characters with ord() less than
0d017f4d 38932 are usually classified as control characters (assuming ASCII,
7be5a6cf
JF
390the ISO Latin character sets, and Unicode), as is the character with
391the ord() value of 127 (C<DEL>).
b8c5462f
JH
392
393=item graph
d74e8afc 394X<graph>
b8c5462f 395
f1cbbd6e 396Any alphanumeric or punctuation (special) character.
b8c5462f
JH
397
398=item print
d74e8afc 399X<print>
b8c5462f 400
f79b3095 401Any alphanumeric or punctuation (special) character or the space character.
b8c5462f
JH
402
403=item punct
d74e8afc 404X<punct>
b8c5462f 405
f1cbbd6e 406Any punctuation (special) character.
b8c5462f
JH
407
408=item xdigit
d74e8afc 409X<xdigit>
b8c5462f 410
593df60c 411Any hexadecimal digit. Though this may feel silly ([0-9A-Fa-f] would
820475bd 412work just fine) it is included for completeness.
b8c5462f 413
b8c5462f
JH
414=back
415
416You can negate the [::] character classes by prefixing the class name
417with a '^'. This is a Perl extension. For example:
d74e8afc 418X<character class, negation>
b8c5462f 419
5496314a 420 POSIX traditional Unicode
93733859 421
6fa80ea2
YO
422 [[:^digit:]] \D \P{IsPosixDigit}
423 [[:^space:]] \S \P{IsPosixSpace}
424 [[:^word:]] \W \P{IsPerlWord}
b8c5462f 425
54c18d04
MK
426Perl respects the POSIX standard in that POSIX character classes are
427only supported within a character class. The POSIX character classes
428[.cc.] and [=cc=] are recognized but B<not> supported and trying to
429use them will cause an error.
b8c5462f 430
04838cea
RGS
431=head3 Assertions
432
a0d0e21e 433Perl defines the following zero-width assertions:
d74e8afc
ITB
434X<zero-width assertion> X<assertion> X<regex, zero-width assertion>
435X<regexp, zero-width assertion>
436X<regular expression, zero-width assertion>
437X<\b> X<\B> X<\A> X<\Z> X<\z> X<\G>
a0d0e21e
LW
438
439 \b Match a word boundary
0d017f4d 440 \B Match except at a word boundary
b85d18e9
IZ
441 \A Match only at beginning of string
442 \Z Match only at end of string, or before newline at the end
443 \z Match only at end of string
9da458fc
IZ
444 \G Match only at pos() (e.g. at the end-of-match position
445 of prior m//g)
a0d0e21e 446
14218588 447A word boundary (C<\b>) is a spot between two characters
19799a22
GS
448that has a C<\w> on one side of it and a C<\W> on the other side
449of it (in either order), counting the imaginary characters off the
450beginning and end of the string as matching a C<\W>. (Within
451character classes C<\b> represents backspace rather than a word
452boundary, just as it normally does in any double-quoted string.)
453The C<\A> and C<\Z> are just like "^" and "$", except that they
454won't match multiple times when the C</m> modifier is used, while
455"^" and "$" will match at every internal line boundary. To match
456the actual end of the string and not ignore an optional trailing
457newline, use C<\z>.
d74e8afc 458X<\b> X<\A> X<\Z> X<\z> X</m>
19799a22
GS
459
460The C<\G> assertion can be used to chain global matches (using
461C<m//g>), as described in L<perlop/"Regexp Quote-Like Operators">.
462It is also useful when writing C<lex>-like scanners, when you have
463several patterns that you want to match against consequent substrings
464of your string, see the previous reference. The actual location
465where C<\G> will match can also be influenced by using C<pos()> as
58e23c8d
YO
466an lvalue: see L<perlfunc/pos>. Note that the rule for zero-length
467matches is modified somewhat, in that contents to the left of C<\G> is
468not counted when determining the length of the match. Thus the following
469will not match forever:
d74e8afc 470X<\G>
c47ff5f1 471
58e23c8d
YO
472 $str = 'ABC';
473 pos($str) = 1;
474 while (/.\G/g) {
475 print $&;
476 }
477
478It will print 'A' and then terminate, as it considers the match to
479be zero-width, and thus will not match at the same position twice in a
480row.
481
482It is worth noting that C<\G> improperly used can result in an infinite
483loop. Take care when using patterns that include C<\G> in an alternation.
484
04838cea
RGS
485=head3 Capture buffers
486
0d017f4d
WL
487The bracketing construct C<( ... )> creates capture buffers. To refer
488to the current contents of a buffer later on, within the same pattern,
489use \1 for the first, \2 for the second, and so on.
490Outside the match use "$" instead of "\". (The
81714fb9 491\<digit> notation works in certain circumstances outside
14218588
GS
492the match. See the warning below about \1 vs $1 for details.)
493Referring back to another part of the match is called a
494I<backreference>.
d74e8afc
ITB
495X<regex, capture buffer> X<regexp, capture buffer>
496X<regular expression, capture buffer> X<backreference>
14218588
GS
497
498There is no limit to the number of captured substrings that you may
499use. However Perl also uses \10, \11, etc. as aliases for \010,
fb55449c
JH
500\011, etc. (Recall that 0 means octal, so \011 is the character at
501number 9 in your coded character set; which would be the 10th character,
81714fb9
YO
502a horizontal tab under ASCII.) Perl resolves this
503ambiguity by interpreting \10 as a backreference only if at least 10
504left parentheses have opened before it. Likewise \11 is a
505backreference only if at least 11 left parentheses have opened
506before it. And so on. \1 through \9 are always interpreted as
5624f11d 507backreferences.
40863337
ML
508If the bracketing group did not match, the associated backreference won't
509match either. (This can happen if the bracketing group is optional, or
510in a different branch of an alternation.)
511
1f1031fe 512X<\g{1}> X<\g{-1}> X<\g{name}> X<relative backreference> X<named backreference>
2bf803e2 513In order to provide a safer and easier way to construct patterns using
99d59c4d
RGS
514backreferences, Perl provides the C<\g{N}> notation (starting with perl
5155.10.0). The curly brackets are optional, however omitting them is less
516safe as the meaning of the pattern can be changed by text (such as digits)
517following it. When N is a positive integer the C<\g{N}> notation is
518exactly equivalent to using normal backreferences. When N is a negative
519integer then it is a relative backreference referring to the previous N'th
520capturing group. When the bracket form is used and N is not an integer, it
521is treated as a reference to a named buffer.
2bf803e2
YO
522
523Thus C<\g{-1}> refers to the last buffer, C<\g{-2}> refers to the
524buffer before that. For example:
5624f11d
YO
525
526 /
527 (Y) # buffer 1
528 ( # buffer 2
529 (X) # buffer 3
2bf803e2
YO
530 \g{-1} # backref to buffer 3
531 \g{-3} # backref to buffer 1
5624f11d
YO
532 )
533 /x
534
2bf803e2 535and would match the same as C</(Y) ( (X) \3 \1 )/x>.
14218588 536
99d59c4d 537Additionally, as of Perl 5.10.0 you may use named capture buffers and named
1f1031fe 538backreferences. The notation is C<< (?<name>...) >> to declare and C<< \k<name> >>
0d017f4d
WL
539to reference. You may also use apostrophes instead of angle brackets to delimit the
540name; and you may use the bracketed C<< \g{name} >> backreference syntax.
541It's possible to refer to a named capture buffer by absolute and relative number as well.
542Outside the pattern, a named capture buffer is available via the C<%+> hash.
543When different buffers within the same pattern have the same name, C<$+{name}>
544and C<< \k<name> >> refer to the leftmost defined group. (Thus it's possible
545to do things with named capture buffers that would otherwise require C<(??{})>
546code to accomplish.)
547X<named capture buffer> X<regular expression, named capture buffer>
64c5a566 548X<%+> X<$+{name}> X<< \k<name> >>
81714fb9 549
14218588 550Examples:
a0d0e21e
LW
551
552 s/^([^ ]*) *([^ ]*)/$2 $1/; # swap first two words
553
81714fb9
YO
554 /(.)\1/ # find first doubled char
555 and print "'$1' is the first doubled character\n";
556
557 /(?<char>.)\k<char>/ # ... a different way
558 and print "'$+{char}' is the first doubled character\n";
559
0d017f4d 560 /(?'char'.)\1/ # ... mix and match
81714fb9 561 and print "'$1' is the first doubled character\n";
c47ff5f1 562
14218588 563 if (/Time: (..):(..):(..)/) { # parse out values
a0d0e21e
LW
564 $hours = $1;
565 $minutes = $2;
566 $seconds = $3;
567 }
c47ff5f1 568
14218588
GS
569Several special variables also refer back to portions of the previous
570match. C<$+> returns whatever the last bracket match matched.
571C<$&> returns the entire matched string. (At one point C<$0> did
572also, but now it returns the name of the program.) C<$`> returns
77ea4f6d
JV
573everything before the matched string. C<$'> returns everything
574after the matched string. And C<$^N> contains whatever was matched by
575the most-recently closed group (submatch). C<$^N> can be used in
576extended patterns (see below), for example to assign a submatch to a
81714fb9 577variable.
d74e8afc 578X<$+> X<$^N> X<$&> X<$`> X<$'>
14218588 579
665e98b9 580The numbered match variables ($1, $2, $3, etc.) and the related punctuation
77ea4f6d 581set (C<$+>, C<$&>, C<$`>, C<$'>, and C<$^N>) are all dynamically scoped
14218588
GS
582until the end of the enclosing block or until the next successful
583match, whichever comes first. (See L<perlsyn/"Compound Statements">.)
d74e8afc
ITB
584X<$+> X<$^N> X<$&> X<$`> X<$'>
585X<$1> X<$2> X<$3> X<$4> X<$5> X<$6> X<$7> X<$8> X<$9>
586
14218588 587
0d017f4d 588B<NOTE>: Failed matches in Perl do not reset the match variables,
5146ce24 589which makes it easier to write code that tests for a series of more
665e98b9
JH
590specific cases and remembers the best match.
591
14218588
GS
592B<WARNING>: Once Perl sees that you need one of C<$&>, C<$`>, or
593C<$'> anywhere in the program, it has to provide them for every
594pattern match. This may substantially slow your program. Perl
595uses the same mechanism to produce $1, $2, etc, so you also pay a
596price for each pattern that contains capturing parentheses. (To
597avoid this cost while retaining the grouping behaviour, use the
598extended regular expression C<(?: ... )> instead.) But if you never
599use C<$&>, C<$`> or C<$'>, then patterns I<without> capturing
600parentheses will not be penalized. So avoid C<$&>, C<$'>, and C<$`>
601if you can, but if you can't (and some algorithms really appreciate
602them), once you've used them once, use them at will, because you've
603already paid the price. As of 5.005, C<$&> is not so costly as the
604other two.
d74e8afc 605X<$&> X<$`> X<$'>
68dc0745 606
99d59c4d 607As a workaround for this problem, Perl 5.10.0 introduces C<${^PREMATCH}>,
cde0cee5
YO
608C<${^MATCH}> and C<${^POSTMATCH}>, which are equivalent to C<$`>, C<$&>
609and C<$'>, B<except> that they are only guaranteed to be defined after a
87e95b7f 610successful match that was executed with the C</p> (preserve) modifier.
cde0cee5
YO
611The use of these variables incurs no global performance penalty, unlike
612their punctuation char equivalents, however at the trade-off that you
613have to tell perl when you want to use them.
87e95b7f 614X</p> X<p modifier>
cde0cee5 615
19799a22
GS
616Backslashed metacharacters in Perl are alphanumeric, such as C<\b>,
617C<\w>, C<\n>. Unlike some other regular expression languages, there
618are no backslashed symbols that aren't alphanumeric. So anything
c47ff5f1 619that looks like \\, \(, \), \<, \>, \{, or \} is always
19799a22
GS
620interpreted as a literal character, not a metacharacter. This was
621once used in a common idiom to disable or quote the special meanings
622of regular expression metacharacters in a string that you want to
36bbe248 623use for a pattern. Simply quote all non-"word" characters:
a0d0e21e
LW
624
625 $pattern =~ s/(\W)/\\$1/g;
626
f1cbbd6e 627(If C<use locale> is set, then this depends on the current locale.)
14218588
GS
628Today it is more common to use the quotemeta() function or the C<\Q>
629metaquoting escape sequence to disable all metacharacters' special
630meanings like this:
a0d0e21e
LW
631
632 /$unquoted\Q$quoted\E$unquoted/
633
9da458fc
IZ
634Beware that if you put literal backslashes (those not inside
635interpolated variables) between C<\Q> and C<\E>, double-quotish
636backslash interpolation may lead to confusing results. If you
637I<need> to use literal backslashes within C<\Q...\E>,
638consult L<perlop/"Gory details of parsing quoted constructs">.
639
19799a22
GS
640=head2 Extended Patterns
641
14218588
GS
642Perl also defines a consistent extension syntax for features not
643found in standard tools like B<awk> and B<lex>. The syntax is a
644pair of parentheses with a question mark as the first thing within
645the parentheses. The character after the question mark indicates
646the extension.
19799a22 647
14218588
GS
648The stability of these extensions varies widely. Some have been
649part of the core language for many years. Others are experimental
650and may change without warning or be completely removed. Check
651the documentation on an individual feature to verify its current
652status.
19799a22 653
14218588
GS
654A question mark was chosen for this and for the minimal-matching
655construct because 1) question marks are rare in older regular
656expressions, and 2) whenever you see one, you should stop and
657"question" exactly what is going on. That's psychology...
a0d0e21e
LW
658
659=over 10
660
cc6b7395 661=item C<(?#text)>
d74e8afc 662X<(?#)>
a0d0e21e 663
14218588 664A comment. The text is ignored. If the C</x> modifier enables
19799a22 665whitespace formatting, a simple C<#> will suffice. Note that Perl closes
259138e3
GS
666the comment as soon as it sees a C<)>, so there is no way to put a literal
667C<)> in the comment.
a0d0e21e 668
f7819f85 669=item C<(?pimsx-imsx)>
d74e8afc 670X<(?)>
19799a22 671
0b6d1084
JH
672One or more embedded pattern-match modifiers, to be turned on (or
673turned off, if preceded by C<->) for the remainder of the pattern or
674the remainder of the enclosing pattern group (if any). This is
675particularly useful for dynamic patterns, such as those read in from a
0d017f4d
WL
676configuration file, taken from an argument, or specified in a table
677somewhere. Consider the case where some patterns want to be case
678sensitive and some do not: The case insensitive ones merely need to
679include C<(?i)> at the front of the pattern. For example:
19799a22
GS
680
681 $pattern = "foobar";
5d458dd8 682 if ( /$pattern/i ) { }
19799a22
GS
683
684 # more flexible:
685
686 $pattern = "(?i)foobar";
5d458dd8 687 if ( /$pattern/ ) { }
19799a22 688
0b6d1084 689These modifiers are restored at the end of the enclosing group. For example,
19799a22
GS
690
691 ( (?i) blah ) \s+ \1
692
0d017f4d
WL
693will match C<blah> in any case, some spaces, and an exact (I<including the case>!)
694repetition of the previous word, assuming the C</x> modifier, and no C</i>
695modifier outside this group.
19799a22 696
8eb5594e
DR
697These modifiers do not carry over into named subpatterns called in the
698enclosing group. In other words, a pattern such as C<((?i)(&NAME))> does not
699change the case-sensitivity of the "NAME" pattern.
700
5530442b 701Note that the C<p> modifier is special in that it can only be enabled,
cde0cee5 702not disabled, and that its presence anywhere in a pattern has a global
5530442b 703effect. Thus C<(?-p)> and C<(?-p:...)> are meaningless and will warn
cde0cee5
YO
704when executed under C<use warnings>.
705
5a964f20 706=item C<(?:pattern)>
d74e8afc 707X<(?:)>
a0d0e21e 708
ca9dfc88
IZ
709=item C<(?imsx-imsx:pattern)>
710
5a964f20
TC
711This is for clustering, not capturing; it groups subexpressions like
712"()", but doesn't make backreferences as "()" does. So
a0d0e21e 713
5a964f20 714 @fields = split(/\b(?:a|b|c)\b/)
a0d0e21e
LW
715
716is like
717
5a964f20 718 @fields = split(/\b(a|b|c)\b/)
a0d0e21e 719
19799a22
GS
720but doesn't spit out extra fields. It's also cheaper not to capture
721characters if you don't need to.
a0d0e21e 722
19799a22 723Any letters between C<?> and C<:> act as flags modifiers as with
5d458dd8 724C<(?imsx-imsx)>. For example,
ca9dfc88
IZ
725
726 /(?s-i:more.*than).*million/i
727
14218588 728is equivalent to the more verbose
ca9dfc88
IZ
729
730 /(?:(?s-i)more.*than).*million/i
731
594d7033
YO
732=item C<(?|pattern)>
733X<(?|)> X<Branch reset>
734
735This is the "branch reset" pattern, which has the special property
736that the capture buffers are numbered from the same starting point
99d59c4d 737in each alternation branch. It is available starting from perl 5.10.0.
4deaaa80 738
693596a8
RGS
739Capture buffers are numbered from left to right, but inside this
740construct the numbering is restarted for each branch.
4deaaa80
PJ
741
742The numbering within each branch will be as normal, and any buffers
743following this construct will be numbered as though the construct
744contained only one branch, that being the one with the most capture
745buffers in it.
746
747This construct will be useful when you want to capture one of a
748number of alternative matches.
749
750Consider the following pattern. The numbers underneath show in
751which buffer the captured content will be stored.
594d7033
YO
752
753
754 # before ---------------branch-reset----------- after
755 / ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x
756 # 1 2 2 3 2 3 4
757
ab106183
A
758Be careful when using the branch reset pattern in combination with
759named captures. Named captures are implemented as being aliases to
760numbered buffers holding the captures, and that interferes with the
761implementation of the branch reset pattern. If you are using named
762captures in a branch reset pattern, it's best to use the same names,
763in the same order, in each of the alternations:
764
765 /(?| (?<a> x ) (?<b> y )
766 | (?<a> z ) (?<b> w )) /x
767
768Not doing so may lead to surprises:
769
770 "12" =~ /(?| (?<a> \d+ ) | (?<b> \D+))/x;
771 say $+ {a}; # Prints '12'
772 say $+ {b}; # *Also* prints '12'.
773
774The problem here is that both the buffer named C<< a >> and the buffer
775named C<< b >> are aliases for the buffer belonging to C<< $1 >>.
90a18110 776
ee9b8eae
YO
777=item Look-Around Assertions
778X<look-around assertion> X<lookaround assertion> X<look-around> X<lookaround>
779
780Look-around assertions are zero width patterns which match a specific
781pattern without including it in C<$&>. Positive assertions match when
782their subpattern matches, negative assertions match when their subpattern
783fails. Look-behind matches text up to the current match position,
784look-ahead matches text following the current match position.
785
786=over 4
787
5a964f20 788=item C<(?=pattern)>
d74e8afc 789X<(?=)> X<look-ahead, positive> X<lookahead, positive>
a0d0e21e 790
19799a22 791A zero-width positive look-ahead assertion. For example, C</\w+(?=\t)/>
a0d0e21e
LW
792matches a word followed by a tab, without including the tab in C<$&>.
793
5a964f20 794=item C<(?!pattern)>
d74e8afc 795X<(?!)> X<look-ahead, negative> X<lookahead, negative>
a0d0e21e 796
19799a22 797A zero-width negative look-ahead assertion. For example C</foo(?!bar)/>
a0d0e21e 798matches any occurrence of "foo" that isn't followed by "bar". Note
19799a22
GS
799however that look-ahead and look-behind are NOT the same thing. You cannot
800use this for look-behind.
7b8d334a 801
5a964f20 802If you are looking for a "bar" that isn't preceded by a "foo", C</(?!foo)bar/>
7b8d334a
GS
803will not do what you want. That's because the C<(?!foo)> is just saying that
804the next thing cannot be "foo"--and it's not, it's a "bar", so "foobar" will
805match. You would have to do something like C</(?!foo)...bar/> for that. We
806say "like" because there's the case of your "bar" not having three characters
807before it. You could cover that this way: C</(?:(?!foo)...|^.{0,2})bar/>.
808Sometimes it's still easier just to say:
a0d0e21e 809
a3cb178b 810 if (/bar/ && $` !~ /foo$/)
a0d0e21e 811
19799a22 812For look-behind see below.
c277df42 813
ee9b8eae
YO
814=item C<(?<=pattern)> C<\K>
815X<(?<=)> X<look-behind, positive> X<lookbehind, positive> X<\K>
c277df42 816
c47ff5f1 817A zero-width positive look-behind assertion. For example, C</(?<=\t)\w+/>
19799a22
GS
818matches a word that follows a tab, without including the tab in C<$&>.
819Works only for fixed-width look-behind.
c277df42 820
ee9b8eae
YO
821There is a special form of this construct, called C<\K>, which causes the
822regex engine to "keep" everything it had matched prior to the C<\K> and
823not include it in C<$&>. This effectively provides variable length
824look-behind. The use of C<\K> inside of another look-around assertion
825is allowed, but the behaviour is currently not well defined.
826
c62285ac 827For various reasons C<\K> may be significantly more efficient than the
ee9b8eae
YO
828equivalent C<< (?<=...) >> construct, and it is especially useful in
829situations where you want to efficiently remove something following
830something else in a string. For instance
831
832 s/(foo)bar/$1/g;
833
834can be rewritten as the much more efficient
835
836 s/foo\Kbar//g;
837
5a964f20 838=item C<(?<!pattern)>
d74e8afc 839X<(?<!)> X<look-behind, negative> X<lookbehind, negative>
c277df42 840
19799a22
GS
841A zero-width negative look-behind assertion. For example C</(?<!bar)foo/>
842matches any occurrence of "foo" that does not follow "bar". Works
843only for fixed-width look-behind.
c277df42 844
ee9b8eae
YO
845=back
846
81714fb9
YO
847=item C<(?'NAME'pattern)>
848
849=item C<< (?<NAME>pattern) >>
850X<< (?<NAME>) >> X<(?'NAME')> X<named capture> X<capture>
851
852A named capture buffer. Identical in every respect to normal capturing
90a18110
RGS
853parentheses C<()> but for the additional fact that C<%+> or C<%-> may be
854used after a successful match to refer to a named buffer. See C<perlvar>
855for more details on the C<%+> and C<%-> hashes.
81714fb9
YO
856
857If multiple distinct capture buffers have the same name then the
858$+{NAME} will refer to the leftmost defined buffer in the match.
859
0d017f4d 860The forms C<(?'NAME'pattern)> and C<< (?<NAME>pattern) >> are equivalent.
81714fb9
YO
861
862B<NOTE:> While the notation of this construct is the same as the similar
0d017f4d 863function in .NET regexes, the behavior is not. In Perl the buffers are
81714fb9
YO
864numbered sequentially regardless of being named or not. Thus in the
865pattern
866
867 /(x)(?<foo>y)(z)/
868
869$+{foo} will be the same as $2, and $3 will contain 'z' instead of
870the opposite which is what a .NET regex hacker might expect.
871
1f1031fe
YO
872Currently NAME is restricted to simple identifiers only.
873In other words, it must match C</^[_A-Za-z][_A-Za-z0-9]*\z/> or
874its Unicode extension (see L<utf8>),
875though it isn't extended by the locale (see L<perllocale>).
81714fb9 876
1f1031fe 877B<NOTE:> In order to make things easier for programmers with experience
ae5648b3 878with the Python or PCRE regex engines, the pattern C<< (?PE<lt>NAMEE<gt>pattern) >>
0d017f4d 879may be used instead of C<< (?<NAME>pattern) >>; however this form does not
64c5a566 880support the use of single quotes as a delimiter for the name.
81714fb9 881
1f1031fe
YO
882=item C<< \k<NAME> >>
883
884=item C<< \k'NAME' >>
81714fb9
YO
885
886Named backreference. Similar to numeric backreferences, except that
887the group is designated by name and not number. If multiple groups
888have the same name then it refers to the leftmost defined group in
889the current match.
890
0d017f4d 891It is an error to refer to a name not defined by a C<< (?<NAME>) >>
81714fb9
YO
892earlier in the pattern.
893
894Both forms are equivalent.
895
1f1031fe 896B<NOTE:> In order to make things easier for programmers with experience
0d017f4d 897with the Python or PCRE regex engines, the pattern C<< (?P=NAME) >>
64c5a566 898may be used instead of C<< \k<NAME> >>.
1f1031fe 899
cc6b7395 900=item C<(?{ code })>
d74e8afc 901X<(?{})> X<regex, code in> X<regexp, code in> X<regular expression, code in>
c277df42 902
19799a22 903B<WARNING>: This extended regular expression feature is considered
b9b4dddf
YO
904experimental, and may be changed without notice. Code executed that
905has side effects may not perform identically from version to version
906due to the effect of future optimisations in the regex engine.
c277df42 907
cc46d5f2 908This zero-width assertion evaluates any embedded Perl code. It
19799a22
GS
909always succeeds, and its C<code> is not interpolated. Currently,
910the rules to determine where the C<code> ends are somewhat convoluted.
911
77ea4f6d
JV
912This feature can be used together with the special variable C<$^N> to
913capture the results of submatches in variables without having to keep
914track of the number of nested parentheses. For example:
915
916 $_ = "The brown fox jumps over the lazy dog";
917 /the (\S+)(?{ $color = $^N }) (\S+)(?{ $animal = $^N })/i;
918 print "color = $color, animal = $animal\n";
919
754091cb
RGS
920Inside the C<(?{...})> block, C<$_> refers to the string the regular
921expression is matching against. You can also use C<pos()> to know what is
fa11829f 922the current position of matching within this string.
754091cb 923
19799a22
GS
924The C<code> is properly scoped in the following sense: If the assertion
925is backtracked (compare L<"Backtracking">), all changes introduced after
926C<local>ization are undone, so that
b9ac3b5b
GS
927
928 $_ = 'a' x 8;
5d458dd8 929 m<
b9ac3b5b
GS
930 (?{ $cnt = 0 }) # Initialize $cnt.
931 (
5d458dd8 932 a
b9ac3b5b
GS
933 (?{
934 local $cnt = $cnt + 1; # Update $cnt, backtracking-safe.
935 })
5d458dd8 936 )*
b9ac3b5b
GS
937 aaaa
938 (?{ $res = $cnt }) # On success copy to non-localized
939 # location.
940 >x;
941
0d017f4d 942will set C<$res = 4>. Note that after the match, C<$cnt> returns to the globally
14218588 943introduced value, because the scopes that restrict C<local> operators
b9ac3b5b
GS
944are unwound.
945
19799a22
GS
946This assertion may be used as a C<(?(condition)yes-pattern|no-pattern)>
947switch. If I<not> used in this way, the result of evaluation of
948C<code> is put into the special variable C<$^R>. This happens
949immediately, so C<$^R> can be used from other C<(?{ code })> assertions
950inside the same regular expression.
b9ac3b5b 951
19799a22
GS
952The assignment to C<$^R> above is properly localized, so the old
953value of C<$^R> is restored if the assertion is backtracked; compare
954L<"Backtracking">.
b9ac3b5b 955
19799a22
GS
956For reasons of security, this construct is forbidden if the regular
957expression involves run-time interpolation of variables, unless the
958perilous C<use re 'eval'> pragma has been used (see L<re>), or the
959variables contain results of C<qr//> operator (see
5d458dd8 960L<perlop/"qr/STRING/imosx">).
871b0233 961
0d017f4d 962This restriction is due to the wide-spread and remarkably convenient
19799a22 963custom of using run-time determined strings as patterns. For example:
871b0233
IZ
964
965 $re = <>;
966 chomp $re;
967 $string =~ /$re/;
968
14218588
GS
969Before Perl knew how to execute interpolated code within a pattern,
970this operation was completely safe from a security point of view,
971although it could raise an exception from an illegal pattern. If
972you turn on the C<use re 'eval'>, though, it is no longer secure,
973so you should only do so if you are also using taint checking.
974Better yet, use the carefully constrained evaluation within a Safe
cc46d5f2 975compartment. See L<perlsec> for details about both these mechanisms.
871b0233 976
e95d7314
GG
977B<WARNING>: Use of lexical (C<my>) variables in these blocks is
978broken. The result is unpredictable and will make perl unstable. The
979workaround is to use global (C<our>) variables.
980
981B<WARNING>: Because Perl's regex engine is currently not re-entrant,
982interpolated code may not invoke the regex engine either directly with
983C<m//> or C<s///>), or indirectly with functions such as
984C<split>. Invoking the regex engine in these blocks will make perl
985unstable.
8988a1bb 986
14455d6c 987=item C<(??{ code })>
d74e8afc
ITB
988X<(??{})>
989X<regex, postponed> X<regexp, postponed> X<regular expression, postponed>
0f5d15d6 990
19799a22 991B<WARNING>: This extended regular expression feature is considered
b9b4dddf
YO
992experimental, and may be changed without notice. Code executed that
993has side effects may not perform identically from version to version
994due to the effect of future optimisations in the regex engine.
0f5d15d6 995
19799a22
GS
996This is a "postponed" regular subexpression. The C<code> is evaluated
997at run time, at the moment this subexpression may match. The result
998of evaluation is considered as a regular expression and matched as
61528107 999if it were inserted instead of this construct. Note that this means
6bda09f9
YO
1000that the contents of capture buffers defined inside an eval'ed pattern
1001are not available outside of the pattern, and vice versa, there is no
1002way for the inner pattern to refer to a capture buffer defined outside.
1003Thus,
1004
1005 ('a' x 100)=~/(??{'(.)' x 100})/
1006
81714fb9 1007B<will> match, it will B<not> set $1.
0f5d15d6 1008
428594d9 1009The C<code> is not interpolated. As before, the rules to determine
19799a22
GS
1010where the C<code> ends are currently somewhat convoluted.
1011
1012The following pattern matches a parenthesized group:
0f5d15d6
IZ
1013
1014 $re = qr{
1015 \(
1016 (?:
1017 (?> [^()]+ ) # Non-parens without backtracking
1018 |
14455d6c 1019 (??{ $re }) # Group with matching parens
0f5d15d6
IZ
1020 )*
1021 \)
1022 }x;
1023
6bda09f9
YO
1024See also C<(?PARNO)> for a different, more efficient way to accomplish
1025the same task.
1026
0b370c0a
A
1027For reasons of security, this construct is forbidden if the regular
1028expression involves run-time interpolation of variables, unless the
1029perilous C<use re 'eval'> pragma has been used (see L<re>), or the
1030variables contain results of C<qr//> operator (see
1031L<perlop/"qr/STRING/imosx">).
1032
5d458dd8 1033Because perl's regex engine is not currently re-entrant, delayed
8988a1bb
DD
1034code may not invoke the regex engine either directly with C<m//> or C<s///>),
1035or indirectly with functions such as C<split>.
1036
5d458dd8
YO
1037Recursing deeper than 50 times without consuming any input string will
1038result in a fatal error. The maximum depth is compiled into perl, so
6bda09f9
YO
1039changing it requires a custom build.
1040
542fa716
YO
1041=item C<(?PARNO)> C<(?-PARNO)> C<(?+PARNO)> C<(?R)> C<(?0)>
1042X<(?PARNO)> X<(?1)> X<(?R)> X<(?0)> X<(?-1)> X<(?+1)> X<(?-PARNO)> X<(?+PARNO)>
6bda09f9 1043X<regex, recursive> X<regexp, recursive> X<regular expression, recursive>
542fa716 1044X<regex, relative recursion>
6bda09f9 1045
81714fb9
YO
1046Similar to C<(??{ code })> except it does not involve compiling any code,
1047instead it treats the contents of a capture buffer as an independent
61528107 1048pattern that must match at the current position. Capture buffers
81714fb9 1049contained by the pattern will have the value as determined by the
6bda09f9
YO
1050outermost recursion.
1051
894be9b7
YO
1052PARNO is a sequence of digits (not starting with 0) whose value reflects
1053the paren-number of the capture buffer to recurse to. C<(?R)> recurses to
1054the beginning of the whole pattern. C<(?0)> is an alternate syntax for
542fa716
YO
1055C<(?R)>. If PARNO is preceded by a plus or minus sign then it is assumed
1056to be relative, with negative numbers indicating preceding capture buffers
1057and positive ones following. Thus C<(?-1)> refers to the most recently
1058declared buffer, and C<(?+1)> indicates the next buffer to be declared.
c74340f9
YO
1059Note that the counting for relative recursion differs from that of
1060relative backreferences, in that with recursion unclosed buffers B<are>
1061included.
6bda09f9 1062
81714fb9 1063The following pattern matches a function foo() which may contain
f145b7e9 1064balanced parentheses as the argument.
6bda09f9
YO
1065
1066 $re = qr{ ( # paren group 1 (full function)
81714fb9 1067 foo
6bda09f9
YO
1068 ( # paren group 2 (parens)
1069 \(
1070 ( # paren group 3 (contents of parens)
1071 (?:
1072 (?> [^()]+ ) # Non-parens without backtracking
1073 |
1074 (?2) # Recurse to start of paren group 2
1075 )*
1076 )
1077 \)
1078 )
1079 )
1080 }x;
1081
1082If the pattern was used as follows
1083
1084 'foo(bar(baz)+baz(bop))'=~/$re/
1085 and print "\$1 = $1\n",
1086 "\$2 = $2\n",
1087 "\$3 = $3\n";
1088
1089the output produced should be the following:
1090
1091 $1 = foo(bar(baz)+baz(bop))
1092 $2 = (bar(baz)+baz(bop))
81714fb9 1093 $3 = bar(baz)+baz(bop)
6bda09f9 1094
81714fb9 1095If there is no corresponding capture buffer defined, then it is a
61528107 1096fatal error. Recursing deeper than 50 times without consuming any input
81714fb9 1097string will also result in a fatal error. The maximum depth is compiled
6bda09f9
YO
1098into perl, so changing it requires a custom build.
1099
542fa716
YO
1100The following shows how using negative indexing can make it
1101easier to embed recursive patterns inside of a C<qr//> construct
1102for later use:
1103
1104 my $parens = qr/(\((?:[^()]++|(?-1))*+\))/;
1105 if (/foo $parens \s+ + \s+ bar $parens/x) {
1106 # do something here...
1107 }
1108
81714fb9 1109B<Note> that this pattern does not behave the same way as the equivalent
0d017f4d 1110PCRE or Python construct of the same form. In Perl you can backtrack into
6bda09f9 1111a recursed group, in PCRE and Python the recursed into group is treated
542fa716
YO
1112as atomic. Also, modifiers are resolved at compile time, so constructs
1113like (?i:(?1)) or (?:(?i)(?1)) do not affect how the sub-pattern will
1114be processed.
6bda09f9 1115
894be9b7
YO
1116=item C<(?&NAME)>
1117X<(?&NAME)>
1118
0d017f4d
WL
1119Recurse to a named subpattern. Identical to C<(?PARNO)> except that the
1120parenthesis to recurse to is determined by name. If multiple parentheses have
894be9b7
YO
1121the same name, then it recurses to the leftmost.
1122
1123It is an error to refer to a name that is not declared somewhere in the
1124pattern.
1125
1f1031fe
YO
1126B<NOTE:> In order to make things easier for programmers with experience
1127with the Python or PCRE regex engines the pattern C<< (?P>NAME) >>
64c5a566 1128may be used instead of C<< (?&NAME) >>.
1f1031fe 1129
e2e6a0f1
YO
1130=item C<(?(condition)yes-pattern|no-pattern)>
1131X<(?()>
286f584a 1132
e2e6a0f1 1133=item C<(?(condition)yes-pattern)>
286f584a 1134
e2e6a0f1
YO
1135Conditional expression. C<(condition)> should be either an integer in
1136parentheses (which is valid if the corresponding pair of parentheses
1137matched), a look-ahead/look-behind/evaluate zero-width assertion, a
1138name in angle brackets or single quotes (which is valid if a buffer
1139with the given name matched), or the special symbol (R) (true when
1140evaluated inside of recursion or eval). Additionally the R may be
1141followed by a number, (which will be true when evaluated when recursing
1142inside of the appropriate group), or by C<&NAME>, in which case it will
1143be true only when evaluated during recursion in the named group.
1144
1145Here's a summary of the possible predicates:
1146
1147=over 4
1148
1149=item (1) (2) ...
1150
1151Checks if the numbered capturing buffer has matched something.
1152
1153=item (<NAME>) ('NAME')
1154
1155Checks if a buffer with the given name has matched something.
1156
1157=item (?{ CODE })
1158
1159Treats the code block as the condition.
1160
1161=item (R)
1162
1163Checks if the expression has been evaluated inside of recursion.
1164
1165=item (R1) (R2) ...
1166
1167Checks if the expression has been evaluated while executing directly
1168inside of the n-th capture group. This check is the regex equivalent of
1169
1170 if ((caller(0))[3] eq 'subname') { ... }
1171
1172In other words, it does not check the full recursion stack.
1173
1174=item (R&NAME)
1175
1176Similar to C<(R1)>, this predicate checks to see if we're executing
1177directly inside of the leftmost group with a given name (this is the same
1178logic used by C<(?&NAME)> to disambiguate). It does not check the full
1179stack, but only the name of the innermost active recursion.
1180
1181=item (DEFINE)
1182
1183In this case, the yes-pattern is never directly executed, and no
1184no-pattern is allowed. Similar in spirit to C<(?{0})> but more efficient.
1185See below for details.
1186
1187=back
1188
1189For example:
1190
1191 m{ ( \( )?
1192 [^()]+
1193 (?(1) \) )
1194 }x
1195
1196matches a chunk of non-parentheses, possibly included in parentheses
1197themselves.
1198
1199A special form is the C<(DEFINE)> predicate, which never executes directly
1200its yes-pattern, and does not allow a no-pattern. This allows to define
1201subpatterns which will be executed only by using the recursion mechanism.
1202This way, you can define a set of regular expression rules that can be
1203bundled into any pattern you choose.
1204
1205It is recommended that for this usage you put the DEFINE block at the
1206end of the pattern, and that you name any subpatterns defined within it.
1207
1208Also, it's worth noting that patterns defined this way probably will
1209not be as efficient, as the optimiser is not very clever about
1210handling them.
1211
1212An example of how this might be used is as follows:
1213
2bf803e2 1214 /(?<NAME>(?&NAME_PAT))(?<ADDR>(?&ADDRESS_PAT))
e2e6a0f1 1215 (?(DEFINE)
2bf803e2
YO
1216 (?<NAME_PAT>....)
1217 (?<ADRESS_PAT>....)
e2e6a0f1
YO
1218 )/x
1219
1220Note that capture buffers matched inside of recursion are not accessible
0d017f4d 1221after the recursion returns, so the extra layer of capturing buffers is
e2e6a0f1
YO
1222necessary. Thus C<$+{NAME_PAT}> would not be defined even though
1223C<$+{NAME}> would be.
286f584a 1224
c47ff5f1 1225=item C<< (?>pattern) >>
6bda09f9 1226X<backtrack> X<backtracking> X<atomic> X<possessive>
5a964f20 1227
19799a22
GS
1228An "independent" subexpression, one which matches the substring
1229that a I<standalone> C<pattern> would match if anchored at the given
9da458fc 1230position, and it matches I<nothing other than this substring>. This
19799a22
GS
1231construct is useful for optimizations of what would otherwise be
1232"eternal" matches, because it will not backtrack (see L<"Backtracking">).
9da458fc
IZ
1233It may also be useful in places where the "grab all you can, and do not
1234give anything back" semantic is desirable.
19799a22 1235
c47ff5f1 1236For example: C<< ^(?>a*)ab >> will never match, since C<< (?>a*) >>
19799a22
GS
1237(anchored at the beginning of string, as above) will match I<all>
1238characters C<a> at the beginning of string, leaving no C<a> for
1239C<ab> to match. In contrast, C<a*ab> will match the same as C<a+b>,
1240since the match of the subgroup C<a*> is influenced by the following
1241group C<ab> (see L<"Backtracking">). In particular, C<a*> inside
1242C<a*ab> will match fewer characters than a standalone C<a*>, since
1243this makes the tail match.
1244
c47ff5f1 1245An effect similar to C<< (?>pattern) >> may be achieved by writing
19799a22
GS
1246C<(?=(pattern))\1>. This matches the same substring as a standalone
1247C<a+>, and the following C<\1> eats the matched string; it therefore
c47ff5f1 1248makes a zero-length assertion into an analogue of C<< (?>...) >>.
19799a22
GS
1249(The difference between these two constructs is that the second one
1250uses a capturing group, thus shifting ordinals of backreferences
1251in the rest of a regular expression.)
1252
1253Consider this pattern:
c277df42 1254
871b0233 1255 m{ \(
e2e6a0f1
YO
1256 (
1257 [^()]+ # x+
1258 |
871b0233
IZ
1259 \( [^()]* \)
1260 )+
e2e6a0f1 1261 \)
871b0233 1262 }x
5a964f20 1263
19799a22
GS
1264That will efficiently match a nonempty group with matching parentheses
1265two levels deep or less. However, if there is no such group, it
1266will take virtually forever on a long string. That's because there
1267are so many different ways to split a long string into several
1268substrings. This is what C<(.+)+> is doing, and C<(.+)+> is similar
1269to a subpattern of the above pattern. Consider how the pattern
1270above detects no-match on C<((()aaaaaaaaaaaaaaaaaa> in several
1271seconds, but that each extra letter doubles this time. This
1272exponential performance will make it appear that your program has
14218588 1273hung. However, a tiny change to this pattern
5a964f20 1274
e2e6a0f1
YO
1275 m{ \(
1276 (
1277 (?> [^()]+ ) # change x+ above to (?> x+ )
1278 |
871b0233
IZ
1279 \( [^()]* \)
1280 )+
e2e6a0f1 1281 \)
871b0233 1282 }x
c277df42 1283
c47ff5f1 1284which uses C<< (?>...) >> matches exactly when the one above does (verifying
5a964f20
TC
1285this yourself would be a productive exercise), but finishes in a fourth
1286the time when used on a similar string with 1000000 C<a>s. Be aware,
1287however, that this pattern currently triggers a warning message under
9f1b1f2d 1288the C<use warnings> pragma or B<-w> switch saying it
6bab786b 1289C<"matches null string many times in regex">.
c277df42 1290
c47ff5f1 1291On simple groups, such as the pattern C<< (?> [^()]+ ) >>, a comparable
19799a22 1292effect may be achieved by negative look-ahead, as in C<[^()]+ (?! [^()] )>.
c277df42
IZ
1293This was only 4 times slower on a string with 1000000 C<a>s.
1294
9da458fc
IZ
1295The "grab all you can, and do not give anything back" semantic is desirable
1296in many situations where on the first sight a simple C<()*> looks like
1297the correct solution. Suppose we parse text with comments being delimited
1298by C<#> followed by some optional (horizontal) whitespace. Contrary to
4375e838 1299its appearance, C<#[ \t]*> I<is not> the correct subexpression to match
9da458fc
IZ
1300the comment delimiter, because it may "give up" some whitespace if
1301the remainder of the pattern can be made to match that way. The correct
1302answer is either one of these:
1303
1304 (?>#[ \t]*)
1305 #[ \t]*(?![ \t])
1306
1307For example, to grab non-empty comments into $1, one should use either
1308one of these:
1309
1310 / (?> \# [ \t]* ) ( .+ ) /x;
1311 / \# [ \t]* ( [^ \t] .* ) /x;
1312
1313Which one you pick depends on which of these expressions better reflects
1314the above specification of comments.
1315
6bda09f9
YO
1316In some literature this construct is called "atomic matching" or
1317"possessive matching".
1318
b9b4dddf
YO
1319Possessive quantifiers are equivalent to putting the item they are applied
1320to inside of one of these constructs. The following equivalences apply:
1321
1322 Quantifier Form Bracketing Form
1323 --------------- ---------------
1324 PAT*+ (?>PAT*)
1325 PAT++ (?>PAT+)
1326 PAT?+ (?>PAT?)
1327 PAT{min,max}+ (?>PAT{min,max})
1328
e2e6a0f1
YO
1329=back
1330
1331=head2 Special Backtracking Control Verbs
1332
1333B<WARNING:> These patterns are experimental and subject to change or
0d017f4d 1334removal in a future version of Perl. Their usage in production code should
e2e6a0f1
YO
1335be noted to avoid problems during upgrades.
1336
1337These special patterns are generally of the form C<(*VERB:ARG)>. Unless
1338otherwise stated the ARG argument is optional; in some cases, it is
1339forbidden.
1340
1341Any pattern containing a special backtracking verb that allows an argument
e1020413 1342has the special behaviour that when executed it sets the current package's
5d458dd8
YO
1343C<$REGERROR> and C<$REGMARK> variables. When doing so the following
1344rules apply:
e2e6a0f1 1345
5d458dd8
YO
1346On failure, the C<$REGERROR> variable will be set to the ARG value of the
1347verb pattern, if the verb was involved in the failure of the match. If the
1348ARG part of the pattern was omitted, then C<$REGERROR> will be set to the
1349name of the last C<(*MARK:NAME)> pattern executed, or to TRUE if there was
1350none. Also, the C<$REGMARK> variable will be set to FALSE.
e2e6a0f1 1351
5d458dd8
YO
1352On a successful match, the C<$REGERROR> variable will be set to FALSE, and
1353the C<$REGMARK> variable will be set to the name of the last
1354C<(*MARK:NAME)> pattern executed. See the explanation for the
1355C<(*MARK:NAME)> verb below for more details.
e2e6a0f1 1356
5d458dd8
YO
1357B<NOTE:> C<$REGERROR> and C<$REGMARK> are not magic variables like C<$1>
1358and most other regex related variables. They are not local to a scope, nor
1359readonly, but instead are volatile package variables similar to C<$AUTOLOAD>.
1360Use C<local> to localize changes to them to a specific scope if necessary.
e2e6a0f1
YO
1361
1362If a pattern does not contain a special backtracking verb that allows an
5d458dd8 1363argument, then C<$REGERROR> and C<$REGMARK> are not touched at all.
e2e6a0f1
YO
1364
1365=over 4
1366
1367=item Verbs that take an argument
1368
1369=over 4
1370
5d458dd8 1371=item C<(*PRUNE)> C<(*PRUNE:NAME)>
f7819f85 1372X<(*PRUNE)> X<(*PRUNE:NAME)>
54612592 1373
5d458dd8
YO
1374This zero-width pattern prunes the backtracking tree at the current point
1375when backtracked into on failure. Consider the pattern C<A (*PRUNE) B>,
1376where A and B are complex patterns. Until the C<(*PRUNE)> verb is reached,
1377A may backtrack as necessary to match. Once it is reached, matching
1378continues in B, which may also backtrack as necessary; however, should B
1379not match, then no further backtracking will take place, and the pattern
1380will fail outright at the current starting position.
54612592
YO
1381
1382The following example counts all the possible matching strings in a
1383pattern (without actually matching any of them).
1384
e2e6a0f1 1385 'aaab' =~ /a+b?(?{print "$&\n"; $count++})(*FAIL)/;
54612592
YO
1386 print "Count=$count\n";
1387
1388which produces:
1389
1390 aaab
1391 aaa
1392 aa
1393 a
1394 aab
1395 aa
1396 a
1397 ab
1398 a
1399 Count=9
1400
5d458dd8 1401If we add a C<(*PRUNE)> before the count like the following
54612592 1402
5d458dd8 1403 'aaab' =~ /a+b?(*PRUNE)(?{print "$&\n"; $count++})(*FAIL)/;
54612592
YO
1404 print "Count=$count\n";
1405
1406we prevent backtracking and find the count of the longest matching
353c6505 1407at each matching starting point like so:
54612592
YO
1408
1409 aaab
1410 aab
1411 ab
1412 Count=3
1413
5d458dd8 1414Any number of C<(*PRUNE)> assertions may be used in a pattern.
54612592 1415
5d458dd8
YO
1416See also C<< (?>pattern) >> and possessive quantifiers for other ways to
1417control backtracking. In some cases, the use of C<(*PRUNE)> can be
1418replaced with a C<< (?>pattern) >> with no functional difference; however,
1419C<(*PRUNE)> can be used to handle cases that cannot be expressed using a
1420C<< (?>pattern) >> alone.
54612592 1421
e2e6a0f1 1422
5d458dd8
YO
1423=item C<(*SKIP)> C<(*SKIP:NAME)>
1424X<(*SKIP)>
e2e6a0f1 1425
5d458dd8 1426This zero-width pattern is similar to C<(*PRUNE)>, except that on
e2e6a0f1 1427failure it also signifies that whatever text that was matched leading up
5d458dd8
YO
1428to the C<(*SKIP)> pattern being executed cannot be part of I<any> match
1429of this pattern. This effectively means that the regex engine "skips" forward
1430to this position on failure and tries to match again, (assuming that
1431there is sufficient room to match).
1432
1433The name of the C<(*SKIP:NAME)> pattern has special significance. If a
1434C<(*MARK:NAME)> was encountered while matching, then it is that position
1435which is used as the "skip point". If no C<(*MARK)> of that name was
1436encountered, then the C<(*SKIP)> operator has no effect. When used
1437without a name the "skip point" is where the match point was when
1438executing the (*SKIP) pattern.
1439
1440Compare the following to the examples in C<(*PRUNE)>, note the string
24b23f37
YO
1441is twice as long:
1442
5d458dd8 1443 'aaabaaab' =~ /a+b?(*SKIP)(?{print "$&\n"; $count++})(*FAIL)/;
24b23f37
YO
1444 print "Count=$count\n";
1445
1446outputs
1447
1448 aaab
1449 aaab
1450 Count=2
1451
5d458dd8 1452Once the 'aaab' at the start of the string has matched, and the C<(*SKIP)>
353c6505 1453executed, the next starting point will be where the cursor was when the
5d458dd8
YO
1454C<(*SKIP)> was executed.
1455
5d458dd8
YO
1456=item C<(*MARK:NAME)> C<(*:NAME)>
1457X<(*MARK)> C<(*MARK:NAME)> C<(*:NAME)>
1458
1459This zero-width pattern can be used to mark the point reached in a string
1460when a certain part of the pattern has been successfully matched. This
1461mark may be given a name. A later C<(*SKIP)> pattern will then skip
1462forward to that point if backtracked into on failure. Any number of
b4222fa9 1463C<(*MARK)> patterns are allowed, and the NAME portion may be duplicated.
5d458dd8
YO
1464
1465In addition to interacting with the C<(*SKIP)> pattern, C<(*MARK:NAME)>
1466can be used to "label" a pattern branch, so that after matching, the
1467program can determine which branches of the pattern were involved in the
1468match.
1469
1470When a match is successful, the C<$REGMARK> variable will be set to the
1471name of the most recently executed C<(*MARK:NAME)> that was involved
1472in the match.
1473
1474This can be used to determine which branch of a pattern was matched
c62285ac 1475without using a separate capture buffer for each branch, which in turn
5d458dd8
YO
1476can result in a performance improvement, as perl cannot optimize
1477C</(?:(x)|(y)|(z))/> as efficiently as something like
1478C</(?:x(*MARK:x)|y(*MARK:y)|z(*MARK:z))/>.
1479
1480When a match has failed, and unless another verb has been involved in
1481failing the match and has provided its own name to use, the C<$REGERROR>
1482variable will be set to the name of the most recently executed
1483C<(*MARK:NAME)>.
1484
1485See C<(*SKIP)> for more details.
1486
b62d2d15
YO
1487As a shortcut C<(*MARK:NAME)> can be written C<(*:NAME)>.
1488
5d458dd8
YO
1489=item C<(*THEN)> C<(*THEN:NAME)>
1490
241e7389 1491This is similar to the "cut group" operator C<::> from Perl 6. Like
5d458dd8
YO
1492C<(*PRUNE)>, this verb always matches, and when backtracked into on
1493failure, it causes the regex engine to try the next alternation in the
1494innermost enclosing group (capturing or otherwise).
1495
1496Its name comes from the observation that this operation combined with the
1497alternation operator (C<|>) can be used to create what is essentially a
1498pattern-based if/then/else block:
1499
1500 ( COND (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ )
1501
1502Note that if this operator is used and NOT inside of an alternation then
1503it acts exactly like the C<(*PRUNE)> operator.
1504
1505 / A (*PRUNE) B /
1506
1507is the same as
1508
1509 / A (*THEN) B /
1510
1511but
1512
1513 / ( A (*THEN) B | C (*THEN) D ) /
1514
1515is not the same as
1516
1517 / ( A (*PRUNE) B | C (*PRUNE) D ) /
1518
1519as after matching the A but failing on the B the C<(*THEN)> verb will
1520backtrack and try C; but the C<(*PRUNE)> verb will simply fail.
24b23f37 1521
e2e6a0f1
YO
1522=item C<(*COMMIT)>
1523X<(*COMMIT)>
24b23f37 1524
241e7389 1525This is the Perl 6 "commit pattern" C<< <commit> >> or C<:::>. It's a
5d458dd8
YO
1526zero-width pattern similar to C<(*SKIP)>, except that when backtracked
1527into on failure it causes the match to fail outright. No further attempts
1528to find a valid match by advancing the start pointer will occur again.
1529For example,
24b23f37 1530
e2e6a0f1 1531 'aaabaaab' =~ /a+b?(*COMMIT)(?{print "$&\n"; $count++})(*FAIL)/;
24b23f37
YO
1532 print "Count=$count\n";
1533
1534outputs
1535
1536 aaab
1537 Count=1
1538
e2e6a0f1
YO
1539In other words, once the C<(*COMMIT)> has been entered, and if the pattern
1540does not match, the regex engine will not try any further matching on the
1541rest of the string.
c277df42 1542
e2e6a0f1 1543=back
9af228c6 1544
e2e6a0f1 1545=item Verbs without an argument
9af228c6
YO
1546
1547=over 4
1548
e2e6a0f1
YO
1549=item C<(*FAIL)> C<(*F)>
1550X<(*FAIL)> X<(*F)>
9af228c6 1551
e2e6a0f1
YO
1552This pattern matches nothing and always fails. It can be used to force the
1553engine to backtrack. It is equivalent to C<(?!)>, but easier to read. In
1554fact, C<(?!)> gets optimised into C<(*FAIL)> internally.
9af228c6 1555
e2e6a0f1 1556It is probably useful only when combined with C<(?{})> or C<(??{})>.
9af228c6 1557
e2e6a0f1
YO
1558=item C<(*ACCEPT)>
1559X<(*ACCEPT)>
9af228c6 1560
e2e6a0f1
YO
1561B<WARNING:> This feature is highly experimental. It is not recommended
1562for production code.
9af228c6 1563
e2e6a0f1
YO
1564This pattern matches nothing and causes the end of successful matching at
1565the point at which the C<(*ACCEPT)> pattern was encountered, regardless of
1566whether there is actually more to match in the string. When inside of a
0d017f4d 1567nested pattern, such as recursion, or in a subpattern dynamically generated
e2e6a0f1 1568via C<(??{})>, only the innermost pattern is ended immediately.
9af228c6 1569
e2e6a0f1
YO
1570If the C<(*ACCEPT)> is inside of capturing buffers then the buffers are
1571marked as ended at the point at which the C<(*ACCEPT)> was encountered.
1572For instance:
9af228c6 1573
e2e6a0f1 1574 'AB' =~ /(A (A|B(*ACCEPT)|C) D)(E)/x;
9af228c6 1575
e2e6a0f1 1576will match, and C<$1> will be C<AB> and C<$2> will be C<B>, C<$3> will not
0d017f4d 1577be set. If another branch in the inner parentheses were matched, such as in the
e2e6a0f1 1578string 'ACDE', then the C<D> and C<E> would have to be matched as well.
9af228c6
YO
1579
1580=back
c277df42 1581
a0d0e21e
LW
1582=back
1583
c07a80fd 1584=head2 Backtracking
d74e8afc 1585X<backtrack> X<backtracking>
c07a80fd 1586
35a734be
IZ
1587NOTE: This section presents an abstract approximation of regular
1588expression behavior. For a more rigorous (and complicated) view of
1589the rules involved in selecting a match among possible alternatives,
0d017f4d 1590see L<Combining RE Pieces>.
35a734be 1591
c277df42 1592A fundamental feature of regular expression matching involves the
5a964f20 1593notion called I<backtracking>, which is currently used (when needed)
0d017f4d 1594by all regular non-possessive expression quantifiers, namely C<*>, C<*?>, C<+>,
9da458fc
IZ
1595C<+?>, C<{n,m}>, and C<{n,m}?>. Backtracking is often optimized
1596internally, but the general principle outlined here is valid.
c07a80fd 1597
1598For a regular expression to match, the I<entire> regular expression must
1599match, not just part of it. So if the beginning of a pattern containing a
1600quantifier succeeds in a way that causes later parts in the pattern to
1601fail, the matching engine backs up and recalculates the beginning
1602part--that's why it's called backtracking.
1603
1604Here is an example of backtracking: Let's say you want to find the
1605word following "foo" in the string "Food is on the foo table.":
1606
1607 $_ = "Food is on the foo table.";
1608 if ( /\b(foo)\s+(\w+)/i ) {
1609 print "$2 follows $1.\n";
1610 }
1611
1612When the match runs, the first part of the regular expression (C<\b(foo)>)
1613finds a possible match right at the beginning of the string, and loads up
1614$1 with "Foo". However, as soon as the matching engine sees that there's
1615no whitespace following the "Foo" that it had saved in $1, it realizes its
68dc0745 1616mistake and starts over again one character after where it had the
c07a80fd 1617tentative match. This time it goes all the way until the next occurrence
1618of "foo". The complete regular expression matches this time, and you get
1619the expected output of "table follows foo."
1620
1621Sometimes minimal matching can help a lot. Imagine you'd like to match
1622everything between "foo" and "bar". Initially, you write something
1623like this:
1624
1625 $_ = "The food is under the bar in the barn.";
1626 if ( /foo(.*)bar/ ) {
1627 print "got <$1>\n";
1628 }
1629
1630Which perhaps unexpectedly yields:
1631
1632 got <d is under the bar in the >
1633
1634That's because C<.*> was greedy, so you get everything between the
14218588 1635I<first> "foo" and the I<last> "bar". Here it's more effective
c07a80fd 1636to use minimal matching to make sure you get the text between a "foo"
1637and the first "bar" thereafter.
1638
1639 if ( /foo(.*?)bar/ ) { print "got <$1>\n" }
1640 got <d is under the >
1641
0d017f4d 1642Here's another example. Let's say you'd like to match a number at the end
b6e13d97 1643of a string, and you also want to keep the preceding part of the match.
c07a80fd 1644So you write this:
1645
1646 $_ = "I have 2 numbers: 53147";
1647 if ( /(.*)(\d*)/ ) { # Wrong!
1648 print "Beginning is <$1>, number is <$2>.\n";
1649 }
1650
1651That won't work at all, because C<.*> was greedy and gobbled up the
1652whole string. As C<\d*> can match on an empty string the complete
1653regular expression matched successfully.
1654
8e1088bc 1655 Beginning is <I have 2 numbers: 53147>, number is <>.
c07a80fd 1656
1657Here are some variants, most of which don't work:
1658
1659 $_ = "I have 2 numbers: 53147";
1660 @pats = qw{
1661 (.*)(\d*)
1662 (.*)(\d+)
1663 (.*?)(\d*)
1664 (.*?)(\d+)
1665 (.*)(\d+)$
1666 (.*?)(\d+)$
1667 (.*)\b(\d+)$
1668 (.*\D)(\d+)$
1669 };
1670
1671 for $pat (@pats) {
1672 printf "%-12s ", $pat;
1673 if ( /$pat/ ) {
1674 print "<$1> <$2>\n";
1675 } else {
1676 print "FAIL\n";
1677 }
1678 }
1679
1680That will print out:
1681
1682 (.*)(\d*) <I have 2 numbers: 53147> <>
1683 (.*)(\d+) <I have 2 numbers: 5314> <7>
1684 (.*?)(\d*) <> <>
1685 (.*?)(\d+) <I have > <2>
1686 (.*)(\d+)$ <I have 2 numbers: 5314> <7>
1687 (.*?)(\d+)$ <I have 2 numbers: > <53147>
1688 (.*)\b(\d+)$ <I have 2 numbers: > <53147>
1689 (.*\D)(\d+)$ <I have 2 numbers: > <53147>
1690
1691As you see, this can be a bit tricky. It's important to realize that a
1692regular expression is merely a set of assertions that gives a definition
1693of success. There may be 0, 1, or several different ways that the
1694definition might succeed against a particular string. And if there are
5a964f20
TC
1695multiple ways it might succeed, you need to understand backtracking to
1696know which variety of success you will achieve.
c07a80fd 1697
19799a22 1698When using look-ahead assertions and negations, this can all get even
8b19b778 1699trickier. Imagine you'd like to find a sequence of non-digits not
c07a80fd 1700followed by "123". You might try to write that as
1701
871b0233
IZ
1702 $_ = "ABC123";
1703 if ( /^\D*(?!123)/ ) { # Wrong!
1704 print "Yup, no 123 in $_\n";
1705 }
c07a80fd 1706
1707But that isn't going to match; at least, not the way you're hoping. It
1708claims that there is no 123 in the string. Here's a clearer picture of
9b9391b2 1709why that pattern matches, contrary to popular expectations:
c07a80fd 1710
4358a253
SS
1711 $x = 'ABC123';
1712 $y = 'ABC445';
c07a80fd 1713
4358a253
SS
1714 print "1: got $1\n" if $x =~ /^(ABC)(?!123)/;
1715 print "2: got $1\n" if $y =~ /^(ABC)(?!123)/;
c07a80fd 1716
4358a253
SS
1717 print "3: got $1\n" if $x =~ /^(\D*)(?!123)/;
1718 print "4: got $1\n" if $y =~ /^(\D*)(?!123)/;
c07a80fd 1719
1720This prints
1721
1722 2: got ABC
1723 3: got AB
1724 4: got ABC
1725
5f05dabc 1726You might have expected test 3 to fail because it seems to a more
c07a80fd 1727general purpose version of test 1. The important difference between
1728them is that test 3 contains a quantifier (C<\D*>) and so can use
1729backtracking, whereas test 1 will not. What's happening is
1730that you've asked "Is it true that at the start of $x, following 0 or more
5f05dabc 1731non-digits, you have something that's not 123?" If the pattern matcher had
c07a80fd 1732let C<\D*> expand to "ABC", this would have caused the whole pattern to
54310121 1733fail.
14218588 1734
c07a80fd 1735The search engine will initially match C<\D*> with "ABC". Then it will
14218588 1736try to match C<(?!123> with "123", which fails. But because
c07a80fd 1737a quantifier (C<\D*>) has been used in the regular expression, the
1738search engine can backtrack and retry the match differently
54310121 1739in the hope of matching the complete regular expression.
c07a80fd 1740
5a964f20
TC
1741The pattern really, I<really> wants to succeed, so it uses the
1742standard pattern back-off-and-retry and lets C<\D*> expand to just "AB" this
c07a80fd 1743time. Now there's indeed something following "AB" that is not
14218588 1744"123". It's "C123", which suffices.
c07a80fd 1745
14218588
GS
1746We can deal with this by using both an assertion and a negation.
1747We'll say that the first part in $1 must be followed both by a digit
1748and by something that's not "123". Remember that the look-aheads
1749are zero-width expressions--they only look, but don't consume any
1750of the string in their match. So rewriting this way produces what
c07a80fd 1751you'd expect; that is, case 5 will fail, but case 6 succeeds:
1752
4358a253
SS
1753 print "5: got $1\n" if $x =~ /^(\D*)(?=\d)(?!123)/;
1754 print "6: got $1\n" if $y =~ /^(\D*)(?=\d)(?!123)/;
c07a80fd 1755
1756 6: got ABC
1757
5a964f20 1758In other words, the two zero-width assertions next to each other work as though
19799a22 1759they're ANDed together, just as you'd use any built-in assertions: C</^$/>
c07a80fd 1760matches only if you're at the beginning of the line AND the end of the
1761line simultaneously. The deeper underlying truth is that juxtaposition in
1762regular expressions always means AND, except when you write an explicit OR
1763using the vertical bar. C</ab/> means match "a" AND (then) match "b",
1764although the attempted matches are made at different positions because "a"
1765is not a zero-width assertion, but a one-width assertion.
1766
0d017f4d 1767B<WARNING>: Particularly complicated regular expressions can take
14218588 1768exponential time to solve because of the immense number of possible
0d017f4d 1769ways they can use backtracking to try for a match. For example, without
9da458fc
IZ
1770internal optimizations done by the regular expression engine, this will
1771take a painfully long time to run:
c07a80fd 1772
e1901655
IZ
1773 'aaaaaaaaaaaa' =~ /((a{0,5}){0,5})*[c]/
1774
1775And if you used C<*>'s in the internal groups instead of limiting them
1776to 0 through 5 matches, then it would take forever--or until you ran
1777out of stack space. Moreover, these internal optimizations are not
1778always applicable. For example, if you put C<{0,5}> instead of C<*>
1779on the external group, no current optimization is applicable, and the
1780match takes a long time to finish.
c07a80fd 1781
9da458fc
IZ
1782A powerful tool for optimizing such beasts is what is known as an
1783"independent group",
c47ff5f1 1784which does not backtrack (see L<C<< (?>pattern) >>>). Note also that
9da458fc 1785zero-length look-ahead/look-behind assertions will not backtrack to make
5d458dd8 1786the tail match, since they are in "logical" context: only
14218588 1787whether they match is considered relevant. For an example
9da458fc 1788where side-effects of look-ahead I<might> have influenced the
c47ff5f1 1789following match, see L<C<< (?>pattern) >>>.
c277df42 1790
a0d0e21e 1791=head2 Version 8 Regular Expressions
d74e8afc 1792X<regular expression, version 8> X<regex, version 8> X<regexp, version 8>
a0d0e21e 1793
5a964f20 1794In case you're not familiar with the "regular" Version 8 regex
a0d0e21e
LW
1795routines, here are the pattern-matching rules not described above.
1796
54310121 1797Any single character matches itself, unless it is a I<metacharacter>
a0d0e21e 1798with a special meaning described here or above. You can cause
5a964f20 1799characters that normally function as metacharacters to be interpreted
5f05dabc 1800literally by prefixing them with a "\" (e.g., "\." matches a ".", not any
0d017f4d
WL
1801character; "\\" matches a "\"). This escape mechanism is also required
1802for the character used as the pattern delimiter.
1803
1804A series of characters matches that series of characters in the target
1805string, so the pattern C<blurfl> would match "blurfl" in the target
1806string.
a0d0e21e
LW
1807
1808You can specify a character class, by enclosing a list of characters
5d458dd8 1809in C<[]>, which will match any character from the list. If the
a0d0e21e 1810first character after the "[" is "^", the class matches any character not
14218588 1811in the list. Within a list, the "-" character specifies a
5a964f20 1812range, so that C<a-z> represents all characters between "a" and "z",
8a4f6ac2
GS
1813inclusive. If you want either "-" or "]" itself to be a member of a
1814class, put it at the start of the list (possibly after a "^"), or
1815escape it with a backslash. "-" is also taken literally when it is
1816at the end of the list, just before the closing "]". (The
84850974
DD
1817following all specify the same class of three characters: C<[-az]>,
1818C<[az-]>, and C<[a\-z]>. All are different from C<[a-z]>, which
5d458dd8
YO
1819specifies a class containing twenty-six characters, even on EBCDIC-based
1820character sets.) Also, if you try to use the character
1821classes C<\w>, C<\W>, C<\s>, C<\S>, C<\d>, or C<\D> as endpoints of
1822a range, the "-" is understood literally.
a0d0e21e 1823
8ada0baa
JH
1824Note also that the whole range idea is rather unportable between
1825character sets--and even within character sets they may cause results
1826you probably didn't expect. A sound principle is to use only ranges
0d017f4d 1827that begin from and end at either alphabetics of equal case ([a-e],
8ada0baa
JH
1828[A-E]), or digits ([0-9]). Anything else is unsafe. If in doubt,
1829spell out the character sets in full.
1830
54310121 1831Characters may be specified using a metacharacter syntax much like that
a0d0e21e
LW
1832used in C: "\n" matches a newline, "\t" a tab, "\r" a carriage return,
1833"\f" a form feed, etc. More generally, \I<nnn>, where I<nnn> is a string
5d458dd8
YO
1834of octal digits, matches the character whose coded character set value
1835is I<nnn>. Similarly, \xI<nn>, where I<nn> are hexadecimal digits,
1836matches the character whose numeric value is I<nn>. The expression \cI<x>
1837matches the character control-I<x>. Finally, the "." metacharacter
fb55449c 1838matches any character except "\n" (unless you use C</s>).
a0d0e21e
LW
1839
1840You can specify a series of alternatives for a pattern using "|" to
1841separate them, so that C<fee|fie|foe> will match any of "fee", "fie",
5a964f20 1842or "foe" in the target string (as would C<f(e|i|o)e>). The
a0d0e21e
LW
1843first alternative includes everything from the last pattern delimiter
1844("(", "[", or the beginning of the pattern) up to the first "|", and
1845the last alternative contains everything from the last "|" to the next
14218588
GS
1846pattern delimiter. That's why it's common practice to include
1847alternatives in parentheses: to minimize confusion about where they
a3cb178b
GS
1848start and end.
1849
5a964f20 1850Alternatives are tried from left to right, so the first
a3cb178b
GS
1851alternative found for which the entire expression matches, is the one that
1852is chosen. This means that alternatives are not necessarily greedy. For
628afcb5 1853example: when matching C<foo|foot> against "barefoot", only the "foo"
a3cb178b
GS
1854part will match, as that is the first alternative tried, and it successfully
1855matches the target string. (This might not seem important, but it is
1856important when you are capturing matched text using parentheses.)
1857
5a964f20 1858Also remember that "|" is interpreted as a literal within square brackets,
a3cb178b 1859so if you write C<[fee|fie|foe]> you're really only matching C<[feio|]>.
a0d0e21e 1860
14218588
GS
1861Within a pattern, you may designate subpatterns for later reference
1862by enclosing them in parentheses, and you may refer back to the
1863I<n>th subpattern later in the pattern using the metacharacter
1864\I<n>. Subpatterns are numbered based on the left to right order
1865of their opening parenthesis. A backreference matches whatever
1866actually matched the subpattern in the string being examined, not
1867the rules for that subpattern. Therefore, C<(0|0x)\d*\s\1\d*> will
1868match "0x1234 0x4321", but not "0x1234 01234", because subpattern
18691 matched "0x", even though the rule C<0|0x> could potentially match
1870the leading 0 in the second number.
cb1a09d0 1871
0d017f4d 1872=head2 Warning on \1 Instead of $1
cb1a09d0 1873
5a964f20 1874Some people get too used to writing things like:
cb1a09d0
AD
1875
1876 $pattern =~ s/(\W)/\\\1/g;
1877
1878This is grandfathered for the RHS of a substitute to avoid shocking the
1879B<sed> addicts, but it's a dirty habit to get into. That's because in
d1be9408 1880PerlThink, the righthand side of an C<s///> is a double-quoted string. C<\1> in
cb1a09d0
AD
1881the usual double-quoted string means a control-A. The customary Unix
1882meaning of C<\1> is kludged in for C<s///>. However, if you get into the habit
1883of doing that, you get yourself into trouble if you then add an C</e>
1884modifier.
1885
5a964f20 1886 s/(\d+)/ \1 + 1 /eg; # causes warning under -w
cb1a09d0
AD
1887
1888Or if you try to do
1889
1890 s/(\d+)/\1000/;
1891
1892You can't disambiguate that by saying C<\{1}000>, whereas you can fix it with
14218588 1893C<${1}000>. The operation of interpolation should not be confused
cb1a09d0
AD
1894with the operation of matching a backreference. Certainly they mean two
1895different things on the I<left> side of the C<s///>.
9fa51da4 1896
0d017f4d 1897=head2 Repeated Patterns Matching a Zero-length Substring
c84d73f1 1898
19799a22 1899B<WARNING>: Difficult material (and prose) ahead. This section needs a rewrite.
c84d73f1
IZ
1900
1901Regular expressions provide a terse and powerful programming language. As
1902with most other power tools, power comes together with the ability
1903to wreak havoc.
1904
1905A common abuse of this power stems from the ability to make infinite
628afcb5 1906loops using regular expressions, with something as innocuous as:
c84d73f1
IZ
1907
1908 'foo' =~ m{ ( o? )* }x;
1909
0d017f4d 1910The C<o?> matches at the beginning of C<'foo'>, and since the position
c84d73f1 1911in the string is not moved by the match, C<o?> would match again and again
527e91da 1912because of the C<*> quantifier. Another common way to create a similar cycle
c84d73f1
IZ
1913is with the looping modifier C<//g>:
1914
1915 @matches = ( 'foo' =~ m{ o? }xg );
1916
1917or
1918
1919 print "match: <$&>\n" while 'foo' =~ m{ o? }xg;
1920
1921or the loop implied by split().
1922
1923However, long experience has shown that many programming tasks may
14218588
GS
1924be significantly simplified by using repeated subexpressions that
1925may match zero-length substrings. Here's a simple example being:
c84d73f1
IZ
1926
1927 @chars = split //, $string; # // is not magic in split
1928 ($whitewashed = $string) =~ s/()/ /g; # parens avoid magic s// /
1929
9da458fc 1930Thus Perl allows such constructs, by I<forcefully breaking
c84d73f1 1931the infinite loop>. The rules for this are different for lower-level
527e91da 1932loops given by the greedy quantifiers C<*+{}>, and for higher-level
c84d73f1
IZ
1933ones like the C</g> modifier or split() operator.
1934
19799a22
GS
1935The lower-level loops are I<interrupted> (that is, the loop is
1936broken) when Perl detects that a repeated expression matched a
1937zero-length substring. Thus
c84d73f1
IZ
1938
1939 m{ (?: NON_ZERO_LENGTH | ZERO_LENGTH )* }x;
1940
5d458dd8 1941is made equivalent to
c84d73f1 1942
5d458dd8
YO
1943 m{ (?: NON_ZERO_LENGTH )*
1944 |
1945 (?: ZERO_LENGTH )?
c84d73f1
IZ
1946 }x;
1947
1948The higher level-loops preserve an additional state between iterations:
5d458dd8 1949whether the last match was zero-length. To break the loop, the following
c84d73f1 1950match after a zero-length match is prohibited to have a length of zero.
5d458dd8 1951This prohibition interacts with backtracking (see L<"Backtracking">),
c84d73f1
IZ
1952and so the I<second best> match is chosen if the I<best> match is of
1953zero length.
1954
19799a22 1955For example:
c84d73f1
IZ
1956
1957 $_ = 'bar';
1958 s/\w??/<$&>/g;
1959
20fb949f 1960results in C<< <><b><><a><><r><> >>. At each position of the string the best
5d458dd8 1961match given by non-greedy C<??> is the zero-length match, and the I<second
c84d73f1
IZ
1962best> match is what is matched by C<\w>. Thus zero-length matches
1963alternate with one-character-long matches.
1964
5d458dd8 1965Similarly, for repeated C<m/()/g> the second-best match is the match at the
c84d73f1
IZ
1966position one notch further in the string.
1967
19799a22 1968The additional state of being I<matched with zero-length> is associated with
c84d73f1 1969the matched string, and is reset by each assignment to pos().
9da458fc
IZ
1970Zero-length matches at the end of the previous match are ignored
1971during C<split>.
c84d73f1 1972
0d017f4d 1973=head2 Combining RE Pieces
35a734be
IZ
1974
1975Each of the elementary pieces of regular expressions which were described
1976before (such as C<ab> or C<\Z>) could match at most one substring
1977at the given position of the input string. However, in a typical regular
1978expression these elementary pieces are combined into more complicated
1979patterns using combining operators C<ST>, C<S|T>, C<S*> etc
1980(in these examples C<S> and C<T> are regular subexpressions).
1981
1982Such combinations can include alternatives, leading to a problem of choice:
1983if we match a regular expression C<a|ab> against C<"abc">, will it match
1984substring C<"a"> or C<"ab">? One way to describe which substring is
1985actually matched is the concept of backtracking (see L<"Backtracking">).
1986However, this description is too low-level and makes you think
1987in terms of a particular implementation.
1988
1989Another description starts with notions of "better"/"worse". All the
1990substrings which may be matched by the given regular expression can be
1991sorted from the "best" match to the "worst" match, and it is the "best"
1992match which is chosen. This substitutes the question of "what is chosen?"
1993by the question of "which matches are better, and which are worse?".
1994
1995Again, for elementary pieces there is no such question, since at most
1996one match at a given position is possible. This section describes the
1997notion of better/worse for combining operators. In the description
1998below C<S> and C<T> are regular subexpressions.
1999
13a2d996 2000=over 4
35a734be
IZ
2001
2002=item C<ST>
2003
2004Consider two possible matches, C<AB> and C<A'B'>, C<A> and C<A'> are
2005substrings which can be matched by C<S>, C<B> and C<B'> are substrings
5d458dd8 2006which can be matched by C<T>.
35a734be
IZ
2007
2008If C<A> is better match for C<S> than C<A'>, C<AB> is a better
2009match than C<A'B'>.
2010
2011If C<A> and C<A'> coincide: C<AB> is a better match than C<AB'> if
2012C<B> is better match for C<T> than C<B'>.
2013
2014=item C<S|T>
2015
2016When C<S> can match, it is a better match than when only C<T> can match.
2017
2018Ordering of two matches for C<S> is the same as for C<S>. Similar for
2019two matches for C<T>.
2020
2021=item C<S{REPEAT_COUNT}>
2022
2023Matches as C<SSS...S> (repeated as many times as necessary).
2024
2025=item C<S{min,max}>
2026
2027Matches as C<S{max}|S{max-1}|...|S{min+1}|S{min}>.
2028
2029=item C<S{min,max}?>
2030
2031Matches as C<S{min}|S{min+1}|...|S{max-1}|S{max}>.
2032
2033=item C<S?>, C<S*>, C<S+>
2034
2035Same as C<S{0,1}>, C<S{0,BIG_NUMBER}>, C<S{1,BIG_NUMBER}> respectively.
2036
2037=item C<S??>, C<S*?>, C<S+?>
2038
2039Same as C<S{0,1}?>, C<S{0,BIG_NUMBER}?>, C<S{1,BIG_NUMBER}?> respectively.
2040
c47ff5f1 2041=item C<< (?>S) >>
35a734be
IZ
2042
2043Matches the best match for C<S> and only that.
2044
2045=item C<(?=S)>, C<(?<=S)>
2046
2047Only the best match for C<S> is considered. (This is important only if
2048C<S> has capturing parentheses, and backreferences are used somewhere
2049else in the whole regular expression.)
2050
2051=item C<(?!S)>, C<(?<!S)>
2052
2053For this grouping operator there is no need to describe the ordering, since
2054only whether or not C<S> can match is important.
2055
6bda09f9 2056=item C<(??{ EXPR })>, C<(?PARNO)>
35a734be
IZ
2057
2058The ordering is the same as for the regular expression which is
6bda09f9 2059the result of EXPR, or the pattern contained by capture buffer PARNO.
35a734be
IZ
2060
2061=item C<(?(condition)yes-pattern|no-pattern)>
2062
2063Recall that which of C<yes-pattern> or C<no-pattern> actually matches is
2064already determined. The ordering of the matches is the same as for the
2065chosen subexpression.
2066
2067=back
2068
2069The above recipes describe the ordering of matches I<at a given position>.
2070One more rule is needed to understand how a match is determined for the
2071whole regular expression: a match at an earlier position is always better
2072than a match at a later position.
2073
0d017f4d 2074=head2 Creating Custom RE Engines
c84d73f1
IZ
2075
2076Overloaded constants (see L<overload>) provide a simple way to extend
2077the functionality of the RE engine.
2078
2079Suppose that we want to enable a new RE escape-sequence C<\Y|> which
0d017f4d 2080matches at a boundary between whitespace characters and non-whitespace
c84d73f1
IZ
2081characters. Note that C<(?=\S)(?<!\S)|(?!\S)(?<=\S)> matches exactly
2082at these positions, so we want to have each C<\Y|> in the place of the
2083more complicated version. We can create a module C<customre> to do
2084this:
2085
2086 package customre;
2087 use overload;
2088
2089 sub import {
2090 shift;
2091 die "No argument to customre::import allowed" if @_;
2092 overload::constant 'qr' => \&convert;
2093 }
2094
2095 sub invalid { die "/$_[0]/: invalid escape '\\$_[1]'"}
2096
580a9fe1
RGS
2097 # We must also take care of not escaping the legitimate \\Y|
2098 # sequence, hence the presence of '\\' in the conversion rules.
5d458dd8 2099 my %rules = ( '\\' => '\\\\',
c84d73f1
IZ
2100 'Y|' => qr/(?=\S)(?<!\S)|(?!\S)(?<=\S)/ );
2101 sub convert {
2102 my $re = shift;
5d458dd8 2103 $re =~ s{
c84d73f1
IZ
2104 \\ ( \\ | Y . )
2105 }
5d458dd8 2106 { $rules{$1} or invalid($re,$1) }sgex;
c84d73f1
IZ
2107 return $re;
2108 }
2109
2110Now C<use customre> enables the new escape in constant regular
2111expressions, i.e., those without any runtime variable interpolations.
2112As documented in L<overload>, this conversion will work only over
2113literal parts of regular expressions. For C<\Y|$re\Y|> the variable
2114part of this regular expression needs to be converted explicitly
2115(but only if the special meaning of C<\Y|> should be enabled inside $re):
2116
2117 use customre;
2118 $re = <>;
2119 chomp $re;
2120 $re = customre::convert $re;
2121 /\Y|$re\Y|/;
2122
1f1031fe
YO
2123=head1 PCRE/Python Support
2124
99d59c4d 2125As of Perl 5.10.0, Perl supports several Python/PCRE specific extensions
1f1031fe 2126to the regex syntax. While Perl programmers are encouraged to use the
99d59c4d 2127Perl specific syntax, the following are also accepted:
1f1031fe
YO
2128
2129=over 4
2130
ae5648b3 2131=item C<< (?PE<lt>NAMEE<gt>pattern) >>
1f1031fe
YO
2132
2133Define a named capture buffer. Equivalent to C<< (?<NAME>pattern) >>.
2134
2135=item C<< (?P=NAME) >>
2136
2137Backreference to a named capture buffer. Equivalent to C<< \g{NAME} >>.
2138
2139=item C<< (?P>NAME) >>
2140
2141Subroutine call to a named capture buffer. Equivalent to C<< (?&NAME) >>.
2142
ee9b8eae 2143=back
1f1031fe 2144
19799a22
GS
2145=head1 BUGS
2146
9da458fc
IZ
2147This document varies from difficult to understand to completely
2148and utterly opaque. The wandering prose riddled with jargon is
2149hard to fathom in several places.
2150
2151This document needs a rewrite that separates the tutorial content
2152from the reference content.
19799a22
GS
2153
2154=head1 SEE ALSO
9fa51da4 2155
91e0c79e
MJD
2156L<perlrequick>.
2157
2158L<perlretut>.
2159
9b599b2a
GS
2160L<perlop/"Regexp Quote-Like Operators">.
2161
1e66bd83
PP
2162L<perlop/"Gory details of parsing quoted constructs">.
2163
14218588
GS
2164L<perlfaq6>.
2165
9b599b2a
GS
2166L<perlfunc/pos>.
2167
2168L<perllocale>.
2169
fb55449c
JH
2170L<perlebcdic>.
2171
14218588
GS
2172I<Mastering Regular Expressions> by Jeffrey Friedl, published
2173by O'Reilly and Associates.