This is a live mirror of the Perl 5 development currently hosted at https://github.com/perl/perl5
(retracted by #12951)
[perl5.git] / pod / perlre.pod
CommitLineData
a0d0e21e
LW
1=head1 NAME
2
3perlre - Perl regular expressions
4
5=head1 DESCRIPTION
6
91e0c79e
MJD
7This page describes the syntax of regular expressions in Perl.
8
9if you haven't used regular expressions before, a quick-start
10introduction is available in L<perlrequick>, and a longer tutorial
11introduction is available in L<perlretut>.
12
13For reference on how regular expressions are used in matching
14operations, plus various examples of the same, see discussions of
15C<m//>, C<s///>, C<qr//> and C<??> in L<perlop/"Regexp Quote-Like
16Operators">.
cb1a09d0 17
19799a22 18Matching operations can have various modifiers. Modifiers
5a964f20 19that relate to the interpretation of the regular expression inside
19799a22
GS
20are listed below. Modifiers that alter the way a regular expression
21is used by Perl are detailed in L<perlop/"Regexp Quote-Like Operators"> and
1e66bd83 22L<perlop/"Gory details of parsing quoted constructs">.
a0d0e21e 23
55497cff 24=over 4
25
26=item i
27
28Do case-insensitive pattern matching.
29
a034a98d
DD
30If C<use locale> is in effect, the case map is taken from the current
31locale. See L<perllocale>.
32
54310121 33=item m
55497cff 34
35Treat string as multiple lines. That is, change "^" and "$" from matching
14218588 36the start or end of the string to matching the start or end of any
7f761169 37line anywhere within the string.
55497cff 38
54310121 39=item s
55497cff 40
41Treat string as single line. That is, change "." to match any character
19799a22 42whatsoever, even a newline, which normally it would not match.
55497cff 43
19799a22
GS
44The C</s> and C</m> modifiers both override the C<$*> setting. That
45is, no matter what C<$*> contains, C</s> without C</m> will force
46"^" to match only at the beginning of the string and "$" to match
47only at the end (or just before a newline at the end) of the string.
48Together, as /ms, they let the "." match any character whatsoever,
fb55449c 49while still allowing "^" and "$" to match, respectively, just after
19799a22 50and just before newlines within the string.
7b8d334a 51
54310121 52=item x
55497cff 53
54Extend your pattern's legibility by permitting whitespace and comments.
55
56=back
a0d0e21e
LW
57
58These are usually written as "the C</x> modifier", even though the delimiter
14218588 59in question might not really be a slash. Any of these
a0d0e21e 60modifiers may also be embedded within the regular expression itself using
14218588 61the C<(?...)> construct. See below.
a0d0e21e 62
4633a7c4 63The C</x> modifier itself needs a little more explanation. It tells
55497cff 64the regular expression parser to ignore whitespace that is neither
65backslashed nor within a character class. You can use this to break up
4633a7c4 66your regular expression into (slightly) more readable parts. The C<#>
54310121 67character is also treated as a metacharacter introducing a comment,
55497cff 68just as in ordinary Perl code. This also means that if you want real
14218588 69whitespace or C<#> characters in the pattern (outside a character
5a964f20 70class, where they are unaffected by C</x>), that you'll either have to
55497cff 71escape them or encode them using octal or hex escapes. Taken together,
72these features go a long way towards making Perl's regular expressions
0c815be9
HS
73more readable. Note that you have to be careful not to include the
74pattern delimiter in the comment--perl has no way of knowing you did
5a964f20 75not intend to close the pattern early. See the C-comment deletion code
0c815be9 76in L<perlop>.
a0d0e21e
LW
77
78=head2 Regular Expressions
79
19799a22 80The patterns used in Perl pattern matching derive from supplied in
14218588 81the Version 8 regex routines. (The routines are derived
19799a22
GS
82(distantly) from Henry Spencer's freely redistributable reimplementation
83of the V8 routines.) See L<Version 8 Regular Expressions> for
84details.
a0d0e21e
LW
85
86In particular the following metacharacters have their standard I<egrep>-ish
87meanings:
88
54310121 89 \ Quote the next metacharacter
a0d0e21e
LW
90 ^ Match the beginning of the line
91 . Match any character (except newline)
c07a80fd 92 $ Match the end of the line (or before newline at the end)
a0d0e21e
LW
93 | Alternation
94 () Grouping
95 [] Character class
96
14218588
GS
97By default, the "^" character is guaranteed to match only the
98beginning of the string, the "$" character only the end (or before the
99newline at the end), and Perl does certain optimizations with the
a0d0e21e
LW
100assumption that the string contains only one line. Embedded newlines
101will not be matched by "^" or "$". You may, however, wish to treat a
4a6725af 102string as a multi-line buffer, such that the "^" will match after any
a0d0e21e
LW
103newline within the string, and "$" will match before any newline. At the
104cost of a little more overhead, you can do this by using the /m modifier
105on the pattern match operator. (Older programs did this by setting C<$*>,
5f05dabc 106but this practice is now deprecated.)
a0d0e21e 107
14218588 108To simplify multi-line substitutions, the "." character never matches a
55497cff 109newline unless you use the C</s> modifier, which in effect tells Perl to pretend
a0d0e21e
LW
110the string is a single line--even if it isn't. The C</s> modifier also
111overrides the setting of C<$*>, in case you have some (badly behaved) older
112code that sets it in another module.
113
114The following standard quantifiers are recognized:
115
116 * Match 0 or more times
117 + Match 1 or more times
118 ? Match 1 or 0 times
119 {n} Match exactly n times
120 {n,} Match at least n times
121 {n,m} Match at least n but not more than m times
122
123(If a curly bracket occurs in any other context, it is treated
124as a regular character.) The "*" modifier is equivalent to C<{0,}>, the "+"
25f94b33 125modifier to C<{1,}>, and the "?" modifier to C<{0,1}>. n and m are limited
9c79236d
GS
126to integral values less than a preset limit defined when perl is built.
127This is usually 32766 on the most common platforms. The actual limit can
128be seen in the error message generated by code such as this:
129
820475bd 130 $_ **= $_ , / {$_} / for 2 .. 42;
a0d0e21e 131
54310121 132By default, a quantified subpattern is "greedy", that is, it will match as
133many times as possible (given a particular starting location) while still
134allowing the rest of the pattern to match. If you want it to match the
135minimum number of times possible, follow the quantifier with a "?". Note
136that the meanings don't change, just the "greediness":
a0d0e21e
LW
137
138 *? Match 0 or more times
139 +? Match 1 or more times
140 ?? Match 0 or 1 time
141 {n}? Match exactly n times
142 {n,}? Match at least n times
143 {n,m}? Match at least n but not more than m times
144
5f05dabc 145Because patterns are processed as double quoted strings, the following
a0d0e21e
LW
146also work:
147
0f36ee90 148 \t tab (HT, TAB)
149 \n newline (LF, NL)
150 \r return (CR)
151 \f form feed (FF)
152 \a alarm (bell) (BEL)
153 \e escape (think troff) (ESC)
cb1a09d0
AD
154 \033 octal char (think of a PDP-11)
155 \x1B hex char
a0ed51b3 156 \x{263a} wide hex char (Unicode SMILEY)
a0d0e21e 157 \c[ control char
4a2d328f 158 \N{name} named char
cb1a09d0
AD
159 \l lowercase next char (think vi)
160 \u uppercase next char (think vi)
161 \L lowercase till \E (think vi)
162 \U uppercase till \E (think vi)
163 \E end case modification (think vi)
5a964f20 164 \Q quote (disable) pattern metacharacters till \E
a0d0e21e 165
a034a98d 166If C<use locale> is in effect, the case map used by C<\l>, C<\L>, C<\u>
423cee85 167and C<\U> is taken from the current locale. See L<perllocale>. For
4a2d328f 168documentation of C<\N{name}>, see L<charnames>.
a034a98d 169
1d2dff63
GS
170You cannot include a literal C<$> or C<@> within a C<\Q> sequence.
171An unescaped C<$> or C<@> interpolates the corresponding variable,
172while escaping will cause the literal string C<\$> to be matched.
173You'll need to write something like C<m/\Quser\E\@\Qhost/>.
174
a0d0e21e
LW
175In addition, Perl defines the following:
176
177 \w Match a "word" character (alphanumeric plus "_")
36bbe248 178 \W Match a non-"word" character
a0d0e21e
LW
179 \s Match a whitespace character
180 \S Match a non-whitespace character
181 \d Match a digit character
182 \D Match a non-digit character
a0ed51b3
LW
183 \pP Match P, named property. Use \p{Prop} for longer names.
184 \PP Match non-P
f244e06d
GS
185 \X Match eXtended Unicode "combining character sequence",
186 equivalent to C<(?:\PM\pM*)>
4a2d328f 187 \C Match a single C char (octet) even under utf8.
a0d0e21e 188
36bbe248 189A C<\w> matches a single alphanumeric character or C<_>, not a whole word.
14218588
GS
190Use C<\w+> to match a string of Perl-identifier characters (which isn't
191the same as matching an English word). If C<use locale> is in effect, the
192list of alphabetic characters generated by C<\w> is taken from the
193current locale. See L<perllocale>. You may use C<\w>, C<\W>, C<\s>, C<\S>,
1209ba90
JH
194C<\d>, and C<\D> within character classes, but if you try to use them
195as endpoints of a range, that's not a range, the "-" is understood literally.
196See L<utf8> for details about C<\pP>, C<\PP>, and C<\X>.
a0d0e21e 197
b8c5462f
JH
198The POSIX character class syntax
199
820475bd 200 [:class:]
b8c5462f 201
26b44a0a
JH
202is also available. The available classes and their backslash
203equivalents (if available) are as follows:
b8c5462f
JH
204
205 alpha
206 alnum
207 ascii
aaa51d5e 208 blank [1]
b8c5462f
JH
209 cntrl
210 digit \d
211 graph
212 lower
213 print
214 punct
aaa51d5e 215 space \s [2]
b8c5462f 216 upper
aaa51d5e 217 word \w [3]
b8c5462f
JH
218 xdigit
219
aaa51d5e
JF
220 [1] A GNU extension equivalent to C<[ \t]>, `all horizontal whitespace'.
221 [2] Not I<exactly equivalent> to C<\s> since the C<[[:space:]]> includes
222 also the (very rare) `vertical tabulator', "\ck", chr(11).
223 [3] A Perl extension.
224
26b44a0a 225For example use C<[:upper:]> to match all the uppercase characters.
aaa51d5e
JF
226Note that the C<[]> are part of the C<[::]> construct, not part of the
227whole character class. For example:
b8c5462f 228
820475bd 229 [01[:alpha:]%]
b8c5462f 230
593df60c 231matches zero, one, any alphabetic character, and the percentage sign.
b8c5462f 232
26b44a0a 233If the C<utf8> pragma is used, the following equivalences to Unicode
3bec3564
JH
234\p{} constructs and equivalent backslash character classes (if available),
235will hold:
b8c5462f
JH
236
237 alpha IsAlpha
238 alnum IsAlnum
239 ascii IsASCII
aaa51d5e 240 blank IsSpace
b8c5462f 241 cntrl IsCntrl
3bec3564 242 digit IsDigit \d
b8c5462f
JH
243 graph IsGraph
244 lower IsLower
245 print IsPrint
246 punct IsPunct
247 space IsSpace
3bec3564 248 IsSpacePerl \s
b8c5462f
JH
249 upper IsUpper
250 word IsWord
251 xdigit IsXDigit
252
26b44a0a 253For example C<[:lower:]> and C<\p{IsLower}> are equivalent.
b8c5462f
JH
254
255If the C<utf8> pragma is not used but the C<locale> pragma is, the
aaa51d5e
JF
256classes correlate with the usual isalpha(3) interface (except for
257`word' and `blank').
b8c5462f
JH
258
259The assumedly non-obviously named classes are:
260
261=over 4
262
263=item cntrl
264
820475bd
GS
265Any control character. Usually characters that don't produce output as
266such but instead control the terminal somehow: for example newline and
267backspace are control characters. All characters with ord() less than
593df60c 26832 are most often classified as control characters (assuming ASCII,
7be5a6cf
JF
269the ISO Latin character sets, and Unicode), as is the character with
270the ord() value of 127 (C<DEL>).
b8c5462f
JH
271
272=item graph
273
f1cbbd6e 274Any alphanumeric or punctuation (special) character.
b8c5462f
JH
275
276=item print
277
f79b3095 278Any alphanumeric or punctuation (special) character or the space character.
b8c5462f
JH
279
280=item punct
281
f1cbbd6e 282Any punctuation (special) character.
b8c5462f
JH
283
284=item xdigit
285
593df60c 286Any hexadecimal digit. Though this may feel silly ([0-9A-Fa-f] would
820475bd 287work just fine) it is included for completeness.
b8c5462f 288
b8c5462f
JH
289=back
290
291You can negate the [::] character classes by prefixing the class name
292with a '^'. This is a Perl extension. For example:
293
93733859
JH
294 POSIX trad. Perl utf8 Perl
295
296 [:^digit:] \D \P{IsDigit}
297 [:^space:] \S \P{IsSpace}
298 [:^word:] \W \P{IsWord}
b8c5462f 299
26b44a0a
JH
300The POSIX character classes [.cc.] and [=cc=] are recognized but
301B<not> supported and trying to use them will cause an error.
b8c5462f 302
a0d0e21e
LW
303Perl defines the following zero-width assertions:
304
305 \b Match a word boundary
306 \B Match a non-(word boundary)
b85d18e9
IZ
307 \A Match only at beginning of string
308 \Z Match only at end of string, or before newline at the end
309 \z Match only at end of string
9da458fc
IZ
310 \G Match only at pos() (e.g. at the end-of-match position
311 of prior m//g)
a0d0e21e 312
14218588 313A word boundary (C<\b>) is a spot between two characters
19799a22
GS
314that has a C<\w> on one side of it and a C<\W> on the other side
315of it (in either order), counting the imaginary characters off the
316beginning and end of the string as matching a C<\W>. (Within
317character classes C<\b> represents backspace rather than a word
318boundary, just as it normally does in any double-quoted string.)
319The C<\A> and C<\Z> are just like "^" and "$", except that they
320won't match multiple times when the C</m> modifier is used, while
321"^" and "$" will match at every internal line boundary. To match
322the actual end of the string and not ignore an optional trailing
323newline, use C<\z>.
324
325The C<\G> assertion can be used to chain global matches (using
326C<m//g>), as described in L<perlop/"Regexp Quote-Like Operators">.
327It is also useful when writing C<lex>-like scanners, when you have
328several patterns that you want to match against consequent substrings
329of your string, see the previous reference. The actual location
330where C<\G> will match can also be influenced by using C<pos()> as
331an lvalue. See L<perlfunc/pos>.
c47ff5f1 332
14218588 333The bracketing construct C<( ... )> creates capture buffers. To
c47ff5f1 334refer to the digit'th buffer use \<digit> within the
14218588 335match. Outside the match use "$" instead of "\". (The
c47ff5f1 336\<digit> notation works in certain circumstances outside
14218588
GS
337the match. See the warning below about \1 vs $1 for details.)
338Referring back to another part of the match is called a
339I<backreference>.
340
341There is no limit to the number of captured substrings that you may
342use. However Perl also uses \10, \11, etc. as aliases for \010,
fb55449c
JH
343\011, etc. (Recall that 0 means octal, so \011 is the character at
344number 9 in your coded character set; which would be the 10th character,
345a horizontal tab under ASCII.) Perl resolves this
346ambiguity by interpreting \10 as a backreference only if at least 10
347left parentheses have opened before it. Likewise \11 is a
348backreference only if at least 11 left parentheses have opened
349before it. And so on. \1 through \9 are always interpreted as
350backreferences.
14218588
GS
351
352Examples:
a0d0e21e
LW
353
354 s/^([^ ]*) *([^ ]*)/$2 $1/; # swap first two words
355
14218588
GS
356 if (/(.)\1/) { # find first doubled char
357 print "'$1' is the first doubled character\n";
358 }
c47ff5f1 359
14218588 360 if (/Time: (..):(..):(..)/) { # parse out values
a0d0e21e
LW
361 $hours = $1;
362 $minutes = $2;
363 $seconds = $3;
364 }
c47ff5f1 365
14218588
GS
366Several special variables also refer back to portions of the previous
367match. C<$+> returns whatever the last bracket match matched.
368C<$&> returns the entire matched string. (At one point C<$0> did
369also, but now it returns the name of the program.) C<$`> returns
370everything before the matched string. And C<$'> returns everything
371after the matched string.
372
373The numbered variables ($1, $2, $3, etc.) and the related punctuation
262ffc8b 374set (C<$+>, C<$&>, C<$`>, and C<$'>) are all dynamically scoped
14218588
GS
375until the end of the enclosing block or until the next successful
376match, whichever comes first. (See L<perlsyn/"Compound Statements">.)
377
378B<WARNING>: Once Perl sees that you need one of C<$&>, C<$`>, or
379C<$'> anywhere in the program, it has to provide them for every
380pattern match. This may substantially slow your program. Perl
381uses the same mechanism to produce $1, $2, etc, so you also pay a
382price for each pattern that contains capturing parentheses. (To
383avoid this cost while retaining the grouping behaviour, use the
384extended regular expression C<(?: ... )> instead.) But if you never
385use C<$&>, C<$`> or C<$'>, then patterns I<without> capturing
386parentheses will not be penalized. So avoid C<$&>, C<$'>, and C<$`>
387if you can, but if you can't (and some algorithms really appreciate
388them), once you've used them once, use them at will, because you've
389already paid the price. As of 5.005, C<$&> is not so costly as the
390other two.
68dc0745 391
19799a22
GS
392Backslashed metacharacters in Perl are alphanumeric, such as C<\b>,
393C<\w>, C<\n>. Unlike some other regular expression languages, there
394are no backslashed symbols that aren't alphanumeric. So anything
c47ff5f1 395that looks like \\, \(, \), \<, \>, \{, or \} is always
19799a22
GS
396interpreted as a literal character, not a metacharacter. This was
397once used in a common idiom to disable or quote the special meanings
398of regular expression metacharacters in a string that you want to
36bbe248 399use for a pattern. Simply quote all non-"word" characters:
a0d0e21e
LW
400
401 $pattern =~ s/(\W)/\\$1/g;
402
f1cbbd6e 403(If C<use locale> is set, then this depends on the current locale.)
14218588
GS
404Today it is more common to use the quotemeta() function or the C<\Q>
405metaquoting escape sequence to disable all metacharacters' special
406meanings like this:
a0d0e21e
LW
407
408 /$unquoted\Q$quoted\E$unquoted/
409
9da458fc
IZ
410Beware that if you put literal backslashes (those not inside
411interpolated variables) between C<\Q> and C<\E>, double-quotish
412backslash interpolation may lead to confusing results. If you
413I<need> to use literal backslashes within C<\Q...\E>,
414consult L<perlop/"Gory details of parsing quoted constructs">.
415
19799a22
GS
416=head2 Extended Patterns
417
14218588
GS
418Perl also defines a consistent extension syntax for features not
419found in standard tools like B<awk> and B<lex>. The syntax is a
420pair of parentheses with a question mark as the first thing within
421the parentheses. The character after the question mark indicates
422the extension.
19799a22 423
14218588
GS
424The stability of these extensions varies widely. Some have been
425part of the core language for many years. Others are experimental
426and may change without warning or be completely removed. Check
427the documentation on an individual feature to verify its current
428status.
19799a22 429
14218588
GS
430A question mark was chosen for this and for the minimal-matching
431construct because 1) question marks are rare in older regular
432expressions, and 2) whenever you see one, you should stop and
433"question" exactly what is going on. That's psychology...
a0d0e21e
LW
434
435=over 10
436
cc6b7395 437=item C<(?#text)>
a0d0e21e 438
14218588 439A comment. The text is ignored. If the C</x> modifier enables
19799a22 440whitespace formatting, a simple C<#> will suffice. Note that Perl closes
259138e3
GS
441the comment as soon as it sees a C<)>, so there is no way to put a literal
442C<)> in the comment.
a0d0e21e 443
19799a22
GS
444=item C<(?imsx-imsx)>
445
446One or more embedded pattern-match modifiers. This is particularly
447useful for dynamic patterns, such as those read in from a configuration
448file, read in as an argument, are specified in a table somewhere,
449etc. Consider the case that some of which want to be case sensitive
450and some do not. The case insensitive ones need to include merely
451C<(?i)> at the front of the pattern. For example:
452
453 $pattern = "foobar";
454 if ( /$pattern/i ) { }
455
456 # more flexible:
457
458 $pattern = "(?i)foobar";
459 if ( /$pattern/ ) { }
460
461Letters after a C<-> turn those modifiers off. These modifiers are
462localized inside an enclosing group (if any). For example,
463
464 ( (?i) blah ) \s+ \1
465
466will match a repeated (I<including the case>!) word C<blah> in any
14218588 467case, assuming C<x> modifier, and no C<i> modifier outside this
19799a22
GS
468group.
469
5a964f20 470=item C<(?:pattern)>
a0d0e21e 471
ca9dfc88
IZ
472=item C<(?imsx-imsx:pattern)>
473
5a964f20
TC
474This is for clustering, not capturing; it groups subexpressions like
475"()", but doesn't make backreferences as "()" does. So
a0d0e21e 476
5a964f20 477 @fields = split(/\b(?:a|b|c)\b/)
a0d0e21e
LW
478
479is like
480
5a964f20 481 @fields = split(/\b(a|b|c)\b/)
a0d0e21e 482
19799a22
GS
483but doesn't spit out extra fields. It's also cheaper not to capture
484characters if you don't need to.
a0d0e21e 485
19799a22
GS
486Any letters between C<?> and C<:> act as flags modifiers as with
487C<(?imsx-imsx)>. For example,
ca9dfc88
IZ
488
489 /(?s-i:more.*than).*million/i
490
14218588 491is equivalent to the more verbose
ca9dfc88
IZ
492
493 /(?:(?s-i)more.*than).*million/i
494
5a964f20 495=item C<(?=pattern)>
a0d0e21e 496
19799a22 497A zero-width positive look-ahead assertion. For example, C</\w+(?=\t)/>
a0d0e21e
LW
498matches a word followed by a tab, without including the tab in C<$&>.
499
5a964f20 500=item C<(?!pattern)>
a0d0e21e 501
19799a22 502A zero-width negative look-ahead assertion. For example C</foo(?!bar)/>
a0d0e21e 503matches any occurrence of "foo" that isn't followed by "bar". Note
19799a22
GS
504however that look-ahead and look-behind are NOT the same thing. You cannot
505use this for look-behind.
7b8d334a 506
5a964f20 507If you are looking for a "bar" that isn't preceded by a "foo", C</(?!foo)bar/>
7b8d334a
GS
508will not do what you want. That's because the C<(?!foo)> is just saying that
509the next thing cannot be "foo"--and it's not, it's a "bar", so "foobar" will
510match. You would have to do something like C</(?!foo)...bar/> for that. We
511say "like" because there's the case of your "bar" not having three characters
512before it. You could cover that this way: C</(?:(?!foo)...|^.{0,2})bar/>.
513Sometimes it's still easier just to say:
a0d0e21e 514
a3cb178b 515 if (/bar/ && $` !~ /foo$/)
a0d0e21e 516
19799a22 517For look-behind see below.
c277df42 518
c47ff5f1 519=item C<(?<=pattern)>
c277df42 520
c47ff5f1 521A zero-width positive look-behind assertion. For example, C</(?<=\t)\w+/>
19799a22
GS
522matches a word that follows a tab, without including the tab in C<$&>.
523Works only for fixed-width look-behind.
c277df42 524
5a964f20 525=item C<(?<!pattern)>
c277df42 526
19799a22
GS
527A zero-width negative look-behind assertion. For example C</(?<!bar)foo/>
528matches any occurrence of "foo" that does not follow "bar". Works
529only for fixed-width look-behind.
c277df42 530
cc6b7395 531=item C<(?{ code })>
c277df42 532
19799a22
GS
533B<WARNING>: This extended regular expression feature is considered
534highly experimental, and may be changed or deleted without notice.
c277df42 535
19799a22
GS
536This zero-width assertion evaluate any embedded Perl code. It
537always succeeds, and its C<code> is not interpolated. Currently,
538the rules to determine where the C<code> ends are somewhat convoluted.
539
540The C<code> is properly scoped in the following sense: If the assertion
541is backtracked (compare L<"Backtracking">), all changes introduced after
542C<local>ization are undone, so that
b9ac3b5b
GS
543
544 $_ = 'a' x 8;
545 m<
546 (?{ $cnt = 0 }) # Initialize $cnt.
547 (
548 a
549 (?{
550 local $cnt = $cnt + 1; # Update $cnt, backtracking-safe.
551 })
552 )*
553 aaaa
554 (?{ $res = $cnt }) # On success copy to non-localized
555 # location.
556 >x;
557
19799a22 558will set C<$res = 4>. Note that after the match, $cnt returns to the globally
14218588 559introduced value, because the scopes that restrict C<local> operators
b9ac3b5b
GS
560are unwound.
561
19799a22
GS
562This assertion may be used as a C<(?(condition)yes-pattern|no-pattern)>
563switch. If I<not> used in this way, the result of evaluation of
564C<code> is put into the special variable C<$^R>. This happens
565immediately, so C<$^R> can be used from other C<(?{ code })> assertions
566inside the same regular expression.
b9ac3b5b 567
19799a22
GS
568The assignment to C<$^R> above is properly localized, so the old
569value of C<$^R> is restored if the assertion is backtracked; compare
570L<"Backtracking">.
b9ac3b5b 571
19799a22
GS
572For reasons of security, this construct is forbidden if the regular
573expression involves run-time interpolation of variables, unless the
574perilous C<use re 'eval'> pragma has been used (see L<re>), or the
575variables contain results of C<qr//> operator (see
576L<perlop/"qr/STRING/imosx">).
871b0233 577
14218588 578This restriction is because of the wide-spread and remarkably convenient
19799a22 579custom of using run-time determined strings as patterns. For example:
871b0233
IZ
580
581 $re = <>;
582 chomp $re;
583 $string =~ /$re/;
584
14218588
GS
585Before Perl knew how to execute interpolated code within a pattern,
586this operation was completely safe from a security point of view,
587although it could raise an exception from an illegal pattern. If
588you turn on the C<use re 'eval'>, though, it is no longer secure,
589so you should only do so if you are also using taint checking.
590Better yet, use the carefully constrained evaluation within a Safe
591module. See L<perlsec> for details about both these mechanisms.
871b0233 592
14455d6c 593=item C<(??{ code })>
0f5d15d6 594
19799a22
GS
595B<WARNING>: This extended regular expression feature is considered
596highly experimental, and may be changed or deleted without notice.
9da458fc
IZ
597A simplified version of the syntax may be introduced for commonly
598used idioms.
0f5d15d6 599
19799a22
GS
600This is a "postponed" regular subexpression. The C<code> is evaluated
601at run time, at the moment this subexpression may match. The result
602of evaluation is considered as a regular expression and matched as
603if it were inserted instead of this construct.
0f5d15d6 604
428594d9 605The C<code> is not interpolated. As before, the rules to determine
19799a22
GS
606where the C<code> ends are currently somewhat convoluted.
607
608The following pattern matches a parenthesized group:
0f5d15d6
IZ
609
610 $re = qr{
611 \(
612 (?:
613 (?> [^()]+ ) # Non-parens without backtracking
614 |
14455d6c 615 (??{ $re }) # Group with matching parens
0f5d15d6
IZ
616 )*
617 \)
618 }x;
619
c47ff5f1 620=item C<< (?>pattern) >>
5a964f20 621
19799a22
GS
622B<WARNING>: This extended regular expression feature is considered
623highly experimental, and may be changed or deleted without notice.
624
625An "independent" subexpression, one which matches the substring
626that a I<standalone> C<pattern> would match if anchored at the given
9da458fc 627position, and it matches I<nothing other than this substring>. This
19799a22
GS
628construct is useful for optimizations of what would otherwise be
629"eternal" matches, because it will not backtrack (see L<"Backtracking">).
9da458fc
IZ
630It may also be useful in places where the "grab all you can, and do not
631give anything back" semantic is desirable.
19799a22 632
c47ff5f1 633For example: C<< ^(?>a*)ab >> will never match, since C<< (?>a*) >>
19799a22
GS
634(anchored at the beginning of string, as above) will match I<all>
635characters C<a> at the beginning of string, leaving no C<a> for
636C<ab> to match. In contrast, C<a*ab> will match the same as C<a+b>,
637since the match of the subgroup C<a*> is influenced by the following
638group C<ab> (see L<"Backtracking">). In particular, C<a*> inside
639C<a*ab> will match fewer characters than a standalone C<a*>, since
640this makes the tail match.
641
c47ff5f1 642An effect similar to C<< (?>pattern) >> may be achieved by writing
19799a22
GS
643C<(?=(pattern))\1>. This matches the same substring as a standalone
644C<a+>, and the following C<\1> eats the matched string; it therefore
c47ff5f1 645makes a zero-length assertion into an analogue of C<< (?>...) >>.
19799a22
GS
646(The difference between these two constructs is that the second one
647uses a capturing group, thus shifting ordinals of backreferences
648in the rest of a regular expression.)
649
650Consider this pattern:
c277df42 651
871b0233
IZ
652 m{ \(
653 (
9da458fc 654 [^()]+ # x+
871b0233
IZ
655 |
656 \( [^()]* \)
657 )+
658 \)
659 }x
5a964f20 660
19799a22
GS
661That will efficiently match a nonempty group with matching parentheses
662two levels deep or less. However, if there is no such group, it
663will take virtually forever on a long string. That's because there
664are so many different ways to split a long string into several
665substrings. This is what C<(.+)+> is doing, and C<(.+)+> is similar
666to a subpattern of the above pattern. Consider how the pattern
667above detects no-match on C<((()aaaaaaaaaaaaaaaaaa> in several
668seconds, but that each extra letter doubles this time. This
669exponential performance will make it appear that your program has
14218588 670hung. However, a tiny change to this pattern
5a964f20 671
871b0233
IZ
672 m{ \(
673 (
9da458fc 674 (?> [^()]+ ) # change x+ above to (?> x+ )
871b0233
IZ
675 |
676 \( [^()]* \)
677 )+
678 \)
679 }x
c277df42 680
c47ff5f1 681which uses C<< (?>...) >> matches exactly when the one above does (verifying
5a964f20
TC
682this yourself would be a productive exercise), but finishes in a fourth
683the time when used on a similar string with 1000000 C<a>s. Be aware,
684however, that this pattern currently triggers a warning message under
9f1b1f2d 685the C<use warnings> pragma or B<-w> switch saying it
6bab786b 686C<"matches null string many times in regex">.
c277df42 687
c47ff5f1 688On simple groups, such as the pattern C<< (?> [^()]+ ) >>, a comparable
19799a22 689effect may be achieved by negative look-ahead, as in C<[^()]+ (?! [^()] )>.
c277df42
IZ
690This was only 4 times slower on a string with 1000000 C<a>s.
691
9da458fc
IZ
692The "grab all you can, and do not give anything back" semantic is desirable
693in many situations where on the first sight a simple C<()*> looks like
694the correct solution. Suppose we parse text with comments being delimited
695by C<#> followed by some optional (horizontal) whitespace. Contrary to
4375e838 696its appearance, C<#[ \t]*> I<is not> the correct subexpression to match
9da458fc
IZ
697the comment delimiter, because it may "give up" some whitespace if
698the remainder of the pattern can be made to match that way. The correct
699answer is either one of these:
700
701 (?>#[ \t]*)
702 #[ \t]*(?![ \t])
703
704For example, to grab non-empty comments into $1, one should use either
705one of these:
706
707 / (?> \# [ \t]* ) ( .+ ) /x;
708 / \# [ \t]* ( [^ \t] .* ) /x;
709
710Which one you pick depends on which of these expressions better reflects
711the above specification of comments.
712
5a964f20 713=item C<(?(condition)yes-pattern|no-pattern)>
c277df42 714
5a964f20 715=item C<(?(condition)yes-pattern)>
c277df42 716
19799a22
GS
717B<WARNING>: This extended regular expression feature is considered
718highly experimental, and may be changed or deleted without notice.
719
c277df42
IZ
720Conditional expression. C<(condition)> should be either an integer in
721parentheses (which is valid if the corresponding pair of parentheses
19799a22 722matched), or look-ahead/look-behind/evaluate zero-width assertion.
c277df42 723
19799a22 724For example:
c277df42 725
5a964f20 726 m{ ( \( )?
871b0233 727 [^()]+
5a964f20 728 (?(1) \) )
871b0233 729 }x
c277df42
IZ
730
731matches a chunk of non-parentheses, possibly included in parentheses
732themselves.
a0d0e21e 733
a0d0e21e
LW
734=back
735
c07a80fd 736=head2 Backtracking
737
35a734be
IZ
738NOTE: This section presents an abstract approximation of regular
739expression behavior. For a more rigorous (and complicated) view of
740the rules involved in selecting a match among possible alternatives,
741see L<Combining pieces together>.
742
c277df42 743A fundamental feature of regular expression matching involves the
5a964f20 744notion called I<backtracking>, which is currently used (when needed)
c277df42 745by all regular expression quantifiers, namely C<*>, C<*?>, C<+>,
9da458fc
IZ
746C<+?>, C<{n,m}>, and C<{n,m}?>. Backtracking is often optimized
747internally, but the general principle outlined here is valid.
c07a80fd 748
749For a regular expression to match, the I<entire> regular expression must
750match, not just part of it. So if the beginning of a pattern containing a
751quantifier succeeds in a way that causes later parts in the pattern to
752fail, the matching engine backs up and recalculates the beginning
753part--that's why it's called backtracking.
754
755Here is an example of backtracking: Let's say you want to find the
756word following "foo" in the string "Food is on the foo table.":
757
758 $_ = "Food is on the foo table.";
759 if ( /\b(foo)\s+(\w+)/i ) {
760 print "$2 follows $1.\n";
761 }
762
763When the match runs, the first part of the regular expression (C<\b(foo)>)
764finds a possible match right at the beginning of the string, and loads up
765$1 with "Foo". However, as soon as the matching engine sees that there's
766no whitespace following the "Foo" that it had saved in $1, it realizes its
68dc0745 767mistake and starts over again one character after where it had the
c07a80fd 768tentative match. This time it goes all the way until the next occurrence
769of "foo". The complete regular expression matches this time, and you get
770the expected output of "table follows foo."
771
772Sometimes minimal matching can help a lot. Imagine you'd like to match
773everything between "foo" and "bar". Initially, you write something
774like this:
775
776 $_ = "The food is under the bar in the barn.";
777 if ( /foo(.*)bar/ ) {
778 print "got <$1>\n";
779 }
780
781Which perhaps unexpectedly yields:
782
783 got <d is under the bar in the >
784
785That's because C<.*> was greedy, so you get everything between the
14218588 786I<first> "foo" and the I<last> "bar". Here it's more effective
c07a80fd 787to use minimal matching to make sure you get the text between a "foo"
788and the first "bar" thereafter.
789
790 if ( /foo(.*?)bar/ ) { print "got <$1>\n" }
791 got <d is under the >
792
793Here's another example: let's say you'd like to match a number at the end
b6e13d97 794of a string, and you also want to keep the preceding part of the match.
c07a80fd 795So you write this:
796
797 $_ = "I have 2 numbers: 53147";
798 if ( /(.*)(\d*)/ ) { # Wrong!
799 print "Beginning is <$1>, number is <$2>.\n";
800 }
801
802That won't work at all, because C<.*> was greedy and gobbled up the
803whole string. As C<\d*> can match on an empty string the complete
804regular expression matched successfully.
805
8e1088bc 806 Beginning is <I have 2 numbers: 53147>, number is <>.
c07a80fd 807
808Here are some variants, most of which don't work:
809
810 $_ = "I have 2 numbers: 53147";
811 @pats = qw{
812 (.*)(\d*)
813 (.*)(\d+)
814 (.*?)(\d*)
815 (.*?)(\d+)
816 (.*)(\d+)$
817 (.*?)(\d+)$
818 (.*)\b(\d+)$
819 (.*\D)(\d+)$
820 };
821
822 for $pat (@pats) {
823 printf "%-12s ", $pat;
824 if ( /$pat/ ) {
825 print "<$1> <$2>\n";
826 } else {
827 print "FAIL\n";
828 }
829 }
830
831That will print out:
832
833 (.*)(\d*) <I have 2 numbers: 53147> <>
834 (.*)(\d+) <I have 2 numbers: 5314> <7>
835 (.*?)(\d*) <> <>
836 (.*?)(\d+) <I have > <2>
837 (.*)(\d+)$ <I have 2 numbers: 5314> <7>
838 (.*?)(\d+)$ <I have 2 numbers: > <53147>
839 (.*)\b(\d+)$ <I have 2 numbers: > <53147>
840 (.*\D)(\d+)$ <I have 2 numbers: > <53147>
841
842As you see, this can be a bit tricky. It's important to realize that a
843regular expression is merely a set of assertions that gives a definition
844of success. There may be 0, 1, or several different ways that the
845definition might succeed against a particular string. And if there are
5a964f20
TC
846multiple ways it might succeed, you need to understand backtracking to
847know which variety of success you will achieve.
c07a80fd 848
19799a22 849When using look-ahead assertions and negations, this can all get even
54310121 850tricker. Imagine you'd like to find a sequence of non-digits not
c07a80fd 851followed by "123". You might try to write that as
852
871b0233
IZ
853 $_ = "ABC123";
854 if ( /^\D*(?!123)/ ) { # Wrong!
855 print "Yup, no 123 in $_\n";
856 }
c07a80fd 857
858But that isn't going to match; at least, not the way you're hoping. It
859claims that there is no 123 in the string. Here's a clearer picture of
9b9391b2 860why that pattern matches, contrary to popular expectations:
c07a80fd 861
862 $x = 'ABC123' ;
863 $y = 'ABC445' ;
864
865 print "1: got $1\n" if $x =~ /^(ABC)(?!123)/ ;
866 print "2: got $1\n" if $y =~ /^(ABC)(?!123)/ ;
867
868 print "3: got $1\n" if $x =~ /^(\D*)(?!123)/ ;
869 print "4: got $1\n" if $y =~ /^(\D*)(?!123)/ ;
870
871This prints
872
873 2: got ABC
874 3: got AB
875 4: got ABC
876
5f05dabc 877You might have expected test 3 to fail because it seems to a more
c07a80fd 878general purpose version of test 1. The important difference between
879them is that test 3 contains a quantifier (C<\D*>) and so can use
880backtracking, whereas test 1 will not. What's happening is
881that you've asked "Is it true that at the start of $x, following 0 or more
5f05dabc 882non-digits, you have something that's not 123?" If the pattern matcher had
c07a80fd 883let C<\D*> expand to "ABC", this would have caused the whole pattern to
54310121 884fail.
14218588 885
c07a80fd 886The search engine will initially match C<\D*> with "ABC". Then it will
14218588 887try to match C<(?!123> with "123", which fails. But because
c07a80fd 888a quantifier (C<\D*>) has been used in the regular expression, the
889search engine can backtrack and retry the match differently
54310121 890in the hope of matching the complete regular expression.
c07a80fd 891
5a964f20
TC
892The pattern really, I<really> wants to succeed, so it uses the
893standard pattern back-off-and-retry and lets C<\D*> expand to just "AB" this
c07a80fd 894time. Now there's indeed something following "AB" that is not
14218588 895"123". It's "C123", which suffices.
c07a80fd 896
14218588
GS
897We can deal with this by using both an assertion and a negation.
898We'll say that the first part in $1 must be followed both by a digit
899and by something that's not "123". Remember that the look-aheads
900are zero-width expressions--they only look, but don't consume any
901of the string in their match. So rewriting this way produces what
c07a80fd 902you'd expect; that is, case 5 will fail, but case 6 succeeds:
903
904 print "5: got $1\n" if $x =~ /^(\D*)(?=\d)(?!123)/ ;
905 print "6: got $1\n" if $y =~ /^(\D*)(?=\d)(?!123)/ ;
906
907 6: got ABC
908
5a964f20 909In other words, the two zero-width assertions next to each other work as though
19799a22 910they're ANDed together, just as you'd use any built-in assertions: C</^$/>
c07a80fd 911matches only if you're at the beginning of the line AND the end of the
912line simultaneously. The deeper underlying truth is that juxtaposition in
913regular expressions always means AND, except when you write an explicit OR
914using the vertical bar. C</ab/> means match "a" AND (then) match "b",
915although the attempted matches are made at different positions because "a"
916is not a zero-width assertion, but a one-width assertion.
917
19799a22 918B<WARNING>: particularly complicated regular expressions can take
14218588 919exponential time to solve because of the immense number of possible
9da458fc
IZ
920ways they can use backtracking to try match. For example, without
921internal optimizations done by the regular expression engine, this will
922take a painfully long time to run:
c07a80fd 923
e1901655
IZ
924 'aaaaaaaaaaaa' =~ /((a{0,5}){0,5})*[c]/
925
926And if you used C<*>'s in the internal groups instead of limiting them
927to 0 through 5 matches, then it would take forever--or until you ran
928out of stack space. Moreover, these internal optimizations are not
929always applicable. For example, if you put C<{0,5}> instead of C<*>
930on the external group, no current optimization is applicable, and the
931match takes a long time to finish.
c07a80fd 932
9da458fc
IZ
933A powerful tool for optimizing such beasts is what is known as an
934"independent group",
c47ff5f1 935which does not backtrack (see L<C<< (?>pattern) >>>). Note also that
9da458fc 936zero-length look-ahead/look-behind assertions will not backtrack to make
14218588
GS
937the tail match, since they are in "logical" context: only
938whether they match is considered relevant. For an example
9da458fc 939where side-effects of look-ahead I<might> have influenced the
c47ff5f1 940following match, see L<C<< (?>pattern) >>>.
c277df42 941
a0d0e21e
LW
942=head2 Version 8 Regular Expressions
943
5a964f20 944In case you're not familiar with the "regular" Version 8 regex
a0d0e21e
LW
945routines, here are the pattern-matching rules not described above.
946
54310121 947Any single character matches itself, unless it is a I<metacharacter>
a0d0e21e 948with a special meaning described here or above. You can cause
5a964f20 949characters that normally function as metacharacters to be interpreted
5f05dabc 950literally by prefixing them with a "\" (e.g., "\." matches a ".", not any
a0d0e21e
LW
951character; "\\" matches a "\"). A series of characters matches that
952series of characters in the target string, so the pattern C<blurfl>
953would match "blurfl" in the target string.
954
955You can specify a character class, by enclosing a list of characters
5a964f20 956in C<[]>, which will match any one character from the list. If the
a0d0e21e 957first character after the "[" is "^", the class matches any character not
14218588 958in the list. Within a list, the "-" character specifies a
5a964f20 959range, so that C<a-z> represents all characters between "a" and "z",
8a4f6ac2
GS
960inclusive. If you want either "-" or "]" itself to be a member of a
961class, put it at the start of the list (possibly after a "^"), or
962escape it with a backslash. "-" is also taken literally when it is
963at the end of the list, just before the closing "]". (The
84850974
DD
964following all specify the same class of three characters: C<[-az]>,
965C<[az-]>, and C<[a\-z]>. All are different from C<[a-z]>, which
fb55449c
JH
966specifies a class containing twenty-six characters, even on EBCDIC
967based coded character sets.) Also, if you try to use the character
968classes C<\w>, C<\W>, C<\s>, C<\S>, C<\d>, or C<\D> as endpoints of
969a range, that's not a range, the "-" is understood literally.
a0d0e21e 970
8ada0baa
JH
971Note also that the whole range idea is rather unportable between
972character sets--and even within character sets they may cause results
973you probably didn't expect. A sound principle is to use only ranges
974that begin from and end at either alphabets of equal case ([a-e],
975[A-E]), or digits ([0-9]). Anything else is unsafe. If in doubt,
976spell out the character sets in full.
977
54310121 978Characters may be specified using a metacharacter syntax much like that
a0d0e21e
LW
979used in C: "\n" matches a newline, "\t" a tab, "\r" a carriage return,
980"\f" a form feed, etc. More generally, \I<nnn>, where I<nnn> is a string
fb55449c
JH
981of octal digits, matches the character whose coded character set value
982is I<nnn>. Similarly, \xI<nn>, where I<nn> are hexadecimal digits,
983matches the character whose numeric value is I<nn>. The expression \cI<x>
984matches the character control-I<x>. Finally, the "." metacharacter
985matches any character except "\n" (unless you use C</s>).
a0d0e21e
LW
986
987You can specify a series of alternatives for a pattern using "|" to
988separate them, so that C<fee|fie|foe> will match any of "fee", "fie",
5a964f20 989or "foe" in the target string (as would C<f(e|i|o)e>). The
a0d0e21e
LW
990first alternative includes everything from the last pattern delimiter
991("(", "[", or the beginning of the pattern) up to the first "|", and
992the last alternative contains everything from the last "|" to the next
14218588
GS
993pattern delimiter. That's why it's common practice to include
994alternatives in parentheses: to minimize confusion about where they
a3cb178b
GS
995start and end.
996
5a964f20 997Alternatives are tried from left to right, so the first
a3cb178b
GS
998alternative found for which the entire expression matches, is the one that
999is chosen. This means that alternatives are not necessarily greedy. For
628afcb5 1000example: when matching C<foo|foot> against "barefoot", only the "foo"
a3cb178b
GS
1001part will match, as that is the first alternative tried, and it successfully
1002matches the target string. (This might not seem important, but it is
1003important when you are capturing matched text using parentheses.)
1004
5a964f20 1005Also remember that "|" is interpreted as a literal within square brackets,
a3cb178b 1006so if you write C<[fee|fie|foe]> you're really only matching C<[feio|]>.
a0d0e21e 1007
14218588
GS
1008Within a pattern, you may designate subpatterns for later reference
1009by enclosing them in parentheses, and you may refer back to the
1010I<n>th subpattern later in the pattern using the metacharacter
1011\I<n>. Subpatterns are numbered based on the left to right order
1012of their opening parenthesis. A backreference matches whatever
1013actually matched the subpattern in the string being examined, not
1014the rules for that subpattern. Therefore, C<(0|0x)\d*\s\1\d*> will
1015match "0x1234 0x4321", but not "0x1234 01234", because subpattern
10161 matched "0x", even though the rule C<0|0x> could potentially match
1017the leading 0 in the second number.
cb1a09d0 1018
19799a22 1019=head2 Warning on \1 vs $1
cb1a09d0 1020
5a964f20 1021Some people get too used to writing things like:
cb1a09d0
AD
1022
1023 $pattern =~ s/(\W)/\\\1/g;
1024
1025This is grandfathered for the RHS of a substitute to avoid shocking the
1026B<sed> addicts, but it's a dirty habit to get into. That's because in
5f05dabc 1027PerlThink, the righthand side of a C<s///> is a double-quoted string. C<\1> in
cb1a09d0
AD
1028the usual double-quoted string means a control-A. The customary Unix
1029meaning of C<\1> is kludged in for C<s///>. However, if you get into the habit
1030of doing that, you get yourself into trouble if you then add an C</e>
1031modifier.
1032
5a964f20 1033 s/(\d+)/ \1 + 1 /eg; # causes warning under -w
cb1a09d0
AD
1034
1035Or if you try to do
1036
1037 s/(\d+)/\1000/;
1038
1039You can't disambiguate that by saying C<\{1}000>, whereas you can fix it with
14218588 1040C<${1}000>. The operation of interpolation should not be confused
cb1a09d0
AD
1041with the operation of matching a backreference. Certainly they mean two
1042different things on the I<left> side of the C<s///>.
9fa51da4 1043
c84d73f1
IZ
1044=head2 Repeated patterns matching zero-length substring
1045
19799a22 1046B<WARNING>: Difficult material (and prose) ahead. This section needs a rewrite.
c84d73f1
IZ
1047
1048Regular expressions provide a terse and powerful programming language. As
1049with most other power tools, power comes together with the ability
1050to wreak havoc.
1051
1052A common abuse of this power stems from the ability to make infinite
628afcb5 1053loops using regular expressions, with something as innocuous as:
c84d73f1
IZ
1054
1055 'foo' =~ m{ ( o? )* }x;
1056
1057The C<o?> can match at the beginning of C<'foo'>, and since the position
1058in the string is not moved by the match, C<o?> would match again and again
14218588 1059because of the C<*> modifier. Another common way to create a similar cycle
c84d73f1
IZ
1060is with the looping modifier C<//g>:
1061
1062 @matches = ( 'foo' =~ m{ o? }xg );
1063
1064or
1065
1066 print "match: <$&>\n" while 'foo' =~ m{ o? }xg;
1067
1068or the loop implied by split().
1069
1070However, long experience has shown that many programming tasks may
14218588
GS
1071be significantly simplified by using repeated subexpressions that
1072may match zero-length substrings. Here's a simple example being:
c84d73f1
IZ
1073
1074 @chars = split //, $string; # // is not magic in split
1075 ($whitewashed = $string) =~ s/()/ /g; # parens avoid magic s// /
1076
9da458fc 1077Thus Perl allows such constructs, by I<forcefully breaking
c84d73f1
IZ
1078the infinite loop>. The rules for this are different for lower-level
1079loops given by the greedy modifiers C<*+{}>, and for higher-level
1080ones like the C</g> modifier or split() operator.
1081
19799a22
GS
1082The lower-level loops are I<interrupted> (that is, the loop is
1083broken) when Perl detects that a repeated expression matched a
1084zero-length substring. Thus
c84d73f1
IZ
1085
1086 m{ (?: NON_ZERO_LENGTH | ZERO_LENGTH )* }x;
1087
1088is made equivalent to
1089
1090 m{ (?: NON_ZERO_LENGTH )*
1091 |
1092 (?: ZERO_LENGTH )?
1093 }x;
1094
1095The higher level-loops preserve an additional state between iterations:
1096whether the last match was zero-length. To break the loop, the following
1097match after a zero-length match is prohibited to have a length of zero.
1098This prohibition interacts with backtracking (see L<"Backtracking">),
1099and so the I<second best> match is chosen if the I<best> match is of
1100zero length.
1101
19799a22 1102For example:
c84d73f1
IZ
1103
1104 $_ = 'bar';
1105 s/\w??/<$&>/g;
1106
20fb949f 1107results in C<< <><b><><a><><r><> >>. At each position of the string the best
c84d73f1
IZ
1108match given by non-greedy C<??> is the zero-length match, and the I<second
1109best> match is what is matched by C<\w>. Thus zero-length matches
1110alternate with one-character-long matches.
1111
1112Similarly, for repeated C<m/()/g> the second-best match is the match at the
1113position one notch further in the string.
1114
19799a22 1115The additional state of being I<matched with zero-length> is associated with
c84d73f1 1116the matched string, and is reset by each assignment to pos().
9da458fc
IZ
1117Zero-length matches at the end of the previous match are ignored
1118during C<split>.
c84d73f1 1119
35a734be
IZ
1120=head2 Combining pieces together
1121
1122Each of the elementary pieces of regular expressions which were described
1123before (such as C<ab> or C<\Z>) could match at most one substring
1124at the given position of the input string. However, in a typical regular
1125expression these elementary pieces are combined into more complicated
1126patterns using combining operators C<ST>, C<S|T>, C<S*> etc
1127(in these examples C<S> and C<T> are regular subexpressions).
1128
1129Such combinations can include alternatives, leading to a problem of choice:
1130if we match a regular expression C<a|ab> against C<"abc">, will it match
1131substring C<"a"> or C<"ab">? One way to describe which substring is
1132actually matched is the concept of backtracking (see L<"Backtracking">).
1133However, this description is too low-level and makes you think
1134in terms of a particular implementation.
1135
1136Another description starts with notions of "better"/"worse". All the
1137substrings which may be matched by the given regular expression can be
1138sorted from the "best" match to the "worst" match, and it is the "best"
1139match which is chosen. This substitutes the question of "what is chosen?"
1140by the question of "which matches are better, and which are worse?".
1141
1142Again, for elementary pieces there is no such question, since at most
1143one match at a given position is possible. This section describes the
1144notion of better/worse for combining operators. In the description
1145below C<S> and C<T> are regular subexpressions.
1146
13a2d996 1147=over 4
35a734be
IZ
1148
1149=item C<ST>
1150
1151Consider two possible matches, C<AB> and C<A'B'>, C<A> and C<A'> are
1152substrings which can be matched by C<S>, C<B> and C<B'> are substrings
1153which can be matched by C<T>.
1154
1155If C<A> is better match for C<S> than C<A'>, C<AB> is a better
1156match than C<A'B'>.
1157
1158If C<A> and C<A'> coincide: C<AB> is a better match than C<AB'> if
1159C<B> is better match for C<T> than C<B'>.
1160
1161=item C<S|T>
1162
1163When C<S> can match, it is a better match than when only C<T> can match.
1164
1165Ordering of two matches for C<S> is the same as for C<S>. Similar for
1166two matches for C<T>.
1167
1168=item C<S{REPEAT_COUNT}>
1169
1170Matches as C<SSS...S> (repeated as many times as necessary).
1171
1172=item C<S{min,max}>
1173
1174Matches as C<S{max}|S{max-1}|...|S{min+1}|S{min}>.
1175
1176=item C<S{min,max}?>
1177
1178Matches as C<S{min}|S{min+1}|...|S{max-1}|S{max}>.
1179
1180=item C<S?>, C<S*>, C<S+>
1181
1182Same as C<S{0,1}>, C<S{0,BIG_NUMBER}>, C<S{1,BIG_NUMBER}> respectively.
1183
1184=item C<S??>, C<S*?>, C<S+?>
1185
1186Same as C<S{0,1}?>, C<S{0,BIG_NUMBER}?>, C<S{1,BIG_NUMBER}?> respectively.
1187
c47ff5f1 1188=item C<< (?>S) >>
35a734be
IZ
1189
1190Matches the best match for C<S> and only that.
1191
1192=item C<(?=S)>, C<(?<=S)>
1193
1194Only the best match for C<S> is considered. (This is important only if
1195C<S> has capturing parentheses, and backreferences are used somewhere
1196else in the whole regular expression.)
1197
1198=item C<(?!S)>, C<(?<!S)>
1199
1200For this grouping operator there is no need to describe the ordering, since
1201only whether or not C<S> can match is important.
1202
14455d6c 1203=item C<(??{ EXPR })>
35a734be
IZ
1204
1205The ordering is the same as for the regular expression which is
1206the result of EXPR.
1207
1208=item C<(?(condition)yes-pattern|no-pattern)>
1209
1210Recall that which of C<yes-pattern> or C<no-pattern> actually matches is
1211already determined. The ordering of the matches is the same as for the
1212chosen subexpression.
1213
1214=back
1215
1216The above recipes describe the ordering of matches I<at a given position>.
1217One more rule is needed to understand how a match is determined for the
1218whole regular expression: a match at an earlier position is always better
1219than a match at a later position.
1220
c84d73f1
IZ
1221=head2 Creating custom RE engines
1222
1223Overloaded constants (see L<overload>) provide a simple way to extend
1224the functionality of the RE engine.
1225
1226Suppose that we want to enable a new RE escape-sequence C<\Y|> which
1227matches at boundary between white-space characters and non-whitespace
1228characters. Note that C<(?=\S)(?<!\S)|(?!\S)(?<=\S)> matches exactly
1229at these positions, so we want to have each C<\Y|> in the place of the
1230more complicated version. We can create a module C<customre> to do
1231this:
1232
1233 package customre;
1234 use overload;
1235
1236 sub import {
1237 shift;
1238 die "No argument to customre::import allowed" if @_;
1239 overload::constant 'qr' => \&convert;
1240 }
1241
1242 sub invalid { die "/$_[0]/: invalid escape '\\$_[1]'"}
1243
1244 my %rules = ( '\\' => '\\',
1245 'Y|' => qr/(?=\S)(?<!\S)|(?!\S)(?<=\S)/ );
1246 sub convert {
1247 my $re = shift;
1248 $re =~ s{
1249 \\ ( \\ | Y . )
1250 }
1251 { $rules{$1} or invalid($re,$1) }sgex;
1252 return $re;
1253 }
1254
1255Now C<use customre> enables the new escape in constant regular
1256expressions, i.e., those without any runtime variable interpolations.
1257As documented in L<overload>, this conversion will work only over
1258literal parts of regular expressions. For C<\Y|$re\Y|> the variable
1259part of this regular expression needs to be converted explicitly
1260(but only if the special meaning of C<\Y|> should be enabled inside $re):
1261
1262 use customre;
1263 $re = <>;
1264 chomp $re;
1265 $re = customre::convert $re;
1266 /\Y|$re\Y|/;
1267
19799a22
GS
1268=head1 BUGS
1269
9da458fc
IZ
1270This document varies from difficult to understand to completely
1271and utterly opaque. The wandering prose riddled with jargon is
1272hard to fathom in several places.
1273
1274This document needs a rewrite that separates the tutorial content
1275from the reference content.
19799a22
GS
1276
1277=head1 SEE ALSO
9fa51da4 1278
91e0c79e
MJD
1279L<perlrequick>.
1280
1281L<perlretut>.
1282
9b599b2a
GS
1283L<perlop/"Regexp Quote-Like Operators">.
1284
1e66bd83
PP
1285L<perlop/"Gory details of parsing quoted constructs">.
1286
14218588
GS
1287L<perlfaq6>.
1288
9b599b2a
GS
1289L<perlfunc/pos>.
1290
1291L<perllocale>.
1292
fb55449c
JH
1293L<perlebcdic>.
1294
14218588
GS
1295I<Mastering Regular Expressions> by Jeffrey Friedl, published
1296by O'Reilly and Associates.