This is a live mirror of the Perl 5 development currently hosted at https://github.com/perl/perl5
perlmod notes from Damian Conway (via Tom Christiansen)
[perl5.git] / pod / perlre.pod
CommitLineData
a0d0e21e
LW
1=head1 NAME
2
3perlre - Perl regular expressions
4
5=head1 DESCRIPTION
6
cb1a09d0 7This page describes the syntax of regular expressions in Perl. For a
5f05dabc 8description of how to I<use> regular expressions in matching
19799a22 9operations, plus various examples of the same, see discussions
1e66bd83 10of C<m//>, C<s///>, C<qr//> and C<??> in L<perlop/"Regexp Quote-Like Operators">.
cb1a09d0 11
19799a22 12Matching operations can have various modifiers. Modifiers
5a964f20 13that relate to the interpretation of the regular expression inside
19799a22
GS
14are listed below. Modifiers that alter the way a regular expression
15is used by Perl are detailed in L<perlop/"Regexp Quote-Like Operators"> and
1e66bd83 16L<perlop/"Gory details of parsing quoted constructs">.
a0d0e21e 17
55497cff 18=over 4
19
20=item i
21
22Do case-insensitive pattern matching.
23
a034a98d
DD
24If C<use locale> is in effect, the case map is taken from the current
25locale. See L<perllocale>.
26
54310121 27=item m
55497cff 28
29Treat string as multiple lines. That is, change "^" and "$" from matching
5f05dabc 30at only the very start or end of the string to the start or end of any
7f761169 31line anywhere within the string.
55497cff 32
54310121 33=item s
55497cff 34
35Treat string as single line. That is, change "." to match any character
19799a22 36whatsoever, even a newline, which normally it would not match.
55497cff 37
19799a22
GS
38The C</s> and C</m> modifiers both override the C<$*> setting. That
39is, no matter what C<$*> contains, C</s> without C</m> will force
40"^" to match only at the beginning of the string and "$" to match
41only at the end (or just before a newline at the end) of the string.
42Together, as /ms, they let the "." match any character whatsoever,
43while yet allowing "^" and "$" to match, respectively, just after
44and just before newlines within the string.
7b8d334a 45
54310121 46=item x
55497cff 47
48Extend your pattern's legibility by permitting whitespace and comments.
49
50=back
a0d0e21e
LW
51
52These are usually written as "the C</x> modifier", even though the delimiter
53in question might not actually be a slash. In fact, any of these
54modifiers may also be embedded within the regular expression itself using
55the new C<(?...)> construct. See below.
56
4633a7c4 57The C</x> modifier itself needs a little more explanation. It tells
55497cff 58the regular expression parser to ignore whitespace that is neither
59backslashed nor within a character class. You can use this to break up
4633a7c4 60your regular expression into (slightly) more readable parts. The C<#>
54310121 61character is also treated as a metacharacter introducing a comment,
55497cff 62just as in ordinary Perl code. This also means that if you want real
5a964f20
TC
63whitespace or C<#> characters in the pattern (outside of a character
64class, where they are unaffected by C</x>), that you'll either have to
55497cff 65escape them or encode them using octal or hex escapes. Taken together,
66these features go a long way towards making Perl's regular expressions
0c815be9
HS
67more readable. Note that you have to be careful not to include the
68pattern delimiter in the comment--perl has no way of knowing you did
5a964f20 69not intend to close the pattern early. See the C-comment deletion code
0c815be9 70in L<perlop>.
a0d0e21e
LW
71
72=head2 Regular Expressions
73
19799a22
GS
74The patterns used in Perl pattern matching derive from supplied in
75the Version 8 regex routines. (In fact, the routines are derived
76(distantly) from Henry Spencer's freely redistributable reimplementation
77of the V8 routines.) See L<Version 8 Regular Expressions> for
78details.
a0d0e21e
LW
79
80In particular the following metacharacters have their standard I<egrep>-ish
81meanings:
82
54310121 83 \ Quote the next metacharacter
a0d0e21e
LW
84 ^ Match the beginning of the line
85 . Match any character (except newline)
c07a80fd 86 $ Match the end of the line (or before newline at the end)
a0d0e21e
LW
87 | Alternation
88 () Grouping
89 [] Character class
90
5f05dabc 91By default, the "^" character is guaranteed to match at only the
92beginning of the string, the "$" character at only the end (or before the
a0d0e21e
LW
93newline at the end) and Perl does certain optimizations with the
94assumption that the string contains only one line. Embedded newlines
95will not be matched by "^" or "$". You may, however, wish to treat a
4a6725af 96string as a multi-line buffer, such that the "^" will match after any
a0d0e21e
LW
97newline within the string, and "$" will match before any newline. At the
98cost of a little more overhead, you can do this by using the /m modifier
99on the pattern match operator. (Older programs did this by setting C<$*>,
5f05dabc 100but this practice is now deprecated.)
a0d0e21e 101
4a6725af 102To facilitate multi-line substitutions, the "." character never matches a
55497cff 103newline unless you use the C</s> modifier, which in effect tells Perl to pretend
a0d0e21e
LW
104the string is a single line--even if it isn't. The C</s> modifier also
105overrides the setting of C<$*>, in case you have some (badly behaved) older
106code that sets it in another module.
107
108The following standard quantifiers are recognized:
109
110 * Match 0 or more times
111 + Match 1 or more times
112 ? Match 1 or 0 times
113 {n} Match exactly n times
114 {n,} Match at least n times
115 {n,m} Match at least n but not more than m times
116
117(If a curly bracket occurs in any other context, it is treated
118as a regular character.) The "*" modifier is equivalent to C<{0,}>, the "+"
25f94b33 119modifier to C<{1,}>, and the "?" modifier to C<{0,1}>. n and m are limited
9c79236d
GS
120to integral values less than a preset limit defined when perl is built.
121This is usually 32766 on the most common platforms. The actual limit can
122be seen in the error message generated by code such as this:
123
124 $_ **= $_ , / {$_} / for 2 .. 42;
a0d0e21e 125
54310121 126By default, a quantified subpattern is "greedy", that is, it will match as
127many times as possible (given a particular starting location) while still
128allowing the rest of the pattern to match. If you want it to match the
129minimum number of times possible, follow the quantifier with a "?". Note
130that the meanings don't change, just the "greediness":
a0d0e21e
LW
131
132 *? Match 0 or more times
133 +? Match 1 or more times
134 ?? Match 0 or 1 time
135 {n}? Match exactly n times
136 {n,}? Match at least n times
137 {n,m}? Match at least n but not more than m times
138
5f05dabc 139Because patterns are processed as double quoted strings, the following
a0d0e21e
LW
140also work:
141
0f36ee90 142 \t tab (HT, TAB)
143 \n newline (LF, NL)
144 \r return (CR)
145 \f form feed (FF)
146 \a alarm (bell) (BEL)
147 \e escape (think troff) (ESC)
cb1a09d0
AD
148 \033 octal char (think of a PDP-11)
149 \x1B hex char
a0ed51b3 150 \x{263a} wide hex char (Unicode SMILEY)
a0d0e21e 151 \c[ control char
cb1a09d0
AD
152 \l lowercase next char (think vi)
153 \u uppercase next char (think vi)
154 \L lowercase till \E (think vi)
155 \U uppercase till \E (think vi)
156 \E end case modification (think vi)
5a964f20 157 \Q quote (disable) pattern metacharacters till \E
a0d0e21e 158
a034a98d 159If C<use locale> is in effect, the case map used by C<\l>, C<\L>, C<\u>
7b8d334a 160and C<\U> is taken from the current locale. See L<perllocale>.
a034a98d 161
1d2dff63
GS
162You cannot include a literal C<$> or C<@> within a C<\Q> sequence.
163An unescaped C<$> or C<@> interpolates the corresponding variable,
164while escaping will cause the literal string C<\$> to be matched.
165You'll need to write something like C<m/\Quser\E\@\Qhost/>.
166
a0d0e21e
LW
167In addition, Perl defines the following:
168
169 \w Match a "word" character (alphanumeric plus "_")
170 \W Match a non-word character
171 \s Match a whitespace character
172 \S Match a non-whitespace character
173 \d Match a digit character
174 \D Match a non-digit character
a0ed51b3
LW
175 \pP Match P, named property. Use \p{Prop} for longer names.
176 \PP Match non-P
f244e06d
GS
177 \X Match eXtended Unicode "combining character sequence",
178 equivalent to C<(?:\PM\pM*)>
a0ed51b3 179 \C Match a single C char (octet) even under utf8.
a0d0e21e 180
19799a22
GS
181A C<\w> matches a single alphanumeric character, not a whole word.
182To match a word you'd need to say C<\w+>. If C<use locale> is in
183effect, the list of alphabetic characters generated by C<\w> is
184taken from the current locale. See L<perllocale>. You may use
185C<\w>, C<\W>, C<\s>, C<\S>, C<\d>, and C<\D> within character classes
186(though not as either end of a range). See L<utf8> for details
187about C<\pP>, C<\PP>, and C<\X>.
a0d0e21e
LW
188
189Perl defines the following zero-width assertions:
190
191 \b Match a word boundary
192 \B Match a non-(word boundary)
b85d18e9
IZ
193 \A Match only at beginning of string
194 \Z Match only at end of string, or before newline at the end
195 \z Match only at end of string
a99df21c 196 \G Match only where previous m//g left off (works only with /g)
a0d0e21e 197
19799a22
GS
198A word boundary (C<\b>) is defined as a spot between two characters
199that has a C<\w> on one side of it and a C<\W> on the other side
200of it (in either order), counting the imaginary characters off the
201beginning and end of the string as matching a C<\W>. (Within
202character classes C<\b> represents backspace rather than a word
203boundary, just as it normally does in any double-quoted string.)
204The C<\A> and C<\Z> are just like "^" and "$", except that they
205won't match multiple times when the C</m> modifier is used, while
206"^" and "$" will match at every internal line boundary. To match
207the actual end of the string and not ignore an optional trailing
208newline, use C<\z>.
209
210The C<\G> assertion can be used to chain global matches (using
211C<m//g>), as described in L<perlop/"Regexp Quote-Like Operators">.
212It is also useful when writing C<lex>-like scanners, when you have
213several patterns that you want to match against consequent substrings
214of your string, see the previous reference. The actual location
215where C<\G> will match can also be influenced by using C<pos()> as
216an lvalue. See L<perlfunc/pos>.
217
218When the bracketing construct C<( ... )> is used to create a capture
219buffer, \E<lt>digitE<gt> matches the digit'th substring. Outside
220of the pattern, always use "$" instead of "\" in front of the digit.
221(While the \E<lt>digitE<gt> notation can on rare occasion work
222outside the current pattern, this should not be relied upon. See
223the WARNING below.) The scope of $E<lt>digitE<gt> (and C<$`>,
224C<$&>, and C<$'>) extends to the end of the enclosing BLOCK or eval
225string, or to the next successful pattern match, whichever comes
226first. If you want to use parentheses to delimit a subpattern
227(e.g., a set of alternatives) without saving it as a subpattern,
228follow the ( with a ?:.
cb1a09d0
AD
229
230You may have as many parentheses as you wish. If you have more
19799a22
GS
231than 9 captured substrings, the variables $10, $11, ... refer to
232the corresponding substring. Within the pattern, \10, \11, etc.
233refer back to substrings if there have been at least that many left
234parentheses before the backreference. Otherwise (for backward
235compatibility) \10 is the same as \010, a backspace, and \11 the
236same as \011, a tab. And so on. (\1 through \9 are always
237backreferences.)
a0d0e21e
LW
238
239C<$+> returns whatever the last bracket match matched. C<$&> returns the
0f36ee90 240entire matched string. (C<$0> used to return the same thing, but not any
a0d0e21e
LW
241more.) C<$`> returns everything before the matched string. C<$'> returns
242everything after the matched string. Examples:
243
244 s/^([^ ]*) *([^ ]*)/$2 $1/; # swap first two words
245
246 if (/Time: (..):(..):(..)/) {
247 $hours = $1;
248 $minutes = $2;
249 $seconds = $3;
250 }
251
19799a22 252Once Perl sees that you need one of C<$&>, C<$`> or C<$'> anywhere in
68dc0745 253the program, it has to provide them on each and every pattern match.
254This can slow your program down. The same mechanism that handles
255these provides for the use of $1, $2, etc., so you pay the same price
19799a22 256for each pattern that contains capturing parentheses. But if you never
5a964f20 257use $&, etc., in your script, then patterns I<without> capturing
19799a22 258parentheses won't be penalized. So avoid $&, $', and $` if you can,
68dc0745 259but if you can't (and some algorithms really appreciate them), once
260you've used them once, use them at will, because you've already paid
5a964f20 261the price. As of 5.005, $& is not so costly as the other two.
68dc0745 262
19799a22
GS
263Backslashed metacharacters in Perl are alphanumeric, such as C<\b>,
264C<\w>, C<\n>. Unlike some other regular expression languages, there
265are no backslashed symbols that aren't alphanumeric. So anything
266that looks like \\, \(, \), \E<lt>, \E<gt>, \{, or \} is always
267interpreted as a literal character, not a metacharacter. This was
268once used in a common idiom to disable or quote the special meanings
269of regular expression metacharacters in a string that you want to
270use for a pattern. Simply quote all non-alphanumeric characters:
a0d0e21e
LW
271
272 $pattern =~ s/(\W)/\\$1/g;
273
19799a22
GS
274In modern days, it is more common to see either the quotemeta()
275function or the C<\Q> metaquoting escape sequence used to disable
276all metacharacters' special meanings like this:
a0d0e21e
LW
277
278 /$unquoted\Q$quoted\E$unquoted/
279
19799a22
GS
280=head2 Extended Patterns
281
282For those situations where simple regular expression patterns are
283not enough, Perl defines a consistent extension syntax for venturing
284beyond simple patterns such as are found in standard tools like
285B<awk> and B<lex>. That syntax is a pair of parentheses with a
286question mark as the first thing within the parentheses (this was
287a syntax error in older versions of Perl). The character after the
288question mark gives the function of the extension.
289
290Many extensions are already supported, some for almost five years
291now. Other, more exotic forms are very new, and should be considered
292highly experimental, and are so marked.
293
294A question mark was chosen for this and for the new minimal-matching
295construct because 1) question mark is pretty rare in older regular
296expressions, and 2) whenever you see one, you should stop and "question"
297exactly what is going on. That's psychology...
a0d0e21e
LW
298
299=over 10
300
cc6b7395 301=item C<(?#text)>
a0d0e21e 302
19799a22
GS
303A comment. The text is ignored. If the C</x> modifier is used to enable
304whitespace formatting, a simple C<#> will suffice. Note that Perl closes
259138e3
GS
305the comment as soon as it sees a C<)>, so there is no way to put a literal
306C<)> in the comment.
a0d0e21e 307
19799a22
GS
308=item C<(?imsx-imsx)>
309
310One or more embedded pattern-match modifiers. This is particularly
311useful for dynamic patterns, such as those read in from a configuration
312file, read in as an argument, are specified in a table somewhere,
313etc. Consider the case that some of which want to be case sensitive
314and some do not. The case insensitive ones need to include merely
315C<(?i)> at the front of the pattern. For example:
316
317 $pattern = "foobar";
318 if ( /$pattern/i ) { }
319
320 # more flexible:
321
322 $pattern = "(?i)foobar";
323 if ( /$pattern/ ) { }
324
325Letters after a C<-> turn those modifiers off. These modifiers are
326localized inside an enclosing group (if any). For example,
327
328 ( (?i) blah ) \s+ \1
329
330will match a repeated (I<including the case>!) word C<blah> in any
331case, assuming C<x> modifier, and no C<i> modifier outside of this
332group.
333
5a964f20 334=item C<(?:pattern)>
a0d0e21e 335
ca9dfc88
IZ
336=item C<(?imsx-imsx:pattern)>
337
5a964f20
TC
338This is for clustering, not capturing; it groups subexpressions like
339"()", but doesn't make backreferences as "()" does. So
a0d0e21e 340
5a964f20 341 @fields = split(/\b(?:a|b|c)\b/)
a0d0e21e
LW
342
343is like
344
5a964f20 345 @fields = split(/\b(a|b|c)\b/)
a0d0e21e 346
19799a22
GS
347but doesn't spit out extra fields. It's also cheaper not to capture
348characters if you don't need to.
a0d0e21e 349
19799a22
GS
350Any letters between C<?> and C<:> act as flags modifiers as with
351C<(?imsx-imsx)>. For example,
ca9dfc88
IZ
352
353 /(?s-i:more.*than).*million/i
354
355is equivalent to more verbose
356
357 /(?:(?s-i)more.*than).*million/i
358
5a964f20 359=item C<(?=pattern)>
a0d0e21e 360
19799a22 361A zero-width positive look-ahead assertion. For example, C</\w+(?=\t)/>
a0d0e21e
LW
362matches a word followed by a tab, without including the tab in C<$&>.
363
5a964f20 364=item C<(?!pattern)>
a0d0e21e 365
19799a22 366A zero-width negative look-ahead assertion. For example C</foo(?!bar)/>
a0d0e21e 367matches any occurrence of "foo" that isn't followed by "bar". Note
19799a22
GS
368however that look-ahead and look-behind are NOT the same thing. You cannot
369use this for look-behind.
7b8d334a 370
5a964f20 371If you are looking for a "bar" that isn't preceded by a "foo", C</(?!foo)bar/>
7b8d334a
GS
372will not do what you want. That's because the C<(?!foo)> is just saying that
373the next thing cannot be "foo"--and it's not, it's a "bar", so "foobar" will
374match. You would have to do something like C</(?!foo)...bar/> for that. We
375say "like" because there's the case of your "bar" not having three characters
376before it. You could cover that this way: C</(?:(?!foo)...|^.{0,2})bar/>.
377Sometimes it's still easier just to say:
a0d0e21e 378
a3cb178b 379 if (/bar/ && $` !~ /foo$/)
a0d0e21e 380
19799a22 381For look-behind see below.
c277df42 382
5a964f20 383=item C<(?E<lt>=pattern)>
c277df42 384
19799a22
GS
385A zero-width positive look-behind assertion. For example, C</(?E<lt>=\t)\w+/>
386matches a word that follows a tab, without including the tab in C<$&>.
387Works only for fixed-width look-behind.
c277df42 388
5a964f20 389=item C<(?<!pattern)>
c277df42 390
19799a22
GS
391A zero-width negative look-behind assertion. For example C</(?<!bar)foo/>
392matches any occurrence of "foo" that does not follow "bar". Works
393only for fixed-width look-behind.
c277df42 394
cc6b7395 395=item C<(?{ code })>
c277df42 396
19799a22
GS
397B<WARNING>: This extended regular expression feature is considered
398highly experimental, and may be changed or deleted without notice.
c277df42 399
19799a22
GS
400This zero-width assertion evaluate any embedded Perl code. It
401always succeeds, and its C<code> is not interpolated. Currently,
402the rules to determine where the C<code> ends are somewhat convoluted.
403
404The C<code> is properly scoped in the following sense: If the assertion
405is backtracked (compare L<"Backtracking">), all changes introduced after
406C<local>ization are undone, so that
b9ac3b5b
GS
407
408 $_ = 'a' x 8;
409 m<
410 (?{ $cnt = 0 }) # Initialize $cnt.
411 (
412 a
413 (?{
414 local $cnt = $cnt + 1; # Update $cnt, backtracking-safe.
415 })
416 )*
417 aaaa
418 (?{ $res = $cnt }) # On success copy to non-localized
419 # location.
420 >x;
421
19799a22
GS
422will set C<$res = 4>. Note that after the match, $cnt returns to the globally
423introduced value, since the scopes which restrict C<local> operators
b9ac3b5b
GS
424are unwound.
425
19799a22
GS
426This assertion may be used as a C<(?(condition)yes-pattern|no-pattern)>
427switch. If I<not> used in this way, the result of evaluation of
428C<code> is put into the special variable C<$^R>. This happens
429immediately, so C<$^R> can be used from other C<(?{ code })> assertions
430inside the same regular expression.
b9ac3b5b 431
19799a22
GS
432The assignment to C<$^R> above is properly localized, so the old
433value of C<$^R> is restored if the assertion is backtracked; compare
434L<"Backtracking">.
b9ac3b5b 435
19799a22
GS
436For reasons of security, this construct is forbidden if the regular
437expression involves run-time interpolation of variables, unless the
438perilous C<use re 'eval'> pragma has been used (see L<re>), or the
439variables contain results of C<qr//> operator (see
440L<perlop/"qr/STRING/imosx">).
871b0233 441
19799a22
GS
442This restriction is due to the wide-spread and remarkably convenient
443custom of using run-time determined strings as patterns. For example:
871b0233
IZ
444
445 $re = <>;
446 chomp $re;
447 $string =~ /$re/;
448
19799a22
GS
449Prior to the execution of code in a pattern, this was completely
450safe from a security point of view, although it could of course
451raise an exception from an illegal pattern. If you turn on the
452C<use re 'eval'>, though, it is no longer secure, so you should
453only do so if you are also using taint checking. Better yet, use
454the carefully constrained evaluation within a Safe module. See
455L<perlsec> for details about both these mechanisms.
871b0233 456
0f5d15d6
IZ
457=item C<(?p{ code })>
458
19799a22
GS
459B<WARNING>: This extended regular expression feature is considered
460highly experimental, and may be changed or deleted without notice.
0f5d15d6 461
19799a22
GS
462This is a "postponed" regular subexpression. The C<code> is evaluated
463at run time, at the moment this subexpression may match. The result
464of evaluation is considered as a regular expression and matched as
465if it were inserted instead of this construct.
0f5d15d6 466
19799a22
GS
467C<code> is not interpolated. As before, the rules to determine
468where the C<code> ends are currently somewhat convoluted.
469
470The following pattern matches a parenthesized group:
0f5d15d6
IZ
471
472 $re = qr{
473 \(
474 (?:
475 (?> [^()]+ ) # Non-parens without backtracking
476 |
477 (?p{ $re }) # Group with matching parens
478 )*
479 \)
480 }x;
481
5a964f20
TC
482=item C<(?E<gt>pattern)>
483
19799a22
GS
484B<WARNING>: This extended regular expression feature is considered
485highly experimental, and may be changed or deleted without notice.
486
487An "independent" subexpression, one which matches the substring
488that a I<standalone> C<pattern> would match if anchored at the given
489position -- but it matches no more than this substring. This
490construct is useful for optimizations of what would otherwise be
491"eternal" matches, because it will not backtrack (see L<"Backtracking">).
492
493For example: C<^(?E<gt>a*)ab> will never match, since C<(?E<gt>a*)>
494(anchored at the beginning of string, as above) will match I<all>
495characters C<a> at the beginning of string, leaving no C<a> for
496C<ab> to match. In contrast, C<a*ab> will match the same as C<a+b>,
497since the match of the subgroup C<a*> is influenced by the following
498group C<ab> (see L<"Backtracking">). In particular, C<a*> inside
499C<a*ab> will match fewer characters than a standalone C<a*>, since
500this makes the tail match.
501
502An effect similar to C<(?E<gt>pattern)> may be achieved by writing
503C<(?=(pattern))\1>. This matches the same substring as a standalone
504C<a+>, and the following C<\1> eats the matched string; it therefore
505makes a zero-length assertion into an analogue of C<(?E<gt>...)>.
506(The difference between these two constructs is that the second one
507uses a capturing group, thus shifting ordinals of backreferences
508in the rest of a regular expression.)
509
510Consider this pattern:
c277df42 511
871b0233
IZ
512 m{ \(
513 (
514 [^()]+
515 |
516 \( [^()]* \)
517 )+
518 \)
519 }x
5a964f20 520
19799a22
GS
521That will efficiently match a nonempty group with matching parentheses
522two levels deep or less. However, if there is no such group, it
523will take virtually forever on a long string. That's because there
524are so many different ways to split a long string into several
525substrings. This is what C<(.+)+> is doing, and C<(.+)+> is similar
526to a subpattern of the above pattern. Consider how the pattern
527above detects no-match on C<((()aaaaaaaaaaaaaaaaaa> in several
528seconds, but that each extra letter doubles this time. This
529exponential performance will make it appear that your program has
530hung. However, a tiny modification of this pattern
5a964f20 531
871b0233
IZ
532 m{ \(
533 (
534 (?> [^()]+ )
535 |
536 \( [^()]* \)
537 )+
538 \)
539 }x
c277df42 540
5a964f20
TC
541which uses C<(?E<gt>...)> matches exactly when the one above does (verifying
542this yourself would be a productive exercise), but finishes in a fourth
543the time when used on a similar string with 1000000 C<a>s. Be aware,
544however, that this pattern currently triggers a warning message under
545B<-w> saying it C<"matches the null string many times">):
c277df42 546
8d300b32 547On simple groups, such as the pattern C<(?E<gt> [^()]+ )>, a comparable
19799a22 548effect may be achieved by negative look-ahead, as in C<[^()]+ (?! [^()] )>.
c277df42
IZ
549This was only 4 times slower on a string with 1000000 C<a>s.
550
5a964f20 551=item C<(?(condition)yes-pattern|no-pattern)>
c277df42 552
5a964f20 553=item C<(?(condition)yes-pattern)>
c277df42 554
19799a22
GS
555B<WARNING>: This extended regular expression feature is considered
556highly experimental, and may be changed or deleted without notice.
557
c277df42
IZ
558Conditional expression. C<(condition)> should be either an integer in
559parentheses (which is valid if the corresponding pair of parentheses
19799a22 560matched), or look-ahead/look-behind/evaluate zero-width assertion.
c277df42 561
19799a22 562For example:
c277df42 563
5a964f20 564 m{ ( \( )?
871b0233 565 [^()]+
5a964f20 566 (?(1) \) )
871b0233 567 }x
c277df42
IZ
568
569matches a chunk of non-parentheses, possibly included in parentheses
570themselves.
a0d0e21e 571
a0d0e21e
LW
572=back
573
c07a80fd 574=head2 Backtracking
575
c277df42 576A fundamental feature of regular expression matching involves the
5a964f20 577notion called I<backtracking>, which is currently used (when needed)
c277df42
IZ
578by all regular expression quantifiers, namely C<*>, C<*?>, C<+>,
579C<+?>, C<{n,m}>, and C<{n,m}?>.
c07a80fd 580
581For a regular expression to match, the I<entire> regular expression must
582match, not just part of it. So if the beginning of a pattern containing a
583quantifier succeeds in a way that causes later parts in the pattern to
584fail, the matching engine backs up and recalculates the beginning
585part--that's why it's called backtracking.
586
587Here is an example of backtracking: Let's say you want to find the
588word following "foo" in the string "Food is on the foo table.":
589
590 $_ = "Food is on the foo table.";
591 if ( /\b(foo)\s+(\w+)/i ) {
592 print "$2 follows $1.\n";
593 }
594
595When the match runs, the first part of the regular expression (C<\b(foo)>)
596finds a possible match right at the beginning of the string, and loads up
597$1 with "Foo". However, as soon as the matching engine sees that there's
598no whitespace following the "Foo" that it had saved in $1, it realizes its
68dc0745 599mistake and starts over again one character after where it had the
c07a80fd 600tentative match. This time it goes all the way until the next occurrence
601of "foo". The complete regular expression matches this time, and you get
602the expected output of "table follows foo."
603
604Sometimes minimal matching can help a lot. Imagine you'd like to match
605everything between "foo" and "bar". Initially, you write something
606like this:
607
608 $_ = "The food is under the bar in the barn.";
609 if ( /foo(.*)bar/ ) {
610 print "got <$1>\n";
611 }
612
613Which perhaps unexpectedly yields:
614
615 got <d is under the bar in the >
616
617That's because C<.*> was greedy, so you get everything between the
618I<first> "foo" and the I<last> "bar". In this case, it's more effective
619to use minimal matching to make sure you get the text between a "foo"
620and the first "bar" thereafter.
621
622 if ( /foo(.*?)bar/ ) { print "got <$1>\n" }
623 got <d is under the >
624
625Here's another example: let's say you'd like to match a number at the end
626of a string, and you also want to keep the preceding part the match.
627So you write this:
628
629 $_ = "I have 2 numbers: 53147";
630 if ( /(.*)(\d*)/ ) { # Wrong!
631 print "Beginning is <$1>, number is <$2>.\n";
632 }
633
634That won't work at all, because C<.*> was greedy and gobbled up the
635whole string. As C<\d*> can match on an empty string the complete
636regular expression matched successfully.
637
8e1088bc 638 Beginning is <I have 2 numbers: 53147>, number is <>.
c07a80fd 639
640Here are some variants, most of which don't work:
641
642 $_ = "I have 2 numbers: 53147";
643 @pats = qw{
644 (.*)(\d*)
645 (.*)(\d+)
646 (.*?)(\d*)
647 (.*?)(\d+)
648 (.*)(\d+)$
649 (.*?)(\d+)$
650 (.*)\b(\d+)$
651 (.*\D)(\d+)$
652 };
653
654 for $pat (@pats) {
655 printf "%-12s ", $pat;
656 if ( /$pat/ ) {
657 print "<$1> <$2>\n";
658 } else {
659 print "FAIL\n";
660 }
661 }
662
663That will print out:
664
665 (.*)(\d*) <I have 2 numbers: 53147> <>
666 (.*)(\d+) <I have 2 numbers: 5314> <7>
667 (.*?)(\d*) <> <>
668 (.*?)(\d+) <I have > <2>
669 (.*)(\d+)$ <I have 2 numbers: 5314> <7>
670 (.*?)(\d+)$ <I have 2 numbers: > <53147>
671 (.*)\b(\d+)$ <I have 2 numbers: > <53147>
672 (.*\D)(\d+)$ <I have 2 numbers: > <53147>
673
674As you see, this can be a bit tricky. It's important to realize that a
675regular expression is merely a set of assertions that gives a definition
676of success. There may be 0, 1, or several different ways that the
677definition might succeed against a particular string. And if there are
5a964f20
TC
678multiple ways it might succeed, you need to understand backtracking to
679know which variety of success you will achieve.
c07a80fd 680
19799a22 681When using look-ahead assertions and negations, this can all get even
54310121 682tricker. Imagine you'd like to find a sequence of non-digits not
c07a80fd 683followed by "123". You might try to write that as
684
871b0233
IZ
685 $_ = "ABC123";
686 if ( /^\D*(?!123)/ ) { # Wrong!
687 print "Yup, no 123 in $_\n";
688 }
c07a80fd 689
690But that isn't going to match; at least, not the way you're hoping. It
691claims that there is no 123 in the string. Here's a clearer picture of
692why it that pattern matches, contrary to popular expectations:
693
694 $x = 'ABC123' ;
695 $y = 'ABC445' ;
696
697 print "1: got $1\n" if $x =~ /^(ABC)(?!123)/ ;
698 print "2: got $1\n" if $y =~ /^(ABC)(?!123)/ ;
699
700 print "3: got $1\n" if $x =~ /^(\D*)(?!123)/ ;
701 print "4: got $1\n" if $y =~ /^(\D*)(?!123)/ ;
702
703This prints
704
705 2: got ABC
706 3: got AB
707 4: got ABC
708
5f05dabc 709You might have expected test 3 to fail because it seems to a more
c07a80fd 710general purpose version of test 1. The important difference between
711them is that test 3 contains a quantifier (C<\D*>) and so can use
712backtracking, whereas test 1 will not. What's happening is
713that you've asked "Is it true that at the start of $x, following 0 or more
5f05dabc 714non-digits, you have something that's not 123?" If the pattern matcher had
c07a80fd 715let C<\D*> expand to "ABC", this would have caused the whole pattern to
54310121 716fail.
c07a80fd 717The search engine will initially match C<\D*> with "ABC". Then it will
5a964f20 718try to match C<(?!123> with "123", which of course fails. But because
c07a80fd 719a quantifier (C<\D*>) has been used in the regular expression, the
720search engine can backtrack and retry the match differently
54310121 721in the hope of matching the complete regular expression.
c07a80fd 722
5a964f20
TC
723The pattern really, I<really> wants to succeed, so it uses the
724standard pattern back-off-and-retry and lets C<\D*> expand to just "AB" this
c07a80fd 725time. Now there's indeed something following "AB" that is not
726"123". It's in fact "C123", which suffices.
727
728We can deal with this by using both an assertion and a negation. We'll
729say that the first part in $1 must be followed by a digit, and in fact, it
730must also be followed by something that's not "123". Remember that the
19799a22 731look-aheads are zero-width expressions--they only look, but don't consume
c07a80fd 732any of the string in their match. So rewriting this way produces what
733you'd expect; that is, case 5 will fail, but case 6 succeeds:
734
735 print "5: got $1\n" if $x =~ /^(\D*)(?=\d)(?!123)/ ;
736 print "6: got $1\n" if $y =~ /^(\D*)(?=\d)(?!123)/ ;
737
738 6: got ABC
739
5a964f20 740In other words, the two zero-width assertions next to each other work as though
19799a22 741they're ANDed together, just as you'd use any built-in assertions: C</^$/>
c07a80fd 742matches only if you're at the beginning of the line AND the end of the
743line simultaneously. The deeper underlying truth is that juxtaposition in
744regular expressions always means AND, except when you write an explicit OR
745using the vertical bar. C</ab/> means match "a" AND (then) match "b",
746although the attempted matches are made at different positions because "a"
747is not a zero-width assertion, but a one-width assertion.
748
19799a22
GS
749B<WARNING>: particularly complicated regular expressions can take
750exponential time to solve due to the immense number of possible
751ways they can use backtracking to try match. For example, this will
752take a very long time to run
c07a80fd 753
754 /((a{0,5}){0,5}){0,5}/
755
756And if you used C<*>'s instead of limiting it to 0 through 5 matches, then
757it would take literally forever--or until you ran out of stack space.
758
c277df42 759A powerful tool for optimizing such beasts is "independent" groups,
5a964f20 760which do not backtrace (see L<C<(?E<gt>pattern)>>). Note also that
19799a22 761zero-length look-ahead/look-behind assertions will not backtrace to make
c277df42
IZ
762the tail match, since they are in "logical" context: only the fact
763whether they match or not is considered relevant. For an example
19799a22 764where side-effects of a look-ahead I<might> have influenced the
5a964f20 765following match, see L<C<(?E<gt>pattern)>>.
c277df42 766
a0d0e21e
LW
767=head2 Version 8 Regular Expressions
768
5a964f20 769In case you're not familiar with the "regular" Version 8 regex
a0d0e21e
LW
770routines, here are the pattern-matching rules not described above.
771
54310121 772Any single character matches itself, unless it is a I<metacharacter>
a0d0e21e 773with a special meaning described here or above. You can cause
5a964f20 774characters that normally function as metacharacters to be interpreted
5f05dabc 775literally by prefixing them with a "\" (e.g., "\." matches a ".", not any
a0d0e21e
LW
776character; "\\" matches a "\"). A series of characters matches that
777series of characters in the target string, so the pattern C<blurfl>
778would match "blurfl" in the target string.
779
780You can specify a character class, by enclosing a list of characters
5a964f20 781in C<[]>, which will match any one character from the list. If the
a0d0e21e
LW
782first character after the "[" is "^", the class matches any character not
783in the list. Within a list, the "-" character is used to specify a
5a964f20 784range, so that C<a-z> represents all characters between "a" and "z",
84850974
DD
785inclusive. If you want "-" itself to be a member of a class, put it
786at the start or end of the list, or escape it with a backslash. (The
787following all specify the same class of three characters: C<[-az]>,
788C<[az-]>, and C<[a\-z]>. All are different from C<[a-z]>, which
789specifies a class containing twenty-six characters.)
a0d0e21e 790
8ada0baa
JH
791Note also that the whole range idea is rather unportable between
792character sets--and even within character sets they may cause results
793you probably didn't expect. A sound principle is to use only ranges
794that begin from and end at either alphabets of equal case ([a-e],
795[A-E]), or digits ([0-9]). Anything else is unsafe. If in doubt,
796spell out the character sets in full.
797
54310121 798Characters may be specified using a metacharacter syntax much like that
a0d0e21e
LW
799used in C: "\n" matches a newline, "\t" a tab, "\r" a carriage return,
800"\f" a form feed, etc. More generally, \I<nnn>, where I<nnn> is a string
801of octal digits, matches the character whose ASCII value is I<nnn>.
0f36ee90 802Similarly, \xI<nn>, where I<nn> are hexadecimal digits, matches the
a0d0e21e 803character whose ASCII value is I<nn>. The expression \cI<x> matches the
54310121 804ASCII character control-I<x>. Finally, the "." metacharacter matches any
a0d0e21e
LW
805character except "\n" (unless you use C</s>).
806
807You can specify a series of alternatives for a pattern using "|" to
808separate them, so that C<fee|fie|foe> will match any of "fee", "fie",
5a964f20 809or "foe" in the target string (as would C<f(e|i|o)e>). The
a0d0e21e
LW
810first alternative includes everything from the last pattern delimiter
811("(", "[", or the beginning of the pattern) up to the first "|", and
812the last alternative contains everything from the last "|" to the next
813pattern delimiter. For this reason, it's common practice to include
814alternatives in parentheses, to minimize confusion about where they
a3cb178b
GS
815start and end.
816
5a964f20 817Alternatives are tried from left to right, so the first
a3cb178b
GS
818alternative found for which the entire expression matches, is the one that
819is chosen. This means that alternatives are not necessarily greedy. For
628afcb5 820example: when matching C<foo|foot> against "barefoot", only the "foo"
a3cb178b
GS
821part will match, as that is the first alternative tried, and it successfully
822matches the target string. (This might not seem important, but it is
823important when you are capturing matched text using parentheses.)
824
5a964f20 825Also remember that "|" is interpreted as a literal within square brackets,
a3cb178b 826so if you write C<[fee|fie|foe]> you're really only matching C<[feio|]>.
a0d0e21e 827
54310121 828Within a pattern, you may designate subpatterns for later reference by
a0d0e21e 829enclosing them in parentheses, and you may refer back to the I<n>th
54310121 830subpattern later in the pattern using the metacharacter \I<n>.
831Subpatterns are numbered based on the left to right order of their
5a964f20 832opening parenthesis. A backreference matches whatever
54310121 833actually matched the subpattern in the string being examined, not the
834rules for that subpattern. Therefore, C<(0|0x)\d*\s\1\d*> will
5a964f20 835match "0x1234 0x4321", but not "0x1234 01234", because subpattern 1
748a9306 836actually matched "0x", even though the rule C<0|0x> could
a0d0e21e 837potentially match the leading 0 in the second number.
cb1a09d0 838
19799a22 839=head2 Warning on \1 vs $1
cb1a09d0 840
5a964f20 841Some people get too used to writing things like:
cb1a09d0
AD
842
843 $pattern =~ s/(\W)/\\\1/g;
844
845This is grandfathered for the RHS of a substitute to avoid shocking the
846B<sed> addicts, but it's a dirty habit to get into. That's because in
5f05dabc 847PerlThink, the righthand side of a C<s///> is a double-quoted string. C<\1> in
cb1a09d0
AD
848the usual double-quoted string means a control-A. The customary Unix
849meaning of C<\1> is kludged in for C<s///>. However, if you get into the habit
850of doing that, you get yourself into trouble if you then add an C</e>
851modifier.
852
5a964f20 853 s/(\d+)/ \1 + 1 /eg; # causes warning under -w
cb1a09d0
AD
854
855Or if you try to do
856
857 s/(\d+)/\1000/;
858
859You can't disambiguate that by saying C<\{1}000>, whereas you can fix it with
860C<${1}000>. Basically, the operation of interpolation should not be confused
861with the operation of matching a backreference. Certainly they mean two
862different things on the I<left> side of the C<s///>.
9fa51da4 863
c84d73f1
IZ
864=head2 Repeated patterns matching zero-length substring
865
19799a22 866B<WARNING>: Difficult material (and prose) ahead. This section needs a rewrite.
c84d73f1
IZ
867
868Regular expressions provide a terse and powerful programming language. As
869with most other power tools, power comes together with the ability
870to wreak havoc.
871
872A common abuse of this power stems from the ability to make infinite
628afcb5 873loops using regular expressions, with something as innocuous as:
c84d73f1
IZ
874
875 'foo' =~ m{ ( o? )* }x;
876
877The C<o?> can match at the beginning of C<'foo'>, and since the position
878in the string is not moved by the match, C<o?> would match again and again
879due to the C<*> modifier. Another common way to create a similar cycle
880is with the looping modifier C<//g>:
881
882 @matches = ( 'foo' =~ m{ o? }xg );
883
884or
885
886 print "match: <$&>\n" while 'foo' =~ m{ o? }xg;
887
888or the loop implied by split().
889
890However, long experience has shown that many programming tasks may
891be significantly simplified by using repeated subexpressions which
892may match zero-length substrings, with a simple example being:
893
894 @chars = split //, $string; # // is not magic in split
895 ($whitewashed = $string) =~ s/()/ /g; # parens avoid magic s// /
896
897Thus Perl allows the C</()/> construct, which I<forcefully breaks
898the infinite loop>. The rules for this are different for lower-level
899loops given by the greedy modifiers C<*+{}>, and for higher-level
900ones like the C</g> modifier or split() operator.
901
19799a22
GS
902The lower-level loops are I<interrupted> (that is, the loop is
903broken) when Perl detects that a repeated expression matched a
904zero-length substring. Thus
c84d73f1
IZ
905
906 m{ (?: NON_ZERO_LENGTH | ZERO_LENGTH )* }x;
907
908is made equivalent to
909
910 m{ (?: NON_ZERO_LENGTH )*
911 |
912 (?: ZERO_LENGTH )?
913 }x;
914
915The higher level-loops preserve an additional state between iterations:
916whether the last match was zero-length. To break the loop, the following
917match after a zero-length match is prohibited to have a length of zero.
918This prohibition interacts with backtracking (see L<"Backtracking">),
919and so the I<second best> match is chosen if the I<best> match is of
920zero length.
921
19799a22 922For example:
c84d73f1
IZ
923
924 $_ = 'bar';
925 s/\w??/<$&>/g;
926
927results in C<"<><b><><a><><r><>">. At each position of the string the best
928match given by non-greedy C<??> is the zero-length match, and the I<second
929best> match is what is matched by C<\w>. Thus zero-length matches
930alternate with one-character-long matches.
931
932Similarly, for repeated C<m/()/g> the second-best match is the match at the
933position one notch further in the string.
934
19799a22 935The additional state of being I<matched with zero-length> is associated with
c84d73f1
IZ
936the matched string, and is reset by each assignment to pos().
937
938=head2 Creating custom RE engines
939
940Overloaded constants (see L<overload>) provide a simple way to extend
941the functionality of the RE engine.
942
943Suppose that we want to enable a new RE escape-sequence C<\Y|> which
944matches at boundary between white-space characters and non-whitespace
945characters. Note that C<(?=\S)(?<!\S)|(?!\S)(?<=\S)> matches exactly
946at these positions, so we want to have each C<\Y|> in the place of the
947more complicated version. We can create a module C<customre> to do
948this:
949
950 package customre;
951 use overload;
952
953 sub import {
954 shift;
955 die "No argument to customre::import allowed" if @_;
956 overload::constant 'qr' => \&convert;
957 }
958
959 sub invalid { die "/$_[0]/: invalid escape '\\$_[1]'"}
960
961 my %rules = ( '\\' => '\\',
962 'Y|' => qr/(?=\S)(?<!\S)|(?!\S)(?<=\S)/ );
963 sub convert {
964 my $re = shift;
965 $re =~ s{
966 \\ ( \\ | Y . )
967 }
968 { $rules{$1} or invalid($re,$1) }sgex;
969 return $re;
970 }
971
972Now C<use customre> enables the new escape in constant regular
973expressions, i.e., those without any runtime variable interpolations.
974As documented in L<overload>, this conversion will work only over
975literal parts of regular expressions. For C<\Y|$re\Y|> the variable
976part of this regular expression needs to be converted explicitly
977(but only if the special meaning of C<\Y|> should be enabled inside $re):
978
979 use customre;
980 $re = <>;
981 chomp $re;
982 $re = customre::convert $re;
983 /\Y|$re\Y|/;
984
19799a22
GS
985=head1 BUGS
986
987This manpage is varies from difficult to understand to completely
988and utterly opaque.
989
990=head1 SEE ALSO
9fa51da4 991
9b599b2a
GS
992L<perlop/"Regexp Quote-Like Operators">.
993
1e66bd83
PP
994L<perlop/"Gory details of parsing quoted constructs">.
995
9b599b2a
GS
996L<perlfunc/pos>.
997
998L<perllocale>.
999
19799a22 1000I<Mastering Regular Expressions> by Jeffrey Friedl.