X-Git-Url: https://perl5.git.perl.org/perl5.git/blobdiff_plain/572224ce94992fbe6a12bf2ed576193649b15d48..d97906123bcd8c325c65db4f67e8c96e2cdafaec:/pod/perlre.pod
diff --git a/pod/perlre.pod b/pod/perlre.pod
index 1609819..e9a5e5f 100644
--- a/pod/perlre.pod
+++ b/pod/perlre.pod
@@ -7,60 +7,351 @@ perlre - Perl regular expressions
This page describes the syntax of regular expressions in Perl.
-If you haven't used regular expressions before, a quick-start
-introduction is available in L, C, C
@@ -915,7 +1292,8 @@ X
X
Backslashed metacharacters in Perl are alphanumeric, such as C<\b>,
C<\w>, C<\n>. Unlike some other regular expression languages, there
are no backslashed symbols that aren't alphanumeric. So anything
-that looks like \\, \(, \), \[, \], \{, or \} is always
+that looks like C<\\>, C<\(>, C<\)>, C<\[>, C<\]>, C<\{>, or C<\}> is
+always
interpreted as a literal character, not a metacharacter. This was
once used in a common idiom to disable or quote the special meanings
of regular expression metacharacters in a string that you want to
@@ -924,9 +1302,9 @@ use for a pattern. Simply quote all non-"word" characters:
$pattern =~ s/(\W)/\\$1/g;
(If C , and C modifiers are special in
-that they can only be enabled, not disabled, and the C, C modifier is special in that its presence
+Note also that the C<"p"> modifier is special in that its presence
anywhere in a pattern has a global effect.
=item C<(?:pattern)>
X<(?:)>
-=item C<(?adluimsx-imsx:pattern)>
+=item C<(?adluimnsx-imnsx:pattern)>
-=item C<(?^aluimsx:pattern)>
+=item C<(?^aluimnsx:pattern)>
X<(?^:)>
This is for clustering, not capturing; it groups subexpressions like
-"()", but doesn't make backreferences as "()" does. So
+C<"()">, but doesn't make backreferences as C<"()"> does. So
@fields = split(/\b(?:a|b|c)\b/)
-is like
+matches the same field delimiters as
@fields = split(/\b(a|b|c)\b/)
-but doesn't spit out extra fields. It's also cheaper not to capture
+but doesn't spit out the delimiters themselves as extra fields (even though
+that's the behaviour of L
})>
Treats the return value of the code block as the condition.
+Full syntax: C<< (?(?{ code })then|else) >>
-=item (R)
+=item C<(R)>
Checks if the expression has been evaluated inside of recursion.
+Full syntax: C<< (?(R)then|else) >>
-=item (R1) (R2) ...
+=item C<(R1)> C<(R2)> ...
Checks if the expression has been evaluated while executing directly
inside of the n-th capture group. This check is the regex equivalent of
@@ -1566,18 +2027,22 @@ inside of the n-th capture group. This check is the regex equivalent of
In other words, it does not check the full recursion stack.
-=item (R&NAME)
+Full syntax: C<< (?(R1)then|else) >>
+
+=item C<(R&I)>
Similar to C<(R1)>, this predicate checks to see if we're executing
directly inside of the leftmost group with a given name (this is the same
-logic used by C<(?&NAME)> to disambiguate). It does not check the full
+logic used by C<(?&I)> to disambiguate). It does not check the full
stack, but only the name of the innermost active recursion.
+Full syntax: C<< (?(R&name)then|else) >>
-=item (DEFINE)
+=item C<(DEFINE)>
In this case, the yes-pattern is never directly executed, and no
no-pattern is allowed. Similar in spirit to C<(?{0})> but more efficient.
See below for details.
+Full syntax: C<< (?(DEFINE)definitions...) >>
=back
@@ -1601,7 +2066,7 @@ It is recommended that for this usage you put the DEFINE block at the
end of the pattern, and that you name any subpatterns defined within it.
Also, it's worth noting that patterns defined this way probably will
-not be as efficient, as the optimiser is not very clever about
+not be as efficient, as the optimizer is not very clever about
handling them.
An example of how this might be used is as follows:
@@ -1609,7 +2074,7 @@ An example of how this might be used is as follows:
/(?(?&NAME_PAT))(?(?&ADDRESS_PAT))
(?(DEFINE)
(?....)
- (?....)
+ (?....)
)/x
Note that capture groups matched inside of recursion are not accessible
@@ -1637,16 +2102,16 @@ An "independent" subexpression, one which matches the substring
that a I C would match if anchored at the given
position, and it matches I. This
construct is useful for optimizations of what would otherwise be
-"eternal" matches, because it will not backtrack (see L<"Backtracking">).
+"eternal" matches, because it will not backtrack (see L"Backtracking">).
It may also be useful in places where the "grab all you can, and do not
give anything back" semantic is desirable.
For example: C<< ^(?>a*)ab >> will never match, since C<< (?>a*) >>
(anchored at the beginning of string, as above) will match I
-characters C at the beginning of string, leaving no C for
+characters C<"a"> at the beginning of string, leaving no C<"a"> for
C to match. In contrast, C will match the same as C,
since the match of the subgroup C is influenced by the following
-group C (see L<"Backtracking">). In particular, C inside
+group C (see L"Backtracking">). In particular, C inside
C will match fewer characters than a standalone C, since
this makes the tail match.
@@ -1695,20 +2160,20 @@ hung. However, a tiny change to this pattern
which uses C<< (?>...) >> matches exactly when the one above does (verifying
this yourself would be a productive exercise), but finishes in a fourth
-the time when used on a similar string with 1000000 Cs. Be aware,
+the time when used on a similar string with 1000000 C<"a">s. Be aware,
however, that, when this construct is followed by a
quantifier, it currently triggers a warning message under
the C pragma or B<-w> switch saying it
C<"matches null string many times in regex">.
On simple groups, such as the pattern C<< (?> [^()]+ ) >>, a comparable
-effect may be achieved by negative look-ahead, as in C<[^()]+ (?! [^()] )>.
-This was only 4 times slower on a string with 1000000 Cs.
+effect may be achieved by negative lookahead, as in C<[^()]+ (?! [^()] )>.
+This was only 4 times slower on a string with 1000000 C<"a">s.
The "grab all you can, and do not give anything back" semantic is desirable
in many situations where on the first sight a simple C<()*> looks like
the correct solution. Suppose we parse text with comments being delimited
-by C<#> followed by some optional (horizontal) whitespace. Contrary to
+by C<"#"> followed by some optional (horizontal) whitespace. Contrary to
its appearance, C<#[ \t]*> I the correct subexpression to match
the comment delimiter, because it may "give up" some whitespace if
the remainder of the pattern can be made to match that way. The correct
@@ -1717,7 +2182,7 @@ answer is either one of these:
(?>#[ \t]*)
#[ \t]*(?![ \t])
-For example, to grab non-empty comments into $1, one should use either
+For example, to grab non-empty comments into C<$1>, one should use either
one of these:
/ (?> \# [ \t]* ) ( .+ ) /x;
@@ -1743,101 +2208,400 @@ to inside of one of these constructs. The following equivalences apply:
See L.
-=back
+Note that this feature is currently L;
+using it yields a warning in the C category.
-=head2 Special Backtracking Control Verbs
+=back
-B These patterns are experimental and subject to change or
-removal in a future version of Perl. Their usage in production code should
-be noted to avoid problems during upgrades.
+=head2 Backtracking
+X X
-These special patterns are generally of the form C<(*VERB:ARG)>. Unless
-otherwise stated the ARG argument is optional; in some cases, it is
-forbidden.
+NOTE: This section presents an abstract approximation of regular
+expression behavior. For a more rigorous (and complicated) view of
+the rules involved in selecting a match among possible alternatives,
+see L.
-Any pattern containing a special backtracking verb that allows an argument
-has the special behaviour that when executed it sets the current package's
-C<$REGERROR> and C<$REGMARK> variables. When doing so the following
-rules apply:
+A fundamental feature of regular expression matching involves the
+notion called I, which is currently used (when needed)
+by all regular non-possessive expression quantifiers, namely C<"*">, C<*?>, C<"+">,
+C<+?>, C<{n,m}>, and C<{n,m}?>. Backtracking is often optimized
+internally, but the general principle outlined here is valid.
-On failure, the C<$REGERROR> variable will be set to the ARG value of the
-verb pattern, if the verb was involved in the failure of the match. If the
-ARG part of the pattern was omitted, then C<$REGERROR> will be set to the
-name of the last C<(*MARK:NAME)> pattern executed, or to TRUE if there was
-none. Also, the C<$REGMARK> variable will be set to FALSE.
+For a regular expression to match, the I regular expression must
+match, not just part of it. So if the beginning of a pattern containing a
+quantifier succeeds in a way that causes later parts in the pattern to
+fail, the matching engine backs up and recalculates the beginning
+part--that's why it's called backtracking.
-On a successful match, the C<$REGERROR> variable will be set to FALSE, and
-the C<$REGMARK> variable will be set to the name of the last
-C<(*MARK:NAME)> pattern executed. See the explanation for the
-C<(*MARK:NAME)> verb below for more details.
+Here is an example of backtracking: Let's say you want to find the
+word following "foo" in the string "Food is on the foo table.":
-B C<$REGERROR> and C<$REGMARK> are not magic variables like C<$1>
-and most other regex-related variables. They are not local to a scope, nor
-readonly, but instead are volatile package variables similar to C<$AUTOLOAD>.
-Use C to localize changes to them to a specific scope if necessary.
+ $_ = "Food is on the foo table.";
+ if ( /\b(foo)\s+(\w+)/i ) {
+ print "$2 follows $1.\n";
+ }
-If a pattern does not contain a special backtracking verb that allows an
-argument, then C<$REGERROR> and C<$REGMARK> are not touched at all.
+When the match runs, the first part of the regular expression (C<\b(foo)>)
+finds a possible match right at the beginning of the string, and loads up
+C<$1> with "Foo". However, as soon as the matching engine sees that there's
+no whitespace following the "Foo" that it had saved in C<$1>, it realizes its
+mistake and starts over again one character after where it had the
+tentative match. This time it goes all the way until the next occurrence
+of "foo". The complete regular expression matches this time, and you get
+the expected output of "table follows foo."
-=over 3
+Sometimes minimal matching can help a lot. Imagine you'd like to match
+everything between "foo" and "bar". Initially, you write something
+like this:
-=item Verbs that take an argument
+ $_ = "The food is under the bar in the barn.";
+ if ( /foo(.*)bar/ ) {
+ print "got <$1>\n";
+ }
-=over 4
+Which perhaps unexpectedly yields:
-=item C<(*PRUNE)> C<(*PRUNE:NAME)>
-X<(*PRUNE)> X<(*PRUNE:NAME)>
+ got
-This zero-width pattern prunes the backtracking tree at the current point
-when backtracked into on failure. Consider the pattern C,
-where A and B are complex patterns. Until the C<(*PRUNE)> verb is reached,
-A may backtrack as necessary to match. Once it is reached, matching
-continues in B, which may also backtrack as necessary; however, should B
-not match, then no further backtracking will take place, and the pattern
-will fail outright at the current starting position.
+That's because C<.*> was greedy, so you get everything between the
+I "foo" and the I "bar". Here it's more effective
+to use minimal matching to make sure you get the text between a "foo"
+and the first "bar" thereafter.
-The following example counts all the possible matching strings in a
-pattern (without actually matching any of them).
+ if ( /foo(.*?)bar/ ) { print "got <$1>\n" }
+ got
- 'aaab' =~ /a+b?(?{print "$&\n"; $count++})(*FAIL)/;
- print "Count=$count\n";
+Here's another example. Let's say you'd like to match a number at the end
+of a string, and you also want to keep the preceding part of the match.
+So you write this:
-which produces:
+ $_ = "I have 2 numbers: 53147";
+ if ( /(.*)(\d*)/ ) { # Wrong!
+ print "Beginning is <$1>, number is <$2>.\n";
+ }
- aaab
- aaa
- aa
- a
- aab
- aa
- a
- ab
- a
- Count=9
+That won't work at all, because C<.*> was greedy and gobbled up the
+whole string. As C<\d*> can match on an empty string the complete
+regular expression matched successfully.
-If we add a C<(*PRUNE)> before the count like the following
+ Beginning is , number is <>.
- 'aaab' =~ /a+b?(*PRUNE)(?{print "$&\n"; $count++})(*FAIL)/;
- print "Count=$count\n";
+Here are some variants, most of which don't work:
-we prevent backtracking and find the count of the longest matching string
-at each matching starting point like so:
+ $_ = "I have 2 numbers: 53147";
+ @pats = qw{
+ (.*)(\d*)
+ (.*)(\d+)
+ (.*?)(\d*)
+ (.*?)(\d+)
+ (.*)(\d+)$
+ (.*?)(\d+)$
+ (.*)\b(\d+)$
+ (.*\D)(\d+)$
+ };
- aaab
- aab
- ab
- Count=3
+ for $pat (@pats) {
+ printf "%-12s ", $pat;
+ if ( /$pat/ ) {
+ print "<$1> <$2>\n";
+ } else {
+ print "FAIL\n";
+ }
+ }
-Any number of C<(*PRUNE)> assertions may be used in a pattern.
+That will print out:
-See also C<< (?>pattern) >> and possessive quantifiers for other ways to
-control backtracking. In some cases, the use of C<(*PRUNE)> can be
-replaced with a C<< (?>pattern) >> with no functional difference; however,
-C<(*PRUNE)> can be used to handle cases that cannot be expressed using a
-C<< (?>pattern) >> alone.
+ (.*)(\d*) <>
+ (.*)(\d+) <7>
+ (.*?)(\d*) <> <>
+ (.*?)(\d+) <2>
+ (.*)(\d+)$ <7>
+ (.*?)(\d+)$ <53147>
+ (.*)\b(\d+)$ <53147>
+ (.*\D)(\d+)$ <53147>
-=item C<(*SKIP)> C<(*SKIP:NAME)>
-X<(*SKIP)>
+As you see, this can be a bit tricky. It's important to realize that a
+regular expression is merely a set of assertions that gives a definition
+of success. There may be 0, 1, or several different ways that the
+definition might succeed against a particular string. And if there are
+multiple ways it might succeed, you need to understand backtracking to
+know which variety of success you will achieve.
+
+When using lookahead assertions and negations, this can all get even
+trickier. Imagine you'd like to find a sequence of non-digits not
+followed by "123". You might try to write that as
+
+ $_ = "ABC123";
+ if ( /^\D*(?!123)/ ) { # Wrong!
+ print "Yup, no 123 in $_\n";
+ }
+
+But that isn't going to match; at least, not the way you're hoping. It
+claims that there is no 123 in the string. Here's a clearer picture of
+why that pattern matches, contrary to popular expectations:
+
+ $x = 'ABC123';
+ $y = 'ABC445';
+
+ print "1: got $1\n" if $x =~ /^(ABC)(?!123)/;
+ print "2: got $1\n" if $y =~ /^(ABC)(?!123)/;
+
+ print "3: got $1\n" if $x =~ /^(\D*)(?!123)/;
+ print "4: got $1\n" if $y =~ /^(\D*)(?!123)/;
+
+This prints
+
+ 2: got ABC
+ 3: got AB
+ 4: got ABC
+
+You might have expected test 3 to fail because it seems to a more
+general purpose version of test 1. The important difference between
+them is that test 3 contains a quantifier (C<\D*>) and so can use
+backtracking, whereas test 1 will not. What's happening is
+that you've asked "Is it true that at the start of C<$x>, following 0 or more
+non-digits, you have something that's not 123?" If the pattern matcher had
+let C<\D*> expand to "ABC", this would have caused the whole pattern to
+fail.
+
+The search engine will initially match C<\D*> with "ABC". Then it will
+try to match C<(?!123)> with "123", which fails. But because
+a quantifier (C<\D*>) has been used in the regular expression, the
+search engine can backtrack and retry the match differently
+in the hope of matching the complete regular expression.
+
+The pattern really, I wants to succeed, so it uses the
+standard pattern back-off-and-retry and lets C<\D*> expand to just "AB" this
+time. Now there's indeed something following "AB" that is not
+"123". It's "C123", which suffices.
+
+We can deal with this by using both an assertion and a negation.
+We'll say that the first part in C<$1> must be followed both by a digit
+and by something that's not "123". Remember that the lookaheads
+are zero-width expressions--they only look, but don't consume any
+of the string in their match. So rewriting this way produces what
+you'd expect; that is, case 5 will fail, but case 6 succeeds:
+
+ print "5: got $1\n" if $x =~ /^(\D*)(?=\d)(?!123)/;
+ print "6: got $1\n" if $y =~ /^(\D*)(?=\d)(?!123)/;
+
+ 6: got ABC
+
+In other words, the two zero-width assertions next to each other work as though
+they're ANDed together, just as you'd use any built-in assertions: C^$/>
+matches only if you're at the beginning of the line AND the end of the
+line simultaneously. The deeper underlying truth is that juxtaposition in
+regular expressions always means AND, except when you write an explicit OR
+using the vertical bar. C means match "a" AND (then) match "b",
+although the attempted matches are made at different positions because "a"
+is not a zero-width assertion, but a one-width assertion.
+
+B: Particularly complicated regular expressions can take
+exponential time to solve because of the immense number of possible
+ways they can use backtracking to try for a match. For example, without
+internal optimizations done by the regular expression engine, this will
+take a painfully long time to run:
+
+ 'aaaaaaaaaaaa' =~ /((a{0,5}){0,5})*[c]/
+
+And if you used C<"*">'s in the internal groups instead of limiting them
+to 0 through 5 matches, then it would take forever--or until you ran
+out of stack space. Moreover, these internal optimizations are not
+always applicable. For example, if you put C<{0,5}> instead of C<"*">
+on the external group, no current optimization is applicable, and the
+match takes a long time to finish.
+
+A powerful tool for optimizing such beasts is what is known as an
+"independent group",
+which does not backtrack (see L