X</m> X<regex, multiline> X<regexp, multiline> X<regular expression, multiline>
Treat string as multiple lines. That is, change "^" and "$" from matching
-the start or end of the string to matching the start or end of any
-line anywhere within the string.
+the start or end of line only at the left and right ends of the string to
+matching them anywhere within the string.
=item s
X</s> X<regex, single-line> X<regexp, single-line>
# be even if it did!!
"\N{LATIN SMALL LIGATURE FI}" =~ /(f)(i)/i; # Doesn't match!
-Perl doesn't match multiple characters in an inverted bracketed
-character class, which otherwise could be highly confusing. See
+Perl doesn't match multiple characters in a bracketed
+character class unless the character that maps to them is explicitly
+mentioned, and it doesn't match them at all if the character class is
+inverted, which otherwise could be highly confusing. See
+L<perlrecharclass/Bracketed Character Classes>, and
L<perlrecharclass/Negation>.
-Also, Perl matching doesn't fully conform to the current Unicode C</i>
-recommendations, which ask that the matching be made upon the NFD
-(Normalization Form Decomposed) of the text. However, Unicode is
-in the process of reconsidering and revising their recommendations.
-
=item x
X</x>
Preserve the string matched such that ${^PREMATCH}, ${^MATCH}, and
${^POSTMATCH} are available for use after matching.
+In Perl 5.18 and higher this is ignored. ${^PREMATCH}, ${^MATCH}, and
+${^POSTMATCH} will be available after the match regardless of the modifier.
+
=item g and c
X</g> X</c>
=item a, d, l and u
X</a> X</d> X</l> X</u>
-These modifiers, new in 5.14, affect which character-set semantics
-(Unicode, ASCII, etc.) are used, as described below in
+These modifiers, all new in 5.14, affect which character-set semantics
+(Unicode, etc.) are used, as described below in
L</Character set modifiers>.
=back
-These are usually written as "the C</x> modifier", even though the delimiter
+Regular expression modifiers are usually written in documentation
+as e.g., "the C</x> modifier", even though the delimiter
in question might not really be a slash. The modifiers C</imsxadlup>
may also be embedded within the regular expression itself using
the C<(?...)> construct, see L</Extended Patterns> below.
-The C</x>, C</l>, C</u>, C</a> and C</d> modifiers need a little more
-explanation.
-
=head3 /x
C</x> tells
the character set modifiers; they affect the character set semantics
used for the regular expression.
-At any given time, exactly one of these modifiers is in effect. Once
-compiled, the behavior doesn't change regardless of what rules are in
-effect when the regular expression is executed. And if a regular
-expression is interpolated into a larger one, the original's rules
-continue to apply to it, and only it.
+The C</d>, C</u>, and C</l> modifiers are not likely to be of much use
+to you, and so you need not worry about them very much. They exist for
+Perl's internal use, so that complex regular expression data structures
+can be automatically serialized and later exactly reconstituted,
+including all their nuances. But, since Perl can't keep a secret, and
+there may be rare instances where they are useful, they are documented
+here.
+
+The C</a> modifier, on the other hand, may be useful. Its purpose is to
+allow code that is to work mostly on ASCII data to not have to concern
+itself with Unicode.
+
+Briefly, C</l> sets the character set to that of whatever B<L>ocale is in
+effect at the time of the execution of the pattern match.
+
+C</u> sets the character set to B<U>nicode.
+
+C</a> also sets the character set to Unicode, BUT adds several
+restrictions for B<A>SCII-safe matching.
+
+C</d> is the old, problematic, pre-5.14 B<D>efault character set
+behavior. Its only use is to force that old behavior.
-Note that the modifiers affect only pattern matching, and do not extend
-to any replacement done. For example,
+At any given time, exactly one of these modifiers is in effect. Their
+existence allows Perl to keep the originally compiled behavior of a
+regular expression, regardless of what rules are in effect when it is
+actually executed. And if it is interpolated into a larger regex, the
+original's rules continue to apply to it, and only it.
- s/foo/\Ubar/l
+The C</l> and C</u> modifiers are automatically selected for
+regular expressions compiled within the scope of various pragmas,
+and we recommend that in general, you use those pragmas instead of
+specifying these modifiers explicitly. For one thing, the modifiers
+affect only pattern matching, and do not extend to even any replacement
+done, whereas using the pragmas give consistent results for all
+appropriate operations within their scopes. For example,
-will uppercase "bar", but the C</l> does not affect how the C<\U>
-operates. If C<use locale> is in effect, the C<\U> will use locale
-rules; if C<use feature 'unicode_strings'> is in effect, it will
-use Unicode rules, etc.
+ s/foo/\Ubar/il
+
+will match "foo" using the locale's rules for case-insensitive matching,
+but the C</l> does not affect how the C<\U> operates. Most likely you
+want both of them to use locale rules. To do this, instead compile the
+regular expression within the scope of C<use locale>. This both
+implicitly adds the C</l> and applies locale rules to the C<\U>. The
+lesson is to C<use locale> and not C</l> explicitly.
+
+Similarly, it would be better to use C<use feature 'unicode_strings'>
+instead of,
+
+ s/foo/\Lbar/iu
+
+to get Unicode rules, as the C<\L> in the former (but not necessarily
+the latter) would also use Unicode rules.
+
+More detail on each of the modifiers follows. Most likely you don't
+need to know this detail for C</l>, C</u>, and C</d>, and can skip ahead
+to L<E<sol>a|/E<sol>a (and E<sol>aa)>.
=head4 /l
above 255 are treated as Unicode no matter what locale is in effect.
Under Unicode rules, there are a few case-insensitive matches that cross
the 255/256 boundary. These are disallowed under C</l>. For example,
-0xFF does not caselessly match the character at 0x178, C<LATIN CAPITAL
-LETTER Y WITH DIAERESIS>, because 0xFF may not be C<LATIN SMALL LETTER Y
-WITH DIAERESIS> in the current locale, and Perl has no way of knowing if
-that character even exists in the locale, much less what code point it
-is.
+0xFF (on ASCII platforms) does not caselessly match the character at
+0x178, C<LATIN CAPITAL LETTER Y WITH DIAERESIS>, because 0xFF may not be
+C<LATIN SMALL LETTER Y WITH DIAERESIS> in the current locale, and Perl
+has no way of knowing if that character even exists in the locale, much
+less what code point it is.
This modifier may be specified to be the default by C<use locale>, but
see L</Which character set modifier is in effect?>.
means to use Unicode rules when pattern matching. On ASCII platforms,
this means that the code points between 128 and 255 take on their
-Latin-1 (ISO-8859-1) meanings (which are the same as Unicode's), whereas
-in strict ASCII their meanings are undefined. Thus the platform
-effectively becomes a Unicode platform, hence, for example, C<\w> will
-match any of the more than 100_000 word characters in Unicode.
+Latin-1 (ISO-8859-1) meanings (which are the same as Unicode's).
+(Otherwise Perl considers their meanings to be undefined.) Thus,
+under this modifier, the ASCII platform effectively becomes a Unicode
+platform; and hence, for example, C<\w> will match any of the more than
+100_000 word characters in Unicode.
Unlike most locales, which are specific to a language and country pair,
-Unicode classifies all the characters that are letters I<somewhere> as
+Unicode classifies all the characters that are letters I<somewhere> in
+the world as
C<\w>. For example, your locale might not think that C<LATIN SMALL
LETTER ETH> is a letter (unless you happen to speak Icelandic), but
Unicode does. Similarly, all the characters that are decimal digits
C<BENGALI DIGIT FOUR> (U+09EA) looks very much like an
C<ASCII DIGIT EIGHT> (U+0038). And, C<\d+>, may match strings of digits
that are a mixture from different writing systems, creating a security
-issue. L<Unicode::UCDE<sol>num()|Unicode::UCD/num> can be used to sort this out.
+issue. L<Unicode::UCD/num()> can be used to sort
+this out. Or the C</a> modifier can be used to force C<\d> to match
+just the ASCII 0 through 9.
-Also, case-insensitive matching works on the full set of Unicode
+Also, under this modifier, case-insensitive matching works on the full
+set of Unicode
characters. The C<KELVIN SIGN>, for example matches the letters "k" and
"K"; and C<LATIN SMALL LIGATURE FF> matches the sequence "ff", which,
if you're not prepared, might make it look like a hexadecimal constant,
C<sS>, and C<ss>, otherwise not.
This modifier may be specified to be the default by C<use feature
-'unicode_strings>, but see
-L</Which character set modifier is in effect?>.
+'unicode_strings>, C<use locale ':not_characters'>, or
+C<L<use 5.012|perlfunc/use VERSION>> (or higher),
+but see L</Which character set modifier is in effect?>.
X</u>
-=head4 /a
-
-is the same as C</u>, except that C<\d>, C<\s>, C<\w>, and the
-Posix character classes are restricted to matching in the ASCII range
-only. That is, with this modifier, C<\d> always means precisely the
-digits C<"0"> to C<"9">; C<\s> means the five characters C<[ \f\n\r\t]>;
-C<\w> means the 63 characters C<[A-Za-z0-9_]>; and likewise, all the
-Posix classes such as C<[[:print:]]> match only the appropriate
-ASCII-range characters.
-
-This modifier is useful for people who only incidentally use Unicode.
-With it, one can write C<\d> with confidence that it will only match
-ASCII characters, and should the need arise to match beyond ASCII, you
-can use C<\p{Digit}>, or C<\p{Word}> for C<\w>. There are similar
-C<\p{...}> constructs that can match white space and Posix classes
-beyond ASCII. See L<perlrecharclass/POSIX Character Classes>.
-
-As you would expect, this modifier causes, for example, C<\D> to mean
-the same thing as C<[^0-9]>; in fact, all non-ASCII characters match
-C<\D>, C<\S>, and C<\W>. C<\b> still means to match at the boundary
-between C<\w> and C<\W>, using the C</a> definitions of them (similarly
-for C<\B>).
-
-Otherwise, C</a> behaves like the C</u> modifier, in that
-case-insensitive matching uses Unicode semantics; for example, "k" will
-match the Unicode C<\N{KELVIN SIGN}> under C</i> matching, and code
-points in the Latin1 range, above ASCII will have Unicode rules when it
-comes to case-insensitive matching.
-
-To forbid ASCII/non-ASCII matches (like "k" with C<\N{KELVIN SIGN}>),
-specify the "a" twice, for example C</aai> or C</aia>
-
-To reiterate, this modifier provides protection for applications that
-don't wish to be exposed to all of Unicode. Specifying it twice
-gives added protection.
-
-This modifier may be specified to be the default by C<use re '/a'>
-or C<use re '/aa'>, but see
-L</Which character set modifier is in effect?>.
-X</a>
-X</aa>
-
=head4 /d
This modifier means to use the "Default" native rules of the platform
=item 5
-the pattern uses a Unicode property (C<\p{...}>)
+the pattern uses a Unicode property (C<\p{...}>); or
+
+=item 6
+
+the pattern uses L</C<(?[ ])>>
=back
Another mnemonic for this modifier is "Depends", as the rules actually
used depend on various things, and as a result you can get unexpected
-results. See L<perlunicode/The "Unicode Bug">.
+results. See L<perlunicode/The "Unicode Bug">. The Unicode Bug has
+become rather infamous, leading to yet another (printable) name for this
+modifier, "Dodgy".
On ASCII platforms, the native rules are ASCII, and on EBCDIC platforms
(at least the ones that Perl handles), they are Latin-1.
chop $str;
$str =~ /^\w/; # Still a match! $str remains in UTF-8 format.
+This modifier is automatically selected by default when none of the
+others are, so yet another name for it is "Default".
+
+Because of the unexpected behaviors associated with this modifier, you
+probably should only use it to maintain weird backward compatibilities.
+
+=head4 /a (and /aa)
+
+This modifier stands for ASCII-restrict (or ASCII-safe). This modifier,
+unlike the others, may be doubled-up to increase its effect.
+
+When it appears singly, it causes the sequences C<\d>, C<\s>, C<\w>, and
+the Posix character classes to match only in the ASCII range. They thus
+revert to their pre-5.6, pre-Unicode meanings. Under C</a>, C<\d>
+always means precisely the digits C<"0"> to C<"9">; C<\s> means the five
+characters C<[ \f\n\r\t]>, and starting in Perl v5.18, experimentally,
+the vertical tab; C<\w> means the 63 characters
+C<[A-Za-z0-9_]>; and likewise, all the Posix classes such as
+C<[[:print:]]> match only the appropriate ASCII-range characters.
+
+This modifier is useful for people who only incidentally use Unicode,
+and who do not wish to be burdened with its complexities and security
+concerns.
+
+With C</a>, one can write C<\d> with confidence that it will only match
+ASCII characters, and should the need arise to match beyond ASCII, you
+can instead use C<\p{Digit}> (or C<\p{Word}> for C<\w>). There are
+similar C<\p{...}> constructs that can match beyond ASCII both white
+space (see L<perlrecharclass/Whitespace>), and Posix classes (see
+L<perlrecharclass/POSIX Character Classes>). Thus, this modifier
+doesn't mean you can't use Unicode, it means that to get Unicode
+matching you must explicitly use a construct (C<\p{}>, C<\P{}>) that
+signals Unicode.
+
+As you would expect, this modifier causes, for example, C<\D> to mean
+the same thing as C<[^0-9]>; in fact, all non-ASCII characters match
+C<\D>, C<\S>, and C<\W>. C<\b> still means to match at the boundary
+between C<\w> and C<\W>, using the C</a> definitions of them (similarly
+for C<\B>).
+
+Otherwise, C</a> behaves like the C</u> modifier, in that
+case-insensitive matching uses Unicode semantics; for example, "k" will
+match the Unicode C<\N{KELVIN SIGN}> under C</i> matching, and code
+points in the Latin1 range, above ASCII will have Unicode rules when it
+comes to case-insensitive matching.
+
+To forbid ASCII/non-ASCII matches (like "k" with C<\N{KELVIN SIGN}>),
+specify the "a" twice, for example C</aai> or C</aia>. (The first
+occurrence of "a" restricts the C<\d>, etc., and the second occurrence
+adds the C</i> restrictions.) But, note that code points outside the
+ASCII range will use Unicode rules for C</i> matching, so the modifier
+doesn't really restrict things to just ASCII; it just forbids the
+intermixing of ASCII and non-ASCII.
+
+To summarize, this modifier provides protection for applications that
+don't wish to be exposed to all of Unicode. Specifying it twice
+gives added protection.
+
+This modifier may be specified to be the default by C<use re '/a'>
+or C<use re '/aa'>. If you do so, you may actually have occasion to use
+the C</u> modifier explictly if there are a few regular expressions
+where you do want full Unicode rules (but even here, it's best if
+everything were under feature C<"unicode_strings">, along with the
+C<use re '/aa'>). Also see L</Which character set modifier is in
+effect?>.
+X</a>
+X</aa>
+
=head4 Which character set modifier is in effect?
Which of these modifiers is in effect at any given point in a regular
-expression depends on a fairly complex set of interactions. As
+expression depends on a fairly complex set of interactions. These have
+been designed so that in general you don't have to worry about it, but
+this section gives the gory details. As
explained below in L</Extended Patterns> it is possible to explicitly
specify modifiers that apply only to portions of a regular expression.
The innermost always has priority over any outer ones, and one applying
The C<L<use re 'E<sol>foo'|re/"'/flags' mode">> pragma can be used to set
default modifiers (including these) for regular expressions compiled
within its scope. This pragma has precedence over the other pragmas
-listed below that change the defaults.
+listed below that also change the defaults.
Otherwise, C<L<use locale|perllocale>> sets the default modifier to C</l>;
-and C<L<use feature 'unicode_strings|feature>> or
+and C<L<use feature 'unicode_strings|feature>>, or
C<L<use 5.012|perlfunc/use VERSION>> (or higher) set the default to
C</u> when not in the same scope as either C<L<use locale|perllocale>>
-or C<L<use bytes|bytes>>. Unlike the mechanisms mentioned above, these
+or C<L<use bytes|bytes>>.
+(C<L<use locale ':not_characters'|perllocale/Unicode and UTF-8>> also
+sets the default to C</u>, overriding any plain C<use locale>.)
+Unlike the mechanisms mentioned above, these
affect operations besides regular expressions pattern matching, and so
give more consistent results with other operators, including using
C<\U>, C<\l>, etc. in substitution replacements.
the string), and "$" will match before any newline. At the
cost of a little more overhead, you can do this by using the /m modifier
on the pattern match operator. (Older programs did this by setting C<$*>,
-but this option was removed in perl 5.9.)
+but this option was removed in perl 5.10.)
X<^> X<$> X</m>
To simplify multi-line substitutions, the "." character never matches a
{n,m} Match at least n but not more than m times
(If a curly bracket occurs in any other context and does not form part of
-a backslashed sequence like C<\x{...}>, it is treated
-as a regular character. In particular, the lower bound
-is not optional.) The "*" quantifier is equivalent to C<{0,}>, the "+"
+a backslashed sequence like C<\x{...}>, it is treated as a regular
+character. In particular, the lower quantifier bound is not optional,
+and a typo in a quantifier silently causes it to be treated as the
+literal characters. For example,
+
+ /o{4,3}/
+
+looks like a quantifier that matches 0 times, since 4 is greater than 3,
+but it really means to match the sequence of six characters
+S<C<"o { 4 , 3 }">>. It is planned to eventually require literal uses
+of curly brackets to be escaped, say by preceding them with a backslash
+or enclosing them within square brackets, (C<"\{"> or C<"[{]">). This
+change will allow for future syntax extensions (like making the lower
+bound of a quantifier optional), and better error checking. In the
+meantime, you should get in the habit of escaping all instances where
+you mean a literal "{".)
+
+The "*" quantifier is equivalent to C<{0,}>, the "+"
quantifier to C<{1,}>, and the "?" quantifier to C<{0,1}>. n and m are limited
to non-negative integral values less than a preset limit defined when perl is built.
This is usually 32766 on the most common platforms. The actual limit can
character class "..." within the outer bracketed
character class. Example: [[:upper:]] matches any
uppercase character.
+ (?[...]) [8] Extended bracketed character class
\w [3] Match a "word" character (alphanumeric plus "_", plus
other connector punctuation chars plus Unicode
marks)
\g{name} [5] Named backreference
\k<name> [5] Named backreference
\K [6] Keep the stuff left of the \K, don't include it in $&
- \N [7] Any character but \n (experimental). Not affected by
- /s modifier
+ \N [7] Any character but \n. Not affected by /s modifier
\v [3] Vertical whitespace
\V [3] Not vertical whitespace
\h [3] Horizontal whitespace
when of the form C<\N{U+I<hex>}>, it matches the character whose Unicode
code point is I<hex>. Otherwise it matches any character but C<\n>.
+=item [8]
+
+See L<perlrecharclass/Extended Bracketed Character Classes> for details.
+
=back
=head3 Assertions
which makes it easier to write code that tests for a series of more
specific cases and remembers the best match.
-B<WARNING>: Once Perl sees that you need one of C<$&>, C<$`>, or
+B<WARNING>: If your code is to run on Perl 5.16 or earlier,
+beware that once Perl sees that you need one of C<$&>, C<$`>, or
C<$'> anywhere in the program, it has to provide them for every
-pattern match. This may substantially slow your program. Perl
+pattern match. This may substantially slow your program. (In Perl 5.18 a
+more efficient mechanism is used, eliminating any slowdown.) Perl
uses the same mechanism to produce C<$1>, C<$2>, etc, so you also pay a
price for each pattern that contains capturing parentheses. (To
avoid this cost while retaining the grouping behaviour, use the
parentheses will not be penalized. So avoid C<$&>, C<$'>, and C<$`>
if you can, but if you can't (and some algorithms really appreciate
them), once you've used them once, use them at will, because you've
-already paid the price. As of 5.005, C<$&> is not so costly as the
-other two.
+already paid the price.
X<$&> X<$`> X<$'>
-As a workaround for this problem, Perl 5.10.0 introduces C<${^PREMATCH}>,
+As a workaround for this problem, Perl 5.10.0 introduced C<${^PREMATCH}>,
C<${^MATCH}> and C<${^POSTMATCH}>, which are equivalent to C<$`>, C<$&>
and C<$'>, B<except> that they are only guaranteed to be defined after a
successful match that was executed with the C</p> (preserve) modifier.
The use of these variables incurs no global performance penalty, unlike
their punctuation char equivalents, however at the trade-off that you
-have to tell perl when you want to use them.
+have to tell perl when you want to use them. As of Perl 5.18, these three
+variables are equivalent to C<$`>, C<$&> and C<$'>, and C</p> is ignored.
X</p> X<p modifier>
=head2 Quoting metacharacters
Backslashed metacharacters in Perl are alphanumeric, such as C<\b>,
C<\w>, C<\n>. Unlike some other regular expression languages, there
are no backslashed symbols that aren't alphanumeric. So anything
-that looks like \\, \(, \), \<, \>, \{, or \} is always
+that looks like \\, \(, \), \[, \], \{, or \} is always
interpreted as a literal character, not a metacharacter. This was
once used in a common idiom to disable or quote the special meanings
of regular expression metacharacters in a string that you want to
I<need> to use literal backslashes within C<\Q...\E>,
consult L<perlop/"Gory details of parsing quoted constructs">.
+C<quotemeta()> and C<\Q> are fully described in L<perlfunc/quotemeta>.
+
=head2 Extended Patterns
Perl also defines a consistent extension syntax for features not
expressions, and 2) whenever you see one, you should stop and
"question" exactly what is going on. That's psychology....
-=over 10
+=over 4
=item C<(?#text)>
X<(?#)>
modifier outside this group.
These modifiers do not carry over into named subpatterns called in the
-enclosing group. In other words, a pattern such as C<((?i)(&NAME))> does not
+enclosing group. In other words, a pattern such as C<((?i)(?&NAME))> does not
change the case-sensitivity of the "NAME" pattern.
Any of these modifiers can be set to apply globally to all regular
B<WARNING>: This extended regular expression feature is considered
experimental, and may be changed without notice. Code executed that
has side effects may not perform identically from version to version
-due to the effect of future optimisations in the regex engine.
+due to the effect of future optimisations in the regex engine. The
+implementation of this feature was radically overhauled for the 5.18.0
+release, and its behaviour in earlier versions of perl was much buggier,
+especially in relation to parsing, lexical vars, scoping, recursion and
+reentrancy.
-This zero-width assertion evaluates any embedded Perl code. It
-always succeeds, and its C<code> is not interpolated. Currently,
-the rules to determine where the C<code> ends are somewhat convoluted.
+This zero-width assertion executes any embedded Perl code. It always
+succeeds, and its return value is set as C<$^R>.
-This feature can be used together with the special variable C<$^N> to
-capture the results of submatches in variables without having to keep
-track of the number of nested parentheses. For example:
+In literal patterns, the code is parsed at the same time as the
+surrounding code. While within the pattern, control is passed temporarily
+back to the perl parser, until the logically-balancing closing brace is
+encountered. This is similar to the way that an array index expression in
+a literal string is handled, for example
- $_ = "The brown fox jumps over the lazy dog";
- /the (\S+)(?{ $color = $^N }) (\S+)(?{ $animal = $^N })/i;
- print "color = $color, animal = $animal\n";
+ "abc$array[ 1 + f('[') + g()]def"
+
+In particular, braces do not need to be balanced:
+
+ s/abc(?{ f('{'); })/def/
+
+Even in a pattern that is interpolated and compiled at run-time, literal
+code blocks will be compiled once, at perl compile time; the following
+prints "ABCD":
+
+ print "D";
+ my $qr = qr/(?{ BEGIN { print "A" } })/;
+ my $foo = "foo";
+ /$foo$qr(?{ BEGIN { print "B" } })/;
+ BEGIN { print "C" }
+
+In patterns where the text of the code is derived from run-time
+information rather than appearing literally in a source code /pattern/,
+the code is compiled at the same time that the pattern is compiled, and
+for reasons of security, C<use re 'eval'> must be in scope. This is to
+stop user-supplied patterns containing code snippets from being
+executable.
+
+In situations where you need to enable this with C<use re 'eval'>, you should
+also have taint checking enabled. Better yet, use the carefully
+constrained evaluation within a Safe compartment. See L<perlsec> for
+details about both these mechanisms.
+
+From the viewpoint of parsing, lexical variable scope and closures,
+
+ /AAA(?{ BBB })CCC/
+
+behaves approximately like
+
+ /AAA/ && do { BBB } && /CCC/
+
+Similarly,
-Inside the C<(?{...})> block, C<$_> refers to the string the regular
+ qr/AAA(?{ BBB })CCC/
+
+behaves approximately like
+
+ sub { /AAA/ && do { BBB } && /CCC/ }
+
+In particular:
+
+ { my $i = 1; $r = qr/(?{ print $i })/ }
+ my $i = 2;
+ /$r/; # prints "1"
+
+Inside a C<(?{...})> block, C<$_> refers to the string the regular
expression is matching against. You can also use C<pos()> to know what is
the current position of matching within this string.
-The C<code> is properly scoped in the following sense: If the assertion
-is backtracked (compare L<"Backtracking">), all changes introduced after
-C<local>ization are undone, so that
+The code block introduces a new scope from the perspective of lexical
+variable declarations, but B<not> from the perspective of C<local> and
+similar localizing behaviours. So later code blocks within the same
+pattern will still see the values which were localized in earlier blocks.
+These accumulated localizations are undone either at the end of a
+successful match, or if the assertion is backtracked (compare
+L<"Backtracking">). For example,
$_ = 'a' x 8;
m<
- (?{ $cnt = 0 }) # Initialize $cnt.
+ (?{ $cnt = 0 }) # Initialize $cnt.
(
a
(?{
- local $cnt = $cnt + 1; # Update $cnt, backtracking-safe.
+ local $cnt = $cnt + 1; # Update $cnt,
+ # backtracking-safe.
})
)*
aaaa
- (?{ $res = $cnt }) # On success copy to
- # non-localized location.
+ (?{ $res = $cnt }) # On success copy to
+ # non-localized location.
>x;
-will set C<$res = 4>. Note that after the match, C<$cnt> returns to the globally
-introduced value, because the scopes that restrict C<local> operators
-are unwound.
+will initially increment C<$cnt> up to 8; then during backtracking, its
+value will be unwound back to 4, which is the value assigned to C<$res>.
+At the end of the regex execution, $cnt will be wound back to its initial
+value of 0.
+
+This assertion may be used as the condition in a
-This assertion may be used as a C<(?(condition)yes-pattern|no-pattern)>
-switch. If I<not> used in this way, the result of evaluation of
-C<code> is put into the special variable C<$^R>. This happens
-immediately, so C<$^R> can be used from other C<(?{ code })> assertions
-inside the same regular expression.
+ (?(condition)yes-pattern|no-pattern)
+
+switch. If I<not> used in this way, the result of evaluation of C<code>
+is put into the special variable C<$^R>. This happens immediately, so
+C<$^R> can be used from other C<(?{ code })> assertions inside the same
+regular expression.
The assignment to C<$^R> above is properly localized, so the old
value of C<$^R> is restored if the assertion is backtracked; compare
L<"Backtracking">.
-For reasons of security, this construct is forbidden if the regular
-expression involves run-time interpolation of variables, unless the
-perilous C<use re 'eval'> pragma has been used (see L<re>), or the
-variables contain results of the C<qr//> operator (see
-L<perlop/"qr/STRINGE<sol>msixpodual">).
+Note that the special variable C<$^N> is particularly useful with code
+blocks to capture the results of submatches in variables without having to
+keep track of the number of nested parentheses. For example:
-This restriction is due to the wide-spread and remarkably convenient
-custom of using run-time determined strings as patterns. For example:
+ $_ = "The brown fox jumps over the lazy dog";
+ /the (\S+)(?{ $color = $^N }) (\S+)(?{ $animal = $^N })/i;
+ print "color = $color, animal = $animal\n";
- $re = <>;
- chomp $re;
- $string =~ /$re/;
-
-Before Perl knew how to execute interpolated code within a pattern,
-this operation was completely safe from a security point of view,
-although it could raise an exception from an illegal pattern. If
-you turn on the C<use re 'eval'>, though, it is no longer secure,
-so you should only do so if you are also using taint checking.
-Better yet, use the carefully constrained evaluation within a Safe
-compartment. See L<perlsec> for details about both these mechanisms.
-
-B<WARNING>: Use of lexical (C<my>) variables in these blocks is
-broken. The result is unpredictable and will make perl unstable. The
-workaround is to use global (C<our>) variables.
-
-B<WARNING>: In perl 5.12.x and earlier, the regex engine
-was not re-entrant, so interpolated code could not
-safely invoke the regex engine either directly with
-C<m//> or C<s///>), or indirectly with functions such as
-C<split>. Invoking the regex engine in these blocks would make perl
-unstable.
=item C<(??{ code })>
X<(??{})>
has side effects may not perform identically from version to version
due to the effect of future optimisations in the regex engine.
-This is a "postponed" regular subexpression. The C<code> is evaluated
-at run time, at the moment this subexpression may match. The result
-of evaluation is considered a regular expression and matched as
-if it were inserted instead of this construct. Note that this means
-that the contents of capture groups defined inside an eval'ed pattern
-are not available outside of the pattern, and vice versa, there is no
-way for the inner pattern to refer to a capture group defined outside.
-Thus,
+This is a "postponed" regular subexpression. It behaves in I<exactly> the
+same way as a C<(?{ code })> code block as described above, except that
+its return value, rather than being assigned to C<$^R>, is treated as a
+pattern, compiled if it's a string (or used as-is if its a qr// object),
+then matched as if it were inserted instead of this construct.
- ('a' x 100)=~/(??{'(.)' x 100})/
+During the matching of this sub-pattern, it has its own set of
+captures which are valid during the sub-match, but are discarded once
+control returns to the main pattern. For example, the following matches,
+with the inner pattern capturing "B" and matching "BB", while the outer
+pattern captures "A";
+
+ my $inner = '(.)\1';
+ "ABBA" =~ /^(.)(??{ $inner })\1/;
+ print $1; # prints "A";
-B<will> match, it will B<not> set $1.
+Note that this means that there is no way for the inner pattern to refer
+to a capture group defined outside. (The code block itself can use C<$1>,
+etc., to refer to the enclosing pattern's capture groups.) Thus, although
-The C<code> is not interpolated. As before, the rules to determine
-where the C<code> ends are currently somewhat convoluted.
+ ('a' x 100)=~/(??{'(.)' x 100})/
+
+I<will> match, it will I<not> set $1 on exit.
The following pattern matches a parenthesized group:
- $re = qr{
- \(
- (?:
- (?> [^()]+ ) # Non-parens without backtracking
- |
- (??{ $re }) # Group with matching parens
- )*
- \)
- }x;
+ $re = qr{
+ \(
+ (?:
+ (?> [^()]+ ) # Non-parens without backtracking
+ |
+ (??{ $re }) # Group with matching parens
+ )*
+ \)
+ }x;
See also C<(?PARNO)> for a different, more efficient way to accomplish
the same task.
-For reasons of security, this construct is forbidden if the regular
-expression involves run-time interpolation of variables, unless the
-perilous C<use re 'eval'> pragma has been used (see L<re>), or the
-variables contain results of the C<qr//> operator (see
-L<perlop/"qrE<sol>STRINGE<sol>msixpodual">).
-
-In perl 5.12.x and earlier, because the regex engine was not re-entrant,
-delayed code could not safely invoke the regex engine either directly with
-C<m//> or C<s///>), or indirectly with functions such as C<split>.
-
-Recursing deeper than 50 times without consuming any input string will
-result in a fatal error. The maximum depth is compiled into perl, so
-changing it requires a custom build.
+Executing a postponed regular expression 50 times without consuming any
+input string will result in a fatal error. The maximum depth is compiled
+into perl, so changing it requires a custom build.
=item C<(?PARNO)> C<(?-PARNO)> C<(?+PARNO)> C<(?R)> C<(?0)>
X<(?PARNO)> X<(?1)> X<(?R)> X<(?0)> X<(?-1)> X<(?+1)> X<(?-PARNO)> X<(?+PARNO)>
X<regex, recursive> X<regexp, recursive> X<regular expression, recursive>
X<regex, relative recursion>
-Similar to C<(??{ code })> except it does not involve compiling any code,
-instead it treats the contents of a capture group as an independent
-pattern that must match at the current position. Capture groups
-contained by the pattern will have the value as determined by the
-outermost recursion.
+Similar to C<(??{ code })> except that it does not involve executing any
+code or potentially compiling a returned pattern string; instead it treats
+the part of the current pattern contained within a specified capture group
+as an independent pattern that must match at the current position.
+Capture groups contained by the pattern will have the value as determined
+by the outermost recursion.
PARNO is a sequence of digits (not starting with 0) whose value reflects
the paren-number of the capture group to recurse to. C<(?R)> recurses to
The following pattern matches a function foo() which may contain
balanced parentheses as the argument.
- $re = qr{ ( # paren group 1 (full function)
+ $re = qr{ ( # paren group 1 (full function)
foo
- ( # paren group 2 (parens)
+ ( # paren group 2 (parens)
\(
- ( # paren group 3 (contents of parens)
+ ( # paren group 3 (contents of parens)
(?:
- (?> [^()]+ ) # Non-parens without backtracking
+ (?> [^()]+ ) # Non-parens without backtracking
|
- (?2) # Recurse to start of paren group 2
+ (?2) # Recurse to start of paren group 2
)*
)
\)
for later use:
my $parens = qr/(\((?:[^()]++|(?-1))*+\))/;
- if (/foo $parens \s+ + \s+ bar $parens/x) {
+ if (/foo $parens \s+ \+ \s+ bar $parens/x) {
# do something here...
}
a true value, matches C<no-pattern> otherwise. A missing pattern always
matches.
-C<(condition)> should be either an integer in
+C<(condition)> should be one of: 1) an integer in
parentheses (which is valid if the corresponding pair of parentheses
-matched), a look-ahead/look-behind/evaluate zero-width assertion, a
+matched); 2) a look-ahead/look-behind/evaluate zero-width assertion; 3) a
name in angle brackets or single quotes (which is valid if a group
-with the given name matched), or the special symbol (R) (true when
+with the given name matched); or 4) the special symbol (R) (true when
evaluated inside of recursion or eval). Additionally the R may be
followed by a number, (which will be true when evaluated when recursing
inside of the appropriate group), or by C<&NAME>, in which case it will
necessary. Thus C<$+{NAME_PAT}> would not be defined even though
C<$+{NAME}> would be.
+Finally, keep in mind that subpatterns created inside a DEFINE block
+count towards the absolute and relative number of captures, so this:
+
+ my @captures = "a" =~ /(.) # First capture
+ (?(DEFINE)
+ (?<EXAMPLE> 1 ) # Second capture
+ )/x;
+ say scalar @captures;
+
+Will output 2, not 1. This is particularly important if you intend to
+compile the definitions with the C<qr//> operator, and later
+interpolate them in another pattern.
+
=item C<< (?>pattern) >>
X<backtrack> X<backtracking> X<atomic> X<possessive>
PAT?+ (?>PAT?)
PAT{min,max}+ (?>PAT{min,max})
+=item C<(?[ ])>
+
+See L<perlrecharclass/Extended Bracketed Character Classes>.
+
=back
=head2 Special Backtracking Control Verbs
If a pattern does not contain a special backtracking verb that allows an
argument, then C<$REGERROR> and C<$REGMARK> are not touched at all.
-=over 4
+=over 3
=item Verbs that take an argument
C<(*PRUNE)> can be used to handle cases that cannot be expressed using a
C<< (?>pattern) >> alone.
-
=item C<(*SKIP)> C<(*SKIP:NAME)>
X<(*SKIP)>
Compare the following to the examples in C<(*PRUNE)>; note the string
is twice as long:
- 'aaabaaab' =~ /a+b?(*SKIP)(?{print "$&\n"; $count++})(*FAIL)/;
- print "Count=$count\n";
+ 'aaabaaab' =~ /a+b?(*SKIP)(?{print "$&\n"; $count++})(*FAIL)/;
+ print "Count=$count\n";
outputs
C<(*SKIP)> was executed.
=item C<(*MARK:NAME)> C<(*:NAME)>
-X<(*MARK)> C<(*MARK:NAME)> C<(*:NAME)>
+X<(*MARK)> X<(*MARK:NAME)> X<(*:NAME)>
This zero-width pattern can be used to mark the point reached in a string
when a certain part of the pattern has been successfully matched. This
variable will be set to the name of the most recently executed
C<(*MARK:NAME)>.
-See C<(*SKIP)> for more details.
+See L</(*SKIP)> for more details.
As a shortcut C<(*MARK:NAME)> can be written C<(*:NAME)>.
=item C<(*THEN)> C<(*THEN:NAME)>
-This is similar to the "cut group" operator C<::> from Perl 6. Like
+This is similar to the "cut group" operator C<::> from Perl 6. Like
C<(*PRUNE)>, this verb always matches, and when backtracked into on
failure, it causes the regex engine to try the next alternation in the
-innermost enclosing group (capturing or otherwise).
+innermost enclosing group (capturing or otherwise) that has alternations.
+The two branches of a C<(?(condition)yes-pattern|no-pattern)> do not
+count as an alternation, as far as C<(*THEN)> is concerned.
Its name comes from the observation that this operation combined with the
alternation operator (C<|>) can be used to create what is essentially a
but
- / ( A (*THEN) B | C (*THEN) D ) /
+ / ( A (*THEN) B | C ) /
is not the same as
- / ( A (*PRUNE) B | C (*PRUNE) D ) /
+ / ( A (*PRUNE) B | C ) /
as after matching the A but failing on the B the C<(*THEN)> verb will
backtrack and try C; but the C<(*PRUNE)> verb will simply fail.
to find a valid match by advancing the start pointer will occur again.
For example,
- 'aaabaaab' =~ /a+b?(*COMMIT)(?{print "$&\n"; $count++})(*FAIL)/;
- print "Count=$count\n";
+ 'aaabaaab' =~ /a+b?(*COMMIT)(?{print "$&\n"; $count++})(*FAIL)/;
+ print "Count=$count\n";
outputs
be significantly simplified by using repeated subexpressions that
may match zero-length substrings. Here's a simple example being:
- @chars = split //, $string; # // is not magic in split
+ @chars = split //, $string; # // is not magic in split
($whitewashed = $string) =~ s/()/ /g; # parens avoid magic s// /
Thus Perl allows such constructs, by I<forcefully breaking