X</m> X<regex, multiline> X<regexp, multiline> X<regular expression, multiline>
Treat string as multiple lines. That is, change "^" and "$" from matching
-the start or end of the string to matching the start or end of any
-line anywhere within the string.
+the start or end of line only at the left and right ends of the string to
+matching them anywhere within the string.
=item s
X</s> X<regex, single-line> X<regexp, single-line>
C<sS>, and C<ss>, otherwise not.
This modifier may be specified to be the default by C<use feature
-'unicode_strings> or C<L<use 5.012|perlfunc/use VERSION>> (or higher),
+'unicode_strings>, C<use locale ':not_characters'>, or
+C<L<use 5.012|perlfunc/use VERSION>> (or higher),
but see L</Which character set modifier is in effect?>.
X</u>
listed below that also change the defaults.
Otherwise, C<L<use locale|perllocale>> sets the default modifier to C</l>;
-and C<L<use feature 'unicode_strings|feature>> or
+and C<L<use feature 'unicode_strings|feature>>, or
C<L<use 5.012|perlfunc/use VERSION>> (or higher) set the default to
C</u> when not in the same scope as either C<L<use locale|perllocale>>
-or C<L<use bytes|bytes>>. Unlike the mechanisms mentioned above, these
+or C<L<use bytes|bytes>>.
+(C<L<use locale ':not_characters'|perllocale/Unicode and UTF-8>> also
+sets the default to C</u>, overriding any plain C<use locale>.)
+Unlike the mechanisms mentioned above, these
affect operations besides regular expressions pattern matching, and so
give more consistent results with other operators, including using
C<\U>, C<\l>, etc. in substitution replacements.
the string), and "$" will match before any newline. At the
cost of a little more overhead, you can do this by using the /m modifier
on the pattern match operator. (Older programs did this by setting C<$*>,
-but this option was removed in perl 5.9.)
+but this option was removed in perl 5.10.)
X<^> X<$> X</m>
To simplify multi-line substitutions, the "." character never matches a
(If a curly bracket occurs in any other context and does not form part of
a backslashed sequence like C<\x{...}>, it is treated
-as a regular character. In particular, the lower bound
-is not optional.) The "*" quantifier is equivalent to C<{0,}>, the "+"
+as a regular character. In particular, the lower quantifier bound
+is not optional. However, in Perl v5.18, it is planned to issue a
+deprecation warning for all such occurrences, and in Perl v5.20 to
+require literal uses of a curly bracket to be escaped, say by preceding
+them with a backslash or enclosing them within square brackets, (C<"\{">
+or C<"[{]">). This change will allow for future syntax extensions (like
+making the lower bound of a quantifier optional), and better error
+checking of quantifiers. Now, a typo in a quantifier silently causes
+it to be treated as the literal characters. For example,
+
+ /o{4,3}/
+
+looks like a quantifier that matches 0 times, since 4 is greater than 3,
+but it really means to match the sequence of six characters
+S<C<"o { 4 , 3 }">>.)
+
+The "*" quantifier is equivalent to C<{0,}>, the "+"
quantifier to C<{1,}>, and the "?" quantifier to C<{0,1}>. n and m are limited
to non-negative integral values less than a preset limit defined when perl is built.
This is usually 32766 on the most common platforms. The actual limit can
I<need> to use literal backslashes within C<\Q...\E>,
consult L<perlop/"Gory details of parsing quoted constructs">.
+C<quotemeta()> and C<\Q> are fully described in L<perlfunc/quotemeta>.
+
=head2 Extended Patterns
Perl also defines a consistent extension syntax for features not
modifier outside this group.
These modifiers do not carry over into named subpatterns called in the
-enclosing group. In other words, a pattern such as C<((?i)(&NAME))> does not
+enclosing group. In other words, a pattern such as C<((?i)(?&NAME))> does not
change the case-sensitivity of the "NAME" pattern.
Any of these modifiers can be set to apply globally to all regular
B<WARNING>: This extended regular expression feature is considered
experimental, and may be changed without notice. Code executed that
has side effects may not perform identically from version to version
-due to the effect of future optimisations in the regex engine.
+due to the effect of future optimisations in the regex engine. The
+implementation of this feature was radically overhauled for the 5.18.0
+release, and its behaviour in earlier versions of perl was much buggier,
+especially in relation to parsing, lexical vars, scoping, recursion and
+reentrancy.
-This zero-width assertion evaluates any embedded Perl code. It
-always succeeds, and its C<code> is not interpolated. Currently,
-the rules to determine where the C<code> ends are somewhat convoluted.
+This zero-width assertion executes any embedded Perl code. It always
+succeeds, and its return value is set as C<$^R>.
-This feature can be used together with the special variable C<$^N> to
-capture the results of submatches in variables without having to keep
-track of the number of nested parentheses. For example:
+In literal patterns, the code is parsed at the same time as the
+surrounding code. While within the pattern, control is passed temporarily
+back to the perl parser, until the logically-balancing closing brace is
+encountered. This is similar to the way that an array index expression in
+a literal string is handled, for example
- $_ = "The brown fox jumps over the lazy dog";
- /the (\S+)(?{ $color = $^N }) (\S+)(?{ $animal = $^N })/i;
- print "color = $color, animal = $animal\n";
+ "abc$array[ 1 + f('[') + g()]def"
+
+In particular, braces do not need to be balanced:
+
+ /abc(?{ f('{'); })/def/
+
+Even in a pattern that is interpolated and compiled at run-time, literal
+code blocks will be compiled once, at perl compile time; the following
+prints "ABCD":
+
+ print "D";
+ my $qr = qr/(?{ BEGIN { print "A" } })/;
+ my $foo = "foo";
+ /$foo$qr(?{ BEGIN { print "B" } })/;
+ BEGIN { print "C" }
+
+In patterns where the text of the code is derived from run-time
+information rather than appearing literally in a source code /pattern/,
+the code is compiled at the same time that the pattern is compiled, and
+fro reasons of security, C<use re 'eval'> must be in scope. This is to
+stop user-supplied patterns containing code snippets from being
+executable.
+
+In situations where you need enable this with C<use re 'eval'>, you should
+also have taint checking enabled. Better yet, use the carefully
+constrained evaluation within a Safe compartment. See L<perlsec> for
+details about both these mechanisms.
+
+From the viewpoint of parsing, lexical variable scope and closures,
+
+ /AAA(?{ BBB })CCC/
+
+behaves approximately like
+
+ /AAA/ && do { BBB } && /CCC/
+
+Similarly,
+
+ qr/AAA(?{ BBB })CCC/
+
+behaves approximately like
+
+ sub { /AAA/ && do { BBB } && /CCC/ }
-Inside the C<(?{...})> block, C<$_> refers to the string the regular
+In particular:
+
+ { my $i = 1; $r = qr/(?{ print $i })/ }
+ my $i = 2;
+ /$r/; # prints "1"
+
+Inside a C<(?{...})> block, C<$_> refers to the string the regular
expression is matching against. You can also use C<pos()> to know what is
the current position of matching within this string.
-The C<code> is properly scoped in the following sense: If the assertion
-is backtracked (compare L<"Backtracking">), all changes introduced after
-C<local>ization are undone, so that
+The code block introduces a new scope from the perspective of lexical
+variable declarations, but B<not> from the perspective of C<local> and
+similar localizing behaviours. So later code blocks within the same
+pattern will still see the values which were localized in earlier blocks.
+These accumulated localizations are undone either at the end of a
+successful match, or if the assertion is backtracked (compare
+L<"Backtracking">). For example,
$_ = 'a' x 8;
m<
# non-localized location.
>x;
-will set C<$res = 4>. Note that after the match, C<$cnt> returns to the globally
-introduced value, because the scopes that restrict C<local> operators
-are unwound.
+will initially increment C<$cnt> up to 8; then during backtracking, its
+value will be unwound back to 4, which is the value assigned to C<$res>.
+At the end of the regex execution, $cnt will be wound back to its initial
+value of 0.
+
+This assertion may be used as the condition in a
-This assertion may be used as a C<(?(condition)yes-pattern|no-pattern)>
-switch. If I<not> used in this way, the result of evaluation of
-C<code> is put into the special variable C<$^R>. This happens
-immediately, so C<$^R> can be used from other C<(?{ code })> assertions
-inside the same regular expression.
+ (?(condition)yes-pattern|no-pattern)
+
+switch. If I<not> used in this way, the result of evaluation of C<code>
+is put into the special variable C<$^R>. This happens immediately, so
+C<$^R> can be used from other C<(?{ code })> assertions inside the same
+regular expression.
The assignment to C<$^R> above is properly localized, so the old
value of C<$^R> is restored if the assertion is backtracked; compare
L<"Backtracking">.
-For reasons of security, this construct is forbidden if the regular
-expression involves run-time interpolation of variables, unless the
-perilous C<use re 'eval'> pragma has been used (see L<re>), or the
-variables contain results of the C<qr//> operator (see
-L<perlop/"qr/STRINGE<sol>msixpodual">).
+Note that the special variable C<$^N> is particularly useful with code
+blocks to capture the results of submatches in variables without having to
+keep track of the number of nested parentheses. For example:
-This restriction is due to the wide-spread and remarkably convenient
-custom of using run-time determined strings as patterns. For example:
+ $_ = "The brown fox jumps over the lazy dog";
+ /the (\S+)(?{ $color = $^N }) (\S+)(?{ $animal = $^N })/i;
+ print "color = $color, animal = $animal\n";
- $re = <>;
- chomp $re;
- $string =~ /$re/;
-
-Before Perl knew how to execute interpolated code within a pattern,
-this operation was completely safe from a security point of view,
-although it could raise an exception from an illegal pattern. If
-you turn on the C<use re 'eval'>, though, it is no longer secure,
-so you should only do so if you are also using taint checking.
-Better yet, use the carefully constrained evaluation within a Safe
-compartment. See L<perlsec> for details about both these mechanisms.
-
-B<WARNING>: Use of lexical (C<my>) variables in these blocks is
-broken. The result is unpredictable and will make perl unstable. The
-workaround is to use global (C<our>) variables.
-
-B<WARNING>: In perl 5.12.x and earlier, the regex engine
-was not re-entrant, so interpolated code could not
-safely invoke the regex engine either directly with
-C<m//> or C<s///>), or indirectly with functions such as
-C<split>. Invoking the regex engine in these blocks would make perl
-unstable.
=item C<(??{ code })>
X<(??{})>
has side effects may not perform identically from version to version
due to the effect of future optimisations in the regex engine.
-This is a "postponed" regular subexpression. The C<code> is evaluated
-at run time, at the moment this subexpression may match. The result
-of evaluation is considered a regular expression and matched as
-if it were inserted instead of this construct. Note that this means
-that the contents of capture groups defined inside an eval'ed pattern
-are not available outside of the pattern, and vice versa, there is no
-way for the inner pattern to refer to a capture group defined outside.
-Thus,
+This is a "postponed" regular subexpression. It behaves in I<exactly> the
+same way as a C<(?{ code })> code block as described above, except that
+its return value, rather than being assigned to C<$^R>, is treated as a
+pattern, compiled if it's a string (or used as-is if its a qr// object),
+then matched as if it were inserted instead of this construct.
- ('a' x 100)=~/(??{'(.)' x 100})/
+During the matching of this sub-pattern, it has its own set of
+captures which are valid during the sub-match, but are discarded once
+control returns to the main pattern. For example, the following matches,
+with the inner pattern capturing "B" and matching "BB", while the outer
+pattern captures "A";
-B<will> match, it will B<not> set $1.
+ my $inner = '(.)\1';
+ "ABBA" =~ /^(.)(??{ $inner })\1/;
+ print $1; # prints "A";
-The C<code> is not interpolated. As before, the rules to determine
-where the C<code> ends are currently somewhat convoluted.
+Note that this means that there is no way for the inner pattern to refer
+to a capture group defined outside. (The code block itself can use C<$1>,
+etc., to refer to the enclosing pattern's capture groups.) Thus, although
+
+ ('a' x 100)=~/(??{'(.)' x 100})/
+
+I<will> match, it will I<not> set $1 on exit.
The following pattern matches a parenthesized group:
See also C<(?PARNO)> for a different, more efficient way to accomplish
the same task.
-For reasons of security, this construct is forbidden if the regular
-expression involves run-time interpolation of variables, unless the
-perilous C<use re 'eval'> pragma has been used (see L<re>), or the
-variables contain results of the C<qr//> operator (see
-L<perlop/"qrE<sol>STRINGE<sol>msixpodual">).
-
-In perl 5.12.x and earlier, because the regex engine was not re-entrant,
-delayed code could not safely invoke the regex engine either directly with
-C<m//> or C<s///>), or indirectly with functions such as C<split>.
-
-Recursing deeper than 50 times without consuming any input string will
-result in a fatal error. The maximum depth is compiled into perl, so
-changing it requires a custom build.
+Executing a postponed regular expression 50 times without consuming any
+input string will result in a fatal error. The maximum depth is compiled
+into perl, so changing it requires a custom build.
=item C<(?PARNO)> C<(?-PARNO)> C<(?+PARNO)> C<(?R)> C<(?0)>
X<(?PARNO)> X<(?1)> X<(?R)> X<(?0)> X<(?-1)> X<(?+1)> X<(?-PARNO)> X<(?+PARNO)>
X<regex, recursive> X<regexp, recursive> X<regular expression, recursive>
X<regex, relative recursion>
-Similar to C<(??{ code })> except it does not involve compiling any code,
-instead it treats the contents of a capture group as an independent
-pattern that must match at the current position. Capture groups
-contained by the pattern will have the value as determined by the
-outermost recursion.
+Similar to C<(??{ code })> except that it does not involve executing any
+code or potentially compiling a returned pattern string; instead it treats
+the part of the current pattern contained within a specified capture group
+as an independent pattern that must match at the current position.
+Capture groups contained by the pattern will have the value as determined
+by the outermost recursion.
PARNO is a sequence of digits (not starting with 0) whose value reflects
the paren-number of the capture group to recurse to. C<(?R)> recurses to
for later use:
my $parens = qr/(\((?:[^()]++|(?-1))*+\))/;
- if (/foo $parens \s+ + \s+ bar $parens/x) {
+ if (/foo $parens \s+ \+ \s+ bar $parens/x) {
# do something here...
}
a true value, matches C<no-pattern> otherwise. A missing pattern always
matches.
-C<(condition)> should be either an integer in
+C<(condition)> should be one of: 1) an integer in
parentheses (which is valid if the corresponding pair of parentheses
-matched), a look-ahead/look-behind/evaluate zero-width assertion, a
+matched); 2) a look-ahead/look-behind/evaluate zero-width assertion; 3) a
name in angle brackets or single quotes (which is valid if a group
-with the given name matched), or the special symbol (R) (true when
+with the given name matched); or 4) the special symbol (R) (true when
evaluated inside of recursion or eval). Additionally the R may be
followed by a number, (which will be true when evaluated when recursing
inside of the appropriate group), or by C<&NAME>, in which case it will
necessary. Thus C<$+{NAME_PAT}> would not be defined even though
C<$+{NAME}> would be.
+Finally, keep in mind that subpatterns created inside a DEFINE block
+count towards the absolute and relative number of captures, so this:
+
+ my @captures = "a" =~ /(.) # First capture
+ (?(DEFINE)
+ (?<EXAMPLE> 1 ) # Second capture
+ )/x;
+ say scalar @captures;
+
+Will output 2, not 1. This is particularly important if you intend to
+compile the definitions with the C<qr//> operator, and later
+interpolate them in another pattern.
+
=item C<< (?>pattern) >>
X<backtrack> X<backtracking> X<atomic> X<possessive>
but
- / ( A (*THEN) B | C (*THEN) D ) /
+ / ( A (*THEN) B | C ) /
is not the same as
- / ( A (*PRUNE) B | C (*PRUNE) D ) /
+ / ( A (*PRUNE) B | C ) /
as after matching the A but failing on the B the C<(*THEN)> verb will
backtrack and try C; but the C<(*PRUNE)> verb will simply fail.