X</m> X<regex, multiline> X<regexp, multiline> X<regular expression, multiline>
Treat string as multiple lines. That is, change "^" and "$" from matching
-the start or end of the string to matching the start or end of any
-line anywhere within the string.
+the start or end of line only at the left and right ends of the string to
+matching them anywhere within the string.
=item s
X</s> X<regex, single-line> X<regexp, single-line>
Do case-insensitive pattern matching.
-If C<use locale> is in effect, the case map is taken from the current
-locale. See L<perllocale>.
+If locale matching rules are in effect, the case map is taken from the
+current
+locale for code points less than 255, and from Unicode rules for larger
+code points. However, matches that would cross the Unicode
+rules/non-Unicode rules boundary (ords 255/256) will not succeed. See
+L<perllocale>.
+
+There are a number of Unicode characters that match multiple characters
+under C</i>. For example, C<LATIN SMALL LIGATURE FI>
+should match the sequence C<fi>. Perl is not
+currently able to do this when the multiple characters are in the pattern and
+are split between groupings, or when one or more are quantified. Thus
+
+ "\N{LATIN SMALL LIGATURE FI}" =~ /fi/i; # Matches
+ "\N{LATIN SMALL LIGATURE FI}" =~ /[fi][fi]/i; # Doesn't match!
+ "\N{LATIN SMALL LIGATURE FI}" =~ /fi*/i; # Doesn't match!
+
+ # The below doesn't match, and it isn't clear what $1 and $2 would
+ # be even if it did!!
+ "\N{LATIN SMALL LIGATURE FI}" =~ /(f)(i)/i; # Doesn't match!
+
+Perl doesn't match multiple characters in an inverted bracketed
+character class, which otherwise could be highly confusing. See
+L<perlrecharclass/Negation>.
+
+Another bug involves character classes that match both a sequence of
+multiple characters, and an initial sub-string of that sequence. For
+example,
+
+ /[s\xDF]/i
+
+should match both a single and a double "s", since C<\xDF> (on ASCII
+platforms) matches "ss". However, this bug
+(L<[perl #89774]|https://rt.perl.org/rt3/Ticket/Display.html?id=89774>)
+causes it to only match a single "s", even if the final larger match
+fails, and matching the double "ss" would have succeeded.
+
+Also, Perl matching doesn't fully conform to the current Unicode C</i>
+recommendations, which ask that the matching be made upon the NFD
+(Normalization Form Decomposed) of the text. However, Unicode is
+in the process of reconsidering and revising their recommendations.
=item x
X</x>
Extend your pattern's legibility by permitting whitespace and comments.
+Details in L</"/x">
=item p
X</p> X<regex, preserve> X<regexp, preserve>
L<perlretut/"Using regular expressions in Perl"> for further explanation
of the g and c modifiers.
+=item a, d, l and u
+X</a> X</d> X</l> X</u>
+
+These modifiers, all new in 5.14, affect which character-set semantics
+(Unicode, etc.) are used, as described below in
+L</Character set modifiers>.
+
=back
-These are usually written as "the C</x> modifier", even though the delimiter
-in question might not really be a slash. The modifiers C</imsx>
+Regular expression modifiers are usually written in documentation
+as e.g., "the C</x> modifier", even though the delimiter
+in question might not really be a slash. The modifiers C</imsxadlup>
may also be embedded within the regular expression itself using
-the C<(?...)> construct. Also are new (in 5.14) character set semantics
-modifiers B<C<<"a">>, B<C<"d">>, B<C<"l">> and B<C<"u">>, which, in 5.14
-only, must be used embedded in the regular expression, and not after the
-trailing delimiter. All this is discussed below in
-L</Extended Patterns>.
-X</a> X</d> X</l> X</u>
+the C<(?...)> construct, see L</Extended Patterns> below.
+
+=head3 /x
-The C</x> modifier itself needs a little more explanation. It tells
+C</x> tells
the regular expression parser to ignore most whitespace that is neither
backslashed nor within a character class. You can use this to break up
your regular expression into (slightly) more readable parts. The C<#>
no way of knowing you did not intend to close the pattern early. See
the C-comment deletion code in L<perlop>. Also note that anything inside
a C<\Q...\E> stays unaffected by C</x>. And note that C</x> doesn't affect
-whether space interpretation within a single multi-character construct. For
+space interpretation within a single multi-character construct. For
example in C<\x{...}>, regardless of the C</x> modifier, there can be no
spaces. Same for a L<quantifier|/Quantifiers> such as C<{3}> or
C<{5,}>. Similarly, C<(?:...)> can't have a space between the C<?> and C<:>,
construct, allowed spaces are not affected by C</x>, and depend on the
construct. For example, C<\x{...}> can't have spaces because hexadecimal
numbers don't have spaces in them. But, Unicode properties can have spaces, so
-in C<\p{...}> there can be spaces that follow the Unicode rules, for which see
+in C<\p{...}> there can be spaces that follow the Unicode rules, for which see
L<perluniprops/Properties accessible through \p{} and \P{}>.
X</x>
+=head3 Character set modifiers
+
+C</d>, C</u>, C</a>, and C</l>, available starting in 5.14, are called
+the character set modifiers; they affect the character set semantics
+used for the regular expression.
+
+The C</d>, C</u>, and C</l> modifiers are not likely to be of much use
+to you, and so you need not worry about them very much. They exist for
+Perl's internal use, so that complex regular expression data structures
+can be automatically serialized and later exactly reconstituted,
+including all their nuances. But, since Perl can't keep a secret, and
+there may be rare instances where they are useful, they are documented
+here.
+
+The C</a> modifier, on the other hand, may be useful. Its purpose is to
+allow code that is to work mostly on ASCII data to not have to concern
+itself with Unicode.
+
+Briefly, C</l> sets the character set to that of whatever B<L>ocale is in
+effect at the time of the execution of the pattern match.
+
+C</u> sets the character set to B<U>nicode.
+
+C</a> also sets the character set to Unicode, BUT adds several
+restrictions for B<A>SCII-safe matching.
+
+C</d> is the old, problematic, pre-5.14 B<D>efault character set
+behavior. Its only use is to force that old behavior.
+
+At any given time, exactly one of these modifiers is in effect. Their
+existence allows Perl to keep the originally compiled behavior of a
+regular expression, regardless of what rules are in effect when it is
+actually executed. And if it is interpolated into a larger regex, the
+original's rules continue to apply to it, and only it.
+
+The C</l> and C</u> modifiers are automatically selected for
+regular expressions compiled within the scope of various pragmas,
+and we recommend that in general, you use those pragmas instead of
+specifying these modifiers explicitly. For one thing, the modifiers
+affect only pattern matching, and do not extend to even any replacement
+done, whereas using the pragmas give consistent results for all
+appropriate operations within their scopes. For example,
+
+ s/foo/\Ubar/il
+
+will match "foo" using the locale's rules for case-insensitive matching,
+but the C</l> does not affect how the C<\U> operates. Most likely you
+want both of them to use locale rules. To do this, instead compile the
+regular expression within the scope of C<use locale>. This both
+implicitly adds the C</l> and applies locale rules to the C<\U>. The
+lesson is to C<use locale> and not C</l> explicitly.
+
+Similarly, it would be better to use C<use feature 'unicode_strings'>
+instead of,
+
+ s/foo/\Lbar/iu
+
+to get Unicode rules, as the C<\L> in the former (but not necessarily
+the latter) would also use Unicode rules.
+
+More detail on each of the modifiers follows. Most likely you don't
+need to know this detail for C</l>, C</u>, and C</d>, and can skip ahead
+to L<E<sol>a|/E<sol>a (and E<sol>aa)>.
+
+=head4 /l
+
+means to use the current locale's rules (see L<perllocale>) when pattern
+matching. For example, C<\w> will match the "word" characters of that
+locale, and C<"/i"> case-insensitive matching will match according to
+the locale's case folding rules. The locale used will be the one in
+effect at the time of execution of the pattern match. This may not be
+the same as the compilation-time locale, and can differ from one match
+to another if there is an intervening call of the
+L<setlocale() function|perllocale/The setlocale function>.
+
+Perl only supports single-byte locales. This means that code points
+above 255 are treated as Unicode no matter what locale is in effect.
+Under Unicode rules, there are a few case-insensitive matches that cross
+the 255/256 boundary. These are disallowed under C</l>. For example,
+0xFF (on ASCII platforms) does not caselessly match the character at
+0x178, C<LATIN CAPITAL LETTER Y WITH DIAERESIS>, because 0xFF may not be
+C<LATIN SMALL LETTER Y WITH DIAERESIS> in the current locale, and Perl
+has no way of knowing if that character even exists in the locale, much
+less what code point it is.
+
+This modifier may be specified to be the default by C<use locale>, but
+see L</Which character set modifier is in effect?>.
+X</l>
+
+=head4 /u
+
+means to use Unicode rules when pattern matching. On ASCII platforms,
+this means that the code points between 128 and 255 take on their
+Latin-1 (ISO-8859-1) meanings (which are the same as Unicode's).
+(Otherwise Perl considers their meanings to be undefined.) Thus,
+under this modifier, the ASCII platform effectively becomes a Unicode
+platform; and hence, for example, C<\w> will match any of the more than
+100_000 word characters in Unicode.
+
+Unlike most locales, which are specific to a language and country pair,
+Unicode classifies all the characters that are letters I<somewhere> in
+the world as
+C<\w>. For example, your locale might not think that C<LATIN SMALL
+LETTER ETH> is a letter (unless you happen to speak Icelandic), but
+Unicode does. Similarly, all the characters that are decimal digits
+somewhere in the world will match C<\d>; this is hundreds, not 10,
+possible matches. And some of those digits look like some of the 10
+ASCII digits, but mean a different number, so a human could easily think
+a number is a different quantity than it really is. For example,
+C<BENGALI DIGIT FOUR> (U+09EA) looks very much like an
+C<ASCII DIGIT EIGHT> (U+0038). And, C<\d+>, may match strings of digits
+that are a mixture from different writing systems, creating a security
+issue. L<Unicode::UCD/num()> can be used to sort
+this out. Or the C</a> modifier can be used to force C<\d> to match
+just the ASCII 0 through 9.
+
+Also, under this modifier, case-insensitive matching works on the full
+set of Unicode
+characters. The C<KELVIN SIGN>, for example matches the letters "k" and
+"K"; and C<LATIN SMALL LIGATURE FF> matches the sequence "ff", which,
+if you're not prepared, might make it look like a hexadecimal constant,
+presenting another potential security issue. See
+L<http://unicode.org/reports/tr36> for a detailed discussion of Unicode
+security issues.
+
+On the EBCDIC platforms that Perl handles, the native character set is
+equivalent to Latin-1. Thus this modifier changes behavior only when
+the C<"/i"> modifier is also specified, and it turns out it affects only
+two characters, giving them full Unicode semantics: the C<MICRO SIGN>
+will match the Greek capital and small letters C<MU>, otherwise not; and
+the C<LATIN CAPITAL LETTER SHARP S> will match any of C<SS>, C<Ss>,
+C<sS>, and C<ss>, otherwise not.
+
+This modifier may be specified to be the default by C<use feature
+'unicode_strings>, C<use locale ':not_characters'>, or
+C<L<use 5.012|perlfunc/use VERSION>> (or higher),
+but see L</Which character set modifier is in effect?>.
+X</u>
+
+=head4 /d
+
+This modifier means to use the "Default" native rules of the platform
+except when there is cause to use Unicode rules instead, as follows:
+
+=over 4
+
+=item 1
+
+the target string is encoded in UTF-8; or
+
+=item 2
+
+the pattern is encoded in UTF-8; or
+
+=item 3
+
+the pattern explicitly mentions a code point that is above 255 (say by
+C<\x{100}>); or
+
+=item 4
+
+the pattern uses a Unicode name (C<\N{...}>); or
+
+=item 5
+
+the pattern uses a Unicode property (C<\p{...}>)
+
+=back
+
+Another mnemonic for this modifier is "Depends", as the rules actually
+used depend on various things, and as a result you can get unexpected
+results. See L<perlunicode/The "Unicode Bug">. The Unicode Bug has
+become rather infamous, leading to yet another (printable) name for this
+modifier, "Dodgy".
+
+On ASCII platforms, the native rules are ASCII, and on EBCDIC platforms
+(at least the ones that Perl handles), they are Latin-1.
+
+Here are some examples of how that works on an ASCII platform:
+
+ $str = "\xDF"; # $str is not in UTF-8 format.
+ $str =~ /^\w/; # No match, as $str isn't in UTF-8 format.
+ $str .= "\x{0e0b}"; # Now $str is in UTF-8 format.
+ $str =~ /^\w/; # Match! $str is now in UTF-8 format.
+ chop $str;
+ $str =~ /^\w/; # Still a match! $str remains in UTF-8 format.
+
+This modifier is automatically selected by default when none of the
+others are, so yet another name for it is "Default".
+
+Because of the unexpected behaviors associated with this modifier, you
+probably should only use it to maintain weird backward compatibilities.
+
+=head4 /a (and /aa)
+
+This modifier stands for ASCII-restrict (or ASCII-safe). This modifier,
+unlike the others, may be doubled-up to increase its effect.
+
+When it appears singly, it causes the sequences C<\d>, C<\s>, C<\w>, and
+the Posix character classes to match only in the ASCII range. They thus
+revert to their pre-5.6, pre-Unicode meanings. Under C</a>, C<\d>
+always means precisely the digits C<"0"> to C<"9">; C<\s> means the five
+characters C<[ \f\n\r\t]>; C<\w> means the 63 characters
+C<[A-Za-z0-9_]>; and likewise, all the Posix classes such as
+C<[[:print:]]> match only the appropriate ASCII-range characters.
+
+This modifier is useful for people who only incidentally use Unicode,
+and who do not wish to be burdened with its complexities and security
+concerns.
+
+With C</a>, one can write C<\d> with confidence that it will only match
+ASCII characters, and should the need arise to match beyond ASCII, you
+can instead use C<\p{Digit}> (or C<\p{Word}> for C<\w>). There are
+similar C<\p{...}> constructs that can match beyond ASCII both white
+space (see L<perlrecharclass/Whitespace>), and Posix classes (see
+L<perlrecharclass/POSIX Character Classes>). Thus, this modifier
+doesn't mean you can't use Unicode, it means that to get Unicode
+matching you must explicitly use a construct (C<\p{}>, C<\P{}>) that
+signals Unicode.
+
+As you would expect, this modifier causes, for example, C<\D> to mean
+the same thing as C<[^0-9]>; in fact, all non-ASCII characters match
+C<\D>, C<\S>, and C<\W>. C<\b> still means to match at the boundary
+between C<\w> and C<\W>, using the C</a> definitions of them (similarly
+for C<\B>).
+
+Otherwise, C</a> behaves like the C</u> modifier, in that
+case-insensitive matching uses Unicode semantics; for example, "k" will
+match the Unicode C<\N{KELVIN SIGN}> under C</i> matching, and code
+points in the Latin1 range, above ASCII will have Unicode rules when it
+comes to case-insensitive matching.
+
+To forbid ASCII/non-ASCII matches (like "k" with C<\N{KELVIN SIGN}>),
+specify the "a" twice, for example C</aai> or C</aia>. (The first
+occurrence of "a" restricts the C<\d>, etc., and the second occurrence
+adds the C</i> restrictions.) But, note that code points outside the
+ASCII range will use Unicode rules for C</i> matching, so the modifier
+doesn't really restrict things to just ASCII; it just forbids the
+intermixing of ASCII and non-ASCII.
+
+To summarize, this modifier provides protection for applications that
+don't wish to be exposed to all of Unicode. Specifying it twice
+gives added protection.
+
+This modifier may be specified to be the default by C<use re '/a'>
+or C<use re '/aa'>. If you do so, you may actually have occasion to use
+the C</u> modifier explictly if there are a few regular expressions
+where you do want full Unicode rules (but even here, it's best if
+everything were under feature C<"unicode_strings">, along with the
+C<use re '/aa'>). Also see L</Which character set modifier is in
+effect?>.
+X</a>
+X</aa>
+
+=head4 Which character set modifier is in effect?
+
+Which of these modifiers is in effect at any given point in a regular
+expression depends on a fairly complex set of interactions. These have
+been designed so that in general you don't have to worry about it, but
+this section gives the gory details. As
+explained below in L</Extended Patterns> it is possible to explicitly
+specify modifiers that apply only to portions of a regular expression.
+The innermost always has priority over any outer ones, and one applying
+to the whole expression has priority over any of the default settings that are
+described in the remainder of this section.
+
+The C<L<use re 'E<sol>foo'|re/"'/flags' mode">> pragma can be used to set
+default modifiers (including these) for regular expressions compiled
+within its scope. This pragma has precedence over the other pragmas
+listed below that also change the defaults.
+
+Otherwise, C<L<use locale|perllocale>> sets the default modifier to C</l>;
+and C<L<use feature 'unicode_strings|feature>>, or
+C<L<use 5.012|perlfunc/use VERSION>> (or higher) set the default to
+C</u> when not in the same scope as either C<L<use locale|perllocale>>
+or C<L<use bytes|bytes>>.
+(C<L<use locale ':not_characters'|perllocale/Unicode and UTF-8>> also
+sets the default to C</u>, overriding any plain C<use locale>.)
+Unlike the mechanisms mentioned above, these
+affect operations besides regular expressions pattern matching, and so
+give more consistent results with other operators, including using
+C<\U>, C<\l>, etc. in substitution replacements.
+
+If none of the above apply, for backwards compatibility reasons, the
+C</d> modifier is the one in effect by default. As this can lead to
+unexpected results, it is best to specify which other rule set should be
+used.
+
+=head4 Character set modifier behavior prior to Perl 5.14
+
+Prior to 5.14, there were no explicit modifiers, but C</l> was implied
+for regexes compiled within the scope of C<use locale>, and C</d> was
+implied otherwise. However, interpolating a regex into a larger regex
+would ignore the original compilation in favor of whatever was in effect
+at the time of the second compilation. There were a number of
+inconsistencies (bugs) with the C</d> modifier, where Unicode rules
+would be used when inappropriate, and vice versa. C<\p{}> did not imply
+Unicode rules, and neither did all occurrences of C<\N{}>, until 5.12.
+
=head2 Regular Expressions
=head3 Metacharacters
the string), and "$" will match before any newline. At the
cost of a little more overhead, you can do this by using the /m modifier
on the pattern match operator. (Older programs did this by setting C<$*>,
-but this practice has been removed in perl 5.9.)
+but this option was removed in perl 5.10.)
X<^> X<$> X</m>
To simplify multi-line substitutions, the "." character never matches a
{n,} Match at least n times
{n,m} Match at least n but not more than m times
-(If a curly bracket occurs in any other context, it is treated
-as a regular character. In particular, the lower bound
-is not optional.) The "*" quantifier is equivalent to C<{0,}>, the "+"
+(If a curly bracket occurs in any other context and does not form part of
+a backslashed sequence like C<\x{...}>, it is treated
+as a regular character. In particular, the lower quantifier bound
+is not optional. However, in Perl v5.18, it is planned to issue a
+deprecation warning for all such occurrences, and in Perl v5.20 to
+require literal uses of a curly bracket to be escaped, say by preceding
+them with a backslash or enclosing them within square brackets, (C<"\{">
+or C<"[{]">). This change will allow for future syntax extensions (like
+making the lower bound of a quantifier optional), and better error
+checking of quantifiers. Now, a typo in a quantifier silently causes
+it to be treated as the literal characters. For example,
+
+ /o{4,3}/
+
+looks like a quantifier that matches 0 times, since 4 is greater than 3,
+but it really means to match the sequence of six characters
+S<C<"o { 4 , 3 }">>.)
+
+The "*" quantifier is equivalent to C<{0,}>, the "+"
quantifier to C<{1,}>, and the "?" quantifier to C<{0,1}>. n and m are limited
to non-negative integral values less than a preset limit defined when perl is built.
This is usually 32766 on the most common platforms. The actual limit can
*? Match 0 or more times, not greedily
+? Match 1 or more times, not greedily
?? Match 0 or 1 time, not greedily
- {n}? Match exactly n times, not greedily
+ {n}? Match exactly n times, not greedily (redundant)
{n,}? Match at least n times, not greedily
{n,m}? Match at least n but not more than m times, not greedily
/"(?:[^"\\]++|\\.)*+"/
as we know that if the final quote does not match, backtracking will not
-help. See the independent subexpression C<< (?>...) >> for more details;
+help. See the independent subexpression
+L</C<< (?>pattern) >>> for more details;
possessive quantifiers are just syntactic sugar for that construct. For
instance the above example could also be written as follows:
=head3 Escape sequences
-Because patterns are processed as double quoted strings, the following
+Because patterns are processed as double-quoted strings, the following
also work:
\t tab (HT, TAB)
uppercase character.
\w [3] Match a "word" character (alphanumeric plus "_", plus
other connector punctuation chars plus Unicode
- marks
+ marks)
\W [3] Match a non-"word" character
\s [3] Match a whitespace character
\S [3] Match a non-whitespace character
C<m//g>), as described in L<perlop/"Regexp Quote-Like Operators">.
It is also useful when writing C<lex>-like scanners, when you have
several patterns that you want to match against consequent substrings
-of your string, see the previous reference. The actual location
+of your string; see the previous reference. The actual location
where C<\G> will match can also be influenced by using C<pos()> as
an lvalue: see L<perlfunc/pos>. Note that the rule for zero-length
-matches is modified somewhat, in that contents to the left of C<\G> is
+matches (see L</"Repeated Patterns Matching a Zero-length Substring">)
+is modified somewhat, in that contents to the left of C<\G> are
not counted when determining the length of the match. Thus the following
will not match forever:
X<\G>
The C<\g> and C<\k> notations were introduced in Perl 5.10.0. Prior to that
there were no named nor relative numbered capture groups. Absolute numbered
-groups were referred to using C<\1>, C<\2>, etc, and this notation is still
+groups were referred to using C<\1>,
+C<\2>, etc., and this notation is still
accepted (and likely always will be). But it leads to some ambiguities if
there are more than 9 capture groups, as C<\10> could mean either the tenth
capture group, or the character whose ordinal in octal is 010 (a backspace in
constant.
The C<\I<digit>> notation also works in certain circumstances outside
-the pattern. See L</Warning on \1 Instead of $1> below for details.)
+the pattern. See L</Warning on \1 Instead of $1> below for details.
Examples:
I<need> to use literal backslashes within C<\Q...\E>,
consult L<perlop/"Gory details of parsing quoted constructs">.
+C<quotemeta()> and C<\Q> are fully described in L<perlfunc/quotemeta>.
+
=head2 Extended Patterns
Perl also defines a consistent extension syntax for features not
-found in standard tools like B<awk> and B<lex>. The syntax is a
+found in standard tools like B<awk> and
+B<lex>. The syntax for most of these is a
pair of parentheses with a question mark as the first thing within
the parentheses. The character after the question mark indicates
the extension.
A question mark was chosen for this and for the minimal-matching
construct because 1) question marks are rare in older regular
expressions, and 2) whenever you see one, you should stop and
-"question" exactly what is going on. That's psychology...
+"question" exactly what is going on. That's psychology....
-=over 10
+=over 4
=item C<(?#text)>
X<(?#)>
This is particularly useful for dynamic patterns, such as those read in from a
configuration file, taken from an argument, or specified in a table
-somewhere. Consider the case where some patterns want to be case
-sensitive and some do not: The case insensitive ones merely need to
+somewhere. Consider the case where some patterns want to be
+case-sensitive and some do not: The case-insensitive ones merely need to
include C<(?i)> at the front of the pattern. For example:
$pattern = "foobar";
modifier outside this group.
These modifiers do not carry over into named subpatterns called in the
-enclosing group. In other words, a pattern such as C<((?i)(&NAME))> does not
+enclosing group. In other words, a pattern such as C<((?i)(?&NAME))> does not
change the case-sensitivity of the "NAME" pattern.
Any of these modifiers can be set to apply globally to all regular
C<"d">) may follow the caret to override it.
But a minus sign is not legal with it.
-Also, starting in Perl 5.14, are modifiers C<"a">, C<"d">, C<"l">, and
-C<"u">, which for 5.14 may not be used as suffix modifiers.
-
-C<"l"> means to use a locale (see L<perllocale>) when pattern matching.
-The locale used will be the one in effect at the time of execution of
-the pattern match. This may not be the same as the compilation-time
-locale, and can differ from one match to another if there is an
-intervening call of the
-L<setlocale() function|perllocale/The setlocale function>.
-This modifier is automatically set if the regular expression is compiled
-within the scope of a C<"use locale"> pragma. Results are not
-well-defined when using this and matching against a utf8-encoded string.
-
-C<"u"> means to use Unicode semantics when pattern matching. It is
-automatically set if the regular expression is encoded in utf8, or is
-compiled within the scope of a
-L<C<"use feature 'unicode_strings">|feature> pragma (and isn't also in
-the scope of L<C<"use locale">|locale> nor L<C<"use bytes">|bytes>
-pragmas. On ASCII platforms, the code points between 128 and 255 take on their
-Latin-1 (ISO-8859-1) meanings (which are the same as Unicode's), whereas
-in strict ASCII their meanings are undefined. Thus the platform
-effectively becomes a Unicode platform. The ASCII characters remain as
-ASCII characters (since ASCII is a subset of Latin-1 and Unicode). For
-example, when this option is not on, on a non-utf8 string, C<"\w">
-matches precisely C<[A-Za-z0-9_]>. When the option is on, it matches
-not just those, but all the Latin-1 word characters (such as an "n" with
-a tilde). On EBCDIC platforms, which already are equivalent to Latin-1,
-this modifier changes behavior only when the C<"/i"> modifier is also
-specified, and affects only two characters, giving them full Unicode
-semantics: the C<MICRO SIGN> will match the Greek capital and
-small letters C<MU>; otherwise not; and the C<LATIN CAPITAL LETTER SHARP
-S> will match any of C<SS>, C<Ss>, C<sS>, and C<ss>, otherwise not.
-(This last case is buggy, however.)
-
-C<"a"> is the same as C<"u">, except that C<\d>, C<\s>, C<\w>, and the
-Posix character classes are restricted to matching in the ASCII range
-only. That is, with this modifier, C<\d> always means precisely the
-digits C<"0"> to C<"9">; C<\s> means the five characters C<[ \f\n\r\t]>;
-C<\w> means the 63 characters C<[A-Za-z0-9_]>; and likewise, all the
-Posix classes such as C<[[:print:]]> match only the appropriate
-ASCII-range characters. As you would expect, this modifier causes, for
-example, C<\D> to mean the same thing as C<[^0-9]>; in fact, all
-non-ASCII characters match C<\D>, C<\S>, and C<\W>. C<\b> still means
-to match at the boundary between C<\w> and C<\W>, using the C<"a">
-definitions of them (similarly for C<\B>). Otherwise, C<"a"> behaves
-like the C<"u"> modifier, in that case-insensitive matching uses Unicode
-semantics; for example, "k" will match the Unicode C<\N{KELVIN SIGN}>
-under C</i> matching, and code points in the Latin1 range, above ASCII
-will have Unicode semantics when it comes to case-insensitive matching.
-But writing two in "a"'s in a row will increase its effect, causing the
-Kelvin sign and all other non-ASCII characters to not match any ASCII
-character under C</i> matching.
-
-C<"d"> means to use the traditional Perl pattern matching behavior.
-This is dualistic (hence the name C<"d">, which also could stand for
-"depends"). When this is in effect, Perl matches according to the
-platform's native character set rules unless there is something that
-indicates to use Unicode rules. If either the target string or the
-pattern itself is encoded in UTF-8, Unicode rules are used. Also, if
-the pattern contains Unicode-only features, such as code points above
-255, C<\p()> Unicode properties or C<\N{}> Unicode names, Unicode rules
-will be used. It is automatically selected by default if the regular
-expression is compiled neither within the scope of a C<"use locale">
-pragma nor a <C<"use feature 'unicode_strings"> pragma.
-This behavior causes a number of glitches, see
-L<perlunicode/The "Unicode Bug">.
-
Note that the C<a>, C<d>, C<l>, C<p>, and C<u> modifiers are special in
that they can only be enabled, not disabled, and the C<a>, C<d>, C<l>, and
C<u> modifiers are mutually exclusive: specifying one de-specifies the
-others, and a maximum of one may appear in the construct. Thus, for
-example, C<(?-p)>, C<(?-d:...)>, and C<(?dl:...)> will warn when
-compiled under C<use warnings>.
+others, and a maximum of one (or two C<a>'s) may appear in the
+construct. Thus, for
+example, C<(?-p)> will warn when compiled under C<use warnings>;
+C<(?-d:...)> and C<(?dl:...)> are fatal errors.
Note also that the C<p> modifier is special in that its presence
anywhere in a pattern has a global effect.
(?x-ims:foo)
The caret tells Perl that this cluster doesn't inherit the flags of any
-surrounding pattern, but to go back to the system defaults (C<d-imsx>),
+surrounding pattern, but uses the system defaults (C<d-imsx>),
modified by any flags specified.
The caret allows for simpler stringification of compiled regular
contained only one branch, that being the one with the most capture
groups in it.
-This construct will be useful when you want to capture one of a
+This construct is useful when you want to capture one of a
number of alternative matches.
Consider the following pattern. The numbers underneath show in
=item Look-Around Assertions
X<look-around assertion> X<lookaround assertion> X<look-around> X<lookaround>
-Look-around assertions are zero width patterns which match a specific
+Look-around assertions are zero-width patterns which match a specific
pattern without including it in C<$&>. Positive assertions match when
their subpattern matches, negative assertions match when their subpattern
fails. Look-behind matches text up to the current match position,
If you are looking for a "bar" that isn't preceded by a "foo", C</(?!foo)bar/>
will not do what you want. That's because the C<(?!foo)> is just saying that
the next thing cannot be "foo"--and it's not, it's a "bar", so "foobar" will
-match. You would have to do something like C</(?!foo)...bar/> for that. We
-say "like" because there's the case of your "bar" not having three characters
-before it. You could cover that this way: C</(?:(?!foo)...|^.{0,2})bar/>.
-Sometimes it's still easier just to say:
-
- if (/bar/ && $` !~ /foo$/)
-
-For look-behind see below.
+match. Use look-behind instead (see below).
=item C<(?<=pattern)> C<\K>
X<(?<=)> X<look-behind, positive> X<lookbehind, positive> X<\K>
There is a special form of this construct, called C<\K>, which causes the
regex engine to "keep" everything it had matched prior to the C<\K> and
-not include it in C<$&>. This effectively provides variable length
+not include it in C<$&>. This effectively provides variable-length
look-behind. The use of C<\K> inside of another look-around assertion
is allowed, but the behaviour is currently not well defined.
X<< (?<NAME>) >> X<(?'NAME')> X<named capture> X<capture>
A named capture group. Identical in every respect to normal capturing
-parentheses C<()> but for the additional fact that C<%+> or C<%-> may be
-used after a successful match to refer to a named group. See L<perlvar>
+parentheses C<()> but for the additional fact that the group
+can be referred to by name in various regular expression
+constructs (like C<\g{NAME}>) and can be accessed by name
+after a successful match via C<%+> or C<%->. See L<perlvar>
for more details on the C<%+> and C<%-> hashes.
If multiple distinct capture groups have the same name then the
B<WARNING>: This extended regular expression feature is considered
experimental, and may be changed without notice. Code executed that
has side effects may not perform identically from version to version
-due to the effect of future optimisations in the regex engine.
+due to the effect of future optimisations in the regex engine. The
+implementation of this feature was radically overhauled for the 5.18.0
+release, and its behaviour in earlier versions of perl was much buggier,
+especially in relation to parsing, lexical vars, scoping, recursion and
+reentrancy.
-This zero-width assertion evaluates any embedded Perl code. It
-always succeeds, and its C<code> is not interpolated. Currently,
-the rules to determine where the C<code> ends are somewhat convoluted.
+This zero-width assertion executes any embedded Perl code. It always
+succeeds, and its return value is set as C<$^R>.
-This feature can be used together with the special variable C<$^N> to
-capture the results of submatches in variables without having to keep
-track of the number of nested parentheses. For example:
+In literal patterns, the code is parsed at the same time as the
+surrounding code. While within the pattern, control is passed temporarily
+back to the perl parser, until the logically-balancing closing brace is
+encountered. This is similar to the way that an array index expression in
+a literal string is handled, for example
- $_ = "The brown fox jumps over the lazy dog";
- /the (\S+)(?{ $color = $^N }) (\S+)(?{ $animal = $^N })/i;
- print "color = $color, animal = $animal\n";
+ "abc$array[ 1 + f('[') + g()]def"
+
+In particular, braces do not need to be balanced:
+
+ /abc(?{ f('{'); })/def/
-Inside the C<(?{...})> block, C<$_> refers to the string the regular
+Even in a pattern that is interpolated and compiled at run-time, literal
+code blocks will be compiled once, at perl compile time; the following
+prints "ABCD":
+
+ print "D";
+ my $qr = qr/(?{ BEGIN { print "A" } })/;
+ my $foo = "foo";
+ /$foo$qr(?{ BEGIN { print "B" } })/;
+ BEGIN { print "C" }
+
+In patterns where the text of the code is derived from run-time
+information rather than appearing literally in a source code /pattern/,
+the code is compiled at the same time that the pattern is compiled, and
+fro reasons of security, C<use re 'eval'> must be in scope. This is to
+stop user-supplied patterns containing code snippets from being
+executable.
+
+In situations where you need enable this with C<use re 'eval'>, you should
+also have taint checking enabled. Better yet, use the carefully
+constrained evaluation within a Safe compartment. See L<perlsec> for
+details about both these mechanisms.
+
+From the viewpoint of parsing, lexical variable scope and closures,
+
+ /AAA(?{ BBB })CCC/
+
+behaves approximately like
+
+ /AAA/ && do { BBB } && /CCC/
+
+Similarly,
+
+ qr/AAA(?{ BBB })CCC/
+
+behaves approximately like
+
+ sub { /AAA/ && do { BBB } && /CCC/ }
+
+In particular:
+
+ { my $i = 1; $r = qr/(?{ print $i })/ }
+ my $i = 2;
+ /$r/; # prints "1"
+
+Inside a C<(?{...})> block, C<$_> refers to the string the regular
expression is matching against. You can also use C<pos()> to know what is
the current position of matching within this string.
-The C<code> is properly scoped in the following sense: If the assertion
-is backtracked (compare L<"Backtracking">), all changes introduced after
-C<local>ization are undone, so that
+The code block introduces a new scope from the perspective of lexical
+variable declarations, but B<not> from the perspective of C<local> and
+similar localizing behaviours. So later code blocks within the same
+pattern will still see the values which were localized in earlier blocks.
+These accumulated localizations are undone either at the end of a
+successful match, or if the assertion is backtracked (compare
+L<"Backtracking">). For example,
$_ = 'a' x 8;
m<
- (?{ $cnt = 0 }) # Initialize $cnt.
+ (?{ $cnt = 0 }) # Initialize $cnt.
(
a
(?{
- local $cnt = $cnt + 1; # Update $cnt, backtracking-safe.
+ local $cnt = $cnt + 1; # Update $cnt,
+ # backtracking-safe.
})
)*
aaaa
- (?{ $res = $cnt }) # On success copy to
- # non-localized location.
+ (?{ $res = $cnt }) # On success copy to
+ # non-localized location.
>x;
-will set C<$res = 4>. Note that after the match, C<$cnt> returns to the globally
-introduced value, because the scopes that restrict C<local> operators
-are unwound.
+will initially increment C<$cnt> up to 8; then during backtracking, its
+value will be unwound back to 4, which is the value assigned to C<$res>.
+At the end of the regex execution, $cnt will be wound back to its initial
+value of 0.
+
+This assertion may be used as the condition in a
+
+ (?(condition)yes-pattern|no-pattern)
-This assertion may be used as a C<(?(condition)yes-pattern|no-pattern)>
-switch. If I<not> used in this way, the result of evaluation of
-C<code> is put into the special variable C<$^R>. This happens
-immediately, so C<$^R> can be used from other C<(?{ code })> assertions
-inside the same regular expression.
+switch. If I<not> used in this way, the result of evaluation of C<code>
+is put into the special variable C<$^R>. This happens immediately, so
+C<$^R> can be used from other C<(?{ code })> assertions inside the same
+regular expression.
The assignment to C<$^R> above is properly localized, so the old
value of C<$^R> is restored if the assertion is backtracked; compare
L<"Backtracking">.
-For reasons of security, this construct is forbidden if the regular
-expression involves run-time interpolation of variables, unless the
-perilous C<use re 'eval'> pragma has been used (see L<re>), or the
-variables contain results of C<qr//> operator (see
-L<perlop/"qr/STRINGE<sol>msixpo">).
+Note that the special variable C<$^N> is particularly useful with code
+blocks to capture the results of submatches in variables without having to
+keep track of the number of nested parentheses. For example:
-This restriction is due to the wide-spread and remarkably convenient
-custom of using run-time determined strings as patterns. For example:
+ $_ = "The brown fox jumps over the lazy dog";
+ /the (\S+)(?{ $color = $^N }) (\S+)(?{ $animal = $^N })/i;
+ print "color = $color, animal = $animal\n";
- $re = <>;
- chomp $re;
- $string =~ /$re/;
-
-Before Perl knew how to execute interpolated code within a pattern,
-this operation was completely safe from a security point of view,
-although it could raise an exception from an illegal pattern. If
-you turn on the C<use re 'eval'>, though, it is no longer secure,
-so you should only do so if you are also using taint checking.
-Better yet, use the carefully constrained evaluation within a Safe
-compartment. See L<perlsec> for details about both these mechanisms.
-
-B<WARNING>: Use of lexical (C<my>) variables in these blocks is
-broken. The result is unpredictable and will make perl unstable. The
-workaround is to use global (C<our>) variables.
-
-B<WARNING>: In perl 5.12.x and earlier, the regex engine
-was not re-entrant, so interpolated code could not
-safely invoke the regex engine either directly with
-C<m//> or C<s///>), or indirectly with functions such as
-C<split>. Invoking the regex engine in these blocks would make perl
-unstable.
=item C<(??{ code })>
X<(??{})>
has side effects may not perform identically from version to version
due to the effect of future optimisations in the regex engine.
-This is a "postponed" regular subexpression. The C<code> is evaluated
-at run time, at the moment this subexpression may match. The result
-of evaluation is considered as a regular expression and matched as
-if it were inserted instead of this construct. Note that this means
-that the contents of capture groups defined inside an eval'ed pattern
-are not available outside of the pattern, and vice versa, there is no
-way for the inner pattern to refer to a capture group defined outside.
-Thus,
+This is a "postponed" regular subexpression. It behaves in I<exactly> the
+same way as a C<(?{ code })> code block as described above, except that
+its return value, rather than being assigned to C<$^R>, is treated as a
+pattern, compiled if it's a string (or used as-is if its a qr// object),
+then matched as if it were inserted instead of this construct.
- ('a' x 100)=~/(??{'(.)' x 100})/
+During the matching of this sub-pattern, it has its own set of
+captures which are valid during the sub-match, but are discarded once
+control returns to the main pattern. For example, the following matches,
+with the inner pattern capturing "B" and matching "BB", while the outer
+pattern captures "A";
-B<will> match, it will B<not> set $1.
+ my $inner = '(.)\1';
+ "ABBA" =~ /^(.)(??{ $inner })\1/;
+ print $1; # prints "A";
-The C<code> is not interpolated. As before, the rules to determine
-where the C<code> ends are currently somewhat convoluted.
+Note that this means that there is no way for the inner pattern to refer
+to a capture group defined outside. (The code block itself can use C<$1>,
+etc., to refer to the enclosing pattern's capture groups.) Thus, although
+
+ ('a' x 100)=~/(??{'(.)' x 100})/
+
+I<will> match, it will I<not> set $1 on exit.
The following pattern matches a parenthesized group:
- $re = qr{
- \(
- (?:
- (?> [^()]+ ) # Non-parens without backtracking
- |
- (??{ $re }) # Group with matching parens
- )*
- \)
- }x;
+ $re = qr{
+ \(
+ (?:
+ (?> [^()]+ ) # Non-parens without backtracking
+ |
+ (??{ $re }) # Group with matching parens
+ )*
+ \)
+ }x;
See also C<(?PARNO)> for a different, more efficient way to accomplish
the same task.
-For reasons of security, this construct is forbidden if the regular
-expression involves run-time interpolation of variables, unless the
-perilous C<use re 'eval'> pragma has been used (see L<re>), or the
-variables contain results of C<qr//> operator (see
-L<perlop/"qrE<sol>STRINGE<sol>msixpo">).
-
-In perl 5.12.x and earlier, because the regex engine was not re-entrant,
-delayed code could not safely invoke the regex engine either directly with
-C<m//> or C<s///>), or indirectly with functions such as C<split>.
-
-Recursing deeper than 50 times without consuming any input string will
-result in a fatal error. The maximum depth is compiled into perl, so
-changing it requires a custom build.
+Executing a postponed regular expression 50 times without consuming any
+input string will result in a fatal error. The maximum depth is compiled
+into perl, so changing it requires a custom build.
=item C<(?PARNO)> C<(?-PARNO)> C<(?+PARNO)> C<(?R)> C<(?0)>
X<(?PARNO)> X<(?1)> X<(?R)> X<(?0)> X<(?-1)> X<(?+1)> X<(?-PARNO)> X<(?+PARNO)>
X<regex, recursive> X<regexp, recursive> X<regular expression, recursive>
X<regex, relative recursion>
-Similar to C<(??{ code })> except it does not involve compiling any code,
-instead it treats the contents of a capture group as an independent
-pattern that must match at the current position. Capture groups
-contained by the pattern will have the value as determined by the
-outermost recursion.
+Similar to C<(??{ code })> except that it does not involve executing any
+code or potentially compiling a returned pattern string; instead it treats
+the part of the current pattern contained within a specified capture group
+as an independent pattern that must match at the current position.
+Capture groups contained by the pattern will have the value as determined
+by the outermost recursion.
PARNO is a sequence of digits (not starting with 0) whose value reflects
the paren-number of the capture group to recurse to. C<(?R)> recurses to
The following pattern matches a function foo() which may contain
balanced parentheses as the argument.
- $re = qr{ ( # paren group 1 (full function)
+ $re = qr{ ( # paren group 1 (full function)
foo
- ( # paren group 2 (parens)
+ ( # paren group 2 (parens)
\(
- ( # paren group 3 (contents of parens)
+ ( # paren group 3 (contents of parens)
(?:
- (?> [^()]+ ) # Non-parens without backtracking
+ (?> [^()]+ ) # Non-parens without backtracking
|
- (?2) # Recurse to start of paren group 2
+ (?2) # Recurse to start of paren group 2
)*
)
\)
for later use:
my $parens = qr/(\((?:[^()]++|(?-1))*+\))/;
- if (/foo $parens \s+ + \s+ bar $parens/x) {
+ if (/foo $parens \s+ \+ \s+ bar $parens/x) {
# do something here...
}
=item C<(?(condition)yes-pattern)>
-Conditional expression. C<(condition)> should be either an integer in
+Conditional expression. Matches C<yes-pattern> if C<condition> yields
+a true value, matches C<no-pattern> otherwise. A missing pattern always
+matches.
+
+C<(condition)> should be one of: 1) an integer in
parentheses (which is valid if the corresponding pair of parentheses
-matched), a look-ahead/look-behind/evaluate zero-width assertion, a
+matched); 2) a look-ahead/look-behind/evaluate zero-width assertion; 3) a
name in angle brackets or single quotes (which is valid if a group
-with the given name matched), or the special symbol (R) (true when
+with the given name matched); or 4) the special symbol (R) (true when
evaluated inside of recursion or eval). Additionally the R may be
followed by a number, (which will be true when evaluated when recursing
inside of the appropriate group), or by C<&NAME>, in which case it will
Checks if a group with the given name has matched something.
+=item (?=...) (?!...) (?<=...) (?<!...)
+
+Checks whether the pattern matches (or does not match, for the '!'
+variants).
+
=item (?{ CODE })
-Treats the code block as the condition.
+Treats the return value of the code block as the condition.
=item (R)
matches a chunk of non-parentheses, possibly included in parentheses
themselves.
-A special form is the C<(DEFINE)> predicate, which never executes directly
-its yes-pattern, and does not allow a no-pattern. This allows to define
-subpatterns which will be executed only by using the recursion mechanism.
+A special form is the C<(DEFINE)> predicate, which never executes its
+yes-pattern directly, and does not allow a no-pattern. This allows one to
+define subpatterns which will be executed only by the recursion mechanism.
This way, you can define a set of regular expression rules that can be
bundled into any pattern you choose.
necessary. Thus C<$+{NAME_PAT}> would not be defined even though
C<$+{NAME}> would be.
+Finally, keep in mind that subpatterns created inside a DEFINE block
+count towards the absolute and relative number of captures, so this:
+
+ my @captures = "a" =~ /(.) # First capture
+ (?(DEFINE)
+ (?<EXAMPLE> 1 ) # Second capture
+ )/x;
+ say scalar @captures;
+
+Will output 2, not 1. This is particularly important if you intend to
+compile the definitions with the C<qr//> operator, and later
+interpolate them in another pattern.
+
=item C<< (?>pattern) >>
X<backtrack> X<backtracking> X<atomic> X<possessive>
C<a*ab> will match fewer characters than a standalone C<a*>, since
this makes the tail match.
+C<< (?>pattern) >> does not disable backtracking altogether once it has
+matched. It is still possible to backtrack past the construct, but not
+into it. So C<< ((?>a*)|(?>b*))ar >> will still match "bar".
+
An effect similar to C<< (?>pattern) >> may be achieved by writing
-C<(?=(pattern))\g1>. This matches the same substring as a standalone
-C<a+>, and the following C<\g1> eats the matched string; it therefore
+C<(?=(pattern))\g{-1}>. This matches the same substring as a standalone
+C<a+>, and the following C<\g{-1}> eats the matched string; it therefore
makes a zero-length assertion into an analogue of C<< (?>...) >>.
(The difference between these two constructs is that the second one
uses a capturing group, thus shifting ordinals of backreferences
which uses C<< (?>...) >> matches exactly when the one above does (verifying
this yourself would be a productive exercise), but finishes in a fourth
the time when used on a similar string with 1000000 C<a>s. Be aware,
-however, that this pattern currently triggers a warning message under
+however, that, when this construct is followed by a
+quantifier, it currently triggers a warning message under
the C<use warnings> pragma or B<-w> switch saying it
C<"matches null string many times in regex">.
C<(*MARK:NAME)> verb below for more details.
B<NOTE:> C<$REGERROR> and C<$REGMARK> are not magic variables like C<$1>
-and most other regex related variables. They are not local to a scope, nor
+and most other regex-related variables. They are not local to a scope, nor
readonly, but instead are volatile package variables similar to C<$AUTOLOAD>.
Use C<local> to localize changes to them to a specific scope if necessary.
If a pattern does not contain a special backtracking verb that allows an
argument, then C<$REGERROR> and C<$REGMARK> are not touched at all.
-=over 4
+=over 3
=item Verbs that take an argument
'aaab' =~ /a+b?(*PRUNE)(?{print "$&\n"; $count++})(*FAIL)/;
print "Count=$count\n";
-we prevent backtracking and find the count of the longest matching
+we prevent backtracking and find the count of the longest matching string
at each matching starting point like so:
aaab
C<(*PRUNE)> can be used to handle cases that cannot be expressed using a
C<< (?>pattern) >> alone.
-
=item C<(*SKIP)> C<(*SKIP:NAME)>
X<(*SKIP)>
without a name the "skip point" is where the match point was when
executing the (*SKIP) pattern.
-Compare the following to the examples in C<(*PRUNE)>, note the string
+Compare the following to the examples in C<(*PRUNE)>; note the string
is twice as long:
- 'aaabaaab' =~ /a+b?(*SKIP)(?{print "$&\n"; $count++})(*FAIL)/;
- print "Count=$count\n";
+ 'aaabaaab' =~ /a+b?(*SKIP)(?{print "$&\n"; $count++})(*FAIL)/;
+ print "Count=$count\n";
outputs
C<(*SKIP)> was executed.
=item C<(*MARK:NAME)> C<(*:NAME)>
-X<(*MARK)> C<(*MARK:NAME)> C<(*:NAME)>
+X<(*MARK)> X<(*MARK:NAME)> X<(*:NAME)>
This zero-width pattern can be used to mark the point reached in a string
when a certain part of the pattern has been successfully matched. This
variable will be set to the name of the most recently executed
C<(*MARK:NAME)>.
-See C<(*SKIP)> for more details.
+See L</(*SKIP)> for more details.
As a shortcut C<(*MARK:NAME)> can be written C<(*:NAME)>.
=item C<(*THEN)> C<(*THEN:NAME)>
-This is similar to the "cut group" operator C<::> from Perl 6. Like
+This is similar to the "cut group" operator C<::> from Perl 6. Like
C<(*PRUNE)>, this verb always matches, and when backtracked into on
failure, it causes the regex engine to try the next alternation in the
-innermost enclosing group (capturing or otherwise).
+innermost enclosing group (capturing or otherwise) that has alternations.
+The two branches of a C<(?(condition)yes-pattern|no-pattern)> do not
+count as an alternation, as far as C<(*THEN)> is concerned.
Its name comes from the observation that this operation combined with the
alternation operator (C<|>) can be used to create what is essentially a
but
- / ( A (*THEN) B | C (*THEN) D ) /
+ / ( A (*THEN) B | C ) /
is not the same as
- / ( A (*PRUNE) B | C (*PRUNE) D ) /
+ / ( A (*PRUNE) B | C ) /
as after matching the A but failing on the B the C<(*THEN)> verb will
backtrack and try C; but the C<(*PRUNE)> verb will simply fail.
+=back
+
+=item Verbs without an argument
+
+=over 4
+
=item C<(*COMMIT)>
X<(*COMMIT)>
to find a valid match by advancing the start pointer will occur again.
For example,
- 'aaabaaab' =~ /a+b?(*COMMIT)(?{print "$&\n"; $count++})(*FAIL)/;
- print "Count=$count\n";
+ 'aaabaaab' =~ /a+b?(*COMMIT)(?{print "$&\n"; $count++})(*FAIL)/;
+ print "Count=$count\n";
outputs
does not match, the regex engine will not try any further matching on the
rest of the string.
-=back
-
-=item Verbs without an argument
-
-=over 4
-
=item C<(*FAIL)> C<(*F)>
X<(*FAIL)> X<(*F)>
'AB' =~ /(A (A|B(*ACCEPT)|C) D)(E)/x;
will match, and C<$1> will be C<AB> and C<$2> will be C<B>, C<$3> will not
-be set. If another branch in the inner parentheses were matched, such as in the
+be set. If another branch in the inner parentheses was matched, such as in the
string 'ACDE', then the C<D> and C<E> would have to be matched as well.
=back
fail.
The search engine will initially match C<\D*> with "ABC". Then it will
-try to match C<(?!123> with "123", which fails. But because
+try to match C<(?!123)> with "123", which fails. But because
a quantifier (C<\D*>) has been used in the regular expression, the
search engine can backtrack and retry the match differently
in the hope of matching the complete regular expression.
A powerful tool for optimizing such beasts is what is known as an
"independent group",
-which does not backtrack (see L<C<< (?>pattern) >>>). Note also that
+which does not backtrack (see L</C<< (?>pattern) >>>). Note also that
zero-length look-ahead/look-behind assertions will not backtrack to make
the tail match, since they are in "logical" context: only
whether they match is considered relevant. For an example
where side-effects of look-ahead I<might> have influenced the
-following match, see L<C<< (?>pattern) >>>.
+following match, see L</C<< (?>pattern) >>>.
=head2 Version 8 Regular Expressions
X<regular expression, version 8> X<regex, version 8> X<regexp, version 8>
for the character used as the pattern delimiter.
A series of characters matches that series of characters in the target
-string, so the pattern C<blurfl> would match "blurfl" in the target
+string, so the pattern C<blurfl> would match "blurfl" in the target
string.
You can specify a character class, by enclosing a list of characters
separate them, so that C<fee|fie|foe> will match any of "fee", "fie",
or "foe" in the target string (as would C<f(e|i|o)e>). The
first alternative includes everything from the last pattern delimiter
-("(", "[", or the beginning of the pattern) up to the first "|", and
+("(", "(?:", etc. or the beginning of the pattern) up to the first "|", and
the last alternative contains everything from the last "|" to the next
-pattern delimiter. That's why it's common practice to include
+closing pattern delimiter. That's why it's common practice to include
alternatives in parentheses: to minimize confusion about where they
start and end.
Within a pattern, you may designate subpatterns for later reference
by enclosing them in parentheses, and you may refer back to the
I<n>th subpattern later in the pattern using the metacharacter
-\I<n>. Subpatterns are numbered based on the left to right order
+\I<n> or \gI<n>. Subpatterns are numbered based on the left to right order
of their opening parenthesis. A backreference matches whatever
actually matched the subpattern in the string being examined, not
the rules for that subpattern. Therefore, C<(0|0x)\d*\s\g1\d*> will
be significantly simplified by using repeated subexpressions that
may match zero-length substrings. Here's a simple example being:
- @chars = split //, $string; # // is not magic in split
+ @chars = split //, $string; # // is not magic in split
($whitewashed = $string) =~ s/()/ /g; # parens avoid magic s// /
Thus Perl allows such constructs, by I<forcefully breaking
is made equivalent to
- m{ (?: NON_ZERO_LENGTH )*
- |
- (?: ZERO_LENGTH )?
- }x;
+ m{ (?: NON_ZERO_LENGTH )* (?: ZERO_LENGTH )? }x;
+
+For example, this program
+
+ #!perl -l
+ "aaaaab" =~ /
+ (?:
+ a # non-zero
+ | # or
+ (?{print "hello"}) # print hello whenever this
+ # branch is tried
+ (?=(b)) # zero-width assertion
+ )* # any number of times
+ /x;
+ print $&;
+ print $1;
-The higher level-loops preserve an additional state between iterations:
+prints
+
+ hello
+ aaaaa
+ b
+
+Notice that "hello" is only printed once, as when Perl sees that the sixth
+iteration of the outermost C<(?:)*> matches a zero-length string, it stops
+the C<*>.
+
+The higher-level loops preserve an additional state between iterations:
whether the last match was zero-length. To break the loop, the following
match after a zero-length match is prohibited to have a length of zero.
This prohibition interacts with backtracking (see L<"Backtracking">),
before (such as C<ab> or C<\Z>) could match at most one substring
at the given position of the input string. However, in a typical regular
expression these elementary pieces are combined into more complicated
-patterns using combining operators C<ST>, C<S|T>, C<S*> etc
+patterns using combining operators C<ST>, C<S|T>, C<S*> etc.
(in these examples C<S> and C<T> are regular subexpressions).
Such combinations can include alternatives, leading to a problem of choice:
substrings which can be matched by C<S>, C<B> and C<B'> are substrings
which can be matched by C<T>.
-If C<A> is better match for C<S> than C<A'>, C<AB> is a better
+If C<A> is a better match for C<S> than C<A'>, C<AB> is a better
match than C<A'B'>.
If C<A> and C<A'> coincide: C<AB> is a better match than C<AB'> if
-C<B> is better match for C<T> than C<B'>.
+C<B> is a better match for C<T> than C<B'>.
=item C<S|T>
=head2 Creating Custom RE Engines
-Overloaded constants (see L<overload>) provide a simple way to extend
-the functionality of the RE engine.
+As of Perl 5.10.0, one can create custom regular expression engines. This
+is not for the faint of heart, as they have to plug in at the C level. See
+L<perlreapi> for more details.
+
+As an alternative, overloaded constants (see L<overload>) provide a simple
+way to extend the functionality of the RE engine, by substituting one
+pattern for another.
Suppose that we want to enable a new RE escape-sequence C<\Y|> which
matches at a boundary between whitespace characters and non-whitespace
$re = customre::convert $re;
/\Y|$re\Y|/;
-=head1 PCRE/Python Support
+=head2 PCRE/Python Support
-As of Perl 5.10.0, Perl supports several Python/PCRE specific extensions
+As of Perl 5.10.0, Perl supports several Python/PCRE-specific extensions
to the regex syntax. While Perl programmers are encouraged to use the
-Perl specific syntax, the following are also accepted:
+Perl-specific syntax, the following are also accepted:
=over 4
=head1 BUGS
-There are numerous problems with case insensitive matching of characters
-outside the ASCII range, especially with those whose folds are multiple
-characters, such as ligatures like C<LATIN SMALL LIGATURE FF>.
-
-In a bracketed character class with case insensitive matching, ranges only work
-for ASCII characters. For example,
-C<m/[\N{CYRILLIC CAPITAL LETTER A}-\N{CYRILLIC CAPITAL LETTER YA}]/i>
-doesn't match all the Russian upper and lower case letters.
-
Many regular expression constructs don't work on EBCDIC platforms.
+There are a number of issues with regard to case-insensitive matching
+in Unicode rules. See C<i> under L</Modifiers> above.
+
This document varies from difficult to understand to completely
and utterly opaque. The wandering prose riddled with jargon is
hard to fathom in several places.