characters. These are called "bracketed character classes" when we are
being precise, but often the word "bracketed" is dropped. (Dropping it
usually doesn't cause confusion.) This means that the C<"["> character
-is another metacharacter. It doesn't match anything just by itelf; it
+is another metacharacter. It doesn't match anything just by itself; it
is used only to tell Perl that what follows it is a bracketed character
class. If you want to match a literal left square bracket, you must
escape it, like C<"\[">. The matching C<"]"> is also a metacharacter;
The list of characters within the character class gives the set of
characters matched by the class. C<"[abc]"> matches a single "a" or "b"
or "c". But if the first character after the C<"["> is C<"^">, the
-class matches any character not in the list. Within a list, the C<"-">
-character specifies a range of characters, so that C<a-z> represents all
-characters between "a" and "z", inclusive. If you want either C<"-"> or
-C<"]"> itself to be a member of a class, put it at the start of the list
-(possibly after a C<"^">), or escape it with a backslash. C<"-"> is
-also taken literally when it is at the end of the list, just before the
-closing C<"]">. (The following all specify the same class of three
-characters: C<[-az]>, C<[az-]>, and C<[a\-z]>. All are different from
-C<[a-z]>, which specifies a class containing twenty-six characters, even
-on EBCDIC-based character sets.)
+class instead matches any character not in the list. Within a list, the
+C<"-"> character specifies a range of characters, so that C<a-z>
+represents all characters between "a" and "z", inclusive. If you want
+either C<"-"> or C<"]"> itself to be a member of a class, put it at the
+start of the list (possibly after a C<"^">), or escape it with a
+backslash. C<"-"> is also taken literally when it is at the end of the
+list, just before the closing C<"]">. (The following all specify the
+same class of three characters: C<[-az]>, C<[az-]>, and C<[a\-z]>. All
+are different from C<[a-z]>, which specifies a class containing
+twenty-six characters, even on EBCDIC-based character sets.)
There is lots more to bracketed character classes; full details are in
L<perlrecharclass/Bracketed Character Classes>.
sequence
} End sequence started by {
- Indicates a range Only in [] interior
+ # Beginning of comment, extends to line end Only with /x modifier
Notice that most of the metacharacters lose their special meaning when
they occur in a bracketed character class, except C<"^"> has a different
space interpretation within a single multi-character construct. For
example in C<\x{...}>, regardless of the C</x> modifier, there can be no
spaces. Same for a L<quantifier|/Quantifiers> such as C<{3}> or
-C<{5,}>. Similarly, C<(?:...)> can't have a space between the C<"{">,
+C<{5,}>. Similarly, C<(?:...)> can't have a space between the C<"(">,
C<"?">, and C<":">. Within any delimiters for such a
construct, allowed spaces are not affected by C</x>, and depend on the
construct. For example, C<\x{...}> can't have spaces because hexadecimal
ASCII digits, but mean a different number, so a human could easily think
a number is a different quantity than it really is. For example,
C<BENGALI DIGIT FOUR> (U+09EA) looks very much like an
-C<ASCII DIGIT EIGHT> (U+0038). And, C<\d+>, may match strings of digits
-that are a mixture from different writing systems, creating a security
-issue. L<Unicode::UCD/num()> can be used to sort
-this out. Or the C</a> modifier can be used to force C<\d> to match
-just the ASCII 0 through 9.
+C<ASCII DIGIT EIGHT> (U+0038), and C<LEPCHA DIGIT SIX> (U+1C46) looks
+very much like an C<ASCII DIGIT FIVE> (U+0035). And, C<\d+>, may match
+strings of digits that are a mixture from different writing systems,
+creating a security issue. A fraudulent website, for example, could
+display the price of something using U+1C46, and it would appear to the
+user that something cost 500 units, but it really costs 600. A browser
+that enforced script runs (L</Script Runs>) would prevent that
+fraudulent display. L<Unicode::UCD/num()> can also be used to sort this
+out. Or the C</a> modifier can be used to force C<\d> to match just the
+ASCII 0 through 9.
Also, under this modifier, case-insensitive matching works on the full
set of Unicode
the pattern uses L</C<(?[ ])>>
+=item 8
+
+the pattern uses L<C<(*script_run: ...)>|/Script Runs>
+
=back
Another mnemonic for this modifier is "Depends", as the rules actually
/"(?>(?:(?>[^"\\]+)|\\.)*)"/
-Note that the possessive quantifier modifier can not be be combined
+Note that the possessive quantifier modifier can not be combined
with the non-greedy modifier. This is because it would make no sense.
Consider the follow equivalency table:
=item [3]
-See L<perlrecharclass/Backslash sequences> for details.
+See L<perlunicode/Unicode Character Properties> for details
=item [4]
# for the closing ')' to match
qr/\(?#the backslash means this isn't a comment)p{Any}/
+ # Comments can be used to fold long patterns into multiple lines
+ qr/First part of a long regex(?#
+ )remaining part/
+
=item C<(?adlupimnsx-imnsx)>
=item C<(?^alupimnsx)>
=over 4
=item C<(?=pattern)>
-X<(?=)> X<look-ahead, positive> X<lookahead, positive>
+
+=item C<(*pla:pattern)>
+
+=item C<(*positive_lookahead:pattern)>
+X<(?=)>
+X<(*pla>
+X<(*positive_lookahead>
+X<look-ahead, positive> X<lookahead, positive>
A zero-width positive lookahead assertion. For example, C</\w+(?=\t)/>
matches a word followed by a tab, without including the tab in C<$&>.
+The alphabetic forms are experimental; using them yields a warning in the
+C<experimental::alpha_assertions> category.
+
=item C<(?!pattern)>
-X<(?!)> X<look-ahead, negative> X<lookahead, negative>
+
+=item C<(*nla:pattern)>
+
+=item C<(*negative_lookahead:pattern)>
+X<(?!)>
+X<(*nla>
+X<(*negative_lookahead>
+X<look-ahead, negative> X<lookahead, negative>
A zero-width negative lookahead assertion. For example C</foo(?!bar)/>
matches any occurrence of "foo" that isn't followed by "bar". Note
the next thing cannot be "foo"--and it's not, it's a "bar", so "foobar" will
match. Use lookbehind instead (see below).
+The alphabetic forms are experimental; using them yields a warning in the
+C<experimental::alpha_assertions> category.
+
=item C<(?<=pattern)>
=item C<\K>
-X<(?<=)> X<look-behind, positive> X<lookbehind, positive> X<\K>
+
+=item C<(*plb:pattern)>
+
+=item C<(*positive_lookbehind:pattern)>
+X<(?<=)>
+X<(*plb>
+X<(*positive_lookbehind>
+X<look-behind, positive> X<lookbehind, positive> X<\K>
A zero-width positive lookbehind assertion. For example, C</(?<=\t)\w+/>
matches a word that follows a tab, without including the tab in C<$&>.
-Works only for fixed-width lookbehind.
+Works only for fixed-width lookbehind of up to 255 characters. Note
+that a compilation error will be generated if the assertion contains a
+multi-character match under C</i>, as that could match a single
+character, or it could match two or three, and that makes it variable
+length, which is forbidden.
There is a special form of this construct, called C<\K> (available since
Perl 5.10.0), which causes the
s/foo\Kbar//g;
+The alphabetic forms (not including C<\K> are experimental; using them
+yields a warning in the C<experimental::alpha_assertions> category.
+
=item C<(?<!pattern)>
-X<(?<!)> X<look-behind, negative> X<lookbehind, negative>
+
+=item C<(*nlb:pattern)>
+
+=item C<(*negative_lookbehind:pattern)>
+X<(?<!)>
+X<(*nlb>
+X<(*negative_lookbehind>
+X<look-behind, negative> X<lookbehind, negative>
A zero-width negative lookbehind assertion. For example C</(?<!bar)foo/>
matches any occurrence of "foo" that does not follow "bar". Works
-only for fixed-width lookbehind.
+only for fixed-width lookbehind of up to 255 characters. Note that a
+compilation error will be generated if the assertion contains a
+multi-character match under C</i>, as that could match a single
+character, or it could match two or three, and that makes it variable
+length, which is forbidden.
+
+The alphabetic forms are experimental; using them yields a warning in the
+C<experimental::alpha_assertions> category.
=back
after a successful match via C<%+> or C<%->. See L<perlvar>
for more details on the C<%+> and C<%-> hashes.
-If multiple distinct capture groups have the same name then the
+If multiple distinct capture groups have the same name, then
C<$+{NAME}> will refer to the leftmost defined group in the match.
The forms C<(?'NAME'pattern)> and C<< (?<NAME>pattern) >> are equivalent.
Checks whether the pattern matches (or does not match, for the C<"!">
variants).
-Full syntax: C<< (?(?=lookahead)then|else) >>
+Full syntax: C<< (?(?=I<lookahead>)I<then>|I<else>) >>
=item C<(?{ I<CODE> })>
interpolate them in another pattern.
=item C<< (?>pattern) >>
+
+=item C<< (*atomic:pattern) >>
+X<(?E<gt>pattern)>
+X<(*atomic>
X<backtrack> X<backtracking> X<atomic> X<possessive>
An "independent" subexpression, one which matches the substring
PAT?+ (?>PAT?)
PAT{min,max}+ (?>PAT{min,max})
+Nested C<(?E<gt>...)> constructs are not no-ops, even if at first glance
+they might seem to be. This is because the nested C<(?E<gt>...)> can
+restrict internal backtracking that otherwise might occur. For example,
+
+ "abc" =~ /(?>a[bc]*c)/
+
+matches, but
+
+ "abc" =~ /(?>a(?>[bc]*)c)/
+
+does not.
+
+The alphabetic form (C<(*atomic:...)>) is experimental; using it
+yields a warning in the C<experimental::alpha_assertions> category.
+
=item C<(?[ ])>
See L<perlrecharclass/Extended Bracketed Character Classes>.
where side-effects of lookahead I<might> have influenced the
following match, see L</C<< (?>pattern) >>>.
+=head2 Script Runs
+X<(*script_run:...)> X<(sr:...)>
+X<(*atomic_script_run:...)> X<(asr:...)>
+
+A script run is basically a sequence of characters, all from the same
+Unicode script (see L<perlunicode/Scripts>), such as Latin or Greek. In
+most places a single word would never be written in multiple scripts,
+unless it is a spoofing attack. An infamous example, is
+
+ paypal.com
+
+Those letters could all be Latin (as in the example just above), or they
+could be all Cyrillic (except for the dot), or they could be a mixture
+of the two. In the case of an internet address the C<.com> would be in
+Latin, And any Cyrillic ones would cause it to be a mixture, not a
+script run. Someone clicking on such a link would not be directed to
+the real Paypal website, but an attacker would craft a look-alike one to
+attempt to gather sensitive information from the person.
+
+Starting in Perl 5.28, it is now easy to detect strings that aren't
+script runs. Simply enclose just about any pattern like either of
+these:
+
+ (*script_run:pattern)
+ (*sr:pattern)
+
+What happens is that after I<pattern> succeeds in matching, it is
+subjected to the additional criterion that every character in it must be
+from the same script (see exceptions below). If this isn't true,
+backtracking occurs until something all in the same script is found that
+matches, or all possibilities are exhausted. This can cause a lot of
+backtracking, but generally, only malicious input will result in this,
+though the slow down could cause a denial of service attack. If your
+needs permit, it is best to make the pattern atomic to cut down on the
+amount of backtracking. This is so likely to be what you want, that
+instead of writing this:
+
+ (*script_run:(?>pattern))
+
+you can write either of these:
+
+ (*atomic_script_run:pattern)
+ (*asr:pattern)
+
+(See L</C<(?E<gt>pattern)>>.)
+
+In Taiwan, Japan, and Korea, it is common for text to have a mixture of
+characters from their native scripts and base Chinese. Perl follows
+Unicode's UTS 39 (L<http://unicode.org/reports/tr39/>) Unicode Security
+Mechanisms in allowing such mixtures. For example, the Japanese scripts
+Katakana and Hiragana are commonly mixed together in practice, along
+with some Chinese characters, and hence are treated as being in a single
+script run by Perl.
+
+The rules used for matching decimal digits are somewhat different. Many
+scripts have their own sets of digits equivalent to the Western C<0>
+through C<9> ones. A few, such as Arabic, have more than one set. For
+a string to be considered a script run, all digits in it must come from
+the same set, as determined by the first digit encountered. The ASCII
+C<[0-9]> are accepted as being in any script, even those that have their
+own set. This is because these are often used in commerce even in such
+scripts. But any mixing of the ASCII and other digits will cause the
+sequence to not be a script run, failing the match. As an example,
+
+ qr/(*script_run: \d+ \b )/x
+
+guarantees that the digits matched will all be from the same set of 10.
+You won't get a look-alike digit from a different script that has a
+different value than what it appears to be.
+
+Unicode has three pseudo scripts that are handled specially.
+
+"Unknown" is applied to code points whose meaning has yet to be
+determined. Perl currently will match as a script run, any single
+character string consisting of one of these code points. But any string
+longer than one code point containing one of these will not be
+considered a script run.
+
+"Inherited" is applied to characters that modify another, such as an
+accent of some type. These are considered to be in the script of the
+master character, and so never cause a script run to not match.
+
+The other one is "Common". This consists of mostly punctuation, emoji,
+and characters used in mathematics and music, and the ASCII digits C<0>
+through C<9>. These characters can appear intermixed in text in many of
+the world's scripts. These also don't cause a script run to not match,
+except any ASCII digits encountered have to obey the decimal digit rules
+described above.
+
+This construct is non-capturing. You can add parentheses to I<pattern>
+to capture, if desired. You will have to do this if you plan to use
+L</(*ACCEPT) (*ACCEPT:arg)> and not have it bypass the script run
+checking.
+
+This feature is experimental, and the exact syntax and details of
+operation are subject to change; using it yields a warning in the
+C<experimental::script_run> category.
+
+The C<Script_Extensions> property as modified by UTS 39
+(L<http://unicode.org/reports/tr39/>) is used as the basis for this
+feature.
+
+To summarize,
+
+=over 4
+
+=item *
+
+All length 0 or length 1 sequences are script runs.
+
+=item *
+
+A longer sequence is a script run if and only if B<all> of the following
+conditions are met:
+
+Z<>
+
+=over
+
+=item 1
+
+No code point in the sequence has the C<Script_Extension> property of
+C<Unknown>.
+
+This currently means that all code points in the sequence have been
+assigned by Unicode to be characters that aren't private use nor
+surrogate code points.
+
+=item 2
+
+All characters in the sequence come from the Common script and/or the
+Inherited script and/or a single other script.
+
+The script of a character is determined by the C<Script_Extensions>
+property as modified by UTS 39 (L<http://unicode.org/reports/tr39/>), as
+described above.
+
+=item 3
+
+All decimal digits in the sequence come from the same block of 10
+consecutive digits.
+
+=back
+
+=back
+
=head2 Special Backtracking Control Verbs
These special patterns are generally of the form C<(*I<VERB>:I<ARG>)>. Unless
Any number of C<(*PRUNE)> assertions may be used in a pattern.
-See also C<< (?>pattern) >> and possessive quantifiers for other ways to
+See also C<<< L<< /(?>pattern) >> >>> and possessive quantifiers for
+other ways to
control backtracking. In some cases, the use of C<(*PRUNE)> can be
replaced with a C<< (?>pattern) >> with no functional difference; however,
C<(*PRUNE)> can be used to handle cases that cannot be expressed using a