characters. These are called "bracketed character classes" when we are
being precise, but often the word "bracketed" is dropped. (Dropping it
usually doesn't cause confusion.) This means that the C<"["> character
-is another metacharacter. It doesn't match anything just by itelf; it
+is another metacharacter. It doesn't match anything just by itself; it
is used only to tell Perl that what follows it is a bracketed character
class. If you want to match a literal left square bracket, you must
escape it, like C<"\[">. The matching C<"]"> is also a metacharacter;
ASCII digits, but mean a different number, so a human could easily think
a number is a different quantity than it really is. For example,
C<BENGALI DIGIT FOUR> (U+09EA) looks very much like an
-C<ASCII DIGIT EIGHT> (U+0038). And, C<\d+>, may match strings of digits
-that are a mixture from different writing systems, creating a security
-issue. L<Unicode::UCD/num()> can be used to sort
-this out. Or the C</a> modifier can be used to force C<\d> to match
-just the ASCII 0 through 9.
+C<ASCII DIGIT EIGHT> (U+0038), and C<LEPCHA DIGIT SIX> (U+1C46) looks
+very much like an C<ASCII DIGIT FIVE> (U+0035). And, C<\d+>, may match
+strings of digits that are a mixture from different writing systems,
+creating a security issue. A fraudulent website, for example, could
+display the price of something using U+1C46, and it would appear to the
+user that something cost 500 units, but it really costs 600. A browser
+that enforced script runs (L</Script Runs>) would prevent that
+fraudulent display. L<Unicode::UCD/num()> can also be used to sort this
+out. Or the C</a> modifier can be used to force C<\d> to match just the
+ASCII 0 through 9.
Also, under this modifier, case-insensitive matching works on the full
set of Unicode
=item 8
-the pattern uses L<C<(+script_run: ...)>|/Script Runs>
+the pattern uses L<C<(*script_run: ...)>|/Script Runs>
=back
/"(?>(?:(?>[^"\\]+)|\\.)*)"/
-Note that the possessive quantifier modifier can not be be combined
+Note that the possessive quantifier modifier can not be combined
with the non-greedy modifier. This is because it would make no sense.
Consider the follow equivalency table:
# for the closing ')' to match
qr/\(?#the backslash means this isn't a comment)p{Any}/
+ # Comments can be used to fold long patterns into multiple lines
+ qr/First part of a long regex(?#
+ )remaining part/
+
=item C<(?adlupimnsx-imnsx)>
=item C<(?^alupimnsx)>
=over 4
=item C<(?=pattern)>
-X<(?=)> X<look-ahead, positive> X<lookahead, positive>
+
+=item C<(*pla:pattern)>
+
+=item C<(*positive_lookahead:pattern)>
+X<(?=)>
+X<(*pla>
+X<(*positive_lookahead>
+X<look-ahead, positive> X<lookahead, positive>
A zero-width positive lookahead assertion. For example, C</\w+(?=\t)/>
matches a word followed by a tab, without including the tab in C<$&>.
+The alphabetic forms are experimental; using them yields a warning in the
+C<experimental::alpha_assertions> category.
+
=item C<(?!pattern)>
-X<(?!)> X<look-ahead, negative> X<lookahead, negative>
+
+=item C<(*nla:pattern)>
+
+=item C<(*negative_lookahead:pattern)>
+X<(?!)>
+X<(*nla>
+X<(*negative_lookahead>
+X<look-ahead, negative> X<lookahead, negative>
A zero-width negative lookahead assertion. For example C</foo(?!bar)/>
matches any occurrence of "foo" that isn't followed by "bar". Note
the next thing cannot be "foo"--and it's not, it's a "bar", so "foobar" will
match. Use lookbehind instead (see below).
+The alphabetic forms are experimental; using them yields a warning in the
+C<experimental::alpha_assertions> category.
+
=item C<(?<=pattern)>
=item C<\K>
-X<(?<=)> X<look-behind, positive> X<lookbehind, positive> X<\K>
+
+=item C<(*plb:pattern)>
+
+=item C<(*positive_lookbehind:pattern)>
+X<(?<=)>
+X<(*plb>
+X<(*positive_lookbehind>
+X<look-behind, positive> X<lookbehind, positive> X<\K>
A zero-width positive lookbehind assertion. For example, C</(?<=\t)\w+/>
matches a word that follows a tab, without including the tab in C<$&>.
s/foo\Kbar//g;
+The alphabetic forms (not including C<\K> are experimental; using them
+yields a warning in the C<experimental::alpha_assertions> category.
+
=item C<(?<!pattern)>
-X<(?<!)> X<look-behind, negative> X<lookbehind, negative>
+
+=item C<(*nlb:pattern)>
+
+=item C<(*negative_lookbehind:pattern)>
+X<(?<!)>
+X<(*nlb>
+X<(*negative_lookbehind>
+X<look-behind, negative> X<lookbehind, negative>
A zero-width negative lookbehind assertion. For example C</(?<!bar)foo/>
matches any occurrence of "foo" that does not follow "bar". Works
only for fixed-width lookbehind.
+The alphabetic forms are experimental; using them yields a warning in the
+C<experimental::alpha_assertions> category.
+
=back
=item C<< (?<NAME>pattern) >>
interpolate them in another pattern.
=item C<< (?>pattern) >>
+
+=item C<< (*atomic:pattern) >>
+X<(?E<gt>pattern)>
+X<(*atomic>
X<backtrack> X<backtracking> X<atomic> X<possessive>
An "independent" subexpression, one which matches the substring
PAT?+ (?>PAT?)
PAT{min,max}+ (?>PAT{min,max})
+Nested C<(?E<gt>...)> constructs are not no-ops, even if at first glance
+they might seem to be. This is because the nested C<(?E<gt>...)> can
+restrict internal backtracking that otherwise might occur. For example,
+
+ "abc" =~ /(?>a[bc]*c)/
+
+matches, but
+
+ "abc" =~ /(?>a(?>[bc]*)c)/
+
+does not.
+
+The alphabetic form (C<(*atomic:...)>) is experimental; using it
+yields a warning in the C<experimental::alpha_assertions> category.
+
=item C<(?[ ])>
See L<perlrecharclass/Extended Bracketed Character Classes>.
following match, see L</C<< (?>pattern) >>>.
=head2 Script Runs
+X<(*script_run:...)> X<(sr:...)>
+X<(*atomic_script_run:...)> X<(asr:...)>
A script run is basically a sequence of characters, all from the same
Unicode script (see L<perlunicode/Scripts>), such as Latin or Greek. In
attempt to gather sensitive information from the person.
Starting in Perl 5.28, it is now easy to detect strings that aren't
-script runs. Simply enclose just about any pattern like this:
+script runs. Simply enclose just about any pattern like either of
+these:
- (+script_run:pattern)
+ (*script_run:pattern)
+ (*sr:pattern)
What happens is that after I<pattern> succeeds in matching, it is
subjected to the additional criterion that every character in it must be
matches, or all possibilities are exhausted. This can cause a lot of
backtracking, but generally, only malicious input will result in this,
though the slow down could cause a denial of service attack. If your
-needs permit, it is best to make the pattern atomic.
+needs permit, it is best to make the pattern atomic to cut down on the
+amount of backtracking. This is so likely to be what you want, that
+instead of writing this:
+
+ (*script_run:(?>pattern))
- (+script_run:(?>pattern))
+you can write either of these:
+
+ (*atomic_script_run:pattern)
+ (*asr:pattern)
(See L</C<(?E<gt>pattern)>>.)
In Taiwan, Japan, and Korea, it is common for text to have a mixture of
characters from their native scripts and base Chinese. Perl follows
Unicode's UTS 39 (L<http://unicode.org/reports/tr39/>) Unicode Security
-Mechanisms in allowing such mixtures.
+Mechanisms in allowing such mixtures. For example, the Japanese scripts
+Katakana and Hiragana are commonly mixed together in practice, along
+with some Chinese characters, and hence are treated as being in a single
+script run by Perl.
The rules used for matching decimal digits are somewhat different. Many
scripts have their own sets of digits equivalent to the Western C<0>
scripts. But any mixing of the ASCII and other digits will cause the
sequence to not be a script run, failing the match. As an example,
- qr/(?script_run: \d+ \b )/x
+ qr/(*script_run: \d+ \b )/x
guarantees that the digits matched will all be from the same set of 10.
You won't get a look-alike digit from a different script that has a
operation are subject to change; using it yields a warning in the
C<experimental::script_run> category.
-The C<Script_Extensions> property is used as the basis for this feature.
+The C<Script_Extensions> property as modified by UTS 39
+(L<http://unicode.org/reports/tr39/>) is used as the basis for this
+feature.
+
+To summarize,
+
+=over 4
+
+=item *
+
+All length 0 or length 1 sequences are script runs.
+
+=item *
+
+A longer sequence is a script run if and only if B<all> of the following
+conditions are met:
+
+Z<>
+
+=over
+
+=item 1
+
+No code point in the sequence has the C<Script_Extension> property of
+C<Unknown>.
+
+This currently means that all code points in the sequence have been
+assigned by Unicode to be characters that aren't private use nor
+surrogate code points.
+
+=item 2
+
+All characters in the sequence come from the Common script and/or the
+Inherited script and/or a single other script.
+
+The script of a character is determined by the C<Script_Extensions>
+property as modified by UTS 39 (L<http://unicode.org/reports/tr39/>), as
+described above.
+
+=item 3
+
+All decimal digits in the sequence come from the same block of 10
+consecutive digits.
+
+=back
+
+=back
=head2 Special Backtracking Control Verbs