X-Git-Url: https://perl5.git.perl.org/perl5.git/blobdiff_plain/3337229ca19b517df4e7e43fd4312f0ee9e5b6c0..12e1284a67e5e3404c704c3f864749fd9f04c7c4:/pod/perlre.pod diff --git a/pod/perlre.pod b/pod/perlre.pod index 29082a6..cc71707 100644 --- a/pod/perlre.pod +++ b/pod/perlre.pod @@ -192,7 +192,7 @@ want in the set. You do this by enclosing the list within C<[]> bracket characters. These are called "bracketed character classes" when we are being precise, but often the word "bracketed" is dropped. (Dropping it usually doesn't cause confusion.) This means that the C<"["> character -is another metacharacter. It doesn't match anything just by itelf; it +is another metacharacter. It doesn't match anything just by itself; it is used only to tell Perl that what follows it is a bracketed character class. If you want to match a literal left square bracket, you must escape it, like C<"\[">. The matching C<"]"> is also a metacharacter; @@ -563,7 +563,8 @@ At any given time, exactly one of these modifiers is in effect. Their existence allows Perl to keep the originally compiled behavior of a regular expression, regardless of what rules are in effect when it is actually executed. And if it is interpolated into a larger regex, the -original's rules continue to apply to it, and only it. +original's rules continue to apply to it, and don't affect the other +parts. The C and C modifiers are automatically selected for regular expressions compiled within the scope of various pragmas, @@ -649,11 +650,16 @@ possible matches. And some of those digits look like some of the 10 ASCII digits, but mean a different number, so a human could easily think a number is a different quantity than it really is. For example, C (U+09EA) looks very much like an -C (U+0038). And, C<\d+>, may match strings of digits -that are a mixture from different writing systems, creating a security -issue. L can be used to sort -this out. Or the C modifier can be used to force C<\d> to match -just the ASCII 0 through 9. +C (U+0038), and C (U+1C46) looks +very much like an C (U+0035). And, C<\d+>, may match +strings of digits that are a mixture from different writing systems, +creating a security issue. A fraudulent website, for example, could +display the price of something using U+1C46, and it would appear to the +user that something cost 500 units, but it really costs 600. A browser +that enforced script runs (L) would prevent that +fraudulent display. L can also be used to sort this +out. Or the C modifier can be used to force C<\d> to match just the +ASCII 0 through 9. Also, under this modifier, case-insensitive matching works on the full set of Unicode @@ -715,8 +721,8 @@ the pattern uses L|/Script Runs> Another mnemonic for this modifier is "Depends", as the rules actually used depend on various things, and as a result you can get unexpected results. See L. The Unicode Bug has -become rather infamous, leading to yet another (printable) name for this -modifier, "Dodgy". +become rather infamous, leading to yet another (without swearing) name +for this modifier, "Dodgy". Unless the pattern or string are encoded in UTF-8, only ASCII characters can match positively. @@ -920,13 +926,13 @@ string" problem can be most efficiently performed when written as: as we know that if the final quote does not match, backtracking will not help. See the independent subexpression -Lpattern) >>> for more details; +LI) >>> for more details; possessive quantifiers are just syntactic sugar for that construct. For instance the above example could also be written as follows: /"(?>(?:(?>[^"\\]+)|\\.)*)"/ -Note that the possessive quantifier modifier can not be be combined +Note that the possessive quantifier modifier can not be combined with the non-greedy modifier. This is because it would make no sense. Consider the follow equivalency table: @@ -1014,7 +1020,7 @@ See L for details. =item [3] -See L for details. +See L for details =item [4] @@ -1030,8 +1036,9 @@ See L below for details. =item [7] -Note that C<\N> has two meanings. When of the form C<\N{NAME}>, it matches the -character or character sequence whose name is C; and similarly +Note that C<\N> has two meanings. When of the form C<\N{I}>, it +matches the character or character sequence whose name is I; and +similarly when of the form C<\N{U+I}>, it matches the character whose Unicode code point is I. Otherwise it matches any character but C<\n>. @@ -1332,10 +1339,10 @@ expressions, and 2) whenever you see one, you should stop and =over 4 -=item C<(?#text)> +=item C<(?#I)> X<(?#)> -A comment. The text is ignored. +A comment. The I is ignored. Note that Perl closes the comment as soon as it sees a C<")">, so there is no way to put a literal C<")"> in the comment. The pattern's closing delimiter must be escaped by @@ -1369,7 +1376,7 @@ an escape sequence. Examples: =item C<(?^alupimnsx)> X<(?)> X<(?^)> -One or more embedded pattern-match modifiers, to be turned on (or +Zero or more embedded pattern-match modifiers, to be turned on (or turned off if preceded by C<"-">) for the remainder of the pattern or the remainder of the enclosing pattern group (if any). @@ -1397,8 +1404,8 @@ repetition of the previous word, assuming the C modifier, and no C modifier outside this group. These modifiers do not carry over into named subpatterns called in the -enclosing group. In other words, a pattern such as C<((?i)(?&NAME))> does not -change the case-sensitivity of the C<"NAME"> pattern. +enclosing group. In other words, a pattern such as C<((?i)(?&I))> does not +change the case-sensitivity of the I pattern. A modifier is overridden by later occurrences of this construct in the same scope containing the same modifier, so that @@ -1443,12 +1450,16 @@ C<(?-d:...)> and C<(?dl:...)> are fatal errors. Note also that the C<"p"> modifier is special in that its presence anywhere in a pattern has a global effect. -=item C<(?:pattern)> +Having zero modifiers makes this a no-op (so why did you specify it, +unless it's generated code), and starting in v5.30, warns under L|re/'strict' mode>. + +=item C<(?:I)> X<(?:)> -=item C<(?adluimnsx-imnsx:pattern)> +=item C<(?adluimnsx-imnsx:I)> -=item C<(?^aluimnsx:pattern)> +=item C<(?^aluimnsx:I)> X<(?^:)> This is for clustering, not capturing; it groups subexpressions like @@ -1513,7 +1524,7 @@ redundant. Mnemonic for C<(?^...)>: A fresh beginning since the usual use of a caret is to match at the beginning. -=item C<(?|pattern)> +=item C<(?|I)> X<(?|)> X This is the "branch reset" pattern, which has the special property @@ -1569,11 +1580,11 @@ lookahead matches text following the current match position. =over 4 -=item C<(?=pattern)> +=item C<(?=I)> -=item C<(*pla:pattern)> +=item C<(*pla:I)> -=item C<(*positive_lookahead:pattern)> +=item C<(*positive_lookahead:I)> X<(?=)> X<(*pla> X<(*positive_lookahead> @@ -1585,11 +1596,11 @@ matches a word followed by a tab, without including the tab in C<$&>. The alphabetic forms are experimental; using them yields a warning in the C category. -=item C<(?!pattern)> +=item C<(?!I)> -=item C<(*nla:pattern)> +=item C<(*nla:I)> -=item C<(*negative_lookahead:pattern)> +=item C<(*negative_lookahead:I)> X<(?!)> X<(*nla> X<(*negative_lookahead> @@ -1608,13 +1619,13 @@ match. Use lookbehind instead (see below). The alphabetic forms are experimental; using them yields a warning in the C category. -=item C<(?<=pattern)> +=item C<(?<=I)> =item C<\K> -=item C<(*plb:pattern)> +=item C<(*plb:I)> -=item C<(*positive_lookbehind:pattern)> +=item C<(*positive_lookbehind:I)> X<(?<=)> X<(*plb> X<(*positive_lookbehind> @@ -1622,13 +1633,38 @@ X X X<\K> A zero-width positive lookbehind assertion. For example, C matches a word that follows a tab, without including the tab in C<$&>. -Works only for fixed-width lookbehind. -There is a special form of this construct, called C<\K> (available since -Perl 5.10.0), which causes the +Prior to Perl 5.30, it worked only for fixed-width lookbehind, but +starting in that release, it can handle variable lengths from 1 to 255 +characters as an experimental feature. The feature is enabled +automatically if you use a variable length lookbehind assertion, but +will raise a warning at pattern compilation time, unless turned off, in +the C category. This is to warn you that the exact +behavior is subject to change should feedback from actual use in the +field indicate to do so; or even complete removal if the problems found +are not practically surmountable. You can achieve close to pre-5.30 +behavior by fatalizing warnings in this category. + +There is a special form of this construct, called C<\K> +(available since Perl 5.10.0), which causes the regex engine to "keep" everything it had matched prior to the C<\K> and -not include it in C<$&>. This effectively provides variable-length -lookbehind. The use of C<\K> inside of another lookaround assertion +not include it in C<$&>. This effectively provides non-experimental +variable-length lookbehind of any length. + +And, there is a technique that can be used to handle variable length +lookbehinds on earlier releases, and longer than 255 characters. It is +described in +L. + +Note that under C, a few single characters match two or three other +characters. This makes them variable length, and the 255 length applies +to the maximum number of characters in the match. For +example C matches the sequence +C<"ss">. Your lookbehind assertion could contain 127 Sharp S +characters under C, but adding a 128th would generate a compilation +error, as that could match 256 C<"s"> characters in a row. + +The use of C<\K> inside of another lookaround assertion is allowed, but the behaviour is currently not well defined. For various reasons C<\K> may be significantly more efficient than the @@ -1642,44 +1678,74 @@ can be rewritten as the much more efficient s/foo\Kbar//g; +Use of the non-greedy modifier C<"?"> may not give you the expected +results if it is within a capturing group within the construct. + The alphabetic forms (not including C<\K> are experimental; using them yields a warning in the C category. -=item C<(? +=item C<(?)> -=item C<(*nlb:pattern)> +=item C<(*nlb:I)> -=item C<(*negative_lookbehind:pattern)> +=item C<(*negative_lookbehind:I)> X<(? X<(*nlb> X<(*negative_lookbehind> X X A zero-width negative lookbehind assertion. For example C -matches any occurrence of "foo" that does not follow "bar". Works -only for fixed-width lookbehind. +matches any occurrence of "foo" that does not follow "bar". + +Prior to Perl 5.30, it worked only for fixed-width lookbehind, but +starting in that release, it can handle variable lengths from 1 to 255 +characters as an experimental feature. The feature is enabled +automatically if you use a variable length lookbehind assertion, but +will raise a warning at pattern compilation time, unless turned off, in +the C category. This is to warn you that the exact +behavior is subject to change should feedback from actual use in the +field indicate to do so; or even complete removal if the problems found +are not practically surmountable. You can achieve close to pre-5.30 +behavior by fatalizing warnings in this category. + +There is a technique that can be used to handle variable length +lookbehinds on earlier releases, and longer than 255 characters. It is +described in +L. + +Note that under C, a few single characters match two or three other +characters. This makes them variable length, and the 255 length applies +to the maximum number of characters in the match. For +example C matches the sequence +C<"ss">. Your lookbehind assertion could contain 127 Sharp S +characters under C, but adding a 128th would generate a compilation +error, as that could match 256 C<"s"> characters in a row. + +Use of the non-greedy modifier C<"?"> may not give you the expected +results if it is within a capturing group within the construct. The alphabetic forms are experimental; using them yields a warning in the C category. =back -=item C<< (?pattern) >> +=item C<< (?>I) >> -=item C<(?'NAME'pattern)> +=item C<(?'I'I)> X<< (?) >> X<(?'NAME')> X X A named capture group. Identical in every respect to normal capturing parentheses C<()> but for the additional fact that the group can be referred to by name in various regular expression -constructs (like C<\g{NAME}>) and can be accessed by name +constructs (like C<\g{I}>) and can be accessed by name after a successful match via C<%+> or C<%->. See L for more details on the C<%+> and C<%-> hashes. If multiple distinct capture groups have the same name, then -C<$+{NAME}> will refer to the leftmost defined group in the match. +C<$+{I}> will refer to the leftmost defined group in the match. -The forms C<(?'NAME'pattern)> and C<< (?pattern) >> are equivalent. +The forms C<(?'I'I)> and C<< (?>I) >> +are equivalent. B While the notation of this construct is the same as the similar function in .NET regexes, the behavior is not. In Perl the groups are @@ -1688,7 +1754,7 @@ pattern /(x)(?y)(z)/ -C<$+{I}> will be the same as C<$2>, and C<$3> will contain 'z' instead of +C<$+{foo}> will be the same as C<$2>, and C<$3> will contain 'z' instead of the opposite which is what a .NET regex hacker might expect. Currently I is restricted to simple identifiers only. @@ -1697,29 +1763,30 @@ its Unicode extension (see L), though it isn't extended by the locale (see L). B In order to make things easier for programmers with experience -with the Python or PCRE regex engines, the pattern C<< (?PENAMEEpattern) >> -may be used instead of C<< (?pattern) >>; however this form does not +with the Python or PCRE regex engines, the pattern C<< +(?PEIEI) >> +may be used instead of C<< (?>I) >>; however this form does not support the use of single quotes as a delimiter for the name. -=item C<< \k >> +=item C<< \k> >> -=item C<< \k'NAME' >> +=item C<< \k'I' >> Named backreference. Similar to numeric backreferences, except that the group is designated by name and not number. If multiple groups have the same name then it refers to the leftmost defined group in the current match. -It is an error to refer to a name not defined by a C<< (?) >> +It is an error to refer to a name not defined by a C<< (?>) >> earlier in the pattern. Both forms are equivalent. B In order to make things easier for programmers with experience -with the Python or PCRE regex engines, the pattern C<< (?P=NAME) >> -may be used instead of C<< \k >>. +with the Python or PCRE regex engines, the pattern C<< (?P=I) >> +may be used instead of C<< \k> >>. -=item C<(?{ code })> +=item C<(?{ I })> X<(?{})> X X X B: Using this feature safely requires that you understand its @@ -1823,9 +1890,9 @@ This assertion may be used as the condition in a (?(condition)yes-pattern|no-pattern) -switch. If I used in this way, the result of evaluation of C +switch. If I used in this way, the result of evaluation of I is put into the special variable C<$^R>. This happens immediately, so -C<$^R> can be used from other C<(?{ code })> assertions inside the same +C<$^R> can be used from other C<(?{ I })> assertions inside the same regular expression. The assignment to C<$^R> above is properly localized, so the old @@ -1841,7 +1908,7 @@ keep track of the number of nested parentheses. For example: print "color = $color, animal = $animal\n"; -=item C<(??{ code })> +=item C<(??{ I })> X<(??{})> X X X @@ -1852,7 +1919,7 @@ optimisations in the regex engine. For more information on this, see L. This is a "postponed" regular subexpression. It behaves in I the -same way as a C<(?{ code })> code block as described above, except that +same way as a C<(?{ I })> code block as described above, except that its return value, rather than being assigned to C<$^R>, is treated as a pattern, compiled if it's a string (or used as-is if its a qr// object), then matched as if it were inserted instead of this construct. @@ -1888,7 +1955,7 @@ The following pattern matches a parenthesized group: }x; See also -L)>|/(?PARNO) (?-PARNO) (?+PARNO) (?R) (?0)> +L)>|/(?I) (?-I) (?+I) (?R) (?0)> for a different, more efficient way to accomplish the same task. @@ -1908,11 +1975,11 @@ the current position in the string. Information about capture state from the caller for things like backreferences is available to the subpattern, but capture buffers set by the subpattern are not visible to the caller. -Similar to C<(??{ code })> except that it does not involve executing any +Similar to C<(??{ I })> except that it does not involve executing any code or potentially compiling a returned pattern string; instead it treats the part of the current pattern contained within a specified capture group as an independent pattern that must match at the current position. Also -different is the treatment of capture buffers, unlike C<(??{ code })> +different is the treatment of capture buffers, unlike C<(??{ I })> recursive patterns have access to their caller's match state, so one can use backreferences safely. @@ -1980,7 +2047,7 @@ as atomic. Also, modifiers are resolved at compile time, so constructs like C<(?i:(?1))> or C<(?:(?i)(?1))> do not affect how the sub-pattern will be processed. -=item C<(?&NAME)> +=item C<(?&I)> X<(?&NAME)> Recurse to a named subpattern. Identical to C<(?I)> except that the @@ -1991,19 +2058,19 @@ It is an error to refer to a name that is not declared somewhere in the pattern. B In order to make things easier for programmers with experience -with the Python or PCRE regex engines the pattern C<< (?P>NAME) >> -may be used instead of C<< (?&NAME) >>. +with the Python or PCRE regex engines the pattern C<< (?P>I) >> +may be used instead of C<< (?&I) >>. -=item C<(?(condition)yes-pattern|no-pattern)> +=item C<(?(I)I|I)> X<(?()> -=item C<(?(condition)yes-pattern)> +=item C<(?(I)I)> -Conditional expression. Matches C if C yields -a true value, matches C otherwise. A missing pattern always +Conditional expression. Matches I if I yields +a true value, matches I otherwise. A missing pattern always matches. -C<(condition)> should be one of: +C<(I)> should be one of: =over 4 @@ -2023,7 +2090,7 @@ matched); (true when evaluated inside of recursion or eval). Additionally the C<"R"> may be followed by a number, (which will be true when evaluated when recursing -inside of the appropriate group), or by C<&NAME>, in which case it will +inside of the appropriate group), or by C<&I>, in which case it will be true only when evaluated during recursion in the named group. =back @@ -2046,17 +2113,17 @@ Full syntax: C<< (?()then|else) >> Checks whether the pattern matches (or does not match, for the C<"!"> variants). -Full syntax: C<< (?(?=lookahead)then|else) >> +Full syntax: C<< (?(?=I)I|I) >> =item C<(?{ I })> Treats the return value of the code block as the condition. -Full syntax: C<< (?(?{ code })then|else) >> +Full syntax: C<< (?(?{ I })I|I) >> =item C<(R)> Checks if the expression has been evaluated inside of recursion. -Full syntax: C<< (?(R)then|else) >> +Full syntax: C<< (?(R)I|I) >> =item C<(R1)> C<(R2)> ... @@ -2067,7 +2134,7 @@ inside of the n-th capture group. This check is the regex equivalent of In other words, it does not check the full recursion stack. -Full syntax: C<< (?(R1)then|else) >> +Full syntax: C<< (?(R1)I|I) >> =item C<(R&I)> @@ -2075,14 +2142,14 @@ Similar to C<(R1)>, this predicate checks to see if we're executing directly inside of the leftmost group with a given name (this is the same logic used by C<(?&I)> to disambiguate). It does not check the full stack, but only the name of the innermost active recursion. -Full syntax: C<< (?(R&name)then|else) >> +Full syntax: C<< (?(R&I)I|I) >> =item C<(DEFINE)> In this case, the yes-pattern is never directly executed, and no no-pattern is allowed. Similar in spirit to C<(?{0})> but more efficient. See below for details. -Full syntax: C<< (?(DEFINE)definitions...) >> +Full syntax: C<< (?(DEFINE)I...) >> =back @@ -2135,15 +2202,15 @@ Will output 2, not 1. This is particularly important if you intend to compile the definitions with the C operator, and later interpolate them in another pattern. -=item C<< (?>pattern) >> +=item C<< (?>I) >> -=item C<< (*atomic:pattern) >> +=item C<< (*atomic:I) >> X<(?Epattern)> X<(*atomic> X X X X An "independent" subexpression, one which matches the substring -that a I C would match if anchored at the given +that a standalone I would match if anchored at the given position, and it matches I. This construct is useful for optimizations of what would otherwise be "eternal" matches, because it will not backtrack (see L). @@ -2159,12 +2226,12 @@ group C (see L). In particular, C inside C will match fewer characters than a standalone C, since this makes the tail match. -C<< (?>pattern) >> does not disable backtracking altogether once it has +C<< (?>I) >> does not disable backtracking altogether once it has matched. It is still possible to backtrack past the construct, but not into it. So C<< ((?>a*)|(?>b*))ar >> will still match "bar". -An effect similar to C<< (?>pattern) >> may be achieved by writing -C<(?=(pattern))\g{-1}>. This matches the same substring as a standalone +An effect similar to C<< (?>I) >> may be achieved by writing +C<(?=(I))\g{-1}>. This matches the same substring as a standalone C, and the following C<\g{-1}> eats the matched string; it therefore makes a zero-length assertion into an analogue of C<< (?>...) >>. (The difference between these two constructs is that the second one @@ -2248,6 +2315,18 @@ to inside of one of these constructs. The following equivalences apply: PAT?+ (?>PAT?) PAT{min,max}+ (?>PAT{min,max}) +Nested C<(?E...)> constructs are not no-ops, even if at first glance +they might seem to be. This is because the nested C<(?E...)> can +restrict internal backtracking that otherwise might occur. For example, + + "abc" =~ /(?>a[bc]*c)/ + +matches, but + + "abc" =~ /(?>a(?>[bc]*)c)/ + +does not. + The alphabetic form (C<(*atomic:...)>) is experimental; using it yields a warning in the C category. @@ -2500,8 +2579,9 @@ backtracking occurs until something all in the same script is found that matches, or all possibilities are exhausted. This can cause a lot of backtracking, but generally, only malicious input will result in this, though the slow down could cause a denial of service attack. If your -needs permit, it is best to make the pattern atomic. This is so likely -to be what you want, that instead of writing this: +needs permit, it is best to make the pattern atomic to cut down on the +amount of backtracking. This is so likely to be what you want, that +instead of writing this: (*script_run:(?>pattern)) @@ -2510,22 +2590,22 @@ you can write either of these: (*atomic_script_run:pattern) (*asr:pattern) -(See Lpattern)>>.) +(See LI)>>.) In Taiwan, Japan, and Korea, it is common for text to have a mixture of characters from their native scripts and base Chinese. Perl follows Unicode's UTS 39 (L) Unicode Security -Mechanisms in allowing such mixtures. +Mechanisms in allowing such mixtures. For example, the Japanese scripts +Katakana and Hiragana are commonly mixed together in practice, along +with some Chinese characters, and hence are treated as being in a single +script run by Perl. -The rules used for matching decimal digits are somewhat different. Many +The rules used for matching decimal digits are slightly stricter. Many scripts have their own sets of digits equivalent to the Western C<0> through C<9> ones. A few, such as Arabic, have more than one set. For a string to be considered a script run, all digits in it must come from -the same set, as determined by the first digit encountered. The ASCII -C<[0-9]> are accepted as being in any script, even those that have their -own set. This is because these are often used in commerce even in such -scripts. But any mixing of the ASCII and other digits will cause the -sequence to not be a script run, failing the match. As an example, +the same set of ten, as determined by the first digit encountered. +As an example, qr/(*script_run: \d+ \b )/x @@ -2546,11 +2626,11 @@ accent of some type. These are considered to be in the script of the master character, and so never cause a script run to not match. The other one is "Common". This consists of mostly punctuation, emoji, -and characters used in mathematics and music, and the ASCII digits C<0> -through C<9>. These characters can appear intermixed in text in many of -the world's scripts. These also don't cause a script run to not match, -except any ASCII digits encountered have to obey the decimal digit rules -described above. +and characters used in mathematics and music, the ASCII digits C<0> +through C<9>, and full-width forms of these digits. These characters +can appear intermixed in text in many of the world's scripts. These +also don't cause a script run to not match. But like other scripts, all +digits in a run must come from the same set of 10. This construct is non-capturing. You can add parentheses to I to capture, if desired. You will have to do this if you plan to use @@ -2561,12 +2641,58 @@ This feature is experimental, and the exact syntax and details of operation are subject to change; using it yields a warning in the C category. -The C property is used as the basis for this feature. +The C property as modified by UTS 39 +(L) is used as the basis for this +feature. + +To summarize, + +=over 4 + +=item * + +All length 0 or length 1 sequences are script runs. + +=item * + +A longer sequence is a script run if and only if B of the following +conditions are met: + +Z<> + +=over + +=item 1 + +No code point in the sequence has the C property of +C. + +This currently means that all code points in the sequence have been +assigned by Unicode to be characters that aren't private use nor +surrogate code points. + +=item 2 + +All characters in the sequence come from the Common script and/or the +Inherited script and/or a single other script. + +The script of a character is determined by the C +property as modified by UTS 39 (L), as +described above. + +=item 3 + +All decimal digits in the sequence come from the same block of 10 +consecutive digits. + +=back + +=back =head2 Special Backtracking Control Verbs -These special patterns are generally of the form C<(*I:I)>. Unless -otherwise stated the I argument is optional; in some cases, it is +These special patterns are generally of the form C<(*I:I)>. Unless +otherwise stated the I argument is optional; in some cases, it is mandatory. Any pattern containing a special backtracking verb that allows an argument @@ -2574,16 +2700,16 @@ has the special behaviour that when executed it sets the current package's C<$REGERROR> and C<$REGMARK> variables. When doing so the following rules apply: -On failure, the C<$REGERROR> variable will be set to the I value of the +On failure, the C<$REGERROR> variable will be set to the I value of the verb pattern, if the verb was involved in the failure of the match. If the -I part of the pattern was omitted, then C<$REGERROR> will be set to the -name of the last C<(*MARK:NAME)> pattern executed, or to TRUE if there was +I part of the pattern was omitted, then C<$REGERROR> will be set to the +name of the last C<(*MARK:I)> pattern executed, or to TRUE if there was none. Also, the C<$REGMARK> variable will be set to FALSE. On a successful match, the C<$REGERROR> variable will be set to FALSE, and the C<$REGMARK> variable will be set to the name of the last -C<(*MARK:NAME)> pattern executed. See the explanation for the -C<(*MARK:NAME)> verb below for more details. +C<(*MARK:I)> pattern executed. See the explanation for the +C<(*MARK:I)> verb below for more details. B C<$REGERROR> and C<$REGMARK> are not magic variables like C<$1> and most other regex-related variables. They are not local to a scope, nor @@ -2602,7 +2728,7 @@ argument, then C<$REGERROR> and C<$REGMARK> are not touched at all. =over 4 -=item C<(*PRUNE)> C<(*PRUNE:NAME)> +=item C<(*PRUNE)> C<(*PRUNE:I)> X<(*PRUNE)> X<(*PRUNE:NAME)> This zero-width pattern prunes the backtracking tree at the current point @@ -2647,14 +2773,14 @@ at each matching starting point like so: Any number of C<(*PRUNE)> assertions may be used in a pattern. -See also C<<< L<< /(?>pattern) >> >>> and possessive quantifiers for +See also C<<< L<< /(?>I) >> >>> and possessive quantifiers for other ways to control backtracking. In some cases, the use of C<(*PRUNE)> can be replaced with a C<< (?>pattern) >> with no functional difference; however, C<(*PRUNE)> can be used to handle cases that cannot be expressed using a C<< (?>pattern) >> alone. -=item C<(*SKIP)> C<(*SKIP:NAME)> +=item C<(*SKIP)> C<(*SKIP:I)> X<(*SKIP)> This zero-width pattern is similar to C<(*PRUNE)>, except that on @@ -2664,8 +2790,8 @@ of this pattern. This effectively means that the regex engine "skips" forward to this position on failure and tries to match again, (assuming that there is sufficient room to match). -The name of the C<(*SKIP:NAME)> pattern has special significance. If a -C<(*MARK:NAME)> was encountered while matching, then it is that position +The name of the C<(*SKIP:I)> pattern has special significance. If a +C<(*MARK:I)> was encountered while matching, then it is that position which is used as the "skip point". If no C<(*MARK)> of that name was encountered, then the C<(*SKIP)> operator has no effect. When used without a name the "skip point" is where the match point was when @@ -2687,7 +2813,7 @@ Once the 'aaab' at the start of the string has matched, and the C<(*SKIP)> executed, the next starting point will be where the cursor was when the C<(*SKIP)> was executed. -=item C<(*MARK:NAME)> C<(*:NAME)> +=item C<(*MARK:I)> C<(*:I)> X<(*MARK)> X<(*MARK:NAME)> X<(*:NAME)> This zero-width pattern can be used to mark the point reached in a string @@ -2696,13 +2822,13 @@ mark may be given a name. A later C<(*SKIP)> pattern will then skip forward to that point if backtracked into on failure. Any number of C<(*MARK)> patterns are allowed, and the I portion may be duplicated. -In addition to interacting with the C<(*SKIP)> pattern, C<(*MARK:NAME)> +In addition to interacting with the C<(*SKIP)> pattern, C<(*MARK:I)> can be used to "label" a pattern branch, so that after matching, the program can determine which branches of the pattern were involved in the match. When a match is successful, the C<$REGMARK> variable will be set to the -name of the most recently executed C<(*MARK:NAME)> that was involved +name of the most recently executed C<(*MARK:I)> that was involved in the match. This can be used to determine which branch of a pattern was matched @@ -2714,19 +2840,19 @@ C. When a match has failed, and unless another verb has been involved in failing the match and has provided its own name to use, the C<$REGERROR> variable will be set to the name of the most recently executed -C<(*MARK:NAME)>. +C<(*MARK:I)>. See L for more details. -As a shortcut C<(*MARK:NAME)> can be written C<(*:NAME)>. +As a shortcut C<(*MARK:I)> can be written C<(*:I)>. -=item C<(*THEN)> C<(*THEN:NAME)> +=item C<(*THEN)> C<(*THEN:I)> This is similar to the "cut group" operator C<::> from Perl 6. Like C<(*PRUNE)>, this verb always matches, and when backtracked into on failure, it causes the regex engine to try the next alternation in the innermost enclosing group (capturing or otherwise) that has alternations. -The two branches of a C<(?(condition)yes-pattern|no-pattern)> do not +The two branches of a C<(?(I)I|I)> do not count as an alternation, as far as C<(*THEN)> is concerned. Its name comes from the observation that this operation combined with the @@ -2755,7 +2881,7 @@ is not the same as as after matching the I but failing on the I the C<(*THEN)> verb will backtrack and try I; but the C<(*PRUNE)> verb will simply fail. -=item C<(*COMMIT)> C<(*COMMIT:args)> +=item C<(*COMMIT)> C<(*COMMIT:I)> X<(*COMMIT)> This is the Perl 6 "commit pattern" C<< >> or C<:::>. It's a @@ -2776,7 +2902,7 @@ In other words, once the C<(*COMMIT)> has been entered, and if the pattern does not match, the regex engine will not try any further matching on the rest of the string. -=item C<(*FAIL)> C<(*F)> C<(*FAIL:arg)> +=item C<(*FAIL)> C<(*F)> C<(*FAIL:I)> X<(*FAIL)> X<(*F)> This pattern matches nothing and always fails. It can be used to force the @@ -2787,7 +2913,7 @@ the argument can be obtained from C<$REGERROR>. It is probably useful only when combined with C<(?{})> or C<(??{})>. -=item C<(*ACCEPT)> C<(*ACCEPT:arg)> +=item C<(*ACCEPT)> C<(*ACCEPT:I)> X<(*ACCEPT)> This pattern matches nothing and causes the end of successful matching at @@ -3020,14 +3146,14 @@ else in the whole regular expression.) For this grouping operator there is no need to describe the ordering, since only whether or not C<"S"> can match is important. -=item C<(??{ EXPR })>, C<(?I)> +=item C<(??{ I })>, C<(?I)> The ordering is the same as for the regular expression which is -the result of EXPR, or the pattern contained by capture group I. +the result of I, or the pattern contained by capture group I. -=item C<(?(condition)yes-pattern|no-pattern)> +=item C<(?(I)I|I)> -Recall that which of C or C actually matches is +Recall that which of I or I actually matches is already determined. The ordering of the matches is the same as for the chosen subexpression. @@ -3135,17 +3261,17 @@ Perl-specific syntax, the following are also accepted: =over 4 -=item C<< (?PENAMEEpattern) >> +=item C<< (?PEIEI) >> -Define a named capture group. Equivalent to C<< (?pattern) >>. +Define a named capture group. Equivalent to C<< (?>I) >>. -=item C<< (?P=NAME) >> +=item C<< (?P=I) >> -Backreference to a named capture group. Equivalent to C<< \g{NAME} >>. +Backreference to a named capture group. Equivalent to C<< \g{I} >>. -=item C<< (?P>NAME) >> +=item C<< (?P>I) >> -Subroutine call to a named capture group. Equivalent to C<< (?&NAME) >>. +Subroutine call to a named capture group. Equivalent to C<< (?&I) >>. =back