X-Git-Url: https://perl5.git.perl.org/perl5.git/blobdiff_plain/0d017f4d564175907ce6698d1a162341a850ea9d..4ed3fda49b8590b1f2536acfe87ecdec36a6d516:/pod/perlre.pod diff --git a/pod/perlre.pod b/pod/perlre.pod index d913c80..ee1c2cb 100644 --- a/pod/perlre.pod +++ b/pod/perlre.pod @@ -27,15 +27,6 @@ L. =over 4 -=item i -X X X -X - -Do case-insensitive pattern matching. - -If C is in effect, the case map is taken from the current -locale. See L. - =item m X X X X @@ -54,11 +45,35 @@ Used together, as /ms, they let the "." match any character whatsoever, while still allowing "^" and "$" to match, respectively, just after and just before newlines within the string. +=item i +X X X +X + +Do case-insensitive pattern matching. + +If C is in effect, the case map is taken from the current +locale. See L. + =item x X Extend your pattern's legibility by permitting whitespace and comments. +=item p +X

X X + +Preserve the string matched such that ${^PREMATCH}, {$^MATCH}, and +${^POSTMATCH} are available for use after matching. + +=item g and c +X X + +Global matching, and keep the Current position after failed matching. +Unlike i, m, s and x, these two flags affect the way the regex is used +rather than the regex itself. See +L for further explanation +of the g and c modifiers. + =back These are usually written as "the C modifier", even though the delimiter @@ -87,7 +102,7 @@ X =head3 Metacharacters -The patterns used in Perl pattern matching evolved from the ones supplied in +The patterns used in Perl pattern matching evolved from those supplied in the Version 8 regex routines. (The routines are derived (distantly) from Henry Spencer's freely redistributable reimplementation of the V8 routines.) See L for @@ -139,8 +154,8 @@ X X X<*> X<+> X X<{n}> X<{n,}> X<{n,m}> (If a curly bracket occurs in any other context, it is treated as a regular character. In particular, the lower bound -is not optional.) The "*" modifier is equivalent to C<{0,}>, the "+" -modifier to C<{1,}>, and the "?" modifier to C<{0,1}>. n and m are limited +is not optional.) The "*" quantifier is equivalent to C<{0,}>, the "+" +quantifier to C<{1,}>, and the "?" quantifier to C<{0,1}>. n and m are limited to integral values less than a preset limit defined when perl is built. This is usually 32766 on the most common platforms. The actual limit can be seen in the error message generated by code such as this: @@ -208,9 +223,9 @@ X<\0> X<\c> X<\N> X<\x> \e escape (think troff) (ESC) \033 octal char (example: ESC) \x1B hex char (example: ESC) - \x{263a} wide hex char (example: Unicode SMILEY) + \x{263a} long hex char (example: Unicode SMILEY) \cK control char (example: VT) - \N{name} named char + \N{name} named Unicode character \l lowercase next char (think vi) \u uppercase next char (think vi) \L lowercase till \E (think vi) @@ -227,11 +242,11 @@ An unescaped C<$> or C<@> interpolates the corresponding variable, while escaping will cause the literal string C<\$> to be matched. You'll need to write something like C. -=head3 Character classes +=head3 Character Classes and other Special Escapes In addition, Perl defines the following: X<\w> X<\W> X<\s> X<\S> X<\d> X<\D> X<\X> X<\p> X<\P> X<\C> -X<\g> X<\k> X<\N> X<\K> X<\v> X<\V> +X<\g> X<\k> X<\N> X<\K> X<\v> X<\V> X<\h> X<\H> X X X X \w Match a "word" character (alphanumeric plus "_") @@ -243,7 +258,7 @@ X X X X \pP Match P, named property. Use \p{Prop} for longer names. \PP Match non-P \X Match eXtended Unicode "combining character sequence", - equivalent to (?:\PM\pM*) + equivalent to (?>\PM\pM*) \C Match a single C char (octet) even under Unicode. NOTE: breaks up characters into their UTF-8 bytes, so you may end up with malformed pieces of UTF-8. @@ -255,12 +270,13 @@ X X X X optionally be wrapped in curly brackets for safer parsing. \g{name} Named backreference \k Named backreference - \N{name} Named unicode character, or unicode escape - \x12 Hexadecimal escape sequence - \x{1234} Long hexadecimal escape sequence \K Keep the stuff left of the \K, don't include it in $& - \v Shortcut for (*PRUNE) - \V Shortcut for (*SKIP) + \N Any character but \n + \v Vertical whitespace + \V Not vertical whitespace + \h Horizontal whitespace + \H Not horizontal whitespace + \R Linebreak A C<\w> matches a single alphanumeric character (an alphabetic character, or a decimal digit) or C<_>, not a whole word. Use C<\w+> @@ -271,12 +287,21 @@ locale. See L. You may use C<\w>, C<\W>, C<\s>, C<\S>, C<\d>, and C<\D> within character classes, but they aren't usable as either end of a range. If any of them precedes or follows a "-", the "-" is understood literally. If Unicode is in effect, C<\s> matches -also "\x{85}", "\x{2028}, and "\x{2029}". See L for more +also "\x{85}", "\x{2028}", and "\x{2029}". See L for more details about C<\pP>, C<\PP>, C<\X> and the possibility of defining your own C<\p> and C<\P> properties, and L about Unicode in general. X<\w> X<\W> X +C<\R> will atomically match a linebreak, including the network line-ending +"\x0D\x0A". Specifically, X<\R> is exactly equivalent to + + (?>\x0D\x0A?|[\x0A-\x0C\x85\x{2028}\x{2029}]) + +B C<\R> has no special meaning inside of a character class; +use C<\v> instead (vertical whitespace). +X<\R> + The POSIX character class syntax X @@ -351,21 +376,61 @@ X X<\p> X<\p{}> digit IsDigit \d graph IsGraph lower IsLower - print IsPrint - punct IsPunct + print IsPrint (but see [2] below) + punct IsPunct (but see [3] below) space IsSpace IsSpacePerl \s upper IsUpper - word IsWord + word IsWord \w xdigit IsXDigit For example C<[[:lower:]]> and C<\p{IsLower}> are equivalent. +However, the equivalence between C<[[:xxxxx:]]> and C<\p{IsXxxxx}> +is not exact. + +=over 4 + +=item [1] + If the C pragma is not used but the C pragma is, the classes correlate with the usual isalpha(3) interface (except for "word" and "blank"). -The assumedly non-obviously named classes are: +But if the C or C pragmas are not used and +the string is not C, then C<[[:xxxxx:]]> (and C<\w>, etc.) +will not match characters 0x80-0xff; whereas C<\p{IsXxxxx}> will +force the string to C and can match these characters +(as Unicode). + +=item [2] + +C<\p{IsPrint}> matches characters 0x09-0x0d but C<[[:print:]]> does not. + +=item [3] + +C<[[:punct::]]> matches the following but C<\p{IsPunct}> does not, +because they are classed as symbols (not punctuation) in Unicode. + +=over 4 + +=item C<$> + +Currency symbol + +=item C<+> C<< < >> C<=> C<< > >> C<|> C<~> + +Mathematical symbols + +=item C<^> C<`> + +Modifier symbols (accents) + +=back + +=back + +The other named classes are: =over 4 @@ -497,14 +562,14 @@ backreferences. X<\g{1}> X<\g{-1}> X<\g{name}> X X In order to provide a safer and easier way to construct patterns using -backreferences, Perl 5.10 provides the C<\g{N}> notation. The curly -brackets are optional, however omitting them is less safe as the meaning -of the pattern can be changed by text (such as digits) following it. -When N is a positive integer the C<\g{N}> notation is exactly equivalent -to using normal backreferences. When N is a negative integer then it is -a relative backreference referring to the previous N'th capturing group. -When the bracket form is used and N is not an integer, it is treated as a -reference to a named buffer. +backreferences, Perl provides the C<\g{N}> notation (starting with perl +5.10.0). The curly brackets are optional, however omitting them is less +safe as the meaning of the pattern can be changed by text (such as digits) +following it. When N is a positive integer the C<\g{N}> notation is +exactly equivalent to using normal backreferences. When N is a negative +integer then it is a relative backreference referring to the previous N'th +capturing group. When the bracket form is used and N is not an integer, it +is treated as a reference to a named buffer. Thus C<\g{-1}> refers to the last buffer, C<\g{-2}> refers to the buffer before that. For example: @@ -520,7 +585,7 @@ buffer before that. For example: and would match the same as C. -Additionally, as of Perl 5.10 you may use named capture buffers and named +Additionally, as of Perl 5.10.0 you may use named capture buffers and named backreferences. The notation is C<< (?...) >> to declare and C<< \k >> to reference. You may also use apostrophes instead of angle brackets to delimit the name; and you may use the bracketed C<< \g{name} >> backreference syntax. @@ -531,7 +596,7 @@ and C<< \k >> refer to the leftmost defined group. (Thus it's possible to do things with named capture buffers that would otherwise require C<(??{})> code to accomplish.) X X -X<%+> X<$+{name}> X<\k{name}> +X<%+> X<$+{name}> X<< \k >> Examples: @@ -590,14 +655,14 @@ already paid the price. As of 5.005, C<$&> is not so costly as the other two. X<$&> X<$`> X<$'> -As a workaround for this problem, Perl 5.10 introduces C<${^PREMATCH}>, +As a workaround for this problem, Perl 5.10.0 introduces C<${^PREMATCH}>, C<${^MATCH}> and C<${^POSTMATCH}>, which are equivalent to C<$`>, C<$&> and C<$'>, B that they are only guaranteed to be defined after a -successful match that was executed with the C (keep-copy) modifier. +successful match that was executed with the C

(preserve) modifier. The use of these variables incurs no global performance penalty, unlike their punctuation char equivalents, however at the trade-off that you have to tell perl when you want to use them. -X X +X

X

Backslashed metacharacters in Perl are alphanumeric, such as C<\b>, C<\w>, C<\n>. Unlike some other regular expression languages, there @@ -652,7 +717,7 @@ whitespace formatting, a simple C<#> will suffice. Note that Perl closes the comment as soon as it sees a C<)>, so there is no way to put a literal C<)> in the comment. -=item C<(?kimsx-imsx)> +=item C<(?pimsx-imsx)> X<(?)> One or more embedded pattern-match modifiers, to be turned on (or @@ -680,9 +745,9 @@ will match C in any case, some spaces, and an exact (I repetition of the previous word, assuming the C modifier, and no C modifier outside this group. -Note that the C modifier is special in that it can only be enabled, +Note that the C

modifier is special in that it can only be enabled, not disabled, and that its presence anywhere in a pattern has a global -effect. Thus C<(?-k)> and C<(?-k:...)> are meaningless and will warn +effect. Thus C<(?-p)> and C<(?-p:...)> are meaningless and will warn when executed under C. =item C<(?:pattern)> @@ -711,6 +776,35 @@ is equivalent to the more verbose /(?:(?s-i)more.*than).*million/i +=item C<(?|pattern)> +X<(?|)> X + +This is the "branch reset" pattern, which has the special property +that the capture buffers are numbered from the same starting point +in each alternation branch. It is available starting from perl 5.10.0. + +Capture buffers are numbered from left to right, but inside this +construct the numbering is restarted for each branch. + +The numbering within each branch will be as normal, and any buffers +following this construct will be numbered as though the construct +contained only one branch, that being the one with the most capture +buffers in it. + +This construct will be useful when you want to capture one of a +number of alternative matches. + +Consider the following pattern. The numbers underneath show in +which buffer the captured content will be stored. + + + # before ---------------branch-reset----------- after + / ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x + # 1 2 2 3 2 3 4 + +Note: as of Perl 5.10.0, branch resets interfere with the contents of +the C<%+> hash, that holds named captures. Consider using C<%-> instead. + =item Look-Around Assertions X X X X @@ -761,7 +855,7 @@ not include it in C<$&>. This effectively provides variable length look-behind. The use of C<\K> inside of another look-around assertion is allowed, but the behaviour is currently not well defined. -For various reasons C<\K> may be signifigantly more efficient than the +For various reasons C<\K> may be significantly more efficient than the equivalent C<< (?<=...) >> construct, and it is especially useful in situations where you want to efficiently remove something following something else in a string. For instance @@ -787,9 +881,9 @@ only for fixed-width look-behind. X<< (?) >> X<(?'NAME')> X X A named capture buffer. Identical in every respect to normal capturing -parentheses C<()> but for the additional fact that C<%+> may be used after -a succesful match to refer to a named buffer. See C for more -details on the C<%+> hash. +parentheses C<()> but for the additional fact that C<%+> or C<%-> may be +used after a successful match to refer to a named buffer. See C +for more details on the C<%+> and C<%-> hashes. If multiple distinct capture buffers have the same name then the $+{NAME} will refer to the leftmost defined buffer in the match. @@ -814,8 +908,7 @@ though it isn't extended by the locale (see L). B In order to make things easier for programmers with experience with the Python or PCRE regex engines, the pattern C<< (?PENAMEEpattern) >> may be used instead of C<< (?pattern) >>; however this form does not -support the use of single quotes as a delimiter for the name. This is -only available in Perl 5.10 or later. +support the use of single quotes as a delimiter for the name. =item C<< \k >> @@ -833,7 +926,7 @@ Both forms are equivalent. B In order to make things easier for programmers with experience with the Python or PCRE regex engines, the pattern C<< (?P=NAME) >> -may be used instead of C<< \k >> in Perl 5.10 or later. +may be used instead of C<< \k >>. =item C<(?{ code })> X<(?{})> X X X @@ -1058,7 +1151,7 @@ pattern. B In order to make things easier for programmers with experience with the Python or PCRE regex engines the pattern C<< (?P>NAME) >> -may be used instead of C<< (?&NAME) >> in Perl 5.10 or later. +may be used instead of C<< (?&NAME) >>. =item C<(?(condition)yes-pattern|no-pattern)> X<(?()> @@ -1302,7 +1395,7 @@ argument, then C<$REGERROR> and C<$REGMARK> are not touched at all. =over 4 =item C<(*PRUNE)> C<(*PRUNE:NAME)> -X<(*PRUNE)> X<(*PRUNE:NAME)> X<\v> +X<(*PRUNE)> X<(*PRUNE:NAME)> This zero-width pattern prunes the backtracking tree at the current point when backtracked into on failure. Consider the pattern C, @@ -1312,8 +1405,6 @@ continues in B, which may also backtrack as necessary; however, should B not match, then no further backtracking will take place, and the pattern will fail outright at the current starting position. -As a shortcut, C<\v> is exactly equivalent to C<(*PRUNE)>. - The following example counts all the possible matching strings in a pattern (without actually matching any of them). @@ -1339,7 +1430,7 @@ If we add a C<(*PRUNE)> before the count like the following print "Count=$count\n"; we prevent backtracking and find the count of the longest matching -at each matching startpoint like so: +at each matching starting point like so: aaab aab @@ -1365,8 +1456,6 @@ of this pattern. This effectively means that the regex engine "skips" forward to this position on failure and tries to match again, (assuming that there is sufficient room to match). -As a shortcut C<\V> is exactly equivalent to C<(*SKIP)>. - The name of the C<(*SKIP:NAME)> pattern has special significance. If a C<(*MARK:NAME)> was encountered while matching, then it is that position which is used as the "skip point". If no C<(*MARK)> of that name was @@ -1387,7 +1476,7 @@ outputs Count=2 Once the 'aaab' at the start of the string has matched, and the C<(*SKIP)> -executed, the next startpoint will be where the cursor was when the +executed, the next starting point will be where the cursor was when the C<(*SKIP)> was executed. =item C<(*MARK:NAME)> C<(*:NAME)> @@ -1410,7 +1499,7 @@ name of the most recently executed C<(*MARK:NAME)> that was involved in the match. This can be used to determine which branch of a pattern was matched -without using a seperate capture buffer for each branch, which in turn +without using a separate capture buffer for each branch, which in turn can result in a performance improvement, as perl cannot optimize C as efficiently as something like C. @@ -1426,7 +1515,7 @@ As a shortcut C<(*MARK:NAME)> can be written C<(*:NAME)>. =item C<(*THEN)> C<(*THEN:NAME)> -This is similar to the "cut group" operator C<::> from Perl6. Like +This is similar to the "cut group" operator C<::> from Perl 6. Like C<(*PRUNE)>, this verb always matches, and when backtracked into on failure, it causes the regex engine to try the next alternation in the innermost enclosing group (capturing or otherwise). @@ -1460,7 +1549,7 @@ backtrack and try C; but the C<(*PRUNE)> verb will simply fail. =item C<(*COMMIT)> X<(*COMMIT)> -This is the Perl6 "commit pattern" C<< >> or C<:::>. It's a +This is the Perl 6 "commit pattern" C<< >> or C<:::>. It's a zero-width pattern similar to C<(*SKIP)>, except that when backtracked into on failure it causes the match to fail outright. No further attempts to find a valid match by advancing the start pointer will occur again. @@ -1847,7 +1936,7 @@ loops using regular expressions, with something as innocuous as: The C matches at the beginning of C<'foo'>, and since the position in the string is not moved by the match, C would match again and again -because of the C<*> modifier. Another common way to create a similar cycle +because of the C<*> quantifier. Another common way to create a similar cycle is with the looping modifier C: @matches = ( 'foo' =~ m{ o? }xg ); @@ -1867,7 +1956,7 @@ may match zero-length substrings. Here's a simple example being: Thus Perl allows such constructs, by I. The rules for this are different for lower-level -loops given by the greedy modifiers C<*+{}>, and for higher-level +loops given by the greedy quantifiers C<*+{}>, and for higher-level ones like the C modifier or split() operator. The lower-level loops are I (that is, the loop is @@ -2060,9 +2149,9 @@ part of this regular expression needs to be converted explicitly =head1 PCRE/Python Support -As of Perl 5.10 Perl supports several Python/PCRE specific extensions +As of Perl 5.10.0, Perl supports several Python/PCRE specific extensions to the regex syntax. While Perl programmers are encouraged to use the -Perl specific syntax, the following are legal in Perl 5.10: +Perl specific syntax, the following are also accepted: =over 4