X-Git-Url: https://perl5.git.perl.org/perl5.git/blobdiff_plain/8da7c437debfa7ce07dc0239bfcdb1127cf972ee..4ee2b8db537d28b77d127a86307e426289e5c8b5:/pod/perlreref.pod diff --git a/pod/perlreref.pod b/pod/perlreref.pod index 08cd227..db7c173 100644 --- a/pod/perlreref.pod +++ b/pod/perlreref.pod @@ -6,61 +6,85 @@ perlreref - Perl Regular Expressions Reference This is a quick reference to Perl's regular expressions. For full information see L and L, as well -as the L section in this document. - -=head1 OPERATORS - - =~ determines to which variable the regex is applied. - In its absence, $_ is used. - - $var =~ /foo/; - - m/pattern/igmsoxc searches a string for a pattern match, - applying the given options. - - i case-Insensitive - g Global - all occurrences - m Multiline mode - ^ and $ match internal lines - s match as a Single line - . matches \n - o compile pattern Once - x eXtended legibility - free whitespace and comments - c don't reset pos on fails when using /g - - If 'pattern' is an empty string, the last I match - regex is used. Delimiters other than '/' may be used for both this - operator and the following ones. - - qr/pattern/imsox lets you store a regex in a variable, - or pass one around. Modifiers as for m// and are stored - within the regex. - - s/pattern/replacement/igmsoxe substitutes matches of - 'pattern' with 'replacement'. Modifiers as for m// - with one addition: - - e Evaluate replacement as an expression - - 'e' may be specified multiple times. 'replacement' is interpreted - as a double quoted string unless a single-quote (') is the delimiter. - - ?pattern? is like m/pattern/ but matches only once. No alternate - delimiters can be used. Must be reset with 'reset'. - -=head1 SYNTAX - - \ Escapes the character(s) immediately following it - . Matches any single character except a newline (unless /s is used) - ^ Matches at the beginning of the string (or line, if /m is used) - $ Matches at the end of the string (or line, if /m is used) - * Matches the preceding element 0 or more times - + Matches the preceding element 1 or more times - ? Matches the preceding element 0 or 1 times - {...} Specifies a range of occurrences for the element preceding it - [...] Matches any one of the characters contained within the brackets - (...) Groups subexpressions for capturing to $1, $2... - (?:...) Groups subexpressions without capturing (cluster) - | Matches either the expression preceding or following it - \1, \2 ... The text from the Nth group +as the L section in this document. + +=head2 OPERATORS + +C<=~> determines to which variable the regex is applied. +In its absence, $_ is used. + + $var =~ /foo/; + +C determines to which variable the regex is applied, +and negates the result of the match; it returns +false if the match succeeds, and true if it fails. + + $var !~ /foo/; + +C searches a string for a pattern match, +applying the given options. + + m Multiline mode - ^ and $ match internal lines + s match as a Single line - . matches \n + i case-Insensitive + x eXtended legibility - free whitespace and comments + p Preserve a copy of the matched string - + ${^PREMATCH}, ${^MATCH}, ${^POSTMATCH} will be defined. + o compile pattern Once + g Global - all occurrences + c don't reset pos on failed matches when using /g + a restrict \d, \s, \w and [:posix:] to match ASCII only + aa (two a's) also /i matches exclude ASCII/non-ASCII + l match according to current locale + u match according to Unicode rules + d match according to native rules unless something indicates + Unicode + n Non-capture mode. Don't let () fill in $1, $2, etc... + +If 'pattern' is an empty string, the last I matched +regex is used. Delimiters other than '/' may be used for both this +operator and the following ones. The leading C can be omitted +if the delimiter is '/'. + +C lets you store a regex in a variable, +or pass one around. Modifiers as for C, and are stored +within the regex. + +C substitutes matches of +'pattern' with 'replacement'. Modifiers as for C, +with two additions: + + e Evaluate 'replacement' as an expression + r Return substitution and leave the original string untouched. + +'e' may be specified multiple times. 'replacement' is interpreted +as a double quoted string unless a single-quote (C<'>) is the delimiter. + +C is like C but matches only once. No alternate +delimiters can be used. Must be reset with reset(). + +=head2 SYNTAX + + \ Escapes the character immediately following it + . Matches any single character except a newline (unless /s is + used) + ^ Matches at the beginning of the string (or line, if /m is used) + $ Matches at the end of the string (or line, if /m is used) + * Matches the preceding element 0 or more times + + Matches the preceding element 1 or more times + ? Matches the preceding element 0 or 1 times + {...} Specifies a range of occurrences for the element preceding it + [...] Matches any one of the characters contained within the brackets + (...) Groups subexpressions for capturing to $1, $2... + (?:...) Groups subexpressions without capturing (cluster) + | Matches either the subexpression preceding or following it + \g1 or \g{1}, \g2 ... Matches the text from the Nth group + \1, \2, \3 ... Matches the text from the Nth group + \g-1 or \g{-1}, \g-2 ... Matches the text from the Nth previous group + \g{name} Named backreference + \k Named backreference + \k'name' Named backreference + (?P=name) Named backreference (python syntax) =head2 ESCAPE SEQUENCES @@ -72,18 +96,23 @@ These work as in normal strings. \n Newline \r Carriage return \t Tab - \038 Any octal ASCII value - \x7f Any hexadecimal ASCII value - \x{263a} A wide hexadecimal value + \037 Char whose ordinal is the 3 octal digits, max \777 + \o{2307} Char whose ordinal is the octal number, unrestricted + \x7f Char whose ordinal is the 2 hex digits, max \xFF + \x{263a} Char whose ordinal is the hex number, unrestricted \cx Control-x - \N{name} A named character + \N{name} A named Unicode character or character sequence + \N{U+263D} A Unicode character by hex ordinal - \l Lowercase until next character - \u Uppercase until next character + \l Lowercase next character + \u Titlecase next character \L Lowercase until \E \U Uppercase until \E + \F Foldcase until \E \Q Disable pattern metacharacters until \E - \E End case modification + \E End modification + +For Titlecase, see L. This one works differently from normal strings: @@ -94,46 +123,74 @@ This one works differently from normal strings: [amy] Match 'a', 'm' or 'y' [f-j] Dash specifies "range" [f-j-] Dash escaped or at start or end means 'dash' - [^f-j] Caret indicates "match char any _except_ these" + [^f-j] Caret indicates "match any character _except_ these" + +The following sequences (except C<\N>) work within or without a character class. +The first six are locale aware, all are Unicode aware. See L +and L for details. + + \d A digit + \D A nondigit + \w A word character + \W A non-word character + \s A whitespace character + \S A non-whitespace character + \h An horizontal whitespace + \H A non horizontal whitespace + \N A non newline (when not followed by '{NAME}';; + not valid in a character class; equivalent to [^\n]; it's + like '.' without /s modifier) + \v A vertical whitespace + \V A non vertical whitespace + \R A generic newline (?>\v|\x0D\x0A) -The following work within or without a character class: - - \d A digit, same as [0-9] - \D A nondigit, same as [^0-9] - \w A word character (alphanumeric), same as [a-zA-Z_0-9] - \W A non-word character, [^a-zA-Z_0-9] - \s A whitespace character, same as [ \t\n\r\f] - \S A non-whitespace character, [^ \t\n\r\f] - \C Match a byte (with Unicode. '.' matches char) \pP Match P-named (Unicode) property - \p{...} Match Unicode property with long name + \p{...} Match Unicode property with name longer than 1 character \PP Match non-P - \P{...} Match lack of Unicode property with long name - \X Match extended unicode sequence + \P{...} Match lack of Unicode property with name longer than 1 char + \X Match Unicode extended grapheme cluster POSIX character classes and their Unicode and Perl equivalents: - alnum IsAlnum Alphanumeric - alpha IsAlpha Alphabetic - ascii IsASCII Any ASCII char - blank IsSpace [ \t] Horizontal whitespace (GNU) - cntrl IsCntrl Control characters - digit IsDigit \d Digits - graph IsGraph Alphanumeric and punctuation - lower IsLower Lowercase chars (locale aware) - print IsPrint Alphanumeric, punct, and space - punct IsPunct Punctuation - space IsSpace [\s\ck] Whitespace - IsSpacePerl \s Perl's whitespace definition - upper IsUpper Uppercase chars (locale aware) - word IsWord \w Alphanumeric plus _ (Perl) - xdigit IsXDigit [\dA-Fa-f] Hexadecimal digit + ASCII- Full- + POSIX range range backslash + [[:...:]] \p{...} \p{...} sequence Description + + ----------------------------------------------------------------------- + alnum PosixAlnum XPosixAlnum 'alpha' plus 'digit' + alpha PosixAlpha XPosixAlpha Alphabetic characters + ascii ASCII Any ASCII character + blank PosixBlank XPosixBlank \h Horizontal whitespace; + full-range also + written as + \p{HorizSpace} (GNU + extension) + cntrl PosixCntrl XPosixCntrl Control characters + digit PosixDigit XPosixDigit \d Decimal digits + graph PosixGraph XPosixGraph 'alnum' plus 'punct' + lower PosixLower XPosixLower Lowercase characters + print PosixPrint XPosixPrint 'graph' plus 'space', + but not any Controls + punct PosixPunct XPosixPunct Punctuation and Symbols + in ASCII-range; just + punct outside it + space PosixSpace XPosixSpace \s Whitespace + upper PosixUpper XPosixUpper Uppercase characters + word PosixWord XPosixWord \w 'alnum' + Unicode marks + + connectors, like + '_' (Perl extension) + xdigit ASCII_Hex_Digit XPosixDigit Hexadecimal digit, + ASCII-range is + [0-9A-Fa-f] + +Also, various synonyms like C<\p{Alpha}> for C<\p{XPosixAlpha}>; all listed +in L Within a character class: - POSIX traditional Unicode - [:digit:] \d \p{IsDigit} - [:^digit:] \D \P{IsDigit} + POSIX traditional Unicode + [:digit:] \d \p{Digit} + [:^digit:] \D \P{Digit} =head2 ANCHORS @@ -141,81 +198,141 @@ All are zero-width assertions. ^ Match string start (or line, if /m is used) $ Match string end (or line, if /m is used) or before newline + \b{} Match boundary of type specified within the braces + \B{} Match wherever \b{} doesn't match \b Match word boundary (between \w and \W) - \B Match except at word boundary + \B Match except at word boundary (between \w and \w or \W and \W) \A Match string start (regardless of /m) - \Z Match string end (preceding optional newline) + \Z Match string end (before optional newline) \z Match absolute string end \G Match where previous m//g left off - \c Suppresses resetting of search position when used with /g. - Without \c, search pattern is reset to the beginning of the string + \K Keep the stuff left of the \K, don't include it in $& =head2 QUANTIFIERS -Quantifiers are greedy by default --- match the B leftmost. +Quantifiers are greedy by default and match the B leftmost. - Maximal Minimal Allowed range - ------- ------- ------------- - {n,m} {n,m}? Must occur at least n times but no more than m times - {n,} {n,}? Must occur at least n times - {n} {n}? Must match exactly n times - * *? 0 or more times (same as {0,}) - + +? 1 or more times (same as {1,}) - ? ?? 0 or 1 time (same as {0,1}) + Maximal Minimal Possessive Allowed range + ------- ------- ---------- ------------- + {n,m} {n,m}? {n,m}+ Must occur at least n times + but no more than m times + {n,} {n,}? {n,}+ Must occur at least n times + {n} {n}? {n}+ Must occur exactly n times + * *? *+ 0 or more times (same as {0,}) + + +? ++ 1 or more times (same as {1,}) + ? ?? ?+ 0 or 1 time (same as {0,1}) -=head2 EXTENDED CONSTRUCTS +The possessive forms (new in Perl 5.10) prevent backtracking: what gets +matched by a pattern with a possessive quantifier will not be backtracked +into, even if that causes the whole match to fail. - (?#text) A comment - (?imxs-imsx:...) Enable/disable option (as per m//) - (?=...) Zero-width positive lookahead assertion - (?!...) Zero-width negative lookahead assertion - (?<...) Zero-width positive lookbehind assertion - (?...) Grab what we can, prohibit backtracking - (?{ code }) Embedded code, return value becomes $^R - (??{ code }) Dynamic regex, return value used as regex - (?(cond)yes|no) cond being integer corresponding to capturing parens - (?(cond)yes) or a lookaround/eval zero-width assertion +There is no quantifier C<{,n}>. That's interpreted as a literal string. + +=head2 EXTENDED CONSTRUCTS -=head1 VARIABLES + (?#text) A comment + (?:...) Groups subexpressions without capturing (cluster) + (?pimsx-imsx:...) Enable/disable option (as per m// modifiers) + (?=...) Zero-width positive lookahead assertion + (?!...) Zero-width negative lookahead assertion + (?<=...) Zero-width positive lookbehind assertion + (?...) Grab what we can, prohibit backtracking + (?|...) Branch reset + (?...) Named capture + (?'name'...) Named capture + (?P...) Named capture (python syntax) + (?[...]) Extended bracketed character class + (?{ code }) Embedded code, return value becomes $^R + (??{ code }) Dynamic regex, return value used as regex + (?N) Recurse into subpattern number N + (?-N), (?+N) Recurse into Nth previous/next subpattern + (?R), (?0) Recurse at the beginning of the whole pattern + (?&name) Recurse into a named subpattern + (?P>name) Recurse into a named subpattern (python syntax) + (?(cond)yes|no) + (?(cond)yes) Conditional expression, where "cond" can be: + (?=pat) lookahead + (?!pat) negative lookahead + (?<=pat) lookbehind + (?) named subpattern has matched something + ('name') named subpattern has matched something + (?{code}) code condition + (R) true if recursing + (RN) true if recursing into Nth subpattern + (R&name) true if recursing into named subpattern + (DEFINE) always false, no no-pattern allowed + +=head2 VARIABLES $_ Default variable for operators to use - $* Enable multiline matching (deprecated; not in 5.9.0 or later) - $& Entire matched string $` Everything prior to matched string + $& Entire matched string $' Everything after to matched string -The use of those last three will slow down B regex use -within your program. Consult L for C<@LAST_MATCH_START> + ${^PREMATCH} Everything prior to matched string + ${^MATCH} Entire matched string + ${^POSTMATCH} Everything after to matched string + +Note to those still using Perl 5.18 or earlier: +The use of C<$`>, C<$&> or C<$'> will slow down B regex use +within your program. Consult L for C<@-> to see equivalent expressions that won't cause slow down. -See also L. +See also L. Starting with Perl 5.10, you +can also use the equivalent variables C<${^PREMATCH}>, C<${^MATCH}> +and C<${^POSTMATCH}>, but for them to be defined, you have to +specify the C

(preserve) modifier on your regular expression. +In Perl 5.20, the use of C<$`>, C<$&> and C<$'> makes no speed difference. $1, $2 ... hold the Xth captured expr $+ Last parenthesized pattern match $^N Holds the most recently closed capture $^R Holds the result of the last (?{...}) expr - @- Offsets of starts of groups. [0] holds start of whole match - @+ Offsets of ends of groups. [0] holds end of whole match + @- Offsets of starts of groups. $-[0] holds start of whole match + @+ Offsets of ends of groups. $+[0] holds end of whole match + %+ Named capture groups + %- Named capture groups, as array refs -Capture groups are numbered according to their I paren. +Captured groups are numbered according to their I paren. -=head1 FUNCTIONS +=head2 FUNCTIONS lc Lowercase a string lcfirst Lowercase first char of a string uc Uppercase a string ucfirst Titlecase first char of a string + fc Foldcase a string + pos Return or set current match position quotemeta Quote metacharacters reset Reset ?pattern? status study Analyze string for optimizing matching - split Use regex to split a string into parts + split Use a regex to split a string into parts + +The first five of these are like the escape sequences C<\L>, C<\l>, +C<\U>, C<\u>, and C<\F>. For Titlecase, see L; For +Foldcase, see L. + +=head2 TERMINOLOGY + +=head3 Titlecase + +Unicode concept which most often is equal to uppercase, but for +certain characters like the German "sharp s" there is a difference. + +=head3 Foldcase + +Unicode form that is useful when comparing strings regardless of case, +as certain characters have complex one-to-many case mappings. Primarily a +variant of lowercase. =head1 AUTHOR -Iain Truskett. +Iain Truskett. Updated by the Perl 5 Porters. This document may be distributed under the same terms as Perl itself. @@ -253,22 +370,30 @@ L for FAQs on regular expressions. =item * +L for a reference on backslash sequences. + +=item * + +L for a reference on character classes. + +=item * + The L module to alter behaviour and aid debugging. =item * -L +L =item * -L, L, L and L +L, L, L and L for details on regexes and internationalisation. =item * I by Jeffrey Friedl -(F) for a thorough grounding and +(F) for a thorough grounding and reference on the topic. =back @@ -283,3 +408,5 @@ Jim Cromie, and Jeffrey Goff for useful advice. + +=cut