X</m> X<regex, multiline> X<regexp, multiline> X<regular expression, multiline>
Treat string as multiple lines. That is, change "^" and "$" from matching
-the start or end of line only at the left and right ends of the string to
-matching them anywhere within the string.
+the start of the string's first line and the end of its last line to
+matching the start and end of each line within the string.
=item s
X</s> X<regex, single-line> X<regexp, single-line>
mechanism, ${^PREMATCH}, ${^MATCH}, and ${^POSTMATCH} will be available
after the match regardless of the modifier.
-=item g and c
-X</g> X</c>
-
-Global matching, and keep the Current position after failed matching.
-Unlike i, m, s and x, these two flags affect the way the regex is used
-rather than the regex itself. See
-L<perlretut/"Using regular expressions in Perl"> for further explanation
-of the g and c modifiers.
-
=item a, d, l and u
X</a> X</d> X</l> X</u>
-These modifiers, all new in 5.14, affect which character-set semantics
+These modifiers, all new in 5.14, affect which character-set rules
(Unicode, etc.) are used, as described below in
L</Character set modifiers>.
+=item n
+X</n> X<regex, non-capture> X<regexp, non-capture>
+X<regular expression, non-capture>
+
+Prevent the grouping metacharacters C<()> from capturing. This modifier,
+new in 5.22, will stop C<$1>, C<$2>, etc... from being filled in.
+
+ "hello" =~ /(hi|hello)/; # $1 is "hello"
+ "hello" =~ /(hi|hello)/n; # $1 is undef
+
+This is equivalent to putting C<?:> at the beginning of every capturing group:
+
+ "hello" =~ /(?:hi|hello)/; # $1 is undef
+
+C</n> can be negated on a per-group basis. Alternatively, named captures
+may still be used.
+
+ "hello" =~ /(?-n:(hi|hello))/n; # $1 is "hello"
+ "hello" =~ /(?<greet>hi|hello)/n; # $1 is "hello", $+{greet} is
+ # "hello"
+
+=item Other Modifiers
+
+There are a number of flags that can be found at the end of regular
+expression constructs that are I<not> generic regular expression flags, but
+apply to the operation being performed, like matching or substitution (C<m//>
+or C<s///> respectively).
+
+Flags described further in
+L<perlretut/"Using regular expressions in Perl"> are:
+
+ c - keep the current position during repeated matching
+ g - globally match the pattern repeatedly in the string
+
+Substitution-specific modifiers described in
+
+L<perlop/"s/PATTERN/REPLACEMENT/msixpodualngcer"> are:
+
+ e - evaluate the right-hand side as an expression
+ ee - evaluate the right side as a string then eval the result
+ o - pretend to optimize your code, but actually introduce bugs
+ r - perform non-destructive substitution and return the new value
+
=back
Regular expression modifiers are usually written in documentation
C</x> tells
the regular expression parser to ignore most whitespace that is neither
-backslashed nor within a character class. You can use this to break up
-your regular expression into (slightly) more readable parts. The C<#>
-character is also treated as a metacharacter introducing a comment,
-just as in ordinary Perl code. This also means that if you want real
-whitespace or C<#> characters in the pattern (outside a character
-class, where they are unaffected by C</x>), then you'll either have to
+backslashed nor within a bracketed character class. You can use this to
+break up your regular expression into (slightly) more readable parts.
+Also, the C<#> character is treated as a metacharacter introducing a
+comment that runs up to the pattern's closing delimiter, or to the end
+of the current line if the pattern extends onto the next line. Hence,
+this is very much like an ordinary Perl code comment. (You can include
+the closing delimiter within the comment only if you precede it with a
+backslash, so be careful!)
+
+Use of C</x> means that if you want real
+whitespace or C<#> characters in the pattern (outside a bracketed character
+class, which is unaffected by C</x>), then you'll either have to
escape them (using backslashes or C<\Q...\E>) or encode them using octal,
-hex, or C<\N{}> escapes. Taken together, these features go a long way towards
-making Perl's regular expressions more readable. Note that you have to
-be careful not to include the pattern delimiter in the comment--perl has
-no way of knowing you did not intend to close the pattern early. See
-the C-comment deletion code in L<perlop>. Also note that anything inside
+hex, or C<\N{}> escapes.
+It is ineffective to try to continue a comment onto the next line by
+escaping the C<\n> with a backslash or C<\Q>.
+
+You can use L</(?#text)> to create a comment that ends earlier than the
+end of the current line, but C<text> also can't contain the closing
+delimiter unless escaped with a backslash.
+
+Taken together, these features go a long way towards
+making Perl's regular expressions more readable. Here's an example:
+
+ # Delete (most) C comments.
+ $program =~ s {
+ /\* # Match the opening delimiter.
+ .*? # Match a minimal number of characters.
+ \*/ # Match the closing delimiter.
+ } []gsx;
+
+Note that anything inside
a C<\Q...\E> stays unaffected by C</x>. And note that C</x> doesn't affect
space interpretation within a single multi-character construct. For
example in C<\x{...}>, regardless of the C</x> modifier, there can be no
L<perluniprops/Properties accessible through \p{} and \P{}>.
X</x>
+The set of characters that are deemed whitespace are those that Unicode
+calls "Pattern White Space", namely:
+
+ U+0009 CHARACTER TABULATION
+ U+000A LINE FEED
+ U+000B LINE TABULATION
+ U+000C FORM FEED
+ U+000D CARRIAGE RETURN
+ U+0020 SPACE
+ U+0085 NEXT LINE
+ U+200E LEFT-TO-RIGHT MARK
+ U+200F RIGHT-TO-LEFT MARK
+ U+2028 LINE SEPARATOR
+ U+2029 PARAGRAPH SEPARATOR
+
=head3 Character set modifiers
C</d>, C</u>, C</a>, and C</l>, available starting in 5.14, are called
-the character set modifiers; they affect the character set semantics
+the character set modifiers; they affect the character set rules
used for the regular expression.
The C</d>, C</u>, and C</l> modifiers are not likely to be of much use
to another if there is an intervening call of the
L<setlocale() function|perllocale/The setlocale function>.
-Perl only supports single-byte locales. This means that code points
-above 255 are treated as Unicode no matter what locale is in effect.
+The only non-single-byte locale Perl supports is (starting in v5.20)
+UTF-8. This means that code points above 255 are treated as Unicode no
+matter what locale is in effect (since UTF-8 implies Unicode).
+
Under Unicode rules, there are a few case-insensitive matches that cross
-the 255/256 boundary. These are disallowed under C</l>. For example,
-0xFF (on ASCII platforms) does not caselessly match the character at
-0x178, C<LATIN CAPITAL LETTER Y WITH DIAERESIS>, because 0xFF may not be
-C<LATIN SMALL LETTER Y WITH DIAERESIS> in the current locale, and Perl
-has no way of knowing if that character even exists in the locale, much
-less what code point it is.
+the 255/256 boundary. Except for UTF-8 locales in Perls v5.20 and
+later, these are disallowed under C</l>. For example, 0xFF (on ASCII
+platforms) does not caselessly match the character at 0x178, C<LATIN
+CAPITAL LETTER Y WITH DIAERESIS>, because 0xFF may not be C<LATIN SMALL
+LETTER Y WITH DIAERESIS> in the current locale, and Perl has no way of
+knowing if that character even exists in the locale, much less what code
+point it is.
+
+In a UTF-8 locale in v5.20 and later, the only visible difference
+between locale and non-locale in regular expressions should be tainting
+(see L<perlsec>).
This modifier may be specified to be the default by C<use locale>, but
see L</Which character set modifier is in effect?>.
for C<\B>).
Otherwise, C</a> behaves like the C</u> modifier, in that
-case-insensitive matching uses Unicode semantics; for example, "k" will
+case-insensitive matching uses Unicode rules; for example, "k" will
match the Unicode C<\N{KELVIN SIGN}> under C</i> matching, and code
points in the Latin1 range, above ASCII will have Unicode rules when it
comes to case-insensitive matching.
\ Quote the next metacharacter
^ Match the beginning of the line
. Match any character (except newline)
- $ Match the end of the line (or before newline at the end)
+ $ Match the end of the string (or before newline at the end
+ of the string)
| Alternation
() Grouping
[] Bracketed Character class
(If a curly bracket occurs in any other context and does not form part of
a backslashed sequence like C<\x{...}>, it is treated as a regular
-character. In particular, the lower quantifier bound is not optional,
-and a typo in a quantifier silently causes it to be treated as the
-literal characters. For example,
-
- /o{4,3}/
-
-looks like a quantifier that matches 0 times, since 4 is greater than 3,
-but it really means to match the sequence of six characters
-S<C<"o { 4 , 3 }">>. It is planned to eventually require literal uses
-of curly brackets to be escaped, say by preceding them with a backslash
-or enclosing them within square brackets, (C<"\{"> or C<"[{]">). This
-change will allow for future syntax extensions (like making the lower
-bound of a quantifier optional), and better error checking. In the
-meantime, you should get in the habit of escaping all instances where
-you mean a literal "{".)
+character. However, a deprecation warning is raised for all such
+occurrences, and in Perl v5.26, literal uses of a curly bracket will be
+required to be escaped, say by preceding them with a backslash (C<"\{">)
+or enclosing them within square brackets (C<"[{]">). This change will
+allow for future syntax extensions (like making the lower bound of a
+quantifier optional), and better error checking of quantifiers.)
The "*" quantifier is equivalent to C<{0,}>, the "+"
quantifier to C<{1,}>, and the "?" quantifier to C<{0,1}>. n and m are limited
\o{}, \000 character whose ordinal is the given octal number
\l lowercase next char (think vi)
\u uppercase next char (think vi)
- \L lowercase till \E (think vi)
- \U uppercase till \E (think vi)
- \Q quote (disable) pattern metacharacters till \E
+ \L lowercase until \E (think vi)
+ \U uppercase until \E (think vi)
+ \Q quote (disable) pattern metacharacters until \E
\E end either case modification or quoted section, think vi
Details are in L<perlop/Quote and Quote-like Operators>.
part of a larger UTF-8 character. Thus it breaks up
characters into their UTF-8 bytes, so you may end up
with malformed pieces of UTF-8. Unsupported in
- lookbehind.
+ lookbehind. (Deprecated.)
\1 [5] Backreference to a specific capture group or buffer.
'1' may actually be any positive integer.
\g1 [5] Backreference to a specific or previous group,
It is worth noting that C<\G> improperly used can result in an infinite
loop. Take care when using patterns that include C<\G> in an alternation.
+Note also that C<s///> will refuse to overwrite part of a substitution
+that has already been replaced; so for example this will stop after the
+first iteration, rather than iterating its way backwards through the
+string:
+
+ $_ = "123456789";
+ pos = 6;
+ s/.(?=.\G)/X/g;
+ print; # prints 1234X6789, not XXXXX6789
+
+
=head3 Capture groups
The bracketing construct C<( ... )> creates capture groups (also referred to as
=item C<(?#text)>
X<(?#)>
-A comment. The text is ignored. If the C</x> modifier enables
-whitespace formatting, a simple C<#> will suffice. Note that Perl closes
+A comment. The text is ignored.
+Note that Perl closes
the comment as soon as it sees a C<)>, so there is no way to put a literal
-C<)> in the comment.
+C<)> in the comment. The pattern's closing delimiter must be escaped by
+a backslash if it appears in the comment.
+
+See L</E<sol>x> for another way to have comments in patterns.
=item C<(?adlupimsx-imsx)>
matches a word that follows a tab, without including the tab in C<$&>.
Works only for fixed-width look-behind.
-There is a special form of this construct, called C<\K>, which causes the
+There is a special form of this construct, called C<\K> (available since
+Perl 5.10.0), which causes the
regex engine to "keep" everything it had matched prior to the C<\K> and
not include it in C<$&>. This effectively provides variable-length
look-behind. The use of C<\K> inside of another look-around assertion
=item C<(?I<PARNO>)> C<(?-I<PARNO>)> C<(?+I<PARNO>)> C<(?R)> C<(?0)>
X<(?PARNO)> X<(?1)> X<(?R)> X<(?0)> X<(?-1)> X<(?+1)> X<(?-PARNO)> X<(?+PARNO)>
X<regex, recursive> X<regexp, recursive> X<regular expression, recursive>
-X<regex, relative recursion>
+X<regex, relative recursion> X<GOSUB> X<GOSTART>
+
+Recursive subpattern. Treat the contents of a given capture buffer in the
+current pattern as an independent subpattern and attempt to match it at
+the current position in the string. Information about capture state from
+the caller for things like backreferences is available to the subpattern,
+but capture buffers set by the subpattern are not visible to the caller.
Similar to C<(??{ code })> except that it does not involve executing any
code or potentially compiling a returned pattern string; instead it treats
the part of the current pattern contained within a specified capture group
-as an independent pattern that must match at the current position.
-Capture groups contained by the pattern will have the value as determined
-by the outermost recursion.
+as an independent pattern that must match at the current position. Also
+different is the treatment of capture buffers, unlike C<(??{ code })>
+recursive patterns have access to their callers match state, so one can
+use backreferences safely.
I<PARNO> is a sequence of digits (not starting with 0) whose value reflects
the paren-number of the capture group to recurse to. C<(?R)> recurses to
/(?<NAME>(?&NAME_PAT))(?<ADDR>(?&ADDRESS_PAT))
(?(DEFINE)
(?<NAME_PAT>....)
- (?<ADRESS_PAT>....)
+ (?<ADDRESS_PAT>....)
)/x
Note that capture groups matched inside of recursion are not accessible
character sets--and even within character sets they may cause results
you probably didn't expect. A sound principle is to use only ranges
that begin from and end at either alphabetics of equal case ([a-e],
-[A-E]), or digits ([0-9]). Anything else is unsafe. If in doubt,
-spell out the character sets in full.
+[A-E]), or digits ([0-9]). Anything else is unsafe or unclear. If in
+doubt, spell out the character sets in full. Specifying the end points
+of the range using the C<\N{...}> syntax, using Unicode names or code
+points makes the range portable, but still likely not easily
+understandable to someone reading the code. For example,
+C<[\N{U+04}-\N{U+07}]> means to match the Unicode code points
+C<\N{U+04}>, C<\N{U+05}>, C<\N{U+06}>, and C<\N{U+07}>, whatever their
+native values may be on the platform.
Characters may be specified using a metacharacter syntax much like that
used in C: "\n" matches a newline, "\t" a tab, "\r" a carriage return,