perlpolicy: clarify what "feature can be replaced" means

[perl5.git] / pod / perlre.pod
diff --git a/pod/perlre.pod b/pod/perlre.pod

index 0119fc5..dfd47cd 100644 (file)
--- a/pod/perlre.pod
+++ b/pod/perlre.pod
@@ -31,8 +31,8 @@ L<perlop/"Gory details of parsing quoted constructs">.
  X</m> X<regex, multiline> X<regexp, multiline> X<regular expression, multiline>
  
  Treat string as multiple lines.  That is, change "^" and "$" from matching
-the start or end of line only at the left and right ends of the string to
-matching them anywhere within the string.
+the start of the string's first line and the end of its last line to
+matching the start and end of each line within the string.
  
  =item s
  X</s> X<regex, single-line> X<regexp, single-line>
@@ -95,22 +95,56 @@ In Perl 5.20 and higher this is ignored. Due to a new copy-on-write
  mechanism, ${^PREMATCH}, ${^MATCH}, and ${^POSTMATCH} will be available
  after the match regardless of the modifier.
  
-=item g and c
-X</g> X</c>
-
-Global matching, and keep the Current position after failed matching.
-Unlike i, m, s and x, these two flags affect the way the regex is used
-rather than the regex itself. See
-L<perlretut/"Using regular expressions in Perl"> for further explanation
-of the g and c modifiers.
-
  =item a, d, l and u
  X</a> X</d> X</l> X</u>
  
-These modifiers, all new in 5.14, affect which character-set semantics
+These modifiers, all new in 5.14, affect which character-set rules
  (Unicode, etc.) are used, as described below in
  L</Character set modifiers>.
  
+=item n
+X</n> X<regex, non-capture> X<regexp, non-capture>
+X<regular expression, non-capture>
+
+Prevent the grouping metacharacters C<()> from capturing. This modifier,
+new in 5.22, will stop C<$1>, C<$2>, etc... from being filled in.
+
+  "hello" =~ /(hi|hello)/;   # $1 is "hello"
+  "hello" =~ /(hi|hello)/n;  # $1 is undef
+
+This is equivalent to putting C<?:> at the beginning of every capturing group:
+
+  "hello" =~ /(?:hi|hello)/; # $1 is undef
+
+C</n> can be negated on a per-group basis. Alternatively, named captures
+may still be used.
+
+  "hello" =~ /(?-n:(hi|hello))/n;   # $1 is "hello"
+  "hello" =~ /(?<greet>hi|hello)/n; # $1 is "hello", $+{greet} is
+                                    # "hello"
+
+=item Other Modifiers
+
+There are a number of flags that can be found at the end of regular
+expression constructs that are I<not> generic regular expression flags, but
+apply to the operation being performed, like matching or substitution (C<m//>
+or C<s///> respectively).
+
+Flags described further in
+L<perlretut/"Using regular expressions in Perl"> are:
+
+  c  - keep the current position during repeated matching
+  g  - globally match the pattern repeatedly in the string
+
+Substitution-specific modifiers described in
+
+L<perlop/"s/PATTERN/REPLACEMENT/msixpodualngcer"> are:
+
+  e  - evaluate the right-hand side as an expression
+  ee - evaluate the right side as a string then eval the result
+  o  - pretend to optimize your code, but actually introduce bugs
+  r  - perform non-destructive substitution and return the new value
+
  =back
  
  Regular expression modifiers are usually written in documentation
@@ -123,18 +157,38 @@ the C<(?...)> construct, see L</Extended Patterns> below.
  
  C</x> tells
  the regular expression parser to ignore most whitespace that is neither
-backslashed nor within a character class.  You can use this to break up
-your regular expression into (slightly) more readable parts.  The C<#>
-character is also treated as a metacharacter introducing a comment,
-just as in ordinary Perl code.  This also means that if you want real
-whitespace or C<#> characters in the pattern (outside a character
-class, where they are unaffected by C</x>), then you'll either have to
+backslashed nor within a bracketed character class.  You can use this to
+break up your regular expression into (slightly) more readable parts.
+Also, the C<#> character is treated as a metacharacter introducing a
+comment that runs up to the pattern's closing delimiter, or to the end
+of the current line if the pattern extends onto the next line.  Hence,
+this is very much like an ordinary Perl code comment.  (You can include
+the closing delimiter within the comment only if you precede it with a
+backslash, so be careful!)
+
+Use of C</x> means that if you want real
+whitespace or C<#> characters in the pattern (outside a bracketed character
+class, which is unaffected by C</x>), then you'll either have to
  escape them (using backslashes or C<\Q...\E>) or encode them using octal,
-hex, or C<\N{}> escapes.  Taken together, these features go a long way towards
-making Perl's regular expressions more readable.  Note that you have to
-be careful not to include the pattern delimiter in the comment--perl has
-no way of knowing you did not intend to close the pattern early.  See
-the C-comment deletion code in L<perlop>.  Also note that anything inside
+hex, or C<\N{}> escapes.
+It is ineffective to try to continue a comment onto the next line by
+escaping the C<\n> with a backslash or C<\Q>.
+
+You can use L</(?#text)> to create a comment that ends earlier than the
+end of the current line, but C<text> also can't contain the closing
+delimiter unless escaped with a backslash.
+
+Taken together, these features go a long way towards
+making Perl's regular expressions more readable.  Here's an example:
+
+    # Delete (most) C comments.
+    $program =~ s {
+       /\*     # Match the opening delimiter.
+       .*?     # Match a minimal number of characters.
+       \*/     # Match the closing delimiter.
+    } []gsx;
+
+Note that anything inside
  a C<\Q...\E> stays unaffected by C</x>.  And note that C</x> doesn't affect
  space interpretation within a single multi-character construct.  For
  example in C<\x{...}>, regardless of the C</x> modifier, there can be no
@@ -148,10 +202,25 @@ in C<\p{...}> there can be spaces that follow the Unicode rules, for which see
  L<perluniprops/Properties accessible through \p{} and \P{}>.
  X</x>
  
+The set of characters that are deemed whitespace are those that Unicode
+calls "Pattern White Space", namely:
+
+ U+0009 CHARACTER TABULATION
+ U+000A LINE FEED
+ U+000B LINE TABULATION
+ U+000C FORM FEED
+ U+000D CARRIAGE RETURN
+ U+0020 SPACE
+ U+0085 NEXT LINE
+ U+200E LEFT-TO-RIGHT MARK
+ U+200F RIGHT-TO-LEFT MARK
+ U+2028 LINE SEPARATOR
+ U+2029 PARAGRAPH SEPARATOR
+
  =head3 Character set modifiers
  
  C</d>, C</u>, C</a>, and C</l>, available starting in 5.14, are called
-the character set modifiers; they affect the character set semantics
+the character set modifiers; they affect the character set rules
  used for the regular expression.
  
  The C</d>, C</u>, and C</l> modifiers are not likely to be of much use
@@ -223,15 +292,22 @@ the same as the compilation-time locale, and can differ from one match
  to another if there is an intervening call of the
  L<setlocale() function|perllocale/The setlocale function>.
  
-Perl only supports single-byte locales.  This means that code points
-above 255 are treated as Unicode no matter what locale is in effect.
+The only non-single-byte locale Perl supports is (starting in v5.20)
+UTF-8.  This means that code points above 255 are treated as Unicode no
+matter what locale is in effect (since UTF-8 implies Unicode).
+
  Under Unicode rules, there are a few case-insensitive matches that cross
-the 255/256 boundary.  These are disallowed under C</l>.  For example,
-0xFF (on ASCII platforms) does not caselessly match the character at
-0x178, C<LATIN CAPITAL LETTER Y WITH DIAERESIS>, because 0xFF may not be
-C<LATIN SMALL LETTER Y WITH DIAERESIS> in the current locale, and Perl
-has no way of knowing if that character even exists in the locale, much
-less what code point it is.
+the 255/256 boundary.  Except for UTF-8 locales in Perls v5.20 and
+later, these are disallowed under C</l>.  For example, 0xFF (on ASCII
+platforms) does not caselessly match the character at 0x178, C<LATIN
+CAPITAL LETTER Y WITH DIAERESIS>, because 0xFF may not be C<LATIN SMALL
+LETTER Y WITH DIAERESIS> in the current locale, and Perl has no way of
+knowing if that character even exists in the locale, much less what code
+point it is.
+
+In a UTF-8 locale in v5.20 and later, the only visible difference
+between locale and non-locale in regular expressions should be tainting
+(see L<perlsec>).
  
  This modifier may be specified to be the default by C<use locale>, but
  see L</Which character set modifier is in effect?>.
@@ -372,7 +448,7 @@ between C<\w> and C<\W>, using the C</a> definitions of them (similarly
  for C<\B>).
  
  Otherwise, C</a> behaves like the C</u> modifier, in that
-case-insensitive matching uses Unicode semantics; for example, "k" will
+case-insensitive matching uses Unicode rules; for example, "k" will
  match the Unicode C<\N{KELVIN SIGN}> under C</i> matching, and code
  points in the Latin1 range, above ASCII will have Unicode rules when it
  comes to case-insensitive matching.
@@ -463,7 +539,8 @@ X<\> X<^> X<.> X<$> X<|> X<(> X<()> X<[> X<[]>
      \        Quote the next metacharacter
      ^        Match the beginning of the line
      .        Match any character (except newline)
-    $        Match the end of the line (or before newline at the end)
+    $        Match the end of the string (or before newline at the end
+             of the string)
      |        Alternation
      ()       Grouping
      []       Bracketed Character class
@@ -500,21 +577,12 @@ X<metacharacter> X<quantifier> X<*> X<+> X<?> X<{n}> X<{n,}> X<{n,m}>
  
  (If a curly bracket occurs in any other context and does not form part of
  a backslashed sequence like C<\x{...}>, it is treated as a regular
-character.  In particular, the lower quantifier bound is not optional,
-and a typo in a quantifier silently causes it to be treated as the
-literal characters.  For example,
-
-    /o{4,3}/
-
-looks like a quantifier that matches 0 times, since 4 is greater than 3,
-but it really means to match the sequence of six characters
-S<C<"o { 4 , 3 }">>.  It is planned to eventually require literal uses
-of curly brackets to be escaped, say by preceding them with a backslash
-or enclosing them within square brackets, (C<"\{"> or C<"[{]">).  This
-change will allow for future syntax extensions (like making the lower
-bound of a quantifier optional), and better error checking.  In the
-meantime, you should get in the habit of escaping all instances where
-you mean a literal "{".)
+character.  However, a deprecation warning is raised for all such
+occurrences, and in Perl v5.26, literal uses of a curly bracket will be
+required to be escaped, say by preceding them with a backslash (C<"\{">)
+or enclosing them within square brackets  (C<"[{]">).  This change will
+allow for future syntax extensions (like making the lower bound of a
+quantifier optional), and better error checking of quantifiers.)
  
  The "*" quantifier is equivalent to C<{0,}>, the "+"
  quantifier to C<{1,}>, and the "?" quantifier to C<{0,1}>.  n and m are limited
@@ -599,9 +667,9 @@ also work:
   \o{}, \000  character whose ordinal is the given octal number
   \l          lowercase next char (think vi)
   \u          uppercase next char (think vi)
- \L          lowercase till \E (think vi)
- \U          uppercase till \E (think vi)
- \Q          quote (disable) pattern metacharacters till \E
+ \L          lowercase until \E (think vi)
+ \U          uppercase until \E (think vi)
+ \Q          quote (disable) pattern metacharacters until \E
   \E          end either case modification or quoted section, think vi
  
  Details are in L<perlop/Quote and Quote-like Operators>.
@@ -635,7 +703,7 @@ X<\g> X<\k> X<\K> X<backreference>
                     part of a larger UTF-8 character.  Thus it breaks up
                     characters into their UTF-8 bytes, so you may end up
                     with malformed pieces of UTF-8.  Unsupported in
-                   lookbehind.
+                   lookbehind. (Deprecated.)
    \1        [5]  Backreference to a specific capture group or buffer.
                     '1' may actually be any positive integer.
    \g1       [5]  Backreference to a specific or previous group,
@@ -746,6 +814,17 @@ row.
  It is worth noting that C<\G> improperly used can result in an infinite
  loop. Take care when using patterns that include C<\G> in an alternation.
  
+Note also that C<s///> will refuse to overwrite part of a substitution
+that has already been replaced; so for example this will stop after the
+first iteration, rather than iterating its way backwards through the
+string:
+
+    $_ = "123456789";
+    pos = 6;
+    s/.(?=.\G)/X/g;
+    print;     # prints 1234X6789, not XXXXX6789
+
+
  =head3 Capture groups
  
  The bracketing construct C<( ... )> creates capture groups (also referred to as
@@ -970,10 +1049,13 @@ expressions, and 2) whenever you see one, you should stop and
  =item C<(?#text)>
  X<(?#)>
  
-A comment.  The text is ignored.  If the C</x> modifier enables
-whitespace formatting, a simple C<#> will suffice.  Note that Perl closes
+A comment.  The text is ignored.
+Note that Perl closes
  the comment as soon as it sees a C<)>, so there is no way to put a literal
-C<)> in the comment.
+C<)> in the comment.  The pattern's closing delimiter must be escaped by
+a backslash if it appears in the comment.
+
+See L</E<sol>x> for another way to have comments in patterns.
  
  =item C<(?adlupimsx-imsx)>
  
@@ -1172,7 +1254,8 @@ A zero-width positive look-behind assertion.  For example, C</(?<=\t)\w+/>
  matches a word that follows a tab, without including the tab in C<$&>.
  Works only for fixed-width look-behind.
  
-There is a special form of this construct, called C<\K>, which causes the
+There is a special form of this construct, called C<\K> (available since
+Perl 5.10.0), which causes the
  regex engine to "keep" everything it had matched prior to the C<\K> and
  not include it in C<$&>. This effectively provides variable-length
  look-behind. The use of C<\K> inside of another look-around assertion
@@ -1433,14 +1516,21 @@ into perl, so changing it requires a custom build.
  =item C<(?I<PARNO>)> C<(?-I<PARNO>)> C<(?+I<PARNO>)> C<(?R)> C<(?0)>
  X<(?PARNO)> X<(?1)> X<(?R)> X<(?0)> X<(?-1)> X<(?+1)> X<(?-PARNO)> X<(?+PARNO)>
  X<regex, recursive> X<regexp, recursive> X<regular expression, recursive>
-X<regex, relative recursion>
+X<regex, relative recursion> X<GOSUB> X<GOSTART>
+
+Recursive subpattern. Treat the contents of a given capture buffer in the
+current pattern as an independent subpattern and attempt to match it at
+the current position in the string. Information about capture state from
+the caller for things like backreferences is available to the subpattern,
+but capture buffers set by the subpattern are not visible to the caller.
  
  Similar to C<(??{ code })> except that it does not involve executing any
  code or potentially compiling a returned pattern string; instead it treats
  the part of the current pattern contained within a specified capture group
-as an independent pattern that must match at the current position.
-Capture groups contained by the pattern will have the value as determined
-by the outermost recursion.
+as an independent pattern that must match at the current position. Also
+different is the treatment of capture buffers, unlike C<(??{ code })>
+recursive patterns have access to their callers match state, so one can
+use backreferences safely.
  
  I<PARNO> is a sequence of digits (not starting with 0) whose value reflects
  the paren-number of the capture group to recurse to. C<(?R)> recurses to
@@ -1616,7 +1706,7 @@ An example of how this might be used is as follows:
    /(?<NAME>(?&NAME_PAT))(?<ADDR>(?&ADDRESS_PAT))
     (?(DEFINE)
       (?<NAME_PAT>....)
-     (?<ADRESS_PAT>....)
+     (?<ADDRESS_PAT>....)
     )/x
  
  Note that capture groups matched inside of recursion are not accessible
@@ -2243,8 +2333,14 @@ Note also that the whole range idea is rather unportable between
  character sets--and even within character sets they may cause results
  you probably didn't expect.  A sound principle is to use only ranges
  that begin from and end at either alphabetics of equal case ([a-e],
-[A-E]), or digits ([0-9]).  Anything else is unsafe.  If in doubt,
-spell out the character sets in full.
+[A-E]), or digits ([0-9]).  Anything else is unsafe or unclear.  If in
+doubt, spell out the character sets in full.  Specifying the end points
+of the range using the C<\N{...}> syntax, using Unicode names or code
+points makes the range portable, but still likely not easily
+understandable to someone reading the code.  For example,
+C<[\N{U+04}-\N{U+07}]> means to match the Unicode code points
+C<\N{U+04}>, C<\N{U+05}>, C<\N{U+06}>, and C<\N{U+07}>, whatever their
+native values may be on the platform.
  
  Characters may be specified using a metacharacter syntax much like that
  used in C: "\n" matches a newline, "\t" a tab, "\r" a carriage return,