Implement facility to plug in syntax triggered by keywords

[perl5.git] / pod / perlre.pod
diff --git a/pod/perlre.pod b/pod/perlre.pod

index a876211..df627ff 100644 (file)
--- a/pod/perlre.pod
+++ b/pod/perlre.pod
@@ -16,6 +16,9 @@ operations, plus various examples of the same, see discussions of
  C<m//>, C<s///>, C<qr//> and C<??> in L<perlop/"Regexp Quote-Like
  Operators">.
  
+
+=head2 Modifiers
+
  Matching operations can have various modifiers.  Modifiers
  that relate to the interpretation of the regular expression inside
  are listed below.  Modifiers that alter the way a regular expression
@@ -24,15 +27,6 @@ L<perlop/"Gory details of parsing quoted constructs">.
  
  =over 4
  
-=item i
-X</i> X<regex, case-insensitive> X<regexp, case-insensitive>
-X<regular expression, case-insensitive>
-
-Do case-insensitive pattern matching.
-
-If C<use locale> is in effect, the case map is taken from the current
-locale.  See L<perllocale>.
-
  =item m
  X</m> X<regex, multiline> X<regexp, multiline> X<regular expression, multiline>
  
@@ -51,11 +45,35 @@ Used together, as /ms, they let the "." match any character whatsoever,
  while still allowing "^" and "$" to match, respectively, just after
  and just before newlines within the string.
  
+=item i
+X</i> X<regex, case-insensitive> X<regexp, case-insensitive>
+X<regular expression, case-insensitive>
+
+Do case-insensitive pattern matching.
+
+If C<use locale> is in effect, the case map is taken from the current
+locale.  See L<perllocale>.
+
  =item x
  X</x>
  
  Extend your pattern's legibility by permitting whitespace and comments.
  
+=item p
+X</p> X<regex, preserve> X<regexp, preserve>
+
+Preserve the string matched such that ${^PREMATCH}, ${^MATCH}, and
+${^POSTMATCH} are available for use after matching.
+
+=item g and c
+X</g> X</c>
+
+Global matching, and keep the Current position after failed matching.
+Unlike i, m, s and x, these two flags affect the way the regex is used
+rather than the regex itself. See
+L<perlretut/"Using regular expressions in Perl"> for further explanation
+of the g and c modifiers.
+
  =back
  
  These are usually written as "the C</x> modifier", even though the delimiter
@@ -84,7 +102,7 @@ X</x>
  
  =head3 Metacharacters
  
-The patterns used in Perl pattern matching derive from supplied in
+The patterns used in Perl pattern matching evolved from those supplied in
  the Version 8 regex routines.  (The routines are derived
  (distantly) from Henry Spencer's freely redistributable reimplementation
  of the V8 routines.)  See L<Version 8 Regular Expressions> for
@@ -136,8 +154,8 @@ X<metacharacter> X<quantifier> X<*> X<+> X<?> X<{n}> X<{n,}> X<{n,m}>
  
  (If a curly bracket occurs in any other context, it is treated
  as a regular character.  In particular, the lower bound
-is not optional.)  The "*" modifier is equivalent to C<{0,}>, the "+"
-modifier to C<{1,}>, and the "?" modifier to C<{0,1}>.  n and m are limited
+is not optional.)  The "*" quantifier is equivalent to C<{0,}>, the "+"
+quantifier to C<{1,}>, and the "?" quantifier to C<{0,1}>.  n and m are limited
  to integral values less than a preset limit defined when perl is built.
  This is usually 32766 on the most common platforms.  The actual limit can
  be seen in the error message generated by code such as this:
@@ -149,24 +167,24 @@ many times as possible (given a particular starting location) while still
  allowing the rest of the pattern to match.  If you want it to match the
  minimum number of times possible, follow the quantifier with a "?".  Note
  that the meanings don't change, just the "greediness":
-X<metacharacter> X<greedy> X<greedyness>
+X<metacharacter> X<greedy> X<greediness>
  X<?> X<*?> X<+?> X<??> X<{n}?> X<{n,}?> X<{n,m}?>
  
-    *?    Match 0 or more times
-    +?    Match 1 or more times
-    ??    Match 0 or 1 time
-    {n}?   Match exactly n times
-    {n,}?  Match at least n times
-    {n,m}? Match at least n but not more than m times
+    *?     Match 0 or more times, not greedily
+    +?     Match 1 or more times, not greedily
+    ??     Match 0 or 1 time, not greedily
+    {n}?   Match exactly n times, not greedily
+    {n,}?  Match at least n times, not greedily
+    {n,m}? Match at least n but not more than m times, not greedily
  
  By default, when a quantified subpattern does not allow the rest of the
  overall pattern to match, Perl will backtrack. However, this behaviour is
-sometimes undesirable. Thus Perl provides the "possesive" quantifier form
+sometimes undesirable. Thus Perl provides the "possessive" quantifier form
  as well.
  
-    *+    Match 0 or more times and give nothing back
-    ++    Match 1 or more times and give nothing back
-    ?+    Match 0 or 1 time and give nothing back
+    *+     Match 0 or more times and give nothing back
+    ++     Match 1 or more times and give nothing back
+    ?+     Match 0 or 1 time and give nothing back
      {n}+   Match exactly n times and give nothing back (redundant)
      {n,}+  Match at least n times and give nothing back
      {n,m}+ Match at least n but not more than m times and give nothing back
@@ -183,7 +201,7 @@ string" problem can be most efficiently performed when written as:
  
     /"(?:[^"\\]++|\\.)*+"/
  
-as we know that if the final quote does not match, bactracking will not
+as we know that if the final quote does not match, backtracking will not
  help. See the independent subexpression C<< (?>...) >> for more details;
  possessive quantifiers are just syntactic sugar for that construct. For
  instance the above example could also be written as follows:
@@ -194,7 +212,7 @@ instance the above example could also be written as follows:
  
  Because patterns are processed as double quoted strings, the following
  also work:
-X<\t> X<\n> X<\r> X<\f> X<\a> X<\l> X<\u> X<\L> X<\U> X<\E> X<\Q>
+X<\t> X<\n> X<\r> X<\f> X<\e> X<\a> X<\l> X<\u> X<\L> X<\U> X<\E> X<\Q>
  X<\0> X<\c> X<\N> X<\x>
  
      \t         tab                   (HT, TAB)
@@ -203,11 +221,11 @@ X<\0> X<\c> X<\N> X<\x>
      \f         form feed             (FF)
      \a         alarm (bell)          (BEL)
      \e         escape (think troff)  (ESC)
-    \033       octal char (think of a PDP-11)
-    \x1B       hex char
-    \x{263a}   wide hex char         (Unicode SMILEY)
-    \c[                control char
-    \N{name}   named char
+    \033       octal char            (example: ESC)
+    \x1B       hex char              (example: ESC)
+    \x{263a}   long hex char         (example: Unicode SMILEY)
+    \cK                control char          (example: VT)
+    \N{name}   named Unicode character
      \l         lowercase next char (think vi)
      \u         uppercase next char (think vi)
      \L         lowercase till \E (think vi)
@@ -224,12 +242,12 @@ An unescaped C<$> or C<@> interpolates the corresponding variable,
  while escaping will cause the literal string C<\$> to be matched.
  You'll need to write something like C<m/\Quser\E\@\Qhost/>.
  
-=head3 Character classes
+=head3 Character Classes and other Special Escapes
  
  In addition, Perl defines the following:
-X<metacharacter>
  X<\w> X<\W> X<\s> X<\S> X<\d> X<\D> X<\X> X<\p> X<\P> X<\C>
-X<word> X<whitespace>
+X<\g> X<\k> X<\N> X<\K> X<\v> X<\V> X<\h> X<\H>
+X<word> X<whitespace> X<character class> X<backreference>
  
      \w      Match a "word" character (alphanumeric plus "_")
      \W      Match a non-"word" character
@@ -240,7 +258,7 @@ X<word> X<whitespace>
      \pP             Match P, named property.  Use \p{Prop} for longer names.
      \PP             Match non-P
      \X      Match eXtended Unicode "combining character sequence",
-             equivalent to (?:\PM\pM*)
+             equivalent to (?>\PM\pM*)
      \C      Match a single C char (octet) even under Unicode.
              NOTE: breaks up characters into their UTF-8 bytes,
              so you may end up with malformed pieces of UTF-8.
@@ -250,10 +268,15 @@ X<word> X<whitespace>
      \g1      Backreference to a specific or previous group,
      \g{-1}   number may be negative indicating a previous buffer and may
               optionally be wrapped in curly brackets for safer parsing.
+    \g{name} Named backreference
      \k<name> Named backreference
-    \N{name} Named unicode character, or unicode escape
-    \x12     Hexadecimal escape sequence
-    \x{1234} Long hexadecimal escape sequence
+    \K       Keep the stuff left of the \K, don't include it in $&
+    \N       Any character but \n
+    \v       Vertical whitespace
+    \V       Not vertical whitespace
+    \h       Horizontal whitespace
+    \H       Not horizontal whitespace
+    \R       Linebreak
  
  A C<\w> matches a single alphanumeric character (an alphabetic
  character, or a decimal digit) or C<_>, not a whole word.  Use C<\w+>
@@ -261,20 +284,30 @@ to match a string of Perl-identifier characters (which isn't the same
  as matching an English word).  If C<use locale> is in effect, the list
  of alphabetic characters generated by C<\w> is taken from the current
  locale.  See L<perllocale>.  You may use C<\w>, C<\W>, C<\s>, C<\S>,
-C<\d>, and C<\D> within character classes, but if you try to use them
-as endpoints of a range, that's not a range, the "-" is understood
-literally.  If Unicode is in effect, C<\s> matches also "\x{85}",
-"\x{2028}, and "\x{2029}", see L<perlunicode> for more details about
-C<\pP>, C<\PP>, and C<\X>, and L<perluniintro> about Unicode in general.
-You can define your own C<\p> and C<\P> properties, see L<perlunicode>.
+C<\d>, and C<\D> within character classes, but they aren't usable
+as either end of a range. If any of them precedes or follows a "-",
+the "-" is understood literally. If Unicode is in effect, C<\s> matches
+also "\x{85}", "\x{2028}", and "\x{2029}". See L<perlunicode> for more
+details about C<\pP>, C<\PP>, C<\X> and the possibility of defining
+your own C<\p> and C<\P> properties, and L<perluniintro> about Unicode
+in general.
  X<\w> X<\W> X<word>
  
+C<\R> will atomically match a linebreak, including the network line-ending
+"\x0D\x0A".  Specifically, X<\R> is exactly equivalent to
+
+  (?>\x0D\x0A?|[\x0A-\x0C\x85\x{2028}\x{2029}])
+
+B<Note:> C<\R> has no special meaning inside of a character class;
+use C<\v> instead (vertical whitespace).
+X<\R>
+
  The POSIX character class syntax
  X<character class>
  
      [:class:]
  
-is also available.  Note that the C<[> and C<]> braces are I<literal>;
+is also available.  Note that the C<[> and C<]> brackets are I<literal>;
  they must always be used within a character class expression.
  
      # this is correct:
@@ -283,26 +316,34 @@ they must always be used within a character class expression.
      # this is not, and will generate a warning:
      $string =~ /[:alpha:]/;
  
-The available classes and their backslash equivalents (if available) are
-as follows:
-X<character class>
+The following table shows the mapping of POSIX character class
+names, common escapes, literal escape sequences and their equivalent
+Unicode style property names.
+X<character class> X<\p> X<\p{}>
  X<alpha> X<alnum> X<ascii> X<blank> X<cntrl> X<digit> X<graph>
  X<lower> X<print> X<punct> X<space> X<upper> X<word> X<xdigit>
  
-    alpha
-    alnum
-    ascii
-    blank              [1]
-    cntrl
-    digit       \d
-    graph
-    lower
-    print
-    punct
-    space       \s     [2]
-    upper
-    word        \w     [3]
-    xdigit
+B<Note:> up to Perl 5.10 the property names used were shared with
+standard Unicode properties, this was changed in Perl 5.11, see
+L<perl5110delta> for details.
+
+    POSIX  Esc  Class               Property           Note
+    --------------------------------------------------------
+    alnum       [0-9A-Za-z]         IsPosixAlnum
+    alpha       [A-Za-z]            IsPosixAlpha
+    ascii       [\000-\177]         IsASCII
+    blank       [\011 ]             IsPosixBlank       [1]
+    cntrl       [\0-\37\177]        IsPosixCntrl
+    digit   \d  [0-9]               IsPosixDigit
+    graph       [!-~]               IsPosixGraph
+    lower       [a-z]               IsPosixLower
+    print       [ -~]               IsPosixPrint
+    punct       [!-/:-@[-`{-~]      IsPosixPunct
+    space       [\11-\15 ]          IsPosixSpace        [2]
+            \s  [\11\12\14\15 ]     IsPerlSpace         [2]
+    upper       [A-Z]               IsPosixUpper
+    word    \w  [0-9A-Z_a-z]        IsPerlWord         [3]
+    xdigit      [0-9A-Fa-f]         IsXDigit
  
  =over
  
@@ -312,8 +353,9 @@ A GNU extension equivalent to C<[ \t]>, "all horizontal whitespace".
  
  =item [2]
  
-Not exactly equivalent to C<\s> since the C<[[:space:]]> includes
-also the (very rare) "vertical tabulator", "\ck", chr(11).
+Note that C<\s> and C<[[:space:]]> are B<not> equivalent as C<[[:space:]]>
+includes also the (very rare) "vertical tabulator", "\cK" or chr(11) in
+ASCII.
  
  =item [3]
  
@@ -327,37 +369,26 @@ whole character class.  For example:
  
      [01[:alpha:]%]
  
-matches zero, one, any alphabetic character, and the percentage sign.
+matches zero, one, any alphabetic character, and the percent sign.
  
-The following equivalences to Unicode \p{} constructs and equivalent
-backslash character classes (if available), will hold:
-X<character class> X<\p> X<\p{}>
+=over 4
+
+=item C<$>
+
+Currency symbol
+
+=item C<+> C<< < >> C<=> C<< > >> C<|> C<~>
  
-    [[:...:]]  \p{...}         backslash
+Mathematical symbols
  
-    alpha       IsAlpha
-    alnum       IsAlnum
-    ascii       IsASCII
-    blank       IsSpace
-    cntrl       IsCntrl
-    digit       IsDigit        \d
-    graph       IsGraph
-    lower       IsLower
-    print       IsPrint
-    punct       IsPunct
-    space       IsSpace
-                IsSpacePerl    \s
-    upper       IsUpper
-    word        IsWord
-    xdigit      IsXDigit
+=item C<^> C<`>
  
-For example C<[[:lower:]]> and C<\p{IsLower}> are equivalent.
+Modifier symbols (accents)
  
-If the C<utf8> pragma is not used but the C<locale> pragma is, the
-classes correlate with the usual isalpha(3) interface (except for
-"word" and "blank").
  
-The assumedly non-obviously named classes are:
+=back
+
+The other named classes are:
  
  =over 4
  
@@ -367,7 +398,7 @@ X<cntrl>
  Any control character.  Usually characters that don't produce output as
  such but instead control the terminal somehow: for example newline and
  backspace are control characters.  All characters with ord() less than
-32 are most often classified as control characters (assuming ASCII,
+32 are usually classified as control characters (assuming ASCII,
  the ISO Latin character sets, and Unicode), as is the character with
  the ord() value of 127 (C<DEL>).
  
@@ -400,9 +431,9 @@ X<character class, negation>
  
      POSIX         traditional  Unicode
  
-    [[:^digit:]]    \D         \P{IsDigit}
-    [[:^space:]]    \S         \P{IsSpace}
-    [[:^word:]]            \W         \P{IsWord}
+    [[:^digit:]]    \D         \P{IsPosixDigit}
+    [[:^space:]]    \S         \P{IsPosixSpace}
+    [[:^word:]]     \W         \P{IsPerlWord}
  
  Perl respects the POSIX standard in that POSIX character classes are
  only supported within a character class.  The POSIX character classes
@@ -418,7 +449,7 @@ X<regular expression, zero-width assertion>
  X<\b> X<\B> X<\A> X<\Z> X<\z> X<\G>
  
      \b Match a word boundary
-    \B Match a non-(word boundary)
+    \B Match except at a word boundary
      \A Match only at beginning of string
      \Z Match only at end of string, or before newline at the end
      \z Match only at end of string
@@ -465,9 +496,10 @@ loop. Take care when using patterns that include C<\G> in an alternation.
  
  =head3 Capture buffers
  
-The bracketing construct C<( ... )> creates capture buffers.  To
-refer to the digit'th buffer use \<digit> within the
-match.  Outside the match use "$" instead of "\".  (The
+The bracketing construct C<( ... )> creates capture buffers. To refer
+to the current contents of a buffer later on, within the same pattern,
+use \1 for the first, \2 for the second, and so on.
+Outside the match use "$" instead of "\".  (The
  \<digit> notation works in certain circumstances outside
  the match.  See the warning below about \1 vs $1 for details.)
  Referring back to another part of the match is called a
@@ -486,14 +518,16 @@ backreference only if at least 11 left parentheses have opened
  before it.  And so on.  \1 through \9 are always interpreted as
  backreferences.
  
-X<\g{1}> X<\g{-1}> X<relative backreference>
+X<\g{1}> X<\g{-1}> X<\g{name}> X<relative backreference> X<named backreference>
  In order to provide a safer and easier way to construct patterns using
-backrefs, in Perl 5.10 the C<\g{N}> notation is provided. The curly
-brackets are optional, however omitting them is less safe as the meaning
-of the pattern can be changed by text (such as digits) following it.
-When N is a positive integer the C<\g{N}> notation is exactly equivalent
-to using normal backreferences. When N is a negative integer then it is
-a relative backreference referring to the previous N'th capturing group.
+backreferences, Perl provides the C<\g{N}> notation (starting with perl
+5.10.0). The curly brackets are optional, however omitting them is less
+safe as the meaning of the pattern can be changed by text (such as digits)
+following it. When N is a positive integer the C<\g{N}> notation is
+exactly equivalent to using normal backreferences. When N is a negative
+integer then it is a relative backreference referring to the previous N'th
+capturing group. When the bracket form is used and N is not an integer, it
+is treated as a reference to a named buffer.
  
  Thus C<\g{-1}> refers to the last buffer, C<\g{-2}> refers to the
  buffer before that. For example:
@@ -509,18 +543,18 @@ buffer before that. For example:
  
  and would match the same as C</(Y) ( (X) \3 \1 )/x>.
  
-Additionally, as of Perl 5.10 you may use named capture buffers and named
-backreferences. The notation is C<< (?<name>...) >> and C<< \k<name> >>
-(you may also use single quotes instead of angle brackets to quote the
-name). The only difference with named capture buffers and unnamed ones is
-that multiple buffers may have the same name and that the contents of
-named capture buffers is available via the C<%+> hash. When multiple
-groups share the same name C<$+{name}> and C<< \k<name> >> refer to the
-leftmost defined group, thus it's possible to do things with named capture
-buffers that would otherwise require C<(??{})> code to accomplish. Named
-capture buffers are numbered just as normal capture buffers are and may be
-referenced via the magic numeric variables or via numeric backreferences
-as well as by name.
+Additionally, as of Perl 5.10.0 you may use named capture buffers and named
+backreferences. The notation is C<< (?<name>...) >> to declare and C<< \k<name> >>
+to reference. You may also use apostrophes instead of angle brackets to delimit the
+name; and you may use the bracketed C<< \g{name} >> backreference syntax.
+It's possible to refer to a named capture buffer by absolute and relative number as well.
+Outside the pattern, a named capture buffer is available via the C<%+> hash.
+When different buffers within the same pattern have the same name, C<$+{name}>
+and C<< \k<name> >> refer to the leftmost defined group. (Thus it's possible
+to do things with named capture buffers that would otherwise require C<(??{})>
+code to accomplish.)
+X<named capture buffer> X<regular expression, named capture buffer>
+X<%+> X<$+{name}> X<< \k<name> >>
  
  Examples:
  
@@ -532,7 +566,7 @@ Examples:
      /(?<char>.)\k<char>/            # ... a different way
           and print "'$+{char}' is the first doubled character\n";
  
-    /(?<char>.)\1/                  # ... mix and match
+    /(?'char'.)\1/                  # ... mix and match
           and print "'$1' is the first doubled character\n";
  
      if (/Time: (..):(..):(..)/) {   # parse out values
@@ -560,7 +594,7 @@ X<$+> X<$^N> X<$&> X<$`> X<$'>
  X<$1> X<$2> X<$3> X<$4> X<$5> X<$6> X<$7> X<$8> X<$9>
  
  
-B<NOTE>: failed matches in Perl do not reset the match variables,
+B<NOTE>: Failed matches in Perl do not reset the match variables,
  which makes it easier to write code that tests for a series of more
  specific cases and remembers the best match.
  
@@ -579,6 +613,15 @@ already paid the price.  As of 5.005, C<$&> is not so costly as the
  other two.
  X<$&> X<$`> X<$'>
  
+As a workaround for this problem, Perl 5.10.0 introduces C<${^PREMATCH}>,
+C<${^MATCH}> and C<${^POSTMATCH}>, which are equivalent to C<$`>, C<$&>
+and C<$'>, B<except> that they are only guaranteed to be defined after a
+successful match that was executed with the C</p> (preserve) modifier.
+The use of these variables incurs no global performance penalty, unlike
+their punctuation char equivalents, however at the trade-off that you
+have to tell perl when you want to use them.
+X</p> X<p modifier>
+
  Backslashed metacharacters in Perl are alphanumeric, such as C<\b>,
  C<\w>, C<\n>.  Unlike some other regular expression languages, there
  are no backslashed symbols that aren't alphanumeric.  So anything
@@ -632,17 +675,17 @@ whitespace formatting, a simple C<#> will suffice.  Note that Perl closes
  the comment as soon as it sees a C<)>, so there is no way to put a literal
  C<)> in the comment.
  
-=item C<(?imsx-imsx)>
+=item C<(?pimsx-imsx)>
  X<(?)>
  
  One or more embedded pattern-match modifiers, to be turned on (or
  turned off, if preceded by C<->) for the remainder of the pattern or
  the remainder of the enclosing pattern group (if any). This is
  particularly useful for dynamic patterns, such as those read in from a
-configuration file, read in as an argument, are specified in a table
-somewhere, etc.  Consider the case that some of which want to be case
-sensitive and some do not.  The case insensitive ones need to include
-merely C<(?i)> at the front of the pattern.  For example:
+configuration file, taken from an argument, or specified in a table
+somewhere.  Consider the case where some patterns want to be case
+sensitive and some do not:  The case insensitive ones merely need to
+include C<(?i)> at the front of the pattern.  For example:
  
      $pattern = "foobar";
      if ( /$pattern/i ) { }
@@ -656,9 +699,18 @@ These modifiers are restored at the end of the enclosing group. For example,
  
      ( (?i) blah ) \s+ \1
  
-will match a repeated (I<including the case>!) word C<blah> in any
-case, assuming C<x> modifier, and no C<i> modifier outside this
-group.
+will match C<blah> in any case, some spaces, and an exact (I<including the case>!)
+repetition of the previous word, assuming the C</x> modifier, and no C</i>
+modifier outside this group.
+
+These modifiers do not carry over into named subpatterns called in the
+enclosing group. In other words, a pattern such as C<((?i)(&NAME))> does not
+change the case-sensitivity of the "NAME" pattern.
+
+Note that the C<p> modifier is special in that it can only be enabled,
+not disabled, and that its presence anywhere in a pattern has a global
+effect. Thus C<(?-p)> and C<(?-p:...)> are meaningless and will warn
+when executed under C<use warnings>.
  
  =item C<(?:pattern)>
  X<(?:)>
@@ -686,6 +738,46 @@ is equivalent to the more verbose
  
      /(?:(?s-i)more.*than).*million/i
  
+=item C<(?|pattern)>
+X<(?|)> X<Branch reset>
+
+This is the "branch reset" pattern, which has the special property
+that the capture buffers are numbered from the same starting point
+in each alternation branch. It is available starting from perl 5.10.0.
+
+Capture buffers are numbered from left to right, but inside this
+construct the numbering is restarted for each branch.
+
+The numbering within each branch will be as normal, and any buffers
+following this construct will be numbered as though the construct
+contained only one branch, that being the one with the most capture
+buffers in it.
+
+This construct will be useful when you want to capture one of a
+number of alternative matches.
+
+Consider the following pattern.  The numbers underneath show in
+which buffer the captured content will be stored.
+
+
+    # before  ---------------branch-reset----------- after        
+    / ( a )  (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x
+    # 1            2         2  3        2     3     4  
+
+Note: as of Perl 5.10.0, branch resets interfere with the contents of
+the C<%+> hash, that holds named captures. Consider using C<%-> instead.
+
+=item Look-Around Assertions
+X<look-around assertion> X<lookaround assertion> X<look-around> X<lookaround>
+
+Look-around assertions are zero width patterns which match a specific
+pattern without including it in C<$&>. Positive assertions match when
+their subpattern matches, negative assertions match when their subpattern
+fails. Look-behind matches text up to the current match position,
+look-ahead matches text following the current match position.
+
+=over 4
+
  =item C<(?=pattern)>
  X<(?=)> X<look-ahead, positive> X<lookahead, positive>
  
@@ -712,13 +804,30 @@ Sometimes it's still easier just to say:
  
  For look-behind see below.
  
-=item C<(?<=pattern)>
-X<(?<=)> X<look-behind, positive> X<lookbehind, positive>
+=item C<(?<=pattern)> C<\K>
+X<(?<=)> X<look-behind, positive> X<lookbehind, positive> X<\K>
  
  A zero-width positive look-behind assertion.  For example, C</(?<=\t)\w+/>
  matches a word that follows a tab, without including the tab in C<$&>.
  Works only for fixed-width look-behind.
  
+There is a special form of this construct, called C<\K>, which causes the
+regex engine to "keep" everything it had matched prior to the C<\K> and
+not include it in C<$&>. This effectively provides variable length
+look-behind. The use of C<\K> inside of another look-around assertion
+is allowed, but the behaviour is currently not well defined.
+
+For various reasons C<\K> may be significantly more efficient than the
+equivalent C<< (?<=...) >> construct, and it is especially useful in
+situations where you want to efficiently remove something following
+something else in a string. For instance
+
+  s/(foo)bar/$1/g;
+
+can be rewritten as the much more efficient
+
+  s/foo\Kbar//g;
+
  =item C<(?<!pattern)>
  X<(?<!)> X<look-behind, negative> X<lookbehind, negative>
  
@@ -726,23 +835,25 @@ A zero-width negative look-behind assertion.  For example C</(?<!bar)foo/>
  matches any occurrence of "foo" that does not follow "bar".  Works
  only for fixed-width look-behind.
  
+=back
+
  =item C<(?'NAME'pattern)>
  
  =item C<< (?<NAME>pattern) >>
  X<< (?<NAME>) >> X<(?'NAME')> X<named capture> X<capture>
  
  A named capture buffer. Identical in every respect to normal capturing
-parens C<()> but for the additional fact that C<%+> may be used after
-a succesful match to refer to a named buffer. See C<perlvar> for more
-details on the C<%+> hash.
+parentheses C<()> but for the additional fact that C<%+> or C<%-> may be
+used after a successful match to refer to a named buffer. See C<perlvar>
+for more details on the C<%+> and C<%-> hashes.
  
  If multiple distinct capture buffers have the same name then the
  $+{NAME} will refer to the leftmost defined buffer in the match.
  
-The forms C<(?'NAME'pattern)> and C<(?<NAME>pattern)> are equivalent.
+The forms C<(?'NAME'pattern)> and C<< (?<NAME>pattern) >> are equivalent.
  
  B<NOTE:> While the notation of this construct is the same as the similar
-function in .NET regexes, the behavior is not, in Perl the buffers are
+function in .NET regexes, the behavior is not. In Perl the buffers are
  numbered sequentially regardless of being named or not. Thus in the
  pattern
  
@@ -751,23 +862,34 @@ pattern
  $+{foo} will be the same as $2, and $3 will contain 'z' instead of
  the opposite which is what a .NET regex hacker might expect.
  
-Currently NAME is restricted to word chars only. In other words, it
-must match C</^\w+$/>.
+Currently NAME is restricted to simple identifiers only.
+In other words, it must match C</^[_A-Za-z][_A-Za-z0-9]*\z/> or
+its Unicode extension (see L<utf8>),
+though it isn't extended by the locale (see L<perllocale>).
  
-=item C<< \k<name> >>
+B<NOTE:> In order to make things easier for programmers with experience
+with the Python or PCRE regex engines, the pattern C<< (?PE<lt>NAMEE<gt>pattern) >>
+may be used instead of C<< (?<NAME>pattern) >>; however this form does not
+support the use of single quotes as a delimiter for the name.
  
-=item C<< \k'name' >>
+=item C<< \k<NAME> >>
+
+=item C<< \k'NAME' >>
  
  Named backreference. Similar to numeric backreferences, except that
  the group is designated by name and not number. If multiple groups
  have the same name then it refers to the leftmost defined group in
  the current match.
  
-It is an error to refer to a name not defined by a C<(?<NAME>)>
+It is an error to refer to a name not defined by a C<< (?<NAME>) >>
  earlier in the pattern.
  
  Both forms are equivalent.
  
+B<NOTE:> In order to make things easier for programmers with experience
+with the Python or PCRE regex engines, the pattern C<< (?P=NAME) >>
+may be used instead of C<< \k<NAME> >>.
+
  =item C<(?{ code })>
  X<(?{})> X<regex, code in> X<regexp, code in> X<regular expression, code in>
  
@@ -810,7 +932,7 @@ C<local>ization are undone, so that
                                         # location.
     >x;
  
-will set C<$res = 4>.  Note that after the match, $cnt returns to the globally
+will set C<$res = 4>.  Note that after the match, C<$cnt> returns to the globally
  introduced value, because the scopes that restrict C<local> operators
  are unwound.
  
@@ -824,20 +946,13 @@ The assignment to C<$^R> above is properly localized, so the old
  value of C<$^R> is restored if the assertion is backtracked; compare
  L<"Backtracking">.
  
-Due to an unfortunate implementation issue, the Perl code contained in these
-blocks is treated as a compile time closure that can have seemingly bizarre
-consequences when used with lexically scoped variables inside of subroutines
-or loops.  There are various workarounds for this, including simply using
-global variables instead.  If you are using this construct and strange results
-occur then check for the use of lexically scoped variables.
-
  For reasons of security, this construct is forbidden if the regular
  expression involves run-time interpolation of variables, unless the
  perilous C<use re 'eval'> pragma has been used (see L<re>), or the
  variables contain results of C<qr//> operator (see
  L<perlop/"qr/STRING/imosx">).
  
-This restriction is because of the wide-spread and remarkably convenient
+This restriction is due to the wide-spread and remarkably convenient
  custom of using run-time determined strings as patterns.  For example:
  
      $re = <>;
@@ -852,9 +967,15 @@ so you should only do so if you are also using taint checking.
  Better yet, use the carefully constrained evaluation within a Safe
  compartment.  See L<perlsec> for details about both these mechanisms.
  
-Because perl's regex engine is not currently re-entrant, interpolated
-code may not invoke the regex engine either directly with C<m//> or C<s///>),
-or indirectly with functions such as C<split>.
+B<WARNING>: Use of lexical (C<my>) variables in these blocks is
+broken. The result is unpredictable and will make perl unstable. The
+workaround is to use global (C<our>) variables.
+
+B<WARNING>: Because Perl's regex engine is currently not re-entrant,
+interpolated code may not invoke the regex engine either directly with
+C<m//> or C<s///>), or indirectly with functions such as
+C<split>. Invoking the regex engine in these blocks will make perl
+unstable.
  
  =item C<(??{ code })>
  X<(??{})>
@@ -973,7 +1094,7 @@ for later use:
      }
  
  B<Note> that this pattern does not behave the same way as the equivalent
-PCRE or Python construct of the same form. In perl you can backtrack into
+PCRE or Python construct of the same form. In Perl you can backtrack into
  a recursed group, in PCRE and Python the recursed into group is treated
  as atomic. Also, modifiers are resolved at compile time, so constructs
  like (?i:(?1)) or (?:(?i)(?1)) do not affect how the sub-pattern will
@@ -982,13 +1103,17 @@ be processed.
  =item C<(?&NAME)>
  X<(?&NAME)>
  
-Recurse to a named subpattern. Identical to (?PARNO) except that the
-parenthesis to recurse to is determined by name. If multiple parens have
+Recurse to a named subpattern. Identical to C<(?PARNO)> except that the
+parenthesis to recurse to is determined by name. If multiple parentheses have
  the same name, then it recurses to the leftmost.
  
  It is an error to refer to a name that is not declared somewhere in the
  pattern.
  
+B<NOTE:> In order to make things easier for programmers with experience
+with the Python or PCRE regex engines the pattern C<< (?P>NAME) >>
+may be used instead of C<< (?&NAME) >>.
+
  =item C<(?(condition)yes-pattern|no-pattern)>
  X<(?()>
  
@@ -1080,7 +1205,7 @@ An example of how this might be used is as follows:
     )/x
  
  Note that capture buffers matched inside of recursion are not accessible
-after the recursion returns, so the extra layer of capturing buffers are
+after the recursion returns, so the extra layer of capturing buffers is
  necessary. Thus C<$+{NAME_PAT}> would not be defined even though
  C<$+{NAME}> would be.
  
@@ -1193,7 +1318,7 @@ to inside of one of these constructs. The following equivalences apply:
  =head2 Special Backtracking Control Verbs
  
  B<WARNING:> These patterns are experimental and subject to change or
-removal in a future version of perl. Their usage in production code should
+removal in a future version of Perl. Their usage in production code should
  be noted to avoid problems during upgrades.
  
  These special patterns are generally of the form C<(*VERB:ARG)>. Unless
@@ -1266,7 +1391,7 @@ If we add a C<(*PRUNE)> before the count like the following
      print "Count=$count\n";
  
  we prevent backtracking and find the count of the longest matching
-at each matching startpoint like so:
+at each matching starting point like so:
  
      aaab
      aab
@@ -1312,7 +1437,7 @@ outputs
      Count=2
  
  Once the 'aaab' at the start of the string has matched, and the C<(*SKIP)>
-executed, the next startpoint will be where the cursor was when the
+executed, the next starting point will be where the cursor was when the
  C<(*SKIP)> was executed.
  
  =item C<(*MARK:NAME)> C<(*:NAME)>
@@ -1335,7 +1460,7 @@ name of the most recently executed C<(*MARK:NAME)> that was involved
  in the match.
  
  This can be used to determine which branch of a pattern was matched
-without using a seperate capture buffer for each branch, which in turn
+without using a separate capture buffer for each branch, which in turn
  can result in a performance improvement, as perl cannot optimize
  C</(?:(x)|(y)|(z))/> as efficiently as something like
  C</(?:x(*MARK:x)|y(*MARK:y)|z(*MARK:z))/>.
@@ -1351,7 +1476,7 @@ As a shortcut C<(*MARK:NAME)> can be written C<(*:NAME)>.
  
  =item C<(*THEN)> C<(*THEN:NAME)>
  
-This is similar to the "cut group" operator C<::> from Perl6. Like
+This is similar to the "cut group" operator C<::> from Perl 6. Like
  C<(*PRUNE)>, this verb always matches, and when backtracked into on
  failure, it causes the regex engine to try the next alternation in the
  innermost enclosing group (capturing or otherwise).
@@ -1385,7 +1510,7 @@ backtrack and try C; but the C<(*PRUNE)> verb will simply fail.
  =item C<(*COMMIT)>
  X<(*COMMIT)>
  
-This is the Perl6 "commit pattern" C<< <commit> >> or C<:::>. It's a
+This is the Perl 6 "commit pattern" C<< <commit> >> or C<:::>. It's a
  zero-width pattern similar to C<(*SKIP)>, except that when backtracked
  into on failure it causes the match to fail outright. No further attempts
  to find a valid match by advancing the start pointer will occur again.
@@ -1427,7 +1552,7 @@ for production code.
  This pattern matches nothing and causes the end of successful matching at
  the point at which the C<(*ACCEPT)> pattern was encountered, regardless of
  whether there is actually more to match in the string. When inside of a
-nested pattern, such as recursion or a dynamically generated subbpattern
+nested pattern, such as recursion, or in a subpattern dynamically generated
  via C<(??{})>, only the innermost pattern is ended immediately.
  
  If the C<(*ACCEPT)> is inside of capturing buffers then the buffers are
@@ -1437,7 +1562,7 @@ For instance:
    'AB' =~ /(A (A|B(*ACCEPT)|C) D)(E)/x;
  
  will match, and C<$1> will be C<AB> and C<$2> will be C<B>, C<$3> will not
-be set. If another branch in the inner parens were matched, such as in the
+be set. If another branch in the inner parentheses were matched, such as in the
  string 'ACDE', then the C<D> and C<E> would have to be matched as well.
  
  =back
@@ -1450,11 +1575,11 @@ X<backtrack> X<backtracking>
  NOTE: This section presents an abstract approximation of regular
  expression behavior.  For a more rigorous (and complicated) view of
  the rules involved in selecting a match among possible alternatives,
-see L<Combining pieces together>.
+see L<Combining RE Pieces>.
  
  A fundamental feature of regular expression matching involves the
  notion called I<backtracking>, which is currently used (when needed)
-by all regular expression quantifiers, namely C<*>, C<*?>, C<+>,
+by all regular non-possessive expression quantifiers, namely C<*>, C<*?>, C<+>,
  C<+?>, C<{n,m}>, and C<{n,m}?>.  Backtracking is often optimized
  internally, but the general principle outlined here is valid.
  
@@ -1502,7 +1627,7 @@ and the first "bar" thereafter.
      if ( /foo(.*?)bar/ ) { print "got <$1>\n" }
    got <d is under the >
  
-Here's another example: let's say you'd like to match a number at the end
+Here's another example. Let's say you'd like to match a number at the end
  of a string, and you also want to keep the preceding part of the match.
  So you write this:
  
@@ -1627,9 +1752,9 @@ using the vertical bar.  C</ab/> means match "a" AND (then) match "b",
  although the attempted matches are made at different positions because "a"
  is not a zero-width assertion, but a one-width assertion.
  
-B<WARNING>: particularly complicated regular expressions can take
+B<WARNING>: Particularly complicated regular expressions can take
  exponential time to solve because of the immense number of possible
-ways they can use backtracking to try match.  For example, without
+ways they can use backtracking to try for a match.  For example, without
  internal optimizations done by the regular expression engine, this will
  take a painfully long time to run:
  
@@ -1661,9 +1786,12 @@ Any single character matches itself, unless it is a I<metacharacter>
  with a special meaning described here or above.  You can cause
  characters that normally function as metacharacters to be interpreted
  literally by prefixing them with a "\" (e.g., "\." matches a ".", not any
-character; "\\" matches a "\").  A series of characters matches that
-series of characters in the target string, so the pattern C<blurfl>
-would match "blurfl" in the target string.
+character; "\\" matches a "\"). This escape mechanism is also required
+for the character used as the pattern delimiter.
+
+A series of characters matches that series of characters in the target
+string, so the pattern  C<blurfl> would match "blurfl" in the target
+string.
  
  You can specify a character class, by enclosing a list of characters
  in C<[]>, which will match any character from the list.  If the
@@ -1684,7 +1812,7 @@ a range, the "-" is understood literally.
  Note also that the whole range idea is rather unportable between
  character sets--and even within character sets they may cause results
  you probably didn't expect.  A sound principle is to use only ranges
-that begin from and end at either alphabets of equal case ([a-e],
+that begin from and end at either alphabetics of equal case ([a-e],
  [A-E]), or digits ([0-9]).  Anything else is unsafe.  If in doubt,
  spell out the character sets in full.
  
@@ -1729,7 +1857,7 @@ match "0x1234 0x4321", but not "0x1234 01234", because subpattern
  1 matched "0x", even though the rule C<0|0x> could potentially match
  the leading 0 in the second number.
  
-=head2 Warning on \1 vs $1
+=head2 Warning on \1 Instead of $1
  
  Some people get too used to writing things like:
  
@@ -1754,7 +1882,7 @@ C<${1}000>.  The operation of interpolation should not be confused
  with the operation of matching a backreference.  Certainly they mean two
  different things on the I<left> side of the C<s///>.
  
-=head2 Repeated patterns matching zero-length substring
+=head2 Repeated Patterns Matching a Zero-length Substring
  
  B<WARNING>: Difficult material (and prose) ahead.  This section needs a rewrite.
  
@@ -1767,9 +1895,9 @@ loops using regular expressions, with something as innocuous as:
  
      'foo' =~ m{ ( o? )* }x;
  
-The C<o?> can match at the beginning of C<'foo'>, and since the position
+The C<o?> matches at the beginning of C<'foo'>, and since the position
  in the string is not moved by the match, C<o?> would match again and again
-because of the C<*> modifier.  Another common way to create a similar cycle
+because of the C<*> quantifier.  Another common way to create a similar cycle
  is with the looping modifier C<//g>:
  
      @matches = ( 'foo' =~ m{ o? }xg );
@@ -1789,7 +1917,7 @@ may match zero-length substrings.  Here's a simple example being:
  
  Thus Perl allows such constructs, by I<forcefully breaking
  the infinite loop>.  The rules for this are different for lower-level
-loops given by the greedy modifiers C<*+{}>, and for higher-level
+loops given by the greedy quantifiers C<*+{}>, and for higher-level
  ones like the C</g> modifier or split() operator.
  
  The lower-level loops are I<interrupted> (that is, the loop is
@@ -1830,7 +1958,7 @@ the matched string, and is reset by each assignment to pos().
  Zero-length matches at the end of the previous match are ignored
  during C<split>.
  
-=head2 Combining pieces together
+=head2 Combining RE Pieces
  
  Each of the elementary pieces of regular expressions which were described
  before (such as C<ab> or C<\Z>) could match at most one substring
@@ -1931,13 +2059,13 @@ One more rule is needed to understand how a match is determined for the
  whole regular expression: a match at an earlier position is always better
  than a match at a later position.
  
-=head2 Creating custom RE engines
+=head2 Creating Custom RE Engines
  
  Overloaded constants (see L<overload>) provide a simple way to extend
  the functionality of the RE engine.
  
  Suppose that we want to enable a new RE escape-sequence C<\Y|> which
-matches at boundary between whitespace characters and non-whitespace
+matches at a boundary between whitespace characters and non-whitespace
  characters.  Note that C<(?=\S)(?<!\S)|(?!\S)(?<=\S)> matches exactly
  at these positions, so we want to have each C<\Y|> in the place of the
  more complicated version.  We can create a module C<customre> to do
@@ -1980,6 +2108,28 @@ part of this regular expression needs to be converted explicitly
      $re = customre::convert $re;
      /\Y|$re\Y|/;
  
+=head1 PCRE/Python Support
+
+As of Perl 5.10.0, Perl supports several Python/PCRE specific extensions
+to the regex syntax. While Perl programmers are encouraged to use the
+Perl specific syntax, the following are also accepted:
+
+=over 4
+
+=item C<< (?PE<lt>NAMEE<gt>pattern) >>
+
+Define a named capture buffer. Equivalent to C<< (?<NAME>pattern) >>.
+
+=item C<< (?P=NAME) >>
+
+Backreference to a named capture buffer. Equivalent to C<< \g{NAME} >>.
+
+=item C<< (?P>NAME) >>
+
+Subroutine call to a named capture buffer. Equivalent to C<< (?&NAME) >>.
+
+=back
+
  =head1 BUGS
  
  This document varies from difficult to understand to completely