Document CVf_UNIQUE flag better

[perl5.git] / pod / perlretut.pod
diff --git a/pod/perlretut.pod b/pod/perlretut.pod

index 65dfb47..be4693d 100644 (file)
--- a/pod/perlretut.pod
+++ b/pod/perlretut.pod
@@ -158,13 +158,14 @@ that a metacharacter can be matched by putting a backslash before it:
      "2+2=4" =~ /2\+2/;   # matches, \+ is treated like an ordinary +
      "The interval is [0,1)." =~ /[0,1)./     # is a syntax error!
      "The interval is [0,1)." =~ /\[0,1\)\./  # matches
-    "/usr/bin/perl" =~ /\/usr\/local\/bin\/perl/;  # matches
+    "/usr/bin/perl" =~ /\/usr\/bin\/perl/;  # matches
  
  In the last regexp, the forward slash C<'/'> is also backslashed,
  because it is used to delimit the regexp.  This can lead to LTS
  (leaning toothpick syndrome), however, and it is often more readable
  to change delimiters.
  
+    "/usr/bin/perl" =~ m!/usr/bin/perl!;    # easier to read
  
  The backslash character C<'\'> is a metacharacter itself and needs to
  be backslashed:
@@ -550,7 +551,7 @@ to give them a chance to match.
  
  The last example points out that character classes are like
  alternations of characters.  At a given character position, the first
-alternative that allows the regexp match to succeed wil be the one
+alternative that allows the regexp match to succeed will be the one
  that matches.
  
  =head2 Grouping things and hierarchical matching
@@ -587,7 +588,7 @@ are
  
  Alternations behave the same way in groups as out of them: at a given
  string position, the leftmost alternative that allows the regexp to
-match is taken.  So in the last example at tth first string position,
+match is taken.  So in the last example at the first string position,
  C<"20"> matches the second alternative, but there is nothing left over
  to match the next two digits C<\d\d>.  So perl moves on to the next
  alternative, which is the null alternative and that works, since
@@ -689,10 +690,11 @@ inside goes into the special variables C<$1>, C<$2>, etc.  They can be
  used just as ordinary variables:
  
      # extract hours, minutes, seconds
-    $time =~ /(\d\d):(\d\d):(\d\d)/;  # match hh:mm:ss format
-    $hours = $1;
-    $minutes = $2;
-    $seconds = $3;
+    if ($time =~ /(\d\d):(\d\d):(\d\d)/) {    # match hh:mm:ss format
+       $hours = $1;
+       $minutes = $2;
+       $seconds = $3;
+    }
  
  Now, we know that in scalar context,
  S<C<$time =~ /(\d\d):(\d\d):(\d\d)/> > returns a true or false
@@ -1403,6 +1405,8 @@ off.  C<\G> allows us to easily do context-sensitive matching:
  
  The combination of C<//g> and C<\G> allows us to process the string a
  bit at a time and use arbitrary Perl logic to decide what to do next.
+Currently, the C<\G> anchor is only fully supported when used to anchor
+to the start of the pattern.
  
  C<\G> is also invaluable in processing fixed length records with
  regexps.  Suppose we have a snippet of coding region DNA, encoded as
@@ -1653,12 +1657,11 @@ Unicode characters in the range of 128-255 use two hexadecimal digits
  with braces: C<\x{ab}>.  Note that this is different than C<\xab>,
  which is just a hexadecimal byte with no Unicode significance.
  
-B<NOTE>: in perl 5.6.0 it used to be that one needed to say C<use utf8>
-to use any Unicode features.  This is no more the case: for almost all
-Unicode processing, the explicit C<utf8> pragma is not needed.
-(The only case where it matters is if your Perl script is in Unicode,
-that is, encoded in UTF-8/UTF-16/UTF-EBCDIC: then an explicit C<use utf8>
-is needed.)
+B<NOTE>: in Perl 5.6.0 it used to be that one needed to say C<use
+utf8> to use any Unicode features.  This is no more the case: for
+almost all Unicode processing, the explicit C<utf8> pragma is not
+needed.  (The only case where it matters is if your Perl script is in
+Unicode and encoded in UTF-8, then an explicit C<use utf8> is needed.)
  
  Figuring out the hexadecimal sequence of a Unicode character you want
  or deciphering someone else's hexadecimal Unicode regexp is about as
@@ -1706,7 +1709,7 @@ it matches I<any> byte 0-255.  So
  The last regexp matches, but is dangerous because the string
  I<character> position is no longer synchronized to the string I<byte>
  position.  This generates the warning 'Malformed UTF-8
-character'.  C<\C> is best used for matching the binary data in strings
+character'.  The C<\C> is best used for matching the binary data in strings
  with binary data intermixed with Unicode characters.
  
  Let us now discuss the rest of the character classes.  Just as with
@@ -1739,7 +1742,7 @@ traditional Unicode classes:
      IsPrint          /^([LMNPS]|Co|Zs)/
      IsPunct          /^P/
      IsSpace          /^Z/ || ($code =~ /^(0009|000A|000B|000C|000D)$/
-    IsSpacePerl      /^Z/ || ($code =~ /^(0009|000A|000C|000D)$/
+    IsSpacePerl      /^Z/ || ($code =~ /^(0009|000A|000C|000D|0085|2028|2029)$/
      IsUpper          /^L[ut]/
      IsWord           /^[LMN]/ || $code eq "005F"
      IsXDigit         $code =~ /^00(3[0-9]|[46][1-6])$/
@@ -1753,7 +1756,7 @@ For the full list see L<perlunicode>.
  
  The Unicode has also been separated into various sets of charaters
  which you can test with C<\p{In...}> (in) and C<\P{In...}> (not in),
-for example C<\p{InLatin}>, C<\p{InGreek}>, or C<\P{InKatakana}>.
+for example C<\p{Latin}>, C<\p{Greek}>, or C<\P{Katakana}>.
  For the full list see L<perlunicode>.
  
  C<\X> is an abbreviation for a character class sequence that includes
@@ -1783,10 +1786,11 @@ C<[:space:]> correspond to the familiar C<\d>, C<\w>, and C<\s>
  character classes.  To negate a POSIX class, put a C<^> in front of
  the name, so that, e.g., C<[:^digit:]> corresponds to C<\D> and under
  C<utf8>, C<\P{IsDigit}>.  The Unicode and POSIX character classes can
-be used just like C<\d>, both inside and outside of character classes:
+be used just like C<\d>, with the exception that POSIX character
+classes can only be used inside of a character class:
  
      /\s+[abc[:digit:]xyz]\s*/;  # match a,b,c,x,y,z, or a digit
-    /^=item\s[:digit:]/;        # match '=item',
+    /^=item\s[[:digit:]]/;      # match '=item',
                                  # followed by a space and a digit
      use charnames ":full";
      /\s+[abc\p{IsDigit}xyz]\s+/;  # match a,b,c,x,y,z, or a digit
@@ -2002,6 +2006,10 @@ They evaluate true if the regexps do I<not> match:
      $x =~ /foo(?!baz)/;  # matches, 'baz' doesn't follow 'foo'
      $x =~ /(?<!\s)foo/;  # matches, there is no \s before 'foo'
  
+The C<\C> is unsupported in lookbehind, because the already
+treacherous definition of C<\C> would become even more so
+when going backwards.
+
  =head2 Using independent subexpressions to prevent backtracking
  
  The last few extended patterns in this tutorial are experimental as of
@@ -2060,7 +2068,7 @@ the first alternative C<[^()]+> matching a substring with no
  parentheses and the second alternative C<\([^()]*\)>  matching a
  substring delimited by parentheses.  The problem with this regexp is
  that it is pathological: it has nested indeterminate quantifiers
- of the form C<(a+|b)+>.  We discussed in Part 1 how nested quantifiers
+of the form C<(a+|b)+>.  We discussed in Part 1 how nested quantifiers
  like this could take an exponentially long time to execute if there
  was no match possible.  To prevent the exponential blowup, we need to
  prevent useless backtracking at some point.  This can be done by