perldelta: Remove duplicate entry; fix typo

[perl5.git] / pod / perlrecharclass.pod
diff --git a/pod/perlrecharclass.pod b/pod/perlrecharclass.pod

index 4ab99ac..89f4a7e 100644 (file)
--- a/pod/perlrecharclass.pod
+++ b/pod/perlrecharclass.pod
@@ -214,7 +214,7 @@ C<\s> matches any single character considered whitespace.
  In all Perl versions, C<\s> matches the 5 characters [\t\n\f\r ]; that
  is, the horizontal tab,
  the newline, the form feed, the carriage return, and the space.
-Starting in Perl v5.18, experimentally, it also matches the vertical tab, C<\cK>.
+Starting in Perl v5.18, it also matches the vertical tab, C<\cK>.
  See note C<[1]> below for a discussion of this.
  
  =item otherwise ...
@@ -241,7 +241,7 @@ table below.
  
  =item otherwise ...
  
-C<\s> matches [\t\n\f\r ] and, starting, experimentally in Perl
+C<\s> matches [\t\n\f\r ] and, starting in Perl
  v5.18, the vertical tab, C<\cK>.
  (See note C<[1]> below for a discussion of this.)
  Note that this list doesn't include the non-breaking space.
@@ -271,9 +271,9 @@ They use the platform's native character set, and do not consider any
  locale that may otherwise be in use.
  
  C<\R> matches anything that can be considered a newline under Unicode
-rules. It's not a character class, as it can match a multi-character
-sequence. Therefore, it cannot be used inside a bracketed character
-class; use C<\v> instead (vertical whitespace).  It uses the platform's
+rules. It can match a multi-character sequence. It cannot be used inside
+a bracketed character class; use C<\v> instead (vertical whitespace).
+It uses the platform's
  native character set, and does not consider any locale that may
  otherwise be in use.
  Details are discussed in L<perlrebackslash>.
@@ -324,13 +324,8 @@ effect that changes the C<\s> matching).
  
  =item [1]
  
-Prior to Perl v5.18, C<\s> did not match the vertical tab.  The change
-in v5.18 is considered an experiment, which means it could be backed out
-in v5.22 if experience indicates that it breaks too much
-existing code.  If this change adversely affects you, send email to
-C<perlbug@perl.org>; if it affects you positively, email
-C<perlthanks@perl.org>.  In the meantime, C<[^\S\cK]> (obscurely)
-matches what C<\s> traditionally did.
+Prior to Perl v5.18, C<\s> did not match the vertical tab.
+C<[^\S\cK]> (obscurely) matches what C<\s> traditionally did.
  
  =item [2]
  
@@ -480,10 +475,10 @@ and the character must be explicitly specified, and not be part of a
  multi-character range (not even as one of its endpoints).  (L</Character
  Ranges> will be explained shortly.) Therefore,
  
- 'ss' =~ /\A[\0-\x{ff}]\z/i        # Doesn't match
- 'ss' =~ /\A[\0-\N{LATIN SMALL LETTER SHARP S}]\z/i    # No match
- 'ss' =~ /\A[\xDF-\xDF]\z/i    # Matches on ASCII platforms, since
-                               # \XDF is LATIN SMALL LETTER SHARP S,
+ 'ss' =~ /\A[\0-\x{ff}]\z/ui       # Doesn't match
+ 'ss' =~ /\A[\0-\N{LATIN SMALL LETTER SHARP S}]\z/ui   # No match
+ 'ss' =~ /\A[\xDF-\xDF]\z/ui   # Matches on ASCII platforms, since
+                               # \xDF is LATIN SMALL LETTER SHARP S,
                                 # and the range is just a single
                                 # element
  
@@ -500,7 +495,7 @@ the class, the entire sequence is matched.  For example,
  
  matches, because C<\N{TAMIL SYLLABLE KAU}> is a named sequence
  consisting of the two characters matched against.  Like the other
-instance where a bracketed class can match multi characters, and for
+instance where a bracketed class can match multiple characters, and for
  similar reasons, the class must not be inverted, and the named sequence
  may not appear in a range, even one where it is both endpoints.  If
  these happen, it is a fatal error if the character class is within an
@@ -543,9 +538,7 @@ C<\t>,
  and
  C<\x>
  are also special and have the same meanings as they do outside a
-bracketed character class.  (However, inside a bracketed character
-class, if C<\N{I<NAME>}> expands to a sequence of characters, only the first
-one in the sequence is used, with a warning.)
+bracketed character class.
  
  Also, a backslash followed by two or three octal digits is considered an octal
  number.
@@ -568,12 +561,12 @@ escaping.
  Examples:
  
   "+"   =~ /[+?*]/     #  Match, "+" in a character class is not special.
- "\cH" =~ /[\b]/      #  Match, \b inside in a character class.
+ "\cH" =~ /[\b]/      #  Match, \b inside in a character class
                        #  is equivalent to a backspace.
- "]"   =~ /[][]/      #  Match, as the character class contains.
+ "]"   =~ /[][]/      #  Match, as the character class contains
                        #  both [ and ].
   "[]"  =~ /[[]]/      #  Match, the pattern contains a character class
-                      #  containing just ], and the character class is
+                      #  containing just [, and the character class is
                        #  followed by a ].
  
  =head3 Character Ranges
@@ -610,10 +603,33 @@ Examples:
               #  hyphen ('-'), or the letter 'm'.
   ['-?]       #  Matches any of the characters  '()*+,-./0123456789:;<=>?
               #  (But not on an EBCDIC platform).
-
-Perl guarantees that the ranges C<A-Z>, C<a-z>, C<0-9>, and any
+ [\N{APOSTROPHE}-\N{QUESTION MARK}]
+             #  Matches any of the characters  '()*+,-./0123456789:;<=>?
+             #  even on an EBCDIC platform.
+ [\N{U+27}-\N{U+3F}] # Same. (U+27 is "'", and U+3F is "?")
+
+As the final two examples above show, you can achieve portablity to
+non-ASCII platforms by using the C<\N{...}> form for the range
+endpoints.  These indicate that the specified range is to be interpreted
+using Unicode values, so C<[\N{U+27}-\N{U+3F}]> means to match
+C<\N{U+27}>, C<\N{U+28}>, C<\N{U+29}>, ..., C<\N{U+3D}>, C<\N{U+3E}>,
+and C<\N{U+3F}>, whatever the native code point versions for those are.
+These are called "Unicode" ranges.  If either end is of the C<\N{...}>
+form, the range is considered Unicode.  A C<regexp> warning is raised
+under C<S<"use re 'strict'">> if the other endpoint is specified
+non-portably:
+
+ [\N{U+00}-\x09]    # Warning under re 'strict'; \x09 is non-portable
+ [\N{U+00}-\t]      # No warning;
+
+Both of the above match the characters C<\N{U+00}> C<\N{U+01}>, ...
+C<\N{U+08}>, C<\N{U+09}>, but the C<\x09> looks like it could be a
+mistake so the warning is raised (under C<re 'strict'>) for it.
+
+Perl also guarantees that the ranges C<A-Z>, C<a-z>, C<0-9>, and any
  subranges of these match what an English-only speaker would expect them
-to match.  That is, C<[A-Z]> matches the 26 ASCII uppercase letters;
+to match on any platform.  That is, C<[A-Z]> matches the 26 ASCII
+uppercase letters;
  C<[a-z]> matches the 26 lowercase letters; and C<[0-9]> matches the 10
  digits.  Subranges, like C<[h-k]>, match correspondingly, in this case
  just the four letters C<"h">, C<"i">, C<"j">, and C<"k">.  This is the
@@ -750,6 +766,12 @@ Perl recognizes the following POSIX character classes:
   word   A Perl extension ("[A-Za-z0-9_]"), equivalent to "\w".
   xdigit Any hexadecimal digit ("[0-9a-fA-F]").
  
+Like the L<Unicode properties|/Unicode Properties>, most of the POSIX
+properties match the same regardless of whether case-insensitive (C</i>)
+matching is in effect or not.  The two exceptions are C<[:upper:]> and
+C<[:lower:]>.  Under C</i>, they each match the union of C<[:upper:]> and
+C<[:lower:]>.
+
  Most POSIX character classes have two Unicode-style C<\p> property
  counterparts.  (They are not official Unicode properties, but Perl extensions
  derived from official Unicode properties.)  The table below shows the relation
@@ -795,8 +817,9 @@ C<\p{Blank}> and C<\p{HorizSpace}> are synonyms.
  
  Control characters don't produce output as such, but instead usually control
  the terminal somehow: for example, newline and backspace are control characters.
-In the ASCII range, characters whose code points are between 0 and 31 inclusive,
-plus 127 (C<DEL>) are control characters.
+On ASCII platforms, in the ASCII range, characters whose code points are
+between 0 and 31 inclusive, plus 127 (C<DEL>) are control characters; on
+EBCDIC platforms, their counterparts are control characters.
  
  =item [3]
  
@@ -834,13 +857,13 @@ Unicode considers symbols.
  
  C<\p{XPerlSpace}> and C<\p{Space}> match identically starting with Perl
  v5.18.  In earlier versions, these differ only in that in non-locale
-matching, C<\p{XPerlSpace}> does not match the vertical tab, C<\cK>.
+matching, C<\p{XPerlSpace}> did not match the vertical tab, C<\cK>.
  Same for the two ASCII-only range forms.
  
  =back
  
  There are various other synonyms that can be used besides the names
-listed in the table.  For example, C<\p{PosixAlpha}> can be written as
+listed in the table.  For example, C<\p{XPosixAlpha}> can be written as
  C<\p{Alpha}>.  All are listed in
  L<perluniprops/Properties accessible through \p{} and \P{}>.
  
@@ -976,6 +999,9 @@ use it will raise a warning, unless disabled via
  Comments on this feature are welcome; send email to
  C<perl5-porters@perl.org>.
  
+The rules used by L<C<use re 'strict>|re/'strict' mode> apply to this
+construct.
+
  We can extend the example above:
  
   /(?[ ( \p{Thai} + \p{Lao} ) & \p{Digit} ])/
@@ -1002,14 +1028,11 @@ There is one unary operator:
  
   !    complement
  
-All the binary operators left associate, and are of equal precedence.
-The unary operator right associates, and has higher precedence.  Use
-parentheses to override the default associations.  Some feedback we've
-received indicates a desire for intersection to have higher precedence
-than union.  This is something that feedback from the field may cause us
-to change in future releases; you may want to parenthesize copiously to
-avoid such changes affecting your code, until this feature is no longer
-considered experimental.
+All the binary operators left associate; C<"&"> is higher precedence
+than the others, which all have equal precedence.  The unary operator
+right associates, and has highest precedence.  Thus this follows the
+normal Perl precedence rules for logical operators.  Use parentheses to
+override the default precedence and associativity.
  
  The main restriction is that everything is a metacharacter.  Thus,
  you cannot refer to single characters by doing something like this:
@@ -1031,8 +1054,9 @@ C<\N{...}>, etc.)
  
  This last example shows the use of this construct to specify an ordinary
  bracketed character class without additional set operations.  Note the
-white space within it; C<E<sol>x> is turned on even within bracketed
-character classes, except you can't have comments inside them.  Hence,
+white space within it; a limited version of C<E<sol>x> is turned on even
+within bracketed character classes, with only the SPACE and TAB (C<\t>)
+characters allowed, and no comments.  Hence,
  
   (?[ [#] ])
  
@@ -1086,8 +1110,12 @@ just three limitations:
  
  =item 1
  
-This construct cannot be used within the scope of
-C<use locale> (or the C<E<sol>l> regex modifier).
+When compiled within the scope of C<use locale> (or the C<E<sol>l> regex
+modifier), this construct assumes that the execution-time locale will be
+a UTF-8 one, and the generated pattern always uses Unicode rules.  What
+gets matched or not thus isn't dependent on the actual runtime locale, so
+tainting is not enabled.  But a C<locale> category warning is raised
+if the runtime locale turns out to not be UTF-8.
  
  =item 2