Add regex nodes for locale

[perl5.git] / pod / perlrecharclass.pod
diff --git a/pod/perlrecharclass.pod b/pod/perlrecharclass.pod

index 3a38e56..fb5868d 100644 (file)
--- a/pod/perlrecharclass.pod
+++ b/pod/perlrecharclass.pod
@@ -480,9 +480,9 @@ and the character must be explicitly specified, and not be part of a
  multi-character range (not even as one of its endpoints).  (L</Character
  Ranges> will be explained shortly.) Therefore,
  
- 'ss' =~ /\A[\0-\x{ff}]\z/i        # Doesn't match
- 'ss' =~ /\A[\0-\N{LATIN SMALL LETTER SHARP S}]\z/i    # No match
- 'ss' =~ /\A[\xDF-\xDF]\z/i    # Matches on ASCII platforms, since
+ 'ss' =~ /\A[\0-\x{ff}]\z/ui       # Doesn't match
+ 'ss' =~ /\A[\0-\N{LATIN SMALL LETTER SHARP S}]\z/ui   # No match
+ 'ss' =~ /\A[\xDF-\xDF]\z/ui   # Matches on ASCII platforms, since
                                 # \XDF is LATIN SMALL LETTER SHARP S,
                                 # and the range is just a single
                                 # element
@@ -500,7 +500,7 @@ the class, the entire sequence is matched.  For example,
  
  matches, because C<\N{TAMIL SYLLABLE KAU}> is a named sequence
  consisting of the two characters matched against.  Like the other
-instance where a bracketed class can match multi characters, and for
+instance where a bracketed class can match multiple characters, and for
  similar reasons, the class must not be inverted, and the named sequence
  may not appear in a range, even one where it is both endpoints.  If
  these happen, it is a fatal error if the character class is within an
@@ -543,9 +543,7 @@ C<\t>,
  and
  C<\x>
  are also special and have the same meanings as they do outside a
-bracketed character class.  (However, inside a bracketed character
-class, if C<\N{I<NAME>}> expands to a sequence of characters, only the first
-one in the sequence is used, with a warning.)
+bracketed character class.
  
  Also, a backslash followed by two or three octal digits is considered an octal
  number.
@@ -600,11 +598,6 @@ your set of characters to be matched and its position in the class is such
  that it could be considered part of a range, you must escape that hyphen
  with a backslash.
  
-The classes C<< [A-Z] >> and C<< [a-z] >> are special cased, in the sense
-they always match exactly the 26 upper/lower case letters, regardless
-of the platform (this only effects EBCDIC, which would otherwise include 
-some non-letters).
-
  Examples:
  
   [a-z]       #  Matches a character that is a lower case ASCII letter.
@@ -615,7 +608,49 @@ Examples:
               #  hyphen ('-'), or the letter 'm'.
   ['-?]       #  Matches any of the characters  '()*+,-./0123456789:;<=>?
               #  (But not on an EBCDIC platform).
-
+ [\N{APOSTROPHE}-\N{QUESTION MARK}]
+             #  Matches any of the characters  '()*+,-./0123456789:;<=>?
+             #  even on an EBCDIC platform.
+ [\N{U+27}-\N{U+3F}] # Same. (U+27 is "'", and U+3F is "?"
+
+As the final two examples above show, you can achieve portablity to
+non-ASCII platforms by using the C<\N{...}> form for the range
+endpoints.  These indicate that the specified range is to be interpreted
+using Unicode values, so C<[\N{U+27}-\N{U+3F}]> means to match
+C<\N{U+27}>, C<\N{U+28}>, C<\N{U+29}>, ..., C<\N{U+3D}>, C<\N{U+3E}>,
+and C<\N{U+3F}>, whatever the native code point versions for those are.
+
+Perl also guarantees that the ranges C<A-Z>, C<a-z>, C<0-9>, and any
+subranges of these match what an English-only speaker would expect them
+to match on any platform.  That is, C<[A-Z]> matches the 26 ASCII
+uppercase letters;
+C<[a-z]> matches the 26 lowercase letters; and C<[0-9]> matches the 10
+digits.  Subranges, like C<[h-k]>, match correspondingly, in this case
+just the four letters C<"h">, C<"i">, C<"j">, and C<"k">.  This is the
+natural behavior on ASCII platforms where the code points (ordinal
+values) for C<"h"> through C<"k"> are consecutive integers (0x68 through
+0x6B).  But special handling to achieve this may be needed on platforms
+with a non-ASCII native character set.  For example, on EBCDIC
+platforms, the code point for C<"h"> is 0x88, C<"i"> is 0x89, C<"j"> is
+0x91, and C<"k"> is 0x92.   Perl specially treats C<[h-k]> to exclude the
+seven code points in the gap: 0x8A through 0x90.  This special handling is
+only invoked when the range is a subrange of one of the ASCII uppercase,
+lowercase, and digit ranges, AND each end of the range is expressed
+either as a literal, like C<"A">, or as a named character (C<\N{...}>,
+including the C<\N{U+...> form).
+
+EBCDIC Examples:
+
+ [i-j]               #  Matches either "i" or "j"
+ [i-\N{LATIN SMALL LETTER J}]  # Same
+ [i-\N{U+6A}]        #  Same
+ [\N{U+69}-\N{U+6A}] #  Same
+ [\x{89}-\x{91}]     #  Matches 0x89 ("i"), 0x8A .. 0x90, 0x91 ("j")
+ [i-\x{91}]          #  Same
+ [\x{89}-j]          #  Same
+ [i-J]               #  Matches, 0x89 ("i") .. 0xC1 ("J"); special
+                     #  handling doesn't apply because range is mixed
+                     #  case
  
  =head3 Negation