perlunicode, perluniprops: \p{Title} is Perl extension

[perl5.git] / pod / perlunicode.pod
diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod

index b193273..f00b110 100644 (file)
--- a/pod/perlunicode.pod
+++ b/pod/perlunicode.pod
@@ -260,11 +260,12 @@ complement B<and> the full character-wide bit complement.
  
  =item *
  
  
  =item *
  
-You can define your own mappings to be used in C<lc()>,
-C<lcfirst()>, C<uc()>, and C<ucfirst()> (or their double-quoted string inlined
-versions such as C<\U>). See
-L<User-Defined Case-Mappings|/"User-Defined Case Mappings (for serious hackers only)">
-for more details.
+There is a CPAN module, L<Unicode::Casing>, which allows you to define
+your own mappings to be used in C<lc()>, C<lcfirst()>, C<uc()>, and
+C<ucfirst()> (or their double-quoted string inlined versions such as
+C<\U>).  (Prior to Perl 5.16, this functionality was partially provided
+in the Perl core, but suffered from a number of insurmountable
+drawbacks, so the CPAN module was written instead.)
  
  =back
  
  
  =back
  
@@ -301,7 +302,8 @@ This formality is needed when properties are not binary; that is, if they can
  take on more values than just True and False.  For example, the Bidi_Class (see
  L</"Bidirectional Character Types"> below), can take on several different
  values, such as Left, Right, Whitespace, and others.  To match these, one needs
  take on more values than just True and False.  For example, the Bidi_Class (see
  L</"Bidirectional Character Types"> below), can take on several different
  values, such as Left, Right, Whitespace, and others.  To match these, one needs
-to specify the property name (Bidi_Class), AND the value being matched against
+to specify both the property name (Bidi_Class), AND the value being
+matched against
  (Left, Right, etc.).  This is done, as in the examples above, by having the
  two components separated by an equal sign (or interchangeably, a colon), like
  C<\p{Bidi_Class: Left}>.
  (Left, Right, etc.).  This is done, as in the examples above, by having the
  two components separated by an equal sign (or interchangeably, a colon), like
  C<\p{Bidi_Class: Left}>.
@@ -469,11 +471,63 @@ The world's languages are written in many different scripts.  This sentence
  written in Cyrillic, and Greek is written in, well, Greek; Japanese mainly in
  Hiragana or Katakana.  There are many more.
  
  written in Cyrillic, and Greek is written in, well, Greek; Japanese mainly in
  Hiragana or Katakana.  There are many more.
  
-The Unicode Script property gives what script a given character is in,
-and the property can be specified with the compound form like
-C<\p{Script=Hebrew}> (short: C<\p{sc=hebr}>).  Perl furnishes shortcuts for all
-script names.  You can omit everything up through the equals (or colon), and
-simply write C<\p{Latin}> or C<\P{Cyrillic}>.
+The Unicode Script and Script_Extensions properties give what script a
+given character is in.  Either property can be specified with the
+compound form like
+C<\p{Script=Hebrew}> (short: C<\p{sc=hebr}>), or
+C<\p{Script_Extensions=Javanese}> (short: C<\p{scx=java}>).
+In addition, Perl furnishes shortcuts for all
+C<Script> property names.  You can omit everything up through the equals
+(or colon), and simply write C<\p{Latin}> or C<\P{Cyrillic}>.
+(This is not true for C<Script_Extensions>, which is required to be
+written in the compound form.)
+
+The difference between these two properties involves characters that are
+used in multiple scripts.  For example the digits '0' through '9' are
+used in many parts of the world.  These are placed in a script named
+C<Common>.  Other characters are used in just a few scripts.  For
+example, the "KATAKANA-HIRAGANA DOUBLE HYPHEN" is used in both Japanese
+scripts, Katakana and Hiragana, but nowhere else.  The C<Script>
+property places all characters that are used in multiple scripts in the
+C<Common> script, while the C<Script_Extensions> property places those
+that are used in only a few scripts into each of those scripts; while
+still using C<Common> for those used in many scripts.  Thus both these
+match:
+
+ "0" =~ /\p{sc=Common}/     # Matches
+ "0" =~ /\p{scx=Common}/    # Matches
+
+and only the first of these match:
+
+ "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{sc=Common}  # Matches
+ "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{scx=Common} # No match
+
+And only the last two of these match:
+
+ "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{sc=Hiragana}  # No match
+ "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{sc=Katakana}  # No match
+ "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{scx=Hiragana} # Matches
+ "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{scx=Katakana} # Matches
+
+C<Script_Extensions> is thus an improved C<Script>, in which there are
+fewer characters in the C<Common> script, and correspondingly more in
+other scripts.  It is new in Unicode version 6.0, and its data are likely
+to change significantly in later releases, as things get sorted out.
+
+(Actually, besides C<Common>, the C<Inherited> script, contains
+characters that are used in multiple scripts.  These are modifier
+characters which modify other characters, and inherit the script value
+of the controlling character.  Some of these are used in many scripts,
+and so go into C<Inherited> in both C<Script> and C<Script_Extensions>.
+Others are used in just a few scripts, so are in C<Inherited> in
+C<Script>, but not in C<Script_Extensions>.)
+
+It is worth stressing that there are several different sets of digits in
+Unicode that are equivalent to 0-9 and are matchable by C<\d> in a
+regular expression.  If they are used in a single language only, they
+are in that language's C<Script> and C<Script_Extension>.  If they are
+used in more than one script, they will be in C<sc=Common>, but only
+if they are used in many scripts should they be in C<scx=Common>.
  
  A complete list of scripts and their shortcuts is in L<perluniprops>.
  
  
  A complete list of scripts and their shortcuts is in L<perluniprops>.
  
@@ -496,20 +550,14 @@ other words, the ASCII characters.  The "Latin" script contains some letters
  from this as well as several other blocks, like "Latin-1 Supplement",
  "Latin Extended-A", etc., but it does not contain all the characters from
  those blocks. It does not, for example, contain the digits 0-9, because
  from this as well as several other blocks, like "Latin-1 Supplement",
  "Latin Extended-A", etc., but it does not contain all the characters from
  those blocks. It does not, for example, contain the digits 0-9, because
-those digits are shared across many scripts. The digits 0-9 and similar groups,
-like punctuation, are in the script called C<Common>.  There is also a
-script called C<Inherited> for characters that modify other characters,
-and inherit the script value of the controlling character.  (Note that
-there are several different sets of digits in Unicode that are
-equivalent to 0-9 and are matchable by C<\d> in a regular expression.
-If they are used in a single language only, they are in that language's
-script.  Only sets are used across several languages are in the
-C<Common> script.)
+those digits are shared across many scripts, and hence are in the
+C<Common> script.
  
  For more about scripts versus blocks, see UAX#24 "Unicode Script Property":
  L<http://www.unicode.org/reports/tr24>
  
  
  For more about scripts versus blocks, see UAX#24 "Unicode Script Property":
  L<http://www.unicode.org/reports/tr24>
  
-The Script property is likely to be the one you want to use when processing
+The C<Script> or C<Script_Extensions> properties are likely to be the
+ones you want to use when processing
  natural language; the Block property may occasionally be useful in working
  with the nuts and bolts of Unicode.
  
  natural language; the Block property may occasionally be useful in working
  with the nuts and bolts of Unicode.
  
@@ -557,8 +605,9 @@ Unicode ones, but some are genuine extensions, including several that are in
  the compound form.  And quite a few of these are actually recommended by Unicode
  (in L<http://www.unicode.org/reports/tr18>).
  
  the compound form.  And quite a few of these are actually recommended by Unicode
  (in L<http://www.unicode.org/reports/tr18>).
  
-This section gives some details on all extensions that aren't synonyms for
-compound-form Unicode properties (for those, you'll have to refer to the
+This section gives some details on all extensions that aren't just
+synonyms for compound-form Unicode properties
+(for those properties, you'll have to refer to the
  L<Unicode Standard|http://www.unicode.org/reports/tr44>.
  
  =over
  L<Unicode Standard|http://www.unicode.org/reports/tr44>.
  
  =over
@@ -719,6 +768,13 @@ This is the same as C<\s>, including beyond ASCII.
  Mnemonic: Space, as modified by Perl.  (It doesn't include the vertical tab
  which both the Posix standard and Unicode consider white space.)
  
  Mnemonic: Space, as modified by Perl.  (It doesn't include the vertical tab
  which both the Posix standard and Unicode consider white space.)
  
+=item B<C<\p{Title}>> and  B<C<\p{Titlecase}>>
+
+Under case-sensitive matching, these both match the same code points as
+C<\p{General Category=Titlecase_Letter}> (C<\p{gc=lt}>).  The difference
+is that under C</i> caseless matching, these match the same as
+C<\p{Cased}>, whereas C<\p{gc=lt}> matches C<\p{Cased_Letter>).
+
  =item B<C<\p{VertSpace}>>
  
  This is the same as C<\v>:  A character that changes the spacing vertically.
  =item B<C<\p{VertSpace}>>
  
  This is the same as C<\v>:  A character that changes the spacing vertically.
@@ -867,189 +923,12 @@ would be intersecting with nothing, resulting in an empty set.
  
  =head2 User-Defined Case Mappings (for serious hackers only)
  
  
  =head2 User-Defined Case Mappings (for serious hackers only)
  
-B<This featured is deprecated and is scheduled to be removed in Perl
-5.16.>
-The CPAN module L<Unicode::Casing> provides better functionality
-without the drawbacks described below.
-
-You can define your own mappings to be used in C<lc()>,
-C<lcfirst()>, C<uc()>, and C<ucfirst()> (or their string-inlined versions,
-C<\L>, C<\l>, C<\U>, and C<\u>).  The mappings are currently only valid
-on strings encoded in UTF-8, but see below for a partial workaround for
-this restriction.
-
-The principle is similar to that of user-defined character
-properties: define subroutines that do the mappings.
-C<ToLower> is used for C<lc()>, C<\L>, C<lcfirst()>, and C<\l>; C<ToTitle> for
-C<ucfirst()> and C<\u>; and C<ToUpper> for C<uc()> and C<\U>.
-
-C<ToUpper()> should look something like this:
-
-    sub ToUpper {
-        return <<END;
-    0061\t007A\t0041
-    0101\t\t0100
-    END
-    }
-
-This sample C<ToUpper()> has the effect of mapping "a-z" to "A-Z", 0x101
-to 0x100, and all other characters map to themselves.  The first
-returned line means to map the code point at 0x61 ("a") to 0x41 ("A"),
-the code point at 0x62 ("b") to 0x42 ("B"),  ..., 0x7A ("z") to 0x5A
-("Z").  The second line maps just the code point 0x101 to 0x100.  Since
-there are no other mappings defined, all other code points map to
-themselves.
-
-This mechanism is not well behaved as far as affecting other packages
-and scopes.  All non-threaded programs have exactly one uppercasing
-behavior, one lowercasing behavior, and one titlecasing behavior in
-effect for utf8-encoded strings for the duration of the program.  Each
-of these behaviors is irrevocably determined the first time the
-corresponding function is called to change a utf8-encoded string's case.
-If a corresponding C<To-> function has been defined in the package that
-makes that first call, the mapping defined by that function will be the
-mapping used for the duration of the program's execution across all
-packages and scopes.  If no corresponding C<To-> function has been
-defined in that package, the standard official mapping will be used for
-all packages and scopes, and any corresponding C<To-> function anywhere
-will be ignored.  Threaded programs have similar behavior.  If the
-program's casing behavior has been decided at the time of a thread's
-creation, the thread will inherit that behavior.  But, if the behavior
-hasn't been decided, the thread gets to decide for itself, and its
-decision does not affect other threads nor its creator.
-
-As shown by the example above, you have to furnish a complete mapping;
-you can't just override a couple of characters and leave the rest
-unchanged.  You can find all the official mappings in the directory
-C<$Config{privlib}>F</unicore/To/>.  The mapping data is returned as the
-here-document.  The C<utf8::ToSpecI<Foo>> hashes in those files are special
-exception mappings derived from
-C<$Config{privlib}>F</unicore/SpecialCasing.txt>.  (The "Digit" and
-"Fold" mappings that one can see in the directory are not directly
-user-accessible, one can use either the L<Unicode::UCD> module, or just match
-case-insensitively, which is what uses the "Fold" mapping.  Neither are user
-overridable.)
-
-If you have many mappings to change, you can take the official mapping data,
-change by hand the affected code points, and place the whole thing into your
-subroutine.  But this will only be valid on Perls that use the same Unicode
-version.  Another option would be to have your subroutine read the official
-mapping files and overwrite the affected code points.
-
-If you have only a few mappings to change, starting in 5.14 you can use the
-following trick, here illustrated for Turkish.
-
-    use Config;
-    use charnames ":full";
-
-    sub ToUpper {
-        my $official = do "$Config{privlib}/unicore/To/Upper.pl";
-        $utf8::ToSpecUpper{'i'} =
-                           "\N{LATIN CAPITAL LETTER I WITH DOT ABOVE}";
-        return $official;
-    }
-
-This takes the official mappings and overrides just one, for "LATIN SMALL
-LETTER I".  The keys to the hash must be the bytes that form the UTF-8
-(on EBCDIC platforms, UTF-EBCDIC) of the character, as illustrated by
-the inverse function.
-
-    sub ToLower {
-        my $official = do $lower;
-        $utf8::ToSpecLower{"\xc4\xb0"} = "i";
-        return $official;
-    }
-
-This example is for an ASCII platform, and C<\xc4\xb0> is the string of
-bytes that together form the UTF-8 that represents C<\N{LATIN CAPITAL
-LETTER I WITH DOT ABOVE}>, C<U+0130>.  You can avoid having to figure out
-these bytes, and at the same time make it work on all platforms by
-instead writing:
-
-    sub ToLower {
-        my $official = do $lower;
-        my $sequence = "\N{LATIN CAPITAL LETTER I WITH DOT ABOVE}";
-        utf8::encode($sequence);
-        $utf8::ToSpecLower{$sequence} = "i";
-        return $official;
-    }
-
-This works because C<utf8::encode()> takes the single character and
-converts it to the sequence of bytes that constitute it.  Note that we took
-advantage of the fact that C<"i"> is the same in UTF-8 or UTF_EBCIDIC as not;
-otherwise we would have had to write
-
-        $utf8::ToSpecLower{$sequence} = "\N{LATIN SMALL LETTER I}";
-
-in the ToLower example, and in the ToUpper example, use
-
-        my $sequence = "\N{LATIN SMALL LETTER I}";
-        utf8::encode($sequence);
-
-A big caveat to the above trick and to this whole mechanism in general,
-is that they work only on strings encoded in UTF-8.  You can partially
-get around this by using C<use subs>.  (But better to just convert to
-use L<Unicode::Casing>.)  For example:
-(The trick illustrated here does work in earlier releases, but only if all the
-characters you want to override have ordinal values of 256 or higher, or
-if you use the other tricks given just below.)
-
-The mappings are in effect only for the package they are defined in, and only
-on scalars that have been marked as having Unicode characters, for example by
-using C<utf8::upgrade()>.  Although probably not advisable, you can
-cause the mappings to be used globally by importing into C<CORE::GLOBAL>
-(see L<CORE>).
-
-You can partially get around the restriction that the source strings
-must be in utf8 by using C<use subs> (or by importing into C<CORE::GLOBAL>) by:
-
- use subs qw(uc ucfirst lc lcfirst);
-
- sub uc($) {
-     my $string = shift;
-     utf8::upgrade($string);
-     return CORE::uc($string);
- }
-
- sub lc($) {
-     my $string = shift;
-     utf8::upgrade($string);
-
-     # Unless an I is before a dot_above, it turns into a dotless i.
-     # (The character class with the combining classes matches non-above
-     # marks following the I.  Any number of these may be between the 'I' and
-     # the dot_above, and the dot_above will still apply to the 'I'.
-     use charnames ":full";
-     $string =~
-             s/I
-               (?! [^\p{ccc=0}\p{ccc=Above}]* \N{COMBINING DOT ABOVE} )
-              /\N{LATIN SMALL LETTER DOTLESS I}/gx;
-
-     # But when the I is followed by a dot_above, remove the
-     # dot_above so the end result will be i.
-     $string =~ s/I
-                    ([^\p{ccc=0}\p{ccc=Above}]* )
-                    \N{COMBINING DOT ABOVE}
-                 /i$1/gx;
-     return CORE::lc($string);
- }
-
-These examples (also for Turkish) make sure the input is in UTF-8, and then
-call the corresponding official function, which will use the C<ToUpper()> and
-C<ToLower()> functions you have defined.
-(For Turkish, there are other required functions: C<ucfirst>, C<lcfirst>,
-and C<ToTitle>. These are very similar to the ones given above.)
-
-The reason this is only a partial fix is that it doesn't affect the C<\l>,
-C<\L>, C<\u>, and C<\U> case-change operations in regular expressions,
-which still require the source to be encoded in utf8 (see L</The "Unicode
-Bug">). (Again, use L<Unicode::Casing> instead.)
-
-The C<lc()> example shows how you can add context-dependent casing. Note
-that context-dependent casing suffers from the problem that the string
-passed to the casing function may not have sufficient context to make
-the proper choice. Also, it will not be called for C<\l>, C<\L>, C<\u>,
-and C<\U>.
+B<This feature has been removed as of Perl 5.16.>
+The CPAN module L<Unicode::Casing> provides better functionality without
+the drawbacks that this feature had.  If you are using a Perl earlier
+than 5.16, this feature was most fully documented in the 5.14 version of
+this pod:
+L<http://perldoc.perl.org/5.14.0/perlunicode.html#User-Defined-Case-Mappings-%28for-serious-hackers-only%29>
  
  =head2 Character Encodings for Input and Output
  
  
  =head2 Character Encodings for Input and Output
  
@@ -1068,41 +947,41 @@ and the section numbers refer to the Unicode Technical Standard #18,
  
  Level 1 - Basic Unicode Support
  
  
  Level 1 - Basic Unicode Support
  
-        RL1.1   Hex Notation                     - done          [1]
-        RL1.2   Properties                       - done          [2][3]
-        RL1.2a  Compatibility Properties         - done          [4]
-        RL1.3   Subtraction and Intersection     - MISSING       [5]
-        RL1.4   Simple Word Boundaries           - done          [6]
-        RL1.5   Simple Loose Matches             - done          [7]
-        RL1.6   Line Boundaries                  - MISSING       [8][9]
-        RL1.7   Supplementary Code Points        - done          [10]
-
-        [1]  \x{...}
-        [2]  \p{...} \P{...}
-        [3]  supports not only minimal list, but all Unicode character
-             properties (see L</Unicode Character Properties>)
-        [4]  \d \D \s \S \w \W \X [:prop:] [:^prop:]
-        [5]  can use regular expression look-ahead [a] or
-             user-defined character properties [b] to emulate set
-             operations
-        [6]  \b \B
-        [7]  note that Perl does Full case-folding in matching (but with
-             bugs), not Simple: for example U+1F88 is equivalent to
-             U+1F00 U+03B9, not with 1F80.  This difference matters
-             mainly for certain Greek capital letters with certain
-             modifiers: the Full case-folding decomposes the letter,
-             while the Simple case-folding would map it to a single
-             character.
-        [8]  should do ^ and $ also on U+000B (\v in C), FF (\f), CR
-             (\r), CRLF (\r\n), NEL (U+0085), LS (U+2028), and PS
-             (U+2029); should also affect <>, $., and script line
-             numbers; should not split lines within CRLF [c] (i.e. there
-             is no empty line between \r and \n)
-       [9]  Linebreaking conformant with UAX#14 "Unicode Line Breaking
-            Algorithm" is available through the Unicode::LineBreaking
-            module.
-       [10]  UTF-8/UTF-EBDDIC used in Perl allows not only U+10000 to
-             U+10FFFF but also beyond U+10FFFF
+ RL1.1   Hex Notation                     - done          [1]
+ RL1.2   Properties                       - done          [2][3]
+ RL1.2a  Compatibility Properties         - done          [4]
+ RL1.3   Subtraction and Intersection     - MISSING       [5]
+ RL1.4   Simple Word Boundaries           - done          [6]
+ RL1.5   Simple Loose Matches             - done          [7]
+ RL1.6   Line Boundaries                  - MISSING       [8][9]
+ RL1.7   Supplementary Code Points        - done          [10]
+
+ [1]  \x{...}
+ [2]  \p{...} \P{...}
+ [3]  supports not only minimal list, but all Unicode character
+      properties (see Unicode Character Properties above)
+ [4]  \d \D \s \S \w \W \X [:prop:] [:^prop:]
+ [5]  can use regular expression look-ahead [a] or
+      user-defined character properties [b] to emulate set
+      operations
+ [6]  \b \B
+ [7]  note that Perl does Full case-folding in matching (but with
+      bugs), not Simple: for example U+1F88 is equivalent to
+      U+1F00 U+03B9, instead of just U+1F80.  This difference
+      matters mainly for certain Greek capital letters with certain
+      modifiers: the Full case-folding decomposes the letter,
+      while the Simple case-folding would map it to a single
+      character.
+ [8]  should do ^ and $ also on U+000B (\v in C), FF (\f), CR
+      (\r), CRLF (\r\n), NEL (U+0085), LS (U+2028), and PS
+      (U+2029); should also affect <>, $., and script line
+      numbers; should not split lines within CRLF [c] (i.e. there
+      is no empty line between \r and \n)
+ [9]  Linebreaking conformant with UAX#14 "Unicode Line Breaking
+      Algorithm" is available through the Unicode::LineBreaking
+      module.
+ [10] UTF-8/UTF-EBDDIC used in Perl allows not only U+10000 to
+      U+10FFFF but also beyond U+10FFFF
  
  [a] You can mimic class subtraction using lookahead.
  For example, what UTS#18 might write as
  
  [a] You can mimic class subtraction using lookahead.
  For example, what UTS#18 might write as
@@ -1120,7 +999,7 @@ But in this particular example, you probably really want
  
  which will match assigned characters known to be part of the Greek script.
  
  
  which will match assigned characters known to be part of the Greek script.
  
-Also see the Unicode::Regex::Set module, it does implement the full
+Also see the L<Unicode::Regex::Set> module, it does implement the full
  UTS#18 grouping, intersection, union, and removal (subtraction) syntax.
  
  [b] '+' for union, '-' for removal (set-difference), '&' for intersection
  UTS#18 grouping, intersection, union, and removal (subtraction) syntax.
  
  [b] '+' for union, '-' for removal (set-difference), '&' for intersection
@@ -1132,43 +1011,43 @@ UTS#18 grouping, intersection, union, and removal (subtraction) syntax.
  
  Level 2 - Extended Unicode Support
  
  
  Level 2 - Extended Unicode Support
  
-        RL2.1   Canonical Equivalents           - MISSING       [10][11]
-        RL2.2   Default Grapheme Clusters       - MISSING       [12]
-        RL2.3   Default Word Boundaries         - MISSING       [14]
-        RL2.4   Default Loose Matches           - MISSING       [15]
-        RL2.5   Name Properties                 - DONE
-        RL2.6   Wildcard Properties             - MISSING
+ RL2.1   Canonical Equivalents           - MISSING       [10][11]
+ RL2.2   Default Grapheme Clusters       - MISSING       [12]
+ RL2.3   Default Word Boundaries         - MISSING       [14]
+ RL2.4   Default Loose Matches           - MISSING       [15]
+ RL2.5   Name Properties                 - DONE
+ RL2.6   Wildcard Properties             - MISSING
  
  
-        [10] see UAX#15 "Unicode Normalization Forms"
-        [11] have Unicode::Normalize but not integrated to regexes
-        [12] have \X but we don't have a "Grapheme Cluster Mode"
-        [14] see UAX#29, Word Boundaries
-        [15] see UAX#21 "Case Mappings"
+ [10] see UAX#15 "Unicode Normalization Forms"
+ [11] have Unicode::Normalize but not integrated to regexes
+ [12] have \X but we don't have a "Grapheme Cluster Mode"
+ [14] see UAX#29, Word Boundaries
+ [15] see UAX#21 "Case Mappings"
  
  =item *
  
  Level 3 - Tailored Support
  
  
  =item *
  
  Level 3 - Tailored Support
  
-        RL3.1   Tailored Punctuation            - MISSING
-        RL3.2   Tailored Grapheme Clusters      - MISSING       [17][18]
-        RL3.3   Tailored Word Boundaries        - MISSING
-        RL3.4   Tailored Loose Matches          - MISSING
-        RL3.5   Tailored Ranges                 - MISSING
-        RL3.6   Context Matching                - MISSING       [19]
-        RL3.7   Incremental Matches             - MISSING
+ RL3.1   Tailored Punctuation            - MISSING
+ RL3.2   Tailored Grapheme Clusters      - MISSING       [17][18]
+ RL3.3   Tailored Word Boundaries        - MISSING
+ RL3.4   Tailored Loose Matches          - MISSING
+ RL3.5   Tailored Ranges                 - MISSING
+ RL3.6   Context Matching                - MISSING       [19]
+ RL3.7   Incremental Matches             - MISSING
        ( RL3.8   Unicode Set Sharing )
        ( RL3.8   Unicode Set Sharing )
-        RL3.9   Possible Match Sets             - MISSING
-        RL3.10  Folded Matching                 - MISSING       [20]
-        RL3.11  Submatchers                     - MISSING
-
-        [17] see UAX#10 "Unicode Collation Algorithms"
-        [18] have Unicode::Collate but not integrated to regexes
-        [19] have (?<=x) and (?=x), but look-aheads or look-behinds
-             should see outside of the target substring
-        [20] need insensitive matching for linguistic features other
-             than case; for example, hiragana to katakana, wide and
-             narrow, simplified Han to traditional Han (see UTR#30
-             "Character Foldings")
+ RL3.9   Possible Match Sets             - MISSING
+ RL3.10  Folded Matching                 - MISSING       [20]
+ RL3.11  Submatchers                     - MISSING
+
+ [17] see UAX#10 "Unicode Collation Algorithms"
+ [18] have Unicode::Collate but not integrated to regexes
+ [19] have (?<=x) and (?=x), but look-aheads or look-behinds
+      should see outside of the target substring
+ [20] need insensitive matching for linguistic features other
+      than case; for example, hiragana to katakana, wide and
+      narrow, simplified Han to traditional Han (see UTR#30
+      "Character Foldings")
  
  =back
  
  
  =back
  
@@ -1189,14 +1068,14 @@ encoding. For ASCII (and we really do mean 7-bit ASCII, not another
  
  The following table is from Unicode 3.2.
  
  
  The following table is from Unicode 3.2.
  
- Code Points            1st Byte  2nd Byte  3rd Byte  4th Byte
+ Code Points            1st Byte  2nd Byte  3rd Byte 4th Byte
  
     U+0000..U+007F       00..7F
     U+0080..U+07FF     * C2..DF    80..BF
     U+0800..U+0FFF       E0      * A0..BF    80..BF
     U+1000..U+CFFF       E1..EC    80..BF    80..BF
     U+D000..U+D7FF       ED        80..9F    80..BF
  
     U+0000..U+007F       00..7F
     U+0080..U+07FF     * C2..DF    80..BF
     U+0800..U+0FFF       E0      * A0..BF    80..BF
     U+1000..U+CFFF       E1..EC    80..BF    80..BF
     U+D000..U+D7FF       ED        80..9F    80..BF
-   U+D800..U+DFFF       +++++++ utf16 surrogates, not legal utf8 +++++++
+   U+D800..U+DFFF       +++++ utf16 surrogates, not legal utf8 +++++
     U+E000..U+FFFF       EE..EF    80..BF    80..BF
    U+10000..U+3FFFF      F0      * 90..BF    80..BF    80..BF
    U+40000..U+FFFFF      F1..F3    80..BF    80..BF    80..BF
     U+E000..U+FFFF       EE..EF    80..BF    80..BF
    U+10000..U+3FFFF      F0      * 90..BF    80..BF    80..BF
    U+40000..U+FFFFF      F1..F3    80..BF    80..BF    80..BF
@@ -1210,12 +1089,12 @@ explicitly forbidden, and the shortest possible encoding should always be used
  
  Another way to look at it is via bits:
  
  
  Another way to look at it is via bits:
  
- Code Points                    1st Byte   2nd Byte  3rd Byte  4th Byte
+                Code Points  1st Byte  2nd Byte  3rd Byte  4th Byte
  
  
-                    0aaaaaaa     0aaaaaaa
-            00000bbbbbaaaaaa     110bbbbb  10aaaaaa
-            ccccbbbbbbaaaaaa     1110cccc  10bbbbbb  10aaaaaa
-  00000dddccccccbbbbbbaaaaaa     11110ddd  10cccccc  10bbbbbb  10aaaaaa
+                   0aaaaaaa  0aaaaaaa
+           00000bbbbbaaaaaa  110bbbbb  10aaaaaa
+           ccccbbbbbbaaaaaa  1110cccc  10bbbbbb  10aaaaaa
+ 00000dddccccccbbbbbbaaaaaa  11110ddd  10cccccc  10bbbbbb  10aaaaaa
  
  As you can see, the continuation bytes all begin with "10", and the
  leading bits of the start byte tell how many bytes there are in the
  
  As you can see, the continuation bytes all begin with "10", and the
  leading bits of the start byte tell how many bytes there are in the
@@ -1508,12 +1387,6 @@ In C<quotemeta> or its inline equivalent C<\Q>, no characters
  code points above 127 are quoted in UTF-8 encoded strings, but in
  byte encoded strings, code points between 128-255 are always quoted.
  
  code points above 127 are quoted in UTF-8 encoded strings, but in
  byte encoded strings, code points between 128-255 are always quoted.
  
-=item *
-
-User-defined case change mappings.  You can create a C<ToUpper()> function, for
-example, which overrides Perl's built-in case mappings.  The scalar must be
-encoded in utf8 for your function to actually be invoked.
-
  =back
  
  This behavior can lead to unexpected results in which a string's semantics
  =back
  
  This behavior can lead to unexpected results in which a string's semantics
@@ -1812,7 +1685,7 @@ working with 5.6, you will need some of the following adjustments to
  your code. The examples are written such that the code will continue
  to work under 5.6, so you should be safe to try them out.
  
  your code. The examples are written such that the code will continue
  to work under 5.6, so you should be safe to try them out.
  
-=over 4
+=over 3
  
  =item *
  
  
  =item *