Add Unicode property wildcards

[perl5.git] / pod / perlunicode.pod
diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod

index edeb37d..8f09a18 100644 (file)
--- a/pod/perlunicode.pod
+++ b/pod/perlunicode.pod
@@ -36,8 +36,8 @@ Unicode support is an extensive requirement. While Perl does not
  implement the Unicode standard or the accompanying technical reports
  from cover to cover, Perl does support many Unicode features.
  
-Also, the use of Unicode may present security issues that aren't obvious.
-Read L<Unicode Security Considerations|http://www.unicode.org/reports/tr36>.
+Also, the use of Unicode may present security issues that aren't
+obvious, see L</Security Implications of Unicode> below.
  
  =over 4
  
@@ -60,10 +60,11 @@ filenames.
  Use the C<:encoding(...)> layer  to read from and write to
  filehandles using the specified encoding.  (See L<open>.)
  
-=item You should convert your non-ASCII, non-UTF-8 Perl scripts to be
+=item You must convert your non-ASCII, non-UTF-8 Perl scripts to be
  UTF-8.
  
-See L<encoding>.
+The L<encoding> module has been deprecated since perl 5.18 and the
+perl internals it requires have been removed with perl 5.26.
  
  =item C<use utf8> still needed to enable L<UTF-8|/Unicode Encodings> in scripts
  
@@ -73,14 +74,16 @@ recognition of that (in string or regular expression literals, or in
  identifier names).  B<This is the only time when an explicit S<C<use
  utf8>> is needed.>  (See L<utf8>).
  
-=item C<BOM>-marked scripts and L<UTF-16|/Unicode Encodings> scripts autodetected
+If a Perl script begins with the bytes that form the UTF-8 encoding of
+the Unicode BYTE ORDER MARK (C<BOM>, see L</Unicode Encodings>), those
+bytes are completely ignored.
  
-However, if a Perl script begins with the Unicode C<BOM> (UTF-16LE,
-UTF16-BE, or UTF-8), or if the script looks like non-C<BOM>-marked
+=item L<UTF-16|/Unicode Encodings> scripts autodetected
+
+If a Perl script begins with the Unicode C<BOM> (UTF-16LE,
+UTF16-BE), or if the script looks like non-C<BOM>-marked
  UTF-16 of either endianness, Perl will correctly read in the script as
-the appropriate Unicode encoding.  (C<BOM>-less UTF-8 cannot be
-effectively recognized or differentiated from ISO 8859-1 or other
-eight-bit encodings.)
+the appropriate Unicode encoding.
  
  =back
  
@@ -162,7 +165,7 @@ contain characters that have ordinal values larger than 255.
  
  If you use a Unicode editor to edit your program, Unicode characters may
  occur directly within the literal strings in UTF-8 encoding, or UTF-16.
-(The former requires a C<BOM> or C<use utf8>, the latter requires a C<BOM>.)
+(The former requires a C<use utf8>, the latter may require a C<BOM>.)
  
  L<perluniintro/Creating Unicode> gives other ways to place non-ASCII
  characters in your strings.
@@ -189,11 +192,12 @@ C<scalar reverse()> reverses by character rather than by byte.
  =item *
  
  The bit string operators, C<& | ^ ~> and (starting in v5.22)
-C<&. |. ^.  ~.> can operate on characters that don't fit into a byte.
-However, the current behavior is likely to change.  You should not use
-these operators on strings that are encoded in UTF-8.  If you're not
-sure about the encoding of a string, downgrade it before using any of
-these operators; you can use
+C<&. |. ^.  ~.> can operate on bit strings encoded in UTF-8, but this
+can give unexpected results if any of the strings contain code points
+above 0xFF.  Starting in v5.28, it is a fatal error to have such an
+operand.  Otherwise, the operation is performed on a non-UTF-8 copy of
+the operand.  If you're not sure about the encoding of a string,
+downgrade it before using any of these operators; you can use
  L<C<utf8::utf8_downgrade()>|utf8/Utility functions>.
  
  =back
@@ -206,7 +210,8 @@ Semantics".
  
  Before Unicode, when a character was a byte was a character,
  Perl knew only about the 128 characters defined by ASCII, code points 0
-through 127 (except for under S<C<use locale>>).  That left the code
+through 127 (except for under L<S<C<use locale>>|perllocale>).  That
+left the code
  points 128 to 255 as unassigned, and available for whatever use a
  program might want.  The only semantics they have is their ordinal
  numbers, and that they are members of none of the non-negative character
@@ -229,7 +234,7 @@ Unicode:
  Within the scope of S<C<use utf8>>
  
  If the whole program is Unicode (signified by using 8-bit B<U>nicode
-B<T>ransformation B<F>ormat), then all strings within it must be
+B<T>ransformation B<F>ormat), then all literal strings within it must be
  Unicode.
  
  =item *
@@ -389,7 +394,7 @@ other.
  
  You may be presented with strings in any of these equivalent forms.
  There is currently nothing in Perl 5 that ignores the differences.  So
-you'll have to specially hanlde it.  The usual advice is to convert your
+you'll have to specially handle it.  The usual advice is to convert your
  inputs to C<NFD> before processing further.
  
  For more detailed information, see L<http://unicode.org/reports/tr15/>.
@@ -592,7 +597,7 @@ C<L</General_Category>> property, this
  property can have more values added in a future Unicode release.  Those
  listed above comprised the complete set for many Unicode releases, but
  others were added in Unicode 6.3; you can always find what the
-current ones are in in L<perluniprops>.  And
+current ones are in L<perluniprops>.  And
  L<http://www.unicode.org/reports/tr9/> describes how to use them.
  
  =head3 B<Scripts>
@@ -602,16 +607,19 @@ The world's languages are written in many different scripts.  This sentence
  written in Cyrillic, and Greek is written in, well, Greek; Japanese mainly in
  Hiragana or Katakana.  There are many more.
  
-The Unicode C<Script> and C<Script_Extensions> properties give what script a
-given character is in.  Either property can be specified with the
-compound form like
+The Unicode C<Script> and C<Script_Extensions> properties give what
+script a given character is in.  The C<Script_Extensions> property is an
+improved version of C<Script>, as demonstrated below.  Either property
+can be specified with the compound form like
  C<\p{Script=Hebrew}> (short: C<\p{sc=hebr}>), or
  C<\p{Script_Extensions=Javanese}> (short: C<\p{scx=java}>).
  In addition, Perl furnishes shortcuts for all
-C<Script> property names.  You can omit everything up through the equals
-(or colon), and simply write C<\p{Latin}> or C<\P{Cyrillic}>.
-(This is not true for C<Script_Extensions>, which is required to be
-written in the compound form.)
+C<Script_Extensions> property names.  You can omit everything up through
+the equals (or colon), and simply write C<\p{Latin}> or C<\P{Cyrillic}>.
+(This is not true for C<Script>, which is required to be
+written in the compound form.  Prior to Perl v5.26, the single form
+returned the plain old C<Script> version, but was changed because
+C<Script_Extensions> gives better results.)
  
  The difference between these two properties involves characters that are
  used in multiple scripts.  For example the digits '0' through '9' are
@@ -645,7 +653,11 @@ fewer characters in the C<Common> script, and correspondingly more in
  other scripts.  It is new in Unicode version 6.0, and its data are likely
  to change significantly in later releases, as things get sorted out.
  New code should probably be using C<Script_Extensions> and not plain
-C<Script>.
+C<Script>.  If you compile perl with a Unicode release that doesn't have
+C<Script_Extensions>, the single form Perl extensions will instead refer
+to the plain C<Script> property.  If you compile with a version of
+Unicode that doesn't have the C<Script> property, these extensions will
+not be defined at all.
  
  (Actually, besides C<Common>, the C<Inherited> script, contains
  characters that are used in multiple scripts.  These are modifier
@@ -658,15 +670,18 @@ C<Script>, but not in C<Script_Extensions>.)
  It is worth stressing that there are several different sets of digits in
  Unicode that are equivalent to 0-9 and are matchable by C<\d> in a
  regular expression.  If they are used in a single language only, they
-are in that language's C<Script> and C<Script_Extension>.  If they are
+are in that language's C<Script> and C<Script_Extensions>.  If they are
  used in more than one script, they will be in C<sc=Common>, but only
  if they are used in many scripts should they be in C<scx=Common>.
  
+The explanation above has omitted some detail; refer to UAX#24 "Unicode
+Script Property": L<http://www.unicode.org/reports/tr24>.
+
  A complete list of scripts and their shortcuts is in L<perluniprops>.
  
  =head3 B<Use of the C<"Is"> Prefix>
  
-For backward compatibility (with Perl 5.6), all properties writable
+For backward compatibility (with ancient Perl 5.6), all properties writable
  without using the compound form mentioned
  so far may have C<Is> or C<Is_> prepended to their name, so C<\P{Is_Lu}>, for
  example, is equal to C<\P{Lu}>, and C<\p{IsScript:Arabic}> is equal to
@@ -690,37 +705,42 @@ C<Common> script.
  For more about scripts versus blocks, see UAX#24 "Unicode Script Property":
  L<http://www.unicode.org/reports/tr24>
  
-The C<Script> or C<Script_Extensions> properties are likely to be the
+The C<Script_Extensions> or C<Script> properties are likely to be the
  ones you want to use when processing
  natural language; the C<Block> property may occasionally be useful in working
  with the nuts and bolts of Unicode.
  
  Block names are matched in the compound form, like C<\p{Block: Arrows}> or
  C<\p{Blk=Hebrew}>.  Unlike most other properties, only a few block names have a
-Unicode-defined short name.  But Perl does provide a (slight, no longer
-recommended) shortcut:  You can say, for example C<\p{In_Arrows}> or
-C<\p{In_Hebrew}>.
-
-For backwards compatibility, the C<In> prefix may be
-omitted if there is no naming conflict with a script or any other
-property, and you can even use an C<Is> prefix instead in those cases.
-But don't do this for new code because your code could break in new
-releases, and this has already happened: There was a time in very
-early Unicode releases when C<\p{Hebrew}> would have matched the
-I<block> Hebrew; now it doesn't.
-
-Using the C<In> prefix avoids this ambiguity, so far.  But new versions
-of Unicode continue to add new properties whose names begin with C<In>.
-There is a possibility that one of them someday will conflict with your
-usage.  Since this is just a Perl extension, Unicode's name will take
-precedence and your code will become broken.  Also, Unicode is free to
-add a script whose name begins with C<In>; that would cause problems.
-
-So it's clearer and best to use the compound form when specifying
-blocks.  And be sure that is what you really really want to do.  In most
-cases scripts are what you want instead.
-
-A complete list of blocks and their shortcuts is in L<perluniprops>.
+Unicode-defined short name.
+
+Perl also defines single form synonyms for the block property in cases
+where these do not conflict with something else.  But don't use any of
+these, because they are unstable.  Since these are Perl extensions, they
+are subordinate to official Unicode property names; Unicode doesn't know
+nor care about Perl's extensions.  It may happen that a name that
+currently means the Perl extension will later be changed without warning
+to mean a different Unicode property in a future version of the perl
+interpreter that uses a later Unicode release, and your code would no
+longer work.  The extensions are mentioned here for completeness:  Take
+the block name and prefix it with one of: C<In> (for example
+C<\p{Blk=Arrows}> can currently be written as C<\p{In_Arrows}>); or
+sometimes C<Is> (like C<\p{Is_Arrows}>); or sometimes no prefix at all
+(C<\p{Arrows}>).  As of this writing (Unicode 9.0) there are no
+conflicts with using the C<In_> prefix, but there are plenty with the
+other two forms.  For example, C<\p{Is_Hebrew}> and C<\p{Hebrew}> mean
+C<\p{Script_Extensions=Hebrew}> which is NOT the same thing as
+C<\p{Blk=Hebrew}>.  Our
+advice used to be to use the C<In_> prefix as a single form way of
+specifying a block.  But Unicode 8.0 added properties whose names begin
+with C<In>, and it's now clear that it's only luck that's so far
+prevented a conflict.  Using C<In> is only marginally less typing than
+C<Blk:>, and the latter's meaning is clearer anyway, and guaranteed to
+never conflict.  So don't take chances.  Use C<\p{Blk=foo}> for new
+code.  And be sure that block is what you really really want to do.  In
+most cases scripts are what you want instead.
+
+A complete list of blocks is in L<perluniprops>.
  
  =head3 B<Other Properties>
  
@@ -833,8 +853,8 @@ L<perlrecharclass/POSIX Character Classes>.
  This property is used when you need to know in what Unicode version(s) a
  character is.
  
-The "*" above stands for some two digit Unicode version number, such as
-C<1.1> or C<4.0>; or the "*" can also be C<Unassigned>.  This property will
+The "*" above stands for some Unicode version number, such as
+C<1.1> or C<12.0>; or the "*" can also be C<Unassigned>.  This property will
  match the code points whose final disposition has been settled as of the
  Unicode release given by the version number; C<\p{Present_In: Unassigned}>
  will match those code points whose meaning has yet to be assigned.
@@ -901,6 +921,145 @@ L<perlrecharclass/POSIX Character Classes>.
  
  =back
  
+=head2 Wildcards in Property Values
+
+Starting in Perl 5.30, it is possible to do do something like this:
+
+ qr!\p{numeric_value=/\A[0-5]\z/}!
+
+or, by abbreviating and adding C</x>,
+
+ qr! \p{nv= /(?x) \A [0-5] \z / }!
+
+This matches all code points whose numeric value is one of 0, 1, 2, 3,
+4, or 5.  This particular example could instead have been written as
+
+ qr! \A [ \p{nv=0}\p{nv=1}\p{nv=2}\p{nv=3}\p{nv=4}\p{nv=5} ] \z !xx
+
+in earlier perls, so in this case this feature just makes things easier
+and shorter to write.  If we hadn't included the C<\A> and C<\z>, these
+would have matched things like C<1E<sol>2> because that contains a 1 (as
+well as a 2).  As written, it matches things like subscripts that have
+these numeric values.  If we only wanted the decimal digits with those
+numeric values, we could say,
+
+ qr! (?[ \d & \p{nv=/[0-5]/ ]) }!x
+
+The C<\d> gets rid of needing to anchor the pattern, since it forces the
+result to only match C<[0-9]>, and the C<[0-5]> further restricts it.
+
+The text in the above examples enclosed between the C<"E<sol>">
+characters can be just about any regular expression.  It is independent
+of the main pattern, so doesn't share any capturing groups, I<etc>.  The
+delimiters for it must be ASCII punctuation, but it may NOT be
+delimited by C<"{">, nor C<"}"> nor contain a literal C<"}">, as that
+delimits the end of the enclosing C<\p{}>.  Like any pattern, certain
+other delimiters are terminated by their mirror images.  These are
+C<"(">, C<"[>", and C<"E<lt>">.  If the delimiter is any of C<"-">,
+C<"_">, C<"+">, or C<"\">, or is the same delimiter as is used for the
+enclosing pattern, it must be be preceded by a backslash escape, both
+fore and aft.
+
+Beware of using C<"$"> to indicate to match the end of the string.  It
+can too easily be interpreted as being a punctuation variable, like
+C<$/>.
+
+No modifiers may follow the final delimiter.  Instead, use
+L<perlre/(?adlupimnsx-imnsx)> and/or
+L<perlre/(?adluimnsx-imnsx:pattern)> to specify modifiers.
+
+This feature is not available when the left-hand side is prefixed by
+C<Is_>, nor for any form that is marked as "Discouraged" in
+L<perluniprops/Discouraged>.
+
+Perl wraps your pattern with C<(?iaa: ... )>.  This is because nothing
+outside ASCII can match the Unicode property values available in this
+release, and they should match caselessly.  If your pattern has a syntax
+error, this wrapping will be shown in the error message, even though you
+didn't specify it yourself.  This could be confusing if you don't know
+about this.
+
+This experimental feature has been added to begin to implement
+L<https://www.unicode.org/reports/tr18/#Wildcard_Properties>.  Using it
+will raise a (default-on) warning in the
+C<experimental::uniprop_wildcards> category.  We reserve the right to
+change its operation as we gain experience.
+
+Your subpattern can be just about anything, but for it to have some
+utility, it should match when called with either or both of
+a) the full name of the property value with underscores (and/or spaces
+in the Block property) and some things uppercase; or b) the property
+value in all lowercase with spaces and underscores squeezed out.  For
+example,
+
+ qr!\p{Blk=/Old I.*/}!
+ qr!\p{Blk=/oldi.*/}!
+
+would match the same things.
+
+A warning is issued if none of the legal values for a property are
+matched by your pattern.  It's likely that a future release will raise a
+warning if your pattern ends up causing every possible code point to
+match.
+
+Another example that shows that within C<\p{...}>, C</x> isn't needed to
+have spaces:
+
+ qr!\p{scx= /Hebrew|Greek/ }!
+
+To be safe, we should have anchored the above example, to prevent
+matches for something like C<Hebrew_Braile>, but there aren't
+any script names like that.
+
+There are certain properties that it doesn't currently work with.  These
+are:
+
+ Bidi Mirroring Glyph
+ Bidi Paired Bracket
+ Case Folding
+ Decomposition Mapping
+ Equivalent Unified Ideograph
+ Name
+ Name Alias
+ Lowercase Mapping
+ NFKC Case Fold
+ Titlecase Mapping
+ Uppercase Mapping
+
+Nor is the C<@I<unicode_property>@> form implemented.
+
+Here's a complete example of matching IPV4 internet protocol addresses
+in any (single) script
+
+ no warnings 'experimental::script_run';
+ no warnings 'experimental::regex_sets';
+ no warnings 'experimental::uniprop_wildcards';
+
+ # Can match a substring, so this intermediate regex needs to have
+ # context or anchoring in its final use.  Using nt=de yields decimal
+ # digits.  When specifying a subset of these, we must include \d to
+ # prevent things like U+00B2 SUPERSCRIPT TWO from matching
+ my $zero_through_255 =
+  qr/ \b (*sr:                                  # All from same sript
+            (?[ \p{nv=0} & \d ])*               # Optional leading zeros
+        (                                       # Then one of:
+                                  \d{1,2}       #   0 - 99
+            | (?[ \p{nv=1} & \d ])  \d{2}       #   100 - 199
+            | (?[ \p{nv=2} & \d ])
+               (  (?[ \p{nv=:[0-4]:} & \d ]) \d #   200 - 249
+                | (?[ \p{nv=5}     & \d ])
+                  (?[ \p{nv=:[0-5]:} & \d ])    #   250 - 255
+               )
+        )
+      )
+    \b
+  /x;
+
+ my $ipv4 = qr/ \A (*sr:         $zero_through_255
+                         (?: [.] $zero_through_255 ) {3}
+                   )
+                \z
+            /x;
  
  =head2 User-Defined Character Properties
  
@@ -945,7 +1104,8 @@ A single hexadecimal number denoting a code point to include.
  =item *
  
  Two hexadecimal numbers separated by horizontal whitespace (space or
-tabular characters) denoting a range of code points to include.
+tabular characters) denoting a range of code points to include.  The
+second number must not be smaller than the first.
  
  =item *
  
@@ -1043,7 +1203,7 @@ C<&utf8::Any> must be the last line in the definition.
  Intersection is used generally for getting the common characters matched
  by two (or more) classes.  It's important to remember not to use C<"&"> for
  the first set; that would be intersecting with nothing, resulting in an
-empty set.
+empty set.  (Similarly using C<"-"> for the first set does nothing).
  
  Unlike non-user-defined C<\p{}> property matches, no warning is ever
  generated if these properties are matched against a non-Unicode code
@@ -1065,38 +1225,40 @@ See L<Encode>.
  =head2 Unicode Regular Expression Support Level
  
  The following list of Unicode supported features for regular expressions describes
-all features currently directly supported by core Perl.  The references to "Level N"
-and the section numbers refer to the Unicode Technical Standard #18,
-"Unicode Regular Expressions", version 13, from August 2008.
-
-=over 4
-
-=item *
-
-Level 1 - Basic Unicode Support
-
- RL1.1   Hex Notation                     - done          [1]
- RL1.2   Properties                       - done          [2][3]
- RL1.2a  Compatibility Properties         - done          [4]
- RL1.3   Subtraction and Intersection     - experimental  [5]
- RL1.4   Simple Word Boundaries           - done          [6]
- RL1.5   Simple Loose Matches             - done          [7]
- RL1.6   Line Boundaries                  - MISSING       [8][9]
- RL1.7   Supplementary Code Points        - done          [10]
+all features currently directly supported by core Perl.  The references
+to "Level I<N>" and the section numbers refer to
+L<UTS#18 "Unicode Regular Expressions"|http://www.unicode.org/reports/tr18>,
+version 18, October 2016.
+
+=head3 Level 1 - Basic Unicode Support
+
+ RL1.1   Hex Notation                     - Done          [1]
+ RL1.2   Properties                       - Done          [2]
+ RL1.2a  Compatibility Properties         - Done          [3]
+ RL1.3   Subtraction and Intersection     - Experimental  [4]
+ RL1.4   Simple Word Boundaries           - Done          [5]
+ RL1.5   Simple Loose Matches             - Done          [6]
+ RL1.6   Line Boundaries                  - Partial       [7]
+ RL1.7   Supplementary Code Points        - Done          [8]
  
  =over 4
  
  =item [1] C<\N{U+...}> and C<\x{...}>
  
-=item [2] C<\p{...}> C<\P{...}>
+=item [2]
+C<\p{...}> C<\P{...}>.  This requirement is for a minimal list of
+properties.  Perl supports these and all other Unicode character
+properties, as R2.7 asks (see L</"Unicode Character Properties"> above).
  
-=item [3] supports not only minimal list, but all Unicode character
-properties (see Unicode Character Properties above)
+=item [3]
+Perl has C<\d> C<\D> C<\s> C<\S> C<\w> C<\W> C<\X> C<[:I<prop>:]>
+C<[:^I<prop>:]>, plus all the properties specified by
+L<http://www.unicode.org/reports/tr18/#Compatibility_Properties>.  These
+are described above in L</Other Properties>
  
-=item [4] C<\d> C<\D> C<\s> C<\S> C<\w> C<\W> C<\X> C<[:I<prop>:]>
-C<[:^I<prop>:]>
+=item [4]
  
-=item [5] The experimental feature starting in v5.18 C<"(?[...])"> accomplishes
+The experimental feature C<"(?[...])"> starting in v5.18 accomplishes
  this.
  
  See L<perlre/(?[ ])>.  If you don't want to use an experimental
@@ -1105,8 +1267,7 @@ feature, you can use one of the following:
  =over 4
  
  =item *
-
-Regular expression look-ahead
+Regular expression lookahead
  
  You can mimic class subtraction using lookahead.
  For example, what UTS#18 might write as
@@ -1139,9 +1300,12 @@ C<"+"> for union, C<"-"> for removal (set-difference), C<"&"> for intersection
  
  =back
  
-=item [6] C<\b> C<\B>
+=item [5]
+C<\b> C<\B> meet most, but not all, the details of this requirement, but
+C<\b{wb}> and C<\B{wb}> do, as well as the stricter R2.3.
+
+=item [6]
  
-=item [7]
  Note that Perl does Full case-folding in matching, not Simple:
  
  For example C<U+1F88> is equivalent to C<U+1F00 U+03B9>, instead of just
@@ -1150,9 +1314,18 @@ letters with certain modifiers: the Full case-folding decomposes the
  letter, while the Simple case-folding would map it to a single
  character.
  
-=item [8]
-Perl treats C<\n> as the start- and end-line delimiter.  Unicode
-specifies more characters that should be so-interpreted.
+=item [7]
+
+The reason this is considered to be only partially implemented is that
+Perl has L<C<qrE<sol>\b{lb}E<sol>>|perlrebackslash/\b{lb}> and
+C<L<Unicode::LineBreak>> that are conformant with
+L<UAX#14 "Unicode Line Breaking Algorithm"|http://www.unicode.org/reports/tr14>.
+The regular expression construct provides default behavior, while the
+heavier-weight module provides customizable line breaking.
+
+But Perl treats C<\n> as the start- and end-line
+delimiter, whereas Unicode specifies more characters that should be
+so-interpreted.
  
  These are:
  
@@ -1172,59 +1345,77 @@ Also, lines should not be split within C<CRLF> (i.e. there is no
  empty line between C<\r> and C<\n>).  For C<CRLF>, try the C<:crlf>
  layer (see L<PerlIO>).
  
-=item [9] But C<L<Unicode::LineBreak>> is available.
+=item [8]
+UTF-8/UTF-EBDDIC used in Perl allows not only C<U+10000> to
+C<U+10FFFF> but also beyond C<U+10FFFF>
+
+=back
+
+=head3 Level 2 - Extended Unicode Support
  
-This module supplies line breaking conformant with
-L<UAX#14 "Unicode Line Breaking Algorithm"|http://www.unicode.org/reports/tr14>.
+ RL2.1   Canonical Equivalents           - Retracted     [9]
+                                           by Unicode
+ RL2.2   Extended Grapheme Clusters      - Partial       [10]
+ RL2.3   Default Word Boundaries         - Done          [11]
+ RL2.4   Default Case Conversion         - Done
+ RL2.5   Name Properties                 - Done
+ RL2.6   Wildcards in Property Values    - Partial       [12]
+ RL2.7   Full Properties                 - Done
+
+=over 4
+
+=item [9]
+Unicode has rewritten this portion of UTS#18 to say that getting
+canonical equivalence (see UAX#15
+L<"Unicode Normalization Forms"|http://www.unicode.org/reports/tr15>)
+is basically to be done at the programmer level.  Use NFD to write
+both your regular expressions and text to match them against (you
+can use L<Unicode::Normalize>).
  
  =item [10]
-UTF-8/UTF-EBDDIC used in Perl allows not only C<U+10000> to
-C<U+10FFFF> but also beyond C<U+10FFFF>
+Perl has C<\X> and C<\b{gcb}> but we don't have a "Grapheme Cluster Mode".
+
+=item [11] see
+L<UAX#29 "Unicode Text Segmentation"|http://www.unicode.org/reports/tr29>,
+
+=item [12] see
+L</Wildcards in Property Values> above.
  
  =back
  
-=item *
+=head3 Level 3 - Tailored Support
  
-Level 2 - Extended Unicode Support
+ RL3.1   Tailored Punctuation            - Missing
+ RL3.2   Tailored Grapheme Clusters      - Missing       [13]
+ RL3.3   Tailored Word Boundaries        - Missing
+ RL3.4   Tailored Loose Matches          - Retracted by Unicode
+ RL3.5   Tailored Ranges                 - Retracted by Unicode
+ RL3.6   Context Matching                - Partial       [14]
+ RL3.7   Incremental Matches             - Missing
+ RL3.8   Unicode Set Sharing             - Retracted by Unicode
+ RL3.9   Possible Match Sets             - Missing
+ RL3.10  Folded Matching                 - Retracted by Unicode
+ RL3.11  Submatchers                     - Partial       [15]
  
- RL2.1   Canonical Equivalents           - MISSING       [10][11]
- RL2.2   Default Grapheme Clusters       - MISSING       [12]
- RL2.3   Default Word Boundaries         - DONE          [14]
- RL2.4   Default Loose Matches           - MISSING       [15]
- RL2.5   Name Properties                 - DONE
- RL2.6   Wildcard Properties             - MISSING
+=over 4
  
- [10] see UAX#15 "Unicode Normalization Forms"
- [11] have Unicode::Normalize but not integrated to regexes
- [12] have \X and \b{gcb} but we don't have a "Grapheme Cluster
-      Mode"
- [14] see UAX#29, Word Boundaries
- [15] This is covered in Chapter 3.13 (in Unicode 6.0)
+=item [13]
+Perl has L<Unicode::Collate>, but it isn't integrated with regular
+expressions.  See
+L<UTS#10 "Unicode Collation Algorithms"|http://www.unicode.org/reports/tr10>.
  
-=item *
+=item [14]
+Perl has C<(?<=x)> and C<(?=x)>, but this requirement says that it
+should be possible to specify that matches may occur only in a substring
+with the lookaheads and lookbehinds able to see beyond that matchable
+portion.
  
-Level 3 - Tailored Support
-
- RL3.1   Tailored Punctuation            - MISSING
- RL3.2   Tailored Grapheme Clusters      - MISSING       [17][18]
- RL3.3   Tailored Word Boundaries        - MISSING
- RL3.4   Tailored Loose Matches          - MISSING
- RL3.5   Tailored Ranges                 - MISSING
- RL3.6   Context Matching                - MISSING       [19]
- RL3.7   Incremental Matches             - MISSING
-      ( RL3.8   Unicode Set Sharing )
- RL3.9   Possible Match Sets             - MISSING
- RL3.10  Folded Matching                 - MISSING       [20]
- RL3.11  Submatchers                     - MISSING
-
- [17] see UAX#10 "Unicode Collation Algorithms"
- [18] have Unicode::Collate but not integrated to regexes
- [19] have (?<=x) and (?=x), but look-aheads or look-behinds
-      should see outside of the target substring
- [20] need insensitive matching for linguistic features other
-      than case; for example, hiragana to katakana, wide and
-      narrow, simplified Han to traditional Han (see UTR#30
-      "Character Foldings")
+=item [15]
+Perl has user-defined properties (L</"User-Defined Character
+Properties">) to look at single code points in ways beyond Unicode, and
+it might be possible, though probably not very clean, to use code blocks
+and things like C<(?(DEFINE)...)> (see L<perlre>) to do more specialized
+matching.
  
  =back
  
@@ -1285,7 +1476,10 @@ encoding of numbers up to C<0x7FFF_FFFF>.  Perl continues to allow those,
  and has extended that up to 13 bytes to encode code points up to what
  can fit in a 64-bit word.  However, Perl will warn if you output any of
  these as being non-portable; and under strict UTF-8 input protocols,
-they are forbidden.
+they are forbidden.  In addition, it is now illegal to use a code point
+larger than what a signed integer variable on your system can hold.  On
+32-bit ASCII systems, this means C<0x7FFF_FFFF> is the legal maximum
+(much higher on 64-bit systems).
  
  =item *
  
@@ -1296,10 +1490,11 @@ This means that all the basic characters (which includes all
  those that have ASCII equivalents (like C<"A">, C<"0">, C<"%">, I<etc.>)
  are the same in both EBCDIC and UTF-EBCDIC.)
  
-UTF-EBCDIC is used on EBCDIC platforms.  The largest Unicode code points
-take 5 bytes to represent (instead of 4 in UTF-8), and Perl extends it
-to a maximum of 7 bytes to encode pode points up to what can fit in a
-32-bit word (instead of 13 bytes and a 64-bit word in UTF-8).
+UTF-EBCDIC is used on EBCDIC platforms.  It generally requires more
+bytes to represent a given code point than UTF-8 does; the largest
+Unicode code points take 5 bytes to represent (instead of 4 in UTF-8),
+and, extended for 64-bit words, it uses 14 bytes instead of 13 bytes in
+UTF-8.
  
  =item *
  
@@ -1468,7 +1663,7 @@ noncharacters.
  
  The maximum Unicode code point is C<U+10FFFF>, and Unicode only defines
  operations on code points up through that.  But Perl works on code
-points up to the maximum permissible unsigned number available on the
+points up to the maximum permissible signed number available on the
  platform.  However, Perl will not accept these from input streams unless
  lax rules are being used, and will warn (using the warning category
  C<"non_unicode">, which is a sub-category of C<"utf8">) if any are output.
@@ -1495,7 +1690,7 @@ became generally reliable) through v5.18.  The difference is that Perl
  treated all C<\p{}> matches as failing, but all C<\P{}> matches as
  succeeding.
  
-One problem with this is that it leads to unexpected, and confusting
+One problem with this is that it leads to unexpected, and confusing
  results in some cases:
  
   chr(0x110000) =~ \p{ASCII_Hex_Digit=True}      # Failed on <= v5.18
@@ -1591,15 +1786,23 @@ Also, note the following:
  
  Malformed UTF-8
  
-Unfortunately, the original specification of UTF-8 leaves some room for
-interpretation of how many bytes of encoded output one should generate
-from one input Unicode character.  Strictly speaking, the shortest
-possible sequence of UTF-8 bytes should be generated,
-because otherwise there is potential for an input buffer overflow at
-the receiving end of a UTF-8 connection.  Perl always generates the
-shortest length UTF-8, and with warnings on, Perl will warn about
-non-shortest length UTF-8 along with other malformations, such as the
-surrogates, which are not Unicode code points valid for interchange.
+UTF-8 is very structured, so many combinations of bytes are invalid.  In
+the past, Perl tried to soldier on and make some sense of invalid
+combinations, but this can lead to security holes, so now, if the Perl
+core needs to process an invalid combination, it will either raise a
+fatal error, or will replace those bytes by the sequence that forms the
+Unicode REPLACEMENT CHARACTER, for which purpose Unicode created it.
+
+Every code point can be represented by more than one possible
+syntactically valid UTF-8 sequence.  Early on, both Unicode and Perl
+considered any of these to be valid, but now, all sequences longer
+than the shortest possible one are considered to be malformed.
+
+Unicode considers many code points to be illegal, or to be avoided.
+Perl generally accepts them, once they have passed through any input
+filters that may try to exclude them.  These have been discussed above
+(see "Surrogates" under UTF-16 in L</Unicode Encodings>,
+L</Noncharacter code points>, and L</Beyond Unicode code points>).
  
  =item *
  
@@ -1627,7 +1830,7 @@ See L<perlebcdic/Unicode and UTF>.
  
  Because UTF-EBCDIC is so similar to UTF-8, the differences are mostly
  hidden from you; S<C<use utf8>> (and NOT something like
-S<C<use utfebcdic>>) declares the the script is in the platform's
+S<C<use utfebcdic>>) declares the script is in the platform's
  "native" 8-bit encoding of Unicode.  (Similarly for the C<":utf8">
  layer.)
  
@@ -1707,7 +1910,7 @@ it, which changes the rules from ASCII to Unicode.  As an
  example, consider the following program and its output:
  
   $ perl -le'
-     no feature 'unicode_strings';
+     no feature "unicode_strings";
       $s1 = "\xC2";
       $s2 = "\x{2660}";
       for ($s1, $s2, $s1.$s2) {
@@ -1775,6 +1978,27 @@ Prior to that, or outside its scope, no code points above 127 are quoted
  in UTF-8 encoded strings, but in byte encoded strings, code points
  between 128-255 are always quoted.
  
+=item *
+
+In the C<..> or L<range|perlop/Range Operators> operator.
+
+Starting in Perl 5.26.0, the range operator on strings treats their lengths
+consistently within the scope of C<unicode_strings>. Prior to that, or
+outside its scope, it could produce strings whose length in characters
+exceeded that of the right-hand side, where the right-hand side took up more
+bytes than the correct range endpoint.
+
+=item *
+
+In L<< C<split>'s special-case whitespace splitting|perlfunc/split >>.
+
+Starting in Perl 5.28.0, the C<split> function with a pattern specified as
+a string containing a single space handles whitespace characters consistently
+within the scope of of C<unicode_strings>. Prior to that, or outside its scope,
+characters that are whitespace according to Unicode rules but not according to
+ASCII rules were treated as field contents rather than field separators when
+they appear in byte-encoded strings.
+
  =back
  
  You can see from the above that the effect of C<unicode_strings>
@@ -1815,7 +2039,7 @@ the XS level, and L<perlapi/Unicode Support> for the API details.
  Perl by default comes with the latest supported Unicode version built-in, but
  the goal is to allow you to change to use any earlier one.  In Perls
  v5.20 and v5.22, however, the earliest usable version is Unicode 5.1.
-Perl v5.18 is able to handle all earlier versions.
+Perl v5.18 and v5.24 are able to handle all earlier versions.
  
  Download the files in the desired version of Unicode from the Unicode web
  site L<http://www.unicode.org>).  These should replace the existing files in
@@ -1840,7 +2064,7 @@ work under 5.6, so you should be safe to try them out.
  A filehandle that should read or write UTF-8
  
    if ($] > 5.008) {
-    binmode $fh, ":encoding(utf8)";
+    binmode $fh, ":encoding(UTF-8)";
    }
  
  =item *
@@ -1855,7 +2079,7 @@ check the documentation to verify if this is still true.
  
    if ($] > 5.008) {
      require Encode;
-    $val = Encode::encode_utf8($val); # make octets
+    $val = Encode::encode("UTF-8", $val); # make octets
    }
  
  =item *
@@ -1867,7 +2091,7 @@ want the UTF8 flag restored:
  
    if ($] > 5.008) {
      require Encode;
-    $val = Encode::decode_utf8($val);
+    $val = Encode::decode("UTF-8", $val);
    }
  
  =item *
@@ -1968,8 +2192,8 @@ Perl's internal representation like so:
      sub my_escape_html ($) {
          my($what) = shift;
          return unless defined $what;
-        Encode::decode_utf8(Foo::Bar::escape_html(
-                                         Encode::encode_utf8($what)));
+        Encode::decode("UTF-8", Foo::Bar::escape_html(
+                                     Encode::encode("UTF-8", $what)));
      }
  
  Sometimes, when the extension does not convert data but just stores