=item C<BOM>-marked scripts and L<UTF-16|/Unicode Encodings> scripts autodetected
-However, if a Perl script begins with the Unicode C<BOM> (UTF-16LE,
+If a Perl script begins with the Unicode C<BOM> (UTF-16LE,
UTF16-BE, or UTF-8), or if the script looks like non-C<BOM>-marked
UTF-16 of either endianness, Perl will correctly read in the script as
the appropriate Unicode encoding. (C<BOM>-less UTF-8 cannot be
written in Cyrillic, and Greek is written in, well, Greek; Japanese mainly in
Hiragana or Katakana. There are many more.
-The Unicode C<Script> and C<Script_Extensions> properties give what script a
-given character is in. Either property can be specified with the
-compound form like
+The Unicode C<Script> and C<Script_Extensions> properties give what
+script a given character is in. The C<Script_Extensions> property is an
+improved version of C<Script>, as demonstrated below. Either property
+can be specified with the compound form like
C<\p{Script=Hebrew}> (short: C<\p{sc=hebr}>), or
C<\p{Script_Extensions=Javanese}> (short: C<\p{scx=java}>).
In addition, Perl furnishes shortcuts for all
-C<Script> property names. You can omit everything up through the equals
-(or colon), and simply write C<\p{Latin}> or C<\P{Cyrillic}>.
-(This is not true for C<Script_Extensions>, which is required to be
-written in the compound form.)
+C<Script_Extensions> property names. You can omit everything up through
+the equals (or colon), and simply write C<\p{Latin}> or C<\P{Cyrillic}>.
+(This is not true for C<Script>, which is required to be
+written in the compound form. Prior to Perl v5.26, the single form
+returned the plain old C<Script> version, but was changed because
+C<Script_Extensions> gives better results.)
The difference between these two properties involves characters that are
used in multiple scripts. For example the digits '0' through '9' are
other scripts. It is new in Unicode version 6.0, and its data are likely
to change significantly in later releases, as things get sorted out.
New code should probably be using C<Script_Extensions> and not plain
-C<Script>.
+C<Script>. If you compile perl with a Unicode release that doesn't have
+C<Script_Extensions>, the single form Perl extensions will instead refer
+to the plain C<Script> property. If you compile with a version of
+Unicode that doesn't have the C<Script> property, these extensions will
+not be defined at all.
(Actually, besides C<Common>, the C<Inherited> script, contains
characters that are used in multiple scripts. These are modifier
It is worth stressing that there are several different sets of digits in
Unicode that are equivalent to 0-9 and are matchable by C<\d> in a
regular expression. If they are used in a single language only, they
-are in that language's C<Script> and C<Script_Extension>. If they are
+are in that language's C<Script> and C<Script_Extensions>. If they are
used in more than one script, they will be in C<sc=Common>, but only
if they are used in many scripts should they be in C<scx=Common>.
+The explanation above has omitted some detail; refer to UAX#24 "Unicode
+Script Property": L<http://www.unicode.org/reports/tr24>.
+
A complete list of scripts and their shortcuts is in L<perluniprops>.
=head3 B<Use of the C<"Is"> Prefix>
For more about scripts versus blocks, see UAX#24 "Unicode Script Property":
L<http://www.unicode.org/reports/tr24>
-The C<Script> or C<Script_Extensions> properties are likely to be the
+The C<Script_Extensions> or C<Script> properties are likely to be the
ones you want to use when processing
natural language; the C<Block> property may occasionally be useful in working
with the nuts and bolts of Unicode.
the block name and prefix it with one of: C<In> (for example
C<\p{Blk=Arrows}> can currently be written as C<\p{In_Arrows}>); or
sometimes C<Is> (like C<\p{Is_Arrows}>); or sometimes no prefix at all
-(C<\p{Arrows}>). As of this writing (Unicode 8.0) there are no
+(C<\p{Arrows}>). As of this writing (Unicode 9.0) there are no
conflicts with using the C<In_> prefix, but there are plenty with the
other two forms. For example, C<\p{Is_Hebrew}> and C<\p{Hebrew}> mean
-C<\p{Script=Hebrew}> which is NOT the same thing as C<\p{Blk=Hebrew}>. Our
+C<\p{Script_Extensions=Hebrew}> which is NOT the same thing as
+C<\p{Blk=Hebrew}>. Our
advice used to be to use the C<In_> prefix as a single form way of
specifying a block. But Unicode 8.0 added properties whose names begin
with C<In>, and it's now clear that it's only luck that's so far
=head2 Unicode Regular Expression Support Level
The following list of Unicode supported features for regular expressions describes
-all features currently directly supported by core Perl. The references to "Level N"
-and the section numbers refer to the Unicode Technical Standard #18,
-"Unicode Regular Expressions", version 13, from August 2008.
-
-=over 4
-
-=item *
-
-Level 1 - Basic Unicode Support
-
- RL1.1 Hex Notation - done [1]
- RL1.2 Properties - done [2][3]
- RL1.2a Compatibility Properties - done [4]
- RL1.3 Subtraction and Intersection - experimental [5]
- RL1.4 Simple Word Boundaries - done [6]
- RL1.5 Simple Loose Matches - done [7]
- RL1.6 Line Boundaries - MISSING [8][9]
- RL1.7 Supplementary Code Points - done [10]
+all features currently directly supported by core Perl. The references
+to "Level I<N>" and the section numbers refer to
+L<UTS#18 "Unicode Regular Expressions"|http://www.unicode.org/reports/tr18>,
+version 13, November 2013.
+
+=head3 Level 1 - Basic Unicode Support
+
+ RL1.1 Hex Notation - Done [1]
+ RL1.2 Properties - Done [2]
+ RL1.2a Compatibility Properties - Done [3]
+ RL1.3 Subtraction and Intersection - Experimental [4]
+ RL1.4 Simple Word Boundaries - Done [5]
+ RL1.5 Simple Loose Matches - Done [6]
+ RL1.6 Line Boundaries - Partial [7]
+ RL1.7 Supplementary Code Points - Done [8]
=over 4
=item [1] C<\N{U+...}> and C<\x{...}>
-=item [2] C<\p{...}> C<\P{...}>
+=item [2]
+C<\p{...}> C<\P{...}>. This requirement is for a minimal list of
+properties. Perl supports these and all other Unicode character
+properties, as R2.7 asks (see L</"Unicode Character Properties"> above).
-=item [3] supports not only minimal list, but all Unicode character
-properties (see Unicode Character Properties above)
+=item [3]
+Perl has C<\d> C<\D> C<\s> C<\S> C<\w> C<\W> C<\X> C<[:I<prop>:]>
+C<[:^I<prop>:]>, plus all the properties specified by
+L<http://www.unicode.org/reports/tr18/#Compatibility_Properties>. These
+are described above in L</Other Properties>
-=item [4] C<\d> C<\D> C<\s> C<\S> C<\w> C<\W> C<\X> C<[:I<prop>:]>
-C<[:^I<prop>:]>
+=item [4]
-=item [5] The experimental feature starting in v5.18 C<"(?[...])"> accomplishes
+The experimental feature C<"(?[...])"> starting in v5.18 accomplishes
this.
See L<perlre/(?[ ])>. If you don't want to use an experimental
=over 4
=item *
-
-Regular expression look-ahead
+Regular expression lookahead
You can mimic class subtraction using lookahead.
For example, what UTS#18 might write as
=back
-=item [6] C<\b> C<\B>
+=item [5]
+C<\b> C<\B> meet most, but not all, the details of this requirement, but
+C<\b{wb}> and C<\B{wb}> do, as well as the stricter R2.3.
+
+=item [6]
-=item [7]
Note that Perl does Full case-folding in matching, not Simple:
For example C<U+1F88> is equivalent to C<U+1F00 U+03B9>, instead of just
letter, while the Simple case-folding would map it to a single
character.
-=item [8]
-Perl treats C<\n> as the start- and end-line delimiter. Unicode
-specifies more characters that should be so-interpreted.
+=item [7]
+
+The reason this is considered to be only partially implemented is that
+Perl has L<C<qrE<sol>\b{lb}E<sol>>|perlrebackslash/\b{lb}> and
+C<L<Unicode::LineBreak>> that are conformant with
+L<UAX#14 "Unicode Line Breaking Algorithm"|http://www.unicode.org/reports/tr14>.
+The regular expression construct provides default behavior, while the
+heavier-weight module provides customizable line breaking.
+
+But Perl treats C<\n> as the start- and end-line
+delimiter, whereas Unicode specifies more characters that should be
+so-interpreted.
These are:
empty line between C<\r> and C<\n>). For C<CRLF>, try the C<:crlf>
layer (see L<PerlIO>).
-=item [9] But C<L<Unicode::LineBreak>> is available.
-
-This module supplies line breaking conformant with
-L<UAX#14 "Unicode Line Breaking Algorithm"|http://www.unicode.org/reports/tr14>.
-
-=item [10]
+=item [8]
UTF-8/UTF-EBDDIC used in Perl allows not only C<U+10000> to
C<U+10FFFF> but also beyond C<U+10FFFF>
=back
-=item *
+=head3 Level 2 - Extended Unicode Support
-Level 2 - Extended Unicode Support
+ RL2.1 Canonical Equivalents - Retracted [9]
+ by Unicode
+ RL2.2 Extended Grapheme Clusters - Partial [10]
+ RL2.3 Default Word Boundaries - Done [11]
+ RL2.4 Default Case Conversion - Done
+ RL2.5 Name Properties - Done
+ RL2.6 Wildcard Properties - Missing
+ RL2.7 Full Properties - Done
- RL2.1 Canonical Equivalents - MISSING [10][11]
- RL2.2 Default Grapheme Clusters - MISSING [12]
- RL2.3 Default Word Boundaries - DONE [14]
- RL2.4 Default Loose Matches - MISSING [15]
- RL2.5 Name Properties - DONE
- RL2.6 Wildcard Properties - MISSING
+=over 4
- [10] see UAX#15 "Unicode Normalization Forms"
- [11] have Unicode::Normalize but not integrated to regexes
- [12] have \X and \b{gcb} but we don't have a "Grapheme Cluster
- Mode"
- [14] see UAX#29, Word Boundaries
- [15] This is covered in Chapter 3.13 (in Unicode 6.0)
+=item [9]
+Unicode has rewritten this portion of UTS#18 to say that getting
+canonical equivalence (see UAX#15
+L<"Unicode Normalization Forms"|http://www.unicode.org/reports/tr15>)
+is basically to be done at the programmer level. Use NFD to write
+both your regular expressions and text to match them against (you
+can use L<Unicode::Normalize>).
-=item *
+=item [10]
+Perl has C<\X> and C<\b{gcb}> but we don't have a "Grapheme Cluster Mode".
-Level 3 - Tailored Support
-
- RL3.1 Tailored Punctuation - MISSING
- RL3.2 Tailored Grapheme Clusters - MISSING [17][18]
- RL3.3 Tailored Word Boundaries - MISSING
- RL3.4 Tailored Loose Matches - MISSING
- RL3.5 Tailored Ranges - MISSING
- RL3.6 Context Matching - MISSING [19]
- RL3.7 Incremental Matches - MISSING
- ( RL3.8 Unicode Set Sharing )
- RL3.9 Possible Match Sets - MISSING
- RL3.10 Folded Matching - MISSING [20]
- RL3.11 Submatchers - MISSING
-
- [17] see UAX#10 "Unicode Collation Algorithms"
- [18] have Unicode::Collate but not integrated to regexes
- [19] have (?<=x) and (?=x), but look-aheads or look-behinds
- should see outside of the target substring
- [20] need insensitive matching for linguistic features other
- than case; for example, hiragana to katakana, wide and
- narrow, simplified Han to traditional Han (see UTR#30
- "Character Foldings")
+=item [11] see
+L<UAX#29 "Unicode Text Segmentation"|http://www.unicode.org/reports/tr29>,
+
+=back
+
+=head3 Level 3 - Tailored Support
+
+ RL3.1 Tailored Punctuation - Missing
+ RL3.2 Tailored Grapheme Clusters - Missing [12]
+ RL3.3 Tailored Word Boundaries - Missing
+ RL3.4 Tailored Loose Matches - Retracted by Unicode
+ RL3.5 Tailored Ranges - Retracted by Unicode
+ RL3.6 Context Matching - Missing [13]
+ RL3.7 Incremental Matches - Missing
+ RL3.8 Unicode Set Sharing - Unicode is proposing
+ to retract this
+ RL3.9 Possible Match Sets - Missing
+ RL3.10 Folded Matching - Retracted by Unicode
+ RL3.11 Submatchers - Missing
+
+=over 4
+
+=item [12]
+Perl has L<Unicode::Collate>, but it isn't integrated with regular
+expressions. See
+L<UTS#10 "Unicode Collation Algorithms"|http://www.unicode.org/reports/tr10>.
+
+=item [13]
+Perl has C<(?<=x)> and C<(?=x)>, but lookaheads or lookbehinds should
+see outside of the target substring
=back
and has extended that up to 13 bytes to encode code points up to what
can fit in a 64-bit word. However, Perl will warn if you output any of
these as being non-portable; and under strict UTF-8 input protocols,
-they are forbidden.
+they are forbidden. In addition, it is deprecated to use a code point
+larger than what a signed integer variable on your system can hold. On
+32-bit ASCII systems, this means C<0x7FFF_FFFF> is the legal maximum
+going forward (much higher on 64-bit systems).
=item *
those that have ASCII equivalents (like C<"A">, C<"0">, C<"%">, I<etc.>)
are the same in both EBCDIC and UTF-EBCDIC.)
-UTF-EBCDIC is used on EBCDIC platforms. The largest Unicode code points
-take 5 bytes to represent (instead of 4 in UTF-8), and Perl extends it
-to a maximum of 7 bytes to encode pode points up to what can fit in a
-32-bit word (instead of 13 bytes and a 64-bit word in UTF-8).
+UTF-EBCDIC is used on EBCDIC platforms. It generally requires more
+bytes to represent a given code point than UTF-8 does; the largest
+Unicode code points take 5 bytes to represent (instead of 4 in UTF-8),
+and, extended for 64-bit words, it uses 14 bytes instead of 13 bytes in
+UTF-8.
=item *
treated all C<\p{}> matches as failing, but all C<\P{}> matches as
succeeding.
-One problem with this is that it leads to unexpected, and confusting
+One problem with this is that it leads to unexpected, and confusing
results in some cases:
chr(0x110000) =~ \p{ASCII_Hex_Digit=True} # Failed on <= v5.18
example, consider the following program and its output:
$ perl -le'
- no feature 'unicode_strings';
+ no feature "unicode_strings";
$s1 = "\xC2";
$s2 = "\x{2660}";
for ($s1, $s2, $s1.$s2) {
Perl by default comes with the latest supported Unicode version built-in, but
the goal is to allow you to change to use any earlier one. In Perls
v5.20 and v5.22, however, the earliest usable version is Unicode 5.1.
-Perl v5.18 is able to handle all earlier versions.
+Perl v5.18 and v5.24 are able to handle all earlier versions.
Download the files in the desired version of Unicode from the Unicode web
site L<http://www.unicode.org>). These should replace the existing files in