This is a live mirror of the Perl 5 development currently hosted at https://github.com/perl/perl5
perlunicode: Update text about malformed UTF-8
[perl5.git] / pod / perlunicode.pod
CommitLineData
393fec97
GS
1=head1 NAME
2
3perlunicode - Unicode support in Perl
4
5=head1 DESCRIPTION
6
a6a7eedc
KW
7If you haven't already, before reading this document, you should become
8familiar with both L<perlunitut> and L<perluniintro>.
9
10Unicode aims to B<UNI>-fy the en-B<CODE>-ings of all the world's
11character sets into a single Standard. For quite a few of the various
12coding standards that existed when Unicode was first created, converting
13from each to Unicode essentially meant adding a constant to each code
14point in the original standard, and converting back meant just
15subtracting that same constant. For ASCII and ISO-8859-1, the constant
16is 0. For ISO-8859-5, (Cyrillic) the constant is 864; for Hebrew
17(ISO-8859-8), it's 1488; Thai (ISO-8859-11), 3424; and so forth. This
18made it easy to do the conversions, and facilitated the adoption of
19Unicode.
20
21And it worked; nowadays, those legacy standards are rarely used. Most
22everyone uses Unicode.
23
24Unicode is a comprehensive standard. It specifies many things outside
25the scope of Perl, such as how to display sequences of characters. For
26a full discussion of all aspects of Unicode, see
27L<http://www.unicode.org>.
28
0a1f2d14 29=head2 Important Caveats
21bad921 30
a6a7eedc
KW
31Even though some of this section may not be understandable to you on
32first reading, we think it's important enough to highlight some of the
33gotchas before delving further, so here goes:
34
376d9008 35Unicode support is an extensive requirement. While Perl does not
c349b1b9
JH
36implement the Unicode standard or the accompanying technical reports
37from cover to cover, Perl does support many Unicode features.
21bad921 38
f57d8456
KW
39Also, the use of Unicode may present security issues that aren't
40obvious, see L</Security Implications of Unicode>.
9d1c51c1 41
13a2d996 42=over 4
21bad921 43
a9130ea9 44=item Safest if you C<use feature 'unicode_strings'>
42581d5d
KW
45
46In order to preserve backward compatibility, Perl does not turn
47on full internal Unicode support unless the pragma
b65e6125
KW
48L<S<C<use feature 'unicode_strings'>>|feature/The 'unicode_strings' feature>
49is specified. (This is automatically
50selected if you S<C<use 5.012>> or higher.) Failure to do this can
42581d5d
KW
51trigger unexpected surprises. See L</The "Unicode Bug"> below.
52
2269d15c
KW
53This pragma doesn't affect I/O. Nor does it change the internal
54representation of strings, only their interpretation. There are still
55several places where Unicode isn't fully supported, such as in
56filenames.
42581d5d 57
fae2c0fb 58=item Input and Output Layers
21bad921 59
a6a7eedc
KW
60Use the C<:encoding(...)> layer to read from and write to
61filehandles using the specified encoding. (See L<open>.)
c349b1b9 62
a6a7eedc
KW
63=item You should convert your non-ASCII, non-UTF-8 Perl scripts to be
64UTF-8.
21bad921 65
a6a7eedc 66See L<encoding>.
21bad921 67
a6a7eedc 68=item C<use utf8> still needed to enable L<UTF-8|/Unicode Encodings> in scripts
21bad921 69
a6a7eedc
KW
70If your Perl script is itself encoded in L<UTF-8|/Unicode Encodings>,
71the S<C<use utf8>> pragma must be explicitly included to enable
72recognition of that (in string or regular expression literals, or in
73identifier names). B<This is the only time when an explicit S<C<use
74utf8>> is needed.> (See L<utf8>).
7aa207d6 75
27c74dfd
KW
76If a Perl script begins with the bytes that form the UTF-8 encoding of
77the Unicode BYTE ORDER MARK (C<BOM>, see L</Unicode Encodings>), those
78bytes are completely ignored.
79
80=item L<UTF-16|/Unicode Encodings> scripts autodetected
7aa207d6 81
fea12a3e 82If a Perl script begins with the Unicode C<BOM> (UTF-16LE,
27c74dfd 83UTF16-BE), or if the script looks like non-C<BOM>-marked
a6a7eedc 84UTF-16 of either endianness, Perl will correctly read in the script as
27c74dfd 85the appropriate Unicode encoding.
990e18f7 86
21bad921
GS
87=back
88
376d9008 89=head2 Byte and Character Semantics
393fec97 90
a6a7eedc
KW
91Before Unicode, most encodings used 8 bits (a single byte) to encode
92each character. Thus a character was a byte, and a byte was a
93character, and there could be only 256 or fewer possible characters.
94"Byte Semantics" in the title of this section refers to
95this behavior. There was no need to distinguish between "Byte" and
96"Character".
97
98Then along comes Unicode which has room for over a million characters
99(and Perl allows for even more). This means that a character may
100require more than a single byte to represent it, and so the two terms
101are no longer equivalent. What matter are the characters as whole
102entities, and not usually the bytes that comprise them. That's what the
103term "Character Semantics" in the title of this section refers to.
104
105Perl had to change internally to decouple "bytes" from "characters".
106It is important that you too change your ideas, if you haven't already,
107so that "byte" and "character" no longer mean the same thing in your
108mind.
109
110The basic building block of Perl strings has always been a "character".
111The changes basically come down to that the implementation no longer
112thinks that a character is always just a single byte.
113
114There are various things to note:
393fec97
GS
115
116=over 4
117
118=item *
119
a6a7eedc
KW
120String handling functions, for the most part, continue to operate in
121terms of characters. C<length()>, for example, returns the number of
122characters in a string, just as before. But that number no longer is
123necessarily the same as the number of bytes in the string (there may be
124more bytes than characters). The other such functions include
125C<chop()>, C<chomp()>, C<substr()>, C<pos()>, C<index()>, C<rindex()>,
126C<sort()>, C<sprintf()>, and C<write()>.
127
128The exceptions are:
129
130=over 4
131
132=item *
133
134the bit-oriented C<vec>
135
136E<nbsp>
137
138=item *
139
140the byte-oriented C<pack>/C<unpack> C<"C"> format
141
142However, the C<W> specifier does operate on whole characters, as does the
143C<U> specifier.
144
145=item *
146
147some operators that interact with the platform's operating system
148
149Operators dealing with filenames are examples.
150
151=item *
152
153when the functions are called from within the scope of the
154S<C<L<use bytes|bytes>>> pragma
155
156Likely, you should use this only for debugging anyway.
157
158=back
159
160=item *
161
376d9008 162Strings--including hash keys--and regular expression patterns may
b65e6125 163contain characters that have ordinal values larger than 255.
393fec97 164
2575c402
JW
165If you use a Unicode editor to edit your program, Unicode characters may
166occur directly within the literal strings in UTF-8 encoding, or UTF-16.
27c74dfd 167(The former requires a C<use utf8>, the latter may require a C<BOM>.)
3e4dbfed 168
a6a7eedc
KW
169L<perluniintro/Creating Unicode> gives other ways to place non-ASCII
170characters in your strings.
6f335b04 171
a6a7eedc 172=item *
fbb93542 173
a6a7eedc 174The C<chr()> and C<ord()> functions work on whole characters.
376d9008 175
393fec97
GS
176=item *
177
a6a7eedc
KW
178Regular expressions match whole characters. For example, C<"."> matches
179a whole character instead of only a single byte.
393fec97 180
393fec97
GS
181=item *
182
a6a7eedc
KW
183The C<tr///> operator translates whole characters. (Note that the
184C<tr///CU> functionality has been removed. For similar functionality to
185that, see C<pack('U0', ...)> and C<pack('C0', ...)>).
393fec97 186
393fec97
GS
187=item *
188
a6a7eedc 189C<scalar reverse()> reverses by character rather than by byte.
393fec97 190
393fec97
GS
191=item *
192
a6a7eedc
KW
193The bit string operators, C<& | ^ ~> and (starting in v5.22)
194C<&. |. ^. ~.> can operate on characters that don't fit into a byte.
195However, the current behavior is likely to change. You should not use
196these operators on strings that are encoded in UTF-8. If you're not
197sure about the encoding of a string, downgrade it before using any of
198these operators; you can use
199L<C<utf8::utf8_downgrade()>|utf8/Utility functions>.
822502e5 200
a6a7eedc 201=back
822502e5 202
a6a7eedc
KW
203The bottom line is that Perl has always practiced "Character Semantics",
204but with the advent of Unicode, that is now different than "Byte
205Semantics".
206
207=head2 ASCII Rules versus Unicode Rules
208
209Before Unicode, when a character was a byte was a character,
210Perl knew only about the 128 characters defined by ASCII, code points 0
b57dd509
KW
211through 127 (except for under L<S<C<use locale>>|perllocale>). That
212left the code
a6a7eedc
KW
213points 128 to 255 as unassigned, and available for whatever use a
214program might want. The only semantics they have is their ordinal
215numbers, and that they are members of none of the non-negative character
216classes. None are considered to match C<\w> for example, but all match
217C<\W>.
822502e5 218
a6a7eedc
KW
219Unicode, of course, assigns each of those code points a particular
220meaning (along with ones above 255). To preserve backward
221compatibility, Perl only uses the Unicode meanings when there is some
222indication that Unicode is what is intended; otherwise the non-ASCII
223code points remain treated as if they are unassigned.
224
225Here are the ways that Perl knows that a string should be treated as
226Unicode:
227
228=over
822502e5
TS
229
230=item *
231
a6a7eedc
KW
232Within the scope of S<C<use utf8>>
233
234If the whole program is Unicode (signified by using 8-bit B<U>nicode
235B<T>ransformation B<F>ormat), then all strings within it must be
236Unicode.
822502e5
TS
237
238=item *
239
a6a7eedc
KW
240Within the scope of
241L<S<C<use feature 'unicode_strings'>>|feature/The 'unicode_strings' feature>
242
243This pragma was created so you can explicitly tell Perl that operations
244executed within its scope are to use Unicode rules. More operations are
245affected with newer perls. See L</The "Unicode Bug">.
822502e5
TS
246
247=item *
248
a6a7eedc
KW
249Within the scope of S<C<use 5.012>> or higher
250
251This implicitly turns on S<C<use feature 'unicode_strings'>>.
822502e5
TS
252
253=item *
254
a6a7eedc
KW
255Within the scope of
256L<S<C<use locale 'not_characters'>>|perllocale/Unicode and UTF-8>,
257or L<S<C<use locale>>|perllocale> and the current
258locale is a UTF-8 locale.
822502e5 259
a6a7eedc
KW
260The former is defined to imply Unicode handling; and the latter
261indicates a Unicode locale, hence a Unicode interpretation of all
262strings within it.
822502e5
TS
263
264=item *
265
a6a7eedc
KW
266When the string contains a Unicode-only code point
267
268Perl has never accepted code points above 255 without them being
269Unicode, so their use implies Unicode for the whole string.
822502e5
TS
270
271=item *
272
a6a7eedc
KW
273When the string contains a Unicode named code point C<\N{...}>
274
275The C<\N{...}> construct explicitly refers to a Unicode code point,
276even if it is one that is also in ASCII. Therefore the string
277containing it must be Unicode.
822502e5
TS
278
279=item *
280
a6a7eedc
KW
281When the string has come from an external source marked as
282Unicode
283
284The L<C<-C>|perlrun/-C [numberE<sol>list]> command line option can
285specify that certain inputs to the program are Unicode, and the values
286of this can be read by your Perl code, see L<perlvar/"${^UNICODE}">.
287
288=item * When the string has been upgraded to UTF-8
289
290The function L<C<utf8::utf8_upgrade()>|utf8/Utility functions>
291can be explicitly used to permanently (unless a subsequent
292C<utf8::utf8_downgrade()> is called) cause a string to be treated as
293Unicode.
294
295=item * There are additional methods for regular expression patterns
296
f6cf4627
KW
297A pattern that is compiled with the C<< /u >> or C<< /a >> modifiers is
298treated as Unicode (though there are some restrictions with C<< /a >>).
299Under the C<< /d >> and C<< /l >> modifiers, there are several other
300indications for Unicode; see L<perlre/Character set modifiers>.
822502e5
TS
301
302=back
303
a6a7eedc
KW
304Note that all of the above are overridden within the scope of
305C<L<use bytes|bytes>>; but you should be using this pragma only for
306debugging.
307
308Note also that some interactions with the platform's operating system
309never use Unicode rules.
310
311When Unicode rules are in effect:
312
822502e5
TS
313=over 4
314
315=item *
316
a6a7eedc
KW
317Case translation operators use the Unicode case translation tables.
318
319Note that C<uc()>, or C<\U> in interpolated strings, translates to
320uppercase, while C<ucfirst>, or C<\u> in interpolated strings,
321translates to titlecase in languages that make the distinction (which is
322equivalent to uppercase in languages without the distinction).
323
324There is a CPAN module, C<L<Unicode::Casing>>, which allows you to
325define your own mappings to be used in C<lc()>, C<lcfirst()>, C<uc()>,
326C<ucfirst()>, and C<fc> (or their double-quoted string inlined versions
327such as C<\U>). (Prior to Perl 5.16, this functionality was partially
328provided in the Perl core, but suffered from a number of insurmountable
329drawbacks, so the CPAN module was written instead.)
330
331=item *
332
333Character classes in regular expressions match based on the character
334properties specified in the Unicode properties database.
335
336C<\w> can be used to match a Japanese ideograph, for instance; and
337C<[[:digit:]]> a Bengali number.
338
339=item *
340
341Named Unicode properties, scripts, and block ranges may be used (like
342bracketed character classes) by using the C<\p{}> "matches property"
343construct and the C<\P{}> negation, "doesn't match property".
344
345See L</"Unicode Character Properties"> for more details.
346
347You can define your own character properties and use them
348in the regular expression with the C<\p{}> or C<\P{}> construct.
349See L</"User-Defined Character Properties"> for more details.
822502e5
TS
350
351=back
352
a6a7eedc
KW
353=head2 Extended Grapheme Clusters (Logical characters)
354
355Consider a character, say C<H>. It could appear with various marks around it,
356such as an acute accent, or a circumflex, or various hooks, circles, arrows,
357I<etc.>, above, below, to one side or the other, I<etc>. There are many
358possibilities among the world's languages. The number of combinations is
359astronomical, and if there were a character for each combination, it would
360soon exhaust Unicode's more than a million possible characters. So Unicode
361took a different approach: there is a character for the base C<H>, and a
362character for each of the possible marks, and these can be variously combined
363to get a final logical character. So a logical character--what appears to be a
364single character--can be a sequence of more than one individual characters.
365The Unicode standard calls these "extended grapheme clusters" (which
366is an improved version of the no-longer much used "grapheme cluster");
367Perl furnishes the C<\X> regular expression construct to match such
368sequences in their entirety.
369
370But Unicode's intent is to unify the existing character set standards and
371practices, and several pre-existing standards have single characters that
372mean the same thing as some of these combinations, like ISO-8859-1,
373which has quite a few of them. For example, C<"LATIN CAPITAL LETTER E
374WITH ACUTE"> was already in this standard when Unicode came along.
375Unicode therefore added it to its repertoire as that single character.
376But this character is considered by Unicode to be equivalent to the
377sequence consisting of the character C<"LATIN CAPITAL LETTER E">
378followed by the character C<"COMBINING ACUTE ACCENT">.
379
380C<"LATIN CAPITAL LETTER E WITH ACUTE"> is called a "pre-composed"
381character, and its equivalence with the "E" and the "COMBINING ACCENT"
382sequence is called canonical equivalence. All pre-composed characters
383are said to have a decomposition (into the equivalent sequence), and the
384decomposition type is also called canonical. A string may be comprised
385as much as possible of precomposed characters, or it may be comprised of
386entirely decomposed characters. Unicode calls these respectively,
387"Normalization Form Composed" (NFC) and "Normalization Form Decomposed".
388The C<L<Unicode::Normalize>> module contains functions that convert
389between the two. A string may also have both composed characters and
390decomposed characters; this module can be used to make it all one or the
391other.
392
393You may be presented with strings in any of these equivalent forms.
394There is currently nothing in Perl 5 that ignores the differences. So
395you'll have to specially hanlde it. The usual advice is to convert your
396inputs to C<NFD> before processing further.
397
398For more detailed information, see L<http://unicode.org/reports/tr15/>.
399
822502e5
TS
400=head2 Unicode Character Properties
401
ee88f7b6 402(The only time that Perl considers a sequence of individual code
9d1c51c1
KW
403points as a single logical character is in the C<\X> construct, already
404mentioned above. Therefore "character" in this discussion means a single
ee88f7b6
KW
405Unicode code point.)
406
407Very nearly all Unicode character properties are accessible through
408regular expressions by using the C<\p{}> "matches property" construct
409and the C<\P{}> "doesn't match property" for its negation.
51f494cc 410
9d1c51c1 411For instance, C<\p{Uppercase}> matches any single character with the Unicode
a9130ea9
KW
412C<"Uppercase"> property, while C<\p{L}> matches any character with a
413C<General_Category> of C<"L"> (letter) property (see
414L</General_Category> below). Brackets are not
9d1c51c1 415required for single letter property names, so C<\p{L}> is equivalent to C<\pL>.
51f494cc 416
9d1c51c1 417More formally, C<\p{Uppercase}> matches any single character whose Unicode
a9130ea9
KW
418C<Uppercase> property value is C<True>, and C<\P{Uppercase}> matches any character
419whose C<Uppercase> property value is C<False>, and they could have been written as
9d1c51c1 420C<\p{Uppercase=True}> and C<\p{Uppercase=False}>, respectively.
51f494cc 421
b19eb496 422This formality is needed when properties are not binary; that is, if they can
a9130ea9
KW
423take on more values than just C<True> and C<False>. For example, the
424C<Bidi_Class> property (see L</"Bidirectional Character Types"> below),
425can take on several different
426values, such as C<Left>, C<Right>, C<Whitespace>, and others. To match these, one needs
427to specify both the property name (C<Bidi_Class>), AND the value being
5bff2035 428matched against
b65e6125 429(C<Left>, C<Right>, I<etc.>). This is done, as in the examples above, by having the
9f815e24 430two components separated by an equal sign (or interchangeably, a colon), like
51f494cc
KW
431C<\p{Bidi_Class: Left}>.
432
433All Unicode-defined character properties may be written in these compound forms
a9130ea9 434of C<\p{I<property>=I<value>}> or C<\p{I<property>:I<value>}>, but Perl provides some
51f494cc
KW
435additional properties that are written only in the single form, as well as
436single-form short-cuts for all binary properties and certain others described
437below, in which you may omit the property name and the equals or colon
438separator.
439
440Most Unicode character properties have at least two synonyms (or aliases if you
b19eb496 441prefer): a short one that is easier to type and a longer one that is more
a9130ea9
KW
442descriptive and hence easier to understand. Thus the C<"L"> and
443C<"Letter"> properties above are equivalent and can be used
444interchangeably. Likewise, C<"Upper"> is a synonym for C<"Uppercase">,
445and we could have written C<\p{Uppercase}> equivalently as C<\p{Upper}>.
446Also, there are typically various synonyms for the values the property
447can be. For binary properties, C<"True"> has 3 synonyms: C<"T">,
448C<"Yes">, and C<"Y">; and C<"False"> has correspondingly C<"F">,
449C<"No">, and C<"N">. But be careful. A short form of a value for one
450property may not mean the same thing as the same short form for another.
451Thus, for the C<L</General_Category>> property, C<"L"> means
452C<"Letter">, but for the L<C<Bidi_Class>|/Bidirectional Character Types>
453property, C<"L"> means C<"Left">. A complete list of properties and
454synonyms is in L<perluniprops>.
51f494cc 455
b19eb496 456Upper/lower case differences in property names and values are irrelevant;
51f494cc
KW
457thus C<\p{Upper}> means the same thing as C<\p{upper}> or even C<\p{UpPeR}>.
458Similarly, you can add or subtract underscores anywhere in the middle of a
459word, so that these are also equivalent to C<\p{U_p_p_e_r}>. And white space
460is irrelevant adjacent to non-word characters, such as the braces and the equals
b19eb496
TC
461or colon separators, so C<\p{ Upper }> and C<\p{ Upper_case : Y }> are
462equivalent to these as well. In fact, white space and even
463hyphens can usually be added or deleted anywhere. So even C<\p{ Up-per case = Yes}> is
51f494cc 464equivalent. All this is called "loose-matching" by Unicode. The few places
b19eb496 465where stricter matching is used is in the middle of numbers, and in the Perl
51f494cc 466extension properties that begin or end with an underscore. Stricter matching
b19eb496 467cares about white space (except adjacent to non-word characters),
51f494cc 468hyphens, and non-interior underscores.
4193bef7 469
376d9008 470You can also use negation in both C<\p{}> and C<\P{}> by introducing a caret
a9130ea9 471(C<^>) between the first brace and the property name: C<\p{^Tamil}> is
eb0cc9e3 472equal to C<\P{Tamil}>.
4193bef7 473
56ca34ca
KW
474Almost all properties are immune to case-insensitive matching. That is,
475adding a C</i> regular expression modifier does not change what they
476match. There are two sets that are affected.
477The first set is
478C<Uppercase_Letter>,
479C<Lowercase_Letter>,
480and C<Titlecase_Letter>,
481all of which match C<Cased_Letter> under C</i> matching.
482And the second set is
483C<Uppercase>,
484C<Lowercase>,
485and C<Titlecase>,
486all of which match C<Cased> under C</i> matching.
487This set also includes its subsets C<PosixUpper> and C<PosixLower> both
a9130ea9 488of which under C</i> match C<PosixAlpha>.
56ca34ca 489(The difference between these sets is that some things, such as Roman
b65e6125
KW
490numerals, come in both upper and lower case so they are C<Cased>, but
491aren't considered letters, so they aren't C<Cased_Letter>'s.)
56ca34ca 492
2d88a86a
KW
493See L</Beyond Unicode code points> for special considerations when
494matching Unicode properties against non-Unicode code points.
94b42e47 495
51f494cc 496=head3 B<General_Category>
14bb0a9a 497
51f494cc
KW
498Every Unicode character is assigned a general category, which is the "most
499usual categorization of a character" (from
500L<http://www.unicode.org/reports/tr44>).
822502e5 501
9f815e24 502The compound way of writing these is like C<\p{General_Category=Number}>
b65e6125 503(short: C<\p{gc:n}>). But Perl furnishes shortcuts in which everything up
51f494cc
KW
504through the equal or colon separator is omitted. So you can instead just write
505C<\pN>.
822502e5 506
a9130ea9
KW
507Here are the short and long forms of the values the C<General Category> property
508can have:
393fec97 509
d73e5302
JH
510 Short Long
511
512 L Letter
51f494cc
KW
513 LC, L& Cased_Letter (that is: [\p{Ll}\p{Lu}\p{Lt}])
514 Lu Uppercase_Letter
515 Ll Lowercase_Letter
516 Lt Titlecase_Letter
517 Lm Modifier_Letter
518 Lo Other_Letter
d73e5302
JH
519
520 M Mark
51f494cc
KW
521 Mn Nonspacing_Mark
522 Mc Spacing_Mark
523 Me Enclosing_Mark
d73e5302
JH
524
525 N Number
51f494cc
KW
526 Nd Decimal_Number (also Digit)
527 Nl Letter_Number
528 No Other_Number
529
530 P Punctuation (also Punct)
531 Pc Connector_Punctuation
532 Pd Dash_Punctuation
533 Ps Open_Punctuation
534 Pe Close_Punctuation
535 Pi Initial_Punctuation
d73e5302 536 (may behave like Ps or Pe depending on usage)
51f494cc 537 Pf Final_Punctuation
d73e5302 538 (may behave like Ps or Pe depending on usage)
51f494cc 539 Po Other_Punctuation
d73e5302
JH
540
541 S Symbol
51f494cc
KW
542 Sm Math_Symbol
543 Sc Currency_Symbol
544 Sk Modifier_Symbol
545 So Other_Symbol
d73e5302
JH
546
547 Z Separator
51f494cc
KW
548 Zs Space_Separator
549 Zl Line_Separator
550 Zp Paragraph_Separator
d73e5302
JH
551
552 C Other
d88362ca 553 Cc Control (also Cntrl)
e150c829 554 Cf Format
6d4f9cf2 555 Cs Surrogate
51f494cc 556 Co Private_Use
e150c829 557 Cn Unassigned
1ac13f9a 558
376d9008 559Single-letter properties match all characters in any of the
3e4dbfed 560two-letter sub-properties starting with the same letter.
b19eb496 561C<LC> and C<L&> are special: both are aliases for the set consisting of everything matched by C<Ll>, C<Lu>, and C<Lt>.
32293815 562
51f494cc 563=head3 B<Bidirectional Character Types>
822502e5 564
b19eb496 565Because scripts differ in their directionality (Hebrew and Arabic are
a9130ea9 566written right to left, for example) Unicode supplies a C<Bidi_Class> property.
1850f57f 567Some of the values this property can have are:
32293815 568
88af3b93 569 Value Meaning
92e830a9 570
12ac2576
JP
571 L Left-to-Right
572 LRE Left-to-Right Embedding
573 LRO Left-to-Right Override
574 R Right-to-Left
51f494cc 575 AL Arabic Letter
12ac2576
JP
576 RLE Right-to-Left Embedding
577 RLO Right-to-Left Override
578 PDF Pop Directional Format
579 EN European Number
51f494cc
KW
580 ES European Separator
581 ET European Terminator
12ac2576 582 AN Arabic Number
51f494cc 583 CS Common Separator
12ac2576
JP
584 NSM Non-Spacing Mark
585 BN Boundary Neutral
586 B Paragraph Separator
587 S Segment Separator
588 WS Whitespace
589 ON Other Neutrals
590
51f494cc
KW
591This property is always written in the compound form.
592For example, C<\p{Bidi_Class:R}> matches characters that are normally
1850f57f 593written right to left. Unlike the
a9130ea9 594C<L</General_Category>> property, this
1850f57f
KW
595property can have more values added in a future Unicode release. Those
596listed above comprised the complete set for many Unicode releases, but
597others were added in Unicode 6.3; you can always find what the
20ada7da 598current ones are in L<perluniprops>. And
1850f57f 599L<http://www.unicode.org/reports/tr9/> describes how to use them.
eb0cc9e3 600
51f494cc
KW
601=head3 B<Scripts>
602
b19eb496 603The world's languages are written in many different scripts. This sentence
e1b711da 604(unless you're reading it in translation) is written in Latin, while Russian is
c69ca1d4 605written in Cyrillic, and Greek is written in, well, Greek; Japanese mainly in
e1b711da 606Hiragana or Katakana. There are many more.
51f494cc 607
48791bf1
KW
608The Unicode C<Script> and C<Script_Extensions> properties give what
609script a given character is in. The C<Script_Extensions> property is an
610improved version of C<Script>, as demonstrated below. Either property
611can be specified with the compound form like
82aed44a
KW
612C<\p{Script=Hebrew}> (short: C<\p{sc=hebr}>), or
613C<\p{Script_Extensions=Javanese}> (short: C<\p{scx=java}>).
614In addition, Perl furnishes shortcuts for all
48791bf1
KW
615C<Script_Extensions> property names. You can omit everything up through
616the equals (or colon), and simply write C<\p{Latin}> or C<\P{Cyrillic}>.
617(This is not true for C<Script>, which is required to be
618written in the compound form. Prior to Perl v5.26, the single form
619returned the plain old C<Script> version, but was changed because
620C<Script_Extensions> gives better results.)
82aed44a
KW
621
622The difference between these two properties involves characters that are
623used in multiple scripts. For example the digits '0' through '9' are
624used in many parts of the world. These are placed in a script named
625C<Common>. Other characters are used in just a few scripts. For
a9130ea9 626example, the C<"KATAKANA-HIRAGANA DOUBLE HYPHEN"> is used in both Japanese
82aed44a
KW
627scripts, Katakana and Hiragana, but nowhere else. The C<Script>
628property places all characters that are used in multiple scripts in the
629C<Common> script, while the C<Script_Extensions> property places those
630that are used in only a few scripts into each of those scripts; while
631still using C<Common> for those used in many scripts. Thus both these
632match:
633
634 "0" =~ /\p{sc=Common}/ # Matches
635 "0" =~ /\p{scx=Common}/ # Matches
636
637and only the first of these match:
638
639 "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{sc=Common} # Matches
640 "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{scx=Common} # No match
641
642And only the last two of these match:
643
644 "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{sc=Hiragana} # No match
645 "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{sc=Katakana} # No match
646 "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{scx=Hiragana} # Matches
647 "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{scx=Katakana} # Matches
648
649C<Script_Extensions> is thus an improved C<Script>, in which there are
650fewer characters in the C<Common> script, and correspondingly more in
651other scripts. It is new in Unicode version 6.0, and its data are likely
652to change significantly in later releases, as things get sorted out.
b65e6125 653New code should probably be using C<Script_Extensions> and not plain
48791bf1
KW
654C<Script>. If you compile perl with a Unicode release that doesn't have
655C<Script_Extensions>, the single form Perl extensions will instead refer
656to the plain C<Script> property. If you compile with a version of
657Unicode that doesn't have the C<Script> property, these extensions will
658not be defined at all.
82aed44a
KW
659
660(Actually, besides C<Common>, the C<Inherited> script, contains
661characters that are used in multiple scripts. These are modifier
b65e6125 662characters which inherit the script value
82aed44a
KW
663of the controlling character. Some of these are used in many scripts,
664and so go into C<Inherited> in both C<Script> and C<Script_Extensions>.
665Others are used in just a few scripts, so are in C<Inherited> in
666C<Script>, but not in C<Script_Extensions>.)
667
668It is worth stressing that there are several different sets of digits in
669Unicode that are equivalent to 0-9 and are matchable by C<\d> in a
670regular expression. If they are used in a single language only, they
48791bf1 671are in that language's C<Script> and C<Script_Extensions>. If they are
82aed44a
KW
672used in more than one script, they will be in C<sc=Common>, but only
673if they are used in many scripts should they be in C<scx=Common>.
51f494cc 674
48791bf1
KW
675The explanation above has omitted some detail; refer to UAX#24 "Unicode
676Script Property": L<http://www.unicode.org/reports/tr24>.
677
51f494cc
KW
678A complete list of scripts and their shortcuts is in L<perluniprops>.
679
a9130ea9 680=head3 B<Use of the C<"Is"> Prefix>
822502e5 681
b65e6125
KW
682For backward compatibility (with Perl 5.6), all properties writable
683without using the compound form mentioned
51f494cc
KW
684so far may have C<Is> or C<Is_> prepended to their name, so C<\P{Is_Lu}>, for
685example, is equal to C<\P{Lu}>, and C<\p{IsScript:Arabic}> is equal to
686C<\p{Arabic}>.
eb0cc9e3 687
51f494cc 688=head3 B<Blocks>
2796c109 689
1bfb14c4
JH
690In addition to B<scripts>, Unicode also defines B<blocks> of
691characters. The difference between scripts and blocks is that the
692concept of scripts is closer to natural languages, while the concept
51f494cc 693of blocks is more of an artificial grouping based on groups of Unicode
a9130ea9 694characters with consecutive ordinal values. For example, the C<"Basic Latin">
b65e6125 695block is all the characters whose ordinals are between 0 and 127, inclusive; in
a9130ea9
KW
696other words, the ASCII characters. The C<"Latin"> script contains some letters
697from this as well as several other blocks, like C<"Latin-1 Supplement">,
b65e6125 698C<"Latin Extended-A">, I<etc.>, but it does not contain all the characters from
7be67b37 699those blocks. It does not, for example, contain the digits 0-9, because
82aed44a
KW
700those digits are shared across many scripts, and hence are in the
701C<Common> script.
51f494cc
KW
702
703For more about scripts versus blocks, see UAX#24 "Unicode Script Property":
704L<http://www.unicode.org/reports/tr24>
705
48791bf1 706The C<Script_Extensions> or C<Script> properties are likely to be the
82aed44a 707ones you want to use when processing
a9130ea9 708natural language; the C<Block> property may occasionally be useful in working
b19eb496 709with the nuts and bolts of Unicode.
51f494cc
KW
710
711Block names are matched in the compound form, like C<\p{Block: Arrows}> or
b19eb496 712C<\p{Blk=Hebrew}>. Unlike most other properties, only a few block names have a
6b5cf123
KW
713Unicode-defined short name.
714
715Perl also defines single form synonyms for the block property in cases
716where these do not conflict with something else. But don't use any of
717these, because they are unstable. Since these are Perl extensions, they
718are subordinate to official Unicode property names; Unicode doesn't know
719nor care about Perl's extensions. It may happen that a name that
720currently means the Perl extension will later be changed without warning
721to mean a different Unicode property in a future version of the perl
722interpreter that uses a later Unicode release, and your code would no
723longer work. The extensions are mentioned here for completeness: Take
724the block name and prefix it with one of: C<In> (for example
725C<\p{Blk=Arrows}> can currently be written as C<\p{In_Arrows}>); or
726sometimes C<Is> (like C<\p{Is_Arrows}>); or sometimes no prefix at all
48791bf1 727(C<\p{Arrows}>). As of this writing (Unicode 9.0) there are no
6b5cf123
KW
728conflicts with using the C<In_> prefix, but there are plenty with the
729other two forms. For example, C<\p{Is_Hebrew}> and C<\p{Hebrew}> mean
48791bf1
KW
730C<\p{Script_Extensions=Hebrew}> which is NOT the same thing as
731C<\p{Blk=Hebrew}>. Our
6b5cf123
KW
732advice used to be to use the C<In_> prefix as a single form way of
733specifying a block. But Unicode 8.0 added properties whose names begin
734with C<In>, and it's now clear that it's only luck that's so far
735prevented a conflict. Using C<In> is only marginally less typing than
736C<Blk:>, and the latter's meaning is clearer anyway, and guaranteed to
737never conflict. So don't take chances. Use C<\p{Blk=foo}> for new
738code. And be sure that block is what you really really want to do. In
739most cases scripts are what you want instead.
740
741A complete list of blocks is in L<perluniprops>.
51f494cc 742
9f815e24
KW
743=head3 B<Other Properties>
744
745There are many more properties than the very basic ones described here.
746A complete list is in L<perluniprops>.
747
748Unicode defines all its properties in the compound form, so all single-form
b19eb496
TC
749properties are Perl extensions. Most of these are just synonyms for the
750Unicode ones, but some are genuine extensions, including several that are in
9f815e24
KW
751the compound form. And quite a few of these are actually recommended by Unicode
752(in L<http://www.unicode.org/reports/tr18>).
753
5bff2035
KW
754This section gives some details on all extensions that aren't just
755synonyms for compound-form Unicode properties
756(for those properties, you'll have to refer to the
9f815e24
KW
757L<Unicode Standard|http://www.unicode.org/reports/tr44>.
758
759=over
760
761=item B<C<\p{All}>>
762
2d88a86a
KW
763This matches every possible code point. It is equivalent to C<qr/./s>.
764Unlike all the other non-user-defined C<\p{}> property matches, no
765warning is ever generated if this is property is matched against a
766non-Unicode code point (see L</Beyond Unicode code points> below).
9f815e24
KW
767
768=item B<C<\p{Alnum}>>
769
770This matches any C<\p{Alphabetic}> or C<\p{Decimal_Number}> character.
771
772=item B<C<\p{Any}>>
773
2d88a86a
KW
774This matches any of the 1_114_112 Unicode code points. It is a synonym
775for C<\p{Unicode}>.
9f815e24 776
42581d5d
KW
777=item B<C<\p{ASCII}>>
778
779This matches any of the 128 characters in the US-ASCII character set,
780which is a subset of Unicode.
781
9f815e24
KW
782=item B<C<\p{Assigned}>>
783
a9130ea9
KW
784This matches any assigned code point; that is, any code point whose L<general
785category|/General_Category> is not C<Unassigned> (or equivalently, not C<Cn>).
9f815e24
KW
786
787=item B<C<\p{Blank}>>
788
789This is the same as C<\h> and C<\p{HorizSpace}>: A character that changes the
790spacing horizontally.
791
792=item B<C<\p{Decomposition_Type: Non_Canonical}>> (Short: C<\p{Dt=NonCanon}>)
793
794Matches a character that has a non-canonical decomposition.
795
a6a7eedc
KW
796The L</Extended Grapheme Clusters (Logical characters)> section above
797talked about canonical decompositions. However, many more characters
798have a different type of decomposition, a "compatible" or
799"non-canonical" decomposition. The sequences that form these
800decompositions are not considered canonically equivalent to the
801pre-composed character. An example is the C<"SUPERSCRIPT ONE">. It is
802somewhat like a regular digit 1, but not exactly; its decomposition into
803the digit 1 is called a "compatible" decomposition, specifically a
9f815e24 804"super" decomposition. There are several such compatibility
b65e6125
KW
805decompositions (see L<http://www.unicode.org/reports/tr44>), including
806one called "compat", which means some miscellaneous type of
807decomposition that doesn't fit into the other decomposition categories
808that Unicode has chosen.
9f815e24
KW
809
810Note that most Unicode characters don't have a decomposition, so their
a9130ea9 811decomposition type is C<"None">.
9f815e24 812
b19eb496
TC
813For your convenience, Perl has added the C<Non_Canonical> decomposition
814type to mean any of the several compatibility decompositions.
9f815e24
KW
815
816=item B<C<\p{Graph}>>
817
818Matches any character that is graphic. Theoretically, this means a character
819that on a printer would cause ink to be used.
820
821=item B<C<\p{HorizSpace}>>
822
b19eb496 823This is the same as C<\h> and C<\p{Blank}>: a character that changes the
9f815e24
KW
824spacing horizontally.
825
42581d5d 826=item B<C<\p{In=*}>>
9f815e24
KW
827
828This is a synonym for C<\p{Present_In=*}>
829
830=item B<C<\p{PerlSpace}>>
831
d28d8023 832This is the same as C<\s>, restricted to ASCII, namely C<S<[ \f\n\r\t]>>
779cf272 833and starting in Perl v5.18, a vertical tab.
9f815e24
KW
834
835Mnemonic: Perl's (original) space
836
837=item B<C<\p{PerlWord}>>
838
839This is the same as C<\w>, restricted to ASCII, namely C<[A-Za-z0-9_]>
840
841Mnemonic: Perl's (original) word.
842
42581d5d 843=item B<C<\p{Posix...}>>
9f815e24 844
b65e6125
KW
845There are several of these, which are equivalents, using the C<\p{}>
846notation, for Posix classes and are described in
42581d5d 847L<perlrecharclass/POSIX Character Classes>.
9f815e24
KW
848
849=item B<C<\p{Present_In: *}>> (Short: C<\p{In=*}>)
850
851This property is used when you need to know in what Unicode version(s) a
852character is.
853
854The "*" above stands for some two digit Unicode version number, such as
855C<1.1> or C<4.0>; or the "*" can also be C<Unassigned>. This property will
856match the code points whose final disposition has been settled as of the
857Unicode release given by the version number; C<\p{Present_In: Unassigned}>
858will match those code points whose meaning has yet to be assigned.
859
a9130ea9 860For example, C<U+0041> C<"LATIN CAPITAL LETTER A"> was present in the very first
9f815e24
KW
861Unicode release available, which is C<1.1>, so this property is true for all
862valid "*" versions. On the other hand, C<U+1EFF> was not assigned until version
a9130ea9 8635.1 when it became C<"LATIN SMALL LETTER Y WITH LOOP">, so the only "*" that
9f815e24
KW
864would match it are 5.1, 5.2, and later.
865
866Unicode furnishes the C<Age> property from which this is derived. The problem
867with Age is that a strict interpretation of it (which Perl takes) has it
868matching the precise release a code point's meaning is introduced in. Thus
869C<U+0041> would match only 1.1; and C<U+1EFF> only 5.1. This is not usually what
870you want.
871
872Some non-Perl implementations of the Age property may change its meaning to be
a9130ea9 873the same as the Perl C<Present_In> property; just be aware of that.
9f815e24
KW
874
875Another confusion with both these properties is that the definition is not
b19eb496
TC
876that the code point has been I<assigned>, but that the meaning of the code point
877has been I<determined>. This is because 66 code points will always be
a9130ea9 878unassigned, and so the C<Age> for them is the Unicode version in which the decision
b19eb496 879to make them so was made. For example, C<U+FDD0> is to be permanently
9f815e24 880unassigned to a character, and the decision to do that was made in version 3.1,
b19eb496 881so C<\p{Age=3.1}> matches this character, as also does C<\p{Present_In: 3.1}> and up.
9f815e24
KW
882
883=item B<C<\p{Print}>>
884
ae5b72c8 885This matches any character that is graphical or blank, except controls.
9f815e24
KW
886
887=item B<C<\p{SpacePerl}>>
888
889This is the same as C<\s>, including beyond ASCII.
890
4d4acfba 891Mnemonic: Space, as modified by Perl. (It doesn't include the vertical tab
779cf272 892until v5.18, which both the Posix standard and Unicode consider white space.)
9f815e24 893
4364919a
KW
894=item B<C<\p{Title}>> and B<C<\p{Titlecase}>>
895
896Under case-sensitive matching, these both match the same code points as
897C<\p{General Category=Titlecase_Letter}> (C<\p{gc=lt}>). The difference
898is that under C</i> caseless matching, these match the same as
899C<\p{Cased}>, whereas C<\p{gc=lt}> matches C<\p{Cased_Letter>).
900
2d88a86a
KW
901=item B<C<\p{Unicode}>>
902
903This matches any of the 1_114_112 Unicode code points.
904C<\p{Any}>.
905
9f815e24
KW
906=item B<C<\p{VertSpace}>>
907
908This is the same as C<\v>: A character that changes the spacing vertically.
909
910=item B<C<\p{Word}>>
911
b19eb496 912This is the same as C<\w>, including over 100_000 characters beyond ASCII.
9f815e24 913
42581d5d
KW
914=item B<C<\p{XPosix...}>>
915
b19eb496 916There are several of these, which are the standard Posix classes
42581d5d
KW
917extended to the full Unicode range. They are described in
918L<perlrecharclass/POSIX Character Classes>.
919
9f815e24
KW
920=back
921
a9130ea9 922
376d9008 923=head2 User-Defined Character Properties
491fd90a 924
51f494cc 925You can define your own binary character properties by defining subroutines
a9130ea9 926whose names begin with C<"In"> or C<"Is">. (The experimental feature
9d1a5160
KW
927L<perlre/(?[ ])> provides an alternative which allows more complex
928definitions.) The subroutines can be defined in any
51f494cc 929package. The user-defined properties can be used in the regular expression
a9130ea9 930C<\p{}> and C<\P{}> constructs; if you are using a user-defined property from a
51f494cc 931package other than the one you are in, you must specify its package in the
a9130ea9 932C<\p{}> or C<\P{}> construct.
bac0b425 933
51f494cc 934 # assuming property Is_Foreign defined in Lang::
bac0b425
JP
935 package main; # property package name required
936 if ($txt =~ /\p{Lang::IsForeign}+/) { ... }
937
938 package Lang; # property package name not required
939 if ($txt =~ /\p{IsForeign}+/) { ... }
940
941
942Note that the effect is compile-time and immutable once defined.
b19eb496
TC
943However, the subroutines are passed a single parameter, which is 0 if
944case-sensitive matching is in effect and non-zero if caseless matching
56ca34ca
KW
945is in effect. The subroutine may return different values depending on
946the value of the flag, and one set of values will immutably be in effect
b19eb496 947for all case-sensitive matches, and the other set for all case-insensitive
56ca34ca 948matches.
491fd90a 949
b19eb496 950Note that if the regular expression is tainted, then Perl will die rather
a9130ea9 951than calling the subroutine when the name of the subroutine is
0e9be77f
DM
952determined by the tainted data.
953
376d9008
JB
954The subroutines must return a specially-formatted string, with one
955or more newline-separated lines. Each line must be one of the following:
491fd90a
JH
956
957=over 4
958
959=item *
960
df9e1087 961A single hexadecimal number denoting a code point to include.
510254c9
A
962
963=item *
964
99a6b1f0 965Two hexadecimal numbers separated by horizontal whitespace (space or
df9e1087 966tabular characters) denoting a range of code points to include.
491fd90a
JH
967
968=item *
969
a9130ea9
KW
970Something to include, prefixed by C<"+">: a built-in character
971property (prefixed by C<"utf8::">) or a fully qualified (including package
830137a2 972name) user-defined character property,
bac0b425
JP
973to represent all the characters in that property; two hexadecimal code
974points for a range; or a single hexadecimal code point.
491fd90a
JH
975
976=item *
977
a9130ea9
KW
978Something to exclude, prefixed by C<"-">: an existing character
979property (prefixed by C<"utf8::">) or a fully qualified (including package
830137a2 980name) user-defined character property,
bac0b425
JP
981to represent all the characters in that property; two hexadecimal code
982points for a range; or a single hexadecimal code point.
491fd90a
JH
983
984=item *
985
a9130ea9
KW
986Something to negate, prefixed C<"!">: an existing character
987property (prefixed by C<"utf8::">) or a fully qualified (including package
830137a2 988name) user-defined character property,
bac0b425
JP
989to represent all the characters in that property; two hexadecimal code
990points for a range; or a single hexadecimal code point.
991
992=item *
993
a9130ea9
KW
994Something to intersect with, prefixed by C<"&">: an existing character
995property (prefixed by C<"utf8::">) or a fully qualified (including package
830137a2 996name) user-defined character property,
bac0b425
JP
997for all the characters except the characters in the property; two
998hexadecimal code points for a range; or a single hexadecimal code point.
491fd90a
JH
999
1000=back
1001
1002For example, to define a property that covers both the Japanese
1003syllabaries (hiragana and katakana), you can define
1004
1005 sub InKana {
d88362ca 1006 return <<END;
d5822f25
A
1007 3040\t309F
1008 30A0\t30FF
491fd90a
JH
1009 END
1010 }
1011
d5822f25
A
1012Imagine that the here-doc end marker is at the beginning of the line.
1013Now you can use C<\p{InKana}> and C<\P{InKana}>.
491fd90a
JH
1014
1015You could also have used the existing block property names:
1016
1017 sub InKana {
d88362ca 1018 return <<'END';
491fd90a
JH
1019 +utf8::InHiragana
1020 +utf8::InKatakana
1021 END
1022 }
1023
1024Suppose you wanted to match only the allocated characters,
d5822f25 1025not the raw block ranges: in other words, you want to remove
b65e6125 1026the unassigned characters:
491fd90a
JH
1027
1028 sub InKana {
d88362ca 1029 return <<'END';
491fd90a
JH
1030 +utf8::InHiragana
1031 +utf8::InKatakana
1032 -utf8::IsCn
1033 END
1034 }
1035
1036The negation is useful for defining (surprise!) negated classes.
1037
1038 sub InNotKana {
d88362ca 1039 return <<'END';
491fd90a
JH
1040 !utf8::InHiragana
1041 -utf8::InKatakana
1042 +utf8::IsCn
1043 END
1044 }
1045
461020ad
KW
1046This will match all non-Unicode code points, since every one of them is
1047not in Kana. You can use intersection to exclude these, if desired, as
1048this modified example shows:
bac0b425 1049
461020ad 1050 sub InNotKana {
bac0b425 1051 return <<'END';
461020ad
KW
1052 !utf8::InHiragana
1053 -utf8::InKatakana
1054 +utf8::IsCn
1055 &utf8::Any
bac0b425
JP
1056 END
1057 }
1058
461020ad
KW
1059C<&utf8::Any> must be the last line in the definition.
1060
1061Intersection is used generally for getting the common characters matched
a9130ea9 1062by two (or more) classes. It's important to remember not to use C<"&"> for
461020ad
KW
1063the first set; that would be intersecting with nothing, resulting in an
1064empty set.
1065
2d88a86a
KW
1066Unlike non-user-defined C<\p{}> property matches, no warning is ever
1067generated if these properties are matched against a non-Unicode code
1068point (see L</Beyond Unicode code points> below).
bac0b425 1069
68585b5e 1070=head2 User-Defined Case Mappings (for serious hackers only)
822502e5 1071
5d1892be 1072B<This feature has been removed as of Perl 5.16.>
a9130ea9 1073The CPAN module C<L<Unicode::Casing>> provides better functionality without
5d1892be
KW
1074the drawbacks that this feature had. If you are using a Perl earlier
1075than 5.16, this feature was most fully documented in the 5.14 version of
1076this pod:
1077L<http://perldoc.perl.org/5.14.0/perlunicode.html#User-Defined-Case-Mappings-%28for-serious-hackers-only%29>
3a2263fe 1078
376d9008 1079=head2 Character Encodings for Input and Output
8cbd9a7a 1080
7221edc9 1081See L<Encode>.
8cbd9a7a 1082
c29a771d 1083=head2 Unicode Regular Expression Support Level
776f8809 1084
b19eb496 1085The following list of Unicode supported features for regular expressions describes
fea12a3e
KW
1086all features currently directly supported by core Perl. The references
1087to "Level I<N>" and the section numbers refer to
1088L<UTS#18 "Unicode Regular Expressions"|http://www.unicode.org/reports/tr18>,
1089version 13, November 2013.
1090
1091=head3 Level 1 - Basic Unicode Support
1092
1093 RL1.1 Hex Notation - Done [1]
1094 RL1.2 Properties - Done [2]
1095 RL1.2a Compatibility Properties - Done [3]
1096 RL1.3 Subtraction and Intersection - Experimental [4]
1097 RL1.4 Simple Word Boundaries - Done [5]
1098 RL1.5 Simple Loose Matches - Done [6]
1099 RL1.6 Line Boundaries - Partial [7]
1100 RL1.7 Supplementary Code Points - Done [8]
755789c0 1101
6f33e417
KW
1102=over 4
1103
a6a7eedc 1104=item [1] C<\N{U+...}> and C<\x{...}>
6f33e417 1105
fea12a3e
KW
1106=item [2]
1107C<\p{...}> C<\P{...}>. This requirement is for a minimal list of
1108properties. Perl supports these and all other Unicode character
1109properties, as R2.7 asks (see L</"Unicode Character Properties"> above).
6f33e417 1110
fea12a3e
KW
1111=item [3]
1112Perl has C<\d> C<\D> C<\s> C<\S> C<\w> C<\W> C<\X> C<[:I<prop>:]>
1113C<[:^I<prop>:]>, plus all the properties specified by
1114L<http://www.unicode.org/reports/tr18/#Compatibility_Properties>. These
1115are described above in L</Other Properties>
6f33e417 1116
fea12a3e 1117=item [4]
6f33e417 1118
fea12a3e 1119The experimental feature C<"(?[...])"> starting in v5.18 accomplishes
a6a7eedc 1120this.
6f33e417 1121
a6a7eedc
KW
1122See L<perlre/(?[ ])>. If you don't want to use an experimental
1123feature, you can use one of the following:
6f33e417
KW
1124
1125=over 4
1126
a6a7eedc 1127=item *
f67a5002 1128Regular expression lookahead
6f33e417
KW
1129
1130You can mimic class subtraction using lookahead.
8158862b 1131For example, what UTS#18 might write as
29bdacb8 1132
209c9685 1133 [{Block=Greek}-[{UNASSIGNED}]]
dbe420b4
JH
1134
1135in Perl can be written as:
1136
209c9685
KW
1137 (?!\p{Unassigned})\p{Block=Greek}
1138 (?=\p{Assigned})\p{Block=Greek}
dbe420b4
JH
1139
1140But in this particular example, you probably really want
1141
209c9685 1142 \p{Greek}
dbe420b4
JH
1143
1144which will match assigned characters known to be part of the Greek script.
29bdacb8 1145
a6a7eedc
KW
1146=item *
1147
1148CPAN module C<L<Unicode::Regex::Set>>
8158862b 1149
6f33e417
KW
1150It does implement the full UTS#18 grouping, intersection, union, and
1151removal (subtraction) syntax.
8158862b 1152
a6a7eedc
KW
1153=item *
1154
1155L</"User-Defined Character Properties">
6f33e417 1156
a9130ea9 1157C<"+"> for union, C<"-"> for removal (set-difference), C<"&"> for intersection
6f33e417
KW
1158
1159=back
1160
fea12a3e
KW
1161=item [5]
1162C<\b> C<\B> meet most, but not all, the details of this requirement, but
1163C<\b{wb}> and C<\B{wb}> do, as well as the stricter R2.3.
1164
1165=item [6]
6f33e417 1166
a6a7eedc 1167Note that Perl does Full case-folding in matching, not Simple:
6f33e417 1168
a6a7eedc
KW
1169For example C<U+1F88> is equivalent to C<U+1F00 U+03B9>, instead of just
1170C<U+1F80>. This difference matters mainly for certain Greek capital
a9130ea9
KW
1171letters with certain modifiers: the Full case-folding decomposes the
1172letter, while the Simple case-folding would map it to a single
1173character.
6f33e417 1174
fea12a3e
KW
1175=item [7]
1176
1177The reason this is considered to be only partially implemented is that
1178Perl has L<C<qrE<sol>\b{lb}E<sol>>|perlrebackslash/\b{lb}> and
1179C<L<Unicode::LineBreak>> that are conformant with
1180L<UAX#14 "Unicode Line Breaking Algorithm"|http://www.unicode.org/reports/tr14>.
1181The regular expression construct provides default behavior, while the
1182heavier-weight module provides customizable line breaking.
1183
1184But Perl treats C<\n> as the start- and end-line
1185delimiter, whereas Unicode specifies more characters that should be
1186so-interpreted.
6f33e417 1187
a6a7eedc 1188These are:
6f33e417 1189
a6a7eedc
KW
1190 VT U+000B (\v in C)
1191 FF U+000C (\f)
1192 CR U+000D (\r)
1193 NEL U+0085
1194 LS U+2028
1195 PS U+2029
6f33e417 1196
a6a7eedc
KW
1197C<^> and C<$> in regular expression patterns are supposed to match all
1198these, but don't.
1199These characters also don't, but should, affect C<< <> >> C<$.>, and
1200script line numbers.
6f33e417 1201
a6a7eedc
KW
1202Also, lines should not be split within C<CRLF> (i.e. there is no
1203empty line between C<\r> and C<\n>). For C<CRLF>, try the C<:crlf>
1204layer (see L<PerlIO>).
1205
fea12a3e 1206=item [8]
a9130ea9
KW
1207UTF-8/UTF-EBDDIC used in Perl allows not only C<U+10000> to
1208C<U+10FFFF> but also beyond C<U+10FFFF>
6f33e417
KW
1209
1210=back
5ca1ac52 1211
fea12a3e 1212=head3 Level 2 - Extended Unicode Support
776f8809 1213
fea12a3e
KW
1214 RL2.1 Canonical Equivalents - Retracted [9]
1215 by Unicode
1216 RL2.2 Extended Grapheme Clusters - Partial [10]
1217 RL2.3 Default Word Boundaries - Done [11]
1218 RL2.4 Default Case Conversion - Done
1219 RL2.5 Name Properties - Done
1220 RL2.6 Wildcard Properties - Missing
1221 RL2.7 Full Properties - Done
776f8809 1222
fea12a3e 1223=over 4
8158862b 1224
fea12a3e
KW
1225=item [9]
1226Unicode has rewritten this portion of UTS#18 to say that getting
1227canonical equivalence (see UAX#15
1228L<"Unicode Normalization Forms"|http://www.unicode.org/reports/tr15>)
1229is basically to be done at the programmer level. Use NFD to write
1230both your regular expressions and text to match them against (you
1231can use L<Unicode::Normalize>).
776f8809 1232
fea12a3e
KW
1233=item [10]
1234Perl has C<\X> and C<\b{gcb}> but we don't have a "Grapheme Cluster Mode".
1235
1236=item [11] see
1237L<UAX#29 "Unicode Text Segmentation"|http://www.unicode.org/reports/tr29>,
1238
1239=back
1240
1241=head3 Level 3 - Tailored Support
1242
1243 RL3.1 Tailored Punctuation - Missing
1244 RL3.2 Tailored Grapheme Clusters - Missing [12]
1245 RL3.3 Tailored Word Boundaries - Missing
1246 RL3.4 Tailored Loose Matches - Retracted by Unicode
1247 RL3.5 Tailored Ranges - Retracted by Unicode
1248 RL3.6 Context Matching - Missing [13]
1249 RL3.7 Incremental Matches - Missing
1250 RL3.8 Unicode Set Sharing - Unicode is proposing
1251 to retract this
1252 RL3.9 Possible Match Sets - Missing
1253 RL3.10 Folded Matching - Retracted by Unicode
1254 RL3.11 Submatchers - Missing
1255
1256=over 4
1257
1258=item [12]
1259Perl has L<Unicode::Collate>, but it isn't integrated with regular
1260expressions. See
1261L<UTS#10 "Unicode Collation Algorithms"|http://www.unicode.org/reports/tr10>.
776f8809 1262
fea12a3e
KW
1263=item [13]
1264Perl has C<(?<=x)> and C<(?=x)>, but lookaheads or lookbehinds should
1265see outside of the target substring
776f8809
JH
1266
1267=back
1268
c349b1b9
JH
1269=head2 Unicode Encodings
1270
376d9008
JB
1271Unicode characters are assigned to I<code points>, which are abstract
1272numbers. To use these numbers, various encodings are needed.
c349b1b9
JH
1273
1274=over 4
1275
c29a771d 1276=item *
5cb3728c
RB
1277
1278UTF-8
c349b1b9 1279
6d4f9cf2 1280UTF-8 is a variable-length (1 to 4 bytes), byte-order independent
a6a7eedc
KW
1281encoding. In most of Perl's documentation, including elsewhere in this
1282document, the term "UTF-8" means also "UTF-EBCDIC". But in this section,
1283"UTF-8" refers only to the encoding used on ASCII platforms. It is a
1284superset of 7-bit US-ASCII, so anything encoded in ASCII has the
1285identical representation when encoded in UTF-8.
c349b1b9 1286
8c007b5a 1287The following table is from Unicode 3.2.
05632f9a 1288
755789c0 1289 Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte
05632f9a 1290
d88362ca 1291 U+0000..U+007F 00..7F
e1b711da 1292 U+0080..U+07FF * C2..DF 80..BF
d88362ca 1293 U+0800..U+0FFF E0 * A0..BF 80..BF
ec90690f
TS
1294 U+1000..U+CFFF E1..EC 80..BF 80..BF
1295 U+D000..U+D7FF ED 80..9F 80..BF
755789c0 1296 U+D800..U+DFFF +++++ utf16 surrogates, not legal utf8 +++++
ec90690f 1297 U+E000..U+FFFF EE..EF 80..BF 80..BF
d88362ca
KW
1298 U+10000..U+3FFFF F0 * 90..BF 80..BF 80..BF
1299 U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF
1300 U+100000..U+10FFFF F4 80..8F 80..BF 80..BF
e1b711da 1301
b19eb496 1302Note the gaps marked by "*" before several of the byte entries above. These are
e1b711da
KW
1303caused by legal UTF-8 avoiding non-shortest encodings: it is technically
1304possible to UTF-8-encode a single code point in different ways, but that is
1305explicitly forbidden, and the shortest possible encoding should always be used
1306(and that is what Perl does).
37361303 1307
376d9008 1308Another way to look at it is via bits:
05632f9a 1309
755789c0 1310 Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte
05632f9a 1311
755789c0
KW
1312 0aaaaaaa 0aaaaaaa
1313 00000bbbbbaaaaaa 110bbbbb 10aaaaaa
1314 ccccbbbbbbaaaaaa 1110cccc 10bbbbbb 10aaaaaa
1315 00000dddccccccbbbbbbaaaaaa 11110ddd 10cccccc 10bbbbbb 10aaaaaa
05632f9a 1316
a9130ea9 1317As you can see, the continuation bytes all begin with C<"10">, and the
e1b711da 1318leading bits of the start byte tell how many bytes there are in the
05632f9a
JH
1319encoded character.
1320
6d4f9cf2 1321The original UTF-8 specification allowed up to 6 bytes, to allow
a9130ea9 1322encoding of numbers up to C<0x7FFF_FFFF>. Perl continues to allow those,
6d4f9cf2
KW
1323and has extended that up to 13 bytes to encode code points up to what
1324can fit in a 64-bit word. However, Perl will warn if you output any of
b19eb496 1325these as being non-portable; and under strict UTF-8 input protocols,
760c7c2f
KW
1326they are forbidden. In addition, it is deprecated to use a code point
1327larger than what a signed integer variable on your system can hold. On
132832-bit ASCII systems, this means C<0x7FFF_FFFF> is the legal maximum
1329going forward (much higher on 64-bit systems).
6d4f9cf2 1330
c29a771d 1331=item *
5cb3728c
RB
1332
1333UTF-EBCDIC
dbe420b4 1334
b65e6125 1335Like UTF-8, but EBCDIC-safe, in the way that UTF-8 is ASCII-safe.
a6a7eedc
KW
1336This means that all the basic characters (which includes all
1337those that have ASCII equivalents (like C<"A">, C<"0">, C<"%">, I<etc.>)
1338are the same in both EBCDIC and UTF-EBCDIC.)
1339
c0236afe
KW
1340UTF-EBCDIC is used on EBCDIC platforms. It generally requires more
1341bytes to represent a given code point than UTF-8 does; the largest
1342Unicode code points take 5 bytes to represent (instead of 4 in UTF-8),
1343and, extended for 64-bit words, it uses 14 bytes instead of 13 bytes in
1344UTF-8.
dbe420b4 1345
c29a771d 1346=item *
5cb3728c 1347
b65e6125 1348UTF-16, UTF-16BE, UTF-16LE, Surrogates, and C<BOM>'s (Byte Order Marks)
c349b1b9 1349
1bfb14c4
JH
1350The followings items are mostly for reference and general Unicode
1351knowledge, Perl doesn't use these constructs internally.
dbe420b4 1352
b19eb496
TC
1353Like UTF-8, UTF-16 is a variable-width encoding, but where
1354UTF-8 uses 8-bit code units, UTF-16 uses 16-bit code units.
1355All code points occupy either 2 or 4 bytes in UTF-16: code points
1356C<U+0000..U+FFFF> are stored in a single 16-bit unit, and code
1bfb14c4 1357points C<U+10000..U+10FFFF> in two 16-bit units. The latter case is
c349b1b9
JH
1358using I<surrogates>, the first 16-bit unit being the I<high
1359surrogate>, and the second being the I<low surrogate>.
1360
376d9008 1361Surrogates are code points set aside to encode the C<U+10000..U+10FFFF>
c349b1b9 1362range of Unicode code points in pairs of 16-bit units. The I<high
9f815e24 1363surrogates> are the range C<U+D800..U+DBFF> and the I<low surrogates>
376d9008 1364are the range C<U+DC00..U+DFFF>. The surrogate encoding is
c349b1b9 1365
d88362ca
KW
1366 $hi = ($uni - 0x10000) / 0x400 + 0xD800;
1367 $lo = ($uni - 0x10000) % 0x400 + 0xDC00;
c349b1b9
JH
1368
1369and the decoding is
1370
d88362ca 1371 $uni = 0x10000 + ($hi - 0xD800) * 0x400 + ($lo - 0xDC00);
c349b1b9 1372
376d9008 1373Because of the 16-bitness, UTF-16 is byte-order dependent. UTF-16
c349b1b9 1374itself can be used for in-memory computations, but if storage or
376d9008
JB
1375transfer is required either UTF-16BE (big-endian) or UTF-16LE
1376(little-endian) encodings must be chosen.
c349b1b9
JH
1377
1378This introduces another problem: what if you just know that your data
376d9008 1379is UTF-16, but you don't know which endianness? Byte Order Marks, or
b65e6125 1380C<BOM>'s, are a solution to this. A special character has been reserved
86bbd6d1 1381in Unicode to function as a byte order marker: the character with the
a9130ea9 1382code point C<U+FEFF> is the C<BOM>.
042da322 1383
a9130ea9 1384The trick is that if you read a C<BOM>, you will know the byte order,
376d9008
JB
1385since if it was written on a big-endian platform, you will read the
1386bytes C<0xFE 0xFF>, but if it was written on a little-endian platform,
1387you will read the bytes C<0xFF 0xFE>. (And if the originating platform
b65e6125
KW
1388was writing in ASCII platform UTF-8, you will read the bytes
1389C<0xEF 0xBB 0xBF>.)
042da322 1390
86bbd6d1 1391The way this trick works is that the character with the code point
6d4f9cf2 1392C<U+FFFE> is not supposed to be in input streams, so the
a9130ea9 1393sequence of bytes C<0xFF 0xFE> is unambiguously "C<BOM>, represented in
1bfb14c4 1394little-endian format" and cannot be C<U+FFFE>, represented in big-endian
6d4f9cf2
KW
1395format".
1396
1397Surrogates have no meaning in Unicode outside their use in pairs to
1398represent other code points. However, Perl allows them to be
1399represented individually internally, for example by saying
f651977e
TC
1400C<chr(0xD801)>, so that all code points, not just those valid for open
1401interchange, are
6d4f9cf2 1402representable. Unicode does define semantics for them, such as their
a9130ea9
KW
1403C<L</General_Category>> is C<"Cs">. But because their use is somewhat dangerous,
1404Perl will warn (using the warning category C<"surrogate">, which is a
1405sub-category of C<"utf8">) if an attempt is made
6d4f9cf2
KW
1406to do things like take the lower case of one, or match
1407case-insensitively, or to output them. (But don't try this on Perls
1408before 5.14.)
c349b1b9 1409
c29a771d 1410=item *
5cb3728c 1411
1e54db1a 1412UTF-32, UTF-32BE, UTF-32LE
c349b1b9 1413
b65e6125 1414The UTF-32 family is pretty much like the UTF-16 family, except that
042da322 1415the units are 32-bit, and therefore the surrogate scheme is not
a9130ea9 1416needed. UTF-32 is a fixed-width encoding. The C<BOM> signatures are
b19eb496 1417C<0x00 0x00 0xFE 0xFF> for BE and C<0xFF 0xFE 0x00 0x00> for LE.
c349b1b9 1418
c29a771d 1419=item *
5cb3728c
RB
1420
1421UCS-2, UCS-4
c349b1b9 1422
b19eb496 1423Legacy, fixed-width encodings defined by the ISO 10646 standard. UCS-2 is a 16-bit
376d9008 1424encoding. Unlike UTF-16, UCS-2 is not extensible beyond C<U+FFFF>,
339cfa0e 1425because it does not use surrogates. UCS-4 is a 32-bit encoding,
b19eb496 1426functionally identical to UTF-32 (the difference being that
a9130ea9 1427UCS-4 forbids neither surrogates nor code points larger than C<0x10_FFFF>).
c349b1b9 1428
c29a771d 1429=item *
5cb3728c
RB
1430
1431UTF-7
c349b1b9 1432
376d9008
JB
1433A seven-bit safe (non-eight-bit) encoding, which is useful if the
1434transport or storage is not eight-bit safe. Defined by RFC 2152.
c349b1b9 1435
95a1a48b
JH
1436=back
1437
57e88091 1438=head2 Noncharacter code points
6d4f9cf2 1439
57e88091 144066 code points are set aside in Unicode as "noncharacter code points".
a9130ea9 1441These all have the C<Unassigned> (C<Cn>) C<L</General_Category>>, and
57e88091
KW
1442no character will ever be assigned to any of them. They are the 32 code
1443points between C<U+FDD0> and C<U+FDEF> inclusive, and the 34 code
1444points:
1445
1446 U+FFFE U+FFFF
1447 U+1FFFE U+1FFFF
1448 U+2FFFE U+2FFFF
1449 ...
1450 U+EFFFE U+EFFFF
1451 U+FFFFE U+FFFFF
1452 U+10FFFE U+10FFFF
1453
1454Until Unicode 7.0, the noncharacters were "B<forbidden> for use in open
1455interchange of Unicode text data", so that code that processed those
1456streams could use these code points as sentinels that could be mixed in
1457with character data, and would always be distinguishable from that data.
1458(Emphasis above and in the next paragraph are added in this document.)
1459
1460Unicode 7.0 changed the wording so that they are "B<not recommended> for
1461use in open interchange of Unicode text data". The 7.0 Standard goes on
1462to say:
1463
1464=over 4
1465
1466"If a noncharacter is received in open interchange, an application is
1467not required to interpret it in any way. It is good practice, however,
1468to recognize it as a noncharacter and to take appropriate action, such
1469as replacing it with C<U+FFFD> replacement character, to indicate the
1470problem in the text. It is not recommended to simply delete
1471noncharacter code points from such text, because of the potential
1472security issues caused by deleting uninterpreted characters. (See
1473conformance clause C7 in Section 3.2, Conformance Requirements, and
1474L<Unicode Technical Report #36, "Unicode Security
1475Considerations"|http://www.unicode.org/reports/tr36/#Substituting_for_Ill_Formed_Subsequences>)."
1476
1477=back
1478
1479This change was made because it was found that various commercial tools
1480like editors, or for things like source code control, had been written
1481so that they would not handle program files that used these code points,
1482effectively precluding their use almost entirely! And that was never
1483the intent. They've always been meant to be usable within an
1484application, or cooperating set of applications, at will.
1485
1486If you're writing code, such as an editor, that is supposed to be able
1487to handle any Unicode text data, then you shouldn't be using these code
1488points yourself, and instead allow them in the input. If you need
1489sentinels, they should instead be something that isn't legal Unicode.
1490For UTF-8 data, you can use the bytes 0xC1 and 0xC2 as sentinels, as
1491they never appear in well-formed UTF-8. (There are equivalents for
1492UTF-EBCDIC). You can also store your Unicode code points in integer
1493variables and use negative values as sentinels.
1494
1495If you're not writing such a tool, then whether you accept noncharacters
1496as input is up to you (though the Standard recommends that you not). If
1497you do strict input stream checking with Perl, these code points
1498continue to be forbidden. This is to maintain backward compatibility
1499(otherwise potential security holes could open up, as an unsuspecting
1500application that was written assuming the noncharacters would be
1501filtered out before getting to it, could now, without warning, start
1502getting them). To do strict checking, you can use the layer
1503C<:encoding('UTF-8')>.
1504
1505Perl continues to warn (using the warning category C<"nonchar">, which
1506is a sub-category of C<"utf8">) if an attempt is made to output
1507noncharacters.
42581d5d
KW
1508
1509=head2 Beyond Unicode code points
1510
a9130ea9
KW
1511The maximum Unicode code point is C<U+10FFFF>, and Unicode only defines
1512operations on code points up through that. But Perl works on code
42581d5d
KW
1513points up to the maximum permissible unsigned number available on the
1514platform. However, Perl will not accept these from input streams unless
1515lax rules are being used, and will warn (using the warning category
2d88a86a
KW
1516C<"non_unicode">, which is a sub-category of C<"utf8">) if any are output.
1517
1518Since Unicode rules are not defined on these code points, if a
1519Unicode-defined operation is done on them, Perl uses what we believe are
1520sensible rules, while generally warning, using the C<"non_unicode">
1521category. For example, C<uc("\x{11_0000}")> will generate such a
1522warning, returning the input parameter as its result, since Perl defines
1523the uppercase of every non-Unicode code point to be the code point
b65e6125
KW
1524itself. (All the case changing operations, not just uppercasing, work
1525this way.)
2d88a86a
KW
1526
1527The situation with matching Unicode properties in regular expressions,
1528the C<\p{}> and C<\P{}> constructs, against these code points is not as
1529clear cut, and how these are handled has changed as we've gained
1530experience.
1531
1532One possibility is to treat any match against these code points as
1533undefined. But since Perl doesn't have the concept of a match being
1534undefined, it converts this to failing or C<FALSE>. This is almost, but
1535not quite, what Perl did from v5.14 (when use of these code points
1536became generally reliable) through v5.18. The difference is that Perl
1537treated all C<\p{}> matches as failing, but all C<\P{}> matches as
1538succeeding.
1539
f66ccb6c 1540One problem with this is that it leads to unexpected, and confusing
2d88a86a
KW
1541results in some cases:
1542
1543 chr(0x110000) =~ \p{ASCII_Hex_Digit=True} # Failed on <= v5.18
1544 chr(0x110000) =~ \p{ASCII_Hex_Digit=False} # Failed! on <= v5.18
1545
1546That is, it treated both matches as undefined, and converted that to
1547false (raising a warning on each). The first case is the expected
1548result, but the second is likely counterintuitive: "How could both be
1549false when they are complements?" Another problem was that the
1550implementation optimized many Unicode property matches down to already
1551existing simpler, faster operations, which don't raise the warning. We
1552chose to not forgo those optimizations, which help the vast majority of
1553matches, just to generate a warning for the unlikely event that an
1554above-Unicode code point is being matched against.
1555
1556As a result of these problems, starting in v5.20, what Perl does is
1557to treat non-Unicode code points as just typical unassigned Unicode
1558characters, and matches accordingly. (Note: Unicode has atypical
57e88091 1559unassigned code points. For example, it has noncharacter code points,
2d88a86a
KW
1560and ones that, when they do get assigned, are destined to be written
1561Right-to-left, as Arabic and Hebrew are. Perl assumes that no
1562non-Unicode code point has any atypical properties.)
1563
1564Perl, in most cases, will raise a warning when matching an above-Unicode
1565code point against a Unicode property when the result is C<TRUE> for
1566C<\p{}>, and C<FALSE> for C<\P{}>. For example:
1567
1568 chr(0x110000) =~ \p{ASCII_Hex_Digit=True} # Fails, no warning
1569 chr(0x110000) =~ \p{ASCII_Hex_Digit=False} # Succeeds, with warning
1570
1571In both these examples, the character being matched is non-Unicode, so
1572Unicode doesn't define how it should match. It clearly isn't an ASCII
1573hex digit, so the first example clearly should fail, and so it does,
1574with no warning. But it is arguable that the second example should have
1575an undefined, hence C<FALSE>, result. So a warning is raised for it.
1576
1577Thus the warning is raised for many fewer cases than in earlier Perls,
1578and only when what the result is could be arguable. It turns out that
1579none of the optimizations made by Perl (or are ever likely to be made)
1580cause the warning to be skipped, so it solves both problems of Perl's
1581earlier approach. The most commonly used property that is affected by
1582this change is C<\p{Unassigned}> which is a short form for
1583C<\p{General_Category=Unassigned}>. Starting in v5.20, all non-Unicode
1584code points are considered C<Unassigned>. In earlier releases the
1585matches failed because the result was considered undefined.
1586
1587The only place where the warning is not raised when it might ought to
1588have been is if optimizations cause the whole pattern match to not even
1589be attempted. For example, Perl may figure out that for a string to
1590match a certain regular expression pattern, the string has to contain
1591the substring C<"foobar">. Before attempting the match, Perl may look
1592for that substring, and if not found, immediately fail the match without
1593actually trying it; so no warning gets generated even if the string
1594contains an above-Unicode code point.
1595
1596This behavior is more "Do what I mean" than in earlier Perls for most
1597applications. But it catches fewer issues for code that needs to be
1598strictly Unicode compliant. Therefore there is an additional mode of
1599operation available to accommodate such code. This mode is enabled if a
1600regular expression pattern is compiled within the lexical scope where
1601the C<"non_unicode"> warning class has been made fatal, say by:
1602
1603 use warnings FATAL => "non_unicode"
1604
44ecbbd8 1605(see L<warnings>). In this mode of operation, Perl will raise the
2d88a86a
KW
1606warning for all matches against a non-Unicode code point (not just the
1607arguable ones), and it skips the optimizations that might cause the
1608warning to not be output. (It currently still won't warn if the match
1609isn't even attempted, like in the C<"foobar"> example above.)
1610
1611In summary, Perl now normally treats non-Unicode code points as typical
1612Unicode unassigned code points for regular expression matches, raising a
1613warning only when it is arguable what the result should be. However, if
1614this warning has been made fatal, it isn't skipped.
1615
1616There is one exception to all this. C<\p{All}> looks like a Unicode
1617property, but it is a Perl extension that is defined to be true for all
1618possible code points, Unicode or not, so no warning is ever generated
1619when matching this against a non-Unicode code point. (Prior to v5.20,
1620it was an exact synonym for C<\p{Any}>, matching code points C<0>
1621through C<0x10FFFF>.)
6d4f9cf2 1622
0d7c09bb
JH
1623=head2 Security Implications of Unicode
1624
b65e6125
KW
1625First, read
1626L<Unicode Security Considerations|http://www.unicode.org/reports/tr36>.
1627
e1b711da
KW
1628Also, note the following:
1629
0d7c09bb
JH
1630=over 4
1631
1632=item *
1633
1634Malformed UTF-8
bf0fa0b2 1635
f57d8456
KW
1636UTF-8 is very structured, so many combinations of bytes are invalid. In
1637the past, Perl tried to soldier on and make some sense of invalid
1638combinations, but this can lead to security holes, so now, if the Perl
1639core needs to process an invalid combination, it will either raise a
1640fatal error, or will replace those bytes by the sequence that forms the
1641Unicode REPLACEMENT CHARACTER, for which purpose Unicode created it.
1642
1643Every code point can be represented by more than one possible
1644syntactically valid UTF-8 sequence. Early on, both Unicode and Perl
1645considered any of these to be valid, but now, all sequences longer
1646than the shortest possible one are considered to be malformed.
1647
1648Unicode considers many code points to be illegal, or to be avoided.
1649Perl generally accepts them, once they have passed through any input
1650filters that may try to exclude them. These have been discussed above
1651(see "Surrogates" under UTF-16 in L</Unicode Encodings>,
1652L</Noncharacter code points>, and L</Beyond Unicode code points>).
bf0fa0b2 1653
0d7c09bb
JH
1654=item *
1655
68693f9e 1656Regular expression pattern matching may surprise you if you're not
b19eb496
TC
1657accustomed to Unicode. Starting in Perl 5.14, several pattern
1658modifiers are available to control this, called the character set
42581d5d
KW
1659modifiers. Details are given in L<perlre/Character set modifiers>.
1660
1661=back
0d7c09bb 1662
376d9008 1663As discussed elsewhere, Perl has one foot (two hooves?) planted in
a6a7eedc
KW
1664each of two worlds: the old world of ASCII and single-byte locales, and
1665the new world of Unicode, upgrading when necessary.
376d9008 1666If your legacy code does not explicitly use Unicode, no automatic
a6a7eedc 1667switch-over to Unicode should happen.
0d7c09bb 1668
c349b1b9
JH
1669=head2 Unicode in Perl on EBCDIC
1670
a6a7eedc
KW
1671Unicode is supported on EBCDIC platforms. See L<perlebcdic>.
1672
1673Unless ASCII vs. EBCDIC issues are specifically being discussed,
1674references to UTF-8 encoding in this document and elsewhere should be
1675read as meaning UTF-EBCDIC on EBCDIC platforms.
1676See L<perlebcdic/Unicode and UTF>.
1677
1678Because UTF-EBCDIC is so similar to UTF-8, the differences are mostly
1679hidden from you; S<C<use utf8>> (and NOT something like
1680S<C<use utfebcdic>>) declares the the script is in the platform's
1681"native" 8-bit encoding of Unicode. (Similarly for the C<":utf8">
1682layer.)
c349b1b9 1683
b310b053
JH
1684=head2 Locales
1685
42581d5d 1686See L<perllocale/Unicode and UTF-8>
b310b053 1687
1aad1664
JH
1688=head2 When Unicode Does Not Happen
1689
b65e6125
KW
1690There are still many places where Unicode (in some encoding or
1691another) could be given as arguments or received as results, or both in
1692Perl, but it is not, in spite of Perl having extensive ways to input and
1693output in Unicode, and a few other "entry points" like the C<@ARGV>
1694array (which can sometimes be interpreted as UTF-8).
1aad1664 1695
e1b711da
KW
1696The following are such interfaces. Also, see L</The "Unicode Bug">.
1697For all of these interfaces Perl
b9cedb1b 1698currently (as of v5.16.0) simply assumes byte strings both as arguments
b65e6125 1699and results, or UTF-8 strings if the (deprecated) C<encoding> pragma has been used.
1aad1664 1700
b19eb496
TC
1701One reason that Perl does not attempt to resolve the role of Unicode in
1702these situations is that the answers are highly dependent on the operating
1aad1664 1703system and the file system(s). For example, whether filenames can be
b19eb496
TC
1704in Unicode and in exactly what kind of encoding, is not exactly a
1705portable concept. Similarly for C<qx> and C<system>: how well will the
1706"command-line interface" (and which of them?) handle Unicode?
1aad1664
JH
1707
1708=over 4
1709
557a2462
RB
1710=item *
1711
a9130ea9
KW
1712C<chdir>, C<chmod>, C<chown>, C<chroot>, C<exec>, C<link>, C<lstat>, C<mkdir>,
1713C<rename>, C<rmdir>, C<stat>, C<symlink>, C<truncate>, C<unlink>, C<utime>, C<-X>
557a2462
RB
1714
1715=item *
1716
a9130ea9 1717C<%ENV>
557a2462
RB
1718
1719=item *
1720
a9130ea9 1721C<glob> (aka the C<E<lt>*E<gt>>)
557a2462
RB
1722
1723=item *
1aad1664 1724
a9130ea9 1725C<open>, C<opendir>, C<sysopen>
1aad1664 1726
557a2462 1727=item *
1aad1664 1728
a9130ea9 1729C<qx> (aka the backtick operator), C<system>
1aad1664 1730
557a2462 1731=item *
1aad1664 1732
a9130ea9 1733C<readdir>, C<readlink>
1aad1664
JH
1734
1735=back
1736
e1b711da
KW
1737=head2 The "Unicode Bug"
1738
a6a7eedc
KW
1739The term, "Unicode bug" has been applied to an inconsistency with the
1740code points in the C<Latin-1 Supplement> block, that is, between
1741128 and 255. Without a locale specified, unlike all other characters or
1742code points, these characters can have very different semantics
1743depending on the rules in effect. (Characters whose code points are
1744above 255 force Unicode rules; whereas the rules for ASCII characters
1745are the same under both ASCII and Unicode rules.)
1746
1747Under Unicode rules, these upper-Latin1 characters are interpreted as
1748Unicode code points, which means they have the same semantics as Latin-1
1749(ISO-8859-1) and C1 controls.
1750
1751As explained in L</ASCII Rules versus Unicode Rules>, under ASCII rules,
1752they are considered to be unassigned characters.
1753
1754This can lead to unexpected results. For example, a string's
1755semantics can suddenly change if a code point above 255 is appended to
1756it, which changes the rules from ASCII to Unicode. As an
1757example, consider the following program and its output:
1758
1759 $ perl -le'
f434f357 1760 no feature "unicode_strings";
a6a7eedc
KW
1761 $s1 = "\xC2";
1762 $s2 = "\x{2660}";
1763 for ($s1, $s2, $s1.$s2) {
1764 print /\w/ || 0;
1765 }
1766 '
1767 0
1768 0
1769 1
1770
1771If there's no C<\w> in C<s1> nor in C<s2>, why does their concatenation
1772have one?
1773
1774This anomaly stems from Perl's attempt to not disturb older programs that
1775didn't use Unicode, along with Perl's desire to add Unicode support
1776seamlessly. But the result turned out to not be seamless. (By the way,
1777you can choose to be warned when things like this happen. See
1778C<L<encoding::warnings>>.)
1779
1780L<S<C<use feature 'unicode_strings'>>|feature/The 'unicode_strings' feature>
1781was added, starting in Perl v5.12, to address this problem. It affects
1782these things:
e1b711da
KW
1783
1784=over 4
1785
1786=item *
1787
1788Changing the case of a scalar, that is, using C<uc()>, C<ucfirst()>, C<lc()>,
2e2b2571
KW
1789and C<lcfirst()>, or C<\L>, C<\U>, C<\u> and C<\l> in double-quotish
1790contexts, such as regular expression substitutions.
a6a7eedc
KW
1791
1792Under C<unicode_strings> starting in Perl 5.12.0, Unicode rules are
2e2b2571
KW
1793generally used. See L<perlfunc/lc> for details on how this works
1794in combination with various other pragmas.
e1b711da
KW
1795
1796=item *
1797
2e2b2571 1798Using caseless (C</i>) regular expression matching.
a6a7eedc 1799
2e2b2571 1800Starting in Perl 5.14.0, regular expressions compiled within
a6a7eedc 1801the scope of C<unicode_strings> use Unicode rules
2e2b2571
KW
1802even when executed or compiled into larger
1803regular expressions outside the scope.
e1b711da
KW
1804
1805=item *
1806
a6a7eedc
KW
1807Matching any of several properties in regular expressions.
1808
1809These properties are C<\b> (without braces), C<\B> (without braces),
1810C<\s>, C<\S>, C<\w>, C<\W>, and all the Posix character classes
630d17dc 1811I<except> C<[[:ascii:]]>.
a6a7eedc 1812
2e2b2571 1813Starting in Perl 5.14.0, regular expressions compiled within
a6a7eedc 1814the scope of C<unicode_strings> use Unicode rules
2e2b2571
KW
1815even when executed or compiled into larger
1816regular expressions outside the scope.
e1b711da
KW
1817
1818=item *
1819
a6a7eedc
KW
1820In C<quotemeta> or its inline equivalent C<\Q>.
1821
2e2b2571
KW
1822Starting in Perl 5.16.0, consistent quoting rules are used within the
1823scope of C<unicode_strings>, as described in L<perlfunc/quotemeta>.
a6a7eedc
KW
1824Prior to that, or outside its scope, no code points above 127 are quoted
1825in UTF-8 encoded strings, but in byte encoded strings, code points
1826between 128-255 are always quoted.
eb88ed9e 1827
d6c970c7
AC
1828=item *
1829
1830In the C<..> or L<range|perlop/Range Operators> operator.
1831
1832Starting in Perl 5.26.0, the range operator on strings treats their lengths
1833consistently within the scope of C<unicode_strings>. Prior to that, or
1834outside its scope, it could produce strings whose length in characters
1835exceeded that of the right-hand side, where the right-hand side took up more
1836bytes than the correct range endpoint.
1837
e1b711da
KW
1838=back
1839
a6a7eedc
KW
1840You can see from the above that the effect of C<unicode_strings>
1841increased over several Perl releases. (And Perl's support for Unicode
1842continues to improve; it's best to use the latest available release in
1843order to get the most complete and accurate results possible.) Note that
1844C<unicode_strings> is automatically chosen if you S<C<use 5.012>> or
1845higher.
e1b711da 1846
2e2b2571 1847For Perls earlier than those described above, or when a string is passed
a6a7eedc 1848to a function outside the scope of C<unicode_strings>, see the next section.
e1b711da 1849
1aad1664
JH
1850=head2 Forcing Unicode in Perl (Or Unforcing Unicode in Perl)
1851
e1b711da
KW
1852Sometimes (see L</"When Unicode Does Not Happen"> or L</The "Unicode Bug">)
1853there are situations where you simply need to force a byte
a6a7eedc
KW
1854string into UTF-8, or vice versa. The standard module L<Encode> can be
1855used for this, or the low-level calls
a9130ea9 1856L<C<utf8::upgrade($bytestring)>|utf8/Utility functions> and
a6a7eedc 1857L<C<utf8::downgrade($utf8string[, FAIL_OK])>|utf8/Utility functions>.
1aad1664 1858
a9130ea9 1859Note that C<utf8::downgrade()> can fail if the string contains characters
2bbc8d55 1860that don't fit into a byte.
1aad1664 1861
e1b711da
KW
1862Calling either function on a string that already is in the desired state is a
1863no-op.
1864
a6a7eedc
KW
1865L</ASCII Rules versus Unicode Rules> gives all the ways that a string is
1866made to use Unicode rules.
95a1a48b 1867
37b3b608 1868=head2 Using Unicode in XS
c349b1b9 1869
37b3b608
KW
1870See L<perlguts/"Unicode Support"> for an introduction to Unicode at
1871the XS level, and L<perlapi/Unicode Support> for the API details.
95a1a48b 1872
e1b711da
KW
1873=head2 Hacking Perl to work on earlier Unicode versions (for very serious hackers only)
1874
a6a7eedc
KW
1875Perl by default comes with the latest supported Unicode version built-in, but
1876the goal is to allow you to change to use any earlier one. In Perls
1877v5.20 and v5.22, however, the earliest usable version is Unicode 5.1.
c55dd03d 1878Perl v5.18 and v5.24 are able to handle all earlier versions.
e1b711da 1879
42581d5d 1880Download the files in the desired version of Unicode from the Unicode web
e1b711da 1881site L<http://www.unicode.org>). These should replace the existing files in
b19eb496 1882F<lib/unicore> in the Perl source tree. Follow the instructions in
116693e8 1883F<README.perl> in that directory to change some of their names, and then build
26e391dd 1884perl (see L<INSTALL>).
116693e8 1885
c8d992ba
A
1886=head2 Porting code from perl-5.6.X
1887
a6a7eedc
KW
1888Perls starting in 5.8 have a different Unicode model from 5.6. In 5.6 the
1889programmer was required to use the C<utf8> pragma to declare that a
1890given scope expected to deal with Unicode data and had to make sure that
1891only Unicode data were reaching that scope. If you have code that is
c8d992ba 1892working with 5.6, you will need some of the following adjustments to
a6a7eedc
KW
1893your code. The examples are written such that the code will continue to
1894work under 5.6, so you should be safe to try them out.
c8d992ba 1895
755789c0 1896=over 3
c8d992ba
A
1897
1898=item *
1899
1900A filehandle that should read or write UTF-8
1901
b9cedb1b 1902 if ($] > 5.008) {
6d8e7450 1903 binmode $fh, ":encoding(UTF-8)";
c8d992ba
A
1904 }
1905
1906=item *
1907
1908A scalar that is going to be passed to some extension
1909
a9130ea9 1910Be it C<Compress::Zlib>, C<Apache::Request> or any extension that has no
c8d992ba 1911mention of Unicode in the manpage, you need to make sure that the
2575c402 1912UTF8 flag is stripped off. Note that at the time of this writing
b9cedb1b 1913(January 2012) the mentioned modules are not UTF-8-aware. Please
c8d992ba
A
1914check the documentation to verify if this is still true.
1915
b9cedb1b 1916 if ($] > 5.008) {
c8d992ba 1917 require Encode;
8e179dd8 1918 $val = Encode::encode("UTF-8", $val); # make octets
c8d992ba
A
1919 }
1920
1921=item *
1922
1923A scalar we got back from an extension
1924
1925If you believe the scalar comes back as UTF-8, you will most likely
2575c402 1926want the UTF8 flag restored:
c8d992ba 1927
b9cedb1b 1928 if ($] > 5.008) {
c8d992ba 1929 require Encode;
8e179dd8 1930 $val = Encode::decode("UTF-8", $val);
c8d992ba
A
1931 }
1932
1933=item *
1934
1935Same thing, if you are really sure it is UTF-8
1936
b9cedb1b 1937 if ($] > 5.008) {
c8d992ba
A
1938 require Encode;
1939 Encode::_utf8_on($val);
1940 }
1941
1942=item *
1943
a9130ea9 1944A wrapper for L<DBI> C<fetchrow_array> and C<fetchrow_hashref>
c8d992ba
A
1945
1946When the database contains only UTF-8, a wrapper function or method is
a9130ea9
KW
1947a convenient way to replace all your C<fetchrow_array> and
1948C<fetchrow_hashref> calls. A wrapper function will also make it easier to
c8d992ba 1949adapt to future enhancements in your database driver. Note that at the
b9cedb1b 1950time of this writing (January 2012), the DBI has no standardized way
a9130ea9 1951to deal with UTF-8 data. Please check the L<DBI documentation|DBI> to verify if
c8d992ba
A
1952that is still true.
1953
1954 sub fetchrow {
d88362ca
KW
1955 # $what is one of fetchrow_{array,hashref}
1956 my($self, $sth, $what) = @_;
b9cedb1b 1957 if ($] < 5.008) {
c8d992ba
A
1958 return $sth->$what;
1959 } else {
1960 require Encode;
1961 if (wantarray) {
1962 my @arr = $sth->$what;
1963 for (@arr) {
1964 defined && /[^\000-\177]/ && Encode::_utf8_on($_);
1965 }
1966 return @arr;
1967 } else {
1968 my $ret = $sth->$what;
1969 if (ref $ret) {
1970 for my $k (keys %$ret) {
d88362ca
KW
1971 defined
1972 && /[^\000-\177]/
1973 && Encode::_utf8_on($_) for $ret->{$k};
c8d992ba
A
1974 }
1975 return $ret;
1976 } else {
1977 defined && /[^\000-\177]/ && Encode::_utf8_on($_) for $ret;
1978 return $ret;
1979 }
1980 }
1981 }
1982 }
1983
1984
1985=item *
1986
1987A large scalar that you know can only contain ASCII
1988
1989Scalars that contain only ASCII and are marked as UTF-8 are sometimes
1990a drag to your program. If you recognize such a situation, just remove
2575c402 1991the UTF8 flag:
c8d992ba 1992
b9cedb1b 1993 utf8::downgrade($val) if $] > 5.008;
c8d992ba
A
1994
1995=back
1996
a6a7eedc
KW
1997=head1 BUGS
1998
1999See also L</The "Unicode Bug"> above.
2000
2001=head2 Interaction with Extensions
2002
2003When Perl exchanges data with an extension, the extension should be
2004able to understand the UTF8 flag and act accordingly. If the
2005extension doesn't recognize that flag, it's likely that the extension
2006will return incorrectly-flagged data.
2007
2008So if you're working with Unicode data, consult the documentation of
2009every module you're using if there are any issues with Unicode data
2010exchange. If the documentation does not talk about Unicode at all,
2011suspect the worst and probably look at the source to learn how the
2012module is implemented. Modules written completely in Perl shouldn't
2013cause problems. Modules that directly or indirectly access code written
2014in other programming languages are at risk.
2015
2016For affected functions, the simple strategy to avoid data corruption is
2017to always make the encoding of the exchanged data explicit. Choose an
2018encoding that you know the extension can handle. Convert arguments passed
2019to the extensions to that encoding and convert results back from that
2020encoding. Write wrapper functions that do the conversions for you, so
2021you can later change the functions when the extension catches up.
2022
2023To provide an example, let's say the popular C<Foo::Bar::escape_html>
2024function doesn't deal with Unicode data yet. The wrapper function
2025would convert the argument to raw UTF-8 and convert the result back to
2026Perl's internal representation like so:
2027
2028 sub my_escape_html ($) {
2029 my($what) = shift;
2030 return unless defined $what;
8e179dd8
P
2031 Encode::decode("UTF-8", Foo::Bar::escape_html(
2032 Encode::encode("UTF-8", $what)));
a6a7eedc
KW
2033 }
2034
2035Sometimes, when the extension does not convert data but just stores
2036and retrieves it, you will be able to use the otherwise
2037dangerous L<C<Encode::_utf8_on()>|Encode/_utf8_on> function. Let's say
2038the popular C<Foo::Bar> extension, written in C, provides a C<param>
2039method that lets you store and retrieve data according to these prototypes:
2040
2041 $self->param($name, $value); # set a scalar
2042 $value = $self->param($name); # retrieve a scalar
2043
2044If it does not yet provide support for any encoding, one could write a
2045derived class with such a C<param> method:
2046
2047 sub param {
2048 my($self,$name,$value) = @_;
2049 utf8::upgrade($name); # make sure it is UTF-8 encoded
2050 if (defined $value) {
2051 utf8::upgrade($value); # make sure it is UTF-8 encoded
2052 return $self->SUPER::param($name,$value);
2053 } else {
2054 my $ret = $self->SUPER::param($name);
2055 Encode::_utf8_on($ret); # we know, it is UTF-8 encoded
2056 return $ret;
2057 }
2058 }
2059
2060Some extensions provide filters on data entry/exit points, such as
2061C<DB_File::filter_store_key> and family. Look out for such filters in
2062the documentation of your extensions; they can make the transition to
2063Unicode data much easier.
2064
2065=head2 Speed
2066
2067Some functions are slower when working on UTF-8 encoded strings than
2068on byte encoded strings. All functions that need to hop over
2069characters such as C<length()>, C<substr()> or C<index()>, or matching
2070regular expressions can work B<much> faster when the underlying data are
2071byte-encoded.
2072
2073In Perl 5.8.0 the slowness was often quite spectacular; in Perl 5.8.1
2074a caching scheme was introduced which improved the situation. In general,
2075operations with UTF-8 encoded strings are still slower. As an example,
2076the Unicode properties (character classes) like C<\p{Nd}> are known to
2077be quite a bit slower (5-20 times) than their simpler counterparts
2078like C<[0-9]> (then again, there are hundreds of Unicode characters matching
2079C<Nd> compared with the 10 ASCII characters matching C<[0-9]>).
2080
393fec97
GS
2081=head1 SEE ALSO
2082
51f494cc 2083L<perlunitut>, L<perluniintro>, L<perluniprops>, L<Encode>, L<open>, L<utf8>, L<bytes>,
b65e6125 2084L<perlretut>, L<perlvar/"${^UNICODE}">,
51f494cc 2085L<http://www.unicode.org/reports/tr44>).
393fec97
GS
2086
2087=cut