This is a live mirror of the Perl 5 development currently hosted at https://github.com/perl/perl5
perlunicode: Nits, minor fixes
[perl5.git] / pod / perlunicode.pod
CommitLineData
393fec97
GS
1=head1 NAME
2
3perlunicode - Unicode support in Perl
4
5=head1 DESCRIPTION
6
0a1f2d14 7=head2 Important Caveats
21bad921 8
376d9008 9Unicode support is an extensive requirement. While Perl does not
c349b1b9
JH
10implement the Unicode standard or the accompanying technical reports
11from cover to cover, Perl does support many Unicode features.
21bad921 12
2575c402 13People who want to learn to use Unicode in Perl, should probably read
0314f483
KW
14the L<Perl Unicode tutorial, perlunitut|perlunitut> and
15L<perluniintro>, before reading
e4911a48 16this reference document.
2575c402 17
9d1c51c1
KW
18Also, the use of Unicode may present security issues that aren't obvious.
19Read L<Unicode Security Considerations|http://www.unicode.org/reports/tr36>.
20
13a2d996 21=over 4
21bad921 22
a9130ea9 23=item Safest if you C<use feature 'unicode_strings'>
42581d5d
KW
24
25In order to preserve backward compatibility, Perl does not turn
26on full internal Unicode support unless the pragma
b65e6125
KW
27L<S<C<use feature 'unicode_strings'>>|feature/The 'unicode_strings' feature>
28is specified. (This is automatically
29selected if you S<C<use 5.012>> or higher.) Failure to do this can
42581d5d
KW
30trigger unexpected surprises. See L</The "Unicode Bug"> below.
31
2269d15c
KW
32This pragma doesn't affect I/O. Nor does it change the internal
33representation of strings, only their interpretation. There are still
34several places where Unicode isn't fully supported, such as in
35filenames.
42581d5d 36
fae2c0fb 37=item Input and Output Layers
21bad921 38
376d9008 39Perl knows when a filehandle uses Perl's internal Unicode encodings
1bfb14c4 40(UTF-8, or UTF-EBCDIC if in EBCDIC) if the filehandle is opened with
a9130ea9 41the C<:encoding(utf8)> layer. Other encodings can be converted to Perl's
1bfb14c4 42encoding on input or from Perl's encoding on output by use of the
a9130ea9 43C<:encoding(...)> layer. See L<open>.
c349b1b9 44
2575c402 45To indicate that Perl source itself is in UTF-8, use C<use utf8;>.
21bad921 46
ad0029c4 47=item C<use utf8> still needed to enable UTF-8/UTF-EBCDIC in scripts
21bad921 48
376d9008
JB
49As a compatibility measure, the C<use utf8> pragma must be explicitly
50included to enable recognition of UTF-8 in the Perl scripts themselves
1bfb14c4
JH
51(in string or regular expression literals, or in identifier names) on
52ASCII-based machines or to recognize UTF-EBCDIC on EBCDIC-based
376d9008 53machines. B<These are the only times when an explicit C<use utf8>
8f8cf39c 54is needed.> See L<utf8>.
21bad921 55
a9130ea9 56=item C<BOM>-marked scripts and UTF-16 scripts autodetected
7aa207d6 57
a9130ea9
KW
58If a Perl script begins marked with the Unicode C<BOM> (UTF-16LE, UTF16-BE,
59or UTF-8), or if the script looks like non-C<BOM>-marked UTF-16 of either
7aa207d6 60endianness, Perl will correctly read in the script as Unicode.
a9130ea9 61(C<BOM>less UTF-8 cannot be effectively recognized or differentiated from
7aa207d6
JH
62ISO 8859-1 or other eight-bit encodings.)
63
990e18f7
AT
64=item C<use encoding> needed to upgrade non-Latin-1 byte strings
65
38a44b82 66By default, there is a fundamental asymmetry in Perl's Unicode model:
990e18f7
AT
67implicit upgrading from byte strings to Unicode strings assumes that
68they were encoded in I<ISO 8859-1 (Latin-1)>, but Unicode strings are
69downgraded with UTF-8 encoding. This happens because the first 256
51f494cc 70codepoints in Unicode happens to agree with Latin-1.
990e18f7 71
990e18f7
AT
72See L</"Byte and Character Semantics"> for more details.
73
21bad921
GS
74=back
75
376d9008 76=head2 Byte and Character Semantics
393fec97 77
b9cedb1b 78Perl uses logically-wide characters to represent strings internally.
393fec97 79
42581d5d
KW
80Starting in Perl 5.14, Perl-level operations work with
81characters rather than bytes within the scope of a
82C<L<use feature 'unicode_strings'|feature>> (or equivalently
83C<use 5.012> or higher). (This is not true if bytes have been
b19eb496 84explicitly requested by C<L<use bytes|bytes>>, nor necessarily true
42581d5d
KW
85for interactions with the platform's operating system.)
86
87For earlier Perls, and when C<unicode_strings> is not in effect, Perl
88provides a fairly safe environment that can handle both types of
89semantics in programs. For operations where Perl can unambiguously
90decide that the input data are characters, Perl switches to character
91semantics. For operations where this determination cannot be made
92without additional information from the user, Perl decides in favor of
93compatibility and chooses to use byte semantics.
94
66cbab2c 95When C<use locale> (but not C<use locale ':not_characters'>) is in
850b7ec9 96effect, Perl uses the rules associated with the current locale.
66cbab2c
KW
97(C<use locale> overrides C<use feature 'unicode_strings'> in the same scope;
98while C<use locale ':not_characters'> effectively also selects
99C<use feature 'unicode_strings'> in its scope; see L<perllocale>.)
100Otherwise, Perl uses the platform's native
42581d5d 101byte semantics for characters whose code points are less than 256, and
850b7ec9 102Unicode rules for those greater than 255. That means that non-ASCII
4b9734bf 103characters are undefined except for their
e1b711da
KW
104ordinal numbers. This means that none have case (upper and lower), nor are any
105a member of character classes, like C<[:alpha:]> or C<\w>. (But all do belong
106to the C<\W> class or the Perl regular expression extension C<[:^alpha:]>.)
2bbc8d55 107
8cbd9a7a 108This behavior preserves compatibility with earlier versions of Perl,
376d9008 109which allowed byte semantics in Perl operations only if
e1b711da 110none of the program's inputs were marked as being a source of Unicode
8cbd9a7a 111character data. Such data may come from filehandles, from calls to
a9130ea9 112external programs, from information provided by the system (such as C<%ENV>),
21bad921 113or from literals and constants in the source text.
8cbd9a7a 114
8cbd9a7a 115The C<utf8> pragma is primarily a compatibility device that enables
75daf61c 116recognition of UTF-(8|EBCDIC) in literals encountered by the parser.
376d9008
JB
117Note that this pragma is only required while Perl defaults to byte
118semantics; when character semantics become the default, this pragma
119may become a no-op. See L<utf8>.
120
376d9008 121If strings operating under byte semantics and strings with Unicode
51f494cc 122character data are concatenated, the new string will have
d9b01026 123character semantics. This can cause surprises: See L</BUGS>, below.
a9130ea9 124You can choose to be warned when this happens. See C<L<encoding::warnings>>.
7dedd01f 125
feda178f 126Under character semantics, many operations that formerly operated on
376d9008 127bytes now operate on characters. A character in Perl is
feda178f 128logically just a number ranging from 0 to 2**31 or so. Larger
376d9008
JB
129characters may encode into longer sequences of bytes internally, but
130this internal detail is mostly hidden for Perl code.
131See L<perluniintro> for more.
393fec97 132
376d9008 133=head2 Effects of Character Semantics
393fec97
GS
134
135Character semantics have the following effects:
136
137=over 4
138
139=item *
140
376d9008 141Strings--including hash keys--and regular expression patterns may
b65e6125 142contain characters that have ordinal values larger than 255.
393fec97 143
2575c402
JW
144If you use a Unicode editor to edit your program, Unicode characters may
145occur directly within the literal strings in UTF-8 encoding, or UTF-16.
a9130ea9 146(The former requires a C<BOM> or C<use utf8>, the latter requires a C<BOM>.)
3e4dbfed 147
195e542a
KW
148Unicode characters can also be added to a string by using the C<\N{U+...}>
149notation. The Unicode code for the desired character, in hexadecimal,
150should be placed in the braces, after the C<U>. For instance, a smiley face is
6f335b04
KW
151C<\N{U+263A}>.
152
a9130ea9
KW
153Alternatively, you can use the C<\x{...}> notation for characters C<0x100> and
154above. For characters below C<0x100> you may get byte semantics instead of
6f335b04 155character semantics; see L</The "Unicode Bug">. On EBCDIC machines there is
195e542a 156the additional problem that the value for such characters gives the EBCDIC
0bd42786
KW
157character rather than the Unicode one, thus it is more portable to use
158C<\N{U+...}> instead.
3e4dbfed 159
fbb93542
KW
160Additionally, you can use the C<\N{...}> notation and put the official
161Unicode character name within the braces, such as
162C<\N{WHITE SMILING FACE}>. This automatically loads the L<charnames>
163module with the C<:full> and C<:short> options. If you prefer different
164options for this module, you can instead, before the C<\N{...}>,
165explicitly load it with your desired options; for example,
166
167 use charnames ':loose';
376d9008 168
393fec97
GS
169=item *
170
574c8022
JH
171If an appropriate L<encoding> is specified, identifiers within the
172Perl script may contain Unicode alphanumeric characters, including
376d9008
JB
173ideographs. Perl does not currently attempt to canonicalize variable
174names.
393fec97 175
393fec97
GS
176=item *
177
a9130ea9 178Regular expressions match characters instead of bytes. C<"."> matches
2575c402 179a character instead of a byte.
393fec97 180
393fec97
GS
181=item *
182
9d1c51c1 183Bracketed character classes in regular expressions match characters instead of
376d9008 184bytes and match against the character properties specified in the
1bfb14c4 185Unicode properties database. C<\w> can be used to match a Japanese
75daf61c 186ideograph, for instance.
393fec97 187
393fec97
GS
188=item *
189
9d1c51c1
KW
190Named Unicode properties, scripts, and block ranges may be used (like bracketed
191character classes) by using the C<\p{}> "matches property" construct and
822502e5 192the C<\P{}> negation, "doesn't match property".
2575c402 193See L</"Unicode Character Properties"> for more details.
822502e5
TS
194
195You can define your own character properties and use them
196in the regular expression with the C<\p{}> or C<\P{}> construct.
822502e5
TS
197See L</"User-Defined Character Properties"> for more details.
198
199=item *
200
9f815e24
KW
201The special pattern C<\X> matches a logical character, an "extended grapheme
202cluster" in Standardese. In Unicode what appears to the user to be a single
51f494cc
KW
203character, for example an accented C<G>, may in fact be composed of a sequence
204of characters, in this case a C<G> followed by an accent character. C<\X>
205will match the entire sequence.
822502e5
TS
206
207=item *
208
209The C<tr///> operator translates characters instead of bytes. Note
210that the C<tr///CU> functionality has been removed. For similar
211functionality see pack('U0', ...) and pack('C0', ...).
212
213=item *
214
215Case translation operators use the Unicode case translation tables
216when character input is provided. Note that C<uc()>, or C<\U> in
217interpolated strings, translates to uppercase, while C<ucfirst>,
218or C<\u> in interpolated strings, translates to titlecase in languages
e1b711da
KW
219that make the distinction (which is equivalent to uppercase in languages
220without the distinction).
822502e5
TS
221
222=item *
223
224Most operators that deal with positions or lengths in a string will
225automatically switch to using character positions, including
226C<chop()>, C<chomp()>, C<substr()>, C<pos()>, C<index()>, C<rindex()>,
227C<sprintf()>, C<write()>, and C<length()>. An operator that
51f494cc
KW
228specifically does not switch is C<vec()>. Operators that really don't
229care include operators that treat strings as a bucket of bits such as
822502e5
TS
230C<sort()>, and operators dealing with filenames.
231
232=item *
233
51f494cc 234The C<pack()>/C<unpack()> letter C<C> does I<not> change, since it is often
822502e5
TS
235used for byte-oriented formats. Again, think C<char> in the C language.
236
237There is a new C<U> specifier that converts between Unicode characters
238and code points. There is also a C<W> specifier that is the equivalent of
239C<chr>/C<ord> and properly handles character values even if they are above 255.
240
241=item *
242
243The C<chr()> and C<ord()> functions work on characters, similar to
244C<pack("W")> and C<unpack("W")>, I<not> C<pack("C")> and
245C<unpack("C")>. C<pack("C")> and C<unpack("C")> are methods for
246emulating byte-oriented C<chr()> and C<ord()> on Unicode strings.
247While these methods reveal the internal encoding of Unicode strings,
248that is not something one normally needs to care about at all.
249
250=item *
251
252The bit string operators, C<& | ^ ~>, can operate on character data.
253However, for backward compatibility, such as when using bit string
254operations when characters are all less than 256 in ordinal value, one
255should not use C<~> (the bit complement) with characters of both
256values less than 256 and values greater than 256. Most importantly,
257DeMorgan's laws (C<~($x|$y) eq ~$x&~$y> and C<~($x&$y) eq ~$x|~$y>)
258will not hold. The reason for this mathematical I<faux pas> is that
259the complement cannot return B<both> the 8-bit (byte-wide) bit
260complement B<and> the full character-wide bit complement.
261
262=item *
263
a9130ea9 264There is a CPAN module, C<L<Unicode::Casing>>, which allows you to define
628253b8
BF
265your own mappings to be used in C<lc()>, C<lcfirst()>, C<uc()>,
266C<ucfirst()>, and C<fc> (or their double-quoted string inlined
267versions such as C<\U>).
268(Prior to Perl 5.16, this functionality was partially provided
5d1892be
KW
269in the Perl core, but suffered from a number of insurmountable
270drawbacks, so the CPAN module was written instead.)
822502e5
TS
271
272=back
273
274=over 4
275
276=item *
277
278And finally, C<scalar reverse()> reverses by character rather than by byte.
279
280=back
281
282=head2 Unicode Character Properties
283
ee88f7b6 284(The only time that Perl considers a sequence of individual code
9d1c51c1
KW
285points as a single logical character is in the C<\X> construct, already
286mentioned above. Therefore "character" in this discussion means a single
ee88f7b6
KW
287Unicode code point.)
288
289Very nearly all Unicode character properties are accessible through
290regular expressions by using the C<\p{}> "matches property" construct
291and the C<\P{}> "doesn't match property" for its negation.
51f494cc 292
9d1c51c1 293For instance, C<\p{Uppercase}> matches any single character with the Unicode
a9130ea9
KW
294C<"Uppercase"> property, while C<\p{L}> matches any character with a
295C<General_Category> of C<"L"> (letter) property (see
296L</General_Category> below). Brackets are not
9d1c51c1 297required for single letter property names, so C<\p{L}> is equivalent to C<\pL>.
51f494cc 298
9d1c51c1 299More formally, C<\p{Uppercase}> matches any single character whose Unicode
a9130ea9
KW
300C<Uppercase> property value is C<True>, and C<\P{Uppercase}> matches any character
301whose C<Uppercase> property value is C<False>, and they could have been written as
9d1c51c1 302C<\p{Uppercase=True}> and C<\p{Uppercase=False}>, respectively.
51f494cc 303
b19eb496 304This formality is needed when properties are not binary; that is, if they can
a9130ea9
KW
305take on more values than just C<True> and C<False>. For example, the
306C<Bidi_Class> property (see L</"Bidirectional Character Types"> below),
307can take on several different
308values, such as C<Left>, C<Right>, C<Whitespace>, and others. To match these, one needs
309to specify both the property name (C<Bidi_Class>), AND the value being
5bff2035 310matched against
b65e6125 311(C<Left>, C<Right>, I<etc.>). This is done, as in the examples above, by having the
9f815e24 312two components separated by an equal sign (or interchangeably, a colon), like
51f494cc
KW
313C<\p{Bidi_Class: Left}>.
314
315All Unicode-defined character properties may be written in these compound forms
a9130ea9 316of C<\p{I<property>=I<value>}> or C<\p{I<property>:I<value>}>, but Perl provides some
51f494cc
KW
317additional properties that are written only in the single form, as well as
318single-form short-cuts for all binary properties and certain others described
319below, in which you may omit the property name and the equals or colon
320separator.
321
322Most Unicode character properties have at least two synonyms (or aliases if you
b19eb496 323prefer): a short one that is easier to type and a longer one that is more
a9130ea9
KW
324descriptive and hence easier to understand. Thus the C<"L"> and
325C<"Letter"> properties above are equivalent and can be used
326interchangeably. Likewise, C<"Upper"> is a synonym for C<"Uppercase">,
327and we could have written C<\p{Uppercase}> equivalently as C<\p{Upper}>.
328Also, there are typically various synonyms for the values the property
329can be. For binary properties, C<"True"> has 3 synonyms: C<"T">,
330C<"Yes">, and C<"Y">; and C<"False"> has correspondingly C<"F">,
331C<"No">, and C<"N">. But be careful. A short form of a value for one
332property may not mean the same thing as the same short form for another.
333Thus, for the C<L</General_Category>> property, C<"L"> means
334C<"Letter">, but for the L<C<Bidi_Class>|/Bidirectional Character Types>
335property, C<"L"> means C<"Left">. A complete list of properties and
336synonyms is in L<perluniprops>.
51f494cc 337
b19eb496 338Upper/lower case differences in property names and values are irrelevant;
51f494cc
KW
339thus C<\p{Upper}> means the same thing as C<\p{upper}> or even C<\p{UpPeR}>.
340Similarly, you can add or subtract underscores anywhere in the middle of a
341word, so that these are also equivalent to C<\p{U_p_p_e_r}>. And white space
342is irrelevant adjacent to non-word characters, such as the braces and the equals
b19eb496
TC
343or colon separators, so C<\p{ Upper }> and C<\p{ Upper_case : Y }> are
344equivalent to these as well. In fact, white space and even
345hyphens can usually be added or deleted anywhere. So even C<\p{ Up-per case = Yes}> is
51f494cc 346equivalent. All this is called "loose-matching" by Unicode. The few places
b19eb496 347where stricter matching is used is in the middle of numbers, and in the Perl
51f494cc 348extension properties that begin or end with an underscore. Stricter matching
b19eb496 349cares about white space (except adjacent to non-word characters),
51f494cc 350hyphens, and non-interior underscores.
4193bef7 351
376d9008 352You can also use negation in both C<\p{}> and C<\P{}> by introducing a caret
a9130ea9 353(C<^>) between the first brace and the property name: C<\p{^Tamil}> is
eb0cc9e3 354equal to C<\P{Tamil}>.
4193bef7 355
56ca34ca
KW
356Almost all properties are immune to case-insensitive matching. That is,
357adding a C</i> regular expression modifier does not change what they
358match. There are two sets that are affected.
359The first set is
360C<Uppercase_Letter>,
361C<Lowercase_Letter>,
362and C<Titlecase_Letter>,
363all of which match C<Cased_Letter> under C</i> matching.
364And the second set is
365C<Uppercase>,
366C<Lowercase>,
367and C<Titlecase>,
368all of which match C<Cased> under C</i> matching.
369This set also includes its subsets C<PosixUpper> and C<PosixLower> both
a9130ea9 370of which under C</i> match C<PosixAlpha>.
56ca34ca 371(The difference between these sets is that some things, such as Roman
b65e6125
KW
372numerals, come in both upper and lower case so they are C<Cased>, but
373aren't considered letters, so they aren't C<Cased_Letter>'s.)
56ca34ca 374
2d88a86a
KW
375See L</Beyond Unicode code points> for special considerations when
376matching Unicode properties against non-Unicode code points.
94b42e47 377
51f494cc 378=head3 B<General_Category>
14bb0a9a 379
51f494cc
KW
380Every Unicode character is assigned a general category, which is the "most
381usual categorization of a character" (from
382L<http://www.unicode.org/reports/tr44>).
822502e5 383
9f815e24 384The compound way of writing these is like C<\p{General_Category=Number}>
b65e6125 385(short: C<\p{gc:n}>). But Perl furnishes shortcuts in which everything up
51f494cc
KW
386through the equal or colon separator is omitted. So you can instead just write
387C<\pN>.
822502e5 388
a9130ea9
KW
389Here are the short and long forms of the values the C<General Category> property
390can have:
393fec97 391
d73e5302
JH
392 Short Long
393
394 L Letter
51f494cc
KW
395 LC, L& Cased_Letter (that is: [\p{Ll}\p{Lu}\p{Lt}])
396 Lu Uppercase_Letter
397 Ll Lowercase_Letter
398 Lt Titlecase_Letter
399 Lm Modifier_Letter
400 Lo Other_Letter
d73e5302
JH
401
402 M Mark
51f494cc
KW
403 Mn Nonspacing_Mark
404 Mc Spacing_Mark
405 Me Enclosing_Mark
d73e5302
JH
406
407 N Number
51f494cc
KW
408 Nd Decimal_Number (also Digit)
409 Nl Letter_Number
410 No Other_Number
411
412 P Punctuation (also Punct)
413 Pc Connector_Punctuation
414 Pd Dash_Punctuation
415 Ps Open_Punctuation
416 Pe Close_Punctuation
417 Pi Initial_Punctuation
d73e5302 418 (may behave like Ps or Pe depending on usage)
51f494cc 419 Pf Final_Punctuation
d73e5302 420 (may behave like Ps or Pe depending on usage)
51f494cc 421 Po Other_Punctuation
d73e5302
JH
422
423 S Symbol
51f494cc
KW
424 Sm Math_Symbol
425 Sc Currency_Symbol
426 Sk Modifier_Symbol
427 So Other_Symbol
d73e5302
JH
428
429 Z Separator
51f494cc
KW
430 Zs Space_Separator
431 Zl Line_Separator
432 Zp Paragraph_Separator
d73e5302
JH
433
434 C Other
d88362ca 435 Cc Control (also Cntrl)
e150c829 436 Cf Format
6d4f9cf2 437 Cs Surrogate
51f494cc 438 Co Private_Use
e150c829 439 Cn Unassigned
1ac13f9a 440
376d9008 441Single-letter properties match all characters in any of the
3e4dbfed 442two-letter sub-properties starting with the same letter.
b19eb496 443C<LC> and C<L&> are special: both are aliases for the set consisting of everything matched by C<Ll>, C<Lu>, and C<Lt>.
32293815 444
51f494cc 445=head3 B<Bidirectional Character Types>
822502e5 446
b19eb496 447Because scripts differ in their directionality (Hebrew and Arabic are
a9130ea9 448written right to left, for example) Unicode supplies a C<Bidi_Class> property.
1850f57f 449Some of the values this property can have are:
32293815 450
88af3b93 451 Value Meaning
92e830a9 452
12ac2576
JP
453 L Left-to-Right
454 LRE Left-to-Right Embedding
455 LRO Left-to-Right Override
456 R Right-to-Left
51f494cc 457 AL Arabic Letter
12ac2576
JP
458 RLE Right-to-Left Embedding
459 RLO Right-to-Left Override
460 PDF Pop Directional Format
461 EN European Number
51f494cc
KW
462 ES European Separator
463 ET European Terminator
12ac2576 464 AN Arabic Number
51f494cc 465 CS Common Separator
12ac2576
JP
466 NSM Non-Spacing Mark
467 BN Boundary Neutral
468 B Paragraph Separator
469 S Segment Separator
470 WS Whitespace
471 ON Other Neutrals
472
51f494cc
KW
473This property is always written in the compound form.
474For example, C<\p{Bidi_Class:R}> matches characters that are normally
1850f57f 475written right to left. Unlike the
a9130ea9 476C<L</General_Category>> property, this
1850f57f
KW
477property can have more values added in a future Unicode release. Those
478listed above comprised the complete set for many Unicode releases, but
479others were added in Unicode 6.3; you can always find what the
480current ones are in in L<perluniprops>. And
481L<http://www.unicode.org/reports/tr9/> describes how to use them.
eb0cc9e3 482
51f494cc
KW
483=head3 B<Scripts>
484
b19eb496 485The world's languages are written in many different scripts. This sentence
e1b711da 486(unless you're reading it in translation) is written in Latin, while Russian is
c69ca1d4 487written in Cyrillic, and Greek is written in, well, Greek; Japanese mainly in
e1b711da 488Hiragana or Katakana. There are many more.
51f494cc 489
b65e6125 490The Unicode C<Script> and C<Script_Extensions> properties give what script a
82aed44a
KW
491given character is in. Either property can be specified with the
492compound form like
493C<\p{Script=Hebrew}> (short: C<\p{sc=hebr}>), or
494C<\p{Script_Extensions=Javanese}> (short: C<\p{scx=java}>).
495In addition, Perl furnishes shortcuts for all
496C<Script> property names. You can omit everything up through the equals
497(or colon), and simply write C<\p{Latin}> or C<\P{Cyrillic}>.
498(This is not true for C<Script_Extensions>, which is required to be
499written in the compound form.)
500
501The difference between these two properties involves characters that are
502used in multiple scripts. For example the digits '0' through '9' are
503used in many parts of the world. These are placed in a script named
504C<Common>. Other characters are used in just a few scripts. For
a9130ea9 505example, the C<"KATAKANA-HIRAGANA DOUBLE HYPHEN"> is used in both Japanese
82aed44a
KW
506scripts, Katakana and Hiragana, but nowhere else. The C<Script>
507property places all characters that are used in multiple scripts in the
508C<Common> script, while the C<Script_Extensions> property places those
509that are used in only a few scripts into each of those scripts; while
510still using C<Common> for those used in many scripts. Thus both these
511match:
512
513 "0" =~ /\p{sc=Common}/ # Matches
514 "0" =~ /\p{scx=Common}/ # Matches
515
516and only the first of these match:
517
518 "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{sc=Common} # Matches
519 "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{scx=Common} # No match
520
521And only the last two of these match:
522
523 "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{sc=Hiragana} # No match
524 "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{sc=Katakana} # No match
525 "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{scx=Hiragana} # Matches
526 "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{scx=Katakana} # Matches
527
528C<Script_Extensions> is thus an improved C<Script>, in which there are
529fewer characters in the C<Common> script, and correspondingly more in
530other scripts. It is new in Unicode version 6.0, and its data are likely
531to change significantly in later releases, as things get sorted out.
b65e6125
KW
532New code should probably be using C<Script_Extensions> and not plain
533C<Script>.
82aed44a
KW
534
535(Actually, besides C<Common>, the C<Inherited> script, contains
536characters that are used in multiple scripts. These are modifier
b65e6125 537characters which inherit the script value
82aed44a
KW
538of the controlling character. Some of these are used in many scripts,
539and so go into C<Inherited> in both C<Script> and C<Script_Extensions>.
540Others are used in just a few scripts, so are in C<Inherited> in
541C<Script>, but not in C<Script_Extensions>.)
542
543It is worth stressing that there are several different sets of digits in
544Unicode that are equivalent to 0-9 and are matchable by C<\d> in a
545regular expression. If they are used in a single language only, they
546are in that language's C<Script> and C<Script_Extension>. If they are
547used in more than one script, they will be in C<sc=Common>, but only
548if they are used in many scripts should they be in C<scx=Common>.
51f494cc
KW
549
550A complete list of scripts and their shortcuts is in L<perluniprops>.
551
a9130ea9 552=head3 B<Use of the C<"Is"> Prefix>
822502e5 553
b65e6125
KW
554For backward compatibility (with Perl 5.6), all properties writable
555without using the compound form mentioned
51f494cc
KW
556so far may have C<Is> or C<Is_> prepended to their name, so C<\P{Is_Lu}>, for
557example, is equal to C<\P{Lu}>, and C<\p{IsScript:Arabic}> is equal to
558C<\p{Arabic}>.
eb0cc9e3 559
51f494cc 560=head3 B<Blocks>
2796c109 561
1bfb14c4
JH
562In addition to B<scripts>, Unicode also defines B<blocks> of
563characters. The difference between scripts and blocks is that the
564concept of scripts is closer to natural languages, while the concept
51f494cc 565of blocks is more of an artificial grouping based on groups of Unicode
a9130ea9 566characters with consecutive ordinal values. For example, the C<"Basic Latin">
b65e6125 567block is all the characters whose ordinals are between 0 and 127, inclusive; in
a9130ea9
KW
568other words, the ASCII characters. The C<"Latin"> script contains some letters
569from this as well as several other blocks, like C<"Latin-1 Supplement">,
b65e6125 570C<"Latin Extended-A">, I<etc.>, but it does not contain all the characters from
7be67b37 571those blocks. It does not, for example, contain the digits 0-9, because
82aed44a
KW
572those digits are shared across many scripts, and hence are in the
573C<Common> script.
51f494cc
KW
574
575For more about scripts versus blocks, see UAX#24 "Unicode Script Property":
576L<http://www.unicode.org/reports/tr24>
577
82aed44a
KW
578The C<Script> or C<Script_Extensions> properties are likely to be the
579ones you want to use when processing
a9130ea9 580natural language; the C<Block> property may occasionally be useful in working
b19eb496 581with the nuts and bolts of Unicode.
51f494cc
KW
582
583Block names are matched in the compound form, like C<\p{Block: Arrows}> or
b19eb496 584C<\p{Blk=Hebrew}>. Unlike most other properties, only a few block names have a
51f494cc
KW
585Unicode-defined short name. But Perl does provide a (slight) shortcut: You
586can say, for example C<\p{In_Arrows}> or C<\p{In_Hebrew}>. For backwards
587compatibility, the C<In> prefix may be omitted if there is no naming conflict
588with a script or any other property, and you can even use an C<Is> prefix
589instead in those cases. But it is not a good idea to do this, for a couple
590reasons:
591
592=over 4
593
594=item 1
595
596It is confusing. There are many naming conflicts, and you may forget some.
9f815e24 597For example, C<\p{Hebrew}> means the I<script> Hebrew, and NOT the I<block>
51f494cc
KW
598Hebrew. But would you remember that 6 months from now?
599
600=item 2
601
3e2dd9ee 602It is unstable. A new version of Unicode may preempt the current meaning by
51f494cc 603creating a property with the same name. There was a time in very early Unicode
9f815e24 604releases when C<\p{Hebrew}> would have matched the I<block> Hebrew; now it
51f494cc 605doesn't.
32293815 606
393fec97
GS
607=back
608
b19eb496
TC
609Some people prefer to always use C<\p{Block: foo}> and C<\p{Script: bar}>
610instead of the shortcuts, whether for clarity, because they can't remember the
611difference between 'In' and 'Is' anyway, or they aren't confident that those who
612eventually will read their code will know that difference.
51f494cc
KW
613
614A complete list of blocks and their shortcuts is in L<perluniprops>.
615
9f815e24
KW
616=head3 B<Other Properties>
617
618There are many more properties than the very basic ones described here.
619A complete list is in L<perluniprops>.
620
621Unicode defines all its properties in the compound form, so all single-form
b19eb496
TC
622properties are Perl extensions. Most of these are just synonyms for the
623Unicode ones, but some are genuine extensions, including several that are in
9f815e24
KW
624the compound form. And quite a few of these are actually recommended by Unicode
625(in L<http://www.unicode.org/reports/tr18>).
626
5bff2035
KW
627This section gives some details on all extensions that aren't just
628synonyms for compound-form Unicode properties
629(for those properties, you'll have to refer to the
9f815e24
KW
630L<Unicode Standard|http://www.unicode.org/reports/tr44>.
631
632=over
633
634=item B<C<\p{All}>>
635
2d88a86a
KW
636This matches every possible code point. It is equivalent to C<qr/./s>.
637Unlike all the other non-user-defined C<\p{}> property matches, no
638warning is ever generated if this is property is matched against a
639non-Unicode code point (see L</Beyond Unicode code points> below).
9f815e24
KW
640
641=item B<C<\p{Alnum}>>
642
643This matches any C<\p{Alphabetic}> or C<\p{Decimal_Number}> character.
644
645=item B<C<\p{Any}>>
646
2d88a86a
KW
647This matches any of the 1_114_112 Unicode code points. It is a synonym
648for C<\p{Unicode}>.
9f815e24 649
42581d5d
KW
650=item B<C<\p{ASCII}>>
651
652This matches any of the 128 characters in the US-ASCII character set,
653which is a subset of Unicode.
654
9f815e24
KW
655=item B<C<\p{Assigned}>>
656
a9130ea9
KW
657This matches any assigned code point; that is, any code point whose L<general
658category|/General_Category> is not C<Unassigned> (or equivalently, not C<Cn>).
9f815e24
KW
659
660=item B<C<\p{Blank}>>
661
662This is the same as C<\h> and C<\p{HorizSpace}>: A character that changes the
663spacing horizontally.
664
665=item B<C<\p{Decomposition_Type: Non_Canonical}>> (Short: C<\p{Dt=NonCanon}>)
666
667Matches a character that has a non-canonical decomposition.
668
a9130ea9 669To understand the use of this rarely used I<property=value> combination, it is
9f815e24
KW
670necessary to know some basics about decomposition.
671Consider a character, say H. It could appear with various marks around it,
672such as an acute accent, or a circumflex, or various hooks, circles, arrows,
b19eb496 673I<etc.>, above, below, to one side or the other, etc. There are many
9f815e24
KW
674possibilities among the world's languages. The number of combinations is
675astronomical, and if there were a character for each combination, it would
676soon exhaust Unicode's more than a million possible characters. So Unicode
677took a different approach: there is a character for the base H, and a
b19eb496 678character for each of the possible marks, and these can be variously combined
9f815e24
KW
679to get a final logical character. So a logical character--what appears to be a
680single character--can be a sequence of more than one individual characters.
b19eb496
TC
681This is called an "extended grapheme cluster"; Perl furnishes the C<\X>
682regular expression construct to match such sequences.
9f815e24
KW
683
684But Unicode's intent is to unify the existing character set standards and
b19eb496 685practices, and several pre-existing standards have single characters that
9f815e24 686mean the same thing as some of these combinations. An example is ISO-8859-1,
a9130ea9
KW
687which has quite a few of these in the Latin-1 range, an example being C<"LATIN
688CAPITAL LETTER E WITH ACUTE">. Because this character was in this pre-existing
9f815e24 689standard, Unicode added it to its repertoire. But this character is considered
b19eb496 690by Unicode to be equivalent to the sequence consisting of the character
a9130ea9 691C<"LATIN CAPITAL LETTER E"> followed by the character C<"COMBINING ACUTE ACCENT">.
9f815e24 692
a9130ea9 693C<"LATIN CAPITAL LETTER E WITH ACUTE"> is called a "pre-composed" character, and
b19eb496 694its equivalence with the sequence is called canonical equivalence. All
9f815e24 695pre-composed characters are said to have a decomposition (into the equivalent
b19eb496 696sequence), and the decomposition type is also called canonical.
9f815e24
KW
697
698However, many more characters have a different type of decomposition, a
699"compatible" or "non-canonical" decomposition. The sequences that form these
700decompositions are not considered canonically equivalent to the pre-composed
a9130ea9 701character. An example, again in the Latin-1 range, is the C<"SUPERSCRIPT ONE">.
b19eb496 702It is somewhat like a regular digit 1, but not exactly; its decomposition
9f815e24
KW
703into the digit 1 is called a "compatible" decomposition, specifically a
704"super" decomposition. There are several such compatibility
b65e6125
KW
705decompositions (see L<http://www.unicode.org/reports/tr44>), including
706one called "compat", which means some miscellaneous type of
707decomposition that doesn't fit into the other decomposition categories
708that Unicode has chosen.
9f815e24
KW
709
710Note that most Unicode characters don't have a decomposition, so their
a9130ea9 711decomposition type is C<"None">.
9f815e24 712
b19eb496
TC
713For your convenience, Perl has added the C<Non_Canonical> decomposition
714type to mean any of the several compatibility decompositions.
9f815e24
KW
715
716=item B<C<\p{Graph}>>
717
718Matches any character that is graphic. Theoretically, this means a character
719that on a printer would cause ink to be used.
720
721=item B<C<\p{HorizSpace}>>
722
b19eb496 723This is the same as C<\h> and C<\p{Blank}>: a character that changes the
9f815e24
KW
724spacing horizontally.
725
42581d5d 726=item B<C<\p{In=*}>>
9f815e24
KW
727
728This is a synonym for C<\p{Present_In=*}>
729
730=item B<C<\p{PerlSpace}>>
731
d28d8023 732This is the same as C<\s>, restricted to ASCII, namely C<S<[ \f\n\r\t]>>
779cf272 733and starting in Perl v5.18, a vertical tab.
9f815e24
KW
734
735Mnemonic: Perl's (original) space
736
737=item B<C<\p{PerlWord}>>
738
739This is the same as C<\w>, restricted to ASCII, namely C<[A-Za-z0-9_]>
740
741Mnemonic: Perl's (original) word.
742
42581d5d 743=item B<C<\p{Posix...}>>
9f815e24 744
b65e6125
KW
745There are several of these, which are equivalents, using the C<\p{}>
746notation, for Posix classes and are described in
42581d5d 747L<perlrecharclass/POSIX Character Classes>.
9f815e24
KW
748
749=item B<C<\p{Present_In: *}>> (Short: C<\p{In=*}>)
750
751This property is used when you need to know in what Unicode version(s) a
752character is.
753
754The "*" above stands for some two digit Unicode version number, such as
755C<1.1> or C<4.0>; or the "*" can also be C<Unassigned>. This property will
756match the code points whose final disposition has been settled as of the
757Unicode release given by the version number; C<\p{Present_In: Unassigned}>
758will match those code points whose meaning has yet to be assigned.
759
a9130ea9 760For example, C<U+0041> C<"LATIN CAPITAL LETTER A"> was present in the very first
9f815e24
KW
761Unicode release available, which is C<1.1>, so this property is true for all
762valid "*" versions. On the other hand, C<U+1EFF> was not assigned until version
a9130ea9 7635.1 when it became C<"LATIN SMALL LETTER Y WITH LOOP">, so the only "*" that
9f815e24
KW
764would match it are 5.1, 5.2, and later.
765
766Unicode furnishes the C<Age> property from which this is derived. The problem
767with Age is that a strict interpretation of it (which Perl takes) has it
768matching the precise release a code point's meaning is introduced in. Thus
769C<U+0041> would match only 1.1; and C<U+1EFF> only 5.1. This is not usually what
770you want.
771
772Some non-Perl implementations of the Age property may change its meaning to be
a9130ea9 773the same as the Perl C<Present_In> property; just be aware of that.
9f815e24
KW
774
775Another confusion with both these properties is that the definition is not
b19eb496
TC
776that the code point has been I<assigned>, but that the meaning of the code point
777has been I<determined>. This is because 66 code points will always be
a9130ea9 778unassigned, and so the C<Age> for them is the Unicode version in which the decision
b19eb496 779to make them so was made. For example, C<U+FDD0> is to be permanently
9f815e24 780unassigned to a character, and the decision to do that was made in version 3.1,
b19eb496 781so C<\p{Age=3.1}> matches this character, as also does C<\p{Present_In: 3.1}> and up.
9f815e24
KW
782
783=item B<C<\p{Print}>>
784
ae5b72c8 785This matches any character that is graphical or blank, except controls.
9f815e24
KW
786
787=item B<C<\p{SpacePerl}>>
788
789This is the same as C<\s>, including beyond ASCII.
790
4d4acfba 791Mnemonic: Space, as modified by Perl. (It doesn't include the vertical tab
779cf272 792until v5.18, which both the Posix standard and Unicode consider white space.)
9f815e24 793
4364919a
KW
794=item B<C<\p{Title}>> and B<C<\p{Titlecase}>>
795
796Under case-sensitive matching, these both match the same code points as
797C<\p{General Category=Titlecase_Letter}> (C<\p{gc=lt}>). The difference
798is that under C</i> caseless matching, these match the same as
799C<\p{Cased}>, whereas C<\p{gc=lt}> matches C<\p{Cased_Letter>).
800
2d88a86a
KW
801=item B<C<\p{Unicode}>>
802
803This matches any of the 1_114_112 Unicode code points.
804C<\p{Any}>.
805
9f815e24
KW
806=item B<C<\p{VertSpace}>>
807
808This is the same as C<\v>: A character that changes the spacing vertically.
809
810=item B<C<\p{Word}>>
811
b19eb496 812This is the same as C<\w>, including over 100_000 characters beyond ASCII.
9f815e24 813
42581d5d
KW
814=item B<C<\p{XPosix...}>>
815
b19eb496 816There are several of these, which are the standard Posix classes
42581d5d
KW
817extended to the full Unicode range. They are described in
818L<perlrecharclass/POSIX Character Classes>.
819
9f815e24
KW
820=back
821
a9130ea9 822
376d9008 823=head2 User-Defined Character Properties
491fd90a 824
51f494cc 825You can define your own binary character properties by defining subroutines
a9130ea9 826whose names begin with C<"In"> or C<"Is">. (The experimental feature
9d1a5160
KW
827L<perlre/(?[ ])> provides an alternative which allows more complex
828definitions.) The subroutines can be defined in any
51f494cc 829package. The user-defined properties can be used in the regular expression
a9130ea9 830C<\p{}> and C<\P{}> constructs; if you are using a user-defined property from a
51f494cc 831package other than the one you are in, you must specify its package in the
a9130ea9 832C<\p{}> or C<\P{}> construct.
bac0b425 833
51f494cc 834 # assuming property Is_Foreign defined in Lang::
bac0b425
JP
835 package main; # property package name required
836 if ($txt =~ /\p{Lang::IsForeign}+/) { ... }
837
838 package Lang; # property package name not required
839 if ($txt =~ /\p{IsForeign}+/) { ... }
840
841
842Note that the effect is compile-time and immutable once defined.
b19eb496
TC
843However, the subroutines are passed a single parameter, which is 0 if
844case-sensitive matching is in effect and non-zero if caseless matching
56ca34ca
KW
845is in effect. The subroutine may return different values depending on
846the value of the flag, and one set of values will immutably be in effect
b19eb496 847for all case-sensitive matches, and the other set for all case-insensitive
56ca34ca 848matches.
491fd90a 849
b19eb496 850Note that if the regular expression is tainted, then Perl will die rather
a9130ea9 851than calling the subroutine when the name of the subroutine is
0e9be77f
DM
852determined by the tainted data.
853
376d9008
JB
854The subroutines must return a specially-formatted string, with one
855or more newline-separated lines. Each line must be one of the following:
491fd90a
JH
856
857=over 4
858
859=item *
860
df9e1087 861A single hexadecimal number denoting a code point to include.
510254c9
A
862
863=item *
864
99a6b1f0 865Two hexadecimal numbers separated by horizontal whitespace (space or
df9e1087 866tabular characters) denoting a range of code points to include.
491fd90a
JH
867
868=item *
869
a9130ea9
KW
870Something to include, prefixed by C<"+">: a built-in character
871property (prefixed by C<"utf8::">) or a fully qualified (including package
830137a2 872name) user-defined character property,
bac0b425
JP
873to represent all the characters in that property; two hexadecimal code
874points for a range; or a single hexadecimal code point.
491fd90a
JH
875
876=item *
877
a9130ea9
KW
878Something to exclude, prefixed by C<"-">: an existing character
879property (prefixed by C<"utf8::">) or a fully qualified (including package
830137a2 880name) user-defined character property,
bac0b425
JP
881to represent all the characters in that property; two hexadecimal code
882points for a range; or a single hexadecimal code point.
491fd90a
JH
883
884=item *
885
a9130ea9
KW
886Something to negate, prefixed C<"!">: an existing character
887property (prefixed by C<"utf8::">) or a fully qualified (including package
830137a2 888name) user-defined character property,
bac0b425
JP
889to represent all the characters in that property; two hexadecimal code
890points for a range; or a single hexadecimal code point.
891
892=item *
893
a9130ea9
KW
894Something to intersect with, prefixed by C<"&">: an existing character
895property (prefixed by C<"utf8::">) or a fully qualified (including package
830137a2 896name) user-defined character property,
bac0b425
JP
897for all the characters except the characters in the property; two
898hexadecimal code points for a range; or a single hexadecimal code point.
491fd90a
JH
899
900=back
901
902For example, to define a property that covers both the Japanese
903syllabaries (hiragana and katakana), you can define
904
905 sub InKana {
d88362ca 906 return <<END;
d5822f25
A
907 3040\t309F
908 30A0\t30FF
491fd90a
JH
909 END
910 }
911
d5822f25
A
912Imagine that the here-doc end marker is at the beginning of the line.
913Now you can use C<\p{InKana}> and C<\P{InKana}>.
491fd90a
JH
914
915You could also have used the existing block property names:
916
917 sub InKana {
d88362ca 918 return <<'END';
491fd90a
JH
919 +utf8::InHiragana
920 +utf8::InKatakana
921 END
922 }
923
924Suppose you wanted to match only the allocated characters,
d5822f25 925not the raw block ranges: in other words, you want to remove
b65e6125 926the unassigned characters:
491fd90a
JH
927
928 sub InKana {
d88362ca 929 return <<'END';
491fd90a
JH
930 +utf8::InHiragana
931 +utf8::InKatakana
932 -utf8::IsCn
933 END
934 }
935
936The negation is useful for defining (surprise!) negated classes.
937
938 sub InNotKana {
d88362ca 939 return <<'END';
491fd90a
JH
940 !utf8::InHiragana
941 -utf8::InKatakana
942 +utf8::IsCn
943 END
944 }
945
461020ad
KW
946This will match all non-Unicode code points, since every one of them is
947not in Kana. You can use intersection to exclude these, if desired, as
948this modified example shows:
bac0b425 949
461020ad 950 sub InNotKana {
bac0b425 951 return <<'END';
461020ad
KW
952 !utf8::InHiragana
953 -utf8::InKatakana
954 +utf8::IsCn
955 &utf8::Any
bac0b425
JP
956 END
957 }
958
461020ad
KW
959C<&utf8::Any> must be the last line in the definition.
960
961Intersection is used generally for getting the common characters matched
a9130ea9 962by two (or more) classes. It's important to remember not to use C<"&"> for
461020ad
KW
963the first set; that would be intersecting with nothing, resulting in an
964empty set.
965
2d88a86a
KW
966Unlike non-user-defined C<\p{}> property matches, no warning is ever
967generated if these properties are matched against a non-Unicode code
968point (see L</Beyond Unicode code points> below).
bac0b425 969
68585b5e 970=head2 User-Defined Case Mappings (for serious hackers only)
822502e5 971
5d1892be 972B<This feature has been removed as of Perl 5.16.>
a9130ea9 973The CPAN module C<L<Unicode::Casing>> provides better functionality without
5d1892be
KW
974the drawbacks that this feature had. If you are using a Perl earlier
975than 5.16, this feature was most fully documented in the 5.14 version of
976this pod:
977L<http://perldoc.perl.org/5.14.0/perlunicode.html#User-Defined-Case-Mappings-%28for-serious-hackers-only%29>
3a2263fe 978
376d9008 979=head2 Character Encodings for Input and Output
8cbd9a7a 980
7221edc9 981See L<Encode>.
8cbd9a7a 982
c29a771d 983=head2 Unicode Regular Expression Support Level
776f8809 984
b19eb496
TC
985The following list of Unicode supported features for regular expressions describes
986all features currently directly supported by core Perl. The references to "Level N"
8158862b 987and the section numbers refer to the Unicode Technical Standard #18,
b19eb496 988"Unicode Regular Expressions", version 13, from August 2008.
776f8809
JH
989
990=over 4
991
992=item *
993
994Level 1 - Basic Unicode Support
995
755789c0
KW
996 RL1.1 Hex Notation - done [1]
997 RL1.2 Properties - done [2][3]
998 RL1.2a Compatibility Properties - done [4]
9d1a5160 999 RL1.3 Subtraction and Intersection - experimental [5]
755789c0
KW
1000 RL1.4 Simple Word Boundaries - done [6]
1001 RL1.5 Simple Loose Matches - done [7]
1002 RL1.6 Line Boundaries - MISSING [8][9]
1003 RL1.7 Supplementary Code Points - done [10]
1004
6f33e417
KW
1005=over 4
1006
1007=item [1]
1008
a9130ea9 1009C<\x{...}>
6f33e417
KW
1010
1011=item [2]
1012
a9130ea9 1013C<\p{...}> C<\P{...}>
6f33e417
KW
1014
1015=item [3]
1016
1017supports not only minimal list, but all Unicode character properties (see Unicode Character Properties above)
1018
1019=item [4]
1020
a9130ea9 1021C<\d> C<\D> C<\s> C<\S> C<\w> C<\W> C<\X> C<[:I<prop>:]> C<[:^I<prop>:]>
6f33e417
KW
1022
1023=item [5]
1024
df9e1087 1025The experimental feature in v5.18 C<"(?[...])"> accomplishes this. See
9d1a5160
KW
1026L<perlre/(?[ ])>. If you don't want to use an experimental feature,
1027you can use one of the following:
6f33e417
KW
1028
1029=over 4
1030
1031=item * Regular expression look-ahead
1032
1033You can mimic class subtraction using lookahead.
8158862b 1034For example, what UTS#18 might write as
29bdacb8 1035
209c9685 1036 [{Block=Greek}-[{UNASSIGNED}]]
dbe420b4
JH
1037
1038in Perl can be written as:
1039
209c9685
KW
1040 (?!\p{Unassigned})\p{Block=Greek}
1041 (?=\p{Assigned})\p{Block=Greek}
dbe420b4
JH
1042
1043But in this particular example, you probably really want
1044
209c9685 1045 \p{Greek}
dbe420b4
JH
1046
1047which will match assigned characters known to be part of the Greek script.
29bdacb8 1048
a9130ea9 1049=item * CPAN module C<L<Unicode::Regex::Set>>
8158862b 1050
6f33e417
KW
1051It does implement the full UTS#18 grouping, intersection, union, and
1052removal (subtraction) syntax.
8158862b 1053
6f33e417
KW
1054=item * L</"User-Defined Character Properties">
1055
a9130ea9 1056C<"+"> for union, C<"-"> for removal (set-difference), C<"&"> for intersection
6f33e417
KW
1057
1058=back
1059
1060=item [6]
1061
a9130ea9 1062C<\b> C<\B>
6f33e417
KW
1063
1064=item [7]
1065
a9130ea9
KW
1066Note that Perl does Full case-folding in matching (but with bugs), not
1067Simple: for example C<U+1F88> is equivalent to C<U+1F00 U+03B9>, instead of
1068just C<U+1F80>. This difference matters mainly for certain Greek capital
1069letters with certain modifiers: the Full case-folding decomposes the
1070letter, while the Simple case-folding would map it to a single
1071character.
6f33e417
KW
1072
1073=item [8]
1074
6bc50c7f
KW
1075Should do C<^> and C<$> also on C<U+000B> (C<\v> in C), C<FF> (C<\f>),
1076C<CR> (C<\r>), C<CRLF> (C<\r\n>), C<NEL> (C<U+0085>), C<LS> (C<U+2028>),
1077and C<PS> (C<U+2029>); should also affect C<E<lt>E<gt>>, C<$.>, and
1078script line numbers; should not split lines within C<CRLF> (i.e. there
1079is no empty line between C<\r> and C<\n>). For C<CRLF>, try the
6f33e417
KW
1080C<:crlf> layer (see L<PerlIO>).
1081
1082=item [9]
1083
a9130ea9
KW
1084Linebreaking conformant with L<UAX#14 "Unicode Line Breaking
1085Algorithm"|http://www.unicode.org/reports/tr14>
1086is available through the C<L<Unicode::LineBreak>> module.
6f33e417
KW
1087
1088=item [10]
1089
a9130ea9
KW
1090UTF-8/UTF-EBDDIC used in Perl allows not only C<U+10000> to
1091C<U+10FFFF> but also beyond C<U+10FFFF>
6f33e417
KW
1092
1093=back
5ca1ac52 1094
776f8809
JH
1095=item *
1096
1097Level 2 - Extended Unicode Support
1098
755789c0
KW
1099 RL2.1 Canonical Equivalents - MISSING [10][11]
1100 RL2.2 Default Grapheme Clusters - MISSING [12]
ae3bb8ea 1101 RL2.3 Default Word Boundaries - DONE [14]
755789c0
KW
1102 RL2.4 Default Loose Matches - MISSING [15]
1103 RL2.5 Name Properties - DONE
1104 RL2.6 Wildcard Properties - MISSING
8158862b 1105
755789c0
KW
1106 [10] see UAX#15 "Unicode Normalization Forms"
1107 [11] have Unicode::Normalize but not integrated to regexes
64935bc6
KW
1108 [12] have \X and \b{gcb} but we don't have a "Grapheme Cluster
1109 Mode"
755789c0 1110 [14] see UAX#29, Word Boundaries
902b08d0 1111 [15] This is covered in Chapter 3.13 (in Unicode 6.0)
776f8809
JH
1112
1113=item *
1114
8158862b
TS
1115Level 3 - Tailored Support
1116
755789c0
KW
1117 RL3.1 Tailored Punctuation - MISSING
1118 RL3.2 Tailored Grapheme Clusters - MISSING [17][18]
1119 RL3.3 Tailored Word Boundaries - MISSING
1120 RL3.4 Tailored Loose Matches - MISSING
1121 RL3.5 Tailored Ranges - MISSING
1122 RL3.6 Context Matching - MISSING [19]
1123 RL3.7 Incremental Matches - MISSING
8158862b 1124 ( RL3.8 Unicode Set Sharing )
755789c0
KW
1125 RL3.9 Possible Match Sets - MISSING
1126 RL3.10 Folded Matching - MISSING [20]
1127 RL3.11 Submatchers - MISSING
1128
1129 [17] see UAX#10 "Unicode Collation Algorithms"
1130 [18] have Unicode::Collate but not integrated to regexes
1131 [19] have (?<=x) and (?=x), but look-aheads or look-behinds
1132 should see outside of the target substring
1133 [20] need insensitive matching for linguistic features other
1134 than case; for example, hiragana to katakana, wide and
1135 narrow, simplified Han to traditional Han (see UTR#30
1136 "Character Foldings")
776f8809
JH
1137
1138=back
1139
c349b1b9
JH
1140=head2 Unicode Encodings
1141
376d9008
JB
1142Unicode characters are assigned to I<code points>, which are abstract
1143numbers. To use these numbers, various encodings are needed.
c349b1b9
JH
1144
1145=over 4
1146
c29a771d 1147=item *
5cb3728c
RB
1148
1149UTF-8
c349b1b9 1150
6d4f9cf2
KW
1151UTF-8 is a variable-length (1 to 4 bytes), byte-order independent
1152encoding. For ASCII (and we really do mean 7-bit ASCII, not another
11538-bit encoding), UTF-8 is transparent.
c349b1b9 1154
8c007b5a 1155The following table is from Unicode 3.2.
05632f9a 1156
755789c0 1157 Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte
05632f9a 1158
d88362ca 1159 U+0000..U+007F 00..7F
e1b711da 1160 U+0080..U+07FF * C2..DF 80..BF
d88362ca 1161 U+0800..U+0FFF E0 * A0..BF 80..BF
ec90690f
TS
1162 U+1000..U+CFFF E1..EC 80..BF 80..BF
1163 U+D000..U+D7FF ED 80..9F 80..BF
755789c0 1164 U+D800..U+DFFF +++++ utf16 surrogates, not legal utf8 +++++
ec90690f 1165 U+E000..U+FFFF EE..EF 80..BF 80..BF
d88362ca
KW
1166 U+10000..U+3FFFF F0 * 90..BF 80..BF 80..BF
1167 U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF
1168 U+100000..U+10FFFF F4 80..8F 80..BF 80..BF
e1b711da 1169
b19eb496 1170Note the gaps marked by "*" before several of the byte entries above. These are
e1b711da
KW
1171caused by legal UTF-8 avoiding non-shortest encodings: it is technically
1172possible to UTF-8-encode a single code point in different ways, but that is
1173explicitly forbidden, and the shortest possible encoding should always be used
1174(and that is what Perl does).
37361303 1175
376d9008 1176Another way to look at it is via bits:
05632f9a 1177
755789c0 1178 Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte
05632f9a 1179
755789c0
KW
1180 0aaaaaaa 0aaaaaaa
1181 00000bbbbbaaaaaa 110bbbbb 10aaaaaa
1182 ccccbbbbbbaaaaaa 1110cccc 10bbbbbb 10aaaaaa
1183 00000dddccccccbbbbbbaaaaaa 11110ddd 10cccccc 10bbbbbb 10aaaaaa
05632f9a 1184
a9130ea9 1185As you can see, the continuation bytes all begin with C<"10">, and the
e1b711da 1186leading bits of the start byte tell how many bytes there are in the
05632f9a
JH
1187encoded character.
1188
6d4f9cf2 1189The original UTF-8 specification allowed up to 6 bytes, to allow
a9130ea9 1190encoding of numbers up to C<0x7FFF_FFFF>. Perl continues to allow those,
6d4f9cf2
KW
1191and has extended that up to 13 bytes to encode code points up to what
1192can fit in a 64-bit word. However, Perl will warn if you output any of
b19eb496 1193these as being non-portable; and under strict UTF-8 input protocols,
6d4f9cf2
KW
1194they are forbidden.
1195
c29a771d 1196=item *
5cb3728c
RB
1197
1198UTF-EBCDIC
dbe420b4 1199
b65e6125 1200Like UTF-8, but EBCDIC-safe, in the way that UTF-8 is ASCII-safe.
dbe420b4 1201
c29a771d 1202=item *
5cb3728c 1203
b65e6125 1204UTF-16, UTF-16BE, UTF-16LE, Surrogates, and C<BOM>'s (Byte Order Marks)
c349b1b9 1205
1bfb14c4
JH
1206The followings items are mostly for reference and general Unicode
1207knowledge, Perl doesn't use these constructs internally.
dbe420b4 1208
b19eb496
TC
1209Like UTF-8, UTF-16 is a variable-width encoding, but where
1210UTF-8 uses 8-bit code units, UTF-16 uses 16-bit code units.
1211All code points occupy either 2 or 4 bytes in UTF-16: code points
1212C<U+0000..U+FFFF> are stored in a single 16-bit unit, and code
1bfb14c4 1213points C<U+10000..U+10FFFF> in two 16-bit units. The latter case is
c349b1b9
JH
1214using I<surrogates>, the first 16-bit unit being the I<high
1215surrogate>, and the second being the I<low surrogate>.
1216
376d9008 1217Surrogates are code points set aside to encode the C<U+10000..U+10FFFF>
c349b1b9 1218range of Unicode code points in pairs of 16-bit units. The I<high
9f815e24 1219surrogates> are the range C<U+D800..U+DBFF> and the I<low surrogates>
376d9008 1220are the range C<U+DC00..U+DFFF>. The surrogate encoding is
c349b1b9 1221
d88362ca
KW
1222 $hi = ($uni - 0x10000) / 0x400 + 0xD800;
1223 $lo = ($uni - 0x10000) % 0x400 + 0xDC00;
c349b1b9
JH
1224
1225and the decoding is
1226
d88362ca 1227 $uni = 0x10000 + ($hi - 0xD800) * 0x400 + ($lo - 0xDC00);
c349b1b9 1228
376d9008 1229Because of the 16-bitness, UTF-16 is byte-order dependent. UTF-16
c349b1b9 1230itself can be used for in-memory computations, but if storage or
376d9008
JB
1231transfer is required either UTF-16BE (big-endian) or UTF-16LE
1232(little-endian) encodings must be chosen.
c349b1b9
JH
1233
1234This introduces another problem: what if you just know that your data
376d9008 1235is UTF-16, but you don't know which endianness? Byte Order Marks, or
b65e6125 1236C<BOM>'s, are a solution to this. A special character has been reserved
86bbd6d1 1237in Unicode to function as a byte order marker: the character with the
a9130ea9 1238code point C<U+FEFF> is the C<BOM>.
042da322 1239
a9130ea9 1240The trick is that if you read a C<BOM>, you will know the byte order,
376d9008
JB
1241since if it was written on a big-endian platform, you will read the
1242bytes C<0xFE 0xFF>, but if it was written on a little-endian platform,
1243you will read the bytes C<0xFF 0xFE>. (And if the originating platform
b65e6125
KW
1244was writing in ASCII platform UTF-8, you will read the bytes
1245C<0xEF 0xBB 0xBF>.)
042da322 1246
86bbd6d1 1247The way this trick works is that the character with the code point
6d4f9cf2 1248C<U+FFFE> is not supposed to be in input streams, so the
a9130ea9 1249sequence of bytes C<0xFF 0xFE> is unambiguously "C<BOM>, represented in
1bfb14c4 1250little-endian format" and cannot be C<U+FFFE>, represented in big-endian
6d4f9cf2
KW
1251format".
1252
1253Surrogates have no meaning in Unicode outside their use in pairs to
1254represent other code points. However, Perl allows them to be
1255represented individually internally, for example by saying
f651977e
TC
1256C<chr(0xD801)>, so that all code points, not just those valid for open
1257interchange, are
6d4f9cf2 1258representable. Unicode does define semantics for them, such as their
a9130ea9
KW
1259C<L</General_Category>> is C<"Cs">. But because their use is somewhat dangerous,
1260Perl will warn (using the warning category C<"surrogate">, which is a
1261sub-category of C<"utf8">) if an attempt is made
6d4f9cf2
KW
1262to do things like take the lower case of one, or match
1263case-insensitively, or to output them. (But don't try this on Perls
1264before 5.14.)
c349b1b9 1265
c29a771d 1266=item *
5cb3728c 1267
1e54db1a 1268UTF-32, UTF-32BE, UTF-32LE
c349b1b9 1269
b65e6125 1270The UTF-32 family is pretty much like the UTF-16 family, except that
042da322 1271the units are 32-bit, and therefore the surrogate scheme is not
a9130ea9 1272needed. UTF-32 is a fixed-width encoding. The C<BOM> signatures are
b19eb496 1273C<0x00 0x00 0xFE 0xFF> for BE and C<0xFF 0xFE 0x00 0x00> for LE.
c349b1b9 1274
c29a771d 1275=item *
5cb3728c
RB
1276
1277UCS-2, UCS-4
c349b1b9 1278
b19eb496 1279Legacy, fixed-width encodings defined by the ISO 10646 standard. UCS-2 is a 16-bit
376d9008 1280encoding. Unlike UTF-16, UCS-2 is not extensible beyond C<U+FFFF>,
339cfa0e 1281because it does not use surrogates. UCS-4 is a 32-bit encoding,
b19eb496 1282functionally identical to UTF-32 (the difference being that
a9130ea9 1283UCS-4 forbids neither surrogates nor code points larger than C<0x10_FFFF>).
c349b1b9 1284
c29a771d 1285=item *
5cb3728c
RB
1286
1287UTF-7
c349b1b9 1288
376d9008
JB
1289A seven-bit safe (non-eight-bit) encoding, which is useful if the
1290transport or storage is not eight-bit safe. Defined by RFC 2152.
c349b1b9 1291
95a1a48b
JH
1292=back
1293
57e88091 1294=head2 Noncharacter code points
6d4f9cf2 1295
57e88091 129666 code points are set aside in Unicode as "noncharacter code points".
a9130ea9 1297These all have the C<Unassigned> (C<Cn>) C<L</General_Category>>, and
57e88091
KW
1298no character will ever be assigned to any of them. They are the 32 code
1299points between C<U+FDD0> and C<U+FDEF> inclusive, and the 34 code
1300points:
1301
1302 U+FFFE U+FFFF
1303 U+1FFFE U+1FFFF
1304 U+2FFFE U+2FFFF
1305 ...
1306 U+EFFFE U+EFFFF
1307 U+FFFFE U+FFFFF
1308 U+10FFFE U+10FFFF
1309
1310Until Unicode 7.0, the noncharacters were "B<forbidden> for use in open
1311interchange of Unicode text data", so that code that processed those
1312streams could use these code points as sentinels that could be mixed in
1313with character data, and would always be distinguishable from that data.
1314(Emphasis above and in the next paragraph are added in this document.)
1315
1316Unicode 7.0 changed the wording so that they are "B<not recommended> for
1317use in open interchange of Unicode text data". The 7.0 Standard goes on
1318to say:
1319
1320=over 4
1321
1322"If a noncharacter is received in open interchange, an application is
1323not required to interpret it in any way. It is good practice, however,
1324to recognize it as a noncharacter and to take appropriate action, such
1325as replacing it with C<U+FFFD> replacement character, to indicate the
1326problem in the text. It is not recommended to simply delete
1327noncharacter code points from such text, because of the potential
1328security issues caused by deleting uninterpreted characters. (See
1329conformance clause C7 in Section 3.2, Conformance Requirements, and
1330L<Unicode Technical Report #36, "Unicode Security
1331Considerations"|http://www.unicode.org/reports/tr36/#Substituting_for_Ill_Formed_Subsequences>)."
1332
1333=back
1334
1335This change was made because it was found that various commercial tools
1336like editors, or for things like source code control, had been written
1337so that they would not handle program files that used these code points,
1338effectively precluding their use almost entirely! And that was never
1339the intent. They've always been meant to be usable within an
1340application, or cooperating set of applications, at will.
1341
1342If you're writing code, such as an editor, that is supposed to be able
1343to handle any Unicode text data, then you shouldn't be using these code
1344points yourself, and instead allow them in the input. If you need
1345sentinels, they should instead be something that isn't legal Unicode.
1346For UTF-8 data, you can use the bytes 0xC1 and 0xC2 as sentinels, as
1347they never appear in well-formed UTF-8. (There are equivalents for
1348UTF-EBCDIC). You can also store your Unicode code points in integer
1349variables and use negative values as sentinels.
1350
1351If you're not writing such a tool, then whether you accept noncharacters
1352as input is up to you (though the Standard recommends that you not). If
1353you do strict input stream checking with Perl, these code points
1354continue to be forbidden. This is to maintain backward compatibility
1355(otherwise potential security holes could open up, as an unsuspecting
1356application that was written assuming the noncharacters would be
1357filtered out before getting to it, could now, without warning, start
1358getting them). To do strict checking, you can use the layer
1359C<:encoding('UTF-8')>.
1360
1361Perl continues to warn (using the warning category C<"nonchar">, which
1362is a sub-category of C<"utf8">) if an attempt is made to output
1363noncharacters.
42581d5d
KW
1364
1365=head2 Beyond Unicode code points
1366
a9130ea9
KW
1367The maximum Unicode code point is C<U+10FFFF>, and Unicode only defines
1368operations on code points up through that. But Perl works on code
42581d5d
KW
1369points up to the maximum permissible unsigned number available on the
1370platform. However, Perl will not accept these from input streams unless
1371lax rules are being used, and will warn (using the warning category
2d88a86a
KW
1372C<"non_unicode">, which is a sub-category of C<"utf8">) if any are output.
1373
1374Since Unicode rules are not defined on these code points, if a
1375Unicode-defined operation is done on them, Perl uses what we believe are
1376sensible rules, while generally warning, using the C<"non_unicode">
1377category. For example, C<uc("\x{11_0000}")> will generate such a
1378warning, returning the input parameter as its result, since Perl defines
1379the uppercase of every non-Unicode code point to be the code point
b65e6125
KW
1380itself. (All the case changing operations, not just uppercasing, work
1381this way.)
2d88a86a
KW
1382
1383The situation with matching Unicode properties in regular expressions,
1384the C<\p{}> and C<\P{}> constructs, against these code points is not as
1385clear cut, and how these are handled has changed as we've gained
1386experience.
1387
1388One possibility is to treat any match against these code points as
1389undefined. But since Perl doesn't have the concept of a match being
1390undefined, it converts this to failing or C<FALSE>. This is almost, but
1391not quite, what Perl did from v5.14 (when use of these code points
1392became generally reliable) through v5.18. The difference is that Perl
1393treated all C<\p{}> matches as failing, but all C<\P{}> matches as
1394succeeding.
1395
1396One problem with this is that it leads to unexpected, and confusting
1397results in some cases:
1398
1399 chr(0x110000) =~ \p{ASCII_Hex_Digit=True} # Failed on <= v5.18
1400 chr(0x110000) =~ \p{ASCII_Hex_Digit=False} # Failed! on <= v5.18
1401
1402That is, it treated both matches as undefined, and converted that to
1403false (raising a warning on each). The first case is the expected
1404result, but the second is likely counterintuitive: "How could both be
1405false when they are complements?" Another problem was that the
1406implementation optimized many Unicode property matches down to already
1407existing simpler, faster operations, which don't raise the warning. We
1408chose to not forgo those optimizations, which help the vast majority of
1409matches, just to generate a warning for the unlikely event that an
1410above-Unicode code point is being matched against.
1411
1412As a result of these problems, starting in v5.20, what Perl does is
1413to treat non-Unicode code points as just typical unassigned Unicode
1414characters, and matches accordingly. (Note: Unicode has atypical
57e88091 1415unassigned code points. For example, it has noncharacter code points,
2d88a86a
KW
1416and ones that, when they do get assigned, are destined to be written
1417Right-to-left, as Arabic and Hebrew are. Perl assumes that no
1418non-Unicode code point has any atypical properties.)
1419
1420Perl, in most cases, will raise a warning when matching an above-Unicode
1421code point against a Unicode property when the result is C<TRUE> for
1422C<\p{}>, and C<FALSE> for C<\P{}>. For example:
1423
1424 chr(0x110000) =~ \p{ASCII_Hex_Digit=True} # Fails, no warning
1425 chr(0x110000) =~ \p{ASCII_Hex_Digit=False} # Succeeds, with warning
1426
1427In both these examples, the character being matched is non-Unicode, so
1428Unicode doesn't define how it should match. It clearly isn't an ASCII
1429hex digit, so the first example clearly should fail, and so it does,
1430with no warning. But it is arguable that the second example should have
1431an undefined, hence C<FALSE>, result. So a warning is raised for it.
1432
1433Thus the warning is raised for many fewer cases than in earlier Perls,
1434and only when what the result is could be arguable. It turns out that
1435none of the optimizations made by Perl (or are ever likely to be made)
1436cause the warning to be skipped, so it solves both problems of Perl's
1437earlier approach. The most commonly used property that is affected by
1438this change is C<\p{Unassigned}> which is a short form for
1439C<\p{General_Category=Unassigned}>. Starting in v5.20, all non-Unicode
1440code points are considered C<Unassigned>. In earlier releases the
1441matches failed because the result was considered undefined.
1442
1443The only place where the warning is not raised when it might ought to
1444have been is if optimizations cause the whole pattern match to not even
1445be attempted. For example, Perl may figure out that for a string to
1446match a certain regular expression pattern, the string has to contain
1447the substring C<"foobar">. Before attempting the match, Perl may look
1448for that substring, and if not found, immediately fail the match without
1449actually trying it; so no warning gets generated even if the string
1450contains an above-Unicode code point.
1451
1452This behavior is more "Do what I mean" than in earlier Perls for most
1453applications. But it catches fewer issues for code that needs to be
1454strictly Unicode compliant. Therefore there is an additional mode of
1455operation available to accommodate such code. This mode is enabled if a
1456regular expression pattern is compiled within the lexical scope where
1457the C<"non_unicode"> warning class has been made fatal, say by:
1458
1459 use warnings FATAL => "non_unicode"
1460
44ecbbd8 1461(see L<warnings>). In this mode of operation, Perl will raise the
2d88a86a
KW
1462warning for all matches against a non-Unicode code point (not just the
1463arguable ones), and it skips the optimizations that might cause the
1464warning to not be output. (It currently still won't warn if the match
1465isn't even attempted, like in the C<"foobar"> example above.)
1466
1467In summary, Perl now normally treats non-Unicode code points as typical
1468Unicode unassigned code points for regular expression matches, raising a
1469warning only when it is arguable what the result should be. However, if
1470this warning has been made fatal, it isn't skipped.
1471
1472There is one exception to all this. C<\p{All}> looks like a Unicode
1473property, but it is a Perl extension that is defined to be true for all
1474possible code points, Unicode or not, so no warning is ever generated
1475when matching this against a non-Unicode code point. (Prior to v5.20,
1476it was an exact synonym for C<\p{Any}>, matching code points C<0>
1477through C<0x10FFFF>.)
6d4f9cf2 1478
0d7c09bb
JH
1479=head2 Security Implications of Unicode
1480
b65e6125
KW
1481First, read
1482L<Unicode Security Considerations|http://www.unicode.org/reports/tr36>.
1483
e1b711da
KW
1484Also, note the following:
1485
0d7c09bb
JH
1486=over 4
1487
1488=item *
1489
1490Malformed UTF-8
bf0fa0b2 1491
42581d5d 1492Unfortunately, the original specification of UTF-8 leaves some room for
bf0fa0b2 1493interpretation of how many bytes of encoded output one should generate
376d9008
JB
1494from one input Unicode character. Strictly speaking, the shortest
1495possible sequence of UTF-8 bytes should be generated,
1496because otherwise there is potential for an input buffer overflow at
feda178f 1497the receiving end of a UTF-8 connection. Perl always generates the
e1b711da 1498shortest length UTF-8, and with warnings on, Perl will warn about
376d9008 1499non-shortest length UTF-8 along with other malformations, such as the
b19eb496 1500surrogates, which are not Unicode code points valid for interchange.
bf0fa0b2 1501
0d7c09bb
JH
1502=item *
1503
68693f9e 1504Regular expression pattern matching may surprise you if you're not
b19eb496
TC
1505accustomed to Unicode. Starting in Perl 5.14, several pattern
1506modifiers are available to control this, called the character set
42581d5d
KW
1507modifiers. Details are given in L<perlre/Character set modifiers>.
1508
1509=back
0d7c09bb 1510
376d9008 1511As discussed elsewhere, Perl has one foot (two hooves?) planted in
1bfb14c4
JH
1512each of two worlds: the old world of bytes and the new world of
1513characters, upgrading from bytes to characters when necessary.
376d9008
JB
1514If your legacy code does not explicitly use Unicode, no automatic
1515switch-over to characters should happen. Characters shouldn't get
1bfb14c4
JH
1516downgraded to bytes, either. It is possible to accidentally mix bytes
1517and characters, however (see L<perluniintro>), in which case C<\w> in
42581d5d 1518regular expressions might start behaving differently (unless the C</a>
b19eb496 1519modifier is in effect). Review your code. Use warnings and the C<strict> pragma.
0d7c09bb 1520
c349b1b9
JH
1521=head2 Unicode in Perl on EBCDIC
1522
376d9008
JB
1523The way Unicode is handled on EBCDIC platforms is still
1524experimental. On such platforms, references to UTF-8 encoding in this
1525document and elsewhere should be read as meaning the UTF-EBCDIC
1526specified in Unicode Technical Report 16, unless ASCII vs. EBCDIC issues
c349b1b9 1527are specifically discussed. There is no C<utfebcdic> pragma or
a9130ea9 1528C<":utfebcdic"> layer; rather, C<"utf8"> and C<":utf8"> are reused to mean
86bbd6d1
PN
1529the platform's "natural" 8-bit encoding of Unicode. See L<perlebcdic>
1530for more discussion of the issues.
c349b1b9 1531
b310b053
JH
1532=head2 Locales
1533
42581d5d 1534See L<perllocale/Unicode and UTF-8>
b310b053 1535
1aad1664
JH
1536=head2 When Unicode Does Not Happen
1537
b65e6125
KW
1538There are still many places where Unicode (in some encoding or
1539another) could be given as arguments or received as results, or both in
1540Perl, but it is not, in spite of Perl having extensive ways to input and
1541output in Unicode, and a few other "entry points" like the C<@ARGV>
1542array (which can sometimes be interpreted as UTF-8).
1aad1664 1543
e1b711da
KW
1544The following are such interfaces. Also, see L</The "Unicode Bug">.
1545For all of these interfaces Perl
b9cedb1b 1546currently (as of v5.16.0) simply assumes byte strings both as arguments
b65e6125 1547and results, or UTF-8 strings if the (deprecated) C<encoding> pragma has been used.
1aad1664 1548
b19eb496
TC
1549One reason that Perl does not attempt to resolve the role of Unicode in
1550these situations is that the answers are highly dependent on the operating
1aad1664 1551system and the file system(s). For example, whether filenames can be
b19eb496
TC
1552in Unicode and in exactly what kind of encoding, is not exactly a
1553portable concept. Similarly for C<qx> and C<system>: how well will the
1554"command-line interface" (and which of them?) handle Unicode?
1aad1664
JH
1555
1556=over 4
1557
557a2462
RB
1558=item *
1559
a9130ea9
KW
1560C<chdir>, C<chmod>, C<chown>, C<chroot>, C<exec>, C<link>, C<lstat>, C<mkdir>,
1561C<rename>, C<rmdir>, C<stat>, C<symlink>, C<truncate>, C<unlink>, C<utime>, C<-X>
557a2462
RB
1562
1563=item *
1564
a9130ea9 1565C<%ENV>
557a2462
RB
1566
1567=item *
1568
a9130ea9 1569C<glob> (aka the C<E<lt>*E<gt>>)
557a2462
RB
1570
1571=item *
1aad1664 1572
a9130ea9 1573C<open>, C<opendir>, C<sysopen>
1aad1664 1574
557a2462 1575=item *
1aad1664 1576
a9130ea9 1577C<qx> (aka the backtick operator), C<system>
1aad1664 1578
557a2462 1579=item *
1aad1664 1580
a9130ea9 1581C<readdir>, C<readlink>
1aad1664
JH
1582
1583=back
1584
e1b711da
KW
1585=head2 The "Unicode Bug"
1586
2e2b2571 1587The term, "Unicode bug" has been applied to an inconsistency
42581d5d 1588on ASCII platforms with the
a9130ea9 1589Unicode code points in the C<Latin-1 Supplement> block, that
e1b711da
KW
1590is, between 128 and 255. Without a locale specified, unlike all other
1591characters or code points, these characters have very different semantics in
20db7501 1592byte semantics versus character semantics, unless
2e2b2571
KW
1593C<use feature 'unicode_strings'> is specified, directly or indirectly.
1594(It is indirectly specified by a C<use v5.12> or higher.)
e1b711da 1595
2e2b2571
KW
1596In character semantics these upper-Latin1 characters are interpreted as
1597Unicode code points, which means
e1b711da
KW
1598they have the same semantics as Latin-1 (ISO-8859-1).
1599
2e2b2571
KW
1600In byte semantics (without C<unicode_strings>), they are considered to
1601be unassigned characters, meaning that the only semantics they have is
1602their ordinal numbers, and that they are
e1b711da 1603not members of various character classes. None are considered to match C<\w>
42581d5d 1604for example, but all match C<\W>.
e1b711da 1605
2e2b2571
KW
1606Perl 5.12.0 added C<unicode_strings> to force character semantics on
1607these code points in some circumstances, which fixed portions of the
1608bug; Perl 5.14.0 fixed almost all of it; and Perl 5.16.0 fixed the
1609remainder (so far as we know, anyway). The lesson here is to enable
1610C<unicode_strings> to avoid the headaches described below.
1611
1612The old, problematic behavior affects these areas:
e1b711da
KW
1613
1614=over 4
1615
1616=item *
1617
1618Changing the case of a scalar, that is, using C<uc()>, C<ucfirst()>, C<lc()>,
2e2b2571
KW
1619and C<lcfirst()>, or C<\L>, C<\U>, C<\u> and C<\l> in double-quotish
1620contexts, such as regular expression substitutions.
1621Under C<unicode_strings> starting in Perl 5.12.0, character semantics are
1622generally used. See L<perlfunc/lc> for details on how this works
1623in combination with various other pragmas.
e1b711da
KW
1624
1625=item *
1626
2e2b2571
KW
1627Using caseless (C</i>) regular expression matching.
1628Starting in Perl 5.14.0, regular expressions compiled within
c43ca372 1629the scope of C<unicode_strings> use character semantics
2e2b2571
KW
1630even when executed or compiled into larger
1631regular expressions outside the scope.
e1b711da
KW
1632
1633=item *
1634
64935bc6
KW
1635Matching any of several properties in regular expressions, namely
1636C<\b> (without braces), C<\B> (without braces), C<\s>, C<\S>, C<\w>,
1637C<\W>, and all the Posix character classes
630d17dc 1638I<except> C<[[:ascii:]]>.
2e2b2571 1639Starting in Perl 5.14.0, regular expressions compiled within
c43ca372 1640the scope of C<unicode_strings> use character semantics
2e2b2571
KW
1641even when executed or compiled into larger
1642regular expressions outside the scope.
e1b711da
KW
1643
1644=item *
1645
91faff93
KW
1646In C<quotemeta> or its inline equivalent C<\Q>, no code points above 127
1647are quoted in UTF-8 encoded strings, but in byte encoded strings, code
1648points between 128-255 are always quoted.
2e2b2571
KW
1649Starting in Perl 5.16.0, consistent quoting rules are used within the
1650scope of C<unicode_strings>, as described in L<perlfunc/quotemeta>.
eb88ed9e 1651
e1b711da
KW
1652=back
1653
1654This behavior can lead to unexpected results in which a string's semantics
1655suddenly change if a code point above 255 is appended to or removed from it,
1656which changes the string's semantics from byte to character or vice versa. As
1657an example, consider the following program and its output:
1658
1659 $ perl -le'
42581d5d 1660 no feature 'unicode_strings';
e1b711da
KW
1661 $s1 = "\xC2";
1662 $s2 = "\x{2660}";
1663 for ($s1, $s2, $s1.$s2) {
1664 print /\w/ || 0;
1665 }
1666 '
1667 0
1668 0
1669 1
1670
9f815e24 1671If there's no C<\w> in C<s1> or in C<s2>, why does their concatenation have one?
e1b711da
KW
1672
1673This anomaly stems from Perl's attempt to not disturb older programs that
1674didn't use Unicode, and hence had no semantics for characters outside of the
1675ASCII range (except in a locale), along with Perl's desire to add Unicode
1676support seamlessly. The result wasn't seamless: these characters were
1677orphaned.
1678
2e2b2571
KW
1679For Perls earlier than those described above, or when a string is passed
1680to a function outside the subpragma's scope, a workaround is to always
a9130ea9 1681call L<C<utf8::upgrade($string)>|utf8/Utility functions>,
20db7501 1682or to use the standard module L<Encode>. Also, a scalar that has any characters
a9130ea9 1683whose ordinal is C<0x100> or above, or which were specified using either of the
b19eb496 1684C<\N{...}> notations, will automatically have character semantics.
e1b711da 1685
1aad1664
JH
1686=head2 Forcing Unicode in Perl (Or Unforcing Unicode in Perl)
1687
e1b711da
KW
1688Sometimes (see L</"When Unicode Does Not Happen"> or L</The "Unicode Bug">)
1689there are situations where you simply need to force a byte
2bbc8d55 1690string into UTF-8, or vice versa. The low-level calls
a9130ea9
KW
1691L<C<utf8::upgrade($bytestring)>|utf8/Utility functions> and
1692L<C<utf8::downgrade($utf8string[, FAIL_OK])>|utf8/Utility functions> are
1aad1664
JH
1693the answers.
1694
a9130ea9 1695Note that C<utf8::downgrade()> can fail if the string contains characters
2bbc8d55 1696that don't fit into a byte.
1aad1664 1697
e1b711da
KW
1698Calling either function on a string that already is in the desired state is a
1699no-op.
1700
95a1a48b 1701
37b3b608 1702=head2 Using Unicode in XS
c349b1b9 1703
37b3b608
KW
1704See L<perlguts/"Unicode Support"> for an introduction to Unicode at
1705the XS level, and L<perlapi/Unicode Support> for the API details.
95a1a48b 1706
e1b711da
KW
1707=head2 Hacking Perl to work on earlier Unicode versions (for very serious hackers only)
1708
1709Perl by default comes with the latest supported Unicode version built in, but
1710you can change to use any earlier one.
1711
42581d5d 1712Download the files in the desired version of Unicode from the Unicode web
e1b711da 1713site L<http://www.unicode.org>). These should replace the existing files in
b19eb496 1714F<lib/unicore> in the Perl source tree. Follow the instructions in
116693e8 1715F<README.perl> in that directory to change some of their names, and then build
26e391dd 1716perl (see L<INSTALL>).
116693e8 1717
c29a771d
JH
1718=head1 BUGS
1719
376d9008 1720=head2 Interaction with Locales
7eabb34d 1721
42581d5d 1722See L<perllocale/Unicode and UTF-8>
c29a771d 1723
9f815e24 1724=head2 Problems with characters in the Latin-1 Supplement range
2bbc8d55 1725
e1b711da
KW
1726See L</The "Unicode Bug">
1727
376d9008 1728=head2 Interaction with Extensions
7eabb34d 1729
376d9008 1730When Perl exchanges data with an extension, the extension should be
2575c402 1731able to understand the UTF8 flag and act accordingly. If the
b19eb496 1732extension doesn't recognize that flag, it's likely that the extension
376d9008 1733will return incorrectly-flagged data.
7eabb34d
A
1734
1735So if you're working with Unicode data, consult the documentation of
1736every module you're using if there are any issues with Unicode data
1737exchange. If the documentation does not talk about Unicode at all,
a73d23f6 1738suspect the worst and probably look at the source to learn how the
376d9008 1739module is implemented. Modules written completely in Perl shouldn't
a73d23f6
RGS
1740cause problems. Modules that directly or indirectly access code written
1741in other programming languages are at risk.
7eabb34d 1742
376d9008 1743For affected functions, the simple strategy to avoid data corruption is
7eabb34d 1744to always make the encoding of the exchanged data explicit. Choose an
376d9008 1745encoding that you know the extension can handle. Convert arguments passed
7eabb34d
A
1746to the extensions to that encoding and convert results back from that
1747encoding. Write wrapper functions that do the conversions for you, so
1748you can later change the functions when the extension catches up.
1749
a9130ea9 1750To provide an example, let's say the popular C<Foo::Bar::escape_html>
7eabb34d
A
1751function doesn't deal with Unicode data yet. The wrapper function
1752would convert the argument to raw UTF-8 and convert the result back to
376d9008 1753Perl's internal representation like so:
7eabb34d
A
1754
1755 sub my_escape_html ($) {
d88362ca
KW
1756 my($what) = shift;
1757 return unless defined $what;
1758 Encode::decode_utf8(Foo::Bar::escape_html(
1759 Encode::encode_utf8($what)));
7eabb34d
A
1760 }
1761
1762Sometimes, when the extension does not convert data but just stores
b19eb496 1763and retrieves them, you will be able to use the otherwise
a9130ea9
KW
1764dangerous L<C<Encode::_utf8_on()>|Encode/_utf8_on> function. Let's say
1765the popular C<Foo::Bar> extension, written in C, provides a C<param>
1766method that lets you store and retrieve data according to these prototypes:
7eabb34d
A
1767
1768 $self->param($name, $value); # set a scalar
1769 $value = $self->param($name); # retrieve a scalar
1770
1771If it does not yet provide support for any encoding, one could write a
1772derived class with such a C<param> method:
1773
1774 sub param {
1775 my($self,$name,$value) = @_;
1776 utf8::upgrade($name); # make sure it is UTF-8 encoded
af55fc6a 1777 if (defined $value) {
7eabb34d
A
1778 utf8::upgrade($value); # make sure it is UTF-8 encoded
1779 return $self->SUPER::param($name,$value);
1780 } else {
1781 my $ret = $self->SUPER::param($name);
1782 Encode::_utf8_on($ret); # we know, it is UTF-8 encoded
1783 return $ret;
1784 }
1785 }
1786
a73d23f6 1787Some extensions provide filters on data entry/exit points, such as
a9130ea9 1788C<DB_File::filter_store_key> and family. Look out for such filters in
66b79f27 1789the documentation of your extensions, they can make the transition to
7eabb34d
A
1790Unicode data much easier.
1791
376d9008 1792=head2 Speed
7eabb34d 1793
c29a771d 1794Some functions are slower when working on UTF-8 encoded strings than
574c8022 1795on byte encoded strings. All functions that need to hop over
a9130ea9
KW
1796characters such as C<length()>, C<substr()> or C<index()>, or matching
1797regular expressions can work B<much> faster when the underlying data are
7c17141f
JH
1798byte-encoded.
1799
1800In Perl 5.8.0 the slowness was often quite spectacular; in Perl 5.8.1
1801a caching scheme was introduced which will hopefully make the slowness
a104b433
JH
1802somewhat less spectacular, at least for some operations. In general,
1803operations with UTF-8 encoded strings are still slower. As an example,
1804the Unicode properties (character classes) like C<\p{Nd}> are known to
1805be quite a bit slower (5-20 times) than their simpler counterparts
42581d5d 1806like C<\d> (then again, there are hundreds of Unicode characters matching C<Nd>
a104b433 1807compared with the 10 ASCII characters matching C<d>).
c8d992ba
A
1808=head2 Porting code from perl-5.6.X
1809
1810Perl 5.8 has a different Unicode model from 5.6. In 5.6 the programmer
1811was required to use the C<utf8> pragma to declare that a given scope
1812expected to deal with Unicode data and had to make sure that only
1813Unicode data were reaching that scope. If you have code that is
1814working with 5.6, you will need some of the following adjustments to
1815your code. The examples are written such that the code will continue
1816to work under 5.6, so you should be safe to try them out.
1817
755789c0 1818=over 3
c8d992ba
A
1819
1820=item *
1821
1822A filehandle that should read or write UTF-8
1823
b9cedb1b 1824 if ($] > 5.008) {
740d4bb2 1825 binmode $fh, ":encoding(utf8)";
c8d992ba
A
1826 }
1827
1828=item *
1829
1830A scalar that is going to be passed to some extension
1831
a9130ea9 1832Be it C<Compress::Zlib>, C<Apache::Request> or any extension that has no
c8d992ba 1833mention of Unicode in the manpage, you need to make sure that the
2575c402 1834UTF8 flag is stripped off. Note that at the time of this writing
b9cedb1b 1835(January 2012) the mentioned modules are not UTF-8-aware. Please
c8d992ba
A
1836check the documentation to verify if this is still true.
1837
b9cedb1b 1838 if ($] > 5.008) {
c8d992ba
A
1839 require Encode;
1840 $val = Encode::encode_utf8($val); # make octets
1841 }
1842
1843=item *
1844
1845A scalar we got back from an extension
1846
1847If you believe the scalar comes back as UTF-8, you will most likely
2575c402 1848want the UTF8 flag restored:
c8d992ba 1849
b9cedb1b 1850 if ($] > 5.008) {
c8d992ba
A
1851 require Encode;
1852 $val = Encode::decode_utf8($val);
1853 }
1854
1855=item *
1856
1857Same thing, if you are really sure it is UTF-8
1858
b9cedb1b 1859 if ($] > 5.008) {
c8d992ba
A
1860 require Encode;
1861 Encode::_utf8_on($val);
1862 }
1863
1864=item *
1865
a9130ea9 1866A wrapper for L<DBI> C<fetchrow_array> and C<fetchrow_hashref>
c8d992ba
A
1867
1868When the database contains only UTF-8, a wrapper function or method is
a9130ea9
KW
1869a convenient way to replace all your C<fetchrow_array> and
1870C<fetchrow_hashref> calls. A wrapper function will also make it easier to
c8d992ba 1871adapt to future enhancements in your database driver. Note that at the
b9cedb1b 1872time of this writing (January 2012), the DBI has no standardized way
a9130ea9 1873to deal with UTF-8 data. Please check the L<DBI documentation|DBI> to verify if
c8d992ba
A
1874that is still true.
1875
1876 sub fetchrow {
d88362ca
KW
1877 # $what is one of fetchrow_{array,hashref}
1878 my($self, $sth, $what) = @_;
b9cedb1b 1879 if ($] < 5.008) {
c8d992ba
A
1880 return $sth->$what;
1881 } else {
1882 require Encode;
1883 if (wantarray) {
1884 my @arr = $sth->$what;
1885 for (@arr) {
1886 defined && /[^\000-\177]/ && Encode::_utf8_on($_);
1887 }
1888 return @arr;
1889 } else {
1890 my $ret = $sth->$what;
1891 if (ref $ret) {
1892 for my $k (keys %$ret) {
d88362ca
KW
1893 defined
1894 && /[^\000-\177]/
1895 && Encode::_utf8_on($_) for $ret->{$k};
c8d992ba
A
1896 }
1897 return $ret;
1898 } else {
1899 defined && /[^\000-\177]/ && Encode::_utf8_on($_) for $ret;
1900 return $ret;
1901 }
1902 }
1903 }
1904 }
1905
1906
1907=item *
1908
1909A large scalar that you know can only contain ASCII
1910
1911Scalars that contain only ASCII and are marked as UTF-8 are sometimes
1912a drag to your program. If you recognize such a situation, just remove
2575c402 1913the UTF8 flag:
c8d992ba 1914
b9cedb1b 1915 utf8::downgrade($val) if $] > 5.008;
c8d992ba
A
1916
1917=back
1918
393fec97
GS
1919=head1 SEE ALSO
1920
51f494cc 1921L<perlunitut>, L<perluniintro>, L<perluniprops>, L<Encode>, L<open>, L<utf8>, L<bytes>,
b65e6125 1922L<perlretut>, L<perlvar/"${^UNICODE}">,
51f494cc 1923L<http://www.unicode.org/reports/tr44>).
393fec97
GS
1924
1925=cut