This is a live mirror of the Perl 5 development currently hosted at https://github.com/perl/perl5
perluniintro: Add note
[perl5.git] / pod / perlunicode.pod
CommitLineData
393fec97
GS
1=head1 NAME
2
3perlunicode - Unicode support in Perl
4
5=head1 DESCRIPTION
6
0a1f2d14 7=head2 Important Caveats
21bad921 8
376d9008 9Unicode support is an extensive requirement. While Perl does not
c349b1b9
JH
10implement the Unicode standard or the accompanying technical reports
11from cover to cover, Perl does support many Unicode features.
21bad921 12
2575c402 13People who want to learn to use Unicode in Perl, should probably read
0314f483
KW
14the L<Perl Unicode tutorial, perlunitut|perlunitut> and
15L<perluniintro>, before reading
e4911a48 16this reference document.
2575c402 17
9d1c51c1
KW
18Also, the use of Unicode may present security issues that aren't obvious.
19Read L<Unicode Security Considerations|http://www.unicode.org/reports/tr36>.
20
13a2d996 21=over 4
21bad921 22
a9130ea9 23=item Safest if you C<use feature 'unicode_strings'>
42581d5d
KW
24
25In order to preserve backward compatibility, Perl does not turn
26on full internal Unicode support unless the pragma
27C<use feature 'unicode_strings'> is specified. (This is automatically
28selected if you use C<use 5.012> or higher.) Failure to do this can
29trigger unexpected surprises. See L</The "Unicode Bug"> below.
30
2269d15c
KW
31This pragma doesn't affect I/O. Nor does it change the internal
32representation of strings, only their interpretation. There are still
33several places where Unicode isn't fully supported, such as in
34filenames.
42581d5d 35
fae2c0fb 36=item Input and Output Layers
21bad921 37
376d9008 38Perl knows when a filehandle uses Perl's internal Unicode encodings
1bfb14c4 39(UTF-8, or UTF-EBCDIC if in EBCDIC) if the filehandle is opened with
a9130ea9 40the C<:encoding(utf8)> layer. Other encodings can be converted to Perl's
1bfb14c4 41encoding on input or from Perl's encoding on output by use of the
a9130ea9 42C<:encoding(...)> layer. See L<open>.
c349b1b9 43
2575c402 44To indicate that Perl source itself is in UTF-8, use C<use utf8;>.
21bad921 45
ad0029c4 46=item C<use utf8> still needed to enable UTF-8/UTF-EBCDIC in scripts
21bad921 47
376d9008
JB
48As a compatibility measure, the C<use utf8> pragma must be explicitly
49included to enable recognition of UTF-8 in the Perl scripts themselves
1bfb14c4
JH
50(in string or regular expression literals, or in identifier names) on
51ASCII-based machines or to recognize UTF-EBCDIC on EBCDIC-based
376d9008 52machines. B<These are the only times when an explicit C<use utf8>
8f8cf39c 53is needed.> See L<utf8>.
21bad921 54
a9130ea9 55=item C<BOM>-marked scripts and UTF-16 scripts autodetected
7aa207d6 56
a9130ea9
KW
57If a Perl script begins marked with the Unicode C<BOM> (UTF-16LE, UTF16-BE,
58or UTF-8), or if the script looks like non-C<BOM>-marked UTF-16 of either
7aa207d6 59endianness, Perl will correctly read in the script as Unicode.
a9130ea9 60(C<BOM>less UTF-8 cannot be effectively recognized or differentiated from
7aa207d6
JH
61ISO 8859-1 or other eight-bit encodings.)
62
990e18f7
AT
63=item C<use encoding> needed to upgrade non-Latin-1 byte strings
64
38a44b82 65By default, there is a fundamental asymmetry in Perl's Unicode model:
990e18f7
AT
66implicit upgrading from byte strings to Unicode strings assumes that
67they were encoded in I<ISO 8859-1 (Latin-1)>, but Unicode strings are
68downgraded with UTF-8 encoding. This happens because the first 256
51f494cc 69codepoints in Unicode happens to agree with Latin-1.
990e18f7 70
990e18f7
AT
71See L</"Byte and Character Semantics"> for more details.
72
21bad921
GS
73=back
74
376d9008 75=head2 Byte and Character Semantics
393fec97 76
b9cedb1b 77Perl uses logically-wide characters to represent strings internally.
393fec97 78
42581d5d
KW
79Starting in Perl 5.14, Perl-level operations work with
80characters rather than bytes within the scope of a
81C<L<use feature 'unicode_strings'|feature>> (or equivalently
82C<use 5.012> or higher). (This is not true if bytes have been
b19eb496 83explicitly requested by C<L<use bytes|bytes>>, nor necessarily true
42581d5d
KW
84for interactions with the platform's operating system.)
85
86For earlier Perls, and when C<unicode_strings> is not in effect, Perl
87provides a fairly safe environment that can handle both types of
88semantics in programs. For operations where Perl can unambiguously
89decide that the input data are characters, Perl switches to character
90semantics. For operations where this determination cannot be made
91without additional information from the user, Perl decides in favor of
92compatibility and chooses to use byte semantics.
93
66cbab2c 94When C<use locale> (but not C<use locale ':not_characters'>) is in
850b7ec9 95effect, Perl uses the rules associated with the current locale.
66cbab2c
KW
96(C<use locale> overrides C<use feature 'unicode_strings'> in the same scope;
97while C<use locale ':not_characters'> effectively also selects
98C<use feature 'unicode_strings'> in its scope; see L<perllocale>.)
99Otherwise, Perl uses the platform's native
42581d5d 100byte semantics for characters whose code points are less than 256, and
850b7ec9 101Unicode rules for those greater than 255. That means that non-ASCII
4b9734bf 102characters are undefined except for their
e1b711da
KW
103ordinal numbers. This means that none have case (upper and lower), nor are any
104a member of character classes, like C<[:alpha:]> or C<\w>. (But all do belong
105to the C<\W> class or the Perl regular expression extension C<[:^alpha:]>.)
2bbc8d55 106
8cbd9a7a 107This behavior preserves compatibility with earlier versions of Perl,
376d9008 108which allowed byte semantics in Perl operations only if
e1b711da 109none of the program's inputs were marked as being a source of Unicode
8cbd9a7a 110character data. Such data may come from filehandles, from calls to
a9130ea9 111external programs, from information provided by the system (such as C<%ENV>),
21bad921 112or from literals and constants in the source text.
8cbd9a7a 113
8cbd9a7a 114The C<utf8> pragma is primarily a compatibility device that enables
75daf61c 115recognition of UTF-(8|EBCDIC) in literals encountered by the parser.
376d9008
JB
116Note that this pragma is only required while Perl defaults to byte
117semantics; when character semantics become the default, this pragma
118may become a no-op. See L<utf8>.
119
376d9008 120If strings operating under byte semantics and strings with Unicode
51f494cc 121character data are concatenated, the new string will have
d9b01026 122character semantics. This can cause surprises: See L</BUGS>, below.
a9130ea9 123You can choose to be warned when this happens. See C<L<encoding::warnings>>.
7dedd01f 124
feda178f 125Under character semantics, many operations that formerly operated on
376d9008 126bytes now operate on characters. A character in Perl is
feda178f 127logically just a number ranging from 0 to 2**31 or so. Larger
376d9008
JB
128characters may encode into longer sequences of bytes internally, but
129this internal detail is mostly hidden for Perl code.
130See L<perluniintro> for more.
393fec97 131
376d9008 132=head2 Effects of Character Semantics
393fec97
GS
133
134Character semantics have the following effects:
135
136=over 4
137
138=item *
139
376d9008 140Strings--including hash keys--and regular expression patterns may
574c8022 141contain characters that have an ordinal value larger than 255.
393fec97 142
2575c402
JW
143If you use a Unicode editor to edit your program, Unicode characters may
144occur directly within the literal strings in UTF-8 encoding, or UTF-16.
a9130ea9 145(The former requires a C<BOM> or C<use utf8>, the latter requires a C<BOM>.)
3e4dbfed 146
195e542a
KW
147Unicode characters can also be added to a string by using the C<\N{U+...}>
148notation. The Unicode code for the desired character, in hexadecimal,
149should be placed in the braces, after the C<U>. For instance, a smiley face is
6f335b04
KW
150C<\N{U+263A}>.
151
a9130ea9
KW
152Alternatively, you can use the C<\x{...}> notation for characters C<0x100> and
153above. For characters below C<0x100> you may get byte semantics instead of
6f335b04 154character semantics; see L</The "Unicode Bug">. On EBCDIC machines there is
195e542a 155the additional problem that the value for such characters gives the EBCDIC
0bd42786
KW
156character rather than the Unicode one, thus it is more portable to use
157C<\N{U+...}> instead.
3e4dbfed 158
fbb93542
KW
159Additionally, you can use the C<\N{...}> notation and put the official
160Unicode character name within the braces, such as
161C<\N{WHITE SMILING FACE}>. This automatically loads the L<charnames>
162module with the C<:full> and C<:short> options. If you prefer different
163options for this module, you can instead, before the C<\N{...}>,
164explicitly load it with your desired options; for example,
165
166 use charnames ':loose';
376d9008 167
393fec97
GS
168=item *
169
574c8022
JH
170If an appropriate L<encoding> is specified, identifiers within the
171Perl script may contain Unicode alphanumeric characters, including
376d9008
JB
172ideographs. Perl does not currently attempt to canonicalize variable
173names.
393fec97 174
393fec97
GS
175=item *
176
a9130ea9 177Regular expressions match characters instead of bytes. C<"."> matches
2575c402 178a character instead of a byte.
393fec97 179
393fec97
GS
180=item *
181
9d1c51c1 182Bracketed character classes in regular expressions match characters instead of
376d9008 183bytes and match against the character properties specified in the
1bfb14c4 184Unicode properties database. C<\w> can be used to match a Japanese
75daf61c 185ideograph, for instance.
393fec97 186
393fec97
GS
187=item *
188
9d1c51c1
KW
189Named Unicode properties, scripts, and block ranges may be used (like bracketed
190character classes) by using the C<\p{}> "matches property" construct and
822502e5 191the C<\P{}> negation, "doesn't match property".
2575c402 192See L</"Unicode Character Properties"> for more details.
822502e5
ST
193
194You can define your own character properties and use them
195in the regular expression with the C<\p{}> or C<\P{}> construct.
822502e5
ST
196See L</"User-Defined Character Properties"> for more details.
197
198=item *
199
9f815e24
KW
200The special pattern C<\X> matches a logical character, an "extended grapheme
201cluster" in Standardese. In Unicode what appears to the user to be a single
51f494cc
KW
202character, for example an accented C<G>, may in fact be composed of a sequence
203of characters, in this case a C<G> followed by an accent character. C<\X>
204will match the entire sequence.
822502e5
ST
205
206=item *
207
208The C<tr///> operator translates characters instead of bytes. Note
209that the C<tr///CU> functionality has been removed. For similar
210functionality see pack('U0', ...) and pack('C0', ...).
211
212=item *
213
214Case translation operators use the Unicode case translation tables
215when character input is provided. Note that C<uc()>, or C<\U> in
216interpolated strings, translates to uppercase, while C<ucfirst>,
217or C<\u> in interpolated strings, translates to titlecase in languages
e1b711da
KW
218that make the distinction (which is equivalent to uppercase in languages
219without the distinction).
822502e5
ST
220
221=item *
222
223Most operators that deal with positions or lengths in a string will
224automatically switch to using character positions, including
225C<chop()>, C<chomp()>, C<substr()>, C<pos()>, C<index()>, C<rindex()>,
226C<sprintf()>, C<write()>, and C<length()>. An operator that
51f494cc
KW
227specifically does not switch is C<vec()>. Operators that really don't
228care include operators that treat strings as a bucket of bits such as
822502e5
ST
229C<sort()>, and operators dealing with filenames.
230
231=item *
232
51f494cc 233The C<pack()>/C<unpack()> letter C<C> does I<not> change, since it is often
822502e5
ST
234used for byte-oriented formats. Again, think C<char> in the C language.
235
236There is a new C<U> specifier that converts between Unicode characters
237and code points. There is also a C<W> specifier that is the equivalent of
238C<chr>/C<ord> and properly handles character values even if they are above 255.
239
240=item *
241
242The C<chr()> and C<ord()> functions work on characters, similar to
243C<pack("W")> and C<unpack("W")>, I<not> C<pack("C")> and
244C<unpack("C")>. C<pack("C")> and C<unpack("C")> are methods for
245emulating byte-oriented C<chr()> and C<ord()> on Unicode strings.
246While these methods reveal the internal encoding of Unicode strings,
247that is not something one normally needs to care about at all.
248
249=item *
250
251The bit string operators, C<& | ^ ~>, can operate on character data.
252However, for backward compatibility, such as when using bit string
253operations when characters are all less than 256 in ordinal value, one
254should not use C<~> (the bit complement) with characters of both
255values less than 256 and values greater than 256. Most importantly,
256DeMorgan's laws (C<~($x|$y) eq ~$x&~$y> and C<~($x&$y) eq ~$x|~$y>)
257will not hold. The reason for this mathematical I<faux pas> is that
258the complement cannot return B<both> the 8-bit (byte-wide) bit
259complement B<and> the full character-wide bit complement.
260
261=item *
262
a9130ea9 263There is a CPAN module, C<L<Unicode::Casing>>, which allows you to define
628253b8
BF
264your own mappings to be used in C<lc()>, C<lcfirst()>, C<uc()>,
265C<ucfirst()>, and C<fc> (or their double-quoted string inlined
266versions such as C<\U>).
267(Prior to Perl 5.16, this functionality was partially provided
5d1892be
KW
268in the Perl core, but suffered from a number of insurmountable
269drawbacks, so the CPAN module was written instead.)
822502e5
ST
270
271=back
272
273=over 4
274
275=item *
276
277And finally, C<scalar reverse()> reverses by character rather than by byte.
278
279=back
280
281=head2 Unicode Character Properties
282
ee88f7b6 283(The only time that Perl considers a sequence of individual code
9d1c51c1
KW
284points as a single logical character is in the C<\X> construct, already
285mentioned above. Therefore "character" in this discussion means a single
ee88f7b6
KW
286Unicode code point.)
287
288Very nearly all Unicode character properties are accessible through
289regular expressions by using the C<\p{}> "matches property" construct
290and the C<\P{}> "doesn't match property" for its negation.
51f494cc 291
9d1c51c1 292For instance, C<\p{Uppercase}> matches any single character with the Unicode
a9130ea9
KW
293C<"Uppercase"> property, while C<\p{L}> matches any character with a
294C<General_Category> of C<"L"> (letter) property (see
295L</General_Category> below). Brackets are not
9d1c51c1 296required for single letter property names, so C<\p{L}> is equivalent to C<\pL>.
51f494cc 297
9d1c51c1 298More formally, C<\p{Uppercase}> matches any single character whose Unicode
a9130ea9
KW
299C<Uppercase> property value is C<True>, and C<\P{Uppercase}> matches any character
300whose C<Uppercase> property value is C<False>, and they could have been written as
9d1c51c1 301C<\p{Uppercase=True}> and C<\p{Uppercase=False}>, respectively.
51f494cc 302
b19eb496 303This formality is needed when properties are not binary; that is, if they can
a9130ea9
KW
304take on more values than just C<True> and C<False>. For example, the
305C<Bidi_Class> property (see L</"Bidirectional Character Types"> below),
306can take on several different
307values, such as C<Left>, C<Right>, C<Whitespace>, and others. To match these, one needs
308to specify both the property name (C<Bidi_Class>), AND the value being
5bff2035 309matched against
a9130ea9 310(C<Left>, C<Right>, etc.). This is done, as in the examples above, by having the
9f815e24 311two components separated by an equal sign (or interchangeably, a colon), like
51f494cc
KW
312C<\p{Bidi_Class: Left}>.
313
314All Unicode-defined character properties may be written in these compound forms
a9130ea9 315of C<\p{I<property>=I<value>}> or C<\p{I<property>:I<value>}>, but Perl provides some
51f494cc
KW
316additional properties that are written only in the single form, as well as
317single-form short-cuts for all binary properties and certain others described
318below, in which you may omit the property name and the equals or colon
319separator.
320
321Most Unicode character properties have at least two synonyms (or aliases if you
b19eb496 322prefer): a short one that is easier to type and a longer one that is more
a9130ea9
KW
323descriptive and hence easier to understand. Thus the C<"L"> and
324C<"Letter"> properties above are equivalent and can be used
325interchangeably. Likewise, C<"Upper"> is a synonym for C<"Uppercase">,
326and we could have written C<\p{Uppercase}> equivalently as C<\p{Upper}>.
327Also, there are typically various synonyms for the values the property
328can be. For binary properties, C<"True"> has 3 synonyms: C<"T">,
329C<"Yes">, and C<"Y">; and C<"False"> has correspondingly C<"F">,
330C<"No">, and C<"N">. But be careful. A short form of a value for one
331property may not mean the same thing as the same short form for another.
332Thus, for the C<L</General_Category>> property, C<"L"> means
333C<"Letter">, but for the L<C<Bidi_Class>|/Bidirectional Character Types>
334property, C<"L"> means C<"Left">. A complete list of properties and
335synonyms is in L<perluniprops>.
51f494cc 336
b19eb496 337Upper/lower case differences in property names and values are irrelevant;
51f494cc
KW
338thus C<\p{Upper}> means the same thing as C<\p{upper}> or even C<\p{UpPeR}>.
339Similarly, you can add or subtract underscores anywhere in the middle of a
340word, so that these are also equivalent to C<\p{U_p_p_e_r}>. And white space
341is irrelevant adjacent to non-word characters, such as the braces and the equals
b19eb496
TC
342or colon separators, so C<\p{ Upper }> and C<\p{ Upper_case : Y }> are
343equivalent to these as well. In fact, white space and even
344hyphens can usually be added or deleted anywhere. So even C<\p{ Up-per case = Yes}> is
51f494cc 345equivalent. All this is called "loose-matching" by Unicode. The few places
b19eb496 346where stricter matching is used is in the middle of numbers, and in the Perl
51f494cc 347extension properties that begin or end with an underscore. Stricter matching
b19eb496 348cares about white space (except adjacent to non-word characters),
51f494cc 349hyphens, and non-interior underscores.
4193bef7 350
376d9008 351You can also use negation in both C<\p{}> and C<\P{}> by introducing a caret
a9130ea9 352(C<^>) between the first brace and the property name: C<\p{^Tamil}> is
eb0cc9e3 353equal to C<\P{Tamil}>.
4193bef7 354
56ca34ca
KW
355Almost all properties are immune to case-insensitive matching. That is,
356adding a C</i> regular expression modifier does not change what they
357match. There are two sets that are affected.
358The first set is
359C<Uppercase_Letter>,
360C<Lowercase_Letter>,
361and C<Titlecase_Letter>,
362all of which match C<Cased_Letter> under C</i> matching.
363And the second set is
364C<Uppercase>,
365C<Lowercase>,
366and C<Titlecase>,
367all of which match C<Cased> under C</i> matching.
368This set also includes its subsets C<PosixUpper> and C<PosixLower> both
a9130ea9 369of which under C</i> match C<PosixAlpha>.
56ca34ca 370(The difference between these sets is that some things, such as Roman
b19eb496 371numerals, come in both upper and lower case so they are C<Cased>, but aren't considered
56ca34ca 372letters, so they aren't C<Cased_Letter>s.)
56ca34ca 373
2d88a86a
KW
374See L</Beyond Unicode code points> for special considerations when
375matching Unicode properties against non-Unicode code points.
94b42e47 376
51f494cc 377=head3 B<General_Category>
14bb0a9a 378
51f494cc
KW
379Every Unicode character is assigned a general category, which is the "most
380usual categorization of a character" (from
381L<http://www.unicode.org/reports/tr44>).
822502e5 382
9f815e24 383The compound way of writing these is like C<\p{General_Category=Number}>
51f494cc
KW
384(short, C<\p{gc:n}>). But Perl furnishes shortcuts in which everything up
385through the equal or colon separator is omitted. So you can instead just write
386C<\pN>.
822502e5 387
a9130ea9
KW
388Here are the short and long forms of the values the C<General Category> property
389can have:
393fec97 390
d73e5302
JH
391 Short Long
392
393 L Letter
51f494cc
KW
394 LC, L& Cased_Letter (that is: [\p{Ll}\p{Lu}\p{Lt}])
395 Lu Uppercase_Letter
396 Ll Lowercase_Letter
397 Lt Titlecase_Letter
398 Lm Modifier_Letter
399 Lo Other_Letter
d73e5302
JH
400
401 M Mark
51f494cc
KW
402 Mn Nonspacing_Mark
403 Mc Spacing_Mark
404 Me Enclosing_Mark
d73e5302
JH
405
406 N Number
51f494cc
KW
407 Nd Decimal_Number (also Digit)
408 Nl Letter_Number
409 No Other_Number
410
411 P Punctuation (also Punct)
412 Pc Connector_Punctuation
413 Pd Dash_Punctuation
414 Ps Open_Punctuation
415 Pe Close_Punctuation
416 Pi Initial_Punctuation
d73e5302 417 (may behave like Ps or Pe depending on usage)
51f494cc 418 Pf Final_Punctuation
d73e5302 419 (may behave like Ps or Pe depending on usage)
51f494cc 420 Po Other_Punctuation
d73e5302
JH
421
422 S Symbol
51f494cc
KW
423 Sm Math_Symbol
424 Sc Currency_Symbol
425 Sk Modifier_Symbol
426 So Other_Symbol
d73e5302
JH
427
428 Z Separator
51f494cc
KW
429 Zs Space_Separator
430 Zl Line_Separator
431 Zp Paragraph_Separator
d73e5302
JH
432
433 C Other
d88362ca 434 Cc Control (also Cntrl)
e150c829 435 Cf Format
6d4f9cf2 436 Cs Surrogate
51f494cc 437 Co Private_Use
e150c829 438 Cn Unassigned
1ac13f9a 439
376d9008 440Single-letter properties match all characters in any of the
3e4dbfed 441two-letter sub-properties starting with the same letter.
b19eb496 442C<LC> and C<L&> are special: both are aliases for the set consisting of everything matched by C<Ll>, C<Lu>, and C<Lt>.
32293815 443
51f494cc 444=head3 B<Bidirectional Character Types>
822502e5 445
b19eb496 446Because scripts differ in their directionality (Hebrew and Arabic are
a9130ea9 447written right to left, for example) Unicode supplies a C<Bidi_Class> property.
1850f57f 448Some of the values this property can have are:
32293815 449
88af3b93 450 Value Meaning
92e830a9 451
12ac2576
JP
452 L Left-to-Right
453 LRE Left-to-Right Embedding
454 LRO Left-to-Right Override
455 R Right-to-Left
51f494cc 456 AL Arabic Letter
12ac2576
JP
457 RLE Right-to-Left Embedding
458 RLO Right-to-Left Override
459 PDF Pop Directional Format
460 EN European Number
51f494cc
KW
461 ES European Separator
462 ET European Terminator
12ac2576 463 AN Arabic Number
51f494cc 464 CS Common Separator
12ac2576
JP
465 NSM Non-Spacing Mark
466 BN Boundary Neutral
467 B Paragraph Separator
468 S Segment Separator
469 WS Whitespace
470 ON Other Neutrals
471
51f494cc
KW
472This property is always written in the compound form.
473For example, C<\p{Bidi_Class:R}> matches characters that are normally
1850f57f 474written right to left. Unlike the
a9130ea9 475C<L</General_Category>> property, this
1850f57f
KW
476property can have more values added in a future Unicode release. Those
477listed above comprised the complete set for many Unicode releases, but
478others were added in Unicode 6.3; you can always find what the
479current ones are in in L<perluniprops>. And
480L<http://www.unicode.org/reports/tr9/> describes how to use them.
eb0cc9e3 481
51f494cc
KW
482=head3 B<Scripts>
483
b19eb496 484The world's languages are written in many different scripts. This sentence
e1b711da 485(unless you're reading it in translation) is written in Latin, while Russian is
c69ca1d4 486written in Cyrillic, and Greek is written in, well, Greek; Japanese mainly in
e1b711da 487Hiragana or Katakana. There are many more.
51f494cc 488
82aed44a
KW
489The Unicode Script and Script_Extensions properties give what script a
490given character is in. Either property can be specified with the
491compound form like
492C<\p{Script=Hebrew}> (short: C<\p{sc=hebr}>), or
493C<\p{Script_Extensions=Javanese}> (short: C<\p{scx=java}>).
494In addition, Perl furnishes shortcuts for all
495C<Script> property names. You can omit everything up through the equals
496(or colon), and simply write C<\p{Latin}> or C<\P{Cyrillic}>.
497(This is not true for C<Script_Extensions>, which is required to be
498written in the compound form.)
499
500The difference between these two properties involves characters that are
501used in multiple scripts. For example the digits '0' through '9' are
502used in many parts of the world. These are placed in a script named
503C<Common>. Other characters are used in just a few scripts. For
a9130ea9 504example, the C<"KATAKANA-HIRAGANA DOUBLE HYPHEN"> is used in both Japanese
82aed44a
KW
505scripts, Katakana and Hiragana, but nowhere else. The C<Script>
506property places all characters that are used in multiple scripts in the
507C<Common> script, while the C<Script_Extensions> property places those
508that are used in only a few scripts into each of those scripts; while
509still using C<Common> for those used in many scripts. Thus both these
510match:
511
512 "0" =~ /\p{sc=Common}/ # Matches
513 "0" =~ /\p{scx=Common}/ # Matches
514
515and only the first of these match:
516
517 "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{sc=Common} # Matches
518 "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{scx=Common} # No match
519
520And only the last two of these match:
521
522 "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{sc=Hiragana} # No match
523 "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{sc=Katakana} # No match
524 "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{scx=Hiragana} # Matches
525 "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{scx=Katakana} # Matches
526
527C<Script_Extensions> is thus an improved C<Script>, in which there are
528fewer characters in the C<Common> script, and correspondingly more in
529other scripts. It is new in Unicode version 6.0, and its data are likely
530to change significantly in later releases, as things get sorted out.
531
532(Actually, besides C<Common>, the C<Inherited> script, contains
533characters that are used in multiple scripts. These are modifier
534characters which modify other characters, and inherit the script value
535of the controlling character. Some of these are used in many scripts,
536and so go into C<Inherited> in both C<Script> and C<Script_Extensions>.
537Others are used in just a few scripts, so are in C<Inherited> in
538C<Script>, but not in C<Script_Extensions>.)
539
540It is worth stressing that there are several different sets of digits in
541Unicode that are equivalent to 0-9 and are matchable by C<\d> in a
542regular expression. If they are used in a single language only, they
543are in that language's C<Script> and C<Script_Extension>. If they are
544used in more than one script, they will be in C<sc=Common>, but only
545if they are used in many scripts should they be in C<scx=Common>.
51f494cc
KW
546
547A complete list of scripts and their shortcuts is in L<perluniprops>.
548
a9130ea9 549=head3 B<Use of the C<"Is"> Prefix>
822502e5 550
1bfb14c4 551For backward compatibility (with Perl 5.6), all properties mentioned
51f494cc
KW
552so far may have C<Is> or C<Is_> prepended to their name, so C<\P{Is_Lu}>, for
553example, is equal to C<\P{Lu}>, and C<\p{IsScript:Arabic}> is equal to
554C<\p{Arabic}>.
eb0cc9e3 555
51f494cc 556=head3 B<Blocks>
2796c109 557
1bfb14c4
JH
558In addition to B<scripts>, Unicode also defines B<blocks> of
559characters. The difference between scripts and blocks is that the
560concept of scripts is closer to natural languages, while the concept
51f494cc 561of blocks is more of an artificial grouping based on groups of Unicode
a9130ea9 562characters with consecutive ordinal values. For example, the C<"Basic Latin">
b19eb496 563block is all characters whose ordinals are between 0 and 127, inclusive; in
a9130ea9
KW
564other words, the ASCII characters. The C<"Latin"> script contains some letters
565from this as well as several other blocks, like C<"Latin-1 Supplement">,
566C<"Latin Extended-A">, etc., but it does not contain all the characters from
7be67b37 567those blocks. It does not, for example, contain the digits 0-9, because
82aed44a
KW
568those digits are shared across many scripts, and hence are in the
569C<Common> script.
51f494cc
KW
570
571For more about scripts versus blocks, see UAX#24 "Unicode Script Property":
572L<http://www.unicode.org/reports/tr24>
573
82aed44a
KW
574The C<Script> or C<Script_Extensions> properties are likely to be the
575ones you want to use when processing
a9130ea9 576natural language; the C<Block> property may occasionally be useful in working
b19eb496 577with the nuts and bolts of Unicode.
51f494cc
KW
578
579Block names are matched in the compound form, like C<\p{Block: Arrows}> or
b19eb496 580C<\p{Blk=Hebrew}>. Unlike most other properties, only a few block names have a
51f494cc
KW
581Unicode-defined short name. But Perl does provide a (slight) shortcut: You
582can say, for example C<\p{In_Arrows}> or C<\p{In_Hebrew}>. For backwards
583compatibility, the C<In> prefix may be omitted if there is no naming conflict
584with a script or any other property, and you can even use an C<Is> prefix
585instead in those cases. But it is not a good idea to do this, for a couple
586reasons:
587
588=over 4
589
590=item 1
591
592It is confusing. There are many naming conflicts, and you may forget some.
9f815e24 593For example, C<\p{Hebrew}> means the I<script> Hebrew, and NOT the I<block>
51f494cc
KW
594Hebrew. But would you remember that 6 months from now?
595
596=item 2
597
3e2dd9ee 598It is unstable. A new version of Unicode may preempt the current meaning by
51f494cc 599creating a property with the same name. There was a time in very early Unicode
9f815e24 600releases when C<\p{Hebrew}> would have matched the I<block> Hebrew; now it
51f494cc 601doesn't.
32293815 602
393fec97
GS
603=back
604
b19eb496
TC
605Some people prefer to always use C<\p{Block: foo}> and C<\p{Script: bar}>
606instead of the shortcuts, whether for clarity, because they can't remember the
607difference between 'In' and 'Is' anyway, or they aren't confident that those who
608eventually will read their code will know that difference.
51f494cc
KW
609
610A complete list of blocks and their shortcuts is in L<perluniprops>.
611
9f815e24
KW
612=head3 B<Other Properties>
613
614There are many more properties than the very basic ones described here.
615A complete list is in L<perluniprops>.
616
617Unicode defines all its properties in the compound form, so all single-form
b19eb496
TC
618properties are Perl extensions. Most of these are just synonyms for the
619Unicode ones, but some are genuine extensions, including several that are in
9f815e24
KW
620the compound form. And quite a few of these are actually recommended by Unicode
621(in L<http://www.unicode.org/reports/tr18>).
622
5bff2035
KW
623This section gives some details on all extensions that aren't just
624synonyms for compound-form Unicode properties
625(for those properties, you'll have to refer to the
9f815e24
KW
626L<Unicode Standard|http://www.unicode.org/reports/tr44>.
627
628=over
629
630=item B<C<\p{All}>>
631
2d88a86a
KW
632This matches every possible code point. It is equivalent to C<qr/./s>.
633Unlike all the other non-user-defined C<\p{}> property matches, no
634warning is ever generated if this is property is matched against a
635non-Unicode code point (see L</Beyond Unicode code points> below).
9f815e24
KW
636
637=item B<C<\p{Alnum}>>
638
639This matches any C<\p{Alphabetic}> or C<\p{Decimal_Number}> character.
640
641=item B<C<\p{Any}>>
642
2d88a86a
KW
643This matches any of the 1_114_112 Unicode code points. It is a synonym
644for C<\p{Unicode}>.
9f815e24 645
42581d5d
KW
646=item B<C<\p{ASCII}>>
647
648This matches any of the 128 characters in the US-ASCII character set,
649which is a subset of Unicode.
650
9f815e24
KW
651=item B<C<\p{Assigned}>>
652
a9130ea9
KW
653This matches any assigned code point; that is, any code point whose L<general
654category|/General_Category> is not C<Unassigned> (or equivalently, not C<Cn>).
9f815e24
KW
655
656=item B<C<\p{Blank}>>
657
658This is the same as C<\h> and C<\p{HorizSpace}>: A character that changes the
659spacing horizontally.
660
661=item B<C<\p{Decomposition_Type: Non_Canonical}>> (Short: C<\p{Dt=NonCanon}>)
662
663Matches a character that has a non-canonical decomposition.
664
a9130ea9 665To understand the use of this rarely used I<property=value> combination, it is
9f815e24
KW
666necessary to know some basics about decomposition.
667Consider a character, say H. It could appear with various marks around it,
668such as an acute accent, or a circumflex, or various hooks, circles, arrows,
b19eb496 669I<etc.>, above, below, to one side or the other, etc. There are many
9f815e24
KW
670possibilities among the world's languages. The number of combinations is
671astronomical, and if there were a character for each combination, it would
672soon exhaust Unicode's more than a million possible characters. So Unicode
673took a different approach: there is a character for the base H, and a
b19eb496 674character for each of the possible marks, and these can be variously combined
9f815e24
KW
675to get a final logical character. So a logical character--what appears to be a
676single character--can be a sequence of more than one individual characters.
b19eb496
TC
677This is called an "extended grapheme cluster"; Perl furnishes the C<\X>
678regular expression construct to match such sequences.
9f815e24
KW
679
680But Unicode's intent is to unify the existing character set standards and
b19eb496 681practices, and several pre-existing standards have single characters that
9f815e24 682mean the same thing as some of these combinations. An example is ISO-8859-1,
a9130ea9
KW
683which has quite a few of these in the Latin-1 range, an example being C<"LATIN
684CAPITAL LETTER E WITH ACUTE">. Because this character was in this pre-existing
9f815e24 685standard, Unicode added it to its repertoire. But this character is considered
b19eb496 686by Unicode to be equivalent to the sequence consisting of the character
a9130ea9 687C<"LATIN CAPITAL LETTER E"> followed by the character C<"COMBINING ACUTE ACCENT">.
9f815e24 688
a9130ea9 689C<"LATIN CAPITAL LETTER E WITH ACUTE"> is called a "pre-composed" character, and
b19eb496 690its equivalence with the sequence is called canonical equivalence. All
9f815e24 691pre-composed characters are said to have a decomposition (into the equivalent
b19eb496 692sequence), and the decomposition type is also called canonical.
9f815e24
KW
693
694However, many more characters have a different type of decomposition, a
695"compatible" or "non-canonical" decomposition. The sequences that form these
696decompositions are not considered canonically equivalent to the pre-composed
a9130ea9 697character. An example, again in the Latin-1 range, is the C<"SUPERSCRIPT ONE">.
b19eb496 698It is somewhat like a regular digit 1, but not exactly; its decomposition
9f815e24
KW
699into the digit 1 is called a "compatible" decomposition, specifically a
700"super" decomposition. There are several such compatibility
701decompositions (see L<http://www.unicode.org/reports/tr44>), including one
b19eb496 702called "compat", which means some miscellaneous type of decomposition
42581d5d 703that doesn't fit into the decomposition categories that Unicode has chosen.
9f815e24
KW
704
705Note that most Unicode characters don't have a decomposition, so their
a9130ea9 706decomposition type is C<"None">.
9f815e24 707
b19eb496
TC
708For your convenience, Perl has added the C<Non_Canonical> decomposition
709type to mean any of the several compatibility decompositions.
9f815e24
KW
710
711=item B<C<\p{Graph}>>
712
713Matches any character that is graphic. Theoretically, this means a character
714that on a printer would cause ink to be used.
715
716=item B<C<\p{HorizSpace}>>
717
b19eb496 718This is the same as C<\h> and C<\p{Blank}>: a character that changes the
9f815e24
KW
719spacing horizontally.
720
42581d5d 721=item B<C<\p{In=*}>>
9f815e24
KW
722
723This is a synonym for C<\p{Present_In=*}>
724
725=item B<C<\p{PerlSpace}>>
726
d28d8023 727This is the same as C<\s>, restricted to ASCII, namely C<S<[ \f\n\r\t]>>
779cf272 728and starting in Perl v5.18, a vertical tab.
9f815e24
KW
729
730Mnemonic: Perl's (original) space
731
732=item B<C<\p{PerlWord}>>
733
734This is the same as C<\w>, restricted to ASCII, namely C<[A-Za-z0-9_]>
735
736Mnemonic: Perl's (original) word.
737
42581d5d 738=item B<C<\p{Posix...}>>
9f815e24 739
a9130ea9 740There are several of these, which are equivalents using the C<\p{}>
b19eb496 741notation for Posix classes and are described in
42581d5d 742L<perlrecharclass/POSIX Character Classes>.
9f815e24
KW
743
744=item B<C<\p{Present_In: *}>> (Short: C<\p{In=*}>)
745
746This property is used when you need to know in what Unicode version(s) a
747character is.
748
749The "*" above stands for some two digit Unicode version number, such as
750C<1.1> or C<4.0>; or the "*" can also be C<Unassigned>. This property will
751match the code points whose final disposition has been settled as of the
752Unicode release given by the version number; C<\p{Present_In: Unassigned}>
753will match those code points whose meaning has yet to be assigned.
754
a9130ea9 755For example, C<U+0041> C<"LATIN CAPITAL LETTER A"> was present in the very first
9f815e24
KW
756Unicode release available, which is C<1.1>, so this property is true for all
757valid "*" versions. On the other hand, C<U+1EFF> was not assigned until version
a9130ea9 7585.1 when it became C<"LATIN SMALL LETTER Y WITH LOOP">, so the only "*" that
9f815e24
KW
759would match it are 5.1, 5.2, and later.
760
761Unicode furnishes the C<Age> property from which this is derived. The problem
762with Age is that a strict interpretation of it (which Perl takes) has it
763matching the precise release a code point's meaning is introduced in. Thus
764C<U+0041> would match only 1.1; and C<U+1EFF> only 5.1. This is not usually what
765you want.
766
767Some non-Perl implementations of the Age property may change its meaning to be
a9130ea9 768the same as the Perl C<Present_In> property; just be aware of that.
9f815e24
KW
769
770Another confusion with both these properties is that the definition is not
b19eb496
TC
771that the code point has been I<assigned>, but that the meaning of the code point
772has been I<determined>. This is because 66 code points will always be
a9130ea9 773unassigned, and so the C<Age> for them is the Unicode version in which the decision
b19eb496 774to make them so was made. For example, C<U+FDD0> is to be permanently
9f815e24 775unassigned to a character, and the decision to do that was made in version 3.1,
b19eb496 776so C<\p{Age=3.1}> matches this character, as also does C<\p{Present_In: 3.1}> and up.
9f815e24
KW
777
778=item B<C<\p{Print}>>
779
ae5b72c8 780This matches any character that is graphical or blank, except controls.
9f815e24
KW
781
782=item B<C<\p{SpacePerl}>>
783
784This is the same as C<\s>, including beyond ASCII.
785
4d4acfba 786Mnemonic: Space, as modified by Perl. (It doesn't include the vertical tab
779cf272 787until v5.18, which both the Posix standard and Unicode consider white space.)
9f815e24 788
4364919a
KW
789=item B<C<\p{Title}>> and B<C<\p{Titlecase}>>
790
791Under case-sensitive matching, these both match the same code points as
792C<\p{General Category=Titlecase_Letter}> (C<\p{gc=lt}>). The difference
793is that under C</i> caseless matching, these match the same as
794C<\p{Cased}>, whereas C<\p{gc=lt}> matches C<\p{Cased_Letter>).
795
2d88a86a
KW
796=item B<C<\p{Unicode}>>
797
798This matches any of the 1_114_112 Unicode code points.
799C<\p{Any}>.
800
9f815e24
KW
801=item B<C<\p{VertSpace}>>
802
803This is the same as C<\v>: A character that changes the spacing vertically.
804
805=item B<C<\p{Word}>>
806
b19eb496 807This is the same as C<\w>, including over 100_000 characters beyond ASCII.
9f815e24 808
42581d5d
KW
809=item B<C<\p{XPosix...}>>
810
b19eb496 811There are several of these, which are the standard Posix classes
42581d5d
KW
812extended to the full Unicode range. They are described in
813L<perlrecharclass/POSIX Character Classes>.
814
9f815e24
KW
815=back
816
a9130ea9 817
376d9008 818=head2 User-Defined Character Properties
491fd90a 819
51f494cc 820You can define your own binary character properties by defining subroutines
a9130ea9 821whose names begin with C<"In"> or C<"Is">. (The experimental feature
9d1a5160
KW
822L<perlre/(?[ ])> provides an alternative which allows more complex
823definitions.) The subroutines can be defined in any
51f494cc 824package. The user-defined properties can be used in the regular expression
a9130ea9 825C<\p{}> and C<\P{}> constructs; if you are using a user-defined property from a
51f494cc 826package other than the one you are in, you must specify its package in the
a9130ea9 827C<\p{}> or C<\P{}> construct.
bac0b425 828
51f494cc 829 # assuming property Is_Foreign defined in Lang::
bac0b425
JP
830 package main; # property package name required
831 if ($txt =~ /\p{Lang::IsForeign}+/) { ... }
832
833 package Lang; # property package name not required
834 if ($txt =~ /\p{IsForeign}+/) { ... }
835
836
837Note that the effect is compile-time and immutable once defined.
b19eb496
TC
838However, the subroutines are passed a single parameter, which is 0 if
839case-sensitive matching is in effect and non-zero if caseless matching
56ca34ca
KW
840is in effect. The subroutine may return different values depending on
841the value of the flag, and one set of values will immutably be in effect
b19eb496 842for all case-sensitive matches, and the other set for all case-insensitive
56ca34ca 843matches.
491fd90a 844
b19eb496 845Note that if the regular expression is tainted, then Perl will die rather
a9130ea9 846than calling the subroutine when the name of the subroutine is
0e9be77f
DM
847determined by the tainted data.
848
376d9008
JB
849The subroutines must return a specially-formatted string, with one
850or more newline-separated lines. Each line must be one of the following:
491fd90a
JH
851
852=over 4
853
854=item *
855
df9e1087 856A single hexadecimal number denoting a code point to include.
510254c9
A
857
858=item *
859
99a6b1f0 860Two hexadecimal numbers separated by horizontal whitespace (space or
df9e1087 861tabular characters) denoting a range of code points to include.
491fd90a
JH
862
863=item *
864
a9130ea9
KW
865Something to include, prefixed by C<"+">: a built-in character
866property (prefixed by C<"utf8::">) or a fully qualified (including package
830137a2 867name) user-defined character property,
bac0b425
JP
868to represent all the characters in that property; two hexadecimal code
869points for a range; or a single hexadecimal code point.
491fd90a
JH
870
871=item *
872
a9130ea9
KW
873Something to exclude, prefixed by C<"-">: an existing character
874property (prefixed by C<"utf8::">) or a fully qualified (including package
830137a2 875name) user-defined character property,
bac0b425
JP
876to represent all the characters in that property; two hexadecimal code
877points for a range; or a single hexadecimal code point.
491fd90a
JH
878
879=item *
880
a9130ea9
KW
881Something to negate, prefixed C<"!">: an existing character
882property (prefixed by C<"utf8::">) or a fully qualified (including package
830137a2 883name) user-defined character property,
bac0b425
JP
884to represent all the characters in that property; two hexadecimal code
885points for a range; or a single hexadecimal code point.
886
887=item *
888
a9130ea9
KW
889Something to intersect with, prefixed by C<"&">: an existing character
890property (prefixed by C<"utf8::">) or a fully qualified (including package
830137a2 891name) user-defined character property,
bac0b425
JP
892for all the characters except the characters in the property; two
893hexadecimal code points for a range; or a single hexadecimal code point.
491fd90a
JH
894
895=back
896
897For example, to define a property that covers both the Japanese
898syllabaries (hiragana and katakana), you can define
899
900 sub InKana {
d88362ca 901 return <<END;
d5822f25
A
902 3040\t309F
903 30A0\t30FF
491fd90a
JH
904 END
905 }
906
d5822f25
A
907Imagine that the here-doc end marker is at the beginning of the line.
908Now you can use C<\p{InKana}> and C<\P{InKana}>.
491fd90a
JH
909
910You could also have used the existing block property names:
911
912 sub InKana {
d88362ca 913 return <<'END';
491fd90a
JH
914 +utf8::InHiragana
915 +utf8::InKatakana
916 END
917 }
918
919Suppose you wanted to match only the allocated characters,
d5822f25 920not the raw block ranges: in other words, you want to remove
491fd90a
JH
921the non-characters:
922
923 sub InKana {
d88362ca 924 return <<'END';
491fd90a
JH
925 +utf8::InHiragana
926 +utf8::InKatakana
927 -utf8::IsCn
928 END
929 }
930
931The negation is useful for defining (surprise!) negated classes.
932
933 sub InNotKana {
d88362ca 934 return <<'END';
491fd90a
JH
935 !utf8::InHiragana
936 -utf8::InKatakana
937 +utf8::IsCn
938 END
939 }
940
461020ad
KW
941This will match all non-Unicode code points, since every one of them is
942not in Kana. You can use intersection to exclude these, if desired, as
943this modified example shows:
bac0b425 944
461020ad 945 sub InNotKana {
bac0b425 946 return <<'END';
461020ad
KW
947 !utf8::InHiragana
948 -utf8::InKatakana
949 +utf8::IsCn
950 &utf8::Any
bac0b425
JP
951 END
952 }
953
461020ad
KW
954C<&utf8::Any> must be the last line in the definition.
955
956Intersection is used generally for getting the common characters matched
a9130ea9 957by two (or more) classes. It's important to remember not to use C<"&"> for
461020ad
KW
958the first set; that would be intersecting with nothing, resulting in an
959empty set.
960
2d88a86a
KW
961Unlike non-user-defined C<\p{}> property matches, no warning is ever
962generated if these properties are matched against a non-Unicode code
963point (see L</Beyond Unicode code points> below).
bac0b425 964
68585b5e 965=head2 User-Defined Case Mappings (for serious hackers only)
822502e5 966
5d1892be 967B<This feature has been removed as of Perl 5.16.>
a9130ea9 968The CPAN module C<L<Unicode::Casing>> provides better functionality without
5d1892be
KW
969the drawbacks that this feature had. If you are using a Perl earlier
970than 5.16, this feature was most fully documented in the 5.14 version of
971this pod:
972L<http://perldoc.perl.org/5.14.0/perlunicode.html#User-Defined-Case-Mappings-%28for-serious-hackers-only%29>
3a2263fe 973
376d9008 974=head2 Character Encodings for Input and Output
8cbd9a7a 975
7221edc9 976See L<Encode>.
8cbd9a7a 977
c29a771d 978=head2 Unicode Regular Expression Support Level
776f8809 979
b19eb496
TC
980The following list of Unicode supported features for regular expressions describes
981all features currently directly supported by core Perl. The references to "Level N"
8158862b 982and the section numbers refer to the Unicode Technical Standard #18,
b19eb496 983"Unicode Regular Expressions", version 13, from August 2008.
776f8809
JH
984
985=over 4
986
987=item *
988
989Level 1 - Basic Unicode Support
990
755789c0
KW
991 RL1.1 Hex Notation - done [1]
992 RL1.2 Properties - done [2][3]
993 RL1.2a Compatibility Properties - done [4]
9d1a5160 994 RL1.3 Subtraction and Intersection - experimental [5]
755789c0
KW
995 RL1.4 Simple Word Boundaries - done [6]
996 RL1.5 Simple Loose Matches - done [7]
997 RL1.6 Line Boundaries - MISSING [8][9]
998 RL1.7 Supplementary Code Points - done [10]
999
6f33e417
KW
1000=over 4
1001
1002=item [1]
1003
a9130ea9 1004C<\x{...}>
6f33e417
KW
1005
1006=item [2]
1007
a9130ea9 1008C<\p{...}> C<\P{...}>
6f33e417
KW
1009
1010=item [3]
1011
1012supports not only minimal list, but all Unicode character properties (see Unicode Character Properties above)
1013
1014=item [4]
1015
a9130ea9 1016C<\d> C<\D> C<\s> C<\S> C<\w> C<\W> C<\X> C<[:I<prop>:]> C<[:^I<prop>:]>
6f33e417
KW
1017
1018=item [5]
1019
df9e1087 1020The experimental feature in v5.18 C<"(?[...])"> accomplishes this. See
9d1a5160
KW
1021L<perlre/(?[ ])>. If you don't want to use an experimental feature,
1022you can use one of the following:
6f33e417
KW
1023
1024=over 4
1025
1026=item * Regular expression look-ahead
1027
1028You can mimic class subtraction using lookahead.
8158862b 1029For example, what UTS#18 might write as
29bdacb8 1030
209c9685 1031 [{Block=Greek}-[{UNASSIGNED}]]
dbe420b4
JH
1032
1033in Perl can be written as:
1034
209c9685
KW
1035 (?!\p{Unassigned})\p{Block=Greek}
1036 (?=\p{Assigned})\p{Block=Greek}
dbe420b4
JH
1037
1038But in this particular example, you probably really want
1039
209c9685 1040 \p{Greek}
dbe420b4
JH
1041
1042which will match assigned characters known to be part of the Greek script.
29bdacb8 1043
a9130ea9 1044=item * CPAN module C<L<Unicode::Regex::Set>>
8158862b 1045
6f33e417
KW
1046It does implement the full UTS#18 grouping, intersection, union, and
1047removal (subtraction) syntax.
8158862b 1048
6f33e417
KW
1049=item * L</"User-Defined Character Properties">
1050
a9130ea9 1051C<"+"> for union, C<"-"> for removal (set-difference), C<"&"> for intersection
6f33e417
KW
1052
1053=back
1054
1055=item [6]
1056
a9130ea9 1057C<\b> C<\B>
6f33e417
KW
1058
1059=item [7]
1060
a9130ea9
KW
1061Note that Perl does Full case-folding in matching (but with bugs), not
1062Simple: for example C<U+1F88> is equivalent to C<U+1F00 U+03B9>, instead of
1063just C<U+1F80>. This difference matters mainly for certain Greek capital
1064letters with certain modifiers: the Full case-folding decomposes the
1065letter, while the Simple case-folding would map it to a single
1066character.
6f33e417
KW
1067
1068=item [8]
1069
6bc50c7f
KW
1070Should do C<^> and C<$> also on C<U+000B> (C<\v> in C), C<FF> (C<\f>),
1071C<CR> (C<\r>), C<CRLF> (C<\r\n>), C<NEL> (C<U+0085>), C<LS> (C<U+2028>),
1072and C<PS> (C<U+2029>); should also affect C<E<lt>E<gt>>, C<$.>, and
1073script line numbers; should not split lines within C<CRLF> (i.e. there
1074is no empty line between C<\r> and C<\n>). For C<CRLF>, try the
6f33e417
KW
1075C<:crlf> layer (see L<PerlIO>).
1076
1077=item [9]
1078
a9130ea9
KW
1079Linebreaking conformant with L<UAX#14 "Unicode Line Breaking
1080Algorithm"|http://www.unicode.org/reports/tr14>
1081is available through the C<L<Unicode::LineBreak>> module.
6f33e417
KW
1082
1083=item [10]
1084
a9130ea9
KW
1085UTF-8/UTF-EBDDIC used in Perl allows not only C<U+10000> to
1086C<U+10FFFF> but also beyond C<U+10FFFF>
6f33e417
KW
1087
1088=back
5ca1ac52 1089
776f8809
JH
1090=item *
1091
1092Level 2 - Extended Unicode Support
1093
755789c0
KW
1094 RL2.1 Canonical Equivalents - MISSING [10][11]
1095 RL2.2 Default Grapheme Clusters - MISSING [12]
ae3bb8ea 1096 RL2.3 Default Word Boundaries - DONE [14]
755789c0
KW
1097 RL2.4 Default Loose Matches - MISSING [15]
1098 RL2.5 Name Properties - DONE
1099 RL2.6 Wildcard Properties - MISSING
8158862b 1100
755789c0
KW
1101 [10] see UAX#15 "Unicode Normalization Forms"
1102 [11] have Unicode::Normalize but not integrated to regexes
64935bc6
KW
1103 [12] have \X and \b{gcb} but we don't have a "Grapheme Cluster
1104 Mode"
755789c0 1105 [14] see UAX#29, Word Boundaries
902b08d0 1106 [15] This is covered in Chapter 3.13 (in Unicode 6.0)
776f8809
JH
1107
1108=item *
1109
8158862b
ST
1110Level 3 - Tailored Support
1111
755789c0
KW
1112 RL3.1 Tailored Punctuation - MISSING
1113 RL3.2 Tailored Grapheme Clusters - MISSING [17][18]
1114 RL3.3 Tailored Word Boundaries - MISSING
1115 RL3.4 Tailored Loose Matches - MISSING
1116 RL3.5 Tailored Ranges - MISSING
1117 RL3.6 Context Matching - MISSING [19]
1118 RL3.7 Incremental Matches - MISSING
8158862b 1119 ( RL3.8 Unicode Set Sharing )
755789c0
KW
1120 RL3.9 Possible Match Sets - MISSING
1121 RL3.10 Folded Matching - MISSING [20]
1122 RL3.11 Submatchers - MISSING
1123
1124 [17] see UAX#10 "Unicode Collation Algorithms"
1125 [18] have Unicode::Collate but not integrated to regexes
1126 [19] have (?<=x) and (?=x), but look-aheads or look-behinds
1127 should see outside of the target substring
1128 [20] need insensitive matching for linguistic features other
1129 than case; for example, hiragana to katakana, wide and
1130 narrow, simplified Han to traditional Han (see UTR#30
1131 "Character Foldings")
776f8809
JH
1132
1133=back
1134
c349b1b9
JH
1135=head2 Unicode Encodings
1136
376d9008
JB
1137Unicode characters are assigned to I<code points>, which are abstract
1138numbers. To use these numbers, various encodings are needed.
c349b1b9
JH
1139
1140=over 4
1141
c29a771d 1142=item *
5cb3728c
RB
1143
1144UTF-8
c349b1b9 1145
6d4f9cf2
KW
1146UTF-8 is a variable-length (1 to 4 bytes), byte-order independent
1147encoding. For ASCII (and we really do mean 7-bit ASCII, not another
11488-bit encoding), UTF-8 is transparent.
c349b1b9 1149
8c007b5a 1150The following table is from Unicode 3.2.
05632f9a 1151
755789c0 1152 Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte
05632f9a 1153
d88362ca 1154 U+0000..U+007F 00..7F
e1b711da 1155 U+0080..U+07FF * C2..DF 80..BF
d88362ca 1156 U+0800..U+0FFF E0 * A0..BF 80..BF
ec90690f
ST
1157 U+1000..U+CFFF E1..EC 80..BF 80..BF
1158 U+D000..U+D7FF ED 80..9F 80..BF
755789c0 1159 U+D800..U+DFFF +++++ utf16 surrogates, not legal utf8 +++++
ec90690f 1160 U+E000..U+FFFF EE..EF 80..BF 80..BF
d88362ca
KW
1161 U+10000..U+3FFFF F0 * 90..BF 80..BF 80..BF
1162 U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF
1163 U+100000..U+10FFFF F4 80..8F 80..BF 80..BF
e1b711da 1164
b19eb496 1165Note the gaps marked by "*" before several of the byte entries above. These are
e1b711da
KW
1166caused by legal UTF-8 avoiding non-shortest encodings: it is technically
1167possible to UTF-8-encode a single code point in different ways, but that is
1168explicitly forbidden, and the shortest possible encoding should always be used
1169(and that is what Perl does).
37361303 1170
376d9008 1171Another way to look at it is via bits:
05632f9a 1172
755789c0 1173 Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte
05632f9a 1174
755789c0
KW
1175 0aaaaaaa 0aaaaaaa
1176 00000bbbbbaaaaaa 110bbbbb 10aaaaaa
1177 ccccbbbbbbaaaaaa 1110cccc 10bbbbbb 10aaaaaa
1178 00000dddccccccbbbbbbaaaaaa 11110ddd 10cccccc 10bbbbbb 10aaaaaa
05632f9a 1179
a9130ea9 1180As you can see, the continuation bytes all begin with C<"10">, and the
e1b711da 1181leading bits of the start byte tell how many bytes there are in the
05632f9a
JH
1182encoded character.
1183
6d4f9cf2 1184The original UTF-8 specification allowed up to 6 bytes, to allow
a9130ea9 1185encoding of numbers up to C<0x7FFF_FFFF>. Perl continues to allow those,
6d4f9cf2
KW
1186and has extended that up to 13 bytes to encode code points up to what
1187can fit in a 64-bit word. However, Perl will warn if you output any of
b19eb496 1188these as being non-portable; and under strict UTF-8 input protocols,
6d4f9cf2
KW
1189they are forbidden.
1190
c29a771d 1191=item *
5cb3728c
RB
1192
1193UTF-EBCDIC
dbe420b4 1194
376d9008 1195Like UTF-8 but EBCDIC-safe, in the way that UTF-8 is ASCII-safe.
dbe420b4 1196
c29a771d 1197=item *
5cb3728c 1198
a9130ea9 1199UTF-16, UTF-16BE, UTF-16LE, Surrogates, and C<BOM>s (Byte Order Marks)
c349b1b9 1200
1bfb14c4
JH
1201The followings items are mostly for reference and general Unicode
1202knowledge, Perl doesn't use these constructs internally.
dbe420b4 1203
b19eb496
TC
1204Like UTF-8, UTF-16 is a variable-width encoding, but where
1205UTF-8 uses 8-bit code units, UTF-16 uses 16-bit code units.
1206All code points occupy either 2 or 4 bytes in UTF-16: code points
1207C<U+0000..U+FFFF> are stored in a single 16-bit unit, and code
1bfb14c4 1208points C<U+10000..U+10FFFF> in two 16-bit units. The latter case is
c349b1b9
JH
1209using I<surrogates>, the first 16-bit unit being the I<high
1210surrogate>, and the second being the I<low surrogate>.
1211
376d9008 1212Surrogates are code points set aside to encode the C<U+10000..U+10FFFF>
c349b1b9 1213range of Unicode code points in pairs of 16-bit units. The I<high
9f815e24 1214surrogates> are the range C<U+D800..U+DBFF> and the I<low surrogates>
376d9008 1215are the range C<U+DC00..U+DFFF>. The surrogate encoding is
c349b1b9 1216
d88362ca
KW
1217 $hi = ($uni - 0x10000) / 0x400 + 0xD800;
1218 $lo = ($uni - 0x10000) % 0x400 + 0xDC00;
c349b1b9
JH
1219
1220and the decoding is
1221
d88362ca 1222 $uni = 0x10000 + ($hi - 0xD800) * 0x400 + ($lo - 0xDC00);
c349b1b9 1223
376d9008 1224Because of the 16-bitness, UTF-16 is byte-order dependent. UTF-16
c349b1b9 1225itself can be used for in-memory computations, but if storage or
376d9008
JB
1226transfer is required either UTF-16BE (big-endian) or UTF-16LE
1227(little-endian) encodings must be chosen.
c349b1b9
JH
1228
1229This introduces another problem: what if you just know that your data
376d9008 1230is UTF-16, but you don't know which endianness? Byte Order Marks, or
a9130ea9 1231C<BOM>s, are a solution to this. A special character has been reserved
86bbd6d1 1232in Unicode to function as a byte order marker: the character with the
a9130ea9 1233code point C<U+FEFF> is the C<BOM>.
042da322 1234
a9130ea9 1235The trick is that if you read a C<BOM>, you will know the byte order,
376d9008
JB
1236since if it was written on a big-endian platform, you will read the
1237bytes C<0xFE 0xFF>, but if it was written on a little-endian platform,
1238you will read the bytes C<0xFF 0xFE>. (And if the originating platform
1239was writing in UTF-8, you will read the bytes C<0xEF 0xBB 0xBF>.)
042da322 1240
86bbd6d1 1241The way this trick works is that the character with the code point
6d4f9cf2 1242C<U+FFFE> is not supposed to be in input streams, so the
a9130ea9 1243sequence of bytes C<0xFF 0xFE> is unambiguously "C<BOM>, represented in
1bfb14c4 1244little-endian format" and cannot be C<U+FFFE>, represented in big-endian
6d4f9cf2
KW
1245format".
1246
1247Surrogates have no meaning in Unicode outside their use in pairs to
1248represent other code points. However, Perl allows them to be
1249represented individually internally, for example by saying
f651977e
TC
1250C<chr(0xD801)>, so that all code points, not just those valid for open
1251interchange, are
6d4f9cf2 1252representable. Unicode does define semantics for them, such as their
a9130ea9
KW
1253C<L</General_Category>> is C<"Cs">. But because their use is somewhat dangerous,
1254Perl will warn (using the warning category C<"surrogate">, which is a
1255sub-category of C<"utf8">) if an attempt is made
6d4f9cf2
KW
1256to do things like take the lower case of one, or match
1257case-insensitively, or to output them. (But don't try this on Perls
1258before 5.14.)
c349b1b9 1259
c29a771d 1260=item *
5cb3728c 1261
1e54db1a 1262UTF-32, UTF-32BE, UTF-32LE
c349b1b9
JH
1263
1264The UTF-32 family is pretty much like the UTF-16 family, expect that
042da322 1265the units are 32-bit, and therefore the surrogate scheme is not
a9130ea9 1266needed. UTF-32 is a fixed-width encoding. The C<BOM> signatures are
b19eb496 1267C<0x00 0x00 0xFE 0xFF> for BE and C<0xFF 0xFE 0x00 0x00> for LE.
c349b1b9 1268
c29a771d 1269=item *
5cb3728c
RB
1270
1271UCS-2, UCS-4
c349b1b9 1272
b19eb496 1273Legacy, fixed-width encodings defined by the ISO 10646 standard. UCS-2 is a 16-bit
376d9008 1274encoding. Unlike UTF-16, UCS-2 is not extensible beyond C<U+FFFF>,
339cfa0e 1275because it does not use surrogates. UCS-4 is a 32-bit encoding,
b19eb496 1276functionally identical to UTF-32 (the difference being that
a9130ea9 1277UCS-4 forbids neither surrogates nor code points larger than C<0x10_FFFF>).
c349b1b9 1278
c29a771d 1279=item *
5cb3728c
RB
1280
1281UTF-7
c349b1b9 1282
376d9008
JB
1283A seven-bit safe (non-eight-bit) encoding, which is useful if the
1284transport or storage is not eight-bit safe. Defined by RFC 2152.
c349b1b9 1285
95a1a48b
JH
1286=back
1287
57e88091 1288=head2 Noncharacter code points
6d4f9cf2 1289
57e88091 129066 code points are set aside in Unicode as "noncharacter code points".
a9130ea9 1291These all have the C<Unassigned> (C<Cn>) C<L</General_Category>>, and
57e88091
KW
1292no character will ever be assigned to any of them. They are the 32 code
1293points between C<U+FDD0> and C<U+FDEF> inclusive, and the 34 code
1294points:
1295
1296 U+FFFE U+FFFF
1297 U+1FFFE U+1FFFF
1298 U+2FFFE U+2FFFF
1299 ...
1300 U+EFFFE U+EFFFF
1301 U+FFFFE U+FFFFF
1302 U+10FFFE U+10FFFF
1303
1304Until Unicode 7.0, the noncharacters were "B<forbidden> for use in open
1305interchange of Unicode text data", so that code that processed those
1306streams could use these code points as sentinels that could be mixed in
1307with character data, and would always be distinguishable from that data.
1308(Emphasis above and in the next paragraph are added in this document.)
1309
1310Unicode 7.0 changed the wording so that they are "B<not recommended> for
1311use in open interchange of Unicode text data". The 7.0 Standard goes on
1312to say:
1313
1314=over 4
1315
1316"If a noncharacter is received in open interchange, an application is
1317not required to interpret it in any way. It is good practice, however,
1318to recognize it as a noncharacter and to take appropriate action, such
1319as replacing it with C<U+FFFD> replacement character, to indicate the
1320problem in the text. It is not recommended to simply delete
1321noncharacter code points from such text, because of the potential
1322security issues caused by deleting uninterpreted characters. (See
1323conformance clause C7 in Section 3.2, Conformance Requirements, and
1324L<Unicode Technical Report #36, "Unicode Security
1325Considerations"|http://www.unicode.org/reports/tr36/#Substituting_for_Ill_Formed_Subsequences>)."
1326
1327=back
1328
1329This change was made because it was found that various commercial tools
1330like editors, or for things like source code control, had been written
1331so that they would not handle program files that used these code points,
1332effectively precluding their use almost entirely! And that was never
1333the intent. They've always been meant to be usable within an
1334application, or cooperating set of applications, at will.
1335
1336If you're writing code, such as an editor, that is supposed to be able
1337to handle any Unicode text data, then you shouldn't be using these code
1338points yourself, and instead allow them in the input. If you need
1339sentinels, they should instead be something that isn't legal Unicode.
1340For UTF-8 data, you can use the bytes 0xC1 and 0xC2 as sentinels, as
1341they never appear in well-formed UTF-8. (There are equivalents for
1342UTF-EBCDIC). You can also store your Unicode code points in integer
1343variables and use negative values as sentinels.
1344
1345If you're not writing such a tool, then whether you accept noncharacters
1346as input is up to you (though the Standard recommends that you not). If
1347you do strict input stream checking with Perl, these code points
1348continue to be forbidden. This is to maintain backward compatibility
1349(otherwise potential security holes could open up, as an unsuspecting
1350application that was written assuming the noncharacters would be
1351filtered out before getting to it, could now, without warning, start
1352getting them). To do strict checking, you can use the layer
1353C<:encoding('UTF-8')>.
1354
1355Perl continues to warn (using the warning category C<"nonchar">, which
1356is a sub-category of C<"utf8">) if an attempt is made to output
1357noncharacters.
42581d5d
KW
1358
1359=head2 Beyond Unicode code points
1360
a9130ea9
KW
1361The maximum Unicode code point is C<U+10FFFF>, and Unicode only defines
1362operations on code points up through that. But Perl works on code
42581d5d
KW
1363points up to the maximum permissible unsigned number available on the
1364platform. However, Perl will not accept these from input streams unless
1365lax rules are being used, and will warn (using the warning category
2d88a86a
KW
1366C<"non_unicode">, which is a sub-category of C<"utf8">) if any are output.
1367
1368Since Unicode rules are not defined on these code points, if a
1369Unicode-defined operation is done on them, Perl uses what we believe are
1370sensible rules, while generally warning, using the C<"non_unicode">
1371category. For example, C<uc("\x{11_0000}")> will generate such a
1372warning, returning the input parameter as its result, since Perl defines
1373the uppercase of every non-Unicode code point to be the code point
1374itself. In fact, all the case changing operations, not just
1375uppercasing, work this way.
1376
1377The situation with matching Unicode properties in regular expressions,
1378the C<\p{}> and C<\P{}> constructs, against these code points is not as
1379clear cut, and how these are handled has changed as we've gained
1380experience.
1381
1382One possibility is to treat any match against these code points as
1383undefined. But since Perl doesn't have the concept of a match being
1384undefined, it converts this to failing or C<FALSE>. This is almost, but
1385not quite, what Perl did from v5.14 (when use of these code points
1386became generally reliable) through v5.18. The difference is that Perl
1387treated all C<\p{}> matches as failing, but all C<\P{}> matches as
1388succeeding.
1389
1390One problem with this is that it leads to unexpected, and confusting
1391results in some cases:
1392
1393 chr(0x110000) =~ \p{ASCII_Hex_Digit=True} # Failed on <= v5.18
1394 chr(0x110000) =~ \p{ASCII_Hex_Digit=False} # Failed! on <= v5.18
1395
1396That is, it treated both matches as undefined, and converted that to
1397false (raising a warning on each). The first case is the expected
1398result, but the second is likely counterintuitive: "How could both be
1399false when they are complements?" Another problem was that the
1400implementation optimized many Unicode property matches down to already
1401existing simpler, faster operations, which don't raise the warning. We
1402chose to not forgo those optimizations, which help the vast majority of
1403matches, just to generate a warning for the unlikely event that an
1404above-Unicode code point is being matched against.
1405
1406As a result of these problems, starting in v5.20, what Perl does is
1407to treat non-Unicode code points as just typical unassigned Unicode
1408characters, and matches accordingly. (Note: Unicode has atypical
57e88091 1409unassigned code points. For example, it has noncharacter code points,
2d88a86a
KW
1410and ones that, when they do get assigned, are destined to be written
1411Right-to-left, as Arabic and Hebrew are. Perl assumes that no
1412non-Unicode code point has any atypical properties.)
1413
1414Perl, in most cases, will raise a warning when matching an above-Unicode
1415code point against a Unicode property when the result is C<TRUE> for
1416C<\p{}>, and C<FALSE> for C<\P{}>. For example:
1417
1418 chr(0x110000) =~ \p{ASCII_Hex_Digit=True} # Fails, no warning
1419 chr(0x110000) =~ \p{ASCII_Hex_Digit=False} # Succeeds, with warning
1420
1421In both these examples, the character being matched is non-Unicode, so
1422Unicode doesn't define how it should match. It clearly isn't an ASCII
1423hex digit, so the first example clearly should fail, and so it does,
1424with no warning. But it is arguable that the second example should have
1425an undefined, hence C<FALSE>, result. So a warning is raised for it.
1426
1427Thus the warning is raised for many fewer cases than in earlier Perls,
1428and only when what the result is could be arguable. It turns out that
1429none of the optimizations made by Perl (or are ever likely to be made)
1430cause the warning to be skipped, so it solves both problems of Perl's
1431earlier approach. The most commonly used property that is affected by
1432this change is C<\p{Unassigned}> which is a short form for
1433C<\p{General_Category=Unassigned}>. Starting in v5.20, all non-Unicode
1434code points are considered C<Unassigned>. In earlier releases the
1435matches failed because the result was considered undefined.
1436
1437The only place where the warning is not raised when it might ought to
1438have been is if optimizations cause the whole pattern match to not even
1439be attempted. For example, Perl may figure out that for a string to
1440match a certain regular expression pattern, the string has to contain
1441the substring C<"foobar">. Before attempting the match, Perl may look
1442for that substring, and if not found, immediately fail the match without
1443actually trying it; so no warning gets generated even if the string
1444contains an above-Unicode code point.
1445
1446This behavior is more "Do what I mean" than in earlier Perls for most
1447applications. But it catches fewer issues for code that needs to be
1448strictly Unicode compliant. Therefore there is an additional mode of
1449operation available to accommodate such code. This mode is enabled if a
1450regular expression pattern is compiled within the lexical scope where
1451the C<"non_unicode"> warning class has been made fatal, say by:
1452
1453 use warnings FATAL => "non_unicode"
1454
44ecbbd8 1455(see L<warnings>). In this mode of operation, Perl will raise the
2d88a86a
KW
1456warning for all matches against a non-Unicode code point (not just the
1457arguable ones), and it skips the optimizations that might cause the
1458warning to not be output. (It currently still won't warn if the match
1459isn't even attempted, like in the C<"foobar"> example above.)
1460
1461In summary, Perl now normally treats non-Unicode code points as typical
1462Unicode unassigned code points for regular expression matches, raising a
1463warning only when it is arguable what the result should be. However, if
1464this warning has been made fatal, it isn't skipped.
1465
1466There is one exception to all this. C<\p{All}> looks like a Unicode
1467property, but it is a Perl extension that is defined to be true for all
1468possible code points, Unicode or not, so no warning is ever generated
1469when matching this against a non-Unicode code point. (Prior to v5.20,
1470it was an exact synonym for C<\p{Any}>, matching code points C<0>
1471through C<0x10FFFF>.)
6d4f9cf2 1472
0d7c09bb
JH
1473=head2 Security Implications of Unicode
1474
e1b711da
KW
1475Read L<Unicode Security Considerations|http://www.unicode.org/reports/tr36>.
1476Also, note the following:
1477
0d7c09bb
JH
1478=over 4
1479
1480=item *
1481
1482Malformed UTF-8
bf0fa0b2 1483
42581d5d 1484Unfortunately, the original specification of UTF-8 leaves some room for
bf0fa0b2 1485interpretation of how many bytes of encoded output one should generate
376d9008
JB
1486from one input Unicode character. Strictly speaking, the shortest
1487possible sequence of UTF-8 bytes should be generated,
1488because otherwise there is potential for an input buffer overflow at
feda178f 1489the receiving end of a UTF-8 connection. Perl always generates the
e1b711da 1490shortest length UTF-8, and with warnings on, Perl will warn about
376d9008 1491non-shortest length UTF-8 along with other malformations, such as the
b19eb496 1492surrogates, which are not Unicode code points valid for interchange.
bf0fa0b2 1493
0d7c09bb
JH
1494=item *
1495
68693f9e 1496Regular expression pattern matching may surprise you if you're not
b19eb496
TC
1497accustomed to Unicode. Starting in Perl 5.14, several pattern
1498modifiers are available to control this, called the character set
42581d5d
KW
1499modifiers. Details are given in L<perlre/Character set modifiers>.
1500
1501=back
0d7c09bb 1502
376d9008 1503As discussed elsewhere, Perl has one foot (two hooves?) planted in
1bfb14c4
JH
1504each of two worlds: the old world of bytes and the new world of
1505characters, upgrading from bytes to characters when necessary.
376d9008
JB
1506If your legacy code does not explicitly use Unicode, no automatic
1507switch-over to characters should happen. Characters shouldn't get
1bfb14c4
JH
1508downgraded to bytes, either. It is possible to accidentally mix bytes
1509and characters, however (see L<perluniintro>), in which case C<\w> in
42581d5d 1510regular expressions might start behaving differently (unless the C</a>
b19eb496 1511modifier is in effect). Review your code. Use warnings and the C<strict> pragma.
0d7c09bb 1512
c349b1b9
JH
1513=head2 Unicode in Perl on EBCDIC
1514
376d9008
JB
1515The way Unicode is handled on EBCDIC platforms is still
1516experimental. On such platforms, references to UTF-8 encoding in this
1517document and elsewhere should be read as meaning the UTF-EBCDIC
1518specified in Unicode Technical Report 16, unless ASCII vs. EBCDIC issues
c349b1b9 1519are specifically discussed. There is no C<utfebcdic> pragma or
a9130ea9 1520C<":utfebcdic"> layer; rather, C<"utf8"> and C<":utf8"> are reused to mean
86bbd6d1
PN
1521the platform's "natural" 8-bit encoding of Unicode. See L<perlebcdic>
1522for more discussion of the issues.
c349b1b9 1523
b310b053
JH
1524=head2 Locales
1525
42581d5d 1526See L<perllocale/Unicode and UTF-8>
b310b053 1527
1aad1664
JH
1528=head2 When Unicode Does Not Happen
1529
1530While Perl does have extensive ways to input and output in Unicode,
a9130ea9 1531and a few other "entry points" like the C<@ARGV> array (which can sometimes be
b19eb496
TC
1532interpreted as UTF-8), there are still many places where Unicode
1533(in some encoding or another) could be given as arguments or received as
1aad1664
JH
1534results, or both, but it is not.
1535
e1b711da
KW
1536The following are such interfaces. Also, see L</The "Unicode Bug">.
1537For all of these interfaces Perl
b9cedb1b 1538currently (as of v5.16.0) simply assumes byte strings both as arguments
b19eb496 1539and results, or UTF-8 strings if the (problematic) C<encoding> pragma has been used.
1aad1664 1540
b19eb496
TC
1541One reason that Perl does not attempt to resolve the role of Unicode in
1542these situations is that the answers are highly dependent on the operating
1aad1664 1543system and the file system(s). For example, whether filenames can be
b19eb496
TC
1544in Unicode and in exactly what kind of encoding, is not exactly a
1545portable concept. Similarly for C<qx> and C<system>: how well will the
1546"command-line interface" (and which of them?) handle Unicode?
1aad1664
JH
1547
1548=over 4
1549
557a2462
RB
1550=item *
1551
a9130ea9
KW
1552C<chdir>, C<chmod>, C<chown>, C<chroot>, C<exec>, C<link>, C<lstat>, C<mkdir>,
1553C<rename>, C<rmdir>, C<stat>, C<symlink>, C<truncate>, C<unlink>, C<utime>, C<-X>
557a2462
RB
1554
1555=item *
1556
a9130ea9 1557C<%ENV>
557a2462
RB
1558
1559=item *
1560
a9130ea9 1561C<glob> (aka the C<E<lt>*E<gt>>)
557a2462
RB
1562
1563=item *
1aad1664 1564
a9130ea9 1565C<open>, C<opendir>, C<sysopen>
1aad1664 1566
557a2462 1567=item *
1aad1664 1568
a9130ea9 1569C<qx> (aka the backtick operator), C<system>
1aad1664 1570
557a2462 1571=item *
1aad1664 1572
a9130ea9 1573C<readdir>, C<readlink>
1aad1664
JH
1574
1575=back
1576
e1b711da
KW
1577=head2 The "Unicode Bug"
1578
2e2b2571 1579The term, "Unicode bug" has been applied to an inconsistency
42581d5d 1580on ASCII platforms with the
a9130ea9 1581Unicode code points in the C<Latin-1 Supplement> block, that
e1b711da
KW
1582is, between 128 and 255. Without a locale specified, unlike all other
1583characters or code points, these characters have very different semantics in
20db7501 1584byte semantics versus character semantics, unless
2e2b2571
KW
1585C<use feature 'unicode_strings'> is specified, directly or indirectly.
1586(It is indirectly specified by a C<use v5.12> or higher.)
e1b711da 1587
2e2b2571
KW
1588In character semantics these upper-Latin1 characters are interpreted as
1589Unicode code points, which means
e1b711da
KW
1590they have the same semantics as Latin-1 (ISO-8859-1).
1591
2e2b2571
KW
1592In byte semantics (without C<unicode_strings>), they are considered to
1593be unassigned characters, meaning that the only semantics they have is
1594their ordinal numbers, and that they are
e1b711da 1595not members of various character classes. None are considered to match C<\w>
42581d5d 1596for example, but all match C<\W>.
e1b711da 1597
2e2b2571
KW
1598Perl 5.12.0 added C<unicode_strings> to force character semantics on
1599these code points in some circumstances, which fixed portions of the
1600bug; Perl 5.14.0 fixed almost all of it; and Perl 5.16.0 fixed the
1601remainder (so far as we know, anyway). The lesson here is to enable
1602C<unicode_strings> to avoid the headaches described below.
1603
1604The old, problematic behavior affects these areas:
e1b711da
KW
1605
1606=over 4
1607
1608=item *
1609
1610Changing the case of a scalar, that is, using C<uc()>, C<ucfirst()>, C<lc()>,
2e2b2571
KW
1611and C<lcfirst()>, or C<\L>, C<\U>, C<\u> and C<\l> in double-quotish
1612contexts, such as regular expression substitutions.
1613Under C<unicode_strings> starting in Perl 5.12.0, character semantics are
1614generally used. See L<perlfunc/lc> for details on how this works
1615in combination with various other pragmas.
e1b711da
KW
1616
1617=item *
1618
2e2b2571
KW
1619Using caseless (C</i>) regular expression matching.
1620Starting in Perl 5.14.0, regular expressions compiled within
c43ca372 1621the scope of C<unicode_strings> use character semantics
2e2b2571
KW
1622even when executed or compiled into larger
1623regular expressions outside the scope.
e1b711da
KW
1624
1625=item *
1626
64935bc6
KW
1627Matching any of several properties in regular expressions, namely
1628C<\b> (without braces), C<\B> (without braces), C<\s>, C<\S>, C<\w>,
1629C<\W>, and all the Posix character classes
630d17dc 1630I<except> C<[[:ascii:]]>.
2e2b2571 1631Starting in Perl 5.14.0, regular expressions compiled within
c43ca372 1632the scope of C<unicode_strings> use character semantics
2e2b2571
KW
1633even when executed or compiled into larger
1634regular expressions outside the scope.
e1b711da
KW
1635
1636=item *
1637
91faff93
KW
1638In C<quotemeta> or its inline equivalent C<\Q>, no code points above 127
1639are quoted in UTF-8 encoded strings, but in byte encoded strings, code
1640points between 128-255 are always quoted.
2e2b2571
KW
1641Starting in Perl 5.16.0, consistent quoting rules are used within the
1642scope of C<unicode_strings>, as described in L<perlfunc/quotemeta>.
eb88ed9e 1643
e1b711da
KW
1644=back
1645
1646This behavior can lead to unexpected results in which a string's semantics
1647suddenly change if a code point above 255 is appended to or removed from it,
1648which changes the string's semantics from byte to character or vice versa. As
1649an example, consider the following program and its output:
1650
1651 $ perl -le'
42581d5d 1652 no feature 'unicode_strings';
e1b711da
KW
1653 $s1 = "\xC2";
1654 $s2 = "\x{2660}";
1655 for ($s1, $s2, $s1.$s2) {
1656 print /\w/ || 0;
1657 }
1658 '
1659 0
1660 0
1661 1
1662
9f815e24 1663If there's no C<\w> in C<s1> or in C<s2>, why does their concatenation have one?
e1b711da
KW
1664
1665This anomaly stems from Perl's attempt to not disturb older programs that
1666didn't use Unicode, and hence had no semantics for characters outside of the
1667ASCII range (except in a locale), along with Perl's desire to add Unicode
1668support seamlessly. The result wasn't seamless: these characters were
1669orphaned.
1670
2e2b2571
KW
1671For Perls earlier than those described above, or when a string is passed
1672to a function outside the subpragma's scope, a workaround is to always
a9130ea9 1673call L<C<utf8::upgrade($string)>|utf8/Utility functions>,
20db7501 1674or to use the standard module L<Encode>. Also, a scalar that has any characters
a9130ea9 1675whose ordinal is C<0x100> or above, or which were specified using either of the
b19eb496 1676C<\N{...}> notations, will automatically have character semantics.
e1b711da 1677
1aad1664
JH
1678=head2 Forcing Unicode in Perl (Or Unforcing Unicode in Perl)
1679
e1b711da
KW
1680Sometimes (see L</"When Unicode Does Not Happen"> or L</The "Unicode Bug">)
1681there are situations where you simply need to force a byte
2bbc8d55 1682string into UTF-8, or vice versa. The low-level calls
a9130ea9
KW
1683L<C<utf8::upgrade($bytestring)>|utf8/Utility functions> and
1684L<C<utf8::downgrade($utf8string[, FAIL_OK])>|utf8/Utility functions> are
1aad1664
JH
1685the answers.
1686
a9130ea9 1687Note that C<utf8::downgrade()> can fail if the string contains characters
2bbc8d55 1688that don't fit into a byte.
1aad1664 1689
e1b711da
KW
1690Calling either function on a string that already is in the desired state is a
1691no-op.
1692
95a1a48b 1693
37b3b608 1694=head2 Using Unicode in XS
c349b1b9 1695
37b3b608
KW
1696See L<perlguts/"Unicode Support"> for an introduction to Unicode at
1697the XS level, and L<perlapi/Unicode Support> for the API details.
95a1a48b 1698
e1b711da
KW
1699=head2 Hacking Perl to work on earlier Unicode versions (for very serious hackers only)
1700
1701Perl by default comes with the latest supported Unicode version built in, but
1702you can change to use any earlier one.
1703
42581d5d 1704Download the files in the desired version of Unicode from the Unicode web
e1b711da 1705site L<http://www.unicode.org>). These should replace the existing files in
b19eb496 1706F<lib/unicore> in the Perl source tree. Follow the instructions in
116693e8 1707F<README.perl> in that directory to change some of their names, and then build
26e391dd 1708perl (see L<INSTALL>).
116693e8 1709
c29a771d
JH
1710=head1 BUGS
1711
376d9008 1712=head2 Interaction with Locales
7eabb34d 1713
42581d5d 1714See L<perllocale/Unicode and UTF-8>
c29a771d 1715
9f815e24 1716=head2 Problems with characters in the Latin-1 Supplement range
2bbc8d55 1717
e1b711da
KW
1718See L</The "Unicode Bug">
1719
376d9008 1720=head2 Interaction with Extensions
7eabb34d 1721
376d9008 1722When Perl exchanges data with an extension, the extension should be
2575c402 1723able to understand the UTF8 flag and act accordingly. If the
b19eb496 1724extension doesn't recognize that flag, it's likely that the extension
376d9008 1725will return incorrectly-flagged data.
7eabb34d
A
1726
1727So if you're working with Unicode data, consult the documentation of
1728every module you're using if there are any issues with Unicode data
1729exchange. If the documentation does not talk about Unicode at all,
a73d23f6 1730suspect the worst and probably look at the source to learn how the
376d9008 1731module is implemented. Modules written completely in Perl shouldn't
a73d23f6
RGS
1732cause problems. Modules that directly or indirectly access code written
1733in other programming languages are at risk.
7eabb34d 1734
376d9008 1735For affected functions, the simple strategy to avoid data corruption is
7eabb34d 1736to always make the encoding of the exchanged data explicit. Choose an
376d9008 1737encoding that you know the extension can handle. Convert arguments passed
7eabb34d
A
1738to the extensions to that encoding and convert results back from that
1739encoding. Write wrapper functions that do the conversions for you, so
1740you can later change the functions when the extension catches up.
1741
a9130ea9 1742To provide an example, let's say the popular C<Foo::Bar::escape_html>
7eabb34d
A
1743function doesn't deal with Unicode data yet. The wrapper function
1744would convert the argument to raw UTF-8 and convert the result back to
376d9008 1745Perl's internal representation like so:
7eabb34d
A
1746
1747 sub my_escape_html ($) {
d88362ca
KW
1748 my($what) = shift;
1749 return unless defined $what;
1750 Encode::decode_utf8(Foo::Bar::escape_html(
1751 Encode::encode_utf8($what)));
7eabb34d
A
1752 }
1753
1754Sometimes, when the extension does not convert data but just stores
b19eb496 1755and retrieves them, you will be able to use the otherwise
a9130ea9
KW
1756dangerous L<C<Encode::_utf8_on()>|Encode/_utf8_on> function. Let's say
1757the popular C<Foo::Bar> extension, written in C, provides a C<param>
1758method that lets you store and retrieve data according to these prototypes:
7eabb34d
A
1759
1760 $self->param($name, $value); # set a scalar
1761 $value = $self->param($name); # retrieve a scalar
1762
1763If it does not yet provide support for any encoding, one could write a
1764derived class with such a C<param> method:
1765
1766 sub param {
1767 my($self,$name,$value) = @_;
1768 utf8::upgrade($name); # make sure it is UTF-8 encoded
af55fc6a 1769 if (defined $value) {
7eabb34d
A
1770 utf8::upgrade($value); # make sure it is UTF-8 encoded
1771 return $self->SUPER::param($name,$value);
1772 } else {
1773 my $ret = $self->SUPER::param($name);
1774 Encode::_utf8_on($ret); # we know, it is UTF-8 encoded
1775 return $ret;
1776 }
1777 }
1778
a73d23f6 1779Some extensions provide filters on data entry/exit points, such as
a9130ea9 1780C<DB_File::filter_store_key> and family. Look out for such filters in
66b79f27 1781the documentation of your extensions, they can make the transition to
7eabb34d
A
1782Unicode data much easier.
1783
376d9008 1784=head2 Speed
7eabb34d 1785
c29a771d 1786Some functions are slower when working on UTF-8 encoded strings than
574c8022 1787on byte encoded strings. All functions that need to hop over
a9130ea9
KW
1788characters such as C<length()>, C<substr()> or C<index()>, or matching
1789regular expressions can work B<much> faster when the underlying data are
7c17141f
JH
1790byte-encoded.
1791
1792In Perl 5.8.0 the slowness was often quite spectacular; in Perl 5.8.1
1793a caching scheme was introduced which will hopefully make the slowness
a104b433
JH
1794somewhat less spectacular, at least for some operations. In general,
1795operations with UTF-8 encoded strings are still slower. As an example,
1796the Unicode properties (character classes) like C<\p{Nd}> are known to
1797be quite a bit slower (5-20 times) than their simpler counterparts
42581d5d 1798like C<\d> (then again, there are hundreds of Unicode characters matching C<Nd>
a104b433 1799compared with the 10 ASCII characters matching C<d>).
c8d992ba
A
1800=head2 Porting code from perl-5.6.X
1801
1802Perl 5.8 has a different Unicode model from 5.6. In 5.6 the programmer
1803was required to use the C<utf8> pragma to declare that a given scope
1804expected to deal with Unicode data and had to make sure that only
1805Unicode data were reaching that scope. If you have code that is
1806working with 5.6, you will need some of the following adjustments to
1807your code. The examples are written such that the code will continue
1808to work under 5.6, so you should be safe to try them out.
1809
755789c0 1810=over 3
c8d992ba
A
1811
1812=item *
1813
1814A filehandle that should read or write UTF-8
1815
b9cedb1b 1816 if ($] > 5.008) {
740d4bb2 1817 binmode $fh, ":encoding(utf8)";
c8d992ba
A
1818 }
1819
1820=item *
1821
1822A scalar that is going to be passed to some extension
1823
a9130ea9 1824Be it C<Compress::Zlib>, C<Apache::Request> or any extension that has no
c8d992ba 1825mention of Unicode in the manpage, you need to make sure that the
2575c402 1826UTF8 flag is stripped off. Note that at the time of this writing
b9cedb1b 1827(January 2012) the mentioned modules are not UTF-8-aware. Please
c8d992ba
A
1828check the documentation to verify if this is still true.
1829
b9cedb1b 1830 if ($] > 5.008) {
c8d992ba
A
1831 require Encode;
1832 $val = Encode::encode_utf8($val); # make octets
1833 }
1834
1835=item *
1836
1837A scalar we got back from an extension
1838
1839If you believe the scalar comes back as UTF-8, you will most likely
2575c402 1840want the UTF8 flag restored:
c8d992ba 1841
b9cedb1b 1842 if ($] > 5.008) {
c8d992ba
A
1843 require Encode;
1844 $val = Encode::decode_utf8($val);
1845 }
1846
1847=item *
1848
1849Same thing, if you are really sure it is UTF-8
1850
b9cedb1b 1851 if ($] > 5.008) {
c8d992ba
A
1852 require Encode;
1853 Encode::_utf8_on($val);
1854 }
1855
1856=item *
1857
a9130ea9 1858A wrapper for L<DBI> C<fetchrow_array> and C<fetchrow_hashref>
c8d992ba
A
1859
1860When the database contains only UTF-8, a wrapper function or method is
a9130ea9
KW
1861a convenient way to replace all your C<fetchrow_array> and
1862C<fetchrow_hashref> calls. A wrapper function will also make it easier to
c8d992ba 1863adapt to future enhancements in your database driver. Note that at the
b9cedb1b 1864time of this writing (January 2012), the DBI has no standardized way
a9130ea9 1865to deal with UTF-8 data. Please check the L<DBI documentation|DBI> to verify if
c8d992ba
A
1866that is still true.
1867
1868 sub fetchrow {
d88362ca
KW
1869 # $what is one of fetchrow_{array,hashref}
1870 my($self, $sth, $what) = @_;
b9cedb1b 1871 if ($] < 5.008) {
c8d992ba
A
1872 return $sth->$what;
1873 } else {
1874 require Encode;
1875 if (wantarray) {
1876 my @arr = $sth->$what;
1877 for (@arr) {
1878 defined && /[^\000-\177]/ && Encode::_utf8_on($_);
1879 }
1880 return @arr;
1881 } else {
1882 my $ret = $sth->$what;
1883 if (ref $ret) {
1884 for my $k (keys %$ret) {
d88362ca
KW
1885 defined
1886 && /[^\000-\177]/
1887 && Encode::_utf8_on($_) for $ret->{$k};
c8d992ba
A
1888 }
1889 return $ret;
1890 } else {
1891 defined && /[^\000-\177]/ && Encode::_utf8_on($_) for $ret;
1892 return $ret;
1893 }
1894 }
1895 }
1896 }
1897
1898
1899=item *
1900
1901A large scalar that you know can only contain ASCII
1902
1903Scalars that contain only ASCII and are marked as UTF-8 are sometimes
1904a drag to your program. If you recognize such a situation, just remove
2575c402 1905the UTF8 flag:
c8d992ba 1906
b9cedb1b 1907 utf8::downgrade($val) if $] > 5.008;
c8d992ba
A
1908
1909=back
1910
393fec97
GS
1911=head1 SEE ALSO
1912
51f494cc 1913L<perlunitut>, L<perluniintro>, L<perluniprops>, L<Encode>, L<open>, L<utf8>, L<bytes>,
a05d7ebb 1914L<perlretut>, L<perlvar/"${^UNICODE}">
51f494cc 1915L<http://www.unicode.org/reports/tr44>).
393fec97
GS
1916
1917=cut