This is a live mirror of the Perl 5 development currently hosted at https://github.com/perl/perl5
regcomp.c: comment typo and rewording
[perl5.git] / pod / perlunicode.pod
CommitLineData
393fec97
GS
1=head1 NAME
2
3perlunicode - Unicode support in Perl
4
5=head1 DESCRIPTION
6
0a1f2d14 7=head2 Important Caveats
21bad921 8
376d9008 9Unicode support is an extensive requirement. While Perl does not
c349b1b9
JH
10implement the Unicode standard or the accompanying technical reports
11from cover to cover, Perl does support many Unicode features.
21bad921 12
2575c402 13People who want to learn to use Unicode in Perl, should probably read
0314f483
KW
14the L<Perl Unicode tutorial, perlunitut|perlunitut> and
15L<perluniintro>, before reading
e4911a48 16this reference document.
2575c402 17
9d1c51c1
KW
18Also, the use of Unicode may present security issues that aren't obvious.
19Read L<Unicode Security Considerations|http://www.unicode.org/reports/tr36>.
20
13a2d996 21=over 4
21bad921 22
a9130ea9 23=item Safest if you C<use feature 'unicode_strings'>
42581d5d
KW
24
25In order to preserve backward compatibility, Perl does not turn
26on full internal Unicode support unless the pragma
27C<use feature 'unicode_strings'> is specified. (This is automatically
28selected if you use C<use 5.012> or higher.) Failure to do this can
29trigger unexpected surprises. See L</The "Unicode Bug"> below.
30
2269d15c
KW
31This pragma doesn't affect I/O. Nor does it change the internal
32representation of strings, only their interpretation. There are still
33several places where Unicode isn't fully supported, such as in
34filenames.
42581d5d 35
fae2c0fb 36=item Input and Output Layers
21bad921 37
376d9008 38Perl knows when a filehandle uses Perl's internal Unicode encodings
1bfb14c4 39(UTF-8, or UTF-EBCDIC if in EBCDIC) if the filehandle is opened with
a9130ea9 40the C<:encoding(utf8)> layer. Other encodings can be converted to Perl's
1bfb14c4 41encoding on input or from Perl's encoding on output by use of the
a9130ea9 42C<:encoding(...)> layer. See L<open>.
c349b1b9 43
2575c402 44To indicate that Perl source itself is in UTF-8, use C<use utf8;>.
21bad921 45
ad0029c4 46=item C<use utf8> still needed to enable UTF-8/UTF-EBCDIC in scripts
21bad921 47
376d9008
JB
48As a compatibility measure, the C<use utf8> pragma must be explicitly
49included to enable recognition of UTF-8 in the Perl scripts themselves
1bfb14c4
JH
50(in string or regular expression literals, or in identifier names) on
51ASCII-based machines or to recognize UTF-EBCDIC on EBCDIC-based
376d9008 52machines. B<These are the only times when an explicit C<use utf8>
8f8cf39c 53is needed.> See L<utf8>.
21bad921 54
a9130ea9 55=item C<BOM>-marked scripts and UTF-16 scripts autodetected
7aa207d6 56
a9130ea9
KW
57If a Perl script begins marked with the Unicode C<BOM> (UTF-16LE, UTF16-BE,
58or UTF-8), or if the script looks like non-C<BOM>-marked UTF-16 of either
7aa207d6 59endianness, Perl will correctly read in the script as Unicode.
a9130ea9 60(C<BOM>less UTF-8 cannot be effectively recognized or differentiated from
7aa207d6
JH
61ISO 8859-1 or other eight-bit encodings.)
62
990e18f7
AT
63=item C<use encoding> needed to upgrade non-Latin-1 byte strings
64
38a44b82 65By default, there is a fundamental asymmetry in Perl's Unicode model:
990e18f7
AT
66implicit upgrading from byte strings to Unicode strings assumes that
67they were encoded in I<ISO 8859-1 (Latin-1)>, but Unicode strings are
68downgraded with UTF-8 encoding. This happens because the first 256
51f494cc 69codepoints in Unicode happens to agree with Latin-1.
990e18f7 70
990e18f7
AT
71See L</"Byte and Character Semantics"> for more details.
72
21bad921
GS
73=back
74
376d9008 75=head2 Byte and Character Semantics
393fec97 76
b9cedb1b 77Perl uses logically-wide characters to represent strings internally.
393fec97 78
42581d5d
KW
79Starting in Perl 5.14, Perl-level operations work with
80characters rather than bytes within the scope of a
81C<L<use feature 'unicode_strings'|feature>> (or equivalently
82C<use 5.012> or higher). (This is not true if bytes have been
b19eb496 83explicitly requested by C<L<use bytes|bytes>>, nor necessarily true
42581d5d
KW
84for interactions with the platform's operating system.)
85
86For earlier Perls, and when C<unicode_strings> is not in effect, Perl
87provides a fairly safe environment that can handle both types of
88semantics in programs. For operations where Perl can unambiguously
89decide that the input data are characters, Perl switches to character
90semantics. For operations where this determination cannot be made
91without additional information from the user, Perl decides in favor of
92compatibility and chooses to use byte semantics.
93
66cbab2c
KW
94When C<use locale> (but not C<use locale ':not_characters'>) is in
95effect, Perl uses the semantics associated with the current locale.
96(C<use locale> overrides C<use feature 'unicode_strings'> in the same scope;
97while C<use locale ':not_characters'> effectively also selects
98C<use feature 'unicode_strings'> in its scope; see L<perllocale>.)
99Otherwise, Perl uses the platform's native
42581d5d 100byte semantics for characters whose code points are less than 256, and
4b9734bf
KW
101Unicode semantics for those greater than 255. That means that non-ASCII
102characters are undefined except for their
e1b711da
KW
103ordinal numbers. This means that none have case (upper and lower), nor are any
104a member of character classes, like C<[:alpha:]> or C<\w>. (But all do belong
105to the C<\W> class or the Perl regular expression extension C<[:^alpha:]>.)
2bbc8d55 106
8cbd9a7a 107This behavior preserves compatibility with earlier versions of Perl,
376d9008 108which allowed byte semantics in Perl operations only if
e1b711da 109none of the program's inputs were marked as being a source of Unicode
8cbd9a7a 110character data. Such data may come from filehandles, from calls to
a9130ea9 111external programs, from information provided by the system (such as C<%ENV>),
21bad921 112or from literals and constants in the source text.
8cbd9a7a 113
8cbd9a7a 114The C<utf8> pragma is primarily a compatibility device that enables
75daf61c 115recognition of UTF-(8|EBCDIC) in literals encountered by the parser.
376d9008
JB
116Note that this pragma is only required while Perl defaults to byte
117semantics; when character semantics become the default, this pragma
118may become a no-op. See L<utf8>.
119
376d9008 120If strings operating under byte semantics and strings with Unicode
51f494cc 121character data are concatenated, the new string will have
d9b01026 122character semantics. This can cause surprises: See L</BUGS>, below.
a9130ea9 123You can choose to be warned when this happens. See C<L<encoding::warnings>>.
7dedd01f 124
feda178f 125Under character semantics, many operations that formerly operated on
376d9008 126bytes now operate on characters. A character in Perl is
feda178f 127logically just a number ranging from 0 to 2**31 or so. Larger
376d9008
JB
128characters may encode into longer sequences of bytes internally, but
129this internal detail is mostly hidden for Perl code.
130See L<perluniintro> for more.
393fec97 131
376d9008 132=head2 Effects of Character Semantics
393fec97
GS
133
134Character semantics have the following effects:
135
136=over 4
137
138=item *
139
376d9008 140Strings--including hash keys--and regular expression patterns may
574c8022 141contain characters that have an ordinal value larger than 255.
393fec97 142
2575c402
JW
143If you use a Unicode editor to edit your program, Unicode characters may
144occur directly within the literal strings in UTF-8 encoding, or UTF-16.
a9130ea9 145(The former requires a C<BOM> or C<use utf8>, the latter requires a C<BOM>.)
3e4dbfed 146
195e542a
KW
147Unicode characters can also be added to a string by using the C<\N{U+...}>
148notation. The Unicode code for the desired character, in hexadecimal,
149should be placed in the braces, after the C<U>. For instance, a smiley face is
6f335b04
KW
150C<\N{U+263A}>.
151
a9130ea9
KW
152Alternatively, you can use the C<\x{...}> notation for characters C<0x100> and
153above. For characters below C<0x100> you may get byte semantics instead of
6f335b04 154character semantics; see L</The "Unicode Bug">. On EBCDIC machines there is
195e542a 155the additional problem that the value for such characters gives the EBCDIC
0bd42786
KW
156character rather than the Unicode one, thus it is more portable to use
157C<\N{U+...}> instead.
3e4dbfed 158
fbb93542
KW
159Additionally, you can use the C<\N{...}> notation and put the official
160Unicode character name within the braces, such as
161C<\N{WHITE SMILING FACE}>. This automatically loads the L<charnames>
162module with the C<:full> and C<:short> options. If you prefer different
163options for this module, you can instead, before the C<\N{...}>,
164explicitly load it with your desired options; for example,
165
166 use charnames ':loose';
376d9008 167
393fec97
GS
168=item *
169
574c8022
JH
170If an appropriate L<encoding> is specified, identifiers within the
171Perl script may contain Unicode alphanumeric characters, including
376d9008
JB
172ideographs. Perl does not currently attempt to canonicalize variable
173names.
393fec97 174
393fec97
GS
175=item *
176
a9130ea9 177Regular expressions match characters instead of bytes. C<"."> matches
2575c402 178a character instead of a byte.
393fec97 179
393fec97
GS
180=item *
181
9d1c51c1 182Bracketed character classes in regular expressions match characters instead of
376d9008 183bytes and match against the character properties specified in the
1bfb14c4 184Unicode properties database. C<\w> can be used to match a Japanese
75daf61c 185ideograph, for instance.
393fec97 186
393fec97
GS
187=item *
188
9d1c51c1
KW
189Named Unicode properties, scripts, and block ranges may be used (like bracketed
190character classes) by using the C<\p{}> "matches property" construct and
822502e5 191the C<\P{}> negation, "doesn't match property".
2575c402 192See L</"Unicode Character Properties"> for more details.
822502e5
TS
193
194You can define your own character properties and use them
195in the regular expression with the C<\p{}> or C<\P{}> construct.
822502e5
TS
196See L</"User-Defined Character Properties"> for more details.
197
198=item *
199
9f815e24
KW
200The special pattern C<\X> matches a logical character, an "extended grapheme
201cluster" in Standardese. In Unicode what appears to the user to be a single
51f494cc
KW
202character, for example an accented C<G>, may in fact be composed of a sequence
203of characters, in this case a C<G> followed by an accent character. C<\X>
204will match the entire sequence.
822502e5
TS
205
206=item *
207
208The C<tr///> operator translates characters instead of bytes. Note
209that the C<tr///CU> functionality has been removed. For similar
210functionality see pack('U0', ...) and pack('C0', ...).
211
212=item *
213
214Case translation operators use the Unicode case translation tables
215when character input is provided. Note that C<uc()>, or C<\U> in
216interpolated strings, translates to uppercase, while C<ucfirst>,
217or C<\u> in interpolated strings, translates to titlecase in languages
e1b711da
KW
218that make the distinction (which is equivalent to uppercase in languages
219without the distinction).
822502e5
TS
220
221=item *
222
223Most operators that deal with positions or lengths in a string will
224automatically switch to using character positions, including
225C<chop()>, C<chomp()>, C<substr()>, C<pos()>, C<index()>, C<rindex()>,
226C<sprintf()>, C<write()>, and C<length()>. An operator that
51f494cc
KW
227specifically does not switch is C<vec()>. Operators that really don't
228care include operators that treat strings as a bucket of bits such as
822502e5
TS
229C<sort()>, and operators dealing with filenames.
230
231=item *
232
51f494cc 233The C<pack()>/C<unpack()> letter C<C> does I<not> change, since it is often
822502e5
TS
234used for byte-oriented formats. Again, think C<char> in the C language.
235
236There is a new C<U> specifier that converts between Unicode characters
237and code points. There is also a C<W> specifier that is the equivalent of
238C<chr>/C<ord> and properly handles character values even if they are above 255.
239
240=item *
241
242The C<chr()> and C<ord()> functions work on characters, similar to
243C<pack("W")> and C<unpack("W")>, I<not> C<pack("C")> and
244C<unpack("C")>. C<pack("C")> and C<unpack("C")> are methods for
245emulating byte-oriented C<chr()> and C<ord()> on Unicode strings.
246While these methods reveal the internal encoding of Unicode strings,
247that is not something one normally needs to care about at all.
248
249=item *
250
251The bit string operators, C<& | ^ ~>, can operate on character data.
252However, for backward compatibility, such as when using bit string
253operations when characters are all less than 256 in ordinal value, one
254should not use C<~> (the bit complement) with characters of both
255values less than 256 and values greater than 256. Most importantly,
256DeMorgan's laws (C<~($x|$y) eq ~$x&~$y> and C<~($x&$y) eq ~$x|~$y>)
257will not hold. The reason for this mathematical I<faux pas> is that
258the complement cannot return B<both> the 8-bit (byte-wide) bit
259complement B<and> the full character-wide bit complement.
260
261=item *
262
a9130ea9 263There is a CPAN module, C<L<Unicode::Casing>>, which allows you to define
628253b8
BF
264your own mappings to be used in C<lc()>, C<lcfirst()>, C<uc()>,
265C<ucfirst()>, and C<fc> (or their double-quoted string inlined
266versions such as C<\U>).
267(Prior to Perl 5.16, this functionality was partially provided
5d1892be
KW
268in the Perl core, but suffered from a number of insurmountable
269drawbacks, so the CPAN module was written instead.)
822502e5
TS
270
271=back
272
273=over 4
274
275=item *
276
277And finally, C<scalar reverse()> reverses by character rather than by byte.
278
279=back
280
281=head2 Unicode Character Properties
282
ee88f7b6 283(The only time that Perl considers a sequence of individual code
9d1c51c1
KW
284points as a single logical character is in the C<\X> construct, already
285mentioned above. Therefore "character" in this discussion means a single
ee88f7b6
KW
286Unicode code point.)
287
288Very nearly all Unicode character properties are accessible through
289regular expressions by using the C<\p{}> "matches property" construct
290and the C<\P{}> "doesn't match property" for its negation.
51f494cc 291
9d1c51c1 292For instance, C<\p{Uppercase}> matches any single character with the Unicode
a9130ea9
KW
293C<"Uppercase"> property, while C<\p{L}> matches any character with a
294C<General_Category> of C<"L"> (letter) property (see
295L</General_Category> below). Brackets are not
9d1c51c1 296required for single letter property names, so C<\p{L}> is equivalent to C<\pL>.
51f494cc 297
9d1c51c1 298More formally, C<\p{Uppercase}> matches any single character whose Unicode
a9130ea9
KW
299C<Uppercase> property value is C<True>, and C<\P{Uppercase}> matches any character
300whose C<Uppercase> property value is C<False>, and they could have been written as
9d1c51c1 301C<\p{Uppercase=True}> and C<\p{Uppercase=False}>, respectively.
51f494cc 302
b19eb496 303This formality is needed when properties are not binary; that is, if they can
a9130ea9
KW
304take on more values than just C<True> and C<False>. For example, the
305C<Bidi_Class> property (see L</"Bidirectional Character Types"> below),
306can take on several different
307values, such as C<Left>, C<Right>, C<Whitespace>, and others. To match these, one needs
308to specify both the property name (C<Bidi_Class>), AND the value being
5bff2035 309matched against
a9130ea9 310(C<Left>, C<Right>, etc.). This is done, as in the examples above, by having the
9f815e24 311two components separated by an equal sign (or interchangeably, a colon), like
51f494cc
KW
312C<\p{Bidi_Class: Left}>.
313
314All Unicode-defined character properties may be written in these compound forms
a9130ea9 315of C<\p{I<property>=I<value>}> or C<\p{I<property>:I<value>}>, but Perl provides some
51f494cc
KW
316additional properties that are written only in the single form, as well as
317single-form short-cuts for all binary properties and certain others described
318below, in which you may omit the property name and the equals or colon
319separator.
320
321Most Unicode character properties have at least two synonyms (or aliases if you
b19eb496 322prefer): a short one that is easier to type and a longer one that is more
a9130ea9
KW
323descriptive and hence easier to understand. Thus the C<"L"> and
324C<"Letter"> properties above are equivalent and can be used
325interchangeably. Likewise, C<"Upper"> is a synonym for C<"Uppercase">,
326and we could have written C<\p{Uppercase}> equivalently as C<\p{Upper}>.
327Also, there are typically various synonyms for the values the property
328can be. For binary properties, C<"True"> has 3 synonyms: C<"T">,
329C<"Yes">, and C<"Y">; and C<"False"> has correspondingly C<"F">,
330C<"No">, and C<"N">. But be careful. A short form of a value for one
331property may not mean the same thing as the same short form for another.
332Thus, for the C<L</General_Category>> property, C<"L"> means
333C<"Letter">, but for the L<C<Bidi_Class>|/Bidirectional Character Types>
334property, C<"L"> means C<"Left">. A complete list of properties and
335synonyms is in L<perluniprops>.
51f494cc 336
b19eb496 337Upper/lower case differences in property names and values are irrelevant;
51f494cc
KW
338thus C<\p{Upper}> means the same thing as C<\p{upper}> or even C<\p{UpPeR}>.
339Similarly, you can add or subtract underscores anywhere in the middle of a
340word, so that these are also equivalent to C<\p{U_p_p_e_r}>. And white space
341is irrelevant adjacent to non-word characters, such as the braces and the equals
b19eb496
TC
342or colon separators, so C<\p{ Upper }> and C<\p{ Upper_case : Y }> are
343equivalent to these as well. In fact, white space and even
344hyphens can usually be added or deleted anywhere. So even C<\p{ Up-per case = Yes}> is
51f494cc 345equivalent. All this is called "loose-matching" by Unicode. The few places
b19eb496 346where stricter matching is used is in the middle of numbers, and in the Perl
51f494cc 347extension properties that begin or end with an underscore. Stricter matching
b19eb496 348cares about white space (except adjacent to non-word characters),
51f494cc 349hyphens, and non-interior underscores.
4193bef7 350
376d9008 351You can also use negation in both C<\p{}> and C<\P{}> by introducing a caret
a9130ea9 352(C<^>) between the first brace and the property name: C<\p{^Tamil}> is
eb0cc9e3 353equal to C<\P{Tamil}>.
4193bef7 354
56ca34ca
KW
355Almost all properties are immune to case-insensitive matching. That is,
356adding a C</i> regular expression modifier does not change what they
357match. There are two sets that are affected.
358The first set is
359C<Uppercase_Letter>,
360C<Lowercase_Letter>,
361and C<Titlecase_Letter>,
362all of which match C<Cased_Letter> under C</i> matching.
363And the second set is
364C<Uppercase>,
365C<Lowercase>,
366and C<Titlecase>,
367all of which match C<Cased> under C</i> matching.
368This set also includes its subsets C<PosixUpper> and C<PosixLower> both
a9130ea9 369of which under C</i> match C<PosixAlpha>.
56ca34ca 370(The difference between these sets is that some things, such as Roman
b19eb496 371numerals, come in both upper and lower case so they are C<Cased>, but aren't considered
56ca34ca 372letters, so they aren't C<Cased_Letter>s.)
56ca34ca 373
94b42e47
KW
374The result is undefined if you try to match a non-Unicode code point
375(that is, one above 0x10FFFF) against a Unicode property. Currently, a
376warning is raised, and the match will fail. In some cases, this is
377counterintuitive, as both these fail:
378
379 chr(0x110000) =~ \p{ASCII_Hex_Digit=True} # Fails.
380 chr(0x110000) =~ \p{ASCII_Hex_Digit=False} # Fails!
381
51f494cc 382=head3 B<General_Category>
14bb0a9a 383
51f494cc
KW
384Every Unicode character is assigned a general category, which is the "most
385usual categorization of a character" (from
386L<http://www.unicode.org/reports/tr44>).
822502e5 387
9f815e24 388The compound way of writing these is like C<\p{General_Category=Number}>
51f494cc
KW
389(short, C<\p{gc:n}>). But Perl furnishes shortcuts in which everything up
390through the equal or colon separator is omitted. So you can instead just write
391C<\pN>.
822502e5 392
a9130ea9
KW
393Here are the short and long forms of the values the C<General Category> property
394can have:
393fec97 395
d73e5302
JH
396 Short Long
397
398 L Letter
51f494cc
KW
399 LC, L& Cased_Letter (that is: [\p{Ll}\p{Lu}\p{Lt}])
400 Lu Uppercase_Letter
401 Ll Lowercase_Letter
402 Lt Titlecase_Letter
403 Lm Modifier_Letter
404 Lo Other_Letter
d73e5302
JH
405
406 M Mark
51f494cc
KW
407 Mn Nonspacing_Mark
408 Mc Spacing_Mark
409 Me Enclosing_Mark
d73e5302
JH
410
411 N Number
51f494cc
KW
412 Nd Decimal_Number (also Digit)
413 Nl Letter_Number
414 No Other_Number
415
416 P Punctuation (also Punct)
417 Pc Connector_Punctuation
418 Pd Dash_Punctuation
419 Ps Open_Punctuation
420 Pe Close_Punctuation
421 Pi Initial_Punctuation
d73e5302 422 (may behave like Ps or Pe depending on usage)
51f494cc 423 Pf Final_Punctuation
d73e5302 424 (may behave like Ps or Pe depending on usage)
51f494cc 425 Po Other_Punctuation
d73e5302
JH
426
427 S Symbol
51f494cc
KW
428 Sm Math_Symbol
429 Sc Currency_Symbol
430 Sk Modifier_Symbol
431 So Other_Symbol
d73e5302
JH
432
433 Z Separator
51f494cc
KW
434 Zs Space_Separator
435 Zl Line_Separator
436 Zp Paragraph_Separator
d73e5302
JH
437
438 C Other
d88362ca 439 Cc Control (also Cntrl)
e150c829 440 Cf Format
6d4f9cf2 441 Cs Surrogate
51f494cc 442 Co Private_Use
e150c829 443 Cn Unassigned
1ac13f9a 444
376d9008 445Single-letter properties match all characters in any of the
3e4dbfed 446two-letter sub-properties starting with the same letter.
b19eb496 447C<LC> and C<L&> are special: both are aliases for the set consisting of everything matched by C<Ll>, C<Lu>, and C<Lt>.
32293815 448
51f494cc 449=head3 B<Bidirectional Character Types>
822502e5 450
b19eb496 451Because scripts differ in their directionality (Hebrew and Arabic are
a9130ea9 452written right to left, for example) Unicode supplies a C<Bidi_Class> property.
1850f57f 453Some of the values this property can have are:
32293815 454
88af3b93 455 Value Meaning
92e830a9 456
12ac2576
JP
457 L Left-to-Right
458 LRE Left-to-Right Embedding
459 LRO Left-to-Right Override
460 R Right-to-Left
51f494cc 461 AL Arabic Letter
12ac2576
JP
462 RLE Right-to-Left Embedding
463 RLO Right-to-Left Override
464 PDF Pop Directional Format
465 EN European Number
51f494cc
KW
466 ES European Separator
467 ET European Terminator
12ac2576 468 AN Arabic Number
51f494cc 469 CS Common Separator
12ac2576
JP
470 NSM Non-Spacing Mark
471 BN Boundary Neutral
472 B Paragraph Separator
473 S Segment Separator
474 WS Whitespace
475 ON Other Neutrals
476
51f494cc
KW
477This property is always written in the compound form.
478For example, C<\p{Bidi_Class:R}> matches characters that are normally
1850f57f 479written right to left. Unlike the
a9130ea9 480C<L</General_Category>> property, this
1850f57f
KW
481property can have more values added in a future Unicode release. Those
482listed above comprised the complete set for many Unicode releases, but
483others were added in Unicode 6.3; you can always find what the
484current ones are in in L<perluniprops>. And
485L<http://www.unicode.org/reports/tr9/> describes how to use them.
eb0cc9e3 486
51f494cc
KW
487=head3 B<Scripts>
488
b19eb496 489The world's languages are written in many different scripts. This sentence
e1b711da 490(unless you're reading it in translation) is written in Latin, while Russian is
c69ca1d4 491written in Cyrillic, and Greek is written in, well, Greek; Japanese mainly in
e1b711da 492Hiragana or Katakana. There are many more.
51f494cc 493
82aed44a
KW
494The Unicode Script and Script_Extensions properties give what script a
495given character is in. Either property can be specified with the
496compound form like
497C<\p{Script=Hebrew}> (short: C<\p{sc=hebr}>), or
498C<\p{Script_Extensions=Javanese}> (short: C<\p{scx=java}>).
499In addition, Perl furnishes shortcuts for all
500C<Script> property names. You can omit everything up through the equals
501(or colon), and simply write C<\p{Latin}> or C<\P{Cyrillic}>.
502(This is not true for C<Script_Extensions>, which is required to be
503written in the compound form.)
504
505The difference between these two properties involves characters that are
506used in multiple scripts. For example the digits '0' through '9' are
507used in many parts of the world. These are placed in a script named
508C<Common>. Other characters are used in just a few scripts. For
a9130ea9 509example, the C<"KATAKANA-HIRAGANA DOUBLE HYPHEN"> is used in both Japanese
82aed44a
KW
510scripts, Katakana and Hiragana, but nowhere else. The C<Script>
511property places all characters that are used in multiple scripts in the
512C<Common> script, while the C<Script_Extensions> property places those
513that are used in only a few scripts into each of those scripts; while
514still using C<Common> for those used in many scripts. Thus both these
515match:
516
517 "0" =~ /\p{sc=Common}/ # Matches
518 "0" =~ /\p{scx=Common}/ # Matches
519
520and only the first of these match:
521
522 "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{sc=Common} # Matches
523 "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{scx=Common} # No match
524
525And only the last two of these match:
526
527 "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{sc=Hiragana} # No match
528 "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{sc=Katakana} # No match
529 "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{scx=Hiragana} # Matches
530 "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{scx=Katakana} # Matches
531
532C<Script_Extensions> is thus an improved C<Script>, in which there are
533fewer characters in the C<Common> script, and correspondingly more in
534other scripts. It is new in Unicode version 6.0, and its data are likely
535to change significantly in later releases, as things get sorted out.
536
537(Actually, besides C<Common>, the C<Inherited> script, contains
538characters that are used in multiple scripts. These are modifier
539characters which modify other characters, and inherit the script value
540of the controlling character. Some of these are used in many scripts,
541and so go into C<Inherited> in both C<Script> and C<Script_Extensions>.
542Others are used in just a few scripts, so are in C<Inherited> in
543C<Script>, but not in C<Script_Extensions>.)
544
545It is worth stressing that there are several different sets of digits in
546Unicode that are equivalent to 0-9 and are matchable by C<\d> in a
547regular expression. If they are used in a single language only, they
548are in that language's C<Script> and C<Script_Extension>. If they are
549used in more than one script, they will be in C<sc=Common>, but only
550if they are used in many scripts should they be in C<scx=Common>.
51f494cc
KW
551
552A complete list of scripts and their shortcuts is in L<perluniprops>.
553
a9130ea9 554=head3 B<Use of the C<"Is"> Prefix>
822502e5 555
1bfb14c4 556For backward compatibility (with Perl 5.6), all properties mentioned
51f494cc
KW
557so far may have C<Is> or C<Is_> prepended to their name, so C<\P{Is_Lu}>, for
558example, is equal to C<\P{Lu}>, and C<\p{IsScript:Arabic}> is equal to
559C<\p{Arabic}>.
eb0cc9e3 560
51f494cc 561=head3 B<Blocks>
2796c109 562
1bfb14c4
JH
563In addition to B<scripts>, Unicode also defines B<blocks> of
564characters. The difference between scripts and blocks is that the
565concept of scripts is closer to natural languages, while the concept
51f494cc 566of blocks is more of an artificial grouping based on groups of Unicode
a9130ea9 567characters with consecutive ordinal values. For example, the C<"Basic Latin">
b19eb496 568block is all characters whose ordinals are between 0 and 127, inclusive; in
a9130ea9
KW
569other words, the ASCII characters. The C<"Latin"> script contains some letters
570from this as well as several other blocks, like C<"Latin-1 Supplement">,
571C<"Latin Extended-A">, etc., but it does not contain all the characters from
7be67b37 572those blocks. It does not, for example, contain the digits 0-9, because
82aed44a
KW
573those digits are shared across many scripts, and hence are in the
574C<Common> script.
51f494cc
KW
575
576For more about scripts versus blocks, see UAX#24 "Unicode Script Property":
577L<http://www.unicode.org/reports/tr24>
578
82aed44a
KW
579The C<Script> or C<Script_Extensions> properties are likely to be the
580ones you want to use when processing
a9130ea9 581natural language; the C<Block> property may occasionally be useful in working
b19eb496 582with the nuts and bolts of Unicode.
51f494cc
KW
583
584Block names are matched in the compound form, like C<\p{Block: Arrows}> or
b19eb496 585C<\p{Blk=Hebrew}>. Unlike most other properties, only a few block names have a
51f494cc
KW
586Unicode-defined short name. But Perl does provide a (slight) shortcut: You
587can say, for example C<\p{In_Arrows}> or C<\p{In_Hebrew}>. For backwards
588compatibility, the C<In> prefix may be omitted if there is no naming conflict
589with a script or any other property, and you can even use an C<Is> prefix
590instead in those cases. But it is not a good idea to do this, for a couple
591reasons:
592
593=over 4
594
595=item 1
596
597It is confusing. There are many naming conflicts, and you may forget some.
9f815e24 598For example, C<\p{Hebrew}> means the I<script> Hebrew, and NOT the I<block>
51f494cc
KW
599Hebrew. But would you remember that 6 months from now?
600
601=item 2
602
3e2dd9ee 603It is unstable. A new version of Unicode may preempt the current meaning by
51f494cc 604creating a property with the same name. There was a time in very early Unicode
9f815e24 605releases when C<\p{Hebrew}> would have matched the I<block> Hebrew; now it
51f494cc 606doesn't.
32293815 607
393fec97
GS
608=back
609
b19eb496
TC
610Some people prefer to always use C<\p{Block: foo}> and C<\p{Script: bar}>
611instead of the shortcuts, whether for clarity, because they can't remember the
612difference between 'In' and 'Is' anyway, or they aren't confident that those who
613eventually will read their code will know that difference.
51f494cc
KW
614
615A complete list of blocks and their shortcuts is in L<perluniprops>.
616
9f815e24
KW
617=head3 B<Other Properties>
618
619There are many more properties than the very basic ones described here.
620A complete list is in L<perluniprops>.
621
622Unicode defines all its properties in the compound form, so all single-form
b19eb496
TC
623properties are Perl extensions. Most of these are just synonyms for the
624Unicode ones, but some are genuine extensions, including several that are in
9f815e24
KW
625the compound form. And quite a few of these are actually recommended by Unicode
626(in L<http://www.unicode.org/reports/tr18>).
627
5bff2035
KW
628This section gives some details on all extensions that aren't just
629synonyms for compound-form Unicode properties
630(for those properties, you'll have to refer to the
9f815e24
KW
631L<Unicode Standard|http://www.unicode.org/reports/tr44>.
632
633=over
634
635=item B<C<\p{All}>>
636
637This matches any of the 1_114_112 Unicode code points. It is a synonym for
638C<\p{Any}>.
639
640=item B<C<\p{Alnum}>>
641
642This matches any C<\p{Alphabetic}> or C<\p{Decimal_Number}> character.
643
644=item B<C<\p{Any}>>
645
646This matches any of the 1_114_112 Unicode code points. It is a synonym for
647C<\p{All}>.
648
42581d5d
KW
649=item B<C<\p{ASCII}>>
650
651This matches any of the 128 characters in the US-ASCII character set,
652which is a subset of Unicode.
653
9f815e24
KW
654=item B<C<\p{Assigned}>>
655
a9130ea9
KW
656This matches any assigned code point; that is, any code point whose L<general
657category|/General_Category> is not C<Unassigned> (or equivalently, not C<Cn>).
9f815e24
KW
658
659=item B<C<\p{Blank}>>
660
661This is the same as C<\h> and C<\p{HorizSpace}>: A character that changes the
662spacing horizontally.
663
664=item B<C<\p{Decomposition_Type: Non_Canonical}>> (Short: C<\p{Dt=NonCanon}>)
665
666Matches a character that has a non-canonical decomposition.
667
a9130ea9 668To understand the use of this rarely used I<property=value> combination, it is
9f815e24
KW
669necessary to know some basics about decomposition.
670Consider a character, say H. It could appear with various marks around it,
671such as an acute accent, or a circumflex, or various hooks, circles, arrows,
b19eb496 672I<etc.>, above, below, to one side or the other, etc. There are many
9f815e24
KW
673possibilities among the world's languages. The number of combinations is
674astronomical, and if there were a character for each combination, it would
675soon exhaust Unicode's more than a million possible characters. So Unicode
676took a different approach: there is a character for the base H, and a
b19eb496 677character for each of the possible marks, and these can be variously combined
9f815e24
KW
678to get a final logical character. So a logical character--what appears to be a
679single character--can be a sequence of more than one individual characters.
b19eb496
TC
680This is called an "extended grapheme cluster"; Perl furnishes the C<\X>
681regular expression construct to match such sequences.
9f815e24
KW
682
683But Unicode's intent is to unify the existing character set standards and
b19eb496 684practices, and several pre-existing standards have single characters that
9f815e24 685mean the same thing as some of these combinations. An example is ISO-8859-1,
a9130ea9
KW
686which has quite a few of these in the Latin-1 range, an example being C<"LATIN
687CAPITAL LETTER E WITH ACUTE">. Because this character was in this pre-existing
9f815e24 688standard, Unicode added it to its repertoire. But this character is considered
b19eb496 689by Unicode to be equivalent to the sequence consisting of the character
a9130ea9 690C<"LATIN CAPITAL LETTER E"> followed by the character C<"COMBINING ACUTE ACCENT">.
9f815e24 691
a9130ea9 692C<"LATIN CAPITAL LETTER E WITH ACUTE"> is called a "pre-composed" character, and
b19eb496 693its equivalence with the sequence is called canonical equivalence. All
9f815e24 694pre-composed characters are said to have a decomposition (into the equivalent
b19eb496 695sequence), and the decomposition type is also called canonical.
9f815e24
KW
696
697However, many more characters have a different type of decomposition, a
698"compatible" or "non-canonical" decomposition. The sequences that form these
699decompositions are not considered canonically equivalent to the pre-composed
a9130ea9 700character. An example, again in the Latin-1 range, is the C<"SUPERSCRIPT ONE">.
b19eb496 701It is somewhat like a regular digit 1, but not exactly; its decomposition
9f815e24
KW
702into the digit 1 is called a "compatible" decomposition, specifically a
703"super" decomposition. There are several such compatibility
704decompositions (see L<http://www.unicode.org/reports/tr44>), including one
b19eb496 705called "compat", which means some miscellaneous type of decomposition
42581d5d 706that doesn't fit into the decomposition categories that Unicode has chosen.
9f815e24
KW
707
708Note that most Unicode characters don't have a decomposition, so their
a9130ea9 709decomposition type is C<"None">.
9f815e24 710
b19eb496
TC
711For your convenience, Perl has added the C<Non_Canonical> decomposition
712type to mean any of the several compatibility decompositions.
9f815e24
KW
713
714=item B<C<\p{Graph}>>
715
716Matches any character that is graphic. Theoretically, this means a character
717that on a printer would cause ink to be used.
718
719=item B<C<\p{HorizSpace}>>
720
b19eb496 721This is the same as C<\h> and C<\p{Blank}>: a character that changes the
9f815e24
KW
722spacing horizontally.
723
42581d5d 724=item B<C<\p{In=*}>>
9f815e24
KW
725
726This is a synonym for C<\p{Present_In=*}>
727
728=item B<C<\p{PerlSpace}>>
729
d28d8023
KW
730This is the same as C<\s>, restricted to ASCII, namely C<S<[ \f\n\r\t]>>
731and starting in Perl v5.18, experimentally, a vertical tab.
9f815e24
KW
732
733Mnemonic: Perl's (original) space
734
735=item B<C<\p{PerlWord}>>
736
737This is the same as C<\w>, restricted to ASCII, namely C<[A-Za-z0-9_]>
738
739Mnemonic: Perl's (original) word.
740
42581d5d 741=item B<C<\p{Posix...}>>
9f815e24 742
a9130ea9 743There are several of these, which are equivalents using the C<\p{}>
b19eb496 744notation for Posix classes and are described in
42581d5d 745L<perlrecharclass/POSIX Character Classes>.
9f815e24
KW
746
747=item B<C<\p{Present_In: *}>> (Short: C<\p{In=*}>)
748
749This property is used when you need to know in what Unicode version(s) a
750character is.
751
752The "*" above stands for some two digit Unicode version number, such as
753C<1.1> or C<4.0>; or the "*" can also be C<Unassigned>. This property will
754match the code points whose final disposition has been settled as of the
755Unicode release given by the version number; C<\p{Present_In: Unassigned}>
756will match those code points whose meaning has yet to be assigned.
757
a9130ea9 758For example, C<U+0041> C<"LATIN CAPITAL LETTER A"> was present in the very first
9f815e24
KW
759Unicode release available, which is C<1.1>, so this property is true for all
760valid "*" versions. On the other hand, C<U+1EFF> was not assigned until version
a9130ea9 7615.1 when it became C<"LATIN SMALL LETTER Y WITH LOOP">, so the only "*" that
9f815e24
KW
762would match it are 5.1, 5.2, and later.
763
764Unicode furnishes the C<Age> property from which this is derived. The problem
765with Age is that a strict interpretation of it (which Perl takes) has it
766matching the precise release a code point's meaning is introduced in. Thus
767C<U+0041> would match only 1.1; and C<U+1EFF> only 5.1. This is not usually what
768you want.
769
770Some non-Perl implementations of the Age property may change its meaning to be
a9130ea9 771the same as the Perl C<Present_In> property; just be aware of that.
9f815e24
KW
772
773Another confusion with both these properties is that the definition is not
b19eb496
TC
774that the code point has been I<assigned>, but that the meaning of the code point
775has been I<determined>. This is because 66 code points will always be
a9130ea9 776unassigned, and so the C<Age> for them is the Unicode version in which the decision
b19eb496 777to make them so was made. For example, C<U+FDD0> is to be permanently
9f815e24 778unassigned to a character, and the decision to do that was made in version 3.1,
b19eb496 779so C<\p{Age=3.1}> matches this character, as also does C<\p{Present_In: 3.1}> and up.
9f815e24
KW
780
781=item B<C<\p{Print}>>
782
ae5b72c8 783This matches any character that is graphical or blank, except controls.
9f815e24
KW
784
785=item B<C<\p{SpacePerl}>>
786
787This is the same as C<\s>, including beyond ASCII.
788
4d4acfba 789Mnemonic: Space, as modified by Perl. (It doesn't include the vertical tab
b19eb496 790which both the Posix standard and Unicode consider white space.)
9f815e24 791
4364919a
KW
792=item B<C<\p{Title}>> and B<C<\p{Titlecase}>>
793
794Under case-sensitive matching, these both match the same code points as
795C<\p{General Category=Titlecase_Letter}> (C<\p{gc=lt}>). The difference
796is that under C</i> caseless matching, these match the same as
797C<\p{Cased}>, whereas C<\p{gc=lt}> matches C<\p{Cased_Letter>).
798
9f815e24
KW
799=item B<C<\p{VertSpace}>>
800
801This is the same as C<\v>: A character that changes the spacing vertically.
802
803=item B<C<\p{Word}>>
804
b19eb496 805This is the same as C<\w>, including over 100_000 characters beyond ASCII.
9f815e24 806
42581d5d
KW
807=item B<C<\p{XPosix...}>>
808
b19eb496 809There are several of these, which are the standard Posix classes
42581d5d
KW
810extended to the full Unicode range. They are described in
811L<perlrecharclass/POSIX Character Classes>.
812
9f815e24
KW
813=back
814
a9130ea9 815
376d9008 816=head2 User-Defined Character Properties
491fd90a 817
51f494cc 818You can define your own binary character properties by defining subroutines
a9130ea9 819whose names begin with C<"In"> or C<"Is">. (The experimental feature
9d1a5160
KW
820L<perlre/(?[ ])> provides an alternative which allows more complex
821definitions.) The subroutines can be defined in any
51f494cc 822package. The user-defined properties can be used in the regular expression
a9130ea9 823C<\p{}> and C<\P{}> constructs; if you are using a user-defined property from a
51f494cc 824package other than the one you are in, you must specify its package in the
a9130ea9 825C<\p{}> or C<\P{}> construct.
bac0b425 826
51f494cc 827 # assuming property Is_Foreign defined in Lang::
bac0b425
JP
828 package main; # property package name required
829 if ($txt =~ /\p{Lang::IsForeign}+/) { ... }
830
831 package Lang; # property package name not required
832 if ($txt =~ /\p{IsForeign}+/) { ... }
833
834
835Note that the effect is compile-time and immutable once defined.
b19eb496
TC
836However, the subroutines are passed a single parameter, which is 0 if
837case-sensitive matching is in effect and non-zero if caseless matching
56ca34ca
KW
838is in effect. The subroutine may return different values depending on
839the value of the flag, and one set of values will immutably be in effect
b19eb496 840for all case-sensitive matches, and the other set for all case-insensitive
56ca34ca 841matches.
491fd90a 842
b19eb496 843Note that if the regular expression is tainted, then Perl will die rather
a9130ea9 844than calling the subroutine when the name of the subroutine is
0e9be77f
DM
845determined by the tainted data.
846
376d9008
JB
847The subroutines must return a specially-formatted string, with one
848or more newline-separated lines. Each line must be one of the following:
491fd90a
JH
849
850=over 4
851
852=item *
853
df9e1087 854A single hexadecimal number denoting a code point to include.
510254c9
A
855
856=item *
857
99a6b1f0 858Two hexadecimal numbers separated by horizontal whitespace (space or
df9e1087 859tabular characters) denoting a range of code points to include.
491fd90a
JH
860
861=item *
862
a9130ea9
KW
863Something to include, prefixed by C<"+">: a built-in character
864property (prefixed by C<"utf8::">) or a fully qualified (including package
830137a2 865name) user-defined character property,
bac0b425
JP
866to represent all the characters in that property; two hexadecimal code
867points for a range; or a single hexadecimal code point.
491fd90a
JH
868
869=item *
870
a9130ea9
KW
871Something to exclude, prefixed by C<"-">: an existing character
872property (prefixed by C<"utf8::">) or a fully qualified (including package
830137a2 873name) user-defined character property,
bac0b425
JP
874to represent all the characters in that property; two hexadecimal code
875points for a range; or a single hexadecimal code point.
491fd90a
JH
876
877=item *
878
a9130ea9
KW
879Something to negate, prefixed C<"!">: an existing character
880property (prefixed by C<"utf8::">) or a fully qualified (including package
830137a2 881name) user-defined character property,
bac0b425
JP
882to represent all the characters in that property; two hexadecimal code
883points for a range; or a single hexadecimal code point.
884
885=item *
886
a9130ea9
KW
887Something to intersect with, prefixed by C<"&">: an existing character
888property (prefixed by C<"utf8::">) or a fully qualified (including package
830137a2 889name) user-defined character property,
bac0b425
JP
890for all the characters except the characters in the property; two
891hexadecimal code points for a range; or a single hexadecimal code point.
491fd90a
JH
892
893=back
894
895For example, to define a property that covers both the Japanese
896syllabaries (hiragana and katakana), you can define
897
898 sub InKana {
d88362ca 899 return <<END;
d5822f25
A
900 3040\t309F
901 30A0\t30FF
491fd90a
JH
902 END
903 }
904
d5822f25
A
905Imagine that the here-doc end marker is at the beginning of the line.
906Now you can use C<\p{InKana}> and C<\P{InKana}>.
491fd90a
JH
907
908You could also have used the existing block property names:
909
910 sub InKana {
d88362ca 911 return <<'END';
491fd90a
JH
912 +utf8::InHiragana
913 +utf8::InKatakana
914 END
915 }
916
917Suppose you wanted to match only the allocated characters,
d5822f25 918not the raw block ranges: in other words, you want to remove
491fd90a
JH
919the non-characters:
920
921 sub InKana {
d88362ca 922 return <<'END';
491fd90a
JH
923 +utf8::InHiragana
924 +utf8::InKatakana
925 -utf8::IsCn
926 END
927 }
928
929The negation is useful for defining (surprise!) negated classes.
930
931 sub InNotKana {
d88362ca 932 return <<'END';
491fd90a
JH
933 !utf8::InHiragana
934 -utf8::InKatakana
935 +utf8::IsCn
936 END
937 }
938
461020ad
KW
939This will match all non-Unicode code points, since every one of them is
940not in Kana. You can use intersection to exclude these, if desired, as
941this modified example shows:
bac0b425 942
461020ad 943 sub InNotKana {
bac0b425 944 return <<'END';
461020ad
KW
945 !utf8::InHiragana
946 -utf8::InKatakana
947 +utf8::IsCn
948 &utf8::Any
bac0b425
JP
949 END
950 }
951
461020ad
KW
952C<&utf8::Any> must be the last line in the definition.
953
954Intersection is used generally for getting the common characters matched
a9130ea9 955by two (or more) classes. It's important to remember not to use C<"&"> for
461020ad
KW
956the first set; that would be intersecting with nothing, resulting in an
957empty set.
958
959(Note that official Unicode properties differ from these in that they
960automatically exclude non-Unicode code points and a warning is raised if
961a match is attempted on one of those.)
bac0b425 962
68585b5e 963=head2 User-Defined Case Mappings (for serious hackers only)
822502e5 964
5d1892be 965B<This feature has been removed as of Perl 5.16.>
a9130ea9 966The CPAN module C<L<Unicode::Casing>> provides better functionality without
5d1892be
KW
967the drawbacks that this feature had. If you are using a Perl earlier
968than 5.16, this feature was most fully documented in the 5.14 version of
969this pod:
970L<http://perldoc.perl.org/5.14.0/perlunicode.html#User-Defined-Case-Mappings-%28for-serious-hackers-only%29>
3a2263fe 971
376d9008 972=head2 Character Encodings for Input and Output
8cbd9a7a 973
7221edc9 974See L<Encode>.
8cbd9a7a 975
c29a771d 976=head2 Unicode Regular Expression Support Level
776f8809 977
b19eb496
TC
978The following list of Unicode supported features for regular expressions describes
979all features currently directly supported by core Perl. The references to "Level N"
8158862b 980and the section numbers refer to the Unicode Technical Standard #18,
b19eb496 981"Unicode Regular Expressions", version 13, from August 2008.
776f8809
JH
982
983=over 4
984
985=item *
986
987Level 1 - Basic Unicode Support
988
755789c0
KW
989 RL1.1 Hex Notation - done [1]
990 RL1.2 Properties - done [2][3]
991 RL1.2a Compatibility Properties - done [4]
9d1a5160 992 RL1.3 Subtraction and Intersection - experimental [5]
755789c0
KW
993 RL1.4 Simple Word Boundaries - done [6]
994 RL1.5 Simple Loose Matches - done [7]
995 RL1.6 Line Boundaries - MISSING [8][9]
996 RL1.7 Supplementary Code Points - done [10]
997
6f33e417
KW
998=over 4
999
1000=item [1]
1001
a9130ea9 1002C<\x{...}>
6f33e417
KW
1003
1004=item [2]
1005
a9130ea9 1006C<\p{...}> C<\P{...}>
6f33e417
KW
1007
1008=item [3]
1009
1010supports not only minimal list, but all Unicode character properties (see Unicode Character Properties above)
1011
1012=item [4]
1013
a9130ea9 1014C<\d> C<\D> C<\s> C<\S> C<\w> C<\W> C<\X> C<[:I<prop>:]> C<[:^I<prop>:]>
6f33e417
KW
1015
1016=item [5]
1017
df9e1087 1018The experimental feature in v5.18 C<"(?[...])"> accomplishes this. See
9d1a5160
KW
1019L<perlre/(?[ ])>. If you don't want to use an experimental feature,
1020you can use one of the following:
6f33e417
KW
1021
1022=over 4
1023
1024=item * Regular expression look-ahead
1025
1026You can mimic class subtraction using lookahead.
8158862b 1027For example, what UTS#18 might write as
29bdacb8 1028
209c9685 1029 [{Block=Greek}-[{UNASSIGNED}]]
dbe420b4
JH
1030
1031in Perl can be written as:
1032
209c9685
KW
1033 (?!\p{Unassigned})\p{Block=Greek}
1034 (?=\p{Assigned})\p{Block=Greek}
dbe420b4
JH
1035
1036But in this particular example, you probably really want
1037
209c9685 1038 \p{Greek}
dbe420b4
JH
1039
1040which will match assigned characters known to be part of the Greek script.
29bdacb8 1041
a9130ea9 1042=item * CPAN module C<L<Unicode::Regex::Set>>
8158862b 1043
6f33e417
KW
1044It does implement the full UTS#18 grouping, intersection, union, and
1045removal (subtraction) syntax.
8158862b 1046
6f33e417
KW
1047=item * L</"User-Defined Character Properties">
1048
a9130ea9 1049C<"+"> for union, C<"-"> for removal (set-difference), C<"&"> for intersection
6f33e417
KW
1050
1051=back
1052
1053=item [6]
1054
a9130ea9 1055C<\b> C<\B>
6f33e417
KW
1056
1057=item [7]
1058
a9130ea9
KW
1059Note that Perl does Full case-folding in matching (but with bugs), not
1060Simple: for example C<U+1F88> is equivalent to C<U+1F00 U+03B9>, instead of
1061just C<U+1F80>. This difference matters mainly for certain Greek capital
1062letters with certain modifiers: the Full case-folding decomposes the
1063letter, while the Simple case-folding would map it to a single
1064character.
6f33e417
KW
1065
1066=item [8]
1067
a9130ea9
KW
1068Should do C<^> and C<$> also on C<U+000B> (C<\v> in C), C<FF> (C<\f>), C<CR> (C<\r>), C<CRLF>
1069(C<\r\n>), C<NEL> (C<U+0085>), C<LS> (C<U+2028>), and C<PS> (C<U+2029>); should also affect
1070C<E<lt>E<gt>>, C<$.>, and script line numbers; should not split lines within C<CRLF>
1071(i.e. there is no empty line between C<\r> and C<\n>). For C<CRLF>, try the
6f33e417
KW
1072C<:crlf> layer (see L<PerlIO>).
1073
1074=item [9]
1075
a9130ea9
KW
1076Linebreaking conformant with L<UAX#14 "Unicode Line Breaking
1077Algorithm"|http://www.unicode.org/reports/tr14>
1078is available through the C<L<Unicode::LineBreak>> module.
6f33e417
KW
1079
1080=item [10]
1081
a9130ea9
KW
1082UTF-8/UTF-EBDDIC used in Perl allows not only C<U+10000> to
1083C<U+10FFFF> but also beyond C<U+10FFFF>
6f33e417
KW
1084
1085=back
5ca1ac52 1086
776f8809
JH
1087=item *
1088
1089Level 2 - Extended Unicode Support
1090
755789c0
KW
1091 RL2.1 Canonical Equivalents - MISSING [10][11]
1092 RL2.2 Default Grapheme Clusters - MISSING [12]
1093 RL2.3 Default Word Boundaries - MISSING [14]
1094 RL2.4 Default Loose Matches - MISSING [15]
1095 RL2.5 Name Properties - DONE
1096 RL2.6 Wildcard Properties - MISSING
8158862b 1097
755789c0
KW
1098 [10] see UAX#15 "Unicode Normalization Forms"
1099 [11] have Unicode::Normalize but not integrated to regexes
1100 [12] have \X but we don't have a "Grapheme Cluster Mode"
1101 [14] see UAX#29, Word Boundaries
902b08d0 1102 [15] This is covered in Chapter 3.13 (in Unicode 6.0)
776f8809
JH
1103
1104=item *
1105
8158862b
TS
1106Level 3 - Tailored Support
1107
755789c0
KW
1108 RL3.1 Tailored Punctuation - MISSING
1109 RL3.2 Tailored Grapheme Clusters - MISSING [17][18]
1110 RL3.3 Tailored Word Boundaries - MISSING
1111 RL3.4 Tailored Loose Matches - MISSING
1112 RL3.5 Tailored Ranges - MISSING
1113 RL3.6 Context Matching - MISSING [19]
1114 RL3.7 Incremental Matches - MISSING
8158862b 1115 ( RL3.8 Unicode Set Sharing )
755789c0
KW
1116 RL3.9 Possible Match Sets - MISSING
1117 RL3.10 Folded Matching - MISSING [20]
1118 RL3.11 Submatchers - MISSING
1119
1120 [17] see UAX#10 "Unicode Collation Algorithms"
1121 [18] have Unicode::Collate but not integrated to regexes
1122 [19] have (?<=x) and (?=x), but look-aheads or look-behinds
1123 should see outside of the target substring
1124 [20] need insensitive matching for linguistic features other
1125 than case; for example, hiragana to katakana, wide and
1126 narrow, simplified Han to traditional Han (see UTR#30
1127 "Character Foldings")
776f8809
JH
1128
1129=back
1130
c349b1b9
JH
1131=head2 Unicode Encodings
1132
376d9008
JB
1133Unicode characters are assigned to I<code points>, which are abstract
1134numbers. To use these numbers, various encodings are needed.
c349b1b9
JH
1135
1136=over 4
1137
c29a771d 1138=item *
5cb3728c
RB
1139
1140UTF-8
c349b1b9 1141
6d4f9cf2
KW
1142UTF-8 is a variable-length (1 to 4 bytes), byte-order independent
1143encoding. For ASCII (and we really do mean 7-bit ASCII, not another
11448-bit encoding), UTF-8 is transparent.
c349b1b9 1145
8c007b5a 1146The following table is from Unicode 3.2.
05632f9a 1147
755789c0 1148 Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte
05632f9a 1149
d88362ca 1150 U+0000..U+007F 00..7F
e1b711da 1151 U+0080..U+07FF * C2..DF 80..BF
d88362ca 1152 U+0800..U+0FFF E0 * A0..BF 80..BF
ec90690f
TS
1153 U+1000..U+CFFF E1..EC 80..BF 80..BF
1154 U+D000..U+D7FF ED 80..9F 80..BF
755789c0 1155 U+D800..U+DFFF +++++ utf16 surrogates, not legal utf8 +++++
ec90690f 1156 U+E000..U+FFFF EE..EF 80..BF 80..BF
d88362ca
KW
1157 U+10000..U+3FFFF F0 * 90..BF 80..BF 80..BF
1158 U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF
1159 U+100000..U+10FFFF F4 80..8F 80..BF 80..BF
e1b711da 1160
b19eb496 1161Note the gaps marked by "*" before several of the byte entries above. These are
e1b711da
KW
1162caused by legal UTF-8 avoiding non-shortest encodings: it is technically
1163possible to UTF-8-encode a single code point in different ways, but that is
1164explicitly forbidden, and the shortest possible encoding should always be used
1165(and that is what Perl does).
37361303 1166
376d9008 1167Another way to look at it is via bits:
05632f9a 1168
755789c0 1169 Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte
05632f9a 1170
755789c0
KW
1171 0aaaaaaa 0aaaaaaa
1172 00000bbbbbaaaaaa 110bbbbb 10aaaaaa
1173 ccccbbbbbbaaaaaa 1110cccc 10bbbbbb 10aaaaaa
1174 00000dddccccccbbbbbbaaaaaa 11110ddd 10cccccc 10bbbbbb 10aaaaaa
05632f9a 1175
a9130ea9 1176As you can see, the continuation bytes all begin with C<"10">, and the
e1b711da 1177leading bits of the start byte tell how many bytes there are in the
05632f9a
JH
1178encoded character.
1179
6d4f9cf2 1180The original UTF-8 specification allowed up to 6 bytes, to allow
a9130ea9 1181encoding of numbers up to C<0x7FFF_FFFF>. Perl continues to allow those,
6d4f9cf2
KW
1182and has extended that up to 13 bytes to encode code points up to what
1183can fit in a 64-bit word. However, Perl will warn if you output any of
b19eb496 1184these as being non-portable; and under strict UTF-8 input protocols,
6d4f9cf2
KW
1185they are forbidden.
1186
1187The Unicode non-character code points are also disallowed in UTF-8 in
1188"open interchange". See L</Non-character code points>.
1189
c29a771d 1190=item *
5cb3728c
RB
1191
1192UTF-EBCDIC
dbe420b4 1193
376d9008 1194Like UTF-8 but EBCDIC-safe, in the way that UTF-8 is ASCII-safe.
dbe420b4 1195
c29a771d 1196=item *
5cb3728c 1197
a9130ea9 1198UTF-16, UTF-16BE, UTF-16LE, Surrogates, and C<BOM>s (Byte Order Marks)
c349b1b9 1199
1bfb14c4
JH
1200The followings items are mostly for reference and general Unicode
1201knowledge, Perl doesn't use these constructs internally.
dbe420b4 1202
b19eb496
TC
1203Like UTF-8, UTF-16 is a variable-width encoding, but where
1204UTF-8 uses 8-bit code units, UTF-16 uses 16-bit code units.
1205All code points occupy either 2 or 4 bytes in UTF-16: code points
1206C<U+0000..U+FFFF> are stored in a single 16-bit unit, and code
1bfb14c4 1207points C<U+10000..U+10FFFF> in two 16-bit units. The latter case is
c349b1b9
JH
1208using I<surrogates>, the first 16-bit unit being the I<high
1209surrogate>, and the second being the I<low surrogate>.
1210
376d9008 1211Surrogates are code points set aside to encode the C<U+10000..U+10FFFF>
c349b1b9 1212range of Unicode code points in pairs of 16-bit units. The I<high
9f815e24 1213surrogates> are the range C<U+D800..U+DBFF> and the I<low surrogates>
376d9008 1214are the range C<U+DC00..U+DFFF>. The surrogate encoding is
c349b1b9 1215
d88362ca
KW
1216 $hi = ($uni - 0x10000) / 0x400 + 0xD800;
1217 $lo = ($uni - 0x10000) % 0x400 + 0xDC00;
c349b1b9
JH
1218
1219and the decoding is
1220
d88362ca 1221 $uni = 0x10000 + ($hi - 0xD800) * 0x400 + ($lo - 0xDC00);
c349b1b9 1222
376d9008 1223Because of the 16-bitness, UTF-16 is byte-order dependent. UTF-16
c349b1b9 1224itself can be used for in-memory computations, but if storage or
376d9008
JB
1225transfer is required either UTF-16BE (big-endian) or UTF-16LE
1226(little-endian) encodings must be chosen.
c349b1b9
JH
1227
1228This introduces another problem: what if you just know that your data
376d9008 1229is UTF-16, but you don't know which endianness? Byte Order Marks, or
a9130ea9 1230C<BOM>s, are a solution to this. A special character has been reserved
86bbd6d1 1231in Unicode to function as a byte order marker: the character with the
a9130ea9 1232code point C<U+FEFF> is the C<BOM>.
042da322 1233
a9130ea9 1234The trick is that if you read a C<BOM>, you will know the byte order,
376d9008
JB
1235since if it was written on a big-endian platform, you will read the
1236bytes C<0xFE 0xFF>, but if it was written on a little-endian platform,
1237you will read the bytes C<0xFF 0xFE>. (And if the originating platform
1238was writing in UTF-8, you will read the bytes C<0xEF 0xBB 0xBF>.)
042da322 1239
86bbd6d1 1240The way this trick works is that the character with the code point
6d4f9cf2 1241C<U+FFFE> is not supposed to be in input streams, so the
a9130ea9 1242sequence of bytes C<0xFF 0xFE> is unambiguously "C<BOM>, represented in
1bfb14c4 1243little-endian format" and cannot be C<U+FFFE>, represented in big-endian
6d4f9cf2
KW
1244format".
1245
1246Surrogates have no meaning in Unicode outside their use in pairs to
1247represent other code points. However, Perl allows them to be
1248represented individually internally, for example by saying
f651977e
TC
1249C<chr(0xD801)>, so that all code points, not just those valid for open
1250interchange, are
6d4f9cf2 1251representable. Unicode does define semantics for them, such as their
a9130ea9
KW
1252C<L</General_Category>> is C<"Cs">. But because their use is somewhat dangerous,
1253Perl will warn (using the warning category C<"surrogate">, which is a
1254sub-category of C<"utf8">) if an attempt is made
6d4f9cf2
KW
1255to do things like take the lower case of one, or match
1256case-insensitively, or to output them. (But don't try this on Perls
1257before 5.14.)
c349b1b9 1258
c29a771d 1259=item *
5cb3728c 1260
1e54db1a 1261UTF-32, UTF-32BE, UTF-32LE
c349b1b9
JH
1262
1263The UTF-32 family is pretty much like the UTF-16 family, expect that
042da322 1264the units are 32-bit, and therefore the surrogate scheme is not
a9130ea9 1265needed. UTF-32 is a fixed-width encoding. The C<BOM> signatures are
b19eb496 1266C<0x00 0x00 0xFE 0xFF> for BE and C<0xFF 0xFE 0x00 0x00> for LE.
c349b1b9 1267
c29a771d 1268=item *
5cb3728c
RB
1269
1270UCS-2, UCS-4
c349b1b9 1271
b19eb496 1272Legacy, fixed-width encodings defined by the ISO 10646 standard. UCS-2 is a 16-bit
376d9008 1273encoding. Unlike UTF-16, UCS-2 is not extensible beyond C<U+FFFF>,
339cfa0e 1274because it does not use surrogates. UCS-4 is a 32-bit encoding,
b19eb496 1275functionally identical to UTF-32 (the difference being that
a9130ea9 1276UCS-4 forbids neither surrogates nor code points larger than C<0x10_FFFF>).
c349b1b9 1277
c29a771d 1278=item *
5cb3728c
RB
1279
1280UTF-7
c349b1b9 1281
376d9008
JB
1282A seven-bit safe (non-eight-bit) encoding, which is useful if the
1283transport or storage is not eight-bit safe. Defined by RFC 2152.
c349b1b9 1284
95a1a48b
JH
1285=back
1286
6d4f9cf2
KW
1287=head2 Non-character code points
1288
128966 code points are set aside in Unicode as "non-character code points".
a9130ea9
KW
1290These all have the C<Unassigned> (C<Cn>) C<L</General_Category>>, and
1291they never will
6d4f9cf2
KW
1292be assigned. These are never supposed to be in legal Unicode input
1293streams, so that code can use them as sentinels that can be mixed in
1294with character data, and they always will be distinguishable from that data.
1295To keep them out of Perl input streams, strict UTF-8 should be
1296specified, such as by using the layer C<:encoding('UTF-8')>. The
a9130ea9
KW
1297non-character code points are the 32 between C<U+FDD0> and C<U+FDEF>, and the
129834 code points C<U+FFFE>, C<U+FFFF>, C<U+1FFFE>, C<U+1FFFF>, ... C<U+10FFFE>, C<U+10FFFF>.
6d4f9cf2
KW
1299Some people are under the mistaken impression that these are "illegal",
1300but that is not true. An application or cooperating set of applications
1301can legally use them at will internally; but these code points are
42581d5d
KW
1302"illegal for open interchange". Therefore, Perl will not accept these
1303from input streams unless lax rules are being used, and will warn
a9130ea9 1304(using the warning category C<"nonchar">, which is a sub-category of C<"utf8">) if
42581d5d
KW
1305an attempt is made to output them.
1306
1307=head2 Beyond Unicode code points
1308
a9130ea9
KW
1309The maximum Unicode code point is C<U+10FFFF>, and Unicode only defines
1310operations on code points up through that. But Perl works on code
42581d5d
KW
1311points up to the maximum permissible unsigned number available on the
1312platform. However, Perl will not accept these from input streams unless
1313lax rules are being used, and will warn (using the warning category
b19eb496 1314"non_unicode", which is a sub-category of "utf8") if an attempt is made to
42581d5d
KW
1315operate on or output them. For example, C<uc(0x11_0000)> will generate
1316this warning, returning the input parameter as its result, as the upper
ee88f7b6 1317case of every non-Unicode code point is the code point itself.
6d4f9cf2 1318
0d7c09bb
JH
1319=head2 Security Implications of Unicode
1320
e1b711da
KW
1321Read L<Unicode Security Considerations|http://www.unicode.org/reports/tr36>.
1322Also, note the following:
1323
0d7c09bb
JH
1324=over 4
1325
1326=item *
1327
1328Malformed UTF-8
bf0fa0b2 1329
42581d5d 1330Unfortunately, the original specification of UTF-8 leaves some room for
bf0fa0b2 1331interpretation of how many bytes of encoded output one should generate
376d9008
JB
1332from one input Unicode character. Strictly speaking, the shortest
1333possible sequence of UTF-8 bytes should be generated,
1334because otherwise there is potential for an input buffer overflow at
feda178f 1335the receiving end of a UTF-8 connection. Perl always generates the
e1b711da 1336shortest length UTF-8, and with warnings on, Perl will warn about
376d9008 1337non-shortest length UTF-8 along with other malformations, such as the
b19eb496 1338surrogates, which are not Unicode code points valid for interchange.
bf0fa0b2 1339
0d7c09bb
JH
1340=item *
1341
68693f9e 1342Regular expression pattern matching may surprise you if you're not
b19eb496
TC
1343accustomed to Unicode. Starting in Perl 5.14, several pattern
1344modifiers are available to control this, called the character set
42581d5d
KW
1345modifiers. Details are given in L<perlre/Character set modifiers>.
1346
1347=back
0d7c09bb 1348
376d9008 1349As discussed elsewhere, Perl has one foot (two hooves?) planted in
1bfb14c4
JH
1350each of two worlds: the old world of bytes and the new world of
1351characters, upgrading from bytes to characters when necessary.
376d9008
JB
1352If your legacy code does not explicitly use Unicode, no automatic
1353switch-over to characters should happen. Characters shouldn't get
1bfb14c4
JH
1354downgraded to bytes, either. It is possible to accidentally mix bytes
1355and characters, however (see L<perluniintro>), in which case C<\w> in
42581d5d 1356regular expressions might start behaving differently (unless the C</a>
b19eb496 1357modifier is in effect). Review your code. Use warnings and the C<strict> pragma.
0d7c09bb 1358
c349b1b9
JH
1359=head2 Unicode in Perl on EBCDIC
1360
376d9008
JB
1361The way Unicode is handled on EBCDIC platforms is still
1362experimental. On such platforms, references to UTF-8 encoding in this
1363document and elsewhere should be read as meaning the UTF-EBCDIC
1364specified in Unicode Technical Report 16, unless ASCII vs. EBCDIC issues
c349b1b9 1365are specifically discussed. There is no C<utfebcdic> pragma or
a9130ea9 1366C<":utfebcdic"> layer; rather, C<"utf8"> and C<":utf8"> are reused to mean
86bbd6d1
PN
1367the platform's "natural" 8-bit encoding of Unicode. See L<perlebcdic>
1368for more discussion of the issues.
c349b1b9 1369
b310b053
JH
1370=head2 Locales
1371
42581d5d 1372See L<perllocale/Unicode and UTF-8>
b310b053 1373
1aad1664
JH
1374=head2 When Unicode Does Not Happen
1375
1376While Perl does have extensive ways to input and output in Unicode,
a9130ea9 1377and a few other "entry points" like the C<@ARGV> array (which can sometimes be
b19eb496
TC
1378interpreted as UTF-8), there are still many places where Unicode
1379(in some encoding or another) could be given as arguments or received as
1aad1664
JH
1380results, or both, but it is not.
1381
e1b711da
KW
1382The following are such interfaces. Also, see L</The "Unicode Bug">.
1383For all of these interfaces Perl
b9cedb1b 1384currently (as of v5.16.0) simply assumes byte strings both as arguments
b19eb496 1385and results, or UTF-8 strings if the (problematic) C<encoding> pragma has been used.
1aad1664 1386
b19eb496
TC
1387One reason that Perl does not attempt to resolve the role of Unicode in
1388these situations is that the answers are highly dependent on the operating
1aad1664 1389system and the file system(s). For example, whether filenames can be
b19eb496
TC
1390in Unicode and in exactly what kind of encoding, is not exactly a
1391portable concept. Similarly for C<qx> and C<system>: how well will the
1392"command-line interface" (and which of them?) handle Unicode?
1aad1664
JH
1393
1394=over 4
1395
557a2462
RB
1396=item *
1397
a9130ea9
KW
1398C<chdir>, C<chmod>, C<chown>, C<chroot>, C<exec>, C<link>, C<lstat>, C<mkdir>,
1399C<rename>, C<rmdir>, C<stat>, C<symlink>, C<truncate>, C<unlink>, C<utime>, C<-X>
557a2462
RB
1400
1401=item *
1402
a9130ea9 1403C<%ENV>
557a2462
RB
1404
1405=item *
1406
a9130ea9 1407C<glob> (aka the C<E<lt>*E<gt>>)
557a2462
RB
1408
1409=item *
1aad1664 1410
a9130ea9 1411C<open>, C<opendir>, C<sysopen>
1aad1664 1412
557a2462 1413=item *
1aad1664 1414
a9130ea9 1415C<qx> (aka the backtick operator), C<system>
1aad1664 1416
557a2462 1417=item *
1aad1664 1418
a9130ea9 1419C<readdir>, C<readlink>
1aad1664
JH
1420
1421=back
1422
e1b711da
KW
1423=head2 The "Unicode Bug"
1424
2e2b2571 1425The term, "Unicode bug" has been applied to an inconsistency
42581d5d 1426on ASCII platforms with the
a9130ea9 1427Unicode code points in the C<Latin-1 Supplement> block, that
e1b711da
KW
1428is, between 128 and 255. Without a locale specified, unlike all other
1429characters or code points, these characters have very different semantics in
20db7501 1430byte semantics versus character semantics, unless
2e2b2571
KW
1431C<use feature 'unicode_strings'> is specified, directly or indirectly.
1432(It is indirectly specified by a C<use v5.12> or higher.)
e1b711da 1433
2e2b2571
KW
1434In character semantics these upper-Latin1 characters are interpreted as
1435Unicode code points, which means
e1b711da
KW
1436they have the same semantics as Latin-1 (ISO-8859-1).
1437
2e2b2571
KW
1438In byte semantics (without C<unicode_strings>), they are considered to
1439be unassigned characters, meaning that the only semantics they have is
1440their ordinal numbers, and that they are
e1b711da 1441not members of various character classes. None are considered to match C<\w>
42581d5d 1442for example, but all match C<\W>.
e1b711da 1443
2e2b2571
KW
1444Perl 5.12.0 added C<unicode_strings> to force character semantics on
1445these code points in some circumstances, which fixed portions of the
1446bug; Perl 5.14.0 fixed almost all of it; and Perl 5.16.0 fixed the
1447remainder (so far as we know, anyway). The lesson here is to enable
1448C<unicode_strings> to avoid the headaches described below.
1449
1450The old, problematic behavior affects these areas:
e1b711da
KW
1451
1452=over 4
1453
1454=item *
1455
1456Changing the case of a scalar, that is, using C<uc()>, C<ucfirst()>, C<lc()>,
2e2b2571
KW
1457and C<lcfirst()>, or C<\L>, C<\U>, C<\u> and C<\l> in double-quotish
1458contexts, such as regular expression substitutions.
1459Under C<unicode_strings> starting in Perl 5.12.0, character semantics are
1460generally used. See L<perlfunc/lc> for details on how this works
1461in combination with various other pragmas.
e1b711da
KW
1462
1463=item *
1464
2e2b2571
KW
1465Using caseless (C</i>) regular expression matching.
1466Starting in Perl 5.14.0, regular expressions compiled within
c43ca372 1467the scope of C<unicode_strings> use character semantics
2e2b2571
KW
1468even when executed or compiled into larger
1469regular expressions outside the scope.
e1b711da
KW
1470
1471=item *
1472
b19eb496 1473Matching any of several properties in regular expressions, namely C<\b>,
630d17dc
KW
1474C<\B>, C<\s>, C<\S>, C<\w>, C<\W>, and all the Posix character classes
1475I<except> C<[[:ascii:]]>.
2e2b2571 1476Starting in Perl 5.14.0, regular expressions compiled within
c43ca372 1477the scope of C<unicode_strings> use character semantics
2e2b2571
KW
1478even when executed or compiled into larger
1479regular expressions outside the scope.
e1b711da
KW
1480
1481=item *
1482
91faff93
KW
1483In C<quotemeta> or its inline equivalent C<\Q>, no code points above 127
1484are quoted in UTF-8 encoded strings, but in byte encoded strings, code
1485points between 128-255 are always quoted.
2e2b2571
KW
1486Starting in Perl 5.16.0, consistent quoting rules are used within the
1487scope of C<unicode_strings>, as described in L<perlfunc/quotemeta>.
eb88ed9e 1488
e1b711da
KW
1489=back
1490
1491This behavior can lead to unexpected results in which a string's semantics
1492suddenly change if a code point above 255 is appended to or removed from it,
1493which changes the string's semantics from byte to character or vice versa. As
1494an example, consider the following program and its output:
1495
1496 $ perl -le'
42581d5d 1497 no feature 'unicode_strings';
e1b711da
KW
1498 $s1 = "\xC2";
1499 $s2 = "\x{2660}";
1500 for ($s1, $s2, $s1.$s2) {
1501 print /\w/ || 0;
1502 }
1503 '
1504 0
1505 0
1506 1
1507
9f815e24 1508If there's no C<\w> in C<s1> or in C<s2>, why does their concatenation have one?
e1b711da
KW
1509
1510This anomaly stems from Perl's attempt to not disturb older programs that
1511didn't use Unicode, and hence had no semantics for characters outside of the
1512ASCII range (except in a locale), along with Perl's desire to add Unicode
1513support seamlessly. The result wasn't seamless: these characters were
1514orphaned.
1515
2e2b2571
KW
1516For Perls earlier than those described above, or when a string is passed
1517to a function outside the subpragma's scope, a workaround is to always
a9130ea9 1518call L<C<utf8::upgrade($string)>|utf8/Utility functions>,
20db7501 1519or to use the standard module L<Encode>. Also, a scalar that has any characters
a9130ea9 1520whose ordinal is C<0x100> or above, or which were specified using either of the
b19eb496 1521C<\N{...}> notations, will automatically have character semantics.
e1b711da 1522
1aad1664
JH
1523=head2 Forcing Unicode in Perl (Or Unforcing Unicode in Perl)
1524
e1b711da
KW
1525Sometimes (see L</"When Unicode Does Not Happen"> or L</The "Unicode Bug">)
1526there are situations where you simply need to force a byte
2bbc8d55 1527string into UTF-8, or vice versa. The low-level calls
a9130ea9
KW
1528L<C<utf8::upgrade($bytestring)>|utf8/Utility functions> and
1529L<C<utf8::downgrade($utf8string[, FAIL_OK])>|utf8/Utility functions> are
1aad1664
JH
1530the answers.
1531
a9130ea9 1532Note that C<utf8::downgrade()> can fail if the string contains characters
2bbc8d55 1533that don't fit into a byte.
1aad1664 1534
e1b711da
KW
1535Calling either function on a string that already is in the desired state is a
1536no-op.
1537
95a1a48b
JH
1538=head2 Using Unicode in XS
1539
3a2263fe
RGS
1540If you want to handle Perl Unicode in XS extensions, you may find the
1541following C APIs useful. See also L<perlguts/"Unicode Support"> for an
1542explanation about Unicode at the XS level, and L<perlapi> for the API
1543details.
95a1a48b
JH
1544
1545=over 4
1546
1547=item *
1548
1bfb14c4 1549C<DO_UTF8(sv)> returns true if the C<UTF8> flag is on and the bytes
2bbc8d55 1550pragma is not in effect. C<SvUTF8(sv)> returns true if the C<UTF8>
a9130ea9 1551flag is on; the C<bytes> pragma is ignored. The C<UTF8> flag being on
1bfb14c4
JH
1552does B<not> mean that there are any characters of code points greater
1553than 255 (or 127) in the scalar or that there are even any characters
1554in the scalar. What the C<UTF8> flag means is that the sequence of
1555octets in the representation of the scalar is the sequence of UTF-8
1556encoded code points of the characters of a string. The C<UTF8> flag
1557being off means that each octet in this representation encodes a
1558single character with code point 0..255 within the string. Perl's
1559Unicode model is not to use UTF-8 until it is absolutely necessary.
95a1a48b
JH
1560
1561=item *
1562
2bbc8d55 1563C<uvchr_to_utf8(buf, chr)> writes a Unicode character code point into
1bfb14c4 1564a buffer encoding the code point as UTF-8, and returns a pointer
2bbc8d55 1565pointing after the UTF-8 bytes. It works appropriately on EBCDIC machines.
95a1a48b
JH
1566
1567=item *
1568
4b88fb76
KW
1569C<utf8_to_uvchr_buf(buf, bufend, lenp)> reads UTF-8 encoded bytes from a
1570buffer and
376d9008 1571returns the Unicode character code point and, optionally, the length of
2bbc8d55 1572the UTF-8 byte sequence. It works appropriately on EBCDIC machines.
95a1a48b
JH
1573
1574=item *
1575
376d9008
JB
1576C<utf8_length(start, end)> returns the length of the UTF-8 encoded buffer
1577in characters. C<sv_len_utf8(sv)> returns the length of the UTF-8 encoded
95a1a48b
JH
1578scalar.
1579
1580=item *
1581
376d9008
JB
1582C<sv_utf8_upgrade(sv)> converts the string of the scalar to its UTF-8
1583encoded form. C<sv_utf8_downgrade(sv)> does the opposite, if
1584possible. C<sv_utf8_encode(sv)> is like sv_utf8_upgrade except that
1585it does not set the C<UTF8> flag. C<sv_utf8_decode()> does the
1586opposite of C<sv_utf8_encode()>. Note that none of these are to be
1587used as general-purpose encoding or decoding interfaces: C<use Encode>
1588for that. C<sv_utf8_upgrade()> is affected by the encoding pragma
1589but C<sv_utf8_downgrade()> is not (since the encoding pragma is
1590designed to be a one-way street).
95a1a48b
JH
1591
1592=item *
1593
dbfbbfa1
KW
1594C<is_utf8_string(buf, len)> returns true if C<len> bytes of the buffer
1595are valid UTF-8.
95a1a48b
JH
1596
1597=item *
1598
49f4c4e4
KW
1599C<is_utf8_char_buf(buf, buf_end)> returns true if the pointer points to
1600a valid UTF-8 character.
95a1a48b
JH
1601
1602=item *
1603
376d9008
JB
1604C<UTF8SKIP(buf)> will return the number of bytes in the UTF-8 encoded
1605character in the buffer. C<UNISKIP(chr)> will return the number of bytes
1606required to UTF-8-encode the Unicode character code point. C<UTF8SKIP()>
90f968e0 1607is useful for example for iterating over the characters of a UTF-8
376d9008 1608encoded buffer; C<UNISKIP()> is useful, for example, in computing
90f968e0 1609the size required for a UTF-8 encoded buffer.
95a1a48b
JH
1610
1611=item *
1612
376d9008 1613C<utf8_distance(a, b)> will tell the distance in characters between the
95a1a48b
JH
1614two pointers pointing to the same UTF-8 encoded buffer.
1615
1616=item *
1617
2bbc8d55 1618C<utf8_hop(s, off)> will return a pointer to a UTF-8 encoded buffer
376d9008
JB
1619that is C<off> (positive or negative) Unicode characters displaced
1620from the UTF-8 buffer C<s>. Be careful not to overstep the buffer:
1621C<utf8_hop()> will merrily run off the end or the beginning of the
1622buffer if told to do so.
95a1a48b 1623
d2cc3551
JH
1624=item *
1625
376d9008
JB
1626C<pv_uni_display(dsv, spv, len, pvlim, flags)> and
1627C<sv_uni_display(dsv, ssv, pvlim, flags)> are useful for debugging the
1628output of Unicode strings and scalars. By default they are useful
1629only for debugging--they display B<all> characters as hexadecimal code
1bfb14c4
JH
1630points--but with the flags C<UNI_DISPLAY_ISPRINT>,
1631C<UNI_DISPLAY_BACKSLASH>, and C<UNI_DISPLAY_QQ> you can make the
1632output more readable.
d2cc3551
JH
1633
1634=item *
1635
66615a54 1636C<foldEQ_utf8(s1, pe1, l1, u1, s2, pe2, l2, u2)> can be used to
376d9008 1637compare two strings case-insensitively in Unicode. For case-sensitive
66615a54
KW
1638comparisons you can just use C<memEQ()> and C<memNE()> as usual, except
1639if one string is in utf8 and the other isn't.
d2cc3551 1640
c349b1b9
JH
1641=back
1642
95a1a48b
JH
1643For more information, see L<perlapi>, and F<utf8.c> and F<utf8.h>
1644in the Perl source code distribution.
1645
e1b711da
KW
1646=head2 Hacking Perl to work on earlier Unicode versions (for very serious hackers only)
1647
1648Perl by default comes with the latest supported Unicode version built in, but
1649you can change to use any earlier one.
1650
42581d5d 1651Download the files in the desired version of Unicode from the Unicode web
e1b711da 1652site L<http://www.unicode.org>). These should replace the existing files in
b19eb496 1653F<lib/unicore> in the Perl source tree. Follow the instructions in
116693e8 1654F<README.perl> in that directory to change some of their names, and then build
26e391dd 1655perl (see L<INSTALL>).
116693e8 1656
c29a771d
JH
1657=head1 BUGS
1658
376d9008 1659=head2 Interaction with Locales
7eabb34d 1660
42581d5d 1661See L<perllocale/Unicode and UTF-8>
c29a771d 1662
9f815e24 1663=head2 Problems with characters in the Latin-1 Supplement range
2bbc8d55 1664
e1b711da
KW
1665See L</The "Unicode Bug">
1666
376d9008 1667=head2 Interaction with Extensions
7eabb34d 1668
376d9008 1669When Perl exchanges data with an extension, the extension should be
2575c402 1670able to understand the UTF8 flag and act accordingly. If the
b19eb496 1671extension doesn't recognize that flag, it's likely that the extension
376d9008 1672will return incorrectly-flagged data.
7eabb34d
A
1673
1674So if you're working with Unicode data, consult the documentation of
1675every module you're using if there are any issues with Unicode data
1676exchange. If the documentation does not talk about Unicode at all,
a73d23f6 1677suspect the worst and probably look at the source to learn how the
376d9008 1678module is implemented. Modules written completely in Perl shouldn't
a73d23f6
RGS
1679cause problems. Modules that directly or indirectly access code written
1680in other programming languages are at risk.
7eabb34d 1681
376d9008 1682For affected functions, the simple strategy to avoid data corruption is
7eabb34d 1683to always make the encoding of the exchanged data explicit. Choose an
376d9008 1684encoding that you know the extension can handle. Convert arguments passed
7eabb34d
A
1685to the extensions to that encoding and convert results back from that
1686encoding. Write wrapper functions that do the conversions for you, so
1687you can later change the functions when the extension catches up.
1688
a9130ea9 1689To provide an example, let's say the popular C<Foo::Bar::escape_html>
7eabb34d
A
1690function doesn't deal with Unicode data yet. The wrapper function
1691would convert the argument to raw UTF-8 and convert the result back to
376d9008 1692Perl's internal representation like so:
7eabb34d
A
1693
1694 sub my_escape_html ($) {
d88362ca
KW
1695 my($what) = shift;
1696 return unless defined $what;
1697 Encode::decode_utf8(Foo::Bar::escape_html(
1698 Encode::encode_utf8($what)));
7eabb34d
A
1699 }
1700
1701Sometimes, when the extension does not convert data but just stores
b19eb496 1702and retrieves them, you will be able to use the otherwise
a9130ea9
KW
1703dangerous L<C<Encode::_utf8_on()>|Encode/_utf8_on> function. Let's say
1704the popular C<Foo::Bar> extension, written in C, provides a C<param>
1705method that lets you store and retrieve data according to these prototypes:
7eabb34d
A
1706
1707 $self->param($name, $value); # set a scalar
1708 $value = $self->param($name); # retrieve a scalar
1709
1710If it does not yet provide support for any encoding, one could write a
1711derived class with such a C<param> method:
1712
1713 sub param {
1714 my($self,$name,$value) = @_;
1715 utf8::upgrade($name); # make sure it is UTF-8 encoded
af55fc6a 1716 if (defined $value) {
7eabb34d
A
1717 utf8::upgrade($value); # make sure it is UTF-8 encoded
1718 return $self->SUPER::param($name,$value);
1719 } else {
1720 my $ret = $self->SUPER::param($name);
1721 Encode::_utf8_on($ret); # we know, it is UTF-8 encoded
1722 return $ret;
1723 }
1724 }
1725
a73d23f6 1726Some extensions provide filters on data entry/exit points, such as
a9130ea9 1727C<DB_File::filter_store_key> and family. Look out for such filters in
66b79f27 1728the documentation of your extensions, they can make the transition to
7eabb34d
A
1729Unicode data much easier.
1730
376d9008 1731=head2 Speed
7eabb34d 1732
c29a771d 1733Some functions are slower when working on UTF-8 encoded strings than
574c8022 1734on byte encoded strings. All functions that need to hop over
a9130ea9
KW
1735characters such as C<length()>, C<substr()> or C<index()>, or matching
1736regular expressions can work B<much> faster when the underlying data are
7c17141f
JH
1737byte-encoded.
1738
1739In Perl 5.8.0 the slowness was often quite spectacular; in Perl 5.8.1
1740a caching scheme was introduced which will hopefully make the slowness
a104b433
JH
1741somewhat less spectacular, at least for some operations. In general,
1742operations with UTF-8 encoded strings are still slower. As an example,
1743the Unicode properties (character classes) like C<\p{Nd}> are known to
1744be quite a bit slower (5-20 times) than their simpler counterparts
42581d5d 1745like C<\d> (then again, there are hundreds of Unicode characters matching C<Nd>
a104b433 1746compared with the 10 ASCII characters matching C<d>).
666f95b9 1747
e1b711da
KW
1748=head2 Problems on EBCDIC platforms
1749
f651977e 1750There are several known problems with Perl on EBCDIC platforms. If you
e1b711da 1751want to use Perl there, send email to perlbug@perl.org.
fe749c9a
KW
1752
1753In earlier versions, when byte and character data were concatenated,
1754the new string was sometimes created by
1755decoding the byte strings as I<ISO 8859-1 (Latin-1)>, even if the
1756old Unicode string used EBCDIC.
1757
1758If you find any of these, please report them as bugs.
1759
c8d992ba
A
1760=head2 Porting code from perl-5.6.X
1761
1762Perl 5.8 has a different Unicode model from 5.6. In 5.6 the programmer
1763was required to use the C<utf8> pragma to declare that a given scope
1764expected to deal with Unicode data and had to make sure that only
1765Unicode data were reaching that scope. If you have code that is
1766working with 5.6, you will need some of the following adjustments to
1767your code. The examples are written such that the code will continue
1768to work under 5.6, so you should be safe to try them out.
1769
755789c0 1770=over 3
c8d992ba
A
1771
1772=item *
1773
1774A filehandle that should read or write UTF-8
1775
b9cedb1b 1776 if ($] > 5.008) {
740d4bb2 1777 binmode $fh, ":encoding(utf8)";
c8d992ba
A
1778 }
1779
1780=item *
1781
1782A scalar that is going to be passed to some extension
1783
a9130ea9 1784Be it C<Compress::Zlib>, C<Apache::Request> or any extension that has no
c8d992ba 1785mention of Unicode in the manpage, you need to make sure that the
2575c402 1786UTF8 flag is stripped off. Note that at the time of this writing
b9cedb1b 1787(January 2012) the mentioned modules are not UTF-8-aware. Please
c8d992ba
A
1788check the documentation to verify if this is still true.
1789
b9cedb1b 1790 if ($] > 5.008) {
c8d992ba
A
1791 require Encode;
1792 $val = Encode::encode_utf8($val); # make octets
1793 }
1794
1795=item *
1796
1797A scalar we got back from an extension
1798
1799If you believe the scalar comes back as UTF-8, you will most likely
2575c402 1800want the UTF8 flag restored:
c8d992ba 1801
b9cedb1b 1802 if ($] > 5.008) {
c8d992ba
A
1803 require Encode;
1804 $val = Encode::decode_utf8($val);
1805 }
1806
1807=item *
1808
1809Same thing, if you are really sure it is UTF-8
1810
b9cedb1b 1811 if ($] > 5.008) {
c8d992ba
A
1812 require Encode;
1813 Encode::_utf8_on($val);
1814 }
1815
1816=item *
1817
a9130ea9 1818A wrapper for L<DBI> C<fetchrow_array> and C<fetchrow_hashref>
c8d992ba
A
1819
1820When the database contains only UTF-8, a wrapper function or method is
a9130ea9
KW
1821a convenient way to replace all your C<fetchrow_array> and
1822C<fetchrow_hashref> calls. A wrapper function will also make it easier to
c8d992ba 1823adapt to future enhancements in your database driver. Note that at the
b9cedb1b 1824time of this writing (January 2012), the DBI has no standardized way
a9130ea9 1825to deal with UTF-8 data. Please check the L<DBI documentation|DBI> to verify if
c8d992ba
A
1826that is still true.
1827
1828 sub fetchrow {
d88362ca
KW
1829 # $what is one of fetchrow_{array,hashref}
1830 my($self, $sth, $what) = @_;
b9cedb1b 1831 if ($] < 5.008) {
c8d992ba
A
1832 return $sth->$what;
1833 } else {
1834 require Encode;
1835 if (wantarray) {
1836 my @arr = $sth->$what;
1837 for (@arr) {
1838 defined && /[^\000-\177]/ && Encode::_utf8_on($_);
1839 }
1840 return @arr;
1841 } else {
1842 my $ret = $sth->$what;
1843 if (ref $ret) {
1844 for my $k (keys %$ret) {
d88362ca
KW
1845 defined
1846 && /[^\000-\177]/
1847 && Encode::_utf8_on($_) for $ret->{$k};
c8d992ba
A
1848 }
1849 return $ret;
1850 } else {
1851 defined && /[^\000-\177]/ && Encode::_utf8_on($_) for $ret;
1852 return $ret;
1853 }
1854 }
1855 }
1856 }
1857
1858
1859=item *
1860
1861A large scalar that you know can only contain ASCII
1862
1863Scalars that contain only ASCII and are marked as UTF-8 are sometimes
1864a drag to your program. If you recognize such a situation, just remove
2575c402 1865the UTF8 flag:
c8d992ba 1866
b9cedb1b 1867 utf8::downgrade($val) if $] > 5.008;
c8d992ba
A
1868
1869=back
1870
393fec97
GS
1871=head1 SEE ALSO
1872
51f494cc 1873L<perlunitut>, L<perluniintro>, L<perluniprops>, L<Encode>, L<open>, L<utf8>, L<bytes>,
a05d7ebb 1874L<perlretut>, L<perlvar/"${^UNICODE}">
51f494cc 1875L<http://www.unicode.org/reports/tr44>).
393fec97
GS
1876
1877=cut