This is a live mirror of the Perl 5 development currently hosted at https://github.com/perl/perl5
Remove gete?[ug]id caching
[perl5.git] / pod / perlunicode.pod
CommitLineData
393fec97
GS
1=head1 NAME
2
3perlunicode - Unicode support in Perl
4
5=head1 DESCRIPTION
6
0a1f2d14 7=head2 Important Caveats
21bad921 8
376d9008 9Unicode support is an extensive requirement. While Perl does not
c349b1b9
JH
10implement the Unicode standard or the accompanying technical reports
11from cover to cover, Perl does support many Unicode features.
21bad921 12
2575c402 13People who want to learn to use Unicode in Perl, should probably read
0314f483
KW
14the L<Perl Unicode tutorial, perlunitut|perlunitut> and
15L<perluniintro>, before reading
e4911a48 16this reference document.
2575c402 17
9d1c51c1
KW
18Also, the use of Unicode may present security issues that aren't obvious.
19Read L<Unicode Security Considerations|http://www.unicode.org/reports/tr36>.
20
13a2d996 21=over 4
21bad921 22
42581d5d
KW
23=item Safest if you "use feature 'unicode_strings'"
24
25In order to preserve backward compatibility, Perl does not turn
26on full internal Unicode support unless the pragma
27C<use feature 'unicode_strings'> is specified. (This is automatically
28selected if you use C<use 5.012> or higher.) Failure to do this can
29trigger unexpected surprises. See L</The "Unicode Bug"> below.
30
b19eb496
TC
31This pragma doesn't affect I/O, and there are still several places
32where Unicode isn't fully supported, such as in filenames.
42581d5d 33
fae2c0fb 34=item Input and Output Layers
21bad921 35
376d9008 36Perl knows when a filehandle uses Perl's internal Unicode encodings
1bfb14c4 37(UTF-8, or UTF-EBCDIC if in EBCDIC) if the filehandle is opened with
4ee7c0ea 38the ":encoding(utf8)" layer. Other encodings can be converted to Perl's
1bfb14c4
JH
39encoding on input or from Perl's encoding on output by use of the
40":encoding(...)" layer. See L<open>.
c349b1b9 41
2575c402 42To indicate that Perl source itself is in UTF-8, use C<use utf8;>.
21bad921 43
ad0029c4 44=item C<use utf8> still needed to enable UTF-8/UTF-EBCDIC in scripts
21bad921 45
376d9008
JB
46As a compatibility measure, the C<use utf8> pragma must be explicitly
47included to enable recognition of UTF-8 in the Perl scripts themselves
1bfb14c4
JH
48(in string or regular expression literals, or in identifier names) on
49ASCII-based machines or to recognize UTF-EBCDIC on EBCDIC-based
376d9008 50machines. B<These are the only times when an explicit C<use utf8>
8f8cf39c 51is needed.> See L<utf8>.
21bad921 52
7aa207d6
JH
53=item BOM-marked scripts and UTF-16 scripts autodetected
54
55If a Perl script begins marked with the Unicode BOM (UTF-16LE, UTF16-BE,
56or UTF-8), or if the script looks like non-BOM-marked UTF-16 of either
57endianness, Perl will correctly read in the script as Unicode.
58(BOMless UTF-8 cannot be effectively recognized or differentiated from
59ISO 8859-1 or other eight-bit encodings.)
60
990e18f7
AT
61=item C<use encoding> needed to upgrade non-Latin-1 byte strings
62
38a44b82 63By default, there is a fundamental asymmetry in Perl's Unicode model:
990e18f7
AT
64implicit upgrading from byte strings to Unicode strings assumes that
65they were encoded in I<ISO 8859-1 (Latin-1)>, but Unicode strings are
66downgraded with UTF-8 encoding. This happens because the first 256
51f494cc 67codepoints in Unicode happens to agree with Latin-1.
990e18f7 68
990e18f7
AT
69See L</"Byte and Character Semantics"> for more details.
70
21bad921
GS
71=back
72
376d9008 73=head2 Byte and Character Semantics
393fec97 74
376d9008 75Beginning with version 5.6, Perl uses logically-wide characters to
3e4dbfed 76represent strings internally.
393fec97 77
42581d5d
KW
78Starting in Perl 5.14, Perl-level operations work with
79characters rather than bytes within the scope of a
80C<L<use feature 'unicode_strings'|feature>> (or equivalently
81C<use 5.012> or higher). (This is not true if bytes have been
b19eb496 82explicitly requested by C<L<use bytes|bytes>>, nor necessarily true
42581d5d
KW
83for interactions with the platform's operating system.)
84
85For earlier Perls, and when C<unicode_strings> is not in effect, Perl
86provides a fairly safe environment that can handle both types of
87semantics in programs. For operations where Perl can unambiguously
88decide that the input data are characters, Perl switches to character
89semantics. For operations where this determination cannot be made
90without additional information from the user, Perl decides in favor of
91compatibility and chooses to use byte semantics.
92
66cbab2c
KW
93When C<use locale> (but not C<use locale ':not_characters'>) is in
94effect, Perl uses the semantics associated with the current locale.
95(C<use locale> overrides C<use feature 'unicode_strings'> in the same scope;
96while C<use locale ':not_characters'> effectively also selects
97C<use feature 'unicode_strings'> in its scope; see L<perllocale>.)
98Otherwise, Perl uses the platform's native
42581d5d
KW
99byte semantics for characters whose code points are less than 256, and
100Unicode semantics for those greater than 255. On EBCDIC platforms, this
101is almost seamless, as the EBCDIC code pages that Perl handles are
102equivalent to Unicode's first 256 code points. (The exception is that
103EBCDIC regular expression case-insensitive matching rules are not as
b19eb496 104as robust as Unicode's.) But on ASCII platforms, Perl uses US-ASCII
e1b711da
KW
105(or Basic Latin in Unicode terminology) byte semantics, meaning that characters
106whose ordinal numbers are in the range 128 - 255 are undefined except for their
107ordinal numbers. This means that none have case (upper and lower), nor are any
108a member of character classes, like C<[:alpha:]> or C<\w>. (But all do belong
109to the C<\W> class or the Perl regular expression extension C<[:^alpha:]>.)
2bbc8d55 110
8cbd9a7a 111This behavior preserves compatibility with earlier versions of Perl,
376d9008 112which allowed byte semantics in Perl operations only if
e1b711da 113none of the program's inputs were marked as being a source of Unicode
8cbd9a7a
GS
114character data. Such data may come from filehandles, from calls to
115external programs, from information provided by the system (such as %ENV),
21bad921 116or from literals and constants in the source text.
8cbd9a7a 117
8cbd9a7a 118The C<utf8> pragma is primarily a compatibility device that enables
75daf61c 119recognition of UTF-(8|EBCDIC) in literals encountered by the parser.
376d9008
JB
120Note that this pragma is only required while Perl defaults to byte
121semantics; when character semantics become the default, this pragma
122may become a no-op. See L<utf8>.
123
376d9008 124If strings operating under byte semantics and strings with Unicode
51f494cc 125character data are concatenated, the new string will have
d9b01026
KW
126character semantics. This can cause surprises: See L</BUGS>, below.
127You can choose to be warned when this happens. See L<encoding::warnings>.
7dedd01f 128
feda178f 129Under character semantics, many operations that formerly operated on
376d9008 130bytes now operate on characters. A character in Perl is
feda178f 131logically just a number ranging from 0 to 2**31 or so. Larger
376d9008
JB
132characters may encode into longer sequences of bytes internally, but
133this internal detail is mostly hidden for Perl code.
134See L<perluniintro> for more.
393fec97 135
376d9008 136=head2 Effects of Character Semantics
393fec97
GS
137
138Character semantics have the following effects:
139
140=over 4
141
142=item *
143
376d9008 144Strings--including hash keys--and regular expression patterns may
574c8022 145contain characters that have an ordinal value larger than 255.
393fec97 146
2575c402
JW
147If you use a Unicode editor to edit your program, Unicode characters may
148occur directly within the literal strings in UTF-8 encoding, or UTF-16.
149(The former requires a BOM or C<use utf8>, the latter requires a BOM.)
3e4dbfed 150
195e542a
KW
151Unicode characters can also be added to a string by using the C<\N{U+...}>
152notation. The Unicode code for the desired character, in hexadecimal,
153should be placed in the braces, after the C<U>. For instance, a smiley face is
6f335b04
KW
154C<\N{U+263A}>.
155
195e542a
KW
156Alternatively, you can use the C<\x{...}> notation for characters 0x100 and
157above. For characters below 0x100 you may get byte semantics instead of
6f335b04 158character semantics; see L</The "Unicode Bug">. On EBCDIC machines there is
195e542a 159the additional problem that the value for such characters gives the EBCDIC
0bd42786
KW
160character rather than the Unicode one, thus it is more portable to use
161C<\N{U+...}> instead.
3e4dbfed 162
fbb93542
KW
163Additionally, you can use the C<\N{...}> notation and put the official
164Unicode character name within the braces, such as
165C<\N{WHITE SMILING FACE}>. This automatically loads the L<charnames>
166module with the C<:full> and C<:short> options. If you prefer different
167options for this module, you can instead, before the C<\N{...}>,
168explicitly load it with your desired options; for example,
169
170 use charnames ':loose';
376d9008 171
393fec97
GS
172=item *
173
574c8022
JH
174If an appropriate L<encoding> is specified, identifiers within the
175Perl script may contain Unicode alphanumeric characters, including
376d9008
JB
176ideographs. Perl does not currently attempt to canonicalize variable
177names.
393fec97 178
393fec97
GS
179=item *
180
1bfb14c4 181Regular expressions match characters instead of bytes. "." matches
2575c402 182a character instead of a byte.
393fec97 183
393fec97
GS
184=item *
185
9d1c51c1 186Bracketed character classes in regular expressions match characters instead of
376d9008 187bytes and match against the character properties specified in the
1bfb14c4 188Unicode properties database. C<\w> can be used to match a Japanese
75daf61c 189ideograph, for instance.
393fec97 190
393fec97
GS
191=item *
192
9d1c51c1
KW
193Named Unicode properties, scripts, and block ranges may be used (like bracketed
194character classes) by using the C<\p{}> "matches property" construct and
822502e5 195the C<\P{}> negation, "doesn't match property".
2575c402 196See L</"Unicode Character Properties"> for more details.
822502e5
TS
197
198You can define your own character properties and use them
199in the regular expression with the C<\p{}> or C<\P{}> construct.
822502e5
TS
200See L</"User-Defined Character Properties"> for more details.
201
202=item *
203
9f815e24
KW
204The special pattern C<\X> matches a logical character, an "extended grapheme
205cluster" in Standardese. In Unicode what appears to the user to be a single
51f494cc
KW
206character, for example an accented C<G>, may in fact be composed of a sequence
207of characters, in this case a C<G> followed by an accent character. C<\X>
208will match the entire sequence.
822502e5
TS
209
210=item *
211
212The C<tr///> operator translates characters instead of bytes. Note
213that the C<tr///CU> functionality has been removed. For similar
214functionality see pack('U0', ...) and pack('C0', ...).
215
216=item *
217
218Case translation operators use the Unicode case translation tables
219when character input is provided. Note that C<uc()>, or C<\U> in
220interpolated strings, translates to uppercase, while C<ucfirst>,
221or C<\u> in interpolated strings, translates to titlecase in languages
e1b711da
KW
222that make the distinction (which is equivalent to uppercase in languages
223without the distinction).
822502e5
TS
224
225=item *
226
227Most operators that deal with positions or lengths in a string will
228automatically switch to using character positions, including
229C<chop()>, C<chomp()>, C<substr()>, C<pos()>, C<index()>, C<rindex()>,
230C<sprintf()>, C<write()>, and C<length()>. An operator that
51f494cc
KW
231specifically does not switch is C<vec()>. Operators that really don't
232care include operators that treat strings as a bucket of bits such as
822502e5
TS
233C<sort()>, and operators dealing with filenames.
234
235=item *
236
51f494cc 237The C<pack()>/C<unpack()> letter C<C> does I<not> change, since it is often
822502e5
TS
238used for byte-oriented formats. Again, think C<char> in the C language.
239
240There is a new C<U> specifier that converts between Unicode characters
241and code points. There is also a C<W> specifier that is the equivalent of
242C<chr>/C<ord> and properly handles character values even if they are above 255.
243
244=item *
245
246The C<chr()> and C<ord()> functions work on characters, similar to
247C<pack("W")> and C<unpack("W")>, I<not> C<pack("C")> and
248C<unpack("C")>. C<pack("C")> and C<unpack("C")> are methods for
249emulating byte-oriented C<chr()> and C<ord()> on Unicode strings.
250While these methods reveal the internal encoding of Unicode strings,
251that is not something one normally needs to care about at all.
252
253=item *
254
255The bit string operators, C<& | ^ ~>, can operate on character data.
256However, for backward compatibility, such as when using bit string
257operations when characters are all less than 256 in ordinal value, one
258should not use C<~> (the bit complement) with characters of both
259values less than 256 and values greater than 256. Most importantly,
260DeMorgan's laws (C<~($x|$y) eq ~$x&~$y> and C<~($x&$y) eq ~$x|~$y>)
261will not hold. The reason for this mathematical I<faux pas> is that
262the complement cannot return B<both> the 8-bit (byte-wide) bit
263complement B<and> the full character-wide bit complement.
264
265=item *
266
5d1892be 267There is a CPAN module, L<Unicode::Casing>, which allows you to define
628253b8
BF
268your own mappings to be used in C<lc()>, C<lcfirst()>, C<uc()>,
269C<ucfirst()>, and C<fc> (or their double-quoted string inlined
270versions such as C<\U>).
271(Prior to Perl 5.16, this functionality was partially provided
5d1892be
KW
272in the Perl core, but suffered from a number of insurmountable
273drawbacks, so the CPAN module was written instead.)
822502e5
TS
274
275=back
276
277=over 4
278
279=item *
280
281And finally, C<scalar reverse()> reverses by character rather than by byte.
282
283=back
284
285=head2 Unicode Character Properties
286
ee88f7b6 287(The only time that Perl considers a sequence of individual code
9d1c51c1
KW
288points as a single logical character is in the C<\X> construct, already
289mentioned above. Therefore "character" in this discussion means a single
ee88f7b6
KW
290Unicode code point.)
291
292Very nearly all Unicode character properties are accessible through
293regular expressions by using the C<\p{}> "matches property" construct
294and the C<\P{}> "doesn't match property" for its negation.
51f494cc 295
9d1c51c1 296For instance, C<\p{Uppercase}> matches any single character with the Unicode
51f494cc
KW
297"Uppercase" property, while C<\p{L}> matches any character with a
298General_Category of "L" (letter) property. Brackets are not
9d1c51c1 299required for single letter property names, so C<\p{L}> is equivalent to C<\pL>.
51f494cc 300
9d1c51c1
KW
301More formally, C<\p{Uppercase}> matches any single character whose Unicode
302Uppercase property value is True, and C<\P{Uppercase}> matches any character
303whose Uppercase property value is False, and they could have been written as
304C<\p{Uppercase=True}> and C<\p{Uppercase=False}>, respectively.
51f494cc 305
b19eb496 306This formality is needed when properties are not binary; that is, if they can
51f494cc 307take on more values than just True and False. For example, the Bidi_Class (see
b19eb496 308L</"Bidirectional Character Types"> below), can take on several different
51f494cc 309values, such as Left, Right, Whitespace, and others. To match these, one needs
5bff2035
KW
310to specify both the property name (Bidi_Class), AND the value being
311matched against
9d1c51c1 312(Left, Right, etc.). This is done, as in the examples above, by having the
9f815e24 313two components separated by an equal sign (or interchangeably, a colon), like
51f494cc
KW
314C<\p{Bidi_Class: Left}>.
315
316All Unicode-defined character properties may be written in these compound forms
317of C<\p{property=value}> or C<\p{property:value}>, but Perl provides some
318additional properties that are written only in the single form, as well as
319single-form short-cuts for all binary properties and certain others described
320below, in which you may omit the property name and the equals or colon
321separator.
322
323Most Unicode character properties have at least two synonyms (or aliases if you
b19eb496
TC
324prefer): a short one that is easier to type and a longer one that is more
325descriptive and hence easier to understand. Thus the "L" and "Letter" properties
326above are equivalent and can be used interchangeably. Likewise,
51f494cc
KW
327"Upper" is a synonym for "Uppercase", and we could have written
328C<\p{Uppercase}> equivalently as C<\p{Upper}>. Also, there are typically
329various synonyms for the values the property can be. For binary properties,
330"True" has 3 synonyms: "T", "Yes", and "Y"; and "False has correspondingly "F",
331"No", and "N". But be careful. A short form of a value for one property may
e1b711da 332not mean the same thing as the same short form for another. Thus, for the
51f494cc
KW
333General_Category property, "L" means "Letter", but for the Bidi_Class property,
334"L" means "Left". A complete list of properties and synonyms is in
335L<perluniprops>.
336
b19eb496 337Upper/lower case differences in property names and values are irrelevant;
51f494cc
KW
338thus C<\p{Upper}> means the same thing as C<\p{upper}> or even C<\p{UpPeR}>.
339Similarly, you can add or subtract underscores anywhere in the middle of a
340word, so that these are also equivalent to C<\p{U_p_p_e_r}>. And white space
341is irrelevant adjacent to non-word characters, such as the braces and the equals
b19eb496
TC
342or colon separators, so C<\p{ Upper }> and C<\p{ Upper_case : Y }> are
343equivalent to these as well. In fact, white space and even
344hyphens can usually be added or deleted anywhere. So even C<\p{ Up-per case = Yes}> is
51f494cc 345equivalent. All this is called "loose-matching" by Unicode. The few places
b19eb496 346where stricter matching is used is in the middle of numbers, and in the Perl
51f494cc 347extension properties that begin or end with an underscore. Stricter matching
b19eb496 348cares about white space (except adjacent to non-word characters),
51f494cc 349hyphens, and non-interior underscores.
4193bef7 350
376d9008
JB
351You can also use negation in both C<\p{}> and C<\P{}> by introducing a caret
352(^) between the first brace and the property name: C<\p{^Tamil}> is
eb0cc9e3 353equal to C<\P{Tamil}>.
4193bef7 354
56ca34ca
KW
355Almost all properties are immune to case-insensitive matching. That is,
356adding a C</i> regular expression modifier does not change what they
357match. There are two sets that are affected.
358The first set is
359C<Uppercase_Letter>,
360C<Lowercase_Letter>,
361and C<Titlecase_Letter>,
362all of which match C<Cased_Letter> under C</i> matching.
363And the second set is
364C<Uppercase>,
365C<Lowercase>,
366and C<Titlecase>,
367all of which match C<Cased> under C</i> matching.
368This set also includes its subsets C<PosixUpper> and C<PosixLower> both
369of which under C</i> matching match C<PosixAlpha>.
370(The difference between these sets is that some things, such as Roman
b19eb496 371numerals, come in both upper and lower case so they are C<Cased>, but aren't considered
56ca34ca 372letters, so they aren't C<Cased_Letter>s.)
56ca34ca 373
94b42e47
KW
374The result is undefined if you try to match a non-Unicode code point
375(that is, one above 0x10FFFF) against a Unicode property. Currently, a
376warning is raised, and the match will fail. In some cases, this is
377counterintuitive, as both these fail:
378
379 chr(0x110000) =~ \p{ASCII_Hex_Digit=True} # Fails.
380 chr(0x110000) =~ \p{ASCII_Hex_Digit=False} # Fails!
381
51f494cc 382=head3 B<General_Category>
14bb0a9a 383
51f494cc
KW
384Every Unicode character is assigned a general category, which is the "most
385usual categorization of a character" (from
386L<http://www.unicode.org/reports/tr44>).
822502e5 387
9f815e24 388The compound way of writing these is like C<\p{General_Category=Number}>
51f494cc
KW
389(short, C<\p{gc:n}>). But Perl furnishes shortcuts in which everything up
390through the equal or colon separator is omitted. So you can instead just write
391C<\pN>.
822502e5 392
51f494cc 393Here are the short and long forms of the General Category properties:
393fec97 394
d73e5302
JH
395 Short Long
396
397 L Letter
51f494cc
KW
398 LC, L& Cased_Letter (that is: [\p{Ll}\p{Lu}\p{Lt}])
399 Lu Uppercase_Letter
400 Ll Lowercase_Letter
401 Lt Titlecase_Letter
402 Lm Modifier_Letter
403 Lo Other_Letter
d73e5302
JH
404
405 M Mark
51f494cc
KW
406 Mn Nonspacing_Mark
407 Mc Spacing_Mark
408 Me Enclosing_Mark
d73e5302
JH
409
410 N Number
51f494cc
KW
411 Nd Decimal_Number (also Digit)
412 Nl Letter_Number
413 No Other_Number
414
415 P Punctuation (also Punct)
416 Pc Connector_Punctuation
417 Pd Dash_Punctuation
418 Ps Open_Punctuation
419 Pe Close_Punctuation
420 Pi Initial_Punctuation
d73e5302 421 (may behave like Ps or Pe depending on usage)
51f494cc 422 Pf Final_Punctuation
d73e5302 423 (may behave like Ps or Pe depending on usage)
51f494cc 424 Po Other_Punctuation
d73e5302
JH
425
426 S Symbol
51f494cc
KW
427 Sm Math_Symbol
428 Sc Currency_Symbol
429 Sk Modifier_Symbol
430 So Other_Symbol
d73e5302
JH
431
432 Z Separator
51f494cc
KW
433 Zs Space_Separator
434 Zl Line_Separator
435 Zp Paragraph_Separator
d73e5302
JH
436
437 C Other
d88362ca 438 Cc Control (also Cntrl)
e150c829 439 Cf Format
6d4f9cf2 440 Cs Surrogate
51f494cc 441 Co Private_Use
e150c829 442 Cn Unassigned
1ac13f9a 443
376d9008 444Single-letter properties match all characters in any of the
3e4dbfed 445two-letter sub-properties starting with the same letter.
b19eb496 446C<LC> and C<L&> are special: both are aliases for the set consisting of everything matched by C<Ll>, C<Lu>, and C<Lt>.
32293815 447
51f494cc 448=head3 B<Bidirectional Character Types>
822502e5 449
b19eb496 450Because scripts differ in their directionality (Hebrew and Arabic are
9d1c51c1 451written right to left, for example) Unicode supplies these properties in
51f494cc 452the Bidi_Class class:
32293815 453
eb0cc9e3 454 Property Meaning
92e830a9 455
12ac2576
JP
456 L Left-to-Right
457 LRE Left-to-Right Embedding
458 LRO Left-to-Right Override
459 R Right-to-Left
51f494cc 460 AL Arabic Letter
12ac2576
JP
461 RLE Right-to-Left Embedding
462 RLO Right-to-Left Override
463 PDF Pop Directional Format
464 EN European Number
51f494cc
KW
465 ES European Separator
466 ET European Terminator
12ac2576 467 AN Arabic Number
51f494cc 468 CS Common Separator
12ac2576
JP
469 NSM Non-Spacing Mark
470 BN Boundary Neutral
471 B Paragraph Separator
472 S Segment Separator
473 WS Whitespace
474 ON Other Neutrals
475
51f494cc
KW
476This property is always written in the compound form.
477For example, C<\p{Bidi_Class:R}> matches characters that are normally
eb0cc9e3
JH
478written right to left.
479
51f494cc
KW
480=head3 B<Scripts>
481
b19eb496 482The world's languages are written in many different scripts. This sentence
e1b711da 483(unless you're reading it in translation) is written in Latin, while Russian is
c69ca1d4 484written in Cyrillic, and Greek is written in, well, Greek; Japanese mainly in
e1b711da 485Hiragana or Katakana. There are many more.
51f494cc 486
82aed44a
KW
487The Unicode Script and Script_Extensions properties give what script a
488given character is in. Either property can be specified with the
489compound form like
490C<\p{Script=Hebrew}> (short: C<\p{sc=hebr}>), or
491C<\p{Script_Extensions=Javanese}> (short: C<\p{scx=java}>).
492In addition, Perl furnishes shortcuts for all
493C<Script> property names. You can omit everything up through the equals
494(or colon), and simply write C<\p{Latin}> or C<\P{Cyrillic}>.
495(This is not true for C<Script_Extensions>, which is required to be
496written in the compound form.)
497
498The difference between these two properties involves characters that are
499used in multiple scripts. For example the digits '0' through '9' are
500used in many parts of the world. These are placed in a script named
501C<Common>. Other characters are used in just a few scripts. For
502example, the "KATAKANA-HIRAGANA DOUBLE HYPHEN" is used in both Japanese
503scripts, Katakana and Hiragana, but nowhere else. The C<Script>
504property places all characters that are used in multiple scripts in the
505C<Common> script, while the C<Script_Extensions> property places those
506that are used in only a few scripts into each of those scripts; while
507still using C<Common> for those used in many scripts. Thus both these
508match:
509
510 "0" =~ /\p{sc=Common}/ # Matches
511 "0" =~ /\p{scx=Common}/ # Matches
512
513and only the first of these match:
514
515 "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{sc=Common} # Matches
516 "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{scx=Common} # No match
517
518And only the last two of these match:
519
520 "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{sc=Hiragana} # No match
521 "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{sc=Katakana} # No match
522 "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{scx=Hiragana} # Matches
523 "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{scx=Katakana} # Matches
524
525C<Script_Extensions> is thus an improved C<Script>, in which there are
526fewer characters in the C<Common> script, and correspondingly more in
527other scripts. It is new in Unicode version 6.0, and its data are likely
528to change significantly in later releases, as things get sorted out.
529
530(Actually, besides C<Common>, the C<Inherited> script, contains
531characters that are used in multiple scripts. These are modifier
532characters which modify other characters, and inherit the script value
533of the controlling character. Some of these are used in many scripts,
534and so go into C<Inherited> in both C<Script> and C<Script_Extensions>.
535Others are used in just a few scripts, so are in C<Inherited> in
536C<Script>, but not in C<Script_Extensions>.)
537
538It is worth stressing that there are several different sets of digits in
539Unicode that are equivalent to 0-9 and are matchable by C<\d> in a
540regular expression. If they are used in a single language only, they
541are in that language's C<Script> and C<Script_Extension>. If they are
542used in more than one script, they will be in C<sc=Common>, but only
543if they are used in many scripts should they be in C<scx=Common>.
51f494cc
KW
544
545A complete list of scripts and their shortcuts is in L<perluniprops>.
546
51f494cc 547=head3 B<Use of "Is" Prefix>
822502e5 548
1bfb14c4 549For backward compatibility (with Perl 5.6), all properties mentioned
51f494cc
KW
550so far may have C<Is> or C<Is_> prepended to their name, so C<\P{Is_Lu}>, for
551example, is equal to C<\P{Lu}>, and C<\p{IsScript:Arabic}> is equal to
552C<\p{Arabic}>.
eb0cc9e3 553
51f494cc 554=head3 B<Blocks>
2796c109 555
1bfb14c4
JH
556In addition to B<scripts>, Unicode also defines B<blocks> of
557characters. The difference between scripts and blocks is that the
558concept of scripts is closer to natural languages, while the concept
51f494cc 559of blocks is more of an artificial grouping based on groups of Unicode
9f815e24 560characters with consecutive ordinal values. For example, the "Basic Latin"
b19eb496 561block is all characters whose ordinals are between 0 and 127, inclusive; in
9f815e24 562other words, the ASCII characters. The "Latin" script contains some letters
b19eb496 563from this as well as several other blocks, like "Latin-1 Supplement",
9d1c51c1 564"Latin Extended-A", etc., but it does not contain all the characters from
7be67b37 565those blocks. It does not, for example, contain the digits 0-9, because
82aed44a
KW
566those digits are shared across many scripts, and hence are in the
567C<Common> script.
51f494cc
KW
568
569For more about scripts versus blocks, see UAX#24 "Unicode Script Property":
570L<http://www.unicode.org/reports/tr24>
571
82aed44a
KW
572The C<Script> or C<Script_Extensions> properties are likely to be the
573ones you want to use when processing
b19eb496
TC
574natural language; the Block property may occasionally be useful in working
575with the nuts and bolts of Unicode.
51f494cc
KW
576
577Block names are matched in the compound form, like C<\p{Block: Arrows}> or
b19eb496 578C<\p{Blk=Hebrew}>. Unlike most other properties, only a few block names have a
51f494cc
KW
579Unicode-defined short name. But Perl does provide a (slight) shortcut: You
580can say, for example C<\p{In_Arrows}> or C<\p{In_Hebrew}>. For backwards
581compatibility, the C<In> prefix may be omitted if there is no naming conflict
582with a script or any other property, and you can even use an C<Is> prefix
583instead in those cases. But it is not a good idea to do this, for a couple
584reasons:
585
586=over 4
587
588=item 1
589
590It is confusing. There are many naming conflicts, and you may forget some.
9f815e24 591For example, C<\p{Hebrew}> means the I<script> Hebrew, and NOT the I<block>
51f494cc
KW
592Hebrew. But would you remember that 6 months from now?
593
594=item 2
595
596It is unstable. A new version of Unicode may pre-empt the current meaning by
597creating a property with the same name. There was a time in very early Unicode
9f815e24 598releases when C<\p{Hebrew}> would have matched the I<block> Hebrew; now it
51f494cc 599doesn't.
32293815 600
393fec97
GS
601=back
602
b19eb496
TC
603Some people prefer to always use C<\p{Block: foo}> and C<\p{Script: bar}>
604instead of the shortcuts, whether for clarity, because they can't remember the
605difference between 'In' and 'Is' anyway, or they aren't confident that those who
606eventually will read their code will know that difference.
51f494cc
KW
607
608A complete list of blocks and their shortcuts is in L<perluniprops>.
609
9f815e24
KW
610=head3 B<Other Properties>
611
612There are many more properties than the very basic ones described here.
613A complete list is in L<perluniprops>.
614
615Unicode defines all its properties in the compound form, so all single-form
b19eb496
TC
616properties are Perl extensions. Most of these are just synonyms for the
617Unicode ones, but some are genuine extensions, including several that are in
9f815e24
KW
618the compound form. And quite a few of these are actually recommended by Unicode
619(in L<http://www.unicode.org/reports/tr18>).
620
5bff2035
KW
621This section gives some details on all extensions that aren't just
622synonyms for compound-form Unicode properties
623(for those properties, you'll have to refer to the
9f815e24
KW
624L<Unicode Standard|http://www.unicode.org/reports/tr44>.
625
626=over
627
628=item B<C<\p{All}>>
629
630This matches any of the 1_114_112 Unicode code points. It is a synonym for
631C<\p{Any}>.
632
633=item B<C<\p{Alnum}>>
634
635This matches any C<\p{Alphabetic}> or C<\p{Decimal_Number}> character.
636
637=item B<C<\p{Any}>>
638
639This matches any of the 1_114_112 Unicode code points. It is a synonym for
640C<\p{All}>.
641
42581d5d
KW
642=item B<C<\p{ASCII}>>
643
644This matches any of the 128 characters in the US-ASCII character set,
645which is a subset of Unicode.
646
9f815e24
KW
647=item B<C<\p{Assigned}>>
648
649This matches any assigned code point; that is, any code point whose general
650category is not Unassigned (or equivalently, not Cn).
651
652=item B<C<\p{Blank}>>
653
654This is the same as C<\h> and C<\p{HorizSpace}>: A character that changes the
655spacing horizontally.
656
657=item B<C<\p{Decomposition_Type: Non_Canonical}>> (Short: C<\p{Dt=NonCanon}>)
658
659Matches a character that has a non-canonical decomposition.
660
661To understand the use of this rarely used property=value combination, it is
662necessary to know some basics about decomposition.
663Consider a character, say H. It could appear with various marks around it,
664such as an acute accent, or a circumflex, or various hooks, circles, arrows,
b19eb496 665I<etc.>, above, below, to one side or the other, etc. There are many
9f815e24
KW
666possibilities among the world's languages. The number of combinations is
667astronomical, and if there were a character for each combination, it would
668soon exhaust Unicode's more than a million possible characters. So Unicode
669took a different approach: there is a character for the base H, and a
b19eb496 670character for each of the possible marks, and these can be variously combined
9f815e24
KW
671to get a final logical character. So a logical character--what appears to be a
672single character--can be a sequence of more than one individual characters.
b19eb496
TC
673This is called an "extended grapheme cluster"; Perl furnishes the C<\X>
674regular expression construct to match such sequences.
9f815e24
KW
675
676But Unicode's intent is to unify the existing character set standards and
b19eb496 677practices, and several pre-existing standards have single characters that
9f815e24
KW
678mean the same thing as some of these combinations. An example is ISO-8859-1,
679which has quite a few of these in the Latin-1 range, an example being "LATIN
680CAPITAL LETTER E WITH ACUTE". Because this character was in this pre-existing
681standard, Unicode added it to its repertoire. But this character is considered
b19eb496
TC
682by Unicode to be equivalent to the sequence consisting of the character
683"LATIN CAPITAL LETTER E" followed by the character "COMBINING ACUTE ACCENT".
9f815e24
KW
684
685"LATIN CAPITAL LETTER E WITH ACUTE" is called a "pre-composed" character, and
b19eb496 686its equivalence with the sequence is called canonical equivalence. All
9f815e24 687pre-composed characters are said to have a decomposition (into the equivalent
b19eb496 688sequence), and the decomposition type is also called canonical.
9f815e24
KW
689
690However, many more characters have a different type of decomposition, a
691"compatible" or "non-canonical" decomposition. The sequences that form these
692decompositions are not considered canonically equivalent to the pre-composed
693character. An example, again in the Latin-1 range, is the "SUPERSCRIPT ONE".
b19eb496 694It is somewhat like a regular digit 1, but not exactly; its decomposition
9f815e24
KW
695into the digit 1 is called a "compatible" decomposition, specifically a
696"super" decomposition. There are several such compatibility
697decompositions (see L<http://www.unicode.org/reports/tr44>), including one
b19eb496 698called "compat", which means some miscellaneous type of decomposition
42581d5d 699that doesn't fit into the decomposition categories that Unicode has chosen.
9f815e24
KW
700
701Note that most Unicode characters don't have a decomposition, so their
702decomposition type is "None".
703
b19eb496
TC
704For your convenience, Perl has added the C<Non_Canonical> decomposition
705type to mean any of the several compatibility decompositions.
9f815e24
KW
706
707=item B<C<\p{Graph}>>
708
709Matches any character that is graphic. Theoretically, this means a character
710that on a printer would cause ink to be used.
711
712=item B<C<\p{HorizSpace}>>
713
b19eb496 714This is the same as C<\h> and C<\p{Blank}>: a character that changes the
9f815e24
KW
715spacing horizontally.
716
42581d5d 717=item B<C<\p{In=*}>>
9f815e24
KW
718
719This is a synonym for C<\p{Present_In=*}>
720
721=item B<C<\p{PerlSpace}>>
722
723This is the same as C<\s>, restricted to ASCII, namely C<S<[ \f\n\r\t]>>.
724
725Mnemonic: Perl's (original) space
726
727=item B<C<\p{PerlWord}>>
728
729This is the same as C<\w>, restricted to ASCII, namely C<[A-Za-z0-9_]>
730
731Mnemonic: Perl's (original) word.
732
42581d5d 733=item B<C<\p{Posix...}>>
9f815e24 734
b19eb496
TC
735There are several of these, which are equivalents using the C<\p>
736notation for Posix classes and are described in
42581d5d 737L<perlrecharclass/POSIX Character Classes>.
9f815e24
KW
738
739=item B<C<\p{Present_In: *}>> (Short: C<\p{In=*}>)
740
741This property is used when you need to know in what Unicode version(s) a
742character is.
743
744The "*" above stands for some two digit Unicode version number, such as
745C<1.1> or C<4.0>; or the "*" can also be C<Unassigned>. This property will
746match the code points whose final disposition has been settled as of the
747Unicode release given by the version number; C<\p{Present_In: Unassigned}>
748will match those code points whose meaning has yet to be assigned.
749
750For example, C<U+0041> "LATIN CAPITAL LETTER A" was present in the very first
751Unicode release available, which is C<1.1>, so this property is true for all
752valid "*" versions. On the other hand, C<U+1EFF> was not assigned until version
7535.1 when it became "LATIN SMALL LETTER Y WITH LOOP", so the only "*" that
754would match it are 5.1, 5.2, and later.
755
756Unicode furnishes the C<Age> property from which this is derived. The problem
757with Age is that a strict interpretation of it (which Perl takes) has it
758matching the precise release a code point's meaning is introduced in. Thus
759C<U+0041> would match only 1.1; and C<U+1EFF> only 5.1. This is not usually what
760you want.
761
762Some non-Perl implementations of the Age property may change its meaning to be
763the same as the Perl Present_In property; just be aware of that.
764
765Another confusion with both these properties is that the definition is not
b19eb496
TC
766that the code point has been I<assigned>, but that the meaning of the code point
767has been I<determined>. This is because 66 code points will always be
768unassigned, and so the Age for them is the Unicode version in which the decision
769to make them so was made. For example, C<U+FDD0> is to be permanently
9f815e24 770unassigned to a character, and the decision to do that was made in version 3.1,
b19eb496 771so C<\p{Age=3.1}> matches this character, as also does C<\p{Present_In: 3.1}> and up.
9f815e24
KW
772
773=item B<C<\p{Print}>>
774
ae5b72c8 775This matches any character that is graphical or blank, except controls.
9f815e24
KW
776
777=item B<C<\p{SpacePerl}>>
778
779This is the same as C<\s>, including beyond ASCII.
780
4d4acfba 781Mnemonic: Space, as modified by Perl. (It doesn't include the vertical tab
b19eb496 782which both the Posix standard and Unicode consider white space.)
9f815e24 783
4364919a
KW
784=item B<C<\p{Title}>> and B<C<\p{Titlecase}>>
785
786Under case-sensitive matching, these both match the same code points as
787C<\p{General Category=Titlecase_Letter}> (C<\p{gc=lt}>). The difference
788is that under C</i> caseless matching, these match the same as
789C<\p{Cased}>, whereas C<\p{gc=lt}> matches C<\p{Cased_Letter>).
790
9f815e24
KW
791=item B<C<\p{VertSpace}>>
792
793This is the same as C<\v>: A character that changes the spacing vertically.
794
795=item B<C<\p{Word}>>
796
b19eb496 797This is the same as C<\w>, including over 100_000 characters beyond ASCII.
9f815e24 798
42581d5d
KW
799=item B<C<\p{XPosix...}>>
800
b19eb496 801There are several of these, which are the standard Posix classes
42581d5d
KW
802extended to the full Unicode range. They are described in
803L<perlrecharclass/POSIX Character Classes>.
804
9f815e24
KW
805=back
806
376d9008 807=head2 User-Defined Character Properties
491fd90a 808
51f494cc
KW
809You can define your own binary character properties by defining subroutines
810whose names begin with "In" or "Is". The subroutines can be defined in any
811package. The user-defined properties can be used in the regular expression
812C<\p> and C<\P> constructs; if you are using a user-defined property from a
813package other than the one you are in, you must specify its package in the
814C<\p> or C<\P> construct.
bac0b425 815
51f494cc 816 # assuming property Is_Foreign defined in Lang::
bac0b425
JP
817 package main; # property package name required
818 if ($txt =~ /\p{Lang::IsForeign}+/) { ... }
819
820 package Lang; # property package name not required
821 if ($txt =~ /\p{IsForeign}+/) { ... }
822
823
824Note that the effect is compile-time and immutable once defined.
b19eb496
TC
825However, the subroutines are passed a single parameter, which is 0 if
826case-sensitive matching is in effect and non-zero if caseless matching
56ca34ca
KW
827is in effect. The subroutine may return different values depending on
828the value of the flag, and one set of values will immutably be in effect
b19eb496 829for all case-sensitive matches, and the other set for all case-insensitive
56ca34ca 830matches.
491fd90a 831
b19eb496 832Note that if the regular expression is tainted, then Perl will die rather
0e9be77f
DM
833than calling the subroutine, where the name of the subroutine is
834determined by the tainted data.
835
376d9008
JB
836The subroutines must return a specially-formatted string, with one
837or more newline-separated lines. Each line must be one of the following:
491fd90a
JH
838
839=over 4
840
841=item *
842
510254c9
A
843A single hexadecimal number denoting a Unicode code point to include.
844
845=item *
846
99a6b1f0 847Two hexadecimal numbers separated by horizontal whitespace (space or
376d9008 848tabular characters) denoting a range of Unicode code points to include.
491fd90a
JH
849
850=item *
851
376d9008 852Something to include, prefixed by "+": a built-in character
830137a2
KW
853property (prefixed by "utf8::") or a fully qualified (including package
854name) user-defined character property,
bac0b425
JP
855to represent all the characters in that property; two hexadecimal code
856points for a range; or a single hexadecimal code point.
491fd90a
JH
857
858=item *
859
376d9008 860Something to exclude, prefixed by "-": an existing character
830137a2
KW
861property (prefixed by "utf8::") or a fully qualified (including package
862name) user-defined character property,
bac0b425
JP
863to represent all the characters in that property; two hexadecimal code
864points for a range; or a single hexadecimal code point.
491fd90a
JH
865
866=item *
867
376d9008 868Something to negate, prefixed "!": an existing character
830137a2
KW
869property (prefixed by "utf8::") or a fully qualified (including package
870name) user-defined character property,
bac0b425
JP
871to represent all the characters in that property; two hexadecimal code
872points for a range; or a single hexadecimal code point.
873
874=item *
875
876Something to intersect with, prefixed by "&": an existing character
830137a2
KW
877property (prefixed by "utf8::") or a fully qualified (including package
878name) user-defined character property,
bac0b425
JP
879for all the characters except the characters in the property; two
880hexadecimal code points for a range; or a single hexadecimal code point.
491fd90a
JH
881
882=back
883
884For example, to define a property that covers both the Japanese
885syllabaries (hiragana and katakana), you can define
886
887 sub InKana {
d88362ca 888 return <<END;
d5822f25
A
889 3040\t309F
890 30A0\t30FF
491fd90a
JH
891 END
892 }
893
d5822f25
A
894Imagine that the here-doc end marker is at the beginning of the line.
895Now you can use C<\p{InKana}> and C<\P{InKana}>.
491fd90a
JH
896
897You could also have used the existing block property names:
898
899 sub InKana {
d88362ca 900 return <<'END';
491fd90a
JH
901 +utf8::InHiragana
902 +utf8::InKatakana
903 END
904 }
905
906Suppose you wanted to match only the allocated characters,
d5822f25 907not the raw block ranges: in other words, you want to remove
491fd90a
JH
908the non-characters:
909
910 sub InKana {
d88362ca 911 return <<'END';
491fd90a
JH
912 +utf8::InHiragana
913 +utf8::InKatakana
914 -utf8::IsCn
915 END
916 }
917
918The negation is useful for defining (surprise!) negated classes.
919
920 sub InNotKana {
d88362ca 921 return <<'END';
491fd90a
JH
922 !utf8::InHiragana
923 -utf8::InKatakana
924 +utf8::IsCn
925 END
926 }
927
461020ad
KW
928This will match all non-Unicode code points, since every one of them is
929not in Kana. You can use intersection to exclude these, if desired, as
930this modified example shows:
bac0b425 931
461020ad 932 sub InNotKana {
bac0b425 933 return <<'END';
461020ad
KW
934 !utf8::InHiragana
935 -utf8::InKatakana
936 +utf8::IsCn
937 &utf8::Any
bac0b425
JP
938 END
939 }
940
461020ad
KW
941C<&utf8::Any> must be the last line in the definition.
942
943Intersection is used generally for getting the common characters matched
944by two (or more) classes. It's important to remember not to use "&" for
945the first set; that would be intersecting with nothing, resulting in an
946empty set.
947
948(Note that official Unicode properties differ from these in that they
949automatically exclude non-Unicode code points and a warning is raised if
950a match is attempted on one of those.)
bac0b425 951
68585b5e 952=head2 User-Defined Case Mappings (for serious hackers only)
822502e5 953
5d1892be
KW
954B<This feature has been removed as of Perl 5.16.>
955The CPAN module L<Unicode::Casing> provides better functionality without
956the drawbacks that this feature had. If you are using a Perl earlier
957than 5.16, this feature was most fully documented in the 5.14 version of
958this pod:
959L<http://perldoc.perl.org/5.14.0/perlunicode.html#User-Defined-Case-Mappings-%28for-serious-hackers-only%29>
3a2263fe 960
376d9008 961=head2 Character Encodings for Input and Output
8cbd9a7a 962
7221edc9 963See L<Encode>.
8cbd9a7a 964
c29a771d 965=head2 Unicode Regular Expression Support Level
776f8809 966
b19eb496
TC
967The following list of Unicode supported features for regular expressions describes
968all features currently directly supported by core Perl. The references to "Level N"
8158862b 969and the section numbers refer to the Unicode Technical Standard #18,
b19eb496 970"Unicode Regular Expressions", version 13, from August 2008.
776f8809
JH
971
972=over 4
973
974=item *
975
976Level 1 - Basic Unicode Support
977
755789c0
KW
978 RL1.1 Hex Notation - done [1]
979 RL1.2 Properties - done [2][3]
980 RL1.2a Compatibility Properties - done [4]
981 RL1.3 Subtraction and Intersection - MISSING [5]
982 RL1.4 Simple Word Boundaries - done [6]
983 RL1.5 Simple Loose Matches - done [7]
984 RL1.6 Line Boundaries - MISSING [8][9]
985 RL1.7 Supplementary Code Points - done [10]
986
987 [1] \x{...}
988 [2] \p{...} \P{...}
989 [3] supports not only minimal list, but all Unicode character
d9742aa3 990 properties (see Unicode Character Properties above)
755789c0
KW
991 [4] \d \D \s \S \w \W \X [:prop:] [:^prop:]
992 [5] can use regular expression look-ahead [a] or
993 user-defined character properties [b] to emulate set
994 operations
995 [6] \b \B
996 [7] note that Perl does Full case-folding in matching (but with
997 bugs), not Simple: for example U+1F88 is equivalent to
e4d56f70
NC
998 U+1F00 U+03B9, instead of just U+1F80. This difference
999 matters mainly for certain Greek capital letters with certain
755789c0
KW
1000 modifiers: the Full case-folding decomposes the letter,
1001 while the Simple case-folding would map it to a single
1002 character.
1003 [8] should do ^ and $ also on U+000B (\v in C), FF (\f), CR
1004 (\r), CRLF (\r\n), NEL (U+0085), LS (U+2028), and PS
1005 (U+2029); should also affect <>, $., and script line
1006 numbers; should not split lines within CRLF [c] (i.e. there
1007 is no empty line between \r and \n)
1008 [9] Linebreaking conformant with UAX#14 "Unicode Line Breaking
1009 Algorithm" is available through the Unicode::LineBreaking
1010 module.
1011 [10] UTF-8/UTF-EBDDIC used in Perl allows not only U+10000 to
1012 U+10FFFF but also beyond U+10FFFF
7207e29d 1013
237bad5b 1014[a] You can mimic class subtraction using lookahead.
8158862b 1015For example, what UTS#18 might write as
29bdacb8 1016
dbe420b4
JH
1017 [{Greek}-[{UNASSIGNED}]]
1018
1019in Perl can be written as:
1020
1d81abf3
JH
1021 (?!\p{Unassigned})\p{InGreekAndCoptic}
1022 (?=\p{Assigned})\p{InGreekAndCoptic}
dbe420b4
JH
1023
1024But in this particular example, you probably really want
1025
1bfb14c4 1026 \p{GreekAndCoptic}
dbe420b4
JH
1027
1028which will match assigned characters known to be part of the Greek script.
29bdacb8 1029
d9742aa3 1030Also see the L<Unicode::Regex::Set> module, it does implement the full
8158862b
TS
1031UTS#18 grouping, intersection, union, and removal (subtraction) syntax.
1032
1033[b] '+' for union, '-' for removal (set-difference), '&' for intersection
1034(see L</"User-Defined Character Properties">)
1035
1036[c] Try the C<:crlf> layer (see L<PerlIO>).
5ca1ac52 1037
776f8809
JH
1038=item *
1039
1040Level 2 - Extended Unicode Support
1041
755789c0
KW
1042 RL2.1 Canonical Equivalents - MISSING [10][11]
1043 RL2.2 Default Grapheme Clusters - MISSING [12]
1044 RL2.3 Default Word Boundaries - MISSING [14]
1045 RL2.4 Default Loose Matches - MISSING [15]
1046 RL2.5 Name Properties - DONE
1047 RL2.6 Wildcard Properties - MISSING
8158862b 1048
755789c0
KW
1049 [10] see UAX#15 "Unicode Normalization Forms"
1050 [11] have Unicode::Normalize but not integrated to regexes
1051 [12] have \X but we don't have a "Grapheme Cluster Mode"
1052 [14] see UAX#29, Word Boundaries
902b08d0 1053 [15] This is covered in Chapter 3.13 (in Unicode 6.0)
776f8809
JH
1054
1055=item *
1056
8158862b
TS
1057Level 3 - Tailored Support
1058
755789c0
KW
1059 RL3.1 Tailored Punctuation - MISSING
1060 RL3.2 Tailored Grapheme Clusters - MISSING [17][18]
1061 RL3.3 Tailored Word Boundaries - MISSING
1062 RL3.4 Tailored Loose Matches - MISSING
1063 RL3.5 Tailored Ranges - MISSING
1064 RL3.6 Context Matching - MISSING [19]
1065 RL3.7 Incremental Matches - MISSING
8158862b 1066 ( RL3.8 Unicode Set Sharing )
755789c0
KW
1067 RL3.9 Possible Match Sets - MISSING
1068 RL3.10 Folded Matching - MISSING [20]
1069 RL3.11 Submatchers - MISSING
1070
1071 [17] see UAX#10 "Unicode Collation Algorithms"
1072 [18] have Unicode::Collate but not integrated to regexes
1073 [19] have (?<=x) and (?=x), but look-aheads or look-behinds
1074 should see outside of the target substring
1075 [20] need insensitive matching for linguistic features other
1076 than case; for example, hiragana to katakana, wide and
1077 narrow, simplified Han to traditional Han (see UTR#30
1078 "Character Foldings")
776f8809
JH
1079
1080=back
1081
c349b1b9
JH
1082=head2 Unicode Encodings
1083
376d9008
JB
1084Unicode characters are assigned to I<code points>, which are abstract
1085numbers. To use these numbers, various encodings are needed.
c349b1b9
JH
1086
1087=over 4
1088
c29a771d 1089=item *
5cb3728c
RB
1090
1091UTF-8
c349b1b9 1092
6d4f9cf2
KW
1093UTF-8 is a variable-length (1 to 4 bytes), byte-order independent
1094encoding. For ASCII (and we really do mean 7-bit ASCII, not another
10958-bit encoding), UTF-8 is transparent.
c349b1b9 1096
8c007b5a 1097The following table is from Unicode 3.2.
05632f9a 1098
755789c0 1099 Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte
05632f9a 1100
d88362ca 1101 U+0000..U+007F 00..7F
e1b711da 1102 U+0080..U+07FF * C2..DF 80..BF
d88362ca 1103 U+0800..U+0FFF E0 * A0..BF 80..BF
ec90690f
TS
1104 U+1000..U+CFFF E1..EC 80..BF 80..BF
1105 U+D000..U+D7FF ED 80..9F 80..BF
755789c0 1106 U+D800..U+DFFF +++++ utf16 surrogates, not legal utf8 +++++
ec90690f 1107 U+E000..U+FFFF EE..EF 80..BF 80..BF
d88362ca
KW
1108 U+10000..U+3FFFF F0 * 90..BF 80..BF 80..BF
1109 U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF
1110 U+100000..U+10FFFF F4 80..8F 80..BF 80..BF
e1b711da 1111
b19eb496 1112Note the gaps marked by "*" before several of the byte entries above. These are
e1b711da
KW
1113caused by legal UTF-8 avoiding non-shortest encodings: it is technically
1114possible to UTF-8-encode a single code point in different ways, but that is
1115explicitly forbidden, and the shortest possible encoding should always be used
1116(and that is what Perl does).
37361303 1117
376d9008 1118Another way to look at it is via bits:
05632f9a 1119
755789c0 1120 Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte
05632f9a 1121
755789c0
KW
1122 0aaaaaaa 0aaaaaaa
1123 00000bbbbbaaaaaa 110bbbbb 10aaaaaa
1124 ccccbbbbbbaaaaaa 1110cccc 10bbbbbb 10aaaaaa
1125 00000dddccccccbbbbbbaaaaaa 11110ddd 10cccccc 10bbbbbb 10aaaaaa
05632f9a 1126
9f815e24 1127As you can see, the continuation bytes all begin with "10", and the
e1b711da 1128leading bits of the start byte tell how many bytes there are in the
05632f9a
JH
1129encoded character.
1130
6d4f9cf2
KW
1131The original UTF-8 specification allowed up to 6 bytes, to allow
1132encoding of numbers up to 0x7FFF_FFFF. Perl continues to allow those,
1133and has extended that up to 13 bytes to encode code points up to what
1134can fit in a 64-bit word. However, Perl will warn if you output any of
b19eb496 1135these as being non-portable; and under strict UTF-8 input protocols,
6d4f9cf2
KW
1136they are forbidden.
1137
1138The Unicode non-character code points are also disallowed in UTF-8 in
1139"open interchange". See L</Non-character code points>.
1140
c29a771d 1141=item *
5cb3728c
RB
1142
1143UTF-EBCDIC
dbe420b4 1144
376d9008 1145Like UTF-8 but EBCDIC-safe, in the way that UTF-8 is ASCII-safe.
dbe420b4 1146
c29a771d 1147=item *
5cb3728c 1148
1e54db1a 1149UTF-16, UTF-16BE, UTF-16LE, Surrogates, and BOMs (Byte Order Marks)
c349b1b9 1150
1bfb14c4
JH
1151The followings items are mostly for reference and general Unicode
1152knowledge, Perl doesn't use these constructs internally.
dbe420b4 1153
b19eb496
TC
1154Like UTF-8, UTF-16 is a variable-width encoding, but where
1155UTF-8 uses 8-bit code units, UTF-16 uses 16-bit code units.
1156All code points occupy either 2 or 4 bytes in UTF-16: code points
1157C<U+0000..U+FFFF> are stored in a single 16-bit unit, and code
1bfb14c4 1158points C<U+10000..U+10FFFF> in two 16-bit units. The latter case is
c349b1b9
JH
1159using I<surrogates>, the first 16-bit unit being the I<high
1160surrogate>, and the second being the I<low surrogate>.
1161
376d9008 1162Surrogates are code points set aside to encode the C<U+10000..U+10FFFF>
c349b1b9 1163range of Unicode code points in pairs of 16-bit units. The I<high
9f815e24 1164surrogates> are the range C<U+D800..U+DBFF> and the I<low surrogates>
376d9008 1165are the range C<U+DC00..U+DFFF>. The surrogate encoding is
c349b1b9 1166
d88362ca
KW
1167 $hi = ($uni - 0x10000) / 0x400 + 0xD800;
1168 $lo = ($uni - 0x10000) % 0x400 + 0xDC00;
c349b1b9
JH
1169
1170and the decoding is
1171
d88362ca 1172 $uni = 0x10000 + ($hi - 0xD800) * 0x400 + ($lo - 0xDC00);
c349b1b9 1173
376d9008 1174Because of the 16-bitness, UTF-16 is byte-order dependent. UTF-16
c349b1b9 1175itself can be used for in-memory computations, but if storage or
376d9008
JB
1176transfer is required either UTF-16BE (big-endian) or UTF-16LE
1177(little-endian) encodings must be chosen.
c349b1b9
JH
1178
1179This introduces another problem: what if you just know that your data
376d9008
JB
1180is UTF-16, but you don't know which endianness? Byte Order Marks, or
1181BOMs, are a solution to this. A special character has been reserved
86bbd6d1 1182in Unicode to function as a byte order marker: the character with the
376d9008 1183code point C<U+FEFF> is the BOM.
042da322 1184
c349b1b9 1185The trick is that if you read a BOM, you will know the byte order,
376d9008
JB
1186since if it was written on a big-endian platform, you will read the
1187bytes C<0xFE 0xFF>, but if it was written on a little-endian platform,
1188you will read the bytes C<0xFF 0xFE>. (And if the originating platform
1189was writing in UTF-8, you will read the bytes C<0xEF 0xBB 0xBF>.)
042da322 1190
86bbd6d1 1191The way this trick works is that the character with the code point
6d4f9cf2 1192C<U+FFFE> is not supposed to be in input streams, so the
376d9008 1193sequence of bytes C<0xFF 0xFE> is unambiguously "BOM, represented in
1bfb14c4 1194little-endian format" and cannot be C<U+FFFE>, represented in big-endian
6d4f9cf2
KW
1195format".
1196
1197Surrogates have no meaning in Unicode outside their use in pairs to
1198represent other code points. However, Perl allows them to be
1199represented individually internally, for example by saying
f651977e
TC
1200C<chr(0xD801)>, so that all code points, not just those valid for open
1201interchange, are
6d4f9cf2
KW
1202representable. Unicode does define semantics for them, such as their
1203General Category is "Cs". But because their use is somewhat dangerous,
b19eb496
TC
1204Perl will warn (using the warning category "surrogate", which is a
1205sub-category of "utf8") if an attempt is made
6d4f9cf2
KW
1206to do things like take the lower case of one, or match
1207case-insensitively, or to output them. (But don't try this on Perls
1208before 5.14.)
c349b1b9 1209
c29a771d 1210=item *
5cb3728c 1211
1e54db1a 1212UTF-32, UTF-32BE, UTF-32LE
c349b1b9
JH
1213
1214The UTF-32 family is pretty much like the UTF-16 family, expect that
042da322 1215the units are 32-bit, and therefore the surrogate scheme is not
f651977e 1216needed. UTF-32 is a fixed-width encoding. The BOM signatures are
b19eb496 1217C<0x00 0x00 0xFE 0xFF> for BE and C<0xFF 0xFE 0x00 0x00> for LE.
c349b1b9 1218
c29a771d 1219=item *
5cb3728c
RB
1220
1221UCS-2, UCS-4
c349b1b9 1222
b19eb496 1223Legacy, fixed-width encodings defined by the ISO 10646 standard. UCS-2 is a 16-bit
376d9008 1224encoding. Unlike UTF-16, UCS-2 is not extensible beyond C<U+FFFF>,
339cfa0e 1225because it does not use surrogates. UCS-4 is a 32-bit encoding,
b19eb496 1226functionally identical to UTF-32 (the difference being that
ee88f7b6 1227UCS-4 forbids neither surrogates nor code points larger than 0x10_FFFF).
c349b1b9 1228
c29a771d 1229=item *
5cb3728c
RB
1230
1231UTF-7
c349b1b9 1232
376d9008
JB
1233A seven-bit safe (non-eight-bit) encoding, which is useful if the
1234transport or storage is not eight-bit safe. Defined by RFC 2152.
c349b1b9 1235
95a1a48b
JH
1236=back
1237
6d4f9cf2
KW
1238=head2 Non-character code points
1239
124066 code points are set aside in Unicode as "non-character code points".
1241These all have the Unassigned (Cn) General Category, and they never will
1242be assigned. These are never supposed to be in legal Unicode input
1243streams, so that code can use them as sentinels that can be mixed in
1244with character data, and they always will be distinguishable from that data.
1245To keep them out of Perl input streams, strict UTF-8 should be
1246specified, such as by using the layer C<:encoding('UTF-8')>. The
1247non-character code points are the 32 between U+FDD0 and U+FDEF, and the
124834 code points U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, ... U+10FFFE, U+10FFFF.
1249Some people are under the mistaken impression that these are "illegal",
1250but that is not true. An application or cooperating set of applications
1251can legally use them at will internally; but these code points are
42581d5d
KW
1252"illegal for open interchange". Therefore, Perl will not accept these
1253from input streams unless lax rules are being used, and will warn
b19eb496 1254(using the warning category "nonchar", which is a sub-category of "utf8") if
42581d5d
KW
1255an attempt is made to output them.
1256
1257=head2 Beyond Unicode code points
1258
1259The maximum Unicode code point is U+10FFFF. But Perl accepts code
1260points up to the maximum permissible unsigned number available on the
1261platform. However, Perl will not accept these from input streams unless
1262lax rules are being used, and will warn (using the warning category
b19eb496 1263"non_unicode", which is a sub-category of "utf8") if an attempt is made to
42581d5d
KW
1264operate on or output them. For example, C<uc(0x11_0000)> will generate
1265this warning, returning the input parameter as its result, as the upper
ee88f7b6 1266case of every non-Unicode code point is the code point itself.
6d4f9cf2 1267
0d7c09bb
JH
1268=head2 Security Implications of Unicode
1269
e1b711da
KW
1270Read L<Unicode Security Considerations|http://www.unicode.org/reports/tr36>.
1271Also, note the following:
1272
0d7c09bb
JH
1273=over 4
1274
1275=item *
1276
1277Malformed UTF-8
bf0fa0b2 1278
42581d5d 1279Unfortunately, the original specification of UTF-8 leaves some room for
bf0fa0b2 1280interpretation of how many bytes of encoded output one should generate
376d9008
JB
1281from one input Unicode character. Strictly speaking, the shortest
1282possible sequence of UTF-8 bytes should be generated,
1283because otherwise there is potential for an input buffer overflow at
feda178f 1284the receiving end of a UTF-8 connection. Perl always generates the
e1b711da 1285shortest length UTF-8, and with warnings on, Perl will warn about
376d9008 1286non-shortest length UTF-8 along with other malformations, such as the
b19eb496 1287surrogates, which are not Unicode code points valid for interchange.
bf0fa0b2 1288
0d7c09bb
JH
1289=item *
1290
68693f9e 1291Regular expression pattern matching may surprise you if you're not
b19eb496
TC
1292accustomed to Unicode. Starting in Perl 5.14, several pattern
1293modifiers are available to control this, called the character set
42581d5d
KW
1294modifiers. Details are given in L<perlre/Character set modifiers>.
1295
1296=back
0d7c09bb 1297
376d9008 1298As discussed elsewhere, Perl has one foot (two hooves?) planted in
1bfb14c4
JH
1299each of two worlds: the old world of bytes and the new world of
1300characters, upgrading from bytes to characters when necessary.
376d9008
JB
1301If your legacy code does not explicitly use Unicode, no automatic
1302switch-over to characters should happen. Characters shouldn't get
1bfb14c4
JH
1303downgraded to bytes, either. It is possible to accidentally mix bytes
1304and characters, however (see L<perluniintro>), in which case C<\w> in
42581d5d 1305regular expressions might start behaving differently (unless the C</a>
b19eb496 1306modifier is in effect). Review your code. Use warnings and the C<strict> pragma.
0d7c09bb 1307
c349b1b9
JH
1308=head2 Unicode in Perl on EBCDIC
1309
376d9008
JB
1310The way Unicode is handled on EBCDIC platforms is still
1311experimental. On such platforms, references to UTF-8 encoding in this
1312document and elsewhere should be read as meaning the UTF-EBCDIC
1313specified in Unicode Technical Report 16, unless ASCII vs. EBCDIC issues
c349b1b9 1314are specifically discussed. There is no C<utfebcdic> pragma or
376d9008 1315":utfebcdic" layer; rather, "utf8" and ":utf8" are reused to mean
86bbd6d1
PN
1316the platform's "natural" 8-bit encoding of Unicode. See L<perlebcdic>
1317for more discussion of the issues.
c349b1b9 1318
b310b053
JH
1319=head2 Locales
1320
42581d5d 1321See L<perllocale/Unicode and UTF-8>
b310b053 1322
1aad1664
JH
1323=head2 When Unicode Does Not Happen
1324
1325While Perl does have extensive ways to input and output in Unicode,
b19eb496
TC
1326and a few other "entry points" like the @ARGV array (which can sometimes be
1327interpreted as UTF-8), there are still many places where Unicode
1328(in some encoding or another) could be given as arguments or received as
1aad1664
JH
1329results, or both, but it is not.
1330
e1b711da
KW
1331The following are such interfaces. Also, see L</The "Unicode Bug">.
1332For all of these interfaces Perl
6cd4dd6c 1333currently (as of 5.8.3) simply assumes byte strings both as arguments
b19eb496 1334and results, or UTF-8 strings if the (problematic) C<encoding> pragma has been used.
1aad1664 1335
b19eb496
TC
1336One reason that Perl does not attempt to resolve the role of Unicode in
1337these situations is that the answers are highly dependent on the operating
1aad1664 1338system and the file system(s). For example, whether filenames can be
b19eb496
TC
1339in Unicode and in exactly what kind of encoding, is not exactly a
1340portable concept. Similarly for C<qx> and C<system>: how well will the
1341"command-line interface" (and which of them?) handle Unicode?
1aad1664
JH
1342
1343=over 4
1344
557a2462
RB
1345=item *
1346
51f494cc 1347chdir, chmod, chown, chroot, exec, link, lstat, mkdir,
1e8e8236 1348rename, rmdir, stat, symlink, truncate, unlink, utime, -X
557a2462
RB
1349
1350=item *
1351
1352%ENV
1353
1354=item *
1355
1356glob (aka the <*>)
1357
1358=item *
1aad1664 1359
557a2462 1360open, opendir, sysopen
1aad1664 1361
557a2462 1362=item *
1aad1664 1363
557a2462 1364qx (aka the backtick operator), system
1aad1664 1365
557a2462 1366=item *
1aad1664 1367
557a2462 1368readdir, readlink
1aad1664
JH
1369
1370=back
1371
e1b711da
KW
1372=head2 The "Unicode Bug"
1373
2e2b2571 1374The term, "Unicode bug" has been applied to an inconsistency
42581d5d
KW
1375on ASCII platforms with the
1376Unicode code points in the Latin-1 Supplement block, that
e1b711da
KW
1377is, between 128 and 255. Without a locale specified, unlike all other
1378characters or code points, these characters have very different semantics in
20db7501 1379byte semantics versus character semantics, unless
2e2b2571
KW
1380C<use feature 'unicode_strings'> is specified, directly or indirectly.
1381(It is indirectly specified by a C<use v5.12> or higher.)
e1b711da 1382
2e2b2571
KW
1383In character semantics these upper-Latin1 characters are interpreted as
1384Unicode code points, which means
e1b711da
KW
1385they have the same semantics as Latin-1 (ISO-8859-1).
1386
2e2b2571
KW
1387In byte semantics (without C<unicode_strings>), they are considered to
1388be unassigned characters, meaning that the only semantics they have is
1389their ordinal numbers, and that they are
e1b711da 1390not members of various character classes. None are considered to match C<\w>
42581d5d 1391for example, but all match C<\W>.
e1b711da 1392
2e2b2571
KW
1393Perl 5.12.0 added C<unicode_strings> to force character semantics on
1394these code points in some circumstances, which fixed portions of the
1395bug; Perl 5.14.0 fixed almost all of it; and Perl 5.16.0 fixed the
1396remainder (so far as we know, anyway). The lesson here is to enable
1397C<unicode_strings> to avoid the headaches described below.
1398
1399The old, problematic behavior affects these areas:
e1b711da
KW
1400
1401=over 4
1402
1403=item *
1404
1405Changing the case of a scalar, that is, using C<uc()>, C<ucfirst()>, C<lc()>,
2e2b2571
KW
1406and C<lcfirst()>, or C<\L>, C<\U>, C<\u> and C<\l> in double-quotish
1407contexts, such as regular expression substitutions.
1408Under C<unicode_strings> starting in Perl 5.12.0, character semantics are
1409generally used. See L<perlfunc/lc> for details on how this works
1410in combination with various other pragmas.
e1b711da
KW
1411
1412=item *
1413
2e2b2571
KW
1414Using caseless (C</i>) regular expression matching.
1415Starting in Perl 5.14.0, regular expressions compiled within
1416the scope of C<unicode_semantics> use character semantics
1417even when executed or compiled into larger
1418regular expressions outside the scope.
e1b711da
KW
1419
1420=item *
1421
b19eb496 1422Matching any of several properties in regular expressions, namely C<\b>,
630d17dc
KW
1423C<\B>, C<\s>, C<\S>, C<\w>, C<\W>, and all the Posix character classes
1424I<except> C<[[:ascii:]]>.
2e2b2571
KW
1425Starting in Perl 5.14.0, regular expressions compiled within
1426the scope of C<unicode_semantics> use character semantics
1427even when executed or compiled into larger
1428regular expressions outside the scope.
e1b711da
KW
1429
1430=item *
1431
91faff93
KW
1432In C<quotemeta> or its inline equivalent C<\Q>, no code points above 127
1433are quoted in UTF-8 encoded strings, but in byte encoded strings, code
1434points between 128-255 are always quoted.
2e2b2571
KW
1435Starting in Perl 5.16.0, consistent quoting rules are used within the
1436scope of C<unicode_strings>, as described in L<perlfunc/quotemeta>.
eb88ed9e 1437
e1b711da
KW
1438=back
1439
1440This behavior can lead to unexpected results in which a string's semantics
1441suddenly change if a code point above 255 is appended to or removed from it,
1442which changes the string's semantics from byte to character or vice versa. As
1443an example, consider the following program and its output:
1444
1445 $ perl -le'
42581d5d 1446 no feature 'unicode_strings';
e1b711da
KW
1447 $s1 = "\xC2";
1448 $s2 = "\x{2660}";
1449 for ($s1, $s2, $s1.$s2) {
1450 print /\w/ || 0;
1451 }
1452 '
1453 0
1454 0
1455 1
1456
9f815e24 1457If there's no C<\w> in C<s1> or in C<s2>, why does their concatenation have one?
e1b711da
KW
1458
1459This anomaly stems from Perl's attempt to not disturb older programs that
1460didn't use Unicode, and hence had no semantics for characters outside of the
1461ASCII range (except in a locale), along with Perl's desire to add Unicode
1462support seamlessly. The result wasn't seamless: these characters were
1463orphaned.
1464
2e2b2571
KW
1465For Perls earlier than those described above, or when a string is passed
1466to a function outside the subpragma's scope, a workaround is to always
1467call C<utf8::upgrade($string)>,
20db7501 1468or to use the standard module L<Encode>. Also, a scalar that has any characters
6f335b04 1469whose ordinal is above 0x100, or which were specified using either of the
b19eb496 1470C<\N{...}> notations, will automatically have character semantics.
e1b711da 1471
1aad1664
JH
1472=head2 Forcing Unicode in Perl (Or Unforcing Unicode in Perl)
1473
e1b711da
KW
1474Sometimes (see L</"When Unicode Does Not Happen"> or L</The "Unicode Bug">)
1475there are situations where you simply need to force a byte
2bbc8d55
SP
1476string into UTF-8, or vice versa. The low-level calls
1477utf8::upgrade($bytestring) and utf8::downgrade($utf8string[, FAIL_OK]) are
1aad1664
JH
1478the answers.
1479
2bbc8d55
SP
1480Note that utf8::downgrade() can fail if the string contains characters
1481that don't fit into a byte.
1aad1664 1482
e1b711da
KW
1483Calling either function on a string that already is in the desired state is a
1484no-op.
1485
95a1a48b
JH
1486=head2 Using Unicode in XS
1487
3a2263fe
RGS
1488If you want to handle Perl Unicode in XS extensions, you may find the
1489following C APIs useful. See also L<perlguts/"Unicode Support"> for an
1490explanation about Unicode at the XS level, and L<perlapi> for the API
1491details.
95a1a48b
JH
1492
1493=over 4
1494
1495=item *
1496
1bfb14c4 1497C<DO_UTF8(sv)> returns true if the C<UTF8> flag is on and the bytes
2bbc8d55 1498pragma is not in effect. C<SvUTF8(sv)> returns true if the C<UTF8>
1bfb14c4
JH
1499flag is on; the bytes pragma is ignored. The C<UTF8> flag being on
1500does B<not> mean that there are any characters of code points greater
1501than 255 (or 127) in the scalar or that there are even any characters
1502in the scalar. What the C<UTF8> flag means is that the sequence of
1503octets in the representation of the scalar is the sequence of UTF-8
1504encoded code points of the characters of a string. The C<UTF8> flag
1505being off means that each octet in this representation encodes a
1506single character with code point 0..255 within the string. Perl's
1507Unicode model is not to use UTF-8 until it is absolutely necessary.
95a1a48b
JH
1508
1509=item *
1510
2bbc8d55 1511C<uvchr_to_utf8(buf, chr)> writes a Unicode character code point into
1bfb14c4 1512a buffer encoding the code point as UTF-8, and returns a pointer
2bbc8d55 1513pointing after the UTF-8 bytes. It works appropriately on EBCDIC machines.
95a1a48b
JH
1514
1515=item *
1516
2bbc8d55 1517C<utf8_to_uvchr(buf, lenp)> reads UTF-8 encoded bytes from a buffer and
376d9008 1518returns the Unicode character code point and, optionally, the length of
2bbc8d55 1519the UTF-8 byte sequence. It works appropriately on EBCDIC machines.
95a1a48b
JH
1520
1521=item *
1522
376d9008
JB
1523C<utf8_length(start, end)> returns the length of the UTF-8 encoded buffer
1524in characters. C<sv_len_utf8(sv)> returns the length of the UTF-8 encoded
95a1a48b
JH
1525scalar.
1526
1527=item *
1528
376d9008
JB
1529C<sv_utf8_upgrade(sv)> converts the string of the scalar to its UTF-8
1530encoded form. C<sv_utf8_downgrade(sv)> does the opposite, if
1531possible. C<sv_utf8_encode(sv)> is like sv_utf8_upgrade except that
1532it does not set the C<UTF8> flag. C<sv_utf8_decode()> does the
1533opposite of C<sv_utf8_encode()>. Note that none of these are to be
1534used as general-purpose encoding or decoding interfaces: C<use Encode>
1535for that. C<sv_utf8_upgrade()> is affected by the encoding pragma
1536but C<sv_utf8_downgrade()> is not (since the encoding pragma is
1537designed to be a one-way street).
95a1a48b
JH
1538
1539=item *
1540
dbfbbfa1
KW
1541C<is_utf8_string(buf, len)> returns true if C<len> bytes of the buffer
1542are valid UTF-8.
95a1a48b
JH
1543
1544=item *
1545
dbfbbfa1
KW
1546C<is_utf8_char(s)> returns true if the pointer points to a valid UTF-8
1547character. However, this function should not be used because of
1548security concerns. Instead, use C<is_utf8_string()>.
95a1a48b
JH
1549
1550=item *
1551
376d9008
JB
1552C<UTF8SKIP(buf)> will return the number of bytes in the UTF-8 encoded
1553character in the buffer. C<UNISKIP(chr)> will return the number of bytes
1554required to UTF-8-encode the Unicode character code point. C<UTF8SKIP()>
90f968e0 1555is useful for example for iterating over the characters of a UTF-8
376d9008 1556encoded buffer; C<UNISKIP()> is useful, for example, in computing
90f968e0 1557the size required for a UTF-8 encoded buffer.
95a1a48b
JH
1558
1559=item *
1560
376d9008 1561C<utf8_distance(a, b)> will tell the distance in characters between the
95a1a48b
JH
1562two pointers pointing to the same UTF-8 encoded buffer.
1563
1564=item *
1565
2bbc8d55 1566C<utf8_hop(s, off)> will return a pointer to a UTF-8 encoded buffer
376d9008
JB
1567that is C<off> (positive or negative) Unicode characters displaced
1568from the UTF-8 buffer C<s>. Be careful not to overstep the buffer:
1569C<utf8_hop()> will merrily run off the end or the beginning of the
1570buffer if told to do so.
95a1a48b 1571
d2cc3551
JH
1572=item *
1573
376d9008
JB
1574C<pv_uni_display(dsv, spv, len, pvlim, flags)> and
1575C<sv_uni_display(dsv, ssv, pvlim, flags)> are useful for debugging the
1576output of Unicode strings and scalars. By default they are useful
1577only for debugging--they display B<all> characters as hexadecimal code
1bfb14c4
JH
1578points--but with the flags C<UNI_DISPLAY_ISPRINT>,
1579C<UNI_DISPLAY_BACKSLASH>, and C<UNI_DISPLAY_QQ> you can make the
1580output more readable.
d2cc3551
JH
1581
1582=item *
1583
66615a54 1584C<foldEQ_utf8(s1, pe1, l1, u1, s2, pe2, l2, u2)> can be used to
376d9008 1585compare two strings case-insensitively in Unicode. For case-sensitive
66615a54
KW
1586comparisons you can just use C<memEQ()> and C<memNE()> as usual, except
1587if one string is in utf8 and the other isn't.
d2cc3551 1588
c349b1b9
JH
1589=back
1590
95a1a48b
JH
1591For more information, see L<perlapi>, and F<utf8.c> and F<utf8.h>
1592in the Perl source code distribution.
1593
e1b711da
KW
1594=head2 Hacking Perl to work on earlier Unicode versions (for very serious hackers only)
1595
1596Perl by default comes with the latest supported Unicode version built in, but
1597you can change to use any earlier one.
1598
42581d5d 1599Download the files in the desired version of Unicode from the Unicode web
e1b711da 1600site L<http://www.unicode.org>). These should replace the existing files in
b19eb496 1601F<lib/unicore> in the Perl source tree. Follow the instructions in
116693e8 1602F<README.perl> in that directory to change some of their names, and then build
26e391dd 1603perl (see L<INSTALL>).
116693e8
DL
1604
1605It is even possible to copy the built files to a different directory, and then
f651977e 1606change F<utf8_heavy.pl> in the directory C<$Config{privlib}> to point to the
116693e8
DL
1607new directory, or maybe make a copy of that directory before making the change,
1608and using C<@INC> or the C<-I> run-time flag to switch between versions at will
e1b711da
KW
1609(but because of caching, not in the middle of a process), but all this is
1610beyond the scope of these instructions.
1611
c29a771d
JH
1612=head1 BUGS
1613
376d9008 1614=head2 Interaction with Locales
7eabb34d 1615
42581d5d 1616See L<perllocale/Unicode and UTF-8>
c29a771d 1617
9f815e24 1618=head2 Problems with characters in the Latin-1 Supplement range
2bbc8d55 1619
e1b711da
KW
1620See L</The "Unicode Bug">
1621
376d9008 1622=head2 Interaction with Extensions
7eabb34d 1623
376d9008 1624When Perl exchanges data with an extension, the extension should be
2575c402 1625able to understand the UTF8 flag and act accordingly. If the
b19eb496 1626extension doesn't recognize that flag, it's likely that the extension
376d9008 1627will return incorrectly-flagged data.
7eabb34d
A
1628
1629So if you're working with Unicode data, consult the documentation of
1630every module you're using if there are any issues with Unicode data
1631exchange. If the documentation does not talk about Unicode at all,
a73d23f6 1632suspect the worst and probably look at the source to learn how the
376d9008 1633module is implemented. Modules written completely in Perl shouldn't
a73d23f6
RGS
1634cause problems. Modules that directly or indirectly access code written
1635in other programming languages are at risk.
7eabb34d 1636
376d9008 1637For affected functions, the simple strategy to avoid data corruption is
7eabb34d 1638to always make the encoding of the exchanged data explicit. Choose an
376d9008 1639encoding that you know the extension can handle. Convert arguments passed
7eabb34d
A
1640to the extensions to that encoding and convert results back from that
1641encoding. Write wrapper functions that do the conversions for you, so
1642you can later change the functions when the extension catches up.
1643
376d9008 1644To provide an example, let's say the popular Foo::Bar::escape_html
7eabb34d
A
1645function doesn't deal with Unicode data yet. The wrapper function
1646would convert the argument to raw UTF-8 and convert the result back to
376d9008 1647Perl's internal representation like so:
7eabb34d
A
1648
1649 sub my_escape_html ($) {
d88362ca
KW
1650 my($what) = shift;
1651 return unless defined $what;
1652 Encode::decode_utf8(Foo::Bar::escape_html(
1653 Encode::encode_utf8($what)));
7eabb34d
A
1654 }
1655
1656Sometimes, when the extension does not convert data but just stores
b19eb496 1657and retrieves them, you will be able to use the otherwise
7eabb34d 1658dangerous Encode::_utf8_on() function. Let's say the popular
66b79f27 1659C<Foo::Bar> extension, written in C, provides a C<param> method that
7eabb34d
A
1660lets you store and retrieve data according to these prototypes:
1661
1662 $self->param($name, $value); # set a scalar
1663 $value = $self->param($name); # retrieve a scalar
1664
1665If it does not yet provide support for any encoding, one could write a
1666derived class with such a C<param> method:
1667
1668 sub param {
1669 my($self,$name,$value) = @_;
1670 utf8::upgrade($name); # make sure it is UTF-8 encoded
af55fc6a 1671 if (defined $value) {
7eabb34d
A
1672 utf8::upgrade($value); # make sure it is UTF-8 encoded
1673 return $self->SUPER::param($name,$value);
1674 } else {
1675 my $ret = $self->SUPER::param($name);
1676 Encode::_utf8_on($ret); # we know, it is UTF-8 encoded
1677 return $ret;
1678 }
1679 }
1680
a73d23f6
RGS
1681Some extensions provide filters on data entry/exit points, such as
1682DB_File::filter_store_key and family. Look out for such filters in
66b79f27 1683the documentation of your extensions, they can make the transition to
7eabb34d
A
1684Unicode data much easier.
1685
376d9008 1686=head2 Speed
7eabb34d 1687
c29a771d 1688Some functions are slower when working on UTF-8 encoded strings than
574c8022 1689on byte encoded strings. All functions that need to hop over
7c17141f
JH
1690characters such as length(), substr() or index(), or matching regular
1691expressions can work B<much> faster when the underlying data are
1692byte-encoded.
1693
1694In Perl 5.8.0 the slowness was often quite spectacular; in Perl 5.8.1
1695a caching scheme was introduced which will hopefully make the slowness
a104b433
JH
1696somewhat less spectacular, at least for some operations. In general,
1697operations with UTF-8 encoded strings are still slower. As an example,
1698the Unicode properties (character classes) like C<\p{Nd}> are known to
1699be quite a bit slower (5-20 times) than their simpler counterparts
42581d5d 1700like C<\d> (then again, there are hundreds of Unicode characters matching C<Nd>
a104b433 1701compared with the 10 ASCII characters matching C<d>).
666f95b9 1702
e1b711da
KW
1703=head2 Problems on EBCDIC platforms
1704
f651977e 1705There are several known problems with Perl on EBCDIC platforms. If you
e1b711da 1706want to use Perl there, send email to perlbug@perl.org.
fe749c9a
KW
1707
1708In earlier versions, when byte and character data were concatenated,
1709the new string was sometimes created by
1710decoding the byte strings as I<ISO 8859-1 (Latin-1)>, even if the
1711old Unicode string used EBCDIC.
1712
1713If you find any of these, please report them as bugs.
1714
c8d992ba
A
1715=head2 Porting code from perl-5.6.X
1716
1717Perl 5.8 has a different Unicode model from 5.6. In 5.6 the programmer
1718was required to use the C<utf8> pragma to declare that a given scope
1719expected to deal with Unicode data and had to make sure that only
1720Unicode data were reaching that scope. If you have code that is
1721working with 5.6, you will need some of the following adjustments to
1722your code. The examples are written such that the code will continue
1723to work under 5.6, so you should be safe to try them out.
1724
755789c0 1725=over 3
c8d992ba
A
1726
1727=item *
1728
1729A filehandle that should read or write UTF-8
1730
1731 if ($] > 5.007) {
740d4bb2 1732 binmode $fh, ":encoding(utf8)";
c8d992ba
A
1733 }
1734
1735=item *
1736
1737A scalar that is going to be passed to some extension
1738
1739Be it Compress::Zlib, Apache::Request or any extension that has no
1740mention of Unicode in the manpage, you need to make sure that the
2575c402 1741UTF8 flag is stripped off. Note that at the time of this writing
c8d992ba
A
1742(October 2002) the mentioned modules are not UTF-8-aware. Please
1743check the documentation to verify if this is still true.
1744
1745 if ($] > 5.007) {
1746 require Encode;
1747 $val = Encode::encode_utf8($val); # make octets
1748 }
1749
1750=item *
1751
1752A scalar we got back from an extension
1753
1754If you believe the scalar comes back as UTF-8, you will most likely
2575c402 1755want the UTF8 flag restored:
c8d992ba
A
1756
1757 if ($] > 5.007) {
1758 require Encode;
1759 $val = Encode::decode_utf8($val);
1760 }
1761
1762=item *
1763
1764Same thing, if you are really sure it is UTF-8
1765
1766 if ($] > 5.007) {
1767 require Encode;
1768 Encode::_utf8_on($val);
1769 }
1770
1771=item *
1772
1773A wrapper for fetchrow_array and fetchrow_hashref
1774
1775When the database contains only UTF-8, a wrapper function or method is
1776a convenient way to replace all your fetchrow_array and
1777fetchrow_hashref calls. A wrapper function will also make it easier to
1778adapt to future enhancements in your database driver. Note that at the
1779time of this writing (October 2002), the DBI has no standardized way
1780to deal with UTF-8 data. Please check the documentation to verify if
1781that is still true.
1782
1783 sub fetchrow {
d88362ca
KW
1784 # $what is one of fetchrow_{array,hashref}
1785 my($self, $sth, $what) = @_;
c8d992ba
A
1786 if ($] < 5.007) {
1787 return $sth->$what;
1788 } else {
1789 require Encode;
1790 if (wantarray) {
1791 my @arr = $sth->$what;
1792 for (@arr) {
1793 defined && /[^\000-\177]/ && Encode::_utf8_on($_);
1794 }
1795 return @arr;
1796 } else {
1797 my $ret = $sth->$what;
1798 if (ref $ret) {
1799 for my $k (keys %$ret) {
d88362ca
KW
1800 defined
1801 && /[^\000-\177]/
1802 && Encode::_utf8_on($_) for $ret->{$k};
c8d992ba
A
1803 }
1804 return $ret;
1805 } else {
1806 defined && /[^\000-\177]/ && Encode::_utf8_on($_) for $ret;
1807 return $ret;
1808 }
1809 }
1810 }
1811 }
1812
1813
1814=item *
1815
1816A large scalar that you know can only contain ASCII
1817
1818Scalars that contain only ASCII and are marked as UTF-8 are sometimes
1819a drag to your program. If you recognize such a situation, just remove
2575c402 1820the UTF8 flag:
c8d992ba
A
1821
1822 utf8::downgrade($val) if $] > 5.007;
1823
1824=back
1825
393fec97
GS
1826=head1 SEE ALSO
1827
51f494cc 1828L<perlunitut>, L<perluniintro>, L<perluniprops>, L<Encode>, L<open>, L<utf8>, L<bytes>,
a05d7ebb 1829L<perlretut>, L<perlvar/"${^UNICODE}">
51f494cc 1830L<http://www.unicode.org/reports/tr44>).
393fec97
GS
1831
1832=cut