This is a live mirror of the Perl 5 development currently hosted at https://github.com/perl/perl5
perlunicode: Correct false statement
[perl5.git] / pod / perlunicode.pod
CommitLineData
393fec97
GS
1=head1 NAME
2
3perlunicode - Unicode support in Perl
4
5=head1 DESCRIPTION
6
0a1f2d14 7=head2 Important Caveats
21bad921 8
376d9008 9Unicode support is an extensive requirement. While Perl does not
c349b1b9
JH
10implement the Unicode standard or the accompanying technical reports
11from cover to cover, Perl does support many Unicode features.
21bad921 12
2575c402 13People who want to learn to use Unicode in Perl, should probably read
0314f483
KW
14the L<Perl Unicode tutorial, perlunitut|perlunitut> and
15L<perluniintro>, before reading
e4911a48 16this reference document.
2575c402 17
9d1c51c1
KW
18Also, the use of Unicode may present security issues that aren't obvious.
19Read L<Unicode Security Considerations|http://www.unicode.org/reports/tr36>.
20
13a2d996 21=over 4
21bad921 22
42581d5d
KW
23=item Safest if you "use feature 'unicode_strings'"
24
25In order to preserve backward compatibility, Perl does not turn
26on full internal Unicode support unless the pragma
27C<use feature 'unicode_strings'> is specified. (This is automatically
28selected if you use C<use 5.012> or higher.) Failure to do this can
29trigger unexpected surprises. See L</The "Unicode Bug"> below.
30
2269d15c
KW
31This pragma doesn't affect I/O. Nor does it change the internal
32representation of strings, only their interpretation. There are still
33several places where Unicode isn't fully supported, such as in
34filenames.
42581d5d 35
fae2c0fb 36=item Input and Output Layers
21bad921 37
376d9008 38Perl knows when a filehandle uses Perl's internal Unicode encodings
1bfb14c4 39(UTF-8, or UTF-EBCDIC if in EBCDIC) if the filehandle is opened with
4ee7c0ea 40the ":encoding(utf8)" layer. Other encodings can be converted to Perl's
1bfb14c4
JH
41encoding on input or from Perl's encoding on output by use of the
42":encoding(...)" layer. See L<open>.
c349b1b9 43
2575c402 44To indicate that Perl source itself is in UTF-8, use C<use utf8;>.
21bad921 45
ad0029c4 46=item C<use utf8> still needed to enable UTF-8/UTF-EBCDIC in scripts
21bad921 47
376d9008
JB
48As a compatibility measure, the C<use utf8> pragma must be explicitly
49included to enable recognition of UTF-8 in the Perl scripts themselves
1bfb14c4
JH
50(in string or regular expression literals, or in identifier names) on
51ASCII-based machines or to recognize UTF-EBCDIC on EBCDIC-based
376d9008 52machines. B<These are the only times when an explicit C<use utf8>
8f8cf39c 53is needed.> See L<utf8>.
21bad921 54
7aa207d6
JH
55=item BOM-marked scripts and UTF-16 scripts autodetected
56
57If a Perl script begins marked with the Unicode BOM (UTF-16LE, UTF16-BE,
58or UTF-8), or if the script looks like non-BOM-marked UTF-16 of either
59endianness, Perl will correctly read in the script as Unicode.
60(BOMless UTF-8 cannot be effectively recognized or differentiated from
61ISO 8859-1 or other eight-bit encodings.)
62
990e18f7
AT
63=item C<use encoding> needed to upgrade non-Latin-1 byte strings
64
38a44b82 65By default, there is a fundamental asymmetry in Perl's Unicode model:
990e18f7
AT
66implicit upgrading from byte strings to Unicode strings assumes that
67they were encoded in I<ISO 8859-1 (Latin-1)>, but Unicode strings are
68downgraded with UTF-8 encoding. This happens because the first 256
51f494cc 69codepoints in Unicode happens to agree with Latin-1.
990e18f7 70
990e18f7
AT
71See L</"Byte and Character Semantics"> for more details.
72
21bad921
GS
73=back
74
376d9008 75=head2 Byte and Character Semantics
393fec97 76
b9cedb1b 77Perl uses logically-wide characters to represent strings internally.
393fec97 78
42581d5d
KW
79Starting in Perl 5.14, Perl-level operations work with
80characters rather than bytes within the scope of a
81C<L<use feature 'unicode_strings'|feature>> (or equivalently
82C<use 5.012> or higher). (This is not true if bytes have been
b19eb496 83explicitly requested by C<L<use bytes|bytes>>, nor necessarily true
42581d5d
KW
84for interactions with the platform's operating system.)
85
86For earlier Perls, and when C<unicode_strings> is not in effect, Perl
87provides a fairly safe environment that can handle both types of
88semantics in programs. For operations where Perl can unambiguously
89decide that the input data are characters, Perl switches to character
90semantics. For operations where this determination cannot be made
91without additional information from the user, Perl decides in favor of
92compatibility and chooses to use byte semantics.
93
66cbab2c
KW
94When C<use locale> (but not C<use locale ':not_characters'>) is in
95effect, Perl uses the semantics associated with the current locale.
96(C<use locale> overrides C<use feature 'unicode_strings'> in the same scope;
97while C<use locale ':not_characters'> effectively also selects
98C<use feature 'unicode_strings'> in its scope; see L<perllocale>.)
99Otherwise, Perl uses the platform's native
42581d5d
KW
100byte semantics for characters whose code points are less than 256, and
101Unicode semantics for those greater than 255. On EBCDIC platforms, this
102is almost seamless, as the EBCDIC code pages that Perl handles are
103equivalent to Unicode's first 256 code points. (The exception is that
104EBCDIC regular expression case-insensitive matching rules are not as
b19eb496 105as robust as Unicode's.) But on ASCII platforms, Perl uses US-ASCII
e1b711da
KW
106(or Basic Latin in Unicode terminology) byte semantics, meaning that characters
107whose ordinal numbers are in the range 128 - 255 are undefined except for their
108ordinal numbers. This means that none have case (upper and lower), nor are any
109a member of character classes, like C<[:alpha:]> or C<\w>. (But all do belong
110to the C<\W> class or the Perl regular expression extension C<[:^alpha:]>.)
2bbc8d55 111
8cbd9a7a 112This behavior preserves compatibility with earlier versions of Perl,
376d9008 113which allowed byte semantics in Perl operations only if
e1b711da 114none of the program's inputs were marked as being a source of Unicode
8cbd9a7a
GS
115character data. Such data may come from filehandles, from calls to
116external programs, from information provided by the system (such as %ENV),
21bad921 117or from literals and constants in the source text.
8cbd9a7a 118
8cbd9a7a 119The C<utf8> pragma is primarily a compatibility device that enables
75daf61c 120recognition of UTF-(8|EBCDIC) in literals encountered by the parser.
376d9008
JB
121Note that this pragma is only required while Perl defaults to byte
122semantics; when character semantics become the default, this pragma
123may become a no-op. See L<utf8>.
124
376d9008 125If strings operating under byte semantics and strings with Unicode
51f494cc 126character data are concatenated, the new string will have
d9b01026
KW
127character semantics. This can cause surprises: See L</BUGS>, below.
128You can choose to be warned when this happens. See L<encoding::warnings>.
7dedd01f 129
feda178f 130Under character semantics, many operations that formerly operated on
376d9008 131bytes now operate on characters. A character in Perl is
feda178f 132logically just a number ranging from 0 to 2**31 or so. Larger
376d9008
JB
133characters may encode into longer sequences of bytes internally, but
134this internal detail is mostly hidden for Perl code.
135See L<perluniintro> for more.
393fec97 136
376d9008 137=head2 Effects of Character Semantics
393fec97
GS
138
139Character semantics have the following effects:
140
141=over 4
142
143=item *
144
376d9008 145Strings--including hash keys--and regular expression patterns may
574c8022 146contain characters that have an ordinal value larger than 255.
393fec97 147
2575c402
JW
148If you use a Unicode editor to edit your program, Unicode characters may
149occur directly within the literal strings in UTF-8 encoding, or UTF-16.
150(The former requires a BOM or C<use utf8>, the latter requires a BOM.)
3e4dbfed 151
195e542a
KW
152Unicode characters can also be added to a string by using the C<\N{U+...}>
153notation. The Unicode code for the desired character, in hexadecimal,
154should be placed in the braces, after the C<U>. For instance, a smiley face is
6f335b04
KW
155C<\N{U+263A}>.
156
195e542a
KW
157Alternatively, you can use the C<\x{...}> notation for characters 0x100 and
158above. For characters below 0x100 you may get byte semantics instead of
6f335b04 159character semantics; see L</The "Unicode Bug">. On EBCDIC machines there is
195e542a 160the additional problem that the value for such characters gives the EBCDIC
0bd42786
KW
161character rather than the Unicode one, thus it is more portable to use
162C<\N{U+...}> instead.
3e4dbfed 163
fbb93542
KW
164Additionally, you can use the C<\N{...}> notation and put the official
165Unicode character name within the braces, such as
166C<\N{WHITE SMILING FACE}>. This automatically loads the L<charnames>
167module with the C<:full> and C<:short> options. If you prefer different
168options for this module, you can instead, before the C<\N{...}>,
169explicitly load it with your desired options; for example,
170
171 use charnames ':loose';
376d9008 172
393fec97
GS
173=item *
174
574c8022
JH
175If an appropriate L<encoding> is specified, identifiers within the
176Perl script may contain Unicode alphanumeric characters, including
376d9008
JB
177ideographs. Perl does not currently attempt to canonicalize variable
178names.
393fec97 179
393fec97
GS
180=item *
181
1bfb14c4 182Regular expressions match characters instead of bytes. "." matches
2575c402 183a character instead of a byte.
393fec97 184
393fec97
GS
185=item *
186
9d1c51c1 187Bracketed character classes in regular expressions match characters instead of
376d9008 188bytes and match against the character properties specified in the
1bfb14c4 189Unicode properties database. C<\w> can be used to match a Japanese
75daf61c 190ideograph, for instance.
393fec97 191
393fec97
GS
192=item *
193
9d1c51c1
KW
194Named Unicode properties, scripts, and block ranges may be used (like bracketed
195character classes) by using the C<\p{}> "matches property" construct and
822502e5 196the C<\P{}> negation, "doesn't match property".
2575c402 197See L</"Unicode Character Properties"> for more details.
822502e5
TS
198
199You can define your own character properties and use them
200in the regular expression with the C<\p{}> or C<\P{}> construct.
822502e5
TS
201See L</"User-Defined Character Properties"> for more details.
202
203=item *
204
9f815e24
KW
205The special pattern C<\X> matches a logical character, an "extended grapheme
206cluster" in Standardese. In Unicode what appears to the user to be a single
51f494cc
KW
207character, for example an accented C<G>, may in fact be composed of a sequence
208of characters, in this case a C<G> followed by an accent character. C<\X>
209will match the entire sequence.
822502e5
TS
210
211=item *
212
213The C<tr///> operator translates characters instead of bytes. Note
214that the C<tr///CU> functionality has been removed. For similar
215functionality see pack('U0', ...) and pack('C0', ...).
216
217=item *
218
219Case translation operators use the Unicode case translation tables
220when character input is provided. Note that C<uc()>, or C<\U> in
221interpolated strings, translates to uppercase, while C<ucfirst>,
222or C<\u> in interpolated strings, translates to titlecase in languages
e1b711da
KW
223that make the distinction (which is equivalent to uppercase in languages
224without the distinction).
822502e5
TS
225
226=item *
227
228Most operators that deal with positions or lengths in a string will
229automatically switch to using character positions, including
230C<chop()>, C<chomp()>, C<substr()>, C<pos()>, C<index()>, C<rindex()>,
231C<sprintf()>, C<write()>, and C<length()>. An operator that
51f494cc
KW
232specifically does not switch is C<vec()>. Operators that really don't
233care include operators that treat strings as a bucket of bits such as
822502e5
TS
234C<sort()>, and operators dealing with filenames.
235
236=item *
237
51f494cc 238The C<pack()>/C<unpack()> letter C<C> does I<not> change, since it is often
822502e5
TS
239used for byte-oriented formats. Again, think C<char> in the C language.
240
241There is a new C<U> specifier that converts between Unicode characters
242and code points. There is also a C<W> specifier that is the equivalent of
243C<chr>/C<ord> and properly handles character values even if they are above 255.
244
245=item *
246
247The C<chr()> and C<ord()> functions work on characters, similar to
248C<pack("W")> and C<unpack("W")>, I<not> C<pack("C")> and
249C<unpack("C")>. C<pack("C")> and C<unpack("C")> are methods for
250emulating byte-oriented C<chr()> and C<ord()> on Unicode strings.
251While these methods reveal the internal encoding of Unicode strings,
252that is not something one normally needs to care about at all.
253
254=item *
255
256The bit string operators, C<& | ^ ~>, can operate on character data.
257However, for backward compatibility, such as when using bit string
258operations when characters are all less than 256 in ordinal value, one
259should not use C<~> (the bit complement) with characters of both
260values less than 256 and values greater than 256. Most importantly,
261DeMorgan's laws (C<~($x|$y) eq ~$x&~$y> and C<~($x&$y) eq ~$x|~$y>)
262will not hold. The reason for this mathematical I<faux pas> is that
263the complement cannot return B<both> the 8-bit (byte-wide) bit
264complement B<and> the full character-wide bit complement.
265
266=item *
267
5d1892be 268There is a CPAN module, L<Unicode::Casing>, which allows you to define
628253b8
BF
269your own mappings to be used in C<lc()>, C<lcfirst()>, C<uc()>,
270C<ucfirst()>, and C<fc> (or their double-quoted string inlined
271versions such as C<\U>).
272(Prior to Perl 5.16, this functionality was partially provided
5d1892be
KW
273in the Perl core, but suffered from a number of insurmountable
274drawbacks, so the CPAN module was written instead.)
822502e5
TS
275
276=back
277
278=over 4
279
280=item *
281
282And finally, C<scalar reverse()> reverses by character rather than by byte.
283
284=back
285
286=head2 Unicode Character Properties
287
ee88f7b6 288(The only time that Perl considers a sequence of individual code
9d1c51c1
KW
289points as a single logical character is in the C<\X> construct, already
290mentioned above. Therefore "character" in this discussion means a single
ee88f7b6
KW
291Unicode code point.)
292
293Very nearly all Unicode character properties are accessible through
294regular expressions by using the C<\p{}> "matches property" construct
295and the C<\P{}> "doesn't match property" for its negation.
51f494cc 296
9d1c51c1 297For instance, C<\p{Uppercase}> matches any single character with the Unicode
51f494cc
KW
298"Uppercase" property, while C<\p{L}> matches any character with a
299General_Category of "L" (letter) property. Brackets are not
9d1c51c1 300required for single letter property names, so C<\p{L}> is equivalent to C<\pL>.
51f494cc 301
9d1c51c1
KW
302More formally, C<\p{Uppercase}> matches any single character whose Unicode
303Uppercase property value is True, and C<\P{Uppercase}> matches any character
304whose Uppercase property value is False, and they could have been written as
305C<\p{Uppercase=True}> and C<\p{Uppercase=False}>, respectively.
51f494cc 306
b19eb496 307This formality is needed when properties are not binary; that is, if they can
51f494cc 308take on more values than just True and False. For example, the Bidi_Class (see
b19eb496 309L</"Bidirectional Character Types"> below), can take on several different
51f494cc 310values, such as Left, Right, Whitespace, and others. To match these, one needs
5bff2035
KW
311to specify both the property name (Bidi_Class), AND the value being
312matched against
9d1c51c1 313(Left, Right, etc.). This is done, as in the examples above, by having the
9f815e24 314two components separated by an equal sign (or interchangeably, a colon), like
51f494cc
KW
315C<\p{Bidi_Class: Left}>.
316
317All Unicode-defined character properties may be written in these compound forms
318of C<\p{property=value}> or C<\p{property:value}>, but Perl provides some
319additional properties that are written only in the single form, as well as
320single-form short-cuts for all binary properties and certain others described
321below, in which you may omit the property name and the equals or colon
322separator.
323
324Most Unicode character properties have at least two synonyms (or aliases if you
b19eb496
TC
325prefer): a short one that is easier to type and a longer one that is more
326descriptive and hence easier to understand. Thus the "L" and "Letter" properties
327above are equivalent and can be used interchangeably. Likewise,
51f494cc
KW
328"Upper" is a synonym for "Uppercase", and we could have written
329C<\p{Uppercase}> equivalently as C<\p{Upper}>. Also, there are typically
330various synonyms for the values the property can be. For binary properties,
331"True" has 3 synonyms: "T", "Yes", and "Y"; and "False has correspondingly "F",
332"No", and "N". But be careful. A short form of a value for one property may
e1b711da 333not mean the same thing as the same short form for another. Thus, for the
51f494cc
KW
334General_Category property, "L" means "Letter", but for the Bidi_Class property,
335"L" means "Left". A complete list of properties and synonyms is in
336L<perluniprops>.
337
b19eb496 338Upper/lower case differences in property names and values are irrelevant;
51f494cc
KW
339thus C<\p{Upper}> means the same thing as C<\p{upper}> or even C<\p{UpPeR}>.
340Similarly, you can add or subtract underscores anywhere in the middle of a
341word, so that these are also equivalent to C<\p{U_p_p_e_r}>. And white space
342is irrelevant adjacent to non-word characters, such as the braces and the equals
b19eb496
TC
343or colon separators, so C<\p{ Upper }> and C<\p{ Upper_case : Y }> are
344equivalent to these as well. In fact, white space and even
345hyphens can usually be added or deleted anywhere. So even C<\p{ Up-per case = Yes}> is
51f494cc 346equivalent. All this is called "loose-matching" by Unicode. The few places
b19eb496 347where stricter matching is used is in the middle of numbers, and in the Perl
51f494cc 348extension properties that begin or end with an underscore. Stricter matching
b19eb496 349cares about white space (except adjacent to non-word characters),
51f494cc 350hyphens, and non-interior underscores.
4193bef7 351
376d9008
JB
352You can also use negation in both C<\p{}> and C<\P{}> by introducing a caret
353(^) between the first brace and the property name: C<\p{^Tamil}> is
eb0cc9e3 354equal to C<\P{Tamil}>.
4193bef7 355
56ca34ca
KW
356Almost all properties are immune to case-insensitive matching. That is,
357adding a C</i> regular expression modifier does not change what they
358match. There are two sets that are affected.
359The first set is
360C<Uppercase_Letter>,
361C<Lowercase_Letter>,
362and C<Titlecase_Letter>,
363all of which match C<Cased_Letter> under C</i> matching.
364And the second set is
365C<Uppercase>,
366C<Lowercase>,
367and C<Titlecase>,
368all of which match C<Cased> under C</i> matching.
369This set also includes its subsets C<PosixUpper> and C<PosixLower> both
370of which under C</i> matching match C<PosixAlpha>.
371(The difference between these sets is that some things, such as Roman
b19eb496 372numerals, come in both upper and lower case so they are C<Cased>, but aren't considered
56ca34ca 373letters, so they aren't C<Cased_Letter>s.)
56ca34ca 374
94b42e47
KW
375The result is undefined if you try to match a non-Unicode code point
376(that is, one above 0x10FFFF) against a Unicode property. Currently, a
377warning is raised, and the match will fail. In some cases, this is
378counterintuitive, as both these fail:
379
380 chr(0x110000) =~ \p{ASCII_Hex_Digit=True} # Fails.
381 chr(0x110000) =~ \p{ASCII_Hex_Digit=False} # Fails!
382
51f494cc 383=head3 B<General_Category>
14bb0a9a 384
51f494cc
KW
385Every Unicode character is assigned a general category, which is the "most
386usual categorization of a character" (from
387L<http://www.unicode.org/reports/tr44>).
822502e5 388
9f815e24 389The compound way of writing these is like C<\p{General_Category=Number}>
51f494cc
KW
390(short, C<\p{gc:n}>). But Perl furnishes shortcuts in which everything up
391through the equal or colon separator is omitted. So you can instead just write
392C<\pN>.
822502e5 393
51f494cc 394Here are the short and long forms of the General Category properties:
393fec97 395
d73e5302
JH
396 Short Long
397
398 L Letter
51f494cc
KW
399 LC, L& Cased_Letter (that is: [\p{Ll}\p{Lu}\p{Lt}])
400 Lu Uppercase_Letter
401 Ll Lowercase_Letter
402 Lt Titlecase_Letter
403 Lm Modifier_Letter
404 Lo Other_Letter
d73e5302
JH
405
406 M Mark
51f494cc
KW
407 Mn Nonspacing_Mark
408 Mc Spacing_Mark
409 Me Enclosing_Mark
d73e5302
JH
410
411 N Number
51f494cc
KW
412 Nd Decimal_Number (also Digit)
413 Nl Letter_Number
414 No Other_Number
415
416 P Punctuation (also Punct)
417 Pc Connector_Punctuation
418 Pd Dash_Punctuation
419 Ps Open_Punctuation
420 Pe Close_Punctuation
421 Pi Initial_Punctuation
d73e5302 422 (may behave like Ps or Pe depending on usage)
51f494cc 423 Pf Final_Punctuation
d73e5302 424 (may behave like Ps or Pe depending on usage)
51f494cc 425 Po Other_Punctuation
d73e5302
JH
426
427 S Symbol
51f494cc
KW
428 Sm Math_Symbol
429 Sc Currency_Symbol
430 Sk Modifier_Symbol
431 So Other_Symbol
d73e5302
JH
432
433 Z Separator
51f494cc
KW
434 Zs Space_Separator
435 Zl Line_Separator
436 Zp Paragraph_Separator
d73e5302
JH
437
438 C Other
d88362ca 439 Cc Control (also Cntrl)
e150c829 440 Cf Format
6d4f9cf2 441 Cs Surrogate
51f494cc 442 Co Private_Use
e150c829 443 Cn Unassigned
1ac13f9a 444
376d9008 445Single-letter properties match all characters in any of the
3e4dbfed 446two-letter sub-properties starting with the same letter.
b19eb496 447C<LC> and C<L&> are special: both are aliases for the set consisting of everything matched by C<Ll>, C<Lu>, and C<Lt>.
32293815 448
51f494cc 449=head3 B<Bidirectional Character Types>
822502e5 450
b19eb496 451Because scripts differ in their directionality (Hebrew and Arabic are
9d1c51c1 452written right to left, for example) Unicode supplies these properties in
51f494cc 453the Bidi_Class class:
32293815 454
eb0cc9e3 455 Property Meaning
92e830a9 456
12ac2576
JP
457 L Left-to-Right
458 LRE Left-to-Right Embedding
459 LRO Left-to-Right Override
460 R Right-to-Left
51f494cc 461 AL Arabic Letter
12ac2576
JP
462 RLE Right-to-Left Embedding
463 RLO Right-to-Left Override
464 PDF Pop Directional Format
465 EN European Number
51f494cc
KW
466 ES European Separator
467 ET European Terminator
12ac2576 468 AN Arabic Number
51f494cc 469 CS Common Separator
12ac2576
JP
470 NSM Non-Spacing Mark
471 BN Boundary Neutral
472 B Paragraph Separator
473 S Segment Separator
474 WS Whitespace
475 ON Other Neutrals
476
51f494cc
KW
477This property is always written in the compound form.
478For example, C<\p{Bidi_Class:R}> matches characters that are normally
eb0cc9e3
JH
479written right to left.
480
51f494cc
KW
481=head3 B<Scripts>
482
b19eb496 483The world's languages are written in many different scripts. This sentence
e1b711da 484(unless you're reading it in translation) is written in Latin, while Russian is
c69ca1d4 485written in Cyrillic, and Greek is written in, well, Greek; Japanese mainly in
e1b711da 486Hiragana or Katakana. There are many more.
51f494cc 487
82aed44a
KW
488The Unicode Script and Script_Extensions properties give what script a
489given character is in. Either property can be specified with the
490compound form like
491C<\p{Script=Hebrew}> (short: C<\p{sc=hebr}>), or
492C<\p{Script_Extensions=Javanese}> (short: C<\p{scx=java}>).
493In addition, Perl furnishes shortcuts for all
494C<Script> property names. You can omit everything up through the equals
495(or colon), and simply write C<\p{Latin}> or C<\P{Cyrillic}>.
496(This is not true for C<Script_Extensions>, which is required to be
497written in the compound form.)
498
499The difference between these two properties involves characters that are
500used in multiple scripts. For example the digits '0' through '9' are
501used in many parts of the world. These are placed in a script named
502C<Common>. Other characters are used in just a few scripts. For
503example, the "KATAKANA-HIRAGANA DOUBLE HYPHEN" is used in both Japanese
504scripts, Katakana and Hiragana, but nowhere else. The C<Script>
505property places all characters that are used in multiple scripts in the
506C<Common> script, while the C<Script_Extensions> property places those
507that are used in only a few scripts into each of those scripts; while
508still using C<Common> for those used in many scripts. Thus both these
509match:
510
511 "0" =~ /\p{sc=Common}/ # Matches
512 "0" =~ /\p{scx=Common}/ # Matches
513
514and only the first of these match:
515
516 "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{sc=Common} # Matches
517 "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{scx=Common} # No match
518
519And only the last two of these match:
520
521 "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{sc=Hiragana} # No match
522 "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{sc=Katakana} # No match
523 "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{scx=Hiragana} # Matches
524 "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{scx=Katakana} # Matches
525
526C<Script_Extensions> is thus an improved C<Script>, in which there are
527fewer characters in the C<Common> script, and correspondingly more in
528other scripts. It is new in Unicode version 6.0, and its data are likely
529to change significantly in later releases, as things get sorted out.
530
531(Actually, besides C<Common>, the C<Inherited> script, contains
532characters that are used in multiple scripts. These are modifier
533characters which modify other characters, and inherit the script value
534of the controlling character. Some of these are used in many scripts,
535and so go into C<Inherited> in both C<Script> and C<Script_Extensions>.
536Others are used in just a few scripts, so are in C<Inherited> in
537C<Script>, but not in C<Script_Extensions>.)
538
539It is worth stressing that there are several different sets of digits in
540Unicode that are equivalent to 0-9 and are matchable by C<\d> in a
541regular expression. If they are used in a single language only, they
542are in that language's C<Script> and C<Script_Extension>. If they are
543used in more than one script, they will be in C<sc=Common>, but only
544if they are used in many scripts should they be in C<scx=Common>.
51f494cc
KW
545
546A complete list of scripts and their shortcuts is in L<perluniprops>.
547
51f494cc 548=head3 B<Use of "Is" Prefix>
822502e5 549
1bfb14c4 550For backward compatibility (with Perl 5.6), all properties mentioned
51f494cc
KW
551so far may have C<Is> or C<Is_> prepended to their name, so C<\P{Is_Lu}>, for
552example, is equal to C<\P{Lu}>, and C<\p{IsScript:Arabic}> is equal to
553C<\p{Arabic}>.
eb0cc9e3 554
51f494cc 555=head3 B<Blocks>
2796c109 556
1bfb14c4
JH
557In addition to B<scripts>, Unicode also defines B<blocks> of
558characters. The difference between scripts and blocks is that the
559concept of scripts is closer to natural languages, while the concept
51f494cc 560of blocks is more of an artificial grouping based on groups of Unicode
9f815e24 561characters with consecutive ordinal values. For example, the "Basic Latin"
b19eb496 562block is all characters whose ordinals are between 0 and 127, inclusive; in
9f815e24 563other words, the ASCII characters. The "Latin" script contains some letters
b19eb496 564from this as well as several other blocks, like "Latin-1 Supplement",
9d1c51c1 565"Latin Extended-A", etc., but it does not contain all the characters from
7be67b37 566those blocks. It does not, for example, contain the digits 0-9, because
82aed44a
KW
567those digits are shared across many scripts, and hence are in the
568C<Common> script.
51f494cc
KW
569
570For more about scripts versus blocks, see UAX#24 "Unicode Script Property":
571L<http://www.unicode.org/reports/tr24>
572
82aed44a
KW
573The C<Script> or C<Script_Extensions> properties are likely to be the
574ones you want to use when processing
b19eb496
TC
575natural language; the Block property may occasionally be useful in working
576with the nuts and bolts of Unicode.
51f494cc
KW
577
578Block names are matched in the compound form, like C<\p{Block: Arrows}> or
b19eb496 579C<\p{Blk=Hebrew}>. Unlike most other properties, only a few block names have a
51f494cc
KW
580Unicode-defined short name. But Perl does provide a (slight) shortcut: You
581can say, for example C<\p{In_Arrows}> or C<\p{In_Hebrew}>. For backwards
582compatibility, the C<In> prefix may be omitted if there is no naming conflict
583with a script or any other property, and you can even use an C<Is> prefix
584instead in those cases. But it is not a good idea to do this, for a couple
585reasons:
586
587=over 4
588
589=item 1
590
591It is confusing. There are many naming conflicts, and you may forget some.
9f815e24 592For example, C<\p{Hebrew}> means the I<script> Hebrew, and NOT the I<block>
51f494cc
KW
593Hebrew. But would you remember that 6 months from now?
594
595=item 2
596
597It is unstable. A new version of Unicode may pre-empt the current meaning by
598creating a property with the same name. There was a time in very early Unicode
9f815e24 599releases when C<\p{Hebrew}> would have matched the I<block> Hebrew; now it
51f494cc 600doesn't.
32293815 601
393fec97
GS
602=back
603
b19eb496
TC
604Some people prefer to always use C<\p{Block: foo}> and C<\p{Script: bar}>
605instead of the shortcuts, whether for clarity, because they can't remember the
606difference between 'In' and 'Is' anyway, or they aren't confident that those who
607eventually will read their code will know that difference.
51f494cc
KW
608
609A complete list of blocks and their shortcuts is in L<perluniprops>.
610
9f815e24
KW
611=head3 B<Other Properties>
612
613There are many more properties than the very basic ones described here.
614A complete list is in L<perluniprops>.
615
616Unicode defines all its properties in the compound form, so all single-form
b19eb496
TC
617properties are Perl extensions. Most of these are just synonyms for the
618Unicode ones, but some are genuine extensions, including several that are in
9f815e24
KW
619the compound form. And quite a few of these are actually recommended by Unicode
620(in L<http://www.unicode.org/reports/tr18>).
621
5bff2035
KW
622This section gives some details on all extensions that aren't just
623synonyms for compound-form Unicode properties
624(for those properties, you'll have to refer to the
9f815e24
KW
625L<Unicode Standard|http://www.unicode.org/reports/tr44>.
626
627=over
628
629=item B<C<\p{All}>>
630
631This matches any of the 1_114_112 Unicode code points. It is a synonym for
632C<\p{Any}>.
633
634=item B<C<\p{Alnum}>>
635
636This matches any C<\p{Alphabetic}> or C<\p{Decimal_Number}> character.
637
638=item B<C<\p{Any}>>
639
640This matches any of the 1_114_112 Unicode code points. It is a synonym for
641C<\p{All}>.
642
42581d5d
KW
643=item B<C<\p{ASCII}>>
644
645This matches any of the 128 characters in the US-ASCII character set,
646which is a subset of Unicode.
647
9f815e24
KW
648=item B<C<\p{Assigned}>>
649
650This matches any assigned code point; that is, any code point whose general
651category is not Unassigned (or equivalently, not Cn).
652
653=item B<C<\p{Blank}>>
654
655This is the same as C<\h> and C<\p{HorizSpace}>: A character that changes the
656spacing horizontally.
657
658=item B<C<\p{Decomposition_Type: Non_Canonical}>> (Short: C<\p{Dt=NonCanon}>)
659
660Matches a character that has a non-canonical decomposition.
661
662To understand the use of this rarely used property=value combination, it is
663necessary to know some basics about decomposition.
664Consider a character, say H. It could appear with various marks around it,
665such as an acute accent, or a circumflex, or various hooks, circles, arrows,
b19eb496 666I<etc.>, above, below, to one side or the other, etc. There are many
9f815e24
KW
667possibilities among the world's languages. The number of combinations is
668astronomical, and if there were a character for each combination, it would
669soon exhaust Unicode's more than a million possible characters. So Unicode
670took a different approach: there is a character for the base H, and a
b19eb496 671character for each of the possible marks, and these can be variously combined
9f815e24
KW
672to get a final logical character. So a logical character--what appears to be a
673single character--can be a sequence of more than one individual characters.
b19eb496
TC
674This is called an "extended grapheme cluster"; Perl furnishes the C<\X>
675regular expression construct to match such sequences.
9f815e24
KW
676
677But Unicode's intent is to unify the existing character set standards and
b19eb496 678practices, and several pre-existing standards have single characters that
9f815e24
KW
679mean the same thing as some of these combinations. An example is ISO-8859-1,
680which has quite a few of these in the Latin-1 range, an example being "LATIN
681CAPITAL LETTER E WITH ACUTE". Because this character was in this pre-existing
682standard, Unicode added it to its repertoire. But this character is considered
b19eb496
TC
683by Unicode to be equivalent to the sequence consisting of the character
684"LATIN CAPITAL LETTER E" followed by the character "COMBINING ACUTE ACCENT".
9f815e24
KW
685
686"LATIN CAPITAL LETTER E WITH ACUTE" is called a "pre-composed" character, and
b19eb496 687its equivalence with the sequence is called canonical equivalence. All
9f815e24 688pre-composed characters are said to have a decomposition (into the equivalent
b19eb496 689sequence), and the decomposition type is also called canonical.
9f815e24
KW
690
691However, many more characters have a different type of decomposition, a
692"compatible" or "non-canonical" decomposition. The sequences that form these
693decompositions are not considered canonically equivalent to the pre-composed
694character. An example, again in the Latin-1 range, is the "SUPERSCRIPT ONE".
b19eb496 695It is somewhat like a regular digit 1, but not exactly; its decomposition
9f815e24
KW
696into the digit 1 is called a "compatible" decomposition, specifically a
697"super" decomposition. There are several such compatibility
698decompositions (see L<http://www.unicode.org/reports/tr44>), including one
b19eb496 699called "compat", which means some miscellaneous type of decomposition
42581d5d 700that doesn't fit into the decomposition categories that Unicode has chosen.
9f815e24
KW
701
702Note that most Unicode characters don't have a decomposition, so their
703decomposition type is "None".
704
b19eb496
TC
705For your convenience, Perl has added the C<Non_Canonical> decomposition
706type to mean any of the several compatibility decompositions.
9f815e24
KW
707
708=item B<C<\p{Graph}>>
709
710Matches any character that is graphic. Theoretically, this means a character
711that on a printer would cause ink to be used.
712
713=item B<C<\p{HorizSpace}>>
714
b19eb496 715This is the same as C<\h> and C<\p{Blank}>: a character that changes the
9f815e24
KW
716spacing horizontally.
717
42581d5d 718=item B<C<\p{In=*}>>
9f815e24
KW
719
720This is a synonym for C<\p{Present_In=*}>
721
722=item B<C<\p{PerlSpace}>>
723
724This is the same as C<\s>, restricted to ASCII, namely C<S<[ \f\n\r\t]>>.
725
726Mnemonic: Perl's (original) space
727
728=item B<C<\p{PerlWord}>>
729
730This is the same as C<\w>, restricted to ASCII, namely C<[A-Za-z0-9_]>
731
732Mnemonic: Perl's (original) word.
733
42581d5d 734=item B<C<\p{Posix...}>>
9f815e24 735
b19eb496
TC
736There are several of these, which are equivalents using the C<\p>
737notation for Posix classes and are described in
42581d5d 738L<perlrecharclass/POSIX Character Classes>.
9f815e24
KW
739
740=item B<C<\p{Present_In: *}>> (Short: C<\p{In=*}>)
741
742This property is used when you need to know in what Unicode version(s) a
743character is.
744
745The "*" above stands for some two digit Unicode version number, such as
746C<1.1> or C<4.0>; or the "*" can also be C<Unassigned>. This property will
747match the code points whose final disposition has been settled as of the
748Unicode release given by the version number; C<\p{Present_In: Unassigned}>
749will match those code points whose meaning has yet to be assigned.
750
751For example, C<U+0041> "LATIN CAPITAL LETTER A" was present in the very first
752Unicode release available, which is C<1.1>, so this property is true for all
753valid "*" versions. On the other hand, C<U+1EFF> was not assigned until version
7545.1 when it became "LATIN SMALL LETTER Y WITH LOOP", so the only "*" that
755would match it are 5.1, 5.2, and later.
756
757Unicode furnishes the C<Age> property from which this is derived. The problem
758with Age is that a strict interpretation of it (which Perl takes) has it
759matching the precise release a code point's meaning is introduced in. Thus
760C<U+0041> would match only 1.1; and C<U+1EFF> only 5.1. This is not usually what
761you want.
762
763Some non-Perl implementations of the Age property may change its meaning to be
764the same as the Perl Present_In property; just be aware of that.
765
766Another confusion with both these properties is that the definition is not
b19eb496
TC
767that the code point has been I<assigned>, but that the meaning of the code point
768has been I<determined>. This is because 66 code points will always be
769unassigned, and so the Age for them is the Unicode version in which the decision
770to make them so was made. For example, C<U+FDD0> is to be permanently
9f815e24 771unassigned to a character, and the decision to do that was made in version 3.1,
b19eb496 772so C<\p{Age=3.1}> matches this character, as also does C<\p{Present_In: 3.1}> and up.
9f815e24
KW
773
774=item B<C<\p{Print}>>
775
ae5b72c8 776This matches any character that is graphical or blank, except controls.
9f815e24
KW
777
778=item B<C<\p{SpacePerl}>>
779
780This is the same as C<\s>, including beyond ASCII.
781
4d4acfba 782Mnemonic: Space, as modified by Perl. (It doesn't include the vertical tab
b19eb496 783which both the Posix standard and Unicode consider white space.)
9f815e24 784
4364919a
KW
785=item B<C<\p{Title}>> and B<C<\p{Titlecase}>>
786
787Under case-sensitive matching, these both match the same code points as
788C<\p{General Category=Titlecase_Letter}> (C<\p{gc=lt}>). The difference
789is that under C</i> caseless matching, these match the same as
790C<\p{Cased}>, whereas C<\p{gc=lt}> matches C<\p{Cased_Letter>).
791
9f815e24
KW
792=item B<C<\p{VertSpace}>>
793
794This is the same as C<\v>: A character that changes the spacing vertically.
795
796=item B<C<\p{Word}>>
797
b19eb496 798This is the same as C<\w>, including over 100_000 characters beyond ASCII.
9f815e24 799
42581d5d
KW
800=item B<C<\p{XPosix...}>>
801
b19eb496 802There are several of these, which are the standard Posix classes
42581d5d
KW
803extended to the full Unicode range. They are described in
804L<perlrecharclass/POSIX Character Classes>.
805
9f815e24
KW
806=back
807
376d9008 808=head2 User-Defined Character Properties
491fd90a 809
51f494cc
KW
810You can define your own binary character properties by defining subroutines
811whose names begin with "In" or "Is". The subroutines can be defined in any
812package. The user-defined properties can be used in the regular expression
813C<\p> and C<\P> constructs; if you are using a user-defined property from a
814package other than the one you are in, you must specify its package in the
815C<\p> or C<\P> construct.
bac0b425 816
51f494cc 817 # assuming property Is_Foreign defined in Lang::
bac0b425
JP
818 package main; # property package name required
819 if ($txt =~ /\p{Lang::IsForeign}+/) { ... }
820
821 package Lang; # property package name not required
822 if ($txt =~ /\p{IsForeign}+/) { ... }
823
824
825Note that the effect is compile-time and immutable once defined.
b19eb496
TC
826However, the subroutines are passed a single parameter, which is 0 if
827case-sensitive matching is in effect and non-zero if caseless matching
56ca34ca
KW
828is in effect. The subroutine may return different values depending on
829the value of the flag, and one set of values will immutably be in effect
b19eb496 830for all case-sensitive matches, and the other set for all case-insensitive
56ca34ca 831matches.
491fd90a 832
b19eb496 833Note that if the regular expression is tainted, then Perl will die rather
0e9be77f
DM
834than calling the subroutine, where the name of the subroutine is
835determined by the tainted data.
836
376d9008
JB
837The subroutines must return a specially-formatted string, with one
838or more newline-separated lines. Each line must be one of the following:
491fd90a
JH
839
840=over 4
841
842=item *
843
510254c9
A
844A single hexadecimal number denoting a Unicode code point to include.
845
846=item *
847
99a6b1f0 848Two hexadecimal numbers separated by horizontal whitespace (space or
376d9008 849tabular characters) denoting a range of Unicode code points to include.
491fd90a
JH
850
851=item *
852
376d9008 853Something to include, prefixed by "+": a built-in character
830137a2
KW
854property (prefixed by "utf8::") or a fully qualified (including package
855name) user-defined character property,
bac0b425
JP
856to represent all the characters in that property; two hexadecimal code
857points for a range; or a single hexadecimal code point.
491fd90a
JH
858
859=item *
860
376d9008 861Something to exclude, prefixed by "-": an existing character
830137a2
KW
862property (prefixed by "utf8::") or a fully qualified (including package
863name) user-defined character property,
bac0b425
JP
864to represent all the characters in that property; two hexadecimal code
865points for a range; or a single hexadecimal code point.
491fd90a
JH
866
867=item *
868
376d9008 869Something to negate, prefixed "!": an existing character
830137a2
KW
870property (prefixed by "utf8::") or a fully qualified (including package
871name) user-defined character property,
bac0b425
JP
872to represent all the characters in that property; two hexadecimal code
873points for a range; or a single hexadecimal code point.
874
875=item *
876
877Something to intersect with, prefixed by "&": an existing character
830137a2
KW
878property (prefixed by "utf8::") or a fully qualified (including package
879name) user-defined character property,
bac0b425
JP
880for all the characters except the characters in the property; two
881hexadecimal code points for a range; or a single hexadecimal code point.
491fd90a
JH
882
883=back
884
885For example, to define a property that covers both the Japanese
886syllabaries (hiragana and katakana), you can define
887
888 sub InKana {
d88362ca 889 return <<END;
d5822f25
A
890 3040\t309F
891 30A0\t30FF
491fd90a
JH
892 END
893 }
894
d5822f25
A
895Imagine that the here-doc end marker is at the beginning of the line.
896Now you can use C<\p{InKana}> and C<\P{InKana}>.
491fd90a
JH
897
898You could also have used the existing block property names:
899
900 sub InKana {
d88362ca 901 return <<'END';
491fd90a
JH
902 +utf8::InHiragana
903 +utf8::InKatakana
904 END
905 }
906
907Suppose you wanted to match only the allocated characters,
d5822f25 908not the raw block ranges: in other words, you want to remove
491fd90a
JH
909the non-characters:
910
911 sub InKana {
d88362ca 912 return <<'END';
491fd90a
JH
913 +utf8::InHiragana
914 +utf8::InKatakana
915 -utf8::IsCn
916 END
917 }
918
919The negation is useful for defining (surprise!) negated classes.
920
921 sub InNotKana {
d88362ca 922 return <<'END';
491fd90a
JH
923 !utf8::InHiragana
924 -utf8::InKatakana
925 +utf8::IsCn
926 END
927 }
928
461020ad
KW
929This will match all non-Unicode code points, since every one of them is
930not in Kana. You can use intersection to exclude these, if desired, as
931this modified example shows:
bac0b425 932
461020ad 933 sub InNotKana {
bac0b425 934 return <<'END';
461020ad
KW
935 !utf8::InHiragana
936 -utf8::InKatakana
937 +utf8::IsCn
938 &utf8::Any
bac0b425
JP
939 END
940 }
941
461020ad
KW
942C<&utf8::Any> must be the last line in the definition.
943
944Intersection is used generally for getting the common characters matched
945by two (or more) classes. It's important to remember not to use "&" for
946the first set; that would be intersecting with nothing, resulting in an
947empty set.
948
949(Note that official Unicode properties differ from these in that they
950automatically exclude non-Unicode code points and a warning is raised if
951a match is attempted on one of those.)
bac0b425 952
68585b5e 953=head2 User-Defined Case Mappings (for serious hackers only)
822502e5 954
5d1892be
KW
955B<This feature has been removed as of Perl 5.16.>
956The CPAN module L<Unicode::Casing> provides better functionality without
957the drawbacks that this feature had. If you are using a Perl earlier
958than 5.16, this feature was most fully documented in the 5.14 version of
959this pod:
960L<http://perldoc.perl.org/5.14.0/perlunicode.html#User-Defined-Case-Mappings-%28for-serious-hackers-only%29>
3a2263fe 961
376d9008 962=head2 Character Encodings for Input and Output
8cbd9a7a 963
7221edc9 964See L<Encode>.
8cbd9a7a 965
c29a771d 966=head2 Unicode Regular Expression Support Level
776f8809 967
b19eb496
TC
968The following list of Unicode supported features for regular expressions describes
969all features currently directly supported by core Perl. The references to "Level N"
8158862b 970and the section numbers refer to the Unicode Technical Standard #18,
b19eb496 971"Unicode Regular Expressions", version 13, from August 2008.
776f8809
JH
972
973=over 4
974
975=item *
976
977Level 1 - Basic Unicode Support
978
755789c0
KW
979 RL1.1 Hex Notation - done [1]
980 RL1.2 Properties - done [2][3]
981 RL1.2a Compatibility Properties - done [4]
982 RL1.3 Subtraction and Intersection - MISSING [5]
983 RL1.4 Simple Word Boundaries - done [6]
984 RL1.5 Simple Loose Matches - done [7]
985 RL1.6 Line Boundaries - MISSING [8][9]
986 RL1.7 Supplementary Code Points - done [10]
987
988 [1] \x{...}
989 [2] \p{...} \P{...}
990 [3] supports not only minimal list, but all Unicode character
d9742aa3 991 properties (see Unicode Character Properties above)
755789c0
KW
992 [4] \d \D \s \S \w \W \X [:prop:] [:^prop:]
993 [5] can use regular expression look-ahead [a] or
994 user-defined character properties [b] to emulate set
995 operations
996 [6] \b \B
997 [7] note that Perl does Full case-folding in matching (but with
998 bugs), not Simple: for example U+1F88 is equivalent to
e4d56f70
NC
999 U+1F00 U+03B9, instead of just U+1F80. This difference
1000 matters mainly for certain Greek capital letters with certain
755789c0
KW
1001 modifiers: the Full case-folding decomposes the letter,
1002 while the Simple case-folding would map it to a single
1003 character.
1004 [8] should do ^ and $ also on U+000B (\v in C), FF (\f), CR
1005 (\r), CRLF (\r\n), NEL (U+0085), LS (U+2028), and PS
1006 (U+2029); should also affect <>, $., and script line
1007 numbers; should not split lines within CRLF [c] (i.e. there
1008 is no empty line between \r and \n)
1009 [9] Linebreaking conformant with UAX#14 "Unicode Line Breaking
1010 Algorithm" is available through the Unicode::LineBreaking
1011 module.
1012 [10] UTF-8/UTF-EBDDIC used in Perl allows not only U+10000 to
1013 U+10FFFF but also beyond U+10FFFF
7207e29d 1014
237bad5b 1015[a] You can mimic class subtraction using lookahead.
8158862b 1016For example, what UTS#18 might write as
29bdacb8 1017
209c9685 1018 [{Block=Greek}-[{UNASSIGNED}]]
dbe420b4
JH
1019
1020in Perl can be written as:
1021
209c9685
KW
1022 (?!\p{Unassigned})\p{Block=Greek}
1023 (?=\p{Assigned})\p{Block=Greek}
dbe420b4
JH
1024
1025But in this particular example, you probably really want
1026
209c9685 1027 \p{Greek}
dbe420b4
JH
1028
1029which will match assigned characters known to be part of the Greek script.
29bdacb8 1030
4ce498ef 1031Also see the L<Unicode::Regex::Set> module; it does implement the full
8158862b
TS
1032UTS#18 grouping, intersection, union, and removal (subtraction) syntax.
1033
1034[b] '+' for union, '-' for removal (set-difference), '&' for intersection
1035(see L</"User-Defined Character Properties">)
1036
1037[c] Try the C<:crlf> layer (see L<PerlIO>).
5ca1ac52 1038
776f8809
JH
1039=item *
1040
1041Level 2 - Extended Unicode Support
1042
755789c0
KW
1043 RL2.1 Canonical Equivalents - MISSING [10][11]
1044 RL2.2 Default Grapheme Clusters - MISSING [12]
1045 RL2.3 Default Word Boundaries - MISSING [14]
1046 RL2.4 Default Loose Matches - MISSING [15]
1047 RL2.5 Name Properties - DONE
1048 RL2.6 Wildcard Properties - MISSING
8158862b 1049
755789c0
KW
1050 [10] see UAX#15 "Unicode Normalization Forms"
1051 [11] have Unicode::Normalize but not integrated to regexes
1052 [12] have \X but we don't have a "Grapheme Cluster Mode"
1053 [14] see UAX#29, Word Boundaries
902b08d0 1054 [15] This is covered in Chapter 3.13 (in Unicode 6.0)
776f8809
JH
1055
1056=item *
1057
8158862b
TS
1058Level 3 - Tailored Support
1059
755789c0
KW
1060 RL3.1 Tailored Punctuation - MISSING
1061 RL3.2 Tailored Grapheme Clusters - MISSING [17][18]
1062 RL3.3 Tailored Word Boundaries - MISSING
1063 RL3.4 Tailored Loose Matches - MISSING
1064 RL3.5 Tailored Ranges - MISSING
1065 RL3.6 Context Matching - MISSING [19]
1066 RL3.7 Incremental Matches - MISSING
8158862b 1067 ( RL3.8 Unicode Set Sharing )
755789c0
KW
1068 RL3.9 Possible Match Sets - MISSING
1069 RL3.10 Folded Matching - MISSING [20]
1070 RL3.11 Submatchers - MISSING
1071
1072 [17] see UAX#10 "Unicode Collation Algorithms"
1073 [18] have Unicode::Collate but not integrated to regexes
1074 [19] have (?<=x) and (?=x), but look-aheads or look-behinds
1075 should see outside of the target substring
1076 [20] need insensitive matching for linguistic features other
1077 than case; for example, hiragana to katakana, wide and
1078 narrow, simplified Han to traditional Han (see UTR#30
1079 "Character Foldings")
776f8809
JH
1080
1081=back
1082
c349b1b9
JH
1083=head2 Unicode Encodings
1084
376d9008
JB
1085Unicode characters are assigned to I<code points>, which are abstract
1086numbers. To use these numbers, various encodings are needed.
c349b1b9
JH
1087
1088=over 4
1089
c29a771d 1090=item *
5cb3728c
RB
1091
1092UTF-8
c349b1b9 1093
6d4f9cf2
KW
1094UTF-8 is a variable-length (1 to 4 bytes), byte-order independent
1095encoding. For ASCII (and we really do mean 7-bit ASCII, not another
10968-bit encoding), UTF-8 is transparent.
c349b1b9 1097
8c007b5a 1098The following table is from Unicode 3.2.
05632f9a 1099
755789c0 1100 Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte
05632f9a 1101
d88362ca 1102 U+0000..U+007F 00..7F
e1b711da 1103 U+0080..U+07FF * C2..DF 80..BF
d88362ca 1104 U+0800..U+0FFF E0 * A0..BF 80..BF
ec90690f
TS
1105 U+1000..U+CFFF E1..EC 80..BF 80..BF
1106 U+D000..U+D7FF ED 80..9F 80..BF
755789c0 1107 U+D800..U+DFFF +++++ utf16 surrogates, not legal utf8 +++++
ec90690f 1108 U+E000..U+FFFF EE..EF 80..BF 80..BF
d88362ca
KW
1109 U+10000..U+3FFFF F0 * 90..BF 80..BF 80..BF
1110 U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF
1111 U+100000..U+10FFFF F4 80..8F 80..BF 80..BF
e1b711da 1112
b19eb496 1113Note the gaps marked by "*" before several of the byte entries above. These are
e1b711da
KW
1114caused by legal UTF-8 avoiding non-shortest encodings: it is technically
1115possible to UTF-8-encode a single code point in different ways, but that is
1116explicitly forbidden, and the shortest possible encoding should always be used
1117(and that is what Perl does).
37361303 1118
376d9008 1119Another way to look at it is via bits:
05632f9a 1120
755789c0 1121 Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte
05632f9a 1122
755789c0
KW
1123 0aaaaaaa 0aaaaaaa
1124 00000bbbbbaaaaaa 110bbbbb 10aaaaaa
1125 ccccbbbbbbaaaaaa 1110cccc 10bbbbbb 10aaaaaa
1126 00000dddccccccbbbbbbaaaaaa 11110ddd 10cccccc 10bbbbbb 10aaaaaa
05632f9a 1127
9f815e24 1128As you can see, the continuation bytes all begin with "10", and the
e1b711da 1129leading bits of the start byte tell how many bytes there are in the
05632f9a
JH
1130encoded character.
1131
6d4f9cf2
KW
1132The original UTF-8 specification allowed up to 6 bytes, to allow
1133encoding of numbers up to 0x7FFF_FFFF. Perl continues to allow those,
1134and has extended that up to 13 bytes to encode code points up to what
1135can fit in a 64-bit word. However, Perl will warn if you output any of
b19eb496 1136these as being non-portable; and under strict UTF-8 input protocols,
6d4f9cf2
KW
1137they are forbidden.
1138
1139The Unicode non-character code points are also disallowed in UTF-8 in
1140"open interchange". See L</Non-character code points>.
1141
c29a771d 1142=item *
5cb3728c
RB
1143
1144UTF-EBCDIC
dbe420b4 1145
376d9008 1146Like UTF-8 but EBCDIC-safe, in the way that UTF-8 is ASCII-safe.
dbe420b4 1147
c29a771d 1148=item *
5cb3728c 1149
1e54db1a 1150UTF-16, UTF-16BE, UTF-16LE, Surrogates, and BOMs (Byte Order Marks)
c349b1b9 1151
1bfb14c4
JH
1152The followings items are mostly for reference and general Unicode
1153knowledge, Perl doesn't use these constructs internally.
dbe420b4 1154
b19eb496
TC
1155Like UTF-8, UTF-16 is a variable-width encoding, but where
1156UTF-8 uses 8-bit code units, UTF-16 uses 16-bit code units.
1157All code points occupy either 2 or 4 bytes in UTF-16: code points
1158C<U+0000..U+FFFF> are stored in a single 16-bit unit, and code
1bfb14c4 1159points C<U+10000..U+10FFFF> in two 16-bit units. The latter case is
c349b1b9
JH
1160using I<surrogates>, the first 16-bit unit being the I<high
1161surrogate>, and the second being the I<low surrogate>.
1162
376d9008 1163Surrogates are code points set aside to encode the C<U+10000..U+10FFFF>
c349b1b9 1164range of Unicode code points in pairs of 16-bit units. The I<high
9f815e24 1165surrogates> are the range C<U+D800..U+DBFF> and the I<low surrogates>
376d9008 1166are the range C<U+DC00..U+DFFF>. The surrogate encoding is
c349b1b9 1167
d88362ca
KW
1168 $hi = ($uni - 0x10000) / 0x400 + 0xD800;
1169 $lo = ($uni - 0x10000) % 0x400 + 0xDC00;
c349b1b9
JH
1170
1171and the decoding is
1172
d88362ca 1173 $uni = 0x10000 + ($hi - 0xD800) * 0x400 + ($lo - 0xDC00);
c349b1b9 1174
376d9008 1175Because of the 16-bitness, UTF-16 is byte-order dependent. UTF-16
c349b1b9 1176itself can be used for in-memory computations, but if storage or
376d9008
JB
1177transfer is required either UTF-16BE (big-endian) or UTF-16LE
1178(little-endian) encodings must be chosen.
c349b1b9
JH
1179
1180This introduces another problem: what if you just know that your data
376d9008
JB
1181is UTF-16, but you don't know which endianness? Byte Order Marks, or
1182BOMs, are a solution to this. A special character has been reserved
86bbd6d1 1183in Unicode to function as a byte order marker: the character with the
376d9008 1184code point C<U+FEFF> is the BOM.
042da322 1185
c349b1b9 1186The trick is that if you read a BOM, you will know the byte order,
376d9008
JB
1187since if it was written on a big-endian platform, you will read the
1188bytes C<0xFE 0xFF>, but if it was written on a little-endian platform,
1189you will read the bytes C<0xFF 0xFE>. (And if the originating platform
1190was writing in UTF-8, you will read the bytes C<0xEF 0xBB 0xBF>.)
042da322 1191
86bbd6d1 1192The way this trick works is that the character with the code point
6d4f9cf2 1193C<U+FFFE> is not supposed to be in input streams, so the
376d9008 1194sequence of bytes C<0xFF 0xFE> is unambiguously "BOM, represented in
1bfb14c4 1195little-endian format" and cannot be C<U+FFFE>, represented in big-endian
6d4f9cf2
KW
1196format".
1197
1198Surrogates have no meaning in Unicode outside their use in pairs to
1199represent other code points. However, Perl allows them to be
1200represented individually internally, for example by saying
f651977e
TC
1201C<chr(0xD801)>, so that all code points, not just those valid for open
1202interchange, are
6d4f9cf2
KW
1203representable. Unicode does define semantics for them, such as their
1204General Category is "Cs". But because their use is somewhat dangerous,
b19eb496
TC
1205Perl will warn (using the warning category "surrogate", which is a
1206sub-category of "utf8") if an attempt is made
6d4f9cf2
KW
1207to do things like take the lower case of one, or match
1208case-insensitively, or to output them. (But don't try this on Perls
1209before 5.14.)
c349b1b9 1210
c29a771d 1211=item *
5cb3728c 1212
1e54db1a 1213UTF-32, UTF-32BE, UTF-32LE
c349b1b9
JH
1214
1215The UTF-32 family is pretty much like the UTF-16 family, expect that
042da322 1216the units are 32-bit, and therefore the surrogate scheme is not
f651977e 1217needed. UTF-32 is a fixed-width encoding. The BOM signatures are
b19eb496 1218C<0x00 0x00 0xFE 0xFF> for BE and C<0xFF 0xFE 0x00 0x00> for LE.
c349b1b9 1219
c29a771d 1220=item *
5cb3728c
RB
1221
1222UCS-2, UCS-4
c349b1b9 1223
b19eb496 1224Legacy, fixed-width encodings defined by the ISO 10646 standard. UCS-2 is a 16-bit
376d9008 1225encoding. Unlike UTF-16, UCS-2 is not extensible beyond C<U+FFFF>,
339cfa0e 1226because it does not use surrogates. UCS-4 is a 32-bit encoding,
b19eb496 1227functionally identical to UTF-32 (the difference being that
ee88f7b6 1228UCS-4 forbids neither surrogates nor code points larger than 0x10_FFFF).
c349b1b9 1229
c29a771d 1230=item *
5cb3728c
RB
1231
1232UTF-7
c349b1b9 1233
376d9008
JB
1234A seven-bit safe (non-eight-bit) encoding, which is useful if the
1235transport or storage is not eight-bit safe. Defined by RFC 2152.
c349b1b9 1236
95a1a48b
JH
1237=back
1238
6d4f9cf2
KW
1239=head2 Non-character code points
1240
124166 code points are set aside in Unicode as "non-character code points".
1242These all have the Unassigned (Cn) General Category, and they never will
1243be assigned. These are never supposed to be in legal Unicode input
1244streams, so that code can use them as sentinels that can be mixed in
1245with character data, and they always will be distinguishable from that data.
1246To keep them out of Perl input streams, strict UTF-8 should be
1247specified, such as by using the layer C<:encoding('UTF-8')>. The
1248non-character code points are the 32 between U+FDD0 and U+FDEF, and the
124934 code points U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, ... U+10FFFE, U+10FFFF.
1250Some people are under the mistaken impression that these are "illegal",
1251but that is not true. An application or cooperating set of applications
1252can legally use them at will internally; but these code points are
42581d5d
KW
1253"illegal for open interchange". Therefore, Perl will not accept these
1254from input streams unless lax rules are being used, and will warn
b19eb496 1255(using the warning category "nonchar", which is a sub-category of "utf8") if
42581d5d
KW
1256an attempt is made to output them.
1257
1258=head2 Beyond Unicode code points
1259
1260The maximum Unicode code point is U+10FFFF. But Perl accepts code
1261points up to the maximum permissible unsigned number available on the
1262platform. However, Perl will not accept these from input streams unless
1263lax rules are being used, and will warn (using the warning category
b19eb496 1264"non_unicode", which is a sub-category of "utf8") if an attempt is made to
42581d5d
KW
1265operate on or output them. For example, C<uc(0x11_0000)> will generate
1266this warning, returning the input parameter as its result, as the upper
ee88f7b6 1267case of every non-Unicode code point is the code point itself.
6d4f9cf2 1268
0d7c09bb
JH
1269=head2 Security Implications of Unicode
1270
e1b711da
KW
1271Read L<Unicode Security Considerations|http://www.unicode.org/reports/tr36>.
1272Also, note the following:
1273
0d7c09bb
JH
1274=over 4
1275
1276=item *
1277
1278Malformed UTF-8
bf0fa0b2 1279
42581d5d 1280Unfortunately, the original specification of UTF-8 leaves some room for
bf0fa0b2 1281interpretation of how many bytes of encoded output one should generate
376d9008
JB
1282from one input Unicode character. Strictly speaking, the shortest
1283possible sequence of UTF-8 bytes should be generated,
1284because otherwise there is potential for an input buffer overflow at
feda178f 1285the receiving end of a UTF-8 connection. Perl always generates the
e1b711da 1286shortest length UTF-8, and with warnings on, Perl will warn about
376d9008 1287non-shortest length UTF-8 along with other malformations, such as the
b19eb496 1288surrogates, which are not Unicode code points valid for interchange.
bf0fa0b2 1289
0d7c09bb
JH
1290=item *
1291
68693f9e 1292Regular expression pattern matching may surprise you if you're not
b19eb496
TC
1293accustomed to Unicode. Starting in Perl 5.14, several pattern
1294modifiers are available to control this, called the character set
42581d5d
KW
1295modifiers. Details are given in L<perlre/Character set modifiers>.
1296
1297=back
0d7c09bb 1298
376d9008 1299As discussed elsewhere, Perl has one foot (two hooves?) planted in
1bfb14c4
JH
1300each of two worlds: the old world of bytes and the new world of
1301characters, upgrading from bytes to characters when necessary.
376d9008
JB
1302If your legacy code does not explicitly use Unicode, no automatic
1303switch-over to characters should happen. Characters shouldn't get
1bfb14c4
JH
1304downgraded to bytes, either. It is possible to accidentally mix bytes
1305and characters, however (see L<perluniintro>), in which case C<\w> in
42581d5d 1306regular expressions might start behaving differently (unless the C</a>
b19eb496 1307modifier is in effect). Review your code. Use warnings and the C<strict> pragma.
0d7c09bb 1308
c349b1b9
JH
1309=head2 Unicode in Perl on EBCDIC
1310
376d9008
JB
1311The way Unicode is handled on EBCDIC platforms is still
1312experimental. On such platforms, references to UTF-8 encoding in this
1313document and elsewhere should be read as meaning the UTF-EBCDIC
1314specified in Unicode Technical Report 16, unless ASCII vs. EBCDIC issues
c349b1b9 1315are specifically discussed. There is no C<utfebcdic> pragma or
376d9008 1316":utfebcdic" layer; rather, "utf8" and ":utf8" are reused to mean
86bbd6d1
PN
1317the platform's "natural" 8-bit encoding of Unicode. See L<perlebcdic>
1318for more discussion of the issues.
c349b1b9 1319
b310b053
JH
1320=head2 Locales
1321
42581d5d 1322See L<perllocale/Unicode and UTF-8>
b310b053 1323
1aad1664
JH
1324=head2 When Unicode Does Not Happen
1325
1326While Perl does have extensive ways to input and output in Unicode,
b19eb496
TC
1327and a few other "entry points" like the @ARGV array (which can sometimes be
1328interpreted as UTF-8), there are still many places where Unicode
1329(in some encoding or another) could be given as arguments or received as
1aad1664
JH
1330results, or both, but it is not.
1331
e1b711da
KW
1332The following are such interfaces. Also, see L</The "Unicode Bug">.
1333For all of these interfaces Perl
b9cedb1b 1334currently (as of v5.16.0) simply assumes byte strings both as arguments
b19eb496 1335and results, or UTF-8 strings if the (problematic) C<encoding> pragma has been used.
1aad1664 1336
b19eb496
TC
1337One reason that Perl does not attempt to resolve the role of Unicode in
1338these situations is that the answers are highly dependent on the operating
1aad1664 1339system and the file system(s). For example, whether filenames can be
b19eb496
TC
1340in Unicode and in exactly what kind of encoding, is not exactly a
1341portable concept. Similarly for C<qx> and C<system>: how well will the
1342"command-line interface" (and which of them?) handle Unicode?
1aad1664
JH
1343
1344=over 4
1345
557a2462
RB
1346=item *
1347
51f494cc 1348chdir, chmod, chown, chroot, exec, link, lstat, mkdir,
1e8e8236 1349rename, rmdir, stat, symlink, truncate, unlink, utime, -X
557a2462
RB
1350
1351=item *
1352
1353%ENV
1354
1355=item *
1356
1357glob (aka the <*>)
1358
1359=item *
1aad1664 1360
557a2462 1361open, opendir, sysopen
1aad1664 1362
557a2462 1363=item *
1aad1664 1364
557a2462 1365qx (aka the backtick operator), system
1aad1664 1366
557a2462 1367=item *
1aad1664 1368
557a2462 1369readdir, readlink
1aad1664
JH
1370
1371=back
1372
e1b711da
KW
1373=head2 The "Unicode Bug"
1374
2e2b2571 1375The term, "Unicode bug" has been applied to an inconsistency
42581d5d
KW
1376on ASCII platforms with the
1377Unicode code points in the Latin-1 Supplement block, that
e1b711da
KW
1378is, between 128 and 255. Without a locale specified, unlike all other
1379characters or code points, these characters have very different semantics in
20db7501 1380byte semantics versus character semantics, unless
2e2b2571
KW
1381C<use feature 'unicode_strings'> is specified, directly or indirectly.
1382(It is indirectly specified by a C<use v5.12> or higher.)
e1b711da 1383
2e2b2571
KW
1384In character semantics these upper-Latin1 characters are interpreted as
1385Unicode code points, which means
e1b711da
KW
1386they have the same semantics as Latin-1 (ISO-8859-1).
1387
2e2b2571
KW
1388In byte semantics (without C<unicode_strings>), they are considered to
1389be unassigned characters, meaning that the only semantics they have is
1390their ordinal numbers, and that they are
e1b711da 1391not members of various character classes. None are considered to match C<\w>
42581d5d 1392for example, but all match C<\W>.
e1b711da 1393
2e2b2571
KW
1394Perl 5.12.0 added C<unicode_strings> to force character semantics on
1395these code points in some circumstances, which fixed portions of the
1396bug; Perl 5.14.0 fixed almost all of it; and Perl 5.16.0 fixed the
1397remainder (so far as we know, anyway). The lesson here is to enable
1398C<unicode_strings> to avoid the headaches described below.
1399
1400The old, problematic behavior affects these areas:
e1b711da
KW
1401
1402=over 4
1403
1404=item *
1405
1406Changing the case of a scalar, that is, using C<uc()>, C<ucfirst()>, C<lc()>,
2e2b2571
KW
1407and C<lcfirst()>, or C<\L>, C<\U>, C<\u> and C<\l> in double-quotish
1408contexts, such as regular expression substitutions.
1409Under C<unicode_strings> starting in Perl 5.12.0, character semantics are
1410generally used. See L<perlfunc/lc> for details on how this works
1411in combination with various other pragmas.
e1b711da
KW
1412
1413=item *
1414
2e2b2571
KW
1415Using caseless (C</i>) regular expression matching.
1416Starting in Perl 5.14.0, regular expressions compiled within
c43ca372 1417the scope of C<unicode_strings> use character semantics
2e2b2571
KW
1418even when executed or compiled into larger
1419regular expressions outside the scope.
e1b711da
KW
1420
1421=item *
1422
b19eb496 1423Matching any of several properties in regular expressions, namely C<\b>,
630d17dc
KW
1424C<\B>, C<\s>, C<\S>, C<\w>, C<\W>, and all the Posix character classes
1425I<except> C<[[:ascii:]]>.
2e2b2571 1426Starting in Perl 5.14.0, regular expressions compiled within
c43ca372 1427the scope of C<unicode_strings> use character semantics
2e2b2571
KW
1428even when executed or compiled into larger
1429regular expressions outside the scope.
e1b711da
KW
1430
1431=item *
1432
91faff93
KW
1433In C<quotemeta> or its inline equivalent C<\Q>, no code points above 127
1434are quoted in UTF-8 encoded strings, but in byte encoded strings, code
1435points between 128-255 are always quoted.
2e2b2571
KW
1436Starting in Perl 5.16.0, consistent quoting rules are used within the
1437scope of C<unicode_strings>, as described in L<perlfunc/quotemeta>.
eb88ed9e 1438
e1b711da
KW
1439=back
1440
1441This behavior can lead to unexpected results in which a string's semantics
1442suddenly change if a code point above 255 is appended to or removed from it,
1443which changes the string's semantics from byte to character or vice versa. As
1444an example, consider the following program and its output:
1445
1446 $ perl -le'
42581d5d 1447 no feature 'unicode_strings';
e1b711da
KW
1448 $s1 = "\xC2";
1449 $s2 = "\x{2660}";
1450 for ($s1, $s2, $s1.$s2) {
1451 print /\w/ || 0;
1452 }
1453 '
1454 0
1455 0
1456 1
1457
9f815e24 1458If there's no C<\w> in C<s1> or in C<s2>, why does their concatenation have one?
e1b711da
KW
1459
1460This anomaly stems from Perl's attempt to not disturb older programs that
1461didn't use Unicode, and hence had no semantics for characters outside of the
1462ASCII range (except in a locale), along with Perl's desire to add Unicode
1463support seamlessly. The result wasn't seamless: these characters were
1464orphaned.
1465
2e2b2571
KW
1466For Perls earlier than those described above, or when a string is passed
1467to a function outside the subpragma's scope, a workaround is to always
1468call C<utf8::upgrade($string)>,
20db7501 1469or to use the standard module L<Encode>. Also, a scalar that has any characters
6f335b04 1470whose ordinal is above 0x100, or which were specified using either of the
b19eb496 1471C<\N{...}> notations, will automatically have character semantics.
e1b711da 1472
1aad1664
JH
1473=head2 Forcing Unicode in Perl (Or Unforcing Unicode in Perl)
1474
e1b711da
KW
1475Sometimes (see L</"When Unicode Does Not Happen"> or L</The "Unicode Bug">)
1476there are situations where you simply need to force a byte
2bbc8d55
SP
1477string into UTF-8, or vice versa. The low-level calls
1478utf8::upgrade($bytestring) and utf8::downgrade($utf8string[, FAIL_OK]) are
1aad1664
JH
1479the answers.
1480
2bbc8d55
SP
1481Note that utf8::downgrade() can fail if the string contains characters
1482that don't fit into a byte.
1aad1664 1483
e1b711da
KW
1484Calling either function on a string that already is in the desired state is a
1485no-op.
1486
95a1a48b
JH
1487=head2 Using Unicode in XS
1488
3a2263fe
RGS
1489If you want to handle Perl Unicode in XS extensions, you may find the
1490following C APIs useful. See also L<perlguts/"Unicode Support"> for an
1491explanation about Unicode at the XS level, and L<perlapi> for the API
1492details.
95a1a48b
JH
1493
1494=over 4
1495
1496=item *
1497
1bfb14c4 1498C<DO_UTF8(sv)> returns true if the C<UTF8> flag is on and the bytes
2bbc8d55 1499pragma is not in effect. C<SvUTF8(sv)> returns true if the C<UTF8>
1bfb14c4
JH
1500flag is on; the bytes pragma is ignored. The C<UTF8> flag being on
1501does B<not> mean that there are any characters of code points greater
1502than 255 (or 127) in the scalar or that there are even any characters
1503in the scalar. What the C<UTF8> flag means is that the sequence of
1504octets in the representation of the scalar is the sequence of UTF-8
1505encoded code points of the characters of a string. The C<UTF8> flag
1506being off means that each octet in this representation encodes a
1507single character with code point 0..255 within the string. Perl's
1508Unicode model is not to use UTF-8 until it is absolutely necessary.
95a1a48b
JH
1509
1510=item *
1511
2bbc8d55 1512C<uvchr_to_utf8(buf, chr)> writes a Unicode character code point into
1bfb14c4 1513a buffer encoding the code point as UTF-8, and returns a pointer
2bbc8d55 1514pointing after the UTF-8 bytes. It works appropriately on EBCDIC machines.
95a1a48b
JH
1515
1516=item *
1517
4b88fb76
KW
1518C<utf8_to_uvchr_buf(buf, bufend, lenp)> reads UTF-8 encoded bytes from a
1519buffer and
376d9008 1520returns the Unicode character code point and, optionally, the length of
2bbc8d55 1521the UTF-8 byte sequence. It works appropriately on EBCDIC machines.
95a1a48b
JH
1522
1523=item *
1524
376d9008
JB
1525C<utf8_length(start, end)> returns the length of the UTF-8 encoded buffer
1526in characters. C<sv_len_utf8(sv)> returns the length of the UTF-8 encoded
95a1a48b
JH
1527scalar.
1528
1529=item *
1530
376d9008
JB
1531C<sv_utf8_upgrade(sv)> converts the string of the scalar to its UTF-8
1532encoded form. C<sv_utf8_downgrade(sv)> does the opposite, if
1533possible. C<sv_utf8_encode(sv)> is like sv_utf8_upgrade except that
1534it does not set the C<UTF8> flag. C<sv_utf8_decode()> does the
1535opposite of C<sv_utf8_encode()>. Note that none of these are to be
1536used as general-purpose encoding or decoding interfaces: C<use Encode>
1537for that. C<sv_utf8_upgrade()> is affected by the encoding pragma
1538but C<sv_utf8_downgrade()> is not (since the encoding pragma is
1539designed to be a one-way street).
95a1a48b
JH
1540
1541=item *
1542
dbfbbfa1
KW
1543C<is_utf8_string(buf, len)> returns true if C<len> bytes of the buffer
1544are valid UTF-8.
95a1a48b
JH
1545
1546=item *
1547
49f4c4e4
KW
1548C<is_utf8_char_buf(buf, buf_end)> returns true if the pointer points to
1549a valid UTF-8 character.
95a1a48b
JH
1550
1551=item *
1552
376d9008
JB
1553C<UTF8SKIP(buf)> will return the number of bytes in the UTF-8 encoded
1554character in the buffer. C<UNISKIP(chr)> will return the number of bytes
1555required to UTF-8-encode the Unicode character code point. C<UTF8SKIP()>
90f968e0 1556is useful for example for iterating over the characters of a UTF-8
376d9008 1557encoded buffer; C<UNISKIP()> is useful, for example, in computing
90f968e0 1558the size required for a UTF-8 encoded buffer.
95a1a48b
JH
1559
1560=item *
1561
376d9008 1562C<utf8_distance(a, b)> will tell the distance in characters between the
95a1a48b
JH
1563two pointers pointing to the same UTF-8 encoded buffer.
1564
1565=item *
1566
2bbc8d55 1567C<utf8_hop(s, off)> will return a pointer to a UTF-8 encoded buffer
376d9008
JB
1568that is C<off> (positive or negative) Unicode characters displaced
1569from the UTF-8 buffer C<s>. Be careful not to overstep the buffer:
1570C<utf8_hop()> will merrily run off the end or the beginning of the
1571buffer if told to do so.
95a1a48b 1572
d2cc3551
JH
1573=item *
1574
376d9008
JB
1575C<pv_uni_display(dsv, spv, len, pvlim, flags)> and
1576C<sv_uni_display(dsv, ssv, pvlim, flags)> are useful for debugging the
1577output of Unicode strings and scalars. By default they are useful
1578only for debugging--they display B<all> characters as hexadecimal code
1bfb14c4
JH
1579points--but with the flags C<UNI_DISPLAY_ISPRINT>,
1580C<UNI_DISPLAY_BACKSLASH>, and C<UNI_DISPLAY_QQ> you can make the
1581output more readable.
d2cc3551
JH
1582
1583=item *
1584
66615a54 1585C<foldEQ_utf8(s1, pe1, l1, u1, s2, pe2, l2, u2)> can be used to
376d9008 1586compare two strings case-insensitively in Unicode. For case-sensitive
66615a54
KW
1587comparisons you can just use C<memEQ()> and C<memNE()> as usual, except
1588if one string is in utf8 and the other isn't.
d2cc3551 1589
c349b1b9
JH
1590=back
1591
95a1a48b
JH
1592For more information, see L<perlapi>, and F<utf8.c> and F<utf8.h>
1593in the Perl source code distribution.
1594
e1b711da
KW
1595=head2 Hacking Perl to work on earlier Unicode versions (for very serious hackers only)
1596
1597Perl by default comes with the latest supported Unicode version built in, but
1598you can change to use any earlier one.
1599
42581d5d 1600Download the files in the desired version of Unicode from the Unicode web
e1b711da 1601site L<http://www.unicode.org>). These should replace the existing files in
b19eb496 1602F<lib/unicore> in the Perl source tree. Follow the instructions in
116693e8 1603F<README.perl> in that directory to change some of their names, and then build
26e391dd 1604perl (see L<INSTALL>).
116693e8 1605
c29a771d
JH
1606=head1 BUGS
1607
376d9008 1608=head2 Interaction with Locales
7eabb34d 1609
42581d5d 1610See L<perllocale/Unicode and UTF-8>
c29a771d 1611
9f815e24 1612=head2 Problems with characters in the Latin-1 Supplement range
2bbc8d55 1613
e1b711da
KW
1614See L</The "Unicode Bug">
1615
376d9008 1616=head2 Interaction with Extensions
7eabb34d 1617
376d9008 1618When Perl exchanges data with an extension, the extension should be
2575c402 1619able to understand the UTF8 flag and act accordingly. If the
b19eb496 1620extension doesn't recognize that flag, it's likely that the extension
376d9008 1621will return incorrectly-flagged data.
7eabb34d
A
1622
1623So if you're working with Unicode data, consult the documentation of
1624every module you're using if there are any issues with Unicode data
1625exchange. If the documentation does not talk about Unicode at all,
a73d23f6 1626suspect the worst and probably look at the source to learn how the
376d9008 1627module is implemented. Modules written completely in Perl shouldn't
a73d23f6
RGS
1628cause problems. Modules that directly or indirectly access code written
1629in other programming languages are at risk.
7eabb34d 1630
376d9008 1631For affected functions, the simple strategy to avoid data corruption is
7eabb34d 1632to always make the encoding of the exchanged data explicit. Choose an
376d9008 1633encoding that you know the extension can handle. Convert arguments passed
7eabb34d
A
1634to the extensions to that encoding and convert results back from that
1635encoding. Write wrapper functions that do the conversions for you, so
1636you can later change the functions when the extension catches up.
1637
376d9008 1638To provide an example, let's say the popular Foo::Bar::escape_html
7eabb34d
A
1639function doesn't deal with Unicode data yet. The wrapper function
1640would convert the argument to raw UTF-8 and convert the result back to
376d9008 1641Perl's internal representation like so:
7eabb34d
A
1642
1643 sub my_escape_html ($) {
d88362ca
KW
1644 my($what) = shift;
1645 return unless defined $what;
1646 Encode::decode_utf8(Foo::Bar::escape_html(
1647 Encode::encode_utf8($what)));
7eabb34d
A
1648 }
1649
1650Sometimes, when the extension does not convert data but just stores
b19eb496 1651and retrieves them, you will be able to use the otherwise
7eabb34d 1652dangerous Encode::_utf8_on() function. Let's say the popular
66b79f27 1653C<Foo::Bar> extension, written in C, provides a C<param> method that
7eabb34d
A
1654lets you store and retrieve data according to these prototypes:
1655
1656 $self->param($name, $value); # set a scalar
1657 $value = $self->param($name); # retrieve a scalar
1658
1659If it does not yet provide support for any encoding, one could write a
1660derived class with such a C<param> method:
1661
1662 sub param {
1663 my($self,$name,$value) = @_;
1664 utf8::upgrade($name); # make sure it is UTF-8 encoded
af55fc6a 1665 if (defined $value) {
7eabb34d
A
1666 utf8::upgrade($value); # make sure it is UTF-8 encoded
1667 return $self->SUPER::param($name,$value);
1668 } else {
1669 my $ret = $self->SUPER::param($name);
1670 Encode::_utf8_on($ret); # we know, it is UTF-8 encoded
1671 return $ret;
1672 }
1673 }
1674
a73d23f6
RGS
1675Some extensions provide filters on data entry/exit points, such as
1676DB_File::filter_store_key and family. Look out for such filters in
66b79f27 1677the documentation of your extensions, they can make the transition to
7eabb34d
A
1678Unicode data much easier.
1679
376d9008 1680=head2 Speed
7eabb34d 1681
c29a771d 1682Some functions are slower when working on UTF-8 encoded strings than
574c8022 1683on byte encoded strings. All functions that need to hop over
7c17141f
JH
1684characters such as length(), substr() or index(), or matching regular
1685expressions can work B<much> faster when the underlying data are
1686byte-encoded.
1687
1688In Perl 5.8.0 the slowness was often quite spectacular; in Perl 5.8.1
1689a caching scheme was introduced which will hopefully make the slowness
a104b433
JH
1690somewhat less spectacular, at least for some operations. In general,
1691operations with UTF-8 encoded strings are still slower. As an example,
1692the Unicode properties (character classes) like C<\p{Nd}> are known to
1693be quite a bit slower (5-20 times) than their simpler counterparts
42581d5d 1694like C<\d> (then again, there are hundreds of Unicode characters matching C<Nd>
a104b433 1695compared with the 10 ASCII characters matching C<d>).
666f95b9 1696
e1b711da
KW
1697=head2 Problems on EBCDIC platforms
1698
f651977e 1699There are several known problems with Perl on EBCDIC platforms. If you
e1b711da 1700want to use Perl there, send email to perlbug@perl.org.
fe749c9a
KW
1701
1702In earlier versions, when byte and character data were concatenated,
1703the new string was sometimes created by
1704decoding the byte strings as I<ISO 8859-1 (Latin-1)>, even if the
1705old Unicode string used EBCDIC.
1706
1707If you find any of these, please report them as bugs.
1708
c8d992ba
A
1709=head2 Porting code from perl-5.6.X
1710
1711Perl 5.8 has a different Unicode model from 5.6. In 5.6 the programmer
1712was required to use the C<utf8> pragma to declare that a given scope
1713expected to deal with Unicode data and had to make sure that only
1714Unicode data were reaching that scope. If you have code that is
1715working with 5.6, you will need some of the following adjustments to
1716your code. The examples are written such that the code will continue
1717to work under 5.6, so you should be safe to try them out.
1718
755789c0 1719=over 3
c8d992ba
A
1720
1721=item *
1722
1723A filehandle that should read or write UTF-8
1724
b9cedb1b 1725 if ($] > 5.008) {
740d4bb2 1726 binmode $fh, ":encoding(utf8)";
c8d992ba
A
1727 }
1728
1729=item *
1730
1731A scalar that is going to be passed to some extension
1732
1733Be it Compress::Zlib, Apache::Request or any extension that has no
1734mention of Unicode in the manpage, you need to make sure that the
2575c402 1735UTF8 flag is stripped off. Note that at the time of this writing
b9cedb1b 1736(January 2012) the mentioned modules are not UTF-8-aware. Please
c8d992ba
A
1737check the documentation to verify if this is still true.
1738
b9cedb1b 1739 if ($] > 5.008) {
c8d992ba
A
1740 require Encode;
1741 $val = Encode::encode_utf8($val); # make octets
1742 }
1743
1744=item *
1745
1746A scalar we got back from an extension
1747
1748If you believe the scalar comes back as UTF-8, you will most likely
2575c402 1749want the UTF8 flag restored:
c8d992ba 1750
b9cedb1b 1751 if ($] > 5.008) {
c8d992ba
A
1752 require Encode;
1753 $val = Encode::decode_utf8($val);
1754 }
1755
1756=item *
1757
1758Same thing, if you are really sure it is UTF-8
1759
b9cedb1b 1760 if ($] > 5.008) {
c8d992ba
A
1761 require Encode;
1762 Encode::_utf8_on($val);
1763 }
1764
1765=item *
1766
1767A wrapper for fetchrow_array and fetchrow_hashref
1768
1769When the database contains only UTF-8, a wrapper function or method is
1770a convenient way to replace all your fetchrow_array and
1771fetchrow_hashref calls. A wrapper function will also make it easier to
1772adapt to future enhancements in your database driver. Note that at the
b9cedb1b 1773time of this writing (January 2012), the DBI has no standardized way
c8d992ba
A
1774to deal with UTF-8 data. Please check the documentation to verify if
1775that is still true.
1776
1777 sub fetchrow {
d88362ca
KW
1778 # $what is one of fetchrow_{array,hashref}
1779 my($self, $sth, $what) = @_;
b9cedb1b 1780 if ($] < 5.008) {
c8d992ba
A
1781 return $sth->$what;
1782 } else {
1783 require Encode;
1784 if (wantarray) {
1785 my @arr = $sth->$what;
1786 for (@arr) {
1787 defined && /[^\000-\177]/ && Encode::_utf8_on($_);
1788 }
1789 return @arr;
1790 } else {
1791 my $ret = $sth->$what;
1792 if (ref $ret) {
1793 for my $k (keys %$ret) {
d88362ca
KW
1794 defined
1795 && /[^\000-\177]/
1796 && Encode::_utf8_on($_) for $ret->{$k};
c8d992ba
A
1797 }
1798 return $ret;
1799 } else {
1800 defined && /[^\000-\177]/ && Encode::_utf8_on($_) for $ret;
1801 return $ret;
1802 }
1803 }
1804 }
1805 }
1806
1807
1808=item *
1809
1810A large scalar that you know can only contain ASCII
1811
1812Scalars that contain only ASCII and are marked as UTF-8 are sometimes
1813a drag to your program. If you recognize such a situation, just remove
2575c402 1814the UTF8 flag:
c8d992ba 1815
b9cedb1b 1816 utf8::downgrade($val) if $] > 5.008;
c8d992ba
A
1817
1818=back
1819
393fec97
GS
1820=head1 SEE ALSO
1821
51f494cc 1822L<perlunitut>, L<perluniintro>, L<perluniprops>, L<Encode>, L<open>, L<utf8>, L<bytes>,
a05d7ebb 1823L<perlretut>, L<perlvar/"${^UNICODE}">
51f494cc 1824L<http://www.unicode.org/reports/tr44>).
393fec97
GS
1825
1826=cut