This is a live mirror of the Perl 5 development currently hosted at https://github.com/perl/perl5
typo fix for vms pod
[perl5.git] / pod / perlunicode.pod
CommitLineData
393fec97
GS
1=head1 NAME
2
3perlunicode - Unicode support in Perl
4
5=head1 DESCRIPTION
6
0a1f2d14 7=head2 Important Caveats
21bad921 8
376d9008 9Unicode support is an extensive requirement. While Perl does not
c349b1b9
JH
10implement the Unicode standard or the accompanying technical reports
11from cover to cover, Perl does support many Unicode features.
21bad921 12
2575c402 13People who want to learn to use Unicode in Perl, should probably read
0314f483
KW
14the L<Perl Unicode tutorial, perlunitut|perlunitut> and
15L<perluniintro>, before reading
e4911a48 16this reference document.
2575c402 17
9d1c51c1
KW
18Also, the use of Unicode may present security issues that aren't obvious.
19Read L<Unicode Security Considerations|http://www.unicode.org/reports/tr36>.
20
13a2d996 21=over 4
21bad921 22
42581d5d
KW
23=item Safest if you "use feature 'unicode_strings'"
24
25In order to preserve backward compatibility, Perl does not turn
26on full internal Unicode support unless the pragma
27C<use feature 'unicode_strings'> is specified. (This is automatically
28selected if you use C<use 5.012> or higher.) Failure to do this can
29trigger unexpected surprises. See L</The "Unicode Bug"> below.
30
2269d15c
KW
31This pragma doesn't affect I/O. Nor does it change the internal
32representation of strings, only their interpretation. There are still
33several places where Unicode isn't fully supported, such as in
34filenames.
42581d5d 35
fae2c0fb 36=item Input and Output Layers
21bad921 37
376d9008 38Perl knows when a filehandle uses Perl's internal Unicode encodings
1bfb14c4 39(UTF-8, or UTF-EBCDIC if in EBCDIC) if the filehandle is opened with
4ee7c0ea 40the ":encoding(utf8)" layer. Other encodings can be converted to Perl's
1bfb14c4
JH
41encoding on input or from Perl's encoding on output by use of the
42":encoding(...)" layer. See L<open>.
c349b1b9 43
2575c402 44To indicate that Perl source itself is in UTF-8, use C<use utf8;>.
21bad921 45
ad0029c4 46=item C<use utf8> still needed to enable UTF-8/UTF-EBCDIC in scripts
21bad921 47
376d9008
JB
48As a compatibility measure, the C<use utf8> pragma must be explicitly
49included to enable recognition of UTF-8 in the Perl scripts themselves
1bfb14c4
JH
50(in string or regular expression literals, or in identifier names) on
51ASCII-based machines or to recognize UTF-EBCDIC on EBCDIC-based
376d9008 52machines. B<These are the only times when an explicit C<use utf8>
8f8cf39c 53is needed.> See L<utf8>.
21bad921 54
7aa207d6
JH
55=item BOM-marked scripts and UTF-16 scripts autodetected
56
57If a Perl script begins marked with the Unicode BOM (UTF-16LE, UTF16-BE,
58or UTF-8), or if the script looks like non-BOM-marked UTF-16 of either
59endianness, Perl will correctly read in the script as Unicode.
60(BOMless UTF-8 cannot be effectively recognized or differentiated from
61ISO 8859-1 or other eight-bit encodings.)
62
990e18f7
AT
63=item C<use encoding> needed to upgrade non-Latin-1 byte strings
64
38a44b82 65By default, there is a fundamental asymmetry in Perl's Unicode model:
990e18f7
AT
66implicit upgrading from byte strings to Unicode strings assumes that
67they were encoded in I<ISO 8859-1 (Latin-1)>, but Unicode strings are
68downgraded with UTF-8 encoding. This happens because the first 256
51f494cc 69codepoints in Unicode happens to agree with Latin-1.
990e18f7 70
990e18f7
AT
71See L</"Byte and Character Semantics"> for more details.
72
21bad921
GS
73=back
74
376d9008 75=head2 Byte and Character Semantics
393fec97 76
b9cedb1b 77Perl uses logically-wide characters to represent strings internally.
393fec97 78
42581d5d
KW
79Starting in Perl 5.14, Perl-level operations work with
80characters rather than bytes within the scope of a
81C<L<use feature 'unicode_strings'|feature>> (or equivalently
82C<use 5.012> or higher). (This is not true if bytes have been
b19eb496 83explicitly requested by C<L<use bytes|bytes>>, nor necessarily true
42581d5d
KW
84for interactions with the platform's operating system.)
85
86For earlier Perls, and when C<unicode_strings> is not in effect, Perl
87provides a fairly safe environment that can handle both types of
88semantics in programs. For operations where Perl can unambiguously
89decide that the input data are characters, Perl switches to character
90semantics. For operations where this determination cannot be made
91without additional information from the user, Perl decides in favor of
92compatibility and chooses to use byte semantics.
93
66cbab2c
KW
94When C<use locale> (but not C<use locale ':not_characters'>) is in
95effect, Perl uses the semantics associated with the current locale.
96(C<use locale> overrides C<use feature 'unicode_strings'> in the same scope;
97while C<use locale ':not_characters'> effectively also selects
98C<use feature 'unicode_strings'> in its scope; see L<perllocale>.)
99Otherwise, Perl uses the platform's native
42581d5d 100byte semantics for characters whose code points are less than 256, and
4b9734bf
KW
101Unicode semantics for those greater than 255. That means that non-ASCII
102characters are undefined except for their
e1b711da
KW
103ordinal numbers. This means that none have case (upper and lower), nor are any
104a member of character classes, like C<[:alpha:]> or C<\w>. (But all do belong
105to the C<\W> class or the Perl regular expression extension C<[:^alpha:]>.)
2bbc8d55 106
8cbd9a7a 107This behavior preserves compatibility with earlier versions of Perl,
376d9008 108which allowed byte semantics in Perl operations only if
e1b711da 109none of the program's inputs were marked as being a source of Unicode
8cbd9a7a
GS
110character data. Such data may come from filehandles, from calls to
111external programs, from information provided by the system (such as %ENV),
21bad921 112or from literals and constants in the source text.
8cbd9a7a 113
8cbd9a7a 114The C<utf8> pragma is primarily a compatibility device that enables
75daf61c 115recognition of UTF-(8|EBCDIC) in literals encountered by the parser.
376d9008
JB
116Note that this pragma is only required while Perl defaults to byte
117semantics; when character semantics become the default, this pragma
118may become a no-op. See L<utf8>.
119
376d9008 120If strings operating under byte semantics and strings with Unicode
51f494cc 121character data are concatenated, the new string will have
d9b01026
KW
122character semantics. This can cause surprises: See L</BUGS>, below.
123You can choose to be warned when this happens. See L<encoding::warnings>.
7dedd01f 124
feda178f 125Under character semantics, many operations that formerly operated on
376d9008 126bytes now operate on characters. A character in Perl is
feda178f 127logically just a number ranging from 0 to 2**31 or so. Larger
376d9008
JB
128characters may encode into longer sequences of bytes internally, but
129this internal detail is mostly hidden for Perl code.
130See L<perluniintro> for more.
393fec97 131
376d9008 132=head2 Effects of Character Semantics
393fec97
GS
133
134Character semantics have the following effects:
135
136=over 4
137
138=item *
139
376d9008 140Strings--including hash keys--and regular expression patterns may
574c8022 141contain characters that have an ordinal value larger than 255.
393fec97 142
2575c402
JW
143If you use a Unicode editor to edit your program, Unicode characters may
144occur directly within the literal strings in UTF-8 encoding, or UTF-16.
145(The former requires a BOM or C<use utf8>, the latter requires a BOM.)
3e4dbfed 146
195e542a
KW
147Unicode characters can also be added to a string by using the C<\N{U+...}>
148notation. The Unicode code for the desired character, in hexadecimal,
149should be placed in the braces, after the C<U>. For instance, a smiley face is
6f335b04
KW
150C<\N{U+263A}>.
151
195e542a
KW
152Alternatively, you can use the C<\x{...}> notation for characters 0x100 and
153above. For characters below 0x100 you may get byte semantics instead of
6f335b04 154character semantics; see L</The "Unicode Bug">. On EBCDIC machines there is
195e542a 155the additional problem that the value for such characters gives the EBCDIC
0bd42786
KW
156character rather than the Unicode one, thus it is more portable to use
157C<\N{U+...}> instead.
3e4dbfed 158
fbb93542
KW
159Additionally, you can use the C<\N{...}> notation and put the official
160Unicode character name within the braces, such as
161C<\N{WHITE SMILING FACE}>. This automatically loads the L<charnames>
162module with the C<:full> and C<:short> options. If you prefer different
163options for this module, you can instead, before the C<\N{...}>,
164explicitly load it with your desired options; for example,
165
166 use charnames ':loose';
376d9008 167
393fec97
GS
168=item *
169
574c8022
JH
170If an appropriate L<encoding> is specified, identifiers within the
171Perl script may contain Unicode alphanumeric characters, including
376d9008
JB
172ideographs. Perl does not currently attempt to canonicalize variable
173names.
393fec97 174
393fec97
GS
175=item *
176
1bfb14c4 177Regular expressions match characters instead of bytes. "." matches
2575c402 178a character instead of a byte.
393fec97 179
393fec97
GS
180=item *
181
9d1c51c1 182Bracketed character classes in regular expressions match characters instead of
376d9008 183bytes and match against the character properties specified in the
1bfb14c4 184Unicode properties database. C<\w> can be used to match a Japanese
75daf61c 185ideograph, for instance.
393fec97 186
393fec97
GS
187=item *
188
9d1c51c1
KW
189Named Unicode properties, scripts, and block ranges may be used (like bracketed
190character classes) by using the C<\p{}> "matches property" construct and
822502e5 191the C<\P{}> negation, "doesn't match property".
2575c402 192See L</"Unicode Character Properties"> for more details.
822502e5
TS
193
194You can define your own character properties and use them
195in the regular expression with the C<\p{}> or C<\P{}> construct.
822502e5
TS
196See L</"User-Defined Character Properties"> for more details.
197
198=item *
199
9f815e24
KW
200The special pattern C<\X> matches a logical character, an "extended grapheme
201cluster" in Standardese. In Unicode what appears to the user to be a single
51f494cc
KW
202character, for example an accented C<G>, may in fact be composed of a sequence
203of characters, in this case a C<G> followed by an accent character. C<\X>
204will match the entire sequence.
822502e5
TS
205
206=item *
207
208The C<tr///> operator translates characters instead of bytes. Note
209that the C<tr///CU> functionality has been removed. For similar
210functionality see pack('U0', ...) and pack('C0', ...).
211
212=item *
213
214Case translation operators use the Unicode case translation tables
215when character input is provided. Note that C<uc()>, or C<\U> in
216interpolated strings, translates to uppercase, while C<ucfirst>,
217or C<\u> in interpolated strings, translates to titlecase in languages
e1b711da
KW
218that make the distinction (which is equivalent to uppercase in languages
219without the distinction).
822502e5
TS
220
221=item *
222
223Most operators that deal with positions or lengths in a string will
224automatically switch to using character positions, including
225C<chop()>, C<chomp()>, C<substr()>, C<pos()>, C<index()>, C<rindex()>,
226C<sprintf()>, C<write()>, and C<length()>. An operator that
51f494cc
KW
227specifically does not switch is C<vec()>. Operators that really don't
228care include operators that treat strings as a bucket of bits such as
822502e5
TS
229C<sort()>, and operators dealing with filenames.
230
231=item *
232
51f494cc 233The C<pack()>/C<unpack()> letter C<C> does I<not> change, since it is often
822502e5
TS
234used for byte-oriented formats. Again, think C<char> in the C language.
235
236There is a new C<U> specifier that converts between Unicode characters
237and code points. There is also a C<W> specifier that is the equivalent of
238C<chr>/C<ord> and properly handles character values even if they are above 255.
239
240=item *
241
242The C<chr()> and C<ord()> functions work on characters, similar to
243C<pack("W")> and C<unpack("W")>, I<not> C<pack("C")> and
244C<unpack("C")>. C<pack("C")> and C<unpack("C")> are methods for
245emulating byte-oriented C<chr()> and C<ord()> on Unicode strings.
246While these methods reveal the internal encoding of Unicode strings,
247that is not something one normally needs to care about at all.
248
249=item *
250
251The bit string operators, C<& | ^ ~>, can operate on character data.
252However, for backward compatibility, such as when using bit string
253operations when characters are all less than 256 in ordinal value, one
254should not use C<~> (the bit complement) with characters of both
255values less than 256 and values greater than 256. Most importantly,
256DeMorgan's laws (C<~($x|$y) eq ~$x&~$y> and C<~($x&$y) eq ~$x|~$y>)
257will not hold. The reason for this mathematical I<faux pas> is that
258the complement cannot return B<both> the 8-bit (byte-wide) bit
259complement B<and> the full character-wide bit complement.
260
261=item *
262
5d1892be 263There is a CPAN module, L<Unicode::Casing>, which allows you to define
628253b8
BF
264your own mappings to be used in C<lc()>, C<lcfirst()>, C<uc()>,
265C<ucfirst()>, and C<fc> (or their double-quoted string inlined
266versions such as C<\U>).
267(Prior to Perl 5.16, this functionality was partially provided
5d1892be
KW
268in the Perl core, but suffered from a number of insurmountable
269drawbacks, so the CPAN module was written instead.)
822502e5
TS
270
271=back
272
273=over 4
274
275=item *
276
277And finally, C<scalar reverse()> reverses by character rather than by byte.
278
279=back
280
281=head2 Unicode Character Properties
282
ee88f7b6 283(The only time that Perl considers a sequence of individual code
9d1c51c1
KW
284points as a single logical character is in the C<\X> construct, already
285mentioned above. Therefore "character" in this discussion means a single
ee88f7b6
KW
286Unicode code point.)
287
288Very nearly all Unicode character properties are accessible through
289regular expressions by using the C<\p{}> "matches property" construct
290and the C<\P{}> "doesn't match property" for its negation.
51f494cc 291
9d1c51c1 292For instance, C<\p{Uppercase}> matches any single character with the Unicode
51f494cc
KW
293"Uppercase" property, while C<\p{L}> matches any character with a
294General_Category of "L" (letter) property. Brackets are not
9d1c51c1 295required for single letter property names, so C<\p{L}> is equivalent to C<\pL>.
51f494cc 296
9d1c51c1
KW
297More formally, C<\p{Uppercase}> matches any single character whose Unicode
298Uppercase property value is True, and C<\P{Uppercase}> matches any character
299whose Uppercase property value is False, and they could have been written as
300C<\p{Uppercase=True}> and C<\p{Uppercase=False}>, respectively.
51f494cc 301
b19eb496 302This formality is needed when properties are not binary; that is, if they can
51f494cc 303take on more values than just True and False. For example, the Bidi_Class (see
b19eb496 304L</"Bidirectional Character Types"> below), can take on several different
51f494cc 305values, such as Left, Right, Whitespace, and others. To match these, one needs
5bff2035
KW
306to specify both the property name (Bidi_Class), AND the value being
307matched against
9d1c51c1 308(Left, Right, etc.). This is done, as in the examples above, by having the
9f815e24 309two components separated by an equal sign (or interchangeably, a colon), like
51f494cc
KW
310C<\p{Bidi_Class: Left}>.
311
312All Unicode-defined character properties may be written in these compound forms
313of C<\p{property=value}> or C<\p{property:value}>, but Perl provides some
314additional properties that are written only in the single form, as well as
315single-form short-cuts for all binary properties and certain others described
316below, in which you may omit the property name and the equals or colon
317separator.
318
319Most Unicode character properties have at least two synonyms (or aliases if you
b19eb496
TC
320prefer): a short one that is easier to type and a longer one that is more
321descriptive and hence easier to understand. Thus the "L" and "Letter" properties
322above are equivalent and can be used interchangeably. Likewise,
51f494cc
KW
323"Upper" is a synonym for "Uppercase", and we could have written
324C<\p{Uppercase}> equivalently as C<\p{Upper}>. Also, there are typically
325various synonyms for the values the property can be. For binary properties,
326"True" has 3 synonyms: "T", "Yes", and "Y"; and "False has correspondingly "F",
327"No", and "N". But be careful. A short form of a value for one property may
e1b711da 328not mean the same thing as the same short form for another. Thus, for the
51f494cc
KW
329General_Category property, "L" means "Letter", but for the Bidi_Class property,
330"L" means "Left". A complete list of properties and synonyms is in
331L<perluniprops>.
332
b19eb496 333Upper/lower case differences in property names and values are irrelevant;
51f494cc
KW
334thus C<\p{Upper}> means the same thing as C<\p{upper}> or even C<\p{UpPeR}>.
335Similarly, you can add or subtract underscores anywhere in the middle of a
336word, so that these are also equivalent to C<\p{U_p_p_e_r}>. And white space
337is irrelevant adjacent to non-word characters, such as the braces and the equals
b19eb496
TC
338or colon separators, so C<\p{ Upper }> and C<\p{ Upper_case : Y }> are
339equivalent to these as well. In fact, white space and even
340hyphens can usually be added or deleted anywhere. So even C<\p{ Up-per case = Yes}> is
51f494cc 341equivalent. All this is called "loose-matching" by Unicode. The few places
b19eb496 342where stricter matching is used is in the middle of numbers, and in the Perl
51f494cc 343extension properties that begin or end with an underscore. Stricter matching
b19eb496 344cares about white space (except adjacent to non-word characters),
51f494cc 345hyphens, and non-interior underscores.
4193bef7 346
376d9008
JB
347You can also use negation in both C<\p{}> and C<\P{}> by introducing a caret
348(^) between the first brace and the property name: C<\p{^Tamil}> is
eb0cc9e3 349equal to C<\P{Tamil}>.
4193bef7 350
56ca34ca
KW
351Almost all properties are immune to case-insensitive matching. That is,
352adding a C</i> regular expression modifier does not change what they
353match. There are two sets that are affected.
354The first set is
355C<Uppercase_Letter>,
356C<Lowercase_Letter>,
357and C<Titlecase_Letter>,
358all of which match C<Cased_Letter> under C</i> matching.
359And the second set is
360C<Uppercase>,
361C<Lowercase>,
362and C<Titlecase>,
363all of which match C<Cased> under C</i> matching.
364This set also includes its subsets C<PosixUpper> and C<PosixLower> both
365of which under C</i> matching match C<PosixAlpha>.
366(The difference between these sets is that some things, such as Roman
b19eb496 367numerals, come in both upper and lower case so they are C<Cased>, but aren't considered
56ca34ca 368letters, so they aren't C<Cased_Letter>s.)
56ca34ca 369
94b42e47
KW
370The result is undefined if you try to match a non-Unicode code point
371(that is, one above 0x10FFFF) against a Unicode property. Currently, a
372warning is raised, and the match will fail. In some cases, this is
373counterintuitive, as both these fail:
374
375 chr(0x110000) =~ \p{ASCII_Hex_Digit=True} # Fails.
376 chr(0x110000) =~ \p{ASCII_Hex_Digit=False} # Fails!
377
51f494cc 378=head3 B<General_Category>
14bb0a9a 379
51f494cc
KW
380Every Unicode character is assigned a general category, which is the "most
381usual categorization of a character" (from
382L<http://www.unicode.org/reports/tr44>).
822502e5 383
9f815e24 384The compound way of writing these is like C<\p{General_Category=Number}>
51f494cc
KW
385(short, C<\p{gc:n}>). But Perl furnishes shortcuts in which everything up
386through the equal or colon separator is omitted. So you can instead just write
387C<\pN>.
822502e5 388
51f494cc 389Here are the short and long forms of the General Category properties:
393fec97 390
d73e5302
JH
391 Short Long
392
393 L Letter
51f494cc
KW
394 LC, L& Cased_Letter (that is: [\p{Ll}\p{Lu}\p{Lt}])
395 Lu Uppercase_Letter
396 Ll Lowercase_Letter
397 Lt Titlecase_Letter
398 Lm Modifier_Letter
399 Lo Other_Letter
d73e5302
JH
400
401 M Mark
51f494cc
KW
402 Mn Nonspacing_Mark
403 Mc Spacing_Mark
404 Me Enclosing_Mark
d73e5302
JH
405
406 N Number
51f494cc
KW
407 Nd Decimal_Number (also Digit)
408 Nl Letter_Number
409 No Other_Number
410
411 P Punctuation (also Punct)
412 Pc Connector_Punctuation
413 Pd Dash_Punctuation
414 Ps Open_Punctuation
415 Pe Close_Punctuation
416 Pi Initial_Punctuation
d73e5302 417 (may behave like Ps or Pe depending on usage)
51f494cc 418 Pf Final_Punctuation
d73e5302 419 (may behave like Ps or Pe depending on usage)
51f494cc 420 Po Other_Punctuation
d73e5302
JH
421
422 S Symbol
51f494cc
KW
423 Sm Math_Symbol
424 Sc Currency_Symbol
425 Sk Modifier_Symbol
426 So Other_Symbol
d73e5302
JH
427
428 Z Separator
51f494cc
KW
429 Zs Space_Separator
430 Zl Line_Separator
431 Zp Paragraph_Separator
d73e5302
JH
432
433 C Other
d88362ca 434 Cc Control (also Cntrl)
e150c829 435 Cf Format
6d4f9cf2 436 Cs Surrogate
51f494cc 437 Co Private_Use
e150c829 438 Cn Unassigned
1ac13f9a 439
376d9008 440Single-letter properties match all characters in any of the
3e4dbfed 441two-letter sub-properties starting with the same letter.
b19eb496 442C<LC> and C<L&> are special: both are aliases for the set consisting of everything matched by C<Ll>, C<Lu>, and C<Lt>.
32293815 443
51f494cc 444=head3 B<Bidirectional Character Types>
822502e5 445
b19eb496 446Because scripts differ in their directionality (Hebrew and Arabic are
9d1c51c1 447written right to left, for example) Unicode supplies these properties in
51f494cc 448the Bidi_Class class:
32293815 449
eb0cc9e3 450 Property Meaning
92e830a9 451
12ac2576
JP
452 L Left-to-Right
453 LRE Left-to-Right Embedding
454 LRO Left-to-Right Override
455 R Right-to-Left
51f494cc 456 AL Arabic Letter
12ac2576
JP
457 RLE Right-to-Left Embedding
458 RLO Right-to-Left Override
459 PDF Pop Directional Format
460 EN European Number
51f494cc
KW
461 ES European Separator
462 ET European Terminator
12ac2576 463 AN Arabic Number
51f494cc 464 CS Common Separator
12ac2576
JP
465 NSM Non-Spacing Mark
466 BN Boundary Neutral
467 B Paragraph Separator
468 S Segment Separator
469 WS Whitespace
470 ON Other Neutrals
471
51f494cc
KW
472This property is always written in the compound form.
473For example, C<\p{Bidi_Class:R}> matches characters that are normally
eb0cc9e3
JH
474written right to left.
475
51f494cc
KW
476=head3 B<Scripts>
477
b19eb496 478The world's languages are written in many different scripts. This sentence
e1b711da 479(unless you're reading it in translation) is written in Latin, while Russian is
c69ca1d4 480written in Cyrillic, and Greek is written in, well, Greek; Japanese mainly in
e1b711da 481Hiragana or Katakana. There are many more.
51f494cc 482
82aed44a
KW
483The Unicode Script and Script_Extensions properties give what script a
484given character is in. Either property can be specified with the
485compound form like
486C<\p{Script=Hebrew}> (short: C<\p{sc=hebr}>), or
487C<\p{Script_Extensions=Javanese}> (short: C<\p{scx=java}>).
488In addition, Perl furnishes shortcuts for all
489C<Script> property names. You can omit everything up through the equals
490(or colon), and simply write C<\p{Latin}> or C<\P{Cyrillic}>.
491(This is not true for C<Script_Extensions>, which is required to be
492written in the compound form.)
493
494The difference between these two properties involves characters that are
495used in multiple scripts. For example the digits '0' through '9' are
496used in many parts of the world. These are placed in a script named
497C<Common>. Other characters are used in just a few scripts. For
498example, the "KATAKANA-HIRAGANA DOUBLE HYPHEN" is used in both Japanese
499scripts, Katakana and Hiragana, but nowhere else. The C<Script>
500property places all characters that are used in multiple scripts in the
501C<Common> script, while the C<Script_Extensions> property places those
502that are used in only a few scripts into each of those scripts; while
503still using C<Common> for those used in many scripts. Thus both these
504match:
505
506 "0" =~ /\p{sc=Common}/ # Matches
507 "0" =~ /\p{scx=Common}/ # Matches
508
509and only the first of these match:
510
511 "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{sc=Common} # Matches
512 "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{scx=Common} # No match
513
514And only the last two of these match:
515
516 "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{sc=Hiragana} # No match
517 "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{sc=Katakana} # No match
518 "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{scx=Hiragana} # Matches
519 "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{scx=Katakana} # Matches
520
521C<Script_Extensions> is thus an improved C<Script>, in which there are
522fewer characters in the C<Common> script, and correspondingly more in
523other scripts. It is new in Unicode version 6.0, and its data are likely
524to change significantly in later releases, as things get sorted out.
525
526(Actually, besides C<Common>, the C<Inherited> script, contains
527characters that are used in multiple scripts. These are modifier
528characters which modify other characters, and inherit the script value
529of the controlling character. Some of these are used in many scripts,
530and so go into C<Inherited> in both C<Script> and C<Script_Extensions>.
531Others are used in just a few scripts, so are in C<Inherited> in
532C<Script>, but not in C<Script_Extensions>.)
533
534It is worth stressing that there are several different sets of digits in
535Unicode that are equivalent to 0-9 and are matchable by C<\d> in a
536regular expression. If they are used in a single language only, they
537are in that language's C<Script> and C<Script_Extension>. If they are
538used in more than one script, they will be in C<sc=Common>, but only
539if they are used in many scripts should they be in C<scx=Common>.
51f494cc
KW
540
541A complete list of scripts and their shortcuts is in L<perluniprops>.
542
51f494cc 543=head3 B<Use of "Is" Prefix>
822502e5 544
1bfb14c4 545For backward compatibility (with Perl 5.6), all properties mentioned
51f494cc
KW
546so far may have C<Is> or C<Is_> prepended to their name, so C<\P{Is_Lu}>, for
547example, is equal to C<\P{Lu}>, and C<\p{IsScript:Arabic}> is equal to
548C<\p{Arabic}>.
eb0cc9e3 549
51f494cc 550=head3 B<Blocks>
2796c109 551
1bfb14c4
JH
552In addition to B<scripts>, Unicode also defines B<blocks> of
553characters. The difference between scripts and blocks is that the
554concept of scripts is closer to natural languages, while the concept
51f494cc 555of blocks is more of an artificial grouping based on groups of Unicode
9f815e24 556characters with consecutive ordinal values. For example, the "Basic Latin"
b19eb496 557block is all characters whose ordinals are between 0 and 127, inclusive; in
9f815e24 558other words, the ASCII characters. The "Latin" script contains some letters
b19eb496 559from this as well as several other blocks, like "Latin-1 Supplement",
9d1c51c1 560"Latin Extended-A", etc., but it does not contain all the characters from
7be67b37 561those blocks. It does not, for example, contain the digits 0-9, because
82aed44a
KW
562those digits are shared across many scripts, and hence are in the
563C<Common> script.
51f494cc
KW
564
565For more about scripts versus blocks, see UAX#24 "Unicode Script Property":
566L<http://www.unicode.org/reports/tr24>
567
82aed44a
KW
568The C<Script> or C<Script_Extensions> properties are likely to be the
569ones you want to use when processing
b19eb496
TC
570natural language; the Block property may occasionally be useful in working
571with the nuts and bolts of Unicode.
51f494cc
KW
572
573Block names are matched in the compound form, like C<\p{Block: Arrows}> or
b19eb496 574C<\p{Blk=Hebrew}>. Unlike most other properties, only a few block names have a
51f494cc
KW
575Unicode-defined short name. But Perl does provide a (slight) shortcut: You
576can say, for example C<\p{In_Arrows}> or C<\p{In_Hebrew}>. For backwards
577compatibility, the C<In> prefix may be omitted if there is no naming conflict
578with a script or any other property, and you can even use an C<Is> prefix
579instead in those cases. But it is not a good idea to do this, for a couple
580reasons:
581
582=over 4
583
584=item 1
585
586It is confusing. There are many naming conflicts, and you may forget some.
9f815e24 587For example, C<\p{Hebrew}> means the I<script> Hebrew, and NOT the I<block>
51f494cc
KW
588Hebrew. But would you remember that 6 months from now?
589
590=item 2
591
3e2dd9ee 592It is unstable. A new version of Unicode may preempt the current meaning by
51f494cc 593creating a property with the same name. There was a time in very early Unicode
9f815e24 594releases when C<\p{Hebrew}> would have matched the I<block> Hebrew; now it
51f494cc 595doesn't.
32293815 596
393fec97
GS
597=back
598
b19eb496
TC
599Some people prefer to always use C<\p{Block: foo}> and C<\p{Script: bar}>
600instead of the shortcuts, whether for clarity, because they can't remember the
601difference between 'In' and 'Is' anyway, or they aren't confident that those who
602eventually will read their code will know that difference.
51f494cc
KW
603
604A complete list of blocks and their shortcuts is in L<perluniprops>.
605
9f815e24
KW
606=head3 B<Other Properties>
607
608There are many more properties than the very basic ones described here.
609A complete list is in L<perluniprops>.
610
611Unicode defines all its properties in the compound form, so all single-form
b19eb496
TC
612properties are Perl extensions. Most of these are just synonyms for the
613Unicode ones, but some are genuine extensions, including several that are in
9f815e24
KW
614the compound form. And quite a few of these are actually recommended by Unicode
615(in L<http://www.unicode.org/reports/tr18>).
616
5bff2035
KW
617This section gives some details on all extensions that aren't just
618synonyms for compound-form Unicode properties
619(for those properties, you'll have to refer to the
9f815e24
KW
620L<Unicode Standard|http://www.unicode.org/reports/tr44>.
621
622=over
623
624=item B<C<\p{All}>>
625
626This matches any of the 1_114_112 Unicode code points. It is a synonym for
627C<\p{Any}>.
628
629=item B<C<\p{Alnum}>>
630
631This matches any C<\p{Alphabetic}> or C<\p{Decimal_Number}> character.
632
633=item B<C<\p{Any}>>
634
635This matches any of the 1_114_112 Unicode code points. It is a synonym for
636C<\p{All}>.
637
42581d5d
KW
638=item B<C<\p{ASCII}>>
639
640This matches any of the 128 characters in the US-ASCII character set,
641which is a subset of Unicode.
642
9f815e24
KW
643=item B<C<\p{Assigned}>>
644
645This matches any assigned code point; that is, any code point whose general
646category is not Unassigned (or equivalently, not Cn).
647
648=item B<C<\p{Blank}>>
649
650This is the same as C<\h> and C<\p{HorizSpace}>: A character that changes the
651spacing horizontally.
652
653=item B<C<\p{Decomposition_Type: Non_Canonical}>> (Short: C<\p{Dt=NonCanon}>)
654
655Matches a character that has a non-canonical decomposition.
656
657To understand the use of this rarely used property=value combination, it is
658necessary to know some basics about decomposition.
659Consider a character, say H. It could appear with various marks around it,
660such as an acute accent, or a circumflex, or various hooks, circles, arrows,
b19eb496 661I<etc.>, above, below, to one side or the other, etc. There are many
9f815e24
KW
662possibilities among the world's languages. The number of combinations is
663astronomical, and if there were a character for each combination, it would
664soon exhaust Unicode's more than a million possible characters. So Unicode
665took a different approach: there is a character for the base H, and a
b19eb496 666character for each of the possible marks, and these can be variously combined
9f815e24
KW
667to get a final logical character. So a logical character--what appears to be a
668single character--can be a sequence of more than one individual characters.
b19eb496
TC
669This is called an "extended grapheme cluster"; Perl furnishes the C<\X>
670regular expression construct to match such sequences.
9f815e24
KW
671
672But Unicode's intent is to unify the existing character set standards and
b19eb496 673practices, and several pre-existing standards have single characters that
9f815e24
KW
674mean the same thing as some of these combinations. An example is ISO-8859-1,
675which has quite a few of these in the Latin-1 range, an example being "LATIN
676CAPITAL LETTER E WITH ACUTE". Because this character was in this pre-existing
677standard, Unicode added it to its repertoire. But this character is considered
b19eb496
TC
678by Unicode to be equivalent to the sequence consisting of the character
679"LATIN CAPITAL LETTER E" followed by the character "COMBINING ACUTE ACCENT".
9f815e24
KW
680
681"LATIN CAPITAL LETTER E WITH ACUTE" is called a "pre-composed" character, and
b19eb496 682its equivalence with the sequence is called canonical equivalence. All
9f815e24 683pre-composed characters are said to have a decomposition (into the equivalent
b19eb496 684sequence), and the decomposition type is also called canonical.
9f815e24
KW
685
686However, many more characters have a different type of decomposition, a
687"compatible" or "non-canonical" decomposition. The sequences that form these
688decompositions are not considered canonically equivalent to the pre-composed
689character. An example, again in the Latin-1 range, is the "SUPERSCRIPT ONE".
b19eb496 690It is somewhat like a regular digit 1, but not exactly; its decomposition
9f815e24
KW
691into the digit 1 is called a "compatible" decomposition, specifically a
692"super" decomposition. There are several such compatibility
693decompositions (see L<http://www.unicode.org/reports/tr44>), including one
b19eb496 694called "compat", which means some miscellaneous type of decomposition
42581d5d 695that doesn't fit into the decomposition categories that Unicode has chosen.
9f815e24
KW
696
697Note that most Unicode characters don't have a decomposition, so their
698decomposition type is "None".
699
b19eb496
TC
700For your convenience, Perl has added the C<Non_Canonical> decomposition
701type to mean any of the several compatibility decompositions.
9f815e24
KW
702
703=item B<C<\p{Graph}>>
704
705Matches any character that is graphic. Theoretically, this means a character
706that on a printer would cause ink to be used.
707
708=item B<C<\p{HorizSpace}>>
709
b19eb496 710This is the same as C<\h> and C<\p{Blank}>: a character that changes the
9f815e24
KW
711spacing horizontally.
712
42581d5d 713=item B<C<\p{In=*}>>
9f815e24
KW
714
715This is a synonym for C<\p{Present_In=*}>
716
717=item B<C<\p{PerlSpace}>>
718
d28d8023
KW
719This is the same as C<\s>, restricted to ASCII, namely C<S<[ \f\n\r\t]>>
720and starting in Perl v5.18, experimentally, a vertical tab.
9f815e24
KW
721
722Mnemonic: Perl's (original) space
723
724=item B<C<\p{PerlWord}>>
725
726This is the same as C<\w>, restricted to ASCII, namely C<[A-Za-z0-9_]>
727
728Mnemonic: Perl's (original) word.
729
42581d5d 730=item B<C<\p{Posix...}>>
9f815e24 731
b19eb496
TC
732There are several of these, which are equivalents using the C<\p>
733notation for Posix classes and are described in
42581d5d 734L<perlrecharclass/POSIX Character Classes>.
9f815e24
KW
735
736=item B<C<\p{Present_In: *}>> (Short: C<\p{In=*}>)
737
738This property is used when you need to know in what Unicode version(s) a
739character is.
740
741The "*" above stands for some two digit Unicode version number, such as
742C<1.1> or C<4.0>; or the "*" can also be C<Unassigned>. This property will
743match the code points whose final disposition has been settled as of the
744Unicode release given by the version number; C<\p{Present_In: Unassigned}>
745will match those code points whose meaning has yet to be assigned.
746
747For example, C<U+0041> "LATIN CAPITAL LETTER A" was present in the very first
748Unicode release available, which is C<1.1>, so this property is true for all
749valid "*" versions. On the other hand, C<U+1EFF> was not assigned until version
7505.1 when it became "LATIN SMALL LETTER Y WITH LOOP", so the only "*" that
751would match it are 5.1, 5.2, and later.
752
753Unicode furnishes the C<Age> property from which this is derived. The problem
754with Age is that a strict interpretation of it (which Perl takes) has it
755matching the precise release a code point's meaning is introduced in. Thus
756C<U+0041> would match only 1.1; and C<U+1EFF> only 5.1. This is not usually what
757you want.
758
759Some non-Perl implementations of the Age property may change its meaning to be
760the same as the Perl Present_In property; just be aware of that.
761
762Another confusion with both these properties is that the definition is not
b19eb496
TC
763that the code point has been I<assigned>, but that the meaning of the code point
764has been I<determined>. This is because 66 code points will always be
765unassigned, and so the Age for them is the Unicode version in which the decision
766to make them so was made. For example, C<U+FDD0> is to be permanently
9f815e24 767unassigned to a character, and the decision to do that was made in version 3.1,
b19eb496 768so C<\p{Age=3.1}> matches this character, as also does C<\p{Present_In: 3.1}> and up.
9f815e24
KW
769
770=item B<C<\p{Print}>>
771
ae5b72c8 772This matches any character that is graphical or blank, except controls.
9f815e24
KW
773
774=item B<C<\p{SpacePerl}>>
775
776This is the same as C<\s>, including beyond ASCII.
777
4d4acfba 778Mnemonic: Space, as modified by Perl. (It doesn't include the vertical tab
b19eb496 779which both the Posix standard and Unicode consider white space.)
9f815e24 780
4364919a
KW
781=item B<C<\p{Title}>> and B<C<\p{Titlecase}>>
782
783Under case-sensitive matching, these both match the same code points as
784C<\p{General Category=Titlecase_Letter}> (C<\p{gc=lt}>). The difference
785is that under C</i> caseless matching, these match the same as
786C<\p{Cased}>, whereas C<\p{gc=lt}> matches C<\p{Cased_Letter>).
787
9f815e24
KW
788=item B<C<\p{VertSpace}>>
789
790This is the same as C<\v>: A character that changes the spacing vertically.
791
792=item B<C<\p{Word}>>
793
b19eb496 794This is the same as C<\w>, including over 100_000 characters beyond ASCII.
9f815e24 795
42581d5d
KW
796=item B<C<\p{XPosix...}>>
797
b19eb496 798There are several of these, which are the standard Posix classes
42581d5d
KW
799extended to the full Unicode range. They are described in
800L<perlrecharclass/POSIX Character Classes>.
801
9f815e24
KW
802=back
803
376d9008 804=head2 User-Defined Character Properties
491fd90a 805
51f494cc 806You can define your own binary character properties by defining subroutines
9d1a5160
KW
807whose names begin with "In" or "Is". (The experimental feature
808L<perlre/(?[ ])> provides an alternative which allows more complex
809definitions.) The subroutines can be defined in any
51f494cc
KW
810package. The user-defined properties can be used in the regular expression
811C<\p> and C<\P> constructs; if you are using a user-defined property from a
812package other than the one you are in, you must specify its package in the
813C<\p> or C<\P> construct.
bac0b425 814
51f494cc 815 # assuming property Is_Foreign defined in Lang::
bac0b425
JP
816 package main; # property package name required
817 if ($txt =~ /\p{Lang::IsForeign}+/) { ... }
818
819 package Lang; # property package name not required
820 if ($txt =~ /\p{IsForeign}+/) { ... }
821
822
823Note that the effect is compile-time and immutable once defined.
b19eb496
TC
824However, the subroutines are passed a single parameter, which is 0 if
825case-sensitive matching is in effect and non-zero if caseless matching
56ca34ca
KW
826is in effect. The subroutine may return different values depending on
827the value of the flag, and one set of values will immutably be in effect
b19eb496 828for all case-sensitive matches, and the other set for all case-insensitive
56ca34ca 829matches.
491fd90a 830
b19eb496 831Note that if the regular expression is tainted, then Perl will die rather
0e9be77f
DM
832than calling the subroutine, where the name of the subroutine is
833determined by the tainted data.
834
376d9008
JB
835The subroutines must return a specially-formatted string, with one
836or more newline-separated lines. Each line must be one of the following:
491fd90a
JH
837
838=over 4
839
840=item *
841
510254c9
A
842A single hexadecimal number denoting a Unicode code point to include.
843
844=item *
845
99a6b1f0 846Two hexadecimal numbers separated by horizontal whitespace (space or
376d9008 847tabular characters) denoting a range of Unicode code points to include.
491fd90a
JH
848
849=item *
850
376d9008 851Something to include, prefixed by "+": a built-in character
830137a2
KW
852property (prefixed by "utf8::") or a fully qualified (including package
853name) user-defined character property,
bac0b425
JP
854to represent all the characters in that property; two hexadecimal code
855points for a range; or a single hexadecimal code point.
491fd90a
JH
856
857=item *
858
376d9008 859Something to exclude, prefixed by "-": an existing character
830137a2
KW
860property (prefixed by "utf8::") or a fully qualified (including package
861name) user-defined character property,
bac0b425
JP
862to represent all the characters in that property; two hexadecimal code
863points for a range; or a single hexadecimal code point.
491fd90a
JH
864
865=item *
866
376d9008 867Something to negate, prefixed "!": an existing character
830137a2
KW
868property (prefixed by "utf8::") or a fully qualified (including package
869name) user-defined character property,
bac0b425
JP
870to represent all the characters in that property; two hexadecimal code
871points for a range; or a single hexadecimal code point.
872
873=item *
874
875Something to intersect with, prefixed by "&": an existing character
830137a2
KW
876property (prefixed by "utf8::") or a fully qualified (including package
877name) user-defined character property,
bac0b425
JP
878for all the characters except the characters in the property; two
879hexadecimal code points for a range; or a single hexadecimal code point.
491fd90a
JH
880
881=back
882
883For example, to define a property that covers both the Japanese
884syllabaries (hiragana and katakana), you can define
885
886 sub InKana {
d88362ca 887 return <<END;
d5822f25
A
888 3040\t309F
889 30A0\t30FF
491fd90a
JH
890 END
891 }
892
d5822f25
A
893Imagine that the here-doc end marker is at the beginning of the line.
894Now you can use C<\p{InKana}> and C<\P{InKana}>.
491fd90a
JH
895
896You could also have used the existing block property names:
897
898 sub InKana {
d88362ca 899 return <<'END';
491fd90a
JH
900 +utf8::InHiragana
901 +utf8::InKatakana
902 END
903 }
904
905Suppose you wanted to match only the allocated characters,
d5822f25 906not the raw block ranges: in other words, you want to remove
491fd90a
JH
907the non-characters:
908
909 sub InKana {
d88362ca 910 return <<'END';
491fd90a
JH
911 +utf8::InHiragana
912 +utf8::InKatakana
913 -utf8::IsCn
914 END
915 }
916
917The negation is useful for defining (surprise!) negated classes.
918
919 sub InNotKana {
d88362ca 920 return <<'END';
491fd90a
JH
921 !utf8::InHiragana
922 -utf8::InKatakana
923 +utf8::IsCn
924 END
925 }
926
461020ad
KW
927This will match all non-Unicode code points, since every one of them is
928not in Kana. You can use intersection to exclude these, if desired, as
929this modified example shows:
bac0b425 930
461020ad 931 sub InNotKana {
bac0b425 932 return <<'END';
461020ad
KW
933 !utf8::InHiragana
934 -utf8::InKatakana
935 +utf8::IsCn
936 &utf8::Any
bac0b425
JP
937 END
938 }
939
461020ad
KW
940C<&utf8::Any> must be the last line in the definition.
941
942Intersection is used generally for getting the common characters matched
943by two (or more) classes. It's important to remember not to use "&" for
944the first set; that would be intersecting with nothing, resulting in an
945empty set.
946
947(Note that official Unicode properties differ from these in that they
948automatically exclude non-Unicode code points and a warning is raised if
949a match is attempted on one of those.)
bac0b425 950
68585b5e 951=head2 User-Defined Case Mappings (for serious hackers only)
822502e5 952
5d1892be
KW
953B<This feature has been removed as of Perl 5.16.>
954The CPAN module L<Unicode::Casing> provides better functionality without
955the drawbacks that this feature had. If you are using a Perl earlier
956than 5.16, this feature was most fully documented in the 5.14 version of
957this pod:
958L<http://perldoc.perl.org/5.14.0/perlunicode.html#User-Defined-Case-Mappings-%28for-serious-hackers-only%29>
3a2263fe 959
376d9008 960=head2 Character Encodings for Input and Output
8cbd9a7a 961
7221edc9 962See L<Encode>.
8cbd9a7a 963
c29a771d 964=head2 Unicode Regular Expression Support Level
776f8809 965
b19eb496
TC
966The following list of Unicode supported features for regular expressions describes
967all features currently directly supported by core Perl. The references to "Level N"
8158862b 968and the section numbers refer to the Unicode Technical Standard #18,
b19eb496 969"Unicode Regular Expressions", version 13, from August 2008.
776f8809
JH
970
971=over 4
972
973=item *
974
975Level 1 - Basic Unicode Support
976
755789c0
KW
977 RL1.1 Hex Notation - done [1]
978 RL1.2 Properties - done [2][3]
979 RL1.2a Compatibility Properties - done [4]
9d1a5160 980 RL1.3 Subtraction and Intersection - experimental [5]
755789c0
KW
981 RL1.4 Simple Word Boundaries - done [6]
982 RL1.5 Simple Loose Matches - done [7]
983 RL1.6 Line Boundaries - MISSING [8][9]
984 RL1.7 Supplementary Code Points - done [10]
985
6f33e417
KW
986=over 4
987
988=item [1]
989
990\x{...}
991
992=item [2]
993
994\p{...} \P{...}
995
996=item [3]
997
998supports not only minimal list, but all Unicode character properties (see Unicode Character Properties above)
999
1000=item [4]
1001
1002\d \D \s \S \w \W \X [:prop:] [:^prop:]
1003
1004=item [5]
1005
9d1a5160
KW
1006The experimental feature in v5.18 "(?[...])" accomplishes this. See
1007L<perlre/(?[ ])>. If you don't want to use an experimental feature,
1008you can use one of the following:
6f33e417
KW
1009
1010=over 4
1011
1012=item * Regular expression look-ahead
1013
1014You can mimic class subtraction using lookahead.
8158862b 1015For example, what UTS#18 might write as
29bdacb8 1016
209c9685 1017 [{Block=Greek}-[{UNASSIGNED}]]
dbe420b4
JH
1018
1019in Perl can be written as:
1020
209c9685
KW
1021 (?!\p{Unassigned})\p{Block=Greek}
1022 (?=\p{Assigned})\p{Block=Greek}
dbe420b4
JH
1023
1024But in this particular example, you probably really want
1025
209c9685 1026 \p{Greek}
dbe420b4
JH
1027
1028which will match assigned characters known to be part of the Greek script.
29bdacb8 1029
6f33e417 1030=item * CPAN module L<Unicode::Regex::Set>
8158862b 1031
6f33e417
KW
1032It does implement the full UTS#18 grouping, intersection, union, and
1033removal (subtraction) syntax.
8158862b 1034
6f33e417
KW
1035=item * L</"User-Defined Character Properties">
1036
1037'+' for union, '-' for removal (set-difference), '&' for intersection
1038
1039=back
1040
1041=item [6]
1042
1043\b \B
1044
1045=item [7]
1046
1047Note that Perl does Full case-folding in matching (but with bugs), not Simple: for example U+1F88 is equivalent to U+1F00 U+03B9, instead of just U+1F80. This difference matters mainly for certain Greek capital letters with certain modifiers: the Full case-folding decomposes the letter, while the Simple case-folding would map it to a single character.
1048
1049=item [8]
1050
1051Should do ^ and $ also on U+000B (\v in C), FF (\f), CR (\r), CRLF
1052(\r\n), NEL (U+0085), LS (U+2028), and PS (U+2029); should also affect
1053<>, $., and script line numbers; should not split lines within CRLF
1054(i.e. there is no empty line between \r and \n). For CRLF, try the
1055C<:crlf> layer (see L<PerlIO>).
1056
1057=item [9]
1058
1059Linebreaking conformant with UAX#14 "Unicode Line Breaking Algorithm" is available through the Unicode::LineBreaking module.
1060
1061=item [10]
1062
1063UTF-8/UTF-EBDDIC used in Perl allows not only U+10000 to
1064U+10FFFF but also beyond U+10FFFF
1065
1066=back
5ca1ac52 1067
776f8809
JH
1068=item *
1069
1070Level 2 - Extended Unicode Support
1071
755789c0
KW
1072 RL2.1 Canonical Equivalents - MISSING [10][11]
1073 RL2.2 Default Grapheme Clusters - MISSING [12]
1074 RL2.3 Default Word Boundaries - MISSING [14]
1075 RL2.4 Default Loose Matches - MISSING [15]
1076 RL2.5 Name Properties - DONE
1077 RL2.6 Wildcard Properties - MISSING
8158862b 1078
755789c0
KW
1079 [10] see UAX#15 "Unicode Normalization Forms"
1080 [11] have Unicode::Normalize but not integrated to regexes
1081 [12] have \X but we don't have a "Grapheme Cluster Mode"
1082 [14] see UAX#29, Word Boundaries
902b08d0 1083 [15] This is covered in Chapter 3.13 (in Unicode 6.0)
776f8809
JH
1084
1085=item *
1086
8158862b
TS
1087Level 3 - Tailored Support
1088
755789c0
KW
1089 RL3.1 Tailored Punctuation - MISSING
1090 RL3.2 Tailored Grapheme Clusters - MISSING [17][18]
1091 RL3.3 Tailored Word Boundaries - MISSING
1092 RL3.4 Tailored Loose Matches - MISSING
1093 RL3.5 Tailored Ranges - MISSING
1094 RL3.6 Context Matching - MISSING [19]
1095 RL3.7 Incremental Matches - MISSING
8158862b 1096 ( RL3.8 Unicode Set Sharing )
755789c0
KW
1097 RL3.9 Possible Match Sets - MISSING
1098 RL3.10 Folded Matching - MISSING [20]
1099 RL3.11 Submatchers - MISSING
1100
1101 [17] see UAX#10 "Unicode Collation Algorithms"
1102 [18] have Unicode::Collate but not integrated to regexes
1103 [19] have (?<=x) and (?=x), but look-aheads or look-behinds
1104 should see outside of the target substring
1105 [20] need insensitive matching for linguistic features other
1106 than case; for example, hiragana to katakana, wide and
1107 narrow, simplified Han to traditional Han (see UTR#30
1108 "Character Foldings")
776f8809
JH
1109
1110=back
1111
c349b1b9
JH
1112=head2 Unicode Encodings
1113
376d9008
JB
1114Unicode characters are assigned to I<code points>, which are abstract
1115numbers. To use these numbers, various encodings are needed.
c349b1b9
JH
1116
1117=over 4
1118
c29a771d 1119=item *
5cb3728c
RB
1120
1121UTF-8
c349b1b9 1122
6d4f9cf2
KW
1123UTF-8 is a variable-length (1 to 4 bytes), byte-order independent
1124encoding. For ASCII (and we really do mean 7-bit ASCII, not another
11258-bit encoding), UTF-8 is transparent.
c349b1b9 1126
8c007b5a 1127The following table is from Unicode 3.2.
05632f9a 1128
755789c0 1129 Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte
05632f9a 1130
d88362ca 1131 U+0000..U+007F 00..7F
e1b711da 1132 U+0080..U+07FF * C2..DF 80..BF
d88362ca 1133 U+0800..U+0FFF E0 * A0..BF 80..BF
ec90690f
TS
1134 U+1000..U+CFFF E1..EC 80..BF 80..BF
1135 U+D000..U+D7FF ED 80..9F 80..BF
755789c0 1136 U+D800..U+DFFF +++++ utf16 surrogates, not legal utf8 +++++
ec90690f 1137 U+E000..U+FFFF EE..EF 80..BF 80..BF
d88362ca
KW
1138 U+10000..U+3FFFF F0 * 90..BF 80..BF 80..BF
1139 U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF
1140 U+100000..U+10FFFF F4 80..8F 80..BF 80..BF
e1b711da 1141
b19eb496 1142Note the gaps marked by "*" before several of the byte entries above. These are
e1b711da
KW
1143caused by legal UTF-8 avoiding non-shortest encodings: it is technically
1144possible to UTF-8-encode a single code point in different ways, but that is
1145explicitly forbidden, and the shortest possible encoding should always be used
1146(and that is what Perl does).
37361303 1147
376d9008 1148Another way to look at it is via bits:
05632f9a 1149
755789c0 1150 Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte
05632f9a 1151
755789c0
KW
1152 0aaaaaaa 0aaaaaaa
1153 00000bbbbbaaaaaa 110bbbbb 10aaaaaa
1154 ccccbbbbbbaaaaaa 1110cccc 10bbbbbb 10aaaaaa
1155 00000dddccccccbbbbbbaaaaaa 11110ddd 10cccccc 10bbbbbb 10aaaaaa
05632f9a 1156
9f815e24 1157As you can see, the continuation bytes all begin with "10", and the
e1b711da 1158leading bits of the start byte tell how many bytes there are in the
05632f9a
JH
1159encoded character.
1160
6d4f9cf2
KW
1161The original UTF-8 specification allowed up to 6 bytes, to allow
1162encoding of numbers up to 0x7FFF_FFFF. Perl continues to allow those,
1163and has extended that up to 13 bytes to encode code points up to what
1164can fit in a 64-bit word. However, Perl will warn if you output any of
b19eb496 1165these as being non-portable; and under strict UTF-8 input protocols,
6d4f9cf2
KW
1166they are forbidden.
1167
1168The Unicode non-character code points are also disallowed in UTF-8 in
1169"open interchange". See L</Non-character code points>.
1170
c29a771d 1171=item *
5cb3728c
RB
1172
1173UTF-EBCDIC
dbe420b4 1174
376d9008 1175Like UTF-8 but EBCDIC-safe, in the way that UTF-8 is ASCII-safe.
dbe420b4 1176
c29a771d 1177=item *
5cb3728c 1178
1e54db1a 1179UTF-16, UTF-16BE, UTF-16LE, Surrogates, and BOMs (Byte Order Marks)
c349b1b9 1180
1bfb14c4
JH
1181The followings items are mostly for reference and general Unicode
1182knowledge, Perl doesn't use these constructs internally.
dbe420b4 1183
b19eb496
TC
1184Like UTF-8, UTF-16 is a variable-width encoding, but where
1185UTF-8 uses 8-bit code units, UTF-16 uses 16-bit code units.
1186All code points occupy either 2 or 4 bytes in UTF-16: code points
1187C<U+0000..U+FFFF> are stored in a single 16-bit unit, and code
1bfb14c4 1188points C<U+10000..U+10FFFF> in two 16-bit units. The latter case is
c349b1b9
JH
1189using I<surrogates>, the first 16-bit unit being the I<high
1190surrogate>, and the second being the I<low surrogate>.
1191
376d9008 1192Surrogates are code points set aside to encode the C<U+10000..U+10FFFF>
c349b1b9 1193range of Unicode code points in pairs of 16-bit units. The I<high
9f815e24 1194surrogates> are the range C<U+D800..U+DBFF> and the I<low surrogates>
376d9008 1195are the range C<U+DC00..U+DFFF>. The surrogate encoding is
c349b1b9 1196
d88362ca
KW
1197 $hi = ($uni - 0x10000) / 0x400 + 0xD800;
1198 $lo = ($uni - 0x10000) % 0x400 + 0xDC00;
c349b1b9
JH
1199
1200and the decoding is
1201
d88362ca 1202 $uni = 0x10000 + ($hi - 0xD800) * 0x400 + ($lo - 0xDC00);
c349b1b9 1203
376d9008 1204Because of the 16-bitness, UTF-16 is byte-order dependent. UTF-16
c349b1b9 1205itself can be used for in-memory computations, but if storage or
376d9008
JB
1206transfer is required either UTF-16BE (big-endian) or UTF-16LE
1207(little-endian) encodings must be chosen.
c349b1b9
JH
1208
1209This introduces another problem: what if you just know that your data
376d9008
JB
1210is UTF-16, but you don't know which endianness? Byte Order Marks, or
1211BOMs, are a solution to this. A special character has been reserved
86bbd6d1 1212in Unicode to function as a byte order marker: the character with the
376d9008 1213code point C<U+FEFF> is the BOM.
042da322 1214
c349b1b9 1215The trick is that if you read a BOM, you will know the byte order,
376d9008
JB
1216since if it was written on a big-endian platform, you will read the
1217bytes C<0xFE 0xFF>, but if it was written on a little-endian platform,
1218you will read the bytes C<0xFF 0xFE>. (And if the originating platform
1219was writing in UTF-8, you will read the bytes C<0xEF 0xBB 0xBF>.)
042da322 1220
86bbd6d1 1221The way this trick works is that the character with the code point
6d4f9cf2 1222C<U+FFFE> is not supposed to be in input streams, so the
376d9008 1223sequence of bytes C<0xFF 0xFE> is unambiguously "BOM, represented in
1bfb14c4 1224little-endian format" and cannot be C<U+FFFE>, represented in big-endian
6d4f9cf2
KW
1225format".
1226
1227Surrogates have no meaning in Unicode outside their use in pairs to
1228represent other code points. However, Perl allows them to be
1229represented individually internally, for example by saying
f651977e
TC
1230C<chr(0xD801)>, so that all code points, not just those valid for open
1231interchange, are
6d4f9cf2
KW
1232representable. Unicode does define semantics for them, such as their
1233General Category is "Cs". But because their use is somewhat dangerous,
b19eb496
TC
1234Perl will warn (using the warning category "surrogate", which is a
1235sub-category of "utf8") if an attempt is made
6d4f9cf2
KW
1236to do things like take the lower case of one, or match
1237case-insensitively, or to output them. (But don't try this on Perls
1238before 5.14.)
c349b1b9 1239
c29a771d 1240=item *
5cb3728c 1241
1e54db1a 1242UTF-32, UTF-32BE, UTF-32LE
c349b1b9
JH
1243
1244The UTF-32 family is pretty much like the UTF-16 family, expect that
042da322 1245the units are 32-bit, and therefore the surrogate scheme is not
f651977e 1246needed. UTF-32 is a fixed-width encoding. The BOM signatures are
b19eb496 1247C<0x00 0x00 0xFE 0xFF> for BE and C<0xFF 0xFE 0x00 0x00> for LE.
c349b1b9 1248
c29a771d 1249=item *
5cb3728c
RB
1250
1251UCS-2, UCS-4
c349b1b9 1252
b19eb496 1253Legacy, fixed-width encodings defined by the ISO 10646 standard. UCS-2 is a 16-bit
376d9008 1254encoding. Unlike UTF-16, UCS-2 is not extensible beyond C<U+FFFF>,
339cfa0e 1255because it does not use surrogates. UCS-4 is a 32-bit encoding,
b19eb496 1256functionally identical to UTF-32 (the difference being that
ee88f7b6 1257UCS-4 forbids neither surrogates nor code points larger than 0x10_FFFF).
c349b1b9 1258
c29a771d 1259=item *
5cb3728c
RB
1260
1261UTF-7
c349b1b9 1262
376d9008
JB
1263A seven-bit safe (non-eight-bit) encoding, which is useful if the
1264transport or storage is not eight-bit safe. Defined by RFC 2152.
c349b1b9 1265
95a1a48b
JH
1266=back
1267
6d4f9cf2
KW
1268=head2 Non-character code points
1269
127066 code points are set aside in Unicode as "non-character code points".
1271These all have the Unassigned (Cn) General Category, and they never will
1272be assigned. These are never supposed to be in legal Unicode input
1273streams, so that code can use them as sentinels that can be mixed in
1274with character data, and they always will be distinguishable from that data.
1275To keep them out of Perl input streams, strict UTF-8 should be
1276specified, such as by using the layer C<:encoding('UTF-8')>. The
1277non-character code points are the 32 between U+FDD0 and U+FDEF, and the
127834 code points U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, ... U+10FFFE, U+10FFFF.
1279Some people are under the mistaken impression that these are "illegal",
1280but that is not true. An application or cooperating set of applications
1281can legally use them at will internally; but these code points are
42581d5d
KW
1282"illegal for open interchange". Therefore, Perl will not accept these
1283from input streams unless lax rules are being used, and will warn
b19eb496 1284(using the warning category "nonchar", which is a sub-category of "utf8") if
42581d5d
KW
1285an attempt is made to output them.
1286
1287=head2 Beyond Unicode code points
1288
1289The maximum Unicode code point is U+10FFFF. But Perl accepts code
1290points up to the maximum permissible unsigned number available on the
1291platform. However, Perl will not accept these from input streams unless
1292lax rules are being used, and will warn (using the warning category
b19eb496 1293"non_unicode", which is a sub-category of "utf8") if an attempt is made to
42581d5d
KW
1294operate on or output them. For example, C<uc(0x11_0000)> will generate
1295this warning, returning the input parameter as its result, as the upper
ee88f7b6 1296case of every non-Unicode code point is the code point itself.
6d4f9cf2 1297
0d7c09bb
JH
1298=head2 Security Implications of Unicode
1299
e1b711da
KW
1300Read L<Unicode Security Considerations|http://www.unicode.org/reports/tr36>.
1301Also, note the following:
1302
0d7c09bb
JH
1303=over 4
1304
1305=item *
1306
1307Malformed UTF-8
bf0fa0b2 1308
42581d5d 1309Unfortunately, the original specification of UTF-8 leaves some room for
bf0fa0b2 1310interpretation of how many bytes of encoded output one should generate
376d9008
JB
1311from one input Unicode character. Strictly speaking, the shortest
1312possible sequence of UTF-8 bytes should be generated,
1313because otherwise there is potential for an input buffer overflow at
feda178f 1314the receiving end of a UTF-8 connection. Perl always generates the
e1b711da 1315shortest length UTF-8, and with warnings on, Perl will warn about
376d9008 1316non-shortest length UTF-8 along with other malformations, such as the
b19eb496 1317surrogates, which are not Unicode code points valid for interchange.
bf0fa0b2 1318
0d7c09bb
JH
1319=item *
1320
68693f9e 1321Regular expression pattern matching may surprise you if you're not
b19eb496
TC
1322accustomed to Unicode. Starting in Perl 5.14, several pattern
1323modifiers are available to control this, called the character set
42581d5d
KW
1324modifiers. Details are given in L<perlre/Character set modifiers>.
1325
1326=back
0d7c09bb 1327
376d9008 1328As discussed elsewhere, Perl has one foot (two hooves?) planted in
1bfb14c4
JH
1329each of two worlds: the old world of bytes and the new world of
1330characters, upgrading from bytes to characters when necessary.
376d9008
JB
1331If your legacy code does not explicitly use Unicode, no automatic
1332switch-over to characters should happen. Characters shouldn't get
1bfb14c4
JH
1333downgraded to bytes, either. It is possible to accidentally mix bytes
1334and characters, however (see L<perluniintro>), in which case C<\w> in
42581d5d 1335regular expressions might start behaving differently (unless the C</a>
b19eb496 1336modifier is in effect). Review your code. Use warnings and the C<strict> pragma.
0d7c09bb 1337
c349b1b9
JH
1338=head2 Unicode in Perl on EBCDIC
1339
376d9008
JB
1340The way Unicode is handled on EBCDIC platforms is still
1341experimental. On such platforms, references to UTF-8 encoding in this
1342document and elsewhere should be read as meaning the UTF-EBCDIC
1343specified in Unicode Technical Report 16, unless ASCII vs. EBCDIC issues
c349b1b9 1344are specifically discussed. There is no C<utfebcdic> pragma or
376d9008 1345":utfebcdic" layer; rather, "utf8" and ":utf8" are reused to mean
86bbd6d1
PN
1346the platform's "natural" 8-bit encoding of Unicode. See L<perlebcdic>
1347for more discussion of the issues.
c349b1b9 1348
b310b053
JH
1349=head2 Locales
1350
42581d5d 1351See L<perllocale/Unicode and UTF-8>
b310b053 1352
1aad1664
JH
1353=head2 When Unicode Does Not Happen
1354
1355While Perl does have extensive ways to input and output in Unicode,
b19eb496
TC
1356and a few other "entry points" like the @ARGV array (which can sometimes be
1357interpreted as UTF-8), there are still many places where Unicode
1358(in some encoding or another) could be given as arguments or received as
1aad1664
JH
1359results, or both, but it is not.
1360
e1b711da
KW
1361The following are such interfaces. Also, see L</The "Unicode Bug">.
1362For all of these interfaces Perl
b9cedb1b 1363currently (as of v5.16.0) simply assumes byte strings both as arguments
b19eb496 1364and results, or UTF-8 strings if the (problematic) C<encoding> pragma has been used.
1aad1664 1365
b19eb496
TC
1366One reason that Perl does not attempt to resolve the role of Unicode in
1367these situations is that the answers are highly dependent on the operating
1aad1664 1368system and the file system(s). For example, whether filenames can be
b19eb496
TC
1369in Unicode and in exactly what kind of encoding, is not exactly a
1370portable concept. Similarly for C<qx> and C<system>: how well will the
1371"command-line interface" (and which of them?) handle Unicode?
1aad1664
JH
1372
1373=over 4
1374
557a2462
RB
1375=item *
1376
51f494cc 1377chdir, chmod, chown, chroot, exec, link, lstat, mkdir,
1e8e8236 1378rename, rmdir, stat, symlink, truncate, unlink, utime, -X
557a2462
RB
1379
1380=item *
1381
1382%ENV
1383
1384=item *
1385
1386glob (aka the <*>)
1387
1388=item *
1aad1664 1389
557a2462 1390open, opendir, sysopen
1aad1664 1391
557a2462 1392=item *
1aad1664 1393
557a2462 1394qx (aka the backtick operator), system
1aad1664 1395
557a2462 1396=item *
1aad1664 1397
557a2462 1398readdir, readlink
1aad1664
JH
1399
1400=back
1401
e1b711da
KW
1402=head2 The "Unicode Bug"
1403
2e2b2571 1404The term, "Unicode bug" has been applied to an inconsistency
42581d5d
KW
1405on ASCII platforms with the
1406Unicode code points in the Latin-1 Supplement block, that
e1b711da
KW
1407is, between 128 and 255. Without a locale specified, unlike all other
1408characters or code points, these characters have very different semantics in
20db7501 1409byte semantics versus character semantics, unless
2e2b2571
KW
1410C<use feature 'unicode_strings'> is specified, directly or indirectly.
1411(It is indirectly specified by a C<use v5.12> or higher.)
e1b711da 1412
2e2b2571
KW
1413In character semantics these upper-Latin1 characters are interpreted as
1414Unicode code points, which means
e1b711da
KW
1415they have the same semantics as Latin-1 (ISO-8859-1).
1416
2e2b2571
KW
1417In byte semantics (without C<unicode_strings>), they are considered to
1418be unassigned characters, meaning that the only semantics they have is
1419their ordinal numbers, and that they are
e1b711da 1420not members of various character classes. None are considered to match C<\w>
42581d5d 1421for example, but all match C<\W>.
e1b711da 1422
2e2b2571
KW
1423Perl 5.12.0 added C<unicode_strings> to force character semantics on
1424these code points in some circumstances, which fixed portions of the
1425bug; Perl 5.14.0 fixed almost all of it; and Perl 5.16.0 fixed the
1426remainder (so far as we know, anyway). The lesson here is to enable
1427C<unicode_strings> to avoid the headaches described below.
1428
1429The old, problematic behavior affects these areas:
e1b711da
KW
1430
1431=over 4
1432
1433=item *
1434
1435Changing the case of a scalar, that is, using C<uc()>, C<ucfirst()>, C<lc()>,
2e2b2571
KW
1436and C<lcfirst()>, or C<\L>, C<\U>, C<\u> and C<\l> in double-quotish
1437contexts, such as regular expression substitutions.
1438Under C<unicode_strings> starting in Perl 5.12.0, character semantics are
1439generally used. See L<perlfunc/lc> for details on how this works
1440in combination with various other pragmas.
e1b711da
KW
1441
1442=item *
1443
2e2b2571
KW
1444Using caseless (C</i>) regular expression matching.
1445Starting in Perl 5.14.0, regular expressions compiled within
c43ca372 1446the scope of C<unicode_strings> use character semantics
2e2b2571
KW
1447even when executed or compiled into larger
1448regular expressions outside the scope.
e1b711da
KW
1449
1450=item *
1451
b19eb496 1452Matching any of several properties in regular expressions, namely C<\b>,
630d17dc
KW
1453C<\B>, C<\s>, C<\S>, C<\w>, C<\W>, and all the Posix character classes
1454I<except> C<[[:ascii:]]>.
2e2b2571 1455Starting in Perl 5.14.0, regular expressions compiled within
c43ca372 1456the scope of C<unicode_strings> use character semantics
2e2b2571
KW
1457even when executed or compiled into larger
1458regular expressions outside the scope.
e1b711da
KW
1459
1460=item *
1461
91faff93
KW
1462In C<quotemeta> or its inline equivalent C<\Q>, no code points above 127
1463are quoted in UTF-8 encoded strings, but in byte encoded strings, code
1464points between 128-255 are always quoted.
2e2b2571
KW
1465Starting in Perl 5.16.0, consistent quoting rules are used within the
1466scope of C<unicode_strings>, as described in L<perlfunc/quotemeta>.
eb88ed9e 1467
e1b711da
KW
1468=back
1469
1470This behavior can lead to unexpected results in which a string's semantics
1471suddenly change if a code point above 255 is appended to or removed from it,
1472which changes the string's semantics from byte to character or vice versa. As
1473an example, consider the following program and its output:
1474
1475 $ perl -le'
42581d5d 1476 no feature 'unicode_strings';
e1b711da
KW
1477 $s1 = "\xC2";
1478 $s2 = "\x{2660}";
1479 for ($s1, $s2, $s1.$s2) {
1480 print /\w/ || 0;
1481 }
1482 '
1483 0
1484 0
1485 1
1486
9f815e24 1487If there's no C<\w> in C<s1> or in C<s2>, why does their concatenation have one?
e1b711da
KW
1488
1489This anomaly stems from Perl's attempt to not disturb older programs that
1490didn't use Unicode, and hence had no semantics for characters outside of the
1491ASCII range (except in a locale), along with Perl's desire to add Unicode
1492support seamlessly. The result wasn't seamless: these characters were
1493orphaned.
1494
2e2b2571
KW
1495For Perls earlier than those described above, or when a string is passed
1496to a function outside the subpragma's scope, a workaround is to always
1497call C<utf8::upgrade($string)>,
20db7501 1498or to use the standard module L<Encode>. Also, a scalar that has any characters
6f335b04 1499whose ordinal is above 0x100, or which were specified using either of the
b19eb496 1500C<\N{...}> notations, will automatically have character semantics.
e1b711da 1501
1aad1664
JH
1502=head2 Forcing Unicode in Perl (Or Unforcing Unicode in Perl)
1503
e1b711da
KW
1504Sometimes (see L</"When Unicode Does Not Happen"> or L</The "Unicode Bug">)
1505there are situations where you simply need to force a byte
2bbc8d55
SP
1506string into UTF-8, or vice versa. The low-level calls
1507utf8::upgrade($bytestring) and utf8::downgrade($utf8string[, FAIL_OK]) are
1aad1664
JH
1508the answers.
1509
2bbc8d55
SP
1510Note that utf8::downgrade() can fail if the string contains characters
1511that don't fit into a byte.
1aad1664 1512
e1b711da
KW
1513Calling either function on a string that already is in the desired state is a
1514no-op.
1515
95a1a48b
JH
1516=head2 Using Unicode in XS
1517
3a2263fe
RGS
1518If you want to handle Perl Unicode in XS extensions, you may find the
1519following C APIs useful. See also L<perlguts/"Unicode Support"> for an
1520explanation about Unicode at the XS level, and L<perlapi> for the API
1521details.
95a1a48b
JH
1522
1523=over 4
1524
1525=item *
1526
1bfb14c4 1527C<DO_UTF8(sv)> returns true if the C<UTF8> flag is on and the bytes
2bbc8d55 1528pragma is not in effect. C<SvUTF8(sv)> returns true if the C<UTF8>
1bfb14c4
JH
1529flag is on; the bytes pragma is ignored. The C<UTF8> flag being on
1530does B<not> mean that there are any characters of code points greater
1531than 255 (or 127) in the scalar or that there are even any characters
1532in the scalar. What the C<UTF8> flag means is that the sequence of
1533octets in the representation of the scalar is the sequence of UTF-8
1534encoded code points of the characters of a string. The C<UTF8> flag
1535being off means that each octet in this representation encodes a
1536single character with code point 0..255 within the string. Perl's
1537Unicode model is not to use UTF-8 until it is absolutely necessary.
95a1a48b
JH
1538
1539=item *
1540
2bbc8d55 1541C<uvchr_to_utf8(buf, chr)> writes a Unicode character code point into
1bfb14c4 1542a buffer encoding the code point as UTF-8, and returns a pointer
2bbc8d55 1543pointing after the UTF-8 bytes. It works appropriately on EBCDIC machines.
95a1a48b
JH
1544
1545=item *
1546
4b88fb76
KW
1547C<utf8_to_uvchr_buf(buf, bufend, lenp)> reads UTF-8 encoded bytes from a
1548buffer and
376d9008 1549returns the Unicode character code point and, optionally, the length of
2bbc8d55 1550the UTF-8 byte sequence. It works appropriately on EBCDIC machines.
95a1a48b
JH
1551
1552=item *
1553
376d9008
JB
1554C<utf8_length(start, end)> returns the length of the UTF-8 encoded buffer
1555in characters. C<sv_len_utf8(sv)> returns the length of the UTF-8 encoded
95a1a48b
JH
1556scalar.
1557
1558=item *
1559
376d9008
JB
1560C<sv_utf8_upgrade(sv)> converts the string of the scalar to its UTF-8
1561encoded form. C<sv_utf8_downgrade(sv)> does the opposite, if
1562possible. C<sv_utf8_encode(sv)> is like sv_utf8_upgrade except that
1563it does not set the C<UTF8> flag. C<sv_utf8_decode()> does the
1564opposite of C<sv_utf8_encode()>. Note that none of these are to be
1565used as general-purpose encoding or decoding interfaces: C<use Encode>
1566for that. C<sv_utf8_upgrade()> is affected by the encoding pragma
1567but C<sv_utf8_downgrade()> is not (since the encoding pragma is
1568designed to be a one-way street).
95a1a48b
JH
1569
1570=item *
1571
dbfbbfa1
KW
1572C<is_utf8_string(buf, len)> returns true if C<len> bytes of the buffer
1573are valid UTF-8.
95a1a48b
JH
1574
1575=item *
1576
49f4c4e4
KW
1577C<is_utf8_char_buf(buf, buf_end)> returns true if the pointer points to
1578a valid UTF-8 character.
95a1a48b
JH
1579
1580=item *
1581
376d9008
JB
1582C<UTF8SKIP(buf)> will return the number of bytes in the UTF-8 encoded
1583character in the buffer. C<UNISKIP(chr)> will return the number of bytes
1584required to UTF-8-encode the Unicode character code point. C<UTF8SKIP()>
90f968e0 1585is useful for example for iterating over the characters of a UTF-8
376d9008 1586encoded buffer; C<UNISKIP()> is useful, for example, in computing
90f968e0 1587the size required for a UTF-8 encoded buffer.
95a1a48b
JH
1588
1589=item *
1590
376d9008 1591C<utf8_distance(a, b)> will tell the distance in characters between the
95a1a48b
JH
1592two pointers pointing to the same UTF-8 encoded buffer.
1593
1594=item *
1595
2bbc8d55 1596C<utf8_hop(s, off)> will return a pointer to a UTF-8 encoded buffer
376d9008
JB
1597that is C<off> (positive or negative) Unicode characters displaced
1598from the UTF-8 buffer C<s>. Be careful not to overstep the buffer:
1599C<utf8_hop()> will merrily run off the end or the beginning of the
1600buffer if told to do so.
95a1a48b 1601
d2cc3551
JH
1602=item *
1603
376d9008
JB
1604C<pv_uni_display(dsv, spv, len, pvlim, flags)> and
1605C<sv_uni_display(dsv, ssv, pvlim, flags)> are useful for debugging the
1606output of Unicode strings and scalars. By default they are useful
1607only for debugging--they display B<all> characters as hexadecimal code
1bfb14c4
JH
1608points--but with the flags C<UNI_DISPLAY_ISPRINT>,
1609C<UNI_DISPLAY_BACKSLASH>, and C<UNI_DISPLAY_QQ> you can make the
1610output more readable.
d2cc3551
JH
1611
1612=item *
1613
66615a54 1614C<foldEQ_utf8(s1, pe1, l1, u1, s2, pe2, l2, u2)> can be used to
376d9008 1615compare two strings case-insensitively in Unicode. For case-sensitive
66615a54
KW
1616comparisons you can just use C<memEQ()> and C<memNE()> as usual, except
1617if one string is in utf8 and the other isn't.
d2cc3551 1618
c349b1b9
JH
1619=back
1620
95a1a48b
JH
1621For more information, see L<perlapi>, and F<utf8.c> and F<utf8.h>
1622in the Perl source code distribution.
1623
e1b711da
KW
1624=head2 Hacking Perl to work on earlier Unicode versions (for very serious hackers only)
1625
1626Perl by default comes with the latest supported Unicode version built in, but
1627you can change to use any earlier one.
1628
42581d5d 1629Download the files in the desired version of Unicode from the Unicode web
e1b711da 1630site L<http://www.unicode.org>). These should replace the existing files in
b19eb496 1631F<lib/unicore> in the Perl source tree. Follow the instructions in
116693e8 1632F<README.perl> in that directory to change some of their names, and then build
26e391dd 1633perl (see L<INSTALL>).
116693e8 1634
c29a771d
JH
1635=head1 BUGS
1636
376d9008 1637=head2 Interaction with Locales
7eabb34d 1638
42581d5d 1639See L<perllocale/Unicode and UTF-8>
c29a771d 1640
9f815e24 1641=head2 Problems with characters in the Latin-1 Supplement range
2bbc8d55 1642
e1b711da
KW
1643See L</The "Unicode Bug">
1644
376d9008 1645=head2 Interaction with Extensions
7eabb34d 1646
376d9008 1647When Perl exchanges data with an extension, the extension should be
2575c402 1648able to understand the UTF8 flag and act accordingly. If the
b19eb496 1649extension doesn't recognize that flag, it's likely that the extension
376d9008 1650will return incorrectly-flagged data.
7eabb34d
A
1651
1652So if you're working with Unicode data, consult the documentation of
1653every module you're using if there are any issues with Unicode data
1654exchange. If the documentation does not talk about Unicode at all,
a73d23f6 1655suspect the worst and probably look at the source to learn how the
376d9008 1656module is implemented. Modules written completely in Perl shouldn't
a73d23f6
RGS
1657cause problems. Modules that directly or indirectly access code written
1658in other programming languages are at risk.
7eabb34d 1659
376d9008 1660For affected functions, the simple strategy to avoid data corruption is
7eabb34d 1661to always make the encoding of the exchanged data explicit. Choose an
376d9008 1662encoding that you know the extension can handle. Convert arguments passed
7eabb34d
A
1663to the extensions to that encoding and convert results back from that
1664encoding. Write wrapper functions that do the conversions for you, so
1665you can later change the functions when the extension catches up.
1666
376d9008 1667To provide an example, let's say the popular Foo::Bar::escape_html
7eabb34d
A
1668function doesn't deal with Unicode data yet. The wrapper function
1669would convert the argument to raw UTF-8 and convert the result back to
376d9008 1670Perl's internal representation like so:
7eabb34d
A
1671
1672 sub my_escape_html ($) {
d88362ca
KW
1673 my($what) = shift;
1674 return unless defined $what;
1675 Encode::decode_utf8(Foo::Bar::escape_html(
1676 Encode::encode_utf8($what)));
7eabb34d
A
1677 }
1678
1679Sometimes, when the extension does not convert data but just stores
b19eb496 1680and retrieves them, you will be able to use the otherwise
7eabb34d 1681dangerous Encode::_utf8_on() function. Let's say the popular
66b79f27 1682C<Foo::Bar> extension, written in C, provides a C<param> method that
7eabb34d
A
1683lets you store and retrieve data according to these prototypes:
1684
1685 $self->param($name, $value); # set a scalar
1686 $value = $self->param($name); # retrieve a scalar
1687
1688If it does not yet provide support for any encoding, one could write a
1689derived class with such a C<param> method:
1690
1691 sub param {
1692 my($self,$name,$value) = @_;
1693 utf8::upgrade($name); # make sure it is UTF-8 encoded
af55fc6a 1694 if (defined $value) {
7eabb34d
A
1695 utf8::upgrade($value); # make sure it is UTF-8 encoded
1696 return $self->SUPER::param($name,$value);
1697 } else {
1698 my $ret = $self->SUPER::param($name);
1699 Encode::_utf8_on($ret); # we know, it is UTF-8 encoded
1700 return $ret;
1701 }
1702 }
1703
a73d23f6
RGS
1704Some extensions provide filters on data entry/exit points, such as
1705DB_File::filter_store_key and family. Look out for such filters in
66b79f27 1706the documentation of your extensions, they can make the transition to
7eabb34d
A
1707Unicode data much easier.
1708
376d9008 1709=head2 Speed
7eabb34d 1710
c29a771d 1711Some functions are slower when working on UTF-8 encoded strings than
574c8022 1712on byte encoded strings. All functions that need to hop over
7c17141f
JH
1713characters such as length(), substr() or index(), or matching regular
1714expressions can work B<much> faster when the underlying data are
1715byte-encoded.
1716
1717In Perl 5.8.0 the slowness was often quite spectacular; in Perl 5.8.1
1718a caching scheme was introduced which will hopefully make the slowness
a104b433
JH
1719somewhat less spectacular, at least for some operations. In general,
1720operations with UTF-8 encoded strings are still slower. As an example,
1721the Unicode properties (character classes) like C<\p{Nd}> are known to
1722be quite a bit slower (5-20 times) than their simpler counterparts
42581d5d 1723like C<\d> (then again, there are hundreds of Unicode characters matching C<Nd>
a104b433 1724compared with the 10 ASCII characters matching C<d>).
666f95b9 1725
e1b711da
KW
1726=head2 Problems on EBCDIC platforms
1727
f651977e 1728There are several known problems with Perl on EBCDIC platforms. If you
e1b711da 1729want to use Perl there, send email to perlbug@perl.org.
fe749c9a
KW
1730
1731In earlier versions, when byte and character data were concatenated,
1732the new string was sometimes created by
1733decoding the byte strings as I<ISO 8859-1 (Latin-1)>, even if the
1734old Unicode string used EBCDIC.
1735
1736If you find any of these, please report them as bugs.
1737
c8d992ba
A
1738=head2 Porting code from perl-5.6.X
1739
1740Perl 5.8 has a different Unicode model from 5.6. In 5.6 the programmer
1741was required to use the C<utf8> pragma to declare that a given scope
1742expected to deal with Unicode data and had to make sure that only
1743Unicode data were reaching that scope. If you have code that is
1744working with 5.6, you will need some of the following adjustments to
1745your code. The examples are written such that the code will continue
1746to work under 5.6, so you should be safe to try them out.
1747
755789c0 1748=over 3
c8d992ba
A
1749
1750=item *
1751
1752A filehandle that should read or write UTF-8
1753
b9cedb1b 1754 if ($] > 5.008) {
740d4bb2 1755 binmode $fh, ":encoding(utf8)";
c8d992ba
A
1756 }
1757
1758=item *
1759
1760A scalar that is going to be passed to some extension
1761
1762Be it Compress::Zlib, Apache::Request or any extension that has no
1763mention of Unicode in the manpage, you need to make sure that the
2575c402 1764UTF8 flag is stripped off. Note that at the time of this writing
b9cedb1b 1765(January 2012) the mentioned modules are not UTF-8-aware. Please
c8d992ba
A
1766check the documentation to verify if this is still true.
1767
b9cedb1b 1768 if ($] > 5.008) {
c8d992ba
A
1769 require Encode;
1770 $val = Encode::encode_utf8($val); # make octets
1771 }
1772
1773=item *
1774
1775A scalar we got back from an extension
1776
1777If you believe the scalar comes back as UTF-8, you will most likely
2575c402 1778want the UTF8 flag restored:
c8d992ba 1779
b9cedb1b 1780 if ($] > 5.008) {
c8d992ba
A
1781 require Encode;
1782 $val = Encode::decode_utf8($val);
1783 }
1784
1785=item *
1786
1787Same thing, if you are really sure it is UTF-8
1788
b9cedb1b 1789 if ($] > 5.008) {
c8d992ba
A
1790 require Encode;
1791 Encode::_utf8_on($val);
1792 }
1793
1794=item *
1795
1796A wrapper for fetchrow_array and fetchrow_hashref
1797
1798When the database contains only UTF-8, a wrapper function or method is
1799a convenient way to replace all your fetchrow_array and
1800fetchrow_hashref calls. A wrapper function will also make it easier to
1801adapt to future enhancements in your database driver. Note that at the
b9cedb1b 1802time of this writing (January 2012), the DBI has no standardized way
c8d992ba
A
1803to deal with UTF-8 data. Please check the documentation to verify if
1804that is still true.
1805
1806 sub fetchrow {
d88362ca
KW
1807 # $what is one of fetchrow_{array,hashref}
1808 my($self, $sth, $what) = @_;
b9cedb1b 1809 if ($] < 5.008) {
c8d992ba
A
1810 return $sth->$what;
1811 } else {
1812 require Encode;
1813 if (wantarray) {
1814 my @arr = $sth->$what;
1815 for (@arr) {
1816 defined && /[^\000-\177]/ && Encode::_utf8_on($_);
1817 }
1818 return @arr;
1819 } else {
1820 my $ret = $sth->$what;
1821 if (ref $ret) {
1822 for my $k (keys %$ret) {
d88362ca
KW
1823 defined
1824 && /[^\000-\177]/
1825 && Encode::_utf8_on($_) for $ret->{$k};
c8d992ba
A
1826 }
1827 return $ret;
1828 } else {
1829 defined && /[^\000-\177]/ && Encode::_utf8_on($_) for $ret;
1830 return $ret;
1831 }
1832 }
1833 }
1834 }
1835
1836
1837=item *
1838
1839A large scalar that you know can only contain ASCII
1840
1841Scalars that contain only ASCII and are marked as UTF-8 are sometimes
1842a drag to your program. If you recognize such a situation, just remove
2575c402 1843the UTF8 flag:
c8d992ba 1844
b9cedb1b 1845 utf8::downgrade($val) if $] > 5.008;
c8d992ba
A
1846
1847=back
1848
393fec97
GS
1849=head1 SEE ALSO
1850
51f494cc 1851L<perlunitut>, L<perluniintro>, L<perluniprops>, L<Encode>, L<open>, L<utf8>, L<bytes>,
a05d7ebb 1852L<perlretut>, L<perlvar/"${^UNICODE}">
51f494cc 1853L<http://www.unicode.org/reports/tr44>).
393fec97
GS
1854
1855=cut