This is a live mirror of the Perl 5 development currently hosted at https://github.com/perl/perl5
Unicode::UCD: Allow prop_invmap() to work on non-compact binary properties
[perl5.git] / pod / perlunicode.pod
CommitLineData
393fec97
GS
1=head1 NAME
2
3perlunicode - Unicode support in Perl
4
5=head1 DESCRIPTION
6
0a1f2d14 7=head2 Important Caveats
21bad921 8
376d9008 9Unicode support is an extensive requirement. While Perl does not
c349b1b9
JH
10implement the Unicode standard or the accompanying technical reports
11from cover to cover, Perl does support many Unicode features.
21bad921 12
2575c402 13People who want to learn to use Unicode in Perl, should probably read
0314f483
KW
14the L<Perl Unicode tutorial, perlunitut|perlunitut> and
15L<perluniintro>, before reading
e4911a48 16this reference document.
2575c402 17
9d1c51c1
KW
18Also, the use of Unicode may present security issues that aren't obvious.
19Read L<Unicode Security Considerations|http://www.unicode.org/reports/tr36>.
20
13a2d996 21=over 4
21bad921 22
42581d5d
KW
23=item Safest if you "use feature 'unicode_strings'"
24
25In order to preserve backward compatibility, Perl does not turn
26on full internal Unicode support unless the pragma
27C<use feature 'unicode_strings'> is specified. (This is automatically
28selected if you use C<use 5.012> or higher.) Failure to do this can
29trigger unexpected surprises. See L</The "Unicode Bug"> below.
30
b19eb496
TC
31This pragma doesn't affect I/O, and there are still several places
32where Unicode isn't fully supported, such as in filenames.
42581d5d 33
fae2c0fb 34=item Input and Output Layers
21bad921 35
376d9008 36Perl knows when a filehandle uses Perl's internal Unicode encodings
1bfb14c4 37(UTF-8, or UTF-EBCDIC if in EBCDIC) if the filehandle is opened with
4ee7c0ea 38the ":encoding(utf8)" layer. Other encodings can be converted to Perl's
1bfb14c4
JH
39encoding on input or from Perl's encoding on output by use of the
40":encoding(...)" layer. See L<open>.
c349b1b9 41
2575c402 42To indicate that Perl source itself is in UTF-8, use C<use utf8;>.
21bad921 43
ad0029c4 44=item C<use utf8> still needed to enable UTF-8/UTF-EBCDIC in scripts
21bad921 45
376d9008
JB
46As a compatibility measure, the C<use utf8> pragma must be explicitly
47included to enable recognition of UTF-8 in the Perl scripts themselves
1bfb14c4
JH
48(in string or regular expression literals, or in identifier names) on
49ASCII-based machines or to recognize UTF-EBCDIC on EBCDIC-based
376d9008 50machines. B<These are the only times when an explicit C<use utf8>
8f8cf39c 51is needed.> See L<utf8>.
21bad921 52
7aa207d6
JH
53=item BOM-marked scripts and UTF-16 scripts autodetected
54
55If a Perl script begins marked with the Unicode BOM (UTF-16LE, UTF16-BE,
56or UTF-8), or if the script looks like non-BOM-marked UTF-16 of either
57endianness, Perl will correctly read in the script as Unicode.
58(BOMless UTF-8 cannot be effectively recognized or differentiated from
59ISO 8859-1 or other eight-bit encodings.)
60
990e18f7
AT
61=item C<use encoding> needed to upgrade non-Latin-1 byte strings
62
38a44b82 63By default, there is a fundamental asymmetry in Perl's Unicode model:
990e18f7
AT
64implicit upgrading from byte strings to Unicode strings assumes that
65they were encoded in I<ISO 8859-1 (Latin-1)>, but Unicode strings are
66downgraded with UTF-8 encoding. This happens because the first 256
51f494cc 67codepoints in Unicode happens to agree with Latin-1.
990e18f7 68
990e18f7
AT
69See L</"Byte and Character Semantics"> for more details.
70
21bad921
GS
71=back
72
376d9008 73=head2 Byte and Character Semantics
393fec97 74
376d9008 75Beginning with version 5.6, Perl uses logically-wide characters to
3e4dbfed 76represent strings internally.
393fec97 77
42581d5d
KW
78Starting in Perl 5.14, Perl-level operations work with
79characters rather than bytes within the scope of a
80C<L<use feature 'unicode_strings'|feature>> (or equivalently
81C<use 5.012> or higher). (This is not true if bytes have been
b19eb496 82explicitly requested by C<L<use bytes|bytes>>, nor necessarily true
42581d5d
KW
83for interactions with the platform's operating system.)
84
85For earlier Perls, and when C<unicode_strings> is not in effect, Perl
86provides a fairly safe environment that can handle both types of
87semantics in programs. For operations where Perl can unambiguously
88decide that the input data are characters, Perl switches to character
89semantics. For operations where this determination cannot be made
90without additional information from the user, Perl decides in favor of
91compatibility and chooses to use byte semantics.
92
93When C<use locale> is in effect (which overrides
0314f483
KW
94C<use feature 'unicode_strings'> in the same scope), Perl uses the
95semantics associated
42581d5d
KW
96with the current locale. Otherwise, Perl uses the platform's native
97byte semantics for characters whose code points are less than 256, and
98Unicode semantics for those greater than 255. On EBCDIC platforms, this
99is almost seamless, as the EBCDIC code pages that Perl handles are
100equivalent to Unicode's first 256 code points. (The exception is that
101EBCDIC regular expression case-insensitive matching rules are not as
b19eb496 102as robust as Unicode's.) But on ASCII platforms, Perl uses US-ASCII
e1b711da
KW
103(or Basic Latin in Unicode terminology) byte semantics, meaning that characters
104whose ordinal numbers are in the range 128 - 255 are undefined except for their
105ordinal numbers. This means that none have case (upper and lower), nor are any
106a member of character classes, like C<[:alpha:]> or C<\w>. (But all do belong
107to the C<\W> class or the Perl regular expression extension C<[:^alpha:]>.)
2bbc8d55 108
8cbd9a7a 109This behavior preserves compatibility with earlier versions of Perl,
376d9008 110which allowed byte semantics in Perl operations only if
e1b711da 111none of the program's inputs were marked as being a source of Unicode
8cbd9a7a
GS
112character data. Such data may come from filehandles, from calls to
113external programs, from information provided by the system (such as %ENV),
21bad921 114or from literals and constants in the source text.
8cbd9a7a 115
8cbd9a7a 116The C<utf8> pragma is primarily a compatibility device that enables
75daf61c 117recognition of UTF-(8|EBCDIC) in literals encountered by the parser.
376d9008
JB
118Note that this pragma is only required while Perl defaults to byte
119semantics; when character semantics become the default, this pragma
120may become a no-op. See L<utf8>.
121
376d9008 122If strings operating under byte semantics and strings with Unicode
51f494cc 123character data are concatenated, the new string will have
d9b01026
KW
124character semantics. This can cause surprises: See L</BUGS>, below.
125You can choose to be warned when this happens. See L<encoding::warnings>.
7dedd01f 126
feda178f 127Under character semantics, many operations that formerly operated on
376d9008 128bytes now operate on characters. A character in Perl is
feda178f 129logically just a number ranging from 0 to 2**31 or so. Larger
376d9008
JB
130characters may encode into longer sequences of bytes internally, but
131this internal detail is mostly hidden for Perl code.
132See L<perluniintro> for more.
393fec97 133
376d9008 134=head2 Effects of Character Semantics
393fec97
GS
135
136Character semantics have the following effects:
137
138=over 4
139
140=item *
141
376d9008 142Strings--including hash keys--and regular expression patterns may
574c8022 143contain characters that have an ordinal value larger than 255.
393fec97 144
2575c402
JW
145If you use a Unicode editor to edit your program, Unicode characters may
146occur directly within the literal strings in UTF-8 encoding, or UTF-16.
147(The former requires a BOM or C<use utf8>, the latter requires a BOM.)
3e4dbfed 148
195e542a
KW
149Unicode characters can also be added to a string by using the C<\N{U+...}>
150notation. The Unicode code for the desired character, in hexadecimal,
151should be placed in the braces, after the C<U>. For instance, a smiley face is
6f335b04
KW
152C<\N{U+263A}>.
153
195e542a
KW
154Alternatively, you can use the C<\x{...}> notation for characters 0x100 and
155above. For characters below 0x100 you may get byte semantics instead of
6f335b04 156character semantics; see L</The "Unicode Bug">. On EBCDIC machines there is
195e542a 157the additional problem that the value for such characters gives the EBCDIC
0bd42786
KW
158character rather than the Unicode one, thus it is more portable to use
159C<\N{U+...}> instead.
3e4dbfed 160
fbb93542
KW
161Additionally, you can use the C<\N{...}> notation and put the official
162Unicode character name within the braces, such as
163C<\N{WHITE SMILING FACE}>. This automatically loads the L<charnames>
164module with the C<:full> and C<:short> options. If you prefer different
165options for this module, you can instead, before the C<\N{...}>,
166explicitly load it with your desired options; for example,
167
168 use charnames ':loose';
376d9008 169
393fec97
GS
170=item *
171
574c8022
JH
172If an appropriate L<encoding> is specified, identifiers within the
173Perl script may contain Unicode alphanumeric characters, including
376d9008
JB
174ideographs. Perl does not currently attempt to canonicalize variable
175names.
393fec97 176
393fec97
GS
177=item *
178
1bfb14c4 179Regular expressions match characters instead of bytes. "." matches
2575c402 180a character instead of a byte.
393fec97 181
393fec97
GS
182=item *
183
9d1c51c1 184Bracketed character classes in regular expressions match characters instead of
376d9008 185bytes and match against the character properties specified in the
1bfb14c4 186Unicode properties database. C<\w> can be used to match a Japanese
75daf61c 187ideograph, for instance.
393fec97 188
393fec97
GS
189=item *
190
9d1c51c1
KW
191Named Unicode properties, scripts, and block ranges may be used (like bracketed
192character classes) by using the C<\p{}> "matches property" construct and
822502e5 193the C<\P{}> negation, "doesn't match property".
2575c402 194See L</"Unicode Character Properties"> for more details.
822502e5
TS
195
196You can define your own character properties and use them
197in the regular expression with the C<\p{}> or C<\P{}> construct.
822502e5
TS
198See L</"User-Defined Character Properties"> for more details.
199
200=item *
201
9f815e24
KW
202The special pattern C<\X> matches a logical character, an "extended grapheme
203cluster" in Standardese. In Unicode what appears to the user to be a single
51f494cc
KW
204character, for example an accented C<G>, may in fact be composed of a sequence
205of characters, in this case a C<G> followed by an accent character. C<\X>
206will match the entire sequence.
822502e5
TS
207
208=item *
209
210The C<tr///> operator translates characters instead of bytes. Note
211that the C<tr///CU> functionality has been removed. For similar
212functionality see pack('U0', ...) and pack('C0', ...).
213
214=item *
215
216Case translation operators use the Unicode case translation tables
217when character input is provided. Note that C<uc()>, or C<\U> in
218interpolated strings, translates to uppercase, while C<ucfirst>,
219or C<\u> in interpolated strings, translates to titlecase in languages
e1b711da
KW
220that make the distinction (which is equivalent to uppercase in languages
221without the distinction).
822502e5
TS
222
223=item *
224
225Most operators that deal with positions or lengths in a string will
226automatically switch to using character positions, including
227C<chop()>, C<chomp()>, C<substr()>, C<pos()>, C<index()>, C<rindex()>,
228C<sprintf()>, C<write()>, and C<length()>. An operator that
51f494cc
KW
229specifically does not switch is C<vec()>. Operators that really don't
230care include operators that treat strings as a bucket of bits such as
822502e5
TS
231C<sort()>, and operators dealing with filenames.
232
233=item *
234
51f494cc 235The C<pack()>/C<unpack()> letter C<C> does I<not> change, since it is often
822502e5
TS
236used for byte-oriented formats. Again, think C<char> in the C language.
237
238There is a new C<U> specifier that converts between Unicode characters
239and code points. There is also a C<W> specifier that is the equivalent of
240C<chr>/C<ord> and properly handles character values even if they are above 255.
241
242=item *
243
244The C<chr()> and C<ord()> functions work on characters, similar to
245C<pack("W")> and C<unpack("W")>, I<not> C<pack("C")> and
246C<unpack("C")>. C<pack("C")> and C<unpack("C")> are methods for
247emulating byte-oriented C<chr()> and C<ord()> on Unicode strings.
248While these methods reveal the internal encoding of Unicode strings,
249that is not something one normally needs to care about at all.
250
251=item *
252
253The bit string operators, C<& | ^ ~>, can operate on character data.
254However, for backward compatibility, such as when using bit string
255operations when characters are all less than 256 in ordinal value, one
256should not use C<~> (the bit complement) with characters of both
257values less than 256 and values greater than 256. Most importantly,
258DeMorgan's laws (C<~($x|$y) eq ~$x&~$y> and C<~($x&$y) eq ~$x|~$y>)
259will not hold. The reason for this mathematical I<faux pas> is that
260the complement cannot return B<both> the 8-bit (byte-wide) bit
261complement B<and> the full character-wide bit complement.
262
263=item *
264
5d1892be
KW
265There is a CPAN module, L<Unicode::Casing>, which allows you to define
266your own mappings to be used in C<lc()>, C<lcfirst()>, C<uc()>, and
267C<ucfirst()> (or their double-quoted string inlined versions such as
268C<\U>). (Prior to Perl 5.16, this functionality was partially provided
269in the Perl core, but suffered from a number of insurmountable
270drawbacks, so the CPAN module was written instead.)
822502e5
TS
271
272=back
273
274=over 4
275
276=item *
277
278And finally, C<scalar reverse()> reverses by character rather than by byte.
279
280=back
281
282=head2 Unicode Character Properties
283
ee88f7b6 284(The only time that Perl considers a sequence of individual code
9d1c51c1
KW
285points as a single logical character is in the C<\X> construct, already
286mentioned above. Therefore "character" in this discussion means a single
ee88f7b6
KW
287Unicode code point.)
288
289Very nearly all Unicode character properties are accessible through
290regular expressions by using the C<\p{}> "matches property" construct
291and the C<\P{}> "doesn't match property" for its negation.
51f494cc 292
9d1c51c1 293For instance, C<\p{Uppercase}> matches any single character with the Unicode
51f494cc
KW
294"Uppercase" property, while C<\p{L}> matches any character with a
295General_Category of "L" (letter) property. Brackets are not
9d1c51c1 296required for single letter property names, so C<\p{L}> is equivalent to C<\pL>.
51f494cc 297
9d1c51c1
KW
298More formally, C<\p{Uppercase}> matches any single character whose Unicode
299Uppercase property value is True, and C<\P{Uppercase}> matches any character
300whose Uppercase property value is False, and they could have been written as
301C<\p{Uppercase=True}> and C<\p{Uppercase=False}>, respectively.
51f494cc 302
b19eb496 303This formality is needed when properties are not binary; that is, if they can
51f494cc 304take on more values than just True and False. For example, the Bidi_Class (see
b19eb496 305L</"Bidirectional Character Types"> below), can take on several different
51f494cc 306values, such as Left, Right, Whitespace, and others. To match these, one needs
5bff2035
KW
307to specify both the property name (Bidi_Class), AND the value being
308matched against
9d1c51c1 309(Left, Right, etc.). This is done, as in the examples above, by having the
9f815e24 310two components separated by an equal sign (or interchangeably, a colon), like
51f494cc
KW
311C<\p{Bidi_Class: Left}>.
312
313All Unicode-defined character properties may be written in these compound forms
314of C<\p{property=value}> or C<\p{property:value}>, but Perl provides some
315additional properties that are written only in the single form, as well as
316single-form short-cuts for all binary properties and certain others described
317below, in which you may omit the property name and the equals or colon
318separator.
319
320Most Unicode character properties have at least two synonyms (or aliases if you
b19eb496
TC
321prefer): a short one that is easier to type and a longer one that is more
322descriptive and hence easier to understand. Thus the "L" and "Letter" properties
323above are equivalent and can be used interchangeably. Likewise,
51f494cc
KW
324"Upper" is a synonym for "Uppercase", and we could have written
325C<\p{Uppercase}> equivalently as C<\p{Upper}>. Also, there are typically
326various synonyms for the values the property can be. For binary properties,
327"True" has 3 synonyms: "T", "Yes", and "Y"; and "False has correspondingly "F",
328"No", and "N". But be careful. A short form of a value for one property may
e1b711da 329not mean the same thing as the same short form for another. Thus, for the
51f494cc
KW
330General_Category property, "L" means "Letter", but for the Bidi_Class property,
331"L" means "Left". A complete list of properties and synonyms is in
332L<perluniprops>.
333
b19eb496 334Upper/lower case differences in property names and values are irrelevant;
51f494cc
KW
335thus C<\p{Upper}> means the same thing as C<\p{upper}> or even C<\p{UpPeR}>.
336Similarly, you can add or subtract underscores anywhere in the middle of a
337word, so that these are also equivalent to C<\p{U_p_p_e_r}>. And white space
338is irrelevant adjacent to non-word characters, such as the braces and the equals
b19eb496
TC
339or colon separators, so C<\p{ Upper }> and C<\p{ Upper_case : Y }> are
340equivalent to these as well. In fact, white space and even
341hyphens can usually be added or deleted anywhere. So even C<\p{ Up-per case = Yes}> is
51f494cc 342equivalent. All this is called "loose-matching" by Unicode. The few places
b19eb496 343where stricter matching is used is in the middle of numbers, and in the Perl
51f494cc 344extension properties that begin or end with an underscore. Stricter matching
b19eb496 345cares about white space (except adjacent to non-word characters),
51f494cc 346hyphens, and non-interior underscores.
4193bef7 347
376d9008
JB
348You can also use negation in both C<\p{}> and C<\P{}> by introducing a caret
349(^) between the first brace and the property name: C<\p{^Tamil}> is
eb0cc9e3 350equal to C<\P{Tamil}>.
4193bef7 351
56ca34ca
KW
352Almost all properties are immune to case-insensitive matching. That is,
353adding a C</i> regular expression modifier does not change what they
354match. There are two sets that are affected.
355The first set is
356C<Uppercase_Letter>,
357C<Lowercase_Letter>,
358and C<Titlecase_Letter>,
359all of which match C<Cased_Letter> under C</i> matching.
360And the second set is
361C<Uppercase>,
362C<Lowercase>,
363and C<Titlecase>,
364all of which match C<Cased> under C</i> matching.
365This set also includes its subsets C<PosixUpper> and C<PosixLower> both
366of which under C</i> matching match C<PosixAlpha>.
367(The difference between these sets is that some things, such as Roman
b19eb496 368numerals, come in both upper and lower case so they are C<Cased>, but aren't considered
56ca34ca 369letters, so they aren't C<Cased_Letter>s.)
56ca34ca 370
94b42e47
KW
371The result is undefined if you try to match a non-Unicode code point
372(that is, one above 0x10FFFF) against a Unicode property. Currently, a
373warning is raised, and the match will fail. In some cases, this is
374counterintuitive, as both these fail:
375
376 chr(0x110000) =~ \p{ASCII_Hex_Digit=True} # Fails.
377 chr(0x110000) =~ \p{ASCII_Hex_Digit=False} # Fails!
378
51f494cc 379=head3 B<General_Category>
14bb0a9a 380
51f494cc
KW
381Every Unicode character is assigned a general category, which is the "most
382usual categorization of a character" (from
383L<http://www.unicode.org/reports/tr44>).
822502e5 384
9f815e24 385The compound way of writing these is like C<\p{General_Category=Number}>
51f494cc
KW
386(short, C<\p{gc:n}>). But Perl furnishes shortcuts in which everything up
387through the equal or colon separator is omitted. So you can instead just write
388C<\pN>.
822502e5 389
51f494cc 390Here are the short and long forms of the General Category properties:
393fec97 391
d73e5302
JH
392 Short Long
393
394 L Letter
51f494cc
KW
395 LC, L& Cased_Letter (that is: [\p{Ll}\p{Lu}\p{Lt}])
396 Lu Uppercase_Letter
397 Ll Lowercase_Letter
398 Lt Titlecase_Letter
399 Lm Modifier_Letter
400 Lo Other_Letter
d73e5302
JH
401
402 M Mark
51f494cc
KW
403 Mn Nonspacing_Mark
404 Mc Spacing_Mark
405 Me Enclosing_Mark
d73e5302
JH
406
407 N Number
51f494cc
KW
408 Nd Decimal_Number (also Digit)
409 Nl Letter_Number
410 No Other_Number
411
412 P Punctuation (also Punct)
413 Pc Connector_Punctuation
414 Pd Dash_Punctuation
415 Ps Open_Punctuation
416 Pe Close_Punctuation
417 Pi Initial_Punctuation
d73e5302 418 (may behave like Ps or Pe depending on usage)
51f494cc 419 Pf Final_Punctuation
d73e5302 420 (may behave like Ps or Pe depending on usage)
51f494cc 421 Po Other_Punctuation
d73e5302
JH
422
423 S Symbol
51f494cc
KW
424 Sm Math_Symbol
425 Sc Currency_Symbol
426 Sk Modifier_Symbol
427 So Other_Symbol
d73e5302
JH
428
429 Z Separator
51f494cc
KW
430 Zs Space_Separator
431 Zl Line_Separator
432 Zp Paragraph_Separator
d73e5302
JH
433
434 C Other
d88362ca 435 Cc Control (also Cntrl)
e150c829 436 Cf Format
6d4f9cf2 437 Cs Surrogate
51f494cc 438 Co Private_Use
e150c829 439 Cn Unassigned
1ac13f9a 440
376d9008 441Single-letter properties match all characters in any of the
3e4dbfed 442two-letter sub-properties starting with the same letter.
b19eb496 443C<LC> and C<L&> are special: both are aliases for the set consisting of everything matched by C<Ll>, C<Lu>, and C<Lt>.
32293815 444
51f494cc 445=head3 B<Bidirectional Character Types>
822502e5 446
b19eb496 447Because scripts differ in their directionality (Hebrew and Arabic are
9d1c51c1 448written right to left, for example) Unicode supplies these properties in
51f494cc 449the Bidi_Class class:
32293815 450
eb0cc9e3 451 Property Meaning
92e830a9 452
12ac2576
JP
453 L Left-to-Right
454 LRE Left-to-Right Embedding
455 LRO Left-to-Right Override
456 R Right-to-Left
51f494cc 457 AL Arabic Letter
12ac2576
JP
458 RLE Right-to-Left Embedding
459 RLO Right-to-Left Override
460 PDF Pop Directional Format
461 EN European Number
51f494cc
KW
462 ES European Separator
463 ET European Terminator
12ac2576 464 AN Arabic Number
51f494cc 465 CS Common Separator
12ac2576
JP
466 NSM Non-Spacing Mark
467 BN Boundary Neutral
468 B Paragraph Separator
469 S Segment Separator
470 WS Whitespace
471 ON Other Neutrals
472
51f494cc
KW
473This property is always written in the compound form.
474For example, C<\p{Bidi_Class:R}> matches characters that are normally
eb0cc9e3
JH
475written right to left.
476
51f494cc
KW
477=head3 B<Scripts>
478
b19eb496 479The world's languages are written in many different scripts. This sentence
e1b711da 480(unless you're reading it in translation) is written in Latin, while Russian is
c69ca1d4 481written in Cyrillic, and Greek is written in, well, Greek; Japanese mainly in
e1b711da 482Hiragana or Katakana. There are many more.
51f494cc 483
82aed44a
KW
484The Unicode Script and Script_Extensions properties give what script a
485given character is in. Either property can be specified with the
486compound form like
487C<\p{Script=Hebrew}> (short: C<\p{sc=hebr}>), or
488C<\p{Script_Extensions=Javanese}> (short: C<\p{scx=java}>).
489In addition, Perl furnishes shortcuts for all
490C<Script> property names. You can omit everything up through the equals
491(or colon), and simply write C<\p{Latin}> or C<\P{Cyrillic}>.
492(This is not true for C<Script_Extensions>, which is required to be
493written in the compound form.)
494
495The difference between these two properties involves characters that are
496used in multiple scripts. For example the digits '0' through '9' are
497used in many parts of the world. These are placed in a script named
498C<Common>. Other characters are used in just a few scripts. For
499example, the "KATAKANA-HIRAGANA DOUBLE HYPHEN" is used in both Japanese
500scripts, Katakana and Hiragana, but nowhere else. The C<Script>
501property places all characters that are used in multiple scripts in the
502C<Common> script, while the C<Script_Extensions> property places those
503that are used in only a few scripts into each of those scripts; while
504still using C<Common> for those used in many scripts. Thus both these
505match:
506
507 "0" =~ /\p{sc=Common}/ # Matches
508 "0" =~ /\p{scx=Common}/ # Matches
509
510and only the first of these match:
511
512 "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{sc=Common} # Matches
513 "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{scx=Common} # No match
514
515And only the last two of these match:
516
517 "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{sc=Hiragana} # No match
518 "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{sc=Katakana} # No match
519 "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{scx=Hiragana} # Matches
520 "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{scx=Katakana} # Matches
521
522C<Script_Extensions> is thus an improved C<Script>, in which there are
523fewer characters in the C<Common> script, and correspondingly more in
524other scripts. It is new in Unicode version 6.0, and its data are likely
525to change significantly in later releases, as things get sorted out.
526
527(Actually, besides C<Common>, the C<Inherited> script, contains
528characters that are used in multiple scripts. These are modifier
529characters which modify other characters, and inherit the script value
530of the controlling character. Some of these are used in many scripts,
531and so go into C<Inherited> in both C<Script> and C<Script_Extensions>.
532Others are used in just a few scripts, so are in C<Inherited> in
533C<Script>, but not in C<Script_Extensions>.)
534
535It is worth stressing that there are several different sets of digits in
536Unicode that are equivalent to 0-9 and are matchable by C<\d> in a
537regular expression. If they are used in a single language only, they
538are in that language's C<Script> and C<Script_Extension>. If they are
539used in more than one script, they will be in C<sc=Common>, but only
540if they are used in many scripts should they be in C<scx=Common>.
51f494cc
KW
541
542A complete list of scripts and their shortcuts is in L<perluniprops>.
543
51f494cc 544=head3 B<Use of "Is" Prefix>
822502e5 545
1bfb14c4 546For backward compatibility (with Perl 5.6), all properties mentioned
51f494cc
KW
547so far may have C<Is> or C<Is_> prepended to their name, so C<\P{Is_Lu}>, for
548example, is equal to C<\P{Lu}>, and C<\p{IsScript:Arabic}> is equal to
549C<\p{Arabic}>.
eb0cc9e3 550
51f494cc 551=head3 B<Blocks>
2796c109 552
1bfb14c4
JH
553In addition to B<scripts>, Unicode also defines B<blocks> of
554characters. The difference between scripts and blocks is that the
555concept of scripts is closer to natural languages, while the concept
51f494cc 556of blocks is more of an artificial grouping based on groups of Unicode
9f815e24 557characters with consecutive ordinal values. For example, the "Basic Latin"
b19eb496 558block is all characters whose ordinals are between 0 and 127, inclusive; in
9f815e24 559other words, the ASCII characters. The "Latin" script contains some letters
b19eb496 560from this as well as several other blocks, like "Latin-1 Supplement",
9d1c51c1 561"Latin Extended-A", etc., but it does not contain all the characters from
7be67b37 562those blocks. It does not, for example, contain the digits 0-9, because
82aed44a
KW
563those digits are shared across many scripts, and hence are in the
564C<Common> script.
51f494cc
KW
565
566For more about scripts versus blocks, see UAX#24 "Unicode Script Property":
567L<http://www.unicode.org/reports/tr24>
568
82aed44a
KW
569The C<Script> or C<Script_Extensions> properties are likely to be the
570ones you want to use when processing
b19eb496
TC
571natural language; the Block property may occasionally be useful in working
572with the nuts and bolts of Unicode.
51f494cc
KW
573
574Block names are matched in the compound form, like C<\p{Block: Arrows}> or
b19eb496 575C<\p{Blk=Hebrew}>. Unlike most other properties, only a few block names have a
51f494cc
KW
576Unicode-defined short name. But Perl does provide a (slight) shortcut: You
577can say, for example C<\p{In_Arrows}> or C<\p{In_Hebrew}>. For backwards
578compatibility, the C<In> prefix may be omitted if there is no naming conflict
579with a script or any other property, and you can even use an C<Is> prefix
580instead in those cases. But it is not a good idea to do this, for a couple
581reasons:
582
583=over 4
584
585=item 1
586
587It is confusing. There are many naming conflicts, and you may forget some.
9f815e24 588For example, C<\p{Hebrew}> means the I<script> Hebrew, and NOT the I<block>
51f494cc
KW
589Hebrew. But would you remember that 6 months from now?
590
591=item 2
592
593It is unstable. A new version of Unicode may pre-empt the current meaning by
594creating a property with the same name. There was a time in very early Unicode
9f815e24 595releases when C<\p{Hebrew}> would have matched the I<block> Hebrew; now it
51f494cc 596doesn't.
32293815 597
393fec97
GS
598=back
599
b19eb496
TC
600Some people prefer to always use C<\p{Block: foo}> and C<\p{Script: bar}>
601instead of the shortcuts, whether for clarity, because they can't remember the
602difference between 'In' and 'Is' anyway, or they aren't confident that those who
603eventually will read their code will know that difference.
51f494cc
KW
604
605A complete list of blocks and their shortcuts is in L<perluniprops>.
606
9f815e24
KW
607=head3 B<Other Properties>
608
609There are many more properties than the very basic ones described here.
610A complete list is in L<perluniprops>.
611
612Unicode defines all its properties in the compound form, so all single-form
b19eb496
TC
613properties are Perl extensions. Most of these are just synonyms for the
614Unicode ones, but some are genuine extensions, including several that are in
9f815e24
KW
615the compound form. And quite a few of these are actually recommended by Unicode
616(in L<http://www.unicode.org/reports/tr18>).
617
5bff2035
KW
618This section gives some details on all extensions that aren't just
619synonyms for compound-form Unicode properties
620(for those properties, you'll have to refer to the
9f815e24
KW
621L<Unicode Standard|http://www.unicode.org/reports/tr44>.
622
623=over
624
625=item B<C<\p{All}>>
626
627This matches any of the 1_114_112 Unicode code points. It is a synonym for
628C<\p{Any}>.
629
630=item B<C<\p{Alnum}>>
631
632This matches any C<\p{Alphabetic}> or C<\p{Decimal_Number}> character.
633
634=item B<C<\p{Any}>>
635
636This matches any of the 1_114_112 Unicode code points. It is a synonym for
637C<\p{All}>.
638
42581d5d
KW
639=item B<C<\p{ASCII}>>
640
641This matches any of the 128 characters in the US-ASCII character set,
642which is a subset of Unicode.
643
9f815e24
KW
644=item B<C<\p{Assigned}>>
645
646This matches any assigned code point; that is, any code point whose general
647category is not Unassigned (or equivalently, not Cn).
648
649=item B<C<\p{Blank}>>
650
651This is the same as C<\h> and C<\p{HorizSpace}>: A character that changes the
652spacing horizontally.
653
654=item B<C<\p{Decomposition_Type: Non_Canonical}>> (Short: C<\p{Dt=NonCanon}>)
655
656Matches a character that has a non-canonical decomposition.
657
658To understand the use of this rarely used property=value combination, it is
659necessary to know some basics about decomposition.
660Consider a character, say H. It could appear with various marks around it,
661such as an acute accent, or a circumflex, or various hooks, circles, arrows,
b19eb496 662I<etc.>, above, below, to one side or the other, etc. There are many
9f815e24
KW
663possibilities among the world's languages. The number of combinations is
664astronomical, and if there were a character for each combination, it would
665soon exhaust Unicode's more than a million possible characters. So Unicode
666took a different approach: there is a character for the base H, and a
b19eb496 667character for each of the possible marks, and these can be variously combined
9f815e24
KW
668to get a final logical character. So a logical character--what appears to be a
669single character--can be a sequence of more than one individual characters.
b19eb496
TC
670This is called an "extended grapheme cluster"; Perl furnishes the C<\X>
671regular expression construct to match such sequences.
9f815e24
KW
672
673But Unicode's intent is to unify the existing character set standards and
b19eb496 674practices, and several pre-existing standards have single characters that
9f815e24
KW
675mean the same thing as some of these combinations. An example is ISO-8859-1,
676which has quite a few of these in the Latin-1 range, an example being "LATIN
677CAPITAL LETTER E WITH ACUTE". Because this character was in this pre-existing
678standard, Unicode added it to its repertoire. But this character is considered
b19eb496
TC
679by Unicode to be equivalent to the sequence consisting of the character
680"LATIN CAPITAL LETTER E" followed by the character "COMBINING ACUTE ACCENT".
9f815e24
KW
681
682"LATIN CAPITAL LETTER E WITH ACUTE" is called a "pre-composed" character, and
b19eb496 683its equivalence with the sequence is called canonical equivalence. All
9f815e24 684pre-composed characters are said to have a decomposition (into the equivalent
b19eb496 685sequence), and the decomposition type is also called canonical.
9f815e24
KW
686
687However, many more characters have a different type of decomposition, a
688"compatible" or "non-canonical" decomposition. The sequences that form these
689decompositions are not considered canonically equivalent to the pre-composed
690character. An example, again in the Latin-1 range, is the "SUPERSCRIPT ONE".
b19eb496 691It is somewhat like a regular digit 1, but not exactly; its decomposition
9f815e24
KW
692into the digit 1 is called a "compatible" decomposition, specifically a
693"super" decomposition. There are several such compatibility
694decompositions (see L<http://www.unicode.org/reports/tr44>), including one
b19eb496 695called "compat", which means some miscellaneous type of decomposition
42581d5d 696that doesn't fit into the decomposition categories that Unicode has chosen.
9f815e24
KW
697
698Note that most Unicode characters don't have a decomposition, so their
699decomposition type is "None".
700
b19eb496
TC
701For your convenience, Perl has added the C<Non_Canonical> decomposition
702type to mean any of the several compatibility decompositions.
9f815e24
KW
703
704=item B<C<\p{Graph}>>
705
706Matches any character that is graphic. Theoretically, this means a character
707that on a printer would cause ink to be used.
708
709=item B<C<\p{HorizSpace}>>
710
b19eb496 711This is the same as C<\h> and C<\p{Blank}>: a character that changes the
9f815e24
KW
712spacing horizontally.
713
42581d5d 714=item B<C<\p{In=*}>>
9f815e24
KW
715
716This is a synonym for C<\p{Present_In=*}>
717
718=item B<C<\p{PerlSpace}>>
719
720This is the same as C<\s>, restricted to ASCII, namely C<S<[ \f\n\r\t]>>.
721
722Mnemonic: Perl's (original) space
723
724=item B<C<\p{PerlWord}>>
725
726This is the same as C<\w>, restricted to ASCII, namely C<[A-Za-z0-9_]>
727
728Mnemonic: Perl's (original) word.
729
42581d5d 730=item B<C<\p{Posix...}>>
9f815e24 731
b19eb496
TC
732There are several of these, which are equivalents using the C<\p>
733notation for Posix classes and are described in
42581d5d 734L<perlrecharclass/POSIX Character Classes>.
9f815e24
KW
735
736=item B<C<\p{Present_In: *}>> (Short: C<\p{In=*}>)
737
738This property is used when you need to know in what Unicode version(s) a
739character is.
740
741The "*" above stands for some two digit Unicode version number, such as
742C<1.1> or C<4.0>; or the "*" can also be C<Unassigned>. This property will
743match the code points whose final disposition has been settled as of the
744Unicode release given by the version number; C<\p{Present_In: Unassigned}>
745will match those code points whose meaning has yet to be assigned.
746
747For example, C<U+0041> "LATIN CAPITAL LETTER A" was present in the very first
748Unicode release available, which is C<1.1>, so this property is true for all
749valid "*" versions. On the other hand, C<U+1EFF> was not assigned until version
7505.1 when it became "LATIN SMALL LETTER Y WITH LOOP", so the only "*" that
751would match it are 5.1, 5.2, and later.
752
753Unicode furnishes the C<Age> property from which this is derived. The problem
754with Age is that a strict interpretation of it (which Perl takes) has it
755matching the precise release a code point's meaning is introduced in. Thus
756C<U+0041> would match only 1.1; and C<U+1EFF> only 5.1. This is not usually what
757you want.
758
759Some non-Perl implementations of the Age property may change its meaning to be
760the same as the Perl Present_In property; just be aware of that.
761
762Another confusion with both these properties is that the definition is not
b19eb496
TC
763that the code point has been I<assigned>, but that the meaning of the code point
764has been I<determined>. This is because 66 code points will always be
765unassigned, and so the Age for them is the Unicode version in which the decision
766to make them so was made. For example, C<U+FDD0> is to be permanently
9f815e24 767unassigned to a character, and the decision to do that was made in version 3.1,
b19eb496 768so C<\p{Age=3.1}> matches this character, as also does C<\p{Present_In: 3.1}> and up.
9f815e24
KW
769
770=item B<C<\p{Print}>>
771
ae5b72c8 772This matches any character that is graphical or blank, except controls.
9f815e24
KW
773
774=item B<C<\p{SpacePerl}>>
775
776This is the same as C<\s>, including beyond ASCII.
777
4d4acfba 778Mnemonic: Space, as modified by Perl. (It doesn't include the vertical tab
b19eb496 779which both the Posix standard and Unicode consider white space.)
9f815e24 780
4364919a
KW
781=item B<C<\p{Title}>> and B<C<\p{Titlecase}>>
782
783Under case-sensitive matching, these both match the same code points as
784C<\p{General Category=Titlecase_Letter}> (C<\p{gc=lt}>). The difference
785is that under C</i> caseless matching, these match the same as
786C<\p{Cased}>, whereas C<\p{gc=lt}> matches C<\p{Cased_Letter>).
787
9f815e24
KW
788=item B<C<\p{VertSpace}>>
789
790This is the same as C<\v>: A character that changes the spacing vertically.
791
792=item B<C<\p{Word}>>
793
b19eb496 794This is the same as C<\w>, including over 100_000 characters beyond ASCII.
9f815e24 795
42581d5d
KW
796=item B<C<\p{XPosix...}>>
797
b19eb496 798There are several of these, which are the standard Posix classes
42581d5d
KW
799extended to the full Unicode range. They are described in
800L<perlrecharclass/POSIX Character Classes>.
801
9f815e24
KW
802=back
803
376d9008 804=head2 User-Defined Character Properties
491fd90a 805
51f494cc
KW
806You can define your own binary character properties by defining subroutines
807whose names begin with "In" or "Is". The subroutines can be defined in any
808package. The user-defined properties can be used in the regular expression
809C<\p> and C<\P> constructs; if you are using a user-defined property from a
810package other than the one you are in, you must specify its package in the
811C<\p> or C<\P> construct.
bac0b425 812
51f494cc 813 # assuming property Is_Foreign defined in Lang::
bac0b425
JP
814 package main; # property package name required
815 if ($txt =~ /\p{Lang::IsForeign}+/) { ... }
816
817 package Lang; # property package name not required
818 if ($txt =~ /\p{IsForeign}+/) { ... }
819
820
821Note that the effect is compile-time and immutable once defined.
b19eb496
TC
822However, the subroutines are passed a single parameter, which is 0 if
823case-sensitive matching is in effect and non-zero if caseless matching
56ca34ca
KW
824is in effect. The subroutine may return different values depending on
825the value of the flag, and one set of values will immutably be in effect
b19eb496 826for all case-sensitive matches, and the other set for all case-insensitive
56ca34ca 827matches.
491fd90a 828
b19eb496 829Note that if the regular expression is tainted, then Perl will die rather
0e9be77f
DM
830than calling the subroutine, where the name of the subroutine is
831determined by the tainted data.
832
376d9008
JB
833The subroutines must return a specially-formatted string, with one
834or more newline-separated lines. Each line must be one of the following:
491fd90a
JH
835
836=over 4
837
838=item *
839
510254c9
A
840A single hexadecimal number denoting a Unicode code point to include.
841
842=item *
843
99a6b1f0 844Two hexadecimal numbers separated by horizontal whitespace (space or
376d9008 845tabular characters) denoting a range of Unicode code points to include.
491fd90a
JH
846
847=item *
848
376d9008 849Something to include, prefixed by "+": a built-in character
bac0b425
JP
850property (prefixed by "utf8::") or a user-defined character property,
851to represent all the characters in that property; two hexadecimal code
852points for a range; or a single hexadecimal code point.
491fd90a
JH
853
854=item *
855
376d9008 856Something to exclude, prefixed by "-": an existing character
bac0b425
JP
857property (prefixed by "utf8::") or a user-defined character property,
858to represent all the characters in that property; two hexadecimal code
859points for a range; or a single hexadecimal code point.
491fd90a
JH
860
861=item *
862
376d9008 863Something to negate, prefixed "!": an existing character
bac0b425
JP
864property (prefixed by "utf8::") or a user-defined character property,
865to represent all the characters in that property; two hexadecimal code
866points for a range; or a single hexadecimal code point.
867
868=item *
869
870Something to intersect with, prefixed by "&": an existing character
871property (prefixed by "utf8::") or a user-defined character property,
872for all the characters except the characters in the property; two
873hexadecimal code points for a range; or a single hexadecimal code point.
491fd90a
JH
874
875=back
876
877For example, to define a property that covers both the Japanese
878syllabaries (hiragana and katakana), you can define
879
880 sub InKana {
d88362ca 881 return <<END;
d5822f25
A
882 3040\t309F
883 30A0\t30FF
491fd90a
JH
884 END
885 }
886
d5822f25
A
887Imagine that the here-doc end marker is at the beginning of the line.
888Now you can use C<\p{InKana}> and C<\P{InKana}>.
491fd90a
JH
889
890You could also have used the existing block property names:
891
892 sub InKana {
d88362ca 893 return <<'END';
491fd90a
JH
894 +utf8::InHiragana
895 +utf8::InKatakana
896 END
897 }
898
899Suppose you wanted to match only the allocated characters,
d5822f25 900not the raw block ranges: in other words, you want to remove
491fd90a
JH
901the non-characters:
902
903 sub InKana {
d88362ca 904 return <<'END';
491fd90a
JH
905 +utf8::InHiragana
906 +utf8::InKatakana
907 -utf8::IsCn
908 END
909 }
910
911The negation is useful for defining (surprise!) negated classes.
912
913 sub InNotKana {
d88362ca 914 return <<'END';
491fd90a
JH
915 !utf8::InHiragana
916 -utf8::InKatakana
917 +utf8::IsCn
918 END
919 }
920
461020ad
KW
921This will match all non-Unicode code points, since every one of them is
922not in Kana. You can use intersection to exclude these, if desired, as
923this modified example shows:
bac0b425 924
461020ad 925 sub InNotKana {
bac0b425 926 return <<'END';
461020ad
KW
927 !utf8::InHiragana
928 -utf8::InKatakana
929 +utf8::IsCn
930 &utf8::Any
bac0b425
JP
931 END
932 }
933
461020ad
KW
934C<&utf8::Any> must be the last line in the definition.
935
936Intersection is used generally for getting the common characters matched
937by two (or more) classes. It's important to remember not to use "&" for
938the first set; that would be intersecting with nothing, resulting in an
939empty set.
940
941(Note that official Unicode properties differ from these in that they
942automatically exclude non-Unicode code points and a warning is raised if
943a match is attempted on one of those.)
bac0b425 944
68585b5e 945=head2 User-Defined Case Mappings (for serious hackers only)
822502e5 946
5d1892be
KW
947B<This feature has been removed as of Perl 5.16.>
948The CPAN module L<Unicode::Casing> provides better functionality without
949the drawbacks that this feature had. If you are using a Perl earlier
950than 5.16, this feature was most fully documented in the 5.14 version of
951this pod:
952L<http://perldoc.perl.org/5.14.0/perlunicode.html#User-Defined-Case-Mappings-%28for-serious-hackers-only%29>
3a2263fe 953
376d9008 954=head2 Character Encodings for Input and Output
8cbd9a7a 955
7221edc9 956See L<Encode>.
8cbd9a7a 957
c29a771d 958=head2 Unicode Regular Expression Support Level
776f8809 959
b19eb496
TC
960The following list of Unicode supported features for regular expressions describes
961all features currently directly supported by core Perl. The references to "Level N"
8158862b 962and the section numbers refer to the Unicode Technical Standard #18,
b19eb496 963"Unicode Regular Expressions", version 13, from August 2008.
776f8809
JH
964
965=over 4
966
967=item *
968
969Level 1 - Basic Unicode Support
970
755789c0
KW
971 RL1.1 Hex Notation - done [1]
972 RL1.2 Properties - done [2][3]
973 RL1.2a Compatibility Properties - done [4]
974 RL1.3 Subtraction and Intersection - MISSING [5]
975 RL1.4 Simple Word Boundaries - done [6]
976 RL1.5 Simple Loose Matches - done [7]
977 RL1.6 Line Boundaries - MISSING [8][9]
978 RL1.7 Supplementary Code Points - done [10]
979
980 [1] \x{...}
981 [2] \p{...} \P{...}
982 [3] supports not only minimal list, but all Unicode character
d9742aa3 983 properties (see Unicode Character Properties above)
755789c0
KW
984 [4] \d \D \s \S \w \W \X [:prop:] [:^prop:]
985 [5] can use regular expression look-ahead [a] or
986 user-defined character properties [b] to emulate set
987 operations
988 [6] \b \B
989 [7] note that Perl does Full case-folding in matching (but with
990 bugs), not Simple: for example U+1F88 is equivalent to
e4d56f70
NC
991 U+1F00 U+03B9, instead of just U+1F80. This difference
992 matters mainly for certain Greek capital letters with certain
755789c0
KW
993 modifiers: the Full case-folding decomposes the letter,
994 while the Simple case-folding would map it to a single
995 character.
996 [8] should do ^ and $ also on U+000B (\v in C), FF (\f), CR
997 (\r), CRLF (\r\n), NEL (U+0085), LS (U+2028), and PS
998 (U+2029); should also affect <>, $., and script line
999 numbers; should not split lines within CRLF [c] (i.e. there
1000 is no empty line between \r and \n)
1001 [9] Linebreaking conformant with UAX#14 "Unicode Line Breaking
1002 Algorithm" is available through the Unicode::LineBreaking
1003 module.
1004 [10] UTF-8/UTF-EBDDIC used in Perl allows not only U+10000 to
1005 U+10FFFF but also beyond U+10FFFF
7207e29d 1006
237bad5b 1007[a] You can mimic class subtraction using lookahead.
8158862b 1008For example, what UTS#18 might write as
29bdacb8 1009
dbe420b4
JH
1010 [{Greek}-[{UNASSIGNED}]]
1011
1012in Perl can be written as:
1013
1d81abf3
JH
1014 (?!\p{Unassigned})\p{InGreekAndCoptic}
1015 (?=\p{Assigned})\p{InGreekAndCoptic}
dbe420b4
JH
1016
1017But in this particular example, you probably really want
1018
1bfb14c4 1019 \p{GreekAndCoptic}
dbe420b4
JH
1020
1021which will match assigned characters known to be part of the Greek script.
29bdacb8 1022
d9742aa3 1023Also see the L<Unicode::Regex::Set> module, it does implement the full
8158862b
TS
1024UTS#18 grouping, intersection, union, and removal (subtraction) syntax.
1025
1026[b] '+' for union, '-' for removal (set-difference), '&' for intersection
1027(see L</"User-Defined Character Properties">)
1028
1029[c] Try the C<:crlf> layer (see L<PerlIO>).
5ca1ac52 1030
776f8809
JH
1031=item *
1032
1033Level 2 - Extended Unicode Support
1034
755789c0
KW
1035 RL2.1 Canonical Equivalents - MISSING [10][11]
1036 RL2.2 Default Grapheme Clusters - MISSING [12]
1037 RL2.3 Default Word Boundaries - MISSING [14]
1038 RL2.4 Default Loose Matches - MISSING [15]
1039 RL2.5 Name Properties - DONE
1040 RL2.6 Wildcard Properties - MISSING
8158862b 1041
755789c0
KW
1042 [10] see UAX#15 "Unicode Normalization Forms"
1043 [11] have Unicode::Normalize but not integrated to regexes
1044 [12] have \X but we don't have a "Grapheme Cluster Mode"
1045 [14] see UAX#29, Word Boundaries
902b08d0 1046 [15] This is covered in Chapter 3.13 (in Unicode 6.0)
776f8809
JH
1047
1048=item *
1049
8158862b
TS
1050Level 3 - Tailored Support
1051
755789c0
KW
1052 RL3.1 Tailored Punctuation - MISSING
1053 RL3.2 Tailored Grapheme Clusters - MISSING [17][18]
1054 RL3.3 Tailored Word Boundaries - MISSING
1055 RL3.4 Tailored Loose Matches - MISSING
1056 RL3.5 Tailored Ranges - MISSING
1057 RL3.6 Context Matching - MISSING [19]
1058 RL3.7 Incremental Matches - MISSING
8158862b 1059 ( RL3.8 Unicode Set Sharing )
755789c0
KW
1060 RL3.9 Possible Match Sets - MISSING
1061 RL3.10 Folded Matching - MISSING [20]
1062 RL3.11 Submatchers - MISSING
1063
1064 [17] see UAX#10 "Unicode Collation Algorithms"
1065 [18] have Unicode::Collate but not integrated to regexes
1066 [19] have (?<=x) and (?=x), but look-aheads or look-behinds
1067 should see outside of the target substring
1068 [20] need insensitive matching for linguistic features other
1069 than case; for example, hiragana to katakana, wide and
1070 narrow, simplified Han to traditional Han (see UTR#30
1071 "Character Foldings")
776f8809
JH
1072
1073=back
1074
c349b1b9
JH
1075=head2 Unicode Encodings
1076
376d9008
JB
1077Unicode characters are assigned to I<code points>, which are abstract
1078numbers. To use these numbers, various encodings are needed.
c349b1b9
JH
1079
1080=over 4
1081
c29a771d 1082=item *
5cb3728c
RB
1083
1084UTF-8
c349b1b9 1085
6d4f9cf2
KW
1086UTF-8 is a variable-length (1 to 4 bytes), byte-order independent
1087encoding. For ASCII (and we really do mean 7-bit ASCII, not another
10888-bit encoding), UTF-8 is transparent.
c349b1b9 1089
8c007b5a 1090The following table is from Unicode 3.2.
05632f9a 1091
755789c0 1092 Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte
05632f9a 1093
d88362ca 1094 U+0000..U+007F 00..7F
e1b711da 1095 U+0080..U+07FF * C2..DF 80..BF
d88362ca 1096 U+0800..U+0FFF E0 * A0..BF 80..BF
ec90690f
TS
1097 U+1000..U+CFFF E1..EC 80..BF 80..BF
1098 U+D000..U+D7FF ED 80..9F 80..BF
755789c0 1099 U+D800..U+DFFF +++++ utf16 surrogates, not legal utf8 +++++
ec90690f 1100 U+E000..U+FFFF EE..EF 80..BF 80..BF
d88362ca
KW
1101 U+10000..U+3FFFF F0 * 90..BF 80..BF 80..BF
1102 U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF
1103 U+100000..U+10FFFF F4 80..8F 80..BF 80..BF
e1b711da 1104
b19eb496 1105Note the gaps marked by "*" before several of the byte entries above. These are
e1b711da
KW
1106caused by legal UTF-8 avoiding non-shortest encodings: it is technically
1107possible to UTF-8-encode a single code point in different ways, but that is
1108explicitly forbidden, and the shortest possible encoding should always be used
1109(and that is what Perl does).
37361303 1110
376d9008 1111Another way to look at it is via bits:
05632f9a 1112
755789c0 1113 Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte
05632f9a 1114
755789c0
KW
1115 0aaaaaaa 0aaaaaaa
1116 00000bbbbbaaaaaa 110bbbbb 10aaaaaa
1117 ccccbbbbbbaaaaaa 1110cccc 10bbbbbb 10aaaaaa
1118 00000dddccccccbbbbbbaaaaaa 11110ddd 10cccccc 10bbbbbb 10aaaaaa
05632f9a 1119
9f815e24 1120As you can see, the continuation bytes all begin with "10", and the
e1b711da 1121leading bits of the start byte tell how many bytes there are in the
05632f9a
JH
1122encoded character.
1123
6d4f9cf2
KW
1124The original UTF-8 specification allowed up to 6 bytes, to allow
1125encoding of numbers up to 0x7FFF_FFFF. Perl continues to allow those,
1126and has extended that up to 13 bytes to encode code points up to what
1127can fit in a 64-bit word. However, Perl will warn if you output any of
b19eb496 1128these as being non-portable; and under strict UTF-8 input protocols,
6d4f9cf2
KW
1129they are forbidden.
1130
1131The Unicode non-character code points are also disallowed in UTF-8 in
1132"open interchange". See L</Non-character code points>.
1133
c29a771d 1134=item *
5cb3728c
RB
1135
1136UTF-EBCDIC
dbe420b4 1137
376d9008 1138Like UTF-8 but EBCDIC-safe, in the way that UTF-8 is ASCII-safe.
dbe420b4 1139
c29a771d 1140=item *
5cb3728c 1141
1e54db1a 1142UTF-16, UTF-16BE, UTF-16LE, Surrogates, and BOMs (Byte Order Marks)
c349b1b9 1143
1bfb14c4
JH
1144The followings items are mostly for reference and general Unicode
1145knowledge, Perl doesn't use these constructs internally.
dbe420b4 1146
b19eb496
TC
1147Like UTF-8, UTF-16 is a variable-width encoding, but where
1148UTF-8 uses 8-bit code units, UTF-16 uses 16-bit code units.
1149All code points occupy either 2 or 4 bytes in UTF-16: code points
1150C<U+0000..U+FFFF> are stored in a single 16-bit unit, and code
1bfb14c4 1151points C<U+10000..U+10FFFF> in two 16-bit units. The latter case is
c349b1b9
JH
1152using I<surrogates>, the first 16-bit unit being the I<high
1153surrogate>, and the second being the I<low surrogate>.
1154
376d9008 1155Surrogates are code points set aside to encode the C<U+10000..U+10FFFF>
c349b1b9 1156range of Unicode code points in pairs of 16-bit units. The I<high
9f815e24 1157surrogates> are the range C<U+D800..U+DBFF> and the I<low surrogates>
376d9008 1158are the range C<U+DC00..U+DFFF>. The surrogate encoding is
c349b1b9 1159
d88362ca
KW
1160 $hi = ($uni - 0x10000) / 0x400 + 0xD800;
1161 $lo = ($uni - 0x10000) % 0x400 + 0xDC00;
c349b1b9
JH
1162
1163and the decoding is
1164
d88362ca 1165 $uni = 0x10000 + ($hi - 0xD800) * 0x400 + ($lo - 0xDC00);
c349b1b9 1166
376d9008 1167Because of the 16-bitness, UTF-16 is byte-order dependent. UTF-16
c349b1b9 1168itself can be used for in-memory computations, but if storage or
376d9008
JB
1169transfer is required either UTF-16BE (big-endian) or UTF-16LE
1170(little-endian) encodings must be chosen.
c349b1b9
JH
1171
1172This introduces another problem: what if you just know that your data
376d9008
JB
1173is UTF-16, but you don't know which endianness? Byte Order Marks, or
1174BOMs, are a solution to this. A special character has been reserved
86bbd6d1 1175in Unicode to function as a byte order marker: the character with the
376d9008 1176code point C<U+FEFF> is the BOM.
042da322 1177
c349b1b9 1178The trick is that if you read a BOM, you will know the byte order,
376d9008
JB
1179since if it was written on a big-endian platform, you will read the
1180bytes C<0xFE 0xFF>, but if it was written on a little-endian platform,
1181you will read the bytes C<0xFF 0xFE>. (And if the originating platform
1182was writing in UTF-8, you will read the bytes C<0xEF 0xBB 0xBF>.)
042da322 1183
86bbd6d1 1184The way this trick works is that the character with the code point
6d4f9cf2 1185C<U+FFFE> is not supposed to be in input streams, so the
376d9008 1186sequence of bytes C<0xFF 0xFE> is unambiguously "BOM, represented in
1bfb14c4 1187little-endian format" and cannot be C<U+FFFE>, represented in big-endian
6d4f9cf2
KW
1188format".
1189
1190Surrogates have no meaning in Unicode outside their use in pairs to
1191represent other code points. However, Perl allows them to be
1192represented individually internally, for example by saying
f651977e
TC
1193C<chr(0xD801)>, so that all code points, not just those valid for open
1194interchange, are
6d4f9cf2
KW
1195representable. Unicode does define semantics for them, such as their
1196General Category is "Cs". But because their use is somewhat dangerous,
b19eb496
TC
1197Perl will warn (using the warning category "surrogate", which is a
1198sub-category of "utf8") if an attempt is made
6d4f9cf2
KW
1199to do things like take the lower case of one, or match
1200case-insensitively, or to output them. (But don't try this on Perls
1201before 5.14.)
c349b1b9 1202
c29a771d 1203=item *
5cb3728c 1204
1e54db1a 1205UTF-32, UTF-32BE, UTF-32LE
c349b1b9
JH
1206
1207The UTF-32 family is pretty much like the UTF-16 family, expect that
042da322 1208the units are 32-bit, and therefore the surrogate scheme is not
f651977e 1209needed. UTF-32 is a fixed-width encoding. The BOM signatures are
b19eb496 1210C<0x00 0x00 0xFE 0xFF> for BE and C<0xFF 0xFE 0x00 0x00> for LE.
c349b1b9 1211
c29a771d 1212=item *
5cb3728c
RB
1213
1214UCS-2, UCS-4
c349b1b9 1215
b19eb496 1216Legacy, fixed-width encodings defined by the ISO 10646 standard. UCS-2 is a 16-bit
376d9008 1217encoding. Unlike UTF-16, UCS-2 is not extensible beyond C<U+FFFF>,
339cfa0e 1218because it does not use surrogates. UCS-4 is a 32-bit encoding,
b19eb496 1219functionally identical to UTF-32 (the difference being that
ee88f7b6 1220UCS-4 forbids neither surrogates nor code points larger than 0x10_FFFF).
c349b1b9 1221
c29a771d 1222=item *
5cb3728c
RB
1223
1224UTF-7
c349b1b9 1225
376d9008
JB
1226A seven-bit safe (non-eight-bit) encoding, which is useful if the
1227transport or storage is not eight-bit safe. Defined by RFC 2152.
c349b1b9 1228
95a1a48b
JH
1229=back
1230
6d4f9cf2
KW
1231=head2 Non-character code points
1232
123366 code points are set aside in Unicode as "non-character code points".
1234These all have the Unassigned (Cn) General Category, and they never will
1235be assigned. These are never supposed to be in legal Unicode input
1236streams, so that code can use them as sentinels that can be mixed in
1237with character data, and they always will be distinguishable from that data.
1238To keep them out of Perl input streams, strict UTF-8 should be
1239specified, such as by using the layer C<:encoding('UTF-8')>. The
1240non-character code points are the 32 between U+FDD0 and U+FDEF, and the
124134 code points U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, ... U+10FFFE, U+10FFFF.
1242Some people are under the mistaken impression that these are "illegal",
1243but that is not true. An application or cooperating set of applications
1244can legally use them at will internally; but these code points are
42581d5d
KW
1245"illegal for open interchange". Therefore, Perl will not accept these
1246from input streams unless lax rules are being used, and will warn
b19eb496 1247(using the warning category "nonchar", which is a sub-category of "utf8") if
42581d5d
KW
1248an attempt is made to output them.
1249
1250=head2 Beyond Unicode code points
1251
1252The maximum Unicode code point is U+10FFFF. But Perl accepts code
1253points up to the maximum permissible unsigned number available on the
1254platform. However, Perl will not accept these from input streams unless
1255lax rules are being used, and will warn (using the warning category
b19eb496 1256"non_unicode", which is a sub-category of "utf8") if an attempt is made to
42581d5d
KW
1257operate on or output them. For example, C<uc(0x11_0000)> will generate
1258this warning, returning the input parameter as its result, as the upper
ee88f7b6 1259case of every non-Unicode code point is the code point itself.
6d4f9cf2 1260
0d7c09bb
JH
1261=head2 Security Implications of Unicode
1262
e1b711da
KW
1263Read L<Unicode Security Considerations|http://www.unicode.org/reports/tr36>.
1264Also, note the following:
1265
0d7c09bb
JH
1266=over 4
1267
1268=item *
1269
1270Malformed UTF-8
bf0fa0b2 1271
42581d5d 1272Unfortunately, the original specification of UTF-8 leaves some room for
bf0fa0b2 1273interpretation of how many bytes of encoded output one should generate
376d9008
JB
1274from one input Unicode character. Strictly speaking, the shortest
1275possible sequence of UTF-8 bytes should be generated,
1276because otherwise there is potential for an input buffer overflow at
feda178f 1277the receiving end of a UTF-8 connection. Perl always generates the
e1b711da 1278shortest length UTF-8, and with warnings on, Perl will warn about
376d9008 1279non-shortest length UTF-8 along with other malformations, such as the
b19eb496 1280surrogates, which are not Unicode code points valid for interchange.
bf0fa0b2 1281
0d7c09bb
JH
1282=item *
1283
68693f9e 1284Regular expression pattern matching may surprise you if you're not
b19eb496
TC
1285accustomed to Unicode. Starting in Perl 5.14, several pattern
1286modifiers are available to control this, called the character set
42581d5d
KW
1287modifiers. Details are given in L<perlre/Character set modifiers>.
1288
1289=back
0d7c09bb 1290
376d9008 1291As discussed elsewhere, Perl has one foot (two hooves?) planted in
1bfb14c4
JH
1292each of two worlds: the old world of bytes and the new world of
1293characters, upgrading from bytes to characters when necessary.
376d9008
JB
1294If your legacy code does not explicitly use Unicode, no automatic
1295switch-over to characters should happen. Characters shouldn't get
1bfb14c4
JH
1296downgraded to bytes, either. It is possible to accidentally mix bytes
1297and characters, however (see L<perluniintro>), in which case C<\w> in
42581d5d 1298regular expressions might start behaving differently (unless the C</a>
b19eb496 1299modifier is in effect). Review your code. Use warnings and the C<strict> pragma.
0d7c09bb 1300
c349b1b9
JH
1301=head2 Unicode in Perl on EBCDIC
1302
376d9008
JB
1303The way Unicode is handled on EBCDIC platforms is still
1304experimental. On such platforms, references to UTF-8 encoding in this
1305document and elsewhere should be read as meaning the UTF-EBCDIC
1306specified in Unicode Technical Report 16, unless ASCII vs. EBCDIC issues
c349b1b9 1307are specifically discussed. There is no C<utfebcdic> pragma or
376d9008 1308":utfebcdic" layer; rather, "utf8" and ":utf8" are reused to mean
86bbd6d1
PN
1309the platform's "natural" 8-bit encoding of Unicode. See L<perlebcdic>
1310for more discussion of the issues.
c349b1b9 1311
b310b053
JH
1312=head2 Locales
1313
42581d5d 1314See L<perllocale/Unicode and UTF-8>
b310b053 1315
1aad1664
JH
1316=head2 When Unicode Does Not Happen
1317
1318While Perl does have extensive ways to input and output in Unicode,
b19eb496
TC
1319and a few other "entry points" like the @ARGV array (which can sometimes be
1320interpreted as UTF-8), there are still many places where Unicode
1321(in some encoding or another) could be given as arguments or received as
1aad1664
JH
1322results, or both, but it is not.
1323
e1b711da
KW
1324The following are such interfaces. Also, see L</The "Unicode Bug">.
1325For all of these interfaces Perl
6cd4dd6c 1326currently (as of 5.8.3) simply assumes byte strings both as arguments
b19eb496 1327and results, or UTF-8 strings if the (problematic) C<encoding> pragma has been used.
1aad1664 1328
b19eb496
TC
1329One reason that Perl does not attempt to resolve the role of Unicode in
1330these situations is that the answers are highly dependent on the operating
1aad1664 1331system and the file system(s). For example, whether filenames can be
b19eb496
TC
1332in Unicode and in exactly what kind of encoding, is not exactly a
1333portable concept. Similarly for C<qx> and C<system>: how well will the
1334"command-line interface" (and which of them?) handle Unicode?
1aad1664
JH
1335
1336=over 4
1337
557a2462
RB
1338=item *
1339
51f494cc 1340chdir, chmod, chown, chroot, exec, link, lstat, mkdir,
1e8e8236 1341rename, rmdir, stat, symlink, truncate, unlink, utime, -X
557a2462
RB
1342
1343=item *
1344
1345%ENV
1346
1347=item *
1348
1349glob (aka the <*>)
1350
1351=item *
1aad1664 1352
557a2462 1353open, opendir, sysopen
1aad1664 1354
557a2462 1355=item *
1aad1664 1356
557a2462 1357qx (aka the backtick operator), system
1aad1664 1358
557a2462 1359=item *
1aad1664 1360
557a2462 1361readdir, readlink
1aad1664
JH
1362
1363=back
1364
e1b711da
KW
1365=head2 The "Unicode Bug"
1366
42581d5d
KW
1367The term, the "Unicode bug" has been applied to an inconsistency
1368on ASCII platforms with the
1369Unicode code points in the Latin-1 Supplement block, that
e1b711da
KW
1370is, between 128 and 255. Without a locale specified, unlike all other
1371characters or code points, these characters have very different semantics in
20db7501
KW
1372byte semantics versus character semantics, unless
1373C<use feature 'unicode_strings'> is specified.
42581d5d
KW
1374(The lesson here is to specify C<unicode_strings> to avoid the
1375headaches.)
e1b711da
KW
1376
1377In character semantics they are interpreted as Unicode code points, which means
1378they have the same semantics as Latin-1 (ISO-8859-1).
1379
1380In byte semantics, they are considered to be unassigned characters, meaning
1381that the only semantics they have is their ordinal numbers, and that they are
1382not members of various character classes. None are considered to match C<\w>
42581d5d 1383for example, but all match C<\W>.
e1b711da
KW
1384
1385The behavior is known to have effects on these areas:
1386
1387=over 4
1388
1389=item *
1390
1391Changing the case of a scalar, that is, using C<uc()>, C<ucfirst()>, C<lc()>,
1392and C<lcfirst()>, or C<\L>, C<\U>, C<\u> and C<\l> in regular expression
1393substitutions.
1394
1395=item *
1396
1397Using caseless (C</i>) regular expression matching
1398
1399=item *
1400
b19eb496 1401Matching any of several properties in regular expressions, namely C<\b>,
630d17dc
KW
1402C<\B>, C<\s>, C<\S>, C<\w>, C<\W>, and all the Posix character classes
1403I<except> C<[[:ascii:]]>.
e1b711da
KW
1404
1405=item *
1406
91faff93
KW
1407In C<quotemeta> or its inline equivalent C<\Q>, no code points above 127
1408are quoted in UTF-8 encoded strings, but in byte encoded strings, code
1409points between 128-255 are always quoted.
eb88ed9e 1410
e1b711da
KW
1411=back
1412
1413This behavior can lead to unexpected results in which a string's semantics
1414suddenly change if a code point above 255 is appended to or removed from it,
1415which changes the string's semantics from byte to character or vice versa. As
1416an example, consider the following program and its output:
1417
1418 $ perl -le'
42581d5d 1419 no feature 'unicode_strings';
e1b711da
KW
1420 $s1 = "\xC2";
1421 $s2 = "\x{2660}";
1422 for ($s1, $s2, $s1.$s2) {
1423 print /\w/ || 0;
1424 }
1425 '
1426 0
1427 0
1428 1
1429
9f815e24 1430If there's no C<\w> in C<s1> or in C<s2>, why does their concatenation have one?
e1b711da
KW
1431
1432This anomaly stems from Perl's attempt to not disturb older programs that
1433didn't use Unicode, and hence had no semantics for characters outside of the
1434ASCII range (except in a locale), along with Perl's desire to add Unicode
1435support seamlessly. The result wasn't seamless: these characters were
1436orphaned.
1437
20db7501
KW
1438Starting in Perl 5.14, C<use feature 'unicode_strings'> can be used to
1439cause Perl to use Unicode semantics on all string operations within the
1440scope of the feature subpragma. Regular expressions compiled in its
1441scope retain that behavior even when executed or compiled into larger
1442regular expressions outside the scope. (The pragma does not, however,
42581d5d
KW
1443affect the C<quotemeta> behavior. Nor does it affect the deprecated
1444user-defined case changing operations--these still require a UTF-8
eb88ed9e 1445encoded string to operate.)
20db7501
KW
1446
1447In Perl 5.12, the subpragma affected casing changes, but not regular
1448expressions. See L<perlfunc/lc> for details on how this pragma works in
1449combination with various others for casing.
1450
1451For earlier Perls, or when a string is passed to a function outside the
1452subpragma's scope, a workaround is to always call C<utf8::upgrade($string)>,
1453or to use the standard module L<Encode>. Also, a scalar that has any characters
6f335b04 1454whose ordinal is above 0x100, or which were specified using either of the
b19eb496 1455C<\N{...}> notations, will automatically have character semantics.
e1b711da 1456
1aad1664
JH
1457=head2 Forcing Unicode in Perl (Or Unforcing Unicode in Perl)
1458
e1b711da
KW
1459Sometimes (see L</"When Unicode Does Not Happen"> or L</The "Unicode Bug">)
1460there are situations where you simply need to force a byte
2bbc8d55
SP
1461string into UTF-8, or vice versa. The low-level calls
1462utf8::upgrade($bytestring) and utf8::downgrade($utf8string[, FAIL_OK]) are
1aad1664
JH
1463the answers.
1464
2bbc8d55
SP
1465Note that utf8::downgrade() can fail if the string contains characters
1466that don't fit into a byte.
1aad1664 1467
e1b711da
KW
1468Calling either function on a string that already is in the desired state is a
1469no-op.
1470
95a1a48b
JH
1471=head2 Using Unicode in XS
1472
3a2263fe
RGS
1473If you want to handle Perl Unicode in XS extensions, you may find the
1474following C APIs useful. See also L<perlguts/"Unicode Support"> for an
1475explanation about Unicode at the XS level, and L<perlapi> for the API
1476details.
95a1a48b
JH
1477
1478=over 4
1479
1480=item *
1481
1bfb14c4 1482C<DO_UTF8(sv)> returns true if the C<UTF8> flag is on and the bytes
2bbc8d55 1483pragma is not in effect. C<SvUTF8(sv)> returns true if the C<UTF8>
1bfb14c4
JH
1484flag is on; the bytes pragma is ignored. The C<UTF8> flag being on
1485does B<not> mean that there are any characters of code points greater
1486than 255 (or 127) in the scalar or that there are even any characters
1487in the scalar. What the C<UTF8> flag means is that the sequence of
1488octets in the representation of the scalar is the sequence of UTF-8
1489encoded code points of the characters of a string. The C<UTF8> flag
1490being off means that each octet in this representation encodes a
1491single character with code point 0..255 within the string. Perl's
1492Unicode model is not to use UTF-8 until it is absolutely necessary.
95a1a48b
JH
1493
1494=item *
1495
2bbc8d55 1496C<uvchr_to_utf8(buf, chr)> writes a Unicode character code point into
1bfb14c4 1497a buffer encoding the code point as UTF-8, and returns a pointer
2bbc8d55 1498pointing after the UTF-8 bytes. It works appropriately on EBCDIC machines.
95a1a48b
JH
1499
1500=item *
1501
2bbc8d55 1502C<utf8_to_uvchr(buf, lenp)> reads UTF-8 encoded bytes from a buffer and
376d9008 1503returns the Unicode character code point and, optionally, the length of
2bbc8d55 1504the UTF-8 byte sequence. It works appropriately on EBCDIC machines.
95a1a48b
JH
1505
1506=item *
1507
376d9008
JB
1508C<utf8_length(start, end)> returns the length of the UTF-8 encoded buffer
1509in characters. C<sv_len_utf8(sv)> returns the length of the UTF-8 encoded
95a1a48b
JH
1510scalar.
1511
1512=item *
1513
376d9008
JB
1514C<sv_utf8_upgrade(sv)> converts the string of the scalar to its UTF-8
1515encoded form. C<sv_utf8_downgrade(sv)> does the opposite, if
1516possible. C<sv_utf8_encode(sv)> is like sv_utf8_upgrade except that
1517it does not set the C<UTF8> flag. C<sv_utf8_decode()> does the
1518opposite of C<sv_utf8_encode()>. Note that none of these are to be
1519used as general-purpose encoding or decoding interfaces: C<use Encode>
1520for that. C<sv_utf8_upgrade()> is affected by the encoding pragma
1521but C<sv_utf8_downgrade()> is not (since the encoding pragma is
1522designed to be a one-way street).
95a1a48b
JH
1523
1524=item *
1525
376d9008 1526C<is_utf8_char(s)> returns true if the pointer points to a valid UTF-8
90f968e0 1527character.
95a1a48b
JH
1528
1529=item *
1530
376d9008 1531C<is_utf8_string(buf, len)> returns true if C<len> bytes of the buffer
95a1a48b
JH
1532are valid UTF-8.
1533
1534=item *
1535
376d9008
JB
1536C<UTF8SKIP(buf)> will return the number of bytes in the UTF-8 encoded
1537character in the buffer. C<UNISKIP(chr)> will return the number of bytes
1538required to UTF-8-encode the Unicode character code point. C<UTF8SKIP()>
90f968e0 1539is useful for example for iterating over the characters of a UTF-8
376d9008 1540encoded buffer; C<UNISKIP()> is useful, for example, in computing
90f968e0 1541the size required for a UTF-8 encoded buffer.
95a1a48b
JH
1542
1543=item *
1544
376d9008 1545C<utf8_distance(a, b)> will tell the distance in characters between the
95a1a48b
JH
1546two pointers pointing to the same UTF-8 encoded buffer.
1547
1548=item *
1549
2bbc8d55 1550C<utf8_hop(s, off)> will return a pointer to a UTF-8 encoded buffer
376d9008
JB
1551that is C<off> (positive or negative) Unicode characters displaced
1552from the UTF-8 buffer C<s>. Be careful not to overstep the buffer:
1553C<utf8_hop()> will merrily run off the end or the beginning of the
1554buffer if told to do so.
95a1a48b 1555
d2cc3551
JH
1556=item *
1557
376d9008
JB
1558C<pv_uni_display(dsv, spv, len, pvlim, flags)> and
1559C<sv_uni_display(dsv, ssv, pvlim, flags)> are useful for debugging the
1560output of Unicode strings and scalars. By default they are useful
1561only for debugging--they display B<all> characters as hexadecimal code
1bfb14c4
JH
1562points--but with the flags C<UNI_DISPLAY_ISPRINT>,
1563C<UNI_DISPLAY_BACKSLASH>, and C<UNI_DISPLAY_QQ> you can make the
1564output more readable.
d2cc3551
JH
1565
1566=item *
1567
66615a54 1568C<foldEQ_utf8(s1, pe1, l1, u1, s2, pe2, l2, u2)> can be used to
376d9008 1569compare two strings case-insensitively in Unicode. For case-sensitive
66615a54
KW
1570comparisons you can just use C<memEQ()> and C<memNE()> as usual, except
1571if one string is in utf8 and the other isn't.
d2cc3551 1572
c349b1b9
JH
1573=back
1574
95a1a48b
JH
1575For more information, see L<perlapi>, and F<utf8.c> and F<utf8.h>
1576in the Perl source code distribution.
1577
e1b711da
KW
1578=head2 Hacking Perl to work on earlier Unicode versions (for very serious hackers only)
1579
1580Perl by default comes with the latest supported Unicode version built in, but
1581you can change to use any earlier one.
1582
42581d5d 1583Download the files in the desired version of Unicode from the Unicode web
e1b711da 1584site L<http://www.unicode.org>). These should replace the existing files in
b19eb496 1585F<lib/unicore> in the Perl source tree. Follow the instructions in
116693e8 1586F<README.perl> in that directory to change some of their names, and then build
26e391dd 1587perl (see L<INSTALL>).
116693e8
DL
1588
1589It is even possible to copy the built files to a different directory, and then
f651977e 1590change F<utf8_heavy.pl> in the directory C<$Config{privlib}> to point to the
116693e8
DL
1591new directory, or maybe make a copy of that directory before making the change,
1592and using C<@INC> or the C<-I> run-time flag to switch between versions at will
e1b711da
KW
1593(but because of caching, not in the middle of a process), but all this is
1594beyond the scope of these instructions.
1595
c29a771d
JH
1596=head1 BUGS
1597
376d9008 1598=head2 Interaction with Locales
7eabb34d 1599
42581d5d 1600See L<perllocale/Unicode and UTF-8>
c29a771d 1601
9f815e24 1602=head2 Problems with characters in the Latin-1 Supplement range
2bbc8d55 1603
e1b711da
KW
1604See L</The "Unicode Bug">
1605
376d9008 1606=head2 Interaction with Extensions
7eabb34d 1607
376d9008 1608When Perl exchanges data with an extension, the extension should be
2575c402 1609able to understand the UTF8 flag and act accordingly. If the
b19eb496 1610extension doesn't recognize that flag, it's likely that the extension
376d9008 1611will return incorrectly-flagged data.
7eabb34d
A
1612
1613So if you're working with Unicode data, consult the documentation of
1614every module you're using if there are any issues with Unicode data
1615exchange. If the documentation does not talk about Unicode at all,
a73d23f6 1616suspect the worst and probably look at the source to learn how the
376d9008 1617module is implemented. Modules written completely in Perl shouldn't
a73d23f6
RGS
1618cause problems. Modules that directly or indirectly access code written
1619in other programming languages are at risk.
7eabb34d 1620
376d9008 1621For affected functions, the simple strategy to avoid data corruption is
7eabb34d 1622to always make the encoding of the exchanged data explicit. Choose an
376d9008 1623encoding that you know the extension can handle. Convert arguments passed
7eabb34d
A
1624to the extensions to that encoding and convert results back from that
1625encoding. Write wrapper functions that do the conversions for you, so
1626you can later change the functions when the extension catches up.
1627
376d9008 1628To provide an example, let's say the popular Foo::Bar::escape_html
7eabb34d
A
1629function doesn't deal with Unicode data yet. The wrapper function
1630would convert the argument to raw UTF-8 and convert the result back to
376d9008 1631Perl's internal representation like so:
7eabb34d
A
1632
1633 sub my_escape_html ($) {
d88362ca
KW
1634 my($what) = shift;
1635 return unless defined $what;
1636 Encode::decode_utf8(Foo::Bar::escape_html(
1637 Encode::encode_utf8($what)));
7eabb34d
A
1638 }
1639
1640Sometimes, when the extension does not convert data but just stores
b19eb496 1641and retrieves them, you will be able to use the otherwise
7eabb34d 1642dangerous Encode::_utf8_on() function. Let's say the popular
66b79f27 1643C<Foo::Bar> extension, written in C, provides a C<param> method that
7eabb34d
A
1644lets you store and retrieve data according to these prototypes:
1645
1646 $self->param($name, $value); # set a scalar
1647 $value = $self->param($name); # retrieve a scalar
1648
1649If it does not yet provide support for any encoding, one could write a
1650derived class with such a C<param> method:
1651
1652 sub param {
1653 my($self,$name,$value) = @_;
1654 utf8::upgrade($name); # make sure it is UTF-8 encoded
af55fc6a 1655 if (defined $value) {
7eabb34d
A
1656 utf8::upgrade($value); # make sure it is UTF-8 encoded
1657 return $self->SUPER::param($name,$value);
1658 } else {
1659 my $ret = $self->SUPER::param($name);
1660 Encode::_utf8_on($ret); # we know, it is UTF-8 encoded
1661 return $ret;
1662 }
1663 }
1664
a73d23f6
RGS
1665Some extensions provide filters on data entry/exit points, such as
1666DB_File::filter_store_key and family. Look out for such filters in
66b79f27 1667the documentation of your extensions, they can make the transition to
7eabb34d
A
1668Unicode data much easier.
1669
376d9008 1670=head2 Speed
7eabb34d 1671
c29a771d 1672Some functions are slower when working on UTF-8 encoded strings than
574c8022 1673on byte encoded strings. All functions that need to hop over
7c17141f
JH
1674characters such as length(), substr() or index(), or matching regular
1675expressions can work B<much> faster when the underlying data are
1676byte-encoded.
1677
1678In Perl 5.8.0 the slowness was often quite spectacular; in Perl 5.8.1
1679a caching scheme was introduced which will hopefully make the slowness
a104b433
JH
1680somewhat less spectacular, at least for some operations. In general,
1681operations with UTF-8 encoded strings are still slower. As an example,
1682the Unicode properties (character classes) like C<\p{Nd}> are known to
1683be quite a bit slower (5-20 times) than their simpler counterparts
42581d5d 1684like C<\d> (then again, there are hundreds of Unicode characters matching C<Nd>
a104b433 1685compared with the 10 ASCII characters matching C<d>).
666f95b9 1686
e1b711da
KW
1687=head2 Problems on EBCDIC platforms
1688
f651977e 1689There are several known problems with Perl on EBCDIC platforms. If you
e1b711da 1690want to use Perl there, send email to perlbug@perl.org.
fe749c9a
KW
1691
1692In earlier versions, when byte and character data were concatenated,
1693the new string was sometimes created by
1694decoding the byte strings as I<ISO 8859-1 (Latin-1)>, even if the
1695old Unicode string used EBCDIC.
1696
1697If you find any of these, please report them as bugs.
1698
c8d992ba
A
1699=head2 Porting code from perl-5.6.X
1700
1701Perl 5.8 has a different Unicode model from 5.6. In 5.6 the programmer
1702was required to use the C<utf8> pragma to declare that a given scope
1703expected to deal with Unicode data and had to make sure that only
1704Unicode data were reaching that scope. If you have code that is
1705working with 5.6, you will need some of the following adjustments to
1706your code. The examples are written such that the code will continue
1707to work under 5.6, so you should be safe to try them out.
1708
755789c0 1709=over 3
c8d992ba
A
1710
1711=item *
1712
1713A filehandle that should read or write UTF-8
1714
1715 if ($] > 5.007) {
740d4bb2 1716 binmode $fh, ":encoding(utf8)";
c8d992ba
A
1717 }
1718
1719=item *
1720
1721A scalar that is going to be passed to some extension
1722
1723Be it Compress::Zlib, Apache::Request or any extension that has no
1724mention of Unicode in the manpage, you need to make sure that the
2575c402 1725UTF8 flag is stripped off. Note that at the time of this writing
c8d992ba
A
1726(October 2002) the mentioned modules are not UTF-8-aware. Please
1727check the documentation to verify if this is still true.
1728
1729 if ($] > 5.007) {
1730 require Encode;
1731 $val = Encode::encode_utf8($val); # make octets
1732 }
1733
1734=item *
1735
1736A scalar we got back from an extension
1737
1738If you believe the scalar comes back as UTF-8, you will most likely
2575c402 1739want the UTF8 flag restored:
c8d992ba
A
1740
1741 if ($] > 5.007) {
1742 require Encode;
1743 $val = Encode::decode_utf8($val);
1744 }
1745
1746=item *
1747
1748Same thing, if you are really sure it is UTF-8
1749
1750 if ($] > 5.007) {
1751 require Encode;
1752 Encode::_utf8_on($val);
1753 }
1754
1755=item *
1756
1757A wrapper for fetchrow_array and fetchrow_hashref
1758
1759When the database contains only UTF-8, a wrapper function or method is
1760a convenient way to replace all your fetchrow_array and
1761fetchrow_hashref calls. A wrapper function will also make it easier to
1762adapt to future enhancements in your database driver. Note that at the
1763time of this writing (October 2002), the DBI has no standardized way
1764to deal with UTF-8 data. Please check the documentation to verify if
1765that is still true.
1766
1767 sub fetchrow {
d88362ca
KW
1768 # $what is one of fetchrow_{array,hashref}
1769 my($self, $sth, $what) = @_;
c8d992ba
A
1770 if ($] < 5.007) {
1771 return $sth->$what;
1772 } else {
1773 require Encode;
1774 if (wantarray) {
1775 my @arr = $sth->$what;
1776 for (@arr) {
1777 defined && /[^\000-\177]/ && Encode::_utf8_on($_);
1778 }
1779 return @arr;
1780 } else {
1781 my $ret = $sth->$what;
1782 if (ref $ret) {
1783 for my $k (keys %$ret) {
d88362ca
KW
1784 defined
1785 && /[^\000-\177]/
1786 && Encode::_utf8_on($_) for $ret->{$k};
c8d992ba
A
1787 }
1788 return $ret;
1789 } else {
1790 defined && /[^\000-\177]/ && Encode::_utf8_on($_) for $ret;
1791 return $ret;
1792 }
1793 }
1794 }
1795 }
1796
1797
1798=item *
1799
1800A large scalar that you know can only contain ASCII
1801
1802Scalars that contain only ASCII and are marked as UTF-8 are sometimes
1803a drag to your program. If you recognize such a situation, just remove
2575c402 1804the UTF8 flag:
c8d992ba
A
1805
1806 utf8::downgrade($val) if $] > 5.007;
1807
1808=back
1809
393fec97
GS
1810=head1 SEE ALSO
1811
51f494cc 1812L<perlunitut>, L<perluniintro>, L<perluniprops>, L<Encode>, L<open>, L<utf8>, L<bytes>,
a05d7ebb 1813L<perlretut>, L<perlvar/"${^UNICODE}">
51f494cc 1814L<http://www.unicode.org/reports/tr44>).
393fec97
GS
1815
1816=cut