This is a live mirror of the Perl 5 development currently hosted at https://github.com/perl/perl5
perlunicode: Fix example.
[perl5.git] / pod / perlunicode.pod
CommitLineData
393fec97
GS
1=head1 NAME
2
3perlunicode - Unicode support in Perl
4
5=head1 DESCRIPTION
6
0a1f2d14 7=head2 Important Caveats
21bad921 8
376d9008 9Unicode support is an extensive requirement. While Perl does not
c349b1b9
JH
10implement the Unicode standard or the accompanying technical reports
11from cover to cover, Perl does support many Unicode features.
21bad921 12
2575c402 13People who want to learn to use Unicode in Perl, should probably read
0314f483
KW
14the L<Perl Unicode tutorial, perlunitut|perlunitut> and
15L<perluniintro>, before reading
e4911a48 16this reference document.
2575c402 17
9d1c51c1
KW
18Also, the use of Unicode may present security issues that aren't obvious.
19Read L<Unicode Security Considerations|http://www.unicode.org/reports/tr36>.
20
13a2d996 21=over 4
21bad921 22
42581d5d
KW
23=item Safest if you "use feature 'unicode_strings'"
24
25In order to preserve backward compatibility, Perl does not turn
26on full internal Unicode support unless the pragma
27C<use feature 'unicode_strings'> is specified. (This is automatically
28selected if you use C<use 5.012> or higher.) Failure to do this can
29trigger unexpected surprises. See L</The "Unicode Bug"> below.
30
b19eb496
TC
31This pragma doesn't affect I/O, and there are still several places
32where Unicode isn't fully supported, such as in filenames.
42581d5d 33
fae2c0fb 34=item Input and Output Layers
21bad921 35
376d9008 36Perl knows when a filehandle uses Perl's internal Unicode encodings
1bfb14c4 37(UTF-8, or UTF-EBCDIC if in EBCDIC) if the filehandle is opened with
4ee7c0ea 38the ":encoding(utf8)" layer. Other encodings can be converted to Perl's
1bfb14c4
JH
39encoding on input or from Perl's encoding on output by use of the
40":encoding(...)" layer. See L<open>.
c349b1b9 41
2575c402 42To indicate that Perl source itself is in UTF-8, use C<use utf8;>.
21bad921 43
ad0029c4 44=item C<use utf8> still needed to enable UTF-8/UTF-EBCDIC in scripts
21bad921 45
376d9008
JB
46As a compatibility measure, the C<use utf8> pragma must be explicitly
47included to enable recognition of UTF-8 in the Perl scripts themselves
1bfb14c4
JH
48(in string or regular expression literals, or in identifier names) on
49ASCII-based machines or to recognize UTF-EBCDIC on EBCDIC-based
376d9008 50machines. B<These are the only times when an explicit C<use utf8>
8f8cf39c 51is needed.> See L<utf8>.
21bad921 52
7aa207d6
JH
53=item BOM-marked scripts and UTF-16 scripts autodetected
54
55If a Perl script begins marked with the Unicode BOM (UTF-16LE, UTF16-BE,
56or UTF-8), or if the script looks like non-BOM-marked UTF-16 of either
57endianness, Perl will correctly read in the script as Unicode.
58(BOMless UTF-8 cannot be effectively recognized or differentiated from
59ISO 8859-1 or other eight-bit encodings.)
60
990e18f7
AT
61=item C<use encoding> needed to upgrade non-Latin-1 byte strings
62
38a44b82 63By default, there is a fundamental asymmetry in Perl's Unicode model:
990e18f7
AT
64implicit upgrading from byte strings to Unicode strings assumes that
65they were encoded in I<ISO 8859-1 (Latin-1)>, but Unicode strings are
66downgraded with UTF-8 encoding. This happens because the first 256
51f494cc 67codepoints in Unicode happens to agree with Latin-1.
990e18f7 68
990e18f7
AT
69See L</"Byte and Character Semantics"> for more details.
70
21bad921
GS
71=back
72
376d9008 73=head2 Byte and Character Semantics
393fec97 74
376d9008 75Beginning with version 5.6, Perl uses logically-wide characters to
3e4dbfed 76represent strings internally.
393fec97 77
42581d5d
KW
78Starting in Perl 5.14, Perl-level operations work with
79characters rather than bytes within the scope of a
80C<L<use feature 'unicode_strings'|feature>> (or equivalently
81C<use 5.012> or higher). (This is not true if bytes have been
b19eb496 82explicitly requested by C<L<use bytes|bytes>>, nor necessarily true
42581d5d
KW
83for interactions with the platform's operating system.)
84
85For earlier Perls, and when C<unicode_strings> is not in effect, Perl
86provides a fairly safe environment that can handle both types of
87semantics in programs. For operations where Perl can unambiguously
88decide that the input data are characters, Perl switches to character
89semantics. For operations where this determination cannot be made
90without additional information from the user, Perl decides in favor of
91compatibility and chooses to use byte semantics.
92
93When C<use locale> is in effect (which overrides
0314f483
KW
94C<use feature 'unicode_strings'> in the same scope), Perl uses the
95semantics associated
42581d5d
KW
96with the current locale. Otherwise, Perl uses the platform's native
97byte semantics for characters whose code points are less than 256, and
98Unicode semantics for those greater than 255. On EBCDIC platforms, this
99is almost seamless, as the EBCDIC code pages that Perl handles are
100equivalent to Unicode's first 256 code points. (The exception is that
101EBCDIC regular expression case-insensitive matching rules are not as
b19eb496 102as robust as Unicode's.) But on ASCII platforms, Perl uses US-ASCII
e1b711da
KW
103(or Basic Latin in Unicode terminology) byte semantics, meaning that characters
104whose ordinal numbers are in the range 128 - 255 are undefined except for their
105ordinal numbers. This means that none have case (upper and lower), nor are any
106a member of character classes, like C<[:alpha:]> or C<\w>. (But all do belong
107to the C<\W> class or the Perl regular expression extension C<[:^alpha:]>.)
2bbc8d55 108
8cbd9a7a 109This behavior preserves compatibility with earlier versions of Perl,
376d9008 110which allowed byte semantics in Perl operations only if
e1b711da 111none of the program's inputs were marked as being a source of Unicode
8cbd9a7a
GS
112character data. Such data may come from filehandles, from calls to
113external programs, from information provided by the system (such as %ENV),
21bad921 114or from literals and constants in the source text.
8cbd9a7a 115
8cbd9a7a 116The C<utf8> pragma is primarily a compatibility device that enables
75daf61c 117recognition of UTF-(8|EBCDIC) in literals encountered by the parser.
376d9008
JB
118Note that this pragma is only required while Perl defaults to byte
119semantics; when character semantics become the default, this pragma
120may become a no-op. See L<utf8>.
121
376d9008 122If strings operating under byte semantics and strings with Unicode
51f494cc 123character data are concatenated, the new string will have
d9b01026
KW
124character semantics. This can cause surprises: See L</BUGS>, below.
125You can choose to be warned when this happens. See L<encoding::warnings>.
7dedd01f 126
feda178f 127Under character semantics, many operations that formerly operated on
376d9008 128bytes now operate on characters. A character in Perl is
feda178f 129logically just a number ranging from 0 to 2**31 or so. Larger
376d9008
JB
130characters may encode into longer sequences of bytes internally, but
131this internal detail is mostly hidden for Perl code.
132See L<perluniintro> for more.
393fec97 133
376d9008 134=head2 Effects of Character Semantics
393fec97
GS
135
136Character semantics have the following effects:
137
138=over 4
139
140=item *
141
376d9008 142Strings--including hash keys--and regular expression patterns may
574c8022 143contain characters that have an ordinal value larger than 255.
393fec97 144
2575c402
JW
145If you use a Unicode editor to edit your program, Unicode characters may
146occur directly within the literal strings in UTF-8 encoding, or UTF-16.
147(The former requires a BOM or C<use utf8>, the latter requires a BOM.)
3e4dbfed 148
195e542a
KW
149Unicode characters can also be added to a string by using the C<\N{U+...}>
150notation. The Unicode code for the desired character, in hexadecimal,
151should be placed in the braces, after the C<U>. For instance, a smiley face is
6f335b04
KW
152C<\N{U+263A}>.
153
195e542a
KW
154Alternatively, you can use the C<\x{...}> notation for characters 0x100 and
155above. For characters below 0x100 you may get byte semantics instead of
6f335b04 156character semantics; see L</The "Unicode Bug">. On EBCDIC machines there is
195e542a 157the additional problem that the value for such characters gives the EBCDIC
6f335b04 158character rather than the Unicode one.
3e4dbfed
JF
159
160Additionally, if you
574c8022 161
3e4dbfed 162 use charnames ':full';
574c8022 163
1bfb14c4
JH
164you can use the C<\N{...}> notation and put the official Unicode
165character name within the braces, such as C<\N{WHITE SMILING FACE}>.
6f335b04 166See L<charnames>.
376d9008 167
393fec97
GS
168=item *
169
574c8022
JH
170If an appropriate L<encoding> is specified, identifiers within the
171Perl script may contain Unicode alphanumeric characters, including
376d9008
JB
172ideographs. Perl does not currently attempt to canonicalize variable
173names.
393fec97 174
393fec97
GS
175=item *
176
1bfb14c4 177Regular expressions match characters instead of bytes. "." matches
2575c402 178a character instead of a byte.
393fec97 179
393fec97
GS
180=item *
181
9d1c51c1 182Bracketed character classes in regular expressions match characters instead of
376d9008 183bytes and match against the character properties specified in the
1bfb14c4 184Unicode properties database. C<\w> can be used to match a Japanese
75daf61c 185ideograph, for instance.
393fec97 186
393fec97
GS
187=item *
188
9d1c51c1
KW
189Named Unicode properties, scripts, and block ranges may be used (like bracketed
190character classes) by using the C<\p{}> "matches property" construct and
822502e5 191the C<\P{}> negation, "doesn't match property".
2575c402 192See L</"Unicode Character Properties"> for more details.
822502e5
ST
193
194You can define your own character properties and use them
195in the regular expression with the C<\p{}> or C<\P{}> construct.
822502e5
ST
196See L</"User-Defined Character Properties"> for more details.
197
198=item *
199
9f815e24
KW
200The special pattern C<\X> matches a logical character, an "extended grapheme
201cluster" in Standardese. In Unicode what appears to the user to be a single
51f494cc
KW
202character, for example an accented C<G>, may in fact be composed of a sequence
203of characters, in this case a C<G> followed by an accent character. C<\X>
204will match the entire sequence.
822502e5
ST
205
206=item *
207
208The C<tr///> operator translates characters instead of bytes. Note
209that the C<tr///CU> functionality has been removed. For similar
210functionality see pack('U0', ...) and pack('C0', ...).
211
212=item *
213
214Case translation operators use the Unicode case translation tables
215when character input is provided. Note that C<uc()>, or C<\U> in
216interpolated strings, translates to uppercase, while C<ucfirst>,
217or C<\u> in interpolated strings, translates to titlecase in languages
e1b711da
KW
218that make the distinction (which is equivalent to uppercase in languages
219without the distinction).
822502e5
ST
220
221=item *
222
223Most operators that deal with positions or lengths in a string will
224automatically switch to using character positions, including
225C<chop()>, C<chomp()>, C<substr()>, C<pos()>, C<index()>, C<rindex()>,
226C<sprintf()>, C<write()>, and C<length()>. An operator that
51f494cc
KW
227specifically does not switch is C<vec()>. Operators that really don't
228care include operators that treat strings as a bucket of bits such as
822502e5
ST
229C<sort()>, and operators dealing with filenames.
230
231=item *
232
51f494cc 233The C<pack()>/C<unpack()> letter C<C> does I<not> change, since it is often
822502e5
ST
234used for byte-oriented formats. Again, think C<char> in the C language.
235
236There is a new C<U> specifier that converts between Unicode characters
237and code points. There is also a C<W> specifier that is the equivalent of
238C<chr>/C<ord> and properly handles character values even if they are above 255.
239
240=item *
241
242The C<chr()> and C<ord()> functions work on characters, similar to
243C<pack("W")> and C<unpack("W")>, I<not> C<pack("C")> and
244C<unpack("C")>. C<pack("C")> and C<unpack("C")> are methods for
245emulating byte-oriented C<chr()> and C<ord()> on Unicode strings.
246While these methods reveal the internal encoding of Unicode strings,
247that is not something one normally needs to care about at all.
248
249=item *
250
251The bit string operators, C<& | ^ ~>, can operate on character data.
252However, for backward compatibility, such as when using bit string
253operations when characters are all less than 256 in ordinal value, one
254should not use C<~> (the bit complement) with characters of both
255values less than 256 and values greater than 256. Most importantly,
256DeMorgan's laws (C<~($x|$y) eq ~$x&~$y> and C<~($x&$y) eq ~$x|~$y>)
257will not hold. The reason for this mathematical I<faux pas> is that
258the complement cannot return B<both> the 8-bit (byte-wide) bit
259complement B<and> the full character-wide bit complement.
260
261=item *
262
5d1892be
KW
263There is a CPAN module, L<Unicode::Casing>, which allows you to define
264your own mappings to be used in C<lc()>, C<lcfirst()>, C<uc()>, and
265C<ucfirst()> (or their double-quoted string inlined versions such as
266C<\U>). (Prior to Perl 5.16, this functionality was partially provided
267in the Perl core, but suffered from a number of insurmountable
268drawbacks, so the CPAN module was written instead.)
822502e5
ST
269
270=back
271
272=over 4
273
274=item *
275
276And finally, C<scalar reverse()> reverses by character rather than by byte.
277
278=back
279
280=head2 Unicode Character Properties
281
ee88f7b6 282(The only time that Perl considers a sequence of individual code
9d1c51c1
KW
283points as a single logical character is in the C<\X> construct, already
284mentioned above. Therefore "character" in this discussion means a single
ee88f7b6
KW
285Unicode code point.)
286
287Very nearly all Unicode character properties are accessible through
288regular expressions by using the C<\p{}> "matches property" construct
289and the C<\P{}> "doesn't match property" for its negation.
51f494cc 290
9d1c51c1 291For instance, C<\p{Uppercase}> matches any single character with the Unicode
51f494cc
KW
292"Uppercase" property, while C<\p{L}> matches any character with a
293General_Category of "L" (letter) property. Brackets are not
9d1c51c1 294required for single letter property names, so C<\p{L}> is equivalent to C<\pL>.
51f494cc 295
9d1c51c1
KW
296More formally, C<\p{Uppercase}> matches any single character whose Unicode
297Uppercase property value is True, and C<\P{Uppercase}> matches any character
298whose Uppercase property value is False, and they could have been written as
299C<\p{Uppercase=True}> and C<\p{Uppercase=False}>, respectively.
51f494cc 300
b19eb496 301This formality is needed when properties are not binary; that is, if they can
51f494cc 302take on more values than just True and False. For example, the Bidi_Class (see
b19eb496 303L</"Bidirectional Character Types"> below), can take on several different
51f494cc 304values, such as Left, Right, Whitespace, and others. To match these, one needs
5bff2035
KW
305to specify both the property name (Bidi_Class), AND the value being
306matched against
9d1c51c1 307(Left, Right, etc.). This is done, as in the examples above, by having the
9f815e24 308two components separated by an equal sign (or interchangeably, a colon), like
51f494cc
KW
309C<\p{Bidi_Class: Left}>.
310
311All Unicode-defined character properties may be written in these compound forms
312of C<\p{property=value}> or C<\p{property:value}>, but Perl provides some
313additional properties that are written only in the single form, as well as
314single-form short-cuts for all binary properties and certain others described
315below, in which you may omit the property name and the equals or colon
316separator.
317
318Most Unicode character properties have at least two synonyms (or aliases if you
b19eb496
TC
319prefer): a short one that is easier to type and a longer one that is more
320descriptive and hence easier to understand. Thus the "L" and "Letter" properties
321above are equivalent and can be used interchangeably. Likewise,
51f494cc
KW
322"Upper" is a synonym for "Uppercase", and we could have written
323C<\p{Uppercase}> equivalently as C<\p{Upper}>. Also, there are typically
324various synonyms for the values the property can be. For binary properties,
325"True" has 3 synonyms: "T", "Yes", and "Y"; and "False has correspondingly "F",
326"No", and "N". But be careful. A short form of a value for one property may
e1b711da 327not mean the same thing as the same short form for another. Thus, for the
51f494cc
KW
328General_Category property, "L" means "Letter", but for the Bidi_Class property,
329"L" means "Left". A complete list of properties and synonyms is in
330L<perluniprops>.
331
b19eb496 332Upper/lower case differences in property names and values are irrelevant;
51f494cc
KW
333thus C<\p{Upper}> means the same thing as C<\p{upper}> or even C<\p{UpPeR}>.
334Similarly, you can add or subtract underscores anywhere in the middle of a
335word, so that these are also equivalent to C<\p{U_p_p_e_r}>. And white space
336is irrelevant adjacent to non-word characters, such as the braces and the equals
b19eb496
TC
337or colon separators, so C<\p{ Upper }> and C<\p{ Upper_case : Y }> are
338equivalent to these as well. In fact, white space and even
339hyphens can usually be added or deleted anywhere. So even C<\p{ Up-per case = Yes}> is
51f494cc 340equivalent. All this is called "loose-matching" by Unicode. The few places
b19eb496 341where stricter matching is used is in the middle of numbers, and in the Perl
51f494cc 342extension properties that begin or end with an underscore. Stricter matching
b19eb496 343cares about white space (except adjacent to non-word characters),
51f494cc 344hyphens, and non-interior underscores.
4193bef7 345
376d9008
JB
346You can also use negation in both C<\p{}> and C<\P{}> by introducing a caret
347(^) between the first brace and the property name: C<\p{^Tamil}> is
eb0cc9e3 348equal to C<\P{Tamil}>.
4193bef7 349
56ca34ca
KW
350Almost all properties are immune to case-insensitive matching. That is,
351adding a C</i> regular expression modifier does not change what they
352match. There are two sets that are affected.
353The first set is
354C<Uppercase_Letter>,
355C<Lowercase_Letter>,
356and C<Titlecase_Letter>,
357all of which match C<Cased_Letter> under C</i> matching.
358And the second set is
359C<Uppercase>,
360C<Lowercase>,
361and C<Titlecase>,
362all of which match C<Cased> under C</i> matching.
363This set also includes its subsets C<PosixUpper> and C<PosixLower> both
364of which under C</i> matching match C<PosixAlpha>.
365(The difference between these sets is that some things, such as Roman
b19eb496 366numerals, come in both upper and lower case so they are C<Cased>, but aren't considered
56ca34ca 367letters, so they aren't C<Cased_Letter>s.)
56ca34ca 368
94b42e47
KW
369The result is undefined if you try to match a non-Unicode code point
370(that is, one above 0x10FFFF) against a Unicode property. Currently, a
371warning is raised, and the match will fail. In some cases, this is
372counterintuitive, as both these fail:
373
374 chr(0x110000) =~ \p{ASCII_Hex_Digit=True} # Fails.
375 chr(0x110000) =~ \p{ASCII_Hex_Digit=False} # Fails!
376
51f494cc 377=head3 B<General_Category>
14bb0a9a 378
51f494cc
KW
379Every Unicode character is assigned a general category, which is the "most
380usual categorization of a character" (from
381L<http://www.unicode.org/reports/tr44>).
822502e5 382
9f815e24 383The compound way of writing these is like C<\p{General_Category=Number}>
51f494cc
KW
384(short, C<\p{gc:n}>). But Perl furnishes shortcuts in which everything up
385through the equal or colon separator is omitted. So you can instead just write
386C<\pN>.
822502e5 387
51f494cc 388Here are the short and long forms of the General Category properties:
393fec97 389
d73e5302
JH
390 Short Long
391
392 L Letter
51f494cc
KW
393 LC, L& Cased_Letter (that is: [\p{Ll}\p{Lu}\p{Lt}])
394 Lu Uppercase_Letter
395 Ll Lowercase_Letter
396 Lt Titlecase_Letter
397 Lm Modifier_Letter
398 Lo Other_Letter
d73e5302
JH
399
400 M Mark
51f494cc
KW
401 Mn Nonspacing_Mark
402 Mc Spacing_Mark
403 Me Enclosing_Mark
d73e5302
JH
404
405 N Number
51f494cc
KW
406 Nd Decimal_Number (also Digit)
407 Nl Letter_Number
408 No Other_Number
409
410 P Punctuation (also Punct)
411 Pc Connector_Punctuation
412 Pd Dash_Punctuation
413 Ps Open_Punctuation
414 Pe Close_Punctuation
415 Pi Initial_Punctuation
d73e5302 416 (may behave like Ps or Pe depending on usage)
51f494cc 417 Pf Final_Punctuation
d73e5302 418 (may behave like Ps or Pe depending on usage)
51f494cc 419 Po Other_Punctuation
d73e5302
JH
420
421 S Symbol
51f494cc
KW
422 Sm Math_Symbol
423 Sc Currency_Symbol
424 Sk Modifier_Symbol
425 So Other_Symbol
d73e5302
JH
426
427 Z Separator
51f494cc
KW
428 Zs Space_Separator
429 Zl Line_Separator
430 Zp Paragraph_Separator
d73e5302
JH
431
432 C Other
d88362ca 433 Cc Control (also Cntrl)
e150c829 434 Cf Format
6d4f9cf2 435 Cs Surrogate
51f494cc 436 Co Private_Use
e150c829 437 Cn Unassigned
1ac13f9a 438
376d9008 439Single-letter properties match all characters in any of the
3e4dbfed 440two-letter sub-properties starting with the same letter.
b19eb496 441C<LC> and C<L&> are special: both are aliases for the set consisting of everything matched by C<Ll>, C<Lu>, and C<Lt>.
32293815 442
51f494cc 443=head3 B<Bidirectional Character Types>
822502e5 444
b19eb496 445Because scripts differ in their directionality (Hebrew and Arabic are
9d1c51c1 446written right to left, for example) Unicode supplies these properties in
51f494cc 447the Bidi_Class class:
32293815 448
eb0cc9e3 449 Property Meaning
92e830a9 450
12ac2576
JP
451 L Left-to-Right
452 LRE Left-to-Right Embedding
453 LRO Left-to-Right Override
454 R Right-to-Left
51f494cc 455 AL Arabic Letter
12ac2576
JP
456 RLE Right-to-Left Embedding
457 RLO Right-to-Left Override
458 PDF Pop Directional Format
459 EN European Number
51f494cc
KW
460 ES European Separator
461 ET European Terminator
12ac2576 462 AN Arabic Number
51f494cc 463 CS Common Separator
12ac2576
JP
464 NSM Non-Spacing Mark
465 BN Boundary Neutral
466 B Paragraph Separator
467 S Segment Separator
468 WS Whitespace
469 ON Other Neutrals
470
51f494cc
KW
471This property is always written in the compound form.
472For example, C<\p{Bidi_Class:R}> matches characters that are normally
eb0cc9e3
JH
473written right to left.
474
51f494cc
KW
475=head3 B<Scripts>
476
b19eb496 477The world's languages are written in many different scripts. This sentence
e1b711da 478(unless you're reading it in translation) is written in Latin, while Russian is
c69ca1d4 479written in Cyrillic, and Greek is written in, well, Greek; Japanese mainly in
e1b711da 480Hiragana or Katakana. There are many more.
51f494cc 481
82aed44a
KW
482The Unicode Script and Script_Extensions properties give what script a
483given character is in. Either property can be specified with the
484compound form like
485C<\p{Script=Hebrew}> (short: C<\p{sc=hebr}>), or
486C<\p{Script_Extensions=Javanese}> (short: C<\p{scx=java}>).
487In addition, Perl furnishes shortcuts for all
488C<Script> property names. You can omit everything up through the equals
489(or colon), and simply write C<\p{Latin}> or C<\P{Cyrillic}>.
490(This is not true for C<Script_Extensions>, which is required to be
491written in the compound form.)
492
493The difference between these two properties involves characters that are
494used in multiple scripts. For example the digits '0' through '9' are
495used in many parts of the world. These are placed in a script named
496C<Common>. Other characters are used in just a few scripts. For
497example, the "KATAKANA-HIRAGANA DOUBLE HYPHEN" is used in both Japanese
498scripts, Katakana and Hiragana, but nowhere else. The C<Script>
499property places all characters that are used in multiple scripts in the
500C<Common> script, while the C<Script_Extensions> property places those
501that are used in only a few scripts into each of those scripts; while
502still using C<Common> for those used in many scripts. Thus both these
503match:
504
505 "0" =~ /\p{sc=Common}/ # Matches
506 "0" =~ /\p{scx=Common}/ # Matches
507
508and only the first of these match:
509
510 "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{sc=Common} # Matches
511 "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{scx=Common} # No match
512
513And only the last two of these match:
514
515 "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{sc=Hiragana} # No match
516 "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{sc=Katakana} # No match
517 "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{scx=Hiragana} # Matches
518 "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{scx=Katakana} # Matches
519
520C<Script_Extensions> is thus an improved C<Script>, in which there are
521fewer characters in the C<Common> script, and correspondingly more in
522other scripts. It is new in Unicode version 6.0, and its data are likely
523to change significantly in later releases, as things get sorted out.
524
525(Actually, besides C<Common>, the C<Inherited> script, contains
526characters that are used in multiple scripts. These are modifier
527characters which modify other characters, and inherit the script value
528of the controlling character. Some of these are used in many scripts,
529and so go into C<Inherited> in both C<Script> and C<Script_Extensions>.
530Others are used in just a few scripts, so are in C<Inherited> in
531C<Script>, but not in C<Script_Extensions>.)
532
533It is worth stressing that there are several different sets of digits in
534Unicode that are equivalent to 0-9 and are matchable by C<\d> in a
535regular expression. If they are used in a single language only, they
536are in that language's C<Script> and C<Script_Extension>. If they are
537used in more than one script, they will be in C<sc=Common>, but only
538if they are used in many scripts should they be in C<scx=Common>.
51f494cc
KW
539
540A complete list of scripts and their shortcuts is in L<perluniprops>.
541
51f494cc 542=head3 B<Use of "Is" Prefix>
822502e5 543
1bfb14c4 544For backward compatibility (with Perl 5.6), all properties mentioned
51f494cc
KW
545so far may have C<Is> or C<Is_> prepended to their name, so C<\P{Is_Lu}>, for
546example, is equal to C<\P{Lu}>, and C<\p{IsScript:Arabic}> is equal to
547C<\p{Arabic}>.
eb0cc9e3 548
51f494cc 549=head3 B<Blocks>
2796c109 550
1bfb14c4
JH
551In addition to B<scripts>, Unicode also defines B<blocks> of
552characters. The difference between scripts and blocks is that the
553concept of scripts is closer to natural languages, while the concept
51f494cc 554of blocks is more of an artificial grouping based on groups of Unicode
9f815e24 555characters with consecutive ordinal values. For example, the "Basic Latin"
b19eb496 556block is all characters whose ordinals are between 0 and 127, inclusive; in
9f815e24 557other words, the ASCII characters. The "Latin" script contains some letters
b19eb496 558from this as well as several other blocks, like "Latin-1 Supplement",
9d1c51c1 559"Latin Extended-A", etc., but it does not contain all the characters from
7be67b37 560those blocks. It does not, for example, contain the digits 0-9, because
82aed44a
KW
561those digits are shared across many scripts, and hence are in the
562C<Common> script.
51f494cc
KW
563
564For more about scripts versus blocks, see UAX#24 "Unicode Script Property":
565L<http://www.unicode.org/reports/tr24>
566
82aed44a
KW
567The C<Script> or C<Script_Extensions> properties are likely to be the
568ones you want to use when processing
b19eb496
TC
569natural language; the Block property may occasionally be useful in working
570with the nuts and bolts of Unicode.
51f494cc
KW
571
572Block names are matched in the compound form, like C<\p{Block: Arrows}> or
b19eb496 573C<\p{Blk=Hebrew}>. Unlike most other properties, only a few block names have a
51f494cc
KW
574Unicode-defined short name. But Perl does provide a (slight) shortcut: You
575can say, for example C<\p{In_Arrows}> or C<\p{In_Hebrew}>. For backwards
576compatibility, the C<In> prefix may be omitted if there is no naming conflict
577with a script or any other property, and you can even use an C<Is> prefix
578instead in those cases. But it is not a good idea to do this, for a couple
579reasons:
580
581=over 4
582
583=item 1
584
585It is confusing. There are many naming conflicts, and you may forget some.
9f815e24 586For example, C<\p{Hebrew}> means the I<script> Hebrew, and NOT the I<block>
51f494cc
KW
587Hebrew. But would you remember that 6 months from now?
588
589=item 2
590
591It is unstable. A new version of Unicode may pre-empt the current meaning by
592creating a property with the same name. There was a time in very early Unicode
9f815e24 593releases when C<\p{Hebrew}> would have matched the I<block> Hebrew; now it
51f494cc 594doesn't.
32293815 595
393fec97
GS
596=back
597
b19eb496
TC
598Some people prefer to always use C<\p{Block: foo}> and C<\p{Script: bar}>
599instead of the shortcuts, whether for clarity, because they can't remember the
600difference between 'In' and 'Is' anyway, or they aren't confident that those who
601eventually will read their code will know that difference.
51f494cc
KW
602
603A complete list of blocks and their shortcuts is in L<perluniprops>.
604
9f815e24
KW
605=head3 B<Other Properties>
606
607There are many more properties than the very basic ones described here.
608A complete list is in L<perluniprops>.
609
610Unicode defines all its properties in the compound form, so all single-form
b19eb496
TC
611properties are Perl extensions. Most of these are just synonyms for the
612Unicode ones, but some are genuine extensions, including several that are in
9f815e24
KW
613the compound form. And quite a few of these are actually recommended by Unicode
614(in L<http://www.unicode.org/reports/tr18>).
615
5bff2035
KW
616This section gives some details on all extensions that aren't just
617synonyms for compound-form Unicode properties
618(for those properties, you'll have to refer to the
9f815e24
KW
619L<Unicode Standard|http://www.unicode.org/reports/tr44>.
620
621=over
622
623=item B<C<\p{All}>>
624
625This matches any of the 1_114_112 Unicode code points. It is a synonym for
626C<\p{Any}>.
627
628=item B<C<\p{Alnum}>>
629
630This matches any C<\p{Alphabetic}> or C<\p{Decimal_Number}> character.
631
632=item B<C<\p{Any}>>
633
634This matches any of the 1_114_112 Unicode code points. It is a synonym for
635C<\p{All}>.
636
42581d5d
KW
637=item B<C<\p{ASCII}>>
638
639This matches any of the 128 characters in the US-ASCII character set,
640which is a subset of Unicode.
641
9f815e24
KW
642=item B<C<\p{Assigned}>>
643
644This matches any assigned code point; that is, any code point whose general
645category is not Unassigned (or equivalently, not Cn).
646
647=item B<C<\p{Blank}>>
648
649This is the same as C<\h> and C<\p{HorizSpace}>: A character that changes the
650spacing horizontally.
651
652=item B<C<\p{Decomposition_Type: Non_Canonical}>> (Short: C<\p{Dt=NonCanon}>)
653
654Matches a character that has a non-canonical decomposition.
655
656To understand the use of this rarely used property=value combination, it is
657necessary to know some basics about decomposition.
658Consider a character, say H. It could appear with various marks around it,
659such as an acute accent, or a circumflex, or various hooks, circles, arrows,
b19eb496 660I<etc.>, above, below, to one side or the other, etc. There are many
9f815e24
KW
661possibilities among the world's languages. The number of combinations is
662astronomical, and if there were a character for each combination, it would
663soon exhaust Unicode's more than a million possible characters. So Unicode
664took a different approach: there is a character for the base H, and a
b19eb496 665character for each of the possible marks, and these can be variously combined
9f815e24
KW
666to get a final logical character. So a logical character--what appears to be a
667single character--can be a sequence of more than one individual characters.
b19eb496
TC
668This is called an "extended grapheme cluster"; Perl furnishes the C<\X>
669regular expression construct to match such sequences.
9f815e24
KW
670
671But Unicode's intent is to unify the existing character set standards and
b19eb496 672practices, and several pre-existing standards have single characters that
9f815e24
KW
673mean the same thing as some of these combinations. An example is ISO-8859-1,
674which has quite a few of these in the Latin-1 range, an example being "LATIN
675CAPITAL LETTER E WITH ACUTE". Because this character was in this pre-existing
676standard, Unicode added it to its repertoire. But this character is considered
b19eb496
TC
677by Unicode to be equivalent to the sequence consisting of the character
678"LATIN CAPITAL LETTER E" followed by the character "COMBINING ACUTE ACCENT".
9f815e24
KW
679
680"LATIN CAPITAL LETTER E WITH ACUTE" is called a "pre-composed" character, and
b19eb496 681its equivalence with the sequence is called canonical equivalence. All
9f815e24 682pre-composed characters are said to have a decomposition (into the equivalent
b19eb496 683sequence), and the decomposition type is also called canonical.
9f815e24
KW
684
685However, many more characters have a different type of decomposition, a
686"compatible" or "non-canonical" decomposition. The sequences that form these
687decompositions are not considered canonically equivalent to the pre-composed
688character. An example, again in the Latin-1 range, is the "SUPERSCRIPT ONE".
b19eb496 689It is somewhat like a regular digit 1, but not exactly; its decomposition
9f815e24
KW
690into the digit 1 is called a "compatible" decomposition, specifically a
691"super" decomposition. There are several such compatibility
692decompositions (see L<http://www.unicode.org/reports/tr44>), including one
b19eb496 693called "compat", which means some miscellaneous type of decomposition
42581d5d 694that doesn't fit into the decomposition categories that Unicode has chosen.
9f815e24
KW
695
696Note that most Unicode characters don't have a decomposition, so their
697decomposition type is "None".
698
b19eb496
TC
699For your convenience, Perl has added the C<Non_Canonical> decomposition
700type to mean any of the several compatibility decompositions.
9f815e24
KW
701
702=item B<C<\p{Graph}>>
703
704Matches any character that is graphic. Theoretically, this means a character
705that on a printer would cause ink to be used.
706
707=item B<C<\p{HorizSpace}>>
708
b19eb496 709This is the same as C<\h> and C<\p{Blank}>: a character that changes the
9f815e24
KW
710spacing horizontally.
711
42581d5d 712=item B<C<\p{In=*}>>
9f815e24
KW
713
714This is a synonym for C<\p{Present_In=*}>
715
716=item B<C<\p{PerlSpace}>>
717
718This is the same as C<\s>, restricted to ASCII, namely C<S<[ \f\n\r\t]>>.
719
720Mnemonic: Perl's (original) space
721
722=item B<C<\p{PerlWord}>>
723
724This is the same as C<\w>, restricted to ASCII, namely C<[A-Za-z0-9_]>
725
726Mnemonic: Perl's (original) word.
727
42581d5d 728=item B<C<\p{Posix...}>>
9f815e24 729
b19eb496
TC
730There are several of these, which are equivalents using the C<\p>
731notation for Posix classes and are described in
42581d5d 732L<perlrecharclass/POSIX Character Classes>.
9f815e24
KW
733
734=item B<C<\p{Present_In: *}>> (Short: C<\p{In=*}>)
735
736This property is used when you need to know in what Unicode version(s) a
737character is.
738
739The "*" above stands for some two digit Unicode version number, such as
740C<1.1> or C<4.0>; or the "*" can also be C<Unassigned>. This property will
741match the code points whose final disposition has been settled as of the
742Unicode release given by the version number; C<\p{Present_In: Unassigned}>
743will match those code points whose meaning has yet to be assigned.
744
745For example, C<U+0041> "LATIN CAPITAL LETTER A" was present in the very first
746Unicode release available, which is C<1.1>, so this property is true for all
747valid "*" versions. On the other hand, C<U+1EFF> was not assigned until version
7485.1 when it became "LATIN SMALL LETTER Y WITH LOOP", so the only "*" that
749would match it are 5.1, 5.2, and later.
750
751Unicode furnishes the C<Age> property from which this is derived. The problem
752with Age is that a strict interpretation of it (which Perl takes) has it
753matching the precise release a code point's meaning is introduced in. Thus
754C<U+0041> would match only 1.1; and C<U+1EFF> only 5.1. This is not usually what
755you want.
756
757Some non-Perl implementations of the Age property may change its meaning to be
758the same as the Perl Present_In property; just be aware of that.
759
760Another confusion with both these properties is that the definition is not
b19eb496
TC
761that the code point has been I<assigned>, but that the meaning of the code point
762has been I<determined>. This is because 66 code points will always be
763unassigned, and so the Age for them is the Unicode version in which the decision
764to make them so was made. For example, C<U+FDD0> is to be permanently
9f815e24 765unassigned to a character, and the decision to do that was made in version 3.1,
b19eb496 766so C<\p{Age=3.1}> matches this character, as also does C<\p{Present_In: 3.1}> and up.
9f815e24
KW
767
768=item B<C<\p{Print}>>
769
ae5b72c8 770This matches any character that is graphical or blank, except controls.
9f815e24
KW
771
772=item B<C<\p{SpacePerl}>>
773
774This is the same as C<\s>, including beyond ASCII.
775
4d4acfba 776Mnemonic: Space, as modified by Perl. (It doesn't include the vertical tab
b19eb496 777which both the Posix standard and Unicode consider white space.)
9f815e24 778
4364919a
KW
779=item B<C<\p{Title}>> and B<C<\p{Titlecase}>>
780
781Under case-sensitive matching, these both match the same code points as
782C<\p{General Category=Titlecase_Letter}> (C<\p{gc=lt}>). The difference
783is that under C</i> caseless matching, these match the same as
784C<\p{Cased}>, whereas C<\p{gc=lt}> matches C<\p{Cased_Letter>).
785
9f815e24
KW
786=item B<C<\p{VertSpace}>>
787
788This is the same as C<\v>: A character that changes the spacing vertically.
789
790=item B<C<\p{Word}>>
791
b19eb496 792This is the same as C<\w>, including over 100_000 characters beyond ASCII.
9f815e24 793
42581d5d
KW
794=item B<C<\p{XPosix...}>>
795
b19eb496 796There are several of these, which are the standard Posix classes
42581d5d
KW
797extended to the full Unicode range. They are described in
798L<perlrecharclass/POSIX Character Classes>.
799
9f815e24
KW
800=back
801
376d9008 802=head2 User-Defined Character Properties
491fd90a 803
51f494cc
KW
804You can define your own binary character properties by defining subroutines
805whose names begin with "In" or "Is". The subroutines can be defined in any
806package. The user-defined properties can be used in the regular expression
807C<\p> and C<\P> constructs; if you are using a user-defined property from a
808package other than the one you are in, you must specify its package in the
809C<\p> or C<\P> construct.
bac0b425 810
51f494cc 811 # assuming property Is_Foreign defined in Lang::
bac0b425
JP
812 package main; # property package name required
813 if ($txt =~ /\p{Lang::IsForeign}+/) { ... }
814
815 package Lang; # property package name not required
816 if ($txt =~ /\p{IsForeign}+/) { ... }
817
818
819Note that the effect is compile-time and immutable once defined.
b19eb496
TC
820However, the subroutines are passed a single parameter, which is 0 if
821case-sensitive matching is in effect and non-zero if caseless matching
56ca34ca
KW
822is in effect. The subroutine may return different values depending on
823the value of the flag, and one set of values will immutably be in effect
b19eb496 824for all case-sensitive matches, and the other set for all case-insensitive
56ca34ca 825matches.
491fd90a 826
b19eb496 827Note that if the regular expression is tainted, then Perl will die rather
0e9be77f
DM
828than calling the subroutine, where the name of the subroutine is
829determined by the tainted data.
830
376d9008
JB
831The subroutines must return a specially-formatted string, with one
832or more newline-separated lines. Each line must be one of the following:
491fd90a
JH
833
834=over 4
835
836=item *
837
510254c9
A
838A single hexadecimal number denoting a Unicode code point to include.
839
840=item *
841
99a6b1f0 842Two hexadecimal numbers separated by horizontal whitespace (space or
376d9008 843tabular characters) denoting a range of Unicode code points to include.
491fd90a
JH
844
845=item *
846
376d9008 847Something to include, prefixed by "+": a built-in character
bac0b425
JP
848property (prefixed by "utf8::") or a user-defined character property,
849to represent all the characters in that property; two hexadecimal code
850points for a range; or a single hexadecimal code point.
491fd90a
JH
851
852=item *
853
376d9008 854Something to exclude, prefixed by "-": an existing character
bac0b425
JP
855property (prefixed by "utf8::") or a user-defined character property,
856to represent all the characters in that property; two hexadecimal code
857points for a range; or a single hexadecimal code point.
491fd90a
JH
858
859=item *
860
376d9008 861Something to negate, prefixed "!": an existing character
bac0b425
JP
862property (prefixed by "utf8::") or a user-defined character property,
863to represent all the characters in that property; two hexadecimal code
864points for a range; or a single hexadecimal code point.
865
866=item *
867
868Something to intersect with, prefixed by "&": an existing character
869property (prefixed by "utf8::") or a user-defined character property,
870for all the characters except the characters in the property; two
871hexadecimal code points for a range; or a single hexadecimal code point.
491fd90a
JH
872
873=back
874
875For example, to define a property that covers both the Japanese
876syllabaries (hiragana and katakana), you can define
877
878 sub InKana {
d88362ca 879 return <<END;
d5822f25
A
880 3040\t309F
881 30A0\t30FF
491fd90a
JH
882 END
883 }
884
d5822f25
A
885Imagine that the here-doc end marker is at the beginning of the line.
886Now you can use C<\p{InKana}> and C<\P{InKana}>.
491fd90a
JH
887
888You could also have used the existing block property names:
889
890 sub InKana {
d88362ca 891 return <<'END';
491fd90a
JH
892 +utf8::InHiragana
893 +utf8::InKatakana
894 END
895 }
896
897Suppose you wanted to match only the allocated characters,
d5822f25 898not the raw block ranges: in other words, you want to remove
491fd90a
JH
899the non-characters:
900
901 sub InKana {
d88362ca 902 return <<'END';
491fd90a
JH
903 +utf8::InHiragana
904 +utf8::InKatakana
905 -utf8::IsCn
906 END
907 }
908
909The negation is useful for defining (surprise!) negated classes.
910
911 sub InNotKana {
d88362ca 912 return <<'END';
491fd90a
JH
913 !utf8::InHiragana
914 -utf8::InKatakana
915 +utf8::IsCn
916 END
917 }
918
bac0b425
JP
919Intersection is useful for getting the common characters matched by
920two (or more) classes.
921
922 sub InFooAndBar {
923 return <<'END';
fe82fa26
KW
924 +main::InFoo
925 &main::InBar
bac0b425
JP
926 END
927 }
928
ac036724 929It's important to remember not to use "&" for the first set; that
b19eb496 930would be intersecting with nothing, resulting in an empty set.
bac0b425 931
68585b5e 932=head2 User-Defined Case Mappings (for serious hackers only)
822502e5 933
5d1892be
KW
934B<This feature has been removed as of Perl 5.16.>
935The CPAN module L<Unicode::Casing> provides better functionality without
936the drawbacks that this feature had. If you are using a Perl earlier
937than 5.16, this feature was most fully documented in the 5.14 version of
938this pod:
939L<http://perldoc.perl.org/5.14.0/perlunicode.html#User-Defined-Case-Mappings-%28for-serious-hackers-only%29>
3a2263fe 940
376d9008 941=head2 Character Encodings for Input and Output
8cbd9a7a 942
7221edc9 943See L<Encode>.
8cbd9a7a 944
c29a771d 945=head2 Unicode Regular Expression Support Level
776f8809 946
b19eb496
TC
947The following list of Unicode supported features for regular expressions describes
948all features currently directly supported by core Perl. The references to "Level N"
8158862b 949and the section numbers refer to the Unicode Technical Standard #18,
b19eb496 950"Unicode Regular Expressions", version 13, from August 2008.
776f8809
JH
951
952=over 4
953
954=item *
955
956Level 1 - Basic Unicode Support
957
755789c0
KW
958 RL1.1 Hex Notation - done [1]
959 RL1.2 Properties - done [2][3]
960 RL1.2a Compatibility Properties - done [4]
961 RL1.3 Subtraction and Intersection - MISSING [5]
962 RL1.4 Simple Word Boundaries - done [6]
963 RL1.5 Simple Loose Matches - done [7]
964 RL1.6 Line Boundaries - MISSING [8][9]
965 RL1.7 Supplementary Code Points - done [10]
966
967 [1] \x{...}
968 [2] \p{...} \P{...}
969 [3] supports not only minimal list, but all Unicode character
d9742aa3 970 properties (see Unicode Character Properties above)
755789c0
KW
971 [4] \d \D \s \S \w \W \X [:prop:] [:^prop:]
972 [5] can use regular expression look-ahead [a] or
973 user-defined character properties [b] to emulate set
974 operations
975 [6] \b \B
976 [7] note that Perl does Full case-folding in matching (but with
977 bugs), not Simple: for example U+1F88 is equivalent to
e4d56f70
NC
978 U+1F00 U+03B9, instead of just U+1F80. This difference
979 matters mainly for certain Greek capital letters with certain
755789c0
KW
980 modifiers: the Full case-folding decomposes the letter,
981 while the Simple case-folding would map it to a single
982 character.
983 [8] should do ^ and $ also on U+000B (\v in C), FF (\f), CR
984 (\r), CRLF (\r\n), NEL (U+0085), LS (U+2028), and PS
985 (U+2029); should also affect <>, $., and script line
986 numbers; should not split lines within CRLF [c] (i.e. there
987 is no empty line between \r and \n)
988 [9] Linebreaking conformant with UAX#14 "Unicode Line Breaking
989 Algorithm" is available through the Unicode::LineBreaking
990 module.
991 [10] UTF-8/UTF-EBDDIC used in Perl allows not only U+10000 to
992 U+10FFFF but also beyond U+10FFFF
7207e29d 993
237bad5b 994[a] You can mimic class subtraction using lookahead.
8158862b 995For example, what UTS#18 might write as
29bdacb8 996
dbe420b4
JH
997 [{Greek}-[{UNASSIGNED}]]
998
999in Perl can be written as:
1000
1d81abf3
JH
1001 (?!\p{Unassigned})\p{InGreekAndCoptic}
1002 (?=\p{Assigned})\p{InGreekAndCoptic}
dbe420b4
JH
1003
1004But in this particular example, you probably really want
1005
1bfb14c4 1006 \p{GreekAndCoptic}
dbe420b4
JH
1007
1008which will match assigned characters known to be part of the Greek script.
29bdacb8 1009
d9742aa3 1010Also see the L<Unicode::Regex::Set> module, it does implement the full
8158862b
ST
1011UTS#18 grouping, intersection, union, and removal (subtraction) syntax.
1012
1013[b] '+' for union, '-' for removal (set-difference), '&' for intersection
1014(see L</"User-Defined Character Properties">)
1015
1016[c] Try the C<:crlf> layer (see L<PerlIO>).
5ca1ac52 1017
776f8809
JH
1018=item *
1019
1020Level 2 - Extended Unicode Support
1021
755789c0
KW
1022 RL2.1 Canonical Equivalents - MISSING [10][11]
1023 RL2.2 Default Grapheme Clusters - MISSING [12]
1024 RL2.3 Default Word Boundaries - MISSING [14]
1025 RL2.4 Default Loose Matches - MISSING [15]
1026 RL2.5 Name Properties - DONE
1027 RL2.6 Wildcard Properties - MISSING
8158862b 1028
755789c0
KW
1029 [10] see UAX#15 "Unicode Normalization Forms"
1030 [11] have Unicode::Normalize but not integrated to regexes
1031 [12] have \X but we don't have a "Grapheme Cluster Mode"
1032 [14] see UAX#29, Word Boundaries
1033 [15] see UAX#21 "Case Mappings"
776f8809
JH
1034
1035=item *
1036
8158862b
ST
1037Level 3 - Tailored Support
1038
755789c0
KW
1039 RL3.1 Tailored Punctuation - MISSING
1040 RL3.2 Tailored Grapheme Clusters - MISSING [17][18]
1041 RL3.3 Tailored Word Boundaries - MISSING
1042 RL3.4 Tailored Loose Matches - MISSING
1043 RL3.5 Tailored Ranges - MISSING
1044 RL3.6 Context Matching - MISSING [19]
1045 RL3.7 Incremental Matches - MISSING
8158862b 1046 ( RL3.8 Unicode Set Sharing )
755789c0
KW
1047 RL3.9 Possible Match Sets - MISSING
1048 RL3.10 Folded Matching - MISSING [20]
1049 RL3.11 Submatchers - MISSING
1050
1051 [17] see UAX#10 "Unicode Collation Algorithms"
1052 [18] have Unicode::Collate but not integrated to regexes
1053 [19] have (?<=x) and (?=x), but look-aheads or look-behinds
1054 should see outside of the target substring
1055 [20] need insensitive matching for linguistic features other
1056 than case; for example, hiragana to katakana, wide and
1057 narrow, simplified Han to traditional Han (see UTR#30
1058 "Character Foldings")
776f8809
JH
1059
1060=back
1061
c349b1b9
JH
1062=head2 Unicode Encodings
1063
376d9008
JB
1064Unicode characters are assigned to I<code points>, which are abstract
1065numbers. To use these numbers, various encodings are needed.
c349b1b9
JH
1066
1067=over 4
1068
c29a771d 1069=item *
5cb3728c
RB
1070
1071UTF-8
c349b1b9 1072
6d4f9cf2
KW
1073UTF-8 is a variable-length (1 to 4 bytes), byte-order independent
1074encoding. For ASCII (and we really do mean 7-bit ASCII, not another
10758-bit encoding), UTF-8 is transparent.
c349b1b9 1076
8c007b5a 1077The following table is from Unicode 3.2.
05632f9a 1078
755789c0 1079 Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte
05632f9a 1080
d88362ca 1081 U+0000..U+007F 00..7F
e1b711da 1082 U+0080..U+07FF * C2..DF 80..BF
d88362ca 1083 U+0800..U+0FFF E0 * A0..BF 80..BF
ec90690f
ST
1084 U+1000..U+CFFF E1..EC 80..BF 80..BF
1085 U+D000..U+D7FF ED 80..9F 80..BF
755789c0 1086 U+D800..U+DFFF +++++ utf16 surrogates, not legal utf8 +++++
ec90690f 1087 U+E000..U+FFFF EE..EF 80..BF 80..BF
d88362ca
KW
1088 U+10000..U+3FFFF F0 * 90..BF 80..BF 80..BF
1089 U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF
1090 U+100000..U+10FFFF F4 80..8F 80..BF 80..BF
e1b711da 1091
b19eb496 1092Note the gaps marked by "*" before several of the byte entries above. These are
e1b711da
KW
1093caused by legal UTF-8 avoiding non-shortest encodings: it is technically
1094possible to UTF-8-encode a single code point in different ways, but that is
1095explicitly forbidden, and the shortest possible encoding should always be used
1096(and that is what Perl does).
37361303 1097
376d9008 1098Another way to look at it is via bits:
05632f9a 1099
755789c0 1100 Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte
05632f9a 1101
755789c0
KW
1102 0aaaaaaa 0aaaaaaa
1103 00000bbbbbaaaaaa 110bbbbb 10aaaaaa
1104 ccccbbbbbbaaaaaa 1110cccc 10bbbbbb 10aaaaaa
1105 00000dddccccccbbbbbbaaaaaa 11110ddd 10cccccc 10bbbbbb 10aaaaaa
05632f9a 1106
9f815e24 1107As you can see, the continuation bytes all begin with "10", and the
e1b711da 1108leading bits of the start byte tell how many bytes there are in the
05632f9a
JH
1109encoded character.
1110
6d4f9cf2
KW
1111The original UTF-8 specification allowed up to 6 bytes, to allow
1112encoding of numbers up to 0x7FFF_FFFF. Perl continues to allow those,
1113and has extended that up to 13 bytes to encode code points up to what
1114can fit in a 64-bit word. However, Perl will warn if you output any of
b19eb496 1115these as being non-portable; and under strict UTF-8 input protocols,
6d4f9cf2
KW
1116they are forbidden.
1117
1118The Unicode non-character code points are also disallowed in UTF-8 in
1119"open interchange". See L</Non-character code points>.
1120
c29a771d 1121=item *
5cb3728c
RB
1122
1123UTF-EBCDIC
dbe420b4 1124
376d9008 1125Like UTF-8 but EBCDIC-safe, in the way that UTF-8 is ASCII-safe.
dbe420b4 1126
c29a771d 1127=item *
5cb3728c 1128
1e54db1a 1129UTF-16, UTF-16BE, UTF-16LE, Surrogates, and BOMs (Byte Order Marks)
c349b1b9 1130
1bfb14c4
JH
1131The followings items are mostly for reference and general Unicode
1132knowledge, Perl doesn't use these constructs internally.
dbe420b4 1133
b19eb496
TC
1134Like UTF-8, UTF-16 is a variable-width encoding, but where
1135UTF-8 uses 8-bit code units, UTF-16 uses 16-bit code units.
1136All code points occupy either 2 or 4 bytes in UTF-16: code points
1137C<U+0000..U+FFFF> are stored in a single 16-bit unit, and code
1bfb14c4 1138points C<U+10000..U+10FFFF> in two 16-bit units. The latter case is
c349b1b9
JH
1139using I<surrogates>, the first 16-bit unit being the I<high
1140surrogate>, and the second being the I<low surrogate>.
1141
376d9008 1142Surrogates are code points set aside to encode the C<U+10000..U+10FFFF>
c349b1b9 1143range of Unicode code points in pairs of 16-bit units. The I<high
9f815e24 1144surrogates> are the range C<U+D800..U+DBFF> and the I<low surrogates>
376d9008 1145are the range C<U+DC00..U+DFFF>. The surrogate encoding is
c349b1b9 1146
d88362ca
KW
1147 $hi = ($uni - 0x10000) / 0x400 + 0xD800;
1148 $lo = ($uni - 0x10000) % 0x400 + 0xDC00;
c349b1b9
JH
1149
1150and the decoding is
1151
d88362ca 1152 $uni = 0x10000 + ($hi - 0xD800) * 0x400 + ($lo - 0xDC00);
c349b1b9 1153
376d9008 1154Because of the 16-bitness, UTF-16 is byte-order dependent. UTF-16
c349b1b9 1155itself can be used for in-memory computations, but if storage or
376d9008
JB
1156transfer is required either UTF-16BE (big-endian) or UTF-16LE
1157(little-endian) encodings must be chosen.
c349b1b9
JH
1158
1159This introduces another problem: what if you just know that your data
376d9008
JB
1160is UTF-16, but you don't know which endianness? Byte Order Marks, or
1161BOMs, are a solution to this. A special character has been reserved
86bbd6d1 1162in Unicode to function as a byte order marker: the character with the
376d9008 1163code point C<U+FEFF> is the BOM.
042da322 1164
c349b1b9 1165The trick is that if you read a BOM, you will know the byte order,
376d9008
JB
1166since if it was written on a big-endian platform, you will read the
1167bytes C<0xFE 0xFF>, but if it was written on a little-endian platform,
1168you will read the bytes C<0xFF 0xFE>. (And if the originating platform
1169was writing in UTF-8, you will read the bytes C<0xEF 0xBB 0xBF>.)
042da322 1170
86bbd6d1 1171The way this trick works is that the character with the code point
6d4f9cf2 1172C<U+FFFE> is not supposed to be in input streams, so the
376d9008 1173sequence of bytes C<0xFF 0xFE> is unambiguously "BOM, represented in
1bfb14c4 1174little-endian format" and cannot be C<U+FFFE>, represented in big-endian
6d4f9cf2
KW
1175format".
1176
1177Surrogates have no meaning in Unicode outside their use in pairs to
1178represent other code points. However, Perl allows them to be
1179represented individually internally, for example by saying
f651977e
TC
1180C<chr(0xD801)>, so that all code points, not just those valid for open
1181interchange, are
6d4f9cf2
KW
1182representable. Unicode does define semantics for them, such as their
1183General Category is "Cs". But because their use is somewhat dangerous,
b19eb496
TC
1184Perl will warn (using the warning category "surrogate", which is a
1185sub-category of "utf8") if an attempt is made
6d4f9cf2
KW
1186to do things like take the lower case of one, or match
1187case-insensitively, or to output them. (But don't try this on Perls
1188before 5.14.)
c349b1b9 1189
c29a771d 1190=item *
5cb3728c 1191
1e54db1a 1192UTF-32, UTF-32BE, UTF-32LE
c349b1b9
JH
1193
1194The UTF-32 family is pretty much like the UTF-16 family, expect that
042da322 1195the units are 32-bit, and therefore the surrogate scheme is not
f651977e 1196needed. UTF-32 is a fixed-width encoding. The BOM signatures are
b19eb496 1197C<0x00 0x00 0xFE 0xFF> for BE and C<0xFF 0xFE 0x00 0x00> for LE.
c349b1b9 1198
c29a771d 1199=item *
5cb3728c
RB
1200
1201UCS-2, UCS-4
c349b1b9 1202
b19eb496 1203Legacy, fixed-width encodings defined by the ISO 10646 standard. UCS-2 is a 16-bit
376d9008 1204encoding. Unlike UTF-16, UCS-2 is not extensible beyond C<U+FFFF>,
339cfa0e 1205because it does not use surrogates. UCS-4 is a 32-bit encoding,
b19eb496 1206functionally identical to UTF-32 (the difference being that
ee88f7b6 1207UCS-4 forbids neither surrogates nor code points larger than 0x10_FFFF).
c349b1b9 1208
c29a771d 1209=item *
5cb3728c
RB
1210
1211UTF-7
c349b1b9 1212
376d9008
JB
1213A seven-bit safe (non-eight-bit) encoding, which is useful if the
1214transport or storage is not eight-bit safe. Defined by RFC 2152.
c349b1b9 1215
95a1a48b
JH
1216=back
1217
6d4f9cf2
KW
1218=head2 Non-character code points
1219
122066 code points are set aside in Unicode as "non-character code points".
1221These all have the Unassigned (Cn) General Category, and they never will
1222be assigned. These are never supposed to be in legal Unicode input
1223streams, so that code can use them as sentinels that can be mixed in
1224with character data, and they always will be distinguishable from that data.
1225To keep them out of Perl input streams, strict UTF-8 should be
1226specified, such as by using the layer C<:encoding('UTF-8')>. The
1227non-character code points are the 32 between U+FDD0 and U+FDEF, and the
122834 code points U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, ... U+10FFFE, U+10FFFF.
1229Some people are under the mistaken impression that these are "illegal",
1230but that is not true. An application or cooperating set of applications
1231can legally use them at will internally; but these code points are
42581d5d
KW
1232"illegal for open interchange". Therefore, Perl will not accept these
1233from input streams unless lax rules are being used, and will warn
b19eb496 1234(using the warning category "nonchar", which is a sub-category of "utf8") if
42581d5d
KW
1235an attempt is made to output them.
1236
1237=head2 Beyond Unicode code points
1238
1239The maximum Unicode code point is U+10FFFF. But Perl accepts code
1240points up to the maximum permissible unsigned number available on the
1241platform. However, Perl will not accept these from input streams unless
1242lax rules are being used, and will warn (using the warning category
b19eb496 1243"non_unicode", which is a sub-category of "utf8") if an attempt is made to
42581d5d
KW
1244operate on or output them. For example, C<uc(0x11_0000)> will generate
1245this warning, returning the input parameter as its result, as the upper
ee88f7b6 1246case of every non-Unicode code point is the code point itself.
6d4f9cf2 1247
0d7c09bb
JH
1248=head2 Security Implications of Unicode
1249
e1b711da
KW
1250Read L<Unicode Security Considerations|http://www.unicode.org/reports/tr36>.
1251Also, note the following:
1252
0d7c09bb
JH
1253=over 4
1254
1255=item *
1256
1257Malformed UTF-8
bf0fa0b2 1258
42581d5d 1259Unfortunately, the original specification of UTF-8 leaves some room for
bf0fa0b2 1260interpretation of how many bytes of encoded output one should generate
376d9008
JB
1261from one input Unicode character. Strictly speaking, the shortest
1262possible sequence of UTF-8 bytes should be generated,
1263because otherwise there is potential for an input buffer overflow at
feda178f 1264the receiving end of a UTF-8 connection. Perl always generates the
e1b711da 1265shortest length UTF-8, and with warnings on, Perl will warn about
376d9008 1266non-shortest length UTF-8 along with other malformations, such as the
b19eb496 1267surrogates, which are not Unicode code points valid for interchange.
bf0fa0b2 1268
0d7c09bb
JH
1269=item *
1270
68693f9e 1271Regular expression pattern matching may surprise you if you're not
b19eb496
TC
1272accustomed to Unicode. Starting in Perl 5.14, several pattern
1273modifiers are available to control this, called the character set
42581d5d
KW
1274modifiers. Details are given in L<perlre/Character set modifiers>.
1275
1276=back
0d7c09bb 1277
376d9008 1278As discussed elsewhere, Perl has one foot (two hooves?) planted in
1bfb14c4
JH
1279each of two worlds: the old world of bytes and the new world of
1280characters, upgrading from bytes to characters when necessary.
376d9008
JB
1281If your legacy code does not explicitly use Unicode, no automatic
1282switch-over to characters should happen. Characters shouldn't get
1bfb14c4
JH
1283downgraded to bytes, either. It is possible to accidentally mix bytes
1284and characters, however (see L<perluniintro>), in which case C<\w> in
42581d5d 1285regular expressions might start behaving differently (unless the C</a>
b19eb496 1286modifier is in effect). Review your code. Use warnings and the C<strict> pragma.
0d7c09bb 1287
c349b1b9
JH
1288=head2 Unicode in Perl on EBCDIC
1289
376d9008
JB
1290The way Unicode is handled on EBCDIC platforms is still
1291experimental. On such platforms, references to UTF-8 encoding in this
1292document and elsewhere should be read as meaning the UTF-EBCDIC
1293specified in Unicode Technical Report 16, unless ASCII vs. EBCDIC issues
c349b1b9 1294are specifically discussed. There is no C<utfebcdic> pragma or
376d9008 1295":utfebcdic" layer; rather, "utf8" and ":utf8" are reused to mean
86bbd6d1
PN
1296the platform's "natural" 8-bit encoding of Unicode. See L<perlebcdic>
1297for more discussion of the issues.
c349b1b9 1298
b310b053
JH
1299=head2 Locales
1300
42581d5d 1301See L<perllocale/Unicode and UTF-8>
b310b053 1302
1aad1664
JH
1303=head2 When Unicode Does Not Happen
1304
1305While Perl does have extensive ways to input and output in Unicode,
b19eb496
TC
1306and a few other "entry points" like the @ARGV array (which can sometimes be
1307interpreted as UTF-8), there are still many places where Unicode
1308(in some encoding or another) could be given as arguments or received as
1aad1664
JH
1309results, or both, but it is not.
1310
e1b711da
KW
1311The following are such interfaces. Also, see L</The "Unicode Bug">.
1312For all of these interfaces Perl
6cd4dd6c 1313currently (as of 5.8.3) simply assumes byte strings both as arguments
b19eb496 1314and results, or UTF-8 strings if the (problematic) C<encoding> pragma has been used.
1aad1664 1315
b19eb496
TC
1316One reason that Perl does not attempt to resolve the role of Unicode in
1317these situations is that the answers are highly dependent on the operating
1aad1664 1318system and the file system(s). For example, whether filenames can be
b19eb496
TC
1319in Unicode and in exactly what kind of encoding, is not exactly a
1320portable concept. Similarly for C<qx> and C<system>: how well will the
1321"command-line interface" (and which of them?) handle Unicode?
1aad1664
JH
1322
1323=over 4
1324
557a2462
RB
1325=item *
1326
51f494cc 1327chdir, chmod, chown, chroot, exec, link, lstat, mkdir,
1e8e8236 1328rename, rmdir, stat, symlink, truncate, unlink, utime, -X
557a2462
RB
1329
1330=item *
1331
1332%ENV
1333
1334=item *
1335
1336glob (aka the <*>)
1337
1338=item *
1aad1664 1339
557a2462 1340open, opendir, sysopen
1aad1664 1341
557a2462 1342=item *
1aad1664 1343
557a2462 1344qx (aka the backtick operator), system
1aad1664 1345
557a2462 1346=item *
1aad1664 1347
557a2462 1348readdir, readlink
1aad1664
JH
1349
1350=back
1351
e1b711da
KW
1352=head2 The "Unicode Bug"
1353
42581d5d
KW
1354The term, the "Unicode bug" has been applied to an inconsistency
1355on ASCII platforms with the
1356Unicode code points in the Latin-1 Supplement block, that
e1b711da
KW
1357is, between 128 and 255. Without a locale specified, unlike all other
1358characters or code points, these characters have very different semantics in
20db7501
KW
1359byte semantics versus character semantics, unless
1360C<use feature 'unicode_strings'> is specified.
42581d5d
KW
1361(The lesson here is to specify C<unicode_strings> to avoid the
1362headaches.)
e1b711da
KW
1363
1364In character semantics they are interpreted as Unicode code points, which means
1365they have the same semantics as Latin-1 (ISO-8859-1).
1366
1367In byte semantics, they are considered to be unassigned characters, meaning
1368that the only semantics they have is their ordinal numbers, and that they are
1369not members of various character classes. None are considered to match C<\w>
42581d5d 1370for example, but all match C<\W>.
e1b711da
KW
1371
1372The behavior is known to have effects on these areas:
1373
1374=over 4
1375
1376=item *
1377
1378Changing the case of a scalar, that is, using C<uc()>, C<ucfirst()>, C<lc()>,
1379and C<lcfirst()>, or C<\L>, C<\U>, C<\u> and C<\l> in regular expression
1380substitutions.
1381
1382=item *
1383
1384Using caseless (C</i>) regular expression matching
1385
1386=item *
1387
b19eb496 1388Matching any of several properties in regular expressions, namely C<\b>,
630d17dc
KW
1389C<\B>, C<\s>, C<\S>, C<\w>, C<\W>, and all the Posix character classes
1390I<except> C<[[:ascii:]]>.
e1b711da
KW
1391
1392=item *
1393
b19eb496
TC
1394In C<quotemeta> or its inline equivalent C<\Q>, no characters
1395code points above 127 are quoted in UTF-8 encoded strings, but in
1396byte encoded strings, code points between 128-255 are always quoted.
eb88ed9e 1397
e1b711da
KW
1398=back
1399
1400This behavior can lead to unexpected results in which a string's semantics
1401suddenly change if a code point above 255 is appended to or removed from it,
1402which changes the string's semantics from byte to character or vice versa. As
1403an example, consider the following program and its output:
1404
1405 $ perl -le'
42581d5d 1406 no feature 'unicode_strings';
e1b711da
KW
1407 $s1 = "\xC2";
1408 $s2 = "\x{2660}";
1409 for ($s1, $s2, $s1.$s2) {
1410 print /\w/ || 0;
1411 }
1412 '
1413 0
1414 0
1415 1
1416
9f815e24 1417If there's no C<\w> in C<s1> or in C<s2>, why does their concatenation have one?
e1b711da
KW
1418
1419This anomaly stems from Perl's attempt to not disturb older programs that
1420didn't use Unicode, and hence had no semantics for characters outside of the
1421ASCII range (except in a locale), along with Perl's desire to add Unicode
1422support seamlessly. The result wasn't seamless: these characters were
1423orphaned.
1424
20db7501
KW
1425Starting in Perl 5.14, C<use feature 'unicode_strings'> can be used to
1426cause Perl to use Unicode semantics on all string operations within the
1427scope of the feature subpragma. Regular expressions compiled in its
1428scope retain that behavior even when executed or compiled into larger
1429regular expressions outside the scope. (The pragma does not, however,
42581d5d
KW
1430affect the C<quotemeta> behavior. Nor does it affect the deprecated
1431user-defined case changing operations--these still require a UTF-8
eb88ed9e 1432encoded string to operate.)
20db7501
KW
1433
1434In Perl 5.12, the subpragma affected casing changes, but not regular
1435expressions. See L<perlfunc/lc> for details on how this pragma works in
1436combination with various others for casing.
1437
1438For earlier Perls, or when a string is passed to a function outside the
1439subpragma's scope, a workaround is to always call C<utf8::upgrade($string)>,
1440or to use the standard module L<Encode>. Also, a scalar that has any characters
6f335b04 1441whose ordinal is above 0x100, or which were specified using either of the
b19eb496 1442C<\N{...}> notations, will automatically have character semantics.
e1b711da 1443
1aad1664
JH
1444=head2 Forcing Unicode in Perl (Or Unforcing Unicode in Perl)
1445
e1b711da
KW
1446Sometimes (see L</"When Unicode Does Not Happen"> or L</The "Unicode Bug">)
1447there are situations where you simply need to force a byte
2bbc8d55
SP
1448string into UTF-8, or vice versa. The low-level calls
1449utf8::upgrade($bytestring) and utf8::downgrade($utf8string[, FAIL_OK]) are
1aad1664
JH
1450the answers.
1451
2bbc8d55
SP
1452Note that utf8::downgrade() can fail if the string contains characters
1453that don't fit into a byte.
1aad1664 1454
e1b711da
KW
1455Calling either function on a string that already is in the desired state is a
1456no-op.
1457
95a1a48b
JH
1458=head2 Using Unicode in XS
1459
3a2263fe
RGS
1460If you want to handle Perl Unicode in XS extensions, you may find the
1461following C APIs useful. See also L<perlguts/"Unicode Support"> for an
1462explanation about Unicode at the XS level, and L<perlapi> for the API
1463details.
95a1a48b
JH
1464
1465=over 4
1466
1467=item *
1468
1bfb14c4 1469C<DO_UTF8(sv)> returns true if the C<UTF8> flag is on and the bytes
2bbc8d55 1470pragma is not in effect. C<SvUTF8(sv)> returns true if the C<UTF8>
1bfb14c4
JH
1471flag is on; the bytes pragma is ignored. The C<UTF8> flag being on
1472does B<not> mean that there are any characters of code points greater
1473than 255 (or 127) in the scalar or that there are even any characters
1474in the scalar. What the C<UTF8> flag means is that the sequence of
1475octets in the representation of the scalar is the sequence of UTF-8
1476encoded code points of the characters of a string. The C<UTF8> flag
1477being off means that each octet in this representation encodes a
1478single character with code point 0..255 within the string. Perl's
1479Unicode model is not to use UTF-8 until it is absolutely necessary.
95a1a48b
JH
1480
1481=item *
1482
2bbc8d55 1483C<uvchr_to_utf8(buf, chr)> writes a Unicode character code point into
1bfb14c4 1484a buffer encoding the code point as UTF-8, and returns a pointer
2bbc8d55 1485pointing after the UTF-8 bytes. It works appropriately on EBCDIC machines.
95a1a48b
JH
1486
1487=item *
1488
2bbc8d55 1489C<utf8_to_uvchr(buf, lenp)> reads UTF-8 encoded bytes from a buffer and
376d9008 1490returns the Unicode character code point and, optionally, the length of
2bbc8d55 1491the UTF-8 byte sequence. It works appropriately on EBCDIC machines.
95a1a48b
JH
1492
1493=item *
1494
376d9008
JB
1495C<utf8_length(start, end)> returns the length of the UTF-8 encoded buffer
1496in characters. C<sv_len_utf8(sv)> returns the length of the UTF-8 encoded
95a1a48b
JH
1497scalar.
1498
1499=item *
1500
376d9008
JB
1501C<sv_utf8_upgrade(sv)> converts the string of the scalar to its UTF-8
1502encoded form. C<sv_utf8_downgrade(sv)> does the opposite, if
1503possible. C<sv_utf8_encode(sv)> is like sv_utf8_upgrade except that
1504it does not set the C<UTF8> flag. C<sv_utf8_decode()> does the
1505opposite of C<sv_utf8_encode()>. Note that none of these are to be
1506used as general-purpose encoding or decoding interfaces: C<use Encode>
1507for that. C<sv_utf8_upgrade()> is affected by the encoding pragma
1508but C<sv_utf8_downgrade()> is not (since the encoding pragma is
1509designed to be a one-way street).
95a1a48b
JH
1510
1511=item *
1512
376d9008 1513C<is_utf8_char(s)> returns true if the pointer points to a valid UTF-8
90f968e0 1514character.
95a1a48b
JH
1515
1516=item *
1517
376d9008 1518C<is_utf8_string(buf, len)> returns true if C<len> bytes of the buffer
95a1a48b
JH
1519are valid UTF-8.
1520
1521=item *
1522
376d9008
JB
1523C<UTF8SKIP(buf)> will return the number of bytes in the UTF-8 encoded
1524character in the buffer. C<UNISKIP(chr)> will return the number of bytes
1525required to UTF-8-encode the Unicode character code point. C<UTF8SKIP()>
90f968e0 1526is useful for example for iterating over the characters of a UTF-8
376d9008 1527encoded buffer; C<UNISKIP()> is useful, for example, in computing
90f968e0 1528the size required for a UTF-8 encoded buffer.
95a1a48b
JH
1529
1530=item *
1531
376d9008 1532C<utf8_distance(a, b)> will tell the distance in characters between the
95a1a48b
JH
1533two pointers pointing to the same UTF-8 encoded buffer.
1534
1535=item *
1536
2bbc8d55 1537C<utf8_hop(s, off)> will return a pointer to a UTF-8 encoded buffer
376d9008
JB
1538that is C<off> (positive or negative) Unicode characters displaced
1539from the UTF-8 buffer C<s>. Be careful not to overstep the buffer:
1540C<utf8_hop()> will merrily run off the end or the beginning of the
1541buffer if told to do so.
95a1a48b 1542
d2cc3551
JH
1543=item *
1544
376d9008
JB
1545C<pv_uni_display(dsv, spv, len, pvlim, flags)> and
1546C<sv_uni_display(dsv, ssv, pvlim, flags)> are useful for debugging the
1547output of Unicode strings and scalars. By default they are useful
1548only for debugging--they display B<all> characters as hexadecimal code
1bfb14c4
JH
1549points--but with the flags C<UNI_DISPLAY_ISPRINT>,
1550C<UNI_DISPLAY_BACKSLASH>, and C<UNI_DISPLAY_QQ> you can make the
1551output more readable.
d2cc3551
JH
1552
1553=item *
1554
66615a54 1555C<foldEQ_utf8(s1, pe1, l1, u1, s2, pe2, l2, u2)> can be used to
376d9008 1556compare two strings case-insensitively in Unicode. For case-sensitive
66615a54
KW
1557comparisons you can just use C<memEQ()> and C<memNE()> as usual, except
1558if one string is in utf8 and the other isn't.
d2cc3551 1559
c349b1b9
JH
1560=back
1561
95a1a48b
JH
1562For more information, see L<perlapi>, and F<utf8.c> and F<utf8.h>
1563in the Perl source code distribution.
1564
e1b711da
KW
1565=head2 Hacking Perl to work on earlier Unicode versions (for very serious hackers only)
1566
1567Perl by default comes with the latest supported Unicode version built in, but
1568you can change to use any earlier one.
1569
42581d5d 1570Download the files in the desired version of Unicode from the Unicode web
e1b711da 1571site L<http://www.unicode.org>). These should replace the existing files in
b19eb496 1572F<lib/unicore> in the Perl source tree. Follow the instructions in
116693e8 1573F<README.perl> in that directory to change some of their names, and then build
26e391dd 1574perl (see L<INSTALL>).
116693e8
DL
1575
1576It is even possible to copy the built files to a different directory, and then
f651977e 1577change F<utf8_heavy.pl> in the directory C<$Config{privlib}> to point to the
116693e8
DL
1578new directory, or maybe make a copy of that directory before making the change,
1579and using C<@INC> or the C<-I> run-time flag to switch between versions at will
e1b711da
KW
1580(but because of caching, not in the middle of a process), but all this is
1581beyond the scope of these instructions.
1582
c29a771d
JH
1583=head1 BUGS
1584
376d9008 1585=head2 Interaction with Locales
7eabb34d 1586
42581d5d 1587See L<perllocale/Unicode and UTF-8>
c29a771d 1588
9f815e24 1589=head2 Problems with characters in the Latin-1 Supplement range
2bbc8d55 1590
e1b711da
KW
1591See L</The "Unicode Bug">
1592
376d9008 1593=head2 Interaction with Extensions
7eabb34d 1594
376d9008 1595When Perl exchanges data with an extension, the extension should be
2575c402 1596able to understand the UTF8 flag and act accordingly. If the
b19eb496 1597extension doesn't recognize that flag, it's likely that the extension
376d9008 1598will return incorrectly-flagged data.
7eabb34d
A
1599
1600So if you're working with Unicode data, consult the documentation of
1601every module you're using if there are any issues with Unicode data
1602exchange. If the documentation does not talk about Unicode at all,
a73d23f6 1603suspect the worst and probably look at the source to learn how the
376d9008 1604module is implemented. Modules written completely in Perl shouldn't
a73d23f6
RGS
1605cause problems. Modules that directly or indirectly access code written
1606in other programming languages are at risk.
7eabb34d 1607
376d9008 1608For affected functions, the simple strategy to avoid data corruption is
7eabb34d 1609to always make the encoding of the exchanged data explicit. Choose an
376d9008 1610encoding that you know the extension can handle. Convert arguments passed
7eabb34d
A
1611to the extensions to that encoding and convert results back from that
1612encoding. Write wrapper functions that do the conversions for you, so
1613you can later change the functions when the extension catches up.
1614
376d9008 1615To provide an example, let's say the popular Foo::Bar::escape_html
7eabb34d
A
1616function doesn't deal with Unicode data yet. The wrapper function
1617would convert the argument to raw UTF-8 and convert the result back to
376d9008 1618Perl's internal representation like so:
7eabb34d
A
1619
1620 sub my_escape_html ($) {
d88362ca
KW
1621 my($what) = shift;
1622 return unless defined $what;
1623 Encode::decode_utf8(Foo::Bar::escape_html(
1624 Encode::encode_utf8($what)));
7eabb34d
A
1625 }
1626
1627Sometimes, when the extension does not convert data but just stores
b19eb496 1628and retrieves them, you will be able to use the otherwise
7eabb34d 1629dangerous Encode::_utf8_on() function. Let's say the popular
66b79f27 1630C<Foo::Bar> extension, written in C, provides a C<param> method that
7eabb34d
A
1631lets you store and retrieve data according to these prototypes:
1632
1633 $self->param($name, $value); # set a scalar
1634 $value = $self->param($name); # retrieve a scalar
1635
1636If it does not yet provide support for any encoding, one could write a
1637derived class with such a C<param> method:
1638
1639 sub param {
1640 my($self,$name,$value) = @_;
1641 utf8::upgrade($name); # make sure it is UTF-8 encoded
af55fc6a 1642 if (defined $value) {
7eabb34d
A
1643 utf8::upgrade($value); # make sure it is UTF-8 encoded
1644 return $self->SUPER::param($name,$value);
1645 } else {
1646 my $ret = $self->SUPER::param($name);
1647 Encode::_utf8_on($ret); # we know, it is UTF-8 encoded
1648 return $ret;
1649 }
1650 }
1651
a73d23f6
RGS
1652Some extensions provide filters on data entry/exit points, such as
1653DB_File::filter_store_key and family. Look out for such filters in
66b79f27 1654the documentation of your extensions, they can make the transition to
7eabb34d
A
1655Unicode data much easier.
1656
376d9008 1657=head2 Speed
7eabb34d 1658
c29a771d 1659Some functions are slower when working on UTF-8 encoded strings than
574c8022 1660on byte encoded strings. All functions that need to hop over
7c17141f
JH
1661characters such as length(), substr() or index(), or matching regular
1662expressions can work B<much> faster when the underlying data are
1663byte-encoded.
1664
1665In Perl 5.8.0 the slowness was often quite spectacular; in Perl 5.8.1
1666a caching scheme was introduced which will hopefully make the slowness
a104b433
JH
1667somewhat less spectacular, at least for some operations. In general,
1668operations with UTF-8 encoded strings are still slower. As an example,
1669the Unicode properties (character classes) like C<\p{Nd}> are known to
1670be quite a bit slower (5-20 times) than their simpler counterparts
42581d5d 1671like C<\d> (then again, there are hundreds of Unicode characters matching C<Nd>
a104b433 1672compared with the 10 ASCII characters matching C<d>).
666f95b9 1673
e1b711da
KW
1674=head2 Problems on EBCDIC platforms
1675
f651977e 1676There are several known problems with Perl on EBCDIC platforms. If you
e1b711da 1677want to use Perl there, send email to perlbug@perl.org.
fe749c9a
KW
1678
1679In earlier versions, when byte and character data were concatenated,
1680the new string was sometimes created by
1681decoding the byte strings as I<ISO 8859-1 (Latin-1)>, even if the
1682old Unicode string used EBCDIC.
1683
1684If you find any of these, please report them as bugs.
1685
c8d992ba
A
1686=head2 Porting code from perl-5.6.X
1687
1688Perl 5.8 has a different Unicode model from 5.6. In 5.6 the programmer
1689was required to use the C<utf8> pragma to declare that a given scope
1690expected to deal with Unicode data and had to make sure that only
1691Unicode data were reaching that scope. If you have code that is
1692working with 5.6, you will need some of the following adjustments to
1693your code. The examples are written such that the code will continue
1694to work under 5.6, so you should be safe to try them out.
1695
755789c0 1696=over 3
c8d992ba
A
1697
1698=item *
1699
1700A filehandle that should read or write UTF-8
1701
1702 if ($] > 5.007) {
740d4bb2 1703 binmode $fh, ":encoding(utf8)";
c8d992ba
A
1704 }
1705
1706=item *
1707
1708A scalar that is going to be passed to some extension
1709
1710Be it Compress::Zlib, Apache::Request or any extension that has no
1711mention of Unicode in the manpage, you need to make sure that the
2575c402 1712UTF8 flag is stripped off. Note that at the time of this writing
c8d992ba
A
1713(October 2002) the mentioned modules are not UTF-8-aware. Please
1714check the documentation to verify if this is still true.
1715
1716 if ($] > 5.007) {
1717 require Encode;
1718 $val = Encode::encode_utf8($val); # make octets
1719 }
1720
1721=item *
1722
1723A scalar we got back from an extension
1724
1725If you believe the scalar comes back as UTF-8, you will most likely
2575c402 1726want the UTF8 flag restored:
c8d992ba
A
1727
1728 if ($] > 5.007) {
1729 require Encode;
1730 $val = Encode::decode_utf8($val);
1731 }
1732
1733=item *
1734
1735Same thing, if you are really sure it is UTF-8
1736
1737 if ($] > 5.007) {
1738 require Encode;
1739 Encode::_utf8_on($val);
1740 }
1741
1742=item *
1743
1744A wrapper for fetchrow_array and fetchrow_hashref
1745
1746When the database contains only UTF-8, a wrapper function or method is
1747a convenient way to replace all your fetchrow_array and
1748fetchrow_hashref calls. A wrapper function will also make it easier to
1749adapt to future enhancements in your database driver. Note that at the
1750time of this writing (October 2002), the DBI has no standardized way
1751to deal with UTF-8 data. Please check the documentation to verify if
1752that is still true.
1753
1754 sub fetchrow {
d88362ca
KW
1755 # $what is one of fetchrow_{array,hashref}
1756 my($self, $sth, $what) = @_;
c8d992ba
A
1757 if ($] < 5.007) {
1758 return $sth->$what;
1759 } else {
1760 require Encode;
1761 if (wantarray) {
1762 my @arr = $sth->$what;
1763 for (@arr) {
1764 defined && /[^\000-\177]/ && Encode::_utf8_on($_);
1765 }
1766 return @arr;
1767 } else {
1768 my $ret = $sth->$what;
1769 if (ref $ret) {
1770 for my $k (keys %$ret) {
d88362ca
KW
1771 defined
1772 && /[^\000-\177]/
1773 && Encode::_utf8_on($_) for $ret->{$k};
c8d992ba
A
1774 }
1775 return $ret;
1776 } else {
1777 defined && /[^\000-\177]/ && Encode::_utf8_on($_) for $ret;
1778 return $ret;
1779 }
1780 }
1781 }
1782 }
1783
1784
1785=item *
1786
1787A large scalar that you know can only contain ASCII
1788
1789Scalars that contain only ASCII and are marked as UTF-8 are sometimes
1790a drag to your program. If you recognize such a situation, just remove
2575c402 1791the UTF8 flag:
c8d992ba
A
1792
1793 utf8::downgrade($val) if $] > 5.007;
1794
1795=back
1796
393fec97
GS
1797=head1 SEE ALSO
1798
51f494cc 1799L<perlunitut>, L<perluniintro>, L<perluniprops>, L<Encode>, L<open>, L<utf8>, L<bytes>,
a05d7ebb 1800L<perlretut>, L<perlvar/"${^UNICODE}">
51f494cc 1801L<http://www.unicode.org/reports/tr44>).
393fec97
GS
1802
1803=cut