This is a live mirror of the Perl 5 development currently hosted at https://github.com/perl/perl5
perlunicode, perluniprops: \p{Title} is Perl extension
[perl5.git] / pod / perlunicode.pod
CommitLineData
393fec97
GS
1=head1 NAME
2
3perlunicode - Unicode support in Perl
4
5=head1 DESCRIPTION
6
0a1f2d14 7=head2 Important Caveats
21bad921 8
376d9008 9Unicode support is an extensive requirement. While Perl does not
c349b1b9
JH
10implement the Unicode standard or the accompanying technical reports
11from cover to cover, Perl does support many Unicode features.
21bad921 12
2575c402 13People who want to learn to use Unicode in Perl, should probably read
0314f483
KW
14the L<Perl Unicode tutorial, perlunitut|perlunitut> and
15L<perluniintro>, before reading
e4911a48 16this reference document.
2575c402 17
9d1c51c1
KW
18Also, the use of Unicode may present security issues that aren't obvious.
19Read L<Unicode Security Considerations|http://www.unicode.org/reports/tr36>.
20
13a2d996 21=over 4
21bad921 22
42581d5d
KW
23=item Safest if you "use feature 'unicode_strings'"
24
25In order to preserve backward compatibility, Perl does not turn
26on full internal Unicode support unless the pragma
27C<use feature 'unicode_strings'> is specified. (This is automatically
28selected if you use C<use 5.012> or higher.) Failure to do this can
29trigger unexpected surprises. See L</The "Unicode Bug"> below.
30
b19eb496
TC
31This pragma doesn't affect I/O, and there are still several places
32where Unicode isn't fully supported, such as in filenames.
42581d5d 33
fae2c0fb 34=item Input and Output Layers
21bad921 35
376d9008 36Perl knows when a filehandle uses Perl's internal Unicode encodings
1bfb14c4 37(UTF-8, or UTF-EBCDIC if in EBCDIC) if the filehandle is opened with
4ee7c0ea 38the ":encoding(utf8)" layer. Other encodings can be converted to Perl's
1bfb14c4
JH
39encoding on input or from Perl's encoding on output by use of the
40":encoding(...)" layer. See L<open>.
c349b1b9 41
2575c402 42To indicate that Perl source itself is in UTF-8, use C<use utf8;>.
21bad921 43
ad0029c4 44=item C<use utf8> still needed to enable UTF-8/UTF-EBCDIC in scripts
21bad921 45
376d9008
JB
46As a compatibility measure, the C<use utf8> pragma must be explicitly
47included to enable recognition of UTF-8 in the Perl scripts themselves
1bfb14c4
JH
48(in string or regular expression literals, or in identifier names) on
49ASCII-based machines or to recognize UTF-EBCDIC on EBCDIC-based
376d9008 50machines. B<These are the only times when an explicit C<use utf8>
8f8cf39c 51is needed.> See L<utf8>.
21bad921 52
7aa207d6
JH
53=item BOM-marked scripts and UTF-16 scripts autodetected
54
55If a Perl script begins marked with the Unicode BOM (UTF-16LE, UTF16-BE,
56or UTF-8), or if the script looks like non-BOM-marked UTF-16 of either
57endianness, Perl will correctly read in the script as Unicode.
58(BOMless UTF-8 cannot be effectively recognized or differentiated from
59ISO 8859-1 or other eight-bit encodings.)
60
990e18f7
AT
61=item C<use encoding> needed to upgrade non-Latin-1 byte strings
62
38a44b82 63By default, there is a fundamental asymmetry in Perl's Unicode model:
990e18f7
AT
64implicit upgrading from byte strings to Unicode strings assumes that
65they were encoded in I<ISO 8859-1 (Latin-1)>, but Unicode strings are
66downgraded with UTF-8 encoding. This happens because the first 256
51f494cc 67codepoints in Unicode happens to agree with Latin-1.
990e18f7 68
990e18f7
AT
69See L</"Byte and Character Semantics"> for more details.
70
21bad921
GS
71=back
72
376d9008 73=head2 Byte and Character Semantics
393fec97 74
376d9008 75Beginning with version 5.6, Perl uses logically-wide characters to
3e4dbfed 76represent strings internally.
393fec97 77
42581d5d
KW
78Starting in Perl 5.14, Perl-level operations work with
79characters rather than bytes within the scope of a
80C<L<use feature 'unicode_strings'|feature>> (or equivalently
81C<use 5.012> or higher). (This is not true if bytes have been
b19eb496 82explicitly requested by C<L<use bytes|bytes>>, nor necessarily true
42581d5d
KW
83for interactions with the platform's operating system.)
84
85For earlier Perls, and when C<unicode_strings> is not in effect, Perl
86provides a fairly safe environment that can handle both types of
87semantics in programs. For operations where Perl can unambiguously
88decide that the input data are characters, Perl switches to character
89semantics. For operations where this determination cannot be made
90without additional information from the user, Perl decides in favor of
91compatibility and chooses to use byte semantics.
92
93When C<use locale> is in effect (which overrides
0314f483
KW
94C<use feature 'unicode_strings'> in the same scope), Perl uses the
95semantics associated
42581d5d
KW
96with the current locale. Otherwise, Perl uses the platform's native
97byte semantics for characters whose code points are less than 256, and
98Unicode semantics for those greater than 255. On EBCDIC platforms, this
99is almost seamless, as the EBCDIC code pages that Perl handles are
100equivalent to Unicode's first 256 code points. (The exception is that
101EBCDIC regular expression case-insensitive matching rules are not as
b19eb496 102as robust as Unicode's.) But on ASCII platforms, Perl uses US-ASCII
e1b711da
KW
103(or Basic Latin in Unicode terminology) byte semantics, meaning that characters
104whose ordinal numbers are in the range 128 - 255 are undefined except for their
105ordinal numbers. This means that none have case (upper and lower), nor are any
106a member of character classes, like C<[:alpha:]> or C<\w>. (But all do belong
107to the C<\W> class or the Perl regular expression extension C<[:^alpha:]>.)
2bbc8d55 108
8cbd9a7a 109This behavior preserves compatibility with earlier versions of Perl,
376d9008 110which allowed byte semantics in Perl operations only if
e1b711da 111none of the program's inputs were marked as being a source of Unicode
8cbd9a7a
GS
112character data. Such data may come from filehandles, from calls to
113external programs, from information provided by the system (such as %ENV),
21bad921 114or from literals and constants in the source text.
8cbd9a7a 115
8cbd9a7a 116The C<utf8> pragma is primarily a compatibility device that enables
75daf61c 117recognition of UTF-(8|EBCDIC) in literals encountered by the parser.
376d9008
JB
118Note that this pragma is only required while Perl defaults to byte
119semantics; when character semantics become the default, this pragma
120may become a no-op. See L<utf8>.
121
376d9008 122If strings operating under byte semantics and strings with Unicode
51f494cc 123character data are concatenated, the new string will have
d9b01026
KW
124character semantics. This can cause surprises: See L</BUGS>, below.
125You can choose to be warned when this happens. See L<encoding::warnings>.
7dedd01f 126
feda178f 127Under character semantics, many operations that formerly operated on
376d9008 128bytes now operate on characters. A character in Perl is
feda178f 129logically just a number ranging from 0 to 2**31 or so. Larger
376d9008
JB
130characters may encode into longer sequences of bytes internally, but
131this internal detail is mostly hidden for Perl code.
132See L<perluniintro> for more.
393fec97 133
376d9008 134=head2 Effects of Character Semantics
393fec97
GS
135
136Character semantics have the following effects:
137
138=over 4
139
140=item *
141
376d9008 142Strings--including hash keys--and regular expression patterns may
574c8022 143contain characters that have an ordinal value larger than 255.
393fec97 144
2575c402
JW
145If you use a Unicode editor to edit your program, Unicode characters may
146occur directly within the literal strings in UTF-8 encoding, or UTF-16.
147(The former requires a BOM or C<use utf8>, the latter requires a BOM.)
3e4dbfed 148
195e542a
KW
149Unicode characters can also be added to a string by using the C<\N{U+...}>
150notation. The Unicode code for the desired character, in hexadecimal,
151should be placed in the braces, after the C<U>. For instance, a smiley face is
6f335b04
KW
152C<\N{U+263A}>.
153
195e542a
KW
154Alternatively, you can use the C<\x{...}> notation for characters 0x100 and
155above. For characters below 0x100 you may get byte semantics instead of
6f335b04 156character semantics; see L</The "Unicode Bug">. On EBCDIC machines there is
195e542a 157the additional problem that the value for such characters gives the EBCDIC
6f335b04 158character rather than the Unicode one.
3e4dbfed
JF
159
160Additionally, if you
574c8022 161
3e4dbfed 162 use charnames ':full';
574c8022 163
1bfb14c4
JH
164you can use the C<\N{...}> notation and put the official Unicode
165character name within the braces, such as C<\N{WHITE SMILING FACE}>.
6f335b04 166See L<charnames>.
376d9008 167
393fec97
GS
168=item *
169
574c8022
JH
170If an appropriate L<encoding> is specified, identifiers within the
171Perl script may contain Unicode alphanumeric characters, including
376d9008
JB
172ideographs. Perl does not currently attempt to canonicalize variable
173names.
393fec97 174
393fec97
GS
175=item *
176
1bfb14c4 177Regular expressions match characters instead of bytes. "." matches
2575c402 178a character instead of a byte.
393fec97 179
393fec97
GS
180=item *
181
9d1c51c1 182Bracketed character classes in regular expressions match characters instead of
376d9008 183bytes and match against the character properties specified in the
1bfb14c4 184Unicode properties database. C<\w> can be used to match a Japanese
75daf61c 185ideograph, for instance.
393fec97 186
393fec97
GS
187=item *
188
9d1c51c1
KW
189Named Unicode properties, scripts, and block ranges may be used (like bracketed
190character classes) by using the C<\p{}> "matches property" construct and
822502e5 191the C<\P{}> negation, "doesn't match property".
2575c402 192See L</"Unicode Character Properties"> for more details.
822502e5
TS
193
194You can define your own character properties and use them
195in the regular expression with the C<\p{}> or C<\P{}> construct.
822502e5
TS
196See L</"User-Defined Character Properties"> for more details.
197
198=item *
199
9f815e24
KW
200The special pattern C<\X> matches a logical character, an "extended grapheme
201cluster" in Standardese. In Unicode what appears to the user to be a single
51f494cc
KW
202character, for example an accented C<G>, may in fact be composed of a sequence
203of characters, in this case a C<G> followed by an accent character. C<\X>
204will match the entire sequence.
822502e5
TS
205
206=item *
207
208The C<tr///> operator translates characters instead of bytes. Note
209that the C<tr///CU> functionality has been removed. For similar
210functionality see pack('U0', ...) and pack('C0', ...).
211
212=item *
213
214Case translation operators use the Unicode case translation tables
215when character input is provided. Note that C<uc()>, or C<\U> in
216interpolated strings, translates to uppercase, while C<ucfirst>,
217or C<\u> in interpolated strings, translates to titlecase in languages
e1b711da
KW
218that make the distinction (which is equivalent to uppercase in languages
219without the distinction).
822502e5
TS
220
221=item *
222
223Most operators that deal with positions or lengths in a string will
224automatically switch to using character positions, including
225C<chop()>, C<chomp()>, C<substr()>, C<pos()>, C<index()>, C<rindex()>,
226C<sprintf()>, C<write()>, and C<length()>. An operator that
51f494cc
KW
227specifically does not switch is C<vec()>. Operators that really don't
228care include operators that treat strings as a bucket of bits such as
822502e5
TS
229C<sort()>, and operators dealing with filenames.
230
231=item *
232
51f494cc 233The C<pack()>/C<unpack()> letter C<C> does I<not> change, since it is often
822502e5
TS
234used for byte-oriented formats. Again, think C<char> in the C language.
235
236There is a new C<U> specifier that converts between Unicode characters
237and code points. There is also a C<W> specifier that is the equivalent of
238C<chr>/C<ord> and properly handles character values even if they are above 255.
239
240=item *
241
242The C<chr()> and C<ord()> functions work on characters, similar to
243C<pack("W")> and C<unpack("W")>, I<not> C<pack("C")> and
244C<unpack("C")>. C<pack("C")> and C<unpack("C")> are methods for
245emulating byte-oriented C<chr()> and C<ord()> on Unicode strings.
246While these methods reveal the internal encoding of Unicode strings,
247that is not something one normally needs to care about at all.
248
249=item *
250
251The bit string operators, C<& | ^ ~>, can operate on character data.
252However, for backward compatibility, such as when using bit string
253operations when characters are all less than 256 in ordinal value, one
254should not use C<~> (the bit complement) with characters of both
255values less than 256 and values greater than 256. Most importantly,
256DeMorgan's laws (C<~($x|$y) eq ~$x&~$y> and C<~($x&$y) eq ~$x|~$y>)
257will not hold. The reason for this mathematical I<faux pas> is that
258the complement cannot return B<both> the 8-bit (byte-wide) bit
259complement B<and> the full character-wide bit complement.
260
261=item *
262
5d1892be
KW
263There is a CPAN module, L<Unicode::Casing>, which allows you to define
264your own mappings to be used in C<lc()>, C<lcfirst()>, C<uc()>, and
265C<ucfirst()> (or their double-quoted string inlined versions such as
266C<\U>). (Prior to Perl 5.16, this functionality was partially provided
267in the Perl core, but suffered from a number of insurmountable
268drawbacks, so the CPAN module was written instead.)
822502e5
TS
269
270=back
271
272=over 4
273
274=item *
275
276And finally, C<scalar reverse()> reverses by character rather than by byte.
277
278=back
279
280=head2 Unicode Character Properties
281
ee88f7b6 282(The only time that Perl considers a sequence of individual code
9d1c51c1
KW
283points as a single logical character is in the C<\X> construct, already
284mentioned above. Therefore "character" in this discussion means a single
ee88f7b6
KW
285Unicode code point.)
286
287Very nearly all Unicode character properties are accessible through
288regular expressions by using the C<\p{}> "matches property" construct
289and the C<\P{}> "doesn't match property" for its negation.
51f494cc 290
9d1c51c1 291For instance, C<\p{Uppercase}> matches any single character with the Unicode
51f494cc
KW
292"Uppercase" property, while C<\p{L}> matches any character with a
293General_Category of "L" (letter) property. Brackets are not
9d1c51c1 294required for single letter property names, so C<\p{L}> is equivalent to C<\pL>.
51f494cc 295
9d1c51c1
KW
296More formally, C<\p{Uppercase}> matches any single character whose Unicode
297Uppercase property value is True, and C<\P{Uppercase}> matches any character
298whose Uppercase property value is False, and they could have been written as
299C<\p{Uppercase=True}> and C<\p{Uppercase=False}>, respectively.
51f494cc 300
b19eb496 301This formality is needed when properties are not binary; that is, if they can
51f494cc 302take on more values than just True and False. For example, the Bidi_Class (see
b19eb496 303L</"Bidirectional Character Types"> below), can take on several different
51f494cc 304values, such as Left, Right, Whitespace, and others. To match these, one needs
5bff2035
KW
305to specify both the property name (Bidi_Class), AND the value being
306matched against
9d1c51c1 307(Left, Right, etc.). This is done, as in the examples above, by having the
9f815e24 308two components separated by an equal sign (or interchangeably, a colon), like
51f494cc
KW
309C<\p{Bidi_Class: Left}>.
310
311All Unicode-defined character properties may be written in these compound forms
312of C<\p{property=value}> or C<\p{property:value}>, but Perl provides some
313additional properties that are written only in the single form, as well as
314single-form short-cuts for all binary properties and certain others described
315below, in which you may omit the property name and the equals or colon
316separator.
317
318Most Unicode character properties have at least two synonyms (or aliases if you
b19eb496
TC
319prefer): a short one that is easier to type and a longer one that is more
320descriptive and hence easier to understand. Thus the "L" and "Letter" properties
321above are equivalent and can be used interchangeably. Likewise,
51f494cc
KW
322"Upper" is a synonym for "Uppercase", and we could have written
323C<\p{Uppercase}> equivalently as C<\p{Upper}>. Also, there are typically
324various synonyms for the values the property can be. For binary properties,
325"True" has 3 synonyms: "T", "Yes", and "Y"; and "False has correspondingly "F",
326"No", and "N". But be careful. A short form of a value for one property may
e1b711da 327not mean the same thing as the same short form for another. Thus, for the
51f494cc
KW
328General_Category property, "L" means "Letter", but for the Bidi_Class property,
329"L" means "Left". A complete list of properties and synonyms is in
330L<perluniprops>.
331
b19eb496 332Upper/lower case differences in property names and values are irrelevant;
51f494cc
KW
333thus C<\p{Upper}> means the same thing as C<\p{upper}> or even C<\p{UpPeR}>.
334Similarly, you can add or subtract underscores anywhere in the middle of a
335word, so that these are also equivalent to C<\p{U_p_p_e_r}>. And white space
336is irrelevant adjacent to non-word characters, such as the braces and the equals
b19eb496
TC
337or colon separators, so C<\p{ Upper }> and C<\p{ Upper_case : Y }> are
338equivalent to these as well. In fact, white space and even
339hyphens can usually be added or deleted anywhere. So even C<\p{ Up-per case = Yes}> is
51f494cc 340equivalent. All this is called "loose-matching" by Unicode. The few places
b19eb496 341where stricter matching is used is in the middle of numbers, and in the Perl
51f494cc 342extension properties that begin or end with an underscore. Stricter matching
b19eb496 343cares about white space (except adjacent to non-word characters),
51f494cc 344hyphens, and non-interior underscores.
4193bef7 345
376d9008
JB
346You can also use negation in both C<\p{}> and C<\P{}> by introducing a caret
347(^) between the first brace and the property name: C<\p{^Tamil}> is
eb0cc9e3 348equal to C<\P{Tamil}>.
4193bef7 349
56ca34ca
KW
350Almost all properties are immune to case-insensitive matching. That is,
351adding a C</i> regular expression modifier does not change what they
352match. There are two sets that are affected.
353The first set is
354C<Uppercase_Letter>,
355C<Lowercase_Letter>,
356and C<Titlecase_Letter>,
357all of which match C<Cased_Letter> under C</i> matching.
358And the second set is
359C<Uppercase>,
360C<Lowercase>,
361and C<Titlecase>,
362all of which match C<Cased> under C</i> matching.
363This set also includes its subsets C<PosixUpper> and C<PosixLower> both
364of which under C</i> matching match C<PosixAlpha>.
365(The difference between these sets is that some things, such as Roman
b19eb496 366numerals, come in both upper and lower case so they are C<Cased>, but aren't considered
56ca34ca 367letters, so they aren't C<Cased_Letter>s.)
56ca34ca 368
51f494cc 369=head3 B<General_Category>
14bb0a9a 370
51f494cc
KW
371Every Unicode character is assigned a general category, which is the "most
372usual categorization of a character" (from
373L<http://www.unicode.org/reports/tr44>).
822502e5 374
9f815e24 375The compound way of writing these is like C<\p{General_Category=Number}>
51f494cc
KW
376(short, C<\p{gc:n}>). But Perl furnishes shortcuts in which everything up
377through the equal or colon separator is omitted. So you can instead just write
378C<\pN>.
822502e5 379
51f494cc 380Here are the short and long forms of the General Category properties:
393fec97 381
d73e5302
JH
382 Short Long
383
384 L Letter
51f494cc
KW
385 LC, L& Cased_Letter (that is: [\p{Ll}\p{Lu}\p{Lt}])
386 Lu Uppercase_Letter
387 Ll Lowercase_Letter
388 Lt Titlecase_Letter
389 Lm Modifier_Letter
390 Lo Other_Letter
d73e5302
JH
391
392 M Mark
51f494cc
KW
393 Mn Nonspacing_Mark
394 Mc Spacing_Mark
395 Me Enclosing_Mark
d73e5302
JH
396
397 N Number
51f494cc
KW
398 Nd Decimal_Number (also Digit)
399 Nl Letter_Number
400 No Other_Number
401
402 P Punctuation (also Punct)
403 Pc Connector_Punctuation
404 Pd Dash_Punctuation
405 Ps Open_Punctuation
406 Pe Close_Punctuation
407 Pi Initial_Punctuation
d73e5302 408 (may behave like Ps or Pe depending on usage)
51f494cc 409 Pf Final_Punctuation
d73e5302 410 (may behave like Ps or Pe depending on usage)
51f494cc 411 Po Other_Punctuation
d73e5302
JH
412
413 S Symbol
51f494cc
KW
414 Sm Math_Symbol
415 Sc Currency_Symbol
416 Sk Modifier_Symbol
417 So Other_Symbol
d73e5302
JH
418
419 Z Separator
51f494cc
KW
420 Zs Space_Separator
421 Zl Line_Separator
422 Zp Paragraph_Separator
d73e5302
JH
423
424 C Other
d88362ca 425 Cc Control (also Cntrl)
e150c829 426 Cf Format
6d4f9cf2 427 Cs Surrogate
51f494cc 428 Co Private_Use
e150c829 429 Cn Unassigned
1ac13f9a 430
376d9008 431Single-letter properties match all characters in any of the
3e4dbfed 432two-letter sub-properties starting with the same letter.
b19eb496 433C<LC> and C<L&> are special: both are aliases for the set consisting of everything matched by C<Ll>, C<Lu>, and C<Lt>.
32293815 434
51f494cc 435=head3 B<Bidirectional Character Types>
822502e5 436
b19eb496 437Because scripts differ in their directionality (Hebrew and Arabic are
9d1c51c1 438written right to left, for example) Unicode supplies these properties in
51f494cc 439the Bidi_Class class:
32293815 440
eb0cc9e3 441 Property Meaning
92e830a9 442
12ac2576
JP
443 L Left-to-Right
444 LRE Left-to-Right Embedding
445 LRO Left-to-Right Override
446 R Right-to-Left
51f494cc 447 AL Arabic Letter
12ac2576
JP
448 RLE Right-to-Left Embedding
449 RLO Right-to-Left Override
450 PDF Pop Directional Format
451 EN European Number
51f494cc
KW
452 ES European Separator
453 ET European Terminator
12ac2576 454 AN Arabic Number
51f494cc 455 CS Common Separator
12ac2576
JP
456 NSM Non-Spacing Mark
457 BN Boundary Neutral
458 B Paragraph Separator
459 S Segment Separator
460 WS Whitespace
461 ON Other Neutrals
462
51f494cc
KW
463This property is always written in the compound form.
464For example, C<\p{Bidi_Class:R}> matches characters that are normally
eb0cc9e3
JH
465written right to left.
466
51f494cc
KW
467=head3 B<Scripts>
468
b19eb496 469The world's languages are written in many different scripts. This sentence
e1b711da 470(unless you're reading it in translation) is written in Latin, while Russian is
c69ca1d4 471written in Cyrillic, and Greek is written in, well, Greek; Japanese mainly in
e1b711da 472Hiragana or Katakana. There are many more.
51f494cc 473
82aed44a
KW
474The Unicode Script and Script_Extensions properties give what script a
475given character is in. Either property can be specified with the
476compound form like
477C<\p{Script=Hebrew}> (short: C<\p{sc=hebr}>), or
478C<\p{Script_Extensions=Javanese}> (short: C<\p{scx=java}>).
479In addition, Perl furnishes shortcuts for all
480C<Script> property names. You can omit everything up through the equals
481(or colon), and simply write C<\p{Latin}> or C<\P{Cyrillic}>.
482(This is not true for C<Script_Extensions>, which is required to be
483written in the compound form.)
484
485The difference between these two properties involves characters that are
486used in multiple scripts. For example the digits '0' through '9' are
487used in many parts of the world. These are placed in a script named
488C<Common>. Other characters are used in just a few scripts. For
489example, the "KATAKANA-HIRAGANA DOUBLE HYPHEN" is used in both Japanese
490scripts, Katakana and Hiragana, but nowhere else. The C<Script>
491property places all characters that are used in multiple scripts in the
492C<Common> script, while the C<Script_Extensions> property places those
493that are used in only a few scripts into each of those scripts; while
494still using C<Common> for those used in many scripts. Thus both these
495match:
496
497 "0" =~ /\p{sc=Common}/ # Matches
498 "0" =~ /\p{scx=Common}/ # Matches
499
500and only the first of these match:
501
502 "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{sc=Common} # Matches
503 "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{scx=Common} # No match
504
505And only the last two of these match:
506
507 "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{sc=Hiragana} # No match
508 "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{sc=Katakana} # No match
509 "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{scx=Hiragana} # Matches
510 "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{scx=Katakana} # Matches
511
512C<Script_Extensions> is thus an improved C<Script>, in which there are
513fewer characters in the C<Common> script, and correspondingly more in
514other scripts. It is new in Unicode version 6.0, and its data are likely
515to change significantly in later releases, as things get sorted out.
516
517(Actually, besides C<Common>, the C<Inherited> script, contains
518characters that are used in multiple scripts. These are modifier
519characters which modify other characters, and inherit the script value
520of the controlling character. Some of these are used in many scripts,
521and so go into C<Inherited> in both C<Script> and C<Script_Extensions>.
522Others are used in just a few scripts, so are in C<Inherited> in
523C<Script>, but not in C<Script_Extensions>.)
524
525It is worth stressing that there are several different sets of digits in
526Unicode that are equivalent to 0-9 and are matchable by C<\d> in a
527regular expression. If they are used in a single language only, they
528are in that language's C<Script> and C<Script_Extension>. If they are
529used in more than one script, they will be in C<sc=Common>, but only
530if they are used in many scripts should they be in C<scx=Common>.
51f494cc
KW
531
532A complete list of scripts and their shortcuts is in L<perluniprops>.
533
51f494cc 534=head3 B<Use of "Is" Prefix>
822502e5 535
1bfb14c4 536For backward compatibility (with Perl 5.6), all properties mentioned
51f494cc
KW
537so far may have C<Is> or C<Is_> prepended to their name, so C<\P{Is_Lu}>, for
538example, is equal to C<\P{Lu}>, and C<\p{IsScript:Arabic}> is equal to
539C<\p{Arabic}>.
eb0cc9e3 540
51f494cc 541=head3 B<Blocks>
2796c109 542
1bfb14c4
JH
543In addition to B<scripts>, Unicode also defines B<blocks> of
544characters. The difference between scripts and blocks is that the
545concept of scripts is closer to natural languages, while the concept
51f494cc 546of blocks is more of an artificial grouping based on groups of Unicode
9f815e24 547characters with consecutive ordinal values. For example, the "Basic Latin"
b19eb496 548block is all characters whose ordinals are between 0 and 127, inclusive; in
9f815e24 549other words, the ASCII characters. The "Latin" script contains some letters
b19eb496 550from this as well as several other blocks, like "Latin-1 Supplement",
9d1c51c1 551"Latin Extended-A", etc., but it does not contain all the characters from
7be67b37 552those blocks. It does not, for example, contain the digits 0-9, because
82aed44a
KW
553those digits are shared across many scripts, and hence are in the
554C<Common> script.
51f494cc
KW
555
556For more about scripts versus blocks, see UAX#24 "Unicode Script Property":
557L<http://www.unicode.org/reports/tr24>
558
82aed44a
KW
559The C<Script> or C<Script_Extensions> properties are likely to be the
560ones you want to use when processing
b19eb496
TC
561natural language; the Block property may occasionally be useful in working
562with the nuts and bolts of Unicode.
51f494cc
KW
563
564Block names are matched in the compound form, like C<\p{Block: Arrows}> or
b19eb496 565C<\p{Blk=Hebrew}>. Unlike most other properties, only a few block names have a
51f494cc
KW
566Unicode-defined short name. But Perl does provide a (slight) shortcut: You
567can say, for example C<\p{In_Arrows}> or C<\p{In_Hebrew}>. For backwards
568compatibility, the C<In> prefix may be omitted if there is no naming conflict
569with a script or any other property, and you can even use an C<Is> prefix
570instead in those cases. But it is not a good idea to do this, for a couple
571reasons:
572
573=over 4
574
575=item 1
576
577It is confusing. There are many naming conflicts, and you may forget some.
9f815e24 578For example, C<\p{Hebrew}> means the I<script> Hebrew, and NOT the I<block>
51f494cc
KW
579Hebrew. But would you remember that 6 months from now?
580
581=item 2
582
583It is unstable. A new version of Unicode may pre-empt the current meaning by
584creating a property with the same name. There was a time in very early Unicode
9f815e24 585releases when C<\p{Hebrew}> would have matched the I<block> Hebrew; now it
51f494cc 586doesn't.
32293815 587
393fec97
GS
588=back
589
b19eb496
TC
590Some people prefer to always use C<\p{Block: foo}> and C<\p{Script: bar}>
591instead of the shortcuts, whether for clarity, because they can't remember the
592difference between 'In' and 'Is' anyway, or they aren't confident that those who
593eventually will read their code will know that difference.
51f494cc
KW
594
595A complete list of blocks and their shortcuts is in L<perluniprops>.
596
9f815e24
KW
597=head3 B<Other Properties>
598
599There are many more properties than the very basic ones described here.
600A complete list is in L<perluniprops>.
601
602Unicode defines all its properties in the compound form, so all single-form
b19eb496
TC
603properties are Perl extensions. Most of these are just synonyms for the
604Unicode ones, but some are genuine extensions, including several that are in
9f815e24
KW
605the compound form. And quite a few of these are actually recommended by Unicode
606(in L<http://www.unicode.org/reports/tr18>).
607
5bff2035
KW
608This section gives some details on all extensions that aren't just
609synonyms for compound-form Unicode properties
610(for those properties, you'll have to refer to the
9f815e24
KW
611L<Unicode Standard|http://www.unicode.org/reports/tr44>.
612
613=over
614
615=item B<C<\p{All}>>
616
617This matches any of the 1_114_112 Unicode code points. It is a synonym for
618C<\p{Any}>.
619
620=item B<C<\p{Alnum}>>
621
622This matches any C<\p{Alphabetic}> or C<\p{Decimal_Number}> character.
623
624=item B<C<\p{Any}>>
625
626This matches any of the 1_114_112 Unicode code points. It is a synonym for
627C<\p{All}>.
628
42581d5d
KW
629=item B<C<\p{ASCII}>>
630
631This matches any of the 128 characters in the US-ASCII character set,
632which is a subset of Unicode.
633
9f815e24
KW
634=item B<C<\p{Assigned}>>
635
636This matches any assigned code point; that is, any code point whose general
637category is not Unassigned (or equivalently, not Cn).
638
639=item B<C<\p{Blank}>>
640
641This is the same as C<\h> and C<\p{HorizSpace}>: A character that changes the
642spacing horizontally.
643
644=item B<C<\p{Decomposition_Type: Non_Canonical}>> (Short: C<\p{Dt=NonCanon}>)
645
646Matches a character that has a non-canonical decomposition.
647
648To understand the use of this rarely used property=value combination, it is
649necessary to know some basics about decomposition.
650Consider a character, say H. It could appear with various marks around it,
651such as an acute accent, or a circumflex, or various hooks, circles, arrows,
b19eb496 652I<etc.>, above, below, to one side or the other, etc. There are many
9f815e24
KW
653possibilities among the world's languages. The number of combinations is
654astronomical, and if there were a character for each combination, it would
655soon exhaust Unicode's more than a million possible characters. So Unicode
656took a different approach: there is a character for the base H, and a
b19eb496 657character for each of the possible marks, and these can be variously combined
9f815e24
KW
658to get a final logical character. So a logical character--what appears to be a
659single character--can be a sequence of more than one individual characters.
b19eb496
TC
660This is called an "extended grapheme cluster"; Perl furnishes the C<\X>
661regular expression construct to match such sequences.
9f815e24
KW
662
663But Unicode's intent is to unify the existing character set standards and
b19eb496 664practices, and several pre-existing standards have single characters that
9f815e24
KW
665mean the same thing as some of these combinations. An example is ISO-8859-1,
666which has quite a few of these in the Latin-1 range, an example being "LATIN
667CAPITAL LETTER E WITH ACUTE". Because this character was in this pre-existing
668standard, Unicode added it to its repertoire. But this character is considered
b19eb496
TC
669by Unicode to be equivalent to the sequence consisting of the character
670"LATIN CAPITAL LETTER E" followed by the character "COMBINING ACUTE ACCENT".
9f815e24
KW
671
672"LATIN CAPITAL LETTER E WITH ACUTE" is called a "pre-composed" character, and
b19eb496 673its equivalence with the sequence is called canonical equivalence. All
9f815e24 674pre-composed characters are said to have a decomposition (into the equivalent
b19eb496 675sequence), and the decomposition type is also called canonical.
9f815e24
KW
676
677However, many more characters have a different type of decomposition, a
678"compatible" or "non-canonical" decomposition. The sequences that form these
679decompositions are not considered canonically equivalent to the pre-composed
680character. An example, again in the Latin-1 range, is the "SUPERSCRIPT ONE".
b19eb496 681It is somewhat like a regular digit 1, but not exactly; its decomposition
9f815e24
KW
682into the digit 1 is called a "compatible" decomposition, specifically a
683"super" decomposition. There are several such compatibility
684decompositions (see L<http://www.unicode.org/reports/tr44>), including one
b19eb496 685called "compat", which means some miscellaneous type of decomposition
42581d5d 686that doesn't fit into the decomposition categories that Unicode has chosen.
9f815e24
KW
687
688Note that most Unicode characters don't have a decomposition, so their
689decomposition type is "None".
690
b19eb496
TC
691For your convenience, Perl has added the C<Non_Canonical> decomposition
692type to mean any of the several compatibility decompositions.
9f815e24
KW
693
694=item B<C<\p{Graph}>>
695
696Matches any character that is graphic. Theoretically, this means a character
697that on a printer would cause ink to be used.
698
699=item B<C<\p{HorizSpace}>>
700
b19eb496 701This is the same as C<\h> and C<\p{Blank}>: a character that changes the
9f815e24
KW
702spacing horizontally.
703
42581d5d 704=item B<C<\p{In=*}>>
9f815e24
KW
705
706This is a synonym for C<\p{Present_In=*}>
707
708=item B<C<\p{PerlSpace}>>
709
710This is the same as C<\s>, restricted to ASCII, namely C<S<[ \f\n\r\t]>>.
711
712Mnemonic: Perl's (original) space
713
714=item B<C<\p{PerlWord}>>
715
716This is the same as C<\w>, restricted to ASCII, namely C<[A-Za-z0-9_]>
717
718Mnemonic: Perl's (original) word.
719
42581d5d 720=item B<C<\p{Posix...}>>
9f815e24 721
b19eb496
TC
722There are several of these, which are equivalents using the C<\p>
723notation for Posix classes and are described in
42581d5d 724L<perlrecharclass/POSIX Character Classes>.
9f815e24
KW
725
726=item B<C<\p{Present_In: *}>> (Short: C<\p{In=*}>)
727
728This property is used when you need to know in what Unicode version(s) a
729character is.
730
731The "*" above stands for some two digit Unicode version number, such as
732C<1.1> or C<4.0>; or the "*" can also be C<Unassigned>. This property will
733match the code points whose final disposition has been settled as of the
734Unicode release given by the version number; C<\p{Present_In: Unassigned}>
735will match those code points whose meaning has yet to be assigned.
736
737For example, C<U+0041> "LATIN CAPITAL LETTER A" was present in the very first
738Unicode release available, which is C<1.1>, so this property is true for all
739valid "*" versions. On the other hand, C<U+1EFF> was not assigned until version
7405.1 when it became "LATIN SMALL LETTER Y WITH LOOP", so the only "*" that
741would match it are 5.1, 5.2, and later.
742
743Unicode furnishes the C<Age> property from which this is derived. The problem
744with Age is that a strict interpretation of it (which Perl takes) has it
745matching the precise release a code point's meaning is introduced in. Thus
746C<U+0041> would match only 1.1; and C<U+1EFF> only 5.1. This is not usually what
747you want.
748
749Some non-Perl implementations of the Age property may change its meaning to be
750the same as the Perl Present_In property; just be aware of that.
751
752Another confusion with both these properties is that the definition is not
b19eb496
TC
753that the code point has been I<assigned>, but that the meaning of the code point
754has been I<determined>. This is because 66 code points will always be
755unassigned, and so the Age for them is the Unicode version in which the decision
756to make them so was made. For example, C<U+FDD0> is to be permanently
9f815e24 757unassigned to a character, and the decision to do that was made in version 3.1,
b19eb496 758so C<\p{Age=3.1}> matches this character, as also does C<\p{Present_In: 3.1}> and up.
9f815e24
KW
759
760=item B<C<\p{Print}>>
761
ae5b72c8 762This matches any character that is graphical or blank, except controls.
9f815e24
KW
763
764=item B<C<\p{SpacePerl}>>
765
766This is the same as C<\s>, including beyond ASCII.
767
4d4acfba 768Mnemonic: Space, as modified by Perl. (It doesn't include the vertical tab
b19eb496 769which both the Posix standard and Unicode consider white space.)
9f815e24 770
4364919a
KW
771=item B<C<\p{Title}>> and B<C<\p{Titlecase}>>
772
773Under case-sensitive matching, these both match the same code points as
774C<\p{General Category=Titlecase_Letter}> (C<\p{gc=lt}>). The difference
775is that under C</i> caseless matching, these match the same as
776C<\p{Cased}>, whereas C<\p{gc=lt}> matches C<\p{Cased_Letter>).
777
9f815e24
KW
778=item B<C<\p{VertSpace}>>
779
780This is the same as C<\v>: A character that changes the spacing vertically.
781
782=item B<C<\p{Word}>>
783
b19eb496 784This is the same as C<\w>, including over 100_000 characters beyond ASCII.
9f815e24 785
42581d5d
KW
786=item B<C<\p{XPosix...}>>
787
b19eb496 788There are several of these, which are the standard Posix classes
42581d5d
KW
789extended to the full Unicode range. They are described in
790L<perlrecharclass/POSIX Character Classes>.
791
9f815e24
KW
792=back
793
376d9008 794=head2 User-Defined Character Properties
491fd90a 795
51f494cc
KW
796You can define your own binary character properties by defining subroutines
797whose names begin with "In" or "Is". The subroutines can be defined in any
798package. The user-defined properties can be used in the regular expression
799C<\p> and C<\P> constructs; if you are using a user-defined property from a
800package other than the one you are in, you must specify its package in the
801C<\p> or C<\P> construct.
bac0b425 802
51f494cc 803 # assuming property Is_Foreign defined in Lang::
bac0b425
JP
804 package main; # property package name required
805 if ($txt =~ /\p{Lang::IsForeign}+/) { ... }
806
807 package Lang; # property package name not required
808 if ($txt =~ /\p{IsForeign}+/) { ... }
809
810
811Note that the effect is compile-time and immutable once defined.
b19eb496
TC
812However, the subroutines are passed a single parameter, which is 0 if
813case-sensitive matching is in effect and non-zero if caseless matching
56ca34ca
KW
814is in effect. The subroutine may return different values depending on
815the value of the flag, and one set of values will immutably be in effect
b19eb496 816for all case-sensitive matches, and the other set for all case-insensitive
56ca34ca 817matches.
491fd90a 818
b19eb496 819Note that if the regular expression is tainted, then Perl will die rather
0e9be77f
DM
820than calling the subroutine, where the name of the subroutine is
821determined by the tainted data.
822
376d9008
JB
823The subroutines must return a specially-formatted string, with one
824or more newline-separated lines. Each line must be one of the following:
491fd90a
JH
825
826=over 4
827
828=item *
829
510254c9
A
830A single hexadecimal number denoting a Unicode code point to include.
831
832=item *
833
99a6b1f0 834Two hexadecimal numbers separated by horizontal whitespace (space or
376d9008 835tabular characters) denoting a range of Unicode code points to include.
491fd90a
JH
836
837=item *
838
376d9008 839Something to include, prefixed by "+": a built-in character
bac0b425
JP
840property (prefixed by "utf8::") or a user-defined character property,
841to represent all the characters in that property; two hexadecimal code
842points for a range; or a single hexadecimal code point.
491fd90a
JH
843
844=item *
845
376d9008 846Something to exclude, prefixed by "-": an existing character
bac0b425
JP
847property (prefixed by "utf8::") or a user-defined character property,
848to represent all the characters in that property; two hexadecimal code
849points for a range; or a single hexadecimal code point.
491fd90a
JH
850
851=item *
852
376d9008 853Something to negate, prefixed "!": an existing character
bac0b425
JP
854property (prefixed by "utf8::") or a user-defined character property,
855to represent all the characters in that property; two hexadecimal code
856points for a range; or a single hexadecimal code point.
857
858=item *
859
860Something to intersect with, prefixed by "&": an existing character
861property (prefixed by "utf8::") or a user-defined character property,
862for all the characters except the characters in the property; two
863hexadecimal code points for a range; or a single hexadecimal code point.
491fd90a
JH
864
865=back
866
867For example, to define a property that covers both the Japanese
868syllabaries (hiragana and katakana), you can define
869
870 sub InKana {
d88362ca 871 return <<END;
d5822f25
A
872 3040\t309F
873 30A0\t30FF
491fd90a
JH
874 END
875 }
876
d5822f25
A
877Imagine that the here-doc end marker is at the beginning of the line.
878Now you can use C<\p{InKana}> and C<\P{InKana}>.
491fd90a
JH
879
880You could also have used the existing block property names:
881
882 sub InKana {
d88362ca 883 return <<'END';
491fd90a
JH
884 +utf8::InHiragana
885 +utf8::InKatakana
886 END
887 }
888
889Suppose you wanted to match only the allocated characters,
d5822f25 890not the raw block ranges: in other words, you want to remove
491fd90a
JH
891the non-characters:
892
893 sub InKana {
d88362ca 894 return <<'END';
491fd90a
JH
895 +utf8::InHiragana
896 +utf8::InKatakana
897 -utf8::IsCn
898 END
899 }
900
901The negation is useful for defining (surprise!) negated classes.
902
903 sub InNotKana {
d88362ca 904 return <<'END';
491fd90a
JH
905 !utf8::InHiragana
906 -utf8::InKatakana
907 +utf8::IsCn
908 END
909 }
910
bac0b425
JP
911Intersection is useful for getting the common characters matched by
912two (or more) classes.
913
914 sub InFooAndBar {
915 return <<'END';
916 +main::Foo
917 &main::Bar
918 END
919 }
920
ac036724 921It's important to remember not to use "&" for the first set; that
b19eb496 922would be intersecting with nothing, resulting in an empty set.
bac0b425 923
68585b5e 924=head2 User-Defined Case Mappings (for serious hackers only)
822502e5 925
5d1892be
KW
926B<This feature has been removed as of Perl 5.16.>
927The CPAN module L<Unicode::Casing> provides better functionality without
928the drawbacks that this feature had. If you are using a Perl earlier
929than 5.16, this feature was most fully documented in the 5.14 version of
930this pod:
931L<http://perldoc.perl.org/5.14.0/perlunicode.html#User-Defined-Case-Mappings-%28for-serious-hackers-only%29>
3a2263fe 932
376d9008 933=head2 Character Encodings for Input and Output
8cbd9a7a 934
7221edc9 935See L<Encode>.
8cbd9a7a 936
c29a771d 937=head2 Unicode Regular Expression Support Level
776f8809 938
b19eb496
TC
939The following list of Unicode supported features for regular expressions describes
940all features currently directly supported by core Perl. The references to "Level N"
8158862b 941and the section numbers refer to the Unicode Technical Standard #18,
b19eb496 942"Unicode Regular Expressions", version 13, from August 2008.
776f8809
JH
943
944=over 4
945
946=item *
947
948Level 1 - Basic Unicode Support
949
755789c0
KW
950 RL1.1 Hex Notation - done [1]
951 RL1.2 Properties - done [2][3]
952 RL1.2a Compatibility Properties - done [4]
953 RL1.3 Subtraction and Intersection - MISSING [5]
954 RL1.4 Simple Word Boundaries - done [6]
955 RL1.5 Simple Loose Matches - done [7]
956 RL1.6 Line Boundaries - MISSING [8][9]
957 RL1.7 Supplementary Code Points - done [10]
958
959 [1] \x{...}
960 [2] \p{...} \P{...}
961 [3] supports not only minimal list, but all Unicode character
d9742aa3 962 properties (see Unicode Character Properties above)
755789c0
KW
963 [4] \d \D \s \S \w \W \X [:prop:] [:^prop:]
964 [5] can use regular expression look-ahead [a] or
965 user-defined character properties [b] to emulate set
966 operations
967 [6] \b \B
968 [7] note that Perl does Full case-folding in matching (but with
969 bugs), not Simple: for example U+1F88 is equivalent to
e4d56f70
NC
970 U+1F00 U+03B9, instead of just U+1F80. This difference
971 matters mainly for certain Greek capital letters with certain
755789c0
KW
972 modifiers: the Full case-folding decomposes the letter,
973 while the Simple case-folding would map it to a single
974 character.
975 [8] should do ^ and $ also on U+000B (\v in C), FF (\f), CR
976 (\r), CRLF (\r\n), NEL (U+0085), LS (U+2028), and PS
977 (U+2029); should also affect <>, $., and script line
978 numbers; should not split lines within CRLF [c] (i.e. there
979 is no empty line between \r and \n)
980 [9] Linebreaking conformant with UAX#14 "Unicode Line Breaking
981 Algorithm" is available through the Unicode::LineBreaking
982 module.
983 [10] UTF-8/UTF-EBDDIC used in Perl allows not only U+10000 to
984 U+10FFFF but also beyond U+10FFFF
7207e29d 985
237bad5b 986[a] You can mimic class subtraction using lookahead.
8158862b 987For example, what UTS#18 might write as
29bdacb8 988
dbe420b4
JH
989 [{Greek}-[{UNASSIGNED}]]
990
991in Perl can be written as:
992
1d81abf3
JH
993 (?!\p{Unassigned})\p{InGreekAndCoptic}
994 (?=\p{Assigned})\p{InGreekAndCoptic}
dbe420b4
JH
995
996But in this particular example, you probably really want
997
1bfb14c4 998 \p{GreekAndCoptic}
dbe420b4
JH
999
1000which will match assigned characters known to be part of the Greek script.
29bdacb8 1001
d9742aa3 1002Also see the L<Unicode::Regex::Set> module, it does implement the full
8158862b
TS
1003UTS#18 grouping, intersection, union, and removal (subtraction) syntax.
1004
1005[b] '+' for union, '-' for removal (set-difference), '&' for intersection
1006(see L</"User-Defined Character Properties">)
1007
1008[c] Try the C<:crlf> layer (see L<PerlIO>).
5ca1ac52 1009
776f8809
JH
1010=item *
1011
1012Level 2 - Extended Unicode Support
1013
755789c0
KW
1014 RL2.1 Canonical Equivalents - MISSING [10][11]
1015 RL2.2 Default Grapheme Clusters - MISSING [12]
1016 RL2.3 Default Word Boundaries - MISSING [14]
1017 RL2.4 Default Loose Matches - MISSING [15]
1018 RL2.5 Name Properties - DONE
1019 RL2.6 Wildcard Properties - MISSING
8158862b 1020
755789c0
KW
1021 [10] see UAX#15 "Unicode Normalization Forms"
1022 [11] have Unicode::Normalize but not integrated to regexes
1023 [12] have \X but we don't have a "Grapheme Cluster Mode"
1024 [14] see UAX#29, Word Boundaries
1025 [15] see UAX#21 "Case Mappings"
776f8809
JH
1026
1027=item *
1028
8158862b
TS
1029Level 3 - Tailored Support
1030
755789c0
KW
1031 RL3.1 Tailored Punctuation - MISSING
1032 RL3.2 Tailored Grapheme Clusters - MISSING [17][18]
1033 RL3.3 Tailored Word Boundaries - MISSING
1034 RL3.4 Tailored Loose Matches - MISSING
1035 RL3.5 Tailored Ranges - MISSING
1036 RL3.6 Context Matching - MISSING [19]
1037 RL3.7 Incremental Matches - MISSING
8158862b 1038 ( RL3.8 Unicode Set Sharing )
755789c0
KW
1039 RL3.9 Possible Match Sets - MISSING
1040 RL3.10 Folded Matching - MISSING [20]
1041 RL3.11 Submatchers - MISSING
1042
1043 [17] see UAX#10 "Unicode Collation Algorithms"
1044 [18] have Unicode::Collate but not integrated to regexes
1045 [19] have (?<=x) and (?=x), but look-aheads or look-behinds
1046 should see outside of the target substring
1047 [20] need insensitive matching for linguistic features other
1048 than case; for example, hiragana to katakana, wide and
1049 narrow, simplified Han to traditional Han (see UTR#30
1050 "Character Foldings")
776f8809
JH
1051
1052=back
1053
c349b1b9
JH
1054=head2 Unicode Encodings
1055
376d9008
JB
1056Unicode characters are assigned to I<code points>, which are abstract
1057numbers. To use these numbers, various encodings are needed.
c349b1b9
JH
1058
1059=over 4
1060
c29a771d 1061=item *
5cb3728c
RB
1062
1063UTF-8
c349b1b9 1064
6d4f9cf2
KW
1065UTF-8 is a variable-length (1 to 4 bytes), byte-order independent
1066encoding. For ASCII (and we really do mean 7-bit ASCII, not another
10678-bit encoding), UTF-8 is transparent.
c349b1b9 1068
8c007b5a 1069The following table is from Unicode 3.2.
05632f9a 1070
755789c0 1071 Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte
05632f9a 1072
d88362ca 1073 U+0000..U+007F 00..7F
e1b711da 1074 U+0080..U+07FF * C2..DF 80..BF
d88362ca 1075 U+0800..U+0FFF E0 * A0..BF 80..BF
ec90690f
TS
1076 U+1000..U+CFFF E1..EC 80..BF 80..BF
1077 U+D000..U+D7FF ED 80..9F 80..BF
755789c0 1078 U+D800..U+DFFF +++++ utf16 surrogates, not legal utf8 +++++
ec90690f 1079 U+E000..U+FFFF EE..EF 80..BF 80..BF
d88362ca
KW
1080 U+10000..U+3FFFF F0 * 90..BF 80..BF 80..BF
1081 U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF
1082 U+100000..U+10FFFF F4 80..8F 80..BF 80..BF
e1b711da 1083
b19eb496 1084Note the gaps marked by "*" before several of the byte entries above. These are
e1b711da
KW
1085caused by legal UTF-8 avoiding non-shortest encodings: it is technically
1086possible to UTF-8-encode a single code point in different ways, but that is
1087explicitly forbidden, and the shortest possible encoding should always be used
1088(and that is what Perl does).
37361303 1089
376d9008 1090Another way to look at it is via bits:
05632f9a 1091
755789c0 1092 Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte
05632f9a 1093
755789c0
KW
1094 0aaaaaaa 0aaaaaaa
1095 00000bbbbbaaaaaa 110bbbbb 10aaaaaa
1096 ccccbbbbbbaaaaaa 1110cccc 10bbbbbb 10aaaaaa
1097 00000dddccccccbbbbbbaaaaaa 11110ddd 10cccccc 10bbbbbb 10aaaaaa
05632f9a 1098
9f815e24 1099As you can see, the continuation bytes all begin with "10", and the
e1b711da 1100leading bits of the start byte tell how many bytes there are in the
05632f9a
JH
1101encoded character.
1102
6d4f9cf2
KW
1103The original UTF-8 specification allowed up to 6 bytes, to allow
1104encoding of numbers up to 0x7FFF_FFFF. Perl continues to allow those,
1105and has extended that up to 13 bytes to encode code points up to what
1106can fit in a 64-bit word. However, Perl will warn if you output any of
b19eb496 1107these as being non-portable; and under strict UTF-8 input protocols,
6d4f9cf2
KW
1108they are forbidden.
1109
1110The Unicode non-character code points are also disallowed in UTF-8 in
1111"open interchange". See L</Non-character code points>.
1112
c29a771d 1113=item *
5cb3728c
RB
1114
1115UTF-EBCDIC
dbe420b4 1116
376d9008 1117Like UTF-8 but EBCDIC-safe, in the way that UTF-8 is ASCII-safe.
dbe420b4 1118
c29a771d 1119=item *
5cb3728c 1120
1e54db1a 1121UTF-16, UTF-16BE, UTF-16LE, Surrogates, and BOMs (Byte Order Marks)
c349b1b9 1122
1bfb14c4
JH
1123The followings items are mostly for reference and general Unicode
1124knowledge, Perl doesn't use these constructs internally.
dbe420b4 1125
b19eb496
TC
1126Like UTF-8, UTF-16 is a variable-width encoding, but where
1127UTF-8 uses 8-bit code units, UTF-16 uses 16-bit code units.
1128All code points occupy either 2 or 4 bytes in UTF-16: code points
1129C<U+0000..U+FFFF> are stored in a single 16-bit unit, and code
1bfb14c4 1130points C<U+10000..U+10FFFF> in two 16-bit units. The latter case is
c349b1b9
JH
1131using I<surrogates>, the first 16-bit unit being the I<high
1132surrogate>, and the second being the I<low surrogate>.
1133
376d9008 1134Surrogates are code points set aside to encode the C<U+10000..U+10FFFF>
c349b1b9 1135range of Unicode code points in pairs of 16-bit units. The I<high
9f815e24 1136surrogates> are the range C<U+D800..U+DBFF> and the I<low surrogates>
376d9008 1137are the range C<U+DC00..U+DFFF>. The surrogate encoding is
c349b1b9 1138
d88362ca
KW
1139 $hi = ($uni - 0x10000) / 0x400 + 0xD800;
1140 $lo = ($uni - 0x10000) % 0x400 + 0xDC00;
c349b1b9
JH
1141
1142and the decoding is
1143
d88362ca 1144 $uni = 0x10000 + ($hi - 0xD800) * 0x400 + ($lo - 0xDC00);
c349b1b9 1145
376d9008 1146Because of the 16-bitness, UTF-16 is byte-order dependent. UTF-16
c349b1b9 1147itself can be used for in-memory computations, but if storage or
376d9008
JB
1148transfer is required either UTF-16BE (big-endian) or UTF-16LE
1149(little-endian) encodings must be chosen.
c349b1b9
JH
1150
1151This introduces another problem: what if you just know that your data
376d9008
JB
1152is UTF-16, but you don't know which endianness? Byte Order Marks, or
1153BOMs, are a solution to this. A special character has been reserved
86bbd6d1 1154in Unicode to function as a byte order marker: the character with the
376d9008 1155code point C<U+FEFF> is the BOM.
042da322 1156
c349b1b9 1157The trick is that if you read a BOM, you will know the byte order,
376d9008
JB
1158since if it was written on a big-endian platform, you will read the
1159bytes C<0xFE 0xFF>, but if it was written on a little-endian platform,
1160you will read the bytes C<0xFF 0xFE>. (And if the originating platform
1161was writing in UTF-8, you will read the bytes C<0xEF 0xBB 0xBF>.)
042da322 1162
86bbd6d1 1163The way this trick works is that the character with the code point
6d4f9cf2 1164C<U+FFFE> is not supposed to be in input streams, so the
376d9008 1165sequence of bytes C<0xFF 0xFE> is unambiguously "BOM, represented in
1bfb14c4 1166little-endian format" and cannot be C<U+FFFE>, represented in big-endian
6d4f9cf2
KW
1167format".
1168
1169Surrogates have no meaning in Unicode outside their use in pairs to
1170represent other code points. However, Perl allows them to be
1171represented individually internally, for example by saying
f651977e
TC
1172C<chr(0xD801)>, so that all code points, not just those valid for open
1173interchange, are
6d4f9cf2
KW
1174representable. Unicode does define semantics for them, such as their
1175General Category is "Cs". But because their use is somewhat dangerous,
b19eb496
TC
1176Perl will warn (using the warning category "surrogate", which is a
1177sub-category of "utf8") if an attempt is made
6d4f9cf2
KW
1178to do things like take the lower case of one, or match
1179case-insensitively, or to output them. (But don't try this on Perls
1180before 5.14.)
c349b1b9 1181
c29a771d 1182=item *
5cb3728c 1183
1e54db1a 1184UTF-32, UTF-32BE, UTF-32LE
c349b1b9
JH
1185
1186The UTF-32 family is pretty much like the UTF-16 family, expect that
042da322 1187the units are 32-bit, and therefore the surrogate scheme is not
f651977e 1188needed. UTF-32 is a fixed-width encoding. The BOM signatures are
b19eb496 1189C<0x00 0x00 0xFE 0xFF> for BE and C<0xFF 0xFE 0x00 0x00> for LE.
c349b1b9 1190
c29a771d 1191=item *
5cb3728c
RB
1192
1193UCS-2, UCS-4
c349b1b9 1194
b19eb496 1195Legacy, fixed-width encodings defined by the ISO 10646 standard. UCS-2 is a 16-bit
376d9008 1196encoding. Unlike UTF-16, UCS-2 is not extensible beyond C<U+FFFF>,
339cfa0e 1197because it does not use surrogates. UCS-4 is a 32-bit encoding,
b19eb496 1198functionally identical to UTF-32 (the difference being that
ee88f7b6 1199UCS-4 forbids neither surrogates nor code points larger than 0x10_FFFF).
c349b1b9 1200
c29a771d 1201=item *
5cb3728c
RB
1202
1203UTF-7
c349b1b9 1204
376d9008
JB
1205A seven-bit safe (non-eight-bit) encoding, which is useful if the
1206transport or storage is not eight-bit safe. Defined by RFC 2152.
c349b1b9 1207
95a1a48b
JH
1208=back
1209
6d4f9cf2
KW
1210=head2 Non-character code points
1211
121266 code points are set aside in Unicode as "non-character code points".
1213These all have the Unassigned (Cn) General Category, and they never will
1214be assigned. These are never supposed to be in legal Unicode input
1215streams, so that code can use them as sentinels that can be mixed in
1216with character data, and they always will be distinguishable from that data.
1217To keep them out of Perl input streams, strict UTF-8 should be
1218specified, such as by using the layer C<:encoding('UTF-8')>. The
1219non-character code points are the 32 between U+FDD0 and U+FDEF, and the
122034 code points U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, ... U+10FFFE, U+10FFFF.
1221Some people are under the mistaken impression that these are "illegal",
1222but that is not true. An application or cooperating set of applications
1223can legally use them at will internally; but these code points are
42581d5d
KW
1224"illegal for open interchange". Therefore, Perl will not accept these
1225from input streams unless lax rules are being used, and will warn
b19eb496 1226(using the warning category "nonchar", which is a sub-category of "utf8") if
42581d5d
KW
1227an attempt is made to output them.
1228
1229=head2 Beyond Unicode code points
1230
1231The maximum Unicode code point is U+10FFFF. But Perl accepts code
1232points up to the maximum permissible unsigned number available on the
1233platform. However, Perl will not accept these from input streams unless
1234lax rules are being used, and will warn (using the warning category
b19eb496 1235"non_unicode", which is a sub-category of "utf8") if an attempt is made to
42581d5d
KW
1236operate on or output them. For example, C<uc(0x11_0000)> will generate
1237this warning, returning the input parameter as its result, as the upper
ee88f7b6 1238case of every non-Unicode code point is the code point itself.
6d4f9cf2 1239
0d7c09bb
JH
1240=head2 Security Implications of Unicode
1241
e1b711da
KW
1242Read L<Unicode Security Considerations|http://www.unicode.org/reports/tr36>.
1243Also, note the following:
1244
0d7c09bb
JH
1245=over 4
1246
1247=item *
1248
1249Malformed UTF-8
bf0fa0b2 1250
42581d5d 1251Unfortunately, the original specification of UTF-8 leaves some room for
bf0fa0b2 1252interpretation of how many bytes of encoded output one should generate
376d9008
JB
1253from one input Unicode character. Strictly speaking, the shortest
1254possible sequence of UTF-8 bytes should be generated,
1255because otherwise there is potential for an input buffer overflow at
feda178f 1256the receiving end of a UTF-8 connection. Perl always generates the
e1b711da 1257shortest length UTF-8, and with warnings on, Perl will warn about
376d9008 1258non-shortest length UTF-8 along with other malformations, such as the
b19eb496 1259surrogates, which are not Unicode code points valid for interchange.
bf0fa0b2 1260
0d7c09bb
JH
1261=item *
1262
68693f9e 1263Regular expression pattern matching may surprise you if you're not
b19eb496
TC
1264accustomed to Unicode. Starting in Perl 5.14, several pattern
1265modifiers are available to control this, called the character set
42581d5d
KW
1266modifiers. Details are given in L<perlre/Character set modifiers>.
1267
1268=back
0d7c09bb 1269
376d9008 1270As discussed elsewhere, Perl has one foot (two hooves?) planted in
1bfb14c4
JH
1271each of two worlds: the old world of bytes and the new world of
1272characters, upgrading from bytes to characters when necessary.
376d9008
JB
1273If your legacy code does not explicitly use Unicode, no automatic
1274switch-over to characters should happen. Characters shouldn't get
1bfb14c4
JH
1275downgraded to bytes, either. It is possible to accidentally mix bytes
1276and characters, however (see L<perluniintro>), in which case C<\w> in
42581d5d 1277regular expressions might start behaving differently (unless the C</a>
b19eb496 1278modifier is in effect). Review your code. Use warnings and the C<strict> pragma.
0d7c09bb 1279
c349b1b9
JH
1280=head2 Unicode in Perl on EBCDIC
1281
376d9008
JB
1282The way Unicode is handled on EBCDIC platforms is still
1283experimental. On such platforms, references to UTF-8 encoding in this
1284document and elsewhere should be read as meaning the UTF-EBCDIC
1285specified in Unicode Technical Report 16, unless ASCII vs. EBCDIC issues
c349b1b9 1286are specifically discussed. There is no C<utfebcdic> pragma or
376d9008 1287":utfebcdic" layer; rather, "utf8" and ":utf8" are reused to mean
86bbd6d1
PN
1288the platform's "natural" 8-bit encoding of Unicode. See L<perlebcdic>
1289for more discussion of the issues.
c349b1b9 1290
b310b053
JH
1291=head2 Locales
1292
42581d5d 1293See L<perllocale/Unicode and UTF-8>
b310b053 1294
1aad1664
JH
1295=head2 When Unicode Does Not Happen
1296
1297While Perl does have extensive ways to input and output in Unicode,
b19eb496
TC
1298and a few other "entry points" like the @ARGV array (which can sometimes be
1299interpreted as UTF-8), there are still many places where Unicode
1300(in some encoding or another) could be given as arguments or received as
1aad1664
JH
1301results, or both, but it is not.
1302
e1b711da
KW
1303The following are such interfaces. Also, see L</The "Unicode Bug">.
1304For all of these interfaces Perl
6cd4dd6c 1305currently (as of 5.8.3) simply assumes byte strings both as arguments
b19eb496 1306and results, or UTF-8 strings if the (problematic) C<encoding> pragma has been used.
1aad1664 1307
b19eb496
TC
1308One reason that Perl does not attempt to resolve the role of Unicode in
1309these situations is that the answers are highly dependent on the operating
1aad1664 1310system and the file system(s). For example, whether filenames can be
b19eb496
TC
1311in Unicode and in exactly what kind of encoding, is not exactly a
1312portable concept. Similarly for C<qx> and C<system>: how well will the
1313"command-line interface" (and which of them?) handle Unicode?
1aad1664
JH
1314
1315=over 4
1316
557a2462
RB
1317=item *
1318
51f494cc 1319chdir, chmod, chown, chroot, exec, link, lstat, mkdir,
1e8e8236 1320rename, rmdir, stat, symlink, truncate, unlink, utime, -X
557a2462
RB
1321
1322=item *
1323
1324%ENV
1325
1326=item *
1327
1328glob (aka the <*>)
1329
1330=item *
1aad1664 1331
557a2462 1332open, opendir, sysopen
1aad1664 1333
557a2462 1334=item *
1aad1664 1335
557a2462 1336qx (aka the backtick operator), system
1aad1664 1337
557a2462 1338=item *
1aad1664 1339
557a2462 1340readdir, readlink
1aad1664
JH
1341
1342=back
1343
e1b711da
KW
1344=head2 The "Unicode Bug"
1345
42581d5d
KW
1346The term, the "Unicode bug" has been applied to an inconsistency
1347on ASCII platforms with the
1348Unicode code points in the Latin-1 Supplement block, that
e1b711da
KW
1349is, between 128 and 255. Without a locale specified, unlike all other
1350characters or code points, these characters have very different semantics in
20db7501
KW
1351byte semantics versus character semantics, unless
1352C<use feature 'unicode_strings'> is specified.
42581d5d
KW
1353(The lesson here is to specify C<unicode_strings> to avoid the
1354headaches.)
e1b711da
KW
1355
1356In character semantics they are interpreted as Unicode code points, which means
1357they have the same semantics as Latin-1 (ISO-8859-1).
1358
1359In byte semantics, they are considered to be unassigned characters, meaning
1360that the only semantics they have is their ordinal numbers, and that they are
1361not members of various character classes. None are considered to match C<\w>
42581d5d 1362for example, but all match C<\W>.
e1b711da
KW
1363
1364The behavior is known to have effects on these areas:
1365
1366=over 4
1367
1368=item *
1369
1370Changing the case of a scalar, that is, using C<uc()>, C<ucfirst()>, C<lc()>,
1371and C<lcfirst()>, or C<\L>, C<\U>, C<\u> and C<\l> in regular expression
1372substitutions.
1373
1374=item *
1375
1376Using caseless (C</i>) regular expression matching
1377
1378=item *
1379
b19eb496 1380Matching any of several properties in regular expressions, namely C<\b>,
630d17dc
KW
1381C<\B>, C<\s>, C<\S>, C<\w>, C<\W>, and all the Posix character classes
1382I<except> C<[[:ascii:]]>.
e1b711da
KW
1383
1384=item *
1385
b19eb496
TC
1386In C<quotemeta> or its inline equivalent C<\Q>, no characters
1387code points above 127 are quoted in UTF-8 encoded strings, but in
1388byte encoded strings, code points between 128-255 are always quoted.
eb88ed9e 1389
e1b711da
KW
1390=back
1391
1392This behavior can lead to unexpected results in which a string's semantics
1393suddenly change if a code point above 255 is appended to or removed from it,
1394which changes the string's semantics from byte to character or vice versa. As
1395an example, consider the following program and its output:
1396
1397 $ perl -le'
42581d5d 1398 no feature 'unicode_strings';
e1b711da
KW
1399 $s1 = "\xC2";
1400 $s2 = "\x{2660}";
1401 for ($s1, $s2, $s1.$s2) {
1402 print /\w/ || 0;
1403 }
1404 '
1405 0
1406 0
1407 1
1408
9f815e24 1409If there's no C<\w> in C<s1> or in C<s2>, why does their concatenation have one?
e1b711da
KW
1410
1411This anomaly stems from Perl's attempt to not disturb older programs that
1412didn't use Unicode, and hence had no semantics for characters outside of the
1413ASCII range (except in a locale), along with Perl's desire to add Unicode
1414support seamlessly. The result wasn't seamless: these characters were
1415orphaned.
1416
20db7501
KW
1417Starting in Perl 5.14, C<use feature 'unicode_strings'> can be used to
1418cause Perl to use Unicode semantics on all string operations within the
1419scope of the feature subpragma. Regular expressions compiled in its
1420scope retain that behavior even when executed or compiled into larger
1421regular expressions outside the scope. (The pragma does not, however,
42581d5d
KW
1422affect the C<quotemeta> behavior. Nor does it affect the deprecated
1423user-defined case changing operations--these still require a UTF-8
eb88ed9e 1424encoded string to operate.)
20db7501
KW
1425
1426In Perl 5.12, the subpragma affected casing changes, but not regular
1427expressions. See L<perlfunc/lc> for details on how this pragma works in
1428combination with various others for casing.
1429
1430For earlier Perls, or when a string is passed to a function outside the
1431subpragma's scope, a workaround is to always call C<utf8::upgrade($string)>,
1432or to use the standard module L<Encode>. Also, a scalar that has any characters
6f335b04 1433whose ordinal is above 0x100, or which were specified using either of the
b19eb496 1434C<\N{...}> notations, will automatically have character semantics.
e1b711da 1435
1aad1664
JH
1436=head2 Forcing Unicode in Perl (Or Unforcing Unicode in Perl)
1437
e1b711da
KW
1438Sometimes (see L</"When Unicode Does Not Happen"> or L</The "Unicode Bug">)
1439there are situations where you simply need to force a byte
2bbc8d55
SP
1440string into UTF-8, or vice versa. The low-level calls
1441utf8::upgrade($bytestring) and utf8::downgrade($utf8string[, FAIL_OK]) are
1aad1664
JH
1442the answers.
1443
2bbc8d55
SP
1444Note that utf8::downgrade() can fail if the string contains characters
1445that don't fit into a byte.
1aad1664 1446
e1b711da
KW
1447Calling either function on a string that already is in the desired state is a
1448no-op.
1449
95a1a48b
JH
1450=head2 Using Unicode in XS
1451
3a2263fe
RGS
1452If you want to handle Perl Unicode in XS extensions, you may find the
1453following C APIs useful. See also L<perlguts/"Unicode Support"> for an
1454explanation about Unicode at the XS level, and L<perlapi> for the API
1455details.
95a1a48b
JH
1456
1457=over 4
1458
1459=item *
1460
1bfb14c4 1461C<DO_UTF8(sv)> returns true if the C<UTF8> flag is on and the bytes
2bbc8d55 1462pragma is not in effect. C<SvUTF8(sv)> returns true if the C<UTF8>
1bfb14c4
JH
1463flag is on; the bytes pragma is ignored. The C<UTF8> flag being on
1464does B<not> mean that there are any characters of code points greater
1465than 255 (or 127) in the scalar or that there are even any characters
1466in the scalar. What the C<UTF8> flag means is that the sequence of
1467octets in the representation of the scalar is the sequence of UTF-8
1468encoded code points of the characters of a string. The C<UTF8> flag
1469being off means that each octet in this representation encodes a
1470single character with code point 0..255 within the string. Perl's
1471Unicode model is not to use UTF-8 until it is absolutely necessary.
95a1a48b
JH
1472
1473=item *
1474
2bbc8d55 1475C<uvchr_to_utf8(buf, chr)> writes a Unicode character code point into
1bfb14c4 1476a buffer encoding the code point as UTF-8, and returns a pointer
2bbc8d55 1477pointing after the UTF-8 bytes. It works appropriately on EBCDIC machines.
95a1a48b
JH
1478
1479=item *
1480
2bbc8d55 1481C<utf8_to_uvchr(buf, lenp)> reads UTF-8 encoded bytes from a buffer and
376d9008 1482returns the Unicode character code point and, optionally, the length of
2bbc8d55 1483the UTF-8 byte sequence. It works appropriately on EBCDIC machines.
95a1a48b
JH
1484
1485=item *
1486
376d9008
JB
1487C<utf8_length(start, end)> returns the length of the UTF-8 encoded buffer
1488in characters. C<sv_len_utf8(sv)> returns the length of the UTF-8 encoded
95a1a48b
JH
1489scalar.
1490
1491=item *
1492
376d9008
JB
1493C<sv_utf8_upgrade(sv)> converts the string of the scalar to its UTF-8
1494encoded form. C<sv_utf8_downgrade(sv)> does the opposite, if
1495possible. C<sv_utf8_encode(sv)> is like sv_utf8_upgrade except that
1496it does not set the C<UTF8> flag. C<sv_utf8_decode()> does the
1497opposite of C<sv_utf8_encode()>. Note that none of these are to be
1498used as general-purpose encoding or decoding interfaces: C<use Encode>
1499for that. C<sv_utf8_upgrade()> is affected by the encoding pragma
1500but C<sv_utf8_downgrade()> is not (since the encoding pragma is
1501designed to be a one-way street).
95a1a48b
JH
1502
1503=item *
1504
376d9008 1505C<is_utf8_char(s)> returns true if the pointer points to a valid UTF-8
90f968e0 1506character.
95a1a48b
JH
1507
1508=item *
1509
376d9008 1510C<is_utf8_string(buf, len)> returns true if C<len> bytes of the buffer
95a1a48b
JH
1511are valid UTF-8.
1512
1513=item *
1514
376d9008
JB
1515C<UTF8SKIP(buf)> will return the number of bytes in the UTF-8 encoded
1516character in the buffer. C<UNISKIP(chr)> will return the number of bytes
1517required to UTF-8-encode the Unicode character code point. C<UTF8SKIP()>
90f968e0 1518is useful for example for iterating over the characters of a UTF-8
376d9008 1519encoded buffer; C<UNISKIP()> is useful, for example, in computing
90f968e0 1520the size required for a UTF-8 encoded buffer.
95a1a48b
JH
1521
1522=item *
1523
376d9008 1524C<utf8_distance(a, b)> will tell the distance in characters between the
95a1a48b
JH
1525two pointers pointing to the same UTF-8 encoded buffer.
1526
1527=item *
1528
2bbc8d55 1529C<utf8_hop(s, off)> will return a pointer to a UTF-8 encoded buffer
376d9008
JB
1530that is C<off> (positive or negative) Unicode characters displaced
1531from the UTF-8 buffer C<s>. Be careful not to overstep the buffer:
1532C<utf8_hop()> will merrily run off the end or the beginning of the
1533buffer if told to do so.
95a1a48b 1534
d2cc3551
JH
1535=item *
1536
376d9008
JB
1537C<pv_uni_display(dsv, spv, len, pvlim, flags)> and
1538C<sv_uni_display(dsv, ssv, pvlim, flags)> are useful for debugging the
1539output of Unicode strings and scalars. By default they are useful
1540only for debugging--they display B<all> characters as hexadecimal code
1bfb14c4
JH
1541points--but with the flags C<UNI_DISPLAY_ISPRINT>,
1542C<UNI_DISPLAY_BACKSLASH>, and C<UNI_DISPLAY_QQ> you can make the
1543output more readable.
d2cc3551
JH
1544
1545=item *
1546
66615a54 1547C<foldEQ_utf8(s1, pe1, l1, u1, s2, pe2, l2, u2)> can be used to
376d9008 1548compare two strings case-insensitively in Unicode. For case-sensitive
66615a54
KW
1549comparisons you can just use C<memEQ()> and C<memNE()> as usual, except
1550if one string is in utf8 and the other isn't.
d2cc3551 1551
c349b1b9
JH
1552=back
1553
95a1a48b
JH
1554For more information, see L<perlapi>, and F<utf8.c> and F<utf8.h>
1555in the Perl source code distribution.
1556
e1b711da
KW
1557=head2 Hacking Perl to work on earlier Unicode versions (for very serious hackers only)
1558
1559Perl by default comes with the latest supported Unicode version built in, but
1560you can change to use any earlier one.
1561
42581d5d 1562Download the files in the desired version of Unicode from the Unicode web
e1b711da 1563site L<http://www.unicode.org>). These should replace the existing files in
b19eb496 1564F<lib/unicore> in the Perl source tree. Follow the instructions in
116693e8 1565F<README.perl> in that directory to change some of their names, and then build
26e391dd 1566perl (see L<INSTALL>).
116693e8
DL
1567
1568It is even possible to copy the built files to a different directory, and then
f651977e 1569change F<utf8_heavy.pl> in the directory C<$Config{privlib}> to point to the
116693e8
DL
1570new directory, or maybe make a copy of that directory before making the change,
1571and using C<@INC> or the C<-I> run-time flag to switch between versions at will
e1b711da
KW
1572(but because of caching, not in the middle of a process), but all this is
1573beyond the scope of these instructions.
1574
c29a771d
JH
1575=head1 BUGS
1576
376d9008 1577=head2 Interaction with Locales
7eabb34d 1578
42581d5d 1579See L<perllocale/Unicode and UTF-8>
c29a771d 1580
9f815e24 1581=head2 Problems with characters in the Latin-1 Supplement range
2bbc8d55 1582
e1b711da
KW
1583See L</The "Unicode Bug">
1584
376d9008 1585=head2 Interaction with Extensions
7eabb34d 1586
376d9008 1587When Perl exchanges data with an extension, the extension should be
2575c402 1588able to understand the UTF8 flag and act accordingly. If the
b19eb496 1589extension doesn't recognize that flag, it's likely that the extension
376d9008 1590will return incorrectly-flagged data.
7eabb34d
A
1591
1592So if you're working with Unicode data, consult the documentation of
1593every module you're using if there are any issues with Unicode data
1594exchange. If the documentation does not talk about Unicode at all,
a73d23f6 1595suspect the worst and probably look at the source to learn how the
376d9008 1596module is implemented. Modules written completely in Perl shouldn't
a73d23f6
RGS
1597cause problems. Modules that directly or indirectly access code written
1598in other programming languages are at risk.
7eabb34d 1599
376d9008 1600For affected functions, the simple strategy to avoid data corruption is
7eabb34d 1601to always make the encoding of the exchanged data explicit. Choose an
376d9008 1602encoding that you know the extension can handle. Convert arguments passed
7eabb34d
A
1603to the extensions to that encoding and convert results back from that
1604encoding. Write wrapper functions that do the conversions for you, so
1605you can later change the functions when the extension catches up.
1606
376d9008 1607To provide an example, let's say the popular Foo::Bar::escape_html
7eabb34d
A
1608function doesn't deal with Unicode data yet. The wrapper function
1609would convert the argument to raw UTF-8 and convert the result back to
376d9008 1610Perl's internal representation like so:
7eabb34d
A
1611
1612 sub my_escape_html ($) {
d88362ca
KW
1613 my($what) = shift;
1614 return unless defined $what;
1615 Encode::decode_utf8(Foo::Bar::escape_html(
1616 Encode::encode_utf8($what)));
7eabb34d
A
1617 }
1618
1619Sometimes, when the extension does not convert data but just stores
b19eb496 1620and retrieves them, you will be able to use the otherwise
7eabb34d 1621dangerous Encode::_utf8_on() function. Let's say the popular
66b79f27 1622C<Foo::Bar> extension, written in C, provides a C<param> method that
7eabb34d
A
1623lets you store and retrieve data according to these prototypes:
1624
1625 $self->param($name, $value); # set a scalar
1626 $value = $self->param($name); # retrieve a scalar
1627
1628If it does not yet provide support for any encoding, one could write a
1629derived class with such a C<param> method:
1630
1631 sub param {
1632 my($self,$name,$value) = @_;
1633 utf8::upgrade($name); # make sure it is UTF-8 encoded
af55fc6a 1634 if (defined $value) {
7eabb34d
A
1635 utf8::upgrade($value); # make sure it is UTF-8 encoded
1636 return $self->SUPER::param($name,$value);
1637 } else {
1638 my $ret = $self->SUPER::param($name);
1639 Encode::_utf8_on($ret); # we know, it is UTF-8 encoded
1640 return $ret;
1641 }
1642 }
1643
a73d23f6
RGS
1644Some extensions provide filters on data entry/exit points, such as
1645DB_File::filter_store_key and family. Look out for such filters in
66b79f27 1646the documentation of your extensions, they can make the transition to
7eabb34d
A
1647Unicode data much easier.
1648
376d9008 1649=head2 Speed
7eabb34d 1650
c29a771d 1651Some functions are slower when working on UTF-8 encoded strings than
574c8022 1652on byte encoded strings. All functions that need to hop over
7c17141f
JH
1653characters such as length(), substr() or index(), or matching regular
1654expressions can work B<much> faster when the underlying data are
1655byte-encoded.
1656
1657In Perl 5.8.0 the slowness was often quite spectacular; in Perl 5.8.1
1658a caching scheme was introduced which will hopefully make the slowness
a104b433
JH
1659somewhat less spectacular, at least for some operations. In general,
1660operations with UTF-8 encoded strings are still slower. As an example,
1661the Unicode properties (character classes) like C<\p{Nd}> are known to
1662be quite a bit slower (5-20 times) than their simpler counterparts
42581d5d 1663like C<\d> (then again, there are hundreds of Unicode characters matching C<Nd>
a104b433 1664compared with the 10 ASCII characters matching C<d>).
666f95b9 1665
e1b711da
KW
1666=head2 Problems on EBCDIC platforms
1667
f651977e 1668There are several known problems with Perl on EBCDIC platforms. If you
e1b711da 1669want to use Perl there, send email to perlbug@perl.org.
fe749c9a
KW
1670
1671In earlier versions, when byte and character data were concatenated,
1672the new string was sometimes created by
1673decoding the byte strings as I<ISO 8859-1 (Latin-1)>, even if the
1674old Unicode string used EBCDIC.
1675
1676If you find any of these, please report them as bugs.
1677
c8d992ba
A
1678=head2 Porting code from perl-5.6.X
1679
1680Perl 5.8 has a different Unicode model from 5.6. In 5.6 the programmer
1681was required to use the C<utf8> pragma to declare that a given scope
1682expected to deal with Unicode data and had to make sure that only
1683Unicode data were reaching that scope. If you have code that is
1684working with 5.6, you will need some of the following adjustments to
1685your code. The examples are written such that the code will continue
1686to work under 5.6, so you should be safe to try them out.
1687
755789c0 1688=over 3
c8d992ba
A
1689
1690=item *
1691
1692A filehandle that should read or write UTF-8
1693
1694 if ($] > 5.007) {
740d4bb2 1695 binmode $fh, ":encoding(utf8)";
c8d992ba
A
1696 }
1697
1698=item *
1699
1700A scalar that is going to be passed to some extension
1701
1702Be it Compress::Zlib, Apache::Request or any extension that has no
1703mention of Unicode in the manpage, you need to make sure that the
2575c402 1704UTF8 flag is stripped off. Note that at the time of this writing
c8d992ba
A
1705(October 2002) the mentioned modules are not UTF-8-aware. Please
1706check the documentation to verify if this is still true.
1707
1708 if ($] > 5.007) {
1709 require Encode;
1710 $val = Encode::encode_utf8($val); # make octets
1711 }
1712
1713=item *
1714
1715A scalar we got back from an extension
1716
1717If you believe the scalar comes back as UTF-8, you will most likely
2575c402 1718want the UTF8 flag restored:
c8d992ba
A
1719
1720 if ($] > 5.007) {
1721 require Encode;
1722 $val = Encode::decode_utf8($val);
1723 }
1724
1725=item *
1726
1727Same thing, if you are really sure it is UTF-8
1728
1729 if ($] > 5.007) {
1730 require Encode;
1731 Encode::_utf8_on($val);
1732 }
1733
1734=item *
1735
1736A wrapper for fetchrow_array and fetchrow_hashref
1737
1738When the database contains only UTF-8, a wrapper function or method is
1739a convenient way to replace all your fetchrow_array and
1740fetchrow_hashref calls. A wrapper function will also make it easier to
1741adapt to future enhancements in your database driver. Note that at the
1742time of this writing (October 2002), the DBI has no standardized way
1743to deal with UTF-8 data. Please check the documentation to verify if
1744that is still true.
1745
1746 sub fetchrow {
d88362ca
KW
1747 # $what is one of fetchrow_{array,hashref}
1748 my($self, $sth, $what) = @_;
c8d992ba
A
1749 if ($] < 5.007) {
1750 return $sth->$what;
1751 } else {
1752 require Encode;
1753 if (wantarray) {
1754 my @arr = $sth->$what;
1755 for (@arr) {
1756 defined && /[^\000-\177]/ && Encode::_utf8_on($_);
1757 }
1758 return @arr;
1759 } else {
1760 my $ret = $sth->$what;
1761 if (ref $ret) {
1762 for my $k (keys %$ret) {
d88362ca
KW
1763 defined
1764 && /[^\000-\177]/
1765 && Encode::_utf8_on($_) for $ret->{$k};
c8d992ba
A
1766 }
1767 return $ret;
1768 } else {
1769 defined && /[^\000-\177]/ && Encode::_utf8_on($_) for $ret;
1770 return $ret;
1771 }
1772 }
1773 }
1774 }
1775
1776
1777=item *
1778
1779A large scalar that you know can only contain ASCII
1780
1781Scalars that contain only ASCII and are marked as UTF-8 are sometimes
1782a drag to your program. If you recognize such a situation, just remove
2575c402 1783the UTF8 flag:
c8d992ba
A
1784
1785 utf8::downgrade($val) if $] > 5.007;
1786
1787=back
1788
393fec97
GS
1789=head1 SEE ALSO
1790
51f494cc 1791L<perlunitut>, L<perluniintro>, L<perluniprops>, L<Encode>, L<open>, L<utf8>, L<bytes>,
a05d7ebb 1792L<perlretut>, L<perlvar/"${^UNICODE}">
51f494cc 1793L<http://www.unicode.org/reports/tr44>).
393fec97
GS
1794
1795=cut