This is a live mirror of the Perl 5 development currently hosted at https://github.com/perl/perl5
mktables: Suppress certain expected debug msgs
[perl5.git] / pod / perlunicode.pod
CommitLineData
393fec97
GS
1=head1 NAME
2
3perlunicode - Unicode support in Perl
4
5=head1 DESCRIPTION
6
0a1f2d14 7=head2 Important Caveats
21bad921 8
376d9008 9Unicode support is an extensive requirement. While Perl does not
c349b1b9
JH
10implement the Unicode standard or the accompanying technical reports
11from cover to cover, Perl does support many Unicode features.
21bad921 12
2575c402 13People who want to learn to use Unicode in Perl, should probably read
0314f483
KW
14the L<Perl Unicode tutorial, perlunitut|perlunitut> and
15L<perluniintro>, before reading
e4911a48 16this reference document.
2575c402 17
9d1c51c1
KW
18Also, the use of Unicode may present security issues that aren't obvious.
19Read L<Unicode Security Considerations|http://www.unicode.org/reports/tr36>.
20
13a2d996 21=over 4
21bad921 22
42581d5d
KW
23=item Safest if you "use feature 'unicode_strings'"
24
25In order to preserve backward compatibility, Perl does not turn
26on full internal Unicode support unless the pragma
27C<use feature 'unicode_strings'> is specified. (This is automatically
28selected if you use C<use 5.012> or higher.) Failure to do this can
29trigger unexpected surprises. See L</The "Unicode Bug"> below.
30
b19eb496
TC
31This pragma doesn't affect I/O, and there are still several places
32where Unicode isn't fully supported, such as in filenames.
42581d5d 33
fae2c0fb 34=item Input and Output Layers
21bad921 35
376d9008 36Perl knows when a filehandle uses Perl's internal Unicode encodings
1bfb14c4 37(UTF-8, or UTF-EBCDIC if in EBCDIC) if the filehandle is opened with
4ee7c0ea 38the ":encoding(utf8)" layer. Other encodings can be converted to Perl's
1bfb14c4
JH
39encoding on input or from Perl's encoding on output by use of the
40":encoding(...)" layer. See L<open>.
c349b1b9 41
2575c402 42To indicate that Perl source itself is in UTF-8, use C<use utf8;>.
21bad921 43
ad0029c4 44=item C<use utf8> still needed to enable UTF-8/UTF-EBCDIC in scripts
21bad921 45
376d9008
JB
46As a compatibility measure, the C<use utf8> pragma must be explicitly
47included to enable recognition of UTF-8 in the Perl scripts themselves
1bfb14c4
JH
48(in string or regular expression literals, or in identifier names) on
49ASCII-based machines or to recognize UTF-EBCDIC on EBCDIC-based
376d9008 50machines. B<These are the only times when an explicit C<use utf8>
8f8cf39c 51is needed.> See L<utf8>.
21bad921 52
7aa207d6
JH
53=item BOM-marked scripts and UTF-16 scripts autodetected
54
55If a Perl script begins marked with the Unicode BOM (UTF-16LE, UTF16-BE,
56or UTF-8), or if the script looks like non-BOM-marked UTF-16 of either
57endianness, Perl will correctly read in the script as Unicode.
58(BOMless UTF-8 cannot be effectively recognized or differentiated from
59ISO 8859-1 or other eight-bit encodings.)
60
990e18f7
AT
61=item C<use encoding> needed to upgrade non-Latin-1 byte strings
62
38a44b82 63By default, there is a fundamental asymmetry in Perl's Unicode model:
990e18f7
AT
64implicit upgrading from byte strings to Unicode strings assumes that
65they were encoded in I<ISO 8859-1 (Latin-1)>, but Unicode strings are
66downgraded with UTF-8 encoding. This happens because the first 256
51f494cc 67codepoints in Unicode happens to agree with Latin-1.
990e18f7 68
990e18f7
AT
69See L</"Byte and Character Semantics"> for more details.
70
21bad921
GS
71=back
72
376d9008 73=head2 Byte and Character Semantics
393fec97 74
376d9008 75Beginning with version 5.6, Perl uses logically-wide characters to
3e4dbfed 76represent strings internally.
393fec97 77
42581d5d
KW
78Starting in Perl 5.14, Perl-level operations work with
79characters rather than bytes within the scope of a
80C<L<use feature 'unicode_strings'|feature>> (or equivalently
81C<use 5.012> or higher). (This is not true if bytes have been
b19eb496 82explicitly requested by C<L<use bytes|bytes>>, nor necessarily true
42581d5d
KW
83for interactions with the platform's operating system.)
84
85For earlier Perls, and when C<unicode_strings> is not in effect, Perl
86provides a fairly safe environment that can handle both types of
87semantics in programs. For operations where Perl can unambiguously
88decide that the input data are characters, Perl switches to character
89semantics. For operations where this determination cannot be made
90without additional information from the user, Perl decides in favor of
91compatibility and chooses to use byte semantics.
92
93When C<use locale> is in effect (which overrides
0314f483
KW
94C<use feature 'unicode_strings'> in the same scope), Perl uses the
95semantics associated
42581d5d
KW
96with the current locale. Otherwise, Perl uses the platform's native
97byte semantics for characters whose code points are less than 256, and
98Unicode semantics for those greater than 255. On EBCDIC platforms, this
99is almost seamless, as the EBCDIC code pages that Perl handles are
100equivalent to Unicode's first 256 code points. (The exception is that
101EBCDIC regular expression case-insensitive matching rules are not as
b19eb496 102as robust as Unicode's.) But on ASCII platforms, Perl uses US-ASCII
e1b711da
KW
103(or Basic Latin in Unicode terminology) byte semantics, meaning that characters
104whose ordinal numbers are in the range 128 - 255 are undefined except for their
105ordinal numbers. This means that none have case (upper and lower), nor are any
106a member of character classes, like C<[:alpha:]> or C<\w>. (But all do belong
107to the C<\W> class or the Perl regular expression extension C<[:^alpha:]>.)
2bbc8d55 108
8cbd9a7a 109This behavior preserves compatibility with earlier versions of Perl,
376d9008 110which allowed byte semantics in Perl operations only if
e1b711da 111none of the program's inputs were marked as being a source of Unicode
8cbd9a7a
GS
112character data. Such data may come from filehandles, from calls to
113external programs, from information provided by the system (such as %ENV),
21bad921 114or from literals and constants in the source text.
8cbd9a7a 115
8cbd9a7a 116The C<utf8> pragma is primarily a compatibility device that enables
75daf61c 117recognition of UTF-(8|EBCDIC) in literals encountered by the parser.
376d9008
JB
118Note that this pragma is only required while Perl defaults to byte
119semantics; when character semantics become the default, this pragma
120may become a no-op. See L<utf8>.
121
376d9008 122If strings operating under byte semantics and strings with Unicode
51f494cc 123character data are concatenated, the new string will have
d9b01026
KW
124character semantics. This can cause surprises: See L</BUGS>, below.
125You can choose to be warned when this happens. See L<encoding::warnings>.
7dedd01f 126
feda178f 127Under character semantics, many operations that formerly operated on
376d9008 128bytes now operate on characters. A character in Perl is
feda178f 129logically just a number ranging from 0 to 2**31 or so. Larger
376d9008
JB
130characters may encode into longer sequences of bytes internally, but
131this internal detail is mostly hidden for Perl code.
132See L<perluniintro> for more.
393fec97 133
376d9008 134=head2 Effects of Character Semantics
393fec97
GS
135
136Character semantics have the following effects:
137
138=over 4
139
140=item *
141
376d9008 142Strings--including hash keys--and regular expression patterns may
574c8022 143contain characters that have an ordinal value larger than 255.
393fec97 144
2575c402
JW
145If you use a Unicode editor to edit your program, Unicode characters may
146occur directly within the literal strings in UTF-8 encoding, or UTF-16.
147(The former requires a BOM or C<use utf8>, the latter requires a BOM.)
3e4dbfed 148
195e542a
KW
149Unicode characters can also be added to a string by using the C<\N{U+...}>
150notation. The Unicode code for the desired character, in hexadecimal,
151should be placed in the braces, after the C<U>. For instance, a smiley face is
6f335b04
KW
152C<\N{U+263A}>.
153
195e542a
KW
154Alternatively, you can use the C<\x{...}> notation for characters 0x100 and
155above. For characters below 0x100 you may get byte semantics instead of
6f335b04 156character semantics; see L</The "Unicode Bug">. On EBCDIC machines there is
195e542a 157the additional problem that the value for such characters gives the EBCDIC
6f335b04 158character rather than the Unicode one.
3e4dbfed
JF
159
160Additionally, if you
574c8022 161
3e4dbfed 162 use charnames ':full';
574c8022 163
1bfb14c4
JH
164you can use the C<\N{...}> notation and put the official Unicode
165character name within the braces, such as C<\N{WHITE SMILING FACE}>.
6f335b04 166See L<charnames>.
376d9008 167
393fec97
GS
168=item *
169
574c8022
JH
170If an appropriate L<encoding> is specified, identifiers within the
171Perl script may contain Unicode alphanumeric characters, including
376d9008
JB
172ideographs. Perl does not currently attempt to canonicalize variable
173names.
393fec97 174
393fec97
GS
175=item *
176
1bfb14c4 177Regular expressions match characters instead of bytes. "." matches
2575c402 178a character instead of a byte.
393fec97 179
393fec97
GS
180=item *
181
9d1c51c1 182Bracketed character classes in regular expressions match characters instead of
376d9008 183bytes and match against the character properties specified in the
1bfb14c4 184Unicode properties database. C<\w> can be used to match a Japanese
75daf61c 185ideograph, for instance.
393fec97 186
393fec97
GS
187=item *
188
9d1c51c1
KW
189Named Unicode properties, scripts, and block ranges may be used (like bracketed
190character classes) by using the C<\p{}> "matches property" construct and
822502e5 191the C<\P{}> negation, "doesn't match property".
2575c402 192See L</"Unicode Character Properties"> for more details.
822502e5
TS
193
194You can define your own character properties and use them
195in the regular expression with the C<\p{}> or C<\P{}> construct.
822502e5
TS
196See L</"User-Defined Character Properties"> for more details.
197
198=item *
199
9f815e24
KW
200The special pattern C<\X> matches a logical character, an "extended grapheme
201cluster" in Standardese. In Unicode what appears to the user to be a single
51f494cc
KW
202character, for example an accented C<G>, may in fact be composed of a sequence
203of characters, in this case a C<G> followed by an accent character. C<\X>
204will match the entire sequence.
822502e5
TS
205
206=item *
207
208The C<tr///> operator translates characters instead of bytes. Note
209that the C<tr///CU> functionality has been removed. For similar
210functionality see pack('U0', ...) and pack('C0', ...).
211
212=item *
213
214Case translation operators use the Unicode case translation tables
215when character input is provided. Note that C<uc()>, or C<\U> in
216interpolated strings, translates to uppercase, while C<ucfirst>,
217or C<\u> in interpolated strings, translates to titlecase in languages
e1b711da
KW
218that make the distinction (which is equivalent to uppercase in languages
219without the distinction).
822502e5
TS
220
221=item *
222
223Most operators that deal with positions or lengths in a string will
224automatically switch to using character positions, including
225C<chop()>, C<chomp()>, C<substr()>, C<pos()>, C<index()>, C<rindex()>,
226C<sprintf()>, C<write()>, and C<length()>. An operator that
51f494cc
KW
227specifically does not switch is C<vec()>. Operators that really don't
228care include operators that treat strings as a bucket of bits such as
822502e5
TS
229C<sort()>, and operators dealing with filenames.
230
231=item *
232
51f494cc 233The C<pack()>/C<unpack()> letter C<C> does I<not> change, since it is often
822502e5
TS
234used for byte-oriented formats. Again, think C<char> in the C language.
235
236There is a new C<U> specifier that converts between Unicode characters
237and code points. There is also a C<W> specifier that is the equivalent of
238C<chr>/C<ord> and properly handles character values even if they are above 255.
239
240=item *
241
242The C<chr()> and C<ord()> functions work on characters, similar to
243C<pack("W")> and C<unpack("W")>, I<not> C<pack("C")> and
244C<unpack("C")>. C<pack("C")> and C<unpack("C")> are methods for
245emulating byte-oriented C<chr()> and C<ord()> on Unicode strings.
246While these methods reveal the internal encoding of Unicode strings,
247that is not something one normally needs to care about at all.
248
249=item *
250
251The bit string operators, C<& | ^ ~>, can operate on character data.
252However, for backward compatibility, such as when using bit string
253operations when characters are all less than 256 in ordinal value, one
254should not use C<~> (the bit complement) with characters of both
255values less than 256 and values greater than 256. Most importantly,
256DeMorgan's laws (C<~($x|$y) eq ~$x&~$y> and C<~($x&$y) eq ~$x|~$y>)
257will not hold. The reason for this mathematical I<faux pas> is that
258the complement cannot return B<both> the 8-bit (byte-wide) bit
259complement B<and> the full character-wide bit complement.
260
261=item *
262
9d1c51c1
KW
263You can define your own mappings to be used in C<lc()>,
264C<lcfirst()>, C<uc()>, and C<ucfirst()> (or their double-quoted string inlined
ab296a20
FR
265versions such as C<\U>). See
266L<User-Defined Case-Mappings|/"User-Defined Case Mappings (for serious hackers only)">
267for more details.
822502e5
TS
268
269=back
270
271=over 4
272
273=item *
274
275And finally, C<scalar reverse()> reverses by character rather than by byte.
276
277=back
278
279=head2 Unicode Character Properties
280
ee88f7b6 281(The only time that Perl considers a sequence of individual code
9d1c51c1
KW
282points as a single logical character is in the C<\X> construct, already
283mentioned above. Therefore "character" in this discussion means a single
ee88f7b6
KW
284Unicode code point.)
285
286Very nearly all Unicode character properties are accessible through
287regular expressions by using the C<\p{}> "matches property" construct
288and the C<\P{}> "doesn't match property" for its negation.
51f494cc 289
9d1c51c1 290For instance, C<\p{Uppercase}> matches any single character with the Unicode
51f494cc
KW
291"Uppercase" property, while C<\p{L}> matches any character with a
292General_Category of "L" (letter) property. Brackets are not
9d1c51c1 293required for single letter property names, so C<\p{L}> is equivalent to C<\pL>.
51f494cc 294
9d1c51c1
KW
295More formally, C<\p{Uppercase}> matches any single character whose Unicode
296Uppercase property value is True, and C<\P{Uppercase}> matches any character
297whose Uppercase property value is False, and they could have been written as
298C<\p{Uppercase=True}> and C<\p{Uppercase=False}>, respectively.
51f494cc 299
b19eb496 300This formality is needed when properties are not binary; that is, if they can
51f494cc 301take on more values than just True and False. For example, the Bidi_Class (see
b19eb496 302L</"Bidirectional Character Types"> below), can take on several different
51f494cc 303values, such as Left, Right, Whitespace, and others. To match these, one needs
5bff2035
KW
304to specify both the property name (Bidi_Class), AND the value being
305matched against
9d1c51c1 306(Left, Right, etc.). This is done, as in the examples above, by having the
9f815e24 307two components separated by an equal sign (or interchangeably, a colon), like
51f494cc
KW
308C<\p{Bidi_Class: Left}>.
309
310All Unicode-defined character properties may be written in these compound forms
311of C<\p{property=value}> or C<\p{property:value}>, but Perl provides some
312additional properties that are written only in the single form, as well as
313single-form short-cuts for all binary properties and certain others described
314below, in which you may omit the property name and the equals or colon
315separator.
316
317Most Unicode character properties have at least two synonyms (or aliases if you
b19eb496
TC
318prefer): a short one that is easier to type and a longer one that is more
319descriptive and hence easier to understand. Thus the "L" and "Letter" properties
320above are equivalent and can be used interchangeably. Likewise,
51f494cc
KW
321"Upper" is a synonym for "Uppercase", and we could have written
322C<\p{Uppercase}> equivalently as C<\p{Upper}>. Also, there are typically
323various synonyms for the values the property can be. For binary properties,
324"True" has 3 synonyms: "T", "Yes", and "Y"; and "False has correspondingly "F",
325"No", and "N". But be careful. A short form of a value for one property may
e1b711da 326not mean the same thing as the same short form for another. Thus, for the
51f494cc
KW
327General_Category property, "L" means "Letter", but for the Bidi_Class property,
328"L" means "Left". A complete list of properties and synonyms is in
329L<perluniprops>.
330
b19eb496 331Upper/lower case differences in property names and values are irrelevant;
51f494cc
KW
332thus C<\p{Upper}> means the same thing as C<\p{upper}> or even C<\p{UpPeR}>.
333Similarly, you can add or subtract underscores anywhere in the middle of a
334word, so that these are also equivalent to C<\p{U_p_p_e_r}>. And white space
335is irrelevant adjacent to non-word characters, such as the braces and the equals
b19eb496
TC
336or colon separators, so C<\p{ Upper }> and C<\p{ Upper_case : Y }> are
337equivalent to these as well. In fact, white space and even
338hyphens can usually be added or deleted anywhere. So even C<\p{ Up-per case = Yes}> is
51f494cc 339equivalent. All this is called "loose-matching" by Unicode. The few places
b19eb496 340where stricter matching is used is in the middle of numbers, and in the Perl
51f494cc 341extension properties that begin or end with an underscore. Stricter matching
b19eb496 342cares about white space (except adjacent to non-word characters),
51f494cc 343hyphens, and non-interior underscores.
4193bef7 344
376d9008
JB
345You can also use negation in both C<\p{}> and C<\P{}> by introducing a caret
346(^) between the first brace and the property name: C<\p{^Tamil}> is
eb0cc9e3 347equal to C<\P{Tamil}>.
4193bef7 348
56ca34ca
KW
349Almost all properties are immune to case-insensitive matching. That is,
350adding a C</i> regular expression modifier does not change what they
351match. There are two sets that are affected.
352The first set is
353C<Uppercase_Letter>,
354C<Lowercase_Letter>,
355and C<Titlecase_Letter>,
356all of which match C<Cased_Letter> under C</i> matching.
357And the second set is
358C<Uppercase>,
359C<Lowercase>,
360and C<Titlecase>,
361all of which match C<Cased> under C</i> matching.
362This set also includes its subsets C<PosixUpper> and C<PosixLower> both
363of which under C</i> matching match C<PosixAlpha>.
364(The difference between these sets is that some things, such as Roman
b19eb496 365numerals, come in both upper and lower case so they are C<Cased>, but aren't considered
56ca34ca 366letters, so they aren't C<Cased_Letter>s.)
56ca34ca 367
51f494cc 368=head3 B<General_Category>
14bb0a9a 369
51f494cc
KW
370Every Unicode character is assigned a general category, which is the "most
371usual categorization of a character" (from
372L<http://www.unicode.org/reports/tr44>).
822502e5 373
9f815e24 374The compound way of writing these is like C<\p{General_Category=Number}>
51f494cc
KW
375(short, C<\p{gc:n}>). But Perl furnishes shortcuts in which everything up
376through the equal or colon separator is omitted. So you can instead just write
377C<\pN>.
822502e5 378
51f494cc 379Here are the short and long forms of the General Category properties:
393fec97 380
d73e5302
JH
381 Short Long
382
383 L Letter
51f494cc
KW
384 LC, L& Cased_Letter (that is: [\p{Ll}\p{Lu}\p{Lt}])
385 Lu Uppercase_Letter
386 Ll Lowercase_Letter
387 Lt Titlecase_Letter
388 Lm Modifier_Letter
389 Lo Other_Letter
d73e5302
JH
390
391 M Mark
51f494cc
KW
392 Mn Nonspacing_Mark
393 Mc Spacing_Mark
394 Me Enclosing_Mark
d73e5302
JH
395
396 N Number
51f494cc
KW
397 Nd Decimal_Number (also Digit)
398 Nl Letter_Number
399 No Other_Number
400
401 P Punctuation (also Punct)
402 Pc Connector_Punctuation
403 Pd Dash_Punctuation
404 Ps Open_Punctuation
405 Pe Close_Punctuation
406 Pi Initial_Punctuation
d73e5302 407 (may behave like Ps or Pe depending on usage)
51f494cc 408 Pf Final_Punctuation
d73e5302 409 (may behave like Ps or Pe depending on usage)
51f494cc 410 Po Other_Punctuation
d73e5302
JH
411
412 S Symbol
51f494cc
KW
413 Sm Math_Symbol
414 Sc Currency_Symbol
415 Sk Modifier_Symbol
416 So Other_Symbol
d73e5302
JH
417
418 Z Separator
51f494cc
KW
419 Zs Space_Separator
420 Zl Line_Separator
421 Zp Paragraph_Separator
d73e5302
JH
422
423 C Other
d88362ca 424 Cc Control (also Cntrl)
e150c829 425 Cf Format
6d4f9cf2 426 Cs Surrogate
51f494cc 427 Co Private_Use
e150c829 428 Cn Unassigned
1ac13f9a 429
376d9008 430Single-letter properties match all characters in any of the
3e4dbfed 431two-letter sub-properties starting with the same letter.
b19eb496 432C<LC> and C<L&> are special: both are aliases for the set consisting of everything matched by C<Ll>, C<Lu>, and C<Lt>.
32293815 433
51f494cc 434=head3 B<Bidirectional Character Types>
822502e5 435
b19eb496 436Because scripts differ in their directionality (Hebrew and Arabic are
9d1c51c1 437written right to left, for example) Unicode supplies these properties in
51f494cc 438the Bidi_Class class:
32293815 439
eb0cc9e3 440 Property Meaning
92e830a9 441
12ac2576
JP
442 L Left-to-Right
443 LRE Left-to-Right Embedding
444 LRO Left-to-Right Override
445 R Right-to-Left
51f494cc 446 AL Arabic Letter
12ac2576
JP
447 RLE Right-to-Left Embedding
448 RLO Right-to-Left Override
449 PDF Pop Directional Format
450 EN European Number
51f494cc
KW
451 ES European Separator
452 ET European Terminator
12ac2576 453 AN Arabic Number
51f494cc 454 CS Common Separator
12ac2576
JP
455 NSM Non-Spacing Mark
456 BN Boundary Neutral
457 B Paragraph Separator
458 S Segment Separator
459 WS Whitespace
460 ON Other Neutrals
461
51f494cc
KW
462This property is always written in the compound form.
463For example, C<\p{Bidi_Class:R}> matches characters that are normally
eb0cc9e3
JH
464written right to left.
465
51f494cc
KW
466=head3 B<Scripts>
467
b19eb496 468The world's languages are written in many different scripts. This sentence
e1b711da 469(unless you're reading it in translation) is written in Latin, while Russian is
c69ca1d4 470written in Cyrillic, and Greek is written in, well, Greek; Japanese mainly in
e1b711da 471Hiragana or Katakana. There are many more.
51f494cc
KW
472
473The Unicode Script property gives what script a given character is in,
9d1c51c1
KW
474and the property can be specified with the compound form like
475C<\p{Script=Hebrew}> (short: C<\p{sc=hebr}>). Perl furnishes shortcuts for all
476script names. You can omit everything up through the equals (or colon), and
477simply write C<\p{Latin}> or C<\P{Cyrillic}>.
51f494cc
KW
478
479A complete list of scripts and their shortcuts is in L<perluniprops>.
480
51f494cc 481=head3 B<Use of "Is" Prefix>
822502e5 482
1bfb14c4 483For backward compatibility (with Perl 5.6), all properties mentioned
51f494cc
KW
484so far may have C<Is> or C<Is_> prepended to their name, so C<\P{Is_Lu}>, for
485example, is equal to C<\P{Lu}>, and C<\p{IsScript:Arabic}> is equal to
486C<\p{Arabic}>.
eb0cc9e3 487
51f494cc 488=head3 B<Blocks>
2796c109 489
1bfb14c4
JH
490In addition to B<scripts>, Unicode also defines B<blocks> of
491characters. The difference between scripts and blocks is that the
492concept of scripts is closer to natural languages, while the concept
51f494cc 493of blocks is more of an artificial grouping based on groups of Unicode
9f815e24 494characters with consecutive ordinal values. For example, the "Basic Latin"
b19eb496 495block is all characters whose ordinals are between 0 and 127, inclusive; in
9f815e24 496other words, the ASCII characters. The "Latin" script contains some letters
b19eb496 497from this as well as several other blocks, like "Latin-1 Supplement",
9d1c51c1 498"Latin Extended-A", etc., but it does not contain all the characters from
7be67b37
KW
499those blocks. It does not, for example, contain the digits 0-9, because
500those digits are shared across many scripts. The digits 0-9 and similar groups,
501like punctuation, are in the script called C<Common>. There is also a
502script called C<Inherited> for characters that modify other characters,
503and inherit the script value of the controlling character. (Note that
b19eb496 504there are several different sets of digits in Unicode that are
7be67b37
KW
505equivalent to 0-9 and are matchable by C<\d> in a regular expression.
506If they are used in a single language only, they are in that language's
d9742aa3 507script. Only sets that are used across several languages are in the
7be67b37 508C<Common> script.)
51f494cc
KW
509
510For more about scripts versus blocks, see UAX#24 "Unicode Script Property":
511L<http://www.unicode.org/reports/tr24>
512
513The Script property is likely to be the one you want to use when processing
b19eb496
TC
514natural language; the Block property may occasionally be useful in working
515with the nuts and bolts of Unicode.
51f494cc
KW
516
517Block names are matched in the compound form, like C<\p{Block: Arrows}> or
b19eb496 518C<\p{Blk=Hebrew}>. Unlike most other properties, only a few block names have a
51f494cc
KW
519Unicode-defined short name. But Perl does provide a (slight) shortcut: You
520can say, for example C<\p{In_Arrows}> or C<\p{In_Hebrew}>. For backwards
521compatibility, the C<In> prefix may be omitted if there is no naming conflict
522with a script or any other property, and you can even use an C<Is> prefix
523instead in those cases. But it is not a good idea to do this, for a couple
524reasons:
525
526=over 4
527
528=item 1
529
530It is confusing. There are many naming conflicts, and you may forget some.
9f815e24 531For example, C<\p{Hebrew}> means the I<script> Hebrew, and NOT the I<block>
51f494cc
KW
532Hebrew. But would you remember that 6 months from now?
533
534=item 2
535
536It is unstable. A new version of Unicode may pre-empt the current meaning by
537creating a property with the same name. There was a time in very early Unicode
9f815e24 538releases when C<\p{Hebrew}> would have matched the I<block> Hebrew; now it
51f494cc 539doesn't.
32293815 540
393fec97
GS
541=back
542
b19eb496
TC
543Some people prefer to always use C<\p{Block: foo}> and C<\p{Script: bar}>
544instead of the shortcuts, whether for clarity, because they can't remember the
545difference between 'In' and 'Is' anyway, or they aren't confident that those who
546eventually will read their code will know that difference.
51f494cc
KW
547
548A complete list of blocks and their shortcuts is in L<perluniprops>.
549
9f815e24
KW
550=head3 B<Other Properties>
551
552There are many more properties than the very basic ones described here.
553A complete list is in L<perluniprops>.
554
555Unicode defines all its properties in the compound form, so all single-form
b19eb496
TC
556properties are Perl extensions. Most of these are just synonyms for the
557Unicode ones, but some are genuine extensions, including several that are in
9f815e24
KW
558the compound form. And quite a few of these are actually recommended by Unicode
559(in L<http://www.unicode.org/reports/tr18>).
560
5bff2035
KW
561This section gives some details on all extensions that aren't just
562synonyms for compound-form Unicode properties
563(for those properties, you'll have to refer to the
9f815e24
KW
564L<Unicode Standard|http://www.unicode.org/reports/tr44>.
565
566=over
567
568=item B<C<\p{All}>>
569
570This matches any of the 1_114_112 Unicode code points. It is a synonym for
571C<\p{Any}>.
572
573=item B<C<\p{Alnum}>>
574
575This matches any C<\p{Alphabetic}> or C<\p{Decimal_Number}> character.
576
577=item B<C<\p{Any}>>
578
579This matches any of the 1_114_112 Unicode code points. It is a synonym for
580C<\p{All}>.
581
42581d5d
KW
582=item B<C<\p{ASCII}>>
583
584This matches any of the 128 characters in the US-ASCII character set,
585which is a subset of Unicode.
586
9f815e24
KW
587=item B<C<\p{Assigned}>>
588
589This matches any assigned code point; that is, any code point whose general
590category is not Unassigned (or equivalently, not Cn).
591
592=item B<C<\p{Blank}>>
593
594This is the same as C<\h> and C<\p{HorizSpace}>: A character that changes the
595spacing horizontally.
596
597=item B<C<\p{Decomposition_Type: Non_Canonical}>> (Short: C<\p{Dt=NonCanon}>)
598
599Matches a character that has a non-canonical decomposition.
600
601To understand the use of this rarely used property=value combination, it is
602necessary to know some basics about decomposition.
603Consider a character, say H. It could appear with various marks around it,
604such as an acute accent, or a circumflex, or various hooks, circles, arrows,
b19eb496 605I<etc.>, above, below, to one side or the other, etc. There are many
9f815e24
KW
606possibilities among the world's languages. The number of combinations is
607astronomical, and if there were a character for each combination, it would
608soon exhaust Unicode's more than a million possible characters. So Unicode
609took a different approach: there is a character for the base H, and a
b19eb496 610character for each of the possible marks, and these can be variously combined
9f815e24
KW
611to get a final logical character. So a logical character--what appears to be a
612single character--can be a sequence of more than one individual characters.
b19eb496
TC
613This is called an "extended grapheme cluster"; Perl furnishes the C<\X>
614regular expression construct to match such sequences.
9f815e24
KW
615
616But Unicode's intent is to unify the existing character set standards and
b19eb496 617practices, and several pre-existing standards have single characters that
9f815e24
KW
618mean the same thing as some of these combinations. An example is ISO-8859-1,
619which has quite a few of these in the Latin-1 range, an example being "LATIN
620CAPITAL LETTER E WITH ACUTE". Because this character was in this pre-existing
621standard, Unicode added it to its repertoire. But this character is considered
b19eb496
TC
622by Unicode to be equivalent to the sequence consisting of the character
623"LATIN CAPITAL LETTER E" followed by the character "COMBINING ACUTE ACCENT".
9f815e24
KW
624
625"LATIN CAPITAL LETTER E WITH ACUTE" is called a "pre-composed" character, and
b19eb496 626its equivalence with the sequence is called canonical equivalence. All
9f815e24 627pre-composed characters are said to have a decomposition (into the equivalent
b19eb496 628sequence), and the decomposition type is also called canonical.
9f815e24
KW
629
630However, many more characters have a different type of decomposition, a
631"compatible" or "non-canonical" decomposition. The sequences that form these
632decompositions are not considered canonically equivalent to the pre-composed
633character. An example, again in the Latin-1 range, is the "SUPERSCRIPT ONE".
b19eb496 634It is somewhat like a regular digit 1, but not exactly; its decomposition
9f815e24
KW
635into the digit 1 is called a "compatible" decomposition, specifically a
636"super" decomposition. There are several such compatibility
637decompositions (see L<http://www.unicode.org/reports/tr44>), including one
b19eb496 638called "compat", which means some miscellaneous type of decomposition
42581d5d 639that doesn't fit into the decomposition categories that Unicode has chosen.
9f815e24
KW
640
641Note that most Unicode characters don't have a decomposition, so their
642decomposition type is "None".
643
b19eb496
TC
644For your convenience, Perl has added the C<Non_Canonical> decomposition
645type to mean any of the several compatibility decompositions.
9f815e24
KW
646
647=item B<C<\p{Graph}>>
648
649Matches any character that is graphic. Theoretically, this means a character
650that on a printer would cause ink to be used.
651
652=item B<C<\p{HorizSpace}>>
653
b19eb496 654This is the same as C<\h> and C<\p{Blank}>: a character that changes the
9f815e24
KW
655spacing horizontally.
656
42581d5d 657=item B<C<\p{In=*}>>
9f815e24
KW
658
659This is a synonym for C<\p{Present_In=*}>
660
661=item B<C<\p{PerlSpace}>>
662
663This is the same as C<\s>, restricted to ASCII, namely C<S<[ \f\n\r\t]>>.
664
665Mnemonic: Perl's (original) space
666
667=item B<C<\p{PerlWord}>>
668
669This is the same as C<\w>, restricted to ASCII, namely C<[A-Za-z0-9_]>
670
671Mnemonic: Perl's (original) word.
672
42581d5d 673=item B<C<\p{Posix...}>>
9f815e24 674
b19eb496
TC
675There are several of these, which are equivalents using the C<\p>
676notation for Posix classes and are described in
42581d5d 677L<perlrecharclass/POSIX Character Classes>.
9f815e24
KW
678
679=item B<C<\p{Present_In: *}>> (Short: C<\p{In=*}>)
680
681This property is used when you need to know in what Unicode version(s) a
682character is.
683
684The "*" above stands for some two digit Unicode version number, such as
685C<1.1> or C<4.0>; or the "*" can also be C<Unassigned>. This property will
686match the code points whose final disposition has been settled as of the
687Unicode release given by the version number; C<\p{Present_In: Unassigned}>
688will match those code points whose meaning has yet to be assigned.
689
690For example, C<U+0041> "LATIN CAPITAL LETTER A" was present in the very first
691Unicode release available, which is C<1.1>, so this property is true for all
692valid "*" versions. On the other hand, C<U+1EFF> was not assigned until version
6935.1 when it became "LATIN SMALL LETTER Y WITH LOOP", so the only "*" that
694would match it are 5.1, 5.2, and later.
695
696Unicode furnishes the C<Age> property from which this is derived. The problem
697with Age is that a strict interpretation of it (which Perl takes) has it
698matching the precise release a code point's meaning is introduced in. Thus
699C<U+0041> would match only 1.1; and C<U+1EFF> only 5.1. This is not usually what
700you want.
701
702Some non-Perl implementations of the Age property may change its meaning to be
703the same as the Perl Present_In property; just be aware of that.
704
705Another confusion with both these properties is that the definition is not
b19eb496
TC
706that the code point has been I<assigned>, but that the meaning of the code point
707has been I<determined>. This is because 66 code points will always be
708unassigned, and so the Age for them is the Unicode version in which the decision
709to make them so was made. For example, C<U+FDD0> is to be permanently
9f815e24 710unassigned to a character, and the decision to do that was made in version 3.1,
b19eb496 711so C<\p{Age=3.1}> matches this character, as also does C<\p{Present_In: 3.1}> and up.
9f815e24
KW
712
713=item B<C<\p{Print}>>
714
ae5b72c8 715This matches any character that is graphical or blank, except controls.
9f815e24
KW
716
717=item B<C<\p{SpacePerl}>>
718
719This is the same as C<\s>, including beyond ASCII.
720
4d4acfba 721Mnemonic: Space, as modified by Perl. (It doesn't include the vertical tab
b19eb496 722which both the Posix standard and Unicode consider white space.)
9f815e24
KW
723
724=item B<C<\p{VertSpace}>>
725
726This is the same as C<\v>: A character that changes the spacing vertically.
727
728=item B<C<\p{Word}>>
729
b19eb496 730This is the same as C<\w>, including over 100_000 characters beyond ASCII.
9f815e24 731
42581d5d
KW
732=item B<C<\p{XPosix...}>>
733
b19eb496 734There are several of these, which are the standard Posix classes
42581d5d
KW
735extended to the full Unicode range. They are described in
736L<perlrecharclass/POSIX Character Classes>.
737
9f815e24
KW
738=back
739
376d9008 740=head2 User-Defined Character Properties
491fd90a 741
51f494cc
KW
742You can define your own binary character properties by defining subroutines
743whose names begin with "In" or "Is". The subroutines can be defined in any
744package. The user-defined properties can be used in the regular expression
745C<\p> and C<\P> constructs; if you are using a user-defined property from a
746package other than the one you are in, you must specify its package in the
747C<\p> or C<\P> construct.
bac0b425 748
51f494cc 749 # assuming property Is_Foreign defined in Lang::
bac0b425
JP
750 package main; # property package name required
751 if ($txt =~ /\p{Lang::IsForeign}+/) { ... }
752
753 package Lang; # property package name not required
754 if ($txt =~ /\p{IsForeign}+/) { ... }
755
756
757Note that the effect is compile-time and immutable once defined.
b19eb496
TC
758However, the subroutines are passed a single parameter, which is 0 if
759case-sensitive matching is in effect and non-zero if caseless matching
56ca34ca
KW
760is in effect. The subroutine may return different values depending on
761the value of the flag, and one set of values will immutably be in effect
b19eb496 762for all case-sensitive matches, and the other set for all case-insensitive
56ca34ca 763matches.
491fd90a 764
b19eb496 765Note that if the regular expression is tainted, then Perl will die rather
0e9be77f
DM
766than calling the subroutine, where the name of the subroutine is
767determined by the tainted data.
768
376d9008
JB
769The subroutines must return a specially-formatted string, with one
770or more newline-separated lines. Each line must be one of the following:
491fd90a
JH
771
772=over 4
773
774=item *
775
510254c9
A
776A single hexadecimal number denoting a Unicode code point to include.
777
778=item *
779
99a6b1f0 780Two hexadecimal numbers separated by horizontal whitespace (space or
376d9008 781tabular characters) denoting a range of Unicode code points to include.
491fd90a
JH
782
783=item *
784
376d9008 785Something to include, prefixed by "+": a built-in character
bac0b425
JP
786property (prefixed by "utf8::") or a user-defined character property,
787to represent all the characters in that property; two hexadecimal code
788points for a range; or a single hexadecimal code point.
491fd90a
JH
789
790=item *
791
376d9008 792Something to exclude, prefixed by "-": an existing character
bac0b425
JP
793property (prefixed by "utf8::") or a user-defined character property,
794to represent all the characters in that property; two hexadecimal code
795points for a range; or a single hexadecimal code point.
491fd90a
JH
796
797=item *
798
376d9008 799Something to negate, prefixed "!": an existing character
bac0b425
JP
800property (prefixed by "utf8::") or a user-defined character property,
801to represent all the characters in that property; two hexadecimal code
802points for a range; or a single hexadecimal code point.
803
804=item *
805
806Something to intersect with, prefixed by "&": an existing character
807property (prefixed by "utf8::") or a user-defined character property,
808for all the characters except the characters in the property; two
809hexadecimal code points for a range; or a single hexadecimal code point.
491fd90a
JH
810
811=back
812
813For example, to define a property that covers both the Japanese
814syllabaries (hiragana and katakana), you can define
815
816 sub InKana {
d88362ca 817 return <<END;
d5822f25
A
818 3040\t309F
819 30A0\t30FF
491fd90a
JH
820 END
821 }
822
d5822f25
A
823Imagine that the here-doc end marker is at the beginning of the line.
824Now you can use C<\p{InKana}> and C<\P{InKana}>.
491fd90a
JH
825
826You could also have used the existing block property names:
827
828 sub InKana {
d88362ca 829 return <<'END';
491fd90a
JH
830 +utf8::InHiragana
831 +utf8::InKatakana
832 END
833 }
834
835Suppose you wanted to match only the allocated characters,
d5822f25 836not the raw block ranges: in other words, you want to remove
491fd90a
JH
837the non-characters:
838
839 sub InKana {
d88362ca 840 return <<'END';
491fd90a
JH
841 +utf8::InHiragana
842 +utf8::InKatakana
843 -utf8::IsCn
844 END
845 }
846
847The negation is useful for defining (surprise!) negated classes.
848
849 sub InNotKana {
d88362ca 850 return <<'END';
491fd90a
JH
851 !utf8::InHiragana
852 -utf8::InKatakana
853 +utf8::IsCn
854 END
855 }
856
bac0b425
JP
857Intersection is useful for getting the common characters matched by
858two (or more) classes.
859
860 sub InFooAndBar {
861 return <<'END';
862 +main::Foo
863 &main::Bar
864 END
865 }
866
ac036724 867It's important to remember not to use "&" for the first set; that
b19eb496 868would be intersecting with nothing, resulting in an empty set.
bac0b425 869
68585b5e 870=head2 User-Defined Case Mappings (for serious hackers only)
822502e5 871
0541896b
KW
872B<This featured is deprecated and is scheduled to be removed in Perl
8735.16.>
e48f36f0 874The CPAN module L<Unicode::Casing> provides better functionality
0541896b
KW
875without the drawbacks described below.
876
877You can define your own mappings to be used in C<lc()>,
d5cd9e7b 878C<lcfirst()>, C<uc()>, and C<ucfirst()> (or their string-inlined versions,
68585b5e
KW
879C<\L>, C<\l>, C<\U>, and C<\u>). The mappings are currently only valid
880on strings encoded in UTF-8, but see below for a partial workaround for
881this restriction.
882
822502e5 883The principle is similar to that of user-defined character
68585b5e
KW
884properties: define subroutines that do the mappings.
885C<ToLower> is used for C<lc()>, C<\L>, C<lcfirst()>, and C<\l>; C<ToTitle> for
886C<ucfirst()> and C<\u>; and C<ToUpper> for C<uc()> and C<\U>.
3a2263fe 887
68585b5e 888C<ToUpper()> should look something like this:
3a2263fe
RGS
889
890 sub ToUpper {
d88362ca 891 return <<END;
68585b5e
KW
892 0061\t007A\t0041
893 0101\t\t0100
3a2263fe
RGS
894 END
895 }
896
68585b5e
KW
897This sample C<ToUpper()> has the effect of mapping "a-z" to "A-Z", 0x101
898to 0x100, and all other characters map to themselves. The first
899returned line means to map the code point at 0x61 ("a") to 0x41 ("A"),
900the code point at 0x62 ("b") to 0x42 ("B"), ..., 0x7A ("z") to 0x5A
901("Z"). The second line maps just the code point 0x101 to 0x100. Since
902there are no other mappings defined, all other code points map to
903themselves.
904
905This mechanism is not well behaved as far as affecting other packages
906and scopes. All non-threaded programs have exactly one uppercasing
907behavior, one lowercasing behavior, and one titlecasing behavior in
908effect for utf8-encoded strings for the duration of the program. Each
909of these behaviors is irrevocably determined the first time the
910corresponding function is called to change a utf8-encoded string's case.
911If a corresponding C<To-> function has been defined in the package that
912makes that first call, the mapping defined by that function will be the
913mapping used for the duration of the program's execution across all
914packages and scopes. If no corresponding C<To-> function has been
915defined in that package, the standard official mapping will be used for
916all packages and scopes, and any corresponding C<To-> function anywhere
917will be ignored. Threaded programs have similar behavior. If the
918program's casing behavior has been decided at the time of a thread's
919creation, the thread will inherit that behavior. But, if the behavior
920hasn't been decided, the thread gets to decide for itself, and its
921decision does not affect other threads nor its creator.
922
923As shown by the example above, you have to furnish a complete mapping;
924you can't just override a couple of characters and leave the rest
71648f9a 925unchanged. You can find all the official mappings in the directory
d5cd9e7b
KW
926C<$Config{privlib}>F</unicore/To/>. The mapping data is returned as the
927here-document. The C<utf8::ToSpecI<Foo>> hashes in those files are special
928exception mappings derived from
71648f9a 929C<$Config{privlib}>F</unicore/SpecialCasing.txt>. (The "Digit" and
9f815e24 930"Fold" mappings that one can see in the directory are not directly
d5cd9e7b 931user-accessible, one can use either the L<Unicode::UCD> module, or just match
71648f9a
KW
932case-insensitively, which is what uses the "Fold" mapping. Neither are user
933overridable.)
3a2263fe 934
71648f9a
KW
935If you have many mappings to change, you can take the official mapping data,
936change by hand the affected code points, and place the whole thing into your
937subroutine. But this will only be valid on Perls that use the same Unicode
938version. Another option would be to have your subroutine read the official
b19eb496 939mapping files and overwrite the affected code points.
3a2263fe 940
167630b6
KW
941If you have only a few mappings to change, starting in 5.14 you can use the
942following trick, here illustrated for Turkish.
71648f9a
KW
943
944 use Config;
70a5eb4a 945 use charnames ":full";
71648f9a
KW
946
947 sub ToUpper {
948 my $official = do "$Config{privlib}/unicore/To/Upper.pl";
70a5eb4a 949 $utf8::ToSpecUpper{'i'} =
71648f9a
KW
950 "\N{LATIN CAPITAL LETTER I WITH DOT ABOVE}";
951 return $official;
952 }
953
954This takes the official mappings and overrides just one, for "LATIN SMALL
167630b6 955LETTER I". The keys to the hash must be the bytes that form the UTF-8
70a5eb4a
KW
956(on EBCDIC platforms, UTF-EBCDIC) of the character, as illustrated by
957the inverse function.
71648f9a
KW
958
959 sub ToLower {
960 my $official = do $lower;
961 $utf8::ToSpecLower{"\xc4\xb0"} = "i";
962 return $official;
963 }
964
70a5eb4a
KW
965This example is for an ASCII platform, and C<\xc4\xb0> is the string of
966bytes that together form the UTF-8 that represents C<\N{LATIN CAPITAL
967LETTER I WITH DOT ABOVE}>, C<U+0130>. You can avoid having to figure out
968these bytes, and at the same time make it work on all platforms by
969instead writing:
71648f9a 970
70a5eb4a
KW
971 sub ToLower {
972 my $official = do $lower;
973 my $sequence = "\N{LATIN CAPITAL LETTER I WITH DOT ABOVE}";
974 utf8::encode($sequence);
975 $utf8::ToSpecLower{$sequence} = "i";
976 return $official;
71648f9a
KW
977 }
978
70a5eb4a
KW
979This works because C<utf8::encode()> takes the single character and
980converts it to the sequence of bytes that constitute it. Note that we took
981advantage of the fact that C<"i"> is the same in UTF-8 or UTF_EBCIDIC as not;
982otherwise we would have had to write
983
984 $utf8::ToSpecLower{$sequence} = "\N{LATIN SMALL LETTER I}";
985
986in the ToLower example, and in the ToUpper example, use
987
988 my $sequence = "\N{LATIN SMALL LETTER I}";
989 utf8::encode($sequence);
990
b19eb496 991A big caveat to the above trick and to this whole mechanism in general,
68585b5e 992is that they work only on strings encoded in UTF-8. You can partially
0541896b
KW
993get around this by using C<use subs>. (But better to just convert to
994use L<Unicode::Casing>.) For example:
167630b6
KW
995(The trick illustrated here does work in earlier releases, but only if all the
996characters you want to override have ordinal values of 256 or higher, or
997if you use the other tricks given just below.)
998
999The mappings are in effect only for the package they are defined in, and only
1000on scalars that have been marked as having Unicode characters, for example by
1001using C<utf8::upgrade()>. Although probably not advisable, you can
1002cause the mappings to be used globally by importing into C<CORE::GLOBAL>
1003(see L<CORE>).
1004
1005You can partially get around the restriction that the source strings
b19eb496 1006must be in utf8 by using C<use subs> (or by importing into C<CORE::GLOBAL>) by:
70a5eb4a
KW
1007
1008 use subs qw(uc ucfirst lc lcfirst);
1009
1010 sub uc($) {
1011 my $string = shift;
1012 utf8::upgrade($string);
1013 return CORE::uc($string);
1014 }
1015
1016 sub lc($) {
1017 my $string = shift;
1018 utf8::upgrade($string);
1019
1020 # Unless an I is before a dot_above, it turns into a dotless i.
1021 # (The character class with the combining classes matches non-above
755789c0
KW
1022 # marks following the I. Any number of these may be between the
1023 # 'I'and the dot_above, and the dot_above will still apply to the
1024 # 'I'.
70a5eb4a
KW
1025 use charnames ":full";
1026 $string =~
1027 s/I
1028 (?! [^\p{ccc=0}\p{ccc=Above}]* \N{COMBINING DOT ABOVE} )
1029 /\N{LATIN SMALL LETTER DOTLESS I}/gx;
1030
1031 # But when the I is followed by a dot_above, remove the
1032 # dot_above so the end result will be i.
1033 $string =~ s/I
1034 ([^\p{ccc=0}\p{ccc=Above}]* )
1035 \N{COMBINING DOT ABOVE}
1036 /i$1/gx;
1037 return CORE::lc($string);
1038 }
71648f9a
KW
1039
1040These examples (also for Turkish) make sure the input is in UTF-8, and then
1041call the corresponding official function, which will use the C<ToUpper()> and
68585b5e 1042C<ToLower()> functions you have defined.
70a5eb4a
KW
1043(For Turkish, there are other required functions: C<ucfirst>, C<lcfirst>,
1044and C<ToTitle>. These are very similar to the ones given above.)
1045
b19eb496
TC
1046The reason this is only a partial fix is that it doesn't affect the C<\l>,
1047C<\L>, C<\u>, and C<\U> case-change operations in regular expressions,
167630b6
KW
1048which still require the source to be encoded in utf8 (see L</The "Unicode
1049Bug">). (Again, use L<Unicode::Casing> instead.)
70a5eb4a
KW
1050
1051The C<lc()> example shows how you can add context-dependent casing. Note
1052that context-dependent casing suffers from the problem that the string
1053passed to the casing function may not have sufficient context to make
b19eb496 1054the proper choice. Also, it will not be called for C<\l>, C<\L>, C<\u>,
167630b6 1055and C<\U>.
3a2263fe 1056
376d9008 1057=head2 Character Encodings for Input and Output
8cbd9a7a 1058
7221edc9 1059See L<Encode>.
8cbd9a7a 1060
c29a771d 1061=head2 Unicode Regular Expression Support Level
776f8809 1062
b19eb496
TC
1063The following list of Unicode supported features for regular expressions describes
1064all features currently directly supported by core Perl. The references to "Level N"
8158862b 1065and the section numbers refer to the Unicode Technical Standard #18,
b19eb496 1066"Unicode Regular Expressions", version 13, from August 2008.
776f8809
JH
1067
1068=over 4
1069
1070=item *
1071
1072Level 1 - Basic Unicode Support
1073
755789c0
KW
1074 RL1.1 Hex Notation - done [1]
1075 RL1.2 Properties - done [2][3]
1076 RL1.2a Compatibility Properties - done [4]
1077 RL1.3 Subtraction and Intersection - MISSING [5]
1078 RL1.4 Simple Word Boundaries - done [6]
1079 RL1.5 Simple Loose Matches - done [7]
1080 RL1.6 Line Boundaries - MISSING [8][9]
1081 RL1.7 Supplementary Code Points - done [10]
1082
1083 [1] \x{...}
1084 [2] \p{...} \P{...}
1085 [3] supports not only minimal list, but all Unicode character
d9742aa3 1086 properties (see Unicode Character Properties above)
755789c0
KW
1087 [4] \d \D \s \S \w \W \X [:prop:] [:^prop:]
1088 [5] can use regular expression look-ahead [a] or
1089 user-defined character properties [b] to emulate set
1090 operations
1091 [6] \b \B
1092 [7] note that Perl does Full case-folding in matching (but with
1093 bugs), not Simple: for example U+1F88 is equivalent to
e4d56f70
NC
1094 U+1F00 U+03B9, instead of just U+1F80. This difference
1095 matters mainly for certain Greek capital letters with certain
755789c0
KW
1096 modifiers: the Full case-folding decomposes the letter,
1097 while the Simple case-folding would map it to a single
1098 character.
1099 [8] should do ^ and $ also on U+000B (\v in C), FF (\f), CR
1100 (\r), CRLF (\r\n), NEL (U+0085), LS (U+2028), and PS
1101 (U+2029); should also affect <>, $., and script line
1102 numbers; should not split lines within CRLF [c] (i.e. there
1103 is no empty line between \r and \n)
1104 [9] Linebreaking conformant with UAX#14 "Unicode Line Breaking
1105 Algorithm" is available through the Unicode::LineBreaking
1106 module.
1107 [10] UTF-8/UTF-EBDDIC used in Perl allows not only U+10000 to
1108 U+10FFFF but also beyond U+10FFFF
7207e29d 1109
237bad5b 1110[a] You can mimic class subtraction using lookahead.
8158862b 1111For example, what UTS#18 might write as
29bdacb8 1112
dbe420b4
JH
1113 [{Greek}-[{UNASSIGNED}]]
1114
1115in Perl can be written as:
1116
1d81abf3
JH
1117 (?!\p{Unassigned})\p{InGreekAndCoptic}
1118 (?=\p{Assigned})\p{InGreekAndCoptic}
dbe420b4
JH
1119
1120But in this particular example, you probably really want
1121
1bfb14c4 1122 \p{GreekAndCoptic}
dbe420b4
JH
1123
1124which will match assigned characters known to be part of the Greek script.
29bdacb8 1125
d9742aa3 1126Also see the L<Unicode::Regex::Set> module, it does implement the full
8158862b
TS
1127UTS#18 grouping, intersection, union, and removal (subtraction) syntax.
1128
1129[b] '+' for union, '-' for removal (set-difference), '&' for intersection
1130(see L</"User-Defined Character Properties">)
1131
1132[c] Try the C<:crlf> layer (see L<PerlIO>).
5ca1ac52 1133
776f8809
JH
1134=item *
1135
1136Level 2 - Extended Unicode Support
1137
755789c0
KW
1138 RL2.1 Canonical Equivalents - MISSING [10][11]
1139 RL2.2 Default Grapheme Clusters - MISSING [12]
1140 RL2.3 Default Word Boundaries - MISSING [14]
1141 RL2.4 Default Loose Matches - MISSING [15]
1142 RL2.5 Name Properties - DONE
1143 RL2.6 Wildcard Properties - MISSING
8158862b 1144
755789c0
KW
1145 [10] see UAX#15 "Unicode Normalization Forms"
1146 [11] have Unicode::Normalize but not integrated to regexes
1147 [12] have \X but we don't have a "Grapheme Cluster Mode"
1148 [14] see UAX#29, Word Boundaries
1149 [15] see UAX#21 "Case Mappings"
776f8809
JH
1150
1151=item *
1152
8158862b
TS
1153Level 3 - Tailored Support
1154
755789c0
KW
1155 RL3.1 Tailored Punctuation - MISSING
1156 RL3.2 Tailored Grapheme Clusters - MISSING [17][18]
1157 RL3.3 Tailored Word Boundaries - MISSING
1158 RL3.4 Tailored Loose Matches - MISSING
1159 RL3.5 Tailored Ranges - MISSING
1160 RL3.6 Context Matching - MISSING [19]
1161 RL3.7 Incremental Matches - MISSING
8158862b 1162 ( RL3.8 Unicode Set Sharing )
755789c0
KW
1163 RL3.9 Possible Match Sets - MISSING
1164 RL3.10 Folded Matching - MISSING [20]
1165 RL3.11 Submatchers - MISSING
1166
1167 [17] see UAX#10 "Unicode Collation Algorithms"
1168 [18] have Unicode::Collate but not integrated to regexes
1169 [19] have (?<=x) and (?=x), but look-aheads or look-behinds
1170 should see outside of the target substring
1171 [20] need insensitive matching for linguistic features other
1172 than case; for example, hiragana to katakana, wide and
1173 narrow, simplified Han to traditional Han (see UTR#30
1174 "Character Foldings")
776f8809
JH
1175
1176=back
1177
c349b1b9
JH
1178=head2 Unicode Encodings
1179
376d9008
JB
1180Unicode characters are assigned to I<code points>, which are abstract
1181numbers. To use these numbers, various encodings are needed.
c349b1b9
JH
1182
1183=over 4
1184
c29a771d 1185=item *
5cb3728c
RB
1186
1187UTF-8
c349b1b9 1188
6d4f9cf2
KW
1189UTF-8 is a variable-length (1 to 4 bytes), byte-order independent
1190encoding. For ASCII (and we really do mean 7-bit ASCII, not another
11918-bit encoding), UTF-8 is transparent.
c349b1b9 1192
8c007b5a 1193The following table is from Unicode 3.2.
05632f9a 1194
755789c0 1195 Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte
05632f9a 1196
d88362ca 1197 U+0000..U+007F 00..7F
e1b711da 1198 U+0080..U+07FF * C2..DF 80..BF
d88362ca 1199 U+0800..U+0FFF E0 * A0..BF 80..BF
ec90690f
TS
1200 U+1000..U+CFFF E1..EC 80..BF 80..BF
1201 U+D000..U+D7FF ED 80..9F 80..BF
755789c0 1202 U+D800..U+DFFF +++++ utf16 surrogates, not legal utf8 +++++
ec90690f 1203 U+E000..U+FFFF EE..EF 80..BF 80..BF
d88362ca
KW
1204 U+10000..U+3FFFF F0 * 90..BF 80..BF 80..BF
1205 U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF
1206 U+100000..U+10FFFF F4 80..8F 80..BF 80..BF
e1b711da 1207
b19eb496 1208Note the gaps marked by "*" before several of the byte entries above. These are
e1b711da
KW
1209caused by legal UTF-8 avoiding non-shortest encodings: it is technically
1210possible to UTF-8-encode a single code point in different ways, but that is
1211explicitly forbidden, and the shortest possible encoding should always be used
1212(and that is what Perl does).
37361303 1213
376d9008 1214Another way to look at it is via bits:
05632f9a 1215
755789c0 1216 Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte
05632f9a 1217
755789c0
KW
1218 0aaaaaaa 0aaaaaaa
1219 00000bbbbbaaaaaa 110bbbbb 10aaaaaa
1220 ccccbbbbbbaaaaaa 1110cccc 10bbbbbb 10aaaaaa
1221 00000dddccccccbbbbbbaaaaaa 11110ddd 10cccccc 10bbbbbb 10aaaaaa
05632f9a 1222
9f815e24 1223As you can see, the continuation bytes all begin with "10", and the
e1b711da 1224leading bits of the start byte tell how many bytes there are in the
05632f9a
JH
1225encoded character.
1226
6d4f9cf2
KW
1227The original UTF-8 specification allowed up to 6 bytes, to allow
1228encoding of numbers up to 0x7FFF_FFFF. Perl continues to allow those,
1229and has extended that up to 13 bytes to encode code points up to what
1230can fit in a 64-bit word. However, Perl will warn if you output any of
b19eb496 1231these as being non-portable; and under strict UTF-8 input protocols,
6d4f9cf2
KW
1232they are forbidden.
1233
1234The Unicode non-character code points are also disallowed in UTF-8 in
1235"open interchange". See L</Non-character code points>.
1236
c29a771d 1237=item *
5cb3728c
RB
1238
1239UTF-EBCDIC
dbe420b4 1240
376d9008 1241Like UTF-8 but EBCDIC-safe, in the way that UTF-8 is ASCII-safe.
dbe420b4 1242
c29a771d 1243=item *
5cb3728c 1244
1e54db1a 1245UTF-16, UTF-16BE, UTF-16LE, Surrogates, and BOMs (Byte Order Marks)
c349b1b9 1246
1bfb14c4
JH
1247The followings items are mostly for reference and general Unicode
1248knowledge, Perl doesn't use these constructs internally.
dbe420b4 1249
b19eb496
TC
1250Like UTF-8, UTF-16 is a variable-width encoding, but where
1251UTF-8 uses 8-bit code units, UTF-16 uses 16-bit code units.
1252All code points occupy either 2 or 4 bytes in UTF-16: code points
1253C<U+0000..U+FFFF> are stored in a single 16-bit unit, and code
1bfb14c4 1254points C<U+10000..U+10FFFF> in two 16-bit units. The latter case is
c349b1b9
JH
1255using I<surrogates>, the first 16-bit unit being the I<high
1256surrogate>, and the second being the I<low surrogate>.
1257
376d9008 1258Surrogates are code points set aside to encode the C<U+10000..U+10FFFF>
c349b1b9 1259range of Unicode code points in pairs of 16-bit units. The I<high
9f815e24 1260surrogates> are the range C<U+D800..U+DBFF> and the I<low surrogates>
376d9008 1261are the range C<U+DC00..U+DFFF>. The surrogate encoding is
c349b1b9 1262
d88362ca
KW
1263 $hi = ($uni - 0x10000) / 0x400 + 0xD800;
1264 $lo = ($uni - 0x10000) % 0x400 + 0xDC00;
c349b1b9
JH
1265
1266and the decoding is
1267
d88362ca 1268 $uni = 0x10000 + ($hi - 0xD800) * 0x400 + ($lo - 0xDC00);
c349b1b9 1269
376d9008 1270Because of the 16-bitness, UTF-16 is byte-order dependent. UTF-16
c349b1b9 1271itself can be used for in-memory computations, but if storage or
376d9008
JB
1272transfer is required either UTF-16BE (big-endian) or UTF-16LE
1273(little-endian) encodings must be chosen.
c349b1b9
JH
1274
1275This introduces another problem: what if you just know that your data
376d9008
JB
1276is UTF-16, but you don't know which endianness? Byte Order Marks, or
1277BOMs, are a solution to this. A special character has been reserved
86bbd6d1 1278in Unicode to function as a byte order marker: the character with the
376d9008 1279code point C<U+FEFF> is the BOM.
042da322 1280
c349b1b9 1281The trick is that if you read a BOM, you will know the byte order,
376d9008
JB
1282since if it was written on a big-endian platform, you will read the
1283bytes C<0xFE 0xFF>, but if it was written on a little-endian platform,
1284you will read the bytes C<0xFF 0xFE>. (And if the originating platform
1285was writing in UTF-8, you will read the bytes C<0xEF 0xBB 0xBF>.)
042da322 1286
86bbd6d1 1287The way this trick works is that the character with the code point
6d4f9cf2 1288C<U+FFFE> is not supposed to be in input streams, so the
376d9008 1289sequence of bytes C<0xFF 0xFE> is unambiguously "BOM, represented in
1bfb14c4 1290little-endian format" and cannot be C<U+FFFE>, represented in big-endian
6d4f9cf2
KW
1291format".
1292
1293Surrogates have no meaning in Unicode outside their use in pairs to
1294represent other code points. However, Perl allows them to be
1295represented individually internally, for example by saying
f651977e
TC
1296C<chr(0xD801)>, so that all code points, not just those valid for open
1297interchange, are
6d4f9cf2
KW
1298representable. Unicode does define semantics for them, such as their
1299General Category is "Cs". But because their use is somewhat dangerous,
b19eb496
TC
1300Perl will warn (using the warning category "surrogate", which is a
1301sub-category of "utf8") if an attempt is made
6d4f9cf2
KW
1302to do things like take the lower case of one, or match
1303case-insensitively, or to output them. (But don't try this on Perls
1304before 5.14.)
c349b1b9 1305
c29a771d 1306=item *
5cb3728c 1307
1e54db1a 1308UTF-32, UTF-32BE, UTF-32LE
c349b1b9
JH
1309
1310The UTF-32 family is pretty much like the UTF-16 family, expect that
042da322 1311the units are 32-bit, and therefore the surrogate scheme is not
f651977e 1312needed. UTF-32 is a fixed-width encoding. The BOM signatures are
b19eb496 1313C<0x00 0x00 0xFE 0xFF> for BE and C<0xFF 0xFE 0x00 0x00> for LE.
c349b1b9 1314
c29a771d 1315=item *
5cb3728c
RB
1316
1317UCS-2, UCS-4
c349b1b9 1318
b19eb496 1319Legacy, fixed-width encodings defined by the ISO 10646 standard. UCS-2 is a 16-bit
376d9008 1320encoding. Unlike UTF-16, UCS-2 is not extensible beyond C<U+FFFF>,
339cfa0e 1321because it does not use surrogates. UCS-4 is a 32-bit encoding,
b19eb496 1322functionally identical to UTF-32 (the difference being that
ee88f7b6 1323UCS-4 forbids neither surrogates nor code points larger than 0x10_FFFF).
c349b1b9 1324
c29a771d 1325=item *
5cb3728c
RB
1326
1327UTF-7
c349b1b9 1328
376d9008
JB
1329A seven-bit safe (non-eight-bit) encoding, which is useful if the
1330transport or storage is not eight-bit safe. Defined by RFC 2152.
c349b1b9 1331
95a1a48b
JH
1332=back
1333
6d4f9cf2
KW
1334=head2 Non-character code points
1335
133666 code points are set aside in Unicode as "non-character code points".
1337These all have the Unassigned (Cn) General Category, and they never will
1338be assigned. These are never supposed to be in legal Unicode input
1339streams, so that code can use them as sentinels that can be mixed in
1340with character data, and they always will be distinguishable from that data.
1341To keep them out of Perl input streams, strict UTF-8 should be
1342specified, such as by using the layer C<:encoding('UTF-8')>. The
1343non-character code points are the 32 between U+FDD0 and U+FDEF, and the
134434 code points U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, ... U+10FFFE, U+10FFFF.
1345Some people are under the mistaken impression that these are "illegal",
1346but that is not true. An application or cooperating set of applications
1347can legally use them at will internally; but these code points are
42581d5d
KW
1348"illegal for open interchange". Therefore, Perl will not accept these
1349from input streams unless lax rules are being used, and will warn
b19eb496 1350(using the warning category "nonchar", which is a sub-category of "utf8") if
42581d5d
KW
1351an attempt is made to output them.
1352
1353=head2 Beyond Unicode code points
1354
1355The maximum Unicode code point is U+10FFFF. But Perl accepts code
1356points up to the maximum permissible unsigned number available on the
1357platform. However, Perl will not accept these from input streams unless
1358lax rules are being used, and will warn (using the warning category
b19eb496 1359"non_unicode", which is a sub-category of "utf8") if an attempt is made to
42581d5d
KW
1360operate on or output them. For example, C<uc(0x11_0000)> will generate
1361this warning, returning the input parameter as its result, as the upper
ee88f7b6 1362case of every non-Unicode code point is the code point itself.
6d4f9cf2 1363
0d7c09bb
JH
1364=head2 Security Implications of Unicode
1365
e1b711da
KW
1366Read L<Unicode Security Considerations|http://www.unicode.org/reports/tr36>.
1367Also, note the following:
1368
0d7c09bb
JH
1369=over 4
1370
1371=item *
1372
1373Malformed UTF-8
bf0fa0b2 1374
42581d5d 1375Unfortunately, the original specification of UTF-8 leaves some room for
bf0fa0b2 1376interpretation of how many bytes of encoded output one should generate
376d9008
JB
1377from one input Unicode character. Strictly speaking, the shortest
1378possible sequence of UTF-8 bytes should be generated,
1379because otherwise there is potential for an input buffer overflow at
feda178f 1380the receiving end of a UTF-8 connection. Perl always generates the
e1b711da 1381shortest length UTF-8, and with warnings on, Perl will warn about
376d9008 1382non-shortest length UTF-8 along with other malformations, such as the
b19eb496 1383surrogates, which are not Unicode code points valid for interchange.
bf0fa0b2 1384
0d7c09bb
JH
1385=item *
1386
68693f9e 1387Regular expression pattern matching may surprise you if you're not
b19eb496
TC
1388accustomed to Unicode. Starting in Perl 5.14, several pattern
1389modifiers are available to control this, called the character set
42581d5d
KW
1390modifiers. Details are given in L<perlre/Character set modifiers>.
1391
1392=back
0d7c09bb 1393
376d9008 1394As discussed elsewhere, Perl has one foot (two hooves?) planted in
1bfb14c4
JH
1395each of two worlds: the old world of bytes and the new world of
1396characters, upgrading from bytes to characters when necessary.
376d9008
JB
1397If your legacy code does not explicitly use Unicode, no automatic
1398switch-over to characters should happen. Characters shouldn't get
1bfb14c4
JH
1399downgraded to bytes, either. It is possible to accidentally mix bytes
1400and characters, however (see L<perluniintro>), in which case C<\w> in
42581d5d 1401regular expressions might start behaving differently (unless the C</a>
b19eb496 1402modifier is in effect). Review your code. Use warnings and the C<strict> pragma.
0d7c09bb 1403
c349b1b9
JH
1404=head2 Unicode in Perl on EBCDIC
1405
376d9008
JB
1406The way Unicode is handled on EBCDIC platforms is still
1407experimental. On such platforms, references to UTF-8 encoding in this
1408document and elsewhere should be read as meaning the UTF-EBCDIC
1409specified in Unicode Technical Report 16, unless ASCII vs. EBCDIC issues
c349b1b9 1410are specifically discussed. There is no C<utfebcdic> pragma or
376d9008 1411":utfebcdic" layer; rather, "utf8" and ":utf8" are reused to mean
86bbd6d1
PN
1412the platform's "natural" 8-bit encoding of Unicode. See L<perlebcdic>
1413for more discussion of the issues.
c349b1b9 1414
b310b053
JH
1415=head2 Locales
1416
42581d5d 1417See L<perllocale/Unicode and UTF-8>
b310b053 1418
1aad1664
JH
1419=head2 When Unicode Does Not Happen
1420
1421While Perl does have extensive ways to input and output in Unicode,
b19eb496
TC
1422and a few other "entry points" like the @ARGV array (which can sometimes be
1423interpreted as UTF-8), there are still many places where Unicode
1424(in some encoding or another) could be given as arguments or received as
1aad1664
JH
1425results, or both, but it is not.
1426
e1b711da
KW
1427The following are such interfaces. Also, see L</The "Unicode Bug">.
1428For all of these interfaces Perl
6cd4dd6c 1429currently (as of 5.8.3) simply assumes byte strings both as arguments
b19eb496 1430and results, or UTF-8 strings if the (problematic) C<encoding> pragma has been used.
1aad1664 1431
b19eb496
TC
1432One reason that Perl does not attempt to resolve the role of Unicode in
1433these situations is that the answers are highly dependent on the operating
1aad1664 1434system and the file system(s). For example, whether filenames can be
b19eb496
TC
1435in Unicode and in exactly what kind of encoding, is not exactly a
1436portable concept. Similarly for C<qx> and C<system>: how well will the
1437"command-line interface" (and which of them?) handle Unicode?
1aad1664
JH
1438
1439=over 4
1440
557a2462
RB
1441=item *
1442
51f494cc 1443chdir, chmod, chown, chroot, exec, link, lstat, mkdir,
1e8e8236 1444rename, rmdir, stat, symlink, truncate, unlink, utime, -X
557a2462
RB
1445
1446=item *
1447
1448%ENV
1449
1450=item *
1451
1452glob (aka the <*>)
1453
1454=item *
1aad1664 1455
557a2462 1456open, opendir, sysopen
1aad1664 1457
557a2462 1458=item *
1aad1664 1459
557a2462 1460qx (aka the backtick operator), system
1aad1664 1461
557a2462 1462=item *
1aad1664 1463
557a2462 1464readdir, readlink
1aad1664
JH
1465
1466=back
1467
e1b711da
KW
1468=head2 The "Unicode Bug"
1469
42581d5d
KW
1470The term, the "Unicode bug" has been applied to an inconsistency
1471on ASCII platforms with the
1472Unicode code points in the Latin-1 Supplement block, that
e1b711da
KW
1473is, between 128 and 255. Without a locale specified, unlike all other
1474characters or code points, these characters have very different semantics in
20db7501
KW
1475byte semantics versus character semantics, unless
1476C<use feature 'unicode_strings'> is specified.
42581d5d
KW
1477(The lesson here is to specify C<unicode_strings> to avoid the
1478headaches.)
e1b711da
KW
1479
1480In character semantics they are interpreted as Unicode code points, which means
1481they have the same semantics as Latin-1 (ISO-8859-1).
1482
1483In byte semantics, they are considered to be unassigned characters, meaning
1484that the only semantics they have is their ordinal numbers, and that they are
1485not members of various character classes. None are considered to match C<\w>
42581d5d 1486for example, but all match C<\W>.
e1b711da
KW
1487
1488The behavior is known to have effects on these areas:
1489
1490=over 4
1491
1492=item *
1493
1494Changing the case of a scalar, that is, using C<uc()>, C<ucfirst()>, C<lc()>,
1495and C<lcfirst()>, or C<\L>, C<\U>, C<\u> and C<\l> in regular expression
1496substitutions.
1497
1498=item *
1499
1500Using caseless (C</i>) regular expression matching
1501
1502=item *
1503
b19eb496 1504Matching any of several properties in regular expressions, namely C<\b>,
630d17dc
KW
1505C<\B>, C<\s>, C<\S>, C<\w>, C<\W>, and all the Posix character classes
1506I<except> C<[[:ascii:]]>.
e1b711da
KW
1507
1508=item *
1509
b19eb496
TC
1510In C<quotemeta> or its inline equivalent C<\Q>, no characters
1511code points above 127 are quoted in UTF-8 encoded strings, but in
1512byte encoded strings, code points between 128-255 are always quoted.
eb88ed9e
KW
1513
1514=item *
1515
e1b711da
KW
1516User-defined case change mappings. You can create a C<ToUpper()> function, for
1517example, which overrides Perl's built-in case mappings. The scalar must be
1518encoded in utf8 for your function to actually be invoked.
1519
1520=back
1521
1522This behavior can lead to unexpected results in which a string's semantics
1523suddenly change if a code point above 255 is appended to or removed from it,
1524which changes the string's semantics from byte to character or vice versa. As
1525an example, consider the following program and its output:
1526
1527 $ perl -le'
42581d5d 1528 no feature 'unicode_strings';
e1b711da
KW
1529 $s1 = "\xC2";
1530 $s2 = "\x{2660}";
1531 for ($s1, $s2, $s1.$s2) {
1532 print /\w/ || 0;
1533 }
1534 '
1535 0
1536 0
1537 1
1538
9f815e24 1539If there's no C<\w> in C<s1> or in C<s2>, why does their concatenation have one?
e1b711da
KW
1540
1541This anomaly stems from Perl's attempt to not disturb older programs that
1542didn't use Unicode, and hence had no semantics for characters outside of the
1543ASCII range (except in a locale), along with Perl's desire to add Unicode
1544support seamlessly. The result wasn't seamless: these characters were
1545orphaned.
1546
20db7501
KW
1547Starting in Perl 5.14, C<use feature 'unicode_strings'> can be used to
1548cause Perl to use Unicode semantics on all string operations within the
1549scope of the feature subpragma. Regular expressions compiled in its
1550scope retain that behavior even when executed or compiled into larger
1551regular expressions outside the scope. (The pragma does not, however,
42581d5d
KW
1552affect the C<quotemeta> behavior. Nor does it affect the deprecated
1553user-defined case changing operations--these still require a UTF-8
eb88ed9e 1554encoded string to operate.)
20db7501
KW
1555
1556In Perl 5.12, the subpragma affected casing changes, but not regular
1557expressions. See L<perlfunc/lc> for details on how this pragma works in
1558combination with various others for casing.
1559
1560For earlier Perls, or when a string is passed to a function outside the
1561subpragma's scope, a workaround is to always call C<utf8::upgrade($string)>,
1562or to use the standard module L<Encode>. Also, a scalar that has any characters
6f335b04 1563whose ordinal is above 0x100, or which were specified using either of the
b19eb496 1564C<\N{...}> notations, will automatically have character semantics.
e1b711da 1565
1aad1664
JH
1566=head2 Forcing Unicode in Perl (Or Unforcing Unicode in Perl)
1567
e1b711da
KW
1568Sometimes (see L</"When Unicode Does Not Happen"> or L</The "Unicode Bug">)
1569there are situations where you simply need to force a byte
2bbc8d55
SP
1570string into UTF-8, or vice versa. The low-level calls
1571utf8::upgrade($bytestring) and utf8::downgrade($utf8string[, FAIL_OK]) are
1aad1664
JH
1572the answers.
1573
2bbc8d55
SP
1574Note that utf8::downgrade() can fail if the string contains characters
1575that don't fit into a byte.
1aad1664 1576
e1b711da
KW
1577Calling either function on a string that already is in the desired state is a
1578no-op.
1579
95a1a48b
JH
1580=head2 Using Unicode in XS
1581
3a2263fe
RGS
1582If you want to handle Perl Unicode in XS extensions, you may find the
1583following C APIs useful. See also L<perlguts/"Unicode Support"> for an
1584explanation about Unicode at the XS level, and L<perlapi> for the API
1585details.
95a1a48b
JH
1586
1587=over 4
1588
1589=item *
1590
1bfb14c4 1591C<DO_UTF8(sv)> returns true if the C<UTF8> flag is on and the bytes
2bbc8d55 1592pragma is not in effect. C<SvUTF8(sv)> returns true if the C<UTF8>
1bfb14c4
JH
1593flag is on; the bytes pragma is ignored. The C<UTF8> flag being on
1594does B<not> mean that there are any characters of code points greater
1595than 255 (or 127) in the scalar or that there are even any characters
1596in the scalar. What the C<UTF8> flag means is that the sequence of
1597octets in the representation of the scalar is the sequence of UTF-8
1598encoded code points of the characters of a string. The C<UTF8> flag
1599being off means that each octet in this representation encodes a
1600single character with code point 0..255 within the string. Perl's
1601Unicode model is not to use UTF-8 until it is absolutely necessary.
95a1a48b
JH
1602
1603=item *
1604
2bbc8d55 1605C<uvchr_to_utf8(buf, chr)> writes a Unicode character code point into
1bfb14c4 1606a buffer encoding the code point as UTF-8, and returns a pointer
2bbc8d55 1607pointing after the UTF-8 bytes. It works appropriately on EBCDIC machines.
95a1a48b
JH
1608
1609=item *
1610
2bbc8d55 1611C<utf8_to_uvchr(buf, lenp)> reads UTF-8 encoded bytes from a buffer and
376d9008 1612returns the Unicode character code point and, optionally, the length of
2bbc8d55 1613the UTF-8 byte sequence. It works appropriately on EBCDIC machines.
95a1a48b
JH
1614
1615=item *
1616
376d9008
JB
1617C<utf8_length(start, end)> returns the length of the UTF-8 encoded buffer
1618in characters. C<sv_len_utf8(sv)> returns the length of the UTF-8 encoded
95a1a48b
JH
1619scalar.
1620
1621=item *
1622
376d9008
JB
1623C<sv_utf8_upgrade(sv)> converts the string of the scalar to its UTF-8
1624encoded form. C<sv_utf8_downgrade(sv)> does the opposite, if
1625possible. C<sv_utf8_encode(sv)> is like sv_utf8_upgrade except that
1626it does not set the C<UTF8> flag. C<sv_utf8_decode()> does the
1627opposite of C<sv_utf8_encode()>. Note that none of these are to be
1628used as general-purpose encoding or decoding interfaces: C<use Encode>
1629for that. C<sv_utf8_upgrade()> is affected by the encoding pragma
1630but C<sv_utf8_downgrade()> is not (since the encoding pragma is
1631designed to be a one-way street).
95a1a48b
JH
1632
1633=item *
1634
376d9008 1635C<is_utf8_char(s)> returns true if the pointer points to a valid UTF-8
90f968e0 1636character.
95a1a48b
JH
1637
1638=item *
1639
376d9008 1640C<is_utf8_string(buf, len)> returns true if C<len> bytes of the buffer
95a1a48b
JH
1641are valid UTF-8.
1642
1643=item *
1644
376d9008
JB
1645C<UTF8SKIP(buf)> will return the number of bytes in the UTF-8 encoded
1646character in the buffer. C<UNISKIP(chr)> will return the number of bytes
1647required to UTF-8-encode the Unicode character code point. C<UTF8SKIP()>
90f968e0 1648is useful for example for iterating over the characters of a UTF-8
376d9008 1649encoded buffer; C<UNISKIP()> is useful, for example, in computing
90f968e0 1650the size required for a UTF-8 encoded buffer.
95a1a48b
JH
1651
1652=item *
1653
376d9008 1654C<utf8_distance(a, b)> will tell the distance in characters between the
95a1a48b
JH
1655two pointers pointing to the same UTF-8 encoded buffer.
1656
1657=item *
1658
2bbc8d55 1659C<utf8_hop(s, off)> will return a pointer to a UTF-8 encoded buffer
376d9008
JB
1660that is C<off> (positive or negative) Unicode characters displaced
1661from the UTF-8 buffer C<s>. Be careful not to overstep the buffer:
1662C<utf8_hop()> will merrily run off the end or the beginning of the
1663buffer if told to do so.
95a1a48b 1664
d2cc3551
JH
1665=item *
1666
376d9008
JB
1667C<pv_uni_display(dsv, spv, len, pvlim, flags)> and
1668C<sv_uni_display(dsv, ssv, pvlim, flags)> are useful for debugging the
1669output of Unicode strings and scalars. By default they are useful
1670only for debugging--they display B<all> characters as hexadecimal code
1bfb14c4
JH
1671points--but with the flags C<UNI_DISPLAY_ISPRINT>,
1672C<UNI_DISPLAY_BACKSLASH>, and C<UNI_DISPLAY_QQ> you can make the
1673output more readable.
d2cc3551
JH
1674
1675=item *
1676
66615a54 1677C<foldEQ_utf8(s1, pe1, l1, u1, s2, pe2, l2, u2)> can be used to
376d9008 1678compare two strings case-insensitively in Unicode. For case-sensitive
66615a54
KW
1679comparisons you can just use C<memEQ()> and C<memNE()> as usual, except
1680if one string is in utf8 and the other isn't.
d2cc3551 1681
c349b1b9
JH
1682=back
1683
95a1a48b
JH
1684For more information, see L<perlapi>, and F<utf8.c> and F<utf8.h>
1685in the Perl source code distribution.
1686
e1b711da
KW
1687=head2 Hacking Perl to work on earlier Unicode versions (for very serious hackers only)
1688
1689Perl by default comes with the latest supported Unicode version built in, but
1690you can change to use any earlier one.
1691
42581d5d 1692Download the files in the desired version of Unicode from the Unicode web
e1b711da 1693site L<http://www.unicode.org>). These should replace the existing files in
b19eb496 1694F<lib/unicore> in the Perl source tree. Follow the instructions in
116693e8 1695F<README.perl> in that directory to change some of their names, and then build
26e391dd 1696perl (see L<INSTALL>).
116693e8
DL
1697
1698It is even possible to copy the built files to a different directory, and then
f651977e 1699change F<utf8_heavy.pl> in the directory C<$Config{privlib}> to point to the
116693e8
DL
1700new directory, or maybe make a copy of that directory before making the change,
1701and using C<@INC> or the C<-I> run-time flag to switch between versions at will
e1b711da
KW
1702(but because of caching, not in the middle of a process), but all this is
1703beyond the scope of these instructions.
1704
c29a771d
JH
1705=head1 BUGS
1706
376d9008 1707=head2 Interaction with Locales
7eabb34d 1708
42581d5d 1709See L<perllocale/Unicode and UTF-8>
c29a771d 1710
9f815e24 1711=head2 Problems with characters in the Latin-1 Supplement range
2bbc8d55 1712
e1b711da
KW
1713See L</The "Unicode Bug">
1714
376d9008 1715=head2 Interaction with Extensions
7eabb34d 1716
376d9008 1717When Perl exchanges data with an extension, the extension should be
2575c402 1718able to understand the UTF8 flag and act accordingly. If the
b19eb496 1719extension doesn't recognize that flag, it's likely that the extension
376d9008 1720will return incorrectly-flagged data.
7eabb34d
A
1721
1722So if you're working with Unicode data, consult the documentation of
1723every module you're using if there are any issues with Unicode data
1724exchange. If the documentation does not talk about Unicode at all,
a73d23f6 1725suspect the worst and probably look at the source to learn how the
376d9008 1726module is implemented. Modules written completely in Perl shouldn't
a73d23f6
RGS
1727cause problems. Modules that directly or indirectly access code written
1728in other programming languages are at risk.
7eabb34d 1729
376d9008 1730For affected functions, the simple strategy to avoid data corruption is
7eabb34d 1731to always make the encoding of the exchanged data explicit. Choose an
376d9008 1732encoding that you know the extension can handle. Convert arguments passed
7eabb34d
A
1733to the extensions to that encoding and convert results back from that
1734encoding. Write wrapper functions that do the conversions for you, so
1735you can later change the functions when the extension catches up.
1736
376d9008 1737To provide an example, let's say the popular Foo::Bar::escape_html
7eabb34d
A
1738function doesn't deal with Unicode data yet. The wrapper function
1739would convert the argument to raw UTF-8 and convert the result back to
376d9008 1740Perl's internal representation like so:
7eabb34d
A
1741
1742 sub my_escape_html ($) {
d88362ca
KW
1743 my($what) = shift;
1744 return unless defined $what;
1745 Encode::decode_utf8(Foo::Bar::escape_html(
1746 Encode::encode_utf8($what)));
7eabb34d
A
1747 }
1748
1749Sometimes, when the extension does not convert data but just stores
b19eb496 1750and retrieves them, you will be able to use the otherwise
7eabb34d 1751dangerous Encode::_utf8_on() function. Let's say the popular
66b79f27 1752C<Foo::Bar> extension, written in C, provides a C<param> method that
7eabb34d
A
1753lets you store and retrieve data according to these prototypes:
1754
1755 $self->param($name, $value); # set a scalar
1756 $value = $self->param($name); # retrieve a scalar
1757
1758If it does not yet provide support for any encoding, one could write a
1759derived class with such a C<param> method:
1760
1761 sub param {
1762 my($self,$name,$value) = @_;
1763 utf8::upgrade($name); # make sure it is UTF-8 encoded
af55fc6a 1764 if (defined $value) {
7eabb34d
A
1765 utf8::upgrade($value); # make sure it is UTF-8 encoded
1766 return $self->SUPER::param($name,$value);
1767 } else {
1768 my $ret = $self->SUPER::param($name);
1769 Encode::_utf8_on($ret); # we know, it is UTF-8 encoded
1770 return $ret;
1771 }
1772 }
1773
a73d23f6
RGS
1774Some extensions provide filters on data entry/exit points, such as
1775DB_File::filter_store_key and family. Look out for such filters in
66b79f27 1776the documentation of your extensions, they can make the transition to
7eabb34d
A
1777Unicode data much easier.
1778
376d9008 1779=head2 Speed
7eabb34d 1780
c29a771d 1781Some functions are slower when working on UTF-8 encoded strings than
574c8022 1782on byte encoded strings. All functions that need to hop over
7c17141f
JH
1783characters such as length(), substr() or index(), or matching regular
1784expressions can work B<much> faster when the underlying data are
1785byte-encoded.
1786
1787In Perl 5.8.0 the slowness was often quite spectacular; in Perl 5.8.1
1788a caching scheme was introduced which will hopefully make the slowness
a104b433
JH
1789somewhat less spectacular, at least for some operations. In general,
1790operations with UTF-8 encoded strings are still slower. As an example,
1791the Unicode properties (character classes) like C<\p{Nd}> are known to
1792be quite a bit slower (5-20 times) than their simpler counterparts
42581d5d 1793like C<\d> (then again, there are hundreds of Unicode characters matching C<Nd>
a104b433 1794compared with the 10 ASCII characters matching C<d>).
666f95b9 1795
e1b711da
KW
1796=head2 Problems on EBCDIC platforms
1797
f651977e 1798There are several known problems with Perl on EBCDIC platforms. If you
e1b711da 1799want to use Perl there, send email to perlbug@perl.org.
fe749c9a
KW
1800
1801In earlier versions, when byte and character data were concatenated,
1802the new string was sometimes created by
1803decoding the byte strings as I<ISO 8859-1 (Latin-1)>, even if the
1804old Unicode string used EBCDIC.
1805
1806If you find any of these, please report them as bugs.
1807
c8d992ba
A
1808=head2 Porting code from perl-5.6.X
1809
1810Perl 5.8 has a different Unicode model from 5.6. In 5.6 the programmer
1811was required to use the C<utf8> pragma to declare that a given scope
1812expected to deal with Unicode data and had to make sure that only
1813Unicode data were reaching that scope. If you have code that is
1814working with 5.6, you will need some of the following adjustments to
1815your code. The examples are written such that the code will continue
1816to work under 5.6, so you should be safe to try them out.
1817
755789c0 1818=over 3
c8d992ba
A
1819
1820=item *
1821
1822A filehandle that should read or write UTF-8
1823
1824 if ($] > 5.007) {
740d4bb2 1825 binmode $fh, ":encoding(utf8)";
c8d992ba
A
1826 }
1827
1828=item *
1829
1830A scalar that is going to be passed to some extension
1831
1832Be it Compress::Zlib, Apache::Request or any extension that has no
1833mention of Unicode in the manpage, you need to make sure that the
2575c402 1834UTF8 flag is stripped off. Note that at the time of this writing
c8d992ba
A
1835(October 2002) the mentioned modules are not UTF-8-aware. Please
1836check the documentation to verify if this is still true.
1837
1838 if ($] > 5.007) {
1839 require Encode;
1840 $val = Encode::encode_utf8($val); # make octets
1841 }
1842
1843=item *
1844
1845A scalar we got back from an extension
1846
1847If you believe the scalar comes back as UTF-8, you will most likely
2575c402 1848want the UTF8 flag restored:
c8d992ba
A
1849
1850 if ($] > 5.007) {
1851 require Encode;
1852 $val = Encode::decode_utf8($val);
1853 }
1854
1855=item *
1856
1857Same thing, if you are really sure it is UTF-8
1858
1859 if ($] > 5.007) {
1860 require Encode;
1861 Encode::_utf8_on($val);
1862 }
1863
1864=item *
1865
1866A wrapper for fetchrow_array and fetchrow_hashref
1867
1868When the database contains only UTF-8, a wrapper function or method is
1869a convenient way to replace all your fetchrow_array and
1870fetchrow_hashref calls. A wrapper function will also make it easier to
1871adapt to future enhancements in your database driver. Note that at the
1872time of this writing (October 2002), the DBI has no standardized way
1873to deal with UTF-8 data. Please check the documentation to verify if
1874that is still true.
1875
1876 sub fetchrow {
d88362ca
KW
1877 # $what is one of fetchrow_{array,hashref}
1878 my($self, $sth, $what) = @_;
c8d992ba
A
1879 if ($] < 5.007) {
1880 return $sth->$what;
1881 } else {
1882 require Encode;
1883 if (wantarray) {
1884 my @arr = $sth->$what;
1885 for (@arr) {
1886 defined && /[^\000-\177]/ && Encode::_utf8_on($_);
1887 }
1888 return @arr;
1889 } else {
1890 my $ret = $sth->$what;
1891 if (ref $ret) {
1892 for my $k (keys %$ret) {
d88362ca
KW
1893 defined
1894 && /[^\000-\177]/
1895 && Encode::_utf8_on($_) for $ret->{$k};
c8d992ba
A
1896 }
1897 return $ret;
1898 } else {
1899 defined && /[^\000-\177]/ && Encode::_utf8_on($_) for $ret;
1900 return $ret;
1901 }
1902 }
1903 }
1904 }
1905
1906
1907=item *
1908
1909A large scalar that you know can only contain ASCII
1910
1911Scalars that contain only ASCII and are marked as UTF-8 are sometimes
1912a drag to your program. If you recognize such a situation, just remove
2575c402 1913the UTF8 flag:
c8d992ba
A
1914
1915 utf8::downgrade($val) if $] > 5.007;
1916
1917=back
1918
393fec97
GS
1919=head1 SEE ALSO
1920
51f494cc 1921L<perlunitut>, L<perluniintro>, L<perluniprops>, L<Encode>, L<open>, L<utf8>, L<bytes>,
a05d7ebb 1922L<perlretut>, L<perlvar/"${^UNICODE}">
51f494cc 1923L<http://www.unicode.org/reports/tr44>).
393fec97
GS
1924
1925=cut