This is a live mirror of the Perl 5 development currently hosted at https://github.com/perl/perl5
perlunicode: Note Bidi_Class changed in Unicode 6.3
[perl5.git] / pod / perlunicode.pod
... / ...
CommitLineData
1=head1 NAME
2
3perlunicode - Unicode support in Perl
4
5=head1 DESCRIPTION
6
7=head2 Important Caveats
8
9Unicode support is an extensive requirement. While Perl does not
10implement the Unicode standard or the accompanying technical reports
11from cover to cover, Perl does support many Unicode features.
12
13People who want to learn to use Unicode in Perl, should probably read
14the L<Perl Unicode tutorial, perlunitut|perlunitut> and
15L<perluniintro>, before reading
16this reference document.
17
18Also, the use of Unicode may present security issues that aren't obvious.
19Read L<Unicode Security Considerations|http://www.unicode.org/reports/tr36>.
20
21=over 4
22
23=item Safest if you "use feature 'unicode_strings'"
24
25In order to preserve backward compatibility, Perl does not turn
26on full internal Unicode support unless the pragma
27C<use feature 'unicode_strings'> is specified. (This is automatically
28selected if you use C<use 5.012> or higher.) Failure to do this can
29trigger unexpected surprises. See L</The "Unicode Bug"> below.
30
31This pragma doesn't affect I/O. Nor does it change the internal
32representation of strings, only their interpretation. There are still
33several places where Unicode isn't fully supported, such as in
34filenames.
35
36=item Input and Output Layers
37
38Perl knows when a filehandle uses Perl's internal Unicode encodings
39(UTF-8, or UTF-EBCDIC if in EBCDIC) if the filehandle is opened with
40the ":encoding(utf8)" layer. Other encodings can be converted to Perl's
41encoding on input or from Perl's encoding on output by use of the
42":encoding(...)" layer. See L<open>.
43
44To indicate that Perl source itself is in UTF-8, use C<use utf8;>.
45
46=item C<use utf8> still needed to enable UTF-8/UTF-EBCDIC in scripts
47
48As a compatibility measure, the C<use utf8> pragma must be explicitly
49included to enable recognition of UTF-8 in the Perl scripts themselves
50(in string or regular expression literals, or in identifier names) on
51ASCII-based machines or to recognize UTF-EBCDIC on EBCDIC-based
52machines. B<These are the only times when an explicit C<use utf8>
53is needed.> See L<utf8>.
54
55=item BOM-marked scripts and UTF-16 scripts autodetected
56
57If a Perl script begins marked with the Unicode BOM (UTF-16LE, UTF16-BE,
58or UTF-8), or if the script looks like non-BOM-marked UTF-16 of either
59endianness, Perl will correctly read in the script as Unicode.
60(BOMless UTF-8 cannot be effectively recognized or differentiated from
61ISO 8859-1 or other eight-bit encodings.)
62
63=item C<use encoding> needed to upgrade non-Latin-1 byte strings
64
65By default, there is a fundamental asymmetry in Perl's Unicode model:
66implicit upgrading from byte strings to Unicode strings assumes that
67they were encoded in I<ISO 8859-1 (Latin-1)>, but Unicode strings are
68downgraded with UTF-8 encoding. This happens because the first 256
69codepoints in Unicode happens to agree with Latin-1.
70
71See L</"Byte and Character Semantics"> for more details.
72
73=back
74
75=head2 Byte and Character Semantics
76
77Perl uses logically-wide characters to represent strings internally.
78
79Starting in Perl 5.14, Perl-level operations work with
80characters rather than bytes within the scope of a
81C<L<use feature 'unicode_strings'|feature>> (or equivalently
82C<use 5.012> or higher). (This is not true if bytes have been
83explicitly requested by C<L<use bytes|bytes>>, nor necessarily true
84for interactions with the platform's operating system.)
85
86For earlier Perls, and when C<unicode_strings> is not in effect, Perl
87provides a fairly safe environment that can handle both types of
88semantics in programs. For operations where Perl can unambiguously
89decide that the input data are characters, Perl switches to character
90semantics. For operations where this determination cannot be made
91without additional information from the user, Perl decides in favor of
92compatibility and chooses to use byte semantics.
93
94When C<use locale> (but not C<use locale ':not_characters'>) is in
95effect, Perl uses the semantics associated with the current locale.
96(C<use locale> overrides C<use feature 'unicode_strings'> in the same scope;
97while C<use locale ':not_characters'> effectively also selects
98C<use feature 'unicode_strings'> in its scope; see L<perllocale>.)
99Otherwise, Perl uses the platform's native
100byte semantics for characters whose code points are less than 256, and
101Unicode semantics for those greater than 255. That means that non-ASCII
102characters are undefined except for their
103ordinal numbers. This means that none have case (upper and lower), nor are any
104a member of character classes, like C<[:alpha:]> or C<\w>. (But all do belong
105to the C<\W> class or the Perl regular expression extension C<[:^alpha:]>.)
106
107This behavior preserves compatibility with earlier versions of Perl,
108which allowed byte semantics in Perl operations only if
109none of the program's inputs were marked as being a source of Unicode
110character data. Such data may come from filehandles, from calls to
111external programs, from information provided by the system (such as %ENV),
112or from literals and constants in the source text.
113
114The C<utf8> pragma is primarily a compatibility device that enables
115recognition of UTF-(8|EBCDIC) in literals encountered by the parser.
116Note that this pragma is only required while Perl defaults to byte
117semantics; when character semantics become the default, this pragma
118may become a no-op. See L<utf8>.
119
120If strings operating under byte semantics and strings with Unicode
121character data are concatenated, the new string will have
122character semantics. This can cause surprises: See L</BUGS>, below.
123You can choose to be warned when this happens. See L<encoding::warnings>.
124
125Under character semantics, many operations that formerly operated on
126bytes now operate on characters. A character in Perl is
127logically just a number ranging from 0 to 2**31 or so. Larger
128characters may encode into longer sequences of bytes internally, but
129this internal detail is mostly hidden for Perl code.
130See L<perluniintro> for more.
131
132=head2 Effects of Character Semantics
133
134Character semantics have the following effects:
135
136=over 4
137
138=item *
139
140Strings--including hash keys--and regular expression patterns may
141contain characters that have an ordinal value larger than 255.
142
143If you use a Unicode editor to edit your program, Unicode characters may
144occur directly within the literal strings in UTF-8 encoding, or UTF-16.
145(The former requires a BOM or C<use utf8>, the latter requires a BOM.)
146
147Unicode characters can also be added to a string by using the C<\N{U+...}>
148notation. The Unicode code for the desired character, in hexadecimal,
149should be placed in the braces, after the C<U>. For instance, a smiley face is
150C<\N{U+263A}>.
151
152Alternatively, you can use the C<\x{...}> notation for characters 0x100 and
153above. For characters below 0x100 you may get byte semantics instead of
154character semantics; see L</The "Unicode Bug">. On EBCDIC machines there is
155the additional problem that the value for such characters gives the EBCDIC
156character rather than the Unicode one, thus it is more portable to use
157C<\N{U+...}> instead.
158
159Additionally, you can use the C<\N{...}> notation and put the official
160Unicode character name within the braces, such as
161C<\N{WHITE SMILING FACE}>. This automatically loads the L<charnames>
162module with the C<:full> and C<:short> options. If you prefer different
163options for this module, you can instead, before the C<\N{...}>,
164explicitly load it with your desired options; for example,
165
166 use charnames ':loose';
167
168=item *
169
170If an appropriate L<encoding> is specified, identifiers within the
171Perl script may contain Unicode alphanumeric characters, including
172ideographs. Perl does not currently attempt to canonicalize variable
173names.
174
175=item *
176
177Regular expressions match characters instead of bytes. "." matches
178a character instead of a byte.
179
180=item *
181
182Bracketed character classes in regular expressions match characters instead of
183bytes and match against the character properties specified in the
184Unicode properties database. C<\w> can be used to match a Japanese
185ideograph, for instance.
186
187=item *
188
189Named Unicode properties, scripts, and block ranges may be used (like bracketed
190character classes) by using the C<\p{}> "matches property" construct and
191the C<\P{}> negation, "doesn't match property".
192See L</"Unicode Character Properties"> for more details.
193
194You can define your own character properties and use them
195in the regular expression with the C<\p{}> or C<\P{}> construct.
196See L</"User-Defined Character Properties"> for more details.
197
198=item *
199
200The special pattern C<\X> matches a logical character, an "extended grapheme
201cluster" in Standardese. In Unicode what appears to the user to be a single
202character, for example an accented C<G>, may in fact be composed of a sequence
203of characters, in this case a C<G> followed by an accent character. C<\X>
204will match the entire sequence.
205
206=item *
207
208The C<tr///> operator translates characters instead of bytes. Note
209that the C<tr///CU> functionality has been removed. For similar
210functionality see pack('U0', ...) and pack('C0', ...).
211
212=item *
213
214Case translation operators use the Unicode case translation tables
215when character input is provided. Note that C<uc()>, or C<\U> in
216interpolated strings, translates to uppercase, while C<ucfirst>,
217or C<\u> in interpolated strings, translates to titlecase in languages
218that make the distinction (which is equivalent to uppercase in languages
219without the distinction).
220
221=item *
222
223Most operators that deal with positions or lengths in a string will
224automatically switch to using character positions, including
225C<chop()>, C<chomp()>, C<substr()>, C<pos()>, C<index()>, C<rindex()>,
226C<sprintf()>, C<write()>, and C<length()>. An operator that
227specifically does not switch is C<vec()>. Operators that really don't
228care include operators that treat strings as a bucket of bits such as
229C<sort()>, and operators dealing with filenames.
230
231=item *
232
233The C<pack()>/C<unpack()> letter C<C> does I<not> change, since it is often
234used for byte-oriented formats. Again, think C<char> in the C language.
235
236There is a new C<U> specifier that converts between Unicode characters
237and code points. There is also a C<W> specifier that is the equivalent of
238C<chr>/C<ord> and properly handles character values even if they are above 255.
239
240=item *
241
242The C<chr()> and C<ord()> functions work on characters, similar to
243C<pack("W")> and C<unpack("W")>, I<not> C<pack("C")> and
244C<unpack("C")>. C<pack("C")> and C<unpack("C")> are methods for
245emulating byte-oriented C<chr()> and C<ord()> on Unicode strings.
246While these methods reveal the internal encoding of Unicode strings,
247that is not something one normally needs to care about at all.
248
249=item *
250
251The bit string operators, C<& | ^ ~>, can operate on character data.
252However, for backward compatibility, such as when using bit string
253operations when characters are all less than 256 in ordinal value, one
254should not use C<~> (the bit complement) with characters of both
255values less than 256 and values greater than 256. Most importantly,
256DeMorgan's laws (C<~($x|$y) eq ~$x&~$y> and C<~($x&$y) eq ~$x|~$y>)
257will not hold. The reason for this mathematical I<faux pas> is that
258the complement cannot return B<both> the 8-bit (byte-wide) bit
259complement B<and> the full character-wide bit complement.
260
261=item *
262
263There is a CPAN module, L<Unicode::Casing>, which allows you to define
264your own mappings to be used in C<lc()>, C<lcfirst()>, C<uc()>,
265C<ucfirst()>, and C<fc> (or their double-quoted string inlined
266versions such as C<\U>).
267(Prior to Perl 5.16, this functionality was partially provided
268in the Perl core, but suffered from a number of insurmountable
269drawbacks, so the CPAN module was written instead.)
270
271=back
272
273=over 4
274
275=item *
276
277And finally, C<scalar reverse()> reverses by character rather than by byte.
278
279=back
280
281=head2 Unicode Character Properties
282
283(The only time that Perl considers a sequence of individual code
284points as a single logical character is in the C<\X> construct, already
285mentioned above. Therefore "character" in this discussion means a single
286Unicode code point.)
287
288Very nearly all Unicode character properties are accessible through
289regular expressions by using the C<\p{}> "matches property" construct
290and the C<\P{}> "doesn't match property" for its negation.
291
292For instance, C<\p{Uppercase}> matches any single character with the Unicode
293"Uppercase" property, while C<\p{L}> matches any character with a
294General_Category of "L" (letter) property. Brackets are not
295required for single letter property names, so C<\p{L}> is equivalent to C<\pL>.
296
297More formally, C<\p{Uppercase}> matches any single character whose Unicode
298Uppercase property value is True, and C<\P{Uppercase}> matches any character
299whose Uppercase property value is False, and they could have been written as
300C<\p{Uppercase=True}> and C<\p{Uppercase=False}>, respectively.
301
302This formality is needed when properties are not binary; that is, if they can
303take on more values than just True and False. For example, the Bidi_Class (see
304L</"Bidirectional Character Types"> below), can take on several different
305values, such as Left, Right, Whitespace, and others. To match these, one needs
306to specify both the property name (Bidi_Class), AND the value being
307matched against
308(Left, Right, etc.). This is done, as in the examples above, by having the
309two components separated by an equal sign (or interchangeably, a colon), like
310C<\p{Bidi_Class: Left}>.
311
312All Unicode-defined character properties may be written in these compound forms
313of C<\p{property=value}> or C<\p{property:value}>, but Perl provides some
314additional properties that are written only in the single form, as well as
315single-form short-cuts for all binary properties and certain others described
316below, in which you may omit the property name and the equals or colon
317separator.
318
319Most Unicode character properties have at least two synonyms (or aliases if you
320prefer): a short one that is easier to type and a longer one that is more
321descriptive and hence easier to understand. Thus the "L" and "Letter" properties
322above are equivalent and can be used interchangeably. Likewise,
323"Upper" is a synonym for "Uppercase", and we could have written
324C<\p{Uppercase}> equivalently as C<\p{Upper}>. Also, there are typically
325various synonyms for the values the property can be. For binary properties,
326"True" has 3 synonyms: "T", "Yes", and "Y"; and "False has correspondingly "F",
327"No", and "N". But be careful. A short form of a value for one property may
328not mean the same thing as the same short form for another. Thus, for the
329General_Category property, "L" means "Letter", but for the Bidi_Class property,
330"L" means "Left". A complete list of properties and synonyms is in
331L<perluniprops>.
332
333Upper/lower case differences in property names and values are irrelevant;
334thus C<\p{Upper}> means the same thing as C<\p{upper}> or even C<\p{UpPeR}>.
335Similarly, you can add or subtract underscores anywhere in the middle of a
336word, so that these are also equivalent to C<\p{U_p_p_e_r}>. And white space
337is irrelevant adjacent to non-word characters, such as the braces and the equals
338or colon separators, so C<\p{ Upper }> and C<\p{ Upper_case : Y }> are
339equivalent to these as well. In fact, white space and even
340hyphens can usually be added or deleted anywhere. So even C<\p{ Up-per case = Yes}> is
341equivalent. All this is called "loose-matching" by Unicode. The few places
342where stricter matching is used is in the middle of numbers, and in the Perl
343extension properties that begin or end with an underscore. Stricter matching
344cares about white space (except adjacent to non-word characters),
345hyphens, and non-interior underscores.
346
347You can also use negation in both C<\p{}> and C<\P{}> by introducing a caret
348(^) between the first brace and the property name: C<\p{^Tamil}> is
349equal to C<\P{Tamil}>.
350
351Almost all properties are immune to case-insensitive matching. That is,
352adding a C</i> regular expression modifier does not change what they
353match. There are two sets that are affected.
354The first set is
355C<Uppercase_Letter>,
356C<Lowercase_Letter>,
357and C<Titlecase_Letter>,
358all of which match C<Cased_Letter> under C</i> matching.
359And the second set is
360C<Uppercase>,
361C<Lowercase>,
362and C<Titlecase>,
363all of which match C<Cased> under C</i> matching.
364This set also includes its subsets C<PosixUpper> and C<PosixLower> both
365of which under C</i> matching match C<PosixAlpha>.
366(The difference between these sets is that some things, such as Roman
367numerals, come in both upper and lower case so they are C<Cased>, but aren't considered
368letters, so they aren't C<Cased_Letter>s.)
369
370The result is undefined if you try to match a non-Unicode code point
371(that is, one above 0x10FFFF) against a Unicode property. Currently, a
372warning is raised, and the match will fail. In some cases, this is
373counterintuitive, as both these fail:
374
375 chr(0x110000) =~ \p{ASCII_Hex_Digit=True} # Fails.
376 chr(0x110000) =~ \p{ASCII_Hex_Digit=False} # Fails!
377
378=head3 B<General_Category>
379
380Every Unicode character is assigned a general category, which is the "most
381usual categorization of a character" (from
382L<http://www.unicode.org/reports/tr44>).
383
384The compound way of writing these is like C<\p{General_Category=Number}>
385(short, C<\p{gc:n}>). But Perl furnishes shortcuts in which everything up
386through the equal or colon separator is omitted. So you can instead just write
387C<\pN>.
388
389Here are the short and long forms of the General Category properties:
390
391 Short Long
392
393 L Letter
394 LC, L& Cased_Letter (that is: [\p{Ll}\p{Lu}\p{Lt}])
395 Lu Uppercase_Letter
396 Ll Lowercase_Letter
397 Lt Titlecase_Letter
398 Lm Modifier_Letter
399 Lo Other_Letter
400
401 M Mark
402 Mn Nonspacing_Mark
403 Mc Spacing_Mark
404 Me Enclosing_Mark
405
406 N Number
407 Nd Decimal_Number (also Digit)
408 Nl Letter_Number
409 No Other_Number
410
411 P Punctuation (also Punct)
412 Pc Connector_Punctuation
413 Pd Dash_Punctuation
414 Ps Open_Punctuation
415 Pe Close_Punctuation
416 Pi Initial_Punctuation
417 (may behave like Ps or Pe depending on usage)
418 Pf Final_Punctuation
419 (may behave like Ps or Pe depending on usage)
420 Po Other_Punctuation
421
422 S Symbol
423 Sm Math_Symbol
424 Sc Currency_Symbol
425 Sk Modifier_Symbol
426 So Other_Symbol
427
428 Z Separator
429 Zs Space_Separator
430 Zl Line_Separator
431 Zp Paragraph_Separator
432
433 C Other
434 Cc Control (also Cntrl)
435 Cf Format
436 Cs Surrogate
437 Co Private_Use
438 Cn Unassigned
439
440Single-letter properties match all characters in any of the
441two-letter sub-properties starting with the same letter.
442C<LC> and C<L&> are special: both are aliases for the set consisting of everything matched by C<Ll>, C<Lu>, and C<Lt>.
443
444=head3 B<Bidirectional Character Types>
445
446Because scripts differ in their directionality (Hebrew and Arabic are
447written right to left, for example) Unicode supplies a Bidi_Class property.
448Some of the values this property can have are:
449
450 Value Meaning
451
452 L Left-to-Right
453 LRE Left-to-Right Embedding
454 LRO Left-to-Right Override
455 R Right-to-Left
456 AL Arabic Letter
457 RLE Right-to-Left Embedding
458 RLO Right-to-Left Override
459 PDF Pop Directional Format
460 EN European Number
461 ES European Separator
462 ET European Terminator
463 AN Arabic Number
464 CS Common Separator
465 NSM Non-Spacing Mark
466 BN Boundary Neutral
467 B Paragraph Separator
468 S Segment Separator
469 WS Whitespace
470 ON Other Neutrals
471
472This property is always written in the compound form.
473For example, C<\p{Bidi_Class:R}> matches characters that are normally
474written right to left. Unlike the
475General_Category property, this
476property can have more values added in a future Unicode release. Those
477listed above comprised the complete set for many Unicode releases, but
478others were added in Unicode 6.3; you can always find what the
479current ones are in in L<perluniprops>. And
480L<http://www.unicode.org/reports/tr9/> describes how to use them.
481
482=head3 B<Scripts>
483
484The world's languages are written in many different scripts. This sentence
485(unless you're reading it in translation) is written in Latin, while Russian is
486written in Cyrillic, and Greek is written in, well, Greek; Japanese mainly in
487Hiragana or Katakana. There are many more.
488
489The Unicode Script and Script_Extensions properties give what script a
490given character is in. Either property can be specified with the
491compound form like
492C<\p{Script=Hebrew}> (short: C<\p{sc=hebr}>), or
493C<\p{Script_Extensions=Javanese}> (short: C<\p{scx=java}>).
494In addition, Perl furnishes shortcuts for all
495C<Script> property names. You can omit everything up through the equals
496(or colon), and simply write C<\p{Latin}> or C<\P{Cyrillic}>.
497(This is not true for C<Script_Extensions>, which is required to be
498written in the compound form.)
499
500The difference between these two properties involves characters that are
501used in multiple scripts. For example the digits '0' through '9' are
502used in many parts of the world. These are placed in a script named
503C<Common>. Other characters are used in just a few scripts. For
504example, the "KATAKANA-HIRAGANA DOUBLE HYPHEN" is used in both Japanese
505scripts, Katakana and Hiragana, but nowhere else. The C<Script>
506property places all characters that are used in multiple scripts in the
507C<Common> script, while the C<Script_Extensions> property places those
508that are used in only a few scripts into each of those scripts; while
509still using C<Common> for those used in many scripts. Thus both these
510match:
511
512 "0" =~ /\p{sc=Common}/ # Matches
513 "0" =~ /\p{scx=Common}/ # Matches
514
515and only the first of these match:
516
517 "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{sc=Common} # Matches
518 "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{scx=Common} # No match
519
520And only the last two of these match:
521
522 "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{sc=Hiragana} # No match
523 "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{sc=Katakana} # No match
524 "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{scx=Hiragana} # Matches
525 "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{scx=Katakana} # Matches
526
527C<Script_Extensions> is thus an improved C<Script>, in which there are
528fewer characters in the C<Common> script, and correspondingly more in
529other scripts. It is new in Unicode version 6.0, and its data are likely
530to change significantly in later releases, as things get sorted out.
531
532(Actually, besides C<Common>, the C<Inherited> script, contains
533characters that are used in multiple scripts. These are modifier
534characters which modify other characters, and inherit the script value
535of the controlling character. Some of these are used in many scripts,
536and so go into C<Inherited> in both C<Script> and C<Script_Extensions>.
537Others are used in just a few scripts, so are in C<Inherited> in
538C<Script>, but not in C<Script_Extensions>.)
539
540It is worth stressing that there are several different sets of digits in
541Unicode that are equivalent to 0-9 and are matchable by C<\d> in a
542regular expression. If they are used in a single language only, they
543are in that language's C<Script> and C<Script_Extension>. If they are
544used in more than one script, they will be in C<sc=Common>, but only
545if they are used in many scripts should they be in C<scx=Common>.
546
547A complete list of scripts and their shortcuts is in L<perluniprops>.
548
549=head3 B<Use of "Is" Prefix>
550
551For backward compatibility (with Perl 5.6), all properties mentioned
552so far may have C<Is> or C<Is_> prepended to their name, so C<\P{Is_Lu}>, for
553example, is equal to C<\P{Lu}>, and C<\p{IsScript:Arabic}> is equal to
554C<\p{Arabic}>.
555
556=head3 B<Blocks>
557
558In addition to B<scripts>, Unicode also defines B<blocks> of
559characters. The difference between scripts and blocks is that the
560concept of scripts is closer to natural languages, while the concept
561of blocks is more of an artificial grouping based on groups of Unicode
562characters with consecutive ordinal values. For example, the "Basic Latin"
563block is all characters whose ordinals are between 0 and 127, inclusive; in
564other words, the ASCII characters. The "Latin" script contains some letters
565from this as well as several other blocks, like "Latin-1 Supplement",
566"Latin Extended-A", etc., but it does not contain all the characters from
567those blocks. It does not, for example, contain the digits 0-9, because
568those digits are shared across many scripts, and hence are in the
569C<Common> script.
570
571For more about scripts versus blocks, see UAX#24 "Unicode Script Property":
572L<http://www.unicode.org/reports/tr24>
573
574The C<Script> or C<Script_Extensions> properties are likely to be the
575ones you want to use when processing
576natural language; the Block property may occasionally be useful in working
577with the nuts and bolts of Unicode.
578
579Block names are matched in the compound form, like C<\p{Block: Arrows}> or
580C<\p{Blk=Hebrew}>. Unlike most other properties, only a few block names have a
581Unicode-defined short name. But Perl does provide a (slight) shortcut: You
582can say, for example C<\p{In_Arrows}> or C<\p{In_Hebrew}>. For backwards
583compatibility, the C<In> prefix may be omitted if there is no naming conflict
584with a script or any other property, and you can even use an C<Is> prefix
585instead in those cases. But it is not a good idea to do this, for a couple
586reasons:
587
588=over 4
589
590=item 1
591
592It is confusing. There are many naming conflicts, and you may forget some.
593For example, C<\p{Hebrew}> means the I<script> Hebrew, and NOT the I<block>
594Hebrew. But would you remember that 6 months from now?
595
596=item 2
597
598It is unstable. A new version of Unicode may preempt the current meaning by
599creating a property with the same name. There was a time in very early Unicode
600releases when C<\p{Hebrew}> would have matched the I<block> Hebrew; now it
601doesn't.
602
603=back
604
605Some people prefer to always use C<\p{Block: foo}> and C<\p{Script: bar}>
606instead of the shortcuts, whether for clarity, because they can't remember the
607difference between 'In' and 'Is' anyway, or they aren't confident that those who
608eventually will read their code will know that difference.
609
610A complete list of blocks and their shortcuts is in L<perluniprops>.
611
612=head3 B<Other Properties>
613
614There are many more properties than the very basic ones described here.
615A complete list is in L<perluniprops>.
616
617Unicode defines all its properties in the compound form, so all single-form
618properties are Perl extensions. Most of these are just synonyms for the
619Unicode ones, but some are genuine extensions, including several that are in
620the compound form. And quite a few of these are actually recommended by Unicode
621(in L<http://www.unicode.org/reports/tr18>).
622
623This section gives some details on all extensions that aren't just
624synonyms for compound-form Unicode properties
625(for those properties, you'll have to refer to the
626L<Unicode Standard|http://www.unicode.org/reports/tr44>.
627
628=over
629
630=item B<C<\p{All}>>
631
632This matches any of the 1_114_112 Unicode code points. It is a synonym for
633C<\p{Any}>.
634
635=item B<C<\p{Alnum}>>
636
637This matches any C<\p{Alphabetic}> or C<\p{Decimal_Number}> character.
638
639=item B<C<\p{Any}>>
640
641This matches any of the 1_114_112 Unicode code points. It is a synonym for
642C<\p{All}>.
643
644=item B<C<\p{ASCII}>>
645
646This matches any of the 128 characters in the US-ASCII character set,
647which is a subset of Unicode.
648
649=item B<C<\p{Assigned}>>
650
651This matches any assigned code point; that is, any code point whose general
652category is not Unassigned (or equivalently, not Cn).
653
654=item B<C<\p{Blank}>>
655
656This is the same as C<\h> and C<\p{HorizSpace}>: A character that changes the
657spacing horizontally.
658
659=item B<C<\p{Decomposition_Type: Non_Canonical}>> (Short: C<\p{Dt=NonCanon}>)
660
661Matches a character that has a non-canonical decomposition.
662
663To understand the use of this rarely used property=value combination, it is
664necessary to know some basics about decomposition.
665Consider a character, say H. It could appear with various marks around it,
666such as an acute accent, or a circumflex, or various hooks, circles, arrows,
667I<etc.>, above, below, to one side or the other, etc. There are many
668possibilities among the world's languages. The number of combinations is
669astronomical, and if there were a character for each combination, it would
670soon exhaust Unicode's more than a million possible characters. So Unicode
671took a different approach: there is a character for the base H, and a
672character for each of the possible marks, and these can be variously combined
673to get a final logical character. So a logical character--what appears to be a
674single character--can be a sequence of more than one individual characters.
675This is called an "extended grapheme cluster"; Perl furnishes the C<\X>
676regular expression construct to match such sequences.
677
678But Unicode's intent is to unify the existing character set standards and
679practices, and several pre-existing standards have single characters that
680mean the same thing as some of these combinations. An example is ISO-8859-1,
681which has quite a few of these in the Latin-1 range, an example being "LATIN
682CAPITAL LETTER E WITH ACUTE". Because this character was in this pre-existing
683standard, Unicode added it to its repertoire. But this character is considered
684by Unicode to be equivalent to the sequence consisting of the character
685"LATIN CAPITAL LETTER E" followed by the character "COMBINING ACUTE ACCENT".
686
687"LATIN CAPITAL LETTER E WITH ACUTE" is called a "pre-composed" character, and
688its equivalence with the sequence is called canonical equivalence. All
689pre-composed characters are said to have a decomposition (into the equivalent
690sequence), and the decomposition type is also called canonical.
691
692However, many more characters have a different type of decomposition, a
693"compatible" or "non-canonical" decomposition. The sequences that form these
694decompositions are not considered canonically equivalent to the pre-composed
695character. An example, again in the Latin-1 range, is the "SUPERSCRIPT ONE".
696It is somewhat like a regular digit 1, but not exactly; its decomposition
697into the digit 1 is called a "compatible" decomposition, specifically a
698"super" decomposition. There are several such compatibility
699decompositions (see L<http://www.unicode.org/reports/tr44>), including one
700called "compat", which means some miscellaneous type of decomposition
701that doesn't fit into the decomposition categories that Unicode has chosen.
702
703Note that most Unicode characters don't have a decomposition, so their
704decomposition type is "None".
705
706For your convenience, Perl has added the C<Non_Canonical> decomposition
707type to mean any of the several compatibility decompositions.
708
709=item B<C<\p{Graph}>>
710
711Matches any character that is graphic. Theoretically, this means a character
712that on a printer would cause ink to be used.
713
714=item B<C<\p{HorizSpace}>>
715
716This is the same as C<\h> and C<\p{Blank}>: a character that changes the
717spacing horizontally.
718
719=item B<C<\p{In=*}>>
720
721This is a synonym for C<\p{Present_In=*}>
722
723=item B<C<\p{PerlSpace}>>
724
725This is the same as C<\s>, restricted to ASCII, namely C<S<[ \f\n\r\t]>>
726and starting in Perl v5.18, experimentally, a vertical tab.
727
728Mnemonic: Perl's (original) space
729
730=item B<C<\p{PerlWord}>>
731
732This is the same as C<\w>, restricted to ASCII, namely C<[A-Za-z0-9_]>
733
734Mnemonic: Perl's (original) word.
735
736=item B<C<\p{Posix...}>>
737
738There are several of these, which are equivalents using the C<\p>
739notation for Posix classes and are described in
740L<perlrecharclass/POSIX Character Classes>.
741
742=item B<C<\p{Present_In: *}>> (Short: C<\p{In=*}>)
743
744This property is used when you need to know in what Unicode version(s) a
745character is.
746
747The "*" above stands for some two digit Unicode version number, such as
748C<1.1> or C<4.0>; or the "*" can also be C<Unassigned>. This property will
749match the code points whose final disposition has been settled as of the
750Unicode release given by the version number; C<\p{Present_In: Unassigned}>
751will match those code points whose meaning has yet to be assigned.
752
753For example, C<U+0041> "LATIN CAPITAL LETTER A" was present in the very first
754Unicode release available, which is C<1.1>, so this property is true for all
755valid "*" versions. On the other hand, C<U+1EFF> was not assigned until version
7565.1 when it became "LATIN SMALL LETTER Y WITH LOOP", so the only "*" that
757would match it are 5.1, 5.2, and later.
758
759Unicode furnishes the C<Age> property from which this is derived. The problem
760with Age is that a strict interpretation of it (which Perl takes) has it
761matching the precise release a code point's meaning is introduced in. Thus
762C<U+0041> would match only 1.1; and C<U+1EFF> only 5.1. This is not usually what
763you want.
764
765Some non-Perl implementations of the Age property may change its meaning to be
766the same as the Perl Present_In property; just be aware of that.
767
768Another confusion with both these properties is that the definition is not
769that the code point has been I<assigned>, but that the meaning of the code point
770has been I<determined>. This is because 66 code points will always be
771unassigned, and so the Age for them is the Unicode version in which the decision
772to make them so was made. For example, C<U+FDD0> is to be permanently
773unassigned to a character, and the decision to do that was made in version 3.1,
774so C<\p{Age=3.1}> matches this character, as also does C<\p{Present_In: 3.1}> and up.
775
776=item B<C<\p{Print}>>
777
778This matches any character that is graphical or blank, except controls.
779
780=item B<C<\p{SpacePerl}>>
781
782This is the same as C<\s>, including beyond ASCII.
783
784Mnemonic: Space, as modified by Perl. (It doesn't include the vertical tab
785which both the Posix standard and Unicode consider white space.)
786
787=item B<C<\p{Title}>> and B<C<\p{Titlecase}>>
788
789Under case-sensitive matching, these both match the same code points as
790C<\p{General Category=Titlecase_Letter}> (C<\p{gc=lt}>). The difference
791is that under C</i> caseless matching, these match the same as
792C<\p{Cased}>, whereas C<\p{gc=lt}> matches C<\p{Cased_Letter>).
793
794=item B<C<\p{VertSpace}>>
795
796This is the same as C<\v>: A character that changes the spacing vertically.
797
798=item B<C<\p{Word}>>
799
800This is the same as C<\w>, including over 100_000 characters beyond ASCII.
801
802=item B<C<\p{XPosix...}>>
803
804There are several of these, which are the standard Posix classes
805extended to the full Unicode range. They are described in
806L<perlrecharclass/POSIX Character Classes>.
807
808=back
809
810=head2 User-Defined Character Properties
811
812You can define your own binary character properties by defining subroutines
813whose names begin with "In" or "Is". (The experimental feature
814L<perlre/(?[ ])> provides an alternative which allows more complex
815definitions.) The subroutines can be defined in any
816package. The user-defined properties can be used in the regular expression
817C<\p> and C<\P> constructs; if you are using a user-defined property from a
818package other than the one you are in, you must specify its package in the
819C<\p> or C<\P> construct.
820
821 # assuming property Is_Foreign defined in Lang::
822 package main; # property package name required
823 if ($txt =~ /\p{Lang::IsForeign}+/) { ... }
824
825 package Lang; # property package name not required
826 if ($txt =~ /\p{IsForeign}+/) { ... }
827
828
829Note that the effect is compile-time and immutable once defined.
830However, the subroutines are passed a single parameter, which is 0 if
831case-sensitive matching is in effect and non-zero if caseless matching
832is in effect. The subroutine may return different values depending on
833the value of the flag, and one set of values will immutably be in effect
834for all case-sensitive matches, and the other set for all case-insensitive
835matches.
836
837Note that if the regular expression is tainted, then Perl will die rather
838than calling the subroutine, where the name of the subroutine is
839determined by the tainted data.
840
841The subroutines must return a specially-formatted string, with one
842or more newline-separated lines. Each line must be one of the following:
843
844=over 4
845
846=item *
847
848A single hexadecimal number denoting a code point to include.
849
850=item *
851
852Two hexadecimal numbers separated by horizontal whitespace (space or
853tabular characters) denoting a range of code points to include.
854
855=item *
856
857Something to include, prefixed by "+": a built-in character
858property (prefixed by "utf8::") or a fully qualified (including package
859name) user-defined character property,
860to represent all the characters in that property; two hexadecimal code
861points for a range; or a single hexadecimal code point.
862
863=item *
864
865Something to exclude, prefixed by "-": an existing character
866property (prefixed by "utf8::") or a fully qualified (including package
867name) user-defined character property,
868to represent all the characters in that property; two hexadecimal code
869points for a range; or a single hexadecimal code point.
870
871=item *
872
873Something to negate, prefixed "!": an existing character
874property (prefixed by "utf8::") or a fully qualified (including package
875name) user-defined character property,
876to represent all the characters in that property; two hexadecimal code
877points for a range; or a single hexadecimal code point.
878
879=item *
880
881Something to intersect with, prefixed by "&": an existing character
882property (prefixed by "utf8::") or a fully qualified (including package
883name) user-defined character property,
884for all the characters except the characters in the property; two
885hexadecimal code points for a range; or a single hexadecimal code point.
886
887=back
888
889For example, to define a property that covers both the Japanese
890syllabaries (hiragana and katakana), you can define
891
892 sub InKana {
893 return <<END;
894 3040\t309F
895 30A0\t30FF
896 END
897 }
898
899Imagine that the here-doc end marker is at the beginning of the line.
900Now you can use C<\p{InKana}> and C<\P{InKana}>.
901
902You could also have used the existing block property names:
903
904 sub InKana {
905 return <<'END';
906 +utf8::InHiragana
907 +utf8::InKatakana
908 END
909 }
910
911Suppose you wanted to match only the allocated characters,
912not the raw block ranges: in other words, you want to remove
913the non-characters:
914
915 sub InKana {
916 return <<'END';
917 +utf8::InHiragana
918 +utf8::InKatakana
919 -utf8::IsCn
920 END
921 }
922
923The negation is useful for defining (surprise!) negated classes.
924
925 sub InNotKana {
926 return <<'END';
927 !utf8::InHiragana
928 -utf8::InKatakana
929 +utf8::IsCn
930 END
931 }
932
933This will match all non-Unicode code points, since every one of them is
934not in Kana. You can use intersection to exclude these, if desired, as
935this modified example shows:
936
937 sub InNotKana {
938 return <<'END';
939 !utf8::InHiragana
940 -utf8::InKatakana
941 +utf8::IsCn
942 &utf8::Any
943 END
944 }
945
946C<&utf8::Any> must be the last line in the definition.
947
948Intersection is used generally for getting the common characters matched
949by two (or more) classes. It's important to remember not to use "&" for
950the first set; that would be intersecting with nothing, resulting in an
951empty set.
952
953(Note that official Unicode properties differ from these in that they
954automatically exclude non-Unicode code points and a warning is raised if
955a match is attempted on one of those.)
956
957=head2 User-Defined Case Mappings (for serious hackers only)
958
959B<This feature has been removed as of Perl 5.16.>
960The CPAN module L<Unicode::Casing> provides better functionality without
961the drawbacks that this feature had. If you are using a Perl earlier
962than 5.16, this feature was most fully documented in the 5.14 version of
963this pod:
964L<http://perldoc.perl.org/5.14.0/perlunicode.html#User-Defined-Case-Mappings-%28for-serious-hackers-only%29>
965
966=head2 Character Encodings for Input and Output
967
968See L<Encode>.
969
970=head2 Unicode Regular Expression Support Level
971
972The following list of Unicode supported features for regular expressions describes
973all features currently directly supported by core Perl. The references to "Level N"
974and the section numbers refer to the Unicode Technical Standard #18,
975"Unicode Regular Expressions", version 13, from August 2008.
976
977=over 4
978
979=item *
980
981Level 1 - Basic Unicode Support
982
983 RL1.1 Hex Notation - done [1]
984 RL1.2 Properties - done [2][3]
985 RL1.2a Compatibility Properties - done [4]
986 RL1.3 Subtraction and Intersection - experimental [5]
987 RL1.4 Simple Word Boundaries - done [6]
988 RL1.5 Simple Loose Matches - done [7]
989 RL1.6 Line Boundaries - MISSING [8][9]
990 RL1.7 Supplementary Code Points - done [10]
991
992=over 4
993
994=item [1]
995
996\x{...}
997
998=item [2]
999
1000\p{...} \P{...}
1001
1002=item [3]
1003
1004supports not only minimal list, but all Unicode character properties (see Unicode Character Properties above)
1005
1006=item [4]
1007
1008\d \D \s \S \w \W \X [:prop:] [:^prop:]
1009
1010=item [5]
1011
1012The experimental feature in v5.18 C<"(?[...])"> accomplishes this. See
1013L<perlre/(?[ ])>. If you don't want to use an experimental feature,
1014you can use one of the following:
1015
1016=over 4
1017
1018=item * Regular expression look-ahead
1019
1020You can mimic class subtraction using lookahead.
1021For example, what UTS#18 might write as
1022
1023 [{Block=Greek}-[{UNASSIGNED}]]
1024
1025in Perl can be written as:
1026
1027 (?!\p{Unassigned})\p{Block=Greek}
1028 (?=\p{Assigned})\p{Block=Greek}
1029
1030But in this particular example, you probably really want
1031
1032 \p{Greek}
1033
1034which will match assigned characters known to be part of the Greek script.
1035
1036=item * CPAN module L<Unicode::Regex::Set>
1037
1038It does implement the full UTS#18 grouping, intersection, union, and
1039removal (subtraction) syntax.
1040
1041=item * L</"User-Defined Character Properties">
1042
1043'+' for union, '-' for removal (set-difference), '&' for intersection
1044
1045=back
1046
1047=item [6]
1048
1049\b \B
1050
1051=item [7]
1052
1053Note that Perl does Full case-folding in matching (but with bugs), not Simple: for example U+1F88 is equivalent to U+1F00 U+03B9, instead of just U+1F80. This difference matters mainly for certain Greek capital letters with certain modifiers: the Full case-folding decomposes the letter, while the Simple case-folding would map it to a single character.
1054
1055=item [8]
1056
1057Should do ^ and $ also on U+000B (\v in C), FF (\f), CR (\r), CRLF
1058(\r\n), NEL (U+0085), LS (U+2028), and PS (U+2029); should also affect
1059<>, $., and script line numbers; should not split lines within CRLF
1060(i.e. there is no empty line between \r and \n). For CRLF, try the
1061C<:crlf> layer (see L<PerlIO>).
1062
1063=item [9]
1064
1065Linebreaking conformant with UAX#14 "Unicode Line Breaking Algorithm" is available through the Unicode::LineBreaking module.
1066
1067=item [10]
1068
1069UTF-8/UTF-EBDDIC used in Perl allows not only U+10000 to
1070U+10FFFF but also beyond U+10FFFF
1071
1072=back
1073
1074=item *
1075
1076Level 2 - Extended Unicode Support
1077
1078 RL2.1 Canonical Equivalents - MISSING [10][11]
1079 RL2.2 Default Grapheme Clusters - MISSING [12]
1080 RL2.3 Default Word Boundaries - MISSING [14]
1081 RL2.4 Default Loose Matches - MISSING [15]
1082 RL2.5 Name Properties - DONE
1083 RL2.6 Wildcard Properties - MISSING
1084
1085 [10] see UAX#15 "Unicode Normalization Forms"
1086 [11] have Unicode::Normalize but not integrated to regexes
1087 [12] have \X but we don't have a "Grapheme Cluster Mode"
1088 [14] see UAX#29, Word Boundaries
1089 [15] This is covered in Chapter 3.13 (in Unicode 6.0)
1090
1091=item *
1092
1093Level 3 - Tailored Support
1094
1095 RL3.1 Tailored Punctuation - MISSING
1096 RL3.2 Tailored Grapheme Clusters - MISSING [17][18]
1097 RL3.3 Tailored Word Boundaries - MISSING
1098 RL3.4 Tailored Loose Matches - MISSING
1099 RL3.5 Tailored Ranges - MISSING
1100 RL3.6 Context Matching - MISSING [19]
1101 RL3.7 Incremental Matches - MISSING
1102 ( RL3.8 Unicode Set Sharing )
1103 RL3.9 Possible Match Sets - MISSING
1104 RL3.10 Folded Matching - MISSING [20]
1105 RL3.11 Submatchers - MISSING
1106
1107 [17] see UAX#10 "Unicode Collation Algorithms"
1108 [18] have Unicode::Collate but not integrated to regexes
1109 [19] have (?<=x) and (?=x), but look-aheads or look-behinds
1110 should see outside of the target substring
1111 [20] need insensitive matching for linguistic features other
1112 than case; for example, hiragana to katakana, wide and
1113 narrow, simplified Han to traditional Han (see UTR#30
1114 "Character Foldings")
1115
1116=back
1117
1118=head2 Unicode Encodings
1119
1120Unicode characters are assigned to I<code points>, which are abstract
1121numbers. To use these numbers, various encodings are needed.
1122
1123=over 4
1124
1125=item *
1126
1127UTF-8
1128
1129UTF-8 is a variable-length (1 to 4 bytes), byte-order independent
1130encoding. For ASCII (and we really do mean 7-bit ASCII, not another
11318-bit encoding), UTF-8 is transparent.
1132
1133The following table is from Unicode 3.2.
1134
1135 Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte
1136
1137 U+0000..U+007F 00..7F
1138 U+0080..U+07FF * C2..DF 80..BF
1139 U+0800..U+0FFF E0 * A0..BF 80..BF
1140 U+1000..U+CFFF E1..EC 80..BF 80..BF
1141 U+D000..U+D7FF ED 80..9F 80..BF
1142 U+D800..U+DFFF +++++ utf16 surrogates, not legal utf8 +++++
1143 U+E000..U+FFFF EE..EF 80..BF 80..BF
1144 U+10000..U+3FFFF F0 * 90..BF 80..BF 80..BF
1145 U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF
1146 U+100000..U+10FFFF F4 80..8F 80..BF 80..BF
1147
1148Note the gaps marked by "*" before several of the byte entries above. These are
1149caused by legal UTF-8 avoiding non-shortest encodings: it is technically
1150possible to UTF-8-encode a single code point in different ways, but that is
1151explicitly forbidden, and the shortest possible encoding should always be used
1152(and that is what Perl does).
1153
1154Another way to look at it is via bits:
1155
1156 Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte
1157
1158 0aaaaaaa 0aaaaaaa
1159 00000bbbbbaaaaaa 110bbbbb 10aaaaaa
1160 ccccbbbbbbaaaaaa 1110cccc 10bbbbbb 10aaaaaa
1161 00000dddccccccbbbbbbaaaaaa 11110ddd 10cccccc 10bbbbbb 10aaaaaa
1162
1163As you can see, the continuation bytes all begin with "10", and the
1164leading bits of the start byte tell how many bytes there are in the
1165encoded character.
1166
1167The original UTF-8 specification allowed up to 6 bytes, to allow
1168encoding of numbers up to 0x7FFF_FFFF. Perl continues to allow those,
1169and has extended that up to 13 bytes to encode code points up to what
1170can fit in a 64-bit word. However, Perl will warn if you output any of
1171these as being non-portable; and under strict UTF-8 input protocols,
1172they are forbidden.
1173
1174The Unicode non-character code points are also disallowed in UTF-8 in
1175"open interchange". See L</Non-character code points>.
1176
1177=item *
1178
1179UTF-EBCDIC
1180
1181Like UTF-8 but EBCDIC-safe, in the way that UTF-8 is ASCII-safe.
1182
1183=item *
1184
1185UTF-16, UTF-16BE, UTF-16LE, Surrogates, and BOMs (Byte Order Marks)
1186
1187The followings items are mostly for reference and general Unicode
1188knowledge, Perl doesn't use these constructs internally.
1189
1190Like UTF-8, UTF-16 is a variable-width encoding, but where
1191UTF-8 uses 8-bit code units, UTF-16 uses 16-bit code units.
1192All code points occupy either 2 or 4 bytes in UTF-16: code points
1193C<U+0000..U+FFFF> are stored in a single 16-bit unit, and code
1194points C<U+10000..U+10FFFF> in two 16-bit units. The latter case is
1195using I<surrogates>, the first 16-bit unit being the I<high
1196surrogate>, and the second being the I<low surrogate>.
1197
1198Surrogates are code points set aside to encode the C<U+10000..U+10FFFF>
1199range of Unicode code points in pairs of 16-bit units. The I<high
1200surrogates> are the range C<U+D800..U+DBFF> and the I<low surrogates>
1201are the range C<U+DC00..U+DFFF>. The surrogate encoding is
1202
1203 $hi = ($uni - 0x10000) / 0x400 + 0xD800;
1204 $lo = ($uni - 0x10000) % 0x400 + 0xDC00;
1205
1206and the decoding is
1207
1208 $uni = 0x10000 + ($hi - 0xD800) * 0x400 + ($lo - 0xDC00);
1209
1210Because of the 16-bitness, UTF-16 is byte-order dependent. UTF-16
1211itself can be used for in-memory computations, but if storage or
1212transfer is required either UTF-16BE (big-endian) or UTF-16LE
1213(little-endian) encodings must be chosen.
1214
1215This introduces another problem: what if you just know that your data
1216is UTF-16, but you don't know which endianness? Byte Order Marks, or
1217BOMs, are a solution to this. A special character has been reserved
1218in Unicode to function as a byte order marker: the character with the
1219code point C<U+FEFF> is the BOM.
1220
1221The trick is that if you read a BOM, you will know the byte order,
1222since if it was written on a big-endian platform, you will read the
1223bytes C<0xFE 0xFF>, but if it was written on a little-endian platform,
1224you will read the bytes C<0xFF 0xFE>. (And if the originating platform
1225was writing in UTF-8, you will read the bytes C<0xEF 0xBB 0xBF>.)
1226
1227The way this trick works is that the character with the code point
1228C<U+FFFE> is not supposed to be in input streams, so the
1229sequence of bytes C<0xFF 0xFE> is unambiguously "BOM, represented in
1230little-endian format" and cannot be C<U+FFFE>, represented in big-endian
1231format".
1232
1233Surrogates have no meaning in Unicode outside their use in pairs to
1234represent other code points. However, Perl allows them to be
1235represented individually internally, for example by saying
1236C<chr(0xD801)>, so that all code points, not just those valid for open
1237interchange, are
1238representable. Unicode does define semantics for them, such as their
1239General Category is "Cs". But because their use is somewhat dangerous,
1240Perl will warn (using the warning category "surrogate", which is a
1241sub-category of "utf8") if an attempt is made
1242to do things like take the lower case of one, or match
1243case-insensitively, or to output them. (But don't try this on Perls
1244before 5.14.)
1245
1246=item *
1247
1248UTF-32, UTF-32BE, UTF-32LE
1249
1250The UTF-32 family is pretty much like the UTF-16 family, expect that
1251the units are 32-bit, and therefore the surrogate scheme is not
1252needed. UTF-32 is a fixed-width encoding. The BOM signatures are
1253C<0x00 0x00 0xFE 0xFF> for BE and C<0xFF 0xFE 0x00 0x00> for LE.
1254
1255=item *
1256
1257UCS-2, UCS-4
1258
1259Legacy, fixed-width encodings defined by the ISO 10646 standard. UCS-2 is a 16-bit
1260encoding. Unlike UTF-16, UCS-2 is not extensible beyond C<U+FFFF>,
1261because it does not use surrogates. UCS-4 is a 32-bit encoding,
1262functionally identical to UTF-32 (the difference being that
1263UCS-4 forbids neither surrogates nor code points larger than 0x10_FFFF).
1264
1265=item *
1266
1267UTF-7
1268
1269A seven-bit safe (non-eight-bit) encoding, which is useful if the
1270transport or storage is not eight-bit safe. Defined by RFC 2152.
1271
1272=back
1273
1274=head2 Non-character code points
1275
127666 code points are set aside in Unicode as "non-character code points".
1277These all have the Unassigned (Cn) General Category, and they never will
1278be assigned. These are never supposed to be in legal Unicode input
1279streams, so that code can use them as sentinels that can be mixed in
1280with character data, and they always will be distinguishable from that data.
1281To keep them out of Perl input streams, strict UTF-8 should be
1282specified, such as by using the layer C<:encoding('UTF-8')>. The
1283non-character code points are the 32 between U+FDD0 and U+FDEF, and the
128434 code points U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, ... U+10FFFE, U+10FFFF.
1285Some people are under the mistaken impression that these are "illegal",
1286but that is not true. An application or cooperating set of applications
1287can legally use them at will internally; but these code points are
1288"illegal for open interchange". Therefore, Perl will not accept these
1289from input streams unless lax rules are being used, and will warn
1290(using the warning category "nonchar", which is a sub-category of "utf8") if
1291an attempt is made to output them.
1292
1293=head2 Beyond Unicode code points
1294
1295The maximum Unicode code point is U+10FFFF. But Perl accepts code
1296points up to the maximum permissible unsigned number available on the
1297platform. However, Perl will not accept these from input streams unless
1298lax rules are being used, and will warn (using the warning category
1299"non_unicode", which is a sub-category of "utf8") if an attempt is made to
1300operate on or output them. For example, C<uc(0x11_0000)> will generate
1301this warning, returning the input parameter as its result, as the upper
1302case of every non-Unicode code point is the code point itself.
1303
1304=head2 Security Implications of Unicode
1305
1306Read L<Unicode Security Considerations|http://www.unicode.org/reports/tr36>.
1307Also, note the following:
1308
1309=over 4
1310
1311=item *
1312
1313Malformed UTF-8
1314
1315Unfortunately, the original specification of UTF-8 leaves some room for
1316interpretation of how many bytes of encoded output one should generate
1317from one input Unicode character. Strictly speaking, the shortest
1318possible sequence of UTF-8 bytes should be generated,
1319because otherwise there is potential for an input buffer overflow at
1320the receiving end of a UTF-8 connection. Perl always generates the
1321shortest length UTF-8, and with warnings on, Perl will warn about
1322non-shortest length UTF-8 along with other malformations, such as the
1323surrogates, which are not Unicode code points valid for interchange.
1324
1325=item *
1326
1327Regular expression pattern matching may surprise you if you're not
1328accustomed to Unicode. Starting in Perl 5.14, several pattern
1329modifiers are available to control this, called the character set
1330modifiers. Details are given in L<perlre/Character set modifiers>.
1331
1332=back
1333
1334As discussed elsewhere, Perl has one foot (two hooves?) planted in
1335each of two worlds: the old world of bytes and the new world of
1336characters, upgrading from bytes to characters when necessary.
1337If your legacy code does not explicitly use Unicode, no automatic
1338switch-over to characters should happen. Characters shouldn't get
1339downgraded to bytes, either. It is possible to accidentally mix bytes
1340and characters, however (see L<perluniintro>), in which case C<\w> in
1341regular expressions might start behaving differently (unless the C</a>
1342modifier is in effect). Review your code. Use warnings and the C<strict> pragma.
1343
1344=head2 Unicode in Perl on EBCDIC
1345
1346The way Unicode is handled on EBCDIC platforms is still
1347experimental. On such platforms, references to UTF-8 encoding in this
1348document and elsewhere should be read as meaning the UTF-EBCDIC
1349specified in Unicode Technical Report 16, unless ASCII vs. EBCDIC issues
1350are specifically discussed. There is no C<utfebcdic> pragma or
1351":utfebcdic" layer; rather, "utf8" and ":utf8" are reused to mean
1352the platform's "natural" 8-bit encoding of Unicode. See L<perlebcdic>
1353for more discussion of the issues.
1354
1355=head2 Locales
1356
1357See L<perllocale/Unicode and UTF-8>
1358
1359=head2 When Unicode Does Not Happen
1360
1361While Perl does have extensive ways to input and output in Unicode,
1362and a few other "entry points" like the @ARGV array (which can sometimes be
1363interpreted as UTF-8), there are still many places where Unicode
1364(in some encoding or another) could be given as arguments or received as
1365results, or both, but it is not.
1366
1367The following are such interfaces. Also, see L</The "Unicode Bug">.
1368For all of these interfaces Perl
1369currently (as of v5.16.0) simply assumes byte strings both as arguments
1370and results, or UTF-8 strings if the (problematic) C<encoding> pragma has been used.
1371
1372One reason that Perl does not attempt to resolve the role of Unicode in
1373these situations is that the answers are highly dependent on the operating
1374system and the file system(s). For example, whether filenames can be
1375in Unicode and in exactly what kind of encoding, is not exactly a
1376portable concept. Similarly for C<qx> and C<system>: how well will the
1377"command-line interface" (and which of them?) handle Unicode?
1378
1379=over 4
1380
1381=item *
1382
1383chdir, chmod, chown, chroot, exec, link, lstat, mkdir,
1384rename, rmdir, stat, symlink, truncate, unlink, utime, -X
1385
1386=item *
1387
1388%ENV
1389
1390=item *
1391
1392glob (aka the <*>)
1393
1394=item *
1395
1396open, opendir, sysopen
1397
1398=item *
1399
1400qx (aka the backtick operator), system
1401
1402=item *
1403
1404readdir, readlink
1405
1406=back
1407
1408=head2 The "Unicode Bug"
1409
1410The term, "Unicode bug" has been applied to an inconsistency
1411on ASCII platforms with the
1412Unicode code points in the Latin-1 Supplement block, that
1413is, between 128 and 255. Without a locale specified, unlike all other
1414characters or code points, these characters have very different semantics in
1415byte semantics versus character semantics, unless
1416C<use feature 'unicode_strings'> is specified, directly or indirectly.
1417(It is indirectly specified by a C<use v5.12> or higher.)
1418
1419In character semantics these upper-Latin1 characters are interpreted as
1420Unicode code points, which means
1421they have the same semantics as Latin-1 (ISO-8859-1).
1422
1423In byte semantics (without C<unicode_strings>), they are considered to
1424be unassigned characters, meaning that the only semantics they have is
1425their ordinal numbers, and that they are
1426not members of various character classes. None are considered to match C<\w>
1427for example, but all match C<\W>.
1428
1429Perl 5.12.0 added C<unicode_strings> to force character semantics on
1430these code points in some circumstances, which fixed portions of the
1431bug; Perl 5.14.0 fixed almost all of it; and Perl 5.16.0 fixed the
1432remainder (so far as we know, anyway). The lesson here is to enable
1433C<unicode_strings> to avoid the headaches described below.
1434
1435The old, problematic behavior affects these areas:
1436
1437=over 4
1438
1439=item *
1440
1441Changing the case of a scalar, that is, using C<uc()>, C<ucfirst()>, C<lc()>,
1442and C<lcfirst()>, or C<\L>, C<\U>, C<\u> and C<\l> in double-quotish
1443contexts, such as regular expression substitutions.
1444Under C<unicode_strings> starting in Perl 5.12.0, character semantics are
1445generally used. See L<perlfunc/lc> for details on how this works
1446in combination with various other pragmas.
1447
1448=item *
1449
1450Using caseless (C</i>) regular expression matching.
1451Starting in Perl 5.14.0, regular expressions compiled within
1452the scope of C<unicode_strings> use character semantics
1453even when executed or compiled into larger
1454regular expressions outside the scope.
1455
1456=item *
1457
1458Matching any of several properties in regular expressions, namely C<\b>,
1459C<\B>, C<\s>, C<\S>, C<\w>, C<\W>, and all the Posix character classes
1460I<except> C<[[:ascii:]]>.
1461Starting in Perl 5.14.0, regular expressions compiled within
1462the scope of C<unicode_strings> use character semantics
1463even when executed or compiled into larger
1464regular expressions outside the scope.
1465
1466=item *
1467
1468In C<quotemeta> or its inline equivalent C<\Q>, no code points above 127
1469are quoted in UTF-8 encoded strings, but in byte encoded strings, code
1470points between 128-255 are always quoted.
1471Starting in Perl 5.16.0, consistent quoting rules are used within the
1472scope of C<unicode_strings>, as described in L<perlfunc/quotemeta>.
1473
1474=back
1475
1476This behavior can lead to unexpected results in which a string's semantics
1477suddenly change if a code point above 255 is appended to or removed from it,
1478which changes the string's semantics from byte to character or vice versa. As
1479an example, consider the following program and its output:
1480
1481 $ perl -le'
1482 no feature 'unicode_strings';
1483 $s1 = "\xC2";
1484 $s2 = "\x{2660}";
1485 for ($s1, $s2, $s1.$s2) {
1486 print /\w/ || 0;
1487 }
1488 '
1489 0
1490 0
1491 1
1492
1493If there's no C<\w> in C<s1> or in C<s2>, why does their concatenation have one?
1494
1495This anomaly stems from Perl's attempt to not disturb older programs that
1496didn't use Unicode, and hence had no semantics for characters outside of the
1497ASCII range (except in a locale), along with Perl's desire to add Unicode
1498support seamlessly. The result wasn't seamless: these characters were
1499orphaned.
1500
1501For Perls earlier than those described above, or when a string is passed
1502to a function outside the subpragma's scope, a workaround is to always
1503call C<utf8::upgrade($string)>,
1504or to use the standard module L<Encode>. Also, a scalar that has any characters
1505whose ordinal is 0x100 or above, or which were specified using either of the
1506C<\N{...}> notations, will automatically have character semantics.
1507
1508=head2 Forcing Unicode in Perl (Or Unforcing Unicode in Perl)
1509
1510Sometimes (see L</"When Unicode Does Not Happen"> or L</The "Unicode Bug">)
1511there are situations where you simply need to force a byte
1512string into UTF-8, or vice versa. The low-level calls
1513utf8::upgrade($bytestring) and utf8::downgrade($utf8string[, FAIL_OK]) are
1514the answers.
1515
1516Note that utf8::downgrade() can fail if the string contains characters
1517that don't fit into a byte.
1518
1519Calling either function on a string that already is in the desired state is a
1520no-op.
1521
1522=head2 Using Unicode in XS
1523
1524If you want to handle Perl Unicode in XS extensions, you may find the
1525following C APIs useful. See also L<perlguts/"Unicode Support"> for an
1526explanation about Unicode at the XS level, and L<perlapi> for the API
1527details.
1528
1529=over 4
1530
1531=item *
1532
1533C<DO_UTF8(sv)> returns true if the C<UTF8> flag is on and the bytes
1534pragma is not in effect. C<SvUTF8(sv)> returns true if the C<UTF8>
1535flag is on; the bytes pragma is ignored. The C<UTF8> flag being on
1536does B<not> mean that there are any characters of code points greater
1537than 255 (or 127) in the scalar or that there are even any characters
1538in the scalar. What the C<UTF8> flag means is that the sequence of
1539octets in the representation of the scalar is the sequence of UTF-8
1540encoded code points of the characters of a string. The C<UTF8> flag
1541being off means that each octet in this representation encodes a
1542single character with code point 0..255 within the string. Perl's
1543Unicode model is not to use UTF-8 until it is absolutely necessary.
1544
1545=item *
1546
1547C<uvchr_to_utf8(buf, chr)> writes a Unicode character code point into
1548a buffer encoding the code point as UTF-8, and returns a pointer
1549pointing after the UTF-8 bytes. It works appropriately on EBCDIC machines.
1550
1551=item *
1552
1553C<utf8_to_uvchr_buf(buf, bufend, lenp)> reads UTF-8 encoded bytes from a
1554buffer and
1555returns the Unicode character code point and, optionally, the length of
1556the UTF-8 byte sequence. It works appropriately on EBCDIC machines.
1557
1558=item *
1559
1560C<utf8_length(start, end)> returns the length of the UTF-8 encoded buffer
1561in characters. C<sv_len_utf8(sv)> returns the length of the UTF-8 encoded
1562scalar.
1563
1564=item *
1565
1566C<sv_utf8_upgrade(sv)> converts the string of the scalar to its UTF-8
1567encoded form. C<sv_utf8_downgrade(sv)> does the opposite, if
1568possible. C<sv_utf8_encode(sv)> is like sv_utf8_upgrade except that
1569it does not set the C<UTF8> flag. C<sv_utf8_decode()> does the
1570opposite of C<sv_utf8_encode()>. Note that none of these are to be
1571used as general-purpose encoding or decoding interfaces: C<use Encode>
1572for that. C<sv_utf8_upgrade()> is affected by the encoding pragma
1573but C<sv_utf8_downgrade()> is not (since the encoding pragma is
1574designed to be a one-way street).
1575
1576=item *
1577
1578C<is_utf8_string(buf, len)> returns true if C<len> bytes of the buffer
1579are valid UTF-8.
1580
1581=item *
1582
1583C<is_utf8_char_buf(buf, buf_end)> returns true if the pointer points to
1584a valid UTF-8 character.
1585
1586=item *
1587
1588C<UTF8SKIP(buf)> will return the number of bytes in the UTF-8 encoded
1589character in the buffer. C<UNISKIP(chr)> will return the number of bytes
1590required to UTF-8-encode the Unicode character code point. C<UTF8SKIP()>
1591is useful for example for iterating over the characters of a UTF-8
1592encoded buffer; C<UNISKIP()> is useful, for example, in computing
1593the size required for a UTF-8 encoded buffer.
1594
1595=item *
1596
1597C<utf8_distance(a, b)> will tell the distance in characters between the
1598two pointers pointing to the same UTF-8 encoded buffer.
1599
1600=item *
1601
1602C<utf8_hop(s, off)> will return a pointer to a UTF-8 encoded buffer
1603that is C<off> (positive or negative) Unicode characters displaced
1604from the UTF-8 buffer C<s>. Be careful not to overstep the buffer:
1605C<utf8_hop()> will merrily run off the end or the beginning of the
1606buffer if told to do so.
1607
1608=item *
1609
1610C<pv_uni_display(dsv, spv, len, pvlim, flags)> and
1611C<sv_uni_display(dsv, ssv, pvlim, flags)> are useful for debugging the
1612output of Unicode strings and scalars. By default they are useful
1613only for debugging--they display B<all> characters as hexadecimal code
1614points--but with the flags C<UNI_DISPLAY_ISPRINT>,
1615C<UNI_DISPLAY_BACKSLASH>, and C<UNI_DISPLAY_QQ> you can make the
1616output more readable.
1617
1618=item *
1619
1620C<foldEQ_utf8(s1, pe1, l1, u1, s2, pe2, l2, u2)> can be used to
1621compare two strings case-insensitively in Unicode. For case-sensitive
1622comparisons you can just use C<memEQ()> and C<memNE()> as usual, except
1623if one string is in utf8 and the other isn't.
1624
1625=back
1626
1627For more information, see L<perlapi>, and F<utf8.c> and F<utf8.h>
1628in the Perl source code distribution.
1629
1630=head2 Hacking Perl to work on earlier Unicode versions (for very serious hackers only)
1631
1632Perl by default comes with the latest supported Unicode version built in, but
1633you can change to use any earlier one.
1634
1635Download the files in the desired version of Unicode from the Unicode web
1636site L<http://www.unicode.org>). These should replace the existing files in
1637F<lib/unicore> in the Perl source tree. Follow the instructions in
1638F<README.perl> in that directory to change some of their names, and then build
1639perl (see L<INSTALL>).
1640
1641=head1 BUGS
1642
1643=head2 Interaction with Locales
1644
1645See L<perllocale/Unicode and UTF-8>
1646
1647=head2 Problems with characters in the Latin-1 Supplement range
1648
1649See L</The "Unicode Bug">
1650
1651=head2 Interaction with Extensions
1652
1653When Perl exchanges data with an extension, the extension should be
1654able to understand the UTF8 flag and act accordingly. If the
1655extension doesn't recognize that flag, it's likely that the extension
1656will return incorrectly-flagged data.
1657
1658So if you're working with Unicode data, consult the documentation of
1659every module you're using if there are any issues with Unicode data
1660exchange. If the documentation does not talk about Unicode at all,
1661suspect the worst and probably look at the source to learn how the
1662module is implemented. Modules written completely in Perl shouldn't
1663cause problems. Modules that directly or indirectly access code written
1664in other programming languages are at risk.
1665
1666For affected functions, the simple strategy to avoid data corruption is
1667to always make the encoding of the exchanged data explicit. Choose an
1668encoding that you know the extension can handle. Convert arguments passed
1669to the extensions to that encoding and convert results back from that
1670encoding. Write wrapper functions that do the conversions for you, so
1671you can later change the functions when the extension catches up.
1672
1673To provide an example, let's say the popular Foo::Bar::escape_html
1674function doesn't deal with Unicode data yet. The wrapper function
1675would convert the argument to raw UTF-8 and convert the result back to
1676Perl's internal representation like so:
1677
1678 sub my_escape_html ($) {
1679 my($what) = shift;
1680 return unless defined $what;
1681 Encode::decode_utf8(Foo::Bar::escape_html(
1682 Encode::encode_utf8($what)));
1683 }
1684
1685Sometimes, when the extension does not convert data but just stores
1686and retrieves them, you will be able to use the otherwise
1687dangerous Encode::_utf8_on() function. Let's say the popular
1688C<Foo::Bar> extension, written in C, provides a C<param> method that
1689lets you store and retrieve data according to these prototypes:
1690
1691 $self->param($name, $value); # set a scalar
1692 $value = $self->param($name); # retrieve a scalar
1693
1694If it does not yet provide support for any encoding, one could write a
1695derived class with such a C<param> method:
1696
1697 sub param {
1698 my($self,$name,$value) = @_;
1699 utf8::upgrade($name); # make sure it is UTF-8 encoded
1700 if (defined $value) {
1701 utf8::upgrade($value); # make sure it is UTF-8 encoded
1702 return $self->SUPER::param($name,$value);
1703 } else {
1704 my $ret = $self->SUPER::param($name);
1705 Encode::_utf8_on($ret); # we know, it is UTF-8 encoded
1706 return $ret;
1707 }
1708 }
1709
1710Some extensions provide filters on data entry/exit points, such as
1711DB_File::filter_store_key and family. Look out for such filters in
1712the documentation of your extensions, they can make the transition to
1713Unicode data much easier.
1714
1715=head2 Speed
1716
1717Some functions are slower when working on UTF-8 encoded strings than
1718on byte encoded strings. All functions that need to hop over
1719characters such as length(), substr() or index(), or matching regular
1720expressions can work B<much> faster when the underlying data are
1721byte-encoded.
1722
1723In Perl 5.8.0 the slowness was often quite spectacular; in Perl 5.8.1
1724a caching scheme was introduced which will hopefully make the slowness
1725somewhat less spectacular, at least for some operations. In general,
1726operations with UTF-8 encoded strings are still slower. As an example,
1727the Unicode properties (character classes) like C<\p{Nd}> are known to
1728be quite a bit slower (5-20 times) than their simpler counterparts
1729like C<\d> (then again, there are hundreds of Unicode characters matching C<Nd>
1730compared with the 10 ASCII characters matching C<d>).
1731
1732=head2 Problems on EBCDIC platforms
1733
1734There are several known problems with Perl on EBCDIC platforms. If you
1735want to use Perl there, send email to perlbug@perl.org.
1736
1737In earlier versions, when byte and character data were concatenated,
1738the new string was sometimes created by
1739decoding the byte strings as I<ISO 8859-1 (Latin-1)>, even if the
1740old Unicode string used EBCDIC.
1741
1742If you find any of these, please report them as bugs.
1743
1744=head2 Porting code from perl-5.6.X
1745
1746Perl 5.8 has a different Unicode model from 5.6. In 5.6 the programmer
1747was required to use the C<utf8> pragma to declare that a given scope
1748expected to deal with Unicode data and had to make sure that only
1749Unicode data were reaching that scope. If you have code that is
1750working with 5.6, you will need some of the following adjustments to
1751your code. The examples are written such that the code will continue
1752to work under 5.6, so you should be safe to try them out.
1753
1754=over 3
1755
1756=item *
1757
1758A filehandle that should read or write UTF-8
1759
1760 if ($] > 5.008) {
1761 binmode $fh, ":encoding(utf8)";
1762 }
1763
1764=item *
1765
1766A scalar that is going to be passed to some extension
1767
1768Be it Compress::Zlib, Apache::Request or any extension that has no
1769mention of Unicode in the manpage, you need to make sure that the
1770UTF8 flag is stripped off. Note that at the time of this writing
1771(January 2012) the mentioned modules are not UTF-8-aware. Please
1772check the documentation to verify if this is still true.
1773
1774 if ($] > 5.008) {
1775 require Encode;
1776 $val = Encode::encode_utf8($val); # make octets
1777 }
1778
1779=item *
1780
1781A scalar we got back from an extension
1782
1783If you believe the scalar comes back as UTF-8, you will most likely
1784want the UTF8 flag restored:
1785
1786 if ($] > 5.008) {
1787 require Encode;
1788 $val = Encode::decode_utf8($val);
1789 }
1790
1791=item *
1792
1793Same thing, if you are really sure it is UTF-8
1794
1795 if ($] > 5.008) {
1796 require Encode;
1797 Encode::_utf8_on($val);
1798 }
1799
1800=item *
1801
1802A wrapper for fetchrow_array and fetchrow_hashref
1803
1804When the database contains only UTF-8, a wrapper function or method is
1805a convenient way to replace all your fetchrow_array and
1806fetchrow_hashref calls. A wrapper function will also make it easier to
1807adapt to future enhancements in your database driver. Note that at the
1808time of this writing (January 2012), the DBI has no standardized way
1809to deal with UTF-8 data. Please check the documentation to verify if
1810that is still true.
1811
1812 sub fetchrow {
1813 # $what is one of fetchrow_{array,hashref}
1814 my($self, $sth, $what) = @_;
1815 if ($] < 5.008) {
1816 return $sth->$what;
1817 } else {
1818 require Encode;
1819 if (wantarray) {
1820 my @arr = $sth->$what;
1821 for (@arr) {
1822 defined && /[^\000-\177]/ && Encode::_utf8_on($_);
1823 }
1824 return @arr;
1825 } else {
1826 my $ret = $sth->$what;
1827 if (ref $ret) {
1828 for my $k (keys %$ret) {
1829 defined
1830 && /[^\000-\177]/
1831 && Encode::_utf8_on($_) for $ret->{$k};
1832 }
1833 return $ret;
1834 } else {
1835 defined && /[^\000-\177]/ && Encode::_utf8_on($_) for $ret;
1836 return $ret;
1837 }
1838 }
1839 }
1840 }
1841
1842
1843=item *
1844
1845A large scalar that you know can only contain ASCII
1846
1847Scalars that contain only ASCII and are marked as UTF-8 are sometimes
1848a drag to your program. If you recognize such a situation, just remove
1849the UTF8 flag:
1850
1851 utf8::downgrade($val) if $] > 5.008;
1852
1853=back
1854
1855=head1 SEE ALSO
1856
1857L<perlunitut>, L<perluniintro>, L<perluniprops>, L<Encode>, L<open>, L<utf8>, L<bytes>,
1858L<perlretut>, L<perlvar/"${^UNICODE}">
1859L<http://www.unicode.org/reports/tr44>).
1860
1861=cut