This is a live mirror of the Perl 5 development currently hosted at https://github.com/perl/perl5
Localize $! in splainthis() too. (see bug #41717)
[perl5.git] / pod / perlunicode.pod
CommitLineData
393fec97
GS
1=head1 NAME
2
3perlunicode - Unicode support in Perl
4
5=head1 DESCRIPTION
6
0a1f2d14 7=head2 Important Caveats
21bad921 8
376d9008 9Unicode support is an extensive requirement. While Perl does not
c349b1b9
JH
10implement the Unicode standard or the accompanying technical reports
11from cover to cover, Perl does support many Unicode features.
21bad921 12
13a2d996 13=over 4
21bad921 14
fae2c0fb 15=item Input and Output Layers
21bad921 16
376d9008 17Perl knows when a filehandle uses Perl's internal Unicode encodings
1bfb14c4
JH
18(UTF-8, or UTF-EBCDIC if in EBCDIC) if the filehandle is opened with
19the ":utf8" layer. Other encodings can be converted to Perl's
20encoding on input or from Perl's encoding on output by use of the
21":encoding(...)" layer. See L<open>.
c349b1b9 22
376d9008 23To indicate that Perl source itself is using a particular encoding,
c349b1b9 24see L<encoding>.
21bad921
GS
25
26=item Regular Expressions
27
c349b1b9 28The regular expression compiler produces polymorphic opcodes. That is,
376d9008
JB
29the pattern adapts to the data and automatically switches to the Unicode
30character scheme when presented with Unicode data--or instead uses
31a traditional byte scheme when presented with byte data.
21bad921 32
ad0029c4 33=item C<use utf8> still needed to enable UTF-8/UTF-EBCDIC in scripts
21bad921 34
376d9008
JB
35As a compatibility measure, the C<use utf8> pragma must be explicitly
36included to enable recognition of UTF-8 in the Perl scripts themselves
1bfb14c4
JH
37(in string or regular expression literals, or in identifier names) on
38ASCII-based machines or to recognize UTF-EBCDIC on EBCDIC-based
376d9008 39machines. B<These are the only times when an explicit C<use utf8>
8f8cf39c 40is needed.> See L<utf8>.
21bad921 41
1768d7eb 42You can also use the C<encoding> pragma to change the default encoding
6ec9efec 43of the data in your script; see L<encoding>.
1768d7eb 44
7aa207d6
JH
45=item BOM-marked scripts and UTF-16 scripts autodetected
46
47If a Perl script begins marked with the Unicode BOM (UTF-16LE, UTF16-BE,
48or UTF-8), or if the script looks like non-BOM-marked UTF-16 of either
49endianness, Perl will correctly read in the script as Unicode.
50(BOMless UTF-8 cannot be effectively recognized or differentiated from
51ISO 8859-1 or other eight-bit encodings.)
52
990e18f7
AT
53=item C<use encoding> needed to upgrade non-Latin-1 byte strings
54
55By default, there is a fundamental asymmetry in Perl's unicode model:
56implicit upgrading from byte strings to Unicode strings assumes that
57they were encoded in I<ISO 8859-1 (Latin-1)>, but Unicode strings are
58downgraded with UTF-8 encoding. This happens because the first 256
59codepoints in Unicode happens to agree with Latin-1.
60
61If you wish to interpret byte strings as UTF-8 instead, use the
62C<encoding> pragma:
63
64 use encoding 'utf8';
65
66See L</"Byte and Character Semantics"> for more details.
67
21bad921
GS
68=back
69
376d9008 70=head2 Byte and Character Semantics
393fec97 71
376d9008 72Beginning with version 5.6, Perl uses logically-wide characters to
3e4dbfed 73represent strings internally.
393fec97 74
376d9008
JB
75In future, Perl-level operations will be expected to work with
76characters rather than bytes.
393fec97 77
376d9008 78However, as an interim compatibility measure, Perl aims to
75daf61c
JH
79provide a safe migration path from byte semantics to character
80semantics for programs. For operations where Perl can unambiguously
376d9008 81decide that the input data are characters, Perl switches to
75daf61c
JH
82character semantics. For operations where this determination cannot
83be made without additional information from the user, Perl decides in
376d9008 84favor of compatibility and chooses to use byte semantics.
8cbd9a7a
GS
85
86This behavior preserves compatibility with earlier versions of Perl,
376d9008
JB
87which allowed byte semantics in Perl operations only if
88none of the program's inputs were marked as being as source of Unicode
8cbd9a7a
GS
89character data. Such data may come from filehandles, from calls to
90external programs, from information provided by the system (such as %ENV),
21bad921 91or from literals and constants in the source text.
8cbd9a7a 92
376d9008
JB
93The C<bytes> pragma will always, regardless of platform, force byte
94semantics in a particular lexical scope. See L<bytes>.
8cbd9a7a
GS
95
96The C<utf8> pragma is primarily a compatibility device that enables
75daf61c 97recognition of UTF-(8|EBCDIC) in literals encountered by the parser.
376d9008
JB
98Note that this pragma is only required while Perl defaults to byte
99semantics; when character semantics become the default, this pragma
100may become a no-op. See L<utf8>.
101
102Unless explicitly stated, Perl operators use character semantics
103for Unicode data and byte semantics for non-Unicode data.
104The decision to use character semantics is made transparently. If
105input data comes from a Unicode source--for example, if a character
fae2c0fb 106encoding layer is added to a filehandle or a literal Unicode
376d9008
JB
107string constant appears in a program--character semantics apply.
108Otherwise, byte semantics are in effect. The C<bytes> pragma should
109be used to force byte semantics on Unicode data.
110
111If strings operating under byte semantics and strings with Unicode
990e18f7
AT
112character data are concatenated, the new string will be created by
113decoding the byte strings as I<ISO 8859-1 (Latin-1)>, even if the
114old Unicode string used EBCDIC. This translation is done without
115regard to the system's native 8-bit encoding. To change this for
116systems with non-Latin-1 and non-EBCDIC native encodings, use the
117C<encoding> pragma. See L<encoding>.
7dedd01f 118
feda178f 119Under character semantics, many operations that formerly operated on
376d9008 120bytes now operate on characters. A character in Perl is
feda178f 121logically just a number ranging from 0 to 2**31 or so. Larger
376d9008
JB
122characters may encode into longer sequences of bytes internally, but
123this internal detail is mostly hidden for Perl code.
124See L<perluniintro> for more.
393fec97 125
376d9008 126=head2 Effects of Character Semantics
393fec97
GS
127
128Character semantics have the following effects:
129
130=over 4
131
132=item *
133
376d9008 134Strings--including hash keys--and regular expression patterns may
574c8022 135contain characters that have an ordinal value larger than 255.
393fec97 136
feda178f
JH
137If you use a Unicode editor to edit your program, Unicode characters
138may occur directly within the literal strings in one of the various
376d9008
JB
139Unicode encodings (UTF-8, UTF-EBCDIC, UCS-2, etc.), but will be recognized
140as such and converted to Perl's internal representation only if the
feda178f 141appropriate L<encoding> is specified.
3e4dbfed 142
1bfb14c4
JH
143Unicode characters can also be added to a string by using the
144C<\x{...}> notation. The Unicode code for the desired character, in
376d9008
JB
145hexadecimal, should be placed in the braces. For instance, a smiley
146face is C<\x{263A}>. This encoding scheme only works for characters
147with a code of 0x100 or above.
3e4dbfed
JF
148
149Additionally, if you
574c8022 150
3e4dbfed 151 use charnames ':full';
574c8022 152
1bfb14c4
JH
153you can use the C<\N{...}> notation and put the official Unicode
154character name within the braces, such as C<\N{WHITE SMILING FACE}>.
376d9008 155
393fec97
GS
156=item *
157
574c8022
JH
158If an appropriate L<encoding> is specified, identifiers within the
159Perl script may contain Unicode alphanumeric characters, including
376d9008
JB
160ideographs. Perl does not currently attempt to canonicalize variable
161names.
393fec97 162
393fec97
GS
163=item *
164
1bfb14c4
JH
165Regular expressions match characters instead of bytes. "." matches
166a character instead of a byte. The C<\C> pattern is provided to force
167a match a single byte--a C<char> in C, hence C<\C>.
393fec97 168
393fec97
GS
169=item *
170
171Character classes in regular expressions match characters instead of
376d9008 172bytes and match against the character properties specified in the
1bfb14c4 173Unicode properties database. C<\w> can be used to match a Japanese
75daf61c 174ideograph, for instance.
393fec97 175
b08eb2a8
RGS
176(However, and as a limitation of the current implementation, using
177C<\w> or C<\W> I<inside> a C<[...]> character class will still match
178with byte semantics.)
179
393fec97
GS
180=item *
181
eb0cc9e3 182Named Unicode properties, scripts, and block ranges may be used like
376d9008 183character classes via the C<\p{}> "matches property" construct and
822502e5
TS
184the C<\P{}> negation, "doesn't match property".
185
186See L</"Unicode Character Properties"> for more details.
187
188You can define your own character properties and use them
189in the regular expression with the C<\p{}> or C<\P{}> construct.
190
191See L</"User-Defined Character Properties"> for more details.
192
193=item *
194
195The special pattern C<\X> matches any extended Unicode
196sequence--"a combining character sequence" in Standardese--where the
197first character is a base character and subsequent characters are mark
198characters that apply to the base character. C<\X> is equivalent to
199C<(?:\PM\pM*)>.
200
201=item *
202
203The C<tr///> operator translates characters instead of bytes. Note
204that the C<tr///CU> functionality has been removed. For similar
205functionality see pack('U0', ...) and pack('C0', ...).
206
207=item *
208
209Case translation operators use the Unicode case translation tables
210when character input is provided. Note that C<uc()>, or C<\U> in
211interpolated strings, translates to uppercase, while C<ucfirst>,
212or C<\u> in interpolated strings, translates to titlecase in languages
213that make the distinction.
214
215=item *
216
217Most operators that deal with positions or lengths in a string will
218automatically switch to using character positions, including
219C<chop()>, C<chomp()>, C<substr()>, C<pos()>, C<index()>, C<rindex()>,
220C<sprintf()>, C<write()>, and C<length()>. An operator that
221specifically does not switch is C<vec()>. Operators that really don't
222care include operators that treat strings as a bucket of bits such as
223C<sort()>, and operators dealing with filenames.
224
225=item *
226
227The C<pack()>/C<unpack()> letter C<C> does I<not> change, since it is often
228used for byte-oriented formats. Again, think C<char> in the C language.
229
230There is a new C<U> specifier that converts between Unicode characters
231and code points. There is also a C<W> specifier that is the equivalent of
232C<chr>/C<ord> and properly handles character values even if they are above 255.
233
234=item *
235
236The C<chr()> and C<ord()> functions work on characters, similar to
237C<pack("W")> and C<unpack("W")>, I<not> C<pack("C")> and
238C<unpack("C")>. C<pack("C")> and C<unpack("C")> are methods for
239emulating byte-oriented C<chr()> and C<ord()> on Unicode strings.
240While these methods reveal the internal encoding of Unicode strings,
241that is not something one normally needs to care about at all.
242
243=item *
244
245The bit string operators, C<& | ^ ~>, can operate on character data.
246However, for backward compatibility, such as when using bit string
247operations when characters are all less than 256 in ordinal value, one
248should not use C<~> (the bit complement) with characters of both
249values less than 256 and values greater than 256. Most importantly,
250DeMorgan's laws (C<~($x|$y) eq ~$x&~$y> and C<~($x&$y) eq ~$x|~$y>)
251will not hold. The reason for this mathematical I<faux pas> is that
252the complement cannot return B<both> the 8-bit (byte-wide) bit
253complement B<and> the full character-wide bit complement.
254
255=item *
256
257lc(), uc(), lcfirst(), and ucfirst() work for the following cases:
258
259=over 8
260
261=item *
262
263the case mapping is from a single Unicode character to another
264single Unicode character, or
265
266=item *
267
268the case mapping is from a single Unicode character to more
269than one Unicode character.
270
271=back
272
273Things to do with locales (Lithuanian, Turkish, Azeri) do B<not> work
274since Perl does not understand the concept of Unicode locales.
275
276See the Unicode Technical Report #21, Case Mappings, for more details.
277
278But you can also define your own mappings to be used in the lc(),
279lcfirst(), uc(), and ucfirst() (or their string-inlined versions).
280
281See L</"User-Defined Case Mappings"> for more details.
282
283=back
284
285=over 4
286
287=item *
288
289And finally, C<scalar reverse()> reverses by character rather than by byte.
290
291=back
292
293=head2 Unicode Character Properties
294
295Named Unicode properties, scripts, and block ranges may be used like
296character classes via the C<\p{}> "matches property" construct and
297the C<\P{}> negation, "doesn't match property".
1bfb14c4
JH
298
299For instance, C<\p{Lu}> matches any character with the Unicode "Lu"
300(Letter, uppercase) property, while C<\p{M}> matches any character
301with an "M" (mark--accents and such) property. Brackets are not
302required for single letter properties, so C<\p{M}> is equivalent to
303C<\pM>. Many predefined properties are available, such as
304C<\p{Mirrored}> and C<\p{Tibetan}>.
4193bef7 305
cfc01aea 306The official Unicode script and block names have spaces and dashes as
376d9008 307separators, but for convenience you can use dashes, spaces, or
1bfb14c4
JH
308underbars, and case is unimportant. It is recommended, however, that
309for consistency you use the following naming: the official Unicode
310script, property, or block name (see below for the additional rules
311that apply to block names) with whitespace and dashes removed, and the
312words "uppercase-first-lowercase-rest". C<Latin-1 Supplement> thus
313becomes C<Latin1Supplement>.
4193bef7 314
376d9008
JB
315You can also use negation in both C<\p{}> and C<\P{}> by introducing a caret
316(^) between the first brace and the property name: C<\p{^Tamil}> is
eb0cc9e3 317equal to C<\P{Tamil}>.
4193bef7 318
14bb0a9a 319B<NOTE: the properties, scripts, and blocks listed here are as of
8158862b 320Unicode 5.0.0 in July 2006.>
14bb0a9a 321
822502e5
TS
322=over 4
323
324=item General Category
325
eb0cc9e3 326Here are the basic Unicode General Category properties, followed by their
68cd2d32 327long form. You can use either; C<\p{Lu}> and C<\p{UppercaseLetter}>,
376d9008 328for instance, are identical.
393fec97 329
d73e5302
JH
330 Short Long
331
332 L Letter
12ac2576 333 LC CasedLetter
eb0cc9e3
JH
334 Lu UppercaseLetter
335 Ll LowercaseLetter
336 Lt TitlecaseLetter
337 Lm ModifierLetter
338 Lo OtherLetter
d73e5302
JH
339
340 M Mark
eb0cc9e3
JH
341 Mn NonspacingMark
342 Mc SpacingMark
343 Me EnclosingMark
d73e5302
JH
344
345 N Number
eb0cc9e3
JH
346 Nd DecimalNumber
347 Nl LetterNumber
348 No OtherNumber
d73e5302
JH
349
350 P Punctuation
eb0cc9e3
JH
351 Pc ConnectorPunctuation
352 Pd DashPunctuation
353 Ps OpenPunctuation
354 Pe ClosePunctuation
355 Pi InitialPunctuation
d73e5302 356 (may behave like Ps or Pe depending on usage)
eb0cc9e3 357 Pf FinalPunctuation
d73e5302 358 (may behave like Ps or Pe depending on usage)
eb0cc9e3 359 Po OtherPunctuation
d73e5302
JH
360
361 S Symbol
eb0cc9e3
JH
362 Sm MathSymbol
363 Sc CurrencySymbol
364 Sk ModifierSymbol
365 So OtherSymbol
d73e5302
JH
366
367 Z Separator
eb0cc9e3
JH
368 Zs SpaceSeparator
369 Zl LineSeparator
370 Zp ParagraphSeparator
d73e5302
JH
371
372 C Other
e150c829
JH
373 Cc Control
374 Cf Format
eb0cc9e3
JH
375 Cs Surrogate (not usable)
376 Co PrivateUse
e150c829 377 Cn Unassigned
1ac13f9a 378
376d9008 379Single-letter properties match all characters in any of the
3e4dbfed 380two-letter sub-properties starting with the same letter.
12ac2576
JP
381C<LC> and C<L&> are special cases, which are aliases for the set of
382C<Ll>, C<Lu>, and C<Lt>.
32293815 383
eb0cc9e3 384Because Perl hides the need for the user to understand the internal
1bfb14c4
JH
385representation of Unicode characters, there is no need to implement
386the somewhat messy concept of surrogates. C<Cs> is therefore not
eb0cc9e3 387supported.
d73e5302 388
822502e5
TS
389=item Bidirectional Character Types
390
376d9008 391Because scripts differ in their directionality--Hebrew is
12ac2576
JP
392written right to left, for example--Unicode supplies these properties in
393the BidiClass class:
32293815 394
eb0cc9e3 395 Property Meaning
92e830a9 396
12ac2576
JP
397 L Left-to-Right
398 LRE Left-to-Right Embedding
399 LRO Left-to-Right Override
400 R Right-to-Left
401 AL Right-to-Left Arabic
402 RLE Right-to-Left Embedding
403 RLO Right-to-Left Override
404 PDF Pop Directional Format
405 EN European Number
406 ES European Number Separator
407 ET European Number Terminator
408 AN Arabic Number
409 CS Common Number Separator
410 NSM Non-Spacing Mark
411 BN Boundary Neutral
412 B Paragraph Separator
413 S Segment Separator
414 WS Whitespace
415 ON Other Neutrals
416
417For example, C<\p{BidiClass:R}> matches characters that are normally
eb0cc9e3
JH
418written right to left.
419
822502e5 420=item Scripts
2796c109 421
376d9008
JB
422The script names which can be used by C<\p{...}> and C<\P{...}>,
423such as in C<\p{Latin}> or C<\p{Cyrillic}>, are as follows:
2796c109 424
1ac13f9a 425 Arabic
e9ad1727 426 Armenian
8158862b 427 Balinese
1ac13f9a 428 Bengali
e9ad1727 429 Bopomofo
8158862b
TS
430 Braille
431 Buginese
1d81abf3 432 Buhid
eb0cc9e3 433 CanadianAboriginal
e9ad1727 434 Cherokee
8158862b
TS
435 Coptic
436 Cuneiform
437 Cypriot
e9ad1727
JH
438 Cyrillic
439 Deseret
440 Devanagari
441 Ethiopic
442 Georgian
8158862b 443 Glagolitic
e9ad1727
JH
444 Gothic
445 Greek
1ac13f9a 446 Gujarati
e9ad1727
JH
447 Gurmukhi
448 Han
449 Hangul
1d81abf3 450 Hanunoo
e9ad1727
JH
451 Hebrew
452 Hiragana
453 Inherited
1ac13f9a 454 Kannada
e9ad1727 455 Katakana
8158862b 456 Kharoshthi
e9ad1727 457 Khmer
1ac13f9a 458 Lao
e9ad1727 459 Latin
8158862b
TS
460 Limbu
461 LinearB
e9ad1727
JH
462 Malayalam
463 Mongolian
1ac13f9a 464 Myanmar
8158862b
TS
465 NewTaiLue
466 Nko
1ac13f9a 467 Ogham
eb0cc9e3 468 OldItalic
8158862b 469 OldPersian
e9ad1727 470 Oriya
8158862b
TS
471 Osmanya
472 PhagsPa
473 Phoenician
1ac13f9a 474 Runic
8158862b 475 Shavian
e9ad1727 476 Sinhala
8158862b 477 SylotiNagri
e9ad1727 478 Syriac
1d81abf3
JH
479 Tagalog
480 Tagbanwa
8158862b 481 TaiLe
e9ad1727
JH
482 Tamil
483 Telugu
484 Thaana
485 Thai
486 Tibetan
8158862b
TS
487 Tifinagh
488 Ugaritic
1ac13f9a 489 Yi
1ac13f9a 490
822502e5
TS
491=item Extended property classes
492
376d9008 493Extended property classes can supplement the basic
1ac13f9a
JH
494properties, defined by the F<PropList> Unicode database:
495
1d81abf3 496 ASCIIHexDigit
eb0cc9e3 497 BidiControl
1ac13f9a 498 Dash
1d81abf3 499 Deprecated
1ac13f9a
JH
500 Diacritic
501 Extender
eb0cc9e3 502 HexDigit
e9ad1727
JH
503 Hyphen
504 Ideographic
1d81abf3
JH
505 IDSBinaryOperator
506 IDSTrinaryOperator
eb0cc9e3 507 JoinControl
1d81abf3 508 LogicalOrderException
eb0cc9e3
JH
509 NoncharacterCodePoint
510 OtherAlphabetic
1d81abf3
JH
511 OtherDefaultIgnorableCodePoint
512 OtherGraphemeExtend
8158862b
TS
513 OtherIDStart
514 OtherIDContinue
eb0cc9e3
JH
515 OtherLowercase
516 OtherMath
517 OtherUppercase
8158862b
TS
518 PatternSyntax
519 PatternWhiteSpace
eb0cc9e3 520 QuotationMark
1d81abf3
JH
521 Radical
522 SoftDotted
8158862b 523 STerm
1d81abf3
JH
524 TerminalPunctuation
525 UnifiedIdeograph
8158862b 526 VariationSelector
eb0cc9e3 527 WhiteSpace
1ac13f9a 528
376d9008 529and there are further derived properties:
1ac13f9a 530
8158862b
TS
531 Alphabetic = Lu + Ll + Lt + Lm + Lo + Nl + OtherAlphabetic
532 Lowercase = Ll + OtherLowercase
533 Uppercase = Lu + OtherUppercase
534 Math = Sm + OtherMath
1ac13f9a 535
8158862b
TS
536 IDStart = Lu + Ll + Lt + Lm + Lo + Nl + OtherIDStart
537 IDContinue = IDStart + Mn + Mc + Nd + Pc + OtherIDContinue
1ac13f9a 538
8158862b
TS
539 DefaultIgnorableCodePoint
540 = OtherDefaultIgnorableCodePoint
541 + Cf + Cc + Cs + Noncharacters + VariationSelector
542 - WhiteSpace - FFF9..FFFB (Annotation Characters)
543
544 Any = Any code points (i.e. U+0000 to U+10FFFF)
545 Assigned = Any non-Cn code points (i.e. synonym for \P{Cn})
546 Unassigned = Synonym for \p{Cn}
547 ASCII = ASCII (i.e. U+0000 to U+007F)
548
549 Common = Any character (or unassigned code point)
550 not explicitly assigned to a script
2796c109 551
822502e5
TS
552=item Use of "Is" Prefix
553
1bfb14c4
JH
554For backward compatibility (with Perl 5.6), all properties mentioned
555so far may have C<Is> prepended to their name, so C<\P{IsLu}>, for
556example, is equal to C<\P{Lu}>.
eb0cc9e3 557
822502e5 558=item Blocks
2796c109 559
1bfb14c4
JH
560In addition to B<scripts>, Unicode also defines B<blocks> of
561characters. The difference between scripts and blocks is that the
562concept of scripts is closer to natural languages, while the concept
563of blocks is more of an artificial grouping based on groups of 256
376d9008 564Unicode characters. For example, the C<Latin> script contains letters
1bfb14c4 565from many blocks but does not contain all the characters from those
376d9008
JB
566blocks. It does not, for example, contain digits, because digits are
567shared across many scripts. Digits and similar groups, like
568punctuation, are in a category called C<Common>.
2796c109 569
8158862b 570For more about scripts, see the UAX#24 "Script Names":
cfc01aea 571
8158862b 572 http://www.unicode.org/reports/tr24/
cfc01aea
JF
573
574For more about blocks, see:
575
576 http://www.unicode.org/Public/UNIDATA/Blocks.txt
2796c109 577
376d9008
JB
578Block names are given with the C<In> prefix. For example, the
579Katakana block is referenced via C<\p{InKatakana}>. The C<In>
7eabb34d 580prefix may be omitted if there is no naming conflict with a script
eb0cc9e3 581or any other property, but it is recommended that C<In> always be used
1bfb14c4 582for block tests to avoid confusion.
eb0cc9e3
JH
583
584These block names are supported:
585
8158862b 586 InAegeanNumbers
1d81abf3 587 InAlphabeticPresentationForms
8158862b
TS
588 InAncientGreekMusicalNotation
589 InAncientGreekNumbers
1d81abf3
JH
590 InArabic
591 InArabicPresentationFormsA
592 InArabicPresentationFormsB
8158862b 593 InArabicSupplement
1d81abf3
JH
594 InArmenian
595 InArrows
8158862b 596 InBalinese
1d81abf3
JH
597 InBasicLatin
598 InBengali
599 InBlockElements
600 InBopomofo
601 InBopomofoExtended
602 InBoxDrawing
603 InBraillePatterns
8158862b 604 InBuginese
1d81abf3
JH
605 InBuhid
606 InByzantineMusicalSymbols
607 InCJKCompatibility
608 InCJKCompatibilityForms
609 InCJKCompatibilityIdeographs
610 InCJKCompatibilityIdeographsSupplement
611 InCJKRadicalsSupplement
8158862b 612 InCJKStrokes
1d81abf3
JH
613 InCJKSymbolsAndPunctuation
614 InCJKUnifiedIdeographs
615 InCJKUnifiedIdeographsExtensionA
616 InCJKUnifiedIdeographsExtensionB
617 InCherokee
618 InCombiningDiacriticalMarks
8158862b 619 InCombiningDiacriticalMarksSupplement
1d81abf3
JH
620 InCombiningDiacriticalMarksforSymbols
621 InCombiningHalfMarks
622 InControlPictures
8158862b
TS
623 InCoptic
624 InCountingRodNumerals
625 InCuneiform
626 InCuneiformNumbersAndPunctuation
1d81abf3 627 InCurrencySymbols
8158862b 628 InCypriotSyllabary
1d81abf3 629 InCyrillic
8158862b 630 InCyrillicSupplement
1d81abf3
JH
631 InDeseret
632 InDevanagari
633 InDingbats
634 InEnclosedAlphanumerics
635 InEnclosedCJKLettersAndMonths
636 InEthiopic
8158862b
TS
637 InEthiopicExtended
638 InEthiopicSupplement
1d81abf3
JH
639 InGeneralPunctuation
640 InGeometricShapes
641 InGeorgian
8158862b
TS
642 InGeorgianSupplement
643 InGlagolitic
1d81abf3
JH
644 InGothic
645 InGreekExtended
646 InGreekAndCoptic
647 InGujarati
648 InGurmukhi
649 InHalfwidthAndFullwidthForms
650 InHangulCompatibilityJamo
651 InHangulJamo
652 InHangulSyllables
653 InHanunoo
654 InHebrew
655 InHighPrivateUseSurrogates
656 InHighSurrogates
657 InHiragana
658 InIPAExtensions
659 InIdeographicDescriptionCharacters
660 InKanbun
661 InKangxiRadicals
662 InKannada
663 InKatakana
664 InKatakanaPhoneticExtensions
8158862b 665 InKharoshthi
1d81abf3 666 InKhmer
8158862b 667 InKhmerSymbols
1d81abf3
JH
668 InLao
669 InLatin1Supplement
670 InLatinExtendedA
671 InLatinExtendedAdditional
672 InLatinExtendedB
8158862b
TS
673 InLatinExtendedC
674 InLatinExtendedD
1d81abf3 675 InLetterlikeSymbols
8158862b
TS
676 InLimbu
677 InLinearBIdeograms
678 InLinearBSyllabary
1d81abf3
JH
679 InLowSurrogates
680 InMalayalam
681 InMathematicalAlphanumericSymbols
682 InMathematicalOperators
683 InMiscellaneousMathematicalSymbolsA
684 InMiscellaneousMathematicalSymbolsB
685 InMiscellaneousSymbols
8158862b 686 InMiscellaneousSymbolsAndArrows
1d81abf3 687 InMiscellaneousTechnical
8158862b 688 InModifierToneLetters
1d81abf3
JH
689 InMongolian
690 InMusicalSymbols
691 InMyanmar
8158862b
TS
692 InNKo
693 InNewTaiLue
1d81abf3
JH
694 InNumberForms
695 InOgham
696 InOldItalic
8158862b 697 InOldPersian
1d81abf3
JH
698 InOpticalCharacterRecognition
699 InOriya
8158862b
TS
700 InOsmanya
701 InPhagspa
702 InPhoenician
703 InPhoneticExtensions
704 InPhoneticExtensionsSupplement
1d81abf3
JH
705 InPrivateUseArea
706 InRunic
8158862b 707 InShavian
1d81abf3
JH
708 InSinhala
709 InSmallFormVariants
710 InSpacingModifierLetters
711 InSpecials
712 InSuperscriptsAndSubscripts
713 InSupplementalArrowsA
714 InSupplementalArrowsB
715 InSupplementalMathematicalOperators
8158862b 716 InSupplementalPunctuation
1d81abf3
JH
717 InSupplementaryPrivateUseAreaA
718 InSupplementaryPrivateUseAreaB
8158862b 719 InSylotiNagri
1d81abf3
JH
720 InSyriac
721 InTagalog
722 InTagbanwa
723 InTags
8158862b
TS
724 InTaiLe
725 InTaiXuanJingSymbols
1d81abf3
JH
726 InTamil
727 InTelugu
728 InThaana
729 InThai
730 InTibetan
8158862b
TS
731 InTifinagh
732 InUgaritic
1d81abf3
JH
733 InUnifiedCanadianAboriginalSyllabics
734 InVariationSelectors
8158862b
TS
735 InVariationSelectorsSupplement
736 InVerticalForms
1d81abf3
JH
737 InYiRadicals
738 InYiSyllables
8158862b 739 InYijingHexagramSymbols
32293815 740
393fec97
GS
741=back
742
376d9008 743=head2 User-Defined Character Properties
491fd90a
JH
744
745You can define your own character properties by defining subroutines
bac0b425
JP
746whose names begin with "In" or "Is". The subroutines can be defined in
747any package. The user-defined properties can be used in the regular
748expression C<\p> and C<\P> constructs; if you are using a user-defined
749property from a package other than the one you are in, you must specify
750its package in the C<\p> or C<\P> construct.
751
752 # assuming property IsForeign defined in Lang::
753 package main; # property package name required
754 if ($txt =~ /\p{Lang::IsForeign}+/) { ... }
755
756 package Lang; # property package name not required
757 if ($txt =~ /\p{IsForeign}+/) { ... }
758
759
760Note that the effect is compile-time and immutable once defined.
491fd90a 761
376d9008
JB
762The subroutines must return a specially-formatted string, with one
763or more newline-separated lines. Each line must be one of the following:
491fd90a
JH
764
765=over 4
766
767=item *
768
99a6b1f0 769Two hexadecimal numbers separated by horizontal whitespace (space or
376d9008 770tabular characters) denoting a range of Unicode code points to include.
491fd90a
JH
771
772=item *
773
376d9008 774Something to include, prefixed by "+": a built-in character
bac0b425
JP
775property (prefixed by "utf8::") or a user-defined character property,
776to represent all the characters in that property; two hexadecimal code
777points for a range; or a single hexadecimal code point.
491fd90a
JH
778
779=item *
780
376d9008 781Something to exclude, prefixed by "-": an existing character
bac0b425
JP
782property (prefixed by "utf8::") or a user-defined character property,
783to represent all the characters in that property; two hexadecimal code
784points for a range; or a single hexadecimal code point.
491fd90a
JH
785
786=item *
787
376d9008 788Something to negate, prefixed "!": an existing character
bac0b425
JP
789property (prefixed by "utf8::") or a user-defined character property,
790to represent all the characters in that property; two hexadecimal code
791points for a range; or a single hexadecimal code point.
792
793=item *
794
795Something to intersect with, prefixed by "&": an existing character
796property (prefixed by "utf8::") or a user-defined character property,
797for all the characters except the characters in the property; two
798hexadecimal code points for a range; or a single hexadecimal code point.
491fd90a
JH
799
800=back
801
802For example, to define a property that covers both the Japanese
803syllabaries (hiragana and katakana), you can define
804
805 sub InKana {
d5822f25
A
806 return <<END;
807 3040\t309F
808 30A0\t30FF
491fd90a
JH
809 END
810 }
811
d5822f25
A
812Imagine that the here-doc end marker is at the beginning of the line.
813Now you can use C<\p{InKana}> and C<\P{InKana}>.
491fd90a
JH
814
815You could also have used the existing block property names:
816
817 sub InKana {
818 return <<'END';
819 +utf8::InHiragana
820 +utf8::InKatakana
821 END
822 }
823
824Suppose you wanted to match only the allocated characters,
d5822f25 825not the raw block ranges: in other words, you want to remove
491fd90a
JH
826the non-characters:
827
828 sub InKana {
829 return <<'END';
830 +utf8::InHiragana
831 +utf8::InKatakana
832 -utf8::IsCn
833 END
834 }
835
836The negation is useful for defining (surprise!) negated classes.
837
838 sub InNotKana {
839 return <<'END';
840 !utf8::InHiragana
841 -utf8::InKatakana
842 +utf8::IsCn
843 END
844 }
845
bac0b425
JP
846Intersection is useful for getting the common characters matched by
847two (or more) classes.
848
849 sub InFooAndBar {
850 return <<'END';
851 +main::Foo
852 &main::Bar
853 END
854 }
855
856It's important to remember not to use "&" for the first set -- that
857would be intersecting with nothing (resulting in an empty set).
858
822502e5
TS
859A final note on the user-defined property tests: they will be used
860only if the scalar has been marked as having Unicode characters.
861Old byte-style strings will not be affected.
862
863=head2 User-Defined Case Mappings
864
3a2263fe
RGS
865You can also define your own mappings to be used in the lc(),
866lcfirst(), uc(), and ucfirst() (or their string-inlined versions).
822502e5
TS
867The principle is similar to that of user-defined character
868properties: to define subroutines in the C<main> package
3a2263fe
RGS
869with names like C<ToLower> (for lc() and lcfirst()), C<ToTitle> (for
870the first character in ucfirst()), and C<ToUpper> (for uc(), and the
871rest of the characters in ucfirst()).
872
873The string returned by the subroutines needs now to be three
874hexadecimal numbers separated by tabulators: start of the source
875range, end of the source range, and start of the destination range.
876For example:
877
878 sub ToUpper {
879 return <<END;
880 0061\t0063\t0041
881 END
882 }
883
884defines an uc() mapping that causes only the characters "a", "b", and
885"c" to be mapped to "A", "B", "C", all other characters will remain
886unchanged.
887
888If there is no source range to speak of, that is, the mapping is from
889a single character to another single character, leave the end of the
890source range empty, but the two tabulator characters are still needed.
891For example:
892
893 sub ToLower {
894 return <<END;
895 0041\t\t0061
896 END
897 }
898
899defines a lc() mapping that causes only "A" to be mapped to "a", all
900other characters will remain unchanged.
901
902(For serious hackers only) If you want to introspect the default
903mappings, you can find the data in the directory
904C<$Config{privlib}>/F<unicore/To/>. The mapping data is returned as
905the here-document, and the C<utf8::ToSpecFoo> are special exception
906mappings derived from <$Config{privlib}>/F<unicore/SpecialCasing.txt>.
907The C<Digit> and C<Fold> mappings that one can see in the directory
908are not directly user-accessible, one can use either the
909C<Unicode::UCD> module, or just match case-insensitively (that's when
910the C<Fold> mapping is used).
911
822502e5
TS
912A final note on the user-defined case mappings: they will be used
913only if the scalar has been marked as having Unicode characters.
914Old byte-style strings will not be affected.
3a2263fe 915
376d9008 916=head2 Character Encodings for Input and Output
8cbd9a7a 917
7221edc9 918See L<Encode>.
8cbd9a7a 919
c29a771d 920=head2 Unicode Regular Expression Support Level
776f8809 921
376d9008
JB
922The following list of Unicode support for regular expressions describes
923all the features currently supported. The references to "Level N"
8158862b
TS
924and the section numbers refer to the Unicode Technical Standard #18,
925"Unicode Regular Expressions", version 11, in May 2005.
776f8809
JH
926
927=over 4
928
929=item *
930
931Level 1 - Basic Unicode Support
932
8158862b
TS
933 RL1.1 Hex Notation - done [1]
934 RL1.2 Properties - done [2][3]
935 RL1.2a Compatibility Properties - done [4]
936 RL1.3 Subtraction and Intersection - MISSING [5]
937 RL1.4 Simple Word Boundaries - done [6]
938 RL1.5 Simple Loose Matches - done [7]
939 RL1.6 Line Boundaries - MISSING [8]
940 RL1.7 Supplementary Code Points - done [9]
941
942 [1] \x{...}
943 [2] \p{...} \P{...}
944 [3] supports not only minimal list (general category, scripts,
945 Alphabetic, Lowercase, Uppercase, WhiteSpace,
946 NoncharacterCodePoint, DefaultIgnorableCodePoint, Any,
947 ASCII, Assigned), but also bidirectional types, blocks, etc.
948 (see L</"Unicode Character Properties">)
949 [4] \d \D \s \S \w \W \X [:prop:] [:^prop:]
950 [5] can use regular expression look-ahead [a] or
951 user-defined character properties [b] to emulate set operations
952 [6] \b \B
953 [7] note that Perl does Full case-folding in matching, not Simple:
835863de 954 for example U+1F88 is equivalent with U+1F00 U+03B9,
e0f9d4a8 955 not with 1F80. This difference matters for certain Greek
376d9008
JB
956 capital letters with certain modifiers: the Full case-folding
957 decomposes the letter, while the Simple case-folding would map
e0f9d4a8 958 it to a single character.
8158862b
TS
959 [8] should do ^ and $ also on U+000B (\v in C), FF (\f), CR (\r),
960 CRLF (\r\n), NEL (U+0085), LS (U+2028), and PS (U+2029);
961 should also affect <>, $., and script line numbers;
962 should not split lines within CRLF [c] (i.e. there is no empty
963 line between \r and \n)
964 [9] UTF-8/UTF-EBDDIC used in perl allows not only U+10000 to U+10FFFF
965 but also beyond U+10FFFF [d]
7207e29d 966
237bad5b 967[a] You can mimic class subtraction using lookahead.
8158862b 968For example, what UTS#18 might write as
29bdacb8 969
dbe420b4
JH
970 [{Greek}-[{UNASSIGNED}]]
971
972in Perl can be written as:
973
1d81abf3
JH
974 (?!\p{Unassigned})\p{InGreekAndCoptic}
975 (?=\p{Assigned})\p{InGreekAndCoptic}
dbe420b4
JH
976
977But in this particular example, you probably really want
978
1bfb14c4 979 \p{GreekAndCoptic}
dbe420b4
JH
980
981which will match assigned characters known to be part of the Greek script.
29bdacb8 982
5ca1ac52 983Also see the Unicode::Regex::Set module, it does implement the full
8158862b
TS
984UTS#18 grouping, intersection, union, and removal (subtraction) syntax.
985
986[b] '+' for union, '-' for removal (set-difference), '&' for intersection
987(see L</"User-Defined Character Properties">)
988
989[c] Try the C<:crlf> layer (see L<PerlIO>).
5ca1ac52 990
8158862b
TS
991[d] Avoid C<use warning 'utf8';> (or say C<no warning 'utf8';>) to allow
992U+FFFF (C<\x{FFFF}>).
237bad5b 993
776f8809
JH
994=item *
995
996Level 2 - Extended Unicode Support
997
8158862b
TS
998 RL2.1 Canonical Equivalents - MISSING [10][11]
999 RL2.2 Default Grapheme Clusters - MISSING [12][13]
1000 RL2.3 Default Word Boundaries - MISSING [14]
1001 RL2.4 Default Loose Matches - MISSING [15]
1002 RL2.5 Name Properties - MISSING [16]
1003 RL2.6 Wildcard Properties - MISSING
1004
1005 [10] see UAX#15 "Unicode Normalization Forms"
1006 [11] have Unicode::Normalize but not integrated to regexes
1007 [12] have \X but at this level . should equal that
1008 [13] UAX#29 "Text Boundaries" considers CRLF and Hangul syllable
1009 clusters as a single grapheme cluster.
1010 [14] see UAX#29, Word Boundaries
1011 [15] see UAX#21 "Case Mappings"
1012 [16] have \N{...} but neither compute names of CJK Ideographs
1013 and Hangul Syllables nor use a loose match [e]
1014
1015[e] C<\N{...}> allows namespaces (see L<charnames>).
776f8809
JH
1016
1017=item *
1018
8158862b
TS
1019Level 3 - Tailored Support
1020
1021 RL3.1 Tailored Punctuation - MISSING
1022 RL3.2 Tailored Grapheme Clusters - MISSING [17][18]
1023 RL3.3 Tailored Word Boundaries - MISSING
1024 RL3.4 Tailored Loose Matches - MISSING
1025 RL3.5 Tailored Ranges - MISSING
1026 RL3.6 Context Matching - MISSING [19]
1027 RL3.7 Incremental Matches - MISSING
1028 ( RL3.8 Unicode Set Sharing )
1029 RL3.9 Possible Match Sets - MISSING
1030 RL3.10 Folded Matching - MISSING [20]
1031 RL3.11 Submatchers - MISSING
1032
1033 [17] see UAX#10 "Unicode Collation Algorithms"
1034 [18] have Unicode::Collate but not integrated to regexes
1035 [19] have (?<=x) and (?=x), but look-aheads or look-behinds should see
1036 outside of the target substring
1037 [20] need insensitive matching for linguistic features other than case;
1038 for example, hiragana to katakana, wide and narrow, simplified Han
1039 to traditional Han (see UTR#30 "Character Foldings")
776f8809
JH
1040
1041=back
1042
c349b1b9
JH
1043=head2 Unicode Encodings
1044
376d9008
JB
1045Unicode characters are assigned to I<code points>, which are abstract
1046numbers. To use these numbers, various encodings are needed.
c349b1b9
JH
1047
1048=over 4
1049
c29a771d 1050=item *
5cb3728c
RB
1051
1052UTF-8
c349b1b9 1053
3e4dbfed 1054UTF-8 is a variable-length (1 to 6 bytes, current character allocations
376d9008
JB
1055require 4 bytes), byte-order independent encoding. For ASCII (and we
1056really do mean 7-bit ASCII, not another 8-bit encoding), UTF-8 is
1057transparent.
c349b1b9 1058
8c007b5a 1059The following table is from Unicode 3.2.
05632f9a
JH
1060
1061 Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte
1062
8c007b5a
JH
1063 U+0000..U+007F 00..7F
1064 U+0080..U+07FF C2..DF 80..BF
ec90690f
TS
1065 U+0800..U+0FFF E0 A0..BF 80..BF
1066 U+1000..U+CFFF E1..EC 80..BF 80..BF
1067 U+D000..U+D7FF ED 80..9F 80..BF
8c007b5a 1068 U+D800..U+DFFF ******* ill-formed *******
ec90690f 1069 U+E000..U+FFFF EE..EF 80..BF 80..BF
05632f9a
JH
1070 U+10000..U+3FFFF F0 90..BF 80..BF 80..BF
1071 U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF
1072 U+100000..U+10FFFF F4 80..8F 80..BF 80..BF
1073
376d9008
JB
1074Note the C<A0..BF> in C<U+0800..U+0FFF>, the C<80..9F> in
1075C<U+D000...U+D7FF>, the C<90..B>F in C<U+10000..U+3FFFF>, and the
1076C<80...8F> in C<U+100000..U+10FFFF>. The "gaps" are caused by legal
1077UTF-8 avoiding non-shortest encodings: it is technically possible to
1078UTF-8-encode a single code point in different ways, but that is
1079explicitly forbidden, and the shortest possible encoding should always
1080be used. So that's what Perl does.
37361303 1081
376d9008 1082Another way to look at it is via bits:
05632f9a
JH
1083
1084 Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte
1085
1086 0aaaaaaa 0aaaaaaa
1087 00000bbbbbaaaaaa 110bbbbb 10aaaaaa
1088 ccccbbbbbbaaaaaa 1110cccc 10bbbbbb 10aaaaaa
1089 00000dddccccccbbbbbbaaaaaa 11110ddd 10cccccc 10bbbbbb 10aaaaaa
1090
1091As you can see, the continuation bytes all begin with C<10>, and the
8c007b5a 1092leading bits of the start byte tell how many bytes the are in the
05632f9a
JH
1093encoded character.
1094
c29a771d 1095=item *
5cb3728c
RB
1096
1097UTF-EBCDIC
dbe420b4 1098
376d9008 1099Like UTF-8 but EBCDIC-safe, in the way that UTF-8 is ASCII-safe.
dbe420b4 1100
c29a771d 1101=item *
5cb3728c 1102
1e54db1a 1103UTF-16, UTF-16BE, UTF-16LE, Surrogates, and BOMs (Byte Order Marks)
c349b1b9 1104
1bfb14c4
JH
1105The followings items are mostly for reference and general Unicode
1106knowledge, Perl doesn't use these constructs internally.
dbe420b4 1107
c349b1b9 1108UTF-16 is a 2 or 4 byte encoding. The Unicode code points
1bfb14c4
JH
1109C<U+0000..U+FFFF> are stored in a single 16-bit unit, and the code
1110points C<U+10000..U+10FFFF> in two 16-bit units. The latter case is
c349b1b9
JH
1111using I<surrogates>, the first 16-bit unit being the I<high
1112surrogate>, and the second being the I<low surrogate>.
1113
376d9008 1114Surrogates are code points set aside to encode the C<U+10000..U+10FFFF>
c349b1b9 1115range of Unicode code points in pairs of 16-bit units. The I<high
376d9008
JB
1116surrogates> are the range C<U+D800..U+DBFF>, and the I<low surrogates>
1117are the range C<U+DC00..U+DFFF>. The surrogate encoding is
c349b1b9
JH
1118
1119 $hi = ($uni - 0x10000) / 0x400 + 0xD800;
1120 $lo = ($uni - 0x10000) % 0x400 + 0xDC00;
1121
1122and the decoding is
1123
1a3fa709 1124 $uni = 0x10000 + ($hi - 0xD800) * 0x400 + ($lo - 0xDC00);
c349b1b9 1125
feda178f 1126If you try to generate surrogates (for example by using chr()), you
376d9008
JB
1127will get a warning if warnings are turned on, because those code
1128points are not valid for a Unicode character.
9466bab6 1129
376d9008 1130Because of the 16-bitness, UTF-16 is byte-order dependent. UTF-16
c349b1b9 1131itself can be used for in-memory computations, but if storage or
376d9008
JB
1132transfer is required either UTF-16BE (big-endian) or UTF-16LE
1133(little-endian) encodings must be chosen.
c349b1b9
JH
1134
1135This introduces another problem: what if you just know that your data
376d9008
JB
1136is UTF-16, but you don't know which endianness? Byte Order Marks, or
1137BOMs, are a solution to this. A special character has been reserved
86bbd6d1 1138in Unicode to function as a byte order marker: the character with the
376d9008 1139code point C<U+FEFF> is the BOM.
042da322 1140
c349b1b9 1141The trick is that if you read a BOM, you will know the byte order,
376d9008
JB
1142since if it was written on a big-endian platform, you will read the
1143bytes C<0xFE 0xFF>, but if it was written on a little-endian platform,
1144you will read the bytes C<0xFF 0xFE>. (And if the originating platform
1145was writing in UTF-8, you will read the bytes C<0xEF 0xBB 0xBF>.)
042da322 1146
86bbd6d1 1147The way this trick works is that the character with the code point
376d9008
JB
1148C<U+FFFE> is guaranteed not to be a valid Unicode character, so the
1149sequence of bytes C<0xFF 0xFE> is unambiguously "BOM, represented in
1bfb14c4 1150little-endian format" and cannot be C<U+FFFE>, represented in big-endian
042da322 1151format".
c349b1b9 1152
c29a771d 1153=item *
5cb3728c 1154
1e54db1a 1155UTF-32, UTF-32BE, UTF-32LE
c349b1b9
JH
1156
1157The UTF-32 family is pretty much like the UTF-16 family, expect that
042da322 1158the units are 32-bit, and therefore the surrogate scheme is not
376d9008
JB
1159needed. The BOM signatures will be C<0x00 0x00 0xFE 0xFF> for BE and
1160C<0xFF 0xFE 0x00 0x00> for LE.
c349b1b9 1161
c29a771d 1162=item *
5cb3728c
RB
1163
1164UCS-2, UCS-4
c349b1b9 1165
86bbd6d1 1166Encodings defined by the ISO 10646 standard. UCS-2 is a 16-bit
376d9008 1167encoding. Unlike UTF-16, UCS-2 is not extensible beyond C<U+FFFF>,
339cfa0e
JH
1168because it does not use surrogates. UCS-4 is a 32-bit encoding,
1169functionally identical to UTF-32.
c349b1b9 1170
c29a771d 1171=item *
5cb3728c
RB
1172
1173UTF-7
c349b1b9 1174
376d9008
JB
1175A seven-bit safe (non-eight-bit) encoding, which is useful if the
1176transport or storage is not eight-bit safe. Defined by RFC 2152.
c349b1b9 1177
95a1a48b
JH
1178=back
1179
0d7c09bb
JH
1180=head2 Security Implications of Unicode
1181
1182=over 4
1183
1184=item *
1185
1186Malformed UTF-8
bf0fa0b2
JH
1187
1188Unfortunately, the specification of UTF-8 leaves some room for
1189interpretation of how many bytes of encoded output one should generate
376d9008
JB
1190from one input Unicode character. Strictly speaking, the shortest
1191possible sequence of UTF-8 bytes should be generated,
1192because otherwise there is potential for an input buffer overflow at
feda178f 1193the receiving end of a UTF-8 connection. Perl always generates the
376d9008
JB
1194shortest length UTF-8, and with warnings on Perl will warn about
1195non-shortest length UTF-8 along with other malformations, such as the
1196surrogates, which are not real Unicode code points.
bf0fa0b2 1197
0d7c09bb
JH
1198=item *
1199
1200Regular expressions behave slightly differently between byte data and
376d9008
JB
1201character (Unicode) data. For example, the "word character" character
1202class C<\w> will work differently depending on if data is eight-bit bytes
1203or Unicode.
0d7c09bb 1204
376d9008
JB
1205In the first case, the set of C<\w> characters is either small--the
1206default set of alphabetic characters, digits, and the "_"--or, if you
0d7c09bb
JH
1207are using a locale (see L<perllocale>), the C<\w> might contain a few
1208more letters according to your language and country.
1209
376d9008 1210In the second case, the C<\w> set of characters is much, much larger.
1bfb14c4
JH
1211Most importantly, even in the set of the first 256 characters, it will
1212probably match different characters: unlike most locales, which are
1213specific to a language and country pair, Unicode classifies all the
1214characters that are letters I<somewhere> as C<\w>. For example, your
1215locale might not think that LATIN SMALL LETTER ETH is a letter (unless
1216you happen to speak Icelandic), but Unicode does.
0d7c09bb 1217
376d9008 1218As discussed elsewhere, Perl has one foot (two hooves?) planted in
1bfb14c4
JH
1219each of two worlds: the old world of bytes and the new world of
1220characters, upgrading from bytes to characters when necessary.
376d9008
JB
1221If your legacy code does not explicitly use Unicode, no automatic
1222switch-over to characters should happen. Characters shouldn't get
1bfb14c4
JH
1223downgraded to bytes, either. It is possible to accidentally mix bytes
1224and characters, however (see L<perluniintro>), in which case C<\w> in
1225regular expressions might start behaving differently. Review your
1226code. Use warnings and the C<strict> pragma.
0d7c09bb
JH
1227
1228=back
1229
c349b1b9
JH
1230=head2 Unicode in Perl on EBCDIC
1231
376d9008
JB
1232The way Unicode is handled on EBCDIC platforms is still
1233experimental. On such platforms, references to UTF-8 encoding in this
1234document and elsewhere should be read as meaning the UTF-EBCDIC
1235specified in Unicode Technical Report 16, unless ASCII vs. EBCDIC issues
c349b1b9 1236are specifically discussed. There is no C<utfebcdic> pragma or
376d9008 1237":utfebcdic" layer; rather, "utf8" and ":utf8" are reused to mean
86bbd6d1
PN
1238the platform's "natural" 8-bit encoding of Unicode. See L<perlebcdic>
1239for more discussion of the issues.
c349b1b9 1240
b310b053
JH
1241=head2 Locales
1242
4616122b 1243Usually locale settings and Unicode do not affect each other, but
b310b053
JH
1244there are a couple of exceptions:
1245
1246=over 4
1247
1248=item *
1249
8aa8f774
JH
1250You can enable automatic UTF-8-ification of your standard file
1251handles, default C<open()> layer, and C<@ARGV> by using either
1252the C<-C> command line switch or the C<PERL_UNICODE> environment
1253variable, see L<perlrun> for the documentation of the C<-C> switch.
b310b053
JH
1254
1255=item *
1256
376d9008
JB
1257Perl tries really hard to work both with Unicode and the old
1258byte-oriented world. Most often this is nice, but sometimes Perl's
1259straddling of the proverbial fence causes problems.
b310b053
JH
1260
1261=back
1262
1aad1664
JH
1263=head2 When Unicode Does Not Happen
1264
1265While Perl does have extensive ways to input and output in Unicode,
1266and few other 'entry points' like the @ARGV which can be interpreted
1267as Unicode (UTF-8), there still are many places where Unicode (in some
1268encoding or another) could be given as arguments or received as
1269results, or both, but it is not.
1270
6cd4dd6c
JH
1271The following are such interfaces. For all of these interfaces Perl
1272currently (as of 5.8.3) simply assumes byte strings both as arguments
1273and results, or UTF-8 strings if the C<encoding> pragma has been used.
1aad1664
JH
1274
1275One reason why Perl does not attempt to resolve the role of Unicode in
1276this cases is that the answers are highly dependent on the operating
1277system and the file system(s). For example, whether filenames can be
1278in Unicode, and in exactly what kind of encoding, is not exactly a
1279portable concept. Similarly for the qx and system: how well will the
1280'command line interface' (and which of them?) handle Unicode?
1281
1282=over 4
1283
557a2462
RB
1284=item *
1285
254c2b64 1286chdir, chmod, chown, chroot, exec, link, lstat, mkdir,
1e8e8236 1287rename, rmdir, stat, symlink, truncate, unlink, utime, -X
557a2462
RB
1288
1289=item *
1290
1291%ENV
1292
1293=item *
1294
1295glob (aka the <*>)
1296
1297=item *
1aad1664 1298
557a2462 1299open, opendir, sysopen
1aad1664 1300
557a2462 1301=item *
1aad1664 1302
557a2462 1303qx (aka the backtick operator), system
1aad1664 1304
557a2462 1305=item *
1aad1664 1306
557a2462 1307readdir, readlink
1aad1664
JH
1308
1309=back
1310
1311=head2 Forcing Unicode in Perl (Or Unforcing Unicode in Perl)
1312
1313Sometimes (see L</"When Unicode Does Not Happen">) there are
1314situations where you simply need to force Perl to believe that a byte
1315string is UTF-8, or vice versa. The low-level calls
1316utf8::upgrade($bytestring) and utf8::downgrade($utf8string) are
1317the answers.
1318
1319Do not use them without careful thought, though: Perl may easily get
1320very confused, angry, or even crash, if you suddenly change the 'nature'
1321of scalar like that. Especially careful you have to be if you use the
1322utf8::upgrade(): any random byte string is not valid UTF-8.
1323
95a1a48b
JH
1324=head2 Using Unicode in XS
1325
3a2263fe
RGS
1326If you want to handle Perl Unicode in XS extensions, you may find the
1327following C APIs useful. See also L<perlguts/"Unicode Support"> for an
1328explanation about Unicode at the XS level, and L<perlapi> for the API
1329details.
95a1a48b
JH
1330
1331=over 4
1332
1333=item *
1334
1bfb14c4
JH
1335C<DO_UTF8(sv)> returns true if the C<UTF8> flag is on and the bytes
1336pragma is not in effect. C<SvUTF8(sv)> returns true is the C<UTF8>
1337flag is on; the bytes pragma is ignored. The C<UTF8> flag being on
1338does B<not> mean that there are any characters of code points greater
1339than 255 (or 127) in the scalar or that there are even any characters
1340in the scalar. What the C<UTF8> flag means is that the sequence of
1341octets in the representation of the scalar is the sequence of UTF-8
1342encoded code points of the characters of a string. The C<UTF8> flag
1343being off means that each octet in this representation encodes a
1344single character with code point 0..255 within the string. Perl's
1345Unicode model is not to use UTF-8 until it is absolutely necessary.
95a1a48b
JH
1346
1347=item *
1348
fb9cc174 1349C<uvuni_to_utf8(buf, chr)> writes a Unicode character code point into
1bfb14c4 1350a buffer encoding the code point as UTF-8, and returns a pointer
95a1a48b
JH
1351pointing after the UTF-8 bytes.
1352
1353=item *
1354
376d9008
JB
1355C<utf8_to_uvuni(buf, lenp)> reads UTF-8 encoded bytes from a buffer and
1356returns the Unicode character code point and, optionally, the length of
1357the UTF-8 byte sequence.
95a1a48b
JH
1358
1359=item *
1360
376d9008
JB
1361C<utf8_length(start, end)> returns the length of the UTF-8 encoded buffer
1362in characters. C<sv_len_utf8(sv)> returns the length of the UTF-8 encoded
95a1a48b
JH
1363scalar.
1364
1365=item *
1366
376d9008
JB
1367C<sv_utf8_upgrade(sv)> converts the string of the scalar to its UTF-8
1368encoded form. C<sv_utf8_downgrade(sv)> does the opposite, if
1369possible. C<sv_utf8_encode(sv)> is like sv_utf8_upgrade except that
1370it does not set the C<UTF8> flag. C<sv_utf8_decode()> does the
1371opposite of C<sv_utf8_encode()>. Note that none of these are to be
1372used as general-purpose encoding or decoding interfaces: C<use Encode>
1373for that. C<sv_utf8_upgrade()> is affected by the encoding pragma
1374but C<sv_utf8_downgrade()> is not (since the encoding pragma is
1375designed to be a one-way street).
95a1a48b
JH
1376
1377=item *
1378
376d9008 1379C<is_utf8_char(s)> returns true if the pointer points to a valid UTF-8
90f968e0 1380character.
95a1a48b
JH
1381
1382=item *
1383
376d9008 1384C<is_utf8_string(buf, len)> returns true if C<len> bytes of the buffer
95a1a48b
JH
1385are valid UTF-8.
1386
1387=item *
1388
376d9008
JB
1389C<UTF8SKIP(buf)> will return the number of bytes in the UTF-8 encoded
1390character in the buffer. C<UNISKIP(chr)> will return the number of bytes
1391required to UTF-8-encode the Unicode character code point. C<UTF8SKIP()>
90f968e0 1392is useful for example for iterating over the characters of a UTF-8
376d9008 1393encoded buffer; C<UNISKIP()> is useful, for example, in computing
90f968e0 1394the size required for a UTF-8 encoded buffer.
95a1a48b
JH
1395
1396=item *
1397
376d9008 1398C<utf8_distance(a, b)> will tell the distance in characters between the
95a1a48b
JH
1399two pointers pointing to the same UTF-8 encoded buffer.
1400
1401=item *
1402
376d9008
JB
1403C<utf8_hop(s, off)> will return a pointer to an UTF-8 encoded buffer
1404that is C<off> (positive or negative) Unicode characters displaced
1405from the UTF-8 buffer C<s>. Be careful not to overstep the buffer:
1406C<utf8_hop()> will merrily run off the end or the beginning of the
1407buffer if told to do so.
95a1a48b 1408
d2cc3551
JH
1409=item *
1410
376d9008
JB
1411C<pv_uni_display(dsv, spv, len, pvlim, flags)> and
1412C<sv_uni_display(dsv, ssv, pvlim, flags)> are useful for debugging the
1413output of Unicode strings and scalars. By default they are useful
1414only for debugging--they display B<all> characters as hexadecimal code
1bfb14c4
JH
1415points--but with the flags C<UNI_DISPLAY_ISPRINT>,
1416C<UNI_DISPLAY_BACKSLASH>, and C<UNI_DISPLAY_QQ> you can make the
1417output more readable.
d2cc3551
JH
1418
1419=item *
1420
376d9008
JB
1421C<ibcmp_utf8(s1, pe1, u1, l1, u1, s2, pe2, l2, u2)> can be used to
1422compare two strings case-insensitively in Unicode. For case-sensitive
1423comparisons you can just use C<memEQ()> and C<memNE()> as usual.
d2cc3551 1424
c349b1b9
JH
1425=back
1426
95a1a48b
JH
1427For more information, see L<perlapi>, and F<utf8.c> and F<utf8.h>
1428in the Perl source code distribution.
1429
c29a771d
JH
1430=head1 BUGS
1431
376d9008 1432=head2 Interaction with Locales
7eabb34d 1433
376d9008
JB
1434Use of locales with Unicode data may lead to odd results. Currently,
1435Perl attempts to attach 8-bit locale info to characters in the range
14360..255, but this technique is demonstrably incorrect for locales that
1437use characters above that range when mapped into Unicode. Perl's
1438Unicode support will also tend to run slower. Use of locales with
1439Unicode is discouraged.
c29a771d 1440
376d9008 1441=head2 Interaction with Extensions
7eabb34d 1442
376d9008 1443When Perl exchanges data with an extension, the extension should be
7eabb34d 1444able to understand the UTF-8 flag and act accordingly. If the
376d9008
JB
1445extension doesn't know about the flag, it's likely that the extension
1446will return incorrectly-flagged data.
7eabb34d
A
1447
1448So if you're working with Unicode data, consult the documentation of
1449every module you're using if there are any issues with Unicode data
1450exchange. If the documentation does not talk about Unicode at all,
a73d23f6 1451suspect the worst and probably look at the source to learn how the
376d9008 1452module is implemented. Modules written completely in Perl shouldn't
a73d23f6
RGS
1453cause problems. Modules that directly or indirectly access code written
1454in other programming languages are at risk.
7eabb34d 1455
376d9008 1456For affected functions, the simple strategy to avoid data corruption is
7eabb34d 1457to always make the encoding of the exchanged data explicit. Choose an
376d9008 1458encoding that you know the extension can handle. Convert arguments passed
7eabb34d
A
1459to the extensions to that encoding and convert results back from that
1460encoding. Write wrapper functions that do the conversions for you, so
1461you can later change the functions when the extension catches up.
1462
376d9008 1463To provide an example, let's say the popular Foo::Bar::escape_html
7eabb34d
A
1464function doesn't deal with Unicode data yet. The wrapper function
1465would convert the argument to raw UTF-8 and convert the result back to
376d9008 1466Perl's internal representation like so:
7eabb34d
A
1467
1468 sub my_escape_html ($) {
1469 my($what) = shift;
1470 return unless defined $what;
1471 Encode::decode_utf8(Foo::Bar::escape_html(Encode::encode_utf8($what)));
1472 }
1473
1474Sometimes, when the extension does not convert data but just stores
1475and retrieves them, you will be in a position to use the otherwise
1476dangerous Encode::_utf8_on() function. Let's say the popular
66b79f27 1477C<Foo::Bar> extension, written in C, provides a C<param> method that
7eabb34d
A
1478lets you store and retrieve data according to these prototypes:
1479
1480 $self->param($name, $value); # set a scalar
1481 $value = $self->param($name); # retrieve a scalar
1482
1483If it does not yet provide support for any encoding, one could write a
1484derived class with such a C<param> method:
1485
1486 sub param {
1487 my($self,$name,$value) = @_;
1488 utf8::upgrade($name); # make sure it is UTF-8 encoded
1489 if (defined $value)
1490 utf8::upgrade($value); # make sure it is UTF-8 encoded
1491 return $self->SUPER::param($name,$value);
1492 } else {
1493 my $ret = $self->SUPER::param($name);
1494 Encode::_utf8_on($ret); # we know, it is UTF-8 encoded
1495 return $ret;
1496 }
1497 }
1498
a73d23f6
RGS
1499Some extensions provide filters on data entry/exit points, such as
1500DB_File::filter_store_key and family. Look out for such filters in
66b79f27 1501the documentation of your extensions, they can make the transition to
7eabb34d
A
1502Unicode data much easier.
1503
376d9008 1504=head2 Speed
7eabb34d 1505
c29a771d 1506Some functions are slower when working on UTF-8 encoded strings than
574c8022 1507on byte encoded strings. All functions that need to hop over
7c17141f
JH
1508characters such as length(), substr() or index(), or matching regular
1509expressions can work B<much> faster when the underlying data are
1510byte-encoded.
1511
1512In Perl 5.8.0 the slowness was often quite spectacular; in Perl 5.8.1
1513a caching scheme was introduced which will hopefully make the slowness
a104b433
JH
1514somewhat less spectacular, at least for some operations. In general,
1515operations with UTF-8 encoded strings are still slower. As an example,
1516the Unicode properties (character classes) like C<\p{Nd}> are known to
1517be quite a bit slower (5-20 times) than their simpler counterparts
1518like C<\d> (then again, there 268 Unicode characters matching C<Nd>
1519compared with the 10 ASCII characters matching C<d>).
666f95b9 1520
c8d992ba
A
1521=head2 Porting code from perl-5.6.X
1522
1523Perl 5.8 has a different Unicode model from 5.6. In 5.6 the programmer
1524was required to use the C<utf8> pragma to declare that a given scope
1525expected to deal with Unicode data and had to make sure that only
1526Unicode data were reaching that scope. If you have code that is
1527working with 5.6, you will need some of the following adjustments to
1528your code. The examples are written such that the code will continue
1529to work under 5.6, so you should be safe to try them out.
1530
1531=over 4
1532
1533=item *
1534
1535A filehandle that should read or write UTF-8
1536
1537 if ($] > 5.007) {
1538 binmode $fh, ":utf8";
1539 }
1540
1541=item *
1542
1543A scalar that is going to be passed to some extension
1544
1545Be it Compress::Zlib, Apache::Request or any extension that has no
1546mention of Unicode in the manpage, you need to make sure that the
1547UTF-8 flag is stripped off. Note that at the time of this writing
1548(October 2002) the mentioned modules are not UTF-8-aware. Please
1549check the documentation to verify if this is still true.
1550
1551 if ($] > 5.007) {
1552 require Encode;
1553 $val = Encode::encode_utf8($val); # make octets
1554 }
1555
1556=item *
1557
1558A scalar we got back from an extension
1559
1560If you believe the scalar comes back as UTF-8, you will most likely
1561want the UTF-8 flag restored:
1562
1563 if ($] > 5.007) {
1564 require Encode;
1565 $val = Encode::decode_utf8($val);
1566 }
1567
1568=item *
1569
1570Same thing, if you are really sure it is UTF-8
1571
1572 if ($] > 5.007) {
1573 require Encode;
1574 Encode::_utf8_on($val);
1575 }
1576
1577=item *
1578
1579A wrapper for fetchrow_array and fetchrow_hashref
1580
1581When the database contains only UTF-8, a wrapper function or method is
1582a convenient way to replace all your fetchrow_array and
1583fetchrow_hashref calls. A wrapper function will also make it easier to
1584adapt to future enhancements in your database driver. Note that at the
1585time of this writing (October 2002), the DBI has no standardized way
1586to deal with UTF-8 data. Please check the documentation to verify if
1587that is still true.
1588
1589 sub fetchrow {
1590 my($self, $sth, $what) = @_; # $what is one of fetchrow_{array,hashref}
1591 if ($] < 5.007) {
1592 return $sth->$what;
1593 } else {
1594 require Encode;
1595 if (wantarray) {
1596 my @arr = $sth->$what;
1597 for (@arr) {
1598 defined && /[^\000-\177]/ && Encode::_utf8_on($_);
1599 }
1600 return @arr;
1601 } else {
1602 my $ret = $sth->$what;
1603 if (ref $ret) {
1604 for my $k (keys %$ret) {
1605 defined && /[^\000-\177]/ && Encode::_utf8_on($_) for $ret->{$k};
1606 }
1607 return $ret;
1608 } else {
1609 defined && /[^\000-\177]/ && Encode::_utf8_on($_) for $ret;
1610 return $ret;
1611 }
1612 }
1613 }
1614 }
1615
1616
1617=item *
1618
1619A large scalar that you know can only contain ASCII
1620
1621Scalars that contain only ASCII and are marked as UTF-8 are sometimes
1622a drag to your program. If you recognize such a situation, just remove
1623the UTF-8 flag:
1624
1625 utf8::downgrade($val) if $] > 5.007;
1626
1627=back
1628
393fec97
GS
1629=head1 SEE ALSO
1630
72ff2908 1631L<perluniintro>, L<encoding>, L<Encode>, L<open>, L<utf8>, L<bytes>,
a05d7ebb 1632L<perlretut>, L<perlvar/"${^UNICODE}">
393fec97
GS
1633
1634=cut