This is a live mirror of the Perl 5 development currently hosted at https://github.com/perl/perl5
[perl #29527] Perl 5.8.4 build problems on LynxOS
[perl5.git] / pod / perlunicode.pod
CommitLineData
393fec97
GS
1=head1 NAME
2
3perlunicode - Unicode support in Perl
4
5=head1 DESCRIPTION
6
0a1f2d14 7=head2 Important Caveats
21bad921 8
376d9008 9Unicode support is an extensive requirement. While Perl does not
c349b1b9
JH
10implement the Unicode standard or the accompanying technical reports
11from cover to cover, Perl does support many Unicode features.
21bad921 12
13a2d996 13=over 4
21bad921 14
fae2c0fb 15=item Input and Output Layers
21bad921 16
376d9008 17Perl knows when a filehandle uses Perl's internal Unicode encodings
1bfb14c4
JH
18(UTF-8, or UTF-EBCDIC if in EBCDIC) if the filehandle is opened with
19the ":utf8" layer. Other encodings can be converted to Perl's
20encoding on input or from Perl's encoding on output by use of the
21":encoding(...)" layer. See L<open>.
c349b1b9 22
376d9008 23To indicate that Perl source itself is using a particular encoding,
c349b1b9 24see L<encoding>.
21bad921
GS
25
26=item Regular Expressions
27
c349b1b9 28The regular expression compiler produces polymorphic opcodes. That is,
376d9008
JB
29the pattern adapts to the data and automatically switches to the Unicode
30character scheme when presented with Unicode data--or instead uses
31a traditional byte scheme when presented with byte data.
21bad921 32
ad0029c4 33=item C<use utf8> still needed to enable UTF-8/UTF-EBCDIC in scripts
21bad921 34
376d9008
JB
35As a compatibility measure, the C<use utf8> pragma must be explicitly
36included to enable recognition of UTF-8 in the Perl scripts themselves
1bfb14c4
JH
37(in string or regular expression literals, or in identifier names) on
38ASCII-based machines or to recognize UTF-EBCDIC on EBCDIC-based
376d9008 39machines. B<These are the only times when an explicit C<use utf8>
8f8cf39c 40is needed.> See L<utf8>.
21bad921 41
1768d7eb 42You can also use the C<encoding> pragma to change the default encoding
6ec9efec 43of the data in your script; see L<encoding>.
1768d7eb 44
990e18f7
AT
45=item C<use encoding> needed to upgrade non-Latin-1 byte strings
46
47By default, there is a fundamental asymmetry in Perl's unicode model:
48implicit upgrading from byte strings to Unicode strings assumes that
49they were encoded in I<ISO 8859-1 (Latin-1)>, but Unicode strings are
50downgraded with UTF-8 encoding. This happens because the first 256
51codepoints in Unicode happens to agree with Latin-1.
52
53If you wish to interpret byte strings as UTF-8 instead, use the
54C<encoding> pragma:
55
56 use encoding 'utf8';
57
58See L</"Byte and Character Semantics"> for more details.
59
21bad921
GS
60=back
61
376d9008 62=head2 Byte and Character Semantics
393fec97 63
376d9008 64Beginning with version 5.6, Perl uses logically-wide characters to
3e4dbfed 65represent strings internally.
393fec97 66
376d9008
JB
67In future, Perl-level operations will be expected to work with
68characters rather than bytes.
393fec97 69
376d9008 70However, as an interim compatibility measure, Perl aims to
75daf61c
JH
71provide a safe migration path from byte semantics to character
72semantics for programs. For operations where Perl can unambiguously
376d9008 73decide that the input data are characters, Perl switches to
75daf61c
JH
74character semantics. For operations where this determination cannot
75be made without additional information from the user, Perl decides in
376d9008 76favor of compatibility and chooses to use byte semantics.
8cbd9a7a
GS
77
78This behavior preserves compatibility with earlier versions of Perl,
376d9008
JB
79which allowed byte semantics in Perl operations only if
80none of the program's inputs were marked as being as source of Unicode
8cbd9a7a
GS
81character data. Such data may come from filehandles, from calls to
82external programs, from information provided by the system (such as %ENV),
21bad921 83or from literals and constants in the source text.
8cbd9a7a 84
376d9008
JB
85The C<bytes> pragma will always, regardless of platform, force byte
86semantics in a particular lexical scope. See L<bytes>.
8cbd9a7a
GS
87
88The C<utf8> pragma is primarily a compatibility device that enables
75daf61c 89recognition of UTF-(8|EBCDIC) in literals encountered by the parser.
376d9008
JB
90Note that this pragma is only required while Perl defaults to byte
91semantics; when character semantics become the default, this pragma
92may become a no-op. See L<utf8>.
93
94Unless explicitly stated, Perl operators use character semantics
95for Unicode data and byte semantics for non-Unicode data.
96The decision to use character semantics is made transparently. If
97input data comes from a Unicode source--for example, if a character
fae2c0fb 98encoding layer is added to a filehandle or a literal Unicode
376d9008
JB
99string constant appears in a program--character semantics apply.
100Otherwise, byte semantics are in effect. The C<bytes> pragma should
101be used to force byte semantics on Unicode data.
102
103If strings operating under byte semantics and strings with Unicode
990e18f7
AT
104character data are concatenated, the new string will be created by
105decoding the byte strings as I<ISO 8859-1 (Latin-1)>, even if the
106old Unicode string used EBCDIC. This translation is done without
107regard to the system's native 8-bit encoding. To change this for
108systems with non-Latin-1 and non-EBCDIC native encodings, use the
109C<encoding> pragma. See L<encoding>.
7dedd01f 110
feda178f 111Under character semantics, many operations that formerly operated on
376d9008 112bytes now operate on characters. A character in Perl is
feda178f 113logically just a number ranging from 0 to 2**31 or so. Larger
376d9008
JB
114characters may encode into longer sequences of bytes internally, but
115this internal detail is mostly hidden for Perl code.
116See L<perluniintro> for more.
393fec97 117
376d9008 118=head2 Effects of Character Semantics
393fec97
GS
119
120Character semantics have the following effects:
121
122=over 4
123
124=item *
125
376d9008 126Strings--including hash keys--and regular expression patterns may
574c8022 127contain characters that have an ordinal value larger than 255.
393fec97 128
feda178f
JH
129If you use a Unicode editor to edit your program, Unicode characters
130may occur directly within the literal strings in one of the various
376d9008
JB
131Unicode encodings (UTF-8, UTF-EBCDIC, UCS-2, etc.), but will be recognized
132as such and converted to Perl's internal representation only if the
feda178f 133appropriate L<encoding> is specified.
3e4dbfed 134
1bfb14c4
JH
135Unicode characters can also be added to a string by using the
136C<\x{...}> notation. The Unicode code for the desired character, in
376d9008
JB
137hexadecimal, should be placed in the braces. For instance, a smiley
138face is C<\x{263A}>. This encoding scheme only works for characters
139with a code of 0x100 or above.
3e4dbfed
JF
140
141Additionally, if you
574c8022 142
3e4dbfed 143 use charnames ':full';
574c8022 144
1bfb14c4
JH
145you can use the C<\N{...}> notation and put the official Unicode
146character name within the braces, such as C<\N{WHITE SMILING FACE}>.
376d9008 147
393fec97
GS
148
149=item *
150
574c8022
JH
151If an appropriate L<encoding> is specified, identifiers within the
152Perl script may contain Unicode alphanumeric characters, including
376d9008
JB
153ideographs. Perl does not currently attempt to canonicalize variable
154names.
393fec97 155
393fec97
GS
156=item *
157
1bfb14c4
JH
158Regular expressions match characters instead of bytes. "." matches
159a character instead of a byte. The C<\C> pattern is provided to force
160a match a single byte--a C<char> in C, hence C<\C>.
393fec97 161
393fec97
GS
162=item *
163
164Character classes in regular expressions match characters instead of
376d9008 165bytes and match against the character properties specified in the
1bfb14c4 166Unicode properties database. C<\w> can be used to match a Japanese
75daf61c 167ideograph, for instance.
393fec97 168
b08eb2a8
RGS
169(However, and as a limitation of the current implementation, using
170C<\w> or C<\W> I<inside> a C<[...]> character class will still match
171with byte semantics.)
172
393fec97
GS
173=item *
174
eb0cc9e3 175Named Unicode properties, scripts, and block ranges may be used like
376d9008
JB
176character classes via the C<\p{}> "matches property" construct and
177the C<\P{}> negation, "doesn't match property".
1bfb14c4
JH
178
179For instance, C<\p{Lu}> matches any character with the Unicode "Lu"
180(Letter, uppercase) property, while C<\p{M}> matches any character
181with an "M" (mark--accents and such) property. Brackets are not
182required for single letter properties, so C<\p{M}> is equivalent to
183C<\pM>. Many predefined properties are available, such as
184C<\p{Mirrored}> and C<\p{Tibetan}>.
4193bef7 185
cfc01aea 186The official Unicode script and block names have spaces and dashes as
376d9008 187separators, but for convenience you can use dashes, spaces, or
1bfb14c4
JH
188underbars, and case is unimportant. It is recommended, however, that
189for consistency you use the following naming: the official Unicode
190script, property, or block name (see below for the additional rules
191that apply to block names) with whitespace and dashes removed, and the
192words "uppercase-first-lowercase-rest". C<Latin-1 Supplement> thus
193becomes C<Latin1Supplement>.
4193bef7 194
376d9008
JB
195You can also use negation in both C<\p{}> and C<\P{}> by introducing a caret
196(^) between the first brace and the property name: C<\p{^Tamil}> is
eb0cc9e3 197equal to C<\P{Tamil}>.
4193bef7 198
14bb0a9a
JH
199B<NOTE: the properties, scripts, and blocks listed here are as of
200Unicode 3.2.0, March 2002, or Perl 5.8.0, July 2002. Unicode 4.0.0
201came out in April 2003, and Perl 5.8.1 in September 2003.>
202
eb0cc9e3 203Here are the basic Unicode General Category properties, followed by their
68cd2d32 204long form. You can use either; C<\p{Lu}> and C<\p{UppercaseLetter}>,
376d9008 205for instance, are identical.
393fec97 206
d73e5302
JH
207 Short Long
208
209 L Letter
12ac2576 210 LC CasedLetter
eb0cc9e3
JH
211 Lu UppercaseLetter
212 Ll LowercaseLetter
213 Lt TitlecaseLetter
214 Lm ModifierLetter
215 Lo OtherLetter
d73e5302
JH
216
217 M Mark
eb0cc9e3
JH
218 Mn NonspacingMark
219 Mc SpacingMark
220 Me EnclosingMark
d73e5302
JH
221
222 N Number
eb0cc9e3
JH
223 Nd DecimalNumber
224 Nl LetterNumber
225 No OtherNumber
d73e5302
JH
226
227 P Punctuation
eb0cc9e3
JH
228 Pc ConnectorPunctuation
229 Pd DashPunctuation
230 Ps OpenPunctuation
231 Pe ClosePunctuation
232 Pi InitialPunctuation
d73e5302 233 (may behave like Ps or Pe depending on usage)
eb0cc9e3 234 Pf FinalPunctuation
d73e5302 235 (may behave like Ps or Pe depending on usage)
eb0cc9e3 236 Po OtherPunctuation
d73e5302
JH
237
238 S Symbol
eb0cc9e3
JH
239 Sm MathSymbol
240 Sc CurrencySymbol
241 Sk ModifierSymbol
242 So OtherSymbol
d73e5302
JH
243
244 Z Separator
eb0cc9e3
JH
245 Zs SpaceSeparator
246 Zl LineSeparator
247 Zp ParagraphSeparator
d73e5302
JH
248
249 C Other
e150c829
JH
250 Cc Control
251 Cf Format
eb0cc9e3
JH
252 Cs Surrogate (not usable)
253 Co PrivateUse
e150c829 254 Cn Unassigned
1ac13f9a 255
376d9008 256Single-letter properties match all characters in any of the
3e4dbfed 257two-letter sub-properties starting with the same letter.
12ac2576
JP
258C<LC> and C<L&> are special cases, which are aliases for the set of
259C<Ll>, C<Lu>, and C<Lt>.
32293815 260
eb0cc9e3 261Because Perl hides the need for the user to understand the internal
1bfb14c4
JH
262representation of Unicode characters, there is no need to implement
263the somewhat messy concept of surrogates. C<Cs> is therefore not
eb0cc9e3 264supported.
d73e5302 265
376d9008 266Because scripts differ in their directionality--Hebrew is
12ac2576
JP
267written right to left, for example--Unicode supplies these properties in
268the BidiClass class:
32293815 269
eb0cc9e3 270 Property Meaning
92e830a9 271
12ac2576
JP
272 L Left-to-Right
273 LRE Left-to-Right Embedding
274 LRO Left-to-Right Override
275 R Right-to-Left
276 AL Right-to-Left Arabic
277 RLE Right-to-Left Embedding
278 RLO Right-to-Left Override
279 PDF Pop Directional Format
280 EN European Number
281 ES European Number Separator
282 ET European Number Terminator
283 AN Arabic Number
284 CS Common Number Separator
285 NSM Non-Spacing Mark
286 BN Boundary Neutral
287 B Paragraph Separator
288 S Segment Separator
289 WS Whitespace
290 ON Other Neutrals
291
292For example, C<\p{BidiClass:R}> matches characters that are normally
eb0cc9e3
JH
293written right to left.
294
210b36aa
AMS
295=back
296
2796c109
JH
297=head2 Scripts
298
376d9008
JB
299The script names which can be used by C<\p{...}> and C<\P{...}>,
300such as in C<\p{Latin}> or C<\p{Cyrillic}>, are as follows:
2796c109 301
1ac13f9a 302 Arabic
e9ad1727 303 Armenian
1ac13f9a 304 Bengali
e9ad1727 305 Bopomofo
1d81abf3 306 Buhid
eb0cc9e3 307 CanadianAboriginal
e9ad1727
JH
308 Cherokee
309 Cyrillic
310 Deseret
311 Devanagari
312 Ethiopic
313 Georgian
314 Gothic
315 Greek
1ac13f9a 316 Gujarati
e9ad1727
JH
317 Gurmukhi
318 Han
319 Hangul
1d81abf3 320 Hanunoo
e9ad1727
JH
321 Hebrew
322 Hiragana
323 Inherited
1ac13f9a 324 Kannada
e9ad1727
JH
325 Katakana
326 Khmer
1ac13f9a 327 Lao
e9ad1727
JH
328 Latin
329 Malayalam
330 Mongolian
1ac13f9a 331 Myanmar
1ac13f9a 332 Ogham
eb0cc9e3 333 OldItalic
e9ad1727 334 Oriya
1ac13f9a 335 Runic
e9ad1727
JH
336 Sinhala
337 Syriac
1d81abf3
JH
338 Tagalog
339 Tagbanwa
e9ad1727
JH
340 Tamil
341 Telugu
342 Thaana
343 Thai
344 Tibetan
1ac13f9a 345 Yi
1ac13f9a 346
376d9008 347Extended property classes can supplement the basic
1ac13f9a
JH
348properties, defined by the F<PropList> Unicode database:
349
1d81abf3 350 ASCIIHexDigit
eb0cc9e3 351 BidiControl
1ac13f9a 352 Dash
1d81abf3 353 Deprecated
1ac13f9a
JH
354 Diacritic
355 Extender
1d81abf3 356 GraphemeLink
eb0cc9e3 357 HexDigit
e9ad1727
JH
358 Hyphen
359 Ideographic
1d81abf3
JH
360 IDSBinaryOperator
361 IDSTrinaryOperator
eb0cc9e3 362 JoinControl
1d81abf3 363 LogicalOrderException
eb0cc9e3
JH
364 NoncharacterCodePoint
365 OtherAlphabetic
1d81abf3
JH
366 OtherDefaultIgnorableCodePoint
367 OtherGraphemeExtend
eb0cc9e3
JH
368 OtherLowercase
369 OtherMath
370 OtherUppercase
371 QuotationMark
1d81abf3
JH
372 Radical
373 SoftDotted
374 TerminalPunctuation
375 UnifiedIdeograph
eb0cc9e3 376 WhiteSpace
1ac13f9a 377
376d9008 378and there are further derived properties:
1ac13f9a 379
eb0cc9e3
JH
380 Alphabetic Lu + Ll + Lt + Lm + Lo + OtherAlphabetic
381 Lowercase Ll + OtherLowercase
382 Uppercase Lu + OtherUppercase
383 Math Sm + OtherMath
1ac13f9a
JH
384
385 ID_Start Lu + Ll + Lt + Lm + Lo + Nl
386 ID_Continue ID_Start + Mn + Mc + Nd + Pc
387
388 Any Any character
66b79f27
RGS
389 Assigned Any non-Cn character (i.e. synonym for \P{Cn})
390 Unassigned Synonym for \p{Cn}
1ac13f9a 391 Common Any character (or unassigned code point)
e150c829 392 not explicitly assigned to a script
2796c109 393
1bfb14c4
JH
394For backward compatibility (with Perl 5.6), all properties mentioned
395so far may have C<Is> prepended to their name, so C<\P{IsLu}>, for
396example, is equal to C<\P{Lu}>.
eb0cc9e3 397
2796c109
JH
398=head2 Blocks
399
1bfb14c4
JH
400In addition to B<scripts>, Unicode also defines B<blocks> of
401characters. The difference between scripts and blocks is that the
402concept of scripts is closer to natural languages, while the concept
403of blocks is more of an artificial grouping based on groups of 256
376d9008 404Unicode characters. For example, the C<Latin> script contains letters
1bfb14c4 405from many blocks but does not contain all the characters from those
376d9008
JB
406blocks. It does not, for example, contain digits, because digits are
407shared across many scripts. Digits and similar groups, like
408punctuation, are in a category called C<Common>.
2796c109 409
cfc01aea
JF
410For more about scripts, see the UTR #24:
411
412 http://www.unicode.org/unicode/reports/tr24/
413
414For more about blocks, see:
415
416 http://www.unicode.org/Public/UNIDATA/Blocks.txt
2796c109 417
376d9008
JB
418Block names are given with the C<In> prefix. For example, the
419Katakana block is referenced via C<\p{InKatakana}>. The C<In>
7eabb34d 420prefix may be omitted if there is no naming conflict with a script
eb0cc9e3 421or any other property, but it is recommended that C<In> always be used
1bfb14c4 422for block tests to avoid confusion.
eb0cc9e3
JH
423
424These block names are supported:
425
1d81abf3
JH
426 InAlphabeticPresentationForms
427 InArabic
428 InArabicPresentationFormsA
429 InArabicPresentationFormsB
430 InArmenian
431 InArrows
432 InBasicLatin
433 InBengali
434 InBlockElements
435 InBopomofo
436 InBopomofoExtended
437 InBoxDrawing
438 InBraillePatterns
439 InBuhid
440 InByzantineMusicalSymbols
441 InCJKCompatibility
442 InCJKCompatibilityForms
443 InCJKCompatibilityIdeographs
444 InCJKCompatibilityIdeographsSupplement
445 InCJKRadicalsSupplement
446 InCJKSymbolsAndPunctuation
447 InCJKUnifiedIdeographs
448 InCJKUnifiedIdeographsExtensionA
449 InCJKUnifiedIdeographsExtensionB
450 InCherokee
451 InCombiningDiacriticalMarks
452 InCombiningDiacriticalMarksforSymbols
453 InCombiningHalfMarks
454 InControlPictures
455 InCurrencySymbols
456 InCyrillic
457 InCyrillicSupplementary
458 InDeseret
459 InDevanagari
460 InDingbats
461 InEnclosedAlphanumerics
462 InEnclosedCJKLettersAndMonths
463 InEthiopic
464 InGeneralPunctuation
465 InGeometricShapes
466 InGeorgian
467 InGothic
468 InGreekExtended
469 InGreekAndCoptic
470 InGujarati
471 InGurmukhi
472 InHalfwidthAndFullwidthForms
473 InHangulCompatibilityJamo
474 InHangulJamo
475 InHangulSyllables
476 InHanunoo
477 InHebrew
478 InHighPrivateUseSurrogates
479 InHighSurrogates
480 InHiragana
481 InIPAExtensions
482 InIdeographicDescriptionCharacters
483 InKanbun
484 InKangxiRadicals
485 InKannada
486 InKatakana
487 InKatakanaPhoneticExtensions
488 InKhmer
489 InLao
490 InLatin1Supplement
491 InLatinExtendedA
492 InLatinExtendedAdditional
493 InLatinExtendedB
494 InLetterlikeSymbols
495 InLowSurrogates
496 InMalayalam
497 InMathematicalAlphanumericSymbols
498 InMathematicalOperators
499 InMiscellaneousMathematicalSymbolsA
500 InMiscellaneousMathematicalSymbolsB
501 InMiscellaneousSymbols
502 InMiscellaneousTechnical
503 InMongolian
504 InMusicalSymbols
505 InMyanmar
506 InNumberForms
507 InOgham
508 InOldItalic
509 InOpticalCharacterRecognition
510 InOriya
511 InPrivateUseArea
512 InRunic
513 InSinhala
514 InSmallFormVariants
515 InSpacingModifierLetters
516 InSpecials
517 InSuperscriptsAndSubscripts
518 InSupplementalArrowsA
519 InSupplementalArrowsB
520 InSupplementalMathematicalOperators
521 InSupplementaryPrivateUseAreaA
522 InSupplementaryPrivateUseAreaB
523 InSyriac
524 InTagalog
525 InTagbanwa
526 InTags
527 InTamil
528 InTelugu
529 InThaana
530 InThai
531 InTibetan
532 InUnifiedCanadianAboriginalSyllabics
533 InVariationSelectors
534 InYiRadicals
535 InYiSyllables
32293815 536
210b36aa
AMS
537=over 4
538
393fec97
GS
539=item *
540
376d9008
JB
541The special pattern C<\X> matches any extended Unicode
542sequence--"a combining character sequence" in Standardese--where the
543first character is a base character and subsequent characters are mark
544characters that apply to the base character. C<\X> is equivalent to
393fec97
GS
545C<(?:\PM\pM*)>.
546
393fec97
GS
547=item *
548
383e7cdd 549The C<tr///> operator translates characters instead of bytes. Note
376d9008
JB
550that the C<tr///CU> functionality has been removed. For similar
551functionality see pack('U0', ...) and pack('C0', ...).
393fec97 552
393fec97
GS
553=item *
554
555Case translation operators use the Unicode case translation tables
376d9008
JB
556when character input is provided. Note that C<uc()>, or C<\U> in
557interpolated strings, translates to uppercase, while C<ucfirst>,
558or C<\u> in interpolated strings, translates to titlecase in languages
559that make the distinction.
393fec97
GS
560
561=item *
562
376d9008 563Most operators that deal with positions or lengths in a string will
75daf61c 564automatically switch to using character positions, including
f5b005ca 565C<chop()>, C<chomp()>, C<substr()>, C<pos()>, C<index()>, C<rindex()>,
75daf61c 566C<sprintf()>, C<write()>, and C<length()>. Operators that
376d9008 567specifically do not switch include C<vec()>, C<pack()>, and
f5b005ca 568C<unpack()>. Operators that really don't care include
376d9008
JB
569operators that treats strings as a bucket of bits such as C<sort()>,
570and operators dealing with filenames.
393fec97
GS
571
572=item *
573
1bfb14c4 574The C<pack()>/C<unpack()> letters C<c> and C<C> do I<not> change,
376d9008 575since they are often used for byte-oriented formats. Again, think
1bfb14c4
JH
576C<char> in the C language.
577
578There is a new C<U> specifier that converts between Unicode characters
579and code points.
393fec97
GS
580
581=item *
582
376d9008
JB
583The C<chr()> and C<ord()> functions work on characters, similar to
584C<pack("U")> and C<unpack("U")>, I<not> C<pack("C")> and
585C<unpack("C")>. C<pack("C")> and C<unpack("C")> are methods for
586emulating byte-oriented C<chr()> and C<ord()> on Unicode strings.
587While these methods reveal the internal encoding of Unicode strings,
588that is not something one normally needs to care about at all.
393fec97
GS
589
590=item *
591
376d9008
JB
592The bit string operators, C<& | ^ ~>, can operate on character data.
593However, for backward compatibility, such as when using bit string
594operations when characters are all less than 256 in ordinal value, one
595should not use C<~> (the bit complement) with characters of both
596values less than 256 and values greater than 256. Most importantly,
597DeMorgan's laws (C<~($x|$y) eq ~$x&~$y> and C<~($x&$y) eq ~$x|~$y>)
598will not hold. The reason for this mathematical I<faux pas> is that
599the complement cannot return B<both> the 8-bit (byte-wide) bit
600complement B<and> the full character-wide bit complement.
a1ca4561
YST
601
602=item *
603
983ffd37
JH
604lc(), uc(), lcfirst(), and ucfirst() work for the following cases:
605
606=over 8
607
608=item *
609
610the case mapping is from a single Unicode character to another
376d9008 611single Unicode character, or
983ffd37
JH
612
613=item *
614
615the case mapping is from a single Unicode character to more
376d9008 616than one Unicode character.
983ffd37
JH
617
618=back
619
63de3cb2
JH
620Things to do with locales (Lithuanian, Turkish, Azeri) do B<not> work
621since Perl does not understand the concept of Unicode locales.
983ffd37 622
dc33ebcf
RGS
623See the Unicode Technical Report #21, Case Mappings, for more details.
624
983ffd37
JH
625=back
626
dc33ebcf 627=over 4
ac1256e8
JH
628
629=item *
630
393fec97
GS
631And finally, C<scalar reverse()> reverses by character rather than by byte.
632
633=back
634
376d9008 635=head2 User-Defined Character Properties
491fd90a
JH
636
637You can define your own character properties by defining subroutines
bac0b425
JP
638whose names begin with "In" or "Is". The subroutines can be defined in
639any package. The user-defined properties can be used in the regular
640expression C<\p> and C<\P> constructs; if you are using a user-defined
641property from a package other than the one you are in, you must specify
642its package in the C<\p> or C<\P> construct.
643
644 # assuming property IsForeign defined in Lang::
645 package main; # property package name required
646 if ($txt =~ /\p{Lang::IsForeign}+/) { ... }
647
648 package Lang; # property package name not required
649 if ($txt =~ /\p{IsForeign}+/) { ... }
650
651
652Note that the effect is compile-time and immutable once defined.
491fd90a 653
376d9008
JB
654The subroutines must return a specially-formatted string, with one
655or more newline-separated lines. Each line must be one of the following:
491fd90a
JH
656
657=over 4
658
659=item *
660
99a6b1f0 661Two hexadecimal numbers separated by horizontal whitespace (space or
376d9008 662tabular characters) denoting a range of Unicode code points to include.
491fd90a
JH
663
664=item *
665
376d9008 666Something to include, prefixed by "+": a built-in character
bac0b425
JP
667property (prefixed by "utf8::") or a user-defined character property,
668to represent all the characters in that property; two hexadecimal code
669points for a range; or a single hexadecimal code point.
491fd90a
JH
670
671=item *
672
376d9008 673Something to exclude, prefixed by "-": an existing character
bac0b425
JP
674property (prefixed by "utf8::") or a user-defined character property,
675to represent all the characters in that property; two hexadecimal code
676points for a range; or a single hexadecimal code point.
491fd90a
JH
677
678=item *
679
376d9008 680Something to negate, prefixed "!": an existing character
bac0b425
JP
681property (prefixed by "utf8::") or a user-defined character property,
682to represent all the characters in that property; two hexadecimal code
683points for a range; or a single hexadecimal code point.
684
685=item *
686
687Something to intersect with, prefixed by "&": an existing character
688property (prefixed by "utf8::") or a user-defined character property,
689for all the characters except the characters in the property; two
690hexadecimal code points for a range; or a single hexadecimal code point.
491fd90a
JH
691
692=back
693
694For example, to define a property that covers both the Japanese
695syllabaries (hiragana and katakana), you can define
696
697 sub InKana {
d5822f25
A
698 return <<END;
699 3040\t309F
700 30A0\t30FF
491fd90a
JH
701 END
702 }
703
d5822f25
A
704Imagine that the here-doc end marker is at the beginning of the line.
705Now you can use C<\p{InKana}> and C<\P{InKana}>.
491fd90a
JH
706
707You could also have used the existing block property names:
708
709 sub InKana {
710 return <<'END';
711 +utf8::InHiragana
712 +utf8::InKatakana
713 END
714 }
715
716Suppose you wanted to match only the allocated characters,
d5822f25 717not the raw block ranges: in other words, you want to remove
491fd90a
JH
718the non-characters:
719
720 sub InKana {
721 return <<'END';
722 +utf8::InHiragana
723 +utf8::InKatakana
724 -utf8::IsCn
725 END
726 }
727
728The negation is useful for defining (surprise!) negated classes.
729
730 sub InNotKana {
731 return <<'END';
732 !utf8::InHiragana
733 -utf8::InKatakana
734 +utf8::IsCn
735 END
736 }
737
bac0b425
JP
738Intersection is useful for getting the common characters matched by
739two (or more) classes.
740
741 sub InFooAndBar {
742 return <<'END';
743 +main::Foo
744 &main::Bar
745 END
746 }
747
748It's important to remember not to use "&" for the first set -- that
749would be intersecting with nothing (resulting in an empty set).
750
3a2263fe
RGS
751You can also define your own mappings to be used in the lc(),
752lcfirst(), uc(), and ucfirst() (or their string-inlined versions).
753The principle is the same: define subroutines in the C<main> package
754with names like C<ToLower> (for lc() and lcfirst()), C<ToTitle> (for
755the first character in ucfirst()), and C<ToUpper> (for uc(), and the
756rest of the characters in ucfirst()).
757
758The string returned by the subroutines needs now to be three
759hexadecimal numbers separated by tabulators: start of the source
760range, end of the source range, and start of the destination range.
761For example:
762
763 sub ToUpper {
764 return <<END;
765 0061\t0063\t0041
766 END
767 }
768
769defines an uc() mapping that causes only the characters "a", "b", and
770"c" to be mapped to "A", "B", "C", all other characters will remain
771unchanged.
772
773If there is no source range to speak of, that is, the mapping is from
774a single character to another single character, leave the end of the
775source range empty, but the two tabulator characters are still needed.
776For example:
777
778 sub ToLower {
779 return <<END;
780 0041\t\t0061
781 END
782 }
783
784defines a lc() mapping that causes only "A" to be mapped to "a", all
785other characters will remain unchanged.
786
787(For serious hackers only) If you want to introspect the default
788mappings, you can find the data in the directory
789C<$Config{privlib}>/F<unicore/To/>. The mapping data is returned as
790the here-document, and the C<utf8::ToSpecFoo> are special exception
791mappings derived from <$Config{privlib}>/F<unicore/SpecialCasing.txt>.
792The C<Digit> and C<Fold> mappings that one can see in the directory
793are not directly user-accessible, one can use either the
794C<Unicode::UCD> module, or just match case-insensitively (that's when
795the C<Fold> mapping is used).
796
797A final note on the user-defined property tests and mappings: they
798will be used only if the scalar has been marked as having Unicode
799characters. Old byte-style strings will not be affected.
800
376d9008 801=head2 Character Encodings for Input and Output
8cbd9a7a 802
7221edc9 803See L<Encode>.
8cbd9a7a 804
c29a771d 805=head2 Unicode Regular Expression Support Level
776f8809 806
376d9008
JB
807The following list of Unicode support for regular expressions describes
808all the features currently supported. The references to "Level N"
809and the section numbers refer to the Unicode Technical Report 18,
965cd703
JH
810"Unicode Regular Expression Guidelines", version 6 (Unicode 3.2.0,
811Perl 5.8.0).
776f8809
JH
812
813=over 4
814
815=item *
816
817Level 1 - Basic Unicode Support
818
819 2.1 Hex Notation - done [1]
3bfdc84c 820 Named Notation - done [2]
776f8809
JH
821 2.2 Categories - done [3][4]
822 2.3 Subtraction - MISSING [5][6]
823 2.4 Simple Word Boundaries - done [7]
78d3e1bf 824 2.5 Simple Loose Matches - done [8]
776f8809
JH
825 2.6 End of Line - MISSING [9][10]
826
827 [ 1] \x{...}
828 [ 2] \N{...}
eb0cc9e3 829 [ 3] . \p{...} \P{...}
12ac2576
JP
830 [ 4] support for scripts (see UTR#24 Script Names), blocks,
831 binary properties, enumerated non-binary properties, and
832 numeric properties (as listed in UTR#18 Other Properties)
776f8809 833 [ 5] have negation
237bad5b
JH
834 [ 6] can use regular expression look-ahead [a]
835 or user-defined character properties [b] to emulate subtraction
776f8809 836 [ 7] include Letters in word characters
376d9008 837 [ 8] note that Perl does Full case-folding in matching, not Simple:
835863de 838 for example U+1F88 is equivalent with U+1F00 U+03B9,
e0f9d4a8 839 not with 1F80. This difference matters for certain Greek
376d9008
JB
840 capital letters with certain modifiers: the Full case-folding
841 decomposes the letter, while the Simple case-folding would map
e0f9d4a8 842 it to a single character.
5ca1ac52 843 [ 9] see UTR #13 Unicode Newline Guidelines
835863de 844 [10] should do ^ and $ also on \x{85}, \x{2028} and \x{2029}
ec83e909 845 (should also affect <>, $., and script line numbers)
3bfdc84c 846 (the \x{85}, \x{2028} and \x{2029} do match \s)
7207e29d 847
237bad5b 848[a] You can mimic class subtraction using lookahead.
5ca1ac52 849For example, what UTR #18 might write as
29bdacb8 850
dbe420b4
JH
851 [{Greek}-[{UNASSIGNED}]]
852
853in Perl can be written as:
854
1d81abf3
JH
855 (?!\p{Unassigned})\p{InGreekAndCoptic}
856 (?=\p{Assigned})\p{InGreekAndCoptic}
dbe420b4
JH
857
858But in this particular example, you probably really want
859
1bfb14c4 860 \p{GreekAndCoptic}
dbe420b4
JH
861
862which will match assigned characters known to be part of the Greek script.
29bdacb8 863
5ca1ac52
JH
864Also see the Unicode::Regex::Set module, it does implement the full
865UTR #18 grouping, intersection, union, and removal (subtraction) syntax.
866
818c4caa 867[b] See L</"User-Defined Character Properties">.
237bad5b 868
776f8809
JH
869=item *
870
871Level 2 - Extended Unicode Support
872
63de3cb2
JH
873 3.1 Surrogates - MISSING [11]
874 3.2 Canonical Equivalents - MISSING [12][13]
875 3.3 Locale-Independent Graphemes - MISSING [14]
876 3.4 Locale-Independent Words - MISSING [15]
877 3.5 Locale-Independent Loose Matches - MISSING [16]
878
879 [11] Surrogates are solely a UTF-16 concept and Perl's internal
880 representation is UTF-8. The Encode module does UTF-16, though.
881 [12] see UTR#15 Unicode Normalization
882 [13] have Unicode::Normalize but not integrated to regexes
883 [14] have \X but at this level . should equal that
884 [15] need three classes, not just \w and \W
885 [16] see UTR#21 Case Mappings
776f8809
JH
886
887=item *
888
889Level 3 - Locale-Sensitive Support
890
891 4.1 Locale-Dependent Categories - MISSING
892 4.2 Locale-Dependent Graphemes - MISSING [16][17]
893 4.3 Locale-Dependent Words - MISSING
894 4.4 Locale-Dependent Loose Matches - MISSING
895 4.5 Locale-Dependent Ranges - MISSING
896
897 [16] see UTR#10 Unicode Collation Algorithms
898 [17] have Unicode::Collate but not integrated to regexes
899
900=back
901
c349b1b9
JH
902=head2 Unicode Encodings
903
376d9008
JB
904Unicode characters are assigned to I<code points>, which are abstract
905numbers. To use these numbers, various encodings are needed.
c349b1b9
JH
906
907=over 4
908
c29a771d 909=item *
5cb3728c
RB
910
911UTF-8
c349b1b9 912
3e4dbfed 913UTF-8 is a variable-length (1 to 6 bytes, current character allocations
376d9008
JB
914require 4 bytes), byte-order independent encoding. For ASCII (and we
915really do mean 7-bit ASCII, not another 8-bit encoding), UTF-8 is
916transparent.
c349b1b9 917
8c007b5a 918The following table is from Unicode 3.2.
05632f9a
JH
919
920 Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte
921
8c007b5a
JH
922 U+0000..U+007F 00..7F
923 U+0080..U+07FF C2..DF 80..BF
ec90690f
TS
924 U+0800..U+0FFF E0 A0..BF 80..BF
925 U+1000..U+CFFF E1..EC 80..BF 80..BF
926 U+D000..U+D7FF ED 80..9F 80..BF
8c007b5a 927 U+D800..U+DFFF ******* ill-formed *******
ec90690f 928 U+E000..U+FFFF EE..EF 80..BF 80..BF
05632f9a
JH
929 U+10000..U+3FFFF F0 90..BF 80..BF 80..BF
930 U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF
931 U+100000..U+10FFFF F4 80..8F 80..BF 80..BF
932
376d9008
JB
933Note the C<A0..BF> in C<U+0800..U+0FFF>, the C<80..9F> in
934C<U+D000...U+D7FF>, the C<90..B>F in C<U+10000..U+3FFFF>, and the
935C<80...8F> in C<U+100000..U+10FFFF>. The "gaps" are caused by legal
936UTF-8 avoiding non-shortest encodings: it is technically possible to
937UTF-8-encode a single code point in different ways, but that is
938explicitly forbidden, and the shortest possible encoding should always
939be used. So that's what Perl does.
37361303 940
376d9008 941Another way to look at it is via bits:
05632f9a
JH
942
943 Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte
944
945 0aaaaaaa 0aaaaaaa
946 00000bbbbbaaaaaa 110bbbbb 10aaaaaa
947 ccccbbbbbbaaaaaa 1110cccc 10bbbbbb 10aaaaaa
948 00000dddccccccbbbbbbaaaaaa 11110ddd 10cccccc 10bbbbbb 10aaaaaa
949
950As you can see, the continuation bytes all begin with C<10>, and the
8c007b5a 951leading bits of the start byte tell how many bytes the are in the
05632f9a
JH
952encoded character.
953
c29a771d 954=item *
5cb3728c
RB
955
956UTF-EBCDIC
dbe420b4 957
376d9008 958Like UTF-8 but EBCDIC-safe, in the way that UTF-8 is ASCII-safe.
dbe420b4 959
c29a771d 960=item *
5cb3728c 961
1e54db1a 962UTF-16, UTF-16BE, UTF-16LE, Surrogates, and BOMs (Byte Order Marks)
c349b1b9 963
1bfb14c4
JH
964The followings items are mostly for reference and general Unicode
965knowledge, Perl doesn't use these constructs internally.
dbe420b4 966
c349b1b9 967UTF-16 is a 2 or 4 byte encoding. The Unicode code points
1bfb14c4
JH
968C<U+0000..U+FFFF> are stored in a single 16-bit unit, and the code
969points C<U+10000..U+10FFFF> in two 16-bit units. The latter case is
c349b1b9
JH
970using I<surrogates>, the first 16-bit unit being the I<high
971surrogate>, and the second being the I<low surrogate>.
972
376d9008 973Surrogates are code points set aside to encode the C<U+10000..U+10FFFF>
c349b1b9 974range of Unicode code points in pairs of 16-bit units. The I<high
376d9008
JB
975surrogates> are the range C<U+D800..U+DBFF>, and the I<low surrogates>
976are the range C<U+DC00..U+DFFF>. The surrogate encoding is
c349b1b9
JH
977
978 $hi = ($uni - 0x10000) / 0x400 + 0xD800;
979 $lo = ($uni - 0x10000) % 0x400 + 0xDC00;
980
981and the decoding is
982
1a3fa709 983 $uni = 0x10000 + ($hi - 0xD800) * 0x400 + ($lo - 0xDC00);
c349b1b9 984
feda178f 985If you try to generate surrogates (for example by using chr()), you
376d9008
JB
986will get a warning if warnings are turned on, because those code
987points are not valid for a Unicode character.
9466bab6 988
376d9008 989Because of the 16-bitness, UTF-16 is byte-order dependent. UTF-16
c349b1b9 990itself can be used for in-memory computations, but if storage or
376d9008
JB
991transfer is required either UTF-16BE (big-endian) or UTF-16LE
992(little-endian) encodings must be chosen.
c349b1b9
JH
993
994This introduces another problem: what if you just know that your data
376d9008
JB
995is UTF-16, but you don't know which endianness? Byte Order Marks, or
996BOMs, are a solution to this. A special character has been reserved
86bbd6d1 997in Unicode to function as a byte order marker: the character with the
376d9008 998code point C<U+FEFF> is the BOM.
042da322 999
c349b1b9 1000The trick is that if you read a BOM, you will know the byte order,
376d9008
JB
1001since if it was written on a big-endian platform, you will read the
1002bytes C<0xFE 0xFF>, but if it was written on a little-endian platform,
1003you will read the bytes C<0xFF 0xFE>. (And if the originating platform
1004was writing in UTF-8, you will read the bytes C<0xEF 0xBB 0xBF>.)
042da322 1005
86bbd6d1 1006The way this trick works is that the character with the code point
376d9008
JB
1007C<U+FFFE> is guaranteed not to be a valid Unicode character, so the
1008sequence of bytes C<0xFF 0xFE> is unambiguously "BOM, represented in
1bfb14c4 1009little-endian format" and cannot be C<U+FFFE>, represented in big-endian
042da322 1010format".
c349b1b9 1011
c29a771d 1012=item *
5cb3728c 1013
1e54db1a 1014UTF-32, UTF-32BE, UTF-32LE
c349b1b9
JH
1015
1016The UTF-32 family is pretty much like the UTF-16 family, expect that
042da322 1017the units are 32-bit, and therefore the surrogate scheme is not
376d9008
JB
1018needed. The BOM signatures will be C<0x00 0x00 0xFE 0xFF> for BE and
1019C<0xFF 0xFE 0x00 0x00> for LE.
c349b1b9 1020
c29a771d 1021=item *
5cb3728c
RB
1022
1023UCS-2, UCS-4
c349b1b9 1024
86bbd6d1 1025Encodings defined by the ISO 10646 standard. UCS-2 is a 16-bit
376d9008 1026encoding. Unlike UTF-16, UCS-2 is not extensible beyond C<U+FFFF>,
339cfa0e
JH
1027because it does not use surrogates. UCS-4 is a 32-bit encoding,
1028functionally identical to UTF-32.
c349b1b9 1029
c29a771d 1030=item *
5cb3728c
RB
1031
1032UTF-7
c349b1b9 1033
376d9008
JB
1034A seven-bit safe (non-eight-bit) encoding, which is useful if the
1035transport or storage is not eight-bit safe. Defined by RFC 2152.
c349b1b9 1036
95a1a48b
JH
1037=back
1038
0d7c09bb
JH
1039=head2 Security Implications of Unicode
1040
1041=over 4
1042
1043=item *
1044
1045Malformed UTF-8
bf0fa0b2
JH
1046
1047Unfortunately, the specification of UTF-8 leaves some room for
1048interpretation of how many bytes of encoded output one should generate
376d9008
JB
1049from one input Unicode character. Strictly speaking, the shortest
1050possible sequence of UTF-8 bytes should be generated,
1051because otherwise there is potential for an input buffer overflow at
feda178f 1052the receiving end of a UTF-8 connection. Perl always generates the
376d9008
JB
1053shortest length UTF-8, and with warnings on Perl will warn about
1054non-shortest length UTF-8 along with other malformations, such as the
1055surrogates, which are not real Unicode code points.
bf0fa0b2 1056
0d7c09bb
JH
1057=item *
1058
1059Regular expressions behave slightly differently between byte data and
376d9008
JB
1060character (Unicode) data. For example, the "word character" character
1061class C<\w> will work differently depending on if data is eight-bit bytes
1062or Unicode.
0d7c09bb 1063
376d9008
JB
1064In the first case, the set of C<\w> characters is either small--the
1065default set of alphabetic characters, digits, and the "_"--or, if you
0d7c09bb
JH
1066are using a locale (see L<perllocale>), the C<\w> might contain a few
1067more letters according to your language and country.
1068
376d9008 1069In the second case, the C<\w> set of characters is much, much larger.
1bfb14c4
JH
1070Most importantly, even in the set of the first 256 characters, it will
1071probably match different characters: unlike most locales, which are
1072specific to a language and country pair, Unicode classifies all the
1073characters that are letters I<somewhere> as C<\w>. For example, your
1074locale might not think that LATIN SMALL LETTER ETH is a letter (unless
1075you happen to speak Icelandic), but Unicode does.
0d7c09bb 1076
376d9008 1077As discussed elsewhere, Perl has one foot (two hooves?) planted in
1bfb14c4
JH
1078each of two worlds: the old world of bytes and the new world of
1079characters, upgrading from bytes to characters when necessary.
376d9008
JB
1080If your legacy code does not explicitly use Unicode, no automatic
1081switch-over to characters should happen. Characters shouldn't get
1bfb14c4
JH
1082downgraded to bytes, either. It is possible to accidentally mix bytes
1083and characters, however (see L<perluniintro>), in which case C<\w> in
1084regular expressions might start behaving differently. Review your
1085code. Use warnings and the C<strict> pragma.
0d7c09bb
JH
1086
1087=back
1088
c349b1b9
JH
1089=head2 Unicode in Perl on EBCDIC
1090
376d9008
JB
1091The way Unicode is handled on EBCDIC platforms is still
1092experimental. On such platforms, references to UTF-8 encoding in this
1093document and elsewhere should be read as meaning the UTF-EBCDIC
1094specified in Unicode Technical Report 16, unless ASCII vs. EBCDIC issues
c349b1b9 1095are specifically discussed. There is no C<utfebcdic> pragma or
376d9008 1096":utfebcdic" layer; rather, "utf8" and ":utf8" are reused to mean
86bbd6d1
PN
1097the platform's "natural" 8-bit encoding of Unicode. See L<perlebcdic>
1098for more discussion of the issues.
c349b1b9 1099
b310b053
JH
1100=head2 Locales
1101
4616122b 1102Usually locale settings and Unicode do not affect each other, but
b310b053
JH
1103there are a couple of exceptions:
1104
1105=over 4
1106
1107=item *
1108
8aa8f774
JH
1109You can enable automatic UTF-8-ification of your standard file
1110handles, default C<open()> layer, and C<@ARGV> by using either
1111the C<-C> command line switch or the C<PERL_UNICODE> environment
1112variable, see L<perlrun> for the documentation of the C<-C> switch.
b310b053
JH
1113
1114=item *
1115
376d9008
JB
1116Perl tries really hard to work both with Unicode and the old
1117byte-oriented world. Most often this is nice, but sometimes Perl's
1118straddling of the proverbial fence causes problems.
b310b053
JH
1119
1120=back
1121
1aad1664
JH
1122=head2 When Unicode Does Not Happen
1123
1124While Perl does have extensive ways to input and output in Unicode,
1125and few other 'entry points' like the @ARGV which can be interpreted
1126as Unicode (UTF-8), there still are many places where Unicode (in some
1127encoding or another) could be given as arguments or received as
1128results, or both, but it is not.
1129
6cd4dd6c
JH
1130The following are such interfaces. For all of these interfaces Perl
1131currently (as of 5.8.3) simply assumes byte strings both as arguments
1132and results, or UTF-8 strings if the C<encoding> pragma has been used.
1aad1664
JH
1133
1134One reason why Perl does not attempt to resolve the role of Unicode in
1135this cases is that the answers are highly dependent on the operating
1136system and the file system(s). For example, whether filenames can be
1137in Unicode, and in exactly what kind of encoding, is not exactly a
1138portable concept. Similarly for the qx and system: how well will the
1139'command line interface' (and which of them?) handle Unicode?
1140
1141=over 4
1142
557a2462
RB
1143=item *
1144
1e8e8236
JH
1145chmod, chmod, chown, chroot, exec, link, lstat, mkdir,
1146rename, rmdir, stat, symlink, truncate, unlink, utime, -X
557a2462
RB
1147
1148=item *
1149
1150%ENV
1151
1152=item *
1153
1154glob (aka the <*>)
1155
1156=item *
1aad1664 1157
557a2462 1158open, opendir, sysopen
1aad1664 1159
557a2462 1160=item *
1aad1664 1161
557a2462 1162qx (aka the backtick operator), system
1aad1664 1163
557a2462 1164=item *
1aad1664 1165
557a2462 1166readdir, readlink
1aad1664
JH
1167
1168=back
1169
1170=head2 Forcing Unicode in Perl (Or Unforcing Unicode in Perl)
1171
1172Sometimes (see L</"When Unicode Does Not Happen">) there are
1173situations where you simply need to force Perl to believe that a byte
1174string is UTF-8, or vice versa. The low-level calls
1175utf8::upgrade($bytestring) and utf8::downgrade($utf8string) are
1176the answers.
1177
1178Do not use them without careful thought, though: Perl may easily get
1179very confused, angry, or even crash, if you suddenly change the 'nature'
1180of scalar like that. Especially careful you have to be if you use the
1181utf8::upgrade(): any random byte string is not valid UTF-8.
1182
95a1a48b
JH
1183=head2 Using Unicode in XS
1184
3a2263fe
RGS
1185If you want to handle Perl Unicode in XS extensions, you may find the
1186following C APIs useful. See also L<perlguts/"Unicode Support"> for an
1187explanation about Unicode at the XS level, and L<perlapi> for the API
1188details.
95a1a48b
JH
1189
1190=over 4
1191
1192=item *
1193
1bfb14c4
JH
1194C<DO_UTF8(sv)> returns true if the C<UTF8> flag is on and the bytes
1195pragma is not in effect. C<SvUTF8(sv)> returns true is the C<UTF8>
1196flag is on; the bytes pragma is ignored. The C<UTF8> flag being on
1197does B<not> mean that there are any characters of code points greater
1198than 255 (or 127) in the scalar or that there are even any characters
1199in the scalar. What the C<UTF8> flag means is that the sequence of
1200octets in the representation of the scalar is the sequence of UTF-8
1201encoded code points of the characters of a string. The C<UTF8> flag
1202being off means that each octet in this representation encodes a
1203single character with code point 0..255 within the string. Perl's
1204Unicode model is not to use UTF-8 until it is absolutely necessary.
95a1a48b
JH
1205
1206=item *
1207
fb9cc174 1208C<uvuni_to_utf8(buf, chr)> writes a Unicode character code point into
1bfb14c4 1209a buffer encoding the code point as UTF-8, and returns a pointer
95a1a48b
JH
1210pointing after the UTF-8 bytes.
1211
1212=item *
1213
376d9008
JB
1214C<utf8_to_uvuni(buf, lenp)> reads UTF-8 encoded bytes from a buffer and
1215returns the Unicode character code point and, optionally, the length of
1216the UTF-8 byte sequence.
95a1a48b
JH
1217
1218=item *
1219
376d9008
JB
1220C<utf8_length(start, end)> returns the length of the UTF-8 encoded buffer
1221in characters. C<sv_len_utf8(sv)> returns the length of the UTF-8 encoded
95a1a48b
JH
1222scalar.
1223
1224=item *
1225
376d9008
JB
1226C<sv_utf8_upgrade(sv)> converts the string of the scalar to its UTF-8
1227encoded form. C<sv_utf8_downgrade(sv)> does the opposite, if
1228possible. C<sv_utf8_encode(sv)> is like sv_utf8_upgrade except that
1229it does not set the C<UTF8> flag. C<sv_utf8_decode()> does the
1230opposite of C<sv_utf8_encode()>. Note that none of these are to be
1231used as general-purpose encoding or decoding interfaces: C<use Encode>
1232for that. C<sv_utf8_upgrade()> is affected by the encoding pragma
1233but C<sv_utf8_downgrade()> is not (since the encoding pragma is
1234designed to be a one-way street).
95a1a48b
JH
1235
1236=item *
1237
376d9008 1238C<is_utf8_char(s)> returns true if the pointer points to a valid UTF-8
90f968e0 1239character.
95a1a48b
JH
1240
1241=item *
1242
376d9008 1243C<is_utf8_string(buf, len)> returns true if C<len> bytes of the buffer
95a1a48b
JH
1244are valid UTF-8.
1245
1246=item *
1247
376d9008
JB
1248C<UTF8SKIP(buf)> will return the number of bytes in the UTF-8 encoded
1249character in the buffer. C<UNISKIP(chr)> will return the number of bytes
1250required to UTF-8-encode the Unicode character code point. C<UTF8SKIP()>
90f968e0 1251is useful for example for iterating over the characters of a UTF-8
376d9008 1252encoded buffer; C<UNISKIP()> is useful, for example, in computing
90f968e0 1253the size required for a UTF-8 encoded buffer.
95a1a48b
JH
1254
1255=item *
1256
376d9008 1257C<utf8_distance(a, b)> will tell the distance in characters between the
95a1a48b
JH
1258two pointers pointing to the same UTF-8 encoded buffer.
1259
1260=item *
1261
376d9008
JB
1262C<utf8_hop(s, off)> will return a pointer to an UTF-8 encoded buffer
1263that is C<off> (positive or negative) Unicode characters displaced
1264from the UTF-8 buffer C<s>. Be careful not to overstep the buffer:
1265C<utf8_hop()> will merrily run off the end or the beginning of the
1266buffer if told to do so.
95a1a48b 1267
d2cc3551
JH
1268=item *
1269
376d9008
JB
1270C<pv_uni_display(dsv, spv, len, pvlim, flags)> and
1271C<sv_uni_display(dsv, ssv, pvlim, flags)> are useful for debugging the
1272output of Unicode strings and scalars. By default they are useful
1273only for debugging--they display B<all> characters as hexadecimal code
1bfb14c4
JH
1274points--but with the flags C<UNI_DISPLAY_ISPRINT>,
1275C<UNI_DISPLAY_BACKSLASH>, and C<UNI_DISPLAY_QQ> you can make the
1276output more readable.
d2cc3551
JH
1277
1278=item *
1279
376d9008
JB
1280C<ibcmp_utf8(s1, pe1, u1, l1, u1, s2, pe2, l2, u2)> can be used to
1281compare two strings case-insensitively in Unicode. For case-sensitive
1282comparisons you can just use C<memEQ()> and C<memNE()> as usual.
d2cc3551 1283
c349b1b9
JH
1284=back
1285
95a1a48b
JH
1286For more information, see L<perlapi>, and F<utf8.c> and F<utf8.h>
1287in the Perl source code distribution.
1288
c29a771d
JH
1289=head1 BUGS
1290
376d9008 1291=head2 Interaction with Locales
7eabb34d 1292
376d9008
JB
1293Use of locales with Unicode data may lead to odd results. Currently,
1294Perl attempts to attach 8-bit locale info to characters in the range
12950..255, but this technique is demonstrably incorrect for locales that
1296use characters above that range when mapped into Unicode. Perl's
1297Unicode support will also tend to run slower. Use of locales with
1298Unicode is discouraged.
c29a771d 1299
376d9008 1300=head2 Interaction with Extensions
7eabb34d 1301
376d9008 1302When Perl exchanges data with an extension, the extension should be
7eabb34d 1303able to understand the UTF-8 flag and act accordingly. If the
376d9008
JB
1304extension doesn't know about the flag, it's likely that the extension
1305will return incorrectly-flagged data.
7eabb34d
A
1306
1307So if you're working with Unicode data, consult the documentation of
1308every module you're using if there are any issues with Unicode data
1309exchange. If the documentation does not talk about Unicode at all,
a73d23f6 1310suspect the worst and probably look at the source to learn how the
376d9008 1311module is implemented. Modules written completely in Perl shouldn't
a73d23f6
RGS
1312cause problems. Modules that directly or indirectly access code written
1313in other programming languages are at risk.
7eabb34d 1314
376d9008 1315For affected functions, the simple strategy to avoid data corruption is
7eabb34d 1316to always make the encoding of the exchanged data explicit. Choose an
376d9008 1317encoding that you know the extension can handle. Convert arguments passed
7eabb34d
A
1318to the extensions to that encoding and convert results back from that
1319encoding. Write wrapper functions that do the conversions for you, so
1320you can later change the functions when the extension catches up.
1321
376d9008 1322To provide an example, let's say the popular Foo::Bar::escape_html
7eabb34d
A
1323function doesn't deal with Unicode data yet. The wrapper function
1324would convert the argument to raw UTF-8 and convert the result back to
376d9008 1325Perl's internal representation like so:
7eabb34d
A
1326
1327 sub my_escape_html ($) {
1328 my($what) = shift;
1329 return unless defined $what;
1330 Encode::decode_utf8(Foo::Bar::escape_html(Encode::encode_utf8($what)));
1331 }
1332
1333Sometimes, when the extension does not convert data but just stores
1334and retrieves them, you will be in a position to use the otherwise
1335dangerous Encode::_utf8_on() function. Let's say the popular
66b79f27 1336C<Foo::Bar> extension, written in C, provides a C<param> method that
7eabb34d
A
1337lets you store and retrieve data according to these prototypes:
1338
1339 $self->param($name, $value); # set a scalar
1340 $value = $self->param($name); # retrieve a scalar
1341
1342If it does not yet provide support for any encoding, one could write a
1343derived class with such a C<param> method:
1344
1345 sub param {
1346 my($self,$name,$value) = @_;
1347 utf8::upgrade($name); # make sure it is UTF-8 encoded
1348 if (defined $value)
1349 utf8::upgrade($value); # make sure it is UTF-8 encoded
1350 return $self->SUPER::param($name,$value);
1351 } else {
1352 my $ret = $self->SUPER::param($name);
1353 Encode::_utf8_on($ret); # we know, it is UTF-8 encoded
1354 return $ret;
1355 }
1356 }
1357
a73d23f6
RGS
1358Some extensions provide filters on data entry/exit points, such as
1359DB_File::filter_store_key and family. Look out for such filters in
66b79f27 1360the documentation of your extensions, they can make the transition to
7eabb34d
A
1361Unicode data much easier.
1362
376d9008 1363=head2 Speed
7eabb34d 1364
c29a771d 1365Some functions are slower when working on UTF-8 encoded strings than
574c8022 1366on byte encoded strings. All functions that need to hop over
7c17141f
JH
1367characters such as length(), substr() or index(), or matching regular
1368expressions can work B<much> faster when the underlying data are
1369byte-encoded.
1370
1371In Perl 5.8.0 the slowness was often quite spectacular; in Perl 5.8.1
1372a caching scheme was introduced which will hopefully make the slowness
a104b433
JH
1373somewhat less spectacular, at least for some operations. In general,
1374operations with UTF-8 encoded strings are still slower. As an example,
1375the Unicode properties (character classes) like C<\p{Nd}> are known to
1376be quite a bit slower (5-20 times) than their simpler counterparts
1377like C<\d> (then again, there 268 Unicode characters matching C<Nd>
1378compared with the 10 ASCII characters matching C<d>).
666f95b9 1379
c8d992ba
A
1380=head2 Porting code from perl-5.6.X
1381
1382Perl 5.8 has a different Unicode model from 5.6. In 5.6 the programmer
1383was required to use the C<utf8> pragma to declare that a given scope
1384expected to deal with Unicode data and had to make sure that only
1385Unicode data were reaching that scope. If you have code that is
1386working with 5.6, you will need some of the following adjustments to
1387your code. The examples are written such that the code will continue
1388to work under 5.6, so you should be safe to try them out.
1389
1390=over 4
1391
1392=item *
1393
1394A filehandle that should read or write UTF-8
1395
1396 if ($] > 5.007) {
1397 binmode $fh, ":utf8";
1398 }
1399
1400=item *
1401
1402A scalar that is going to be passed to some extension
1403
1404Be it Compress::Zlib, Apache::Request or any extension that has no
1405mention of Unicode in the manpage, you need to make sure that the
1406UTF-8 flag is stripped off. Note that at the time of this writing
1407(October 2002) the mentioned modules are not UTF-8-aware. Please
1408check the documentation to verify if this is still true.
1409
1410 if ($] > 5.007) {
1411 require Encode;
1412 $val = Encode::encode_utf8($val); # make octets
1413 }
1414
1415=item *
1416
1417A scalar we got back from an extension
1418
1419If you believe the scalar comes back as UTF-8, you will most likely
1420want the UTF-8 flag restored:
1421
1422 if ($] > 5.007) {
1423 require Encode;
1424 $val = Encode::decode_utf8($val);
1425 }
1426
1427=item *
1428
1429Same thing, if you are really sure it is UTF-8
1430
1431 if ($] > 5.007) {
1432 require Encode;
1433 Encode::_utf8_on($val);
1434 }
1435
1436=item *
1437
1438A wrapper for fetchrow_array and fetchrow_hashref
1439
1440When the database contains only UTF-8, a wrapper function or method is
1441a convenient way to replace all your fetchrow_array and
1442fetchrow_hashref calls. A wrapper function will also make it easier to
1443adapt to future enhancements in your database driver. Note that at the
1444time of this writing (October 2002), the DBI has no standardized way
1445to deal with UTF-8 data. Please check the documentation to verify if
1446that is still true.
1447
1448 sub fetchrow {
1449 my($self, $sth, $what) = @_; # $what is one of fetchrow_{array,hashref}
1450 if ($] < 5.007) {
1451 return $sth->$what;
1452 } else {
1453 require Encode;
1454 if (wantarray) {
1455 my @arr = $sth->$what;
1456 for (@arr) {
1457 defined && /[^\000-\177]/ && Encode::_utf8_on($_);
1458 }
1459 return @arr;
1460 } else {
1461 my $ret = $sth->$what;
1462 if (ref $ret) {
1463 for my $k (keys %$ret) {
1464 defined && /[^\000-\177]/ && Encode::_utf8_on($_) for $ret->{$k};
1465 }
1466 return $ret;
1467 } else {
1468 defined && /[^\000-\177]/ && Encode::_utf8_on($_) for $ret;
1469 return $ret;
1470 }
1471 }
1472 }
1473 }
1474
1475
1476=item *
1477
1478A large scalar that you know can only contain ASCII
1479
1480Scalars that contain only ASCII and are marked as UTF-8 are sometimes
1481a drag to your program. If you recognize such a situation, just remove
1482the UTF-8 flag:
1483
1484 utf8::downgrade($val) if $] > 5.007;
1485
1486=back
1487
393fec97
GS
1488=head1 SEE ALSO
1489
72ff2908 1490L<perluniintro>, L<encoding>, L<Encode>, L<open>, L<utf8>, L<bytes>,
a05d7ebb 1491L<perlretut>, L<perlvar/"${^UNICODE}">
393fec97
GS
1492
1493=cut