This is a live mirror of the Perl 5 development currently hosted at https://github.com/perl/perl5
doc nits
[perl5.git] / pod / perlunicode.pod
CommitLineData
393fec97
GS
1=head1 NAME
2
3perlunicode - Unicode support in Perl
4
5=head1 DESCRIPTION
6
0a1f2d14 7=head2 Important Caveats
21bad921 8
376d9008 9Unicode support is an extensive requirement. While Perl does not
c349b1b9
JH
10implement the Unicode standard or the accompanying technical reports
11from cover to cover, Perl does support many Unicode features.
21bad921 12
13a2d996 13=over 4
21bad921 14
fae2c0fb 15=item Input and Output Layers
21bad921 16
376d9008 17Perl knows when a filehandle uses Perl's internal Unicode encodings
1bfb14c4
JH
18(UTF-8, or UTF-EBCDIC if in EBCDIC) if the filehandle is opened with
19the ":utf8" layer. Other encodings can be converted to Perl's
20encoding on input or from Perl's encoding on output by use of the
21":encoding(...)" layer. See L<open>.
c349b1b9 22
376d9008 23To indicate that Perl source itself is using a particular encoding,
c349b1b9 24see L<encoding>.
21bad921
GS
25
26=item Regular Expressions
27
c349b1b9 28The regular expression compiler produces polymorphic opcodes. That is,
376d9008
JB
29the pattern adapts to the data and automatically switches to the Unicode
30character scheme when presented with Unicode data--or instead uses
31a traditional byte scheme when presented with byte data.
21bad921 32
ad0029c4 33=item C<use utf8> still needed to enable UTF-8/UTF-EBCDIC in scripts
21bad921 34
376d9008
JB
35As a compatibility measure, the C<use utf8> pragma must be explicitly
36included to enable recognition of UTF-8 in the Perl scripts themselves
1bfb14c4
JH
37(in string or regular expression literals, or in identifier names) on
38ASCII-based machines or to recognize UTF-EBCDIC on EBCDIC-based
376d9008 39machines. B<These are the only times when an explicit C<use utf8>
8f8cf39c 40is needed.> See L<utf8>.
21bad921 41
1768d7eb 42You can also use the C<encoding> pragma to change the default encoding
6ec9efec 43of the data in your script; see L<encoding>.
1768d7eb 44
990e18f7
AT
45=item C<use encoding> needed to upgrade non-Latin-1 byte strings
46
47By default, there is a fundamental asymmetry in Perl's unicode model:
48implicit upgrading from byte strings to Unicode strings assumes that
49they were encoded in I<ISO 8859-1 (Latin-1)>, but Unicode strings are
50downgraded with UTF-8 encoding. This happens because the first 256
51codepoints in Unicode happens to agree with Latin-1.
52
53If you wish to interpret byte strings as UTF-8 instead, use the
54C<encoding> pragma:
55
56 use encoding 'utf8';
57
58See L</"Byte and Character Semantics"> for more details.
59
21bad921
GS
60=back
61
376d9008 62=head2 Byte and Character Semantics
393fec97 63
376d9008 64Beginning with version 5.6, Perl uses logically-wide characters to
3e4dbfed 65represent strings internally.
393fec97 66
376d9008
JB
67In future, Perl-level operations will be expected to work with
68characters rather than bytes.
393fec97 69
376d9008 70However, as an interim compatibility measure, Perl aims to
75daf61c
JH
71provide a safe migration path from byte semantics to character
72semantics for programs. For operations where Perl can unambiguously
376d9008 73decide that the input data are characters, Perl switches to
75daf61c
JH
74character semantics. For operations where this determination cannot
75be made without additional information from the user, Perl decides in
376d9008 76favor of compatibility and chooses to use byte semantics.
8cbd9a7a
GS
77
78This behavior preserves compatibility with earlier versions of Perl,
376d9008
JB
79which allowed byte semantics in Perl operations only if
80none of the program's inputs were marked as being as source of Unicode
8cbd9a7a
GS
81character data. Such data may come from filehandles, from calls to
82external programs, from information provided by the system (such as %ENV),
21bad921 83or from literals and constants in the source text.
8cbd9a7a 84
376d9008
JB
85The C<bytes> pragma will always, regardless of platform, force byte
86semantics in a particular lexical scope. See L<bytes>.
8cbd9a7a
GS
87
88The C<utf8> pragma is primarily a compatibility device that enables
75daf61c 89recognition of UTF-(8|EBCDIC) in literals encountered by the parser.
376d9008
JB
90Note that this pragma is only required while Perl defaults to byte
91semantics; when character semantics become the default, this pragma
92may become a no-op. See L<utf8>.
93
94Unless explicitly stated, Perl operators use character semantics
95for Unicode data and byte semantics for non-Unicode data.
96The decision to use character semantics is made transparently. If
97input data comes from a Unicode source--for example, if a character
fae2c0fb 98encoding layer is added to a filehandle or a literal Unicode
376d9008
JB
99string constant appears in a program--character semantics apply.
100Otherwise, byte semantics are in effect. The C<bytes> pragma should
101be used to force byte semantics on Unicode data.
102
103If strings operating under byte semantics and strings with Unicode
990e18f7
AT
104character data are concatenated, the new string will be created by
105decoding the byte strings as I<ISO 8859-1 (Latin-1)>, even if the
106old Unicode string used EBCDIC. This translation is done without
107regard to the system's native 8-bit encoding. To change this for
108systems with non-Latin-1 and non-EBCDIC native encodings, use the
109C<encoding> pragma. See L<encoding>.
7dedd01f 110
feda178f 111Under character semantics, many operations that formerly operated on
376d9008 112bytes now operate on characters. A character in Perl is
feda178f 113logically just a number ranging from 0 to 2**31 or so. Larger
376d9008
JB
114characters may encode into longer sequences of bytes internally, but
115this internal detail is mostly hidden for Perl code.
116See L<perluniintro> for more.
393fec97 117
376d9008 118=head2 Effects of Character Semantics
393fec97
GS
119
120Character semantics have the following effects:
121
122=over 4
123
124=item *
125
376d9008 126Strings--including hash keys--and regular expression patterns may
574c8022 127contain characters that have an ordinal value larger than 255.
393fec97 128
feda178f
JH
129If you use a Unicode editor to edit your program, Unicode characters
130may occur directly within the literal strings in one of the various
376d9008
JB
131Unicode encodings (UTF-8, UTF-EBCDIC, UCS-2, etc.), but will be recognized
132as such and converted to Perl's internal representation only if the
feda178f 133appropriate L<encoding> is specified.
3e4dbfed 134
1bfb14c4
JH
135Unicode characters can also be added to a string by using the
136C<\x{...}> notation. The Unicode code for the desired character, in
376d9008
JB
137hexadecimal, should be placed in the braces. For instance, a smiley
138face is C<\x{263A}>. This encoding scheme only works for characters
139with a code of 0x100 or above.
3e4dbfed
JF
140
141Additionally, if you
574c8022 142
3e4dbfed 143 use charnames ':full';
574c8022 144
1bfb14c4
JH
145you can use the C<\N{...}> notation and put the official Unicode
146character name within the braces, such as C<\N{WHITE SMILING FACE}>.
376d9008 147
393fec97
GS
148
149=item *
150
574c8022
JH
151If an appropriate L<encoding> is specified, identifiers within the
152Perl script may contain Unicode alphanumeric characters, including
376d9008
JB
153ideographs. Perl does not currently attempt to canonicalize variable
154names.
393fec97 155
393fec97
GS
156=item *
157
1bfb14c4
JH
158Regular expressions match characters instead of bytes. "." matches
159a character instead of a byte. The C<\C> pattern is provided to force
160a match a single byte--a C<char> in C, hence C<\C>.
393fec97 161
393fec97
GS
162=item *
163
164Character classes in regular expressions match characters instead of
376d9008 165bytes and match against the character properties specified in the
1bfb14c4 166Unicode properties database. C<\w> can be used to match a Japanese
75daf61c 167ideograph, for instance.
393fec97 168
393fec97
GS
169=item *
170
eb0cc9e3 171Named Unicode properties, scripts, and block ranges may be used like
376d9008
JB
172character classes via the C<\p{}> "matches property" construct and
173the C<\P{}> negation, "doesn't match property".
1bfb14c4
JH
174
175For instance, C<\p{Lu}> matches any character with the Unicode "Lu"
176(Letter, uppercase) property, while C<\p{M}> matches any character
177with an "M" (mark--accents and such) property. Brackets are not
178required for single letter properties, so C<\p{M}> is equivalent to
179C<\pM>. Many predefined properties are available, such as
180C<\p{Mirrored}> and C<\p{Tibetan}>.
4193bef7 181
cfc01aea 182The official Unicode script and block names have spaces and dashes as
376d9008 183separators, but for convenience you can use dashes, spaces, or
1bfb14c4
JH
184underbars, and case is unimportant. It is recommended, however, that
185for consistency you use the following naming: the official Unicode
186script, property, or block name (see below for the additional rules
187that apply to block names) with whitespace and dashes removed, and the
188words "uppercase-first-lowercase-rest". C<Latin-1 Supplement> thus
189becomes C<Latin1Supplement>.
4193bef7 190
376d9008
JB
191You can also use negation in both C<\p{}> and C<\P{}> by introducing a caret
192(^) between the first brace and the property name: C<\p{^Tamil}> is
eb0cc9e3 193equal to C<\P{Tamil}>.
4193bef7 194
14bb0a9a
JH
195B<NOTE: the properties, scripts, and blocks listed here are as of
196Unicode 3.2.0, March 2002, or Perl 5.8.0, July 2002. Unicode 4.0.0
197came out in April 2003, and Perl 5.8.1 in September 2003.>
198
eb0cc9e3 199Here are the basic Unicode General Category properties, followed by their
68cd2d32 200long form. You can use either; C<\p{Lu}> and C<\p{UppercaseLetter}>,
376d9008 201for instance, are identical.
393fec97 202
d73e5302
JH
203 Short Long
204
205 L Letter
eb0cc9e3
JH
206 Lu UppercaseLetter
207 Ll LowercaseLetter
208 Lt TitlecaseLetter
209 Lm ModifierLetter
210 Lo OtherLetter
d73e5302
JH
211
212 M Mark
eb0cc9e3
JH
213 Mn NonspacingMark
214 Mc SpacingMark
215 Me EnclosingMark
d73e5302
JH
216
217 N Number
eb0cc9e3
JH
218 Nd DecimalNumber
219 Nl LetterNumber
220 No OtherNumber
d73e5302
JH
221
222 P Punctuation
eb0cc9e3
JH
223 Pc ConnectorPunctuation
224 Pd DashPunctuation
225 Ps OpenPunctuation
226 Pe ClosePunctuation
227 Pi InitialPunctuation
d73e5302 228 (may behave like Ps or Pe depending on usage)
eb0cc9e3 229 Pf FinalPunctuation
d73e5302 230 (may behave like Ps or Pe depending on usage)
eb0cc9e3 231 Po OtherPunctuation
d73e5302
JH
232
233 S Symbol
eb0cc9e3
JH
234 Sm MathSymbol
235 Sc CurrencySymbol
236 Sk ModifierSymbol
237 So OtherSymbol
d73e5302
JH
238
239 Z Separator
eb0cc9e3
JH
240 Zs SpaceSeparator
241 Zl LineSeparator
242 Zp ParagraphSeparator
d73e5302
JH
243
244 C Other
e150c829
JH
245 Cc Control
246 Cf Format
eb0cc9e3
JH
247 Cs Surrogate (not usable)
248 Co PrivateUse
e150c829 249 Cn Unassigned
1ac13f9a 250
376d9008 251Single-letter properties match all characters in any of the
3e4dbfed 252two-letter sub-properties starting with the same letter.
376d9008 253C<L&> is a special case, which is an alias for C<Ll>, C<Lu>, and C<Lt>.
32293815 254
eb0cc9e3 255Because Perl hides the need for the user to understand the internal
1bfb14c4
JH
256representation of Unicode characters, there is no need to implement
257the somewhat messy concept of surrogates. C<Cs> is therefore not
eb0cc9e3 258supported.
d73e5302 259
376d9008
JB
260Because scripts differ in their directionality--Hebrew is
261written right to left, for example--Unicode supplies these properties:
32293815 262
eb0cc9e3 263 Property Meaning
92e830a9 264
d73e5302
JH
265 BidiL Left-to-Right
266 BidiLRE Left-to-Right Embedding
267 BidiLRO Left-to-Right Override
268 BidiR Right-to-Left
269 BidiAL Right-to-Left Arabic
270 BidiRLE Right-to-Left Embedding
271 BidiRLO Right-to-Left Override
272 BidiPDF Pop Directional Format
273 BidiEN European Number
274 BidiES European Number Separator
275 BidiET European Number Terminator
276 BidiAN Arabic Number
277 BidiCS Common Number Separator
278 BidiNSM Non-Spacing Mark
279 BidiBN Boundary Neutral
280 BidiB Paragraph Separator
281 BidiS Segment Separator
282 BidiWS Whitespace
283 BidiON Other Neutrals
32293815 284
376d9008 285For example, C<\p{BidiR}> matches characters that are normally
eb0cc9e3
JH
286written right to left.
287
210b36aa
AMS
288=back
289
2796c109
JH
290=head2 Scripts
291
376d9008
JB
292The script names which can be used by C<\p{...}> and C<\P{...}>,
293such as in C<\p{Latin}> or C<\p{Cyrillic}>, are as follows:
2796c109 294
1ac13f9a 295 Arabic
e9ad1727 296 Armenian
1ac13f9a 297 Bengali
e9ad1727 298 Bopomofo
1d81abf3 299 Buhid
eb0cc9e3 300 CanadianAboriginal
e9ad1727
JH
301 Cherokee
302 Cyrillic
303 Deseret
304 Devanagari
305 Ethiopic
306 Georgian
307 Gothic
308 Greek
1ac13f9a 309 Gujarati
e9ad1727
JH
310 Gurmukhi
311 Han
312 Hangul
1d81abf3 313 Hanunoo
e9ad1727
JH
314 Hebrew
315 Hiragana
316 Inherited
1ac13f9a 317 Kannada
e9ad1727
JH
318 Katakana
319 Khmer
1ac13f9a 320 Lao
e9ad1727
JH
321 Latin
322 Malayalam
323 Mongolian
1ac13f9a 324 Myanmar
1ac13f9a 325 Ogham
eb0cc9e3 326 OldItalic
e9ad1727 327 Oriya
1ac13f9a 328 Runic
e9ad1727
JH
329 Sinhala
330 Syriac
1d81abf3
JH
331 Tagalog
332 Tagbanwa
e9ad1727
JH
333 Tamil
334 Telugu
335 Thaana
336 Thai
337 Tibetan
1ac13f9a 338 Yi
1ac13f9a 339
376d9008 340Extended property classes can supplement the basic
1ac13f9a
JH
341properties, defined by the F<PropList> Unicode database:
342
1d81abf3 343 ASCIIHexDigit
eb0cc9e3 344 BidiControl
1ac13f9a 345 Dash
1d81abf3 346 Deprecated
1ac13f9a
JH
347 Diacritic
348 Extender
1d81abf3 349 GraphemeLink
eb0cc9e3 350 HexDigit
e9ad1727
JH
351 Hyphen
352 Ideographic
1d81abf3
JH
353 IDSBinaryOperator
354 IDSTrinaryOperator
eb0cc9e3 355 JoinControl
1d81abf3 356 LogicalOrderException
eb0cc9e3
JH
357 NoncharacterCodePoint
358 OtherAlphabetic
1d81abf3
JH
359 OtherDefaultIgnorableCodePoint
360 OtherGraphemeExtend
eb0cc9e3
JH
361 OtherLowercase
362 OtherMath
363 OtherUppercase
364 QuotationMark
1d81abf3
JH
365 Radical
366 SoftDotted
367 TerminalPunctuation
368 UnifiedIdeograph
eb0cc9e3 369 WhiteSpace
1ac13f9a 370
376d9008 371and there are further derived properties:
1ac13f9a 372
eb0cc9e3
JH
373 Alphabetic Lu + Ll + Lt + Lm + Lo + OtherAlphabetic
374 Lowercase Ll + OtherLowercase
375 Uppercase Lu + OtherUppercase
376 Math Sm + OtherMath
1ac13f9a
JH
377
378 ID_Start Lu + Ll + Lt + Lm + Lo + Nl
379 ID_Continue ID_Start + Mn + Mc + Nd + Pc
380
381 Any Any character
66b79f27
RGS
382 Assigned Any non-Cn character (i.e. synonym for \P{Cn})
383 Unassigned Synonym for \p{Cn}
1ac13f9a 384 Common Any character (or unassigned code point)
e150c829 385 not explicitly assigned to a script
2796c109 386
1bfb14c4
JH
387For backward compatibility (with Perl 5.6), all properties mentioned
388so far may have C<Is> prepended to their name, so C<\P{IsLu}>, for
389example, is equal to C<\P{Lu}>.
eb0cc9e3 390
2796c109
JH
391=head2 Blocks
392
1bfb14c4
JH
393In addition to B<scripts>, Unicode also defines B<blocks> of
394characters. The difference between scripts and blocks is that the
395concept of scripts is closer to natural languages, while the concept
396of blocks is more of an artificial grouping based on groups of 256
376d9008 397Unicode characters. For example, the C<Latin> script contains letters
1bfb14c4 398from many blocks but does not contain all the characters from those
376d9008
JB
399blocks. It does not, for example, contain digits, because digits are
400shared across many scripts. Digits and similar groups, like
401punctuation, are in a category called C<Common>.
2796c109 402
cfc01aea
JF
403For more about scripts, see the UTR #24:
404
405 http://www.unicode.org/unicode/reports/tr24/
406
407For more about blocks, see:
408
409 http://www.unicode.org/Public/UNIDATA/Blocks.txt
2796c109 410
376d9008
JB
411Block names are given with the C<In> prefix. For example, the
412Katakana block is referenced via C<\p{InKatakana}>. The C<In>
7eabb34d 413prefix may be omitted if there is no naming conflict with a script
eb0cc9e3 414or any other property, but it is recommended that C<In> always be used
1bfb14c4 415for block tests to avoid confusion.
eb0cc9e3
JH
416
417These block names are supported:
418
1d81abf3
JH
419 InAlphabeticPresentationForms
420 InArabic
421 InArabicPresentationFormsA
422 InArabicPresentationFormsB
423 InArmenian
424 InArrows
425 InBasicLatin
426 InBengali
427 InBlockElements
428 InBopomofo
429 InBopomofoExtended
430 InBoxDrawing
431 InBraillePatterns
432 InBuhid
433 InByzantineMusicalSymbols
434 InCJKCompatibility
435 InCJKCompatibilityForms
436 InCJKCompatibilityIdeographs
437 InCJKCompatibilityIdeographsSupplement
438 InCJKRadicalsSupplement
439 InCJKSymbolsAndPunctuation
440 InCJKUnifiedIdeographs
441 InCJKUnifiedIdeographsExtensionA
442 InCJKUnifiedIdeographsExtensionB
443 InCherokee
444 InCombiningDiacriticalMarks
445 InCombiningDiacriticalMarksforSymbols
446 InCombiningHalfMarks
447 InControlPictures
448 InCurrencySymbols
449 InCyrillic
450 InCyrillicSupplementary
451 InDeseret
452 InDevanagari
453 InDingbats
454 InEnclosedAlphanumerics
455 InEnclosedCJKLettersAndMonths
456 InEthiopic
457 InGeneralPunctuation
458 InGeometricShapes
459 InGeorgian
460 InGothic
461 InGreekExtended
462 InGreekAndCoptic
463 InGujarati
464 InGurmukhi
465 InHalfwidthAndFullwidthForms
466 InHangulCompatibilityJamo
467 InHangulJamo
468 InHangulSyllables
469 InHanunoo
470 InHebrew
471 InHighPrivateUseSurrogates
472 InHighSurrogates
473 InHiragana
474 InIPAExtensions
475 InIdeographicDescriptionCharacters
476 InKanbun
477 InKangxiRadicals
478 InKannada
479 InKatakana
480 InKatakanaPhoneticExtensions
481 InKhmer
482 InLao
483 InLatin1Supplement
484 InLatinExtendedA
485 InLatinExtendedAdditional
486 InLatinExtendedB
487 InLetterlikeSymbols
488 InLowSurrogates
489 InMalayalam
490 InMathematicalAlphanumericSymbols
491 InMathematicalOperators
492 InMiscellaneousMathematicalSymbolsA
493 InMiscellaneousMathematicalSymbolsB
494 InMiscellaneousSymbols
495 InMiscellaneousTechnical
496 InMongolian
497 InMusicalSymbols
498 InMyanmar
499 InNumberForms
500 InOgham
501 InOldItalic
502 InOpticalCharacterRecognition
503 InOriya
504 InPrivateUseArea
505 InRunic
506 InSinhala
507 InSmallFormVariants
508 InSpacingModifierLetters
509 InSpecials
510 InSuperscriptsAndSubscripts
511 InSupplementalArrowsA
512 InSupplementalArrowsB
513 InSupplementalMathematicalOperators
514 InSupplementaryPrivateUseAreaA
515 InSupplementaryPrivateUseAreaB
516 InSyriac
517 InTagalog
518 InTagbanwa
519 InTags
520 InTamil
521 InTelugu
522 InThaana
523 InThai
524 InTibetan
525 InUnifiedCanadianAboriginalSyllabics
526 InVariationSelectors
527 InYiRadicals
528 InYiSyllables
32293815 529
210b36aa
AMS
530=over 4
531
393fec97
GS
532=item *
533
376d9008
JB
534The special pattern C<\X> matches any extended Unicode
535sequence--"a combining character sequence" in Standardese--where the
536first character is a base character and subsequent characters are mark
537characters that apply to the base character. C<\X> is equivalent to
393fec97
GS
538C<(?:\PM\pM*)>.
539
393fec97
GS
540=item *
541
383e7cdd 542The C<tr///> operator translates characters instead of bytes. Note
376d9008
JB
543that the C<tr///CU> functionality has been removed. For similar
544functionality see pack('U0', ...) and pack('C0', ...).
393fec97 545
393fec97
GS
546=item *
547
548Case translation operators use the Unicode case translation tables
376d9008
JB
549when character input is provided. Note that C<uc()>, or C<\U> in
550interpolated strings, translates to uppercase, while C<ucfirst>,
551or C<\u> in interpolated strings, translates to titlecase in languages
552that make the distinction.
393fec97
GS
553
554=item *
555
376d9008 556Most operators that deal with positions or lengths in a string will
75daf61c
JH
557automatically switch to using character positions, including
558C<chop()>, C<substr()>, C<pos()>, C<index()>, C<rindex()>,
559C<sprintf()>, C<write()>, and C<length()>. Operators that
376d9008
JB
560specifically do not switch include C<vec()>, C<pack()>, and
561C<unpack()>. Operators that really don't care include C<chomp()>,
562operators that treats strings as a bucket of bits such as C<sort()>,
563and operators dealing with filenames.
393fec97
GS
564
565=item *
566
1bfb14c4 567The C<pack()>/C<unpack()> letters C<c> and C<C> do I<not> change,
376d9008 568since they are often used for byte-oriented formats. Again, think
1bfb14c4
JH
569C<char> in the C language.
570
571There is a new C<U> specifier that converts between Unicode characters
572and code points.
393fec97
GS
573
574=item *
575
376d9008
JB
576The C<chr()> and C<ord()> functions work on characters, similar to
577C<pack("U")> and C<unpack("U")>, I<not> C<pack("C")> and
578C<unpack("C")>. C<pack("C")> and C<unpack("C")> are methods for
579emulating byte-oriented C<chr()> and C<ord()> on Unicode strings.
580While these methods reveal the internal encoding of Unicode strings,
581that is not something one normally needs to care about at all.
393fec97
GS
582
583=item *
584
376d9008
JB
585The bit string operators, C<& | ^ ~>, can operate on character data.
586However, for backward compatibility, such as when using bit string
587operations when characters are all less than 256 in ordinal value, one
588should not use C<~> (the bit complement) with characters of both
589values less than 256 and values greater than 256. Most importantly,
590DeMorgan's laws (C<~($x|$y) eq ~$x&~$y> and C<~($x&$y) eq ~$x|~$y>)
591will not hold. The reason for this mathematical I<faux pas> is that
592the complement cannot return B<both> the 8-bit (byte-wide) bit
593complement B<and> the full character-wide bit complement.
a1ca4561
YST
594
595=item *
596
983ffd37
JH
597lc(), uc(), lcfirst(), and ucfirst() work for the following cases:
598
599=over 8
600
601=item *
602
603the case mapping is from a single Unicode character to another
376d9008 604single Unicode character, or
983ffd37
JH
605
606=item *
607
608the case mapping is from a single Unicode character to more
376d9008 609than one Unicode character.
983ffd37
JH
610
611=back
612
63de3cb2
JH
613Things to do with locales (Lithuanian, Turkish, Azeri) do B<not> work
614since Perl does not understand the concept of Unicode locales.
983ffd37 615
dc33ebcf
RGS
616See the Unicode Technical Report #21, Case Mappings, for more details.
617
983ffd37
JH
618=back
619
dc33ebcf 620=over 4
ac1256e8
JH
621
622=item *
623
393fec97
GS
624And finally, C<scalar reverse()> reverses by character rather than by byte.
625
626=back
627
376d9008 628=head2 User-Defined Character Properties
491fd90a
JH
629
630You can define your own character properties by defining subroutines
3a2263fe
RGS
631whose names begin with "In" or "Is". The subroutines must be defined
632in the C<main> package. The user-defined properties can be used in the
633regular expression C<\p> and C<\P> constructs. Note that the effect
634is compile-time and immutable once defined.
491fd90a 635
376d9008
JB
636The subroutines must return a specially-formatted string, with one
637or more newline-separated lines. Each line must be one of the following:
491fd90a
JH
638
639=over 4
640
641=item *
642
99a6b1f0 643Two hexadecimal numbers separated by horizontal whitespace (space or
376d9008 644tabular characters) denoting a range of Unicode code points to include.
491fd90a
JH
645
646=item *
647
376d9008
JB
648Something to include, prefixed by "+": a built-in character
649property (prefixed by "utf8::"), to represent all the characters in that
650property; two hexadecimal code points for a range; or a single
651hexadecimal code point.
491fd90a
JH
652
653=item *
654
376d9008 655Something to exclude, prefixed by "-": an existing character
11ef8fdd 656property (prefixed by "utf8::"), for all the characters in that
376d9008
JB
657property; two hexadecimal code points for a range; or a single
658hexadecimal code point.
491fd90a
JH
659
660=item *
661
376d9008 662Something to negate, prefixed "!": an existing character
11ef8fdd 663property (prefixed by "utf8::") for all the characters except the
376d9008
JB
664characters in the property; two hexadecimal code points for a range;
665or a single hexadecimal code point.
491fd90a
JH
666
667=back
668
669For example, to define a property that covers both the Japanese
670syllabaries (hiragana and katakana), you can define
671
672 sub InKana {
d5822f25
A
673 return <<END;
674 3040\t309F
675 30A0\t30FF
491fd90a
JH
676 END
677 }
678
d5822f25
A
679Imagine that the here-doc end marker is at the beginning of the line.
680Now you can use C<\p{InKana}> and C<\P{InKana}>.
491fd90a
JH
681
682You could also have used the existing block property names:
683
684 sub InKana {
685 return <<'END';
686 +utf8::InHiragana
687 +utf8::InKatakana
688 END
689 }
690
691Suppose you wanted to match only the allocated characters,
d5822f25 692not the raw block ranges: in other words, you want to remove
491fd90a
JH
693the non-characters:
694
695 sub InKana {
696 return <<'END';
697 +utf8::InHiragana
698 +utf8::InKatakana
699 -utf8::IsCn
700 END
701 }
702
703The negation is useful for defining (surprise!) negated classes.
704
705 sub InNotKana {
706 return <<'END';
707 !utf8::InHiragana
708 -utf8::InKatakana
709 +utf8::IsCn
710 END
711 }
712
3a2263fe
RGS
713You can also define your own mappings to be used in the lc(),
714lcfirst(), uc(), and ucfirst() (or their string-inlined versions).
715The principle is the same: define subroutines in the C<main> package
716with names like C<ToLower> (for lc() and lcfirst()), C<ToTitle> (for
717the first character in ucfirst()), and C<ToUpper> (for uc(), and the
718rest of the characters in ucfirst()).
719
720The string returned by the subroutines needs now to be three
721hexadecimal numbers separated by tabulators: start of the source
722range, end of the source range, and start of the destination range.
723For example:
724
725 sub ToUpper {
726 return <<END;
727 0061\t0063\t0041
728 END
729 }
730
731defines an uc() mapping that causes only the characters "a", "b", and
732"c" to be mapped to "A", "B", "C", all other characters will remain
733unchanged.
734
735If there is no source range to speak of, that is, the mapping is from
736a single character to another single character, leave the end of the
737source range empty, but the two tabulator characters are still needed.
738For example:
739
740 sub ToLower {
741 return <<END;
742 0041\t\t0061
743 END
744 }
745
746defines a lc() mapping that causes only "A" to be mapped to "a", all
747other characters will remain unchanged.
748
749(For serious hackers only) If you want to introspect the default
750mappings, you can find the data in the directory
751C<$Config{privlib}>/F<unicore/To/>. The mapping data is returned as
752the here-document, and the C<utf8::ToSpecFoo> are special exception
753mappings derived from <$Config{privlib}>/F<unicore/SpecialCasing.txt>.
754The C<Digit> and C<Fold> mappings that one can see in the directory
755are not directly user-accessible, one can use either the
756C<Unicode::UCD> module, or just match case-insensitively (that's when
757the C<Fold> mapping is used).
758
759A final note on the user-defined property tests and mappings: they
760will be used only if the scalar has been marked as having Unicode
761characters. Old byte-style strings will not be affected.
762
376d9008 763=head2 Character Encodings for Input and Output
8cbd9a7a 764
7221edc9 765See L<Encode>.
8cbd9a7a 766
c29a771d 767=head2 Unicode Regular Expression Support Level
776f8809 768
376d9008
JB
769The following list of Unicode support for regular expressions describes
770all the features currently supported. The references to "Level N"
771and the section numbers refer to the Unicode Technical Report 18,
965cd703
JH
772"Unicode Regular Expression Guidelines", version 6 (Unicode 3.2.0,
773Perl 5.8.0).
776f8809
JH
774
775=over 4
776
777=item *
778
779Level 1 - Basic Unicode Support
780
781 2.1 Hex Notation - done [1]
3bfdc84c 782 Named Notation - done [2]
776f8809
JH
783 2.2 Categories - done [3][4]
784 2.3 Subtraction - MISSING [5][6]
785 2.4 Simple Word Boundaries - done [7]
78d3e1bf 786 2.5 Simple Loose Matches - done [8]
776f8809
JH
787 2.6 End of Line - MISSING [9][10]
788
789 [ 1] \x{...}
790 [ 2] \N{...}
eb0cc9e3 791 [ 3] . \p{...} \P{...}
29bdacb8 792 [ 4] now scripts (see UTR#24 Script Names) in addition to blocks
776f8809 793 [ 5] have negation
237bad5b
JH
794 [ 6] can use regular expression look-ahead [a]
795 or user-defined character properties [b] to emulate subtraction
776f8809 796 [ 7] include Letters in word characters
376d9008 797 [ 8] note that Perl does Full case-folding in matching, not Simple:
835863de 798 for example U+1F88 is equivalent with U+1F00 U+03B9,
e0f9d4a8 799 not with 1F80. This difference matters for certain Greek
376d9008
JB
800 capital letters with certain modifiers: the Full case-folding
801 decomposes the letter, while the Simple case-folding would map
e0f9d4a8 802 it to a single character.
5ca1ac52 803 [ 9] see UTR #13 Unicode Newline Guidelines
835863de 804 [10] should do ^ and $ also on \x{85}, \x{2028} and \x{2029}
ec83e909 805 (should also affect <>, $., and script line numbers)
3bfdc84c 806 (the \x{85}, \x{2028} and \x{2029} do match \s)
7207e29d 807
237bad5b 808[a] You can mimic class subtraction using lookahead.
5ca1ac52 809For example, what UTR #18 might write as
29bdacb8 810
dbe420b4
JH
811 [{Greek}-[{UNASSIGNED}]]
812
813in Perl can be written as:
814
1d81abf3
JH
815 (?!\p{Unassigned})\p{InGreekAndCoptic}
816 (?=\p{Assigned})\p{InGreekAndCoptic}
dbe420b4
JH
817
818But in this particular example, you probably really want
819
1bfb14c4 820 \p{GreekAndCoptic}
dbe420b4
JH
821
822which will match assigned characters known to be part of the Greek script.
29bdacb8 823
5ca1ac52
JH
824Also see the Unicode::Regex::Set module, it does implement the full
825UTR #18 grouping, intersection, union, and removal (subtraction) syntax.
826
818c4caa 827[b] See L</"User-Defined Character Properties">.
237bad5b 828
776f8809
JH
829=item *
830
831Level 2 - Extended Unicode Support
832
63de3cb2
JH
833 3.1 Surrogates - MISSING [11]
834 3.2 Canonical Equivalents - MISSING [12][13]
835 3.3 Locale-Independent Graphemes - MISSING [14]
836 3.4 Locale-Independent Words - MISSING [15]
837 3.5 Locale-Independent Loose Matches - MISSING [16]
838
839 [11] Surrogates are solely a UTF-16 concept and Perl's internal
840 representation is UTF-8. The Encode module does UTF-16, though.
841 [12] see UTR#15 Unicode Normalization
842 [13] have Unicode::Normalize but not integrated to regexes
843 [14] have \X but at this level . should equal that
844 [15] need three classes, not just \w and \W
845 [16] see UTR#21 Case Mappings
776f8809
JH
846
847=item *
848
849Level 3 - Locale-Sensitive Support
850
851 4.1 Locale-Dependent Categories - MISSING
852 4.2 Locale-Dependent Graphemes - MISSING [16][17]
853 4.3 Locale-Dependent Words - MISSING
854 4.4 Locale-Dependent Loose Matches - MISSING
855 4.5 Locale-Dependent Ranges - MISSING
856
857 [16] see UTR#10 Unicode Collation Algorithms
858 [17] have Unicode::Collate but not integrated to regexes
859
860=back
861
c349b1b9
JH
862=head2 Unicode Encodings
863
376d9008
JB
864Unicode characters are assigned to I<code points>, which are abstract
865numbers. To use these numbers, various encodings are needed.
c349b1b9
JH
866
867=over 4
868
c29a771d 869=item *
5cb3728c
RB
870
871UTF-8
c349b1b9 872
3e4dbfed 873UTF-8 is a variable-length (1 to 6 bytes, current character allocations
376d9008
JB
874require 4 bytes), byte-order independent encoding. For ASCII (and we
875really do mean 7-bit ASCII, not another 8-bit encoding), UTF-8 is
876transparent.
c349b1b9 877
8c007b5a 878The following table is from Unicode 3.2.
05632f9a
JH
879
880 Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte
881
8c007b5a
JH
882 U+0000..U+007F 00..7F
883 U+0080..U+07FF C2..DF 80..BF
ec90690f
ST
884 U+0800..U+0FFF E0 A0..BF 80..BF
885 U+1000..U+CFFF E1..EC 80..BF 80..BF
886 U+D000..U+D7FF ED 80..9F 80..BF
8c007b5a 887 U+D800..U+DFFF ******* ill-formed *******
ec90690f 888 U+E000..U+FFFF EE..EF 80..BF 80..BF
05632f9a
JH
889 U+10000..U+3FFFF F0 90..BF 80..BF 80..BF
890 U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF
891 U+100000..U+10FFFF F4 80..8F 80..BF 80..BF
892
376d9008
JB
893Note the C<A0..BF> in C<U+0800..U+0FFF>, the C<80..9F> in
894C<U+D000...U+D7FF>, the C<90..B>F in C<U+10000..U+3FFFF>, and the
895C<80...8F> in C<U+100000..U+10FFFF>. The "gaps" are caused by legal
896UTF-8 avoiding non-shortest encodings: it is technically possible to
897UTF-8-encode a single code point in different ways, but that is
898explicitly forbidden, and the shortest possible encoding should always
899be used. So that's what Perl does.
37361303 900
376d9008 901Another way to look at it is via bits:
05632f9a
JH
902
903 Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte
904
905 0aaaaaaa 0aaaaaaa
906 00000bbbbbaaaaaa 110bbbbb 10aaaaaa
907 ccccbbbbbbaaaaaa 1110cccc 10bbbbbb 10aaaaaa
908 00000dddccccccbbbbbbaaaaaa 11110ddd 10cccccc 10bbbbbb 10aaaaaa
909
910As you can see, the continuation bytes all begin with C<10>, and the
8c007b5a 911leading bits of the start byte tell how many bytes the are in the
05632f9a
JH
912encoded character.
913
c29a771d 914=item *
5cb3728c
RB
915
916UTF-EBCDIC
dbe420b4 917
376d9008 918Like UTF-8 but EBCDIC-safe, in the way that UTF-8 is ASCII-safe.
dbe420b4 919
c29a771d 920=item *
5cb3728c 921
1e54db1a 922UTF-16, UTF-16BE, UTF-16LE, Surrogates, and BOMs (Byte Order Marks)
c349b1b9 923
1bfb14c4
JH
924The followings items are mostly for reference and general Unicode
925knowledge, Perl doesn't use these constructs internally.
dbe420b4 926
c349b1b9 927UTF-16 is a 2 or 4 byte encoding. The Unicode code points
1bfb14c4
JH
928C<U+0000..U+FFFF> are stored in a single 16-bit unit, and the code
929points C<U+10000..U+10FFFF> in two 16-bit units. The latter case is
c349b1b9
JH
930using I<surrogates>, the first 16-bit unit being the I<high
931surrogate>, and the second being the I<low surrogate>.
932
376d9008 933Surrogates are code points set aside to encode the C<U+10000..U+10FFFF>
c349b1b9 934range of Unicode code points in pairs of 16-bit units. The I<high
376d9008
JB
935surrogates> are the range C<U+D800..U+DBFF>, and the I<low surrogates>
936are the range C<U+DC00..U+DFFF>. The surrogate encoding is
c349b1b9
JH
937
938 $hi = ($uni - 0x10000) / 0x400 + 0xD800;
939 $lo = ($uni - 0x10000) % 0x400 + 0xDC00;
940
941and the decoding is
942
1a3fa709 943 $uni = 0x10000 + ($hi - 0xD800) * 0x400 + ($lo - 0xDC00);
c349b1b9 944
feda178f 945If you try to generate surrogates (for example by using chr()), you
376d9008
JB
946will get a warning if warnings are turned on, because those code
947points are not valid for a Unicode character.
9466bab6 948
376d9008 949Because of the 16-bitness, UTF-16 is byte-order dependent. UTF-16
c349b1b9 950itself can be used for in-memory computations, but if storage or
376d9008
JB
951transfer is required either UTF-16BE (big-endian) or UTF-16LE
952(little-endian) encodings must be chosen.
c349b1b9
JH
953
954This introduces another problem: what if you just know that your data
376d9008
JB
955is UTF-16, but you don't know which endianness? Byte Order Marks, or
956BOMs, are a solution to this. A special character has been reserved
86bbd6d1 957in Unicode to function as a byte order marker: the character with the
376d9008 958code point C<U+FEFF> is the BOM.
042da322 959
c349b1b9 960The trick is that if you read a BOM, you will know the byte order,
376d9008
JB
961since if it was written on a big-endian platform, you will read the
962bytes C<0xFE 0xFF>, but if it was written on a little-endian platform,
963you will read the bytes C<0xFF 0xFE>. (And if the originating platform
964was writing in UTF-8, you will read the bytes C<0xEF 0xBB 0xBF>.)
042da322 965
86bbd6d1 966The way this trick works is that the character with the code point
376d9008
JB
967C<U+FFFE> is guaranteed not to be a valid Unicode character, so the
968sequence of bytes C<0xFF 0xFE> is unambiguously "BOM, represented in
1bfb14c4 969little-endian format" and cannot be C<U+FFFE>, represented in big-endian
042da322 970format".
c349b1b9 971
c29a771d 972=item *
5cb3728c 973
1e54db1a 974UTF-32, UTF-32BE, UTF-32LE
c349b1b9
JH
975
976The UTF-32 family is pretty much like the UTF-16 family, expect that
042da322 977the units are 32-bit, and therefore the surrogate scheme is not
376d9008
JB
978needed. The BOM signatures will be C<0x00 0x00 0xFE 0xFF> for BE and
979C<0xFF 0xFE 0x00 0x00> for LE.
c349b1b9 980
c29a771d 981=item *
5cb3728c
RB
982
983UCS-2, UCS-4
c349b1b9 984
86bbd6d1 985Encodings defined by the ISO 10646 standard. UCS-2 is a 16-bit
376d9008 986encoding. Unlike UTF-16, UCS-2 is not extensible beyond C<U+FFFF>,
339cfa0e
JH
987because it does not use surrogates. UCS-4 is a 32-bit encoding,
988functionally identical to UTF-32.
c349b1b9 989
c29a771d 990=item *
5cb3728c
RB
991
992UTF-7
c349b1b9 993
376d9008
JB
994A seven-bit safe (non-eight-bit) encoding, which is useful if the
995transport or storage is not eight-bit safe. Defined by RFC 2152.
c349b1b9 996
95a1a48b
JH
997=back
998
0d7c09bb
JH
999=head2 Security Implications of Unicode
1000
1001=over 4
1002
1003=item *
1004
1005Malformed UTF-8
bf0fa0b2
JH
1006
1007Unfortunately, the specification of UTF-8 leaves some room for
1008interpretation of how many bytes of encoded output one should generate
376d9008
JB
1009from one input Unicode character. Strictly speaking, the shortest
1010possible sequence of UTF-8 bytes should be generated,
1011because otherwise there is potential for an input buffer overflow at
feda178f 1012the receiving end of a UTF-8 connection. Perl always generates the
376d9008
JB
1013shortest length UTF-8, and with warnings on Perl will warn about
1014non-shortest length UTF-8 along with other malformations, such as the
1015surrogates, which are not real Unicode code points.
bf0fa0b2 1016
0d7c09bb
JH
1017=item *
1018
1019Regular expressions behave slightly differently between byte data and
376d9008
JB
1020character (Unicode) data. For example, the "word character" character
1021class C<\w> will work differently depending on if data is eight-bit bytes
1022or Unicode.
0d7c09bb 1023
376d9008
JB
1024In the first case, the set of C<\w> characters is either small--the
1025default set of alphabetic characters, digits, and the "_"--or, if you
0d7c09bb
JH
1026are using a locale (see L<perllocale>), the C<\w> might contain a few
1027more letters according to your language and country.
1028
376d9008 1029In the second case, the C<\w> set of characters is much, much larger.
1bfb14c4
JH
1030Most importantly, even in the set of the first 256 characters, it will
1031probably match different characters: unlike most locales, which are
1032specific to a language and country pair, Unicode classifies all the
1033characters that are letters I<somewhere> as C<\w>. For example, your
1034locale might not think that LATIN SMALL LETTER ETH is a letter (unless
1035you happen to speak Icelandic), but Unicode does.
0d7c09bb 1036
376d9008 1037As discussed elsewhere, Perl has one foot (two hooves?) planted in
1bfb14c4
JH
1038each of two worlds: the old world of bytes and the new world of
1039characters, upgrading from bytes to characters when necessary.
376d9008
JB
1040If your legacy code does not explicitly use Unicode, no automatic
1041switch-over to characters should happen. Characters shouldn't get
1bfb14c4
JH
1042downgraded to bytes, either. It is possible to accidentally mix bytes
1043and characters, however (see L<perluniintro>), in which case C<\w> in
1044regular expressions might start behaving differently. Review your
1045code. Use warnings and the C<strict> pragma.
0d7c09bb
JH
1046
1047=back
1048
c349b1b9
JH
1049=head2 Unicode in Perl on EBCDIC
1050
376d9008
JB
1051The way Unicode is handled on EBCDIC platforms is still
1052experimental. On such platforms, references to UTF-8 encoding in this
1053document and elsewhere should be read as meaning the UTF-EBCDIC
1054specified in Unicode Technical Report 16, unless ASCII vs. EBCDIC issues
c349b1b9 1055are specifically discussed. There is no C<utfebcdic> pragma or
376d9008 1056":utfebcdic" layer; rather, "utf8" and ":utf8" are reused to mean
86bbd6d1
PN
1057the platform's "natural" 8-bit encoding of Unicode. See L<perlebcdic>
1058for more discussion of the issues.
c349b1b9 1059
b310b053
JH
1060=head2 Locales
1061
4616122b 1062Usually locale settings and Unicode do not affect each other, but
b310b053
JH
1063there are a couple of exceptions:
1064
1065=over 4
1066
1067=item *
1068
8aa8f774
JH
1069You can enable automatic UTF-8-ification of your standard file
1070handles, default C<open()> layer, and C<@ARGV> by using either
1071the C<-C> command line switch or the C<PERL_UNICODE> environment
1072variable, see L<perlrun> for the documentation of the C<-C> switch.
b310b053
JH
1073
1074=item *
1075
376d9008
JB
1076Perl tries really hard to work both with Unicode and the old
1077byte-oriented world. Most often this is nice, but sometimes Perl's
1078straddling of the proverbial fence causes problems.
b310b053
JH
1079
1080=back
1081
1aad1664
JH
1082=head2 When Unicode Does Not Happen
1083
1084While Perl does have extensive ways to input and output in Unicode,
1085and few other 'entry points' like the @ARGV which can be interpreted
1086as Unicode (UTF-8), there still are many places where Unicode (in some
1087encoding or another) could be given as arguments or received as
1088results, or both, but it is not.
1089
1090The following are such interfaces. For all of these Perl currently
1091(as of 5.8.1) simply assumes byte strings both as arguments and results.
1092
1093One reason why Perl does not attempt to resolve the role of Unicode in
1094this cases is that the answers are highly dependent on the operating
1095system and the file system(s). For example, whether filenames can be
1096in Unicode, and in exactly what kind of encoding, is not exactly a
1097portable concept. Similarly for the qx and system: how well will the
1098'command line interface' (and which of them?) handle Unicode?
1099
1100=over 4
1101
557a2462
RB
1102=item *
1103
1e8e8236
JH
1104chmod, chmod, chown, chroot, exec, link, lstat, mkdir,
1105rename, rmdir, stat, symlink, truncate, unlink, utime, -X
557a2462
RB
1106
1107=item *
1108
1109%ENV
1110
1111=item *
1112
1113glob (aka the <*>)
1114
1115=item *
1aad1664 1116
557a2462 1117open, opendir, sysopen
1aad1664 1118
557a2462 1119=item *
1aad1664 1120
557a2462 1121qx (aka the backtick operator), system
1aad1664 1122
557a2462 1123=item *
1aad1664 1124
557a2462 1125readdir, readlink
1aad1664
JH
1126
1127=back
1128
1129=head2 Forcing Unicode in Perl (Or Unforcing Unicode in Perl)
1130
1131Sometimes (see L</"When Unicode Does Not Happen">) there are
1132situations where you simply need to force Perl to believe that a byte
1133string is UTF-8, or vice versa. The low-level calls
1134utf8::upgrade($bytestring) and utf8::downgrade($utf8string) are
1135the answers.
1136
1137Do not use them without careful thought, though: Perl may easily get
1138very confused, angry, or even crash, if you suddenly change the 'nature'
1139of scalar like that. Especially careful you have to be if you use the
1140utf8::upgrade(): any random byte string is not valid UTF-8.
1141
95a1a48b
JH
1142=head2 Using Unicode in XS
1143
3a2263fe
RGS
1144If you want to handle Perl Unicode in XS extensions, you may find the
1145following C APIs useful. See also L<perlguts/"Unicode Support"> for an
1146explanation about Unicode at the XS level, and L<perlapi> for the API
1147details.
95a1a48b
JH
1148
1149=over 4
1150
1151=item *
1152
1bfb14c4
JH
1153C<DO_UTF8(sv)> returns true if the C<UTF8> flag is on and the bytes
1154pragma is not in effect. C<SvUTF8(sv)> returns true is the C<UTF8>
1155flag is on; the bytes pragma is ignored. The C<UTF8> flag being on
1156does B<not> mean that there are any characters of code points greater
1157than 255 (or 127) in the scalar or that there are even any characters
1158in the scalar. What the C<UTF8> flag means is that the sequence of
1159octets in the representation of the scalar is the sequence of UTF-8
1160encoded code points of the characters of a string. The C<UTF8> flag
1161being off means that each octet in this representation encodes a
1162single character with code point 0..255 within the string. Perl's
1163Unicode model is not to use UTF-8 until it is absolutely necessary.
95a1a48b
JH
1164
1165=item *
1166
fb9cc174 1167C<uvuni_to_utf8(buf, chr)> writes a Unicode character code point into
1bfb14c4 1168a buffer encoding the code point as UTF-8, and returns a pointer
95a1a48b
JH
1169pointing after the UTF-8 bytes.
1170
1171=item *
1172
376d9008
JB
1173C<utf8_to_uvuni(buf, lenp)> reads UTF-8 encoded bytes from a buffer and
1174returns the Unicode character code point and, optionally, the length of
1175the UTF-8 byte sequence.
95a1a48b
JH
1176
1177=item *
1178
376d9008
JB
1179C<utf8_length(start, end)> returns the length of the UTF-8 encoded buffer
1180in characters. C<sv_len_utf8(sv)> returns the length of the UTF-8 encoded
95a1a48b
JH
1181scalar.
1182
1183=item *
1184
376d9008
JB
1185C<sv_utf8_upgrade(sv)> converts the string of the scalar to its UTF-8
1186encoded form. C<sv_utf8_downgrade(sv)> does the opposite, if
1187possible. C<sv_utf8_encode(sv)> is like sv_utf8_upgrade except that
1188it does not set the C<UTF8> flag. C<sv_utf8_decode()> does the
1189opposite of C<sv_utf8_encode()>. Note that none of these are to be
1190used as general-purpose encoding or decoding interfaces: C<use Encode>
1191for that. C<sv_utf8_upgrade()> is affected by the encoding pragma
1192but C<sv_utf8_downgrade()> is not (since the encoding pragma is
1193designed to be a one-way street).
95a1a48b
JH
1194
1195=item *
1196
376d9008 1197C<is_utf8_char(s)> returns true if the pointer points to a valid UTF-8
90f968e0 1198character.
95a1a48b
JH
1199
1200=item *
1201
376d9008 1202C<is_utf8_string(buf, len)> returns true if C<len> bytes of the buffer
95a1a48b
JH
1203are valid UTF-8.
1204
1205=item *
1206
376d9008
JB
1207C<UTF8SKIP(buf)> will return the number of bytes in the UTF-8 encoded
1208character in the buffer. C<UNISKIP(chr)> will return the number of bytes
1209required to UTF-8-encode the Unicode character code point. C<UTF8SKIP()>
90f968e0 1210is useful for example for iterating over the characters of a UTF-8
376d9008 1211encoded buffer; C<UNISKIP()> is useful, for example, in computing
90f968e0 1212the size required for a UTF-8 encoded buffer.
95a1a48b
JH
1213
1214=item *
1215
376d9008 1216C<utf8_distance(a, b)> will tell the distance in characters between the
95a1a48b
JH
1217two pointers pointing to the same UTF-8 encoded buffer.
1218
1219=item *
1220
376d9008
JB
1221C<utf8_hop(s, off)> will return a pointer to an UTF-8 encoded buffer
1222that is C<off> (positive or negative) Unicode characters displaced
1223from the UTF-8 buffer C<s>. Be careful not to overstep the buffer:
1224C<utf8_hop()> will merrily run off the end or the beginning of the
1225buffer if told to do so.
95a1a48b 1226
d2cc3551
JH
1227=item *
1228
376d9008
JB
1229C<pv_uni_display(dsv, spv, len, pvlim, flags)> and
1230C<sv_uni_display(dsv, ssv, pvlim, flags)> are useful for debugging the
1231output of Unicode strings and scalars. By default they are useful
1232only for debugging--they display B<all> characters as hexadecimal code
1bfb14c4
JH
1233points--but with the flags C<UNI_DISPLAY_ISPRINT>,
1234C<UNI_DISPLAY_BACKSLASH>, and C<UNI_DISPLAY_QQ> you can make the
1235output more readable.
d2cc3551
JH
1236
1237=item *
1238
376d9008
JB
1239C<ibcmp_utf8(s1, pe1, u1, l1, u1, s2, pe2, l2, u2)> can be used to
1240compare two strings case-insensitively in Unicode. For case-sensitive
1241comparisons you can just use C<memEQ()> and C<memNE()> as usual.
d2cc3551 1242
c349b1b9
JH
1243=back
1244
95a1a48b
JH
1245For more information, see L<perlapi>, and F<utf8.c> and F<utf8.h>
1246in the Perl source code distribution.
1247
c29a771d
JH
1248=head1 BUGS
1249
376d9008 1250=head2 Interaction with Locales
7eabb34d 1251
376d9008
JB
1252Use of locales with Unicode data may lead to odd results. Currently,
1253Perl attempts to attach 8-bit locale info to characters in the range
12540..255, but this technique is demonstrably incorrect for locales that
1255use characters above that range when mapped into Unicode. Perl's
1256Unicode support will also tend to run slower. Use of locales with
1257Unicode is discouraged.
c29a771d 1258
376d9008 1259=head2 Interaction with Extensions
7eabb34d 1260
376d9008 1261When Perl exchanges data with an extension, the extension should be
7eabb34d 1262able to understand the UTF-8 flag and act accordingly. If the
376d9008
JB
1263extension doesn't know about the flag, it's likely that the extension
1264will return incorrectly-flagged data.
7eabb34d
A
1265
1266So if you're working with Unicode data, consult the documentation of
1267every module you're using if there are any issues with Unicode data
1268exchange. If the documentation does not talk about Unicode at all,
a73d23f6 1269suspect the worst and probably look at the source to learn how the
376d9008 1270module is implemented. Modules written completely in Perl shouldn't
a73d23f6
RGS
1271cause problems. Modules that directly or indirectly access code written
1272in other programming languages are at risk.
7eabb34d 1273
376d9008 1274For affected functions, the simple strategy to avoid data corruption is
7eabb34d 1275to always make the encoding of the exchanged data explicit. Choose an
376d9008 1276encoding that you know the extension can handle. Convert arguments passed
7eabb34d
A
1277to the extensions to that encoding and convert results back from that
1278encoding. Write wrapper functions that do the conversions for you, so
1279you can later change the functions when the extension catches up.
1280
376d9008 1281To provide an example, let's say the popular Foo::Bar::escape_html
7eabb34d
A
1282function doesn't deal with Unicode data yet. The wrapper function
1283would convert the argument to raw UTF-8 and convert the result back to
376d9008 1284Perl's internal representation like so:
7eabb34d
A
1285
1286 sub my_escape_html ($) {
1287 my($what) = shift;
1288 return unless defined $what;
1289 Encode::decode_utf8(Foo::Bar::escape_html(Encode::encode_utf8($what)));
1290 }
1291
1292Sometimes, when the extension does not convert data but just stores
1293and retrieves them, you will be in a position to use the otherwise
1294dangerous Encode::_utf8_on() function. Let's say the popular
66b79f27 1295C<Foo::Bar> extension, written in C, provides a C<param> method that
7eabb34d
A
1296lets you store and retrieve data according to these prototypes:
1297
1298 $self->param($name, $value); # set a scalar
1299 $value = $self->param($name); # retrieve a scalar
1300
1301If it does not yet provide support for any encoding, one could write a
1302derived class with such a C<param> method:
1303
1304 sub param {
1305 my($self,$name,$value) = @_;
1306 utf8::upgrade($name); # make sure it is UTF-8 encoded
1307 if (defined $value)
1308 utf8::upgrade($value); # make sure it is UTF-8 encoded
1309 return $self->SUPER::param($name,$value);
1310 } else {
1311 my $ret = $self->SUPER::param($name);
1312 Encode::_utf8_on($ret); # we know, it is UTF-8 encoded
1313 return $ret;
1314 }
1315 }
1316
a73d23f6
RGS
1317Some extensions provide filters on data entry/exit points, such as
1318DB_File::filter_store_key and family. Look out for such filters in
66b79f27 1319the documentation of your extensions, they can make the transition to
7eabb34d
A
1320Unicode data much easier.
1321
376d9008 1322=head2 Speed
7eabb34d 1323
c29a771d 1324Some functions are slower when working on UTF-8 encoded strings than
574c8022 1325on byte encoded strings. All functions that need to hop over
7c17141f
JH
1326characters such as length(), substr() or index(), or matching regular
1327expressions can work B<much> faster when the underlying data are
1328byte-encoded.
1329
1330In Perl 5.8.0 the slowness was often quite spectacular; in Perl 5.8.1
1331a caching scheme was introduced which will hopefully make the slowness
a104b433
JH
1332somewhat less spectacular, at least for some operations. In general,
1333operations with UTF-8 encoded strings are still slower. As an example,
1334the Unicode properties (character classes) like C<\p{Nd}> are known to
1335be quite a bit slower (5-20 times) than their simpler counterparts
1336like C<\d> (then again, there 268 Unicode characters matching C<Nd>
1337compared with the 10 ASCII characters matching C<d>).
666f95b9 1338
c8d992ba
A
1339=head2 Porting code from perl-5.6.X
1340
1341Perl 5.8 has a different Unicode model from 5.6. In 5.6 the programmer
1342was required to use the C<utf8> pragma to declare that a given scope
1343expected to deal with Unicode data and had to make sure that only
1344Unicode data were reaching that scope. If you have code that is
1345working with 5.6, you will need some of the following adjustments to
1346your code. The examples are written such that the code will continue
1347to work under 5.6, so you should be safe to try them out.
1348
1349=over 4
1350
1351=item *
1352
1353A filehandle that should read or write UTF-8
1354
1355 if ($] > 5.007) {
1356 binmode $fh, ":utf8";
1357 }
1358
1359=item *
1360
1361A scalar that is going to be passed to some extension
1362
1363Be it Compress::Zlib, Apache::Request or any extension that has no
1364mention of Unicode in the manpage, you need to make sure that the
1365UTF-8 flag is stripped off. Note that at the time of this writing
1366(October 2002) the mentioned modules are not UTF-8-aware. Please
1367check the documentation to verify if this is still true.
1368
1369 if ($] > 5.007) {
1370 require Encode;
1371 $val = Encode::encode_utf8($val); # make octets
1372 }
1373
1374=item *
1375
1376A scalar we got back from an extension
1377
1378If you believe the scalar comes back as UTF-8, you will most likely
1379want the UTF-8 flag restored:
1380
1381 if ($] > 5.007) {
1382 require Encode;
1383 $val = Encode::decode_utf8($val);
1384 }
1385
1386=item *
1387
1388Same thing, if you are really sure it is UTF-8
1389
1390 if ($] > 5.007) {
1391 require Encode;
1392 Encode::_utf8_on($val);
1393 }
1394
1395=item *
1396
1397A wrapper for fetchrow_array and fetchrow_hashref
1398
1399When the database contains only UTF-8, a wrapper function or method is
1400a convenient way to replace all your fetchrow_array and
1401fetchrow_hashref calls. A wrapper function will also make it easier to
1402adapt to future enhancements in your database driver. Note that at the
1403time of this writing (October 2002), the DBI has no standardized way
1404to deal with UTF-8 data. Please check the documentation to verify if
1405that is still true.
1406
1407 sub fetchrow {
1408 my($self, $sth, $what) = @_; # $what is one of fetchrow_{array,hashref}
1409 if ($] < 5.007) {
1410 return $sth->$what;
1411 } else {
1412 require Encode;
1413 if (wantarray) {
1414 my @arr = $sth->$what;
1415 for (@arr) {
1416 defined && /[^\000-\177]/ && Encode::_utf8_on($_);
1417 }
1418 return @arr;
1419 } else {
1420 my $ret = $sth->$what;
1421 if (ref $ret) {
1422 for my $k (keys %$ret) {
1423 defined && /[^\000-\177]/ && Encode::_utf8_on($_) for $ret->{$k};
1424 }
1425 return $ret;
1426 } else {
1427 defined && /[^\000-\177]/ && Encode::_utf8_on($_) for $ret;
1428 return $ret;
1429 }
1430 }
1431 }
1432 }
1433
1434
1435=item *
1436
1437A large scalar that you know can only contain ASCII
1438
1439Scalars that contain only ASCII and are marked as UTF-8 are sometimes
1440a drag to your program. If you recognize such a situation, just remove
1441the UTF-8 flag:
1442
1443 utf8::downgrade($val) if $] > 5.007;
1444
1445=back
1446
393fec97
GS
1447=head1 SEE ALSO
1448
72ff2908 1449L<perluniintro>, L<encoding>, L<Encode>, L<open>, L<utf8>, L<bytes>,
a05d7ebb 1450L<perlretut>, L<perlvar/"${^UNICODE}">
393fec97
GS
1451
1452=cut