This is a live mirror of the Perl 5 development currently hosted at https://github.com/perl/perl5
chomp() cares about Unicode
[perl5.git] / pod / perlunicode.pod
CommitLineData
393fec97
GS
1=head1 NAME
2
3perlunicode - Unicode support in Perl
4
5=head1 DESCRIPTION
6
0a1f2d14 7=head2 Important Caveats
21bad921 8
376d9008 9Unicode support is an extensive requirement. While Perl does not
c349b1b9
JH
10implement the Unicode standard or the accompanying technical reports
11from cover to cover, Perl does support many Unicode features.
21bad921 12
13a2d996 13=over 4
21bad921 14
fae2c0fb 15=item Input and Output Layers
21bad921 16
376d9008 17Perl knows when a filehandle uses Perl's internal Unicode encodings
1bfb14c4
JH
18(UTF-8, or UTF-EBCDIC if in EBCDIC) if the filehandle is opened with
19the ":utf8" layer. Other encodings can be converted to Perl's
20encoding on input or from Perl's encoding on output by use of the
21":encoding(...)" layer. See L<open>.
c349b1b9 22
376d9008 23To indicate that Perl source itself is using a particular encoding,
c349b1b9 24see L<encoding>.
21bad921
GS
25
26=item Regular Expressions
27
c349b1b9 28The regular expression compiler produces polymorphic opcodes. That is,
376d9008
JB
29the pattern adapts to the data and automatically switches to the Unicode
30character scheme when presented with Unicode data--or instead uses
31a traditional byte scheme when presented with byte data.
21bad921 32
ad0029c4 33=item C<use utf8> still needed to enable UTF-8/UTF-EBCDIC in scripts
21bad921 34
376d9008
JB
35As a compatibility measure, the C<use utf8> pragma must be explicitly
36included to enable recognition of UTF-8 in the Perl scripts themselves
1bfb14c4
JH
37(in string or regular expression literals, or in identifier names) on
38ASCII-based machines or to recognize UTF-EBCDIC on EBCDIC-based
376d9008 39machines. B<These are the only times when an explicit C<use utf8>
8f8cf39c 40is needed.> See L<utf8>.
21bad921 41
1768d7eb 42You can also use the C<encoding> pragma to change the default encoding
6ec9efec 43of the data in your script; see L<encoding>.
1768d7eb 44
990e18f7
AT
45=item C<use encoding> needed to upgrade non-Latin-1 byte strings
46
47By default, there is a fundamental asymmetry in Perl's unicode model:
48implicit upgrading from byte strings to Unicode strings assumes that
49they were encoded in I<ISO 8859-1 (Latin-1)>, but Unicode strings are
50downgraded with UTF-8 encoding. This happens because the first 256
51codepoints in Unicode happens to agree with Latin-1.
52
53If you wish to interpret byte strings as UTF-8 instead, use the
54C<encoding> pragma:
55
56 use encoding 'utf8';
57
58See L</"Byte and Character Semantics"> for more details.
59
21bad921
GS
60=back
61
376d9008 62=head2 Byte and Character Semantics
393fec97 63
376d9008 64Beginning with version 5.6, Perl uses logically-wide characters to
3e4dbfed 65represent strings internally.
393fec97 66
376d9008
JB
67In future, Perl-level operations will be expected to work with
68characters rather than bytes.
393fec97 69
376d9008 70However, as an interim compatibility measure, Perl aims to
75daf61c
JH
71provide a safe migration path from byte semantics to character
72semantics for programs. For operations where Perl can unambiguously
376d9008 73decide that the input data are characters, Perl switches to
75daf61c
JH
74character semantics. For operations where this determination cannot
75be made without additional information from the user, Perl decides in
376d9008 76favor of compatibility and chooses to use byte semantics.
8cbd9a7a
GS
77
78This behavior preserves compatibility with earlier versions of Perl,
376d9008
JB
79which allowed byte semantics in Perl operations only if
80none of the program's inputs were marked as being as source of Unicode
8cbd9a7a
GS
81character data. Such data may come from filehandles, from calls to
82external programs, from information provided by the system (such as %ENV),
21bad921 83or from literals and constants in the source text.
8cbd9a7a 84
376d9008
JB
85The C<bytes> pragma will always, regardless of platform, force byte
86semantics in a particular lexical scope. See L<bytes>.
8cbd9a7a
GS
87
88The C<utf8> pragma is primarily a compatibility device that enables
75daf61c 89recognition of UTF-(8|EBCDIC) in literals encountered by the parser.
376d9008
JB
90Note that this pragma is only required while Perl defaults to byte
91semantics; when character semantics become the default, this pragma
92may become a no-op. See L<utf8>.
93
94Unless explicitly stated, Perl operators use character semantics
95for Unicode data and byte semantics for non-Unicode data.
96The decision to use character semantics is made transparently. If
97input data comes from a Unicode source--for example, if a character
fae2c0fb 98encoding layer is added to a filehandle or a literal Unicode
376d9008
JB
99string constant appears in a program--character semantics apply.
100Otherwise, byte semantics are in effect. The C<bytes> pragma should
101be used to force byte semantics on Unicode data.
102
103If strings operating under byte semantics and strings with Unicode
990e18f7
AT
104character data are concatenated, the new string will be created by
105decoding the byte strings as I<ISO 8859-1 (Latin-1)>, even if the
106old Unicode string used EBCDIC. This translation is done without
107regard to the system's native 8-bit encoding. To change this for
108systems with non-Latin-1 and non-EBCDIC native encodings, use the
109C<encoding> pragma. See L<encoding>.
7dedd01f 110
feda178f 111Under character semantics, many operations that formerly operated on
376d9008 112bytes now operate on characters. A character in Perl is
feda178f 113logically just a number ranging from 0 to 2**31 or so. Larger
376d9008
JB
114characters may encode into longer sequences of bytes internally, but
115this internal detail is mostly hidden for Perl code.
116See L<perluniintro> for more.
393fec97 117
376d9008 118=head2 Effects of Character Semantics
393fec97
GS
119
120Character semantics have the following effects:
121
122=over 4
123
124=item *
125
376d9008 126Strings--including hash keys--and regular expression patterns may
574c8022 127contain characters that have an ordinal value larger than 255.
393fec97 128
feda178f
JH
129If you use a Unicode editor to edit your program, Unicode characters
130may occur directly within the literal strings in one of the various
376d9008
JB
131Unicode encodings (UTF-8, UTF-EBCDIC, UCS-2, etc.), but will be recognized
132as such and converted to Perl's internal representation only if the
feda178f 133appropriate L<encoding> is specified.
3e4dbfed 134
1bfb14c4
JH
135Unicode characters can also be added to a string by using the
136C<\x{...}> notation. The Unicode code for the desired character, in
376d9008
JB
137hexadecimal, should be placed in the braces. For instance, a smiley
138face is C<\x{263A}>. This encoding scheme only works for characters
139with a code of 0x100 or above.
3e4dbfed
JF
140
141Additionally, if you
574c8022 142
3e4dbfed 143 use charnames ':full';
574c8022 144
1bfb14c4
JH
145you can use the C<\N{...}> notation and put the official Unicode
146character name within the braces, such as C<\N{WHITE SMILING FACE}>.
376d9008 147
393fec97
GS
148
149=item *
150
574c8022
JH
151If an appropriate L<encoding> is specified, identifiers within the
152Perl script may contain Unicode alphanumeric characters, including
376d9008
JB
153ideographs. Perl does not currently attempt to canonicalize variable
154names.
393fec97 155
393fec97
GS
156=item *
157
1bfb14c4
JH
158Regular expressions match characters instead of bytes. "." matches
159a character instead of a byte. The C<\C> pattern is provided to force
160a match a single byte--a C<char> in C, hence C<\C>.
393fec97 161
393fec97
GS
162=item *
163
164Character classes in regular expressions match characters instead of
376d9008 165bytes and match against the character properties specified in the
1bfb14c4 166Unicode properties database. C<\w> can be used to match a Japanese
75daf61c 167ideograph, for instance.
393fec97 168
b08eb2a8
RGS
169(However, and as a limitation of the current implementation, using
170C<\w> or C<\W> I<inside> a C<[...]> character class will still match
171with byte semantics.)
172
393fec97
GS
173=item *
174
eb0cc9e3 175Named Unicode properties, scripts, and block ranges may be used like
376d9008
JB
176character classes via the C<\p{}> "matches property" construct and
177the C<\P{}> negation, "doesn't match property".
1bfb14c4
JH
178
179For instance, C<\p{Lu}> matches any character with the Unicode "Lu"
180(Letter, uppercase) property, while C<\p{M}> matches any character
181with an "M" (mark--accents and such) property. Brackets are not
182required for single letter properties, so C<\p{M}> is equivalent to
183C<\pM>. Many predefined properties are available, such as
184C<\p{Mirrored}> and C<\p{Tibetan}>.
4193bef7 185
cfc01aea 186The official Unicode script and block names have spaces and dashes as
376d9008 187separators, but for convenience you can use dashes, spaces, or
1bfb14c4
JH
188underbars, and case is unimportant. It is recommended, however, that
189for consistency you use the following naming: the official Unicode
190script, property, or block name (see below for the additional rules
191that apply to block names) with whitespace and dashes removed, and the
192words "uppercase-first-lowercase-rest". C<Latin-1 Supplement> thus
193becomes C<Latin1Supplement>.
4193bef7 194
376d9008
JB
195You can also use negation in both C<\p{}> and C<\P{}> by introducing a caret
196(^) between the first brace and the property name: C<\p{^Tamil}> is
eb0cc9e3 197equal to C<\P{Tamil}>.
4193bef7 198
14bb0a9a
JH
199B<NOTE: the properties, scripts, and blocks listed here are as of
200Unicode 3.2.0, March 2002, or Perl 5.8.0, July 2002. Unicode 4.0.0
201came out in April 2003, and Perl 5.8.1 in September 2003.>
202
eb0cc9e3 203Here are the basic Unicode General Category properties, followed by their
68cd2d32 204long form. You can use either; C<\p{Lu}> and C<\p{UppercaseLetter}>,
376d9008 205for instance, are identical.
393fec97 206
d73e5302
JH
207 Short Long
208
209 L Letter
eb0cc9e3
JH
210 Lu UppercaseLetter
211 Ll LowercaseLetter
212 Lt TitlecaseLetter
213 Lm ModifierLetter
214 Lo OtherLetter
d73e5302
JH
215
216 M Mark
eb0cc9e3
JH
217 Mn NonspacingMark
218 Mc SpacingMark
219 Me EnclosingMark
d73e5302
JH
220
221 N Number
eb0cc9e3
JH
222 Nd DecimalNumber
223 Nl LetterNumber
224 No OtherNumber
d73e5302
JH
225
226 P Punctuation
eb0cc9e3
JH
227 Pc ConnectorPunctuation
228 Pd DashPunctuation
229 Ps OpenPunctuation
230 Pe ClosePunctuation
231 Pi InitialPunctuation
d73e5302 232 (may behave like Ps or Pe depending on usage)
eb0cc9e3 233 Pf FinalPunctuation
d73e5302 234 (may behave like Ps or Pe depending on usage)
eb0cc9e3 235 Po OtherPunctuation
d73e5302
JH
236
237 S Symbol
eb0cc9e3
JH
238 Sm MathSymbol
239 Sc CurrencySymbol
240 Sk ModifierSymbol
241 So OtherSymbol
d73e5302
JH
242
243 Z Separator
eb0cc9e3
JH
244 Zs SpaceSeparator
245 Zl LineSeparator
246 Zp ParagraphSeparator
d73e5302
JH
247
248 C Other
e150c829
JH
249 Cc Control
250 Cf Format
eb0cc9e3
JH
251 Cs Surrogate (not usable)
252 Co PrivateUse
e150c829 253 Cn Unassigned
1ac13f9a 254
376d9008 255Single-letter properties match all characters in any of the
3e4dbfed 256two-letter sub-properties starting with the same letter.
376d9008 257C<L&> is a special case, which is an alias for C<Ll>, C<Lu>, and C<Lt>.
32293815 258
eb0cc9e3 259Because Perl hides the need for the user to understand the internal
1bfb14c4
JH
260representation of Unicode characters, there is no need to implement
261the somewhat messy concept of surrogates. C<Cs> is therefore not
eb0cc9e3 262supported.
d73e5302 263
376d9008
JB
264Because scripts differ in their directionality--Hebrew is
265written right to left, for example--Unicode supplies these properties:
32293815 266
eb0cc9e3 267 Property Meaning
92e830a9 268
d73e5302
JH
269 BidiL Left-to-Right
270 BidiLRE Left-to-Right Embedding
271 BidiLRO Left-to-Right Override
272 BidiR Right-to-Left
273 BidiAL Right-to-Left Arabic
274 BidiRLE Right-to-Left Embedding
275 BidiRLO Right-to-Left Override
276 BidiPDF Pop Directional Format
277 BidiEN European Number
278 BidiES European Number Separator
279 BidiET European Number Terminator
280 BidiAN Arabic Number
281 BidiCS Common Number Separator
282 BidiNSM Non-Spacing Mark
283 BidiBN Boundary Neutral
284 BidiB Paragraph Separator
285 BidiS Segment Separator
286 BidiWS Whitespace
287 BidiON Other Neutrals
32293815 288
376d9008 289For example, C<\p{BidiR}> matches characters that are normally
eb0cc9e3
JH
290written right to left.
291
210b36aa
AMS
292=back
293
2796c109
JH
294=head2 Scripts
295
376d9008
JB
296The script names which can be used by C<\p{...}> and C<\P{...}>,
297such as in C<\p{Latin}> or C<\p{Cyrillic}>, are as follows:
2796c109 298
1ac13f9a 299 Arabic
e9ad1727 300 Armenian
1ac13f9a 301 Bengali
e9ad1727 302 Bopomofo
1d81abf3 303 Buhid
eb0cc9e3 304 CanadianAboriginal
e9ad1727
JH
305 Cherokee
306 Cyrillic
307 Deseret
308 Devanagari
309 Ethiopic
310 Georgian
311 Gothic
312 Greek
1ac13f9a 313 Gujarati
e9ad1727
JH
314 Gurmukhi
315 Han
316 Hangul
1d81abf3 317 Hanunoo
e9ad1727
JH
318 Hebrew
319 Hiragana
320 Inherited
1ac13f9a 321 Kannada
e9ad1727
JH
322 Katakana
323 Khmer
1ac13f9a 324 Lao
e9ad1727
JH
325 Latin
326 Malayalam
327 Mongolian
1ac13f9a 328 Myanmar
1ac13f9a 329 Ogham
eb0cc9e3 330 OldItalic
e9ad1727 331 Oriya
1ac13f9a 332 Runic
e9ad1727
JH
333 Sinhala
334 Syriac
1d81abf3
JH
335 Tagalog
336 Tagbanwa
e9ad1727
JH
337 Tamil
338 Telugu
339 Thaana
340 Thai
341 Tibetan
1ac13f9a 342 Yi
1ac13f9a 343
376d9008 344Extended property classes can supplement the basic
1ac13f9a
JH
345properties, defined by the F<PropList> Unicode database:
346
1d81abf3 347 ASCIIHexDigit
eb0cc9e3 348 BidiControl
1ac13f9a 349 Dash
1d81abf3 350 Deprecated
1ac13f9a
JH
351 Diacritic
352 Extender
1d81abf3 353 GraphemeLink
eb0cc9e3 354 HexDigit
e9ad1727
JH
355 Hyphen
356 Ideographic
1d81abf3
JH
357 IDSBinaryOperator
358 IDSTrinaryOperator
eb0cc9e3 359 JoinControl
1d81abf3 360 LogicalOrderException
eb0cc9e3
JH
361 NoncharacterCodePoint
362 OtherAlphabetic
1d81abf3
JH
363 OtherDefaultIgnorableCodePoint
364 OtherGraphemeExtend
eb0cc9e3
JH
365 OtherLowercase
366 OtherMath
367 OtherUppercase
368 QuotationMark
1d81abf3
JH
369 Radical
370 SoftDotted
371 TerminalPunctuation
372 UnifiedIdeograph
eb0cc9e3 373 WhiteSpace
1ac13f9a 374
376d9008 375and there are further derived properties:
1ac13f9a 376
eb0cc9e3
JH
377 Alphabetic Lu + Ll + Lt + Lm + Lo + OtherAlphabetic
378 Lowercase Ll + OtherLowercase
379 Uppercase Lu + OtherUppercase
380 Math Sm + OtherMath
1ac13f9a
JH
381
382 ID_Start Lu + Ll + Lt + Lm + Lo + Nl
383 ID_Continue ID_Start + Mn + Mc + Nd + Pc
384
385 Any Any character
66b79f27
RGS
386 Assigned Any non-Cn character (i.e. synonym for \P{Cn})
387 Unassigned Synonym for \p{Cn}
1ac13f9a 388 Common Any character (or unassigned code point)
e150c829 389 not explicitly assigned to a script
2796c109 390
1bfb14c4
JH
391For backward compatibility (with Perl 5.6), all properties mentioned
392so far may have C<Is> prepended to their name, so C<\P{IsLu}>, for
393example, is equal to C<\P{Lu}>.
eb0cc9e3 394
2796c109
JH
395=head2 Blocks
396
1bfb14c4
JH
397In addition to B<scripts>, Unicode also defines B<blocks> of
398characters. The difference between scripts and blocks is that the
399concept of scripts is closer to natural languages, while the concept
400of blocks is more of an artificial grouping based on groups of 256
376d9008 401Unicode characters. For example, the C<Latin> script contains letters
1bfb14c4 402from many blocks but does not contain all the characters from those
376d9008
JB
403blocks. It does not, for example, contain digits, because digits are
404shared across many scripts. Digits and similar groups, like
405punctuation, are in a category called C<Common>.
2796c109 406
cfc01aea
JF
407For more about scripts, see the UTR #24:
408
409 http://www.unicode.org/unicode/reports/tr24/
410
411For more about blocks, see:
412
413 http://www.unicode.org/Public/UNIDATA/Blocks.txt
2796c109 414
376d9008
JB
415Block names are given with the C<In> prefix. For example, the
416Katakana block is referenced via C<\p{InKatakana}>. The C<In>
7eabb34d 417prefix may be omitted if there is no naming conflict with a script
eb0cc9e3 418or any other property, but it is recommended that C<In> always be used
1bfb14c4 419for block tests to avoid confusion.
eb0cc9e3
JH
420
421These block names are supported:
422
1d81abf3
JH
423 InAlphabeticPresentationForms
424 InArabic
425 InArabicPresentationFormsA
426 InArabicPresentationFormsB
427 InArmenian
428 InArrows
429 InBasicLatin
430 InBengali
431 InBlockElements
432 InBopomofo
433 InBopomofoExtended
434 InBoxDrawing
435 InBraillePatterns
436 InBuhid
437 InByzantineMusicalSymbols
438 InCJKCompatibility
439 InCJKCompatibilityForms
440 InCJKCompatibilityIdeographs
441 InCJKCompatibilityIdeographsSupplement
442 InCJKRadicalsSupplement
443 InCJKSymbolsAndPunctuation
444 InCJKUnifiedIdeographs
445 InCJKUnifiedIdeographsExtensionA
446 InCJKUnifiedIdeographsExtensionB
447 InCherokee
448 InCombiningDiacriticalMarks
449 InCombiningDiacriticalMarksforSymbols
450 InCombiningHalfMarks
451 InControlPictures
452 InCurrencySymbols
453 InCyrillic
454 InCyrillicSupplementary
455 InDeseret
456 InDevanagari
457 InDingbats
458 InEnclosedAlphanumerics
459 InEnclosedCJKLettersAndMonths
460 InEthiopic
461 InGeneralPunctuation
462 InGeometricShapes
463 InGeorgian
464 InGothic
465 InGreekExtended
466 InGreekAndCoptic
467 InGujarati
468 InGurmukhi
469 InHalfwidthAndFullwidthForms
470 InHangulCompatibilityJamo
471 InHangulJamo
472 InHangulSyllables
473 InHanunoo
474 InHebrew
475 InHighPrivateUseSurrogates
476 InHighSurrogates
477 InHiragana
478 InIPAExtensions
479 InIdeographicDescriptionCharacters
480 InKanbun
481 InKangxiRadicals
482 InKannada
483 InKatakana
484 InKatakanaPhoneticExtensions
485 InKhmer
486 InLao
487 InLatin1Supplement
488 InLatinExtendedA
489 InLatinExtendedAdditional
490 InLatinExtendedB
491 InLetterlikeSymbols
492 InLowSurrogates
493 InMalayalam
494 InMathematicalAlphanumericSymbols
495 InMathematicalOperators
496 InMiscellaneousMathematicalSymbolsA
497 InMiscellaneousMathematicalSymbolsB
498 InMiscellaneousSymbols
499 InMiscellaneousTechnical
500 InMongolian
501 InMusicalSymbols
502 InMyanmar
503 InNumberForms
504 InOgham
505 InOldItalic
506 InOpticalCharacterRecognition
507 InOriya
508 InPrivateUseArea
509 InRunic
510 InSinhala
511 InSmallFormVariants
512 InSpacingModifierLetters
513 InSpecials
514 InSuperscriptsAndSubscripts
515 InSupplementalArrowsA
516 InSupplementalArrowsB
517 InSupplementalMathematicalOperators
518 InSupplementaryPrivateUseAreaA
519 InSupplementaryPrivateUseAreaB
520 InSyriac
521 InTagalog
522 InTagbanwa
523 InTags
524 InTamil
525 InTelugu
526 InThaana
527 InThai
528 InTibetan
529 InUnifiedCanadianAboriginalSyllabics
530 InVariationSelectors
531 InYiRadicals
532 InYiSyllables
32293815 533
210b36aa
AMS
534=over 4
535
393fec97
GS
536=item *
537
376d9008
JB
538The special pattern C<\X> matches any extended Unicode
539sequence--"a combining character sequence" in Standardese--where the
540first character is a base character and subsequent characters are mark
541characters that apply to the base character. C<\X> is equivalent to
393fec97
GS
542C<(?:\PM\pM*)>.
543
393fec97
GS
544=item *
545
383e7cdd 546The C<tr///> operator translates characters instead of bytes. Note
376d9008
JB
547that the C<tr///CU> functionality has been removed. For similar
548functionality see pack('U0', ...) and pack('C0', ...).
393fec97 549
393fec97
GS
550=item *
551
552Case translation operators use the Unicode case translation tables
376d9008
JB
553when character input is provided. Note that C<uc()>, or C<\U> in
554interpolated strings, translates to uppercase, while C<ucfirst>,
555or C<\u> in interpolated strings, translates to titlecase in languages
556that make the distinction.
393fec97
GS
557
558=item *
559
376d9008 560Most operators that deal with positions or lengths in a string will
75daf61c 561automatically switch to using character positions, including
f5b005ca 562C<chop()>, C<chomp()>, C<substr()>, C<pos()>, C<index()>, C<rindex()>,
75daf61c 563C<sprintf()>, C<write()>, and C<length()>. Operators that
376d9008 564specifically do not switch include C<vec()>, C<pack()>, and
f5b005ca 565C<unpack()>. Operators that really don't care include
376d9008
JB
566operators that treats strings as a bucket of bits such as C<sort()>,
567and operators dealing with filenames.
393fec97
GS
568
569=item *
570
1bfb14c4 571The C<pack()>/C<unpack()> letters C<c> and C<C> do I<not> change,
376d9008 572since they are often used for byte-oriented formats. Again, think
1bfb14c4
JH
573C<char> in the C language.
574
575There is a new C<U> specifier that converts between Unicode characters
576and code points.
393fec97
GS
577
578=item *
579
376d9008
JB
580The C<chr()> and C<ord()> functions work on characters, similar to
581C<pack("U")> and C<unpack("U")>, I<not> C<pack("C")> and
582C<unpack("C")>. C<pack("C")> and C<unpack("C")> are methods for
583emulating byte-oriented C<chr()> and C<ord()> on Unicode strings.
584While these methods reveal the internal encoding of Unicode strings,
585that is not something one normally needs to care about at all.
393fec97
GS
586
587=item *
588
376d9008
JB
589The bit string operators, C<& | ^ ~>, can operate on character data.
590However, for backward compatibility, such as when using bit string
591operations when characters are all less than 256 in ordinal value, one
592should not use C<~> (the bit complement) with characters of both
593values less than 256 and values greater than 256. Most importantly,
594DeMorgan's laws (C<~($x|$y) eq ~$x&~$y> and C<~($x&$y) eq ~$x|~$y>)
595will not hold. The reason for this mathematical I<faux pas> is that
596the complement cannot return B<both> the 8-bit (byte-wide) bit
597complement B<and> the full character-wide bit complement.
a1ca4561
YST
598
599=item *
600
983ffd37
JH
601lc(), uc(), lcfirst(), and ucfirst() work for the following cases:
602
603=over 8
604
605=item *
606
607the case mapping is from a single Unicode character to another
376d9008 608single Unicode character, or
983ffd37
JH
609
610=item *
611
612the case mapping is from a single Unicode character to more
376d9008 613than one Unicode character.
983ffd37
JH
614
615=back
616
63de3cb2
JH
617Things to do with locales (Lithuanian, Turkish, Azeri) do B<not> work
618since Perl does not understand the concept of Unicode locales.
983ffd37 619
dc33ebcf
RGS
620See the Unicode Technical Report #21, Case Mappings, for more details.
621
983ffd37
JH
622=back
623
dc33ebcf 624=over 4
ac1256e8
JH
625
626=item *
627
393fec97
GS
628And finally, C<scalar reverse()> reverses by character rather than by byte.
629
630=back
631
376d9008 632=head2 User-Defined Character Properties
491fd90a
JH
633
634You can define your own character properties by defining subroutines
3a2263fe
RGS
635whose names begin with "In" or "Is". The subroutines must be defined
636in the C<main> package. The user-defined properties can be used in the
637regular expression C<\p> and C<\P> constructs. Note that the effect
638is compile-time and immutable once defined.
491fd90a 639
376d9008
JB
640The subroutines must return a specially-formatted string, with one
641or more newline-separated lines. Each line must be one of the following:
491fd90a
JH
642
643=over 4
644
645=item *
646
99a6b1f0 647Two hexadecimal numbers separated by horizontal whitespace (space or
376d9008 648tabular characters) denoting a range of Unicode code points to include.
491fd90a
JH
649
650=item *
651
376d9008
JB
652Something to include, prefixed by "+": a built-in character
653property (prefixed by "utf8::"), to represent all the characters in that
654property; two hexadecimal code points for a range; or a single
655hexadecimal code point.
491fd90a
JH
656
657=item *
658
376d9008 659Something to exclude, prefixed by "-": an existing character
11ef8fdd 660property (prefixed by "utf8::"), for all the characters in that
376d9008
JB
661property; two hexadecimal code points for a range; or a single
662hexadecimal code point.
491fd90a
JH
663
664=item *
665
376d9008 666Something to negate, prefixed "!": an existing character
11ef8fdd 667property (prefixed by "utf8::") for all the characters except the
376d9008
JB
668characters in the property; two hexadecimal code points for a range;
669or a single hexadecimal code point.
491fd90a
JH
670
671=back
672
673For example, to define a property that covers both the Japanese
674syllabaries (hiragana and katakana), you can define
675
676 sub InKana {
d5822f25
A
677 return <<END;
678 3040\t309F
679 30A0\t30FF
491fd90a
JH
680 END
681 }
682
d5822f25
A
683Imagine that the here-doc end marker is at the beginning of the line.
684Now you can use C<\p{InKana}> and C<\P{InKana}>.
491fd90a
JH
685
686You could also have used the existing block property names:
687
688 sub InKana {
689 return <<'END';
690 +utf8::InHiragana
691 +utf8::InKatakana
692 END
693 }
694
695Suppose you wanted to match only the allocated characters,
d5822f25 696not the raw block ranges: in other words, you want to remove
491fd90a
JH
697the non-characters:
698
699 sub InKana {
700 return <<'END';
701 +utf8::InHiragana
702 +utf8::InKatakana
703 -utf8::IsCn
704 END
705 }
706
707The negation is useful for defining (surprise!) negated classes.
708
709 sub InNotKana {
710 return <<'END';
711 !utf8::InHiragana
712 -utf8::InKatakana
713 +utf8::IsCn
714 END
715 }
716
3a2263fe
RGS
717You can also define your own mappings to be used in the lc(),
718lcfirst(), uc(), and ucfirst() (or their string-inlined versions).
719The principle is the same: define subroutines in the C<main> package
720with names like C<ToLower> (for lc() and lcfirst()), C<ToTitle> (for
721the first character in ucfirst()), and C<ToUpper> (for uc(), and the
722rest of the characters in ucfirst()).
723
724The string returned by the subroutines needs now to be three
725hexadecimal numbers separated by tabulators: start of the source
726range, end of the source range, and start of the destination range.
727For example:
728
729 sub ToUpper {
730 return <<END;
731 0061\t0063\t0041
732 END
733 }
734
735defines an uc() mapping that causes only the characters "a", "b", and
736"c" to be mapped to "A", "B", "C", all other characters will remain
737unchanged.
738
739If there is no source range to speak of, that is, the mapping is from
740a single character to another single character, leave the end of the
741source range empty, but the two tabulator characters are still needed.
742For example:
743
744 sub ToLower {
745 return <<END;
746 0041\t\t0061
747 END
748 }
749
750defines a lc() mapping that causes only "A" to be mapped to "a", all
751other characters will remain unchanged.
752
753(For serious hackers only) If you want to introspect the default
754mappings, you can find the data in the directory
755C<$Config{privlib}>/F<unicore/To/>. The mapping data is returned as
756the here-document, and the C<utf8::ToSpecFoo> are special exception
757mappings derived from <$Config{privlib}>/F<unicore/SpecialCasing.txt>.
758The C<Digit> and C<Fold> mappings that one can see in the directory
759are not directly user-accessible, one can use either the
760C<Unicode::UCD> module, or just match case-insensitively (that's when
761the C<Fold> mapping is used).
762
763A final note on the user-defined property tests and mappings: they
764will be used only if the scalar has been marked as having Unicode
765characters. Old byte-style strings will not be affected.
766
376d9008 767=head2 Character Encodings for Input and Output
8cbd9a7a 768
7221edc9 769See L<Encode>.
8cbd9a7a 770
c29a771d 771=head2 Unicode Regular Expression Support Level
776f8809 772
376d9008
JB
773The following list of Unicode support for regular expressions describes
774all the features currently supported. The references to "Level N"
775and the section numbers refer to the Unicode Technical Report 18,
965cd703
JH
776"Unicode Regular Expression Guidelines", version 6 (Unicode 3.2.0,
777Perl 5.8.0).
776f8809
JH
778
779=over 4
780
781=item *
782
783Level 1 - Basic Unicode Support
784
785 2.1 Hex Notation - done [1]
3bfdc84c 786 Named Notation - done [2]
776f8809
JH
787 2.2 Categories - done [3][4]
788 2.3 Subtraction - MISSING [5][6]
789 2.4 Simple Word Boundaries - done [7]
78d3e1bf 790 2.5 Simple Loose Matches - done [8]
776f8809
JH
791 2.6 End of Line - MISSING [9][10]
792
793 [ 1] \x{...}
794 [ 2] \N{...}
eb0cc9e3 795 [ 3] . \p{...} \P{...}
29bdacb8 796 [ 4] now scripts (see UTR#24 Script Names) in addition to blocks
776f8809 797 [ 5] have negation
237bad5b
JH
798 [ 6] can use regular expression look-ahead [a]
799 or user-defined character properties [b] to emulate subtraction
776f8809 800 [ 7] include Letters in word characters
376d9008 801 [ 8] note that Perl does Full case-folding in matching, not Simple:
835863de 802 for example U+1F88 is equivalent with U+1F00 U+03B9,
e0f9d4a8 803 not with 1F80. This difference matters for certain Greek
376d9008
JB
804 capital letters with certain modifiers: the Full case-folding
805 decomposes the letter, while the Simple case-folding would map
e0f9d4a8 806 it to a single character.
5ca1ac52 807 [ 9] see UTR #13 Unicode Newline Guidelines
835863de 808 [10] should do ^ and $ also on \x{85}, \x{2028} and \x{2029}
ec83e909 809 (should also affect <>, $., and script line numbers)
3bfdc84c 810 (the \x{85}, \x{2028} and \x{2029} do match \s)
7207e29d 811
237bad5b 812[a] You can mimic class subtraction using lookahead.
5ca1ac52 813For example, what UTR #18 might write as
29bdacb8 814
dbe420b4
JH
815 [{Greek}-[{UNASSIGNED}]]
816
817in Perl can be written as:
818
1d81abf3
JH
819 (?!\p{Unassigned})\p{InGreekAndCoptic}
820 (?=\p{Assigned})\p{InGreekAndCoptic}
dbe420b4
JH
821
822But in this particular example, you probably really want
823
1bfb14c4 824 \p{GreekAndCoptic}
dbe420b4
JH
825
826which will match assigned characters known to be part of the Greek script.
29bdacb8 827
5ca1ac52
JH
828Also see the Unicode::Regex::Set module, it does implement the full
829UTR #18 grouping, intersection, union, and removal (subtraction) syntax.
830
818c4caa 831[b] See L</"User-Defined Character Properties">.
237bad5b 832
776f8809
JH
833=item *
834
835Level 2 - Extended Unicode Support
836
63de3cb2
JH
837 3.1 Surrogates - MISSING [11]
838 3.2 Canonical Equivalents - MISSING [12][13]
839 3.3 Locale-Independent Graphemes - MISSING [14]
840 3.4 Locale-Independent Words - MISSING [15]
841 3.5 Locale-Independent Loose Matches - MISSING [16]
842
843 [11] Surrogates are solely a UTF-16 concept and Perl's internal
844 representation is UTF-8. The Encode module does UTF-16, though.
845 [12] see UTR#15 Unicode Normalization
846 [13] have Unicode::Normalize but not integrated to regexes
847 [14] have \X but at this level . should equal that
848 [15] need three classes, not just \w and \W
849 [16] see UTR#21 Case Mappings
776f8809
JH
850
851=item *
852
853Level 3 - Locale-Sensitive Support
854
855 4.1 Locale-Dependent Categories - MISSING
856 4.2 Locale-Dependent Graphemes - MISSING [16][17]
857 4.3 Locale-Dependent Words - MISSING
858 4.4 Locale-Dependent Loose Matches - MISSING
859 4.5 Locale-Dependent Ranges - MISSING
860
861 [16] see UTR#10 Unicode Collation Algorithms
862 [17] have Unicode::Collate but not integrated to regexes
863
864=back
865
c349b1b9
JH
866=head2 Unicode Encodings
867
376d9008
JB
868Unicode characters are assigned to I<code points>, which are abstract
869numbers. To use these numbers, various encodings are needed.
c349b1b9
JH
870
871=over 4
872
c29a771d 873=item *
5cb3728c
RB
874
875UTF-8
c349b1b9 876
3e4dbfed 877UTF-8 is a variable-length (1 to 6 bytes, current character allocations
376d9008
JB
878require 4 bytes), byte-order independent encoding. For ASCII (and we
879really do mean 7-bit ASCII, not another 8-bit encoding), UTF-8 is
880transparent.
c349b1b9 881
8c007b5a 882The following table is from Unicode 3.2.
05632f9a
JH
883
884 Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte
885
8c007b5a
JH
886 U+0000..U+007F 00..7F
887 U+0080..U+07FF C2..DF 80..BF
ec90690f
ST
888 U+0800..U+0FFF E0 A0..BF 80..BF
889 U+1000..U+CFFF E1..EC 80..BF 80..BF
890 U+D000..U+D7FF ED 80..9F 80..BF
8c007b5a 891 U+D800..U+DFFF ******* ill-formed *******
ec90690f 892 U+E000..U+FFFF EE..EF 80..BF 80..BF
05632f9a
JH
893 U+10000..U+3FFFF F0 90..BF 80..BF 80..BF
894 U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF
895 U+100000..U+10FFFF F4 80..8F 80..BF 80..BF
896
376d9008
JB
897Note the C<A0..BF> in C<U+0800..U+0FFF>, the C<80..9F> in
898C<U+D000...U+D7FF>, the C<90..B>F in C<U+10000..U+3FFFF>, and the
899C<80...8F> in C<U+100000..U+10FFFF>. The "gaps" are caused by legal
900UTF-8 avoiding non-shortest encodings: it is technically possible to
901UTF-8-encode a single code point in different ways, but that is
902explicitly forbidden, and the shortest possible encoding should always
903be used. So that's what Perl does.
37361303 904
376d9008 905Another way to look at it is via bits:
05632f9a
JH
906
907 Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte
908
909 0aaaaaaa 0aaaaaaa
910 00000bbbbbaaaaaa 110bbbbb 10aaaaaa
911 ccccbbbbbbaaaaaa 1110cccc 10bbbbbb 10aaaaaa
912 00000dddccccccbbbbbbaaaaaa 11110ddd 10cccccc 10bbbbbb 10aaaaaa
913
914As you can see, the continuation bytes all begin with C<10>, and the
8c007b5a 915leading bits of the start byte tell how many bytes the are in the
05632f9a
JH
916encoded character.
917
c29a771d 918=item *
5cb3728c
RB
919
920UTF-EBCDIC
dbe420b4 921
376d9008 922Like UTF-8 but EBCDIC-safe, in the way that UTF-8 is ASCII-safe.
dbe420b4 923
c29a771d 924=item *
5cb3728c 925
1e54db1a 926UTF-16, UTF-16BE, UTF-16LE, Surrogates, and BOMs (Byte Order Marks)
c349b1b9 927
1bfb14c4
JH
928The followings items are mostly for reference and general Unicode
929knowledge, Perl doesn't use these constructs internally.
dbe420b4 930
c349b1b9 931UTF-16 is a 2 or 4 byte encoding. The Unicode code points
1bfb14c4
JH
932C<U+0000..U+FFFF> are stored in a single 16-bit unit, and the code
933points C<U+10000..U+10FFFF> in two 16-bit units. The latter case is
c349b1b9
JH
934using I<surrogates>, the first 16-bit unit being the I<high
935surrogate>, and the second being the I<low surrogate>.
936
376d9008 937Surrogates are code points set aside to encode the C<U+10000..U+10FFFF>
c349b1b9 938range of Unicode code points in pairs of 16-bit units. The I<high
376d9008
JB
939surrogates> are the range C<U+D800..U+DBFF>, and the I<low surrogates>
940are the range C<U+DC00..U+DFFF>. The surrogate encoding is
c349b1b9
JH
941
942 $hi = ($uni - 0x10000) / 0x400 + 0xD800;
943 $lo = ($uni - 0x10000) % 0x400 + 0xDC00;
944
945and the decoding is
946
1a3fa709 947 $uni = 0x10000 + ($hi - 0xD800) * 0x400 + ($lo - 0xDC00);
c349b1b9 948
feda178f 949If you try to generate surrogates (for example by using chr()), you
376d9008
JB
950will get a warning if warnings are turned on, because those code
951points are not valid for a Unicode character.
9466bab6 952
376d9008 953Because of the 16-bitness, UTF-16 is byte-order dependent. UTF-16
c349b1b9 954itself can be used for in-memory computations, but if storage or
376d9008
JB
955transfer is required either UTF-16BE (big-endian) or UTF-16LE
956(little-endian) encodings must be chosen.
c349b1b9
JH
957
958This introduces another problem: what if you just know that your data
376d9008
JB
959is UTF-16, but you don't know which endianness? Byte Order Marks, or
960BOMs, are a solution to this. A special character has been reserved
86bbd6d1 961in Unicode to function as a byte order marker: the character with the
376d9008 962code point C<U+FEFF> is the BOM.
042da322 963
c349b1b9 964The trick is that if you read a BOM, you will know the byte order,
376d9008
JB
965since if it was written on a big-endian platform, you will read the
966bytes C<0xFE 0xFF>, but if it was written on a little-endian platform,
967you will read the bytes C<0xFF 0xFE>. (And if the originating platform
968was writing in UTF-8, you will read the bytes C<0xEF 0xBB 0xBF>.)
042da322 969
86bbd6d1 970The way this trick works is that the character with the code point
376d9008
JB
971C<U+FFFE> is guaranteed not to be a valid Unicode character, so the
972sequence of bytes C<0xFF 0xFE> is unambiguously "BOM, represented in
1bfb14c4 973little-endian format" and cannot be C<U+FFFE>, represented in big-endian
042da322 974format".
c349b1b9 975
c29a771d 976=item *
5cb3728c 977
1e54db1a 978UTF-32, UTF-32BE, UTF-32LE
c349b1b9
JH
979
980The UTF-32 family is pretty much like the UTF-16 family, expect that
042da322 981the units are 32-bit, and therefore the surrogate scheme is not
376d9008
JB
982needed. The BOM signatures will be C<0x00 0x00 0xFE 0xFF> for BE and
983C<0xFF 0xFE 0x00 0x00> for LE.
c349b1b9 984
c29a771d 985=item *
5cb3728c
RB
986
987UCS-2, UCS-4
c349b1b9 988
86bbd6d1 989Encodings defined by the ISO 10646 standard. UCS-2 is a 16-bit
376d9008 990encoding. Unlike UTF-16, UCS-2 is not extensible beyond C<U+FFFF>,
339cfa0e
JH
991because it does not use surrogates. UCS-4 is a 32-bit encoding,
992functionally identical to UTF-32.
c349b1b9 993
c29a771d 994=item *
5cb3728c
RB
995
996UTF-7
c349b1b9 997
376d9008
JB
998A seven-bit safe (non-eight-bit) encoding, which is useful if the
999transport or storage is not eight-bit safe. Defined by RFC 2152.
c349b1b9 1000
95a1a48b
JH
1001=back
1002
0d7c09bb
JH
1003=head2 Security Implications of Unicode
1004
1005=over 4
1006
1007=item *
1008
1009Malformed UTF-8
bf0fa0b2
JH
1010
1011Unfortunately, the specification of UTF-8 leaves some room for
1012interpretation of how many bytes of encoded output one should generate
376d9008
JB
1013from one input Unicode character. Strictly speaking, the shortest
1014possible sequence of UTF-8 bytes should be generated,
1015because otherwise there is potential for an input buffer overflow at
feda178f 1016the receiving end of a UTF-8 connection. Perl always generates the
376d9008
JB
1017shortest length UTF-8, and with warnings on Perl will warn about
1018non-shortest length UTF-8 along with other malformations, such as the
1019surrogates, which are not real Unicode code points.
bf0fa0b2 1020
0d7c09bb
JH
1021=item *
1022
1023Regular expressions behave slightly differently between byte data and
376d9008
JB
1024character (Unicode) data. For example, the "word character" character
1025class C<\w> will work differently depending on if data is eight-bit bytes
1026or Unicode.
0d7c09bb 1027
376d9008
JB
1028In the first case, the set of C<\w> characters is either small--the
1029default set of alphabetic characters, digits, and the "_"--or, if you
0d7c09bb
JH
1030are using a locale (see L<perllocale>), the C<\w> might contain a few
1031more letters according to your language and country.
1032
376d9008 1033In the second case, the C<\w> set of characters is much, much larger.
1bfb14c4
JH
1034Most importantly, even in the set of the first 256 characters, it will
1035probably match different characters: unlike most locales, which are
1036specific to a language and country pair, Unicode classifies all the
1037characters that are letters I<somewhere> as C<\w>. For example, your
1038locale might not think that LATIN SMALL LETTER ETH is a letter (unless
1039you happen to speak Icelandic), but Unicode does.
0d7c09bb 1040
376d9008 1041As discussed elsewhere, Perl has one foot (two hooves?) planted in
1bfb14c4
JH
1042each of two worlds: the old world of bytes and the new world of
1043characters, upgrading from bytes to characters when necessary.
376d9008
JB
1044If your legacy code does not explicitly use Unicode, no automatic
1045switch-over to characters should happen. Characters shouldn't get
1bfb14c4
JH
1046downgraded to bytes, either. It is possible to accidentally mix bytes
1047and characters, however (see L<perluniintro>), in which case C<\w> in
1048regular expressions might start behaving differently. Review your
1049code. Use warnings and the C<strict> pragma.
0d7c09bb
JH
1050
1051=back
1052
c349b1b9
JH
1053=head2 Unicode in Perl on EBCDIC
1054
376d9008
JB
1055The way Unicode is handled on EBCDIC platforms is still
1056experimental. On such platforms, references to UTF-8 encoding in this
1057document and elsewhere should be read as meaning the UTF-EBCDIC
1058specified in Unicode Technical Report 16, unless ASCII vs. EBCDIC issues
c349b1b9 1059are specifically discussed. There is no C<utfebcdic> pragma or
376d9008 1060":utfebcdic" layer; rather, "utf8" and ":utf8" are reused to mean
86bbd6d1
PN
1061the platform's "natural" 8-bit encoding of Unicode. See L<perlebcdic>
1062for more discussion of the issues.
c349b1b9 1063
b310b053
JH
1064=head2 Locales
1065
4616122b 1066Usually locale settings and Unicode do not affect each other, but
b310b053
JH
1067there are a couple of exceptions:
1068
1069=over 4
1070
1071=item *
1072
8aa8f774
JH
1073You can enable automatic UTF-8-ification of your standard file
1074handles, default C<open()> layer, and C<@ARGV> by using either
1075the C<-C> command line switch or the C<PERL_UNICODE> environment
1076variable, see L<perlrun> for the documentation of the C<-C> switch.
b310b053
JH
1077
1078=item *
1079
376d9008
JB
1080Perl tries really hard to work both with Unicode and the old
1081byte-oriented world. Most often this is nice, but sometimes Perl's
1082straddling of the proverbial fence causes problems.
b310b053
JH
1083
1084=back
1085
1aad1664
JH
1086=head2 When Unicode Does Not Happen
1087
1088While Perl does have extensive ways to input and output in Unicode,
1089and few other 'entry points' like the @ARGV which can be interpreted
1090as Unicode (UTF-8), there still are many places where Unicode (in some
1091encoding or another) could be given as arguments or received as
1092results, or both, but it is not.
1093
6cd4dd6c
JH
1094The following are such interfaces. For all of these interfaces Perl
1095currently (as of 5.8.3) simply assumes byte strings both as arguments
1096and results, or UTF-8 strings if the C<encoding> pragma has been used.
1aad1664
JH
1097
1098One reason why Perl does not attempt to resolve the role of Unicode in
1099this cases is that the answers are highly dependent on the operating
1100system and the file system(s). For example, whether filenames can be
1101in Unicode, and in exactly what kind of encoding, is not exactly a
1102portable concept. Similarly for the qx and system: how well will the
1103'command line interface' (and which of them?) handle Unicode?
1104
1105=over 4
1106
557a2462
RB
1107=item *
1108
1e8e8236
JH
1109chmod, chmod, chown, chroot, exec, link, lstat, mkdir,
1110rename, rmdir, stat, symlink, truncate, unlink, utime, -X
557a2462
RB
1111
1112=item *
1113
1114%ENV
1115
1116=item *
1117
1118glob (aka the <*>)
1119
1120=item *
1aad1664 1121
557a2462 1122open, opendir, sysopen
1aad1664 1123
557a2462 1124=item *
1aad1664 1125
557a2462 1126qx (aka the backtick operator), system
1aad1664 1127
557a2462 1128=item *
1aad1664 1129
557a2462 1130readdir, readlink
1aad1664
JH
1131
1132=back
1133
1134=head2 Forcing Unicode in Perl (Or Unforcing Unicode in Perl)
1135
1136Sometimes (see L</"When Unicode Does Not Happen">) there are
1137situations where you simply need to force Perl to believe that a byte
1138string is UTF-8, or vice versa. The low-level calls
1139utf8::upgrade($bytestring) and utf8::downgrade($utf8string) are
1140the answers.
1141
1142Do not use them without careful thought, though: Perl may easily get
1143very confused, angry, or even crash, if you suddenly change the 'nature'
1144of scalar like that. Especially careful you have to be if you use the
1145utf8::upgrade(): any random byte string is not valid UTF-8.
1146
95a1a48b
JH
1147=head2 Using Unicode in XS
1148
3a2263fe
RGS
1149If you want to handle Perl Unicode in XS extensions, you may find the
1150following C APIs useful. See also L<perlguts/"Unicode Support"> for an
1151explanation about Unicode at the XS level, and L<perlapi> for the API
1152details.
95a1a48b
JH
1153
1154=over 4
1155
1156=item *
1157
1bfb14c4
JH
1158C<DO_UTF8(sv)> returns true if the C<UTF8> flag is on and the bytes
1159pragma is not in effect. C<SvUTF8(sv)> returns true is the C<UTF8>
1160flag is on; the bytes pragma is ignored. The C<UTF8> flag being on
1161does B<not> mean that there are any characters of code points greater
1162than 255 (or 127) in the scalar or that there are even any characters
1163in the scalar. What the C<UTF8> flag means is that the sequence of
1164octets in the representation of the scalar is the sequence of UTF-8
1165encoded code points of the characters of a string. The C<UTF8> flag
1166being off means that each octet in this representation encodes a
1167single character with code point 0..255 within the string. Perl's
1168Unicode model is not to use UTF-8 until it is absolutely necessary.
95a1a48b
JH
1169
1170=item *
1171
fb9cc174 1172C<uvuni_to_utf8(buf, chr)> writes a Unicode character code point into
1bfb14c4 1173a buffer encoding the code point as UTF-8, and returns a pointer
95a1a48b
JH
1174pointing after the UTF-8 bytes.
1175
1176=item *
1177
376d9008
JB
1178C<utf8_to_uvuni(buf, lenp)> reads UTF-8 encoded bytes from a buffer and
1179returns the Unicode character code point and, optionally, the length of
1180the UTF-8 byte sequence.
95a1a48b
JH
1181
1182=item *
1183
376d9008
JB
1184C<utf8_length(start, end)> returns the length of the UTF-8 encoded buffer
1185in characters. C<sv_len_utf8(sv)> returns the length of the UTF-8 encoded
95a1a48b
JH
1186scalar.
1187
1188=item *
1189
376d9008
JB
1190C<sv_utf8_upgrade(sv)> converts the string of the scalar to its UTF-8
1191encoded form. C<sv_utf8_downgrade(sv)> does the opposite, if
1192possible. C<sv_utf8_encode(sv)> is like sv_utf8_upgrade except that
1193it does not set the C<UTF8> flag. C<sv_utf8_decode()> does the
1194opposite of C<sv_utf8_encode()>. Note that none of these are to be
1195used as general-purpose encoding or decoding interfaces: C<use Encode>
1196for that. C<sv_utf8_upgrade()> is affected by the encoding pragma
1197but C<sv_utf8_downgrade()> is not (since the encoding pragma is
1198designed to be a one-way street).
95a1a48b
JH
1199
1200=item *
1201
376d9008 1202C<is_utf8_char(s)> returns true if the pointer points to a valid UTF-8
90f968e0 1203character.
95a1a48b
JH
1204
1205=item *
1206
376d9008 1207C<is_utf8_string(buf, len)> returns true if C<len> bytes of the buffer
95a1a48b
JH
1208are valid UTF-8.
1209
1210=item *
1211
376d9008
JB
1212C<UTF8SKIP(buf)> will return the number of bytes in the UTF-8 encoded
1213character in the buffer. C<UNISKIP(chr)> will return the number of bytes
1214required to UTF-8-encode the Unicode character code point. C<UTF8SKIP()>
90f968e0 1215is useful for example for iterating over the characters of a UTF-8
376d9008 1216encoded buffer; C<UNISKIP()> is useful, for example, in computing
90f968e0 1217the size required for a UTF-8 encoded buffer.
95a1a48b
JH
1218
1219=item *
1220
376d9008 1221C<utf8_distance(a, b)> will tell the distance in characters between the
95a1a48b
JH
1222two pointers pointing to the same UTF-8 encoded buffer.
1223
1224=item *
1225
376d9008
JB
1226C<utf8_hop(s, off)> will return a pointer to an UTF-8 encoded buffer
1227that is C<off> (positive or negative) Unicode characters displaced
1228from the UTF-8 buffer C<s>. Be careful not to overstep the buffer:
1229C<utf8_hop()> will merrily run off the end or the beginning of the
1230buffer if told to do so.
95a1a48b 1231
d2cc3551
JH
1232=item *
1233
376d9008
JB
1234C<pv_uni_display(dsv, spv, len, pvlim, flags)> and
1235C<sv_uni_display(dsv, ssv, pvlim, flags)> are useful for debugging the
1236output of Unicode strings and scalars. By default they are useful
1237only for debugging--they display B<all> characters as hexadecimal code
1bfb14c4
JH
1238points--but with the flags C<UNI_DISPLAY_ISPRINT>,
1239C<UNI_DISPLAY_BACKSLASH>, and C<UNI_DISPLAY_QQ> you can make the
1240output more readable.
d2cc3551
JH
1241
1242=item *
1243
376d9008
JB
1244C<ibcmp_utf8(s1, pe1, u1, l1, u1, s2, pe2, l2, u2)> can be used to
1245compare two strings case-insensitively in Unicode. For case-sensitive
1246comparisons you can just use C<memEQ()> and C<memNE()> as usual.
d2cc3551 1247
c349b1b9
JH
1248=back
1249
95a1a48b
JH
1250For more information, see L<perlapi>, and F<utf8.c> and F<utf8.h>
1251in the Perl source code distribution.
1252
c29a771d
JH
1253=head1 BUGS
1254
376d9008 1255=head2 Interaction with Locales
7eabb34d 1256
376d9008
JB
1257Use of locales with Unicode data may lead to odd results. Currently,
1258Perl attempts to attach 8-bit locale info to characters in the range
12590..255, but this technique is demonstrably incorrect for locales that
1260use characters above that range when mapped into Unicode. Perl's
1261Unicode support will also tend to run slower. Use of locales with
1262Unicode is discouraged.
c29a771d 1263
376d9008 1264=head2 Interaction with Extensions
7eabb34d 1265
376d9008 1266When Perl exchanges data with an extension, the extension should be
7eabb34d 1267able to understand the UTF-8 flag and act accordingly. If the
376d9008
JB
1268extension doesn't know about the flag, it's likely that the extension
1269will return incorrectly-flagged data.
7eabb34d
A
1270
1271So if you're working with Unicode data, consult the documentation of
1272every module you're using if there are any issues with Unicode data
1273exchange. If the documentation does not talk about Unicode at all,
a73d23f6 1274suspect the worst and probably look at the source to learn how the
376d9008 1275module is implemented. Modules written completely in Perl shouldn't
a73d23f6
RGS
1276cause problems. Modules that directly or indirectly access code written
1277in other programming languages are at risk.
7eabb34d 1278
376d9008 1279For affected functions, the simple strategy to avoid data corruption is
7eabb34d 1280to always make the encoding of the exchanged data explicit. Choose an
376d9008 1281encoding that you know the extension can handle. Convert arguments passed
7eabb34d
A
1282to the extensions to that encoding and convert results back from that
1283encoding. Write wrapper functions that do the conversions for you, so
1284you can later change the functions when the extension catches up.
1285
376d9008 1286To provide an example, let's say the popular Foo::Bar::escape_html
7eabb34d
A
1287function doesn't deal with Unicode data yet. The wrapper function
1288would convert the argument to raw UTF-8 and convert the result back to
376d9008 1289Perl's internal representation like so:
7eabb34d
A
1290
1291 sub my_escape_html ($) {
1292 my($what) = shift;
1293 return unless defined $what;
1294 Encode::decode_utf8(Foo::Bar::escape_html(Encode::encode_utf8($what)));
1295 }
1296
1297Sometimes, when the extension does not convert data but just stores
1298and retrieves them, you will be in a position to use the otherwise
1299dangerous Encode::_utf8_on() function. Let's say the popular
66b79f27 1300C<Foo::Bar> extension, written in C, provides a C<param> method that
7eabb34d
A
1301lets you store and retrieve data according to these prototypes:
1302
1303 $self->param($name, $value); # set a scalar
1304 $value = $self->param($name); # retrieve a scalar
1305
1306If it does not yet provide support for any encoding, one could write a
1307derived class with such a C<param> method:
1308
1309 sub param {
1310 my($self,$name,$value) = @_;
1311 utf8::upgrade($name); # make sure it is UTF-8 encoded
1312 if (defined $value)
1313 utf8::upgrade($value); # make sure it is UTF-8 encoded
1314 return $self->SUPER::param($name,$value);
1315 } else {
1316 my $ret = $self->SUPER::param($name);
1317 Encode::_utf8_on($ret); # we know, it is UTF-8 encoded
1318 return $ret;
1319 }
1320 }
1321
a73d23f6
RGS
1322Some extensions provide filters on data entry/exit points, such as
1323DB_File::filter_store_key and family. Look out for such filters in
66b79f27 1324the documentation of your extensions, they can make the transition to
7eabb34d
A
1325Unicode data much easier.
1326
376d9008 1327=head2 Speed
7eabb34d 1328
c29a771d 1329Some functions are slower when working on UTF-8 encoded strings than
574c8022 1330on byte encoded strings. All functions that need to hop over
7c17141f
JH
1331characters such as length(), substr() or index(), or matching regular
1332expressions can work B<much> faster when the underlying data are
1333byte-encoded.
1334
1335In Perl 5.8.0 the slowness was often quite spectacular; in Perl 5.8.1
1336a caching scheme was introduced which will hopefully make the slowness
a104b433
JH
1337somewhat less spectacular, at least for some operations. In general,
1338operations with UTF-8 encoded strings are still slower. As an example,
1339the Unicode properties (character classes) like C<\p{Nd}> are known to
1340be quite a bit slower (5-20 times) than their simpler counterparts
1341like C<\d> (then again, there 268 Unicode characters matching C<Nd>
1342compared with the 10 ASCII characters matching C<d>).
666f95b9 1343
c8d992ba
A
1344=head2 Porting code from perl-5.6.X
1345
1346Perl 5.8 has a different Unicode model from 5.6. In 5.6 the programmer
1347was required to use the C<utf8> pragma to declare that a given scope
1348expected to deal with Unicode data and had to make sure that only
1349Unicode data were reaching that scope. If you have code that is
1350working with 5.6, you will need some of the following adjustments to
1351your code. The examples are written such that the code will continue
1352to work under 5.6, so you should be safe to try them out.
1353
1354=over 4
1355
1356=item *
1357
1358A filehandle that should read or write UTF-8
1359
1360 if ($] > 5.007) {
1361 binmode $fh, ":utf8";
1362 }
1363
1364=item *
1365
1366A scalar that is going to be passed to some extension
1367
1368Be it Compress::Zlib, Apache::Request or any extension that has no
1369mention of Unicode in the manpage, you need to make sure that the
1370UTF-8 flag is stripped off. Note that at the time of this writing
1371(October 2002) the mentioned modules are not UTF-8-aware. Please
1372check the documentation to verify if this is still true.
1373
1374 if ($] > 5.007) {
1375 require Encode;
1376 $val = Encode::encode_utf8($val); # make octets
1377 }
1378
1379=item *
1380
1381A scalar we got back from an extension
1382
1383If you believe the scalar comes back as UTF-8, you will most likely
1384want the UTF-8 flag restored:
1385
1386 if ($] > 5.007) {
1387 require Encode;
1388 $val = Encode::decode_utf8($val);
1389 }
1390
1391=item *
1392
1393Same thing, if you are really sure it is UTF-8
1394
1395 if ($] > 5.007) {
1396 require Encode;
1397 Encode::_utf8_on($val);
1398 }
1399
1400=item *
1401
1402A wrapper for fetchrow_array and fetchrow_hashref
1403
1404When the database contains only UTF-8, a wrapper function or method is
1405a convenient way to replace all your fetchrow_array and
1406fetchrow_hashref calls. A wrapper function will also make it easier to
1407adapt to future enhancements in your database driver. Note that at the
1408time of this writing (October 2002), the DBI has no standardized way
1409to deal with UTF-8 data. Please check the documentation to verify if
1410that is still true.
1411
1412 sub fetchrow {
1413 my($self, $sth, $what) = @_; # $what is one of fetchrow_{array,hashref}
1414 if ($] < 5.007) {
1415 return $sth->$what;
1416 } else {
1417 require Encode;
1418 if (wantarray) {
1419 my @arr = $sth->$what;
1420 for (@arr) {
1421 defined && /[^\000-\177]/ && Encode::_utf8_on($_);
1422 }
1423 return @arr;
1424 } else {
1425 my $ret = $sth->$what;
1426 if (ref $ret) {
1427 for my $k (keys %$ret) {
1428 defined && /[^\000-\177]/ && Encode::_utf8_on($_) for $ret->{$k};
1429 }
1430 return $ret;
1431 } else {
1432 defined && /[^\000-\177]/ && Encode::_utf8_on($_) for $ret;
1433 return $ret;
1434 }
1435 }
1436 }
1437 }
1438
1439
1440=item *
1441
1442A large scalar that you know can only contain ASCII
1443
1444Scalars that contain only ASCII and are marked as UTF-8 are sometimes
1445a drag to your program. If you recognize such a situation, just remove
1446the UTF-8 flag:
1447
1448 utf8::downgrade($val) if $] > 5.007;
1449
1450=back
1451
393fec97
GS
1452=head1 SEE ALSO
1453
72ff2908 1454L<perluniintro>, L<encoding>, L<Encode>, L<open>, L<utf8>, L<bytes>,
a05d7ebb 1455L<perlretut>, L<perlvar/"${^UNICODE}">
393fec97
GS
1456
1457=cut