This is a live mirror of the Perl 5 development currently hosted at https://github.com/perl/perl5
More portability defines, now mostly type-related
[perl5.git] / pod / perlunicode.pod
CommitLineData
393fec97
GS
1=head1 NAME
2
3perlunicode - Unicode support in Perl
4
5=head1 DESCRIPTION
6
0a1f2d14 7=head2 Important Caveats
21bad921 8
376d9008 9Unicode support is an extensive requirement. While Perl does not
c349b1b9
JH
10implement the Unicode standard or the accompanying technical reports
11from cover to cover, Perl does support many Unicode features.
21bad921 12
13a2d996 13=over 4
21bad921
GS
14
15=item Input and Output Disciplines
16
376d9008
JB
17Perl knows when a filehandle uses Perl's internal Unicode encodings
18(UTF-8 or UTF-EBCDIC) if the filehandle is opened with the ":utf8"
19layer. Other encodings can be converted to Perl's encoding on input
20or from Perl's encoding on output by use of the ":encoding(...)"
21layer. See L<open>.
c349b1b9 22
376d9008 23To indicate that Perl source itself is using a particular encoding,
c349b1b9 24see L<encoding>.
21bad921
GS
25
26=item Regular Expressions
27
c349b1b9 28The regular expression compiler produces polymorphic opcodes. That is,
376d9008
JB
29the pattern adapts to the data and automatically switches to the Unicode
30character scheme when presented with Unicode data--or instead uses
31a traditional byte scheme when presented with byte data.
21bad921 32
ad0029c4 33=item C<use utf8> still needed to enable UTF-8/UTF-EBCDIC in scripts
21bad921 34
376d9008
JB
35As a compatibility measure, the C<use utf8> pragma must be explicitly
36included to enable recognition of UTF-8 in the Perl scripts themselves
37on ASCII-based machines or to recognize UTF-EBCDIC on EBCDIC-based
38machines. B<These are the only times when an explicit C<use utf8>
39is needed.>
21bad921 40
1768d7eb 41You can also use the C<encoding> pragma to change the default encoding
6ec9efec 42of the data in your script; see L<encoding>.
1768d7eb 43
21bad921
GS
44=back
45
376d9008 46=head2 Byte and Character Semantics
393fec97 47
376d9008 48Beginning with version 5.6, Perl uses logically-wide characters to
3e4dbfed 49represent strings internally.
393fec97 50
376d9008
JB
51In future, Perl-level operations will be expected to work with
52characters rather than bytes.
393fec97 53
376d9008 54However, as an interim compatibility measure, Perl aims to
75daf61c
JH
55provide a safe migration path from byte semantics to character
56semantics for programs. For operations where Perl can unambiguously
376d9008 57decide that the input data are characters, Perl switches to
75daf61c
JH
58character semantics. For operations where this determination cannot
59be made without additional information from the user, Perl decides in
376d9008 60favor of compatibility and chooses to use byte semantics.
8cbd9a7a
GS
61
62This behavior preserves compatibility with earlier versions of Perl,
376d9008
JB
63which allowed byte semantics in Perl operations only if
64none of the program's inputs were marked as being as source of Unicode
8cbd9a7a
GS
65character data. Such data may come from filehandles, from calls to
66external programs, from information provided by the system (such as %ENV),
21bad921 67or from literals and constants in the source text.
8cbd9a7a 68
376d9008
JB
69On Windows platforms, if the C<-C> command line switch is used or the
70${^WIDE_SYSTEM_CALLS} global flag is set to C<1>, all system calls
71will use the corresponding wide-character APIs. This feature is
72available only on Windows to conform to the API standard already
73established for that platform.
8cbd9a7a 74
376d9008
JB
75The C<bytes> pragma will always, regardless of platform, force byte
76semantics in a particular lexical scope. See L<bytes>.
8cbd9a7a
GS
77
78The C<utf8> pragma is primarily a compatibility device that enables
75daf61c 79recognition of UTF-(8|EBCDIC) in literals encountered by the parser.
376d9008
JB
80Note that this pragma is only required while Perl defaults to byte
81semantics; when character semantics become the default, this pragma
82may become a no-op. See L<utf8>.
83
84Unless explicitly stated, Perl operators use character semantics
85for Unicode data and byte semantics for non-Unicode data.
86The decision to use character semantics is made transparently. If
87input data comes from a Unicode source--for example, if a character
88encoding discipline is added to a filehandle or a literal Unicode
89string constant appears in a program--character semantics apply.
90Otherwise, byte semantics are in effect. The C<bytes> pragma should
91be used to force byte semantics on Unicode data.
92
93If strings operating under byte semantics and strings with Unicode
94character data are concatenated, the new string will be upgraded to
95I<ISO 8859-1 (Latin-1)>, even if the old Unicode string used EBCDIC.
96This translation is done without regard to the system's native 8-bit
97encoding, so to change this for systems with non-Latin-1 and
98non-EBCDIC native encodings use the C<encoding> pragma. See
99L<encoding>.
7dedd01f 100
feda178f 101Under character semantics, many operations that formerly operated on
376d9008 102bytes now operate on characters. A character in Perl is
feda178f 103logically just a number ranging from 0 to 2**31 or so. Larger
376d9008
JB
104characters may encode into longer sequences of bytes internally, but
105this internal detail is mostly hidden for Perl code.
106See L<perluniintro> for more.
393fec97 107
376d9008 108=head2 Effects of Character Semantics
393fec97
GS
109
110Character semantics have the following effects:
111
112=over 4
113
114=item *
115
376d9008 116Strings--including hash keys--and regular expression patterns may
574c8022 117contain characters that have an ordinal value larger than 255.
393fec97 118
feda178f
JH
119If you use a Unicode editor to edit your program, Unicode characters
120may occur directly within the literal strings in one of the various
376d9008
JB
121Unicode encodings (UTF-8, UTF-EBCDIC, UCS-2, etc.), but will be recognized
122as such and converted to Perl's internal representation only if the
feda178f 123appropriate L<encoding> is specified.
3e4dbfed 124
376d9008
JB
125Unicode characters can also be added to a string by using the C<\x{...}>
126notation. The Unicode code for the desired character, in
127hexadecimal, should be placed in the braces. For instance, a smiley
128face is C<\x{263A}>. This encoding scheme only works for characters
129with a code of 0x100 or above.
3e4dbfed
JF
130
131Additionally, if you
574c8022 132
3e4dbfed 133 use charnames ':full';
574c8022 134
376d9008
JB
135you can use the C<\N{...}> notation and put the official Unicode character
136name within the braces, such as C<\N{WHITE SMILING FACE}>.
137
393fec97
GS
138
139=item *
140
574c8022
JH
141If an appropriate L<encoding> is specified, identifiers within the
142Perl script may contain Unicode alphanumeric characters, including
376d9008
JB
143ideographs. Perl does not currently attempt to canonicalize variable
144names.
393fec97 145
393fec97
GS
146=item *
147
376d9008
JB
148Regular expressions match characters instead of bytes.
149"." matches a character instead of a byte. The C<\C> pattern
150is provided to force a match a single byte--a "C<char>" in C, hence C<\C>.
393fec97 151
393fec97
GS
152=item *
153
154Character classes in regular expressions match characters instead of
376d9008
JB
155bytes and match against the character properties specified in the
156Unicode properties database. C<\w> can be used to match an
75daf61c 157ideograph, for instance.
393fec97 158
393fec97
GS
159=item *
160
eb0cc9e3 161Named Unicode properties, scripts, and block ranges may be used like
376d9008
JB
162character classes via the C<\p{}> "matches property" construct and
163the C<\P{}> negation, "doesn't match property".
164For instance, C<\p{Lu}> matches any
feda178f 165character with the Unicode "Lu" (Letter, uppercase) property, while
376d9008
JB
166C<\p{M}> matches any character with an "M" (mark--accents and such)
167property. Brackets are not required for single letter properties, so
168C<\p{M}> is equivalent to C<\pM>. Many predefined properties are
169available, such as C<\p{Mirrored}> and C<\p{Tibetan}>.
4193bef7 170
cfc01aea 171The official Unicode script and block names have spaces and dashes as
376d9008
JB
172separators, but for convenience you can use dashes, spaces, or
173underbars, and case is unimportant. It is
eb0cc9e3 174recommended, however, that for consistency you use the following naming:
376d9008
JB
175the official Unicode script, property, or block name (see below for the
176additional rules that apply to block names) with whitespace and dashes
177removed, and the words "uppercase-first-lowercase-rest". "C<Latin-1
178Supplement>" thus becomes "C<Latin1Supplement>".
4193bef7 179
376d9008
JB
180You can also use negation in both C<\p{}> and C<\P{}> by introducing a caret
181(^) between the first brace and the property name: C<\p{^Tamil}> is
eb0cc9e3 182equal to C<\P{Tamil}>.
4193bef7 183
eb0cc9e3 184Here are the basic Unicode General Category properties, followed by their
376d9008
JB
185long form. You can use either; C<\p{Lu}> and C<\p{LowercaseLetter}>,
186for instance, are identical.
393fec97 187
d73e5302
JH
188 Short Long
189
190 L Letter
eb0cc9e3
JH
191 Lu UppercaseLetter
192 Ll LowercaseLetter
193 Lt TitlecaseLetter
194 Lm ModifierLetter
195 Lo OtherLetter
d73e5302
JH
196
197 M Mark
eb0cc9e3
JH
198 Mn NonspacingMark
199 Mc SpacingMark
200 Me EnclosingMark
d73e5302
JH
201
202 N Number
eb0cc9e3
JH
203 Nd DecimalNumber
204 Nl LetterNumber
205 No OtherNumber
d73e5302
JH
206
207 P Punctuation
eb0cc9e3
JH
208 Pc ConnectorPunctuation
209 Pd DashPunctuation
210 Ps OpenPunctuation
211 Pe ClosePunctuation
212 Pi InitialPunctuation
d73e5302 213 (may behave like Ps or Pe depending on usage)
eb0cc9e3 214 Pf FinalPunctuation
d73e5302 215 (may behave like Ps or Pe depending on usage)
eb0cc9e3 216 Po OtherPunctuation
d73e5302
JH
217
218 S Symbol
eb0cc9e3
JH
219 Sm MathSymbol
220 Sc CurrencySymbol
221 Sk ModifierSymbol
222 So OtherSymbol
d73e5302
JH
223
224 Z Separator
eb0cc9e3
JH
225 Zs SpaceSeparator
226 Zl LineSeparator
227 Zp ParagraphSeparator
d73e5302
JH
228
229 C Other
e150c829
JH
230 Cc Control
231 Cf Format
eb0cc9e3
JH
232 Cs Surrogate (not usable)
233 Co PrivateUse
e150c829 234 Cn Unassigned
1ac13f9a 235
376d9008 236Single-letter properties match all characters in any of the
3e4dbfed 237two-letter sub-properties starting with the same letter.
376d9008 238C<L&> is a special case, which is an alias for C<Ll>, C<Lu>, and C<Lt>.
32293815 239
eb0cc9e3 240Because Perl hides the need for the user to understand the internal
376d9008
JB
241representation of Unicode characters, there is no need to implement the
242somewhat messy concept of surrogates. C<Cs> is therefore not
eb0cc9e3 243supported.
d73e5302 244
376d9008
JB
245Because scripts differ in their directionality--Hebrew is
246written right to left, for example--Unicode supplies these properties:
32293815 247
eb0cc9e3 248 Property Meaning
92e830a9 249
d73e5302
JH
250 BidiL Left-to-Right
251 BidiLRE Left-to-Right Embedding
252 BidiLRO Left-to-Right Override
253 BidiR Right-to-Left
254 BidiAL Right-to-Left Arabic
255 BidiRLE Right-to-Left Embedding
256 BidiRLO Right-to-Left Override
257 BidiPDF Pop Directional Format
258 BidiEN European Number
259 BidiES European Number Separator
260 BidiET European Number Terminator
261 BidiAN Arabic Number
262 BidiCS Common Number Separator
263 BidiNSM Non-Spacing Mark
264 BidiBN Boundary Neutral
265 BidiB Paragraph Separator
266 BidiS Segment Separator
267 BidiWS Whitespace
268 BidiON Other Neutrals
32293815 269
376d9008 270For example, C<\p{BidiR}> matches characters that are normally
eb0cc9e3
JH
271written right to left.
272
210b36aa
AMS
273=back
274
2796c109
JH
275=head2 Scripts
276
376d9008
JB
277The script names which can be used by C<\p{...}> and C<\P{...}>,
278such as in C<\p{Latin}> or C<\p{Cyrillic}>, are as follows:
2796c109 279
1ac13f9a 280 Arabic
e9ad1727 281 Armenian
1ac13f9a 282 Bengali
e9ad1727 283 Bopomofo
1d81abf3 284 Buhid
eb0cc9e3 285 CanadianAboriginal
e9ad1727
JH
286 Cherokee
287 Cyrillic
288 Deseret
289 Devanagari
290 Ethiopic
291 Georgian
292 Gothic
293 Greek
1ac13f9a 294 Gujarati
e9ad1727
JH
295 Gurmukhi
296 Han
297 Hangul
1d81abf3 298 Hanunoo
e9ad1727
JH
299 Hebrew
300 Hiragana
301 Inherited
1ac13f9a 302 Kannada
e9ad1727
JH
303 Katakana
304 Khmer
1ac13f9a 305 Lao
e9ad1727
JH
306 Latin
307 Malayalam
308 Mongolian
1ac13f9a 309 Myanmar
1ac13f9a 310 Ogham
eb0cc9e3 311 OldItalic
e9ad1727 312 Oriya
1ac13f9a 313 Runic
e9ad1727
JH
314 Sinhala
315 Syriac
1d81abf3
JH
316 Tagalog
317 Tagbanwa
e9ad1727
JH
318 Tamil
319 Telugu
320 Thaana
321 Thai
322 Tibetan
1ac13f9a 323 Yi
1ac13f9a 324
376d9008 325Extended property classes can supplement the basic
1ac13f9a
JH
326properties, defined by the F<PropList> Unicode database:
327
1d81abf3 328 ASCIIHexDigit
eb0cc9e3 329 BidiControl
1ac13f9a 330 Dash
1d81abf3 331 Deprecated
1ac13f9a
JH
332 Diacritic
333 Extender
1d81abf3 334 GraphemeLink
eb0cc9e3 335 HexDigit
e9ad1727
JH
336 Hyphen
337 Ideographic
1d81abf3
JH
338 IDSBinaryOperator
339 IDSTrinaryOperator
eb0cc9e3 340 JoinControl
1d81abf3 341 LogicalOrderException
eb0cc9e3
JH
342 NoncharacterCodePoint
343 OtherAlphabetic
1d81abf3
JH
344 OtherDefaultIgnorableCodePoint
345 OtherGraphemeExtend
eb0cc9e3
JH
346 OtherLowercase
347 OtherMath
348 OtherUppercase
349 QuotationMark
1d81abf3
JH
350 Radical
351 SoftDotted
352 TerminalPunctuation
353 UnifiedIdeograph
eb0cc9e3 354 WhiteSpace
1ac13f9a 355
376d9008 356and there are further derived properties:
1ac13f9a 357
eb0cc9e3
JH
358 Alphabetic Lu + Ll + Lt + Lm + Lo + OtherAlphabetic
359 Lowercase Ll + OtherLowercase
360 Uppercase Lu + OtherUppercase
361 Math Sm + OtherMath
1ac13f9a
JH
362
363 ID_Start Lu + Ll + Lt + Lm + Lo + Nl
364 ID_Continue ID_Start + Mn + Mc + Nd + Pc
365
366 Any Any character
66b79f27
RGS
367 Assigned Any non-Cn character (i.e. synonym for \P{Cn})
368 Unassigned Synonym for \p{Cn}
1ac13f9a 369 Common Any character (or unassigned code point)
e150c829 370 not explicitly assigned to a script
2796c109 371
7eabb34d 372For backward compatibility, all properties mentioned so far may have C<Is>
376d9008 373prepended to their name, so C<\P{IsLu}>, for example, is equal to C<\P{Lu}>.
eb0cc9e3 374
2796c109
JH
375=head2 Blocks
376
eb0cc9e3 377In addition to B<scripts>, Unicode also defines B<blocks> of characters.
376d9008
JB
378The difference between scripts and blocks is that the concept of
379scripts is closer to natural languages, while the concept of blocks
380is more of an artificial grouping based on groups of around 256
381Unicode characters. For example, the C<Latin> script contains letters
382from many blocks but does not contain all the characters from those
383blocks. It does not, for example, contain digits, because digits are
384shared across many scripts. Digits and similar groups, like
385punctuation, are in a category called C<Common>.
2796c109 386
cfc01aea
JF
387For more about scripts, see the UTR #24:
388
389 http://www.unicode.org/unicode/reports/tr24/
390
391For more about blocks, see:
392
393 http://www.unicode.org/Public/UNIDATA/Blocks.txt
2796c109 394
376d9008
JB
395Block names are given with the C<In> prefix. For example, the
396Katakana block is referenced via C<\p{InKatakana}>. The C<In>
7eabb34d 397prefix may be omitted if there is no naming conflict with a script
eb0cc9e3
JH
398or any other property, but it is recommended that C<In> always be used
399to avoid confusion.
400
401These block names are supported:
402
1d81abf3
JH
403 InAlphabeticPresentationForms
404 InArabic
405 InArabicPresentationFormsA
406 InArabicPresentationFormsB
407 InArmenian
408 InArrows
409 InBasicLatin
410 InBengali
411 InBlockElements
412 InBopomofo
413 InBopomofoExtended
414 InBoxDrawing
415 InBraillePatterns
416 InBuhid
417 InByzantineMusicalSymbols
418 InCJKCompatibility
419 InCJKCompatibilityForms
420 InCJKCompatibilityIdeographs
421 InCJKCompatibilityIdeographsSupplement
422 InCJKRadicalsSupplement
423 InCJKSymbolsAndPunctuation
424 InCJKUnifiedIdeographs
425 InCJKUnifiedIdeographsExtensionA
426 InCJKUnifiedIdeographsExtensionB
427 InCherokee
428 InCombiningDiacriticalMarks
429 InCombiningDiacriticalMarksforSymbols
430 InCombiningHalfMarks
431 InControlPictures
432 InCurrencySymbols
433 InCyrillic
434 InCyrillicSupplementary
435 InDeseret
436 InDevanagari
437 InDingbats
438 InEnclosedAlphanumerics
439 InEnclosedCJKLettersAndMonths
440 InEthiopic
441 InGeneralPunctuation
442 InGeometricShapes
443 InGeorgian
444 InGothic
445 InGreekExtended
446 InGreekAndCoptic
447 InGujarati
448 InGurmukhi
449 InHalfwidthAndFullwidthForms
450 InHangulCompatibilityJamo
451 InHangulJamo
452 InHangulSyllables
453 InHanunoo
454 InHebrew
455 InHighPrivateUseSurrogates
456 InHighSurrogates
457 InHiragana
458 InIPAExtensions
459 InIdeographicDescriptionCharacters
460 InKanbun
461 InKangxiRadicals
462 InKannada
463 InKatakana
464 InKatakanaPhoneticExtensions
465 InKhmer
466 InLao
467 InLatin1Supplement
468 InLatinExtendedA
469 InLatinExtendedAdditional
470 InLatinExtendedB
471 InLetterlikeSymbols
472 InLowSurrogates
473 InMalayalam
474 InMathematicalAlphanumericSymbols
475 InMathematicalOperators
476 InMiscellaneousMathematicalSymbolsA
477 InMiscellaneousMathematicalSymbolsB
478 InMiscellaneousSymbols
479 InMiscellaneousTechnical
480 InMongolian
481 InMusicalSymbols
482 InMyanmar
483 InNumberForms
484 InOgham
485 InOldItalic
486 InOpticalCharacterRecognition
487 InOriya
488 InPrivateUseArea
489 InRunic
490 InSinhala
491 InSmallFormVariants
492 InSpacingModifierLetters
493 InSpecials
494 InSuperscriptsAndSubscripts
495 InSupplementalArrowsA
496 InSupplementalArrowsB
497 InSupplementalMathematicalOperators
498 InSupplementaryPrivateUseAreaA
499 InSupplementaryPrivateUseAreaB
500 InSyriac
501 InTagalog
502 InTagbanwa
503 InTags
504 InTamil
505 InTelugu
506 InThaana
507 InThai
508 InTibetan
509 InUnifiedCanadianAboriginalSyllabics
510 InVariationSelectors
511 InYiRadicals
512 InYiSyllables
32293815 513
210b36aa
AMS
514=over 4
515
393fec97
GS
516=item *
517
376d9008
JB
518The special pattern C<\X> matches any extended Unicode
519sequence--"a combining character sequence" in Standardese--where the
520first character is a base character and subsequent characters are mark
521characters that apply to the base character. C<\X> is equivalent to
393fec97
GS
522C<(?:\PM\pM*)>.
523
393fec97
GS
524=item *
525
383e7cdd 526The C<tr///> operator translates characters instead of bytes. Note
376d9008
JB
527that the C<tr///CU> functionality has been removed. For similar
528functionality see pack('U0', ...) and pack('C0', ...).
393fec97 529
393fec97
GS
530=item *
531
532Case translation operators use the Unicode case translation tables
376d9008
JB
533when character input is provided. Note that C<uc()>, or C<\U> in
534interpolated strings, translates to uppercase, while C<ucfirst>,
535or C<\u> in interpolated strings, translates to titlecase in languages
536that make the distinction.
393fec97
GS
537
538=item *
539
376d9008 540Most operators that deal with positions or lengths in a string will
75daf61c
JH
541automatically switch to using character positions, including
542C<chop()>, C<substr()>, C<pos()>, C<index()>, C<rindex()>,
543C<sprintf()>, C<write()>, and C<length()>. Operators that
376d9008
JB
544specifically do not switch include C<vec()>, C<pack()>, and
545C<unpack()>. Operators that really don't care include C<chomp()>,
546operators that treats strings as a bucket of bits such as C<sort()>,
547and operators dealing with filenames.
393fec97
GS
548
549=item *
550
551The C<pack()>/C<unpack()> letters "C<c>" and "C<C>" do I<not> change,
376d9008
JB
552since they are often used for byte-oriented formats. Again, think
553"C<char>" in the C language. There is a new "C<U>" specifier
554that converts between Unicode characters and integers.
393fec97
GS
555
556=item *
557
376d9008
JB
558The C<chr()> and C<ord()> functions work on characters, similar to
559C<pack("U")> and C<unpack("U")>, I<not> C<pack("C")> and
560C<unpack("C")>. C<pack("C")> and C<unpack("C")> are methods for
561emulating byte-oriented C<chr()> and C<ord()> on Unicode strings.
562While these methods reveal the internal encoding of Unicode strings,
563that is not something one normally needs to care about at all.
393fec97
GS
564
565=item *
566
376d9008
JB
567The bit string operators, C<& | ^ ~>, can operate on character data.
568However, for backward compatibility, such as when using bit string
569operations when characters are all less than 256 in ordinal value, one
570should not use C<~> (the bit complement) with characters of both
571values less than 256 and values greater than 256. Most importantly,
572DeMorgan's laws (C<~($x|$y) eq ~$x&~$y> and C<~($x&$y) eq ~$x|~$y>)
573will not hold. The reason for this mathematical I<faux pas> is that
574the complement cannot return B<both> the 8-bit (byte-wide) bit
575complement B<and> the full character-wide bit complement.
a1ca4561
YST
576
577=item *
578
983ffd37
JH
579lc(), uc(), lcfirst(), and ucfirst() work for the following cases:
580
581=over 8
582
583=item *
584
585the case mapping is from a single Unicode character to another
376d9008 586single Unicode character, or
983ffd37
JH
587
588=item *
589
590the case mapping is from a single Unicode character to more
376d9008 591than one Unicode character.
983ffd37
JH
592
593=back
594
376d9008 595The following cases do not yet work:
983ffd37
JH
596
597=over 8
598
599=item *
600
376d9008 601the "final sigma" (Greek), and
983ffd37
JH
602
603=item *
604
376d9008 605anything to with locales (Lithuanian, Turkish, Azeri).
983ffd37
JH
606
607=back
608
609See the Unicode Technical Report #21, Case Mappings, for more details.
ac1256e8
JH
610
611=item *
612
393fec97
GS
613And finally, C<scalar reverse()> reverses by character rather than by byte.
614
615=back
616
376d9008 617=head2 User-Defined Character Properties
491fd90a
JH
618
619You can define your own character properties by defining subroutines
376d9008 620whose names begin with "In" or "Is". The subroutines must be
491fd90a
JH
621visible in the package that uses the properties. The user-defined
622properties can be used in the regular expression C<\p> and C<\P>
623constructs.
624
376d9008
JB
625The subroutines must return a specially-formatted string, with one
626or more newline-separated lines. Each line must be one of the following:
491fd90a
JH
627
628=over 4
629
630=item *
631
99a6b1f0 632Two hexadecimal numbers separated by horizontal whitespace (space or
376d9008 633tabular characters) denoting a range of Unicode code points to include.
491fd90a
JH
634
635=item *
636
376d9008
JB
637Something to include, prefixed by "+": a built-in character
638property (prefixed by "utf8::"), to represent all the characters in that
639property; two hexadecimal code points for a range; or a single
640hexadecimal code point.
491fd90a
JH
641
642=item *
643
376d9008 644Something to exclude, prefixed by "-": an existing character
11ef8fdd 645property (prefixed by "utf8::"), for all the characters in that
376d9008
JB
646property; two hexadecimal code points for a range; or a single
647hexadecimal code point.
491fd90a
JH
648
649=item *
650
376d9008 651Something to negate, prefixed "!": an existing character
11ef8fdd 652property (prefixed by "utf8::") for all the characters except the
376d9008
JB
653characters in the property; two hexadecimal code points for a range;
654or a single hexadecimal code point.
491fd90a
JH
655
656=back
657
658For example, to define a property that covers both the Japanese
659syllabaries (hiragana and katakana), you can define
660
661 sub InKana {
d5822f25
A
662 return <<END;
663 3040\t309F
664 30A0\t30FF
491fd90a
JH
665 END
666 }
667
d5822f25
A
668Imagine that the here-doc end marker is at the beginning of the line.
669Now you can use C<\p{InKana}> and C<\P{InKana}>.
491fd90a
JH
670
671You could also have used the existing block property names:
672
673 sub InKana {
674 return <<'END';
675 +utf8::InHiragana
676 +utf8::InKatakana
677 END
678 }
679
680Suppose you wanted to match only the allocated characters,
d5822f25 681not the raw block ranges: in other words, you want to remove
491fd90a
JH
682the non-characters:
683
684 sub InKana {
685 return <<'END';
686 +utf8::InHiragana
687 +utf8::InKatakana
688 -utf8::IsCn
689 END
690 }
691
692The negation is useful for defining (surprise!) negated classes.
693
694 sub InNotKana {
695 return <<'END';
696 !utf8::InHiragana
697 -utf8::InKatakana
698 +utf8::IsCn
699 END
700 }
701
376d9008 702=head2 Character Encodings for Input and Output
8cbd9a7a 703
7221edc9 704See L<Encode>.
8cbd9a7a 705
c29a771d 706=head2 Unicode Regular Expression Support Level
776f8809 707
376d9008
JB
708The following list of Unicode support for regular expressions describes
709all the features currently supported. The references to "Level N"
710and the section numbers refer to the Unicode Technical Report 18,
711"Unicode Regular Expression Guidelines".
776f8809
JH
712
713=over 4
714
715=item *
716
717Level 1 - Basic Unicode Support
718
719 2.1 Hex Notation - done [1]
3bfdc84c 720 Named Notation - done [2]
776f8809
JH
721 2.2 Categories - done [3][4]
722 2.3 Subtraction - MISSING [5][6]
723 2.4 Simple Word Boundaries - done [7]
78d3e1bf 724 2.5 Simple Loose Matches - done [8]
776f8809
JH
725 2.6 End of Line - MISSING [9][10]
726
727 [ 1] \x{...}
728 [ 2] \N{...}
eb0cc9e3 729 [ 3] . \p{...} \P{...}
29bdacb8 730 [ 4] now scripts (see UTR#24 Script Names) in addition to blocks
776f8809 731 [ 5] have negation
237bad5b
JH
732 [ 6] can use regular expression look-ahead [a]
733 or user-defined character properties [b] to emulate subtraction
776f8809 734 [ 7] include Letters in word characters
376d9008 735 [ 8] note that Perl does Full case-folding in matching, not Simple:
e0f9d4a8
JH
736 for example U+1F88 is equivalent with U+1F000 U+03B9,
737 not with 1F80. This difference matters for certain Greek
376d9008
JB
738 capital letters with certain modifiers: the Full case-folding
739 decomposes the letter, while the Simple case-folding would map
e0f9d4a8 740 it to a single character.
776f8809 741 [ 9] see UTR#13 Unicode Newline Guidelines
ec83e909
JH
742 [10] should do ^ and $ also on \x{85}, \x{2028} and \x{2029})
743 (should also affect <>, $., and script line numbers)
3bfdc84c 744 (the \x{85}, \x{2028} and \x{2029} do match \s)
7207e29d 745
237bad5b 746[a] You can mimic class subtraction using lookahead.
dbe420b4 747For example, what TR18 might write as
29bdacb8 748
dbe420b4
JH
749 [{Greek}-[{UNASSIGNED}]]
750
751in Perl can be written as:
752
1d81abf3
JH
753 (?!\p{Unassigned})\p{InGreekAndCoptic}
754 (?=\p{Assigned})\p{InGreekAndCoptic}
dbe420b4
JH
755
756But in this particular example, you probably really want
757
758 \p{Greek}
759
760which will match assigned characters known to be part of the Greek script.
29bdacb8 761
237bad5b
JH
762[b] See L</User-defined Character Properties>.
763
776f8809
JH
764=item *
765
766Level 2 - Extended Unicode Support
767
768 3.1 Surrogates - MISSING
769 3.2 Canonical Equivalents - MISSING [11][12]
770 3.3 Locale-Independent Graphemes - MISSING [13]
771 3.4 Locale-Independent Words - MISSING [14]
772 3.5 Locale-Independent Loose Matches - MISSING [15]
773
774 [11] see UTR#15 Unicode Normalization
775 [12] have Unicode::Normalize but not integrated to regexes
776 [13] have \X but at this level . should equal that
777 [14] need three classes, not just \w and \W
778 [15] see UTR#21 Case Mappings
779
780=item *
781
782Level 3 - Locale-Sensitive Support
783
784 4.1 Locale-Dependent Categories - MISSING
785 4.2 Locale-Dependent Graphemes - MISSING [16][17]
786 4.3 Locale-Dependent Words - MISSING
787 4.4 Locale-Dependent Loose Matches - MISSING
788 4.5 Locale-Dependent Ranges - MISSING
789
790 [16] see UTR#10 Unicode Collation Algorithms
791 [17] have Unicode::Collate but not integrated to regexes
792
793=back
794
c349b1b9
JH
795=head2 Unicode Encodings
796
376d9008
JB
797Unicode characters are assigned to I<code points>, which are abstract
798numbers. To use these numbers, various encodings are needed.
c349b1b9
JH
799
800=over 4
801
c29a771d 802=item *
5cb3728c
RB
803
804UTF-8
c349b1b9 805
3e4dbfed 806UTF-8 is a variable-length (1 to 6 bytes, current character allocations
376d9008
JB
807require 4 bytes), byte-order independent encoding. For ASCII (and we
808really do mean 7-bit ASCII, not another 8-bit encoding), UTF-8 is
809transparent.
c349b1b9 810
8c007b5a 811The following table is from Unicode 3.2.
05632f9a
JH
812
813 Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte
814
8c007b5a
JH
815 U+0000..U+007F 00..7F
816 U+0080..U+07FF C2..DF 80..BF
ec90690f
TS
817 U+0800..U+0FFF E0 A0..BF 80..BF
818 U+1000..U+CFFF E1..EC 80..BF 80..BF
819 U+D000..U+D7FF ED 80..9F 80..BF
8c007b5a 820 U+D800..U+DFFF ******* ill-formed *******
ec90690f 821 U+E000..U+FFFF EE..EF 80..BF 80..BF
05632f9a
JH
822 U+10000..U+3FFFF F0 90..BF 80..BF 80..BF
823 U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF
824 U+100000..U+10FFFF F4 80..8F 80..BF 80..BF
825
376d9008
JB
826Note the C<A0..BF> in C<U+0800..U+0FFF>, the C<80..9F> in
827C<U+D000...U+D7FF>, the C<90..B>F in C<U+10000..U+3FFFF>, and the
828C<80...8F> in C<U+100000..U+10FFFF>. The "gaps" are caused by legal
829UTF-8 avoiding non-shortest encodings: it is technically possible to
830UTF-8-encode a single code point in different ways, but that is
831explicitly forbidden, and the shortest possible encoding should always
832be used. So that's what Perl does.
37361303 833
376d9008 834Another way to look at it is via bits:
05632f9a
JH
835
836 Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte
837
838 0aaaaaaa 0aaaaaaa
839 00000bbbbbaaaaaa 110bbbbb 10aaaaaa
840 ccccbbbbbbaaaaaa 1110cccc 10bbbbbb 10aaaaaa
841 00000dddccccccbbbbbbaaaaaa 11110ddd 10cccccc 10bbbbbb 10aaaaaa
842
843As you can see, the continuation bytes all begin with C<10>, and the
8c007b5a 844leading bits of the start byte tell how many bytes the are in the
05632f9a
JH
845encoded character.
846
c29a771d 847=item *
5cb3728c
RB
848
849UTF-EBCDIC
dbe420b4 850
376d9008 851Like UTF-8 but EBCDIC-safe, in the way that UTF-8 is ASCII-safe.
dbe420b4 852
c29a771d 853=item *
5cb3728c
RB
854
855UTF-16, UTF-16BE, UTF16-LE, Surrogates, and BOMs (Byte Order Marks)
c349b1b9 856
376d9008
JB
857The followings items are mostly for reference, Perl doesn't
858use them internally.
dbe420b4 859
c349b1b9 860UTF-16 is a 2 or 4 byte encoding. The Unicode code points
376d9008
JB
861C<U+0000..U+FFFF> are stored in a single 16-bit unit, and the code points
862C<U+10000..U+10FFFF> in two 16-bit units. The latter case is
c349b1b9
JH
863using I<surrogates>, the first 16-bit unit being the I<high
864surrogate>, and the second being the I<low surrogate>.
865
376d9008 866Surrogates are code points set aside to encode the C<U+10000..U+10FFFF>
c349b1b9 867range of Unicode code points in pairs of 16-bit units. The I<high
376d9008
JB
868surrogates> are the range C<U+D800..U+DBFF>, and the I<low surrogates>
869are the range C<U+DC00..U+DFFF>. The surrogate encoding is
c349b1b9
JH
870
871 $hi = ($uni - 0x10000) / 0x400 + 0xD800;
872 $lo = ($uni - 0x10000) % 0x400 + 0xDC00;
873
874and the decoding is
875
1a3fa709 876 $uni = 0x10000 + ($hi - 0xD800) * 0x400 + ($lo - 0xDC00);
c349b1b9 877
feda178f 878If you try to generate surrogates (for example by using chr()), you
376d9008
JB
879will get a warning if warnings are turned on, because those code
880points are not valid for a Unicode character.
9466bab6 881
376d9008 882Because of the 16-bitness, UTF-16 is byte-order dependent. UTF-16
c349b1b9 883itself can be used for in-memory computations, but if storage or
376d9008
JB
884transfer is required either UTF-16BE (big-endian) or UTF-16LE
885(little-endian) encodings must be chosen.
c349b1b9
JH
886
887This introduces another problem: what if you just know that your data
376d9008
JB
888is UTF-16, but you don't know which endianness? Byte Order Marks, or
889BOMs, are a solution to this. A special character has been reserved
86bbd6d1 890in Unicode to function as a byte order marker: the character with the
376d9008 891code point C<U+FEFF> is the BOM.
042da322 892
c349b1b9 893The trick is that if you read a BOM, you will know the byte order,
376d9008
JB
894since if it was written on a big-endian platform, you will read the
895bytes C<0xFE 0xFF>, but if it was written on a little-endian platform,
896you will read the bytes C<0xFF 0xFE>. (And if the originating platform
897was writing in UTF-8, you will read the bytes C<0xEF 0xBB 0xBF>.)
042da322 898
86bbd6d1 899The way this trick works is that the character with the code point
376d9008
JB
900C<U+FFFE> is guaranteed not to be a valid Unicode character, so the
901sequence of bytes C<0xFF 0xFE> is unambiguously "BOM, represented in
902little-endian format" and cannot be "C<U+FFFE>, represented in big-endian
042da322 903format".
c349b1b9 904
c29a771d 905=item *
5cb3728c
RB
906
907UTF-32, UTF-32BE, UTF32-LE
c349b1b9
JH
908
909The UTF-32 family is pretty much like the UTF-16 family, expect that
042da322 910the units are 32-bit, and therefore the surrogate scheme is not
376d9008
JB
911needed. The BOM signatures will be C<0x00 0x00 0xFE 0xFF> for BE and
912C<0xFF 0xFE 0x00 0x00> for LE.
c349b1b9 913
c29a771d 914=item *
5cb3728c
RB
915
916UCS-2, UCS-4
c349b1b9 917
86bbd6d1 918Encodings defined by the ISO 10646 standard. UCS-2 is a 16-bit
376d9008 919encoding. Unlike UTF-16, UCS-2 is not extensible beyond C<U+FFFF>,
339cfa0e
JH
920because it does not use surrogates. UCS-4 is a 32-bit encoding,
921functionally identical to UTF-32.
c349b1b9 922
c29a771d 923=item *
5cb3728c
RB
924
925UTF-7
c349b1b9 926
376d9008
JB
927A seven-bit safe (non-eight-bit) encoding, which is useful if the
928transport or storage is not eight-bit safe. Defined by RFC 2152.
c349b1b9 929
95a1a48b
JH
930=back
931
0d7c09bb
JH
932=head2 Security Implications of Unicode
933
934=over 4
935
936=item *
937
938Malformed UTF-8
bf0fa0b2
JH
939
940Unfortunately, the specification of UTF-8 leaves some room for
941interpretation of how many bytes of encoded output one should generate
376d9008
JB
942from one input Unicode character. Strictly speaking, the shortest
943possible sequence of UTF-8 bytes should be generated,
944because otherwise there is potential for an input buffer overflow at
feda178f 945the receiving end of a UTF-8 connection. Perl always generates the
376d9008
JB
946shortest length UTF-8, and with warnings on Perl will warn about
947non-shortest length UTF-8 along with other malformations, such as the
948surrogates, which are not real Unicode code points.
bf0fa0b2 949
0d7c09bb
JH
950=item *
951
952Regular expressions behave slightly differently between byte data and
376d9008
JB
953character (Unicode) data. For example, the "word character" character
954class C<\w> will work differently depending on if data is eight-bit bytes
955or Unicode.
0d7c09bb 956
376d9008
JB
957In the first case, the set of C<\w> characters is either small--the
958default set of alphabetic characters, digits, and the "_"--or, if you
0d7c09bb
JH
959are using a locale (see L<perllocale>), the C<\w> might contain a few
960more letters according to your language and country.
961
376d9008
JB
962In the second case, the C<\w> set of characters is much, much larger.
963Most importantly, even in the set of the first 256 characters, it
964will probably match different characters: unlike most locales,
965which are specific to a language and country pair, Unicode classifies all
966the characters that are letters as C<\w>. For example, your locale might
0d7c09bb
JH
967not think that LATIN SMALL LETTER ETH is a letter (unless you happen
968to speak Icelandic), but Unicode does.
969
376d9008
JB
970As discussed elsewhere, Perl has one foot (two hooves?) planted in
971each of two worlds: the old world of bytes and the new
0d7c09bb 972world of characters, upgrading from bytes to characters when necessary.
376d9008
JB
973If your legacy code does not explicitly use Unicode, no automatic
974switch-over to characters should happen. Characters shouldn't get
975downgraded to bytes, either. It is possible to accidentally mix
976bytes and characters, however (see L<perluniintro>), in which case
977C<\w> in regular expressions might start behaving differently. Review
978your code.
0d7c09bb
JH
979
980=back
981
c349b1b9
JH
982=head2 Unicode in Perl on EBCDIC
983
376d9008
JB
984The way Unicode is handled on EBCDIC platforms is still
985experimental. On such platforms, references to UTF-8 encoding in this
986document and elsewhere should be read as meaning the UTF-EBCDIC
987specified in Unicode Technical Report 16, unless ASCII vs. EBCDIC issues
c349b1b9 988are specifically discussed. There is no C<utfebcdic> pragma or
376d9008 989":utfebcdic" layer; rather, "utf8" and ":utf8" are reused to mean
86bbd6d1
PN
990the platform's "natural" 8-bit encoding of Unicode. See L<perlebcdic>
991for more discussion of the issues.
c349b1b9 992
b310b053
JH
993=head2 Locales
994
4616122b 995Usually locale settings and Unicode do not affect each other, but
b310b053
JH
996there are a couple of exceptions:
997
998=over 4
999
1000=item *
1001
1002If your locale environment variables (LANGUAGE, LC_ALL, LC_CTYPE, LANG)
1003contain the strings 'UTF-8' or 'UTF8' (case-insensitive matching),
376d9008
JB
1004the default encodings of your STDIN, STDOUT, and STDERR, and of
1005B<any subsequent file open>, are considered to be UTF-8.
b310b053
JH
1006
1007=item *
1008
376d9008
JB
1009Perl tries really hard to work both with Unicode and the old
1010byte-oriented world. Most often this is nice, but sometimes Perl's
1011straddling of the proverbial fence causes problems.
b310b053
JH
1012
1013=back
1014
95a1a48b
JH
1015=head2 Using Unicode in XS
1016
1017If you want to handle Perl Unicode in XS extensions, you may find
376d9008 1018the following C APIs useful. See L<perlapi> for details.
95a1a48b
JH
1019
1020=over 4
1021
1022=item *
1023
376d9008
JB
1024C<DO_UTF8(sv)> returns true if the C<UTF8> flag is on and the bytes pragma
1025is not in effect. C<SvUTF8(sv)> returns true is the C<UTF8> flag is on; the
1026bytes pragma is ignored. The C<UTF8> flag being on does B<not> mean that
b31c5e31 1027there are any characters of code points greater than 255 (or 127) in
376d9008
JB
1028the scalar or that there are even any characters in the scalar.
1029What the C<UTF8> flag means is that the sequence of octets in the
b31c5e31 1030representation of the scalar is the sequence of UTF-8 encoded
376d9008 1031code points of the characters of a string. The C<UTF8> flag being
b31c5e31 1032off means that each octet in this representation encodes a single
376d9008
JB
1033character with code point 0..255 within the string. Perl's Unicode
1034model is not to use UTF-8 until it is absolutely necessary.
95a1a48b
JH
1035
1036=item *
1037
376d9008 1038C<uvuni_to_utf8(buf, chr>) writes a Unicode character code point into a
cfc01aea 1039buffer encoding the code point as UTF-8, and returns a pointer
95a1a48b
JH
1040pointing after the UTF-8 bytes.
1041
1042=item *
1043
376d9008
JB
1044C<utf8_to_uvuni(buf, lenp)> reads UTF-8 encoded bytes from a buffer and
1045returns the Unicode character code point and, optionally, the length of
1046the UTF-8 byte sequence.
95a1a48b
JH
1047
1048=item *
1049
376d9008
JB
1050C<utf8_length(start, end)> returns the length of the UTF-8 encoded buffer
1051in characters. C<sv_len_utf8(sv)> returns the length of the UTF-8 encoded
95a1a48b
JH
1052scalar.
1053
1054=item *
1055
376d9008
JB
1056C<sv_utf8_upgrade(sv)> converts the string of the scalar to its UTF-8
1057encoded form. C<sv_utf8_downgrade(sv)> does the opposite, if
1058possible. C<sv_utf8_encode(sv)> is like sv_utf8_upgrade except that
1059it does not set the C<UTF8> flag. C<sv_utf8_decode()> does the
1060opposite of C<sv_utf8_encode()>. Note that none of these are to be
1061used as general-purpose encoding or decoding interfaces: C<use Encode>
1062for that. C<sv_utf8_upgrade()> is affected by the encoding pragma
1063but C<sv_utf8_downgrade()> is not (since the encoding pragma is
1064designed to be a one-way street).
95a1a48b
JH
1065
1066=item *
1067
376d9008 1068C<is_utf8_char(s)> returns true if the pointer points to a valid UTF-8
90f968e0 1069character.
95a1a48b
JH
1070
1071=item *
1072
376d9008 1073C<is_utf8_string(buf, len)> returns true if C<len> bytes of the buffer
95a1a48b
JH
1074are valid UTF-8.
1075
1076=item *
1077
376d9008
JB
1078C<UTF8SKIP(buf)> will return the number of bytes in the UTF-8 encoded
1079character in the buffer. C<UNISKIP(chr)> will return the number of bytes
1080required to UTF-8-encode the Unicode character code point. C<UTF8SKIP()>
90f968e0 1081is useful for example for iterating over the characters of a UTF-8
376d9008 1082encoded buffer; C<UNISKIP()> is useful, for example, in computing
90f968e0 1083the size required for a UTF-8 encoded buffer.
95a1a48b
JH
1084
1085=item *
1086
376d9008 1087C<utf8_distance(a, b)> will tell the distance in characters between the
95a1a48b
JH
1088two pointers pointing to the same UTF-8 encoded buffer.
1089
1090=item *
1091
376d9008
JB
1092C<utf8_hop(s, off)> will return a pointer to an UTF-8 encoded buffer
1093that is C<off> (positive or negative) Unicode characters displaced
1094from the UTF-8 buffer C<s>. Be careful not to overstep the buffer:
1095C<utf8_hop()> will merrily run off the end or the beginning of the
1096buffer if told to do so.
95a1a48b 1097
d2cc3551
JH
1098=item *
1099
376d9008
JB
1100C<pv_uni_display(dsv, spv, len, pvlim, flags)> and
1101C<sv_uni_display(dsv, ssv, pvlim, flags)> are useful for debugging the
1102output of Unicode strings and scalars. By default they are useful
1103only for debugging--they display B<all> characters as hexadecimal code
1104points--but with the flags C<UNI_DISPLAY_ISPRINT> and
1105C<UNI_DISPLAY_BACKSLASH> you can make the output more readable.
d2cc3551
JH
1106
1107=item *
1108
376d9008
JB
1109C<ibcmp_utf8(s1, pe1, u1, l1, u1, s2, pe2, l2, u2)> can be used to
1110compare two strings case-insensitively in Unicode. For case-sensitive
1111comparisons you can just use C<memEQ()> and C<memNE()> as usual.
d2cc3551 1112
c349b1b9
JH
1113=back
1114
95a1a48b
JH
1115For more information, see L<perlapi>, and F<utf8.c> and F<utf8.h>
1116in the Perl source code distribution.
1117
c29a771d
JH
1118=head1 BUGS
1119
376d9008 1120=head2 Interaction with Locales
7eabb34d 1121
376d9008
JB
1122Use of locales with Unicode data may lead to odd results. Currently,
1123Perl attempts to attach 8-bit locale info to characters in the range
11240..255, but this technique is demonstrably incorrect for locales that
1125use characters above that range when mapped into Unicode. Perl's
1126Unicode support will also tend to run slower. Use of locales with
1127Unicode is discouraged.
c29a771d 1128
376d9008 1129=head2 Interaction with Extensions
7eabb34d 1130
376d9008 1131When Perl exchanges data with an extension, the extension should be
7eabb34d 1132able to understand the UTF-8 flag and act accordingly. If the
376d9008
JB
1133extension doesn't know about the flag, it's likely that the extension
1134will return incorrectly-flagged data.
7eabb34d
A
1135
1136So if you're working with Unicode data, consult the documentation of
1137every module you're using if there are any issues with Unicode data
1138exchange. If the documentation does not talk about Unicode at all,
a73d23f6 1139suspect the worst and probably look at the source to learn how the
376d9008 1140module is implemented. Modules written completely in Perl shouldn't
a73d23f6
RGS
1141cause problems. Modules that directly or indirectly access code written
1142in other programming languages are at risk.
7eabb34d 1143
376d9008 1144For affected functions, the simple strategy to avoid data corruption is
7eabb34d 1145to always make the encoding of the exchanged data explicit. Choose an
376d9008 1146encoding that you know the extension can handle. Convert arguments passed
7eabb34d
A
1147to the extensions to that encoding and convert results back from that
1148encoding. Write wrapper functions that do the conversions for you, so
1149you can later change the functions when the extension catches up.
1150
376d9008 1151To provide an example, let's say the popular Foo::Bar::escape_html
7eabb34d
A
1152function doesn't deal with Unicode data yet. The wrapper function
1153would convert the argument to raw UTF-8 and convert the result back to
376d9008 1154Perl's internal representation like so:
7eabb34d
A
1155
1156 sub my_escape_html ($) {
1157 my($what) = shift;
1158 return unless defined $what;
1159 Encode::decode_utf8(Foo::Bar::escape_html(Encode::encode_utf8($what)));
1160 }
1161
1162Sometimes, when the extension does not convert data but just stores
1163and retrieves them, you will be in a position to use the otherwise
1164dangerous Encode::_utf8_on() function. Let's say the popular
66b79f27 1165C<Foo::Bar> extension, written in C, provides a C<param> method that
7eabb34d
A
1166lets you store and retrieve data according to these prototypes:
1167
1168 $self->param($name, $value); # set a scalar
1169 $value = $self->param($name); # retrieve a scalar
1170
1171If it does not yet provide support for any encoding, one could write a
1172derived class with such a C<param> method:
1173
1174 sub param {
1175 my($self,$name,$value) = @_;
1176 utf8::upgrade($name); # make sure it is UTF-8 encoded
1177 if (defined $value)
1178 utf8::upgrade($value); # make sure it is UTF-8 encoded
1179 return $self->SUPER::param($name,$value);
1180 } else {
1181 my $ret = $self->SUPER::param($name);
1182 Encode::_utf8_on($ret); # we know, it is UTF-8 encoded
1183 return $ret;
1184 }
1185 }
1186
a73d23f6
RGS
1187Some extensions provide filters on data entry/exit points, such as
1188DB_File::filter_store_key and family. Look out for such filters in
66b79f27 1189the documentation of your extensions, they can make the transition to
7eabb34d
A
1190Unicode data much easier.
1191
376d9008 1192=head2 Speed
7eabb34d 1193
c29a771d 1194Some functions are slower when working on UTF-8 encoded strings than
574c8022 1195on byte encoded strings. All functions that need to hop over
c29a771d
JH
1196characters such as length(), substr() or index() can work B<much>
1197faster when the underlying data are byte-encoded. Witness the
1198following benchmark:
666f95b9 1199
c29a771d
JH
1200 % perl -e '
1201 use Benchmark;
1202 use strict;
1203 our $l = 10000;
1204 our $u = our $b = "x" x $l;
1205 substr($u,0,1) = "\x{100}";
1206 timethese(-2,{
1207 LENGTH_B => q{ length($b) },
1208 LENGTH_U => q{ length($u) },
1209 SUBSTR_B => q{ substr($b, $l/4, $l/2) },
1210 SUBSTR_U => q{ substr($u, $l/4, $l/2) },
1211 });
1212 '
1213 Benchmark: running LENGTH_B, LENGTH_U, SUBSTR_B, SUBSTR_U for at least 2 CPU seconds...
1214 LENGTH_B: 2 wallclock secs ( 2.36 usr + 0.00 sys = 2.36 CPU) @ 5649983.05/s (n=13333960)
1215 LENGTH_U: 2 wallclock secs ( 2.11 usr + 0.00 sys = 2.11 CPU) @ 12155.45/s (n=25648)
1216 SUBSTR_B: 3 wallclock secs ( 2.16 usr + 0.00 sys = 2.16 CPU) @ 374480.09/s (n=808877)
1217 SUBSTR_U: 2 wallclock secs ( 2.11 usr + 0.00 sys = 2.11 CPU) @ 6791.00/s (n=14329)
666f95b9 1218
376d9008
JB
1219The numbers show an incredible slowness on long UTF-8 strings. You
1220should carefully avoid using these functions in tight loops. If you
1221want to iterate over characters, the superior coding technique would
1222split the characters into an array instead of using substr, as the following
c29a771d
JH
1223benchmark shows:
1224
1225 % perl -e '
1226 use Benchmark;
1227 use strict;
1228 our $l = 10000;
1229 our $u = our $b = "x" x $l;
1230 substr($u,0,1) = "\x{100}";
1231 timethese(-5,{
1232 SPLIT_B => q{ for my $c (split //, $b){} },
1233 SPLIT_U => q{ for my $c (split //, $u){} },
1234 SUBSTR_B => q{ for my $i (0..length($b)-1){my $c = substr($b,$i,1);} },
1235 SUBSTR_U => q{ for my $i (0..length($u)-1){my $c = substr($u,$i,1);} },
1236 });
1237 '
1238 Benchmark: running SPLIT_B, SPLIT_U, SUBSTR_B, SUBSTR_U for at least 5 CPU seconds...
1239 SPLIT_B: 6 wallclock secs ( 5.29 usr + 0.00 sys = 5.29 CPU) @ 56.14/s (n=297)
1240 SPLIT_U: 5 wallclock secs ( 5.17 usr + 0.01 sys = 5.18 CPU) @ 55.21/s (n=286)
1241 SUBSTR_B: 5 wallclock secs ( 5.34 usr + 0.00 sys = 5.34 CPU) @ 123.22/s (n=658)
1242 SUBSTR_U: 7 wallclock secs ( 6.20 usr + 0.00 sys = 6.20 CPU) @ 0.81/s (n=5)
1243
376d9008
JB
1244Even though the algorithm based on C<substr()> is faster than
1245C<split()> for byte-encoded data, it pales in comparison to the speed
1246of C<split()> when used with UTF-8 data.
666f95b9 1247
393fec97
GS
1248=head1 SEE ALSO
1249
72ff2908
JH
1250L<perluniintro>, L<encoding>, L<Encode>, L<open>, L<utf8>, L<bytes>,
1251L<perlretut>, L<perlvar/"${^WIDE_SYSTEM_CALLS}">
393fec97
GS
1252
1253=cut