This is a live mirror of the Perl 5 development currently hosted at https://github.com/perl/perl5
temporary fix to avoid t/op/tie.t failures on Win32
[perl5.git] / pod / perlunicode.pod
CommitLineData
393fec97
GS
1=head1 NAME
2
3perlunicode - Unicode support in Perl
4
5=head1 DESCRIPTION
6
0a1f2d14 7=head2 Important Caveats
21bad921 8
376d9008 9Unicode support is an extensive requirement. While Perl does not
c349b1b9
JH
10implement the Unicode standard or the accompanying technical reports
11from cover to cover, Perl does support many Unicode features.
21bad921 12
13a2d996 13=over 4
21bad921 14
fae2c0fb 15=item Input and Output Layers
21bad921 16
376d9008 17Perl knows when a filehandle uses Perl's internal Unicode encodings
1bfb14c4
JH
18(UTF-8, or UTF-EBCDIC if in EBCDIC) if the filehandle is opened with
19the ":utf8" layer. Other encodings can be converted to Perl's
20encoding on input or from Perl's encoding on output by use of the
21":encoding(...)" layer. See L<open>.
c349b1b9 22
376d9008 23To indicate that Perl source itself is using a particular encoding,
c349b1b9 24see L<encoding>.
21bad921
GS
25
26=item Regular Expressions
27
c349b1b9 28The regular expression compiler produces polymorphic opcodes. That is,
376d9008
JB
29the pattern adapts to the data and automatically switches to the Unicode
30character scheme when presented with Unicode data--or instead uses
31a traditional byte scheme when presented with byte data.
21bad921 32
ad0029c4 33=item C<use utf8> still needed to enable UTF-8/UTF-EBCDIC in scripts
21bad921 34
376d9008
JB
35As a compatibility measure, the C<use utf8> pragma must be explicitly
36included to enable recognition of UTF-8 in the Perl scripts themselves
1bfb14c4
JH
37(in string or regular expression literals, or in identifier names) on
38ASCII-based machines or to recognize UTF-EBCDIC on EBCDIC-based
376d9008 39machines. B<These are the only times when an explicit C<use utf8>
8f8cf39c 40is needed.> See L<utf8>.
21bad921 41
1768d7eb 42You can also use the C<encoding> pragma to change the default encoding
6ec9efec 43of the data in your script; see L<encoding>.
1768d7eb 44
21bad921
GS
45=back
46
376d9008 47=head2 Byte and Character Semantics
393fec97 48
376d9008 49Beginning with version 5.6, Perl uses logically-wide characters to
3e4dbfed 50represent strings internally.
393fec97 51
376d9008
JB
52In future, Perl-level operations will be expected to work with
53characters rather than bytes.
393fec97 54
376d9008 55However, as an interim compatibility measure, Perl aims to
75daf61c
JH
56provide a safe migration path from byte semantics to character
57semantics for programs. For operations where Perl can unambiguously
376d9008 58decide that the input data are characters, Perl switches to
75daf61c
JH
59character semantics. For operations where this determination cannot
60be made without additional information from the user, Perl decides in
376d9008 61favor of compatibility and chooses to use byte semantics.
8cbd9a7a
GS
62
63This behavior preserves compatibility with earlier versions of Perl,
376d9008
JB
64which allowed byte semantics in Perl operations only if
65none of the program's inputs were marked as being as source of Unicode
8cbd9a7a
GS
66character data. Such data may come from filehandles, from calls to
67external programs, from information provided by the system (such as %ENV),
21bad921 68or from literals and constants in the source text.
8cbd9a7a 69
376d9008
JB
70The C<bytes> pragma will always, regardless of platform, force byte
71semantics in a particular lexical scope. See L<bytes>.
8cbd9a7a
GS
72
73The C<utf8> pragma is primarily a compatibility device that enables
75daf61c 74recognition of UTF-(8|EBCDIC) in literals encountered by the parser.
376d9008
JB
75Note that this pragma is only required while Perl defaults to byte
76semantics; when character semantics become the default, this pragma
77may become a no-op. See L<utf8>.
78
79Unless explicitly stated, Perl operators use character semantics
80for Unicode data and byte semantics for non-Unicode data.
81The decision to use character semantics is made transparently. If
82input data comes from a Unicode source--for example, if a character
fae2c0fb 83encoding layer is added to a filehandle or a literal Unicode
376d9008
JB
84string constant appears in a program--character semantics apply.
85Otherwise, byte semantics are in effect. The C<bytes> pragma should
86be used to force byte semantics on Unicode data.
87
88If strings operating under byte semantics and strings with Unicode
89character data are concatenated, the new string will be upgraded to
90I<ISO 8859-1 (Latin-1)>, even if the old Unicode string used EBCDIC.
91This translation is done without regard to the system's native 8-bit
92encoding, so to change this for systems with non-Latin-1 and
93non-EBCDIC native encodings use the C<encoding> pragma. See
94L<encoding>.
7dedd01f 95
feda178f 96Under character semantics, many operations that formerly operated on
376d9008 97bytes now operate on characters. A character in Perl is
feda178f 98logically just a number ranging from 0 to 2**31 or so. Larger
376d9008
JB
99characters may encode into longer sequences of bytes internally, but
100this internal detail is mostly hidden for Perl code.
101See L<perluniintro> for more.
393fec97 102
376d9008 103=head2 Effects of Character Semantics
393fec97
GS
104
105Character semantics have the following effects:
106
107=over 4
108
109=item *
110
376d9008 111Strings--including hash keys--and regular expression patterns may
574c8022 112contain characters that have an ordinal value larger than 255.
393fec97 113
feda178f
JH
114If you use a Unicode editor to edit your program, Unicode characters
115may occur directly within the literal strings in one of the various
376d9008
JB
116Unicode encodings (UTF-8, UTF-EBCDIC, UCS-2, etc.), but will be recognized
117as such and converted to Perl's internal representation only if the
feda178f 118appropriate L<encoding> is specified.
3e4dbfed 119
1bfb14c4
JH
120Unicode characters can also be added to a string by using the
121C<\x{...}> notation. The Unicode code for the desired character, in
376d9008
JB
122hexadecimal, should be placed in the braces. For instance, a smiley
123face is C<\x{263A}>. This encoding scheme only works for characters
124with a code of 0x100 or above.
3e4dbfed
JF
125
126Additionally, if you
574c8022 127
3e4dbfed 128 use charnames ':full';
574c8022 129
1bfb14c4
JH
130you can use the C<\N{...}> notation and put the official Unicode
131character name within the braces, such as C<\N{WHITE SMILING FACE}>.
376d9008 132
393fec97
GS
133
134=item *
135
574c8022
JH
136If an appropriate L<encoding> is specified, identifiers within the
137Perl script may contain Unicode alphanumeric characters, including
376d9008
JB
138ideographs. Perl does not currently attempt to canonicalize variable
139names.
393fec97 140
393fec97
GS
141=item *
142
1bfb14c4
JH
143Regular expressions match characters instead of bytes. "." matches
144a character instead of a byte. The C<\C> pattern is provided to force
145a match a single byte--a C<char> in C, hence C<\C>.
393fec97 146
393fec97
GS
147=item *
148
149Character classes in regular expressions match characters instead of
376d9008 150bytes and match against the character properties specified in the
1bfb14c4 151Unicode properties database. C<\w> can be used to match a Japanese
75daf61c 152ideograph, for instance.
393fec97 153
393fec97
GS
154=item *
155
eb0cc9e3 156Named Unicode properties, scripts, and block ranges may be used like
376d9008
JB
157character classes via the C<\p{}> "matches property" construct and
158the C<\P{}> negation, "doesn't match property".
1bfb14c4
JH
159
160For instance, C<\p{Lu}> matches any character with the Unicode "Lu"
161(Letter, uppercase) property, while C<\p{M}> matches any character
162with an "M" (mark--accents and such) property. Brackets are not
163required for single letter properties, so C<\p{M}> is equivalent to
164C<\pM>. Many predefined properties are available, such as
165C<\p{Mirrored}> and C<\p{Tibetan}>.
4193bef7 166
cfc01aea 167The official Unicode script and block names have spaces and dashes as
376d9008 168separators, but for convenience you can use dashes, spaces, or
1bfb14c4
JH
169underbars, and case is unimportant. It is recommended, however, that
170for consistency you use the following naming: the official Unicode
171script, property, or block name (see below for the additional rules
172that apply to block names) with whitespace and dashes removed, and the
173words "uppercase-first-lowercase-rest". C<Latin-1 Supplement> thus
174becomes C<Latin1Supplement>.
4193bef7 175
376d9008
JB
176You can also use negation in both C<\p{}> and C<\P{}> by introducing a caret
177(^) between the first brace and the property name: C<\p{^Tamil}> is
eb0cc9e3 178equal to C<\P{Tamil}>.
4193bef7 179
14bb0a9a
JH
180B<NOTE: the properties, scripts, and blocks listed here are as of
181Unicode 3.2.0, March 2002, or Perl 5.8.0, July 2002. Unicode 4.0.0
182came out in April 2003, and Perl 5.8.1 in September 2003.>
183
eb0cc9e3 184Here are the basic Unicode General Category properties, followed by their
68cd2d32 185long form. You can use either; C<\p{Lu}> and C<\p{UppercaseLetter}>,
376d9008 186for instance, are identical.
393fec97 187
d73e5302
JH
188 Short Long
189
190 L Letter
eb0cc9e3
JH
191 Lu UppercaseLetter
192 Ll LowercaseLetter
193 Lt TitlecaseLetter
194 Lm ModifierLetter
195 Lo OtherLetter
d73e5302
JH
196
197 M Mark
eb0cc9e3
JH
198 Mn NonspacingMark
199 Mc SpacingMark
200 Me EnclosingMark
d73e5302
JH
201
202 N Number
eb0cc9e3
JH
203 Nd DecimalNumber
204 Nl LetterNumber
205 No OtherNumber
d73e5302
JH
206
207 P Punctuation
eb0cc9e3
JH
208 Pc ConnectorPunctuation
209 Pd DashPunctuation
210 Ps OpenPunctuation
211 Pe ClosePunctuation
212 Pi InitialPunctuation
d73e5302 213 (may behave like Ps or Pe depending on usage)
eb0cc9e3 214 Pf FinalPunctuation
d73e5302 215 (may behave like Ps or Pe depending on usage)
eb0cc9e3 216 Po OtherPunctuation
d73e5302
JH
217
218 S Symbol
eb0cc9e3
JH
219 Sm MathSymbol
220 Sc CurrencySymbol
221 Sk ModifierSymbol
222 So OtherSymbol
d73e5302
JH
223
224 Z Separator
eb0cc9e3
JH
225 Zs SpaceSeparator
226 Zl LineSeparator
227 Zp ParagraphSeparator
d73e5302
JH
228
229 C Other
e150c829
JH
230 Cc Control
231 Cf Format
eb0cc9e3
JH
232 Cs Surrogate (not usable)
233 Co PrivateUse
e150c829 234 Cn Unassigned
1ac13f9a 235
376d9008 236Single-letter properties match all characters in any of the
3e4dbfed 237two-letter sub-properties starting with the same letter.
376d9008 238C<L&> is a special case, which is an alias for C<Ll>, C<Lu>, and C<Lt>.
32293815 239
eb0cc9e3 240Because Perl hides the need for the user to understand the internal
1bfb14c4
JH
241representation of Unicode characters, there is no need to implement
242the somewhat messy concept of surrogates. C<Cs> is therefore not
eb0cc9e3 243supported.
d73e5302 244
376d9008
JB
245Because scripts differ in their directionality--Hebrew is
246written right to left, for example--Unicode supplies these properties:
32293815 247
eb0cc9e3 248 Property Meaning
92e830a9 249
d73e5302
JH
250 BidiL Left-to-Right
251 BidiLRE Left-to-Right Embedding
252 BidiLRO Left-to-Right Override
253 BidiR Right-to-Left
254 BidiAL Right-to-Left Arabic
255 BidiRLE Right-to-Left Embedding
256 BidiRLO Right-to-Left Override
257 BidiPDF Pop Directional Format
258 BidiEN European Number
259 BidiES European Number Separator
260 BidiET European Number Terminator
261 BidiAN Arabic Number
262 BidiCS Common Number Separator
263 BidiNSM Non-Spacing Mark
264 BidiBN Boundary Neutral
265 BidiB Paragraph Separator
266 BidiS Segment Separator
267 BidiWS Whitespace
268 BidiON Other Neutrals
32293815 269
376d9008 270For example, C<\p{BidiR}> matches characters that are normally
eb0cc9e3
JH
271written right to left.
272
210b36aa
AMS
273=back
274
2796c109
JH
275=head2 Scripts
276
376d9008
JB
277The script names which can be used by C<\p{...}> and C<\P{...}>,
278such as in C<\p{Latin}> or C<\p{Cyrillic}>, are as follows:
2796c109 279
1ac13f9a 280 Arabic
e9ad1727 281 Armenian
1ac13f9a 282 Bengali
e9ad1727 283 Bopomofo
1d81abf3 284 Buhid
eb0cc9e3 285 CanadianAboriginal
e9ad1727
JH
286 Cherokee
287 Cyrillic
288 Deseret
289 Devanagari
290 Ethiopic
291 Georgian
292 Gothic
293 Greek
1ac13f9a 294 Gujarati
e9ad1727
JH
295 Gurmukhi
296 Han
297 Hangul
1d81abf3 298 Hanunoo
e9ad1727
JH
299 Hebrew
300 Hiragana
301 Inherited
1ac13f9a 302 Kannada
e9ad1727
JH
303 Katakana
304 Khmer
1ac13f9a 305 Lao
e9ad1727
JH
306 Latin
307 Malayalam
308 Mongolian
1ac13f9a 309 Myanmar
1ac13f9a 310 Ogham
eb0cc9e3 311 OldItalic
e9ad1727 312 Oriya
1ac13f9a 313 Runic
e9ad1727
JH
314 Sinhala
315 Syriac
1d81abf3
JH
316 Tagalog
317 Tagbanwa
e9ad1727
JH
318 Tamil
319 Telugu
320 Thaana
321 Thai
322 Tibetan
1ac13f9a 323 Yi
1ac13f9a 324
376d9008 325Extended property classes can supplement the basic
1ac13f9a
JH
326properties, defined by the F<PropList> Unicode database:
327
1d81abf3 328 ASCIIHexDigit
eb0cc9e3 329 BidiControl
1ac13f9a 330 Dash
1d81abf3 331 Deprecated
1ac13f9a
JH
332 Diacritic
333 Extender
1d81abf3 334 GraphemeLink
eb0cc9e3 335 HexDigit
e9ad1727
JH
336 Hyphen
337 Ideographic
1d81abf3
JH
338 IDSBinaryOperator
339 IDSTrinaryOperator
eb0cc9e3 340 JoinControl
1d81abf3 341 LogicalOrderException
eb0cc9e3
JH
342 NoncharacterCodePoint
343 OtherAlphabetic
1d81abf3
JH
344 OtherDefaultIgnorableCodePoint
345 OtherGraphemeExtend
eb0cc9e3
JH
346 OtherLowercase
347 OtherMath
348 OtherUppercase
349 QuotationMark
1d81abf3
JH
350 Radical
351 SoftDotted
352 TerminalPunctuation
353 UnifiedIdeograph
eb0cc9e3 354 WhiteSpace
1ac13f9a 355
376d9008 356and there are further derived properties:
1ac13f9a 357
eb0cc9e3
JH
358 Alphabetic Lu + Ll + Lt + Lm + Lo + OtherAlphabetic
359 Lowercase Ll + OtherLowercase
360 Uppercase Lu + OtherUppercase
361 Math Sm + OtherMath
1ac13f9a
JH
362
363 ID_Start Lu + Ll + Lt + Lm + Lo + Nl
364 ID_Continue ID_Start + Mn + Mc + Nd + Pc
365
366 Any Any character
66b79f27
RGS
367 Assigned Any non-Cn character (i.e. synonym for \P{Cn})
368 Unassigned Synonym for \p{Cn}
1ac13f9a 369 Common Any character (or unassigned code point)
e150c829 370 not explicitly assigned to a script
2796c109 371
1bfb14c4
JH
372For backward compatibility (with Perl 5.6), all properties mentioned
373so far may have C<Is> prepended to their name, so C<\P{IsLu}>, for
374example, is equal to C<\P{Lu}>.
eb0cc9e3 375
2796c109
JH
376=head2 Blocks
377
1bfb14c4
JH
378In addition to B<scripts>, Unicode also defines B<blocks> of
379characters. The difference between scripts and blocks is that the
380concept of scripts is closer to natural languages, while the concept
381of blocks is more of an artificial grouping based on groups of 256
376d9008 382Unicode characters. For example, the C<Latin> script contains letters
1bfb14c4 383from many blocks but does not contain all the characters from those
376d9008
JB
384blocks. It does not, for example, contain digits, because digits are
385shared across many scripts. Digits and similar groups, like
386punctuation, are in a category called C<Common>.
2796c109 387
cfc01aea
JF
388For more about scripts, see the UTR #24:
389
390 http://www.unicode.org/unicode/reports/tr24/
391
392For more about blocks, see:
393
394 http://www.unicode.org/Public/UNIDATA/Blocks.txt
2796c109 395
376d9008
JB
396Block names are given with the C<In> prefix. For example, the
397Katakana block is referenced via C<\p{InKatakana}>. The C<In>
7eabb34d 398prefix may be omitted if there is no naming conflict with a script
eb0cc9e3 399or any other property, but it is recommended that C<In> always be used
1bfb14c4 400for block tests to avoid confusion.
eb0cc9e3
JH
401
402These block names are supported:
403
1d81abf3
JH
404 InAlphabeticPresentationForms
405 InArabic
406 InArabicPresentationFormsA
407 InArabicPresentationFormsB
408 InArmenian
409 InArrows
410 InBasicLatin
411 InBengali
412 InBlockElements
413 InBopomofo
414 InBopomofoExtended
415 InBoxDrawing
416 InBraillePatterns
417 InBuhid
418 InByzantineMusicalSymbols
419 InCJKCompatibility
420 InCJKCompatibilityForms
421 InCJKCompatibilityIdeographs
422 InCJKCompatibilityIdeographsSupplement
423 InCJKRadicalsSupplement
424 InCJKSymbolsAndPunctuation
425 InCJKUnifiedIdeographs
426 InCJKUnifiedIdeographsExtensionA
427 InCJKUnifiedIdeographsExtensionB
428 InCherokee
429 InCombiningDiacriticalMarks
430 InCombiningDiacriticalMarksforSymbols
431 InCombiningHalfMarks
432 InControlPictures
433 InCurrencySymbols
434 InCyrillic
435 InCyrillicSupplementary
436 InDeseret
437 InDevanagari
438 InDingbats
439 InEnclosedAlphanumerics
440 InEnclosedCJKLettersAndMonths
441 InEthiopic
442 InGeneralPunctuation
443 InGeometricShapes
444 InGeorgian
445 InGothic
446 InGreekExtended
447 InGreekAndCoptic
448 InGujarati
449 InGurmukhi
450 InHalfwidthAndFullwidthForms
451 InHangulCompatibilityJamo
452 InHangulJamo
453 InHangulSyllables
454 InHanunoo
455 InHebrew
456 InHighPrivateUseSurrogates
457 InHighSurrogates
458 InHiragana
459 InIPAExtensions
460 InIdeographicDescriptionCharacters
461 InKanbun
462 InKangxiRadicals
463 InKannada
464 InKatakana
465 InKatakanaPhoneticExtensions
466 InKhmer
467 InLao
468 InLatin1Supplement
469 InLatinExtendedA
470 InLatinExtendedAdditional
471 InLatinExtendedB
472 InLetterlikeSymbols
473 InLowSurrogates
474 InMalayalam
475 InMathematicalAlphanumericSymbols
476 InMathematicalOperators
477 InMiscellaneousMathematicalSymbolsA
478 InMiscellaneousMathematicalSymbolsB
479 InMiscellaneousSymbols
480 InMiscellaneousTechnical
481 InMongolian
482 InMusicalSymbols
483 InMyanmar
484 InNumberForms
485 InOgham
486 InOldItalic
487 InOpticalCharacterRecognition
488 InOriya
489 InPrivateUseArea
490 InRunic
491 InSinhala
492 InSmallFormVariants
493 InSpacingModifierLetters
494 InSpecials
495 InSuperscriptsAndSubscripts
496 InSupplementalArrowsA
497 InSupplementalArrowsB
498 InSupplementalMathematicalOperators
499 InSupplementaryPrivateUseAreaA
500 InSupplementaryPrivateUseAreaB
501 InSyriac
502 InTagalog
503 InTagbanwa
504 InTags
505 InTamil
506 InTelugu
507 InThaana
508 InThai
509 InTibetan
510 InUnifiedCanadianAboriginalSyllabics
511 InVariationSelectors
512 InYiRadicals
513 InYiSyllables
32293815 514
210b36aa
AMS
515=over 4
516
393fec97
GS
517=item *
518
376d9008
JB
519The special pattern C<\X> matches any extended Unicode
520sequence--"a combining character sequence" in Standardese--where the
521first character is a base character and subsequent characters are mark
522characters that apply to the base character. C<\X> is equivalent to
393fec97
GS
523C<(?:\PM\pM*)>.
524
393fec97
GS
525=item *
526
383e7cdd 527The C<tr///> operator translates characters instead of bytes. Note
376d9008
JB
528that the C<tr///CU> functionality has been removed. For similar
529functionality see pack('U0', ...) and pack('C0', ...).
393fec97 530
393fec97
GS
531=item *
532
533Case translation operators use the Unicode case translation tables
376d9008
JB
534when character input is provided. Note that C<uc()>, or C<\U> in
535interpolated strings, translates to uppercase, while C<ucfirst>,
536or C<\u> in interpolated strings, translates to titlecase in languages
537that make the distinction.
393fec97
GS
538
539=item *
540
376d9008 541Most operators that deal with positions or lengths in a string will
75daf61c
JH
542automatically switch to using character positions, including
543C<chop()>, C<substr()>, C<pos()>, C<index()>, C<rindex()>,
544C<sprintf()>, C<write()>, and C<length()>. Operators that
376d9008
JB
545specifically do not switch include C<vec()>, C<pack()>, and
546C<unpack()>. Operators that really don't care include C<chomp()>,
547operators that treats strings as a bucket of bits such as C<sort()>,
548and operators dealing with filenames.
393fec97
GS
549
550=item *
551
1bfb14c4 552The C<pack()>/C<unpack()> letters C<c> and C<C> do I<not> change,
376d9008 553since they are often used for byte-oriented formats. Again, think
1bfb14c4
JH
554C<char> in the C language.
555
556There is a new C<U> specifier that converts between Unicode characters
557and code points.
393fec97
GS
558
559=item *
560
376d9008
JB
561The C<chr()> and C<ord()> functions work on characters, similar to
562C<pack("U")> and C<unpack("U")>, I<not> C<pack("C")> and
563C<unpack("C")>. C<pack("C")> and C<unpack("C")> are methods for
564emulating byte-oriented C<chr()> and C<ord()> on Unicode strings.
565While these methods reveal the internal encoding of Unicode strings,
566that is not something one normally needs to care about at all.
393fec97
GS
567
568=item *
569
376d9008
JB
570The bit string operators, C<& | ^ ~>, can operate on character data.
571However, for backward compatibility, such as when using bit string
572operations when characters are all less than 256 in ordinal value, one
573should not use C<~> (the bit complement) with characters of both
574values less than 256 and values greater than 256. Most importantly,
575DeMorgan's laws (C<~($x|$y) eq ~$x&~$y> and C<~($x&$y) eq ~$x|~$y>)
576will not hold. The reason for this mathematical I<faux pas> is that
577the complement cannot return B<both> the 8-bit (byte-wide) bit
578complement B<and> the full character-wide bit complement.
a1ca4561
YST
579
580=item *
581
983ffd37
JH
582lc(), uc(), lcfirst(), and ucfirst() work for the following cases:
583
584=over 8
585
586=item *
587
588the case mapping is from a single Unicode character to another
376d9008 589single Unicode character, or
983ffd37
JH
590
591=item *
592
593the case mapping is from a single Unicode character to more
376d9008 594than one Unicode character.
983ffd37
JH
595
596=back
597
63de3cb2
JH
598Things to do with locales (Lithuanian, Turkish, Azeri) do B<not> work
599since Perl does not understand the concept of Unicode locales.
983ffd37 600
dc33ebcf
RGS
601See the Unicode Technical Report #21, Case Mappings, for more details.
602
983ffd37
JH
603=back
604
dc33ebcf 605=over 4
ac1256e8
JH
606
607=item *
608
393fec97
GS
609And finally, C<scalar reverse()> reverses by character rather than by byte.
610
611=back
612
376d9008 613=head2 User-Defined Character Properties
491fd90a
JH
614
615You can define your own character properties by defining subroutines
3a2263fe
RGS
616whose names begin with "In" or "Is". The subroutines must be defined
617in the C<main> package. The user-defined properties can be used in the
618regular expression C<\p> and C<\P> constructs. Note that the effect
619is compile-time and immutable once defined.
491fd90a 620
376d9008
JB
621The subroutines must return a specially-formatted string, with one
622or more newline-separated lines. Each line must be one of the following:
491fd90a
JH
623
624=over 4
625
626=item *
627
99a6b1f0 628Two hexadecimal numbers separated by horizontal whitespace (space or
376d9008 629tabular characters) denoting a range of Unicode code points to include.
491fd90a
JH
630
631=item *
632
376d9008
JB
633Something to include, prefixed by "+": a built-in character
634property (prefixed by "utf8::"), to represent all the characters in that
635property; two hexadecimal code points for a range; or a single
636hexadecimal code point.
491fd90a
JH
637
638=item *
639
376d9008 640Something to exclude, prefixed by "-": an existing character
11ef8fdd 641property (prefixed by "utf8::"), for all the characters in that
376d9008
JB
642property; two hexadecimal code points for a range; or a single
643hexadecimal code point.
491fd90a
JH
644
645=item *
646
376d9008 647Something to negate, prefixed "!": an existing character
11ef8fdd 648property (prefixed by "utf8::") for all the characters except the
376d9008
JB
649characters in the property; two hexadecimal code points for a range;
650or a single hexadecimal code point.
491fd90a
JH
651
652=back
653
654For example, to define a property that covers both the Japanese
655syllabaries (hiragana and katakana), you can define
656
657 sub InKana {
d5822f25
A
658 return <<END;
659 3040\t309F
660 30A0\t30FF
491fd90a
JH
661 END
662 }
663
d5822f25
A
664Imagine that the here-doc end marker is at the beginning of the line.
665Now you can use C<\p{InKana}> and C<\P{InKana}>.
491fd90a
JH
666
667You could also have used the existing block property names:
668
669 sub InKana {
670 return <<'END';
671 +utf8::InHiragana
672 +utf8::InKatakana
673 END
674 }
675
676Suppose you wanted to match only the allocated characters,
d5822f25 677not the raw block ranges: in other words, you want to remove
491fd90a
JH
678the non-characters:
679
680 sub InKana {
681 return <<'END';
682 +utf8::InHiragana
683 +utf8::InKatakana
684 -utf8::IsCn
685 END
686 }
687
688The negation is useful for defining (surprise!) negated classes.
689
690 sub InNotKana {
691 return <<'END';
692 !utf8::InHiragana
693 -utf8::InKatakana
694 +utf8::IsCn
695 END
696 }
697
3a2263fe
RGS
698You can also define your own mappings to be used in the lc(),
699lcfirst(), uc(), and ucfirst() (or their string-inlined versions).
700The principle is the same: define subroutines in the C<main> package
701with names like C<ToLower> (for lc() and lcfirst()), C<ToTitle> (for
702the first character in ucfirst()), and C<ToUpper> (for uc(), and the
703rest of the characters in ucfirst()).
704
705The string returned by the subroutines needs now to be three
706hexadecimal numbers separated by tabulators: start of the source
707range, end of the source range, and start of the destination range.
708For example:
709
710 sub ToUpper {
711 return <<END;
712 0061\t0063\t0041
713 END
714 }
715
716defines an uc() mapping that causes only the characters "a", "b", and
717"c" to be mapped to "A", "B", "C", all other characters will remain
718unchanged.
719
720If there is no source range to speak of, that is, the mapping is from
721a single character to another single character, leave the end of the
722source range empty, but the two tabulator characters are still needed.
723For example:
724
725 sub ToLower {
726 return <<END;
727 0041\t\t0061
728 END
729 }
730
731defines a lc() mapping that causes only "A" to be mapped to "a", all
732other characters will remain unchanged.
733
734(For serious hackers only) If you want to introspect the default
735mappings, you can find the data in the directory
736C<$Config{privlib}>/F<unicore/To/>. The mapping data is returned as
737the here-document, and the C<utf8::ToSpecFoo> are special exception
738mappings derived from <$Config{privlib}>/F<unicore/SpecialCasing.txt>.
739The C<Digit> and C<Fold> mappings that one can see in the directory
740are not directly user-accessible, one can use either the
741C<Unicode::UCD> module, or just match case-insensitively (that's when
742the C<Fold> mapping is used).
743
744A final note on the user-defined property tests and mappings: they
745will be used only if the scalar has been marked as having Unicode
746characters. Old byte-style strings will not be affected.
747
376d9008 748=head2 Character Encodings for Input and Output
8cbd9a7a 749
7221edc9 750See L<Encode>.
8cbd9a7a 751
c29a771d 752=head2 Unicode Regular Expression Support Level
776f8809 753
376d9008
JB
754The following list of Unicode support for regular expressions describes
755all the features currently supported. The references to "Level N"
756and the section numbers refer to the Unicode Technical Report 18,
965cd703
JH
757"Unicode Regular Expression Guidelines", version 6 (Unicode 3.2.0,
758Perl 5.8.0).
776f8809
JH
759
760=over 4
761
762=item *
763
764Level 1 - Basic Unicode Support
765
766 2.1 Hex Notation - done [1]
3bfdc84c 767 Named Notation - done [2]
776f8809
JH
768 2.2 Categories - done [3][4]
769 2.3 Subtraction - MISSING [5][6]
770 2.4 Simple Word Boundaries - done [7]
78d3e1bf 771 2.5 Simple Loose Matches - done [8]
776f8809
JH
772 2.6 End of Line - MISSING [9][10]
773
774 [ 1] \x{...}
775 [ 2] \N{...}
eb0cc9e3 776 [ 3] . \p{...} \P{...}
29bdacb8 777 [ 4] now scripts (see UTR#24 Script Names) in addition to blocks
776f8809 778 [ 5] have negation
237bad5b
JH
779 [ 6] can use regular expression look-ahead [a]
780 or user-defined character properties [b] to emulate subtraction
776f8809 781 [ 7] include Letters in word characters
376d9008 782 [ 8] note that Perl does Full case-folding in matching, not Simple:
835863de 783 for example U+1F88 is equivalent with U+1F00 U+03B9,
e0f9d4a8 784 not with 1F80. This difference matters for certain Greek
376d9008
JB
785 capital letters with certain modifiers: the Full case-folding
786 decomposes the letter, while the Simple case-folding would map
e0f9d4a8 787 it to a single character.
5ca1ac52 788 [ 9] see UTR #13 Unicode Newline Guidelines
835863de 789 [10] should do ^ and $ also on \x{85}, \x{2028} and \x{2029}
ec83e909 790 (should also affect <>, $., and script line numbers)
3bfdc84c 791 (the \x{85}, \x{2028} and \x{2029} do match \s)
7207e29d 792
237bad5b 793[a] You can mimic class subtraction using lookahead.
5ca1ac52 794For example, what UTR #18 might write as
29bdacb8 795
dbe420b4
JH
796 [{Greek}-[{UNASSIGNED}]]
797
798in Perl can be written as:
799
1d81abf3
JH
800 (?!\p{Unassigned})\p{InGreekAndCoptic}
801 (?=\p{Assigned})\p{InGreekAndCoptic}
dbe420b4
JH
802
803But in this particular example, you probably really want
804
1bfb14c4 805 \p{GreekAndCoptic}
dbe420b4
JH
806
807which will match assigned characters known to be part of the Greek script.
29bdacb8 808
5ca1ac52
JH
809Also see the Unicode::Regex::Set module, it does implement the full
810UTR #18 grouping, intersection, union, and removal (subtraction) syntax.
811
818c4caa 812[b] See L</"User-Defined Character Properties">.
237bad5b 813
776f8809
JH
814=item *
815
816Level 2 - Extended Unicode Support
817
63de3cb2
JH
818 3.1 Surrogates - MISSING [11]
819 3.2 Canonical Equivalents - MISSING [12][13]
820 3.3 Locale-Independent Graphemes - MISSING [14]
821 3.4 Locale-Independent Words - MISSING [15]
822 3.5 Locale-Independent Loose Matches - MISSING [16]
823
824 [11] Surrogates are solely a UTF-16 concept and Perl's internal
825 representation is UTF-8. The Encode module does UTF-16, though.
826 [12] see UTR#15 Unicode Normalization
827 [13] have Unicode::Normalize but not integrated to regexes
828 [14] have \X but at this level . should equal that
829 [15] need three classes, not just \w and \W
830 [16] see UTR#21 Case Mappings
776f8809
JH
831
832=item *
833
834Level 3 - Locale-Sensitive Support
835
836 4.1 Locale-Dependent Categories - MISSING
837 4.2 Locale-Dependent Graphemes - MISSING [16][17]
838 4.3 Locale-Dependent Words - MISSING
839 4.4 Locale-Dependent Loose Matches - MISSING
840 4.5 Locale-Dependent Ranges - MISSING
841
842 [16] see UTR#10 Unicode Collation Algorithms
843 [17] have Unicode::Collate but not integrated to regexes
844
845=back
846
c349b1b9
JH
847=head2 Unicode Encodings
848
376d9008
JB
849Unicode characters are assigned to I<code points>, which are abstract
850numbers. To use these numbers, various encodings are needed.
c349b1b9
JH
851
852=over 4
853
c29a771d 854=item *
5cb3728c
RB
855
856UTF-8
c349b1b9 857
3e4dbfed 858UTF-8 is a variable-length (1 to 6 bytes, current character allocations
376d9008
JB
859require 4 bytes), byte-order independent encoding. For ASCII (and we
860really do mean 7-bit ASCII, not another 8-bit encoding), UTF-8 is
861transparent.
c349b1b9 862
8c007b5a 863The following table is from Unicode 3.2.
05632f9a
JH
864
865 Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte
866
8c007b5a
JH
867 U+0000..U+007F 00..7F
868 U+0080..U+07FF C2..DF 80..BF
ec90690f
TS
869 U+0800..U+0FFF E0 A0..BF 80..BF
870 U+1000..U+CFFF E1..EC 80..BF 80..BF
871 U+D000..U+D7FF ED 80..9F 80..BF
8c007b5a 872 U+D800..U+DFFF ******* ill-formed *******
ec90690f 873 U+E000..U+FFFF EE..EF 80..BF 80..BF
05632f9a
JH
874 U+10000..U+3FFFF F0 90..BF 80..BF 80..BF
875 U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF
876 U+100000..U+10FFFF F4 80..8F 80..BF 80..BF
877
376d9008
JB
878Note the C<A0..BF> in C<U+0800..U+0FFF>, the C<80..9F> in
879C<U+D000...U+D7FF>, the C<90..B>F in C<U+10000..U+3FFFF>, and the
880C<80...8F> in C<U+100000..U+10FFFF>. The "gaps" are caused by legal
881UTF-8 avoiding non-shortest encodings: it is technically possible to
882UTF-8-encode a single code point in different ways, but that is
883explicitly forbidden, and the shortest possible encoding should always
884be used. So that's what Perl does.
37361303 885
376d9008 886Another way to look at it is via bits:
05632f9a
JH
887
888 Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte
889
890 0aaaaaaa 0aaaaaaa
891 00000bbbbbaaaaaa 110bbbbb 10aaaaaa
892 ccccbbbbbbaaaaaa 1110cccc 10bbbbbb 10aaaaaa
893 00000dddccccccbbbbbbaaaaaa 11110ddd 10cccccc 10bbbbbb 10aaaaaa
894
895As you can see, the continuation bytes all begin with C<10>, and the
8c007b5a 896leading bits of the start byte tell how many bytes the are in the
05632f9a
JH
897encoded character.
898
c29a771d 899=item *
5cb3728c
RB
900
901UTF-EBCDIC
dbe420b4 902
376d9008 903Like UTF-8 but EBCDIC-safe, in the way that UTF-8 is ASCII-safe.
dbe420b4 904
c29a771d 905=item *
5cb3728c 906
1e54db1a 907UTF-16, UTF-16BE, UTF-16LE, Surrogates, and BOMs (Byte Order Marks)
c349b1b9 908
1bfb14c4
JH
909The followings items are mostly for reference and general Unicode
910knowledge, Perl doesn't use these constructs internally.
dbe420b4 911
c349b1b9 912UTF-16 is a 2 or 4 byte encoding. The Unicode code points
1bfb14c4
JH
913C<U+0000..U+FFFF> are stored in a single 16-bit unit, and the code
914points C<U+10000..U+10FFFF> in two 16-bit units. The latter case is
c349b1b9
JH
915using I<surrogates>, the first 16-bit unit being the I<high
916surrogate>, and the second being the I<low surrogate>.
917
376d9008 918Surrogates are code points set aside to encode the C<U+10000..U+10FFFF>
c349b1b9 919range of Unicode code points in pairs of 16-bit units. The I<high
376d9008
JB
920surrogates> are the range C<U+D800..U+DBFF>, and the I<low surrogates>
921are the range C<U+DC00..U+DFFF>. The surrogate encoding is
c349b1b9
JH
922
923 $hi = ($uni - 0x10000) / 0x400 + 0xD800;
924 $lo = ($uni - 0x10000) % 0x400 + 0xDC00;
925
926and the decoding is
927
1a3fa709 928 $uni = 0x10000 + ($hi - 0xD800) * 0x400 + ($lo - 0xDC00);
c349b1b9 929
feda178f 930If you try to generate surrogates (for example by using chr()), you
376d9008
JB
931will get a warning if warnings are turned on, because those code
932points are not valid for a Unicode character.
9466bab6 933
376d9008 934Because of the 16-bitness, UTF-16 is byte-order dependent. UTF-16
c349b1b9 935itself can be used for in-memory computations, but if storage or
376d9008
JB
936transfer is required either UTF-16BE (big-endian) or UTF-16LE
937(little-endian) encodings must be chosen.
c349b1b9
JH
938
939This introduces another problem: what if you just know that your data
376d9008
JB
940is UTF-16, but you don't know which endianness? Byte Order Marks, or
941BOMs, are a solution to this. A special character has been reserved
86bbd6d1 942in Unicode to function as a byte order marker: the character with the
376d9008 943code point C<U+FEFF> is the BOM.
042da322 944
c349b1b9 945The trick is that if you read a BOM, you will know the byte order,
376d9008
JB
946since if it was written on a big-endian platform, you will read the
947bytes C<0xFE 0xFF>, but if it was written on a little-endian platform,
948you will read the bytes C<0xFF 0xFE>. (And if the originating platform
949was writing in UTF-8, you will read the bytes C<0xEF 0xBB 0xBF>.)
042da322 950
86bbd6d1 951The way this trick works is that the character with the code point
376d9008
JB
952C<U+FFFE> is guaranteed not to be a valid Unicode character, so the
953sequence of bytes C<0xFF 0xFE> is unambiguously "BOM, represented in
1bfb14c4 954little-endian format" and cannot be C<U+FFFE>, represented in big-endian
042da322 955format".
c349b1b9 956
c29a771d 957=item *
5cb3728c 958
1e54db1a 959UTF-32, UTF-32BE, UTF-32LE
c349b1b9
JH
960
961The UTF-32 family is pretty much like the UTF-16 family, expect that
042da322 962the units are 32-bit, and therefore the surrogate scheme is not
376d9008
JB
963needed. The BOM signatures will be C<0x00 0x00 0xFE 0xFF> for BE and
964C<0xFF 0xFE 0x00 0x00> for LE.
c349b1b9 965
c29a771d 966=item *
5cb3728c
RB
967
968UCS-2, UCS-4
c349b1b9 969
86bbd6d1 970Encodings defined by the ISO 10646 standard. UCS-2 is a 16-bit
376d9008 971encoding. Unlike UTF-16, UCS-2 is not extensible beyond C<U+FFFF>,
339cfa0e
JH
972because it does not use surrogates. UCS-4 is a 32-bit encoding,
973functionally identical to UTF-32.
c349b1b9 974
c29a771d 975=item *
5cb3728c
RB
976
977UTF-7
c349b1b9 978
376d9008
JB
979A seven-bit safe (non-eight-bit) encoding, which is useful if the
980transport or storage is not eight-bit safe. Defined by RFC 2152.
c349b1b9 981
95a1a48b
JH
982=back
983
0d7c09bb
JH
984=head2 Security Implications of Unicode
985
986=over 4
987
988=item *
989
990Malformed UTF-8
bf0fa0b2
JH
991
992Unfortunately, the specification of UTF-8 leaves some room for
993interpretation of how many bytes of encoded output one should generate
376d9008
JB
994from one input Unicode character. Strictly speaking, the shortest
995possible sequence of UTF-8 bytes should be generated,
996because otherwise there is potential for an input buffer overflow at
feda178f 997the receiving end of a UTF-8 connection. Perl always generates the
376d9008
JB
998shortest length UTF-8, and with warnings on Perl will warn about
999non-shortest length UTF-8 along with other malformations, such as the
1000surrogates, which are not real Unicode code points.
bf0fa0b2 1001
0d7c09bb
JH
1002=item *
1003
1004Regular expressions behave slightly differently between byte data and
376d9008
JB
1005character (Unicode) data. For example, the "word character" character
1006class C<\w> will work differently depending on if data is eight-bit bytes
1007or Unicode.
0d7c09bb 1008
376d9008
JB
1009In the first case, the set of C<\w> characters is either small--the
1010default set of alphabetic characters, digits, and the "_"--or, if you
0d7c09bb
JH
1011are using a locale (see L<perllocale>), the C<\w> might contain a few
1012more letters according to your language and country.
1013
376d9008 1014In the second case, the C<\w> set of characters is much, much larger.
1bfb14c4
JH
1015Most importantly, even in the set of the first 256 characters, it will
1016probably match different characters: unlike most locales, which are
1017specific to a language and country pair, Unicode classifies all the
1018characters that are letters I<somewhere> as C<\w>. For example, your
1019locale might not think that LATIN SMALL LETTER ETH is a letter (unless
1020you happen to speak Icelandic), but Unicode does.
0d7c09bb 1021
376d9008 1022As discussed elsewhere, Perl has one foot (two hooves?) planted in
1bfb14c4
JH
1023each of two worlds: the old world of bytes and the new world of
1024characters, upgrading from bytes to characters when necessary.
376d9008
JB
1025If your legacy code does not explicitly use Unicode, no automatic
1026switch-over to characters should happen. Characters shouldn't get
1bfb14c4
JH
1027downgraded to bytes, either. It is possible to accidentally mix bytes
1028and characters, however (see L<perluniintro>), in which case C<\w> in
1029regular expressions might start behaving differently. Review your
1030code. Use warnings and the C<strict> pragma.
0d7c09bb
JH
1031
1032=back
1033
c349b1b9
JH
1034=head2 Unicode in Perl on EBCDIC
1035
376d9008
JB
1036The way Unicode is handled on EBCDIC platforms is still
1037experimental. On such platforms, references to UTF-8 encoding in this
1038document and elsewhere should be read as meaning the UTF-EBCDIC
1039specified in Unicode Technical Report 16, unless ASCII vs. EBCDIC issues
c349b1b9 1040are specifically discussed. There is no C<utfebcdic> pragma or
376d9008 1041":utfebcdic" layer; rather, "utf8" and ":utf8" are reused to mean
86bbd6d1
PN
1042the platform's "natural" 8-bit encoding of Unicode. See L<perlebcdic>
1043for more discussion of the issues.
c349b1b9 1044
b310b053
JH
1045=head2 Locales
1046
4616122b 1047Usually locale settings and Unicode do not affect each other, but
b310b053
JH
1048there are a couple of exceptions:
1049
1050=over 4
1051
1052=item *
1053
8aa8f774
JH
1054You can enable automatic UTF-8-ification of your standard file
1055handles, default C<open()> layer, and C<@ARGV> by using either
1056the C<-C> command line switch or the C<PERL_UNICODE> environment
1057variable, see L<perlrun> for the documentation of the C<-C> switch.
b310b053
JH
1058
1059=item *
1060
376d9008
JB
1061Perl tries really hard to work both with Unicode and the old
1062byte-oriented world. Most often this is nice, but sometimes Perl's
1063straddling of the proverbial fence causes problems.
b310b053
JH
1064
1065=back
1066
1aad1664
JH
1067=head2 When Unicode Does Not Happen
1068
1069While Perl does have extensive ways to input and output in Unicode,
1070and few other 'entry points' like the @ARGV which can be interpreted
1071as Unicode (UTF-8), there still are many places where Unicode (in some
1072encoding or another) could be given as arguments or received as
1073results, or both, but it is not.
1074
1075The following are such interfaces. For all of these Perl currently
1076(as of 5.8.1) simply assumes byte strings both as arguments and results.
1077
1078One reason why Perl does not attempt to resolve the role of Unicode in
1079this cases is that the answers are highly dependent on the operating
1080system and the file system(s). For example, whether filenames can be
1081in Unicode, and in exactly what kind of encoding, is not exactly a
1082portable concept. Similarly for the qx and system: how well will the
1083'command line interface' (and which of them?) handle Unicode?
1084
1085=over 4
1086
557a2462
RB
1087=item *
1088
1089chmod, chmod, chown, chroot, exec, link, mkdir
1090rename, rmdir stat, symlink, truncate, unlink, utime
1091
1092=item *
1093
1094%ENV
1095
1096=item *
1097
1098glob (aka the <*>)
1099
1100=item *
1aad1664 1101
557a2462 1102open, opendir, sysopen
1aad1664 1103
557a2462 1104=item *
1aad1664 1105
557a2462 1106qx (aka the backtick operator), system
1aad1664 1107
557a2462 1108=item *
1aad1664 1109
557a2462 1110readdir, readlink
1aad1664
JH
1111
1112=back
1113
1114=head2 Forcing Unicode in Perl (Or Unforcing Unicode in Perl)
1115
1116Sometimes (see L</"When Unicode Does Not Happen">) there are
1117situations where you simply need to force Perl to believe that a byte
1118string is UTF-8, or vice versa. The low-level calls
1119utf8::upgrade($bytestring) and utf8::downgrade($utf8string) are
1120the answers.
1121
1122Do not use them without careful thought, though: Perl may easily get
1123very confused, angry, or even crash, if you suddenly change the 'nature'
1124of scalar like that. Especially careful you have to be if you use the
1125utf8::upgrade(): any random byte string is not valid UTF-8.
1126
95a1a48b
JH
1127=head2 Using Unicode in XS
1128
3a2263fe
RGS
1129If you want to handle Perl Unicode in XS extensions, you may find the
1130following C APIs useful. See also L<perlguts/"Unicode Support"> for an
1131explanation about Unicode at the XS level, and L<perlapi> for the API
1132details.
95a1a48b
JH
1133
1134=over 4
1135
1136=item *
1137
1bfb14c4
JH
1138C<DO_UTF8(sv)> returns true if the C<UTF8> flag is on and the bytes
1139pragma is not in effect. C<SvUTF8(sv)> returns true is the C<UTF8>
1140flag is on; the bytes pragma is ignored. The C<UTF8> flag being on
1141does B<not> mean that there are any characters of code points greater
1142than 255 (or 127) in the scalar or that there are even any characters
1143in the scalar. What the C<UTF8> flag means is that the sequence of
1144octets in the representation of the scalar is the sequence of UTF-8
1145encoded code points of the characters of a string. The C<UTF8> flag
1146being off means that each octet in this representation encodes a
1147single character with code point 0..255 within the string. Perl's
1148Unicode model is not to use UTF-8 until it is absolutely necessary.
95a1a48b
JH
1149
1150=item *
1151
fb9cc174 1152C<uvuni_to_utf8(buf, chr)> writes a Unicode character code point into
1bfb14c4 1153a buffer encoding the code point as UTF-8, and returns a pointer
95a1a48b
JH
1154pointing after the UTF-8 bytes.
1155
1156=item *
1157
376d9008
JB
1158C<utf8_to_uvuni(buf, lenp)> reads UTF-8 encoded bytes from a buffer and
1159returns the Unicode character code point and, optionally, the length of
1160the UTF-8 byte sequence.
95a1a48b
JH
1161
1162=item *
1163
376d9008
JB
1164C<utf8_length(start, end)> returns the length of the UTF-8 encoded buffer
1165in characters. C<sv_len_utf8(sv)> returns the length of the UTF-8 encoded
95a1a48b
JH
1166scalar.
1167
1168=item *
1169
376d9008
JB
1170C<sv_utf8_upgrade(sv)> converts the string of the scalar to its UTF-8
1171encoded form. C<sv_utf8_downgrade(sv)> does the opposite, if
1172possible. C<sv_utf8_encode(sv)> is like sv_utf8_upgrade except that
1173it does not set the C<UTF8> flag. C<sv_utf8_decode()> does the
1174opposite of C<sv_utf8_encode()>. Note that none of these are to be
1175used as general-purpose encoding or decoding interfaces: C<use Encode>
1176for that. C<sv_utf8_upgrade()> is affected by the encoding pragma
1177but C<sv_utf8_downgrade()> is not (since the encoding pragma is
1178designed to be a one-way street).
95a1a48b
JH
1179
1180=item *
1181
376d9008 1182C<is_utf8_char(s)> returns true if the pointer points to a valid UTF-8
90f968e0 1183character.
95a1a48b
JH
1184
1185=item *
1186
376d9008 1187C<is_utf8_string(buf, len)> returns true if C<len> bytes of the buffer
95a1a48b
JH
1188are valid UTF-8.
1189
1190=item *
1191
376d9008
JB
1192C<UTF8SKIP(buf)> will return the number of bytes in the UTF-8 encoded
1193character in the buffer. C<UNISKIP(chr)> will return the number of bytes
1194required to UTF-8-encode the Unicode character code point. C<UTF8SKIP()>
90f968e0 1195is useful for example for iterating over the characters of a UTF-8
376d9008 1196encoded buffer; C<UNISKIP()> is useful, for example, in computing
90f968e0 1197the size required for a UTF-8 encoded buffer.
95a1a48b
JH
1198
1199=item *
1200
376d9008 1201C<utf8_distance(a, b)> will tell the distance in characters between the
95a1a48b
JH
1202two pointers pointing to the same UTF-8 encoded buffer.
1203
1204=item *
1205
376d9008
JB
1206C<utf8_hop(s, off)> will return a pointer to an UTF-8 encoded buffer
1207that is C<off> (positive or negative) Unicode characters displaced
1208from the UTF-8 buffer C<s>. Be careful not to overstep the buffer:
1209C<utf8_hop()> will merrily run off the end or the beginning of the
1210buffer if told to do so.
95a1a48b 1211
d2cc3551
JH
1212=item *
1213
376d9008
JB
1214C<pv_uni_display(dsv, spv, len, pvlim, flags)> and
1215C<sv_uni_display(dsv, ssv, pvlim, flags)> are useful for debugging the
1216output of Unicode strings and scalars. By default they are useful
1217only for debugging--they display B<all> characters as hexadecimal code
1bfb14c4
JH
1218points--but with the flags C<UNI_DISPLAY_ISPRINT>,
1219C<UNI_DISPLAY_BACKSLASH>, and C<UNI_DISPLAY_QQ> you can make the
1220output more readable.
d2cc3551
JH
1221
1222=item *
1223
376d9008
JB
1224C<ibcmp_utf8(s1, pe1, u1, l1, u1, s2, pe2, l2, u2)> can be used to
1225compare two strings case-insensitively in Unicode. For case-sensitive
1226comparisons you can just use C<memEQ()> and C<memNE()> as usual.
d2cc3551 1227
c349b1b9
JH
1228=back
1229
95a1a48b
JH
1230For more information, see L<perlapi>, and F<utf8.c> and F<utf8.h>
1231in the Perl source code distribution.
1232
c29a771d
JH
1233=head1 BUGS
1234
376d9008 1235=head2 Interaction with Locales
7eabb34d 1236
376d9008
JB
1237Use of locales with Unicode data may lead to odd results. Currently,
1238Perl attempts to attach 8-bit locale info to characters in the range
12390..255, but this technique is demonstrably incorrect for locales that
1240use characters above that range when mapped into Unicode. Perl's
1241Unicode support will also tend to run slower. Use of locales with
1242Unicode is discouraged.
c29a771d 1243
376d9008 1244=head2 Interaction with Extensions
7eabb34d 1245
376d9008 1246When Perl exchanges data with an extension, the extension should be
7eabb34d 1247able to understand the UTF-8 flag and act accordingly. If the
376d9008
JB
1248extension doesn't know about the flag, it's likely that the extension
1249will return incorrectly-flagged data.
7eabb34d
A
1250
1251So if you're working with Unicode data, consult the documentation of
1252every module you're using if there are any issues with Unicode data
1253exchange. If the documentation does not talk about Unicode at all,
a73d23f6 1254suspect the worst and probably look at the source to learn how the
376d9008 1255module is implemented. Modules written completely in Perl shouldn't
a73d23f6
RGS
1256cause problems. Modules that directly or indirectly access code written
1257in other programming languages are at risk.
7eabb34d 1258
376d9008 1259For affected functions, the simple strategy to avoid data corruption is
7eabb34d 1260to always make the encoding of the exchanged data explicit. Choose an
376d9008 1261encoding that you know the extension can handle. Convert arguments passed
7eabb34d
A
1262to the extensions to that encoding and convert results back from that
1263encoding. Write wrapper functions that do the conversions for you, so
1264you can later change the functions when the extension catches up.
1265
376d9008 1266To provide an example, let's say the popular Foo::Bar::escape_html
7eabb34d
A
1267function doesn't deal with Unicode data yet. The wrapper function
1268would convert the argument to raw UTF-8 and convert the result back to
376d9008 1269Perl's internal representation like so:
7eabb34d
A
1270
1271 sub my_escape_html ($) {
1272 my($what) = shift;
1273 return unless defined $what;
1274 Encode::decode_utf8(Foo::Bar::escape_html(Encode::encode_utf8($what)));
1275 }
1276
1277Sometimes, when the extension does not convert data but just stores
1278and retrieves them, you will be in a position to use the otherwise
1279dangerous Encode::_utf8_on() function. Let's say the popular
66b79f27 1280C<Foo::Bar> extension, written in C, provides a C<param> method that
7eabb34d
A
1281lets you store and retrieve data according to these prototypes:
1282
1283 $self->param($name, $value); # set a scalar
1284 $value = $self->param($name); # retrieve a scalar
1285
1286If it does not yet provide support for any encoding, one could write a
1287derived class with such a C<param> method:
1288
1289 sub param {
1290 my($self,$name,$value) = @_;
1291 utf8::upgrade($name); # make sure it is UTF-8 encoded
1292 if (defined $value)
1293 utf8::upgrade($value); # make sure it is UTF-8 encoded
1294 return $self->SUPER::param($name,$value);
1295 } else {
1296 my $ret = $self->SUPER::param($name);
1297 Encode::_utf8_on($ret); # we know, it is UTF-8 encoded
1298 return $ret;
1299 }
1300 }
1301
a73d23f6
RGS
1302Some extensions provide filters on data entry/exit points, such as
1303DB_File::filter_store_key and family. Look out for such filters in
66b79f27 1304the documentation of your extensions, they can make the transition to
7eabb34d
A
1305Unicode data much easier.
1306
376d9008 1307=head2 Speed
7eabb34d 1308
c29a771d 1309Some functions are slower when working on UTF-8 encoded strings than
574c8022 1310on byte encoded strings. All functions that need to hop over
7c17141f
JH
1311characters such as length(), substr() or index(), or matching regular
1312expressions can work B<much> faster when the underlying data are
1313byte-encoded.
1314
1315In Perl 5.8.0 the slowness was often quite spectacular; in Perl 5.8.1
1316a caching scheme was introduced which will hopefully make the slowness
a104b433
JH
1317somewhat less spectacular, at least for some operations. In general,
1318operations with UTF-8 encoded strings are still slower. As an example,
1319the Unicode properties (character classes) like C<\p{Nd}> are known to
1320be quite a bit slower (5-20 times) than their simpler counterparts
1321like C<\d> (then again, there 268 Unicode characters matching C<Nd>
1322compared with the 10 ASCII characters matching C<d>).
666f95b9 1323
c8d992ba
A
1324=head2 Porting code from perl-5.6.X
1325
1326Perl 5.8 has a different Unicode model from 5.6. In 5.6 the programmer
1327was required to use the C<utf8> pragma to declare that a given scope
1328expected to deal with Unicode data and had to make sure that only
1329Unicode data were reaching that scope. If you have code that is
1330working with 5.6, you will need some of the following adjustments to
1331your code. The examples are written such that the code will continue
1332to work under 5.6, so you should be safe to try them out.
1333
1334=over 4
1335
1336=item *
1337
1338A filehandle that should read or write UTF-8
1339
1340 if ($] > 5.007) {
1341 binmode $fh, ":utf8";
1342 }
1343
1344=item *
1345
1346A scalar that is going to be passed to some extension
1347
1348Be it Compress::Zlib, Apache::Request or any extension that has no
1349mention of Unicode in the manpage, you need to make sure that the
1350UTF-8 flag is stripped off. Note that at the time of this writing
1351(October 2002) the mentioned modules are not UTF-8-aware. Please
1352check the documentation to verify if this is still true.
1353
1354 if ($] > 5.007) {
1355 require Encode;
1356 $val = Encode::encode_utf8($val); # make octets
1357 }
1358
1359=item *
1360
1361A scalar we got back from an extension
1362
1363If you believe the scalar comes back as UTF-8, you will most likely
1364want the UTF-8 flag restored:
1365
1366 if ($] > 5.007) {
1367 require Encode;
1368 $val = Encode::decode_utf8($val);
1369 }
1370
1371=item *
1372
1373Same thing, if you are really sure it is UTF-8
1374
1375 if ($] > 5.007) {
1376 require Encode;
1377 Encode::_utf8_on($val);
1378 }
1379
1380=item *
1381
1382A wrapper for fetchrow_array and fetchrow_hashref
1383
1384When the database contains only UTF-8, a wrapper function or method is
1385a convenient way to replace all your fetchrow_array and
1386fetchrow_hashref calls. A wrapper function will also make it easier to
1387adapt to future enhancements in your database driver. Note that at the
1388time of this writing (October 2002), the DBI has no standardized way
1389to deal with UTF-8 data. Please check the documentation to verify if
1390that is still true.
1391
1392 sub fetchrow {
1393 my($self, $sth, $what) = @_; # $what is one of fetchrow_{array,hashref}
1394 if ($] < 5.007) {
1395 return $sth->$what;
1396 } else {
1397 require Encode;
1398 if (wantarray) {
1399 my @arr = $sth->$what;
1400 for (@arr) {
1401 defined && /[^\000-\177]/ && Encode::_utf8_on($_);
1402 }
1403 return @arr;
1404 } else {
1405 my $ret = $sth->$what;
1406 if (ref $ret) {
1407 for my $k (keys %$ret) {
1408 defined && /[^\000-\177]/ && Encode::_utf8_on($_) for $ret->{$k};
1409 }
1410 return $ret;
1411 } else {
1412 defined && /[^\000-\177]/ && Encode::_utf8_on($_) for $ret;
1413 return $ret;
1414 }
1415 }
1416 }
1417 }
1418
1419
1420=item *
1421
1422A large scalar that you know can only contain ASCII
1423
1424Scalars that contain only ASCII and are marked as UTF-8 are sometimes
1425a drag to your program. If you recognize such a situation, just remove
1426the UTF-8 flag:
1427
1428 utf8::downgrade($val) if $] > 5.007;
1429
1430=back
1431
393fec97
GS
1432=head1 SEE ALSO
1433
72ff2908 1434L<perluniintro>, L<encoding>, L<Encode>, L<open>, L<utf8>, L<bytes>,
a05d7ebb 1435L<perlretut>, L<perlvar/"${^UNICODE}">
393fec97
GS
1436
1437=cut