| 1 | =head1 NAME |
| 2 | |
| 3 | perlunicode - Unicode support in Perl |
| 4 | |
| 5 | =head1 DESCRIPTION |
| 6 | |
| 7 | =head2 Important Caveats |
| 8 | |
| 9 | Unicode support is an extensive requirement. While Perl does not |
| 10 | implement the Unicode standard or the accompanying technical reports |
| 11 | from cover to cover, Perl does support many Unicode features. |
| 12 | |
| 13 | People who want to learn to use Unicode in Perl, should probably read |
| 14 | L<the Perl Unicode tutorial, perlunitut|perlunitut>, before reading |
| 15 | this reference document. |
| 16 | |
| 17 | =over 4 |
| 18 | |
| 19 | =item Input and Output Layers |
| 20 | |
| 21 | Perl knows when a filehandle uses Perl's internal Unicode encodings |
| 22 | (UTF-8, or UTF-EBCDIC if in EBCDIC) if the filehandle is opened with |
| 23 | the ":utf8" layer. Other encodings can be converted to Perl's |
| 24 | encoding on input or from Perl's encoding on output by use of the |
| 25 | ":encoding(...)" layer. See L<open>. |
| 26 | |
| 27 | To indicate that Perl source itself is in UTF-8, use C<use utf8;>. |
| 28 | |
| 29 | =item Regular Expressions |
| 30 | |
| 31 | The regular expression compiler produces polymorphic opcodes. That is, |
| 32 | the pattern adapts to the data and automatically switches to the Unicode |
| 33 | character scheme when presented with data that is internally encoded in |
| 34 | UTF-8 -- or instead uses a traditional byte scheme when presented with |
| 35 | byte data. |
| 36 | |
| 37 | =item C<use utf8> still needed to enable UTF-8/UTF-EBCDIC in scripts |
| 38 | |
| 39 | As a compatibility measure, the C<use utf8> pragma must be explicitly |
| 40 | included to enable recognition of UTF-8 in the Perl scripts themselves |
| 41 | (in string or regular expression literals, or in identifier names) on |
| 42 | ASCII-based machines or to recognize UTF-EBCDIC on EBCDIC-based |
| 43 | machines. B<These are the only times when an explicit C<use utf8> |
| 44 | is needed.> See L<utf8>. |
| 45 | |
| 46 | =item BOM-marked scripts and UTF-16 scripts autodetected |
| 47 | |
| 48 | If a Perl script begins marked with the Unicode BOM (UTF-16LE, UTF16-BE, |
| 49 | or UTF-8), or if the script looks like non-BOM-marked UTF-16 of either |
| 50 | endianness, Perl will correctly read in the script as Unicode. |
| 51 | (BOMless UTF-8 cannot be effectively recognized or differentiated from |
| 52 | ISO 8859-1 or other eight-bit encodings.) |
| 53 | |
| 54 | =item C<use encoding> needed to upgrade non-Latin-1 byte strings |
| 55 | |
| 56 | By default, there is a fundamental asymmetry in Perl's Unicode model: |
| 57 | implicit upgrading from byte strings to Unicode strings assumes that |
| 58 | they were encoded in I<ISO 8859-1 (Latin-1)>, but Unicode strings are |
| 59 | downgraded with UTF-8 encoding. This happens because the first 256 |
| 60 | codepoints in Unicode happens to agree with Latin-1. |
| 61 | |
| 62 | See L</"Byte and Character Semantics"> for more details. |
| 63 | |
| 64 | =back |
| 65 | |
| 66 | =head2 Byte and Character Semantics |
| 67 | |
| 68 | Beginning with version 5.6, Perl uses logically-wide characters to |
| 69 | represent strings internally. |
| 70 | |
| 71 | In future, Perl-level operations will be expected to work with |
| 72 | characters rather than bytes. |
| 73 | |
| 74 | However, as an interim compatibility measure, Perl aims to |
| 75 | provide a safe migration path from byte semantics to character |
| 76 | semantics for programs. For operations where Perl can unambiguously |
| 77 | decide that the input data are characters, Perl switches to |
| 78 | character semantics. For operations where this determination cannot |
| 79 | be made without additional information from the user, Perl decides in |
| 80 | favor of compatibility and chooses to use byte semantics. |
| 81 | |
| 82 | Under byte semantics, when C<use locale> is in effect, Perl uses the |
| 83 | semantics associated with the current locale. Absent a C<use locale>, Perl |
| 84 | currently uses US-ASCII (or Basic Latin in Unicode terminology) byte semantics, |
| 85 | meaning that characters whose ordinal numbers are in the range 128 - 255 are |
| 86 | undefined except for their ordinal numbers. This means that none have case |
| 87 | (upper and lower), nor are any a member of character classes, like C<[:alpha:]> |
| 88 | or C<\w>. |
| 89 | (But all do belong to the C<\W> class or the Perl regular expression extension |
| 90 | C<[:^alpha:]>.) |
| 91 | |
| 92 | This behavior preserves compatibility with earlier versions of Perl, |
| 93 | which allowed byte semantics in Perl operations only if |
| 94 | none of the program's inputs were marked as being as source of Unicode |
| 95 | character data. Such data may come from filehandles, from calls to |
| 96 | external programs, from information provided by the system (such as %ENV), |
| 97 | or from literals and constants in the source text. |
| 98 | |
| 99 | The C<bytes> pragma will always, regardless of platform, force byte |
| 100 | semantics in a particular lexical scope. See L<bytes>. |
| 101 | |
| 102 | The C<utf8> pragma is primarily a compatibility device that enables |
| 103 | recognition of UTF-(8|EBCDIC) in literals encountered by the parser. |
| 104 | Note that this pragma is only required while Perl defaults to byte |
| 105 | semantics; when character semantics become the default, this pragma |
| 106 | may become a no-op. See L<utf8>. |
| 107 | |
| 108 | Unless explicitly stated, Perl operators use character semantics |
| 109 | for Unicode data and byte semantics for non-Unicode data. |
| 110 | The decision to use character semantics is made transparently. If |
| 111 | input data comes from a Unicode source--for example, if a character |
| 112 | encoding layer is added to a filehandle or a literal Unicode |
| 113 | string constant appears in a program--character semantics apply. |
| 114 | Otherwise, byte semantics are in effect. The C<bytes> pragma should |
| 115 | be used to force byte semantics on Unicode data. |
| 116 | |
| 117 | If strings operating under byte semantics and strings with Unicode |
| 118 | character data are concatenated, the new string will have |
| 119 | character semantics. This can cause surprises: See L</BUGS>, below |
| 120 | |
| 121 | Under character semantics, many operations that formerly operated on |
| 122 | bytes now operate on characters. A character in Perl is |
| 123 | logically just a number ranging from 0 to 2**31 or so. Larger |
| 124 | characters may encode into longer sequences of bytes internally, but |
| 125 | this internal detail is mostly hidden for Perl code. |
| 126 | See L<perluniintro> for more. |
| 127 | |
| 128 | =head2 Effects of Character Semantics |
| 129 | |
| 130 | Character semantics have the following effects: |
| 131 | |
| 132 | =over 4 |
| 133 | |
| 134 | =item * |
| 135 | |
| 136 | Strings--including hash keys--and regular expression patterns may |
| 137 | contain characters that have an ordinal value larger than 255. |
| 138 | |
| 139 | If you use a Unicode editor to edit your program, Unicode characters may |
| 140 | occur directly within the literal strings in UTF-8 encoding, or UTF-16. |
| 141 | (The former requires a BOM or C<use utf8>, the latter requires a BOM.) |
| 142 | |
| 143 | Unicode characters can also be added to a string by using the C<\x{...}> |
| 144 | notation. The Unicode code for the desired character, in hexadecimal, |
| 145 | should be placed in the braces. For instance, a smiley face is |
| 146 | C<\x{263A}>. This encoding scheme works for all characters, but |
| 147 | for characters under 0x100, note that Perl may use an 8 bit encoding |
| 148 | internally, for optimization and/or backward compatibility. |
| 149 | |
| 150 | Additionally, if you |
| 151 | |
| 152 | use charnames ':full'; |
| 153 | |
| 154 | you can use the C<\N{...}> notation and put the official Unicode |
| 155 | character name within the braces, such as C<\N{WHITE SMILING FACE}>. |
| 156 | |
| 157 | =item * |
| 158 | |
| 159 | If an appropriate L<encoding> is specified, identifiers within the |
| 160 | Perl script may contain Unicode alphanumeric characters, including |
| 161 | ideographs. Perl does not currently attempt to canonicalize variable |
| 162 | names. |
| 163 | |
| 164 | =item * |
| 165 | |
| 166 | Regular expressions match characters instead of bytes. "." matches |
| 167 | a character instead of a byte. |
| 168 | |
| 169 | =item * |
| 170 | |
| 171 | Character classes in regular expressions match characters instead of |
| 172 | bytes and match against the character properties specified in the |
| 173 | Unicode properties database. C<\w> can be used to match a Japanese |
| 174 | ideograph, for instance. |
| 175 | |
| 176 | =item * |
| 177 | |
| 178 | Named Unicode properties, scripts, and block ranges may be used like |
| 179 | character classes via the C<\p{}> "matches property" construct and |
| 180 | the C<\P{}> negation, "doesn't match property". |
| 181 | |
| 182 | See L</"Unicode Character Properties"> for more details. |
| 183 | |
| 184 | You can define your own character properties and use them |
| 185 | in the regular expression with the C<\p{}> or C<\P{}> construct. |
| 186 | |
| 187 | See L</"User-Defined Character Properties"> for more details. |
| 188 | |
| 189 | =item * |
| 190 | |
| 191 | The special pattern C<\X> matches any extended Unicode |
| 192 | sequence--"a combining character sequence" in Standardese--where the |
| 193 | first character is a base character and subsequent characters are mark |
| 194 | characters that apply to the base character. C<\X> is equivalent to |
| 195 | C<< (?>\PM\pM*) >>. |
| 196 | |
| 197 | =item * |
| 198 | |
| 199 | The C<tr///> operator translates characters instead of bytes. Note |
| 200 | that the C<tr///CU> functionality has been removed. For similar |
| 201 | functionality see pack('U0', ...) and pack('C0', ...). |
| 202 | |
| 203 | =item * |
| 204 | |
| 205 | Case translation operators use the Unicode case translation tables |
| 206 | when character input is provided. Note that C<uc()>, or C<\U> in |
| 207 | interpolated strings, translates to uppercase, while C<ucfirst>, |
| 208 | or C<\u> in interpolated strings, translates to titlecase in languages |
| 209 | that make the distinction. |
| 210 | |
| 211 | =item * |
| 212 | |
| 213 | Most operators that deal with positions or lengths in a string will |
| 214 | automatically switch to using character positions, including |
| 215 | C<chop()>, C<chomp()>, C<substr()>, C<pos()>, C<index()>, C<rindex()>, |
| 216 | C<sprintf()>, C<write()>, and C<length()>. An operator that |
| 217 | specifically does not switch is C<vec()>. Operators that really don't |
| 218 | care include operators that treat strings as a bucket of bits such as |
| 219 | C<sort()>, and operators dealing with filenames. |
| 220 | |
| 221 | =item * |
| 222 | |
| 223 | The C<pack()>/C<unpack()> letter C<C> does I<not> change, since it is often |
| 224 | used for byte-oriented formats. Again, think C<char> in the C language. |
| 225 | |
| 226 | There is a new C<U> specifier that converts between Unicode characters |
| 227 | and code points. There is also a C<W> specifier that is the equivalent of |
| 228 | C<chr>/C<ord> and properly handles character values even if they are above 255. |
| 229 | |
| 230 | =item * |
| 231 | |
| 232 | The C<chr()> and C<ord()> functions work on characters, similar to |
| 233 | C<pack("W")> and C<unpack("W")>, I<not> C<pack("C")> and |
| 234 | C<unpack("C")>. C<pack("C")> and C<unpack("C")> are methods for |
| 235 | emulating byte-oriented C<chr()> and C<ord()> on Unicode strings. |
| 236 | While these methods reveal the internal encoding of Unicode strings, |
| 237 | that is not something one normally needs to care about at all. |
| 238 | |
| 239 | =item * |
| 240 | |
| 241 | The bit string operators, C<& | ^ ~>, can operate on character data. |
| 242 | However, for backward compatibility, such as when using bit string |
| 243 | operations when characters are all less than 256 in ordinal value, one |
| 244 | should not use C<~> (the bit complement) with characters of both |
| 245 | values less than 256 and values greater than 256. Most importantly, |
| 246 | DeMorgan's laws (C<~($x|$y) eq ~$x&~$y> and C<~($x&$y) eq ~$x|~$y>) |
| 247 | will not hold. The reason for this mathematical I<faux pas> is that |
| 248 | the complement cannot return B<both> the 8-bit (byte-wide) bit |
| 249 | complement B<and> the full character-wide bit complement. |
| 250 | |
| 251 | =item * |
| 252 | |
| 253 | lc(), uc(), lcfirst(), and ucfirst() work for the following cases: |
| 254 | |
| 255 | =over 8 |
| 256 | |
| 257 | =item * |
| 258 | |
| 259 | the case mapping is from a single Unicode character to another |
| 260 | single Unicode character, or |
| 261 | |
| 262 | =item * |
| 263 | |
| 264 | the case mapping is from a single Unicode character to more |
| 265 | than one Unicode character. |
| 266 | |
| 267 | =back |
| 268 | |
| 269 | Things to do with locales (Lithuanian, Turkish, Azeri) do B<not> work |
| 270 | since Perl does not understand the concept of Unicode locales. |
| 271 | |
| 272 | See the Unicode Technical Report #21, Case Mappings, for more details. |
| 273 | |
| 274 | But you can also define your own mappings to be used in the lc(), |
| 275 | lcfirst(), uc(), and ucfirst() (or their string-inlined versions). |
| 276 | |
| 277 | See L</"User-Defined Case Mappings"> for more details. |
| 278 | |
| 279 | =back |
| 280 | |
| 281 | =over 4 |
| 282 | |
| 283 | =item * |
| 284 | |
| 285 | And finally, C<scalar reverse()> reverses by character rather than by byte. |
| 286 | |
| 287 | =back |
| 288 | |
| 289 | =head2 Unicode Character Properties |
| 290 | |
| 291 | Named Unicode properties, scripts, and block ranges may be used like |
| 292 | character classes via the C<\p{}> "matches property" construct and |
| 293 | the C<\P{}> negation, "doesn't match property". |
| 294 | |
| 295 | For instance, C<\p{Lu}> matches any character with the Unicode "Lu" |
| 296 | (Letter, uppercase) property, while C<\p{M}> matches any character |
| 297 | with an "M" (mark--accents and such) property. Brackets are not |
| 298 | required for single letter properties, so C<\p{M}> is equivalent to |
| 299 | C<\pM>. Many predefined properties are available, such as |
| 300 | C<\p{Mirrored}> and C<\p{Tibetan}>. |
| 301 | |
| 302 | The official Unicode script and block names have spaces and dashes as |
| 303 | separators, but for convenience you can use dashes, spaces, or |
| 304 | underbars, and case is unimportant. It is recommended, however, that |
| 305 | for consistency you use the following naming: the official Unicode |
| 306 | script, property, or block name (see below for the additional rules |
| 307 | that apply to block names) with whitespace and dashes removed, and the |
| 308 | words "uppercase-first-lowercase-rest". C<Latin-1 Supplement> thus |
| 309 | becomes C<Latin1Supplement>. |
| 310 | |
| 311 | You can also use negation in both C<\p{}> and C<\P{}> by introducing a caret |
| 312 | (^) between the first brace and the property name: C<\p{^Tamil}> is |
| 313 | equal to C<\P{Tamil}>. |
| 314 | |
| 315 | B<NOTE: the properties, scripts, and blocks listed here are as of |
| 316 | Unicode 5.0.0 in July 2006.> |
| 317 | |
| 318 | =over 4 |
| 319 | |
| 320 | =item General Category |
| 321 | |
| 322 | Here are the basic Unicode General Category properties, followed by their |
| 323 | long form. You can use either; C<\p{Lu}> and C<\p{UppercaseLetter}>, |
| 324 | for instance, are identical. |
| 325 | |
| 326 | Short Long |
| 327 | |
| 328 | L Letter |
| 329 | LC CasedLetter |
| 330 | Lu UppercaseLetter |
| 331 | Ll LowercaseLetter |
| 332 | Lt TitlecaseLetter |
| 333 | Lm ModifierLetter |
| 334 | Lo OtherLetter |
| 335 | |
| 336 | M Mark |
| 337 | Mn NonspacingMark |
| 338 | Mc SpacingMark |
| 339 | Me EnclosingMark |
| 340 | |
| 341 | N Number |
| 342 | Nd DecimalNumber |
| 343 | Nl LetterNumber |
| 344 | No OtherNumber |
| 345 | |
| 346 | P Punctuation |
| 347 | Pc ConnectorPunctuation |
| 348 | Pd DashPunctuation |
| 349 | Ps OpenPunctuation |
| 350 | Pe ClosePunctuation |
| 351 | Pi InitialPunctuation |
| 352 | (may behave like Ps or Pe depending on usage) |
| 353 | Pf FinalPunctuation |
| 354 | (may behave like Ps or Pe depending on usage) |
| 355 | Po OtherPunctuation |
| 356 | |
| 357 | S Symbol |
| 358 | Sm MathSymbol |
| 359 | Sc CurrencySymbol |
| 360 | Sk ModifierSymbol |
| 361 | So OtherSymbol |
| 362 | |
| 363 | Z Separator |
| 364 | Zs SpaceSeparator |
| 365 | Zl LineSeparator |
| 366 | Zp ParagraphSeparator |
| 367 | |
| 368 | C Other |
| 369 | Cc Control |
| 370 | Cf Format |
| 371 | Cs Surrogate (not usable) |
| 372 | Co PrivateUse |
| 373 | Cn Unassigned |
| 374 | |
| 375 | Single-letter properties match all characters in any of the |
| 376 | two-letter sub-properties starting with the same letter. |
| 377 | C<LC> and C<L&> are special cases, which are aliases for the set of |
| 378 | C<Ll>, C<Lu>, and C<Lt>. |
| 379 | |
| 380 | Because Perl hides the need for the user to understand the internal |
| 381 | representation of Unicode characters, there is no need to implement |
| 382 | the somewhat messy concept of surrogates. C<Cs> is therefore not |
| 383 | supported. |
| 384 | |
| 385 | =item Bidirectional Character Types |
| 386 | |
| 387 | Because scripts differ in their directionality--Hebrew is |
| 388 | written right to left, for example--Unicode supplies these properties in |
| 389 | the BidiClass class: |
| 390 | |
| 391 | Property Meaning |
| 392 | |
| 393 | L Left-to-Right |
| 394 | LRE Left-to-Right Embedding |
| 395 | LRO Left-to-Right Override |
| 396 | R Right-to-Left |
| 397 | AL Right-to-Left Arabic |
| 398 | RLE Right-to-Left Embedding |
| 399 | RLO Right-to-Left Override |
| 400 | PDF Pop Directional Format |
| 401 | EN European Number |
| 402 | ES European Number Separator |
| 403 | ET European Number Terminator |
| 404 | AN Arabic Number |
| 405 | CS Common Number Separator |
| 406 | NSM Non-Spacing Mark |
| 407 | BN Boundary Neutral |
| 408 | B Paragraph Separator |
| 409 | S Segment Separator |
| 410 | WS Whitespace |
| 411 | ON Other Neutrals |
| 412 | |
| 413 | For example, C<\p{BidiClass:R}> matches characters that are normally |
| 414 | written right to left. |
| 415 | |
| 416 | =item Scripts |
| 417 | |
| 418 | The script names which can be used by C<\p{...}> and C<\P{...}>, |
| 419 | such as in C<\p{Latin}> or C<\p{Cyrillic}>, are as follows: |
| 420 | |
| 421 | Arabic |
| 422 | Armenian |
| 423 | Balinese |
| 424 | Bengali |
| 425 | Bopomofo |
| 426 | Braille |
| 427 | Buginese |
| 428 | Buhid |
| 429 | CanadianAboriginal |
| 430 | Cherokee |
| 431 | Coptic |
| 432 | Cuneiform |
| 433 | Cypriot |
| 434 | Cyrillic |
| 435 | Deseret |
| 436 | Devanagari |
| 437 | Ethiopic |
| 438 | Georgian |
| 439 | Glagolitic |
| 440 | Gothic |
| 441 | Greek |
| 442 | Gujarati |
| 443 | Gurmukhi |
| 444 | Han |
| 445 | Hangul |
| 446 | Hanunoo |
| 447 | Hebrew |
| 448 | Hiragana |
| 449 | Inherited |
| 450 | Kannada |
| 451 | Katakana |
| 452 | Kharoshthi |
| 453 | Khmer |
| 454 | Lao |
| 455 | Latin |
| 456 | Limbu |
| 457 | LinearB |
| 458 | Malayalam |
| 459 | Mongolian |
| 460 | Myanmar |
| 461 | NewTaiLue |
| 462 | Nko |
| 463 | Ogham |
| 464 | OldItalic |
| 465 | OldPersian |
| 466 | Oriya |
| 467 | Osmanya |
| 468 | PhagsPa |
| 469 | Phoenician |
| 470 | Runic |
| 471 | Shavian |
| 472 | Sinhala |
| 473 | SylotiNagri |
| 474 | Syriac |
| 475 | Tagalog |
| 476 | Tagbanwa |
| 477 | TaiLe |
| 478 | Tamil |
| 479 | Telugu |
| 480 | Thaana |
| 481 | Thai |
| 482 | Tibetan |
| 483 | Tifinagh |
| 484 | Ugaritic |
| 485 | Yi |
| 486 | |
| 487 | =item Extended property classes |
| 488 | |
| 489 | Extended property classes can supplement the basic |
| 490 | properties, defined by the F<PropList> Unicode database: |
| 491 | |
| 492 | ASCIIHexDigit |
| 493 | BidiControl |
| 494 | Dash |
| 495 | Deprecated |
| 496 | Diacritic |
| 497 | Extender |
| 498 | HexDigit |
| 499 | Hyphen |
| 500 | Ideographic |
| 501 | IDSBinaryOperator |
| 502 | IDSTrinaryOperator |
| 503 | JoinControl |
| 504 | LogicalOrderException |
| 505 | NoncharacterCodePoint |
| 506 | OtherAlphabetic |
| 507 | OtherDefaultIgnorableCodePoint |
| 508 | OtherGraphemeExtend |
| 509 | OtherIDStart |
| 510 | OtherIDContinue |
| 511 | OtherLowercase |
| 512 | OtherMath |
| 513 | OtherUppercase |
| 514 | PatternSyntax |
| 515 | PatternWhiteSpace |
| 516 | QuotationMark |
| 517 | Radical |
| 518 | SoftDotted |
| 519 | STerm |
| 520 | TerminalPunctuation |
| 521 | UnifiedIdeograph |
| 522 | VariationSelector |
| 523 | WhiteSpace |
| 524 | |
| 525 | and there are further derived properties: |
| 526 | |
| 527 | Alphabetic = Lu + Ll + Lt + Lm + Lo + Nl + OtherAlphabetic |
| 528 | Lowercase = Ll + OtherLowercase |
| 529 | Uppercase = Lu + OtherUppercase |
| 530 | Math = Sm + OtherMath |
| 531 | |
| 532 | IDStart = Lu + Ll + Lt + Lm + Lo + Nl + OtherIDStart |
| 533 | IDContinue = IDStart + Mn + Mc + Nd + Pc + OtherIDContinue |
| 534 | |
| 535 | DefaultIgnorableCodePoint |
| 536 | = OtherDefaultIgnorableCodePoint |
| 537 | + Cf + Cc + Cs + Noncharacters + VariationSelector |
| 538 | - WhiteSpace - FFF9..FFFB (Annotation Characters) |
| 539 | |
| 540 | Any = Any code points (i.e. U+0000 to U+10FFFF) |
| 541 | Assigned = Any non-Cn code points (i.e. synonym for \P{Cn}) |
| 542 | Unassigned = Synonym for \p{Cn} |
| 543 | ASCII = ASCII (i.e. U+0000 to U+007F) |
| 544 | |
| 545 | Common = Any character (or unassigned code point) |
| 546 | not explicitly assigned to a script |
| 547 | |
| 548 | =item Use of "Is" Prefix |
| 549 | |
| 550 | For backward compatibility (with Perl 5.6), all properties mentioned |
| 551 | so far may have C<Is> prepended to their name, so C<\P{IsLu}>, for |
| 552 | example, is equal to C<\P{Lu}>. |
| 553 | |
| 554 | =item Blocks |
| 555 | |
| 556 | In addition to B<scripts>, Unicode also defines B<blocks> of |
| 557 | characters. The difference between scripts and blocks is that the |
| 558 | concept of scripts is closer to natural languages, while the concept |
| 559 | of blocks is more of an artificial grouping based on groups of 256 |
| 560 | Unicode characters. For example, the C<Latin> script contains letters |
| 561 | from many blocks but does not contain all the characters from those |
| 562 | blocks. It does not, for example, contain digits, because digits are |
| 563 | shared across many scripts. Digits and similar groups, like |
| 564 | punctuation, are in a category called C<Common>. |
| 565 | |
| 566 | For more about scripts, see the UAX#24 "Script Names": |
| 567 | |
| 568 | http://www.unicode.org/reports/tr24/ |
| 569 | |
| 570 | For more about blocks, see: |
| 571 | |
| 572 | http://www.unicode.org/Public/UNIDATA/Blocks.txt |
| 573 | |
| 574 | Block names are given with the C<In> prefix. For example, the |
| 575 | Katakana block is referenced via C<\p{InKatakana}>. The C<In> |
| 576 | prefix may be omitted if there is no naming conflict with a script |
| 577 | or any other property, but it is recommended that C<In> always be used |
| 578 | for block tests to avoid confusion. |
| 579 | |
| 580 | These block names are supported: |
| 581 | |
| 582 | InAegeanNumbers |
| 583 | InAlphabeticPresentationForms |
| 584 | InAncientGreekMusicalNotation |
| 585 | InAncientGreekNumbers |
| 586 | InArabic |
| 587 | InArabicPresentationFormsA |
| 588 | InArabicPresentationFormsB |
| 589 | InArabicSupplement |
| 590 | InArmenian |
| 591 | InArrows |
| 592 | InBalinese |
| 593 | InBasicLatin |
| 594 | InBengali |
| 595 | InBlockElements |
| 596 | InBopomofo |
| 597 | InBopomofoExtended |
| 598 | InBoxDrawing |
| 599 | InBraillePatterns |
| 600 | InBuginese |
| 601 | InBuhid |
| 602 | InByzantineMusicalSymbols |
| 603 | InCJKCompatibility |
| 604 | InCJKCompatibilityForms |
| 605 | InCJKCompatibilityIdeographs |
| 606 | InCJKCompatibilityIdeographsSupplement |
| 607 | InCJKRadicalsSupplement |
| 608 | InCJKStrokes |
| 609 | InCJKSymbolsAndPunctuation |
| 610 | InCJKUnifiedIdeographs |
| 611 | InCJKUnifiedIdeographsExtensionA |
| 612 | InCJKUnifiedIdeographsExtensionB |
| 613 | InCherokee |
| 614 | InCombiningDiacriticalMarks |
| 615 | InCombiningDiacriticalMarksSupplement |
| 616 | InCombiningDiacriticalMarksforSymbols |
| 617 | InCombiningHalfMarks |
| 618 | InControlPictures |
| 619 | InCoptic |
| 620 | InCountingRodNumerals |
| 621 | InCuneiform |
| 622 | InCuneiformNumbersAndPunctuation |
| 623 | InCurrencySymbols |
| 624 | InCypriotSyllabary |
| 625 | InCyrillic |
| 626 | InCyrillicSupplement |
| 627 | InDeseret |
| 628 | InDevanagari |
| 629 | InDingbats |
| 630 | InEnclosedAlphanumerics |
| 631 | InEnclosedCJKLettersAndMonths |
| 632 | InEthiopic |
| 633 | InEthiopicExtended |
| 634 | InEthiopicSupplement |
| 635 | InGeneralPunctuation |
| 636 | InGeometricShapes |
| 637 | InGeorgian |
| 638 | InGeorgianSupplement |
| 639 | InGlagolitic |
| 640 | InGothic |
| 641 | InGreekExtended |
| 642 | InGreekAndCoptic |
| 643 | InGujarati |
| 644 | InGurmukhi |
| 645 | InHalfwidthAndFullwidthForms |
| 646 | InHangulCompatibilityJamo |
| 647 | InHangulJamo |
| 648 | InHangulSyllables |
| 649 | InHanunoo |
| 650 | InHebrew |
| 651 | InHighPrivateUseSurrogates |
| 652 | InHighSurrogates |
| 653 | InHiragana |
| 654 | InIPAExtensions |
| 655 | InIdeographicDescriptionCharacters |
| 656 | InKanbun |
| 657 | InKangxiRadicals |
| 658 | InKannada |
| 659 | InKatakana |
| 660 | InKatakanaPhoneticExtensions |
| 661 | InKharoshthi |
| 662 | InKhmer |
| 663 | InKhmerSymbols |
| 664 | InLao |
| 665 | InLatin1Supplement |
| 666 | InLatinExtendedA |
| 667 | InLatinExtendedAdditional |
| 668 | InLatinExtendedB |
| 669 | InLatinExtendedC |
| 670 | InLatinExtendedD |
| 671 | InLetterlikeSymbols |
| 672 | InLimbu |
| 673 | InLinearBIdeograms |
| 674 | InLinearBSyllabary |
| 675 | InLowSurrogates |
| 676 | InMalayalam |
| 677 | InMathematicalAlphanumericSymbols |
| 678 | InMathematicalOperators |
| 679 | InMiscellaneousMathematicalSymbolsA |
| 680 | InMiscellaneousMathematicalSymbolsB |
| 681 | InMiscellaneousSymbols |
| 682 | InMiscellaneousSymbolsAndArrows |
| 683 | InMiscellaneousTechnical |
| 684 | InModifierToneLetters |
| 685 | InMongolian |
| 686 | InMusicalSymbols |
| 687 | InMyanmar |
| 688 | InNKo |
| 689 | InNewTaiLue |
| 690 | InNumberForms |
| 691 | InOgham |
| 692 | InOldItalic |
| 693 | InOldPersian |
| 694 | InOpticalCharacterRecognition |
| 695 | InOriya |
| 696 | InOsmanya |
| 697 | InPhagspa |
| 698 | InPhoenician |
| 699 | InPhoneticExtensions |
| 700 | InPhoneticExtensionsSupplement |
| 701 | InPrivateUseArea |
| 702 | InRunic |
| 703 | InShavian |
| 704 | InSinhala |
| 705 | InSmallFormVariants |
| 706 | InSpacingModifierLetters |
| 707 | InSpecials |
| 708 | InSuperscriptsAndSubscripts |
| 709 | InSupplementalArrowsA |
| 710 | InSupplementalArrowsB |
| 711 | InSupplementalMathematicalOperators |
| 712 | InSupplementalPunctuation |
| 713 | InSupplementaryPrivateUseAreaA |
| 714 | InSupplementaryPrivateUseAreaB |
| 715 | InSylotiNagri |
| 716 | InSyriac |
| 717 | InTagalog |
| 718 | InTagbanwa |
| 719 | InTags |
| 720 | InTaiLe |
| 721 | InTaiXuanJingSymbols |
| 722 | InTamil |
| 723 | InTelugu |
| 724 | InThaana |
| 725 | InThai |
| 726 | InTibetan |
| 727 | InTifinagh |
| 728 | InUgaritic |
| 729 | InUnifiedCanadianAboriginalSyllabics |
| 730 | InVariationSelectors |
| 731 | InVariationSelectorsSupplement |
| 732 | InVerticalForms |
| 733 | InYiRadicals |
| 734 | InYiSyllables |
| 735 | InYijingHexagramSymbols |
| 736 | |
| 737 | =back |
| 738 | |
| 739 | =head2 User-Defined Character Properties |
| 740 | |
| 741 | You can define your own character properties by defining subroutines |
| 742 | whose names begin with "In" or "Is". The subroutines can be defined in |
| 743 | any package. The user-defined properties can be used in the regular |
| 744 | expression C<\p> and C<\P> constructs; if you are using a user-defined |
| 745 | property from a package other than the one you are in, you must specify |
| 746 | its package in the C<\p> or C<\P> construct. |
| 747 | |
| 748 | # assuming property IsForeign defined in Lang:: |
| 749 | package main; # property package name required |
| 750 | if ($txt =~ /\p{Lang::IsForeign}+/) { ... } |
| 751 | |
| 752 | package Lang; # property package name not required |
| 753 | if ($txt =~ /\p{IsForeign}+/) { ... } |
| 754 | |
| 755 | |
| 756 | Note that the effect is compile-time and immutable once defined. |
| 757 | |
| 758 | The subroutines must return a specially-formatted string, with one |
| 759 | or more newline-separated lines. Each line must be one of the following: |
| 760 | |
| 761 | =over 4 |
| 762 | |
| 763 | =item * |
| 764 | |
| 765 | A single hexadecimal number denoting a Unicode code point to include. |
| 766 | |
| 767 | =item * |
| 768 | |
| 769 | Two hexadecimal numbers separated by horizontal whitespace (space or |
| 770 | tabular characters) denoting a range of Unicode code points to include. |
| 771 | |
| 772 | =item * |
| 773 | |
| 774 | Something to include, prefixed by "+": a built-in character |
| 775 | property (prefixed by "utf8::") or a user-defined character property, |
| 776 | to represent all the characters in that property; two hexadecimal code |
| 777 | points for a range; or a single hexadecimal code point. |
| 778 | |
| 779 | =item * |
| 780 | |
| 781 | Something to exclude, prefixed by "-": an existing character |
| 782 | property (prefixed by "utf8::") or a user-defined character property, |
| 783 | to represent all the characters in that property; two hexadecimal code |
| 784 | points for a range; or a single hexadecimal code point. |
| 785 | |
| 786 | =item * |
| 787 | |
| 788 | Something to negate, prefixed "!": an existing character |
| 789 | property (prefixed by "utf8::") or a user-defined character property, |
| 790 | to represent all the characters in that property; two hexadecimal code |
| 791 | points for a range; or a single hexadecimal code point. |
| 792 | |
| 793 | =item * |
| 794 | |
| 795 | Something to intersect with, prefixed by "&": an existing character |
| 796 | property (prefixed by "utf8::") or a user-defined character property, |
| 797 | for all the characters except the characters in the property; two |
| 798 | hexadecimal code points for a range; or a single hexadecimal code point. |
| 799 | |
| 800 | =back |
| 801 | |
| 802 | For example, to define a property that covers both the Japanese |
| 803 | syllabaries (hiragana and katakana), you can define |
| 804 | |
| 805 | sub InKana { |
| 806 | return <<END; |
| 807 | 3040\t309F |
| 808 | 30A0\t30FF |
| 809 | END |
| 810 | } |
| 811 | |
| 812 | Imagine that the here-doc end marker is at the beginning of the line. |
| 813 | Now you can use C<\p{InKana}> and C<\P{InKana}>. |
| 814 | |
| 815 | You could also have used the existing block property names: |
| 816 | |
| 817 | sub InKana { |
| 818 | return <<'END'; |
| 819 | +utf8::InHiragana |
| 820 | +utf8::InKatakana |
| 821 | END |
| 822 | } |
| 823 | |
| 824 | Suppose you wanted to match only the allocated characters, |
| 825 | not the raw block ranges: in other words, you want to remove |
| 826 | the non-characters: |
| 827 | |
| 828 | sub InKana { |
| 829 | return <<'END'; |
| 830 | +utf8::InHiragana |
| 831 | +utf8::InKatakana |
| 832 | -utf8::IsCn |
| 833 | END |
| 834 | } |
| 835 | |
| 836 | The negation is useful for defining (surprise!) negated classes. |
| 837 | |
| 838 | sub InNotKana { |
| 839 | return <<'END'; |
| 840 | !utf8::InHiragana |
| 841 | -utf8::InKatakana |
| 842 | +utf8::IsCn |
| 843 | END |
| 844 | } |
| 845 | |
| 846 | Intersection is useful for getting the common characters matched by |
| 847 | two (or more) classes. |
| 848 | |
| 849 | sub InFooAndBar { |
| 850 | return <<'END'; |
| 851 | +main::Foo |
| 852 | &main::Bar |
| 853 | END |
| 854 | } |
| 855 | |
| 856 | It's important to remember not to use "&" for the first set -- that |
| 857 | would be intersecting with nothing (resulting in an empty set). |
| 858 | |
| 859 | =head2 User-Defined Case Mappings |
| 860 | |
| 861 | You can also define your own mappings to be used in the lc(), |
| 862 | lcfirst(), uc(), and ucfirst() (or their string-inlined versions). |
| 863 | The principle is similar to that of user-defined character |
| 864 | properties: to define subroutines in the C<main> package |
| 865 | with names like C<ToLower> (for lc() and lcfirst()), C<ToTitle> (for |
| 866 | the first character in ucfirst()), and C<ToUpper> (for uc(), and the |
| 867 | rest of the characters in ucfirst()). |
| 868 | |
| 869 | The string returned by the subroutines needs now to be three |
| 870 | hexadecimal numbers separated by tabulators: start of the source |
| 871 | range, end of the source range, and start of the destination range. |
| 872 | For example: |
| 873 | |
| 874 | sub ToUpper { |
| 875 | return <<END; |
| 876 | 0061\t0063\t0041 |
| 877 | END |
| 878 | } |
| 879 | |
| 880 | defines an uc() mapping that causes only the characters "a", "b", and |
| 881 | "c" to be mapped to "A", "B", "C", all other characters will remain |
| 882 | unchanged. |
| 883 | |
| 884 | If there is no source range to speak of, that is, the mapping is from |
| 885 | a single character to another single character, leave the end of the |
| 886 | source range empty, but the two tabulator characters are still needed. |
| 887 | For example: |
| 888 | |
| 889 | sub ToLower { |
| 890 | return <<END; |
| 891 | 0041\t\t0061 |
| 892 | END |
| 893 | } |
| 894 | |
| 895 | defines a lc() mapping that causes only "A" to be mapped to "a", all |
| 896 | other characters will remain unchanged. |
| 897 | |
| 898 | (For serious hackers only) If you want to introspect the default |
| 899 | mappings, you can find the data in the directory |
| 900 | C<$Config{privlib}>/F<unicore/To/>. The mapping data is returned as |
| 901 | the here-document, and the C<utf8::ToSpecFoo> are special exception |
| 902 | mappings derived from <$Config{privlib}>/F<unicore/SpecialCasing.txt>. |
| 903 | The C<Digit> and C<Fold> mappings that one can see in the directory |
| 904 | are not directly user-accessible, one can use either the |
| 905 | C<Unicode::UCD> module, or just match case-insensitively (that's when |
| 906 | the C<Fold> mapping is used). |
| 907 | |
| 908 | A final note on the user-defined case mappings: they will be used |
| 909 | only if the scalar has been marked as having Unicode characters. |
| 910 | Old byte-style strings will not be affected. |
| 911 | |
| 912 | =head2 Character Encodings for Input and Output |
| 913 | |
| 914 | See L<Encode>. |
| 915 | |
| 916 | =head2 Unicode Regular Expression Support Level |
| 917 | |
| 918 | The following list of Unicode support for regular expressions describes |
| 919 | all the features currently supported. The references to "Level N" |
| 920 | and the section numbers refer to the Unicode Technical Standard #18, |
| 921 | "Unicode Regular Expressions", version 11, in May 2005. |
| 922 | |
| 923 | =over 4 |
| 924 | |
| 925 | =item * |
| 926 | |
| 927 | Level 1 - Basic Unicode Support |
| 928 | |
| 929 | RL1.1 Hex Notation - done [1] |
| 930 | RL1.2 Properties - done [2][3] |
| 931 | RL1.2a Compatibility Properties - done [4] |
| 932 | RL1.3 Subtraction and Intersection - MISSING [5] |
| 933 | RL1.4 Simple Word Boundaries - done [6] |
| 934 | RL1.5 Simple Loose Matches - done [7] |
| 935 | RL1.6 Line Boundaries - MISSING [8] |
| 936 | RL1.7 Supplementary Code Points - done [9] |
| 937 | |
| 938 | [1] \x{...} |
| 939 | [2] \p{...} \P{...} |
| 940 | [3] supports not only minimal list (general category, scripts, |
| 941 | Alphabetic, Lowercase, Uppercase, WhiteSpace, |
| 942 | NoncharacterCodePoint, DefaultIgnorableCodePoint, Any, |
| 943 | ASCII, Assigned), but also bidirectional types, blocks, etc. |
| 944 | (see L</"Unicode Character Properties">) |
| 945 | [4] \d \D \s \S \w \W \X [:prop:] [:^prop:] |
| 946 | [5] can use regular expression look-ahead [a] or |
| 947 | user-defined character properties [b] to emulate set operations |
| 948 | [6] \b \B |
| 949 | [7] note that Perl does Full case-folding in matching, not Simple: |
| 950 | for example U+1F88 is equivalent to U+1F00 U+03B9, |
| 951 | not with 1F80. This difference matters mainly for certain Greek |
| 952 | capital letters with certain modifiers: the Full case-folding |
| 953 | decomposes the letter, while the Simple case-folding would map |
| 954 | it to a single character. |
| 955 | [8] should do ^ and $ also on U+000B (\v in C), FF (\f), CR (\r), |
| 956 | CRLF (\r\n), NEL (U+0085), LS (U+2028), and PS (U+2029); |
| 957 | should also affect <>, $., and script line numbers; |
| 958 | should not split lines within CRLF [c] (i.e. there is no empty |
| 959 | line between \r and \n) |
| 960 | [9] UTF-8/UTF-EBDDIC used in perl allows not only U+10000 to U+10FFFF |
| 961 | but also beyond U+10FFFF [d] |
| 962 | |
| 963 | [a] You can mimic class subtraction using lookahead. |
| 964 | For example, what UTS#18 might write as |
| 965 | |
| 966 | [{Greek}-[{UNASSIGNED}]] |
| 967 | |
| 968 | in Perl can be written as: |
| 969 | |
| 970 | (?!\p{Unassigned})\p{InGreekAndCoptic} |
| 971 | (?=\p{Assigned})\p{InGreekAndCoptic} |
| 972 | |
| 973 | But in this particular example, you probably really want |
| 974 | |
| 975 | \p{GreekAndCoptic} |
| 976 | |
| 977 | which will match assigned characters known to be part of the Greek script. |
| 978 | |
| 979 | Also see the Unicode::Regex::Set module, it does implement the full |
| 980 | UTS#18 grouping, intersection, union, and removal (subtraction) syntax. |
| 981 | |
| 982 | [b] '+' for union, '-' for removal (set-difference), '&' for intersection |
| 983 | (see L</"User-Defined Character Properties">) |
| 984 | |
| 985 | [c] Try the C<:crlf> layer (see L<PerlIO>). |
| 986 | |
| 987 | [d] Avoid C<use warning 'utf8';> (or say C<no warning 'utf8';>) to allow |
| 988 | U+FFFF (C<\x{FFFF}>). |
| 989 | |
| 990 | =item * |
| 991 | |
| 992 | Level 2 - Extended Unicode Support |
| 993 | |
| 994 | RL2.1 Canonical Equivalents - MISSING [10][11] |
| 995 | RL2.2 Default Grapheme Clusters - MISSING [12][13] |
| 996 | RL2.3 Default Word Boundaries - MISSING [14] |
| 997 | RL2.4 Default Loose Matches - MISSING [15] |
| 998 | RL2.5 Name Properties - MISSING [16] |
| 999 | RL2.6 Wildcard Properties - MISSING |
| 1000 | |
| 1001 | [10] see UAX#15 "Unicode Normalization Forms" |
| 1002 | [11] have Unicode::Normalize but not integrated to regexes |
| 1003 | [12] have \X but at this level . should equal that |
| 1004 | [13] UAX#29 "Text Boundaries" considers CRLF and Hangul syllable |
| 1005 | clusters as a single grapheme cluster. |
| 1006 | [14] see UAX#29, Word Boundaries |
| 1007 | [15] see UAX#21 "Case Mappings" |
| 1008 | [16] have \N{...} but neither compute names of CJK Ideographs |
| 1009 | and Hangul Syllables nor use a loose match [e] |
| 1010 | |
| 1011 | [e] C<\N{...}> allows namespaces (see L<charnames>). |
| 1012 | |
| 1013 | =item * |
| 1014 | |
| 1015 | Level 3 - Tailored Support |
| 1016 | |
| 1017 | RL3.1 Tailored Punctuation - MISSING |
| 1018 | RL3.2 Tailored Grapheme Clusters - MISSING [17][18] |
| 1019 | RL3.3 Tailored Word Boundaries - MISSING |
| 1020 | RL3.4 Tailored Loose Matches - MISSING |
| 1021 | RL3.5 Tailored Ranges - MISSING |
| 1022 | RL3.6 Context Matching - MISSING [19] |
| 1023 | RL3.7 Incremental Matches - MISSING |
| 1024 | ( RL3.8 Unicode Set Sharing ) |
| 1025 | RL3.9 Possible Match Sets - MISSING |
| 1026 | RL3.10 Folded Matching - MISSING [20] |
| 1027 | RL3.11 Submatchers - MISSING |
| 1028 | |
| 1029 | [17] see UAX#10 "Unicode Collation Algorithms" |
| 1030 | [18] have Unicode::Collate but not integrated to regexes |
| 1031 | [19] have (?<=x) and (?=x), but look-aheads or look-behinds should see |
| 1032 | outside of the target substring |
| 1033 | [20] need insensitive matching for linguistic features other than case; |
| 1034 | for example, hiragana to katakana, wide and narrow, simplified Han |
| 1035 | to traditional Han (see UTR#30 "Character Foldings") |
| 1036 | |
| 1037 | =back |
| 1038 | |
| 1039 | =head2 Unicode Encodings |
| 1040 | |
| 1041 | Unicode characters are assigned to I<code points>, which are abstract |
| 1042 | numbers. To use these numbers, various encodings are needed. |
| 1043 | |
| 1044 | =over 4 |
| 1045 | |
| 1046 | =item * |
| 1047 | |
| 1048 | UTF-8 |
| 1049 | |
| 1050 | UTF-8 is a variable-length (1 to 6 bytes, current character allocations |
| 1051 | require 4 bytes), byte-order independent encoding. For ASCII (and we |
| 1052 | really do mean 7-bit ASCII, not another 8-bit encoding), UTF-8 is |
| 1053 | transparent. |
| 1054 | |
| 1055 | The following table is from Unicode 3.2. |
| 1056 | |
| 1057 | Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte |
| 1058 | |
| 1059 | U+0000..U+007F 00..7F |
| 1060 | U+0080..U+07FF C2..DF 80..BF |
| 1061 | U+0800..U+0FFF E0 A0..BF 80..BF |
| 1062 | U+1000..U+CFFF E1..EC 80..BF 80..BF |
| 1063 | U+D000..U+D7FF ED 80..9F 80..BF |
| 1064 | U+D800..U+DFFF ******* ill-formed ******* |
| 1065 | U+E000..U+FFFF EE..EF 80..BF 80..BF |
| 1066 | U+10000..U+3FFFF F0 90..BF 80..BF 80..BF |
| 1067 | U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF |
| 1068 | U+100000..U+10FFFF F4 80..8F 80..BF 80..BF |
| 1069 | |
| 1070 | Note the C<A0..BF> in C<U+0800..U+0FFF>, the C<80..9F> in |
| 1071 | C<U+D000...U+D7FF>, the C<90..B>F in C<U+10000..U+3FFFF>, and the |
| 1072 | C<80...8F> in C<U+100000..U+10FFFF>. The "gaps" are caused by legal |
| 1073 | UTF-8 avoiding non-shortest encodings: it is technically possible to |
| 1074 | UTF-8-encode a single code point in different ways, but that is |
| 1075 | explicitly forbidden, and the shortest possible encoding should always |
| 1076 | be used. So that's what Perl does. |
| 1077 | |
| 1078 | Another way to look at it is via bits: |
| 1079 | |
| 1080 | Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte |
| 1081 | |
| 1082 | 0aaaaaaa 0aaaaaaa |
| 1083 | 00000bbbbbaaaaaa 110bbbbb 10aaaaaa |
| 1084 | ccccbbbbbbaaaaaa 1110cccc 10bbbbbb 10aaaaaa |
| 1085 | 00000dddccccccbbbbbbaaaaaa 11110ddd 10cccccc 10bbbbbb 10aaaaaa |
| 1086 | |
| 1087 | As you can see, the continuation bytes all begin with C<10>, and the |
| 1088 | leading bits of the start byte tell how many bytes the are in the |
| 1089 | encoded character. |
| 1090 | |
| 1091 | =item * |
| 1092 | |
| 1093 | UTF-EBCDIC |
| 1094 | |
| 1095 | Like UTF-8 but EBCDIC-safe, in the way that UTF-8 is ASCII-safe. |
| 1096 | |
| 1097 | =item * |
| 1098 | |
| 1099 | UTF-16, UTF-16BE, UTF-16LE, Surrogates, and BOMs (Byte Order Marks) |
| 1100 | |
| 1101 | The followings items are mostly for reference and general Unicode |
| 1102 | knowledge, Perl doesn't use these constructs internally. |
| 1103 | |
| 1104 | UTF-16 is a 2 or 4 byte encoding. The Unicode code points |
| 1105 | C<U+0000..U+FFFF> are stored in a single 16-bit unit, and the code |
| 1106 | points C<U+10000..U+10FFFF> in two 16-bit units. The latter case is |
| 1107 | using I<surrogates>, the first 16-bit unit being the I<high |
| 1108 | surrogate>, and the second being the I<low surrogate>. |
| 1109 | |
| 1110 | Surrogates are code points set aside to encode the C<U+10000..U+10FFFF> |
| 1111 | range of Unicode code points in pairs of 16-bit units. The I<high |
| 1112 | surrogates> are the range C<U+D800..U+DBFF>, and the I<low surrogates> |
| 1113 | are the range C<U+DC00..U+DFFF>. The surrogate encoding is |
| 1114 | |
| 1115 | $hi = ($uni - 0x10000) / 0x400 + 0xD800; |
| 1116 | $lo = ($uni - 0x10000) % 0x400 + 0xDC00; |
| 1117 | |
| 1118 | and the decoding is |
| 1119 | |
| 1120 | $uni = 0x10000 + ($hi - 0xD800) * 0x400 + ($lo - 0xDC00); |
| 1121 | |
| 1122 | If you try to generate surrogates (for example by using chr()), you |
| 1123 | will get a warning if warnings are turned on, because those code |
| 1124 | points are not valid for a Unicode character. |
| 1125 | |
| 1126 | Because of the 16-bitness, UTF-16 is byte-order dependent. UTF-16 |
| 1127 | itself can be used for in-memory computations, but if storage or |
| 1128 | transfer is required either UTF-16BE (big-endian) or UTF-16LE |
| 1129 | (little-endian) encodings must be chosen. |
| 1130 | |
| 1131 | This introduces another problem: what if you just know that your data |
| 1132 | is UTF-16, but you don't know which endianness? Byte Order Marks, or |
| 1133 | BOMs, are a solution to this. A special character has been reserved |
| 1134 | in Unicode to function as a byte order marker: the character with the |
| 1135 | code point C<U+FEFF> is the BOM. |
| 1136 | |
| 1137 | The trick is that if you read a BOM, you will know the byte order, |
| 1138 | since if it was written on a big-endian platform, you will read the |
| 1139 | bytes C<0xFE 0xFF>, but if it was written on a little-endian platform, |
| 1140 | you will read the bytes C<0xFF 0xFE>. (And if the originating platform |
| 1141 | was writing in UTF-8, you will read the bytes C<0xEF 0xBB 0xBF>.) |
| 1142 | |
| 1143 | The way this trick works is that the character with the code point |
| 1144 | C<U+FFFE> is guaranteed not to be a valid Unicode character, so the |
| 1145 | sequence of bytes C<0xFF 0xFE> is unambiguously "BOM, represented in |
| 1146 | little-endian format" and cannot be C<U+FFFE>, represented in big-endian |
| 1147 | format". |
| 1148 | |
| 1149 | =item * |
| 1150 | |
| 1151 | UTF-32, UTF-32BE, UTF-32LE |
| 1152 | |
| 1153 | The UTF-32 family is pretty much like the UTF-16 family, expect that |
| 1154 | the units are 32-bit, and therefore the surrogate scheme is not |
| 1155 | needed. The BOM signatures will be C<0x00 0x00 0xFE 0xFF> for BE and |
| 1156 | C<0xFF 0xFE 0x00 0x00> for LE. |
| 1157 | |
| 1158 | =item * |
| 1159 | |
| 1160 | UCS-2, UCS-4 |
| 1161 | |
| 1162 | Encodings defined by the ISO 10646 standard. UCS-2 is a 16-bit |
| 1163 | encoding. Unlike UTF-16, UCS-2 is not extensible beyond C<U+FFFF>, |
| 1164 | because it does not use surrogates. UCS-4 is a 32-bit encoding, |
| 1165 | functionally identical to UTF-32. |
| 1166 | |
| 1167 | =item * |
| 1168 | |
| 1169 | UTF-7 |
| 1170 | |
| 1171 | A seven-bit safe (non-eight-bit) encoding, which is useful if the |
| 1172 | transport or storage is not eight-bit safe. Defined by RFC 2152. |
| 1173 | |
| 1174 | =back |
| 1175 | |
| 1176 | =head2 Security Implications of Unicode |
| 1177 | |
| 1178 | =over 4 |
| 1179 | |
| 1180 | =item * |
| 1181 | |
| 1182 | Malformed UTF-8 |
| 1183 | |
| 1184 | Unfortunately, the specification of UTF-8 leaves some room for |
| 1185 | interpretation of how many bytes of encoded output one should generate |
| 1186 | from one input Unicode character. Strictly speaking, the shortest |
| 1187 | possible sequence of UTF-8 bytes should be generated, |
| 1188 | because otherwise there is potential for an input buffer overflow at |
| 1189 | the receiving end of a UTF-8 connection. Perl always generates the |
| 1190 | shortest length UTF-8, and with warnings on Perl will warn about |
| 1191 | non-shortest length UTF-8 along with other malformations, such as the |
| 1192 | surrogates, which are not real Unicode code points. |
| 1193 | |
| 1194 | =item * |
| 1195 | |
| 1196 | Regular expressions behave slightly differently between byte data and |
| 1197 | character (Unicode) data. For example, the "word character" character |
| 1198 | class C<\w> will work differently depending on if data is eight-bit bytes |
| 1199 | or Unicode. |
| 1200 | |
| 1201 | In the first case, the set of C<\w> characters is either small--the |
| 1202 | default set of alphabetic characters, digits, and the "_"--or, if you |
| 1203 | are using a locale (see L<perllocale>), the C<\w> might contain a few |
| 1204 | more letters according to your language and country. |
| 1205 | |
| 1206 | In the second case, the C<\w> set of characters is much, much larger. |
| 1207 | Most importantly, even in the set of the first 256 characters, it will |
| 1208 | probably match different characters: unlike most locales, which are |
| 1209 | specific to a language and country pair, Unicode classifies all the |
| 1210 | characters that are letters I<somewhere> as C<\w>. For example, your |
| 1211 | locale might not think that LATIN SMALL LETTER ETH is a letter (unless |
| 1212 | you happen to speak Icelandic), but Unicode does. |
| 1213 | |
| 1214 | As discussed elsewhere, Perl has one foot (two hooves?) planted in |
| 1215 | each of two worlds: the old world of bytes and the new world of |
| 1216 | characters, upgrading from bytes to characters when necessary. |
| 1217 | If your legacy code does not explicitly use Unicode, no automatic |
| 1218 | switch-over to characters should happen. Characters shouldn't get |
| 1219 | downgraded to bytes, either. It is possible to accidentally mix bytes |
| 1220 | and characters, however (see L<perluniintro>), in which case C<\w> in |
| 1221 | regular expressions might start behaving differently. Review your |
| 1222 | code. Use warnings and the C<strict> pragma. |
| 1223 | |
| 1224 | =back |
| 1225 | |
| 1226 | =head2 Unicode in Perl on EBCDIC |
| 1227 | |
| 1228 | The way Unicode is handled on EBCDIC platforms is still |
| 1229 | experimental. On such platforms, references to UTF-8 encoding in this |
| 1230 | document and elsewhere should be read as meaning the UTF-EBCDIC |
| 1231 | specified in Unicode Technical Report 16, unless ASCII vs. EBCDIC issues |
| 1232 | are specifically discussed. There is no C<utfebcdic> pragma or |
| 1233 | ":utfebcdic" layer; rather, "utf8" and ":utf8" are reused to mean |
| 1234 | the platform's "natural" 8-bit encoding of Unicode. See L<perlebcdic> |
| 1235 | for more discussion of the issues. |
| 1236 | |
| 1237 | =head2 Locales |
| 1238 | |
| 1239 | Usually locale settings and Unicode do not affect each other, but |
| 1240 | there are a couple of exceptions: |
| 1241 | |
| 1242 | =over 4 |
| 1243 | |
| 1244 | =item * |
| 1245 | |
| 1246 | You can enable automatic UTF-8-ification of your standard file |
| 1247 | handles, default C<open()> layer, and C<@ARGV> by using either |
| 1248 | the C<-C> command line switch or the C<PERL_UNICODE> environment |
| 1249 | variable, see L<perlrun> for the documentation of the C<-C> switch. |
| 1250 | |
| 1251 | =item * |
| 1252 | |
| 1253 | Perl tries really hard to work both with Unicode and the old |
| 1254 | byte-oriented world. Most often this is nice, but sometimes Perl's |
| 1255 | straddling of the proverbial fence causes problems. |
| 1256 | |
| 1257 | =back |
| 1258 | |
| 1259 | =head2 When Unicode Does Not Happen |
| 1260 | |
| 1261 | While Perl does have extensive ways to input and output in Unicode, |
| 1262 | and few other 'entry points' like the @ARGV which can be interpreted |
| 1263 | as Unicode (UTF-8), there still are many places where Unicode (in some |
| 1264 | encoding or another) could be given as arguments or received as |
| 1265 | results, or both, but it is not. |
| 1266 | |
| 1267 | The following are such interfaces. For all of these interfaces Perl |
| 1268 | currently (as of 5.8.3) simply assumes byte strings both as arguments |
| 1269 | and results, or UTF-8 strings if the C<encoding> pragma has been used. |
| 1270 | |
| 1271 | One reason why Perl does not attempt to resolve the role of Unicode in |
| 1272 | this cases is that the answers are highly dependent on the operating |
| 1273 | system and the file system(s). For example, whether filenames can be |
| 1274 | in Unicode, and in exactly what kind of encoding, is not exactly a |
| 1275 | portable concept. Similarly for the qx and system: how well will the |
| 1276 | 'command line interface' (and which of them?) handle Unicode? |
| 1277 | |
| 1278 | =over 4 |
| 1279 | |
| 1280 | =item * |
| 1281 | |
| 1282 | chdir, chmod, chown, chroot, exec, link, lstat, mkdir, |
| 1283 | rename, rmdir, stat, symlink, truncate, unlink, utime, -X |
| 1284 | |
| 1285 | =item * |
| 1286 | |
| 1287 | %ENV |
| 1288 | |
| 1289 | =item * |
| 1290 | |
| 1291 | glob (aka the <*>) |
| 1292 | |
| 1293 | =item * |
| 1294 | |
| 1295 | open, opendir, sysopen |
| 1296 | |
| 1297 | =item * |
| 1298 | |
| 1299 | qx (aka the backtick operator), system |
| 1300 | |
| 1301 | =item * |
| 1302 | |
| 1303 | readdir, readlink |
| 1304 | |
| 1305 | =back |
| 1306 | |
| 1307 | =head2 Forcing Unicode in Perl (Or Unforcing Unicode in Perl) |
| 1308 | |
| 1309 | Sometimes (see L</"When Unicode Does Not Happen">) there are |
| 1310 | situations where you simply need to force a byte |
| 1311 | string into UTF-8, or vice versa. The low-level calls |
| 1312 | utf8::upgrade($bytestring) and utf8::downgrade($utf8string[, FAIL_OK]) are |
| 1313 | the answers. |
| 1314 | |
| 1315 | Note that utf8::downgrade() can fail if the string contains characters |
| 1316 | that don't fit into a byte. |
| 1317 | |
| 1318 | =head2 Using Unicode in XS |
| 1319 | |
| 1320 | If you want to handle Perl Unicode in XS extensions, you may find the |
| 1321 | following C APIs useful. See also L<perlguts/"Unicode Support"> for an |
| 1322 | explanation about Unicode at the XS level, and L<perlapi> for the API |
| 1323 | details. |
| 1324 | |
| 1325 | =over 4 |
| 1326 | |
| 1327 | =item * |
| 1328 | |
| 1329 | C<DO_UTF8(sv)> returns true if the C<UTF8> flag is on and the bytes |
| 1330 | pragma is not in effect. C<SvUTF8(sv)> returns true if the C<UTF8> |
| 1331 | flag is on; the bytes pragma is ignored. The C<UTF8> flag being on |
| 1332 | does B<not> mean that there are any characters of code points greater |
| 1333 | than 255 (or 127) in the scalar or that there are even any characters |
| 1334 | in the scalar. What the C<UTF8> flag means is that the sequence of |
| 1335 | octets in the representation of the scalar is the sequence of UTF-8 |
| 1336 | encoded code points of the characters of a string. The C<UTF8> flag |
| 1337 | being off means that each octet in this representation encodes a |
| 1338 | single character with code point 0..255 within the string. Perl's |
| 1339 | Unicode model is not to use UTF-8 until it is absolutely necessary. |
| 1340 | |
| 1341 | =item * |
| 1342 | |
| 1343 | C<uvchr_to_utf8(buf, chr)> writes a Unicode character code point into |
| 1344 | a buffer encoding the code point as UTF-8, and returns a pointer |
| 1345 | pointing after the UTF-8 bytes. It works appropriately on EBCDIC machines. |
| 1346 | |
| 1347 | =item * |
| 1348 | |
| 1349 | C<utf8_to_uvchr(buf, lenp)> reads UTF-8 encoded bytes from a buffer and |
| 1350 | returns the Unicode character code point and, optionally, the length of |
| 1351 | the UTF-8 byte sequence. It works appropriately on EBCDIC machines. |
| 1352 | |
| 1353 | =item * |
| 1354 | |
| 1355 | C<utf8_length(start, end)> returns the length of the UTF-8 encoded buffer |
| 1356 | in characters. C<sv_len_utf8(sv)> returns the length of the UTF-8 encoded |
| 1357 | scalar. |
| 1358 | |
| 1359 | =item * |
| 1360 | |
| 1361 | C<sv_utf8_upgrade(sv)> converts the string of the scalar to its UTF-8 |
| 1362 | encoded form. C<sv_utf8_downgrade(sv)> does the opposite, if |
| 1363 | possible. C<sv_utf8_encode(sv)> is like sv_utf8_upgrade except that |
| 1364 | it does not set the C<UTF8> flag. C<sv_utf8_decode()> does the |
| 1365 | opposite of C<sv_utf8_encode()>. Note that none of these are to be |
| 1366 | used as general-purpose encoding or decoding interfaces: C<use Encode> |
| 1367 | for that. C<sv_utf8_upgrade()> is affected by the encoding pragma |
| 1368 | but C<sv_utf8_downgrade()> is not (since the encoding pragma is |
| 1369 | designed to be a one-way street). |
| 1370 | |
| 1371 | =item * |
| 1372 | |
| 1373 | C<is_utf8_char(s)> returns true if the pointer points to a valid UTF-8 |
| 1374 | character. |
| 1375 | |
| 1376 | =item * |
| 1377 | |
| 1378 | C<is_utf8_string(buf, len)> returns true if C<len> bytes of the buffer |
| 1379 | are valid UTF-8. |
| 1380 | |
| 1381 | =item * |
| 1382 | |
| 1383 | C<UTF8SKIP(buf)> will return the number of bytes in the UTF-8 encoded |
| 1384 | character in the buffer. C<UNISKIP(chr)> will return the number of bytes |
| 1385 | required to UTF-8-encode the Unicode character code point. C<UTF8SKIP()> |
| 1386 | is useful for example for iterating over the characters of a UTF-8 |
| 1387 | encoded buffer; C<UNISKIP()> is useful, for example, in computing |
| 1388 | the size required for a UTF-8 encoded buffer. |
| 1389 | |
| 1390 | =item * |
| 1391 | |
| 1392 | C<utf8_distance(a, b)> will tell the distance in characters between the |
| 1393 | two pointers pointing to the same UTF-8 encoded buffer. |
| 1394 | |
| 1395 | =item * |
| 1396 | |
| 1397 | C<utf8_hop(s, off)> will return a pointer to a UTF-8 encoded buffer |
| 1398 | that is C<off> (positive or negative) Unicode characters displaced |
| 1399 | from the UTF-8 buffer C<s>. Be careful not to overstep the buffer: |
| 1400 | C<utf8_hop()> will merrily run off the end or the beginning of the |
| 1401 | buffer if told to do so. |
| 1402 | |
| 1403 | =item * |
| 1404 | |
| 1405 | C<pv_uni_display(dsv, spv, len, pvlim, flags)> and |
| 1406 | C<sv_uni_display(dsv, ssv, pvlim, flags)> are useful for debugging the |
| 1407 | output of Unicode strings and scalars. By default they are useful |
| 1408 | only for debugging--they display B<all> characters as hexadecimal code |
| 1409 | points--but with the flags C<UNI_DISPLAY_ISPRINT>, |
| 1410 | C<UNI_DISPLAY_BACKSLASH>, and C<UNI_DISPLAY_QQ> you can make the |
| 1411 | output more readable. |
| 1412 | |
| 1413 | =item * |
| 1414 | |
| 1415 | C<ibcmp_utf8(s1, pe1, l1, u1, s2, pe2, l2, u2)> can be used to |
| 1416 | compare two strings case-insensitively in Unicode. For case-sensitive |
| 1417 | comparisons you can just use C<memEQ()> and C<memNE()> as usual. |
| 1418 | |
| 1419 | =back |
| 1420 | |
| 1421 | For more information, see L<perlapi>, and F<utf8.c> and F<utf8.h> |
| 1422 | in the Perl source code distribution. |
| 1423 | |
| 1424 | =head1 BUGS |
| 1425 | |
| 1426 | =head2 Interaction with Locales |
| 1427 | |
| 1428 | Use of locales with Unicode data may lead to odd results. Currently, |
| 1429 | Perl attempts to attach 8-bit locale info to characters in the range |
| 1430 | 0..255, but this technique is demonstrably incorrect for locales that |
| 1431 | use characters above that range when mapped into Unicode. Perl's |
| 1432 | Unicode support will also tend to run slower. Use of locales with |
| 1433 | Unicode is discouraged. |
| 1434 | |
| 1435 | =head2 Problems with characters whose ordinal numbers are in the range 128 - 255 with no Locale specified |
| 1436 | |
| 1437 | Without a locale specified, unlike all other characters or code points, |
| 1438 | these characters have very different semantics in byte semantics versus |
| 1439 | character semantics. |
| 1440 | In character semantics they are interpreted as Unicode code points, which means |
| 1441 | they are viewed as Latin-1 (ISO-8859-1). |
| 1442 | In byte semantics, they are considered to be unassigned characters, |
| 1443 | meaning that the only semantics they have is their |
| 1444 | ordinal numbers, and that they are not members of various character classes. |
| 1445 | None are considered to match C<\w> for example, but all match C<\W>. |
| 1446 | Besides these class matches, |
| 1447 | the known operations that this affects are those that change the case, |
| 1448 | regular expression matching while ignoring case, |
| 1449 | and B<quotemeta()>. |
| 1450 | This can lead to unexpected results in which a string's semantics suddenly |
| 1451 | change if a code point above 255 is appended to or removed from it, |
| 1452 | which changes the string's semantics from byte to character or vice versa. |
| 1453 | This behavior is scheduled to change in version 5.12, but in the meantime, |
| 1454 | a workaround is to always call utf8::upgrade($string), or to use the |
| 1455 | standard modules L<Encode> or L<charnames>. |
| 1456 | |
| 1457 | =head2 Interaction with Extensions |
| 1458 | |
| 1459 | When Perl exchanges data with an extension, the extension should be |
| 1460 | able to understand the UTF8 flag and act accordingly. If the |
| 1461 | extension doesn't know about the flag, it's likely that the extension |
| 1462 | will return incorrectly-flagged data. |
| 1463 | |
| 1464 | So if you're working with Unicode data, consult the documentation of |
| 1465 | every module you're using if there are any issues with Unicode data |
| 1466 | exchange. If the documentation does not talk about Unicode at all, |
| 1467 | suspect the worst and probably look at the source to learn how the |
| 1468 | module is implemented. Modules written completely in Perl shouldn't |
| 1469 | cause problems. Modules that directly or indirectly access code written |
| 1470 | in other programming languages are at risk. |
| 1471 | |
| 1472 | For affected functions, the simple strategy to avoid data corruption is |
| 1473 | to always make the encoding of the exchanged data explicit. Choose an |
| 1474 | encoding that you know the extension can handle. Convert arguments passed |
| 1475 | to the extensions to that encoding and convert results back from that |
| 1476 | encoding. Write wrapper functions that do the conversions for you, so |
| 1477 | you can later change the functions when the extension catches up. |
| 1478 | |
| 1479 | To provide an example, let's say the popular Foo::Bar::escape_html |
| 1480 | function doesn't deal with Unicode data yet. The wrapper function |
| 1481 | would convert the argument to raw UTF-8 and convert the result back to |
| 1482 | Perl's internal representation like so: |
| 1483 | |
| 1484 | sub my_escape_html ($) { |
| 1485 | my($what) = shift; |
| 1486 | return unless defined $what; |
| 1487 | Encode::decode_utf8(Foo::Bar::escape_html(Encode::encode_utf8($what))); |
| 1488 | } |
| 1489 | |
| 1490 | Sometimes, when the extension does not convert data but just stores |
| 1491 | and retrieves them, you will be in a position to use the otherwise |
| 1492 | dangerous Encode::_utf8_on() function. Let's say the popular |
| 1493 | C<Foo::Bar> extension, written in C, provides a C<param> method that |
| 1494 | lets you store and retrieve data according to these prototypes: |
| 1495 | |
| 1496 | $self->param($name, $value); # set a scalar |
| 1497 | $value = $self->param($name); # retrieve a scalar |
| 1498 | |
| 1499 | If it does not yet provide support for any encoding, one could write a |
| 1500 | derived class with such a C<param> method: |
| 1501 | |
| 1502 | sub param { |
| 1503 | my($self,$name,$value) = @_; |
| 1504 | utf8::upgrade($name); # make sure it is UTF-8 encoded |
| 1505 | if (defined $value) { |
| 1506 | utf8::upgrade($value); # make sure it is UTF-8 encoded |
| 1507 | return $self->SUPER::param($name,$value); |
| 1508 | } else { |
| 1509 | my $ret = $self->SUPER::param($name); |
| 1510 | Encode::_utf8_on($ret); # we know, it is UTF-8 encoded |
| 1511 | return $ret; |
| 1512 | } |
| 1513 | } |
| 1514 | |
| 1515 | Some extensions provide filters on data entry/exit points, such as |
| 1516 | DB_File::filter_store_key and family. Look out for such filters in |
| 1517 | the documentation of your extensions, they can make the transition to |
| 1518 | Unicode data much easier. |
| 1519 | |
| 1520 | =head2 Speed |
| 1521 | |
| 1522 | Some functions are slower when working on UTF-8 encoded strings than |
| 1523 | on byte encoded strings. All functions that need to hop over |
| 1524 | characters such as length(), substr() or index(), or matching regular |
| 1525 | expressions can work B<much> faster when the underlying data are |
| 1526 | byte-encoded. |
| 1527 | |
| 1528 | In Perl 5.8.0 the slowness was often quite spectacular; in Perl 5.8.1 |
| 1529 | a caching scheme was introduced which will hopefully make the slowness |
| 1530 | somewhat less spectacular, at least for some operations. In general, |
| 1531 | operations with UTF-8 encoded strings are still slower. As an example, |
| 1532 | the Unicode properties (character classes) like C<\p{Nd}> are known to |
| 1533 | be quite a bit slower (5-20 times) than their simpler counterparts |
| 1534 | like C<\d> (then again, there 268 Unicode characters matching C<Nd> |
| 1535 | compared with the 10 ASCII characters matching C<d>). |
| 1536 | |
| 1537 | =head2 Possible problems on EBCDIC platforms |
| 1538 | |
| 1539 | In earlier versions, when byte and character data were concatenated, |
| 1540 | the new string was sometimes created by |
| 1541 | decoding the byte strings as I<ISO 8859-1 (Latin-1)>, even if the |
| 1542 | old Unicode string used EBCDIC. |
| 1543 | |
| 1544 | If you find any of these, please report them as bugs. |
| 1545 | |
| 1546 | =head2 Porting code from perl-5.6.X |
| 1547 | |
| 1548 | Perl 5.8 has a different Unicode model from 5.6. In 5.6 the programmer |
| 1549 | was required to use the C<utf8> pragma to declare that a given scope |
| 1550 | expected to deal with Unicode data and had to make sure that only |
| 1551 | Unicode data were reaching that scope. If you have code that is |
| 1552 | working with 5.6, you will need some of the following adjustments to |
| 1553 | your code. The examples are written such that the code will continue |
| 1554 | to work under 5.6, so you should be safe to try them out. |
| 1555 | |
| 1556 | =over 4 |
| 1557 | |
| 1558 | =item * |
| 1559 | |
| 1560 | A filehandle that should read or write UTF-8 |
| 1561 | |
| 1562 | if ($] > 5.007) { |
| 1563 | binmode $fh, ":encoding(utf8)"; |
| 1564 | } |
| 1565 | |
| 1566 | =item * |
| 1567 | |
| 1568 | A scalar that is going to be passed to some extension |
| 1569 | |
| 1570 | Be it Compress::Zlib, Apache::Request or any extension that has no |
| 1571 | mention of Unicode in the manpage, you need to make sure that the |
| 1572 | UTF8 flag is stripped off. Note that at the time of this writing |
| 1573 | (October 2002) the mentioned modules are not UTF-8-aware. Please |
| 1574 | check the documentation to verify if this is still true. |
| 1575 | |
| 1576 | if ($] > 5.007) { |
| 1577 | require Encode; |
| 1578 | $val = Encode::encode_utf8($val); # make octets |
| 1579 | } |
| 1580 | |
| 1581 | =item * |
| 1582 | |
| 1583 | A scalar we got back from an extension |
| 1584 | |
| 1585 | If you believe the scalar comes back as UTF-8, you will most likely |
| 1586 | want the UTF8 flag restored: |
| 1587 | |
| 1588 | if ($] > 5.007) { |
| 1589 | require Encode; |
| 1590 | $val = Encode::decode_utf8($val); |
| 1591 | } |
| 1592 | |
| 1593 | =item * |
| 1594 | |
| 1595 | Same thing, if you are really sure it is UTF-8 |
| 1596 | |
| 1597 | if ($] > 5.007) { |
| 1598 | require Encode; |
| 1599 | Encode::_utf8_on($val); |
| 1600 | } |
| 1601 | |
| 1602 | =item * |
| 1603 | |
| 1604 | A wrapper for fetchrow_array and fetchrow_hashref |
| 1605 | |
| 1606 | When the database contains only UTF-8, a wrapper function or method is |
| 1607 | a convenient way to replace all your fetchrow_array and |
| 1608 | fetchrow_hashref calls. A wrapper function will also make it easier to |
| 1609 | adapt to future enhancements in your database driver. Note that at the |
| 1610 | time of this writing (October 2002), the DBI has no standardized way |
| 1611 | to deal with UTF-8 data. Please check the documentation to verify if |
| 1612 | that is still true. |
| 1613 | |
| 1614 | sub fetchrow { |
| 1615 | my($self, $sth, $what) = @_; # $what is one of fetchrow_{array,hashref} |
| 1616 | if ($] < 5.007) { |
| 1617 | return $sth->$what; |
| 1618 | } else { |
| 1619 | require Encode; |
| 1620 | if (wantarray) { |
| 1621 | my @arr = $sth->$what; |
| 1622 | for (@arr) { |
| 1623 | defined && /[^\000-\177]/ && Encode::_utf8_on($_); |
| 1624 | } |
| 1625 | return @arr; |
| 1626 | } else { |
| 1627 | my $ret = $sth->$what; |
| 1628 | if (ref $ret) { |
| 1629 | for my $k (keys %$ret) { |
| 1630 | defined && /[^\000-\177]/ && Encode::_utf8_on($_) for $ret->{$k}; |
| 1631 | } |
| 1632 | return $ret; |
| 1633 | } else { |
| 1634 | defined && /[^\000-\177]/ && Encode::_utf8_on($_) for $ret; |
| 1635 | return $ret; |
| 1636 | } |
| 1637 | } |
| 1638 | } |
| 1639 | } |
| 1640 | |
| 1641 | |
| 1642 | =item * |
| 1643 | |
| 1644 | A large scalar that you know can only contain ASCII |
| 1645 | |
| 1646 | Scalars that contain only ASCII and are marked as UTF-8 are sometimes |
| 1647 | a drag to your program. If you recognize such a situation, just remove |
| 1648 | the UTF8 flag: |
| 1649 | |
| 1650 | utf8::downgrade($val) if $] > 5.007; |
| 1651 | |
| 1652 | =back |
| 1653 | |
| 1654 | =head1 SEE ALSO |
| 1655 | |
| 1656 | L<perlunitut>, L<perluniintro>, L<Encode>, L<open>, L<utf8>, L<bytes>, |
| 1657 | L<perlretut>, L<perlvar/"${^UNICODE}"> |
| 1658 | |
| 1659 | =cut |