Commit | Line | Data |
---|---|---|
393fec97 GS |
1 | =head1 NAME |
2 | ||
3 | perlunicode - Unicode support in Perl | |
4 | ||
5 | =head1 DESCRIPTION | |
6 | ||
0a1f2d14 | 7 | =head2 Important Caveats |
21bad921 | 8 | |
376d9008 | 9 | Unicode support is an extensive requirement. While Perl does not |
c349b1b9 JH |
10 | implement the Unicode standard or the accompanying technical reports |
11 | from cover to cover, Perl does support many Unicode features. | |
21bad921 | 12 | |
13a2d996 | 13 | =over 4 |
21bad921 | 14 | |
fae2c0fb | 15 | =item Input and Output Layers |
21bad921 | 16 | |
376d9008 | 17 | Perl knows when a filehandle uses Perl's internal Unicode encodings |
1bfb14c4 JH |
18 | (UTF-8, or UTF-EBCDIC if in EBCDIC) if the filehandle is opened with |
19 | the ":utf8" layer. Other encodings can be converted to Perl's | |
20 | encoding on input or from Perl's encoding on output by use of the | |
21 | ":encoding(...)" layer. See L<open>. | |
c349b1b9 | 22 | |
376d9008 | 23 | To indicate that Perl source itself is using a particular encoding, |
c349b1b9 | 24 | see L<encoding>. |
21bad921 GS |
25 | |
26 | =item Regular Expressions | |
27 | ||
c349b1b9 | 28 | The regular expression compiler produces polymorphic opcodes. That is, |
376d9008 JB |
29 | the pattern adapts to the data and automatically switches to the Unicode |
30 | character scheme when presented with Unicode data--or instead uses | |
31 | a traditional byte scheme when presented with byte data. | |
21bad921 | 32 | |
ad0029c4 | 33 | =item C<use utf8> still needed to enable UTF-8/UTF-EBCDIC in scripts |
21bad921 | 34 | |
376d9008 JB |
35 | As a compatibility measure, the C<use utf8> pragma must be explicitly |
36 | included to enable recognition of UTF-8 in the Perl scripts themselves | |
1bfb14c4 JH |
37 | (in string or regular expression literals, or in identifier names) on |
38 | ASCII-based machines or to recognize UTF-EBCDIC on EBCDIC-based | |
376d9008 | 39 | machines. B<These are the only times when an explicit C<use utf8> |
8f8cf39c | 40 | is needed.> See L<utf8>. |
21bad921 | 41 | |
1768d7eb | 42 | You can also use the C<encoding> pragma to change the default encoding |
6ec9efec | 43 | of the data in your script; see L<encoding>. |
1768d7eb | 44 | |
21bad921 GS |
45 | =back |
46 | ||
376d9008 | 47 | =head2 Byte and Character Semantics |
393fec97 | 48 | |
376d9008 | 49 | Beginning with version 5.6, Perl uses logically-wide characters to |
3e4dbfed | 50 | represent strings internally. |
393fec97 | 51 | |
376d9008 JB |
52 | In future, Perl-level operations will be expected to work with |
53 | characters rather than bytes. | |
393fec97 | 54 | |
376d9008 | 55 | However, as an interim compatibility measure, Perl aims to |
75daf61c JH |
56 | provide a safe migration path from byte semantics to character |
57 | semantics for programs. For operations where Perl can unambiguously | |
376d9008 | 58 | decide that the input data are characters, Perl switches to |
75daf61c JH |
59 | character semantics. For operations where this determination cannot |
60 | be made without additional information from the user, Perl decides in | |
376d9008 | 61 | favor of compatibility and chooses to use byte semantics. |
8cbd9a7a GS |
62 | |
63 | This behavior preserves compatibility with earlier versions of Perl, | |
376d9008 JB |
64 | which allowed byte semantics in Perl operations only if |
65 | none of the program's inputs were marked as being as source of Unicode | |
8cbd9a7a GS |
66 | character data. Such data may come from filehandles, from calls to |
67 | external programs, from information provided by the system (such as %ENV), | |
21bad921 | 68 | or from literals and constants in the source text. |
8cbd9a7a | 69 | |
376d9008 JB |
70 | The C<bytes> pragma will always, regardless of platform, force byte |
71 | semantics in a particular lexical scope. See L<bytes>. | |
8cbd9a7a GS |
72 | |
73 | The C<utf8> pragma is primarily a compatibility device that enables | |
75daf61c | 74 | recognition of UTF-(8|EBCDIC) in literals encountered by the parser. |
376d9008 JB |
75 | Note that this pragma is only required while Perl defaults to byte |
76 | semantics; when character semantics become the default, this pragma | |
77 | may become a no-op. See L<utf8>. | |
78 | ||
79 | Unless explicitly stated, Perl operators use character semantics | |
80 | for Unicode data and byte semantics for non-Unicode data. | |
81 | The decision to use character semantics is made transparently. If | |
82 | input data comes from a Unicode source--for example, if a character | |
fae2c0fb | 83 | encoding layer is added to a filehandle or a literal Unicode |
376d9008 JB |
84 | string constant appears in a program--character semantics apply. |
85 | Otherwise, byte semantics are in effect. The C<bytes> pragma should | |
86 | be used to force byte semantics on Unicode data. | |
87 | ||
88 | If strings operating under byte semantics and strings with Unicode | |
89 | character data are concatenated, the new string will be upgraded to | |
90 | I<ISO 8859-1 (Latin-1)>, even if the old Unicode string used EBCDIC. | |
91 | This translation is done without regard to the system's native 8-bit | |
92 | encoding, so to change this for systems with non-Latin-1 and | |
93 | non-EBCDIC native encodings use the C<encoding> pragma. See | |
94 | L<encoding>. | |
7dedd01f | 95 | |
feda178f | 96 | Under character semantics, many operations that formerly operated on |
376d9008 | 97 | bytes now operate on characters. A character in Perl is |
feda178f | 98 | logically just a number ranging from 0 to 2**31 or so. Larger |
376d9008 JB |
99 | characters may encode into longer sequences of bytes internally, but |
100 | this internal detail is mostly hidden for Perl code. | |
101 | See L<perluniintro> for more. | |
393fec97 | 102 | |
376d9008 | 103 | =head2 Effects of Character Semantics |
393fec97 GS |
104 | |
105 | Character semantics have the following effects: | |
106 | ||
107 | =over 4 | |
108 | ||
109 | =item * | |
110 | ||
376d9008 | 111 | Strings--including hash keys--and regular expression patterns may |
574c8022 | 112 | contain characters that have an ordinal value larger than 255. |
393fec97 | 113 | |
feda178f JH |
114 | If you use a Unicode editor to edit your program, Unicode characters |
115 | may occur directly within the literal strings in one of the various | |
376d9008 JB |
116 | Unicode encodings (UTF-8, UTF-EBCDIC, UCS-2, etc.), but will be recognized |
117 | as such and converted to Perl's internal representation only if the | |
feda178f | 118 | appropriate L<encoding> is specified. |
3e4dbfed | 119 | |
1bfb14c4 JH |
120 | Unicode characters can also be added to a string by using the |
121 | C<\x{...}> notation. The Unicode code for the desired character, in | |
376d9008 JB |
122 | hexadecimal, should be placed in the braces. For instance, a smiley |
123 | face is C<\x{263A}>. This encoding scheme only works for characters | |
124 | with a code of 0x100 or above. | |
3e4dbfed JF |
125 | |
126 | Additionally, if you | |
574c8022 | 127 | |
3e4dbfed | 128 | use charnames ':full'; |
574c8022 | 129 | |
1bfb14c4 JH |
130 | you can use the C<\N{...}> notation and put the official Unicode |
131 | character name within the braces, such as C<\N{WHITE SMILING FACE}>. | |
376d9008 | 132 | |
393fec97 GS |
133 | |
134 | =item * | |
135 | ||
574c8022 JH |
136 | If an appropriate L<encoding> is specified, identifiers within the |
137 | Perl script may contain Unicode alphanumeric characters, including | |
376d9008 JB |
138 | ideographs. Perl does not currently attempt to canonicalize variable |
139 | names. | |
393fec97 | 140 | |
393fec97 GS |
141 | =item * |
142 | ||
1bfb14c4 JH |
143 | Regular expressions match characters instead of bytes. "." matches |
144 | a character instead of a byte. The C<\C> pattern is provided to force | |
145 | a match a single byte--a C<char> in C, hence C<\C>. | |
393fec97 | 146 | |
393fec97 GS |
147 | =item * |
148 | ||
149 | Character classes in regular expressions match characters instead of | |
376d9008 | 150 | bytes and match against the character properties specified in the |
1bfb14c4 | 151 | Unicode properties database. C<\w> can be used to match a Japanese |
75daf61c | 152 | ideograph, for instance. |
393fec97 | 153 | |
393fec97 GS |
154 | =item * |
155 | ||
eb0cc9e3 | 156 | Named Unicode properties, scripts, and block ranges may be used like |
376d9008 JB |
157 | character classes via the C<\p{}> "matches property" construct and |
158 | the C<\P{}> negation, "doesn't match property". | |
1bfb14c4 JH |
159 | |
160 | For instance, C<\p{Lu}> matches any character with the Unicode "Lu" | |
161 | (Letter, uppercase) property, while C<\p{M}> matches any character | |
162 | with an "M" (mark--accents and such) property. Brackets are not | |
163 | required for single letter properties, so C<\p{M}> is equivalent to | |
164 | C<\pM>. Many predefined properties are available, such as | |
165 | C<\p{Mirrored}> and C<\p{Tibetan}>. | |
4193bef7 | 166 | |
cfc01aea | 167 | The official Unicode script and block names have spaces and dashes as |
376d9008 | 168 | separators, but for convenience you can use dashes, spaces, or |
1bfb14c4 JH |
169 | underbars, and case is unimportant. It is recommended, however, that |
170 | for consistency you use the following naming: the official Unicode | |
171 | script, property, or block name (see below for the additional rules | |
172 | that apply to block names) with whitespace and dashes removed, and the | |
173 | words "uppercase-first-lowercase-rest". C<Latin-1 Supplement> thus | |
174 | becomes C<Latin1Supplement>. | |
4193bef7 | 175 | |
376d9008 JB |
176 | You can also use negation in both C<\p{}> and C<\P{}> by introducing a caret |
177 | (^) between the first brace and the property name: C<\p{^Tamil}> is | |
eb0cc9e3 | 178 | equal to C<\P{Tamil}>. |
4193bef7 | 179 | |
eb0cc9e3 | 180 | Here are the basic Unicode General Category properties, followed by their |
68cd2d32 | 181 | long form. You can use either; C<\p{Lu}> and C<\p{UppercaseLetter}>, |
376d9008 | 182 | for instance, are identical. |
393fec97 | 183 | |
d73e5302 JH |
184 | Short Long |
185 | ||
186 | L Letter | |
eb0cc9e3 JH |
187 | Lu UppercaseLetter |
188 | Ll LowercaseLetter | |
189 | Lt TitlecaseLetter | |
190 | Lm ModifierLetter | |
191 | Lo OtherLetter | |
d73e5302 JH |
192 | |
193 | M Mark | |
eb0cc9e3 JH |
194 | Mn NonspacingMark |
195 | Mc SpacingMark | |
196 | Me EnclosingMark | |
d73e5302 JH |
197 | |
198 | N Number | |
eb0cc9e3 JH |
199 | Nd DecimalNumber |
200 | Nl LetterNumber | |
201 | No OtherNumber | |
d73e5302 JH |
202 | |
203 | P Punctuation | |
eb0cc9e3 JH |
204 | Pc ConnectorPunctuation |
205 | Pd DashPunctuation | |
206 | Ps OpenPunctuation | |
207 | Pe ClosePunctuation | |
208 | Pi InitialPunctuation | |
d73e5302 | 209 | (may behave like Ps or Pe depending on usage) |
eb0cc9e3 | 210 | Pf FinalPunctuation |
d73e5302 | 211 | (may behave like Ps or Pe depending on usage) |
eb0cc9e3 | 212 | Po OtherPunctuation |
d73e5302 JH |
213 | |
214 | S Symbol | |
eb0cc9e3 JH |
215 | Sm MathSymbol |
216 | Sc CurrencySymbol | |
217 | Sk ModifierSymbol | |
218 | So OtherSymbol | |
d73e5302 JH |
219 | |
220 | Z Separator | |
eb0cc9e3 JH |
221 | Zs SpaceSeparator |
222 | Zl LineSeparator | |
223 | Zp ParagraphSeparator | |
d73e5302 JH |
224 | |
225 | C Other | |
e150c829 JH |
226 | Cc Control |
227 | Cf Format | |
eb0cc9e3 JH |
228 | Cs Surrogate (not usable) |
229 | Co PrivateUse | |
e150c829 | 230 | Cn Unassigned |
1ac13f9a | 231 | |
376d9008 | 232 | Single-letter properties match all characters in any of the |
3e4dbfed | 233 | two-letter sub-properties starting with the same letter. |
376d9008 | 234 | C<L&> is a special case, which is an alias for C<Ll>, C<Lu>, and C<Lt>. |
32293815 | 235 | |
eb0cc9e3 | 236 | Because Perl hides the need for the user to understand the internal |
1bfb14c4 JH |
237 | representation of Unicode characters, there is no need to implement |
238 | the somewhat messy concept of surrogates. C<Cs> is therefore not | |
eb0cc9e3 | 239 | supported. |
d73e5302 | 240 | |
376d9008 JB |
241 | Because scripts differ in their directionality--Hebrew is |
242 | written right to left, for example--Unicode supplies these properties: | |
32293815 | 243 | |
eb0cc9e3 | 244 | Property Meaning |
92e830a9 | 245 | |
d73e5302 JH |
246 | BidiL Left-to-Right |
247 | BidiLRE Left-to-Right Embedding | |
248 | BidiLRO Left-to-Right Override | |
249 | BidiR Right-to-Left | |
250 | BidiAL Right-to-Left Arabic | |
251 | BidiRLE Right-to-Left Embedding | |
252 | BidiRLO Right-to-Left Override | |
253 | BidiPDF Pop Directional Format | |
254 | BidiEN European Number | |
255 | BidiES European Number Separator | |
256 | BidiET European Number Terminator | |
257 | BidiAN Arabic Number | |
258 | BidiCS Common Number Separator | |
259 | BidiNSM Non-Spacing Mark | |
260 | BidiBN Boundary Neutral | |
261 | BidiB Paragraph Separator | |
262 | BidiS Segment Separator | |
263 | BidiWS Whitespace | |
264 | BidiON Other Neutrals | |
32293815 | 265 | |
376d9008 | 266 | For example, C<\p{BidiR}> matches characters that are normally |
eb0cc9e3 JH |
267 | written right to left. |
268 | ||
210b36aa AMS |
269 | =back |
270 | ||
2796c109 JH |
271 | =head2 Scripts |
272 | ||
376d9008 JB |
273 | The script names which can be used by C<\p{...}> and C<\P{...}>, |
274 | such as in C<\p{Latin}> or C<\p{Cyrillic}>, are as follows: | |
2796c109 | 275 | |
1ac13f9a | 276 | Arabic |
e9ad1727 | 277 | Armenian |
1ac13f9a | 278 | Bengali |
e9ad1727 | 279 | Bopomofo |
1d81abf3 | 280 | Buhid |
eb0cc9e3 | 281 | CanadianAboriginal |
e9ad1727 JH |
282 | Cherokee |
283 | Cyrillic | |
284 | Deseret | |
285 | Devanagari | |
286 | Ethiopic | |
287 | Georgian | |
288 | Gothic | |
289 | Greek | |
1ac13f9a | 290 | Gujarati |
e9ad1727 JH |
291 | Gurmukhi |
292 | Han | |
293 | Hangul | |
1d81abf3 | 294 | Hanunoo |
e9ad1727 JH |
295 | Hebrew |
296 | Hiragana | |
297 | Inherited | |
1ac13f9a | 298 | Kannada |
e9ad1727 JH |
299 | Katakana |
300 | Khmer | |
1ac13f9a | 301 | Lao |
e9ad1727 JH |
302 | Latin |
303 | Malayalam | |
304 | Mongolian | |
1ac13f9a | 305 | Myanmar |
1ac13f9a | 306 | Ogham |
eb0cc9e3 | 307 | OldItalic |
e9ad1727 | 308 | Oriya |
1ac13f9a | 309 | Runic |
e9ad1727 JH |
310 | Sinhala |
311 | Syriac | |
1d81abf3 JH |
312 | Tagalog |
313 | Tagbanwa | |
e9ad1727 JH |
314 | Tamil |
315 | Telugu | |
316 | Thaana | |
317 | Thai | |
318 | Tibetan | |
1ac13f9a | 319 | Yi |
1ac13f9a | 320 | |
376d9008 | 321 | Extended property classes can supplement the basic |
1ac13f9a JH |
322 | properties, defined by the F<PropList> Unicode database: |
323 | ||
1d81abf3 | 324 | ASCIIHexDigit |
eb0cc9e3 | 325 | BidiControl |
1ac13f9a | 326 | Dash |
1d81abf3 | 327 | Deprecated |
1ac13f9a JH |
328 | Diacritic |
329 | Extender | |
1d81abf3 | 330 | GraphemeLink |
eb0cc9e3 | 331 | HexDigit |
e9ad1727 JH |
332 | Hyphen |
333 | Ideographic | |
1d81abf3 JH |
334 | IDSBinaryOperator |
335 | IDSTrinaryOperator | |
eb0cc9e3 | 336 | JoinControl |
1d81abf3 | 337 | LogicalOrderException |
eb0cc9e3 JH |
338 | NoncharacterCodePoint |
339 | OtherAlphabetic | |
1d81abf3 JH |
340 | OtherDefaultIgnorableCodePoint |
341 | OtherGraphemeExtend | |
eb0cc9e3 JH |
342 | OtherLowercase |
343 | OtherMath | |
344 | OtherUppercase | |
345 | QuotationMark | |
1d81abf3 JH |
346 | Radical |
347 | SoftDotted | |
348 | TerminalPunctuation | |
349 | UnifiedIdeograph | |
eb0cc9e3 | 350 | WhiteSpace |
1ac13f9a | 351 | |
376d9008 | 352 | and there are further derived properties: |
1ac13f9a | 353 | |
eb0cc9e3 JH |
354 | Alphabetic Lu + Ll + Lt + Lm + Lo + OtherAlphabetic |
355 | Lowercase Ll + OtherLowercase | |
356 | Uppercase Lu + OtherUppercase | |
357 | Math Sm + OtherMath | |
1ac13f9a JH |
358 | |
359 | ID_Start Lu + Ll + Lt + Lm + Lo + Nl | |
360 | ID_Continue ID_Start + Mn + Mc + Nd + Pc | |
361 | ||
362 | Any Any character | |
66b79f27 RGS |
363 | Assigned Any non-Cn character (i.e. synonym for \P{Cn}) |
364 | Unassigned Synonym for \p{Cn} | |
1ac13f9a | 365 | Common Any character (or unassigned code point) |
e150c829 | 366 | not explicitly assigned to a script |
2796c109 | 367 | |
1bfb14c4 JH |
368 | For backward compatibility (with Perl 5.6), all properties mentioned |
369 | so far may have C<Is> prepended to their name, so C<\P{IsLu}>, for | |
370 | example, is equal to C<\P{Lu}>. | |
eb0cc9e3 | 371 | |
2796c109 JH |
372 | =head2 Blocks |
373 | ||
1bfb14c4 JH |
374 | In addition to B<scripts>, Unicode also defines B<blocks> of |
375 | characters. The difference between scripts and blocks is that the | |
376 | concept of scripts is closer to natural languages, while the concept | |
377 | of blocks is more of an artificial grouping based on groups of 256 | |
376d9008 | 378 | Unicode characters. For example, the C<Latin> script contains letters |
1bfb14c4 | 379 | from many blocks but does not contain all the characters from those |
376d9008 JB |
380 | blocks. It does not, for example, contain digits, because digits are |
381 | shared across many scripts. Digits and similar groups, like | |
382 | punctuation, are in a category called C<Common>. | |
2796c109 | 383 | |
cfc01aea JF |
384 | For more about scripts, see the UTR #24: |
385 | ||
386 | http://www.unicode.org/unicode/reports/tr24/ | |
387 | ||
388 | For more about blocks, see: | |
389 | ||
390 | http://www.unicode.org/Public/UNIDATA/Blocks.txt | |
2796c109 | 391 | |
376d9008 JB |
392 | Block names are given with the C<In> prefix. For example, the |
393 | Katakana block is referenced via C<\p{InKatakana}>. The C<In> | |
7eabb34d | 394 | prefix may be omitted if there is no naming conflict with a script |
eb0cc9e3 | 395 | or any other property, but it is recommended that C<In> always be used |
1bfb14c4 | 396 | for block tests to avoid confusion. |
eb0cc9e3 JH |
397 | |
398 | These block names are supported: | |
399 | ||
1d81abf3 JH |
400 | InAlphabeticPresentationForms |
401 | InArabic | |
402 | InArabicPresentationFormsA | |
403 | InArabicPresentationFormsB | |
404 | InArmenian | |
405 | InArrows | |
406 | InBasicLatin | |
407 | InBengali | |
408 | InBlockElements | |
409 | InBopomofo | |
410 | InBopomofoExtended | |
411 | InBoxDrawing | |
412 | InBraillePatterns | |
413 | InBuhid | |
414 | InByzantineMusicalSymbols | |
415 | InCJKCompatibility | |
416 | InCJKCompatibilityForms | |
417 | InCJKCompatibilityIdeographs | |
418 | InCJKCompatibilityIdeographsSupplement | |
419 | InCJKRadicalsSupplement | |
420 | InCJKSymbolsAndPunctuation | |
421 | InCJKUnifiedIdeographs | |
422 | InCJKUnifiedIdeographsExtensionA | |
423 | InCJKUnifiedIdeographsExtensionB | |
424 | InCherokee | |
425 | InCombiningDiacriticalMarks | |
426 | InCombiningDiacriticalMarksforSymbols | |
427 | InCombiningHalfMarks | |
428 | InControlPictures | |
429 | InCurrencySymbols | |
430 | InCyrillic | |
431 | InCyrillicSupplementary | |
432 | InDeseret | |
433 | InDevanagari | |
434 | InDingbats | |
435 | InEnclosedAlphanumerics | |
436 | InEnclosedCJKLettersAndMonths | |
437 | InEthiopic | |
438 | InGeneralPunctuation | |
439 | InGeometricShapes | |
440 | InGeorgian | |
441 | InGothic | |
442 | InGreekExtended | |
443 | InGreekAndCoptic | |
444 | InGujarati | |
445 | InGurmukhi | |
446 | InHalfwidthAndFullwidthForms | |
447 | InHangulCompatibilityJamo | |
448 | InHangulJamo | |
449 | InHangulSyllables | |
450 | InHanunoo | |
451 | InHebrew | |
452 | InHighPrivateUseSurrogates | |
453 | InHighSurrogates | |
454 | InHiragana | |
455 | InIPAExtensions | |
456 | InIdeographicDescriptionCharacters | |
457 | InKanbun | |
458 | InKangxiRadicals | |
459 | InKannada | |
460 | InKatakana | |
461 | InKatakanaPhoneticExtensions | |
462 | InKhmer | |
463 | InLao | |
464 | InLatin1Supplement | |
465 | InLatinExtendedA | |
466 | InLatinExtendedAdditional | |
467 | InLatinExtendedB | |
468 | InLetterlikeSymbols | |
469 | InLowSurrogates | |
470 | InMalayalam | |
471 | InMathematicalAlphanumericSymbols | |
472 | InMathematicalOperators | |
473 | InMiscellaneousMathematicalSymbolsA | |
474 | InMiscellaneousMathematicalSymbolsB | |
475 | InMiscellaneousSymbols | |
476 | InMiscellaneousTechnical | |
477 | InMongolian | |
478 | InMusicalSymbols | |
479 | InMyanmar | |
480 | InNumberForms | |
481 | InOgham | |
482 | InOldItalic | |
483 | InOpticalCharacterRecognition | |
484 | InOriya | |
485 | InPrivateUseArea | |
486 | InRunic | |
487 | InSinhala | |
488 | InSmallFormVariants | |
489 | InSpacingModifierLetters | |
490 | InSpecials | |
491 | InSuperscriptsAndSubscripts | |
492 | InSupplementalArrowsA | |
493 | InSupplementalArrowsB | |
494 | InSupplementalMathematicalOperators | |
495 | InSupplementaryPrivateUseAreaA | |
496 | InSupplementaryPrivateUseAreaB | |
497 | InSyriac | |
498 | InTagalog | |
499 | InTagbanwa | |
500 | InTags | |
501 | InTamil | |
502 | InTelugu | |
503 | InThaana | |
504 | InThai | |
505 | InTibetan | |
506 | InUnifiedCanadianAboriginalSyllabics | |
507 | InVariationSelectors | |
508 | InYiRadicals | |
509 | InYiSyllables | |
32293815 | 510 | |
210b36aa AMS |
511 | =over 4 |
512 | ||
393fec97 GS |
513 | =item * |
514 | ||
376d9008 JB |
515 | The special pattern C<\X> matches any extended Unicode |
516 | sequence--"a combining character sequence" in Standardese--where the | |
517 | first character is a base character and subsequent characters are mark | |
518 | characters that apply to the base character. C<\X> is equivalent to | |
393fec97 GS |
519 | C<(?:\PM\pM*)>. |
520 | ||
393fec97 GS |
521 | =item * |
522 | ||
383e7cdd | 523 | The C<tr///> operator translates characters instead of bytes. Note |
376d9008 JB |
524 | that the C<tr///CU> functionality has been removed. For similar |
525 | functionality see pack('U0', ...) and pack('C0', ...). | |
393fec97 | 526 | |
393fec97 GS |
527 | =item * |
528 | ||
529 | Case translation operators use the Unicode case translation tables | |
376d9008 JB |
530 | when character input is provided. Note that C<uc()>, or C<\U> in |
531 | interpolated strings, translates to uppercase, while C<ucfirst>, | |
532 | or C<\u> in interpolated strings, translates to titlecase in languages | |
533 | that make the distinction. | |
393fec97 GS |
534 | |
535 | =item * | |
536 | ||
376d9008 | 537 | Most operators that deal with positions or lengths in a string will |
75daf61c JH |
538 | automatically switch to using character positions, including |
539 | C<chop()>, C<substr()>, C<pos()>, C<index()>, C<rindex()>, | |
540 | C<sprintf()>, C<write()>, and C<length()>. Operators that | |
376d9008 JB |
541 | specifically do not switch include C<vec()>, C<pack()>, and |
542 | C<unpack()>. Operators that really don't care include C<chomp()>, | |
543 | operators that treats strings as a bucket of bits such as C<sort()>, | |
544 | and operators dealing with filenames. | |
393fec97 GS |
545 | |
546 | =item * | |
547 | ||
1bfb14c4 | 548 | The C<pack()>/C<unpack()> letters C<c> and C<C> do I<not> change, |
376d9008 | 549 | since they are often used for byte-oriented formats. Again, think |
1bfb14c4 JH |
550 | C<char> in the C language. |
551 | ||
552 | There is a new C<U> specifier that converts between Unicode characters | |
553 | and code points. | |
393fec97 GS |
554 | |
555 | =item * | |
556 | ||
376d9008 JB |
557 | The C<chr()> and C<ord()> functions work on characters, similar to |
558 | C<pack("U")> and C<unpack("U")>, I<not> C<pack("C")> and | |
559 | C<unpack("C")>. C<pack("C")> and C<unpack("C")> are methods for | |
560 | emulating byte-oriented C<chr()> and C<ord()> on Unicode strings. | |
561 | While these methods reveal the internal encoding of Unicode strings, | |
562 | that is not something one normally needs to care about at all. | |
393fec97 GS |
563 | |
564 | =item * | |
565 | ||
376d9008 JB |
566 | The bit string operators, C<& | ^ ~>, can operate on character data. |
567 | However, for backward compatibility, such as when using bit string | |
568 | operations when characters are all less than 256 in ordinal value, one | |
569 | should not use C<~> (the bit complement) with characters of both | |
570 | values less than 256 and values greater than 256. Most importantly, | |
571 | DeMorgan's laws (C<~($x|$y) eq ~$x&~$y> and C<~($x&$y) eq ~$x|~$y>) | |
572 | will not hold. The reason for this mathematical I<faux pas> is that | |
573 | the complement cannot return B<both> the 8-bit (byte-wide) bit | |
574 | complement B<and> the full character-wide bit complement. | |
a1ca4561 YST |
575 | |
576 | =item * | |
577 | ||
983ffd37 JH |
578 | lc(), uc(), lcfirst(), and ucfirst() work for the following cases: |
579 | ||
580 | =over 8 | |
581 | ||
582 | =item * | |
583 | ||
584 | the case mapping is from a single Unicode character to another | |
376d9008 | 585 | single Unicode character, or |
983ffd37 JH |
586 | |
587 | =item * | |
588 | ||
589 | the case mapping is from a single Unicode character to more | |
376d9008 | 590 | than one Unicode character. |
983ffd37 JH |
591 | |
592 | =back | |
593 | ||
63de3cb2 JH |
594 | Things to do with locales (Lithuanian, Turkish, Azeri) do B<not> work |
595 | since Perl does not understand the concept of Unicode locales. | |
983ffd37 | 596 | |
dc33ebcf RGS |
597 | See the Unicode Technical Report #21, Case Mappings, for more details. |
598 | ||
983ffd37 JH |
599 | =back |
600 | ||
dc33ebcf | 601 | =over 4 |
ac1256e8 JH |
602 | |
603 | =item * | |
604 | ||
393fec97 GS |
605 | And finally, C<scalar reverse()> reverses by character rather than by byte. |
606 | ||
607 | =back | |
608 | ||
376d9008 | 609 | =head2 User-Defined Character Properties |
491fd90a JH |
610 | |
611 | You can define your own character properties by defining subroutines | |
3a2263fe RGS |
612 | whose names begin with "In" or "Is". The subroutines must be defined |
613 | in the C<main> package. The user-defined properties can be used in the | |
614 | regular expression C<\p> and C<\P> constructs. Note that the effect | |
615 | is compile-time and immutable once defined. | |
491fd90a | 616 | |
376d9008 JB |
617 | The subroutines must return a specially-formatted string, with one |
618 | or more newline-separated lines. Each line must be one of the following: | |
491fd90a JH |
619 | |
620 | =over 4 | |
621 | ||
622 | =item * | |
623 | ||
99a6b1f0 | 624 | Two hexadecimal numbers separated by horizontal whitespace (space or |
376d9008 | 625 | tabular characters) denoting a range of Unicode code points to include. |
491fd90a JH |
626 | |
627 | =item * | |
628 | ||
376d9008 JB |
629 | Something to include, prefixed by "+": a built-in character |
630 | property (prefixed by "utf8::"), to represent all the characters in that | |
631 | property; two hexadecimal code points for a range; or a single | |
632 | hexadecimal code point. | |
491fd90a JH |
633 | |
634 | =item * | |
635 | ||
376d9008 | 636 | Something to exclude, prefixed by "-": an existing character |
11ef8fdd | 637 | property (prefixed by "utf8::"), for all the characters in that |
376d9008 JB |
638 | property; two hexadecimal code points for a range; or a single |
639 | hexadecimal code point. | |
491fd90a JH |
640 | |
641 | =item * | |
642 | ||
376d9008 | 643 | Something to negate, prefixed "!": an existing character |
11ef8fdd | 644 | property (prefixed by "utf8::") for all the characters except the |
376d9008 JB |
645 | characters in the property; two hexadecimal code points for a range; |
646 | or a single hexadecimal code point. | |
491fd90a JH |
647 | |
648 | =back | |
649 | ||
650 | For example, to define a property that covers both the Japanese | |
651 | syllabaries (hiragana and katakana), you can define | |
652 | ||
653 | sub InKana { | |
d5822f25 A |
654 | return <<END; |
655 | 3040\t309F | |
656 | 30A0\t30FF | |
491fd90a JH |
657 | END |
658 | } | |
659 | ||
d5822f25 A |
660 | Imagine that the here-doc end marker is at the beginning of the line. |
661 | Now you can use C<\p{InKana}> and C<\P{InKana}>. | |
491fd90a JH |
662 | |
663 | You could also have used the existing block property names: | |
664 | ||
665 | sub InKana { | |
666 | return <<'END'; | |
667 | +utf8::InHiragana | |
668 | +utf8::InKatakana | |
669 | END | |
670 | } | |
671 | ||
672 | Suppose you wanted to match only the allocated characters, | |
d5822f25 | 673 | not the raw block ranges: in other words, you want to remove |
491fd90a JH |
674 | the non-characters: |
675 | ||
676 | sub InKana { | |
677 | return <<'END'; | |
678 | +utf8::InHiragana | |
679 | +utf8::InKatakana | |
680 | -utf8::IsCn | |
681 | END | |
682 | } | |
683 | ||
684 | The negation is useful for defining (surprise!) negated classes. | |
685 | ||
686 | sub InNotKana { | |
687 | return <<'END'; | |
688 | !utf8::InHiragana | |
689 | -utf8::InKatakana | |
690 | +utf8::IsCn | |
691 | END | |
692 | } | |
693 | ||
3a2263fe RGS |
694 | You can also define your own mappings to be used in the lc(), |
695 | lcfirst(), uc(), and ucfirst() (or their string-inlined versions). | |
696 | The principle is the same: define subroutines in the C<main> package | |
697 | with names like C<ToLower> (for lc() and lcfirst()), C<ToTitle> (for | |
698 | the first character in ucfirst()), and C<ToUpper> (for uc(), and the | |
699 | rest of the characters in ucfirst()). | |
700 | ||
701 | The string returned by the subroutines needs now to be three | |
702 | hexadecimal numbers separated by tabulators: start of the source | |
703 | range, end of the source range, and start of the destination range. | |
704 | For example: | |
705 | ||
706 | sub ToUpper { | |
707 | return <<END; | |
708 | 0061\t0063\t0041 | |
709 | END | |
710 | } | |
711 | ||
712 | defines an uc() mapping that causes only the characters "a", "b", and | |
713 | "c" to be mapped to "A", "B", "C", all other characters will remain | |
714 | unchanged. | |
715 | ||
716 | If there is no source range to speak of, that is, the mapping is from | |
717 | a single character to another single character, leave the end of the | |
718 | source range empty, but the two tabulator characters are still needed. | |
719 | For example: | |
720 | ||
721 | sub ToLower { | |
722 | return <<END; | |
723 | 0041\t\t0061 | |
724 | END | |
725 | } | |
726 | ||
727 | defines a lc() mapping that causes only "A" to be mapped to "a", all | |
728 | other characters will remain unchanged. | |
729 | ||
730 | (For serious hackers only) If you want to introspect the default | |
731 | mappings, you can find the data in the directory | |
732 | C<$Config{privlib}>/F<unicore/To/>. The mapping data is returned as | |
733 | the here-document, and the C<utf8::ToSpecFoo> are special exception | |
734 | mappings derived from <$Config{privlib}>/F<unicore/SpecialCasing.txt>. | |
735 | The C<Digit> and C<Fold> mappings that one can see in the directory | |
736 | are not directly user-accessible, one can use either the | |
737 | C<Unicode::UCD> module, or just match case-insensitively (that's when | |
738 | the C<Fold> mapping is used). | |
739 | ||
740 | A final note on the user-defined property tests and mappings: they | |
741 | will be used only if the scalar has been marked as having Unicode | |
742 | characters. Old byte-style strings will not be affected. | |
743 | ||
376d9008 | 744 | =head2 Character Encodings for Input and Output |
8cbd9a7a | 745 | |
7221edc9 | 746 | See L<Encode>. |
8cbd9a7a | 747 | |
c29a771d | 748 | =head2 Unicode Regular Expression Support Level |
776f8809 | 749 | |
376d9008 JB |
750 | The following list of Unicode support for regular expressions describes |
751 | all the features currently supported. The references to "Level N" | |
752 | and the section numbers refer to the Unicode Technical Report 18, | |
753 | "Unicode Regular Expression Guidelines". | |
776f8809 JH |
754 | |
755 | =over 4 | |
756 | ||
757 | =item * | |
758 | ||
759 | Level 1 - Basic Unicode Support | |
760 | ||
761 | 2.1 Hex Notation - done [1] | |
3bfdc84c | 762 | Named Notation - done [2] |
776f8809 JH |
763 | 2.2 Categories - done [3][4] |
764 | 2.3 Subtraction - MISSING [5][6] | |
765 | 2.4 Simple Word Boundaries - done [7] | |
78d3e1bf | 766 | 2.5 Simple Loose Matches - done [8] |
776f8809 JH |
767 | 2.6 End of Line - MISSING [9][10] |
768 | ||
769 | [ 1] \x{...} | |
770 | [ 2] \N{...} | |
eb0cc9e3 | 771 | [ 3] . \p{...} \P{...} |
29bdacb8 | 772 | [ 4] now scripts (see UTR#24 Script Names) in addition to blocks |
776f8809 | 773 | [ 5] have negation |
237bad5b JH |
774 | [ 6] can use regular expression look-ahead [a] |
775 | or user-defined character properties [b] to emulate subtraction | |
776f8809 | 776 | [ 7] include Letters in word characters |
376d9008 | 777 | [ 8] note that Perl does Full case-folding in matching, not Simple: |
835863de | 778 | for example U+1F88 is equivalent with U+1F00 U+03B9, |
e0f9d4a8 | 779 | not with 1F80. This difference matters for certain Greek |
376d9008 JB |
780 | capital letters with certain modifiers: the Full case-folding |
781 | decomposes the letter, while the Simple case-folding would map | |
e0f9d4a8 | 782 | it to a single character. |
5ca1ac52 | 783 | [ 9] see UTR #13 Unicode Newline Guidelines |
835863de | 784 | [10] should do ^ and $ also on \x{85}, \x{2028} and \x{2029} |
ec83e909 | 785 | (should also affect <>, $., and script line numbers) |
3bfdc84c | 786 | (the \x{85}, \x{2028} and \x{2029} do match \s) |
7207e29d | 787 | |
237bad5b | 788 | [a] You can mimic class subtraction using lookahead. |
5ca1ac52 | 789 | For example, what UTR #18 might write as |
29bdacb8 | 790 | |
dbe420b4 JH |
791 | [{Greek}-[{UNASSIGNED}]] |
792 | ||
793 | in Perl can be written as: | |
794 | ||
1d81abf3 JH |
795 | (?!\p{Unassigned})\p{InGreekAndCoptic} |
796 | (?=\p{Assigned})\p{InGreekAndCoptic} | |
dbe420b4 JH |
797 | |
798 | But in this particular example, you probably really want | |
799 | ||
1bfb14c4 | 800 | \p{GreekAndCoptic} |
dbe420b4 JH |
801 | |
802 | which will match assigned characters known to be part of the Greek script. | |
29bdacb8 | 803 | |
5ca1ac52 JH |
804 | Also see the Unicode::Regex::Set module, it does implement the full |
805 | UTR #18 grouping, intersection, union, and removal (subtraction) syntax. | |
806 | ||
818c4caa | 807 | [b] See L</"User-Defined Character Properties">. |
237bad5b | 808 | |
776f8809 JH |
809 | =item * |
810 | ||
811 | Level 2 - Extended Unicode Support | |
812 | ||
63de3cb2 JH |
813 | 3.1 Surrogates - MISSING [11] |
814 | 3.2 Canonical Equivalents - MISSING [12][13] | |
815 | 3.3 Locale-Independent Graphemes - MISSING [14] | |
816 | 3.4 Locale-Independent Words - MISSING [15] | |
817 | 3.5 Locale-Independent Loose Matches - MISSING [16] | |
818 | ||
819 | [11] Surrogates are solely a UTF-16 concept and Perl's internal | |
820 | representation is UTF-8. The Encode module does UTF-16, though. | |
821 | [12] see UTR#15 Unicode Normalization | |
822 | [13] have Unicode::Normalize but not integrated to regexes | |
823 | [14] have \X but at this level . should equal that | |
824 | [15] need three classes, not just \w and \W | |
825 | [16] see UTR#21 Case Mappings | |
776f8809 JH |
826 | |
827 | =item * | |
828 | ||
829 | Level 3 - Locale-Sensitive Support | |
830 | ||
831 | 4.1 Locale-Dependent Categories - MISSING | |
832 | 4.2 Locale-Dependent Graphemes - MISSING [16][17] | |
833 | 4.3 Locale-Dependent Words - MISSING | |
834 | 4.4 Locale-Dependent Loose Matches - MISSING | |
835 | 4.5 Locale-Dependent Ranges - MISSING | |
836 | ||
837 | [16] see UTR#10 Unicode Collation Algorithms | |
838 | [17] have Unicode::Collate but not integrated to regexes | |
839 | ||
840 | =back | |
841 | ||
c349b1b9 JH |
842 | =head2 Unicode Encodings |
843 | ||
376d9008 JB |
844 | Unicode characters are assigned to I<code points>, which are abstract |
845 | numbers. To use these numbers, various encodings are needed. | |
c349b1b9 JH |
846 | |
847 | =over 4 | |
848 | ||
c29a771d | 849 | =item * |
5cb3728c RB |
850 | |
851 | UTF-8 | |
c349b1b9 | 852 | |
3e4dbfed | 853 | UTF-8 is a variable-length (1 to 6 bytes, current character allocations |
376d9008 JB |
854 | require 4 bytes), byte-order independent encoding. For ASCII (and we |
855 | really do mean 7-bit ASCII, not another 8-bit encoding), UTF-8 is | |
856 | transparent. | |
c349b1b9 | 857 | |
8c007b5a | 858 | The following table is from Unicode 3.2. |
05632f9a JH |
859 | |
860 | Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte | |
861 | ||
8c007b5a JH |
862 | U+0000..U+007F 00..7F |
863 | U+0080..U+07FF C2..DF 80..BF | |
ec90690f TS |
864 | U+0800..U+0FFF E0 A0..BF 80..BF |
865 | U+1000..U+CFFF E1..EC 80..BF 80..BF | |
866 | U+D000..U+D7FF ED 80..9F 80..BF | |
8c007b5a | 867 | U+D800..U+DFFF ******* ill-formed ******* |
ec90690f | 868 | U+E000..U+FFFF EE..EF 80..BF 80..BF |
05632f9a JH |
869 | U+10000..U+3FFFF F0 90..BF 80..BF 80..BF |
870 | U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF | |
871 | U+100000..U+10FFFF F4 80..8F 80..BF 80..BF | |
872 | ||
376d9008 JB |
873 | Note the C<A0..BF> in C<U+0800..U+0FFF>, the C<80..9F> in |
874 | C<U+D000...U+D7FF>, the C<90..B>F in C<U+10000..U+3FFFF>, and the | |
875 | C<80...8F> in C<U+100000..U+10FFFF>. The "gaps" are caused by legal | |
876 | UTF-8 avoiding non-shortest encodings: it is technically possible to | |
877 | UTF-8-encode a single code point in different ways, but that is | |
878 | explicitly forbidden, and the shortest possible encoding should always | |
879 | be used. So that's what Perl does. | |
37361303 | 880 | |
376d9008 | 881 | Another way to look at it is via bits: |
05632f9a JH |
882 | |
883 | Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte | |
884 | ||
885 | 0aaaaaaa 0aaaaaaa | |
886 | 00000bbbbbaaaaaa 110bbbbb 10aaaaaa | |
887 | ccccbbbbbbaaaaaa 1110cccc 10bbbbbb 10aaaaaa | |
888 | 00000dddccccccbbbbbbaaaaaa 11110ddd 10cccccc 10bbbbbb 10aaaaaa | |
889 | ||
890 | As you can see, the continuation bytes all begin with C<10>, and the | |
8c007b5a | 891 | leading bits of the start byte tell how many bytes the are in the |
05632f9a JH |
892 | encoded character. |
893 | ||
c29a771d | 894 | =item * |
5cb3728c RB |
895 | |
896 | UTF-EBCDIC | |
dbe420b4 | 897 | |
376d9008 | 898 | Like UTF-8 but EBCDIC-safe, in the way that UTF-8 is ASCII-safe. |
dbe420b4 | 899 | |
c29a771d | 900 | =item * |
5cb3728c RB |
901 | |
902 | UTF-16, UTF-16BE, UTF16-LE, Surrogates, and BOMs (Byte Order Marks) | |
c349b1b9 | 903 | |
1bfb14c4 JH |
904 | The followings items are mostly for reference and general Unicode |
905 | knowledge, Perl doesn't use these constructs internally. | |
dbe420b4 | 906 | |
c349b1b9 | 907 | UTF-16 is a 2 or 4 byte encoding. The Unicode code points |
1bfb14c4 JH |
908 | C<U+0000..U+FFFF> are stored in a single 16-bit unit, and the code |
909 | points C<U+10000..U+10FFFF> in two 16-bit units. The latter case is | |
c349b1b9 JH |
910 | using I<surrogates>, the first 16-bit unit being the I<high |
911 | surrogate>, and the second being the I<low surrogate>. | |
912 | ||
376d9008 | 913 | Surrogates are code points set aside to encode the C<U+10000..U+10FFFF> |
c349b1b9 | 914 | range of Unicode code points in pairs of 16-bit units. The I<high |
376d9008 JB |
915 | surrogates> are the range C<U+D800..U+DBFF>, and the I<low surrogates> |
916 | are the range C<U+DC00..U+DFFF>. The surrogate encoding is | |
c349b1b9 JH |
917 | |
918 | $hi = ($uni - 0x10000) / 0x400 + 0xD800; | |
919 | $lo = ($uni - 0x10000) % 0x400 + 0xDC00; | |
920 | ||
921 | and the decoding is | |
922 | ||
1a3fa709 | 923 | $uni = 0x10000 + ($hi - 0xD800) * 0x400 + ($lo - 0xDC00); |
c349b1b9 | 924 | |
feda178f | 925 | If you try to generate surrogates (for example by using chr()), you |
376d9008 JB |
926 | will get a warning if warnings are turned on, because those code |
927 | points are not valid for a Unicode character. | |
9466bab6 | 928 | |
376d9008 | 929 | Because of the 16-bitness, UTF-16 is byte-order dependent. UTF-16 |
c349b1b9 | 930 | itself can be used for in-memory computations, but if storage or |
376d9008 JB |
931 | transfer is required either UTF-16BE (big-endian) or UTF-16LE |
932 | (little-endian) encodings must be chosen. | |
c349b1b9 JH |
933 | |
934 | This introduces another problem: what if you just know that your data | |
376d9008 JB |
935 | is UTF-16, but you don't know which endianness? Byte Order Marks, or |
936 | BOMs, are a solution to this. A special character has been reserved | |
86bbd6d1 | 937 | in Unicode to function as a byte order marker: the character with the |
376d9008 | 938 | code point C<U+FEFF> is the BOM. |
042da322 | 939 | |
c349b1b9 | 940 | The trick is that if you read a BOM, you will know the byte order, |
376d9008 JB |
941 | since if it was written on a big-endian platform, you will read the |
942 | bytes C<0xFE 0xFF>, but if it was written on a little-endian platform, | |
943 | you will read the bytes C<0xFF 0xFE>. (And if the originating platform | |
944 | was writing in UTF-8, you will read the bytes C<0xEF 0xBB 0xBF>.) | |
042da322 | 945 | |
86bbd6d1 | 946 | The way this trick works is that the character with the code point |
376d9008 JB |
947 | C<U+FFFE> is guaranteed not to be a valid Unicode character, so the |
948 | sequence of bytes C<0xFF 0xFE> is unambiguously "BOM, represented in | |
1bfb14c4 | 949 | little-endian format" and cannot be C<U+FFFE>, represented in big-endian |
042da322 | 950 | format". |
c349b1b9 | 951 | |
c29a771d | 952 | =item * |
5cb3728c RB |
953 | |
954 | UTF-32, UTF-32BE, UTF32-LE | |
c349b1b9 JH |
955 | |
956 | The UTF-32 family is pretty much like the UTF-16 family, expect that | |
042da322 | 957 | the units are 32-bit, and therefore the surrogate scheme is not |
376d9008 JB |
958 | needed. The BOM signatures will be C<0x00 0x00 0xFE 0xFF> for BE and |
959 | C<0xFF 0xFE 0x00 0x00> for LE. | |
c349b1b9 | 960 | |
c29a771d | 961 | =item * |
5cb3728c RB |
962 | |
963 | UCS-2, UCS-4 | |
c349b1b9 | 964 | |
86bbd6d1 | 965 | Encodings defined by the ISO 10646 standard. UCS-2 is a 16-bit |
376d9008 | 966 | encoding. Unlike UTF-16, UCS-2 is not extensible beyond C<U+FFFF>, |
339cfa0e JH |
967 | because it does not use surrogates. UCS-4 is a 32-bit encoding, |
968 | functionally identical to UTF-32. | |
c349b1b9 | 969 | |
c29a771d | 970 | =item * |
5cb3728c RB |
971 | |
972 | UTF-7 | |
c349b1b9 | 973 | |
376d9008 JB |
974 | A seven-bit safe (non-eight-bit) encoding, which is useful if the |
975 | transport or storage is not eight-bit safe. Defined by RFC 2152. | |
c349b1b9 | 976 | |
95a1a48b JH |
977 | =back |
978 | ||
0d7c09bb JH |
979 | =head2 Security Implications of Unicode |
980 | ||
981 | =over 4 | |
982 | ||
983 | =item * | |
984 | ||
985 | Malformed UTF-8 | |
bf0fa0b2 JH |
986 | |
987 | Unfortunately, the specification of UTF-8 leaves some room for | |
988 | interpretation of how many bytes of encoded output one should generate | |
376d9008 JB |
989 | from one input Unicode character. Strictly speaking, the shortest |
990 | possible sequence of UTF-8 bytes should be generated, | |
991 | because otherwise there is potential for an input buffer overflow at | |
feda178f | 992 | the receiving end of a UTF-8 connection. Perl always generates the |
376d9008 JB |
993 | shortest length UTF-8, and with warnings on Perl will warn about |
994 | non-shortest length UTF-8 along with other malformations, such as the | |
995 | surrogates, which are not real Unicode code points. | |
bf0fa0b2 | 996 | |
0d7c09bb JH |
997 | =item * |
998 | ||
999 | Regular expressions behave slightly differently between byte data and | |
376d9008 JB |
1000 | character (Unicode) data. For example, the "word character" character |
1001 | class C<\w> will work differently depending on if data is eight-bit bytes | |
1002 | or Unicode. | |
0d7c09bb | 1003 | |
376d9008 JB |
1004 | In the first case, the set of C<\w> characters is either small--the |
1005 | default set of alphabetic characters, digits, and the "_"--or, if you | |
0d7c09bb JH |
1006 | are using a locale (see L<perllocale>), the C<\w> might contain a few |
1007 | more letters according to your language and country. | |
1008 | ||
376d9008 | 1009 | In the second case, the C<\w> set of characters is much, much larger. |
1bfb14c4 JH |
1010 | Most importantly, even in the set of the first 256 characters, it will |
1011 | probably match different characters: unlike most locales, which are | |
1012 | specific to a language and country pair, Unicode classifies all the | |
1013 | characters that are letters I<somewhere> as C<\w>. For example, your | |
1014 | locale might not think that LATIN SMALL LETTER ETH is a letter (unless | |
1015 | you happen to speak Icelandic), but Unicode does. | |
0d7c09bb | 1016 | |
376d9008 | 1017 | As discussed elsewhere, Perl has one foot (two hooves?) planted in |
1bfb14c4 JH |
1018 | each of two worlds: the old world of bytes and the new world of |
1019 | characters, upgrading from bytes to characters when necessary. | |
376d9008 JB |
1020 | If your legacy code does not explicitly use Unicode, no automatic |
1021 | switch-over to characters should happen. Characters shouldn't get | |
1bfb14c4 JH |
1022 | downgraded to bytes, either. It is possible to accidentally mix bytes |
1023 | and characters, however (see L<perluniintro>), in which case C<\w> in | |
1024 | regular expressions might start behaving differently. Review your | |
1025 | code. Use warnings and the C<strict> pragma. | |
0d7c09bb JH |
1026 | |
1027 | =back | |
1028 | ||
c349b1b9 JH |
1029 | =head2 Unicode in Perl on EBCDIC |
1030 | ||
376d9008 JB |
1031 | The way Unicode is handled on EBCDIC platforms is still |
1032 | experimental. On such platforms, references to UTF-8 encoding in this | |
1033 | document and elsewhere should be read as meaning the UTF-EBCDIC | |
1034 | specified in Unicode Technical Report 16, unless ASCII vs. EBCDIC issues | |
c349b1b9 | 1035 | are specifically discussed. There is no C<utfebcdic> pragma or |
376d9008 | 1036 | ":utfebcdic" layer; rather, "utf8" and ":utf8" are reused to mean |
86bbd6d1 PN |
1037 | the platform's "natural" 8-bit encoding of Unicode. See L<perlebcdic> |
1038 | for more discussion of the issues. | |
c349b1b9 | 1039 | |
b310b053 JH |
1040 | =head2 Locales |
1041 | ||
4616122b | 1042 | Usually locale settings and Unicode do not affect each other, but |
b310b053 JH |
1043 | there are a couple of exceptions: |
1044 | ||
1045 | =over 4 | |
1046 | ||
1047 | =item * | |
1048 | ||
8aa8f774 JH |
1049 | You can enable automatic UTF-8-ification of your standard file |
1050 | handles, default C<open()> layer, and C<@ARGV> by using either | |
1051 | the C<-C> command line switch or the C<PERL_UNICODE> environment | |
1052 | variable, see L<perlrun> for the documentation of the C<-C> switch. | |
b310b053 JH |
1053 | |
1054 | =item * | |
1055 | ||
376d9008 JB |
1056 | Perl tries really hard to work both with Unicode and the old |
1057 | byte-oriented world. Most often this is nice, but sometimes Perl's | |
1058 | straddling of the proverbial fence causes problems. | |
b310b053 JH |
1059 | |
1060 | =back | |
1061 | ||
1aad1664 JH |
1062 | =head2 When Unicode Does Not Happen |
1063 | ||
1064 | While Perl does have extensive ways to input and output in Unicode, | |
1065 | and few other 'entry points' like the @ARGV which can be interpreted | |
1066 | as Unicode (UTF-8), there still are many places where Unicode (in some | |
1067 | encoding or another) could be given as arguments or received as | |
1068 | results, or both, but it is not. | |
1069 | ||
1070 | The following are such interfaces. For all of these Perl currently | |
1071 | (as of 5.8.1) simply assumes byte strings both as arguments and results. | |
1072 | ||
1073 | One reason why Perl does not attempt to resolve the role of Unicode in | |
1074 | this cases is that the answers are highly dependent on the operating | |
1075 | system and the file system(s). For example, whether filenames can be | |
1076 | in Unicode, and in exactly what kind of encoding, is not exactly a | |
1077 | portable concept. Similarly for the qx and system: how well will the | |
1078 | 'command line interface' (and which of them?) handle Unicode? | |
1079 | ||
1080 | =over 4 | |
1081 | ||
1082 | =item chmod, chmod, chown, chroot, exec, link, mkdir, rename, rmdir, stat, symlink, truncate, unlink, utime | |
1083 | ||
1084 | =item %ENV | |
1085 | ||
1086 | =item glob (aka the <*>) | |
1087 | ||
1088 | =item open, opendir, sysopen | |
1089 | ||
1090 | =item qx (aka the backtick operator), system | |
1091 | ||
1092 | =item readdir, readlink | |
1093 | ||
1094 | =back | |
1095 | ||
1096 | =head2 Forcing Unicode in Perl (Or Unforcing Unicode in Perl) | |
1097 | ||
1098 | Sometimes (see L</"When Unicode Does Not Happen">) there are | |
1099 | situations where you simply need to force Perl to believe that a byte | |
1100 | string is UTF-8, or vice versa. The low-level calls | |
1101 | utf8::upgrade($bytestring) and utf8::downgrade($utf8string) are | |
1102 | the answers. | |
1103 | ||
1104 | Do not use them without careful thought, though: Perl may easily get | |
1105 | very confused, angry, or even crash, if you suddenly change the 'nature' | |
1106 | of scalar like that. Especially careful you have to be if you use the | |
1107 | utf8::upgrade(): any random byte string is not valid UTF-8. | |
1108 | ||
95a1a48b JH |
1109 | =head2 Using Unicode in XS |
1110 | ||
3a2263fe RGS |
1111 | If you want to handle Perl Unicode in XS extensions, you may find the |
1112 | following C APIs useful. See also L<perlguts/"Unicode Support"> for an | |
1113 | explanation about Unicode at the XS level, and L<perlapi> for the API | |
1114 | details. | |
95a1a48b JH |
1115 | |
1116 | =over 4 | |
1117 | ||
1118 | =item * | |
1119 | ||
1bfb14c4 JH |
1120 | C<DO_UTF8(sv)> returns true if the C<UTF8> flag is on and the bytes |
1121 | pragma is not in effect. C<SvUTF8(sv)> returns true is the C<UTF8> | |
1122 | flag is on; the bytes pragma is ignored. The C<UTF8> flag being on | |
1123 | does B<not> mean that there are any characters of code points greater | |
1124 | than 255 (or 127) in the scalar or that there are even any characters | |
1125 | in the scalar. What the C<UTF8> flag means is that the sequence of | |
1126 | octets in the representation of the scalar is the sequence of UTF-8 | |
1127 | encoded code points of the characters of a string. The C<UTF8> flag | |
1128 | being off means that each octet in this representation encodes a | |
1129 | single character with code point 0..255 within the string. Perl's | |
1130 | Unicode model is not to use UTF-8 until it is absolutely necessary. | |
95a1a48b JH |
1131 | |
1132 | =item * | |
1133 | ||
1bfb14c4 JH |
1134 | C<uvuni_to_utf8(buf, chr>) writes a Unicode character code point into |
1135 | a buffer encoding the code point as UTF-8, and returns a pointer | |
95a1a48b JH |
1136 | pointing after the UTF-8 bytes. |
1137 | ||
1138 | =item * | |
1139 | ||
376d9008 JB |
1140 | C<utf8_to_uvuni(buf, lenp)> reads UTF-8 encoded bytes from a buffer and |
1141 | returns the Unicode character code point and, optionally, the length of | |
1142 | the UTF-8 byte sequence. | |
95a1a48b JH |
1143 | |
1144 | =item * | |
1145 | ||
376d9008 JB |
1146 | C<utf8_length(start, end)> returns the length of the UTF-8 encoded buffer |
1147 | in characters. C<sv_len_utf8(sv)> returns the length of the UTF-8 encoded | |
95a1a48b JH |
1148 | scalar. |
1149 | ||
1150 | =item * | |
1151 | ||
376d9008 JB |
1152 | C<sv_utf8_upgrade(sv)> converts the string of the scalar to its UTF-8 |
1153 | encoded form. C<sv_utf8_downgrade(sv)> does the opposite, if | |
1154 | possible. C<sv_utf8_encode(sv)> is like sv_utf8_upgrade except that | |
1155 | it does not set the C<UTF8> flag. C<sv_utf8_decode()> does the | |
1156 | opposite of C<sv_utf8_encode()>. Note that none of these are to be | |
1157 | used as general-purpose encoding or decoding interfaces: C<use Encode> | |
1158 | for that. C<sv_utf8_upgrade()> is affected by the encoding pragma | |
1159 | but C<sv_utf8_downgrade()> is not (since the encoding pragma is | |
1160 | designed to be a one-way street). | |
95a1a48b JH |
1161 | |
1162 | =item * | |
1163 | ||
376d9008 | 1164 | C<is_utf8_char(s)> returns true if the pointer points to a valid UTF-8 |
90f968e0 | 1165 | character. |
95a1a48b JH |
1166 | |
1167 | =item * | |
1168 | ||
376d9008 | 1169 | C<is_utf8_string(buf, len)> returns true if C<len> bytes of the buffer |
95a1a48b JH |
1170 | are valid UTF-8. |
1171 | ||
1172 | =item * | |
1173 | ||
376d9008 JB |
1174 | C<UTF8SKIP(buf)> will return the number of bytes in the UTF-8 encoded |
1175 | character in the buffer. C<UNISKIP(chr)> will return the number of bytes | |
1176 | required to UTF-8-encode the Unicode character code point. C<UTF8SKIP()> | |
90f968e0 | 1177 | is useful for example for iterating over the characters of a UTF-8 |
376d9008 | 1178 | encoded buffer; C<UNISKIP()> is useful, for example, in computing |
90f968e0 | 1179 | the size required for a UTF-8 encoded buffer. |
95a1a48b JH |
1180 | |
1181 | =item * | |
1182 | ||
376d9008 | 1183 | C<utf8_distance(a, b)> will tell the distance in characters between the |
95a1a48b JH |
1184 | two pointers pointing to the same UTF-8 encoded buffer. |
1185 | ||
1186 | =item * | |
1187 | ||
376d9008 JB |
1188 | C<utf8_hop(s, off)> will return a pointer to an UTF-8 encoded buffer |
1189 | that is C<off> (positive or negative) Unicode characters displaced | |
1190 | from the UTF-8 buffer C<s>. Be careful not to overstep the buffer: | |
1191 | C<utf8_hop()> will merrily run off the end or the beginning of the | |
1192 | buffer if told to do so. | |
95a1a48b | 1193 | |
d2cc3551 JH |
1194 | =item * |
1195 | ||
376d9008 JB |
1196 | C<pv_uni_display(dsv, spv, len, pvlim, flags)> and |
1197 | C<sv_uni_display(dsv, ssv, pvlim, flags)> are useful for debugging the | |
1198 | output of Unicode strings and scalars. By default they are useful | |
1199 | only for debugging--they display B<all> characters as hexadecimal code | |
1bfb14c4 JH |
1200 | points--but with the flags C<UNI_DISPLAY_ISPRINT>, |
1201 | C<UNI_DISPLAY_BACKSLASH>, and C<UNI_DISPLAY_QQ> you can make the | |
1202 | output more readable. | |
d2cc3551 JH |
1203 | |
1204 | =item * | |
1205 | ||
376d9008 JB |
1206 | C<ibcmp_utf8(s1, pe1, u1, l1, u1, s2, pe2, l2, u2)> can be used to |
1207 | compare two strings case-insensitively in Unicode. For case-sensitive | |
1208 | comparisons you can just use C<memEQ()> and C<memNE()> as usual. | |
d2cc3551 | 1209 | |
c349b1b9 JH |
1210 | =back |
1211 | ||
95a1a48b JH |
1212 | For more information, see L<perlapi>, and F<utf8.c> and F<utf8.h> |
1213 | in the Perl source code distribution. | |
1214 | ||
c29a771d JH |
1215 | =head1 BUGS |
1216 | ||
376d9008 | 1217 | =head2 Interaction with Locales |
7eabb34d | 1218 | |
376d9008 JB |
1219 | Use of locales with Unicode data may lead to odd results. Currently, |
1220 | Perl attempts to attach 8-bit locale info to characters in the range | |
1221 | 0..255, but this technique is demonstrably incorrect for locales that | |
1222 | use characters above that range when mapped into Unicode. Perl's | |
1223 | Unicode support will also tend to run slower. Use of locales with | |
1224 | Unicode is discouraged. | |
c29a771d | 1225 | |
376d9008 | 1226 | =head2 Interaction with Extensions |
7eabb34d | 1227 | |
376d9008 | 1228 | When Perl exchanges data with an extension, the extension should be |
7eabb34d | 1229 | able to understand the UTF-8 flag and act accordingly. If the |
376d9008 JB |
1230 | extension doesn't know about the flag, it's likely that the extension |
1231 | will return incorrectly-flagged data. | |
7eabb34d A |
1232 | |
1233 | So if you're working with Unicode data, consult the documentation of | |
1234 | every module you're using if there are any issues with Unicode data | |
1235 | exchange. If the documentation does not talk about Unicode at all, | |
a73d23f6 | 1236 | suspect the worst and probably look at the source to learn how the |
376d9008 | 1237 | module is implemented. Modules written completely in Perl shouldn't |
a73d23f6 RGS |
1238 | cause problems. Modules that directly or indirectly access code written |
1239 | in other programming languages are at risk. | |
7eabb34d | 1240 | |
376d9008 | 1241 | For affected functions, the simple strategy to avoid data corruption is |
7eabb34d | 1242 | to always make the encoding of the exchanged data explicit. Choose an |
376d9008 | 1243 | encoding that you know the extension can handle. Convert arguments passed |
7eabb34d A |
1244 | to the extensions to that encoding and convert results back from that |
1245 | encoding. Write wrapper functions that do the conversions for you, so | |
1246 | you can later change the functions when the extension catches up. | |
1247 | ||
376d9008 | 1248 | To provide an example, let's say the popular Foo::Bar::escape_html |
7eabb34d A |
1249 | function doesn't deal with Unicode data yet. The wrapper function |
1250 | would convert the argument to raw UTF-8 and convert the result back to | |
376d9008 | 1251 | Perl's internal representation like so: |
7eabb34d A |
1252 | |
1253 | sub my_escape_html ($) { | |
1254 | my($what) = shift; | |
1255 | return unless defined $what; | |
1256 | Encode::decode_utf8(Foo::Bar::escape_html(Encode::encode_utf8($what))); | |
1257 | } | |
1258 | ||
1259 | Sometimes, when the extension does not convert data but just stores | |
1260 | and retrieves them, you will be in a position to use the otherwise | |
1261 | dangerous Encode::_utf8_on() function. Let's say the popular | |
66b79f27 | 1262 | C<Foo::Bar> extension, written in C, provides a C<param> method that |
7eabb34d A |
1263 | lets you store and retrieve data according to these prototypes: |
1264 | ||
1265 | $self->param($name, $value); # set a scalar | |
1266 | $value = $self->param($name); # retrieve a scalar | |
1267 | ||
1268 | If it does not yet provide support for any encoding, one could write a | |
1269 | derived class with such a C<param> method: | |
1270 | ||
1271 | sub param { | |
1272 | my($self,$name,$value) = @_; | |
1273 | utf8::upgrade($name); # make sure it is UTF-8 encoded | |
1274 | if (defined $value) | |
1275 | utf8::upgrade($value); # make sure it is UTF-8 encoded | |
1276 | return $self->SUPER::param($name,$value); | |
1277 | } else { | |
1278 | my $ret = $self->SUPER::param($name); | |
1279 | Encode::_utf8_on($ret); # we know, it is UTF-8 encoded | |
1280 | return $ret; | |
1281 | } | |
1282 | } | |
1283 | ||
a73d23f6 RGS |
1284 | Some extensions provide filters on data entry/exit points, such as |
1285 | DB_File::filter_store_key and family. Look out for such filters in | |
66b79f27 | 1286 | the documentation of your extensions, they can make the transition to |
7eabb34d A |
1287 | Unicode data much easier. |
1288 | ||
376d9008 | 1289 | =head2 Speed |
7eabb34d | 1290 | |
c29a771d | 1291 | Some functions are slower when working on UTF-8 encoded strings than |
574c8022 | 1292 | on byte encoded strings. All functions that need to hop over |
7c17141f JH |
1293 | characters such as length(), substr() or index(), or matching regular |
1294 | expressions can work B<much> faster when the underlying data are | |
1295 | byte-encoded. | |
1296 | ||
1297 | In Perl 5.8.0 the slowness was often quite spectacular; in Perl 5.8.1 | |
1298 | a caching scheme was introduced which will hopefully make the slowness | |
1299 | somewhat less spectacular. Operations with UTF-8 encoded strings are | |
1300 | still slower, though. | |
666f95b9 | 1301 | |
c8d992ba A |
1302 | =head2 Porting code from perl-5.6.X |
1303 | ||
1304 | Perl 5.8 has a different Unicode model from 5.6. In 5.6 the programmer | |
1305 | was required to use the C<utf8> pragma to declare that a given scope | |
1306 | expected to deal with Unicode data and had to make sure that only | |
1307 | Unicode data were reaching that scope. If you have code that is | |
1308 | working with 5.6, you will need some of the following adjustments to | |
1309 | your code. The examples are written such that the code will continue | |
1310 | to work under 5.6, so you should be safe to try them out. | |
1311 | ||
1312 | =over 4 | |
1313 | ||
1314 | =item * | |
1315 | ||
1316 | A filehandle that should read or write UTF-8 | |
1317 | ||
1318 | if ($] > 5.007) { | |
1319 | binmode $fh, ":utf8"; | |
1320 | } | |
1321 | ||
1322 | =item * | |
1323 | ||
1324 | A scalar that is going to be passed to some extension | |
1325 | ||
1326 | Be it Compress::Zlib, Apache::Request or any extension that has no | |
1327 | mention of Unicode in the manpage, you need to make sure that the | |
1328 | UTF-8 flag is stripped off. Note that at the time of this writing | |
1329 | (October 2002) the mentioned modules are not UTF-8-aware. Please | |
1330 | check the documentation to verify if this is still true. | |
1331 | ||
1332 | if ($] > 5.007) { | |
1333 | require Encode; | |
1334 | $val = Encode::encode_utf8($val); # make octets | |
1335 | } | |
1336 | ||
1337 | =item * | |
1338 | ||
1339 | A scalar we got back from an extension | |
1340 | ||
1341 | If you believe the scalar comes back as UTF-8, you will most likely | |
1342 | want the UTF-8 flag restored: | |
1343 | ||
1344 | if ($] > 5.007) { | |
1345 | require Encode; | |
1346 | $val = Encode::decode_utf8($val); | |
1347 | } | |
1348 | ||
1349 | =item * | |
1350 | ||
1351 | Same thing, if you are really sure it is UTF-8 | |
1352 | ||
1353 | if ($] > 5.007) { | |
1354 | require Encode; | |
1355 | Encode::_utf8_on($val); | |
1356 | } | |
1357 | ||
1358 | =item * | |
1359 | ||
1360 | A wrapper for fetchrow_array and fetchrow_hashref | |
1361 | ||
1362 | When the database contains only UTF-8, a wrapper function or method is | |
1363 | a convenient way to replace all your fetchrow_array and | |
1364 | fetchrow_hashref calls. A wrapper function will also make it easier to | |
1365 | adapt to future enhancements in your database driver. Note that at the | |
1366 | time of this writing (October 2002), the DBI has no standardized way | |
1367 | to deal with UTF-8 data. Please check the documentation to verify if | |
1368 | that is still true. | |
1369 | ||
1370 | sub fetchrow { | |
1371 | my($self, $sth, $what) = @_; # $what is one of fetchrow_{array,hashref} | |
1372 | if ($] < 5.007) { | |
1373 | return $sth->$what; | |
1374 | } else { | |
1375 | require Encode; | |
1376 | if (wantarray) { | |
1377 | my @arr = $sth->$what; | |
1378 | for (@arr) { | |
1379 | defined && /[^\000-\177]/ && Encode::_utf8_on($_); | |
1380 | } | |
1381 | return @arr; | |
1382 | } else { | |
1383 | my $ret = $sth->$what; | |
1384 | if (ref $ret) { | |
1385 | for my $k (keys %$ret) { | |
1386 | defined && /[^\000-\177]/ && Encode::_utf8_on($_) for $ret->{$k}; | |
1387 | } | |
1388 | return $ret; | |
1389 | } else { | |
1390 | defined && /[^\000-\177]/ && Encode::_utf8_on($_) for $ret; | |
1391 | return $ret; | |
1392 | } | |
1393 | } | |
1394 | } | |
1395 | } | |
1396 | ||
1397 | ||
1398 | =item * | |
1399 | ||
1400 | A large scalar that you know can only contain ASCII | |
1401 | ||
1402 | Scalars that contain only ASCII and are marked as UTF-8 are sometimes | |
1403 | a drag to your program. If you recognize such a situation, just remove | |
1404 | the UTF-8 flag: | |
1405 | ||
1406 | utf8::downgrade($val) if $] > 5.007; | |
1407 | ||
1408 | =back | |
1409 | ||
393fec97 GS |
1410 | =head1 SEE ALSO |
1411 | ||
72ff2908 | 1412 | L<perluniintro>, L<encoding>, L<Encode>, L<open>, L<utf8>, L<bytes>, |
a05d7ebb | 1413 | L<perlretut>, L<perlvar/"${^UNICODE}"> |
393fec97 GS |
1414 | |
1415 | =cut |