Commit | Line | Data |
---|---|---|
393fec97 GS |
1 | =head1 NAME |
2 | ||
3 | perlunicode - Unicode support in Perl | |
4 | ||
5 | =head1 DESCRIPTION | |
6 | ||
0a1f2d14 | 7 | =head2 Important Caveats |
21bad921 | 8 | |
c349b1b9 JH |
9 | Unicode support is an extensive requirement. While perl does not |
10 | implement the Unicode standard or the accompanying technical reports | |
11 | from cover to cover, Perl does support many Unicode features. | |
21bad921 | 12 | |
13a2d996 | 13 | =over 4 |
21bad921 GS |
14 | |
15 | =item Input and Output Disciplines | |
16 | ||
75daf61c JH |
17 | A filehandle can be marked as containing perl's internal Unicode |
18 | encoding (UTF-8 or UTF-EBCDIC) by opening it with the ":utf8" layer. | |
0a1f2d14 | 19 | Other encodings can be converted to perl's encoding on input, or from |
c349b1b9 JH |
20 | perl's encoding on output by use of the ":encoding(...)" layer. |
21 | See L<open>. | |
22 | ||
d1be9408 | 23 | To mark the Perl source itself as being in a particular encoding, |
c349b1b9 | 24 | see L<encoding>. |
21bad921 GS |
25 | |
26 | =item Regular Expressions | |
27 | ||
c349b1b9 JH |
28 | The regular expression compiler produces polymorphic opcodes. That is, |
29 | the pattern adapts to the data and automatically switch to the Unicode | |
30 | character scheme when presented with Unicode data, or a traditional | |
31 | byte scheme when presented with byte data. | |
21bad921 | 32 | |
ad0029c4 | 33 | =item C<use utf8> still needed to enable UTF-8/UTF-EBCDIC in scripts |
21bad921 | 34 | |
c349b1b9 JH |
35 | As a compatibility measure, this pragma must be explicitly used to |
36 | enable recognition of UTF-8 in the Perl scripts themselves on ASCII | |
3e4dbfed | 37 | based machines, or to recognize UTF-EBCDIC on EBCDIC based machines. |
c349b1b9 JH |
38 | B<NOTE: this should be the only place where an explicit C<use utf8> |
39 | is needed>. | |
21bad921 | 40 | |
1768d7eb | 41 | You can also use the C<encoding> pragma to change the default encoding |
6ec9efec | 42 | of the data in your script; see L<encoding>. |
1768d7eb | 43 | |
21bad921 GS |
44 | =back |
45 | ||
46 | =head2 Byte and Character semantics | |
393fec97 GS |
47 | |
48 | Beginning with version 5.6, Perl uses logically wide characters to | |
3e4dbfed | 49 | represent strings internally. |
393fec97 | 50 | |
75daf61c JH |
51 | In future, Perl-level operations can be expected to work with |
52 | characters rather than bytes, in general. | |
393fec97 | 53 | |
75daf61c JH |
54 | However, as strictly an interim compatibility measure, Perl aims to |
55 | provide a safe migration path from byte semantics to character | |
56 | semantics for programs. For operations where Perl can unambiguously | |
57 | decide that the input data is characters, Perl now switches to | |
58 | character semantics. For operations where this determination cannot | |
59 | be made without additional information from the user, Perl decides in | |
60 | favor of compatibility, and chooses to use byte semantics. | |
8cbd9a7a GS |
61 | |
62 | This behavior preserves compatibility with earlier versions of Perl, | |
63 | which allowed byte semantics in Perl operations, but only as long as | |
64 | none of the program's inputs are marked as being as source of Unicode | |
65 | character data. Such data may come from filehandles, from calls to | |
66 | external programs, from information provided by the system (such as %ENV), | |
21bad921 | 67 | or from literals and constants in the source text. |
8cbd9a7a | 68 | |
c349b1b9 | 69 | On Windows platforms, if the C<-C> command line switch is used, (or the |
75daf61c JH |
70 | ${^WIDE_SYSTEM_CALLS} global flag is set to C<1>), all system calls |
71 | will use the corresponding wide character APIs. Note that this is | |
c349b1b9 JH |
72 | currently only implemented on Windows since other platforms lack an |
73 | API standard on this area. | |
8cbd9a7a | 74 | |
75daf61c JH |
75 | Regardless of the above, the C<bytes> pragma can always be used to |
76 | force byte semantics in a particular lexical scope. See L<bytes>. | |
8cbd9a7a GS |
77 | |
78 | The C<utf8> pragma is primarily a compatibility device that enables | |
75daf61c | 79 | recognition of UTF-(8|EBCDIC) in literals encountered by the parser. |
7dedd01f JH |
80 | Note that this pragma is only required until a future version of Perl |
81 | in which character semantics will become the default. This pragma may | |
82 | then become a no-op. See L<utf8>. | |
8cbd9a7a GS |
83 | |
84 | Unless mentioned otherwise, Perl operators will use character semantics | |
85 | when they are dealing with Unicode data, and byte semantics otherwise. | |
86 | Thus, character semantics for these operations apply transparently; if | |
87 | the input data came from a Unicode source (for example, by adding a | |
88 | character encoding discipline to the filehandle whence it came, or a | |
3e4dbfed | 89 | literal Unicode string constant in the program), character semantics |
8cbd9a7a | 90 | apply; otherwise, byte semantics are in effect. To force byte semantics |
8058d7ab | 91 | on Unicode data, the C<bytes> pragma should be used. |
393fec97 | 92 | |
0a378802 JH |
93 | Notice that if you concatenate strings with byte semantics and strings |
94 | with Unicode character data, the bytes will by default be upgraded | |
95 | I<as if they were ISO 8859-1 (Latin-1)> (or if in EBCDIC, after a | |
3e4dbfed JF |
96 | translation to ISO 8859-1). This is done without regard to the |
97 | system's native 8-bit encoding, so to change this for systems with | |
98 | non-Latin-1 (or non-EBCDIC) native encodings, use the C<encoding> | |
0a378802 | 99 | pragma, see L<encoding>. |
7dedd01f | 100 | |
feda178f JH |
101 | Under character semantics, many operations that formerly operated on |
102 | bytes change to operating on characters. A character in Perl is | |
103 | logically just a number ranging from 0 to 2**31 or so. Larger | |
104 | characters may encode to longer sequences of bytes internally, but | |
105 | this is just an internal detail which is hidden at the Perl level. | |
106 | See L<perluniintro> for more on this. | |
393fec97 | 107 | |
8cbd9a7a | 108 | =head2 Effects of character semantics |
393fec97 GS |
109 | |
110 | Character semantics have the following effects: | |
111 | ||
112 | =over 4 | |
113 | ||
114 | =item * | |
115 | ||
574c8022 JH |
116 | Strings (including hash keys) and regular expression patterns may |
117 | contain characters that have an ordinal value larger than 255. | |
393fec97 | 118 | |
feda178f JH |
119 | If you use a Unicode editor to edit your program, Unicode characters |
120 | may occur directly within the literal strings in one of the various | |
121 | Unicode encodings (UTF-8, UTF-EBCDIC, UCS-2, etc.), but are recognized | |
122 | as such (and converted to Perl's internal representation) only if the | |
123 | appropriate L<encoding> is specified. | |
3e4dbfed JF |
124 | |
125 | You can also get Unicode characters into a string by using the C<\x{...}> | |
126 | notation, putting the Unicode code for the desired character, in | |
127 | hexadecimal, into the curlies. For instance, a smiley face is C<\x{263A}>. | |
128 | This works only for characters with a code 0x100 and above. | |
129 | ||
130 | Additionally, if you | |
574c8022 | 131 | |
3e4dbfed | 132 | use charnames ':full'; |
574c8022 | 133 | |
3e4dbfed JF |
134 | you can use the C<\N{...}> notation, putting the official Unicode character |
135 | name within the curlies. For example, C<\N{WHITE SMILING FACE}>. | |
136 | This works for all characters that have names. | |
393fec97 GS |
137 | |
138 | =item * | |
139 | ||
574c8022 JH |
140 | If an appropriate L<encoding> is specified, identifiers within the |
141 | Perl script may contain Unicode alphanumeric characters, including | |
142 | ideographs. (You are currently on your own when it comes to using the | |
143 | canonical forms of characters--Perl doesn't (yet) attempt to | |
144 | canonicalize variable names for you.) | |
393fec97 | 145 | |
393fec97 GS |
146 | =item * |
147 | ||
148 | Regular expressions match characters instead of bytes. For instance, | |
149 | "." matches a character instead of a byte. (However, the C<\C> pattern | |
75daf61c | 150 | is provided to force a match a single byte ("C<char>" in C, hence C<\C>).) |
393fec97 | 151 | |
393fec97 GS |
152 | =item * |
153 | ||
154 | Character classes in regular expressions match characters instead of | |
155 | bytes, and match against the character properties specified in the | |
75daf61c JH |
156 | Unicode properties database. So C<\w> can be used to match an |
157 | ideograph, for instance. | |
393fec97 | 158 | |
393fec97 GS |
159 | =item * |
160 | ||
eb0cc9e3 JH |
161 | Named Unicode properties, scripts, and block ranges may be used like |
162 | character classes via the new C<\p{}> (matches property) and C<\P{}> | |
163 | (doesn't match property) constructs. For instance, C<\p{Lu}> matches any | |
feda178f | 164 | character with the Unicode "Lu" (Letter, uppercase) property, while |
ec90690f | 165 | C<\p{M}> matches any character with an "M" (mark -- accents and such) |
eb0cc9e3 JH |
166 | property. Single letter properties may omit the brackets, so that can be |
167 | written C<\pM> also. Many predefined properties are available, such | |
168 | as C<\p{Mirrored}> and C<\p{Tibetan}>. | |
4193bef7 | 169 | |
cfc01aea | 170 | The official Unicode script and block names have spaces and dashes as |
eb0cc9e3 JH |
171 | separators, but for convenience you can have dashes, spaces, and underbars |
172 | at every word division, and you need not care about correct casing. It is | |
173 | recommended, however, that for consistency you use the following naming: | |
174 | the official Unicode script, block, or property name (see below for the | |
175 | additional rules that apply to block names), with whitespace and dashes | |
176 | removed, and the words "uppercase-first-lowercase-rest". That is, "Latin-1 | |
177 | Supplement" becomes "Latin1Supplement". | |
4193bef7 | 178 | |
a1cc1cb1 | 179 | You can also negate both C<\p{}> and C<\P{}> by introducing a caret |
eb0cc9e3 JH |
180 | (^) between the first curly and the property name: C<\p{^Tamil}> is |
181 | equal to C<\P{Tamil}>. | |
4193bef7 | 182 | |
eb0cc9e3 JH |
183 | Here are the basic Unicode General Category properties, followed by their |
184 | long form (you can use either, e.g. C<\p{Lu}> and C<\p{LowercaseLetter}> | |
185 | are identical). | |
393fec97 | 186 | |
d73e5302 JH |
187 | Short Long |
188 | ||
189 | L Letter | |
eb0cc9e3 JH |
190 | Lu UppercaseLetter |
191 | Ll LowercaseLetter | |
192 | Lt TitlecaseLetter | |
193 | Lm ModifierLetter | |
194 | Lo OtherLetter | |
d73e5302 JH |
195 | |
196 | M Mark | |
eb0cc9e3 JH |
197 | Mn NonspacingMark |
198 | Mc SpacingMark | |
199 | Me EnclosingMark | |
d73e5302 JH |
200 | |
201 | N Number | |
eb0cc9e3 JH |
202 | Nd DecimalNumber |
203 | Nl LetterNumber | |
204 | No OtherNumber | |
d73e5302 JH |
205 | |
206 | P Punctuation | |
eb0cc9e3 JH |
207 | Pc ConnectorPunctuation |
208 | Pd DashPunctuation | |
209 | Ps OpenPunctuation | |
210 | Pe ClosePunctuation | |
211 | Pi InitialPunctuation | |
d73e5302 | 212 | (may behave like Ps or Pe depending on usage) |
eb0cc9e3 | 213 | Pf FinalPunctuation |
d73e5302 | 214 | (may behave like Ps or Pe depending on usage) |
eb0cc9e3 | 215 | Po OtherPunctuation |
d73e5302 JH |
216 | |
217 | S Symbol | |
eb0cc9e3 JH |
218 | Sm MathSymbol |
219 | Sc CurrencySymbol | |
220 | Sk ModifierSymbol | |
221 | So OtherSymbol | |
d73e5302 JH |
222 | |
223 | Z Separator | |
eb0cc9e3 JH |
224 | Zs SpaceSeparator |
225 | Zl LineSeparator | |
226 | Zp ParagraphSeparator | |
d73e5302 JH |
227 | |
228 | C Other | |
e150c829 JH |
229 | Cc Control |
230 | Cf Format | |
eb0cc9e3 JH |
231 | Cs Surrogate (not usable) |
232 | Co PrivateUse | |
e150c829 | 233 | Cn Unassigned |
1ac13f9a | 234 | |
3e4dbfed JF |
235 | The single-letter properties match all characters in any of the |
236 | two-letter sub-properties starting with the same letter. | |
1ac13f9a | 237 | There's also C<L&> which is an alias for C<Ll>, C<Lu>, and C<Lt>. |
32293815 | 238 | |
eb0cc9e3 JH |
239 | Because Perl hides the need for the user to understand the internal |
240 | representation of Unicode characters, it has no need to support the | |
241 | somewhat messy concept of surrogates. Therefore, the C<Cs> property is not | |
242 | supported. | |
d73e5302 | 243 | |
eb0cc9e3 JH |
244 | Because scripts differ in their directionality (for example Hebrew is |
245 | written right to left), Unicode supplies these properties: | |
32293815 | 246 | |
eb0cc9e3 | 247 | Property Meaning |
92e830a9 | 248 | |
d73e5302 JH |
249 | BidiL Left-to-Right |
250 | BidiLRE Left-to-Right Embedding | |
251 | BidiLRO Left-to-Right Override | |
252 | BidiR Right-to-Left | |
253 | BidiAL Right-to-Left Arabic | |
254 | BidiRLE Right-to-Left Embedding | |
255 | BidiRLO Right-to-Left Override | |
256 | BidiPDF Pop Directional Format | |
257 | BidiEN European Number | |
258 | BidiES European Number Separator | |
259 | BidiET European Number Terminator | |
260 | BidiAN Arabic Number | |
261 | BidiCS Common Number Separator | |
262 | BidiNSM Non-Spacing Mark | |
263 | BidiBN Boundary Neutral | |
264 | BidiB Paragraph Separator | |
265 | BidiS Segment Separator | |
266 | BidiWS Whitespace | |
267 | BidiON Other Neutrals | |
32293815 | 268 | |
eb0cc9e3 JH |
269 | For example, C<\p{BidiR}> matches all characters that are normally |
270 | written right to left. | |
271 | ||
210b36aa AMS |
272 | =back |
273 | ||
2796c109 JH |
274 | =head2 Scripts |
275 | ||
eb0cc9e3 | 276 | The scripts available via C<\p{...}> and C<\P{...}>, for example |
66b79f27 | 277 | C<\p{Latin}> or C<\p{Cyrillic}>, are as follows: |
2796c109 | 278 | |
1ac13f9a | 279 | Arabic |
e9ad1727 | 280 | Armenian |
1ac13f9a | 281 | Bengali |
e9ad1727 | 282 | Bopomofo |
1d81abf3 | 283 | Buhid |
eb0cc9e3 | 284 | CanadianAboriginal |
e9ad1727 JH |
285 | Cherokee |
286 | Cyrillic | |
287 | Deseret | |
288 | Devanagari | |
289 | Ethiopic | |
290 | Georgian | |
291 | Gothic | |
292 | Greek | |
1ac13f9a | 293 | Gujarati |
e9ad1727 JH |
294 | Gurmukhi |
295 | Han | |
296 | Hangul | |
1d81abf3 | 297 | Hanunoo |
e9ad1727 JH |
298 | Hebrew |
299 | Hiragana | |
300 | Inherited | |
1ac13f9a | 301 | Kannada |
e9ad1727 JH |
302 | Katakana |
303 | Khmer | |
1ac13f9a | 304 | Lao |
e9ad1727 JH |
305 | Latin |
306 | Malayalam | |
307 | Mongolian | |
1ac13f9a | 308 | Myanmar |
1ac13f9a | 309 | Ogham |
eb0cc9e3 | 310 | OldItalic |
e9ad1727 | 311 | Oriya |
1ac13f9a | 312 | Runic |
e9ad1727 JH |
313 | Sinhala |
314 | Syriac | |
1d81abf3 JH |
315 | Tagalog |
316 | Tagbanwa | |
e9ad1727 JH |
317 | Tamil |
318 | Telugu | |
319 | Thaana | |
320 | Thai | |
321 | Tibetan | |
1ac13f9a | 322 | Yi |
1ac13f9a JH |
323 | |
324 | There are also extended property classes that supplement the basic | |
325 | properties, defined by the F<PropList> Unicode database: | |
326 | ||
1d81abf3 | 327 | ASCIIHexDigit |
eb0cc9e3 | 328 | BidiControl |
1ac13f9a | 329 | Dash |
1d81abf3 | 330 | Deprecated |
1ac13f9a JH |
331 | Diacritic |
332 | Extender | |
1d81abf3 | 333 | GraphemeLink |
eb0cc9e3 | 334 | HexDigit |
e9ad1727 JH |
335 | Hyphen |
336 | Ideographic | |
1d81abf3 JH |
337 | IDSBinaryOperator |
338 | IDSTrinaryOperator | |
eb0cc9e3 | 339 | JoinControl |
1d81abf3 | 340 | LogicalOrderException |
eb0cc9e3 JH |
341 | NoncharacterCodePoint |
342 | OtherAlphabetic | |
1d81abf3 JH |
343 | OtherDefaultIgnorableCodePoint |
344 | OtherGraphemeExtend | |
eb0cc9e3 JH |
345 | OtherLowercase |
346 | OtherMath | |
347 | OtherUppercase | |
348 | QuotationMark | |
1d81abf3 JH |
349 | Radical |
350 | SoftDotted | |
351 | TerminalPunctuation | |
352 | UnifiedIdeograph | |
eb0cc9e3 | 353 | WhiteSpace |
1ac13f9a JH |
354 | |
355 | and further derived properties: | |
356 | ||
eb0cc9e3 JH |
357 | Alphabetic Lu + Ll + Lt + Lm + Lo + OtherAlphabetic |
358 | Lowercase Ll + OtherLowercase | |
359 | Uppercase Lu + OtherUppercase | |
360 | Math Sm + OtherMath | |
1ac13f9a JH |
361 | |
362 | ID_Start Lu + Ll + Lt + Lm + Lo + Nl | |
363 | ID_Continue ID_Start + Mn + Mc + Nd + Pc | |
364 | ||
365 | Any Any character | |
66b79f27 RGS |
366 | Assigned Any non-Cn character (i.e. synonym for \P{Cn}) |
367 | Unassigned Synonym for \p{Cn} | |
1ac13f9a | 368 | Common Any character (or unassigned code point) |
e150c829 | 369 | not explicitly assigned to a script |
2796c109 | 370 | |
7eabb34d | 371 | For backward compatibility, all properties mentioned so far may have C<Is> |
eb0cc9e3 JH |
372 | prepended to their name (e.g. C<\P{IsLu}> is equal to C<\P{Lu}>). |
373 | ||
2796c109 JH |
374 | =head2 Blocks |
375 | ||
eb0cc9e3 JH |
376 | In addition to B<scripts>, Unicode also defines B<blocks> of characters. |
377 | The difference between scripts and blocks is that the scripts concept is | |
378 | closer to natural languages, while the blocks concept is more an artificial | |
379 | grouping based on groups of mostly 256 Unicode characters. For example, the | |
380 | C<Latin> script contains letters from many blocks. On the other hand, the | |
381 | C<Latin> script does not contain all the characters from those blocks. It | |
382 | does not, for example, contain digits because digits are shared across many | |
383 | scripts. Digits and other similar groups, like punctuation, are in a | |
384 | category called C<Common>. | |
2796c109 | 385 | |
cfc01aea JF |
386 | For more about scripts, see the UTR #24: |
387 | ||
388 | http://www.unicode.org/unicode/reports/tr24/ | |
389 | ||
390 | For more about blocks, see: | |
391 | ||
392 | http://www.unicode.org/Public/UNIDATA/Blocks.txt | |
2796c109 | 393 | |
eb0cc9e3 | 394 | Blocks names are given with the C<In> prefix. For example, the |
92e830a9 | 395 | Katakana block is referenced via C<\p{InKatakana}>. The C<In> |
7eabb34d | 396 | prefix may be omitted if there is no naming conflict with a script |
eb0cc9e3 JH |
397 | or any other property, but it is recommended that C<In> always be used |
398 | to avoid confusion. | |
399 | ||
400 | These block names are supported: | |
401 | ||
1d81abf3 JH |
402 | InAlphabeticPresentationForms |
403 | InArabic | |
404 | InArabicPresentationFormsA | |
405 | InArabicPresentationFormsB | |
406 | InArmenian | |
407 | InArrows | |
408 | InBasicLatin | |
409 | InBengali | |
410 | InBlockElements | |
411 | InBopomofo | |
412 | InBopomofoExtended | |
413 | InBoxDrawing | |
414 | InBraillePatterns | |
415 | InBuhid | |
416 | InByzantineMusicalSymbols | |
417 | InCJKCompatibility | |
418 | InCJKCompatibilityForms | |
419 | InCJKCompatibilityIdeographs | |
420 | InCJKCompatibilityIdeographsSupplement | |
421 | InCJKRadicalsSupplement | |
422 | InCJKSymbolsAndPunctuation | |
423 | InCJKUnifiedIdeographs | |
424 | InCJKUnifiedIdeographsExtensionA | |
425 | InCJKUnifiedIdeographsExtensionB | |
426 | InCherokee | |
427 | InCombiningDiacriticalMarks | |
428 | InCombiningDiacriticalMarksforSymbols | |
429 | InCombiningHalfMarks | |
430 | InControlPictures | |
431 | InCurrencySymbols | |
432 | InCyrillic | |
433 | InCyrillicSupplementary | |
434 | InDeseret | |
435 | InDevanagari | |
436 | InDingbats | |
437 | InEnclosedAlphanumerics | |
438 | InEnclosedCJKLettersAndMonths | |
439 | InEthiopic | |
440 | InGeneralPunctuation | |
441 | InGeometricShapes | |
442 | InGeorgian | |
443 | InGothic | |
444 | InGreekExtended | |
445 | InGreekAndCoptic | |
446 | InGujarati | |
447 | InGurmukhi | |
448 | InHalfwidthAndFullwidthForms | |
449 | InHangulCompatibilityJamo | |
450 | InHangulJamo | |
451 | InHangulSyllables | |
452 | InHanunoo | |
453 | InHebrew | |
454 | InHighPrivateUseSurrogates | |
455 | InHighSurrogates | |
456 | InHiragana | |
457 | InIPAExtensions | |
458 | InIdeographicDescriptionCharacters | |
459 | InKanbun | |
460 | InKangxiRadicals | |
461 | InKannada | |
462 | InKatakana | |
463 | InKatakanaPhoneticExtensions | |
464 | InKhmer | |
465 | InLao | |
466 | InLatin1Supplement | |
467 | InLatinExtendedA | |
468 | InLatinExtendedAdditional | |
469 | InLatinExtendedB | |
470 | InLetterlikeSymbols | |
471 | InLowSurrogates | |
472 | InMalayalam | |
473 | InMathematicalAlphanumericSymbols | |
474 | InMathematicalOperators | |
475 | InMiscellaneousMathematicalSymbolsA | |
476 | InMiscellaneousMathematicalSymbolsB | |
477 | InMiscellaneousSymbols | |
478 | InMiscellaneousTechnical | |
479 | InMongolian | |
480 | InMusicalSymbols | |
481 | InMyanmar | |
482 | InNumberForms | |
483 | InOgham | |
484 | InOldItalic | |
485 | InOpticalCharacterRecognition | |
486 | InOriya | |
487 | InPrivateUseArea | |
488 | InRunic | |
489 | InSinhala | |
490 | InSmallFormVariants | |
491 | InSpacingModifierLetters | |
492 | InSpecials | |
493 | InSuperscriptsAndSubscripts | |
494 | InSupplementalArrowsA | |
495 | InSupplementalArrowsB | |
496 | InSupplementalMathematicalOperators | |
497 | InSupplementaryPrivateUseAreaA | |
498 | InSupplementaryPrivateUseAreaB | |
499 | InSyriac | |
500 | InTagalog | |
501 | InTagbanwa | |
502 | InTags | |
503 | InTamil | |
504 | InTelugu | |
505 | InThaana | |
506 | InThai | |
507 | InTibetan | |
508 | InUnifiedCanadianAboriginalSyllabics | |
509 | InVariationSelectors | |
510 | InYiRadicals | |
511 | InYiSyllables | |
32293815 | 512 | |
210b36aa AMS |
513 | =over 4 |
514 | ||
393fec97 GS |
515 | =item * |
516 | ||
c29a771d | 517 | The special pattern C<\X> matches any extended Unicode sequence |
393fec97 GS |
518 | (a "combining character sequence" in Standardese), where the first |
519 | character is a base character and subsequent characters are mark | |
520 | characters that apply to the base character. It is equivalent to | |
521 | C<(?:\PM\pM*)>. | |
522 | ||
393fec97 GS |
523 | =item * |
524 | ||
383e7cdd JH |
525 | The C<tr///> operator translates characters instead of bytes. Note |
526 | that the C<tr///CU> functionality has been removed, as the interface | |
527 | was a mistake. For similar functionality see pack('U0', ...) and | |
528 | pack('C0', ...). | |
393fec97 | 529 | |
393fec97 GS |
530 | =item * |
531 | ||
532 | Case translation operators use the Unicode case translation tables | |
44bc797b JH |
533 | when provided character input. Note that C<uc()> (also known as C<\U> |
534 | in doublequoted strings) translates to uppercase, while C<ucfirst> | |
535 | (also known as C<\u> in doublequoted strings) translates to titlecase | |
536 | (for languages that make the distinction). Naturally the | |
537 | corresponding backslash sequences have the same semantics. | |
393fec97 GS |
538 | |
539 | =item * | |
540 | ||
541 | Most operators that deal with positions or lengths in the string will | |
75daf61c JH |
542 | automatically switch to using character positions, including |
543 | C<chop()>, C<substr()>, C<pos()>, C<index()>, C<rindex()>, | |
544 | C<sprintf()>, C<write()>, and C<length()>. Operators that | |
545 | specifically don't switch include C<vec()>, C<pack()>, and | |
546 | C<unpack()>. Operators that really don't care include C<chomp()>, as | |
547 | well as any other operator that treats a string as a bucket of bits, | |
548 | such as C<sort()>, and the operators dealing with filenames. | |
393fec97 GS |
549 | |
550 | =item * | |
551 | ||
552 | The C<pack()>/C<unpack()> letters "C<c>" and "C<C>" do I<not> change, | |
553 | since they're often used for byte-oriented formats. (Again, think | |
554 | "C<char>" in the C language.) However, there is a new "C<U>" specifier | |
3e4dbfed | 555 | that will convert between Unicode characters and integers. |
393fec97 GS |
556 | |
557 | =item * | |
558 | ||
559 | The C<chr()> and C<ord()> functions work on characters. This is like | |
560 | C<pack("U")> and C<unpack("U")>, not like C<pack("C")> and | |
561 | C<unpack("C")>. In fact, the latter are how you now emulate | |
35bcd338 | 562 | byte-oriented C<chr()> and C<ord()> for Unicode strings. |
3e4dbfed JF |
563 | (Note that this reveals the internal encoding of Unicode strings, |
564 | which is not something one normally needs to care about at all.) | |
393fec97 GS |
565 | |
566 | =item * | |
567 | ||
a1ca4561 YST |
568 | The bit string operators C<& | ^ ~> can operate on character data. |
569 | However, for backward compatibility reasons (bit string operations | |
75daf61c JH |
570 | when the characters all are less than 256 in ordinal value) one should |
571 | not mix C<~> (the bit complement) and characters both less than 256 and | |
a1ca4561 YST |
572 | equal or greater than 256. Most importantly, the DeMorgan's laws |
573 | (C<~($x|$y) eq ~$x&~$y>, C<~($x&$y) eq ~$x|~$y>) won't hold. | |
574 | Another way to look at this is that the complement cannot return | |
75daf61c | 575 | B<both> the 8-bit (byte) wide bit complement B<and> the full character |
a1ca4561 YST |
576 | wide bit complement. |
577 | ||
578 | =item * | |
579 | ||
983ffd37 JH |
580 | lc(), uc(), lcfirst(), and ucfirst() work for the following cases: |
581 | ||
582 | =over 8 | |
583 | ||
584 | =item * | |
585 | ||
586 | the case mapping is from a single Unicode character to another | |
587 | single Unicode character | |
588 | ||
589 | =item * | |
590 | ||
591 | the case mapping is from a single Unicode character to more | |
592 | than one Unicode character | |
593 | ||
594 | =back | |
595 | ||
210b36aa | 596 | What doesn't yet work are the following cases: |
983ffd37 JH |
597 | |
598 | =over 8 | |
599 | ||
600 | =item * | |
601 | ||
602 | the "final sigma" (Greek) | |
603 | ||
604 | =item * | |
605 | ||
606 | anything to with locales (Lithuanian, Turkish, Azeri) | |
607 | ||
608 | =back | |
609 | ||
610 | See the Unicode Technical Report #21, Case Mappings, for more details. | |
ac1256e8 JH |
611 | |
612 | =item * | |
613 | ||
393fec97 GS |
614 | And finally, C<scalar reverse()> reverses by character rather than by byte. |
615 | ||
616 | =back | |
617 | ||
237bad5b | 618 | =head2 User-defined Character Properties |
491fd90a JH |
619 | |
620 | You can define your own character properties by defining subroutines | |
621 | that have names beginning with "In" or "Is". The subroutines must be | |
622 | visible in the package that uses the properties. The user-defined | |
623 | properties can be used in the regular expression C<\p> and C<\P> | |
624 | constructs. | |
625 | ||
626 | The subroutines must return a specially formatted string: one or more | |
627 | newline-separated lines. Each line must be one of the following: | |
628 | ||
629 | =over 4 | |
630 | ||
631 | =item * | |
632 | ||
99a6b1f0 DK |
633 | Two hexadecimal numbers separated by horizontal whitespace (space or |
634 | tabulator characters) denoting a range of Unicode codepoints to include. | |
491fd90a JH |
635 | |
636 | =item * | |
637 | ||
11ef8fdd JH |
638 | Something to include, prefixed by "+": either an built-in character |
639 | property (prefixed by "utf8::"), for all the characters in that | |
640 | property; or two hexadecimal codepoints for a range; or a single | |
641 | hexadecimal codepoint. | |
491fd90a JH |
642 | |
643 | =item * | |
644 | ||
11ef8fdd JH |
645 | Something to exclude, prefixed by "-": either an existing character |
646 | property (prefixed by "utf8::"), for all the characters in that | |
647 | property; or two hexadecimal codepoints for a range; or a single | |
648 | hexadecimal codepoint. | |
491fd90a JH |
649 | |
650 | =item * | |
651 | ||
11ef8fdd JH |
652 | Something to negate, prefixed "!": either an existing character |
653 | property (prefixed by "utf8::") for all the characters except the | |
654 | characters in the property; or two hexadecimal codepoints for a range; | |
655 | or a single hexadecimal codepoint. | |
491fd90a JH |
656 | |
657 | =back | |
658 | ||
659 | For example, to define a property that covers both the Japanese | |
660 | syllabaries (hiragana and katakana), you can define | |
661 | ||
662 | sub InKana { | |
d5822f25 A |
663 | return <<END; |
664 | 3040\t309F | |
665 | 30A0\t30FF | |
491fd90a JH |
666 | END |
667 | } | |
668 | ||
d5822f25 A |
669 | Imagine that the here-doc end marker is at the beginning of the line. |
670 | Now you can use C<\p{InKana}> and C<\P{InKana}>. | |
491fd90a JH |
671 | |
672 | You could also have used the existing block property names: | |
673 | ||
674 | sub InKana { | |
675 | return <<'END'; | |
676 | +utf8::InHiragana | |
677 | +utf8::InKatakana | |
678 | END | |
679 | } | |
680 | ||
681 | Suppose you wanted to match only the allocated characters, | |
d5822f25 | 682 | not the raw block ranges: in other words, you want to remove |
491fd90a JH |
683 | the non-characters: |
684 | ||
685 | sub InKana { | |
686 | return <<'END'; | |
687 | +utf8::InHiragana | |
688 | +utf8::InKatakana | |
689 | -utf8::IsCn | |
690 | END | |
691 | } | |
692 | ||
693 | The negation is useful for defining (surprise!) negated classes. | |
694 | ||
695 | sub InNotKana { | |
696 | return <<'END'; | |
697 | !utf8::InHiragana | |
698 | -utf8::InKatakana | |
699 | +utf8::IsCn | |
700 | END | |
701 | } | |
702 | ||
8cbd9a7a GS |
703 | =head2 Character encodings for input and output |
704 | ||
7221edc9 | 705 | See L<Encode>. |
8cbd9a7a | 706 | |
c29a771d | 707 | =head2 Unicode Regular Expression Support Level |
776f8809 JH |
708 | |
709 | The following list of Unicode regular expression support describes | |
710 | feature by feature the Unicode support implemented in Perl as of Perl | |
711 | 5.8.0. The "Level N" and the section numbers refer to the Unicode | |
712 | Technical Report 18, "Unicode Regular Expression Guidelines". | |
713 | ||
714 | =over 4 | |
715 | ||
716 | =item * | |
717 | ||
718 | Level 1 - Basic Unicode Support | |
719 | ||
720 | 2.1 Hex Notation - done [1] | |
3bfdc84c | 721 | Named Notation - done [2] |
776f8809 JH |
722 | 2.2 Categories - done [3][4] |
723 | 2.3 Subtraction - MISSING [5][6] | |
724 | 2.4 Simple Word Boundaries - done [7] | |
78d3e1bf | 725 | 2.5 Simple Loose Matches - done [8] |
776f8809 JH |
726 | 2.6 End of Line - MISSING [9][10] |
727 | ||
728 | [ 1] \x{...} | |
729 | [ 2] \N{...} | |
eb0cc9e3 | 730 | [ 3] . \p{...} \P{...} |
29bdacb8 | 731 | [ 4] now scripts (see UTR#24 Script Names) in addition to blocks |
776f8809 | 732 | [ 5] have negation |
237bad5b JH |
733 | [ 6] can use regular expression look-ahead [a] |
734 | or user-defined character properties [b] to emulate subtraction | |
776f8809 | 735 | [ 7] include Letters in word characters |
e0f9d4a8 JH |
736 | [ 8] note that perl does Full casefolding in matching, not Simple: |
737 | for example U+1F88 is equivalent with U+1F000 U+03B9, | |
738 | not with 1F80. This difference matters for certain Greek | |
739 | capital letters with certain modifiers: the Full casefolding | |
740 | decomposes the letter, while the Simple casefolding would map | |
741 | it to a single character. | |
776f8809 | 742 | [ 9] see UTR#13 Unicode Newline Guidelines |
ec83e909 JH |
743 | [10] should do ^ and $ also on \x{85}, \x{2028} and \x{2029}) |
744 | (should also affect <>, $., and script line numbers) | |
3bfdc84c | 745 | (the \x{85}, \x{2028} and \x{2029} do match \s) |
7207e29d | 746 | |
237bad5b | 747 | [a] You can mimic class subtraction using lookahead. |
dbe420b4 | 748 | For example, what TR18 might write as |
29bdacb8 | 749 | |
dbe420b4 JH |
750 | [{Greek}-[{UNASSIGNED}]] |
751 | ||
752 | in Perl can be written as: | |
753 | ||
1d81abf3 JH |
754 | (?!\p{Unassigned})\p{InGreekAndCoptic} |
755 | (?=\p{Assigned})\p{InGreekAndCoptic} | |
dbe420b4 JH |
756 | |
757 | But in this particular example, you probably really want | |
758 | ||
759 | \p{Greek} | |
760 | ||
761 | which will match assigned characters known to be part of the Greek script. | |
29bdacb8 | 762 | |
237bad5b JH |
763 | [b] See L</User-defined Character Properties>. |
764 | ||
776f8809 JH |
765 | =item * |
766 | ||
767 | Level 2 - Extended Unicode Support | |
768 | ||
769 | 3.1 Surrogates - MISSING | |
770 | 3.2 Canonical Equivalents - MISSING [11][12] | |
771 | 3.3 Locale-Independent Graphemes - MISSING [13] | |
772 | 3.4 Locale-Independent Words - MISSING [14] | |
773 | 3.5 Locale-Independent Loose Matches - MISSING [15] | |
774 | ||
775 | [11] see UTR#15 Unicode Normalization | |
776 | [12] have Unicode::Normalize but not integrated to regexes | |
777 | [13] have \X but at this level . should equal that | |
778 | [14] need three classes, not just \w and \W | |
779 | [15] see UTR#21 Case Mappings | |
780 | ||
781 | =item * | |
782 | ||
783 | Level 3 - Locale-Sensitive Support | |
784 | ||
785 | 4.1 Locale-Dependent Categories - MISSING | |
786 | 4.2 Locale-Dependent Graphemes - MISSING [16][17] | |
787 | 4.3 Locale-Dependent Words - MISSING | |
788 | 4.4 Locale-Dependent Loose Matches - MISSING | |
789 | 4.5 Locale-Dependent Ranges - MISSING | |
790 | ||
791 | [16] see UTR#10 Unicode Collation Algorithms | |
792 | [17] have Unicode::Collate but not integrated to regexes | |
793 | ||
794 | =back | |
795 | ||
c349b1b9 JH |
796 | =head2 Unicode Encodings |
797 | ||
798 | Unicode characters are assigned to I<code points> which are abstract | |
86bbd6d1 | 799 | numbers. To use these numbers various encodings are needed. |
c349b1b9 JH |
800 | |
801 | =over 4 | |
802 | ||
c29a771d | 803 | =item * |
5cb3728c RB |
804 | |
805 | UTF-8 | |
c349b1b9 | 806 | |
3e4dbfed JF |
807 | UTF-8 is a variable-length (1 to 6 bytes, current character allocations |
808 | require 4 bytes), byteorder independent encoding. For ASCII, UTF-8 is | |
809 | transparent (and we really do mean 7-bit ASCII, not another 8-bit encoding). | |
c349b1b9 | 810 | |
8c007b5a | 811 | The following table is from Unicode 3.2. |
05632f9a JH |
812 | |
813 | Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte | |
814 | ||
8c007b5a JH |
815 | U+0000..U+007F 00..7F |
816 | U+0080..U+07FF C2..DF 80..BF | |
ec90690f TS |
817 | U+0800..U+0FFF E0 A0..BF 80..BF |
818 | U+1000..U+CFFF E1..EC 80..BF 80..BF | |
819 | U+D000..U+D7FF ED 80..9F 80..BF | |
8c007b5a | 820 | U+D800..U+DFFF ******* ill-formed ******* |
ec90690f | 821 | U+E000..U+FFFF EE..EF 80..BF 80..BF |
05632f9a JH |
822 | U+10000..U+3FFFF F0 90..BF 80..BF 80..BF |
823 | U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF | |
824 | U+100000..U+10FFFF F4 80..8F 80..BF 80..BF | |
825 | ||
8c007b5a JH |
826 | Note the A0..BF in U+0800..U+0FFF, the 80..9F in U+D000...U+D7FF, |
827 | the 90..BF in U+10000..U+3FFFF, and the 80...8F in U+100000..U+10FFFF. | |
37361303 JH |
828 | The "gaps" are caused by legal UTF-8 avoiding non-shortest encodings: |
829 | it is technically possible to UTF-8-encode a single code point in different | |
830 | ways, but that is explicitly forbidden, and the shortest possible encoding | |
831 | should always be used (and that is what Perl does). | |
832 | ||
05632f9a JH |
833 | Or, another way to look at it, as bits: |
834 | ||
835 | Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte | |
836 | ||
837 | 0aaaaaaa 0aaaaaaa | |
838 | 00000bbbbbaaaaaa 110bbbbb 10aaaaaa | |
839 | ccccbbbbbbaaaaaa 1110cccc 10bbbbbb 10aaaaaa | |
840 | 00000dddccccccbbbbbbaaaaaa 11110ddd 10cccccc 10bbbbbb 10aaaaaa | |
841 | ||
842 | As you can see, the continuation bytes all begin with C<10>, and the | |
8c007b5a | 843 | leading bits of the start byte tell how many bytes the are in the |
05632f9a JH |
844 | encoded character. |
845 | ||
c29a771d | 846 | =item * |
5cb3728c RB |
847 | |
848 | UTF-EBCDIC | |
dbe420b4 | 849 | |
fe854a6f | 850 | Like UTF-8, but EBCDIC-safe, as UTF-8 is ASCII-safe. |
dbe420b4 | 851 | |
c29a771d | 852 | =item * |
5cb3728c RB |
853 | |
854 | UTF-16, UTF-16BE, UTF16-LE, Surrogates, and BOMs (Byte Order Marks) | |
c349b1b9 | 855 | |
dbe420b4 JH |
856 | (The followings items are mostly for reference, Perl doesn't |
857 | use them internally.) | |
858 | ||
c349b1b9 | 859 | UTF-16 is a 2 or 4 byte encoding. The Unicode code points |
ec90690f TS |
860 | U+0000..U+FFFF are stored in a single 16-bit unit, and the code points |
861 | U+10000..U+10FFFF in two 16-bit units. The latter case is | |
c349b1b9 JH |
862 | using I<surrogates>, the first 16-bit unit being the I<high |
863 | surrogate>, and the second being the I<low surrogate>. | |
864 | ||
ec90690f | 865 | Surrogates are code points set aside to encode the U+10000..U+10FFFF |
c349b1b9 | 866 | range of Unicode code points in pairs of 16-bit units. The I<high |
ec90690f TS |
867 | surrogates> are the range U+D800..U+DBFF, and the I<low surrogates> |
868 | are the range U+DC00..U+DFFF. The surrogate encoding is | |
c349b1b9 JH |
869 | |
870 | $hi = ($uni - 0x10000) / 0x400 + 0xD800; | |
871 | $lo = ($uni - 0x10000) % 0x400 + 0xDC00; | |
872 | ||
873 | and the decoding is | |
874 | ||
1a3fa709 | 875 | $uni = 0x10000 + ($hi - 0xD800) * 0x400 + ($lo - 0xDC00); |
c349b1b9 | 876 | |
feda178f JH |
877 | If you try to generate surrogates (for example by using chr()), you |
878 | will get a warning if warnings are turned on (C<-w> or C<use | |
879 | warnings;>) because those code points are not valid for a Unicode | |
880 | character. | |
9466bab6 | 881 | |
86bbd6d1 | 882 | Because of the 16-bitness, UTF-16 is byteorder dependent. UTF-16 |
c349b1b9 | 883 | itself can be used for in-memory computations, but if storage or |
86bbd6d1 | 884 | transfer is required, either UTF-16BE (Big Endian) or UTF-16LE |
c349b1b9 JH |
885 | (Little Endian) must be chosen. |
886 | ||
887 | This introduces another problem: what if you just know that your data | |
888 | is UTF-16, but you don't know which endianness? Byte Order Marks | |
889 | (BOMs) are a solution to this. A special character has been reserved | |
86bbd6d1 | 890 | in Unicode to function as a byte order marker: the character with the |
ec90690f | 891 | code point U+FEFF is the BOM. |
042da322 | 892 | |
c349b1b9 JH |
893 | The trick is that if you read a BOM, you will know the byte order, |
894 | since if it was written on a big endian platform, you will read the | |
86bbd6d1 PN |
895 | bytes 0xFE 0xFF, but if it was written on a little endian platform, |
896 | you will read the bytes 0xFF 0xFE. (And if the originating platform | |
897 | was writing in UTF-8, you will read the bytes 0xEF 0xBB 0xBF.) | |
042da322 | 898 | |
86bbd6d1 | 899 | The way this trick works is that the character with the code point |
ec90690f | 900 | U+FFFE is guaranteed not to be a valid Unicode character, so the |
86bbd6d1 | 901 | sequence of bytes 0xFF 0xFE is unambiguously "BOM, represented in |
ec90690f | 902 | little-endian format" and cannot be "U+FFFE, represented in big-endian |
042da322 | 903 | format". |
c349b1b9 | 904 | |
c29a771d | 905 | =item * |
5cb3728c RB |
906 | |
907 | UTF-32, UTF-32BE, UTF32-LE | |
c349b1b9 JH |
908 | |
909 | The UTF-32 family is pretty much like the UTF-16 family, expect that | |
042da322 JH |
910 | the units are 32-bit, and therefore the surrogate scheme is not |
911 | needed. The BOM signatures will be 0x00 0x00 0xFE 0xFF for BE and | |
912 | 0xFF 0xFE 0x00 0x00 for LE. | |
c349b1b9 | 913 | |
c29a771d | 914 | =item * |
5cb3728c RB |
915 | |
916 | UCS-2, UCS-4 | |
c349b1b9 | 917 | |
86bbd6d1 | 918 | Encodings defined by the ISO 10646 standard. UCS-2 is a 16-bit |
ec90690f | 919 | encoding. Unlike UTF-16, UCS-2 is not extensible beyond U+FFFF, |
339cfa0e JH |
920 | because it does not use surrogates. UCS-4 is a 32-bit encoding, |
921 | functionally identical to UTF-32. | |
c349b1b9 | 922 | |
c29a771d | 923 | =item * |
5cb3728c RB |
924 | |
925 | UTF-7 | |
c349b1b9 JH |
926 | |
927 | A seven-bit safe (non-eight-bit) encoding, useful if the | |
928 | transport/storage is not eight-bit safe. Defined by RFC 2152. | |
929 | ||
95a1a48b JH |
930 | =back |
931 | ||
0d7c09bb JH |
932 | =head2 Security Implications of Unicode |
933 | ||
934 | =over 4 | |
935 | ||
936 | =item * | |
937 | ||
938 | Malformed UTF-8 | |
bf0fa0b2 JH |
939 | |
940 | Unfortunately, the specification of UTF-8 leaves some room for | |
941 | interpretation of how many bytes of encoded output one should generate | |
942 | from one input Unicode character. Strictly speaking, one is supposed | |
943 | to always generate the shortest possible sequence of UTF-8 bytes, | |
feda178f JH |
944 | because otherwise there is potential for input buffer overflow at |
945 | the receiving end of a UTF-8 connection. Perl always generates the | |
946 | shortest length UTF-8, and with warnings on (C<-w> or C<use | |
947 | warnings;>) Perl will warn about non-shortest length UTF-8 (and other | |
948 | malformations, too, such as the surrogates, which are not real | |
949 | Unicode code points.) | |
bf0fa0b2 | 950 | |
0d7c09bb JH |
951 | =item * |
952 | ||
953 | Regular expressions behave slightly differently between byte data and | |
954 | character (Unicode data). For example, the "word character" character | |
955 | class C<\w> will work differently when the data is all eight-bit bytes | |
956 | or when the data is Unicode. | |
957 | ||
958 | In the first case, the set of C<\w> characters is either small (the | |
959 | default set of alphabetic characters, digits, and the "_"), or, if you | |
960 | are using a locale (see L<perllocale>), the C<\w> might contain a few | |
961 | more letters according to your language and country. | |
962 | ||
963 | In the second case, the C<\w> set of characters is much, much larger, | |
964 | and most importantly, even in the set of the first 256 characters, it | |
965 | will most probably be different: as opposed to most locales (which are | |
966 | specific to a language and country pair) Unicode classifies all the | |
967 | characters that are letters as C<\w>. For example: your locale might | |
968 | not think that LATIN SMALL LETTER ETH is a letter (unless you happen | |
969 | to speak Icelandic), but Unicode does. | |
970 | ||
a73d23f6 | 971 | As discussed elsewhere, Perl tries to stand one leg (two legs, as |
0746c07b | 972 | camels are quadrupeds?) in two worlds: the old world of bytes and the new |
0d7c09bb JH |
973 | world of characters, upgrading from bytes to characters when necessary. |
974 | If your legacy code is not explicitly using Unicode, no automatic | |
975 | switchover to characters should happen, and characters shouldn't get | |
976 | downgraded back to bytes, either. It is possible to accidentally mix | |
977 | bytes and characters, however (see L<perluniintro>), in which case the | |
978 | C<\w> might start behaving differently. Review your code. | |
979 | ||
980 | =back | |
981 | ||
c349b1b9 JH |
982 | =head2 Unicode in Perl on EBCDIC |
983 | ||
984 | The way Unicode is handled on EBCDIC platforms is still rather | |
86bbd6d1 | 985 | experimental. On such a platform, references to UTF-8 encoding in this |
c349b1b9 JH |
986 | document and elsewhere should be read as meaning UTF-EBCDIC as |
987 | specified in Unicode Technical Report 16 unless ASCII vs EBCDIC issues | |
988 | are specifically discussed. There is no C<utfebcdic> pragma or | |
86bbd6d1 PN |
989 | ":utfebcdic" layer, rather, "utf8" and ":utf8" are re-used to mean |
990 | the platform's "natural" 8-bit encoding of Unicode. See L<perlebcdic> | |
991 | for more discussion of the issues. | |
c349b1b9 | 992 | |
b310b053 JH |
993 | =head2 Locales |
994 | ||
4616122b | 995 | Usually locale settings and Unicode do not affect each other, but |
b310b053 JH |
996 | there are a couple of exceptions: |
997 | ||
998 | =over 4 | |
999 | ||
1000 | =item * | |
1001 | ||
1002 | If your locale environment variables (LANGUAGE, LC_ALL, LC_CTYPE, LANG) | |
1003 | contain the strings 'UTF-8' or 'UTF8' (case-insensitive matching), | |
1004 | the default encoding of your STDIN, STDOUT, and STDERR, and of | |
1005 | B<any subsequent file open>, is UTF-8. | |
1006 | ||
1007 | =item * | |
1008 | ||
1009 | Perl tries really hard to work both with Unicode and the old byte | |
1010 | oriented world: most often this is nice, but sometimes this causes | |
574c8022 | 1011 | problems. |
b310b053 JH |
1012 | |
1013 | =back | |
1014 | ||
95a1a48b JH |
1015 | =head2 Using Unicode in XS |
1016 | ||
1017 | If you want to handle Perl Unicode in XS extensions, you may find | |
90f968e0 | 1018 | the following C APIs useful (see perlapi for details): |
95a1a48b JH |
1019 | |
1020 | =over 4 | |
1021 | ||
1022 | =item * | |
1023 | ||
f1e62f77 AT |
1024 | DO_UTF8(sv) returns true if the UTF8 flag is on and the bytes pragma |
1025 | is not in effect. SvUTF8(sv) returns true is the UTF8 flag is on, the | |
1026 | bytes pragma is ignored. The UTF8 flag being on does B<not> mean that | |
b31c5e31 AT |
1027 | there are any characters of code points greater than 255 (or 127) in |
1028 | the scalar, or that there even are any characters in the scalar. | |
1029 | What the UTF8 flag means is that the sequence of octets in the | |
1030 | representation of the scalar is the sequence of UTF-8 encoded | |
1031 | code points of the characters of a string. The UTF8 flag being | |
1032 | off means that each octet in this representation encodes a single | |
1033 | character with codepoint 0..255 within the string. Perl's Unicode | |
1034 | model is not to use UTF-8 until it's really necessary. | |
95a1a48b JH |
1035 | |
1036 | =item * | |
1037 | ||
1038 | uvuni_to_utf8(buf, chr) writes a Unicode character code point into a | |
cfc01aea | 1039 | buffer encoding the code point as UTF-8, and returns a pointer |
95a1a48b JH |
1040 | pointing after the UTF-8 bytes. |
1041 | ||
1042 | =item * | |
1043 | ||
1044 | utf8_to_uvuni(buf, lenp) reads UTF-8 encoded bytes from a buffer and | |
1045 | returns the Unicode character code point (and optionally the length of | |
1046 | the UTF-8 byte sequence). | |
1047 | ||
1048 | =item * | |
1049 | ||
90f968e0 JH |
1050 | utf8_length(start, end) returns the length of the UTF-8 encoded buffer |
1051 | in characters. sv_len_utf8(sv) returns the length of the UTF-8 encoded | |
95a1a48b JH |
1052 | scalar. |
1053 | ||
1054 | =item * | |
1055 | ||
1056 | sv_utf8_upgrade(sv) converts the string of the scalar to its UTF-8 | |
1057 | encoded form. sv_utf8_downgrade(sv) does the opposite (if possible). | |
1058 | sv_utf8_encode(sv) is like sv_utf8_upgrade but the UTF8 flag does not | |
1059 | get turned on. sv_utf8_decode() does the opposite of sv_utf8_encode(). | |
13a6c0e0 JH |
1060 | Note that none of these are to be used as general purpose encoding/decoding |
1061 | interfaces: use Encode for that. sv_utf8_upgrade() is affected by the | |
1062 | encoding pragma, but sv_utf8_downgrade() is not (since the encoding | |
1063 | pragma is designed to be a one-way street). | |
95a1a48b JH |
1064 | |
1065 | =item * | |
1066 | ||
90f968e0 JH |
1067 | is_utf8_char(s) returns true if the pointer points to a valid UTF-8 |
1068 | character. | |
95a1a48b JH |
1069 | |
1070 | =item * | |
1071 | ||
1072 | is_utf8_string(buf, len) returns true if the len bytes of the buffer | |
1073 | are valid UTF-8. | |
1074 | ||
1075 | =item * | |
1076 | ||
1077 | UTF8SKIP(buf) will return the number of bytes in the UTF-8 encoded | |
1078 | character in the buffer. UNISKIP(chr) will return the number of bytes | |
90f968e0 JH |
1079 | required to UTF-8-encode the Unicode character code point. UTF8SKIP() |
1080 | is useful for example for iterating over the characters of a UTF-8 | |
1081 | encoded buffer; UNISKIP() is useful for example in computing | |
1082 | the size required for a UTF-8 encoded buffer. | |
95a1a48b JH |
1083 | |
1084 | =item * | |
1085 | ||
1086 | utf8_distance(a, b) will tell the distance in characters between the | |
1087 | two pointers pointing to the same UTF-8 encoded buffer. | |
1088 | ||
1089 | =item * | |
1090 | ||
1091 | utf8_hop(s, off) will return a pointer to an UTF-8 encoded buffer that | |
1092 | is C<off> (positive or negative) Unicode characters displaced from the | |
90f968e0 JH |
1093 | UTF-8 buffer C<s>. Be careful not to overstep the buffer: utf8_hop() |
1094 | will merrily run off the end or the beginning if told to do so. | |
95a1a48b | 1095 | |
d2cc3551 JH |
1096 | =item * |
1097 | ||
1098 | pv_uni_display(dsv, spv, len, pvlim, flags) and sv_uni_display(dsv, | |
1099 | ssv, pvlim, flags) are useful for debug output of Unicode strings and | |
90f968e0 JH |
1100 | scalars. By default they are useful only for debug: they display |
1101 | B<all> characters as hexadecimal code points, but with the flags | |
1102 | UNI_DISPLAY_ISPRINT and UNI_DISPLAY_BACKSLASH you can make the output | |
1103 | more readable. | |
d2cc3551 JH |
1104 | |
1105 | =item * | |
1106 | ||
90f968e0 JH |
1107 | ibcmp_utf8(s1, pe1, u1, l1, u1, s2, pe2, l2, u2) can be used to |
1108 | compare two strings case-insensitively in Unicode. | |
1109 | (For case-sensitive comparisons you can just use memEQ() and memNE() | |
1110 | as usual.) | |
d2cc3551 | 1111 | |
c349b1b9 JH |
1112 | =back |
1113 | ||
95a1a48b JH |
1114 | For more information, see L<perlapi>, and F<utf8.c> and F<utf8.h> |
1115 | in the Perl source code distribution. | |
1116 | ||
c29a771d JH |
1117 | =head1 BUGS |
1118 | ||
7eabb34d A |
1119 | =head2 Interaction with locales |
1120 | ||
c29a771d JH |
1121 | Use of locales with Unicode data may lead to odd results. Currently |
1122 | there is some attempt to apply 8-bit locale info to characters in the | |
1123 | range 0..255, but this is demonstrably incorrect for locales that use | |
1124 | characters above that range when mapped into Unicode. It will also | |
574c8022 | 1125 | tend to run slower. Use of locales with Unicode is discouraged. |
c29a771d | 1126 | |
7eabb34d A |
1127 | =head2 Interaction with extensions |
1128 | ||
1129 | When perl exchanges data with an extension, the extension should be | |
1130 | able to understand the UTF-8 flag and act accordingly. If the | |
1131 | extension doesn't know about the flag, the risk is high that it will | |
1132 | return data that are incorrectly flagged. | |
1133 | ||
1134 | So if you're working with Unicode data, consult the documentation of | |
1135 | every module you're using if there are any issues with Unicode data | |
1136 | exchange. If the documentation does not talk about Unicode at all, | |
a73d23f6 RGS |
1137 | suspect the worst and probably look at the source to learn how the |
1138 | module is implemented. Modules written completely in perl shouldn't | |
1139 | cause problems. Modules that directly or indirectly access code written | |
1140 | in other programming languages are at risk. | |
7eabb34d A |
1141 | |
1142 | For affected functions the simple strategy to avoid data corruption is | |
1143 | to always make the encoding of the exchanged data explicit. Choose an | |
1144 | encoding you know the extension can handle. Convert arguments passed | |
1145 | to the extensions to that encoding and convert results back from that | |
1146 | encoding. Write wrapper functions that do the conversions for you, so | |
1147 | you can later change the functions when the extension catches up. | |
1148 | ||
1149 | To provide an example let's say the popular Foo::Bar::escape_html | |
1150 | function doesn't deal with Unicode data yet. The wrapper function | |
1151 | would convert the argument to raw UTF-8 and convert the result back to | |
1152 | perl's internal representation like so: | |
1153 | ||
1154 | sub my_escape_html ($) { | |
1155 | my($what) = shift; | |
1156 | return unless defined $what; | |
1157 | Encode::decode_utf8(Foo::Bar::escape_html(Encode::encode_utf8($what))); | |
1158 | } | |
1159 | ||
1160 | Sometimes, when the extension does not convert data but just stores | |
1161 | and retrieves them, you will be in a position to use the otherwise | |
1162 | dangerous Encode::_utf8_on() function. Let's say the popular | |
66b79f27 | 1163 | C<Foo::Bar> extension, written in C, provides a C<param> method that |
7eabb34d A |
1164 | lets you store and retrieve data according to these prototypes: |
1165 | ||
1166 | $self->param($name, $value); # set a scalar | |
1167 | $value = $self->param($name); # retrieve a scalar | |
1168 | ||
1169 | If it does not yet provide support for any encoding, one could write a | |
1170 | derived class with such a C<param> method: | |
1171 | ||
1172 | sub param { | |
1173 | my($self,$name,$value) = @_; | |
1174 | utf8::upgrade($name); # make sure it is UTF-8 encoded | |
1175 | if (defined $value) | |
1176 | utf8::upgrade($value); # make sure it is UTF-8 encoded | |
1177 | return $self->SUPER::param($name,$value); | |
1178 | } else { | |
1179 | my $ret = $self->SUPER::param($name); | |
1180 | Encode::_utf8_on($ret); # we know, it is UTF-8 encoded | |
1181 | return $ret; | |
1182 | } | |
1183 | } | |
1184 | ||
a73d23f6 RGS |
1185 | Some extensions provide filters on data entry/exit points, such as |
1186 | DB_File::filter_store_key and family. Look out for such filters in | |
66b79f27 | 1187 | the documentation of your extensions, they can make the transition to |
7eabb34d A |
1188 | Unicode data much easier. |
1189 | ||
1190 | =head2 speed | |
1191 | ||
c29a771d | 1192 | Some functions are slower when working on UTF-8 encoded strings than |
574c8022 | 1193 | on byte encoded strings. All functions that need to hop over |
c29a771d JH |
1194 | characters such as length(), substr() or index() can work B<much> |
1195 | faster when the underlying data are byte-encoded. Witness the | |
1196 | following benchmark: | |
666f95b9 | 1197 | |
c29a771d JH |
1198 | % perl -e ' |
1199 | use Benchmark; | |
1200 | use strict; | |
1201 | our $l = 10000; | |
1202 | our $u = our $b = "x" x $l; | |
1203 | substr($u,0,1) = "\x{100}"; | |
1204 | timethese(-2,{ | |
1205 | LENGTH_B => q{ length($b) }, | |
1206 | LENGTH_U => q{ length($u) }, | |
1207 | SUBSTR_B => q{ substr($b, $l/4, $l/2) }, | |
1208 | SUBSTR_U => q{ substr($u, $l/4, $l/2) }, | |
1209 | }); | |
1210 | ' | |
1211 | Benchmark: running LENGTH_B, LENGTH_U, SUBSTR_B, SUBSTR_U for at least 2 CPU seconds... | |
1212 | LENGTH_B: 2 wallclock secs ( 2.36 usr + 0.00 sys = 2.36 CPU) @ 5649983.05/s (n=13333960) | |
1213 | LENGTH_U: 2 wallclock secs ( 2.11 usr + 0.00 sys = 2.11 CPU) @ 12155.45/s (n=25648) | |
1214 | SUBSTR_B: 3 wallclock secs ( 2.16 usr + 0.00 sys = 2.16 CPU) @ 374480.09/s (n=808877) | |
1215 | SUBSTR_U: 2 wallclock secs ( 2.11 usr + 0.00 sys = 2.11 CPU) @ 6791.00/s (n=14329) | |
666f95b9 | 1216 | |
c29a771d JH |
1217 | The numbers show an incredible slowness on long UTF-8 strings and you |
1218 | should carefully avoid to use these functions within tight loops. For | |
1219 | example if you want to iterate over characters, it is infinitely | |
1220 | better to split into an array than to use substr, as the following | |
1221 | benchmark shows: | |
1222 | ||
1223 | % perl -e ' | |
1224 | use Benchmark; | |
1225 | use strict; | |
1226 | our $l = 10000; | |
1227 | our $u = our $b = "x" x $l; | |
1228 | substr($u,0,1) = "\x{100}"; | |
1229 | timethese(-5,{ | |
1230 | SPLIT_B => q{ for my $c (split //, $b){} }, | |
1231 | SPLIT_U => q{ for my $c (split //, $u){} }, | |
1232 | SUBSTR_B => q{ for my $i (0..length($b)-1){my $c = substr($b,$i,1);} }, | |
1233 | SUBSTR_U => q{ for my $i (0..length($u)-1){my $c = substr($u,$i,1);} }, | |
1234 | }); | |
1235 | ' | |
1236 | Benchmark: running SPLIT_B, SPLIT_U, SUBSTR_B, SUBSTR_U for at least 5 CPU seconds... | |
1237 | SPLIT_B: 6 wallclock secs ( 5.29 usr + 0.00 sys = 5.29 CPU) @ 56.14/s (n=297) | |
1238 | SPLIT_U: 5 wallclock secs ( 5.17 usr + 0.01 sys = 5.18 CPU) @ 55.21/s (n=286) | |
1239 | SUBSTR_B: 5 wallclock secs ( 5.34 usr + 0.00 sys = 5.34 CPU) @ 123.22/s (n=658) | |
1240 | SUBSTR_U: 7 wallclock secs ( 6.20 usr + 0.00 sys = 6.20 CPU) @ 0.81/s (n=5) | |
1241 | ||
1242 | You see, the algorithm based on substr() was faster with byte encoded | |
1243 | data but it is pathologically slow with UTF-8 data. | |
666f95b9 | 1244 | |
393fec97 GS |
1245 | =head1 SEE ALSO |
1246 | ||
72ff2908 JH |
1247 | L<perluniintro>, L<encoding>, L<Encode>, L<open>, L<utf8>, L<bytes>, |
1248 | L<perlretut>, L<perlvar/"${^WIDE_SYSTEM_CALLS}"> | |
393fec97 GS |
1249 | |
1250 | =cut |