Commit | Line | Data |
---|---|---|
393fec97 GS |
1 | =head1 NAME |
2 | ||
3 | perlunicode - Unicode support in Perl | |
4 | ||
5 | =head1 DESCRIPTION | |
6 | ||
a6a7eedc KW |
7 | If you haven't already, before reading this document, you should become |
8 | familiar with both L<perlunitut> and L<perluniintro>. | |
9 | ||
10 | Unicode aims to B<UNI>-fy the en-B<CODE>-ings of all the world's | |
11 | character sets into a single Standard. For quite a few of the various | |
12 | coding standards that existed when Unicode was first created, converting | |
13 | from each to Unicode essentially meant adding a constant to each code | |
14 | point in the original standard, and converting back meant just | |
15 | subtracting that same constant. For ASCII and ISO-8859-1, the constant | |
16 | is 0. For ISO-8859-5, (Cyrillic) the constant is 864; for Hebrew | |
17 | (ISO-8859-8), it's 1488; Thai (ISO-8859-11), 3424; and so forth. This | |
18 | made it easy to do the conversions, and facilitated the adoption of | |
19 | Unicode. | |
20 | ||
21 | And it worked; nowadays, those legacy standards are rarely used. Most | |
22 | everyone uses Unicode. | |
23 | ||
24 | Unicode is a comprehensive standard. It specifies many things outside | |
25 | the scope of Perl, such as how to display sequences of characters. For | |
26 | a full discussion of all aspects of Unicode, see | |
27 | L<http://www.unicode.org>. | |
28 | ||
0a1f2d14 | 29 | =head2 Important Caveats |
21bad921 | 30 | |
a6a7eedc KW |
31 | Even though some of this section may not be understandable to you on |
32 | first reading, we think it's important enough to highlight some of the | |
33 | gotchas before delving further, so here goes: | |
34 | ||
376d9008 | 35 | Unicode support is an extensive requirement. While Perl does not |
c349b1b9 JH |
36 | implement the Unicode standard or the accompanying technical reports |
37 | from cover to cover, Perl does support many Unicode features. | |
21bad921 | 38 | |
f57d8456 KW |
39 | Also, the use of Unicode may present security issues that aren't |
40 | obvious, see L</Security Implications of Unicode>. | |
9d1c51c1 | 41 | |
13a2d996 | 42 | =over 4 |
21bad921 | 43 | |
a9130ea9 | 44 | =item Safest if you C<use feature 'unicode_strings'> |
42581d5d KW |
45 | |
46 | In order to preserve backward compatibility, Perl does not turn | |
47 | on full internal Unicode support unless the pragma | |
b65e6125 KW |
48 | L<S<C<use feature 'unicode_strings'>>|feature/The 'unicode_strings' feature> |
49 | is specified. (This is automatically | |
50 | selected if you S<C<use 5.012>> or higher.) Failure to do this can | |
42581d5d KW |
51 | trigger unexpected surprises. See L</The "Unicode Bug"> below. |
52 | ||
2269d15c KW |
53 | This pragma doesn't affect I/O. Nor does it change the internal |
54 | representation of strings, only their interpretation. There are still | |
55 | several places where Unicode isn't fully supported, such as in | |
56 | filenames. | |
42581d5d | 57 | |
fae2c0fb | 58 | =item Input and Output Layers |
21bad921 | 59 | |
a6a7eedc KW |
60 | Use the C<:encoding(...)> layer to read from and write to |
61 | filehandles using the specified encoding. (See L<open>.) | |
c349b1b9 | 62 | |
a6a7eedc KW |
63 | =item You should convert your non-ASCII, non-UTF-8 Perl scripts to be |
64 | UTF-8. | |
21bad921 | 65 | |
a6a7eedc | 66 | See L<encoding>. |
21bad921 | 67 | |
a6a7eedc | 68 | =item C<use utf8> still needed to enable L<UTF-8|/Unicode Encodings> in scripts |
21bad921 | 69 | |
a6a7eedc KW |
70 | If your Perl script is itself encoded in L<UTF-8|/Unicode Encodings>, |
71 | the S<C<use utf8>> pragma must be explicitly included to enable | |
72 | recognition of that (in string or regular expression literals, or in | |
73 | identifier names). B<This is the only time when an explicit S<C<use | |
74 | utf8>> is needed.> (See L<utf8>). | |
7aa207d6 | 75 | |
27c74dfd KW |
76 | If a Perl script begins with the bytes that form the UTF-8 encoding of |
77 | the Unicode BYTE ORDER MARK (C<BOM>, see L</Unicode Encodings>), those | |
78 | bytes are completely ignored. | |
79 | ||
80 | =item L<UTF-16|/Unicode Encodings> scripts autodetected | |
7aa207d6 | 81 | |
fea12a3e | 82 | If a Perl script begins with the Unicode C<BOM> (UTF-16LE, |
27c74dfd | 83 | UTF16-BE), or if the script looks like non-C<BOM>-marked |
a6a7eedc | 84 | UTF-16 of either endianness, Perl will correctly read in the script as |
27c74dfd | 85 | the appropriate Unicode encoding. |
990e18f7 | 86 | |
21bad921 GS |
87 | =back |
88 | ||
376d9008 | 89 | =head2 Byte and Character Semantics |
393fec97 | 90 | |
a6a7eedc KW |
91 | Before Unicode, most encodings used 8 bits (a single byte) to encode |
92 | each character. Thus a character was a byte, and a byte was a | |
93 | character, and there could be only 256 or fewer possible characters. | |
94 | "Byte Semantics" in the title of this section refers to | |
95 | this behavior. There was no need to distinguish between "Byte" and | |
96 | "Character". | |
97 | ||
98 | Then along comes Unicode which has room for over a million characters | |
99 | (and Perl allows for even more). This means that a character may | |
100 | require more than a single byte to represent it, and so the two terms | |
101 | are no longer equivalent. What matter are the characters as whole | |
102 | entities, and not usually the bytes that comprise them. That's what the | |
103 | term "Character Semantics" in the title of this section refers to. | |
104 | ||
105 | Perl had to change internally to decouple "bytes" from "characters". | |
106 | It is important that you too change your ideas, if you haven't already, | |
107 | so that "byte" and "character" no longer mean the same thing in your | |
108 | mind. | |
109 | ||
110 | The basic building block of Perl strings has always been a "character". | |
111 | The changes basically come down to that the implementation no longer | |
112 | thinks that a character is always just a single byte. | |
113 | ||
114 | There are various things to note: | |
393fec97 GS |
115 | |
116 | =over 4 | |
117 | ||
118 | =item * | |
119 | ||
a6a7eedc KW |
120 | String handling functions, for the most part, continue to operate in |
121 | terms of characters. C<length()>, for example, returns the number of | |
122 | characters in a string, just as before. But that number no longer is | |
123 | necessarily the same as the number of bytes in the string (there may be | |
124 | more bytes than characters). The other such functions include | |
125 | C<chop()>, C<chomp()>, C<substr()>, C<pos()>, C<index()>, C<rindex()>, | |
126 | C<sort()>, C<sprintf()>, and C<write()>. | |
127 | ||
128 | The exceptions are: | |
129 | ||
130 | =over 4 | |
131 | ||
132 | =item * | |
133 | ||
134 | the bit-oriented C<vec> | |
135 | ||
136 | E<nbsp> | |
137 | ||
138 | =item * | |
139 | ||
140 | the byte-oriented C<pack>/C<unpack> C<"C"> format | |
141 | ||
142 | However, the C<W> specifier does operate on whole characters, as does the | |
143 | C<U> specifier. | |
144 | ||
145 | =item * | |
146 | ||
147 | some operators that interact with the platform's operating system | |
148 | ||
149 | Operators dealing with filenames are examples. | |
150 | ||
151 | =item * | |
152 | ||
153 | when the functions are called from within the scope of the | |
154 | S<C<L<use bytes|bytes>>> pragma | |
155 | ||
156 | Likely, you should use this only for debugging anyway. | |
157 | ||
158 | =back | |
159 | ||
160 | =item * | |
161 | ||
376d9008 | 162 | Strings--including hash keys--and regular expression patterns may |
b65e6125 | 163 | contain characters that have ordinal values larger than 255. |
393fec97 | 164 | |
2575c402 JW |
165 | If you use a Unicode editor to edit your program, Unicode characters may |
166 | occur directly within the literal strings in UTF-8 encoding, or UTF-16. | |
27c74dfd | 167 | (The former requires a C<use utf8>, the latter may require a C<BOM>.) |
3e4dbfed | 168 | |
a6a7eedc KW |
169 | L<perluniintro/Creating Unicode> gives other ways to place non-ASCII |
170 | characters in your strings. | |
6f335b04 | 171 | |
a6a7eedc | 172 | =item * |
fbb93542 | 173 | |
a6a7eedc | 174 | The C<chr()> and C<ord()> functions work on whole characters. |
376d9008 | 175 | |
393fec97 GS |
176 | =item * |
177 | ||
a6a7eedc KW |
178 | Regular expressions match whole characters. For example, C<"."> matches |
179 | a whole character instead of only a single byte. | |
393fec97 | 180 | |
393fec97 GS |
181 | =item * |
182 | ||
a6a7eedc KW |
183 | The C<tr///> operator translates whole characters. (Note that the |
184 | C<tr///CU> functionality has been removed. For similar functionality to | |
185 | that, see C<pack('U0', ...)> and C<pack('C0', ...)>). | |
393fec97 | 186 | |
393fec97 GS |
187 | =item * |
188 | ||
a6a7eedc | 189 | C<scalar reverse()> reverses by character rather than by byte. |
393fec97 | 190 | |
393fec97 GS |
191 | =item * |
192 | ||
a6a7eedc KW |
193 | The bit string operators, C<& | ^ ~> and (starting in v5.22) |
194 | C<&. |. ^. ~.> can operate on characters that don't fit into a byte. | |
195 | However, the current behavior is likely to change. You should not use | |
196 | these operators on strings that are encoded in UTF-8. If you're not | |
197 | sure about the encoding of a string, downgrade it before using any of | |
198 | these operators; you can use | |
199 | L<C<utf8::utf8_downgrade()>|utf8/Utility functions>. | |
822502e5 | 200 | |
a6a7eedc | 201 | =back |
822502e5 | 202 | |
a6a7eedc KW |
203 | The bottom line is that Perl has always practiced "Character Semantics", |
204 | but with the advent of Unicode, that is now different than "Byte | |
205 | Semantics". | |
206 | ||
207 | =head2 ASCII Rules versus Unicode Rules | |
208 | ||
209 | Before Unicode, when a character was a byte was a character, | |
210 | Perl knew only about the 128 characters defined by ASCII, code points 0 | |
b57dd509 KW |
211 | through 127 (except for under L<S<C<use locale>>|perllocale>). That |
212 | left the code | |
a6a7eedc KW |
213 | points 128 to 255 as unassigned, and available for whatever use a |
214 | program might want. The only semantics they have is their ordinal | |
215 | numbers, and that they are members of none of the non-negative character | |
216 | classes. None are considered to match C<\w> for example, but all match | |
217 | C<\W>. | |
822502e5 | 218 | |
a6a7eedc KW |
219 | Unicode, of course, assigns each of those code points a particular |
220 | meaning (along with ones above 255). To preserve backward | |
221 | compatibility, Perl only uses the Unicode meanings when there is some | |
222 | indication that Unicode is what is intended; otherwise the non-ASCII | |
223 | code points remain treated as if they are unassigned. | |
224 | ||
225 | Here are the ways that Perl knows that a string should be treated as | |
226 | Unicode: | |
227 | ||
228 | =over | |
822502e5 TS |
229 | |
230 | =item * | |
231 | ||
a6a7eedc KW |
232 | Within the scope of S<C<use utf8>> |
233 | ||
234 | If the whole program is Unicode (signified by using 8-bit B<U>nicode | |
235 | B<T>ransformation B<F>ormat), then all strings within it must be | |
236 | Unicode. | |
822502e5 TS |
237 | |
238 | =item * | |
239 | ||
a6a7eedc KW |
240 | Within the scope of |
241 | L<S<C<use feature 'unicode_strings'>>|feature/The 'unicode_strings' feature> | |
242 | ||
243 | This pragma was created so you can explicitly tell Perl that operations | |
244 | executed within its scope are to use Unicode rules. More operations are | |
245 | affected with newer perls. See L</The "Unicode Bug">. | |
822502e5 TS |
246 | |
247 | =item * | |
248 | ||
a6a7eedc KW |
249 | Within the scope of S<C<use 5.012>> or higher |
250 | ||
251 | This implicitly turns on S<C<use feature 'unicode_strings'>>. | |
822502e5 TS |
252 | |
253 | =item * | |
254 | ||
a6a7eedc KW |
255 | Within the scope of |
256 | L<S<C<use locale 'not_characters'>>|perllocale/Unicode and UTF-8>, | |
257 | or L<S<C<use locale>>|perllocale> and the current | |
258 | locale is a UTF-8 locale. | |
822502e5 | 259 | |
a6a7eedc KW |
260 | The former is defined to imply Unicode handling; and the latter |
261 | indicates a Unicode locale, hence a Unicode interpretation of all | |
262 | strings within it. | |
822502e5 TS |
263 | |
264 | =item * | |
265 | ||
a6a7eedc KW |
266 | When the string contains a Unicode-only code point |
267 | ||
268 | Perl has never accepted code points above 255 without them being | |
269 | Unicode, so their use implies Unicode for the whole string. | |
822502e5 TS |
270 | |
271 | =item * | |
272 | ||
a6a7eedc KW |
273 | When the string contains a Unicode named code point C<\N{...}> |
274 | ||
275 | The C<\N{...}> construct explicitly refers to a Unicode code point, | |
276 | even if it is one that is also in ASCII. Therefore the string | |
277 | containing it must be Unicode. | |
822502e5 TS |
278 | |
279 | =item * | |
280 | ||
a6a7eedc KW |
281 | When the string has come from an external source marked as |
282 | Unicode | |
283 | ||
284 | The L<C<-C>|perlrun/-C [numberE<sol>list]> command line option can | |
285 | specify that certain inputs to the program are Unicode, and the values | |
286 | of this can be read by your Perl code, see L<perlvar/"${^UNICODE}">. | |
287 | ||
288 | =item * When the string has been upgraded to UTF-8 | |
289 | ||
290 | The function L<C<utf8::utf8_upgrade()>|utf8/Utility functions> | |
291 | can be explicitly used to permanently (unless a subsequent | |
292 | C<utf8::utf8_downgrade()> is called) cause a string to be treated as | |
293 | Unicode. | |
294 | ||
295 | =item * There are additional methods for regular expression patterns | |
296 | ||
f6cf4627 KW |
297 | A pattern that is compiled with the C<< /u >> or C<< /a >> modifiers is |
298 | treated as Unicode (though there are some restrictions with C<< /a >>). | |
299 | Under the C<< /d >> and C<< /l >> modifiers, there are several other | |
300 | indications for Unicode; see L<perlre/Character set modifiers>. | |
822502e5 TS |
301 | |
302 | =back | |
303 | ||
a6a7eedc KW |
304 | Note that all of the above are overridden within the scope of |
305 | C<L<use bytes|bytes>>; but you should be using this pragma only for | |
306 | debugging. | |
307 | ||
308 | Note also that some interactions with the platform's operating system | |
309 | never use Unicode rules. | |
310 | ||
311 | When Unicode rules are in effect: | |
312 | ||
822502e5 TS |
313 | =over 4 |
314 | ||
315 | =item * | |
316 | ||
a6a7eedc KW |
317 | Case translation operators use the Unicode case translation tables. |
318 | ||
319 | Note that C<uc()>, or C<\U> in interpolated strings, translates to | |
320 | uppercase, while C<ucfirst>, or C<\u> in interpolated strings, | |
321 | translates to titlecase in languages that make the distinction (which is | |
322 | equivalent to uppercase in languages without the distinction). | |
323 | ||
324 | There is a CPAN module, C<L<Unicode::Casing>>, which allows you to | |
325 | define your own mappings to be used in C<lc()>, C<lcfirst()>, C<uc()>, | |
326 | C<ucfirst()>, and C<fc> (or their double-quoted string inlined versions | |
327 | such as C<\U>). (Prior to Perl 5.16, this functionality was partially | |
328 | provided in the Perl core, but suffered from a number of insurmountable | |
329 | drawbacks, so the CPAN module was written instead.) | |
330 | ||
331 | =item * | |
332 | ||
333 | Character classes in regular expressions match based on the character | |
334 | properties specified in the Unicode properties database. | |
335 | ||
336 | C<\w> can be used to match a Japanese ideograph, for instance; and | |
337 | C<[[:digit:]]> a Bengali number. | |
338 | ||
339 | =item * | |
340 | ||
341 | Named Unicode properties, scripts, and block ranges may be used (like | |
342 | bracketed character classes) by using the C<\p{}> "matches property" | |
343 | construct and the C<\P{}> negation, "doesn't match property". | |
344 | ||
345 | See L</"Unicode Character Properties"> for more details. | |
346 | ||
347 | You can define your own character properties and use them | |
348 | in the regular expression with the C<\p{}> or C<\P{}> construct. | |
349 | See L</"User-Defined Character Properties"> for more details. | |
822502e5 TS |
350 | |
351 | =back | |
352 | ||
a6a7eedc KW |
353 | =head2 Extended Grapheme Clusters (Logical characters) |
354 | ||
355 | Consider a character, say C<H>. It could appear with various marks around it, | |
356 | such as an acute accent, or a circumflex, or various hooks, circles, arrows, | |
357 | I<etc.>, above, below, to one side or the other, I<etc>. There are many | |
358 | possibilities among the world's languages. The number of combinations is | |
359 | astronomical, and if there were a character for each combination, it would | |
360 | soon exhaust Unicode's more than a million possible characters. So Unicode | |
361 | took a different approach: there is a character for the base C<H>, and a | |
362 | character for each of the possible marks, and these can be variously combined | |
363 | to get a final logical character. So a logical character--what appears to be a | |
364 | single character--can be a sequence of more than one individual characters. | |
365 | The Unicode standard calls these "extended grapheme clusters" (which | |
366 | is an improved version of the no-longer much used "grapheme cluster"); | |
367 | Perl furnishes the C<\X> regular expression construct to match such | |
368 | sequences in their entirety. | |
369 | ||
370 | But Unicode's intent is to unify the existing character set standards and | |
371 | practices, and several pre-existing standards have single characters that | |
372 | mean the same thing as some of these combinations, like ISO-8859-1, | |
373 | which has quite a few of them. For example, C<"LATIN CAPITAL LETTER E | |
374 | WITH ACUTE"> was already in this standard when Unicode came along. | |
375 | Unicode therefore added it to its repertoire as that single character. | |
376 | But this character is considered by Unicode to be equivalent to the | |
377 | sequence consisting of the character C<"LATIN CAPITAL LETTER E"> | |
378 | followed by the character C<"COMBINING ACUTE ACCENT">. | |
379 | ||
380 | C<"LATIN CAPITAL LETTER E WITH ACUTE"> is called a "pre-composed" | |
381 | character, and its equivalence with the "E" and the "COMBINING ACCENT" | |
382 | sequence is called canonical equivalence. All pre-composed characters | |
383 | are said to have a decomposition (into the equivalent sequence), and the | |
384 | decomposition type is also called canonical. A string may be comprised | |
385 | as much as possible of precomposed characters, or it may be comprised of | |
386 | entirely decomposed characters. Unicode calls these respectively, | |
387 | "Normalization Form Composed" (NFC) and "Normalization Form Decomposed". | |
388 | The C<L<Unicode::Normalize>> module contains functions that convert | |
389 | between the two. A string may also have both composed characters and | |
390 | decomposed characters; this module can be used to make it all one or the | |
391 | other. | |
392 | ||
393 | You may be presented with strings in any of these equivalent forms. | |
394 | There is currently nothing in Perl 5 that ignores the differences. So | |
395 | you'll have to specially hanlde it. The usual advice is to convert your | |
396 | inputs to C<NFD> before processing further. | |
397 | ||
398 | For more detailed information, see L<http://unicode.org/reports/tr15/>. | |
399 | ||
822502e5 TS |
400 | =head2 Unicode Character Properties |
401 | ||
ee88f7b6 | 402 | (The only time that Perl considers a sequence of individual code |
9d1c51c1 KW |
403 | points as a single logical character is in the C<\X> construct, already |
404 | mentioned above. Therefore "character" in this discussion means a single | |
ee88f7b6 KW |
405 | Unicode code point.) |
406 | ||
407 | Very nearly all Unicode character properties are accessible through | |
408 | regular expressions by using the C<\p{}> "matches property" construct | |
409 | and the C<\P{}> "doesn't match property" for its negation. | |
51f494cc | 410 | |
9d1c51c1 | 411 | For instance, C<\p{Uppercase}> matches any single character with the Unicode |
a9130ea9 KW |
412 | C<"Uppercase"> property, while C<\p{L}> matches any character with a |
413 | C<General_Category> of C<"L"> (letter) property (see | |
414 | L</General_Category> below). Brackets are not | |
9d1c51c1 | 415 | required for single letter property names, so C<\p{L}> is equivalent to C<\pL>. |
51f494cc | 416 | |
9d1c51c1 | 417 | More formally, C<\p{Uppercase}> matches any single character whose Unicode |
a9130ea9 KW |
418 | C<Uppercase> property value is C<True>, and C<\P{Uppercase}> matches any character |
419 | whose C<Uppercase> property value is C<False>, and they could have been written as | |
9d1c51c1 | 420 | C<\p{Uppercase=True}> and C<\p{Uppercase=False}>, respectively. |
51f494cc | 421 | |
b19eb496 | 422 | This formality is needed when properties are not binary; that is, if they can |
a9130ea9 KW |
423 | take on more values than just C<True> and C<False>. For example, the |
424 | C<Bidi_Class> property (see L</"Bidirectional Character Types"> below), | |
425 | can take on several different | |
426 | values, such as C<Left>, C<Right>, C<Whitespace>, and others. To match these, one needs | |
427 | to specify both the property name (C<Bidi_Class>), AND the value being | |
5bff2035 | 428 | matched against |
b65e6125 | 429 | (C<Left>, C<Right>, I<etc.>). This is done, as in the examples above, by having the |
9f815e24 | 430 | two components separated by an equal sign (or interchangeably, a colon), like |
51f494cc KW |
431 | C<\p{Bidi_Class: Left}>. |
432 | ||
433 | All Unicode-defined character properties may be written in these compound forms | |
a9130ea9 | 434 | of C<\p{I<property>=I<value>}> or C<\p{I<property>:I<value>}>, but Perl provides some |
51f494cc KW |
435 | additional properties that are written only in the single form, as well as |
436 | single-form short-cuts for all binary properties and certain others described | |
437 | below, in which you may omit the property name and the equals or colon | |
438 | separator. | |
439 | ||
440 | Most Unicode character properties have at least two synonyms (or aliases if you | |
b19eb496 | 441 | prefer): a short one that is easier to type and a longer one that is more |
a9130ea9 KW |
442 | descriptive and hence easier to understand. Thus the C<"L"> and |
443 | C<"Letter"> properties above are equivalent and can be used | |
444 | interchangeably. Likewise, C<"Upper"> is a synonym for C<"Uppercase">, | |
445 | and we could have written C<\p{Uppercase}> equivalently as C<\p{Upper}>. | |
446 | Also, there are typically various synonyms for the values the property | |
447 | can be. For binary properties, C<"True"> has 3 synonyms: C<"T">, | |
448 | C<"Yes">, and C<"Y">; and C<"False"> has correspondingly C<"F">, | |
449 | C<"No">, and C<"N">. But be careful. A short form of a value for one | |
450 | property may not mean the same thing as the same short form for another. | |
451 | Thus, for the C<L</General_Category>> property, C<"L"> means | |
452 | C<"Letter">, but for the L<C<Bidi_Class>|/Bidirectional Character Types> | |
453 | property, C<"L"> means C<"Left">. A complete list of properties and | |
454 | synonyms is in L<perluniprops>. | |
51f494cc | 455 | |
b19eb496 | 456 | Upper/lower case differences in property names and values are irrelevant; |
51f494cc KW |
457 | thus C<\p{Upper}> means the same thing as C<\p{upper}> or even C<\p{UpPeR}>. |
458 | Similarly, you can add or subtract underscores anywhere in the middle of a | |
459 | word, so that these are also equivalent to C<\p{U_p_p_e_r}>. And white space | |
460 | is irrelevant adjacent to non-word characters, such as the braces and the equals | |
b19eb496 TC |
461 | or colon separators, so C<\p{ Upper }> and C<\p{ Upper_case : Y }> are |
462 | equivalent to these as well. In fact, white space and even | |
463 | hyphens can usually be added or deleted anywhere. So even C<\p{ Up-per case = Yes}> is | |
51f494cc | 464 | equivalent. All this is called "loose-matching" by Unicode. The few places |
b19eb496 | 465 | where stricter matching is used is in the middle of numbers, and in the Perl |
51f494cc | 466 | extension properties that begin or end with an underscore. Stricter matching |
b19eb496 | 467 | cares about white space (except adjacent to non-word characters), |
51f494cc | 468 | hyphens, and non-interior underscores. |
4193bef7 | 469 | |
376d9008 | 470 | You can also use negation in both C<\p{}> and C<\P{}> by introducing a caret |
a9130ea9 | 471 | (C<^>) between the first brace and the property name: C<\p{^Tamil}> is |
eb0cc9e3 | 472 | equal to C<\P{Tamil}>. |
4193bef7 | 473 | |
56ca34ca KW |
474 | Almost all properties are immune to case-insensitive matching. That is, |
475 | adding a C</i> regular expression modifier does not change what they | |
476 | match. There are two sets that are affected. | |
477 | The first set is | |
478 | C<Uppercase_Letter>, | |
479 | C<Lowercase_Letter>, | |
480 | and C<Titlecase_Letter>, | |
481 | all of which match C<Cased_Letter> under C</i> matching. | |
482 | And the second set is | |
483 | C<Uppercase>, | |
484 | C<Lowercase>, | |
485 | and C<Titlecase>, | |
486 | all of which match C<Cased> under C</i> matching. | |
487 | This set also includes its subsets C<PosixUpper> and C<PosixLower> both | |
a9130ea9 | 488 | of which under C</i> match C<PosixAlpha>. |
56ca34ca | 489 | (The difference between these sets is that some things, such as Roman |
b65e6125 KW |
490 | numerals, come in both upper and lower case so they are C<Cased>, but |
491 | aren't considered letters, so they aren't C<Cased_Letter>'s.) | |
56ca34ca | 492 | |
2d88a86a KW |
493 | See L</Beyond Unicode code points> for special considerations when |
494 | matching Unicode properties against non-Unicode code points. | |
94b42e47 | 495 | |
51f494cc | 496 | =head3 B<General_Category> |
14bb0a9a | 497 | |
51f494cc KW |
498 | Every Unicode character is assigned a general category, which is the "most |
499 | usual categorization of a character" (from | |
500 | L<http://www.unicode.org/reports/tr44>). | |
822502e5 | 501 | |
9f815e24 | 502 | The compound way of writing these is like C<\p{General_Category=Number}> |
b65e6125 | 503 | (short: C<\p{gc:n}>). But Perl furnishes shortcuts in which everything up |
51f494cc KW |
504 | through the equal or colon separator is omitted. So you can instead just write |
505 | C<\pN>. | |
822502e5 | 506 | |
a9130ea9 KW |
507 | Here are the short and long forms of the values the C<General Category> property |
508 | can have: | |
393fec97 | 509 | |
d73e5302 JH |
510 | Short Long |
511 | ||
512 | L Letter | |
51f494cc KW |
513 | LC, L& Cased_Letter (that is: [\p{Ll}\p{Lu}\p{Lt}]) |
514 | Lu Uppercase_Letter | |
515 | Ll Lowercase_Letter | |
516 | Lt Titlecase_Letter | |
517 | Lm Modifier_Letter | |
518 | Lo Other_Letter | |
d73e5302 JH |
519 | |
520 | M Mark | |
51f494cc KW |
521 | Mn Nonspacing_Mark |
522 | Mc Spacing_Mark | |
523 | Me Enclosing_Mark | |
d73e5302 JH |
524 | |
525 | N Number | |
51f494cc KW |
526 | Nd Decimal_Number (also Digit) |
527 | Nl Letter_Number | |
528 | No Other_Number | |
529 | ||
530 | P Punctuation (also Punct) | |
531 | Pc Connector_Punctuation | |
532 | Pd Dash_Punctuation | |
533 | Ps Open_Punctuation | |
534 | Pe Close_Punctuation | |
535 | Pi Initial_Punctuation | |
d73e5302 | 536 | (may behave like Ps or Pe depending on usage) |
51f494cc | 537 | Pf Final_Punctuation |
d73e5302 | 538 | (may behave like Ps or Pe depending on usage) |
51f494cc | 539 | Po Other_Punctuation |
d73e5302 JH |
540 | |
541 | S Symbol | |
51f494cc KW |
542 | Sm Math_Symbol |
543 | Sc Currency_Symbol | |
544 | Sk Modifier_Symbol | |
545 | So Other_Symbol | |
d73e5302 JH |
546 | |
547 | Z Separator | |
51f494cc KW |
548 | Zs Space_Separator |
549 | Zl Line_Separator | |
550 | Zp Paragraph_Separator | |
d73e5302 JH |
551 | |
552 | C Other | |
d88362ca | 553 | Cc Control (also Cntrl) |
e150c829 | 554 | Cf Format |
6d4f9cf2 | 555 | Cs Surrogate |
51f494cc | 556 | Co Private_Use |
e150c829 | 557 | Cn Unassigned |
1ac13f9a | 558 | |
376d9008 | 559 | Single-letter properties match all characters in any of the |
3e4dbfed | 560 | two-letter sub-properties starting with the same letter. |
b19eb496 | 561 | C<LC> and C<L&> are special: both are aliases for the set consisting of everything matched by C<Ll>, C<Lu>, and C<Lt>. |
32293815 | 562 | |
51f494cc | 563 | =head3 B<Bidirectional Character Types> |
822502e5 | 564 | |
b19eb496 | 565 | Because scripts differ in their directionality (Hebrew and Arabic are |
a9130ea9 | 566 | written right to left, for example) Unicode supplies a C<Bidi_Class> property. |
1850f57f | 567 | Some of the values this property can have are: |
32293815 | 568 | |
88af3b93 | 569 | Value Meaning |
92e830a9 | 570 | |
12ac2576 JP |
571 | L Left-to-Right |
572 | LRE Left-to-Right Embedding | |
573 | LRO Left-to-Right Override | |
574 | R Right-to-Left | |
51f494cc | 575 | AL Arabic Letter |
12ac2576 JP |
576 | RLE Right-to-Left Embedding |
577 | RLO Right-to-Left Override | |
578 | PDF Pop Directional Format | |
579 | EN European Number | |
51f494cc KW |
580 | ES European Separator |
581 | ET European Terminator | |
12ac2576 | 582 | AN Arabic Number |
51f494cc | 583 | CS Common Separator |
12ac2576 JP |
584 | NSM Non-Spacing Mark |
585 | BN Boundary Neutral | |
586 | B Paragraph Separator | |
587 | S Segment Separator | |
588 | WS Whitespace | |
589 | ON Other Neutrals | |
590 | ||
51f494cc KW |
591 | This property is always written in the compound form. |
592 | For example, C<\p{Bidi_Class:R}> matches characters that are normally | |
1850f57f | 593 | written right to left. Unlike the |
a9130ea9 | 594 | C<L</General_Category>> property, this |
1850f57f KW |
595 | property can have more values added in a future Unicode release. Those |
596 | listed above comprised the complete set for many Unicode releases, but | |
597 | others were added in Unicode 6.3; you can always find what the | |
20ada7da | 598 | current ones are in L<perluniprops>. And |
1850f57f | 599 | L<http://www.unicode.org/reports/tr9/> describes how to use them. |
eb0cc9e3 | 600 | |
51f494cc KW |
601 | =head3 B<Scripts> |
602 | ||
b19eb496 | 603 | The world's languages are written in many different scripts. This sentence |
e1b711da | 604 | (unless you're reading it in translation) is written in Latin, while Russian is |
c69ca1d4 | 605 | written in Cyrillic, and Greek is written in, well, Greek; Japanese mainly in |
e1b711da | 606 | Hiragana or Katakana. There are many more. |
51f494cc | 607 | |
48791bf1 KW |
608 | The Unicode C<Script> and C<Script_Extensions> properties give what |
609 | script a given character is in. The C<Script_Extensions> property is an | |
610 | improved version of C<Script>, as demonstrated below. Either property | |
611 | can be specified with the compound form like | |
82aed44a KW |
612 | C<\p{Script=Hebrew}> (short: C<\p{sc=hebr}>), or |
613 | C<\p{Script_Extensions=Javanese}> (short: C<\p{scx=java}>). | |
614 | In addition, Perl furnishes shortcuts for all | |
48791bf1 KW |
615 | C<Script_Extensions> property names. You can omit everything up through |
616 | the equals (or colon), and simply write C<\p{Latin}> or C<\P{Cyrillic}>. | |
617 | (This is not true for C<Script>, which is required to be | |
618 | written in the compound form. Prior to Perl v5.26, the single form | |
619 | returned the plain old C<Script> version, but was changed because | |
620 | C<Script_Extensions> gives better results.) | |
82aed44a KW |
621 | |
622 | The difference between these two properties involves characters that are | |
623 | used in multiple scripts. For example the digits '0' through '9' are | |
624 | used in many parts of the world. These are placed in a script named | |
625 | C<Common>. Other characters are used in just a few scripts. For | |
a9130ea9 | 626 | example, the C<"KATAKANA-HIRAGANA DOUBLE HYPHEN"> is used in both Japanese |
82aed44a KW |
627 | scripts, Katakana and Hiragana, but nowhere else. The C<Script> |
628 | property places all characters that are used in multiple scripts in the | |
629 | C<Common> script, while the C<Script_Extensions> property places those | |
630 | that are used in only a few scripts into each of those scripts; while | |
631 | still using C<Common> for those used in many scripts. Thus both these | |
632 | match: | |
633 | ||
634 | "0" =~ /\p{sc=Common}/ # Matches | |
635 | "0" =~ /\p{scx=Common}/ # Matches | |
636 | ||
637 | and only the first of these match: | |
638 | ||
639 | "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{sc=Common} # Matches | |
640 | "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{scx=Common} # No match | |
641 | ||
642 | And only the last two of these match: | |
643 | ||
644 | "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{sc=Hiragana} # No match | |
645 | "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{sc=Katakana} # No match | |
646 | "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{scx=Hiragana} # Matches | |
647 | "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{scx=Katakana} # Matches | |
648 | ||
649 | C<Script_Extensions> is thus an improved C<Script>, in which there are | |
650 | fewer characters in the C<Common> script, and correspondingly more in | |
651 | other scripts. It is new in Unicode version 6.0, and its data are likely | |
652 | to change significantly in later releases, as things get sorted out. | |
b65e6125 | 653 | New code should probably be using C<Script_Extensions> and not plain |
48791bf1 KW |
654 | C<Script>. If you compile perl with a Unicode release that doesn't have |
655 | C<Script_Extensions>, the single form Perl extensions will instead refer | |
656 | to the plain C<Script> property. If you compile with a version of | |
657 | Unicode that doesn't have the C<Script> property, these extensions will | |
658 | not be defined at all. | |
82aed44a KW |
659 | |
660 | (Actually, besides C<Common>, the C<Inherited> script, contains | |
661 | characters that are used in multiple scripts. These are modifier | |
b65e6125 | 662 | characters which inherit the script value |
82aed44a KW |
663 | of the controlling character. Some of these are used in many scripts, |
664 | and so go into C<Inherited> in both C<Script> and C<Script_Extensions>. | |
665 | Others are used in just a few scripts, so are in C<Inherited> in | |
666 | C<Script>, but not in C<Script_Extensions>.) | |
667 | ||
668 | It is worth stressing that there are several different sets of digits in | |
669 | Unicode that are equivalent to 0-9 and are matchable by C<\d> in a | |
670 | regular expression. If they are used in a single language only, they | |
48791bf1 | 671 | are in that language's C<Script> and C<Script_Extensions>. If they are |
82aed44a KW |
672 | used in more than one script, they will be in C<sc=Common>, but only |
673 | if they are used in many scripts should they be in C<scx=Common>. | |
51f494cc | 674 | |
48791bf1 KW |
675 | The explanation above has omitted some detail; refer to UAX#24 "Unicode |
676 | Script Property": L<http://www.unicode.org/reports/tr24>. | |
677 | ||
51f494cc KW |
678 | A complete list of scripts and their shortcuts is in L<perluniprops>. |
679 | ||
a9130ea9 | 680 | =head3 B<Use of the C<"Is"> Prefix> |
822502e5 | 681 | |
b65e6125 KW |
682 | For backward compatibility (with Perl 5.6), all properties writable |
683 | without using the compound form mentioned | |
51f494cc KW |
684 | so far may have C<Is> or C<Is_> prepended to their name, so C<\P{Is_Lu}>, for |
685 | example, is equal to C<\P{Lu}>, and C<\p{IsScript:Arabic}> is equal to | |
686 | C<\p{Arabic}>. | |
eb0cc9e3 | 687 | |
51f494cc | 688 | =head3 B<Blocks> |
2796c109 | 689 | |
1bfb14c4 JH |
690 | In addition to B<scripts>, Unicode also defines B<blocks> of |
691 | characters. The difference between scripts and blocks is that the | |
692 | concept of scripts is closer to natural languages, while the concept | |
51f494cc | 693 | of blocks is more of an artificial grouping based on groups of Unicode |
a9130ea9 | 694 | characters with consecutive ordinal values. For example, the C<"Basic Latin"> |
b65e6125 | 695 | block is all the characters whose ordinals are between 0 and 127, inclusive; in |
a9130ea9 KW |
696 | other words, the ASCII characters. The C<"Latin"> script contains some letters |
697 | from this as well as several other blocks, like C<"Latin-1 Supplement">, | |
b65e6125 | 698 | C<"Latin Extended-A">, I<etc.>, but it does not contain all the characters from |
7be67b37 | 699 | those blocks. It does not, for example, contain the digits 0-9, because |
82aed44a KW |
700 | those digits are shared across many scripts, and hence are in the |
701 | C<Common> script. | |
51f494cc KW |
702 | |
703 | For more about scripts versus blocks, see UAX#24 "Unicode Script Property": | |
704 | L<http://www.unicode.org/reports/tr24> | |
705 | ||
48791bf1 | 706 | The C<Script_Extensions> or C<Script> properties are likely to be the |
82aed44a | 707 | ones you want to use when processing |
a9130ea9 | 708 | natural language; the C<Block> property may occasionally be useful in working |
b19eb496 | 709 | with the nuts and bolts of Unicode. |
51f494cc KW |
710 | |
711 | Block names are matched in the compound form, like C<\p{Block: Arrows}> or | |
b19eb496 | 712 | C<\p{Blk=Hebrew}>. Unlike most other properties, only a few block names have a |
6b5cf123 KW |
713 | Unicode-defined short name. |
714 | ||
715 | Perl also defines single form synonyms for the block property in cases | |
716 | where these do not conflict with something else. But don't use any of | |
717 | these, because they are unstable. Since these are Perl extensions, they | |
718 | are subordinate to official Unicode property names; Unicode doesn't know | |
719 | nor care about Perl's extensions. It may happen that a name that | |
720 | currently means the Perl extension will later be changed without warning | |
721 | to mean a different Unicode property in a future version of the perl | |
722 | interpreter that uses a later Unicode release, and your code would no | |
723 | longer work. The extensions are mentioned here for completeness: Take | |
724 | the block name and prefix it with one of: C<In> (for example | |
725 | C<\p{Blk=Arrows}> can currently be written as C<\p{In_Arrows}>); or | |
726 | sometimes C<Is> (like C<\p{Is_Arrows}>); or sometimes no prefix at all | |
48791bf1 | 727 | (C<\p{Arrows}>). As of this writing (Unicode 9.0) there are no |
6b5cf123 KW |
728 | conflicts with using the C<In_> prefix, but there are plenty with the |
729 | other two forms. For example, C<\p{Is_Hebrew}> and C<\p{Hebrew}> mean | |
48791bf1 KW |
730 | C<\p{Script_Extensions=Hebrew}> which is NOT the same thing as |
731 | C<\p{Blk=Hebrew}>. Our | |
6b5cf123 KW |
732 | advice used to be to use the C<In_> prefix as a single form way of |
733 | specifying a block. But Unicode 8.0 added properties whose names begin | |
734 | with C<In>, and it's now clear that it's only luck that's so far | |
735 | prevented a conflict. Using C<In> is only marginally less typing than | |
736 | C<Blk:>, and the latter's meaning is clearer anyway, and guaranteed to | |
737 | never conflict. So don't take chances. Use C<\p{Blk=foo}> for new | |
738 | code. And be sure that block is what you really really want to do. In | |
739 | most cases scripts are what you want instead. | |
740 | ||
741 | A complete list of blocks is in L<perluniprops>. | |
51f494cc | 742 | |
9f815e24 KW |
743 | =head3 B<Other Properties> |
744 | ||
745 | There are many more properties than the very basic ones described here. | |
746 | A complete list is in L<perluniprops>. | |
747 | ||
748 | Unicode defines all its properties in the compound form, so all single-form | |
b19eb496 TC |
749 | properties are Perl extensions. Most of these are just synonyms for the |
750 | Unicode ones, but some are genuine extensions, including several that are in | |
9f815e24 KW |
751 | the compound form. And quite a few of these are actually recommended by Unicode |
752 | (in L<http://www.unicode.org/reports/tr18>). | |
753 | ||
5bff2035 KW |
754 | This section gives some details on all extensions that aren't just |
755 | synonyms for compound-form Unicode properties | |
756 | (for those properties, you'll have to refer to the | |
9f815e24 KW |
757 | L<Unicode Standard|http://www.unicode.org/reports/tr44>. |
758 | ||
759 | =over | |
760 | ||
761 | =item B<C<\p{All}>> | |
762 | ||
2d88a86a KW |
763 | This matches every possible code point. It is equivalent to C<qr/./s>. |
764 | Unlike all the other non-user-defined C<\p{}> property matches, no | |
765 | warning is ever generated if this is property is matched against a | |
766 | non-Unicode code point (see L</Beyond Unicode code points> below). | |
9f815e24 KW |
767 | |
768 | =item B<C<\p{Alnum}>> | |
769 | ||
770 | This matches any C<\p{Alphabetic}> or C<\p{Decimal_Number}> character. | |
771 | ||
772 | =item B<C<\p{Any}>> | |
773 | ||
2d88a86a KW |
774 | This matches any of the 1_114_112 Unicode code points. It is a synonym |
775 | for C<\p{Unicode}>. | |
9f815e24 | 776 | |
42581d5d KW |
777 | =item B<C<\p{ASCII}>> |
778 | ||
779 | This matches any of the 128 characters in the US-ASCII character set, | |
780 | which is a subset of Unicode. | |
781 | ||
9f815e24 KW |
782 | =item B<C<\p{Assigned}>> |
783 | ||
a9130ea9 KW |
784 | This matches any assigned code point; that is, any code point whose L<general |
785 | category|/General_Category> is not C<Unassigned> (or equivalently, not C<Cn>). | |
9f815e24 KW |
786 | |
787 | =item B<C<\p{Blank}>> | |
788 | ||
789 | This is the same as C<\h> and C<\p{HorizSpace}>: A character that changes the | |
790 | spacing horizontally. | |
791 | ||
792 | =item B<C<\p{Decomposition_Type: Non_Canonical}>> (Short: C<\p{Dt=NonCanon}>) | |
793 | ||
794 | Matches a character that has a non-canonical decomposition. | |
795 | ||
a6a7eedc KW |
796 | The L</Extended Grapheme Clusters (Logical characters)> section above |
797 | talked about canonical decompositions. However, many more characters | |
798 | have a different type of decomposition, a "compatible" or | |
799 | "non-canonical" decomposition. The sequences that form these | |
800 | decompositions are not considered canonically equivalent to the | |
801 | pre-composed character. An example is the C<"SUPERSCRIPT ONE">. It is | |
802 | somewhat like a regular digit 1, but not exactly; its decomposition into | |
803 | the digit 1 is called a "compatible" decomposition, specifically a | |
9f815e24 | 804 | "super" decomposition. There are several such compatibility |
b65e6125 KW |
805 | decompositions (see L<http://www.unicode.org/reports/tr44>), including |
806 | one called "compat", which means some miscellaneous type of | |
807 | decomposition that doesn't fit into the other decomposition categories | |
808 | that Unicode has chosen. | |
9f815e24 KW |
809 | |
810 | Note that most Unicode characters don't have a decomposition, so their | |
a9130ea9 | 811 | decomposition type is C<"None">. |
9f815e24 | 812 | |
b19eb496 TC |
813 | For your convenience, Perl has added the C<Non_Canonical> decomposition |
814 | type to mean any of the several compatibility decompositions. | |
9f815e24 KW |
815 | |
816 | =item B<C<\p{Graph}>> | |
817 | ||
818 | Matches any character that is graphic. Theoretically, this means a character | |
819 | that on a printer would cause ink to be used. | |
820 | ||
821 | =item B<C<\p{HorizSpace}>> | |
822 | ||
b19eb496 | 823 | This is the same as C<\h> and C<\p{Blank}>: a character that changes the |
9f815e24 KW |
824 | spacing horizontally. |
825 | ||
42581d5d | 826 | =item B<C<\p{In=*}>> |
9f815e24 KW |
827 | |
828 | This is a synonym for C<\p{Present_In=*}> | |
829 | ||
830 | =item B<C<\p{PerlSpace}>> | |
831 | ||
d28d8023 | 832 | This is the same as C<\s>, restricted to ASCII, namely C<S<[ \f\n\r\t]>> |
779cf272 | 833 | and starting in Perl v5.18, a vertical tab. |
9f815e24 KW |
834 | |
835 | Mnemonic: Perl's (original) space | |
836 | ||
837 | =item B<C<\p{PerlWord}>> | |
838 | ||
839 | This is the same as C<\w>, restricted to ASCII, namely C<[A-Za-z0-9_]> | |
840 | ||
841 | Mnemonic: Perl's (original) word. | |
842 | ||
42581d5d | 843 | =item B<C<\p{Posix...}>> |
9f815e24 | 844 | |
b65e6125 KW |
845 | There are several of these, which are equivalents, using the C<\p{}> |
846 | notation, for Posix classes and are described in | |
42581d5d | 847 | L<perlrecharclass/POSIX Character Classes>. |
9f815e24 KW |
848 | |
849 | =item B<C<\p{Present_In: *}>> (Short: C<\p{In=*}>) | |
850 | ||
851 | This property is used when you need to know in what Unicode version(s) a | |
852 | character is. | |
853 | ||
854 | The "*" above stands for some two digit Unicode version number, such as | |
855 | C<1.1> or C<4.0>; or the "*" can also be C<Unassigned>. This property will | |
856 | match the code points whose final disposition has been settled as of the | |
857 | Unicode release given by the version number; C<\p{Present_In: Unassigned}> | |
858 | will match those code points whose meaning has yet to be assigned. | |
859 | ||
a9130ea9 | 860 | For example, C<U+0041> C<"LATIN CAPITAL LETTER A"> was present in the very first |
9f815e24 KW |
861 | Unicode release available, which is C<1.1>, so this property is true for all |
862 | valid "*" versions. On the other hand, C<U+1EFF> was not assigned until version | |
a9130ea9 | 863 | 5.1 when it became C<"LATIN SMALL LETTER Y WITH LOOP">, so the only "*" that |
9f815e24 KW |
864 | would match it are 5.1, 5.2, and later. |
865 | ||
866 | Unicode furnishes the C<Age> property from which this is derived. The problem | |
867 | with Age is that a strict interpretation of it (which Perl takes) has it | |
868 | matching the precise release a code point's meaning is introduced in. Thus | |
869 | C<U+0041> would match only 1.1; and C<U+1EFF> only 5.1. This is not usually what | |
870 | you want. | |
871 | ||
872 | Some non-Perl implementations of the Age property may change its meaning to be | |
a9130ea9 | 873 | the same as the Perl C<Present_In> property; just be aware of that. |
9f815e24 KW |
874 | |
875 | Another confusion with both these properties is that the definition is not | |
b19eb496 TC |
876 | that the code point has been I<assigned>, but that the meaning of the code point |
877 | has been I<determined>. This is because 66 code points will always be | |
a9130ea9 | 878 | unassigned, and so the C<Age> for them is the Unicode version in which the decision |
b19eb496 | 879 | to make them so was made. For example, C<U+FDD0> is to be permanently |
9f815e24 | 880 | unassigned to a character, and the decision to do that was made in version 3.1, |
b19eb496 | 881 | so C<\p{Age=3.1}> matches this character, as also does C<\p{Present_In: 3.1}> and up. |
9f815e24 KW |
882 | |
883 | =item B<C<\p{Print}>> | |
884 | ||
ae5b72c8 | 885 | This matches any character that is graphical or blank, except controls. |
9f815e24 KW |
886 | |
887 | =item B<C<\p{SpacePerl}>> | |
888 | ||
889 | This is the same as C<\s>, including beyond ASCII. | |
890 | ||
4d4acfba | 891 | Mnemonic: Space, as modified by Perl. (It doesn't include the vertical tab |
779cf272 | 892 | until v5.18, which both the Posix standard and Unicode consider white space.) |
9f815e24 | 893 | |
4364919a KW |
894 | =item B<C<\p{Title}>> and B<C<\p{Titlecase}>> |
895 | ||
896 | Under case-sensitive matching, these both match the same code points as | |
897 | C<\p{General Category=Titlecase_Letter}> (C<\p{gc=lt}>). The difference | |
898 | is that under C</i> caseless matching, these match the same as | |
899 | C<\p{Cased}>, whereas C<\p{gc=lt}> matches C<\p{Cased_Letter>). | |
900 | ||
2d88a86a KW |
901 | =item B<C<\p{Unicode}>> |
902 | ||
903 | This matches any of the 1_114_112 Unicode code points. | |
904 | C<\p{Any}>. | |
905 | ||
9f815e24 KW |
906 | =item B<C<\p{VertSpace}>> |
907 | ||
908 | This is the same as C<\v>: A character that changes the spacing vertically. | |
909 | ||
910 | =item B<C<\p{Word}>> | |
911 | ||
b19eb496 | 912 | This is the same as C<\w>, including over 100_000 characters beyond ASCII. |
9f815e24 | 913 | |
42581d5d KW |
914 | =item B<C<\p{XPosix...}>> |
915 | ||
b19eb496 | 916 | There are several of these, which are the standard Posix classes |
42581d5d KW |
917 | extended to the full Unicode range. They are described in |
918 | L<perlrecharclass/POSIX Character Classes>. | |
919 | ||
9f815e24 KW |
920 | =back |
921 | ||
a9130ea9 | 922 | |
376d9008 | 923 | =head2 User-Defined Character Properties |
491fd90a | 924 | |
51f494cc | 925 | You can define your own binary character properties by defining subroutines |
a9130ea9 | 926 | whose names begin with C<"In"> or C<"Is">. (The experimental feature |
9d1a5160 KW |
927 | L<perlre/(?[ ])> provides an alternative which allows more complex |
928 | definitions.) The subroutines can be defined in any | |
51f494cc | 929 | package. The user-defined properties can be used in the regular expression |
a9130ea9 | 930 | C<\p{}> and C<\P{}> constructs; if you are using a user-defined property from a |
51f494cc | 931 | package other than the one you are in, you must specify its package in the |
a9130ea9 | 932 | C<\p{}> or C<\P{}> construct. |
bac0b425 | 933 | |
51f494cc | 934 | # assuming property Is_Foreign defined in Lang:: |
bac0b425 JP |
935 | package main; # property package name required |
936 | if ($txt =~ /\p{Lang::IsForeign}+/) { ... } | |
937 | ||
938 | package Lang; # property package name not required | |
939 | if ($txt =~ /\p{IsForeign}+/) { ... } | |
940 | ||
941 | ||
942 | Note that the effect is compile-time and immutable once defined. | |
b19eb496 TC |
943 | However, the subroutines are passed a single parameter, which is 0 if |
944 | case-sensitive matching is in effect and non-zero if caseless matching | |
56ca34ca KW |
945 | is in effect. The subroutine may return different values depending on |
946 | the value of the flag, and one set of values will immutably be in effect | |
b19eb496 | 947 | for all case-sensitive matches, and the other set for all case-insensitive |
56ca34ca | 948 | matches. |
491fd90a | 949 | |
b19eb496 | 950 | Note that if the regular expression is tainted, then Perl will die rather |
a9130ea9 | 951 | than calling the subroutine when the name of the subroutine is |
0e9be77f DM |
952 | determined by the tainted data. |
953 | ||
376d9008 JB |
954 | The subroutines must return a specially-formatted string, with one |
955 | or more newline-separated lines. Each line must be one of the following: | |
491fd90a JH |
956 | |
957 | =over 4 | |
958 | ||
959 | =item * | |
960 | ||
df9e1087 | 961 | A single hexadecimal number denoting a code point to include. |
510254c9 A |
962 | |
963 | =item * | |
964 | ||
99a6b1f0 | 965 | Two hexadecimal numbers separated by horizontal whitespace (space or |
df9e1087 | 966 | tabular characters) denoting a range of code points to include. |
491fd90a JH |
967 | |
968 | =item * | |
969 | ||
a9130ea9 KW |
970 | Something to include, prefixed by C<"+">: a built-in character |
971 | property (prefixed by C<"utf8::">) or a fully qualified (including package | |
830137a2 | 972 | name) user-defined character property, |
bac0b425 JP |
973 | to represent all the characters in that property; two hexadecimal code |
974 | points for a range; or a single hexadecimal code point. | |
491fd90a JH |
975 | |
976 | =item * | |
977 | ||
a9130ea9 KW |
978 | Something to exclude, prefixed by C<"-">: an existing character |
979 | property (prefixed by C<"utf8::">) or a fully qualified (including package | |
830137a2 | 980 | name) user-defined character property, |
bac0b425 JP |
981 | to represent all the characters in that property; two hexadecimal code |
982 | points for a range; or a single hexadecimal code point. | |
491fd90a JH |
983 | |
984 | =item * | |
985 | ||
a9130ea9 KW |
986 | Something to negate, prefixed C<"!">: an existing character |
987 | property (prefixed by C<"utf8::">) or a fully qualified (including package | |
830137a2 | 988 | name) user-defined character property, |
bac0b425 JP |
989 | to represent all the characters in that property; two hexadecimal code |
990 | points for a range; or a single hexadecimal code point. | |
991 | ||
992 | =item * | |
993 | ||
a9130ea9 KW |
994 | Something to intersect with, prefixed by C<"&">: an existing character |
995 | property (prefixed by C<"utf8::">) or a fully qualified (including package | |
830137a2 | 996 | name) user-defined character property, |
bac0b425 JP |
997 | for all the characters except the characters in the property; two |
998 | hexadecimal code points for a range; or a single hexadecimal code point. | |
491fd90a JH |
999 | |
1000 | =back | |
1001 | ||
1002 | For example, to define a property that covers both the Japanese | |
1003 | syllabaries (hiragana and katakana), you can define | |
1004 | ||
1005 | sub InKana { | |
d88362ca | 1006 | return <<END; |
d5822f25 A |
1007 | 3040\t309F |
1008 | 30A0\t30FF | |
491fd90a JH |
1009 | END |
1010 | } | |
1011 | ||
d5822f25 A |
1012 | Imagine that the here-doc end marker is at the beginning of the line. |
1013 | Now you can use C<\p{InKana}> and C<\P{InKana}>. | |
491fd90a JH |
1014 | |
1015 | You could also have used the existing block property names: | |
1016 | ||
1017 | sub InKana { | |
d88362ca | 1018 | return <<'END'; |
491fd90a JH |
1019 | +utf8::InHiragana |
1020 | +utf8::InKatakana | |
1021 | END | |
1022 | } | |
1023 | ||
1024 | Suppose you wanted to match only the allocated characters, | |
d5822f25 | 1025 | not the raw block ranges: in other words, you want to remove |
b65e6125 | 1026 | the unassigned characters: |
491fd90a JH |
1027 | |
1028 | sub InKana { | |
d88362ca | 1029 | return <<'END'; |
491fd90a JH |
1030 | +utf8::InHiragana |
1031 | +utf8::InKatakana | |
1032 | -utf8::IsCn | |
1033 | END | |
1034 | } | |
1035 | ||
1036 | The negation is useful for defining (surprise!) negated classes. | |
1037 | ||
1038 | sub InNotKana { | |
d88362ca | 1039 | return <<'END'; |
491fd90a JH |
1040 | !utf8::InHiragana |
1041 | -utf8::InKatakana | |
1042 | +utf8::IsCn | |
1043 | END | |
1044 | } | |
1045 | ||
461020ad KW |
1046 | This will match all non-Unicode code points, since every one of them is |
1047 | not in Kana. You can use intersection to exclude these, if desired, as | |
1048 | this modified example shows: | |
bac0b425 | 1049 | |
461020ad | 1050 | sub InNotKana { |
bac0b425 | 1051 | return <<'END'; |
461020ad KW |
1052 | !utf8::InHiragana |
1053 | -utf8::InKatakana | |
1054 | +utf8::IsCn | |
1055 | &utf8::Any | |
bac0b425 JP |
1056 | END |
1057 | } | |
1058 | ||
461020ad KW |
1059 | C<&utf8::Any> must be the last line in the definition. |
1060 | ||
1061 | Intersection is used generally for getting the common characters matched | |
a9130ea9 | 1062 | by two (or more) classes. It's important to remember not to use C<"&"> for |
461020ad KW |
1063 | the first set; that would be intersecting with nothing, resulting in an |
1064 | empty set. | |
1065 | ||
2d88a86a KW |
1066 | Unlike non-user-defined C<\p{}> property matches, no warning is ever |
1067 | generated if these properties are matched against a non-Unicode code | |
1068 | point (see L</Beyond Unicode code points> below). | |
bac0b425 | 1069 | |
68585b5e | 1070 | =head2 User-Defined Case Mappings (for serious hackers only) |
822502e5 | 1071 | |
5d1892be | 1072 | B<This feature has been removed as of Perl 5.16.> |
a9130ea9 | 1073 | The CPAN module C<L<Unicode::Casing>> provides better functionality without |
5d1892be KW |
1074 | the drawbacks that this feature had. If you are using a Perl earlier |
1075 | than 5.16, this feature was most fully documented in the 5.14 version of | |
1076 | this pod: | |
1077 | L<http://perldoc.perl.org/5.14.0/perlunicode.html#User-Defined-Case-Mappings-%28for-serious-hackers-only%29> | |
3a2263fe | 1078 | |
376d9008 | 1079 | =head2 Character Encodings for Input and Output |
8cbd9a7a | 1080 | |
7221edc9 | 1081 | See L<Encode>. |
8cbd9a7a | 1082 | |
c29a771d | 1083 | =head2 Unicode Regular Expression Support Level |
776f8809 | 1084 | |
b19eb496 | 1085 | The following list of Unicode supported features for regular expressions describes |
fea12a3e KW |
1086 | all features currently directly supported by core Perl. The references |
1087 | to "Level I<N>" and the section numbers refer to | |
1088 | L<UTS#18 "Unicode Regular Expressions"|http://www.unicode.org/reports/tr18>, | |
1089 | version 13, November 2013. | |
1090 | ||
1091 | =head3 Level 1 - Basic Unicode Support | |
1092 | ||
1093 | RL1.1 Hex Notation - Done [1] | |
1094 | RL1.2 Properties - Done [2] | |
1095 | RL1.2a Compatibility Properties - Done [3] | |
1096 | RL1.3 Subtraction and Intersection - Experimental [4] | |
1097 | RL1.4 Simple Word Boundaries - Done [5] | |
1098 | RL1.5 Simple Loose Matches - Done [6] | |
1099 | RL1.6 Line Boundaries - Partial [7] | |
1100 | RL1.7 Supplementary Code Points - Done [8] | |
755789c0 | 1101 | |
6f33e417 KW |
1102 | =over 4 |
1103 | ||
a6a7eedc | 1104 | =item [1] C<\N{U+...}> and C<\x{...}> |
6f33e417 | 1105 | |
fea12a3e KW |
1106 | =item [2] |
1107 | C<\p{...}> C<\P{...}>. This requirement is for a minimal list of | |
1108 | properties. Perl supports these and all other Unicode character | |
1109 | properties, as R2.7 asks (see L</"Unicode Character Properties"> above). | |
6f33e417 | 1110 | |
fea12a3e KW |
1111 | =item [3] |
1112 | Perl has C<\d> C<\D> C<\s> C<\S> C<\w> C<\W> C<\X> C<[:I<prop>:]> | |
1113 | C<[:^I<prop>:]>, plus all the properties specified by | |
1114 | L<http://www.unicode.org/reports/tr18/#Compatibility_Properties>. These | |
1115 | are described above in L</Other Properties> | |
6f33e417 | 1116 | |
fea12a3e | 1117 | =item [4] |
6f33e417 | 1118 | |
fea12a3e | 1119 | The experimental feature C<"(?[...])"> starting in v5.18 accomplishes |
a6a7eedc | 1120 | this. |
6f33e417 | 1121 | |
a6a7eedc KW |
1122 | See L<perlre/(?[ ])>. If you don't want to use an experimental |
1123 | feature, you can use one of the following: | |
6f33e417 KW |
1124 | |
1125 | =over 4 | |
1126 | ||
a6a7eedc | 1127 | =item * |
f67a5002 | 1128 | Regular expression lookahead |
6f33e417 KW |
1129 | |
1130 | You can mimic class subtraction using lookahead. | |
8158862b | 1131 | For example, what UTS#18 might write as |
29bdacb8 | 1132 | |
209c9685 | 1133 | [{Block=Greek}-[{UNASSIGNED}]] |
dbe420b4 JH |
1134 | |
1135 | in Perl can be written as: | |
1136 | ||
209c9685 KW |
1137 | (?!\p{Unassigned})\p{Block=Greek} |
1138 | (?=\p{Assigned})\p{Block=Greek} | |
dbe420b4 JH |
1139 | |
1140 | But in this particular example, you probably really want | |
1141 | ||
209c9685 | 1142 | \p{Greek} |
dbe420b4 JH |
1143 | |
1144 | which will match assigned characters known to be part of the Greek script. | |
29bdacb8 | 1145 | |
a6a7eedc KW |
1146 | =item * |
1147 | ||
1148 | CPAN module C<L<Unicode::Regex::Set>> | |
8158862b | 1149 | |
6f33e417 KW |
1150 | It does implement the full UTS#18 grouping, intersection, union, and |
1151 | removal (subtraction) syntax. | |
8158862b | 1152 | |
a6a7eedc KW |
1153 | =item * |
1154 | ||
1155 | L</"User-Defined Character Properties"> | |
6f33e417 | 1156 | |
a9130ea9 | 1157 | C<"+"> for union, C<"-"> for removal (set-difference), C<"&"> for intersection |
6f33e417 KW |
1158 | |
1159 | =back | |
1160 | ||
fea12a3e KW |
1161 | =item [5] |
1162 | C<\b> C<\B> meet most, but not all, the details of this requirement, but | |
1163 | C<\b{wb}> and C<\B{wb}> do, as well as the stricter R2.3. | |
1164 | ||
1165 | =item [6] | |
6f33e417 | 1166 | |
a6a7eedc | 1167 | Note that Perl does Full case-folding in matching, not Simple: |
6f33e417 | 1168 | |
a6a7eedc KW |
1169 | For example C<U+1F88> is equivalent to C<U+1F00 U+03B9>, instead of just |
1170 | C<U+1F80>. This difference matters mainly for certain Greek capital | |
a9130ea9 KW |
1171 | letters with certain modifiers: the Full case-folding decomposes the |
1172 | letter, while the Simple case-folding would map it to a single | |
1173 | character. | |
6f33e417 | 1174 | |
fea12a3e KW |
1175 | =item [7] |
1176 | ||
1177 | The reason this is considered to be only partially implemented is that | |
1178 | Perl has L<C<qrE<sol>\b{lb}E<sol>>|perlrebackslash/\b{lb}> and | |
1179 | C<L<Unicode::LineBreak>> that are conformant with | |
1180 | L<UAX#14 "Unicode Line Breaking Algorithm"|http://www.unicode.org/reports/tr14>. | |
1181 | The regular expression construct provides default behavior, while the | |
1182 | heavier-weight module provides customizable line breaking. | |
1183 | ||
1184 | But Perl treats C<\n> as the start- and end-line | |
1185 | delimiter, whereas Unicode specifies more characters that should be | |
1186 | so-interpreted. | |
6f33e417 | 1187 | |
a6a7eedc | 1188 | These are: |
6f33e417 | 1189 | |
a6a7eedc KW |
1190 | VT U+000B (\v in C) |
1191 | FF U+000C (\f) | |
1192 | CR U+000D (\r) | |
1193 | NEL U+0085 | |
1194 | LS U+2028 | |
1195 | PS U+2029 | |
6f33e417 | 1196 | |
a6a7eedc KW |
1197 | C<^> and C<$> in regular expression patterns are supposed to match all |
1198 | these, but don't. | |
1199 | These characters also don't, but should, affect C<< <> >> C<$.>, and | |
1200 | script line numbers. | |
6f33e417 | 1201 | |
a6a7eedc KW |
1202 | Also, lines should not be split within C<CRLF> (i.e. there is no |
1203 | empty line between C<\r> and C<\n>). For C<CRLF>, try the C<:crlf> | |
1204 | layer (see L<PerlIO>). | |
1205 | ||
fea12a3e | 1206 | =item [8] |
a9130ea9 KW |
1207 | UTF-8/UTF-EBDDIC used in Perl allows not only C<U+10000> to |
1208 | C<U+10FFFF> but also beyond C<U+10FFFF> | |
6f33e417 KW |
1209 | |
1210 | =back | |
5ca1ac52 | 1211 | |
fea12a3e | 1212 | =head3 Level 2 - Extended Unicode Support |
776f8809 | 1213 | |
fea12a3e KW |
1214 | RL2.1 Canonical Equivalents - Retracted [9] |
1215 | by Unicode | |
1216 | RL2.2 Extended Grapheme Clusters - Partial [10] | |
1217 | RL2.3 Default Word Boundaries - Done [11] | |
1218 | RL2.4 Default Case Conversion - Done | |
1219 | RL2.5 Name Properties - Done | |
1220 | RL2.6 Wildcard Properties - Missing | |
1221 | RL2.7 Full Properties - Done | |
776f8809 | 1222 | |
fea12a3e | 1223 | =over 4 |
8158862b | 1224 | |
fea12a3e KW |
1225 | =item [9] |
1226 | Unicode has rewritten this portion of UTS#18 to say that getting | |
1227 | canonical equivalence (see UAX#15 | |
1228 | L<"Unicode Normalization Forms"|http://www.unicode.org/reports/tr15>) | |
1229 | is basically to be done at the programmer level. Use NFD to write | |
1230 | both your regular expressions and text to match them against (you | |
1231 | can use L<Unicode::Normalize>). | |
776f8809 | 1232 | |
fea12a3e KW |
1233 | =item [10] |
1234 | Perl has C<\X> and C<\b{gcb}> but we don't have a "Grapheme Cluster Mode". | |
1235 | ||
1236 | =item [11] see | |
1237 | L<UAX#29 "Unicode Text Segmentation"|http://www.unicode.org/reports/tr29>, | |
1238 | ||
1239 | =back | |
1240 | ||
1241 | =head3 Level 3 - Tailored Support | |
1242 | ||
1243 | RL3.1 Tailored Punctuation - Missing | |
1244 | RL3.2 Tailored Grapheme Clusters - Missing [12] | |
1245 | RL3.3 Tailored Word Boundaries - Missing | |
1246 | RL3.4 Tailored Loose Matches - Retracted by Unicode | |
1247 | RL3.5 Tailored Ranges - Retracted by Unicode | |
1248 | RL3.6 Context Matching - Missing [13] | |
1249 | RL3.7 Incremental Matches - Missing | |
1250 | RL3.8 Unicode Set Sharing - Unicode is proposing | |
1251 | to retract this | |
1252 | RL3.9 Possible Match Sets - Missing | |
1253 | RL3.10 Folded Matching - Retracted by Unicode | |
1254 | RL3.11 Submatchers - Missing | |
1255 | ||
1256 | =over 4 | |
1257 | ||
1258 | =item [12] | |
1259 | Perl has L<Unicode::Collate>, but it isn't integrated with regular | |
1260 | expressions. See | |
1261 | L<UTS#10 "Unicode Collation Algorithms"|http://www.unicode.org/reports/tr10>. | |
776f8809 | 1262 | |
fea12a3e KW |
1263 | =item [13] |
1264 | Perl has C<(?<=x)> and C<(?=x)>, but lookaheads or lookbehinds should | |
1265 | see outside of the target substring | |
776f8809 JH |
1266 | |
1267 | =back | |
1268 | ||
c349b1b9 JH |
1269 | =head2 Unicode Encodings |
1270 | ||
376d9008 JB |
1271 | Unicode characters are assigned to I<code points>, which are abstract |
1272 | numbers. To use these numbers, various encodings are needed. | |
c349b1b9 JH |
1273 | |
1274 | =over 4 | |
1275 | ||
c29a771d | 1276 | =item * |
5cb3728c RB |
1277 | |
1278 | UTF-8 | |
c349b1b9 | 1279 | |
6d4f9cf2 | 1280 | UTF-8 is a variable-length (1 to 4 bytes), byte-order independent |
a6a7eedc KW |
1281 | encoding. In most of Perl's documentation, including elsewhere in this |
1282 | document, the term "UTF-8" means also "UTF-EBCDIC". But in this section, | |
1283 | "UTF-8" refers only to the encoding used on ASCII platforms. It is a | |
1284 | superset of 7-bit US-ASCII, so anything encoded in ASCII has the | |
1285 | identical representation when encoded in UTF-8. | |
c349b1b9 | 1286 | |
8c007b5a | 1287 | The following table is from Unicode 3.2. |
05632f9a | 1288 | |
755789c0 | 1289 | Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte |
05632f9a | 1290 | |
d88362ca | 1291 | U+0000..U+007F 00..7F |
e1b711da | 1292 | U+0080..U+07FF * C2..DF 80..BF |
d88362ca | 1293 | U+0800..U+0FFF E0 * A0..BF 80..BF |
ec90690f TS |
1294 | U+1000..U+CFFF E1..EC 80..BF 80..BF |
1295 | U+D000..U+D7FF ED 80..9F 80..BF | |
755789c0 | 1296 | U+D800..U+DFFF +++++ utf16 surrogates, not legal utf8 +++++ |
ec90690f | 1297 | U+E000..U+FFFF EE..EF 80..BF 80..BF |
d88362ca KW |
1298 | U+10000..U+3FFFF F0 * 90..BF 80..BF 80..BF |
1299 | U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF | |
1300 | U+100000..U+10FFFF F4 80..8F 80..BF 80..BF | |
e1b711da | 1301 | |
b19eb496 | 1302 | Note the gaps marked by "*" before several of the byte entries above. These are |
e1b711da KW |
1303 | caused by legal UTF-8 avoiding non-shortest encodings: it is technically |
1304 | possible to UTF-8-encode a single code point in different ways, but that is | |
1305 | explicitly forbidden, and the shortest possible encoding should always be used | |
1306 | (and that is what Perl does). | |
37361303 | 1307 | |
376d9008 | 1308 | Another way to look at it is via bits: |
05632f9a | 1309 | |
755789c0 | 1310 | Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte |
05632f9a | 1311 | |
755789c0 KW |
1312 | 0aaaaaaa 0aaaaaaa |
1313 | 00000bbbbbaaaaaa 110bbbbb 10aaaaaa | |
1314 | ccccbbbbbbaaaaaa 1110cccc 10bbbbbb 10aaaaaa | |
1315 | 00000dddccccccbbbbbbaaaaaa 11110ddd 10cccccc 10bbbbbb 10aaaaaa | |
05632f9a | 1316 | |
a9130ea9 | 1317 | As you can see, the continuation bytes all begin with C<"10">, and the |
e1b711da | 1318 | leading bits of the start byte tell how many bytes there are in the |
05632f9a JH |
1319 | encoded character. |
1320 | ||
6d4f9cf2 | 1321 | The original UTF-8 specification allowed up to 6 bytes, to allow |
a9130ea9 | 1322 | encoding of numbers up to C<0x7FFF_FFFF>. Perl continues to allow those, |
6d4f9cf2 KW |
1323 | and has extended that up to 13 bytes to encode code points up to what |
1324 | can fit in a 64-bit word. However, Perl will warn if you output any of | |
b19eb496 | 1325 | these as being non-portable; and under strict UTF-8 input protocols, |
760c7c2f KW |
1326 | they are forbidden. In addition, it is deprecated to use a code point |
1327 | larger than what a signed integer variable on your system can hold. On | |
1328 | 32-bit ASCII systems, this means C<0x7FFF_FFFF> is the legal maximum | |
1329 | going forward (much higher on 64-bit systems). | |
6d4f9cf2 | 1330 | |
c29a771d | 1331 | =item * |
5cb3728c RB |
1332 | |
1333 | UTF-EBCDIC | |
dbe420b4 | 1334 | |
b65e6125 | 1335 | Like UTF-8, but EBCDIC-safe, in the way that UTF-8 is ASCII-safe. |
a6a7eedc KW |
1336 | This means that all the basic characters (which includes all |
1337 | those that have ASCII equivalents (like C<"A">, C<"0">, C<"%">, I<etc.>) | |
1338 | are the same in both EBCDIC and UTF-EBCDIC.) | |
1339 | ||
c0236afe KW |
1340 | UTF-EBCDIC is used on EBCDIC platforms. It generally requires more |
1341 | bytes to represent a given code point than UTF-8 does; the largest | |
1342 | Unicode code points take 5 bytes to represent (instead of 4 in UTF-8), | |
1343 | and, extended for 64-bit words, it uses 14 bytes instead of 13 bytes in | |
1344 | UTF-8. | |
dbe420b4 | 1345 | |
c29a771d | 1346 | =item * |
5cb3728c | 1347 | |
b65e6125 | 1348 | UTF-16, UTF-16BE, UTF-16LE, Surrogates, and C<BOM>'s (Byte Order Marks) |
c349b1b9 | 1349 | |
1bfb14c4 JH |
1350 | The followings items are mostly for reference and general Unicode |
1351 | knowledge, Perl doesn't use these constructs internally. | |
dbe420b4 | 1352 | |
b19eb496 TC |
1353 | Like UTF-8, UTF-16 is a variable-width encoding, but where |
1354 | UTF-8 uses 8-bit code units, UTF-16 uses 16-bit code units. | |
1355 | All code points occupy either 2 or 4 bytes in UTF-16: code points | |
1356 | C<U+0000..U+FFFF> are stored in a single 16-bit unit, and code | |
1bfb14c4 | 1357 | points C<U+10000..U+10FFFF> in two 16-bit units. The latter case is |
c349b1b9 JH |
1358 | using I<surrogates>, the first 16-bit unit being the I<high |
1359 | surrogate>, and the second being the I<low surrogate>. | |
1360 | ||
376d9008 | 1361 | Surrogates are code points set aside to encode the C<U+10000..U+10FFFF> |
c349b1b9 | 1362 | range of Unicode code points in pairs of 16-bit units. The I<high |
9f815e24 | 1363 | surrogates> are the range C<U+D800..U+DBFF> and the I<low surrogates> |
376d9008 | 1364 | are the range C<U+DC00..U+DFFF>. The surrogate encoding is |
c349b1b9 | 1365 | |
d88362ca KW |
1366 | $hi = ($uni - 0x10000) / 0x400 + 0xD800; |
1367 | $lo = ($uni - 0x10000) % 0x400 + 0xDC00; | |
c349b1b9 JH |
1368 | |
1369 | and the decoding is | |
1370 | ||
d88362ca | 1371 | $uni = 0x10000 + ($hi - 0xD800) * 0x400 + ($lo - 0xDC00); |
c349b1b9 | 1372 | |
376d9008 | 1373 | Because of the 16-bitness, UTF-16 is byte-order dependent. UTF-16 |
c349b1b9 | 1374 | itself can be used for in-memory computations, but if storage or |
376d9008 JB |
1375 | transfer is required either UTF-16BE (big-endian) or UTF-16LE |
1376 | (little-endian) encodings must be chosen. | |
c349b1b9 JH |
1377 | |
1378 | This introduces another problem: what if you just know that your data | |
376d9008 | 1379 | is UTF-16, but you don't know which endianness? Byte Order Marks, or |
b65e6125 | 1380 | C<BOM>'s, are a solution to this. A special character has been reserved |
86bbd6d1 | 1381 | in Unicode to function as a byte order marker: the character with the |
a9130ea9 | 1382 | code point C<U+FEFF> is the C<BOM>. |
042da322 | 1383 | |
a9130ea9 | 1384 | The trick is that if you read a C<BOM>, you will know the byte order, |
376d9008 JB |
1385 | since if it was written on a big-endian platform, you will read the |
1386 | bytes C<0xFE 0xFF>, but if it was written on a little-endian platform, | |
1387 | you will read the bytes C<0xFF 0xFE>. (And if the originating platform | |
b65e6125 KW |
1388 | was writing in ASCII platform UTF-8, you will read the bytes |
1389 | C<0xEF 0xBB 0xBF>.) | |
042da322 | 1390 | |
86bbd6d1 | 1391 | The way this trick works is that the character with the code point |
6d4f9cf2 | 1392 | C<U+FFFE> is not supposed to be in input streams, so the |
a9130ea9 | 1393 | sequence of bytes C<0xFF 0xFE> is unambiguously "C<BOM>, represented in |
1bfb14c4 | 1394 | little-endian format" and cannot be C<U+FFFE>, represented in big-endian |
6d4f9cf2 KW |
1395 | format". |
1396 | ||
1397 | Surrogates have no meaning in Unicode outside their use in pairs to | |
1398 | represent other code points. However, Perl allows them to be | |
1399 | represented individually internally, for example by saying | |
f651977e TC |
1400 | C<chr(0xD801)>, so that all code points, not just those valid for open |
1401 | interchange, are | |
6d4f9cf2 | 1402 | representable. Unicode does define semantics for them, such as their |
a9130ea9 KW |
1403 | C<L</General_Category>> is C<"Cs">. But because their use is somewhat dangerous, |
1404 | Perl will warn (using the warning category C<"surrogate">, which is a | |
1405 | sub-category of C<"utf8">) if an attempt is made | |
6d4f9cf2 KW |
1406 | to do things like take the lower case of one, or match |
1407 | case-insensitively, or to output them. (But don't try this on Perls | |
1408 | before 5.14.) | |
c349b1b9 | 1409 | |
c29a771d | 1410 | =item * |
5cb3728c | 1411 | |
1e54db1a | 1412 | UTF-32, UTF-32BE, UTF-32LE |
c349b1b9 | 1413 | |
b65e6125 | 1414 | The UTF-32 family is pretty much like the UTF-16 family, except that |
042da322 | 1415 | the units are 32-bit, and therefore the surrogate scheme is not |
a9130ea9 | 1416 | needed. UTF-32 is a fixed-width encoding. The C<BOM> signatures are |
b19eb496 | 1417 | C<0x00 0x00 0xFE 0xFF> for BE and C<0xFF 0xFE 0x00 0x00> for LE. |
c349b1b9 | 1418 | |
c29a771d | 1419 | =item * |
5cb3728c RB |
1420 | |
1421 | UCS-2, UCS-4 | |
c349b1b9 | 1422 | |
b19eb496 | 1423 | Legacy, fixed-width encodings defined by the ISO 10646 standard. UCS-2 is a 16-bit |
376d9008 | 1424 | encoding. Unlike UTF-16, UCS-2 is not extensible beyond C<U+FFFF>, |
339cfa0e | 1425 | because it does not use surrogates. UCS-4 is a 32-bit encoding, |
b19eb496 | 1426 | functionally identical to UTF-32 (the difference being that |
a9130ea9 | 1427 | UCS-4 forbids neither surrogates nor code points larger than C<0x10_FFFF>). |
c349b1b9 | 1428 | |
c29a771d | 1429 | =item * |
5cb3728c RB |
1430 | |
1431 | UTF-7 | |
c349b1b9 | 1432 | |
376d9008 JB |
1433 | A seven-bit safe (non-eight-bit) encoding, which is useful if the |
1434 | transport or storage is not eight-bit safe. Defined by RFC 2152. | |
c349b1b9 | 1435 | |
95a1a48b JH |
1436 | =back |
1437 | ||
57e88091 | 1438 | =head2 Noncharacter code points |
6d4f9cf2 | 1439 | |
57e88091 | 1440 | 66 code points are set aside in Unicode as "noncharacter code points". |
a9130ea9 | 1441 | These all have the C<Unassigned> (C<Cn>) C<L</General_Category>>, and |
57e88091 KW |
1442 | no character will ever be assigned to any of them. They are the 32 code |
1443 | points between C<U+FDD0> and C<U+FDEF> inclusive, and the 34 code | |
1444 | points: | |
1445 | ||
1446 | U+FFFE U+FFFF | |
1447 | U+1FFFE U+1FFFF | |
1448 | U+2FFFE U+2FFFF | |
1449 | ... | |
1450 | U+EFFFE U+EFFFF | |
1451 | U+FFFFE U+FFFFF | |
1452 | U+10FFFE U+10FFFF | |
1453 | ||
1454 | Until Unicode 7.0, the noncharacters were "B<forbidden> for use in open | |
1455 | interchange of Unicode text data", so that code that processed those | |
1456 | streams could use these code points as sentinels that could be mixed in | |
1457 | with character data, and would always be distinguishable from that data. | |
1458 | (Emphasis above and in the next paragraph are added in this document.) | |
1459 | ||
1460 | Unicode 7.0 changed the wording so that they are "B<not recommended> for | |
1461 | use in open interchange of Unicode text data". The 7.0 Standard goes on | |
1462 | to say: | |
1463 | ||
1464 | =over 4 | |
1465 | ||
1466 | "If a noncharacter is received in open interchange, an application is | |
1467 | not required to interpret it in any way. It is good practice, however, | |
1468 | to recognize it as a noncharacter and to take appropriate action, such | |
1469 | as replacing it with C<U+FFFD> replacement character, to indicate the | |
1470 | problem in the text. It is not recommended to simply delete | |
1471 | noncharacter code points from such text, because of the potential | |
1472 | security issues caused by deleting uninterpreted characters. (See | |
1473 | conformance clause C7 in Section 3.2, Conformance Requirements, and | |
1474 | L<Unicode Technical Report #36, "Unicode Security | |
1475 | Considerations"|http://www.unicode.org/reports/tr36/#Substituting_for_Ill_Formed_Subsequences>)." | |
1476 | ||
1477 | =back | |
1478 | ||
1479 | This change was made because it was found that various commercial tools | |
1480 | like editors, or for things like source code control, had been written | |
1481 | so that they would not handle program files that used these code points, | |
1482 | effectively precluding their use almost entirely! And that was never | |
1483 | the intent. They've always been meant to be usable within an | |
1484 | application, or cooperating set of applications, at will. | |
1485 | ||
1486 | If you're writing code, such as an editor, that is supposed to be able | |
1487 | to handle any Unicode text data, then you shouldn't be using these code | |
1488 | points yourself, and instead allow them in the input. If you need | |
1489 | sentinels, they should instead be something that isn't legal Unicode. | |
1490 | For UTF-8 data, you can use the bytes 0xC1 and 0xC2 as sentinels, as | |
1491 | they never appear in well-formed UTF-8. (There are equivalents for | |
1492 | UTF-EBCDIC). You can also store your Unicode code points in integer | |
1493 | variables and use negative values as sentinels. | |
1494 | ||
1495 | If you're not writing such a tool, then whether you accept noncharacters | |
1496 | as input is up to you (though the Standard recommends that you not). If | |
1497 | you do strict input stream checking with Perl, these code points | |
1498 | continue to be forbidden. This is to maintain backward compatibility | |
1499 | (otherwise potential security holes could open up, as an unsuspecting | |
1500 | application that was written assuming the noncharacters would be | |
1501 | filtered out before getting to it, could now, without warning, start | |
1502 | getting them). To do strict checking, you can use the layer | |
1503 | C<:encoding('UTF-8')>. | |
1504 | ||
1505 | Perl continues to warn (using the warning category C<"nonchar">, which | |
1506 | is a sub-category of C<"utf8">) if an attempt is made to output | |
1507 | noncharacters. | |
42581d5d KW |
1508 | |
1509 | =head2 Beyond Unicode code points | |
1510 | ||
a9130ea9 KW |
1511 | The maximum Unicode code point is C<U+10FFFF>, and Unicode only defines |
1512 | operations on code points up through that. But Perl works on code | |
42581d5d KW |
1513 | points up to the maximum permissible unsigned number available on the |
1514 | platform. However, Perl will not accept these from input streams unless | |
1515 | lax rules are being used, and will warn (using the warning category | |
2d88a86a KW |
1516 | C<"non_unicode">, which is a sub-category of C<"utf8">) if any are output. |
1517 | ||
1518 | Since Unicode rules are not defined on these code points, if a | |
1519 | Unicode-defined operation is done on them, Perl uses what we believe are | |
1520 | sensible rules, while generally warning, using the C<"non_unicode"> | |
1521 | category. For example, C<uc("\x{11_0000}")> will generate such a | |
1522 | warning, returning the input parameter as its result, since Perl defines | |
1523 | the uppercase of every non-Unicode code point to be the code point | |
b65e6125 KW |
1524 | itself. (All the case changing operations, not just uppercasing, work |
1525 | this way.) | |
2d88a86a KW |
1526 | |
1527 | The situation with matching Unicode properties in regular expressions, | |
1528 | the C<\p{}> and C<\P{}> constructs, against these code points is not as | |
1529 | clear cut, and how these are handled has changed as we've gained | |
1530 | experience. | |
1531 | ||
1532 | One possibility is to treat any match against these code points as | |
1533 | undefined. But since Perl doesn't have the concept of a match being | |
1534 | undefined, it converts this to failing or C<FALSE>. This is almost, but | |
1535 | not quite, what Perl did from v5.14 (when use of these code points | |
1536 | became generally reliable) through v5.18. The difference is that Perl | |
1537 | treated all C<\p{}> matches as failing, but all C<\P{}> matches as | |
1538 | succeeding. | |
1539 | ||
f66ccb6c | 1540 | One problem with this is that it leads to unexpected, and confusing |
2d88a86a KW |
1541 | results in some cases: |
1542 | ||
1543 | chr(0x110000) =~ \p{ASCII_Hex_Digit=True} # Failed on <= v5.18 | |
1544 | chr(0x110000) =~ \p{ASCII_Hex_Digit=False} # Failed! on <= v5.18 | |
1545 | ||
1546 | That is, it treated both matches as undefined, and converted that to | |
1547 | false (raising a warning on each). The first case is the expected | |
1548 | result, but the second is likely counterintuitive: "How could both be | |
1549 | false when they are complements?" Another problem was that the | |
1550 | implementation optimized many Unicode property matches down to already | |
1551 | existing simpler, faster operations, which don't raise the warning. We | |
1552 | chose to not forgo those optimizations, which help the vast majority of | |
1553 | matches, just to generate a warning for the unlikely event that an | |
1554 | above-Unicode code point is being matched against. | |
1555 | ||
1556 | As a result of these problems, starting in v5.20, what Perl does is | |
1557 | to treat non-Unicode code points as just typical unassigned Unicode | |
1558 | characters, and matches accordingly. (Note: Unicode has atypical | |
57e88091 | 1559 | unassigned code points. For example, it has noncharacter code points, |
2d88a86a KW |
1560 | and ones that, when they do get assigned, are destined to be written |
1561 | Right-to-left, as Arabic and Hebrew are. Perl assumes that no | |
1562 | non-Unicode code point has any atypical properties.) | |
1563 | ||
1564 | Perl, in most cases, will raise a warning when matching an above-Unicode | |
1565 | code point against a Unicode property when the result is C<TRUE> for | |
1566 | C<\p{}>, and C<FALSE> for C<\P{}>. For example: | |
1567 | ||
1568 | chr(0x110000) =~ \p{ASCII_Hex_Digit=True} # Fails, no warning | |
1569 | chr(0x110000) =~ \p{ASCII_Hex_Digit=False} # Succeeds, with warning | |
1570 | ||
1571 | In both these examples, the character being matched is non-Unicode, so | |
1572 | Unicode doesn't define how it should match. It clearly isn't an ASCII | |
1573 | hex digit, so the first example clearly should fail, and so it does, | |
1574 | with no warning. But it is arguable that the second example should have | |
1575 | an undefined, hence C<FALSE>, result. So a warning is raised for it. | |
1576 | ||
1577 | Thus the warning is raised for many fewer cases than in earlier Perls, | |
1578 | and only when what the result is could be arguable. It turns out that | |
1579 | none of the optimizations made by Perl (or are ever likely to be made) | |
1580 | cause the warning to be skipped, so it solves both problems of Perl's | |
1581 | earlier approach. The most commonly used property that is affected by | |
1582 | this change is C<\p{Unassigned}> which is a short form for | |
1583 | C<\p{General_Category=Unassigned}>. Starting in v5.20, all non-Unicode | |
1584 | code points are considered C<Unassigned>. In earlier releases the | |
1585 | matches failed because the result was considered undefined. | |
1586 | ||
1587 | The only place where the warning is not raised when it might ought to | |
1588 | have been is if optimizations cause the whole pattern match to not even | |
1589 | be attempted. For example, Perl may figure out that for a string to | |
1590 | match a certain regular expression pattern, the string has to contain | |
1591 | the substring C<"foobar">. Before attempting the match, Perl may look | |
1592 | for that substring, and if not found, immediately fail the match without | |
1593 | actually trying it; so no warning gets generated even if the string | |
1594 | contains an above-Unicode code point. | |
1595 | ||
1596 | This behavior is more "Do what I mean" than in earlier Perls for most | |
1597 | applications. But it catches fewer issues for code that needs to be | |
1598 | strictly Unicode compliant. Therefore there is an additional mode of | |
1599 | operation available to accommodate such code. This mode is enabled if a | |
1600 | regular expression pattern is compiled within the lexical scope where | |
1601 | the C<"non_unicode"> warning class has been made fatal, say by: | |
1602 | ||
1603 | use warnings FATAL => "non_unicode" | |
1604 | ||
44ecbbd8 | 1605 | (see L<warnings>). In this mode of operation, Perl will raise the |
2d88a86a KW |
1606 | warning for all matches against a non-Unicode code point (not just the |
1607 | arguable ones), and it skips the optimizations that might cause the | |
1608 | warning to not be output. (It currently still won't warn if the match | |
1609 | isn't even attempted, like in the C<"foobar"> example above.) | |
1610 | ||
1611 | In summary, Perl now normally treats non-Unicode code points as typical | |
1612 | Unicode unassigned code points for regular expression matches, raising a | |
1613 | warning only when it is arguable what the result should be. However, if | |
1614 | this warning has been made fatal, it isn't skipped. | |
1615 | ||
1616 | There is one exception to all this. C<\p{All}> looks like a Unicode | |
1617 | property, but it is a Perl extension that is defined to be true for all | |
1618 | possible code points, Unicode or not, so no warning is ever generated | |
1619 | when matching this against a non-Unicode code point. (Prior to v5.20, | |
1620 | it was an exact synonym for C<\p{Any}>, matching code points C<0> | |
1621 | through C<0x10FFFF>.) | |
6d4f9cf2 | 1622 | |
0d7c09bb JH |
1623 | =head2 Security Implications of Unicode |
1624 | ||
b65e6125 KW |
1625 | First, read |
1626 | L<Unicode Security Considerations|http://www.unicode.org/reports/tr36>. | |
1627 | ||
e1b711da KW |
1628 | Also, note the following: |
1629 | ||
0d7c09bb JH |
1630 | =over 4 |
1631 | ||
1632 | =item * | |
1633 | ||
1634 | Malformed UTF-8 | |
bf0fa0b2 | 1635 | |
f57d8456 KW |
1636 | UTF-8 is very structured, so many combinations of bytes are invalid. In |
1637 | the past, Perl tried to soldier on and make some sense of invalid | |
1638 | combinations, but this can lead to security holes, so now, if the Perl | |
1639 | core needs to process an invalid combination, it will either raise a | |
1640 | fatal error, or will replace those bytes by the sequence that forms the | |
1641 | Unicode REPLACEMENT CHARACTER, for which purpose Unicode created it. | |
1642 | ||
1643 | Every code point can be represented by more than one possible | |
1644 | syntactically valid UTF-8 sequence. Early on, both Unicode and Perl | |
1645 | considered any of these to be valid, but now, all sequences longer | |
1646 | than the shortest possible one are considered to be malformed. | |
1647 | ||
1648 | Unicode considers many code points to be illegal, or to be avoided. | |
1649 | Perl generally accepts them, once they have passed through any input | |
1650 | filters that may try to exclude them. These have been discussed above | |
1651 | (see "Surrogates" under UTF-16 in L</Unicode Encodings>, | |
1652 | L</Noncharacter code points>, and L</Beyond Unicode code points>). | |
bf0fa0b2 | 1653 | |
0d7c09bb JH |
1654 | =item * |
1655 | ||
68693f9e | 1656 | Regular expression pattern matching may surprise you if you're not |
b19eb496 TC |
1657 | accustomed to Unicode. Starting in Perl 5.14, several pattern |
1658 | modifiers are available to control this, called the character set | |
42581d5d KW |
1659 | modifiers. Details are given in L<perlre/Character set modifiers>. |
1660 | ||
1661 | =back | |
0d7c09bb | 1662 | |
376d9008 | 1663 | As discussed elsewhere, Perl has one foot (two hooves?) planted in |
a6a7eedc KW |
1664 | each of two worlds: the old world of ASCII and single-byte locales, and |
1665 | the new world of Unicode, upgrading when necessary. | |
376d9008 | 1666 | If your legacy code does not explicitly use Unicode, no automatic |
a6a7eedc | 1667 | switch-over to Unicode should happen. |
0d7c09bb | 1668 | |
c349b1b9 JH |
1669 | =head2 Unicode in Perl on EBCDIC |
1670 | ||
a6a7eedc KW |
1671 | Unicode is supported on EBCDIC platforms. See L<perlebcdic>. |
1672 | ||
1673 | Unless ASCII vs. EBCDIC issues are specifically being discussed, | |
1674 | references to UTF-8 encoding in this document and elsewhere should be | |
1675 | read as meaning UTF-EBCDIC on EBCDIC platforms. | |
1676 | See L<perlebcdic/Unicode and UTF>. | |
1677 | ||
1678 | Because UTF-EBCDIC is so similar to UTF-8, the differences are mostly | |
1679 | hidden from you; S<C<use utf8>> (and NOT something like | |
1680 | S<C<use utfebcdic>>) declares the the script is in the platform's | |
1681 | "native" 8-bit encoding of Unicode. (Similarly for the C<":utf8"> | |
1682 | layer.) | |
c349b1b9 | 1683 | |
b310b053 JH |
1684 | =head2 Locales |
1685 | ||
42581d5d | 1686 | See L<perllocale/Unicode and UTF-8> |
b310b053 | 1687 | |
1aad1664 JH |
1688 | =head2 When Unicode Does Not Happen |
1689 | ||
b65e6125 KW |
1690 | There are still many places where Unicode (in some encoding or |
1691 | another) could be given as arguments or received as results, or both in | |
1692 | Perl, but it is not, in spite of Perl having extensive ways to input and | |
1693 | output in Unicode, and a few other "entry points" like the C<@ARGV> | |
1694 | array (which can sometimes be interpreted as UTF-8). | |
1aad1664 | 1695 | |
e1b711da KW |
1696 | The following are such interfaces. Also, see L</The "Unicode Bug">. |
1697 | For all of these interfaces Perl | |
b9cedb1b | 1698 | currently (as of v5.16.0) simply assumes byte strings both as arguments |
b65e6125 | 1699 | and results, or UTF-8 strings if the (deprecated) C<encoding> pragma has been used. |
1aad1664 | 1700 | |
b19eb496 TC |
1701 | One reason that Perl does not attempt to resolve the role of Unicode in |
1702 | these situations is that the answers are highly dependent on the operating | |
1aad1664 | 1703 | system and the file system(s). For example, whether filenames can be |
b19eb496 TC |
1704 | in Unicode and in exactly what kind of encoding, is not exactly a |
1705 | portable concept. Similarly for C<qx> and C<system>: how well will the | |
1706 | "command-line interface" (and which of them?) handle Unicode? | |
1aad1664 JH |
1707 | |
1708 | =over 4 | |
1709 | ||
557a2462 RB |
1710 | =item * |
1711 | ||
a9130ea9 KW |
1712 | C<chdir>, C<chmod>, C<chown>, C<chroot>, C<exec>, C<link>, C<lstat>, C<mkdir>, |
1713 | C<rename>, C<rmdir>, C<stat>, C<symlink>, C<truncate>, C<unlink>, C<utime>, C<-X> | |
557a2462 RB |
1714 | |
1715 | =item * | |
1716 | ||
a9130ea9 | 1717 | C<%ENV> |
557a2462 RB |
1718 | |
1719 | =item * | |
1720 | ||
a9130ea9 | 1721 | C<glob> (aka the C<E<lt>*E<gt>>) |
557a2462 RB |
1722 | |
1723 | =item * | |
1aad1664 | 1724 | |
a9130ea9 | 1725 | C<open>, C<opendir>, C<sysopen> |
1aad1664 | 1726 | |
557a2462 | 1727 | =item * |
1aad1664 | 1728 | |
a9130ea9 | 1729 | C<qx> (aka the backtick operator), C<system> |
1aad1664 | 1730 | |
557a2462 | 1731 | =item * |
1aad1664 | 1732 | |
a9130ea9 | 1733 | C<readdir>, C<readlink> |
1aad1664 JH |
1734 | |
1735 | =back | |
1736 | ||
e1b711da KW |
1737 | =head2 The "Unicode Bug" |
1738 | ||
a6a7eedc KW |
1739 | The term, "Unicode bug" has been applied to an inconsistency with the |
1740 | code points in the C<Latin-1 Supplement> block, that is, between | |
1741 | 128 and 255. Without a locale specified, unlike all other characters or | |
1742 | code points, these characters can have very different semantics | |
1743 | depending on the rules in effect. (Characters whose code points are | |
1744 | above 255 force Unicode rules; whereas the rules for ASCII characters | |
1745 | are the same under both ASCII and Unicode rules.) | |
1746 | ||
1747 | Under Unicode rules, these upper-Latin1 characters are interpreted as | |
1748 | Unicode code points, which means they have the same semantics as Latin-1 | |
1749 | (ISO-8859-1) and C1 controls. | |
1750 | ||
1751 | As explained in L</ASCII Rules versus Unicode Rules>, under ASCII rules, | |
1752 | they are considered to be unassigned characters. | |
1753 | ||
1754 | This can lead to unexpected results. For example, a string's | |
1755 | semantics can suddenly change if a code point above 255 is appended to | |
1756 | it, which changes the rules from ASCII to Unicode. As an | |
1757 | example, consider the following program and its output: | |
1758 | ||
1759 | $ perl -le' | |
f434f357 | 1760 | no feature "unicode_strings"; |
a6a7eedc KW |
1761 | $s1 = "\xC2"; |
1762 | $s2 = "\x{2660}"; | |
1763 | for ($s1, $s2, $s1.$s2) { | |
1764 | print /\w/ || 0; | |
1765 | } | |
1766 | ' | |
1767 | 0 | |
1768 | 0 | |
1769 | 1 | |
1770 | ||
1771 | If there's no C<\w> in C<s1> nor in C<s2>, why does their concatenation | |
1772 | have one? | |
1773 | ||
1774 | This anomaly stems from Perl's attempt to not disturb older programs that | |
1775 | didn't use Unicode, along with Perl's desire to add Unicode support | |
1776 | seamlessly. But the result turned out to not be seamless. (By the way, | |
1777 | you can choose to be warned when things like this happen. See | |
1778 | C<L<encoding::warnings>>.) | |
1779 | ||
1780 | L<S<C<use feature 'unicode_strings'>>|feature/The 'unicode_strings' feature> | |
1781 | was added, starting in Perl v5.12, to address this problem. It affects | |
1782 | these things: | |
e1b711da KW |
1783 | |
1784 | =over 4 | |
1785 | ||
1786 | =item * | |
1787 | ||
1788 | Changing the case of a scalar, that is, using C<uc()>, C<ucfirst()>, C<lc()>, | |
2e2b2571 KW |
1789 | and C<lcfirst()>, or C<\L>, C<\U>, C<\u> and C<\l> in double-quotish |
1790 | contexts, such as regular expression substitutions. | |
a6a7eedc KW |
1791 | |
1792 | Under C<unicode_strings> starting in Perl 5.12.0, Unicode rules are | |
2e2b2571 KW |
1793 | generally used. See L<perlfunc/lc> for details on how this works |
1794 | in combination with various other pragmas. | |
e1b711da KW |
1795 | |
1796 | =item * | |
1797 | ||
2e2b2571 | 1798 | Using caseless (C</i>) regular expression matching. |
a6a7eedc | 1799 | |
2e2b2571 | 1800 | Starting in Perl 5.14.0, regular expressions compiled within |
a6a7eedc | 1801 | the scope of C<unicode_strings> use Unicode rules |
2e2b2571 KW |
1802 | even when executed or compiled into larger |
1803 | regular expressions outside the scope. | |
e1b711da KW |
1804 | |
1805 | =item * | |
1806 | ||
a6a7eedc KW |
1807 | Matching any of several properties in regular expressions. |
1808 | ||
1809 | These properties are C<\b> (without braces), C<\B> (without braces), | |
1810 | C<\s>, C<\S>, C<\w>, C<\W>, and all the Posix character classes | |
630d17dc | 1811 | I<except> C<[[:ascii:]]>. |
a6a7eedc | 1812 | |
2e2b2571 | 1813 | Starting in Perl 5.14.0, regular expressions compiled within |
a6a7eedc | 1814 | the scope of C<unicode_strings> use Unicode rules |
2e2b2571 KW |
1815 | even when executed or compiled into larger |
1816 | regular expressions outside the scope. | |
e1b711da KW |
1817 | |
1818 | =item * | |
1819 | ||
a6a7eedc KW |
1820 | In C<quotemeta> or its inline equivalent C<\Q>. |
1821 | ||
2e2b2571 KW |
1822 | Starting in Perl 5.16.0, consistent quoting rules are used within the |
1823 | scope of C<unicode_strings>, as described in L<perlfunc/quotemeta>. | |
a6a7eedc KW |
1824 | Prior to that, or outside its scope, no code points above 127 are quoted |
1825 | in UTF-8 encoded strings, but in byte encoded strings, code points | |
1826 | between 128-255 are always quoted. | |
eb88ed9e | 1827 | |
d6c970c7 AC |
1828 | =item * |
1829 | ||
1830 | In the C<..> or L<range|perlop/Range Operators> operator. | |
1831 | ||
1832 | Starting in Perl 5.26.0, the range operator on strings treats their lengths | |
1833 | consistently within the scope of C<unicode_strings>. Prior to that, or | |
1834 | outside its scope, it could produce strings whose length in characters | |
1835 | exceeded that of the right-hand side, where the right-hand side took up more | |
1836 | bytes than the correct range endpoint. | |
1837 | ||
e1b711da KW |
1838 | =back |
1839 | ||
a6a7eedc KW |
1840 | You can see from the above that the effect of C<unicode_strings> |
1841 | increased over several Perl releases. (And Perl's support for Unicode | |
1842 | continues to improve; it's best to use the latest available release in | |
1843 | order to get the most complete and accurate results possible.) Note that | |
1844 | C<unicode_strings> is automatically chosen if you S<C<use 5.012>> or | |
1845 | higher. | |
e1b711da | 1846 | |
2e2b2571 | 1847 | For Perls earlier than those described above, or when a string is passed |
a6a7eedc | 1848 | to a function outside the scope of C<unicode_strings>, see the next section. |
e1b711da | 1849 | |
1aad1664 JH |
1850 | =head2 Forcing Unicode in Perl (Or Unforcing Unicode in Perl) |
1851 | ||
e1b711da KW |
1852 | Sometimes (see L</"When Unicode Does Not Happen"> or L</The "Unicode Bug">) |
1853 | there are situations where you simply need to force a byte | |
a6a7eedc KW |
1854 | string into UTF-8, or vice versa. The standard module L<Encode> can be |
1855 | used for this, or the low-level calls | |
a9130ea9 | 1856 | L<C<utf8::upgrade($bytestring)>|utf8/Utility functions> and |
a6a7eedc | 1857 | L<C<utf8::downgrade($utf8string[, FAIL_OK])>|utf8/Utility functions>. |
1aad1664 | 1858 | |
a9130ea9 | 1859 | Note that C<utf8::downgrade()> can fail if the string contains characters |
2bbc8d55 | 1860 | that don't fit into a byte. |
1aad1664 | 1861 | |
e1b711da KW |
1862 | Calling either function on a string that already is in the desired state is a |
1863 | no-op. | |
1864 | ||
a6a7eedc KW |
1865 | L</ASCII Rules versus Unicode Rules> gives all the ways that a string is |
1866 | made to use Unicode rules. | |
95a1a48b | 1867 | |
37b3b608 | 1868 | =head2 Using Unicode in XS |
c349b1b9 | 1869 | |
37b3b608 KW |
1870 | See L<perlguts/"Unicode Support"> for an introduction to Unicode at |
1871 | the XS level, and L<perlapi/Unicode Support> for the API details. | |
95a1a48b | 1872 | |
e1b711da KW |
1873 | =head2 Hacking Perl to work on earlier Unicode versions (for very serious hackers only) |
1874 | ||
a6a7eedc KW |
1875 | Perl by default comes with the latest supported Unicode version built-in, but |
1876 | the goal is to allow you to change to use any earlier one. In Perls | |
1877 | v5.20 and v5.22, however, the earliest usable version is Unicode 5.1. | |
c55dd03d | 1878 | Perl v5.18 and v5.24 are able to handle all earlier versions. |
e1b711da | 1879 | |
42581d5d | 1880 | Download the files in the desired version of Unicode from the Unicode web |
e1b711da | 1881 | site L<http://www.unicode.org>). These should replace the existing files in |
b19eb496 | 1882 | F<lib/unicore> in the Perl source tree. Follow the instructions in |
116693e8 | 1883 | F<README.perl> in that directory to change some of their names, and then build |
26e391dd | 1884 | perl (see L<INSTALL>). |
116693e8 | 1885 | |
c8d992ba A |
1886 | =head2 Porting code from perl-5.6.X |
1887 | ||
a6a7eedc KW |
1888 | Perls starting in 5.8 have a different Unicode model from 5.6. In 5.6 the |
1889 | programmer was required to use the C<utf8> pragma to declare that a | |
1890 | given scope expected to deal with Unicode data and had to make sure that | |
1891 | only Unicode data were reaching that scope. If you have code that is | |
c8d992ba | 1892 | working with 5.6, you will need some of the following adjustments to |
a6a7eedc KW |
1893 | your code. The examples are written such that the code will continue to |
1894 | work under 5.6, so you should be safe to try them out. | |
c8d992ba | 1895 | |
755789c0 | 1896 | =over 3 |
c8d992ba A |
1897 | |
1898 | =item * | |
1899 | ||
1900 | A filehandle that should read or write UTF-8 | |
1901 | ||
b9cedb1b | 1902 | if ($] > 5.008) { |
6d8e7450 | 1903 | binmode $fh, ":encoding(UTF-8)"; |
c8d992ba A |
1904 | } |
1905 | ||
1906 | =item * | |
1907 | ||
1908 | A scalar that is going to be passed to some extension | |
1909 | ||
a9130ea9 | 1910 | Be it C<Compress::Zlib>, C<Apache::Request> or any extension that has no |
c8d992ba | 1911 | mention of Unicode in the manpage, you need to make sure that the |
2575c402 | 1912 | UTF8 flag is stripped off. Note that at the time of this writing |
b9cedb1b | 1913 | (January 2012) the mentioned modules are not UTF-8-aware. Please |
c8d992ba A |
1914 | check the documentation to verify if this is still true. |
1915 | ||
b9cedb1b | 1916 | if ($] > 5.008) { |
c8d992ba | 1917 | require Encode; |
8e179dd8 | 1918 | $val = Encode::encode("UTF-8", $val); # make octets |
c8d992ba A |
1919 | } |
1920 | ||
1921 | =item * | |
1922 | ||
1923 | A scalar we got back from an extension | |
1924 | ||
1925 | If you believe the scalar comes back as UTF-8, you will most likely | |
2575c402 | 1926 | want the UTF8 flag restored: |
c8d992ba | 1927 | |
b9cedb1b | 1928 | if ($] > 5.008) { |
c8d992ba | 1929 | require Encode; |
8e179dd8 | 1930 | $val = Encode::decode("UTF-8", $val); |
c8d992ba A |
1931 | } |
1932 | ||
1933 | =item * | |
1934 | ||
1935 | Same thing, if you are really sure it is UTF-8 | |
1936 | ||
b9cedb1b | 1937 | if ($] > 5.008) { |
c8d992ba A |
1938 | require Encode; |
1939 | Encode::_utf8_on($val); | |
1940 | } | |
1941 | ||
1942 | =item * | |
1943 | ||
a9130ea9 | 1944 | A wrapper for L<DBI> C<fetchrow_array> and C<fetchrow_hashref> |
c8d992ba A |
1945 | |
1946 | When the database contains only UTF-8, a wrapper function or method is | |
a9130ea9 KW |
1947 | a convenient way to replace all your C<fetchrow_array> and |
1948 | C<fetchrow_hashref> calls. A wrapper function will also make it easier to | |
c8d992ba | 1949 | adapt to future enhancements in your database driver. Note that at the |
b9cedb1b | 1950 | time of this writing (January 2012), the DBI has no standardized way |
a9130ea9 | 1951 | to deal with UTF-8 data. Please check the L<DBI documentation|DBI> to verify if |
c8d992ba A |
1952 | that is still true. |
1953 | ||
1954 | sub fetchrow { | |
d88362ca KW |
1955 | # $what is one of fetchrow_{array,hashref} |
1956 | my($self, $sth, $what) = @_; | |
b9cedb1b | 1957 | if ($] < 5.008) { |
c8d992ba A |
1958 | return $sth->$what; |
1959 | } else { | |
1960 | require Encode; | |
1961 | if (wantarray) { | |
1962 | my @arr = $sth->$what; | |
1963 | for (@arr) { | |
1964 | defined && /[^\000-\177]/ && Encode::_utf8_on($_); | |
1965 | } | |
1966 | return @arr; | |
1967 | } else { | |
1968 | my $ret = $sth->$what; | |
1969 | if (ref $ret) { | |
1970 | for my $k (keys %$ret) { | |
d88362ca KW |
1971 | defined |
1972 | && /[^\000-\177]/ | |
1973 | && Encode::_utf8_on($_) for $ret->{$k}; | |
c8d992ba A |
1974 | } |
1975 | return $ret; | |
1976 | } else { | |
1977 | defined && /[^\000-\177]/ && Encode::_utf8_on($_) for $ret; | |
1978 | return $ret; | |
1979 | } | |
1980 | } | |
1981 | } | |
1982 | } | |
1983 | ||
1984 | ||
1985 | =item * | |
1986 | ||
1987 | A large scalar that you know can only contain ASCII | |
1988 | ||
1989 | Scalars that contain only ASCII and are marked as UTF-8 are sometimes | |
1990 | a drag to your program. If you recognize such a situation, just remove | |
2575c402 | 1991 | the UTF8 flag: |
c8d992ba | 1992 | |
b9cedb1b | 1993 | utf8::downgrade($val) if $] > 5.008; |
c8d992ba A |
1994 | |
1995 | =back | |
1996 | ||
a6a7eedc KW |
1997 | =head1 BUGS |
1998 | ||
1999 | See also L</The "Unicode Bug"> above. | |
2000 | ||
2001 | =head2 Interaction with Extensions | |
2002 | ||
2003 | When Perl exchanges data with an extension, the extension should be | |
2004 | able to understand the UTF8 flag and act accordingly. If the | |
2005 | extension doesn't recognize that flag, it's likely that the extension | |
2006 | will return incorrectly-flagged data. | |
2007 | ||
2008 | So if you're working with Unicode data, consult the documentation of | |
2009 | every module you're using if there are any issues with Unicode data | |
2010 | exchange. If the documentation does not talk about Unicode at all, | |
2011 | suspect the worst and probably look at the source to learn how the | |
2012 | module is implemented. Modules written completely in Perl shouldn't | |
2013 | cause problems. Modules that directly or indirectly access code written | |
2014 | in other programming languages are at risk. | |
2015 | ||
2016 | For affected functions, the simple strategy to avoid data corruption is | |
2017 | to always make the encoding of the exchanged data explicit. Choose an | |
2018 | encoding that you know the extension can handle. Convert arguments passed | |
2019 | to the extensions to that encoding and convert results back from that | |
2020 | encoding. Write wrapper functions that do the conversions for you, so | |
2021 | you can later change the functions when the extension catches up. | |
2022 | ||
2023 | To provide an example, let's say the popular C<Foo::Bar::escape_html> | |
2024 | function doesn't deal with Unicode data yet. The wrapper function | |
2025 | would convert the argument to raw UTF-8 and convert the result back to | |
2026 | Perl's internal representation like so: | |
2027 | ||
2028 | sub my_escape_html ($) { | |
2029 | my($what) = shift; | |
2030 | return unless defined $what; | |
8e179dd8 P |
2031 | Encode::decode("UTF-8", Foo::Bar::escape_html( |
2032 | Encode::encode("UTF-8", $what))); | |
a6a7eedc KW |
2033 | } |
2034 | ||
2035 | Sometimes, when the extension does not convert data but just stores | |
2036 | and retrieves it, you will be able to use the otherwise | |
2037 | dangerous L<C<Encode::_utf8_on()>|Encode/_utf8_on> function. Let's say | |
2038 | the popular C<Foo::Bar> extension, written in C, provides a C<param> | |
2039 | method that lets you store and retrieve data according to these prototypes: | |
2040 | ||
2041 | $self->param($name, $value); # set a scalar | |
2042 | $value = $self->param($name); # retrieve a scalar | |
2043 | ||
2044 | If it does not yet provide support for any encoding, one could write a | |
2045 | derived class with such a C<param> method: | |
2046 | ||
2047 | sub param { | |
2048 | my($self,$name,$value) = @_; | |
2049 | utf8::upgrade($name); # make sure it is UTF-8 encoded | |
2050 | if (defined $value) { | |
2051 | utf8::upgrade($value); # make sure it is UTF-8 encoded | |
2052 | return $self->SUPER::param($name,$value); | |
2053 | } else { | |
2054 | my $ret = $self->SUPER::param($name); | |
2055 | Encode::_utf8_on($ret); # we know, it is UTF-8 encoded | |
2056 | return $ret; | |
2057 | } | |
2058 | } | |
2059 | ||
2060 | Some extensions provide filters on data entry/exit points, such as | |
2061 | C<DB_File::filter_store_key> and family. Look out for such filters in | |
2062 | the documentation of your extensions; they can make the transition to | |
2063 | Unicode data much easier. | |
2064 | ||
2065 | =head2 Speed | |
2066 | ||
2067 | Some functions are slower when working on UTF-8 encoded strings than | |
2068 | on byte encoded strings. All functions that need to hop over | |
2069 | characters such as C<length()>, C<substr()> or C<index()>, or matching | |
2070 | regular expressions can work B<much> faster when the underlying data are | |
2071 | byte-encoded. | |
2072 | ||
2073 | In Perl 5.8.0 the slowness was often quite spectacular; in Perl 5.8.1 | |
2074 | a caching scheme was introduced which improved the situation. In general, | |
2075 | operations with UTF-8 encoded strings are still slower. As an example, | |
2076 | the Unicode properties (character classes) like C<\p{Nd}> are known to | |
2077 | be quite a bit slower (5-20 times) than their simpler counterparts | |
2078 | like C<[0-9]> (then again, there are hundreds of Unicode characters matching | |
2079 | C<Nd> compared with the 10 ASCII characters matching C<[0-9]>). | |
2080 | ||
393fec97 GS |
2081 | =head1 SEE ALSO |
2082 | ||
51f494cc | 2083 | L<perlunitut>, L<perluniintro>, L<perluniprops>, L<Encode>, L<open>, L<utf8>, L<bytes>, |
b65e6125 | 2084 | L<perlretut>, L<perlvar/"${^UNICODE}">, |
51f494cc | 2085 | L<http://www.unicode.org/reports/tr44>). |
393fec97 GS |
2086 | |
2087 | =cut |