Commit | Line | Data |
---|---|---|
393fec97 GS |
1 | =head1 NAME |
2 | ||
3 | perlunicode - Unicode support in Perl | |
4 | ||
5 | =head1 DESCRIPTION | |
6 | ||
a6a7eedc KW |
7 | If you haven't already, before reading this document, you should become |
8 | familiar with both L<perlunitut> and L<perluniintro>. | |
9 | ||
10 | Unicode aims to B<UNI>-fy the en-B<CODE>-ings of all the world's | |
11 | character sets into a single Standard. For quite a few of the various | |
12 | coding standards that existed when Unicode was first created, converting | |
13 | from each to Unicode essentially meant adding a constant to each code | |
14 | point in the original standard, and converting back meant just | |
15 | subtracting that same constant. For ASCII and ISO-8859-1, the constant | |
16 | is 0. For ISO-8859-5, (Cyrillic) the constant is 864; for Hebrew | |
17 | (ISO-8859-8), it's 1488; Thai (ISO-8859-11), 3424; and so forth. This | |
18 | made it easy to do the conversions, and facilitated the adoption of | |
19 | Unicode. | |
20 | ||
21 | And it worked; nowadays, those legacy standards are rarely used. Most | |
22 | everyone uses Unicode. | |
23 | ||
24 | Unicode is a comprehensive standard. It specifies many things outside | |
25 | the scope of Perl, such as how to display sequences of characters. For | |
26 | a full discussion of all aspects of Unicode, see | |
71c89d21 | 27 | L<https://www.unicode.org>. |
a6a7eedc | 28 | |
0a1f2d14 | 29 | =head2 Important Caveats |
21bad921 | 30 | |
a6a7eedc KW |
31 | Even though some of this section may not be understandable to you on |
32 | first reading, we think it's important enough to highlight some of the | |
33 | gotchas before delving further, so here goes: | |
34 | ||
376d9008 | 35 | Unicode support is an extensive requirement. While Perl does not |
c349b1b9 JH |
36 | implement the Unicode standard or the accompanying technical reports |
37 | from cover to cover, Perl does support many Unicode features. | |
21bad921 | 38 | |
f57d8456 | 39 | Also, the use of Unicode may present security issues that aren't |
526f2ca9 | 40 | obvious, see L</Security Implications of Unicode> below. |
9d1c51c1 | 41 | |
13a2d996 | 42 | =over 4 |
21bad921 | 43 | |
a9130ea9 | 44 | =item Safest if you C<use feature 'unicode_strings'> |
42581d5d KW |
45 | |
46 | In order to preserve backward compatibility, Perl does not turn | |
47 | on full internal Unicode support unless the pragma | |
b65e6125 KW |
48 | L<S<C<use feature 'unicode_strings'>>|feature/The 'unicode_strings' feature> |
49 | is specified. (This is automatically | |
6901d503 | 50 | selected if you S<C<use v5.12>> or higher.) Failure to do this can |
42581d5d KW |
51 | trigger unexpected surprises. See L</The "Unicode Bug"> below. |
52 | ||
2269d15c KW |
53 | This pragma doesn't affect I/O. Nor does it change the internal |
54 | representation of strings, only their interpretation. There are still | |
55 | several places where Unicode isn't fully supported, such as in | |
56 | filenames. | |
42581d5d | 57 | |
fae2c0fb | 58 | =item Input and Output Layers |
21bad921 | 59 | |
a6a7eedc KW |
60 | Use the C<:encoding(...)> layer to read from and write to |
61 | filehandles using the specified encoding. (See L<open>.) | |
c349b1b9 | 62 | |
01c3fbbc | 63 | =item You must convert your non-ASCII, non-UTF-8 Perl scripts to be |
a6a7eedc | 64 | UTF-8. |
21bad921 | 65 | |
01c3fbbc TC |
66 | The L<encoding> module has been deprecated since perl 5.18 and the |
67 | perl internals it requires have been removed with perl 5.26. | |
21bad921 | 68 | |
a6a7eedc | 69 | =item C<use utf8> still needed to enable L<UTF-8|/Unicode Encodings> in scripts |
21bad921 | 70 | |
a6a7eedc KW |
71 | If your Perl script is itself encoded in L<UTF-8|/Unicode Encodings>, |
72 | the S<C<use utf8>> pragma must be explicitly included to enable | |
73 | recognition of that (in string or regular expression literals, or in | |
74 | identifier names). B<This is the only time when an explicit S<C<use | |
75 | utf8>> is needed.> (See L<utf8>). | |
7aa207d6 | 76 | |
27c74dfd KW |
77 | If a Perl script begins with the bytes that form the UTF-8 encoding of |
78 | the Unicode BYTE ORDER MARK (C<BOM>, see L</Unicode Encodings>), those | |
79 | bytes are completely ignored. | |
80 | ||
81 | =item L<UTF-16|/Unicode Encodings> scripts autodetected | |
7aa207d6 | 82 | |
fea12a3e | 83 | If a Perl script begins with the Unicode C<BOM> (UTF-16LE, |
27c74dfd | 84 | UTF16-BE), or if the script looks like non-C<BOM>-marked |
a6a7eedc | 85 | UTF-16 of either endianness, Perl will correctly read in the script as |
27c74dfd | 86 | the appropriate Unicode encoding. |
990e18f7 | 87 | |
21bad921 GS |
88 | =back |
89 | ||
376d9008 | 90 | =head2 Byte and Character Semantics |
393fec97 | 91 | |
a6a7eedc KW |
92 | Before Unicode, most encodings used 8 bits (a single byte) to encode |
93 | each character. Thus a character was a byte, and a byte was a | |
94 | character, and there could be only 256 or fewer possible characters. | |
95 | "Byte Semantics" in the title of this section refers to | |
96 | this behavior. There was no need to distinguish between "Byte" and | |
97 | "Character". | |
98 | ||
99 | Then along comes Unicode which has room for over a million characters | |
100 | (and Perl allows for even more). This means that a character may | |
101 | require more than a single byte to represent it, and so the two terms | |
102 | are no longer equivalent. What matter are the characters as whole | |
103 | entities, and not usually the bytes that comprise them. That's what the | |
104 | term "Character Semantics" in the title of this section refers to. | |
105 | ||
106 | Perl had to change internally to decouple "bytes" from "characters". | |
107 | It is important that you too change your ideas, if you haven't already, | |
108 | so that "byte" and "character" no longer mean the same thing in your | |
109 | mind. | |
110 | ||
111 | The basic building block of Perl strings has always been a "character". | |
112 | The changes basically come down to that the implementation no longer | |
113 | thinks that a character is always just a single byte. | |
114 | ||
115 | There are various things to note: | |
393fec97 GS |
116 | |
117 | =over 4 | |
118 | ||
119 | =item * | |
120 | ||
a6a7eedc KW |
121 | String handling functions, for the most part, continue to operate in |
122 | terms of characters. C<length()>, for example, returns the number of | |
123 | characters in a string, just as before. But that number no longer is | |
124 | necessarily the same as the number of bytes in the string (there may be | |
125 | more bytes than characters). The other such functions include | |
126 | C<chop()>, C<chomp()>, C<substr()>, C<pos()>, C<index()>, C<rindex()>, | |
127 | C<sort()>, C<sprintf()>, and C<write()>. | |
128 | ||
129 | The exceptions are: | |
130 | ||
131 | =over 4 | |
132 | ||
133 | =item * | |
134 | ||
135 | the bit-oriented C<vec> | |
136 | ||
137 | E<nbsp> | |
138 | ||
139 | =item * | |
140 | ||
141 | the byte-oriented C<pack>/C<unpack> C<"C"> format | |
142 | ||
143 | However, the C<W> specifier does operate on whole characters, as does the | |
144 | C<U> specifier. | |
145 | ||
146 | =item * | |
147 | ||
148 | some operators that interact with the platform's operating system | |
149 | ||
150 | Operators dealing with filenames are examples. | |
151 | ||
152 | =item * | |
153 | ||
154 | when the functions are called from within the scope of the | |
155 | S<C<L<use bytes|bytes>>> pragma | |
156 | ||
157 | Likely, you should use this only for debugging anyway. | |
158 | ||
159 | =back | |
160 | ||
161 | =item * | |
162 | ||
376d9008 | 163 | Strings--including hash keys--and regular expression patterns may |
b65e6125 | 164 | contain characters that have ordinal values larger than 255. |
393fec97 | 165 | |
2575c402 JW |
166 | If you use a Unicode editor to edit your program, Unicode characters may |
167 | occur directly within the literal strings in UTF-8 encoding, or UTF-16. | |
27c74dfd | 168 | (The former requires a C<use utf8>, the latter may require a C<BOM>.) |
3e4dbfed | 169 | |
a6a7eedc KW |
170 | L<perluniintro/Creating Unicode> gives other ways to place non-ASCII |
171 | characters in your strings. | |
6f335b04 | 172 | |
a6a7eedc | 173 | =item * |
fbb93542 | 174 | |
a6a7eedc | 175 | The C<chr()> and C<ord()> functions work on whole characters. |
376d9008 | 176 | |
393fec97 GS |
177 | =item * |
178 | ||
a6a7eedc KW |
179 | Regular expressions match whole characters. For example, C<"."> matches |
180 | a whole character instead of only a single byte. | |
393fec97 | 181 | |
393fec97 GS |
182 | =item * |
183 | ||
a6a7eedc KW |
184 | The C<tr///> operator translates whole characters. (Note that the |
185 | C<tr///CU> functionality has been removed. For similar functionality to | |
111d8f2d | 186 | that, see S<C<pack('U0', ...)>> and S<C<pack('C0', ...)>>). |
393fec97 | 187 | |
393fec97 GS |
188 | =item * |
189 | ||
a6a7eedc | 190 | C<scalar reverse()> reverses by character rather than by byte. |
393fec97 | 191 | |
393fec97 GS |
192 | =item * |
193 | ||
a6a7eedc | 194 | The bit string operators, C<& | ^ ~> and (starting in v5.22) |
fac71630 KW |
195 | C<&. |. ^. ~.> can operate on bit strings encoded in UTF-8, but this |
196 | can give unexpected results if any of the strings contain code points | |
197 | above 0xFF. Starting in v5.28, it is a fatal error to have such an | |
198 | operand. Otherwise, the operation is performed on a non-UTF-8 copy of | |
199 | the operand. If you're not sure about the encoding of a string, | |
200 | downgrade it before using any of these operators; you can use | |
a6a7eedc | 201 | L<C<utf8::utf8_downgrade()>|utf8/Utility functions>. |
822502e5 | 202 | |
a6a7eedc | 203 | =back |
822502e5 | 204 | |
a6a7eedc KW |
205 | The bottom line is that Perl has always practiced "Character Semantics", |
206 | but with the advent of Unicode, that is now different than "Byte | |
207 | Semantics". | |
208 | ||
209 | =head2 ASCII Rules versus Unicode Rules | |
210 | ||
211 | Before Unicode, when a character was a byte was a character, | |
212 | Perl knew only about the 128 characters defined by ASCII, code points 0 | |
b57dd509 KW |
213 | through 127 (except for under L<S<C<use locale>>|perllocale>). That |
214 | left the code | |
a6a7eedc KW |
215 | points 128 to 255 as unassigned, and available for whatever use a |
216 | program might want. The only semantics they have is their ordinal | |
217 | numbers, and that they are members of none of the non-negative character | |
218 | classes. None are considered to match C<\w> for example, but all match | |
219 | C<\W>. | |
822502e5 | 220 | |
a6a7eedc KW |
221 | Unicode, of course, assigns each of those code points a particular |
222 | meaning (along with ones above 255). To preserve backward | |
223 | compatibility, Perl only uses the Unicode meanings when there is some | |
224 | indication that Unicode is what is intended; otherwise the non-ASCII | |
225 | code points remain treated as if they are unassigned. | |
226 | ||
227 | Here are the ways that Perl knows that a string should be treated as | |
228 | Unicode: | |
229 | ||
230 | =over | |
822502e5 TS |
231 | |
232 | =item * | |
233 | ||
a6a7eedc KW |
234 | Within the scope of S<C<use utf8>> |
235 | ||
236 | If the whole program is Unicode (signified by using 8-bit B<U>nicode | |
e423fa83 | 237 | B<T>ransformation B<F>ormat), then all literal strings within it must be |
a6a7eedc | 238 | Unicode. |
822502e5 TS |
239 | |
240 | =item * | |
241 | ||
a6a7eedc KW |
242 | Within the scope of |
243 | L<S<C<use feature 'unicode_strings'>>|feature/The 'unicode_strings' feature> | |
244 | ||
245 | This pragma was created so you can explicitly tell Perl that operations | |
246 | executed within its scope are to use Unicode rules. More operations are | |
247 | affected with newer perls. See L</The "Unicode Bug">. | |
822502e5 TS |
248 | |
249 | =item * | |
250 | ||
6901d503 | 251 | Within the scope of S<C<use v5.12>> or higher |
a6a7eedc KW |
252 | |
253 | This implicitly turns on S<C<use feature 'unicode_strings'>>. | |
822502e5 TS |
254 | |
255 | =item * | |
256 | ||
a6a7eedc KW |
257 | Within the scope of |
258 | L<S<C<use locale 'not_characters'>>|perllocale/Unicode and UTF-8>, | |
259 | or L<S<C<use locale>>|perllocale> and the current | |
260 | locale is a UTF-8 locale. | |
822502e5 | 261 | |
a6a7eedc KW |
262 | The former is defined to imply Unicode handling; and the latter |
263 | indicates a Unicode locale, hence a Unicode interpretation of all | |
264 | strings within it. | |
822502e5 TS |
265 | |
266 | =item * | |
267 | ||
a6a7eedc KW |
268 | When the string contains a Unicode-only code point |
269 | ||
270 | Perl has never accepted code points above 255 without them being | |
271 | Unicode, so their use implies Unicode for the whole string. | |
822502e5 TS |
272 | |
273 | =item * | |
274 | ||
a6a7eedc KW |
275 | When the string contains a Unicode named code point C<\N{...}> |
276 | ||
277 | The C<\N{...}> construct explicitly refers to a Unicode code point, | |
278 | even if it is one that is also in ASCII. Therefore the string | |
279 | containing it must be Unicode. | |
822502e5 TS |
280 | |
281 | =item * | |
282 | ||
a6a7eedc KW |
283 | When the string has come from an external source marked as |
284 | Unicode | |
285 | ||
286 | The L<C<-C>|perlrun/-C [numberE<sol>list]> command line option can | |
287 | specify that certain inputs to the program are Unicode, and the values | |
288 | of this can be read by your Perl code, see L<perlvar/"${^UNICODE}">. | |
289 | ||
290 | =item * When the string has been upgraded to UTF-8 | |
291 | ||
292 | The function L<C<utf8::utf8_upgrade()>|utf8/Utility functions> | |
293 | can be explicitly used to permanently (unless a subsequent | |
294 | C<utf8::utf8_downgrade()> is called) cause a string to be treated as | |
295 | Unicode. | |
296 | ||
297 | =item * There are additional methods for regular expression patterns | |
298 | ||
f6cf4627 KW |
299 | A pattern that is compiled with the C<< /u >> or C<< /a >> modifiers is |
300 | treated as Unicode (though there are some restrictions with C<< /a >>). | |
301 | Under the C<< /d >> and C<< /l >> modifiers, there are several other | |
302 | indications for Unicode; see L<perlre/Character set modifiers>. | |
822502e5 TS |
303 | |
304 | =back | |
305 | ||
a6a7eedc KW |
306 | Note that all of the above are overridden within the scope of |
307 | C<L<use bytes|bytes>>; but you should be using this pragma only for | |
308 | debugging. | |
309 | ||
310 | Note also that some interactions with the platform's operating system | |
311 | never use Unicode rules. | |
312 | ||
313 | When Unicode rules are in effect: | |
314 | ||
822502e5 TS |
315 | =over 4 |
316 | ||
317 | =item * | |
318 | ||
a6a7eedc KW |
319 | Case translation operators use the Unicode case translation tables. |
320 | ||
321 | Note that C<uc()>, or C<\U> in interpolated strings, translates to | |
322 | uppercase, while C<ucfirst>, or C<\u> in interpolated strings, | |
323 | translates to titlecase in languages that make the distinction (which is | |
324 | equivalent to uppercase in languages without the distinction). | |
325 | ||
326 | There is a CPAN module, C<L<Unicode::Casing>>, which allows you to | |
327 | define your own mappings to be used in C<lc()>, C<lcfirst()>, C<uc()>, | |
328 | C<ucfirst()>, and C<fc> (or their double-quoted string inlined versions | |
329 | such as C<\U>). (Prior to Perl 5.16, this functionality was partially | |
330 | provided in the Perl core, but suffered from a number of insurmountable | |
331 | drawbacks, so the CPAN module was written instead.) | |
332 | ||
333 | =item * | |
334 | ||
335 | Character classes in regular expressions match based on the character | |
336 | properties specified in the Unicode properties database. | |
337 | ||
338 | C<\w> can be used to match a Japanese ideograph, for instance; and | |
339 | C<[[:digit:]]> a Bengali number. | |
340 | ||
341 | =item * | |
342 | ||
343 | Named Unicode properties, scripts, and block ranges may be used (like | |
344 | bracketed character classes) by using the C<\p{}> "matches property" | |
345 | construct and the C<\P{}> negation, "doesn't match property". | |
346 | ||
347 | See L</"Unicode Character Properties"> for more details. | |
348 | ||
349 | You can define your own character properties and use them | |
350 | in the regular expression with the C<\p{}> or C<\P{}> construct. | |
351 | See L</"User-Defined Character Properties"> for more details. | |
822502e5 TS |
352 | |
353 | =back | |
354 | ||
a6a7eedc KW |
355 | =head2 Extended Grapheme Clusters (Logical characters) |
356 | ||
357 | Consider a character, say C<H>. It could appear with various marks around it, | |
358 | such as an acute accent, or a circumflex, or various hooks, circles, arrows, | |
359 | I<etc.>, above, below, to one side or the other, I<etc>. There are many | |
360 | possibilities among the world's languages. The number of combinations is | |
361 | astronomical, and if there were a character for each combination, it would | |
362 | soon exhaust Unicode's more than a million possible characters. So Unicode | |
363 | took a different approach: there is a character for the base C<H>, and a | |
364 | character for each of the possible marks, and these can be variously combined | |
365 | to get a final logical character. So a logical character--what appears to be a | |
366 | single character--can be a sequence of more than one individual characters. | |
367 | The Unicode standard calls these "extended grapheme clusters" (which | |
368 | is an improved version of the no-longer much used "grapheme cluster"); | |
369 | Perl furnishes the C<\X> regular expression construct to match such | |
370 | sequences in their entirety. | |
371 | ||
372 | But Unicode's intent is to unify the existing character set standards and | |
373 | practices, and several pre-existing standards have single characters that | |
374 | mean the same thing as some of these combinations, like ISO-8859-1, | |
375 | which has quite a few of them. For example, C<"LATIN CAPITAL LETTER E | |
376 | WITH ACUTE"> was already in this standard when Unicode came along. | |
377 | Unicode therefore added it to its repertoire as that single character. | |
378 | But this character is considered by Unicode to be equivalent to the | |
379 | sequence consisting of the character C<"LATIN CAPITAL LETTER E"> | |
380 | followed by the character C<"COMBINING ACUTE ACCENT">. | |
381 | ||
382 | C<"LATIN CAPITAL LETTER E WITH ACUTE"> is called a "pre-composed" | |
383 | character, and its equivalence with the "E" and the "COMBINING ACCENT" | |
384 | sequence is called canonical equivalence. All pre-composed characters | |
385 | are said to have a decomposition (into the equivalent sequence), and the | |
f87d0856 DC |
386 | decomposition type is also called canonical. A string may consist |
387 | as much as possible of precomposed characters, or it may consist of | |
a6a7eedc KW |
388 | entirely decomposed characters. Unicode calls these respectively, |
389 | "Normalization Form Composed" (NFC) and "Normalization Form Decomposed". | |
390 | The C<L<Unicode::Normalize>> module contains functions that convert | |
391 | between the two. A string may also have both composed characters and | |
392 | decomposed characters; this module can be used to make it all one or the | |
393 | other. | |
394 | ||
395 | You may be presented with strings in any of these equivalent forms. | |
396 | There is currently nothing in Perl 5 that ignores the differences. So | |
dabde021 | 397 | you'll have to specially handle it. The usual advice is to convert your |
a6a7eedc KW |
398 | inputs to C<NFD> before processing further. |
399 | ||
400 | For more detailed information, see L<http://unicode.org/reports/tr15/>. | |
401 | ||
822502e5 TS |
402 | =head2 Unicode Character Properties |
403 | ||
ee88f7b6 | 404 | (The only time that Perl considers a sequence of individual code |
9d1c51c1 KW |
405 | points as a single logical character is in the C<\X> construct, already |
406 | mentioned above. Therefore "character" in this discussion means a single | |
ee88f7b6 KW |
407 | Unicode code point.) |
408 | ||
409 | Very nearly all Unicode character properties are accessible through | |
410 | regular expressions by using the C<\p{}> "matches property" construct | |
411 | and the C<\P{}> "doesn't match property" for its negation. | |
51f494cc | 412 | |
9d1c51c1 | 413 | For instance, C<\p{Uppercase}> matches any single character with the Unicode |
a9130ea9 KW |
414 | C<"Uppercase"> property, while C<\p{L}> matches any character with a |
415 | C<General_Category> of C<"L"> (letter) property (see | |
416 | L</General_Category> below). Brackets are not | |
9d1c51c1 | 417 | required for single letter property names, so C<\p{L}> is equivalent to C<\pL>. |
51f494cc | 418 | |
9d1c51c1 | 419 | More formally, C<\p{Uppercase}> matches any single character whose Unicode |
a9130ea9 KW |
420 | C<Uppercase> property value is C<True>, and C<\P{Uppercase}> matches any character |
421 | whose C<Uppercase> property value is C<False>, and they could have been written as | |
9d1c51c1 | 422 | C<\p{Uppercase=True}> and C<\p{Uppercase=False}>, respectively. |
51f494cc | 423 | |
b19eb496 | 424 | This formality is needed when properties are not binary; that is, if they can |
a9130ea9 KW |
425 | take on more values than just C<True> and C<False>. For example, the |
426 | C<Bidi_Class> property (see L</"Bidirectional Character Types"> below), | |
427 | can take on several different | |
428 | values, such as C<Left>, C<Right>, C<Whitespace>, and others. To match these, one needs | |
429 | to specify both the property name (C<Bidi_Class>), AND the value being | |
5bff2035 | 430 | matched against |
b65e6125 | 431 | (C<Left>, C<Right>, I<etc.>). This is done, as in the examples above, by having the |
9f815e24 | 432 | two components separated by an equal sign (or interchangeably, a colon), like |
51f494cc KW |
433 | C<\p{Bidi_Class: Left}>. |
434 | ||
435 | All Unicode-defined character properties may be written in these compound forms | |
a9130ea9 | 436 | of C<\p{I<property>=I<value>}> or C<\p{I<property>:I<value>}>, but Perl provides some |
51f494cc KW |
437 | additional properties that are written only in the single form, as well as |
438 | single-form short-cuts for all binary properties and certain others described | |
439 | below, in which you may omit the property name and the equals or colon | |
440 | separator. | |
441 | ||
442 | Most Unicode character properties have at least two synonyms (or aliases if you | |
b19eb496 | 443 | prefer): a short one that is easier to type and a longer one that is more |
a9130ea9 KW |
444 | descriptive and hence easier to understand. Thus the C<"L"> and |
445 | C<"Letter"> properties above are equivalent and can be used | |
446 | interchangeably. Likewise, C<"Upper"> is a synonym for C<"Uppercase">, | |
447 | and we could have written C<\p{Uppercase}> equivalently as C<\p{Upper}>. | |
448 | Also, there are typically various synonyms for the values the property | |
449 | can be. For binary properties, C<"True"> has 3 synonyms: C<"T">, | |
450 | C<"Yes">, and C<"Y">; and C<"False"> has correspondingly C<"F">, | |
451 | C<"No">, and C<"N">. But be careful. A short form of a value for one | |
d130120f KW |
452 | property may not mean the same thing as the short form spelled the same |
453 | for another. | |
a9130ea9 KW |
454 | Thus, for the C<L</General_Category>> property, C<"L"> means |
455 | C<"Letter">, but for the L<C<Bidi_Class>|/Bidirectional Character Types> | |
456 | property, C<"L"> means C<"Left">. A complete list of properties and | |
457 | synonyms is in L<perluniprops>. | |
51f494cc | 458 | |
b19eb496 | 459 | Upper/lower case differences in property names and values are irrelevant; |
51f494cc KW |
460 | thus C<\p{Upper}> means the same thing as C<\p{upper}> or even C<\p{UpPeR}>. |
461 | Similarly, you can add or subtract underscores anywhere in the middle of a | |
462 | word, so that these are also equivalent to C<\p{U_p_p_e_r}>. And white space | |
673c254b KW |
463 | is generally irrelevant adjacent to non-word characters, such as the |
464 | braces and the equals or colon separators, so C<\p{ Upper }> and | |
465 | C<\p{ Upper_case : Y }> are equivalent to these as well. In fact, white | |
466 | space and even hyphens can usually be added or deleted anywhere. So | |
467 | even C<\p{ Up-per case = Yes}> is equivalent. All this is called | |
468 | "loose-matching" by Unicode. The "name" property has some restrictions | |
469 | on this due to a few outlier names. Full details are given in | |
470 | L<https://www.unicode.org/reports/tr44/tr44-24.html#UAX44-LM2>. | |
471 | ||
472 | The few places where stricter matching is | |
473 | used is in the middle of numbers, the "name" property, and in the Perl | |
474 | extension properties that begin or end with an underscore. Stricter | |
475 | matching cares about white space (except adjacent to non-word | |
476 | characters), hyphens, and non-interior underscores. | |
4193bef7 | 477 | |
376d9008 | 478 | You can also use negation in both C<\p{}> and C<\P{}> by introducing a caret |
a9130ea9 | 479 | (C<^>) between the first brace and the property name: C<\p{^Tamil}> is |
eb0cc9e3 | 480 | equal to C<\P{Tamil}>. |
4193bef7 | 481 | |
56ca34ca KW |
482 | Almost all properties are immune to case-insensitive matching. That is, |
483 | adding a C</i> regular expression modifier does not change what they | |
484 | match. There are two sets that are affected. | |
485 | The first set is | |
486 | C<Uppercase_Letter>, | |
487 | C<Lowercase_Letter>, | |
488 | and C<Titlecase_Letter>, | |
489 | all of which match C<Cased_Letter> under C</i> matching. | |
490 | And the second set is | |
491 | C<Uppercase>, | |
492 | C<Lowercase>, | |
493 | and C<Titlecase>, | |
494 | all of which match C<Cased> under C</i> matching. | |
495 | This set also includes its subsets C<PosixUpper> and C<PosixLower> both | |
a9130ea9 | 496 | of which under C</i> match C<PosixAlpha>. |
56ca34ca | 497 | (The difference between these sets is that some things, such as Roman |
b65e6125 KW |
498 | numerals, come in both upper and lower case so they are C<Cased>, but |
499 | aren't considered letters, so they aren't C<Cased_Letter>'s.) | |
56ca34ca | 500 | |
2d88a86a KW |
501 | See L</Beyond Unicode code points> for special considerations when |
502 | matching Unicode properties against non-Unicode code points. | |
94b42e47 | 503 | |
51f494cc | 504 | =head3 B<General_Category> |
14bb0a9a | 505 | |
51f494cc KW |
506 | Every Unicode character is assigned a general category, which is the "most |
507 | usual categorization of a character" (from | |
71c89d21 | 508 | L<https://www.unicode.org/reports/tr44>). |
822502e5 | 509 | |
9f815e24 | 510 | The compound way of writing these is like C<\p{General_Category=Number}> |
b65e6125 | 511 | (short: C<\p{gc:n}>). But Perl furnishes shortcuts in which everything up |
51f494cc KW |
512 | through the equal or colon separator is omitted. So you can instead just write |
513 | C<\pN>. | |
822502e5 | 514 | |
a9130ea9 KW |
515 | Here are the short and long forms of the values the C<General Category> property |
516 | can have: | |
393fec97 | 517 | |
d73e5302 JH |
518 | Short Long |
519 | ||
520 | L Letter | |
51f494cc KW |
521 | LC, L& Cased_Letter (that is: [\p{Ll}\p{Lu}\p{Lt}]) |
522 | Lu Uppercase_Letter | |
523 | Ll Lowercase_Letter | |
524 | Lt Titlecase_Letter | |
525 | Lm Modifier_Letter | |
526 | Lo Other_Letter | |
d73e5302 JH |
527 | |
528 | M Mark | |
51f494cc KW |
529 | Mn Nonspacing_Mark |
530 | Mc Spacing_Mark | |
531 | Me Enclosing_Mark | |
d73e5302 JH |
532 | |
533 | N Number | |
51f494cc KW |
534 | Nd Decimal_Number (also Digit) |
535 | Nl Letter_Number | |
536 | No Other_Number | |
537 | ||
538 | P Punctuation (also Punct) | |
539 | Pc Connector_Punctuation | |
540 | Pd Dash_Punctuation | |
541 | Ps Open_Punctuation | |
542 | Pe Close_Punctuation | |
543 | Pi Initial_Punctuation | |
d73e5302 | 544 | (may behave like Ps or Pe depending on usage) |
51f494cc | 545 | Pf Final_Punctuation |
d73e5302 | 546 | (may behave like Ps or Pe depending on usage) |
51f494cc | 547 | Po Other_Punctuation |
d73e5302 JH |
548 | |
549 | S Symbol | |
51f494cc KW |
550 | Sm Math_Symbol |
551 | Sc Currency_Symbol | |
552 | Sk Modifier_Symbol | |
553 | So Other_Symbol | |
d73e5302 JH |
554 | |
555 | Z Separator | |
51f494cc KW |
556 | Zs Space_Separator |
557 | Zl Line_Separator | |
558 | Zp Paragraph_Separator | |
d73e5302 JH |
559 | |
560 | C Other | |
d88362ca | 561 | Cc Control (also Cntrl) |
e150c829 | 562 | Cf Format |
6d4f9cf2 | 563 | Cs Surrogate |
51f494cc | 564 | Co Private_Use |
e150c829 | 565 | Cn Unassigned |
1ac13f9a | 566 | |
376d9008 | 567 | Single-letter properties match all characters in any of the |
3e4dbfed | 568 | two-letter sub-properties starting with the same letter. |
b19eb496 | 569 | C<LC> and C<L&> are special: both are aliases for the set consisting of everything matched by C<Ll>, C<Lu>, and C<Lt>. |
32293815 | 570 | |
51f494cc | 571 | =head3 B<Bidirectional Character Types> |
822502e5 | 572 | |
b19eb496 | 573 | Because scripts differ in their directionality (Hebrew and Arabic are |
a9130ea9 | 574 | written right to left, for example) Unicode supplies a C<Bidi_Class> property. |
1850f57f | 575 | Some of the values this property can have are: |
32293815 | 576 | |
88af3b93 | 577 | Value Meaning |
92e830a9 | 578 | |
12ac2576 JP |
579 | L Left-to-Right |
580 | LRE Left-to-Right Embedding | |
581 | LRO Left-to-Right Override | |
582 | R Right-to-Left | |
51f494cc | 583 | AL Arabic Letter |
12ac2576 JP |
584 | RLE Right-to-Left Embedding |
585 | RLO Right-to-Left Override | |
586 | PDF Pop Directional Format | |
587 | EN European Number | |
51f494cc KW |
588 | ES European Separator |
589 | ET European Terminator | |
12ac2576 | 590 | AN Arabic Number |
51f494cc | 591 | CS Common Separator |
12ac2576 JP |
592 | NSM Non-Spacing Mark |
593 | BN Boundary Neutral | |
594 | B Paragraph Separator | |
595 | S Segment Separator | |
596 | WS Whitespace | |
597 | ON Other Neutrals | |
598 | ||
51f494cc KW |
599 | This property is always written in the compound form. |
600 | For example, C<\p{Bidi_Class:R}> matches characters that are normally | |
1850f57f | 601 | written right to left. Unlike the |
a9130ea9 | 602 | C<L</General_Category>> property, this |
1850f57f KW |
603 | property can have more values added in a future Unicode release. Those |
604 | listed above comprised the complete set for many Unicode releases, but | |
605 | others were added in Unicode 6.3; you can always find what the | |
20ada7da | 606 | current ones are in L<perluniprops>. And |
71c89d21 | 607 | L<https://www.unicode.org/reports/tr9/> describes how to use them. |
eb0cc9e3 | 608 | |
51f494cc KW |
609 | =head3 B<Scripts> |
610 | ||
b19eb496 | 611 | The world's languages are written in many different scripts. This sentence |
e1b711da | 612 | (unless you're reading it in translation) is written in Latin, while Russian is |
c69ca1d4 | 613 | written in Cyrillic, and Greek is written in, well, Greek; Japanese mainly in |
e1b711da | 614 | Hiragana or Katakana. There are many more. |
51f494cc | 615 | |
48791bf1 KW |
616 | The Unicode C<Script> and C<Script_Extensions> properties give what |
617 | script a given character is in. The C<Script_Extensions> property is an | |
618 | improved version of C<Script>, as demonstrated below. Either property | |
619 | can be specified with the compound form like | |
82aed44a KW |
620 | C<\p{Script=Hebrew}> (short: C<\p{sc=hebr}>), or |
621 | C<\p{Script_Extensions=Javanese}> (short: C<\p{scx=java}>). | |
622 | In addition, Perl furnishes shortcuts for all | |
48791bf1 KW |
623 | C<Script_Extensions> property names. You can omit everything up through |
624 | the equals (or colon), and simply write C<\p{Latin}> or C<\P{Cyrillic}>. | |
625 | (This is not true for C<Script>, which is required to be | |
626 | written in the compound form. Prior to Perl v5.26, the single form | |
627 | returned the plain old C<Script> version, but was changed because | |
628 | C<Script_Extensions> gives better results.) | |
82aed44a KW |
629 | |
630 | The difference between these two properties involves characters that are | |
631 | used in multiple scripts. For example the digits '0' through '9' are | |
632 | used in many parts of the world. These are placed in a script named | |
633 | C<Common>. Other characters are used in just a few scripts. For | |
a9130ea9 | 634 | example, the C<"KATAKANA-HIRAGANA DOUBLE HYPHEN"> is used in both Japanese |
82aed44a KW |
635 | scripts, Katakana and Hiragana, but nowhere else. The C<Script> |
636 | property places all characters that are used in multiple scripts in the | |
637 | C<Common> script, while the C<Script_Extensions> property places those | |
638 | that are used in only a few scripts into each of those scripts; while | |
639 | still using C<Common> for those used in many scripts. Thus both these | |
640 | match: | |
641 | ||
642 | "0" =~ /\p{sc=Common}/ # Matches | |
643 | "0" =~ /\p{scx=Common}/ # Matches | |
644 | ||
645 | and only the first of these match: | |
646 | ||
647 | "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{sc=Common} # Matches | |
648 | "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{scx=Common} # No match | |
649 | ||
650 | And only the last two of these match: | |
651 | ||
652 | "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{sc=Hiragana} # No match | |
653 | "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{sc=Katakana} # No match | |
654 | "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{scx=Hiragana} # Matches | |
655 | "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{scx=Katakana} # Matches | |
656 | ||
657 | C<Script_Extensions> is thus an improved C<Script>, in which there are | |
658 | fewer characters in the C<Common> script, and correspondingly more in | |
659 | other scripts. It is new in Unicode version 6.0, and its data are likely | |
660 | to change significantly in later releases, as things get sorted out. | |
b65e6125 | 661 | New code should probably be using C<Script_Extensions> and not plain |
48791bf1 KW |
662 | C<Script>. If you compile perl with a Unicode release that doesn't have |
663 | C<Script_Extensions>, the single form Perl extensions will instead refer | |
664 | to the plain C<Script> property. If you compile with a version of | |
665 | Unicode that doesn't have the C<Script> property, these extensions will | |
666 | not be defined at all. | |
82aed44a KW |
667 | |
668 | (Actually, besides C<Common>, the C<Inherited> script, contains | |
669 | characters that are used in multiple scripts. These are modifier | |
b65e6125 | 670 | characters which inherit the script value |
82aed44a KW |
671 | of the controlling character. Some of these are used in many scripts, |
672 | and so go into C<Inherited> in both C<Script> and C<Script_Extensions>. | |
673 | Others are used in just a few scripts, so are in C<Inherited> in | |
674 | C<Script>, but not in C<Script_Extensions>.) | |
675 | ||
676 | It is worth stressing that there are several different sets of digits in | |
677 | Unicode that are equivalent to 0-9 and are matchable by C<\d> in a | |
678 | regular expression. If they are used in a single language only, they | |
48791bf1 | 679 | are in that language's C<Script> and C<Script_Extensions>. If they are |
82aed44a KW |
680 | used in more than one script, they will be in C<sc=Common>, but only |
681 | if they are used in many scripts should they be in C<scx=Common>. | |
51f494cc | 682 | |
48791bf1 | 683 | The explanation above has omitted some detail; refer to UAX#24 "Unicode |
71c89d21 | 684 | Script Property": L<https://www.unicode.org/reports/tr24>. |
48791bf1 | 685 | |
51f494cc KW |
686 | A complete list of scripts and their shortcuts is in L<perluniprops>. |
687 | ||
a9130ea9 | 688 | =head3 B<Use of the C<"Is"> Prefix> |
822502e5 | 689 | |
7b0ac457 | 690 | For backward compatibility (with ancient Perl 5.6), all properties writable |
b65e6125 | 691 | without using the compound form mentioned |
51f494cc KW |
692 | so far may have C<Is> or C<Is_> prepended to their name, so C<\P{Is_Lu}>, for |
693 | example, is equal to C<\P{Lu}>, and C<\p{IsScript:Arabic}> is equal to | |
694 | C<\p{Arabic}>. | |
eb0cc9e3 | 695 | |
51f494cc | 696 | =head3 B<Blocks> |
2796c109 | 697 | |
1bfb14c4 JH |
698 | In addition to B<scripts>, Unicode also defines B<blocks> of |
699 | characters. The difference between scripts and blocks is that the | |
700 | concept of scripts is closer to natural languages, while the concept | |
51f494cc | 701 | of blocks is more of an artificial grouping based on groups of Unicode |
a9130ea9 | 702 | characters with consecutive ordinal values. For example, the C<"Basic Latin"> |
b65e6125 | 703 | block is all the characters whose ordinals are between 0 and 127, inclusive; in |
a9130ea9 KW |
704 | other words, the ASCII characters. The C<"Latin"> script contains some letters |
705 | from this as well as several other blocks, like C<"Latin-1 Supplement">, | |
b65e6125 | 706 | C<"Latin Extended-A">, I<etc.>, but it does not contain all the characters from |
7be67b37 | 707 | those blocks. It does not, for example, contain the digits 0-9, because |
82aed44a KW |
708 | those digits are shared across many scripts, and hence are in the |
709 | C<Common> script. | |
51f494cc KW |
710 | |
711 | For more about scripts versus blocks, see UAX#24 "Unicode Script Property": | |
71c89d21 | 712 | L<https://www.unicode.org/reports/tr24> |
51f494cc | 713 | |
48791bf1 | 714 | The C<Script_Extensions> or C<Script> properties are likely to be the |
82aed44a | 715 | ones you want to use when processing |
a9130ea9 | 716 | natural language; the C<Block> property may occasionally be useful in working |
b19eb496 | 717 | with the nuts and bolts of Unicode. |
51f494cc KW |
718 | |
719 | Block names are matched in the compound form, like C<\p{Block: Arrows}> or | |
b19eb496 | 720 | C<\p{Blk=Hebrew}>. Unlike most other properties, only a few block names have a |
6b5cf123 KW |
721 | Unicode-defined short name. |
722 | ||
723 | Perl also defines single form synonyms for the block property in cases | |
724 | where these do not conflict with something else. But don't use any of | |
725 | these, because they are unstable. Since these are Perl extensions, they | |
726 | are subordinate to official Unicode property names; Unicode doesn't know | |
727 | nor care about Perl's extensions. It may happen that a name that | |
728 | currently means the Perl extension will later be changed without warning | |
729 | to mean a different Unicode property in a future version of the perl | |
730 | interpreter that uses a later Unicode release, and your code would no | |
731 | longer work. The extensions are mentioned here for completeness: Take | |
732 | the block name and prefix it with one of: C<In> (for example | |
733 | C<\p{Blk=Arrows}> can currently be written as C<\p{In_Arrows}>); or | |
734 | sometimes C<Is> (like C<\p{Is_Arrows}>); or sometimes no prefix at all | |
48791bf1 | 735 | (C<\p{Arrows}>). As of this writing (Unicode 9.0) there are no |
6b5cf123 KW |
736 | conflicts with using the C<In_> prefix, but there are plenty with the |
737 | other two forms. For example, C<\p{Is_Hebrew}> and C<\p{Hebrew}> mean | |
48791bf1 KW |
738 | C<\p{Script_Extensions=Hebrew}> which is NOT the same thing as |
739 | C<\p{Blk=Hebrew}>. Our | |
6b5cf123 KW |
740 | advice used to be to use the C<In_> prefix as a single form way of |
741 | specifying a block. But Unicode 8.0 added properties whose names begin | |
742 | with C<In>, and it's now clear that it's only luck that's so far | |
743 | prevented a conflict. Using C<In> is only marginally less typing than | |
744 | C<Blk:>, and the latter's meaning is clearer anyway, and guaranteed to | |
745 | never conflict. So don't take chances. Use C<\p{Blk=foo}> for new | |
746 | code. And be sure that block is what you really really want to do. In | |
747 | most cases scripts are what you want instead. | |
748 | ||
749 | A complete list of blocks is in L<perluniprops>. | |
51f494cc | 750 | |
9f815e24 KW |
751 | =head3 B<Other Properties> |
752 | ||
753 | There are many more properties than the very basic ones described here. | |
754 | A complete list is in L<perluniprops>. | |
755 | ||
756 | Unicode defines all its properties in the compound form, so all single-form | |
b19eb496 TC |
757 | properties are Perl extensions. Most of these are just synonyms for the |
758 | Unicode ones, but some are genuine extensions, including several that are in | |
9f815e24 | 759 | the compound form. And quite a few of these are actually recommended by Unicode |
71c89d21 | 760 | (in L<https://www.unicode.org/reports/tr18>). |
9f815e24 | 761 | |
5bff2035 KW |
762 | This section gives some details on all extensions that aren't just |
763 | synonyms for compound-form Unicode properties | |
764 | (for those properties, you'll have to refer to the | |
71c89d21 | 765 | L<Unicode Standard|https://www.unicode.org/reports/tr44>. |
9f815e24 KW |
766 | |
767 | =over | |
768 | ||
769 | =item B<C<\p{All}>> | |
770 | ||
2d88a86a KW |
771 | This matches every possible code point. It is equivalent to C<qr/./s>. |
772 | Unlike all the other non-user-defined C<\p{}> property matches, no | |
773 | warning is ever generated if this is property is matched against a | |
774 | non-Unicode code point (see L</Beyond Unicode code points> below). | |
9f815e24 KW |
775 | |
776 | =item B<C<\p{Alnum}>> | |
777 | ||
778 | This matches any C<\p{Alphabetic}> or C<\p{Decimal_Number}> character. | |
779 | ||
780 | =item B<C<\p{Any}>> | |
781 | ||
2d88a86a KW |
782 | This matches any of the 1_114_112 Unicode code points. It is a synonym |
783 | for C<\p{Unicode}>. | |
9f815e24 | 784 | |
42581d5d KW |
785 | =item B<C<\p{ASCII}>> |
786 | ||
787 | This matches any of the 128 characters in the US-ASCII character set, | |
788 | which is a subset of Unicode. | |
789 | ||
9f815e24 KW |
790 | =item B<C<\p{Assigned}>> |
791 | ||
a9130ea9 KW |
792 | This matches any assigned code point; that is, any code point whose L<general |
793 | category|/General_Category> is not C<Unassigned> (or equivalently, not C<Cn>). | |
9f815e24 KW |
794 | |
795 | =item B<C<\p{Blank}>> | |
796 | ||
797 | This is the same as C<\h> and C<\p{HorizSpace}>: A character that changes the | |
798 | spacing horizontally. | |
799 | ||
800 | =item B<C<\p{Decomposition_Type: Non_Canonical}>> (Short: C<\p{Dt=NonCanon}>) | |
801 | ||
069d116c KW |
802 | Matches a character that has any of the non-canonical decomposition |
803 | types. Canonical decompositions are introduced in the | |
804 | L</Extended Grapheme Clusters (Logical characters)> section above. | |
805 | However, many more characters have a different type of decomposition, | |
806 | generically called "compatible" decompositions, or "non-canonical". The | |
807 | sequences that form these decompositions are not considered canonically | |
808 | equivalent to the pre-composed character. An example is the | |
809 | C<"SUPERSCRIPT ONE">. It is somewhat like a regular digit 1, but not | |
810 | exactly; its decomposition into the digit 1 is called a "compatible" | |
811 | decomposition, specifically a "super" (for "superscript") decomposition. | |
812 | There are several such compatibility decompositions (see | |
813 | L<https://www.unicode.org/reports/tr44>). S<C<\p{Dt: Non_Canon}>> is a | |
814 | Perl extension that uses just one name to refer to the union of all of | |
815 | them. | |
816 | ||
817 | Most Unicode characters don't have a decomposition, so their | |
818 | decomposition type is C<"None">. Hence, C<Non_Canonical> is equivalent | |
819 | to | |
820 | ||
821 | qr/(?[ \P{DT=Canonical} - \p{DT=None} ])/ | |
822 | ||
823 | (Note that one of the non-canonical decompositions is named "compat", | |
824 | which could perhaps have been better named "miscellaneous". It includes | |
825 | just the things that Unicode couldn't figure out a better generic name | |
826 | for.) | |
9f815e24 KW |
827 | |
828 | =item B<C<\p{Graph}>> | |
829 | ||
830 | Matches any character that is graphic. Theoretically, this means a character | |
831 | that on a printer would cause ink to be used. | |
832 | ||
833 | =item B<C<\p{HorizSpace}>> | |
834 | ||
b19eb496 | 835 | This is the same as C<\h> and C<\p{Blank}>: a character that changes the |
9f815e24 KW |
836 | spacing horizontally. |
837 | ||
42581d5d | 838 | =item B<C<\p{In=*}>> |
9f815e24 KW |
839 | |
840 | This is a synonym for C<\p{Present_In=*}> | |
841 | ||
842 | =item B<C<\p{PerlSpace}>> | |
843 | ||
d28d8023 | 844 | This is the same as C<\s>, restricted to ASCII, namely C<S<[ \f\n\r\t]>> |
779cf272 | 845 | and starting in Perl v5.18, a vertical tab. |
9f815e24 KW |
846 | |
847 | Mnemonic: Perl's (original) space | |
848 | ||
849 | =item B<C<\p{PerlWord}>> | |
850 | ||
851 | This is the same as C<\w>, restricted to ASCII, namely C<[A-Za-z0-9_]> | |
852 | ||
853 | Mnemonic: Perl's (original) word. | |
854 | ||
42581d5d | 855 | =item B<C<\p{Posix...}>> |
9f815e24 | 856 | |
b65e6125 KW |
857 | There are several of these, which are equivalents, using the C<\p{}> |
858 | notation, for Posix classes and are described in | |
42581d5d | 859 | L<perlrecharclass/POSIX Character Classes>. |
9f815e24 KW |
860 | |
861 | =item B<C<\p{Present_In: *}>> (Short: C<\p{In=*}>) | |
862 | ||
863 | This property is used when you need to know in what Unicode version(s) a | |
864 | character is. | |
865 | ||
526f2ca9 KW |
866 | The "*" above stands for some Unicode version number, such as |
867 | C<1.1> or C<12.0>; or the "*" can also be C<Unassigned>. This property will | |
9f815e24 KW |
868 | match the code points whose final disposition has been settled as of the |
869 | Unicode release given by the version number; C<\p{Present_In: Unassigned}> | |
870 | will match those code points whose meaning has yet to be assigned. | |
871 | ||
a9130ea9 | 872 | For example, C<U+0041> C<"LATIN CAPITAL LETTER A"> was present in the very first |
9f815e24 KW |
873 | Unicode release available, which is C<1.1>, so this property is true for all |
874 | valid "*" versions. On the other hand, C<U+1EFF> was not assigned until version | |
a9130ea9 | 875 | 5.1 when it became C<"LATIN SMALL LETTER Y WITH LOOP">, so the only "*" that |
9f815e24 KW |
876 | would match it are 5.1, 5.2, and later. |
877 | ||
878 | Unicode furnishes the C<Age> property from which this is derived. The problem | |
879 | with Age is that a strict interpretation of it (which Perl takes) has it | |
880 | matching the precise release a code point's meaning is introduced in. Thus | |
881 | C<U+0041> would match only 1.1; and C<U+1EFF> only 5.1. This is not usually what | |
882 | you want. | |
883 | ||
884 | Some non-Perl implementations of the Age property may change its meaning to be | |
a9130ea9 | 885 | the same as the Perl C<Present_In> property; just be aware of that. |
9f815e24 KW |
886 | |
887 | Another confusion with both these properties is that the definition is not | |
b19eb496 TC |
888 | that the code point has been I<assigned>, but that the meaning of the code point |
889 | has been I<determined>. This is because 66 code points will always be | |
a9130ea9 | 890 | unassigned, and so the C<Age> for them is the Unicode version in which the decision |
b19eb496 | 891 | to make them so was made. For example, C<U+FDD0> is to be permanently |
9f815e24 | 892 | unassigned to a character, and the decision to do that was made in version 3.1, |
b19eb496 | 893 | so C<\p{Age=3.1}> matches this character, as also does C<\p{Present_In: 3.1}> and up. |
9f815e24 KW |
894 | |
895 | =item B<C<\p{Print}>> | |
896 | ||
ae5b72c8 | 897 | This matches any character that is graphical or blank, except controls. |
9f815e24 KW |
898 | |
899 | =item B<C<\p{SpacePerl}>> | |
900 | ||
901 | This is the same as C<\s>, including beyond ASCII. | |
902 | ||
4d4acfba | 903 | Mnemonic: Space, as modified by Perl. (It doesn't include the vertical tab |
779cf272 | 904 | until v5.18, which both the Posix standard and Unicode consider white space.) |
9f815e24 | 905 | |
4364919a KW |
906 | =item B<C<\p{Title}>> and B<C<\p{Titlecase}>> |
907 | ||
908 | Under case-sensitive matching, these both match the same code points as | |
909 | C<\p{General Category=Titlecase_Letter}> (C<\p{gc=lt}>). The difference | |
910 | is that under C</i> caseless matching, these match the same as | |
911 | C<\p{Cased}>, whereas C<\p{gc=lt}> matches C<\p{Cased_Letter>). | |
912 | ||
2d88a86a KW |
913 | =item B<C<\p{Unicode}>> |
914 | ||
915 | This matches any of the 1_114_112 Unicode code points. | |
916 | C<\p{Any}>. | |
917 | ||
9f815e24 KW |
918 | =item B<C<\p{VertSpace}>> |
919 | ||
920 | This is the same as C<\v>: A character that changes the spacing vertically. | |
921 | ||
922 | =item B<C<\p{Word}>> | |
923 | ||
b19eb496 | 924 | This is the same as C<\w>, including over 100_000 characters beyond ASCII. |
9f815e24 | 925 | |
42581d5d KW |
926 | =item B<C<\p{XPosix...}>> |
927 | ||
b19eb496 | 928 | There are several of these, which are the standard Posix classes |
42581d5d KW |
929 | extended to the full Unicode range. They are described in |
930 | L<perlrecharclass/POSIX Character Classes>. | |
931 | ||
9f815e24 KW |
932 | =back |
933 | ||
673c254b KW |
934 | =head2 Comparison of C<\N{...}> and C<\p{name=...}> |
935 | ||
936 | Starting in Perl 5.32, you can specify a character by its name in | |
937 | regular expression patterns using C<\p{name=...}>. This is in addition | |
938 | to the longstanding method of using C<\N{...}>. The following | |
939 | summarizes the differences between these two: | |
940 | ||
941 | \N{...} \p{Name=...} | |
942 | can interpolate only with eval yes [1] | |
943 | custom names yes no [2] | |
944 | name aliases yes yes [3] | |
cc06e157 | 945 | named sequences yes yes [4] |
673c254b KW |
946 | name value parsing exact Unicode loose [5] |
947 | ||
948 | =over | |
949 | ||
950 | =item [1] | |
951 | ||
952 | The ability to interpolate means you can do something like | |
953 | ||
954 | qr/\p{na=latin capital letter $which}/ | |
955 | ||
956 | and specify C<$which> elsewhere. | |
957 | ||
958 | =item [2] | |
959 | ||
960 | You can create your own names for characters, and override official | |
961 | ones when using C<\N{...}>. See L<charnames/CUSTOM ALIASES>. | |
962 | ||
963 | =item [3] | |
964 | ||
965 | Some characters have multiple names (synonyms). | |
966 | ||
967 | =item [4] | |
968 | ||
969 | Some particular sequences of characters are given a single name, in | |
970 | addition to their individual ones. | |
971 | ||
673c254b KW |
972 | =item [5] |
973 | ||
974 | Exact name value matching means you have to specify case, hyphens, | |
975 | underscores, and spaces precisely in the name you want. Loose matching | |
976 | follows the Unicode rules | |
977 | L<https://www.unicode.org/reports/tr44/tr44-24.html#UAX44-LM2>, | |
978 | where these are mostly irrelevant. Except for a few outlier character | |
979 | names, these are the same rules as are already used for any other | |
980 | C<\p{...}> property. | |
981 | ||
982 | =back | |
983 | ||
1532347b KW |
984 | =head2 Wildcards in Property Values |
985 | ||
a3815e44 | 986 | Starting in Perl 5.30, it is possible to do something like this: |
1532347b KW |
987 | |
988 | qr!\p{numeric_value=/\A[0-5]\z/}! | |
989 | ||
990 | or, by abbreviating and adding C</x>, | |
991 | ||
992 | qr! \p{nv= /(?x) \A [0-5] \z / }! | |
993 | ||
994 | This matches all code points whose numeric value is one of 0, 1, 2, 3, | |
995 | 4, or 5. This particular example could instead have been written as | |
996 | ||
997 | qr! \A [ \p{nv=0}\p{nv=1}\p{nv=2}\p{nv=3}\p{nv=4}\p{nv=5} ] \z !xx | |
998 | ||
999 | in earlier perls, so in this case this feature just makes things easier | |
1000 | and shorter to write. If we hadn't included the C<\A> and C<\z>, these | |
1001 | would have matched things like C<1E<sol>2> because that contains a 1 (as | |
1002 | well as a 2). As written, it matches things like subscripts that have | |
1003 | these numeric values. If we only wanted the decimal digits with those | |
1004 | numeric values, we could say, | |
1005 | ||
1006 | qr! (?[ \d & \p{nv=/[0-5]/ ]) }!x | |
1007 | ||
1008 | The C<\d> gets rid of needing to anchor the pattern, since it forces the | |
1009 | result to only match C<[0-9]>, and the C<[0-5]> further restricts it. | |
1010 | ||
1011 | The text in the above examples enclosed between the C<"E<sol>"> | |
1012 | characters can be just about any regular expression. It is independent | |
1013 | of the main pattern, so doesn't share any capturing groups, I<etc>. The | |
1014 | delimiters for it must be ASCII punctuation, but it may NOT be | |
1015 | delimited by C<"{">, nor C<"}"> nor contain a literal C<"}">, as that | |
1016 | delimits the end of the enclosing C<\p{}>. Like any pattern, certain | |
1017 | other delimiters are terminated by their mirror images. These are | |
1018 | C<"(">, C<"[>", and C<"E<lt>">. If the delimiter is any of C<"-">, | |
1019 | C<"_">, C<"+">, or C<"\">, or is the same delimiter as is used for the | |
a3815e44 | 1020 | enclosing pattern, it must be preceded by a backslash escape, both |
1532347b KW |
1021 | fore and aft. |
1022 | ||
1023 | Beware of using C<"$"> to indicate to match the end of the string. It | |
1024 | can too easily be interpreted as being a punctuation variable, like | |
1025 | C<$/>. | |
1026 | ||
1027 | No modifiers may follow the final delimiter. Instead, use | |
1028 | L<perlre/(?adlupimnsx-imnsx)> and/or | |
1029 | L<perlre/(?adluimnsx-imnsx:pattern)> to specify modifiers. | |
4829f32d KW |
1030 | However, certain modifiers are illegal in your wildcard subpattern. |
1031 | The only character set modifier specifiable is C</aa>; | |
1032 | any other character set, and C<-m>, and C<p>, and C<s> are all illegal. | |
1033 | Specifying modifiers like C<qr/.../gc> that aren't legal in the | |
1034 | C<(?...)> notation normally raise a warning, but with wildcard | |
1035 | subpatterns, their use is an error. The C<m> modifier is ineffective; | |
1036 | everything that matches will be a single line. | |
1037 | ||
1038 | By default, your pattern is matched case-insensitively, as if C</i> had | |
1039 | been specified. You can change this by saying C<(?-i)> in your pattern. | |
1040 | ||
1041 | There are also certain operations that are illegal. You can't nest | |
1042 | C<\p{...}> and C<\P{...}> calls within a wildcard subpattern, and C<\G> | |
1043 | doesn't make sense, so is also prohibited. | |
1044 | ||
1045 | And the C<*> quantifier (or its equivalent C<(0,}>) is illegal. | |
1532347b KW |
1046 | |
1047 | This feature is not available when the left-hand side is prefixed by | |
1048 | C<Is_>, nor for any form that is marked as "Discouraged" in | |
1049 | L<perluniprops/Discouraged>. | |
1050 | ||
1532347b KW |
1051 | This experimental feature has been added to begin to implement |
1052 | L<https://www.unicode.org/reports/tr18/#Wildcard_Properties>. Using it | |
1053 | will raise a (default-on) warning in the | |
1054 | C<experimental::uniprop_wildcards> category. We reserve the right to | |
1055 | change its operation as we gain experience. | |
1056 | ||
1057 | Your subpattern can be just about anything, but for it to have some | |
1058 | utility, it should match when called with either or both of | |
1059 | a) the full name of the property value with underscores (and/or spaces | |
1060 | in the Block property) and some things uppercase; or b) the property | |
1061 | value in all lowercase with spaces and underscores squeezed out. For | |
1062 | example, | |
1063 | ||
1064 | qr!\p{Blk=/Old I.*/}! | |
1065 | qr!\p{Blk=/oldi.*/}! | |
1066 | ||
1067 | would match the same things. | |
1068 | ||
1532347b KW |
1069 | Another example that shows that within C<\p{...}>, C</x> isn't needed to |
1070 | have spaces: | |
1071 | ||
1072 | qr!\p{scx= /Hebrew|Greek/ }! | |
1073 | ||
1074 | To be safe, we should have anchored the above example, to prevent | |
01d49772 | 1075 | matches for something like C<Hebrew_Braille>, but there aren't |
b1a91f30 KW |
1076 | any script names like that, so far. |
1077 | A warning is issued if none of the legal values for a property are | |
1078 | matched by your pattern. It's likely that a future release will raise a | |
1079 | warning if your pattern ends up causing every possible code point to | |
1080 | match. | |
1081 | ||
cc06e157 KW |
1082 | Starting in 5.32, the Name, Name Aliases, and Named Sequences properties |
1083 | are allowed to be matched. They are considered to be a single | |
1084 | combination property, just as has long been the case for C<\N{}>. Loose | |
1085 | matching doesn't work in exactly the same way for these as it does for | |
1086 | the values of other properties. The rules are given in | |
b1a91f30 KW |
1087 | L<https://www.unicode.org/reports/tr44/tr44-24.html#UAX44-LM2>. As a |
1088 | result, Perl doesn't try loose matching for you, like it does in other | |
1089 | properties. All letters in names are uppercase, but you can add C<(?i)> | |
1090 | to your subpattern to ignore case. If you're uncertain where a blank | |
1091 | is, you can use C< ?> in your subpattern. No character name contains an | |
1092 | underscore, so don't bother trying to match one. The use of hyphens is | |
1093 | particularly problematic; refer to the above link. But note that, as of | |
1094 | Unicode 13.0, the only script in modern usage which has weirdnesses with | |
1095 | these is Tibetan; also the two Korean characters U+116C HANGUL JUNGSEONG | |
1096 | OE and U+1180 HANGUL JUNGSEONG O-E. Unicode makes no promises to not | |
1097 | add hyphen-problematic names in the future. | |
1098 | ||
1099 | Using wildcards on these is resource intensive, given the hundreds of | |
1100 | thousands of legal names that must be checked against. | |
1101 | ||
1102 | An example of using Name property wildcards is | |
1103 | ||
1104 | qr!\p{name=/(SMILING|GRINNING) FACE/}! | |
1105 | ||
1106 | Another is | |
1107 | ||
1108 | qr/(?[ \p{name=\/CJK\/} - \p{ideographic} ])/ | |
1109 | ||
1110 | which is the 200-ish (as of Unicode 13.0) CJK characters that aren't | |
1111 | ideographs. | |
1532347b | 1112 | |
b1a91f30 KW |
1113 | There are certain properties that wildcard subpatterns don't currently |
1114 | work with. These are: | |
1532347b KW |
1115 | |
1116 | Bidi Mirroring Glyph | |
1117 | Bidi Paired Bracket | |
1118 | Case Folding | |
1119 | Decomposition Mapping | |
1120 | Equivalent Unified Ideograph | |
1532347b KW |
1121 | Lowercase Mapping |
1122 | NFKC Case Fold | |
1123 | Titlecase Mapping | |
1124 | Uppercase Mapping | |
1125 | ||
1126 | Nor is the C<@I<unicode_property>@> form implemented. | |
1127 | ||
1128 | Here's a complete example of matching IPV4 internet protocol addresses | |
1129 | in any (single) script | |
1130 | ||
1532347b KW |
1131 | no warnings 'experimental::uniprop_wildcards'; |
1132 | ||
1133 | # Can match a substring, so this intermediate regex needs to have | |
1134 | # context or anchoring in its final use. Using nt=de yields decimal | |
1135 | # digits. When specifying a subset of these, we must include \d to | |
1136 | # prevent things like U+00B2 SUPERSCRIPT TWO from matching | |
1137 | my $zero_through_255 = | |
1138 | qr/ \b (*sr: # All from same sript | |
1139 | (?[ \p{nv=0} & \d ])* # Optional leading zeros | |
1140 | ( # Then one of: | |
1141 | \d{1,2} # 0 - 99 | |
1142 | | (?[ \p{nv=1} & \d ]) \d{2} # 100 - 199 | |
1143 | | (?[ \p{nv=2} & \d ]) | |
1144 | ( (?[ \p{nv=:[0-4]:} & \d ]) \d # 200 - 249 | |
1145 | | (?[ \p{nv=5} & \d ]) | |
1146 | (?[ \p{nv=:[0-5]:} & \d ]) # 250 - 255 | |
1147 | ) | |
1148 | ) | |
1149 | ) | |
1150 | \b | |
1151 | /x; | |
1152 | ||
1153 | my $ipv4 = qr/ \A (*sr: $zero_through_255 | |
1154 | (?: [.] $zero_through_255 ) {3} | |
1155 | ) | |
1156 | \z | |
1157 | /x; | |
a9130ea9 | 1158 | |
376d9008 | 1159 | =head2 User-Defined Character Properties |
491fd90a | 1160 | |
51f494cc | 1161 | You can define your own binary character properties by defining subroutines |
0f7529f0 | 1162 | whose names begin with C<"In"> or C<"Is">. (The regex sets feature |
9d1a5160 KW |
1163 | L<perlre/(?[ ])> provides an alternative which allows more complex |
1164 | definitions.) The subroutines can be defined in any | |
1c2f3d7a KW |
1165 | package. They override any Unicode properties expressed as the same |
1166 | names. The user-defined properties can be used in the regular | |
1167 | expression | |
a9130ea9 | 1168 | C<\p{}> and C<\P{}> constructs; if you are using a user-defined property from a |
51f494cc | 1169 | package other than the one you are in, you must specify its package in the |
a9130ea9 | 1170 | C<\p{}> or C<\P{}> construct. |
bac0b425 | 1171 | |
cb1faabf | 1172 | # assuming property IsForeign defined in Lang:: |
bac0b425 JP |
1173 | package main; # property package name required |
1174 | if ($txt =~ /\p{Lang::IsForeign}+/) { ... } | |
1175 | ||
1176 | package Lang; # property package name not required | |
1177 | if ($txt =~ /\p{IsForeign}+/) { ... } | |
1178 | ||
1179 | ||
966f9a3b | 1180 | The subroutines are passed a single parameter, which is 0 if |
b19eb496 | 1181 | case-sensitive matching is in effect and non-zero if caseless matching |
56ca34ca | 1182 | is in effect. The subroutine may return different values depending on |
966f9a3b KW |
1183 | the value of the flag. But the subroutine is never called more than |
1184 | once for each flag value (zero vs non-zero). The return value is saved | |
1185 | and used instead of calling the sub ever again. If the sub is defined | |
1186 | at the time the pattern is compiled, it will be called then; if not, it | |
1187 | will be called the first time its value (for that flag) is needed during | |
1188 | execution. | |
491fd90a | 1189 | |
b19eb496 | 1190 | Note that if the regular expression is tainted, then Perl will die rather |
a9130ea9 | 1191 | than calling the subroutine when the name of the subroutine is |
0e9be77f DM |
1192 | determined by the tainted data. |
1193 | ||
376d9008 JB |
1194 | The subroutines must return a specially-formatted string, with one |
1195 | or more newline-separated lines. Each line must be one of the following: | |
491fd90a JH |
1196 | |
1197 | =over 4 | |
1198 | ||
1199 | =item * | |
1200 | ||
df9e1087 | 1201 | A single hexadecimal number denoting a code point to include. |
510254c9 A |
1202 | |
1203 | =item * | |
1204 | ||
99a6b1f0 | 1205 | Two hexadecimal numbers separated by horizontal whitespace (space or |
73b95840 KW |
1206 | tabular characters) denoting a range of code points to include. The |
1207 | second number must not be smaller than the first. | |
491fd90a JH |
1208 | |
1209 | =item * | |
1210 | ||
a9130ea9 KW |
1211 | Something to include, prefixed by C<"+">: a built-in character |
1212 | property (prefixed by C<"utf8::">) or a fully qualified (including package | |
830137a2 | 1213 | name) user-defined character property, |
bac0b425 JP |
1214 | to represent all the characters in that property; two hexadecimal code |
1215 | points for a range; or a single hexadecimal code point. | |
491fd90a JH |
1216 | |
1217 | =item * | |
1218 | ||
a9130ea9 KW |
1219 | Something to exclude, prefixed by C<"-">: an existing character |
1220 | property (prefixed by C<"utf8::">) or a fully qualified (including package | |
830137a2 | 1221 | name) user-defined character property, |
bac0b425 JP |
1222 | to represent all the characters in that property; two hexadecimal code |
1223 | points for a range; or a single hexadecimal code point. | |
491fd90a JH |
1224 | |
1225 | =item * | |
1226 | ||
a9130ea9 KW |
1227 | Something to negate, prefixed C<"!">: an existing character |
1228 | property (prefixed by C<"utf8::">) or a fully qualified (including package | |
830137a2 | 1229 | name) user-defined character property, |
bac0b425 JP |
1230 | to represent all the characters in that property; two hexadecimal code |
1231 | points for a range; or a single hexadecimal code point. | |
1232 | ||
1233 | =item * | |
1234 | ||
a9130ea9 KW |
1235 | Something to intersect with, prefixed by C<"&">: an existing character |
1236 | property (prefixed by C<"utf8::">) or a fully qualified (including package | |
830137a2 | 1237 | name) user-defined character property, |
bac0b425 JP |
1238 | for all the characters except the characters in the property; two |
1239 | hexadecimal code points for a range; or a single hexadecimal code point. | |
491fd90a JH |
1240 | |
1241 | =back | |
1242 | ||
1243 | For example, to define a property that covers both the Japanese | |
1244 | syllabaries (hiragana and katakana), you can define | |
1245 | ||
1246 | sub InKana { | |
d88362ca | 1247 | return <<END; |
d5822f25 A |
1248 | 3040\t309F |
1249 | 30A0\t30FF | |
491fd90a JH |
1250 | END |
1251 | } | |
1252 | ||
d5822f25 A |
1253 | Imagine that the here-doc end marker is at the beginning of the line. |
1254 | Now you can use C<\p{InKana}> and C<\P{InKana}>. | |
491fd90a JH |
1255 | |
1256 | You could also have used the existing block property names: | |
1257 | ||
1258 | sub InKana { | |
d88362ca | 1259 | return <<'END'; |
491fd90a JH |
1260 | +utf8::InHiragana |
1261 | +utf8::InKatakana | |
1262 | END | |
1263 | } | |
1264 | ||
1265 | Suppose you wanted to match only the allocated characters, | |
d5822f25 | 1266 | not the raw block ranges: in other words, you want to remove |
b65e6125 | 1267 | the unassigned characters: |
491fd90a JH |
1268 | |
1269 | sub InKana { | |
d88362ca | 1270 | return <<'END'; |
491fd90a JH |
1271 | +utf8::InHiragana |
1272 | +utf8::InKatakana | |
1273 | -utf8::IsCn | |
1274 | END | |
1275 | } | |
1276 | ||
1277 | The negation is useful for defining (surprise!) negated classes. | |
1278 | ||
1279 | sub InNotKana { | |
d88362ca | 1280 | return <<'END'; |
491fd90a JH |
1281 | !utf8::InHiragana |
1282 | -utf8::InKatakana | |
1283 | +utf8::IsCn | |
1284 | END | |
1285 | } | |
1286 | ||
461020ad KW |
1287 | This will match all non-Unicode code points, since every one of them is |
1288 | not in Kana. You can use intersection to exclude these, if desired, as | |
1289 | this modified example shows: | |
bac0b425 | 1290 | |
461020ad | 1291 | sub InNotKana { |
bac0b425 | 1292 | return <<'END'; |
461020ad KW |
1293 | !utf8::InHiragana |
1294 | -utf8::InKatakana | |
1295 | +utf8::IsCn | |
1296 | &utf8::Any | |
bac0b425 JP |
1297 | END |
1298 | } | |
1299 | ||
461020ad KW |
1300 | C<&utf8::Any> must be the last line in the definition. |
1301 | ||
1302 | Intersection is used generally for getting the common characters matched | |
a9130ea9 | 1303 | by two (or more) classes. It's important to remember not to use C<"&"> for |
461020ad | 1304 | the first set; that would be intersecting with nothing, resulting in an |
5acbde07 | 1305 | empty set. (Similarly using C<"-"> for the first set does nothing). |
461020ad | 1306 | |
2d88a86a KW |
1307 | Unlike non-user-defined C<\p{}> property matches, no warning is ever |
1308 | generated if these properties are matched against a non-Unicode code | |
1309 | point (see L</Beyond Unicode code points> below). | |
bac0b425 | 1310 | |
68585b5e | 1311 | =head2 User-Defined Case Mappings (for serious hackers only) |
822502e5 | 1312 | |
5d1892be | 1313 | B<This feature has been removed as of Perl 5.16.> |
a9130ea9 | 1314 | The CPAN module C<L<Unicode::Casing>> provides better functionality without |
5d1892be KW |
1315 | the drawbacks that this feature had. If you are using a Perl earlier |
1316 | than 5.16, this feature was most fully documented in the 5.14 version of | |
1317 | this pod: | |
1318 | L<http://perldoc.perl.org/5.14.0/perlunicode.html#User-Defined-Case-Mappings-%28for-serious-hackers-only%29> | |
3a2263fe | 1319 | |
376d9008 | 1320 | =head2 Character Encodings for Input and Output |
8cbd9a7a | 1321 | |
7221edc9 | 1322 | See L<Encode>. |
8cbd9a7a | 1323 | |
c29a771d | 1324 | =head2 Unicode Regular Expression Support Level |
776f8809 | 1325 | |
b19eb496 | 1326 | The following list of Unicode supported features for regular expressions describes |
fea12a3e KW |
1327 | all features currently directly supported by core Perl. The references |
1328 | to "Level I<N>" and the section numbers refer to | |
71c89d21 | 1329 | L<UTS#18 "Unicode Regular Expressions"|https://www.unicode.org/reports/tr18>, |
526f2ca9 | 1330 | version 18, October 2016. |
fea12a3e KW |
1331 | |
1332 | =head3 Level 1 - Basic Unicode Support | |
1333 | ||
1334 | RL1.1 Hex Notation - Done [1] | |
1335 | RL1.2 Properties - Done [2] | |
1336 | RL1.2a Compatibility Properties - Done [3] | |
0f7529f0 | 1337 | RL1.3 Subtraction and Intersection - Done [4] |
fea12a3e KW |
1338 | RL1.4 Simple Word Boundaries - Done [5] |
1339 | RL1.5 Simple Loose Matches - Done [6] | |
1340 | RL1.6 Line Boundaries - Partial [7] | |
1341 | RL1.7 Supplementary Code Points - Done [8] | |
755789c0 | 1342 | |
6f33e417 KW |
1343 | =over 4 |
1344 | ||
a6a7eedc | 1345 | =item [1] C<\N{U+...}> and C<\x{...}> |
6f33e417 | 1346 | |
fea12a3e KW |
1347 | =item [2] |
1348 | C<\p{...}> C<\P{...}>. This requirement is for a minimal list of | |
58f92e50 | 1349 | properties. Perl supports these. See R2.7 for other properties. |
6f33e417 | 1350 | |
fea12a3e | 1351 | =item [3] |
7ddf4b55 | 1352 | |
fea12a3e KW |
1353 | Perl has C<\d> C<\D> C<\s> C<\S> C<\w> C<\W> C<\X> C<[:I<prop>:]> |
1354 | C<[:^I<prop>:]>, plus all the properties specified by | |
71c89d21 | 1355 | L<https://www.unicode.org/reports/tr18/#Compatibility_Properties>. These |
fea12a3e | 1356 | are described above in L</Other Properties> |
6f33e417 | 1357 | |
fea12a3e | 1358 | =item [4] |
6f33e417 | 1359 | |
0f7529f0 KW |
1360 | The regex sets feature C<"(?[...])"> starting in v5.18 accomplishes |
1361 | this. See L<perlre/(?[ ])>. | |
6f33e417 | 1362 | |
fea12a3e KW |
1363 | =item [5] |
1364 | C<\b> C<\B> meet most, but not all, the details of this requirement, but | |
1365 | C<\b{wb}> and C<\B{wb}> do, as well as the stricter R2.3. | |
1366 | ||
1367 | =item [6] | |
6f33e417 | 1368 | |
a6a7eedc | 1369 | Note that Perl does Full case-folding in matching, not Simple: |
6f33e417 | 1370 | |
a6a7eedc KW |
1371 | For example C<U+1F88> is equivalent to C<U+1F00 U+03B9>, instead of just |
1372 | C<U+1F80>. This difference matters mainly for certain Greek capital | |
a9130ea9 KW |
1373 | letters with certain modifiers: the Full case-folding decomposes the |
1374 | letter, while the Simple case-folding would map it to a single | |
1375 | character. | |
6f33e417 | 1376 | |
fea12a3e KW |
1377 | =item [7] |
1378 | ||
1379 | The reason this is considered to be only partially implemented is that | |
1380 | Perl has L<C<qrE<sol>\b{lb}E<sol>>|perlrebackslash/\b{lb}> and | |
1381 | C<L<Unicode::LineBreak>> that are conformant with | |
71c89d21 | 1382 | L<UAX#14 "Unicode Line Breaking Algorithm"|https://www.unicode.org/reports/tr14>. |
fea12a3e KW |
1383 | The regular expression construct provides default behavior, while the |
1384 | heavier-weight module provides customizable line breaking. | |
1385 | ||
1386 | But Perl treats C<\n> as the start- and end-line | |
1387 | delimiter, whereas Unicode specifies more characters that should be | |
1388 | so-interpreted. | |
6f33e417 | 1389 | |
a6a7eedc | 1390 | These are: |
6f33e417 | 1391 | |
a6a7eedc KW |
1392 | VT U+000B (\v in C) |
1393 | FF U+000C (\f) | |
1394 | CR U+000D (\r) | |
1395 | NEL U+0085 | |
1396 | LS U+2028 | |
1397 | PS U+2029 | |
6f33e417 | 1398 | |
a6a7eedc KW |
1399 | C<^> and C<$> in regular expression patterns are supposed to match all |
1400 | these, but don't. | |
1401 | These characters also don't, but should, affect C<< <> >> C<$.>, and | |
1402 | script line numbers. | |
6f33e417 | 1403 | |
a6a7eedc KW |
1404 | Also, lines should not be split within C<CRLF> (i.e. there is no |
1405 | empty line between C<\r> and C<\n>). For C<CRLF>, try the C<:crlf> | |
1406 | layer (see L<PerlIO>). | |
1407 | ||
fea12a3e | 1408 | =item [8] |
a9130ea9 KW |
1409 | UTF-8/UTF-EBDDIC used in Perl allows not only C<U+10000> to |
1410 | C<U+10FFFF> but also beyond C<U+10FFFF> | |
6f33e417 KW |
1411 | |
1412 | =back | |
5ca1ac52 | 1413 | |
fea12a3e | 1414 | =head3 Level 2 - Extended Unicode Support |
776f8809 | 1415 | |
fea12a3e KW |
1416 | RL2.1 Canonical Equivalents - Retracted [9] |
1417 | by Unicode | |
58f92e50 KW |
1418 | RL2.2 Extended Grapheme Clusters and - Partial [10] |
1419 | Character Classes with Strings | |
fea12a3e KW |
1420 | RL2.3 Default Word Boundaries - Done [11] |
1421 | RL2.4 Default Case Conversion - Done | |
1422 | RL2.5 Name Properties - Done | |
1532347b | 1423 | RL2.6 Wildcards in Property Values - Partial [12] |
58f92e50 KW |
1424 | RL2.7 Full Properties - Partial [13] |
1425 | RL2.8 Optional Properties - Partial [14] | |
776f8809 | 1426 | |
fea12a3e | 1427 | =over 4 |
8158862b | 1428 | |
fea12a3e KW |
1429 | =item [9] |
1430 | Unicode has rewritten this portion of UTS#18 to say that getting | |
1431 | canonical equivalence (see UAX#15 | |
71c89d21 | 1432 | L<"Unicode Normalization Forms"|https://www.unicode.org/reports/tr15>) |
fea12a3e KW |
1433 | is basically to be done at the programmer level. Use NFD to write |
1434 | both your regular expressions and text to match them against (you | |
1435 | can use L<Unicode::Normalize>). | |
776f8809 | 1436 | |
fea12a3e | 1437 | =item [10] |
58f92e50 KW |
1438 | Perl has C<\X> and C<\b{gcb}>. Unicode has retracted their "Grapheme |
1439 | Cluster Mode", and recently added string properties, which Perl does not | |
1440 | yet support. | |
fea12a3e KW |
1441 | |
1442 | =item [11] see | |
71c89d21 | 1443 | L<UAX#29 "Unicode Text Segmentation"|https://www.unicode.org/reports/tr29>, |
fea12a3e | 1444 | |
1532347b KW |
1445 | =item [12] see |
1446 | L</Wildcards in Property Values> above. | |
1447 | ||
526f2ca9 | 1448 | =item [13] |
58f92e50 KW |
1449 | Perl supports all the properties in the Unicode Character Database |
1450 | (UCD). It does not yet support the listed properties that come from | |
1451 | other Unicode sources. | |
776f8809 | 1452 | |
526f2ca9 | 1453 | =item [14] |
58f92e50 KW |
1454 | The only optional property that Perl supports is Named Sequence. None |
1455 | of these properties are in the UCD. | |
776f8809 JH |
1456 | |
1457 | =back | |
1458 | ||
58f92e50 KW |
1459 | =head3 Level 3 - Tailored Support |
1460 | ||
1461 | This has been retracted by Unicode. | |
1462 | ||
c349b1b9 JH |
1463 | =head2 Unicode Encodings |
1464 | ||
376d9008 JB |
1465 | Unicode characters are assigned to I<code points>, which are abstract |
1466 | numbers. To use these numbers, various encodings are needed. | |
c349b1b9 JH |
1467 | |
1468 | =over 4 | |
1469 | ||
c29a771d | 1470 | =item * |
5cb3728c RB |
1471 | |
1472 | UTF-8 | |
c349b1b9 | 1473 | |
6d4f9cf2 | 1474 | UTF-8 is a variable-length (1 to 4 bytes), byte-order independent |
a6a7eedc KW |
1475 | encoding. In most of Perl's documentation, including elsewhere in this |
1476 | document, the term "UTF-8" means also "UTF-EBCDIC". But in this section, | |
1477 | "UTF-8" refers only to the encoding used on ASCII platforms. It is a | |
1478 | superset of 7-bit US-ASCII, so anything encoded in ASCII has the | |
1479 | identical representation when encoded in UTF-8. | |
c349b1b9 | 1480 | |
8c007b5a | 1481 | The following table is from Unicode 3.2. |
05632f9a | 1482 | |
755789c0 | 1483 | Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte |
05632f9a | 1484 | |
d88362ca | 1485 | U+0000..U+007F 00..7F |
e1b711da | 1486 | U+0080..U+07FF * C2..DF 80..BF |
d88362ca | 1487 | U+0800..U+0FFF E0 * A0..BF 80..BF |
ec90690f TS |
1488 | U+1000..U+CFFF E1..EC 80..BF 80..BF |
1489 | U+D000..U+D7FF ED 80..9F 80..BF | |
755789c0 | 1490 | U+D800..U+DFFF +++++ utf16 surrogates, not legal utf8 +++++ |
ec90690f | 1491 | U+E000..U+FFFF EE..EF 80..BF 80..BF |
d88362ca KW |
1492 | U+10000..U+3FFFF F0 * 90..BF 80..BF 80..BF |
1493 | U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF | |
1494 | U+100000..U+10FFFF F4 80..8F 80..BF 80..BF | |
e1b711da | 1495 | |
b19eb496 | 1496 | Note the gaps marked by "*" before several of the byte entries above. These are |
e1b711da KW |
1497 | caused by legal UTF-8 avoiding non-shortest encodings: it is technically |
1498 | possible to UTF-8-encode a single code point in different ways, but that is | |
1499 | explicitly forbidden, and the shortest possible encoding should always be used | |
1500 | (and that is what Perl does). | |
37361303 | 1501 | |
376d9008 | 1502 | Another way to look at it is via bits: |
05632f9a | 1503 | |
755789c0 | 1504 | Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte |
05632f9a | 1505 | |
755789c0 KW |
1506 | 0aaaaaaa 0aaaaaaa |
1507 | 00000bbbbbaaaaaa 110bbbbb 10aaaaaa | |
1508 | ccccbbbbbbaaaaaa 1110cccc 10bbbbbb 10aaaaaa | |
1509 | 00000dddccccccbbbbbbaaaaaa 11110ddd 10cccccc 10bbbbbb 10aaaaaa | |
05632f9a | 1510 | |
a9130ea9 | 1511 | As you can see, the continuation bytes all begin with C<"10">, and the |
e1b711da | 1512 | leading bits of the start byte tell how many bytes there are in the |
05632f9a JH |
1513 | encoded character. |
1514 | ||
6d4f9cf2 | 1515 | The original UTF-8 specification allowed up to 6 bytes, to allow |
a9130ea9 | 1516 | encoding of numbers up to C<0x7FFF_FFFF>. Perl continues to allow those, |
6d4f9cf2 KW |
1517 | and has extended that up to 13 bytes to encode code points up to what |
1518 | can fit in a 64-bit word. However, Perl will warn if you output any of | |
b19eb496 | 1519 | these as being non-portable; and under strict UTF-8 input protocols, |
526f2ca9 | 1520 | they are forbidden. In addition, it is now illegal to use a code point |
760c7c2f KW |
1521 | larger than what a signed integer variable on your system can hold. On |
1522 | 32-bit ASCII systems, this means C<0x7FFF_FFFF> is the legal maximum | |
526f2ca9 | 1523 | (much higher on 64-bit systems). |
6d4f9cf2 | 1524 | |
c29a771d | 1525 | =item * |
5cb3728c RB |
1526 | |
1527 | UTF-EBCDIC | |
dbe420b4 | 1528 | |
b65e6125 | 1529 | Like UTF-8, but EBCDIC-safe, in the way that UTF-8 is ASCII-safe. |
a6a7eedc KW |
1530 | This means that all the basic characters (which includes all |
1531 | those that have ASCII equivalents (like C<"A">, C<"0">, C<"%">, I<etc.>) | |
1532 | are the same in both EBCDIC and UTF-EBCDIC.) | |
1533 | ||
c0236afe KW |
1534 | UTF-EBCDIC is used on EBCDIC platforms. It generally requires more |
1535 | bytes to represent a given code point than UTF-8 does; the largest | |
1536 | Unicode code points take 5 bytes to represent (instead of 4 in UTF-8), | |
1537 | and, extended for 64-bit words, it uses 14 bytes instead of 13 bytes in | |
1538 | UTF-8. | |
dbe420b4 | 1539 | |
c29a771d | 1540 | =item * |
5cb3728c | 1541 | |
b65e6125 | 1542 | UTF-16, UTF-16BE, UTF-16LE, Surrogates, and C<BOM>'s (Byte Order Marks) |
c349b1b9 | 1543 | |
1bfb14c4 JH |
1544 | The followings items are mostly for reference and general Unicode |
1545 | knowledge, Perl doesn't use these constructs internally. | |
dbe420b4 | 1546 | |
b19eb496 TC |
1547 | Like UTF-8, UTF-16 is a variable-width encoding, but where |
1548 | UTF-8 uses 8-bit code units, UTF-16 uses 16-bit code units. | |
1549 | All code points occupy either 2 or 4 bytes in UTF-16: code points | |
1550 | C<U+0000..U+FFFF> are stored in a single 16-bit unit, and code | |
1bfb14c4 | 1551 | points C<U+10000..U+10FFFF> in two 16-bit units. The latter case is |
c349b1b9 JH |
1552 | using I<surrogates>, the first 16-bit unit being the I<high |
1553 | surrogate>, and the second being the I<low surrogate>. | |
1554 | ||
376d9008 | 1555 | Surrogates are code points set aside to encode the C<U+10000..U+10FFFF> |
c349b1b9 | 1556 | range of Unicode code points in pairs of 16-bit units. The I<high |
9f815e24 | 1557 | surrogates> are the range C<U+D800..U+DBFF> and the I<low surrogates> |
376d9008 | 1558 | are the range C<U+DC00..U+DFFF>. The surrogate encoding is |
c349b1b9 | 1559 | |
d88362ca KW |
1560 | $hi = ($uni - 0x10000) / 0x400 + 0xD800; |
1561 | $lo = ($uni - 0x10000) % 0x400 + 0xDC00; | |
c349b1b9 JH |
1562 | |
1563 | and the decoding is | |
1564 | ||
d88362ca | 1565 | $uni = 0x10000 + ($hi - 0xD800) * 0x400 + ($lo - 0xDC00); |
c349b1b9 | 1566 | |
376d9008 | 1567 | Because of the 16-bitness, UTF-16 is byte-order dependent. UTF-16 |
c349b1b9 | 1568 | itself can be used for in-memory computations, but if storage or |
376d9008 JB |
1569 | transfer is required either UTF-16BE (big-endian) or UTF-16LE |
1570 | (little-endian) encodings must be chosen. | |
c349b1b9 JH |
1571 | |
1572 | This introduces another problem: what if you just know that your data | |
376d9008 | 1573 | is UTF-16, but you don't know which endianness? Byte Order Marks, or |
b65e6125 | 1574 | C<BOM>'s, are a solution to this. A special character has been reserved |
86bbd6d1 | 1575 | in Unicode to function as a byte order marker: the character with the |
a9130ea9 | 1576 | code point C<U+FEFF> is the C<BOM>. |
042da322 | 1577 | |
a9130ea9 | 1578 | The trick is that if you read a C<BOM>, you will know the byte order, |
376d9008 JB |
1579 | since if it was written on a big-endian platform, you will read the |
1580 | bytes C<0xFE 0xFF>, but if it was written on a little-endian platform, | |
1581 | you will read the bytes C<0xFF 0xFE>. (And if the originating platform | |
b65e6125 KW |
1582 | was writing in ASCII platform UTF-8, you will read the bytes |
1583 | C<0xEF 0xBB 0xBF>.) | |
042da322 | 1584 | |
86bbd6d1 | 1585 | The way this trick works is that the character with the code point |
6d4f9cf2 | 1586 | C<U+FFFE> is not supposed to be in input streams, so the |
a9130ea9 | 1587 | sequence of bytes C<0xFF 0xFE> is unambiguously "C<BOM>, represented in |
1bfb14c4 | 1588 | little-endian format" and cannot be C<U+FFFE>, represented in big-endian |
6d4f9cf2 KW |
1589 | format". |
1590 | ||
1591 | Surrogates have no meaning in Unicode outside their use in pairs to | |
1592 | represent other code points. However, Perl allows them to be | |
1593 | represented individually internally, for example by saying | |
f651977e TC |
1594 | C<chr(0xD801)>, so that all code points, not just those valid for open |
1595 | interchange, are | |
6d4f9cf2 | 1596 | representable. Unicode does define semantics for them, such as their |
a9130ea9 KW |
1597 | C<L</General_Category>> is C<"Cs">. But because their use is somewhat dangerous, |
1598 | Perl will warn (using the warning category C<"surrogate">, which is a | |
1599 | sub-category of C<"utf8">) if an attempt is made | |
6d4f9cf2 KW |
1600 | to do things like take the lower case of one, or match |
1601 | case-insensitively, or to output them. (But don't try this on Perls | |
1602 | before 5.14.) | |
c349b1b9 | 1603 | |
c29a771d | 1604 | =item * |
5cb3728c | 1605 | |
1e54db1a | 1606 | UTF-32, UTF-32BE, UTF-32LE |
c349b1b9 | 1607 | |
b65e6125 | 1608 | The UTF-32 family is pretty much like the UTF-16 family, except that |
042da322 | 1609 | the units are 32-bit, and therefore the surrogate scheme is not |
a9130ea9 | 1610 | needed. UTF-32 is a fixed-width encoding. The C<BOM> signatures are |
b19eb496 | 1611 | C<0x00 0x00 0xFE 0xFF> for BE and C<0xFF 0xFE 0x00 0x00> for LE. |
c349b1b9 | 1612 | |
c29a771d | 1613 | =item * |
5cb3728c RB |
1614 | |
1615 | UCS-2, UCS-4 | |
c349b1b9 | 1616 | |
b19eb496 | 1617 | Legacy, fixed-width encodings defined by the ISO 10646 standard. UCS-2 is a 16-bit |
376d9008 | 1618 | encoding. Unlike UTF-16, UCS-2 is not extensible beyond C<U+FFFF>, |
339cfa0e | 1619 | because it does not use surrogates. UCS-4 is a 32-bit encoding, |
b19eb496 | 1620 | functionally identical to UTF-32 (the difference being that |
a9130ea9 | 1621 | UCS-4 forbids neither surrogates nor code points larger than C<0x10_FFFF>). |
c349b1b9 | 1622 | |
c29a771d | 1623 | =item * |
5cb3728c RB |
1624 | |
1625 | UTF-7 | |
c349b1b9 | 1626 | |
376d9008 JB |
1627 | A seven-bit safe (non-eight-bit) encoding, which is useful if the |
1628 | transport or storage is not eight-bit safe. Defined by RFC 2152. | |
c349b1b9 | 1629 | |
95a1a48b JH |
1630 | =back |
1631 | ||
57e88091 | 1632 | =head2 Noncharacter code points |
6d4f9cf2 | 1633 | |
57e88091 | 1634 | 66 code points are set aside in Unicode as "noncharacter code points". |
a9130ea9 | 1635 | These all have the C<Unassigned> (C<Cn>) C<L</General_Category>>, and |
57e88091 KW |
1636 | no character will ever be assigned to any of them. They are the 32 code |
1637 | points between C<U+FDD0> and C<U+FDEF> inclusive, and the 34 code | |
1638 | points: | |
1639 | ||
1640 | U+FFFE U+FFFF | |
1641 | U+1FFFE U+1FFFF | |
1642 | U+2FFFE U+2FFFF | |
1643 | ... | |
1644 | U+EFFFE U+EFFFF | |
1645 | U+FFFFE U+FFFFF | |
1646 | U+10FFFE U+10FFFF | |
1647 | ||
1648 | Until Unicode 7.0, the noncharacters were "B<forbidden> for use in open | |
1649 | interchange of Unicode text data", so that code that processed those | |
1650 | streams could use these code points as sentinels that could be mixed in | |
1651 | with character data, and would always be distinguishable from that data. | |
1652 | (Emphasis above and in the next paragraph are added in this document.) | |
1653 | ||
1654 | Unicode 7.0 changed the wording so that they are "B<not recommended> for | |
1655 | use in open interchange of Unicode text data". The 7.0 Standard goes on | |
1656 | to say: | |
1657 | ||
1658 | =over 4 | |
1659 | ||
1660 | "If a noncharacter is received in open interchange, an application is | |
1661 | not required to interpret it in any way. It is good practice, however, | |
1662 | to recognize it as a noncharacter and to take appropriate action, such | |
1663 | as replacing it with C<U+FFFD> replacement character, to indicate the | |
1664 | problem in the text. It is not recommended to simply delete | |
1665 | noncharacter code points from such text, because of the potential | |
1666 | security issues caused by deleting uninterpreted characters. (See | |
1667 | conformance clause C7 in Section 3.2, Conformance Requirements, and | |
1668 | L<Unicode Technical Report #36, "Unicode Security | |
71c89d21 | 1669 | Considerations"|https://www.unicode.org/reports/tr36/#Substituting_for_Ill_Formed_Subsequences>)." |
57e88091 KW |
1670 | |
1671 | =back | |
1672 | ||
1673 | This change was made because it was found that various commercial tools | |
1674 | like editors, or for things like source code control, had been written | |
1675 | so that they would not handle program files that used these code points, | |
1676 | effectively precluding their use almost entirely! And that was never | |
1677 | the intent. They've always been meant to be usable within an | |
1678 | application, or cooperating set of applications, at will. | |
1679 | ||
1680 | If you're writing code, such as an editor, that is supposed to be able | |
1681 | to handle any Unicode text data, then you shouldn't be using these code | |
1682 | points yourself, and instead allow them in the input. If you need | |
1683 | sentinels, they should instead be something that isn't legal Unicode. | |
8f7bec53 | 1684 | For UTF-8 data, you can use the bytes 0xC0 and 0xC1 as sentinels, as |
57e88091 KW |
1685 | they never appear in well-formed UTF-8. (There are equivalents for |
1686 | UTF-EBCDIC). You can also store your Unicode code points in integer | |
1687 | variables and use negative values as sentinels. | |
1688 | ||
1689 | If you're not writing such a tool, then whether you accept noncharacters | |
1690 | as input is up to you (though the Standard recommends that you not). If | |
1691 | you do strict input stream checking with Perl, these code points | |
1692 | continue to be forbidden. This is to maintain backward compatibility | |
1693 | (otherwise potential security holes could open up, as an unsuspecting | |
1694 | application that was written assuming the noncharacters would be | |
1695 | filtered out before getting to it, could now, without warning, start | |
1696 | getting them). To do strict checking, you can use the layer | |
1697 | C<:encoding('UTF-8')>. | |
1698 | ||
1699 | Perl continues to warn (using the warning category C<"nonchar">, which | |
1700 | is a sub-category of C<"utf8">) if an attempt is made to output | |
1701 | noncharacters. | |
42581d5d KW |
1702 | |
1703 | =head2 Beyond Unicode code points | |
1704 | ||
a9130ea9 KW |
1705 | The maximum Unicode code point is C<U+10FFFF>, and Unicode only defines |
1706 | operations on code points up through that. But Perl works on code | |
526f2ca9 | 1707 | points up to the maximum permissible signed number available on the |
42581d5d KW |
1708 | platform. However, Perl will not accept these from input streams unless |
1709 | lax rules are being used, and will warn (using the warning category | |
2d88a86a KW |
1710 | C<"non_unicode">, which is a sub-category of C<"utf8">) if any are output. |
1711 | ||
1712 | Since Unicode rules are not defined on these code points, if a | |
1713 | Unicode-defined operation is done on them, Perl uses what we believe are | |
1714 | sensible rules, while generally warning, using the C<"non_unicode"> | |
1715 | category. For example, C<uc("\x{11_0000}")> will generate such a | |
1716 | warning, returning the input parameter as its result, since Perl defines | |
1717 | the uppercase of every non-Unicode code point to be the code point | |
b65e6125 KW |
1718 | itself. (All the case changing operations, not just uppercasing, work |
1719 | this way.) | |
2d88a86a KW |
1720 | |
1721 | The situation with matching Unicode properties in regular expressions, | |
1722 | the C<\p{}> and C<\P{}> constructs, against these code points is not as | |
1723 | clear cut, and how these are handled has changed as we've gained | |
1724 | experience. | |
1725 | ||
1726 | One possibility is to treat any match against these code points as | |
1727 | undefined. But since Perl doesn't have the concept of a match being | |
1728 | undefined, it converts this to failing or C<FALSE>. This is almost, but | |
1729 | not quite, what Perl did from v5.14 (when use of these code points | |
1730 | became generally reliable) through v5.18. The difference is that Perl | |
1731 | treated all C<\p{}> matches as failing, but all C<\P{}> matches as | |
1732 | succeeding. | |
1733 | ||
f66ccb6c | 1734 | One problem with this is that it leads to unexpected, and confusing |
2d88a86a KW |
1735 | results in some cases: |
1736 | ||
1737 | chr(0x110000) =~ \p{ASCII_Hex_Digit=True} # Failed on <= v5.18 | |
1738 | chr(0x110000) =~ \p{ASCII_Hex_Digit=False} # Failed! on <= v5.18 | |
1739 | ||
1740 | That is, it treated both matches as undefined, and converted that to | |
1741 | false (raising a warning on each). The first case is the expected | |
1742 | result, but the second is likely counterintuitive: "How could both be | |
1743 | false when they are complements?" Another problem was that the | |
1744 | implementation optimized many Unicode property matches down to already | |
1745 | existing simpler, faster operations, which don't raise the warning. We | |
1746 | chose to not forgo those optimizations, which help the vast majority of | |
1747 | matches, just to generate a warning for the unlikely event that an | |
1748 | above-Unicode code point is being matched against. | |
1749 | ||
1750 | As a result of these problems, starting in v5.20, what Perl does is | |
1751 | to treat non-Unicode code points as just typical unassigned Unicode | |
1752 | characters, and matches accordingly. (Note: Unicode has atypical | |
57e88091 | 1753 | unassigned code points. For example, it has noncharacter code points, |
2d88a86a KW |
1754 | and ones that, when they do get assigned, are destined to be written |
1755 | Right-to-left, as Arabic and Hebrew are. Perl assumes that no | |
1756 | non-Unicode code point has any atypical properties.) | |
1757 | ||
1758 | Perl, in most cases, will raise a warning when matching an above-Unicode | |
1759 | code point against a Unicode property when the result is C<TRUE> for | |
1760 | C<\p{}>, and C<FALSE> for C<\P{}>. For example: | |
1761 | ||
1762 | chr(0x110000) =~ \p{ASCII_Hex_Digit=True} # Fails, no warning | |
1763 | chr(0x110000) =~ \p{ASCII_Hex_Digit=False} # Succeeds, with warning | |
1764 | ||
1765 | In both these examples, the character being matched is non-Unicode, so | |
1766 | Unicode doesn't define how it should match. It clearly isn't an ASCII | |
1767 | hex digit, so the first example clearly should fail, and so it does, | |
1768 | with no warning. But it is arguable that the second example should have | |
1769 | an undefined, hence C<FALSE>, result. So a warning is raised for it. | |
1770 | ||
1771 | Thus the warning is raised for many fewer cases than in earlier Perls, | |
1772 | and only when what the result is could be arguable. It turns out that | |
1773 | none of the optimizations made by Perl (or are ever likely to be made) | |
1774 | cause the warning to be skipped, so it solves both problems of Perl's | |
1775 | earlier approach. The most commonly used property that is affected by | |
1776 | this change is C<\p{Unassigned}> which is a short form for | |
1777 | C<\p{General_Category=Unassigned}>. Starting in v5.20, all non-Unicode | |
1778 | code points are considered C<Unassigned>. In earlier releases the | |
1779 | matches failed because the result was considered undefined. | |
1780 | ||
1781 | The only place where the warning is not raised when it might ought to | |
1782 | have been is if optimizations cause the whole pattern match to not even | |
1783 | be attempted. For example, Perl may figure out that for a string to | |
1784 | match a certain regular expression pattern, the string has to contain | |
1785 | the substring C<"foobar">. Before attempting the match, Perl may look | |
1786 | for that substring, and if not found, immediately fail the match without | |
1787 | actually trying it; so no warning gets generated even if the string | |
1788 | contains an above-Unicode code point. | |
1789 | ||
1790 | This behavior is more "Do what I mean" than in earlier Perls for most | |
1791 | applications. But it catches fewer issues for code that needs to be | |
1792 | strictly Unicode compliant. Therefore there is an additional mode of | |
1793 | operation available to accommodate such code. This mode is enabled if a | |
1794 | regular expression pattern is compiled within the lexical scope where | |
1795 | the C<"non_unicode"> warning class has been made fatal, say by: | |
1796 | ||
1797 | use warnings FATAL => "non_unicode" | |
1798 | ||
44ecbbd8 | 1799 | (see L<warnings>). In this mode of operation, Perl will raise the |
2d88a86a KW |
1800 | warning for all matches against a non-Unicode code point (not just the |
1801 | arguable ones), and it skips the optimizations that might cause the | |
1802 | warning to not be output. (It currently still won't warn if the match | |
1803 | isn't even attempted, like in the C<"foobar"> example above.) | |
1804 | ||
1805 | In summary, Perl now normally treats non-Unicode code points as typical | |
1806 | Unicode unassigned code points for regular expression matches, raising a | |
1807 | warning only when it is arguable what the result should be. However, if | |
1808 | this warning has been made fatal, it isn't skipped. | |
1809 | ||
1810 | There is one exception to all this. C<\p{All}> looks like a Unicode | |
1811 | property, but it is a Perl extension that is defined to be true for all | |
1812 | possible code points, Unicode or not, so no warning is ever generated | |
1813 | when matching this against a non-Unicode code point. (Prior to v5.20, | |
1814 | it was an exact synonym for C<\p{Any}>, matching code points C<0> | |
1815 | through C<0x10FFFF>.) | |
6d4f9cf2 | 1816 | |
0d7c09bb JH |
1817 | =head2 Security Implications of Unicode |
1818 | ||
b65e6125 | 1819 | First, read |
71c89d21 | 1820 | L<Unicode Security Considerations|https://www.unicode.org/reports/tr36>. |
b65e6125 | 1821 | |
e1b711da KW |
1822 | Also, note the following: |
1823 | ||
0d7c09bb JH |
1824 | =over 4 |
1825 | ||
1826 | =item * | |
1827 | ||
1828 | Malformed UTF-8 | |
bf0fa0b2 | 1829 | |
f57d8456 KW |
1830 | UTF-8 is very structured, so many combinations of bytes are invalid. In |
1831 | the past, Perl tried to soldier on and make some sense of invalid | |
1832 | combinations, but this can lead to security holes, so now, if the Perl | |
1833 | core needs to process an invalid combination, it will either raise a | |
1834 | fatal error, or will replace those bytes by the sequence that forms the | |
1835 | Unicode REPLACEMENT CHARACTER, for which purpose Unicode created it. | |
1836 | ||
1837 | Every code point can be represented by more than one possible | |
1838 | syntactically valid UTF-8 sequence. Early on, both Unicode and Perl | |
1839 | considered any of these to be valid, but now, all sequences longer | |
1840 | than the shortest possible one are considered to be malformed. | |
1841 | ||
1842 | Unicode considers many code points to be illegal, or to be avoided. | |
1843 | Perl generally accepts them, once they have passed through any input | |
1844 | filters that may try to exclude them. These have been discussed above | |
1845 | (see "Surrogates" under UTF-16 in L</Unicode Encodings>, | |
1846 | L</Noncharacter code points>, and L</Beyond Unicode code points>). | |
bf0fa0b2 | 1847 | |
0d7c09bb JH |
1848 | =item * |
1849 | ||
68693f9e | 1850 | Regular expression pattern matching may surprise you if you're not |
b19eb496 TC |
1851 | accustomed to Unicode. Starting in Perl 5.14, several pattern |
1852 | modifiers are available to control this, called the character set | |
42581d5d KW |
1853 | modifiers. Details are given in L<perlre/Character set modifiers>. |
1854 | ||
1855 | =back | |
0d7c09bb | 1856 | |
376d9008 | 1857 | As discussed elsewhere, Perl has one foot (two hooves?) planted in |
a6a7eedc KW |
1858 | each of two worlds: the old world of ASCII and single-byte locales, and |
1859 | the new world of Unicode, upgrading when necessary. | |
376d9008 | 1860 | If your legacy code does not explicitly use Unicode, no automatic |
a6a7eedc | 1861 | switch-over to Unicode should happen. |
0d7c09bb | 1862 | |
c349b1b9 JH |
1863 | =head2 Unicode in Perl on EBCDIC |
1864 | ||
a6a7eedc KW |
1865 | Unicode is supported on EBCDIC platforms. See L<perlebcdic>. |
1866 | ||
1867 | Unless ASCII vs. EBCDIC issues are specifically being discussed, | |
1868 | references to UTF-8 encoding in this document and elsewhere should be | |
1869 | read as meaning UTF-EBCDIC on EBCDIC platforms. | |
1870 | See L<perlebcdic/Unicode and UTF>. | |
1871 | ||
1872 | Because UTF-EBCDIC is so similar to UTF-8, the differences are mostly | |
1873 | hidden from you; S<C<use utf8>> (and NOT something like | |
dabde021 | 1874 | S<C<use utfebcdic>>) declares the script is in the platform's |
a6a7eedc KW |
1875 | "native" 8-bit encoding of Unicode. (Similarly for the C<":utf8"> |
1876 | layer.) | |
c349b1b9 | 1877 | |
b310b053 JH |
1878 | =head2 Locales |
1879 | ||
42581d5d | 1880 | See L<perllocale/Unicode and UTF-8> |
b310b053 | 1881 | |
1aad1664 JH |
1882 | =head2 When Unicode Does Not Happen |
1883 | ||
b65e6125 KW |
1884 | There are still many places where Unicode (in some encoding or |
1885 | another) could be given as arguments or received as results, or both in | |
1886 | Perl, but it is not, in spite of Perl having extensive ways to input and | |
1887 | output in Unicode, and a few other "entry points" like the C<@ARGV> | |
1888 | array (which can sometimes be interpreted as UTF-8). | |
1aad1664 | 1889 | |
e1b711da KW |
1890 | The following are such interfaces. Also, see L</The "Unicode Bug">. |
1891 | For all of these interfaces Perl | |
b9cedb1b | 1892 | currently (as of v5.16.0) simply assumes byte strings both as arguments |
b65e6125 | 1893 | and results, or UTF-8 strings if the (deprecated) C<encoding> pragma has been used. |
1aad1664 | 1894 | |
b19eb496 TC |
1895 | One reason that Perl does not attempt to resolve the role of Unicode in |
1896 | these situations is that the answers are highly dependent on the operating | |
1aad1664 | 1897 | system and the file system(s). For example, whether filenames can be |
b19eb496 TC |
1898 | in Unicode and in exactly what kind of encoding, is not exactly a |
1899 | portable concept. Similarly for C<qx> and C<system>: how well will the | |
1900 | "command-line interface" (and which of them?) handle Unicode? | |
1aad1664 JH |
1901 | |
1902 | =over 4 | |
1903 | ||
557a2462 RB |
1904 | =item * |
1905 | ||
a9130ea9 KW |
1906 | C<chdir>, C<chmod>, C<chown>, C<chroot>, C<exec>, C<link>, C<lstat>, C<mkdir>, |
1907 | C<rename>, C<rmdir>, C<stat>, C<symlink>, C<truncate>, C<unlink>, C<utime>, C<-X> | |
557a2462 RB |
1908 | |
1909 | =item * | |
1910 | ||
a9130ea9 | 1911 | C<%ENV> |
557a2462 RB |
1912 | |
1913 | =item * | |
1914 | ||
a9130ea9 | 1915 | C<glob> (aka the C<E<lt>*E<gt>>) |
557a2462 RB |
1916 | |
1917 | =item * | |
1aad1664 | 1918 | |
a9130ea9 | 1919 | C<open>, C<opendir>, C<sysopen> |
1aad1664 | 1920 | |
557a2462 | 1921 | =item * |
1aad1664 | 1922 | |
a9130ea9 | 1923 | C<qx> (aka the backtick operator), C<system> |
1aad1664 | 1924 | |
557a2462 | 1925 | =item * |
1aad1664 | 1926 | |
a9130ea9 | 1927 | C<readdir>, C<readlink> |
1aad1664 JH |
1928 | |
1929 | =back | |
1930 | ||
e1b711da KW |
1931 | =head2 The "Unicode Bug" |
1932 | ||
a6a7eedc KW |
1933 | The term, "Unicode bug" has been applied to an inconsistency with the |
1934 | code points in the C<Latin-1 Supplement> block, that is, between | |
1935 | 128 and 255. Without a locale specified, unlike all other characters or | |
1936 | code points, these characters can have very different semantics | |
1937 | depending on the rules in effect. (Characters whose code points are | |
1938 | above 255 force Unicode rules; whereas the rules for ASCII characters | |
1939 | are the same under both ASCII and Unicode rules.) | |
1940 | ||
1941 | Under Unicode rules, these upper-Latin1 characters are interpreted as | |
1942 | Unicode code points, which means they have the same semantics as Latin-1 | |
1943 | (ISO-8859-1) and C1 controls. | |
1944 | ||
1945 | As explained in L</ASCII Rules versus Unicode Rules>, under ASCII rules, | |
1946 | they are considered to be unassigned characters. | |
1947 | ||
1948 | This can lead to unexpected results. For example, a string's | |
1949 | semantics can suddenly change if a code point above 255 is appended to | |
1950 | it, which changes the rules from ASCII to Unicode. As an | |
1951 | example, consider the following program and its output: | |
1952 | ||
1953 | $ perl -le' | |
f434f357 | 1954 | no feature "unicode_strings"; |
a6a7eedc KW |
1955 | $s1 = "\xC2"; |
1956 | $s2 = "\x{2660}"; | |
1957 | for ($s1, $s2, $s1.$s2) { | |
1958 | print /\w/ || 0; | |
1959 | } | |
1960 | ' | |
1961 | 0 | |
1962 | 0 | |
1963 | 1 | |
1964 | ||
1965 | If there's no C<\w> in C<s1> nor in C<s2>, why does their concatenation | |
1966 | have one? | |
1967 | ||
1968 | This anomaly stems from Perl's attempt to not disturb older programs that | |
1969 | didn't use Unicode, along with Perl's desire to add Unicode support | |
1970 | seamlessly. But the result turned out to not be seamless. (By the way, | |
1971 | you can choose to be warned when things like this happen. See | |
1972 | C<L<encoding::warnings>>.) | |
1973 | ||
1974 | L<S<C<use feature 'unicode_strings'>>|feature/The 'unicode_strings' feature> | |
1975 | was added, starting in Perl v5.12, to address this problem. It affects | |
1976 | these things: | |
e1b711da KW |
1977 | |
1978 | =over 4 | |
1979 | ||
1980 | =item * | |
1981 | ||
1982 | Changing the case of a scalar, that is, using C<uc()>, C<ucfirst()>, C<lc()>, | |
2e2b2571 KW |
1983 | and C<lcfirst()>, or C<\L>, C<\U>, C<\u> and C<\l> in double-quotish |
1984 | contexts, such as regular expression substitutions. | |
a6a7eedc KW |
1985 | |
1986 | Under C<unicode_strings> starting in Perl 5.12.0, Unicode rules are | |
2e2b2571 KW |
1987 | generally used. See L<perlfunc/lc> for details on how this works |
1988 | in combination with various other pragmas. | |
e1b711da KW |
1989 | |
1990 | =item * | |
1991 | ||
2e2b2571 | 1992 | Using caseless (C</i>) regular expression matching. |
a6a7eedc | 1993 | |
2e2b2571 | 1994 | Starting in Perl 5.14.0, regular expressions compiled within |
a6a7eedc | 1995 | the scope of C<unicode_strings> use Unicode rules |
2e2b2571 KW |
1996 | even when executed or compiled into larger |
1997 | regular expressions outside the scope. | |
e1b711da KW |
1998 | |
1999 | =item * | |
2000 | ||
a6a7eedc KW |
2001 | Matching any of several properties in regular expressions. |
2002 | ||
2003 | These properties are C<\b> (without braces), C<\B> (without braces), | |
2004 | C<\s>, C<\S>, C<\w>, C<\W>, and all the Posix character classes | |
630d17dc | 2005 | I<except> C<[[:ascii:]]>. |
a6a7eedc | 2006 | |
2e2b2571 | 2007 | Starting in Perl 5.14.0, regular expressions compiled within |
a6a7eedc | 2008 | the scope of C<unicode_strings> use Unicode rules |
2e2b2571 KW |
2009 | even when executed or compiled into larger |
2010 | regular expressions outside the scope. | |
e1b711da KW |
2011 | |
2012 | =item * | |
2013 | ||
a6a7eedc KW |
2014 | In C<quotemeta> or its inline equivalent C<\Q>. |
2015 | ||
2e2b2571 KW |
2016 | Starting in Perl 5.16.0, consistent quoting rules are used within the |
2017 | scope of C<unicode_strings>, as described in L<perlfunc/quotemeta>. | |
a6a7eedc KW |
2018 | Prior to that, or outside its scope, no code points above 127 are quoted |
2019 | in UTF-8 encoded strings, but in byte encoded strings, code points | |
2020 | between 128-255 are always quoted. | |
eb88ed9e | 2021 | |
d6c970c7 AC |
2022 | =item * |
2023 | ||
2024 | In the C<..> or L<range|perlop/Range Operators> operator. | |
2025 | ||
2026 | Starting in Perl 5.26.0, the range operator on strings treats their lengths | |
2027 | consistently within the scope of C<unicode_strings>. Prior to that, or | |
2028 | outside its scope, it could produce strings whose length in characters | |
2029 | exceeded that of the right-hand side, where the right-hand side took up more | |
2030 | bytes than the correct range endpoint. | |
2031 | ||
20ae58f7 AC |
2032 | =item * |
2033 | ||
2034 | In L<< C<split>'s special-case whitespace splitting|perlfunc/split >>. | |
2035 | ||
2036 | Starting in Perl 5.28.0, the C<split> function with a pattern specified as | |
2037 | a string containing a single space handles whitespace characters consistently | |
a3815e44 | 2038 | within the scope of C<unicode_strings>. Prior to that, or outside its scope, |
20ae58f7 AC |
2039 | characters that are whitespace according to Unicode rules but not according to |
2040 | ASCII rules were treated as field contents rather than field separators when | |
2041 | they appear in byte-encoded strings. | |
2042 | ||
e1b711da KW |
2043 | =back |
2044 | ||
a6a7eedc KW |
2045 | You can see from the above that the effect of C<unicode_strings> |
2046 | increased over several Perl releases. (And Perl's support for Unicode | |
2047 | continues to improve; it's best to use the latest available release in | |
2048 | order to get the most complete and accurate results possible.) Note that | |
6901d503 | 2049 | C<unicode_strings> is automatically chosen if you S<C<use v5.12>> or |
a6a7eedc | 2050 | higher. |
e1b711da | 2051 | |
2e2b2571 | 2052 | For Perls earlier than those described above, or when a string is passed |
a6a7eedc | 2053 | to a function outside the scope of C<unicode_strings>, see the next section. |
e1b711da | 2054 | |
1aad1664 JH |
2055 | =head2 Forcing Unicode in Perl (Or Unforcing Unicode in Perl) |
2056 | ||
e1b711da KW |
2057 | Sometimes (see L</"When Unicode Does Not Happen"> or L</The "Unicode Bug">) |
2058 | there are situations where you simply need to force a byte | |
a6a7eedc KW |
2059 | string into UTF-8, or vice versa. The standard module L<Encode> can be |
2060 | used for this, or the low-level calls | |
a9130ea9 | 2061 | L<C<utf8::upgrade($bytestring)>|utf8/Utility functions> and |
a6a7eedc | 2062 | L<C<utf8::downgrade($utf8string[, FAIL_OK])>|utf8/Utility functions>. |
1aad1664 | 2063 | |
a9130ea9 | 2064 | Note that C<utf8::downgrade()> can fail if the string contains characters |
2bbc8d55 | 2065 | that don't fit into a byte. |
1aad1664 | 2066 | |
e1b711da KW |
2067 | Calling either function on a string that already is in the desired state is a |
2068 | no-op. | |
2069 | ||
a6a7eedc KW |
2070 | L</ASCII Rules versus Unicode Rules> gives all the ways that a string is |
2071 | made to use Unicode rules. | |
95a1a48b | 2072 | |
37b3b608 | 2073 | =head2 Using Unicode in XS |
c349b1b9 | 2074 | |
37b3b608 KW |
2075 | See L<perlguts/"Unicode Support"> for an introduction to Unicode at |
2076 | the XS level, and L<perlapi/Unicode Support> for the API details. | |
95a1a48b | 2077 | |
e1b711da KW |
2078 | =head2 Hacking Perl to work on earlier Unicode versions (for very serious hackers only) |
2079 | ||
a6a7eedc KW |
2080 | Perl by default comes with the latest supported Unicode version built-in, but |
2081 | the goal is to allow you to change to use any earlier one. In Perls | |
2082 | v5.20 and v5.22, however, the earliest usable version is Unicode 5.1. | |
c55dd03d | 2083 | Perl v5.18 and v5.24 are able to handle all earlier versions. |
e1b711da | 2084 | |
42581d5d | 2085 | Download the files in the desired version of Unicode from the Unicode web |
71c89d21 | 2086 | site L<https://www.unicode.org>). These should replace the existing files in |
b19eb496 | 2087 | F<lib/unicore> in the Perl source tree. Follow the instructions in |
116693e8 | 2088 | F<README.perl> in that directory to change some of their names, and then build |
26e391dd | 2089 | perl (see L<INSTALL>). |
116693e8 | 2090 | |
c8d992ba A |
2091 | =head2 Porting code from perl-5.6.X |
2092 | ||
a6a7eedc KW |
2093 | Perls starting in 5.8 have a different Unicode model from 5.6. In 5.6 the |
2094 | programmer was required to use the C<utf8> pragma to declare that a | |
2095 | given scope expected to deal with Unicode data and had to make sure that | |
2096 | only Unicode data were reaching that scope. If you have code that is | |
c8d992ba | 2097 | working with 5.6, you will need some of the following adjustments to |
a6a7eedc KW |
2098 | your code. The examples are written such that the code will continue to |
2099 | work under 5.6, so you should be safe to try them out. | |
c8d992ba | 2100 | |
755789c0 | 2101 | =over 3 |
c8d992ba A |
2102 | |
2103 | =item * | |
2104 | ||
2105 | A filehandle that should read or write UTF-8 | |
2106 | ||
b9cedb1b | 2107 | if ($] > 5.008) { |
6d8e7450 | 2108 | binmode $fh, ":encoding(UTF-8)"; |
c8d992ba A |
2109 | } |
2110 | ||
2111 | =item * | |
2112 | ||
2113 | A scalar that is going to be passed to some extension | |
2114 | ||
a9130ea9 | 2115 | Be it C<Compress::Zlib>, C<Apache::Request> or any extension that has no |
c8d992ba | 2116 | mention of Unicode in the manpage, you need to make sure that the |
2575c402 | 2117 | UTF8 flag is stripped off. Note that at the time of this writing |
b9cedb1b | 2118 | (January 2012) the mentioned modules are not UTF-8-aware. Please |
c8d992ba A |
2119 | check the documentation to verify if this is still true. |
2120 | ||
b9cedb1b | 2121 | if ($] > 5.008) { |
c8d992ba | 2122 | require Encode; |
8e179dd8 | 2123 | $val = Encode::encode("UTF-8", $val); # make octets |
c8d992ba A |
2124 | } |
2125 | ||
2126 | =item * | |
2127 | ||
2128 | A scalar we got back from an extension | |
2129 | ||
2130 | If you believe the scalar comes back as UTF-8, you will most likely | |
2575c402 | 2131 | want the UTF8 flag restored: |
c8d992ba | 2132 | |
b9cedb1b | 2133 | if ($] > 5.008) { |
c8d992ba | 2134 | require Encode; |
8e179dd8 | 2135 | $val = Encode::decode("UTF-8", $val); |
c8d992ba A |
2136 | } |
2137 | ||
2138 | =item * | |
2139 | ||
2140 | Same thing, if you are really sure it is UTF-8 | |
2141 | ||
b9cedb1b | 2142 | if ($] > 5.008) { |
c8d992ba A |
2143 | require Encode; |
2144 | Encode::_utf8_on($val); | |
2145 | } | |
2146 | ||
2147 | =item * | |
2148 | ||
a9130ea9 | 2149 | A wrapper for L<DBI> C<fetchrow_array> and C<fetchrow_hashref> |
c8d992ba A |
2150 | |
2151 | When the database contains only UTF-8, a wrapper function or method is | |
a9130ea9 KW |
2152 | a convenient way to replace all your C<fetchrow_array> and |
2153 | C<fetchrow_hashref> calls. A wrapper function will also make it easier to | |
c8d992ba | 2154 | adapt to future enhancements in your database driver. Note that at the |
b9cedb1b | 2155 | time of this writing (January 2012), the DBI has no standardized way |
a9130ea9 | 2156 | to deal with UTF-8 data. Please check the L<DBI documentation|DBI> to verify if |
c8d992ba A |
2157 | that is still true. |
2158 | ||
2159 | sub fetchrow { | |
d88362ca KW |
2160 | # $what is one of fetchrow_{array,hashref} |
2161 | my($self, $sth, $what) = @_; | |
b9cedb1b | 2162 | if ($] < 5.008) { |
c8d992ba A |
2163 | return $sth->$what; |
2164 | } else { | |
2165 | require Encode; | |
2166 | if (wantarray) { | |
2167 | my @arr = $sth->$what; | |
2168 | for (@arr) { | |
2169 | defined && /[^\000-\177]/ && Encode::_utf8_on($_); | |
2170 | } | |
2171 | return @arr; | |
2172 | } else { | |
2173 | my $ret = $sth->$what; | |
2174 | if (ref $ret) { | |
2175 | for my $k (keys %$ret) { | |
d88362ca KW |
2176 | defined |
2177 | && /[^\000-\177]/ | |
2178 | && Encode::_utf8_on($_) for $ret->{$k}; | |
c8d992ba A |
2179 | } |
2180 | return $ret; | |
2181 | } else { | |
2182 | defined && /[^\000-\177]/ && Encode::_utf8_on($_) for $ret; | |
2183 | return $ret; | |
2184 | } | |
2185 | } | |
2186 | } | |
2187 | } | |
2188 | ||
2189 | ||
2190 | =item * | |
2191 | ||
2192 | A large scalar that you know can only contain ASCII | |
2193 | ||
2194 | Scalars that contain only ASCII and are marked as UTF-8 are sometimes | |
2195 | a drag to your program. If you recognize such a situation, just remove | |
2575c402 | 2196 | the UTF8 flag: |
c8d992ba | 2197 | |
b9cedb1b | 2198 | utf8::downgrade($val) if $] > 5.008; |
c8d992ba A |
2199 | |
2200 | =back | |
2201 | ||
a6a7eedc KW |
2202 | =head1 BUGS |
2203 | ||
2204 | See also L</The "Unicode Bug"> above. | |
2205 | ||
2206 | =head2 Interaction with Extensions | |
2207 | ||
2208 | When Perl exchanges data with an extension, the extension should be | |
2209 | able to understand the UTF8 flag and act accordingly. If the | |
2210 | extension doesn't recognize that flag, it's likely that the extension | |
2211 | will return incorrectly-flagged data. | |
2212 | ||
2213 | So if you're working with Unicode data, consult the documentation of | |
2214 | every module you're using if there are any issues with Unicode data | |
2215 | exchange. If the documentation does not talk about Unicode at all, | |
2216 | suspect the worst and probably look at the source to learn how the | |
2217 | module is implemented. Modules written completely in Perl shouldn't | |
2218 | cause problems. Modules that directly or indirectly access code written | |
2219 | in other programming languages are at risk. | |
2220 | ||
2221 | For affected functions, the simple strategy to avoid data corruption is | |
2222 | to always make the encoding of the exchanged data explicit. Choose an | |
2223 | encoding that you know the extension can handle. Convert arguments passed | |
2224 | to the extensions to that encoding and convert results back from that | |
2225 | encoding. Write wrapper functions that do the conversions for you, so | |
2226 | you can later change the functions when the extension catches up. | |
2227 | ||
2228 | To provide an example, let's say the popular C<Foo::Bar::escape_html> | |
2229 | function doesn't deal with Unicode data yet. The wrapper function | |
2230 | would convert the argument to raw UTF-8 and convert the result back to | |
2231 | Perl's internal representation like so: | |
2232 | ||
2233 | sub my_escape_html ($) { | |
2234 | my($what) = shift; | |
2235 | return unless defined $what; | |
8e179dd8 P |
2236 | Encode::decode("UTF-8", Foo::Bar::escape_html( |
2237 | Encode::encode("UTF-8", $what))); | |
a6a7eedc KW |
2238 | } |
2239 | ||
2240 | Sometimes, when the extension does not convert data but just stores | |
2241 | and retrieves it, you will be able to use the otherwise | |
2242 | dangerous L<C<Encode::_utf8_on()>|Encode/_utf8_on> function. Let's say | |
2243 | the popular C<Foo::Bar> extension, written in C, provides a C<param> | |
2244 | method that lets you store and retrieve data according to these prototypes: | |
2245 | ||
2246 | $self->param($name, $value); # set a scalar | |
2247 | $value = $self->param($name); # retrieve a scalar | |
2248 | ||
2249 | If it does not yet provide support for any encoding, one could write a | |
2250 | derived class with such a C<param> method: | |
2251 | ||
2252 | sub param { | |
2253 | my($self,$name,$value) = @_; | |
2254 | utf8::upgrade($name); # make sure it is UTF-8 encoded | |
2255 | if (defined $value) { | |
2256 | utf8::upgrade($value); # make sure it is UTF-8 encoded | |
2257 | return $self->SUPER::param($name,$value); | |
2258 | } else { | |
2259 | my $ret = $self->SUPER::param($name); | |
2260 | Encode::_utf8_on($ret); # we know, it is UTF-8 encoded | |
2261 | return $ret; | |
2262 | } | |
2263 | } | |
2264 | ||
2265 | Some extensions provide filters on data entry/exit points, such as | |
2266 | C<DB_File::filter_store_key> and family. Look out for such filters in | |
2267 | the documentation of your extensions; they can make the transition to | |
2268 | Unicode data much easier. | |
2269 | ||
2270 | =head2 Speed | |
2271 | ||
2272 | Some functions are slower when working on UTF-8 encoded strings than | |
2273 | on byte encoded strings. All functions that need to hop over | |
2274 | characters such as C<length()>, C<substr()> or C<index()>, or matching | |
2275 | regular expressions can work B<much> faster when the underlying data are | |
2276 | byte-encoded. | |
2277 | ||
2278 | In Perl 5.8.0 the slowness was often quite spectacular; in Perl 5.8.1 | |
2279 | a caching scheme was introduced which improved the situation. In general, | |
2280 | operations with UTF-8 encoded strings are still slower. As an example, | |
2281 | the Unicode properties (character classes) like C<\p{Nd}> are known to | |
2282 | be quite a bit slower (5-20 times) than their simpler counterparts | |
2283 | like C<[0-9]> (then again, there are hundreds of Unicode characters matching | |
2284 | C<Nd> compared with the 10 ASCII characters matching C<[0-9]>). | |
2285 | ||
393fec97 GS |
2286 | =head1 SEE ALSO |
2287 | ||
51f494cc | 2288 | L<perlunitut>, L<perluniintro>, L<perluniprops>, L<Encode>, L<open>, L<utf8>, L<bytes>, |
b65e6125 | 2289 | L<perlretut>, L<perlvar/"${^UNICODE}">, |
71c89d21 | 2290 | L<https://www.unicode.org/reports/tr44>). |
393fec97 GS |
2291 | |
2292 | =cut |