Commit | Line | Data |
---|---|---|
393fec97 GS |
1 | =head1 NAME |
2 | ||
3 | perlunicode - Unicode support in Perl | |
4 | ||
5 | =head1 DESCRIPTION | |
6 | ||
0a1f2d14 | 7 | =head2 Important Caveats |
21bad921 | 8 | |
376d9008 | 9 | Unicode support is an extensive requirement. While Perl does not |
c349b1b9 JH |
10 | implement the Unicode standard or the accompanying technical reports |
11 | from cover to cover, Perl does support many Unicode features. | |
21bad921 | 12 | |
2575c402 | 13 | People who want to learn to use Unicode in Perl, should probably read |
0314f483 KW |
14 | the L<Perl Unicode tutorial, perlunitut|perlunitut> and |
15 | L<perluniintro>, before reading | |
e4911a48 | 16 | this reference document. |
2575c402 | 17 | |
9d1c51c1 KW |
18 | Also, the use of Unicode may present security issues that aren't obvious. |
19 | Read L<Unicode Security Considerations|http://www.unicode.org/reports/tr36>. | |
20 | ||
13a2d996 | 21 | =over 4 |
21bad921 | 22 | |
a9130ea9 | 23 | =item Safest if you C<use feature 'unicode_strings'> |
42581d5d KW |
24 | |
25 | In order to preserve backward compatibility, Perl does not turn | |
26 | on full internal Unicode support unless the pragma | |
b65e6125 KW |
27 | L<S<C<use feature 'unicode_strings'>>|feature/The 'unicode_strings' feature> |
28 | is specified. (This is automatically | |
29 | selected if you S<C<use 5.012>> or higher.) Failure to do this can | |
42581d5d KW |
30 | trigger unexpected surprises. See L</The "Unicode Bug"> below. |
31 | ||
2269d15c KW |
32 | This pragma doesn't affect I/O. Nor does it change the internal |
33 | representation of strings, only their interpretation. There are still | |
34 | several places where Unicode isn't fully supported, such as in | |
35 | filenames. | |
42581d5d | 36 | |
fae2c0fb | 37 | =item Input and Output Layers |
21bad921 | 38 | |
376d9008 | 39 | Perl knows when a filehandle uses Perl's internal Unicode encodings |
1bfb14c4 | 40 | (UTF-8, or UTF-EBCDIC if in EBCDIC) if the filehandle is opened with |
a9130ea9 | 41 | the C<:encoding(utf8)> layer. Other encodings can be converted to Perl's |
1bfb14c4 | 42 | encoding on input or from Perl's encoding on output by use of the |
a9130ea9 | 43 | C<:encoding(...)> layer. See L<open>. |
c349b1b9 | 44 | |
2575c402 | 45 | To indicate that Perl source itself is in UTF-8, use C<use utf8;>. |
21bad921 | 46 | |
ad0029c4 | 47 | =item C<use utf8> still needed to enable UTF-8/UTF-EBCDIC in scripts |
21bad921 | 48 | |
376d9008 JB |
49 | As a compatibility measure, the C<use utf8> pragma must be explicitly |
50 | included to enable recognition of UTF-8 in the Perl scripts themselves | |
1bfb14c4 JH |
51 | (in string or regular expression literals, or in identifier names) on |
52 | ASCII-based machines or to recognize UTF-EBCDIC on EBCDIC-based | |
376d9008 | 53 | machines. B<These are the only times when an explicit C<use utf8> |
8f8cf39c | 54 | is needed.> See L<utf8>. |
21bad921 | 55 | |
a9130ea9 | 56 | =item C<BOM>-marked scripts and UTF-16 scripts autodetected |
7aa207d6 | 57 | |
a9130ea9 KW |
58 | If a Perl script begins marked with the Unicode C<BOM> (UTF-16LE, UTF16-BE, |
59 | or UTF-8), or if the script looks like non-C<BOM>-marked UTF-16 of either | |
7aa207d6 | 60 | endianness, Perl will correctly read in the script as Unicode. |
a9130ea9 | 61 | (C<BOM>less UTF-8 cannot be effectively recognized or differentiated from |
7aa207d6 JH |
62 | ISO 8859-1 or other eight-bit encodings.) |
63 | ||
990e18f7 AT |
64 | =item C<use encoding> needed to upgrade non-Latin-1 byte strings |
65 | ||
38a44b82 | 66 | By default, there is a fundamental asymmetry in Perl's Unicode model: |
990e18f7 AT |
67 | implicit upgrading from byte strings to Unicode strings assumes that |
68 | they were encoded in I<ISO 8859-1 (Latin-1)>, but Unicode strings are | |
69 | downgraded with UTF-8 encoding. This happens because the first 256 | |
51f494cc | 70 | codepoints in Unicode happens to agree with Latin-1. |
990e18f7 | 71 | |
990e18f7 AT |
72 | See L</"Byte and Character Semantics"> for more details. |
73 | ||
21bad921 GS |
74 | =back |
75 | ||
376d9008 | 76 | =head2 Byte and Character Semantics |
393fec97 | 77 | |
b9cedb1b | 78 | Perl uses logically-wide characters to represent strings internally. |
393fec97 | 79 | |
42581d5d KW |
80 | Starting in Perl 5.14, Perl-level operations work with |
81 | characters rather than bytes within the scope of a | |
82 | C<L<use feature 'unicode_strings'|feature>> (or equivalently | |
83 | C<use 5.012> or higher). (This is not true if bytes have been | |
b19eb496 | 84 | explicitly requested by C<L<use bytes|bytes>>, nor necessarily true |
42581d5d KW |
85 | for interactions with the platform's operating system.) |
86 | ||
87 | For earlier Perls, and when C<unicode_strings> is not in effect, Perl | |
88 | provides a fairly safe environment that can handle both types of | |
89 | semantics in programs. For operations where Perl can unambiguously | |
90 | decide that the input data are characters, Perl switches to character | |
91 | semantics. For operations where this determination cannot be made | |
92 | without additional information from the user, Perl decides in favor of | |
93 | compatibility and chooses to use byte semantics. | |
94 | ||
66cbab2c | 95 | When C<use locale> (but not C<use locale ':not_characters'>) is in |
850b7ec9 | 96 | effect, Perl uses the rules associated with the current locale. |
66cbab2c KW |
97 | (C<use locale> overrides C<use feature 'unicode_strings'> in the same scope; |
98 | while C<use locale ':not_characters'> effectively also selects | |
99 | C<use feature 'unicode_strings'> in its scope; see L<perllocale>.) | |
100 | Otherwise, Perl uses the platform's native | |
42581d5d | 101 | byte semantics for characters whose code points are less than 256, and |
850b7ec9 | 102 | Unicode rules for those greater than 255. That means that non-ASCII |
4b9734bf | 103 | characters are undefined except for their |
e1b711da KW |
104 | ordinal numbers. This means that none have case (upper and lower), nor are any |
105 | a member of character classes, like C<[:alpha:]> or C<\w>. (But all do belong | |
106 | to the C<\W> class or the Perl regular expression extension C<[:^alpha:]>.) | |
2bbc8d55 | 107 | |
8cbd9a7a | 108 | This behavior preserves compatibility with earlier versions of Perl, |
376d9008 | 109 | which allowed byte semantics in Perl operations only if |
e1b711da | 110 | none of the program's inputs were marked as being a source of Unicode |
8cbd9a7a | 111 | character data. Such data may come from filehandles, from calls to |
a9130ea9 | 112 | external programs, from information provided by the system (such as C<%ENV>), |
21bad921 | 113 | or from literals and constants in the source text. |
8cbd9a7a | 114 | |
8cbd9a7a | 115 | The C<utf8> pragma is primarily a compatibility device that enables |
75daf61c | 116 | recognition of UTF-(8|EBCDIC) in literals encountered by the parser. |
376d9008 JB |
117 | Note that this pragma is only required while Perl defaults to byte |
118 | semantics; when character semantics become the default, this pragma | |
119 | may become a no-op. See L<utf8>. | |
120 | ||
376d9008 | 121 | If strings operating under byte semantics and strings with Unicode |
51f494cc | 122 | character data are concatenated, the new string will have |
d9b01026 | 123 | character semantics. This can cause surprises: See L</BUGS>, below. |
a9130ea9 | 124 | You can choose to be warned when this happens. See C<L<encoding::warnings>>. |
7dedd01f | 125 | |
feda178f | 126 | Under character semantics, many operations that formerly operated on |
376d9008 | 127 | bytes now operate on characters. A character in Perl is |
feda178f | 128 | logically just a number ranging from 0 to 2**31 or so. Larger |
376d9008 JB |
129 | characters may encode into longer sequences of bytes internally, but |
130 | this internal detail is mostly hidden for Perl code. | |
131 | See L<perluniintro> for more. | |
393fec97 | 132 | |
376d9008 | 133 | =head2 Effects of Character Semantics |
393fec97 GS |
134 | |
135 | Character semantics have the following effects: | |
136 | ||
137 | =over 4 | |
138 | ||
139 | =item * | |
140 | ||
376d9008 | 141 | Strings--including hash keys--and regular expression patterns may |
b65e6125 | 142 | contain characters that have ordinal values larger than 255. |
393fec97 | 143 | |
2575c402 JW |
144 | If you use a Unicode editor to edit your program, Unicode characters may |
145 | occur directly within the literal strings in UTF-8 encoding, or UTF-16. | |
a9130ea9 | 146 | (The former requires a C<BOM> or C<use utf8>, the latter requires a C<BOM>.) |
3e4dbfed | 147 | |
195e542a KW |
148 | Unicode characters can also be added to a string by using the C<\N{U+...}> |
149 | notation. The Unicode code for the desired character, in hexadecimal, | |
150 | should be placed in the braces, after the C<U>. For instance, a smiley face is | |
6f335b04 KW |
151 | C<\N{U+263A}>. |
152 | ||
a9130ea9 KW |
153 | Alternatively, you can use the C<\x{...}> notation for characters C<0x100> and |
154 | above. For characters below C<0x100> you may get byte semantics instead of | |
6f335b04 | 155 | character semantics; see L</The "Unicode Bug">. On EBCDIC machines there is |
195e542a | 156 | the additional problem that the value for such characters gives the EBCDIC |
0bd42786 KW |
157 | character rather than the Unicode one, thus it is more portable to use |
158 | C<\N{U+...}> instead. | |
3e4dbfed | 159 | |
fbb93542 KW |
160 | Additionally, you can use the C<\N{...}> notation and put the official |
161 | Unicode character name within the braces, such as | |
162 | C<\N{WHITE SMILING FACE}>. This automatically loads the L<charnames> | |
163 | module with the C<:full> and C<:short> options. If you prefer different | |
164 | options for this module, you can instead, before the C<\N{...}>, | |
165 | explicitly load it with your desired options; for example, | |
166 | ||
167 | use charnames ':loose'; | |
376d9008 | 168 | |
393fec97 GS |
169 | =item * |
170 | ||
574c8022 JH |
171 | If an appropriate L<encoding> is specified, identifiers within the |
172 | Perl script may contain Unicode alphanumeric characters, including | |
376d9008 JB |
173 | ideographs. Perl does not currently attempt to canonicalize variable |
174 | names. | |
393fec97 | 175 | |
393fec97 GS |
176 | =item * |
177 | ||
a9130ea9 | 178 | Regular expressions match characters instead of bytes. C<"."> matches |
2575c402 | 179 | a character instead of a byte. |
393fec97 | 180 | |
393fec97 GS |
181 | =item * |
182 | ||
9d1c51c1 | 183 | Bracketed character classes in regular expressions match characters instead of |
376d9008 | 184 | bytes and match against the character properties specified in the |
1bfb14c4 | 185 | Unicode properties database. C<\w> can be used to match a Japanese |
75daf61c | 186 | ideograph, for instance. |
393fec97 | 187 | |
393fec97 GS |
188 | =item * |
189 | ||
9d1c51c1 KW |
190 | Named Unicode properties, scripts, and block ranges may be used (like bracketed |
191 | character classes) by using the C<\p{}> "matches property" construct and | |
822502e5 | 192 | the C<\P{}> negation, "doesn't match property". |
2575c402 | 193 | See L</"Unicode Character Properties"> for more details. |
822502e5 TS |
194 | |
195 | You can define your own character properties and use them | |
196 | in the regular expression with the C<\p{}> or C<\P{}> construct. | |
822502e5 TS |
197 | See L</"User-Defined Character Properties"> for more details. |
198 | ||
199 | =item * | |
200 | ||
9f815e24 KW |
201 | The special pattern C<\X> matches a logical character, an "extended grapheme |
202 | cluster" in Standardese. In Unicode what appears to the user to be a single | |
51f494cc KW |
203 | character, for example an accented C<G>, may in fact be composed of a sequence |
204 | of characters, in this case a C<G> followed by an accent character. C<\X> | |
205 | will match the entire sequence. | |
822502e5 TS |
206 | |
207 | =item * | |
208 | ||
209 | The C<tr///> operator translates characters instead of bytes. Note | |
210 | that the C<tr///CU> functionality has been removed. For similar | |
211 | functionality see pack('U0', ...) and pack('C0', ...). | |
212 | ||
213 | =item * | |
214 | ||
215 | Case translation operators use the Unicode case translation tables | |
216 | when character input is provided. Note that C<uc()>, or C<\U> in | |
217 | interpolated strings, translates to uppercase, while C<ucfirst>, | |
218 | or C<\u> in interpolated strings, translates to titlecase in languages | |
e1b711da KW |
219 | that make the distinction (which is equivalent to uppercase in languages |
220 | without the distinction). | |
822502e5 TS |
221 | |
222 | =item * | |
223 | ||
224 | Most operators that deal with positions or lengths in a string will | |
225 | automatically switch to using character positions, including | |
226 | C<chop()>, C<chomp()>, C<substr()>, C<pos()>, C<index()>, C<rindex()>, | |
227 | C<sprintf()>, C<write()>, and C<length()>. An operator that | |
51f494cc KW |
228 | specifically does not switch is C<vec()>. Operators that really don't |
229 | care include operators that treat strings as a bucket of bits such as | |
822502e5 TS |
230 | C<sort()>, and operators dealing with filenames. |
231 | ||
232 | =item * | |
233 | ||
51f494cc | 234 | The C<pack()>/C<unpack()> letter C<C> does I<not> change, since it is often |
822502e5 TS |
235 | used for byte-oriented formats. Again, think C<char> in the C language. |
236 | ||
237 | There is a new C<U> specifier that converts between Unicode characters | |
238 | and code points. There is also a C<W> specifier that is the equivalent of | |
239 | C<chr>/C<ord> and properly handles character values even if they are above 255. | |
240 | ||
241 | =item * | |
242 | ||
243 | The C<chr()> and C<ord()> functions work on characters, similar to | |
244 | C<pack("W")> and C<unpack("W")>, I<not> C<pack("C")> and | |
245 | C<unpack("C")>. C<pack("C")> and C<unpack("C")> are methods for | |
246 | emulating byte-oriented C<chr()> and C<ord()> on Unicode strings. | |
247 | While these methods reveal the internal encoding of Unicode strings, | |
248 | that is not something one normally needs to care about at all. | |
249 | ||
250 | =item * | |
251 | ||
252 | The bit string operators, C<& | ^ ~>, can operate on character data. | |
253 | However, for backward compatibility, such as when using bit string | |
254 | operations when characters are all less than 256 in ordinal value, one | |
255 | should not use C<~> (the bit complement) with characters of both | |
256 | values less than 256 and values greater than 256. Most importantly, | |
257 | DeMorgan's laws (C<~($x|$y) eq ~$x&~$y> and C<~($x&$y) eq ~$x|~$y>) | |
258 | will not hold. The reason for this mathematical I<faux pas> is that | |
259 | the complement cannot return B<both> the 8-bit (byte-wide) bit | |
260 | complement B<and> the full character-wide bit complement. | |
261 | ||
262 | =item * | |
263 | ||
a9130ea9 | 264 | There is a CPAN module, C<L<Unicode::Casing>>, which allows you to define |
628253b8 BF |
265 | your own mappings to be used in C<lc()>, C<lcfirst()>, C<uc()>, |
266 | C<ucfirst()>, and C<fc> (or their double-quoted string inlined | |
267 | versions such as C<\U>). | |
268 | (Prior to Perl 5.16, this functionality was partially provided | |
5d1892be KW |
269 | in the Perl core, but suffered from a number of insurmountable |
270 | drawbacks, so the CPAN module was written instead.) | |
822502e5 TS |
271 | |
272 | =back | |
273 | ||
274 | =over 4 | |
275 | ||
276 | =item * | |
277 | ||
278 | And finally, C<scalar reverse()> reverses by character rather than by byte. | |
279 | ||
280 | =back | |
281 | ||
282 | =head2 Unicode Character Properties | |
283 | ||
ee88f7b6 | 284 | (The only time that Perl considers a sequence of individual code |
9d1c51c1 KW |
285 | points as a single logical character is in the C<\X> construct, already |
286 | mentioned above. Therefore "character" in this discussion means a single | |
ee88f7b6 KW |
287 | Unicode code point.) |
288 | ||
289 | Very nearly all Unicode character properties are accessible through | |
290 | regular expressions by using the C<\p{}> "matches property" construct | |
291 | and the C<\P{}> "doesn't match property" for its negation. | |
51f494cc | 292 | |
9d1c51c1 | 293 | For instance, C<\p{Uppercase}> matches any single character with the Unicode |
a9130ea9 KW |
294 | C<"Uppercase"> property, while C<\p{L}> matches any character with a |
295 | C<General_Category> of C<"L"> (letter) property (see | |
296 | L</General_Category> below). Brackets are not | |
9d1c51c1 | 297 | required for single letter property names, so C<\p{L}> is equivalent to C<\pL>. |
51f494cc | 298 | |
9d1c51c1 | 299 | More formally, C<\p{Uppercase}> matches any single character whose Unicode |
a9130ea9 KW |
300 | C<Uppercase> property value is C<True>, and C<\P{Uppercase}> matches any character |
301 | whose C<Uppercase> property value is C<False>, and they could have been written as | |
9d1c51c1 | 302 | C<\p{Uppercase=True}> and C<\p{Uppercase=False}>, respectively. |
51f494cc | 303 | |
b19eb496 | 304 | This formality is needed when properties are not binary; that is, if they can |
a9130ea9 KW |
305 | take on more values than just C<True> and C<False>. For example, the |
306 | C<Bidi_Class> property (see L</"Bidirectional Character Types"> below), | |
307 | can take on several different | |
308 | values, such as C<Left>, C<Right>, C<Whitespace>, and others. To match these, one needs | |
309 | to specify both the property name (C<Bidi_Class>), AND the value being | |
5bff2035 | 310 | matched against |
b65e6125 | 311 | (C<Left>, C<Right>, I<etc.>). This is done, as in the examples above, by having the |
9f815e24 | 312 | two components separated by an equal sign (or interchangeably, a colon), like |
51f494cc KW |
313 | C<\p{Bidi_Class: Left}>. |
314 | ||
315 | All Unicode-defined character properties may be written in these compound forms | |
a9130ea9 | 316 | of C<\p{I<property>=I<value>}> or C<\p{I<property>:I<value>}>, but Perl provides some |
51f494cc KW |
317 | additional properties that are written only in the single form, as well as |
318 | single-form short-cuts for all binary properties and certain others described | |
319 | below, in which you may omit the property name and the equals or colon | |
320 | separator. | |
321 | ||
322 | Most Unicode character properties have at least two synonyms (or aliases if you | |
b19eb496 | 323 | prefer): a short one that is easier to type and a longer one that is more |
a9130ea9 KW |
324 | descriptive and hence easier to understand. Thus the C<"L"> and |
325 | C<"Letter"> properties above are equivalent and can be used | |
326 | interchangeably. Likewise, C<"Upper"> is a synonym for C<"Uppercase">, | |
327 | and we could have written C<\p{Uppercase}> equivalently as C<\p{Upper}>. | |
328 | Also, there are typically various synonyms for the values the property | |
329 | can be. For binary properties, C<"True"> has 3 synonyms: C<"T">, | |
330 | C<"Yes">, and C<"Y">; and C<"False"> has correspondingly C<"F">, | |
331 | C<"No">, and C<"N">. But be careful. A short form of a value for one | |
332 | property may not mean the same thing as the same short form for another. | |
333 | Thus, for the C<L</General_Category>> property, C<"L"> means | |
334 | C<"Letter">, but for the L<C<Bidi_Class>|/Bidirectional Character Types> | |
335 | property, C<"L"> means C<"Left">. A complete list of properties and | |
336 | synonyms is in L<perluniprops>. | |
51f494cc | 337 | |
b19eb496 | 338 | Upper/lower case differences in property names and values are irrelevant; |
51f494cc KW |
339 | thus C<\p{Upper}> means the same thing as C<\p{upper}> or even C<\p{UpPeR}>. |
340 | Similarly, you can add or subtract underscores anywhere in the middle of a | |
341 | word, so that these are also equivalent to C<\p{U_p_p_e_r}>. And white space | |
342 | is irrelevant adjacent to non-word characters, such as the braces and the equals | |
b19eb496 TC |
343 | or colon separators, so C<\p{ Upper }> and C<\p{ Upper_case : Y }> are |
344 | equivalent to these as well. In fact, white space and even | |
345 | hyphens can usually be added or deleted anywhere. So even C<\p{ Up-per case = Yes}> is | |
51f494cc | 346 | equivalent. All this is called "loose-matching" by Unicode. The few places |
b19eb496 | 347 | where stricter matching is used is in the middle of numbers, and in the Perl |
51f494cc | 348 | extension properties that begin or end with an underscore. Stricter matching |
b19eb496 | 349 | cares about white space (except adjacent to non-word characters), |
51f494cc | 350 | hyphens, and non-interior underscores. |
4193bef7 | 351 | |
376d9008 | 352 | You can also use negation in both C<\p{}> and C<\P{}> by introducing a caret |
a9130ea9 | 353 | (C<^>) between the first brace and the property name: C<\p{^Tamil}> is |
eb0cc9e3 | 354 | equal to C<\P{Tamil}>. |
4193bef7 | 355 | |
56ca34ca KW |
356 | Almost all properties are immune to case-insensitive matching. That is, |
357 | adding a C</i> regular expression modifier does not change what they | |
358 | match. There are two sets that are affected. | |
359 | The first set is | |
360 | C<Uppercase_Letter>, | |
361 | C<Lowercase_Letter>, | |
362 | and C<Titlecase_Letter>, | |
363 | all of which match C<Cased_Letter> under C</i> matching. | |
364 | And the second set is | |
365 | C<Uppercase>, | |
366 | C<Lowercase>, | |
367 | and C<Titlecase>, | |
368 | all of which match C<Cased> under C</i> matching. | |
369 | This set also includes its subsets C<PosixUpper> and C<PosixLower> both | |
a9130ea9 | 370 | of which under C</i> match C<PosixAlpha>. |
56ca34ca | 371 | (The difference between these sets is that some things, such as Roman |
b65e6125 KW |
372 | numerals, come in both upper and lower case so they are C<Cased>, but |
373 | aren't considered letters, so they aren't C<Cased_Letter>'s.) | |
56ca34ca | 374 | |
2d88a86a KW |
375 | See L</Beyond Unicode code points> for special considerations when |
376 | matching Unicode properties against non-Unicode code points. | |
94b42e47 | 377 | |
51f494cc | 378 | =head3 B<General_Category> |
14bb0a9a | 379 | |
51f494cc KW |
380 | Every Unicode character is assigned a general category, which is the "most |
381 | usual categorization of a character" (from | |
382 | L<http://www.unicode.org/reports/tr44>). | |
822502e5 | 383 | |
9f815e24 | 384 | The compound way of writing these is like C<\p{General_Category=Number}> |
b65e6125 | 385 | (short: C<\p{gc:n}>). But Perl furnishes shortcuts in which everything up |
51f494cc KW |
386 | through the equal or colon separator is omitted. So you can instead just write |
387 | C<\pN>. | |
822502e5 | 388 | |
a9130ea9 KW |
389 | Here are the short and long forms of the values the C<General Category> property |
390 | can have: | |
393fec97 | 391 | |
d73e5302 JH |
392 | Short Long |
393 | ||
394 | L Letter | |
51f494cc KW |
395 | LC, L& Cased_Letter (that is: [\p{Ll}\p{Lu}\p{Lt}]) |
396 | Lu Uppercase_Letter | |
397 | Ll Lowercase_Letter | |
398 | Lt Titlecase_Letter | |
399 | Lm Modifier_Letter | |
400 | Lo Other_Letter | |
d73e5302 JH |
401 | |
402 | M Mark | |
51f494cc KW |
403 | Mn Nonspacing_Mark |
404 | Mc Spacing_Mark | |
405 | Me Enclosing_Mark | |
d73e5302 JH |
406 | |
407 | N Number | |
51f494cc KW |
408 | Nd Decimal_Number (also Digit) |
409 | Nl Letter_Number | |
410 | No Other_Number | |
411 | ||
412 | P Punctuation (also Punct) | |
413 | Pc Connector_Punctuation | |
414 | Pd Dash_Punctuation | |
415 | Ps Open_Punctuation | |
416 | Pe Close_Punctuation | |
417 | Pi Initial_Punctuation | |
d73e5302 | 418 | (may behave like Ps or Pe depending on usage) |
51f494cc | 419 | Pf Final_Punctuation |
d73e5302 | 420 | (may behave like Ps or Pe depending on usage) |
51f494cc | 421 | Po Other_Punctuation |
d73e5302 JH |
422 | |
423 | S Symbol | |
51f494cc KW |
424 | Sm Math_Symbol |
425 | Sc Currency_Symbol | |
426 | Sk Modifier_Symbol | |
427 | So Other_Symbol | |
d73e5302 JH |
428 | |
429 | Z Separator | |
51f494cc KW |
430 | Zs Space_Separator |
431 | Zl Line_Separator | |
432 | Zp Paragraph_Separator | |
d73e5302 JH |
433 | |
434 | C Other | |
d88362ca | 435 | Cc Control (also Cntrl) |
e150c829 | 436 | Cf Format |
6d4f9cf2 | 437 | Cs Surrogate |
51f494cc | 438 | Co Private_Use |
e150c829 | 439 | Cn Unassigned |
1ac13f9a | 440 | |
376d9008 | 441 | Single-letter properties match all characters in any of the |
3e4dbfed | 442 | two-letter sub-properties starting with the same letter. |
b19eb496 | 443 | C<LC> and C<L&> are special: both are aliases for the set consisting of everything matched by C<Ll>, C<Lu>, and C<Lt>. |
32293815 | 444 | |
51f494cc | 445 | =head3 B<Bidirectional Character Types> |
822502e5 | 446 | |
b19eb496 | 447 | Because scripts differ in their directionality (Hebrew and Arabic are |
a9130ea9 | 448 | written right to left, for example) Unicode supplies a C<Bidi_Class> property. |
1850f57f | 449 | Some of the values this property can have are: |
32293815 | 450 | |
88af3b93 | 451 | Value Meaning |
92e830a9 | 452 | |
12ac2576 JP |
453 | L Left-to-Right |
454 | LRE Left-to-Right Embedding | |
455 | LRO Left-to-Right Override | |
456 | R Right-to-Left | |
51f494cc | 457 | AL Arabic Letter |
12ac2576 JP |
458 | RLE Right-to-Left Embedding |
459 | RLO Right-to-Left Override | |
460 | PDF Pop Directional Format | |
461 | EN European Number | |
51f494cc KW |
462 | ES European Separator |
463 | ET European Terminator | |
12ac2576 | 464 | AN Arabic Number |
51f494cc | 465 | CS Common Separator |
12ac2576 JP |
466 | NSM Non-Spacing Mark |
467 | BN Boundary Neutral | |
468 | B Paragraph Separator | |
469 | S Segment Separator | |
470 | WS Whitespace | |
471 | ON Other Neutrals | |
472 | ||
51f494cc KW |
473 | This property is always written in the compound form. |
474 | For example, C<\p{Bidi_Class:R}> matches characters that are normally | |
1850f57f | 475 | written right to left. Unlike the |
a9130ea9 | 476 | C<L</General_Category>> property, this |
1850f57f KW |
477 | property can have more values added in a future Unicode release. Those |
478 | listed above comprised the complete set for many Unicode releases, but | |
479 | others were added in Unicode 6.3; you can always find what the | |
480 | current ones are in in L<perluniprops>. And | |
481 | L<http://www.unicode.org/reports/tr9/> describes how to use them. | |
eb0cc9e3 | 482 | |
51f494cc KW |
483 | =head3 B<Scripts> |
484 | ||
b19eb496 | 485 | The world's languages are written in many different scripts. This sentence |
e1b711da | 486 | (unless you're reading it in translation) is written in Latin, while Russian is |
c69ca1d4 | 487 | written in Cyrillic, and Greek is written in, well, Greek; Japanese mainly in |
e1b711da | 488 | Hiragana or Katakana. There are many more. |
51f494cc | 489 | |
b65e6125 | 490 | The Unicode C<Script> and C<Script_Extensions> properties give what script a |
82aed44a KW |
491 | given character is in. Either property can be specified with the |
492 | compound form like | |
493 | C<\p{Script=Hebrew}> (short: C<\p{sc=hebr}>), or | |
494 | C<\p{Script_Extensions=Javanese}> (short: C<\p{scx=java}>). | |
495 | In addition, Perl furnishes shortcuts for all | |
496 | C<Script> property names. You can omit everything up through the equals | |
497 | (or colon), and simply write C<\p{Latin}> or C<\P{Cyrillic}>. | |
498 | (This is not true for C<Script_Extensions>, which is required to be | |
499 | written in the compound form.) | |
500 | ||
501 | The difference between these two properties involves characters that are | |
502 | used in multiple scripts. For example the digits '0' through '9' are | |
503 | used in many parts of the world. These are placed in a script named | |
504 | C<Common>. Other characters are used in just a few scripts. For | |
a9130ea9 | 505 | example, the C<"KATAKANA-HIRAGANA DOUBLE HYPHEN"> is used in both Japanese |
82aed44a KW |
506 | scripts, Katakana and Hiragana, but nowhere else. The C<Script> |
507 | property places all characters that are used in multiple scripts in the | |
508 | C<Common> script, while the C<Script_Extensions> property places those | |
509 | that are used in only a few scripts into each of those scripts; while | |
510 | still using C<Common> for those used in many scripts. Thus both these | |
511 | match: | |
512 | ||
513 | "0" =~ /\p{sc=Common}/ # Matches | |
514 | "0" =~ /\p{scx=Common}/ # Matches | |
515 | ||
516 | and only the first of these match: | |
517 | ||
518 | "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{sc=Common} # Matches | |
519 | "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{scx=Common} # No match | |
520 | ||
521 | And only the last two of these match: | |
522 | ||
523 | "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{sc=Hiragana} # No match | |
524 | "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{sc=Katakana} # No match | |
525 | "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{scx=Hiragana} # Matches | |
526 | "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{scx=Katakana} # Matches | |
527 | ||
528 | C<Script_Extensions> is thus an improved C<Script>, in which there are | |
529 | fewer characters in the C<Common> script, and correspondingly more in | |
530 | other scripts. It is new in Unicode version 6.0, and its data are likely | |
531 | to change significantly in later releases, as things get sorted out. | |
b65e6125 KW |
532 | New code should probably be using C<Script_Extensions> and not plain |
533 | C<Script>. | |
82aed44a KW |
534 | |
535 | (Actually, besides C<Common>, the C<Inherited> script, contains | |
536 | characters that are used in multiple scripts. These are modifier | |
b65e6125 | 537 | characters which inherit the script value |
82aed44a KW |
538 | of the controlling character. Some of these are used in many scripts, |
539 | and so go into C<Inherited> in both C<Script> and C<Script_Extensions>. | |
540 | Others are used in just a few scripts, so are in C<Inherited> in | |
541 | C<Script>, but not in C<Script_Extensions>.) | |
542 | ||
543 | It is worth stressing that there are several different sets of digits in | |
544 | Unicode that are equivalent to 0-9 and are matchable by C<\d> in a | |
545 | regular expression. If they are used in a single language only, they | |
546 | are in that language's C<Script> and C<Script_Extension>. If they are | |
547 | used in more than one script, they will be in C<sc=Common>, but only | |
548 | if they are used in many scripts should they be in C<scx=Common>. | |
51f494cc KW |
549 | |
550 | A complete list of scripts and their shortcuts is in L<perluniprops>. | |
551 | ||
a9130ea9 | 552 | =head3 B<Use of the C<"Is"> Prefix> |
822502e5 | 553 | |
b65e6125 KW |
554 | For backward compatibility (with Perl 5.6), all properties writable |
555 | without using the compound form mentioned | |
51f494cc KW |
556 | so far may have C<Is> or C<Is_> prepended to their name, so C<\P{Is_Lu}>, for |
557 | example, is equal to C<\P{Lu}>, and C<\p{IsScript:Arabic}> is equal to | |
558 | C<\p{Arabic}>. | |
eb0cc9e3 | 559 | |
51f494cc | 560 | =head3 B<Blocks> |
2796c109 | 561 | |
1bfb14c4 JH |
562 | In addition to B<scripts>, Unicode also defines B<blocks> of |
563 | characters. The difference between scripts and blocks is that the | |
564 | concept of scripts is closer to natural languages, while the concept | |
51f494cc | 565 | of blocks is more of an artificial grouping based on groups of Unicode |
a9130ea9 | 566 | characters with consecutive ordinal values. For example, the C<"Basic Latin"> |
b65e6125 | 567 | block is all the characters whose ordinals are between 0 and 127, inclusive; in |
a9130ea9 KW |
568 | other words, the ASCII characters. The C<"Latin"> script contains some letters |
569 | from this as well as several other blocks, like C<"Latin-1 Supplement">, | |
b65e6125 | 570 | C<"Latin Extended-A">, I<etc.>, but it does not contain all the characters from |
7be67b37 | 571 | those blocks. It does not, for example, contain the digits 0-9, because |
82aed44a KW |
572 | those digits are shared across many scripts, and hence are in the |
573 | C<Common> script. | |
51f494cc KW |
574 | |
575 | For more about scripts versus blocks, see UAX#24 "Unicode Script Property": | |
576 | L<http://www.unicode.org/reports/tr24> | |
577 | ||
82aed44a KW |
578 | The C<Script> or C<Script_Extensions> properties are likely to be the |
579 | ones you want to use when processing | |
a9130ea9 | 580 | natural language; the C<Block> property may occasionally be useful in working |
b19eb496 | 581 | with the nuts and bolts of Unicode. |
51f494cc KW |
582 | |
583 | Block names are matched in the compound form, like C<\p{Block: Arrows}> or | |
b19eb496 | 584 | C<\p{Blk=Hebrew}>. Unlike most other properties, only a few block names have a |
51f494cc KW |
585 | Unicode-defined short name. But Perl does provide a (slight) shortcut: You |
586 | can say, for example C<\p{In_Arrows}> or C<\p{In_Hebrew}>. For backwards | |
587 | compatibility, the C<In> prefix may be omitted if there is no naming conflict | |
588 | with a script or any other property, and you can even use an C<Is> prefix | |
589 | instead in those cases. But it is not a good idea to do this, for a couple | |
590 | reasons: | |
591 | ||
592 | =over 4 | |
593 | ||
594 | =item 1 | |
595 | ||
596 | It is confusing. There are many naming conflicts, and you may forget some. | |
9f815e24 | 597 | For example, C<\p{Hebrew}> means the I<script> Hebrew, and NOT the I<block> |
51f494cc KW |
598 | Hebrew. But would you remember that 6 months from now? |
599 | ||
600 | =item 2 | |
601 | ||
3e2dd9ee | 602 | It is unstable. A new version of Unicode may preempt the current meaning by |
51f494cc | 603 | creating a property with the same name. There was a time in very early Unicode |
9f815e24 | 604 | releases when C<\p{Hebrew}> would have matched the I<block> Hebrew; now it |
51f494cc | 605 | doesn't. |
32293815 | 606 | |
393fec97 GS |
607 | =back |
608 | ||
b19eb496 TC |
609 | Some people prefer to always use C<\p{Block: foo}> and C<\p{Script: bar}> |
610 | instead of the shortcuts, whether for clarity, because they can't remember the | |
611 | difference between 'In' and 'Is' anyway, or they aren't confident that those who | |
612 | eventually will read their code will know that difference. | |
51f494cc KW |
613 | |
614 | A complete list of blocks and their shortcuts is in L<perluniprops>. | |
615 | ||
9f815e24 KW |
616 | =head3 B<Other Properties> |
617 | ||
618 | There are many more properties than the very basic ones described here. | |
619 | A complete list is in L<perluniprops>. | |
620 | ||
621 | Unicode defines all its properties in the compound form, so all single-form | |
b19eb496 TC |
622 | properties are Perl extensions. Most of these are just synonyms for the |
623 | Unicode ones, but some are genuine extensions, including several that are in | |
9f815e24 KW |
624 | the compound form. And quite a few of these are actually recommended by Unicode |
625 | (in L<http://www.unicode.org/reports/tr18>). | |
626 | ||
5bff2035 KW |
627 | This section gives some details on all extensions that aren't just |
628 | synonyms for compound-form Unicode properties | |
629 | (for those properties, you'll have to refer to the | |
9f815e24 KW |
630 | L<Unicode Standard|http://www.unicode.org/reports/tr44>. |
631 | ||
632 | =over | |
633 | ||
634 | =item B<C<\p{All}>> | |
635 | ||
2d88a86a KW |
636 | This matches every possible code point. It is equivalent to C<qr/./s>. |
637 | Unlike all the other non-user-defined C<\p{}> property matches, no | |
638 | warning is ever generated if this is property is matched against a | |
639 | non-Unicode code point (see L</Beyond Unicode code points> below). | |
9f815e24 KW |
640 | |
641 | =item B<C<\p{Alnum}>> | |
642 | ||
643 | This matches any C<\p{Alphabetic}> or C<\p{Decimal_Number}> character. | |
644 | ||
645 | =item B<C<\p{Any}>> | |
646 | ||
2d88a86a KW |
647 | This matches any of the 1_114_112 Unicode code points. It is a synonym |
648 | for C<\p{Unicode}>. | |
9f815e24 | 649 | |
42581d5d KW |
650 | =item B<C<\p{ASCII}>> |
651 | ||
652 | This matches any of the 128 characters in the US-ASCII character set, | |
653 | which is a subset of Unicode. | |
654 | ||
9f815e24 KW |
655 | =item B<C<\p{Assigned}>> |
656 | ||
a9130ea9 KW |
657 | This matches any assigned code point; that is, any code point whose L<general |
658 | category|/General_Category> is not C<Unassigned> (or equivalently, not C<Cn>). | |
9f815e24 KW |
659 | |
660 | =item B<C<\p{Blank}>> | |
661 | ||
662 | This is the same as C<\h> and C<\p{HorizSpace}>: A character that changes the | |
663 | spacing horizontally. | |
664 | ||
665 | =item B<C<\p{Decomposition_Type: Non_Canonical}>> (Short: C<\p{Dt=NonCanon}>) | |
666 | ||
667 | Matches a character that has a non-canonical decomposition. | |
668 | ||
a9130ea9 | 669 | To understand the use of this rarely used I<property=value> combination, it is |
9f815e24 KW |
670 | necessary to know some basics about decomposition. |
671 | Consider a character, say H. It could appear with various marks around it, | |
672 | such as an acute accent, or a circumflex, or various hooks, circles, arrows, | |
b19eb496 | 673 | I<etc.>, above, below, to one side or the other, etc. There are many |
9f815e24 KW |
674 | possibilities among the world's languages. The number of combinations is |
675 | astronomical, and if there were a character for each combination, it would | |
676 | soon exhaust Unicode's more than a million possible characters. So Unicode | |
677 | took a different approach: there is a character for the base H, and a | |
b19eb496 | 678 | character for each of the possible marks, and these can be variously combined |
9f815e24 KW |
679 | to get a final logical character. So a logical character--what appears to be a |
680 | single character--can be a sequence of more than one individual characters. | |
b19eb496 TC |
681 | This is called an "extended grapheme cluster"; Perl furnishes the C<\X> |
682 | regular expression construct to match such sequences. | |
9f815e24 KW |
683 | |
684 | But Unicode's intent is to unify the existing character set standards and | |
b19eb496 | 685 | practices, and several pre-existing standards have single characters that |
9f815e24 | 686 | mean the same thing as some of these combinations. An example is ISO-8859-1, |
a9130ea9 KW |
687 | which has quite a few of these in the Latin-1 range, an example being C<"LATIN |
688 | CAPITAL LETTER E WITH ACUTE">. Because this character was in this pre-existing | |
9f815e24 | 689 | standard, Unicode added it to its repertoire. But this character is considered |
b19eb496 | 690 | by Unicode to be equivalent to the sequence consisting of the character |
a9130ea9 | 691 | C<"LATIN CAPITAL LETTER E"> followed by the character C<"COMBINING ACUTE ACCENT">. |
9f815e24 | 692 | |
a9130ea9 | 693 | C<"LATIN CAPITAL LETTER E WITH ACUTE"> is called a "pre-composed" character, and |
b19eb496 | 694 | its equivalence with the sequence is called canonical equivalence. All |
9f815e24 | 695 | pre-composed characters are said to have a decomposition (into the equivalent |
b19eb496 | 696 | sequence), and the decomposition type is also called canonical. |
9f815e24 KW |
697 | |
698 | However, many more characters have a different type of decomposition, a | |
699 | "compatible" or "non-canonical" decomposition. The sequences that form these | |
700 | decompositions are not considered canonically equivalent to the pre-composed | |
a9130ea9 | 701 | character. An example, again in the Latin-1 range, is the C<"SUPERSCRIPT ONE">. |
b19eb496 | 702 | It is somewhat like a regular digit 1, but not exactly; its decomposition |
9f815e24 KW |
703 | into the digit 1 is called a "compatible" decomposition, specifically a |
704 | "super" decomposition. There are several such compatibility | |
b65e6125 KW |
705 | decompositions (see L<http://www.unicode.org/reports/tr44>), including |
706 | one called "compat", which means some miscellaneous type of | |
707 | decomposition that doesn't fit into the other decomposition categories | |
708 | that Unicode has chosen. | |
9f815e24 KW |
709 | |
710 | Note that most Unicode characters don't have a decomposition, so their | |
a9130ea9 | 711 | decomposition type is C<"None">. |
9f815e24 | 712 | |
b19eb496 TC |
713 | For your convenience, Perl has added the C<Non_Canonical> decomposition |
714 | type to mean any of the several compatibility decompositions. | |
9f815e24 KW |
715 | |
716 | =item B<C<\p{Graph}>> | |
717 | ||
718 | Matches any character that is graphic. Theoretically, this means a character | |
719 | that on a printer would cause ink to be used. | |
720 | ||
721 | =item B<C<\p{HorizSpace}>> | |
722 | ||
b19eb496 | 723 | This is the same as C<\h> and C<\p{Blank}>: a character that changes the |
9f815e24 KW |
724 | spacing horizontally. |
725 | ||
42581d5d | 726 | =item B<C<\p{In=*}>> |
9f815e24 KW |
727 | |
728 | This is a synonym for C<\p{Present_In=*}> | |
729 | ||
730 | =item B<C<\p{PerlSpace}>> | |
731 | ||
d28d8023 | 732 | This is the same as C<\s>, restricted to ASCII, namely C<S<[ \f\n\r\t]>> |
779cf272 | 733 | and starting in Perl v5.18, a vertical tab. |
9f815e24 KW |
734 | |
735 | Mnemonic: Perl's (original) space | |
736 | ||
737 | =item B<C<\p{PerlWord}>> | |
738 | ||
739 | This is the same as C<\w>, restricted to ASCII, namely C<[A-Za-z0-9_]> | |
740 | ||
741 | Mnemonic: Perl's (original) word. | |
742 | ||
42581d5d | 743 | =item B<C<\p{Posix...}>> |
9f815e24 | 744 | |
b65e6125 KW |
745 | There are several of these, which are equivalents, using the C<\p{}> |
746 | notation, for Posix classes and are described in | |
42581d5d | 747 | L<perlrecharclass/POSIX Character Classes>. |
9f815e24 KW |
748 | |
749 | =item B<C<\p{Present_In: *}>> (Short: C<\p{In=*}>) | |
750 | ||
751 | This property is used when you need to know in what Unicode version(s) a | |
752 | character is. | |
753 | ||
754 | The "*" above stands for some two digit Unicode version number, such as | |
755 | C<1.1> or C<4.0>; or the "*" can also be C<Unassigned>. This property will | |
756 | match the code points whose final disposition has been settled as of the | |
757 | Unicode release given by the version number; C<\p{Present_In: Unassigned}> | |
758 | will match those code points whose meaning has yet to be assigned. | |
759 | ||
a9130ea9 | 760 | For example, C<U+0041> C<"LATIN CAPITAL LETTER A"> was present in the very first |
9f815e24 KW |
761 | Unicode release available, which is C<1.1>, so this property is true for all |
762 | valid "*" versions. On the other hand, C<U+1EFF> was not assigned until version | |
a9130ea9 | 763 | 5.1 when it became C<"LATIN SMALL LETTER Y WITH LOOP">, so the only "*" that |
9f815e24 KW |
764 | would match it are 5.1, 5.2, and later. |
765 | ||
766 | Unicode furnishes the C<Age> property from which this is derived. The problem | |
767 | with Age is that a strict interpretation of it (which Perl takes) has it | |
768 | matching the precise release a code point's meaning is introduced in. Thus | |
769 | C<U+0041> would match only 1.1; and C<U+1EFF> only 5.1. This is not usually what | |
770 | you want. | |
771 | ||
772 | Some non-Perl implementations of the Age property may change its meaning to be | |
a9130ea9 | 773 | the same as the Perl C<Present_In> property; just be aware of that. |
9f815e24 KW |
774 | |
775 | Another confusion with both these properties is that the definition is not | |
b19eb496 TC |
776 | that the code point has been I<assigned>, but that the meaning of the code point |
777 | has been I<determined>. This is because 66 code points will always be | |
a9130ea9 | 778 | unassigned, and so the C<Age> for them is the Unicode version in which the decision |
b19eb496 | 779 | to make them so was made. For example, C<U+FDD0> is to be permanently |
9f815e24 | 780 | unassigned to a character, and the decision to do that was made in version 3.1, |
b19eb496 | 781 | so C<\p{Age=3.1}> matches this character, as also does C<\p{Present_In: 3.1}> and up. |
9f815e24 KW |
782 | |
783 | =item B<C<\p{Print}>> | |
784 | ||
ae5b72c8 | 785 | This matches any character that is graphical or blank, except controls. |
9f815e24 KW |
786 | |
787 | =item B<C<\p{SpacePerl}>> | |
788 | ||
789 | This is the same as C<\s>, including beyond ASCII. | |
790 | ||
4d4acfba | 791 | Mnemonic: Space, as modified by Perl. (It doesn't include the vertical tab |
779cf272 | 792 | until v5.18, which both the Posix standard and Unicode consider white space.) |
9f815e24 | 793 | |
4364919a KW |
794 | =item B<C<\p{Title}>> and B<C<\p{Titlecase}>> |
795 | ||
796 | Under case-sensitive matching, these both match the same code points as | |
797 | C<\p{General Category=Titlecase_Letter}> (C<\p{gc=lt}>). The difference | |
798 | is that under C</i> caseless matching, these match the same as | |
799 | C<\p{Cased}>, whereas C<\p{gc=lt}> matches C<\p{Cased_Letter>). | |
800 | ||
2d88a86a KW |
801 | =item B<C<\p{Unicode}>> |
802 | ||
803 | This matches any of the 1_114_112 Unicode code points. | |
804 | C<\p{Any}>. | |
805 | ||
9f815e24 KW |
806 | =item B<C<\p{VertSpace}>> |
807 | ||
808 | This is the same as C<\v>: A character that changes the spacing vertically. | |
809 | ||
810 | =item B<C<\p{Word}>> | |
811 | ||
b19eb496 | 812 | This is the same as C<\w>, including over 100_000 characters beyond ASCII. |
9f815e24 | 813 | |
42581d5d KW |
814 | =item B<C<\p{XPosix...}>> |
815 | ||
b19eb496 | 816 | There are several of these, which are the standard Posix classes |
42581d5d KW |
817 | extended to the full Unicode range. They are described in |
818 | L<perlrecharclass/POSIX Character Classes>. | |
819 | ||
9f815e24 KW |
820 | =back |
821 | ||
a9130ea9 | 822 | |
376d9008 | 823 | =head2 User-Defined Character Properties |
491fd90a | 824 | |
51f494cc | 825 | You can define your own binary character properties by defining subroutines |
a9130ea9 | 826 | whose names begin with C<"In"> or C<"Is">. (The experimental feature |
9d1a5160 KW |
827 | L<perlre/(?[ ])> provides an alternative which allows more complex |
828 | definitions.) The subroutines can be defined in any | |
51f494cc | 829 | package. The user-defined properties can be used in the regular expression |
a9130ea9 | 830 | C<\p{}> and C<\P{}> constructs; if you are using a user-defined property from a |
51f494cc | 831 | package other than the one you are in, you must specify its package in the |
a9130ea9 | 832 | C<\p{}> or C<\P{}> construct. |
bac0b425 | 833 | |
51f494cc | 834 | # assuming property Is_Foreign defined in Lang:: |
bac0b425 JP |
835 | package main; # property package name required |
836 | if ($txt =~ /\p{Lang::IsForeign}+/) { ... } | |
837 | ||
838 | package Lang; # property package name not required | |
839 | if ($txt =~ /\p{IsForeign}+/) { ... } | |
840 | ||
841 | ||
842 | Note that the effect is compile-time and immutable once defined. | |
b19eb496 TC |
843 | However, the subroutines are passed a single parameter, which is 0 if |
844 | case-sensitive matching is in effect and non-zero if caseless matching | |
56ca34ca KW |
845 | is in effect. The subroutine may return different values depending on |
846 | the value of the flag, and one set of values will immutably be in effect | |
b19eb496 | 847 | for all case-sensitive matches, and the other set for all case-insensitive |
56ca34ca | 848 | matches. |
491fd90a | 849 | |
b19eb496 | 850 | Note that if the regular expression is tainted, then Perl will die rather |
a9130ea9 | 851 | than calling the subroutine when the name of the subroutine is |
0e9be77f DM |
852 | determined by the tainted data. |
853 | ||
376d9008 JB |
854 | The subroutines must return a specially-formatted string, with one |
855 | or more newline-separated lines. Each line must be one of the following: | |
491fd90a JH |
856 | |
857 | =over 4 | |
858 | ||
859 | =item * | |
860 | ||
df9e1087 | 861 | A single hexadecimal number denoting a code point to include. |
510254c9 A |
862 | |
863 | =item * | |
864 | ||
99a6b1f0 | 865 | Two hexadecimal numbers separated by horizontal whitespace (space or |
df9e1087 | 866 | tabular characters) denoting a range of code points to include. |
491fd90a JH |
867 | |
868 | =item * | |
869 | ||
a9130ea9 KW |
870 | Something to include, prefixed by C<"+">: a built-in character |
871 | property (prefixed by C<"utf8::">) or a fully qualified (including package | |
830137a2 | 872 | name) user-defined character property, |
bac0b425 JP |
873 | to represent all the characters in that property; two hexadecimal code |
874 | points for a range; or a single hexadecimal code point. | |
491fd90a JH |
875 | |
876 | =item * | |
877 | ||
a9130ea9 KW |
878 | Something to exclude, prefixed by C<"-">: an existing character |
879 | property (prefixed by C<"utf8::">) or a fully qualified (including package | |
830137a2 | 880 | name) user-defined character property, |
bac0b425 JP |
881 | to represent all the characters in that property; two hexadecimal code |
882 | points for a range; or a single hexadecimal code point. | |
491fd90a JH |
883 | |
884 | =item * | |
885 | ||
a9130ea9 KW |
886 | Something to negate, prefixed C<"!">: an existing character |
887 | property (prefixed by C<"utf8::">) or a fully qualified (including package | |
830137a2 | 888 | name) user-defined character property, |
bac0b425 JP |
889 | to represent all the characters in that property; two hexadecimal code |
890 | points for a range; or a single hexadecimal code point. | |
891 | ||
892 | =item * | |
893 | ||
a9130ea9 KW |
894 | Something to intersect with, prefixed by C<"&">: an existing character |
895 | property (prefixed by C<"utf8::">) or a fully qualified (including package | |
830137a2 | 896 | name) user-defined character property, |
bac0b425 JP |
897 | for all the characters except the characters in the property; two |
898 | hexadecimal code points for a range; or a single hexadecimal code point. | |
491fd90a JH |
899 | |
900 | =back | |
901 | ||
902 | For example, to define a property that covers both the Japanese | |
903 | syllabaries (hiragana and katakana), you can define | |
904 | ||
905 | sub InKana { | |
d88362ca | 906 | return <<END; |
d5822f25 A |
907 | 3040\t309F |
908 | 30A0\t30FF | |
491fd90a JH |
909 | END |
910 | } | |
911 | ||
d5822f25 A |
912 | Imagine that the here-doc end marker is at the beginning of the line. |
913 | Now you can use C<\p{InKana}> and C<\P{InKana}>. | |
491fd90a JH |
914 | |
915 | You could also have used the existing block property names: | |
916 | ||
917 | sub InKana { | |
d88362ca | 918 | return <<'END'; |
491fd90a JH |
919 | +utf8::InHiragana |
920 | +utf8::InKatakana | |
921 | END | |
922 | } | |
923 | ||
924 | Suppose you wanted to match only the allocated characters, | |
d5822f25 | 925 | not the raw block ranges: in other words, you want to remove |
b65e6125 | 926 | the unassigned characters: |
491fd90a JH |
927 | |
928 | sub InKana { | |
d88362ca | 929 | return <<'END'; |
491fd90a JH |
930 | +utf8::InHiragana |
931 | +utf8::InKatakana | |
932 | -utf8::IsCn | |
933 | END | |
934 | } | |
935 | ||
936 | The negation is useful for defining (surprise!) negated classes. | |
937 | ||
938 | sub InNotKana { | |
d88362ca | 939 | return <<'END'; |
491fd90a JH |
940 | !utf8::InHiragana |
941 | -utf8::InKatakana | |
942 | +utf8::IsCn | |
943 | END | |
944 | } | |
945 | ||
461020ad KW |
946 | This will match all non-Unicode code points, since every one of them is |
947 | not in Kana. You can use intersection to exclude these, if desired, as | |
948 | this modified example shows: | |
bac0b425 | 949 | |
461020ad | 950 | sub InNotKana { |
bac0b425 | 951 | return <<'END'; |
461020ad KW |
952 | !utf8::InHiragana |
953 | -utf8::InKatakana | |
954 | +utf8::IsCn | |
955 | &utf8::Any | |
bac0b425 JP |
956 | END |
957 | } | |
958 | ||
461020ad KW |
959 | C<&utf8::Any> must be the last line in the definition. |
960 | ||
961 | Intersection is used generally for getting the common characters matched | |
a9130ea9 | 962 | by two (or more) classes. It's important to remember not to use C<"&"> for |
461020ad KW |
963 | the first set; that would be intersecting with nothing, resulting in an |
964 | empty set. | |
965 | ||
2d88a86a KW |
966 | Unlike non-user-defined C<\p{}> property matches, no warning is ever |
967 | generated if these properties are matched against a non-Unicode code | |
968 | point (see L</Beyond Unicode code points> below). | |
bac0b425 | 969 | |
68585b5e | 970 | =head2 User-Defined Case Mappings (for serious hackers only) |
822502e5 | 971 | |
5d1892be | 972 | B<This feature has been removed as of Perl 5.16.> |
a9130ea9 | 973 | The CPAN module C<L<Unicode::Casing>> provides better functionality without |
5d1892be KW |
974 | the drawbacks that this feature had. If you are using a Perl earlier |
975 | than 5.16, this feature was most fully documented in the 5.14 version of | |
976 | this pod: | |
977 | L<http://perldoc.perl.org/5.14.0/perlunicode.html#User-Defined-Case-Mappings-%28for-serious-hackers-only%29> | |
3a2263fe | 978 | |
376d9008 | 979 | =head2 Character Encodings for Input and Output |
8cbd9a7a | 980 | |
7221edc9 | 981 | See L<Encode>. |
8cbd9a7a | 982 | |
c29a771d | 983 | =head2 Unicode Regular Expression Support Level |
776f8809 | 984 | |
b19eb496 TC |
985 | The following list of Unicode supported features for regular expressions describes |
986 | all features currently directly supported by core Perl. The references to "Level N" | |
8158862b | 987 | and the section numbers refer to the Unicode Technical Standard #18, |
b19eb496 | 988 | "Unicode Regular Expressions", version 13, from August 2008. |
776f8809 JH |
989 | |
990 | =over 4 | |
991 | ||
992 | =item * | |
993 | ||
994 | Level 1 - Basic Unicode Support | |
995 | ||
755789c0 KW |
996 | RL1.1 Hex Notation - done [1] |
997 | RL1.2 Properties - done [2][3] | |
998 | RL1.2a Compatibility Properties - done [4] | |
9d1a5160 | 999 | RL1.3 Subtraction and Intersection - experimental [5] |
755789c0 KW |
1000 | RL1.4 Simple Word Boundaries - done [6] |
1001 | RL1.5 Simple Loose Matches - done [7] | |
1002 | RL1.6 Line Boundaries - MISSING [8][9] | |
1003 | RL1.7 Supplementary Code Points - done [10] | |
1004 | ||
6f33e417 KW |
1005 | =over 4 |
1006 | ||
1007 | =item [1] | |
1008 | ||
a9130ea9 | 1009 | C<\x{...}> |
6f33e417 KW |
1010 | |
1011 | =item [2] | |
1012 | ||
a9130ea9 | 1013 | C<\p{...}> C<\P{...}> |
6f33e417 KW |
1014 | |
1015 | =item [3] | |
1016 | ||
1017 | supports not only minimal list, but all Unicode character properties (see Unicode Character Properties above) | |
1018 | ||
1019 | =item [4] | |
1020 | ||
a9130ea9 | 1021 | C<\d> C<\D> C<\s> C<\S> C<\w> C<\W> C<\X> C<[:I<prop>:]> C<[:^I<prop>:]> |
6f33e417 KW |
1022 | |
1023 | =item [5] | |
1024 | ||
df9e1087 | 1025 | The experimental feature in v5.18 C<"(?[...])"> accomplishes this. See |
9d1a5160 KW |
1026 | L<perlre/(?[ ])>. If you don't want to use an experimental feature, |
1027 | you can use one of the following: | |
6f33e417 KW |
1028 | |
1029 | =over 4 | |
1030 | ||
1031 | =item * Regular expression look-ahead | |
1032 | ||
1033 | You can mimic class subtraction using lookahead. | |
8158862b | 1034 | For example, what UTS#18 might write as |
29bdacb8 | 1035 | |
209c9685 | 1036 | [{Block=Greek}-[{UNASSIGNED}]] |
dbe420b4 JH |
1037 | |
1038 | in Perl can be written as: | |
1039 | ||
209c9685 KW |
1040 | (?!\p{Unassigned})\p{Block=Greek} |
1041 | (?=\p{Assigned})\p{Block=Greek} | |
dbe420b4 JH |
1042 | |
1043 | But in this particular example, you probably really want | |
1044 | ||
209c9685 | 1045 | \p{Greek} |
dbe420b4 JH |
1046 | |
1047 | which will match assigned characters known to be part of the Greek script. | |
29bdacb8 | 1048 | |
a9130ea9 | 1049 | =item * CPAN module C<L<Unicode::Regex::Set>> |
8158862b | 1050 | |
6f33e417 KW |
1051 | It does implement the full UTS#18 grouping, intersection, union, and |
1052 | removal (subtraction) syntax. | |
8158862b | 1053 | |
6f33e417 KW |
1054 | =item * L</"User-Defined Character Properties"> |
1055 | ||
a9130ea9 | 1056 | C<"+"> for union, C<"-"> for removal (set-difference), C<"&"> for intersection |
6f33e417 KW |
1057 | |
1058 | =back | |
1059 | ||
1060 | =item [6] | |
1061 | ||
a9130ea9 | 1062 | C<\b> C<\B> |
6f33e417 KW |
1063 | |
1064 | =item [7] | |
1065 | ||
a9130ea9 KW |
1066 | Note that Perl does Full case-folding in matching (but with bugs), not |
1067 | Simple: for example C<U+1F88> is equivalent to C<U+1F00 U+03B9>, instead of | |
1068 | just C<U+1F80>. This difference matters mainly for certain Greek capital | |
1069 | letters with certain modifiers: the Full case-folding decomposes the | |
1070 | letter, while the Simple case-folding would map it to a single | |
1071 | character. | |
6f33e417 KW |
1072 | |
1073 | =item [8] | |
1074 | ||
6bc50c7f KW |
1075 | Should do C<^> and C<$> also on C<U+000B> (C<\v> in C), C<FF> (C<\f>), |
1076 | C<CR> (C<\r>), C<CRLF> (C<\r\n>), C<NEL> (C<U+0085>), C<LS> (C<U+2028>), | |
1077 | and C<PS> (C<U+2029>); should also affect C<E<lt>E<gt>>, C<$.>, and | |
1078 | script line numbers; should not split lines within C<CRLF> (i.e. there | |
1079 | is no empty line between C<\r> and C<\n>). For C<CRLF>, try the | |
6f33e417 KW |
1080 | C<:crlf> layer (see L<PerlIO>). |
1081 | ||
1082 | =item [9] | |
1083 | ||
a9130ea9 KW |
1084 | Linebreaking conformant with L<UAX#14 "Unicode Line Breaking |
1085 | Algorithm"|http://www.unicode.org/reports/tr14> | |
1086 | is available through the C<L<Unicode::LineBreak>> module. | |
6f33e417 KW |
1087 | |
1088 | =item [10] | |
1089 | ||
a9130ea9 KW |
1090 | UTF-8/UTF-EBDDIC used in Perl allows not only C<U+10000> to |
1091 | C<U+10FFFF> but also beyond C<U+10FFFF> | |
6f33e417 KW |
1092 | |
1093 | =back | |
5ca1ac52 | 1094 | |
776f8809 JH |
1095 | =item * |
1096 | ||
1097 | Level 2 - Extended Unicode Support | |
1098 | ||
755789c0 KW |
1099 | RL2.1 Canonical Equivalents - MISSING [10][11] |
1100 | RL2.2 Default Grapheme Clusters - MISSING [12] | |
ae3bb8ea | 1101 | RL2.3 Default Word Boundaries - DONE [14] |
755789c0 KW |
1102 | RL2.4 Default Loose Matches - MISSING [15] |
1103 | RL2.5 Name Properties - DONE | |
1104 | RL2.6 Wildcard Properties - MISSING | |
8158862b | 1105 | |
755789c0 KW |
1106 | [10] see UAX#15 "Unicode Normalization Forms" |
1107 | [11] have Unicode::Normalize but not integrated to regexes | |
64935bc6 KW |
1108 | [12] have \X and \b{gcb} but we don't have a "Grapheme Cluster |
1109 | Mode" | |
755789c0 | 1110 | [14] see UAX#29, Word Boundaries |
902b08d0 | 1111 | [15] This is covered in Chapter 3.13 (in Unicode 6.0) |
776f8809 JH |
1112 | |
1113 | =item * | |
1114 | ||
8158862b TS |
1115 | Level 3 - Tailored Support |
1116 | ||
755789c0 KW |
1117 | RL3.1 Tailored Punctuation - MISSING |
1118 | RL3.2 Tailored Grapheme Clusters - MISSING [17][18] | |
1119 | RL3.3 Tailored Word Boundaries - MISSING | |
1120 | RL3.4 Tailored Loose Matches - MISSING | |
1121 | RL3.5 Tailored Ranges - MISSING | |
1122 | RL3.6 Context Matching - MISSING [19] | |
1123 | RL3.7 Incremental Matches - MISSING | |
8158862b | 1124 | ( RL3.8 Unicode Set Sharing ) |
755789c0 KW |
1125 | RL3.9 Possible Match Sets - MISSING |
1126 | RL3.10 Folded Matching - MISSING [20] | |
1127 | RL3.11 Submatchers - MISSING | |
1128 | ||
1129 | [17] see UAX#10 "Unicode Collation Algorithms" | |
1130 | [18] have Unicode::Collate but not integrated to regexes | |
1131 | [19] have (?<=x) and (?=x), but look-aheads or look-behinds | |
1132 | should see outside of the target substring | |
1133 | [20] need insensitive matching for linguistic features other | |
1134 | than case; for example, hiragana to katakana, wide and | |
1135 | narrow, simplified Han to traditional Han (see UTR#30 | |
1136 | "Character Foldings") | |
776f8809 JH |
1137 | |
1138 | =back | |
1139 | ||
c349b1b9 JH |
1140 | =head2 Unicode Encodings |
1141 | ||
376d9008 JB |
1142 | Unicode characters are assigned to I<code points>, which are abstract |
1143 | numbers. To use these numbers, various encodings are needed. | |
c349b1b9 JH |
1144 | |
1145 | =over 4 | |
1146 | ||
c29a771d | 1147 | =item * |
5cb3728c RB |
1148 | |
1149 | UTF-8 | |
c349b1b9 | 1150 | |
6d4f9cf2 KW |
1151 | UTF-8 is a variable-length (1 to 4 bytes), byte-order independent |
1152 | encoding. For ASCII (and we really do mean 7-bit ASCII, not another | |
1153 | 8-bit encoding), UTF-8 is transparent. | |
c349b1b9 | 1154 | |
8c007b5a | 1155 | The following table is from Unicode 3.2. |
05632f9a | 1156 | |
755789c0 | 1157 | Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte |
05632f9a | 1158 | |
d88362ca | 1159 | U+0000..U+007F 00..7F |
e1b711da | 1160 | U+0080..U+07FF * C2..DF 80..BF |
d88362ca | 1161 | U+0800..U+0FFF E0 * A0..BF 80..BF |
ec90690f TS |
1162 | U+1000..U+CFFF E1..EC 80..BF 80..BF |
1163 | U+D000..U+D7FF ED 80..9F 80..BF | |
755789c0 | 1164 | U+D800..U+DFFF +++++ utf16 surrogates, not legal utf8 +++++ |
ec90690f | 1165 | U+E000..U+FFFF EE..EF 80..BF 80..BF |
d88362ca KW |
1166 | U+10000..U+3FFFF F0 * 90..BF 80..BF 80..BF |
1167 | U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF | |
1168 | U+100000..U+10FFFF F4 80..8F 80..BF 80..BF | |
e1b711da | 1169 | |
b19eb496 | 1170 | Note the gaps marked by "*" before several of the byte entries above. These are |
e1b711da KW |
1171 | caused by legal UTF-8 avoiding non-shortest encodings: it is technically |
1172 | possible to UTF-8-encode a single code point in different ways, but that is | |
1173 | explicitly forbidden, and the shortest possible encoding should always be used | |
1174 | (and that is what Perl does). | |
37361303 | 1175 | |
376d9008 | 1176 | Another way to look at it is via bits: |
05632f9a | 1177 | |
755789c0 | 1178 | Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte |
05632f9a | 1179 | |
755789c0 KW |
1180 | 0aaaaaaa 0aaaaaaa |
1181 | 00000bbbbbaaaaaa 110bbbbb 10aaaaaa | |
1182 | ccccbbbbbbaaaaaa 1110cccc 10bbbbbb 10aaaaaa | |
1183 | 00000dddccccccbbbbbbaaaaaa 11110ddd 10cccccc 10bbbbbb 10aaaaaa | |
05632f9a | 1184 | |
a9130ea9 | 1185 | As you can see, the continuation bytes all begin with C<"10">, and the |
e1b711da | 1186 | leading bits of the start byte tell how many bytes there are in the |
05632f9a JH |
1187 | encoded character. |
1188 | ||
6d4f9cf2 | 1189 | The original UTF-8 specification allowed up to 6 bytes, to allow |
a9130ea9 | 1190 | encoding of numbers up to C<0x7FFF_FFFF>. Perl continues to allow those, |
6d4f9cf2 KW |
1191 | and has extended that up to 13 bytes to encode code points up to what |
1192 | can fit in a 64-bit word. However, Perl will warn if you output any of | |
b19eb496 | 1193 | these as being non-portable; and under strict UTF-8 input protocols, |
6d4f9cf2 KW |
1194 | they are forbidden. |
1195 | ||
c29a771d | 1196 | =item * |
5cb3728c RB |
1197 | |
1198 | UTF-EBCDIC | |
dbe420b4 | 1199 | |
b65e6125 | 1200 | Like UTF-8, but EBCDIC-safe, in the way that UTF-8 is ASCII-safe. |
dbe420b4 | 1201 | |
c29a771d | 1202 | =item * |
5cb3728c | 1203 | |
b65e6125 | 1204 | UTF-16, UTF-16BE, UTF-16LE, Surrogates, and C<BOM>'s (Byte Order Marks) |
c349b1b9 | 1205 | |
1bfb14c4 JH |
1206 | The followings items are mostly for reference and general Unicode |
1207 | knowledge, Perl doesn't use these constructs internally. | |
dbe420b4 | 1208 | |
b19eb496 TC |
1209 | Like UTF-8, UTF-16 is a variable-width encoding, but where |
1210 | UTF-8 uses 8-bit code units, UTF-16 uses 16-bit code units. | |
1211 | All code points occupy either 2 or 4 bytes in UTF-16: code points | |
1212 | C<U+0000..U+FFFF> are stored in a single 16-bit unit, and code | |
1bfb14c4 | 1213 | points C<U+10000..U+10FFFF> in two 16-bit units. The latter case is |
c349b1b9 JH |
1214 | using I<surrogates>, the first 16-bit unit being the I<high |
1215 | surrogate>, and the second being the I<low surrogate>. | |
1216 | ||
376d9008 | 1217 | Surrogates are code points set aside to encode the C<U+10000..U+10FFFF> |
c349b1b9 | 1218 | range of Unicode code points in pairs of 16-bit units. The I<high |
9f815e24 | 1219 | surrogates> are the range C<U+D800..U+DBFF> and the I<low surrogates> |
376d9008 | 1220 | are the range C<U+DC00..U+DFFF>. The surrogate encoding is |
c349b1b9 | 1221 | |
d88362ca KW |
1222 | $hi = ($uni - 0x10000) / 0x400 + 0xD800; |
1223 | $lo = ($uni - 0x10000) % 0x400 + 0xDC00; | |
c349b1b9 JH |
1224 | |
1225 | and the decoding is | |
1226 | ||
d88362ca | 1227 | $uni = 0x10000 + ($hi - 0xD800) * 0x400 + ($lo - 0xDC00); |
c349b1b9 | 1228 | |
376d9008 | 1229 | Because of the 16-bitness, UTF-16 is byte-order dependent. UTF-16 |
c349b1b9 | 1230 | itself can be used for in-memory computations, but if storage or |
376d9008 JB |
1231 | transfer is required either UTF-16BE (big-endian) or UTF-16LE |
1232 | (little-endian) encodings must be chosen. | |
c349b1b9 JH |
1233 | |
1234 | This introduces another problem: what if you just know that your data | |
376d9008 | 1235 | is UTF-16, but you don't know which endianness? Byte Order Marks, or |
b65e6125 | 1236 | C<BOM>'s, are a solution to this. A special character has been reserved |
86bbd6d1 | 1237 | in Unicode to function as a byte order marker: the character with the |
a9130ea9 | 1238 | code point C<U+FEFF> is the C<BOM>. |
042da322 | 1239 | |
a9130ea9 | 1240 | The trick is that if you read a C<BOM>, you will know the byte order, |
376d9008 JB |
1241 | since if it was written on a big-endian platform, you will read the |
1242 | bytes C<0xFE 0xFF>, but if it was written on a little-endian platform, | |
1243 | you will read the bytes C<0xFF 0xFE>. (And if the originating platform | |
b65e6125 KW |
1244 | was writing in ASCII platform UTF-8, you will read the bytes |
1245 | C<0xEF 0xBB 0xBF>.) | |
042da322 | 1246 | |
86bbd6d1 | 1247 | The way this trick works is that the character with the code point |
6d4f9cf2 | 1248 | C<U+FFFE> is not supposed to be in input streams, so the |
a9130ea9 | 1249 | sequence of bytes C<0xFF 0xFE> is unambiguously "C<BOM>, represented in |
1bfb14c4 | 1250 | little-endian format" and cannot be C<U+FFFE>, represented in big-endian |
6d4f9cf2 KW |
1251 | format". |
1252 | ||
1253 | Surrogates have no meaning in Unicode outside their use in pairs to | |
1254 | represent other code points. However, Perl allows them to be | |
1255 | represented individually internally, for example by saying | |
f651977e TC |
1256 | C<chr(0xD801)>, so that all code points, not just those valid for open |
1257 | interchange, are | |
6d4f9cf2 | 1258 | representable. Unicode does define semantics for them, such as their |
a9130ea9 KW |
1259 | C<L</General_Category>> is C<"Cs">. But because their use is somewhat dangerous, |
1260 | Perl will warn (using the warning category C<"surrogate">, which is a | |
1261 | sub-category of C<"utf8">) if an attempt is made | |
6d4f9cf2 KW |
1262 | to do things like take the lower case of one, or match |
1263 | case-insensitively, or to output them. (But don't try this on Perls | |
1264 | before 5.14.) | |
c349b1b9 | 1265 | |
c29a771d | 1266 | =item * |
5cb3728c | 1267 | |
1e54db1a | 1268 | UTF-32, UTF-32BE, UTF-32LE |
c349b1b9 | 1269 | |
b65e6125 | 1270 | The UTF-32 family is pretty much like the UTF-16 family, except that |
042da322 | 1271 | the units are 32-bit, and therefore the surrogate scheme is not |
a9130ea9 | 1272 | needed. UTF-32 is a fixed-width encoding. The C<BOM> signatures are |
b19eb496 | 1273 | C<0x00 0x00 0xFE 0xFF> for BE and C<0xFF 0xFE 0x00 0x00> for LE. |
c349b1b9 | 1274 | |
c29a771d | 1275 | =item * |
5cb3728c RB |
1276 | |
1277 | UCS-2, UCS-4 | |
c349b1b9 | 1278 | |
b19eb496 | 1279 | Legacy, fixed-width encodings defined by the ISO 10646 standard. UCS-2 is a 16-bit |
376d9008 | 1280 | encoding. Unlike UTF-16, UCS-2 is not extensible beyond C<U+FFFF>, |
339cfa0e | 1281 | because it does not use surrogates. UCS-4 is a 32-bit encoding, |
b19eb496 | 1282 | functionally identical to UTF-32 (the difference being that |
a9130ea9 | 1283 | UCS-4 forbids neither surrogates nor code points larger than C<0x10_FFFF>). |
c349b1b9 | 1284 | |
c29a771d | 1285 | =item * |
5cb3728c RB |
1286 | |
1287 | UTF-7 | |
c349b1b9 | 1288 | |
376d9008 JB |
1289 | A seven-bit safe (non-eight-bit) encoding, which is useful if the |
1290 | transport or storage is not eight-bit safe. Defined by RFC 2152. | |
c349b1b9 | 1291 | |
95a1a48b JH |
1292 | =back |
1293 | ||
57e88091 | 1294 | =head2 Noncharacter code points |
6d4f9cf2 | 1295 | |
57e88091 | 1296 | 66 code points are set aside in Unicode as "noncharacter code points". |
a9130ea9 | 1297 | These all have the C<Unassigned> (C<Cn>) C<L</General_Category>>, and |
57e88091 KW |
1298 | no character will ever be assigned to any of them. They are the 32 code |
1299 | points between C<U+FDD0> and C<U+FDEF> inclusive, and the 34 code | |
1300 | points: | |
1301 | ||
1302 | U+FFFE U+FFFF | |
1303 | U+1FFFE U+1FFFF | |
1304 | U+2FFFE U+2FFFF | |
1305 | ... | |
1306 | U+EFFFE U+EFFFF | |
1307 | U+FFFFE U+FFFFF | |
1308 | U+10FFFE U+10FFFF | |
1309 | ||
1310 | Until Unicode 7.0, the noncharacters were "B<forbidden> for use in open | |
1311 | interchange of Unicode text data", so that code that processed those | |
1312 | streams could use these code points as sentinels that could be mixed in | |
1313 | with character data, and would always be distinguishable from that data. | |
1314 | (Emphasis above and in the next paragraph are added in this document.) | |
1315 | ||
1316 | Unicode 7.0 changed the wording so that they are "B<not recommended> for | |
1317 | use in open interchange of Unicode text data". The 7.0 Standard goes on | |
1318 | to say: | |
1319 | ||
1320 | =over 4 | |
1321 | ||
1322 | "If a noncharacter is received in open interchange, an application is | |
1323 | not required to interpret it in any way. It is good practice, however, | |
1324 | to recognize it as a noncharacter and to take appropriate action, such | |
1325 | as replacing it with C<U+FFFD> replacement character, to indicate the | |
1326 | problem in the text. It is not recommended to simply delete | |
1327 | noncharacter code points from such text, because of the potential | |
1328 | security issues caused by deleting uninterpreted characters. (See | |
1329 | conformance clause C7 in Section 3.2, Conformance Requirements, and | |
1330 | L<Unicode Technical Report #36, "Unicode Security | |
1331 | Considerations"|http://www.unicode.org/reports/tr36/#Substituting_for_Ill_Formed_Subsequences>)." | |
1332 | ||
1333 | =back | |
1334 | ||
1335 | This change was made because it was found that various commercial tools | |
1336 | like editors, or for things like source code control, had been written | |
1337 | so that they would not handle program files that used these code points, | |
1338 | effectively precluding their use almost entirely! And that was never | |
1339 | the intent. They've always been meant to be usable within an | |
1340 | application, or cooperating set of applications, at will. | |
1341 | ||
1342 | If you're writing code, such as an editor, that is supposed to be able | |
1343 | to handle any Unicode text data, then you shouldn't be using these code | |
1344 | points yourself, and instead allow them in the input. If you need | |
1345 | sentinels, they should instead be something that isn't legal Unicode. | |
1346 | For UTF-8 data, you can use the bytes 0xC1 and 0xC2 as sentinels, as | |
1347 | they never appear in well-formed UTF-8. (There are equivalents for | |
1348 | UTF-EBCDIC). You can also store your Unicode code points in integer | |
1349 | variables and use negative values as sentinels. | |
1350 | ||
1351 | If you're not writing such a tool, then whether you accept noncharacters | |
1352 | as input is up to you (though the Standard recommends that you not). If | |
1353 | you do strict input stream checking with Perl, these code points | |
1354 | continue to be forbidden. This is to maintain backward compatibility | |
1355 | (otherwise potential security holes could open up, as an unsuspecting | |
1356 | application that was written assuming the noncharacters would be | |
1357 | filtered out before getting to it, could now, without warning, start | |
1358 | getting them). To do strict checking, you can use the layer | |
1359 | C<:encoding('UTF-8')>. | |
1360 | ||
1361 | Perl continues to warn (using the warning category C<"nonchar">, which | |
1362 | is a sub-category of C<"utf8">) if an attempt is made to output | |
1363 | noncharacters. | |
42581d5d KW |
1364 | |
1365 | =head2 Beyond Unicode code points | |
1366 | ||
a9130ea9 KW |
1367 | The maximum Unicode code point is C<U+10FFFF>, and Unicode only defines |
1368 | operations on code points up through that. But Perl works on code | |
42581d5d KW |
1369 | points up to the maximum permissible unsigned number available on the |
1370 | platform. However, Perl will not accept these from input streams unless | |
1371 | lax rules are being used, and will warn (using the warning category | |
2d88a86a KW |
1372 | C<"non_unicode">, which is a sub-category of C<"utf8">) if any are output. |
1373 | ||
1374 | Since Unicode rules are not defined on these code points, if a | |
1375 | Unicode-defined operation is done on them, Perl uses what we believe are | |
1376 | sensible rules, while generally warning, using the C<"non_unicode"> | |
1377 | category. For example, C<uc("\x{11_0000}")> will generate such a | |
1378 | warning, returning the input parameter as its result, since Perl defines | |
1379 | the uppercase of every non-Unicode code point to be the code point | |
b65e6125 KW |
1380 | itself. (All the case changing operations, not just uppercasing, work |
1381 | this way.) | |
2d88a86a KW |
1382 | |
1383 | The situation with matching Unicode properties in regular expressions, | |
1384 | the C<\p{}> and C<\P{}> constructs, against these code points is not as | |
1385 | clear cut, and how these are handled has changed as we've gained | |
1386 | experience. | |
1387 | ||
1388 | One possibility is to treat any match against these code points as | |
1389 | undefined. But since Perl doesn't have the concept of a match being | |
1390 | undefined, it converts this to failing or C<FALSE>. This is almost, but | |
1391 | not quite, what Perl did from v5.14 (when use of these code points | |
1392 | became generally reliable) through v5.18. The difference is that Perl | |
1393 | treated all C<\p{}> matches as failing, but all C<\P{}> matches as | |
1394 | succeeding. | |
1395 | ||
1396 | One problem with this is that it leads to unexpected, and confusting | |
1397 | results in some cases: | |
1398 | ||
1399 | chr(0x110000) =~ \p{ASCII_Hex_Digit=True} # Failed on <= v5.18 | |
1400 | chr(0x110000) =~ \p{ASCII_Hex_Digit=False} # Failed! on <= v5.18 | |
1401 | ||
1402 | That is, it treated both matches as undefined, and converted that to | |
1403 | false (raising a warning on each). The first case is the expected | |
1404 | result, but the second is likely counterintuitive: "How could both be | |
1405 | false when they are complements?" Another problem was that the | |
1406 | implementation optimized many Unicode property matches down to already | |
1407 | existing simpler, faster operations, which don't raise the warning. We | |
1408 | chose to not forgo those optimizations, which help the vast majority of | |
1409 | matches, just to generate a warning for the unlikely event that an | |
1410 | above-Unicode code point is being matched against. | |
1411 | ||
1412 | As a result of these problems, starting in v5.20, what Perl does is | |
1413 | to treat non-Unicode code points as just typical unassigned Unicode | |
1414 | characters, and matches accordingly. (Note: Unicode has atypical | |
57e88091 | 1415 | unassigned code points. For example, it has noncharacter code points, |
2d88a86a KW |
1416 | and ones that, when they do get assigned, are destined to be written |
1417 | Right-to-left, as Arabic and Hebrew are. Perl assumes that no | |
1418 | non-Unicode code point has any atypical properties.) | |
1419 | ||
1420 | Perl, in most cases, will raise a warning when matching an above-Unicode | |
1421 | code point against a Unicode property when the result is C<TRUE> for | |
1422 | C<\p{}>, and C<FALSE> for C<\P{}>. For example: | |
1423 | ||
1424 | chr(0x110000) =~ \p{ASCII_Hex_Digit=True} # Fails, no warning | |
1425 | chr(0x110000) =~ \p{ASCII_Hex_Digit=False} # Succeeds, with warning | |
1426 | ||
1427 | In both these examples, the character being matched is non-Unicode, so | |
1428 | Unicode doesn't define how it should match. It clearly isn't an ASCII | |
1429 | hex digit, so the first example clearly should fail, and so it does, | |
1430 | with no warning. But it is arguable that the second example should have | |
1431 | an undefined, hence C<FALSE>, result. So a warning is raised for it. | |
1432 | ||
1433 | Thus the warning is raised for many fewer cases than in earlier Perls, | |
1434 | and only when what the result is could be arguable. It turns out that | |
1435 | none of the optimizations made by Perl (or are ever likely to be made) | |
1436 | cause the warning to be skipped, so it solves both problems of Perl's | |
1437 | earlier approach. The most commonly used property that is affected by | |
1438 | this change is C<\p{Unassigned}> which is a short form for | |
1439 | C<\p{General_Category=Unassigned}>. Starting in v5.20, all non-Unicode | |
1440 | code points are considered C<Unassigned>. In earlier releases the | |
1441 | matches failed because the result was considered undefined. | |
1442 | ||
1443 | The only place where the warning is not raised when it might ought to | |
1444 | have been is if optimizations cause the whole pattern match to not even | |
1445 | be attempted. For example, Perl may figure out that for a string to | |
1446 | match a certain regular expression pattern, the string has to contain | |
1447 | the substring C<"foobar">. Before attempting the match, Perl may look | |
1448 | for that substring, and if not found, immediately fail the match without | |
1449 | actually trying it; so no warning gets generated even if the string | |
1450 | contains an above-Unicode code point. | |
1451 | ||
1452 | This behavior is more "Do what I mean" than in earlier Perls for most | |
1453 | applications. But it catches fewer issues for code that needs to be | |
1454 | strictly Unicode compliant. Therefore there is an additional mode of | |
1455 | operation available to accommodate such code. This mode is enabled if a | |
1456 | regular expression pattern is compiled within the lexical scope where | |
1457 | the C<"non_unicode"> warning class has been made fatal, say by: | |
1458 | ||
1459 | use warnings FATAL => "non_unicode" | |
1460 | ||
44ecbbd8 | 1461 | (see L<warnings>). In this mode of operation, Perl will raise the |
2d88a86a KW |
1462 | warning for all matches against a non-Unicode code point (not just the |
1463 | arguable ones), and it skips the optimizations that might cause the | |
1464 | warning to not be output. (It currently still won't warn if the match | |
1465 | isn't even attempted, like in the C<"foobar"> example above.) | |
1466 | ||
1467 | In summary, Perl now normally treats non-Unicode code points as typical | |
1468 | Unicode unassigned code points for regular expression matches, raising a | |
1469 | warning only when it is arguable what the result should be. However, if | |
1470 | this warning has been made fatal, it isn't skipped. | |
1471 | ||
1472 | There is one exception to all this. C<\p{All}> looks like a Unicode | |
1473 | property, but it is a Perl extension that is defined to be true for all | |
1474 | possible code points, Unicode or not, so no warning is ever generated | |
1475 | when matching this against a non-Unicode code point. (Prior to v5.20, | |
1476 | it was an exact synonym for C<\p{Any}>, matching code points C<0> | |
1477 | through C<0x10FFFF>.) | |
6d4f9cf2 | 1478 | |
0d7c09bb JH |
1479 | =head2 Security Implications of Unicode |
1480 | ||
b65e6125 KW |
1481 | First, read |
1482 | L<Unicode Security Considerations|http://www.unicode.org/reports/tr36>. | |
1483 | ||
e1b711da KW |
1484 | Also, note the following: |
1485 | ||
0d7c09bb JH |
1486 | =over 4 |
1487 | ||
1488 | =item * | |
1489 | ||
1490 | Malformed UTF-8 | |
bf0fa0b2 | 1491 | |
42581d5d | 1492 | Unfortunately, the original specification of UTF-8 leaves some room for |
bf0fa0b2 | 1493 | interpretation of how many bytes of encoded output one should generate |
376d9008 JB |
1494 | from one input Unicode character. Strictly speaking, the shortest |
1495 | possible sequence of UTF-8 bytes should be generated, | |
1496 | because otherwise there is potential for an input buffer overflow at | |
feda178f | 1497 | the receiving end of a UTF-8 connection. Perl always generates the |
e1b711da | 1498 | shortest length UTF-8, and with warnings on, Perl will warn about |
376d9008 | 1499 | non-shortest length UTF-8 along with other malformations, such as the |
b19eb496 | 1500 | surrogates, which are not Unicode code points valid for interchange. |
bf0fa0b2 | 1501 | |
0d7c09bb JH |
1502 | =item * |
1503 | ||
68693f9e | 1504 | Regular expression pattern matching may surprise you if you're not |
b19eb496 TC |
1505 | accustomed to Unicode. Starting in Perl 5.14, several pattern |
1506 | modifiers are available to control this, called the character set | |
42581d5d KW |
1507 | modifiers. Details are given in L<perlre/Character set modifiers>. |
1508 | ||
1509 | =back | |
0d7c09bb | 1510 | |
376d9008 | 1511 | As discussed elsewhere, Perl has one foot (two hooves?) planted in |
1bfb14c4 JH |
1512 | each of two worlds: the old world of bytes and the new world of |
1513 | characters, upgrading from bytes to characters when necessary. | |
376d9008 JB |
1514 | If your legacy code does not explicitly use Unicode, no automatic |
1515 | switch-over to characters should happen. Characters shouldn't get | |
1bfb14c4 JH |
1516 | downgraded to bytes, either. It is possible to accidentally mix bytes |
1517 | and characters, however (see L<perluniintro>), in which case C<\w> in | |
42581d5d | 1518 | regular expressions might start behaving differently (unless the C</a> |
b19eb496 | 1519 | modifier is in effect). Review your code. Use warnings and the C<strict> pragma. |
0d7c09bb | 1520 | |
c349b1b9 JH |
1521 | =head2 Unicode in Perl on EBCDIC |
1522 | ||
376d9008 JB |
1523 | The way Unicode is handled on EBCDIC platforms is still |
1524 | experimental. On such platforms, references to UTF-8 encoding in this | |
1525 | document and elsewhere should be read as meaning the UTF-EBCDIC | |
1526 | specified in Unicode Technical Report 16, unless ASCII vs. EBCDIC issues | |
c349b1b9 | 1527 | are specifically discussed. There is no C<utfebcdic> pragma or |
a9130ea9 | 1528 | C<":utfebcdic"> layer; rather, C<"utf8"> and C<":utf8"> are reused to mean |
86bbd6d1 PN |
1529 | the platform's "natural" 8-bit encoding of Unicode. See L<perlebcdic> |
1530 | for more discussion of the issues. | |
c349b1b9 | 1531 | |
b310b053 JH |
1532 | =head2 Locales |
1533 | ||
42581d5d | 1534 | See L<perllocale/Unicode and UTF-8> |
b310b053 | 1535 | |
1aad1664 JH |
1536 | =head2 When Unicode Does Not Happen |
1537 | ||
b65e6125 KW |
1538 | There are still many places where Unicode (in some encoding or |
1539 | another) could be given as arguments or received as results, or both in | |
1540 | Perl, but it is not, in spite of Perl having extensive ways to input and | |
1541 | output in Unicode, and a few other "entry points" like the C<@ARGV> | |
1542 | array (which can sometimes be interpreted as UTF-8). | |
1aad1664 | 1543 | |
e1b711da KW |
1544 | The following are such interfaces. Also, see L</The "Unicode Bug">. |
1545 | For all of these interfaces Perl | |
b9cedb1b | 1546 | currently (as of v5.16.0) simply assumes byte strings both as arguments |
b65e6125 | 1547 | and results, or UTF-8 strings if the (deprecated) C<encoding> pragma has been used. |
1aad1664 | 1548 | |
b19eb496 TC |
1549 | One reason that Perl does not attempt to resolve the role of Unicode in |
1550 | these situations is that the answers are highly dependent on the operating | |
1aad1664 | 1551 | system and the file system(s). For example, whether filenames can be |
b19eb496 TC |
1552 | in Unicode and in exactly what kind of encoding, is not exactly a |
1553 | portable concept. Similarly for C<qx> and C<system>: how well will the | |
1554 | "command-line interface" (and which of them?) handle Unicode? | |
1aad1664 JH |
1555 | |
1556 | =over 4 | |
1557 | ||
557a2462 RB |
1558 | =item * |
1559 | ||
a9130ea9 KW |
1560 | C<chdir>, C<chmod>, C<chown>, C<chroot>, C<exec>, C<link>, C<lstat>, C<mkdir>, |
1561 | C<rename>, C<rmdir>, C<stat>, C<symlink>, C<truncate>, C<unlink>, C<utime>, C<-X> | |
557a2462 RB |
1562 | |
1563 | =item * | |
1564 | ||
a9130ea9 | 1565 | C<%ENV> |
557a2462 RB |
1566 | |
1567 | =item * | |
1568 | ||
a9130ea9 | 1569 | C<glob> (aka the C<E<lt>*E<gt>>) |
557a2462 RB |
1570 | |
1571 | =item * | |
1aad1664 | 1572 | |
a9130ea9 | 1573 | C<open>, C<opendir>, C<sysopen> |
1aad1664 | 1574 | |
557a2462 | 1575 | =item * |
1aad1664 | 1576 | |
a9130ea9 | 1577 | C<qx> (aka the backtick operator), C<system> |
1aad1664 | 1578 | |
557a2462 | 1579 | =item * |
1aad1664 | 1580 | |
a9130ea9 | 1581 | C<readdir>, C<readlink> |
1aad1664 JH |
1582 | |
1583 | =back | |
1584 | ||
e1b711da KW |
1585 | =head2 The "Unicode Bug" |
1586 | ||
2e2b2571 | 1587 | The term, "Unicode bug" has been applied to an inconsistency |
42581d5d | 1588 | on ASCII platforms with the |
a9130ea9 | 1589 | Unicode code points in the C<Latin-1 Supplement> block, that |
e1b711da KW |
1590 | is, between 128 and 255. Without a locale specified, unlike all other |
1591 | characters or code points, these characters have very different semantics in | |
20db7501 | 1592 | byte semantics versus character semantics, unless |
2e2b2571 KW |
1593 | C<use feature 'unicode_strings'> is specified, directly or indirectly. |
1594 | (It is indirectly specified by a C<use v5.12> or higher.) | |
e1b711da | 1595 | |
2e2b2571 KW |
1596 | In character semantics these upper-Latin1 characters are interpreted as |
1597 | Unicode code points, which means | |
e1b711da KW |
1598 | they have the same semantics as Latin-1 (ISO-8859-1). |
1599 | ||
2e2b2571 KW |
1600 | In byte semantics (without C<unicode_strings>), they are considered to |
1601 | be unassigned characters, meaning that the only semantics they have is | |
1602 | their ordinal numbers, and that they are | |
e1b711da | 1603 | not members of various character classes. None are considered to match C<\w> |
42581d5d | 1604 | for example, but all match C<\W>. |
e1b711da | 1605 | |
2e2b2571 KW |
1606 | Perl 5.12.0 added C<unicode_strings> to force character semantics on |
1607 | these code points in some circumstances, which fixed portions of the | |
1608 | bug; Perl 5.14.0 fixed almost all of it; and Perl 5.16.0 fixed the | |
1609 | remainder (so far as we know, anyway). The lesson here is to enable | |
1610 | C<unicode_strings> to avoid the headaches described below. | |
1611 | ||
1612 | The old, problematic behavior affects these areas: | |
e1b711da KW |
1613 | |
1614 | =over 4 | |
1615 | ||
1616 | =item * | |
1617 | ||
1618 | Changing the case of a scalar, that is, using C<uc()>, C<ucfirst()>, C<lc()>, | |
2e2b2571 KW |
1619 | and C<lcfirst()>, or C<\L>, C<\U>, C<\u> and C<\l> in double-quotish |
1620 | contexts, such as regular expression substitutions. | |
1621 | Under C<unicode_strings> starting in Perl 5.12.0, character semantics are | |
1622 | generally used. See L<perlfunc/lc> for details on how this works | |
1623 | in combination with various other pragmas. | |
e1b711da KW |
1624 | |
1625 | =item * | |
1626 | ||
2e2b2571 KW |
1627 | Using caseless (C</i>) regular expression matching. |
1628 | Starting in Perl 5.14.0, regular expressions compiled within | |
c43ca372 | 1629 | the scope of C<unicode_strings> use character semantics |
2e2b2571 KW |
1630 | even when executed or compiled into larger |
1631 | regular expressions outside the scope. | |
e1b711da KW |
1632 | |
1633 | =item * | |
1634 | ||
64935bc6 KW |
1635 | Matching any of several properties in regular expressions, namely |
1636 | C<\b> (without braces), C<\B> (without braces), C<\s>, C<\S>, C<\w>, | |
1637 | C<\W>, and all the Posix character classes | |
630d17dc | 1638 | I<except> C<[[:ascii:]]>. |
2e2b2571 | 1639 | Starting in Perl 5.14.0, regular expressions compiled within |
c43ca372 | 1640 | the scope of C<unicode_strings> use character semantics |
2e2b2571 KW |
1641 | even when executed or compiled into larger |
1642 | regular expressions outside the scope. | |
e1b711da KW |
1643 | |
1644 | =item * | |
1645 | ||
91faff93 KW |
1646 | In C<quotemeta> or its inline equivalent C<\Q>, no code points above 127 |
1647 | are quoted in UTF-8 encoded strings, but in byte encoded strings, code | |
1648 | points between 128-255 are always quoted. | |
2e2b2571 KW |
1649 | Starting in Perl 5.16.0, consistent quoting rules are used within the |
1650 | scope of C<unicode_strings>, as described in L<perlfunc/quotemeta>. | |
eb88ed9e | 1651 | |
e1b711da KW |
1652 | =back |
1653 | ||
1654 | This behavior can lead to unexpected results in which a string's semantics | |
1655 | suddenly change if a code point above 255 is appended to or removed from it, | |
1656 | which changes the string's semantics from byte to character or vice versa. As | |
1657 | an example, consider the following program and its output: | |
1658 | ||
1659 | $ perl -le' | |
42581d5d | 1660 | no feature 'unicode_strings'; |
e1b711da KW |
1661 | $s1 = "\xC2"; |
1662 | $s2 = "\x{2660}"; | |
1663 | for ($s1, $s2, $s1.$s2) { | |
1664 | print /\w/ || 0; | |
1665 | } | |
1666 | ' | |
1667 | 0 | |
1668 | 0 | |
1669 | 1 | |
1670 | ||
9f815e24 | 1671 | If there's no C<\w> in C<s1> or in C<s2>, why does their concatenation have one? |
e1b711da KW |
1672 | |
1673 | This anomaly stems from Perl's attempt to not disturb older programs that | |
1674 | didn't use Unicode, and hence had no semantics for characters outside of the | |
1675 | ASCII range (except in a locale), along with Perl's desire to add Unicode | |
1676 | support seamlessly. The result wasn't seamless: these characters were | |
1677 | orphaned. | |
1678 | ||
2e2b2571 KW |
1679 | For Perls earlier than those described above, or when a string is passed |
1680 | to a function outside the subpragma's scope, a workaround is to always | |
a9130ea9 | 1681 | call L<C<utf8::upgrade($string)>|utf8/Utility functions>, |
20db7501 | 1682 | or to use the standard module L<Encode>. Also, a scalar that has any characters |
a9130ea9 | 1683 | whose ordinal is C<0x100> or above, or which were specified using either of the |
b19eb496 | 1684 | C<\N{...}> notations, will automatically have character semantics. |
e1b711da | 1685 | |
1aad1664 JH |
1686 | =head2 Forcing Unicode in Perl (Or Unforcing Unicode in Perl) |
1687 | ||
e1b711da KW |
1688 | Sometimes (see L</"When Unicode Does Not Happen"> or L</The "Unicode Bug">) |
1689 | there are situations where you simply need to force a byte | |
2bbc8d55 | 1690 | string into UTF-8, or vice versa. The low-level calls |
a9130ea9 KW |
1691 | L<C<utf8::upgrade($bytestring)>|utf8/Utility functions> and |
1692 | L<C<utf8::downgrade($utf8string[, FAIL_OK])>|utf8/Utility functions> are | |
1aad1664 JH |
1693 | the answers. |
1694 | ||
a9130ea9 | 1695 | Note that C<utf8::downgrade()> can fail if the string contains characters |
2bbc8d55 | 1696 | that don't fit into a byte. |
1aad1664 | 1697 | |
e1b711da KW |
1698 | Calling either function on a string that already is in the desired state is a |
1699 | no-op. | |
1700 | ||
95a1a48b | 1701 | |
37b3b608 | 1702 | =head2 Using Unicode in XS |
c349b1b9 | 1703 | |
37b3b608 KW |
1704 | See L<perlguts/"Unicode Support"> for an introduction to Unicode at |
1705 | the XS level, and L<perlapi/Unicode Support> for the API details. | |
95a1a48b | 1706 | |
e1b711da KW |
1707 | =head2 Hacking Perl to work on earlier Unicode versions (for very serious hackers only) |
1708 | ||
1709 | Perl by default comes with the latest supported Unicode version built in, but | |
1710 | you can change to use any earlier one. | |
1711 | ||
42581d5d | 1712 | Download the files in the desired version of Unicode from the Unicode web |
e1b711da | 1713 | site L<http://www.unicode.org>). These should replace the existing files in |
b19eb496 | 1714 | F<lib/unicore> in the Perl source tree. Follow the instructions in |
116693e8 | 1715 | F<README.perl> in that directory to change some of their names, and then build |
26e391dd | 1716 | perl (see L<INSTALL>). |
116693e8 | 1717 | |
c29a771d JH |
1718 | =head1 BUGS |
1719 | ||
376d9008 | 1720 | =head2 Interaction with Locales |
7eabb34d | 1721 | |
42581d5d | 1722 | See L<perllocale/Unicode and UTF-8> |
c29a771d | 1723 | |
9f815e24 | 1724 | =head2 Problems with characters in the Latin-1 Supplement range |
2bbc8d55 | 1725 | |
e1b711da KW |
1726 | See L</The "Unicode Bug"> |
1727 | ||
376d9008 | 1728 | =head2 Interaction with Extensions |
7eabb34d | 1729 | |
376d9008 | 1730 | When Perl exchanges data with an extension, the extension should be |
2575c402 | 1731 | able to understand the UTF8 flag and act accordingly. If the |
b19eb496 | 1732 | extension doesn't recognize that flag, it's likely that the extension |
376d9008 | 1733 | will return incorrectly-flagged data. |
7eabb34d A |
1734 | |
1735 | So if you're working with Unicode data, consult the documentation of | |
1736 | every module you're using if there are any issues with Unicode data | |
1737 | exchange. If the documentation does not talk about Unicode at all, | |
a73d23f6 | 1738 | suspect the worst and probably look at the source to learn how the |
376d9008 | 1739 | module is implemented. Modules written completely in Perl shouldn't |
a73d23f6 RGS |
1740 | cause problems. Modules that directly or indirectly access code written |
1741 | in other programming languages are at risk. | |
7eabb34d | 1742 | |
376d9008 | 1743 | For affected functions, the simple strategy to avoid data corruption is |
7eabb34d | 1744 | to always make the encoding of the exchanged data explicit. Choose an |
376d9008 | 1745 | encoding that you know the extension can handle. Convert arguments passed |
7eabb34d A |
1746 | to the extensions to that encoding and convert results back from that |
1747 | encoding. Write wrapper functions that do the conversions for you, so | |
1748 | you can later change the functions when the extension catches up. | |
1749 | ||
a9130ea9 | 1750 | To provide an example, let's say the popular C<Foo::Bar::escape_html> |
7eabb34d A |
1751 | function doesn't deal with Unicode data yet. The wrapper function |
1752 | would convert the argument to raw UTF-8 and convert the result back to | |
376d9008 | 1753 | Perl's internal representation like so: |
7eabb34d A |
1754 | |
1755 | sub my_escape_html ($) { | |
d88362ca KW |
1756 | my($what) = shift; |
1757 | return unless defined $what; | |
1758 | Encode::decode_utf8(Foo::Bar::escape_html( | |
1759 | Encode::encode_utf8($what))); | |
7eabb34d A |
1760 | } |
1761 | ||
1762 | Sometimes, when the extension does not convert data but just stores | |
b19eb496 | 1763 | and retrieves them, you will be able to use the otherwise |
a9130ea9 KW |
1764 | dangerous L<C<Encode::_utf8_on()>|Encode/_utf8_on> function. Let's say |
1765 | the popular C<Foo::Bar> extension, written in C, provides a C<param> | |
1766 | method that lets you store and retrieve data according to these prototypes: | |
7eabb34d A |
1767 | |
1768 | $self->param($name, $value); # set a scalar | |
1769 | $value = $self->param($name); # retrieve a scalar | |
1770 | ||
1771 | If it does not yet provide support for any encoding, one could write a | |
1772 | derived class with such a C<param> method: | |
1773 | ||
1774 | sub param { | |
1775 | my($self,$name,$value) = @_; | |
1776 | utf8::upgrade($name); # make sure it is UTF-8 encoded | |
af55fc6a | 1777 | if (defined $value) { |
7eabb34d A |
1778 | utf8::upgrade($value); # make sure it is UTF-8 encoded |
1779 | return $self->SUPER::param($name,$value); | |
1780 | } else { | |
1781 | my $ret = $self->SUPER::param($name); | |
1782 | Encode::_utf8_on($ret); # we know, it is UTF-8 encoded | |
1783 | return $ret; | |
1784 | } | |
1785 | } | |
1786 | ||
a73d23f6 | 1787 | Some extensions provide filters on data entry/exit points, such as |
a9130ea9 | 1788 | C<DB_File::filter_store_key> and family. Look out for such filters in |
66b79f27 | 1789 | the documentation of your extensions, they can make the transition to |
7eabb34d A |
1790 | Unicode data much easier. |
1791 | ||
376d9008 | 1792 | =head2 Speed |
7eabb34d | 1793 | |
c29a771d | 1794 | Some functions are slower when working on UTF-8 encoded strings than |
574c8022 | 1795 | on byte encoded strings. All functions that need to hop over |
a9130ea9 KW |
1796 | characters such as C<length()>, C<substr()> or C<index()>, or matching |
1797 | regular expressions can work B<much> faster when the underlying data are | |
7c17141f JH |
1798 | byte-encoded. |
1799 | ||
1800 | In Perl 5.8.0 the slowness was often quite spectacular; in Perl 5.8.1 | |
1801 | a caching scheme was introduced which will hopefully make the slowness | |
a104b433 JH |
1802 | somewhat less spectacular, at least for some operations. In general, |
1803 | operations with UTF-8 encoded strings are still slower. As an example, | |
1804 | the Unicode properties (character classes) like C<\p{Nd}> are known to | |
1805 | be quite a bit slower (5-20 times) than their simpler counterparts | |
42581d5d | 1806 | like C<\d> (then again, there are hundreds of Unicode characters matching C<Nd> |
a104b433 | 1807 | compared with the 10 ASCII characters matching C<d>). |
c8d992ba A |
1808 | =head2 Porting code from perl-5.6.X |
1809 | ||
1810 | Perl 5.8 has a different Unicode model from 5.6. In 5.6 the programmer | |
1811 | was required to use the C<utf8> pragma to declare that a given scope | |
1812 | expected to deal with Unicode data and had to make sure that only | |
1813 | Unicode data were reaching that scope. If you have code that is | |
1814 | working with 5.6, you will need some of the following adjustments to | |
1815 | your code. The examples are written such that the code will continue | |
1816 | to work under 5.6, so you should be safe to try them out. | |
1817 | ||
755789c0 | 1818 | =over 3 |
c8d992ba A |
1819 | |
1820 | =item * | |
1821 | ||
1822 | A filehandle that should read or write UTF-8 | |
1823 | ||
b9cedb1b | 1824 | if ($] > 5.008) { |
740d4bb2 | 1825 | binmode $fh, ":encoding(utf8)"; |
c8d992ba A |
1826 | } |
1827 | ||
1828 | =item * | |
1829 | ||
1830 | A scalar that is going to be passed to some extension | |
1831 | ||
a9130ea9 | 1832 | Be it C<Compress::Zlib>, C<Apache::Request> or any extension that has no |
c8d992ba | 1833 | mention of Unicode in the manpage, you need to make sure that the |
2575c402 | 1834 | UTF8 flag is stripped off. Note that at the time of this writing |
b9cedb1b | 1835 | (January 2012) the mentioned modules are not UTF-8-aware. Please |
c8d992ba A |
1836 | check the documentation to verify if this is still true. |
1837 | ||
b9cedb1b | 1838 | if ($] > 5.008) { |
c8d992ba A |
1839 | require Encode; |
1840 | $val = Encode::encode_utf8($val); # make octets | |
1841 | } | |
1842 | ||
1843 | =item * | |
1844 | ||
1845 | A scalar we got back from an extension | |
1846 | ||
1847 | If you believe the scalar comes back as UTF-8, you will most likely | |
2575c402 | 1848 | want the UTF8 flag restored: |
c8d992ba | 1849 | |
b9cedb1b | 1850 | if ($] > 5.008) { |
c8d992ba A |
1851 | require Encode; |
1852 | $val = Encode::decode_utf8($val); | |
1853 | } | |
1854 | ||
1855 | =item * | |
1856 | ||
1857 | Same thing, if you are really sure it is UTF-8 | |
1858 | ||
b9cedb1b | 1859 | if ($] > 5.008) { |
c8d992ba A |
1860 | require Encode; |
1861 | Encode::_utf8_on($val); | |
1862 | } | |
1863 | ||
1864 | =item * | |
1865 | ||
a9130ea9 | 1866 | A wrapper for L<DBI> C<fetchrow_array> and C<fetchrow_hashref> |
c8d992ba A |
1867 | |
1868 | When the database contains only UTF-8, a wrapper function or method is | |
a9130ea9 KW |
1869 | a convenient way to replace all your C<fetchrow_array> and |
1870 | C<fetchrow_hashref> calls. A wrapper function will also make it easier to | |
c8d992ba | 1871 | adapt to future enhancements in your database driver. Note that at the |
b9cedb1b | 1872 | time of this writing (January 2012), the DBI has no standardized way |
a9130ea9 | 1873 | to deal with UTF-8 data. Please check the L<DBI documentation|DBI> to verify if |
c8d992ba A |
1874 | that is still true. |
1875 | ||
1876 | sub fetchrow { | |
d88362ca KW |
1877 | # $what is one of fetchrow_{array,hashref} |
1878 | my($self, $sth, $what) = @_; | |
b9cedb1b | 1879 | if ($] < 5.008) { |
c8d992ba A |
1880 | return $sth->$what; |
1881 | } else { | |
1882 | require Encode; | |
1883 | if (wantarray) { | |
1884 | my @arr = $sth->$what; | |
1885 | for (@arr) { | |
1886 | defined && /[^\000-\177]/ && Encode::_utf8_on($_); | |
1887 | } | |
1888 | return @arr; | |
1889 | } else { | |
1890 | my $ret = $sth->$what; | |
1891 | if (ref $ret) { | |
1892 | for my $k (keys %$ret) { | |
d88362ca KW |
1893 | defined |
1894 | && /[^\000-\177]/ | |
1895 | && Encode::_utf8_on($_) for $ret->{$k}; | |
c8d992ba A |
1896 | } |
1897 | return $ret; | |
1898 | } else { | |
1899 | defined && /[^\000-\177]/ && Encode::_utf8_on($_) for $ret; | |
1900 | return $ret; | |
1901 | } | |
1902 | } | |
1903 | } | |
1904 | } | |
1905 | ||
1906 | ||
1907 | =item * | |
1908 | ||
1909 | A large scalar that you know can only contain ASCII | |
1910 | ||
1911 | Scalars that contain only ASCII and are marked as UTF-8 are sometimes | |
1912 | a drag to your program. If you recognize such a situation, just remove | |
2575c402 | 1913 | the UTF8 flag: |
c8d992ba | 1914 | |
b9cedb1b | 1915 | utf8::downgrade($val) if $] > 5.008; |
c8d992ba A |
1916 | |
1917 | =back | |
1918 | ||
393fec97 GS |
1919 | =head1 SEE ALSO |
1920 | ||
51f494cc | 1921 | L<perlunitut>, L<perluniintro>, L<perluniprops>, L<Encode>, L<open>, L<utf8>, L<bytes>, |
b65e6125 | 1922 | L<perlretut>, L<perlvar/"${^UNICODE}">, |
51f494cc | 1923 | L<http://www.unicode.org/reports/tr44>). |
393fec97 GS |
1924 | |
1925 | =cut |