Commit | Line | Data |
---|---|---|
393fec97 GS |
1 | =head1 NAME |
2 | ||
3 | perlunicode - Unicode support in Perl | |
4 | ||
5 | =head1 DESCRIPTION | |
6 | ||
0a1f2d14 | 7 | =head2 Important Caveats |
21bad921 | 8 | |
376d9008 | 9 | Unicode support is an extensive requirement. While Perl does not |
c349b1b9 JH |
10 | implement the Unicode standard or the accompanying technical reports |
11 | from cover to cover, Perl does support many Unicode features. | |
21bad921 | 12 | |
2575c402 | 13 | People who want to learn to use Unicode in Perl, should probably read |
9d1c51c1 | 14 | the L<Perl Unicode tutorial, perlunitut|perlunitut>, before reading |
e4911a48 | 15 | this reference document. |
2575c402 | 16 | |
9d1c51c1 KW |
17 | Also, the use of Unicode may present security issues that aren't obvious. |
18 | Read L<Unicode Security Considerations|http://www.unicode.org/reports/tr36>. | |
19 | ||
13a2d996 | 20 | =over 4 |
21bad921 | 21 | |
fae2c0fb | 22 | =item Input and Output Layers |
21bad921 | 23 | |
376d9008 | 24 | Perl knows when a filehandle uses Perl's internal Unicode encodings |
1bfb14c4 | 25 | (UTF-8, or UTF-EBCDIC if in EBCDIC) if the filehandle is opened with |
4ee7c0ea | 26 | the ":encoding(utf8)" layer. Other encodings can be converted to Perl's |
1bfb14c4 JH |
27 | encoding on input or from Perl's encoding on output by use of the |
28 | ":encoding(...)" layer. See L<open>. | |
c349b1b9 | 29 | |
2575c402 | 30 | To indicate that Perl source itself is in UTF-8, use C<use utf8;>. |
21bad921 GS |
31 | |
32 | =item Regular Expressions | |
33 | ||
c349b1b9 | 34 | The regular expression compiler produces polymorphic opcodes. That is, |
376d9008 | 35 | the pattern adapts to the data and automatically switches to the Unicode |
2575c402 | 36 | character scheme when presented with data that is internally encoded in |
ac036724 | 37 | UTF-8, or instead uses a traditional byte scheme when presented with |
2575c402 | 38 | byte data. |
21bad921 | 39 | |
ad0029c4 | 40 | =item C<use utf8> still needed to enable UTF-8/UTF-EBCDIC in scripts |
21bad921 | 41 | |
376d9008 JB |
42 | As a compatibility measure, the C<use utf8> pragma must be explicitly |
43 | included to enable recognition of UTF-8 in the Perl scripts themselves | |
1bfb14c4 JH |
44 | (in string or regular expression literals, or in identifier names) on |
45 | ASCII-based machines or to recognize UTF-EBCDIC on EBCDIC-based | |
376d9008 | 46 | machines. B<These are the only times when an explicit C<use utf8> |
8f8cf39c | 47 | is needed.> See L<utf8>. |
21bad921 | 48 | |
7aa207d6 JH |
49 | =item BOM-marked scripts and UTF-16 scripts autodetected |
50 | ||
51 | If a Perl script begins marked with the Unicode BOM (UTF-16LE, UTF16-BE, | |
52 | or UTF-8), or if the script looks like non-BOM-marked UTF-16 of either | |
53 | endianness, Perl will correctly read in the script as Unicode. | |
54 | (BOMless UTF-8 cannot be effectively recognized or differentiated from | |
55 | ISO 8859-1 or other eight-bit encodings.) | |
56 | ||
990e18f7 AT |
57 | =item C<use encoding> needed to upgrade non-Latin-1 byte strings |
58 | ||
38a44b82 | 59 | By default, there is a fundamental asymmetry in Perl's Unicode model: |
990e18f7 AT |
60 | implicit upgrading from byte strings to Unicode strings assumes that |
61 | they were encoded in I<ISO 8859-1 (Latin-1)>, but Unicode strings are | |
62 | downgraded with UTF-8 encoding. This happens because the first 256 | |
51f494cc | 63 | codepoints in Unicode happens to agree with Latin-1. |
990e18f7 | 64 | |
990e18f7 AT |
65 | See L</"Byte and Character Semantics"> for more details. |
66 | ||
21bad921 GS |
67 | =back |
68 | ||
376d9008 | 69 | =head2 Byte and Character Semantics |
393fec97 | 70 | |
376d9008 | 71 | Beginning with version 5.6, Perl uses logically-wide characters to |
3e4dbfed | 72 | represent strings internally. |
393fec97 | 73 | |
376d9008 JB |
74 | In future, Perl-level operations will be expected to work with |
75 | characters rather than bytes. | |
393fec97 | 76 | |
376d9008 | 77 | However, as an interim compatibility measure, Perl aims to |
75daf61c JH |
78 | provide a safe migration path from byte semantics to character |
79 | semantics for programs. For operations where Perl can unambiguously | |
376d9008 | 80 | decide that the input data are characters, Perl switches to |
75daf61c JH |
81 | character semantics. For operations where this determination cannot |
82 | be made without additional information from the user, Perl decides in | |
376d9008 | 83 | favor of compatibility and chooses to use byte semantics. |
8cbd9a7a | 84 | |
51f494cc | 85 | Under byte semantics, when C<use locale> is in effect, Perl uses the |
e1b711da KW |
86 | semantics associated with the current locale. Absent a C<use locale>, and |
87 | absent a C<use feature 'unicode_strings'> pragma, Perl currently uses US-ASCII | |
88 | (or Basic Latin in Unicode terminology) byte semantics, meaning that characters | |
89 | whose ordinal numbers are in the range 128 - 255 are undefined except for their | |
90 | ordinal numbers. This means that none have case (upper and lower), nor are any | |
91 | a member of character classes, like C<[:alpha:]> or C<\w>. (But all do belong | |
92 | to the C<\W> class or the Perl regular expression extension C<[:^alpha:]>.) | |
2bbc8d55 | 93 | |
8cbd9a7a | 94 | This behavior preserves compatibility with earlier versions of Perl, |
376d9008 | 95 | which allowed byte semantics in Perl operations only if |
e1b711da | 96 | none of the program's inputs were marked as being a source of Unicode |
8cbd9a7a GS |
97 | character data. Such data may come from filehandles, from calls to |
98 | external programs, from information provided by the system (such as %ENV), | |
21bad921 | 99 | or from literals and constants in the source text. |
8cbd9a7a | 100 | |
376d9008 JB |
101 | The C<bytes> pragma will always, regardless of platform, force byte |
102 | semantics in a particular lexical scope. See L<bytes>. | |
8cbd9a7a | 103 | |
cbf9b121 KW |
104 | The C<use feature 'unicode_strings'> pragma is intended always, |
105 | regardless of platform, to force character (Unicode) semantics in a | |
106 | particular lexical scope. | |
e1b711da KW |
107 | See L</The "Unicode Bug"> below. |
108 | ||
8cbd9a7a | 109 | The C<utf8> pragma is primarily a compatibility device that enables |
75daf61c | 110 | recognition of UTF-(8|EBCDIC) in literals encountered by the parser. |
376d9008 JB |
111 | Note that this pragma is only required while Perl defaults to byte |
112 | semantics; when character semantics become the default, this pragma | |
113 | may become a no-op. See L<utf8>. | |
114 | ||
115 | Unless explicitly stated, Perl operators use character semantics | |
116 | for Unicode data and byte semantics for non-Unicode data. | |
117 | The decision to use character semantics is made transparently. If | |
118 | input data comes from a Unicode source--for example, if a character | |
fae2c0fb | 119 | encoding layer is added to a filehandle or a literal Unicode |
376d9008 JB |
120 | string constant appears in a program--character semantics apply. |
121 | Otherwise, byte semantics are in effect. The C<bytes> pragma should | |
e1b711da KW |
122 | be used to force byte semantics on Unicode data, and the C<use feature |
123 | 'unicode_strings'> pragma to force Unicode semantics on byte data (though in | |
124 | 5.12 it isn't fully implemented). | |
376d9008 JB |
125 | |
126 | If strings operating under byte semantics and strings with Unicode | |
51f494cc | 127 | character data are concatenated, the new string will have |
d9b01026 KW |
128 | character semantics. This can cause surprises: See L</BUGS>, below. |
129 | You can choose to be warned when this happens. See L<encoding::warnings>. | |
7dedd01f | 130 | |
feda178f | 131 | Under character semantics, many operations that formerly operated on |
376d9008 | 132 | bytes now operate on characters. A character in Perl is |
feda178f | 133 | logically just a number ranging from 0 to 2**31 or so. Larger |
376d9008 JB |
134 | characters may encode into longer sequences of bytes internally, but |
135 | this internal detail is mostly hidden for Perl code. | |
136 | See L<perluniintro> for more. | |
393fec97 | 137 | |
376d9008 | 138 | =head2 Effects of Character Semantics |
393fec97 GS |
139 | |
140 | Character semantics have the following effects: | |
141 | ||
142 | =over 4 | |
143 | ||
144 | =item * | |
145 | ||
376d9008 | 146 | Strings--including hash keys--and regular expression patterns may |
574c8022 | 147 | contain characters that have an ordinal value larger than 255. |
393fec97 | 148 | |
2575c402 JW |
149 | If you use a Unicode editor to edit your program, Unicode characters may |
150 | occur directly within the literal strings in UTF-8 encoding, or UTF-16. | |
151 | (The former requires a BOM or C<use utf8>, the latter requires a BOM.) | |
3e4dbfed | 152 | |
195e542a KW |
153 | Unicode characters can also be added to a string by using the C<\N{U+...}> |
154 | notation. The Unicode code for the desired character, in hexadecimal, | |
155 | should be placed in the braces, after the C<U>. For instance, a smiley face is | |
6f335b04 KW |
156 | C<\N{U+263A}>. |
157 | ||
195e542a KW |
158 | Alternatively, you can use the C<\x{...}> notation for characters 0x100 and |
159 | above. For characters below 0x100 you may get byte semantics instead of | |
6f335b04 | 160 | character semantics; see L</The "Unicode Bug">. On EBCDIC machines there is |
195e542a | 161 | the additional problem that the value for such characters gives the EBCDIC |
6f335b04 | 162 | character rather than the Unicode one. |
3e4dbfed JF |
163 | |
164 | Additionally, if you | |
574c8022 | 165 | |
3e4dbfed | 166 | use charnames ':full'; |
574c8022 | 167 | |
1bfb14c4 JH |
168 | you can use the C<\N{...}> notation and put the official Unicode |
169 | character name within the braces, such as C<\N{WHITE SMILING FACE}>. | |
6f335b04 | 170 | See L<charnames>. |
376d9008 | 171 | |
393fec97 GS |
172 | =item * |
173 | ||
574c8022 JH |
174 | If an appropriate L<encoding> is specified, identifiers within the |
175 | Perl script may contain Unicode alphanumeric characters, including | |
376d9008 JB |
176 | ideographs. Perl does not currently attempt to canonicalize variable |
177 | names. | |
393fec97 | 178 | |
393fec97 GS |
179 | =item * |
180 | ||
1bfb14c4 | 181 | Regular expressions match characters instead of bytes. "." matches |
2575c402 | 182 | a character instead of a byte. |
393fec97 | 183 | |
393fec97 GS |
184 | =item * |
185 | ||
9d1c51c1 | 186 | Bracketed character classes in regular expressions match characters instead of |
376d9008 | 187 | bytes and match against the character properties specified in the |
1bfb14c4 | 188 | Unicode properties database. C<\w> can be used to match a Japanese |
75daf61c | 189 | ideograph, for instance. |
393fec97 | 190 | |
393fec97 GS |
191 | =item * |
192 | ||
9d1c51c1 KW |
193 | Named Unicode properties, scripts, and block ranges may be used (like bracketed |
194 | character classes) by using the C<\p{}> "matches property" construct and | |
822502e5 | 195 | the C<\P{}> negation, "doesn't match property". |
2575c402 | 196 | See L</"Unicode Character Properties"> for more details. |
822502e5 TS |
197 | |
198 | You can define your own character properties and use them | |
199 | in the regular expression with the C<\p{}> or C<\P{}> construct. | |
822502e5 TS |
200 | See L</"User-Defined Character Properties"> for more details. |
201 | ||
202 | =item * | |
203 | ||
9f815e24 KW |
204 | The special pattern C<\X> matches a logical character, an "extended grapheme |
205 | cluster" in Standardese. In Unicode what appears to the user to be a single | |
51f494cc KW |
206 | character, for example an accented C<G>, may in fact be composed of a sequence |
207 | of characters, in this case a C<G> followed by an accent character. C<\X> | |
208 | will match the entire sequence. | |
822502e5 TS |
209 | |
210 | =item * | |
211 | ||
212 | The C<tr///> operator translates characters instead of bytes. Note | |
213 | that the C<tr///CU> functionality has been removed. For similar | |
214 | functionality see pack('U0', ...) and pack('C0', ...). | |
215 | ||
216 | =item * | |
217 | ||
218 | Case translation operators use the Unicode case translation tables | |
219 | when character input is provided. Note that C<uc()>, or C<\U> in | |
220 | interpolated strings, translates to uppercase, while C<ucfirst>, | |
221 | or C<\u> in interpolated strings, translates to titlecase in languages | |
e1b711da KW |
222 | that make the distinction (which is equivalent to uppercase in languages |
223 | without the distinction). | |
822502e5 TS |
224 | |
225 | =item * | |
226 | ||
227 | Most operators that deal with positions or lengths in a string will | |
228 | automatically switch to using character positions, including | |
229 | C<chop()>, C<chomp()>, C<substr()>, C<pos()>, C<index()>, C<rindex()>, | |
230 | C<sprintf()>, C<write()>, and C<length()>. An operator that | |
51f494cc KW |
231 | specifically does not switch is C<vec()>. Operators that really don't |
232 | care include operators that treat strings as a bucket of bits such as | |
822502e5 TS |
233 | C<sort()>, and operators dealing with filenames. |
234 | ||
235 | =item * | |
236 | ||
51f494cc | 237 | The C<pack()>/C<unpack()> letter C<C> does I<not> change, since it is often |
822502e5 TS |
238 | used for byte-oriented formats. Again, think C<char> in the C language. |
239 | ||
240 | There is a new C<U> specifier that converts between Unicode characters | |
241 | and code points. There is also a C<W> specifier that is the equivalent of | |
242 | C<chr>/C<ord> and properly handles character values even if they are above 255. | |
243 | ||
244 | =item * | |
245 | ||
246 | The C<chr()> and C<ord()> functions work on characters, similar to | |
247 | C<pack("W")> and C<unpack("W")>, I<not> C<pack("C")> and | |
248 | C<unpack("C")>. C<pack("C")> and C<unpack("C")> are methods for | |
249 | emulating byte-oriented C<chr()> and C<ord()> on Unicode strings. | |
250 | While these methods reveal the internal encoding of Unicode strings, | |
251 | that is not something one normally needs to care about at all. | |
252 | ||
253 | =item * | |
254 | ||
255 | The bit string operators, C<& | ^ ~>, can operate on character data. | |
256 | However, for backward compatibility, such as when using bit string | |
257 | operations when characters are all less than 256 in ordinal value, one | |
258 | should not use C<~> (the bit complement) with characters of both | |
259 | values less than 256 and values greater than 256. Most importantly, | |
260 | DeMorgan's laws (C<~($x|$y) eq ~$x&~$y> and C<~($x&$y) eq ~$x|~$y>) | |
261 | will not hold. The reason for this mathematical I<faux pas> is that | |
262 | the complement cannot return B<both> the 8-bit (byte-wide) bit | |
263 | complement B<and> the full character-wide bit complement. | |
264 | ||
265 | =item * | |
266 | ||
9d1c51c1 KW |
267 | You can define your own mappings to be used in C<lc()>, |
268 | C<lcfirst()>, C<uc()>, and C<ucfirst()> (or their double-quoted string inlined | |
ab296a20 FR |
269 | versions such as C<\U>). See |
270 | L<User-Defined Case-Mappings|/"User-Defined Case Mappings (for serious hackers only)"> | |
271 | for more details. | |
822502e5 TS |
272 | |
273 | =back | |
274 | ||
275 | =over 4 | |
276 | ||
277 | =item * | |
278 | ||
279 | And finally, C<scalar reverse()> reverses by character rather than by byte. | |
280 | ||
281 | =back | |
282 | ||
283 | =head2 Unicode Character Properties | |
284 | ||
51f494cc | 285 | Most Unicode character properties are accessible by using regular expressions. |
9d1c51c1 KW |
286 | They are used (like bracketed character classes) by using the C<\p{}> "matches |
287 | property" construct and the C<\P{}> negation, "doesn't match property". | |
288 | ||
289 | Note that the only time that Perl considers a sequence of individual code | |
290 | points as a single logical character is in the C<\X> construct, already | |
291 | mentioned above. Therefore "character" in this discussion means a single | |
292 | Unicode code point. | |
51f494cc | 293 | |
9d1c51c1 | 294 | For instance, C<\p{Uppercase}> matches any single character with the Unicode |
51f494cc KW |
295 | "Uppercase" property, while C<\p{L}> matches any character with a |
296 | General_Category of "L" (letter) property. Brackets are not | |
9d1c51c1 | 297 | required for single letter property names, so C<\p{L}> is equivalent to C<\pL>. |
51f494cc | 298 | |
9d1c51c1 KW |
299 | More formally, C<\p{Uppercase}> matches any single character whose Unicode |
300 | Uppercase property value is True, and C<\P{Uppercase}> matches any character | |
301 | whose Uppercase property value is False, and they could have been written as | |
302 | C<\p{Uppercase=True}> and C<\p{Uppercase=False}>, respectively. | |
51f494cc KW |
303 | |
304 | This formality is needed when properties are not binary, that is if they can | |
305 | take on more values than just True and False. For example, the Bidi_Class (see | |
306 | L</"Bidirectional Character Types"> below), can take on a number of different | |
307 | values, such as Left, Right, Whitespace, and others. To match these, one needs | |
e1b711da | 308 | to specify the property name (Bidi_Class), and the value being matched against |
9d1c51c1 | 309 | (Left, Right, etc.). This is done, as in the examples above, by having the |
9f815e24 | 310 | two components separated by an equal sign (or interchangeably, a colon), like |
51f494cc KW |
311 | C<\p{Bidi_Class: Left}>. |
312 | ||
313 | All Unicode-defined character properties may be written in these compound forms | |
314 | of C<\p{property=value}> or C<\p{property:value}>, but Perl provides some | |
315 | additional properties that are written only in the single form, as well as | |
316 | single-form short-cuts for all binary properties and certain others described | |
317 | below, in which you may omit the property name and the equals or colon | |
318 | separator. | |
319 | ||
320 | Most Unicode character properties have at least two synonyms (or aliases if you | |
321 | prefer), a short one that is easier to type, and a longer one which is more | |
322 | descriptive and hence it is easier to understand what it means. Thus the "L" | |
323 | and "Letter" above are equivalent and can be used interchangeably. Likewise, | |
324 | "Upper" is a synonym for "Uppercase", and we could have written | |
325 | C<\p{Uppercase}> equivalently as C<\p{Upper}>. Also, there are typically | |
326 | various synonyms for the values the property can be. For binary properties, | |
327 | "True" has 3 synonyms: "T", "Yes", and "Y"; and "False has correspondingly "F", | |
328 | "No", and "N". But be careful. A short form of a value for one property may | |
e1b711da | 329 | not mean the same thing as the same short form for another. Thus, for the |
51f494cc KW |
330 | General_Category property, "L" means "Letter", but for the Bidi_Class property, |
331 | "L" means "Left". A complete list of properties and synonyms is in | |
332 | L<perluniprops>. | |
333 | ||
334 | Upper/lower case differences in the property names and values are irrelevant, | |
335 | thus C<\p{Upper}> means the same thing as C<\p{upper}> or even C<\p{UpPeR}>. | |
336 | Similarly, you can add or subtract underscores anywhere in the middle of a | |
337 | word, so that these are also equivalent to C<\p{U_p_p_e_r}>. And white space | |
338 | is irrelevant adjacent to non-word characters, such as the braces and the equals | |
339 | or colon separators so C<\p{ Upper }> and C<\p{ Upper_case : Y }> are | |
340 | equivalent to these as well. In fact, in most cases, white space and even | |
341 | hyphens can be added or deleted anywhere. So even C<\p{ Up-per case = Yes}> is | |
342 | equivalent. All this is called "loose-matching" by Unicode. The few places | |
343 | where stricter matching is employed is in the middle of numbers, and the Perl | |
344 | extension properties that begin or end with an underscore. Stricter matching | |
345 | cares about white space (except adjacent to the non-word characters) and | |
346 | hyphens, and non-interior underscores. | |
4193bef7 | 347 | |
376d9008 JB |
348 | You can also use negation in both C<\p{}> and C<\P{}> by introducing a caret |
349 | (^) between the first brace and the property name: C<\p{^Tamil}> is | |
eb0cc9e3 | 350 | equal to C<\P{Tamil}>. |
4193bef7 | 351 | |
56ca34ca KW |
352 | Almost all properties are immune to case-insensitive matching. That is, |
353 | adding a C</i> regular expression modifier does not change what they | |
354 | match. There are two sets that are affected. | |
355 | The first set is | |
356 | C<Uppercase_Letter>, | |
357 | C<Lowercase_Letter>, | |
358 | and C<Titlecase_Letter>, | |
359 | all of which match C<Cased_Letter> under C</i> matching. | |
360 | And the second set is | |
361 | C<Uppercase>, | |
362 | C<Lowercase>, | |
363 | and C<Titlecase>, | |
364 | all of which match C<Cased> under C</i> matching. | |
365 | This set also includes its subsets C<PosixUpper> and C<PosixLower> both | |
366 | of which under C</i> matching match C<PosixAlpha>. | |
367 | (The difference between these sets is that some things, such as Roman | |
368 | Numerals come in both upper and lower case so they are C<Cased>, but aren't considered to be | |
369 | letters, so they aren't C<Cased_Letter>s.) | |
370 | L<perluniprops> includes a notation for all forms that have C</i> | |
371 | differences. | |
372 | ||
51f494cc | 373 | =head3 B<General_Category> |
14bb0a9a | 374 | |
51f494cc KW |
375 | Every Unicode character is assigned a general category, which is the "most |
376 | usual categorization of a character" (from | |
377 | L<http://www.unicode.org/reports/tr44>). | |
822502e5 | 378 | |
9f815e24 | 379 | The compound way of writing these is like C<\p{General_Category=Number}> |
51f494cc KW |
380 | (short, C<\p{gc:n}>). But Perl furnishes shortcuts in which everything up |
381 | through the equal or colon separator is omitted. So you can instead just write | |
382 | C<\pN>. | |
822502e5 | 383 | |
51f494cc | 384 | Here are the short and long forms of the General Category properties: |
393fec97 | 385 | |
d73e5302 JH |
386 | Short Long |
387 | ||
388 | L Letter | |
51f494cc KW |
389 | LC, L& Cased_Letter (that is: [\p{Ll}\p{Lu}\p{Lt}]) |
390 | Lu Uppercase_Letter | |
391 | Ll Lowercase_Letter | |
392 | Lt Titlecase_Letter | |
393 | Lm Modifier_Letter | |
394 | Lo Other_Letter | |
d73e5302 JH |
395 | |
396 | M Mark | |
51f494cc KW |
397 | Mn Nonspacing_Mark |
398 | Mc Spacing_Mark | |
399 | Me Enclosing_Mark | |
d73e5302 JH |
400 | |
401 | N Number | |
51f494cc KW |
402 | Nd Decimal_Number (also Digit) |
403 | Nl Letter_Number | |
404 | No Other_Number | |
405 | ||
406 | P Punctuation (also Punct) | |
407 | Pc Connector_Punctuation | |
408 | Pd Dash_Punctuation | |
409 | Ps Open_Punctuation | |
410 | Pe Close_Punctuation | |
411 | Pi Initial_Punctuation | |
d73e5302 | 412 | (may behave like Ps or Pe depending on usage) |
51f494cc | 413 | Pf Final_Punctuation |
d73e5302 | 414 | (may behave like Ps or Pe depending on usage) |
51f494cc | 415 | Po Other_Punctuation |
d73e5302 JH |
416 | |
417 | S Symbol | |
51f494cc KW |
418 | Sm Math_Symbol |
419 | Sc Currency_Symbol | |
420 | Sk Modifier_Symbol | |
421 | So Other_Symbol | |
d73e5302 JH |
422 | |
423 | Z Separator | |
51f494cc KW |
424 | Zs Space_Separator |
425 | Zl Line_Separator | |
426 | Zp Paragraph_Separator | |
d73e5302 JH |
427 | |
428 | C Other | |
d88362ca | 429 | Cc Control (also Cntrl) |
e150c829 | 430 | Cf Format |
6d4f9cf2 | 431 | Cs Surrogate |
51f494cc | 432 | Co Private_Use |
e150c829 | 433 | Cn Unassigned |
1ac13f9a | 434 | |
376d9008 | 435 | Single-letter properties match all characters in any of the |
3e4dbfed | 436 | two-letter sub-properties starting with the same letter. |
9d1c51c1 | 437 | C<LC> and C<L&> are special cases, which are both aliases for the set consisting of everything matched by C<Ll>, C<Lu>, and C<Lt>. |
32293815 | 438 | |
51f494cc | 439 | =head3 B<Bidirectional Character Types> |
822502e5 | 440 | |
9d1c51c1 KW |
441 | Because scripts differ in their directionality (Hebrew is |
442 | written right to left, for example) Unicode supplies these properties in | |
51f494cc | 443 | the Bidi_Class class: |
32293815 | 444 | |
eb0cc9e3 | 445 | Property Meaning |
92e830a9 | 446 | |
12ac2576 JP |
447 | L Left-to-Right |
448 | LRE Left-to-Right Embedding | |
449 | LRO Left-to-Right Override | |
450 | R Right-to-Left | |
51f494cc | 451 | AL Arabic Letter |
12ac2576 JP |
452 | RLE Right-to-Left Embedding |
453 | RLO Right-to-Left Override | |
454 | PDF Pop Directional Format | |
455 | EN European Number | |
51f494cc KW |
456 | ES European Separator |
457 | ET European Terminator | |
12ac2576 | 458 | AN Arabic Number |
51f494cc | 459 | CS Common Separator |
12ac2576 JP |
460 | NSM Non-Spacing Mark |
461 | BN Boundary Neutral | |
462 | B Paragraph Separator | |
463 | S Segment Separator | |
464 | WS Whitespace | |
465 | ON Other Neutrals | |
466 | ||
51f494cc KW |
467 | This property is always written in the compound form. |
468 | For example, C<\p{Bidi_Class:R}> matches characters that are normally | |
eb0cc9e3 JH |
469 | written right to left. |
470 | ||
51f494cc KW |
471 | =head3 B<Scripts> |
472 | ||
e1b711da KW |
473 | The world's languages are written in a number of scripts. This sentence |
474 | (unless you're reading it in translation) is written in Latin, while Russian is | |
c69ca1d4 | 475 | written in Cyrillic, and Greek is written in, well, Greek; Japanese mainly in |
e1b711da | 476 | Hiragana or Katakana. There are many more. |
51f494cc KW |
477 | |
478 | The Unicode Script property gives what script a given character is in, | |
9d1c51c1 KW |
479 | and the property can be specified with the compound form like |
480 | C<\p{Script=Hebrew}> (short: C<\p{sc=hebr}>). Perl furnishes shortcuts for all | |
481 | script names. You can omit everything up through the equals (or colon), and | |
482 | simply write C<\p{Latin}> or C<\P{Cyrillic}>. | |
51f494cc KW |
483 | |
484 | A complete list of scripts and their shortcuts is in L<perluniprops>. | |
485 | ||
51f494cc | 486 | =head3 B<Use of "Is" Prefix> |
822502e5 | 487 | |
1bfb14c4 | 488 | For backward compatibility (with Perl 5.6), all properties mentioned |
51f494cc KW |
489 | so far may have C<Is> or C<Is_> prepended to their name, so C<\P{Is_Lu}>, for |
490 | example, is equal to C<\P{Lu}>, and C<\p{IsScript:Arabic}> is equal to | |
491 | C<\p{Arabic}>. | |
eb0cc9e3 | 492 | |
51f494cc | 493 | =head3 B<Blocks> |
2796c109 | 494 | |
1bfb14c4 JH |
495 | In addition to B<scripts>, Unicode also defines B<blocks> of |
496 | characters. The difference between scripts and blocks is that the | |
497 | concept of scripts is closer to natural languages, while the concept | |
51f494cc | 498 | of blocks is more of an artificial grouping based on groups of Unicode |
9f815e24 | 499 | characters with consecutive ordinal values. For example, the "Basic Latin" |
51f494cc | 500 | block is all characters whose ordinals are between 0 and 127, inclusive, in |
9f815e24 KW |
501 | other words, the ASCII characters. The "Latin" script contains some letters |
502 | from this block as well as several more, like "Latin-1 Supplement", | |
9d1c51c1 | 503 | "Latin Extended-A", etc., but it does not contain all the characters from |
7be67b37 KW |
504 | those blocks. It does not, for example, contain the digits 0-9, because |
505 | those digits are shared across many scripts. The digits 0-9 and similar groups, | |
506 | like punctuation, are in the script called C<Common>. There is also a | |
507 | script called C<Inherited> for characters that modify other characters, | |
508 | and inherit the script value of the controlling character. (Note that | |
509 | there are a number of different sets of digits in Unicode that are | |
510 | equivalent to 0-9 and are matchable by C<\d> in a regular expression. | |
511 | If they are used in a single language only, they are in that language's | |
512 | script. Only the sets that are used across languages are in the | |
513 | C<Common> script.) | |
51f494cc KW |
514 | |
515 | For more about scripts versus blocks, see UAX#24 "Unicode Script Property": | |
516 | L<http://www.unicode.org/reports/tr24> | |
517 | ||
518 | The Script property is likely to be the one you want to use when processing | |
519 | natural language; the Block property may be useful in working with the nuts and | |
520 | bolts of Unicode. | |
521 | ||
522 | Block names are matched in the compound form, like C<\p{Block: Arrows}> or | |
523 | C<\p{Blk=Hebrew}>. Unlike most other properties only a few block names have a | |
524 | Unicode-defined short name. But Perl does provide a (slight) shortcut: You | |
525 | can say, for example C<\p{In_Arrows}> or C<\p{In_Hebrew}>. For backwards | |
526 | compatibility, the C<In> prefix may be omitted if there is no naming conflict | |
527 | with a script or any other property, and you can even use an C<Is> prefix | |
528 | instead in those cases. But it is not a good idea to do this, for a couple | |
529 | reasons: | |
530 | ||
531 | =over 4 | |
532 | ||
533 | =item 1 | |
534 | ||
535 | It is confusing. There are many naming conflicts, and you may forget some. | |
9f815e24 | 536 | For example, C<\p{Hebrew}> means the I<script> Hebrew, and NOT the I<block> |
51f494cc KW |
537 | Hebrew. But would you remember that 6 months from now? |
538 | ||
539 | =item 2 | |
540 | ||
541 | It is unstable. A new version of Unicode may pre-empt the current meaning by | |
542 | creating a property with the same name. There was a time in very early Unicode | |
9f815e24 | 543 | releases when C<\p{Hebrew}> would have matched the I<block> Hebrew; now it |
51f494cc | 544 | doesn't. |
32293815 | 545 | |
393fec97 GS |
546 | =back |
547 | ||
51f494cc KW |
548 | Some people just prefer to always use C<\p{Block: foo}> and C<\p{Script: bar}> |
549 | instead of the shortcuts, for clarity, and because they can't remember the | |
550 | difference between 'In' and 'Is' anyway (or aren't confident that those who | |
551 | eventually will read their code will know). | |
552 | ||
553 | A complete list of blocks and their shortcuts is in L<perluniprops>. | |
554 | ||
9f815e24 KW |
555 | =head3 B<Other Properties> |
556 | ||
557 | There are many more properties than the very basic ones described here. | |
558 | A complete list is in L<perluniprops>. | |
559 | ||
560 | Unicode defines all its properties in the compound form, so all single-form | |
561 | properties are Perl extensions. A number of these are just synonyms for the | |
562 | Unicode ones, but some are genunine extensions, including a couple that are in | |
563 | the compound form. And quite a few of these are actually recommended by Unicode | |
564 | (in L<http://www.unicode.org/reports/tr18>). | |
565 | ||
566 | This section gives some details on all the extensions that aren't synonyms for | |
567 | compound-form Unicode properties (for those, you'll have to refer to the | |
568 | L<Unicode Standard|http://www.unicode.org/reports/tr44>. | |
569 | ||
570 | =over | |
571 | ||
572 | =item B<C<\p{All}>> | |
573 | ||
574 | This matches any of the 1_114_112 Unicode code points. It is a synonym for | |
575 | C<\p{Any}>. | |
576 | ||
577 | =item B<C<\p{Alnum}>> | |
578 | ||
579 | This matches any C<\p{Alphabetic}> or C<\p{Decimal_Number}> character. | |
580 | ||
581 | =item B<C<\p{Any}>> | |
582 | ||
583 | This matches any of the 1_114_112 Unicode code points. It is a synonym for | |
584 | C<\p{All}>. | |
585 | ||
586 | =item B<C<\p{Assigned}>> | |
587 | ||
588 | This matches any assigned code point; that is, any code point whose general | |
589 | category is not Unassigned (or equivalently, not Cn). | |
590 | ||
591 | =item B<C<\p{Blank}>> | |
592 | ||
593 | This is the same as C<\h> and C<\p{HorizSpace}>: A character that changes the | |
594 | spacing horizontally. | |
595 | ||
596 | =item B<C<\p{Decomposition_Type: Non_Canonical}>> (Short: C<\p{Dt=NonCanon}>) | |
597 | ||
598 | Matches a character that has a non-canonical decomposition. | |
599 | ||
600 | To understand the use of this rarely used property=value combination, it is | |
601 | necessary to know some basics about decomposition. | |
602 | Consider a character, say H. It could appear with various marks around it, | |
603 | such as an acute accent, or a circumflex, or various hooks, circles, arrows, | |
9d1c51c1 | 604 | I<etc.>, above, below, to one side and/or the other, etc. There are many |
9f815e24 KW |
605 | possibilities among the world's languages. The number of combinations is |
606 | astronomical, and if there were a character for each combination, it would | |
607 | soon exhaust Unicode's more than a million possible characters. So Unicode | |
608 | took a different approach: there is a character for the base H, and a | |
609 | character for each of the possible marks, and they can be combined variously | |
610 | to get a final logical character. So a logical character--what appears to be a | |
611 | single character--can be a sequence of more than one individual characters. | |
612 | This is called an "extended grapheme cluster". (Perl furnishes the C<\X> | |
613 | construct to match such sequences.) | |
614 | ||
615 | But Unicode's intent is to unify the existing character set standards and | |
616 | practices, and a number of pre-existing standards have single characters that | |
617 | mean the same thing as some of these combinations. An example is ISO-8859-1, | |
618 | which has quite a few of these in the Latin-1 range, an example being "LATIN | |
619 | CAPITAL LETTER E WITH ACUTE". Because this character was in this pre-existing | |
620 | standard, Unicode added it to its repertoire. But this character is considered | |
621 | by Unicode to be equivalent to the sequence consisting of first the character | |
622 | "LATIN CAPITAL LETTER E", then the character "COMBINING ACUTE ACCENT". | |
623 | ||
624 | "LATIN CAPITAL LETTER E WITH ACUTE" is called a "pre-composed" character, and | |
625 | the equivalence with the sequence is called canonical equivalence. All | |
626 | pre-composed characters are said to have a decomposition (into the equivalent | |
627 | sequence) and the decomposition type is also called canonical. | |
628 | ||
629 | However, many more characters have a different type of decomposition, a | |
630 | "compatible" or "non-canonical" decomposition. The sequences that form these | |
631 | decompositions are not considered canonically equivalent to the pre-composed | |
632 | character. An example, again in the Latin-1 range, is the "SUPERSCRIPT ONE". | |
633 | It is kind of like a regular digit 1, but not exactly; its decomposition | |
634 | into the digit 1 is called a "compatible" decomposition, specifically a | |
635 | "super" decomposition. There are several such compatibility | |
636 | decompositions (see L<http://www.unicode.org/reports/tr44>), including one | |
637 | called "compat" which means some miscellaneous type of decomposition | |
638 | that doesn't fit into the decomposition categories that Unicode has chosen. | |
639 | ||
640 | Note that most Unicode characters don't have a decomposition, so their | |
641 | decomposition type is "None". | |
642 | ||
643 | Perl has added the C<Non_Canonical> type, for your convenience, to mean any of | |
644 | the compatibility decompositions. | |
645 | ||
646 | =item B<C<\p{Graph}>> | |
647 | ||
648 | Matches any character that is graphic. Theoretically, this means a character | |
649 | that on a printer would cause ink to be used. | |
650 | ||
651 | =item B<C<\p{HorizSpace}>> | |
652 | ||
653 | This is the same as C<\h> and C<\p{Blank}>: A character that changes the | |
654 | spacing horizontally. | |
655 | ||
656 | =item B<C<\p{In=*}>> | |
657 | ||
658 | This is a synonym for C<\p{Present_In=*}> | |
659 | ||
660 | =item B<C<\p{PerlSpace}>> | |
661 | ||
662 | This is the same as C<\s>, restricted to ASCII, namely C<S<[ \f\n\r\t]>>. | |
663 | ||
664 | Mnemonic: Perl's (original) space | |
665 | ||
666 | =item B<C<\p{PerlWord}>> | |
667 | ||
668 | This is the same as C<\w>, restricted to ASCII, namely C<[A-Za-z0-9_]> | |
669 | ||
670 | Mnemonic: Perl's (original) word. | |
671 | ||
672 | =item B<C<\p{PosixAlnum}>> | |
673 | ||
674 | This matches any alphanumeric character in the ASCII range, namely | |
675 | C<[A-Za-z0-9]>. | |
676 | ||
677 | =item B<C<\p{PosixAlpha}>> | |
678 | ||
679 | This matches any alphabetic character in the ASCII range, namely C<[A-Za-z]>. | |
680 | ||
681 | =item B<C<\p{PosixBlank}>> | |
682 | ||
683 | This matches any blank character in the ASCII range, namely C<S<[ \t]>>. | |
684 | ||
685 | =item B<C<\p{PosixCntrl}>> | |
686 | ||
687 | This matches any control character in the ASCII range, namely C<[\x00-\x1F\x7F]> | |
688 | ||
689 | =item B<C<\p{PosixDigit}>> | |
690 | ||
691 | This matches any digit character in the ASCII range, namely C<[0-9]>. | |
692 | ||
693 | =item B<C<\p{PosixGraph}>> | |
694 | ||
695 | This matches any graphical character in the ASCII range, namely C<[\x21-\x7E]>. | |
696 | ||
697 | =item B<C<\p{PosixLower}>> | |
698 | ||
699 | This matches any lowercase character in the ASCII range, namely C<[a-z]>. | |
700 | ||
701 | =item B<C<\p{PosixPrint}>> | |
702 | ||
703 | This matches any printable character in the ASCII range, namely C<[\x20-\x7E]>. | |
704 | These are the graphical characters plus SPACE. | |
705 | ||
706 | =item B<C<\p{PosixPunct}>> | |
707 | ||
708 | This matches any punctuation character in the ASCII range, namely | |
709 | C<[\x21-\x2F\x3A-\x40\x5B-\x60\x7B-\x7E]>. These are the | |
710 | graphical characters that aren't word characters. Note that the Posix standard | |
711 | includes in its definition of punctuation, those characters that Unicode calls | |
712 | "symbols." | |
713 | ||
714 | =item B<C<\p{PosixSpace}>> | |
715 | ||
716 | This matches any space character in the ASCII range, namely | |
717 | C<S<[ \f\n\r\t\x0B]>> (the last being a vertical tab). | |
718 | ||
719 | =item B<C<\p{PosixUpper}>> | |
720 | ||
721 | This matches any uppercase character in the ASCII range, namely C<[A-Z]>. | |
722 | ||
723 | =item B<C<\p{Present_In: *}>> (Short: C<\p{In=*}>) | |
724 | ||
725 | This property is used when you need to know in what Unicode version(s) a | |
726 | character is. | |
727 | ||
728 | The "*" above stands for some two digit Unicode version number, such as | |
729 | C<1.1> or C<4.0>; or the "*" can also be C<Unassigned>. This property will | |
730 | match the code points whose final disposition has been settled as of the | |
731 | Unicode release given by the version number; C<\p{Present_In: Unassigned}> | |
732 | will match those code points whose meaning has yet to be assigned. | |
733 | ||
734 | For example, C<U+0041> "LATIN CAPITAL LETTER A" was present in the very first | |
735 | Unicode release available, which is C<1.1>, so this property is true for all | |
736 | valid "*" versions. On the other hand, C<U+1EFF> was not assigned until version | |
737 | 5.1 when it became "LATIN SMALL LETTER Y WITH LOOP", so the only "*" that | |
738 | would match it are 5.1, 5.2, and later. | |
739 | ||
740 | Unicode furnishes the C<Age> property from which this is derived. The problem | |
741 | with Age is that a strict interpretation of it (which Perl takes) has it | |
742 | matching the precise release a code point's meaning is introduced in. Thus | |
743 | C<U+0041> would match only 1.1; and C<U+1EFF> only 5.1. This is not usually what | |
744 | you want. | |
745 | ||
746 | Some non-Perl implementations of the Age property may change its meaning to be | |
747 | the same as the Perl Present_In property; just be aware of that. | |
748 | ||
749 | Another confusion with both these properties is that the definition is not | |
750 | that the code point has been assigned, but that the meaning of the code point | |
751 | has been determined. This is because 66 code points will always be | |
752 | unassigned, and, so the Age for them is the Unicode version the decision to | |
753 | make them so was made in. For example, C<U+FDD0> is to be permanently | |
754 | unassigned to a character, and the decision to do that was made in version 3.1, | |
755 | so C<\p{Age=3.1}> matches this character and C<\p{Present_In: 3.1}> and up | |
756 | matches as well. | |
757 | ||
758 | =item B<C<\p{Print}>> | |
759 | ||
ae5b72c8 | 760 | This matches any character that is graphical or blank, except controls. |
9f815e24 KW |
761 | |
762 | =item B<C<\p{SpacePerl}>> | |
763 | ||
764 | This is the same as C<\s>, including beyond ASCII. | |
765 | ||
4d4acfba KW |
766 | Mnemonic: Space, as modified by Perl. (It doesn't include the vertical tab |
767 | which both the Posix standard and Unicode consider to be space.) | |
9f815e24 KW |
768 | |
769 | =item B<C<\p{VertSpace}>> | |
770 | ||
771 | This is the same as C<\v>: A character that changes the spacing vertically. | |
772 | ||
773 | =item B<C<\p{Word}>> | |
774 | ||
775 | This is the same as C<\w>, including beyond ASCII. | |
776 | ||
777 | =back | |
778 | ||
376d9008 | 779 | =head2 User-Defined Character Properties |
491fd90a | 780 | |
51f494cc KW |
781 | You can define your own binary character properties by defining subroutines |
782 | whose names begin with "In" or "Is". The subroutines can be defined in any | |
783 | package. The user-defined properties can be used in the regular expression | |
784 | C<\p> and C<\P> constructs; if you are using a user-defined property from a | |
785 | package other than the one you are in, you must specify its package in the | |
786 | C<\p> or C<\P> construct. | |
bac0b425 | 787 | |
51f494cc | 788 | # assuming property Is_Foreign defined in Lang:: |
bac0b425 JP |
789 | package main; # property package name required |
790 | if ($txt =~ /\p{Lang::IsForeign}+/) { ... } | |
791 | ||
792 | package Lang; # property package name not required | |
793 | if ($txt =~ /\p{IsForeign}+/) { ... } | |
794 | ||
795 | ||
796 | Note that the effect is compile-time and immutable once defined. | |
56ca34ca KW |
797 | However the subroutines are passed a single parameter which is 0 if |
798 | case-sensitive matching is in effect, and non-zero if caseless matching | |
799 | is in effect. The subroutine may return different values depending on | |
800 | the value of the flag, and one set of values will immutably be in effect | |
801 | for all case-sensitive matches; the other set for all case-insensitive | |
802 | matches. | |
491fd90a | 803 | |
376d9008 JB |
804 | The subroutines must return a specially-formatted string, with one |
805 | or more newline-separated lines. Each line must be one of the following: | |
491fd90a JH |
806 | |
807 | =over 4 | |
808 | ||
809 | =item * | |
810 | ||
510254c9 A |
811 | A single hexadecimal number denoting a Unicode code point to include. |
812 | ||
813 | =item * | |
814 | ||
99a6b1f0 | 815 | Two hexadecimal numbers separated by horizontal whitespace (space or |
376d9008 | 816 | tabular characters) denoting a range of Unicode code points to include. |
491fd90a JH |
817 | |
818 | =item * | |
819 | ||
376d9008 | 820 | Something to include, prefixed by "+": a built-in character |
bac0b425 JP |
821 | property (prefixed by "utf8::") or a user-defined character property, |
822 | to represent all the characters in that property; two hexadecimal code | |
823 | points for a range; or a single hexadecimal code point. | |
491fd90a JH |
824 | |
825 | =item * | |
826 | ||
376d9008 | 827 | Something to exclude, prefixed by "-": an existing character |
bac0b425 JP |
828 | property (prefixed by "utf8::") or a user-defined character property, |
829 | to represent all the characters in that property; two hexadecimal code | |
830 | points for a range; or a single hexadecimal code point. | |
491fd90a JH |
831 | |
832 | =item * | |
833 | ||
376d9008 | 834 | Something to negate, prefixed "!": an existing character |
bac0b425 JP |
835 | property (prefixed by "utf8::") or a user-defined character property, |
836 | to represent all the characters in that property; two hexadecimal code | |
837 | points for a range; or a single hexadecimal code point. | |
838 | ||
839 | =item * | |
840 | ||
841 | Something to intersect with, prefixed by "&": an existing character | |
842 | property (prefixed by "utf8::") or a user-defined character property, | |
843 | for all the characters except the characters in the property; two | |
844 | hexadecimal code points for a range; or a single hexadecimal code point. | |
491fd90a JH |
845 | |
846 | =back | |
847 | ||
848 | For example, to define a property that covers both the Japanese | |
849 | syllabaries (hiragana and katakana), you can define | |
850 | ||
851 | sub InKana { | |
d88362ca | 852 | return <<END; |
d5822f25 A |
853 | 3040\t309F |
854 | 30A0\t30FF | |
491fd90a JH |
855 | END |
856 | } | |
857 | ||
d5822f25 A |
858 | Imagine that the here-doc end marker is at the beginning of the line. |
859 | Now you can use C<\p{InKana}> and C<\P{InKana}>. | |
491fd90a JH |
860 | |
861 | You could also have used the existing block property names: | |
862 | ||
863 | sub InKana { | |
d88362ca | 864 | return <<'END'; |
491fd90a JH |
865 | +utf8::InHiragana |
866 | +utf8::InKatakana | |
867 | END | |
868 | } | |
869 | ||
870 | Suppose you wanted to match only the allocated characters, | |
d5822f25 | 871 | not the raw block ranges: in other words, you want to remove |
491fd90a JH |
872 | the non-characters: |
873 | ||
874 | sub InKana { | |
d88362ca | 875 | return <<'END'; |
491fd90a JH |
876 | +utf8::InHiragana |
877 | +utf8::InKatakana | |
878 | -utf8::IsCn | |
879 | END | |
880 | } | |
881 | ||
882 | The negation is useful for defining (surprise!) negated classes. | |
883 | ||
884 | sub InNotKana { | |
d88362ca | 885 | return <<'END'; |
491fd90a JH |
886 | !utf8::InHiragana |
887 | -utf8::InKatakana | |
888 | +utf8::IsCn | |
889 | END | |
890 | } | |
891 | ||
bac0b425 JP |
892 | Intersection is useful for getting the common characters matched by |
893 | two (or more) classes. | |
894 | ||
895 | sub InFooAndBar { | |
896 | return <<'END'; | |
897 | +main::Foo | |
898 | &main::Bar | |
899 | END | |
900 | } | |
901 | ||
ac036724 | 902 | It's important to remember not to use "&" for the first set; that |
bac0b425 JP |
903 | would be intersecting with nothing (resulting in an empty set). |
904 | ||
68585b5e | 905 | =head2 User-Defined Case Mappings (for serious hackers only) |
822502e5 | 906 | |
d5cd9e7b KW |
907 | You can also define your own mappings to be used in C<lc()>, |
908 | C<lcfirst()>, C<uc()>, and C<ucfirst()> (or their string-inlined versions, | |
68585b5e KW |
909 | C<\L>, C<\l>, C<\U>, and C<\u>). The mappings are currently only valid |
910 | on strings encoded in UTF-8, but see below for a partial workaround for | |
911 | this restriction. | |
912 | ||
822502e5 | 913 | The principle is similar to that of user-defined character |
68585b5e KW |
914 | properties: define subroutines that do the mappings. |
915 | C<ToLower> is used for C<lc()>, C<\L>, C<lcfirst()>, and C<\l>; C<ToTitle> for | |
916 | C<ucfirst()> and C<\u>; and C<ToUpper> for C<uc()> and C<\U>. | |
3a2263fe | 917 | |
68585b5e | 918 | C<ToUpper()> should look something like this: |
3a2263fe RGS |
919 | |
920 | sub ToUpper { | |
d88362ca | 921 | return <<END; |
68585b5e KW |
922 | 0061\t007A\t0041 |
923 | 0101\t\t0100 | |
3a2263fe RGS |
924 | END |
925 | } | |
926 | ||
68585b5e KW |
927 | This sample C<ToUpper()> has the effect of mapping "a-z" to "A-Z", 0x101 |
928 | to 0x100, and all other characters map to themselves. The first | |
929 | returned line means to map the code point at 0x61 ("a") to 0x41 ("A"), | |
930 | the code point at 0x62 ("b") to 0x42 ("B"), ..., 0x7A ("z") to 0x5A | |
931 | ("Z"). The second line maps just the code point 0x101 to 0x100. Since | |
932 | there are no other mappings defined, all other code points map to | |
933 | themselves. | |
934 | ||
935 | This mechanism is not well behaved as far as affecting other packages | |
936 | and scopes. All non-threaded programs have exactly one uppercasing | |
937 | behavior, one lowercasing behavior, and one titlecasing behavior in | |
938 | effect for utf8-encoded strings for the duration of the program. Each | |
939 | of these behaviors is irrevocably determined the first time the | |
940 | corresponding function is called to change a utf8-encoded string's case. | |
941 | If a corresponding C<To-> function has been defined in the package that | |
942 | makes that first call, the mapping defined by that function will be the | |
943 | mapping used for the duration of the program's execution across all | |
944 | packages and scopes. If no corresponding C<To-> function has been | |
945 | defined in that package, the standard official mapping will be used for | |
946 | all packages and scopes, and any corresponding C<To-> function anywhere | |
947 | will be ignored. Threaded programs have similar behavior. If the | |
948 | program's casing behavior has been decided at the time of a thread's | |
949 | creation, the thread will inherit that behavior. But, if the behavior | |
950 | hasn't been decided, the thread gets to decide for itself, and its | |
951 | decision does not affect other threads nor its creator. | |
952 | ||
953 | As shown by the example above, you have to furnish a complete mapping; | |
954 | you can't just override a couple of characters and leave the rest | |
71648f9a | 955 | unchanged. You can find all the official mappings in the directory |
d5cd9e7b KW |
956 | C<$Config{privlib}>F</unicore/To/>. The mapping data is returned as the |
957 | here-document. The C<utf8::ToSpecI<Foo>> hashes in those files are special | |
958 | exception mappings derived from | |
71648f9a | 959 | C<$Config{privlib}>F</unicore/SpecialCasing.txt>. (The "Digit" and |
9f815e24 | 960 | "Fold" mappings that one can see in the directory are not directly |
d5cd9e7b | 961 | user-accessible, one can use either the L<Unicode::UCD> module, or just match |
71648f9a KW |
962 | case-insensitively, which is what uses the "Fold" mapping. Neither are user |
963 | overridable.) | |
3a2263fe | 964 | |
71648f9a KW |
965 | If you have many mappings to change, you can take the official mapping data, |
966 | change by hand the affected code points, and place the whole thing into your | |
967 | subroutine. But this will only be valid on Perls that use the same Unicode | |
968 | version. Another option would be to have your subroutine read the official | |
969 | mapping file(s) and overwrite the affected code points. | |
3a2263fe | 970 | |
70a5eb4a KW |
971 | If you have only a few mappings to change you can use the |
972 | following trick (but see below for a big caveat), here illustrated for | |
973 | Turkish: | |
71648f9a KW |
974 | |
975 | use Config; | |
70a5eb4a | 976 | use charnames ":full"; |
71648f9a KW |
977 | |
978 | sub ToUpper { | |
979 | my $official = do "$Config{privlib}/unicore/To/Upper.pl"; | |
70a5eb4a | 980 | $utf8::ToSpecUpper{'i'} = |
71648f9a KW |
981 | "\N{LATIN CAPITAL LETTER I WITH DOT ABOVE}"; |
982 | return $official; | |
983 | } | |
984 | ||
985 | This takes the official mappings and overrides just one, for "LATIN SMALL | |
70a5eb4a KW |
986 | LETTER I". Each hash key must be the string of bytes that form the UTF-8 |
987 | (on EBCDIC platforms, UTF-EBCDIC) of the character, as illustrated by | |
988 | the inverse function. | |
71648f9a KW |
989 | |
990 | sub ToLower { | |
991 | my $official = do $lower; | |
992 | $utf8::ToSpecLower{"\xc4\xb0"} = "i"; | |
993 | return $official; | |
994 | } | |
995 | ||
70a5eb4a KW |
996 | This example is for an ASCII platform, and C<\xc4\xb0> is the string of |
997 | bytes that together form the UTF-8 that represents C<\N{LATIN CAPITAL | |
998 | LETTER I WITH DOT ABOVE}>, C<U+0130>. You can avoid having to figure out | |
999 | these bytes, and at the same time make it work on all platforms by | |
1000 | instead writing: | |
71648f9a | 1001 | |
70a5eb4a KW |
1002 | sub ToLower { |
1003 | my $official = do $lower; | |
1004 | my $sequence = "\N{LATIN CAPITAL LETTER I WITH DOT ABOVE}"; | |
1005 | utf8::encode($sequence); | |
1006 | $utf8::ToSpecLower{$sequence} = "i"; | |
1007 | return $official; | |
71648f9a KW |
1008 | } |
1009 | ||
70a5eb4a KW |
1010 | This works because C<utf8::encode()> takes the single character and |
1011 | converts it to the sequence of bytes that constitute it. Note that we took | |
1012 | advantage of the fact that C<"i"> is the same in UTF-8 or UTF_EBCIDIC as not; | |
1013 | otherwise we would have had to write | |
1014 | ||
1015 | $utf8::ToSpecLower{$sequence} = "\N{LATIN SMALL LETTER I}"; | |
1016 | ||
1017 | in the ToLower example, and in the ToUpper example, use | |
1018 | ||
1019 | my $sequence = "\N{LATIN SMALL LETTER I}"; | |
1020 | utf8::encode($sequence); | |
1021 | ||
68585b5e KW |
1022 | A big caveat to the above trick, and to this whole mechanism in general, |
1023 | is that they work only on strings encoded in UTF-8. You can partially | |
1024 | get around this by using C<use subs>. For example: | |
70a5eb4a KW |
1025 | |
1026 | use subs qw(uc ucfirst lc lcfirst); | |
1027 | ||
1028 | sub uc($) { | |
1029 | my $string = shift; | |
1030 | utf8::upgrade($string); | |
1031 | return CORE::uc($string); | |
1032 | } | |
1033 | ||
1034 | sub lc($) { | |
1035 | my $string = shift; | |
1036 | utf8::upgrade($string); | |
1037 | ||
1038 | # Unless an I is before a dot_above, it turns into a dotless i. | |
1039 | # (The character class with the combining classes matches non-above | |
1040 | # marks following the I. Any number of these may be between the 'I' and | |
68585b5e | 1041 | # the dot_above, and the dot_above will still apply to the 'I'. |
70a5eb4a KW |
1042 | use charnames ":full"; |
1043 | $string =~ | |
1044 | s/I | |
1045 | (?! [^\p{ccc=0}\p{ccc=Above}]* \N{COMBINING DOT ABOVE} ) | |
1046 | /\N{LATIN SMALL LETTER DOTLESS I}/gx; | |
1047 | ||
1048 | # But when the I is followed by a dot_above, remove the | |
1049 | # dot_above so the end result will be i. | |
1050 | $string =~ s/I | |
1051 | ([^\p{ccc=0}\p{ccc=Above}]* ) | |
1052 | \N{COMBINING DOT ABOVE} | |
1053 | /i$1/gx; | |
1054 | return CORE::lc($string); | |
1055 | } | |
71648f9a KW |
1056 | |
1057 | These examples (also for Turkish) make sure the input is in UTF-8, and then | |
1058 | call the corresponding official function, which will use the C<ToUpper()> and | |
68585b5e | 1059 | C<ToLower()> functions you have defined. |
70a5eb4a KW |
1060 | (For Turkish, there are other required functions: C<ucfirst>, C<lcfirst>, |
1061 | and C<ToTitle>. These are very similar to the ones given above.) | |
1062 | ||
1063 | The reason this is a partial work-around is that it doesn't affect the C<\l>, | |
1064 | C<\L>, C<\u>, and C<\U> case change operations, which still require the source | |
1065 | to be encoded in utf8 (see L</The "Unicode Bug">). | |
1066 | ||
1067 | The C<lc()> example shows how you can add context-dependent casing. Note | |
1068 | that context-dependent casing suffers from the problem that the string | |
1069 | passed to the casing function may not have sufficient context to make | |
68585b5e | 1070 | the proper choice. And, it will not be called for C<\l>, C<\L>, C<\u>, |
70a5eb4a | 1071 | and C<\U>. |
3a2263fe | 1072 | |
376d9008 | 1073 | =head2 Character Encodings for Input and Output |
8cbd9a7a | 1074 | |
7221edc9 | 1075 | See L<Encode>. |
8cbd9a7a | 1076 | |
c29a771d | 1077 | =head2 Unicode Regular Expression Support Level |
776f8809 | 1078 | |
376d9008 JB |
1079 | The following list of Unicode support for regular expressions describes |
1080 | all the features currently supported. The references to "Level N" | |
8158862b TS |
1081 | and the section numbers refer to the Unicode Technical Standard #18, |
1082 | "Unicode Regular Expressions", version 11, in May 2005. | |
776f8809 JH |
1083 | |
1084 | =over 4 | |
1085 | ||
1086 | =item * | |
1087 | ||
1088 | Level 1 - Basic Unicode Support | |
1089 | ||
d88362ca KW |
1090 | RL1.1 Hex Notation - done [1] |
1091 | RL1.2 Properties - done [2][3] | |
1092 | RL1.2a Compatibility Properties - done [4] | |
1093 | RL1.3 Subtraction and Intersection - MISSING [5] | |
1094 | RL1.4 Simple Word Boundaries - done [6] | |
1095 | RL1.5 Simple Loose Matches - done [7] | |
1096 | RL1.6 Line Boundaries - MISSING [8] | |
1097 | RL1.7 Supplementary Code Points - done [9] | |
8158862b TS |
1098 | |
1099 | [1] \x{...} | |
1100 | [2] \p{...} \P{...} | |
d88362ca KW |
1101 | [3] supports not only minimal list, but all Unicode character |
1102 | properties (see L</Unicode Character Properties>) | |
8158862b TS |
1103 | [4] \d \D \s \S \w \W \X [:prop:] [:^prop:] |
1104 | [5] can use regular expression look-ahead [a] or | |
d88362ca KW |
1105 | user-defined character properties [b] to emulate set |
1106 | operations | |
8158862b | 1107 | [6] \b \B |
d88362ca KW |
1108 | [7] note that Perl does Full case-folding in matching (but with |
1109 | bugs), not Simple: for example U+1F88 is equivalent to | |
1110 | U+1F00 U+03B9, not with 1F80. This difference matters | |
1111 | mainly for certain Greek capital letters with certain | |
1112 | modifiers: the Full case-folding decomposes the letter, | |
1113 | while the Simple case-folding would map it to a single | |
1114 | character. | |
1115 | [8] should do ^ and $ also on U+000B (\v in C), FF (\f), CR | |
1116 | (\r), CRLF (\r\n), NEL (U+0085), LS (U+2028), and PS | |
1117 | (U+2029); should also affect <>, $., and script line | |
1118 | numbers; should not split lines within CRLF [c] (i.e. there | |
1119 | is no empty line between \r and \n) | |
1120 | [9] UTF-8/UTF-EBDDIC used in perl allows not only U+10000 to | |
1121 | U+10FFFF but also beyond U+10FFFF [d] | |
7207e29d | 1122 | |
237bad5b | 1123 | [a] You can mimic class subtraction using lookahead. |
8158862b | 1124 | For example, what UTS#18 might write as |
29bdacb8 | 1125 | |
dbe420b4 JH |
1126 | [{Greek}-[{UNASSIGNED}]] |
1127 | ||
1128 | in Perl can be written as: | |
1129 | ||
1d81abf3 JH |
1130 | (?!\p{Unassigned})\p{InGreekAndCoptic} |
1131 | (?=\p{Assigned})\p{InGreekAndCoptic} | |
dbe420b4 JH |
1132 | |
1133 | But in this particular example, you probably really want | |
1134 | ||
1bfb14c4 | 1135 | \p{GreekAndCoptic} |
dbe420b4 JH |
1136 | |
1137 | which will match assigned characters known to be part of the Greek script. | |
29bdacb8 | 1138 | |
5ca1ac52 | 1139 | Also see the Unicode::Regex::Set module, it does implement the full |
8158862b TS |
1140 | UTS#18 grouping, intersection, union, and removal (subtraction) syntax. |
1141 | ||
1142 | [b] '+' for union, '-' for removal (set-difference), '&' for intersection | |
1143 | (see L</"User-Defined Character Properties">) | |
1144 | ||
1145 | [c] Try the C<:crlf> layer (see L<PerlIO>). | |
5ca1ac52 | 1146 | |
c670e63a KW |
1147 | [d] U+FFFF will currently generate a warning message if 'utf8' warnings are |
1148 | enabled | |
237bad5b | 1149 | |
776f8809 JH |
1150 | =item * |
1151 | ||
1152 | Level 2 - Extended Unicode Support | |
1153 | ||
8158862b | 1154 | RL2.1 Canonical Equivalents - MISSING [10][11] |
c670e63a | 1155 | RL2.2 Default Grapheme Clusters - MISSING [12] |
8158862b TS |
1156 | RL2.3 Default Word Boundaries - MISSING [14] |
1157 | RL2.4 Default Loose Matches - MISSING [15] | |
1158 | RL2.5 Name Properties - MISSING [16] | |
1159 | RL2.6 Wildcard Properties - MISSING | |
1160 | ||
1161 | [10] see UAX#15 "Unicode Normalization Forms" | |
1162 | [11] have Unicode::Normalize but not integrated to regexes | |
e1b711da | 1163 | [12] have \X but we don't have a "Grapheme Cluster Mode" |
8158862b TS |
1164 | [14] see UAX#29, Word Boundaries |
1165 | [15] see UAX#21 "Case Mappings" | |
5bd59e57 | 1166 | [16] missing loose match [e] |
8158862b TS |
1167 | |
1168 | [e] C<\N{...}> allows namespaces (see L<charnames>). | |
776f8809 JH |
1169 | |
1170 | =item * | |
1171 | ||
8158862b TS |
1172 | Level 3 - Tailored Support |
1173 | ||
1174 | RL3.1 Tailored Punctuation - MISSING | |
1175 | RL3.2 Tailored Grapheme Clusters - MISSING [17][18] | |
1176 | RL3.3 Tailored Word Boundaries - MISSING | |
1177 | RL3.4 Tailored Loose Matches - MISSING | |
1178 | RL3.5 Tailored Ranges - MISSING | |
1179 | RL3.6 Context Matching - MISSING [19] | |
1180 | RL3.7 Incremental Matches - MISSING | |
1181 | ( RL3.8 Unicode Set Sharing ) | |
1182 | RL3.9 Possible Match Sets - MISSING | |
1183 | RL3.10 Folded Matching - MISSING [20] | |
1184 | RL3.11 Submatchers - MISSING | |
1185 | ||
1186 | [17] see UAX#10 "Unicode Collation Algorithms" | |
1187 | [18] have Unicode::Collate but not integrated to regexes | |
d88362ca KW |
1188 | [19] have (?<=x) and (?=x), but look-aheads or look-behinds |
1189 | should see outside of the target substring | |
1190 | [20] need insensitive matching for linguistic features other | |
1191 | than case; for example, hiragana to katakana, wide and | |
1192 | narrow, simplified Han to traditional Han (see UTR#30 | |
1193 | "Character Foldings") | |
776f8809 JH |
1194 | |
1195 | =back | |
1196 | ||
c349b1b9 JH |
1197 | =head2 Unicode Encodings |
1198 | ||
376d9008 JB |
1199 | Unicode characters are assigned to I<code points>, which are abstract |
1200 | numbers. To use these numbers, various encodings are needed. | |
c349b1b9 JH |
1201 | |
1202 | =over 4 | |
1203 | ||
c29a771d | 1204 | =item * |
5cb3728c RB |
1205 | |
1206 | UTF-8 | |
c349b1b9 | 1207 | |
6d4f9cf2 KW |
1208 | UTF-8 is a variable-length (1 to 4 bytes), byte-order independent |
1209 | encoding. For ASCII (and we really do mean 7-bit ASCII, not another | |
1210 | 8-bit encoding), UTF-8 is transparent. | |
c349b1b9 | 1211 | |
8c007b5a | 1212 | The following table is from Unicode 3.2. |
05632f9a | 1213 | |
d88362ca | 1214 | Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte |
05632f9a | 1215 | |
d88362ca | 1216 | U+0000..U+007F 00..7F |
e1b711da | 1217 | U+0080..U+07FF * C2..DF 80..BF |
d88362ca | 1218 | U+0800..U+0FFF E0 * A0..BF 80..BF |
ec90690f TS |
1219 | U+1000..U+CFFF E1..EC 80..BF 80..BF |
1220 | U+D000..U+D7FF ED 80..9F 80..BF | |
e1b711da | 1221 | U+D800..U+DFFF +++++++ utf16 surrogates, not legal utf8 +++++++ |
ec90690f | 1222 | U+E000..U+FFFF EE..EF 80..BF 80..BF |
d88362ca KW |
1223 | U+10000..U+3FFFF F0 * 90..BF 80..BF 80..BF |
1224 | U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF | |
1225 | U+100000..U+10FFFF F4 80..8F 80..BF 80..BF | |
e1b711da KW |
1226 | |
1227 | Note the gaps before several of the byte entries above marked by '*'. These are | |
1228 | caused by legal UTF-8 avoiding non-shortest encodings: it is technically | |
1229 | possible to UTF-8-encode a single code point in different ways, but that is | |
1230 | explicitly forbidden, and the shortest possible encoding should always be used | |
1231 | (and that is what Perl does). | |
37361303 | 1232 | |
376d9008 | 1233 | Another way to look at it is via bits: |
05632f9a JH |
1234 | |
1235 | Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte | |
1236 | ||
1237 | 0aaaaaaa 0aaaaaaa | |
1238 | 00000bbbbbaaaaaa 110bbbbb 10aaaaaa | |
1239 | ccccbbbbbbaaaaaa 1110cccc 10bbbbbb 10aaaaaa | |
1240 | 00000dddccccccbbbbbbaaaaaa 11110ddd 10cccccc 10bbbbbb 10aaaaaa | |
1241 | ||
9f815e24 | 1242 | As you can see, the continuation bytes all begin with "10", and the |
e1b711da | 1243 | leading bits of the start byte tell how many bytes there are in the |
05632f9a JH |
1244 | encoded character. |
1245 | ||
6d4f9cf2 KW |
1246 | The original UTF-8 specification allowed up to 6 bytes, to allow |
1247 | encoding of numbers up to 0x7FFF_FFFF. Perl continues to allow those, | |
1248 | and has extended that up to 13 bytes to encode code points up to what | |
1249 | can fit in a 64-bit word. However, Perl will warn if you output any of | |
1250 | these, as being non-portable; and under strict UTF-8 input protocols, | |
1251 | they are forbidden. | |
1252 | ||
1253 | The Unicode non-character code points are also disallowed in UTF-8 in | |
1254 | "open interchange". See L</Non-character code points>. | |
1255 | ||
c29a771d | 1256 | =item * |
5cb3728c RB |
1257 | |
1258 | UTF-EBCDIC | |
dbe420b4 | 1259 | |
376d9008 | 1260 | Like UTF-8 but EBCDIC-safe, in the way that UTF-8 is ASCII-safe. |
dbe420b4 | 1261 | |
c29a771d | 1262 | =item * |
5cb3728c | 1263 | |
1e54db1a | 1264 | UTF-16, UTF-16BE, UTF-16LE, Surrogates, and BOMs (Byte Order Marks) |
c349b1b9 | 1265 | |
1bfb14c4 JH |
1266 | The followings items are mostly for reference and general Unicode |
1267 | knowledge, Perl doesn't use these constructs internally. | |
dbe420b4 | 1268 | |
c349b1b9 | 1269 | UTF-16 is a 2 or 4 byte encoding. The Unicode code points |
1bfb14c4 JH |
1270 | C<U+0000..U+FFFF> are stored in a single 16-bit unit, and the code |
1271 | points C<U+10000..U+10FFFF> in two 16-bit units. The latter case is | |
c349b1b9 JH |
1272 | using I<surrogates>, the first 16-bit unit being the I<high |
1273 | surrogate>, and the second being the I<low surrogate>. | |
1274 | ||
376d9008 | 1275 | Surrogates are code points set aside to encode the C<U+10000..U+10FFFF> |
c349b1b9 | 1276 | range of Unicode code points in pairs of 16-bit units. The I<high |
9f815e24 | 1277 | surrogates> are the range C<U+D800..U+DBFF> and the I<low surrogates> |
376d9008 | 1278 | are the range C<U+DC00..U+DFFF>. The surrogate encoding is |
c349b1b9 | 1279 | |
d88362ca KW |
1280 | $hi = ($uni - 0x10000) / 0x400 + 0xD800; |
1281 | $lo = ($uni - 0x10000) % 0x400 + 0xDC00; | |
c349b1b9 JH |
1282 | |
1283 | and the decoding is | |
1284 | ||
d88362ca | 1285 | $uni = 0x10000 + ($hi - 0xD800) * 0x400 + ($lo - 0xDC00); |
c349b1b9 | 1286 | |
376d9008 | 1287 | Because of the 16-bitness, UTF-16 is byte-order dependent. UTF-16 |
c349b1b9 | 1288 | itself can be used for in-memory computations, but if storage or |
376d9008 JB |
1289 | transfer is required either UTF-16BE (big-endian) or UTF-16LE |
1290 | (little-endian) encodings must be chosen. | |
c349b1b9 JH |
1291 | |
1292 | This introduces another problem: what if you just know that your data | |
376d9008 JB |
1293 | is UTF-16, but you don't know which endianness? Byte Order Marks, or |
1294 | BOMs, are a solution to this. A special character has been reserved | |
86bbd6d1 | 1295 | in Unicode to function as a byte order marker: the character with the |
376d9008 | 1296 | code point C<U+FEFF> is the BOM. |
042da322 | 1297 | |
c349b1b9 | 1298 | The trick is that if you read a BOM, you will know the byte order, |
376d9008 JB |
1299 | since if it was written on a big-endian platform, you will read the |
1300 | bytes C<0xFE 0xFF>, but if it was written on a little-endian platform, | |
1301 | you will read the bytes C<0xFF 0xFE>. (And if the originating platform | |
1302 | was writing in UTF-8, you will read the bytes C<0xEF 0xBB 0xBF>.) | |
042da322 | 1303 | |
86bbd6d1 | 1304 | The way this trick works is that the character with the code point |
6d4f9cf2 | 1305 | C<U+FFFE> is not supposed to be in input streams, so the |
376d9008 | 1306 | sequence of bytes C<0xFF 0xFE> is unambiguously "BOM, represented in |
1bfb14c4 | 1307 | little-endian format" and cannot be C<U+FFFE>, represented in big-endian |
6d4f9cf2 KW |
1308 | format". |
1309 | ||
1310 | Surrogates have no meaning in Unicode outside their use in pairs to | |
1311 | represent other code points. However, Perl allows them to be | |
1312 | represented individually internally, for example by saying | |
1313 | C<chr(0xD801)>, so that the all code points, not just Unicode ones, are | |
1314 | representable. Unicode does define semantics for them, such as their | |
1315 | General Category is "Cs". But because their use is somewhat dangerous, | |
1316 | Perl will warn (using the warning category UTF8) if an attempt is made | |
1317 | to do things like take the lower case of one, or match | |
1318 | case-insensitively, or to output them. (But don't try this on Perls | |
1319 | before 5.14.) | |
c349b1b9 | 1320 | |
c29a771d | 1321 | =item * |
5cb3728c | 1322 | |
1e54db1a | 1323 | UTF-32, UTF-32BE, UTF-32LE |
c349b1b9 JH |
1324 | |
1325 | The UTF-32 family is pretty much like the UTF-16 family, expect that | |
042da322 | 1326 | the units are 32-bit, and therefore the surrogate scheme is not |
376d9008 JB |
1327 | needed. The BOM signatures will be C<0x00 0x00 0xFE 0xFF> for BE and |
1328 | C<0xFF 0xFE 0x00 0x00> for LE. | |
c349b1b9 | 1329 | |
c29a771d | 1330 | =item * |
5cb3728c RB |
1331 | |
1332 | UCS-2, UCS-4 | |
c349b1b9 | 1333 | |
86bbd6d1 | 1334 | Encodings defined by the ISO 10646 standard. UCS-2 is a 16-bit |
376d9008 | 1335 | encoding. Unlike UTF-16, UCS-2 is not extensible beyond C<U+FFFF>, |
339cfa0e JH |
1336 | because it does not use surrogates. UCS-4 is a 32-bit encoding, |
1337 | functionally identical to UTF-32. | |
c349b1b9 | 1338 | |
c29a771d | 1339 | =item * |
5cb3728c RB |
1340 | |
1341 | UTF-7 | |
c349b1b9 | 1342 | |
376d9008 JB |
1343 | A seven-bit safe (non-eight-bit) encoding, which is useful if the |
1344 | transport or storage is not eight-bit safe. Defined by RFC 2152. | |
c349b1b9 | 1345 | |
95a1a48b JH |
1346 | =back |
1347 | ||
6d4f9cf2 KW |
1348 | =head2 Non-character code points |
1349 | ||
1350 | 66 code points are set aside in Unicode as "non-character code points". | |
1351 | These all have the Unassigned (Cn) General Category, and they never will | |
1352 | be assigned. These are never supposed to be in legal Unicode input | |
1353 | streams, so that code can use them as sentinels that can be mixed in | |
1354 | with character data, and they always will be distinguishable from that data. | |
1355 | To keep them out of Perl input streams, strict UTF-8 should be | |
1356 | specified, such as by using the layer C<:encoding('UTF-8')>. The | |
1357 | non-character code points are the 32 between U+FDD0 and U+FDEF, and the | |
1358 | 34 code points U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, ... U+10FFFE, U+10FFFF. | |
1359 | Some people are under the mistaken impression that these are "illegal", | |
1360 | but that is not true. An application or cooperating set of applications | |
1361 | can legally use them at will internally; but these code points are | |
1362 | "illegal for open interchange". | |
1363 | ||
0d7c09bb JH |
1364 | =head2 Security Implications of Unicode |
1365 | ||
e1b711da KW |
1366 | Read L<Unicode Security Considerations|http://www.unicode.org/reports/tr36>. |
1367 | Also, note the following: | |
1368 | ||
0d7c09bb JH |
1369 | =over 4 |
1370 | ||
1371 | =item * | |
1372 | ||
1373 | Malformed UTF-8 | |
bf0fa0b2 JH |
1374 | |
1375 | Unfortunately, the specification of UTF-8 leaves some room for | |
1376 | interpretation of how many bytes of encoded output one should generate | |
376d9008 JB |
1377 | from one input Unicode character. Strictly speaking, the shortest |
1378 | possible sequence of UTF-8 bytes should be generated, | |
1379 | because otherwise there is potential for an input buffer overflow at | |
feda178f | 1380 | the receiving end of a UTF-8 connection. Perl always generates the |
e1b711da | 1381 | shortest length UTF-8, and with warnings on, Perl will warn about |
376d9008 JB |
1382 | non-shortest length UTF-8 along with other malformations, such as the |
1383 | surrogates, which are not real Unicode code points. | |
bf0fa0b2 | 1384 | |
0d7c09bb JH |
1385 | =item * |
1386 | ||
68693f9e KW |
1387 | Regular expression pattern matching may surprise you if you're not |
1388 | accustomed to Unicode. Starting in Perl 5.14, there are a number of | |
1389 | modifiers available that control this. For convenience, they will be | |
1390 | referred to in this section using the notation, e.g., C<"/a"> even | |
1391 | though in 5.14, they are not usable in a postfix form after the | |
1392 | (typical) trailing slash of a regular expression. (In 5.14, they are | |
1393 | usable only infix, for example by C</(?a:foo)/>, or by setting them to | |
1394 | apply across a scope by, e.g., C<use re '/a';>. It is planned to lift | |
1395 | this restriction in 5.16.) | |
1396 | ||
1397 | The C<"/l"> modifier says that the regular expression should match based | |
1398 | on whatever locale is in effect at execution time. For example, C<\w> | |
1399 | will match the "word" characters of that locale, and C<"/i"> | |
1400 | case-insensitive matching will match according to the locale's case | |
1401 | folding rules. See L<perllocale>). C<\d> will likely match just 10 | |
1402 | digit characters. This modifier is automatically selected within the | |
1403 | scope of either C<use locale> or C<use re '/l'>. | |
1404 | ||
1405 | The C<"/u"> modifier says that the regular expression should match based | |
1406 | on Unicode semantics. C<\w> will match any of the more than 100_000 | |
1407 | word characters in Unicode. Unlike most locales, which are specific to | |
1408 | a language and country pair, Unicode classifies all the characters that | |
1409 | are letters I<somewhere> as C<\w>. For example, your locale might not | |
1410 | think that "LATIN SMALL LETTER ETH" is a letter (unless you happen to | |
1411 | speak Icelandic), but Unicode does. Similarly, all the characters that | |
1412 | are decimal digits somewhere in the world will match C<\d>; this is | |
1413 | hundreds, not 10, possible matches. (And some of those digits look like | |
1414 | some of the 10 ASCII digits, but mean a different number, so a human | |
1415 | could easily think a number is a different quantity than it really is.) | |
1416 | Also, case-insensitive matching works on the full set of Unicode | |
1417 | characters. The "KELVIN SIGN", for example matches the letters "k" and | |
1418 | "K"; and "LATIN SMALL LETTER LONG S" (which looks very much like an "f", | |
1419 | and was common in the 18th century but is now obsolete), matches "s" and | |
1420 | "S". This modifier is automatically selected within the scope of either | |
1421 | C<use re '/u'> or C<use feature 'unicode_strings'> (which in turn is | |
1422 | selected by C<use 5.012>. | |
1423 | ||
1424 | The C<"/a"> modifier is like the C<"/u"> modifier, except that it | |
1425 | restricts certain constructs to match only in the ASCII range. C<\w> | |
1426 | will match only the 63 characters "[A-Za-z0-9_]"; C<\d>, only the 10 | |
1427 | digits 0-9; C<\s>, only the five characters "[ \f\n\r\t]"; and the | |
1428 | C<"[[:posix:]]"> classes only the appropriate ASCII characters. (See | |
765fa144 | 1429 | L<perlrecharclass>.) This modifier is like the C<"/u"> modifier in that |
68693f9e KW |
1430 | things like "KELVIN SIGN" match the letters "k" and "K"; and non-ASCII |
1431 | characters continue to have Unicode semantics. This modifier is | |
1432 | recommended for people who only incidentally use Unicode. One can write | |
1433 | C<\d> with confidence that it will only match ASCII characters, and | |
1434 | should the need arise to match beyond ASCII, you can use C<\p{Digit}> or | |
765fa144 | 1435 | C<\p{Word}>. (See L<perlrecharclass> for how to extend C<\s>, and the |
68693f9e KW |
1436 | Posix classes beyond ASCII under this modifier.) This modifier is |
1437 | automatically selected within the scope of C<use re '/a'>. | |
1438 | ||
1439 | The C<"/d"> modifier gives the regular expression behavior that Perl has | |
1440 | had between 5.6 and 5.12. For backwards compatibility it is selected | |
1441 | by default, but it leads to a number of issues, as outlined in | |
1442 | L</The "Unicode Bug">. When this modifier is in effect, regular | |
1443 | expression matching uses the semantics of what is called the "C" or | |
1444 | "Posix" locale, unless the pattern or target string of the match is | |
1445 | encoded in UTF-8, in which case it uses Unicode semantics. That is, it | |
1446 | uses what this document calls "byte" semantics unless there is some | |
1447 | UTF-8-ness involved, in which case it uses "character" semantics. Note | |
1448 | that byte semantics are not the same as C<"/a"> matching, as the former | |
1449 | doesn't know about the characters that are in the Latin-1 range which | |
1450 | aren't ASCII (such as "LATIN SMALL LETTER ETH), but C<"/a"> does. | |
0d7c09bb | 1451 | |
376d9008 | 1452 | As discussed elsewhere, Perl has one foot (two hooves?) planted in |
1bfb14c4 JH |
1453 | each of two worlds: the old world of bytes and the new world of |
1454 | characters, upgrading from bytes to characters when necessary. | |
376d9008 JB |
1455 | If your legacy code does not explicitly use Unicode, no automatic |
1456 | switch-over to characters should happen. Characters shouldn't get | |
1bfb14c4 JH |
1457 | downgraded to bytes, either. It is possible to accidentally mix bytes |
1458 | and characters, however (see L<perluniintro>), in which case C<\w> in | |
1459 | regular expressions might start behaving differently. Review your | |
1460 | code. Use warnings and the C<strict> pragma. | |
0d7c09bb | 1461 | |
68693f9e KW |
1462 | There are some additional rules as to which of these modifiers is in |
1463 | effect if there are contradictory rules present. First, an explicit | |
1464 | modifier in a regular expression always overrides any pragmas. And a | |
1465 | modifier in an inner cluster or capture group overrides one in an outer | |
1466 | group (for that inner group only). If both C<use locale> and C<use | |
1467 | feature 'unicode_strings> are in effect, the C<"/l"> modifier is | |
1468 | selected. And finally, a C<use re> that specifies a modifier has | |
1469 | precedence over both those pragmas. | |
1470 | ||
0d7c09bb JH |
1471 | =back |
1472 | ||
c349b1b9 JH |
1473 | =head2 Unicode in Perl on EBCDIC |
1474 | ||
376d9008 JB |
1475 | The way Unicode is handled on EBCDIC platforms is still |
1476 | experimental. On such platforms, references to UTF-8 encoding in this | |
1477 | document and elsewhere should be read as meaning the UTF-EBCDIC | |
1478 | specified in Unicode Technical Report 16, unless ASCII vs. EBCDIC issues | |
c349b1b9 | 1479 | are specifically discussed. There is no C<utfebcdic> pragma or |
376d9008 | 1480 | ":utfebcdic" layer; rather, "utf8" and ":utf8" are reused to mean |
86bbd6d1 PN |
1481 | the platform's "natural" 8-bit encoding of Unicode. See L<perlebcdic> |
1482 | for more discussion of the issues. | |
c349b1b9 | 1483 | |
b310b053 JH |
1484 | =head2 Locales |
1485 | ||
4616122b | 1486 | Usually locale settings and Unicode do not affect each other, but |
9e92194e | 1487 | there are exceptions: |
b310b053 JH |
1488 | |
1489 | =over 4 | |
1490 | ||
1491 | =item * | |
1492 | ||
8aa8f774 JH |
1493 | You can enable automatic UTF-8-ification of your standard file |
1494 | handles, default C<open()> layer, and C<@ARGV> by using either | |
1495 | the C<-C> command line switch or the C<PERL_UNICODE> environment | |
1496 | variable, see L<perlrun> for the documentation of the C<-C> switch. | |
b310b053 JH |
1497 | |
1498 | =item * | |
1499 | ||
376d9008 JB |
1500 | Perl tries really hard to work both with Unicode and the old |
1501 | byte-oriented world. Most often this is nice, but sometimes Perl's | |
9e92194e KW |
1502 | straddling of the proverbial fence causes problems. Here's an example |
1503 | of how things can go wrong. A locale can define a code point to be | |
1504 | anything it wants. It could make 'A' into a control character, for example. | |
1505 | But strings encoded in utf8 always have Unicode semantics, so an 'A' in | |
1506 | such a string is always an uppercase letter, never a control, no matter | |
1507 | what the locale says it should be. | |
b310b053 JH |
1508 | |
1509 | =back | |
1510 | ||
1aad1664 JH |
1511 | =head2 When Unicode Does Not Happen |
1512 | ||
1513 | While Perl does have extensive ways to input and output in Unicode, | |
1514 | and few other 'entry points' like the @ARGV which can be interpreted | |
1515 | as Unicode (UTF-8), there still are many places where Unicode (in some | |
1516 | encoding or another) could be given as arguments or received as | |
1517 | results, or both, but it is not. | |
1518 | ||
e1b711da KW |
1519 | The following are such interfaces. Also, see L</The "Unicode Bug">. |
1520 | For all of these interfaces Perl | |
6cd4dd6c JH |
1521 | currently (as of 5.8.3) simply assumes byte strings both as arguments |
1522 | and results, or UTF-8 strings if the C<encoding> pragma has been used. | |
1aad1664 JH |
1523 | |
1524 | One reason why Perl does not attempt to resolve the role of Unicode in | |
e1b711da | 1525 | these cases is that the answers are highly dependent on the operating |
1aad1664 JH |
1526 | system and the file system(s). For example, whether filenames can be |
1527 | in Unicode, and in exactly what kind of encoding, is not exactly a | |
1528 | portable concept. Similarly for the qx and system: how well will the | |
1529 | 'command line interface' (and which of them?) handle Unicode? | |
1530 | ||
1531 | =over 4 | |
1532 | ||
557a2462 RB |
1533 | =item * |
1534 | ||
51f494cc | 1535 | chdir, chmod, chown, chroot, exec, link, lstat, mkdir, |
1e8e8236 | 1536 | rename, rmdir, stat, symlink, truncate, unlink, utime, -X |
557a2462 RB |
1537 | |
1538 | =item * | |
1539 | ||
1540 | %ENV | |
1541 | ||
1542 | =item * | |
1543 | ||
1544 | glob (aka the <*>) | |
1545 | ||
1546 | =item * | |
1aad1664 | 1547 | |
557a2462 | 1548 | open, opendir, sysopen |
1aad1664 | 1549 | |
557a2462 | 1550 | =item * |
1aad1664 | 1551 | |
557a2462 | 1552 | qx (aka the backtick operator), system |
1aad1664 | 1553 | |
557a2462 | 1554 | =item * |
1aad1664 | 1555 | |
557a2462 | 1556 | readdir, readlink |
1aad1664 JH |
1557 | |
1558 | =back | |
1559 | ||
e1b711da KW |
1560 | =head2 The "Unicode Bug" |
1561 | ||
1562 | The term, the "Unicode bug" has been applied to an inconsistency with the | |
6f335b04 | 1563 | Unicode characters whose ordinals are in the Latin-1 Supplement block, that |
e1b711da KW |
1564 | is, between 128 and 255. Without a locale specified, unlike all other |
1565 | characters or code points, these characters have very different semantics in | |
20db7501 KW |
1566 | byte semantics versus character semantics, unless |
1567 | C<use feature 'unicode_strings'> is specified. | |
e1b711da KW |
1568 | |
1569 | In character semantics they are interpreted as Unicode code points, which means | |
1570 | they have the same semantics as Latin-1 (ISO-8859-1). | |
1571 | ||
1572 | In byte semantics, they are considered to be unassigned characters, meaning | |
1573 | that the only semantics they have is their ordinal numbers, and that they are | |
1574 | not members of various character classes. None are considered to match C<\w> | |
1575 | for example, but all match C<\W>. (On EBCDIC platforms, the behavior may | |
1576 | be different from this, depending on the underlying C language library | |
1577 | functions.) | |
1578 | ||
1579 | The behavior is known to have effects on these areas: | |
1580 | ||
1581 | =over 4 | |
1582 | ||
1583 | =item * | |
1584 | ||
1585 | Changing the case of a scalar, that is, using C<uc()>, C<ucfirst()>, C<lc()>, | |
1586 | and C<lcfirst()>, or C<\L>, C<\U>, C<\u> and C<\l> in regular expression | |
1587 | substitutions. | |
1588 | ||
1589 | =item * | |
1590 | ||
1591 | Using caseless (C</i>) regular expression matching | |
1592 | ||
1593 | =item * | |
1594 | ||
630d17dc KW |
1595 | Matching a number of properties in regular expressions, namely C<\b>, |
1596 | C<\B>, C<\s>, C<\S>, C<\w>, C<\W>, and all the Posix character classes | |
1597 | I<except> C<[[:ascii:]]>. | |
e1b711da KW |
1598 | |
1599 | =item * | |
1600 | ||
1601 | User-defined case change mappings. You can create a C<ToUpper()> function, for | |
1602 | example, which overrides Perl's built-in case mappings. The scalar must be | |
1603 | encoded in utf8 for your function to actually be invoked. | |
1604 | ||
1605 | =back | |
1606 | ||
1607 | This behavior can lead to unexpected results in which a string's semantics | |
1608 | suddenly change if a code point above 255 is appended to or removed from it, | |
1609 | which changes the string's semantics from byte to character or vice versa. As | |
1610 | an example, consider the following program and its output: | |
1611 | ||
1612 | $ perl -le' | |
1613 | $s1 = "\xC2"; | |
1614 | $s2 = "\x{2660}"; | |
1615 | for ($s1, $s2, $s1.$s2) { | |
1616 | print /\w/ || 0; | |
1617 | } | |
1618 | ' | |
1619 | 0 | |
1620 | 0 | |
1621 | 1 | |
1622 | ||
9f815e24 | 1623 | If there's no C<\w> in C<s1> or in C<s2>, why does their concatenation have one? |
e1b711da KW |
1624 | |
1625 | This anomaly stems from Perl's attempt to not disturb older programs that | |
1626 | didn't use Unicode, and hence had no semantics for characters outside of the | |
1627 | ASCII range (except in a locale), along with Perl's desire to add Unicode | |
1628 | support seamlessly. The result wasn't seamless: these characters were | |
1629 | orphaned. | |
1630 | ||
20db7501 KW |
1631 | Starting in Perl 5.14, C<use feature 'unicode_strings'> can be used to |
1632 | cause Perl to use Unicode semantics on all string operations within the | |
1633 | scope of the feature subpragma. Regular expressions compiled in its | |
1634 | scope retain that behavior even when executed or compiled into larger | |
1635 | regular expressions outside the scope. (The pragma does not, however, | |
1636 | affect user-defined case changing operations. These still require a | |
1637 | UTF-8 encoded string to operate.) | |
1638 | ||
1639 | In Perl 5.12, the subpragma affected casing changes, but not regular | |
1640 | expressions. See L<perlfunc/lc> for details on how this pragma works in | |
1641 | combination with various others for casing. | |
1642 | ||
1643 | For earlier Perls, or when a string is passed to a function outside the | |
1644 | subpragma's scope, a workaround is to always call C<utf8::upgrade($string)>, | |
1645 | or to use the standard module L<Encode>. Also, a scalar that has any characters | |
6f335b04 KW |
1646 | whose ordinal is above 0x100, or which were specified using either of the |
1647 | C<\N{...}> notations will automatically have character semantics. | |
e1b711da | 1648 | |
1aad1664 JH |
1649 | =head2 Forcing Unicode in Perl (Or Unforcing Unicode in Perl) |
1650 | ||
e1b711da KW |
1651 | Sometimes (see L</"When Unicode Does Not Happen"> or L</The "Unicode Bug">) |
1652 | there are situations where you simply need to force a byte | |
2bbc8d55 SP |
1653 | string into UTF-8, or vice versa. The low-level calls |
1654 | utf8::upgrade($bytestring) and utf8::downgrade($utf8string[, FAIL_OK]) are | |
1aad1664 JH |
1655 | the answers. |
1656 | ||
2bbc8d55 SP |
1657 | Note that utf8::downgrade() can fail if the string contains characters |
1658 | that don't fit into a byte. | |
1aad1664 | 1659 | |
e1b711da KW |
1660 | Calling either function on a string that already is in the desired state is a |
1661 | no-op. | |
1662 | ||
95a1a48b JH |
1663 | =head2 Using Unicode in XS |
1664 | ||
3a2263fe RGS |
1665 | If you want to handle Perl Unicode in XS extensions, you may find the |
1666 | following C APIs useful. See also L<perlguts/"Unicode Support"> for an | |
1667 | explanation about Unicode at the XS level, and L<perlapi> for the API | |
1668 | details. | |
95a1a48b JH |
1669 | |
1670 | =over 4 | |
1671 | ||
1672 | =item * | |
1673 | ||
1bfb14c4 | 1674 | C<DO_UTF8(sv)> returns true if the C<UTF8> flag is on and the bytes |
2bbc8d55 | 1675 | pragma is not in effect. C<SvUTF8(sv)> returns true if the C<UTF8> |
1bfb14c4 JH |
1676 | flag is on; the bytes pragma is ignored. The C<UTF8> flag being on |
1677 | does B<not> mean that there are any characters of code points greater | |
1678 | than 255 (or 127) in the scalar or that there are even any characters | |
1679 | in the scalar. What the C<UTF8> flag means is that the sequence of | |
1680 | octets in the representation of the scalar is the sequence of UTF-8 | |
1681 | encoded code points of the characters of a string. The C<UTF8> flag | |
1682 | being off means that each octet in this representation encodes a | |
1683 | single character with code point 0..255 within the string. Perl's | |
1684 | Unicode model is not to use UTF-8 until it is absolutely necessary. | |
95a1a48b JH |
1685 | |
1686 | =item * | |
1687 | ||
2bbc8d55 | 1688 | C<uvchr_to_utf8(buf, chr)> writes a Unicode character code point into |
1bfb14c4 | 1689 | a buffer encoding the code point as UTF-8, and returns a pointer |
2bbc8d55 | 1690 | pointing after the UTF-8 bytes. It works appropriately on EBCDIC machines. |
95a1a48b JH |
1691 | |
1692 | =item * | |
1693 | ||
2bbc8d55 | 1694 | C<utf8_to_uvchr(buf, lenp)> reads UTF-8 encoded bytes from a buffer and |
376d9008 | 1695 | returns the Unicode character code point and, optionally, the length of |
2bbc8d55 | 1696 | the UTF-8 byte sequence. It works appropriately on EBCDIC machines. |
95a1a48b JH |
1697 | |
1698 | =item * | |
1699 | ||
376d9008 JB |
1700 | C<utf8_length(start, end)> returns the length of the UTF-8 encoded buffer |
1701 | in characters. C<sv_len_utf8(sv)> returns the length of the UTF-8 encoded | |
95a1a48b JH |
1702 | scalar. |
1703 | ||
1704 | =item * | |
1705 | ||
376d9008 JB |
1706 | C<sv_utf8_upgrade(sv)> converts the string of the scalar to its UTF-8 |
1707 | encoded form. C<sv_utf8_downgrade(sv)> does the opposite, if | |
1708 | possible. C<sv_utf8_encode(sv)> is like sv_utf8_upgrade except that | |
1709 | it does not set the C<UTF8> flag. C<sv_utf8_decode()> does the | |
1710 | opposite of C<sv_utf8_encode()>. Note that none of these are to be | |
1711 | used as general-purpose encoding or decoding interfaces: C<use Encode> | |
1712 | for that. C<sv_utf8_upgrade()> is affected by the encoding pragma | |
1713 | but C<sv_utf8_downgrade()> is not (since the encoding pragma is | |
1714 | designed to be a one-way street). | |
95a1a48b JH |
1715 | |
1716 | =item * | |
1717 | ||
376d9008 | 1718 | C<is_utf8_char(s)> returns true if the pointer points to a valid UTF-8 |
90f968e0 | 1719 | character. |
95a1a48b JH |
1720 | |
1721 | =item * | |
1722 | ||
376d9008 | 1723 | C<is_utf8_string(buf, len)> returns true if C<len> bytes of the buffer |
95a1a48b JH |
1724 | are valid UTF-8. |
1725 | ||
1726 | =item * | |
1727 | ||
376d9008 JB |
1728 | C<UTF8SKIP(buf)> will return the number of bytes in the UTF-8 encoded |
1729 | character in the buffer. C<UNISKIP(chr)> will return the number of bytes | |
1730 | required to UTF-8-encode the Unicode character code point. C<UTF8SKIP()> | |
90f968e0 | 1731 | is useful for example for iterating over the characters of a UTF-8 |
376d9008 | 1732 | encoded buffer; C<UNISKIP()> is useful, for example, in computing |
90f968e0 | 1733 | the size required for a UTF-8 encoded buffer. |
95a1a48b JH |
1734 | |
1735 | =item * | |
1736 | ||
376d9008 | 1737 | C<utf8_distance(a, b)> will tell the distance in characters between the |
95a1a48b JH |
1738 | two pointers pointing to the same UTF-8 encoded buffer. |
1739 | ||
1740 | =item * | |
1741 | ||
2bbc8d55 | 1742 | C<utf8_hop(s, off)> will return a pointer to a UTF-8 encoded buffer |
376d9008 JB |
1743 | that is C<off> (positive or negative) Unicode characters displaced |
1744 | from the UTF-8 buffer C<s>. Be careful not to overstep the buffer: | |
1745 | C<utf8_hop()> will merrily run off the end or the beginning of the | |
1746 | buffer if told to do so. | |
95a1a48b | 1747 | |
d2cc3551 JH |
1748 | =item * |
1749 | ||
376d9008 JB |
1750 | C<pv_uni_display(dsv, spv, len, pvlim, flags)> and |
1751 | C<sv_uni_display(dsv, ssv, pvlim, flags)> are useful for debugging the | |
1752 | output of Unicode strings and scalars. By default they are useful | |
1753 | only for debugging--they display B<all> characters as hexadecimal code | |
1bfb14c4 JH |
1754 | points--but with the flags C<UNI_DISPLAY_ISPRINT>, |
1755 | C<UNI_DISPLAY_BACKSLASH>, and C<UNI_DISPLAY_QQ> you can make the | |
1756 | output more readable. | |
d2cc3551 JH |
1757 | |
1758 | =item * | |
1759 | ||
66615a54 | 1760 | C<foldEQ_utf8(s1, pe1, l1, u1, s2, pe2, l2, u2)> can be used to |
376d9008 | 1761 | compare two strings case-insensitively in Unicode. For case-sensitive |
66615a54 KW |
1762 | comparisons you can just use C<memEQ()> and C<memNE()> as usual, except |
1763 | if one string is in utf8 and the other isn't. | |
d2cc3551 | 1764 | |
c349b1b9 JH |
1765 | =back |
1766 | ||
95a1a48b JH |
1767 | For more information, see L<perlapi>, and F<utf8.c> and F<utf8.h> |
1768 | in the Perl source code distribution. | |
1769 | ||
e1b711da KW |
1770 | =head2 Hacking Perl to work on earlier Unicode versions (for very serious hackers only) |
1771 | ||
1772 | Perl by default comes with the latest supported Unicode version built in, but | |
1773 | you can change to use any earlier one. | |
1774 | ||
1775 | Download the files in the version of Unicode that you want from the Unicode web | |
1776 | site L<http://www.unicode.org>). These should replace the existing files in | |
1777 | C<\$Config{privlib}>/F<unicore>. (C<\%Config> is available from the Config | |
1778 | module.) Follow the instructions in F<README.perl> in that directory to change | |
1779 | some of their names, and then run F<make>. | |
1780 | ||
1781 | It is even possible to download them to a different directory, and then change | |
1782 | F<utf8_heavy.pl> in the directory C<\$Config{privlib}> to point to the new | |
1783 | directory, or maybe make a copy of that directory before making the change, and | |
1784 | using C<@INC> or the C<-I> run-time flag to switch between versions at will | |
1785 | (but because of caching, not in the middle of a process), but all this is | |
1786 | beyond the scope of these instructions. | |
1787 | ||
c29a771d JH |
1788 | =head1 BUGS |
1789 | ||
376d9008 | 1790 | =head2 Interaction with Locales |
7eabb34d | 1791 | |
376d9008 JB |
1792 | Use of locales with Unicode data may lead to odd results. Currently, |
1793 | Perl attempts to attach 8-bit locale info to characters in the range | |
1794 | 0..255, but this technique is demonstrably incorrect for locales that | |
1795 | use characters above that range when mapped into Unicode. Perl's | |
1796 | Unicode support will also tend to run slower. Use of locales with | |
1797 | Unicode is discouraged. | |
c29a771d | 1798 | |
9f815e24 | 1799 | =head2 Problems with characters in the Latin-1 Supplement range |
2bbc8d55 | 1800 | |
e1b711da KW |
1801 | See L</The "Unicode Bug"> |
1802 | ||
1803 | =head2 Problems with case-insensitive regular expression matching | |
1804 | ||
1805 | There are problems with case-insensitive matches, including those involving | |
1806 | character classes (enclosed in [square brackets]), characters whose fold | |
9f815e24 KW |
1807 | is to multiple characters (such as the single character LATIN SMALL LIGATURE |
1808 | FFL matches case-insensitively with the 3-character string C<ffl>), and | |
1809 | characters in the Latin-1 Supplement. | |
2bbc8d55 | 1810 | |
376d9008 | 1811 | =head2 Interaction with Extensions |
7eabb34d | 1812 | |
376d9008 | 1813 | When Perl exchanges data with an extension, the extension should be |
2575c402 | 1814 | able to understand the UTF8 flag and act accordingly. If the |
376d9008 JB |
1815 | extension doesn't know about the flag, it's likely that the extension |
1816 | will return incorrectly-flagged data. | |
7eabb34d A |
1817 | |
1818 | So if you're working with Unicode data, consult the documentation of | |
1819 | every module you're using if there are any issues with Unicode data | |
1820 | exchange. If the documentation does not talk about Unicode at all, | |
a73d23f6 | 1821 | suspect the worst and probably look at the source to learn how the |
376d9008 | 1822 | module is implemented. Modules written completely in Perl shouldn't |
a73d23f6 RGS |
1823 | cause problems. Modules that directly or indirectly access code written |
1824 | in other programming languages are at risk. | |
7eabb34d | 1825 | |
376d9008 | 1826 | For affected functions, the simple strategy to avoid data corruption is |
7eabb34d | 1827 | to always make the encoding of the exchanged data explicit. Choose an |
376d9008 | 1828 | encoding that you know the extension can handle. Convert arguments passed |
7eabb34d A |
1829 | to the extensions to that encoding and convert results back from that |
1830 | encoding. Write wrapper functions that do the conversions for you, so | |
1831 | you can later change the functions when the extension catches up. | |
1832 | ||
376d9008 | 1833 | To provide an example, let's say the popular Foo::Bar::escape_html |
7eabb34d A |
1834 | function doesn't deal with Unicode data yet. The wrapper function |
1835 | would convert the argument to raw UTF-8 and convert the result back to | |
376d9008 | 1836 | Perl's internal representation like so: |
7eabb34d A |
1837 | |
1838 | sub my_escape_html ($) { | |
d88362ca KW |
1839 | my($what) = shift; |
1840 | return unless defined $what; | |
1841 | Encode::decode_utf8(Foo::Bar::escape_html( | |
1842 | Encode::encode_utf8($what))); | |
7eabb34d A |
1843 | } |
1844 | ||
1845 | Sometimes, when the extension does not convert data but just stores | |
1846 | and retrieves them, you will be in a position to use the otherwise | |
1847 | dangerous Encode::_utf8_on() function. Let's say the popular | |
66b79f27 | 1848 | C<Foo::Bar> extension, written in C, provides a C<param> method that |
7eabb34d A |
1849 | lets you store and retrieve data according to these prototypes: |
1850 | ||
1851 | $self->param($name, $value); # set a scalar | |
1852 | $value = $self->param($name); # retrieve a scalar | |
1853 | ||
1854 | If it does not yet provide support for any encoding, one could write a | |
1855 | derived class with such a C<param> method: | |
1856 | ||
1857 | sub param { | |
1858 | my($self,$name,$value) = @_; | |
1859 | utf8::upgrade($name); # make sure it is UTF-8 encoded | |
af55fc6a | 1860 | if (defined $value) { |
7eabb34d A |
1861 | utf8::upgrade($value); # make sure it is UTF-8 encoded |
1862 | return $self->SUPER::param($name,$value); | |
1863 | } else { | |
1864 | my $ret = $self->SUPER::param($name); | |
1865 | Encode::_utf8_on($ret); # we know, it is UTF-8 encoded | |
1866 | return $ret; | |
1867 | } | |
1868 | } | |
1869 | ||
a73d23f6 RGS |
1870 | Some extensions provide filters on data entry/exit points, such as |
1871 | DB_File::filter_store_key and family. Look out for such filters in | |
66b79f27 | 1872 | the documentation of your extensions, they can make the transition to |
7eabb34d A |
1873 | Unicode data much easier. |
1874 | ||
376d9008 | 1875 | =head2 Speed |
7eabb34d | 1876 | |
c29a771d | 1877 | Some functions are slower when working on UTF-8 encoded strings than |
574c8022 | 1878 | on byte encoded strings. All functions that need to hop over |
7c17141f JH |
1879 | characters such as length(), substr() or index(), or matching regular |
1880 | expressions can work B<much> faster when the underlying data are | |
1881 | byte-encoded. | |
1882 | ||
1883 | In Perl 5.8.0 the slowness was often quite spectacular; in Perl 5.8.1 | |
1884 | a caching scheme was introduced which will hopefully make the slowness | |
a104b433 JH |
1885 | somewhat less spectacular, at least for some operations. In general, |
1886 | operations with UTF-8 encoded strings are still slower. As an example, | |
1887 | the Unicode properties (character classes) like C<\p{Nd}> are known to | |
1888 | be quite a bit slower (5-20 times) than their simpler counterparts | |
1889 | like C<\d> (then again, there 268 Unicode characters matching C<Nd> | |
1890 | compared with the 10 ASCII characters matching C<d>). | |
666f95b9 | 1891 | |
e1b711da KW |
1892 | =head2 Problems on EBCDIC platforms |
1893 | ||
1894 | There are a number of known problems with Perl on EBCDIC platforms. If you | |
1895 | want to use Perl there, send email to perlbug@perl.org. | |
fe749c9a KW |
1896 | |
1897 | In earlier versions, when byte and character data were concatenated, | |
1898 | the new string was sometimes created by | |
1899 | decoding the byte strings as I<ISO 8859-1 (Latin-1)>, even if the | |
1900 | old Unicode string used EBCDIC. | |
1901 | ||
1902 | If you find any of these, please report them as bugs. | |
1903 | ||
c8d992ba A |
1904 | =head2 Porting code from perl-5.6.X |
1905 | ||
1906 | Perl 5.8 has a different Unicode model from 5.6. In 5.6 the programmer | |
1907 | was required to use the C<utf8> pragma to declare that a given scope | |
1908 | expected to deal with Unicode data and had to make sure that only | |
1909 | Unicode data were reaching that scope. If you have code that is | |
1910 | working with 5.6, you will need some of the following adjustments to | |
1911 | your code. The examples are written such that the code will continue | |
1912 | to work under 5.6, so you should be safe to try them out. | |
1913 | ||
1914 | =over 4 | |
1915 | ||
1916 | =item * | |
1917 | ||
1918 | A filehandle that should read or write UTF-8 | |
1919 | ||
1920 | if ($] > 5.007) { | |
740d4bb2 | 1921 | binmode $fh, ":encoding(utf8)"; |
c8d992ba A |
1922 | } |
1923 | ||
1924 | =item * | |
1925 | ||
1926 | A scalar that is going to be passed to some extension | |
1927 | ||
1928 | Be it Compress::Zlib, Apache::Request or any extension that has no | |
1929 | mention of Unicode in the manpage, you need to make sure that the | |
2575c402 | 1930 | UTF8 flag is stripped off. Note that at the time of this writing |
c8d992ba A |
1931 | (October 2002) the mentioned modules are not UTF-8-aware. Please |
1932 | check the documentation to verify if this is still true. | |
1933 | ||
1934 | if ($] > 5.007) { | |
1935 | require Encode; | |
1936 | $val = Encode::encode_utf8($val); # make octets | |
1937 | } | |
1938 | ||
1939 | =item * | |
1940 | ||
1941 | A scalar we got back from an extension | |
1942 | ||
1943 | If you believe the scalar comes back as UTF-8, you will most likely | |
2575c402 | 1944 | want the UTF8 flag restored: |
c8d992ba A |
1945 | |
1946 | if ($] > 5.007) { | |
1947 | require Encode; | |
1948 | $val = Encode::decode_utf8($val); | |
1949 | } | |
1950 | ||
1951 | =item * | |
1952 | ||
1953 | Same thing, if you are really sure it is UTF-8 | |
1954 | ||
1955 | if ($] > 5.007) { | |
1956 | require Encode; | |
1957 | Encode::_utf8_on($val); | |
1958 | } | |
1959 | ||
1960 | =item * | |
1961 | ||
1962 | A wrapper for fetchrow_array and fetchrow_hashref | |
1963 | ||
1964 | When the database contains only UTF-8, a wrapper function or method is | |
1965 | a convenient way to replace all your fetchrow_array and | |
1966 | fetchrow_hashref calls. A wrapper function will also make it easier to | |
1967 | adapt to future enhancements in your database driver. Note that at the | |
1968 | time of this writing (October 2002), the DBI has no standardized way | |
1969 | to deal with UTF-8 data. Please check the documentation to verify if | |
1970 | that is still true. | |
1971 | ||
1972 | sub fetchrow { | |
d88362ca KW |
1973 | # $what is one of fetchrow_{array,hashref} |
1974 | my($self, $sth, $what) = @_; | |
c8d992ba A |
1975 | if ($] < 5.007) { |
1976 | return $sth->$what; | |
1977 | } else { | |
1978 | require Encode; | |
1979 | if (wantarray) { | |
1980 | my @arr = $sth->$what; | |
1981 | for (@arr) { | |
1982 | defined && /[^\000-\177]/ && Encode::_utf8_on($_); | |
1983 | } | |
1984 | return @arr; | |
1985 | } else { | |
1986 | my $ret = $sth->$what; | |
1987 | if (ref $ret) { | |
1988 | for my $k (keys %$ret) { | |
d88362ca KW |
1989 | defined |
1990 | && /[^\000-\177]/ | |
1991 | && Encode::_utf8_on($_) for $ret->{$k}; | |
c8d992ba A |
1992 | } |
1993 | return $ret; | |
1994 | } else { | |
1995 | defined && /[^\000-\177]/ && Encode::_utf8_on($_) for $ret; | |
1996 | return $ret; | |
1997 | } | |
1998 | } | |
1999 | } | |
2000 | } | |
2001 | ||
2002 | ||
2003 | =item * | |
2004 | ||
2005 | A large scalar that you know can only contain ASCII | |
2006 | ||
2007 | Scalars that contain only ASCII and are marked as UTF-8 are sometimes | |
2008 | a drag to your program. If you recognize such a situation, just remove | |
2575c402 | 2009 | the UTF8 flag: |
c8d992ba A |
2010 | |
2011 | utf8::downgrade($val) if $] > 5.007; | |
2012 | ||
2013 | =back | |
2014 | ||
393fec97 GS |
2015 | =head1 SEE ALSO |
2016 | ||
51f494cc | 2017 | L<perlunitut>, L<perluniintro>, L<perluniprops>, L<Encode>, L<open>, L<utf8>, L<bytes>, |
a05d7ebb | 2018 | L<perlretut>, L<perlvar/"${^UNICODE}"> |
51f494cc | 2019 | L<http://www.unicode.org/reports/tr44>). |
393fec97 GS |
2020 | |
2021 | =cut |