Commit | Line | Data |
---|---|---|
a0ed51b3 LW |
1 | package utf8; |
2 | ||
3 | sub import { | |
4 | $^H |= 0x00000008; | |
5 | $enc{caller()} = $_[1] if $_[1]; | |
6 | } | |
7 | ||
8 | sub unimport { | |
9 | $^H &= ~0x00000008; | |
10 | } | |
11 | ||
12 | sub AUTOLOAD { | |
13 | require "utf8_heavy.pl"; | |
14 | goto &$AUTOLOAD; | |
15 | } | |
16 | ||
17 | 1; | |
18 | __END__ | |
19 | ||
20 | =head1 NAME | |
21 | ||
22 | utf8 - Perl pragma to turn on UTF-8 and Unicode support | |
23 | ||
24 | =head1 SYNOPSIS | |
25 | ||
26 | use utf8; | |
27 | no utf8; | |
28 | ||
29 | =head1 DESCRIPTION | |
30 | ||
31 | The utf8 pragma tells Perl to use UTF-8 as its internal string | |
32 | representation for the rest of the enclosing block. (The "no utf8" | |
33 | pragma tells Perl to switch back to ordinary byte-oriented processing | |
34 | for the rest of the enclosing block.) Under utf8, many operations that | |
35 | formerly operated on bytes change to operating on characters. For | |
36 | ASCII data this makes no difference, because UTF-8 stores ASCII in | |
37 | single bytes, but for any character greater than C<chr(127)>, the | |
38 | character is stored in a sequence of two or more bytes, all of which | |
39 | have the high bit set. But by and large, the user need not worry about | |
40 | this, because the utf8 pragma hides it from the user. A character | |
41 | under utf8 is logically just a number ranging from 0 to 2**32 or so. | |
42 | Larger characters encode to longer sequences of bytes, but again, this | |
43 | is hidden. | |
44 | ||
45 | Use of the utf8 pragma has the following effects: | |
46 | ||
47 | =over 4 | |
48 | ||
49 | =item * | |
50 | ||
51 | Strings and patterns may contain characters that have an ordinal value | |
52 | larger than 255. Presuming you use a Unicode editor to edit your | |
53 | program, these will typically occur directly within the literal strings | |
54 | as UTF-8 characters, but you can also specify a particular character | |
55 | with an extension of the C<\x> notation. UTF-8 characters are | |
f244e06d | 56 | specified by putting the hexadecimal code within curlies after the |
a0ed51b3 LW |
57 | C<\x>. For instance, a Unicode smiley face is C<\x{263A}>. A |
58 | character in the Latin-1 range (128..255) should be written C<\x{ab}> | |
59 | rather than C<\xab>, since the former will turn into a two-byte UTF-8 | |
60 | code, while the latter will continue to be interpreted as generating a | |
ca24dfc6 | 61 | 8-bit byte rather than a character. In fact, if C<-w> is turned on, it will |
a0ed51b3 LW |
62 | produce a warning that you might be generating invalid UTF-8. |
63 | ||
64 | =item * | |
65 | ||
66 | Identifiers within the Perl script may contain Unicode alphanumeric | |
67 | characters, including ideographs. (You are currently on your own when | |
68 | it comes to using the canonical forms of characters--Perl doesn't (yet) | |
69 | attempt to canonicalize variable names for you.) | |
70 | ||
71 | =item * | |
72 | ||
73 | Regular expressions match characters instead of bytes. For instance, | |
4a2d328f IZ |
74 | "." matches a character instead of a byte. (However, the C<\C> pattern |
75 | is provided to force a match a single byte ("C<char>" in C, hence | |
76 | C<\C>).) | |
a0ed51b3 LW |
77 | |
78 | =item * | |
79 | ||
80 | Character classes in regular expressions match characters instead of | |
81 | bytes, and match against the character properties specified in the | |
82 | Unicode properties database. So C<\w> can be used to match an ideograph, | |
83 | for instance. | |
84 | ||
85 | =item * | |
86 | ||
87 | Named Unicode properties and block ranges make be used as character | |
88 | classes via the new C<\p{}> (matches property) and C<\P{}> (doesn't | |
89 | match property) constructs. For instance, C<\p{Lu}> matches any | |
90 | character with the Unicode uppercase property, while C<\p{M}> matches | |
91 | any mark character. Single letter properties may omit the brackets, so | |
92 | that can be written C<\pM> also. Many predefined character classes are | |
93 | available, such as C<\p{IsMirrored}> and C<\p{InTibetan}>. | |
94 | ||
95 | =item * | |
96 | ||
97 | The special pattern C<\X> match matches any extended Unicode sequence | |
98 | (a "combining character sequence" in Standardese), where the first | |
99 | character is a base character and subsequent characters are mark | |
100 | characters that apply to the base character. It is equivalent to | |
22244bdb | 101 | C<(?:\PM\pM*)>. |
a0ed51b3 LW |
102 | |
103 | =item * | |
104 | ||
105 | The C<tr///> operator translates characters instead of bytes. It can also | |
106 | be forced to translate between 8-bit codes and UTF-8 regardless of the | |
107 | surrounding utf8 state. For instance, if you know your input in Latin-1, | |
108 | you can say: | |
109 | ||
110 | use utf8; | |
111 | while (<>) { | |
112 | tr/\0-\xff//CU; # latin1 char to utf8 | |
113 | ... | |
114 | } | |
115 | ||
116 | Similarly you could translate your output with | |
117 | ||
118 | tr/\0-\x{ff}//UC; # utf8 to latin1 char | |
119 | ||
120 | No, C<s///> doesn't take /U or /C (yet?). | |
121 | ||
122 | =item * | |
123 | ||
124 | Case translation operators use the Unicode case translation tables. | |
125 | Note that C<uc()> translates to uppercase, while C<ucfirst> translates | |
126 | to titlecase (for languages that make the distinction). Naturally | |
127 | the corresponding backslash sequences have the same semantics. | |
128 | ||
129 | =item * | |
130 | ||
131 | Most operators that deal with positions or lengths in the string will | |
132 | automatically switch to using character positions, including C<chop()>, | |
133 | C<substr()>, C<pos()>, C<index()>, C<rindex()>, C<sprintf()>, | |
134 | C<write()>, and C<length()>. Operators that specifically don't switch | |
135 | include C<vec()>, C<pack()>, and C<unpack()>. Operators that really | |
136 | don't care include C<chomp()>, as well as any other operator that | |
137 | treats a string as a bucket of bits, such as C<sort()>, and the | |
138 | operators dealing with filenames. | |
139 | ||
140 | =item * | |
141 | ||
142 | The C<pack()>/C<unpack()> letters "C<c>" and "C<C>" do I<not> change, | |
143 | since they're often used for byte-oriented formats. (Again, think | |
144 | "C<char>" in the C language.) However, there is a new "C<U>" specifier | |
145 | that will convert between UTF-8 characters and integers. (It works | |
146 | outside of the utf8 pragma too.) | |
147 | ||
148 | =item * | |
149 | ||
150 | The C<chr()> and C<ord()> functions work on characters. This is like | |
151 | C<pack("U")> and C<unpack("U")>, not like C<pack("C")> and | |
152 | C<unpack("C")>. In fact, the latter are how you now emulate | |
153 | byte-oriented C<chr()> and C<ord()> under utf8. | |
154 | ||
155 | =item * | |
156 | ||
157 | And finally, C<scalar reverse()> reverses by character rather than by byte. | |
158 | ||
159 | =back | |
160 | ||
161 | =head1 CAVEATS | |
162 | ||
163 | As of yet, there is no method for automatically coercing input and | |
164 | output to some encoding other than UTF-8. This is planned in the near | |
165 | future, however. | |
166 | ||
167 | In any event, you'll need to keep track of whether interfaces to other | |
168 | modules expect UTF-8 data or something else. The utf8 pragma does not | |
169 | magically mark strings for you in order to remember their encoding, nor | |
170 | will any automatic coercion happen (other than that eventually planned | |
171 | for I/O). If you want such automatic coercion, you can build yourself | |
172 | a set of pretty object-oriented modules. Expect it to run considerably | |
173 | slower than than this low-level support. | |
174 | ||
175 | Use of locales with utf8 may lead to odd results. Currently there is | |
176 | some attempt to apply 8-bit locale info to characters in the range | |
177 | 0..255, but this is demonstrably incorrect for locales that use | |
178 | characters above that range (when mapped into Unicode). It will also | |
179 | tend to run slower. Avoidance of locales is strongly encouraged. | |
180 | ||
181 | =cut |