Commit | Line | Data |
---|---|---|
5d030b67 JH |
1 | =head1 NAME |
2 | ||
a999c27c | 3 | Encode::Supported -- Supported encodings by Encode |
5d030b67 JH |
4 | |
5 | =head1 DESCRIPTION | |
6 | ||
5129552c | 7 | =head2 Encoding Names |
5d030b67 JH |
8 | |
9 | Encoding names are case insensitive. White space in names | |
10 | is ignored. In addition an encoding may have aliases. | |
11 | Each encoding has one "canonical" name. The "canonical" | |
12 | name is chosen from the names of the encoding by picking | |
a999c27c | 13 | the first in the following sequence (with a few exceptions). |
5d030b67 | 14 | |
a999c27c JH |
15 | =over |
16 | ||
17 | =item * | |
18 | ||
19 | The name used by the perl community. That includes 'utf8' and 'ascii'. | |
20 | Unlike aliases, canonical names directly reaches the method so such | |
21 | frequently used words like 'utf8' should do without alias lookups. | |
22 | ||
23 | =item * | |
24 | ||
25 | The MIME name as defined in IETF RFCs This includes all "iso-"'s. | |
26 | ||
27 | =item * | |
28 | ||
29 | The name in the IANA registry. | |
30 | ||
31 | =item * | |
32 | ||
33 | The name used by the organization that defined it. | |
34 | ||
35 | =back | |
36 | ||
37 | In case I<de jure> canonical names differ from that of the Encode | |
38 | module, they are always aliased if it ever be implemented. So you can | |
39 | safely tell if a given encoding is implemented or not just by passing | |
40 | the canonical name. | |
5d030b67 | 41 | |
5129552c JH |
42 | Because of all the alias issues, and because in the general case |
43 | encodings have state, "Encode" uses the encoding object internally | |
44 | once an operation is in progress. | |
5d030b67 | 45 | |
5129552c | 46 | =head1 Supported Encodings |
5d030b67 JH |
47 | |
48 | As of Perl 5.8.0, at least the following encodings are recognized. | |
49 | Note that unless otherwise specified, they are all case insensitive | |
a63c962f | 50 | (via alias) and all occurrance of spaces are replaced with '-'. In |
5d030b67 JH |
51 | other words, "ISO 8859 1" and "iso-8859-1" are identical. |
52 | ||
5129552c JH |
53 | Encodings are categorized and implemented in several different modules |
54 | but you don't have to C<use Encode::XX> to make them available for | |
55 | most cases. Encode.pm will automatically load those modules in need. | |
5d030b67 | 56 | |
5129552c | 57 | =head2 Built-in Encodings |
5d030b67 | 58 | |
5129552c | 59 | The following encodings are always available. |
5d030b67 | 60 | |
67d7b5ef JH |
61 | Canonical Aliases Comments & References |
62 | ---------------------------------------------------------------- | |
a999c27c | 63 | ascii US-ascii [ECMA] |
67d7b5ef | 64 | iso-8859-1 latin1 [ISO] |
a999c27c | 65 | utf8 UTF-8 [RFC2279] |
c731e18e JH |
66 | ---------------------------------------------------------------- |
67 | ||
68 | ||
69 | =head2 Encode::Unicode -- other Unicode encodings | |
70 | ||
71 | Unicode coding schemes other than native utf8 are supported by | |
72 | Encode::Unicode which will be autoloaded on demand. | |
73 | ||
74 | ---------------------------------------------------------------- | |
f2a2953c JH |
75 | UCS-2BE UCS-2, iso-10646-1 [IANA, UC] |
76 | UCS-2LE [UC] | |
77 | UTF-16 [UC] | |
78 | UTF-16BE [UC] | |
79 | UTF-16LE [UC] | |
80 | UTF-32 [UC] | |
81 | UTF-32BE [UC] | |
82 | UTF-32LE [UC] | |
67d7b5ef | 83 | ---------------------------------------------------------------- |
5d030b67 | 84 | |
f2a2953c JH |
85 | To find how those (UCS-2|UTF-(16|32))(LE|BE)? differ to one another, |
86 | see L<Encode::Unicode>. | |
87 | ||
a999c27c | 88 | =head2 Encode::Byte -- Extended ASCII |
5d030b67 | 89 | |
a999c27c JH |
90 | Encode::Byte implements most of single-byte encodings except for |
91 | Symbols and EBCDIC. The following encodings are based single-byte | |
92 | encoding implemented as extended ASCII. For most cases it uses | |
93 | \x80-\xff (upper half) to map non-ASCII characters. | |
94 | ||
95 | =over 2 | |
96 | ||
97 | =item ISO-8859 and corresponding vendor mappings | |
98 | ||
99 | Since there are so many, They are presented in table format with | |
100 | Languages and corresponding encoding names by vendors. Note the table | |
101 | is sorted in order of ISO-8859 and the corresponding vendor mappings | |
102 | are slightly different from that of ISO. See | |
103 | L<http://czyborra.com/charsets/iso8859.html> for details. | |
104 | ||
105 | Lang/Regions ISO/Other Std. DOS Windows Macintosh Others | |
106 | ---------------------------------------------------------------- | |
107 | N. America (ASCII) cp437 AdobeStandardEncoding | |
108 | cp863 (DOSCanadaF) | |
109 | W. Europe (iso-8859-1) cp850 cp1252 MacRoman nextstep | |
110 | hp-roman8 | |
111 | cp860 (DOSPortuguese) | |
112 | CE. Europe iso-8859-2 cp852 cp1250 MacCentralEurRoman | |
113 | MacCroatian | |
114 | MacRomanian | |
115 | MacRumanian | |
116 | Latin3(*3) iso-8859-3 | |
117 | Latin4(*4) iso-8859-4 | |
118 | Cyrillics iso-8859-5 cp855 cp1251 MacCyrillic | |
119 | (Also see next section) cp866 MacUkrainian | |
120 | Arabic iso-8859-6 cp864 cp1256 MacArabic | |
121 | cp1006 MacFarsi | |
122 | Greek iso-8859-7 cp737 cp1253 MacGreek | |
123 | cp869 (DOSGreek2) | |
124 | Hebrew iso-8859-8 cp862 cp1255 MacHebrew | |
125 | Turkish iso-8859-9 cp857 cp1254 MacTurkish | |
126 | Nordics iso-8859-10 cp865 | |
127 | cp861 MacIcelandic | |
128 | MacSami | |
129 | Thai iso-8859-11 cp874 MacThai | |
130 | (iso-8859-12 is nonexistent. Reserved for Indics?) | |
131 | Baltics iso-8859-13 cp775 cp1257 | |
132 | Celtics iso-8859-14 | |
133 | Latin9(*15) iso-8859-15 | |
134 | Latin10 iso-8859-16 | |
135 | Vietnamese viscii cp1258 MacVietnamese | |
136 | ---------------------------------------------------------------- | |
137 | ||
138 | (*3) Esperanto, Maltese, and Turkish. Turkish is now on 8859-5 | |
139 | (*4) Baltics. Now on 8859-10 | |
140 | (*9) Nicknamed Latin0; Euro sign as well as French and Finnish | |
141 | letters that are missing from 8859-1 are added. | |
142 | ||
143 | All cp* are also available as ibm-*, ms-*, and windows-* . See also | |
144 | L<http://czyborra.com/charsets/codepages.html>. | |
145 | ||
146 | Macintosh encodings don't seem to be registered in such entities as | |
147 | IANA. "Canonical" names in Encode are based upon Apple's Tech Note | |
148 | 1150. See L<http://developer.apple.com/technotes/tn/tn1150.html> | |
149 | for details | |
150 | ||
151 | =item KOI8 - De Facto Standard for Cyrillic world | |
152 | ||
153 | Though ISO-8859 does have ISO-8859, KOI8 series is far more popular | |
154 | in the Net. L<Encode> comes with the following KOI charsets. for | |
155 | gory details, See <http://czyborra.com/charsets/cyrillic.html> for | |
156 | details. | |
5d030b67 | 157 | |
67d7b5ef | 158 | ---------------------------------------------------------------- |
67d7b5ef | 159 | koi8-f |
a999c27c | 160 | koi8-r cp878 [RFC1489] |
67d7b5ef | 161 | koi8-u [RFC2319] |
67d7b5ef | 162 | |
a999c27c JH |
163 | =item gsm0338 - Hentai Latin 1 |
164 | ||
165 | GSM0338 is for GSM handsets. Though it shares alpanumerals with ASCII, | |
166 | control character ranges and other parts are mapped very differently, | |
f2a2953c JH |
167 | presumablly to store Greek and Cyrillic alphabets. This one is also |
168 | covered in Encode::Byte even thought this one does not comply extended | |
169 | ASCII. | |
a999c27c JH |
170 | |
171 | =back | |
5d030b67 | 172 | |
5129552c | 173 | =head2 The CJK: Chinese, Japanese, Korean (Multibyte) |
5d030b67 JH |
174 | |
175 | Note Vietnamese is listed above. Also read "Encoding vs Charset" | |
a63c962f JH |
176 | below. Also note these are implemented in distinct module by |
177 | languages, due the the size concerns. Please also refer to their | |
178 | respective document pages. | |
5d030b67 | 179 | |
5129552c JH |
180 | =over 4 |
181 | ||
182 | =item Encode::CN -- Continental China | |
183 | ||
f2a2953c | 184 | Standard DOS/Win Macintosh Comment/Reference |
67d7b5ef | 185 | ---------------------------------------------------------------- |
f2a2953c JH |
186 | euc-cn(*1) MacChineseSimp |
187 | (gbk) cp936 (*2) | |
188 | gb12345-raw { GB12345 without CES } | |
189 | gb2312-raw { GB2312 without CES } | |
5129552c JH |
190 | hz |
191 | iso-ir-165 | |
67d7b5ef | 192 | ---------------------------------------------------------------- |
5129552c | 193 | |
f2a2953c JH |
194 | (*1) GB2312 is aliased to this. see L<Microsoft-related naming mess> |
195 | (*2) gbk is aliased to this. see L<Microsoft-related naming mess> | |
196 | ||
5129552c JH |
197 | =item Encode::JP -- Japan |
198 | ||
f2a2953c | 199 | Standard DOS/Win Macintosh Comment/Reference |
67d7b5ef | 200 | ---------------------------------------------------------------- |
a999c27c JH |
201 | euc-jp |
202 | shiftjis cp932 macJapanese | |
f2a2953c JH |
203 | 7bit-jis |
204 | euc-jp | |
205 | iso-2022-jp [RFC1468] | |
206 | iso-2022-jp-1 [RFC2237] | |
207 | jis0201-raw { JIS X 0201 (roman + halfwidth kana) without CES } | |
208 | jis0208-raw { JIS X 0208 (Kanji + fullwidth kana) without CES } | |
209 | jis0212-raw { JIS X 0212 (Extended Kanji) without CES } | |
67d7b5ef | 210 | ---------------------------------------------------------------- |
5129552c JH |
211 | |
212 | =item Encode::KR -- Korea | |
213 | ||
f2a2953c | 214 | Standard DOS/Win Macintosh Comment/Reference |
67d7b5ef | 215 | ---------------------------------------------------------------- |
a999c27c | 216 | euc-kr MacKorean [RFC1557] |
f2a2953c | 217 | cp949 (*) |
a999c27c JH |
218 | iso-2022-kr [RFC1557] |
219 | johab [KS X 1001:1998, Annex 3] | |
f2a2953c | 220 | ksc5601-raw { KSC5601 without CES } |
67d7b5ef | 221 | ---------------------------------------------------------------- |
5129552c | 222 | |
f2a2953c JH |
223 | (*) ks_c_5601-1987, (x-)?windows-949, and uhc are aliased to |
224 | this. See below. | |
225 | ||
226 | ||
5129552c JH |
227 | =item Encode::TW -- Taiwan |
228 | ||
f2a2953c | 229 | Standard DOS/Win Macintosh Comment/Reference |
67d7b5ef | 230 | ---------------------------------------------------------------- |
a999c27c | 231 | big5 cp950 MacChineseTrad |
5129552c | 232 | big5-hkscs |
67d7b5ef | 233 | ---------------------------------------------------------------- |
5129552c JH |
234 | |
235 | =item Encode::HanExtra -- More Chinese via CPAN | |
236 | ||
237 | Due to size concerns, additional Chinese encodings below are | |
238 | distributed separately on CPAN, under the name Encode::HanExtra. | |
239 | ||
f2a2953c | 240 | Standard DOS/Win Macintosh Comment/Reference |
67d7b5ef | 241 | ---------------------------------------------------------------- |
5129552c JH |
242 | gb18030 |
243 | euc-tw | |
244 | big5plus | |
67d7b5ef | 245 | ---------------------------------------------------------------- |
5129552c JH |
246 | |
247 | =back | |
248 | ||
249 | =head2 Miscellaneous encodings | |
250 | ||
251 | =over 4 | |
252 | ||
253 | =item Encode::EBCDIC | |
5d030b67 | 254 | |
a999c27c | 255 | See L<perlebcdic> for details. |
5d030b67 | 256 | |
67d7b5ef | 257 | ---------------------------------------------------------------- |
5d030b67 | 258 | cp37 |
a999c27c JH |
259 | cp500 |
260 | cp875 | |
261 | cp1026 | |
262 | cp1047 | |
5d030b67 | 263 | posix-bc |
67d7b5ef | 264 | ---------------------------------------------------------------- |
5129552c | 265 | |
a63c962f | 266 | =item Encode::Symbols |
5d030b67 | 267 | |
5129552c | 268 | For symbols and dingbats. |
5d030b67 | 269 | |
67d7b5ef | 270 | ---------------------------------------------------------------- |
5d030b67 JH |
271 | symbol |
272 | dingbats | |
a999c27c JH |
273 | MacDingbats |
274 | AdobeZdingbat | |
275 | AdobeSymbol | |
67d7b5ef JH |
276 | ---------------------------------------------------------------- |
277 | ||
278 | =back | |
279 | ||
280 | =head1 Unsupported encodings | |
281 | ||
282 | The following are not supported as yet. Some because they are rarely | |
283 | usede, some because of technical difficulty. They may be supported by | |
284 | external modules via CPAN in future, however. | |
285 | ||
286 | =over 4 | |
287 | ||
288 | =item ISO-2022-JP-2 [RFC1554] | |
289 | ||
290 | Not very popular yet. Needs Unicode Database or equivalent to | |
291 | implement encode() (Because it includes JIS X 0208/0212, KSC5601, and | |
292 | GB2312 sumulteniously, which code points in unicode overlap. So you | |
293 | need to lookup the database to determine what character set a given | |
294 | Unicode character should belong). | |
295 | ||
296 | =item ISO-2022-CN [RFC1922] | |
297 | ||
298 | Not very popular. Needs CNS 11643-1 and 2 which are not available in | |
299 | this module. CNS 11643 is supported (via euc-tw) in | |
300 | Encode::HanExtra. Autrijus may add support for this encoding in his | |
301 | module in future | |
302 | ||
303 | =item various UP-UX encodings | |
304 | ||
305 | The following are unsoported due to the lack of mapping data. | |
306 | ||
307 | '8' - arabic8, greek8, hebrew8, kana8, thai8, and turkish8 | |
308 | '15' - japanese15, korean15, and roi15 | |
309 | ||
310 | =item Cyrillic encoding ISO-IR-111 | |
311 | ||
312 | Anton doubts its usefulness. | |
313 | ||
314 | =item ISO-8859-8-1 [Hebrew] | |
315 | ||
a999c27c JH |
316 | None of the Encode team knows Hebrew enough (ISO-8859-8, cp1255 and |
317 | MacHebrew are supported because and just because there were mappings | |
318 | available at L<http://www.unicode.org/>). Contribution welcome. | |
67d7b5ef JH |
319 | |
320 | =item Thai encoding TCVN | |
321 | ||
322 | Ditto. | |
323 | ||
324 | =item Vietnamese encodings VPS | |
325 | ||
a999c27c JH |
326 | Though Jungshik has reported that mozilla supports this encoding, It was too late for us to add one. In future via a separate module. See |
327 | L<http://lxr.mozilla.org/seamonkey/source/intl/uconv/ucvlatin/vps.uf> and | |
328 | L<http://lxr.mozilla.org/seamonkey/source/intl/uconv/ucvlatin/vps.ut> | |
329 | if you are interested in helping us. | |
67d7b5ef JH |
330 | |
331 | =item various Mac encodings | |
332 | ||
a999c27c JH |
333 | The following are unsoported due to the lack of mapping data. |
334 | ||
335 | MacArmenian, MacBengali, MacBurmese, MacEthiopic | |
336 | MacExtArabic, MacGeorgian, MacKannada, MacKhmer | |
337 | MacLaotian, MacMalayalam, MacMongolian, MacOriya | |
338 | MacSinhalese, MacTamil, MacTelugu, MacTibetan | |
339 | MacVietnamese | |
340 | ||
341 | The rest of which already available are based upon the vendor mappings at | |
342 | L<http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/> . | |
343 | ||
344 | =item (Mac) Indic encodings | |
345 | ||
346 | The maps for the following is available at L<http://www.unicode.org/> | |
347 | but remains unsupport because those encordigs need algorithmical | |
348 | approach, unsupported by F<enc2xs> | |
67d7b5ef | 349 | |
a999c27c JH |
350 | MacDevanagari |
351 | MacGurmukhi | |
352 | MacGujarati | |
67d7b5ef | 353 | |
a999c27c JH |
354 | For details, please see C<Unicode mapping issues and notes:> at |
355 | L<http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/DEVANAGA.TXT> . | |
356 | ||
357 | I believe this issue is prevalent not only for Mac Indics but also in | |
358 | other Indic encodings but those mentions were the only Indic encodings | |
359 | maps that I could find at L<http://www.unicode.org/> . | |
5129552c JH |
360 | |
361 | =back | |
5d030b67 | 362 | |
a999c27c | 363 | =head1 Encoding vs. Charset -- terminology |
5d030b67 | 364 | |
a999c27c JH |
365 | We are used to using the term (character) I<encoding> and I<character set> |
366 | interchangeably. But just as using the term byte and character is | |
367 | dangerous and should be differenciated when needed, we need to | |
368 | differenciate I<encoding> and I<character set>. | |
5d030b67 | 369 | |
f2a2953c | 370 | To understand that, it's follow how we make computers grok our characters. |
a999c27c JH |
371 | |
372 | =over 4 | |
373 | ||
374 | =item * | |
67d7b5ef | 375 | |
a999c27c JH |
376 | First we start with which characters to include. We call this |
377 | collection of characters I<character repertoire>. | |
5d030b67 | 378 | |
a999c27c | 379 | =item * |
5d030b67 | 380 | |
a999c27c JH |
381 | Then we have to give each character a unique ID so your computer can |
382 | tell the differnce from 'a' to 'A'. This itemized character | |
383 | repartoire is now a I<character set>. | |
a63c962f | 384 | |
a999c27c JH |
385 | =item * |
386 | ||
387 | If your computer can grow the character set without further | |
388 | proccessing, you can go ahead use it. This is called a I<coded | |
389 | character set> (CCS) or I<raw character encoding>. ASCII is used this | |
390 | way for most cases. | |
391 | ||
392 | =item * | |
393 | ||
394 | But in many cases especially multi-byte CJK encodings, you have to | |
395 | tweak a little more. Your network connection may not accept any data | |
396 | with the Most Significant Bit set, Your computer may not be able to | |
397 | tell if a given byte is a whole character or just half of it. So you | |
398 | have to I<encode> the character set to use it. | |
399 | ||
400 | A I<character encoding scheme> (CES) determines how to encode a given | |
401 | character set, or a set of multiple character sets. 7bit ISO-2022 is | |
402 | an example of CES. You switch between character sets via I<escape | |
403 | sequence>. | |
67d7b5ef JH |
404 | |
405 | =back | |
406 | ||
a999c27c JH |
407 | Technically, or Mathematically speaking, a character set encoded in |
408 | such a CES that maps character by character may form a CCS. EUC is such | |
409 | an example. CES of EUC is as follows; | |
67d7b5ef | 410 | |
a999c27c | 411 | =over 4 |
5d030b67 | 412 | |
a999c27c | 413 | =item * |
5d030b67 | 414 | |
a999c27c JH |
415 | Map ASCII unchanged. |
416 | ||
417 | =item * | |
418 | ||
419 | Map such a character set that consists of 94 or 96 powered by N | |
420 | members by adding 0x80 to each byte. | |
421 | ||
422 | =item * | |
423 | ||
424 | You can also use 0x8e and 0x8f to tell the following sequence of | |
425 | characters belong to yet another character set. each following byte | |
426 | is added by 0x80 | |
427 | ||
428 | =back | |
429 | ||
430 | By carefully looking at at the encoded byte sequence, you may find the | |
431 | byte sequence conforms a unique number. In that sense EUC is a CCS | |
432 | generated by a CES above from up to four CCS (complicated?). UTF-8 | |
433 | falls into this category. See L<perlunicode/"UTF-8"> to find how | |
434 | UTF-8 maps Unicode to a byte sequence. | |
435 | ||
436 | You may also find by now why 7bit ISO-2022 cannot conform a CCS. If | |
437 | you look at a byte sequence \x21\x21, you can't tell if it is two !'s | |
438 | or IDEOGRAPHIC SPACE. EUC maps the latter to \xA1\xA1 so you have no | |
439 | trouble between "!!". and " " | |
67d7b5ef | 440 | |
a63c962f JH |
441 | =head1 Encoding Classification (by Anton Tagunov and Dan Kogai) |
442 | ||
443 | This section tries to classify the supported encodings by their | |
444 | applicability for information exchange over the Internet and to | |
445 | choose the most suitable aliases to name them in the context of | |
446 | such communication. | |
447 | ||
67d7b5ef JH |
448 | =over 2 |
449 | ||
450 | =item * | |
451 | ||
f2a2953c | 452 | To (en|de) code Encodings marked as C<(**)>, You need |
a999c27c | 453 | C<Encode::HanExtra>, available from CPAN. |
67d7b5ef JH |
454 | |
455 | =back | |
456 | ||
a63c962f | 457 | Encoding names |
5d030b67 | 458 | |
f2a2953c JH |
459 | US-ASCII UTF-8 ISO-8859-* KOI8-R |
460 | Shift_JIS EUC-JP ISO-2022-JP ISO-2022-JP-1 | |
461 | EUC-KR Big5 GB2312 | |
a999c27c JH |
462 | |
463 | are registered to IANA as preferred MIME names and may probably | |
464 | be used over the Internet. | |
5d030b67 | 465 | |
c731e18e | 466 | C<Shift_JIS> has been officialized by JIS X 0208:1997. |
a999c27c | 467 | L<Microsoft-related naming mess> gives details. |
5d030b67 | 468 | |
a999c27c JH |
469 | C<GB2312> is the IANA name for C<EUC-CN>. |
470 | See L<Microsoft-related naming mess> for details. | |
471 | ||
472 | C<GB_2312-80> I<raw> encoding is available as C<gb2312-raw> | |
f2a2953c | 473 | with Encode. See L<Encode::CN> for details. |
5d030b67 | 474 | |
a63c962f | 475 | EUC-CN |
f2a2953c | 476 | KOI8-U [RFC2319] |
5d030b67 | 477 | |
a999c27c JH |
478 | have not been registered with IANA (as of March 2002) but |
479 | seem to be supported by major web browsers. | |
480 | IANA name for C<EUC-CN> is C<GB2312>. | |
67d7b5ef JH |
481 | |
482 | KS_C_5601-1987 | |
483 | ||
a999c27c JH |
484 | is heavily misused. |
485 | See L<Microsoft-related naming mess> for details. | |
486 | ||
487 | C<KS_C_5601-1987> I<raw> encoding is available as C<kcs5601-raw> | |
f2a2953c JH |
488 | with Encode. See L<Encode::KR> for details. |
489 | ||
490 | UTF-16 UTF-16BE UTF-16LE | |
491 | ||
492 | are a IANA-registered C<charset>s. See [RFC 2781] for details. | |
493 | Jungshik Shin reports that UTF-16 with a BOM is well accepted | |
494 | by MS IE 5/6 and NS 4/6. Beware however that | |
495 | ||
496 | =over 2 | |
497 | ||
498 | =item * | |
5d030b67 | 499 | |
f2a2953c JH |
500 | C<UTF-16> support in any software you're going to be |
501 | using/interoperating with has probably been less tested | |
502 | then C<UTF-8> support | |
5d030b67 | 503 | |
f2a2953c JH |
504 | =item * |
505 | ||
c731e18e JH |
506 | C<UTF-8> coded data seamlessly passes traditional |
507 | command piping (C<cat>, C<more>, etc.) while C<UTF-16> coded | |
f2a2953c JH |
508 | data is likely to cause confusion (with it's zero bytes, |
509 | for example) | |
510 | ||
511 | =item * | |
512 | ||
513 | it is beyond the power of words to describe the way HTML browsers | |
c731e18e | 514 | encode non-C<ASCII> form data. To get a general impression visit |
f2a2953c | 515 | L<http://ppewww.ph.gla.ac.uk/~flavell/charset/form-i18n.html>. |
c731e18e | 516 | While encoding of form data has stabilized for C<UTF-8> coded pages |
f2a2953c JH |
517 | (at least IE 5/6, NS 6, Opera 6 behave consitently), be sure to |
518 | expect fun (and cross-browser discrepancies) with C<UTF-16> coded | |
519 | pages! | |
520 | ||
521 | =back | |
522 | ||
523 | The rule of thumb is to use C<UTF-8> unless you know what | |
c731e18e | 524 | you're doing and unless you really benefit from using C<UTF-16>. |
a999c27c | 525 | |
5d030b67 | 526 | |
f2a2953c | 527 | ISO-IR-165 [RFC1345] |
5d030b67 JH |
528 | GBK |
529 | VISCII | |
a63c962f | 530 | GB 12345 |
f2a2953c JH |
531 | GB 18030 (**) (see links bellow) |
532 | EUC-TW (**) | |
5d030b67 JH |
533 | |
534 | are totally valid encodings but not registered at IANA. | |
a63c962f JH |
535 | The names under which they are listed here are probably the |
536 | most widely-known names for these encodings and are recommended | |
537 | names. | |
538 | ||
f2a2953c | 539 | BIG5PLUS (**) |
a63c962f | 540 | |
67d7b5ef | 541 | is a bit proprietary name. |
5d030b67 | 542 | |
a999c27c JH |
543 | =head2 Microsoft-related naming mess |
544 | ||
545 | Microsoft products misuse the following names: | |
5d030b67 | 546 | |
67d7b5ef | 547 | =over 2 |
a63c962f | 548 | |
a999c27c | 549 | =item KS_C_5601-1987 |
5d030b67 | 550 | |
a999c27c | 551 | Microsoft extension to C<EUC-KR>. |
5d030b67 | 552 | |
c731e18e | 553 | Proper names: C<CP949>, C<UHC>, C<x-windows-949> (as used by Mozilla). |
67d7b5ef | 554 | |
f2a2953c | 555 | See L<http://lists.w3.org/Archives/Public/ietf-charsets/2001AprJun/0033.html> |
a999c27c | 556 | for details. |
5d030b67 | 557 | |
f2a2953c JH |
558 | Encode aliases C<KS_C_5601-1987> to C<cp949> to reflect this common |
559 | misusage. I<Raw> C<KS_C_5601-1987> encoding is available as | |
560 | C<kcs5601-raw>. | |
5d030b67 | 561 | |
f2a2953c | 562 | See L<Encode::KR> for details. |
67d7b5ef | 563 | |
a999c27c | 564 | =item GB2312 |
67d7b5ef | 565 | |
a999c27c | 566 | Microsoft extension to C<EUC-CN>. |
a63c962f | 567 | |
a999c27c | 568 | Proper names: C<CP936>, C<GBK>. |
a63c962f | 569 | |
a999c27c JH |
570 | C<GB2312> has been registered in the C<EUC-CN> meaning at |
571 | IANA. This has partially repaired the situation: Microsoft's | |
572 | C<GB2312> has become a superset of the official C<GB2312>. | |
67d7b5ef | 573 | |
a999c27c JH |
574 | Encode aliases C<GB2312> to C<euc-cn> in full agreement with |
575 | IANA registration. C<cp936> is supported separately. | |
f2a2953c | 576 | I<Raw> C<GB_2312-80> encoding is available as C<gb2312-raw>. |
a999c27c | 577 | |
f2a2953c | 578 | See L<Encode::CN> for details. |
a999c27c JH |
579 | |
580 | =item Big5 | |
581 | ||
582 | Microsoft extension to C<Big5>. | |
583 | ||
584 | Proper name: C<CP950>. | |
585 | ||
586 | Encode separately supports C<Big5> and C<cp950>. | |
587 | ||
588 | =item Shift_JIS | |
589 | ||
590 | Microsoft's understanding of C<Shift_JIS>. | |
591 | ||
592 | JIS has not endorsed the full Microsoft standard however. | |
593 | The official C<Shift_JIS> includes only JIS X 0201 and JIS X 0208 | |
594 | subsets, while Microsoft has always been meaning C<Shift_JIS> to | |
c731e18e JH |
595 | encode a wider character repertoire, see C<IANA> registration for |
596 | C<Windows-31J>. | |
a999c27c JH |
597 | |
598 | As a historical predecessor Microsoft's variant | |
599 | probably has more rights for the name, albeit it may be objected | |
600 | that Microsoft shouldn't have used JIS as part of the name | |
601 | in the first place. | |
602 | ||
c731e18e | 603 | Unabiguous name: C<CP932>. C<IANA> name (not used?): C<Windows-31J>. |
a999c27c JH |
604 | |
605 | Encode separately supports C<Shift_JIS> and C<cp932>. | |
606 | ||
607 | =back | |
608 | ||
609 | =head1 Glossary | |
610 | ||
611 | =over 2 | |
612 | ||
613 | =item character repertoire | |
614 | ||
615 | A collection of unique characters. A I<character> set in the most | |
616 | strict sense. At this stage characters are not numberd. | |
617 | ||
618 | =item coded character set (CCS) | |
619 | ||
620 | A character set that is mapped in a way computers can use directly. | |
621 | Many character encodings including EUC falls in this category. | |
622 | ||
623 | =item character encoding scheme (CES) | |
624 | ||
625 | An algorithm to map a character set to a byte sequence. You don't | |
626 | have to be able to tell which character set a given byte sequence | |
627 | belongs. 7-bit ISO-2022 is a CES but it cannot be a CCS. EUC is an | |
628 | example of being both a CCS and CES. | |
629 | ||
f2a2953c JH |
630 | =item charset (in MIME context) |
631 | ||
632 | has long been used in the meaning of C<encoding>, CES. | |
633 | ||
634 | While C<character set> word combination has lost this meaning | |
635 | in MIME context since [RFC 2130], C<charset> abbreviation has | |
636 | retained it. This is how [RFC 2277], [RFC 2278] bless C<charset>: | |
637 | ||
638 | ||
639 | This document uses the term "charset" to mean a set of rules for | |
640 | mapping from a sequence of octets to a sequence of characters, such | |
641 | as the combination of a coded character set and a character encoding | |
642 | scheme; this is also what is used as an identifier in MIME "charset=" | |
643 | parameters, and registered in the IANA charset registry ... (Note | |
644 | that this is NOT a term used by other standards bodies, such as ISO). | |
645 | [RFC 2277] | |
646 | ||
a999c27c JH |
647 | =item EUC |
648 | ||
649 | Extended Unix Character. See ISO-2022 | |
650 | ||
651 | =item ISO-2022 | |
652 | ||
653 | A CES that was carefully designed to coexist with ASCII. There are 7 | |
f2a2953c JH |
654 | bit version and 8 bit version. |
655 | ||
656 | 7 bit version switches character set via escape sequence so this | |
657 | cannot form a CCS. Since this is more difficult to handle in programs | |
658 | than the 8 bit version, 7 bit version is not very popular except for | |
659 | iso-2022-jp, the de facto standard CES for e-mails. | |
660 | ||
661 | 8 bit version can conform a CCS. EUC and ISO-8859 are two examples | |
662 | thereof. pre-5.6 perl could use them as string literals. | |
a999c27c JH |
663 | |
664 | =item UCS | |
665 | ||
666 | Short for I<Universal Character Set>. When you say just UCS, it means | |
667 | I<Unicode> | |
668 | ||
669 | =item UCS-2 | |
670 | ||
671 | ISO/IEC 10646 encoding form: Universal Character Set coded in two | |
672 | octets. | |
673 | ||
674 | =item Unicode | |
675 | ||
f2a2953c JH |
676 | A Character Set that aims to include all character repertoire of the |
677 | world. Many character sets in various national as well as industorial | |
678 | standards have become, in a way, just subsets of Unicode. | |
a999c27c JH |
679 | |
680 | =item UTF | |
681 | ||
f2a2953c | 682 | Short for I<Unicode Transformation Format>. Determines how to map a |
a999c27c JH |
683 | unicode character into byte sequnece. |
684 | ||
685 | =item UTF-16 | |
686 | ||
687 | A UTF in 16-bit encoding. Can either be in big endian or little | |
f2a2953c JH |
688 | endian. Big endian version is called UTF-16BE (equals to UCS-2 + |
689 | Surrogate Support) and little endian version is UTF-16LE. | |
67d7b5ef JH |
690 | |
691 | =back | |
5d030b67 JH |
692 | |
693 | =head1 See Also | |
694 | ||
5129552c JH |
695 | L<Encode>, |
696 | L<Encode::Byte>, | |
a63c962f | 697 | L<Encode::CN>, L<Encode::JP>, L<Encode::KR>, L<Encode::TW>, |
5129552c | 698 | L<Encode::EBCDIC>, L<Encode::Symbol> |
5d030b67 | 699 | |
a999c27c JH |
700 | =head1 References |
701 | ||
702 | =over 2 | |
703 | ||
704 | =item ECMA | |
705 | ||
706 | European Computer Manufacturers Association | |
707 | L<http://www.ecma.ch> | |
708 | ||
709 | =over 2 | |
710 | ||
711 | =item EMCA-035 (eq C<ISO-2022>) | |
712 | ||
713 | L<http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM> | |
714 | ||
715 | The very dspecification of ISO-2022 is available from the link above. | |
716 | ||
717 | =back | |
718 | ||
719 | =item IANA | |
720 | ||
721 | Internet Assigned Numbers Authority | |
722 | L<http://www.iana.org/> | |
723 | ||
724 | =over 2 | |
725 | ||
726 | =item Assigned Charset Names by IANA | |
727 | ||
728 | L<http://www.iana.org/assignments/character-sets> | |
729 | ||
730 | Most of the C<canonical names> in Encode derive from this list | |
731 | so you can directly apply the string you have extracted from MIME | |
732 | header of mails and we pages. | |
733 | ||
734 | =back | |
735 | ||
736 | =item ISO | |
737 | ||
738 | International Organization for Standardization | |
739 | L<http://www.iso.ch/> | |
740 | ||
741 | =item RFC | |
742 | ||
743 | Request For Comment -- need I say more? | |
f2a2953c | 744 | L<http://www.rfc.net/>, L<http://www.faqs.org/rfcs/> |
a999c27c JH |
745 | |
746 | =item UC | |
747 | ||
748 | Unicode Consortium | |
749 | L<http://www.unicode.org/> | |
750 | ||
751 | =over 2 | |
752 | ||
753 | =item Unicode Glossary | |
754 | ||
755 | L<http://www.unicode.org/glossary/> | |
756 | ||
757 | The glossary of this document is based opon this site. | |
758 | ||
759 | =back | |
760 | ||
761 | =back | |
762 | ||
763 | =head2 Other Notable Sites | |
764 | ||
765 | =over 2 | |
766 | ||
767 | =item czyborra.com | |
768 | ||
f2a2953c | 769 | L<http://czyborra.com/> |
a999c27c JH |
770 | |
771 | Contains a a lot of useful information, especially gory details of ISO | |
772 | vs. vendor mappings. | |
773 | ||
774 | =item CJK.inf | |
775 | ||
776 | L<http://www.oreilly.com/people/authors/lunde/cjk_inf.html> | |
777 | ||
778 | Somewhat obsolete (last update in 1996), but still useful. Also try | |
779 | ||
780 | L<ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/pdf/GB18030_Summary.pdf> | |
781 | ||
782 | You will find brief info on C<EUC-CN>, C<GBK> and mostly on C<GB 18030> | |
783 | ||
f2a2953c JH |
784 | =item Jungshik Shin's Hangul FAQ |
785 | ||
786 | L<http://jshin.net/faq> | |
787 | ||
788 | And especially it's subject 8 | |
789 | ||
790 | L<http://jshin.net/faq/qa8.html> | |
791 | ||
792 | a comprehensive overview of the Korean (C<KS *>) standards. | |
793 | ||
794 | =back | |
795 | ||
796 | =head2 Offline sources | |
797 | ||
798 | =over 2 | |
799 | ||
800 | =item C<CJKV Information Processing> by Ken Lunde | |
801 | ||
802 | CJKV Information Processing | |
803 | 1999 O'Reilly & Associates, ISBN : 1-56592-224-7 | |
804 | ||
805 | The modern successor of the C<CJK.inf>. | |
806 | ||
807 | Features a comprehensive coverage on CJKV character sets and | |
808 | encodings along with many other issues faced by anyone trying | |
809 | to better support CJKV languages/scripts in all the areas of | |
810 | information processing. | |
811 | ||
812 | To purchase this book visit | |
813 | L<http://www.oreilly.com/catalog/cjkvinfo/> | |
814 | ||
a999c27c JH |
815 | =back |
816 | ||
5d030b67 | 817 | =cut |
67d7b5ef JH |
818 | |
819 | I could not find this page because the hostname doesn't resolve! | |
820 | ||
821 | Brief description for most of the mentioned CJK encodings | |
822 | L<http://www.debian.org.ru/doc/manuals/intro-i18n/ch-codes.html> |