This is a live mirror of the Perl 5 development currently hosted at https://github.com/perl/perl5
Upgrade to Encode 1.30, from Dan Kogai.
[perl5.git] / ext / Encode / lib / Encode / Supported.pod
CommitLineData
5d030b67
JH
1=head1 NAME
2
a999c27c 3Encode::Supported -- Supported encodings by Encode
5d030b67
JH
4
5=head1 DESCRIPTION
6
5129552c 7=head2 Encoding Names
5d030b67
JH
8
9Encoding names are case insensitive. White space in names
10is ignored. In addition an encoding may have aliases.
11Each encoding has one "canonical" name. The "canonical"
12name is chosen from the names of the encoding by picking
a999c27c 13the first in the following sequence (with a few exceptions).
5d030b67 14
a999c27c
JH
15=over
16
17=item *
18
19The name used by the perl community. That includes 'utf8' and 'ascii'.
20Unlike aliases, canonical names directly reaches the method so such
21frequently used words like 'utf8' should do without alias lookups.
22
23=item *
24
25The MIME name as defined in IETF RFCs This includes all "iso-"'s.
26
27=item *
28
29The name in the IANA registry.
30
31=item *
32
33The name used by the organization that defined it.
34
35=back
36
37In case I<de jure> canonical names differ from that of the Encode
38module, they are always aliased if it ever be implemented. So you can
39safely tell if a given encoding is implemented or not just by passing
40the canonical name.
5d030b67 41
5129552c
JH
42Because of all the alias issues, and because in the general case
43encodings have state, "Encode" uses the encoding object internally
44once an operation is in progress.
5d030b67 45
5129552c 46=head1 Supported Encodings
5d030b67
JH
47
48As of Perl 5.8.0, at least the following encodings are recognized.
49Note that unless otherwise specified, they are all case insensitive
a63c962f 50(via alias) and all occurrance of spaces are replaced with '-'. In
5d030b67
JH
51other words, "ISO 8859 1" and "iso-8859-1" are identical.
52
5129552c
JH
53Encodings are categorized and implemented in several different modules
54but you don't have to C<use Encode::XX> to make them available for
55most cases. Encode.pm will automatically load those modules in need.
5d030b67 56
5129552c 57=head2 Built-in Encodings
5d030b67 58
5129552c 59The following encodings are always available.
5d030b67 60
67d7b5ef
JH
61 Canonical Aliases Comments & References
62 ----------------------------------------------------------------
a999c27c 63 ascii US-ascii [ECMA]
67d7b5ef 64 iso-8859-1 latin1 [ISO]
a999c27c 65 utf8 UTF-8 [RFC2279]
c731e18e
JH
66 ----------------------------------------------------------------
67
68
69=head2 Encode::Unicode -- other Unicode encodings
70
71Unicode coding schemes other than native utf8 are supported by
72Encode::Unicode which will be autoloaded on demand.
73
74 ----------------------------------------------------------------
f2a2953c
JH
75 UCS-2BE UCS-2, iso-10646-1 [IANA, UC]
76 UCS-2LE [UC]
77 UTF-16 [UC]
78 UTF-16BE [UC]
79 UTF-16LE [UC]
80 UTF-32 [UC]
81 UTF-32BE [UC]
82 UTF-32LE [UC]
67d7b5ef 83 ----------------------------------------------------------------
5d030b67 84
f2a2953c
JH
85To find how those (UCS-2|UTF-(16|32))(LE|BE)? differ to one another,
86see L<Encode::Unicode>.
87
a999c27c 88=head2 Encode::Byte -- Extended ASCII
5d030b67 89
a999c27c
JH
90Encode::Byte implements most of single-byte encodings except for
91Symbols and EBCDIC. The following encodings are based single-byte
92encoding implemented as extended ASCII. For most cases it uses
93\x80-\xff (upper half) to map non-ASCII characters.
94
95=over 2
96
97=item ISO-8859 and corresponding vendor mappings
98
99Since there are so many, They are presented in table format with
100Languages and corresponding encoding names by vendors. Note the table
101is sorted in order of ISO-8859 and the corresponding vendor mappings
102are slightly different from that of ISO. See
103L<http://czyborra.com/charsets/iso8859.html> for details.
104
105 Lang/Regions ISO/Other Std. DOS Windows Macintosh Others
106 ----------------------------------------------------------------
107 N. America (ASCII) cp437 AdobeStandardEncoding
108 cp863 (DOSCanadaF)
109 W. Europe (iso-8859-1) cp850 cp1252 MacRoman nextstep
110 hp-roman8
111 cp860 (DOSPortuguese)
112 CE. Europe iso-8859-2 cp852 cp1250 MacCentralEurRoman
113 MacCroatian
114 MacRomanian
115 MacRumanian
116 Latin3(*3) iso-8859-3
117 Latin4(*4) iso-8859-4
118 Cyrillics iso-8859-5 cp855 cp1251 MacCyrillic
119 (Also see next section) cp866 MacUkrainian
120 Arabic iso-8859-6 cp864 cp1256 MacArabic
121 cp1006 MacFarsi
122 Greek iso-8859-7 cp737 cp1253 MacGreek
123 cp869 (DOSGreek2)
124 Hebrew iso-8859-8 cp862 cp1255 MacHebrew
125 Turkish iso-8859-9 cp857 cp1254 MacTurkish
126 Nordics iso-8859-10 cp865
127 cp861 MacIcelandic
128 MacSami
129 Thai iso-8859-11 cp874 MacThai
130 (iso-8859-12 is nonexistent. Reserved for Indics?)
131 Baltics iso-8859-13 cp775 cp1257
132 Celtics iso-8859-14
133 Latin9(*15) iso-8859-15
134 Latin10 iso-8859-16
135 Vietnamese viscii cp1258 MacVietnamese
136 ----------------------------------------------------------------
137
138 (*3) Esperanto, Maltese, and Turkish. Turkish is now on 8859-5
139 (*4) Baltics. Now on 8859-10
140 (*9) Nicknamed Latin0; Euro sign as well as French and Finnish
141 letters that are missing from 8859-1 are added.
142
143All cp* are also available as ibm-*, ms-*, and windows-* . See also
144L<http://czyborra.com/charsets/codepages.html>.
145
146Macintosh encodings don't seem to be registered in such entities as
147IANA. "Canonical" names in Encode are based upon Apple's Tech Note
1481150. See L<http://developer.apple.com/technotes/tn/tn1150.html>
149for details
150
151=item KOI8 - De Facto Standard for Cyrillic world
152
153Though ISO-8859 does have ISO-8859, KOI8 series is far more popular
154in the Net. L<Encode> comes with the following KOI charsets. for
155gory details, See <http://czyborra.com/charsets/cyrillic.html> for
156details.
5d030b67 157
67d7b5ef 158 ----------------------------------------------------------------
67d7b5ef 159 koi8-f
a999c27c 160 koi8-r cp878 [RFC1489]
67d7b5ef 161 koi8-u [RFC2319]
67d7b5ef 162
a999c27c
JH
163=item gsm0338 - Hentai Latin 1
164
165GSM0338 is for GSM handsets. Though it shares alpanumerals with ASCII,
166control character ranges and other parts are mapped very differently,
f2a2953c
JH
167presumablly to store Greek and Cyrillic alphabets. This one is also
168covered in Encode::Byte even thought this one does not comply extended
169ASCII.
a999c27c
JH
170
171=back
5d030b67 172
5129552c 173=head2 The CJK: Chinese, Japanese, Korean (Multibyte)
5d030b67
JH
174
175Note Vietnamese is listed above. Also read "Encoding vs Charset"
a63c962f
JH
176below. Also note these are implemented in distinct module by
177languages, due the the size concerns. Please also refer to their
178respective document pages.
5d030b67 179
5129552c
JH
180=over 4
181
182=item Encode::CN -- Continental China
183
f2a2953c 184 Standard DOS/Win Macintosh Comment/Reference
67d7b5ef 185 ----------------------------------------------------------------
f2a2953c
JH
186 euc-cn(*1) MacChineseSimp
187 (gbk) cp936 (*2)
188 gb12345-raw { GB12345 without CES }
189 gb2312-raw { GB2312 without CES }
5129552c
JH
190 hz
191 iso-ir-165
67d7b5ef 192 ----------------------------------------------------------------
5129552c 193
f2a2953c
JH
194 (*1) GB2312 is aliased to this. see L<Microsoft-related naming mess>
195 (*2) gbk is aliased to this. see L<Microsoft-related naming mess>
196
5129552c
JH
197=item Encode::JP -- Japan
198
f2a2953c 199 Standard DOS/Win Macintosh Comment/Reference
67d7b5ef 200 ----------------------------------------------------------------
a999c27c
JH
201 euc-jp
202 shiftjis cp932 macJapanese
f2a2953c
JH
203 7bit-jis
204 euc-jp
205 iso-2022-jp [RFC1468]
206 iso-2022-jp-1 [RFC2237]
207 jis0201-raw { JIS X 0201 (roman + halfwidth kana) without CES }
208 jis0208-raw { JIS X 0208 (Kanji + fullwidth kana) without CES }
209 jis0212-raw { JIS X 0212 (Extended Kanji) without CES }
67d7b5ef 210 ----------------------------------------------------------------
5129552c
JH
211
212=item Encode::KR -- Korea
213
f2a2953c 214 Standard DOS/Win Macintosh Comment/Reference
67d7b5ef 215 ----------------------------------------------------------------
a999c27c 216 euc-kr MacKorean [RFC1557]
f2a2953c 217 cp949 (*)
a999c27c
JH
218 iso-2022-kr [RFC1557]
219 johab [KS X 1001:1998, Annex 3]
f2a2953c 220 ksc5601-raw { KSC5601 without CES }
67d7b5ef 221 ----------------------------------------------------------------
5129552c 222
f2a2953c
JH
223 (*) ks_c_5601-1987, (x-)?windows-949, and uhc are aliased to
224 this. See below.
225
226
5129552c
JH
227=item Encode::TW -- Taiwan
228
f2a2953c 229 Standard DOS/Win Macintosh Comment/Reference
67d7b5ef 230 ----------------------------------------------------------------
a999c27c 231 big5 cp950 MacChineseTrad
5129552c 232 big5-hkscs
67d7b5ef 233 ----------------------------------------------------------------
5129552c
JH
234
235=item Encode::HanExtra -- More Chinese via CPAN
236
237Due to size concerns, additional Chinese encodings below are
238distributed separately on CPAN, under the name Encode::HanExtra.
239
f2a2953c 240 Standard DOS/Win Macintosh Comment/Reference
67d7b5ef 241 ----------------------------------------------------------------
5129552c
JH
242 gb18030
243 euc-tw
244 big5plus
67d7b5ef 245 ----------------------------------------------------------------
5129552c
JH
246
247=back
248
249=head2 Miscellaneous encodings
250
251=over 4
252
253=item Encode::EBCDIC
5d030b67 254
a999c27c 255See L<perlebcdic> for details.
5d030b67 256
67d7b5ef 257 ----------------------------------------------------------------
5d030b67 258 cp37
a999c27c
JH
259 cp500
260 cp875
261 cp1026
262 cp1047
5d030b67 263 posix-bc
67d7b5ef 264 ----------------------------------------------------------------
5129552c 265
a63c962f 266=item Encode::Symbols
5d030b67 267
5129552c 268For symbols and dingbats.
5d030b67 269
67d7b5ef 270 ----------------------------------------------------------------
5d030b67
JH
271 symbol
272 dingbats
a999c27c
JH
273 MacDingbats
274 AdobeZdingbat
275 AdobeSymbol
67d7b5ef
JH
276 ----------------------------------------------------------------
277
278=back
279
280=head1 Unsupported encodings
281
282The following are not supported as yet. Some because they are rarely
283usede, some because of technical difficulty. They may be supported by
284external modules via CPAN in future, however.
285
286=over 4
287
288=item ISO-2022-JP-2 [RFC1554]
289
290Not very popular yet. Needs Unicode Database or equivalent to
291implement encode() (Because it includes JIS X 0208/0212, KSC5601, and
292GB2312 sumulteniously, which code points in unicode overlap. So you
293need to lookup the database to determine what character set a given
294Unicode character should belong).
295
296=item ISO-2022-CN [RFC1922]
297
298Not very popular. Needs CNS 11643-1 and 2 which are not available in
299this module. CNS 11643 is supported (via euc-tw) in
300Encode::HanExtra. Autrijus may add support for this encoding in his
301module in future
302
303=item various UP-UX encodings
304
305The following are unsoported due to the lack of mapping data.
306
307 '8' - arabic8, greek8, hebrew8, kana8, thai8, and turkish8
308 '15' - japanese15, korean15, and roi15
309
310=item Cyrillic encoding ISO-IR-111
311
312Anton doubts its usefulness.
313
314=item ISO-8859-8-1 [Hebrew]
315
a999c27c
JH
316None of the Encode team knows Hebrew enough (ISO-8859-8, cp1255 and
317MacHebrew are supported because and just because there were mappings
318available at L<http://www.unicode.org/>). Contribution welcome.
67d7b5ef
JH
319
320=item Thai encoding TCVN
321
322Ditto.
323
324=item Vietnamese encodings VPS
325
a999c27c
JH
326Though Jungshik has reported that mozilla supports this encoding, It was too late for us to add one. In future via a separate module. See
327L<http://lxr.mozilla.org/seamonkey/source/intl/uconv/ucvlatin/vps.uf> and
328L<http://lxr.mozilla.org/seamonkey/source/intl/uconv/ucvlatin/vps.ut>
329if you are interested in helping us.
67d7b5ef
JH
330
331=item various Mac encodings
332
a999c27c
JH
333The following are unsoported due to the lack of mapping data.
334
335 MacArmenian, MacBengali, MacBurmese, MacEthiopic
336 MacExtArabic, MacGeorgian, MacKannada, MacKhmer
337 MacLaotian, MacMalayalam, MacMongolian, MacOriya
338 MacSinhalese, MacTamil, MacTelugu, MacTibetan
339 MacVietnamese
340
341The rest of which already available are based upon the vendor mappings at
342L<http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/> .
343
344=item (Mac) Indic encodings
345
346The maps for the following is available at L<http://www.unicode.org/>
347but remains unsupport because those encordigs need algorithmical
348approach, unsupported by F<enc2xs>
67d7b5ef 349
a999c27c
JH
350 MacDevanagari
351 MacGurmukhi
352 MacGujarati
67d7b5ef 353
a999c27c
JH
354For details, please see C<Unicode mapping issues and notes:> at
355L<http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/DEVANAGA.TXT> .
356
357I believe this issue is prevalent not only for Mac Indics but also in
358other Indic encodings but those mentions were the only Indic encodings
359maps that I could find at L<http://www.unicode.org/> .
5129552c
JH
360
361=back
5d030b67 362
a999c27c 363=head1 Encoding vs. Charset -- terminology
5d030b67 364
a999c27c
JH
365We are used to using the term (character) I<encoding> and I<character set>
366interchangeably. But just as using the term byte and character is
367dangerous and should be differenciated when needed, we need to
368differenciate I<encoding> and I<character set>.
5d030b67 369
f2a2953c 370To understand that, it's follow how we make computers grok our characters.
a999c27c
JH
371
372=over 4
373
374=item *
67d7b5ef 375
a999c27c
JH
376First we start with which characters to include. We call this
377collection of characters I<character repertoire>.
5d030b67 378
a999c27c 379=item *
5d030b67 380
a999c27c
JH
381Then we have to give each character a unique ID so your computer can
382tell the differnce from 'a' to 'A'. This itemized character
383repartoire is now a I<character set>.
a63c962f 384
a999c27c
JH
385=item *
386
387If your computer can grow the character set without further
388proccessing, you can go ahead use it. This is called a I<coded
389character set> (CCS) or I<raw character encoding>. ASCII is used this
390way for most cases.
391
392=item *
393
394But in many cases especially multi-byte CJK encodings, you have to
395tweak a little more. Your network connection may not accept any data
396with the Most Significant Bit set, Your computer may not be able to
397tell if a given byte is a whole character or just half of it. So you
398have to I<encode> the character set to use it.
399
400A I<character encoding scheme> (CES) determines how to encode a given
401character set, or a set of multiple character sets. 7bit ISO-2022 is
402an example of CES. You switch between character sets via I<escape
403sequence>.
67d7b5ef
JH
404
405=back
406
a999c27c
JH
407Technically, or Mathematically speaking, a character set encoded in
408such a CES that maps character by character may form a CCS. EUC is such
409an example. CES of EUC is as follows;
67d7b5ef 410
a999c27c 411=over 4
5d030b67 412
a999c27c 413=item *
5d030b67 414
a999c27c
JH
415Map ASCII unchanged.
416
417=item *
418
419Map such a character set that consists of 94 or 96 powered by N
420members by adding 0x80 to each byte.
421
422=item *
423
424You can also use 0x8e and 0x8f to tell the following sequence of
425characters belong to yet another character set. each following byte
426is added by 0x80
427
428=back
429
430By carefully looking at at the encoded byte sequence, you may find the
431byte sequence conforms a unique number. In that sense EUC is a CCS
432generated by a CES above from up to four CCS (complicated?). UTF-8
433falls into this category. See L<perlunicode/"UTF-8"> to find how
434UTF-8 maps Unicode to a byte sequence.
435
436You may also find by now why 7bit ISO-2022 cannot conform a CCS. If
437you look at a byte sequence \x21\x21, you can't tell if it is two !'s
438or IDEOGRAPHIC SPACE. EUC maps the latter to \xA1\xA1 so you have no
439trouble between "!!". and " "
67d7b5ef 440
a63c962f
JH
441=head1 Encoding Classification (by Anton Tagunov and Dan Kogai)
442
443This section tries to classify the supported encodings by their
444applicability for information exchange over the Internet and to
445choose the most suitable aliases to name them in the context of
446such communication.
447
67d7b5ef
JH
448=over 2
449
450=item *
451
f2a2953c 452To (en|de) code Encodings marked as C<(**)>, You need
a999c27c 453C<Encode::HanExtra>, available from CPAN.
67d7b5ef
JH
454
455=back
456
a63c962f 457Encoding names
5d030b67 458
f2a2953c
JH
459 US-ASCII UTF-8 ISO-8859-* KOI8-R
460 Shift_JIS EUC-JP ISO-2022-JP ISO-2022-JP-1
461 EUC-KR Big5 GB2312
a999c27c
JH
462
463are registered to IANA as preferred MIME names and may probably
464be used over the Internet.
5d030b67 465
c731e18e 466C<Shift_JIS> has been officialized by JIS X 0208:1997.
a999c27c 467L<Microsoft-related naming mess> gives details.
5d030b67 468
a999c27c
JH
469C<GB2312> is the IANA name for C<EUC-CN>.
470See L<Microsoft-related naming mess> for details.
471
472C<GB_2312-80> I<raw> encoding is available as C<gb2312-raw>
f2a2953c 473with Encode. See L<Encode::CN> for details.
5d030b67 474
a63c962f 475 EUC-CN
f2a2953c 476 KOI8-U [RFC2319]
5d030b67 477
a999c27c
JH
478have not been registered with IANA (as of March 2002) but
479seem to be supported by major web browsers.
480IANA name for C<EUC-CN> is C<GB2312>.
67d7b5ef
JH
481
482 KS_C_5601-1987
483
a999c27c
JH
484is heavily misused.
485See L<Microsoft-related naming mess> for details.
486
487C<KS_C_5601-1987> I<raw> encoding is available as C<kcs5601-raw>
f2a2953c
JH
488with Encode. See L<Encode::KR> for details.
489
490 UTF-16 UTF-16BE UTF-16LE
491
492are a IANA-registered C<charset>s. See [RFC 2781] for details.
493Jungshik Shin reports that UTF-16 with a BOM is well accepted
494by MS IE 5/6 and NS 4/6. Beware however that
495
496=over 2
497
498=item *
5d030b67 499
f2a2953c
JH
500C<UTF-16> support in any software you're going to be
501using/interoperating with has probably been less tested
502then C<UTF-8> support
5d030b67 503
f2a2953c
JH
504=item *
505
c731e18e
JH
506C<UTF-8> coded data seamlessly passes traditional
507command piping (C<cat>, C<more>, etc.) while C<UTF-16> coded
f2a2953c
JH
508data is likely to cause confusion (with it's zero bytes,
509for example)
510
511=item *
512
513it is beyond the power of words to describe the way HTML browsers
c731e18e 514encode non-C<ASCII> form data. To get a general impression visit
f2a2953c 515L<http://ppewww.ph.gla.ac.uk/~flavell/charset/form-i18n.html>.
c731e18e 516While encoding of form data has stabilized for C<UTF-8> coded pages
f2a2953c
JH
517(at least IE 5/6, NS 6, Opera 6 behave consitently), be sure to
518expect fun (and cross-browser discrepancies) with C<UTF-16> coded
519pages!
520
521=back
522
523The rule of thumb is to use C<UTF-8> unless you know what
c731e18e 524you're doing and unless you really benefit from using C<UTF-16>.
a999c27c 525
5d030b67 526
f2a2953c 527 ISO-IR-165 [RFC1345]
5d030b67
JH
528 GBK
529 VISCII
a63c962f 530 GB 12345
f2a2953c
JH
531 GB 18030 (**) (see links bellow)
532 EUC-TW (**)
5d030b67
JH
533
534are totally valid encodings but not registered at IANA.
a63c962f
JH
535The names under which they are listed here are probably the
536most widely-known names for these encodings and are recommended
537names.
538
f2a2953c 539 BIG5PLUS (**)
a63c962f 540
67d7b5ef 541is a bit proprietary name.
5d030b67 542
a999c27c
JH
543=head2 Microsoft-related naming mess
544
545Microsoft products misuse the following names:
5d030b67 546
67d7b5ef 547=over 2
a63c962f 548
a999c27c 549=item KS_C_5601-1987
5d030b67 550
a999c27c 551Microsoft extension to C<EUC-KR>.
5d030b67 552
c731e18e 553Proper names: C<CP949>, C<UHC>, C<x-windows-949> (as used by Mozilla).
67d7b5ef 554
f2a2953c 555See L<http://lists.w3.org/Archives/Public/ietf-charsets/2001AprJun/0033.html>
a999c27c 556for details.
5d030b67 557
f2a2953c
JH
558Encode aliases C<KS_C_5601-1987> to C<cp949> to reflect this common
559misusage. I<Raw> C<KS_C_5601-1987> encoding is available as
560C<kcs5601-raw>.
5d030b67 561
f2a2953c 562See L<Encode::KR> for details.
67d7b5ef 563
a999c27c 564=item GB2312
67d7b5ef 565
a999c27c 566Microsoft extension to C<EUC-CN>.
a63c962f 567
a999c27c 568Proper names: C<CP936>, C<GBK>.
a63c962f 569
a999c27c
JH
570C<GB2312> has been registered in the C<EUC-CN> meaning at
571IANA. This has partially repaired the situation: Microsoft's
572C<GB2312> has become a superset of the official C<GB2312>.
67d7b5ef 573
a999c27c
JH
574Encode aliases C<GB2312> to C<euc-cn> in full agreement with
575IANA registration. C<cp936> is supported separately.
f2a2953c 576I<Raw> C<GB_2312-80> encoding is available as C<gb2312-raw>.
a999c27c 577
f2a2953c 578See L<Encode::CN> for details.
a999c27c
JH
579
580=item Big5
581
582Microsoft extension to C<Big5>.
583
584Proper name: C<CP950>.
585
586Encode separately supports C<Big5> and C<cp950>.
587
588=item Shift_JIS
589
590Microsoft's understanding of C<Shift_JIS>.
591
592JIS has not endorsed the full Microsoft standard however.
593The official C<Shift_JIS> includes only JIS X 0201 and JIS X 0208
594subsets, while Microsoft has always been meaning C<Shift_JIS> to
c731e18e
JH
595encode a wider character repertoire, see C<IANA> registration for
596C<Windows-31J>.
a999c27c
JH
597
598As a historical predecessor Microsoft's variant
599probably has more rights for the name, albeit it may be objected
600that Microsoft shouldn't have used JIS as part of the name
601in the first place.
602
c731e18e 603Unabiguous name: C<CP932>. C<IANA> name (not used?): C<Windows-31J>.
a999c27c
JH
604
605Encode separately supports C<Shift_JIS> and C<cp932>.
606
607=back
608
609=head1 Glossary
610
611=over 2
612
613=item character repertoire
614
615A collection of unique characters. A I<character> set in the most
616strict sense. At this stage characters are not numberd.
617
618=item coded character set (CCS)
619
620A character set that is mapped in a way computers can use directly.
621Many character encodings including EUC falls in this category.
622
623=item character encoding scheme (CES)
624
625An algorithm to map a character set to a byte sequence. You don't
626have to be able to tell which character set a given byte sequence
627belongs. 7-bit ISO-2022 is a CES but it cannot be a CCS. EUC is an
628example of being both a CCS and CES.
629
f2a2953c
JH
630=item charset (in MIME context)
631
632has long been used in the meaning of C<encoding>, CES.
633
634While C<character set> word combination has lost this meaning
635in MIME context since [RFC 2130], C<charset> abbreviation has
636retained it. This is how [RFC 2277], [RFC 2278] bless C<charset>:
637
638
639 This document uses the term "charset" to mean a set of rules for
640 mapping from a sequence of octets to a sequence of characters, such
641 as the combination of a coded character set and a character encoding
642 scheme; this is also what is used as an identifier in MIME "charset="
643 parameters, and registered in the IANA charset registry ... (Note
644 that this is NOT a term used by other standards bodies, such as ISO).
645 [RFC 2277]
646
a999c27c
JH
647=item EUC
648
649Extended Unix Character. See ISO-2022
650
651=item ISO-2022
652
653A CES that was carefully designed to coexist with ASCII. There are 7
f2a2953c
JH
654bit version and 8 bit version.
655
6567 bit version switches character set via escape sequence so this
657cannot form a CCS. Since this is more difficult to handle in programs
658than the 8 bit version, 7 bit version is not very popular except for
659iso-2022-jp, the de facto standard CES for e-mails.
660
6618 bit version can conform a CCS. EUC and ISO-8859 are two examples
662thereof. pre-5.6 perl could use them as string literals.
a999c27c
JH
663
664=item UCS
665
666Short for I<Universal Character Set>. When you say just UCS, it means
667I<Unicode>
668
669=item UCS-2
670
671ISO/IEC 10646 encoding form: Universal Character Set coded in two
672octets.
673
674=item Unicode
675
f2a2953c
JH
676A Character Set that aims to include all character repertoire of the
677world. Many character sets in various national as well as industorial
678standards have become, in a way, just subsets of Unicode.
a999c27c
JH
679
680=item UTF
681
f2a2953c 682Short for I<Unicode Transformation Format>. Determines how to map a
a999c27c
JH
683unicode character into byte sequnece.
684
685=item UTF-16
686
687A UTF in 16-bit encoding. Can either be in big endian or little
f2a2953c
JH
688endian. Big endian version is called UTF-16BE (equals to UCS-2 +
689Surrogate Support) and little endian version is UTF-16LE.
67d7b5ef
JH
690
691=back
5d030b67
JH
692
693=head1 See Also
694
5129552c
JH
695L<Encode>,
696L<Encode::Byte>,
a63c962f 697L<Encode::CN>, L<Encode::JP>, L<Encode::KR>, L<Encode::TW>,
5129552c 698L<Encode::EBCDIC>, L<Encode::Symbol>
5d030b67 699
a999c27c
JH
700=head1 References
701
702=over 2
703
704=item ECMA
705
706European Computer Manufacturers Association
707L<http://www.ecma.ch>
708
709=over 2
710
711=item EMCA-035 (eq C<ISO-2022>)
712
713L<http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM>
714
715The very dspecification of ISO-2022 is available from the link above.
716
717=back
718
719=item IANA
720
721Internet Assigned Numbers Authority
722L<http://www.iana.org/>
723
724=over 2
725
726=item Assigned Charset Names by IANA
727
728L<http://www.iana.org/assignments/character-sets>
729
730Most of the C<canonical names> in Encode derive from this list
731so you can directly apply the string you have extracted from MIME
732header of mails and we pages.
733
734=back
735
736=item ISO
737
738International Organization for Standardization
739L<http://www.iso.ch/>
740
741=item RFC
742
743Request For Comment -- need I say more?
f2a2953c 744L<http://www.rfc.net/>, L<http://www.faqs.org/rfcs/>
a999c27c
JH
745
746=item UC
747
748Unicode Consortium
749L<http://www.unicode.org/>
750
751=over 2
752
753=item Unicode Glossary
754
755L<http://www.unicode.org/glossary/>
756
757The glossary of this document is based opon this site.
758
759=back
760
761=back
762
763=head2 Other Notable Sites
764
765=over 2
766
767=item czyborra.com
768
f2a2953c 769L<http://czyborra.com/>
a999c27c
JH
770
771Contains a a lot of useful information, especially gory details of ISO
772vs. vendor mappings.
773
774=item CJK.inf
775
776L<http://www.oreilly.com/people/authors/lunde/cjk_inf.html>
777
778Somewhat obsolete (last update in 1996), but still useful. Also try
779
780L<ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/pdf/GB18030_Summary.pdf>
781
782You will find brief info on C<EUC-CN>, C<GBK> and mostly on C<GB 18030>
783
f2a2953c
JH
784=item Jungshik Shin's Hangul FAQ
785
786L<http://jshin.net/faq>
787
788And especially it's subject 8
789
790L<http://jshin.net/faq/qa8.html>
791
792a comprehensive overview of the Korean (C<KS *>) standards.
793
794=back
795
796=head2 Offline sources
797
798=over 2
799
800=item C<CJKV Information Processing> by Ken Lunde
801
802CJKV Information Processing
8031999 O'Reilly & Associates, ISBN : 1-56592-224-7
804
805The modern successor of the C<CJK.inf>.
806
807Features a comprehensive coverage on CJKV character sets and
808encodings along with many other issues faced by anyone trying
809to better support CJKV languages/scripts in all the areas of
810information processing.
811
812To purchase this book visit
813L<http://www.oreilly.com/catalog/cjkvinfo/>
814
a999c27c
JH
815=back
816
5d030b67 817=cut
67d7b5ef
JH
818
819I could not find this page because the hostname doesn't resolve!
820
821 Brief description for most of the mentioned CJK encodings
822L<http://www.debian.org.ru/doc/manuals/intro-i18n/ch-codes.html>