Commit | Line | Data |
---|---|---|
1d72bdf6 NIS |
1 | /* utfebcdic.h |
2 | * | |
2eee27d7 SS |
3 | * Copyright (C) 2001, 2002, 2003, 2005, 2006, 2007, 2009, |
4 | * 2010, 2011 by Larry Wall, Nick Ing-Simmons, and others | |
1d72bdf6 NIS |
5 | * |
6 | * You may distribute under the terms of either the GNU General Public | |
7 | * License or the Artistic License, as specified in the README file. | |
8 | * | |
9 | * Macros to implement UTF-EBCDIC as perl's internal encoding | |
97237291 | 10 | * Adapted from version 7.1 of Unicode Technical Report #16: |
ad37daf5 | 11 | * http://www.unicode.org/reports/tr16 |
fe749c9a KW |
12 | * |
13 | * To summarize, the way it works is: | |
c229c178 KW |
14 | * To convert an EBCDIC code point to UTF-EBCDIC: |
15 | * 1) convert to Unicode. No conversion is necesary for code points above | |
16 | * 255, as Unicode and EBCDIC are identical in this range. For smaller | |
17 | * code points, the conversion is done by lookup in the PL_e2a table (with | |
18 | * inverse PL_a2e) in the generated file 'ebcdic_tables.h'. The 'a' | |
02a95550 KW |
19 | * stands for ASCII platform, meaning 0-255 Unicode. Use |
20 | * NATIVE_TO_LATIN1() and LATIN1_TO_NATIVE(), respectively to perform this | |
21 | * lookup. NATIVE_TO_UNI() and UNI_TO_NATIVE() are similarly used for any | |
22 | * input, and know to avoid the lookup for inputs above 255. | |
97237291 | 23 | * 2) convert that to a utf8-like string called I8 ('I' stands for |
d06134e5 KW |
24 | * intermediate) with variant characters occupying multiple bytes. This |
25 | * step is similar to the utf8-creating step from Unicode, but the details | |
26 | * are different. This transformation is called UTF8-Mod. There is a | |
27 | * chart about the bit patterns in a comment later in this file. But | |
fe749c9a KW |
28 | * essentially here are the differences: |
29 | * UTF8 I8 | |
30 | * invariant byte starts with 0 starts with 0 or 100 | |
31 | * continuation byte starts with 10 starts with 101 | |
32 | * start byte same in both: if the code point requires N bytes, | |
c0236afe KW |
33 | * then the leading N bits are 1, followed by a 0. If |
34 | * all 8 bits in the first byte are 1, the code point | |
35 | * will occupy 14 bytes (compared to 13 in Perl's | |
36 | * extended UTF-8). This is incompatible with what | |
37 | * tr16 implies should be the representation of code | |
38 | * points 2**30 and above, but allows Perl to be able | |
39 | * to represent all code points that fit in a 64-bit | |
40 | * word in either our extended UTF-EBCDIC or UTF-8. | |
97237291 KW |
41 | * 3) Use the algorithm in tr16 to convert each byte from step 2 into |
42 | * final UTF-EBCDIC. This is done by table lookup from a table | |
4bc3dcfa | 43 | * constructed from the algorithm, reproduced in ebcdic_tables.h as |
97237291 KW |
44 | * PL_utf2e, with its inverse being PL_e2utf. They are constructed so that |
45 | * all EBCDIC invariants remain invariant, but no others do, and the first | |
46 | * byte of a variant will always have its upper bit set. But note that | |
97d0ceda KW |
47 | * the upper bit of some invariants is also 1. The table also is designed |
48 | * so that lexically comparing two UTF-EBCDIC-variant characters yields | |
49 | * the Unicode code point order. (To get native code point order, one has | |
50 | * to convert the latin1-range characters to their native code point | |
02a95550 KW |
51 | * value.) The macros NATIVE_UTF8_TO_I8() and I8_TO_NATIVE_UTF8() do the |
52 | * table lookups. | |
97237291 KW |
53 | * |
54 | * For example, the ordinal value of 'A' is 193 in EBCDIC, and also is 193 in | |
55 | * UTF-EBCDIC. Step 1) converts it to 65, Step 2 leaves it at 65, and Step 3 | |
56 | * converts it back to 193. As an example of how a variant character works, | |
57 | * take LATIN SMALL LETTER Y WITH DIAERESIS, which is typically 0xDF in | |
58 | * EBCDIC. Step 1 converts it to the Unicode value, 0xFF. Step 2 converts | |
59 | * that to two bytes = 11000111 10111111 = C7 BF, and Step 3 converts those to | |
60 | * 0x8B 0x73. | |
45f80db9 | 61 | * |
fe749c9a KW |
62 | * If you're starting from Unicode, skip step 1. For UTF-EBCDIC to straight |
63 | * EBCDIC, reverse the steps. | |
64 | * | |
65 | * The EBCDIC invariants have been chosen to be those characters whose Unicode | |
66 | * equivalents have ordinal numbers less than 160, that is the same characters | |
67 | * that are expressible in ASCII, plus the C1 controls. So there are 160 | |
bc2161fd | 68 | * invariants instead of the 128 in UTF-8. |
fe749c9a KW |
69 | * |
70 | * The purpose of Step 3 is to make the encoding be invariant for the chosen | |
71 | * characters. This messes up the convenient patterns found in step 2, so | |
02a95550 KW |
72 | * generally, one has to undo step 3 into a temporary to use them, using the |
73 | * macro NATIVE_TO_I8(). However, one "shadow", or parallel table, | |
74 | * PL_utf8skip, has been constructed that doesn't require undoing things. It | |
75 | * is such that for each byte, it says how long the sequence is if that | |
76 | * (UTF-EBCDIC) byte were to begin it | |
97237291 KW |
77 | * |
78 | * There are actually 3 slightly different UTF-EBCDIC encodings in | |
4bc3dcfa | 79 | * ebcdic_tables.h, one for each of the code pages recognized by Perl. That |
97237291 KW |
80 | * means that there are actually three different sets of tables, one for each |
81 | * code page. (If Perl is compiled on platforms using another EBCDIC code | |
82 | * page, it may not compile, or Perl may silently mistake it for one of the | |
83 | * three.) | |
fe749c9a | 84 | * |
97237291 KW |
85 | * Note that tr16 actually only specifies one version of UTF-EBCDIC, based on |
86 | * the 1047 encoding, and which is supposed to be used for all code pages. | |
87 | * But this doesn't work. To illustrate the problem, consider the '^' character. | |
88 | * On a 037 code page it is the single byte 176, whereas under 1047 UTF-EBCDIC | |
89 | * it is the single byte 95. If Perl implemented tr16 exactly, it would mean | |
90 | * that changing a string containing '^' to UTF-EBCDIC would change that '^' | |
91 | * from 176 to 95 (and vice-versa), violating the rule that ASCII-range | |
92 | * characters are the same in UTF-8 or not. Much code in Perl assumes this | |
93 | * rule. See for example | |
94 | * http://grokbase.com/t/perl/mvs/025xf0yhmn/utf-ebcdic-for-posix-bc-malformed-utf-8-character | |
95 | * What Perl does is create a version of UTF-EBCDIC suited to each code page; | |
96 | * the one for the 1047 code page is identical to what's specified in tr16. | |
97 | * This complicates interchanging files between computers using different code | |
98 | * pages. Best is to convert to I8 before sending them, as the I8 | |
99 | * representation is the same no matter what the underlying code page is. | |
fe749c9a | 100 | * |
ff982d00 KW |
101 | * Because of the way UTF-EBCDIC is constructed, the lowest 32 code points that |
102 | * aren't equivalent to ASCII characters nor C1 controls form the set of | |
103 | * continuation bytes; the remaining 64 non-ASCII, non-control code points form | |
104 | * the potential start bytes, in order. (However, the first 5 of these lead to | |
80bfb4dc KW |
105 | * malformed overlongs, so there really are only 59 start bytes, and the first |
106 | * three of the 59 are the start bytes for the Latin1 range.) Hence the | |
ff982d00 KW |
107 | * UTF-EBCDIC for the smallest variant code point, 0x160, will have likely 0x41 |
108 | * as its continuation byte, provided 0x41 isn't an ASCII or C1 equivalent. | |
109 | * And its start byte will be the code point that is 37 (32+5) non-ASCII, | |
110 | * non-control code points past it. (0 - 3F are controls, and 40 is SPACE, | |
111 | * leaving 41 as the first potentially available one.) In contrast, on ASCII | |
112 | * platforms, the first 64 (not 32) non-ASCII code points are the continuation | |
113 | * bytes. And the first 2 (not 5) potential start bytes form overlong | |
114 | * malformed sequences. | |
115 | * | |
fe749c9a KW |
116 | * EBCDIC characters above 0xFF are the same as Unicode in Perl's |
117 | * implementation of all 3 encodings, so for those Step 1 is trivial. | |
118 | * | |
119 | * (Note that the entries for invariant characters are necessarily the same in | |
97237291 | 120 | * PL_e2a and PL_e2utf; likewise for their inverses.) |
fe749c9a KW |
121 | * |
122 | * UTF-EBCDIC strings are the same length or longer than UTF-8 representations | |
123 | * of the same string. The maximum code point representable as 2 bytes in | |
124 | * UTF-EBCDIC is 0x3FFF, instead of 0x7FFF in UTF-8. | |
1d72bdf6 NIS |
125 | */ |
126 | ||
127 | START_EXTERN_C | |
128 | ||
4bc3dcfa | 129 | #include "ebcdic_tables.h" |
44f2fc15 | 130 | |
1d72bdf6 NIS |
131 | END_EXTERN_C |
132 | ||
1e54db1a | 133 | /* EBCDIC-happy ways of converting native code to UTF-8 */ |
1d72bdf6 | 134 | |
e9b19ab7 KW |
135 | /* Use these when ch is known to be < 256 */ |
136 | #define NATIVE_TO_LATIN1(ch) (__ASSERT_(FITS_IN_8_BITS(ch)) PL_e2a[(U8)(ch)]) | |
137 | #define LATIN1_TO_NATIVE(ch) (__ASSERT_(FITS_IN_8_BITS(ch)) PL_a2e[(U8)(ch)]) | |
59a449d5 | 138 | |
e9b19ab7 KW |
139 | /* Use these on bytes */ |
140 | #define NATIVE_UTF8_TO_I8(b) (__ASSERT_(FITS_IN_8_BITS(b)) PL_e2utf[(U8)(b)]) | |
141 | #define I8_TO_NATIVE_UTF8(b) (__ASSERT_(FITS_IN_8_BITS(b)) PL_utf2e[(U8)(b)]) | |
59a449d5 | 142 | |
bc3632a8 | 143 | /* Transforms in wide UV chars */ |
02a95550 KW |
144 | #define NATIVE_TO_UNI(ch) \ |
145 | (FITS_IN_8_BITS(ch) ? NATIVE_TO_LATIN1(ch) : (UV) (ch)) | |
146 | #define UNI_TO_NATIVE(ch) \ | |
147 | (FITS_IN_8_BITS(ch) ? LATIN1_TO_NATIVE(ch) : (UV) (ch)) | |
1d72bdf6 | 148 | /* |
c0236afe | 149 | The following table is adapted from tr16, it shows the I8 encoding of Unicode code points. |
1d72bdf6 | 150 | |
c0236afe | 151 | Unicode U32 Bit pattern 1st Byte 2nd Byte 3rd Byte 4th Byte 5th Byte 6th Byte 7th Byte |
1d72bdf6 NIS |
152 | U+0000..U+007F 000000000xxxxxxx 0xxxxxxx |
153 | U+0080..U+009F 00000000100xxxxx 100xxxxx | |
1d72bdf6 NIS |
154 | U+00A0..U+03FF 000000yyyyyxxxxx 110yyyyy 101xxxxx |
155 | U+0400..U+3FFF 00zzzzyyyyyxxxxx 1110zzzz 101yyyyy 101xxxxx | |
156 | U+4000..U+3FFFF 0wwwzzzzzyyyyyxxxxx 11110www 101zzzzz 101yyyyy 101xxxxx | |
157 | U+40000..U+3FFFFF 0vvwwwwwzzzzzyyyyyxxxxx 111110vv 101wwwww 101zzzzz 101yyyyy 101xxxxx | |
158 | U+400000..U+3FFFFFF 0uvvvvvwwwwwzzzzzyyyyyxxxxx 1111110u 101vvvvv 101wwwww 101zzzzz 101yyyyy 101xxxxx | |
c0236afe | 159 | U+4000000..U+3FFFFFFF 00uuuuuvvvvvwwwwwzzzzzyyyyyxxxxx 11111110 101uuuuu 101vvvvv 101wwwww 101zzzzz 101yyyyy 101xxxxx |
1d72bdf6 | 160 | |
c0236afe KW |
161 | Beyond this, Perl uses an incompatible extension, similar to the one used in |
162 | regular UTF-8. There are now 14 bytes. A full 32 bits of information thus looks like this: | |
163 | 1st Byte 2nd-7th 8th Byte 9th Byte 10th B 11th B 12th B 13th B 14th B | |
164 | U+40000000..U+FFFFFFFF ttuuuuuvvvvvwwwwwzzzzzyyyyyxxxxx 11111111 10100000 101000tt 101uuuuu 101vvvvv 101wwwww 101zzzzz 101yyyyy 101xxxxx | |
1d72bdf6 | 165 | |
c0236afe KW |
166 | For 32-bit words, the 2nd through 7th bytes effectively function as leading |
167 | zeros. Above 32 bits, these fill up, with each byte yielding 5 bits of | |
168 | information, so that with 13 continuation bytes, we can handle 65 bits, just | |
a14e0a36 KW |
169 | above what a 64 bit word can hold |
170 | ||
171 | The following table gives the I8: | |
172 | ||
173 | I8 Code Points 1st Byte 2nd Byte 3rd 4th 5th 6th 7th 8th 9th-14th | |
174 | ||
e0e0cbdb | 175 | 0x0000..0x009F 00..9F |
a14e0a36 KW |
176 | 0x00A0..0x00FF * C5..C7 A0..BF |
177 | U+0100..U+03FF C8..DF A0..BF | |
178 | U+0400..U+3FFF * E1..EF A0..BF A0..BF | |
179 | U+4000..U+7FFF F0 * B0..BF A0..BF A0..BF | |
180 | U+8000..U+D7FF F1 A0..B5 A0..BF A0..BF | |
181 | U+D800..U+DFFF F1 B6..B7 A0..BF A0..BF (surrogates) | |
182 | U+E000..U+FFFF F1 B8..BF A0..BF A0..BF | |
183 | U+10000..U+3FFFF F2..F7 A0..BF A0..BF A0..BF | |
184 | U+40000..U+FFFFF F8 * A8..BF A0..BF A0..BF A0..BF | |
185 | U+100000..U+10FFFF F9 A0..A1 A0..BF A0..BF A0..BF | |
186 | Below are above-Unicode code points | |
1072f3e3 | 187 | U+110000..U+1FFFFF F9 A2..BF A0..BF A0..BF A0..BF |
a14e0a36 KW |
188 | U+200000..U+3FFFFF FA..FB A0..BF A0..BF A0..BF A0..BF |
189 | U+400000..U+1FFFFFF FC * A4..BF A0..BF A0..BF A0..BF A0..BF | |
190 | U+2000000..U+3FFFFFF FD A0..BF A0..BF A0..BF A0..BF A0..BF | |
191 | U+4000000..U+3FFFFFFF FE * A2..BF A0..BF A0..BF A0..BF A0..BF A0..BF | |
192 | U+40000000.. FF A0..BF A0..BF A0..BF A0..BF A0..BF A0..BF * A1..BF A0..BF | |
193 | ||
194 | Note the gaps before several of the byte entries above marked by '*'. These are | |
195 | caused by legal UTF-8 avoiding non-shortest encodings: it is technically | |
196 | possible to UTF-8-encode a single code point in different ways, but that is | |
197 | explicitly forbidden, and the shortest possible encoding should always be used | |
198 | (and that is what Perl does). */ | |
1ff3baa2 | 199 | |
fcd03d92 | 200 | #define UTF_CONTINUATION_BYTE_INFO_BITS UTF_EBCDIC_CONTINUATION_BYTE_INFO_BITS |
1d72bdf6 | 201 | |
02a95550 KW |
202 | /* ^? is defined to be APC on EBCDIC systems, as specified in Unicode Technical |
203 | * Report #16. See the definition of toCTRL() for more */ | |
0ed2b00b KW |
204 | #define QUESTION_MARK_CTRL LATIN1_TO_NATIVE(0x9F) |
205 | ||
e9a8c099 | 206 | /* |
14d04a33 | 207 | * ex: set ts=8 sts=4 sw=4 et: |
e9a8c099 | 208 | */ |