Commit | Line | Data |
---|---|---|
49781f4a AB |
1 | =encoding utf8 |
2 | ||
d396a558 JH |
3 | =head1 NAME |
4 | ||
5 | perlebcdic - Considerations for running Perl on EBCDIC platforms | |
6 | ||
7 | =head1 DESCRIPTION | |
8 | ||
9 | An exploration of some of the issues facing Perl programmers | |
eaf8b9b9 | 10 | on EBCDIC based computers. We do not cover localization, |
8a50e6a3 | 11 | internationalization, or multi-byte character set issues other |
395f5a0c | 12 | than some discussion of UTF-8 and UTF-EBCDIC. |
d396a558 JH |
13 | |
14 | Portions that are still incomplete are marked with XXX. | |
15 | ||
e1b711da KW |
16 | Perl used to work on EBCDIC machines, but there are now areas of the code where |
17 | it doesn't. If you want to use Perl on an EBCDIC machine, please let us know | |
18 | by sending mail to perlbug@perl.org | |
19 | ||
d396a558 JH |
20 | =head1 COMMON CHARACTER CODE SETS |
21 | ||
22 | =head2 ASCII | |
23 | ||
2bbc8d55 SP |
24 | The American Standard Code for Information Interchange (ASCII or US-ASCII) is a |
25 | set of | |
eaf8b9b9 KW |
26 | integers running from 0 to 127 (decimal) that imply character |
27 | interpretation by the display and other systems of computers. | |
28 | The range 0..127 can be covered by setting the bits in a 7-bit binary | |
29 | digit, hence the set is sometimes referred to as "7-bit ASCII". | |
30 | ASCII was described by the American National Standards Institute | |
31 | document ANSI X3.4-1986. It was also described by ISO 646:1991 | |
32 | (with localization for currency symbols). The full ASCII set is | |
33 | given in the table below as the first 128 elements. Languages that | |
34 | can be written adequately with the characters in ASCII include | |
35 | English, Hawaiian, Indonesian, Swahili and some Native American | |
d396a558 JH |
36 | languages. |
37 | ||
51b5cecb PP |
38 | There are many character sets that extend the range of integers |
39 | from 0..2**7-1 up to 2**8-1, or 8 bit bytes (octets if you prefer). | |
40 | One common one is the ISO 8859-1 character set. | |
41 | ||
d396a558 JH |
42 | =head2 ISO 8859 |
43 | ||
eaf8b9b9 | 44 | The ISO 8859-$n are a collection of character code sets from the |
5d9fe53c | 45 | International Organization for Standardization (ISO), each of which |
eaf8b9b9 | 46 | adds characters to the ASCII set that are typically found in European |
5d9fe53c | 47 | languages, many of which are based on the Roman, or Latin, alphabet. |
d396a558 JH |
48 | |
49 | =head2 Latin 1 (ISO 8859-1) | |
50 | ||
eaf8b9b9 KW |
51 | A particular 8-bit extension to ASCII that includes grave and acute |
52 | accented Latin characters. Languages that can employ ISO 8859-1 | |
53 | include all the languages covered by ASCII as well as Afrikaans, | |
54 | Albanian, Basque, Catalan, Danish, Faroese, Finnish, Norwegian, | |
55 | Portuguese, Spanish, and Swedish. Dutch is covered albeit without | |
56 | the ij ligature. French is covered too but without the oe ligature. | |
d396a558 | 57 | German can use ISO 8859-1 but must do so without German-style |
eaf8b9b9 | 58 | quotation marks. This set is based on Western European extensions |
d396a558 JH |
59 | to ASCII and is commonly encountered in world wide web work. |
60 | In IBM character code set identification terminology ISO 8859-1 is | |
51b5cecb | 61 | also known as CCSID 819 (or sometimes 0819 or even 00819). |
d396a558 JH |
62 | |
63 | =head2 EBCDIC | |
64 | ||
eaf8b9b9 | 65 | The Extended Binary Coded Decimal Interchange Code refers to a |
8a50e6a3 | 66 | large collection of single- and multi-byte coded character sets that are |
e1b711da KW |
67 | different from ASCII or ISO 8859-1 and are all slightly different from each |
68 | other; they typically run on host computers. The EBCDIC encodings derive from | |
8a50e6a3 | 69 | 8-bit byte extensions of Hollerith punched card encodings. The layout on the |
e1b711da KW |
70 | cards was such that high bits were set for the upper and lower case alphabet |
71 | characters [a-z] and [A-Z], but there were gaps within each Latin alphabet | |
72 | range. | |
d396a558 | 73 | |
eaf8b9b9 | 74 | Some IBM EBCDIC character sets may be known by character code set |
2c09a866 | 75 | identification numbers (CCSID numbers) or code page numbers. |
51b5cecb | 76 | |
2bbc8d55 SP |
77 | Perl can be compiled on platforms that run any of three commonly used EBCDIC |
78 | character sets, listed below. | |
79 | ||
d5924ca6 | 80 | =head3 The 13 variant characters |
1e054b24 | 81 | |
51b5cecb PP |
82 | Among IBM EBCDIC character code sets there are 13 characters that |
83 | are often mapped to different integer values. Those characters | |
84 | are known as the 13 "variant" characters and are: | |
d396a558 | 85 | |
eaf8b9b9 | 86 | \ [ ] { } ^ ~ ! # | $ @ ` |
d396a558 | 87 | |
6ff677df | 88 | When Perl is compiled for a platform, it looks at all of these characters to |
2bbc8d55 SP |
89 | guess which EBCDIC character set the platform uses, and adapts itself |
90 | accordingly to that platform. If the platform uses a character set that is not | |
91 | one of the three Perl knows about, Perl will either fail to compile, or | |
92 | mistakenly and silently choose one of the three. | |
93 | They are: | |
94 | ||
d5924ca6 KW |
95 | =over |
96 | ||
97 | =item B<0037> | |
d396a558 | 98 | |
eaf8b9b9 KW |
99 | Character code set ID 0037 is a mapping of the ASCII plus Latin-1 |
100 | characters (i.e. ISO 8859-1) to an EBCDIC set. 0037 is used | |
101 | in North American English locales on the OS/400 operating system | |
102 | that runs on AS/400 computers. CCSID 0037 differs from ISO 8859-1 | |
51b5cecb | 103 | in 237 places, in other words they agree on only 19 code point values. |
d396a558 | 104 | |
d5924ca6 | 105 | =item B<1047> |
d396a558 | 106 | |
eaf8b9b9 KW |
107 | Character code set ID 1047 is also a mapping of the ASCII plus |
108 | Latin-1 characters (i.e. ISO 8859-1) to an EBCDIC set. 1047 is | |
109 | used under Unix System Services for OS/390 or z/OS, and OpenEdition | |
395f5a0c | 110 | for VM/ESA. CCSID 1047 differs from CCSID 0037 in eight places. |
d396a558 | 111 | |
d5924ca6 | 112 | =item B<POSIX-BC> |
d396a558 JH |
113 | |
114 | The EBCDIC code page in use on Siemens' BS2000 system is distinct from | |
115 | 1047 and 0037. It is identified below as the POSIX-BC set. | |
116 | ||
d5924ca6 KW |
117 | =back |
118 | ||
64c66fb6 JH |
119 | =head2 Unicode code points versus EBCDIC code points |
120 | ||
121 | In Unicode terminology a I<code point> is the number assigned to a | |
122 | character: for example, in EBCDIC the character "A" is usually assigned | |
123 | the number 193. In Unicode the character "A" is assigned the number 65. | |
124 | This causes a problem with the semantics of the pack/unpack "U", which | |
125 | are supposed to pack Unicode code points to characters and back to numbers. | |
126 | The problem is: which code points to use for code points less than 256? | |
127 | (for 256 and over there's no problem: Unicode code points are used) | |
f11f9c4c | 128 | In EBCDIC, the EBCDIC code points are used for the low 256. This |
64c66fb6 JH |
129 | means that the equivalences |
130 | ||
c72e675e KW |
131 | pack("U", ord($character)) eq $character |
132 | unpack("U", $character) == ord $character | |
64c66fb6 JH |
133 | |
134 | will hold. (If Unicode code points were applied consistently over | |
135 | all the possible code points, pack("U",ord("A")) would in EBCDIC | |
136 | equal I<A with acute> or chr(101), and unpack("U", "A") would equal | |
137 | 65, or I<non-breaking space>, not 193, or ord "A".) | |
138 | ||
dc4af4bb JH |
139 | =head2 Remaining Perl Unicode problems in EBCDIC |
140 | ||
141 | =over 4 | |
142 | ||
143 | =item * | |
144 | ||
dc4af4bb | 145 | The extensions Unicode::Collate and Unicode::Normalized are not |
f11f9c4c | 146 | supported under EBCDIC, likewise for the (now deprecated) encoding pragma. |
dc4af4bb JH |
147 | |
148 | =back | |
149 | ||
395f5a0c PK |
150 | =head2 Unicode and UTF |
151 | ||
2bbc8d55 SP |
152 | UTF stands for C<Unicode Transformation Format>. |
153 | UTF-8 is an encoding of Unicode into a sequence of 8-bit byte chunks, based on | |
154 | ASCII and Latin-1. | |
155 | The length of a sequence required to represent a Unicode code point | |
156 | depends on the ordinal number of that code point, | |
157 | with larger numbers requiring more bytes. | |
158 | UTF-EBCDIC is like UTF-8, but based on EBCDIC. | |
159 | ||
fe749c9a KW |
160 | You may see the term C<invariant> character or code point. |
161 | This simply means that the character has the same numeric | |
162 | value when encoded as when not. | |
42bde815 | 163 | (Note that this is a very different concept from L</The 13 variant characters> |
2bbc8d55 | 164 | mentioned above.) |
fe749c9a KW |
165 | For example, the ordinal value of 'A' is 193 in most EBCDIC code pages, |
166 | and also is 193 when encoded in UTF-EBCDIC. | |
e1b711da | 167 | All variant code points occupy at least two bytes when encoded. |
fe749c9a KW |
168 | In UTF-8, the code points corresponding to the lowest 128 |
169 | ordinal numbers (0 - 127: the ASCII characters) are invariant. | |
170 | In UTF-EBCDIC, there are 160 invariant characters. | |
2bbc8d55 | 171 | (If you care, the EBCDIC invariants are those characters |
fe749c9a | 172 | which have ASCII equivalents, plus those that correspond to |
2bbc8d55 | 173 | the C1 controls (80..9f on ASCII platforms).) |
fe749c9a | 174 | |
2bbc8d55 SP |
175 | A string encoded in UTF-EBCDIC may be longer (but never shorter) than |
176 | one encoded in UTF-8. | |
395f5a0c | 177 | |
8704cfd1 | 178 | =head2 Using Encode |
8f94de01 JH |
179 | |
180 | Starting from Perl 5.8 you can use the standard new module Encode | |
2bbc8d55 SP |
181 | to translate from EBCDIC to Latin-1 code points. |
182 | Encode knows about more EBCDIC character sets than Perl can currently | |
183 | be compiled to run on. | |
8f94de01 | 184 | |
c72e675e | 185 | use Encode 'from_to'; |
8f94de01 | 186 | |
c72e675e | 187 | my %ebcdic = ( 176 => 'cp37', 95 => 'cp1047', 106 => 'posix-bc' ); |
8f94de01 | 188 | |
c72e675e KW |
189 | # $a is in EBCDIC code points |
190 | from_to($a, $ebcdic{ord '^'}, 'latin1'); | |
191 | # $a is ISO 8859-1 code points | |
8f94de01 JH |
192 | |
193 | and from Latin-1 code points to EBCDIC code points | |
194 | ||
c72e675e | 195 | use Encode 'from_to'; |
8f94de01 | 196 | |
c72e675e | 197 | my %ebcdic = ( 176 => 'cp37', 95 => 'cp1047', 106 => 'posix-bc' ); |
8f94de01 | 198 | |
c72e675e KW |
199 | # $a is ISO 8859-1 code points |
200 | from_to($a, 'latin1', $ebcdic{ord '^'}); | |
201 | # $a is in EBCDIC code points | |
8f94de01 JH |
202 | |
203 | For doing I/O it is suggested that you use the autotranslating features | |
204 | of PerlIO, see L<perluniintro>. | |
205 | ||
aa2b82fc JH |
206 | Since version 5.8 Perl uses the new PerlIO I/O library. This enables |
207 | you to use different encodings per IO channel. For example you may use | |
208 | ||
209 | use Encode; | |
210 | open($f, ">:encoding(ascii)", "test.ascii"); | |
211 | print $f "Hello World!\n"; | |
212 | open($f, ">:encoding(cp37)", "test.ebcdic"); | |
213 | print $f "Hello World!\n"; | |
214 | open($f, ">:encoding(latin1)", "test.latin1"); | |
215 | print $f "Hello World!\n"; | |
216 | open($f, ">:encoding(utf8)", "test.utf8"); | |
217 | print $f "Hello World!\n"; | |
218 | ||
2c09a866 | 219 | to get four files containing "Hello World!\n" in ASCII, CP 0037 EBCDIC, |
2bbc8d55 | 220 | ISO 8859-1 (Latin-1) (in this example identical to ASCII since only ASCII |
eaf8b9b9 | 221 | characters were printed), and |
2bbc8d55 SP |
222 | UTF-EBCDIC (in this example identical to normal EBCDIC since only characters |
223 | that don't differ between EBCDIC and UTF-EBCDIC were printed). See the | |
aa2b82fc JH |
224 | documentation of Encode::PerlIO for details. |
225 | ||
226 | As the PerlIO layer uses raw IO (bytes) internally, all this totally | |
227 | ignores things like the type of your filesystem (ASCII or EBCDIC). | |
228 | ||
d396a558 JH |
229 | =head1 SINGLE OCTET TABLES |
230 | ||
231 | The following tables list the ASCII and Latin 1 ordered sets including | |
232 | the subsets: C0 controls (0..31), ASCII graphics (32..7e), delete (7f), | |
eaf8b9b9 | 233 | C1 controls (80..9f), and Latin-1 (a.k.a. ISO 8859-1) (a0..ff). In the |
8d725451 | 234 | table names of the Latin 1 |
eaf8b9b9 | 235 | extensions to ASCII have been labelled with character names roughly |
8d725451 KW |
236 | corresponding to I<The Unicode Standard, Version 6.1> albeit with |
237 | substitutions such as s/LATIN// and s/VULGAR// in all cases, s/CAPITAL | |
238 | LETTER// in some cases, and s/SMALL LETTER ([A-Z])/\l$1/ in some other | |
0e56abba | 239 | cases. Controls are listed using their Unicode 6.2 abbreviations. |
eaf8b9b9 | 240 | The differences between the 0037 and 1047 sets are |
8d725451 KW |
241 | flagged with **. The differences between the 1047 and POSIX-BC sets |
242 | are flagged with ##. All ord() numbers listed are decimal. If you | |
243 | would rather see this table listing octal values, then run the table | |
244 | (that is, the pod source text of this document, since this recipe may not | |
1e054b24 | 245 | work with a pod2_other_format translation) through: |
d396a558 JH |
246 | |
247 | =over 4 | |
248 | ||
249 | =item recipe 0 | |
250 | ||
251 | =back | |
252 | ||
8d725451 KW |
253 | perl -ne 'if(/(.{29})(\d+)\s+(\d+)\s+(\d+)\s+(\d+)/)' \ |
254 | -e '{printf("%s%-5.03o%-5.03o%-5.03o%.03o\n",$1,$2,$3,$4,$5)}' \ | |
5f26d5fd | 255 | perlebcdic.pod |
395f5a0c PK |
256 | |
257 | If you want to retain the UTF-x code points then in script form you | |
258 | might want to write: | |
259 | ||
260 | =over 4 | |
261 | ||
262 | =item recipe 1 | |
263 | ||
264 | =back | |
265 | ||
c72e675e KW |
266 | open(FH,"<perlebcdic.pod") or die "Could not open perlebcdic.pod: $!"; |
267 | while (<FH>) { | |
f11f9c4c KW |
268 | if (/(.{29})(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\.?(\d*) |
269 | \s+(\d+)\.?(\d*)/x) | |
5f26d5fd | 270 | { |
c72e675e | 271 | if ($7 ne '' && $9 ne '') { |
5f26d5fd | 272 | printf( |
8d725451 | 273 | "%s%-5.03o%-5.03o%-5.03o%-5.03o%-3o.%-5o%-3o.%.03o\n", |
5f26d5fd | 274 | $1,$2,$3,$4,$5,$6,$7,$8,$9); |
c72e675e KW |
275 | } |
276 | elsif ($7 ne '') { | |
8d725451 | 277 | printf("%s%-5.03o%-5.03o%-5.03o%-5.03o%-3o.%-5o%.03o\n", |
c72e675e KW |
278 | $1,$2,$3,$4,$5,$6,$7,$8); |
279 | } | |
280 | else { | |
8d725451 | 281 | printf("%s%-5.03o%-5.03o%-5.03o%-5.03o%-5.03o%.03o\n", |
5f26d5fd | 282 | $1,$2,$3,$4,$5,$6,$8); |
c72e675e KW |
283 | } |
284 | } | |
285 | } | |
d396a558 JH |
286 | |
287 | If you would rather see this table listing hexadecimal values then | |
288 | run the table through: | |
289 | ||
290 | =over 4 | |
291 | ||
395f5a0c | 292 | =item recipe 2 |
d396a558 JH |
293 | |
294 | =back | |
295 | ||
8d725451 KW |
296 | perl -ne 'if(/(.{29})(\d+)\s+(\d+)\s+(\d+)\s+(\d+)/)' \ |
297 | -e '{printf("%s%-5.02X%-5.02X%-5.02X%.02X\n",$1,$2,$3,$4,$5)}' \ | |
5f26d5fd | 298 | perlebcdic.pod |
395f5a0c PK |
299 | |
300 | Or, in order to retain the UTF-x code points in hexadecimal: | |
301 | ||
302 | =over 4 | |
303 | ||
304 | =item recipe 3 | |
305 | ||
306 | =back | |
307 | ||
c72e675e KW |
308 | open(FH,"<perlebcdic.pod") or die "Could not open perlebcdic.pod: $!"; |
309 | while (<FH>) { | |
f11f9c4c KW |
310 | if (/(.{29})(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\.?(\d*) |
311 | \s+(\d+)\.?(\d*)/x) | |
5f26d5fd | 312 | { |
c72e675e | 313 | if ($7 ne '' && $9 ne '') { |
5f26d5fd | 314 | printf( |
8d725451 | 315 | "%s%-5.02X%-5.02X%-5.02X%-5.02X%-2X.%-6.02X%02X.%02X\n", |
c72e675e KW |
316 | $1,$2,$3,$4,$5,$6,$7,$8,$9); |
317 | } | |
318 | elsif ($7 ne '') { | |
8d725451 | 319 | printf("%s%-5.02X%-5.02X%-5.02X%-5.02X%-2X.%-6.02X%02X\n", |
c72e675e KW |
320 | $1,$2,$3,$4,$5,$6,$7,$8); |
321 | } | |
322 | else { | |
8d725451 | 323 | printf("%s%-5.02X%-5.02X%-5.02X%-5.02X%-5.02X%02X\n", |
5f26d5fd | 324 | $1,$2,$3,$4,$5,$6,$8); |
c72e675e KW |
325 | } |
326 | } | |
327 | } | |
395f5a0c PK |
328 | |
329 | ||
8d725451 | 330 | ISO |
f11f9c4c KW |
331 | 8859-1 POS- CCSID |
332 | CCSID CCSID CCSID IX- 1047 | |
8d725451 KW |
333 | chr 0819 0037 1047 BC UTF-8 UTF-EBCDIC |
334 | --------------------------------------------------------------------- | |
335 | <NUL> 0 0 0 0 0 0 | |
336 | <SOH> 1 1 1 1 1 1 | |
337 | <STX> 2 2 2 2 2 2 | |
338 | <ETX> 3 3 3 3 3 3 | |
339 | <EOT> 4 55 55 55 4 55 | |
340 | <ENQ> 5 45 45 45 5 45 | |
341 | <ACK> 6 46 46 46 6 46 | |
342 | <BEL> 7 47 47 47 7 47 | |
343 | <BS> 8 22 22 22 8 22 | |
344 | <HT> 9 5 5 5 9 5 | |
345 | <LF> 10 37 21 21 10 21 ** | |
346 | <VT> 11 11 11 11 11 11 | |
347 | <FF> 12 12 12 12 12 12 | |
348 | <CR> 13 13 13 13 13 13 | |
349 | <SO> 14 14 14 14 14 14 | |
350 | <SI> 15 15 15 15 15 15 | |
351 | <DLE> 16 16 16 16 16 16 | |
352 | <DC1> 17 17 17 17 17 17 | |
353 | <DC2> 18 18 18 18 18 18 | |
354 | <DC3> 19 19 19 19 19 19 | |
355 | <DC4> 20 60 60 60 20 60 | |
356 | <NAK> 21 61 61 61 21 61 | |
357 | <SYN> 22 50 50 50 22 50 | |
358 | <ETB> 23 38 38 38 23 38 | |
359 | <CAN> 24 24 24 24 24 24 | |
360 | <EOM> 25 25 25 25 25 25 | |
361 | <SUB> 26 63 63 63 26 63 | |
362 | <ESC> 27 39 39 39 27 39 | |
363 | <FS> 28 28 28 28 28 28 | |
364 | <GS> 29 29 29 29 29 29 | |
365 | <RS> 30 30 30 30 30 30 | |
366 | <US> 31 31 31 31 31 31 | |
367 | <SPACE> 32 64 64 64 32 64 | |
368 | ! 33 90 90 90 33 90 | |
369 | " 34 127 127 127 34 127 | |
370 | # 35 123 123 123 35 123 | |
371 | $ 36 91 91 91 36 91 | |
372 | % 37 108 108 108 37 108 | |
373 | & 38 80 80 80 38 80 | |
374 | ' 39 125 125 125 39 125 | |
375 | ( 40 77 77 77 40 77 | |
376 | ) 41 93 93 93 41 93 | |
377 | * 42 92 92 92 42 92 | |
378 | + 43 78 78 78 43 78 | |
379 | , 44 107 107 107 44 107 | |
380 | - 45 96 96 96 45 96 | |
381 | . 46 75 75 75 46 75 | |
382 | / 47 97 97 97 47 97 | |
383 | 0 48 240 240 240 48 240 | |
384 | 1 49 241 241 241 49 241 | |
385 | 2 50 242 242 242 50 242 | |
386 | 3 51 243 243 243 51 243 | |
387 | 4 52 244 244 244 52 244 | |
388 | 5 53 245 245 245 53 245 | |
389 | 6 54 246 246 246 54 246 | |
390 | 7 55 247 247 247 55 247 | |
391 | 8 56 248 248 248 56 248 | |
392 | 9 57 249 249 249 57 249 | |
393 | : 58 122 122 122 58 122 | |
394 | ; 59 94 94 94 59 94 | |
395 | < 60 76 76 76 60 76 | |
396 | = 61 126 126 126 61 126 | |
397 | > 62 110 110 110 62 110 | |
398 | ? 63 111 111 111 63 111 | |
399 | @ 64 124 124 124 64 124 | |
400 | A 65 193 193 193 65 193 | |
401 | B 66 194 194 194 66 194 | |
402 | C 67 195 195 195 67 195 | |
403 | D 68 196 196 196 68 196 | |
404 | E 69 197 197 197 69 197 | |
405 | F 70 198 198 198 70 198 | |
406 | G 71 199 199 199 71 199 | |
407 | H 72 200 200 200 72 200 | |
408 | I 73 201 201 201 73 201 | |
409 | J 74 209 209 209 74 209 | |
410 | K 75 210 210 210 75 210 | |
411 | L 76 211 211 211 76 211 | |
412 | M 77 212 212 212 77 212 | |
413 | N 78 213 213 213 78 213 | |
414 | O 79 214 214 214 79 214 | |
415 | P 80 215 215 215 80 215 | |
416 | Q 81 216 216 216 81 216 | |
417 | R 82 217 217 217 82 217 | |
418 | S 83 226 226 226 83 226 | |
419 | T 84 227 227 227 84 227 | |
420 | U 85 228 228 228 85 228 | |
421 | V 86 229 229 229 86 229 | |
422 | W 87 230 230 230 87 230 | |
423 | X 88 231 231 231 88 231 | |
424 | Y 89 232 232 232 89 232 | |
425 | Z 90 233 233 233 90 233 | |
426 | [ 91 186 173 187 91 173 ** ## | |
427 | \ 92 224 224 188 92 224 ## | |
428 | ] 93 187 189 189 93 189 ** | |
429 | ^ 94 176 95 106 94 95 ** ## | |
430 | _ 95 109 109 109 95 109 | |
431 | ` 96 121 121 74 96 121 ## | |
432 | a 97 129 129 129 97 129 | |
433 | b 98 130 130 130 98 130 | |
434 | c 99 131 131 131 99 131 | |
435 | d 100 132 132 132 100 132 | |
436 | e 101 133 133 133 101 133 | |
437 | f 102 134 134 134 102 134 | |
438 | g 103 135 135 135 103 135 | |
439 | h 104 136 136 136 104 136 | |
440 | i 105 137 137 137 105 137 | |
441 | j 106 145 145 145 106 145 | |
442 | k 107 146 146 146 107 146 | |
443 | l 108 147 147 147 108 147 | |
444 | m 109 148 148 148 109 148 | |
445 | n 110 149 149 149 110 149 | |
446 | o 111 150 150 150 111 150 | |
447 | p 112 151 151 151 112 151 | |
448 | q 113 152 152 152 113 152 | |
449 | r 114 153 153 153 114 153 | |
450 | s 115 162 162 162 115 162 | |
451 | t 116 163 163 163 116 163 | |
452 | u 117 164 164 164 117 164 | |
453 | v 118 165 165 165 118 165 | |
454 | w 119 166 166 166 119 166 | |
455 | x 120 167 167 167 120 167 | |
456 | y 121 168 168 168 121 168 | |
457 | z 122 169 169 169 122 169 | |
458 | { 123 192 192 251 123 192 ## | |
459 | | 124 79 79 79 124 79 | |
460 | } 125 208 208 253 125 208 ## | |
461 | ~ 126 161 161 255 126 161 ## | |
462 | <DEL> 127 7 7 7 127 7 | |
463 | <PAD> 128 32 32 32 194.128 32 | |
464 | <HOP> 129 33 33 33 194.129 33 | |
465 | <BPH> 130 34 34 34 194.130 34 | |
466 | <NBH> 131 35 35 35 194.131 35 | |
467 | <IND> 132 36 36 36 194.132 36 | |
468 | <NEL> 133 21 37 37 194.133 37 ** | |
469 | <SSA> 134 6 6 6 194.134 6 | |
470 | <ESA> 135 23 23 23 194.135 23 | |
471 | <HTS> 136 40 40 40 194.136 40 | |
472 | <HTJ> 137 41 41 41 194.137 41 | |
473 | <VTS> 138 42 42 42 194.138 42 | |
474 | <PLD> 139 43 43 43 194.139 43 | |
475 | <PLU> 140 44 44 44 194.140 44 | |
476 | <RI> 141 9 9 9 194.141 9 | |
477 | <SS2> 142 10 10 10 194.142 10 | |
478 | <SS3> 143 27 27 27 194.143 27 | |
479 | <DCS> 144 48 48 48 194.144 48 | |
480 | <PU1> 145 49 49 49 194.145 49 | |
481 | <PU2> 146 26 26 26 194.146 26 | |
482 | <STS> 147 51 51 51 194.147 51 | |
483 | <CCH> 148 52 52 52 194.148 52 | |
484 | <MW> 149 53 53 53 194.149 53 | |
485 | <SPA> 150 54 54 54 194.150 54 | |
486 | <EPA> 151 8 8 8 194.151 8 | |
487 | <SOS> 152 56 56 56 194.152 56 | |
488 | <SGC> 153 57 57 57 194.153 57 | |
489 | <SCI> 154 58 58 58 194.154 58 | |
490 | <CSI> 155 59 59 59 194.155 59 | |
491 | <ST> 156 4 4 4 194.156 4 | |
492 | <OSC> 157 20 20 20 194.157 20 | |
493 | <PM> 158 62 62 62 194.158 62 | |
494 | <APC> 159 255 255 95 194.159 255 ## | |
495 | <NON-BREAKING SPACE> 160 65 65 65 194.160 128.65 | |
496 | <INVERTED "!" > 161 170 170 170 194.161 128.66 | |
497 | <CENT SIGN> 162 74 74 176 194.162 128.67 ## | |
498 | <POUND SIGN> 163 177 177 177 194.163 128.68 | |
499 | <CURRENCY SIGN> 164 159 159 159 194.164 128.69 | |
500 | <YEN SIGN> 165 178 178 178 194.165 128.70 | |
501 | <BROKEN BAR> 166 106 106 208 194.166 128.71 ## | |
502 | <SECTION SIGN> 167 181 181 181 194.167 128.72 | |
503 | <DIAERESIS> 168 189 187 121 194.168 128.73 ** ## | |
504 | <COPYRIGHT SIGN> 169 180 180 180 194.169 128.74 | |
505 | <FEMININE ORDINAL> 170 154 154 154 194.170 128.81 | |
506 | <LEFT POINTING GUILLEMET> 171 138 138 138 194.171 128.82 | |
507 | <NOT SIGN> 172 95 176 186 194.172 128.83 ** ## | |
508 | <SOFT HYPHEN> 173 202 202 202 194.173 128.84 | |
509 | <REGISTERED TRADE MARK> 174 175 175 175 194.174 128.85 | |
510 | <MACRON> 175 188 188 161 194.175 128.86 ## | |
511 | <DEGREE SIGN> 176 144 144 144 194.176 128.87 | |
512 | <PLUS-OR-MINUS SIGN> 177 143 143 143 194.177 128.88 | |
513 | <SUPERSCRIPT TWO> 178 234 234 234 194.178 128.89 | |
514 | <SUPERSCRIPT THREE> 179 250 250 250 194.179 128.98 | |
515 | <ACUTE ACCENT> 180 190 190 190 194.180 128.99 | |
516 | <MICRO SIGN> 181 160 160 160 194.181 128.100 | |
517 | <PARAGRAPH SIGN> 182 182 182 182 194.182 128.101 | |
518 | <MIDDLE DOT> 183 179 179 179 194.183 128.102 | |
519 | <CEDILLA> 184 157 157 157 194.184 128.103 | |
520 | <SUPERSCRIPT ONE> 185 218 218 218 194.185 128.104 | |
521 | <MASC. ORDINAL INDICATOR> 186 155 155 155 194.186 128.105 | |
522 | <RIGHT POINTING GUILLEMET> 187 139 139 139 194.187 128.106 | |
523 | <FRACTION ONE QUARTER> 188 183 183 183 194.188 128.112 | |
524 | <FRACTION ONE HALF> 189 184 184 184 194.189 128.113 | |
525 | <FRACTION THREE QUARTERS> 190 185 185 185 194.190 128.114 | |
526 | <INVERTED QUESTION MARK> 191 171 171 171 194.191 128.115 | |
527 | <A WITH GRAVE> 192 100 100 100 195.128 138.65 | |
528 | <A WITH ACUTE> 193 101 101 101 195.129 138.66 | |
529 | <A WITH CIRCUMFLEX> 194 98 98 98 195.130 138.67 | |
530 | <A WITH TILDE> 195 102 102 102 195.131 138.68 | |
531 | <A WITH DIAERESIS> 196 99 99 99 195.132 138.69 | |
532 | <A WITH RING ABOVE> 197 103 103 103 195.133 138.70 | |
533 | <CAPITAL LIGATURE AE> 198 158 158 158 195.134 138.71 | |
534 | <C WITH CEDILLA> 199 104 104 104 195.135 138.72 | |
535 | <E WITH GRAVE> 200 116 116 116 195.136 138.73 | |
536 | <E WITH ACUTE> 201 113 113 113 195.137 138.74 | |
537 | <E WITH CIRCUMFLEX> 202 114 114 114 195.138 138.81 | |
538 | <E WITH DIAERESIS> 203 115 115 115 195.139 138.82 | |
539 | <I WITH GRAVE> 204 120 120 120 195.140 138.83 | |
540 | <I WITH ACUTE> 205 117 117 117 195.141 138.84 | |
541 | <I WITH CIRCUMFLEX> 206 118 118 118 195.142 138.85 | |
542 | <I WITH DIAERESIS> 207 119 119 119 195.143 138.86 | |
543 | <CAPITAL LETTER ETH> 208 172 172 172 195.144 138.87 | |
544 | <N WITH TILDE> 209 105 105 105 195.145 138.88 | |
545 | <O WITH GRAVE> 210 237 237 237 195.146 138.89 | |
546 | <O WITH ACUTE> 211 238 238 238 195.147 138.98 | |
547 | <O WITH CIRCUMFLEX> 212 235 235 235 195.148 138.99 | |
548 | <O WITH TILDE> 213 239 239 239 195.149 138.100 | |
549 | <O WITH DIAERESIS> 214 236 236 236 195.150 138.101 | |
550 | <MULTIPLICATION SIGN> 215 191 191 191 195.151 138.102 | |
551 | <O WITH STROKE> 216 128 128 128 195.152 138.103 | |
552 | <U WITH GRAVE> 217 253 253 224 195.153 138.104 ## | |
553 | <U WITH ACUTE> 218 254 254 254 195.154 138.105 | |
554 | <U WITH CIRCUMFLEX> 219 251 251 221 195.155 138.106 ## | |
555 | <U WITH DIAERESIS> 220 252 252 252 195.156 138.112 | |
556 | <Y WITH ACUTE> 221 173 186 173 195.157 138.113 ** ## | |
557 | <CAPITAL LETTER THORN> 222 174 174 174 195.158 138.114 | |
558 | <SMALL LETTER SHARP S> 223 89 89 89 195.159 138.115 | |
559 | <a WITH GRAVE> 224 68 68 68 195.160 139.65 | |
560 | <a WITH ACUTE> 225 69 69 69 195.161 139.66 | |
561 | <a WITH CIRCUMFLEX> 226 66 66 66 195.162 139.67 | |
562 | <a WITH TILDE> 227 70 70 70 195.163 139.68 | |
563 | <a WITH DIAERESIS> 228 67 67 67 195.164 139.69 | |
564 | <a WITH RING ABOVE> 229 71 71 71 195.165 139.70 | |
565 | <SMALL LIGATURE ae> 230 156 156 156 195.166 139.71 | |
566 | <c WITH CEDILLA> 231 72 72 72 195.167 139.72 | |
567 | <e WITH GRAVE> 232 84 84 84 195.168 139.73 | |
568 | <e WITH ACUTE> 233 81 81 81 195.169 139.74 | |
569 | <e WITH CIRCUMFLEX> 234 82 82 82 195.170 139.81 | |
570 | <e WITH DIAERESIS> 235 83 83 83 195.171 139.82 | |
571 | <i WITH GRAVE> 236 88 88 88 195.172 139.83 | |
572 | <i WITH ACUTE> 237 85 85 85 195.173 139.84 | |
573 | <i WITH CIRCUMFLEX> 238 86 86 86 195.174 139.85 | |
574 | <i WITH DIAERESIS> 239 87 87 87 195.175 139.86 | |
575 | <SMALL LETTER eth> 240 140 140 140 195.176 139.87 | |
576 | <n WITH TILDE> 241 73 73 73 195.177 139.88 | |
577 | <o WITH GRAVE> 242 205 205 205 195.178 139.89 | |
578 | <o WITH ACUTE> 243 206 206 206 195.179 139.98 | |
579 | <o WITH CIRCUMFLEX> 244 203 203 203 195.180 139.99 | |
580 | <o WITH TILDE> 245 207 207 207 195.181 139.100 | |
581 | <o WITH DIAERESIS> 246 204 204 204 195.182 139.101 | |
582 | <DIVISION SIGN> 247 225 225 225 195.183 139.102 | |
583 | <o WITH STROKE> 248 112 112 112 195.184 139.103 | |
584 | <u WITH GRAVE> 249 221 221 192 195.185 139.104 ## | |
585 | <u WITH ACUTE> 250 222 222 222 195.186 139.105 | |
586 | <u WITH CIRCUMFLEX> 251 219 219 219 195.187 139.106 | |
587 | <u WITH DIAERESIS> 252 220 220 220 195.188 139.112 | |
588 | <y WITH ACUTE> 253 141 141 141 195.189 139.113 | |
589 | <SMALL LETTER thorn> 254 142 142 142 195.190 139.114 | |
590 | <y WITH DIAERESIS> 255 223 223 223 195.191 139.115 | |
d396a558 JH |
591 | |
592 | If you would rather see the above table in CCSID 0037 order rather than | |
593 | ASCII + Latin-1 order then run the table through: | |
594 | ||
595 | =over 4 | |
596 | ||
395f5a0c | 597 | =item recipe 4 |
d396a558 JH |
598 | |
599 | =back | |
600 | ||
5f26d5fd | 601 | perl \ |
8d725451 | 602 | -ne 'if(/.{29}\d{1,3}\s{2,4}\d{1,3}\s{2,4}\d{1,3}\s{2,4}\d{1,3}/)'\ |
84f709e7 JH |
603 | -e '{push(@l,$_)}' \ |
604 | -e 'END{print map{$_->[0]}' \ | |
605 | -e ' sort{$a->[1] <=> $b->[1]}' \ | |
8d725451 | 606 | -e ' map{[$_,substr($_,34,3)]}@l;}' perlebcdic.pod |
d396a558 | 607 | |
2c09a866 | 608 | If you would rather see it in CCSID 1047 order then change the number |
8d725451 | 609 | 34 in the last line to 39, like this: |
d396a558 JH |
610 | |
611 | =over 4 | |
612 | ||
395f5a0c | 613 | =item recipe 5 |
d396a558 JH |
614 | |
615 | =back | |
616 | ||
5f26d5fd | 617 | perl \ |
8d725451 | 618 | -ne 'if(/.{29}\d{1,3}\s{2,4}\d{1,3}\s{2,4}\d{1,3}\s{2,4}\d{1,3}/)'\ |
5f26d5fd KW |
619 | -e '{push(@l,$_)}' \ |
620 | -e 'END{print map{$_->[0]}' \ | |
621 | -e ' sort{$a->[1] <=> $b->[1]}' \ | |
8d725451 | 622 | -e ' map{[$_,substr($_,39,3)]}@l;}' perlebcdic.pod |
d396a558 | 623 | |
2c09a866 | 624 | If you would rather see it in POSIX-BC order then change the number |
8d725451 | 625 | 39 in the last line to 44, like this: |
d396a558 JH |
626 | |
627 | =over 4 | |
628 | ||
395f5a0c | 629 | =item recipe 6 |
d396a558 JH |
630 | |
631 | =back | |
632 | ||
5f26d5fd | 633 | perl \ |
8d725451 | 634 | -ne 'if(/.{29}\d{1,3}\s{2,4}\d{1,3}\s{2,4}\d{1,3}\s{2,4}\d{1,3}/)'\ |
84f709e7 JH |
635 | -e '{push(@l,$_)}' \ |
636 | -e 'END{print map{$_->[0]}' \ | |
637 | -e ' sort{$a->[1] <=> $b->[1]}' \ | |
8d725451 | 638 | -e ' map{[$_,substr($_,44,3)]}@l;}' perlebcdic.pod |
d396a558 JH |
639 | |
640 | ||
641 | =head1 IDENTIFYING CHARACTER CODE SETS | |
642 | ||
eaf8b9b9 KW |
643 | To determine the character set you are running under from perl one |
644 | could use the return value of ord() or chr() to test one or more | |
d396a558 JH |
645 | character values. For example: |
646 | ||
84f709e7 JH |
647 | $is_ascii = "A" eq chr(65); |
648 | $is_ebcdic = "A" eq chr(193); | |
d396a558 | 649 | |
51b5cecb | 650 | Also, "\t" is a C<HORIZONTAL TABULATION> character so that: |
d396a558 | 651 | |
84f709e7 JH |
652 | $is_ascii = ord("\t") == 9; |
653 | $is_ebcdic = ord("\t") == 5; | |
d396a558 | 654 | |
b439bde5 | 655 | To distinguish between EBCDIC code pages try looking at one or more of |
d396a558 JH |
656 | the characters that differ between them. For example: |
657 | ||
84f709e7 JH |
658 | $is_ebcdic_37 = "\n" eq chr(37); |
659 | $is_ebcdic_1047 = "\n" eq chr(21); | |
d396a558 JH |
660 | |
661 | Or better still choose a character that is uniquely encoded in any | |
662 | of the code sets, e.g.: | |
663 | ||
84f709e7 JH |
664 | $is_ascii = ord('[') == 91; |
665 | $is_ebcdic_37 = ord('[') == 186; | |
666 | $is_ebcdic_1047 = ord('[') == 173; | |
667 | $is_ebcdic_POSIX_BC = ord('[') == 187; | |
d396a558 JH |
668 | |
669 | However, it would be unwise to write tests such as: | |
670 | ||
84f709e7 JH |
671 | $is_ascii = "\r" ne chr(13); # WRONG |
672 | $is_ascii = "\n" ne chr(10); # ILL ADVISED | |
d396a558 | 673 | |
2bbc8d55 | 674 | Obviously the first of these will fail to distinguish most ASCII platforms |
eaf8b9b9 KW |
675 | from either a CCSID 0037, a 1047, or a POSIX-BC EBCDIC platform since "\r" eq |
676 | chr(13) under all of those coded character sets. But note too that | |
677 | because "\n" is chr(13) and "\r" is chr(10) on the Macintosh (which is an | |
2bbc8d55 | 678 | ASCII platform) the second C<$is_ascii> test will lead to trouble there. |
d396a558 | 679 | |
eaf8b9b9 | 680 | To determine whether or not perl was built under an EBCDIC |
d396a558 JH |
681 | code page you can use the Config module like so: |
682 | ||
683 | use Config; | |
84f709e7 | 684 | $is_ebcdic = $Config{'ebcdic'} eq 'define'; |
d396a558 JH |
685 | |
686 | =head1 CONVERSIONS | |
687 | ||
d5924ca6 KW |
688 | =head2 C<utf8::unicode_to_native()> and C<utf8::native_to_unicode()> |
689 | ||
690 | These functions take an input numeric code point in one encoding and | |
691 | return what its equivalent value is in the other. | |
692 | ||
1e054b24 PP |
693 | =head2 tr/// |
694 | ||
eaf8b9b9 | 695 | In order to convert a string of characters from one character set to |
d396a558 | 696 | another a simple list of numbers, such as in the right columns in the |
eaf8b9b9 | 697 | above table, along with perl's tr/// operator is all that is needed. |
5f26d5fd | 698 | The data in the table are in ASCII/Latin1 order, hence the EBCDIC columns |
eaf8b9b9 | 699 | provide easy-to-use ASCII/Latin1 to EBCDIC operations that are also easily |
d396a558 JH |
700 | reversed. |
701 | ||
5f26d5fd KW |
702 | For example, to convert ASCII/Latin1 to code page 037 take the output of the |
703 | second numbers column from the output of recipe 2 (modified to add '\' | |
5d9fe53c | 704 | characters), and use it in tr/// like so: |
d396a558 | 705 | |
eaf8b9b9 | 706 | $cp_037 = |
5f26d5fd KW |
707 | '\x00\x01\x02\x03\x37\x2D\x2E\x2F\x16\x05\x25\x0B\x0C\x0D\x0E\x0F' . |
708 | '\x10\x11\x12\x13\x3C\x3D\x32\x26\x18\x19\x3F\x27\x1C\x1D\x1E\x1F' . | |
709 | '\x40\x5A\x7F\x7B\x5B\x6C\x50\x7D\x4D\x5D\x5C\x4E\x6B\x60\x4B\x61' . | |
710 | '\xF0\xF1\xF2\xF3\xF4\xF5\xF6\xF7\xF8\xF9\x7A\x5E\x4C\x7E\x6E\x6F' . | |
711 | '\x7C\xC1\xC2\xC3\xC4\xC5\xC6\xC7\xC8\xC9\xD1\xD2\xD3\xD4\xD5\xD6' . | |
712 | '\xD7\xD8\xD9\xE2\xE3\xE4\xE5\xE6\xE7\xE8\xE9\xBA\xE0\xBB\xB0\x6D' . | |
713 | '\x79\x81\x82\x83\x84\x85\x86\x87\x88\x89\x91\x92\x93\x94\x95\x96' . | |
714 | '\x97\x98\x99\xA2\xA3\xA4\xA5\xA6\xA7\xA8\xA9\xC0\x4F\xD0\xA1\x07' . | |
715 | '\x20\x21\x22\x23\x24\x15\x06\x17\x28\x29\x2A\x2B\x2C\x09\x0A\x1B' . | |
716 | '\x30\x31\x1A\x33\x34\x35\x36\x08\x38\x39\x3A\x3B\x04\x14\x3E\xFF' . | |
717 | '\x41\xAA\x4A\xB1\x9F\xB2\x6A\xB5\xBD\xB4\x9A\x8A\x5F\xCA\xAF\xBC' . | |
718 | '\x90\x8F\xEA\xFA\xBE\xA0\xB6\xB3\x9D\xDA\x9B\x8B\xB7\xB8\xB9\xAB' . | |
719 | '\x64\x65\x62\x66\x63\x67\x9E\x68\x74\x71\x72\x73\x78\x75\x76\x77' . | |
720 | '\xAC\x69\xED\xEE\xEB\xEF\xEC\xBF\x80\xFD\xFE\xFB\xFC\xAD\xAE\x59' . | |
721 | '\x44\x45\x42\x46\x43\x47\x9C\x48\x54\x51\x52\x53\x58\x55\x56\x57' . | |
722 | '\x8C\x49\xCD\xCE\xCB\xCF\xCC\xE1\x70\xDD\xDE\xDB\xDC\x8D\x8E\xDF'; | |
d396a558 JH |
723 | |
724 | my $ebcdic_string = $ascii_string; | |
5f26d5fd | 725 | eval '$ebcdic_string =~ tr/\000-\377/' . $cp_037 . '/'; |
d396a558 | 726 | |
0be03469 | 727 | To convert from EBCDIC 037 to ASCII just reverse the order of the tr/// |
d396a558 JH |
728 | arguments like so: |
729 | ||
730 | my $ascii_string = $ebcdic_string; | |
5f26d5fd KW |
731 | eval '$ascii_string =~ tr/' . $cp_037 . '/\000-\377/'; |
732 | ||
733 | Similarly one could take the output of the third numbers column from recipe 2 | |
734 | to obtain a C<$cp_1047> table. The fourth numbers column of the output from | |
735 | recipe 2 could provide a C<$cp_posix_bc> table suitable for transcoding as | |
736 | well. | |
d5d9880c | 737 | |
5f26d5fd KW |
738 | If you wanted to see the inverse tables, you would first have to sort on the |
739 | desired numbers column as in recipes 4, 5 or 6, then take the output of the | |
740 | first numbers column. | |
1e054b24 PP |
741 | |
742 | =head2 iconv | |
d396a558 | 743 | |
d5d9880c | 744 | XPG operability often implies the presence of an I<iconv> utility |
d396a558 JH |
745 | available from the shell or from the C library. Consult your system's |
746 | documentation for information on iconv. | |
747 | ||
eaf8b9b9 | 748 | On OS/390 or z/OS see the iconv(1) manpage. One way to invoke the iconv |
d396a558 JH |
749 | shell utility from within perl would be to: |
750 | ||
395f5a0c | 751 | # OS/390 or z/OS example |
84f709e7 | 752 | $ascii_data = `echo '$ebcdic_data'| iconv -f IBM-1047 -t ISO8859-1` |
d396a558 JH |
753 | |
754 | or the inverse map: | |
755 | ||
395f5a0c | 756 | # OS/390 or z/OS example |
84f709e7 | 757 | $ebcdic_data = `echo '$ascii_data'| iconv -f ISO8859-1 -t IBM-1047` |
d396a558 | 758 | |
8a50e6a3 | 759 | For other perl-based conversion options see the Convert::* modules on CPAN. |
d396a558 | 760 | |
1e054b24 PP |
761 | =head2 C RTL |
762 | ||
8a50e6a3 | 763 | The OS/390 and z/OS C run-time libraries provide _atoe() and _etoa() functions. |
1e054b24 | 764 | |
d396a558 JH |
765 | =head1 OPERATOR DIFFERENCES |
766 | ||
eaf8b9b9 | 767 | The C<..> range operator treats certain character ranges with |
2bbc8d55 SP |
768 | care on EBCDIC platforms. For example the following array |
769 | will have twenty six elements on either an EBCDIC platform | |
770 | or an ASCII platform: | |
d396a558 | 771 | |
84f709e7 | 772 | @alphabet = ('A'..'Z'); # $#alphabet == 25 |
d396a558 JH |
773 | |
774 | The bitwise operators such as & ^ | may return different results | |
eaf8b9b9 | 775 | when operating on string or character data in a perl program running |
2bbc8d55 | 776 | on an EBCDIC platform than when run on an ASCII platform. Here is |
d396a558 JH |
777 | an example adapted from the one in L<perlop>: |
778 | ||
779 | # EBCDIC-based examples | |
84f709e7 | 780 | print "j p \n" ^ " a h"; # prints "JAPH\n" |
eaf8b9b9 | 781 | print "JA" | " ph\n"; # prints "japh\n" |
84f709e7 JH |
782 | print "JAPH\nJunk" & "\277\277\277\277\277"; # prints "japh\n"; |
783 | print 'p N$' ^ " E<H\n"; # prints "Perl\n"; | |
d396a558 JH |
784 | |
785 | An interesting property of the 32 C0 control characters | |
786 | in the ASCII table is that they can "literally" be constructed | |
c72e675e KW |
787 | as control characters in perl, e.g. C<(chr(0)> eq C<\c@>)> |
788 | C<(chr(1)> eq C<\cA>)>, and so on. Perl on EBCDIC platforms has been | |
2c09a866 | 789 | ported to take C<\c@> to chr(0) and C<\cA> to chr(1), etc. as well, but the |
2bd1cbf6 | 790 | characters that result depend on which code page you are |
2c09a866 KW |
791 | using. The table below uses the standard acronyms for the controls. |
792 | The POSIX-BC and 1047 sets are | |
eaf8b9b9 | 793 | identical throughout this range and differ from the 0037 set at only |
51b5cecb | 794 | one spot (21 decimal). Note that the C<LINE FEED> character |
eaf8b9b9 KW |
795 | may be generated by C<\cJ> on ASCII platforms but by C<\cU> on 1047 or POSIX-BC |
796 | platforms and cannot be generated as a C<"\c.letter."> control character on | |
2c09a866 KW |
797 | 0037 platforms. Note also that C<\c\> cannot be the final element in a string |
798 | or regex, as it will absorb the terminator. But C<\c\I<X>> is a C<FILE | |
799 | SEPARATOR> concatenated with I<X> for all I<X>. | |
2bd1cbf6 KW |
800 | The outlier C<\c?> on ASCII, which yields a non-C0 control C<DEL>, |
801 | yields the outlier control C<APC> on EBCDIC, the one that isn't in the | |
802 | block of contiguous controls. | |
2c09a866 | 803 | |
eaf8b9b9 | 804 | chr ord 8859-1 0037 1047 && POSIX-BC |
c72e675e | 805 | ----------------------------------------------------------------------- |
2c09a866 | 806 | \c@ 0 <NUL> <NUL> <NUL> |
eaf8b9b9 | 807 | \cA 1 <SOH> <SOH> <SOH> |
2c09a866 KW |
808 | \cB 2 <STX> <STX> <STX> |
809 | \cC 3 <ETX> <ETX> <ETX> | |
eaf8b9b9 KW |
810 | \cD 4 <EOT> <ST> <ST> |
811 | \cE 5 <ENQ> <HT> <HT> | |
812 | \cF 6 <ACK> <SSA> <SSA> | |
813 | \cG 7 <BEL> <DEL> <DEL> | |
814 | \cH 8 <BS> <EPA> <EPA> | |
815 | \cI 9 <HT> <RI> <RI> | |
816 | \cJ 10 <LF> <SS2> <SS2> | |
2c09a866 | 817 | \cK 11 <VT> <VT> <VT> |
eaf8b9b9 KW |
818 | \cL 12 <FF> <FF> <FF> |
819 | \cM 13 <CR> <CR> <CR> | |
2c09a866 KW |
820 | \cN 14 <SO> <SO> <SO> |
821 | \cO 15 <SI> <SI> <SI> | |
eaf8b9b9 | 822 | \cP 16 <DLE> <DLE> <DLE> |
2c09a866 KW |
823 | \cQ 17 <DC1> <DC1> <DC1> |
824 | \cR 18 <DC2> <DC2> <DC2> | |
eaf8b9b9 KW |
825 | \cS 19 <DC3> <DC3> <DC3> |
826 | \cT 20 <DC4> <OSC> <OSC> | |
8d725451 | 827 | \cU 21 <NAK> <NEL> <LF> ** |
2c09a866 | 828 | \cV 22 <SYN> <BS> <BS> |
eaf8b9b9 | 829 | \cW 23 <ETB> <ESA> <ESA> |
2c09a866 KW |
830 | \cX 24 <CAN> <CAN> <CAN> |
831 | \cY 25 <EOM> <EOM> <EOM> | |
eaf8b9b9 KW |
832 | \cZ 26 <SUB> <PU2> <PU2> |
833 | \c[ 27 <ESC> <SS3> <SS3> | |
2c09a866 KW |
834 | \c\X 28 <FS>X <FS>X <FS>X |
835 | \c] 29 <GS> <GS> <GS> | |
836 | \c^ 30 <RS> <RS> <RS> | |
837 | \c_ 31 <US> <US> <US> | |
2bd1cbf6 KW |
838 | \c? * <DEL> <APC> <APC> |
839 | ||
840 | C<*> Note: C<\c?> maps to ordinal 127 (C<DEL>) on ASCII platforms, but | |
841 | since ordinal 127 is a not a control character on EBCDIC machines, | |
842 | C<\c?> instead maps to C<APC>, which is 255 in 0037 and 1047, and 95 in | |
843 | POSIX-BC. | |
d396a558 JH |
844 | |
845 | =head1 FUNCTION DIFFERENCES | |
846 | ||
847 | =over 8 | |
848 | ||
849 | =item chr() | |
850 | ||
eaf8b9b9 | 851 | chr() must be given an EBCDIC code number argument to yield a desired |
2bbc8d55 | 852 | character return value on an EBCDIC platform. For example: |
d396a558 | 853 | |
84f709e7 | 854 | $CAPITAL_LETTER_A = chr(193); |
d396a558 JH |
855 | |
856 | =item ord() | |
857 | ||
2bbc8d55 | 858 | ord() will return EBCDIC code number values on an EBCDIC platform. |
d396a558 JH |
859 | For example: |
860 | ||
84f709e7 | 861 | $the_number_193 = ord("A"); |
d396a558 JH |
862 | |
863 | =item pack() | |
864 | ||
eaf8b9b9 | 865 | The c and C templates for pack() are dependent upon character set |
d396a558 JH |
866 | encoding. Examples of usage on EBCDIC include: |
867 | ||
868 | $foo = pack("CCCC",193,194,195,196); | |
869 | # $foo eq "ABCD" | |
84f709e7 | 870 | $foo = pack("C4",193,194,195,196); |
d396a558 JH |
871 | # same thing |
872 | ||
873 | $foo = pack("ccxxcc",193,194,195,196); | |
874 | # $foo eq "AB\0\0CD" | |
875 | ||
876 | =item print() | |
877 | ||
878 | One must be careful with scalars and strings that are passed to | |
879 | print that contain ASCII encodings. One common place | |
880 | for this to occur is in the output of the MIME type header for | |
eaf8b9b9 | 881 | CGI script writing. For example, many perl programming guides |
d396a558 JH |
882 | recommend something similar to: |
883 | ||
eaf8b9b9 | 884 | print "Content-type:\ttext/html\015\012\015\012"; |
d396a558 JH |
885 | # this may be wrong on EBCDIC |
886 | ||
eaf8b9b9 | 887 | Under the IBM OS/390 USS Web Server or WebSphere on z/OS for example |
395f5a0c | 888 | you should instead write that as: |
d396a558 | 889 | |
5f26d5fd | 890 | print "Content-type:\ttext/html\r\n\r\n"; # OK for DGW et al |
d396a558 JH |
891 | |
892 | That is because the translation from EBCDIC to ASCII is done | |
893 | by the web server in this case (such code will not be appropriate for | |
eaf8b9b9 | 894 | the Macintosh however). Consult your web server's documentation for |
d396a558 JH |
895 | further details. |
896 | ||
897 | =item printf() | |
898 | ||
899 | The formats that can convert characters to numbers and vice versa | |
900 | will be different from their ASCII counterparts when executed | |
2bbc8d55 | 901 | on an EBCDIC platform. Examples include: |
d396a558 JH |
902 | |
903 | printf("%c%c%c",193,194,195); # prints ABC | |
904 | ||
905 | =item sort() | |
906 | ||
eaf8b9b9 | 907 | EBCDIC sort results may differ from ASCII sort results especially for |
d396a558 JH |
908 | mixed case strings. This is discussed in more detail below. |
909 | ||
910 | =item sprintf() | |
911 | ||
912 | See the discussion of printf() above. An example of the use | |
913 | of sprintf would be: | |
914 | ||
84f709e7 | 915 | $CAPITAL_LETTER_A = sprintf("%c",193); |
d396a558 JH |
916 | |
917 | =item unpack() | |
918 | ||
919 | See the discussion of pack() above. | |
920 | ||
921 | =back | |
922 | ||
923 | =head1 REGULAR EXPRESSION DIFFERENCES | |
924 | ||
eaf8b9b9 KW |
925 | As of perl 5.005_03 the letter range regular expressions such as |
926 | [A-Z] and [a-z] have been especially coded to not pick up gap | |
927 | characters. For example, characters such as E<ocirc> C<o WITH CIRCUMFLEX> | |
928 | that lie between I and J would not be matched by the | |
1b2d223b JH |
929 | regular expression range C</[H-K]/>. This works in |
930 | the other direction, too, if either of the range end points is | |
931 | explicitly numeric: C<[\x89-\x91]> will match C<\x8e>, even | |
932 | though C<\x89> is C<i> and C<\x91 > is C<j>, and C<\x8e> | |
933 | is a gap character from the alphabetic viewpoint. | |
51b5cecb | 934 | |
eaf8b9b9 KW |
935 | If you do want to match the alphabet gap characters in a single octet |
936 | regular expression try matching the hex or octal code such | |
937 | as C</\313/> on EBCDIC or C</\364/> on ASCII platforms to | |
51b5cecb | 938 | have your regular expression match C<o WITH CIRCUMFLEX>. |
d396a558 | 939 | |
51b5cecb | 940 | Another construct to be wary of is the inappropriate use of hex or |
d396a558 JH |
941 | octal constants in regular expressions. Consider the following |
942 | set of subs: | |
943 | ||
944 | sub is_c0 { | |
945 | my $char = substr(shift,0,1); | |
946 | $char =~ /[\000-\037]/; | |
947 | } | |
948 | ||
949 | sub is_print_ascii { | |
950 | my $char = substr(shift,0,1); | |
951 | $char =~ /[\040-\176]/; | |
952 | } | |
953 | ||
954 | sub is_delete { | |
955 | my $char = substr(shift,0,1); | |
956 | $char eq "\177"; | |
957 | } | |
958 | ||
959 | sub is_c1 { | |
960 | my $char = substr(shift,0,1); | |
961 | $char =~ /[\200-\237]/; | |
962 | } | |
963 | ||
10c526cf | 964 | sub is_latin_1 { # But not ASCII; not C1 |
d396a558 JH |
965 | my $char = substr(shift,0,1); |
966 | $char =~ /[\240-\377]/; | |
967 | } | |
968 | ||
10c526cf KW |
969 | These are valid only on ASCII platforms, but can be easily rewritten to |
970 | work on any platform as follows: | |
d396a558 JH |
971 | |
972 | sub Is_c0 { | |
973 | my $char = substr(shift,0,1); | |
f11f9c4c KW |
974 | return $char =~ /[[:cntrl:]]/a && ! Is_delete($char); |
975 | ||
976 | # Alternatively: | |
977 | # return $char =~ /[[:cntrl:]]/ | |
978 | # && $char =~ /[[:ascii:]]/ | |
979 | # && ! Is_delete($char); | |
d396a558 JH |
980 | } |
981 | ||
982 | sub Is_print_ascii { | |
983 | my $char = substr(shift,0,1); | |
10c526cf | 984 | |
f11f9c4c | 985 | return $char =~ /[[:print:]]/a; |
10c526cf KW |
986 | |
987 | # Alternatively: | |
f11f9c4c KW |
988 | # return $char =~ /[[:print:]]/ && $char =~ /[[:ascii:]]/; |
989 | ||
990 | # Or | |
10c526cf KW |
991 | # return $char |
992 | # =~ /[ !"\#\$%&'()*+,\-.\/0-9:;<=>?\@A-Z[\\\]^_`a-z{|}~]/; | |
d396a558 JH |
993 | } |
994 | ||
995 | sub Is_delete { | |
996 | my $char = substr(shift,0,1); | |
10c526cf | 997 | return utf8::native_to_unicode(ord $char) == 0x7F; |
d396a558 JH |
998 | } |
999 | ||
1000 | sub Is_c1 { | |
10c526cf | 1001 | use feature 'unicode_strings'; |
d396a558 | 1002 | my $char = substr(shift,0,1); |
10c526cf | 1003 | return $char =~ /[[:cntrl:]]/ && $char !~ /[[:ascii:]]/; |
d396a558 JH |
1004 | } |
1005 | ||
10c526cf KW |
1006 | sub Is_latin_1 { # But not ASCII; not C1 |
1007 | use feature 'unicode_strings'; | |
d396a558 | 1008 | my $char = substr(shift,0,1); |
10c526cf KW |
1009 | return ord($char) < 256 |
1010 | && $char !~ [[:ascii:]] | |
1011 | && $char !~ [[:cntrl:]]; | |
d396a558 JH |
1012 | } |
1013 | ||
10c526cf | 1014 | Another way to write C<Is_latin_1()> would be |
d396a558 JH |
1015 | to use the characters in the range explicitly: |
1016 | ||
1017 | sub Is_latin_1 { | |
1018 | my $char = substr(shift,0,1); | |
f11f9c4c KW |
1019 | $char =~ /[ ¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏ] |
1020 | [ÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ]/x; | |
d396a558 JH |
1021 | } |
1022 | ||
eaf8b9b9 | 1023 | Although that form may run into trouble in network transit (due to the |
d396a558 | 1024 | presence of 8 bit characters) or on non ISO-Latin character sets. |
d396a558 JH |
1025 | |
1026 | =head1 SOCKETS | |
1027 | ||
1028 | Most socket programming assumes ASCII character encodings in network | |
1029 | byte order. Exceptions can include CGI script writing under a | |
1030 | host web server where the server may take care of translation for you. | |
1031 | Most host web servers convert EBCDIC data to ISO-8859-1 or Unicode on | |
1032 | output. | |
1033 | ||
1034 | =head1 SORTING | |
1035 | ||
8a50e6a3 | 1036 | One big difference between ASCII-based character sets and EBCDIC ones |
d396a558 | 1037 | are the relative positions of upper and lower case letters and the |
8a50e6a3 FC |
1038 | letters compared to the digits. If sorted on an ASCII-based platform the |
1039 | two-letter abbreviation for a physician comes before the two letter | |
1040 | abbreviation for drive; that is: | |
d396a558 | 1041 | |
c72e675e | 1042 | @sorted = sort(qw(Dr. dr.)); # @sorted holds ('Dr.','dr.') on ASCII, |
84f709e7 | 1043 | # but ('dr.','Dr.') on EBCDIC |
d396a558 | 1044 | |
8a50e6a3 | 1045 | The property of lowercase before uppercase letters in EBCDIC is |
d396a558 | 1046 | even carried to the Latin 1 EBCDIC pages such as 0037 and 1047. |
eaf8b9b9 KW |
1047 | An example would be that E<Euml> C<E WITH DIAERESIS> (203) comes |
1048 | before E<euml> C<e WITH DIAERESIS> (235) on an ASCII platform, but | |
1049 | the latter (83) comes before the former (115) on an EBCDIC platform. | |
1050 | (Astute readers will note that the uppercase version of E<szlig> | |
1051 | C<SMALL LETTER SHARP S> is simply "SS" and that the upper case version of | |
1052 | E<yuml> C<y WITH DIAERESIS> is not in the 0..255 range but it is | |
51b5cecb | 1053 | at U+x0178 in Unicode, or C<"\x{178}"> in a Unicode enabled Perl). |
d396a558 JH |
1054 | |
1055 | The sort order will cause differences between results obtained on | |
2bbc8d55 | 1056 | ASCII platforms versus EBCDIC platforms. What follows are some suggestions |
d396a558 JH |
1057 | on how to deal with these differences. |
1058 | ||
51b5cecb | 1059 | =head2 Ignore ASCII vs. EBCDIC sort differences. |
d396a558 JH |
1060 | |
1061 | This is the least computationally expensive strategy. It may require | |
1062 | some user education. | |
1063 | ||
51b5cecb | 1064 | =head2 MONO CASE then sort data. |
d396a558 | 1065 | |
8a50e6a3 | 1066 | In order to minimize the expense of mono casing mixed-case text, try to |
d396a558 JH |
1067 | C<tr///> towards the character set case most employed within the data. |
1068 | If the data are primarily UPPERCASE non Latin 1 then apply tr/[a-z]/[A-Z]/ | |
1069 | then sort(). If the data are primarily lowercase non Latin 1 then | |
1070 | apply tr/[A-Z]/[a-z]/ before sorting. If the data are primarily UPPERCASE | |
eaf8b9b9 | 1071 | and include Latin-1 characters then apply: |
51b5cecb | 1072 | |
b693e169 KW |
1073 | tr/[a-z]/[A-Z]/; |
1074 | tr/[àáâãäåæçèéêëìíîïðñòóôõöøùúûüýþ]/[ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞ/; | |
1075 | s/ß/SS/g; | |
d396a558 | 1076 | |
eaf8b9b9 KW |
1077 | then sort(). Do note however that such Latin-1 manipulation does not |
1078 | address the E<yuml> C<y WITH DIAERESIS> character that will remain at | |
1079 | code point 255 on ASCII platforms, but 223 on most EBCDIC platforms | |
1080 | where it will sort to a place less than the EBCDIC numerals. With a | |
8a50e6a3 | 1081 | Unicode-enabled Perl you might try: |
d396a558 | 1082 | |
51b5cecb PP |
1083 | tr/^?/\x{178}/; |
1084 | ||
eaf8b9b9 | 1085 | The strategy of mono casing data before sorting does not preserve the case |
51b5cecb PP |
1086 | of the data and may not be acceptable for that reason. |
1087 | ||
1088 | =head2 Convert, sort data, then re convert. | |
d396a558 JH |
1089 | |
1090 | This is the most expensive proposition that does not employ a network | |
1091 | connection. | |
1092 | ||
2bbc8d55 | 1093 | =head2 Perform sorting on one type of platform only. |
d396a558 JH |
1094 | |
1095 | This strategy can employ a network connection. As such | |
1096 | it would be computationally expensive. | |
1097 | ||
395f5a0c | 1098 | =head1 TRANSFORMATION FORMATS |
1e054b24 | 1099 | |
eaf8b9b9 KW |
1100 | There are a variety of ways of transforming data with an intra character set |
1101 | mapping that serve a variety of purposes. Sorting was discussed in the | |
1102 | previous section and a few of the other more popular mapping techniques are | |
1e054b24 PP |
1103 | discussed next. |
1104 | ||
1105 | =head2 URL decoding and encoding | |
d396a558 | 1106 | |
51b5cecb | 1107 | Note that some URLs have hexadecimal ASCII code points in them in an |
eaf8b9b9 | 1108 | attempt to overcome character or protocol limitation issues. For example |
1e054b24 | 1109 | the tilde character is not on every keyboard hence a URL of the form: |
d396a558 JH |
1110 | |
1111 | http://www.pvhp.com/~pvhp/ | |
1112 | ||
1113 | may also be expressed as either of: | |
1114 | ||
1115 | http://www.pvhp.com/%7Epvhp/ | |
1116 | ||
1117 | http://www.pvhp.com/%7epvhp/ | |
1118 | ||
51b5cecb | 1119 | where 7E is the hexadecimal ASCII code point for '~'. Here is an example |
f11f9c4c | 1120 | of decoding such a URL in any EBCDIC code page: |
d396a558 | 1121 | |
84f709e7 | 1122 | $url = 'http://www.pvhp.com/%7Epvhp/'; |
f11f9c4c KW |
1123 | $url =~ s/%([0-9a-fA-F]{2})/ |
1124 | pack("c",utf8::unicode_to_native(hex($1)))/xge; | |
d396a558 | 1125 | |
eaf8b9b9 | 1126 | Conversely, here is a partial solution for the task of encoding such |
f11f9c4c | 1127 | a URL in any EBCDIC code page: |
1e054b24 | 1128 | |
84f709e7 | 1129 | $url = 'http://www.pvhp.com/~pvhp/'; |
eaf8b9b9 KW |
1130 | # The following regular expression does not address the |
1131 | # mappings for: ('.' => '%2E', '/' => '%2F', ':' => '%3A') | |
10c526cf | 1132 | $url =~ s/([\t "#%&\(\),;<=>\?\@\[\\\]^`{|}~])/ |
f11f9c4c | 1133 | sprintf("%%%02X",utf8::native_to_unicode(ord($1)))/xge; |
1e054b24 | 1134 | |
eaf8b9b9 | 1135 | where a more complete solution would split the URL into components |
1e054b24 PP |
1136 | and apply a full s/// substitution only to the appropriate parts. |
1137 | ||
1e054b24 PP |
1138 | =head2 uu encoding and decoding |
1139 | ||
eaf8b9b9 KW |
1140 | The C<u> template to pack() or unpack() will render EBCDIC data in EBCDIC |
1141 | characters equivalent to their ASCII counterparts. For example, the | |
1e054b24 PP |
1142 | following will print "Yes indeed\n" on either an ASCII or EBCDIC computer: |
1143 | ||
84f709e7 JH |
1144 | $all_byte_chrs = ''; |
1145 | for (0..255) { $all_byte_chrs .= chr($_); } | |
1146 | $uuencode_byte_chrs = pack('u', $all_byte_chrs); | |
210b36aa | 1147 | ($uu = <<'ENDOFHEREDOC') =~ s/^\s*//gm; |
1e054b24 PP |
1148 | M``$"`P0%!@<("0H+#`T.#Q`1$A,4%187&!D:&QP='A\@(2(C)"4F)R@I*BLL |
1149 | M+2XO,#$R,S0U-C<X.3H[/#T^/T!!0D-$149'2$E*2TQ-3D]045)35%565UA9 | |
1150 | M6EM<75Y?8&%B8V1E9F=H:6IK;&UN;W!Q<G-T=79W>'EZ>WQ]?G^`@8*#A(6& | |
1151 | MAXB)BHN,C8Z/D)&2DY25EI>8F9J;G)V>GZ"AHJ.DI::GJ*FJJZRMKJ^PL;*S | |
1152 | MM+6VM[BYNKN\O;Z_P,'"P\3%QL?(R<K+S,W.S]#1TM/4U=;7V-G:V]S=WM_@ | |
1153 | ?X>+CY.7FY^CIZNOL[>[O\/'R\_3U]O?X^?K[_/W^_P`` | |
1154 | ENDOFHEREDOC | |
84f709e7 | 1155 | if ($uuencode_byte_chrs eq $uu) { |
1e054b24 PP |
1156 | print "Yes "; |
1157 | } | |
1158 | $uudecode_byte_chrs = unpack('u', $uuencode_byte_chrs); | |
84f709e7 | 1159 | if ($uudecode_byte_chrs eq $all_byte_chrs) { |
1e054b24 PP |
1160 | print "indeed\n"; |
1161 | } | |
1162 | ||
f11f9c4c | 1163 | Here is a very spartan uudecoder that will work on EBCDIC: |
1e054b24 | 1164 | |
84f709e7 | 1165 | #!/usr/local/bin/perl |
84f709e7 | 1166 | $_ = <> until ($mode,$file) = /^begin\s*(\d*)\s*(\S*)/; |
1e054b24 PP |
1167 | open(OUT, "> $file") if $file ne ""; |
1168 | while(<>) { | |
1169 | last if /^end/; | |
1170 | next if /[a-z]/; | |
f11f9c4c KW |
1171 | next unless int((((utf8::native_to_unicode(ord()) - 32 ) & 077) |
1172 | + 2) / 3) | |
1173 | == int(length() / 4); | |
1e054b24 PP |
1174 | print OUT unpack("u", $_); |
1175 | } | |
1176 | close(OUT); | |
1177 | chmod oct($mode), $file; | |
1178 | ||
1179 | ||
1180 | =head2 Quoted-Printable encoding and decoding | |
1181 | ||
8a50e6a3 | 1182 | On ASCII-encoded platforms it is possible to strip characters outside of |
1e054b24 PP |
1183 | the printable set using: |
1184 | ||
1185 | # This QP encoder works on ASCII only | |
84f709e7 | 1186 | $qp_string =~ s/([=\x00-\x1F\x80-\xFF])/sprintf("=%02X",ord($1))/ge; |
1e054b24 | 1187 | |
eaf8b9b9 | 1188 | Whereas a QP encoder that works on both ASCII and EBCDIC platforms |
f11f9c4c | 1189 | would look somewhat like the following: |
1e054b24 | 1190 | |
f11f9c4c | 1191 | $delete = utf8::unicode_to_native(ord("\x7F")); |
84f709e7 | 1192 | $qp_string =~ |
f11f9c4c KW |
1193 | s/([^[:print:]$delete])/ |
1194 | sprintf("=%02X",utf8::native_to_unicode(ord($1)))/xage; | |
1e054b24 PP |
1195 | |
1196 | (although in production code the substitutions might be done | |
f11f9c4c | 1197 | in the EBCDIC branch with the function call and separately in the |
1e054b24 PP |
1198 | ASCII branch without the expense of the identity map). |
1199 | ||
1200 | Such QP strings can be decoded with: | |
1201 | ||
1202 | # This QP decoder is limited to ASCII only | |
f11f9c4c | 1203 | $string =~ s/=([[:xdigit:][[:xdigit:])/chr hex $1/ge; |
1e054b24 PP |
1204 | $string =~ s/=[\n\r]+$//; |
1205 | ||
eaf8b9b9 | 1206 | Whereas a QP decoder that works on both ASCII and EBCDIC platforms |
f11f9c4c | 1207 | would look somewhat like the following: |
1e054b24 | 1208 | |
f11f9c4c KW |
1209 | $string =~ s/=([[:xdigit:][:xdigit:]])/ |
1210 | chr utf8::native_to_unicode(hex $1)/xge; | |
1e054b24 PP |
1211 | $string =~ s/=[\n\r]+$//; |
1212 | ||
c69ca1d4 | 1213 | =head2 Caesarean ciphers |
1e054b24 PP |
1214 | |
1215 | The practice of shifting an alphabet one or more characters for encipherment | |
1216 | dates back thousands of years and was explicitly detailed by Gaius Julius | |
eaf8b9b9 | 1217 | Caesar in his B<Gallic Wars> text. A single alphabet shift is sometimes |
1e054b24 | 1218 | referred to as a rotation and the shift amount is given as a number $n after |
eaf8b9b9 KW |
1219 | the string 'rot' or "rot$n". Rot0 and rot26 would designate identity maps |
1220 | on the 26-letter English version of the Latin alphabet. Rot13 has the | |
1221 | interesting property that alternate subsequent invocations are identity maps | |
1222 | (thus rot13 is its own non-trivial inverse in the group of 26 alphabet | |
1223 | rotations). Hence the following is a rot13 encoder and decoder that will | |
2bbc8d55 | 1224 | work on ASCII and EBCDIC platforms: |
1e054b24 PP |
1225 | |
1226 | #!/usr/local/bin/perl | |
1227 | ||
84f709e7 | 1228 | while(<>){ |
1e054b24 PP |
1229 | tr/n-za-mN-ZA-M/a-zA-Z/; |
1230 | print; | |
1231 | } | |
1232 | ||
1233 | In one-liner form: | |
1234 | ||
84f709e7 | 1235 | perl -ne 'tr/n-za-mN-ZA-M/a-zA-Z/;print' |
1e054b24 PP |
1236 | |
1237 | ||
1238 | =head1 Hashing order and checksums | |
1239 | ||
eaf8b9b9 | 1240 | To the extent that it is possible to write code that depends on |
395f5a0c | 1241 | hashing order there may be differences between hashes as stored |
8a50e6a3 | 1242 | on an ASCII-based platform and hashes stored on an EBCDIC-based platform. |
1e054b24 PP |
1243 | XXX |
1244 | ||
d396a558 JH |
1245 | =head1 I18N AND L10N |
1246 | ||
eaf8b9b9 KW |
1247 | Internationalization (I18N) and localization (L10N) are supported at least |
1248 | in principle even on EBCDIC platforms. The details are system-dependent | |
d396a558 JH |
1249 | and discussed under the L<perlebcdic/OS ISSUES> section below. |
1250 | ||
8a50e6a3 | 1251 | =head1 MULTI-OCTET CHARACTER SETS |
d396a558 | 1252 | |
eaf8b9b9 KW |
1253 | Perl may work with an internal UTF-EBCDIC encoding form for wide characters |
1254 | on EBCDIC platforms in a manner analogous to the way that it works with | |
395f5a0c PK |
1255 | the UTF-8 internal encoding form on ASCII based platforms. |
1256 | ||
1257 | Legacy multi byte EBCDIC code pages XXX. | |
d396a558 JH |
1258 | |
1259 | =head1 OS ISSUES | |
1260 | ||
eaf8b9b9 | 1261 | There may be a few system-dependent issues |
d396a558 JH |
1262 | of concern to EBCDIC Perl programmers. |
1263 | ||
522b859a | 1264 | =head2 OS/400 |
51b5cecb | 1265 | |
d396a558 JH |
1266 | =over 8 |
1267 | ||
522b859a JH |
1268 | =item PASE |
1269 | ||
8a50e6a3 FC |
1270 | The PASE environment is a runtime environment for OS/400 that can run |
1271 | executables built for PowerPC AIX in OS/400; see L<perlos400>. PASE | |
522b859a JH |
1272 | is ASCII-based, not EBCDIC-based as the ILE. |
1273 | ||
d396a558 JH |
1274 | =item IFS access |
1275 | ||
1276 | XXX. | |
1277 | ||
1278 | =back | |
1279 | ||
395f5a0c | 1280 | =head2 OS/390, z/OS |
d396a558 | 1281 | |
51b5cecb PP |
1282 | Perl runs under Unix Systems Services or USS. |
1283 | ||
d396a558 JH |
1284 | =over 8 |
1285 | ||
51b5cecb PP |
1286 | =item chcp |
1287 | ||
eaf8b9b9 | 1288 | B<chcp> is supported as a shell utility for displaying and changing |
75cdcc93 | 1289 | one's code page. See also L<chcp(1)>. |
51b5cecb | 1290 | |
d396a558 JH |
1291 | =item dataset access |
1292 | ||
1293 | For sequential data set access try: | |
1294 | ||
1295 | my @ds_records = `cat //DSNAME`; | |
1296 | ||
1297 | or: | |
1298 | ||
1299 | my @ds_records = `cat //'HLQ.DSNAME'`; | |
1300 | ||
1301 | See also the OS390::Stdio module on CPAN. | |
1302 | ||
395f5a0c | 1303 | =item OS/390, z/OS iconv |
51b5cecb | 1304 | |
1e054b24 PP |
1305 | B<iconv> is supported as both a shell utility and a C RTL routine. |
1306 | See also the iconv(1) and iconv(3) manual pages. | |
51b5cecb | 1307 | |
d396a558 JH |
1308 | =item locales |
1309 | ||
395f5a0c PK |
1310 | On OS/390 or z/OS see L<locale> for information on locales. The L10N files |
1311 | are in F</usr/nls/locale>. $Config{d_setlocale} is 'define' on OS/390 | |
1312 | or z/OS. | |
d396a558 JH |
1313 | |
1314 | =back | |
1315 | ||
d396a558 JH |
1316 | =head2 POSIX-BC? |
1317 | ||
1318 | XXX. | |
1319 | ||
51b5cecb PP |
1320 | =head1 BUGS |
1321 | ||
51b5cecb | 1322 | Not all shells will allow multiple C<-e> string arguments to perl to |
eaf8b9b9 | 1323 | be concatenated together properly as recipes 0, 2, 4, 5, and 6 might |
395f5a0c | 1324 | seem to imply. |
51b5cecb | 1325 | |
b3b6085d PP |
1326 | =head1 SEE ALSO |
1327 | ||
395f5a0c | 1328 | L<perllocale>, L<perlfunc>, L<perlunicode>, L<utf8>. |
b3b6085d | 1329 | |
d396a558 JH |
1330 | =head1 REFERENCES |
1331 | ||
2bbc8d55 | 1332 | L<http://anubis.dkuug.dk/i18n/charmaps> |
d396a558 | 1333 | |
2bbc8d55 | 1334 | L<http://www.unicode.org/> |
d396a558 | 1335 | |
2bbc8d55 | 1336 | L<http://www.unicode.org/unicode/reports/tr16/> |
d396a558 | 1337 | |
08d7a6b2 | 1338 | L<http://www.wps.com/projects/codes/> |
51b5cecb PP |
1339 | B<ASCII: American Standard Code for Information Infiltration> Tom Jennings, |
1340 | September 1999. | |
1341 | ||
eaf8b9b9 KW |
1342 | B<The Unicode Standard, Version 3.0> The Unicode Consortium, Lisa Moore ed., |
1343 | ISBN 0-201-61633-5, Addison Wesley Developers Press, February 2000. | |
51b5cecb | 1344 | |
eaf8b9b9 KW |
1345 | B<CDRA: IBM - Character Data Representation Architecture - |
1346 | Reference and Registry>, IBM SC09-2190-00, December 1996. | |
d396a558 | 1347 | |
eaf8b9b9 | 1348 | "Demystifying Character Sets", Andrea Vine, Multilingual Computing |
d396a558 JH |
1349 | & Technology, B<#26 Vol. 10 Issue 4>, August/September 1999; |
1350 | ISSN 1523-0309; Multilingual Computing Inc. Sandpoint ID, USA. | |
1351 | ||
1e054b24 PP |
1352 | B<Codes, Ciphers, and Other Cryptic and Clandestine Communication> |
1353 | Fred B. Wrixon, ISBN 1-57912-040-7, Black Dog & Leventhal Publishers, | |
1354 | 1998. | |
1355 | ||
2bbc8d55 | 1356 | L<http://www.bobbemer.com/P-BIT.HTM> |
395f5a0c PK |
1357 | B<IBM - EBCDIC and the P-bit; The biggest Computer Goof Ever> Robert Bemer. |
1358 | ||
1359 | =head1 HISTORY | |
1360 | ||
1361 | 15 April 2001: added UTF-8 and UTF-EBCDIC to main table, pvhp. | |
1362 | ||
d396a558 JH |
1363 | =head1 AUTHOR |
1364 | ||
eaf8b9b9 KW |
1365 | Peter Prymmer pvhp@best.com wrote this in 1999 and 2000 |
1366 | with CCSID 0819 and 0037 help from Chris Leach and | |
1367 | AndrE<eacute> Pirard A.Pirard@ulg.ac.be as well as POSIX-BC | |
b3b6085d | 1368 | help from Thomas Dorner Thomas.Dorner@start.de. |
eaf8b9b9 KW |
1369 | Thanks also to Vickie Cooper, Philip Newton, William Raffloer, and |
1370 | Joe Smith. Trademarks, registered trademarks, service marks and | |
1371 | registered service marks used in this document are the property of | |
1e054b24 | 1372 | their respective owners. |