Commit | Line | Data |
---|---|---|
49781f4a AB |
1 | =encoding utf8 |
2 | ||
d396a558 JH |
3 | =head1 NAME |
4 | ||
5 | perlebcdic - Considerations for running Perl on EBCDIC platforms | |
6 | ||
7 | =head1 DESCRIPTION | |
8 | ||
9 | An exploration of some of the issues facing Perl programmers | |
4d2ca8b5 | 10 | on EBCDIC based computers. |
d396a558 | 11 | |
4d2ca8b5 | 12 | Portions of this document that are still incomplete are marked with XXX. |
d396a558 | 13 | |
4d2ca8b5 KW |
14 | Early Perl versions worked on some EBCDIC machines, but the last known |
15 | version that ran on EBCDIC was v5.8.7, until v5.22, when the Perl core | |
16 | again works on z/OS. Theoretically, it could work on OS/400 or Siemens' | |
4b638048 KW |
17 | BS2000 (or their successors), but this is untested. In v5.22 and 5.24, |
18 | not all | |
4d2ca8b5 KW |
19 | the modules found on CPAN but shipped with core Perl work on z/OS. |
20 | ||
21 | If you want to use Perl on a non-z/OS EBCDIC machine, please let us know | |
e1b711da KW |
22 | by sending mail to perlbug@perl.org |
23 | ||
4d2ca8b5 KW |
24 | Writing Perl on an EBCDIC platform is really no different than writing |
25 | on an L</ASCII> one, but with different underlying numbers, as we'll see | |
26 | shortly. You'll have to know something about those L</ASCII> platforms | |
27 | because the documentation is biased and will frequently use example | |
28 | numbers that don't apply to EBCDIC. There are also very few CPAN | |
29 | modules that are written for EBCDIC and which don't work on ASCII; | |
30 | instead the vast majority of CPAN modules are written for ASCII, and | |
31 | some may happen to work on EBCDIC, while a few have been designed to | |
32 | portably work on both. | |
33 | ||
34 | If your code just uses the 52 letters A-Z and a-z, plus SPACE, the | |
35 | digits 0-9, and the punctuation characters that Perl uses, plus a few | |
36 | controls that are denoted by escape sequences like C<\n> and C<\t>, then | |
37 | there's nothing special about using Perl, and your code may very well | |
38 | work on an ASCII machine without change. | |
39 | ||
40 | But if you write code that uses C<\005> to mean a TAB or C<\xC1> to mean | |
41 | an "A", or C<\xDF> to mean a "E<yuml>" (small C<"y"> with a diaeresis), | |
42 | then your code may well work on your EBCDIC platform, but not on an | |
43 | ASCII one. That's fine to do if no one will ever want to run your code | |
4b638048 | 44 | on an ASCII platform; but the bias in this document will be towards writing |
4d2ca8b5 KW |
45 | code portable between EBCDIC and ASCII systems. Again, if every |
46 | character you care about is easily enterable from your keyboard, you | |
47 | don't have to know anything about ASCII, but many keyboards don't easily | |
48 | allow you to directly enter, say, the character C<\xDF>, so you have to | |
49 | specify it indirectly, such as by using the C<"\xDF"> escape sequence. | |
50 | In those cases it's easiest to know something about the ASCII/Unicode | |
51 | character sets. If you know that the small "E<yuml>" is C<U+00FF>, then | |
52 | you can instead specify it as C<"\N{U+FF}">, and have the computer | |
53 | automatically translate it to C<\xDF> on your platform, and leave it as | |
54 | C<\xFF> on ASCII ones. Or you could specify it by name, C<\N{LATIN | |
55 | SMALL LETTER Y WITH DIAERESIS> and not have to know the numbers. | |
4b638048 | 56 | Either way works, but both require familiarity with Unicode. |
4d2ca8b5 | 57 | |
d396a558 JH |
58 | =head1 COMMON CHARACTER CODE SETS |
59 | ||
60 | =head2 ASCII | |
61 | ||
4d2ca8b5 KW |
62 | The American Standard Code for Information Interchange (ASCII or |
63 | US-ASCII) is a set of | |
64 | integers running from 0 to 127 (decimal) that have standardized | |
65 | interpretations by the computers which use ASCII. For example, 65 means | |
66 | the letter "A". | |
4b638048 | 67 | The range 0..127 can be covered by setting various bits in a 7-bit binary |
eaf8b9b9 KW |
68 | digit, hence the set is sometimes referred to as "7-bit ASCII". |
69 | ASCII was described by the American National Standards Institute | |
70 | document ANSI X3.4-1986. It was also described by ISO 646:1991 | |
71 | (with localization for currency symbols). The full ASCII set is | |
4d2ca8b5 KW |
72 | given in the table L<below|/recipe 3> as the first 128 elements. |
73 | Languages that | |
eaf8b9b9 KW |
74 | can be written adequately with the characters in ASCII include |
75 | English, Hawaiian, Indonesian, Swahili and some Native American | |
d396a558 JH |
76 | languages. |
77 | ||
4d2ca8b5 KW |
78 | Most non-EBCDIC character sets are supersets of ASCII. That is the |
79 | integers 0-127 mean what ASCII says they mean. But integers 128 and | |
80 | above are specific to the character set. | |
81 | ||
82 | Many of these fit entirely into 8 bits, using ASCII as 0-127, while | |
83 | specifying what 128-255 mean, and not using anything above 255. | |
84 | Thus, these are single-byte (or octet if you prefer) character sets. | |
85 | One important one (since Unicode is a superset of it) is the ISO 8859-1 | |
86 | character set. | |
51b5cecb | 87 | |
d396a558 JH |
88 | =head2 ISO 8859 |
89 | ||
4d2ca8b5 KW |
90 | The ISO 8859-I<B<$n>> are a collection of character code sets from the |
91 | International Organization for Standardization (ISO), each of which adds | |
92 | characters to the ASCII set that are typically found in various | |
5d9fe53c | 93 | languages, many of which are based on the Roman, or Latin, alphabet. |
4d2ca8b5 KW |
94 | Most are for European languages, but there are also ones for Arabic, |
95 | Greek, Hebrew, and Thai. There are good references on the web about | |
96 | all these. | |
d396a558 JH |
97 | |
98 | =head2 Latin 1 (ISO 8859-1) | |
99 | ||
eaf8b9b9 KW |
100 | A particular 8-bit extension to ASCII that includes grave and acute |
101 | accented Latin characters. Languages that can employ ISO 8859-1 | |
102 | include all the languages covered by ASCII as well as Afrikaans, | |
103 | Albanian, Basque, Catalan, Danish, Faroese, Finnish, Norwegian, | |
104 | Portuguese, Spanish, and Swedish. Dutch is covered albeit without | |
105 | the ij ligature. French is covered too but without the oe ligature. | |
d396a558 | 106 | German can use ISO 8859-1 but must do so without German-style |
eaf8b9b9 | 107 | quotation marks. This set is based on Western European extensions |
d396a558 | 108 | to ASCII and is commonly encountered in world wide web work. |
4d2ca8b5 | 109 | In IBM character code set identification terminology, ISO 8859-1 is |
51b5cecb | 110 | also known as CCSID 819 (or sometimes 0819 or even 00819). |
d396a558 JH |
111 | |
112 | =head2 EBCDIC | |
113 | ||
eaf8b9b9 | 114 | The Extended Binary Coded Decimal Interchange Code refers to a |
8a50e6a3 | 115 | large collection of single- and multi-byte coded character sets that are |
4d2ca8b5 KW |
116 | quite different from ASCII and ISO 8859-1, and are all slightly |
117 | different from each other; they typically run on host computers. The | |
118 | EBCDIC encodings derive from 8-bit byte extensions of Hollerith punched | |
119 | card encodings, which long predate ASCII. The layout on the | |
120 | cards was such that high bits were set for the upper and lower case | |
121 | alphabetic | |
122 | characters C<[a-z]> and C<[A-Z]>, but there were gaps within each Latin | |
123 | alphabet range, visible in the table L<below|/recipe 3>. These gaps can | |
124 | cause complications. | |
d396a558 | 125 | |
eaf8b9b9 | 126 | Some IBM EBCDIC character sets may be known by character code set |
2c09a866 | 127 | identification numbers (CCSID numbers) or code page numbers. |
51b5cecb | 128 | |
2bbc8d55 SP |
129 | Perl can be compiled on platforms that run any of three commonly used EBCDIC |
130 | character sets, listed below. | |
131 | ||
d5924ca6 | 132 | =head3 The 13 variant characters |
1e054b24 | 133 | |
51b5cecb PP |
134 | Among IBM EBCDIC character code sets there are 13 characters that |
135 | are often mapped to different integer values. Those characters | |
136 | are known as the 13 "variant" characters and are: | |
d396a558 | 137 | |
eaf8b9b9 | 138 | \ [ ] { } ^ ~ ! # | $ @ ` |
d396a558 | 139 | |
6ff677df | 140 | When Perl is compiled for a platform, it looks at all of these characters to |
2bbc8d55 SP |
141 | guess which EBCDIC character set the platform uses, and adapts itself |
142 | accordingly to that platform. If the platform uses a character set that is not | |
143 | one of the three Perl knows about, Perl will either fail to compile, or | |
144 | mistakenly and silently choose one of the three. | |
4d2ca8b5 KW |
145 | |
146 | =head3 EBCDIC code sets recognized by Perl | |
2bbc8d55 | 147 | |
d5924ca6 KW |
148 | =over |
149 | ||
150 | =item B<0037> | |
d396a558 | 151 | |
eaf8b9b9 KW |
152 | Character code set ID 0037 is a mapping of the ASCII plus Latin-1 |
153 | characters (i.e. ISO 8859-1) to an EBCDIC set. 0037 is used | |
154 | in North American English locales on the OS/400 operating system | |
155 | that runs on AS/400 computers. CCSID 0037 differs from ISO 8859-1 | |
a8f582bb | 156 | in 236 places; in other words they agree on only 20 code point values. |
d396a558 | 157 | |
d5924ca6 | 158 | =item B<1047> |
d396a558 | 159 | |
eaf8b9b9 KW |
160 | Character code set ID 1047 is also a mapping of the ASCII plus |
161 | Latin-1 characters (i.e. ISO 8859-1) to an EBCDIC set. 1047 is | |
162 | used under Unix System Services for OS/390 or z/OS, and OpenEdition | |
a8f582bb KW |
163 | for VM/ESA. CCSID 1047 differs from CCSID 0037 in eight places, |
164 | and from ISO 8859-1 in 236. | |
d396a558 | 165 | |
d5924ca6 | 166 | =item B<POSIX-BC> |
d396a558 JH |
167 | |
168 | The EBCDIC code page in use on Siemens' BS2000 system is distinct from | |
169 | 1047 and 0037. It is identified below as the POSIX-BC set. | |
a8f582bb KW |
170 | Like 0037 and 1047, it is the same as ISO 8859-1 in 20 code point |
171 | values. | |
d396a558 | 172 | |
d5924ca6 KW |
173 | =back |
174 | ||
64c66fb6 JH |
175 | =head2 Unicode code points versus EBCDIC code points |
176 | ||
177 | In Unicode terminology a I<code point> is the number assigned to a | |
178 | character: for example, in EBCDIC the character "A" is usually assigned | |
4d2ca8b5 KW |
179 | the number 193. In Unicode, the character "A" is assigned the number 65. |
180 | All the code points in ASCII and Latin-1 (ISO 8859-1) have the same | |
181 | meaning in Unicode. All three of the recognized EBCDIC code sets have | |
182 | 256 code points, and in each code set, all 256 code points are mapped to | |
183 | equivalent Latin1 code points. Obviously, "A" will map to "A", "B" => | |
184 | "B", "%" => "%", etc., for all printable characters in Latin1 and these | |
185 | code pages. | |
186 | ||
187 | It also turns out that EBCDIC has nearly precise equivalents for the | |
188 | ASCII/Latin1 C0 controls and the DELETE control. (The C0 controls are | |
189 | those whose ASCII code points are 0..0x1F; things like TAB, ACK, BEL, | |
190 | etc.) A mapping is set up between these ASCII/EBCDIC controls. There | |
191 | isn't such a precise mapping between the C1 controls on ASCII platforms | |
192 | and the remaining EBCDIC controls. What has been done is to map these | |
193 | controls, mostly arbitrarily, to some otherwise unmatched character in | |
194 | the other character set. Most of these are very very rarely used | |
195 | nowadays in EBCDIC anyway, and their names have been dropped, without | |
196 | much complaint. For example the EO (Eight Ones) EBCDIC control | |
197 | (consisting of eight one bits = 0xFF) is mapped to the C1 APC control | |
198 | (0x9F), and you can't use the name "EO". | |
199 | ||
200 | The EBCDIC controls provide three possible line terminator characters, | |
201 | CR (0x0D), LF (0x25), and NL (0x15). On ASCII platforms, the symbols | |
202 | "NL" and "LF" refer to the same character, but in strict EBCDIC | |
203 | terminology they are different ones. The EBCDIC NL is mapped to the C1 | |
204 | control called "NEL" ("Next Line"; here's a case where the mapping makes | |
205 | quite a bit of sense, and hence isn't just arbitrary). On some EBCDIC | |
206 | platforms, this NL or NEL is the typical line terminator. This is true | |
207 | of z/OS and BS2000. In these platforms, the C compilers will swap the | |
208 | LF and NEL code points, so that C<"\n"> is 0x15, and refers to NL. Perl | |
209 | does that too; you can see it in the code chart L<below|/recipe 3>. | |
210 | This makes things generally "just work" without you even having to be | |
211 | aware that there is a swap. | |
dc4af4bb | 212 | |
395f5a0c PK |
213 | =head2 Unicode and UTF |
214 | ||
4d2ca8b5 | 215 | UTF stands for "Unicode Transformation Format". |
2bbc8d55 SP |
216 | UTF-8 is an encoding of Unicode into a sequence of 8-bit byte chunks, based on |
217 | ASCII and Latin-1. | |
218 | The length of a sequence required to represent a Unicode code point | |
219 | depends on the ordinal number of that code point, | |
220 | with larger numbers requiring more bytes. | |
221 | UTF-EBCDIC is like UTF-8, but based on EBCDIC. | |
4d2ca8b5 KW |
222 | They are enough alike that often, casual usage will conflate the two |
223 | terms, and use "UTF-8" to mean both the UTF-8 found on ASCII platforms, | |
224 | and the UTF-EBCDIC found on EBCDIC ones. | |
2bbc8d55 | 225 | |
4d2ca8b5 | 226 | You may see the term "invariant" character or code point. |
fe749c9a | 227 | This simply means that the character has the same numeric |
4d2ca8b5 KW |
228 | value and representation when encoded in UTF-8 (or UTF-EBCDIC) as when |
229 | not. (Note that this is a very different concept from L</The 13 variant | |
230 | characters> mentioned above. Careful prose will use the term "UTF-8 | |
231 | invariant" instead of just "invariant", but most often you'll see just | |
232 | "invariant".) For example, the ordinal value of "A" is 193 in most | |
233 | EBCDIC code pages, and also is 193 when encoded in UTF-EBCDIC. All | |
234 | UTF-8 (or UTF-EBCDIC) variant code points occupy at least two bytes when | |
235 | encoded in UTF-8 (or UTF-EBCDIC); by definition, the UTF-8 (or | |
236 | UTF-EBCDIC) invariant code points are exactly one byte whether encoded | |
237 | in UTF-8 (or UTF-EBCDIC), or not. (By now you see why people typically | |
238 | just say "UTF-8" when they also mean "UTF-EBCDIC". For the rest of this | |
239 | document, we'll mostly be casual about it too.) | |
240 | In ASCII UTF-8, the code points corresponding to the lowest 128 | |
fe749c9a KW |
241 | ordinal numbers (0 - 127: the ASCII characters) are invariant. |
242 | In UTF-EBCDIC, there are 160 invariant characters. | |
2bbc8d55 | 243 | (If you care, the EBCDIC invariants are those characters |
fe749c9a | 244 | which have ASCII equivalents, plus those that correspond to |
4d2ca8b5 | 245 | the C1 controls (128 - 159 on ASCII platforms).) |
fe749c9a | 246 | |
c0236afe KW |
247 | A string encoded in UTF-EBCDIC may be longer (very rarely shorter) than |
248 | one encoded in UTF-8. Perl extends both UTF-8 and UTF-EBCDIC so that | |
249 | they can encode code points above the Unicode maximum of U+10FFFF. Both | |
250 | extensions are constructed to allow encoding of any code point that fits | |
251 | in a 64-bit word. | |
4d2ca8b5 KW |
252 | |
253 | UTF-EBCDIC is defined by | |
c0236afe KW |
254 | L<Unicode Technical Report #16|http://www.unicode.org/reports/tr16> |
255 | (often referred to as just TR16). | |
4d2ca8b5 KW |
256 | It is defined based on CCSID 1047, not allowing for the differences for |
257 | other code pages. This allows for easy interchange of text between | |
258 | computers running different code pages, but makes it unusable, without | |
259 | adaptation, for Perl on those other code pages. | |
260 | ||
261 | The reason for this unusability is that a fundamental assumption of Perl | |
262 | is that the characters it cares about for parsing and lexical analysis | |
263 | are the same whether or not the text is in UTF-8. For example, Perl | |
264 | expects the character C<"["> to have the same representation, no matter | |
265 | if the string containing it (or program text) is UTF-8 encoded or not. | |
266 | To ensure this, Perl adapts UTF-EBCDIC to the particular code page so | |
267 | that all characters it expects to be UTF-8 invariant are in fact UTF-8 | |
268 | invariant. This means that text generated on a computer running one | |
269 | version of Perl's UTF-EBCDIC has to be translated to be intelligible to | |
270 | a computer running another. | |
395f5a0c | 271 | |
c0236afe KW |
272 | TR16 implies a method to extend UTF-EBCDIC to encode points up through |
273 | S<C<2 ** 31 - 1>>. Perl uses this method for code points up through | |
274 | S<C<2 ** 30 - 1>>, but uses an incompatible method for larger ones, to | |
275 | enable it to handle much larger code points than otherwise. | |
276 | ||
8704cfd1 | 277 | =head2 Using Encode |
8f94de01 | 278 | |
4d2ca8b5 | 279 | Starting from Perl 5.8 you can use the standard module Encode |
2bbc8d55 SP |
280 | to translate from EBCDIC to Latin-1 code points. |
281 | Encode knows about more EBCDIC character sets than Perl can currently | |
282 | be compiled to run on. | |
8f94de01 | 283 | |
c72e675e | 284 | use Encode 'from_to'; |
8f94de01 | 285 | |
c72e675e | 286 | my %ebcdic = ( 176 => 'cp37', 95 => 'cp1047', 106 => 'posix-bc' ); |
8f94de01 | 287 | |
c72e675e KW |
288 | # $a is in EBCDIC code points |
289 | from_to($a, $ebcdic{ord '^'}, 'latin1'); | |
290 | # $a is ISO 8859-1 code points | |
8f94de01 JH |
291 | |
292 | and from Latin-1 code points to EBCDIC code points | |
293 | ||
c72e675e | 294 | use Encode 'from_to'; |
8f94de01 | 295 | |
c72e675e | 296 | my %ebcdic = ( 176 => 'cp37', 95 => 'cp1047', 106 => 'posix-bc' ); |
8f94de01 | 297 | |
c72e675e KW |
298 | # $a is ISO 8859-1 code points |
299 | from_to($a, 'latin1', $ebcdic{ord '^'}); | |
300 | # $a is in EBCDIC code points | |
8f94de01 JH |
301 | |
302 | For doing I/O it is suggested that you use the autotranslating features | |
303 | of PerlIO, see L<perluniintro>. | |
304 | ||
4d2ca8b5 | 305 | Since version 5.8 Perl uses the PerlIO I/O library. This enables |
aa2b82fc JH |
306 | you to use different encodings per IO channel. For example you may use |
307 | ||
308 | use Encode; | |
309 | open($f, ">:encoding(ascii)", "test.ascii"); | |
310 | print $f "Hello World!\n"; | |
311 | open($f, ">:encoding(cp37)", "test.ebcdic"); | |
312 | print $f "Hello World!\n"; | |
313 | open($f, ">:encoding(latin1)", "test.latin1"); | |
314 | print $f "Hello World!\n"; | |
315 | open($f, ">:encoding(utf8)", "test.utf8"); | |
316 | print $f "Hello World!\n"; | |
317 | ||
2c09a866 | 318 | to get four files containing "Hello World!\n" in ASCII, CP 0037 EBCDIC, |
2bbc8d55 | 319 | ISO 8859-1 (Latin-1) (in this example identical to ASCII since only ASCII |
eaf8b9b9 | 320 | characters were printed), and |
2bbc8d55 SP |
321 | UTF-EBCDIC (in this example identical to normal EBCDIC since only characters |
322 | that don't differ between EBCDIC and UTF-EBCDIC were printed). See the | |
4d2ca8b5 | 323 | documentation of L<Encode::PerlIO> for details. |
aa2b82fc JH |
324 | |
325 | As the PerlIO layer uses raw IO (bytes) internally, all this totally | |
326 | ignores things like the type of your filesystem (ASCII or EBCDIC). | |
327 | ||
d396a558 JH |
328 | =head1 SINGLE OCTET TABLES |
329 | ||
330 | The following tables list the ASCII and Latin 1 ordered sets including | |
331 | the subsets: C0 controls (0..31), ASCII graphics (32..7e), delete (7f), | |
eaf8b9b9 | 332 | C1 controls (80..9f), and Latin-1 (a.k.a. ISO 8859-1) (a0..ff). In the |
8d725451 | 333 | table names of the Latin 1 |
eaf8b9b9 | 334 | extensions to ASCII have been labelled with character names roughly |
8d725451 | 335 | corresponding to I<The Unicode Standard, Version 6.1> albeit with |
4d2ca8b5 KW |
336 | substitutions such as C<s/LATIN//> and C<s/VULGAR//> in all cases; |
337 | S<C<s/CAPITAL LETTER//>> in some cases; and | |
338 | S<C<s/SMALL LETTER ([A-Z])/\l$1/>> in some other | |
0e56abba | 339 | cases. Controls are listed using their Unicode 6.2 abbreviations. |
eaf8b9b9 | 340 | The differences between the 0037 and 1047 sets are |
4d2ca8b5 KW |
341 | flagged with C<**>. The differences between the 1047 and POSIX-BC sets |
342 | are flagged with C<##.> All C<ord()> numbers listed are decimal. If you | |
8d725451 KW |
343 | would rather see this table listing octal values, then run the table |
344 | (that is, the pod source text of this document, since this recipe may not | |
1e054b24 | 345 | work with a pod2_other_format translation) through: |
d396a558 JH |
346 | |
347 | =over 4 | |
348 | ||
349 | =item recipe 0 | |
350 | ||
351 | =back | |
352 | ||
8d725451 KW |
353 | perl -ne 'if(/(.{29})(\d+)\s+(\d+)\s+(\d+)\s+(\d+)/)' \ |
354 | -e '{printf("%s%-5.03o%-5.03o%-5.03o%.03o\n",$1,$2,$3,$4,$5)}' \ | |
5f26d5fd | 355 | perlebcdic.pod |
395f5a0c PK |
356 | |
357 | If you want to retain the UTF-x code points then in script form you | |
358 | might want to write: | |
359 | ||
360 | =over 4 | |
361 | ||
362 | =item recipe 1 | |
363 | ||
364 | =back | |
365 | ||
c72e675e KW |
366 | open(FH,"<perlebcdic.pod") or die "Could not open perlebcdic.pod: $!"; |
367 | while (<FH>) { | |
f11f9c4c KW |
368 | if (/(.{29})(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\.?(\d*) |
369 | \s+(\d+)\.?(\d*)/x) | |
5f26d5fd | 370 | { |
c72e675e | 371 | if ($7 ne '' && $9 ne '') { |
5f26d5fd | 372 | printf( |
8d725451 | 373 | "%s%-5.03o%-5.03o%-5.03o%-5.03o%-3o.%-5o%-3o.%.03o\n", |
5f26d5fd | 374 | $1,$2,$3,$4,$5,$6,$7,$8,$9); |
c72e675e KW |
375 | } |
376 | elsif ($7 ne '') { | |
8d725451 | 377 | printf("%s%-5.03o%-5.03o%-5.03o%-5.03o%-3o.%-5o%.03o\n", |
c72e675e KW |
378 | $1,$2,$3,$4,$5,$6,$7,$8); |
379 | } | |
380 | else { | |
8d725451 | 381 | printf("%s%-5.03o%-5.03o%-5.03o%-5.03o%-5.03o%.03o\n", |
5f26d5fd | 382 | $1,$2,$3,$4,$5,$6,$8); |
c72e675e KW |
383 | } |
384 | } | |
385 | } | |
d396a558 JH |
386 | |
387 | If you would rather see this table listing hexadecimal values then | |
388 | run the table through: | |
389 | ||
390 | =over 4 | |
391 | ||
395f5a0c | 392 | =item recipe 2 |
d396a558 JH |
393 | |
394 | =back | |
395 | ||
8d725451 KW |
396 | perl -ne 'if(/(.{29})(\d+)\s+(\d+)\s+(\d+)\s+(\d+)/)' \ |
397 | -e '{printf("%s%-5.02X%-5.02X%-5.02X%.02X\n",$1,$2,$3,$4,$5)}' \ | |
5f26d5fd | 398 | perlebcdic.pod |
395f5a0c PK |
399 | |
400 | Or, in order to retain the UTF-x code points in hexadecimal: | |
401 | ||
402 | =over 4 | |
403 | ||
404 | =item recipe 3 | |
405 | ||
406 | =back | |
407 | ||
c72e675e KW |
408 | open(FH,"<perlebcdic.pod") or die "Could not open perlebcdic.pod: $!"; |
409 | while (<FH>) { | |
f11f9c4c KW |
410 | if (/(.{29})(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\.?(\d*) |
411 | \s+(\d+)\.?(\d*)/x) | |
5f26d5fd | 412 | { |
c72e675e | 413 | if ($7 ne '' && $9 ne '') { |
5f26d5fd | 414 | printf( |
8d725451 | 415 | "%s%-5.02X%-5.02X%-5.02X%-5.02X%-2X.%-6.02X%02X.%02X\n", |
c72e675e KW |
416 | $1,$2,$3,$4,$5,$6,$7,$8,$9); |
417 | } | |
418 | elsif ($7 ne '') { | |
8d725451 | 419 | printf("%s%-5.02X%-5.02X%-5.02X%-5.02X%-2X.%-6.02X%02X\n", |
c72e675e KW |
420 | $1,$2,$3,$4,$5,$6,$7,$8); |
421 | } | |
422 | else { | |
8d725451 | 423 | printf("%s%-5.02X%-5.02X%-5.02X%-5.02X%-5.02X%02X\n", |
5f26d5fd | 424 | $1,$2,$3,$4,$5,$6,$8); |
c72e675e KW |
425 | } |
426 | } | |
427 | } | |
395f5a0c PK |
428 | |
429 | ||
8d725451 | 430 | ISO |
f11f9c4c KW |
431 | 8859-1 POS- CCSID |
432 | CCSID CCSID CCSID IX- 1047 | |
8d725451 KW |
433 | chr 0819 0037 1047 BC UTF-8 UTF-EBCDIC |
434 | --------------------------------------------------------------------- | |
435 | <NUL> 0 0 0 0 0 0 | |
436 | <SOH> 1 1 1 1 1 1 | |
437 | <STX> 2 2 2 2 2 2 | |
438 | <ETX> 3 3 3 3 3 3 | |
439 | <EOT> 4 55 55 55 4 55 | |
440 | <ENQ> 5 45 45 45 5 45 | |
441 | <ACK> 6 46 46 46 6 46 | |
442 | <BEL> 7 47 47 47 7 47 | |
443 | <BS> 8 22 22 22 8 22 | |
444 | <HT> 9 5 5 5 9 5 | |
445 | <LF> 10 37 21 21 10 21 ** | |
446 | <VT> 11 11 11 11 11 11 | |
447 | <FF> 12 12 12 12 12 12 | |
448 | <CR> 13 13 13 13 13 13 | |
449 | <SO> 14 14 14 14 14 14 | |
450 | <SI> 15 15 15 15 15 15 | |
451 | <DLE> 16 16 16 16 16 16 | |
452 | <DC1> 17 17 17 17 17 17 | |
453 | <DC2> 18 18 18 18 18 18 | |
454 | <DC3> 19 19 19 19 19 19 | |
455 | <DC4> 20 60 60 60 20 60 | |
456 | <NAK> 21 61 61 61 21 61 | |
457 | <SYN> 22 50 50 50 22 50 | |
458 | <ETB> 23 38 38 38 23 38 | |
459 | <CAN> 24 24 24 24 24 24 | |
460 | <EOM> 25 25 25 25 25 25 | |
461 | <SUB> 26 63 63 63 26 63 | |
462 | <ESC> 27 39 39 39 27 39 | |
463 | <FS> 28 28 28 28 28 28 | |
464 | <GS> 29 29 29 29 29 29 | |
465 | <RS> 30 30 30 30 30 30 | |
466 | <US> 31 31 31 31 31 31 | |
467 | <SPACE> 32 64 64 64 32 64 | |
468 | ! 33 90 90 90 33 90 | |
469 | " 34 127 127 127 34 127 | |
470 | # 35 123 123 123 35 123 | |
471 | $ 36 91 91 91 36 91 | |
472 | % 37 108 108 108 37 108 | |
473 | & 38 80 80 80 38 80 | |
474 | ' 39 125 125 125 39 125 | |
475 | ( 40 77 77 77 40 77 | |
476 | ) 41 93 93 93 41 93 | |
477 | * 42 92 92 92 42 92 | |
478 | + 43 78 78 78 43 78 | |
479 | , 44 107 107 107 44 107 | |
480 | - 45 96 96 96 45 96 | |
481 | . 46 75 75 75 46 75 | |
482 | / 47 97 97 97 47 97 | |
483 | 0 48 240 240 240 48 240 | |
484 | 1 49 241 241 241 49 241 | |
485 | 2 50 242 242 242 50 242 | |
486 | 3 51 243 243 243 51 243 | |
487 | 4 52 244 244 244 52 244 | |
488 | 5 53 245 245 245 53 245 | |
489 | 6 54 246 246 246 54 246 | |
490 | 7 55 247 247 247 55 247 | |
491 | 8 56 248 248 248 56 248 | |
492 | 9 57 249 249 249 57 249 | |
493 | : 58 122 122 122 58 122 | |
494 | ; 59 94 94 94 59 94 | |
495 | < 60 76 76 76 60 76 | |
496 | = 61 126 126 126 61 126 | |
497 | > 62 110 110 110 62 110 | |
498 | ? 63 111 111 111 63 111 | |
499 | @ 64 124 124 124 64 124 | |
500 | A 65 193 193 193 65 193 | |
501 | B 66 194 194 194 66 194 | |
502 | C 67 195 195 195 67 195 | |
503 | D 68 196 196 196 68 196 | |
504 | E 69 197 197 197 69 197 | |
505 | F 70 198 198 198 70 198 | |
506 | G 71 199 199 199 71 199 | |
507 | H 72 200 200 200 72 200 | |
508 | I 73 201 201 201 73 201 | |
509 | J 74 209 209 209 74 209 | |
510 | K 75 210 210 210 75 210 | |
511 | L 76 211 211 211 76 211 | |
512 | M 77 212 212 212 77 212 | |
513 | N 78 213 213 213 78 213 | |
514 | O 79 214 214 214 79 214 | |
515 | P 80 215 215 215 80 215 | |
516 | Q 81 216 216 216 81 216 | |
517 | R 82 217 217 217 82 217 | |
518 | S 83 226 226 226 83 226 | |
519 | T 84 227 227 227 84 227 | |
520 | U 85 228 228 228 85 228 | |
521 | V 86 229 229 229 86 229 | |
522 | W 87 230 230 230 87 230 | |
523 | X 88 231 231 231 88 231 | |
524 | Y 89 232 232 232 89 232 | |
525 | Z 90 233 233 233 90 233 | |
526 | [ 91 186 173 187 91 173 ** ## | |
527 | \ 92 224 224 188 92 224 ## | |
528 | ] 93 187 189 189 93 189 ** | |
529 | ^ 94 176 95 106 94 95 ** ## | |
530 | _ 95 109 109 109 95 109 | |
531 | ` 96 121 121 74 96 121 ## | |
532 | a 97 129 129 129 97 129 | |
533 | b 98 130 130 130 98 130 | |
534 | c 99 131 131 131 99 131 | |
535 | d 100 132 132 132 100 132 | |
536 | e 101 133 133 133 101 133 | |
537 | f 102 134 134 134 102 134 | |
538 | g 103 135 135 135 103 135 | |
539 | h 104 136 136 136 104 136 | |
540 | i 105 137 137 137 105 137 | |
541 | j 106 145 145 145 106 145 | |
542 | k 107 146 146 146 107 146 | |
543 | l 108 147 147 147 108 147 | |
544 | m 109 148 148 148 109 148 | |
545 | n 110 149 149 149 110 149 | |
546 | o 111 150 150 150 111 150 | |
547 | p 112 151 151 151 112 151 | |
548 | q 113 152 152 152 113 152 | |
549 | r 114 153 153 153 114 153 | |
550 | s 115 162 162 162 115 162 | |
551 | t 116 163 163 163 116 163 | |
552 | u 117 164 164 164 117 164 | |
553 | v 118 165 165 165 118 165 | |
554 | w 119 166 166 166 119 166 | |
555 | x 120 167 167 167 120 167 | |
556 | y 121 168 168 168 121 168 | |
557 | z 122 169 169 169 122 169 | |
558 | { 123 192 192 251 123 192 ## | |
559 | | 124 79 79 79 124 79 | |
560 | } 125 208 208 253 125 208 ## | |
561 | ~ 126 161 161 255 126 161 ## | |
562 | <DEL> 127 7 7 7 127 7 | |
563 | <PAD> 128 32 32 32 194.128 32 | |
564 | <HOP> 129 33 33 33 194.129 33 | |
565 | <BPH> 130 34 34 34 194.130 34 | |
566 | <NBH> 131 35 35 35 194.131 35 | |
567 | <IND> 132 36 36 36 194.132 36 | |
568 | <NEL> 133 21 37 37 194.133 37 ** | |
569 | <SSA> 134 6 6 6 194.134 6 | |
570 | <ESA> 135 23 23 23 194.135 23 | |
571 | <HTS> 136 40 40 40 194.136 40 | |
572 | <HTJ> 137 41 41 41 194.137 41 | |
573 | <VTS> 138 42 42 42 194.138 42 | |
574 | <PLD> 139 43 43 43 194.139 43 | |
575 | <PLU> 140 44 44 44 194.140 44 | |
576 | <RI> 141 9 9 9 194.141 9 | |
577 | <SS2> 142 10 10 10 194.142 10 | |
578 | <SS3> 143 27 27 27 194.143 27 | |
579 | <DCS> 144 48 48 48 194.144 48 | |
580 | <PU1> 145 49 49 49 194.145 49 | |
581 | <PU2> 146 26 26 26 194.146 26 | |
582 | <STS> 147 51 51 51 194.147 51 | |
583 | <CCH> 148 52 52 52 194.148 52 | |
584 | <MW> 149 53 53 53 194.149 53 | |
585 | <SPA> 150 54 54 54 194.150 54 | |
586 | <EPA> 151 8 8 8 194.151 8 | |
587 | <SOS> 152 56 56 56 194.152 56 | |
588 | <SGC> 153 57 57 57 194.153 57 | |
589 | <SCI> 154 58 58 58 194.154 58 | |
590 | <CSI> 155 59 59 59 194.155 59 | |
591 | <ST> 156 4 4 4 194.156 4 | |
592 | <OSC> 157 20 20 20 194.157 20 | |
593 | <PM> 158 62 62 62 194.158 62 | |
594 | <APC> 159 255 255 95 194.159 255 ## | |
595 | <NON-BREAKING SPACE> 160 65 65 65 194.160 128.65 | |
596 | <INVERTED "!" > 161 170 170 170 194.161 128.66 | |
597 | <CENT SIGN> 162 74 74 176 194.162 128.67 ## | |
598 | <POUND SIGN> 163 177 177 177 194.163 128.68 | |
599 | <CURRENCY SIGN> 164 159 159 159 194.164 128.69 | |
600 | <YEN SIGN> 165 178 178 178 194.165 128.70 | |
601 | <BROKEN BAR> 166 106 106 208 194.166 128.71 ## | |
602 | <SECTION SIGN> 167 181 181 181 194.167 128.72 | |
603 | <DIAERESIS> 168 189 187 121 194.168 128.73 ** ## | |
604 | <COPYRIGHT SIGN> 169 180 180 180 194.169 128.74 | |
605 | <FEMININE ORDINAL> 170 154 154 154 194.170 128.81 | |
606 | <LEFT POINTING GUILLEMET> 171 138 138 138 194.171 128.82 | |
607 | <NOT SIGN> 172 95 176 186 194.172 128.83 ** ## | |
608 | <SOFT HYPHEN> 173 202 202 202 194.173 128.84 | |
609 | <REGISTERED TRADE MARK> 174 175 175 175 194.174 128.85 | |
610 | <MACRON> 175 188 188 161 194.175 128.86 ## | |
611 | <DEGREE SIGN> 176 144 144 144 194.176 128.87 | |
612 | <PLUS-OR-MINUS SIGN> 177 143 143 143 194.177 128.88 | |
613 | <SUPERSCRIPT TWO> 178 234 234 234 194.178 128.89 | |
614 | <SUPERSCRIPT THREE> 179 250 250 250 194.179 128.98 | |
615 | <ACUTE ACCENT> 180 190 190 190 194.180 128.99 | |
616 | <MICRO SIGN> 181 160 160 160 194.181 128.100 | |
617 | <PARAGRAPH SIGN> 182 182 182 182 194.182 128.101 | |
618 | <MIDDLE DOT> 183 179 179 179 194.183 128.102 | |
619 | <CEDILLA> 184 157 157 157 194.184 128.103 | |
620 | <SUPERSCRIPT ONE> 185 218 218 218 194.185 128.104 | |
621 | <MASC. ORDINAL INDICATOR> 186 155 155 155 194.186 128.105 | |
622 | <RIGHT POINTING GUILLEMET> 187 139 139 139 194.187 128.106 | |
623 | <FRACTION ONE QUARTER> 188 183 183 183 194.188 128.112 | |
624 | <FRACTION ONE HALF> 189 184 184 184 194.189 128.113 | |
625 | <FRACTION THREE QUARTERS> 190 185 185 185 194.190 128.114 | |
626 | <INVERTED QUESTION MARK> 191 171 171 171 194.191 128.115 | |
627 | <A WITH GRAVE> 192 100 100 100 195.128 138.65 | |
628 | <A WITH ACUTE> 193 101 101 101 195.129 138.66 | |
629 | <A WITH CIRCUMFLEX> 194 98 98 98 195.130 138.67 | |
630 | <A WITH TILDE> 195 102 102 102 195.131 138.68 | |
631 | <A WITH DIAERESIS> 196 99 99 99 195.132 138.69 | |
632 | <A WITH RING ABOVE> 197 103 103 103 195.133 138.70 | |
633 | <CAPITAL LIGATURE AE> 198 158 158 158 195.134 138.71 | |
634 | <C WITH CEDILLA> 199 104 104 104 195.135 138.72 | |
635 | <E WITH GRAVE> 200 116 116 116 195.136 138.73 | |
636 | <E WITH ACUTE> 201 113 113 113 195.137 138.74 | |
637 | <E WITH CIRCUMFLEX> 202 114 114 114 195.138 138.81 | |
638 | <E WITH DIAERESIS> 203 115 115 115 195.139 138.82 | |
639 | <I WITH GRAVE> 204 120 120 120 195.140 138.83 | |
640 | <I WITH ACUTE> 205 117 117 117 195.141 138.84 | |
641 | <I WITH CIRCUMFLEX> 206 118 118 118 195.142 138.85 | |
642 | <I WITH DIAERESIS> 207 119 119 119 195.143 138.86 | |
643 | <CAPITAL LETTER ETH> 208 172 172 172 195.144 138.87 | |
644 | <N WITH TILDE> 209 105 105 105 195.145 138.88 | |
645 | <O WITH GRAVE> 210 237 237 237 195.146 138.89 | |
646 | <O WITH ACUTE> 211 238 238 238 195.147 138.98 | |
647 | <O WITH CIRCUMFLEX> 212 235 235 235 195.148 138.99 | |
648 | <O WITH TILDE> 213 239 239 239 195.149 138.100 | |
649 | <O WITH DIAERESIS> 214 236 236 236 195.150 138.101 | |
650 | <MULTIPLICATION SIGN> 215 191 191 191 195.151 138.102 | |
651 | <O WITH STROKE> 216 128 128 128 195.152 138.103 | |
652 | <U WITH GRAVE> 217 253 253 224 195.153 138.104 ## | |
653 | <U WITH ACUTE> 218 254 254 254 195.154 138.105 | |
654 | <U WITH CIRCUMFLEX> 219 251 251 221 195.155 138.106 ## | |
655 | <U WITH DIAERESIS> 220 252 252 252 195.156 138.112 | |
656 | <Y WITH ACUTE> 221 173 186 173 195.157 138.113 ** ## | |
657 | <CAPITAL LETTER THORN> 222 174 174 174 195.158 138.114 | |
658 | <SMALL LETTER SHARP S> 223 89 89 89 195.159 138.115 | |
659 | <a WITH GRAVE> 224 68 68 68 195.160 139.65 | |
660 | <a WITH ACUTE> 225 69 69 69 195.161 139.66 | |
661 | <a WITH CIRCUMFLEX> 226 66 66 66 195.162 139.67 | |
662 | <a WITH TILDE> 227 70 70 70 195.163 139.68 | |
663 | <a WITH DIAERESIS> 228 67 67 67 195.164 139.69 | |
664 | <a WITH RING ABOVE> 229 71 71 71 195.165 139.70 | |
665 | <SMALL LIGATURE ae> 230 156 156 156 195.166 139.71 | |
666 | <c WITH CEDILLA> 231 72 72 72 195.167 139.72 | |
667 | <e WITH GRAVE> 232 84 84 84 195.168 139.73 | |
668 | <e WITH ACUTE> 233 81 81 81 195.169 139.74 | |
669 | <e WITH CIRCUMFLEX> 234 82 82 82 195.170 139.81 | |
670 | <e WITH DIAERESIS> 235 83 83 83 195.171 139.82 | |
671 | <i WITH GRAVE> 236 88 88 88 195.172 139.83 | |
672 | <i WITH ACUTE> 237 85 85 85 195.173 139.84 | |
673 | <i WITH CIRCUMFLEX> 238 86 86 86 195.174 139.85 | |
674 | <i WITH DIAERESIS> 239 87 87 87 195.175 139.86 | |
675 | <SMALL LETTER eth> 240 140 140 140 195.176 139.87 | |
676 | <n WITH TILDE> 241 73 73 73 195.177 139.88 | |
677 | <o WITH GRAVE> 242 205 205 205 195.178 139.89 | |
678 | <o WITH ACUTE> 243 206 206 206 195.179 139.98 | |
679 | <o WITH CIRCUMFLEX> 244 203 203 203 195.180 139.99 | |
680 | <o WITH TILDE> 245 207 207 207 195.181 139.100 | |
681 | <o WITH DIAERESIS> 246 204 204 204 195.182 139.101 | |
682 | <DIVISION SIGN> 247 225 225 225 195.183 139.102 | |
683 | <o WITH STROKE> 248 112 112 112 195.184 139.103 | |
684 | <u WITH GRAVE> 249 221 221 192 195.185 139.104 ## | |
685 | <u WITH ACUTE> 250 222 222 222 195.186 139.105 | |
686 | <u WITH CIRCUMFLEX> 251 219 219 219 195.187 139.106 | |
687 | <u WITH DIAERESIS> 252 220 220 220 195.188 139.112 | |
688 | <y WITH ACUTE> 253 141 141 141 195.189 139.113 | |
689 | <SMALL LETTER thorn> 254 142 142 142 195.190 139.114 | |
690 | <y WITH DIAERESIS> 255 223 223 223 195.191 139.115 | |
d396a558 JH |
691 | |
692 | If you would rather see the above table in CCSID 0037 order rather than | |
693 | ASCII + Latin-1 order then run the table through: | |
694 | ||
695 | =over 4 | |
696 | ||
395f5a0c | 697 | =item recipe 4 |
d396a558 JH |
698 | |
699 | =back | |
700 | ||
5f26d5fd | 701 | perl \ |
8d725451 | 702 | -ne 'if(/.{29}\d{1,3}\s{2,4}\d{1,3}\s{2,4}\d{1,3}\s{2,4}\d{1,3}/)'\ |
84f709e7 JH |
703 | -e '{push(@l,$_)}' \ |
704 | -e 'END{print map{$_->[0]}' \ | |
705 | -e ' sort{$a->[1] <=> $b->[1]}' \ | |
8d725451 | 706 | -e ' map{[$_,substr($_,34,3)]}@l;}' perlebcdic.pod |
d396a558 | 707 | |
2c09a866 | 708 | If you would rather see it in CCSID 1047 order then change the number |
8d725451 | 709 | 34 in the last line to 39, like this: |
d396a558 JH |
710 | |
711 | =over 4 | |
712 | ||
395f5a0c | 713 | =item recipe 5 |
d396a558 JH |
714 | |
715 | =back | |
716 | ||
5f26d5fd | 717 | perl \ |
8d725451 | 718 | -ne 'if(/.{29}\d{1,3}\s{2,4}\d{1,3}\s{2,4}\d{1,3}\s{2,4}\d{1,3}/)'\ |
5f26d5fd KW |
719 | -e '{push(@l,$_)}' \ |
720 | -e 'END{print map{$_->[0]}' \ | |
721 | -e ' sort{$a->[1] <=> $b->[1]}' \ | |
8d725451 | 722 | -e ' map{[$_,substr($_,39,3)]}@l;}' perlebcdic.pod |
d396a558 | 723 | |
2c09a866 | 724 | If you would rather see it in POSIX-BC order then change the number |
4d2ca8b5 | 725 | 34 in the last line to 44, like this: |
d396a558 JH |
726 | |
727 | =over 4 | |
728 | ||
395f5a0c | 729 | =item recipe 6 |
d396a558 JH |
730 | |
731 | =back | |
732 | ||
5f26d5fd | 733 | perl \ |
8d725451 | 734 | -ne 'if(/.{29}\d{1,3}\s{2,4}\d{1,3}\s{2,4}\d{1,3}\s{2,4}\d{1,3}/)'\ |
84f709e7 JH |
735 | -e '{push(@l,$_)}' \ |
736 | -e 'END{print map{$_->[0]}' \ | |
737 | -e ' sort{$a->[1] <=> $b->[1]}' \ | |
8d725451 | 738 | -e ' map{[$_,substr($_,44,3)]}@l;}' perlebcdic.pod |
d396a558 | 739 | |
4d2ca8b5 | 740 | =head2 Table in hex, sorted in 1047 order |
d396a558 | 741 | |
4d2ca8b5 KW |
742 | Since this document was first written, the convention has become more |
743 | and more to use hexadecimal notation for code points. To do this with | |
744 | the recipes and to also sort is a multi-step process, so here, for | |
745 | convenience, is the table from above, re-sorted to be in Code Page 1047 | |
746 | order, and using hex notation. | |
d396a558 | 747 | |
4d2ca8b5 KW |
748 | ISO |
749 | 8859-1 POS- CCSID | |
750 | CCSID CCSID CCSID IX- 1047 | |
751 | chr 0819 0037 1047 BC UTF-8 UTF-EBCDIC | |
752 | --------------------------------------------------------------------- | |
753 | <NUL> 00 00 00 00 00 00 | |
754 | <SOH> 01 01 01 01 01 01 | |
755 | <STX> 02 02 02 02 02 02 | |
756 | <ETX> 03 03 03 03 03 03 | |
757 | <ST> 9C 04 04 04 C2.9C 04 | |
758 | <HT> 09 05 05 05 09 05 | |
759 | <SSA> 86 06 06 06 C2.86 06 | |
760 | <DEL> 7F 07 07 07 7F 07 | |
761 | <EPA> 97 08 08 08 C2.97 08 | |
762 | <RI> 8D 09 09 09 C2.8D 09 | |
763 | <SS2> 8E 0A 0A 0A C2.8E 0A | |
764 | <VT> 0B 0B 0B 0B 0B 0B | |
765 | <FF> 0C 0C 0C 0C 0C 0C | |
766 | <CR> 0D 0D 0D 0D 0D 0D | |
767 | <SO> 0E 0E 0E 0E 0E 0E | |
768 | <SI> 0F 0F 0F 0F 0F 0F | |
769 | <DLE> 10 10 10 10 10 10 | |
770 | <DC1> 11 11 11 11 11 11 | |
771 | <DC2> 12 12 12 12 12 12 | |
772 | <DC3> 13 13 13 13 13 13 | |
773 | <OSC> 9D 14 14 14 C2.9D 14 | |
774 | <LF> 0A 25 15 15 0A 15 ** | |
775 | <BS> 08 16 16 16 08 16 | |
776 | <ESA> 87 17 17 17 C2.87 17 | |
777 | <CAN> 18 18 18 18 18 18 | |
778 | <EOM> 19 19 19 19 19 19 | |
779 | <PU2> 92 1A 1A 1A C2.92 1A | |
780 | <SS3> 8F 1B 1B 1B C2.8F 1B | |
781 | <FS> 1C 1C 1C 1C 1C 1C | |
782 | <GS> 1D 1D 1D 1D 1D 1D | |
783 | <RS> 1E 1E 1E 1E 1E 1E | |
784 | <US> 1F 1F 1F 1F 1F 1F | |
785 | <PAD> 80 20 20 20 C2.80 20 | |
786 | <HOP> 81 21 21 21 C2.81 21 | |
787 | <BPH> 82 22 22 22 C2.82 22 | |
788 | <NBH> 83 23 23 23 C2.83 23 | |
789 | <IND> 84 24 24 24 C2.84 24 | |
790 | <NEL> 85 15 25 25 C2.85 25 ** | |
791 | <ETB> 17 26 26 26 17 26 | |
792 | <ESC> 1B 27 27 27 1B 27 | |
793 | <HTS> 88 28 28 28 C2.88 28 | |
794 | <HTJ> 89 29 29 29 C2.89 29 | |
795 | <VTS> 8A 2A 2A 2A C2.8A 2A | |
796 | <PLD> 8B 2B 2B 2B C2.8B 2B | |
797 | <PLU> 8C 2C 2C 2C C2.8C 2C | |
798 | <ENQ> 05 2D 2D 2D 05 2D | |
799 | <ACK> 06 2E 2E 2E 06 2E | |
800 | <BEL> 07 2F 2F 2F 07 2F | |
801 | <DCS> 90 30 30 30 C2.90 30 | |
802 | <PU1> 91 31 31 31 C2.91 31 | |
803 | <SYN> 16 32 32 32 16 32 | |
804 | <STS> 93 33 33 33 C2.93 33 | |
805 | <CCH> 94 34 34 34 C2.94 34 | |
806 | <MW> 95 35 35 35 C2.95 35 | |
807 | <SPA> 96 36 36 36 C2.96 36 | |
808 | <EOT> 04 37 37 37 04 37 | |
809 | <SOS> 98 38 38 38 C2.98 38 | |
810 | <SGC> 99 39 39 39 C2.99 39 | |
811 | <SCI> 9A 3A 3A 3A C2.9A 3A | |
812 | <CSI> 9B 3B 3B 3B C2.9B 3B | |
813 | <DC4> 14 3C 3C 3C 14 3C | |
814 | <NAK> 15 3D 3D 3D 15 3D | |
815 | <PM> 9E 3E 3E 3E C2.9E 3E | |
816 | <SUB> 1A 3F 3F 3F 1A 3F | |
817 | <SPACE> 20 40 40 40 20 40 | |
818 | <NON-BREAKING SPACE> A0 41 41 41 C2.A0 80.41 | |
819 | <a WITH CIRCUMFLEX> E2 42 42 42 C3.A2 8B.43 | |
820 | <a WITH DIAERESIS> E4 43 43 43 C3.A4 8B.45 | |
821 | <a WITH GRAVE> E0 44 44 44 C3.A0 8B.41 | |
822 | <a WITH ACUTE> E1 45 45 45 C3.A1 8B.42 | |
823 | <a WITH TILDE> E3 46 46 46 C3.A3 8B.44 | |
824 | <a WITH RING ABOVE> E5 47 47 47 C3.A5 8B.46 | |
825 | <c WITH CEDILLA> E7 48 48 48 C3.A7 8B.48 | |
826 | <n WITH TILDE> F1 49 49 49 C3.B1 8B.58 | |
827 | <CENT SIGN> A2 4A 4A B0 C2.A2 80.43 ## | |
828 | . 2E 4B 4B 4B 2E 4B | |
829 | < 3C 4C 4C 4C 3C 4C | |
830 | ( 28 4D 4D 4D 28 4D | |
831 | + 2B 4E 4E 4E 2B 4E | |
832 | | 7C 4F 4F 4F 7C 4F | |
833 | & 26 50 50 50 26 50 | |
834 | <e WITH ACUTE> E9 51 51 51 C3.A9 8B.4A | |
835 | <e WITH CIRCUMFLEX> EA 52 52 52 C3.AA 8B.51 | |
836 | <e WITH DIAERESIS> EB 53 53 53 C3.AB 8B.52 | |
837 | <e WITH GRAVE> E8 54 54 54 C3.A8 8B.49 | |
838 | <i WITH ACUTE> ED 55 55 55 C3.AD 8B.54 | |
839 | <i WITH CIRCUMFLEX> EE 56 56 56 C3.AE 8B.55 | |
840 | <i WITH DIAERESIS> EF 57 57 57 C3.AF 8B.56 | |
841 | <i WITH GRAVE> EC 58 58 58 C3.AC 8B.53 | |
842 | <SMALL LETTER SHARP S> DF 59 59 59 C3.9F 8A.73 | |
843 | ! 21 5A 5A 5A 21 5A | |
844 | $ 24 5B 5B 5B 24 5B | |
845 | * 2A 5C 5C 5C 2A 5C | |
846 | ) 29 5D 5D 5D 29 5D | |
847 | ; 3B 5E 5E 5E 3B 5E | |
848 | ^ 5E B0 5F 6A 5E 5F ** ## | |
849 | - 2D 60 60 60 2D 60 | |
850 | / 2F 61 61 61 2F 61 | |
851 | <A WITH CIRCUMFLEX> C2 62 62 62 C3.82 8A.43 | |
852 | <A WITH DIAERESIS> C4 63 63 63 C3.84 8A.45 | |
853 | <A WITH GRAVE> C0 64 64 64 C3.80 8A.41 | |
854 | <A WITH ACUTE> C1 65 65 65 C3.81 8A.42 | |
855 | <A WITH TILDE> C3 66 66 66 C3.83 8A.44 | |
856 | <A WITH RING ABOVE> C5 67 67 67 C3.85 8A.46 | |
857 | <C WITH CEDILLA> C7 68 68 68 C3.87 8A.48 | |
858 | <N WITH TILDE> D1 69 69 69 C3.91 8A.58 | |
859 | <BROKEN BAR> A6 6A 6A D0 C2.A6 80.47 ## | |
860 | , 2C 6B 6B 6B 2C 6B | |
861 | % 25 6C 6C 6C 25 6C | |
862 | _ 5F 6D 6D 6D 5F 6D | |
863 | > 3E 6E 6E 6E 3E 6E | |
864 | ? 3F 6F 6F 6F 3F 6F | |
865 | <o WITH STROKE> F8 70 70 70 C3.B8 8B.67 | |
866 | <E WITH ACUTE> C9 71 71 71 C3.89 8A.4A | |
867 | <E WITH CIRCUMFLEX> CA 72 72 72 C3.8A 8A.51 | |
868 | <E WITH DIAERESIS> CB 73 73 73 C3.8B 8A.52 | |
869 | <E WITH GRAVE> C8 74 74 74 C3.88 8A.49 | |
870 | <I WITH ACUTE> CD 75 75 75 C3.8D 8A.54 | |
871 | <I WITH CIRCUMFLEX> CE 76 76 76 C3.8E 8A.55 | |
872 | <I WITH DIAERESIS> CF 77 77 77 C3.8F 8A.56 | |
873 | <I WITH GRAVE> CC 78 78 78 C3.8C 8A.53 | |
874 | ` 60 79 79 4A 60 79 ## | |
875 | : 3A 7A 7A 7A 3A 7A | |
876 | # 23 7B 7B 7B 23 7B | |
877 | @ 40 7C 7C 7C 40 7C | |
878 | ' 27 7D 7D 7D 27 7D | |
879 | = 3D 7E 7E 7E 3D 7E | |
880 | " 22 7F 7F 7F 22 7F | |
881 | <O WITH STROKE> D8 80 80 80 C3.98 8A.67 | |
882 | a 61 81 81 81 61 81 | |
883 | b 62 82 82 82 62 82 | |
884 | c 63 83 83 83 63 83 | |
885 | d 64 84 84 84 64 84 | |
886 | e 65 85 85 85 65 85 | |
887 | f 66 86 86 86 66 86 | |
888 | g 67 87 87 87 67 87 | |
889 | h 68 88 88 88 68 88 | |
890 | i 69 89 89 89 69 89 | |
891 | <LEFT POINTING GUILLEMET> AB 8A 8A 8A C2.AB 80.52 | |
892 | <RIGHT POINTING GUILLEMET> BB 8B 8B 8B C2.BB 80.6A | |
893 | <SMALL LETTER eth> F0 8C 8C 8C C3.B0 8B.57 | |
894 | <y WITH ACUTE> FD 8D 8D 8D C3.BD 8B.71 | |
895 | <SMALL LETTER thorn> FE 8E 8E 8E C3.BE 8B.72 | |
896 | <PLUS-OR-MINUS SIGN> B1 8F 8F 8F C2.B1 80.58 | |
897 | <DEGREE SIGN> B0 90 90 90 C2.B0 80.57 | |
898 | j 6A 91 91 91 6A 91 | |
899 | k 6B 92 92 92 6B 92 | |
900 | l 6C 93 93 93 6C 93 | |
901 | m 6D 94 94 94 6D 94 | |
902 | n 6E 95 95 95 6E 95 | |
903 | o 6F 96 96 96 6F 96 | |
904 | p 70 97 97 97 70 97 | |
905 | q 71 98 98 98 71 98 | |
906 | r 72 99 99 99 72 99 | |
907 | <FEMININE ORDINAL> AA 9A 9A 9A C2.AA 80.51 | |
908 | <MASC. ORDINAL INDICATOR> BA 9B 9B 9B C2.BA 80.69 | |
909 | <SMALL LIGATURE ae> E6 9C 9C 9C C3.A6 8B.47 | |
910 | <CEDILLA> B8 9D 9D 9D C2.B8 80.67 | |
911 | <CAPITAL LIGATURE AE> C6 9E 9E 9E C3.86 8A.47 | |
912 | <CURRENCY SIGN> A4 9F 9F 9F C2.A4 80.45 | |
913 | <MICRO SIGN> B5 A0 A0 A0 C2.B5 80.64 | |
914 | ~ 7E A1 A1 FF 7E A1 ## | |
915 | s 73 A2 A2 A2 73 A2 | |
916 | t 74 A3 A3 A3 74 A3 | |
917 | u 75 A4 A4 A4 75 A4 | |
918 | v 76 A5 A5 A5 76 A5 | |
919 | w 77 A6 A6 A6 77 A6 | |
920 | x 78 A7 A7 A7 78 A7 | |
921 | y 79 A8 A8 A8 79 A8 | |
922 | z 7A A9 A9 A9 7A A9 | |
923 | <INVERTED "!" > A1 AA AA AA C2.A1 80.42 | |
924 | <INVERTED QUESTION MARK> BF AB AB AB C2.BF 80.73 | |
925 | <CAPITAL LETTER ETH> D0 AC AC AC C3.90 8A.57 | |
926 | [ 5B BA AD BB 5B AD ** ## | |
927 | <CAPITAL LETTER THORN> DE AE AE AE C3.9E 8A.72 | |
928 | <REGISTERED TRADE MARK> AE AF AF AF C2.AE 80.55 | |
929 | <NOT SIGN> AC 5F B0 BA C2.AC 80.53 ** ## | |
930 | <POUND SIGN> A3 B1 B1 B1 C2.A3 80.44 | |
931 | <YEN SIGN> A5 B2 B2 B2 C2.A5 80.46 | |
932 | <MIDDLE DOT> B7 B3 B3 B3 C2.B7 80.66 | |
933 | <COPYRIGHT SIGN> A9 B4 B4 B4 C2.A9 80.4A | |
934 | <SECTION SIGN> A7 B5 B5 B5 C2.A7 80.48 | |
935 | <PARAGRAPH SIGN> B6 B6 B6 B6 C2.B6 80.65 | |
936 | <FRACTION ONE QUARTER> BC B7 B7 B7 C2.BC 80.70 | |
937 | <FRACTION ONE HALF> BD B8 B8 B8 C2.BD 80.71 | |
938 | <FRACTION THREE QUARTERS> BE B9 B9 B9 C2.BE 80.72 | |
939 | <Y WITH ACUTE> DD AD BA AD C3.9D 8A.71 ** ## | |
940 | <DIAERESIS> A8 BD BB 79 C2.A8 80.49 ** ## | |
941 | <MACRON> AF BC BC A1 C2.AF 80.56 ## | |
942 | ] 5D BB BD BD 5D BD ** | |
943 | <ACUTE ACCENT> B4 BE BE BE C2.B4 80.63 | |
944 | <MULTIPLICATION SIGN> D7 BF BF BF C3.97 8A.66 | |
945 | { 7B C0 C0 FB 7B C0 ## | |
946 | A 41 C1 C1 C1 41 C1 | |
947 | B 42 C2 C2 C2 42 C2 | |
948 | C 43 C3 C3 C3 43 C3 | |
949 | D 44 C4 C4 C4 44 C4 | |
950 | E 45 C5 C5 C5 45 C5 | |
951 | F 46 C6 C6 C6 46 C6 | |
952 | G 47 C7 C7 C7 47 C7 | |
953 | H 48 C8 C8 C8 48 C8 | |
954 | I 49 C9 C9 C9 49 C9 | |
955 | <SOFT HYPHEN> AD CA CA CA C2.AD 80.54 | |
956 | <o WITH CIRCUMFLEX> F4 CB CB CB C3.B4 8B.63 | |
957 | <o WITH DIAERESIS> F6 CC CC CC C3.B6 8B.65 | |
958 | <o WITH GRAVE> F2 CD CD CD C3.B2 8B.59 | |
959 | <o WITH ACUTE> F3 CE CE CE C3.B3 8B.62 | |
960 | <o WITH TILDE> F5 CF CF CF C3.B5 8B.64 | |
961 | } 7D D0 D0 FD 7D D0 ## | |
962 | J 4A D1 D1 D1 4A D1 | |
963 | K 4B D2 D2 D2 4B D2 | |
964 | L 4C D3 D3 D3 4C D3 | |
965 | M 4D D4 D4 D4 4D D4 | |
966 | N 4E D5 D5 D5 4E D5 | |
967 | O 4F D6 D6 D6 4F D6 | |
968 | P 50 D7 D7 D7 50 D7 | |
969 | Q 51 D8 D8 D8 51 D8 | |
970 | R 52 D9 D9 D9 52 D9 | |
971 | <SUPERSCRIPT ONE> B9 DA DA DA C2.B9 80.68 | |
972 | <u WITH CIRCUMFLEX> FB DB DB DB C3.BB 8B.6A | |
973 | <u WITH DIAERESIS> FC DC DC DC C3.BC 8B.70 | |
974 | <u WITH GRAVE> F9 DD DD C0 C3.B9 8B.68 ## | |
975 | <u WITH ACUTE> FA DE DE DE C3.BA 8B.69 | |
976 | <y WITH DIAERESIS> FF DF DF DF C3.BF 8B.73 | |
977 | \ 5C E0 E0 BC 5C E0 ## | |
978 | <DIVISION SIGN> F7 E1 E1 E1 C3.B7 8B.66 | |
979 | S 53 E2 E2 E2 53 E2 | |
980 | T 54 E3 E3 E3 54 E3 | |
981 | U 55 E4 E4 E4 55 E4 | |
982 | V 56 E5 E5 E5 56 E5 | |
983 | W 57 E6 E6 E6 57 E6 | |
984 | X 58 E7 E7 E7 58 E7 | |
985 | Y 59 E8 E8 E8 59 E8 | |
986 | Z 5A E9 E9 E9 5A E9 | |
987 | <SUPERSCRIPT TWO> B2 EA EA EA C2.B2 80.59 | |
988 | <O WITH CIRCUMFLEX> D4 EB EB EB C3.94 8A.63 | |
989 | <O WITH DIAERESIS> D6 EC EC EC C3.96 8A.65 | |
990 | <O WITH GRAVE> D2 ED ED ED C3.92 8A.59 | |
991 | <O WITH ACUTE> D3 EE EE EE C3.93 8A.62 | |
992 | <O WITH TILDE> D5 EF EF EF C3.95 8A.64 | |
993 | 0 30 F0 F0 F0 30 F0 | |
994 | 1 31 F1 F1 F1 31 F1 | |
995 | 2 32 F2 F2 F2 32 F2 | |
996 | 3 33 F3 F3 F3 33 F3 | |
997 | 4 34 F4 F4 F4 34 F4 | |
998 | 5 35 F5 F5 F5 35 F5 | |
999 | 6 36 F6 F6 F6 36 F6 | |
1000 | 7 37 F7 F7 F7 37 F7 | |
1001 | 8 38 F8 F8 F8 38 F8 | |
1002 | 9 39 F9 F9 F9 39 F9 | |
1003 | <SUPERSCRIPT THREE> B3 FA FA FA C2.B3 80.62 | |
1004 | <U WITH CIRCUMFLEX> DB FB FB DD C3.9B 8A.6A ## | |
1005 | <U WITH DIAERESIS> DC FC FC FC C3.9C 8A.70 | |
1006 | <U WITH GRAVE> D9 FD FD E0 C3.99 8A.68 ## | |
1007 | <U WITH ACUTE> DA FE FE FE C3.9A 8A.69 | |
1008 | <APC> 9F FF FF 5F C2.9F FF ## | |
d396a558 | 1009 | |
4d2ca8b5 | 1010 | =head1 IDENTIFYING CHARACTER CODE SETS |
d396a558 | 1011 | |
4d2ca8b5 KW |
1012 | It is possible to determine which character set you are operating under. |
1013 | But first you need to be really really sure you need to do this. Your | |
1014 | code will be simpler and probably just as portable if you don't have | |
1015 | to test the character set and do different things, depending. There are | |
1016 | actually only very few circumstances where it's not easy to write | |
1017 | straight-line code portable to all character sets. See | |
1018 | L<perluniintro/Unicode and EBCDIC> for how to portably specify | |
1019 | characters. | |
d396a558 | 1020 | |
4d2ca8b5 KW |
1021 | But there are some cases where you may want to know which character set |
1022 | you are running under. One possible example is doing | |
1023 | L<sorting|/SORTING> in inner loops where performance is critical. | |
d396a558 | 1024 | |
4d2ca8b5 KW |
1025 | To determine if you are running under ASCII or EBCDIC, you can use the |
1026 | return value of C<ord()> or C<chr()> to test one or more character | |
1027 | values. For example: | |
d396a558 | 1028 | |
4d2ca8b5 KW |
1029 | $is_ascii = "A" eq chr(65); |
1030 | $is_ebcdic = "A" eq chr(193); | |
1031 | $is_ascii = ord("A") == 65; | |
1032 | $is_ebcdic = ord("A") == 193; | |
d396a558 | 1033 | |
4d2ca8b5 KW |
1034 | There's even less need to distinguish between EBCDIC code pages, but to |
1035 | do so try looking at one or more of the characters that differ between | |
1036 | them. | |
d396a558 | 1037 | |
84f709e7 JH |
1038 | $is_ascii = ord('[') == 91; |
1039 | $is_ebcdic_37 = ord('[') == 186; | |
1040 | $is_ebcdic_1047 = ord('[') == 173; | |
1041 | $is_ebcdic_POSIX_BC = ord('[') == 187; | |
d396a558 JH |
1042 | |
1043 | However, it would be unwise to write tests such as: | |
1044 | ||
84f709e7 JH |
1045 | $is_ascii = "\r" ne chr(13); # WRONG |
1046 | $is_ascii = "\n" ne chr(10); # ILL ADVISED | |
d396a558 | 1047 | |
4d2ca8b5 KW |
1048 | Obviously the first of these will fail to distinguish most ASCII |
1049 | platforms from either a CCSID 0037, a 1047, or a POSIX-BC EBCDIC | |
1050 | platform since S<C<"\r" eq chr(13)>> under all of those coded character | |
1051 | sets. But note too that because C<"\n"> is C<chr(13)> and C<"\r"> is | |
1052 | C<chr(10)> on old Macintosh (which is an ASCII platform) the second | |
1053 | C<$is_ascii> test will lead to trouble there. | |
d396a558 | 1054 | |
eaf8b9b9 | 1055 | To determine whether or not perl was built under an EBCDIC |
d396a558 JH |
1056 | code page you can use the Config module like so: |
1057 | ||
1058 | use Config; | |
84f709e7 | 1059 | $is_ebcdic = $Config{'ebcdic'} eq 'define'; |
d396a558 JH |
1060 | |
1061 | =head1 CONVERSIONS | |
1062 | ||
d5924ca6 KW |
1063 | =head2 C<utf8::unicode_to_native()> and C<utf8::native_to_unicode()> |
1064 | ||
1065 | These functions take an input numeric code point in one encoding and | |
1066 | return what its equivalent value is in the other. | |
1067 | ||
4d2ca8b5 KW |
1068 | See L<utf8>. |
1069 | ||
1e054b24 PP |
1070 | =head2 tr/// |
1071 | ||
eaf8b9b9 | 1072 | In order to convert a string of characters from one character set to |
d396a558 | 1073 | another a simple list of numbers, such as in the right columns in the |
4d2ca8b5 | 1074 | above table, along with Perl's C<tr///> operator is all that is needed. |
5f26d5fd | 1075 | The data in the table are in ASCII/Latin1 order, hence the EBCDIC columns |
eaf8b9b9 | 1076 | provide easy-to-use ASCII/Latin1 to EBCDIC operations that are also easily |
d396a558 JH |
1077 | reversed. |
1078 | ||
5f26d5fd | 1079 | For example, to convert ASCII/Latin1 to code page 037 take the output of the |
4d2ca8b5 KW |
1080 | second numbers column from the output of recipe 2 (modified to add |
1081 | C<"\"> characters), and use it in C<tr///> like so: | |
d396a558 | 1082 | |
eaf8b9b9 | 1083 | $cp_037 = |
5f26d5fd KW |
1084 | '\x00\x01\x02\x03\x37\x2D\x2E\x2F\x16\x05\x25\x0B\x0C\x0D\x0E\x0F' . |
1085 | '\x10\x11\x12\x13\x3C\x3D\x32\x26\x18\x19\x3F\x27\x1C\x1D\x1E\x1F' . | |
1086 | '\x40\x5A\x7F\x7B\x5B\x6C\x50\x7D\x4D\x5D\x5C\x4E\x6B\x60\x4B\x61' . | |
1087 | '\xF0\xF1\xF2\xF3\xF4\xF5\xF6\xF7\xF8\xF9\x7A\x5E\x4C\x7E\x6E\x6F' . | |
1088 | '\x7C\xC1\xC2\xC3\xC4\xC5\xC6\xC7\xC8\xC9\xD1\xD2\xD3\xD4\xD5\xD6' . | |
1089 | '\xD7\xD8\xD9\xE2\xE3\xE4\xE5\xE6\xE7\xE8\xE9\xBA\xE0\xBB\xB0\x6D' . | |
1090 | '\x79\x81\x82\x83\x84\x85\x86\x87\x88\x89\x91\x92\x93\x94\x95\x96' . | |
1091 | '\x97\x98\x99\xA2\xA3\xA4\xA5\xA6\xA7\xA8\xA9\xC0\x4F\xD0\xA1\x07' . | |
1092 | '\x20\x21\x22\x23\x24\x15\x06\x17\x28\x29\x2A\x2B\x2C\x09\x0A\x1B' . | |
1093 | '\x30\x31\x1A\x33\x34\x35\x36\x08\x38\x39\x3A\x3B\x04\x14\x3E\xFF' . | |
1094 | '\x41\xAA\x4A\xB1\x9F\xB2\x6A\xB5\xBD\xB4\x9A\x8A\x5F\xCA\xAF\xBC' . | |
1095 | '\x90\x8F\xEA\xFA\xBE\xA0\xB6\xB3\x9D\xDA\x9B\x8B\xB7\xB8\xB9\xAB' . | |
1096 | '\x64\x65\x62\x66\x63\x67\x9E\x68\x74\x71\x72\x73\x78\x75\x76\x77' . | |
1097 | '\xAC\x69\xED\xEE\xEB\xEF\xEC\xBF\x80\xFD\xFE\xFB\xFC\xAD\xAE\x59' . | |
1098 | '\x44\x45\x42\x46\x43\x47\x9C\x48\x54\x51\x52\x53\x58\x55\x56\x57' . | |
1099 | '\x8C\x49\xCD\xCE\xCB\xCF\xCC\xE1\x70\xDD\xDE\xDB\xDC\x8D\x8E\xDF'; | |
d396a558 JH |
1100 | |
1101 | my $ebcdic_string = $ascii_string; | |
5f26d5fd | 1102 | eval '$ebcdic_string =~ tr/\000-\377/' . $cp_037 . '/'; |
d396a558 | 1103 | |
0be03469 | 1104 | To convert from EBCDIC 037 to ASCII just reverse the order of the tr/// |
d396a558 JH |
1105 | arguments like so: |
1106 | ||
1107 | my $ascii_string = $ebcdic_string; | |
5f26d5fd KW |
1108 | eval '$ascii_string =~ tr/' . $cp_037 . '/\000-\377/'; |
1109 | ||
1110 | Similarly one could take the output of the third numbers column from recipe 2 | |
1111 | to obtain a C<$cp_1047> table. The fourth numbers column of the output from | |
1112 | recipe 2 could provide a C<$cp_posix_bc> table suitable for transcoding as | |
1113 | well. | |
d5d9880c | 1114 | |
5f26d5fd KW |
1115 | If you wanted to see the inverse tables, you would first have to sort on the |
1116 | desired numbers column as in recipes 4, 5 or 6, then take the output of the | |
1117 | first numbers column. | |
1e054b24 PP |
1118 | |
1119 | =head2 iconv | |
d396a558 | 1120 | |
d5d9880c | 1121 | XPG operability often implies the presence of an I<iconv> utility |
d396a558 JH |
1122 | available from the shell or from the C library. Consult your system's |
1123 | documentation for information on iconv. | |
1124 | ||
4d2ca8b5 | 1125 | On OS/390 or z/OS see the L<iconv(1)> manpage. One way to invoke the C<iconv> |
d396a558 JH |
1126 | shell utility from within perl would be to: |
1127 | ||
395f5a0c | 1128 | # OS/390 or z/OS example |
84f709e7 | 1129 | $ascii_data = `echo '$ebcdic_data'| iconv -f IBM-1047 -t ISO8859-1` |
d396a558 JH |
1130 | |
1131 | or the inverse map: | |
1132 | ||
395f5a0c | 1133 | # OS/390 or z/OS example |
84f709e7 | 1134 | $ebcdic_data = `echo '$ascii_data'| iconv -f ISO8859-1 -t IBM-1047` |
d396a558 | 1135 | |
4d2ca8b5 | 1136 | For other Perl-based conversion options see the C<Convert::*> modules on CPAN. |
d396a558 | 1137 | |
1e054b24 PP |
1138 | =head2 C RTL |
1139 | ||
4d2ca8b5 | 1140 | The OS/390 and z/OS C run-time libraries provide C<_atoe()> and C<_etoa()> functions. |
1e054b24 | 1141 | |
d396a558 JH |
1142 | =head1 OPERATOR DIFFERENCES |
1143 | ||
eaf8b9b9 | 1144 | The C<..> range operator treats certain character ranges with |
2bbc8d55 SP |
1145 | care on EBCDIC platforms. For example the following array |
1146 | will have twenty six elements on either an EBCDIC platform | |
1147 | or an ASCII platform: | |
d396a558 | 1148 | |
84f709e7 | 1149 | @alphabet = ('A'..'Z'); # $#alphabet == 25 |
d396a558 JH |
1150 | |
1151 | The bitwise operators such as & ^ | may return different results | |
4d2ca8b5 | 1152 | when operating on string or character data in a Perl program running |
2bbc8d55 | 1153 | on an EBCDIC platform than when run on an ASCII platform. Here is |
d396a558 JH |
1154 | an example adapted from the one in L<perlop>: |
1155 | ||
1156 | # EBCDIC-based examples | |
84f709e7 | 1157 | print "j p \n" ^ " a h"; # prints "JAPH\n" |
eaf8b9b9 | 1158 | print "JA" | " ph\n"; # prints "japh\n" |
84f709e7 JH |
1159 | print "JAPH\nJunk" & "\277\277\277\277\277"; # prints "japh\n"; |
1160 | print 'p N$' ^ " E<H\n"; # prints "Perl\n"; | |
d396a558 JH |
1161 | |
1162 | An interesting property of the 32 C0 control characters | |
1163 | in the ASCII table is that they can "literally" be constructed | |
4d2ca8b5 | 1164 | as control characters in Perl, e.g. C<(chr(0)> eq C<\c@>)> |
c72e675e | 1165 | C<(chr(1)> eq C<\cA>)>, and so on. Perl on EBCDIC platforms has been |
4d2ca8b5 | 1166 | ported to take C<\c@> to C<chr(0)> and C<\cA> to C<chr(1)>, etc. as well, but the |
2bd1cbf6 | 1167 | characters that result depend on which code page you are |
2c09a866 KW |
1168 | using. The table below uses the standard acronyms for the controls. |
1169 | The POSIX-BC and 1047 sets are | |
eaf8b9b9 | 1170 | identical throughout this range and differ from the 0037 set at only |
4d2ca8b5 | 1171 | one spot (21 decimal). Note that the line terminator character |
eaf8b9b9 KW |
1172 | may be generated by C<\cJ> on ASCII platforms but by C<\cU> on 1047 or POSIX-BC |
1173 | platforms and cannot be generated as a C<"\c.letter."> control character on | |
2c09a866 KW |
1174 | 0037 platforms. Note also that C<\c\> cannot be the final element in a string |
1175 | or regex, as it will absorb the terminator. But C<\c\I<X>> is a C<FILE | |
1176 | SEPARATOR> concatenated with I<X> for all I<X>. | |
2bd1cbf6 KW |
1177 | The outlier C<\c?> on ASCII, which yields a non-C0 control C<DEL>, |
1178 | yields the outlier control C<APC> on EBCDIC, the one that isn't in the | |
aae773bb KW |
1179 | block of contiguous controls. Note that a subtlety of this is that |
1180 | C<\c?> on ASCII platforms is an ASCII character, while it isn't | |
1181 | equivalent to any ASCII character in EBCDIC platforms. | |
2c09a866 | 1182 | |
eaf8b9b9 | 1183 | chr ord 8859-1 0037 1047 && POSIX-BC |
c72e675e | 1184 | ----------------------------------------------------------------------- |
2c09a866 | 1185 | \c@ 0 <NUL> <NUL> <NUL> |
eaf8b9b9 | 1186 | \cA 1 <SOH> <SOH> <SOH> |
2c09a866 KW |
1187 | \cB 2 <STX> <STX> <STX> |
1188 | \cC 3 <ETX> <ETX> <ETX> | |
eaf8b9b9 KW |
1189 | \cD 4 <EOT> <ST> <ST> |
1190 | \cE 5 <ENQ> <HT> <HT> | |
1191 | \cF 6 <ACK> <SSA> <SSA> | |
1192 | \cG 7 <BEL> <DEL> <DEL> | |
1193 | \cH 8 <BS> <EPA> <EPA> | |
1194 | \cI 9 <HT> <RI> <RI> | |
1195 | \cJ 10 <LF> <SS2> <SS2> | |
2c09a866 | 1196 | \cK 11 <VT> <VT> <VT> |
eaf8b9b9 KW |
1197 | \cL 12 <FF> <FF> <FF> |
1198 | \cM 13 <CR> <CR> <CR> | |
2c09a866 KW |
1199 | \cN 14 <SO> <SO> <SO> |
1200 | \cO 15 <SI> <SI> <SI> | |
eaf8b9b9 | 1201 | \cP 16 <DLE> <DLE> <DLE> |
2c09a866 KW |
1202 | \cQ 17 <DC1> <DC1> <DC1> |
1203 | \cR 18 <DC2> <DC2> <DC2> | |
eaf8b9b9 KW |
1204 | \cS 19 <DC3> <DC3> <DC3> |
1205 | \cT 20 <DC4> <OSC> <OSC> | |
8d725451 | 1206 | \cU 21 <NAK> <NEL> <LF> ** |
2c09a866 | 1207 | \cV 22 <SYN> <BS> <BS> |
eaf8b9b9 | 1208 | \cW 23 <ETB> <ESA> <ESA> |
2c09a866 KW |
1209 | \cX 24 <CAN> <CAN> <CAN> |
1210 | \cY 25 <EOM> <EOM> <EOM> | |
eaf8b9b9 KW |
1211 | \cZ 26 <SUB> <PU2> <PU2> |
1212 | \c[ 27 <ESC> <SS3> <SS3> | |
2c09a866 KW |
1213 | \c\X 28 <FS>X <FS>X <FS>X |
1214 | \c] 29 <GS> <GS> <GS> | |
1215 | \c^ 30 <RS> <RS> <RS> | |
1216 | \c_ 31 <US> <US> <US> | |
2bd1cbf6 KW |
1217 | \c? * <DEL> <APC> <APC> |
1218 | ||
1219 | C<*> Note: C<\c?> maps to ordinal 127 (C<DEL>) on ASCII platforms, but | |
1220 | since ordinal 127 is a not a control character on EBCDIC machines, | |
4d2ca8b5 KW |
1221 | C<\c?> instead maps on them to C<APC>, which is 255 in 0037 and 1047, |
1222 | and 95 in POSIX-BC. | |
d396a558 JH |
1223 | |
1224 | =head1 FUNCTION DIFFERENCES | |
1225 | ||
1226 | =over 8 | |
1227 | ||
4d2ca8b5 | 1228 | =item C<chr()> |
d396a558 | 1229 | |
4d2ca8b5 | 1230 | C<chr()> must be given an EBCDIC code number argument to yield a desired |
2bbc8d55 | 1231 | character return value on an EBCDIC platform. For example: |
d396a558 | 1232 | |
84f709e7 | 1233 | $CAPITAL_LETTER_A = chr(193); |
d396a558 | 1234 | |
4d2ca8b5 | 1235 | =item C<ord()> |
d396a558 | 1236 | |
4d2ca8b5 | 1237 | C<ord()> will return EBCDIC code number values on an EBCDIC platform. |
d396a558 JH |
1238 | For example: |
1239 | ||
84f709e7 | 1240 | $the_number_193 = ord("A"); |
d396a558 | 1241 | |
4d2ca8b5 | 1242 | =item C<pack()> |
d396a558 | 1243 | |
4d2ca8b5 KW |
1244 | |
1245 | The C<"c"> and C<"C"> templates for C<pack()> are dependent upon character set | |
d396a558 JH |
1246 | encoding. Examples of usage on EBCDIC include: |
1247 | ||
1248 | $foo = pack("CCCC",193,194,195,196); | |
1249 | # $foo eq "ABCD" | |
84f709e7 | 1250 | $foo = pack("C4",193,194,195,196); |
d396a558 JH |
1251 | # same thing |
1252 | ||
1253 | $foo = pack("ccxxcc",193,194,195,196); | |
1254 | # $foo eq "AB\0\0CD" | |
1255 | ||
4d2ca8b5 KW |
1256 | The C<"U"> template has been ported to mean "Unicode" on all platforms so |
1257 | that | |
1258 | ||
1259 | pack("U", 65) eq 'A' | |
1260 | ||
1261 | is true on all platforms. If you want native code points for the low | |
1262 | 256, use the C<"W"> template. This means that the equivalences | |
1263 | ||
1264 | pack("W", ord($character)) eq $character | |
1265 | unpack("W", $character) == ord $character | |
1266 | ||
1267 | will hold. | |
1268 | ||
4d2ca8b5 | 1269 | =item C<print()> |
d396a558 JH |
1270 | |
1271 | One must be careful with scalars and strings that are passed to | |
1272 | print that contain ASCII encodings. One common place | |
1273 | for this to occur is in the output of the MIME type header for | |
4d2ca8b5 | 1274 | CGI script writing. For example, many Perl programming guides |
d396a558 JH |
1275 | recommend something similar to: |
1276 | ||
eaf8b9b9 | 1277 | print "Content-type:\ttext/html\015\012\015\012"; |
d396a558 JH |
1278 | # this may be wrong on EBCDIC |
1279 | ||
4d2ca8b5 | 1280 | You can instead write |
d396a558 | 1281 | |
5f26d5fd | 1282 | print "Content-type:\ttext/html\r\n\r\n"; # OK for DGW et al |
d396a558 | 1283 | |
4d2ca8b5 KW |
1284 | and have it work portably. |
1285 | ||
d396a558 | 1286 | That is because the translation from EBCDIC to ASCII is done |
4d2ca8b5 | 1287 | by the web server in this case. Consult your web server's documentation for |
d396a558 JH |
1288 | further details. |
1289 | ||
4d2ca8b5 | 1290 | =item C<printf()> |
d396a558 JH |
1291 | |
1292 | The formats that can convert characters to numbers and vice versa | |
1293 | will be different from their ASCII counterparts when executed | |
2bbc8d55 | 1294 | on an EBCDIC platform. Examples include: |
d396a558 JH |
1295 | |
1296 | printf("%c%c%c",193,194,195); # prints ABC | |
1297 | ||
4d2ca8b5 | 1298 | =item C<sort()> |
d396a558 | 1299 | |
eaf8b9b9 | 1300 | EBCDIC sort results may differ from ASCII sort results especially for |
4d2ca8b5 | 1301 | mixed case strings. This is discussed in more detail L<below|/SORTING>. |
d396a558 | 1302 | |
4d2ca8b5 | 1303 | =item C<sprintf()> |
d396a558 | 1304 | |
4d2ca8b5 | 1305 | See the discussion of C<L</printf()>> above. An example of the use |
d396a558 JH |
1306 | of sprintf would be: |
1307 | ||
84f709e7 | 1308 | $CAPITAL_LETTER_A = sprintf("%c",193); |
d396a558 | 1309 | |
4d2ca8b5 | 1310 | =item C<unpack()> |
d396a558 | 1311 | |
4d2ca8b5 | 1312 | See the discussion of C<L</pack()>> above. |
d396a558 JH |
1313 | |
1314 | =back | |
1315 | ||
4d2ca8b5 KW |
1316 | Note that it is possible to write portable code for these by specifying |
1317 | things in Unicode numbers, and using a conversion function: | |
1318 | ||
1319 | printf("%c",utf8::unicode_to_native(65)); # prints A on all | |
1320 | # platforms | |
1321 | print utf8::native_to_unicode(ord("A")); # Likewise, prints 65 | |
1322 | ||
1323 | See L<perluniintro/Unicode and EBCDIC> and L</CONVERSIONS> | |
1324 | for other options. | |
1325 | ||
d396a558 JH |
1326 | =head1 REGULAR EXPRESSION DIFFERENCES |
1327 | ||
4d2ca8b5 KW |
1328 | You can write your regular expressions just like someone on an ASCII |
1329 | platform would do. But keep in mind that using octal or hex notation to | |
1330 | specify a particular code point will give you the character that the | |
1331 | EBCDIC code page natively maps to it. (This is also true of all | |
1332 | double-quoted strings.) If you want to write portably, just use the | |
1333 | C<\N{U+...}> notation everywhere where you would have used C<\x{...}>, | |
1334 | and don't use octal notation at all. | |
1335 | ||
1336 | Starting in Perl v5.22, this applies to ranges in bracketed character | |
1337 | classes. If you say, for example, C<qr/[\N{U+20}-\N{U+7F}]/>, it means | |
1338 | the characters C<\N{U+20}>, C<\N{U+21}>, ..., C<\N{U+7F}>. This range | |
1339 | is all the printable characters that the ASCII character set contains. | |
1340 | ||
1341 | Prior to v5.22, you couldn't specify any ranges portably, except | |
1342 | (starting in Perl v5.5.3) all subsets of the C<[A-Z]> and C<[a-z]> | |
1343 | ranges are specially coded to not pick up gap characters. For example, | |
1344 | characters such as "E<ocirc>" (C<o WITH CIRCUMFLEX>) that lie between | |
1345 | "I" and "J" would not be matched by the regular expression range | |
1346 | C</[H-K]/>. But if either of the range end points is explicitly numeric | |
1347 | (and neither is specified by C<\N{U+...}>), the gap characters are | |
1348 | matched: | |
1349 | ||
1350 | /[\x89-\x91]/ | |
1351 | ||
1352 | will match C<\x8e>, even though C<\x89> is "i" and C<\x91 > is "j", | |
1353 | and C<\x8e> is a gap character, from the alphabetic viewpoint. | |
1354 | ||
1355 | Another construct to be wary of is the inappropriate use of hex (unless | |
1356 | you use C<\N{U+...}>) or | |
d396a558 JH |
1357 | octal constants in regular expressions. Consider the following |
1358 | set of subs: | |
1359 | ||
1360 | sub is_c0 { | |
1361 | my $char = substr(shift,0,1); | |
1362 | $char =~ /[\000-\037]/; | |
1363 | } | |
1364 | ||
1365 | sub is_print_ascii { | |
1366 | my $char = substr(shift,0,1); | |
1367 | $char =~ /[\040-\176]/; | |
1368 | } | |
1369 | ||
1370 | sub is_delete { | |
1371 | my $char = substr(shift,0,1); | |
1372 | $char eq "\177"; | |
1373 | } | |
1374 | ||
1375 | sub is_c1 { | |
1376 | my $char = substr(shift,0,1); | |
1377 | $char =~ /[\200-\237]/; | |
1378 | } | |
1379 | ||
10c526cf | 1380 | sub is_latin_1 { # But not ASCII; not C1 |
d396a558 JH |
1381 | my $char = substr(shift,0,1); |
1382 | $char =~ /[\240-\377]/; | |
1383 | } | |
1384 | ||
4d2ca8b5 KW |
1385 | These are valid only on ASCII platforms. Starting in Perl v5.22, simply |
1386 | changing the octal constants to equivalent C<\N{U+...}> values makes | |
1387 | them portable: | |
1388 | ||
1389 | sub is_c0 { | |
1390 | my $char = substr(shift,0,1); | |
1391 | $char =~ /[\N{U+00}-\N{U+1F}]/; | |
1392 | } | |
1393 | ||
1394 | sub is_print_ascii { | |
1395 | my $char = substr(shift,0,1); | |
1396 | $char =~ /[\N{U+20}-\N{U+7E}]/; | |
1397 | } | |
1398 | ||
1399 | sub is_delete { | |
1400 | my $char = substr(shift,0,1); | |
1401 | $char eq "\N{U+7F}"; | |
1402 | } | |
1403 | ||
1404 | sub is_c1 { | |
1405 | my $char = substr(shift,0,1); | |
1406 | $char =~ /[\N{U+80}-\N{U+9F}]/; | |
1407 | } | |
1408 | ||
1409 | sub is_latin_1 { # But not ASCII; not C1 | |
1410 | my $char = substr(shift,0,1); | |
1411 | $char =~ /[\N{U+A0}-\N{U+FF}]/; | |
1412 | } | |
1413 | ||
1414 | And here are some alternative portable ways to write them: | |
d396a558 JH |
1415 | |
1416 | sub Is_c0 { | |
1417 | my $char = substr(shift,0,1); | |
f11f9c4c KW |
1418 | return $char =~ /[[:cntrl:]]/a && ! Is_delete($char); |
1419 | ||
1420 | # Alternatively: | |
1421 | # return $char =~ /[[:cntrl:]]/ | |
1422 | # && $char =~ /[[:ascii:]]/ | |
1423 | # && ! Is_delete($char); | |
d396a558 JH |
1424 | } |
1425 | ||
1426 | sub Is_print_ascii { | |
1427 | my $char = substr(shift,0,1); | |
10c526cf | 1428 | |
f11f9c4c | 1429 | return $char =~ /[[:print:]]/a; |
10c526cf KW |
1430 | |
1431 | # Alternatively: | |
f11f9c4c KW |
1432 | # return $char =~ /[[:print:]]/ && $char =~ /[[:ascii:]]/; |
1433 | ||
1434 | # Or | |
10c526cf KW |
1435 | # return $char |
1436 | # =~ /[ !"\#\$%&'()*+,\-.\/0-9:;<=>?\@A-Z[\\\]^_`a-z{|}~]/; | |
d396a558 JH |
1437 | } |
1438 | ||
1439 | sub Is_delete { | |
1440 | my $char = substr(shift,0,1); | |
10c526cf | 1441 | return utf8::native_to_unicode(ord $char) == 0x7F; |
d396a558 JH |
1442 | } |
1443 | ||
1444 | sub Is_c1 { | |
10c526cf | 1445 | use feature 'unicode_strings'; |
d396a558 | 1446 | my $char = substr(shift,0,1); |
10c526cf | 1447 | return $char =~ /[[:cntrl:]]/ && $char !~ /[[:ascii:]]/; |
d396a558 JH |
1448 | } |
1449 | ||
10c526cf KW |
1450 | sub Is_latin_1 { # But not ASCII; not C1 |
1451 | use feature 'unicode_strings'; | |
d396a558 | 1452 | my $char = substr(shift,0,1); |
10c526cf | 1453 | return ord($char) < 256 |
4d2ca8b5 KW |
1454 | && $char !~ /[[:ascii:]]/ |
1455 | && $char !~ /[[:cntrl:]]/; | |
d396a558 JH |
1456 | } |
1457 | ||
10c526cf | 1458 | Another way to write C<Is_latin_1()> would be |
d396a558 JH |
1459 | to use the characters in the range explicitly: |
1460 | ||
1461 | sub Is_latin_1 { | |
1462 | my $char = substr(shift,0,1); | |
f11f9c4c KW |
1463 | $char =~ /[ ¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏ] |
1464 | [ÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ]/x; | |
d396a558 JH |
1465 | } |
1466 | ||
eaf8b9b9 | 1467 | Although that form may run into trouble in network transit (due to the |
4d2ca8b5 KW |
1468 | presence of 8 bit characters) or on non ISO-Latin character sets. But |
1469 | it does allow C<Is_c1> to be rewritten so it works on Perls that don't | |
1470 | have C<'unicode_strings'> (earlier than v5.14): | |
1471 | ||
1472 | sub Is_latin_1 { # But not ASCII; not C1 | |
1473 | my $char = substr(shift,0,1); | |
1474 | return ord($char) < 256 | |
1475 | && $char !~ /[[:ascii:]]/ | |
1476 | && ! Is_latin1($char); | |
1477 | } | |
d396a558 JH |
1478 | |
1479 | =head1 SOCKETS | |
1480 | ||
1481 | Most socket programming assumes ASCII character encodings in network | |
1482 | byte order. Exceptions can include CGI script writing under a | |
1483 | host web server where the server may take care of translation for you. | |
1484 | Most host web servers convert EBCDIC data to ISO-8859-1 or Unicode on | |
1485 | output. | |
1486 | ||
1487 | =head1 SORTING | |
1488 | ||
8a50e6a3 | 1489 | One big difference between ASCII-based character sets and EBCDIC ones |
4d2ca8b5 KW |
1490 | are the relative positions of the characters when sorted in native |
1491 | order. Of most concern are the upper- and lowercase letters, the | |
1492 | digits, and the underscore (C<"_">). On ASCII platforms the native sort | |
1493 | order has the digits come before the uppercase letters which come before | |
1494 | the underscore which comes before the lowercase letters. On EBCDIC, the | |
1495 | underscore comes first, then the lowercase letters, then the uppercase | |
1496 | ones, and the digits last. If sorted on an ASCII-based platform, the | |
8a50e6a3 FC |
1497 | two-letter abbreviation for a physician comes before the two letter |
1498 | abbreviation for drive; that is: | |
d396a558 | 1499 | |
c72e675e | 1500 | @sorted = sort(qw(Dr. dr.)); # @sorted holds ('Dr.','dr.') on ASCII, |
84f709e7 | 1501 | # but ('dr.','Dr.') on EBCDIC |
d396a558 | 1502 | |
8a50e6a3 | 1503 | The property of lowercase before uppercase letters in EBCDIC is |
d396a558 | 1504 | even carried to the Latin 1 EBCDIC pages such as 0037 and 1047. |
4d2ca8b5 KW |
1505 | An example would be that "E<Euml>" (C<E WITH DIAERESIS>, 203) comes |
1506 | before "E<euml>" (C<e WITH DIAERESIS>, 235) on an ASCII platform, but | |
eaf8b9b9 | 1507 | the latter (83) comes before the former (115) on an EBCDIC platform. |
4d2ca8b5 KW |
1508 | (Astute readers will note that the uppercase version of "E<szlig>" |
1509 | C<SMALL LETTER SHARP S> is simply "SS" and that the upper case versions | |
1510 | of "E<yuml>" (small C<y WITH DIAERESIS>) and "E<micro>" (C<MICRO SIGN>) | |
1511 | are not in the 0..255 range but are in Unicode, in a Unicode enabled | |
1512 | Perl). | |
d396a558 JH |
1513 | |
1514 | The sort order will cause differences between results obtained on | |
2bbc8d55 | 1515 | ASCII platforms versus EBCDIC platforms. What follows are some suggestions |
d396a558 JH |
1516 | on how to deal with these differences. |
1517 | ||
51b5cecb | 1518 | =head2 Ignore ASCII vs. EBCDIC sort differences. |
d396a558 JH |
1519 | |
1520 | This is the least computationally expensive strategy. It may require | |
1521 | some user education. | |
1522 | ||
4d2ca8b5 | 1523 | =head2 Use a sort helper function |
d396a558 | 1524 | |
4d2ca8b5 KW |
1525 | This is completely general, but the most computationally expensive |
1526 | strategy. Choose one or the other character set and transform to that | |
33f0d962 | 1527 | for every sort comparison. Here's a complete example that transforms |
4d2ca8b5 | 1528 | to ASCII sort order: |
51b5cecb | 1529 | |
4d2ca8b5 KW |
1530 | sub native_to_uni($) { |
1531 | my $string = shift; | |
d396a558 | 1532 | |
4d2ca8b5 KW |
1533 | # Saves time on an ASCII platform |
1534 | return $string if ord 'A' == 65; | |
d396a558 | 1535 | |
4d2ca8b5 KW |
1536 | my $output = ""; |
1537 | for my $i (0 .. length($string) - 1) { | |
1538 | $output | |
1539 | .= chr(utf8::native_to_unicode(ord(substr($string, $i, 1)))); | |
1540 | } | |
1541 | ||
1542 | # Preserve utf8ness of input onto the output, even if it didn't need | |
1543 | # to be utf8 | |
1544 | utf8::upgrade($output) if utf8::is_utf8($string); | |
51b5cecb | 1545 | |
4d2ca8b5 KW |
1546 | return $output; |
1547 | } | |
51b5cecb | 1548 | |
4d2ca8b5 KW |
1549 | sub ascii_order { # Sort helper |
1550 | return native_to_uni($a) cmp native_to_uni($b); | |
1551 | } | |
d396a558 | 1552 | |
4d2ca8b5 KW |
1553 | sort ascii_order @list; |
1554 | ||
1555 | =head2 MONO CASE then sort data (for non-digits, non-underscore) | |
1556 | ||
1557 | If you don't care about where digits and underscore sort to, you can do | |
1558 | something like this | |
1559 | ||
1560 | sub case_insensitive_order { # Sort helper | |
1561 | return lc($a) cmp lc($b) | |
1562 | } | |
1563 | ||
1564 | sort case_insensitive_order @list; | |
1565 | ||
1566 | If performance is an issue, and you don't care if the output is in the | |
1567 | same case as the input, Use C<tr///> to transform to the case most | |
1568 | employed within the data. If the data are primarily UPPERCASE | |
1569 | non-Latin1, then apply C<tr/[a-z]/[A-Z]/>, and then C<sort()>. If the | |
1570 | data are primarily lowercase non Latin1 then apply C<tr/[A-Z]/[a-z]/> | |
1571 | before sorting. If the data are primarily UPPERCASE and include Latin-1 | |
1572 | characters then apply: | |
1573 | ||
1574 | tr/[a-z]/[A-Z]/; | |
1575 | tr/[àáâãäåæçèéêëìíîïðñòóôõöøùúûüýþ]/[ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞ/; | |
1576 | s/ß/SS/g; | |
1577 | ||
1578 | then C<sort()>. If you have a choice, it's better to lowercase things | |
1579 | to avoid the problems of the two Latin-1 characters whose uppercase is | |
1580 | outside Latin-1: "E<yuml>" (small C<y WITH DIAERESIS>) and "E<micro>" | |
1581 | (C<MICRO SIGN>). If you do need to upppercase, you can; with a | |
1582 | Unicode-enabled Perl, do: | |
1583 | ||
1584 | tr/ÿ/\x{178}/; | |
1585 | tr/µ/\x{39C}/; | |
d396a558 | 1586 | |
2bbc8d55 | 1587 | =head2 Perform sorting on one type of platform only. |
d396a558 JH |
1588 | |
1589 | This strategy can employ a network connection. As such | |
1590 | it would be computationally expensive. | |
1591 | ||
395f5a0c | 1592 | =head1 TRANSFORMATION FORMATS |
1e054b24 | 1593 | |
eaf8b9b9 KW |
1594 | There are a variety of ways of transforming data with an intra character set |
1595 | mapping that serve a variety of purposes. Sorting was discussed in the | |
1596 | previous section and a few of the other more popular mapping techniques are | |
1e054b24 PP |
1597 | discussed next. |
1598 | ||
1599 | =head2 URL decoding and encoding | |
d396a558 | 1600 | |
51b5cecb | 1601 | Note that some URLs have hexadecimal ASCII code points in them in an |
eaf8b9b9 | 1602 | attempt to overcome character or protocol limitation issues. For example |
1e054b24 | 1603 | the tilde character is not on every keyboard hence a URL of the form: |
d396a558 JH |
1604 | |
1605 | http://www.pvhp.com/~pvhp/ | |
1606 | ||
1607 | may also be expressed as either of: | |
1608 | ||
1609 | http://www.pvhp.com/%7Epvhp/ | |
1610 | ||
1611 | http://www.pvhp.com/%7epvhp/ | |
1612 | ||
4d2ca8b5 | 1613 | where 7E is the hexadecimal ASCII code point for "~". Here is an example |
f11f9c4c | 1614 | of decoding such a URL in any EBCDIC code page: |
d396a558 | 1615 | |
84f709e7 | 1616 | $url = 'http://www.pvhp.com/%7Epvhp/'; |
f11f9c4c KW |
1617 | $url =~ s/%([0-9a-fA-F]{2})/ |
1618 | pack("c",utf8::unicode_to_native(hex($1)))/xge; | |
d396a558 | 1619 | |
eaf8b9b9 | 1620 | Conversely, here is a partial solution for the task of encoding such |
f11f9c4c | 1621 | a URL in any EBCDIC code page: |
1e054b24 | 1622 | |
84f709e7 | 1623 | $url = 'http://www.pvhp.com/~pvhp/'; |
eaf8b9b9 KW |
1624 | # The following regular expression does not address the |
1625 | # mappings for: ('.' => '%2E', '/' => '%2F', ':' => '%3A') | |
10c526cf | 1626 | $url =~ s/([\t "#%&\(\),;<=>\?\@\[\\\]^`{|}~])/ |
f11f9c4c | 1627 | sprintf("%%%02X",utf8::native_to_unicode(ord($1)))/xge; |
1e054b24 | 1628 | |
eaf8b9b9 | 1629 | where a more complete solution would split the URL into components |
1e054b24 PP |
1630 | and apply a full s/// substitution only to the appropriate parts. |
1631 | ||
1e054b24 PP |
1632 | =head2 uu encoding and decoding |
1633 | ||
4d2ca8b5 KW |
1634 | The C<u> template to C<pack()> or C<unpack()> will render EBCDIC data in |
1635 | EBCDIC characters equivalent to their ASCII counterparts. For example, | |
1636 | the following will print "Yes indeed\n" on either an ASCII or EBCDIC | |
1637 | computer: | |
1e054b24 | 1638 | |
84f709e7 JH |
1639 | $all_byte_chrs = ''; |
1640 | for (0..255) { $all_byte_chrs .= chr($_); } | |
1641 | $uuencode_byte_chrs = pack('u', $all_byte_chrs); | |
210b36aa | 1642 | ($uu = <<'ENDOFHEREDOC') =~ s/^\s*//gm; |
1e054b24 PP |
1643 | M``$"`P0%!@<("0H+#`T.#Q`1$A,4%187&!D:&QP='A\@(2(C)"4F)R@I*BLL |
1644 | M+2XO,#$R,S0U-C<X.3H[/#T^/T!!0D-$149'2$E*2TQ-3D]045)35%565UA9 | |
1645 | M6EM<75Y?8&%B8V1E9F=H:6IK;&UN;W!Q<G-T=79W>'EZ>WQ]?G^`@8*#A(6& | |
1646 | MAXB)BHN,C8Z/D)&2DY25EI>8F9J;G)V>GZ"AHJ.DI::GJ*FJJZRMKJ^PL;*S | |
1647 | MM+6VM[BYNKN\O;Z_P,'"P\3%QL?(R<K+S,W.S]#1TM/4U=;7V-G:V]S=WM_@ | |
1648 | ?X>+CY.7FY^CIZNOL[>[O\/'R\_3U]O?X^?K[_/W^_P`` | |
1649 | ENDOFHEREDOC | |
84f709e7 | 1650 | if ($uuencode_byte_chrs eq $uu) { |
1e054b24 PP |
1651 | print "Yes "; |
1652 | } | |
1653 | $uudecode_byte_chrs = unpack('u', $uuencode_byte_chrs); | |
84f709e7 | 1654 | if ($uudecode_byte_chrs eq $all_byte_chrs) { |
1e054b24 PP |
1655 | print "indeed\n"; |
1656 | } | |
1657 | ||
f11f9c4c | 1658 | Here is a very spartan uudecoder that will work on EBCDIC: |
1e054b24 | 1659 | |
84f709e7 | 1660 | #!/usr/local/bin/perl |
84f709e7 | 1661 | $_ = <> until ($mode,$file) = /^begin\s*(\d*)\s*(\S*)/; |
1e054b24 PP |
1662 | open(OUT, "> $file") if $file ne ""; |
1663 | while(<>) { | |
1664 | last if /^end/; | |
1665 | next if /[a-z]/; | |
f11f9c4c KW |
1666 | next unless int((((utf8::native_to_unicode(ord()) - 32 ) & 077) |
1667 | + 2) / 3) | |
1668 | == int(length() / 4); | |
1e054b24 PP |
1669 | print OUT unpack("u", $_); |
1670 | } | |
1671 | close(OUT); | |
1672 | chmod oct($mode), $file; | |
1673 | ||
1674 | ||
1675 | =head2 Quoted-Printable encoding and decoding | |
1676 | ||
8a50e6a3 | 1677 | On ASCII-encoded platforms it is possible to strip characters outside of |
1e054b24 PP |
1678 | the printable set using: |
1679 | ||
1680 | # This QP encoder works on ASCII only | |
4d2ca8b5 KW |
1681 | $qp_string =~ s/([=\x00-\x1F\x80-\xFF])/ |
1682 | sprintf("=%02X",ord($1))/xge; | |
1e054b24 | 1683 | |
4d2ca8b5 KW |
1684 | Starting in Perl v5.22, this is trivially changeable to work portably on |
1685 | both ASCII and EBCDIC platforms. | |
1686 | ||
1687 | # This QP encoder works on both ASCII and EBCDIC | |
1688 | $qp_string =~ s/([=\N{U+00}-\N{U+1F}\N{U+80}-\N{U+FF}])/ | |
1689 | sprintf("=%02X",ord($1))/xge; | |
1690 | ||
1691 | For earlier Perls, a QP encoder that works on both ASCII and EBCDIC | |
1692 | platforms would look somewhat like the following: | |
1e054b24 | 1693 | |
f11f9c4c | 1694 | $delete = utf8::unicode_to_native(ord("\x7F")); |
84f709e7 | 1695 | $qp_string =~ |
f11f9c4c KW |
1696 | s/([^[:print:]$delete])/ |
1697 | sprintf("=%02X",utf8::native_to_unicode(ord($1)))/xage; | |
1e054b24 PP |
1698 | |
1699 | (although in production code the substitutions might be done | |
f11f9c4c | 1700 | in the EBCDIC branch with the function call and separately in the |
4d2ca8b5 KW |
1701 | ASCII branch without the expense of the identity map; in Perl v5.22, the |
1702 | identity map is optimized out so there is no expense, but the | |
1703 | alternative above is simpler and is also available in v5.22). | |
1e054b24 PP |
1704 | |
1705 | Such QP strings can be decoded with: | |
1706 | ||
1707 | # This QP decoder is limited to ASCII only | |
f11f9c4c | 1708 | $string =~ s/=([[:xdigit:][[:xdigit:])/chr hex $1/ge; |
1e054b24 PP |
1709 | $string =~ s/=[\n\r]+$//; |
1710 | ||
eaf8b9b9 | 1711 | Whereas a QP decoder that works on both ASCII and EBCDIC platforms |
f11f9c4c | 1712 | would look somewhat like the following: |
1e054b24 | 1713 | |
f11f9c4c KW |
1714 | $string =~ s/=([[:xdigit:][:xdigit:]])/ |
1715 | chr utf8::native_to_unicode(hex $1)/xge; | |
1e054b24 PP |
1716 | $string =~ s/=[\n\r]+$//; |
1717 | ||
c69ca1d4 | 1718 | =head2 Caesarean ciphers |
1e054b24 PP |
1719 | |
1720 | The practice of shifting an alphabet one or more characters for encipherment | |
1721 | dates back thousands of years and was explicitly detailed by Gaius Julius | |
eaf8b9b9 | 1722 | Caesar in his B<Gallic Wars> text. A single alphabet shift is sometimes |
1e054b24 | 1723 | referred to as a rotation and the shift amount is given as a number $n after |
eaf8b9b9 KW |
1724 | the string 'rot' or "rot$n". Rot0 and rot26 would designate identity maps |
1725 | on the 26-letter English version of the Latin alphabet. Rot13 has the | |
1726 | interesting property that alternate subsequent invocations are identity maps | |
1727 | (thus rot13 is its own non-trivial inverse in the group of 26 alphabet | |
1728 | rotations). Hence the following is a rot13 encoder and decoder that will | |
2bbc8d55 | 1729 | work on ASCII and EBCDIC platforms: |
1e054b24 PP |
1730 | |
1731 | #!/usr/local/bin/perl | |
1732 | ||
84f709e7 | 1733 | while(<>){ |
1e054b24 PP |
1734 | tr/n-za-mN-ZA-M/a-zA-Z/; |
1735 | print; | |
1736 | } | |
1737 | ||
1738 | In one-liner form: | |
1739 | ||
84f709e7 | 1740 | perl -ne 'tr/n-za-mN-ZA-M/a-zA-Z/;print' |
1e054b24 PP |
1741 | |
1742 | ||
1743 | =head1 Hashing order and checksums | |
1744 | ||
4d2ca8b5 KW |
1745 | Perl deliberately randomizes hash order for security purposes on both |
1746 | ASCII and EBCDIC platforms. | |
1747 | ||
1748 | EBCDIC checksums will differ for the same file translated into ASCII | |
1749 | and vice versa. | |
1e054b24 | 1750 | |
d396a558 JH |
1751 | =head1 I18N AND L10N |
1752 | ||
eaf8b9b9 KW |
1753 | Internationalization (I18N) and localization (L10N) are supported at least |
1754 | in principle even on EBCDIC platforms. The details are system-dependent | |
5a0de581 | 1755 | and discussed under the L</OS ISSUES> section below. |
d396a558 | 1756 | |
8a50e6a3 | 1757 | =head1 MULTI-OCTET CHARACTER SETS |
d396a558 | 1758 | |
4d2ca8b5 KW |
1759 | Perl works with UTF-EBCDIC, a multi-byte encoding. In Perls earlier |
1760 | than v5.22, there may be various bugs in this regard. | |
395f5a0c PK |
1761 | |
1762 | Legacy multi byte EBCDIC code pages XXX. | |
d396a558 JH |
1763 | |
1764 | =head1 OS ISSUES | |
1765 | ||
eaf8b9b9 | 1766 | There may be a few system-dependent issues |
d396a558 JH |
1767 | of concern to EBCDIC Perl programmers. |
1768 | ||
522b859a | 1769 | =head2 OS/400 |
51b5cecb | 1770 | |
d396a558 JH |
1771 | =over 8 |
1772 | ||
522b859a JH |
1773 | =item PASE |
1774 | ||
8a50e6a3 FC |
1775 | The PASE environment is a runtime environment for OS/400 that can run |
1776 | executables built for PowerPC AIX in OS/400; see L<perlos400>. PASE | |
522b859a JH |
1777 | is ASCII-based, not EBCDIC-based as the ILE. |
1778 | ||
d396a558 JH |
1779 | =item IFS access |
1780 | ||
1781 | XXX. | |
1782 | ||
1783 | =back | |
1784 | ||
395f5a0c | 1785 | =head2 OS/390, z/OS |
d396a558 | 1786 | |
51b5cecb PP |
1787 | Perl runs under Unix Systems Services or USS. |
1788 | ||
d396a558 JH |
1789 | =over 8 |
1790 | ||
4d2ca8b5 KW |
1791 | =item C<sigaction> |
1792 | ||
1793 | C<SA_SIGINFO> can have segmentation faults. | |
1794 | ||
1795 | =item C<chcp> | |
51b5cecb | 1796 | |
eaf8b9b9 | 1797 | B<chcp> is supported as a shell utility for displaying and changing |
75cdcc93 | 1798 | one's code page. See also L<chcp(1)>. |
51b5cecb | 1799 | |
d396a558 JH |
1800 | =item dataset access |
1801 | ||
1802 | For sequential data set access try: | |
1803 | ||
1804 | my @ds_records = `cat //DSNAME`; | |
1805 | ||
1806 | or: | |
1807 | ||
1808 | my @ds_records = `cat //'HLQ.DSNAME'`; | |
1809 | ||
1810 | See also the OS390::Stdio module on CPAN. | |
1811 | ||
4d2ca8b5 | 1812 | =item C<iconv> |
51b5cecb | 1813 | |
1e054b24 | 1814 | B<iconv> is supported as both a shell utility and a C RTL routine. |
4d2ca8b5 | 1815 | See also the L<iconv(1)> and L<iconv(3)> manual pages. |
51b5cecb | 1816 | |
d396a558 JH |
1817 | =item locales |
1818 | ||
4d2ca8b5 KW |
1819 | Locales are supported. There may be glitches when a locale is another |
1820 | EBCDIC code page which has some of the | |
1821 | L<code-page variant characters|/The 13 variant characters> in other | |
1822 | positions. | |
1823 | ||
1824 | There aren't currently any real UTF-8 locales, even though some locale | |
1825 | names contain the string "UTF-8". | |
1826 | ||
1827 | See L<perllocale> for information on locales. The L10N files | |
1828 | are in F</usr/nls/locale>. C<$Config{d_setlocale}> is C<'define'> on | |
1829 | OS/390 or z/OS. | |
d396a558 JH |
1830 | |
1831 | =back | |
1832 | ||
d396a558 JH |
1833 | =head2 POSIX-BC? |
1834 | ||
1835 | XXX. | |
1836 | ||
51b5cecb PP |
1837 | =head1 BUGS |
1838 | ||
4d2ca8b5 KW |
1839 | =over 4 |
1840 | ||
1841 | =item * | |
1842 | ||
51b5cecb | 1843 | Not all shells will allow multiple C<-e> string arguments to perl to |
4d2ca8b5 KW |
1844 | be concatenated together properly as recipes in this document |
1845 | 0, 2, 4, 5, and 6 might | |
395f5a0c | 1846 | seem to imply. |
51b5cecb | 1847 | |
4d2ca8b5 KW |
1848 | =item * |
1849 | ||
4d2ca8b5 | 1850 | There are a significant number of test failures in the CPAN modules |
4b638048 | 1851 | shipped with Perl v5.22 and 5.24. These are only in modules not primarily |
4d2ca8b5 KW |
1852 | maintained by Perl 5 porters. Some of these are failures in the tests |
1853 | only: they don't realize that it is proper to get different results on | |
1854 | EBCDIC platforms. And some of the failures are real bugs. If you | |
1855 | compile and do a C<make test> on Perl, all tests on the C</cpan> | |
1856 | directory are skipped. | |
1857 | ||
4d2ca8b5 KW |
1858 | L<Encode> partially works. |
1859 | ||
7a145370 KW |
1860 | =item * |
1861 | ||
4b638048 KW |
1862 | In earlier Perl versions, when byte and character data were |
1863 | concatenated, the new string was sometimes created by | |
7a145370 KW |
1864 | decoding the byte strings as I<ISO 8859-1 (Latin-1)>, even if the |
1865 | old Unicode string used EBCDIC. | |
1866 | ||
4d2ca8b5 KW |
1867 | =back |
1868 | ||
b3b6085d PP |
1869 | =head1 SEE ALSO |
1870 | ||
395f5a0c | 1871 | L<perllocale>, L<perlfunc>, L<perlunicode>, L<utf8>. |
b3b6085d | 1872 | |
d396a558 JH |
1873 | =head1 REFERENCES |
1874 | ||
2bbc8d55 | 1875 | L<http://anubis.dkuug.dk/i18n/charmaps> |
d396a558 | 1876 | |
2bbc8d55 | 1877 | L<http://www.unicode.org/> |
d396a558 | 1878 | |
2bbc8d55 | 1879 | L<http://www.unicode.org/unicode/reports/tr16/> |
d396a558 | 1880 | |
08d7a6b2 | 1881 | L<http://www.wps.com/projects/codes/> |
51b5cecb PP |
1882 | B<ASCII: American Standard Code for Information Infiltration> Tom Jennings, |
1883 | September 1999. | |
1884 | ||
eaf8b9b9 KW |
1885 | B<The Unicode Standard, Version 3.0> The Unicode Consortium, Lisa Moore ed., |
1886 | ISBN 0-201-61633-5, Addison Wesley Developers Press, February 2000. | |
51b5cecb | 1887 | |
eaf8b9b9 KW |
1888 | B<CDRA: IBM - Character Data Representation Architecture - |
1889 | Reference and Registry>, IBM SC09-2190-00, December 1996. | |
d396a558 | 1890 | |
eaf8b9b9 | 1891 | "Demystifying Character Sets", Andrea Vine, Multilingual Computing |
d396a558 JH |
1892 | & Technology, B<#26 Vol. 10 Issue 4>, August/September 1999; |
1893 | ISSN 1523-0309; Multilingual Computing Inc. Sandpoint ID, USA. | |
1894 | ||
1e054b24 PP |
1895 | B<Codes, Ciphers, and Other Cryptic and Clandestine Communication> |
1896 | Fred B. Wrixon, ISBN 1-57912-040-7, Black Dog & Leventhal Publishers, | |
1897 | 1998. | |
1898 | ||
2bbc8d55 | 1899 | L<http://www.bobbemer.com/P-BIT.HTM> |
395f5a0c PK |
1900 | B<IBM - EBCDIC and the P-bit; The biggest Computer Goof Ever> Robert Bemer. |
1901 | ||
1902 | =head1 HISTORY | |
1903 | ||
1904 | 15 April 2001: added UTF-8 and UTF-EBCDIC to main table, pvhp. | |
1905 | ||
d396a558 JH |
1906 | =head1 AUTHOR |
1907 | ||
eaf8b9b9 KW |
1908 | Peter Prymmer pvhp@best.com wrote this in 1999 and 2000 |
1909 | with CCSID 0819 and 0037 help from Chris Leach and | |
1910 | AndrE<eacute> Pirard A.Pirard@ulg.ac.be as well as POSIX-BC | |
b3b6085d | 1911 | help from Thomas Dorner Thomas.Dorner@start.de. |
eaf8b9b9 KW |
1912 | Thanks also to Vickie Cooper, Philip Newton, William Raffloer, and |
1913 | Joe Smith. Trademarks, registered trademarks, service marks and | |
1914 | registered service marks used in this document are the property of | |
1e054b24 | 1915 | their respective owners. |
4d2ca8b5 KW |
1916 | |
1917 | Now maintained by Perl5 Porters. |