This is a live mirror of the Perl 5 development currently hosted at https://github.com/perl/perl5
(perl #133706) remove exploit code from Storable
[perl5.git] / pod / perlebcdic.pod
CommitLineData
49781f4a
AB
1=encoding utf8
2
d396a558
JH
3=head1 NAME
4
5perlebcdic - Considerations for running Perl on EBCDIC platforms
6
7=head1 DESCRIPTION
8
9An exploration of some of the issues facing Perl programmers
4d2ca8b5 10on EBCDIC based computers.
d396a558 11
4d2ca8b5 12Portions of this document that are still incomplete are marked with XXX.
d396a558 13
4d2ca8b5
KW
14Early Perl versions worked on some EBCDIC machines, but the last known
15version that ran on EBCDIC was v5.8.7, until v5.22, when the Perl core
16again works on z/OS. Theoretically, it could work on OS/400 or Siemens'
4b638048
KW
17BS2000 (or their successors), but this is untested. In v5.22 and 5.24,
18not all
4d2ca8b5
KW
19the modules found on CPAN but shipped with core Perl work on z/OS.
20
21If you want to use Perl on a non-z/OS EBCDIC machine, please let us know
e1b711da
KW
22by sending mail to perlbug@perl.org
23
4d2ca8b5
KW
24Writing Perl on an EBCDIC platform is really no different than writing
25on an L</ASCII> one, but with different underlying numbers, as we'll see
26shortly. You'll have to know something about those L</ASCII> platforms
27because the documentation is biased and will frequently use example
28numbers that don't apply to EBCDIC. There are also very few CPAN
29modules that are written for EBCDIC and which don't work on ASCII;
30instead the vast majority of CPAN modules are written for ASCII, and
31some may happen to work on EBCDIC, while a few have been designed to
32portably work on both.
33
34If your code just uses the 52 letters A-Z and a-z, plus SPACE, the
35digits 0-9, and the punctuation characters that Perl uses, plus a few
36controls that are denoted by escape sequences like C<\n> and C<\t>, then
37there's nothing special about using Perl, and your code may very well
38work on an ASCII machine without change.
39
40But if you write code that uses C<\005> to mean a TAB or C<\xC1> to mean
41an "A", or C<\xDF> to mean a "E<yuml>" (small C<"y"> with a diaeresis),
42then your code may well work on your EBCDIC platform, but not on an
43ASCII one. That's fine to do if no one will ever want to run your code
4b638048 44on an ASCII platform; but the bias in this document will be towards writing
4d2ca8b5
KW
45code portable between EBCDIC and ASCII systems. Again, if every
46character you care about is easily enterable from your keyboard, you
47don't have to know anything about ASCII, but many keyboards don't easily
48allow you to directly enter, say, the character C<\xDF>, so you have to
49specify it indirectly, such as by using the C<"\xDF"> escape sequence.
50In those cases it's easiest to know something about the ASCII/Unicode
51character sets. If you know that the small "E<yuml>" is C<U+00FF>, then
52you can instead specify it as C<"\N{U+FF}">, and have the computer
53automatically translate it to C<\xDF> on your platform, and leave it as
54C<\xFF> on ASCII ones. Or you could specify it by name, C<\N{LATIN
55SMALL LETTER Y WITH DIAERESIS> and not have to know the numbers.
4b638048 56Either way works, but both require familiarity with Unicode.
4d2ca8b5 57
d396a558
JH
58=head1 COMMON CHARACTER CODE SETS
59
60=head2 ASCII
61
4d2ca8b5
KW
62The American Standard Code for Information Interchange (ASCII or
63US-ASCII) is a set of
64integers running from 0 to 127 (decimal) that have standardized
65interpretations by the computers which use ASCII. For example, 65 means
66the letter "A".
4b638048 67The range 0..127 can be covered by setting various bits in a 7-bit binary
eaf8b9b9
KW
68digit, hence the set is sometimes referred to as "7-bit ASCII".
69ASCII was described by the American National Standards Institute
70document ANSI X3.4-1986. It was also described by ISO 646:1991
71(with localization for currency symbols). The full ASCII set is
4d2ca8b5
KW
72given in the table L<below|/recipe 3> as the first 128 elements.
73Languages that
eaf8b9b9
KW
74can be written adequately with the characters in ASCII include
75English, Hawaiian, Indonesian, Swahili and some Native American
d396a558
JH
76languages.
77
4d2ca8b5
KW
78Most non-EBCDIC character sets are supersets of ASCII. That is the
79integers 0-127 mean what ASCII says they mean. But integers 128 and
80above are specific to the character set.
81
82Many of these fit entirely into 8 bits, using ASCII as 0-127, while
83specifying what 128-255 mean, and not using anything above 255.
84Thus, these are single-byte (or octet if you prefer) character sets.
85One important one (since Unicode is a superset of it) is the ISO 8859-1
86character set.
51b5cecb 87
d396a558
JH
88=head2 ISO 8859
89
4d2ca8b5
KW
90The ISO 8859-I<B<$n>> are a collection of character code sets from the
91International Organization for Standardization (ISO), each of which adds
92characters to the ASCII set that are typically found in various
5d9fe53c 93languages, many of which are based on the Roman, or Latin, alphabet.
4d2ca8b5
KW
94Most are for European languages, but there are also ones for Arabic,
95Greek, Hebrew, and Thai. There are good references on the web about
96all these.
d396a558
JH
97
98=head2 Latin 1 (ISO 8859-1)
99
eaf8b9b9
KW
100A particular 8-bit extension to ASCII that includes grave and acute
101accented Latin characters. Languages that can employ ISO 8859-1
102include all the languages covered by ASCII as well as Afrikaans,
103Albanian, Basque, Catalan, Danish, Faroese, Finnish, Norwegian,
104Portuguese, Spanish, and Swedish. Dutch is covered albeit without
105the ij ligature. French is covered too but without the oe ligature.
d396a558 106German can use ISO 8859-1 but must do so without German-style
eaf8b9b9 107quotation marks. This set is based on Western European extensions
d396a558 108to ASCII and is commonly encountered in world wide web work.
4d2ca8b5 109In IBM character code set identification terminology, ISO 8859-1 is
51b5cecb 110also known as CCSID 819 (or sometimes 0819 or even 00819).
d396a558
JH
111
112=head2 EBCDIC
113
eaf8b9b9 114The Extended Binary Coded Decimal Interchange Code refers to a
8a50e6a3 115large collection of single- and multi-byte coded character sets that are
4d2ca8b5
KW
116quite different from ASCII and ISO 8859-1, and are all slightly
117different from each other; they typically run on host computers. The
118EBCDIC encodings derive from 8-bit byte extensions of Hollerith punched
119card encodings, which long predate ASCII. The layout on the
120cards was such that high bits were set for the upper and lower case
121alphabetic
122characters C<[a-z]> and C<[A-Z]>, but there were gaps within each Latin
123alphabet range, visible in the table L<below|/recipe 3>. These gaps can
124cause complications.
d396a558 125
eaf8b9b9 126Some IBM EBCDIC character sets may be known by character code set
2c09a866 127identification numbers (CCSID numbers) or code page numbers.
51b5cecb 128
2bbc8d55
SP
129Perl can be compiled on platforms that run any of three commonly used EBCDIC
130character sets, listed below.
131
d5924ca6 132=head3 The 13 variant characters
1e054b24 133
51b5cecb
PP
134Among IBM EBCDIC character code sets there are 13 characters that
135are often mapped to different integer values. Those characters
136are known as the 13 "variant" characters and are:
d396a558 137
eaf8b9b9 138 \ [ ] { } ^ ~ ! # | $ @ `
d396a558 139
6ff677df 140When Perl is compiled for a platform, it looks at all of these characters to
2bbc8d55
SP
141guess which EBCDIC character set the platform uses, and adapts itself
142accordingly to that platform. If the platform uses a character set that is not
143one of the three Perl knows about, Perl will either fail to compile, or
144mistakenly and silently choose one of the three.
4d2ca8b5
KW
145
146=head3 EBCDIC code sets recognized by Perl
2bbc8d55 147
d5924ca6
KW
148=over
149
150=item B<0037>
d396a558 151
eaf8b9b9
KW
152Character code set ID 0037 is a mapping of the ASCII plus Latin-1
153characters (i.e. ISO 8859-1) to an EBCDIC set. 0037 is used
154in North American English locales on the OS/400 operating system
155that runs on AS/400 computers. CCSID 0037 differs from ISO 8859-1
a8f582bb 156in 236 places; in other words they agree on only 20 code point values.
d396a558 157
d5924ca6 158=item B<1047>
d396a558 159
eaf8b9b9
KW
160Character code set ID 1047 is also a mapping of the ASCII plus
161Latin-1 characters (i.e. ISO 8859-1) to an EBCDIC set. 1047 is
162used under Unix System Services for OS/390 or z/OS, and OpenEdition
a8f582bb
KW
163for VM/ESA. CCSID 1047 differs from CCSID 0037 in eight places,
164and from ISO 8859-1 in 236.
d396a558 165
d5924ca6 166=item B<POSIX-BC>
d396a558
JH
167
168The EBCDIC code page in use on Siemens' BS2000 system is distinct from
1691047 and 0037. It is identified below as the POSIX-BC set.
a8f582bb
KW
170Like 0037 and 1047, it is the same as ISO 8859-1 in 20 code point
171values.
d396a558 172
d5924ca6
KW
173=back
174
64c66fb6
JH
175=head2 Unicode code points versus EBCDIC code points
176
177In Unicode terminology a I<code point> is the number assigned to a
178character: for example, in EBCDIC the character "A" is usually assigned
4d2ca8b5
KW
179the number 193. In Unicode, the character "A" is assigned the number 65.
180All the code points in ASCII and Latin-1 (ISO 8859-1) have the same
181meaning in Unicode. All three of the recognized EBCDIC code sets have
182256 code points, and in each code set, all 256 code points are mapped to
183equivalent Latin1 code points. Obviously, "A" will map to "A", "B" =>
184"B", "%" => "%", etc., for all printable characters in Latin1 and these
185code pages.
186
187It also turns out that EBCDIC has nearly precise equivalents for the
188ASCII/Latin1 C0 controls and the DELETE control. (The C0 controls are
189those whose ASCII code points are 0..0x1F; things like TAB, ACK, BEL,
190etc.) A mapping is set up between these ASCII/EBCDIC controls. There
191isn't such a precise mapping between the C1 controls on ASCII platforms
192and the remaining EBCDIC controls. What has been done is to map these
193controls, mostly arbitrarily, to some otherwise unmatched character in
194the other character set. Most of these are very very rarely used
195nowadays in EBCDIC anyway, and their names have been dropped, without
196much complaint. For example the EO (Eight Ones) EBCDIC control
197(consisting of eight one bits = 0xFF) is mapped to the C1 APC control
198(0x9F), and you can't use the name "EO".
199
200The EBCDIC controls provide three possible line terminator characters,
201CR (0x0D), LF (0x25), and NL (0x15). On ASCII platforms, the symbols
202"NL" and "LF" refer to the same character, but in strict EBCDIC
203terminology they are different ones. The EBCDIC NL is mapped to the C1
204control called "NEL" ("Next Line"; here's a case where the mapping makes
205quite a bit of sense, and hence isn't just arbitrary). On some EBCDIC
206platforms, this NL or NEL is the typical line terminator. This is true
207of z/OS and BS2000. In these platforms, the C compilers will swap the
208LF and NEL code points, so that C<"\n"> is 0x15, and refers to NL. Perl
209does that too; you can see it in the code chart L<below|/recipe 3>.
210This makes things generally "just work" without you even having to be
211aware that there is a swap.
dc4af4bb 212
395f5a0c
PK
213=head2 Unicode and UTF
214
4d2ca8b5 215UTF stands for "Unicode Transformation Format".
2bbc8d55
SP
216UTF-8 is an encoding of Unicode into a sequence of 8-bit byte chunks, based on
217ASCII and Latin-1.
218The length of a sequence required to represent a Unicode code point
219depends on the ordinal number of that code point,
220with larger numbers requiring more bytes.
221UTF-EBCDIC is like UTF-8, but based on EBCDIC.
4d2ca8b5
KW
222They are enough alike that often, casual usage will conflate the two
223terms, and use "UTF-8" to mean both the UTF-8 found on ASCII platforms,
224and the UTF-EBCDIC found on EBCDIC ones.
2bbc8d55 225
4d2ca8b5 226You may see the term "invariant" character or code point.
fe749c9a 227This simply means that the character has the same numeric
4d2ca8b5
KW
228value and representation when encoded in UTF-8 (or UTF-EBCDIC) as when
229not. (Note that this is a very different concept from L</The 13 variant
230characters> mentioned above. Careful prose will use the term "UTF-8
231invariant" instead of just "invariant", but most often you'll see just
232"invariant".) For example, the ordinal value of "A" is 193 in most
233EBCDIC code pages, and also is 193 when encoded in UTF-EBCDIC. All
234UTF-8 (or UTF-EBCDIC) variant code points occupy at least two bytes when
235encoded in UTF-8 (or UTF-EBCDIC); by definition, the UTF-8 (or
236UTF-EBCDIC) invariant code points are exactly one byte whether encoded
237in UTF-8 (or UTF-EBCDIC), or not. (By now you see why people typically
238just say "UTF-8" when they also mean "UTF-EBCDIC". For the rest of this
239document, we'll mostly be casual about it too.)
240In ASCII UTF-8, the code points corresponding to the lowest 128
fe749c9a
KW
241ordinal numbers (0 - 127: the ASCII characters) are invariant.
242In UTF-EBCDIC, there are 160 invariant characters.
2bbc8d55 243(If you care, the EBCDIC invariants are those characters
fe749c9a 244which have ASCII equivalents, plus those that correspond to
4d2ca8b5 245the C1 controls (128 - 159 on ASCII platforms).)
fe749c9a 246
c0236afe
KW
247A string encoded in UTF-EBCDIC may be longer (very rarely shorter) than
248one encoded in UTF-8. Perl extends both UTF-8 and UTF-EBCDIC so that
249they can encode code points above the Unicode maximum of U+10FFFF. Both
250extensions are constructed to allow encoding of any code point that fits
251in a 64-bit word.
4d2ca8b5
KW
252
253UTF-EBCDIC is defined by
c0236afe
KW
254L<Unicode Technical Report #16|http://www.unicode.org/reports/tr16>
255(often referred to as just TR16).
4d2ca8b5
KW
256It is defined based on CCSID 1047, not allowing for the differences for
257other code pages. This allows for easy interchange of text between
258computers running different code pages, but makes it unusable, without
259adaptation, for Perl on those other code pages.
260
261The reason for this unusability is that a fundamental assumption of Perl
262is that the characters it cares about for parsing and lexical analysis
263are the same whether or not the text is in UTF-8. For example, Perl
264expects the character C<"["> to have the same representation, no matter
265if the string containing it (or program text) is UTF-8 encoded or not.
266To ensure this, Perl adapts UTF-EBCDIC to the particular code page so
267that all characters it expects to be UTF-8 invariant are in fact UTF-8
268invariant. This means that text generated on a computer running one
269version of Perl's UTF-EBCDIC has to be translated to be intelligible to
270a computer running another.
395f5a0c 271
c0236afe
KW
272TR16 implies a method to extend UTF-EBCDIC to encode points up through
273S<C<2 ** 31 - 1>>. Perl uses this method for code points up through
274S<C<2 ** 30 - 1>>, but uses an incompatible method for larger ones, to
275enable it to handle much larger code points than otherwise.
276
8704cfd1 277=head2 Using Encode
8f94de01 278
4d2ca8b5 279Starting from Perl 5.8 you can use the standard module Encode
2bbc8d55
SP
280to translate from EBCDIC to Latin-1 code points.
281Encode knows about more EBCDIC character sets than Perl can currently
282be compiled to run on.
8f94de01 283
c72e675e 284 use Encode 'from_to';
8f94de01 285
c72e675e 286 my %ebcdic = ( 176 => 'cp37', 95 => 'cp1047', 106 => 'posix-bc' );
8f94de01 287
c72e675e
KW
288 # $a is in EBCDIC code points
289 from_to($a, $ebcdic{ord '^'}, 'latin1');
290 # $a is ISO 8859-1 code points
8f94de01
JH
291
292and from Latin-1 code points to EBCDIC code points
293
c72e675e 294 use Encode 'from_to';
8f94de01 295
c72e675e 296 my %ebcdic = ( 176 => 'cp37', 95 => 'cp1047', 106 => 'posix-bc' );
8f94de01 297
c72e675e
KW
298 # $a is ISO 8859-1 code points
299 from_to($a, 'latin1', $ebcdic{ord '^'});
300 # $a is in EBCDIC code points
8f94de01
JH
301
302For doing I/O it is suggested that you use the autotranslating features
303of PerlIO, see L<perluniintro>.
304
4d2ca8b5 305Since version 5.8 Perl uses the PerlIO I/O library. This enables
aa2b82fc
JH
306you to use different encodings per IO channel. For example you may use
307
308 use Encode;
309 open($f, ">:encoding(ascii)", "test.ascii");
310 print $f "Hello World!\n";
311 open($f, ">:encoding(cp37)", "test.ebcdic");
312 print $f "Hello World!\n";
313 open($f, ">:encoding(latin1)", "test.latin1");
314 print $f "Hello World!\n";
315 open($f, ">:encoding(utf8)", "test.utf8");
316 print $f "Hello World!\n";
317
2c09a866 318to get four files containing "Hello World!\n" in ASCII, CP 0037 EBCDIC,
2bbc8d55 319ISO 8859-1 (Latin-1) (in this example identical to ASCII since only ASCII
eaf8b9b9 320characters were printed), and
2bbc8d55
SP
321UTF-EBCDIC (in this example identical to normal EBCDIC since only characters
322that don't differ between EBCDIC and UTF-EBCDIC were printed). See the
4d2ca8b5 323documentation of L<Encode::PerlIO> for details.
aa2b82fc
JH
324
325As the PerlIO layer uses raw IO (bytes) internally, all this totally
326ignores things like the type of your filesystem (ASCII or EBCDIC).
327
d396a558
JH
328=head1 SINGLE OCTET TABLES
329
330The following tables list the ASCII and Latin 1 ordered sets including
331the subsets: C0 controls (0..31), ASCII graphics (32..7e), delete (7f),
eaf8b9b9 332C1 controls (80..9f), and Latin-1 (a.k.a. ISO 8859-1) (a0..ff). In the
8d725451 333table names of the Latin 1
eaf8b9b9 334extensions to ASCII have been labelled with character names roughly
8d725451 335corresponding to I<The Unicode Standard, Version 6.1> albeit with
4d2ca8b5
KW
336substitutions such as C<s/LATIN//> and C<s/VULGAR//> in all cases;
337S<C<s/CAPITAL LETTER//>> in some cases; and
338S<C<s/SMALL LETTER ([A-Z])/\l$1/>> in some other
0e56abba 339cases. Controls are listed using their Unicode 6.2 abbreviations.
eaf8b9b9 340The differences between the 0037 and 1047 sets are
4d2ca8b5
KW
341flagged with C<**>. The differences between the 1047 and POSIX-BC sets
342are flagged with C<##.> All C<ord()> numbers listed are decimal. If you
8d725451
KW
343would rather see this table listing octal values, then run the table
344(that is, the pod source text of this document, since this recipe may not
1e054b24 345work with a pod2_other_format translation) through:
d396a558
JH
346
347=over 4
348
349=item recipe 0
350
351=back
352
8d725451
KW
353 perl -ne 'if(/(.{29})(\d+)\s+(\d+)\s+(\d+)\s+(\d+)/)' \
354 -e '{printf("%s%-5.03o%-5.03o%-5.03o%.03o\n",$1,$2,$3,$4,$5)}' \
5f26d5fd 355 perlebcdic.pod
395f5a0c
PK
356
357If you want to retain the UTF-x code points then in script form you
358might want to write:
359
360=over 4
361
362=item recipe 1
363
364=back
365
c72e675e
KW
366 open(FH,"<perlebcdic.pod") or die "Could not open perlebcdic.pod: $!";
367 while (<FH>) {
f11f9c4c
KW
368 if (/(.{29})(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\.?(\d*)
369 \s+(\d+)\.?(\d*)/x)
5f26d5fd 370 {
c72e675e 371 if ($7 ne '' && $9 ne '') {
5f26d5fd 372 printf(
8d725451 373 "%s%-5.03o%-5.03o%-5.03o%-5.03o%-3o.%-5o%-3o.%.03o\n",
5f26d5fd 374 $1,$2,$3,$4,$5,$6,$7,$8,$9);
c72e675e
KW
375 }
376 elsif ($7 ne '') {
8d725451 377 printf("%s%-5.03o%-5.03o%-5.03o%-5.03o%-3o.%-5o%.03o\n",
c72e675e
KW
378 $1,$2,$3,$4,$5,$6,$7,$8);
379 }
380 else {
8d725451 381 printf("%s%-5.03o%-5.03o%-5.03o%-5.03o%-5.03o%.03o\n",
5f26d5fd 382 $1,$2,$3,$4,$5,$6,$8);
c72e675e
KW
383 }
384 }
385 }
d396a558
JH
386
387If you would rather see this table listing hexadecimal values then
388run the table through:
389
390=over 4
391
395f5a0c 392=item recipe 2
d396a558
JH
393
394=back
395
8d725451
KW
396 perl -ne 'if(/(.{29})(\d+)\s+(\d+)\s+(\d+)\s+(\d+)/)' \
397 -e '{printf("%s%-5.02X%-5.02X%-5.02X%.02X\n",$1,$2,$3,$4,$5)}' \
5f26d5fd 398 perlebcdic.pod
395f5a0c
PK
399
400Or, in order to retain the UTF-x code points in hexadecimal:
401
402=over 4
403
404=item recipe 3
405
406=back
407
c72e675e
KW
408 open(FH,"<perlebcdic.pod") or die "Could not open perlebcdic.pod: $!";
409 while (<FH>) {
f11f9c4c
KW
410 if (/(.{29})(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\.?(\d*)
411 \s+(\d+)\.?(\d*)/x)
5f26d5fd 412 {
c72e675e 413 if ($7 ne '' && $9 ne '') {
5f26d5fd 414 printf(
8d725451 415 "%s%-5.02X%-5.02X%-5.02X%-5.02X%-2X.%-6.02X%02X.%02X\n",
c72e675e
KW
416 $1,$2,$3,$4,$5,$6,$7,$8,$9);
417 }
418 elsif ($7 ne '') {
8d725451 419 printf("%s%-5.02X%-5.02X%-5.02X%-5.02X%-2X.%-6.02X%02X\n",
c72e675e
KW
420 $1,$2,$3,$4,$5,$6,$7,$8);
421 }
422 else {
8d725451 423 printf("%s%-5.02X%-5.02X%-5.02X%-5.02X%-5.02X%02X\n",
5f26d5fd 424 $1,$2,$3,$4,$5,$6,$8);
c72e675e
KW
425 }
426 }
427 }
395f5a0c
PK
428
429
8d725451 430 ISO
f11f9c4c
KW
431 8859-1 POS- CCSID
432 CCSID CCSID CCSID IX- 1047
8d725451
KW
433 chr 0819 0037 1047 BC UTF-8 UTF-EBCDIC
434 ---------------------------------------------------------------------
435 <NUL> 0 0 0 0 0 0
436 <SOH> 1 1 1 1 1 1
437 <STX> 2 2 2 2 2 2
438 <ETX> 3 3 3 3 3 3
439 <EOT> 4 55 55 55 4 55
440 <ENQ> 5 45 45 45 5 45
441 <ACK> 6 46 46 46 6 46
442 <BEL> 7 47 47 47 7 47
443 <BS> 8 22 22 22 8 22
444 <HT> 9 5 5 5 9 5
445 <LF> 10 37 21 21 10 21 **
446 <VT> 11 11 11 11 11 11
447 <FF> 12 12 12 12 12 12
448 <CR> 13 13 13 13 13 13
449 <SO> 14 14 14 14 14 14
450 <SI> 15 15 15 15 15 15
451 <DLE> 16 16 16 16 16 16
452 <DC1> 17 17 17 17 17 17
453 <DC2> 18 18 18 18 18 18
454 <DC3> 19 19 19 19 19 19
455 <DC4> 20 60 60 60 20 60
456 <NAK> 21 61 61 61 21 61
457 <SYN> 22 50 50 50 22 50
458 <ETB> 23 38 38 38 23 38
459 <CAN> 24 24 24 24 24 24
460 <EOM> 25 25 25 25 25 25
461 <SUB> 26 63 63 63 26 63
462 <ESC> 27 39 39 39 27 39
463 <FS> 28 28 28 28 28 28
464 <GS> 29 29 29 29 29 29
465 <RS> 30 30 30 30 30 30
466 <US> 31 31 31 31 31 31
467 <SPACE> 32 64 64 64 32 64
468 ! 33 90 90 90 33 90
469 " 34 127 127 127 34 127
470 # 35 123 123 123 35 123
471 $ 36 91 91 91 36 91
472 % 37 108 108 108 37 108
473 & 38 80 80 80 38 80
474 ' 39 125 125 125 39 125
475 ( 40 77 77 77 40 77
476 ) 41 93 93 93 41 93
477 * 42 92 92 92 42 92
478 + 43 78 78 78 43 78
479 , 44 107 107 107 44 107
480 - 45 96 96 96 45 96
481 . 46 75 75 75 46 75
482 / 47 97 97 97 47 97
483 0 48 240 240 240 48 240
484 1 49 241 241 241 49 241
485 2 50 242 242 242 50 242
486 3 51 243 243 243 51 243
487 4 52 244 244 244 52 244
488 5 53 245 245 245 53 245
489 6 54 246 246 246 54 246
490 7 55 247 247 247 55 247
491 8 56 248 248 248 56 248
492 9 57 249 249 249 57 249
493 : 58 122 122 122 58 122
494 ; 59 94 94 94 59 94
495 < 60 76 76 76 60 76
496 = 61 126 126 126 61 126
497 > 62 110 110 110 62 110
498 ? 63 111 111 111 63 111
499 @ 64 124 124 124 64 124
500 A 65 193 193 193 65 193
501 B 66 194 194 194 66 194
502 C 67 195 195 195 67 195
503 D 68 196 196 196 68 196
504 E 69 197 197 197 69 197
505 F 70 198 198 198 70 198
506 G 71 199 199 199 71 199
507 H 72 200 200 200 72 200
508 I 73 201 201 201 73 201
509 J 74 209 209 209 74 209
510 K 75 210 210 210 75 210
511 L 76 211 211 211 76 211
512 M 77 212 212 212 77 212
513 N 78 213 213 213 78 213
514 O 79 214 214 214 79 214
515 P 80 215 215 215 80 215
516 Q 81 216 216 216 81 216
517 R 82 217 217 217 82 217
518 S 83 226 226 226 83 226
519 T 84 227 227 227 84 227
520 U 85 228 228 228 85 228
521 V 86 229 229 229 86 229
522 W 87 230 230 230 87 230
523 X 88 231 231 231 88 231
524 Y 89 232 232 232 89 232
525 Z 90 233 233 233 90 233
526 [ 91 186 173 187 91 173 ** ##
527 \ 92 224 224 188 92 224 ##
528 ] 93 187 189 189 93 189 **
529 ^ 94 176 95 106 94 95 ** ##
530 _ 95 109 109 109 95 109
531 ` 96 121 121 74 96 121 ##
532 a 97 129 129 129 97 129
533 b 98 130 130 130 98 130
534 c 99 131 131 131 99 131
535 d 100 132 132 132 100 132
536 e 101 133 133 133 101 133
537 f 102 134 134 134 102 134
538 g 103 135 135 135 103 135
539 h 104 136 136 136 104 136
540 i 105 137 137 137 105 137
541 j 106 145 145 145 106 145
542 k 107 146 146 146 107 146
543 l 108 147 147 147 108 147
544 m 109 148 148 148 109 148
545 n 110 149 149 149 110 149
546 o 111 150 150 150 111 150
547 p 112 151 151 151 112 151
548 q 113 152 152 152 113 152
549 r 114 153 153 153 114 153
550 s 115 162 162 162 115 162
551 t 116 163 163 163 116 163
552 u 117 164 164 164 117 164
553 v 118 165 165 165 118 165
554 w 119 166 166 166 119 166
555 x 120 167 167 167 120 167
556 y 121 168 168 168 121 168
557 z 122 169 169 169 122 169
558 { 123 192 192 251 123 192 ##
559 | 124 79 79 79 124 79
560 } 125 208 208 253 125 208 ##
561 ~ 126 161 161 255 126 161 ##
562 <DEL> 127 7 7 7 127 7
563 <PAD> 128 32 32 32 194.128 32
564 <HOP> 129 33 33 33 194.129 33
565 <BPH> 130 34 34 34 194.130 34
566 <NBH> 131 35 35 35 194.131 35
567 <IND> 132 36 36 36 194.132 36
568 <NEL> 133 21 37 37 194.133 37 **
569 <SSA> 134 6 6 6 194.134 6
570 <ESA> 135 23 23 23 194.135 23
571 <HTS> 136 40 40 40 194.136 40
572 <HTJ> 137 41 41 41 194.137 41
573 <VTS> 138 42 42 42 194.138 42
574 <PLD> 139 43 43 43 194.139 43
575 <PLU> 140 44 44 44 194.140 44
576 <RI> 141 9 9 9 194.141 9
577 <SS2> 142 10 10 10 194.142 10
578 <SS3> 143 27 27 27 194.143 27
579 <DCS> 144 48 48 48 194.144 48
580 <PU1> 145 49 49 49 194.145 49
581 <PU2> 146 26 26 26 194.146 26
582 <STS> 147 51 51 51 194.147 51
583 <CCH> 148 52 52 52 194.148 52
584 <MW> 149 53 53 53 194.149 53
585 <SPA> 150 54 54 54 194.150 54
586 <EPA> 151 8 8 8 194.151 8
587 <SOS> 152 56 56 56 194.152 56
588 <SGC> 153 57 57 57 194.153 57
589 <SCI> 154 58 58 58 194.154 58
590 <CSI> 155 59 59 59 194.155 59
591 <ST> 156 4 4 4 194.156 4
592 <OSC> 157 20 20 20 194.157 20
593 <PM> 158 62 62 62 194.158 62
594 <APC> 159 255 255 95 194.159 255 ##
595 <NON-BREAKING SPACE> 160 65 65 65 194.160 128.65
596 <INVERTED "!" > 161 170 170 170 194.161 128.66
597 <CENT SIGN> 162 74 74 176 194.162 128.67 ##
598 <POUND SIGN> 163 177 177 177 194.163 128.68
599 <CURRENCY SIGN> 164 159 159 159 194.164 128.69
600 <YEN SIGN> 165 178 178 178 194.165 128.70
601 <BROKEN BAR> 166 106 106 208 194.166 128.71 ##
602 <SECTION SIGN> 167 181 181 181 194.167 128.72
603 <DIAERESIS> 168 189 187 121 194.168 128.73 ** ##
604 <COPYRIGHT SIGN> 169 180 180 180 194.169 128.74
605 <FEMININE ORDINAL> 170 154 154 154 194.170 128.81
606 <LEFT POINTING GUILLEMET> 171 138 138 138 194.171 128.82
607 <NOT SIGN> 172 95 176 186 194.172 128.83 ** ##
608 <SOFT HYPHEN> 173 202 202 202 194.173 128.84
609 <REGISTERED TRADE MARK> 174 175 175 175 194.174 128.85
610 <MACRON> 175 188 188 161 194.175 128.86 ##
611 <DEGREE SIGN> 176 144 144 144 194.176 128.87
612 <PLUS-OR-MINUS SIGN> 177 143 143 143 194.177 128.88
613 <SUPERSCRIPT TWO> 178 234 234 234 194.178 128.89
614 <SUPERSCRIPT THREE> 179 250 250 250 194.179 128.98
615 <ACUTE ACCENT> 180 190 190 190 194.180 128.99
616 <MICRO SIGN> 181 160 160 160 194.181 128.100
617 <PARAGRAPH SIGN> 182 182 182 182 194.182 128.101
618 <MIDDLE DOT> 183 179 179 179 194.183 128.102
619 <CEDILLA> 184 157 157 157 194.184 128.103
620 <SUPERSCRIPT ONE> 185 218 218 218 194.185 128.104
621 <MASC. ORDINAL INDICATOR> 186 155 155 155 194.186 128.105
622 <RIGHT POINTING GUILLEMET> 187 139 139 139 194.187 128.106
623 <FRACTION ONE QUARTER> 188 183 183 183 194.188 128.112
624 <FRACTION ONE HALF> 189 184 184 184 194.189 128.113
625 <FRACTION THREE QUARTERS> 190 185 185 185 194.190 128.114
626 <INVERTED QUESTION MARK> 191 171 171 171 194.191 128.115
627 <A WITH GRAVE> 192 100 100 100 195.128 138.65
628 <A WITH ACUTE> 193 101 101 101 195.129 138.66
629 <A WITH CIRCUMFLEX> 194 98 98 98 195.130 138.67
630 <A WITH TILDE> 195 102 102 102 195.131 138.68
631 <A WITH DIAERESIS> 196 99 99 99 195.132 138.69
632 <A WITH RING ABOVE> 197 103 103 103 195.133 138.70
633 <CAPITAL LIGATURE AE> 198 158 158 158 195.134 138.71
634 <C WITH CEDILLA> 199 104 104 104 195.135 138.72
635 <E WITH GRAVE> 200 116 116 116 195.136 138.73
636 <E WITH ACUTE> 201 113 113 113 195.137 138.74
637 <E WITH CIRCUMFLEX> 202 114 114 114 195.138 138.81
638 <E WITH DIAERESIS> 203 115 115 115 195.139 138.82
639 <I WITH GRAVE> 204 120 120 120 195.140 138.83
640 <I WITH ACUTE> 205 117 117 117 195.141 138.84
641 <I WITH CIRCUMFLEX> 206 118 118 118 195.142 138.85
642 <I WITH DIAERESIS> 207 119 119 119 195.143 138.86
643 <CAPITAL LETTER ETH> 208 172 172 172 195.144 138.87
644 <N WITH TILDE> 209 105 105 105 195.145 138.88
645 <O WITH GRAVE> 210 237 237 237 195.146 138.89
646 <O WITH ACUTE> 211 238 238 238 195.147 138.98
647 <O WITH CIRCUMFLEX> 212 235 235 235 195.148 138.99
648 <O WITH TILDE> 213 239 239 239 195.149 138.100
649 <O WITH DIAERESIS> 214 236 236 236 195.150 138.101
650 <MULTIPLICATION SIGN> 215 191 191 191 195.151 138.102
651 <O WITH STROKE> 216 128 128 128 195.152 138.103
652 <U WITH GRAVE> 217 253 253 224 195.153 138.104 ##
653 <U WITH ACUTE> 218 254 254 254 195.154 138.105
654 <U WITH CIRCUMFLEX> 219 251 251 221 195.155 138.106 ##
655 <U WITH DIAERESIS> 220 252 252 252 195.156 138.112
656 <Y WITH ACUTE> 221 173 186 173 195.157 138.113 ** ##
657 <CAPITAL LETTER THORN> 222 174 174 174 195.158 138.114
658 <SMALL LETTER SHARP S> 223 89 89 89 195.159 138.115
659 <a WITH GRAVE> 224 68 68 68 195.160 139.65
660 <a WITH ACUTE> 225 69 69 69 195.161 139.66
661 <a WITH CIRCUMFLEX> 226 66 66 66 195.162 139.67
662 <a WITH TILDE> 227 70 70 70 195.163 139.68
663 <a WITH DIAERESIS> 228 67 67 67 195.164 139.69
664 <a WITH RING ABOVE> 229 71 71 71 195.165 139.70
665 <SMALL LIGATURE ae> 230 156 156 156 195.166 139.71
666 <c WITH CEDILLA> 231 72 72 72 195.167 139.72
667 <e WITH GRAVE> 232 84 84 84 195.168 139.73
668 <e WITH ACUTE> 233 81 81 81 195.169 139.74
669 <e WITH CIRCUMFLEX> 234 82 82 82 195.170 139.81
670 <e WITH DIAERESIS> 235 83 83 83 195.171 139.82
671 <i WITH GRAVE> 236 88 88 88 195.172 139.83
672 <i WITH ACUTE> 237 85 85 85 195.173 139.84
673 <i WITH CIRCUMFLEX> 238 86 86 86 195.174 139.85
674 <i WITH DIAERESIS> 239 87 87 87 195.175 139.86
675 <SMALL LETTER eth> 240 140 140 140 195.176 139.87
676 <n WITH TILDE> 241 73 73 73 195.177 139.88
677 <o WITH GRAVE> 242 205 205 205 195.178 139.89
678 <o WITH ACUTE> 243 206 206 206 195.179 139.98
679 <o WITH CIRCUMFLEX> 244 203 203 203 195.180 139.99
680 <o WITH TILDE> 245 207 207 207 195.181 139.100
681 <o WITH DIAERESIS> 246 204 204 204 195.182 139.101
682 <DIVISION SIGN> 247 225 225 225 195.183 139.102
683 <o WITH STROKE> 248 112 112 112 195.184 139.103
684 <u WITH GRAVE> 249 221 221 192 195.185 139.104 ##
685 <u WITH ACUTE> 250 222 222 222 195.186 139.105
686 <u WITH CIRCUMFLEX> 251 219 219 219 195.187 139.106
687 <u WITH DIAERESIS> 252 220 220 220 195.188 139.112
688 <y WITH ACUTE> 253 141 141 141 195.189 139.113
689 <SMALL LETTER thorn> 254 142 142 142 195.190 139.114
690 <y WITH DIAERESIS> 255 223 223 223 195.191 139.115
d396a558
JH
691
692If you would rather see the above table in CCSID 0037 order rather than
693ASCII + Latin-1 order then run the table through:
694
695=over 4
696
395f5a0c 697=item recipe 4
d396a558
JH
698
699=back
700
5f26d5fd 701 perl \
8d725451 702 -ne 'if(/.{29}\d{1,3}\s{2,4}\d{1,3}\s{2,4}\d{1,3}\s{2,4}\d{1,3}/)'\
84f709e7
JH
703 -e '{push(@l,$_)}' \
704 -e 'END{print map{$_->[0]}' \
705 -e ' sort{$a->[1] <=> $b->[1]}' \
8d725451 706 -e ' map{[$_,substr($_,34,3)]}@l;}' perlebcdic.pod
d396a558 707
2c09a866 708If you would rather see it in CCSID 1047 order then change the number
8d725451 70934 in the last line to 39, like this:
d396a558
JH
710
711=over 4
712
395f5a0c 713=item recipe 5
d396a558
JH
714
715=back
716
5f26d5fd 717 perl \
8d725451 718 -ne 'if(/.{29}\d{1,3}\s{2,4}\d{1,3}\s{2,4}\d{1,3}\s{2,4}\d{1,3}/)'\
5f26d5fd
KW
719 -e '{push(@l,$_)}' \
720 -e 'END{print map{$_->[0]}' \
721 -e ' sort{$a->[1] <=> $b->[1]}' \
8d725451 722 -e ' map{[$_,substr($_,39,3)]}@l;}' perlebcdic.pod
d396a558 723
2c09a866 724If you would rather see it in POSIX-BC order then change the number
4d2ca8b5 72534 in the last line to 44, like this:
d396a558
JH
726
727=over 4
728
395f5a0c 729=item recipe 6
d396a558
JH
730
731=back
732
5f26d5fd 733 perl \
8d725451 734 -ne 'if(/.{29}\d{1,3}\s{2,4}\d{1,3}\s{2,4}\d{1,3}\s{2,4}\d{1,3}/)'\
84f709e7
JH
735 -e '{push(@l,$_)}' \
736 -e 'END{print map{$_->[0]}' \
737 -e ' sort{$a->[1] <=> $b->[1]}' \
8d725451 738 -e ' map{[$_,substr($_,44,3)]}@l;}' perlebcdic.pod
d396a558 739
4d2ca8b5 740=head2 Table in hex, sorted in 1047 order
d396a558 741
4d2ca8b5
KW
742Since this document was first written, the convention has become more
743and more to use hexadecimal notation for code points. To do this with
744the recipes and to also sort is a multi-step process, so here, for
745convenience, is the table from above, re-sorted to be in Code Page 1047
746order, and using hex notation.
d396a558 747
4d2ca8b5
KW
748 ISO
749 8859-1 POS- CCSID
750 CCSID CCSID CCSID IX- 1047
751 chr 0819 0037 1047 BC UTF-8 UTF-EBCDIC
752 ---------------------------------------------------------------------
753 <NUL> 00 00 00 00 00 00
754 <SOH> 01 01 01 01 01 01
755 <STX> 02 02 02 02 02 02
756 <ETX> 03 03 03 03 03 03
757 <ST> 9C 04 04 04 C2.9C 04
758 <HT> 09 05 05 05 09 05
759 <SSA> 86 06 06 06 C2.86 06
760 <DEL> 7F 07 07 07 7F 07
761 <EPA> 97 08 08 08 C2.97 08
762 <RI> 8D 09 09 09 C2.8D 09
763 <SS2> 8E 0A 0A 0A C2.8E 0A
764 <VT> 0B 0B 0B 0B 0B 0B
765 <FF> 0C 0C 0C 0C 0C 0C
766 <CR> 0D 0D 0D 0D 0D 0D
767 <SO> 0E 0E 0E 0E 0E 0E
768 <SI> 0F 0F 0F 0F 0F 0F
769 <DLE> 10 10 10 10 10 10
770 <DC1> 11 11 11 11 11 11
771 <DC2> 12 12 12 12 12 12
772 <DC3> 13 13 13 13 13 13
773 <OSC> 9D 14 14 14 C2.9D 14
774 <LF> 0A 25 15 15 0A 15 **
775 <BS> 08 16 16 16 08 16
776 <ESA> 87 17 17 17 C2.87 17
777 <CAN> 18 18 18 18 18 18
778 <EOM> 19 19 19 19 19 19
779 <PU2> 92 1A 1A 1A C2.92 1A
780 <SS3> 8F 1B 1B 1B C2.8F 1B
781 <FS> 1C 1C 1C 1C 1C 1C
782 <GS> 1D 1D 1D 1D 1D 1D
783 <RS> 1E 1E 1E 1E 1E 1E
784 <US> 1F 1F 1F 1F 1F 1F
785 <PAD> 80 20 20 20 C2.80 20
786 <HOP> 81 21 21 21 C2.81 21
787 <BPH> 82 22 22 22 C2.82 22
788 <NBH> 83 23 23 23 C2.83 23
789 <IND> 84 24 24 24 C2.84 24
790 <NEL> 85 15 25 25 C2.85 25 **
791 <ETB> 17 26 26 26 17 26
792 <ESC> 1B 27 27 27 1B 27
793 <HTS> 88 28 28 28 C2.88 28
794 <HTJ> 89 29 29 29 C2.89 29
795 <VTS> 8A 2A 2A 2A C2.8A 2A
796 <PLD> 8B 2B 2B 2B C2.8B 2B
797 <PLU> 8C 2C 2C 2C C2.8C 2C
798 <ENQ> 05 2D 2D 2D 05 2D
799 <ACK> 06 2E 2E 2E 06 2E
800 <BEL> 07 2F 2F 2F 07 2F
801 <DCS> 90 30 30 30 C2.90 30
802 <PU1> 91 31 31 31 C2.91 31
803 <SYN> 16 32 32 32 16 32
804 <STS> 93 33 33 33 C2.93 33
805 <CCH> 94 34 34 34 C2.94 34
806 <MW> 95 35 35 35 C2.95 35
807 <SPA> 96 36 36 36 C2.96 36
808 <EOT> 04 37 37 37 04 37
809 <SOS> 98 38 38 38 C2.98 38
810 <SGC> 99 39 39 39 C2.99 39
811 <SCI> 9A 3A 3A 3A C2.9A 3A
812 <CSI> 9B 3B 3B 3B C2.9B 3B
813 <DC4> 14 3C 3C 3C 14 3C
814 <NAK> 15 3D 3D 3D 15 3D
815 <PM> 9E 3E 3E 3E C2.9E 3E
816 <SUB> 1A 3F 3F 3F 1A 3F
817 <SPACE> 20 40 40 40 20 40
818 <NON-BREAKING SPACE> A0 41 41 41 C2.A0 80.41
819 <a WITH CIRCUMFLEX> E2 42 42 42 C3.A2 8B.43
820 <a WITH DIAERESIS> E4 43 43 43 C3.A4 8B.45
821 <a WITH GRAVE> E0 44 44 44 C3.A0 8B.41
822 <a WITH ACUTE> E1 45 45 45 C3.A1 8B.42
823 <a WITH TILDE> E3 46 46 46 C3.A3 8B.44
824 <a WITH RING ABOVE> E5 47 47 47 C3.A5 8B.46
825 <c WITH CEDILLA> E7 48 48 48 C3.A7 8B.48
826 <n WITH TILDE> F1 49 49 49 C3.B1 8B.58
827 <CENT SIGN> A2 4A 4A B0 C2.A2 80.43 ##
828 . 2E 4B 4B 4B 2E 4B
829 < 3C 4C 4C 4C 3C 4C
830 ( 28 4D 4D 4D 28 4D
831 + 2B 4E 4E 4E 2B 4E
832 | 7C 4F 4F 4F 7C 4F
833 & 26 50 50 50 26 50
834 <e WITH ACUTE> E9 51 51 51 C3.A9 8B.4A
835 <e WITH CIRCUMFLEX> EA 52 52 52 C3.AA 8B.51
836 <e WITH DIAERESIS> EB 53 53 53 C3.AB 8B.52
837 <e WITH GRAVE> E8 54 54 54 C3.A8 8B.49
838 <i WITH ACUTE> ED 55 55 55 C3.AD 8B.54
839 <i WITH CIRCUMFLEX> EE 56 56 56 C3.AE 8B.55
840 <i WITH DIAERESIS> EF 57 57 57 C3.AF 8B.56
841 <i WITH GRAVE> EC 58 58 58 C3.AC 8B.53
842 <SMALL LETTER SHARP S> DF 59 59 59 C3.9F 8A.73
843 ! 21 5A 5A 5A 21 5A
844 $ 24 5B 5B 5B 24 5B
845 * 2A 5C 5C 5C 2A 5C
846 ) 29 5D 5D 5D 29 5D
847 ; 3B 5E 5E 5E 3B 5E
848 ^ 5E B0 5F 6A 5E 5F ** ##
849 - 2D 60 60 60 2D 60
850 / 2F 61 61 61 2F 61
851 <A WITH CIRCUMFLEX> C2 62 62 62 C3.82 8A.43
852 <A WITH DIAERESIS> C4 63 63 63 C3.84 8A.45
853 <A WITH GRAVE> C0 64 64 64 C3.80 8A.41
854 <A WITH ACUTE> C1 65 65 65 C3.81 8A.42
855 <A WITH TILDE> C3 66 66 66 C3.83 8A.44
856 <A WITH RING ABOVE> C5 67 67 67 C3.85 8A.46
857 <C WITH CEDILLA> C7 68 68 68 C3.87 8A.48
858 <N WITH TILDE> D1 69 69 69 C3.91 8A.58
859 <BROKEN BAR> A6 6A 6A D0 C2.A6 80.47 ##
860 , 2C 6B 6B 6B 2C 6B
861 % 25 6C 6C 6C 25 6C
862 _ 5F 6D 6D 6D 5F 6D
863 > 3E 6E 6E 6E 3E 6E
864 ? 3F 6F 6F 6F 3F 6F
865 <o WITH STROKE> F8 70 70 70 C3.B8 8B.67
866 <E WITH ACUTE> C9 71 71 71 C3.89 8A.4A
867 <E WITH CIRCUMFLEX> CA 72 72 72 C3.8A 8A.51
868 <E WITH DIAERESIS> CB 73 73 73 C3.8B 8A.52
869 <E WITH GRAVE> C8 74 74 74 C3.88 8A.49
870 <I WITH ACUTE> CD 75 75 75 C3.8D 8A.54
871 <I WITH CIRCUMFLEX> CE 76 76 76 C3.8E 8A.55
872 <I WITH DIAERESIS> CF 77 77 77 C3.8F 8A.56
873 <I WITH GRAVE> CC 78 78 78 C3.8C 8A.53
874 ` 60 79 79 4A 60 79 ##
875 : 3A 7A 7A 7A 3A 7A
876 # 23 7B 7B 7B 23 7B
877 @ 40 7C 7C 7C 40 7C
878 ' 27 7D 7D 7D 27 7D
879 = 3D 7E 7E 7E 3D 7E
880 " 22 7F 7F 7F 22 7F
881 <O WITH STROKE> D8 80 80 80 C3.98 8A.67
882 a 61 81 81 81 61 81
883 b 62 82 82 82 62 82
884 c 63 83 83 83 63 83
885 d 64 84 84 84 64 84
886 e 65 85 85 85 65 85
887 f 66 86 86 86 66 86
888 g 67 87 87 87 67 87
889 h 68 88 88 88 68 88
890 i 69 89 89 89 69 89
891 <LEFT POINTING GUILLEMET> AB 8A 8A 8A C2.AB 80.52
892 <RIGHT POINTING GUILLEMET> BB 8B 8B 8B C2.BB 80.6A
893 <SMALL LETTER eth> F0 8C 8C 8C C3.B0 8B.57
894 <y WITH ACUTE> FD 8D 8D 8D C3.BD 8B.71
895 <SMALL LETTER thorn> FE 8E 8E 8E C3.BE 8B.72
896 <PLUS-OR-MINUS SIGN> B1 8F 8F 8F C2.B1 80.58
897 <DEGREE SIGN> B0 90 90 90 C2.B0 80.57
898 j 6A 91 91 91 6A 91
899 k 6B 92 92 92 6B 92
900 l 6C 93 93 93 6C 93
901 m 6D 94 94 94 6D 94
902 n 6E 95 95 95 6E 95
903 o 6F 96 96 96 6F 96
904 p 70 97 97 97 70 97
905 q 71 98 98 98 71 98
906 r 72 99 99 99 72 99
907 <FEMININE ORDINAL> AA 9A 9A 9A C2.AA 80.51
908 <MASC. ORDINAL INDICATOR> BA 9B 9B 9B C2.BA 80.69
909 <SMALL LIGATURE ae> E6 9C 9C 9C C3.A6 8B.47
910 <CEDILLA> B8 9D 9D 9D C2.B8 80.67
911 <CAPITAL LIGATURE AE> C6 9E 9E 9E C3.86 8A.47
912 <CURRENCY SIGN> A4 9F 9F 9F C2.A4 80.45
913 <MICRO SIGN> B5 A0 A0 A0 C2.B5 80.64
914 ~ 7E A1 A1 FF 7E A1 ##
915 s 73 A2 A2 A2 73 A2
916 t 74 A3 A3 A3 74 A3
917 u 75 A4 A4 A4 75 A4
918 v 76 A5 A5 A5 76 A5
919 w 77 A6 A6 A6 77 A6
920 x 78 A7 A7 A7 78 A7
921 y 79 A8 A8 A8 79 A8
922 z 7A A9 A9 A9 7A A9
923 <INVERTED "!" > A1 AA AA AA C2.A1 80.42
924 <INVERTED QUESTION MARK> BF AB AB AB C2.BF 80.73
925 <CAPITAL LETTER ETH> D0 AC AC AC C3.90 8A.57
926 [ 5B BA AD BB 5B AD ** ##
927 <CAPITAL LETTER THORN> DE AE AE AE C3.9E 8A.72
928 <REGISTERED TRADE MARK> AE AF AF AF C2.AE 80.55
929 <NOT SIGN> AC 5F B0 BA C2.AC 80.53 ** ##
930 <POUND SIGN> A3 B1 B1 B1 C2.A3 80.44
931 <YEN SIGN> A5 B2 B2 B2 C2.A5 80.46
932 <MIDDLE DOT> B7 B3 B3 B3 C2.B7 80.66
933 <COPYRIGHT SIGN> A9 B4 B4 B4 C2.A9 80.4A
934 <SECTION SIGN> A7 B5 B5 B5 C2.A7 80.48
935 <PARAGRAPH SIGN> B6 B6 B6 B6 C2.B6 80.65
936 <FRACTION ONE QUARTER> BC B7 B7 B7 C2.BC 80.70
937 <FRACTION ONE HALF> BD B8 B8 B8 C2.BD 80.71
938 <FRACTION THREE QUARTERS> BE B9 B9 B9 C2.BE 80.72
939 <Y WITH ACUTE> DD AD BA AD C3.9D 8A.71 ** ##
940 <DIAERESIS> A8 BD BB 79 C2.A8 80.49 ** ##
941 <MACRON> AF BC BC A1 C2.AF 80.56 ##
942 ] 5D BB BD BD 5D BD **
943 <ACUTE ACCENT> B4 BE BE BE C2.B4 80.63
944 <MULTIPLICATION SIGN> D7 BF BF BF C3.97 8A.66
945 { 7B C0 C0 FB 7B C0 ##
946 A 41 C1 C1 C1 41 C1
947 B 42 C2 C2 C2 42 C2
948 C 43 C3 C3 C3 43 C3
949 D 44 C4 C4 C4 44 C4
950 E 45 C5 C5 C5 45 C5
951 F 46 C6 C6 C6 46 C6
952 G 47 C7 C7 C7 47 C7
953 H 48 C8 C8 C8 48 C8
954 I 49 C9 C9 C9 49 C9
955 <SOFT HYPHEN> AD CA CA CA C2.AD 80.54
956 <o WITH CIRCUMFLEX> F4 CB CB CB C3.B4 8B.63
957 <o WITH DIAERESIS> F6 CC CC CC C3.B6 8B.65
958 <o WITH GRAVE> F2 CD CD CD C3.B2 8B.59
959 <o WITH ACUTE> F3 CE CE CE C3.B3 8B.62
960 <o WITH TILDE> F5 CF CF CF C3.B5 8B.64
961 } 7D D0 D0 FD 7D D0 ##
962 J 4A D1 D1 D1 4A D1
963 K 4B D2 D2 D2 4B D2
964 L 4C D3 D3 D3 4C D3
965 M 4D D4 D4 D4 4D D4
966 N 4E D5 D5 D5 4E D5
967 O 4F D6 D6 D6 4F D6
968 P 50 D7 D7 D7 50 D7
969 Q 51 D8 D8 D8 51 D8
970 R 52 D9 D9 D9 52 D9
971 <SUPERSCRIPT ONE> B9 DA DA DA C2.B9 80.68
972 <u WITH CIRCUMFLEX> FB DB DB DB C3.BB 8B.6A
973 <u WITH DIAERESIS> FC DC DC DC C3.BC 8B.70
974 <u WITH GRAVE> F9 DD DD C0 C3.B9 8B.68 ##
975 <u WITH ACUTE> FA DE DE DE C3.BA 8B.69
976 <y WITH DIAERESIS> FF DF DF DF C3.BF 8B.73
977 \ 5C E0 E0 BC 5C E0 ##
978 <DIVISION SIGN> F7 E1 E1 E1 C3.B7 8B.66
979 S 53 E2 E2 E2 53 E2
980 T 54 E3 E3 E3 54 E3
981 U 55 E4 E4 E4 55 E4
982 V 56 E5 E5 E5 56 E5
983 W 57 E6 E6 E6 57 E6
984 X 58 E7 E7 E7 58 E7
985 Y 59 E8 E8 E8 59 E8
986 Z 5A E9 E9 E9 5A E9
987 <SUPERSCRIPT TWO> B2 EA EA EA C2.B2 80.59
988 <O WITH CIRCUMFLEX> D4 EB EB EB C3.94 8A.63
989 <O WITH DIAERESIS> D6 EC EC EC C3.96 8A.65
990 <O WITH GRAVE> D2 ED ED ED C3.92 8A.59
991 <O WITH ACUTE> D3 EE EE EE C3.93 8A.62
992 <O WITH TILDE> D5 EF EF EF C3.95 8A.64
993 0 30 F0 F0 F0 30 F0
994 1 31 F1 F1 F1 31 F1
995 2 32 F2 F2 F2 32 F2
996 3 33 F3 F3 F3 33 F3
997 4 34 F4 F4 F4 34 F4
998 5 35 F5 F5 F5 35 F5
999 6 36 F6 F6 F6 36 F6
1000 7 37 F7 F7 F7 37 F7
1001 8 38 F8 F8 F8 38 F8
1002 9 39 F9 F9 F9 39 F9
1003 <SUPERSCRIPT THREE> B3 FA FA FA C2.B3 80.62
1004 <U WITH CIRCUMFLEX> DB FB FB DD C3.9B 8A.6A ##
1005 <U WITH DIAERESIS> DC FC FC FC C3.9C 8A.70
1006 <U WITH GRAVE> D9 FD FD E0 C3.99 8A.68 ##
1007 <U WITH ACUTE> DA FE FE FE C3.9A 8A.69
1008 <APC> 9F FF FF 5F C2.9F FF ##
d396a558 1009
4d2ca8b5 1010=head1 IDENTIFYING CHARACTER CODE SETS
d396a558 1011
4d2ca8b5
KW
1012It is possible to determine which character set you are operating under.
1013But first you need to be really really sure you need to do this. Your
1014code will be simpler and probably just as portable if you don't have
1015to test the character set and do different things, depending. There are
1016actually only very few circumstances where it's not easy to write
1017straight-line code portable to all character sets. See
1018L<perluniintro/Unicode and EBCDIC> for how to portably specify
1019characters.
d396a558 1020
4d2ca8b5
KW
1021But there are some cases where you may want to know which character set
1022you are running under. One possible example is doing
1023L<sorting|/SORTING> in inner loops where performance is critical.
d396a558 1024
4d2ca8b5
KW
1025To determine if you are running under ASCII or EBCDIC, you can use the
1026return value of C<ord()> or C<chr()> to test one or more character
1027values. For example:
d396a558 1028
4d2ca8b5
KW
1029 $is_ascii = "A" eq chr(65);
1030 $is_ebcdic = "A" eq chr(193);
1031 $is_ascii = ord("A") == 65;
1032 $is_ebcdic = ord("A") == 193;
d396a558 1033
4d2ca8b5
KW
1034There's even less need to distinguish between EBCDIC code pages, but to
1035do so try looking at one or more of the characters that differ between
1036them.
d396a558 1037
84f709e7
JH
1038 $is_ascii = ord('[') == 91;
1039 $is_ebcdic_37 = ord('[') == 186;
1040 $is_ebcdic_1047 = ord('[') == 173;
1041 $is_ebcdic_POSIX_BC = ord('[') == 187;
d396a558
JH
1042
1043However, it would be unwise to write tests such as:
1044
84f709e7
JH
1045 $is_ascii = "\r" ne chr(13); # WRONG
1046 $is_ascii = "\n" ne chr(10); # ILL ADVISED
d396a558 1047
4d2ca8b5
KW
1048Obviously the first of these will fail to distinguish most ASCII
1049platforms from either a CCSID 0037, a 1047, or a POSIX-BC EBCDIC
1050platform since S<C<"\r" eq chr(13)>> under all of those coded character
1051sets. But note too that because C<"\n"> is C<chr(13)> and C<"\r"> is
1052C<chr(10)> on old Macintosh (which is an ASCII platform) the second
1053C<$is_ascii> test will lead to trouble there.
d396a558 1054
eaf8b9b9 1055To determine whether or not perl was built under an EBCDIC
d396a558
JH
1056code page you can use the Config module like so:
1057
1058 use Config;
84f709e7 1059 $is_ebcdic = $Config{'ebcdic'} eq 'define';
d396a558
JH
1060
1061=head1 CONVERSIONS
1062
d5924ca6
KW
1063=head2 C<utf8::unicode_to_native()> and C<utf8::native_to_unicode()>
1064
1065These functions take an input numeric code point in one encoding and
1066return what its equivalent value is in the other.
1067
4d2ca8b5
KW
1068See L<utf8>.
1069
1e054b24
PP
1070=head2 tr///
1071
eaf8b9b9 1072In order to convert a string of characters from one character set to
d396a558 1073another a simple list of numbers, such as in the right columns in the
4d2ca8b5 1074above table, along with Perl's C<tr///> operator is all that is needed.
5f26d5fd 1075The data in the table are in ASCII/Latin1 order, hence the EBCDIC columns
eaf8b9b9 1076provide easy-to-use ASCII/Latin1 to EBCDIC operations that are also easily
d396a558
JH
1077reversed.
1078
5f26d5fd 1079For example, to convert ASCII/Latin1 to code page 037 take the output of the
4d2ca8b5
KW
1080second numbers column from the output of recipe 2 (modified to add
1081C<"\"> characters), and use it in C<tr///> like so:
d396a558 1082
eaf8b9b9 1083 $cp_037 =
5f26d5fd
KW
1084 '\x00\x01\x02\x03\x37\x2D\x2E\x2F\x16\x05\x25\x0B\x0C\x0D\x0E\x0F' .
1085 '\x10\x11\x12\x13\x3C\x3D\x32\x26\x18\x19\x3F\x27\x1C\x1D\x1E\x1F' .
1086 '\x40\x5A\x7F\x7B\x5B\x6C\x50\x7D\x4D\x5D\x5C\x4E\x6B\x60\x4B\x61' .
1087 '\xF0\xF1\xF2\xF3\xF4\xF5\xF6\xF7\xF8\xF9\x7A\x5E\x4C\x7E\x6E\x6F' .
1088 '\x7C\xC1\xC2\xC3\xC4\xC5\xC6\xC7\xC8\xC9\xD1\xD2\xD3\xD4\xD5\xD6' .
1089 '\xD7\xD8\xD9\xE2\xE3\xE4\xE5\xE6\xE7\xE8\xE9\xBA\xE0\xBB\xB0\x6D' .
1090 '\x79\x81\x82\x83\x84\x85\x86\x87\x88\x89\x91\x92\x93\x94\x95\x96' .
1091 '\x97\x98\x99\xA2\xA3\xA4\xA5\xA6\xA7\xA8\xA9\xC0\x4F\xD0\xA1\x07' .
1092 '\x20\x21\x22\x23\x24\x15\x06\x17\x28\x29\x2A\x2B\x2C\x09\x0A\x1B' .
1093 '\x30\x31\x1A\x33\x34\x35\x36\x08\x38\x39\x3A\x3B\x04\x14\x3E\xFF' .
1094 '\x41\xAA\x4A\xB1\x9F\xB2\x6A\xB5\xBD\xB4\x9A\x8A\x5F\xCA\xAF\xBC' .
1095 '\x90\x8F\xEA\xFA\xBE\xA0\xB6\xB3\x9D\xDA\x9B\x8B\xB7\xB8\xB9\xAB' .
1096 '\x64\x65\x62\x66\x63\x67\x9E\x68\x74\x71\x72\x73\x78\x75\x76\x77' .
1097 '\xAC\x69\xED\xEE\xEB\xEF\xEC\xBF\x80\xFD\xFE\xFB\xFC\xAD\xAE\x59' .
1098 '\x44\x45\x42\x46\x43\x47\x9C\x48\x54\x51\x52\x53\x58\x55\x56\x57' .
1099 '\x8C\x49\xCD\xCE\xCB\xCF\xCC\xE1\x70\xDD\xDE\xDB\xDC\x8D\x8E\xDF';
d396a558
JH
1100
1101 my $ebcdic_string = $ascii_string;
5f26d5fd 1102 eval '$ebcdic_string =~ tr/\000-\377/' . $cp_037 . '/';
d396a558 1103
0be03469 1104To convert from EBCDIC 037 to ASCII just reverse the order of the tr///
d396a558
JH
1105arguments like so:
1106
1107 my $ascii_string = $ebcdic_string;
5f26d5fd
KW
1108 eval '$ascii_string =~ tr/' . $cp_037 . '/\000-\377/';
1109
1110Similarly one could take the output of the third numbers column from recipe 2
1111to obtain a C<$cp_1047> table. The fourth numbers column of the output from
1112recipe 2 could provide a C<$cp_posix_bc> table suitable for transcoding as
1113well.
d5d9880c 1114
5f26d5fd
KW
1115If you wanted to see the inverse tables, you would first have to sort on the
1116desired numbers column as in recipes 4, 5 or 6, then take the output of the
1117first numbers column.
1e054b24
PP
1118
1119=head2 iconv
d396a558 1120
d5d9880c 1121XPG operability often implies the presence of an I<iconv> utility
d396a558
JH
1122available from the shell or from the C library. Consult your system's
1123documentation for information on iconv.
1124
4d2ca8b5 1125On OS/390 or z/OS see the L<iconv(1)> manpage. One way to invoke the C<iconv>
d396a558
JH
1126shell utility from within perl would be to:
1127
395f5a0c 1128 # OS/390 or z/OS example
84f709e7 1129 $ascii_data = `echo '$ebcdic_data'| iconv -f IBM-1047 -t ISO8859-1`
d396a558
JH
1130
1131or the inverse map:
1132
395f5a0c 1133 # OS/390 or z/OS example
84f709e7 1134 $ebcdic_data = `echo '$ascii_data'| iconv -f ISO8859-1 -t IBM-1047`
d396a558 1135
4d2ca8b5 1136For other Perl-based conversion options see the C<Convert::*> modules on CPAN.
d396a558 1137
1e054b24
PP
1138=head2 C RTL
1139
4d2ca8b5 1140The OS/390 and z/OS C run-time libraries provide C<_atoe()> and C<_etoa()> functions.
1e054b24 1141
d396a558
JH
1142=head1 OPERATOR DIFFERENCES
1143
eaf8b9b9 1144The C<..> range operator treats certain character ranges with
2bbc8d55
SP
1145care on EBCDIC platforms. For example the following array
1146will have twenty six elements on either an EBCDIC platform
1147or an ASCII platform:
d396a558 1148
84f709e7 1149 @alphabet = ('A'..'Z'); # $#alphabet == 25
d396a558
JH
1150
1151The bitwise operators such as & ^ | may return different results
4d2ca8b5 1152when operating on string or character data in a Perl program running
2bbc8d55 1153on an EBCDIC platform than when run on an ASCII platform. Here is
d396a558
JH
1154an example adapted from the one in L<perlop>:
1155
1156 # EBCDIC-based examples
84f709e7 1157 print "j p \n" ^ " a h"; # prints "JAPH\n"
eaf8b9b9 1158 print "JA" | " ph\n"; # prints "japh\n"
84f709e7
JH
1159 print "JAPH\nJunk" & "\277\277\277\277\277"; # prints "japh\n";
1160 print 'p N$' ^ " E<H\n"; # prints "Perl\n";
d396a558
JH
1161
1162An interesting property of the 32 C0 control characters
1163in the ASCII table is that they can "literally" be constructed
4d2ca8b5 1164as control characters in Perl, e.g. C<(chr(0)> eq C<\c@>)>
c72e675e 1165C<(chr(1)> eq C<\cA>)>, and so on. Perl on EBCDIC platforms has been
4d2ca8b5 1166ported to take C<\c@> to C<chr(0)> and C<\cA> to C<chr(1)>, etc. as well, but the
2bd1cbf6 1167characters that result depend on which code page you are
2c09a866
KW
1168using. The table below uses the standard acronyms for the controls.
1169The POSIX-BC and 1047 sets are
eaf8b9b9 1170identical throughout this range and differ from the 0037 set at only
4d2ca8b5 1171one spot (21 decimal). Note that the line terminator character
eaf8b9b9
KW
1172may be generated by C<\cJ> on ASCII platforms but by C<\cU> on 1047 or POSIX-BC
1173platforms and cannot be generated as a C<"\c.letter."> control character on
2c09a866
KW
11740037 platforms. Note also that C<\c\> cannot be the final element in a string
1175or regex, as it will absorb the terminator. But C<\c\I<X>> is a C<FILE
1176SEPARATOR> concatenated with I<X> for all I<X>.
2bd1cbf6
KW
1177The outlier C<\c?> on ASCII, which yields a non-C0 control C<DEL>,
1178yields the outlier control C<APC> on EBCDIC, the one that isn't in the
aae773bb
KW
1179block of contiguous controls. Note that a subtlety of this is that
1180C<\c?> on ASCII platforms is an ASCII character, while it isn't
1181equivalent to any ASCII character in EBCDIC platforms.
2c09a866 1182
eaf8b9b9 1183 chr ord 8859-1 0037 1047 && POSIX-BC
c72e675e 1184 -----------------------------------------------------------------------
2c09a866 1185 \c@ 0 <NUL> <NUL> <NUL>
eaf8b9b9 1186 \cA 1 <SOH> <SOH> <SOH>
2c09a866
KW
1187 \cB 2 <STX> <STX> <STX>
1188 \cC 3 <ETX> <ETX> <ETX>
eaf8b9b9
KW
1189 \cD 4 <EOT> <ST> <ST>
1190 \cE 5 <ENQ> <HT> <HT>
1191 \cF 6 <ACK> <SSA> <SSA>
1192 \cG 7 <BEL> <DEL> <DEL>
1193 \cH 8 <BS> <EPA> <EPA>
1194 \cI 9 <HT> <RI> <RI>
1195 \cJ 10 <LF> <SS2> <SS2>
2c09a866 1196 \cK 11 <VT> <VT> <VT>
eaf8b9b9
KW
1197 \cL 12 <FF> <FF> <FF>
1198 \cM 13 <CR> <CR> <CR>
2c09a866
KW
1199 \cN 14 <SO> <SO> <SO>
1200 \cO 15 <SI> <SI> <SI>
eaf8b9b9 1201 \cP 16 <DLE> <DLE> <DLE>
2c09a866
KW
1202 \cQ 17 <DC1> <DC1> <DC1>
1203 \cR 18 <DC2> <DC2> <DC2>
eaf8b9b9
KW
1204 \cS 19 <DC3> <DC3> <DC3>
1205 \cT 20 <DC4> <OSC> <OSC>
8d725451 1206 \cU 21 <NAK> <NEL> <LF> **
2c09a866 1207 \cV 22 <SYN> <BS> <BS>
eaf8b9b9 1208 \cW 23 <ETB> <ESA> <ESA>
2c09a866
KW
1209 \cX 24 <CAN> <CAN> <CAN>
1210 \cY 25 <EOM> <EOM> <EOM>
eaf8b9b9
KW
1211 \cZ 26 <SUB> <PU2> <PU2>
1212 \c[ 27 <ESC> <SS3> <SS3>
2c09a866
KW
1213 \c\X 28 <FS>X <FS>X <FS>X
1214 \c] 29 <GS> <GS> <GS>
1215 \c^ 30 <RS> <RS> <RS>
1216 \c_ 31 <US> <US> <US>
2bd1cbf6
KW
1217 \c? * <DEL> <APC> <APC>
1218
1219C<*> Note: C<\c?> maps to ordinal 127 (C<DEL>) on ASCII platforms, but
1220since ordinal 127 is a not a control character on EBCDIC machines,
4d2ca8b5
KW
1221C<\c?> instead maps on them to C<APC>, which is 255 in 0037 and 1047,
1222and 95 in POSIX-BC.
d396a558
JH
1223
1224=head1 FUNCTION DIFFERENCES
1225
1226=over 8
1227
4d2ca8b5 1228=item C<chr()>
d396a558 1229
4d2ca8b5 1230C<chr()> must be given an EBCDIC code number argument to yield a desired
2bbc8d55 1231character return value on an EBCDIC platform. For example:
d396a558 1232
84f709e7 1233 $CAPITAL_LETTER_A = chr(193);
d396a558 1234
4d2ca8b5 1235=item C<ord()>
d396a558 1236
4d2ca8b5 1237C<ord()> will return EBCDIC code number values on an EBCDIC platform.
d396a558
JH
1238For example:
1239
84f709e7 1240 $the_number_193 = ord("A");
d396a558 1241
4d2ca8b5 1242=item C<pack()>
d396a558 1243
4d2ca8b5
KW
1244
1245The C<"c"> and C<"C"> templates for C<pack()> are dependent upon character set
d396a558
JH
1246encoding. Examples of usage on EBCDIC include:
1247
1248 $foo = pack("CCCC",193,194,195,196);
1249 # $foo eq "ABCD"
84f709e7 1250 $foo = pack("C4",193,194,195,196);
d396a558
JH
1251 # same thing
1252
1253 $foo = pack("ccxxcc",193,194,195,196);
1254 # $foo eq "AB\0\0CD"
1255
4d2ca8b5
KW
1256The C<"U"> template has been ported to mean "Unicode" on all platforms so
1257that
1258
1259 pack("U", 65) eq 'A'
1260
1261is true on all platforms. If you want native code points for the low
1262256, use the C<"W"> template. This means that the equivalences
1263
1264 pack("W", ord($character)) eq $character
1265 unpack("W", $character) == ord $character
1266
1267will hold.
1268
4d2ca8b5 1269=item C<print()>
d396a558
JH
1270
1271One must be careful with scalars and strings that are passed to
1272print that contain ASCII encodings. One common place
1273for this to occur is in the output of the MIME type header for
4d2ca8b5 1274CGI script writing. For example, many Perl programming guides
d396a558
JH
1275recommend something similar to:
1276
eaf8b9b9 1277 print "Content-type:\ttext/html\015\012\015\012";
d396a558
JH
1278 # this may be wrong on EBCDIC
1279
4d2ca8b5 1280You can instead write
d396a558 1281
5f26d5fd 1282 print "Content-type:\ttext/html\r\n\r\n"; # OK for DGW et al
d396a558 1283
4d2ca8b5
KW
1284and have it work portably.
1285
d396a558 1286That is because the translation from EBCDIC to ASCII is done
4d2ca8b5 1287by the web server in this case. Consult your web server's documentation for
d396a558
JH
1288further details.
1289
4d2ca8b5 1290=item C<printf()>
d396a558
JH
1291
1292The formats that can convert characters to numbers and vice versa
1293will be different from their ASCII counterparts when executed
2bbc8d55 1294on an EBCDIC platform. Examples include:
d396a558
JH
1295
1296 printf("%c%c%c",193,194,195); # prints ABC
1297
4d2ca8b5 1298=item C<sort()>
d396a558 1299
eaf8b9b9 1300EBCDIC sort results may differ from ASCII sort results especially for
4d2ca8b5 1301mixed case strings. This is discussed in more detail L<below|/SORTING>.
d396a558 1302
4d2ca8b5 1303=item C<sprintf()>
d396a558 1304
4d2ca8b5 1305See the discussion of C<L</printf()>> above. An example of the use
d396a558
JH
1306of sprintf would be:
1307
84f709e7 1308 $CAPITAL_LETTER_A = sprintf("%c",193);
d396a558 1309
4d2ca8b5 1310=item C<unpack()>
d396a558 1311
4d2ca8b5 1312See the discussion of C<L</pack()>> above.
d396a558
JH
1313
1314=back
1315
4d2ca8b5
KW
1316Note that it is possible to write portable code for these by specifying
1317things in Unicode numbers, and using a conversion function:
1318
1319 printf("%c",utf8::unicode_to_native(65)); # prints A on all
1320 # platforms
1321 print utf8::native_to_unicode(ord("A")); # Likewise, prints 65
1322
1323See L<perluniintro/Unicode and EBCDIC> and L</CONVERSIONS>
1324for other options.
1325
d396a558
JH
1326=head1 REGULAR EXPRESSION DIFFERENCES
1327
4d2ca8b5
KW
1328You can write your regular expressions just like someone on an ASCII
1329platform would do. But keep in mind that using octal or hex notation to
1330specify a particular code point will give you the character that the
1331EBCDIC code page natively maps to it. (This is also true of all
1332double-quoted strings.) If you want to write portably, just use the
1333C<\N{U+...}> notation everywhere where you would have used C<\x{...}>,
1334and don't use octal notation at all.
1335
1336Starting in Perl v5.22, this applies to ranges in bracketed character
1337classes. If you say, for example, C<qr/[\N{U+20}-\N{U+7F}]/>, it means
1338the characters C<\N{U+20}>, C<\N{U+21}>, ..., C<\N{U+7F}>. This range
1339is all the printable characters that the ASCII character set contains.
1340
1341Prior to v5.22, you couldn't specify any ranges portably, except
1342(starting in Perl v5.5.3) all subsets of the C<[A-Z]> and C<[a-z]>
1343ranges are specially coded to not pick up gap characters. For example,
1344characters such as "E<ocirc>" (C<o WITH CIRCUMFLEX>) that lie between
1345"I" and "J" would not be matched by the regular expression range
1346C</[H-K]/>. But if either of the range end points is explicitly numeric
1347(and neither is specified by C<\N{U+...}>), the gap characters are
1348matched:
1349
1350 /[\x89-\x91]/
1351
1352will match C<\x8e>, even though C<\x89> is "i" and C<\x91 > is "j",
1353and C<\x8e> is a gap character, from the alphabetic viewpoint.
1354
1355Another construct to be wary of is the inappropriate use of hex (unless
1356you use C<\N{U+...}>) or
d396a558
JH
1357octal constants in regular expressions. Consider the following
1358set of subs:
1359
1360 sub is_c0 {
1361 my $char = substr(shift,0,1);
1362 $char =~ /[\000-\037]/;
1363 }
1364
1365 sub is_print_ascii {
1366 my $char = substr(shift,0,1);
1367 $char =~ /[\040-\176]/;
1368 }
1369
1370 sub is_delete {
1371 my $char = substr(shift,0,1);
1372 $char eq "\177";
1373 }
1374
1375 sub is_c1 {
1376 my $char = substr(shift,0,1);
1377 $char =~ /[\200-\237]/;
1378 }
1379
10c526cf 1380 sub is_latin_1 { # But not ASCII; not C1
d396a558
JH
1381 my $char = substr(shift,0,1);
1382 $char =~ /[\240-\377]/;
1383 }
1384
4d2ca8b5
KW
1385These are valid only on ASCII platforms. Starting in Perl v5.22, simply
1386changing the octal constants to equivalent C<\N{U+...}> values makes
1387them portable:
1388
1389 sub is_c0 {
1390 my $char = substr(shift,0,1);
1391 $char =~ /[\N{U+00}-\N{U+1F}]/;
1392 }
1393
1394 sub is_print_ascii {
1395 my $char = substr(shift,0,1);
1396 $char =~ /[\N{U+20}-\N{U+7E}]/;
1397 }
1398
1399 sub is_delete {
1400 my $char = substr(shift,0,1);
1401 $char eq "\N{U+7F}";
1402 }
1403
1404 sub is_c1 {
1405 my $char = substr(shift,0,1);
1406 $char =~ /[\N{U+80}-\N{U+9F}]/;
1407 }
1408
1409 sub is_latin_1 { # But not ASCII; not C1
1410 my $char = substr(shift,0,1);
1411 $char =~ /[\N{U+A0}-\N{U+FF}]/;
1412 }
1413
1414And here are some alternative portable ways to write them:
d396a558
JH
1415
1416 sub Is_c0 {
1417 my $char = substr(shift,0,1);
f11f9c4c
KW
1418 return $char =~ /[[:cntrl:]]/a && ! Is_delete($char);
1419
1420 # Alternatively:
1421 # return $char =~ /[[:cntrl:]]/
1422 # && $char =~ /[[:ascii:]]/
1423 # && ! Is_delete($char);
d396a558
JH
1424 }
1425
1426 sub Is_print_ascii {
1427 my $char = substr(shift,0,1);
10c526cf 1428
f11f9c4c 1429 return $char =~ /[[:print:]]/a;
10c526cf
KW
1430
1431 # Alternatively:
f11f9c4c
KW
1432 # return $char =~ /[[:print:]]/ && $char =~ /[[:ascii:]]/;
1433
1434 # Or
10c526cf
KW
1435 # return $char
1436 # =~ /[ !"\#\$%&'()*+,\-.\/0-9:;<=>?\@A-Z[\\\]^_`a-z{|}~]/;
d396a558
JH
1437 }
1438
1439 sub Is_delete {
1440 my $char = substr(shift,0,1);
10c526cf 1441 return utf8::native_to_unicode(ord $char) == 0x7F;
d396a558
JH
1442 }
1443
1444 sub Is_c1 {
10c526cf 1445 use feature 'unicode_strings';
d396a558 1446 my $char = substr(shift,0,1);
10c526cf 1447 return $char =~ /[[:cntrl:]]/ && $char !~ /[[:ascii:]]/;
d396a558
JH
1448 }
1449
10c526cf
KW
1450 sub Is_latin_1 { # But not ASCII; not C1
1451 use feature 'unicode_strings';
d396a558 1452 my $char = substr(shift,0,1);
10c526cf 1453 return ord($char) < 256
4d2ca8b5
KW
1454 && $char !~ /[[:ascii:]]/
1455 && $char !~ /[[:cntrl:]]/;
d396a558
JH
1456 }
1457
10c526cf 1458Another way to write C<Is_latin_1()> would be
d396a558
JH
1459to use the characters in the range explicitly:
1460
1461 sub Is_latin_1 {
1462 my $char = substr(shift,0,1);
f11f9c4c
KW
1463 $char =~ /[ ¡¢£¤¥¦§¨©ª«¬­®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏ]
1464 [ÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ]/x;
d396a558
JH
1465 }
1466
eaf8b9b9 1467Although that form may run into trouble in network transit (due to the
4d2ca8b5
KW
1468presence of 8 bit characters) or on non ISO-Latin character sets. But
1469it does allow C<Is_c1> to be rewritten so it works on Perls that don't
1470have C<'unicode_strings'> (earlier than v5.14):
1471
1472 sub Is_latin_1 { # But not ASCII; not C1
1473 my $char = substr(shift,0,1);
1474 return ord($char) < 256
1475 && $char !~ /[[:ascii:]]/
1476 && ! Is_latin1($char);
1477 }
d396a558
JH
1478
1479=head1 SOCKETS
1480
1481Most socket programming assumes ASCII character encodings in network
1482byte order. Exceptions can include CGI script writing under a
1483host web server where the server may take care of translation for you.
1484Most host web servers convert EBCDIC data to ISO-8859-1 or Unicode on
1485output.
1486
1487=head1 SORTING
1488
8a50e6a3 1489One big difference between ASCII-based character sets and EBCDIC ones
4d2ca8b5
KW
1490are the relative positions of the characters when sorted in native
1491order. Of most concern are the upper- and lowercase letters, the
1492digits, and the underscore (C<"_">). On ASCII platforms the native sort
1493order has the digits come before the uppercase letters which come before
1494the underscore which comes before the lowercase letters. On EBCDIC, the
1495underscore comes first, then the lowercase letters, then the uppercase
1496ones, and the digits last. If sorted on an ASCII-based platform, the
8a50e6a3
FC
1497two-letter abbreviation for a physician comes before the two letter
1498abbreviation for drive; that is:
d396a558 1499
c72e675e 1500 @sorted = sort(qw(Dr. dr.)); # @sorted holds ('Dr.','dr.') on ASCII,
84f709e7 1501 # but ('dr.','Dr.') on EBCDIC
d396a558 1502
8a50e6a3 1503The property of lowercase before uppercase letters in EBCDIC is
d396a558 1504even carried to the Latin 1 EBCDIC pages such as 0037 and 1047.
4d2ca8b5
KW
1505An example would be that "E<Euml>" (C<E WITH DIAERESIS>, 203) comes
1506before "E<euml>" (C<e WITH DIAERESIS>, 235) on an ASCII platform, but
eaf8b9b9 1507the latter (83) comes before the former (115) on an EBCDIC platform.
4d2ca8b5
KW
1508(Astute readers will note that the uppercase version of "E<szlig>"
1509C<SMALL LETTER SHARP S> is simply "SS" and that the upper case versions
1510of "E<yuml>" (small C<y WITH DIAERESIS>) and "E<micro>" (C<MICRO SIGN>)
1511are not in the 0..255 range but are in Unicode, in a Unicode enabled
1512Perl).
d396a558
JH
1513
1514The sort order will cause differences between results obtained on
2bbc8d55 1515ASCII platforms versus EBCDIC platforms. What follows are some suggestions
d396a558
JH
1516on how to deal with these differences.
1517
51b5cecb 1518=head2 Ignore ASCII vs. EBCDIC sort differences.
d396a558
JH
1519
1520This is the least computationally expensive strategy. It may require
1521some user education.
1522
4d2ca8b5 1523=head2 Use a sort helper function
d396a558 1524
4d2ca8b5
KW
1525This is completely general, but the most computationally expensive
1526strategy. Choose one or the other character set and transform to that
33f0d962 1527for every sort comparison. Here's a complete example that transforms
4d2ca8b5 1528to ASCII sort order:
51b5cecb 1529
4d2ca8b5
KW
1530 sub native_to_uni($) {
1531 my $string = shift;
d396a558 1532
4d2ca8b5
KW
1533 # Saves time on an ASCII platform
1534 return $string if ord 'A' == 65;
d396a558 1535
4d2ca8b5
KW
1536 my $output = "";
1537 for my $i (0 .. length($string) - 1) {
1538 $output
1539 .= chr(utf8::native_to_unicode(ord(substr($string, $i, 1))));
1540 }
1541
1542 # Preserve utf8ness of input onto the output, even if it didn't need
1543 # to be utf8
1544 utf8::upgrade($output) if utf8::is_utf8($string);
51b5cecb 1545
4d2ca8b5
KW
1546 return $output;
1547 }
51b5cecb 1548
4d2ca8b5
KW
1549 sub ascii_order { # Sort helper
1550 return native_to_uni($a) cmp native_to_uni($b);
1551 }
d396a558 1552
4d2ca8b5
KW
1553 sort ascii_order @list;
1554
1555=head2 MONO CASE then sort data (for non-digits, non-underscore)
1556
1557If you don't care about where digits and underscore sort to, you can do
1558something like this
1559
1560 sub case_insensitive_order { # Sort helper
1561 return lc($a) cmp lc($b)
1562 }
1563
1564 sort case_insensitive_order @list;
1565
1566If performance is an issue, and you don't care if the output is in the
1567same case as the input, Use C<tr///> to transform to the case most
1568employed within the data. If the data are primarily UPPERCASE
1569non-Latin1, then apply C<tr/[a-z]/[A-Z]/>, and then C<sort()>. If the
1570data are primarily lowercase non Latin1 then apply C<tr/[A-Z]/[a-z]/>
1571before sorting. If the data are primarily UPPERCASE and include Latin-1
1572characters then apply:
1573
1574 tr/[a-z]/[A-Z]/;
1575 tr/[àáâãäåæçèéêëìíîïðñòóôõöøùúûüýþ]/[ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞ/;
1576 s/ß/SS/g;
1577
1578then C<sort()>. If you have a choice, it's better to lowercase things
1579to avoid the problems of the two Latin-1 characters whose uppercase is
1580outside Latin-1: "E<yuml>" (small C<y WITH DIAERESIS>) and "E<micro>"
1581(C<MICRO SIGN>). If you do need to upppercase, you can; with a
1582Unicode-enabled Perl, do:
1583
1584 tr/ÿ/\x{178}/;
1585 tr/µ/\x{39C}/;
d396a558 1586
2bbc8d55 1587=head2 Perform sorting on one type of platform only.
d396a558
JH
1588
1589This strategy can employ a network connection. As such
1590it would be computationally expensive.
1591
395f5a0c 1592=head1 TRANSFORMATION FORMATS
1e054b24 1593
eaf8b9b9
KW
1594There are a variety of ways of transforming data with an intra character set
1595mapping that serve a variety of purposes. Sorting was discussed in the
1596previous section and a few of the other more popular mapping techniques are
1e054b24
PP
1597discussed next.
1598
1599=head2 URL decoding and encoding
d396a558 1600
51b5cecb 1601Note that some URLs have hexadecimal ASCII code points in them in an
eaf8b9b9 1602attempt to overcome character or protocol limitation issues. For example
1e054b24 1603the tilde character is not on every keyboard hence a URL of the form:
d396a558
JH
1604
1605 http://www.pvhp.com/~pvhp/
1606
1607may also be expressed as either of:
1608
1609 http://www.pvhp.com/%7Epvhp/
1610
1611 http://www.pvhp.com/%7epvhp/
1612
4d2ca8b5 1613where 7E is the hexadecimal ASCII code point for "~". Here is an example
f11f9c4c 1614of decoding such a URL in any EBCDIC code page:
d396a558 1615
84f709e7 1616 $url = 'http://www.pvhp.com/%7Epvhp/';
f11f9c4c
KW
1617 $url =~ s/%([0-9a-fA-F]{2})/
1618 pack("c",utf8::unicode_to_native(hex($1)))/xge;
d396a558 1619
eaf8b9b9 1620Conversely, here is a partial solution for the task of encoding such
f11f9c4c 1621a URL in any EBCDIC code page:
1e054b24 1622
84f709e7 1623 $url = 'http://www.pvhp.com/~pvhp/';
eaf8b9b9
KW
1624 # The following regular expression does not address the
1625 # mappings for: ('.' => '%2E', '/' => '%2F', ':' => '%3A')
10c526cf 1626 $url =~ s/([\t "#%&\(\),;<=>\?\@\[\\\]^`{|}~])/
f11f9c4c 1627 sprintf("%%%02X",utf8::native_to_unicode(ord($1)))/xge;
1e054b24 1628
eaf8b9b9 1629where a more complete solution would split the URL into components
1e054b24
PP
1630and apply a full s/// substitution only to the appropriate parts.
1631
1e054b24
PP
1632=head2 uu encoding and decoding
1633
4d2ca8b5
KW
1634The C<u> template to C<pack()> or C<unpack()> will render EBCDIC data in
1635EBCDIC characters equivalent to their ASCII counterparts. For example,
1636the following will print "Yes indeed\n" on either an ASCII or EBCDIC
1637computer:
1e054b24 1638
84f709e7
JH
1639 $all_byte_chrs = '';
1640 for (0..255) { $all_byte_chrs .= chr($_); }
1641 $uuencode_byte_chrs = pack('u', $all_byte_chrs);
210b36aa 1642 ($uu = <<'ENDOFHEREDOC') =~ s/^\s*//gm;
1e054b24
PP
1643 M``$"`P0%!@<("0H+#`T.#Q`1$A,4%187&!D:&QP='A\@(2(C)"4F)R@I*BLL
1644 M+2XO,#$R,S0U-C<X.3H[/#T^/T!!0D-$149'2$E*2TQ-3D]045)35%565UA9
1645 M6EM<75Y?8&%B8V1E9F=H:6IK;&UN;W!Q<G-T=79W>'EZ>WQ]?G^`@8*#A(6&
1646 MAXB)BHN,C8Z/D)&2DY25EI>8F9J;G)V>GZ"AHJ.DI::GJ*FJJZRMKJ^PL;*S
1647 MM+6VM[BYNKN\O;Z_P,'"P\3%QL?(R<K+S,W.S]#1TM/4U=;7V-G:V]S=WM_@
1648 ?X>+CY.7FY^CIZNOL[>[O\/'R\_3U]O?X^?K[_/W^_P``
1649 ENDOFHEREDOC
84f709e7 1650 if ($uuencode_byte_chrs eq $uu) {
1e054b24
PP
1651 print "Yes ";
1652 }
1653 $uudecode_byte_chrs = unpack('u', $uuencode_byte_chrs);
84f709e7 1654 if ($uudecode_byte_chrs eq $all_byte_chrs) {
1e054b24
PP
1655 print "indeed\n";
1656 }
1657
f11f9c4c 1658Here is a very spartan uudecoder that will work on EBCDIC:
1e054b24 1659
84f709e7 1660 #!/usr/local/bin/perl
84f709e7 1661 $_ = <> until ($mode,$file) = /^begin\s*(\d*)\s*(\S*)/;
1e054b24
PP
1662 open(OUT, "> $file") if $file ne "";
1663 while(<>) {
1664 last if /^end/;
1665 next if /[a-z]/;
f11f9c4c
KW
1666 next unless int((((utf8::native_to_unicode(ord()) - 32 ) & 077)
1667 + 2) / 3)
1668 == int(length() / 4);
1e054b24
PP
1669 print OUT unpack("u", $_);
1670 }
1671 close(OUT);
1672 chmod oct($mode), $file;
1673
1674
1675=head2 Quoted-Printable encoding and decoding
1676
8a50e6a3 1677On ASCII-encoded platforms it is possible to strip characters outside of
1e054b24
PP
1678the printable set using:
1679
1680 # This QP encoder works on ASCII only
4d2ca8b5
KW
1681 $qp_string =~ s/([=\x00-\x1F\x80-\xFF])/
1682 sprintf("=%02X",ord($1))/xge;
1e054b24 1683
4d2ca8b5
KW
1684Starting in Perl v5.22, this is trivially changeable to work portably on
1685both ASCII and EBCDIC platforms.
1686
1687 # This QP encoder works on both ASCII and EBCDIC
1688 $qp_string =~ s/([=\N{U+00}-\N{U+1F}\N{U+80}-\N{U+FF}])/
1689 sprintf("=%02X",ord($1))/xge;
1690
1691For earlier Perls, a QP encoder that works on both ASCII and EBCDIC
1692platforms would look somewhat like the following:
1e054b24 1693
f11f9c4c 1694 $delete = utf8::unicode_to_native(ord("\x7F"));
84f709e7 1695 $qp_string =~
f11f9c4c
KW
1696 s/([^[:print:]$delete])/
1697 sprintf("=%02X",utf8::native_to_unicode(ord($1)))/xage;
1e054b24
PP
1698
1699(although in production code the substitutions might be done
f11f9c4c 1700in the EBCDIC branch with the function call and separately in the
4d2ca8b5
KW
1701ASCII branch without the expense of the identity map; in Perl v5.22, the
1702identity map is optimized out so there is no expense, but the
1703alternative above is simpler and is also available in v5.22).
1e054b24
PP
1704
1705Such QP strings can be decoded with:
1706
1707 # This QP decoder is limited to ASCII only
f11f9c4c 1708 $string =~ s/=([[:xdigit:][[:xdigit:])/chr hex $1/ge;
1e054b24
PP
1709 $string =~ s/=[\n\r]+$//;
1710
eaf8b9b9 1711Whereas a QP decoder that works on both ASCII and EBCDIC platforms
f11f9c4c 1712would look somewhat like the following:
1e054b24 1713
f11f9c4c
KW
1714 $string =~ s/=([[:xdigit:][:xdigit:]])/
1715 chr utf8::native_to_unicode(hex $1)/xge;
1e054b24
PP
1716 $string =~ s/=[\n\r]+$//;
1717
c69ca1d4 1718=head2 Caesarean ciphers
1e054b24
PP
1719
1720The practice of shifting an alphabet one or more characters for encipherment
1721dates back thousands of years and was explicitly detailed by Gaius Julius
eaf8b9b9 1722Caesar in his B<Gallic Wars> text. A single alphabet shift is sometimes
1e054b24 1723referred to as a rotation and the shift amount is given as a number $n after
eaf8b9b9
KW
1724the string 'rot' or "rot$n". Rot0 and rot26 would designate identity maps
1725on the 26-letter English version of the Latin alphabet. Rot13 has the
1726interesting property that alternate subsequent invocations are identity maps
1727(thus rot13 is its own non-trivial inverse in the group of 26 alphabet
1728rotations). Hence the following is a rot13 encoder and decoder that will
2bbc8d55 1729work on ASCII and EBCDIC platforms:
1e054b24
PP
1730
1731 #!/usr/local/bin/perl
1732
84f709e7 1733 while(<>){
1e054b24
PP
1734 tr/n-za-mN-ZA-M/a-zA-Z/;
1735 print;
1736 }
1737
1738In one-liner form:
1739
84f709e7 1740 perl -ne 'tr/n-za-mN-ZA-M/a-zA-Z/;print'
1e054b24
PP
1741
1742
1743=head1 Hashing order and checksums
1744
4d2ca8b5
KW
1745Perl deliberately randomizes hash order for security purposes on both
1746ASCII and EBCDIC platforms.
1747
1748EBCDIC checksums will differ for the same file translated into ASCII
1749and vice versa.
1e054b24 1750
d396a558
JH
1751=head1 I18N AND L10N
1752
eaf8b9b9
KW
1753Internationalization (I18N) and localization (L10N) are supported at least
1754in principle even on EBCDIC platforms. The details are system-dependent
5a0de581 1755and discussed under the L</OS ISSUES> section below.
d396a558 1756
8a50e6a3 1757=head1 MULTI-OCTET CHARACTER SETS
d396a558 1758
4d2ca8b5
KW
1759Perl works with UTF-EBCDIC, a multi-byte encoding. In Perls earlier
1760than v5.22, there may be various bugs in this regard.
395f5a0c
PK
1761
1762Legacy multi byte EBCDIC code pages XXX.
d396a558
JH
1763
1764=head1 OS ISSUES
1765
eaf8b9b9 1766There may be a few system-dependent issues
d396a558
JH
1767of concern to EBCDIC Perl programmers.
1768
522b859a 1769=head2 OS/400
51b5cecb 1770
d396a558
JH
1771=over 8
1772
522b859a
JH
1773=item PASE
1774
8a50e6a3
FC
1775The PASE environment is a runtime environment for OS/400 that can run
1776executables built for PowerPC AIX in OS/400; see L<perlos400>. PASE
522b859a
JH
1777is ASCII-based, not EBCDIC-based as the ILE.
1778
d396a558
JH
1779=item IFS access
1780
1781XXX.
1782
1783=back
1784
395f5a0c 1785=head2 OS/390, z/OS
d396a558 1786
51b5cecb
PP
1787Perl runs under Unix Systems Services or USS.
1788
d396a558
JH
1789=over 8
1790
4d2ca8b5
KW
1791=item C<sigaction>
1792
1793C<SA_SIGINFO> can have segmentation faults.
1794
1795=item C<chcp>
51b5cecb 1796
eaf8b9b9 1797B<chcp> is supported as a shell utility for displaying and changing
75cdcc93 1798one's code page. See also L<chcp(1)>.
51b5cecb 1799
d396a558
JH
1800=item dataset access
1801
1802For sequential data set access try:
1803
1804 my @ds_records = `cat //DSNAME`;
1805
1806or:
1807
1808 my @ds_records = `cat //'HLQ.DSNAME'`;
1809
1810See also the OS390::Stdio module on CPAN.
1811
4d2ca8b5 1812=item C<iconv>
51b5cecb 1813
1e054b24 1814B<iconv> is supported as both a shell utility and a C RTL routine.
4d2ca8b5 1815See also the L<iconv(1)> and L<iconv(3)> manual pages.
51b5cecb 1816
d396a558
JH
1817=item locales
1818
4d2ca8b5
KW
1819Locales are supported. There may be glitches when a locale is another
1820EBCDIC code page which has some of the
1821L<code-page variant characters|/The 13 variant characters> in other
1822positions.
1823
1824There aren't currently any real UTF-8 locales, even though some locale
1825names contain the string "UTF-8".
1826
1827See L<perllocale> for information on locales. The L10N files
1828are in F</usr/nls/locale>. C<$Config{d_setlocale}> is C<'define'> on
1829OS/390 or z/OS.
d396a558
JH
1830
1831=back
1832
d396a558
JH
1833=head2 POSIX-BC?
1834
1835XXX.
1836
51b5cecb
PP
1837=head1 BUGS
1838
4d2ca8b5
KW
1839=over 4
1840
1841=item *
1842
51b5cecb 1843Not all shells will allow multiple C<-e> string arguments to perl to
4d2ca8b5
KW
1844be concatenated together properly as recipes in this document
18450, 2, 4, 5, and 6 might
395f5a0c 1846seem to imply.
51b5cecb 1847
4d2ca8b5
KW
1848=item *
1849
4d2ca8b5 1850There are a significant number of test failures in the CPAN modules
4b638048 1851shipped with Perl v5.22 and 5.24. These are only in modules not primarily
4d2ca8b5
KW
1852maintained by Perl 5 porters. Some of these are failures in the tests
1853only: they don't realize that it is proper to get different results on
1854EBCDIC platforms. And some of the failures are real bugs. If you
1855compile and do a C<make test> on Perl, all tests on the C</cpan>
1856directory are skipped.
1857
4d2ca8b5
KW
1858L<Encode> partially works.
1859
7a145370
KW
1860=item *
1861
4b638048
KW
1862In earlier Perl versions, when byte and character data were
1863concatenated, the new string was sometimes created by
7a145370
KW
1864decoding the byte strings as I<ISO 8859-1 (Latin-1)>, even if the
1865old Unicode string used EBCDIC.
1866
4d2ca8b5
KW
1867=back
1868
b3b6085d
PP
1869=head1 SEE ALSO
1870
395f5a0c 1871L<perllocale>, L<perlfunc>, L<perlunicode>, L<utf8>.
b3b6085d 1872
d396a558
JH
1873=head1 REFERENCES
1874
2bbc8d55 1875L<http://anubis.dkuug.dk/i18n/charmaps>
d396a558 1876
2bbc8d55 1877L<http://www.unicode.org/>
d396a558 1878
2bbc8d55 1879L<http://www.unicode.org/unicode/reports/tr16/>
d396a558 1880
08d7a6b2 1881L<http://www.wps.com/projects/codes/>
51b5cecb
PP
1882B<ASCII: American Standard Code for Information Infiltration> Tom Jennings,
1883September 1999.
1884
eaf8b9b9
KW
1885B<The Unicode Standard, Version 3.0> The Unicode Consortium, Lisa Moore ed.,
1886ISBN 0-201-61633-5, Addison Wesley Developers Press, February 2000.
51b5cecb 1887
eaf8b9b9
KW
1888B<CDRA: IBM - Character Data Representation Architecture -
1889Reference and Registry>, IBM SC09-2190-00, December 1996.
d396a558 1890
eaf8b9b9 1891"Demystifying Character Sets", Andrea Vine, Multilingual Computing
d396a558
JH
1892& Technology, B<#26 Vol. 10 Issue 4>, August/September 1999;
1893ISSN 1523-0309; Multilingual Computing Inc. Sandpoint ID, USA.
1894
1e054b24
PP
1895B<Codes, Ciphers, and Other Cryptic and Clandestine Communication>
1896Fred B. Wrixon, ISBN 1-57912-040-7, Black Dog & Leventhal Publishers,
18971998.
1898
2bbc8d55 1899L<http://www.bobbemer.com/P-BIT.HTM>
395f5a0c
PK
1900B<IBM - EBCDIC and the P-bit; The biggest Computer Goof Ever> Robert Bemer.
1901
1902=head1 HISTORY
1903
190415 April 2001: added UTF-8 and UTF-EBCDIC to main table, pvhp.
1905
d396a558
JH
1906=head1 AUTHOR
1907
eaf8b9b9
KW
1908Peter Prymmer pvhp@best.com wrote this in 1999 and 2000
1909with CCSID 0819 and 0037 help from Chris Leach and
1910AndrE<eacute> Pirard A.Pirard@ulg.ac.be as well as POSIX-BC
b3b6085d 1911help from Thomas Dorner Thomas.Dorner@start.de.
eaf8b9b9
KW
1912Thanks also to Vickie Cooper, Philip Newton, William Raffloer, and
1913Joe Smith. Trademarks, registered trademarks, service marks and
1914registered service marks used in this document are the property of
1e054b24 1915their respective owners.
4d2ca8b5
KW
1916
1917Now maintained by Perl5 Porters.