+=encoding utf8
+
=head1 NAME
perlebcdic - Considerations for running Perl on EBCDIC platforms
An exploration of some of the issues facing Perl programmers
on EBCDIC based computers. We do not cover localization,
-internationalization, or multi byte character set issues other
+internationalization, or multi-byte character set issues other
than some discussion of UTF-8 and UTF-EBCDIC.
Portions that are still incomplete are marked with XXX.
integers running from 0 to 127 (decimal) that imply character
interpretation by the display and other systems of computers.
The range 0..127 can be covered by setting the bits in a 7-bit binary
-digit, hence the set is sometimes referred to as a "7-bit ASCII".
+digit, hence the set is sometimes referred to as "7-bit ASCII".
ASCII was described by the American National Standards Institute
document ANSI X3.4-1986. It was also described by ISO 646:1991
(with localization for currency symbols). The full ASCII set is
=head2 EBCDIC
The Extended Binary Coded Decimal Interchange Code refers to a
-large collection of single and multi byte coded character sets that are
+large collection of single- and multi-byte coded character sets that are
different from ASCII or ISO 8859-1 and are all slightly different from each
other; they typically run on host computers. The EBCDIC encodings derive from
-8 bit byte extensions of Hollerith punched card encodings. The layout on the
+8-bit byte extensions of Hollerith punched card encodings. The layout on the
cards was such that high bits were set for the upper and lower case alphabet
characters [a-z] and [A-Z], but there were gaps within each Latin alphabet
range.
In EBCDIC, for the low 256 the EBCDIC code points are used. This
means that the equivalences
- pack("U", ord($character)) eq $character
- unpack("U", $character) == ord $character
+ pack("U", ord($character)) eq $character
+ unpack("U", $character) == ord $character
will hold. (If Unicode code points were applied consistently over
all the possible code points, pack("U",ord("A")) would in EBCDIC
Encode knows about more EBCDIC character sets than Perl can currently
be compiled to run on.
- use Encode 'from_to';
+ use Encode 'from_to';
- my %ebcdic = ( 176 => 'cp37', 95 => 'cp1047', 106 => 'posix-bc' );
+ my %ebcdic = ( 176 => 'cp37', 95 => 'cp1047', 106 => 'posix-bc' );
- # $a is in EBCDIC code points
- from_to($a, $ebcdic{ord '^'}, 'latin1');
- # $a is ISO 8859-1 code points
+ # $a is in EBCDIC code points
+ from_to($a, $ebcdic{ord '^'}, 'latin1');
+ # $a is ISO 8859-1 code points
and from Latin-1 code points to EBCDIC code points
- use Encode 'from_to';
+ use Encode 'from_to';
- my %ebcdic = ( 176 => 'cp37', 95 => 'cp1047', 106 => 'posix-bc' );
+ my %ebcdic = ( 176 => 'cp37', 95 => 'cp1047', 106 => 'posix-bc' );
- # $a is ISO 8859-1 code points
- from_to($a, 'latin1', $ebcdic{ord '^'});
- # $a is in EBCDIC code points
+ # $a is ISO 8859-1 code points
+ from_to($a, 'latin1', $ebcdic{ord '^'});
+ # $a is in EBCDIC code points
For doing I/O it is suggested that you use the autotranslating features
of PerlIO, see L<perluniintro>.
in some other cases. The "names" of the controls listed here are
the Unicode Version 1 names, except for the few that don't have names, in which
case the names in the Wikipedia article were used
-(L<http://en.wikipedia.org/wiki/C0_and_C1_control_codes>.
+(L<http://en.wikipedia.org/wiki/C0_and_C1_control_codes>).
The differences between the 0037 and 1047 sets are
flagged with ***. The differences between the 1047 and POSIX-BC sets
are flagged with ###. All ord() numbers listed are decimal. If you
=back
perl -ne 'if(/(.{43})(\d+)\s+(\d+)\s+(\d+)\s+(\d+)/)' \
- -e '{printf("%s%-9o%-9o%-9o%o\n",$1,$2,$3,$4,$5)}' perlebcdic.pod
+ -e '{printf("%s%-9.03o%-9.03o%-9.03o%.03o\n",$1,$2,$3,$4,$5)}' \
+ perlebcdic.pod
If you want to retain the UTF-x code points then in script form you
might want to write:
=back
- open(FH,"<perlebcdic.pod") or die "Could not open perlebcdic.pod: $!";
- while (<FH>) {
- if (/(.{43})(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\.?(\d*)\s+(\d+)\.?(\d*)/) {
- if ($7 ne '' && $9 ne '') {
- printf("%s%-9o%-9o%-9o%-9o%-3o.%-5o%-3o.%o\n",$1,$2,$3,$4,$5,$6,$7,$8,$9);
- }
- elsif ($7 ne '') {
- printf("%s%-9o%-9o%-9o%-9o%-3o.%-5o%o\n",$1,$2,$3,$4,$5,$6,$7,$8);
- }
- else {
- printf("%s%-9o%-9o%-9o%-9o%-9o%o\n",$1,$2,$3,$4,$5,$6,$8);
- }
- }
- }
+ open(FH,"<perlebcdic.pod") or die "Could not open perlebcdic.pod: $!";
+ while (<FH>) {
+ if (/(.{43})(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\.?(\d*)\s+(\d+)\.?(\d*)/)
+ {
+ if ($7 ne '' && $9 ne '') {
+ printf(
+ "%s%-9.03o%-9.03o%-9.03o%-9.03o%-3o.%-5o%-3o.%.03o\n",
+ $1,$2,$3,$4,$5,$6,$7,$8,$9);
+ }
+ elsif ($7 ne '') {
+ printf("%s%-9.03o%-9.03o%-9.03o%-9.03o%-3o.%-5o%.03o\n",
+ $1,$2,$3,$4,$5,$6,$7,$8);
+ }
+ else {
+ printf("%s%-9.03o%-9.03o%-9.03o%-9.03o%-9.03o%.03o\n",
+ $1,$2,$3,$4,$5,$6,$8);
+ }
+ }
+ }
If you would rather see this table listing hexadecimal values then
run the table through:
=back
perl -ne 'if(/(.{43})(\d+)\s+(\d+)\s+(\d+)\s+(\d+)/)' \
- -e '{printf("%s%-9X%-9X%-9X%X\n",$1,$2,$3,$4,$5)}' perlebcdic.pod
+ -e '{printf("%s%-9.02X%-9.02X%-9.02X%.02X\n",$1,$2,$3,$4,$5)}' \
+ perlebcdic.pod
Or, in order to retain the UTF-x code points in hexadecimal:
=back
- open(FH,"<perlebcdic.pod") or die "Could not open perlebcdic.pod: $!";
- while (<FH>) {
- if (/(.{43})(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\.?(\d*)\s+(\d+)\.?(\d*)/) {
- if ($7 ne '' && $9 ne '') {
- printf("%s%-9X%-9X%-9X%-9X%-2X.%-6X%-2X.%X\n",$1,$2,$3,$4,$5,$6,$7,$8,$9);
- }
- elsif ($7 ne '') {
- printf("%s%-9X%-9X%-9X%-9X%-2X.%-6X%X\n",$1,$2,$3,$4,$5,$6,$7,$8);
- }
- else {
- printf("%s%-9X%-9X%-9X%-9X%-9X%X\n",$1,$2,$3,$4,$5,$6,$8);
- }
- }
- }
+ open(FH,"<perlebcdic.pod") or die "Could not open perlebcdic.pod: $!";
+ while (<FH>) {
+ if (/(.{43})(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\.?(\d*)\s+(\d+)\.?(\d*)/)
+ {
+ if ($7 ne '' && $9 ne '') {
+ printf(
+ "%s%-9.02X%-9.02X%-9.02X%-9.02X%-2X.%-6.02X%02X.%02X\n",
+ $1,$2,$3,$4,$5,$6,$7,$8,$9);
+ }
+ elsif ($7 ne '') {
+ printf("%s%-9.02X%-9.02X%-9.02X%-9.02X%-2X.%-6.02X%02X\n",
+ $1,$2,$3,$4,$5,$6,$7,$8);
+ }
+ else {
+ printf("%s%-9.02X%-9.02X%-9.02X%-9.02X%-9.02X%02X\n",
+ $1,$2,$3,$4,$5,$6,$8);
+ }
+ }
+ }
ISO 8859-1 CCSID CCSID CCSID 1047
=back
- perl -ne 'if(/.{43}\d{1,3}\s{6,8}\d{1,3}\s{6,8}\d{1,3}\s{6,8}\d{1,3}/)'\
+ perl \
+ -ne 'if(/.{43}\d{1,3}\s{6,8}\d{1,3}\s{6,8}\d{1,3}\s{6,8}\d{1,3}/)'\
-e '{push(@l,$_)}' \
-e 'END{print map{$_->[0]}' \
-e ' sort{$a->[1] <=> $b->[1]}' \
=back
- perl -ne 'if(/.{43}\d{1,3}\s{6,8}\d{1,3}\s{6,8}\d{1,3}\s{6,8}\d{1,3}/)'\
- -e '{push(@l,$_)}' \
- -e 'END{print map{$_->[0]}' \
- -e ' sort{$a->[1] <=> $b->[1]}' \
- -e ' map{[$_,substr($_,61,3)]}@l;}' perlebcdic.pod
+ perl \
+ -ne 'if(/.{43}\d{1,3}\s{6,8}\d{1,3}\s{6,8}\d{1,3}\s{6,8}\d{1,3}/)'\
+ -e '{push(@l,$_)}' \
+ -e 'END{print map{$_->[0]}' \
+ -e ' sort{$a->[1] <=> $b->[1]}' \
+ -e ' map{[$_,substr($_,61,3)]}@l;}' perlebcdic.pod
If you would rather see it in POSIX-BC order then change the number
61 in the last line to 70, like this:
=back
- perl -ne 'if(/.{43}\d{1,3}\s{6,8}\d{1,3}\s{6,8}\d{1,3}\s{6,8}\d{1,3}/)'\
+ perl \
+ -ne 'if(/.{43}\d{1,3}\s{6,8}\d{1,3}\s{6,8}\d{1,3}\s{6,8}\d{1,3}/)'\
-e '{push(@l,$_)}' \
-e 'END{print map{$_->[0]}' \
-e ' sort{$a->[1] <=> $b->[1]}' \
Obviously the first of these will fail to distinguish most ASCII platforms
from either a CCSID 0037, a 1047, or a POSIX-BC EBCDIC platform since "\r" eq
chr(13) under all of those coded character sets. But note too that
-because "\n" is chr(13) and "\r" is chr(10) on the MacIntosh (which is an
+because "\n" is chr(13) and "\r" is chr(10) on the Macintosh (which is an
ASCII platform) the second C<$is_ascii> test will lead to trouble there.
To determine whether or not perl was built under an EBCDIC
In order to convert a string of characters from one character set to
another a simple list of numbers, such as in the right columns in the
above table, along with perl's tr/// operator is all that is needed.
-The data in the table are in ASCII order hence the EBCDIC columns
-provide easy to use ASCII to EBCDIC operations that are also easily
+The data in the table are in ASCII/Latin1 order, hence the EBCDIC columns
+provide easy-to-use ASCII/Latin1 to EBCDIC operations that are also easily
reversed.
-For example, to convert ASCII to code page 037 take the output of the second
-column from the output of recipe 0 (modified to add \\ characters) and use
-it in tr/// like so:
+For example, to convert ASCII/Latin1 to code page 037 take the output of the
+second numbers column from the output of recipe 2 (modified to add '\'
+characters) and use it in tr/// like so:
$cp_037 =
- '\000\001\002\003\234\011\206\177\227\215\216\013\014\015\016\017' .
- '\020\021\022\023\235\205\010\207\030\031\222\217\034\035\036\037' .
- '\200\201\202\203\204\012\027\033\210\211\212\213\214\005\006\007' .
- '\220\221\026\223\224\225\226\004\230\231\232\233\024\025\236\032' .
- '\040\240\342\344\340\341\343\345\347\361\242\056\074\050\053\174' .
- '\046\351\352\353\350\355\356\357\354\337\041\044\052\051\073\254' .
- '\055\057\302\304\300\301\303\305\307\321\246\054\045\137\076\077' .
- '\370\311\312\313\310\315\316\317\314\140\072\043\100\047\075\042' .
- '\330\141\142\143\144\145\146\147\150\151\253\273\360\375\376\261' .
- '\260\152\153\154\155\156\157\160\161\162\252\272\346\270\306\244' .
- '\265\176\163\164\165\166\167\170\171\172\241\277\320\335\336\256' .
- '\136\243\245\267\251\247\266\274\275\276\133\135\257\250\264\327' .
- '\173\101\102\103\104\105\106\107\110\111\255\364\366\362\363\365' .
- '\175\112\113\114\115\116\117\120\121\122\271\373\374\371\372\377' .
- '\134\367\123\124\125\126\127\130\131\132\262\324\326\322\323\325' .
- '\060\061\062\063\064\065\066\067\070\071\263\333\334\331\332\237' ;
+ '\x00\x01\x02\x03\x37\x2D\x2E\x2F\x16\x05\x25\x0B\x0C\x0D\x0E\x0F' .
+ '\x10\x11\x12\x13\x3C\x3D\x32\x26\x18\x19\x3F\x27\x1C\x1D\x1E\x1F' .
+ '\x40\x5A\x7F\x7B\x5B\x6C\x50\x7D\x4D\x5D\x5C\x4E\x6B\x60\x4B\x61' .
+ '\xF0\xF1\xF2\xF3\xF4\xF5\xF6\xF7\xF8\xF9\x7A\x5E\x4C\x7E\x6E\x6F' .
+ '\x7C\xC1\xC2\xC3\xC4\xC5\xC6\xC7\xC8\xC9\xD1\xD2\xD3\xD4\xD5\xD6' .
+ '\xD7\xD8\xD9\xE2\xE3\xE4\xE5\xE6\xE7\xE8\xE9\xBA\xE0\xBB\xB0\x6D' .
+ '\x79\x81\x82\x83\x84\x85\x86\x87\x88\x89\x91\x92\x93\x94\x95\x96' .
+ '\x97\x98\x99\xA2\xA3\xA4\xA5\xA6\xA7\xA8\xA9\xC0\x4F\xD0\xA1\x07' .
+ '\x20\x21\x22\x23\x24\x15\x06\x17\x28\x29\x2A\x2B\x2C\x09\x0A\x1B' .
+ '\x30\x31\x1A\x33\x34\x35\x36\x08\x38\x39\x3A\x3B\x04\x14\x3E\xFF' .
+ '\x41\xAA\x4A\xB1\x9F\xB2\x6A\xB5\xBD\xB4\x9A\x8A\x5F\xCA\xAF\xBC' .
+ '\x90\x8F\xEA\xFA\xBE\xA0\xB6\xB3\x9D\xDA\x9B\x8B\xB7\xB8\xB9\xAB' .
+ '\x64\x65\x62\x66\x63\x67\x9E\x68\x74\x71\x72\x73\x78\x75\x76\x77' .
+ '\xAC\x69\xED\xEE\xEB\xEF\xEC\xBF\x80\xFD\xFE\xFB\xFC\xAD\xAE\x59' .
+ '\x44\x45\x42\x46\x43\x47\x9C\x48\x54\x51\x52\x53\x58\x55\x56\x57' .
+ '\x8C\x49\xCD\xCE\xCB\xCF\xCC\xE1\x70\xDD\xDE\xDB\xDC\x8D\x8E\xDF';
my $ebcdic_string = $ascii_string;
- eval '$ebcdic_string =~ tr/' . $cp_037 . '/\000-\377/';
+ eval '$ebcdic_string =~ tr/\000-\377/' . $cp_037 . '/';
To convert from EBCDIC 037 to ASCII just reverse the order of the tr///
arguments like so:
my $ascii_string = $ebcdic_string;
- eval '$ascii_string =~ tr/\000-\377/' . $cp_037 . '/';
+ eval '$ascii_string =~ tr/' . $cp_037 . '/\000-\377/';
+
+Similarly one could take the output of the third numbers column from recipe 2
+to obtain a C<$cp_1047> table. The fourth numbers column of the output from
+recipe 2 could provide a C<$cp_posix_bc> table suitable for transcoding as
+well.
-Similarly one could take the output of the third column from recipe 0 to
-obtain a C<$cp_1047> table. The fourth column of the output from recipe
-0 could provide a C<$cp_posix_bc> table suitable for transcoding as well.
+If you wanted to see the inverse tables, you would first have to sort on the
+desired numbers column as in recipes 4, 5 or 6, then take the output of the
+first numbers column.
=head2 iconv
# OS/390 or z/OS example
$ebcdic_data = `echo '$ascii_data'| iconv -f ISO8859-1 -t IBM-1047`
-For other perl based conversion options see the Convert::* modules on CPAN.
+For other perl-based conversion options see the Convert::* modules on CPAN.
=head2 C RTL
-The OS/390 and z/OS C run time libraries provide _atoe() and _etoa() functions.
+The OS/390 and z/OS C run-time libraries provide _atoe() and _etoa() functions.
=head1 OPERATOR DIFFERENCES
An interesting property of the 32 C0 control characters
in the ASCII table is that they can "literally" be constructed
-as control characters in perl, e.g. C<(chr(0) eq C<\c@>)>
-C<(chr(1) eq C<\cA>)>, and so on. Perl on EBCDIC platforms has been
+as control characters in perl, e.g. C<(chr(0)> eq C<\c@>)>
+C<(chr(1)> eq C<\cA>)>, and so on. Perl on EBCDIC platforms has been
ported to take C<\c@> to chr(0) and C<\cA> to chr(1), etc. as well, but the
thirty three characters that result depend on which code page you are
using. The table below uses the standard acronyms for the controls.
SEPARATOR> concatenated with I<X> for all I<X>.
chr ord 8859-1 0037 1047 && POSIX-BC
- ------------------------------------------------------------------------
+ -----------------------------------------------------------------------
\c? 127 <DEL> " "
\c@ 0 <NUL> <NUL> <NUL>
\cA 1 <SOH> <SOH> <SOH>
Under the IBM OS/390 USS Web Server or WebSphere on z/OS for example
you should instead write that as:
- print "Content-type:\ttext/html\r\n\r\n"; # OK for DGW et alia
+ print "Content-type:\ttext/html\r\n\r\n"; # OK for DGW et al
That is because the translation from EBCDIC to ASCII is done
by the web server in this case (such code will not be appropriate for
=head1 REGULAR EXPRESSION DIFFERENCES
-As of perl 5.005_03 the letter range regular expression such as
+As of perl 5.005_03 the letter range regular expressions such as
[A-Z] and [a-z] have been especially coded to not pick up gap
characters. For example, characters such as E<ocirc> C<o WITH CIRCUMFLEX>
that lie between I and J would not be matched by the
=head1 SORTING
-One big difference between ASCII based character sets and EBCDIC ones
+One big difference between ASCII-based character sets and EBCDIC ones
are the relative positions of upper and lower case letters and the
-letters compared to the digits. If sorted on an ASCII based platform the
-two letter abbreviation for a physician comes before the two letter
-for drive, that is:
+letters compared to the digits. If sorted on an ASCII-based platform the
+two-letter abbreviation for a physician comes before the two letter
+abbreviation for drive; that is:
- @sorted = sort(qw(Dr. dr.)); # @sorted holds ('Dr.','dr.') on ASCII,
+ @sorted = sort(qw(Dr. dr.)); # @sorted holds ('Dr.','dr.') on ASCII,
# but ('dr.','Dr.') on EBCDIC
-The property of lower case before uppercase letters in EBCDIC is
+The property of lowercase before uppercase letters in EBCDIC is
even carried to the Latin 1 EBCDIC pages such as 0037 and 1047.
An example would be that E<Euml> C<E WITH DIAERESIS> (203) comes
before E<euml> C<e WITH DIAERESIS> (235) on an ASCII platform, but
the latter (83) comes before the former (115) on an EBCDIC platform.
-(Astute readers will note that the upper case version of E<szlig>
+(Astute readers will note that the uppercase version of E<szlig>
C<SMALL LETTER SHARP S> is simply "SS" and that the upper case version of
E<yuml> C<y WITH DIAERESIS> is not in the 0..255 range but it is
at U+x0178 in Unicode, or C<"\x{178}"> in a Unicode enabled Perl).
=head2 MONO CASE then sort data.
-In order to minimize the expense of mono casing mixed test try to
+In order to minimize the expense of mono casing mixed-case text, try to
C<tr///> towards the character set case most employed within the data.
If the data are primarily UPPERCASE non Latin 1 then apply tr/[a-z]/[A-Z]/
then sort(). If the data are primarily lowercase non Latin 1 then
address the E<yuml> C<y WITH DIAERESIS> character that will remain at
code point 255 on ASCII platforms, but 223 on most EBCDIC platforms
where it will sort to a place less than the EBCDIC numerals. With a
-Unicode enabled Perl you might try:
+Unicode-enabled Perl you might try:
tr/^?/\x{178}/;
=head2 Quoted-Printable encoding and decoding
-On ASCII encoded platforms it is possible to strip characters outside of
+On ASCII-encoded platforms it is possible to strip characters outside of
the printable set using:
# This QP encoder works on ASCII only
$string =~ s/=([0-9A-Fa-f][0-9A-Fa-f])/chr $a2e[hex $1]/ge;
$string =~ s/=[\n\r]+$//;
-=head2 Caesarian ciphers
+=head2 Caesarean ciphers
The practice of shifting an alphabet one or more characters for encipherment
dates back thousands of years and was explicitly detailed by Gaius Julius
Caesar in his B<Gallic Wars> text. A single alphabet shift is sometimes
referred to as a rotation and the shift amount is given as a number $n after
the string 'rot' or "rot$n". Rot0 and rot26 would designate identity maps
-on the 26 letter English version of the Latin alphabet. Rot13 has the
+on the 26-letter English version of the Latin alphabet. Rot13 has the
interesting property that alternate subsequent invocations are identity maps
(thus rot13 is its own non-trivial inverse in the group of 26 alphabet
rotations). Hence the following is a rot13 encoder and decoder that will
To the extent that it is possible to write code that depends on
hashing order there may be differences between hashes as stored
-on an ASCII based platform and hashes stored on an EBCDIC based platform.
+on an ASCII-based platform and hashes stored on an EBCDIC-based platform.
XXX
=head1 I18N AND L10N
-Internationalization(I18N) and localization(L10N) are supported at least
-in principle even on EBCDIC platforms. The details are system dependent
+Internationalization (I18N) and localization (L10N) are supported at least
+in principle even on EBCDIC platforms. The details are system-dependent
and discussed under the L<perlebcdic/OS ISSUES> section below.
-=head1 MULTI OCTET CHARACTER SETS
+=head1 MULTI-OCTET CHARACTER SETS
Perl may work with an internal UTF-EBCDIC encoding form for wide characters
on EBCDIC platforms in a manner analogous to the way that it works with
=head1 OS ISSUES
-There may be a few system dependent issues
+There may be a few system-dependent issues
of concern to EBCDIC Perl programmers.
=head2 OS/400
=item PASE
-The PASE environment is runtime environment for OS/400 that can run
-executables built for PowerPC AIX in OS/400, see L<perlos400>. PASE
+The PASE environment is a runtime environment for OS/400 that can run
+executables built for PowerPC AIX in OS/400; see L<perlos400>. PASE
is ASCII-based, not EBCDIC-based as the ILE.
=item IFS access
=item chcp
B<chcp> is supported as a shell utility for displaying and changing
-one's code page. See also L<chcp>.
+one's code page. See also L<chcp(1)>.
=item dataset access