Early Perl versions worked on some EBCDIC machines, but the last known
version that ran on EBCDIC was v5.8.7, until v5.22, when the Perl core
again works on z/OS. Theoretically, it could work on OS/400 or Siemens'
-BS2000 (or their successors), but this is untested. In v5.22, not all
+BS2000 (or their successors), but this is untested. In v5.22 and 5.24,
+not all
the modules found on CPAN but shipped with core Perl work on z/OS.
If you want to use Perl on a non-z/OS EBCDIC machine, please let us know
an "A", or C<\xDF> to mean a "E<yuml>" (small C<"y"> with a diaeresis),
then your code may well work on your EBCDIC platform, but not on an
ASCII one. That's fine to do if no one will ever want to run your code
-on an ASCII platform; but the bias in this document will be in writing
+on an ASCII platform; but the bias in this document will be towards writing
code portable between EBCDIC and ASCII systems. Again, if every
character you care about is easily enterable from your keyboard, you
don't have to know anything about ASCII, but many keyboards don't easily
automatically translate it to C<\xDF> on your platform, and leave it as
C<\xFF> on ASCII ones. Or you could specify it by name, C<\N{LATIN
SMALL LETTER Y WITH DIAERESIS> and not have to know the numbers.
-Either way works, but require familiarity with Unicode.
+Either way works, but both require familiarity with Unicode.
=head1 COMMON CHARACTER CODE SETS
integers running from 0 to 127 (decimal) that have standardized
interpretations by the computers which use ASCII. For example, 65 means
the letter "A".
-The range 0..127 can be covered by setting the bits in a 7-bit binary
+The range 0..127 can be covered by setting various bits in a 7-bit binary
digit, hence the set is sometimes referred to as "7-bit ASCII".
ASCII was described by the American National Standards Institute
document ANSI X3.4-1986. It was also described by ISO 646:1991
characters (i.e. ISO 8859-1) to an EBCDIC set. 0037 is used
in North American English locales on the OS/400 operating system
that runs on AS/400 computers. CCSID 0037 differs from ISO 8859-1
-in 237 places; in other words they agree on only 19 code point values.
+in 236 places; in other words they agree on only 20 code point values.
=item B<1047>
Character code set ID 1047 is also a mapping of the ASCII plus
Latin-1 characters (i.e. ISO 8859-1) to an EBCDIC set. 1047 is
used under Unix System Services for OS/390 or z/OS, and OpenEdition
-for VM/ESA. CCSID 1047 differs from CCSID 0037 in eight places.
+for VM/ESA. CCSID 1047 differs from CCSID 0037 in eight places,
+and from ISO 8859-1 in 236.
=item B<POSIX-BC>
The EBCDIC code page in use on Siemens' BS2000 system is distinct from
1047 and 0037. It is identified below as the POSIX-BC set.
+Like 0037 and 1047, it is the same as ISO 8859-1 in 20 code point
+values.
=back
which have ASCII equivalents, plus those that correspond to
the C1 controls (128 - 159 on ASCII platforms).)
-A string encoded in UTF-EBCDIC may be longer (but never shorter) than
-one encoded in UTF-8. Perl extends UTF-8 so that it can encode code
-points above the Unicode maximum of U+10FFFF. It extends UTF-EBCDIC as
-well, but due to the inherent limitations in UTF-EBCDIC, the maximum
-code point expressible is U+7FFF_FFFF, even if the word size is more
-than 32 bits.
+A string encoded in UTF-EBCDIC may be longer (very rarely shorter) than
+one encoded in UTF-8. Perl extends both UTF-8 and UTF-EBCDIC so that
+they can encode code points above the Unicode maximum of U+10FFFF. Both
+extensions are constructed to allow encoding of any code point that fits
+in a 64-bit word.
UTF-EBCDIC is defined by
-L<Unicode Technical Report #16|http://www.unicode.org/reports/tr16>.
+L<Unicode Technical Report #16|http://www.unicode.org/reports/tr16>
+(often referred to as just TR16).
It is defined based on CCSID 1047, not allowing for the differences for
other code pages. This allows for easy interchange of text between
computers running different code pages, but makes it unusable, without
version of Perl's UTF-EBCDIC has to be translated to be intelligible to
a computer running another.
+TR16 implies a method to extend UTF-EBCDIC to encode points up through
+S<C<2 ** 31 - 1>>. Perl uses this method for code points up through
+S<C<2 ** 30 - 1>>, but uses an incompatible method for larger ones, to
+enable it to handle much larger code points than otherwise.
+
=head2 Using Encode
Starting from Perl 5.8 you can use the standard module Encode
$CAPITAL_LETTER_A = chr(193);
-The largest code point that is representable in UTF-EBCDIC is
-U+7FFF_FFFF. If you do C<chr()> on a larger value, a runtime error
-(similar to division by 0) will happen.
-
=item C<ord()>
C<ord()> will return EBCDIC code number values on an EBCDIC platform.
will hold.
-The largest code point that is representable in UTF-EBCDIC is
-U+7FFF_FFFF. If you try to pack a larger value into a character, a
-runtime error (similar to division by 0) will happen.
-
=item C<print()>
One must be careful with scalars and strings that are passed to
Internationalization (I18N) and localization (L10N) are supported at least
in principle even on EBCDIC platforms. The details are system-dependent
-and discussed under the L<OS ISSUES> section below.
+and discussed under the L</OS ISSUES> section below.
=head1 MULTI-OCTET CHARACTER SETS
=item *
-There are some bugs in the C<pack>/C<unpack> C<"U0"> template
-
-=item *
-
There are a significant number of test failures in the CPAN modules
-shipped with Perl v5.22. These are only in modules not primarily
+shipped with Perl v5.22 and 5.24. These are only in modules not primarily
maintained by Perl 5 porters. Some of these are failures in the tests
only: they don't realize that it is proper to get different results on
EBCDIC platforms. And some of the failures are real bugs. If you
compile and do a C<make test> on Perl, all tests on the C</cpan>
directory are skipped.
-In particular, the extensions L<Unicode::Collate> and
-L<Unicode::Normalize> are not supported under EBCDIC; likewise for the
-(now deprecated) L<encoding> pragma.
+In particular, the (now deprecated) L<encoding> pragma is not supported
+under EBCDIC.
L<Encode> partially works.
+=item *
+
+In earlier Perl versions, when byte and character data were
+concatenated, the new string was sometimes created by
+decoding the byte strings as I<ISO 8859-1 (Latin-1)>, even if the
+old Unicode string used EBCDIC.
+
=back
=head1 SEE ALSO