perldelta for #47047 / 1de22db27a

[perl5.git] / pod / perlebcdic.pod
diff --git a/pod/perlebcdic.pod b/pod/perlebcdic.pod

index b7e69f8..6dd8e10 100644 (file)
--- a/pod/perlebcdic.pod
+++ b/pod/perlebcdic.pod
@@ -14,7 +14,8 @@ Portions of this document that are still incomplete are marked with XXX.
  Early Perl versions worked on some EBCDIC machines, but the last known
  version that ran on EBCDIC was v5.8.7, until v5.22, when the Perl core
  again works on z/OS.  Theoretically, it could work on OS/400 or Siemens'
-BS2000  (or their successors), but this is untested.  In v5.22, not all
+BS2000  (or their successors), but this is untested.  In v5.22 and 5.24,
+not all
  the modules found on CPAN but shipped with core Perl work on z/OS.
  
  If you want to use Perl on a non-z/OS EBCDIC machine, please let us know
@@ -40,7 +41,7 @@ But if you write code that uses C<\005> to mean a TAB or C<\xC1> to mean
  an "A", or C<\xDF> to mean a "E<yuml>" (small C<"y"> with a diaeresis),
  then your code may well work on your EBCDIC platform, but not on an
  ASCII one.  That's fine to do if no one will ever want to run your code
-on an ASCII platform; but the bias in this document will be in writing
+on an ASCII platform; but the bias in this document will be towards writing
  code portable between EBCDIC and ASCII systems.  Again, if every
  character you care about is easily enterable from your keyboard, you
  don't have to know anything about ASCII, but many keyboards don't easily
@@ -52,7 +53,7 @@ you can instead specify it as C<"\N{U+FF}">, and have the computer
  automatically translate it to C<\xDF> on your platform, and leave it as
  C<\xFF> on ASCII ones.  Or you could specify it by name, C<\N{LATIN
  SMALL LETTER Y WITH DIAERESIS> and not have to know the  numbers.
-Either way works, but require familiarity with Unicode.
+Either way works, but both require familiarity with Unicode.
  
  =head1 COMMON CHARACTER CODE SETS
  
@@ -63,7 +64,7 @@ US-ASCII) is a set of
  integers running from 0 to 127 (decimal) that have standardized
  interpretations by the computers which use ASCII.  For example, 65 means
  the letter "A".
-The range 0..127 can be covered by setting the bits in a 7-bit binary
+The range 0..127 can be covered by setting various bits in a 7-bit binary
  digit, hence the set is sometimes referred to as "7-bit ASCII".
  ASCII was described by the American National Standards Institute
  document ANSI X3.4-1986.  It was also described by ISO 646:1991
@@ -152,19 +153,22 @@ Character code set ID 0037 is a mapping of the ASCII plus Latin-1
  characters (i.e. ISO 8859-1) to an EBCDIC set.  0037 is used
  in North American English locales on the OS/400 operating system
  that runs on AS/400 computers.  CCSID 0037 differs from ISO 8859-1
-in 237 places; in other words they agree on only 19 code point values.
+in 236 places; in other words they agree on only 20 code point values.
  
  =item B<1047>
  
  Character code set ID 1047 is also a mapping of the ASCII plus
  Latin-1 characters (i.e. ISO 8859-1) to an EBCDIC set.  1047 is
  used under Unix System Services for OS/390 or z/OS, and OpenEdition
-for VM/ESA.  CCSID 1047 differs from CCSID 0037 in eight places.
+for VM/ESA.  CCSID 1047 differs from CCSID 0037 in eight places,
+and from ISO 8859-1 in 236.
  
  =item B<POSIX-BC>
  
  The EBCDIC code page in use on Siemens' BS2000 system is distinct from
  1047 and 0037.  It is identified below as the POSIX-BC set.
+Like 0037 and 1047, it is the same as ISO 8859-1 in 20 code point
+values.
  
  =back
  
@@ -240,15 +244,15 @@ In UTF-EBCDIC, there are 160 invariant characters.
  which have ASCII equivalents, plus those that correspond to
  the C1 controls (128 - 159 on ASCII platforms).)
  
-A string encoded in UTF-EBCDIC may be longer (but never shorter) than
-one encoded in UTF-8.  Perl extends UTF-8 so that it can encode code
-points above the Unicode maximum of U+10FFFF.  It extends UTF-EBCDIC as
-well, but due to the inherent limitations in UTF-EBCDIC, the maximum
-code point expressible is U+7FFF_FFFF, even if the word size is more
-than 32 bits.
+A string encoded in UTF-EBCDIC may be longer (very rarely shorter) than
+one encoded in UTF-8.  Perl extends both UTF-8 and UTF-EBCDIC so that
+they can encode code points above the Unicode maximum of U+10FFFF.  Both
+extensions are constructed to allow encoding of any code point that fits
+in a 64-bit word.
  
  UTF-EBCDIC is defined by
-L<Unicode Technical Report #16|http://www.unicode.org/reports/tr16>.
+L<Unicode Technical Report #16|http://www.unicode.org/reports/tr16>
+(often referred to as just TR16).
  It is defined based on CCSID 1047, not allowing for the differences for
  other code pages.  This allows for easy interchange of text between
  computers running different code pages, but makes it unusable, without
@@ -265,6 +269,11 @@ invariant.  This means that text generated on a computer running one
  version of Perl's UTF-EBCDIC has to be translated to be intelligible to
  a computer running another.
  
+TR16 implies a method to extend UTF-EBCDIC to encode points up through
+S<C<2 ** 31 - 1>>.  Perl uses this method for code points up through
+S<C<2 ** 30 - 1>>, but uses an incompatible method for larger ones, to
+enable it to handle much larger code points than otherwise.
+
  =head2 Using Encode
  
  Starting from Perl 5.8 you can use the standard module Encode
@@ -1223,10 +1232,6 @@ character return value on an EBCDIC platform.  For example:
  
      $CAPITAL_LETTER_A = chr(193);
  
-The largest code point that is representable in UTF-EBCDIC is
-U+7FFF_FFFF.  If you do C<chr()> on a larger value, a runtime error
-(similar to division by 0) will happen.
-
  =item C<ord()>
  
  C<ord()> will return EBCDIC code number values on an EBCDIC platform.
@@ -1261,10 +1266,6 @@ is true on all platforms.  If you want native code points for the low
  
  will hold.
  
-The largest code point that is representable in UTF-EBCDIC is
-U+7FFF_FFFF.  If you try to pack a larger value into a character, a
-runtime error (similar to division by 0) will happen.
-
  =item C<print()>
  
  One must be careful with scalars and strings that are passed to
@@ -1751,7 +1752,7 @@ and vice versa.
  
  Internationalization (I18N) and localization (L10N) are supported at least
  in principle even on EBCDIC platforms.  The details are system-dependent
-and discussed under the L<OS ISSUES> section below.
+and discussed under the L</OS ISSUES> section below.
  
  =head1 MULTI-OCTET CHARACTER SETS
  
@@ -1846,24 +1847,26 @@ seem to imply.
  
  =item *
  
-There are some bugs in the C<pack>/C<unpack> C<"U0"> template
-
-=item *
-
  There are a significant number of test failures in the CPAN modules
-shipped with Perl v5.22.  These are only in modules not primarily
+shipped with Perl v5.22 and 5.24.  These are only in modules not primarily
  maintained by Perl 5 porters.  Some of these are failures in the tests
  only: they don't realize that it is proper to get different results on
  EBCDIC platforms.  And some of the failures are real bugs.  If you
  compile and do a C<make test> on Perl, all tests on the C</cpan>
  directory are skipped.
  
-In particular, the extensions L<Unicode::Collate> and
-L<Unicode::Normalize> are not supported under EBCDIC; likewise for the
-(now deprecated) L<encoding> pragma.
+In particular, the (now deprecated) L<encoding> pragma is not supported
+under EBCDIC.
  
  L<Encode> partially works.
  
+=item *
+
+In earlier Perl versions, when byte and character data were
+concatenated, the new string was sometimes created by
+decoding the byte strings as I<ISO 8859-1 (Latin-1)>, even if the
+old Unicode string used EBCDIC.
+
  =back
  
  =head1 SEE ALSO