EBCDIC has the Unicode bug too

author Karl Williamson <public@khwilliamson.com>

Tue, 12 Mar 2013 03:13:38 +0000 (21:13 -0600)

committer Karl Williamson <public@khwilliamson.com>

Tue, 12 Mar 2013 03:21:03 +0000 (21:21 -0600)
author Karl Williamson <public@khwilliamson.com>
Tue, 12 Mar 2013 03:13:38 +0000 (21:13 -0600)
committer Karl Williamson <public@khwilliamson.com>
Tue, 12 Mar 2013 03:21:03 +0000 (21:21 -0600)
diff --git a/autodoc.pl b/autodoc.pl

index 3b39696..925f2f5 100644 (file)
--- a/autodoc.pl
+++ b/autodoc.pl
@@ -415,11 +415,6 @@ But the ordinals of characters differ between ASCII, EBCDIC, and
  the UTF- encodings, and a string encoded in UTF-EBCDIC may occupy more bytes
  than in UTF-8.
  
-Also, on some EBCDIC machines, functions that are documented as operating on
-US-ASCII (or Basic Latin in Unicode terminology) may in fact operate on all
-256 characters in the EBCDIC range, not just the subset corresponding to
-US-ASCII.
-
  The listing below is alphabetical, case insensitive.
  
  _EOB_
diff --git a/handy.h b/handy.h

index a65523e..a969d1a 100644 (file)
--- a/handy.h
+++ b/handy.h
@@ -489,19 +489,15 @@ Perl rules.  If the input is a number that doesn't fit in an octet, FALSE is
  always returned.
  
  Variant C<isFOO_A> (e.g., C<isALPHA_A()>) will return TRUE only if the input is
-also in the ASCII character set.  For ASCII platforms, the base function with
-no suffix and the one with the C<_A> suffix are identical.  On EBCDIC
-platforms, the C<_A> suffix function will not return true unless the specified
-character also has an ASCII equivalent.
-
-Variant C<isFOO_L1> operates on the full Latin1 character set.  For EBCDIC
-platforms, the base function with no suffix and the one with the C<_L1> suffix
-are identical.  For ASCII platforms, the C<_L1> suffix imposes the Latin-1
-character set onto the platform.  That is, the code points that are ASCII are
-unaffected, since ASCII is a subset of Latin-1.  But the non-ASCII code points
-are treated as if they are Latin-1 characters.  For example, C<isSPACE_L1()>
-will return true when called with the code point 0xA0, which is the Latin-1
-NO-BREAK SPACE.
+also in the ASCII character set.  The base function with no suffix and the one
+with the C<_A> suffix are identical.
+
+Variant C<isFOO_L1> imposes the Latin-1 (or EBCDIC equivlalent) character set
+onto the platform.  That is, the code points that are ASCII are unaffected,
+since ASCII is a subset of Latin-1.  But the non-ASCII code points are treated
+as if they are Latin-1 characters.  For example, C<isWORDCHAR_L1()> will return
+true when called with the code point 0xDF, which is a word character in both
+ASCII and EBCDIC (though it represent different characters in each).
  
  Variant C<isFOO_uni> is like the C<isFOO_L1> variant, but accepts any UV code
  point as input.  If the code point is larger than 255, Unicode rules are used
diff --git a/pod/perlfunc.pod b/pod/perlfunc.pod

index 2c1e5f4..468b6b0 100644 (file)
--- a/pod/perlfunc.pod
+++ b/pod/perlfunc.pod
@@ -3287,19 +3287,9 @@ What gets returned depends on several factors:
  
  =item If C<use bytes> is in effect:
  
-=over
-
-=item On EBCDIC platforms
-
-The results are what the C language system call C<tolower()> returns.
-
-=item On ASCII platforms
-
  The results follow ASCII semantics.  Only characters C<A-Z> change, to C<a-z>
  respectively.
  
-=back
-
  =item Otherwise, if C<use locale> (but not C<use locale ':not_characters'>) is in effect:
  
  Respects current LC_CTYPE locale for code points < 256; and uses Unicode
@@ -3326,21 +3316,11 @@ Unicode semantics are used for the case change.
  
  =item Otherwise:
  
-=over
-
-=item On EBCDIC platforms
-
-The results are what the C language system call C<tolower()> returns.
-
-=item On ASCII platforms
-
  ASCII semantics are used for the case change.  The lowercase of any character
  outside the ASCII range is the character itself.
  
  =back
  
-=back
-
  =item lcfirst EXPR
  X<lcfirst> X<lowercase>
  
diff --git a/pod/perlre.pod b/pod/perlre.pod

index 60173d6..80aa306 100644 (file)
--- a/pod/perlre.pod
+++ b/pod/perlre.pod
@@ -272,14 +272,6 @@ presenting another potential security issue.  See
  L<http://unicode.org/reports/tr36> for a detailed discussion of Unicode
  security issues.
  
-On the EBCDIC platforms that Perl handles, the native character set is
-equivalent to Latin-1.  Thus this modifier changes behavior only when
-the C<"/i"> modifier is also specified, and it turns out it affects only
-two characters, giving them full Unicode semantics: the C<MICRO SIGN>
-will match the Greek capital and small letters C<MU>, otherwise not; and
-the C<LATIN CAPITAL LETTER SHARP S> will match any of C<SS>, C<Ss>,
-C<sS>, and C<ss>, otherwise not.
-
  This modifier may be specified to be the default by C<use feature
  'unicode_strings>, C<use locale ':not_characters'>, or
  C<L<use 5.012|perlfunc/use VERSION>> (or higher),
@@ -326,8 +318,8 @@ results.  See L<perlunicode/The "Unicode Bug">.  The Unicode Bug has
  become rather infamous, leading to yet another (printable) name for this
  modifier, "Dodgy".
  
-On ASCII platforms, the native rules are ASCII, and on EBCDIC platforms
-(at least the ones that Perl handles), they are Latin-1.
+Unless the pattern or string are encoded in UTF-8, only ASCII characters
+can match positively.
  
  Here are some examples of how that works on an ASCII platform:
  
diff --git a/pod/perlrecharclass.pod b/pod/perlrecharclass.pod

index 681cd06..2611618 100644 (file)
--- a/pod/perlrecharclass.pod
+++ b/pod/perlrecharclass.pod
@@ -174,7 +174,7 @@ are generally used to add auxiliary markings to letters.
  C<\w> matches the platform's native underscore character plus whatever
  the locale considers to be alphanumeric.
  
-=item if Unicode rules are in effect or if on an EBCDIC platform ...
+=item if Unicode rules are in effect ...
  
  C<\w> matches exactly what C<\p{Word}> matches.
  
@@ -232,7 +232,7 @@ in the table below.
  
  C<\s> matches whatever the locale considers to be whitespace.
  
-=item if Unicode rules are in effect or if on an EBCDIC platform ...
+=item if Unicode rules are in effect ...
  
  C<\s> matches exactly the characters shown with an "s" column in the
  table below.
@@ -289,8 +289,8 @@ C<\s>, C<\h> and C<\v> as of Unicode 6.0.
  
  The first column gives the Unicode code point of the character (in hex format),
  the second column gives the (Unicode) name. The third column indicates
-by which class(es) the character is matched (assuming no locale or EBCDIC code
-page is in effect that changes the C<\s> matching).
+by which class(es) the character is matched (assuming no locale is in
+effect that changes the C<\s> matching).
  
   0x0009        CHARACTER TABULATION   h s
   0x000a              LINE FEED (LF)    vs
@@ -732,10 +732,6 @@ the terminal somehow: for example, newline and backspace are control characters.
  In the ASCII range, characters whose code points are between 0 and 31 inclusive,
  plus 127 (C<DEL>) are control characters.
  
-On EBCDIC platforms, it is likely that the code page will define C<[[:cntrl:]]>
-to be the EBCDIC equivalents of the ASCII controls, plus the controls
-that in Unicode have code points from 128 through 159.
-
  =item [3]
  
  Any character that is I<graphical>, that is, visible. This class consists
@@ -815,7 +811,7 @@ The POSIX class matches according to the locale, except that
  C<word> uses the platform's native underscore character, no matter what
  the locale is.
  
-=item if Unicode rules are in effect or if on an EBCDIC platform ...
+=item if Unicode rules are in effect ...
  
  The POSIX class matches the same as the Full-range counterpart.
  
@@ -834,7 +830,7 @@ L<perlre/Which character set modifier is in effect?>.
  
  It is proposed to change this behavior in a future release of Perl so that
  whether or not Unicode rules are in effect would not change the
-behavior:  Outside of locale or an EBCDIC code page, the POSIX classes
+behavior:  Outside of locale, the POSIX classes
  would behave like their ASCII-range counterparts.  If you wish to
  comment on this proposal, send email to C<perl5-porters@perl.org>.
  
diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod

index 7a0b915..7a98285 100644 (file)
--- a/pod/perlunicode.pod
+++ b/pod/perlunicode.pod
@@ -98,13 +98,8 @@ while C<use locale ':not_characters'> effectively also selects
  C<use feature 'unicode_strings'> in its scope; see L<perllocale>.)
  Otherwise, Perl uses the platform's native
  byte semantics for characters whose code points are less than 256, and
-Unicode semantics for those greater than 255.  On EBCDIC platforms, this
-is almost seamless, as the EBCDIC code pages that Perl handles are
-equivalent to Unicode's first 256 code points.  (The exception is that
-EBCDIC regular expression case-insensitive matching rules are not as
-as robust as Unicode's.)   But on ASCII platforms, Perl uses US-ASCII
-(or Basic Latin in Unicode terminology) byte semantics, meaning that characters
-whose ordinal numbers are in the range 128 - 255 are undefined except for their
+Unicode semantics for those greater than 255.  That means that non-ASCII
+characters are undefined except for their
  ordinal numbers.  This means that none have case (upper and lower), nor are any
  a member of character classes, like C<[:alpha:]> or C<\w>.  (But all do belong
  to the C<\W> class or the Perl regular expression extension C<[:^alpha:]>.)
diff --git a/pod/perlunifaq.pod b/pod/perlunifaq.pod

index ca3a180..f952d1a 100644 (file)
--- a/pod/perlunifaq.pod
+++ b/pod/perlunifaq.pod
@@ -149,8 +149,8 @@ rely on the way things worked before Unicode came along.  Those older
  programs knew only about the ASCII character set, and so may not work
  properly for additional characters.  When a string is encoded in UTF-8,
  Perl assumes that the program is prepared to deal with Unicode, but when
-the string isn't, Perl assumes that only ASCII (unless it is an EBCDIC
-platform) is wanted, and so those characters that are not ASCII
+the string isn't, Perl assumes that only ASCII
+is wanted, and so those characters that are not ASCII
  characters aren't recognized as to what they would be in Unicode.
  C<use feature 'unicode_strings'> tells Perl to treat all characters as
  Unicode, whether the string is encoded in UTF-8 or not, thus avoiding
author	Karl Williamson <public@khwilliamson.com>
	Tue, 12 Mar 2013 03:13:38 +0000 (21:13 -0600)
committer	Karl Williamson <public@khwilliamson.com>
	Tue, 12 Mar 2013 03:21:03 +0000 (21:21 -0600)
autodoc.pl		patch \| blob \| blame \| history
handy.h		patch \| blob \| blame \| history
pod/perlfunc.pod		patch \| blob \| blame \| history
pod/perlre.pod		patch \| blob \| blame \| history
pod/perlrecharclass.pod		patch \| blob \| blame \| history
pod/perlunicode.pod		patch \| blob \| blame \| history
pod/perlunifaq.pod		patch \| blob \| blame \| history