Nit in perluniintro.pod

[perl5.git] / pod / perluniintro.pod
diff --git a/pod/perluniintro.pod b/pod/perluniintro.pod

index 8144303..6a8c07d 100644 (file)
--- a/pod/perluniintro.pod
+++ b/pod/perluniintro.pod
@@ -24,7 +24,7 @@ Unicode 1.0 was released in October 1991, and 4.0 in April 2003.
  A Unicode I<character> is an abstract entity.  It is not bound to any
  particular integer width, especially not to the C language C<char>.
  Unicode is language-neutral and display-neutral: it does not encode the
-language of the text and it does not generally define fonts or other graphical
+language of the text, and it does not generally define fonts or other graphical
  layout details.  Unicode operates on characters and on text built from
  those characters.
  
@@ -47,15 +47,15 @@ lowercasing, and collating (sorting) are defined.
  
  A Unicode I<logical> "character" can actually consist of more than one internal
  I<actual> "character" or code point.  For Western languages, this is adequately
-represented by a I<base character> (like C<LATIN CAPITAL LETTER A>), followed
+modelled by a I<base character> (like C<LATIN CAPITAL LETTER A>) followed
  by one or more I<modifiers> (like C<COMBINING ACUTE ACCENT>).  This sequence of
  base character and modifiers is called a I<combining character
  sequence>.  Some non-western languages require more complicated
-representations, so Unicode invented a I<grapheme cluster> and then an
-I<extended grapheme cluster>.  For example, A Korean Hangul syllable is
+models, so Unicode created the I<grapheme cluster> concept, and then the
+I<extended grapheme cluster>.  For example, a Korean Hangul syllable is
  considered a single logical character, but most often consists of three actual
-characters: a leading consonant followed by an interior vowel followed by a
-trailing consonant.
+Unicode characters: a leading consonant followed by an interior vowel followed
+by a trailing consonant.
  
  Whether to call these extended grapheme clusters "characters" depends on your
  point of view. If you are a programmer, you probably would tend towards seeing
@@ -66,7 +66,7 @@ that's probably what it looks like in the context of the user's language.
  With this "whole sequence" view of characters, the total number of
  characters is open-ended. But in the programmer's "one unit is one
  character" point of view, the concept of "characters" is more
-deterministic.  In this document, we take that second  point of view:
+deterministic.  In this document, we take that second point of view:
  one "character" is one Unicode code point.
  
  For some combinations, there are I<precomposed> characters.
@@ -83,12 +83,12 @@ Because of backward compatibility with legacy encodings, the "a unique
  number for every character" idea breaks down a bit: instead, there is
  "at least one number for every character".  The same character could
  be represented differently in several legacy encodings.  The
-converse is also not true: some code points do not have an assigned
+converse is not also true: some code points do not have an assigned
  character.  Firstly, there are unallocated code points within
  otherwise used blocks.  Secondly, there are special Unicode control
  characters that do not represent true characters.
  
-A common myth about Unicode is that it would be "16-bit", that is,
+A common myth about Unicode is that it is "16-bit", that is,
  Unicode is only represented as C<0x10000> (or 65536) characters from
  C<0x0000> to C<0xFFFF>.  B<This is untrue.>  Since Unicode 2.0 (July
  1996), Unicode has been defined all the way up to 21 bits (C<0x10FFFF>),
@@ -98,22 +98,22 @@ I<Plane 0>, or the I<Basic Multilingual Plane> (BMP).  With Unicode
  3.1, 17 (yes, seventeen) planes in all were defined--but they are
  nowhere near full of defined characters, yet.
  
-Another myth is that the 256-character blocks have something to
+Another myth is about Unicode blocks--that they have something to
  do with languages--that each block would define the characters used
  by a language or a set of languages.  B<This is also untrue.>
  The division into blocks exists, but it is almost completely
  accidental--an artifact of how the characters have been and
-still are allocated.  Instead, there is a concept called I<scripts>,
-which is more useful: there is C<Latin> script, C<Greek> script, and
-so on.  Scripts usually span varied parts of several blocks.
-For further information see L<Unicode::UCD>.
+still are allocated.  Instead, there is a concept called I<scripts>, which is
+more useful: there is C<Latin> script, C<Greek> script, and so on.  Scripts
+usually span varied parts of several blocks.  For more information about
+scripts, see L<perlunicode/Scripts>.
  
  The Unicode code points are just abstract numbers.  To input and
  output these abstract numbers, the numbers must be I<encoded> or
  I<serialised> somehow.  Unicode defines several I<character encoding
  forms>, of which I<UTF-8> is perhaps the most popular.  UTF-8 is a
  variable length encoding that encodes Unicode characters as 1 to 6
-bytes (only 4 with the currently defined characters).  Other encodings
+bytes.  Other encodings
  include UTF-16 and UTF-32 and their big- and little-endian variants
  (UTF-8 is byte-order independent) The ISO/IEC 10646 defines the UCS-2
  and UCS-4 encoding forms.
@@ -144,8 +144,8 @@ scripts with legacy 8-bit data in them would break.  See L<utf8>.
  Perl supports both pre-5.6 strings of eight-bit native bytes, and
  strings of Unicode characters.  The principle is that Perl tries to
  keep its data as eight-bit bytes for as long as possible, but as soon
-as Unicodeness cannot be avoided, the data is transparently upgraded
-to Unicode.
+as Unicodeness cannot be avoided, the data is (mostly) transparently upgraded
+to Unicode.  There are some problems--see L<perlunicode/The "Unicode Bug">.
  
  Internally, Perl currently uses either whatever the native eight-bit
  character set of the platform (for example Latin-1) is, defaulting to
@@ -155,8 +155,8 @@ character set.  Otherwise, it uses UTF-8.
  
  A user of Perl does not normally need to know nor care how Perl
  happens to encode its internal strings, but it becomes relevant when
-outputting Unicode strings to a stream without a PerlIO layer -- one with
-the "default" encoding.  In such a case, the raw bytes used internally
+outputting Unicode strings to a stream without a PerlIO layer (one with
+the "default" encoding).  In such a case, the raw bytes used internally
  (the native character set or UTF-8, as appropriate for each string)
  will be used, and a "Wide character" warning will be issued if those
  strings contain a character beyond 0x00FF.
@@ -196,11 +196,12 @@ C<useperlio=define>.
  
  Perl 5.8.0 also supports Unicode on EBCDIC platforms.  There,
  Unicode support is somewhat more complex to implement since
-additional conversions are needed at every step.  Some problems
-remain, see L<perlebcdic> for details.
+additional conversions are needed at every step.
+
+Later Perl releases have added code that will not work on EBCDIC platforms, and
+no one has complained, so the divergence has continued.  If you want to run
+Perl on an EBCDIC platform, send email to perlbug@perl.org
  
-In any case, the Unicode support on EBCDIC platforms is better than
-in the 5.6 series, which didn't work much at all for EBCDIC platform.
  On EBCDIC platforms, the internal Unicode encoding form is UTF-EBCDIC
  instead of UTF-8.  The difference is that as UTF-8 is "ASCII-safe" in
  that ASCII characters encode to UTF-8 as-is, while UTF-EBCDIC is
@@ -247,7 +248,7 @@ characters:
  
  Note that both C<\x{...}> and C<\N{...}> are compile-time string
  constants: you cannot use variables in them.  if you want similar
-run-time functionality, use C<chr()> and C<charnames::vianame()>.
+run-time functionality, use C<chr()> and C<charnames::string_vianame()>.
  
  If you want to force the result to Unicode characters, use the special
  C<"U0"> prefix.  It consumes no arguments but causes the following bytes
@@ -343,7 +344,8 @@ layer when opening files
  The I/O layers can also be specified more flexibly with
  the C<open> pragma.  See L<open>, or look at the following example.
  
-    use open ':encoding(utf8)'; # input/output default encoding will be UTF-8
+    use open ':encoding(utf8)'; # input/output default encoding will be
+                                # UTF-8
      open X, ">file";
      print X chr(0x100), "\n";
      close X;
@@ -354,7 +356,8 @@ the C<open> pragma.  See L<open>, or look at the following example.
  With the C<open> pragma you can use the C<:locale> layer
  
      BEGIN { $ENV{LC_ALL} = $ENV{LANG} = 'ru_RU.KOI8-R' }
-    # the :locale will probe the locale environment variables like LC_ALL
+    # the :locale will probe the locale environment variables like
+    # LC_ALL
      use open OUT => ':locale'; # russki parusski
      open(O, ">koi8");
      print O chr(0x430); # Unicode CYRILLIC SMALL LETTER A = KOI8-R 0xc1
@@ -431,13 +434,13 @@ its argument so that Unicode characters with code points greater than
  255 are displayed as C<\x{...}>, control characters (like C<\n>) are
  displayed as C<\x..>, and the rest of the characters as themselves:
  
-   sub nice_string {
-       join("",
-         map { $_ > 255 ?                  # if wide character...
-               sprintf("\\x{%04X}", $_) :  # \x{...}
-               chr($_) =~ /[[:cntrl:]]/ ?  # else if control character ...
-               sprintf("\\x%02X", $_) :    # \x..
-               quotemeta(chr($_))          # else quoted or as themselves
+ sub nice_string {
+     join("",
+       map { $_ > 255 ?                  # if wide character...
+              sprintf("\\x{%04X}", $_) :  # \x{...}
+              chr($_) =~ /[[:cntrl:]]/ ?  # else if control character ...
+              sprintf("\\x%02X", $_) :    # \x..
+              quotemeta(chr($_))          # else quoted or as themselves
           } unpack("W*", $_[0]));           # unpack Unicode characters
     }
  
@@ -515,13 +518,12 @@ case, the answer is no (because 0x00C1 != 0x0041).  But sometimes, any
  CAPITAL LETTER As should be considered equal, or even As of any case.
  
  The long answer is that you need to consider character normalization
-and casing issues: see L<Unicode::Normalize>, Unicode Technical
-Reports #15 and #21, I<Unicode Normalization Forms> and I<Case
-Mappings>, L<http://www.unicode.org/unicode/reports/tr15/> and
-L<http://www.unicode.org/unicode/reports/tr21/>
+and casing issues: see L<Unicode::Normalize>, Unicode Technical Report #15,
+L<Unicode Normalization Forms|http://www.unicode.org/unicode/reports/tr15> and
+sections on case mapping in the L<Unicode Standard|http://www.unicode.org>.
  
  As of Perl 5.8.0, the "Full" case-folding of I<Case
-Mappings/SpecialCasing> is implemented.
+Mappings/SpecialCasing> is implemented, but bugs remain in C<qr//i> with them.
  
  =item *
  
@@ -553,19 +555,19 @@ L<http://www.unicode.org/unicode/reports/tr10/>
  
  Character Ranges and Classes
  
-Character ranges in regular expression character classes (C</[a-z]/>)
-and in the C<tr///> (also known as C<y///>) operator are not magically
-Unicode-aware.  What this means is that C<[A-Za-z]> will not magically start
-to mean "all alphabetic letters"; not that it does mean that even for
-8-bit characters, you should be using C</[[:alpha:]]/> in that case.
+Character ranges in regular expression bracketed character classes ( e.g.,
+C</[a-z]/>) and in the C<tr///> (also known as C<y///>) operator are not
+magically Unicode-aware.  What this means is that C<[A-Za-z]> will not
+magically start to mean "all alphabetic letters" (not that it does mean that
+even for 8-bit characters; for those, if you are using locales (L<perllocale>),
+use C</[[:alpha:]]/>; and if not, use the 8-bit-aware property C<\p{alpha}>).
+
+All the properties that begin with C<\p> (and its inverse C<\P>) are actually
+character classes that are Unicode-aware.  There are dozens of them, see
+L<perluniprops>.
  
-For specifying character classes like that in regular expressions,
-you can use the various Unicode properties--C<\pL>, or perhaps
-C<\p{Alphabetic}>, in this particular case.  You can use Unicode
-code points as the end points of character ranges, but there is no
-magic associated with specifying a certain range.  For further
-information--there are dozens of Unicode character classes--see
-L<perlunicode>.
+You can use Unicode code points as the end points of character ranges, and the
+range will include all Unicode code points that lie between those end points.
  
  =item *
  
@@ -607,9 +609,9 @@ Unicode; for that, see the earlier I/O discussion.
  How Do I Know Whether My String Is In Unicode?
  
  You shouldn't have to care.  But you may, because currently the semantics of the
-characters whose ordinals are in the range 128 to 255 is different depending on
+characters whose ordinals are in the range 128 to 255 are different depending on
  whether the string they are contained within is in Unicode or not.
-(See L<perlunicode>.) 
+(See L<perlunicode/When Unicode Does Not Happen>.)
  
  To determine if a string is in Unicode, use:
  
@@ -622,8 +624,8 @@ string has any characters at all.  All the C<is_utf8()> does is to
  return the value of the internal "utf8ness" flag attached to the
  C<$string>.  If the flag is off, the bytes in the scalar are interpreted
  as a single byte encoding.  If the flag is on, the bytes in the scalar
-are interpreted as the (multi-byte, variable-length) UTF-8 encoded code
-points of the characters.  Bytes added to an UTF-8 encoded string are
+are interpreted as the (variable-length, potentially multi-byte) UTF-8 encoded
+code points of the characters.  Bytes added to a UTF-8 encoded string are
  automatically upgraded to UTF-8.  If mixed non-UTF-8 and UTF-8 scalars
  are merged (double-quoted interpolation, explicit concatenation, and
  printf/sprintf parameter substitution), the result will be UTF-8 encoded
@@ -648,6 +650,7 @@ the C<length()> function:
      use bytes;
      print length($unicode), "\n"; # will also print 2
                                    # (the 0xC4 0x80 of the UTF-8)
+    no bytes;
  
  =item *
  
@@ -730,11 +733,11 @@ or:
  
  You can find the bytes that make up a UTF-8 sequence with
  
-       @bytes = unpack("C*", $Unicode_string)
+    @bytes = unpack("C*", $Unicode_string)
  
  and you can create well-formed Unicode with
  
-       $Unicode_string = pack("U*", 0xff, ...)
+    $Unicode_string = pack("U*", 0xff, ...)
  
  =item *