perl.exp was not built in time on systems that required it (AIX, ...)

[perl5.git] / pod / perluniintro.pod
diff --git a/pod/perluniintro.pod b/pod/perluniintro.pod

index eadcedd..36f729c 100644 (file)
--- a/pod/perluniintro.pod
+++ b/pod/perluniintro.pod
@@ -24,7 +24,7 @@ Unicode 1.0 was released in October 1991, and 4.0 in April 2003.
  A Unicode I<character> is an abstract entity.  It is not bound to any
  particular integer width, especially not to the C language C<char>.
  Unicode is language-neutral and display-neutral: it does not encode the
-language of the text and it does not define fonts or other graphical
+language of the text and it does not generally define fonts or other graphical
  layout details.  Unicode operates on characters and on text built from
  those characters.
  
@@ -125,8 +125,7 @@ serious Unicode work.  The maintenance release 5.6.1 fixed many of the
  problems of the initial Unicode implementation, but for example
  regular expressions still do not work with Unicode in 5.6.1.
  
-B<Starting from Perl 5.8.0, the use of C<use utf8> is no longer
-necessary.> In earlier releases the C<utf8> pragma was used to declare
+B<Starting from Perl 5.8.0, the use of C<use utf8> is needed only in much more restricted circumstances.> In earlier releases the C<utf8> pragma was used to declare
  that operations in the current block or file would be Unicode-aware.
  This model was found to be wrong, or at least clumsy: the "Unicodeness"
  is now carried with the data, instead of being attached to the
@@ -160,14 +159,14 @@ strings contain a character beyond 0x00FF.
  
  For example,
  
-      perl -e 'print "\x{DF}\n", "\x{0100}\x{DF}\n"'              
+      perl -e 'print "\x{DF}\n", "\x{0100}\x{DF}\n"'
  
  produces a fairly useless mixture of native bytes and UTF-8, as well
  as a warning:
  
       Wide character in print at ...
  
-To output UTF-8, use the C<:utf8> output layer.  Prepending
+To output UTF-8, use the C<:encoding> or C<:utf8> output layer.  Prepending
  
        binmode(STDOUT, ":utf8");
  
@@ -246,16 +245,14 @@ Note that both C<\x{...}> and C<\N{...}> are compile-time string
  constants: you cannot use variables in them.  if you want similar
  run-time functionality, use C<chr()> and C<charnames::vianame()>.
  
-Also note that if all the code points for pack "U" are below 0x100,
-bytes will be generated, just like if you were using C<chr()>.
-
-   my $bytes = pack("U*", 0x80, 0xFF);
-
  If you want to force the result to Unicode characters, use the special
-C<"U0"> prefix.  It consumes no arguments but forces the result to be
-in Unicode characters, instead of bytes.
+C<"U0"> prefix.  It consumes no arguments but causes the following bytes
+to be interpreted as the UTF-8 encoding of Unicode characters:
+
+   my $chars = pack("U0W*", 0x80, 0x42);
  
-   my $chars = pack("U0U*", 0x80, 0xFF);
+Likewise, you can stop such UTF-8 interpretation by using the special
+C<"C0"> prefix.
  
  =head2 Handling Unicode
  
@@ -265,7 +262,7 @@ C<substr()> will work on the Unicode characters; regular expressions
  will work on the Unicode characters (see L<perlunicode> and L<perlretut>).
  
  Note that Perl considers combining character sequences to be
-characters, so for example
+separate characters, so for example
  
      use charnames ':full';
      print length("\N{LATIN CAPITAL LETTER A}\N{COMBINING ACUTE ACCENT}"), "\n";
@@ -280,27 +277,13 @@ encodings, I/O, and certain special cases:
  
  When you combine legacy data and Unicode the legacy data needs
  to be upgraded to Unicode.  Normally ISO 8859-1 (or EBCDIC, if
-applicable) is assumed.  You can override this assumption by
-using the C<encoding> pragma, for example
-
-    use encoding 'latin2'; # ISO 8859-2
-
-in which case literals (string or regular expressions), C<chr()>,
-and C<ord()> in your whole script are assumed to produce Unicode
-characters from ISO 8859-2 code points.  Note that the matching for
-encoding names is forgiving: instead of C<latin2> you could have
-said C<Latin 2>, or C<iso8859-2>, or other variations.  With just
-
-    use encoding;
-
-the environment variable C<PERL_ENCODING> will be consulted.
-If that variable isn't set, the encoding pragma will fail.
+applicable) is assumed.
  
  The C<Encode> module knows about many encodings and has interfaces
  for doing conversions between those encodings:
  
-    use Encode 'from_to';
-    from_to($data, "iso-8859-3", "utf-8"); # from legacy to utf-8
+    use Encode 'decode';
+    $data = decode("iso-8859-3", $data); # convert from legacy to utf-8
  
  =head2 Unicode I/O
  
@@ -333,7 +316,9 @@ and on already open streams, use C<binmode()>:
  The matching of encoding names is loose: case does not matter, and
  many encodings have several aliases.  Note that the C<:utf8> layer
  must always be specified exactly like that; it is I<not> subject to
-the loose matching of encoding names.
+the loose matching of encoding names. Also note that C<:utf8> is unsafe for
+input, because it accepts the data without validating that it is indeed valid
+UTF8.
  
  See L<PerlIO> for the C<:utf8> layer, L<PerlIO::encoding> and
  L<Encode::PerlIO> for the C<:encoding()> layer, and
@@ -345,7 +330,7 @@ Unicode or legacy encodings does not magically turn the data into
  Unicode in Perl's eyes.  To do that, specify the appropriate
  layer when opening files
  
-    open(my $fh,'<:utf8', 'anything');
+    open(my $fh,'<:encoding(utf8)', 'anything');
      my $line_of_unicode = <$fh>;
  
      open(my $fh,'<:encoding(Big5)', 'anything');
@@ -354,7 +339,7 @@ layer when opening files
  The I/O layers can also be specified more flexibly with
  the C<open> pragma.  See L<open>, or look at the following example.
  
-    use open ':utf8'; # input and output default layer will be UTF-8
+    use open ':encoding(utf8)'; # input/output default encoding will be UTF-8
      open X, ">file";
      print X chr(0x100), "\n";
      close X;
@@ -374,11 +359,6 @@ With the C<open> pragma you can use the C<:locale> layer
      printf "%#x\n", ord(<I>), "\n"; # this should print 0xc1
      close I;
  
-or you can also use the C<':encoding(...)'> layer
-
-    open(my $epic,'<:encoding(iso-8859-7)','iliad.greek');
-    my $line_of_unicode = <$epic>;
-
  These methods install a transparent filter on the I/O stream that
  converts data from the specified encoding when it is read in from the
  stream.  The result is always Unicode.
@@ -406,8 +386,8 @@ the file "text.utf8", encoded as UTF-8:
      while (<$nihongo>) { print $unicode $_ }
  
  The naming of encodings, both by the C<open()> and by the C<open>
-pragma, is similar to the C<encoding> pragma in that it allows for
-flexible names: C<koi8-r> and C<KOI8R> will both be understood.
+pragma allows for flexible names: C<koi8-r> and C<KOI8R> will both be
+understood.
  
  Common encodings recognized by ISO, MIME, IANA, and various other
  standardisation organisations are recognised; for a more detailed
@@ -427,13 +407,13 @@ by repeatedly encoding the data:
      local $/; ## read in the whole file of 8-bit characters
      $t = <F>;
      close F;
-    open F, ">:utf8", "file";
+    open F, ">:encoding(utf8)", "file";
      print F $t; ## convert to UTF-8 on output
      close F;
  
  If you run this code twice, the contents of the F<file> will be twice
-UTF-8 encoded.  A C<use open ':utf8'> would have avoided the bug, or
-explicitly opening also the F<file> for input as UTF-8.
+UTF-8 encoded.  A C<use open ':encoding(utf8)'> would have avoided the
+bug, or explicitly opening also the F<file> for input as UTF-8.
  
  B<NOTE>: the C<:utf8> and C<:encoding> features work only if your
  Perl has been built with the new PerlIO feature (which is the default
@@ -454,7 +434,7 @@ displayed as C<\x..>, and the rest of the characters as themselves:
                 chr($_) =~ /[[:cntrl:]]/ ?  # else if control character ...
                 sprintf("\\x%02X", $_) :    # \x..
                 quotemeta(chr($_))          # else quoted or as themselves
-         } unpack("U*", $_[0]));           # unpack Unicode characters
+         } unpack("W*", $_[0]));           # unpack Unicode characters
     }
  
  For example,
@@ -494,17 +474,18 @@ explicitly-defined I/O layers). But if you must, there are two
  ways of looking behind the scenes.
  
  One way of peeking inside the internal encoding of Unicode characters
-is to use C<unpack("C*", ...> to get the bytes or C<unpack("H*", ...)>
-to display the bytes:
+is to use C<unpack("C*", ...> to get the bytes of whatever the string
+encoding happens to be, or C<unpack("U0..", ...)> to get the bytes of the
+UTF-8 encoding:
  
      # this prints  c4 80  for the UTF-8 bytes 0xc4 0x80
-    print join(" ", unpack("H*", pack("U", 0x100))), "\n";
+    print join(" ", unpack("U0(H2)*", pack("U", 0x100))), "\n";
  
  Yet another way would be to use the Devel::Peek module:
  
      perl -MDevel::Peek -e 'Dump(chr(0x100))'
  
-That shows the UTF8 flag in FLAGS and both the UTF-8 bytes
+That shows the C<UTF8> flag in FLAGS and both the UTF-8 bytes
  and Unicode characters in C<PV>.  See also later in this document
  the discussion about the C<utf8::is_utf8()> function.
  
@@ -532,8 +513,8 @@ CAPITAL LETTER As should be considered equal, or even As of any case.
  The long answer is that you need to consider character normalization
  and casing issues: see L<Unicode::Normalize>, Unicode Technical
  Reports #15 and #21, I<Unicode Normalization Forms> and I<Case
-Mappings>, http://www.unicode.org/unicode/reports/tr15/ and 
-http://www.unicode.org/unicode/reports/tr21/ 
+Mappings>, L<http://www.unicode.org/unicode/reports/tr15/> and
+L<http://www.unicode.org/unicode/reports/tr21/>
  
  As of Perl 5.8.0, the "Full" case-folding of I<Case
  Mappings/SpecialCasing> is implemented.
@@ -556,7 +537,7 @@ C<0x00C1> > C<0x00C0>.
  The long answer is that "it depends", and a good answer cannot be
  given without knowing (at the very least) the language context.
  See L<Unicode::Collate>, and I<Unicode Collation Algorithm>
-http://www.unicode.org/unicode/reports/tr10/
+L<http://www.unicode.org/unicode/reports/tr10/>
  
  =back
  
@@ -570,7 +551,7 @@ Character Ranges and Classes
  
  Character ranges in regular expression character classes (C</[a-z]/>)
  and in the C<tr///> (also known as C<y///>) operator are not magically
-Unicode-aware.  What this means that C<[A-Za-z]> will not magically start
+Unicode-aware.  What this means is that C<[A-Za-z]> will not magically start
  to mean "all alphabetic letters"; not that it does mean that even for
  8-bit characters, you should be using C</[[:alpha:]]/> in that case.
  
@@ -621,11 +602,12 @@ Unicode; for that, see the earlier I/O discussion.
  
  How Do I Know Whether My String Is In Unicode?
  
-You shouldn't care.  No, you really shouldn't.  No, really.  If you
-have to care--beyond the cases described above--it means that we
-didn't get the transparency of Unicode quite right.
+You shouldn't have to care.  But you may, because currently the semantics of the
+characters whose ordinals are in the range 128 to 255 is different depending on
+whether the string they are contained within is in Unicode or not.
+(See L<perlunicode>.) 
  
-Okay, if you insist:
+To determine if a string is in Unicode, use:
  
      print utf8::is_utf8($string) ? 1 : 0, "\n";
  
@@ -638,7 +620,7 @@ C<$string>.  If the flag is off, the bytes in the scalar are interpreted
  as a single byte encoding.  If the flag is on, the bytes in the scalar
  are interpreted as the (multi-byte, variable-length) UTF-8 encoded code
  points of the characters.  Bytes added to an UTF-8 encoded string are
-automatically upgraded to UTF-8.  If mixed non-UTF8 and UTF-8 scalars
+automatically upgraded to UTF-8.  If mixed non-UTF-8 and UTF-8 scalars
  are merged (double-quoted interpolation, explicit concatenation, and
  printf/sprintf parameter substitution), the result will be UTF-8 encoded
  as if copies of the byte strings were upgraded to UTF-8: for example,
@@ -652,8 +634,8 @@ C<$a> will stay byte-encoded.
  
  Sometimes you might really need to know the byte length of a string
  instead of the character length. For that use either the
-C<Encode::encode_utf8()> function or the C<bytes> pragma and its only
-defined function C<length()>:
+C<Encode::encode_utf8()> function or the C<bytes> pragma  and
+the C<length()> function:
  
      my $unicode = chr(0x100);
      print length($unicode), "\n"; # will print 1
@@ -670,22 +652,24 @@ How Do I Detect Data That's Not Valid In a Particular Encoding?
  Use the C<Encode> package to try converting it.
  For example,
  
-    use Encode 'encode_utf8';
-    if (encode_utf8($string_of_bytes_that_I_think_is_utf8)) {
-        # valid
+    use Encode 'decode_utf8';
+
+    if (eval { decode_utf8($string, Encode::FB_CROAK); 1 }) {
+        # $string is valid utf8
      } else {
-        # invalid
+        # $string is not valid utf8
      }
  
-For UTF-8 only, you can use:
+Or use C<unpack> to try decoding it:
  
      use warnings;
-    @chars = unpack("U0U*", $string_of_bytes_that_I_think_is_utf8);
+    @chars = unpack("C0U*", $string_of_bytes_that_I_think_is_utf8);
  
-If invalid, a C<Malformed UTF-8 character (byte 0x##) in unpack>
-warning is produced. The "U0" means "expect strictly UTF-8 encoded
-Unicode".  Without that the C<unpack("U*", ...)> would accept also
-data like C<chr(0xFF>), similarly to the C<pack> as we saw earlier.
+If invalid, a C<Malformed UTF-8 character> warning is produced. The "C0" means
+"process the string character per character".  Without that, the
+C<unpack("U*", ...)> would work in C<U0> mode (the default if the format
+string starts with C<U>) and it would return the bytes making up the UTF-8
+encoding of the target string, something that will always work.
  
  =item *
  
@@ -727,8 +711,8 @@ Back to converting data.  If you have (or want) data in your system's
  native 8-bit encoding (e.g. Latin-1, EBCDIC, etc.), you can use
  pack/unpack to convert to/from Unicode.
  
-    $native_string  = pack("C*", unpack("U*", $Unicode_string));
-    $Unicode_string = pack("U*", unpack("C*", $native_string));
+    $native_string  = pack("W*", unpack("U*", $Unicode_string));
+    $Unicode_string = pack("U*", unpack("W*", $native_string));
  
  If you have a sequence of bytes you B<know> is valid UTF-8,
  but Perl doesn't know it yet, you can make Perl a believer, too:
@@ -736,18 +720,24 @@ but Perl doesn't know it yet, you can make Perl a believer, too:
      use Encode 'decode_utf8';
      $Unicode = decode_utf8($bytes);
  
-You can convert well-formed UTF-8 to a sequence of bytes, but if
-you just want to convert random binary data into UTF-8, you can't.
-B<Any random collection of bytes isn't well-formed UTF-8>.  You can
-use C<unpack("C*", $string)> for the former, and you can create
-well-formed Unicode data by C<pack("U*", 0xff, ...)>.
+or:
+
+    $Unicode = pack("U0a*", $bytes);
+
+You can find the bytes that make up a UTF-8 sequence with
+
+       @bytes = unpack("C*", $Unicode_string)
+
+and you can create well-formed Unicode with
+
+       $Unicode_string = pack("U*", 0xff, ...)
  
  =item *
  
  How Do I Display Unicode?  How Do I Input Unicode?
  
-See http://www.alanwood.net/unicode/ and
-http://www.cl.cam.ac.uk/~mgk25/unicode.html
+See L<http://www.alanwood.net/unicode/> and
+L<http://www.cl.cam.ac.uk/~mgk25/unicode.html>
  
  =item *
  
@@ -799,44 +789,44 @@ show a decimal number in hexadecimal.  If you have just the
  
  Unicode Consortium
  
-    http://www.unicode.org/
+L<http://www.unicode.org/>
  
  =item *
  
  Unicode FAQ
  
-    http://www.unicode.org/unicode/faq/
+L<http://www.unicode.org/unicode/faq/>
  
  =item *
  
  Unicode Glossary
  
-    http://www.unicode.org/glossary/
+L<http://www.unicode.org/glossary/>
  
  =item *
  
  Unicode Useful Resources
  
-    http://www.unicode.org/unicode/onlinedat/resources.html
+L<http://www.unicode.org/unicode/onlinedat/resources.html>
  
  =item *
  
  Unicode and Multilingual Support in HTML, Fonts, Web Browsers and Other Applications
  
-    http://www.alanwood.net/unicode/
+L<http://www.alanwood.net/unicode/>
  
  =item *
  
  UTF-8 and Unicode FAQ for Unix/Linux
  
-    http://www.cl.cam.ac.uk/~mgk25/unicode.html
+L<http://www.cl.cam.ac.uk/~mgk25/unicode.html>
  
  =item *
  
  Legacy Character Sets
  
-    http://www.czyborra.com/
-    http://www.eki.ee/letter/
+L<http://www.czyborra.com/>
+L<http://www.eki.ee/letter/>
  
  =item *
  
@@ -845,7 +835,7 @@ directory
  
      $Config{installprivlib}/unicore
  
-in Perl 5.8.0 or newer, and 
+in Perl 5.8.0 or newer, and
  
      $Config{installprivlib}/unicode
  
@@ -880,7 +870,7 @@ to UTF-8 bytes and back, the code works even with older Perl 5 versions.
  
  =head1 SEE ALSO
  
-L<perlunicode>, L<Encode>, L<encoding>, L<open>, L<utf8>, L<bytes>,
+L<perlunitut>, L<perlunicode>, L<Encode>, L<open>, L<utf8>, L<bytes>,
  L<perlretut>, L<perlrun>, L<Unicode::Collate>, L<Unicode::Normalize>,
  L<Unicode::UCD>