No pod in internal Net::FTP classes.

[perl5.git] / pod / perlunicode.pod
diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod

index 13031ff..e56f3ff 100644 (file)
--- a/pod/perlunicode.pod
+++ b/pod/perlunicode.pod
@@ -20,7 +20,7 @@ Other encodings can be converted to perl's encoding on input, or from
  perl's encoding on output by use of the ":encoding(...)" layer.
  See L<open>.
  
-To mark the Perl source itself as being in an particular encoding,
+To mark the Perl source itself as being in a particular encoding,
  see L<encoding>.
  
  =item Regular Expressions
@@ -622,7 +622,7 @@ Level 1 - Basic Unicode Support
          2.2 Categories                          - done          [3][4]
          2.3 Subtraction                         - MISSING       [5][6]
          2.4 Simple Word Boundaries              - done          [7]
-        2.5 Simple Loose Matches                - MISSING       [8]
+        2.5 Simple Loose Matches                - done          [8]
          2.6 End of Line                         - MISSING       [9][10]
  
          [ 1] \x{...}
@@ -632,7 +632,7 @@ Level 1 - Basic Unicode Support
          [ 5] have negation
          [ 6] can use look-ahead to emulate subtracion
          [ 7] include Letters in word characters
-        [ 8] see UTR#21 Case Mappings
+        [ 8] see UTR#21 Case Mappings: Perl implements 1:1 mappings
          [ 9] see UTR#13 Unicode Newline Guidelines
          [10] should do ^ and $ also on \x{2028} and \x{2029}
  
@@ -711,21 +711,25 @@ is UTF-16, but you don't know which endianness?  Byte Order Marks
  (BOMs) are a solution to this.  A special character has been reserved
  in Unicode to function as a byte order marker: the character with the
  code point 0xFEFF is the BOM.
+
  The trick is that if you read a BOM, you will know the byte order,
  since if it was written on a big endian platform, you will read the
  bytes 0xFE 0xFF, but if it was written on a little endian platform,
  you will read the bytes 0xFF 0xFE.  (And if the originating platform
  was writing in UTF-8, you will read the bytes 0xEF 0xBB 0xBF.)
+
  The way this trick works is that the character with the code point
  0xFFFE is guaranteed not to be a valid Unicode character, so the
  sequence of bytes 0xFF 0xFE is unambiguously "BOM, represented in
-little-endian format" and cannot be "0xFFFE, represented in
-big-endian format".
+little-endian format" and cannot be "0xFFFE, represented in big-endian
+format".
  
  =item UTF-32, UTF-32BE, UTF32-LE
  
  The UTF-32 family is pretty much like the UTF-16 family, expect that
-the units are 32-bit, and therefore the surrogate scheme is not needed.
+the units are 32-bit, and therefore the surrogate scheme is not
+needed.  The BOM signatures will be 0x00 0x00 0xFE 0xFF for BE and
+0xFF 0xFE 0x00 0x00 for LE.
  
  =item UCS-2, UCS-4
  
@@ -738,6 +742,18 @@ is not extensible beyond 0xFFFF, because it does not use surrogates.
  A seven-bit safe (non-eight-bit) encoding, useful if the
  transport/storage is not eight-bit safe.  Defined by RFC 2152.
  
+=head2 Security Implications of Malformed UTF-8
+
+Unfortunately, the specification of UTF-8 leaves some room for
+interpretation of how many bytes of encoded output one should generate
+from one input Unicode character.  Strictly speaking, one is supposed
+to always generate the shortest possible sequence of UTF-8 bytes,
+because otherwise there is potential for input buffer overflow at the
+receiving end of a UTF-8 connection.  Perl always generates the shortest
+length UTF-8, and with warnings on (C<-w> or C<use warnings;>) Perl will
+warn about non-shortest length UTF-8 (and other malformations, too,
+such as the surrogates, which are not real character code points.)
+
  =head2 Unicode in Perl on EBCDIC
  
  The way Unicode is handled on EBCDIC platforms is still rather