perl's encoding on output by use of the ":encoding(...)" layer.
See L<open>.
-To mark the Perl source itself as being in an particular encoding,
+To mark the Perl source itself as being in a particular encoding,
see L<encoding>.
=item Regular Expressions
2.2 Categories - done [3][4]
2.3 Subtraction - MISSING [5][6]
2.4 Simple Word Boundaries - done [7]
- 2.5 Simple Loose Matches - MISSING [8]
+ 2.5 Simple Loose Matches - done [8]
2.6 End of Line - MISSING [9][10]
[ 1] \x{...}
[ 5] have negation
[ 6] can use look-ahead to emulate subtracion
[ 7] include Letters in word characters
- [ 8] see UTR#21 Case Mappings
+ [ 8] see UTR#21 Case Mappings: Perl implements 1:1 mappings
[ 9] see UTR#13 Unicode Newline Guidelines
[10] should do ^ and $ also on \x{2028} and \x{2029}
(BOMs) are a solution to this. A special character has been reserved
in Unicode to function as a byte order marker: the character with the
code point 0xFEFF is the BOM.
+
The trick is that if you read a BOM, you will know the byte order,
since if it was written on a big endian platform, you will read the
bytes 0xFE 0xFF, but if it was written on a little endian platform,
you will read the bytes 0xFF 0xFE. (And if the originating platform
was writing in UTF-8, you will read the bytes 0xEF 0xBB 0xBF.)
+
The way this trick works is that the character with the code point
0xFFFE is guaranteed not to be a valid Unicode character, so the
sequence of bytes 0xFF 0xFE is unambiguously "BOM, represented in
-little-endian format" and cannot be "0xFFFE, represented in
-big-endian format".
+little-endian format" and cannot be "0xFFFE, represented in big-endian
+format".
=item UTF-32, UTF-32BE, UTF32-LE
The UTF-32 family is pretty much like the UTF-16 family, expect that
-the units are 32-bit, and therefore the surrogate scheme is not needed.
+the units are 32-bit, and therefore the surrogate scheme is not
+needed. The BOM signatures will be 0x00 0x00 0xFE 0xFF for BE and
+0xFF 0xFE 0x00 0x00 for LE.
=item UCS-2, UCS-4
A seven-bit safe (non-eight-bit) encoding, useful if the
transport/storage is not eight-bit safe. Defined by RFC 2152.
+=head2 Security Implications of Malformed UTF-8
+
+Unfortunately, the specification of UTF-8 leaves some room for
+interpretation of how many bytes of encoded output one should generate
+from one input Unicode character. Strictly speaking, one is supposed
+to always generate the shortest possible sequence of UTF-8 bytes,
+because otherwise there is potential for input buffer overflow at the
+receiving end of a UTF-8 connection. Perl always generates the shortest
+length UTF-8, and with warnings on (C<-w> or C<use warnings;>) Perl will
+warn about non-shortest length UTF-8 (and other malformations, too,
+such as the surrogates, which are not real character code points.)
+
=head2 Unicode in Perl on EBCDIC
The way Unicode is handled on EBCDIC platforms is still rather