=item *
-The special pattern C<\X> match matches any extended Unicode sequence
+The special pattern C<\X> matches any extended Unicode sequence
(a "combining character sequence" in Standardese), where the first
character is a base character and subsequent characters are mark
characters that apply to the base character. It is equivalent to
See L<Encode>.
-=head1 CAVEATS
-
-Whether an arbitrary piece of data will be treated as "characters" or
-"bytes" by internal operations cannot be divined at the current time.
-
-Use of locales with Unicode data may lead to odd results. Currently
-there is some attempt to apply 8-bit locale info to characters in the
-range 0..255, but this is demonstrably incorrect for locales that use
-characters above that range when mapped into Unicode. It will also
-tend to run slower. Avoidance of locales is strongly encouraged.
-
-=head1 UNICODE REGULAR EXPRESSION SUPPORT LEVEL
+=head2 Unicode Regular Expression Support Level
The following list of Unicode regular expression support describes
feature by feature the Unicode support implemented in Perl as of Perl
=over 4
-=item
+=item *
UTF-8
leading bits of the start byte tell how many bytes the are in the
encoded character.
-=item
+=item *
UTF-EBCDIC
Like UTF-8, but EBCDIC-safe, as UTF-8 is ASCII-safe.
-=item
+=item *
UTF-16, UTF-16BE, UTF16-LE, Surrogates, and BOMs (Byte Order Marks)
little-endian format" and cannot be "0xFFFE, represented in big-endian
format".
-=item
+=item *
UTF-32, UTF-32BE, UTF32-LE
needed. The BOM signatures will be 0x00 0x00 0xFE 0xFF for BE and
0xFF 0xFE 0x00 0x00 for LE.
-=item
+=item *
UCS-2, UCS-4
encoding, UCS-4 is a 32-bit encoding. Unlike UTF-16, UCS-2
is not extensible beyond 0xFFFF, because it does not use surrogates.
-=item
+=item *
UTF-7
DO_UTF8(sv) returns true if the UTF8 flag is on and the bytes pragma
is not in effect. SvUTF8(sv) returns true is the UTF8 flag is on, the
bytes pragma is ignored. The UTF8 flag being on does B<not> mean that
-there are any characters of code points greater than 255 (or 127) in the
-scalar, or that there even are any characters in the scalar. What the
-UTF8 flag means is that the sequence of octets in the representation
-of the scalar should be treated as UTF-8 encoding of a string.
-The UTF8 flag being off means that each octet in this representation
-encodes a single character with codepoint 0..255 within the string.
-Perl's Unicode model is not to use UTF-8 until it's really necessary.
+there are any characters of code points greater than 255 (or 127) in
+the scalar, or that there even are any characters in the scalar.
+What the UTF8 flag means is that the sequence of octets in the
+representation of the scalar is the sequence of UTF-8 encoded
+code points of the characters of a string. The UTF8 flag being
+off means that each octet in this representation encodes a single
+character with codepoint 0..255 within the string. Perl's Unicode
+model is not to use UTF-8 until it's really necessary.
=item *
encoded form. sv_utf8_downgrade(sv) does the opposite (if possible).
sv_utf8_encode(sv) is like sv_utf8_upgrade but the UTF8 flag does not
get turned on. sv_utf8_decode() does the opposite of sv_utf8_encode().
+Note that none of these are to be used as general purpose encoding/decoding
+interfaces: use Encode for that. sv_utf8_upgrade() is affected by the
+encoding pragma, but sv_utf8_downgrade() is not (since the encoding
+pragma is designed to be a one-way street).
=item *
For more information, see L<perlapi>, and F<utf8.c> and F<utf8.h>
in the Perl source code distribution.
+=head1 BUGS
+
+Use of locales with Unicode data may lead to odd results. Currently
+there is some attempt to apply 8-bit locale info to characters in the
+range 0..255, but this is demonstrably incorrect for locales that use
+characters above that range when mapped into Unicode. It will also
+tend to run slower. Avoidance of locales is strongly encouraged.
+
+Some functions are slower when working on UTF-8 encoded strings than
+on byte encoded strings. All functions that need to hop over
+characters such as length(), substr() or index() can work B<much>
+faster when the underlying data are byte-encoded. Witness the
+following benchmark:
+
+ % perl -e '
+ use Benchmark;
+ use strict;
+ our $l = 10000;
+ our $u = our $b = "x" x $l;
+ substr($u,0,1) = "\x{100}";
+ timethese(-2,{
+ LENGTH_B => q{ length($b) },
+ LENGTH_U => q{ length($u) },
+ SUBSTR_B => q{ substr($b, $l/4, $l/2) },
+ SUBSTR_U => q{ substr($u, $l/4, $l/2) },
+ });
+ '
+ Benchmark: running LENGTH_B, LENGTH_U, SUBSTR_B, SUBSTR_U for at least 2 CPU seconds...
+ LENGTH_B: 2 wallclock secs ( 2.36 usr + 0.00 sys = 2.36 CPU) @ 5649983.05/s (n=13333960)
+ LENGTH_U: 2 wallclock secs ( 2.11 usr + 0.00 sys = 2.11 CPU) @ 12155.45/s (n=25648)
+ SUBSTR_B: 3 wallclock secs ( 2.16 usr + 0.00 sys = 2.16 CPU) @ 374480.09/s (n=808877)
+ SUBSTR_U: 2 wallclock secs ( 2.11 usr + 0.00 sys = 2.11 CPU) @ 6791.00/s (n=14329)
+
+The numbers show an incredible slowness on long UTF-8 strings and you
+should carefully avoid to use these functions within tight loops. For
+example if you want to iterate over characters, it is infinitely
+better to split into an array than to use substr, as the following
+benchmark shows:
+
+ % perl -e '
+ use Benchmark;
+ use strict;
+ our $l = 10000;
+ our $u = our $b = "x" x $l;
+ substr($u,0,1) = "\x{100}";
+ timethese(-5,{
+ SPLIT_B => q{ for my $c (split //, $b){} },
+ SPLIT_U => q{ for my $c (split //, $u){} },
+ SUBSTR_B => q{ for my $i (0..length($b)-1){my $c = substr($b,$i,1);} },
+ SUBSTR_U => q{ for my $i (0..length($u)-1){my $c = substr($u,$i,1);} },
+ });
+ '
+ Benchmark: running SPLIT_B, SPLIT_U, SUBSTR_B, SUBSTR_U for at least 5 CPU seconds...
+ SPLIT_B: 6 wallclock secs ( 5.29 usr + 0.00 sys = 5.29 CPU) @ 56.14/s (n=297)
+ SPLIT_U: 5 wallclock secs ( 5.17 usr + 0.01 sys = 5.18 CPU) @ 55.21/s (n=286)
+ SUBSTR_B: 5 wallclock secs ( 5.34 usr + 0.00 sys = 5.34 CPU) @ 123.22/s (n=658)
+ SUBSTR_U: 7 wallclock secs ( 6.20 usr + 0.00 sys = 6.20 CPU) @ 0.81/s (n=5)
+
+You see, the algorithm based on substr() was faster with byte encoded
+data but it is pathologically slow with UTF-8 data.
+
=head1 SEE ALSO
L<perluniintro>, L<encoding>, L<Encode>, L<open>, L<utf8>, L<bytes>,