problems of the initial Unicode implementation, but for example
regular expressions still do not work with Unicode in 5.6.1.
Perl v5.14.0 is the first release where Unicode support is
-(almost) seamlessly integrable without some gotchas. (There are two
+(almost) seamlessly integrable without some gotchas. (There are a few
exceptions. Firstly, some differences in L<quotemeta|perlfunc/quotemeta>
were fixed starting in Perl 5.16.0. Secondly, some differences in
L<the range operator|perlop/Range Operators> were fixed starting in
-Perl 5.26.0.)
+Perl 5.26.0. Thirdly, some differences in L<split|perlfunc/split> were fixed
+started in Perl 5.28.0.)
To enable this
seamless support, you should C<use feature 'unicode_strings'> (which is
for doing conversions between those encodings:
use Encode 'decode';
- $data = decode("iso-8859-3", $data); # convert from legacy to utf-8
+ $data = decode("iso-8859-3", $data); # convert from legacy
=head2 Unicode I/O
must always be specified exactly like that; it is I<not> subject to
the loose matching of encoding names. Also note that currently C<:utf8> is unsafe for
input, because it accepts the data without validating that it is indeed valid
-UTF-8; you should instead use C<:encoding(utf-8)> (with or without a
+UTF-8; you should instead use C<:encoding(UTF-8)> (with or without a
hyphen).
See L<PerlIO> for the C<:utf8> layer, L<PerlIO::encoding> and
list see L<Encode::Supported>.
C<read()> reads characters and returns the number of characters.
-C<seek()> and C<tell()> operate on byte counts, as do C<sysread()>
-and C<sysseek()>.
+C<seek()> and C<tell()> operate on byte counts, as does C<sysseek()>.
+
+C<sysread()> and C<syswrite()> should not be used on file handles with
+character encoding layers, they behave badly, and that behaviour has
+been deprecated since perl 5.24.
Notice that because of the default behaviour of not doing any
conversion upon input if there is no default layer,
=item *
-Bit Complement Operator ~ And vec()
+Starting in Perl 5.28, it is illegal for bit operators, like C<~>, to
+operate on strings containing code points above 255.
+
+=item *
-The bit complement operator C<~> may produce surprising results if
+The vec() function may produce surprising results if
used on strings containing characters with ordinal values above
255. In such a case, the results are consistent with the internal
encoding of the characters, but not with much else. So don't do
-that. Similarly for C<vec()>: you will be operating on the
-internally-encoded bit patterns of the Unicode characters, not on
-the code point values, which is very probably not what you want.
+that, and starting in Perl 5.28, a deprecation message is issued if you
+do so, becoming illegal in Perl 5.32.
=item *
interpreted via a particular encoding, you can use C<Encode>:
use Encode 'from_to';
- from_to($data, "iso-8859-1", "utf-8"); # from latin-1 to utf-8
+ from_to($data, "iso-8859-1", "UTF-8"); # from latin-1 to UTF-8
The call to C<from_to()> changes the bytes in C<$data>, but nothing
material about the nature of the string has changed as far as Perl is
If you have a sequence of bytes you B<know> is valid UTF-8,
but Perl doesn't know it yet, you can make Perl a believer, too:
- use Encode 'decode_utf8';
- $Unicode = decode_utf8($bytes);
+ $Unicode = $bytes;
+ utf8::decode($Unicode);
or:
How Does Unicode Work With Traditional Locales?
-If your locale is a UTF-8 locale, starting in Perl v5.20, Perl works
-well for all categories except C<LC_COLLATE> dealing with sorting and
-the C<cmp> operator.
+If your locale is a UTF-8 locale, starting in Perl v5.26, Perl works
+well for all categories; before this, starting with Perl v5.20, it works
+for all categories but C<LC_COLLATE>, which deals with
+sorting and the C<cmp> operator. But note that the standard
+C<L<Unicode::Collate>> and C<L<Unicode::Collate::Locale>> modules offer
+much more powerful solutions to collation issues, and work on earlier
+releases.
For other locales, starting in Perl 5.16, you can specify