One possible cause is that you set the UTF8 flag yourself for data that
you thought to be in UTF-8 but it wasn't (it was for example legacy
-8-bit data). To guard against this, you can use Encode::decode_utf8.
+8-bit data). To guard against this, you can use C<Encode::decode('UTF-8', ...)>.
If you use the C<:encoding(UTF-8)> PerlIO layer for input, invalid byte
sequences are handled gracefully, but if you use C<:utf8>, the flag is
Like all Perl character operations, L<C<length>|/length EXPR> normally
deals in logical
characters, not physical bytes. For how many bytes a string encoded as
-UTF-8 would take up, use C<length(Encode::encode_utf8(EXPR))> (you'll have
-to C<use Encode> first). See L<Encode> and L<perlunicode>.
+UTF-8 would take up, use C<length(Encode::encode('UTF-8', EXPR))>
+(you'll have to C<use Encode> first). See L<Encode> and L<perlunicode>.
=item __LINE__
X<__LINE__>
my @hebrew = unpack( 'U*', $utf );
Please note: in the general case, you're better off using
-Encode::decode_utf8 to decode a UTF-8 encoded byte string to a Perl
-Unicode string, and Encode::encode_utf8 to encode a Perl Unicode string
-to UTF-8 bytes. These functions provide means of handling invalid byte
+L<C<Encode::decode('UTF-8', $utf)>|Encode/decode> to decode a UTF-8
+encoded byte string to a Perl Unicode string, and
+L<C<Encode::encode('UTF-8', $str)>|Encode/encode> to encode a Perl Unicode
+string to UTF-8 bytes. These functions provide means of handling invalid byte
sequences and generally have a friendlier interface.
=head2 Another Portable Binary Encoding
if ($] > 5.008) {
require Encode;
- $val = Encode::encode_utf8($val); # make octets
+ $val = Encode::encode("UTF-8", $val); # make octets
}
=item *
if ($] > 5.008) {
require Encode;
- $val = Encode::decode_utf8($val);
+ $val = Encode::decode("UTF-8", $val);
}
=item *
sub my_escape_html ($) {
my($what) = shift;
return unless defined $what;
- Encode::decode_utf8(Foo::Bar::escape_html(
- Encode::encode_utf8($what)));
+ Encode::decode("UTF-8", Foo::Bar::escape_html(
+ Encode::encode("UTF-8", $what)));
}
Sometimes, when the extension does not convert data but just stores
or
$ export PERL_UNICODE=A
or
- use Encode qw(decode_utf8);
- @ARGV = map { decode_utf8($_, 1) } @ARGV;
+ use Encode qw(decode);
+ @ARGV = map { decode('UTF-8', $_, 1) } @ARGV;
=head2 ℞ 14: Decode program arguments as locale encoding
$ export PERL_UNICODE=SDA
or
use open qw(:std :utf8);
- use Encode qw(decode_utf8);
- @ARGV = map { decode_utf8($_, 1) } @ARGV;
+ use Encode qw(decode);
+ @ARGV = map { decode('UTF-8', $_, 1) } @ARGV;
=head2 ℞ 19: Open file with specific encoding
=head2 What are C<decode_utf8> and C<encode_utf8>?
These are alternate syntaxes for C<decode('utf8', ...)> and C<encode('utf8',
-...)>.
+...)>. Do not use these functions for data exchange. Instead use
+C<decode('UTF-8', ...)> and C<encode('UTF-8', ...)>; see
+L</What's the difference between UTF-8 and utf8?> below.
=head2 What is a "wide character"?
what it accepts. If you have to communicate with things that aren't so liberal,
you may want to consider using C<UTF-8>. If you have to communicate with things
that are too liberal, you may have to use C<utf8>. The full explanation is in
-L<Encode>.
+L<Encode/"UTF-8 vs. utf8 vs. UTF8">.
C<UTF-8> is internally known as C<utf-8-strict>. The tutorial uses UTF-8
consistently, even where utf8 is actually used internally, because the
Use the C<Encode> package to try converting it.
For example,
- use Encode 'decode_utf8';
+ use Encode 'decode';
- if (eval { decode_utf8($string, Encode::FB_CROAK); 1 }) {
- # $string is valid utf8
+ if (eval { decode('UTF-8', $string, Encode::FB_CROAK); 1 }) {
+ # $string is valid UTF-8
} else {
- # $string is not valid utf8
+ # $string is not valid UTF-8
}
Or use C<unpack> to try decoding it: