No, and this isn't really a Unicode FAQ.
-Perl has an abstracted interface for all supported character encodings, so they
+Perl has an abstracted interface for all supported character encodings, so this
is actually a generic C<Encode> tutorial and C<Encode> FAQ. But many people
think that Unicode is special and magical, and I didn't want to disappoint
them, so I decided to call the document a Unicode tutorial.
=head2 Which version of perl should I use?
Well, if you can, upgrade to the most recent, but certainly C<5.8.1> or newer.
-The tutorial and FAQ are based on the status quo as of C<5.8.8>.
+The tutorial and FAQ assume the latest release.
You should also check your modules, and upgrade them if necessary. For example,
HTML::Entities requires version >= 1.32 to function correctly, even though the
You can provide this layer when C<open>ing the file:
- open my $fh, '>:encoding(UTF-8)', $filename; # auto encoding on write
- open my $fh, '<:encoding(UTF-8)', $filename; # auto decoding on read
+ open my $fh, '>:encoding(UTF-8)', $filename; # auto encoding on write
+ open my $fh, '<:encoding(UTF-8)', $filename; # auto decoding on read
Or if you already have an open filehandle:
- binmode $fh, ':encoding(UTF-8)';
+ binmode $fh, ':encoding(UTF-8)';
Some database drivers for DBI can also automatically encode and decode, but
that is sometimes limited to the UTF-8 encoding.
=head2 Why do regex character classes sometimes match only in the ASCII range?
-=head2 Why do some characters not uppercase or lowercase correctly?
-
-It seemed like a good idea at the time, to keep the semantics the same for
-standard strings, when Perl got Unicode support. While it might be repaired
-in the future, we now have to deal with the fact that Perl treats equal
-strings differently, depending on the internal state.
+Starting in Perl 5.14 (and partially in Perl 5.12), just put a
+C<use feature 'unicode_strings'> near the beginning of your program.
+Within its lexical scope you shouldn't have this problem. It also is
+automatically enabled under C<use feature ':5.12'> or C<use v5.12> or
+using C<-E> on the command line for Perl 5.12 or higher.
+
+The rationale for requiring this is to not break older programs that
+rely on the way things worked before Unicode came along. Those older
+programs knew only about the ASCII character set, and so may not work
+properly for additional characters. When a string is encoded in UTF-8,
+Perl assumes that the program is prepared to deal with Unicode, but when
+the string isn't, Perl assumes that only ASCII
+is wanted, and so those characters that are not ASCII
+characters aren't recognized as to what they would be in Unicode.
+C<use feature 'unicode_strings'> tells Perl to treat all characters as
+Unicode, whether the string is encoded in UTF-8 or not, thus avoiding
+the problem.
+
+However, on earlier Perls, or if you pass strings to subroutines outside
+the feature's scope, you can force Unicode rules by changing the
+encoding to UTF-8 by doing C<utf8::upgrade($string)>. This can be used
+safely on any string, as it checks and does not change strings that have
+already been upgraded.
-Affected are C<uc>, C<lc>, C<ucfirst>, C<lcfirst>, C<\U>, C<\L>, C<\u>, C<\l>,
-C<\d>, C<\s>, C<\w>, C<\D>, C<\S>, C<\W>, C</.../i>, C<(?i:...)>,
-C</[[:posix:]]/>.
+For a more detailed discussion, see L<Unicode::Semantics> on CPAN.
-To force Unicode semantics, you can upgrade the internal representation to
-by doing C<utf8::upgrade($string)>. This does not change strings that were
-already upgraded.
+=head2 Why do some characters not uppercase or lowercase correctly?
-For a more detailed discussion, see L<Unicode::Semantics> on CPAN.
+See the answer to the previous question.
=head2 How can I determine if a string is a text string or a binary string?
This is a term used both for characters with an ordinal value greater than 127,
characters with an ordinal value greater than 255, or any character occupying
-than one byte, depending on the context.
+more than one byte, depending on the context.
The Perl warning "Wide character in ..." is caused by a character with an
ordinal value greater than 255. With no specified encoding layer, Perl tries to
The UTF8 flag, also called SvUTF8, is an internal flag that indicates that the
current internal representation is UTF-8. Without the flag, it is assumed to be
-ISO-8859-1. Perl converts between these automatically.
+ISO-8859-1. Perl converts between these automatically. (Actually Perl usually
+assumes the representation is ASCII; see L</Why do regex character classes
+sometimes match only in the ASCII range?> above.)
One of Perl's internal formats happens to be UTF-8. Unfortunately, Perl can't
keep a secret, so everyone knows about this. That is the source of much
but this is considered bad style. Especially C<_utf8_on> can be dangerous, for
the same reason that C<:utf8> can.
-There are some shortcuts for oneliners; see C<-C> in L<perlrun>.
+There are some shortcuts for oneliners;
+see L<-C|perlrun/-C [numberE<sol>list]> in L<perlrun>.
=head2 What's the difference between C<UTF-8> and C<utf8>?