A Unicode I<character> is an abstract entity. It is not bound to any
particular integer width, especially not to the C language C<char>.
Unicode is language-neutral and display-neutral: it does not encode the
-language of the text and it does not define fonts or other graphical
+language of the text and it does not generally define fonts or other graphical
layout details. Unicode operates on characters and on text built from
those characters.
problems of the initial Unicode implementation, but for example
regular expressions still do not work with Unicode in 5.6.1.
-B<Starting from Perl 5.8.0, the use of C<use utf8> is no longer
-necessary.> In earlier releases the C<utf8> pragma was used to declare
+B<Starting from Perl 5.8.0, the use of C<use utf8> is needed only in much more restricted circumstances.> In earlier releases the C<utf8> pragma was used to declare
that operations in the current block or file would be Unicode-aware.
This model was found to be wrong, or at least clumsy: the "Unicodeness"
is now carried with the data, instead of being attached to the
For example,
- perl -e 'print "\x{DF}\n", "\x{0100}\x{DF}\n"'
+ perl -e 'print "\x{DF}\n", "\x{0100}\x{DF}\n"'
produces a fairly useless mixture of native bytes and UTF-8, as well
as a warning:
Wide character in print at ...
-To output UTF-8, use the C<:utf8> output layer. Prepending
+To output UTF-8, use the C<:encoding> or C<:utf8> output layer. Prepending
binmode(STDOUT, ":utf8");
constants: you cannot use variables in them. if you want similar
run-time functionality, use C<chr()> and C<charnames::vianame()>.
-Also note that if all the code points for pack "U" are below 0x100,
-bytes will be generated, just like if you were using C<chr()>.
-
- my $bytes = pack("U*", 0x80, 0xFF);
-
If you want to force the result to Unicode characters, use the special
-C<"U0"> prefix. It consumes no arguments but forces the result to be
-in Unicode characters, instead of bytes.
+C<"U0"> prefix. It consumes no arguments but causes the following bytes
+to be interpreted as the UTF-8 encoding of Unicode characters:
- my $chars = pack("U0U*", 0x80, 0xFF);
+ my $chars = pack("U0W*", 0x80, 0x42);
+
+Likewise, you can stop such UTF-8 interpretation by using the special
+C<"C0"> prefix.
=head2 Handling Unicode
When you combine legacy data and Unicode the legacy data needs
to be upgraded to Unicode. Normally ISO 8859-1 (or EBCDIC, if
-applicable) is assumed. You can override this assumption by
-using the C<encoding> pragma, for example
-
- use encoding 'latin2'; # ISO 8859-2
-
-in which case literals (string or regular expressions), C<chr()>,
-and C<ord()> in your whole script are assumed to produce Unicode
-characters from ISO 8859-2 code points. Note that the matching for
-encoding names is forgiving: instead of C<latin2> you could have
-said C<Latin 2>, or C<iso8859-2>, or other variations. With just
-
- use encoding;
-
-the environment variable C<PERL_ENCODING> will be consulted.
-If that variable isn't set, the encoding pragma will fail.
+applicable) is assumed.
The C<Encode> module knows about many encodings and has interfaces
for doing conversions between those encodings:
The matching of encoding names is loose: case does not matter, and
many encodings have several aliases. Note that the C<:utf8> layer
must always be specified exactly like that; it is I<not> subject to
-the loose matching of encoding names.
+the loose matching of encoding names. Also note that C<:utf8> is unsafe for
+input, because it accepts the data without validating that it is indeed valid
+UTF8.
See L<PerlIO> for the C<:utf8> layer, L<PerlIO::encoding> and
L<Encode::PerlIO> for the C<:encoding()> layer, and
Unicode in Perl's eyes. To do that, specify the appropriate
layer when opening files
- open(my $fh,'<:utf8', 'anything');
+ open(my $fh,'<:encoding(utf8)', 'anything');
my $line_of_unicode = <$fh>;
open(my $fh,'<:encoding(Big5)', 'anything');
The I/O layers can also be specified more flexibly with
the C<open> pragma. See L<open>, or look at the following example.
- use open ':utf8'; # input and output default layer will be UTF-8
+ use open ':encoding(utf8)'; # input/output default encoding will be UTF-8
open X, ">file";
print X chr(0x100), "\n";
close X;
printf "%#x\n", ord(<I>), "\n"; # this should print 0xc1
close I;
-or you can also use the C<':encoding(...)'> layer
-
- open(my $epic,'<:encoding(iso-8859-7)','iliad.greek');
- my $line_of_unicode = <$epic>;
-
These methods install a transparent filter on the I/O stream that
converts data from the specified encoding when it is read in from the
stream. The result is always Unicode.
while (<$nihongo>) { print $unicode $_ }
The naming of encodings, both by the C<open()> and by the C<open>
-pragma, is similar to the C<encoding> pragma in that it allows for
-flexible names: C<koi8-r> and C<KOI8R> will both be understood.
+pragma allows for flexible names: C<koi8-r> and C<KOI8R> will both be
+understood.
Common encodings recognized by ISO, MIME, IANA, and various other
standardisation organisations are recognised; for a more detailed
local $/; ## read in the whole file of 8-bit characters
$t = <F>;
close F;
- open F, ">:utf8", "file";
+ open F, ">:encoding(utf8)", "file";
print F $t; ## convert to UTF-8 on output
close F;
If you run this code twice, the contents of the F<file> will be twice
-UTF-8 encoded. A C<use open ':utf8'> would have avoided the bug, or
-explicitly opening also the F<file> for input as UTF-8.
+UTF-8 encoded. A C<use open ':encoding(utf8)'> would have avoided the
+bug, or explicitly opening also the F<file> for input as UTF-8.
B<NOTE>: the C<:utf8> and C<:encoding> features work only if your
Perl has been built with the new PerlIO feature (which is the default
chr($_) =~ /[[:cntrl:]]/ ? # else if control character ...
sprintf("\\x%02X", $_) : # \x..
quotemeta(chr($_)) # else quoted or as themselves
- } unpack("U*", $_[0])); # unpack Unicode characters
+ } unpack("W*", $_[0])); # unpack Unicode characters
}
For example,
ways of looking behind the scenes.
One way of peeking inside the internal encoding of Unicode characters
-is to use C<unpack("C*", ...> to get the bytes or C<unpack("H*", ...)>
-to display the bytes:
+is to use C<unpack("C*", ...> to get the bytes of whatever the string
+encoding happens to be, or C<unpack("U0..", ...)> to get the bytes of the
+UTF-8 encoding:
# this prints c4 80 for the UTF-8 bytes 0xc4 0x80
- print join(" ", unpack("H*", pack("U", 0x100))), "\n";
+ print join(" ", unpack("U0(H2)*", pack("U", 0x100))), "\n";
Yet another way would be to use the Devel::Peek module:
The long answer is that you need to consider character normalization
and casing issues: see L<Unicode::Normalize>, Unicode Technical
Reports #15 and #21, I<Unicode Normalization Forms> and I<Case
-Mappings>, http://www.unicode.org/unicode/reports/tr15/ and
-http://www.unicode.org/unicode/reports/tr21/
+Mappings>, L<http://www.unicode.org/unicode/reports/tr15/> and
+L<http://www.unicode.org/unicode/reports/tr21/>
As of Perl 5.8.0, the "Full" case-folding of I<Case
Mappings/SpecialCasing> is implemented.
The long answer is that "it depends", and a good answer cannot be
given without knowing (at the very least) the language context.
See L<Unicode::Collate>, and I<Unicode Collation Algorithm>
-http://www.unicode.org/unicode/reports/tr10/
+L<http://www.unicode.org/unicode/reports/tr10/>
=back
Character ranges in regular expression character classes (C</[a-z]/>)
and in the C<tr///> (also known as C<y///>) operator are not magically
-Unicode-aware. What this means that C<[A-Za-z]> will not magically start
+Unicode-aware. What this means is that C<[A-Za-z]> will not magically start
to mean "all alphabetic letters"; not that it does mean that even for
8-bit characters, you should be using C</[[:alpha:]]/> in that case.
How Do I Know Whether My String Is In Unicode?
-You shouldn't care. No, you really shouldn't. No, really. If you
-have to care--beyond the cases described above--it means that we
-didn't get the transparency of Unicode quite right.
+You shouldn't have to care. But you may, because currently the semantics of the
+characters whose ordinals are in the range 128 to 255 is different depending on
+whether the string they are contained within is in Unicode or not.
+(See L<perlunicode>.)
-Okay, if you insist:
+To determine if a string is in Unicode, use:
print utf8::is_utf8($string) ? 1 : 0, "\n";
Sometimes you might really need to know the byte length of a string
instead of the character length. For that use either the
-C<Encode::encode_utf8()> function or the C<bytes> pragma and its only
-defined function C<length()>:
+C<Encode::encode_utf8()> function or the C<bytes> pragma and
+the C<length()> function:
my $unicode = chr(0x100);
print length($unicode), "\n"; # will print 1
Use the C<Encode> package to try converting it.
For example,
- use Encode 'encode_utf8';
- if (encode_utf8($string_of_bytes_that_I_think_is_utf8)) {
- # valid
+ use Encode 'decode_utf8';
+
+ if (eval { decode_utf8($string, Encode::FB_CROAK); 1 }) {
+ # $string is valid utf8
} else {
- # invalid
+ # $string is not valid utf8
}
-For UTF-8 only, you can use:
+Or use C<unpack> to try decoding it:
use warnings;
- @chars = unpack("U0U*", $string_of_bytes_that_I_think_is_utf8);
+ @chars = unpack("C0U*", $string_of_bytes_that_I_think_is_utf8);
-If invalid, a C<Malformed UTF-8 character (byte 0x##) in unpack>
-warning is produced. The "U0" means "expect strictly UTF-8 encoded
-Unicode". Without that the C<unpack("U*", ...)> would accept also
-data like C<chr(0xFF>), similarly to the C<pack> as we saw earlier.
+If invalid, a C<Malformed UTF-8 character> warning is produced. The "C0" means
+"process the string character per character". Without that, the
+C<unpack("U*", ...)> would work in C<U0> mode (the default if the format
+string starts with C<U>) and it would return the bytes making up the UTF-8
+encoding of the target string, something that will always work.
=item *
native 8-bit encoding (e.g. Latin-1, EBCDIC, etc.), you can use
pack/unpack to convert to/from Unicode.
- $native_string = pack("C*", unpack("U*", $Unicode_string));
- $Unicode_string = pack("U*", unpack("C*", $native_string));
+ $native_string = pack("W*", unpack("U*", $Unicode_string));
+ $Unicode_string = pack("U*", unpack("W*", $native_string));
If you have a sequence of bytes you B<know> is valid UTF-8,
but Perl doesn't know it yet, you can make Perl a believer, too:
use Encode 'decode_utf8';
$Unicode = decode_utf8($bytes);
-You can convert well-formed UTF-8 to a sequence of bytes, but if
-you just want to convert random binary data into UTF-8, you can't.
-B<Any random collection of bytes isn't well-formed UTF-8>. You can
-use C<unpack("C*", $string)> for the former, and you can create
-well-formed Unicode data by C<pack("U*", 0xff, ...)>.
+or:
+
+ $Unicode = pack("U0a*", $bytes);
+
+You can find the bytes that make up a UTF-8 sequence with
+
+ @bytes = unpack("C*", $Unicode_string)
+
+and you can create well-formed Unicode with
+
+ $Unicode_string = pack("U*", 0xff, ...)
=item *
How Do I Display Unicode? How Do I Input Unicode?
-See http://www.alanwood.net/unicode/ and
-http://www.cl.cam.ac.uk/~mgk25/unicode.html
+See L<http://www.alanwood.net/unicode/> and
+L<http://www.cl.cam.ac.uk/~mgk25/unicode.html>
=item *
Unicode Consortium
- http://www.unicode.org/
+L<http://www.unicode.org/>
=item *
Unicode FAQ
- http://www.unicode.org/unicode/faq/
+L<http://www.unicode.org/unicode/faq/>
=item *
Unicode Glossary
- http://www.unicode.org/glossary/
+L<http://www.unicode.org/glossary/>
=item *
Unicode Useful Resources
- http://www.unicode.org/unicode/onlinedat/resources.html
+L<http://www.unicode.org/unicode/onlinedat/resources.html>
=item *
Unicode and Multilingual Support in HTML, Fonts, Web Browsers and Other Applications
- http://www.alanwood.net/unicode/
+L<http://www.alanwood.net/unicode/>
=item *
UTF-8 and Unicode FAQ for Unix/Linux
- http://www.cl.cam.ac.uk/~mgk25/unicode.html
+L<http://www.cl.cam.ac.uk/~mgk25/unicode.html>
=item *
Legacy Character Sets
- http://www.czyborra.com/
- http://www.eki.ee/letter/
+L<http://www.czyborra.com/>
+L<http://www.eki.ee/letter/>
=item *
$Config{installprivlib}/unicore
-in Perl 5.8.0 or newer, and
+in Perl 5.8.0 or newer, and
$Config{installprivlib}/unicode
=head1 SEE ALSO
-L<perlunicode>, L<Encode>, L<encoding>, L<open>, L<utf8>, L<bytes>,
+L<perlunitut>, L<perlunicode>, L<Encode>, L<open>, L<utf8>, L<bytes>,
L<perlretut>, L<perlrun>, L<Unicode::Collate>, L<Unicode::Normalize>,
L<Unicode::UCD>