constants: you cannot use variables in them. if you want similar
run-time functionality, use C<chr()> and C<charnames::vianame()>.
-Also note that if all the code points for pack "U" are below 0x100,
-bytes will be generated, just like if you were using C<chr()>.
-
- my $bytes = pack("U*", 0x80, 0xFF);
-
If you want to force the result to Unicode characters, use the special
C<"U0"> prefix. It consumes no arguments but forces the result to be
in Unicode characters, instead of bytes.
- my $chars = pack("U0U*", 0x80, 0xFF);
+ my $chars = pack("U0C*", 0x80, 0x42);
+
+Likewise, you can force the result to be bytes by using the special
+C<"C0"> prefix.
=head2 Handling Unicode
will work on the Unicode characters (see L<perlunicode> and L<perlretut>).
Note that Perl considers combining character sequences to be
-characters, so for example
+separate characters, so for example
use charnames ':full';
print length("\N{LATIN CAPITAL LETTER A}\N{COMBINING ACUTE ACCENT}"), "\n";
The C<Encode> module knows about many encodings and has interfaces
for doing conversions between those encodings:
- use Encode 'from_to';
- from_to($data, "iso-8859-3", "utf-8"); # from legacy to utf-8
+ use Encode 'decode';
+ $data = decode("iso-8859-3", $data); # convert from legacy to utf-8
=head2 Unicode I/O
perl -MDevel::Peek -e 'Dump(chr(0x100))'
-That shows the UTF8 flag in FLAGS and both the UTF-8 bytes
+That shows the C<UTF8> flag in FLAGS and both the UTF-8 bytes
and Unicode characters in C<PV>. See also later in this document
the discussion about the C<utf8::is_utf8()> function.
as a single byte encoding. If the flag is on, the bytes in the scalar
are interpreted as the (multi-byte, variable-length) UTF-8 encoded code
points of the characters. Bytes added to an UTF-8 encoded string are
-automatically upgraded to UTF-8. If mixed non-UTF8 and UTF-8 scalars
+automatically upgraded to UTF-8. If mixed non-UTF-8 and UTF-8 scalars
are merged (double-quoted interpolation, explicit concatenation, and
printf/sprintf parameter substitution), the result will be UTF-8 encoded
as if copies of the byte strings were upgraded to UTF-8: for example,