$utf8::hint_bits = 0x00800000;
-our $VERSION = '1.06';
+our $VERSION = '1.10';
sub import {
$^H |= $utf8::hint_bits;
use utf8;
no utf8;
- # Convert a Perl scalar to/from UTF-8.
+ # Convert the internal representation of a Perl scalar to/from UTF-8.
+
$num_octets = utf8::upgrade($string);
$success = utf8::downgrade($string[, FAIL_OK]);
- # Change the native bytes of a Perl scalar to/from UTF-8 bytes.
- utf8::encode($string);
- utf8::decode($string);
+ # Change each character of a Perl scalar to/from a series of
+ # characters that represent the UTF-8 bytes of each original character.
+
+ utf8::encode($string); # "\x{100}" becomes "\xc4\x80"
+ utf8::decode($string); # "\xc4\x80" becomes "\x{100}"
$flag = utf8::is_utf8(STRING); # since Perl 5.8.1
$flag = utf8::valid(STRING);
=item * $num_octets = utf8::upgrade($string)
-Converts in-place the internal octet sequence in the native encoding
-(Latin-1 or EBCDIC) to the equivalent character sequence in I<UTF-X>.
-I<$string> already encoded as characters does no harm. Returns the
+Converts in-place the internal representation of the string from an octet
+sequence in the native encoding (Latin-1 or EBCDIC) to I<UTF-X>. The
+logical character sequence itself is unchanged. If I<$string> is already
+stored as I<UTF-X>, then this is a no-op. Returns the
number of octets necessary to represent the string as I<UTF-X>. Can be
used to make sure that the UTF-8 flag is on, so that C<\w> or C<lc()>
work as Unicode on strings containing characters in the range 0x80-0xFF
=item * $success = utf8::downgrade($string[, FAIL_OK])
-Converts in-place the internal octet sequence in I<UTF-X> to the
-equivalent octet sequence in the native encoding (Latin-1 or EBCDIC).
-I<$string> already encoded as native 8 bit does no harm. Can be used to
+Converts in-place the internal representation of the string from
+I<UTF-X> to the equivalent octet sequence in the native encoding (Latin-1
+or EBCDIC). The logical character sequence itself is unchanged. If
+I<$string> is already stored as native 8 bit, then this is a no-op. Can
+be used to
make sure that the UTF-8 flag is off, e.g. when you want to make sure
that the substr() or length() function works with the usually faster
byte algorithm.
Therefore Encode is recommended for the general purposes; see also
L<Encode>.
-B<NOTE:> this function is experimental and may change or be removed
-without notice.
-
=item * utf8::encode($string)
Converts in-place the character sequence to the corresponding octet
-sequence in I<UTF-X>. The UTF8 flag is turned off, so that after this
-operation, the string is a byte string. Returns nothing.
+sequence in I<UTF-X>. That is, every (possibly wide) character gets
+replaced with a sequence of one or more characters that represent the
+individual I<UTF-X> bytes of the character. The UTF8 flag is turned off.
+Returns nothing.
+
+ my $a = "\x{100}"; # $a contains one character, with ord 0x100
+ utf8::encode($a); # $a contains two characters, with ords 0xc4 and 0x80
B<Note that this function does not handle arbitrary encodings.>
Therefore Encode is recommended for the general purposes; see also
=item * $success = utf8::decode($string)
Attempts to convert in-place the octet sequence in I<UTF-X> to the
-corresponding character sequence. The UTF-8 flag is turned on only if
-the source string contains multiple-byte I<UTF-X> characters. If
-I<$string> is invalid as I<UTF-X>, returns false; otherwise returns
-true.
+corresponding character sequence. That is, it replaces each sequence of
+characters in the string whose ords represent a valid UTF-X byte
+sequence, with the corresponding single character. The UTF-8 flag is
+turned on only if the source string contains multiple-byte I<UTF-X>
+characters. If I<$string> is invalid as I<UTF-X>, returns false;
+otherwise returns true.
+
+ my $a = "\xc4\x80"; # $a contains two characters, with ords 0xc4 and 0x80
+ utf8::decode($a); # $a contains one character, with ord 0x100
B<Note that this function does not handle arbitrary encodings.>
Therefore Encode is recommended for the general purposes; see also
L<Encode>.
-B<NOTE:> this function is experimental and may change or be removed
-without notice.
-
=item * $flag = utf8::is_utf8(STRING)
-(Since Perl 5.8.1) Test whether STRING is in UTF-8 internally.
+(Since Perl 5.8.1) Test whether STRING is encoded internally in UTF-8.
Functionally the same as Encode::is_utf8().
=item * $flag = utf8::valid(STRING)
[INTERNAL] Test whether STRING is in a consistent state regarding
-UTF-8. Will return true is well-formed UTF-8 and has the UTF-8 flag
-on B<or> if string is held as bytes (both these states are 'consistent').
+UTF-8. Will return true if it is well-formed UTF-8 and has the UTF-8 flag
+on B<or> if STRING is held as bytes (both these states are 'consistent').
Main reason for this routine is to allow Perl's testsuite to check
that operations have left strings in a consistent state. You most
probably want to use utf8::is_utf8() instead.
functions C<sv_utf8_upgrade>, C<sv_utf8_downgrade>, C<sv_utf8_encode>,
and C<sv_utf8_decode>, which are wrapped by the Perl functions
C<utf8::upgrade>, C<utf8::downgrade>, C<utf8::encode> and
-C<utf8::decode>. Note that in the Perl 5.8.0 and 5.8.1 implementation
-the functions utf8::is_utf8, utf8::valid, utf8::encode, utf8::decode,
-utf8::upgrade, and utf8::downgrade are always available, without a
-C<require utf8> statement-- this may change in future releases.
+C<utf8::decode>. Also, the functions utf8::is_utf8, utf8::valid,
+utf8::encode, utf8::decode, utf8::upgrade, and utf8::downgrade are
+actually internal, and thus always available, without a C<require utf8>
+statement.
=head1 BUGS