This is a live mirror of the Perl 5 development currently hosted at https://github.com/perl/perl5
Better explain utf8::upgrade/downgrade/encode/decode
[perl5.git] / lib / utf8.pm
CommitLineData
a0ed51b3
LW
1package utf8;
2
d5448623
GS
3$utf8::hint_bits = 0x00800000;
4
836ccc8e 5our $VERSION = '1.08';
b75c8c73 6
a0ed51b3 7sub import {
d5448623 8 $^H |= $utf8::hint_bits;
a0ed51b3
LW
9 $enc{caller()} = $_[1] if $_[1];
10}
11
12sub unimport {
d5448623 13 $^H &= ~$utf8::hint_bits;
a0ed51b3
LW
14}
15
16sub AUTOLOAD {
17 require "utf8_heavy.pl";
daf4d4ea 18 goto &$AUTOLOAD if defined &$AUTOLOAD;
bd7017d3 19 require Carp;
daf4d4ea 20 Carp::croak("Undefined subroutine $AUTOLOAD called");
a0ed51b3
LW
21}
22
231;
24__END__
25
26=head1 NAME
27
b3419ed8 28utf8 - Perl pragma to enable/disable UTF-8 (or UTF-EBCDIC) in source code
a0ed51b3
LW
29
30=head1 SYNOPSIS
31
32 use utf8;
33 no utf8;
34
836ccc8e
DM
35 # Convert the internal representation of a Perl scalar to/from UTF-8.
36
973655a8
SB
37 $num_octets = utf8::upgrade($string);
38 $success = utf8::downgrade($string[, FAIL_OK]);
39
836ccc8e
DM
40 # Change each character of a Perl scalar to/from a series of
41 # characters that represent the UTF-8 bytes of each original character.
42
43 utf8::encode($string); # "\x{100}" becomes "\xc4\x80"
44 utf8::decode($string); # "\xc4\x80" becomes "\x{100}"
973655a8 45
786c9463 46 $flag = utf8::is_utf8(STRING); # since Perl 5.8.1
973655a8
SB
47 $flag = utf8::valid(STRING);
48
a0ed51b3
LW
49=head1 DESCRIPTION
50
393fec97 51The C<use utf8> pragma tells the Perl parser to allow UTF-8 in the
b3419ed8 52program text in the current lexical scope (allow UTF-EBCDIC on EBCDIC based
70122e76 53platforms). The C<no utf8> pragma tells Perl to switch back to treating
b3419ed8 54the source text as literal bytes in the current lexical scope.
a0ed51b3 55
19b49582
JH
56B<Do not use this pragma for anything else than telling Perl that your
57script is written in UTF-8.> The utility functions described below are
2575c402
JW
58directly usable without C<use utf8;>.
59
60Because it is not possible to reliably tell UTF-8 from native 8 bit
61encodings, you need either a Byte Order Mark at the beginning of your
62source code, or C<use utf8;>, to instruct perl.
19b49582 63
2575c402
JW
64When UTF-8 becomes the standard source format, this pragma will
65effectively become a no-op. For convenience in what follows the term
66I<UTF-X> is used to refer to UTF-8 on ASCII and ISO Latin based
67platforms and UTF-EBCDIC on EBCDIC based platforms.
a0ed51b3 68
a74e8b45
JH
69See also the effects of the C<-C> switch and its cousin, the
70C<$ENV{PERL_UNICODE}>, in L<perlrun>.
71
ad0029c4 72Enabling the C<utf8> pragma has the following effect:
a0ed51b3 73
4ac9195f 74=over 4
a0ed51b3
LW
75
76=item *
77
393fec97 78Bytes in the source text that have their high-bit set will be treated
2fa62f66 79as being part of a literal UTF-X sequence. This includes most
c20e2abd 80literals such as identifier names, string constants, and constant
8f8cf39c
JH
81regular expression patterns.
82
83On EBCDIC platforms characters in the Latin 1 character set are
84treated as being part of a literal UTF-EBCDIC character.
a0ed51b3 85
4ac9195f
MS
86=back
87
ae90e350
JH
88Note that if you have bytes with the eighth bit on in your script
89(for example embedded Latin-1 in your string literals), C<use utf8>
90will be unhappy since the bytes are most probably not well-formed
2fa62f66
AT
91UTF-X. If you want to have such bytes under C<use utf8>, you can disable
92this pragma until the end the block (or file, if at top level) by
93C<no utf8;>.
ae90e350 94
1b026014
NIS
95=head2 Utility functions
96
8800c35a
JH
97The following functions are defined in the C<utf8::> package by the
98Perl core. You do not need to say C<use utf8> to use these and in fact
19b49582 99you should not say that unless you really want to have UTF-8 source code.
1b026014
NIS
100
101=over 4
102
973655a8 103=item * $num_octets = utf8::upgrade($string)
1b026014 104
836ccc8e
DM
105Converts in-place the internal representation of the string from an octet
106sequence in the native encoding (Latin-1 or EBCDIC) to I<UTF-X>. The
107logical character sequence itself is unchanged. If I<$string> is already
108stored as I<UTF-X>, then this is a no-op. Returns the
2575c402
JW
109number of octets necessary to represent the string as I<UTF-X>. Can be
110used to make sure that the UTF-8 flag is on, so that C<\w> or C<lc()>
111work as Unicode on strings containing characters in the range 0x80-0xFF
112(on ASCII and derivatives).
78ea37eb
TS
113
114B<Note that this function does not handle arbitrary encodings.>
2575c402
JW
115Therefore Encode is recommended for the general purposes; see also
116L<Encode>.
1b026014 117
973655a8 118=item * $success = utf8::downgrade($string[, FAIL_OK])
1b026014 119
836ccc8e
DM
120Converts in-place the the internal representation of the string from
121I<UTF-X> to the equivalent octet sequence in the native encoding (Latin-1
122or EBCDIC). The logical character sequence itself is unchanged. If
123I<$string> is already stored as native 8 bit, then this is a no-op. Can
124be used to
2575c402
JW
125make sure that the UTF-8 flag is off, e.g. when you want to make sure
126that the substr() or length() function works with the usually faster
127byte algorithm.
78ea37eb 128
2575c402
JW
129Fails if the original I<UTF-X> sequence cannot be represented in the
130native 8 bit encoding. On failure dies or, if the value of C<FAIL_OK> is
131true, returns false.
78ea37eb 132
2575c402
JW
133Returns true on success.
134
135B<Note that this function does not handle arbitrary encodings.>
136Therefore Encode is recommended for the general purposes; see also
137L<Encode>.
78ea37eb 138
1b026014
NIS
139=item * utf8::encode($string)
140
2575c402 141Converts in-place the character sequence to the corresponding octet
836ccc8e
DM
142sequence in I<UTF-X>. That is, every (possibly wide) character gets
143replaced with a sequence of one or more characters that represent the
144individual I<UTF-X> bytes of the character. The UTF8 flag is turned off.
145Returns nothing.
146
147 my $a = "\x{100}"; # $a contains one character, with ord 0x100
148 utf8::encode($a); # $a contains two characters, with ords 0xc4 and 0x80
78ea37eb
TS
149
150B<Note that this function does not handle arbitrary encodings.>
2575c402
JW
151Therefore Encode is recommended for the general purposes; see also
152L<Encode>.
094ce63c 153
2575c402 154=item * $success = utf8::decode($string)
1b026014 155
2575c402 156Attempts to convert in-place the octet sequence in I<UTF-X> to the
836ccc8e
DM
157corresponding character sequence. That is, it replaces each sequence of
158characters in the string whose ords represent a valid UTF-X byte
159sequence, with the corresponding single character. The UTF-8 flag is
160turned on only if the source string contains multiple-byte I<UTF-X>
161characters. If I<$string> is invalid as I<UTF-X>, returns false;
162otherwise returns true.
163
164 my $a = "\xc4\x80"; # $a contains two characters, with ords 0xc4 and 0x80
165 utf8::decode($a); # $a contains one character, with ord 0x100
78ea37eb
TS
166
167B<Note that this function does not handle arbitrary encodings.>
2575c402
JW
168Therefore Encode is recommended for the general purposes; see also
169L<Encode>.
78ea37eb 170
8800c35a
JH
171=item * $flag = utf8::is_utf8(STRING)
172
2575c402
JW
173(Since Perl 5.8.1) Test whether STRING is in UTF-8 internally.
174Functionally the same as Encode::is_utf8().
8800c35a 175
70122e76
JH
176=item * $flag = utf8::valid(STRING)
177
8800c35a
JH
178[INTERNAL] Test whether STRING is in a consistent state regarding
179UTF-8. Will return true is well-formed UTF-8 and has the UTF-8 flag
180on B<or> if string is held as bytes (both these states are 'consistent').
181Main reason for this routine is to allow Perl's testsuite to check
182that operations have left strings in a consistent state. You most
183probably want to use utf8::is_utf8() instead.
70122e76 184
1b026014
NIS
185=back
186
7d865a91
JH
187C<utf8::encode> is like C<utf8::upgrade>, but the UTF8 flag is
188cleared. See L<perlunicode> for more on the UTF8 flag and the C API
189functions C<sv_utf8_upgrade>, C<sv_utf8_downgrade>, C<sv_utf8_encode>,
094ce63c
AT
190and C<sv_utf8_decode>, which are wrapped by the Perl functions
191C<utf8::upgrade>, C<utf8::downgrade>, C<utf8::encode> and
7edb8f2b
RGS
192C<utf8::decode>. Also, the functions utf8::is_utf8, utf8::valid,
193utf8::encode, utf8::decode, utf8::upgrade, and utf8::downgrade are
194actually internal, and thus always available, without a C<require utf8>
195statement.
f1e62f77 196
8f8cf39c
JH
197=head1 BUGS
198
199One can have Unicode in identifier names, but not in package/class or
200subroutine names. While some limited functionality towards this does
201exist as of Perl 5.8.0, that is more accidental than designed; use of
202Unicode for the said purposes is unsupported.
203
204One reason of this unfinishedness is its (currently) inherent
205unportability: since both package names and subroutine names may need
206to be mapped to file and directory names, the Unicode capability of
207the filesystem becomes important-- and there unfortunately aren't
208portable answers.
209
393fec97 210=head1 SEE ALSO
a0ed51b3 211
2575c402 212L<perlunitut>, L<perluniintro>, L<perlrun>, L<bytes>, L<perlunicode>
a0ed51b3
LW
213
214=cut