This is a live mirror of the Perl 5 development currently hosted at https://github.com/perl/perl5
FAIL_OK looks too much like a constant [RT##118279]
[perl5.git] / lib / utf8.pm
CommitLineData
a0ed51b3
LW
1package utf8;
2
d5448623
GS
3$utf8::hint_bits = 0x00800000;
4
637ec54e 5our $VERSION = '1.12';
b75c8c73 6
a0ed51b3 7sub import {
d5448623 8 $^H |= $utf8::hint_bits;
a0ed51b3
LW
9}
10
11sub unimport {
d5448623 12 $^H &= ~$utf8::hint_bits;
a0ed51b3
LW
13}
14
15sub AUTOLOAD {
16 require "utf8_heavy.pl";
daf4d4ea 17 goto &$AUTOLOAD if defined &$AUTOLOAD;
bd7017d3 18 require Carp;
daf4d4ea 19 Carp::croak("Undefined subroutine $AUTOLOAD called");
a0ed51b3
LW
20}
21
221;
23__END__
24
25=head1 NAME
26
b3419ed8 27utf8 - Perl pragma to enable/disable UTF-8 (or UTF-EBCDIC) in source code
a0ed51b3
LW
28
29=head1 SYNOPSIS
30
291cc134
KW
31 use utf8;
32 no utf8;
a0ed51b3 33
291cc134 34 # Convert the internal representation of a Perl scalar to/from UTF-8.
836ccc8e 35
291cc134 36 $num_octets = utf8::upgrade($string);
98695e13 37 $success = utf8::downgrade($string[, $fail_ok]);
973655a8 38
291cc134
KW
39 # Change each character of a Perl scalar to/from a series of
40 # characters that represent the UTF-8 bytes of each original character.
836ccc8e 41
291cc134
KW
42 utf8::encode($string); # "\x{100}" becomes "\xc4\x80"
43 utf8::decode($string); # "\xc4\x80" becomes "\x{100}"
973655a8 44
291cc134
KW
45 $flag = utf8::is_utf8(STRING); # since Perl 5.8.1
46 $flag = utf8::valid(STRING);
973655a8 47
a0ed51b3
LW
48=head1 DESCRIPTION
49
393fec97 50The C<use utf8> pragma tells the Perl parser to allow UTF-8 in the
b3419ed8 51program text in the current lexical scope (allow UTF-EBCDIC on EBCDIC based
70122e76 52platforms). The C<no utf8> pragma tells Perl to switch back to treating
b3419ed8 53the source text as literal bytes in the current lexical scope.
a0ed51b3 54
19b49582
JH
55B<Do not use this pragma for anything else than telling Perl that your
56script is written in UTF-8.> The utility functions described below are
2575c402
JW
57directly usable without C<use utf8;>.
58
59Because it is not possible to reliably tell UTF-8 from native 8 bit
60encodings, you need either a Byte Order Mark at the beginning of your
61source code, or C<use utf8;>, to instruct perl.
19b49582 62
2575c402
JW
63When UTF-8 becomes the standard source format, this pragma will
64effectively become a no-op. For convenience in what follows the term
65I<UTF-X> is used to refer to UTF-8 on ASCII and ISO Latin based
66platforms and UTF-EBCDIC on EBCDIC based platforms.
a0ed51b3 67
a74e8b45
JH
68See also the effects of the C<-C> switch and its cousin, the
69C<$ENV{PERL_UNICODE}>, in L<perlrun>.
70
ad0029c4 71Enabling the C<utf8> pragma has the following effect:
a0ed51b3 72
4ac9195f 73=over 4
a0ed51b3
LW
74
75=item *
76
393fec97 77Bytes in the source text that have their high-bit set will be treated
2fa62f66 78as being part of a literal UTF-X sequence. This includes most
c20e2abd 79literals such as identifier names, string constants, and constant
8f8cf39c
JH
80regular expression patterns.
81
82On EBCDIC platforms characters in the Latin 1 character set are
83treated as being part of a literal UTF-EBCDIC character.
a0ed51b3 84
4ac9195f
MS
85=back
86
ae90e350
JH
87Note that if you have bytes with the eighth bit on in your script
88(for example embedded Latin-1 in your string literals), C<use utf8>
89will be unhappy since the bytes are most probably not well-formed
2fa62f66
AT
90UTF-X. If you want to have such bytes under C<use utf8>, you can disable
91this pragma until the end the block (or file, if at top level) by
92C<no utf8;>.
ae90e350 93
1b026014
NIS
94=head2 Utility functions
95
8800c35a
JH
96The following functions are defined in the C<utf8::> package by the
97Perl core. You do not need to say C<use utf8> to use these and in fact
2f7e5073 98you should not say that unless you really want to have UTF-8 source code.
1b026014
NIS
99
100=over 4
101
973655a8 102=item * $num_octets = utf8::upgrade($string)
1b026014 103
836ccc8e
DM
104Converts in-place the internal representation of the string from an octet
105sequence in the native encoding (Latin-1 or EBCDIC) to I<UTF-X>. The
106logical character sequence itself is unchanged. If I<$string> is already
107stored as I<UTF-X>, then this is a no-op. Returns the
2575c402
JW
108number of octets necessary to represent the string as I<UTF-X>. Can be
109used to make sure that the UTF-8 flag is on, so that C<\w> or C<lc()>
110work as Unicode on strings containing characters in the range 0x80-0xFF
111(on ASCII and derivatives).
78ea37eb
ST
112
113B<Note that this function does not handle arbitrary encodings.>
2575c402
JW
114Therefore Encode is recommended for the general purposes; see also
115L<Encode>.
1b026014 116
98695e13 117=item * $success = utf8::downgrade($string[, $fail_ok])
1b026014 118
730d7228 119Converts in-place the internal representation of the string from
836ccc8e
DM
120I<UTF-X> to the equivalent octet sequence in the native encoding (Latin-1
121or EBCDIC). The logical character sequence itself is unchanged. If
122I<$string> is already stored as native 8 bit, then this is a no-op. Can
123be used to
2575c402
JW
124make sure that the UTF-8 flag is off, e.g. when you want to make sure
125that the substr() or length() function works with the usually faster
126byte algorithm.
78ea37eb 127
2575c402 128Fails if the original I<UTF-X> sequence cannot be represented in the
98695e13 129native 8 bit encoding. On failure dies or, if the value of C<$fail_ok> is
2575c402 130true, returns false.
78ea37eb 131
2575c402
JW
132Returns true on success.
133
134B<Note that this function does not handle arbitrary encodings.>
135Therefore Encode is recommended for the general purposes; see also
136L<Encode>.
78ea37eb 137
1b026014
NIS
138=item * utf8::encode($string)
139
2575c402 140Converts in-place the character sequence to the corresponding octet
836ccc8e
DM
141sequence in I<UTF-X>. That is, every (possibly wide) character gets
142replaced with a sequence of one or more characters that represent the
143individual I<UTF-X> bytes of the character. The UTF8 flag is turned off.
144Returns nothing.
145
291cc134
KW
146 my $a = "\x{100}"; # $a contains one character, with ord 0x100
147 utf8::encode($a); # $a contains two characters, with ords 0xc4 and
148 # 0x80
78ea37eb
ST
149
150B<Note that this function does not handle arbitrary encodings.>
2575c402
JW
151Therefore Encode is recommended for the general purposes; see also
152L<Encode>.
094ce63c 153
2575c402 154=item * $success = utf8::decode($string)
1b026014 155
2575c402 156Attempts to convert in-place the octet sequence in I<UTF-X> to the
836ccc8e
DM
157corresponding character sequence. That is, it replaces each sequence of
158characters in the string whose ords represent a valid UTF-X byte
159sequence, with the corresponding single character. The UTF-8 flag is
160turned on only if the source string contains multiple-byte I<UTF-X>
161characters. If I<$string> is invalid as I<UTF-X>, returns false;
162otherwise returns true.
163
291cc134
KW
164 my $a = "\xc4\x80"; # $a contains two characters, with ords
165 # 0xc4 and 0x80
836ccc8e 166 utf8::decode($a); # $a contains one character, with ord 0x100
78ea37eb
ST
167
168B<Note that this function does not handle arbitrary encodings.>
2575c402
JW
169Therefore Encode is recommended for the general purposes; see also
170L<Encode>.
78ea37eb 171
8800c35a
JH
172=item * $flag = utf8::is_utf8(STRING)
173
637ec54e
KW
174(Since Perl 5.8.1) Test whether STRING is marked internally as encoded in
175UTF-8. Functionally the same as Encode::is_utf8().
8800c35a 176
70122e76
JH
177=item * $flag = utf8::valid(STRING)
178
8800c35a 179[INTERNAL] Test whether STRING is in a consistent state regarding
9a54da5c
KW
180UTF-8. Will return true if it is well-formed UTF-8 and has the UTF-8 flag
181on B<or> if STRING is held as bytes (both these states are 'consistent').
637ec54e 182Main reason for this routine is to allow Perl's test suite to check
8800c35a
JH
183that operations have left strings in a consistent state. You most
184probably want to use utf8::is_utf8() instead.
70122e76 185
1b026014
NIS
186=back
187
7d865a91
JH
188C<utf8::encode> is like C<utf8::upgrade>, but the UTF8 flag is
189cleared. See L<perlunicode> for more on the UTF8 flag and the C API
190functions C<sv_utf8_upgrade>, C<sv_utf8_downgrade>, C<sv_utf8_encode>,
094ce63c
AT
191and C<sv_utf8_decode>, which are wrapped by the Perl functions
192C<utf8::upgrade>, C<utf8::downgrade>, C<utf8::encode> and
7edb8f2b
RGS
193C<utf8::decode>. Also, the functions utf8::is_utf8, utf8::valid,
194utf8::encode, utf8::decode, utf8::upgrade, and utf8::downgrade are
195actually internal, and thus always available, without a C<require utf8>
196statement.
f1e62f77 197
8f8cf39c
JH
198=head1 BUGS
199
200One can have Unicode in identifier names, but not in package/class or
201subroutine names. While some limited functionality towards this does
202exist as of Perl 5.8.0, that is more accidental than designed; use of
203Unicode for the said purposes is unsupported.
204
205One reason of this unfinishedness is its (currently) inherent
206unportability: since both package names and subroutine names may need
207to be mapped to file and directory names, the Unicode capability of
208the filesystem becomes important-- and there unfortunately aren't
209portable answers.
210
393fec97 211=head1 SEE ALSO
a0ed51b3 212
2575c402 213L<perlunitut>, L<perluniintro>, L<perlrun>, L<bytes>, L<perlunicode>
a0ed51b3
LW
214
215=cut