This is a live mirror of the Perl 5 development currently hosted at https://github.com/perl/perl5
perlapi: Place in dictionary sort order
[perl5.git] / lib / utf8.pm
CommitLineData
a0ed51b3
LW
1package utf8;
2
d5448623
GS
3$utf8::hint_bits = 0x00800000;
4
ca3d51ba 5our $VERSION = '1.14';
b75c8c73 6
a0ed51b3 7sub import {
d5448623 8 $^H |= $utf8::hint_bits;
a0ed51b3
LW
9}
10
11sub unimport {
d5448623 12 $^H &= ~$utf8::hint_bits;
a0ed51b3
LW
13}
14
15sub AUTOLOAD {
16 require "utf8_heavy.pl";
daf4d4ea 17 goto &$AUTOLOAD if defined &$AUTOLOAD;
bd7017d3 18 require Carp;
daf4d4ea 19 Carp::croak("Undefined subroutine $AUTOLOAD called");
a0ed51b3
LW
20}
21
221;
23__END__
24
25=head1 NAME
26
b3419ed8 27utf8 - Perl pragma to enable/disable UTF-8 (or UTF-EBCDIC) in source code
a0ed51b3
LW
28
29=head1 SYNOPSIS
30
291cc134
KW
31 use utf8;
32 no utf8;
a0ed51b3 33
291cc134 34 # Convert the internal representation of a Perl scalar to/from UTF-8.
836ccc8e 35
291cc134 36 $num_octets = utf8::upgrade($string);
98695e13 37 $success = utf8::downgrade($string[, $fail_ok]);
973655a8 38
291cc134
KW
39 # Change each character of a Perl scalar to/from a series of
40 # characters that represent the UTF-8 bytes of each original character.
836ccc8e 41
291cc134
KW
42 utf8::encode($string); # "\x{100}" becomes "\xc4\x80"
43 utf8::decode($string); # "\xc4\x80" becomes "\x{100}"
973655a8 44
ca3d51ba
KW
45 # Convert a code point from the platform native character set to
46 # Unicode, and vice-versa.
47 $unicode = utf8::native_to_unicode(ord('A')); # returns 65 on both
48 # ASCII and EBCDIC
49 # platforms
50 $native = utf8::unicode_to_native(65); # returns 65 on ASCII
51 # platforms; 193 on EBCDIC
52
ac8b87d7
EB
53 $flag = utf8::is_utf8($string); # since Perl 5.8.1
54 $flag = utf8::valid($string);
973655a8 55
a0ed51b3
LW
56=head1 DESCRIPTION
57
393fec97 58The C<use utf8> pragma tells the Perl parser to allow UTF-8 in the
b3419ed8 59program text in the current lexical scope (allow UTF-EBCDIC on EBCDIC based
70122e76 60platforms). The C<no utf8> pragma tells Perl to switch back to treating
b3419ed8 61the source text as literal bytes in the current lexical scope.
a0ed51b3 62
19b49582
JH
63B<Do not use this pragma for anything else than telling Perl that your
64script is written in UTF-8.> The utility functions described below are
2575c402
JW
65directly usable without C<use utf8;>.
66
67Because it is not possible to reliably tell UTF-8 from native 8 bit
68encodings, you need either a Byte Order Mark at the beginning of your
69source code, or C<use utf8;>, to instruct perl.
19b49582 70
2575c402
JW
71When UTF-8 becomes the standard source format, this pragma will
72effectively become a no-op. For convenience in what follows the term
73I<UTF-X> is used to refer to UTF-8 on ASCII and ISO Latin based
74platforms and UTF-EBCDIC on EBCDIC based platforms.
a0ed51b3 75
a74e8b45
JH
76See also the effects of the C<-C> switch and its cousin, the
77C<$ENV{PERL_UNICODE}>, in L<perlrun>.
78
ad0029c4 79Enabling the C<utf8> pragma has the following effect:
a0ed51b3 80
4ac9195f 81=over 4
a0ed51b3
LW
82
83=item *
84
393fec97 85Bytes in the source text that have their high-bit set will be treated
2fa62f66 86as being part of a literal UTF-X sequence. This includes most
c20e2abd 87literals such as identifier names, string constants, and constant
8f8cf39c
JH
88regular expression patterns.
89
90On EBCDIC platforms characters in the Latin 1 character set are
91treated as being part of a literal UTF-EBCDIC character.
a0ed51b3 92
4ac9195f
MS
93=back
94
ae90e350
JH
95Note that if you have bytes with the eighth bit on in your script
96(for example embedded Latin-1 in your string literals), C<use utf8>
97will be unhappy since the bytes are most probably not well-formed
2fa62f66
AT
98UTF-X. If you want to have such bytes under C<use utf8>, you can disable
99this pragma until the end the block (or file, if at top level) by
100C<no utf8;>.
ae90e350 101
1b026014
NIS
102=head2 Utility functions
103
8800c35a
JH
104The following functions are defined in the C<utf8::> package by the
105Perl core. You do not need to say C<use utf8> to use these and in fact
2f7e5073 106you should not say that unless you really want to have UTF-8 source code.
1b026014
NIS
107
108=over 4
109
308a4ae1 110=item * C<$num_octets = utf8::upgrade($string)>
1b026014 111
836ccc8e
DM
112Converts in-place the internal representation of the string from an octet
113sequence in the native encoding (Latin-1 or EBCDIC) to I<UTF-X>. The
114logical character sequence itself is unchanged. If I<$string> is already
115stored as I<UTF-X>, then this is a no-op. Returns the
2575c402
JW
116number of octets necessary to represent the string as I<UTF-X>. Can be
117used to make sure that the UTF-8 flag is on, so that C<\w> or C<lc()>
118work as Unicode on strings containing characters in the range 0x80-0xFF
119(on ASCII and derivatives).
78ea37eb
TS
120
121B<Note that this function does not handle arbitrary encodings.>
2575c402
JW
122Therefore Encode is recommended for the general purposes; see also
123L<Encode>.
1b026014 124
308a4ae1 125=item * C<$success = utf8::downgrade($string[, $fail_ok])>
1b026014 126
730d7228 127Converts in-place the internal representation of the string from
836ccc8e
DM
128I<UTF-X> to the equivalent octet sequence in the native encoding (Latin-1
129or EBCDIC). The logical character sequence itself is unchanged. If
130I<$string> is already stored as native 8 bit, then this is a no-op. Can
131be used to
2575c402
JW
132make sure that the UTF-8 flag is off, e.g. when you want to make sure
133that the substr() or length() function works with the usually faster
134byte algorithm.
78ea37eb 135
2575c402 136Fails if the original I<UTF-X> sequence cannot be represented in the
ac8b87d7 137native 8 bit encoding. On failure dies or, if the value of I<$fail_ok> is
2575c402 138true, returns false.
78ea37eb 139
2575c402
JW
140Returns true on success.
141
142B<Note that this function does not handle arbitrary encodings.>
143Therefore Encode is recommended for the general purposes; see also
144L<Encode>.
78ea37eb 145
308a4ae1 146=item * C<utf8::encode($string)>
1b026014 147
2575c402 148Converts in-place the character sequence to the corresponding octet
836ccc8e
DM
149sequence in I<UTF-X>. That is, every (possibly wide) character gets
150replaced with a sequence of one or more characters that represent the
151individual I<UTF-X> bytes of the character. The UTF8 flag is turned off.
152Returns nothing.
153
291cc134 154 my $a = "\x{100}"; # $a contains one character, with ord 0x100
ca3d51ba
KW
155 utf8::encode($a); # $a contains two characters, with ords (on
156 # ASCII platforms) 0xc4 and 0x80
78ea37eb
TS
157
158B<Note that this function does not handle arbitrary encodings.>
2575c402
JW
159Therefore Encode is recommended for the general purposes; see also
160L<Encode>.
094ce63c 161
308a4ae1 162=item * C<$success = utf8::decode($string)>
1b026014 163
c8ed25e6 164Attempts to convert in-place the octet sequence encoded as I<UTF-X> to the
836ccc8e
DM
165corresponding character sequence. That is, it replaces each sequence of
166characters in the string whose ords represent a valid UTF-X byte
167sequence, with the corresponding single character. The UTF-8 flag is
168turned on only if the source string contains multiple-byte I<UTF-X>
169characters. If I<$string> is invalid as I<UTF-X>, returns false;
170otherwise returns true.
171
ca3d51ba
KW
172 my $a = "\xc4\x80"; # $a contains two characters, with ords
173 # 0xc4 and 0x80
174 utf8::decode($a); # On ASCII platforms, $a contains one char,
175 # with ord 0x100. On EBCDIC platforms, $a
176 # is unchanged and the function returns FALSE.
177
178(C<"\xc4\x80"> is not a valid sequence of bytes in any UTF-8-encoded
179character(s) in the EBCDIC code pages that Perl supports, which is why the
180above example returns failure on them. What does decode into C<\x{100}>
181depends on the platform. It is C<"\x8C\x41"> in IBM-1047.)
78ea37eb
TS
182
183B<Note that this function does not handle arbitrary encodings.>
2575c402
JW
184Therefore Encode is recommended for the general purposes; see also
185L<Encode>.
78ea37eb 186
ca3d51ba
KW
187=item * C<$unicode = utf8::native_to_unicode($code_point)>
188
189This takes an unsigned integer (which represents the ordinal number of a
190character (or a code point) on the platform the program is being run on) and
191returns its Unicode equivalent value. Since ASCII platforms natively use the
192Unicode code points, this function returns its input on them. On EBCDIC
193platforms it converts from EBCIDC to Unicode.
194
195A meaningless value will currently be returned if the input is not an unsigned
196integer.
197
198=item * C<$native = utf8::unicode_to_native($code_point)>
199
200This is the inverse of C<utf8::native_to_unicode()>, converting the other
201direction. Again, on ASCII platforms, this returns its input, but on EBCDIC
202platforms it will find the native platform code point, given any Unicode one.
203
204A meaningless value will currently be returned if the input is not an unsigned
205integer.
206
308a4ae1 207=item * C<$flag = utf8::is_utf8($string)>
8800c35a 208
ac8b87d7 209(Since Perl 5.8.1) Test whether I<$string> is marked internally as encoded in
637ec54e 210UTF-8. Functionally the same as Encode::is_utf8().
8800c35a 211
308a4ae1 212=item * C<$flag = utf8::valid($string)>
70122e76 213
ac8b87d7 214[INTERNAL] Test whether I<$string> is in a consistent state regarding
9a54da5c 215UTF-8. Will return true if it is well-formed UTF-8 and has the UTF-8 flag
ac8b87d7 216on B<or> if I<$string> is held as bytes (both these states are 'consistent').
637ec54e 217Main reason for this routine is to allow Perl's test suite to check
8800c35a
JH
218that operations have left strings in a consistent state. You most
219probably want to use utf8::is_utf8() instead.
70122e76 220
1b026014
NIS
221=back
222
7d865a91
JH
223C<utf8::encode> is like C<utf8::upgrade>, but the UTF8 flag is
224cleared. See L<perlunicode> for more on the UTF8 flag and the C API
225functions C<sv_utf8_upgrade>, C<sv_utf8_downgrade>, C<sv_utf8_encode>,
094ce63c
AT
226and C<sv_utf8_decode>, which are wrapped by the Perl functions
227C<utf8::upgrade>, C<utf8::downgrade>, C<utf8::encode> and
7edb8f2b
RGS
228C<utf8::decode>. Also, the functions utf8::is_utf8, utf8::valid,
229utf8::encode, utf8::decode, utf8::upgrade, and utf8::downgrade are
230actually internal, and thus always available, without a C<require utf8>
231statement.
f1e62f77 232
8f8cf39c
JH
233=head1 BUGS
234
235One can have Unicode in identifier names, but not in package/class or
236subroutine names. While some limited functionality towards this does
237exist as of Perl 5.8.0, that is more accidental than designed; use of
238Unicode for the said purposes is unsupported.
239
240One reason of this unfinishedness is its (currently) inherent
241unportability: since both package names and subroutine names may need
242to be mapped to file and directory names, the Unicode capability of
243the filesystem becomes important-- and there unfortunately aren't
244portable answers.
245
393fec97 246=head1 SEE ALSO
a0ed51b3 247
2575c402 248L<perlunitut>, L<perluniintro>, L<perlrun>, L<bytes>, L<perlunicode>
a0ed51b3
LW
249
250=cut