This is a live mirror of the Perl 5 development currently hosted at https://github.com/perl/perl5
regcomp.h: Fold 2 ANYOF flags into a single one
[perl5.git] / lib / utf8.pm
CommitLineData
a0ed51b3
LW
1package utf8;
2
d5448623
GS
3$utf8::hint_bits = 0x00800000;
4
bcb52d7e 5our $VERSION = '1.17';
b75c8c73 6
a0ed51b3 7sub import {
d5448623 8 $^H |= $utf8::hint_bits;
a0ed51b3
LW
9}
10
11sub unimport {
d5448623 12 $^H &= ~$utf8::hint_bits;
a0ed51b3
LW
13}
14
15sub AUTOLOAD {
16 require "utf8_heavy.pl";
daf4d4ea 17 goto &$AUTOLOAD if defined &$AUTOLOAD;
bd7017d3 18 require Carp;
daf4d4ea 19 Carp::croak("Undefined subroutine $AUTOLOAD called");
a0ed51b3
LW
20}
21
221;
23__END__
24
25=head1 NAME
26
b3419ed8 27utf8 - Perl pragma to enable/disable UTF-8 (or UTF-EBCDIC) in source code
a0ed51b3
LW
28
29=head1 SYNOPSIS
30
291cc134
KW
31 use utf8;
32 no utf8;
a0ed51b3 33
291cc134 34 # Convert the internal representation of a Perl scalar to/from UTF-8.
836ccc8e 35
291cc134 36 $num_octets = utf8::upgrade($string);
98695e13 37 $success = utf8::downgrade($string[, $fail_ok]);
973655a8 38
291cc134
KW
39 # Change each character of a Perl scalar to/from a series of
40 # characters that represent the UTF-8 bytes of each original character.
836ccc8e 41
291cc134
KW
42 utf8::encode($string); # "\x{100}" becomes "\xc4\x80"
43 utf8::decode($string); # "\xc4\x80" becomes "\x{100}"
973655a8 44
ca3d51ba
KW
45 # Convert a code point from the platform native character set to
46 # Unicode, and vice-versa.
47 $unicode = utf8::native_to_unicode(ord('A')); # returns 65 on both
48 # ASCII and EBCDIC
49 # platforms
50 $native = utf8::unicode_to_native(65); # returns 65 on ASCII
51 # platforms; 193 on EBCDIC
52
ac8b87d7
EB
53 $flag = utf8::is_utf8($string); # since Perl 5.8.1
54 $flag = utf8::valid($string);
973655a8 55
a0ed51b3
LW
56=head1 DESCRIPTION
57
393fec97 58The C<use utf8> pragma tells the Perl parser to allow UTF-8 in the
b3419ed8 59program text in the current lexical scope (allow UTF-EBCDIC on EBCDIC based
70122e76 60platforms). The C<no utf8> pragma tells Perl to switch back to treating
b3419ed8 61the source text as literal bytes in the current lexical scope.
a0ed51b3 62
19b49582
JH
63B<Do not use this pragma for anything else than telling Perl that your
64script is written in UTF-8.> The utility functions described below are
2575c402
JW
65directly usable without C<use utf8;>.
66
67Because it is not possible to reliably tell UTF-8 from native 8 bit
68encodings, you need either a Byte Order Mark at the beginning of your
69source code, or C<use utf8;>, to instruct perl.
19b49582 70
2575c402
JW
71When UTF-8 becomes the standard source format, this pragma will
72effectively become a no-op. For convenience in what follows the term
73I<UTF-X> is used to refer to UTF-8 on ASCII and ISO Latin based
74platforms and UTF-EBCDIC on EBCDIC based platforms.
a0ed51b3 75
a74e8b45 76See also the effects of the C<-C> switch and its cousin, the
127161e0 77C<PERL_UNICODE> environment variable, in L<perlrun>.
a74e8b45 78
ad0029c4 79Enabling the C<utf8> pragma has the following effect:
a0ed51b3 80
4ac9195f 81=over 4
a0ed51b3
LW
82
83=item *
84
393fec97 85Bytes in the source text that have their high-bit set will be treated
2fa62f66 86as being part of a literal UTF-X sequence. This includes most
c20e2abd 87literals such as identifier names, string constants, and constant
8f8cf39c
JH
88regular expression patterns.
89
90On EBCDIC platforms characters in the Latin 1 character set are
91treated as being part of a literal UTF-EBCDIC character.
a0ed51b3 92
4ac9195f
MS
93=back
94
ae90e350
JH
95Note that if you have bytes with the eighth bit on in your script
96(for example embedded Latin-1 in your string literals), C<use utf8>
97will be unhappy since the bytes are most probably not well-formed
2fa62f66
AT
98UTF-X. If you want to have such bytes under C<use utf8>, you can disable
99this pragma until the end the block (or file, if at top level) by
100C<no utf8;>.
ae90e350 101
1b026014
NIS
102=head2 Utility functions
103
8800c35a
JH
104The following functions are defined in the C<utf8::> package by the
105Perl core. You do not need to say C<use utf8> to use these and in fact
2f7e5073 106you should not say that unless you really want to have UTF-8 source code.
1b026014
NIS
107
108=over 4
109
308a4ae1 110=item * C<$num_octets = utf8::upgrade($string)>
1b026014 111
836ccc8e
DM
112Converts in-place the internal representation of the string from an octet
113sequence in the native encoding (Latin-1 or EBCDIC) to I<UTF-X>. The
114logical character sequence itself is unchanged. If I<$string> is already
115stored as I<UTF-X>, then this is a no-op. Returns the
2575c402
JW
116number of octets necessary to represent the string as I<UTF-X>. Can be
117used to make sure that the UTF-8 flag is on, so that C<\w> or C<lc()>
118work as Unicode on strings containing characters in the range 0x80-0xFF
119(on ASCII and derivatives).
78ea37eb
TS
120
121B<Note that this function does not handle arbitrary encodings.>
2575c402
JW
122Therefore Encode is recommended for the general purposes; see also
123L<Encode>.
1b026014 124
308a4ae1 125=item * C<$success = utf8::downgrade($string[, $fail_ok])>
1b026014 126
730d7228 127Converts in-place the internal representation of the string from
836ccc8e
DM
128I<UTF-X> to the equivalent octet sequence in the native encoding (Latin-1
129or EBCDIC). The logical character sequence itself is unchanged. If
130I<$string> is already stored as native 8 bit, then this is a no-op. Can
131be used to
2575c402
JW
132make sure that the UTF-8 flag is off, e.g. when you want to make sure
133that the substr() or length() function works with the usually faster
134byte algorithm.
78ea37eb 135
2575c402 136Fails if the original I<UTF-X> sequence cannot be represented in the
ac8b87d7 137native 8 bit encoding. On failure dies or, if the value of I<$fail_ok> is
2575c402 138true, returns false.
78ea37eb 139
2575c402
JW
140Returns true on success.
141
142B<Note that this function does not handle arbitrary encodings.>
143Therefore Encode is recommended for the general purposes; see also
144L<Encode>.
78ea37eb 145
308a4ae1 146=item * C<utf8::encode($string)>
1b026014 147
2575c402 148Converts in-place the character sequence to the corresponding octet
836ccc8e
DM
149sequence in I<UTF-X>. That is, every (possibly wide) character gets
150replaced with a sequence of one or more characters that represent the
151individual I<UTF-X> bytes of the character. The UTF8 flag is turned off.
152Returns nothing.
153
291cc134 154 my $a = "\x{100}"; # $a contains one character, with ord 0x100
ca3d51ba
KW
155 utf8::encode($a); # $a contains two characters, with ords (on
156 # ASCII platforms) 0xc4 and 0x80
78ea37eb
TS
157
158B<Note that this function does not handle arbitrary encodings.>
2575c402
JW
159Therefore Encode is recommended for the general purposes; see also
160L<Encode>.
094ce63c 161
308a4ae1 162=item * C<$success = utf8::decode($string)>
1b026014 163
c8ed25e6 164Attempts to convert in-place the octet sequence encoded as I<UTF-X> to the
836ccc8e
DM
165corresponding character sequence. That is, it replaces each sequence of
166characters in the string whose ords represent a valid UTF-X byte
167sequence, with the corresponding single character. The UTF-8 flag is
168turned on only if the source string contains multiple-byte I<UTF-X>
169characters. If I<$string> is invalid as I<UTF-X>, returns false;
170otherwise returns true.
171
ca3d51ba
KW
172 my $a = "\xc4\x80"; # $a contains two characters, with ords
173 # 0xc4 and 0x80
174 utf8::decode($a); # On ASCII platforms, $a contains one char,
175 # with ord 0x100. On EBCDIC platforms, $a
176 # is unchanged and the function returns FALSE.
177
178(C<"\xc4\x80"> is not a valid sequence of bytes in any UTF-8-encoded
179character(s) in the EBCDIC code pages that Perl supports, which is why the
180above example returns failure on them. What does decode into C<\x{100}>
181depends on the platform. It is C<"\x8C\x41"> in IBM-1047.)
78ea37eb
TS
182
183B<Note that this function does not handle arbitrary encodings.>
2575c402
JW
184Therefore Encode is recommended for the general purposes; see also
185L<Encode>.
78ea37eb 186
ca3d51ba
KW
187=item * C<$unicode = utf8::native_to_unicode($code_point)>
188
273e254d 189(Since Perl v5.8.0)
ca3d51ba
KW
190This takes an unsigned integer (which represents the ordinal number of a
191character (or a code point) on the platform the program is being run on) and
192returns its Unicode equivalent value. Since ASCII platforms natively use the
193Unicode code points, this function returns its input on them. On EBCDIC
bc1767aa 194platforms it converts from EBCDIC to Unicode.
ca3d51ba
KW
195
196A meaningless value will currently be returned if the input is not an unsigned
197integer.
198
273e254d
KW
199Since Perl v5.22.0, calls to this function are optimized out on ASCII
200platforms, so there is no performance hit in using it there.
201
ca3d51ba
KW
202=item * C<$native = utf8::unicode_to_native($code_point)>
203
273e254d 204(Since Perl v5.8.0)
ca3d51ba
KW
205This is the inverse of C<utf8::native_to_unicode()>, converting the other
206direction. Again, on ASCII platforms, this returns its input, but on EBCDIC
207platforms it will find the native platform code point, given any Unicode one.
208
209A meaningless value will currently be returned if the input is not an unsigned
210integer.
211
273e254d
KW
212Since Perl v5.22.0, calls to this function are optimized out on ASCII
213platforms, so there is no performance hit in using it there.
214
308a4ae1 215=item * C<$flag = utf8::is_utf8($string)>
8800c35a 216
ac8b87d7 217(Since Perl 5.8.1) Test whether I<$string> is marked internally as encoded in
637ec54e 218UTF-8. Functionally the same as Encode::is_utf8().
8800c35a 219
308a4ae1 220=item * C<$flag = utf8::valid($string)>
70122e76 221
ac8b87d7 222[INTERNAL] Test whether I<$string> is in a consistent state regarding
9a54da5c 223UTF-8. Will return true if it is well-formed UTF-8 and has the UTF-8 flag
ac8b87d7 224on B<or> if I<$string> is held as bytes (both these states are 'consistent').
637ec54e 225Main reason for this routine is to allow Perl's test suite to check
8800c35a
JH
226that operations have left strings in a consistent state. You most
227probably want to use utf8::is_utf8() instead.
70122e76 228
1b026014
NIS
229=back
230
7d865a91
JH
231C<utf8::encode> is like C<utf8::upgrade>, but the UTF8 flag is
232cleared. See L<perlunicode> for more on the UTF8 flag and the C API
233functions C<sv_utf8_upgrade>, C<sv_utf8_downgrade>, C<sv_utf8_encode>,
094ce63c
AT
234and C<sv_utf8_decode>, which are wrapped by the Perl functions
235C<utf8::upgrade>, C<utf8::downgrade>, C<utf8::encode> and
7edb8f2b
RGS
236C<utf8::decode>. Also, the functions utf8::is_utf8, utf8::valid,
237utf8::encode, utf8::decode, utf8::upgrade, and utf8::downgrade are
238actually internal, and thus always available, without a C<require utf8>
239statement.
f1e62f77 240
8f8cf39c
JH
241=head1 BUGS
242
243One can have Unicode in identifier names, but not in package/class or
244subroutine names. While some limited functionality towards this does
245exist as of Perl 5.8.0, that is more accidental than designed; use of
246Unicode for the said purposes is unsupported.
247
248One reason of this unfinishedness is its (currently) inherent
249unportability: since both package names and subroutine names may need
250to be mapped to file and directory names, the Unicode capability of
251the filesystem becomes important-- and there unfortunately aren't
252portable answers.
253
393fec97 254=head1 SEE ALSO
a0ed51b3 255
2575c402 256L<perlunitut>, L<perluniintro>, L<perlrun>, L<bytes>, L<perlunicode>
a0ed51b3
LW
257
258=cut