This is a live mirror of the Perl 5 development currently hosted at https://github.com/perl/perl5
new perldelta
[perl5.git] / lib / utf8.pm
CommitLineData
a0ed51b3
LW
1package utf8;
2
ba6f05db
TR
3use strict;
4use warnings;
d5448623 5
ba6f05db
TR
6our $hint_bits = 0x00800000;
7
347a477c 8our $VERSION = '1.25';
ba6f05db 9our $AUTOLOAD;
b75c8c73 10
a0ed51b3 11sub import {
ba6f05db 12 $^H |= $hint_bits;
a0ed51b3
LW
13}
14
15sub unimport {
ba6f05db 16 $^H &= ~$hint_bits;
a0ed51b3
LW
17}
18
19sub AUTOLOAD {
daf4d4ea 20 goto &$AUTOLOAD if defined &$AUTOLOAD;
bd7017d3 21 require Carp;
daf4d4ea 22 Carp::croak("Undefined subroutine $AUTOLOAD called");
a0ed51b3
LW
23}
24
251;
26__END__
27
28=head1 NAME
29
b3419ed8 30utf8 - Perl pragma to enable/disable UTF-8 (or UTF-EBCDIC) in source code
a0ed51b3
LW
31
32=head1 SYNOPSIS
33
291cc134
KW
34 use utf8;
35 no utf8;
a0ed51b3 36
291cc134 37 # Convert the internal representation of a Perl scalar to/from UTF-8.
836ccc8e 38
291cc134 39 $num_octets = utf8::upgrade($string);
98695e13 40 $success = utf8::downgrade($string[, $fail_ok]);
973655a8 41
291cc134
KW
42 # Change each character of a Perl scalar to/from a series of
43 # characters that represent the UTF-8 bytes of each original character.
836ccc8e 44
291cc134
KW
45 utf8::encode($string); # "\x{100}" becomes "\xc4\x80"
46 utf8::decode($string); # "\xc4\x80" becomes "\x{100}"
973655a8 47
ca3d51ba
KW
48 # Convert a code point from the platform native character set to
49 # Unicode, and vice-versa.
50 $unicode = utf8::native_to_unicode(ord('A')); # returns 65 on both
51 # ASCII and EBCDIC
52 # platforms
a04477f8
KW
53 $native = utf8::unicode_to_native(65); # returns 65 on ASCII
54 # platforms; 193 on
55 # EBCDIC
ca3d51ba 56
ac8b87d7
EB
57 $flag = utf8::is_utf8($string); # since Perl 5.8.1
58 $flag = utf8::valid($string);
973655a8 59
a0ed51b3
LW
60=head1 DESCRIPTION
61
393fec97 62The C<use utf8> pragma tells the Perl parser to allow UTF-8 in the
a04477f8
KW
63program text in the current lexical scope. The C<no utf8> pragma tells Perl
64to switch back to treating the source text as literal bytes in the current
65lexical scope. (On EBCDIC platforms, technically it is allowing UTF-EBCDIC,
66and not UTF-8, but this distinction is academic, so in this document the term
67UTF-8 is used to mean both).
a0ed51b3 68
19b49582
JH
69B<Do not use this pragma for anything else than telling Perl that your
70script is written in UTF-8.> The utility functions described below are
2575c402
JW
71directly usable without C<use utf8;>.
72
73Because it is not possible to reliably tell UTF-8 from native 8 bit
74encodings, you need either a Byte Order Mark at the beginning of your
75source code, or C<use utf8;>, to instruct perl.
19b49582 76
2575c402 77When UTF-8 becomes the standard source format, this pragma will
a04477f8 78effectively become a no-op.
a0ed51b3 79
a74e8b45 80See also the effects of the C<-C> switch and its cousin, the
127161e0 81C<PERL_UNICODE> environment variable, in L<perlrun>.
a74e8b45 82
ad0029c4 83Enabling the C<utf8> pragma has the following effect:
a0ed51b3 84
4ac9195f 85=over 4
a0ed51b3
LW
86
87=item *
88
a04477f8
KW
89Bytes in the source text that are not in the ASCII character set will be
90treated as being part of a literal UTF-8 sequence. This includes most
c20e2abd 91literals such as identifier names, string constants, and constant
8f8cf39c
JH
92regular expression patterns.
93
4ac9195f
MS
94=back
95
a04477f8
KW
96Note that if you have non-ASCII, non-UTF-8 bytes in your script (for example
97embedded Latin-1 in your string literals), C<use utf8> will be unhappy. If
98you want to have such bytes under C<use utf8>, you can disable this pragma
99until the end the block (or file, if at top level) by C<no utf8;>.
ae90e350 100
1b026014
NIS
101=head2 Utility functions
102
8800c35a
JH
103The following functions are defined in the C<utf8::> package by the
104Perl core. You do not need to say C<use utf8> to use these and in fact
2f7e5073 105you should not say that unless you really want to have UTF-8 source code.
1b026014
NIS
106
107=over 4
108
308a4ae1 109=item * C<$num_octets = utf8::upgrade($string)>
1b026014 110
a04477f8 111(Since Perl v5.8.0)
836ccc8e 112Converts in-place the internal representation of the string from an octet
a04477f8 113sequence in the native encoding (Latin-1 or EBCDIC) to UTF-8. The
836ccc8e 114logical character sequence itself is unchanged. If I<$string> is already
0397beb0
TC
115upgraded, then this is a no-op. Returns the
116number of octets necessary to represent the string as UTF-8.
347a477c
KW
117Since Perl v5.38, if C<$string> is C<undef> no action is taken; prior to that,
118it would be converted to be defined and zero-length.
0397beb0
TC
119
120If your code needs to be compatible with versions of perl without
121C<use feature 'unicode_strings';>, you can force Unicode semantics on
122a given string:
123
124 # force unicode semantics for $string without the
125 # "unicode_strings" feature
126 utf8::upgrade($string);
127
128For example:
129
130 # without explicit or implicit use feature 'unicode_strings'
131 my $x = "\xDF"; # LATIN SMALL LETTER SHARP S
132 $x =~ /ss/i; # won't match
133 my $y = uc($x); # won't convert
134 utf8::upgrade($x);
135 $x =~ /ss/i; # matches
136 my $z = uc($x); # converts to "SS"
78ea37eb 137
a04477f8
KW
138B<Note that this function does not handle arbitrary encodings>;
139use L<Encode> instead.
1b026014 140
308a4ae1 141=item * C<$success = utf8::downgrade($string[, $fail_ok])>
1b026014 142
a04477f8 143(Since Perl v5.8.0)
50a85cfe
KW
144Converts in-place the internal representation of the string from UTF-8 to the
145equivalent octet sequence in the native encoding (Latin-1 or EBCDIC). The
146logical character sequence itself is unchanged. If I<$string> is already
147stored as native 8 bit, then this is a no-op. Can be used to make sure that
148the UTF-8 flag is off, e.g. when you want to make sure that the substr() or
149length() function works with the usually faster byte algorithm.
78ea37eb 150
a04477f8 151Fails if the original UTF-8 sequence cannot be represented in the
ac8b87d7 152native 8 bit encoding. On failure dies or, if the value of I<$fail_ok> is
2575c402 153true, returns false.
78ea37eb 154
2575c402
JW
155Returns true on success.
156
0397beb0
TC
157If your code expects an octet sequence this can be used to validate
158that you've received one:
159
160 # throw an exception if not representable as octets
161 utf8::downgrade($string)
162
163 # or do your own error handling
164 utf8::downgrade($string, 1) or die "string must be octets";
165
a04477f8
KW
166B<Note that this function does not handle arbitrary encodings>;
167use L<Encode> instead.
78ea37eb 168
308a4ae1 169=item * C<utf8::encode($string)>
1b026014 170
a04477f8 171(Since Perl v5.8.0)
2575c402 172Converts in-place the character sequence to the corresponding octet
50a85cfe
KW
173sequence in Perl's extended UTF-8. That is, every (possibly wide) character
174gets replaced with a sequence of one or more characters that represent the
a04477f8 175individual UTF-8 bytes of the character. The UTF8 flag is turned off.
836ccc8e
DM
176Returns nothing.
177
0397beb0
TC
178 my $x = "\x{100}"; # $x contains one character, with ord 0x100
179 utf8::encode($x); # $x contains two characters, with ords (on
a04477f8
KW
180 # ASCII platforms) 0xc4 and 0x80. On EBCDIC
181 # 1047, this would instead be 0x8C and 0x41.
78ea37eb 182
0397beb0
TC
183Similar to:
184
185 use Encode;
186 $x = Encode::encode("utf8", $x);
187
a04477f8
KW
188B<Note that this function does not handle arbitrary encodings>;
189use L<Encode> instead.
094ce63c 190
308a4ae1 191=item * C<$success = utf8::decode($string)>
1b026014 192
a04477f8 193(Since Perl v5.8.0)
50a85cfe
KW
194Attempts to convert in-place the octet sequence encoded in Perl's extended
195UTF-8 to the corresponding character sequence. That is, it replaces each
196sequence of characters in the string whose ords represent a valid (extended)
197UTF-8 byte sequence, with the corresponding single character. The UTF-8 flag
198is turned on only if the source string contains multiple-byte UTF-8
199characters. If I<$string> is invalid as extended UTF-8, returns false;
836ccc8e
DM
200otherwise returns true.
201
0397beb0 202 my $x = "\xc4\x80"; # $x contains two characters, with ords
ca3d51ba 203 # 0xc4 and 0x80
0397beb0 204 utf8::decode($x); # On ASCII platforms, $x contains one char,
a04477f8 205 # with ord 0x100. Since these bytes aren't
0397beb0 206 # legal UTF-EBCDIC, on EBCDIC platforms, $x is
a04477f8 207 # unchanged and the function returns FALSE.
089cd0e7
KW
208 my $y = "\xc3\x83\xc2\xab"; This has been encoded twice; this
209 # example is only for ASCII platforms
210 utf8::decode($y); # Converts $y to \xc3\xab, returns TRUE;
211 utf8::decode($y); # Further converts to \xeb, returns TRUE;
212 utf8::decode($y); # Returns FALSE, leaves $y unchanged
78ea37eb 213
a04477f8
KW
214B<Note that this function does not handle arbitrary encodings>;
215use L<Encode> instead.
78ea37eb 216
ca3d51ba
KW
217=item * C<$unicode = utf8::native_to_unicode($code_point)>
218
273e254d 219(Since Perl v5.8.0)
ca3d51ba
KW
220This takes an unsigned integer (which represents the ordinal number of a
221character (or a code point) on the platform the program is being run on) and
222returns its Unicode equivalent value. Since ASCII platforms natively use the
223Unicode code points, this function returns its input on them. On EBCDIC
bc1767aa 224platforms it converts from EBCDIC to Unicode.
ca3d51ba
KW
225
226A meaningless value will currently be returned if the input is not an unsigned
227integer.
228
273e254d
KW
229Since Perl v5.22.0, calls to this function are optimized out on ASCII
230platforms, so there is no performance hit in using it there.
231
ca3d51ba
KW
232=item * C<$native = utf8::unicode_to_native($code_point)>
233
273e254d 234(Since Perl v5.8.0)
ca3d51ba
KW
235This is the inverse of C<utf8::native_to_unicode()>, converting the other
236direction. Again, on ASCII platforms, this returns its input, but on EBCDIC
237platforms it will find the native platform code point, given any Unicode one.
238
239A meaningless value will currently be returned if the input is not an unsigned
240integer.
241
273e254d
KW
242Since Perl v5.22.0, calls to this function are optimized out on ASCII
243platforms, so there is no performance hit in using it there.
244
308a4ae1 245=item * C<$flag = utf8::is_utf8($string)>
8800c35a 246
ac8b87d7 247(Since Perl 5.8.1) Test whether I<$string> is marked internally as encoded in
0397beb0
TC
248UTF-8. Functionally the same as C<Encode::is_utf8($string)>.
249
250Typically only necessary for debugging and testing, if you need to
251dump the internals of an SV, L<Devel::Peek's|Devel::Peek> Dump()
252provides more detail in a compact form.
253
254If you still think you need this outside of debugging, testing or
255dealing with filenames, you should probably read L<perlunitut> and
256L<perlunifaq/What is "the UTF8 flag"?>.
257
258Don't use this flag as a marker to distinguish character and binary
0c50e915 259data: that should be decided for each variable when you write your
0397beb0
TC
260code.
261
262To force unicode semantics in code portable to perl 5.8 and 5.10, call
263C<utf8::upgrade($string)> unconditionally.
8800c35a 264
308a4ae1 265=item * C<$flag = utf8::valid($string)>
70122e76 266
ac8b87d7 267[INTERNAL] Test whether I<$string> is in a consistent state regarding
8b4b6c86
KW
268UTF-8. Will return true if it is well-formed Perl extended UTF-8 and has the
269UTF-8 flag
ac8b87d7 270on B<or> if I<$string> is held as bytes (both these states are 'consistent').
0c50e915 271The main reason for this routine is to allow Perl's test suite to check
0397beb0 272that operations have left strings in a consistent state.
70122e76 273
1b026014
NIS
274=back
275
7d865a91 276C<utf8::encode> is like C<utf8::upgrade>, but the UTF8 flag is
a04477f8
KW
277cleared. See L<perlunicode>, and the C API
278functions C<L<sv_utf8_upgrade|perlapi/sv_utf8_upgrade>>,
279C<L<perlapi/sv_utf8_downgrade>>, C<L<perlapi/sv_utf8_encode>>,
280and C<L<perlapi/sv_utf8_decode>>, which are wrapped by the Perl functions
094ce63c 281C<utf8::upgrade>, C<utf8::downgrade>, C<utf8::encode> and
a04477f8
KW
282C<utf8::decode>. Also, the functions C<utf8::is_utf8>, C<utf8::valid>,
283C<utf8::encode>, C<utf8::decode>, C<utf8::upgrade>, and C<utf8::downgrade> are
7edb8f2b
RGS
284actually internal, and thus always available, without a C<require utf8>
285statement.
f1e62f77 286
8f8cf39c
JH
287=head1 BUGS
288
a04477f8
KW
289Some filesystems may not support UTF-8 file names, or they may be supported
290incompatibly with Perl. Therefore UTF-8 names that are visible to the
291filesystem, such as module names may not work.
8f8cf39c 292
393fec97 293=head1 SEE ALSO
a0ed51b3 294
2575c402 295L<perlunitut>, L<perluniintro>, L<perlrun>, L<bytes>, L<perlunicode>
a0ed51b3
LW
296
297=cut