This is a live mirror of the Perl 5 development currently hosted at https://github.com/perl/perl5
unfortunately sysread() tries to read characters
[perl5.git] / lib / utf8.pm
CommitLineData
a0ed51b3
LW
1package utf8;
2
d5448623
GS
3$utf8::hint_bits = 0x00800000;
4
a04477f8 5our $VERSION = '1.19';
b75c8c73 6
a0ed51b3 7sub import {
d5448623 8 $^H |= $utf8::hint_bits;
a0ed51b3
LW
9}
10
11sub unimport {
d5448623 12 $^H &= ~$utf8::hint_bits;
a0ed51b3
LW
13}
14
15sub AUTOLOAD {
16 require "utf8_heavy.pl";
daf4d4ea 17 goto &$AUTOLOAD if defined &$AUTOLOAD;
bd7017d3 18 require Carp;
daf4d4ea 19 Carp::croak("Undefined subroutine $AUTOLOAD called");
a0ed51b3
LW
20}
21
221;
23__END__
24
25=head1 NAME
26
b3419ed8 27utf8 - Perl pragma to enable/disable UTF-8 (or UTF-EBCDIC) in source code
a0ed51b3
LW
28
29=head1 SYNOPSIS
30
291cc134
KW
31 use utf8;
32 no utf8;
a0ed51b3 33
291cc134 34 # Convert the internal representation of a Perl scalar to/from UTF-8.
836ccc8e 35
291cc134 36 $num_octets = utf8::upgrade($string);
98695e13 37 $success = utf8::downgrade($string[, $fail_ok]);
973655a8 38
291cc134
KW
39 # Change each character of a Perl scalar to/from a series of
40 # characters that represent the UTF-8 bytes of each original character.
836ccc8e 41
291cc134
KW
42 utf8::encode($string); # "\x{100}" becomes "\xc4\x80"
43 utf8::decode($string); # "\xc4\x80" becomes "\x{100}"
973655a8 44
ca3d51ba
KW
45 # Convert a code point from the platform native character set to
46 # Unicode, and vice-versa.
47 $unicode = utf8::native_to_unicode(ord('A')); # returns 65 on both
48 # ASCII and EBCDIC
49 # platforms
a04477f8
KW
50 $native = utf8::unicode_to_native(65); # returns 65 on ASCII
51 # platforms; 193 on
52 # EBCDIC
ca3d51ba 53
ac8b87d7
EB
54 $flag = utf8::is_utf8($string); # since Perl 5.8.1
55 $flag = utf8::valid($string);
973655a8 56
a0ed51b3
LW
57=head1 DESCRIPTION
58
393fec97 59The C<use utf8> pragma tells the Perl parser to allow UTF-8 in the
a04477f8
KW
60program text in the current lexical scope. The C<no utf8> pragma tells Perl
61to switch back to treating the source text as literal bytes in the current
62lexical scope. (On EBCDIC platforms, technically it is allowing UTF-EBCDIC,
63and not UTF-8, but this distinction is academic, so in this document the term
64UTF-8 is used to mean both).
a0ed51b3 65
19b49582
JH
66B<Do not use this pragma for anything else than telling Perl that your
67script is written in UTF-8.> The utility functions described below are
2575c402
JW
68directly usable without C<use utf8;>.
69
70Because it is not possible to reliably tell UTF-8 from native 8 bit
71encodings, you need either a Byte Order Mark at the beginning of your
72source code, or C<use utf8;>, to instruct perl.
19b49582 73
2575c402 74When UTF-8 becomes the standard source format, this pragma will
a04477f8 75effectively become a no-op.
a0ed51b3 76
a74e8b45 77See also the effects of the C<-C> switch and its cousin, the
127161e0 78C<PERL_UNICODE> environment variable, in L<perlrun>.
a74e8b45 79
ad0029c4 80Enabling the C<utf8> pragma has the following effect:
a0ed51b3 81
4ac9195f 82=over 4
a0ed51b3
LW
83
84=item *
85
a04477f8
KW
86Bytes in the source text that are not in the ASCII character set will be
87treated as being part of a literal UTF-8 sequence. This includes most
c20e2abd 88literals such as identifier names, string constants, and constant
8f8cf39c
JH
89regular expression patterns.
90
4ac9195f
MS
91=back
92
a04477f8
KW
93Note that if you have non-ASCII, non-UTF-8 bytes in your script (for example
94embedded Latin-1 in your string literals), C<use utf8> will be unhappy. If
95you want to have such bytes under C<use utf8>, you can disable this pragma
96until the end the block (or file, if at top level) by C<no utf8;>.
ae90e350 97
1b026014
NIS
98=head2 Utility functions
99
8800c35a
JH
100The following functions are defined in the C<utf8::> package by the
101Perl core. You do not need to say C<use utf8> to use these and in fact
2f7e5073 102you should not say that unless you really want to have UTF-8 source code.
1b026014
NIS
103
104=over 4
105
308a4ae1 106=item * C<$num_octets = utf8::upgrade($string)>
1b026014 107
a04477f8 108(Since Perl v5.8.0)
836ccc8e 109Converts in-place the internal representation of the string from an octet
a04477f8 110sequence in the native encoding (Latin-1 or EBCDIC) to UTF-8. The
836ccc8e 111logical character sequence itself is unchanged. If I<$string> is already
a04477f8
KW
112stored as UTF-8, then this is a no-op. Returns the
113number of octets necessary to represent the string as UTF-8. Can be
2575c402 114used to make sure that the UTF-8 flag is on, so that C<\w> or C<lc()>
a04477f8
KW
115work as Unicode on strings containing non-ASCII characters whose code points
116are below 256.
78ea37eb 117
a04477f8
KW
118B<Note that this function does not handle arbitrary encodings>;
119use L<Encode> instead.
1b026014 120
308a4ae1 121=item * C<$success = utf8::downgrade($string[, $fail_ok])>
1b026014 122
a04477f8 123(Since Perl v5.8.0)
730d7228 124Converts in-place the internal representation of the string from
a04477f8 125UTF-8 to the equivalent octet sequence in the native encoding (Latin-1
836ccc8e
DM
126or EBCDIC). The logical character sequence itself is unchanged. If
127I<$string> is already stored as native 8 bit, then this is a no-op. Can
128be used to
2575c402
JW
129make sure that the UTF-8 flag is off, e.g. when you want to make sure
130that the substr() or length() function works with the usually faster
131byte algorithm.
78ea37eb 132
a04477f8 133Fails if the original UTF-8 sequence cannot be represented in the
ac8b87d7 134native 8 bit encoding. On failure dies or, if the value of I<$fail_ok> is
2575c402 135true, returns false.
78ea37eb 136
2575c402
JW
137Returns true on success.
138
a04477f8
KW
139B<Note that this function does not handle arbitrary encodings>;
140use L<Encode> instead.
78ea37eb 141
308a4ae1 142=item * C<utf8::encode($string)>
1b026014 143
a04477f8 144(Since Perl v5.8.0)
2575c402 145Converts in-place the character sequence to the corresponding octet
a04477f8 146sequence in UTF-8. That is, every (possibly wide) character gets
836ccc8e 147replaced with a sequence of one or more characters that represent the
a04477f8 148individual UTF-8 bytes of the character. The UTF8 flag is turned off.
836ccc8e
DM
149Returns nothing.
150
291cc134 151 my $a = "\x{100}"; # $a contains one character, with ord 0x100
ca3d51ba 152 utf8::encode($a); # $a contains two characters, with ords (on
a04477f8
KW
153 # ASCII platforms) 0xc4 and 0x80. On EBCDIC
154 # 1047, this would instead be 0x8C and 0x41.
78ea37eb 155
a04477f8
KW
156B<Note that this function does not handle arbitrary encodings>;
157use L<Encode> instead.
094ce63c 158
308a4ae1 159=item * C<$success = utf8::decode($string)>
1b026014 160
a04477f8
KW
161(Since Perl v5.8.0)
162Attempts to convert in-place the octet sequence encoded as UTF-8 to the
836ccc8e 163corresponding character sequence. That is, it replaces each sequence of
a04477f8 164characters in the string whose ords represent a valid UTF-8 byte
836ccc8e 165sequence, with the corresponding single character. The UTF-8 flag is
a04477f8
KW
166turned on only if the source string contains multiple-byte UTF-8
167characters. If I<$string> is invalid as UTF-8, returns false;
836ccc8e
DM
168otherwise returns true.
169
ca3d51ba
KW
170 my $a = "\xc4\x80"; # $a contains two characters, with ords
171 # 0xc4 and 0x80
172 utf8::decode($a); # On ASCII platforms, $a contains one char,
a04477f8
KW
173 # with ord 0x100. Since these bytes aren't
174 # legal UTF-EBCDIC, on EBCDIC platforms, $a is
175 # unchanged and the function returns FALSE.
78ea37eb 176
a04477f8
KW
177B<Note that this function does not handle arbitrary encodings>;
178use L<Encode> instead.
78ea37eb 179
ca3d51ba
KW
180=item * C<$unicode = utf8::native_to_unicode($code_point)>
181
273e254d 182(Since Perl v5.8.0)
ca3d51ba
KW
183This takes an unsigned integer (which represents the ordinal number of a
184character (or a code point) on the platform the program is being run on) and
185returns its Unicode equivalent value. Since ASCII platforms natively use the
186Unicode code points, this function returns its input on them. On EBCDIC
bc1767aa 187platforms it converts from EBCDIC to Unicode.
ca3d51ba
KW
188
189A meaningless value will currently be returned if the input is not an unsigned
190integer.
191
273e254d
KW
192Since Perl v5.22.0, calls to this function are optimized out on ASCII
193platforms, so there is no performance hit in using it there.
194
ca3d51ba
KW
195=item * C<$native = utf8::unicode_to_native($code_point)>
196
273e254d 197(Since Perl v5.8.0)
ca3d51ba
KW
198This is the inverse of C<utf8::native_to_unicode()>, converting the other
199direction. Again, on ASCII platforms, this returns its input, but on EBCDIC
200platforms it will find the native platform code point, given any Unicode one.
201
202A meaningless value will currently be returned if the input is not an unsigned
203integer.
204
273e254d
KW
205Since Perl v5.22.0, calls to this function are optimized out on ASCII
206platforms, so there is no performance hit in using it there.
207
308a4ae1 208=item * C<$flag = utf8::is_utf8($string)>
8800c35a 209
ac8b87d7 210(Since Perl 5.8.1) Test whether I<$string> is marked internally as encoded in
a04477f8 211UTF-8. Functionally the same as C<Encode::is_utf8()>.
8800c35a 212
308a4ae1 213=item * C<$flag = utf8::valid($string)>
70122e76 214
ac8b87d7 215[INTERNAL] Test whether I<$string> is in a consistent state regarding
9a54da5c 216UTF-8. Will return true if it is well-formed UTF-8 and has the UTF-8 flag
ac8b87d7 217on B<or> if I<$string> is held as bytes (both these states are 'consistent').
637ec54e 218Main reason for this routine is to allow Perl's test suite to check
8800c35a 219that operations have left strings in a consistent state. You most
a04477f8 220probably want to use C<utf8::is_utf8()> instead.
70122e76 221
1b026014
NIS
222=back
223
7d865a91 224C<utf8::encode> is like C<utf8::upgrade>, but the UTF8 flag is
a04477f8
KW
225cleared. See L<perlunicode>, and the C API
226functions C<L<sv_utf8_upgrade|perlapi/sv_utf8_upgrade>>,
227C<L<perlapi/sv_utf8_downgrade>>, C<L<perlapi/sv_utf8_encode>>,
228and C<L<perlapi/sv_utf8_decode>>, which are wrapped by the Perl functions
094ce63c 229C<utf8::upgrade>, C<utf8::downgrade>, C<utf8::encode> and
a04477f8
KW
230C<utf8::decode>. Also, the functions C<utf8::is_utf8>, C<utf8::valid>,
231C<utf8::encode>, C<utf8::decode>, C<utf8::upgrade>, and C<utf8::downgrade> are
7edb8f2b
RGS
232actually internal, and thus always available, without a C<require utf8>
233statement.
f1e62f77 234
8f8cf39c
JH
235=head1 BUGS
236
a04477f8
KW
237Some filesystems may not support UTF-8 file names, or they may be supported
238incompatibly with Perl. Therefore UTF-8 names that are visible to the
239filesystem, such as module names may not work.
8f8cf39c 240
393fec97 241=head1 SEE ALSO
a0ed51b3 242
2575c402 243L<perlunitut>, L<perluniintro>, L<perlrun>, L<bytes>, L<perlunicode>
a0ed51b3
LW
244
245=cut