This is a live mirror of the Perl 5 development currently hosted at https://github.com/perl/perl5
lib/utf8.pm: Pod clarification and nit
[perl5.git] / lib / utf8.pm
CommitLineData
a0ed51b3
LW
1package utf8;
2
d5448623
GS
3$utf8::hint_bits = 0x00800000;
4
637ec54e 5our $VERSION = '1.12';
b75c8c73 6
a0ed51b3 7sub import {
d5448623 8 $^H |= $utf8::hint_bits;
a0ed51b3
LW
9}
10
11sub unimport {
d5448623 12 $^H &= ~$utf8::hint_bits;
a0ed51b3
LW
13}
14
15sub AUTOLOAD {
16 require "utf8_heavy.pl";
daf4d4ea 17 goto &$AUTOLOAD if defined &$AUTOLOAD;
bd7017d3 18 require Carp;
daf4d4ea 19 Carp::croak("Undefined subroutine $AUTOLOAD called");
a0ed51b3
LW
20}
21
221;
23__END__
24
25=head1 NAME
26
b3419ed8 27utf8 - Perl pragma to enable/disable UTF-8 (or UTF-EBCDIC) in source code
a0ed51b3
LW
28
29=head1 SYNOPSIS
30
31 use utf8;
32 no utf8;
33
836ccc8e
DM
34 # Convert the internal representation of a Perl scalar to/from UTF-8.
35
973655a8
SB
36 $num_octets = utf8::upgrade($string);
37 $success = utf8::downgrade($string[, FAIL_OK]);
38
836ccc8e
DM
39 # Change each character of a Perl scalar to/from a series of
40 # characters that represent the UTF-8 bytes of each original character.
41
42 utf8::encode($string); # "\x{100}" becomes "\xc4\x80"
43 utf8::decode($string); # "\xc4\x80" becomes "\x{100}"
973655a8 44
786c9463 45 $flag = utf8::is_utf8(STRING); # since Perl 5.8.1
973655a8
SB
46 $flag = utf8::valid(STRING);
47
a0ed51b3
LW
48=head1 DESCRIPTION
49
393fec97 50The C<use utf8> pragma tells the Perl parser to allow UTF-8 in the
b3419ed8 51program text in the current lexical scope (allow UTF-EBCDIC on EBCDIC based
70122e76 52platforms). The C<no utf8> pragma tells Perl to switch back to treating
b3419ed8 53the source text as literal bytes in the current lexical scope.
a0ed51b3 54
19b49582
JH
55B<Do not use this pragma for anything else than telling Perl that your
56script is written in UTF-8.> The utility functions described below are
2575c402
JW
57directly usable without C<use utf8;>.
58
59Because it is not possible to reliably tell UTF-8 from native 8 bit
60encodings, you need either a Byte Order Mark at the beginning of your
61source code, or C<use utf8;>, to instruct perl.
19b49582 62
2575c402
JW
63When UTF-8 becomes the standard source format, this pragma will
64effectively become a no-op. For convenience in what follows the term
65I<UTF-X> is used to refer to UTF-8 on ASCII and ISO Latin based
66platforms and UTF-EBCDIC on EBCDIC based platforms.
a0ed51b3 67
a74e8b45
JH
68See also the effects of the C<-C> switch and its cousin, the
69C<$ENV{PERL_UNICODE}>, in L<perlrun>.
70
ad0029c4 71Enabling the C<utf8> pragma has the following effect:
a0ed51b3 72
4ac9195f 73=over 4
a0ed51b3
LW
74
75=item *
76
393fec97 77Bytes in the source text that have their high-bit set will be treated
2fa62f66 78as being part of a literal UTF-X sequence. This includes most
c20e2abd 79literals such as identifier names, string constants, and constant
8f8cf39c
JH
80regular expression patterns.
81
82On EBCDIC platforms characters in the Latin 1 character set are
83treated as being part of a literal UTF-EBCDIC character.
a0ed51b3 84
4ac9195f
MS
85=back
86
ae90e350
JH
87Note that if you have bytes with the eighth bit on in your script
88(for example embedded Latin-1 in your string literals), C<use utf8>
89will be unhappy since the bytes are most probably not well-formed
2fa62f66
AT
90UTF-X. If you want to have such bytes under C<use utf8>, you can disable
91this pragma until the end the block (or file, if at top level) by
92C<no utf8;>.
ae90e350 93
1b026014
NIS
94=head2 Utility functions
95
8800c35a
JH
96The following functions are defined in the C<utf8::> package by the
97Perl core. You do not need to say C<use utf8> to use these and in fact
19b49582 98you should not say that unless you really want to have UTF-8 source code.
1b026014
NIS
99
100=over 4
101
973655a8 102=item * $num_octets = utf8::upgrade($string)
1b026014 103
836ccc8e
DM
104Converts in-place the internal representation of the string from an octet
105sequence in the native encoding (Latin-1 or EBCDIC) to I<UTF-X>. The
106logical character sequence itself is unchanged. If I<$string> is already
107stored as I<UTF-X>, then this is a no-op. Returns the
2575c402
JW
108number of octets necessary to represent the string as I<UTF-X>. Can be
109used to make sure that the UTF-8 flag is on, so that C<\w> or C<lc()>
110work as Unicode on strings containing characters in the range 0x80-0xFF
111(on ASCII and derivatives).
78ea37eb
ST
112
113B<Note that this function does not handle arbitrary encodings.>
2575c402
JW
114Therefore Encode is recommended for the general purposes; see also
115L<Encode>.
1b026014 116
973655a8 117=item * $success = utf8::downgrade($string[, FAIL_OK])
1b026014 118
730d7228 119Converts in-place the internal representation of the string from
836ccc8e
DM
120I<UTF-X> to the equivalent octet sequence in the native encoding (Latin-1
121or EBCDIC). The logical character sequence itself is unchanged. If
122I<$string> is already stored as native 8 bit, then this is a no-op. Can
123be used to
2575c402
JW
124make sure that the UTF-8 flag is off, e.g. when you want to make sure
125that the substr() or length() function works with the usually faster
126byte algorithm.
78ea37eb 127
2575c402
JW
128Fails if the original I<UTF-X> sequence cannot be represented in the
129native 8 bit encoding. On failure dies or, if the value of C<FAIL_OK> is
130true, returns false.
78ea37eb 131
2575c402
JW
132Returns true on success.
133
134B<Note that this function does not handle arbitrary encodings.>
135Therefore Encode is recommended for the general purposes; see also
136L<Encode>.
78ea37eb 137
1b026014
NIS
138=item * utf8::encode($string)
139
2575c402 140Converts in-place the character sequence to the corresponding octet
836ccc8e
DM
141sequence in I<UTF-X>. That is, every (possibly wide) character gets
142replaced with a sequence of one or more characters that represent the
143individual I<UTF-X> bytes of the character. The UTF8 flag is turned off.
144Returns nothing.
145
146 my $a = "\x{100}"; # $a contains one character, with ord 0x100
147 utf8::encode($a); # $a contains two characters, with ords 0xc4 and 0x80
78ea37eb
ST
148
149B<Note that this function does not handle arbitrary encodings.>
2575c402
JW
150Therefore Encode is recommended for the general purposes; see also
151L<Encode>.
094ce63c 152
2575c402 153=item * $success = utf8::decode($string)
1b026014 154
2575c402 155Attempts to convert in-place the octet sequence in I<UTF-X> to the
836ccc8e
DM
156corresponding character sequence. That is, it replaces each sequence of
157characters in the string whose ords represent a valid UTF-X byte
158sequence, with the corresponding single character. The UTF-8 flag is
159turned on only if the source string contains multiple-byte I<UTF-X>
160characters. If I<$string> is invalid as I<UTF-X>, returns false;
161otherwise returns true.
162
163 my $a = "\xc4\x80"; # $a contains two characters, with ords 0xc4 and 0x80
164 utf8::decode($a); # $a contains one character, with ord 0x100
78ea37eb
ST
165
166B<Note that this function does not handle arbitrary encodings.>
2575c402
JW
167Therefore Encode is recommended for the general purposes; see also
168L<Encode>.
78ea37eb 169
8800c35a
JH
170=item * $flag = utf8::is_utf8(STRING)
171
637ec54e
KW
172(Since Perl 5.8.1) Test whether STRING is marked internally as encoded in
173UTF-8. Functionally the same as Encode::is_utf8().
8800c35a 174
70122e76
JH
175=item * $flag = utf8::valid(STRING)
176
8800c35a 177[INTERNAL] Test whether STRING is in a consistent state regarding
9a54da5c
KW
178UTF-8. Will return true if it is well-formed UTF-8 and has the UTF-8 flag
179on B<or> if STRING is held as bytes (both these states are 'consistent').
637ec54e 180Main reason for this routine is to allow Perl's test suite to check
8800c35a
JH
181that operations have left strings in a consistent state. You most
182probably want to use utf8::is_utf8() instead.
70122e76 183
1b026014
NIS
184=back
185
7d865a91
JH
186C<utf8::encode> is like C<utf8::upgrade>, but the UTF8 flag is
187cleared. See L<perlunicode> for more on the UTF8 flag and the C API
188functions C<sv_utf8_upgrade>, C<sv_utf8_downgrade>, C<sv_utf8_encode>,
094ce63c
AT
189and C<sv_utf8_decode>, which are wrapped by the Perl functions
190C<utf8::upgrade>, C<utf8::downgrade>, C<utf8::encode> and
7edb8f2b
RGS
191C<utf8::decode>. Also, the functions utf8::is_utf8, utf8::valid,
192utf8::encode, utf8::decode, utf8::upgrade, and utf8::downgrade are
193actually internal, and thus always available, without a C<require utf8>
194statement.
f1e62f77 195
8f8cf39c
JH
196=head1 BUGS
197
198One can have Unicode in identifier names, but not in package/class or
199subroutine names. While some limited functionality towards this does
200exist as of Perl 5.8.0, that is more accidental than designed; use of
201Unicode for the said purposes is unsupported.
202
203One reason of this unfinishedness is its (currently) inherent
204unportability: since both package names and subroutine names may need
205to be mapped to file and directory names, the Unicode capability of
206the filesystem becomes important-- and there unfortunately aren't
207portable answers.
208
393fec97 209=head1 SEE ALSO
a0ed51b3 210
2575c402 211L<perlunitut>, L<perluniintro>, L<perlrun>, L<bytes>, L<perlunicode>
a0ed51b3
LW
212
213=cut