This is a live mirror of the Perl 5 development currently hosted at https://github.com/perl/perl5
Undocument the use of .*utf8.*{upgrade,downgrade,encode,decode}
[perl5.git] / lib / utf8.pm
CommitLineData
a0ed51b3
LW
1package utf8;
2
d5448623
GS
3$utf8::hint_bits = 0x00800000;
4
b75c8c73
MS
5our $VERSION = '1.00';
6
a0ed51b3 7sub import {
d5448623 8 $^H |= $utf8::hint_bits;
a0ed51b3
LW
9 $enc{caller()} = $_[1] if $_[1];
10}
11
12sub unimport {
d5448623 13 $^H &= ~$utf8::hint_bits;
a0ed51b3
LW
14}
15
16sub AUTOLOAD {
17 require "utf8_heavy.pl";
daf4d4ea
SC
18 goto &$AUTOLOAD if defined &$AUTOLOAD;
19 Carp::croak("Undefined subroutine $AUTOLOAD called");
a0ed51b3
LW
20}
21
221;
23__END__
24
25=head1 NAME
26
b3419ed8 27utf8 - Perl pragma to enable/disable UTF-8 (or UTF-EBCDIC) in source code
a0ed51b3
LW
28
29=head1 SYNOPSIS
30
31 use utf8;
32 no utf8;
33
34=head1 DESCRIPTION
35
393fec97 36The C<use utf8> pragma tells the Perl parser to allow UTF-8 in the
b3419ed8
PK
37program text in the current lexical scope (allow UTF-EBCDIC on EBCDIC based
38platforms). The C<no utf8> pragma tells Perl to switch back to treating
39the source text as literal bytes in the current lexical scope.
a0ed51b3 40
393fec97
GS
41This pragma is primarily a compatibility device. Perl versions
42earlier than 5.6 allowed arbitrary bytes in source code, whereas
43in future we would like to standardize on the UTF-8 encoding for
44source text. Until UTF-8 becomes the default format for source
45text, this pragma should be used to recognize UTF-8 in the source.
46When UTF-8 becomes the standard source format, this pragma will
b3419ed8 47effectively become a no-op. For convenience in what follows the
ad0029c4 48term I<UTF-X> is used to refer to UTF-8 on ASCII and ISO Latin based
b3419ed8 49platforms and UTF-EBCDIC on EBCDIC based platforms.
a0ed51b3 50
ad0029c4 51Enabling the C<utf8> pragma has the following effect:
a0ed51b3 52
4ac9195f 53=over 4
a0ed51b3
LW
54
55=item *
56
393fec97 57Bytes in the source text that have their high-bit set will be treated
ad0029c4
JH
58as being part of a literal UTF-8 character. This includes most
59literals such as identifiers, string constants, constant regular
60expression patterns and package names. On EBCDIC platforms characters
61in the Latin 1 character set are treated as being part of a literal
62UTF-EBCDIC character.
a0ed51b3 63
4ac9195f
MS
64=back
65
ae90e350
JH
66Note that if you have bytes with the eighth bit on in your script
67(for example embedded Latin-1 in your string literals), C<use utf8>
68will be unhappy since the bytes are most probably not well-formed
69UTF-8. If you want to have such bytes and use utf8, you can disable
70utf8 until the end the block (or file, if at top level) by C<no utf8;>.
71
1b026014
NIS
72=head2 Utility functions
73
74The following functions are defined in the C<utf8::> package by the perl core.
75
76=over 4
77
78=item * $num_octets = utf8::upgrade($string);
79
ad0029c4
JH
80Converts internal representation of string to the Perl's internal
81I<UTF-X> form. Returns the number of octets necessary to represent
13a6c0e0
JH
82the string as I<UTF-X>. Note that this should not be used to convert
83a legacy byte encoding to Unicode: use Encode for that. Affected
84by the encoding pragma.
1b026014
NIS
85
86=item * utf8::downgrade($string[, CHECK])
87
88Converts internal representation of string to be un-encoded bytes.
13a6c0e0
JH
89Note that this should not be used to convert Unicode back to a legacy
90byte encoding: use Encode for that. B<Not> affected by the encoding
91pragma.
1b026014
NIS
92
93=item * utf8::encode($string)
94
13a6c0e0
JH
95Converts (in-place) I<$string> from logical characters to octet
96sequence representing it in Perl's I<UTF-X> encoding. Note that this
97should not be used to convert a legacy byte encoding to Unicode: use
98Encode for that. =item * $flag = utf8::decode($string)
1b026014 99
ad0029c4 100Attempts to convert I<$string> in-place from Perl's I<UTF-X> encoding
13a6c0e0
JH
101into logical characters. Note that this should not be used to convert
102Unicode back to a legacy byte encoding: use Encode for that.
1b026014
NIS
103
104=back
105
f1e62f77
AT
106C<utf8::encode> is like C<utf8::upgrade> but the UTF8 flag does not
107get turned on. See L<perlunicode> for more on the UTF8 flag and the C
108API functions C<sv_utf8_upgrade>, C<sv_utf8_downgrade>,
109C<sv_utf8_encode>, C<sv_utf8_decode> that are wrapped by the Perl
110functions C<utf8::upgrade>, C<utf8::downgrade>, C<utf8::encode> and
111C<utf8::decode>.
112
393fec97 113=head1 SEE ALSO
a0ed51b3 114
8058d7ab 115L<perlunicode>, L<bytes>
a0ed51b3
LW
116
117=cut