This is a live mirror of the Perl 5 development currently hosted at https://github.com/perl/perl5
Integrate mainline
[perl5.git] / lib / utf8.pm
CommitLineData
a0ed51b3
LW
1package utf8;
2
d5448623
GS
3$utf8::hint_bits = 0x00800000;
4
b75c8c73
MS
5our $VERSION = '1.00';
6
a0ed51b3 7sub import {
d5448623 8 $^H |= $utf8::hint_bits;
a0ed51b3
LW
9 $enc{caller()} = $_[1] if $_[1];
10}
11
12sub unimport {
d5448623 13 $^H &= ~$utf8::hint_bits;
a0ed51b3
LW
14}
15
16sub AUTOLOAD {
17 require "utf8_heavy.pl";
daf4d4ea
SC
18 goto &$AUTOLOAD if defined &$AUTOLOAD;
19 Carp::croak("Undefined subroutine $AUTOLOAD called");
a0ed51b3
LW
20}
21
221;
23__END__
24
25=head1 NAME
26
b3419ed8 27utf8 - Perl pragma to enable/disable UTF-8 (or UTF-EBCDIC) in source code
a0ed51b3
LW
28
29=head1 SYNOPSIS
30
31 use utf8;
32 no utf8;
33
34=head1 DESCRIPTION
35
393fec97 36The C<use utf8> pragma tells the Perl parser to allow UTF-8 in the
b3419ed8
PK
37program text in the current lexical scope (allow UTF-EBCDIC on EBCDIC based
38platforms). The C<no utf8> pragma tells Perl to switch back to treating
39the source text as literal bytes in the current lexical scope.
a0ed51b3 40
393fec97
GS
41This pragma is primarily a compatibility device. Perl versions
42earlier than 5.6 allowed arbitrary bytes in source code, whereas
43in future we would like to standardize on the UTF-8 encoding for
44source text. Until UTF-8 becomes the default format for source
45text, this pragma should be used to recognize UTF-8 in the source.
46When UTF-8 becomes the standard source format, this pragma will
b3419ed8 47effectively become a no-op. For convenience in what follows the
ad0029c4 48term I<UTF-X> is used to refer to UTF-8 on ASCII and ISO Latin based
b3419ed8 49platforms and UTF-EBCDIC on EBCDIC based platforms.
a0ed51b3 50
ad0029c4 51Enabling the C<utf8> pragma has the following effect:
a0ed51b3 52
4ac9195f 53=over 4
a0ed51b3
LW
54
55=item *
56
393fec97 57Bytes in the source text that have their high-bit set will be treated
ad0029c4
JH
58as being part of a literal UTF-8 character. This includes most
59literals such as identifiers, string constants, constant regular
60expression patterns and package names. On EBCDIC platforms characters
61in the Latin 1 character set are treated as being part of a literal
62UTF-EBCDIC character.
a0ed51b3 63
4ac9195f
MS
64=back
65
ae90e350
JH
66Note that if you have bytes with the eighth bit on in your script
67(for example embedded Latin-1 in your string literals), C<use utf8>
68will be unhappy since the bytes are most probably not well-formed
69UTF-8. If you want to have such bytes and use utf8, you can disable
70utf8 until the end the block (or file, if at top level) by C<no utf8;>.
71
1b026014
NIS
72=head2 Utility functions
73
74The following functions are defined in the C<utf8::> package by the perl core.
75
76=over 4
77
78=item * $num_octets = utf8::upgrade($string);
79
ad0029c4
JH
80Converts internal representation of string to the Perl's internal
81I<UTF-X> form. Returns the number of octets necessary to represent
82the string as I<UTF-X>.
1b026014
NIS
83
84=item * utf8::downgrade($string[, CHECK])
85
86Converts internal representation of string to be un-encoded bytes.
87
88=item * utf8::encode($string)
89
90Converts (in-place) I<$string> from logical characters to octet sequence
ad0029c4 91representing it in Perl's I<UTF-X> encoding.
1b026014
NIS
92
93=item * $flag = utf8::decode($string)
94
ad0029c4
JH
95Attempts to convert I<$string> in-place from Perl's I<UTF-X> encoding
96into logical characters.
1b026014
NIS
97
98=back
99
393fec97 100=head1 SEE ALSO
a0ed51b3 101
8058d7ab 102L<perlunicode>, L<bytes>
a0ed51b3
LW
103
104=cut