perl5.git.perl.org Git - perl5.git/blame

Commit	Line	Data
a0ed51b3 LW	1	package utf8;
a0ed51b3 LW	2
ba6f05db TR	3	use strict;
ba6f05db TR	4	use warnings;
d5448623	5
ba6f05db TR	6	our $hint_bits = 0x00800000;
ba6f05db TR	7
347a477c	8	our $VERSION = '1.25';
ba6f05db	9	our $AUTOLOAD;
b75c8c73	10
a0ed51b3	11	sub import {
ba6f05db	12	$^H \|= $hint_bits;
a0ed51b3 LW	13	}
	14
	15	sub unimport {
ba6f05db	16	$^H &= ~$hint_bits;
a0ed51b3 LW	17	}
	18
	19	sub AUTOLOAD {
daf4d4ea	20	goto &$AUTOLOAD if defined &$AUTOLOAD;
bd7017d3	21	require Carp;
daf4d4ea	22	Carp::croak("Undefined subroutine $AUTOLOAD called");
a0ed51b3 LW	23	}
	24
	25	1;
	26	__END__
	27
	28	=head1 NAME
	29
b3419ed8	30	utf8 - Perl pragma to enable/disable UTF-8 (or UTF-EBCDIC) in source code
a0ed51b3 LW	31
	32	=head1 SYNOPSIS
	33
291cc134 KW	34	use utf8;
291cc134 KW	35	no utf8;
a0ed51b3	36
291cc134	37	# Convert the internal representation of a Perl scalar to/from UTF-8.
836ccc8e	38
291cc134	39	$num_octets = utf8::upgrade($string);
98695e13	40	$success = utf8::downgrade($string[, $fail_ok]);
973655a8	41
291cc134 KW	42	# Change each character of a Perl scalar to/from a series of
291cc134 KW	43	# characters that represent the UTF-8 bytes of each original character.
836ccc8e	44
291cc134 KW	45	utf8::encode($string); # "\x{100}" becomes "\xc4\x80"
291cc134 KW	46	utf8::decode($string); # "\xc4\x80" becomes "\x{100}"
973655a8	47
ca3d51ba KW	48	# Convert a code point from the platform native character set to
	49	# Unicode, and vice-versa.
	50	$unicode = utf8::native_to_unicode(ord('A')); # returns 65 on both
	51	# ASCII and EBCDIC
	52	# platforms
a04477f8 KW	53	$native = utf8::unicode_to_native(65); # returns 65 on ASCII
	54	# platforms; 193 on
	55	# EBCDIC
ca3d51ba	56
ac8b87d7 EB	57	$flag = utf8::is_utf8($string); # since Perl 5.8.1
ac8b87d7 EB	58	$flag = utf8::valid($string);
973655a8	59
a0ed51b3 LW	60	=head1 DESCRIPTION
a0ed51b3 LW	61
393fec97	62	The C<use utf8> pragma tells the Perl parser to allow UTF-8 in the
a04477f8 KW	63	program text in the current lexical scope. The C<no utf8> pragma tells Perl
	64	to switch back to treating the source text as literal bytes in the current
	65	lexical scope. (On EBCDIC platforms, technically it is allowing UTF-EBCDIC,
	66	and not UTF-8, but this distinction is academic, so in this document the term
	67	UTF-8 is used to mean both).
a0ed51b3	68
19b49582 JH	69	B<Do not use this pragma for anything else than telling Perl that your
19b49582 JH	70	script is written in UTF-8.> The utility functions described below are
2575c402 JW	71	directly usable without C<use utf8;>.
	72
	73	Because it is not possible to reliably tell UTF-8 from native 8 bit
	74	encodings, you need either a Byte Order Mark at the beginning of your
	75	source code, or C<use utf8;>, to instruct perl.
19b49582	76
2575c402	77	When UTF-8 becomes the standard source format, this pragma will
a04477f8	78	effectively become a no-op.
a0ed51b3	79
a74e8b45	80	See also the effects of the C<-C> switch and its cousin, the
127161e0	81	C<PERL_UNICODE> environment variable, in L<perlrun>.
a74e8b45	82
ad0029c4	83	Enabling the C<utf8> pragma has the following effect:
a0ed51b3	84
4ac9195f	85	=over 4
a0ed51b3 LW	86
	87	=item *
	88
a04477f8 KW	89	Bytes in the source text that are not in the ASCII character set will be
a04477f8 KW	90	treated as being part of a literal UTF-8 sequence. This includes most
c20e2abd	91	literals such as identifier names, string constants, and constant
8f8cf39c JH	92	regular expression patterns.
8f8cf39c JH	93
4ac9195f MS	94	=back
4ac9195f MS	95
a04477f8 KW	96	Note that if you have non-ASCII, non-UTF-8 bytes in your script (for example
	97	embedded Latin-1 in your string literals), C<use utf8> will be unhappy. If
	98	you want to have such bytes under C<use utf8>, you can disable this pragma
	99	until the end the block (or file, if at top level) by C<no utf8;>.
ae90e350	100
1b026014 NIS	101	=head2 Utility functions
1b026014 NIS	102
8800c35a JH	103	The following functions are defined in the C<utf8::> package by the
8800c35a JH	104	Perl core. You do not need to say C<use utf8> to use these and in fact
2f7e5073	105	you should not say that unless you really want to have UTF-8 source code.
1b026014 NIS	106
	107	=over 4
	108
308a4ae1	109	=item * C<$num_octets = utf8::upgrade($string)>
1b026014	110
a04477f8	111	(Since Perl v5.8.0)
836ccc8e	112	Converts in-place the internal representation of the string from an octet
a04477f8	113	sequence in the native encoding (Latin-1 or EBCDIC) to UTF-8. The
836ccc8e	114	logical character sequence itself is unchanged. If I<$string> is already
0397beb0 TC	115	upgraded, then this is a no-op. Returns the
0397beb0 TC	116	number of octets necessary to represent the string as UTF-8.
347a477c KW	117	Since Perl v5.38, if C<$string> is C<undef> no action is taken; prior to that,
347a477c KW	118	it would be converted to be defined and zero-length.
0397beb0 TC	119
	120	If your code needs to be compatible with versions of perl without
	121	C<use feature 'unicode_strings';>, you can force Unicode semantics on
	122	a given string:
	123
	124	# force unicode semantics for $string without the
	125	# "unicode_strings" feature
	126	utf8::upgrade($string);
	127
	128	For example:
	129
	130	# without explicit or implicit use feature 'unicode_strings'
	131	my $x = "\xDF"; # LATIN SMALL LETTER SHARP S
	132	$x =~ /ss/i; # won't match
	133	my $y = uc($x); # won't convert
	134	utf8::upgrade($x);
	135	$x =~ /ss/i; # matches
	136	my $z = uc($x); # converts to "SS"
78ea37eb	137
a04477f8 KW	138	B<Note that this function does not handle arbitrary encodings>;
a04477f8 KW	139	use L<Encode> instead.
1b026014	140
308a4ae1	141	=item * C<$success = utf8::downgrade($string[, $fail_ok])>
1b026014	142
a04477f8	143	(Since Perl v5.8.0)
50a85cfe KW	144	Converts in-place the internal representation of the string from UTF-8 to the
	145	equivalent octet sequence in the native encoding (Latin-1 or EBCDIC). The
	146	logical character sequence itself is unchanged. If I<$string> is already
	147	stored as native 8 bit, then this is a no-op. Can be used to make sure that
	148	the UTF-8 flag is off, e.g. when you want to make sure that the substr() or
	149	length() function works with the usually faster byte algorithm.
78ea37eb	150
a04477f8	151	Fails if the original UTF-8 sequence cannot be represented in the
ac8b87d7	152	native 8 bit encoding. On failure dies or, if the value of I<$fail_ok> is
2575c402	153	true, returns false.
78ea37eb	154
2575c402 JW	155	Returns true on success.
2575c402 JW	156
0397beb0 TC	157	If your code expects an octet sequence this can be used to validate
	158	that you've received one:
	159
	160	# throw an exception if not representable as octets
	161	utf8::downgrade($string)
	162
	163	# or do your own error handling
	164	utf8::downgrade($string, 1) or die "string must be octets";
	165
a04477f8 KW	166	B<Note that this function does not handle arbitrary encodings>;
a04477f8 KW	167	use L<Encode> instead.
78ea37eb	168
308a4ae1	169	=item * C<utf8::encode($string)>
1b026014	170
a04477f8	171	(Since Perl v5.8.0)
2575c402	172	Converts in-place the character sequence to the corresponding octet
50a85cfe KW	173	sequence in Perl's extended UTF-8. That is, every (possibly wide) character
50a85cfe KW	174	gets replaced with a sequence of one or more characters that represent the
a04477f8	175	individual UTF-8 bytes of the character. The UTF8 flag is turned off.
836ccc8e DM	176	Returns nothing.
836ccc8e DM	177
0397beb0 TC	178	my $x = "\x{100}"; # $x contains one character, with ord 0x100
0397beb0 TC	179	utf8::encode($x); # $x contains two characters, with ords (on
a04477f8 KW	180	# ASCII platforms) 0xc4 and 0x80. On EBCDIC
a04477f8 KW	181	# 1047, this would instead be 0x8C and 0x41.
78ea37eb	182
0397beb0 TC	183	Similar to:
	184
	185	use Encode;
	186	$x = Encode::encode("utf8", $x);
	187
a04477f8 KW	188	B<Note that this function does not handle arbitrary encodings>;
a04477f8 KW	189	use L<Encode> instead.
094ce63c	190
308a4ae1	191	=item * C<$success = utf8::decode($string)>
1b026014	192
a04477f8	193	(Since Perl v5.8.0)
50a85cfe KW	194	Attempts to convert in-place the octet sequence encoded in Perl's extended
	195	UTF-8 to the corresponding character sequence. That is, it replaces each
	196	sequence of characters in the string whose ords represent a valid (extended)
	197	UTF-8 byte sequence, with the corresponding single character. The UTF-8 flag
	198	is turned on only if the source string contains multiple-byte UTF-8
	199	characters. If I<$string> is invalid as extended UTF-8, returns false;
836ccc8e DM	200	otherwise returns true.
836ccc8e DM	201
0397beb0	202	my $x = "\xc4\x80"; # $x contains two characters, with ords
ca3d51ba	203	# 0xc4 and 0x80
0397beb0	204	utf8::decode($x); # On ASCII platforms, $x contains one char,
a04477f8	205	# with ord 0x100. Since these bytes aren't
0397beb0	206	# legal UTF-EBCDIC, on EBCDIC platforms, $x is
a04477f8	207	# unchanged and the function returns FALSE.
089cd0e7 KW	208	my $y = "\xc3\x83\xc2\xab"; This has been encoded twice; this
	209	# example is only for ASCII platforms
	210	utf8::decode($y); # Converts $y to \xc3\xab, returns TRUE;
	211	utf8::decode($y); # Further converts to \xeb, returns TRUE;
	212	utf8::decode($y); # Returns FALSE, leaves $y unchanged
78ea37eb	213
a04477f8 KW	214	B<Note that this function does not handle arbitrary encodings>;
a04477f8 KW	215	use L<Encode> instead.
78ea37eb	216
ca3d51ba KW	217	=item * C<$unicode = utf8::native_to_unicode($code_point)>
ca3d51ba KW	218
273e254d	219	(Since Perl v5.8.0)
ca3d51ba KW	220	This takes an unsigned integer (which represents the ordinal number of a
	221	character (or a code point) on the platform the program is being run on) and
	222	returns its Unicode equivalent value. Since ASCII platforms natively use the
	223	Unicode code points, this function returns its input on them. On EBCDIC
bc1767aa	224	platforms it converts from EBCDIC to Unicode.
ca3d51ba KW	225
	226	A meaningless value will currently be returned if the input is not an unsigned
	227	integer.
	228
273e254d KW	229	Since Perl v5.22.0, calls to this function are optimized out on ASCII
	230	platforms, so there is no performance hit in using it there.
	231
ca3d51ba KW	232	=item * C<$native = utf8::unicode_to_native($code_point)>
ca3d51ba KW	233
273e254d	234	(Since Perl v5.8.0)
ca3d51ba KW	235	This is the inverse of C<utf8::native_to_unicode()>, converting the other
	236	direction. Again, on ASCII platforms, this returns its input, but on EBCDIC
	237	platforms it will find the native platform code point, given any Unicode one.
	238
	239	A meaningless value will currently be returned if the input is not an unsigned
	240	integer.
	241
273e254d KW	242	Since Perl v5.22.0, calls to this function are optimized out on ASCII
	243	platforms, so there is no performance hit in using it there.
	244
308a4ae1	245	=item * C<$flag = utf8::is_utf8($string)>
8800c35a	246
ac8b87d7	247	(Since Perl 5.8.1) Test whether I<$string> is marked internally as encoded in
0397beb0 TC	248	UTF-8. Functionally the same as C<Encode::is_utf8($string)>.
	249
	250	Typically only necessary for debugging and testing, if you need to
	251	dump the internals of an SV, L<Devel::Peek's\|Devel::Peek> Dump()
	252	provides more detail in a compact form.
	253
	254	If you still think you need this outside of debugging, testing or
	255	dealing with filenames, you should probably read L<perlunitut> and
	256	L<perlunifaq/What is "the UTF8 flag"?>.
	257
	258	Don't use this flag as a marker to distinguish character and binary
0c50e915	259	data: that should be decided for each variable when you write your
0397beb0 TC	260	code.
	261
	262	To force unicode semantics in code portable to perl 5.8 and 5.10, call
	263	C<utf8::upgrade($string)> unconditionally.
8800c35a	264
308a4ae1	265	=item * C<$flag = utf8::valid($string)>
70122e76	266
ac8b87d7	267	[INTERNAL] Test whether I<$string> is in a consistent state regarding
8b4b6c86 KW	268	UTF-8. Will return true if it is well-formed Perl extended UTF-8 and has the
8b4b6c86 KW	269	UTF-8 flag
ac8b87d7	270	on B<or> if I<$string> is held as bytes (both these states are 'consistent').
0c50e915	271	The main reason for this routine is to allow Perl's test suite to check
0397beb0	272	that operations have left strings in a consistent state.
70122e76	273
1b026014 NIS	274	=back
1b026014 NIS	275
7d865a91	276	C<utf8::encode> is like C<utf8::upgrade>, but the UTF8 flag is
a04477f8 KW	277	cleared. See L<perlunicode>, and the C API
	278	functions C<L<sv_utf8_upgrade\|perlapi/sv_utf8_upgrade>>,
	279	C<L<perlapi/sv_utf8_downgrade>>, C<L<perlapi/sv_utf8_encode>>,
	280	and C<L<perlapi/sv_utf8_decode>>, which are wrapped by the Perl functions
094ce63c	281	C<utf8::upgrade>, C<utf8::downgrade>, C<utf8::encode> and
a04477f8 KW	282	C<utf8::decode>. Also, the functions C<utf8::is_utf8>, C<utf8::valid>,
a04477f8 KW	283	C<utf8::encode>, C<utf8::decode>, C<utf8::upgrade>, and C<utf8::downgrade> are
7edb8f2b RGS	284	actually internal, and thus always available, without a C<require utf8>
7edb8f2b RGS	285	statement.
f1e62f77	286
8f8cf39c JH	287	=head1 BUGS
8f8cf39c JH	288
a04477f8 KW	289	Some filesystems may not support UTF-8 file names, or they may be supported
	290	incompatibly with Perl. Therefore UTF-8 names that are visible to the
	291	filesystem, such as module names may not work.
8f8cf39c	292
393fec97	293	=head1 SEE ALSO
a0ed51b3	294
2575c402	295	L<perlunitut>, L<perluniintro>, L<perlrun>, L<bytes>, L<perlunicode>
a0ed51b3 LW	296
a0ed51b3 LW	297	=cut