perl5.git.perl.org Git - perl5.git/blame_incremental

... / ...

Commit	Line	Data
	1	package utf8;
	2
	3	use strict;
	4	use warnings;
	5
	6	our $hint_bits = 0x00800000;
	7
	8	our $VERSION = '1.25';
	9	our $AUTOLOAD;
	10
	11	sub import {
	12	$^H \|= $hint_bits;
	13	}
	14
	15	sub unimport {
	16	$^H &= ~$hint_bits;
	17	}
	18
	19	sub AUTOLOAD {
	20	goto &$AUTOLOAD if defined &$AUTOLOAD;
	21	require Carp;
	22	Carp::croak("Undefined subroutine $AUTOLOAD called");
	23	}
	24
	25	1;
	26	__END__
	27
	28	=head1 NAME
	29
	30	utf8 - Perl pragma to enable/disable UTF-8 (or UTF-EBCDIC) in source code
	31
	32	=head1 SYNOPSIS
	33
	34	use utf8;
	35	no utf8;
	36
	37	# Convert the internal representation of a Perl scalar to/from UTF-8.
	38
	39	$num_octets = utf8::upgrade($string);
	40	$success = utf8::downgrade($string[, $fail_ok]);
	41
	42	# Change each character of a Perl scalar to/from a series of
	43	# characters that represent the UTF-8 bytes of each original character.
	44
	45	utf8::encode($string); # "\x{100}" becomes "\xc4\x80"
	46	utf8::decode($string); # "\xc4\x80" becomes "\x{100}"
	47
	48	# Convert a code point from the platform native character set to
	49	# Unicode, and vice-versa.
	50	$unicode = utf8::native_to_unicode(ord('A')); # returns 65 on both
	51	# ASCII and EBCDIC
	52	# platforms
	53	$native = utf8::unicode_to_native(65); # returns 65 on ASCII
	54	# platforms; 193 on
	55	# EBCDIC
	56
	57	$flag = utf8::is_utf8($string); # since Perl 5.8.1
	58	$flag = utf8::valid($string);
	59
	60	=head1 DESCRIPTION
	61
	62	The C<use utf8> pragma tells the Perl parser to allow UTF-8 in the
	63	program text in the current lexical scope. The C<no utf8> pragma tells Perl
	64	to switch back to treating the source text as literal bytes in the current
	65	lexical scope. (On EBCDIC platforms, technically it is allowing UTF-EBCDIC,
	66	and not UTF-8, but this distinction is academic, so in this document the term
	67	UTF-8 is used to mean both).
	68
	69	B<Do not use this pragma for anything else than telling Perl that your
	70	script is written in UTF-8.> The utility functions described below are
	71	directly usable without C<use utf8;>.
	72
	73	Because it is not possible to reliably tell UTF-8 from native 8 bit
	74	encodings, you need either a Byte Order Mark at the beginning of your
	75	source code, or C<use utf8;>, to instruct perl.
	76
	77	When UTF-8 becomes the standard source format, this pragma will
	78	effectively become a no-op.
	79
	80	See also the effects of the C<-C> switch and its cousin, the
	81	C<PERL_UNICODE> environment variable, in L<perlrun>.
	82
	83	Enabling the C<utf8> pragma has the following effect:
	84
	85	=over 4
	86
	87	=item *
	88
	89	Bytes in the source text that are not in the ASCII character set will be
	90	treated as being part of a literal UTF-8 sequence. This includes most
	91	literals such as identifier names, string constants, and constant
	92	regular expression patterns.
	93
	94	=back
	95
	96	Note that if you have non-ASCII, non-UTF-8 bytes in your script (for example
	97	embedded Latin-1 in your string literals), C<use utf8> will be unhappy. If
	98	you want to have such bytes under C<use utf8>, you can disable this pragma
	99	until the end the block (or file, if at top level) by C<no utf8;>.
	100
	101	=head2 Utility functions
	102
	103	The following functions are defined in the C<utf8::> package by the
	104	Perl core. You do not need to say C<use utf8> to use these and in fact
	105	you should not say that unless you really want to have UTF-8 source code.
	106
	107	=over 4
	108
	109	=item * C<$num_octets = utf8::upgrade($string)>
	110
	111	(Since Perl v5.8.0)
	112	Converts in-place the internal representation of the string from an octet
	113	sequence in the native encoding (Latin-1 or EBCDIC) to UTF-8. The
	114	logical character sequence itself is unchanged. If I<$string> is already
	115	upgraded, then this is a no-op. Returns the
	116	number of octets necessary to represent the string as UTF-8.
	117	Since Perl v5.38, if C<$string> is C<undef> no action is taken; prior to that,
	118	it would be converted to be defined and zero-length.
	119
	120	If your code needs to be compatible with versions of perl without
	121	C<use feature 'unicode_strings';>, you can force Unicode semantics on
	122	a given string:
	123
	124	# force unicode semantics for $string without the
	125	# "unicode_strings" feature
	126	utf8::upgrade($string);
	127
	128	For example:
	129
	130	# without explicit or implicit use feature 'unicode_strings'
	131	my $x = "\xDF"; # LATIN SMALL LETTER SHARP S
	132	$x =~ /ss/i; # won't match
	133	my $y = uc($x); # won't convert
	134	utf8::upgrade($x);
	135	$x =~ /ss/i; # matches
	136	my $z = uc($x); # converts to "SS"
	137
	138	B<Note that this function does not handle arbitrary encodings>;
	139	use L<Encode> instead.
	140
	141	=item * C<$success = utf8::downgrade($string[, $fail_ok])>
	142
	143	(Since Perl v5.8.0)
	144	Converts in-place the internal representation of the string from UTF-8 to the
	145	equivalent octet sequence in the native encoding (Latin-1 or EBCDIC). The
	146	logical character sequence itself is unchanged. If I<$string> is already
	147	stored as native 8 bit, then this is a no-op. Can be used to make sure that
	148	the UTF-8 flag is off, e.g. when you want to make sure that the substr() or
	149	length() function works with the usually faster byte algorithm.
	150
	151	Fails if the original UTF-8 sequence cannot be represented in the
	152	native 8 bit encoding. On failure dies or, if the value of I<$fail_ok> is
	153	true, returns false.
	154
	155	Returns true on success.
	156
	157	If your code expects an octet sequence this can be used to validate
	158	that you've received one:
	159
	160	# throw an exception if not representable as octets
	161	utf8::downgrade($string)
	162
	163	# or do your own error handling
	164	utf8::downgrade($string, 1) or die "string must be octets";
	165
	166	B<Note that this function does not handle arbitrary encodings>;
	167	use L<Encode> instead.
	168
	169	=item * C<utf8::encode($string)>
	170
	171	(Since Perl v5.8.0)
	172	Converts in-place the character sequence to the corresponding octet
	173	sequence in Perl's extended UTF-8. That is, every (possibly wide) character
	174	gets replaced with a sequence of one or more characters that represent the
	175	individual UTF-8 bytes of the character. The UTF8 flag is turned off.
	176	Returns nothing.
	177
	178	my $x = "\x{100}"; # $x contains one character, with ord 0x100
	179	utf8::encode($x); # $x contains two characters, with ords (on
	180	# ASCII platforms) 0xc4 and 0x80. On EBCDIC
	181	# 1047, this would instead be 0x8C and 0x41.
	182
	183	Similar to:
	184
	185	use Encode;
	186	$x = Encode::encode("utf8", $x);
	187
	188	B<Note that this function does not handle arbitrary encodings>;
	189	use L<Encode> instead.
	190
	191	=item * C<$success = utf8::decode($string)>
	192
	193	(Since Perl v5.8.0)
	194	Attempts to convert in-place the octet sequence encoded in Perl's extended
	195	UTF-8 to the corresponding character sequence. That is, it replaces each
	196	sequence of characters in the string whose ords represent a valid (extended)
	197	UTF-8 byte sequence, with the corresponding single character. The UTF-8 flag
	198	is turned on only if the source string contains multiple-byte UTF-8
	199	characters. If I<$string> is invalid as extended UTF-8, returns false;
	200	otherwise returns true.
	201
	202	my $x = "\xc4\x80"; # $x contains two characters, with ords
	203	# 0xc4 and 0x80
	204	utf8::decode($x); # On ASCII platforms, $x contains one char,
	205	# with ord 0x100. Since these bytes aren't
	206	# legal UTF-EBCDIC, on EBCDIC platforms, $x is
	207	# unchanged and the function returns FALSE.
	208	my $y = "\xc3\x83\xc2\xab"; This has been encoded twice; this
	209	# example is only for ASCII platforms
	210	utf8::decode($y); # Converts $y to \xc3\xab, returns TRUE;
	211	utf8::decode($y); # Further converts to \xeb, returns TRUE;
	212	utf8::decode($y); # Returns FALSE, leaves $y unchanged
	213
	214	B<Note that this function does not handle arbitrary encodings>;
	215	use L<Encode> instead.
	216
	217	=item * C<$unicode = utf8::native_to_unicode($code_point)>
	218
	219	(Since Perl v5.8.0)
	220	This takes an unsigned integer (which represents the ordinal number of a
	221	character (or a code point) on the platform the program is being run on) and
	222	returns its Unicode equivalent value. Since ASCII platforms natively use the
	223	Unicode code points, this function returns its input on them. On EBCDIC
	224	platforms it converts from EBCDIC to Unicode.
	225
	226	A meaningless value will currently be returned if the input is not an unsigned
	227	integer.
	228
	229	Since Perl v5.22.0, calls to this function are optimized out on ASCII
	230	platforms, so there is no performance hit in using it there.
	231
	232	=item * C<$native = utf8::unicode_to_native($code_point)>
	233
	234	(Since Perl v5.8.0)
	235	This is the inverse of C<utf8::native_to_unicode()>, converting the other
	236	direction. Again, on ASCII platforms, this returns its input, but on EBCDIC
	237	platforms it will find the native platform code point, given any Unicode one.
	238
	239	A meaningless value will currently be returned if the input is not an unsigned
	240	integer.
	241
	242	Since Perl v5.22.0, calls to this function are optimized out on ASCII
	243	platforms, so there is no performance hit in using it there.
	244
	245	=item * C<$flag = utf8::is_utf8($string)>
	246
	247	(Since Perl 5.8.1) Test whether I<$string> is marked internally as encoded in
	248	UTF-8. Functionally the same as C<Encode::is_utf8($string)>.
	249
	250	Typically only necessary for debugging and testing, if you need to
	251	dump the internals of an SV, L<Devel::Peek's\|Devel::Peek> Dump()
	252	provides more detail in a compact form.
	253
	254	If you still think you need this outside of debugging, testing or
	255	dealing with filenames, you should probably read L<perlunitut> and
	256	L<perlunifaq/What is "the UTF8 flag"?>.
	257
	258	Don't use this flag as a marker to distinguish character and binary
	259	data: that should be decided for each variable when you write your
	260	code.
	261
	262	To force unicode semantics in code portable to perl 5.8 and 5.10, call
	263	C<utf8::upgrade($string)> unconditionally.
	264
	265	=item * C<$flag = utf8::valid($string)>
	266
	267	[INTERNAL] Test whether I<$string> is in a consistent state regarding
	268	UTF-8. Will return true if it is well-formed Perl extended UTF-8 and has the
	269	UTF-8 flag
	270	on B<or> if I<$string> is held as bytes (both these states are 'consistent').
	271	The main reason for this routine is to allow Perl's test suite to check
	272	that operations have left strings in a consistent state.
	273
	274	=back
	275
	276	C<utf8::encode> is like C<utf8::upgrade>, but the UTF8 flag is
	277	cleared. See L<perlunicode>, and the C API
	278	functions C<L<sv_utf8_upgrade\|perlapi/sv_utf8_upgrade>>,
	279	C<L<perlapi/sv_utf8_downgrade>>, C<L<perlapi/sv_utf8_encode>>,
	280	and C<L<perlapi/sv_utf8_decode>>, which are wrapped by the Perl functions
	281	C<utf8::upgrade>, C<utf8::downgrade>, C<utf8::encode> and
	282	C<utf8::decode>. Also, the functions C<utf8::is_utf8>, C<utf8::valid>,
	283	C<utf8::encode>, C<utf8::decode>, C<utf8::upgrade>, and C<utf8::downgrade> are
	284	actually internal, and thus always available, without a C<require utf8>
	285	statement.
	286
	287	=head1 BUGS
	288
	289	Some filesystems may not support UTF-8 file names, or they may be supported
	290	incompatibly with Perl. Therefore UTF-8 names that are visible to the
	291	filesystem, such as module names may not work.
	292
	293	=head1 SEE ALSO
	294
	295	L<perlunitut>, L<perluniintro>, L<perlrun>, L<bytes>, L<perlunicode>
	296
	297	=cut