perl5.git.perl.org Git - perl5.git/blame_incremental

... / ...

Commit	Line	Data
	1	package utf8;
	2
	3	$utf8::hint_bits = 0x00800000;
	4
	5	our $VERSION = '1.13';
	6
	7	sub import {
	8	$^H \|= $utf8::hint_bits;
	9	}
	10
	11	sub unimport {
	12	$^H &= ~$utf8::hint_bits;
	13	}
	14
	15	sub AUTOLOAD {
	16	require "utf8_heavy.pl";
	17	goto &$AUTOLOAD if defined &$AUTOLOAD;
	18	require Carp;
	19	Carp::croak("Undefined subroutine $AUTOLOAD called");
	20	}
	21
	22	1;
	23	__END__
	24
	25	=head1 NAME
	26
	27	utf8 - Perl pragma to enable/disable UTF-8 (or UTF-EBCDIC) in source code
	28
	29	=head1 SYNOPSIS
	30
	31	use utf8;
	32	no utf8;
	33
	34	# Convert the internal representation of a Perl scalar to/from UTF-8.
	35
	36	$num_octets = utf8::upgrade($string);
	37	$success = utf8::downgrade($string[, $fail_ok]);
	38
	39	# Change each character of a Perl scalar to/from a series of
	40	# characters that represent the UTF-8 bytes of each original character.
	41
	42	utf8::encode($string); # "\x{100}" becomes "\xc4\x80"
	43	utf8::decode($string); # "\xc4\x80" becomes "\x{100}"
	44
	45	$flag = utf8::is_utf8($string); # since Perl 5.8.1
	46	$flag = utf8::valid($string);
	47
	48	=head1 DESCRIPTION
	49
	50	The C<use utf8> pragma tells the Perl parser to allow UTF-8 in the
	51	program text in the current lexical scope (allow UTF-EBCDIC on EBCDIC based
	52	platforms). The C<no utf8> pragma tells Perl to switch back to treating
	53	the source text as literal bytes in the current lexical scope.
	54
	55	B<Do not use this pragma for anything else than telling Perl that your
	56	script is written in UTF-8.> The utility functions described below are
	57	directly usable without C<use utf8;>.
	58
	59	Because it is not possible to reliably tell UTF-8 from native 8 bit
	60	encodings, you need either a Byte Order Mark at the beginning of your
	61	source code, or C<use utf8;>, to instruct perl.
	62
	63	When UTF-8 becomes the standard source format, this pragma will
	64	effectively become a no-op. For convenience in what follows the term
	65	I<UTF-X> is used to refer to UTF-8 on ASCII and ISO Latin based
	66	platforms and UTF-EBCDIC on EBCDIC based platforms.
	67
	68	See also the effects of the C<-C> switch and its cousin, the
	69	C<$ENV{PERL_UNICODE}>, in L<perlrun>.
	70
	71	Enabling the C<utf8> pragma has the following effect:
	72
	73	=over 4
	74
	75	=item *
	76
	77	Bytes in the source text that have their high-bit set will be treated
	78	as being part of a literal UTF-X sequence. This includes most
	79	literals such as identifier names, string constants, and constant
	80	regular expression patterns.
	81
	82	On EBCDIC platforms characters in the Latin 1 character set are
	83	treated as being part of a literal UTF-EBCDIC character.
	84
	85	=back
	86
	87	Note that if you have bytes with the eighth bit on in your script
	88	(for example embedded Latin-1 in your string literals), C<use utf8>
	89	will be unhappy since the bytes are most probably not well-formed
	90	UTF-X. If you want to have such bytes under C<use utf8>, you can disable
	91	this pragma until the end the block (or file, if at top level) by
	92	C<no utf8;>.
	93
	94	=head2 Utility functions
	95
	96	The following functions are defined in the C<utf8::> package by the
	97	Perl core. You do not need to say C<use utf8> to use these and in fact
	98	you should not say that unless you really want to have UTF-8 source code.
	99
	100	=over 4
	101
	102	=item * $num_octets = utf8::upgrade($string)
	103
	104	Converts in-place the internal representation of the string from an octet
	105	sequence in the native encoding (Latin-1 or EBCDIC) to I<UTF-X>. The
	106	logical character sequence itself is unchanged. If I<$string> is already
	107	stored as I<UTF-X>, then this is a no-op. Returns the
	108	number of octets necessary to represent the string as I<UTF-X>. Can be
	109	used to make sure that the UTF-8 flag is on, so that C<\w> or C<lc()>
	110	work as Unicode on strings containing characters in the range 0x80-0xFF
	111	(on ASCII and derivatives).
	112
	113	B<Note that this function does not handle arbitrary encodings.>
	114	Therefore Encode is recommended for the general purposes; see also
	115	L<Encode>.
	116
	117	=item * $success = utf8::downgrade($string[, $fail_ok])
	118
	119	Converts in-place the internal representation of the string from
	120	I<UTF-X> to the equivalent octet sequence in the native encoding (Latin-1
	121	or EBCDIC). The logical character sequence itself is unchanged. If
	122	I<$string> is already stored as native 8 bit, then this is a no-op. Can
	123	be used to
	124	make sure that the UTF-8 flag is off, e.g. when you want to make sure
	125	that the substr() or length() function works with the usually faster
	126	byte algorithm.
	127
	128	Fails if the original I<UTF-X> sequence cannot be represented in the
	129	native 8 bit encoding. On failure dies or, if the value of I<$fail_ok> is
	130	true, returns false.
	131
	132	Returns true on success.
	133
	134	B<Note that this function does not handle arbitrary encodings.>
	135	Therefore Encode is recommended for the general purposes; see also
	136	L<Encode>.
	137
	138	=item * utf8::encode($string)
	139
	140	Converts in-place the character sequence to the corresponding octet
	141	sequence in I<UTF-X>. That is, every (possibly wide) character gets
	142	replaced with a sequence of one or more characters that represent the
	143	individual I<UTF-X> bytes of the character. The UTF8 flag is turned off.
	144	Returns nothing.
	145
	146	my $a = "\x{100}"; # $a contains one character, with ord 0x100
	147	utf8::encode($a); # $a contains two characters, with ords 0xc4 and
	148	# 0x80
	149
	150	B<Note that this function does not handle arbitrary encodings.>
	151	Therefore Encode is recommended for the general purposes; see also
	152	L<Encode>.
	153
	154	=item * $success = utf8::decode($string)
	155
	156	Attempts to convert in-place the octet sequence encoded as I<UTF-X> to the
	157	corresponding character sequence. That is, it replaces each sequence of
	158	characters in the string whose ords represent a valid UTF-X byte
	159	sequence, with the corresponding single character. The UTF-8 flag is
	160	turned on only if the source string contains multiple-byte I<UTF-X>
	161	characters. If I<$string> is invalid as I<UTF-X>, returns false;
	162	otherwise returns true.
	163
	164	my $a = "\xc4\x80"; # $a contains two characters, with ords
	165	# 0xc4 and 0x80
	166	utf8::decode($a); # $a contains one character, with ord 0x100
	167
	168	B<Note that this function does not handle arbitrary encodings.>
	169	Therefore Encode is recommended for the general purposes; see also
	170	L<Encode>.
	171
	172	=item * $flag = utf8::is_utf8($string)
	173
	174	(Since Perl 5.8.1) Test whether I<$string> is marked internally as encoded in
	175	UTF-8. Functionally the same as Encode::is_utf8().
	176
	177	=item * $flag = utf8::valid($string)
	178
	179	[INTERNAL] Test whether I<$string> is in a consistent state regarding
	180	UTF-8. Will return true if it is well-formed UTF-8 and has the UTF-8 flag
	181	on B<or> if I<$string> is held as bytes (both these states are 'consistent').
	182	Main reason for this routine is to allow Perl's test suite to check
	183	that operations have left strings in a consistent state. You most
	184	probably want to use utf8::is_utf8() instead.
	185
	186	=back
	187
	188	C<utf8::encode> is like C<utf8::upgrade>, but the UTF8 flag is
	189	cleared. See L<perlunicode> for more on the UTF8 flag and the C API
	190	functions C<sv_utf8_upgrade>, C<sv_utf8_downgrade>, C<sv_utf8_encode>,
	191	and C<sv_utf8_decode>, which are wrapped by the Perl functions
	192	C<utf8::upgrade>, C<utf8::downgrade>, C<utf8::encode> and
	193	C<utf8::decode>. Also, the functions utf8::is_utf8, utf8::valid,
	194	utf8::encode, utf8::decode, utf8::upgrade, and utf8::downgrade are
	195	actually internal, and thus always available, without a C<require utf8>
	196	statement.
	197
	198	=head1 BUGS
	199
	200	One can have Unicode in identifier names, but not in package/class or
	201	subroutine names. While some limited functionality towards this does
	202	exist as of Perl 5.8.0, that is more accidental than designed; use of
	203	Unicode for the said purposes is unsupported.
	204
	205	One reason of this unfinishedness is its (currently) inherent
	206	unportability: since both package names and subroutine names may need
	207	to be mapped to file and directory names, the Unicode capability of
	208	the filesystem becomes important-- and there unfortunately aren't
	209	portable answers.
	210
	211	=head1 SEE ALSO
	212
	213	L<perlunitut>, L<perluniintro>, L<perlrun>, L<bytes>, L<perlunicode>
	214
	215	=cut