perl5.git.perl.org Git - perl5.git/blame_incremental

... / ...

Commit	Line	Data
	1	=head1 NAME
	2
	3	perluniintro - Perl Unicode introduction
	4
	5	=head1 DESCRIPTION
	6
	7	This document gives a general idea of Unicode and how to use Unicode
	8	in Perl.
	9
	10	=head2 Unicode
	11
	12	Unicode is a character set standard which plans to codify all of the
	13	writing systems of the world, plus many other symbols.
	14
	15	Unicode and ISO/IEC 10646 are coordinated standards that provide code
	16	points for characters in almost all modern character set standards,
	17	covering more than 30 writing systems and hundreds of languages,
	18	including all commercially-important modern languages. All characters
	19	in the largest Chinese, Japanese, and Korean dictionaries are also
	20	encoded. The standards will eventually cover almost all characters in
	21	more than 250 writing systems and thousands of languages.
	22
	23	A Unicode I<character> is an abstract entity. It is not bound to any
	24	particular integer width, especially not to the C language C<char>.
	25	Unicode is language-neutral and display-neutral: it does not encode the
	26	language of the text and it does not define fonts or other graphical
	27	layout details. Unicode operates on characters and on text built from
	28	those characters.
	29
	30	Unicode defines characters like C<LATIN CAPITAL LETTER A> or C<GREEK
	31	SMALL LETTER ALPHA> and unique numbers for the characters, in this
	32	case 0x0041 and 0x03B1, respectively. These unique numbers are called
	33	I<code points>.
	34
	35	The Unicode standard prefers using hexadecimal notation for the code
	36	points. If numbers like C<0x0041> are unfamiliar to
	37	you, take a peek at a later section, L</"Hexadecimal Notation">.
	38	The Unicode standard uses the notation C<U+0041 LATIN CAPITAL LETTER A>,
	39	to give the hexadecimal code point and the normative name of
	40	the character.
	41
	42	Unicode also defines various I<properties> for the characters, like
	43	"uppercase" or "lowercase", "decimal digit", or "punctuation";
	44	these properties are independent of the names of the characters.
	45	Furthermore, various operations on the characters like uppercasing,
	46	lowercasing, and collating (sorting) are defined.
	47
	48	A Unicode character consists either of a single code point, or a
	49	I<base character> (like C<LATIN CAPITAL LETTER A>), followed by one or
	50	more I<modifiers> (like C<COMBINING ACUTE ACCENT>). This sequence of
	51	base character and modifiers is called a I<combining character
	52	sequence>.
	53
	54	Whether to call these combining character sequences "characters"
	55	depends on your point of view. If you are a programmer, you probably
	56	would tend towards seeing each element in the sequences as one unit,
	57	or "character". The whole sequence could be seen as one "character",
	58	however, from the user's point of view, since that's probably what it
	59	looks like in the context of the user's language.
	60
	61	With this "whole sequence" view of characters, the total number of
	62	characters is open-ended. But in the programmer's "one unit is one
	63	character" point of view, the concept of "characters" is more
	64	deterministic. In this document, we take that second point of view:
	65	one "character" is one Unicode code point, be it a base character or
	66	a combining character.
	67
	68	For some combinations, there are I<precomposed> characters.
	69	C<LATIN CAPITAL LETTER A WITH ACUTE>, for example, is defined as
	70	a single code point. These precomposed characters are, however,
	71	only available for some combinations, and are mainly
	72	meant to support round-trip conversions between Unicode and legacy
	73	standards (like the ISO 8859). In the general case, the composing
	74	method is more extensible. To support conversion between
	75	different compositions of the characters, various I<normalization
	76	forms> to standardize representations are also defined.
	77
	78	Because of backward compatibility with legacy encodings, the "a unique
	79	number for every character" idea breaks down a bit: instead, there is
	80	"at least one number for every character". The same character could
	81	be represented differently in several legacy encodings. The
	82	converse is also not true: some code points do not have an assigned
	83	character. Firstly, there are unallocated code points within
	84	otherwise used blocks. Secondly, there are special Unicode control
	85	characters that do not represent true characters.
	86
	87	A common myth about Unicode is that it would be "16-bit", that is,
	88	Unicode is only represented as C<0x10000> (or 65536) characters from
	89	C<0x0000> to C<0xFFFF>. B<This is untrue.> Since Unicode 2.0, Unicode
	90	has been defined all the way up to 21 bits (C<0x10FFFF>), and since
	91	Unicode 3.1, characters have been defined beyond C<0xFFFF>. The first
	92	C<0x10000> characters are called the I<Plane 0>, or the I<Basic
	93	Multilingual Plane> (BMP). With Unicode 3.1, 17 planes in all are
	94	defined--but nowhere near full of defined characters, yet.
	95
	96	Another myth is that the 256-character blocks have something to
	97	do with languages--that each block would define the characters used
	98	by a language or a set of languages. B<This is also untrue.>
	99	The division into blocks exists, but it is almost completely
	100	accidental--an artifact of how the characters have been and
	101	still are allocated. Instead, there is a concept called I<scripts>,
	102	which is more useful: there is C<Latin> script, C<Greek> script, and
	103	so on. Scripts usually span varied parts of several blocks.
	104	For further information see L<Unicode::UCD>.
	105
	106	The Unicode code points are just abstract numbers. To input and
	107	output these abstract numbers, the numbers must be I<encoded> somehow.
	108	Unicode defines several I<character encoding forms>, of which I<UTF-8>
	109	is perhaps the most popular. UTF-8 is a variable length encoding that
	110	encodes Unicode characters as 1 to 6 bytes (only 4 with the currently
	111	defined characters). Other encodings include UTF-16 and UTF-32 and their
	112	big- and little-endian variants (UTF-8 is byte-order independent)
	113	The ISO/IEC 10646 defines the UCS-2 and UCS-4 encoding forms.
	114
	115	For more information about encodings--for instance, to learn what
	116	I<surrogates> and I<byte order marks> (BOMs) are--see L<perlunicode>.
	117
	118	=head2 Perl's Unicode Support
	119
	120	Starting from Perl 5.6.0, Perl has had the capacity to handle Unicode
	121	natively. Perl 5.8.0, however, is the first recommended release for
	122	serious Unicode work. The maintenance release 5.6.1 fixed many of the
	123	problems of the initial Unicode implementation, but for example
	124	regular expressions still do not work with Unicode in 5.6.1.
	125
	126	B<Starting from Perl 5.8.0, the use of C<use utf8> is no longer
	127	necessary.> In earlier releases the C<utf8> pragma was used to declare
	128	that operations in the current block or file would be Unicode-aware.
	129	This model was found to be wrong, or at least clumsy: the "Unicodeness"
	130	is now carried with the data, instead of being attached to the
	131	operations. Only one case remains where an explicit C<use utf8> is
	132	needed: if your Perl script itself is encoded in UTF-8, you can use
	133	UTF-8 in your identifier names, and in string and regular expression
	134	literals, by saying C<use utf8>. This is not the default because
	135	scripts with legacy 8-bit data in them would break.
	136
	137	=head2 Perl's Unicode Model
	138
	139	Perl supports both pre-5.6 strings of eight-bit native bytes, and
	140	strings of Unicode characters. The principle is that Perl tries to
	141	keep its data as eight-bit bytes for as long as possible, but as soon
	142	as Unicodeness cannot be avoided, the data is transparently upgraded
	143	to Unicode.
	144
	145	Internally, Perl currently uses either whatever the native eight-bit
	146	character set of the platform (for example Latin-1) is, defaulting to
	147	UTF-8, to encode Unicode strings. Specifically, if all code points in
	148	the string are C<0xFF> or less, Perl uses the native eight-bit
	149	character set. Otherwise, it uses UTF-8.
	150
	151	A user of Perl does not normally need to know nor care how Perl
	152	happens to encode its internal strings, but it becomes relevant when
	153	outputting Unicode strings to a stream without a discipline--one with
	154	the "default" encoding. In such a case, the raw bytes used internally
	155	(the native character set or UTF-8, as appropriate for each string)
	156	will be used, and a "Wide character" warning will be issued if those
	157	strings contain a character beyond 0x00FF.
	158
	159	For example,
	160
	161	perl -e 'print "\x{DF}\n", "\x{0100}\x{DF}\n"'
	162
	163	produces a fairly useless mixture of native bytes and UTF-8, as well
	164	as a warning:
	165
	166	Wide character in print at ...
	167
	168	To output UTF-8, use the C<:utf8> output discipline. Prepending
	169
	170	binmode(STDOUT, ":utf8");
	171
	172	to this sample program ensures that the output is completely UTF-8,
	173	and removes the program's warning.
	174
	175	If your locale environment variables (C<LANGUAGE>, C<LC_ALL>,
	176	C<LC_CTYPE>, C<LANG>) contain the strings 'UTF-8' or 'UTF8',
	177	regardless of case, then the default encoding of your STDIN, STDOUT,
	178	and STDERR and of B<any subsequent file open>, is UTF-8. Note that
	179	this means that Perl expects other software to work, too: if Perl has
	180	been led to believe that STDIN should be UTF-8, but then STDIN coming
	181	in from another command is not UTF-8, Perl will complain about the
	182	malformed UTF-8.
	183
	184	=head2 Unicode and EBCDIC
	185
	186	Perl 5.8.0 also supports Unicode on EBCDIC platforms. There,
	187	Unicode support is somewhat more complex to implement since
	188	additional conversions are needed at every step. Some problems
	189	remain, see L<perlebcdic> for details.
	190
	191	In any case, the Unicode support on EBCDIC platforms is better than
	192	in the 5.6 series, which didn't work much at all for EBCDIC platform.
	193	On EBCDIC platforms, the internal Unicode encoding form is UTF-EBCDIC
	194	instead of UTF-8. The difference is that as UTF-8 is "ASCII-safe" in
	195	that ASCII characters encode to UTF-8 as-is, while UTF-EBCDIC is
	196	"EBCDIC-safe".
	197
	198	=head2 Creating Unicode
	199
	200	To create Unicode characters in literals for code points above C<0xFF>,
	201	use the C<\x{...}> notation in double-quoted strings:
	202
	203	my $smiley = "\x{263a}";
	204
	205	Similarly, it can be used in regular expression literals
	206
	207	$smiley =~ /\x{263a}/;
	208
	209	At run-time you can use C<chr()>:
	210
	211	my $hebrew_alef = chr(0x05d0);
	212
	213	See L</"Further Resources"> for how to find all these numeric codes.
	214
	215	Naturally, C<ord()> will do the reverse: it turns a character into
	216	a code point.
	217
	218	Note that C<\x..> (no C<{}> and only two hexadecimal digits), C<\x{...}>,
	219	and C<chr(...)> for arguments less than C<0x100> (decimal 256)
	220	generate an eight-bit character for backward compatibility with older
	221	Perls. For arguments of C<0x100> or more, Unicode characters are
	222	always produced. If you want to force the production of Unicode
	223	characters regardless of the numeric value, use C<pack("U", ...)>
	224	instead of C<\x..>, C<\x{...}>, or C<chr()>.
	225
	226	You can also use the C<charnames> pragma to invoke characters
	227	by name in double-quoted strings:
	228
	229	use charnames ':full';
	230	my $arabic_alef = "\N{ARABIC LETTER ALEF}";
	231
	232	And, as mentioned above, you can also C<pack()> numbers into Unicode
	233	characters:
	234
	235	my $georgian_an = pack("U", 0x10a0);
	236
	237	Note that both C<\x{...}> and C<\N{...}> are compile-time string
	238	constants: you cannot use variables in them. if you want similar
	239	run-time functionality, use C<chr()> and C<charnames::vianame()>.
	240
	241	Also note that if all the code points for pack "U" are below 0x100,
	242	bytes will be generated, just like if you were using C<chr()>.
	243
	244	my $bytes = pack("U*", 0x80, 0xFF);
	245
	246	If you want to force the result to Unicode characters, use the special
	247	C<"U0"> prefix. It consumes no arguments but forces the result to be
	248	in Unicode characters, instead of bytes.
	249
	250	my $chars = pack("U0U*", 0x80, 0xFF);
	251
	252	=head2 Handling Unicode
	253
	254	Handling Unicode is for the most part transparent: just use the
	255	strings as usual. Functions like C<index()>, C<length()>, and
	256	C<substr()> will work on the Unicode characters; regular expressions
	257	will work on the Unicode characters (see L<perlunicode> and L<perlretut>).
	258
	259	Note that Perl considers combining character sequences to be
	260	characters, so for example
	261
	262	use charnames ':full';
	263	print length("\N{LATIN CAPITAL LETTER A}\N{COMBINING ACUTE ACCENT}"), "\n";
	264
	265	will print 2, not 1. The only exception is that regular expressions
	266	have C<\X> for matching a combining character sequence.
	267
	268	Life is not quite so transparent, however, when working with legacy
	269	encodings, I/O, and certain special cases:
	270
	271	=head2 Legacy Encodings
	272
	273	When you combine legacy data and Unicode the legacy data needs
	274	to be upgraded to Unicode. Normally ISO 8859-1 (or EBCDIC, if
	275	applicable) is assumed. You can override this assumption by
	276	using the C<encoding> pragma, for example
	277
	278	use encoding 'latin2'; # ISO 8859-2
	279
	280	in which case literals (string or regular expressions), C<chr()>,
	281	and C<ord()> in your whole script are assumed to produce Unicode
	282	characters from ISO 8859-2 code points. Note that the matching for
	283	encoding names is forgiving: instead of C<latin2> you could have
	284	said C<Latin 2>, or C<iso8859-2>, or other variations. With just
	285
	286	use encoding;
	287
	288	the environment variable C<PERL_ENCODING> will be consulted.
	289	If that variable isn't set, the encoding pragma will fail.
	290
	291	The C<Encode> module knows about many encodings and has interfaces
	292	for doing conversions between those encodings:
	293
	294	use Encode 'from_to';
	295	from_to($data, "iso-8859-3", "utf-8"); # from legacy to utf-8
	296
	297	=head2 Unicode I/O
	298
	299	Normally, writing out Unicode data
	300
	301	print FH $some_string_with_unicode, "\n";
	302
	303	produces raw bytes that Perl happens to use to internally encode the
	304	Unicode string. Perl's internal encoding depends on the system as
	305	well as what characters happen to be in the string at the time. If
	306	any of the characters are at code points C<0x100> or above, you will get
	307	a warning. To ensure that the output is explicitly rendered in the
	308	encoding you desire--and to avoid the warning--open the stream with
	309	the desired encoding. Some examples:
	310
	311	open FH, ">:utf8", "file";
	312
	313	open FH, ">:encoding(ucs2)", "file";
	314	open FH, ">:encoding(UTF-8)", "file";
	315	open FH, ">:encoding(shift_jis)", "file";
	316
	317	and on already open streams, use C<binmode()>:
	318
	319	binmode(STDOUT, ":utf8");
	320
	321	binmode(STDOUT, ":encoding(ucs2)");
	322	binmode(STDOUT, ":encoding(UTF-8)");
	323	binmode(STDOUT, ":encoding(shift_jis)");
	324
	325	The matching of encoding names is loose: case does not matter, and
	326	many encodings have several aliases. Note that C<:utf8> discipline
	327	must always be specified exactly like that; it is I<not> subject to
	328	the loose matching of encoding names.
	329
	330	See L<PerlIO> for the C<:utf8> layer, L<PerlIO::encoding> and
	331	L<Encode::PerlIO> for the C<:encoding()> layer, and
	332	L<Encode::Supported> for many encodings supported by the C<Encode>
	333	module.
	334
	335	Reading in a file that you know happens to be encoded in one of the
	336	Unicode or legacy encodings does not magically turn the data into
	337	Unicode in Perl's eyes. To do that, specify the appropriate
	338	discipline when opening files
	339
	340	open(my $fh,'<:utf8', 'anything');
	341	my $line_of_unicode = <$fh>;
	342
	343	open(my $fh,'<:encoding(Big5)', 'anything');
	344	my $line_of_unicode = <$fh>;
	345
	346	The I/O disciplines can also be specified more flexibly with
	347	the C<open> pragma. See L<open>, or look at the following example.
	348
	349	use open ':utf8'; # input and output default discipline will be UTF-8
	350	open X, ">file";
	351	print X chr(0x100), "\n";
	352	close X;
	353	open Y, "<file";
	354	printf "%#x\n", ord(<Y>); # this should print 0x100
	355	close Y;
	356
	357	With the C<open> pragma you can use the C<:locale> discipline
	358
	359	$ENV{LC_ALL} = $ENV{LANG} = 'ru_RU.KOI8-R';
	360	# the :locale will probe the locale environment variables like LC_ALL
	361	use open OUT => ':locale'; # russki parusski
	362	open(O, ">koi8");
	363	print O chr(0x430); # Unicode CYRILLIC SMALL LETTER A = KOI8-R 0xc1
	364	close O;
	365	open(I, "<koi8");
	366	printf "%#x\n", ord(<I>), "\n"; # this should print 0xc1
	367	close I;
	368
	369	or you can also use the C<':encoding(...)'> discipline
	370
	371	open(my $epic,'<:encoding(iso-8859-7)','iliad.greek');
	372	my $line_of_unicode = <$epic>;
	373
	374	These methods install a transparent filter on the I/O stream that
	375	converts data from the specified encoding when it is read in from the
	376	stream. The result is always Unicode.
	377
	378	The L<open> pragma affects all the C<open()> calls after the pragma by
	379	setting default disciplines. If you want to affect only certain
	380	streams, use explicit disciplines directly in the C<open()> call.
	381
	382	You can switch encodings on an already opened stream by using
	383	C<binmode()>; see L<perlfunc/binmode>.
	384
	385	The C<:locale> does not currently (as of Perl 5.8.0) work with
	386	C<open()> and C<binmode()>, only with the C<open> pragma. The
	387	C<:utf8> and C<:encoding(...)> methods do work with all of C<open()>,
	388	C<binmode()>, and the C<open> pragma.
	389
	390	Similarly, you may use these I/O disciplines on output streams to
	391	automatically convert Unicode to the specified encoding when it is
	392	written to the stream. For example, the following snippet copies the
	393	contents of the file "text.jis" (encoded as ISO-2022-JP, aka JIS) to
	394	the file "text.utf8", encoded as UTF-8:
	395
	396	open(my $nihongo, '<:encoding(iso2022-jp)', 'text.jis');
	397	open(my $unicode, '>:utf8', 'text.utf8');
	398	while (<$nihongo>) { print $unicode }
	399
	400	The naming of encodings, both by the C<open()> and by the C<open>
	401	pragma, is similar to the C<encoding> pragma in that it allows for
	402	flexible names: C<koi8-r> and C<KOI8R> will both be understood.
	403
	404	Common encodings recognized by ISO, MIME, IANA, and various other
	405	standardisation organisations are recognised; for a more detailed
	406	list see L<Encode::Supported>.
	407
	408	C<read()> reads characters and returns the number of characters.
	409	C<seek()> and C<tell()> operate on byte counts, as do C<sysread()>
	410	and C<sysseek()>.
	411
	412	Notice that because of the default behaviour of not doing any
	413	conversion upon input if there is no default discipline,
	414	it is easy to mistakenly write code that keeps on expanding a file
	415	by repeatedly encoding the data:
	416
	417	# BAD CODE WARNING
	418	open F, "file";
	419	local $/; ## read in the whole file of 8-bit characters
	420	$t = <F>;
	421	close F;
	422	open F, ">:utf8", "file";
	423	print F $t; ## convert to UTF-8 on output
	424	close F;
	425
	426	If you run this code twice, the contents of the F<file> will be twice
	427	UTF-8 encoded. A C<use open ':utf8'> would have avoided the bug, or
	428	explicitly opening also the F<file> for input as UTF-8.
	429
	430	B<NOTE>: the C<:utf8> and C<:encoding> features work only if your
	431	Perl has been built with the new "perlio" feature. Almost all
	432	Perl 5.8 platforms do use "perlio", though: you can see whether
	433	yours is by running "perl -V" and looking for C<useperlio=define>.
	434
	435	=head2 Displaying Unicode As Text
	436
	437	Sometimes you might want to display Perl scalars containing Unicode as
	438	simple ASCII (or EBCDIC) text. The following subroutine converts
	439	its argument so that Unicode characters with code points greater than
	440	255 are displayed as C<\x{...}>, control characters (like C<\n>) are
	441	displayed as C<\x..>, and the rest of the characters as themselves:
	442
	443	sub nice_string {
	444	join("",
	445	map { $_ > 255 ? # if wide character...
	446	sprintf("\\x{%04X}", $_) : # \x{...}
	447	chr($_) =~ /[[:cntrl:]]/ ? # else if control character ...
	448	sprintf("\\x%02X", $_) : # \x..
	449	chr($_) # else as themselves
	450	} unpack("U*", $_[0])); # unpack Unicode characters
	451	}
	452
	453	For example,
	454
	455	nice_string("foo\x{100}bar\n")
	456
	457	returns:
	458
	459	"foo\x{0100}bar\x0A"
	460
	461	=head2 Special Cases
	462
	463	=over 4
	464
	465	=item *
	466
	467	Bit Complement Operator ~ And vec()
	468
	469	The bit complement operator C<~> may produce surprising results if
	470	used on strings containing characters with ordinal values above
	471	255. In such a case, the results are consistent with the internal
	472	encoding of the characters, but not with much else. So don't do
	473	that. Similarly for C<vec()>: you will be operating on the
	474	internally-encoded bit patterns of the Unicode characters, not on
	475	the code point values, which is very probably not what you want.
	476
	477	=item *
	478
	479	Peeking At Perl's Internal Encoding
	480
	481	Normal users of Perl should never care how Perl encodes any particular
	482	Unicode string (because the normal ways to get at the contents of a
	483	string with Unicode--via input and output--should always be via
	484	explicitly-defined I/O disciplines). But if you must, there are two
	485	ways of looking behind the scenes.
	486
	487	One way of peeking inside the internal encoding of Unicode characters
	488	is to use C<unpack("C", ...> to get the bytes or C<unpack("H", ...)>
	489	to display the bytes:
	490
	491	# this prints c4 80 for the UTF-8 bytes 0xc4 0x80
	492	print join(" ", unpack("H*", pack("U", 0x100))), "\n";
	493
	494	Yet another way would be to use the Devel::Peek module:
	495
	496	perl -MDevel::Peek -e 'Dump(chr(0x100))'
	497
	498	That shows the UTF8 flag in FLAGS and both the UTF-8 bytes
	499	and Unicode characters in C<PV>. See also later in this document
	500	the discussion about the C<is_utf8> function of the C<Encode> module.
	501
	502	=back
	503
	504	=head2 Advanced Topics
	505
	506	=over 4
	507
	508	=item *
	509
	510	String Equivalence
	511
	512	The question of string equivalence turns somewhat complicated
	513	in Unicode: what do you mean by "equal"?
	514
	515	(Is C<LATIN CAPITAL LETTER A WITH ACUTE> equal to
	516	C<LATIN CAPITAL LETTER A>?)
	517
	518	The short answer is that by default Perl compares equivalence (C<eq>,
	519	C<ne>) based only on code points of the characters. In the above
	520	case, the answer is no (because 0x00C1 != 0x0041). But sometimes, any
	521	CAPITAL LETTER As should be considered equal, or even As of any case.
	522
	523	The long answer is that you need to consider character normalization
	524	and casing issues: see L<Unicode::Normalize>, Unicode Technical
	525	Reports #15 and #21, I<Unicode Normalization Forms> and I<Case
	526	Mappings>, http://www.unicode.org/unicode/reports/tr15/ and
	527	http://www.unicode.org/unicode/reports/tr21/
	528
	529	As of Perl 5.8.0, the "Full" case-folding of I<Case
	530	Mappings/SpecialCasing> is implemented.
	531
	532	=item *
	533
	534	String Collation
	535
	536	People like to see their strings nicely sorted--or as Unicode
	537	parlance goes, collated. But again, what do you mean by collate?
	538
	539	(Does C<LATIN CAPITAL LETTER A WITH ACUTE> come before or after
	540	C<LATIN CAPITAL LETTER A WITH GRAVE>?)
	541
	542	The short answer is that by default, Perl compares strings (C<lt>,
	543	C<le>, C<cmp>, C<ge>, C<gt>) based only on the code points of the
	544	characters. In the above case, the answer is "after", since
	545	C<0x00C1> > C<0x00C0>.
	546
	547	The long answer is that "it depends", and a good answer cannot be
	548	given without knowing (at the very least) the language context.
	549	See L<Unicode::Collate>, and I<Unicode Collation Algorithm>
	550	http://www.unicode.org/unicode/reports/tr10/
	551
	552	=back
	553
	554	=head2 Miscellaneous
	555
	556	=over 4
	557
	558	=item *
	559
	560	Character Ranges and Classes
	561
	562	Character ranges in regular expression character classes (C</[a-z]/>)
	563	and in the C<tr///> (also known as C<y///>) operator are not magically
	564	Unicode-aware. What this means that C<[A-Za-z]> will not magically start
	565	to mean "all alphabetic letters"; not that it does mean that even for
	566	8-bit characters, you should be using C</[[:alpha:]]/> in that case.
	567
	568	For specifying character classes like that in regular expressions,
	569	you can use the various Unicode properties--C<\pL>, or perhaps
	570	C<\p{Alphabetic}>, in this particular case. You can use Unicode
	571	code points as the end points of character ranges, but there is no
	572	magic associated with specifying a certain range. For further
	573	information--there are dozens of Unicode character classes--see
	574	L<perlunicode>.
	575
	576	=item *
	577
	578	String-To-Number Conversions
	579
	580	Unicode does define several other decimal--and numeric--characters
	581	besides the familiar 0 to 9, such as the Arabic and Indic digits.
	582	Perl does not support string-to-number conversion for digits other
	583	than ASCII 0 to 9 (and ASCII a to f for hexadecimal).
	584
	585	=back
	586
	587	=head2 Questions With Answers
	588
	589	=over 4
	590
	591	=item *
	592
	593	Will My Old Scripts Break?
	594
	595	Very probably not. Unless you are generating Unicode characters
	596	somehow, old behaviour should be preserved. About the only behaviour
	597	that has changed and which could start generating Unicode is the old
	598	behaviour of C<chr()> where supplying an argument more than 255
	599	produced a character modulo 255. C<chr(300)>, for example, was equal
	600	to C<chr(45)> or "-" (in ASCII), now it is LATIN CAPITAL LETTER I WITH
	601	BREVE.
	602
	603	=item *
	604
	605	How Do I Make My Scripts Work With Unicode?
	606
	607	Very little work should be needed since nothing changes until you
	608	generate Unicode data. The most important thing is getting input as
	609	Unicode; for that, see the earlier I/O discussion.
	610
	611	=item *
	612
	613	How Do I Know Whether My String Is In Unicode?
	614
	615	You shouldn't care. No, you really shouldn't. No, really. If you
	616	have to care--beyond the cases described above--it means that we
	617	didn't get the transparency of Unicode quite right.
	618
	619	Okay, if you insist:
	620
	621	use Encode 'is_utf8';
	622	print is_utf8($string) ? 1 : 0, "\n";
	623
	624	But note that this doesn't mean that any of the characters in the
	625	string are necessary UTF-8 encoded, or that any of the characters have
	626	code points greater than 0xFF (255) or even 0x80 (128), or that the
	627	string has any characters at all. All the C<is_utf8()> does is to
	628	return the value of the internal "utf8ness" flag attached to the
	629	C<$string>. If the flag is off, the bytes in the scalar are interpreted
	630	as a single byte encoding. If the flag is on, the bytes in the scalar
	631	are interpreted as the (multi-byte, variable-length) UTF-8 encoded code
	632	points of the characters. Bytes added to an UTF-8 encoded string are
	633	automatically upgraded to UTF-8. If mixed non-UTF8 and UTF-8 scalars
	634	are merged (double-quoted interpolation, explicit concatenation, and
	635	printf/sprintf parameter substitution), the result will be UTF-8 encoded
	636	as if copies of the byte strings were upgraded to UTF-8: for example,
	637
	638	$a = "ab\x80c";
	639	$b = "\x{100}";
	640	print "$a = $b\n";
	641
	642	the output string will be UTF-8-encoded C<ab\x80c\x{100}\n>, but note
	643	that C<$a> will stay byte-encoded.
	644
	645	Sometimes you might really need to know the byte length of a string
	646	instead of the character length. For that use either the
	647	C<Encode::encode_utf8()> function or the C<bytes> pragma and its only
	648	defined function C<length()>:
	649
	650	my $unicode = chr(0x100);
	651	print length($unicode), "\n"; # will print 1
	652	require Encode;
	653	print length(Encode::encode_utf8($unicode)), "\n"; # will print 2
	654	use bytes;
	655	print length($unicode), "\n"; # will also print 2
	656	# (the 0xC4 0x80 of the UTF-8)
	657
	658	=item *
	659
	660	How Do I Detect Data That's Not Valid In a Particular Encoding?
	661
	662	Use the C<Encode> package to try converting it.
	663	For example,
	664
	665	use Encode 'encode_utf8';
	666	if (encode_utf8($string_of_bytes_that_I_think_is_utf8)) {
	667	# valid
	668	} else {
	669	# invalid
	670	}
	671
	672	For UTF-8 only, you can use:
	673
	674	use warnings;
	675	@chars = unpack("U0U*", $string_of_bytes_that_I_think_is_utf8);
	676
	677	If invalid, a C<Malformed UTF-8 character (byte 0x##) in unpack>
	678	warning is produced. The "U0" means "expect strictly UTF-8 encoded
	679	Unicode". Without that the C<unpack("U*", ...)> would accept also
	680	data like C<chr(0xFF>), similarly to the C<pack> as we saw earlier.
	681
	682	=item *
	683
	684	How Do I Convert Binary Data Into a Particular Encoding, Or Vice Versa?
	685
	686	This probably isn't as useful as you might think.
	687	Normally, you shouldn't need to.
	688
	689	In one sense, what you are asking doesn't make much sense: encodings
	690	are for characters, and binary data are not "characters", so converting
	691	"data" into some encoding isn't meaningful unless you know in what
	692	character set and encoding the binary data is in, in which case it's
	693	not just binary data, now is it?
	694
	695	If you have a raw sequence of bytes that you know should be
	696	interpreted via a particular encoding, you can use C<Encode>:
	697
	698	use Encode 'from_to';
	699	from_to($data, "iso-8859-1", "utf-8"); # from latin-1 to utf-8
	700
	701	The call to C<from_to()> changes the bytes in C<$data>, but nothing
	702	material about the nature of the string has changed as far as Perl is
	703	concerned. Both before and after the call, the string C<$data>
	704	contains just a bunch of 8-bit bytes. As far as Perl is concerned,
	705	the encoding of the string remains as "system-native 8-bit bytes".
	706
	707	You might relate this to a fictional 'Translate' module:
	708
	709	use Translate;
	710	my $phrase = "Yes";
	711	Translate::from_to($phrase, 'english', 'deutsch');
	712	## phrase now contains "Ja"
	713
	714	The contents of the string changes, but not the nature of the string.
	715	Perl doesn't know any more after the call than before that the
	716	contents of the string indicates the affirmative.
	717
	718	Back to converting data. If you have (or want) data in your system's
	719	native 8-bit encoding (e.g. Latin-1, EBCDIC, etc.), you can use
	720	pack/unpack to convert to/from Unicode.
	721
	722	$native_string = pack("C", unpack("U", $Unicode_string));
	723	$Unicode_string = pack("U", unpack("C", $native_string));
	724
	725	If you have a sequence of bytes you B<know> is valid UTF-8,
	726	but Perl doesn't know it yet, you can make Perl a believer, too:
	727
	728	use Encode 'decode_utf8';
	729	$Unicode = decode_utf8($bytes);
	730
	731	You can convert well-formed UTF-8 to a sequence of bytes, but if
	732	you just want to convert random binary data into UTF-8, you can't.
	733	B<Any random collection of bytes isn't well-formed UTF-8>. You can
	734	use C<unpack("C*", $string)> for the former, and you can create
	735	well-formed Unicode data by C<pack("U*", 0xff, ...)>.
	736
	737	=item *
	738
	739	How Do I Display Unicode? How Do I Input Unicode?
	740
	741	See http://www.alanwood.net/unicode/ and
	742	http://www.cl.cam.ac.uk/~mgk25/unicode.html
	743
	744	=item *
	745
	746	How Does Unicode Work With Traditional Locales?
	747
	748	In Perl, not very well. Avoid using locales through the C<locale>
	749	pragma. Use only one or the other.
	750
	751	=back
	752
	753	=head2 Hexadecimal Notation
	754
	755	The Unicode standard prefers using hexadecimal notation because
	756	that more clearly shows the division of Unicode into blocks of 256 characters.
	757	Hexadecimal is also simply shorter than decimal. You can use decimal
	758	notation, too, but learning to use hexadecimal just makes life easier
	759	with the Unicode standard. The C<U+HHHH> notation uses hexadecimal,
	760	for example.
	761
	762	The C<0x> prefix means a hexadecimal number, the digits are 0-9 I<and>
	763	a-f (or A-F, case doesn't matter). Each hexadecimal digit represents
	764	four bits, or half a byte. C<print 0x..., "\n"> will show a
	765	hexadecimal number in decimal, and C<printf "%x\n", $decimal> will
	766	show a decimal number in hexadecimal. If you have just the
	767	"hex digits" of a hexadecimal number, you can use the C<hex()> function.
	768
	769	print 0x0009, "\n"; # 9
	770	print 0x000a, "\n"; # 10
	771	print 0x000f, "\n"; # 15
	772	print 0x0010, "\n"; # 16
	773	print 0x0011, "\n"; # 17
	774	print 0x0100, "\n"; # 256
	775
	776	print 0x0041, "\n"; # 65
	777
	778	printf "%x\n", 65; # 41
	779	printf "%#x\n", 65; # 0x41
	780
	781	print hex("41"), "\n"; # 65
	782
	783	=head2 Further Resources
	784
	785	=over 4
	786
	787	=item *
	788
	789	Unicode Consortium
	790
	791	http://www.unicode.org/
	792
	793	=item *
	794
	795	Unicode FAQ
	796
	797	http://www.unicode.org/unicode/faq/
	798
	799	=item *
	800
	801	Unicode Glossary
	802
	803	http://www.unicode.org/glossary/
	804
	805	=item *
	806
	807	Unicode Useful Resources
	808
	809	http://www.unicode.org/unicode/onlinedat/resources.html
	810
	811	=item *
	812
	813	Unicode and Multilingual Support in HTML, Fonts, Web Browsers and Other Applications
	814
	815	http://www.alanwood.net/unicode/
	816
	817	=item *
	818
	819	UTF-8 and Unicode FAQ for Unix/Linux
	820
	821	http://www.cl.cam.ac.uk/~mgk25/unicode.html
	822
	823	=item *
	824
	825	Legacy Character Sets
	826
	827	http://www.czyborra.com/
	828	http://www.eki.ee/letter/
	829
	830	=item *
	831
	832	The Unicode support files live within the Perl installation in the
	833	directory
	834
	835	$Config{installprivlib}/unicore
	836
	837	in Perl 5.8.0 or newer, and
	838
	839	$Config{installprivlib}/unicode
	840
	841	in the Perl 5.6 series. (The renaming to F<lib/unicore> was done to
	842	avoid naming conflicts with lib/Unicode in case-insensitive filesystems.)
	843	The main Unicode data file is F<UnicodeData.txt> (or F<Unicode.301> in
	844	Perl 5.6.1.) You can find the C<$Config{installprivlib}> by
	845
	846	perl "-V:installprivlib"
	847
	848	You can explore various information from the Unicode data files using
	849	the C<Unicode::UCD> module.
	850
	851	=back
	852
	853	=head1 UNICODE IN OLDER PERLS
	854
	855	If you cannot upgrade your Perl to 5.8.0 or later, you can still
	856	do some Unicode processing by using the modules C<Unicode::String>,
	857	C<Unicode::Map8>, and C<Unicode::Map>, available from CPAN.
	858	If you have the GNU recode installed, you can also use the
	859	Perl front-end C<Convert::Recode> for character conversions.
	860
	861	The following are fast conversions from ISO 8859-1 (Latin-1) bytes
	862	to UTF-8 bytes, the code works even with older Perl 5 versions.
	863
	864	# ISO 8859-1 to UTF-8
	865	s/([\x80-\xFF])/chr(0xC0\|ord($1)>>6).chr(0x80\|ord($1)&0x3F)/eg;
	866
	867	# UTF-8 to ISO 8859-1
	868	s/([\xC2\xC3])([\x80-\xBF])/chr(ord($1)<<6&0xC0\|ord($2)&0x3F)/eg;
	869
	870	=head1 SEE ALSO
	871
	872	L<perlunicode>, L<Encode>, L<encoding>, L<open>, L<utf8>, L<bytes>,
	873	L<perlretut>, L<Unicode::Collate>, L<Unicode::Normalize>, L<Unicode::UCD>
	874
	875	=head1 ACKNOWLEDGMENTS
	876
	877	Thanks to the kind readers of the perl5-porters@perl.org,
	878	perl-unicode@perl.org, linux-utf8@nl.linux.org, and unicore@unicode.org
	879	mailing lists for their valuable feedback.
	880
	881	=head1 AUTHOR, COPYRIGHT, AND LICENSE
	882
	883	Copyright 2001-2002 Jarkko Hietaniemi <jhi@iki.fi>
	884
	885	This document may be distributed under the same terms as Perl itself.