perl5.git.perl.org Git - perl5.git/blame_incremental

... / ...

Commit	Line	Data
	1	=head1 NAME
	2
	3	perluniintro - Perl Unicode introduction
	4
	5	=head1 DESCRIPTION
	6
	7	This document gives a general idea of Unicode and how to use Unicode
	8	in Perl.
	9
	10	=head2 Unicode
	11
	12	Unicode is a character set standard which plans to codify all of the
	13	writing systems of the world, plus many other symbols.
	14
	15	Unicode and ISO/IEC 10646 are coordinated standards that provide code
	16	points for characters in almost all modern character set standards,
	17	covering more than 30 writing systems and hundreds of languages,
	18	including all commercially-important modern languages. All characters
	19	in the largest Chinese, Japanese, and Korean dictionaries are also
	20	encoded. The standards will eventually cover almost all characters in
	21	more than 250 writing systems and thousands of languages.
	22	Unicode 1.0 was released in October 1991, and 4.0 in April 2003.
	23
	24	A Unicode I<character> is an abstract entity. It is not bound to any
	25	particular integer width, especially not to the C language C<char>.
	26	Unicode is language-neutral and display-neutral: it does not encode the
	27	language of the text, and it does not generally define fonts or other graphical
	28	layout details. Unicode operates on characters and on text built from
	29	those characters.
	30
	31	Unicode defines characters like C<LATIN CAPITAL LETTER A> or C<GREEK
	32	SMALL LETTER ALPHA> and unique numbers for the characters, in this
	33	case 0x0041 and 0x03B1, respectively. These unique numbers are called
	34	I<code points>.
	35
	36	The Unicode standard prefers using hexadecimal notation for the code
	37	points. If numbers like C<0x0041> are unfamiliar to you, take a peek
	38	at a later section, L</"Hexadecimal Notation">. The Unicode standard
	39	uses the notation C<U+0041 LATIN CAPITAL LETTER A>, to give the
	40	hexadecimal code point and the normative name of the character.
	41
	42	Unicode also defines various I<properties> for the characters, like
	43	"uppercase" or "lowercase", "decimal digit", or "punctuation";
	44	these properties are independent of the names of the characters.
	45	Furthermore, various operations on the characters like uppercasing,
	46	lowercasing, and collating (sorting) are defined.
	47
	48	A Unicode I<logical> "character" can actually consist of more than one internal
	49	I<actual> "character" or code point. For Western languages, this is adequately
	50	modelled by a I<base character> (like C<LATIN CAPITAL LETTER A>) followed
	51	by one or more I<modifiers> (like C<COMBINING ACUTE ACCENT>). This sequence of
	52	base character and modifiers is called a I<combining character
	53	sequence>. Some non-western languages require more complicated
	54	models, so Unicode created the I<grapheme cluster> concept, and then the
	55	I<extended grapheme cluster>. For example, a Korean Hangul syllable is
	56	considered a single logical character, but most often consists of three actual
	57	Unicode characters: a leading consonant followed by an interior vowel followed
	58	by a trailing consonant.
	59
	60	Whether to call these extended grapheme clusters "characters" depends on your
	61	point of view. If you are a programmer, you probably would tend towards seeing
	62	each element in the sequences as one unit, or "character". The whole sequence
	63	could be seen as one "character", however, from the user's point of view, since
	64	that's probably what it looks like in the context of the user's language.
	65
	66	With this "whole sequence" view of characters, the total number of
	67	characters is open-ended. But in the programmer's "one unit is one
	68	character" point of view, the concept of "characters" is more
	69	deterministic. In this document, we take that second point of view:
	70	one "character" is one Unicode code point.
	71
	72	For some combinations, there are I<precomposed> characters.
	73	C<LATIN CAPITAL LETTER A WITH ACUTE>, for example, is defined as
	74	a single code point. These precomposed characters are, however,
	75	only available for some combinations, and are mainly
	76	meant to support round-trip conversions between Unicode and legacy
	77	standards (like the ISO 8859). In the general case, the composing
	78	method is more extensible. To support conversion between
	79	different compositions of the characters, various I<normalization
	80	forms> to standardize representations are also defined.
	81
	82	Because of backward compatibility with legacy encodings, the "a unique
	83	number for every character" idea breaks down a bit: instead, there is
	84	"at least one number for every character". The same character could
	85	be represented differently in several legacy encodings. The
	86	converse is also not true: some code points do not have an assigned
	87	character. Firstly, there are unallocated code points within
	88	otherwise used blocks. Secondly, there are special Unicode control
	89	characters that do not represent true characters.
	90
	91	A common myth about Unicode is that it is "16-bit", that is,
	92	Unicode is only represented as C<0x10000> (or 65536) characters from
	93	C<0x0000> to C<0xFFFF>. B<This is untrue.> Since Unicode 2.0 (July
	94	1996), Unicode has been defined all the way up to 21 bits (C<0x10FFFF>),
	95	and since Unicode 3.1 (March 2001), characters have been defined
	96	beyond C<0xFFFF>. The first C<0x10000> characters are called the
	97	I<Plane 0>, or the I<Basic Multilingual Plane> (BMP). With Unicode
	98	3.1, 17 (yes, seventeen) planes in all were defined--but they are
	99	nowhere near full of defined characters, yet.
	100
	101	Another myth is about Unicode blocks--that they have something to
	102	do with languages--that each block would define the characters used
	103	by a language or a set of languages. B<This is also untrue.>
	104	The division into blocks exists, but it is almost completely
	105	accidental--an artifact of how the characters have been and
	106	still are allocated. Instead, there is a concept called I<scripts>, which is
	107	more useful: there is C<Latin> script, C<Greek> script, and so on. Scripts
	108	usually span varied parts of several blocks. For more information about
	109	scripts, see L<perlunicode/Scripts>.
	110
	111	The Unicode code points are just abstract numbers. To input and
	112	output these abstract numbers, the numbers must be I<encoded> or
	113	I<serialised> somehow. Unicode defines several I<character encoding
	114	forms>, of which I<UTF-8> is perhaps the most popular. UTF-8 is a
	115	variable length encoding that encodes Unicode characters as 1 to 6
	116	bytes. Other encodings
	117	include UTF-16 and UTF-32 and their big- and little-endian variants
	118	(UTF-8 is byte-order independent) The ISO/IEC 10646 defines the UCS-2
	119	and UCS-4 encoding forms.
	120
	121	For more information about encodings--for instance, to learn what
	122	I<surrogates> and I<byte order marks> (BOMs) are--see L<perlunicode>.
	123
	124	=head2 Perl's Unicode Support
	125
	126	Starting from Perl 5.6.0, Perl has had the capacity to handle Unicode
	127	natively. Perl 5.8.0, however, is the first recommended release for
	128	serious Unicode work. The maintenance release 5.6.1 fixed many of the
	129	problems of the initial Unicode implementation, but for example
	130	regular expressions still do not work with Unicode in 5.6.1.
	131
	132	B<Starting from Perl 5.8.0, the use of C<use utf8> is needed only in much more restricted circumstances.> In earlier releases the C<utf8> pragma was used to declare
	133	that operations in the current block or file would be Unicode-aware.
	134	This model was found to be wrong, or at least clumsy: the "Unicodeness"
	135	is now carried with the data, instead of being attached to the
	136	operations. Only one case remains where an explicit C<use utf8> is
	137	needed: if your Perl script itself is encoded in UTF-8, you can use
	138	UTF-8 in your identifier names, and in string and regular expression
	139	literals, by saying C<use utf8>. This is not the default because
	140	scripts with legacy 8-bit data in them would break. See L<utf8>.
	141
	142	=head2 Perl's Unicode Model
	143
	144	Perl supports both pre-5.6 strings of eight-bit native bytes, and
	145	strings of Unicode characters. The principle is that Perl tries to
	146	keep its data as eight-bit bytes for as long as possible, but as soon
	147	as Unicodeness cannot be avoided, the data is (mostly) transparently upgraded
	148	to Unicode. There are some problems--see L<perlunicode/The "Unicode Bug">.
	149
	150	Internally, Perl currently uses either whatever the native eight-bit
	151	character set of the platform (for example Latin-1) is, defaulting to
	152	UTF-8, to encode Unicode strings. Specifically, if all code points in
	153	the string are C<0xFF> or less, Perl uses the native eight-bit
	154	character set. Otherwise, it uses UTF-8.
	155
	156	A user of Perl does not normally need to know nor care how Perl
	157	happens to encode its internal strings, but it becomes relevant when
	158	outputting Unicode strings to a stream without a PerlIO layer (one with
	159	the "default" encoding). In such a case, the raw bytes used internally
	160	(the native character set or UTF-8, as appropriate for each string)
	161	will be used, and a "Wide character" warning will be issued if those
	162	strings contain a character beyond 0x00FF.
	163
	164	For example,
	165
	166	perl -e 'print "\x{DF}\n", "\x{0100}\x{DF}\n"'
	167
	168	produces a fairly useless mixture of native bytes and UTF-8, as well
	169	as a warning:
	170
	171	Wide character in print at ...
	172
	173	To output UTF-8, use the C<:encoding> or C<:utf8> output layer. Prepending
	174
	175	binmode(STDOUT, ":utf8");
	176
	177	to this sample program ensures that the output is completely UTF-8,
	178	and removes the program's warning.
	179
	180	You can enable automatic UTF-8-ification of your standard file
	181	handles, default C<open()> layer, and C<@ARGV> by using either
	182	the C<-C> command line switch or the C<PERL_UNICODE> environment
	183	variable, see L<perlrun> for the documentation of the C<-C> switch.
	184
	185	Note that this means that Perl expects other software to work, too:
	186	if Perl has been led to believe that STDIN should be UTF-8, but then
	187	STDIN coming in from another command is not UTF-8, Perl will complain
	188	about the malformed UTF-8.
	189
	190	All features that combine Unicode and I/O also require using the new
	191	PerlIO feature. Almost all Perl 5.8 platforms do use PerlIO, though:
	192	you can see whether yours is by running "perl -V" and looking for
	193	C<useperlio=define>.
	194
	195	=head2 Unicode and EBCDIC
	196
	197	Perl 5.8.0 also supports Unicode on EBCDIC platforms. There,
	198	Unicode support is somewhat more complex to implement since
	199	additional conversions are needed at every step.
	200
	201	Later Perl releases have added code that will not work on EBCDIC platforms, and
	202	no one has complained, so the divergence has continued. If you want to run
	203	Perl on an EBCDIC platform, send email to perlbug@perl.org
	204
	205	On EBCDIC platforms, the internal Unicode encoding form is UTF-EBCDIC
	206	instead of UTF-8. The difference is that as UTF-8 is "ASCII-safe" in
	207	that ASCII characters encode to UTF-8 as-is, while UTF-EBCDIC is
	208	"EBCDIC-safe".
	209
	210	=head2 Creating Unicode
	211
	212	To create Unicode characters in literals for code points above C<0xFF>,
	213	use the C<\x{...}> notation in double-quoted strings:
	214
	215	my $smiley = "\x{263a}";
	216
	217	Similarly, it can be used in regular expression literals
	218
	219	$smiley =~ /\x{263a}/;
	220
	221	At run-time you can use C<chr()>:
	222
	223	my $hebrew_alef = chr(0x05d0);
	224
	225	See L</"Further Resources"> for how to find all these numeric codes.
	226
	227	Naturally, C<ord()> will do the reverse: it turns a character into
	228	a code point.
	229
	230	Note that C<\x..> (no C<{}> and only two hexadecimal digits), C<\x{...}>,
	231	and C<chr(...)> for arguments less than C<0x100> (decimal 256)
	232	generate an eight-bit character for backward compatibility with older
	233	Perls. For arguments of C<0x100> or more, Unicode characters are
	234	always produced. If you want to force the production of Unicode
	235	characters regardless of the numeric value, use C<pack("U", ...)>
	236	instead of C<\x..>, C<\x{...}>, or C<chr()>.
	237
	238	You can also use the C<charnames> pragma to invoke characters
	239	by name in double-quoted strings:
	240
	241	use charnames ':full';
	242	my $arabic_alef = "\N{ARABIC LETTER ALEF}";
	243
	244	And, as mentioned above, you can also C<pack()> numbers into Unicode
	245	characters:
	246
	247	my $georgian_an = pack("U", 0x10a0);
	248
	249	Note that both C<\x{...}> and C<\N{...}> are compile-time string
	250	constants: you cannot use variables in them. if you want similar
	251	run-time functionality, use C<chr()> and C<charnames::vianame()>.
	252
	253	If you want to force the result to Unicode characters, use the special
	254	C<"U0"> prefix. It consumes no arguments but causes the following bytes
	255	to be interpreted as the UTF-8 encoding of Unicode characters:
	256
	257	my $chars = pack("U0W*", 0x80, 0x42);
	258
	259	Likewise, you can stop such UTF-8 interpretation by using the special
	260	C<"C0"> prefix.
	261
	262	=head2 Handling Unicode
	263
	264	Handling Unicode is for the most part transparent: just use the
	265	strings as usual. Functions like C<index()>, C<length()>, and
	266	C<substr()> will work on the Unicode characters; regular expressions
	267	will work on the Unicode characters (see L<perlunicode> and L<perlretut>).
	268
	269	Note that Perl considers grapheme clusters to be separate characters, so for
	270	example
	271
	272	use charnames ':full';
	273	print length("\N{LATIN CAPITAL LETTER A}\N{COMBINING ACUTE ACCENT}"), "\n";
	274
	275	will print 2, not 1. The only exception is that regular expressions
	276	have C<\X> for matching an extended grapheme cluster.
	277
	278	Life is not quite so transparent, however, when working with legacy
	279	encodings, I/O, and certain special cases:
	280
	281	=head2 Legacy Encodings
	282
	283	When you combine legacy data and Unicode the legacy data needs
	284	to be upgraded to Unicode. Normally ISO 8859-1 (or EBCDIC, if
	285	applicable) is assumed.
	286
	287	The C<Encode> module knows about many encodings and has interfaces
	288	for doing conversions between those encodings:
	289
	290	use Encode 'decode';
	291	$data = decode("iso-8859-3", $data); # convert from legacy to utf-8
	292
	293	=head2 Unicode I/O
	294
	295	Normally, writing out Unicode data
	296
	297	print FH $some_string_with_unicode, "\n";
	298
	299	produces raw bytes that Perl happens to use to internally encode the
	300	Unicode string. Perl's internal encoding depends on the system as
	301	well as what characters happen to be in the string at the time. If
	302	any of the characters are at code points C<0x100> or above, you will get
	303	a warning. To ensure that the output is explicitly rendered in the
	304	encoding you desire--and to avoid the warning--open the stream with
	305	the desired encoding. Some examples:
	306
	307	open FH, ">:utf8", "file";
	308
	309	open FH, ">:encoding(ucs2)", "file";
	310	open FH, ">:encoding(UTF-8)", "file";
	311	open FH, ">:encoding(shift_jis)", "file";
	312
	313	and on already open streams, use C<binmode()>:
	314
	315	binmode(STDOUT, ":utf8");
	316
	317	binmode(STDOUT, ":encoding(ucs2)");
	318	binmode(STDOUT, ":encoding(UTF-8)");
	319	binmode(STDOUT, ":encoding(shift_jis)");
	320
	321	The matching of encoding names is loose: case does not matter, and
	322	many encodings have several aliases. Note that the C<:utf8> layer
	323	must always be specified exactly like that; it is I<not> subject to
	324	the loose matching of encoding names. Also note that C<:utf8> is unsafe for
	325	input, because it accepts the data without validating that it is indeed valid
	326	UTF8.
	327
	328	See L<PerlIO> for the C<:utf8> layer, L<PerlIO::encoding> and
	329	L<Encode::PerlIO> for the C<:encoding()> layer, and
	330	L<Encode::Supported> for many encodings supported by the C<Encode>
	331	module.
	332
	333	Reading in a file that you know happens to be encoded in one of the
	334	Unicode or legacy encodings does not magically turn the data into
	335	Unicode in Perl's eyes. To do that, specify the appropriate
	336	layer when opening files
	337
	338	open(my $fh,'<:encoding(utf8)', 'anything');
	339	my $line_of_unicode = <$fh>;
	340
	341	open(my $fh,'<:encoding(Big5)', 'anything');
	342	my $line_of_unicode = <$fh>;
	343
	344	The I/O layers can also be specified more flexibly with
	345	the C<open> pragma. See L<open>, or look at the following example.
	346
	347	use open ':encoding(utf8)'; # input/output default encoding will be
	348	# UTF-8
	349	open X, ">file";
	350	print X chr(0x100), "\n";
	351	close X;
	352	open Y, "<file";
	353	printf "%#x\n", ord(<Y>); # this should print 0x100
	354	close Y;
	355
	356	With the C<open> pragma you can use the C<:locale> layer
	357
	358	BEGIN { $ENV{LC_ALL} = $ENV{LANG} = 'ru_RU.KOI8-R' }
	359	# the :locale will probe the locale environment variables like
	360	# LC_ALL
	361	use open OUT => ':locale'; # russki parusski
	362	open(O, ">koi8");
	363	print O chr(0x430); # Unicode CYRILLIC SMALL LETTER A = KOI8-R 0xc1
	364	close O;
	365	open(I, "<koi8");
	366	printf "%#x\n", ord(<I>), "\n"; # this should print 0xc1
	367	close I;
	368
	369	These methods install a transparent filter on the I/O stream that
	370	converts data from the specified encoding when it is read in from the
	371	stream. The result is always Unicode.
	372
	373	The L<open> pragma affects all the C<open()> calls after the pragma by
	374	setting default layers. If you want to affect only certain
	375	streams, use explicit layers directly in the C<open()> call.
	376
	377	You can switch encodings on an already opened stream by using
	378	C<binmode()>; see L<perlfunc/binmode>.
	379
	380	The C<:locale> does not currently (as of Perl 5.8.0) work with
	381	C<open()> and C<binmode()>, only with the C<open> pragma. The
	382	C<:utf8> and C<:encoding(...)> methods do work with all of C<open()>,
	383	C<binmode()>, and the C<open> pragma.
	384
	385	Similarly, you may use these I/O layers on output streams to
	386	automatically convert Unicode to the specified encoding when it is
	387	written to the stream. For example, the following snippet copies the
	388	contents of the file "text.jis" (encoded as ISO-2022-JP, aka JIS) to
	389	the file "text.utf8", encoded as UTF-8:
	390
	391	open(my $nihongo, '<:encoding(iso-2022-jp)', 'text.jis');
	392	open(my $unicode, '>:utf8', 'text.utf8');
	393	while (<$nihongo>) { print $unicode $_ }
	394
	395	The naming of encodings, both by the C<open()> and by the C<open>
	396	pragma allows for flexible names: C<koi8-r> and C<KOI8R> will both be
	397	understood.
	398
	399	Common encodings recognized by ISO, MIME, IANA, and various other
	400	standardisation organisations are recognised; for a more detailed
	401	list see L<Encode::Supported>.
	402
	403	C<read()> reads characters and returns the number of characters.
	404	C<seek()> and C<tell()> operate on byte counts, as do C<sysread()>
	405	and C<sysseek()>.
	406
	407	Notice that because of the default behaviour of not doing any
	408	conversion upon input if there is no default layer,
	409	it is easy to mistakenly write code that keeps on expanding a file
	410	by repeatedly encoding the data:
	411
	412	# BAD CODE WARNING
	413	open F, "file";
	414	local $/; ## read in the whole file of 8-bit characters
	415	$t = <F>;
	416	close F;
	417	open F, ">:encoding(utf8)", "file";
	418	print F $t; ## convert to UTF-8 on output
	419	close F;
	420
	421	If you run this code twice, the contents of the F<file> will be twice
	422	UTF-8 encoded. A C<use open ':encoding(utf8)'> would have avoided the
	423	bug, or explicitly opening also the F<file> for input as UTF-8.
	424
	425	B<NOTE>: the C<:utf8> and C<:encoding> features work only if your
	426	Perl has been built with the new PerlIO feature (which is the default
	427	on most systems).
	428
	429	=head2 Displaying Unicode As Text
	430
	431	Sometimes you might want to display Perl scalars containing Unicode as
	432	simple ASCII (or EBCDIC) text. The following subroutine converts
	433	its argument so that Unicode characters with code points greater than
	434	255 are displayed as C<\x{...}>, control characters (like C<\n>) are
	435	displayed as C<\x..>, and the rest of the characters as themselves:
	436
	437	sub nice_string {
	438	join("",
	439	map { $_ > 255 ? # if wide character...
	440	sprintf("\\x{%04X}", $_) : # \x{...}
	441	chr($_) =~ /[[:cntrl:]]/ ? # else if control character ...
	442	sprintf("\\x%02X", $_) : # \x..
	443	quotemeta(chr($_)) # else quoted or as themselves
	444	} unpack("W*", $_[0])); # unpack Unicode characters
	445	}
	446
	447	For example,
	448
	449	nice_string("foo\x{100}bar\n")
	450
	451	returns the string
	452
	453	'foo\x{0100}bar\x0A'
	454
	455	which is ready to be printed.
	456
	457	=head2 Special Cases
	458
	459	=over 4
	460
	461	=item *
	462
	463	Bit Complement Operator ~ And vec()
	464
	465	The bit complement operator C<~> may produce surprising results if
	466	used on strings containing characters with ordinal values above
	467	255. In such a case, the results are consistent with the internal
	468	encoding of the characters, but not with much else. So don't do
	469	that. Similarly for C<vec()>: you will be operating on the
	470	internally-encoded bit patterns of the Unicode characters, not on
	471	the code point values, which is very probably not what you want.
	472
	473	=item *
	474
	475	Peeking At Perl's Internal Encoding
	476
	477	Normal users of Perl should never care how Perl encodes any particular
	478	Unicode string (because the normal ways to get at the contents of a
	479	string with Unicode--via input and output--should always be via
	480	explicitly-defined I/O layers). But if you must, there are two
	481	ways of looking behind the scenes.
	482
	483	One way of peeking inside the internal encoding of Unicode characters
	484	is to use C<unpack("C*", ...> to get the bytes of whatever the string
	485	encoding happens to be, or C<unpack("U0..", ...)> to get the bytes of the
	486	UTF-8 encoding:
	487
	488	# this prints c4 80 for the UTF-8 bytes 0xc4 0x80
	489	print join(" ", unpack("U0(H2)*", pack("U", 0x100))), "\n";
	490
	491	Yet another way would be to use the Devel::Peek module:
	492
	493	perl -MDevel::Peek -e 'Dump(chr(0x100))'
	494
	495	That shows the C<UTF8> flag in FLAGS and both the UTF-8 bytes
	496	and Unicode characters in C<PV>. See also later in this document
	497	the discussion about the C<utf8::is_utf8()> function.
	498
	499	=back
	500
	501	=head2 Advanced Topics
	502
	503	=over 4
	504
	505	=item *
	506
	507	String Equivalence
	508
	509	The question of string equivalence turns somewhat complicated
	510	in Unicode: what do you mean by "equal"?
	511
	512	(Is C<LATIN CAPITAL LETTER A WITH ACUTE> equal to
	513	C<LATIN CAPITAL LETTER A>?)
	514
	515	The short answer is that by default Perl compares equivalence (C<eq>,
	516	C<ne>) based only on code points of the characters. In the above
	517	case, the answer is no (because 0x00C1 != 0x0041). But sometimes, any
	518	CAPITAL LETTER As should be considered equal, or even As of any case.
	519
	520	The long answer is that you need to consider character normalization
	521	and casing issues: see L<Unicode::Normalize>, Unicode Technical Report #15,
	522	L<Unicode Normalization Forms\|http://www.unicode.org/unicode/reports/tr15> and
	523	sections on case mapping in the L<Unicode Standard\|http://www.unicode.org>.
	524
	525	As of Perl 5.8.0, the "Full" case-folding of I<Case
	526	Mappings/SpecialCasing> is implemented, but bugs remain in C<qr//i> with them.
	527
	528	=item *
	529
	530	String Collation
	531
	532	People like to see their strings nicely sorted--or as Unicode
	533	parlance goes, collated. But again, what do you mean by collate?
	534
	535	(Does C<LATIN CAPITAL LETTER A WITH ACUTE> come before or after
	536	C<LATIN CAPITAL LETTER A WITH GRAVE>?)
	537
	538	The short answer is that by default, Perl compares strings (C<lt>,
	539	C<le>, C<cmp>, C<ge>, C<gt>) based only on the code points of the
	540	characters. In the above case, the answer is "after", since
	541	C<0x00C1> > C<0x00C0>.
	542
	543	The long answer is that "it depends", and a good answer cannot be
	544	given without knowing (at the very least) the language context.
	545	See L<Unicode::Collate>, and I<Unicode Collation Algorithm>
	546	L<http://www.unicode.org/unicode/reports/tr10/>
	547
	548	=back
	549
	550	=head2 Miscellaneous
	551
	552	=over 4
	553
	554	=item *
	555
	556	Character Ranges and Classes
	557
	558	Character ranges in regular expression bracketed character classes ( e.g.,
	559	C</[a-z]/>) and in the C<tr///> (also known as C<y///>) operator are not
	560	magically Unicode-aware. What this means is that C<[A-Za-z]> will not
	561	magically start to mean "all alphabetic letters" (not that it does mean that
	562	even for 8-bit characters; for those, if you are using locales (L<perllocale>),
	563	use C</[[:alpha:]]/>; and if not, use the 8-bit-aware property C<\p{alpha}>).
	564
	565	All the properties that begin with C<\p> (and its inverse C<\P>) are actually
	566	character classes that are Unicode-aware. There are dozens of them, see
	567	L<perluniprops>.
	568
	569	You can use Unicode code points as the end points of character ranges, and the
	570	range will include all Unicode code points that lie between those end points.
	571
	572	=item *
	573
	574	String-To-Number Conversions
	575
	576	Unicode does define several other decimal--and numeric--characters
	577	besides the familiar 0 to 9, such as the Arabic and Indic digits.
	578	Perl does not support string-to-number conversion for digits other
	579	than ASCII 0 to 9 (and ASCII a to f for hexadecimal).
	580
	581	=back
	582
	583	=head2 Questions With Answers
	584
	585	=over 4
	586
	587	=item *
	588
	589	Will My Old Scripts Break?
	590
	591	Very probably not. Unless you are generating Unicode characters
	592	somehow, old behaviour should be preserved. About the only behaviour
	593	that has changed and which could start generating Unicode is the old
	594	behaviour of C<chr()> where supplying an argument more than 255
	595	produced a character modulo 255. C<chr(300)>, for example, was equal
	596	to C<chr(45)> or "-" (in ASCII), now it is LATIN CAPITAL LETTER I WITH
	597	BREVE.
	598
	599	=item *
	600
	601	How Do I Make My Scripts Work With Unicode?
	602
	603	Very little work should be needed since nothing changes until you
	604	generate Unicode data. The most important thing is getting input as
	605	Unicode; for that, see the earlier I/O discussion.
	606
	607	=item *
	608
	609	How Do I Know Whether My String Is In Unicode?
	610
	611	You shouldn't have to care. But you may, because currently the semantics of the
	612	characters whose ordinals are in the range 128 to 255 are different depending on
	613	whether the string they are contained within is in Unicode or not.
	614	(See L<perlunicode/When Unicode Does Not Happen>.)
	615
	616	To determine if a string is in Unicode, use:
	617
	618	print utf8::is_utf8($string) ? 1 : 0, "\n";
	619
	620	But note that this doesn't mean that any of the characters in the
	621	string are necessary UTF-8 encoded, or that any of the characters have
	622	code points greater than 0xFF (255) or even 0x80 (128), or that the
	623	string has any characters at all. All the C<is_utf8()> does is to
	624	return the value of the internal "utf8ness" flag attached to the
	625	C<$string>. If the flag is off, the bytes in the scalar are interpreted
	626	as a single byte encoding. If the flag is on, the bytes in the scalar
	627	are interpreted as the (variable-length, potentially multi-byte) UTF-8 encoded
	628	code points of the characters. Bytes added to a UTF-8 encoded string are
	629	automatically upgraded to UTF-8. If mixed non-UTF-8 and UTF-8 scalars
	630	are merged (double-quoted interpolation, explicit concatenation, and
	631	printf/sprintf parameter substitution), the result will be UTF-8 encoded
	632	as if copies of the byte strings were upgraded to UTF-8: for example,
	633
	634	$a = "ab\x80c";
	635	$b = "\x{100}";
	636	print "$a = $b\n";
	637
	638	the output string will be UTF-8-encoded C<ab\x80c = \x{100}\n>, but
	639	C<$a> will stay byte-encoded.
	640
	641	Sometimes you might really need to know the byte length of a string
	642	instead of the character length. For that use either the
	643	C<Encode::encode_utf8()> function or the C<bytes> pragma and
	644	the C<length()> function:
	645
	646	my $unicode = chr(0x100);
	647	print length($unicode), "\n"; # will print 1
	648	require Encode;
	649	print length(Encode::encode_utf8($unicode)), "\n"; # will print 2
	650	use bytes;
	651	print length($unicode), "\n"; # will also print 2
	652	# (the 0xC4 0x80 of the UTF-8)
	653	no bytes;
	654
	655	=item *
	656
	657	How Do I Detect Data That's Not Valid In a Particular Encoding?
	658
	659	Use the C<Encode> package to try converting it.
	660	For example,
	661
	662	use Encode 'decode_utf8';
	663
	664	if (eval { decode_utf8($string, Encode::FB_CROAK); 1 }) {
	665	# $string is valid utf8
	666	} else {
	667	# $string is not valid utf8
	668	}
	669
	670	Or use C<unpack> to try decoding it:
	671
	672	use warnings;
	673	@chars = unpack("C0U*", $string_of_bytes_that_I_think_is_utf8);
	674
	675	If invalid, a C<Malformed UTF-8 character> warning is produced. The "C0" means
	676	"process the string character per character". Without that, the
	677	C<unpack("U*", ...)> would work in C<U0> mode (the default if the format
	678	string starts with C<U>) and it would return the bytes making up the UTF-8
	679	encoding of the target string, something that will always work.
	680
	681	=item *
	682
	683	How Do I Convert Binary Data Into a Particular Encoding, Or Vice Versa?
	684
	685	This probably isn't as useful as you might think.
	686	Normally, you shouldn't need to.
	687
	688	In one sense, what you are asking doesn't make much sense: encodings
	689	are for characters, and binary data are not "characters", so converting
	690	"data" into some encoding isn't meaningful unless you know in what
	691	character set and encoding the binary data is in, in which case it's
	692	not just binary data, now is it?
	693
	694	If you have a raw sequence of bytes that you know should be
	695	interpreted via a particular encoding, you can use C<Encode>:
	696
	697	use Encode 'from_to';
	698	from_to($data, "iso-8859-1", "utf-8"); # from latin-1 to utf-8
	699
	700	The call to C<from_to()> changes the bytes in C<$data>, but nothing
	701	material about the nature of the string has changed as far as Perl is
	702	concerned. Both before and after the call, the string C<$data>
	703	contains just a bunch of 8-bit bytes. As far as Perl is concerned,
	704	the encoding of the string remains as "system-native 8-bit bytes".
	705
	706	You might relate this to a fictional 'Translate' module:
	707
	708	use Translate;
	709	my $phrase = "Yes";
	710	Translate::from_to($phrase, 'english', 'deutsch');
	711	## phrase now contains "Ja"
	712
	713	The contents of the string changes, but not the nature of the string.
	714	Perl doesn't know any more after the call than before that the
	715	contents of the string indicates the affirmative.
	716
	717	Back to converting data. If you have (or want) data in your system's
	718	native 8-bit encoding (e.g. Latin-1, EBCDIC, etc.), you can use
	719	pack/unpack to convert to/from Unicode.
	720
	721	$native_string = pack("W", unpack("U", $Unicode_string));
	722	$Unicode_string = pack("U", unpack("W", $native_string));
	723
	724	If you have a sequence of bytes you B<know> is valid UTF-8,
	725	but Perl doesn't know it yet, you can make Perl a believer, too:
	726
	727	use Encode 'decode_utf8';
	728	$Unicode = decode_utf8($bytes);
	729
	730	or:
	731
	732	$Unicode = pack("U0a*", $bytes);
	733
	734	You can find the bytes that make up a UTF-8 sequence with
	735
	736	@bytes = unpack("C*", $Unicode_string)
	737
	738	and you can create well-formed Unicode with
	739
	740	$Unicode_string = pack("U*", 0xff, ...)
	741
	742	=item *
	743
	744	How Do I Display Unicode? How Do I Input Unicode?
	745
	746	See L<http://www.alanwood.net/unicode/> and
	747	L<http://www.cl.cam.ac.uk/~mgk25/unicode.html>
	748
	749	=item *
	750
	751	How Does Unicode Work With Traditional Locales?
	752
	753	In Perl, not very well. Avoid using locales through the C<locale>
	754	pragma. Use only one or the other. But see L<perlrun> for the
	755	description of the C<-C> switch and its environment counterpart,
	756	C<$ENV{PERL_UNICODE}> to see how to enable various Unicode features,
	757	for example by using locale settings.
	758
	759	=back
	760
	761	=head2 Hexadecimal Notation
	762
	763	The Unicode standard prefers using hexadecimal notation because
	764	that more clearly shows the division of Unicode into blocks of 256 characters.
	765	Hexadecimal is also simply shorter than decimal. You can use decimal
	766	notation, too, but learning to use hexadecimal just makes life easier
	767	with the Unicode standard. The C<U+HHHH> notation uses hexadecimal,
	768	for example.
	769
	770	The C<0x> prefix means a hexadecimal number, the digits are 0-9 I<and>
	771	a-f (or A-F, case doesn't matter). Each hexadecimal digit represents
	772	four bits, or half a byte. C<print 0x..., "\n"> will show a
	773	hexadecimal number in decimal, and C<printf "%x\n", $decimal> will
	774	show a decimal number in hexadecimal. If you have just the
	775	"hex digits" of a hexadecimal number, you can use the C<hex()> function.
	776
	777	print 0x0009, "\n"; # 9
	778	print 0x000a, "\n"; # 10
	779	print 0x000f, "\n"; # 15
	780	print 0x0010, "\n"; # 16
	781	print 0x0011, "\n"; # 17
	782	print 0x0100, "\n"; # 256
	783
	784	print 0x0041, "\n"; # 65
	785
	786	printf "%x\n", 65; # 41
	787	printf "%#x\n", 65; # 0x41
	788
	789	print hex("41"), "\n"; # 65
	790
	791	=head2 Further Resources
	792
	793	=over 4
	794
	795	=item *
	796
	797	Unicode Consortium
	798
	799	L<http://www.unicode.org/>
	800
	801	=item *
	802
	803	Unicode FAQ
	804
	805	L<http://www.unicode.org/unicode/faq/>
	806
	807	=item *
	808
	809	Unicode Glossary
	810
	811	L<http://www.unicode.org/glossary/>
	812
	813	=item *
	814
	815	Unicode Useful Resources
	816
	817	L<http://www.unicode.org/unicode/onlinedat/resources.html>
	818
	819	=item *
	820
	821	Unicode and Multilingual Support in HTML, Fonts, Web Browsers and Other Applications
	822
	823	L<http://www.alanwood.net/unicode/>
	824
	825	=item *
	826
	827	UTF-8 and Unicode FAQ for Unix/Linux
	828
	829	L<http://www.cl.cam.ac.uk/~mgk25/unicode.html>
	830
	831	=item *
	832
	833	Legacy Character Sets
	834
	835	L<http://www.czyborra.com/>
	836	L<http://www.eki.ee/letter/>
	837
	838	=item *
	839
	840	The Unicode support files live within the Perl installation in the
	841	directory
	842
	843	$Config{installprivlib}/unicore
	844
	845	in Perl 5.8.0 or newer, and
	846
	847	$Config{installprivlib}/unicode
	848
	849	in the Perl 5.6 series. (The renaming to F<lib/unicore> was done to
	850	avoid naming conflicts with lib/Unicode in case-insensitive filesystems.)
	851	The main Unicode data file is F<UnicodeData.txt> (or F<Unicode.301> in
	852	Perl 5.6.1.) You can find the C<$Config{installprivlib}> by
	853
	854	perl "-V:installprivlib"
	855
	856	You can explore various information from the Unicode data files using
	857	the C<Unicode::UCD> module.
	858
	859	=back
	860
	861	=head1 UNICODE IN OLDER PERLS
	862
	863	If you cannot upgrade your Perl to 5.8.0 or later, you can still
	864	do some Unicode processing by using the modules C<Unicode::String>,
	865	C<Unicode::Map8>, and C<Unicode::Map>, available from CPAN.
	866	If you have the GNU recode installed, you can also use the
	867	Perl front-end C<Convert::Recode> for character conversions.
	868
	869	The following are fast conversions from ISO 8859-1 (Latin-1) bytes
	870	to UTF-8 bytes and back, the code works even with older Perl 5 versions.
	871
	872	# ISO 8859-1 to UTF-8
	873	s/([\x80-\xFF])/chr(0xC0\|ord($1)>>6).chr(0x80\|ord($1)&0x3F)/eg;
	874
	875	# UTF-8 to ISO 8859-1
	876	s/([\xC2\xC3])([\x80-\xBF])/chr(ord($1)<<6&0xC0\|ord($2)&0x3F)/eg;
	877
	878	=head1 SEE ALSO
	879
	880	L<perlunitut>, L<perlunicode>, L<Encode>, L<open>, L<utf8>, L<bytes>,
	881	L<perlretut>, L<perlrun>, L<Unicode::Collate>, L<Unicode::Normalize>,
	882	L<Unicode::UCD>
	883
	884	=head1 ACKNOWLEDGMENTS
	885
	886	Thanks to the kind readers of the perl5-porters@perl.org,
	887	perl-unicode@perl.org, linux-utf8@nl.linux.org, and unicore@unicode.org
	888	mailing lists for their valuable feedback.
	889
	890	=head1 AUTHOR, COPYRIGHT, AND LICENSE
	891
	892	Copyright 2001-2002 Jarkko Hietaniemi E<lt>jhi@iki.fiE<gt>
	893
	894	This document may be distributed under the same terms as Perl itself.