perl5.git.perl.org Git - perl5.git/blame_incremental

... / ...

Commit	Line	Data
	1	=head1 NAME
	2
	3	perluniintro - Perl Unicode introduction
	4
	5	=head1 DESCRIPTION
	6
	7	This document gives a general idea of Unicode and how to use Unicode
	8	in Perl. See L</Further Resources> for references to more in-depth
	9	treatments of Unicode.
	10
	11	=head2 Unicode
	12
	13	Unicode is a character set standard which plans to codify all of the
	14	writing systems of the world, plus many other symbols.
	15
	16	Unicode and ISO/IEC 10646 are coordinated standards that unify
	17	almost all other modern character set standards,
	18	covering more than 80 writing systems and hundreds of languages,
	19	including all commercially-important modern languages. All characters
	20	in the largest Chinese, Japanese, and Korean dictionaries are also
	21	encoded. The standards will eventually cover almost all characters in
	22	more than 250 writing systems and thousands of languages.
	23	Unicode 1.0 was released in October 1991, and 6.0 in October 2010.
	24
	25	A Unicode I<character> is an abstract entity. It is not bound to any
	26	particular integer width, especially not to the C language C<char>.
	27	Unicode is language-neutral and display-neutral: it does not encode the
	28	language of the text, and it does not generally define fonts or other graphical
	29	layout details. Unicode operates on characters and on text built from
	30	those characters.
	31
	32	Unicode defines characters like C<LATIN CAPITAL LETTER A> or C<GREEK
	33	SMALL LETTER ALPHA> and unique numbers for the characters, in this
	34	case 0x0041 and 0x03B1, respectively. These unique numbers are called
	35	I<code points>. A code point is essentially the position of the
	36	character within the set of all possible Unicode characters, and thus in
	37	Perl, the term I<ordinal> is often used interchangeably with it.
	38
	39	The Unicode standard prefers using hexadecimal notation for the code
	40	points. If numbers like C<0x0041> are unfamiliar to you, take a peek
	41	at a later section, L</"Hexadecimal Notation">. The Unicode standard
	42	uses the notation C<U+0041 LATIN CAPITAL LETTER A>, to give the
	43	hexadecimal code point and the normative name of the character.
	44
	45	Unicode also defines various I<properties> for the characters, like
	46	"uppercase" or "lowercase", "decimal digit", or "punctuation";
	47	these properties are independent of the names of the characters.
	48	Furthermore, various operations on the characters like uppercasing,
	49	lowercasing, and collating (sorting) are defined.
	50
	51	A Unicode I<logical> "character" can actually consist of more than one internal
	52	I<actual> "character" or code point. For Western languages, this is adequately
	53	modelled by a I<base character> (like C<LATIN CAPITAL LETTER A>) followed
	54	by one or more I<modifiers> (like C<COMBINING ACUTE ACCENT>). This sequence of
	55	base character and modifiers is called a I<combining character
	56	sequence>. Some non-western languages require more complicated
	57	models, so Unicode created the I<grapheme cluster> concept, which was
	58	later further refined into the I<extended grapheme cluster>. For
	59	example, a Korean Hangul syllable is considered a single logical
	60	character, but most often consists of three actual
	61	Unicode characters: a leading consonant followed by an interior vowel followed
	62	by a trailing consonant.
	63
	64	Whether to call these extended grapheme clusters "characters" depends on your
	65	point of view. If you are a programmer, you probably would tend towards seeing
	66	each element in the sequences as one unit, or "character". However from
	67	the user's point of view, the whole sequence could be seen as one
	68	"character" since that's probably what it looks like in the context of the
	69	user's language. In this document, we take the programmer's point of
	70	view: one "character" is one Unicode code point.
	71
	72	For some combinations of base character and modifiers, there are
	73	I<precomposed> characters. There is a single character equivalent, for
	74	example, to the sequence C<LATIN CAPITAL LETTER A> followed by
	75	C<COMBINING ACUTE ACCENT>. It is called C<LATIN CAPITAL LETTER A WITH
	76	ACUTE>. These precomposed characters are, however, only available for
	77	some combinations, and are mainly meant to support round-trip
	78	conversions between Unicode and legacy standards (like ISO 8859). Using
	79	sequences, as Unicode does, allows for needing fewer basic building blocks
	80	(code points) to express many more potential grapheme clusters. To
	81	support conversion between equivalent forms, various I<normalization
	82	forms> are also defined. Thus, C<LATIN CAPITAL LETTER A WITH ACUTE> is
	83	in I<Normalization Form Composed>, (abbreviated NFC), and the sequence
	84	C<LATIN CAPITAL LETTER A> followed by C<COMBINING ACUTE ACCENT>
	85	represents the same character in I<Normalization Form Decomposed> (NFD).
	86
	87	Because of backward compatibility with legacy encodings, the "a unique
	88	number for every character" idea breaks down a bit: instead, there is
	89	"at least one number for every character". The same character could
	90	be represented differently in several legacy encodings. The
	91	converse is not also true: some code points do not have an assigned
	92	character. Firstly, there are unallocated code points within
	93	otherwise used blocks. Secondly, there are special Unicode control
	94	characters that do not represent true characters.
	95
	96	When Unicode was first conceived, it was thought that all the world's
	97	characters could be represented using a 16-bit word; that is a maximum of
	98	C<0x10000> (or 65536) characters from C<0x0000> to C<0xFFFF> would be
	99	needed. This soon proved to be false, and since Unicode 2.0 (July
	100	1996), Unicode has been defined all the way up to 21 bits (C<0x10FFFF>),
	101	and Unicode 3.1 (March 2001) defined the first characters above C<0xFFFF>.
	102	The first C<0x10000> characters are called the I<Plane 0>, or the
	103	I<Basic Multilingual Plane> (BMP). With Unicode 3.1, 17 (yes,
	104	seventeen) planes in all were defined--but they are nowhere near full of
	105	defined characters, yet.
	106
	107	When a new language is being encoded, Unicode generally will choose a
	108	C<block> of consecutive unallocated code points for its characters. So
	109	far, the number of code points in these blocks has always been evenly
	110	divisible by 16. Extras in a block, not currently needed, are left
	111	unallocated, for future growth. But there have been occasions when
	112	a later relase needed more code points than the available extras, and a
	113	new block had to allocated somewhere else, not contiguous to the initial
	114	one, to handle the overflow. Thus, it became apparent early on that
	115	"block" wasn't an adequate organizing principal, and so the C<Script>
	116	property was created. (Later an improved script property was added as
	117	well, the C<Script_Extensions> property.) Those code points that are in
	118	overflow blocks can still
	119	have the same script as the original ones. The script concept fits more
	120	closely with natural language: there is C<Latin> script, C<Greek>
	121	script, and so on; and there are several artificial scripts, like
	122	C<Common> for characters that are used in multiple scripts, such as
	123	mathematical symbols. Scripts usually span varied parts of several
	124	blocks. For more information about scripts, see L<perlunicode/Scripts>.
	125	The division into blocks exists, but it is almost completely
	126	accidental--an artifact of how the characters have been and still are
	127	allocated. (Note that this paragraph has oversimplified things for the
	128	sake of this being an introduction. Unicode doesn't really encode
	129	languages, but the writing systems for them--their scripts; and one
	130	script can be used by many languages. Unicode also encodes things that
	131	aren't really about languages, such as symbols like C<BAGGAGE CLAIM>.)
	132
	133	The Unicode code points are just abstract numbers. To input and
	134	output these abstract numbers, the numbers must be I<encoded> or
	135	I<serialised> somehow. Unicode defines several I<character encoding
	136	forms>, of which I<UTF-8> is perhaps the most popular. UTF-8 is a
	137	variable length encoding that encodes Unicode characters as 1 to 6
	138	bytes. Other encodings
	139	include UTF-16 and UTF-32 and their big- and little-endian variants
	140	(UTF-8 is byte-order independent) The ISO/IEC 10646 defines the UCS-2
	141	and UCS-4 encoding forms.
	142
	143	For more information about encodings--for instance, to learn what
	144	I<surrogates> and I<byte order marks> (BOMs) are--see L<perlunicode>.
	145
	146	=head2 Perl's Unicode Support
	147
	148	Starting from Perl v5.6.0, Perl has had the capacity to handle Unicode
	149	natively. Perl v5.8.0, however, is the first recommended release for
	150	serious Unicode work. The maintenance release 5.6.1 fixed many of the
	151	problems of the initial Unicode implementation, but for example
	152	regular expressions still do not work with Unicode in 5.6.1.
	153	Perl v5.14.0 is the first release where Unicode support is
	154	(almost) seamlessly integrable without some gotchas (the exception being
	155	some differences in L<quotemeta\|perlfunc/quotemeta>, which is fixed
	156	starting in Perl 5.16.0). To enable this
	157	seamless support, you should C<use feature 'unicode_strings'> (which is
	158	automatically selected if you C<use 5.012> or higher). See L<feature>.
	159	(5.14 also fixes a number of bugs and departures from the Unicode
	160	standard.)
	161
	162	Before Perl v5.8.0, the use of C<use utf8> was used to declare
	163	that operations in the current block or file would be Unicode-aware.
	164	This model was found to be wrong, or at least clumsy: the "Unicodeness"
	165	is now carried with the data, instead of being attached to the
	166	operations.
	167	Starting with Perl v5.8.0, only one case remains where an explicit C<use
	168	utf8> is needed: if your Perl script itself is encoded in UTF-8, you can
	169	use UTF-8 in your identifier names, and in string and regular expression
	170	literals, by saying C<use utf8>. This is not the default because
	171	scripts with legacy 8-bit data in them would break. See L<utf8>.
	172
	173	=head2 Perl's Unicode Model
	174
	175	Perl supports both pre-5.6 strings of eight-bit native bytes, and
	176	strings of Unicode characters. The general principle is that Perl tries
	177	to keep its data as eight-bit bytes for as long as possible, but as soon
	178	as Unicodeness cannot be avoided, the data is transparently upgraded
	179	to Unicode. Prior to Perl v5.14.0, the upgrade was not completely
	180	transparent (see L<perlunicode/The "Unicode Bug">), and for backwards
	181	compatibility, full transparency is not gained unless C<use feature
	182	'unicode_strings'> (see L<feature>) or C<use 5.012> (or higher) is
	183	selected.
	184
	185	Internally, Perl currently uses either whatever the native eight-bit
	186	character set of the platform (for example Latin-1) is, defaulting to
	187	UTF-8, to encode Unicode strings. Specifically, if all code points in
	188	the string are C<0xFF> or less, Perl uses the native eight-bit
	189	character set. Otherwise, it uses UTF-8.
	190
	191	A user of Perl does not normally need to know nor care how Perl
	192	happens to encode its internal strings, but it becomes relevant when
	193	outputting Unicode strings to a stream without a PerlIO layer (one with
	194	the "default" encoding). In such a case, the raw bytes used internally
	195	(the native character set or UTF-8, as appropriate for each string)
	196	will be used, and a "Wide character" warning will be issued if those
	197	strings contain a character beyond 0x00FF.
	198
	199	For example,
	200
	201	perl -e 'print "\x{DF}\n", "\x{0100}\x{DF}\n"'
	202
	203	produces a fairly useless mixture of native bytes and UTF-8, as well
	204	as a warning:
	205
	206	Wide character in print at ...
	207
	208	To output UTF-8, use the C<:encoding> or C<:utf8> output layer. Prepending
	209
	210	binmode(STDOUT, ":utf8");
	211
	212	to this sample program ensures that the output is completely UTF-8,
	213	and removes the program's warning.
	214
	215	You can enable automatic UTF-8-ification of your standard file
	216	handles, default C<open()> layer, and C<@ARGV> by using either
	217	the C<-C> command line switch or the C<PERL_UNICODE> environment
	218	variable, see L<perlrun> for the documentation of the C<-C> switch.
	219
	220	Note that this means that Perl expects other software to work the same
	221	way:
	222	if Perl has been led to believe that STDIN should be UTF-8, but then
	223	STDIN coming in from another command is not UTF-8, Perl will likely
	224	complain about the malformed UTF-8.
	225
	226	All features that combine Unicode and I/O also require using the new
	227	PerlIO feature. Almost all Perl 5.8 platforms do use PerlIO, though:
	228	you can see whether yours is by running "perl -V" and looking for
	229	C<useperlio=define>.
	230
	231	=head2 Unicode and EBCDIC
	232
	233	Perl 5.8.0 also supports Unicode on EBCDIC platforms. There,
	234	Unicode support is somewhat more complex to implement since
	235	additional conversions are needed at every step.
	236
	237	Later Perl releases have added code that will not work on EBCDIC platforms, and
	238	no one has complained, so the divergence has continued. If you want to run
	239	Perl on an EBCDIC platform, send email to perlbug@perl.org
	240
	241	On EBCDIC platforms, the internal Unicode encoding form is UTF-EBCDIC
	242	instead of UTF-8. The difference is that as UTF-8 is "ASCII-safe" in
	243	that ASCII characters encode to UTF-8 as-is, while UTF-EBCDIC is
	244	"EBCDIC-safe".
	245
	246	=head2 Creating Unicode
	247
	248	To create Unicode characters in literals for code points above C<0xFF>,
	249	use the C<\x{...}> notation in double-quoted strings:
	250
	251	my $smiley = "\x{263a}";
	252
	253	Similarly, it can be used in regular expression literals
	254
	255	$smiley =~ /\x{263a}/;
	256
	257	At run-time you can use C<chr()>:
	258
	259	my $hebrew_alef = chr(0x05d0);
	260
	261	See L</"Further Resources"> for how to find all these numeric codes.
	262
	263	Naturally, C<ord()> will do the reverse: it turns a character into
	264	a code point.
	265
	266	Note that C<\x..> (no C<{}> and only two hexadecimal digits), C<\x{...}>,
	267	and C<chr(...)> for arguments less than C<0x100> (decimal 256)
	268	generate an eight-bit character for backward compatibility with older
	269	Perls. For arguments of C<0x100> or more, Unicode characters are
	270	always produced. If you want to force the production of Unicode
	271	characters regardless of the numeric value, use C<pack("U", ...)>
	272	instead of C<\x..>, C<\x{...}>, or C<chr()>.
	273
	274	You can invoke characters
	275	by name in double-quoted strings:
	276
	277	my $arabic_alef = "\N{ARABIC LETTER ALEF}";
	278
	279	And, as mentioned above, you can also C<pack()> numbers into Unicode
	280	characters:
	281
	282	my $georgian_an = pack("U", 0x10a0);
	283
	284	Note that both C<\x{...}> and C<\N{...}> are compile-time string
	285	constants: you cannot use variables in them. if you want similar
	286	run-time functionality, use C<chr()> and C<charnames::string_vianame()>.
	287
	288	If you want to force the result to Unicode characters, use the special
	289	C<"U0"> prefix. It consumes no arguments but causes the following bytes
	290	to be interpreted as the UTF-8 encoding of Unicode characters:
	291
	292	my $chars = pack("U0W*", 0x80, 0x42);
	293
	294	Likewise, you can stop such UTF-8 interpretation by using the special
	295	C<"C0"> prefix.
	296
	297	=head2 Handling Unicode
	298
	299	Handling Unicode is for the most part transparent: just use the
	300	strings as usual. Functions like C<index()>, C<length()>, and
	301	C<substr()> will work on the Unicode characters; regular expressions
	302	will work on the Unicode characters (see L<perlunicode> and L<perlretut>).
	303
	304	Note that Perl considers grapheme clusters to be separate characters, so for
	305	example
	306
	307	print length("\N{LATIN CAPITAL LETTER A}\N{COMBINING ACUTE ACCENT}"),
	308	"\n";
	309
	310	will print 2, not 1. The only exception is that regular expressions
	311	have C<\X> for matching an extended grapheme cluster. (Thus C<\X> in a
	312	regular expression would match the entire sequence of both the example
	313	characters.)
	314
	315	Life is not quite so transparent, however, when working with legacy
	316	encodings, I/O, and certain special cases:
	317
	318	=head2 Legacy Encodings
	319
	320	When you combine legacy data and Unicode, the legacy data needs
	321	to be upgraded to Unicode. Normally the legacy data is assumed to be
	322	ISO 8859-1 (or EBCDIC, if applicable).
	323
	324	The C<Encode> module knows about many encodings and has interfaces
	325	for doing conversions between those encodings:
	326
	327	use Encode 'decode';
	328	$data = decode("iso-8859-3", $data); # convert from legacy to utf-8
	329
	330	=head2 Unicode I/O
	331
	332	Normally, writing out Unicode data
	333
	334	print FH $some_string_with_unicode, "\n";
	335
	336	produces raw bytes that Perl happens to use to internally encode the
	337	Unicode string. Perl's internal encoding depends on the system as
	338	well as what characters happen to be in the string at the time. If
	339	any of the characters are at code points C<0x100> or above, you will get
	340	a warning. To ensure that the output is explicitly rendered in the
	341	encoding you desire--and to avoid the warning--open the stream with
	342	the desired encoding. Some examples:
	343
	344	open FH, ">:utf8", "file";
	345
	346	open FH, ">:encoding(ucs2)", "file";
	347	open FH, ">:encoding(UTF-8)", "file";
	348	open FH, ">:encoding(shift_jis)", "file";
	349
	350	and on already open streams, use C<binmode()>:
	351
	352	binmode(STDOUT, ":utf8");
	353
	354	binmode(STDOUT, ":encoding(ucs2)");
	355	binmode(STDOUT, ":encoding(UTF-8)");
	356	binmode(STDOUT, ":encoding(shift_jis)");
	357
	358	The matching of encoding names is loose: case does not matter, and
	359	many encodings have several aliases. Note that the C<:utf8> layer
	360	must always be specified exactly like that; it is I<not> subject to
	361	the loose matching of encoding names. Also note that currently C<:utf8> is unsafe for
	362	input, because it accepts the data without validating that it is indeed valid
	363	UTF-8; you should instead use C<:encoding(utf-8)> (with or without a
	364	hyphen).
	365
	366	See L<PerlIO> for the C<:utf8> layer, L<PerlIO::encoding> and
	367	L<Encode::PerlIO> for the C<:encoding()> layer, and
	368	L<Encode::Supported> for many encodings supported by the C<Encode>
	369	module.
	370
	371	Reading in a file that you know happens to be encoded in one of the
	372	Unicode or legacy encodings does not magically turn the data into
	373	Unicode in Perl's eyes. To do that, specify the appropriate
	374	layer when opening files
	375
	376	open(my $fh,'<:encoding(utf8)', 'anything');
	377	my $line_of_unicode = <$fh>;
	378
	379	open(my $fh,'<:encoding(Big5)', 'anything');
	380	my $line_of_unicode = <$fh>;
	381
	382	The I/O layers can also be specified more flexibly with
	383	the C<open> pragma. See L<open>, or look at the following example.
	384
	385	use open ':encoding(utf8)'; # input/output default encoding will be
	386	# UTF-8
	387	open X, ">file";
	388	print X chr(0x100), "\n";
	389	close X;
	390	open Y, "<file";
	391	printf "%#x\n", ord(<Y>); # this should print 0x100
	392	close Y;
	393
	394	With the C<open> pragma you can use the C<:locale> layer
	395
	396	BEGIN { $ENV{LC_ALL} = $ENV{LANG} = 'ru_RU.KOI8-R' }
	397	# the :locale will probe the locale environment variables like
	398	# LC_ALL
	399	use open OUT => ':locale'; # russki parusski
	400	open(O, ">koi8");
	401	print O chr(0x430); # Unicode CYRILLIC SMALL LETTER A = KOI8-R 0xc1
	402	close O;
	403	open(I, "<koi8");
	404	printf "%#x\n", ord(<I>), "\n"; # this should print 0xc1
	405	close I;
	406
	407	These methods install a transparent filter on the I/O stream that
	408	converts data from the specified encoding when it is read in from the
	409	stream. The result is always Unicode.
	410
	411	The L<open> pragma affects all the C<open()> calls after the pragma by
	412	setting default layers. If you want to affect only certain
	413	streams, use explicit layers directly in the C<open()> call.
	414
	415	You can switch encodings on an already opened stream by using
	416	C<binmode()>; see L<perlfunc/binmode>.
	417
	418	The C<:locale> does not currently work with
	419	C<open()> and C<binmode()>, only with the C<open> pragma. The
	420	C<:utf8> and C<:encoding(...)> methods do work with all of C<open()>,
	421	C<binmode()>, and the C<open> pragma.
	422
	423	Similarly, you may use these I/O layers on output streams to
	424	automatically convert Unicode to the specified encoding when it is
	425	written to the stream. For example, the following snippet copies the
	426	contents of the file "text.jis" (encoded as ISO-2022-JP, aka JIS) to
	427	the file "text.utf8", encoded as UTF-8:
	428
	429	open(my $nihongo, '<:encoding(iso-2022-jp)', 'text.jis');
	430	open(my $unicode, '>:utf8', 'text.utf8');
	431	while (<$nihongo>) { print $unicode $_ }
	432
	433	The naming of encodings, both by the C<open()> and by the C<open>
	434	pragma allows for flexible names: C<koi8-r> and C<KOI8R> will both be
	435	understood.
	436
	437	Common encodings recognized by ISO, MIME, IANA, and various other
	438	standardisation organisations are recognised; for a more detailed
	439	list see L<Encode::Supported>.
	440
	441	C<read()> reads characters and returns the number of characters.
	442	C<seek()> and C<tell()> operate on byte counts, as do C<sysread()>
	443	and C<sysseek()>.
	444
	445	Notice that because of the default behaviour of not doing any
	446	conversion upon input if there is no default layer,
	447	it is easy to mistakenly write code that keeps on expanding a file
	448	by repeatedly encoding the data:
	449
	450	# BAD CODE WARNING
	451	open F, "file";
	452	local $/; ## read in the whole file of 8-bit characters
	453	$t = <F>;
	454	close F;
	455	open F, ">:encoding(utf8)", "file";
	456	print F $t; ## convert to UTF-8 on output
	457	close F;
	458
	459	If you run this code twice, the contents of the F<file> will be twice
	460	UTF-8 encoded. A C<use open ':encoding(utf8)'> would have avoided the
	461	bug, or explicitly opening also the F<file> for input as UTF-8.
	462
	463	B<NOTE>: the C<:utf8> and C<:encoding> features work only if your
	464	Perl has been built with the new PerlIO feature (which is the default
	465	on most systems).
	466
	467	=head2 Displaying Unicode As Text
	468
	469	Sometimes you might want to display Perl scalars containing Unicode as
	470	simple ASCII (or EBCDIC) text. The following subroutine converts
	471	its argument so that Unicode characters with code points greater than
	472	255 are displayed as C<\x{...}>, control characters (like C<\n>) are
	473	displayed as C<\x..>, and the rest of the characters as themselves:
	474
	475	sub nice_string {
	476	join("",
	477	map { $_ > 255 ? # if wide character...
	478	sprintf("\\x{%04X}", $_) : # \x{...}
	479	chr($_) =~ /[[:cntrl:]]/ ? # else if control character...
	480	sprintf("\\x%02X", $_) : # \x..
	481	quotemeta(chr($_)) # else quoted or as themselves
	482	} unpack("W*", $_[0])); # unpack Unicode characters
	483	}
	484
	485	For example,
	486
	487	nice_string("foo\x{100}bar\n")
	488
	489	returns the string
	490
	491	'foo\x{0100}bar\x0A'
	492
	493	which is ready to be printed.
	494
	495	=head2 Special Cases
	496
	497	=over 4
	498
	499	=item *
	500
	501	Bit Complement Operator ~ And vec()
	502
	503	The bit complement operator C<~> may produce surprising results if
	504	used on strings containing characters with ordinal values above
	505	255. In such a case, the results are consistent with the internal
	506	encoding of the characters, but not with much else. So don't do
	507	that. Similarly for C<vec()>: you will be operating on the
	508	internally-encoded bit patterns of the Unicode characters, not on
	509	the code point values, which is very probably not what you want.
	510
	511	=item *
	512
	513	Peeking At Perl's Internal Encoding
	514
	515	Normal users of Perl should never care how Perl encodes any particular
	516	Unicode string (because the normal ways to get at the contents of a
	517	string with Unicode--via input and output--should always be via
	518	explicitly-defined I/O layers). But if you must, there are two
	519	ways of looking behind the scenes.
	520
	521	One way of peeking inside the internal encoding of Unicode characters
	522	is to use C<unpack("C*", ...> to get the bytes of whatever the string
	523	encoding happens to be, or C<unpack("U0..", ...)> to get the bytes of the
	524	UTF-8 encoding:
	525
	526	# this prints c4 80 for the UTF-8 bytes 0xc4 0x80
	527	print join(" ", unpack("U0(H2)*", pack("U", 0x100))), "\n";
	528
	529	Yet another way would be to use the Devel::Peek module:
	530
	531	perl -MDevel::Peek -e 'Dump(chr(0x100))'
	532
	533	That shows the C<UTF8> flag in FLAGS and both the UTF-8 bytes
	534	and Unicode characters in C<PV>. See also later in this document
	535	the discussion about the C<utf8::is_utf8()> function.
	536
	537	=back
	538
	539	=head2 Advanced Topics
	540
	541	=over 4
	542
	543	=item *
	544
	545	String Equivalence
	546
	547	The question of string equivalence turns somewhat complicated
	548	in Unicode: what do you mean by "equal"?
	549
	550	(Is C<LATIN CAPITAL LETTER A WITH ACUTE> equal to
	551	C<LATIN CAPITAL LETTER A>?)
	552
	553	The short answer is that by default Perl compares equivalence (C<eq>,
	554	C<ne>) based only on code points of the characters. In the above
	555	case, the answer is no (because 0x00C1 != 0x0041). But sometimes, any
	556	CAPITAL LETTER A's should be considered equal, or even A's of any case.
	557
	558	The long answer is that you need to consider character normalization
	559	and casing issues: see L<Unicode::Normalize>, Unicode Technical Report #15,
	560	L<Unicode Normalization Forms\|http://www.unicode.org/unicode/reports/tr15> and
	561	sections on case mapping in the L<Unicode Standard\|http://www.unicode.org>.
	562
	563	As of Perl 5.8.0, the "Full" case-folding of I<Case
	564	Mappings/SpecialCasing> is implemented, but bugs remain in C<qr//i> with them,
	565	mostly fixed by 5.14.
	566
	567	=item *
	568
	569	String Collation
	570
	571	People like to see their strings nicely sorted--or as Unicode
	572	parlance goes, collated. But again, what do you mean by collate?
	573
	574	(Does C<LATIN CAPITAL LETTER A WITH ACUTE> come before or after
	575	C<LATIN CAPITAL LETTER A WITH GRAVE>?)
	576
	577	The short answer is that by default, Perl compares strings (C<lt>,
	578	C<le>, C<cmp>, C<ge>, C<gt>) based only on the code points of the
	579	characters. In the above case, the answer is "after", since
	580	C<0x00C1> > C<0x00C0>.
	581
	582	The long answer is that "it depends", and a good answer cannot be
	583	given without knowing (at the very least) the language context.
	584	See L<Unicode::Collate>, and I<Unicode Collation Algorithm>
	585	L<http://www.unicode.org/unicode/reports/tr10/>
	586
	587	=back
	588
	589	=head2 Miscellaneous
	590
	591	=over 4
	592
	593	=item *
	594
	595	Character Ranges and Classes
	596
	597	Character ranges in regular expression bracketed character classes ( e.g.,
	598	C</[a-z]/>) and in the C<tr///> (also known as C<y///>) operator are not
	599	magically Unicode-aware. What this means is that C<[A-Za-z]> will not
	600	magically start to mean "all alphabetic letters" (not that it does mean that
	601	even for 8-bit characters; for those, if you are using locales (L<perllocale>),
	602	use C</[[:alpha:]]/>; and if not, use the 8-bit-aware property C<\p{alpha}>).
	603
	604	All the properties that begin with C<\p> (and its inverse C<\P>) are actually
	605	character classes that are Unicode-aware. There are dozens of them, see
	606	L<perluniprops>.
	607
	608	You can use Unicode code points as the end points of character ranges, and the
	609	range will include all Unicode code points that lie between those end points.
	610
	611	=item *
	612
	613	String-To-Number Conversions
	614
	615	Unicode does define several other decimal--and numeric--characters
	616	besides the familiar 0 to 9, such as the Arabic and Indic digits.
	617	Perl does not support string-to-number conversion for digits other
	618	than ASCII 0 to 9 (and ASCII a to f for hexadecimal).
	619	To get safe conversions from any Unicode string, use
	620	L<Unicode::UCD/num()>.
	621
	622	=back
	623
	624	=head2 Questions With Answers
	625
	626	=over 4
	627
	628	=item *
	629
	630	Will My Old Scripts Break?
	631
	632	Very probably not. Unless you are generating Unicode characters
	633	somehow, old behaviour should be preserved. About the only behaviour
	634	that has changed and which could start generating Unicode is the old
	635	behaviour of C<chr()> where supplying an argument more than 255
	636	produced a character modulo 255. C<chr(300)>, for example, was equal
	637	to C<chr(45)> or "-" (in ASCII), now it is LATIN CAPITAL LETTER I WITH
	638	BREVE.
	639
	640	=item *
	641
	642	How Do I Make My Scripts Work With Unicode?
	643
	644	Very little work should be needed since nothing changes until you
	645	generate Unicode data. The most important thing is getting input as
	646	Unicode; for that, see the earlier I/O discussion.
	647	To get full seamless Unicode support, add
	648	C<use feature 'unicode_strings'> (or C<use 5.012> or higher) to your
	649	script.
	650
	651	=item *
	652
	653	How Do I Know Whether My String Is In Unicode?
	654
	655	You shouldn't have to care. But you may if your Perl is before 5.14.0
	656	or you haven't specified C<use feature 'unicode_strings'> or C<use
	657	5.012> (or higher) because otherwise the semantics of the code points
	658	in the range 128 to 255 are different depending on
	659	whether the string they are contained within is in Unicode or not.
	660	(See L<perlunicode/When Unicode Does Not Happen>.)
	661
	662	To determine if a string is in Unicode, use:
	663
	664	print utf8::is_utf8($string) ? 1 : 0, "\n";
	665
	666	But note that this doesn't mean that any of the characters in the
	667	string are necessary UTF-8 encoded, or that any of the characters have
	668	code points greater than 0xFF (255) or even 0x80 (128), or that the
	669	string has any characters at all. All the C<is_utf8()> does is to
	670	return the value of the internal "utf8ness" flag attached to the
	671	C<$string>. If the flag is off, the bytes in the scalar are interpreted
	672	as a single byte encoding. If the flag is on, the bytes in the scalar
	673	are interpreted as the (variable-length, potentially multi-byte) UTF-8 encoded
	674	code points of the characters. Bytes added to a UTF-8 encoded string are
	675	automatically upgraded to UTF-8. If mixed non-UTF-8 and UTF-8 scalars
	676	are merged (double-quoted interpolation, explicit concatenation, or
	677	printf/sprintf parameter substitution), the result will be UTF-8 encoded
	678	as if copies of the byte strings were upgraded to UTF-8: for example,
	679
	680	$a = "ab\x80c";
	681	$b = "\x{100}";
	682	print "$a = $b\n";
	683
	684	the output string will be UTF-8-encoded C<ab\x80c = \x{100}\n>, but
	685	C<$a> will stay byte-encoded.
	686
	687	Sometimes you might really need to know the byte length of a string
	688	instead of the character length. For that use either the
	689	C<Encode::encode_utf8()> function or the C<bytes> pragma
	690	and the C<length()> function:
	691
	692	my $unicode = chr(0x100);
	693	print length($unicode), "\n"; # will print 1
	694	require Encode;
	695	print length(Encode::encode_utf8($unicode)),"\n"; # will print 2
	696	use bytes;
	697	print length($unicode), "\n"; # will also print 2
	698	# (the 0xC4 0x80 of the UTF-8)
	699	no bytes;
	700
	701	=item *
	702
	703	How Do I Find Out What Encoding a File Has?
	704
	705	You might try L<Encode::Guess>, but it has a number of limitations.
	706
	707	=item *
	708
	709	How Do I Detect Data That's Not Valid In a Particular Encoding?
	710
	711	Use the C<Encode> package to try converting it.
	712	For example,
	713
	714	use Encode 'decode_utf8';
	715
	716	if (eval { decode_utf8($string, Encode::FB_CROAK); 1 }) {
	717	# $string is valid utf8
	718	} else {
	719	# $string is not valid utf8
	720	}
	721
	722	Or use C<unpack> to try decoding it:
	723
	724	use warnings;
	725	@chars = unpack("C0U*", $string_of_bytes_that_I_think_is_utf8);
	726
	727	If invalid, a C<Malformed UTF-8 character> warning is produced. The "C0" means
	728	"process the string character per character". Without that, the
	729	C<unpack("U*", ...)> would work in C<U0> mode (the default if the format
	730	string starts with C<U>) and it would return the bytes making up the UTF-8
	731	encoding of the target string, something that will always work.
	732
	733	=item *
	734
	735	How Do I Convert Binary Data Into a Particular Encoding, Or Vice Versa?
	736
	737	This probably isn't as useful as you might think.
	738	Normally, you shouldn't need to.
	739
	740	In one sense, what you are asking doesn't make much sense: encodings
	741	are for characters, and binary data are not "characters", so converting
	742	"data" into some encoding isn't meaningful unless you know in what
	743	character set and encoding the binary data is in, in which case it's
	744	not just binary data, now is it?
	745
	746	If you have a raw sequence of bytes that you know should be
	747	interpreted via a particular encoding, you can use C<Encode>:
	748
	749	use Encode 'from_to';
	750	from_to($data, "iso-8859-1", "utf-8"); # from latin-1 to utf-8
	751
	752	The call to C<from_to()> changes the bytes in C<$data>, but nothing
	753	material about the nature of the string has changed as far as Perl is
	754	concerned. Both before and after the call, the string C<$data>
	755	contains just a bunch of 8-bit bytes. As far as Perl is concerned,
	756	the encoding of the string remains as "system-native 8-bit bytes".
	757
	758	You might relate this to a fictional 'Translate' module:
	759
	760	use Translate;
	761	my $phrase = "Yes";
	762	Translate::from_to($phrase, 'english', 'deutsch');
	763	## phrase now contains "Ja"
	764
	765	The contents of the string changes, but not the nature of the string.
	766	Perl doesn't know any more after the call than before that the
	767	contents of the string indicates the affirmative.
	768
	769	Back to converting data. If you have (or want) data in your system's
	770	native 8-bit encoding (e.g. Latin-1, EBCDIC, etc.), you can use
	771	pack/unpack to convert to/from Unicode.
	772
	773	$native_string = pack("W", unpack("U", $Unicode_string));
	774	$Unicode_string = pack("U", unpack("W", $native_string));
	775
	776	If you have a sequence of bytes you B<know> is valid UTF-8,
	777	but Perl doesn't know it yet, you can make Perl a believer, too:
	778
	779	use Encode 'decode_utf8';
	780	$Unicode = decode_utf8($bytes);
	781
	782	or:
	783
	784	$Unicode = pack("U0a*", $bytes);
	785
	786	You can find the bytes that make up a UTF-8 sequence with
	787
	788	@bytes = unpack("C*", $Unicode_string)
	789
	790	and you can create well-formed Unicode with
	791
	792	$Unicode_string = pack("U*", 0xff, ...)
	793
	794	=item *
	795
	796	How Do I Display Unicode? How Do I Input Unicode?
	797
	798	See L<http://www.alanwood.net/unicode/> and
	799	L<http://www.cl.cam.ac.uk/~mgk25/unicode.html>
	800
	801	=item *
	802
	803	How Does Unicode Work With Traditional Locales?
	804
	805	Starting in Perl 5.16, you can specify
	806
	807	use locale ':not_characters';
	808
	809	to get Perl to work well with tradtional locales. The catch is that you
	810	have to translate from the locale character set to/from Unicode
	811	yourself. See L</Unicode IE<sol>O> above for how to
	812
	813	use open ':locale';
	814
	815	to accomplish this, but full details are in L<perllocale/Unicode and
	816	UTF-8>, including gotchas that happen if you don't specifiy
	817	C<:not_characters>.
	818
	819	=back
	820
	821	=head2 Hexadecimal Notation
	822
	823	The Unicode standard prefers using hexadecimal notation because
	824	that more clearly shows the division of Unicode into blocks of 256 characters.
	825	Hexadecimal is also simply shorter than decimal. You can use decimal
	826	notation, too, but learning to use hexadecimal just makes life easier
	827	with the Unicode standard. The C<U+HHHH> notation uses hexadecimal,
	828	for example.
	829
	830	The C<0x> prefix means a hexadecimal number, the digits are 0-9 I<and>
	831	a-f (or A-F, case doesn't matter). Each hexadecimal digit represents
	832	four bits, or half a byte. C<print 0x..., "\n"> will show a
	833	hexadecimal number in decimal, and C<printf "%x\n", $decimal> will
	834	show a decimal number in hexadecimal. If you have just the
	835	"hex digits" of a hexadecimal number, you can use the C<hex()> function.
	836
	837	print 0x0009, "\n"; # 9
	838	print 0x000a, "\n"; # 10
	839	print 0x000f, "\n"; # 15
	840	print 0x0010, "\n"; # 16
	841	print 0x0011, "\n"; # 17
	842	print 0x0100, "\n"; # 256
	843
	844	print 0x0041, "\n"; # 65
	845
	846	printf "%x\n", 65; # 41
	847	printf "%#x\n", 65; # 0x41
	848
	849	print hex("41"), "\n"; # 65
	850
	851	=head2 Further Resources
	852
	853	=over 4
	854
	855	=item *
	856
	857	Unicode Consortium
	858
	859	L<http://www.unicode.org/>
	860
	861	=item *
	862
	863	Unicode FAQ
	864
	865	L<http://www.unicode.org/unicode/faq/>
	866
	867	=item *
	868
	869	Unicode Glossary
	870
	871	L<http://www.unicode.org/glossary/>
	872
	873	=item *
	874
	875	Unicode Recommended Reading List
	876
	877	The Unicode Consortium has a list of articles and books, some of which
	878	give a much more in depth treatment of Unicode:
	879	L<http://unicode.org/resources/readinglist.html>
	880
	881	=item *
	882
	883	Unicode Useful Resources
	884
	885	L<http://www.unicode.org/unicode/onlinedat/resources.html>
	886
	887	=item *
	888
	889	Unicode and Multilingual Support in HTML, Fonts, Web Browsers and Other Applications
	890
	891	L<http://www.alanwood.net/unicode/>
	892
	893	=item *
	894
	895	UTF-8 and Unicode FAQ for Unix/Linux
	896
	897	L<http://www.cl.cam.ac.uk/~mgk25/unicode.html>
	898
	899	=item *
	900
	901	Legacy Character Sets
	902
	903	L<http://www.czyborra.com/>
	904	L<http://www.eki.ee/letter/>
	905
	906	=item *
	907
	908	You can explore various information from the Unicode data files using
	909	the C<Unicode::UCD> module.
	910
	911	=back
	912
	913	=head1 UNICODE IN OLDER PERLS
	914
	915	If you cannot upgrade your Perl to 5.8.0 or later, you can still
	916	do some Unicode processing by using the modules C<Unicode::String>,
	917	C<Unicode::Map8>, and C<Unicode::Map>, available from CPAN.
	918	If you have the GNU recode installed, you can also use the
	919	Perl front-end C<Convert::Recode> for character conversions.
	920
	921	The following are fast conversions from ISO 8859-1 (Latin-1) bytes
	922	to UTF-8 bytes and back, the code works even with older Perl 5 versions.
	923
	924	# ISO 8859-1 to UTF-8
	925	s/([\x80-\xFF])/chr(0xC0\|ord($1)>>6).chr(0x80\|ord($1)&0x3F)/eg;
	926
	927	# UTF-8 to ISO 8859-1
	928	s/([\xC2\xC3])([\x80-\xBF])/chr(ord($1)<<6&0xC0\|ord($2)&0x3F)/eg;
	929
	930	=head1 SEE ALSO
	931
	932	L<perlunitut>, L<perlunicode>, L<Encode>, L<open>, L<utf8>, L<bytes>,
	933	L<perlretut>, L<perlrun>, L<Unicode::Collate>, L<Unicode::Normalize>,
	934	L<Unicode::UCD>
	935
	936	=head1 ACKNOWLEDGMENTS
	937
	938	Thanks to the kind readers of the perl5-porters@perl.org,
	939	perl-unicode@perl.org, linux-utf8@nl.linux.org, and unicore@unicode.org
	940	mailing lists for their valuable feedback.
	941
	942	=head1 AUTHOR, COPYRIGHT, AND LICENSE
	943
	944	Copyright 2001-2011 Jarkko Hietaniemi E<lt>jhi@iki.fiE<gt>
	945
	946	This document may be distributed under the same terms as Perl itself.