perl5.git.perl.org Git - perl5.git/blame_incremental

... / ...

Commit	Line	Data
	1	=head1 NAME
	2
	3	perlunicode - Unicode support in Perl
	4
	5	=head1 DESCRIPTION
	6
	7	=head2 Important Caveats
	8
	9	Unicode support is an extensive requirement. While Perl does not
	10	implement the Unicode standard or the accompanying technical reports
	11	from cover to cover, Perl does support many Unicode features.
	12
	13	People who want to learn to use Unicode in Perl, should probably read
	14	the L<Perl Unicode tutorial, perlunitut\|perlunitut> and
	15	L<perluniintro>, before reading
	16	this reference document.
	17
	18	Also, the use of Unicode may present security issues that aren't obvious.
	19	Read L<Unicode Security Considerations\|http://www.unicode.org/reports/tr36>.
	20
	21	=over 4
	22
	23	=item Safest if you "use feature 'unicode_strings'"
	24
	25	In order to preserve backward compatibility, Perl does not turn
	26	on full internal Unicode support unless the pragma
	27	C<use feature 'unicode_strings'> is specified. (This is automatically
	28	selected if you use C<use 5.012> or higher.) Failure to do this can
	29	trigger unexpected surprises. See L</The "Unicode Bug"> below.
	30
	31	This pragma doesn't affect I/O. Nor does it change the internal
	32	representation of strings, only their interpretation. There are still
	33	several places where Unicode isn't fully supported, such as in
	34	filenames.
	35
	36	=item Input and Output Layers
	37
	38	Perl knows when a filehandle uses Perl's internal Unicode encodings
	39	(UTF-8, or UTF-EBCDIC if in EBCDIC) if the filehandle is opened with
	40	the ":encoding(utf8)" layer. Other encodings can be converted to Perl's
	41	encoding on input or from Perl's encoding on output by use of the
	42	":encoding(...)" layer. See L<open>.
	43
	44	To indicate that Perl source itself is in UTF-8, use C<use utf8;>.
	45
	46	=item C<use utf8> still needed to enable UTF-8/UTF-EBCDIC in scripts
	47
	48	As a compatibility measure, the C<use utf8> pragma must be explicitly
	49	included to enable recognition of UTF-8 in the Perl scripts themselves
	50	(in string or regular expression literals, or in identifier names) on
	51	ASCII-based machines or to recognize UTF-EBCDIC on EBCDIC-based
	52	machines. B<These are the only times when an explicit C<use utf8>
	53	is needed.> See L<utf8>.
	54
	55	=item BOM-marked scripts and UTF-16 scripts autodetected
	56
	57	If a Perl script begins marked with the Unicode BOM (UTF-16LE, UTF16-BE,
	58	or UTF-8), or if the script looks like non-BOM-marked UTF-16 of either
	59	endianness, Perl will correctly read in the script as Unicode.
	60	(BOMless UTF-8 cannot be effectively recognized or differentiated from
	61	ISO 8859-1 or other eight-bit encodings.)
	62
	63	=item C<use encoding> needed to upgrade non-Latin-1 byte strings
	64
	65	By default, there is a fundamental asymmetry in Perl's Unicode model:
	66	implicit upgrading from byte strings to Unicode strings assumes that
	67	they were encoded in I<ISO 8859-1 (Latin-1)>, but Unicode strings are
	68	downgraded with UTF-8 encoding. This happens because the first 256
	69	codepoints in Unicode happens to agree with Latin-1.
	70
	71	See L</"Byte and Character Semantics"> for more details.
	72
	73	=back
	74
	75	=head2 Byte and Character Semantics
	76
	77	Perl uses logically-wide characters to represent strings internally.
	78
	79	Starting in Perl 5.14, Perl-level operations work with
	80	characters rather than bytes within the scope of a
	81	C<L<use feature 'unicode_strings'\|feature>> (or equivalently
	82	C<use 5.012> or higher). (This is not true if bytes have been
	83	explicitly requested by C<L<use bytes\|bytes>>, nor necessarily true
	84	for interactions with the platform's operating system.)
	85
	86	For earlier Perls, and when C<unicode_strings> is not in effect, Perl
	87	provides a fairly safe environment that can handle both types of
	88	semantics in programs. For operations where Perl can unambiguously
	89	decide that the input data are characters, Perl switches to character
	90	semantics. For operations where this determination cannot be made
	91	without additional information from the user, Perl decides in favor of
	92	compatibility and chooses to use byte semantics.
	93
	94	When C<use locale> (but not C<use locale ':not_characters'>) is in
	95	effect, Perl uses the semantics associated with the current locale.
	96	(C<use locale> overrides C<use feature 'unicode_strings'> in the same scope;
	97	while C<use locale ':not_characters'> effectively also selects
	98	C<use feature 'unicode_strings'> in its scope; see L<perllocale>.)
	99	Otherwise, Perl uses the platform's native
	100	byte semantics for characters whose code points are less than 256, and
	101	Unicode semantics for those greater than 255. That means that non-ASCII
	102	characters are undefined except for their
	103	ordinal numbers. This means that none have case (upper and lower), nor are any
	104	a member of character classes, like C<[:alpha:]> or C<\w>. (But all do belong
	105	to the C<\W> class or the Perl regular expression extension C<[:^alpha:]>.)
	106
	107	This behavior preserves compatibility with earlier versions of Perl,
	108	which allowed byte semantics in Perl operations only if
	109	none of the program's inputs were marked as being a source of Unicode
	110	character data. Such data may come from filehandles, from calls to
	111	external programs, from information provided by the system (such as %ENV),
	112	or from literals and constants in the source text.
	113
	114	The C<utf8> pragma is primarily a compatibility device that enables
	115	recognition of UTF-(8\|EBCDIC) in literals encountered by the parser.
	116	Note that this pragma is only required while Perl defaults to byte
	117	semantics; when character semantics become the default, this pragma
	118	may become a no-op. See L<utf8>.
	119
	120	If strings operating under byte semantics and strings with Unicode
	121	character data are concatenated, the new string will have
	122	character semantics. This can cause surprises: See L</BUGS>, below.
	123	You can choose to be warned when this happens. See L<encoding::warnings>.
	124
	125	Under character semantics, many operations that formerly operated on
	126	bytes now operate on characters. A character in Perl is
	127	logically just a number ranging from 0 to 2**31 or so. Larger
	128	characters may encode into longer sequences of bytes internally, but
	129	this internal detail is mostly hidden for Perl code.
	130	See L<perluniintro> for more.
	131
	132	=head2 Effects of Character Semantics
	133
	134	Character semantics have the following effects:
	135
	136	=over 4
	137
	138	=item *
	139
	140	Strings--including hash keys--and regular expression patterns may
	141	contain characters that have an ordinal value larger than 255.
	142
	143	If you use a Unicode editor to edit your program, Unicode characters may
	144	occur directly within the literal strings in UTF-8 encoding, or UTF-16.
	145	(The former requires a BOM or C<use utf8>, the latter requires a BOM.)
	146
	147	Unicode characters can also be added to a string by using the C<\N{U+...}>
	148	notation. The Unicode code for the desired character, in hexadecimal,
	149	should be placed in the braces, after the C<U>. For instance, a smiley face is
	150	C<\N{U+263A}>.
	151
	152	Alternatively, you can use the C<\x{...}> notation for characters 0x100 and
	153	above. For characters below 0x100 you may get byte semantics instead of
	154	character semantics; see L</The "Unicode Bug">. On EBCDIC machines there is
	155	the additional problem that the value for such characters gives the EBCDIC
	156	character rather than the Unicode one, thus it is more portable to use
	157	C<\N{U+...}> instead.
	158
	159	Additionally, you can use the C<\N{...}> notation and put the official
	160	Unicode character name within the braces, such as
	161	C<\N{WHITE SMILING FACE}>. This automatically loads the L<charnames>
	162	module with the C<:full> and C<:short> options. If you prefer different
	163	options for this module, you can instead, before the C<\N{...}>,
	164	explicitly load it with your desired options; for example,
	165
	166	use charnames ':loose';
	167
	168	=item *
	169
	170	If an appropriate L<encoding> is specified, identifiers within the
	171	Perl script may contain Unicode alphanumeric characters, including
	172	ideographs. Perl does not currently attempt to canonicalize variable
	173	names.
	174
	175	=item *
	176
	177	Regular expressions match characters instead of bytes. "." matches
	178	a character instead of a byte.
	179
	180	=item *
	181
	182	Bracketed character classes in regular expressions match characters instead of
	183	bytes and match against the character properties specified in the
	184	Unicode properties database. C<\w> can be used to match a Japanese
	185	ideograph, for instance.
	186
	187	=item *
	188
	189	Named Unicode properties, scripts, and block ranges may be used (like bracketed
	190	character classes) by using the C<\p{}> "matches property" construct and
	191	the C<\P{}> negation, "doesn't match property".
	192	See L</"Unicode Character Properties"> for more details.
	193
	194	You can define your own character properties and use them
	195	in the regular expression with the C<\p{}> or C<\P{}> construct.
	196	See L</"User-Defined Character Properties"> for more details.
	197
	198	=item *
	199
	200	The special pattern C<\X> matches a logical character, an "extended grapheme
	201	cluster" in Standardese. In Unicode what appears to the user to be a single
	202	character, for example an accented C<G>, may in fact be composed of a sequence
	203	of characters, in this case a C<G> followed by an accent character. C<\X>
	204	will match the entire sequence.
	205
	206	=item *
	207
	208	The C<tr///> operator translates characters instead of bytes. Note
	209	that the C<tr///CU> functionality has been removed. For similar
	210	functionality see pack('U0', ...) and pack('C0', ...).
	211
	212	=item *
	213
	214	Case translation operators use the Unicode case translation tables
	215	when character input is provided. Note that C<uc()>, or C<\U> in
	216	interpolated strings, translates to uppercase, while C<ucfirst>,
	217	or C<\u> in interpolated strings, translates to titlecase in languages
	218	that make the distinction (which is equivalent to uppercase in languages
	219	without the distinction).
	220
	221	=item *
	222
	223	Most operators that deal with positions or lengths in a string will
	224	automatically switch to using character positions, including
	225	C<chop()>, C<chomp()>, C<substr()>, C<pos()>, C<index()>, C<rindex()>,
	226	C<sprintf()>, C<write()>, and C<length()>. An operator that
	227	specifically does not switch is C<vec()>. Operators that really don't
	228	care include operators that treat strings as a bucket of bits such as
	229	C<sort()>, and operators dealing with filenames.
	230
	231	=item *
	232
	233	The C<pack()>/C<unpack()> letter C<C> does I<not> change, since it is often
	234	used for byte-oriented formats. Again, think C<char> in the C language.
	235
	236	There is a new C<U> specifier that converts between Unicode characters
	237	and code points. There is also a C<W> specifier that is the equivalent of
	238	C<chr>/C<ord> and properly handles character values even if they are above 255.
	239
	240	=item *
	241
	242	The C<chr()> and C<ord()> functions work on characters, similar to
	243	C<pack("W")> and C<unpack("W")>, I<not> C<pack("C")> and
	244	C<unpack("C")>. C<pack("C")> and C<unpack("C")> are methods for
	245	emulating byte-oriented C<chr()> and C<ord()> on Unicode strings.
	246	While these methods reveal the internal encoding of Unicode strings,
	247	that is not something one normally needs to care about at all.
	248
	249	=item *
	250
	251	The bit string operators, C<& \| ^ ~>, can operate on character data.
	252	However, for backward compatibility, such as when using bit string
	253	operations when characters are all less than 256 in ordinal value, one
	254	should not use C<~> (the bit complement) with characters of both
	255	values less than 256 and values greater than 256. Most importantly,
	256	DeMorgan's laws (C<~($x\|$y) eq ~$x&~$y> and C<~($x&$y) eq ~$x\|~$y>)
	257	will not hold. The reason for this mathematical I<faux pas> is that
	258	the complement cannot return B<both> the 8-bit (byte-wide) bit
	259	complement B<and> the full character-wide bit complement.
	260
	261	=item *
	262
	263	There is a CPAN module, L<Unicode::Casing>, which allows you to define
	264	your own mappings to be used in C<lc()>, C<lcfirst()>, C<uc()>,
	265	C<ucfirst()>, and C<fc> (or their double-quoted string inlined
	266	versions such as C<\U>).
	267	(Prior to Perl 5.16, this functionality was partially provided
	268	in the Perl core, but suffered from a number of insurmountable
	269	drawbacks, so the CPAN module was written instead.)
	270
	271	=back
	272
	273	=over 4
	274
	275	=item *
	276
	277	And finally, C<scalar reverse()> reverses by character rather than by byte.
	278
	279	=back
	280
	281	=head2 Unicode Character Properties
	282
	283	(The only time that Perl considers a sequence of individual code
	284	points as a single logical character is in the C<\X> construct, already
	285	mentioned above. Therefore "character" in this discussion means a single
	286	Unicode code point.)
	287
	288	Very nearly all Unicode character properties are accessible through
	289	regular expressions by using the C<\p{}> "matches property" construct
	290	and the C<\P{}> "doesn't match property" for its negation.
	291
	292	For instance, C<\p{Uppercase}> matches any single character with the Unicode
	293	"Uppercase" property, while C<\p{L}> matches any character with a
	294	General_Category of "L" (letter) property. Brackets are not
	295	required for single letter property names, so C<\p{L}> is equivalent to C<\pL>.
	296
	297	More formally, C<\p{Uppercase}> matches any single character whose Unicode
	298	Uppercase property value is True, and C<\P{Uppercase}> matches any character
	299	whose Uppercase property value is False, and they could have been written as
	300	C<\p{Uppercase=True}> and C<\p{Uppercase=False}>, respectively.
	301
	302	This formality is needed when properties are not binary; that is, if they can
	303	take on more values than just True and False. For example, the Bidi_Class (see
	304	L</"Bidirectional Character Types"> below), can take on several different
	305	values, such as Left, Right, Whitespace, and others. To match these, one needs
	306	to specify both the property name (Bidi_Class), AND the value being
	307	matched against
	308	(Left, Right, etc.). This is done, as in the examples above, by having the
	309	two components separated by an equal sign (or interchangeably, a colon), like
	310	C<\p{Bidi_Class: Left}>.
	311
	312	All Unicode-defined character properties may be written in these compound forms
	313	of C<\p{property=value}> or C<\p{property:value}>, but Perl provides some
	314	additional properties that are written only in the single form, as well as
	315	single-form short-cuts for all binary properties and certain others described
	316	below, in which you may omit the property name and the equals or colon
	317	separator.
	318
	319	Most Unicode character properties have at least two synonyms (or aliases if you
	320	prefer): a short one that is easier to type and a longer one that is more
	321	descriptive and hence easier to understand. Thus the "L" and "Letter" properties
	322	above are equivalent and can be used interchangeably. Likewise,
	323	"Upper" is a synonym for "Uppercase", and we could have written
	324	C<\p{Uppercase}> equivalently as C<\p{Upper}>. Also, there are typically
	325	various synonyms for the values the property can be. For binary properties,
	326	"True" has 3 synonyms: "T", "Yes", and "Y"; and "False has correspondingly "F",
	327	"No", and "N". But be careful. A short form of a value for one property may
	328	not mean the same thing as the same short form for another. Thus, for the
	329	General_Category property, "L" means "Letter", but for the Bidi_Class property,
	330	"L" means "Left". A complete list of properties and synonyms is in
	331	L<perluniprops>.
	332
	333	Upper/lower case differences in property names and values are irrelevant;
	334	thus C<\p{Upper}> means the same thing as C<\p{upper}> or even C<\p{UpPeR}>.
	335	Similarly, you can add or subtract underscores anywhere in the middle of a
	336	word, so that these are also equivalent to C<\p{U_p_p_e_r}>. And white space
	337	is irrelevant adjacent to non-word characters, such as the braces and the equals
	338	or colon separators, so C<\p{ Upper }> and C<\p{ Upper_case : Y }> are
	339	equivalent to these as well. In fact, white space and even
	340	hyphens can usually be added or deleted anywhere. So even C<\p{ Up-per case = Yes}> is
	341	equivalent. All this is called "loose-matching" by Unicode. The few places
	342	where stricter matching is used is in the middle of numbers, and in the Perl
	343	extension properties that begin or end with an underscore. Stricter matching
	344	cares about white space (except adjacent to non-word characters),
	345	hyphens, and non-interior underscores.
	346
	347	You can also use negation in both C<\p{}> and C<\P{}> by introducing a caret
	348	(^) between the first brace and the property name: C<\p{^Tamil}> is
	349	equal to C<\P{Tamil}>.
	350
	351	Almost all properties are immune to case-insensitive matching. That is,
	352	adding a C</i> regular expression modifier does not change what they
	353	match. There are two sets that are affected.
	354	The first set is
	355	C<Uppercase_Letter>,
	356	C<Lowercase_Letter>,
	357	and C<Titlecase_Letter>,
	358	all of which match C<Cased_Letter> under C</i> matching.
	359	And the second set is
	360	C<Uppercase>,
	361	C<Lowercase>,
	362	and C<Titlecase>,
	363	all of which match C<Cased> under C</i> matching.
	364	This set also includes its subsets C<PosixUpper> and C<PosixLower> both
	365	of which under C</i> matching match C<PosixAlpha>.
	366	(The difference between these sets is that some things, such as Roman
	367	numerals, come in both upper and lower case so they are C<Cased>, but aren't considered
	368	letters, so they aren't C<Cased_Letter>s.)
	369
	370	The result is undefined if you try to match a non-Unicode code point
	371	(that is, one above 0x10FFFF) against a Unicode property. Currently, a
	372	warning is raised, and the match will fail. In some cases, this is
	373	counterintuitive, as both these fail:
	374
	375	chr(0x110000) =~ \p{ASCII_Hex_Digit=True} # Fails.
	376	chr(0x110000) =~ \p{ASCII_Hex_Digit=False} # Fails!
	377
	378	=head3 B<General_Category>
	379
	380	Every Unicode character is assigned a general category, which is the "most
	381	usual categorization of a character" (from
	382	L<http://www.unicode.org/reports/tr44>).
	383
	384	The compound way of writing these is like C<\p{General_Category=Number}>
	385	(short, C<\p{gc:n}>). But Perl furnishes shortcuts in which everything up
	386	through the equal or colon separator is omitted. So you can instead just write
	387	C<\pN>.
	388
	389	Here are the short and long forms of the General Category properties:
	390
	391	Short Long
	392
	393	L Letter
	394	LC, L& Cased_Letter (that is: [\p{Ll}\p{Lu}\p{Lt}])
	395	Lu Uppercase_Letter
	396	Ll Lowercase_Letter
	397	Lt Titlecase_Letter
	398	Lm Modifier_Letter
	399	Lo Other_Letter
	400
	401	M Mark
	402	Mn Nonspacing_Mark
	403	Mc Spacing_Mark
	404	Me Enclosing_Mark
	405
	406	N Number
	407	Nd Decimal_Number (also Digit)
	408	Nl Letter_Number
	409	No Other_Number
	410
	411	P Punctuation (also Punct)
	412	Pc Connector_Punctuation
	413	Pd Dash_Punctuation
	414	Ps Open_Punctuation
	415	Pe Close_Punctuation
	416	Pi Initial_Punctuation
	417	(may behave like Ps or Pe depending on usage)
	418	Pf Final_Punctuation
	419	(may behave like Ps or Pe depending on usage)
	420	Po Other_Punctuation
	421
	422	S Symbol
	423	Sm Math_Symbol
	424	Sc Currency_Symbol
	425	Sk Modifier_Symbol
	426	So Other_Symbol
	427
	428	Z Separator
	429	Zs Space_Separator
	430	Zl Line_Separator
	431	Zp Paragraph_Separator
	432
	433	C Other
	434	Cc Control (also Cntrl)
	435	Cf Format
	436	Cs Surrogate
	437	Co Private_Use
	438	Cn Unassigned
	439
	440	Single-letter properties match all characters in any of the
	441	two-letter sub-properties starting with the same letter.
	442	C<LC> and C<L&> are special: both are aliases for the set consisting of everything matched by C<Ll>, C<Lu>, and C<Lt>.
	443
	444	=head3 B<Bidirectional Character Types>
	445
	446	Because scripts differ in their directionality (Hebrew and Arabic are
	447	written right to left, for example) Unicode supplies a Bidi_Class property.
	448	Some of the values this property can have are:
	449
	450	Value Meaning
	451
	452	L Left-to-Right
	453	LRE Left-to-Right Embedding
	454	LRO Left-to-Right Override
	455	R Right-to-Left
	456	AL Arabic Letter
	457	RLE Right-to-Left Embedding
	458	RLO Right-to-Left Override
	459	PDF Pop Directional Format
	460	EN European Number
	461	ES European Separator
	462	ET European Terminator
	463	AN Arabic Number
	464	CS Common Separator
	465	NSM Non-Spacing Mark
	466	BN Boundary Neutral
	467	B Paragraph Separator
	468	S Segment Separator
	469	WS Whitespace
	470	ON Other Neutrals
	471
	472	This property is always written in the compound form.
	473	For example, C<\p{Bidi_Class:R}> matches characters that are normally
	474	written right to left. Unlike the
	475	General_Category property, this
	476	property can have more values added in a future Unicode release. Those
	477	listed above comprised the complete set for many Unicode releases, but
	478	others were added in Unicode 6.3; you can always find what the
	479	current ones are in in L<perluniprops>. And
	480	L<http://www.unicode.org/reports/tr9/> describes how to use them.
	481
	482	=head3 B<Scripts>
	483
	484	The world's languages are written in many different scripts. This sentence
	485	(unless you're reading it in translation) is written in Latin, while Russian is
	486	written in Cyrillic, and Greek is written in, well, Greek; Japanese mainly in
	487	Hiragana or Katakana. There are many more.
	488
	489	The Unicode Script and Script_Extensions properties give what script a
	490	given character is in. Either property can be specified with the
	491	compound form like
	492	C<\p{Script=Hebrew}> (short: C<\p{sc=hebr}>), or
	493	C<\p{Script_Extensions=Javanese}> (short: C<\p{scx=java}>).
	494	In addition, Perl furnishes shortcuts for all
	495	C<Script> property names. You can omit everything up through the equals
	496	(or colon), and simply write C<\p{Latin}> or C<\P{Cyrillic}>.
	497	(This is not true for C<Script_Extensions>, which is required to be
	498	written in the compound form.)
	499
	500	The difference between these two properties involves characters that are
	501	used in multiple scripts. For example the digits '0' through '9' are
	502	used in many parts of the world. These are placed in a script named
	503	C<Common>. Other characters are used in just a few scripts. For
	504	example, the "KATAKANA-HIRAGANA DOUBLE HYPHEN" is used in both Japanese
	505	scripts, Katakana and Hiragana, but nowhere else. The C<Script>
	506	property places all characters that are used in multiple scripts in the
	507	C<Common> script, while the C<Script_Extensions> property places those
	508	that are used in only a few scripts into each of those scripts; while
	509	still using C<Common> for those used in many scripts. Thus both these
	510	match:
	511
	512	"0" =~ /\p{sc=Common}/ # Matches
	513	"0" =~ /\p{scx=Common}/ # Matches
	514
	515	and only the first of these match:
	516
	517	"\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{sc=Common} # Matches
	518	"\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{scx=Common} # No match
	519
	520	And only the last two of these match:
	521
	522	"\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{sc=Hiragana} # No match
	523	"\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{sc=Katakana} # No match
	524	"\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{scx=Hiragana} # Matches
	525	"\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{scx=Katakana} # Matches
	526
	527	C<Script_Extensions> is thus an improved C<Script>, in which there are
	528	fewer characters in the C<Common> script, and correspondingly more in
	529	other scripts. It is new in Unicode version 6.0, and its data are likely
	530	to change significantly in later releases, as things get sorted out.
	531
	532	(Actually, besides C<Common>, the C<Inherited> script, contains
	533	characters that are used in multiple scripts. These are modifier
	534	characters which modify other characters, and inherit the script value
	535	of the controlling character. Some of these are used in many scripts,
	536	and so go into C<Inherited> in both C<Script> and C<Script_Extensions>.
	537	Others are used in just a few scripts, so are in C<Inherited> in
	538	C<Script>, but not in C<Script_Extensions>.)
	539
	540	It is worth stressing that there are several different sets of digits in
	541	Unicode that are equivalent to 0-9 and are matchable by C<\d> in a
	542	regular expression. If they are used in a single language only, they
	543	are in that language's C<Script> and C<Script_Extension>. If they are
	544	used in more than one script, they will be in C<sc=Common>, but only
	545	if they are used in many scripts should they be in C<scx=Common>.
	546
	547	A complete list of scripts and their shortcuts is in L<perluniprops>.
	548
	549	=head3 B<Use of "Is" Prefix>
	550
	551	For backward compatibility (with Perl 5.6), all properties mentioned
	552	so far may have C<Is> or C<Is_> prepended to their name, so C<\P{Is_Lu}>, for
	553	example, is equal to C<\P{Lu}>, and C<\p{IsScript:Arabic}> is equal to
	554	C<\p{Arabic}>.
	555
	556	=head3 B<Blocks>
	557
	558	In addition to B<scripts>, Unicode also defines B<blocks> of
	559	characters. The difference between scripts and blocks is that the
	560	concept of scripts is closer to natural languages, while the concept
	561	of blocks is more of an artificial grouping based on groups of Unicode
	562	characters with consecutive ordinal values. For example, the "Basic Latin"
	563	block is all characters whose ordinals are between 0 and 127, inclusive; in
	564	other words, the ASCII characters. The "Latin" script contains some letters
	565	from this as well as several other blocks, like "Latin-1 Supplement",
	566	"Latin Extended-A", etc., but it does not contain all the characters from
	567	those blocks. It does not, for example, contain the digits 0-9, because
	568	those digits are shared across many scripts, and hence are in the
	569	C<Common> script.
	570
	571	For more about scripts versus blocks, see UAX#24 "Unicode Script Property":
	572	L<http://www.unicode.org/reports/tr24>
	573
	574	The C<Script> or C<Script_Extensions> properties are likely to be the
	575	ones you want to use when processing
	576	natural language; the Block property may occasionally be useful in working
	577	with the nuts and bolts of Unicode.
	578
	579	Block names are matched in the compound form, like C<\p{Block: Arrows}> or
	580	C<\p{Blk=Hebrew}>. Unlike most other properties, only a few block names have a
	581	Unicode-defined short name. But Perl does provide a (slight) shortcut: You
	582	can say, for example C<\p{In_Arrows}> or C<\p{In_Hebrew}>. For backwards
	583	compatibility, the C<In> prefix may be omitted if there is no naming conflict
	584	with a script or any other property, and you can even use an C<Is> prefix
	585	instead in those cases. But it is not a good idea to do this, for a couple
	586	reasons:
	587
	588	=over 4
	589
	590	=item 1
	591
	592	It is confusing. There are many naming conflicts, and you may forget some.
	593	For example, C<\p{Hebrew}> means the I<script> Hebrew, and NOT the I<block>
	594	Hebrew. But would you remember that 6 months from now?
	595
	596	=item 2
	597
	598	It is unstable. A new version of Unicode may preempt the current meaning by
	599	creating a property with the same name. There was a time in very early Unicode
	600	releases when C<\p{Hebrew}> would have matched the I<block> Hebrew; now it
	601	doesn't.
	602
	603	=back
	604
	605	Some people prefer to always use C<\p{Block: foo}> and C<\p{Script: bar}>
	606	instead of the shortcuts, whether for clarity, because they can't remember the
	607	difference between 'In' and 'Is' anyway, or they aren't confident that those who
	608	eventually will read their code will know that difference.
	609
	610	A complete list of blocks and their shortcuts is in L<perluniprops>.
	611
	612	=head3 B<Other Properties>
	613
	614	There are many more properties than the very basic ones described here.
	615	A complete list is in L<perluniprops>.
	616
	617	Unicode defines all its properties in the compound form, so all single-form
	618	properties are Perl extensions. Most of these are just synonyms for the
	619	Unicode ones, but some are genuine extensions, including several that are in
	620	the compound form. And quite a few of these are actually recommended by Unicode
	621	(in L<http://www.unicode.org/reports/tr18>).
	622
	623	This section gives some details on all extensions that aren't just
	624	synonyms for compound-form Unicode properties
	625	(for those properties, you'll have to refer to the
	626	L<Unicode Standard\|http://www.unicode.org/reports/tr44>.
	627
	628	=over
	629
	630	=item B<C<\p{All}>>
	631
	632	This matches any of the 1_114_112 Unicode code points. It is a synonym for
	633	C<\p{Any}>.
	634
	635	=item B<C<\p{Alnum}>>
	636
	637	This matches any C<\p{Alphabetic}> or C<\p{Decimal_Number}> character.
	638
	639	=item B<C<\p{Any}>>
	640
	641	This matches any of the 1_114_112 Unicode code points. It is a synonym for
	642	C<\p{All}>.
	643
	644	=item B<C<\p{ASCII}>>
	645
	646	This matches any of the 128 characters in the US-ASCII character set,
	647	which is a subset of Unicode.
	648
	649	=item B<C<\p{Assigned}>>
	650
	651	This matches any assigned code point; that is, any code point whose general
	652	category is not Unassigned (or equivalently, not Cn).
	653
	654	=item B<C<\p{Blank}>>
	655
	656	This is the same as C<\h> and C<\p{HorizSpace}>: A character that changes the
	657	spacing horizontally.
	658
	659	=item B<C<\p{Decomposition_Type: Non_Canonical}>> (Short: C<\p{Dt=NonCanon}>)
	660
	661	Matches a character that has a non-canonical decomposition.
	662
	663	To understand the use of this rarely used property=value combination, it is
	664	necessary to know some basics about decomposition.
	665	Consider a character, say H. It could appear with various marks around it,
	666	such as an acute accent, or a circumflex, or various hooks, circles, arrows,
	667	I<etc.>, above, below, to one side or the other, etc. There are many
	668	possibilities among the world's languages. The number of combinations is
	669	astronomical, and if there were a character for each combination, it would
	670	soon exhaust Unicode's more than a million possible characters. So Unicode
	671	took a different approach: there is a character for the base H, and a
	672	character for each of the possible marks, and these can be variously combined
	673	to get a final logical character. So a logical character--what appears to be a
	674	single character--can be a sequence of more than one individual characters.
	675	This is called an "extended grapheme cluster"; Perl furnishes the C<\X>
	676	regular expression construct to match such sequences.
	677
	678	But Unicode's intent is to unify the existing character set standards and
	679	practices, and several pre-existing standards have single characters that
	680	mean the same thing as some of these combinations. An example is ISO-8859-1,
	681	which has quite a few of these in the Latin-1 range, an example being "LATIN
	682	CAPITAL LETTER E WITH ACUTE". Because this character was in this pre-existing
	683	standard, Unicode added it to its repertoire. But this character is considered
	684	by Unicode to be equivalent to the sequence consisting of the character
	685	"LATIN CAPITAL LETTER E" followed by the character "COMBINING ACUTE ACCENT".
	686
	687	"LATIN CAPITAL LETTER E WITH ACUTE" is called a "pre-composed" character, and
	688	its equivalence with the sequence is called canonical equivalence. All
	689	pre-composed characters are said to have a decomposition (into the equivalent
	690	sequence), and the decomposition type is also called canonical.
	691
	692	However, many more characters have a different type of decomposition, a
	693	"compatible" or "non-canonical" decomposition. The sequences that form these
	694	decompositions are not considered canonically equivalent to the pre-composed
	695	character. An example, again in the Latin-1 range, is the "SUPERSCRIPT ONE".
	696	It is somewhat like a regular digit 1, but not exactly; its decomposition
	697	into the digit 1 is called a "compatible" decomposition, specifically a
	698	"super" decomposition. There are several such compatibility
	699	decompositions (see L<http://www.unicode.org/reports/tr44>), including one
	700	called "compat", which means some miscellaneous type of decomposition
	701	that doesn't fit into the decomposition categories that Unicode has chosen.
	702
	703	Note that most Unicode characters don't have a decomposition, so their
	704	decomposition type is "None".
	705
	706	For your convenience, Perl has added the C<Non_Canonical> decomposition
	707	type to mean any of the several compatibility decompositions.
	708
	709	=item B<C<\p{Graph}>>
	710
	711	Matches any character that is graphic. Theoretically, this means a character
	712	that on a printer would cause ink to be used.
	713
	714	=item B<C<\p{HorizSpace}>>
	715
	716	This is the same as C<\h> and C<\p{Blank}>: a character that changes the
	717	spacing horizontally.
	718
	719	=item B<C<\p{In=*}>>
	720
	721	This is a synonym for C<\p{Present_In=*}>
	722
	723	=item B<C<\p{PerlSpace}>>
	724
	725	This is the same as C<\s>, restricted to ASCII, namely C<S<[ \f\n\r\t]>>
	726	and starting in Perl v5.18, experimentally, a vertical tab.
	727
	728	Mnemonic: Perl's (original) space
	729
	730	=item B<C<\p{PerlWord}>>
	731
	732	This is the same as C<\w>, restricted to ASCII, namely C<[A-Za-z0-9_]>
	733
	734	Mnemonic: Perl's (original) word.
	735
	736	=item B<C<\p{Posix...}>>
	737
	738	There are several of these, which are equivalents using the C<\p>
	739	notation for Posix classes and are described in
	740	L<perlrecharclass/POSIX Character Classes>.
	741
	742	=item B<C<\p{Present_In: }>> (Short: C<\p{In=}>)
	743
	744	This property is used when you need to know in what Unicode version(s) a
	745	character is.
	746
	747	The "*" above stands for some two digit Unicode version number, such as
	748	C<1.1> or C<4.0>; or the "*" can also be C<Unassigned>. This property will
	749	match the code points whose final disposition has been settled as of the
	750	Unicode release given by the version number; C<\p{Present_In: Unassigned}>
	751	will match those code points whose meaning has yet to be assigned.
	752
	753	For example, C<U+0041> "LATIN CAPITAL LETTER A" was present in the very first
	754	Unicode release available, which is C<1.1>, so this property is true for all
	755	valid "*" versions. On the other hand, C<U+1EFF> was not assigned until version
	756	5.1 when it became "LATIN SMALL LETTER Y WITH LOOP", so the only "*" that
	757	would match it are 5.1, 5.2, and later.
	758
	759	Unicode furnishes the C<Age> property from which this is derived. The problem
	760	with Age is that a strict interpretation of it (which Perl takes) has it
	761	matching the precise release a code point's meaning is introduced in. Thus
	762	C<U+0041> would match only 1.1; and C<U+1EFF> only 5.1. This is not usually what
	763	you want.
	764
	765	Some non-Perl implementations of the Age property may change its meaning to be
	766	the same as the Perl Present_In property; just be aware of that.
	767
	768	Another confusion with both these properties is that the definition is not
	769	that the code point has been I<assigned>, but that the meaning of the code point
	770	has been I<determined>. This is because 66 code points will always be
	771	unassigned, and so the Age for them is the Unicode version in which the decision
	772	to make them so was made. For example, C<U+FDD0> is to be permanently
	773	unassigned to a character, and the decision to do that was made in version 3.1,
	774	so C<\p{Age=3.1}> matches this character, as also does C<\p{Present_In: 3.1}> and up.
	775
	776	=item B<C<\p{Print}>>
	777
	778	This matches any character that is graphical or blank, except controls.
	779
	780	=item B<C<\p{SpacePerl}>>
	781
	782	This is the same as C<\s>, including beyond ASCII.
	783
	784	Mnemonic: Space, as modified by Perl. (It doesn't include the vertical tab
	785	which both the Posix standard and Unicode consider white space.)
	786
	787	=item B<C<\p{Title}>> and B<C<\p{Titlecase}>>
	788
	789	Under case-sensitive matching, these both match the same code points as
	790	C<\p{General Category=Titlecase_Letter}> (C<\p{gc=lt}>). The difference
	791	is that under C</i> caseless matching, these match the same as
	792	C<\p{Cased}>, whereas C<\p{gc=lt}> matches C<\p{Cased_Letter>).
	793
	794	=item B<C<\p{VertSpace}>>
	795
	796	This is the same as C<\v>: A character that changes the spacing vertically.
	797
	798	=item B<C<\p{Word}>>
	799
	800	This is the same as C<\w>, including over 100_000 characters beyond ASCII.
	801
	802	=item B<C<\p{XPosix...}>>
	803
	804	There are several of these, which are the standard Posix classes
	805	extended to the full Unicode range. They are described in
	806	L<perlrecharclass/POSIX Character Classes>.
	807
	808	=back
	809
	810	=head2 User-Defined Character Properties
	811
	812	You can define your own binary character properties by defining subroutines
	813	whose names begin with "In" or "Is". (The experimental feature
	814	L<perlre/(?[ ])> provides an alternative which allows more complex
	815	definitions.) The subroutines can be defined in any
	816	package. The user-defined properties can be used in the regular expression
	817	C<\p> and C<\P> constructs; if you are using a user-defined property from a
	818	package other than the one you are in, you must specify its package in the
	819	C<\p> or C<\P> construct.
	820
	821	# assuming property Is_Foreign defined in Lang::
	822	package main; # property package name required
	823	if ($txt =~ /\p{Lang::IsForeign}+/) { ... }
	824
	825	package Lang; # property package name not required
	826	if ($txt =~ /\p{IsForeign}+/) { ... }
	827
	828
	829	Note that the effect is compile-time and immutable once defined.
	830	However, the subroutines are passed a single parameter, which is 0 if
	831	case-sensitive matching is in effect and non-zero if caseless matching
	832	is in effect. The subroutine may return different values depending on
	833	the value of the flag, and one set of values will immutably be in effect
	834	for all case-sensitive matches, and the other set for all case-insensitive
	835	matches.
	836
	837	Note that if the regular expression is tainted, then Perl will die rather
	838	than calling the subroutine, where the name of the subroutine is
	839	determined by the tainted data.
	840
	841	The subroutines must return a specially-formatted string, with one
	842	or more newline-separated lines. Each line must be one of the following:
	843
	844	=over 4
	845
	846	=item *
	847
	848	A single hexadecimal number denoting a code point to include.
	849
	850	=item *
	851
	852	Two hexadecimal numbers separated by horizontal whitespace (space or
	853	tabular characters) denoting a range of code points to include.
	854
	855	=item *
	856
	857	Something to include, prefixed by "+": a built-in character
	858	property (prefixed by "utf8::") or a fully qualified (including package
	859	name) user-defined character property,
	860	to represent all the characters in that property; two hexadecimal code
	861	points for a range; or a single hexadecimal code point.
	862
	863	=item *
	864
	865	Something to exclude, prefixed by "-": an existing character
	866	property (prefixed by "utf8::") or a fully qualified (including package
	867	name) user-defined character property,
	868	to represent all the characters in that property; two hexadecimal code
	869	points for a range; or a single hexadecimal code point.
	870
	871	=item *
	872
	873	Something to negate, prefixed "!": an existing character
	874	property (prefixed by "utf8::") or a fully qualified (including package
	875	name) user-defined character property,
	876	to represent all the characters in that property; two hexadecimal code
	877	points for a range; or a single hexadecimal code point.
	878
	879	=item *
	880
	881	Something to intersect with, prefixed by "&": an existing character
	882	property (prefixed by "utf8::") or a fully qualified (including package
	883	name) user-defined character property,
	884	for all the characters except the characters in the property; two
	885	hexadecimal code points for a range; or a single hexadecimal code point.
	886
	887	=back
	888
	889	For example, to define a property that covers both the Japanese
	890	syllabaries (hiragana and katakana), you can define
	891
	892	sub InKana {
	893	return <<END;
	894	3040\t309F
	895	30A0\t30FF
	896	END
	897	}
	898
	899	Imagine that the here-doc end marker is at the beginning of the line.
	900	Now you can use C<\p{InKana}> and C<\P{InKana}>.
	901
	902	You could also have used the existing block property names:
	903
	904	sub InKana {
	905	return <<'END';
	906	+utf8::InHiragana
	907	+utf8::InKatakana
	908	END
	909	}
	910
	911	Suppose you wanted to match only the allocated characters,
	912	not the raw block ranges: in other words, you want to remove
	913	the non-characters:
	914
	915	sub InKana {
	916	return <<'END';
	917	+utf8::InHiragana
	918	+utf8::InKatakana
	919	-utf8::IsCn
	920	END
	921	}
	922
	923	The negation is useful for defining (surprise!) negated classes.
	924
	925	sub InNotKana {
	926	return <<'END';
	927	!utf8::InHiragana
	928	-utf8::InKatakana
	929	+utf8::IsCn
	930	END
	931	}
	932
	933	This will match all non-Unicode code points, since every one of them is
	934	not in Kana. You can use intersection to exclude these, if desired, as
	935	this modified example shows:
	936
	937	sub InNotKana {
	938	return <<'END';
	939	!utf8::InHiragana
	940	-utf8::InKatakana
	941	+utf8::IsCn
	942	&utf8::Any
	943	END
	944	}
	945
	946	C<&utf8::Any> must be the last line in the definition.
	947
	948	Intersection is used generally for getting the common characters matched
	949	by two (or more) classes. It's important to remember not to use "&" for
	950	the first set; that would be intersecting with nothing, resulting in an
	951	empty set.
	952
	953	(Note that official Unicode properties differ from these in that they
	954	automatically exclude non-Unicode code points and a warning is raised if
	955	a match is attempted on one of those.)
	956
	957	=head2 User-Defined Case Mappings (for serious hackers only)
	958
	959	B<This feature has been removed as of Perl 5.16.>
	960	The CPAN module L<Unicode::Casing> provides better functionality without
	961	the drawbacks that this feature had. If you are using a Perl earlier
	962	than 5.16, this feature was most fully documented in the 5.14 version of
	963	this pod:
	964	L<http://perldoc.perl.org/5.14.0/perlunicode.html#User-Defined-Case-Mappings-%28for-serious-hackers-only%29>
	965
	966	=head2 Character Encodings for Input and Output
	967
	968	See L<Encode>.
	969
	970	=head2 Unicode Regular Expression Support Level
	971
	972	The following list of Unicode supported features for regular expressions describes
	973	all features currently directly supported by core Perl. The references to "Level N"
	974	and the section numbers refer to the Unicode Technical Standard #18,
	975	"Unicode Regular Expressions", version 13, from August 2008.
	976
	977	=over 4
	978
	979	=item *
	980
	981	Level 1 - Basic Unicode Support
	982
	983	RL1.1 Hex Notation - done [1]
	984	RL1.2 Properties - done [2][3]
	985	RL1.2a Compatibility Properties - done [4]
	986	RL1.3 Subtraction and Intersection - experimental [5]
	987	RL1.4 Simple Word Boundaries - done [6]
	988	RL1.5 Simple Loose Matches - done [7]
	989	RL1.6 Line Boundaries - MISSING [8][9]
	990	RL1.7 Supplementary Code Points - done [10]
	991
	992	=over 4
	993
	994	=item [1]
	995
	996	\x{...}
	997
	998	=item [2]
	999
	1000	\p{...} \P{...}
	1001
	1002	=item [3]
	1003
	1004	supports not only minimal list, but all Unicode character properties (see Unicode Character Properties above)
	1005
	1006	=item [4]
	1007
	1008	\d \D \s \S \w \W \X [:prop:] [:^prop:]
	1009
	1010	=item [5]
	1011
	1012	The experimental feature in v5.18 C<"(?[...])"> accomplishes this. See
	1013	L<perlre/(?[ ])>. If you don't want to use an experimental feature,
	1014	you can use one of the following:
	1015
	1016	=over 4
	1017
	1018	=item * Regular expression look-ahead
	1019
	1020	You can mimic class subtraction using lookahead.
	1021	For example, what UTS#18 might write as
	1022
	1023	[{Block=Greek}-[{UNASSIGNED}]]
	1024
	1025	in Perl can be written as:
	1026
	1027	(?!\p{Unassigned})\p{Block=Greek}
	1028	(?=\p{Assigned})\p{Block=Greek}
	1029
	1030	But in this particular example, you probably really want
	1031
	1032	\p{Greek}
	1033
	1034	which will match assigned characters known to be part of the Greek script.
	1035
	1036	=item * CPAN module L<Unicode::Regex::Set>
	1037
	1038	It does implement the full UTS#18 grouping, intersection, union, and
	1039	removal (subtraction) syntax.
	1040
	1041	=item * L</"User-Defined Character Properties">
	1042
	1043	'+' for union, '-' for removal (set-difference), '&' for intersection
	1044
	1045	=back
	1046
	1047	=item [6]
	1048
	1049	\b \B
	1050
	1051	=item [7]
	1052
	1053	Note that Perl does Full case-folding in matching (but with bugs), not Simple: for example U+1F88 is equivalent to U+1F00 U+03B9, instead of just U+1F80. This difference matters mainly for certain Greek capital letters with certain modifiers: the Full case-folding decomposes the letter, while the Simple case-folding would map it to a single character.
	1054
	1055	=item [8]
	1056
	1057	Should do ^ and $ also on U+000B (\v in C), FF (\f), CR (\r), CRLF
	1058	(\r\n), NEL (U+0085), LS (U+2028), and PS (U+2029); should also affect
	1059	<>, $., and script line numbers; should not split lines within CRLF
	1060	(i.e. there is no empty line between \r and \n). For CRLF, try the
	1061	C<:crlf> layer (see L<PerlIO>).
	1062
	1063	=item [9]
	1064
	1065	Linebreaking conformant with UAX#14 "Unicode Line Breaking Algorithm" is available through the Unicode::LineBreaking module.
	1066
	1067	=item [10]
	1068
	1069	UTF-8/UTF-EBDDIC used in Perl allows not only U+10000 to
	1070	U+10FFFF but also beyond U+10FFFF
	1071
	1072	=back
	1073
	1074	=item *
	1075
	1076	Level 2 - Extended Unicode Support
	1077
	1078	RL2.1 Canonical Equivalents - MISSING [10][11]
	1079	RL2.2 Default Grapheme Clusters - MISSING [12]
	1080	RL2.3 Default Word Boundaries - MISSING [14]
	1081	RL2.4 Default Loose Matches - MISSING [15]
	1082	RL2.5 Name Properties - DONE
	1083	RL2.6 Wildcard Properties - MISSING
	1084
	1085	[10] see UAX#15 "Unicode Normalization Forms"
	1086	[11] have Unicode::Normalize but not integrated to regexes
	1087	[12] have \X but we don't have a "Grapheme Cluster Mode"
	1088	[14] see UAX#29, Word Boundaries
	1089	[15] This is covered in Chapter 3.13 (in Unicode 6.0)
	1090
	1091	=item *
	1092
	1093	Level 3 - Tailored Support
	1094
	1095	RL3.1 Tailored Punctuation - MISSING
	1096	RL3.2 Tailored Grapheme Clusters - MISSING [17][18]
	1097	RL3.3 Tailored Word Boundaries - MISSING
	1098	RL3.4 Tailored Loose Matches - MISSING
	1099	RL3.5 Tailored Ranges - MISSING
	1100	RL3.6 Context Matching - MISSING [19]
	1101	RL3.7 Incremental Matches - MISSING
	1102	( RL3.8 Unicode Set Sharing )
	1103	RL3.9 Possible Match Sets - MISSING
	1104	RL3.10 Folded Matching - MISSING [20]
	1105	RL3.11 Submatchers - MISSING
	1106
	1107	[17] see UAX#10 "Unicode Collation Algorithms"
	1108	[18] have Unicode::Collate but not integrated to regexes
	1109	[19] have (?<=x) and (?=x), but look-aheads or look-behinds
	1110	should see outside of the target substring
	1111	[20] need insensitive matching for linguistic features other
	1112	than case; for example, hiragana to katakana, wide and
	1113	narrow, simplified Han to traditional Han (see UTR#30
	1114	"Character Foldings")
	1115
	1116	=back
	1117
	1118	=head2 Unicode Encodings
	1119
	1120	Unicode characters are assigned to I<code points>, which are abstract
	1121	numbers. To use these numbers, various encodings are needed.
	1122
	1123	=over 4
	1124
	1125	=item *
	1126
	1127	UTF-8
	1128
	1129	UTF-8 is a variable-length (1 to 4 bytes), byte-order independent
	1130	encoding. For ASCII (and we really do mean 7-bit ASCII, not another
	1131	8-bit encoding), UTF-8 is transparent.
	1132
	1133	The following table is from Unicode 3.2.
	1134
	1135	Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte
	1136
	1137	U+0000..U+007F 00..7F
	1138	U+0080..U+07FF * C2..DF 80..BF
	1139	U+0800..U+0FFF E0 * A0..BF 80..BF
	1140	U+1000..U+CFFF E1..EC 80..BF 80..BF
	1141	U+D000..U+D7FF ED 80..9F 80..BF
	1142	U+D800..U+DFFF +++++ utf16 surrogates, not legal utf8 +++++
	1143	U+E000..U+FFFF EE..EF 80..BF 80..BF
	1144	U+10000..U+3FFFF F0 * 90..BF 80..BF 80..BF
	1145	U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF
	1146	U+100000..U+10FFFF F4 80..8F 80..BF 80..BF
	1147
	1148	Note the gaps marked by "*" before several of the byte entries above. These are
	1149	caused by legal UTF-8 avoiding non-shortest encodings: it is technically
	1150	possible to UTF-8-encode a single code point in different ways, but that is
	1151	explicitly forbidden, and the shortest possible encoding should always be used
	1152	(and that is what Perl does).
	1153
	1154	Another way to look at it is via bits:
	1155
	1156	Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte
	1157
	1158	0aaaaaaa 0aaaaaaa
	1159	00000bbbbbaaaaaa 110bbbbb 10aaaaaa
	1160	ccccbbbbbbaaaaaa 1110cccc 10bbbbbb 10aaaaaa
	1161	00000dddccccccbbbbbbaaaaaa 11110ddd 10cccccc 10bbbbbb 10aaaaaa
	1162
	1163	As you can see, the continuation bytes all begin with "10", and the
	1164	leading bits of the start byte tell how many bytes there are in the
	1165	encoded character.
	1166
	1167	The original UTF-8 specification allowed up to 6 bytes, to allow
	1168	encoding of numbers up to 0x7FFF_FFFF. Perl continues to allow those,
	1169	and has extended that up to 13 bytes to encode code points up to what
	1170	can fit in a 64-bit word. However, Perl will warn if you output any of
	1171	these as being non-portable; and under strict UTF-8 input protocols,
	1172	they are forbidden.
	1173
	1174	The Unicode non-character code points are also disallowed in UTF-8 in
	1175	"open interchange". See L</Non-character code points>.
	1176
	1177	=item *
	1178
	1179	UTF-EBCDIC
	1180
	1181	Like UTF-8 but EBCDIC-safe, in the way that UTF-8 is ASCII-safe.
	1182
	1183	=item *
	1184
	1185	UTF-16, UTF-16BE, UTF-16LE, Surrogates, and BOMs (Byte Order Marks)
	1186
	1187	The followings items are mostly for reference and general Unicode
	1188	knowledge, Perl doesn't use these constructs internally.
	1189
	1190	Like UTF-8, UTF-16 is a variable-width encoding, but where
	1191	UTF-8 uses 8-bit code units, UTF-16 uses 16-bit code units.
	1192	All code points occupy either 2 or 4 bytes in UTF-16: code points
	1193	C<U+0000..U+FFFF> are stored in a single 16-bit unit, and code
	1194	points C<U+10000..U+10FFFF> in two 16-bit units. The latter case is
	1195	using I<surrogates>, the first 16-bit unit being the I<high
	1196	surrogate>, and the second being the I<low surrogate>.
	1197
	1198	Surrogates are code points set aside to encode the C<U+10000..U+10FFFF>
	1199	range of Unicode code points in pairs of 16-bit units. The I<high
	1200	surrogates> are the range C<U+D800..U+DBFF> and the I<low surrogates>
	1201	are the range C<U+DC00..U+DFFF>. The surrogate encoding is
	1202
	1203	$hi = ($uni - 0x10000) / 0x400 + 0xD800;
	1204	$lo = ($uni - 0x10000) % 0x400 + 0xDC00;
	1205
	1206	and the decoding is
	1207
	1208	$uni = 0x10000 + ($hi - 0xD800) * 0x400 + ($lo - 0xDC00);
	1209
	1210	Because of the 16-bitness, UTF-16 is byte-order dependent. UTF-16
	1211	itself can be used for in-memory computations, but if storage or
	1212	transfer is required either UTF-16BE (big-endian) or UTF-16LE
	1213	(little-endian) encodings must be chosen.
	1214
	1215	This introduces another problem: what if you just know that your data
	1216	is UTF-16, but you don't know which endianness? Byte Order Marks, or
	1217	BOMs, are a solution to this. A special character has been reserved
	1218	in Unicode to function as a byte order marker: the character with the
	1219	code point C<U+FEFF> is the BOM.
	1220
	1221	The trick is that if you read a BOM, you will know the byte order,
	1222	since if it was written on a big-endian platform, you will read the
	1223	bytes C<0xFE 0xFF>, but if it was written on a little-endian platform,
	1224	you will read the bytes C<0xFF 0xFE>. (And if the originating platform
	1225	was writing in UTF-8, you will read the bytes C<0xEF 0xBB 0xBF>.)
	1226
	1227	The way this trick works is that the character with the code point
	1228	C<U+FFFE> is not supposed to be in input streams, so the
	1229	sequence of bytes C<0xFF 0xFE> is unambiguously "BOM, represented in
	1230	little-endian format" and cannot be C<U+FFFE>, represented in big-endian
	1231	format".
	1232
	1233	Surrogates have no meaning in Unicode outside their use in pairs to
	1234	represent other code points. However, Perl allows them to be
	1235	represented individually internally, for example by saying
	1236	C<chr(0xD801)>, so that all code points, not just those valid for open
	1237	interchange, are
	1238	representable. Unicode does define semantics for them, such as their
	1239	General Category is "Cs". But because their use is somewhat dangerous,
	1240	Perl will warn (using the warning category "surrogate", which is a
	1241	sub-category of "utf8") if an attempt is made
	1242	to do things like take the lower case of one, or match
	1243	case-insensitively, or to output them. (But don't try this on Perls
	1244	before 5.14.)
	1245
	1246	=item *
	1247
	1248	UTF-32, UTF-32BE, UTF-32LE
	1249
	1250	The UTF-32 family is pretty much like the UTF-16 family, expect that
	1251	the units are 32-bit, and therefore the surrogate scheme is not
	1252	needed. UTF-32 is a fixed-width encoding. The BOM signatures are
	1253	C<0x00 0x00 0xFE 0xFF> for BE and C<0xFF 0xFE 0x00 0x00> for LE.
	1254
	1255	=item *
	1256
	1257	UCS-2, UCS-4
	1258
	1259	Legacy, fixed-width encodings defined by the ISO 10646 standard. UCS-2 is a 16-bit
	1260	encoding. Unlike UTF-16, UCS-2 is not extensible beyond C<U+FFFF>,
	1261	because it does not use surrogates. UCS-4 is a 32-bit encoding,
	1262	functionally identical to UTF-32 (the difference being that
	1263	UCS-4 forbids neither surrogates nor code points larger than 0x10_FFFF).
	1264
	1265	=item *
	1266
	1267	UTF-7
	1268
	1269	A seven-bit safe (non-eight-bit) encoding, which is useful if the
	1270	transport or storage is not eight-bit safe. Defined by RFC 2152.
	1271
	1272	=back
	1273
	1274	=head2 Non-character code points
	1275
	1276	66 code points are set aside in Unicode as "non-character code points".
	1277	These all have the Unassigned (Cn) General Category, and they never will
	1278	be assigned. These are never supposed to be in legal Unicode input
	1279	streams, so that code can use them as sentinels that can be mixed in
	1280	with character data, and they always will be distinguishable from that data.
	1281	To keep them out of Perl input streams, strict UTF-8 should be
	1282	specified, such as by using the layer C<:encoding('UTF-8')>. The
	1283	non-character code points are the 32 between U+FDD0 and U+FDEF, and the
	1284	34 code points U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, ... U+10FFFE, U+10FFFF.
	1285	Some people are under the mistaken impression that these are "illegal",
	1286	but that is not true. An application or cooperating set of applications
	1287	can legally use them at will internally; but these code points are
	1288	"illegal for open interchange". Therefore, Perl will not accept these
	1289	from input streams unless lax rules are being used, and will warn
	1290	(using the warning category "nonchar", which is a sub-category of "utf8") if
	1291	an attempt is made to output them.
	1292
	1293	=head2 Beyond Unicode code points
	1294
	1295	The maximum Unicode code point is U+10FFFF. But Perl accepts code
	1296	points up to the maximum permissible unsigned number available on the
	1297	platform. However, Perl will not accept these from input streams unless
	1298	lax rules are being used, and will warn (using the warning category
	1299	"non_unicode", which is a sub-category of "utf8") if an attempt is made to
	1300	operate on or output them. For example, C<uc(0x11_0000)> will generate
	1301	this warning, returning the input parameter as its result, as the upper
	1302	case of every non-Unicode code point is the code point itself.
	1303
	1304	=head2 Security Implications of Unicode
	1305
	1306	Read L<Unicode Security Considerations\|http://www.unicode.org/reports/tr36>.
	1307	Also, note the following:
	1308
	1309	=over 4
	1310
	1311	=item *
	1312
	1313	Malformed UTF-8
	1314
	1315	Unfortunately, the original specification of UTF-8 leaves some room for
	1316	interpretation of how many bytes of encoded output one should generate
	1317	from one input Unicode character. Strictly speaking, the shortest
	1318	possible sequence of UTF-8 bytes should be generated,
	1319	because otherwise there is potential for an input buffer overflow at
	1320	the receiving end of a UTF-8 connection. Perl always generates the
	1321	shortest length UTF-8, and with warnings on, Perl will warn about
	1322	non-shortest length UTF-8 along with other malformations, such as the
	1323	surrogates, which are not Unicode code points valid for interchange.
	1324
	1325	=item *
	1326
	1327	Regular expression pattern matching may surprise you if you're not
	1328	accustomed to Unicode. Starting in Perl 5.14, several pattern
	1329	modifiers are available to control this, called the character set
	1330	modifiers. Details are given in L<perlre/Character set modifiers>.
	1331
	1332	=back
	1333
	1334	As discussed elsewhere, Perl has one foot (two hooves?) planted in
	1335	each of two worlds: the old world of bytes and the new world of
	1336	characters, upgrading from bytes to characters when necessary.
	1337	If your legacy code does not explicitly use Unicode, no automatic
	1338	switch-over to characters should happen. Characters shouldn't get
	1339	downgraded to bytes, either. It is possible to accidentally mix bytes
	1340	and characters, however (see L<perluniintro>), in which case C<\w> in
	1341	regular expressions might start behaving differently (unless the C</a>
	1342	modifier is in effect). Review your code. Use warnings and the C<strict> pragma.
	1343
	1344	=head2 Unicode in Perl on EBCDIC
	1345
	1346	The way Unicode is handled on EBCDIC platforms is still
	1347	experimental. On such platforms, references to UTF-8 encoding in this
	1348	document and elsewhere should be read as meaning the UTF-EBCDIC
	1349	specified in Unicode Technical Report 16, unless ASCII vs. EBCDIC issues
	1350	are specifically discussed. There is no C<utfebcdic> pragma or
	1351	":utfebcdic" layer; rather, "utf8" and ":utf8" are reused to mean
	1352	the platform's "natural" 8-bit encoding of Unicode. See L<perlebcdic>
	1353	for more discussion of the issues.
	1354
	1355	=head2 Locales
	1356
	1357	See L<perllocale/Unicode and UTF-8>
	1358
	1359	=head2 When Unicode Does Not Happen
	1360
	1361	While Perl does have extensive ways to input and output in Unicode,
	1362	and a few other "entry points" like the @ARGV array (which can sometimes be
	1363	interpreted as UTF-8), there are still many places where Unicode
	1364	(in some encoding or another) could be given as arguments or received as
	1365	results, or both, but it is not.
	1366
	1367	The following are such interfaces. Also, see L</The "Unicode Bug">.
	1368	For all of these interfaces Perl
	1369	currently (as of v5.16.0) simply assumes byte strings both as arguments
	1370	and results, or UTF-8 strings if the (problematic) C<encoding> pragma has been used.
	1371
	1372	One reason that Perl does not attempt to resolve the role of Unicode in
	1373	these situations is that the answers are highly dependent on the operating
	1374	system and the file system(s). For example, whether filenames can be
	1375	in Unicode and in exactly what kind of encoding, is not exactly a
	1376	portable concept. Similarly for C<qx> and C<system>: how well will the
	1377	"command-line interface" (and which of them?) handle Unicode?
	1378
	1379	=over 4
	1380
	1381	=item *
	1382
	1383	chdir, chmod, chown, chroot, exec, link, lstat, mkdir,
	1384	rename, rmdir, stat, symlink, truncate, unlink, utime, -X
	1385
	1386	=item *
	1387
	1388	%ENV
	1389
	1390	=item *
	1391
	1392	glob (aka the <*>)
	1393
	1394	=item *
	1395
	1396	open, opendir, sysopen
	1397
	1398	=item *
	1399
	1400	qx (aka the backtick operator), system
	1401
	1402	=item *
	1403
	1404	readdir, readlink
	1405
	1406	=back
	1407
	1408	=head2 The "Unicode Bug"
	1409
	1410	The term, "Unicode bug" has been applied to an inconsistency
	1411	on ASCII platforms with the
	1412	Unicode code points in the Latin-1 Supplement block, that
	1413	is, between 128 and 255. Without a locale specified, unlike all other
	1414	characters or code points, these characters have very different semantics in
	1415	byte semantics versus character semantics, unless
	1416	C<use feature 'unicode_strings'> is specified, directly or indirectly.
	1417	(It is indirectly specified by a C<use v5.12> or higher.)
	1418
	1419	In character semantics these upper-Latin1 characters are interpreted as
	1420	Unicode code points, which means
	1421	they have the same semantics as Latin-1 (ISO-8859-1).
	1422
	1423	In byte semantics (without C<unicode_strings>), they are considered to
	1424	be unassigned characters, meaning that the only semantics they have is
	1425	their ordinal numbers, and that they are
	1426	not members of various character classes. None are considered to match C<\w>
	1427	for example, but all match C<\W>.
	1428
	1429	Perl 5.12.0 added C<unicode_strings> to force character semantics on
	1430	these code points in some circumstances, which fixed portions of the
	1431	bug; Perl 5.14.0 fixed almost all of it; and Perl 5.16.0 fixed the
	1432	remainder (so far as we know, anyway). The lesson here is to enable
	1433	C<unicode_strings> to avoid the headaches described below.
	1434
	1435	The old, problematic behavior affects these areas:
	1436
	1437	=over 4
	1438
	1439	=item *
	1440
	1441	Changing the case of a scalar, that is, using C<uc()>, C<ucfirst()>, C<lc()>,
	1442	and C<lcfirst()>, or C<\L>, C<\U>, C<\u> and C<\l> in double-quotish
	1443	contexts, such as regular expression substitutions.
	1444	Under C<unicode_strings> starting in Perl 5.12.0, character semantics are
	1445	generally used. See L<perlfunc/lc> for details on how this works
	1446	in combination with various other pragmas.
	1447
	1448	=item *
	1449
	1450	Using caseless (C</i>) regular expression matching.
	1451	Starting in Perl 5.14.0, regular expressions compiled within
	1452	the scope of C<unicode_strings> use character semantics
	1453	even when executed or compiled into larger
	1454	regular expressions outside the scope.
	1455
	1456	=item *
	1457
	1458	Matching any of several properties in regular expressions, namely C<\b>,
	1459	C<\B>, C<\s>, C<\S>, C<\w>, C<\W>, and all the Posix character classes
	1460	I<except> C<[[:ascii:]]>.
	1461	Starting in Perl 5.14.0, regular expressions compiled within
	1462	the scope of C<unicode_strings> use character semantics
	1463	even when executed or compiled into larger
	1464	regular expressions outside the scope.
	1465
	1466	=item *
	1467
	1468	In C<quotemeta> or its inline equivalent C<\Q>, no code points above 127
	1469	are quoted in UTF-8 encoded strings, but in byte encoded strings, code
	1470	points between 128-255 are always quoted.
	1471	Starting in Perl 5.16.0, consistent quoting rules are used within the
	1472	scope of C<unicode_strings>, as described in L<perlfunc/quotemeta>.
	1473
	1474	=back
	1475
	1476	This behavior can lead to unexpected results in which a string's semantics
	1477	suddenly change if a code point above 255 is appended to or removed from it,
	1478	which changes the string's semantics from byte to character or vice versa. As
	1479	an example, consider the following program and its output:
	1480
	1481	$ perl -le'
	1482	no feature 'unicode_strings';
	1483	$s1 = "\xC2";
	1484	$s2 = "\x{2660}";
	1485	for ($s1, $s2, $s1.$s2) {
	1486	print /\w/ \|\| 0;
	1487	}
	1488	'
	1489	0
	1490	0
	1491	1
	1492
	1493	If there's no C<\w> in C<s1> or in C<s2>, why does their concatenation have one?
	1494
	1495	This anomaly stems from Perl's attempt to not disturb older programs that
	1496	didn't use Unicode, and hence had no semantics for characters outside of the
	1497	ASCII range (except in a locale), along with Perl's desire to add Unicode
	1498	support seamlessly. The result wasn't seamless: these characters were
	1499	orphaned.
	1500
	1501	For Perls earlier than those described above, or when a string is passed
	1502	to a function outside the subpragma's scope, a workaround is to always
	1503	call C<utf8::upgrade($string)>,
	1504	or to use the standard module L<Encode>. Also, a scalar that has any characters
	1505	whose ordinal is 0x100 or above, or which were specified using either of the
	1506	C<\N{...}> notations, will automatically have character semantics.
	1507
	1508	=head2 Forcing Unicode in Perl (Or Unforcing Unicode in Perl)
	1509
	1510	Sometimes (see L</"When Unicode Does Not Happen"> or L</The "Unicode Bug">)
	1511	there are situations where you simply need to force a byte
	1512	string into UTF-8, or vice versa. The low-level calls
	1513	utf8::upgrade($bytestring) and utf8::downgrade($utf8string[, FAIL_OK]) are
	1514	the answers.
	1515
	1516	Note that utf8::downgrade() can fail if the string contains characters
	1517	that don't fit into a byte.
	1518
	1519	Calling either function on a string that already is in the desired state is a
	1520	no-op.
	1521
	1522	=head2 Using Unicode in XS
	1523
	1524	If you want to handle Perl Unicode in XS extensions, you may find the
	1525	following C APIs useful. See also L<perlguts/"Unicode Support"> for an
	1526	explanation about Unicode at the XS level, and L<perlapi> for the API
	1527	details.
	1528
	1529	=over 4
	1530
	1531	=item *
	1532
	1533	C<DO_UTF8(sv)> returns true if the C<UTF8> flag is on and the bytes
	1534	pragma is not in effect. C<SvUTF8(sv)> returns true if the C<UTF8>
	1535	flag is on; the bytes pragma is ignored. The C<UTF8> flag being on
	1536	does B<not> mean that there are any characters of code points greater
	1537	than 255 (or 127) in the scalar or that there are even any characters
	1538	in the scalar. What the C<UTF8> flag means is that the sequence of
	1539	octets in the representation of the scalar is the sequence of UTF-8
	1540	encoded code points of the characters of a string. The C<UTF8> flag
	1541	being off means that each octet in this representation encodes a
	1542	single character with code point 0..255 within the string. Perl's
	1543	Unicode model is not to use UTF-8 until it is absolutely necessary.
	1544
	1545	=item *
	1546
	1547	C<uvchr_to_utf8(buf, chr)> writes a Unicode character code point into
	1548	a buffer encoding the code point as UTF-8, and returns a pointer
	1549	pointing after the UTF-8 bytes. It works appropriately on EBCDIC machines.
	1550
	1551	=item *
	1552
	1553	C<utf8_to_uvchr_buf(buf, bufend, lenp)> reads UTF-8 encoded bytes from a
	1554	buffer and
	1555	returns the Unicode character code point and, optionally, the length of
	1556	the UTF-8 byte sequence. It works appropriately on EBCDIC machines.
	1557
	1558	=item *
	1559
	1560	C<utf8_length(start, end)> returns the length of the UTF-8 encoded buffer
	1561	in characters. C<sv_len_utf8(sv)> returns the length of the UTF-8 encoded
	1562	scalar.
	1563
	1564	=item *
	1565
	1566	C<sv_utf8_upgrade(sv)> converts the string of the scalar to its UTF-8
	1567	encoded form. C<sv_utf8_downgrade(sv)> does the opposite, if
	1568	possible. C<sv_utf8_encode(sv)> is like sv_utf8_upgrade except that
	1569	it does not set the C<UTF8> flag. C<sv_utf8_decode()> does the
	1570	opposite of C<sv_utf8_encode()>. Note that none of these are to be
	1571	used as general-purpose encoding or decoding interfaces: C<use Encode>
	1572	for that. C<sv_utf8_upgrade()> is affected by the encoding pragma
	1573	but C<sv_utf8_downgrade()> is not (since the encoding pragma is
	1574	designed to be a one-way street).
	1575
	1576	=item *
	1577
	1578	C<is_utf8_string(buf, len)> returns true if C<len> bytes of the buffer
	1579	are valid UTF-8.
	1580
	1581	=item *
	1582
	1583	C<is_utf8_char_buf(buf, buf_end)> returns true if the pointer points to
	1584	a valid UTF-8 character.
	1585
	1586	=item *
	1587
	1588	C<UTF8SKIP(buf)> will return the number of bytes in the UTF-8 encoded
	1589	character in the buffer. C<UNISKIP(chr)> will return the number of bytes
	1590	required to UTF-8-encode the Unicode character code point. C<UTF8SKIP()>
	1591	is useful for example for iterating over the characters of a UTF-8
	1592	encoded buffer; C<UNISKIP()> is useful, for example, in computing
	1593	the size required for a UTF-8 encoded buffer.
	1594
	1595	=item *
	1596
	1597	C<utf8_distance(a, b)> will tell the distance in characters between the
	1598	two pointers pointing to the same UTF-8 encoded buffer.
	1599
	1600	=item *
	1601
	1602	C<utf8_hop(s, off)> will return a pointer to a UTF-8 encoded buffer
	1603	that is C<off> (positive or negative) Unicode characters displaced
	1604	from the UTF-8 buffer C<s>. Be careful not to overstep the buffer:
	1605	C<utf8_hop()> will merrily run off the end or the beginning of the
	1606	buffer if told to do so.
	1607
	1608	=item *
	1609
	1610	C<pv_uni_display(dsv, spv, len, pvlim, flags)> and
	1611	C<sv_uni_display(dsv, ssv, pvlim, flags)> are useful for debugging the
	1612	output of Unicode strings and scalars. By default they are useful
	1613	only for debugging--they display B<all> characters as hexadecimal code
	1614	points--but with the flags C<UNI_DISPLAY_ISPRINT>,
	1615	C<UNI_DISPLAY_BACKSLASH>, and C<UNI_DISPLAY_QQ> you can make the
	1616	output more readable.
	1617
	1618	=item *
	1619
	1620	C<foldEQ_utf8(s1, pe1, l1, u1, s2, pe2, l2, u2)> can be used to
	1621	compare two strings case-insensitively in Unicode. For case-sensitive
	1622	comparisons you can just use C<memEQ()> and C<memNE()> as usual, except
	1623	if one string is in utf8 and the other isn't.
	1624
	1625	=back
	1626
	1627	For more information, see L<perlapi>, and F<utf8.c> and F<utf8.h>
	1628	in the Perl source code distribution.
	1629
	1630	=head2 Hacking Perl to work on earlier Unicode versions (for very serious hackers only)
	1631
	1632	Perl by default comes with the latest supported Unicode version built in, but
	1633	you can change to use any earlier one.
	1634
	1635	Download the files in the desired version of Unicode from the Unicode web
	1636	site L<http://www.unicode.org>). These should replace the existing files in
	1637	F<lib/unicore> in the Perl source tree. Follow the instructions in
	1638	F<README.perl> in that directory to change some of their names, and then build
	1639	perl (see L<INSTALL>).
	1640
	1641	=head1 BUGS
	1642
	1643	=head2 Interaction with Locales
	1644
	1645	See L<perllocale/Unicode and UTF-8>
	1646
	1647	=head2 Problems with characters in the Latin-1 Supplement range
	1648
	1649	See L</The "Unicode Bug">
	1650
	1651	=head2 Interaction with Extensions
	1652
	1653	When Perl exchanges data with an extension, the extension should be
	1654	able to understand the UTF8 flag and act accordingly. If the
	1655	extension doesn't recognize that flag, it's likely that the extension
	1656	will return incorrectly-flagged data.
	1657
	1658	So if you're working with Unicode data, consult the documentation of
	1659	every module you're using if there are any issues with Unicode data
	1660	exchange. If the documentation does not talk about Unicode at all,
	1661	suspect the worst and probably look at the source to learn how the
	1662	module is implemented. Modules written completely in Perl shouldn't
	1663	cause problems. Modules that directly or indirectly access code written
	1664	in other programming languages are at risk.
	1665
	1666	For affected functions, the simple strategy to avoid data corruption is
	1667	to always make the encoding of the exchanged data explicit. Choose an
	1668	encoding that you know the extension can handle. Convert arguments passed
	1669	to the extensions to that encoding and convert results back from that
	1670	encoding. Write wrapper functions that do the conversions for you, so
	1671	you can later change the functions when the extension catches up.
	1672
	1673	To provide an example, let's say the popular Foo::Bar::escape_html
	1674	function doesn't deal with Unicode data yet. The wrapper function
	1675	would convert the argument to raw UTF-8 and convert the result back to
	1676	Perl's internal representation like so:
	1677
	1678	sub my_escape_html ($) {
	1679	my($what) = shift;
	1680	return unless defined $what;
	1681	Encode::decode_utf8(Foo::Bar::escape_html(
	1682	Encode::encode_utf8($what)));
	1683	}
	1684
	1685	Sometimes, when the extension does not convert data but just stores
	1686	and retrieves them, you will be able to use the otherwise
	1687	dangerous Encode::_utf8_on() function. Let's say the popular
	1688	C<Foo::Bar> extension, written in C, provides a C<param> method that
	1689	lets you store and retrieve data according to these prototypes:
	1690
	1691	$self->param($name, $value); # set a scalar
	1692	$value = $self->param($name); # retrieve a scalar
	1693
	1694	If it does not yet provide support for any encoding, one could write a
	1695	derived class with such a C<param> method:
	1696
	1697	sub param {
	1698	my($self,$name,$value) = @_;
	1699	utf8::upgrade($name); # make sure it is UTF-8 encoded
	1700	if (defined $value) {
	1701	utf8::upgrade($value); # make sure it is UTF-8 encoded
	1702	return $self->SUPER::param($name,$value);
	1703	} else {
	1704	my $ret = $self->SUPER::param($name);
	1705	Encode::_utf8_on($ret); # we know, it is UTF-8 encoded
	1706	return $ret;
	1707	}
	1708	}
	1709
	1710	Some extensions provide filters on data entry/exit points, such as
	1711	DB_File::filter_store_key and family. Look out for such filters in
	1712	the documentation of your extensions, they can make the transition to
	1713	Unicode data much easier.
	1714
	1715	=head2 Speed
	1716
	1717	Some functions are slower when working on UTF-8 encoded strings than
	1718	on byte encoded strings. All functions that need to hop over
	1719	characters such as length(), substr() or index(), or matching regular
	1720	expressions can work B<much> faster when the underlying data are
	1721	byte-encoded.
	1722
	1723	In Perl 5.8.0 the slowness was often quite spectacular; in Perl 5.8.1
	1724	a caching scheme was introduced which will hopefully make the slowness
	1725	somewhat less spectacular, at least for some operations. In general,
	1726	operations with UTF-8 encoded strings are still slower. As an example,
	1727	the Unicode properties (character classes) like C<\p{Nd}> are known to
	1728	be quite a bit slower (5-20 times) than their simpler counterparts
	1729	like C<\d> (then again, there are hundreds of Unicode characters matching C<Nd>
	1730	compared with the 10 ASCII characters matching C<d>).
	1731
	1732	=head2 Problems on EBCDIC platforms
	1733
	1734	There are several known problems with Perl on EBCDIC platforms. If you
	1735	want to use Perl there, send email to perlbug@perl.org.
	1736
	1737	In earlier versions, when byte and character data were concatenated,
	1738	the new string was sometimes created by
	1739	decoding the byte strings as I<ISO 8859-1 (Latin-1)>, even if the
	1740	old Unicode string used EBCDIC.
	1741
	1742	If you find any of these, please report them as bugs.
	1743
	1744	=head2 Porting code from perl-5.6.X
	1745
	1746	Perl 5.8 has a different Unicode model from 5.6. In 5.6 the programmer
	1747	was required to use the C<utf8> pragma to declare that a given scope
	1748	expected to deal with Unicode data and had to make sure that only
	1749	Unicode data were reaching that scope. If you have code that is
	1750	working with 5.6, you will need some of the following adjustments to
	1751	your code. The examples are written such that the code will continue
	1752	to work under 5.6, so you should be safe to try them out.
	1753
	1754	=over 3
	1755
	1756	=item *
	1757
	1758	A filehandle that should read or write UTF-8
	1759
	1760	if ($] > 5.008) {
	1761	binmode $fh, ":encoding(utf8)";
	1762	}
	1763
	1764	=item *
	1765
	1766	A scalar that is going to be passed to some extension
	1767
	1768	Be it Compress::Zlib, Apache::Request or any extension that has no
	1769	mention of Unicode in the manpage, you need to make sure that the
	1770	UTF8 flag is stripped off. Note that at the time of this writing
	1771	(January 2012) the mentioned modules are not UTF-8-aware. Please
	1772	check the documentation to verify if this is still true.
	1773
	1774	if ($] > 5.008) {
	1775	require Encode;
	1776	$val = Encode::encode_utf8($val); # make octets
	1777	}
	1778
	1779	=item *
	1780
	1781	A scalar we got back from an extension
	1782
	1783	If you believe the scalar comes back as UTF-8, you will most likely
	1784	want the UTF8 flag restored:
	1785
	1786	if ($] > 5.008) {
	1787	require Encode;
	1788	$val = Encode::decode_utf8($val);
	1789	}
	1790
	1791	=item *
	1792
	1793	Same thing, if you are really sure it is UTF-8
	1794
	1795	if ($] > 5.008) {
	1796	require Encode;
	1797	Encode::_utf8_on($val);
	1798	}
	1799
	1800	=item *
	1801
	1802	A wrapper for fetchrow_array and fetchrow_hashref
	1803
	1804	When the database contains only UTF-8, a wrapper function or method is
	1805	a convenient way to replace all your fetchrow_array and
	1806	fetchrow_hashref calls. A wrapper function will also make it easier to
	1807	adapt to future enhancements in your database driver. Note that at the
	1808	time of this writing (January 2012), the DBI has no standardized way
	1809	to deal with UTF-8 data. Please check the documentation to verify if
	1810	that is still true.
	1811
	1812	sub fetchrow {
	1813	# $what is one of fetchrow_{array,hashref}
	1814	my($self, $sth, $what) = @_;
	1815	if ($] < 5.008) {
	1816	return $sth->$what;
	1817	} else {
	1818	require Encode;
	1819	if (wantarray) {
	1820	my @arr = $sth->$what;
	1821	for (@arr) {
	1822	defined && /[^\000-\177]/ && Encode::_utf8_on($_);
	1823	}
	1824	return @arr;
	1825	} else {
	1826	my $ret = $sth->$what;
	1827	if (ref $ret) {
	1828	for my $k (keys %$ret) {
	1829	defined
	1830	&& /[^\000-\177]/
	1831	&& Encode::_utf8_on($_) for $ret->{$k};
	1832	}
	1833	return $ret;
	1834	} else {
	1835	defined && /[^\000-\177]/ && Encode::_utf8_on($_) for $ret;
	1836	return $ret;
	1837	}
	1838	}
	1839	}
	1840	}
	1841
	1842
	1843	=item *
	1844
	1845	A large scalar that you know can only contain ASCII
	1846
	1847	Scalars that contain only ASCII and are marked as UTF-8 are sometimes
	1848	a drag to your program. If you recognize such a situation, just remove
	1849	the UTF8 flag:
	1850
	1851	utf8::downgrade($val) if $] > 5.008;
	1852
	1853	=back
	1854
	1855	=head1 SEE ALSO
	1856
	1857	L<perlunitut>, L<perluniintro>, L<perluniprops>, L<Encode>, L<open>, L<utf8>, L<bytes>,
	1858	L<perlretut>, L<perlvar/"${^UNICODE}">
	1859	L<http://www.unicode.org/reports/tr44>).
	1860
	1861	=cut