perl5.git.perl.org Git - perl5.git/blame_incremental

... / ...

Commit	Line	Data
	1	=head1 NAME
	2
	3	perlunicode - Unicode support in Perl
	4
	5	=head1 DESCRIPTION
	6
	7	If you haven't already, before reading this document, you should become
	8	familiar with both L<perlunitut> and L<perluniintro>.
	9
	10	Unicode aims to B<UNI>-fy the en-B<CODE>-ings of all the world's
	11	character sets into a single Standard. For quite a few of the various
	12	coding standards that existed when Unicode was first created, converting
	13	from each to Unicode essentially meant adding a constant to each code
	14	point in the original standard, and converting back meant just
	15	subtracting that same constant. For ASCII and ISO-8859-1, the constant
	16	is 0. For ISO-8859-5, (Cyrillic) the constant is 864; for Hebrew
	17	(ISO-8859-8), it's 1488; Thai (ISO-8859-11), 3424; and so forth. This
	18	made it easy to do the conversions, and facilitated the adoption of
	19	Unicode.
	20
	21	And it worked; nowadays, those legacy standards are rarely used. Most
	22	everyone uses Unicode.
	23
	24	Unicode is a comprehensive standard. It specifies many things outside
	25	the scope of Perl, such as how to display sequences of characters. For
	26	a full discussion of all aspects of Unicode, see
	27	L<http://www.unicode.org>.
	28
	29	=head2 Important Caveats
	30
	31	Even though some of this section may not be understandable to you on
	32	first reading, we think it's important enough to highlight some of the
	33	gotchas before delving further, so here goes:
	34
	35	Unicode support is an extensive requirement. While Perl does not
	36	implement the Unicode standard or the accompanying technical reports
	37	from cover to cover, Perl does support many Unicode features.
	38
	39	Also, the use of Unicode may present security issues that aren't
	40	obvious, see L</Security Implications of Unicode> below.
	41
	42	=over 4
	43
	44	=item Safest if you C<use feature 'unicode_strings'>
	45
	46	In order to preserve backward compatibility, Perl does not turn
	47	on full internal Unicode support unless the pragma
	48	L<S<C<use feature 'unicode_strings'>>\|feature/The 'unicode_strings' feature>
	49	is specified. (This is automatically
	50	selected if you S<C<use 5.012>> or higher.) Failure to do this can
	51	trigger unexpected surprises. See L</The "Unicode Bug"> below.
	52
	53	This pragma doesn't affect I/O. Nor does it change the internal
	54	representation of strings, only their interpretation. There are still
	55	several places where Unicode isn't fully supported, such as in
	56	filenames.
	57
	58	=item Input and Output Layers
	59
	60	Use the C<:encoding(...)> layer to read from and write to
	61	filehandles using the specified encoding. (See L<open>.)
	62
	63	=item You must convert your non-ASCII, non-UTF-8 Perl scripts to be
	64	UTF-8.
	65
	66	The L<encoding> module has been deprecated since perl 5.18 and the
	67	perl internals it requires have been removed with perl 5.26.
	68
	69	=item C<use utf8> still needed to enable L<UTF-8\|/Unicode Encodings> in scripts
	70
	71	If your Perl script is itself encoded in L<UTF-8\|/Unicode Encodings>,
	72	the S<C<use utf8>> pragma must be explicitly included to enable
	73	recognition of that (in string or regular expression literals, or in
	74	identifier names). B<This is the only time when an explicit S<C<use
	75	utf8>> is needed.> (See L<utf8>).
	76
	77	If a Perl script begins with the bytes that form the UTF-8 encoding of
	78	the Unicode BYTE ORDER MARK (C<BOM>, see L</Unicode Encodings>), those
	79	bytes are completely ignored.
	80
	81	=item L<UTF-16\|/Unicode Encodings> scripts autodetected
	82
	83	If a Perl script begins with the Unicode C<BOM> (UTF-16LE,
	84	UTF16-BE), or if the script looks like non-C<BOM>-marked
	85	UTF-16 of either endianness, Perl will correctly read in the script as
	86	the appropriate Unicode encoding.
	87
	88	=back
	89
	90	=head2 Byte and Character Semantics
	91
	92	Before Unicode, most encodings used 8 bits (a single byte) to encode
	93	each character. Thus a character was a byte, and a byte was a
	94	character, and there could be only 256 or fewer possible characters.
	95	"Byte Semantics" in the title of this section refers to
	96	this behavior. There was no need to distinguish between "Byte" and
	97	"Character".
	98
	99	Then along comes Unicode which has room for over a million characters
	100	(and Perl allows for even more). This means that a character may
	101	require more than a single byte to represent it, and so the two terms
	102	are no longer equivalent. What matter are the characters as whole
	103	entities, and not usually the bytes that comprise them. That's what the
	104	term "Character Semantics" in the title of this section refers to.
	105
	106	Perl had to change internally to decouple "bytes" from "characters".
	107	It is important that you too change your ideas, if you haven't already,
	108	so that "byte" and "character" no longer mean the same thing in your
	109	mind.
	110
	111	The basic building block of Perl strings has always been a "character".
	112	The changes basically come down to that the implementation no longer
	113	thinks that a character is always just a single byte.
	114
	115	There are various things to note:
	116
	117	=over 4
	118
	119	=item *
	120
	121	String handling functions, for the most part, continue to operate in
	122	terms of characters. C<length()>, for example, returns the number of
	123	characters in a string, just as before. But that number no longer is
	124	necessarily the same as the number of bytes in the string (there may be
	125	more bytes than characters). The other such functions include
	126	C<chop()>, C<chomp()>, C<substr()>, C<pos()>, C<index()>, C<rindex()>,
	127	C<sort()>, C<sprintf()>, and C<write()>.
	128
	129	The exceptions are:
	130
	131	=over 4
	132
	133	=item *
	134
	135	the bit-oriented C<vec>
	136
	137	E<nbsp>
	138
	139	=item *
	140
	141	the byte-oriented C<pack>/C<unpack> C<"C"> format
	142
	143	However, the C<W> specifier does operate on whole characters, as does the
	144	C<U> specifier.
	145
	146	=item *
	147
	148	some operators that interact with the platform's operating system
	149
	150	Operators dealing with filenames are examples.
	151
	152	=item *
	153
	154	when the functions are called from within the scope of the
	155	S<C<L<use bytes\|bytes>>> pragma
	156
	157	Likely, you should use this only for debugging anyway.
	158
	159	=back
	160
	161	=item *
	162
	163	Strings--including hash keys--and regular expression patterns may
	164	contain characters that have ordinal values larger than 255.
	165
	166	If you use a Unicode editor to edit your program, Unicode characters may
	167	occur directly within the literal strings in UTF-8 encoding, or UTF-16.
	168	(The former requires a C<use utf8>, the latter may require a C<BOM>.)
	169
	170	L<perluniintro/Creating Unicode> gives other ways to place non-ASCII
	171	characters in your strings.
	172
	173	=item *
	174
	175	The C<chr()> and C<ord()> functions work on whole characters.
	176
	177	=item *
	178
	179	Regular expressions match whole characters. For example, C<"."> matches
	180	a whole character instead of only a single byte.
	181
	182	=item *
	183
	184	The C<tr///> operator translates whole characters. (Note that the
	185	C<tr///CU> functionality has been removed. For similar functionality to
	186	that, see C<pack('U0', ...)> and C<pack('C0', ...)>).
	187
	188	=item *
	189
	190	C<scalar reverse()> reverses by character rather than by byte.
	191
	192	=item *
	193
	194	The bit string operators, C<& \| ^ ~> and (starting in v5.22)
	195	C<&. \|. ^. ~.> can operate on bit strings encoded in UTF-8, but this
	196	can give unexpected results if any of the strings contain code points
	197	above 0xFF. Starting in v5.28, it is a fatal error to have such an
	198	operand. Otherwise, the operation is performed on a non-UTF-8 copy of
	199	the operand. If you're not sure about the encoding of a string,
	200	downgrade it before using any of these operators; you can use
	201	L<C<utf8::utf8_downgrade()>\|utf8/Utility functions>.
	202
	203	=back
	204
	205	The bottom line is that Perl has always practiced "Character Semantics",
	206	but with the advent of Unicode, that is now different than "Byte
	207	Semantics".
	208
	209	=head2 ASCII Rules versus Unicode Rules
	210
	211	Before Unicode, when a character was a byte was a character,
	212	Perl knew only about the 128 characters defined by ASCII, code points 0
	213	through 127 (except for under L<S<C<use locale>>\|perllocale>). That
	214	left the code
	215	points 128 to 255 as unassigned, and available for whatever use a
	216	program might want. The only semantics they have is their ordinal
	217	numbers, and that they are members of none of the non-negative character
	218	classes. None are considered to match C<\w> for example, but all match
	219	C<\W>.
	220
	221	Unicode, of course, assigns each of those code points a particular
	222	meaning (along with ones above 255). To preserve backward
	223	compatibility, Perl only uses the Unicode meanings when there is some
	224	indication that Unicode is what is intended; otherwise the non-ASCII
	225	code points remain treated as if they are unassigned.
	226
	227	Here are the ways that Perl knows that a string should be treated as
	228	Unicode:
	229
	230	=over
	231
	232	=item *
	233
	234	Within the scope of S<C<use utf8>>
	235
	236	If the whole program is Unicode (signified by using 8-bit B<U>nicode
	237	B<T>ransformation B<F>ormat), then all literal strings within it must be
	238	Unicode.
	239
	240	=item *
	241
	242	Within the scope of
	243	L<S<C<use feature 'unicode_strings'>>\|feature/The 'unicode_strings' feature>
	244
	245	This pragma was created so you can explicitly tell Perl that operations
	246	executed within its scope are to use Unicode rules. More operations are
	247	affected with newer perls. See L</The "Unicode Bug">.
	248
	249	=item *
	250
	251	Within the scope of S<C<use 5.012>> or higher
	252
	253	This implicitly turns on S<C<use feature 'unicode_strings'>>.
	254
	255	=item *
	256
	257	Within the scope of
	258	L<S<C<use locale 'not_characters'>>\|perllocale/Unicode and UTF-8>,
	259	or L<S<C<use locale>>\|perllocale> and the current
	260	locale is a UTF-8 locale.
	261
	262	The former is defined to imply Unicode handling; and the latter
	263	indicates a Unicode locale, hence a Unicode interpretation of all
	264	strings within it.
	265
	266	=item *
	267
	268	When the string contains a Unicode-only code point
	269
	270	Perl has never accepted code points above 255 without them being
	271	Unicode, so their use implies Unicode for the whole string.
	272
	273	=item *
	274
	275	When the string contains a Unicode named code point C<\N{...}>
	276
	277	The C<\N{...}> construct explicitly refers to a Unicode code point,
	278	even if it is one that is also in ASCII. Therefore the string
	279	containing it must be Unicode.
	280
	281	=item *
	282
	283	When the string has come from an external source marked as
	284	Unicode
	285
	286	The L<C<-C>\|perlrun/-C [numberE<sol>list]> command line option can
	287	specify that certain inputs to the program are Unicode, and the values
	288	of this can be read by your Perl code, see L<perlvar/"${^UNICODE}">.
	289
	290	=item * When the string has been upgraded to UTF-8
	291
	292	The function L<C<utf8::utf8_upgrade()>\|utf8/Utility functions>
	293	can be explicitly used to permanently (unless a subsequent
	294	C<utf8::utf8_downgrade()> is called) cause a string to be treated as
	295	Unicode.
	296
	297	=item * There are additional methods for regular expression patterns
	298
	299	A pattern that is compiled with the C<< /u >> or C<< /a >> modifiers is
	300	treated as Unicode (though there are some restrictions with C<< /a >>).
	301	Under the C<< /d >> and C<< /l >> modifiers, there are several other
	302	indications for Unicode; see L<perlre/Character set modifiers>.
	303
	304	=back
	305
	306	Note that all of the above are overridden within the scope of
	307	C<L<use bytes\|bytes>>; but you should be using this pragma only for
	308	debugging.
	309
	310	Note also that some interactions with the platform's operating system
	311	never use Unicode rules.
	312
	313	When Unicode rules are in effect:
	314
	315	=over 4
	316
	317	=item *
	318
	319	Case translation operators use the Unicode case translation tables.
	320
	321	Note that C<uc()>, or C<\U> in interpolated strings, translates to
	322	uppercase, while C<ucfirst>, or C<\u> in interpolated strings,
	323	translates to titlecase in languages that make the distinction (which is
	324	equivalent to uppercase in languages without the distinction).
	325
	326	There is a CPAN module, C<L<Unicode::Casing>>, which allows you to
	327	define your own mappings to be used in C<lc()>, C<lcfirst()>, C<uc()>,
	328	C<ucfirst()>, and C<fc> (or their double-quoted string inlined versions
	329	such as C<\U>). (Prior to Perl 5.16, this functionality was partially
	330	provided in the Perl core, but suffered from a number of insurmountable
	331	drawbacks, so the CPAN module was written instead.)
	332
	333	=item *
	334
	335	Character classes in regular expressions match based on the character
	336	properties specified in the Unicode properties database.
	337
	338	C<\w> can be used to match a Japanese ideograph, for instance; and
	339	C<[[:digit:]]> a Bengali number.
	340
	341	=item *
	342
	343	Named Unicode properties, scripts, and block ranges may be used (like
	344	bracketed character classes) by using the C<\p{}> "matches property"
	345	construct and the C<\P{}> negation, "doesn't match property".
	346
	347	See L</"Unicode Character Properties"> for more details.
	348
	349	You can define your own character properties and use them
	350	in the regular expression with the C<\p{}> or C<\P{}> construct.
	351	See L</"User-Defined Character Properties"> for more details.
	352
	353	=back
	354
	355	=head2 Extended Grapheme Clusters (Logical characters)
	356
	357	Consider a character, say C<H>. It could appear with various marks around it,
	358	such as an acute accent, or a circumflex, or various hooks, circles, arrows,
	359	I<etc.>, above, below, to one side or the other, I<etc>. There are many
	360	possibilities among the world's languages. The number of combinations is
	361	astronomical, and if there were a character for each combination, it would
	362	soon exhaust Unicode's more than a million possible characters. So Unicode
	363	took a different approach: there is a character for the base C<H>, and a
	364	character for each of the possible marks, and these can be variously combined
	365	to get a final logical character. So a logical character--what appears to be a
	366	single character--can be a sequence of more than one individual characters.
	367	The Unicode standard calls these "extended grapheme clusters" (which
	368	is an improved version of the no-longer much used "grapheme cluster");
	369	Perl furnishes the C<\X> regular expression construct to match such
	370	sequences in their entirety.
	371
	372	But Unicode's intent is to unify the existing character set standards and
	373	practices, and several pre-existing standards have single characters that
	374	mean the same thing as some of these combinations, like ISO-8859-1,
	375	which has quite a few of them. For example, C<"LATIN CAPITAL LETTER E
	376	WITH ACUTE"> was already in this standard when Unicode came along.
	377	Unicode therefore added it to its repertoire as that single character.
	378	But this character is considered by Unicode to be equivalent to the
	379	sequence consisting of the character C<"LATIN CAPITAL LETTER E">
	380	followed by the character C<"COMBINING ACUTE ACCENT">.
	381
	382	C<"LATIN CAPITAL LETTER E WITH ACUTE"> is called a "pre-composed"
	383	character, and its equivalence with the "E" and the "COMBINING ACCENT"
	384	sequence is called canonical equivalence. All pre-composed characters
	385	are said to have a decomposition (into the equivalent sequence), and the
	386	decomposition type is also called canonical. A string may be comprised
	387	as much as possible of precomposed characters, or it may be comprised of
	388	entirely decomposed characters. Unicode calls these respectively,
	389	"Normalization Form Composed" (NFC) and "Normalization Form Decomposed".
	390	The C<L<Unicode::Normalize>> module contains functions that convert
	391	between the two. A string may also have both composed characters and
	392	decomposed characters; this module can be used to make it all one or the
	393	other.
	394
	395	You may be presented with strings in any of these equivalent forms.
	396	There is currently nothing in Perl 5 that ignores the differences. So
	397	you'll have to specially handle it. The usual advice is to convert your
	398	inputs to C<NFD> before processing further.
	399
	400	For more detailed information, see L<http://unicode.org/reports/tr15/>.
	401
	402	=head2 Unicode Character Properties
	403
	404	(The only time that Perl considers a sequence of individual code
	405	points as a single logical character is in the C<\X> construct, already
	406	mentioned above. Therefore "character" in this discussion means a single
	407	Unicode code point.)
	408
	409	Very nearly all Unicode character properties are accessible through
	410	regular expressions by using the C<\p{}> "matches property" construct
	411	and the C<\P{}> "doesn't match property" for its negation.
	412
	413	For instance, C<\p{Uppercase}> matches any single character with the Unicode
	414	C<"Uppercase"> property, while C<\p{L}> matches any character with a
	415	C<General_Category> of C<"L"> (letter) property (see
	416	L</General_Category> below). Brackets are not
	417	required for single letter property names, so C<\p{L}> is equivalent to C<\pL>.
	418
	419	More formally, C<\p{Uppercase}> matches any single character whose Unicode
	420	C<Uppercase> property value is C<True>, and C<\P{Uppercase}> matches any character
	421	whose C<Uppercase> property value is C<False>, and they could have been written as
	422	C<\p{Uppercase=True}> and C<\p{Uppercase=False}>, respectively.
	423
	424	This formality is needed when properties are not binary; that is, if they can
	425	take on more values than just C<True> and C<False>. For example, the
	426	C<Bidi_Class> property (see L</"Bidirectional Character Types"> below),
	427	can take on several different
	428	values, such as C<Left>, C<Right>, C<Whitespace>, and others. To match these, one needs
	429	to specify both the property name (C<Bidi_Class>), AND the value being
	430	matched against
	431	(C<Left>, C<Right>, I<etc.>). This is done, as in the examples above, by having the
	432	two components separated by an equal sign (or interchangeably, a colon), like
	433	C<\p{Bidi_Class: Left}>.
	434
	435	All Unicode-defined character properties may be written in these compound forms
	436	of C<\p{I<property>=I<value>}> or C<\p{I<property>:I<value>}>, but Perl provides some
	437	additional properties that are written only in the single form, as well as
	438	single-form short-cuts for all binary properties and certain others described
	439	below, in which you may omit the property name and the equals or colon
	440	separator.
	441
	442	Most Unicode character properties have at least two synonyms (or aliases if you
	443	prefer): a short one that is easier to type and a longer one that is more
	444	descriptive and hence easier to understand. Thus the C<"L"> and
	445	C<"Letter"> properties above are equivalent and can be used
	446	interchangeably. Likewise, C<"Upper"> is a synonym for C<"Uppercase">,
	447	and we could have written C<\p{Uppercase}> equivalently as C<\p{Upper}>.
	448	Also, there are typically various synonyms for the values the property
	449	can be. For binary properties, C<"True"> has 3 synonyms: C<"T">,
	450	C<"Yes">, and C<"Y">; and C<"False"> has correspondingly C<"F">,
	451	C<"No">, and C<"N">. But be careful. A short form of a value for one
	452	property may not mean the same thing as the same short form for another.
	453	Thus, for the C<L</General_Category>> property, C<"L"> means
	454	C<"Letter">, but for the L<C<Bidi_Class>\|/Bidirectional Character Types>
	455	property, C<"L"> means C<"Left">. A complete list of properties and
	456	synonyms is in L<perluniprops>.
	457
	458	Upper/lower case differences in property names and values are irrelevant;
	459	thus C<\p{Upper}> means the same thing as C<\p{upper}> or even C<\p{UpPeR}>.
	460	Similarly, you can add or subtract underscores anywhere in the middle of a
	461	word, so that these are also equivalent to C<\p{U_p_p_e_r}>. And white space
	462	is irrelevant adjacent to non-word characters, such as the braces and the equals
	463	or colon separators, so C<\p{ Upper }> and C<\p{ Upper_case : Y }> are
	464	equivalent to these as well. In fact, white space and even
	465	hyphens can usually be added or deleted anywhere. So even C<\p{ Up-per case = Yes}> is
	466	equivalent. All this is called "loose-matching" by Unicode. The few places
	467	where stricter matching is used is in the middle of numbers, and in the Perl
	468	extension properties that begin or end with an underscore. Stricter matching
	469	cares about white space (except adjacent to non-word characters),
	470	hyphens, and non-interior underscores.
	471
	472	You can also use negation in both C<\p{}> and C<\P{}> by introducing a caret
	473	(C<^>) between the first brace and the property name: C<\p{^Tamil}> is
	474	equal to C<\P{Tamil}>.
	475
	476	Almost all properties are immune to case-insensitive matching. That is,
	477	adding a C</i> regular expression modifier does not change what they
	478	match. There are two sets that are affected.
	479	The first set is
	480	C<Uppercase_Letter>,
	481	C<Lowercase_Letter>,
	482	and C<Titlecase_Letter>,
	483	all of which match C<Cased_Letter> under C</i> matching.
	484	And the second set is
	485	C<Uppercase>,
	486	C<Lowercase>,
	487	and C<Titlecase>,
	488	all of which match C<Cased> under C</i> matching.
	489	This set also includes its subsets C<PosixUpper> and C<PosixLower> both
	490	of which under C</i> match C<PosixAlpha>.
	491	(The difference between these sets is that some things, such as Roman
	492	numerals, come in both upper and lower case so they are C<Cased>, but
	493	aren't considered letters, so they aren't C<Cased_Letter>'s.)
	494
	495	See L</Beyond Unicode code points> for special considerations when
	496	matching Unicode properties against non-Unicode code points.
	497
	498	=head3 B<General_Category>
	499
	500	Every Unicode character is assigned a general category, which is the "most
	501	usual categorization of a character" (from
	502	L<http://www.unicode.org/reports/tr44>).
	503
	504	The compound way of writing these is like C<\p{General_Category=Number}>
	505	(short: C<\p{gc:n}>). But Perl furnishes shortcuts in which everything up
	506	through the equal or colon separator is omitted. So you can instead just write
	507	C<\pN>.
	508
	509	Here are the short and long forms of the values the C<General Category> property
	510	can have:
	511
	512	Short Long
	513
	514	L Letter
	515	LC, L& Cased_Letter (that is: [\p{Ll}\p{Lu}\p{Lt}])
	516	Lu Uppercase_Letter
	517	Ll Lowercase_Letter
	518	Lt Titlecase_Letter
	519	Lm Modifier_Letter
	520	Lo Other_Letter
	521
	522	M Mark
	523	Mn Nonspacing_Mark
	524	Mc Spacing_Mark
	525	Me Enclosing_Mark
	526
	527	N Number
	528	Nd Decimal_Number (also Digit)
	529	Nl Letter_Number
	530	No Other_Number
	531
	532	P Punctuation (also Punct)
	533	Pc Connector_Punctuation
	534	Pd Dash_Punctuation
	535	Ps Open_Punctuation
	536	Pe Close_Punctuation
	537	Pi Initial_Punctuation
	538	(may behave like Ps or Pe depending on usage)
	539	Pf Final_Punctuation
	540	(may behave like Ps or Pe depending on usage)
	541	Po Other_Punctuation
	542
	543	S Symbol
	544	Sm Math_Symbol
	545	Sc Currency_Symbol
	546	Sk Modifier_Symbol
	547	So Other_Symbol
	548
	549	Z Separator
	550	Zs Space_Separator
	551	Zl Line_Separator
	552	Zp Paragraph_Separator
	553
	554	C Other
	555	Cc Control (also Cntrl)
	556	Cf Format
	557	Cs Surrogate
	558	Co Private_Use
	559	Cn Unassigned
	560
	561	Single-letter properties match all characters in any of the
	562	two-letter sub-properties starting with the same letter.
	563	C<LC> and C<L&> are special: both are aliases for the set consisting of everything matched by C<Ll>, C<Lu>, and C<Lt>.
	564
	565	=head3 B<Bidirectional Character Types>
	566
	567	Because scripts differ in their directionality (Hebrew and Arabic are
	568	written right to left, for example) Unicode supplies a C<Bidi_Class> property.
	569	Some of the values this property can have are:
	570
	571	Value Meaning
	572
	573	L Left-to-Right
	574	LRE Left-to-Right Embedding
	575	LRO Left-to-Right Override
	576	R Right-to-Left
	577	AL Arabic Letter
	578	RLE Right-to-Left Embedding
	579	RLO Right-to-Left Override
	580	PDF Pop Directional Format
	581	EN European Number
	582	ES European Separator
	583	ET European Terminator
	584	AN Arabic Number
	585	CS Common Separator
	586	NSM Non-Spacing Mark
	587	BN Boundary Neutral
	588	B Paragraph Separator
	589	S Segment Separator
	590	WS Whitespace
	591	ON Other Neutrals
	592
	593	This property is always written in the compound form.
	594	For example, C<\p{Bidi_Class:R}> matches characters that are normally
	595	written right to left. Unlike the
	596	C<L</General_Category>> property, this
	597	property can have more values added in a future Unicode release. Those
	598	listed above comprised the complete set for many Unicode releases, but
	599	others were added in Unicode 6.3; you can always find what the
	600	current ones are in L<perluniprops>. And
	601	L<http://www.unicode.org/reports/tr9/> describes how to use them.
	602
	603	=head3 B<Scripts>
	604
	605	The world's languages are written in many different scripts. This sentence
	606	(unless you're reading it in translation) is written in Latin, while Russian is
	607	written in Cyrillic, and Greek is written in, well, Greek; Japanese mainly in
	608	Hiragana or Katakana. There are many more.
	609
	610	The Unicode C<Script> and C<Script_Extensions> properties give what
	611	script a given character is in. The C<Script_Extensions> property is an
	612	improved version of C<Script>, as demonstrated below. Either property
	613	can be specified with the compound form like
	614	C<\p{Script=Hebrew}> (short: C<\p{sc=hebr}>), or
	615	C<\p{Script_Extensions=Javanese}> (short: C<\p{scx=java}>).
	616	In addition, Perl furnishes shortcuts for all
	617	C<Script_Extensions> property names. You can omit everything up through
	618	the equals (or colon), and simply write C<\p{Latin}> or C<\P{Cyrillic}>.
	619	(This is not true for C<Script>, which is required to be
	620	written in the compound form. Prior to Perl v5.26, the single form
	621	returned the plain old C<Script> version, but was changed because
	622	C<Script_Extensions> gives better results.)
	623
	624	The difference between these two properties involves characters that are
	625	used in multiple scripts. For example the digits '0' through '9' are
	626	used in many parts of the world. These are placed in a script named
	627	C<Common>. Other characters are used in just a few scripts. For
	628	example, the C<"KATAKANA-HIRAGANA DOUBLE HYPHEN"> is used in both Japanese
	629	scripts, Katakana and Hiragana, but nowhere else. The C<Script>
	630	property places all characters that are used in multiple scripts in the
	631	C<Common> script, while the C<Script_Extensions> property places those
	632	that are used in only a few scripts into each of those scripts; while
	633	still using C<Common> for those used in many scripts. Thus both these
	634	match:
	635
	636	"0" =~ /\p{sc=Common}/ # Matches
	637	"0" =~ /\p{scx=Common}/ # Matches
	638
	639	and only the first of these match:
	640
	641	"\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{sc=Common} # Matches
	642	"\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{scx=Common} # No match
	643
	644	And only the last two of these match:
	645
	646	"\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{sc=Hiragana} # No match
	647	"\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{sc=Katakana} # No match
	648	"\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{scx=Hiragana} # Matches
	649	"\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{scx=Katakana} # Matches
	650
	651	C<Script_Extensions> is thus an improved C<Script>, in which there are
	652	fewer characters in the C<Common> script, and correspondingly more in
	653	other scripts. It is new in Unicode version 6.0, and its data are likely
	654	to change significantly in later releases, as things get sorted out.
	655	New code should probably be using C<Script_Extensions> and not plain
	656	C<Script>. If you compile perl with a Unicode release that doesn't have
	657	C<Script_Extensions>, the single form Perl extensions will instead refer
	658	to the plain C<Script> property. If you compile with a version of
	659	Unicode that doesn't have the C<Script> property, these extensions will
	660	not be defined at all.
	661
	662	(Actually, besides C<Common>, the C<Inherited> script, contains
	663	characters that are used in multiple scripts. These are modifier
	664	characters which inherit the script value
	665	of the controlling character. Some of these are used in many scripts,
	666	and so go into C<Inherited> in both C<Script> and C<Script_Extensions>.
	667	Others are used in just a few scripts, so are in C<Inherited> in
	668	C<Script>, but not in C<Script_Extensions>.)
	669
	670	It is worth stressing that there are several different sets of digits in
	671	Unicode that are equivalent to 0-9 and are matchable by C<\d> in a
	672	regular expression. If they are used in a single language only, they
	673	are in that language's C<Script> and C<Script_Extensions>. If they are
	674	used in more than one script, they will be in C<sc=Common>, but only
	675	if they are used in many scripts should they be in C<scx=Common>.
	676
	677	The explanation above has omitted some detail; refer to UAX#24 "Unicode
	678	Script Property": L<http://www.unicode.org/reports/tr24>.
	679
	680	A complete list of scripts and their shortcuts is in L<perluniprops>.
	681
	682	=head3 B<Use of the C<"Is"> Prefix>
	683
	684	For backward compatibility (with ancient Perl 5.6), all properties writable
	685	without using the compound form mentioned
	686	so far may have C<Is> or C<Is_> prepended to their name, so C<\P{Is_Lu}>, for
	687	example, is equal to C<\P{Lu}>, and C<\p{IsScript:Arabic}> is equal to
	688	C<\p{Arabic}>.
	689
	690	=head3 B<Blocks>
	691
	692	In addition to B<scripts>, Unicode also defines B<blocks> of
	693	characters. The difference between scripts and blocks is that the
	694	concept of scripts is closer to natural languages, while the concept
	695	of blocks is more of an artificial grouping based on groups of Unicode
	696	characters with consecutive ordinal values. For example, the C<"Basic Latin">
	697	block is all the characters whose ordinals are between 0 and 127, inclusive; in
	698	other words, the ASCII characters. The C<"Latin"> script contains some letters
	699	from this as well as several other blocks, like C<"Latin-1 Supplement">,
	700	C<"Latin Extended-A">, I<etc.>, but it does not contain all the characters from
	701	those blocks. It does not, for example, contain the digits 0-9, because
	702	those digits are shared across many scripts, and hence are in the
	703	C<Common> script.
	704
	705	For more about scripts versus blocks, see UAX#24 "Unicode Script Property":
	706	L<http://www.unicode.org/reports/tr24>
	707
	708	The C<Script_Extensions> or C<Script> properties are likely to be the
	709	ones you want to use when processing
	710	natural language; the C<Block> property may occasionally be useful in working
	711	with the nuts and bolts of Unicode.
	712
	713	Block names are matched in the compound form, like C<\p{Block: Arrows}> or
	714	C<\p{Blk=Hebrew}>. Unlike most other properties, only a few block names have a
	715	Unicode-defined short name.
	716
	717	Perl also defines single form synonyms for the block property in cases
	718	where these do not conflict with something else. But don't use any of
	719	these, because they are unstable. Since these are Perl extensions, they
	720	are subordinate to official Unicode property names; Unicode doesn't know
	721	nor care about Perl's extensions. It may happen that a name that
	722	currently means the Perl extension will later be changed without warning
	723	to mean a different Unicode property in a future version of the perl
	724	interpreter that uses a later Unicode release, and your code would no
	725	longer work. The extensions are mentioned here for completeness: Take
	726	the block name and prefix it with one of: C<In> (for example
	727	C<\p{Blk=Arrows}> can currently be written as C<\p{In_Arrows}>); or
	728	sometimes C<Is> (like C<\p{Is_Arrows}>); or sometimes no prefix at all
	729	(C<\p{Arrows}>). As of this writing (Unicode 9.0) there are no
	730	conflicts with using the C<In_> prefix, but there are plenty with the
	731	other two forms. For example, C<\p{Is_Hebrew}> and C<\p{Hebrew}> mean
	732	C<\p{Script_Extensions=Hebrew}> which is NOT the same thing as
	733	C<\p{Blk=Hebrew}>. Our
	734	advice used to be to use the C<In_> prefix as a single form way of
	735	specifying a block. But Unicode 8.0 added properties whose names begin
	736	with C<In>, and it's now clear that it's only luck that's so far
	737	prevented a conflict. Using C<In> is only marginally less typing than
	738	C<Blk:>, and the latter's meaning is clearer anyway, and guaranteed to
	739	never conflict. So don't take chances. Use C<\p{Blk=foo}> for new
	740	code. And be sure that block is what you really really want to do. In
	741	most cases scripts are what you want instead.
	742
	743	A complete list of blocks is in L<perluniprops>.
	744
	745	=head3 B<Other Properties>
	746
	747	There are many more properties than the very basic ones described here.
	748	A complete list is in L<perluniprops>.
	749
	750	Unicode defines all its properties in the compound form, so all single-form
	751	properties are Perl extensions. Most of these are just synonyms for the
	752	Unicode ones, but some are genuine extensions, including several that are in
	753	the compound form. And quite a few of these are actually recommended by Unicode
	754	(in L<http://www.unicode.org/reports/tr18>).
	755
	756	This section gives some details on all extensions that aren't just
	757	synonyms for compound-form Unicode properties
	758	(for those properties, you'll have to refer to the
	759	L<Unicode Standard\|http://www.unicode.org/reports/tr44>.
	760
	761	=over
	762
	763	=item B<C<\p{All}>>
	764
	765	This matches every possible code point. It is equivalent to C<qr/./s>.
	766	Unlike all the other non-user-defined C<\p{}> property matches, no
	767	warning is ever generated if this is property is matched against a
	768	non-Unicode code point (see L</Beyond Unicode code points> below).
	769
	770	=item B<C<\p{Alnum}>>
	771
	772	This matches any C<\p{Alphabetic}> or C<\p{Decimal_Number}> character.
	773
	774	=item B<C<\p{Any}>>
	775
	776	This matches any of the 1_114_112 Unicode code points. It is a synonym
	777	for C<\p{Unicode}>.
	778
	779	=item B<C<\p{ASCII}>>
	780
	781	This matches any of the 128 characters in the US-ASCII character set,
	782	which is a subset of Unicode.
	783
	784	=item B<C<\p{Assigned}>>
	785
	786	This matches any assigned code point; that is, any code point whose L<general
	787	category\|/General_Category> is not C<Unassigned> (or equivalently, not C<Cn>).
	788
	789	=item B<C<\p{Blank}>>
	790
	791	This is the same as C<\h> and C<\p{HorizSpace}>: A character that changes the
	792	spacing horizontally.
	793
	794	=item B<C<\p{Decomposition_Type: Non_Canonical}>> (Short: C<\p{Dt=NonCanon}>)
	795
	796	Matches a character that has a non-canonical decomposition.
	797
	798	The L</Extended Grapheme Clusters (Logical characters)> section above
	799	talked about canonical decompositions. However, many more characters
	800	have a different type of decomposition, a "compatible" or
	801	"non-canonical" decomposition. The sequences that form these
	802	decompositions are not considered canonically equivalent to the
	803	pre-composed character. An example is the C<"SUPERSCRIPT ONE">. It is
	804	somewhat like a regular digit 1, but not exactly; its decomposition into
	805	the digit 1 is called a "compatible" decomposition, specifically a
	806	"super" decomposition. There are several such compatibility
	807	decompositions (see L<http://www.unicode.org/reports/tr44>), including
	808	one called "compat", which means some miscellaneous type of
	809	decomposition that doesn't fit into the other decomposition categories
	810	that Unicode has chosen.
	811
	812	Note that most Unicode characters don't have a decomposition, so their
	813	decomposition type is C<"None">.
	814
	815	For your convenience, Perl has added the C<Non_Canonical> decomposition
	816	type to mean any of the several compatibility decompositions.
	817
	818	=item B<C<\p{Graph}>>
	819
	820	Matches any character that is graphic. Theoretically, this means a character
	821	that on a printer would cause ink to be used.
	822
	823	=item B<C<\p{HorizSpace}>>
	824
	825	This is the same as C<\h> and C<\p{Blank}>: a character that changes the
	826	spacing horizontally.
	827
	828	=item B<C<\p{In=*}>>
	829
	830	This is a synonym for C<\p{Present_In=*}>
	831
	832	=item B<C<\p{PerlSpace}>>
	833
	834	This is the same as C<\s>, restricted to ASCII, namely C<S<[ \f\n\r\t]>>
	835	and starting in Perl v5.18, a vertical tab.
	836
	837	Mnemonic: Perl's (original) space
	838
	839	=item B<C<\p{PerlWord}>>
	840
	841	This is the same as C<\w>, restricted to ASCII, namely C<[A-Za-z0-9_]>
	842
	843	Mnemonic: Perl's (original) word.
	844
	845	=item B<C<\p{Posix...}>>
	846
	847	There are several of these, which are equivalents, using the C<\p{}>
	848	notation, for Posix classes and are described in
	849	L<perlrecharclass/POSIX Character Classes>.
	850
	851	=item B<C<\p{Present_In: }>> (Short: C<\p{In=}>)
	852
	853	This property is used when you need to know in what Unicode version(s) a
	854	character is.
	855
	856	The "*" above stands for some Unicode version number, such as
	857	C<1.1> or C<12.0>; or the "*" can also be C<Unassigned>. This property will
	858	match the code points whose final disposition has been settled as of the
	859	Unicode release given by the version number; C<\p{Present_In: Unassigned}>
	860	will match those code points whose meaning has yet to be assigned.
	861
	862	For example, C<U+0041> C<"LATIN CAPITAL LETTER A"> was present in the very first
	863	Unicode release available, which is C<1.1>, so this property is true for all
	864	valid "*" versions. On the other hand, C<U+1EFF> was not assigned until version
	865	5.1 when it became C<"LATIN SMALL LETTER Y WITH LOOP">, so the only "*" that
	866	would match it are 5.1, 5.2, and later.
	867
	868	Unicode furnishes the C<Age> property from which this is derived. The problem
	869	with Age is that a strict interpretation of it (which Perl takes) has it
	870	matching the precise release a code point's meaning is introduced in. Thus
	871	C<U+0041> would match only 1.1; and C<U+1EFF> only 5.1. This is not usually what
	872	you want.
	873
	874	Some non-Perl implementations of the Age property may change its meaning to be
	875	the same as the Perl C<Present_In> property; just be aware of that.
	876
	877	Another confusion with both these properties is that the definition is not
	878	that the code point has been I<assigned>, but that the meaning of the code point
	879	has been I<determined>. This is because 66 code points will always be
	880	unassigned, and so the C<Age> for them is the Unicode version in which the decision
	881	to make them so was made. For example, C<U+FDD0> is to be permanently
	882	unassigned to a character, and the decision to do that was made in version 3.1,
	883	so C<\p{Age=3.1}> matches this character, as also does C<\p{Present_In: 3.1}> and up.
	884
	885	=item B<C<\p{Print}>>
	886
	887	This matches any character that is graphical or blank, except controls.
	888
	889	=item B<C<\p{SpacePerl}>>
	890
	891	This is the same as C<\s>, including beyond ASCII.
	892
	893	Mnemonic: Space, as modified by Perl. (It doesn't include the vertical tab
	894	until v5.18, which both the Posix standard and Unicode consider white space.)
	895
	896	=item B<C<\p{Title}>> and B<C<\p{Titlecase}>>
	897
	898	Under case-sensitive matching, these both match the same code points as
	899	C<\p{General Category=Titlecase_Letter}> (C<\p{gc=lt}>). The difference
	900	is that under C</i> caseless matching, these match the same as
	901	C<\p{Cased}>, whereas C<\p{gc=lt}> matches C<\p{Cased_Letter>).
	902
	903	=item B<C<\p{Unicode}>>
	904
	905	This matches any of the 1_114_112 Unicode code points.
	906	C<\p{Any}>.
	907
	908	=item B<C<\p{VertSpace}>>
	909
	910	This is the same as C<\v>: A character that changes the spacing vertically.
	911
	912	=item B<C<\p{Word}>>
	913
	914	This is the same as C<\w>, including over 100_000 characters beyond ASCII.
	915
	916	=item B<C<\p{XPosix...}>>
	917
	918	There are several of these, which are the standard Posix classes
	919	extended to the full Unicode range. They are described in
	920	L<perlrecharclass/POSIX Character Classes>.
	921
	922	=back
	923
	924	=head2 Wildcards in Property Values
	925
	926	Starting in Perl 5.30, it is possible to do do something like this:
	927
	928	qr!\p{numeric_value=/\A[0-5]\z/}!
	929
	930	or, by abbreviating and adding C</x>,
	931
	932	qr! \p{nv= /(?x) \A [0-5] \z / }!
	933
	934	This matches all code points whose numeric value is one of 0, 1, 2, 3,
	935	4, or 5. This particular example could instead have been written as
	936
	937	qr! \A [ \p{nv=0}\p{nv=1}\p{nv=2}\p{nv=3}\p{nv=4}\p{nv=5} ] \z !xx
	938
	939	in earlier perls, so in this case this feature just makes things easier
	940	and shorter to write. If we hadn't included the C<\A> and C<\z>, these
	941	would have matched things like C<1E<sol>2> because that contains a 1 (as
	942	well as a 2). As written, it matches things like subscripts that have
	943	these numeric values. If we only wanted the decimal digits with those
	944	numeric values, we could say,
	945
	946	qr! (?[ \d & \p{nv=/[0-5]/ ]) }!x
	947
	948	The C<\d> gets rid of needing to anchor the pattern, since it forces the
	949	result to only match C<[0-9]>, and the C<[0-5]> further restricts it.
	950
	951	The text in the above examples enclosed between the C<"E<sol>">
	952	characters can be just about any regular expression. It is independent
	953	of the main pattern, so doesn't share any capturing groups, I<etc>. The
	954	delimiters for it must be ASCII punctuation, but it may NOT be
	955	delimited by C<"{">, nor C<"}"> nor contain a literal C<"}">, as that
	956	delimits the end of the enclosing C<\p{}>. Like any pattern, certain
	957	other delimiters are terminated by their mirror images. These are
	958	C<"(">, C<"[>", and C<"E<lt>">. If the delimiter is any of C<"-">,
	959	C<"_">, C<"+">, or C<"\">, or is the same delimiter as is used for the
	960	enclosing pattern, it must be be preceded by a backslash escape, both
	961	fore and aft.
	962
	963	Beware of using C<"$"> to indicate to match the end of the string. It
	964	can too easily be interpreted as being a punctuation variable, like
	965	C<$/>.
	966
	967	No modifiers may follow the final delimiter. Instead, use
	968	L<perlre/(?adlupimnsx-imnsx)> and/or
	969	L<perlre/(?adluimnsx-imnsx:pattern)> to specify modifiers.
	970
	971	This feature is not available when the left-hand side is prefixed by
	972	C<Is_>, nor for any form that is marked as "Discouraged" in
	973	L<perluniprops/Discouraged>.
	974
	975	Perl wraps your pattern with C<(?iaa: ... )>. This is because nothing
	976	outside ASCII can match the Unicode property values available in this
	977	release, and they should match caselessly. If your pattern has a syntax
	978	error, this wrapping will be shown in the error message, even though you
	979	didn't specify it yourself. This could be confusing if you don't know
	980	about this.
	981
	982	This experimental feature has been added to begin to implement
	983	L<https://www.unicode.org/reports/tr18/#Wildcard_Properties>. Using it
	984	will raise a (default-on) warning in the
	985	C<experimental::uniprop_wildcards> category. We reserve the right to
	986	change its operation as we gain experience.
	987
	988	Your subpattern can be just about anything, but for it to have some
	989	utility, it should match when called with either or both of
	990	a) the full name of the property value with underscores (and/or spaces
	991	in the Block property) and some things uppercase; or b) the property
	992	value in all lowercase with spaces and underscores squeezed out. For
	993	example,
	994
	995	qr!\p{Blk=/Old I.*/}!
	996	qr!\p{Blk=/oldi.*/}!
	997
	998	would match the same things.
	999
	1000	A warning is issued if none of the legal values for a property are
	1001	matched by your pattern. It's likely that a future release will raise a
	1002	warning if your pattern ends up causing every possible code point to
	1003	match.
	1004
	1005	Another example that shows that within C<\p{...}>, C</x> isn't needed to
	1006	have spaces:
	1007
	1008	qr!\p{scx= /Hebrew\|Greek/ }!
	1009
	1010	To be safe, we should have anchored the above example, to prevent
	1011	matches for something like C<Hebrew_Braile>, but there aren't
	1012	any script names like that.
	1013
	1014	There are certain properties that it doesn't currently work with. These
	1015	are:
	1016
	1017	Bidi Mirroring Glyph
	1018	Bidi Paired Bracket
	1019	Case Folding
	1020	Decomposition Mapping
	1021	Equivalent Unified Ideograph
	1022	Name
	1023	Name Alias
	1024	Lowercase Mapping
	1025	NFKC Case Fold
	1026	Titlecase Mapping
	1027	Uppercase Mapping
	1028
	1029	Nor is the C<@I<unicode_property>@> form implemented.
	1030
	1031	Here's a complete example of matching IPV4 internet protocol addresses
	1032	in any (single) script
	1033
	1034	no warnings 'experimental::script_run';
	1035	no warnings 'experimental::regex_sets';
	1036	no warnings 'experimental::uniprop_wildcards';
	1037
	1038	# Can match a substring, so this intermediate regex needs to have
	1039	# context or anchoring in its final use. Using nt=de yields decimal
	1040	# digits. When specifying a subset of these, we must include \d to
	1041	# prevent things like U+00B2 SUPERSCRIPT TWO from matching
	1042	my $zero_through_255 =
	1043	qr/ \b (*sr: # All from same sript
	1044	(?[ \p{nv=0} & \d ])* # Optional leading zeros
	1045	( # Then one of:
	1046	\d{1,2} # 0 - 99
	1047	\| (?[ \p{nv=1} & \d ]) \d{2} # 100 - 199
	1048	\| (?[ \p{nv=2} & \d ])
	1049	( (?[ \p{nv=:[0-4]:} & \d ]) \d # 200 - 249
	1050	\| (?[ \p{nv=5} & \d ])
	1051	(?[ \p{nv=:[0-5]:} & \d ]) # 250 - 255
	1052	)
	1053	)
	1054	)
	1055	\b
	1056	/x;
	1057
	1058	my $ipv4 = qr/ \A (*sr: $zero_through_255
	1059	(?: [.] $zero_through_255 ) {3}
	1060	)
	1061	\z
	1062	/x;
	1063
	1064	=head2 User-Defined Character Properties
	1065
	1066	You can define your own binary character properties by defining subroutines
	1067	whose names begin with C<"In"> or C<"Is">. (The experimental feature
	1068	L<perlre/(?[ ])> provides an alternative which allows more complex
	1069	definitions.) The subroutines can be defined in any
	1070	package. The user-defined properties can be used in the regular expression
	1071	C<\p{}> and C<\P{}> constructs; if you are using a user-defined property from a
	1072	package other than the one you are in, you must specify its package in the
	1073	C<\p{}> or C<\P{}> construct.
	1074
	1075	# assuming property Is_Foreign defined in Lang::
	1076	package main; # property package name required
	1077	if ($txt =~ /\p{Lang::IsForeign}+/) { ... }
	1078
	1079	package Lang; # property package name not required
	1080	if ($txt =~ /\p{IsForeign}+/) { ... }
	1081
	1082
	1083	Note that the effect is compile-time and immutable once defined.
	1084	However, the subroutines are passed a single parameter, which is 0 if
	1085	case-sensitive matching is in effect and non-zero if caseless matching
	1086	is in effect. The subroutine may return different values depending on
	1087	the value of the flag, and one set of values will immutably be in effect
	1088	for all case-sensitive matches, and the other set for all case-insensitive
	1089	matches.
	1090
	1091	Note that if the regular expression is tainted, then Perl will die rather
	1092	than calling the subroutine when the name of the subroutine is
	1093	determined by the tainted data.
	1094
	1095	The subroutines must return a specially-formatted string, with one
	1096	or more newline-separated lines. Each line must be one of the following:
	1097
	1098	=over 4
	1099
	1100	=item *
	1101
	1102	A single hexadecimal number denoting a code point to include.
	1103
	1104	=item *
	1105
	1106	Two hexadecimal numbers separated by horizontal whitespace (space or
	1107	tabular characters) denoting a range of code points to include. The
	1108	second number must not be smaller than the first.
	1109
	1110	=item *
	1111
	1112	Something to include, prefixed by C<"+">: a built-in character
	1113	property (prefixed by C<"utf8::">) or a fully qualified (including package
	1114	name) user-defined character property,
	1115	to represent all the characters in that property; two hexadecimal code
	1116	points for a range; or a single hexadecimal code point.
	1117
	1118	=item *
	1119
	1120	Something to exclude, prefixed by C<"-">: an existing character
	1121	property (prefixed by C<"utf8::">) or a fully qualified (including package
	1122	name) user-defined character property,
	1123	to represent all the characters in that property; two hexadecimal code
	1124	points for a range; or a single hexadecimal code point.
	1125
	1126	=item *
	1127
	1128	Something to negate, prefixed C<"!">: an existing character
	1129	property (prefixed by C<"utf8::">) or a fully qualified (including package
	1130	name) user-defined character property,
	1131	to represent all the characters in that property; two hexadecimal code
	1132	points for a range; or a single hexadecimal code point.
	1133
	1134	=item *
	1135
	1136	Something to intersect with, prefixed by C<"&">: an existing character
	1137	property (prefixed by C<"utf8::">) or a fully qualified (including package
	1138	name) user-defined character property,
	1139	for all the characters except the characters in the property; two
	1140	hexadecimal code points for a range; or a single hexadecimal code point.
	1141
	1142	=back
	1143
	1144	For example, to define a property that covers both the Japanese
	1145	syllabaries (hiragana and katakana), you can define
	1146
	1147	sub InKana {
	1148	return <<END;
	1149	3040\t309F
	1150	30A0\t30FF
	1151	END
	1152	}
	1153
	1154	Imagine that the here-doc end marker is at the beginning of the line.
	1155	Now you can use C<\p{InKana}> and C<\P{InKana}>.
	1156
	1157	You could also have used the existing block property names:
	1158
	1159	sub InKana {
	1160	return <<'END';
	1161	+utf8::InHiragana
	1162	+utf8::InKatakana
	1163	END
	1164	}
	1165
	1166	Suppose you wanted to match only the allocated characters,
	1167	not the raw block ranges: in other words, you want to remove
	1168	the unassigned characters:
	1169
	1170	sub InKana {
	1171	return <<'END';
	1172	+utf8::InHiragana
	1173	+utf8::InKatakana
	1174	-utf8::IsCn
	1175	END
	1176	}
	1177
	1178	The negation is useful for defining (surprise!) negated classes.
	1179
	1180	sub InNotKana {
	1181	return <<'END';
	1182	!utf8::InHiragana
	1183	-utf8::InKatakana
	1184	+utf8::IsCn
	1185	END
	1186	}
	1187
	1188	This will match all non-Unicode code points, since every one of them is
	1189	not in Kana. You can use intersection to exclude these, if desired, as
	1190	this modified example shows:
	1191
	1192	sub InNotKana {
	1193	return <<'END';
	1194	!utf8::InHiragana
	1195	-utf8::InKatakana
	1196	+utf8::IsCn
	1197	&utf8::Any
	1198	END
	1199	}
	1200
	1201	C<&utf8::Any> must be the last line in the definition.
	1202
	1203	Intersection is used generally for getting the common characters matched
	1204	by two (or more) classes. It's important to remember not to use C<"&"> for
	1205	the first set; that would be intersecting with nothing, resulting in an
	1206	empty set. (Similarly using C<"-"> for the first set does nothing).
	1207
	1208	Unlike non-user-defined C<\p{}> property matches, no warning is ever
	1209	generated if these properties are matched against a non-Unicode code
	1210	point (see L</Beyond Unicode code points> below).
	1211
	1212	=head2 User-Defined Case Mappings (for serious hackers only)
	1213
	1214	B<This feature has been removed as of Perl 5.16.>
	1215	The CPAN module C<L<Unicode::Casing>> provides better functionality without
	1216	the drawbacks that this feature had. If you are using a Perl earlier
	1217	than 5.16, this feature was most fully documented in the 5.14 version of
	1218	this pod:
	1219	L<http://perldoc.perl.org/5.14.0/perlunicode.html#User-Defined-Case-Mappings-%28for-serious-hackers-only%29>
	1220
	1221	=head2 Character Encodings for Input and Output
	1222
	1223	See L<Encode>.
	1224
	1225	=head2 Unicode Regular Expression Support Level
	1226
	1227	The following list of Unicode supported features for regular expressions describes
	1228	all features currently directly supported by core Perl. The references
	1229	to "Level I<N>" and the section numbers refer to
	1230	L<UTS#18 "Unicode Regular Expressions"\|http://www.unicode.org/reports/tr18>,
	1231	version 18, October 2016.
	1232
	1233	=head3 Level 1 - Basic Unicode Support
	1234
	1235	RL1.1 Hex Notation - Done [1]
	1236	RL1.2 Properties - Done [2]
	1237	RL1.2a Compatibility Properties - Done [3]
	1238	RL1.3 Subtraction and Intersection - Experimental [4]
	1239	RL1.4 Simple Word Boundaries - Done [5]
	1240	RL1.5 Simple Loose Matches - Done [6]
	1241	RL1.6 Line Boundaries - Partial [7]
	1242	RL1.7 Supplementary Code Points - Done [8]
	1243
	1244	=over 4
	1245
	1246	=item [1] C<\N{U+...}> and C<\x{...}>
	1247
	1248	=item [2]
	1249	C<\p{...}> C<\P{...}>. This requirement is for a minimal list of
	1250	properties. Perl supports these and all other Unicode character
	1251	properties, as R2.7 asks (see L</"Unicode Character Properties"> above).
	1252
	1253	=item [3]
	1254	Perl has C<\d> C<\D> C<\s> C<\S> C<\w> C<\W> C<\X> C<[:I<prop>:]>
	1255	C<[:^I<prop>:]>, plus all the properties specified by
	1256	L<http://www.unicode.org/reports/tr18/#Compatibility_Properties>. These
	1257	are described above in L</Other Properties>
	1258
	1259	=item [4]
	1260
	1261	The experimental feature C<"(?[...])"> starting in v5.18 accomplishes
	1262	this.
	1263
	1264	See L<perlre/(?[ ])>. If you don't want to use an experimental
	1265	feature, you can use one of the following:
	1266
	1267	=over 4
	1268
	1269	=item *
	1270	Regular expression lookahead
	1271
	1272	You can mimic class subtraction using lookahead.
	1273	For example, what UTS#18 might write as
	1274
	1275	[{Block=Greek}-[{UNASSIGNED}]]
	1276
	1277	in Perl can be written as:
	1278
	1279	(?!\p{Unassigned})\p{Block=Greek}
	1280	(?=\p{Assigned})\p{Block=Greek}
	1281
	1282	But in this particular example, you probably really want
	1283
	1284	\p{Greek}
	1285
	1286	which will match assigned characters known to be part of the Greek script.
	1287
	1288	=item *
	1289
	1290	CPAN module C<L<Unicode::Regex::Set>>
	1291
	1292	It does implement the full UTS#18 grouping, intersection, union, and
	1293	removal (subtraction) syntax.
	1294
	1295	=item *
	1296
	1297	L</"User-Defined Character Properties">
	1298
	1299	C<"+"> for union, C<"-"> for removal (set-difference), C<"&"> for intersection
	1300
	1301	=back
	1302
	1303	=item [5]
	1304	C<\b> C<\B> meet most, but not all, the details of this requirement, but
	1305	C<\b{wb}> and C<\B{wb}> do, as well as the stricter R2.3.
	1306
	1307	=item [6]
	1308
	1309	Note that Perl does Full case-folding in matching, not Simple:
	1310
	1311	For example C<U+1F88> is equivalent to C<U+1F00 U+03B9>, instead of just
	1312	C<U+1F80>. This difference matters mainly for certain Greek capital
	1313	letters with certain modifiers: the Full case-folding decomposes the
	1314	letter, while the Simple case-folding would map it to a single
	1315	character.
	1316
	1317	=item [7]
	1318
	1319	The reason this is considered to be only partially implemented is that
	1320	Perl has L<C<qrE<sol>\b{lb}E<sol>>\|perlrebackslash/\b{lb}> and
	1321	C<L<Unicode::LineBreak>> that are conformant with
	1322	L<UAX#14 "Unicode Line Breaking Algorithm"\|http://www.unicode.org/reports/tr14>.
	1323	The regular expression construct provides default behavior, while the
	1324	heavier-weight module provides customizable line breaking.
	1325
	1326	But Perl treats C<\n> as the start- and end-line
	1327	delimiter, whereas Unicode specifies more characters that should be
	1328	so-interpreted.
	1329
	1330	These are:
	1331
	1332	VT U+000B (\v in C)
	1333	FF U+000C (\f)
	1334	CR U+000D (\r)
	1335	NEL U+0085
	1336	LS U+2028
	1337	PS U+2029
	1338
	1339	C<^> and C<$> in regular expression patterns are supposed to match all
	1340	these, but don't.
	1341	These characters also don't, but should, affect C<< <> >> C<$.>, and
	1342	script line numbers.
	1343
	1344	Also, lines should not be split within C<CRLF> (i.e. there is no
	1345	empty line between C<\r> and C<\n>). For C<CRLF>, try the C<:crlf>
	1346	layer (see L<PerlIO>).
	1347
	1348	=item [8]
	1349	UTF-8/UTF-EBDDIC used in Perl allows not only C<U+10000> to
	1350	C<U+10FFFF> but also beyond C<U+10FFFF>
	1351
	1352	=back
	1353
	1354	=head3 Level 2 - Extended Unicode Support
	1355
	1356	RL2.1 Canonical Equivalents - Retracted [9]
	1357	by Unicode
	1358	RL2.2 Extended Grapheme Clusters - Partial [10]
	1359	RL2.3 Default Word Boundaries - Done [11]
	1360	RL2.4 Default Case Conversion - Done
	1361	RL2.5 Name Properties - Done
	1362	RL2.6 Wildcards in Property Values - Partial [12]
	1363	RL2.7 Full Properties - Done
	1364
	1365	=over 4
	1366
	1367	=item [9]
	1368	Unicode has rewritten this portion of UTS#18 to say that getting
	1369	canonical equivalence (see UAX#15
	1370	L<"Unicode Normalization Forms"\|http://www.unicode.org/reports/tr15>)
	1371	is basically to be done at the programmer level. Use NFD to write
	1372	both your regular expressions and text to match them against (you
	1373	can use L<Unicode::Normalize>).
	1374
	1375	=item [10]
	1376	Perl has C<\X> and C<\b{gcb}> but we don't have a "Grapheme Cluster Mode".
	1377
	1378	=item [11] see
	1379	L<UAX#29 "Unicode Text Segmentation"\|http://www.unicode.org/reports/tr29>,
	1380
	1381	=item [12] see
	1382	L</Wildcards in Property Values> above.
	1383
	1384	=back
	1385
	1386	=head3 Level 3 - Tailored Support
	1387
	1388	RL3.1 Tailored Punctuation - Missing
	1389	RL3.2 Tailored Grapheme Clusters - Missing [13]
	1390	RL3.3 Tailored Word Boundaries - Missing
	1391	RL3.4 Tailored Loose Matches - Retracted by Unicode
	1392	RL3.5 Tailored Ranges - Retracted by Unicode
	1393	RL3.6 Context Matching - Partial [14]
	1394	RL3.7 Incremental Matches - Missing
	1395	RL3.8 Unicode Set Sharing - Retracted by Unicode
	1396	RL3.9 Possible Match Sets - Missing
	1397	RL3.10 Folded Matching - Retracted by Unicode
	1398	RL3.11 Submatchers - Partial [15]
	1399
	1400	=over 4
	1401
	1402	=item [13]
	1403	Perl has L<Unicode::Collate>, but it isn't integrated with regular
	1404	expressions. See
	1405	L<UTS#10 "Unicode Collation Algorithms"\|http://www.unicode.org/reports/tr10>.
	1406
	1407	=item [14]
	1408	Perl has C<(?<=x)> and C<(?=x)>, but this requirement says that it
	1409	should be possible to specify that matches may occur only in a substring
	1410	with the lookaheads and lookbehinds able to see beyond that matchable
	1411	portion.
	1412
	1413	=item [15]
	1414	Perl has user-defined properties (L</"User-Defined Character
	1415	Properties">) to look at single code points in ways beyond Unicode, and
	1416	it might be possible, though probably not very clean, to use code blocks
	1417	and things like C<(?(DEFINE)...)> (see L<perlre>) to do more specialized
	1418	matching.
	1419
	1420	=back
	1421
	1422	=head2 Unicode Encodings
	1423
	1424	Unicode characters are assigned to I<code points>, which are abstract
	1425	numbers. To use these numbers, various encodings are needed.
	1426
	1427	=over 4
	1428
	1429	=item *
	1430
	1431	UTF-8
	1432
	1433	UTF-8 is a variable-length (1 to 4 bytes), byte-order independent
	1434	encoding. In most of Perl's documentation, including elsewhere in this
	1435	document, the term "UTF-8" means also "UTF-EBCDIC". But in this section,
	1436	"UTF-8" refers only to the encoding used on ASCII platforms. It is a
	1437	superset of 7-bit US-ASCII, so anything encoded in ASCII has the
	1438	identical representation when encoded in UTF-8.
	1439
	1440	The following table is from Unicode 3.2.
	1441
	1442	Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte
	1443
	1444	U+0000..U+007F 00..7F
	1445	U+0080..U+07FF * C2..DF 80..BF
	1446	U+0800..U+0FFF E0 * A0..BF 80..BF
	1447	U+1000..U+CFFF E1..EC 80..BF 80..BF
	1448	U+D000..U+D7FF ED 80..9F 80..BF
	1449	U+D800..U+DFFF +++++ utf16 surrogates, not legal utf8 +++++
	1450	U+E000..U+FFFF EE..EF 80..BF 80..BF
	1451	U+10000..U+3FFFF F0 * 90..BF 80..BF 80..BF
	1452	U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF
	1453	U+100000..U+10FFFF F4 80..8F 80..BF 80..BF
	1454
	1455	Note the gaps marked by "*" before several of the byte entries above. These are
	1456	caused by legal UTF-8 avoiding non-shortest encodings: it is technically
	1457	possible to UTF-8-encode a single code point in different ways, but that is
	1458	explicitly forbidden, and the shortest possible encoding should always be used
	1459	(and that is what Perl does).
	1460
	1461	Another way to look at it is via bits:
	1462
	1463	Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte
	1464
	1465	0aaaaaaa 0aaaaaaa
	1466	00000bbbbbaaaaaa 110bbbbb 10aaaaaa
	1467	ccccbbbbbbaaaaaa 1110cccc 10bbbbbb 10aaaaaa
	1468	00000dddccccccbbbbbbaaaaaa 11110ddd 10cccccc 10bbbbbb 10aaaaaa
	1469
	1470	As you can see, the continuation bytes all begin with C<"10">, and the
	1471	leading bits of the start byte tell how many bytes there are in the
	1472	encoded character.
	1473
	1474	The original UTF-8 specification allowed up to 6 bytes, to allow
	1475	encoding of numbers up to C<0x7FFF_FFFF>. Perl continues to allow those,
	1476	and has extended that up to 13 bytes to encode code points up to what
	1477	can fit in a 64-bit word. However, Perl will warn if you output any of
	1478	these as being non-portable; and under strict UTF-8 input protocols,
	1479	they are forbidden. In addition, it is now illegal to use a code point
	1480	larger than what a signed integer variable on your system can hold. On
	1481	32-bit ASCII systems, this means C<0x7FFF_FFFF> is the legal maximum
	1482	(much higher on 64-bit systems).
	1483
	1484	=item *
	1485
	1486	UTF-EBCDIC
	1487
	1488	Like UTF-8, but EBCDIC-safe, in the way that UTF-8 is ASCII-safe.
	1489	This means that all the basic characters (which includes all
	1490	those that have ASCII equivalents (like C<"A">, C<"0">, C<"%">, I<etc.>)
	1491	are the same in both EBCDIC and UTF-EBCDIC.)
	1492
	1493	UTF-EBCDIC is used on EBCDIC platforms. It generally requires more
	1494	bytes to represent a given code point than UTF-8 does; the largest
	1495	Unicode code points take 5 bytes to represent (instead of 4 in UTF-8),
	1496	and, extended for 64-bit words, it uses 14 bytes instead of 13 bytes in
	1497	UTF-8.
	1498
	1499	=item *
	1500
	1501	UTF-16, UTF-16BE, UTF-16LE, Surrogates, and C<BOM>'s (Byte Order Marks)
	1502
	1503	The followings items are mostly for reference and general Unicode
	1504	knowledge, Perl doesn't use these constructs internally.
	1505
	1506	Like UTF-8, UTF-16 is a variable-width encoding, but where
	1507	UTF-8 uses 8-bit code units, UTF-16 uses 16-bit code units.
	1508	All code points occupy either 2 or 4 bytes in UTF-16: code points
	1509	C<U+0000..U+FFFF> are stored in a single 16-bit unit, and code
	1510	points C<U+10000..U+10FFFF> in two 16-bit units. The latter case is
	1511	using I<surrogates>, the first 16-bit unit being the I<high
	1512	surrogate>, and the second being the I<low surrogate>.
	1513
	1514	Surrogates are code points set aside to encode the C<U+10000..U+10FFFF>
	1515	range of Unicode code points in pairs of 16-bit units. The I<high
	1516	surrogates> are the range C<U+D800..U+DBFF> and the I<low surrogates>
	1517	are the range C<U+DC00..U+DFFF>. The surrogate encoding is
	1518
	1519	$hi = ($uni - 0x10000) / 0x400 + 0xD800;
	1520	$lo = ($uni - 0x10000) % 0x400 + 0xDC00;
	1521
	1522	and the decoding is
	1523
	1524	$uni = 0x10000 + ($hi - 0xD800) * 0x400 + ($lo - 0xDC00);
	1525
	1526	Because of the 16-bitness, UTF-16 is byte-order dependent. UTF-16
	1527	itself can be used for in-memory computations, but if storage or
	1528	transfer is required either UTF-16BE (big-endian) or UTF-16LE
	1529	(little-endian) encodings must be chosen.
	1530
	1531	This introduces another problem: what if you just know that your data
	1532	is UTF-16, but you don't know which endianness? Byte Order Marks, or
	1533	C<BOM>'s, are a solution to this. A special character has been reserved
	1534	in Unicode to function as a byte order marker: the character with the
	1535	code point C<U+FEFF> is the C<BOM>.
	1536
	1537	The trick is that if you read a C<BOM>, you will know the byte order,
	1538	since if it was written on a big-endian platform, you will read the
	1539	bytes C<0xFE 0xFF>, but if it was written on a little-endian platform,
	1540	you will read the bytes C<0xFF 0xFE>. (And if the originating platform
	1541	was writing in ASCII platform UTF-8, you will read the bytes
	1542	C<0xEF 0xBB 0xBF>.)
	1543
	1544	The way this trick works is that the character with the code point
	1545	C<U+FFFE> is not supposed to be in input streams, so the
	1546	sequence of bytes C<0xFF 0xFE> is unambiguously "C<BOM>, represented in
	1547	little-endian format" and cannot be C<U+FFFE>, represented in big-endian
	1548	format".
	1549
	1550	Surrogates have no meaning in Unicode outside their use in pairs to
	1551	represent other code points. However, Perl allows them to be
	1552	represented individually internally, for example by saying
	1553	C<chr(0xD801)>, so that all code points, not just those valid for open
	1554	interchange, are
	1555	representable. Unicode does define semantics for them, such as their
	1556	C<L</General_Category>> is C<"Cs">. But because their use is somewhat dangerous,
	1557	Perl will warn (using the warning category C<"surrogate">, which is a
	1558	sub-category of C<"utf8">) if an attempt is made
	1559	to do things like take the lower case of one, or match
	1560	case-insensitively, or to output them. (But don't try this on Perls
	1561	before 5.14.)
	1562
	1563	=item *
	1564
	1565	UTF-32, UTF-32BE, UTF-32LE
	1566
	1567	The UTF-32 family is pretty much like the UTF-16 family, except that
	1568	the units are 32-bit, and therefore the surrogate scheme is not
	1569	needed. UTF-32 is a fixed-width encoding. The C<BOM> signatures are
	1570	C<0x00 0x00 0xFE 0xFF> for BE and C<0xFF 0xFE 0x00 0x00> for LE.
	1571
	1572	=item *
	1573
	1574	UCS-2, UCS-4
	1575
	1576	Legacy, fixed-width encodings defined by the ISO 10646 standard. UCS-2 is a 16-bit
	1577	encoding. Unlike UTF-16, UCS-2 is not extensible beyond C<U+FFFF>,
	1578	because it does not use surrogates. UCS-4 is a 32-bit encoding,
	1579	functionally identical to UTF-32 (the difference being that
	1580	UCS-4 forbids neither surrogates nor code points larger than C<0x10_FFFF>).
	1581
	1582	=item *
	1583
	1584	UTF-7
	1585
	1586	A seven-bit safe (non-eight-bit) encoding, which is useful if the
	1587	transport or storage is not eight-bit safe. Defined by RFC 2152.
	1588
	1589	=back
	1590
	1591	=head2 Noncharacter code points
	1592
	1593	66 code points are set aside in Unicode as "noncharacter code points".
	1594	These all have the C<Unassigned> (C<Cn>) C<L</General_Category>>, and
	1595	no character will ever be assigned to any of them. They are the 32 code
	1596	points between C<U+FDD0> and C<U+FDEF> inclusive, and the 34 code
	1597	points:
	1598
	1599	U+FFFE U+FFFF
	1600	U+1FFFE U+1FFFF
	1601	U+2FFFE U+2FFFF
	1602	...
	1603	U+EFFFE U+EFFFF
	1604	U+FFFFE U+FFFFF
	1605	U+10FFFE U+10FFFF
	1606
	1607	Until Unicode 7.0, the noncharacters were "B<forbidden> for use in open
	1608	interchange of Unicode text data", so that code that processed those
	1609	streams could use these code points as sentinels that could be mixed in
	1610	with character data, and would always be distinguishable from that data.
	1611	(Emphasis above and in the next paragraph are added in this document.)
	1612
	1613	Unicode 7.0 changed the wording so that they are "B<not recommended> for
	1614	use in open interchange of Unicode text data". The 7.0 Standard goes on
	1615	to say:
	1616
	1617	=over 4
	1618
	1619	"If a noncharacter is received in open interchange, an application is
	1620	not required to interpret it in any way. It is good practice, however,
	1621	to recognize it as a noncharacter and to take appropriate action, such
	1622	as replacing it with C<U+FFFD> replacement character, to indicate the
	1623	problem in the text. It is not recommended to simply delete
	1624	noncharacter code points from such text, because of the potential
	1625	security issues caused by deleting uninterpreted characters. (See
	1626	conformance clause C7 in Section 3.2, Conformance Requirements, and
	1627	L<Unicode Technical Report #36, "Unicode Security
	1628	Considerations"\|http://www.unicode.org/reports/tr36/#Substituting_for_Ill_Formed_Subsequences>)."
	1629
	1630	=back
	1631
	1632	This change was made because it was found that various commercial tools
	1633	like editors, or for things like source code control, had been written
	1634	so that they would not handle program files that used these code points,
	1635	effectively precluding their use almost entirely! And that was never
	1636	the intent. They've always been meant to be usable within an
	1637	application, or cooperating set of applications, at will.
	1638
	1639	If you're writing code, such as an editor, that is supposed to be able
	1640	to handle any Unicode text data, then you shouldn't be using these code
	1641	points yourself, and instead allow them in the input. If you need
	1642	sentinels, they should instead be something that isn't legal Unicode.
	1643	For UTF-8 data, you can use the bytes 0xC1 and 0xC2 as sentinels, as
	1644	they never appear in well-formed UTF-8. (There are equivalents for
	1645	UTF-EBCDIC). You can also store your Unicode code points in integer
	1646	variables and use negative values as sentinels.
	1647
	1648	If you're not writing such a tool, then whether you accept noncharacters
	1649	as input is up to you (though the Standard recommends that you not). If
	1650	you do strict input stream checking with Perl, these code points
	1651	continue to be forbidden. This is to maintain backward compatibility
	1652	(otherwise potential security holes could open up, as an unsuspecting
	1653	application that was written assuming the noncharacters would be
	1654	filtered out before getting to it, could now, without warning, start
	1655	getting them). To do strict checking, you can use the layer
	1656	C<:encoding('UTF-8')>.
	1657
	1658	Perl continues to warn (using the warning category C<"nonchar">, which
	1659	is a sub-category of C<"utf8">) if an attempt is made to output
	1660	noncharacters.
	1661
	1662	=head2 Beyond Unicode code points
	1663
	1664	The maximum Unicode code point is C<U+10FFFF>, and Unicode only defines
	1665	operations on code points up through that. But Perl works on code
	1666	points up to the maximum permissible signed number available on the
	1667	platform. However, Perl will not accept these from input streams unless
	1668	lax rules are being used, and will warn (using the warning category
	1669	C<"non_unicode">, which is a sub-category of C<"utf8">) if any are output.
	1670
	1671	Since Unicode rules are not defined on these code points, if a
	1672	Unicode-defined operation is done on them, Perl uses what we believe are
	1673	sensible rules, while generally warning, using the C<"non_unicode">
	1674	category. For example, C<uc("\x{11_0000}")> will generate such a
	1675	warning, returning the input parameter as its result, since Perl defines
	1676	the uppercase of every non-Unicode code point to be the code point
	1677	itself. (All the case changing operations, not just uppercasing, work
	1678	this way.)
	1679
	1680	The situation with matching Unicode properties in regular expressions,
	1681	the C<\p{}> and C<\P{}> constructs, against these code points is not as
	1682	clear cut, and how these are handled has changed as we've gained
	1683	experience.
	1684
	1685	One possibility is to treat any match against these code points as
	1686	undefined. But since Perl doesn't have the concept of a match being
	1687	undefined, it converts this to failing or C<FALSE>. This is almost, but
	1688	not quite, what Perl did from v5.14 (when use of these code points
	1689	became generally reliable) through v5.18. The difference is that Perl
	1690	treated all C<\p{}> matches as failing, but all C<\P{}> matches as
	1691	succeeding.
	1692
	1693	One problem with this is that it leads to unexpected, and confusing
	1694	results in some cases:
	1695
	1696	chr(0x110000) =~ \p{ASCII_Hex_Digit=True} # Failed on <= v5.18
	1697	chr(0x110000) =~ \p{ASCII_Hex_Digit=False} # Failed! on <= v5.18
	1698
	1699	That is, it treated both matches as undefined, and converted that to
	1700	false (raising a warning on each). The first case is the expected
	1701	result, but the second is likely counterintuitive: "How could both be
	1702	false when they are complements?" Another problem was that the
	1703	implementation optimized many Unicode property matches down to already
	1704	existing simpler, faster operations, which don't raise the warning. We
	1705	chose to not forgo those optimizations, which help the vast majority of
	1706	matches, just to generate a warning for the unlikely event that an
	1707	above-Unicode code point is being matched against.
	1708
	1709	As a result of these problems, starting in v5.20, what Perl does is
	1710	to treat non-Unicode code points as just typical unassigned Unicode
	1711	characters, and matches accordingly. (Note: Unicode has atypical
	1712	unassigned code points. For example, it has noncharacter code points,
	1713	and ones that, when they do get assigned, are destined to be written
	1714	Right-to-left, as Arabic and Hebrew are. Perl assumes that no
	1715	non-Unicode code point has any atypical properties.)
	1716
	1717	Perl, in most cases, will raise a warning when matching an above-Unicode
	1718	code point against a Unicode property when the result is C<TRUE> for
	1719	C<\p{}>, and C<FALSE> for C<\P{}>. For example:
	1720
	1721	chr(0x110000) =~ \p{ASCII_Hex_Digit=True} # Fails, no warning
	1722	chr(0x110000) =~ \p{ASCII_Hex_Digit=False} # Succeeds, with warning
	1723
	1724	In both these examples, the character being matched is non-Unicode, so
	1725	Unicode doesn't define how it should match. It clearly isn't an ASCII
	1726	hex digit, so the first example clearly should fail, and so it does,
	1727	with no warning. But it is arguable that the second example should have
	1728	an undefined, hence C<FALSE>, result. So a warning is raised for it.
	1729
	1730	Thus the warning is raised for many fewer cases than in earlier Perls,
	1731	and only when what the result is could be arguable. It turns out that
	1732	none of the optimizations made by Perl (or are ever likely to be made)
	1733	cause the warning to be skipped, so it solves both problems of Perl's
	1734	earlier approach. The most commonly used property that is affected by
	1735	this change is C<\p{Unassigned}> which is a short form for
	1736	C<\p{General_Category=Unassigned}>. Starting in v5.20, all non-Unicode
	1737	code points are considered C<Unassigned>. In earlier releases the
	1738	matches failed because the result was considered undefined.
	1739
	1740	The only place where the warning is not raised when it might ought to
	1741	have been is if optimizations cause the whole pattern match to not even
	1742	be attempted. For example, Perl may figure out that for a string to
	1743	match a certain regular expression pattern, the string has to contain
	1744	the substring C<"foobar">. Before attempting the match, Perl may look
	1745	for that substring, and if not found, immediately fail the match without
	1746	actually trying it; so no warning gets generated even if the string
	1747	contains an above-Unicode code point.
	1748
	1749	This behavior is more "Do what I mean" than in earlier Perls for most
	1750	applications. But it catches fewer issues for code that needs to be
	1751	strictly Unicode compliant. Therefore there is an additional mode of
	1752	operation available to accommodate such code. This mode is enabled if a
	1753	regular expression pattern is compiled within the lexical scope where
	1754	the C<"non_unicode"> warning class has been made fatal, say by:
	1755
	1756	use warnings FATAL => "non_unicode"
	1757
	1758	(see L<warnings>). In this mode of operation, Perl will raise the
	1759	warning for all matches against a non-Unicode code point (not just the
	1760	arguable ones), and it skips the optimizations that might cause the
	1761	warning to not be output. (It currently still won't warn if the match
	1762	isn't even attempted, like in the C<"foobar"> example above.)
	1763
	1764	In summary, Perl now normally treats non-Unicode code points as typical
	1765	Unicode unassigned code points for regular expression matches, raising a
	1766	warning only when it is arguable what the result should be. However, if
	1767	this warning has been made fatal, it isn't skipped.
	1768
	1769	There is one exception to all this. C<\p{All}> looks like a Unicode
	1770	property, but it is a Perl extension that is defined to be true for all
	1771	possible code points, Unicode or not, so no warning is ever generated
	1772	when matching this against a non-Unicode code point. (Prior to v5.20,
	1773	it was an exact synonym for C<\p{Any}>, matching code points C<0>
	1774	through C<0x10FFFF>.)
	1775
	1776	=head2 Security Implications of Unicode
	1777
	1778	First, read
	1779	L<Unicode Security Considerations\|http://www.unicode.org/reports/tr36>.
	1780
	1781	Also, note the following:
	1782
	1783	=over 4
	1784
	1785	=item *
	1786
	1787	Malformed UTF-8
	1788
	1789	UTF-8 is very structured, so many combinations of bytes are invalid. In
	1790	the past, Perl tried to soldier on and make some sense of invalid
	1791	combinations, but this can lead to security holes, so now, if the Perl
	1792	core needs to process an invalid combination, it will either raise a
	1793	fatal error, or will replace those bytes by the sequence that forms the
	1794	Unicode REPLACEMENT CHARACTER, for which purpose Unicode created it.
	1795
	1796	Every code point can be represented by more than one possible
	1797	syntactically valid UTF-8 sequence. Early on, both Unicode and Perl
	1798	considered any of these to be valid, but now, all sequences longer
	1799	than the shortest possible one are considered to be malformed.
	1800
	1801	Unicode considers many code points to be illegal, or to be avoided.
	1802	Perl generally accepts them, once they have passed through any input
	1803	filters that may try to exclude them. These have been discussed above
	1804	(see "Surrogates" under UTF-16 in L</Unicode Encodings>,
	1805	L</Noncharacter code points>, and L</Beyond Unicode code points>).
	1806
	1807	=item *
	1808
	1809	Regular expression pattern matching may surprise you if you're not
	1810	accustomed to Unicode. Starting in Perl 5.14, several pattern
	1811	modifiers are available to control this, called the character set
	1812	modifiers. Details are given in L<perlre/Character set modifiers>.
	1813
	1814	=back
	1815
	1816	As discussed elsewhere, Perl has one foot (two hooves?) planted in
	1817	each of two worlds: the old world of ASCII and single-byte locales, and
	1818	the new world of Unicode, upgrading when necessary.
	1819	If your legacy code does not explicitly use Unicode, no automatic
	1820	switch-over to Unicode should happen.
	1821
	1822	=head2 Unicode in Perl on EBCDIC
	1823
	1824	Unicode is supported on EBCDIC platforms. See L<perlebcdic>.
	1825
	1826	Unless ASCII vs. EBCDIC issues are specifically being discussed,
	1827	references to UTF-8 encoding in this document and elsewhere should be
	1828	read as meaning UTF-EBCDIC on EBCDIC platforms.
	1829	See L<perlebcdic/Unicode and UTF>.
	1830
	1831	Because UTF-EBCDIC is so similar to UTF-8, the differences are mostly
	1832	hidden from you; S<C<use utf8>> (and NOT something like
	1833	S<C<use utfebcdic>>) declares the script is in the platform's
	1834	"native" 8-bit encoding of Unicode. (Similarly for the C<":utf8">
	1835	layer.)
	1836
	1837	=head2 Locales
	1838
	1839	See L<perllocale/Unicode and UTF-8>
	1840
	1841	=head2 When Unicode Does Not Happen
	1842
	1843	There are still many places where Unicode (in some encoding or
	1844	another) could be given as arguments or received as results, or both in
	1845	Perl, but it is not, in spite of Perl having extensive ways to input and
	1846	output in Unicode, and a few other "entry points" like the C<@ARGV>
	1847	array (which can sometimes be interpreted as UTF-8).
	1848
	1849	The following are such interfaces. Also, see L</The "Unicode Bug">.
	1850	For all of these interfaces Perl
	1851	currently (as of v5.16.0) simply assumes byte strings both as arguments
	1852	and results, or UTF-8 strings if the (deprecated) C<encoding> pragma has been used.
	1853
	1854	One reason that Perl does not attempt to resolve the role of Unicode in
	1855	these situations is that the answers are highly dependent on the operating
	1856	system and the file system(s). For example, whether filenames can be
	1857	in Unicode and in exactly what kind of encoding, is not exactly a
	1858	portable concept. Similarly for C<qx> and C<system>: how well will the
	1859	"command-line interface" (and which of them?) handle Unicode?
	1860
	1861	=over 4
	1862
	1863	=item *
	1864
	1865	C<chdir>, C<chmod>, C<chown>, C<chroot>, C<exec>, C<link>, C<lstat>, C<mkdir>,
	1866	C<rename>, C<rmdir>, C<stat>, C<symlink>, C<truncate>, C<unlink>, C<utime>, C<-X>
	1867
	1868	=item *
	1869
	1870	C<%ENV>
	1871
	1872	=item *
	1873
	1874	C<glob> (aka the C<E<lt>*E<gt>>)
	1875
	1876	=item *
	1877
	1878	C<open>, C<opendir>, C<sysopen>
	1879
	1880	=item *
	1881
	1882	C<qx> (aka the backtick operator), C<system>
	1883
	1884	=item *
	1885
	1886	C<readdir>, C<readlink>
	1887
	1888	=back
	1889
	1890	=head2 The "Unicode Bug"
	1891
	1892	The term, "Unicode bug" has been applied to an inconsistency with the
	1893	code points in the C<Latin-1 Supplement> block, that is, between
	1894	128 and 255. Without a locale specified, unlike all other characters or
	1895	code points, these characters can have very different semantics
	1896	depending on the rules in effect. (Characters whose code points are
	1897	above 255 force Unicode rules; whereas the rules for ASCII characters
	1898	are the same under both ASCII and Unicode rules.)
	1899
	1900	Under Unicode rules, these upper-Latin1 characters are interpreted as
	1901	Unicode code points, which means they have the same semantics as Latin-1
	1902	(ISO-8859-1) and C1 controls.
	1903
	1904	As explained in L</ASCII Rules versus Unicode Rules>, under ASCII rules,
	1905	they are considered to be unassigned characters.
	1906
	1907	This can lead to unexpected results. For example, a string's
	1908	semantics can suddenly change if a code point above 255 is appended to
	1909	it, which changes the rules from ASCII to Unicode. As an
	1910	example, consider the following program and its output:
	1911
	1912	$ perl -le'
	1913	no feature "unicode_strings";
	1914	$s1 = "\xC2";
	1915	$s2 = "\x{2660}";
	1916	for ($s1, $s2, $s1.$s2) {
	1917	print /\w/ \|\| 0;
	1918	}
	1919	'
	1920	0
	1921	0
	1922	1
	1923
	1924	If there's no C<\w> in C<s1> nor in C<s2>, why does their concatenation
	1925	have one?
	1926
	1927	This anomaly stems from Perl's attempt to not disturb older programs that
	1928	didn't use Unicode, along with Perl's desire to add Unicode support
	1929	seamlessly. But the result turned out to not be seamless. (By the way,
	1930	you can choose to be warned when things like this happen. See
	1931	C<L<encoding::warnings>>.)
	1932
	1933	L<S<C<use feature 'unicode_strings'>>\|feature/The 'unicode_strings' feature>
	1934	was added, starting in Perl v5.12, to address this problem. It affects
	1935	these things:
	1936
	1937	=over 4
	1938
	1939	=item *
	1940
	1941	Changing the case of a scalar, that is, using C<uc()>, C<ucfirst()>, C<lc()>,
	1942	and C<lcfirst()>, or C<\L>, C<\U>, C<\u> and C<\l> in double-quotish
	1943	contexts, such as regular expression substitutions.
	1944
	1945	Under C<unicode_strings> starting in Perl 5.12.0, Unicode rules are
	1946	generally used. See L<perlfunc/lc> for details on how this works
	1947	in combination with various other pragmas.
	1948
	1949	=item *
	1950
	1951	Using caseless (C</i>) regular expression matching.
	1952
	1953	Starting in Perl 5.14.0, regular expressions compiled within
	1954	the scope of C<unicode_strings> use Unicode rules
	1955	even when executed or compiled into larger
	1956	regular expressions outside the scope.
	1957
	1958	=item *
	1959
	1960	Matching any of several properties in regular expressions.
	1961
	1962	These properties are C<\b> (without braces), C<\B> (without braces),
	1963	C<\s>, C<\S>, C<\w>, C<\W>, and all the Posix character classes
	1964	I<except> C<[[:ascii:]]>.
	1965
	1966	Starting in Perl 5.14.0, regular expressions compiled within
	1967	the scope of C<unicode_strings> use Unicode rules
	1968	even when executed or compiled into larger
	1969	regular expressions outside the scope.
	1970
	1971	=item *
	1972
	1973	In C<quotemeta> or its inline equivalent C<\Q>.
	1974
	1975	Starting in Perl 5.16.0, consistent quoting rules are used within the
	1976	scope of C<unicode_strings>, as described in L<perlfunc/quotemeta>.
	1977	Prior to that, or outside its scope, no code points above 127 are quoted
	1978	in UTF-8 encoded strings, but in byte encoded strings, code points
	1979	between 128-255 are always quoted.
	1980
	1981	=item *
	1982
	1983	In the C<..> or L<range\|perlop/Range Operators> operator.
	1984
	1985	Starting in Perl 5.26.0, the range operator on strings treats their lengths
	1986	consistently within the scope of C<unicode_strings>. Prior to that, or
	1987	outside its scope, it could produce strings whose length in characters
	1988	exceeded that of the right-hand side, where the right-hand side took up more
	1989	bytes than the correct range endpoint.
	1990
	1991	=item *
	1992
	1993	In L<< C<split>'s special-case whitespace splitting\|perlfunc/split >>.
	1994
	1995	Starting in Perl 5.28.0, the C<split> function with a pattern specified as
	1996	a string containing a single space handles whitespace characters consistently
	1997	within the scope of of C<unicode_strings>. Prior to that, or outside its scope,
	1998	characters that are whitespace according to Unicode rules but not according to
	1999	ASCII rules were treated as field contents rather than field separators when
	2000	they appear in byte-encoded strings.
	2001
	2002	=back
	2003
	2004	You can see from the above that the effect of C<unicode_strings>
	2005	increased over several Perl releases. (And Perl's support for Unicode
	2006	continues to improve; it's best to use the latest available release in
	2007	order to get the most complete and accurate results possible.) Note that
	2008	C<unicode_strings> is automatically chosen if you S<C<use 5.012>> or
	2009	higher.
	2010
	2011	For Perls earlier than those described above, or when a string is passed
	2012	to a function outside the scope of C<unicode_strings>, see the next section.
	2013
	2014	=head2 Forcing Unicode in Perl (Or Unforcing Unicode in Perl)
	2015
	2016	Sometimes (see L</"When Unicode Does Not Happen"> or L</The "Unicode Bug">)
	2017	there are situations where you simply need to force a byte
	2018	string into UTF-8, or vice versa. The standard module L<Encode> can be
	2019	used for this, or the low-level calls
	2020	L<C<utf8::upgrade($bytestring)>\|utf8/Utility functions> and
	2021	L<C<utf8::downgrade($utf8string[, FAIL_OK])>\|utf8/Utility functions>.
	2022
	2023	Note that C<utf8::downgrade()> can fail if the string contains characters
	2024	that don't fit into a byte.
	2025
	2026	Calling either function on a string that already is in the desired state is a
	2027	no-op.
	2028
	2029	L</ASCII Rules versus Unicode Rules> gives all the ways that a string is
	2030	made to use Unicode rules.
	2031
	2032	=head2 Using Unicode in XS
	2033
	2034	See L<perlguts/"Unicode Support"> for an introduction to Unicode at
	2035	the XS level, and L<perlapi/Unicode Support> for the API details.
	2036
	2037	=head2 Hacking Perl to work on earlier Unicode versions (for very serious hackers only)
	2038
	2039	Perl by default comes with the latest supported Unicode version built-in, but
	2040	the goal is to allow you to change to use any earlier one. In Perls
	2041	v5.20 and v5.22, however, the earliest usable version is Unicode 5.1.
	2042	Perl v5.18 and v5.24 are able to handle all earlier versions.
	2043
	2044	Download the files in the desired version of Unicode from the Unicode web
	2045	site L<http://www.unicode.org>). These should replace the existing files in
	2046	F<lib/unicore> in the Perl source tree. Follow the instructions in
	2047	F<README.perl> in that directory to change some of their names, and then build
	2048	perl (see L<INSTALL>).
	2049
	2050	=head2 Porting code from perl-5.6.X
	2051
	2052	Perls starting in 5.8 have a different Unicode model from 5.6. In 5.6 the
	2053	programmer was required to use the C<utf8> pragma to declare that a
	2054	given scope expected to deal with Unicode data and had to make sure that
	2055	only Unicode data were reaching that scope. If you have code that is
	2056	working with 5.6, you will need some of the following adjustments to
	2057	your code. The examples are written such that the code will continue to
	2058	work under 5.6, so you should be safe to try them out.
	2059
	2060	=over 3
	2061
	2062	=item *
	2063
	2064	A filehandle that should read or write UTF-8
	2065
	2066	if ($] > 5.008) {
	2067	binmode $fh, ":encoding(UTF-8)";
	2068	}
	2069
	2070	=item *
	2071
	2072	A scalar that is going to be passed to some extension
	2073
	2074	Be it C<Compress::Zlib>, C<Apache::Request> or any extension that has no
	2075	mention of Unicode in the manpage, you need to make sure that the
	2076	UTF8 flag is stripped off. Note that at the time of this writing
	2077	(January 2012) the mentioned modules are not UTF-8-aware. Please
	2078	check the documentation to verify if this is still true.
	2079
	2080	if ($] > 5.008) {
	2081	require Encode;
	2082	$val = Encode::encode("UTF-8", $val); # make octets
	2083	}
	2084
	2085	=item *
	2086
	2087	A scalar we got back from an extension
	2088
	2089	If you believe the scalar comes back as UTF-8, you will most likely
	2090	want the UTF8 flag restored:
	2091
	2092	if ($] > 5.008) {
	2093	require Encode;
	2094	$val = Encode::decode("UTF-8", $val);
	2095	}
	2096
	2097	=item *
	2098
	2099	Same thing, if you are really sure it is UTF-8
	2100
	2101	if ($] > 5.008) {
	2102	require Encode;
	2103	Encode::_utf8_on($val);
	2104	}
	2105
	2106	=item *
	2107
	2108	A wrapper for L<DBI> C<fetchrow_array> and C<fetchrow_hashref>
	2109
	2110	When the database contains only UTF-8, a wrapper function or method is
	2111	a convenient way to replace all your C<fetchrow_array> and
	2112	C<fetchrow_hashref> calls. A wrapper function will also make it easier to
	2113	adapt to future enhancements in your database driver. Note that at the
	2114	time of this writing (January 2012), the DBI has no standardized way
	2115	to deal with UTF-8 data. Please check the L<DBI documentation\|DBI> to verify if
	2116	that is still true.
	2117
	2118	sub fetchrow {
	2119	# $what is one of fetchrow_{array,hashref}
	2120	my($self, $sth, $what) = @_;
	2121	if ($] < 5.008) {
	2122	return $sth->$what;
	2123	} else {
	2124	require Encode;
	2125	if (wantarray) {
	2126	my @arr = $sth->$what;
	2127	for (@arr) {
	2128	defined && /[^\000-\177]/ && Encode::_utf8_on($_);
	2129	}
	2130	return @arr;
	2131	} else {
	2132	my $ret = $sth->$what;
	2133	if (ref $ret) {
	2134	for my $k (keys %$ret) {
	2135	defined
	2136	&& /[^\000-\177]/
	2137	&& Encode::_utf8_on($_) for $ret->{$k};
	2138	}
	2139	return $ret;
	2140	} else {
	2141	defined && /[^\000-\177]/ && Encode::_utf8_on($_) for $ret;
	2142	return $ret;
	2143	}
	2144	}
	2145	}
	2146	}
	2147
	2148
	2149	=item *
	2150
	2151	A large scalar that you know can only contain ASCII
	2152
	2153	Scalars that contain only ASCII and are marked as UTF-8 are sometimes
	2154	a drag to your program. If you recognize such a situation, just remove
	2155	the UTF8 flag:
	2156
	2157	utf8::downgrade($val) if $] > 5.008;
	2158
	2159	=back
	2160
	2161	=head1 BUGS
	2162
	2163	See also L</The "Unicode Bug"> above.
	2164
	2165	=head2 Interaction with Extensions
	2166
	2167	When Perl exchanges data with an extension, the extension should be
	2168	able to understand the UTF8 flag and act accordingly. If the
	2169	extension doesn't recognize that flag, it's likely that the extension
	2170	will return incorrectly-flagged data.
	2171
	2172	So if you're working with Unicode data, consult the documentation of
	2173	every module you're using if there are any issues with Unicode data
	2174	exchange. If the documentation does not talk about Unicode at all,
	2175	suspect the worst and probably look at the source to learn how the
	2176	module is implemented. Modules written completely in Perl shouldn't
	2177	cause problems. Modules that directly or indirectly access code written
	2178	in other programming languages are at risk.
	2179
	2180	For affected functions, the simple strategy to avoid data corruption is
	2181	to always make the encoding of the exchanged data explicit. Choose an
	2182	encoding that you know the extension can handle. Convert arguments passed
	2183	to the extensions to that encoding and convert results back from that
	2184	encoding. Write wrapper functions that do the conversions for you, so
	2185	you can later change the functions when the extension catches up.
	2186
	2187	To provide an example, let's say the popular C<Foo::Bar::escape_html>
	2188	function doesn't deal with Unicode data yet. The wrapper function
	2189	would convert the argument to raw UTF-8 and convert the result back to
	2190	Perl's internal representation like so:
	2191
	2192	sub my_escape_html ($) {
	2193	my($what) = shift;
	2194	return unless defined $what;
	2195	Encode::decode("UTF-8", Foo::Bar::escape_html(
	2196	Encode::encode("UTF-8", $what)));
	2197	}
	2198
	2199	Sometimes, when the extension does not convert data but just stores
	2200	and retrieves it, you will be able to use the otherwise
	2201	dangerous L<C<Encode::_utf8_on()>\|Encode/_utf8_on> function. Let's say
	2202	the popular C<Foo::Bar> extension, written in C, provides a C<param>
	2203	method that lets you store and retrieve data according to these prototypes:
	2204
	2205	$self->param($name, $value); # set a scalar
	2206	$value = $self->param($name); # retrieve a scalar
	2207
	2208	If it does not yet provide support for any encoding, one could write a
	2209	derived class with such a C<param> method:
	2210
	2211	sub param {
	2212	my($self,$name,$value) = @_;
	2213	utf8::upgrade($name); # make sure it is UTF-8 encoded
	2214	if (defined $value) {
	2215	utf8::upgrade($value); # make sure it is UTF-8 encoded
	2216	return $self->SUPER::param($name,$value);
	2217	} else {
	2218	my $ret = $self->SUPER::param($name);
	2219	Encode::_utf8_on($ret); # we know, it is UTF-8 encoded
	2220	return $ret;
	2221	}
	2222	}
	2223
	2224	Some extensions provide filters on data entry/exit points, such as
	2225	C<DB_File::filter_store_key> and family. Look out for such filters in
	2226	the documentation of your extensions; they can make the transition to
	2227	Unicode data much easier.
	2228
	2229	=head2 Speed
	2230
	2231	Some functions are slower when working on UTF-8 encoded strings than
	2232	on byte encoded strings. All functions that need to hop over
	2233	characters such as C<length()>, C<substr()> or C<index()>, or matching
	2234	regular expressions can work B<much> faster when the underlying data are
	2235	byte-encoded.
	2236
	2237	In Perl 5.8.0 the slowness was often quite spectacular; in Perl 5.8.1
	2238	a caching scheme was introduced which improved the situation. In general,
	2239	operations with UTF-8 encoded strings are still slower. As an example,
	2240	the Unicode properties (character classes) like C<\p{Nd}> are known to
	2241	be quite a bit slower (5-20 times) than their simpler counterparts
	2242	like C<[0-9]> (then again, there are hundreds of Unicode characters matching
	2243	C<Nd> compared with the 10 ASCII characters matching C<[0-9]>).
	2244
	2245	=head1 SEE ALSO
	2246
	2247	L<perlunitut>, L<perluniintro>, L<perluniprops>, L<Encode>, L<open>, L<utf8>, L<bytes>,
	2248	L<perlretut>, L<perlvar/"${^UNICODE}">,
	2249	L<http://www.unicode.org/reports/tr44>).
	2250
	2251	=cut