perl5.git.perl.org Git - perl5.git/blame_incremental

... / ...

Commit	Line	Data
	1	=head1 NAME
	2
	3	perlunicode - Unicode support in Perl
	4
	5	=head1 DESCRIPTION
	6
	7	=head2 Important Caveats
	8
	9	Unicode support is an extensive requirement. While Perl does not
	10	implement the Unicode standard or the accompanying technical reports
	11	from cover to cover, Perl does support many Unicode features.
	12
	13	People who want to learn to use Unicode in Perl, should probably read
	14	the L<Perl Unicode tutorial, perlunitut\|perlunitut>, before reading
	15	this reference document.
	16
	17	Also, the use of Unicode may present security issues that aren't obvious.
	18	Read L<Unicode Security Considerations\|http://www.unicode.org/reports/tr36>.
	19
	20	=over 4
	21
	22	=item Input and Output Layers
	23
	24	Perl knows when a filehandle uses Perl's internal Unicode encodings
	25	(UTF-8, or UTF-EBCDIC if in EBCDIC) if the filehandle is opened with
	26	the ":utf8" layer. Other encodings can be converted to Perl's
	27	encoding on input or from Perl's encoding on output by use of the
	28	":encoding(...)" layer. See L<open>.
	29
	30	To indicate that Perl source itself is in UTF-8, use C<use utf8;>.
	31
	32	=item Regular Expressions
	33
	34	The regular expression compiler produces polymorphic opcodes. That is,
	35	the pattern adapts to the data and automatically switches to the Unicode
	36	character scheme when presented with data that is internally encoded in
	37	UTF-8, or instead uses a traditional byte scheme when presented with
	38	byte data.
	39
	40	=item C<use utf8> still needed to enable UTF-8/UTF-EBCDIC in scripts
	41
	42	As a compatibility measure, the C<use utf8> pragma must be explicitly
	43	included to enable recognition of UTF-8 in the Perl scripts themselves
	44	(in string or regular expression literals, or in identifier names) on
	45	ASCII-based machines or to recognize UTF-EBCDIC on EBCDIC-based
	46	machines. B<These are the only times when an explicit C<use utf8>
	47	is needed.> See L<utf8>.
	48
	49	=item BOM-marked scripts and UTF-16 scripts autodetected
	50
	51	If a Perl script begins marked with the Unicode BOM (UTF-16LE, UTF16-BE,
	52	or UTF-8), or if the script looks like non-BOM-marked UTF-16 of either
	53	endianness, Perl will correctly read in the script as Unicode.
	54	(BOMless UTF-8 cannot be effectively recognized or differentiated from
	55	ISO 8859-1 or other eight-bit encodings.)
	56
	57	=item C<use encoding> needed to upgrade non-Latin-1 byte strings
	58
	59	By default, there is a fundamental asymmetry in Perl's Unicode model:
	60	implicit upgrading from byte strings to Unicode strings assumes that
	61	they were encoded in I<ISO 8859-1 (Latin-1)>, but Unicode strings are
	62	downgraded with UTF-8 encoding. This happens because the first 256
	63	codepoints in Unicode happens to agree with Latin-1.
	64
	65	See L</"Byte and Character Semantics"> for more details.
	66
	67	=back
	68
	69	=head2 Byte and Character Semantics
	70
	71	Beginning with version 5.6, Perl uses logically-wide characters to
	72	represent strings internally.
	73
	74	In future, Perl-level operations will be expected to work with
	75	characters rather than bytes.
	76
	77	However, as an interim compatibility measure, Perl aims to
	78	provide a safe migration path from byte semantics to character
	79	semantics for programs. For operations where Perl can unambiguously
	80	decide that the input data are characters, Perl switches to
	81	character semantics. For operations where this determination cannot
	82	be made without additional information from the user, Perl decides in
	83	favor of compatibility and chooses to use byte semantics.
	84
	85	Under byte semantics, when C<use locale> is in effect, Perl uses the
	86	semantics associated with the current locale. Absent a C<use locale>, and
	87	absent a C<use feature 'unicode_strings'> pragma, Perl currently uses US-ASCII
	88	(or Basic Latin in Unicode terminology) byte semantics, meaning that characters
	89	whose ordinal numbers are in the range 128 - 255 are undefined except for their
	90	ordinal numbers. This means that none have case (upper and lower), nor are any
	91	a member of character classes, like C<[:alpha:]> or C<\w>. (But all do belong
	92	to the C<\W> class or the Perl regular expression extension C<[:^alpha:]>.)
	93
	94	This behavior preserves compatibility with earlier versions of Perl,
	95	which allowed byte semantics in Perl operations only if
	96	none of the program's inputs were marked as being a source of Unicode
	97	character data. Such data may come from filehandles, from calls to
	98	external programs, from information provided by the system (such as %ENV),
	99	or from literals and constants in the source text.
	100
	101	The C<bytes> pragma will always, regardless of platform, force byte
	102	semantics in a particular lexical scope. See L<bytes>.
	103
	104	The C<use feature 'unicode_strings'> pragma is intended to always, regardless
	105	of platform, force character (Unicode) semantics in a particular lexical scope.
	106	In release 5.12, it is partially implemented, applying only to case changes.
	107	See L</The "Unicode Bug"> below.
	108
	109	The C<utf8> pragma is primarily a compatibility device that enables
	110	recognition of UTF-(8\|EBCDIC) in literals encountered by the parser.
	111	Note that this pragma is only required while Perl defaults to byte
	112	semantics; when character semantics become the default, this pragma
	113	may become a no-op. See L<utf8>.
	114
	115	Unless explicitly stated, Perl operators use character semantics
	116	for Unicode data and byte semantics for non-Unicode data.
	117	The decision to use character semantics is made transparently. If
	118	input data comes from a Unicode source--for example, if a character
	119	encoding layer is added to a filehandle or a literal Unicode
	120	string constant appears in a program--character semantics apply.
	121	Otherwise, byte semantics are in effect. The C<bytes> pragma should
	122	be used to force byte semantics on Unicode data, and the C<use feature
	123	'unicode_strings'> pragma to force Unicode semantics on byte data (though in
	124	5.12 it isn't fully implemented).
	125
	126	If strings operating under byte semantics and strings with Unicode
	127	character data are concatenated, the new string will have
	128	character semantics. This can cause surprises: See L</BUGS>, below.
	129	You can choose to be warned when this happens. See L<encoding::warnings>.
	130
	131	Under character semantics, many operations that formerly operated on
	132	bytes now operate on characters. A character in Perl is
	133	logically just a number ranging from 0 to 2**31 or so. Larger
	134	characters may encode into longer sequences of bytes internally, but
	135	this internal detail is mostly hidden for Perl code.
	136	See L<perluniintro> for more.
	137
	138	=head2 Effects of Character Semantics
	139
	140	Character semantics have the following effects:
	141
	142	=over 4
	143
	144	=item *
	145
	146	Strings--including hash keys--and regular expression patterns may
	147	contain characters that have an ordinal value larger than 255.
	148
	149	If you use a Unicode editor to edit your program, Unicode characters may
	150	occur directly within the literal strings in UTF-8 encoding, or UTF-16.
	151	(The former requires a BOM or C<use utf8>, the latter requires a BOM.)
	152
	153	Unicode characters can also be added to a string by using the C<\N{U+...}>
	154	notation. The Unicode code for the desired character, in hexadecimal,
	155	should be placed in the braces, after the C<U>. For instance, a smiley face is
	156	C<\N{U+263A}>.
	157
	158	Alternatively, you can use the C<\x{...}> notation for characters 0x100 and
	159	above. For characters below 0x100 you may get byte semantics instead of
	160	character semantics; see L</The "Unicode Bug">. On EBCDIC machines there is
	161	the additional problem that the value for such characters gives the EBCDIC
	162	character rather than the Unicode one.
	163
	164	Additionally, if you
	165
	166	use charnames ':full';
	167
	168	you can use the C<\N{...}> notation and put the official Unicode
	169	character name within the braces, such as C<\N{WHITE SMILING FACE}>.
	170	See L<charnames>.
	171
	172	=item *
	173
	174	If an appropriate L<encoding> is specified, identifiers within the
	175	Perl script may contain Unicode alphanumeric characters, including
	176	ideographs. Perl does not currently attempt to canonicalize variable
	177	names.
	178
	179	=item *
	180
	181	Regular expressions match characters instead of bytes. "." matches
	182	a character instead of a byte.
	183
	184	=item *
	185
	186	Bracketed character classes in regular expressions match characters instead of
	187	bytes and match against the character properties specified in the
	188	Unicode properties database. C<\w> can be used to match a Japanese
	189	ideograph, for instance.
	190
	191	=item *
	192
	193	Named Unicode properties, scripts, and block ranges may be used (like bracketed
	194	character classes) by using the C<\p{}> "matches property" construct and
	195	the C<\P{}> negation, "doesn't match property".
	196	See L</"Unicode Character Properties"> for more details.
	197
	198	You can define your own character properties and use them
	199	in the regular expression with the C<\p{}> or C<\P{}> construct.
	200	See L</"User-Defined Character Properties"> for more details.
	201
	202	=item *
	203
	204	The special pattern C<\X> matches a logical character, an "extended grapheme
	205	cluster" in Standardese. In Unicode what appears to the user to be a single
	206	character, for example an accented C<G>, may in fact be composed of a sequence
	207	of characters, in this case a C<G> followed by an accent character. C<\X>
	208	will match the entire sequence.
	209
	210	=item *
	211
	212	The C<tr///> operator translates characters instead of bytes. Note
	213	that the C<tr///CU> functionality has been removed. For similar
	214	functionality see pack('U0', ...) and pack('C0', ...).
	215
	216	=item *
	217
	218	Case translation operators use the Unicode case translation tables
	219	when character input is provided. Note that C<uc()>, or C<\U> in
	220	interpolated strings, translates to uppercase, while C<ucfirst>,
	221	or C<\u> in interpolated strings, translates to titlecase in languages
	222	that make the distinction (which is equivalent to uppercase in languages
	223	without the distinction).
	224
	225	=item *
	226
	227	Most operators that deal with positions or lengths in a string will
	228	automatically switch to using character positions, including
	229	C<chop()>, C<chomp()>, C<substr()>, C<pos()>, C<index()>, C<rindex()>,
	230	C<sprintf()>, C<write()>, and C<length()>. An operator that
	231	specifically does not switch is C<vec()>. Operators that really don't
	232	care include operators that treat strings as a bucket of bits such as
	233	C<sort()>, and operators dealing with filenames.
	234
	235	=item *
	236
	237	The C<pack()>/C<unpack()> letter C<C> does I<not> change, since it is often
	238	used for byte-oriented formats. Again, think C<char> in the C language.
	239
	240	There is a new C<U> specifier that converts between Unicode characters
	241	and code points. There is also a C<W> specifier that is the equivalent of
	242	C<chr>/C<ord> and properly handles character values even if they are above 255.
	243
	244	=item *
	245
	246	The C<chr()> and C<ord()> functions work on characters, similar to
	247	C<pack("W")> and C<unpack("W")>, I<not> C<pack("C")> and
	248	C<unpack("C")>. C<pack("C")> and C<unpack("C")> are methods for
	249	emulating byte-oriented C<chr()> and C<ord()> on Unicode strings.
	250	While these methods reveal the internal encoding of Unicode strings,
	251	that is not something one normally needs to care about at all.
	252
	253	=item *
	254
	255	The bit string operators, C<& \| ^ ~>, can operate on character data.
	256	However, for backward compatibility, such as when using bit string
	257	operations when characters are all less than 256 in ordinal value, one
	258	should not use C<~> (the bit complement) with characters of both
	259	values less than 256 and values greater than 256. Most importantly,
	260	DeMorgan's laws (C<~($x\|$y) eq ~$x&~$y> and C<~($x&$y) eq ~$x\|~$y>)
	261	will not hold. The reason for this mathematical I<faux pas> is that
	262	the complement cannot return B<both> the 8-bit (byte-wide) bit
	263	complement B<and> the full character-wide bit complement.
	264
	265	=item *
	266
	267	You can define your own mappings to be used in C<lc()>,
	268	C<lcfirst()>, C<uc()>, and C<ucfirst()> (or their double-quoted string inlined
	269	versions such as C<\U>). See
	270	L<User-Defined Case-Mappings\|/"User-Defined Case Mappings (for serious hackers only)">
	271	for more details.
	272
	273	=back
	274
	275	=over 4
	276
	277	=item *
	278
	279	And finally, C<scalar reverse()> reverses by character rather than by byte.
	280
	281	=back
	282
	283	=head2 Unicode Character Properties
	284
	285	Most Unicode character properties are accessible by using regular expressions.
	286	They are used (like bracketed character classes) by using the C<\p{}> "matches
	287	property" construct and the C<\P{}> negation, "doesn't match property".
	288
	289	Note that the only time that Perl considers a sequence of individual code
	290	points as a single logical character is in the C<\X> construct, already
	291	mentioned above. Therefore "character" in this discussion means a single
	292	Unicode code point.
	293
	294	For instance, C<\p{Uppercase}> matches any single character with the Unicode
	295	"Uppercase" property, while C<\p{L}> matches any character with a
	296	General_Category of "L" (letter) property. Brackets are not
	297	required for single letter property names, so C<\p{L}> is equivalent to C<\pL>.
	298
	299	More formally, C<\p{Uppercase}> matches any single character whose Unicode
	300	Uppercase property value is True, and C<\P{Uppercase}> matches any character
	301	whose Uppercase property value is False, and they could have been written as
	302	C<\p{Uppercase=True}> and C<\p{Uppercase=False}>, respectively.
	303
	304	This formality is needed when properties are not binary, that is if they can
	305	take on more values than just True and False. For example, the Bidi_Class (see
	306	L</"Bidirectional Character Types"> below), can take on a number of different
	307	values, such as Left, Right, Whitespace, and others. To match these, one needs
	308	to specify the property name (Bidi_Class), and the value being matched against
	309	(Left, Right, etc.). This is done, as in the examples above, by having the
	310	two components separated by an equal sign (or interchangeably, a colon), like
	311	C<\p{Bidi_Class: Left}>.
	312
	313	All Unicode-defined character properties may be written in these compound forms
	314	of C<\p{property=value}> or C<\p{property:value}>, but Perl provides some
	315	additional properties that are written only in the single form, as well as
	316	single-form short-cuts for all binary properties and certain others described
	317	below, in which you may omit the property name and the equals or colon
	318	separator.
	319
	320	Most Unicode character properties have at least two synonyms (or aliases if you
	321	prefer), a short one that is easier to type, and a longer one which is more
	322	descriptive and hence it is easier to understand what it means. Thus the "L"
	323	and "Letter" above are equivalent and can be used interchangeably. Likewise,
	324	"Upper" is a synonym for "Uppercase", and we could have written
	325	C<\p{Uppercase}> equivalently as C<\p{Upper}>. Also, there are typically
	326	various synonyms for the values the property can be. For binary properties,
	327	"True" has 3 synonyms: "T", "Yes", and "Y"; and "False has correspondingly "F",
	328	"No", and "N". But be careful. A short form of a value for one property may
	329	not mean the same thing as the same short form for another. Thus, for the
	330	General_Category property, "L" means "Letter", but for the Bidi_Class property,
	331	"L" means "Left". A complete list of properties and synonyms is in
	332	L<perluniprops>.
	333
	334	Upper/lower case differences in the property names and values are irrelevant,
	335	thus C<\p{Upper}> means the same thing as C<\p{upper}> or even C<\p{UpPeR}>.
	336	Similarly, you can add or subtract underscores anywhere in the middle of a
	337	word, so that these are also equivalent to C<\p{U_p_p_e_r}>. And white space
	338	is irrelevant adjacent to non-word characters, such as the braces and the equals
	339	or colon separators so C<\p{ Upper }> and C<\p{ Upper_case : Y }> are
	340	equivalent to these as well. In fact, in most cases, white space and even
	341	hyphens can be added or deleted anywhere. So even C<\p{ Up-per case = Yes}> is
	342	equivalent. All this is called "loose-matching" by Unicode. The few places
	343	where stricter matching is employed is in the middle of numbers, and the Perl
	344	extension properties that begin or end with an underscore. Stricter matching
	345	cares about white space (except adjacent to the non-word characters) and
	346	hyphens, and non-interior underscores.
	347
	348	You can also use negation in both C<\p{}> and C<\P{}> by introducing a caret
	349	(^) between the first brace and the property name: C<\p{^Tamil}> is
	350	equal to C<\P{Tamil}>.
	351
	352	=head3 B<General_Category>
	353
	354	Every Unicode character is assigned a general category, which is the "most
	355	usual categorization of a character" (from
	356	L<http://www.unicode.org/reports/tr44>).
	357
	358	The compound way of writing these is like C<\p{General_Category=Number}>
	359	(short, C<\p{gc:n}>). But Perl furnishes shortcuts in which everything up
	360	through the equal or colon separator is omitted. So you can instead just write
	361	C<\pN>.
	362
	363	Here are the short and long forms of the General Category properties:
	364
	365	Short Long
	366
	367	L Letter
	368	LC, L& Cased_Letter (that is: [\p{Ll}\p{Lu}\p{Lt}])
	369	Lu Uppercase_Letter
	370	Ll Lowercase_Letter
	371	Lt Titlecase_Letter
	372	Lm Modifier_Letter
	373	Lo Other_Letter
	374
	375	M Mark
	376	Mn Nonspacing_Mark
	377	Mc Spacing_Mark
	378	Me Enclosing_Mark
	379
	380	N Number
	381	Nd Decimal_Number (also Digit)
	382	Nl Letter_Number
	383	No Other_Number
	384
	385	P Punctuation (also Punct)
	386	Pc Connector_Punctuation
	387	Pd Dash_Punctuation
	388	Ps Open_Punctuation
	389	Pe Close_Punctuation
	390	Pi Initial_Punctuation
	391	(may behave like Ps or Pe depending on usage)
	392	Pf Final_Punctuation
	393	(may behave like Ps or Pe depending on usage)
	394	Po Other_Punctuation
	395
	396	S Symbol
	397	Sm Math_Symbol
	398	Sc Currency_Symbol
	399	Sk Modifier_Symbol
	400	So Other_Symbol
	401
	402	Z Separator
	403	Zs Space_Separator
	404	Zl Line_Separator
	405	Zp Paragraph_Separator
	406
	407	C Other
	408	Cc Control (also Cntrl)
	409	Cf Format
	410	Cs Surrogate (not usable)
	411	Co Private_Use
	412	Cn Unassigned
	413
	414	Single-letter properties match all characters in any of the
	415	two-letter sub-properties starting with the same letter.
	416	C<LC> and C<L&> are special cases, which are both aliases for the set consisting of everything matched by C<Ll>, C<Lu>, and C<Lt>.
	417
	418	Because Perl hides the need for the user to understand the internal
	419	representation of Unicode characters, there is no need to implement
	420	the somewhat messy concept of surrogates. C<Cs> is therefore not
	421	supported.
	422
	423	=head3 B<Bidirectional Character Types>
	424
	425	Because scripts differ in their directionality (Hebrew is
	426	written right to left, for example) Unicode supplies these properties in
	427	the Bidi_Class class:
	428
	429	Property Meaning
	430
	431	L Left-to-Right
	432	LRE Left-to-Right Embedding
	433	LRO Left-to-Right Override
	434	R Right-to-Left
	435	AL Arabic Letter
	436	RLE Right-to-Left Embedding
	437	RLO Right-to-Left Override
	438	PDF Pop Directional Format
	439	EN European Number
	440	ES European Separator
	441	ET European Terminator
	442	AN Arabic Number
	443	CS Common Separator
	444	NSM Non-Spacing Mark
	445	BN Boundary Neutral
	446	B Paragraph Separator
	447	S Segment Separator
	448	WS Whitespace
	449	ON Other Neutrals
	450
	451	This property is always written in the compound form.
	452	For example, C<\p{Bidi_Class:R}> matches characters that are normally
	453	written right to left.
	454
	455	=head3 B<Scripts>
	456
	457	The world's languages are written in a number of scripts. This sentence
	458	(unless you're reading it in translation) is written in Latin, while Russian is
	459	written in Cyrllic, and Greek is written in, well, Greek; Japanese mainly in
	460	Hiragana or Katakana. There are many more.
	461
	462	The Unicode Script property gives what script a given character is in,
	463	and the property can be specified with the compound form like
	464	C<\p{Script=Hebrew}> (short: C<\p{sc=hebr}>). Perl furnishes shortcuts for all
	465	script names. You can omit everything up through the equals (or colon), and
	466	simply write C<\p{Latin}> or C<\P{Cyrillic}>.
	467
	468	A complete list of scripts and their shortcuts is in L<perluniprops>.
	469
	470	=head3 B<Use of "Is" Prefix>
	471
	472	For backward compatibility (with Perl 5.6), all properties mentioned
	473	so far may have C<Is> or C<Is_> prepended to their name, so C<\P{Is_Lu}>, for
	474	example, is equal to C<\P{Lu}>, and C<\p{IsScript:Arabic}> is equal to
	475	C<\p{Arabic}>.
	476
	477	=head3 B<Blocks>
	478
	479	In addition to B<scripts>, Unicode also defines B<blocks> of
	480	characters. The difference between scripts and blocks is that the
	481	concept of scripts is closer to natural languages, while the concept
	482	of blocks is more of an artificial grouping based on groups of Unicode
	483	characters with consecutive ordinal values. For example, the "Basic Latin"
	484	block is all characters whose ordinals are between 0 and 127, inclusive, in
	485	other words, the ASCII characters. The "Latin" script contains some letters
	486	from this block as well as several more, like "Latin-1 Supplement",
	487	"Latin Extended-A", etc., but it does not contain all the characters from
	488	those blocks. It does not, for example, contain digits, because digits are
	489	shared across many scripts. Digits and similar groups, like punctuation, are in
	490	the script called C<Common>. There is also a script called C<Inherited> for
	491	characters that modify other characters, and inherit the script value of the
	492	controlling character.
	493
	494	For more about scripts versus blocks, see UAX#24 "Unicode Script Property":
	495	L<http://www.unicode.org/reports/tr24>
	496
	497	The Script property is likely to be the one you want to use when processing
	498	natural language; the Block property may be useful in working with the nuts and
	499	bolts of Unicode.
	500
	501	Block names are matched in the compound form, like C<\p{Block: Arrows}> or
	502	C<\p{Blk=Hebrew}>. Unlike most other properties only a few block names have a
	503	Unicode-defined short name. But Perl does provide a (slight) shortcut: You
	504	can say, for example C<\p{In_Arrows}> or C<\p{In_Hebrew}>. For backwards
	505	compatibility, the C<In> prefix may be omitted if there is no naming conflict
	506	with a script or any other property, and you can even use an C<Is> prefix
	507	instead in those cases. But it is not a good idea to do this, for a couple
	508	reasons:
	509
	510	=over 4
	511
	512	=item 1
	513
	514	It is confusing. There are many naming conflicts, and you may forget some.
	515	For example, C<\p{Hebrew}> means the I<script> Hebrew, and NOT the I<block>
	516	Hebrew. But would you remember that 6 months from now?
	517
	518	=item 2
	519
	520	It is unstable. A new version of Unicode may pre-empt the current meaning by
	521	creating a property with the same name. There was a time in very early Unicode
	522	releases when C<\p{Hebrew}> would have matched the I<block> Hebrew; now it
	523	doesn't.
	524
	525	=back
	526
	527	Some people just prefer to always use C<\p{Block: foo}> and C<\p{Script: bar}>
	528	instead of the shortcuts, for clarity, and because they can't remember the
	529	difference between 'In' and 'Is' anyway (or aren't confident that those who
	530	eventually will read their code will know).
	531
	532	A complete list of blocks and their shortcuts is in L<perluniprops>.
	533
	534	=head3 B<Other Properties>
	535
	536	There are many more properties than the very basic ones described here.
	537	A complete list is in L<perluniprops>.
	538
	539	Unicode defines all its properties in the compound form, so all single-form
	540	properties are Perl extensions. A number of these are just synonyms for the
	541	Unicode ones, but some are genunine extensions, including a couple that are in
	542	the compound form. And quite a few of these are actually recommended by Unicode
	543	(in L<http://www.unicode.org/reports/tr18>).
	544
	545	This section gives some details on all the extensions that aren't synonyms for
	546	compound-form Unicode properties (for those, you'll have to refer to the
	547	L<Unicode Standard\|http://www.unicode.org/reports/tr44>.
	548
	549	=over
	550
	551	=item B<C<\p{All}>>
	552
	553	This matches any of the 1_114_112 Unicode code points. It is a synonym for
	554	C<\p{Any}>.
	555
	556	=item B<C<\p{Alnum}>>
	557
	558	This matches any C<\p{Alphabetic}> or C<\p{Decimal_Number}> character.
	559
	560	=item B<C<\p{Any}>>
	561
	562	This matches any of the 1_114_112 Unicode code points. It is a synonym for
	563	C<\p{All}>.
	564
	565	=item B<C<\p{Assigned}>>
	566
	567	This matches any assigned code point; that is, any code point whose general
	568	category is not Unassigned (or equivalently, not Cn).
	569
	570	=item B<C<\p{Blank}>>
	571
	572	This is the same as C<\h> and C<\p{HorizSpace}>: A character that changes the
	573	spacing horizontally.
	574
	575	=item B<C<\p{Decomposition_Type: Non_Canonical}>> (Short: C<\p{Dt=NonCanon}>)
	576
	577	Matches a character that has a non-canonical decomposition.
	578
	579	To understand the use of this rarely used property=value combination, it is
	580	necessary to know some basics about decomposition.
	581	Consider a character, say H. It could appear with various marks around it,
	582	such as an acute accent, or a circumflex, or various hooks, circles, arrows,
	583	I<etc.>, above, below, to one side and/or the other, etc. There are many
	584	possibilities among the world's languages. The number of combinations is
	585	astronomical, and if there were a character for each combination, it would
	586	soon exhaust Unicode's more than a million possible characters. So Unicode
	587	took a different approach: there is a character for the base H, and a
	588	character for each of the possible marks, and they can be combined variously
	589	to get a final logical character. So a logical character--what appears to be a
	590	single character--can be a sequence of more than one individual characters.
	591	This is called an "extended grapheme cluster". (Perl furnishes the C<\X>
	592	construct to match such sequences.)
	593
	594	But Unicode's intent is to unify the existing character set standards and
	595	practices, and a number of pre-existing standards have single characters that
	596	mean the same thing as some of these combinations. An example is ISO-8859-1,
	597	which has quite a few of these in the Latin-1 range, an example being "LATIN
	598	CAPITAL LETTER E WITH ACUTE". Because this character was in this pre-existing
	599	standard, Unicode added it to its repertoire. But this character is considered
	600	by Unicode to be equivalent to the sequence consisting of first the character
	601	"LATIN CAPITAL LETTER E", then the character "COMBINING ACUTE ACCENT".
	602
	603	"LATIN CAPITAL LETTER E WITH ACUTE" is called a "pre-composed" character, and
	604	the equivalence with the sequence is called canonical equivalence. All
	605	pre-composed characters are said to have a decomposition (into the equivalent
	606	sequence) and the decomposition type is also called canonical.
	607
	608	However, many more characters have a different type of decomposition, a
	609	"compatible" or "non-canonical" decomposition. The sequences that form these
	610	decompositions are not considered canonically equivalent to the pre-composed
	611	character. An example, again in the Latin-1 range, is the "SUPERSCRIPT ONE".
	612	It is kind of like a regular digit 1, but not exactly; its decomposition
	613	into the digit 1 is called a "compatible" decomposition, specifically a
	614	"super" decomposition. There are several such compatibility
	615	decompositions (see L<http://www.unicode.org/reports/tr44>), including one
	616	called "compat" which means some miscellaneous type of decomposition
	617	that doesn't fit into the decomposition categories that Unicode has chosen.
	618
	619	Note that most Unicode characters don't have a decomposition, so their
	620	decomposition type is "None".
	621
	622	Perl has added the C<Non_Canonical> type, for your convenience, to mean any of
	623	the compatibility decompositions.
	624
	625	=item B<C<\p{Graph}>>
	626
	627	Matches any character that is graphic. Theoretically, this means a character
	628	that on a printer would cause ink to be used.
	629
	630	=item B<C<\p{HorizSpace}>>
	631
	632	This is the same as C<\h> and C<\p{Blank}>: A character that changes the
	633	spacing horizontally.
	634
	635	=item B<C<\p{In=*}>>
	636
	637	This is a synonym for C<\p{Present_In=*}>
	638
	639	=item B<C<\p{PerlSpace}>>
	640
	641	This is the same as C<\s>, restricted to ASCII, namely C<S<[ \f\n\r\t]>>.
	642
	643	Mnemonic: Perl's (original) space
	644
	645	=item B<C<\p{PerlWord}>>
	646
	647	This is the same as C<\w>, restricted to ASCII, namely C<[A-Za-z0-9_]>
	648
	649	Mnemonic: Perl's (original) word.
	650
	651	=item B<C<\p{PosixAlnum}>>
	652
	653	This matches any alphanumeric character in the ASCII range, namely
	654	C<[A-Za-z0-9]>.
	655
	656	=item B<C<\p{PosixAlpha}>>
	657
	658	This matches any alphabetic character in the ASCII range, namely C<[A-Za-z]>.
	659
	660	=item B<C<\p{PosixBlank}>>
	661
	662	This matches any blank character in the ASCII range, namely C<S<[ \t]>>.
	663
	664	=item B<C<\p{PosixCntrl}>>
	665
	666	This matches any control character in the ASCII range, namely C<[\x00-\x1F\x7F]>
	667
	668	=item B<C<\p{PosixDigit}>>
	669
	670	This matches any digit character in the ASCII range, namely C<[0-9]>.
	671
	672	=item B<C<\p{PosixGraph}>>
	673
	674	This matches any graphical character in the ASCII range, namely C<[\x21-\x7E]>.
	675
	676	=item B<C<\p{PosixLower}>>
	677
	678	This matches any lowercase character in the ASCII range, namely C<[a-z]>.
	679
	680	=item B<C<\p{PosixPrint}>>
	681
	682	This matches any printable character in the ASCII range, namely C<[\x20-\x7E]>.
	683	These are the graphical characters plus SPACE.
	684
	685	=item B<C<\p{PosixPunct}>>
	686
	687	This matches any punctuation character in the ASCII range, namely
	688	C<[\x21-\x2F\x3A-\x40\x5B-\x60\x7B-\x7E]>. These are the
	689	graphical characters that aren't word characters. Note that the Posix standard
	690	includes in its definition of punctuation, those characters that Unicode calls
	691	"symbols."
	692
	693	=item B<C<\p{PosixSpace}>>
	694
	695	This matches any space character in the ASCII range, namely
	696	C<S<[ \f\n\r\t\x0B]>> (the last being a vertical tab).
	697
	698	=item B<C<\p{PosixUpper}>>
	699
	700	This matches any uppercase character in the ASCII range, namely C<[A-Z]>.
	701
	702	=item B<C<\p{Present_In: }>> (Short: C<\p{In=}>)
	703
	704	This property is used when you need to know in what Unicode version(s) a
	705	character is.
	706
	707	The "*" above stands for some two digit Unicode version number, such as
	708	C<1.1> or C<4.0>; or the "*" can also be C<Unassigned>. This property will
	709	match the code points whose final disposition has been settled as of the
	710	Unicode release given by the version number; C<\p{Present_In: Unassigned}>
	711	will match those code points whose meaning has yet to be assigned.
	712
	713	For example, C<U+0041> "LATIN CAPITAL LETTER A" was present in the very first
	714	Unicode release available, which is C<1.1>, so this property is true for all
	715	valid "*" versions. On the other hand, C<U+1EFF> was not assigned until version
	716	5.1 when it became "LATIN SMALL LETTER Y WITH LOOP", so the only "*" that
	717	would match it are 5.1, 5.2, and later.
	718
	719	Unicode furnishes the C<Age> property from which this is derived. The problem
	720	with Age is that a strict interpretation of it (which Perl takes) has it
	721	matching the precise release a code point's meaning is introduced in. Thus
	722	C<U+0041> would match only 1.1; and C<U+1EFF> only 5.1. This is not usually what
	723	you want.
	724
	725	Some non-Perl implementations of the Age property may change its meaning to be
	726	the same as the Perl Present_In property; just be aware of that.
	727
	728	Another confusion with both these properties is that the definition is not
	729	that the code point has been assigned, but that the meaning of the code point
	730	has been determined. This is because 66 code points will always be
	731	unassigned, and, so the Age for them is the Unicode version the decision to
	732	make them so was made in. For example, C<U+FDD0> is to be permanently
	733	unassigned to a character, and the decision to do that was made in version 3.1,
	734	so C<\p{Age=3.1}> matches this character and C<\p{Present_In: 3.1}> and up
	735	matches as well.
	736
	737	=item B<C<\p{Print}>>
	738
	739	This matches any character that is graphical or blank, except controls.
	740
	741	=item B<C<\p{SpacePerl}>>
	742
	743	This is the same as C<\s>, including beyond ASCII.
	744
	745	Mnemonic: Space, as modified by Perl. (It doesn't include the vertical tab
	746	which both the Posix standard and Unicode consider to be space.)
	747
	748	=item B<C<\p{VertSpace}>>
	749
	750	This is the same as C<\v>: A character that changes the spacing vertically.
	751
	752	=item B<C<\p{Word}>>
	753
	754	This is the same as C<\w>, including beyond ASCII.
	755
	756	=back
	757
	758	=head2 User-Defined Character Properties
	759
	760	You can define your own binary character properties by defining subroutines
	761	whose names begin with "In" or "Is". The subroutines can be defined in any
	762	package. The user-defined properties can be used in the regular expression
	763	C<\p> and C<\P> constructs; if you are using a user-defined property from a
	764	package other than the one you are in, you must specify its package in the
	765	C<\p> or C<\P> construct.
	766
	767	# assuming property Is_Foreign defined in Lang::
	768	package main; # property package name required
	769	if ($txt =~ /\p{Lang::IsForeign}+/) { ... }
	770
	771	package Lang; # property package name not required
	772	if ($txt =~ /\p{IsForeign}+/) { ... }
	773
	774
	775	Note that the effect is compile-time and immutable once defined.
	776
	777	The subroutines must return a specially-formatted string, with one
	778	or more newline-separated lines. Each line must be one of the following:
	779
	780	=over 4
	781
	782	=item *
	783
	784	A single hexadecimal number denoting a Unicode code point to include.
	785
	786	=item *
	787
	788	Two hexadecimal numbers separated by horizontal whitespace (space or
	789	tabular characters) denoting a range of Unicode code points to include.
	790
	791	=item *
	792
	793	Something to include, prefixed by "+": a built-in character
	794	property (prefixed by "utf8::") or a user-defined character property,
	795	to represent all the characters in that property; two hexadecimal code
	796	points for a range; or a single hexadecimal code point.
	797
	798	=item *
	799
	800	Something to exclude, prefixed by "-": an existing character
	801	property (prefixed by "utf8::") or a user-defined character property,
	802	to represent all the characters in that property; two hexadecimal code
	803	points for a range; or a single hexadecimal code point.
	804
	805	=item *
	806
	807	Something to negate, prefixed "!": an existing character
	808	property (prefixed by "utf8::") or a user-defined character property,
	809	to represent all the characters in that property; two hexadecimal code
	810	points for a range; or a single hexadecimal code point.
	811
	812	=item *
	813
	814	Something to intersect with, prefixed by "&": an existing character
	815	property (prefixed by "utf8::") or a user-defined character property,
	816	for all the characters except the characters in the property; two
	817	hexadecimal code points for a range; or a single hexadecimal code point.
	818
	819	=back
	820
	821	For example, to define a property that covers both the Japanese
	822	syllabaries (hiragana and katakana), you can define
	823
	824	sub InKana {
	825	return <<END;
	826	3040\t309F
	827	30A0\t30FF
	828	END
	829	}
	830
	831	Imagine that the here-doc end marker is at the beginning of the line.
	832	Now you can use C<\p{InKana}> and C<\P{InKana}>.
	833
	834	You could also have used the existing block property names:
	835
	836	sub InKana {
	837	return <<'END';
	838	+utf8::InHiragana
	839	+utf8::InKatakana
	840	END
	841	}
	842
	843	Suppose you wanted to match only the allocated characters,
	844	not the raw block ranges: in other words, you want to remove
	845	the non-characters:
	846
	847	sub InKana {
	848	return <<'END';
	849	+utf8::InHiragana
	850	+utf8::InKatakana
	851	-utf8::IsCn
	852	END
	853	}
	854
	855	The negation is useful for defining (surprise!) negated classes.
	856
	857	sub InNotKana {
	858	return <<'END';
	859	!utf8::InHiragana
	860	-utf8::InKatakana
	861	+utf8::IsCn
	862	END
	863	}
	864
	865	Intersection is useful for getting the common characters matched by
	866	two (or more) classes.
	867
	868	sub InFooAndBar {
	869	return <<'END';
	870	+main::Foo
	871	&main::Bar
	872	END
	873	}
	874
	875	It's important to remember not to use "&" for the first set; that
	876	would be intersecting with nothing (resulting in an empty set).
	877
	878	=head2 User-Defined Case Mappings (for serious hackers only)
	879
	880	You can also define your own mappings to be used in C<lc()>,
	881	C<lcfirst()>, C<uc()>, and C<ucfirst()> (or their string-inlined versions,
	882	C<\L>, C<\l>, C<\U>, and C<\u>). The mappings are currently only valid
	883	on strings encoded in UTF-8, but see below for a partial workaround for
	884	this restriction.
	885
	886	The principle is similar to that of user-defined character
	887	properties: define subroutines that do the mappings.
	888	C<ToLower> is used for C<lc()>, C<\L>, C<lcfirst()>, and C<\l>; C<ToTitle> for
	889	C<ucfirst()> and C<\u>; and C<ToUpper> for C<uc()> and C<\U>.
	890
	891	C<ToUpper()> should look something like this:
	892
	893	sub ToUpper {
	894	return <<END;
	895	0061\t007A\t0041
	896	0101\t\t0100
	897	END
	898	}
	899
	900	This sample C<ToUpper()> has the effect of mapping "a-z" to "A-Z", 0x101
	901	to 0x100, and all other characters map to themselves. The first
	902	returned line means to map the code point at 0x61 ("a") to 0x41 ("A"),
	903	the code point at 0x62 ("b") to 0x42 ("B"), ..., 0x7A ("z") to 0x5A
	904	("Z"). The second line maps just the code point 0x101 to 0x100. Since
	905	there are no other mappings defined, all other code points map to
	906	themselves.
	907
	908	This mechanism is not well behaved as far as affecting other packages
	909	and scopes. All non-threaded programs have exactly one uppercasing
	910	behavior, one lowercasing behavior, and one titlecasing behavior in
	911	effect for utf8-encoded strings for the duration of the program. Each
	912	of these behaviors is irrevocably determined the first time the
	913	corresponding function is called to change a utf8-encoded string's case.
	914	If a corresponding C<To-> function has been defined in the package that
	915	makes that first call, the mapping defined by that function will be the
	916	mapping used for the duration of the program's execution across all
	917	packages and scopes. If no corresponding C<To-> function has been
	918	defined in that package, the standard official mapping will be used for
	919	all packages and scopes, and any corresponding C<To-> function anywhere
	920	will be ignored. Threaded programs have similar behavior. If the
	921	program's casing behavior has been decided at the time of a thread's
	922	creation, the thread will inherit that behavior. But, if the behavior
	923	hasn't been decided, the thread gets to decide for itself, and its
	924	decision does not affect other threads nor its creator.
	925
	926	As shown by the example above, you have to furnish a complete mapping;
	927	you can't just override a couple of characters and leave the rest
	928	unchanged. You can find all the official mappings in the directory
	929	C<$Config{privlib}>F</unicore/To/>. The mapping data is returned as the
	930	here-document. The C<utf8::ToSpecI<Foo>> hashes in those files are special
	931	exception mappings derived from
	932	C<$Config{privlib}>F</unicore/SpecialCasing.txt>. (The "Digit" and
	933	"Fold" mappings that one can see in the directory are not directly
	934	user-accessible, one can use either the L<Unicode::UCD> module, or just match
	935	case-insensitively, which is what uses the "Fold" mapping. Neither are user
	936	overridable.)
	937
	938	If you have many mappings to change, you can take the official mapping data,
	939	change by hand the affected code points, and place the whole thing into your
	940	subroutine. But this will only be valid on Perls that use the same Unicode
	941	version. Another option would be to have your subroutine read the official
	942	mapping file(s) and overwrite the affected code points.
	943
	944	If you have only a few mappings to change you can use the
	945	following trick (but see below for a big caveat), here illustrated for
	946	Turkish:
	947
	948	use Config;
	949	use charnames ":full";
	950
	951	sub ToUpper {
	952	my $official = do "$Config{privlib}/unicore/To/Upper.pl";
	953	$utf8::ToSpecUpper{'i'} =
	954	"\N{LATIN CAPITAL LETTER I WITH DOT ABOVE}";
	955	return $official;
	956	}
	957
	958	This takes the official mappings and overrides just one, for "LATIN SMALL
	959	LETTER I". Each hash key must be the string of bytes that form the UTF-8
	960	(on EBCDIC platforms, UTF-EBCDIC) of the character, as illustrated by
	961	the inverse function.
	962
	963	sub ToLower {
	964	my $official = do $lower;
	965	$utf8::ToSpecLower{"\xc4\xb0"} = "i";
	966	return $official;
	967	}
	968
	969	This example is for an ASCII platform, and C<\xc4\xb0> is the string of
	970	bytes that together form the UTF-8 that represents C<\N{LATIN CAPITAL
	971	LETTER I WITH DOT ABOVE}>, C<U+0130>. You can avoid having to figure out
	972	these bytes, and at the same time make it work on all platforms by
	973	instead writing:
	974
	975	sub ToLower {
	976	my $official = do $lower;
	977	my $sequence = "\N{LATIN CAPITAL LETTER I WITH DOT ABOVE}";
	978	utf8::encode($sequence);
	979	$utf8::ToSpecLower{$sequence} = "i";
	980	return $official;
	981	}
	982
	983	This works because C<utf8::encode()> takes the single character and
	984	converts it to the sequence of bytes that constitute it. Note that we took
	985	advantage of the fact that C<"i"> is the same in UTF-8 or UTF_EBCIDIC as not;
	986	otherwise we would have had to write
	987
	988	$utf8::ToSpecLower{$sequence} = "\N{LATIN SMALL LETTER I}";
	989
	990	in the ToLower example, and in the ToUpper example, use
	991
	992	my $sequence = "\N{LATIN SMALL LETTER I}";
	993	utf8::encode($sequence);
	994
	995	A big caveat to the above trick, and to this whole mechanism in general,
	996	is that they work only on strings encoded in UTF-8. You can partially
	997	get around this by using C<use subs>. For example:
	998
	999	use subs qw(uc ucfirst lc lcfirst);
	1000
	1001	sub uc($) {
	1002	my $string = shift;
	1003	utf8::upgrade($string);
	1004	return CORE::uc($string);
	1005	}
	1006
	1007	sub lc($) {
	1008	my $string = shift;
	1009	utf8::upgrade($string);
	1010
	1011	# Unless an I is before a dot_above, it turns into a dotless i.
	1012	# (The character class with the combining classes matches non-above
	1013	# marks following the I. Any number of these may be between the 'I' and
	1014	# the dot_above, and the dot_above will still apply to the 'I'.
	1015	use charnames ":full";
	1016	$string =~
	1017	s/I
	1018	(?! [^\p{ccc=0}\p{ccc=Above}]* \N{COMBINING DOT ABOVE} )
	1019	/\N{LATIN SMALL LETTER DOTLESS I}/gx;
	1020
	1021	# But when the I is followed by a dot_above, remove the
	1022	# dot_above so the end result will be i.
	1023	$string =~ s/I
	1024	([^\p{ccc=0}\p{ccc=Above}]* )
	1025	\N{COMBINING DOT ABOVE}
	1026	/i$1/gx;
	1027	return CORE::lc($string);
	1028	}
	1029
	1030	These examples (also for Turkish) make sure the input is in UTF-8, and then
	1031	call the corresponding official function, which will use the C<ToUpper()> and
	1032	C<ToLower()> functions you have defined.
	1033	(For Turkish, there are other required functions: C<ucfirst>, C<lcfirst>,
	1034	and C<ToTitle>. These are very similar to the ones given above.)
	1035
	1036	The reason this is a partial work-around is that it doesn't affect the C<\l>,
	1037	C<\L>, C<\u>, and C<\U> case change operations, which still require the source
	1038	to be encoded in utf8 (see L</The "Unicode Bug">).
	1039
	1040	The C<lc()> example shows how you can add context-dependent casing. Note
	1041	that context-dependent casing suffers from the problem that the string
	1042	passed to the casing function may not have sufficient context to make
	1043	the proper choice. And, it will not be called for C<\l>, C<\L>, C<\u>,
	1044	and C<\U>.
	1045
	1046	=head2 Character Encodings for Input and Output
	1047
	1048	See L<Encode>.
	1049
	1050	=head2 Unicode Regular Expression Support Level
	1051
	1052	The following list of Unicode support for regular expressions describes
	1053	all the features currently supported. The references to "Level N"
	1054	and the section numbers refer to the Unicode Technical Standard #18,
	1055	"Unicode Regular Expressions", version 11, in May 2005.
	1056
	1057	=over 4
	1058
	1059	=item *
	1060
	1061	Level 1 - Basic Unicode Support
	1062
	1063	RL1.1 Hex Notation - done [1]
	1064	RL1.2 Properties - done [2][3]
	1065	RL1.2a Compatibility Properties - done [4]
	1066	RL1.3 Subtraction and Intersection - MISSING [5]
	1067	RL1.4 Simple Word Boundaries - done [6]
	1068	RL1.5 Simple Loose Matches - done [7]
	1069	RL1.6 Line Boundaries - MISSING [8]
	1070	RL1.7 Supplementary Code Points - done [9]
	1071
	1072	[1] \x{...}
	1073	[2] \p{...} \P{...}
	1074	[3] supports not only minimal list, but all Unicode character
	1075	properties (see L</Unicode Character Properties>)
	1076	[4] \d \D \s \S \w \W \X [:prop:] [:^prop:]
	1077	[5] can use regular expression look-ahead [a] or
	1078	user-defined character properties [b] to emulate set
	1079	operations
	1080	[6] \b \B
	1081	[7] note that Perl does Full case-folding in matching (but with
	1082	bugs), not Simple: for example U+1F88 is equivalent to
	1083	U+1F00 U+03B9, not with 1F80. This difference matters
	1084	mainly for certain Greek capital letters with certain
	1085	modifiers: the Full case-folding decomposes the letter,
	1086	while the Simple case-folding would map it to a single
	1087	character.
	1088	[8] should do ^ and $ also on U+000B (\v in C), FF (\f), CR
	1089	(\r), CRLF (\r\n), NEL (U+0085), LS (U+2028), and PS
	1090	(U+2029); should also affect <>, $., and script line
	1091	numbers; should not split lines within CRLF [c] (i.e. there
	1092	is no empty line between \r and \n)
	1093	[9] UTF-8/UTF-EBDDIC used in perl allows not only U+10000 to
	1094	U+10FFFF but also beyond U+10FFFF [d]
	1095
	1096	[a] You can mimic class subtraction using lookahead.
	1097	For example, what UTS#18 might write as
	1098
	1099	[{Greek}-[{UNASSIGNED}]]
	1100
	1101	in Perl can be written as:
	1102
	1103	(?!\p{Unassigned})\p{InGreekAndCoptic}
	1104	(?=\p{Assigned})\p{InGreekAndCoptic}
	1105
	1106	But in this particular example, you probably really want
	1107
	1108	\p{GreekAndCoptic}
	1109
	1110	which will match assigned characters known to be part of the Greek script.
	1111
	1112	Also see the Unicode::Regex::Set module, it does implement the full
	1113	UTS#18 grouping, intersection, union, and removal (subtraction) syntax.
	1114
	1115	[b] '+' for union, '-' for removal (set-difference), '&' for intersection
	1116	(see L</"User-Defined Character Properties">)
	1117
	1118	[c] Try the C<:crlf> layer (see L<PerlIO>).
	1119
	1120	[d] U+FFFF will currently generate a warning message if 'utf8' warnings are
	1121	enabled
	1122
	1123	=item *
	1124
	1125	Level 2 - Extended Unicode Support
	1126
	1127	RL2.1 Canonical Equivalents - MISSING [10][11]
	1128	RL2.2 Default Grapheme Clusters - MISSING [12]
	1129	RL2.3 Default Word Boundaries - MISSING [14]
	1130	RL2.4 Default Loose Matches - MISSING [15]
	1131	RL2.5 Name Properties - MISSING [16]
	1132	RL2.6 Wildcard Properties - MISSING
	1133
	1134	[10] see UAX#15 "Unicode Normalization Forms"
	1135	[11] have Unicode::Normalize but not integrated to regexes
	1136	[12] have \X but we don't have a "Grapheme Cluster Mode"
	1137	[14] see UAX#29, Word Boundaries
	1138	[15] see UAX#21 "Case Mappings"
	1139	[16] missing loose match [e]
	1140
	1141	[e] C<\N{...}> allows namespaces (see L<charnames>).
	1142
	1143	=item *
	1144
	1145	Level 3 - Tailored Support
	1146
	1147	RL3.1 Tailored Punctuation - MISSING
	1148	RL3.2 Tailored Grapheme Clusters - MISSING [17][18]
	1149	RL3.3 Tailored Word Boundaries - MISSING
	1150	RL3.4 Tailored Loose Matches - MISSING
	1151	RL3.5 Tailored Ranges - MISSING
	1152	RL3.6 Context Matching - MISSING [19]
	1153	RL3.7 Incremental Matches - MISSING
	1154	( RL3.8 Unicode Set Sharing )
	1155	RL3.9 Possible Match Sets - MISSING
	1156	RL3.10 Folded Matching - MISSING [20]
	1157	RL3.11 Submatchers - MISSING
	1158
	1159	[17] see UAX#10 "Unicode Collation Algorithms"
	1160	[18] have Unicode::Collate but not integrated to regexes
	1161	[19] have (?<=x) and (?=x), but look-aheads or look-behinds
	1162	should see outside of the target substring
	1163	[20] need insensitive matching for linguistic features other
	1164	than case; for example, hiragana to katakana, wide and
	1165	narrow, simplified Han to traditional Han (see UTR#30
	1166	"Character Foldings")
	1167
	1168	=back
	1169
	1170	=head2 Unicode Encodings
	1171
	1172	Unicode characters are assigned to I<code points>, which are abstract
	1173	numbers. To use these numbers, various encodings are needed.
	1174
	1175	=over 4
	1176
	1177	=item *
	1178
	1179	UTF-8
	1180
	1181	UTF-8 is a variable-length (1 to 6 bytes, current character allocations
	1182	require 4 bytes), byte-order independent encoding. For ASCII (and we
	1183	really do mean 7-bit ASCII, not another 8-bit encoding), UTF-8 is
	1184	transparent.
	1185
	1186	The following table is from Unicode 3.2.
	1187
	1188	Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte
	1189
	1190	U+0000..U+007F 00..7F
	1191	U+0080..U+07FF * C2..DF 80..BF
	1192	U+0800..U+0FFF E0 * A0..BF 80..BF
	1193	U+1000..U+CFFF E1..EC 80..BF 80..BF
	1194	U+D000..U+D7FF ED 80..9F 80..BF
	1195	U+D800..U+DFFF +++++++ utf16 surrogates, not legal utf8 +++++++
	1196	U+E000..U+FFFF EE..EF 80..BF 80..BF
	1197	U+10000..U+3FFFF F0 * 90..BF 80..BF 80..BF
	1198	U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF
	1199	U+100000..U+10FFFF F4 80..8F 80..BF 80..BF
	1200
	1201	Note the gaps before several of the byte entries above marked by '*'. These are
	1202	caused by legal UTF-8 avoiding non-shortest encodings: it is technically
	1203	possible to UTF-8-encode a single code point in different ways, but that is
	1204	explicitly forbidden, and the shortest possible encoding should always be used
	1205	(and that is what Perl does).
	1206
	1207	Another way to look at it is via bits:
	1208
	1209	Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte
	1210
	1211	0aaaaaaa 0aaaaaaa
	1212	00000bbbbbaaaaaa 110bbbbb 10aaaaaa
	1213	ccccbbbbbbaaaaaa 1110cccc 10bbbbbb 10aaaaaa
	1214	00000dddccccccbbbbbbaaaaaa 11110ddd 10cccccc 10bbbbbb 10aaaaaa
	1215
	1216	As you can see, the continuation bytes all begin with "10", and the
	1217	leading bits of the start byte tell how many bytes there are in the
	1218	encoded character.
	1219
	1220	=item *
	1221
	1222	UTF-EBCDIC
	1223
	1224	Like UTF-8 but EBCDIC-safe, in the way that UTF-8 is ASCII-safe.
	1225
	1226	=item *
	1227
	1228	UTF-16, UTF-16BE, UTF-16LE, Surrogates, and BOMs (Byte Order Marks)
	1229
	1230	The followings items are mostly for reference and general Unicode
	1231	knowledge, Perl doesn't use these constructs internally.
	1232
	1233	UTF-16 is a 2 or 4 byte encoding. The Unicode code points
	1234	C<U+0000..U+FFFF> are stored in a single 16-bit unit, and the code
	1235	points C<U+10000..U+10FFFF> in two 16-bit units. The latter case is
	1236	using I<surrogates>, the first 16-bit unit being the I<high
	1237	surrogate>, and the second being the I<low surrogate>.
	1238
	1239	Surrogates are code points set aside to encode the C<U+10000..U+10FFFF>
	1240	range of Unicode code points in pairs of 16-bit units. The I<high
	1241	surrogates> are the range C<U+D800..U+DBFF> and the I<low surrogates>
	1242	are the range C<U+DC00..U+DFFF>. The surrogate encoding is
	1243
	1244	$hi = ($uni - 0x10000) / 0x400 + 0xD800;
	1245	$lo = ($uni - 0x10000) % 0x400 + 0xDC00;
	1246
	1247	and the decoding is
	1248
	1249	$uni = 0x10000 + ($hi - 0xD800) * 0x400 + ($lo - 0xDC00);
	1250
	1251	If you try to generate surrogates (for example by using chr()), you
	1252	will get a warning, if warnings are turned on, because those code
	1253	points are not valid for a Unicode character.
	1254
	1255	Because of the 16-bitness, UTF-16 is byte-order dependent. UTF-16
	1256	itself can be used for in-memory computations, but if storage or
	1257	transfer is required either UTF-16BE (big-endian) or UTF-16LE
	1258	(little-endian) encodings must be chosen.
	1259
	1260	This introduces another problem: what if you just know that your data
	1261	is UTF-16, but you don't know which endianness? Byte Order Marks, or
	1262	BOMs, are a solution to this. A special character has been reserved
	1263	in Unicode to function as a byte order marker: the character with the
	1264	code point C<U+FEFF> is the BOM.
	1265
	1266	The trick is that if you read a BOM, you will know the byte order,
	1267	since if it was written on a big-endian platform, you will read the
	1268	bytes C<0xFE 0xFF>, but if it was written on a little-endian platform,
	1269	you will read the bytes C<0xFF 0xFE>. (And if the originating platform
	1270	was writing in UTF-8, you will read the bytes C<0xEF 0xBB 0xBF>.)
	1271
	1272	The way this trick works is that the character with the code point
	1273	C<U+FFFE> is guaranteed not to be a valid Unicode character, so the
	1274	sequence of bytes C<0xFF 0xFE> is unambiguously "BOM, represented in
	1275	little-endian format" and cannot be C<U+FFFE>, represented in big-endian
	1276	format". (Actually, C<U+FFFE> is legal for use by your program, even for
	1277	input/output, but better not use it if you need a BOM. But it is "illegal for
	1278	interchange", so that an unsuspecting program won't get confused.)
	1279
	1280	=item *
	1281
	1282	UTF-32, UTF-32BE, UTF-32LE
	1283
	1284	The UTF-32 family is pretty much like the UTF-16 family, expect that
	1285	the units are 32-bit, and therefore the surrogate scheme is not
	1286	needed. The BOM signatures will be C<0x00 0x00 0xFE 0xFF> for BE and
	1287	C<0xFF 0xFE 0x00 0x00> for LE.
	1288
	1289	=item *
	1290
	1291	UCS-2, UCS-4
	1292
	1293	Encodings defined by the ISO 10646 standard. UCS-2 is a 16-bit
	1294	encoding. Unlike UTF-16, UCS-2 is not extensible beyond C<U+FFFF>,
	1295	because it does not use surrogates. UCS-4 is a 32-bit encoding,
	1296	functionally identical to UTF-32.
	1297
	1298	=item *
	1299
	1300	UTF-7
	1301
	1302	A seven-bit safe (non-eight-bit) encoding, which is useful if the
	1303	transport or storage is not eight-bit safe. Defined by RFC 2152.
	1304
	1305	=back
	1306
	1307	=head2 Security Implications of Unicode
	1308
	1309	Read L<Unicode Security Considerations\|http://www.unicode.org/reports/tr36>.
	1310	Also, note the following:
	1311
	1312	=over 4
	1313
	1314	=item *
	1315
	1316	Malformed UTF-8
	1317
	1318	Unfortunately, the specification of UTF-8 leaves some room for
	1319	interpretation of how many bytes of encoded output one should generate
	1320	from one input Unicode character. Strictly speaking, the shortest
	1321	possible sequence of UTF-8 bytes should be generated,
	1322	because otherwise there is potential for an input buffer overflow at
	1323	the receiving end of a UTF-8 connection. Perl always generates the
	1324	shortest length UTF-8, and with warnings on, Perl will warn about
	1325	non-shortest length UTF-8 along with other malformations, such as the
	1326	surrogates, which are not real Unicode code points.
	1327
	1328	=item *
	1329
	1330	Regular expressions behave slightly differently between byte data and
	1331	character (Unicode) data. For example, the "word character" character
	1332	class C<\w> will work differently depending on if data is eight-bit bytes
	1333	or Unicode.
	1334
	1335	In the first case, the set of C<\w> characters is either small--the
	1336	default set of alphabetic characters, digits, and the "_"--or, if you
	1337	are using a locale (see L<perllocale>), the C<\w> might contain a few
	1338	more letters according to your language and country.
	1339
	1340	In the second case, the C<\w> set of characters is much, much larger.
	1341	Most importantly, even in the set of the first 256 characters, it will
	1342	probably match different characters: unlike most locales, which are
	1343	specific to a language and country pair, Unicode classifies all the
	1344	characters that are letters I<somewhere> as C<\w>. For example, your
	1345	locale might not think that LATIN SMALL LETTER ETH is a letter (unless
	1346	you happen to speak Icelandic), but Unicode does.
	1347
	1348	As discussed elsewhere, Perl has one foot (two hooves?) planted in
	1349	each of two worlds: the old world of bytes and the new world of
	1350	characters, upgrading from bytes to characters when necessary.
	1351	If your legacy code does not explicitly use Unicode, no automatic
	1352	switch-over to characters should happen. Characters shouldn't get
	1353	downgraded to bytes, either. It is possible to accidentally mix bytes
	1354	and characters, however (see L<perluniintro>), in which case C<\w> in
	1355	regular expressions might start behaving differently. Review your
	1356	code. Use warnings and the C<strict> pragma.
	1357
	1358	=back
	1359
	1360	=head2 Unicode in Perl on EBCDIC
	1361
	1362	The way Unicode is handled on EBCDIC platforms is still
	1363	experimental. On such platforms, references to UTF-8 encoding in this
	1364	document and elsewhere should be read as meaning the UTF-EBCDIC
	1365	specified in Unicode Technical Report 16, unless ASCII vs. EBCDIC issues
	1366	are specifically discussed. There is no C<utfebcdic> pragma or
	1367	":utfebcdic" layer; rather, "utf8" and ":utf8" are reused to mean
	1368	the platform's "natural" 8-bit encoding of Unicode. See L<perlebcdic>
	1369	for more discussion of the issues.
	1370
	1371	=head2 Locales
	1372
	1373	Usually locale settings and Unicode do not affect each other, but
	1374	there are exceptions:
	1375
	1376	=over 4
	1377
	1378	=item *
	1379
	1380	You can enable automatic UTF-8-ification of your standard file
	1381	handles, default C<open()> layer, and C<@ARGV> by using either
	1382	the C<-C> command line switch or the C<PERL_UNICODE> environment
	1383	variable, see L<perlrun> for the documentation of the C<-C> switch.
	1384
	1385	=item *
	1386
	1387	Perl tries really hard to work both with Unicode and the old
	1388	byte-oriented world. Most often this is nice, but sometimes Perl's
	1389	straddling of the proverbial fence causes problems. Here's an example
	1390	of how things can go wrong. A locale can define a code point to be
	1391	anything it wants. It could make 'A' into a control character, for example.
	1392	But strings encoded in utf8 always have Unicode semantics, so an 'A' in
	1393	such a string is always an uppercase letter, never a control, no matter
	1394	what the locale says it should be.
	1395
	1396	=back
	1397
	1398	=head2 When Unicode Does Not Happen
	1399
	1400	While Perl does have extensive ways to input and output in Unicode,
	1401	and few other 'entry points' like the @ARGV which can be interpreted
	1402	as Unicode (UTF-8), there still are many places where Unicode (in some
	1403	encoding or another) could be given as arguments or received as
	1404	results, or both, but it is not.
	1405
	1406	The following are such interfaces. Also, see L</The "Unicode Bug">.
	1407	For all of these interfaces Perl
	1408	currently (as of 5.8.3) simply assumes byte strings both as arguments
	1409	and results, or UTF-8 strings if the C<encoding> pragma has been used.
	1410
	1411	One reason why Perl does not attempt to resolve the role of Unicode in
	1412	these cases is that the answers are highly dependent on the operating
	1413	system and the file system(s). For example, whether filenames can be
	1414	in Unicode, and in exactly what kind of encoding, is not exactly a
	1415	portable concept. Similarly for the qx and system: how well will the
	1416	'command line interface' (and which of them?) handle Unicode?
	1417
	1418	=over 4
	1419
	1420	=item *
	1421
	1422	chdir, chmod, chown, chroot, exec, link, lstat, mkdir,
	1423	rename, rmdir, stat, symlink, truncate, unlink, utime, -X
	1424
	1425	=item *
	1426
	1427	%ENV
	1428
	1429	=item *
	1430
	1431	glob (aka the <*>)
	1432
	1433	=item *
	1434
	1435	open, opendir, sysopen
	1436
	1437	=item *
	1438
	1439	qx (aka the backtick operator), system
	1440
	1441	=item *
	1442
	1443	readdir, readlink
	1444
	1445	=back
	1446
	1447	=head2 The "Unicode Bug"
	1448
	1449	The term, the "Unicode bug" has been applied to an inconsistency with the
	1450	Unicode characters whose ordinals are in the Latin-1 Supplement block, that
	1451	is, between 128 and 255. Without a locale specified, unlike all other
	1452	characters or code points, these characters have very different semantics in
	1453	byte semantics versus character semantics.
	1454
	1455	In character semantics they are interpreted as Unicode code points, which means
	1456	they have the same semantics as Latin-1 (ISO-8859-1).
	1457
	1458	In byte semantics, they are considered to be unassigned characters, meaning
	1459	that the only semantics they have is their ordinal numbers, and that they are
	1460	not members of various character classes. None are considered to match C<\w>
	1461	for example, but all match C<\W>. (On EBCDIC platforms, the behavior may
	1462	be different from this, depending on the underlying C language library
	1463	functions.)
	1464
	1465	The behavior is known to have effects on these areas:
	1466
	1467	=over 4
	1468
	1469	=item *
	1470
	1471	Changing the case of a scalar, that is, using C<uc()>, C<ucfirst()>, C<lc()>,
	1472	and C<lcfirst()>, or C<\L>, C<\U>, C<\u> and C<\l> in regular expression
	1473	substitutions.
	1474
	1475	=item *
	1476
	1477	Using caseless (C</i>) regular expression matching
	1478
	1479	=item *
	1480
	1481	Matching a number of properties in regular expressions, namely C<\b>,
	1482	C<\B>, C<\s>, C<\S>, C<\w>, C<\W>, and all the Posix character classes
	1483	I<except> C<[[:ascii:]]>.
	1484
	1485	=item *
	1486
	1487	User-defined case change mappings. You can create a C<ToUpper()> function, for
	1488	example, which overrides Perl's built-in case mappings. The scalar must be
	1489	encoded in utf8 for your function to actually be invoked.
	1490
	1491	=back
	1492
	1493	This behavior can lead to unexpected results in which a string's semantics
	1494	suddenly change if a code point above 255 is appended to or removed from it,
	1495	which changes the string's semantics from byte to character or vice versa. As
	1496	an example, consider the following program and its output:
	1497
	1498	$ perl -le'
	1499	$s1 = "\xC2";
	1500	$s2 = "\x{2660}";
	1501	for ($s1, $s2, $s1.$s2) {
	1502	print /\w/ \|\| 0;
	1503	}
	1504	'
	1505	0
	1506	0
	1507	1
	1508
	1509	If there's no C<\w> in C<s1> or in C<s2>, why does their concatenation have one?
	1510
	1511	This anomaly stems from Perl's attempt to not disturb older programs that
	1512	didn't use Unicode, and hence had no semantics for characters outside of the
	1513	ASCII range (except in a locale), along with Perl's desire to add Unicode
	1514	support seamlessly. The result wasn't seamless: these characters were
	1515	orphaned.
	1516
	1517	Work is being done to correct this, but only some of it is complete.
	1518	What has been finished is:
	1519
	1520	=over
	1521
	1522	=item *
	1523
	1524	the matching of C<\b>, C<\s>, C<\w> and the Posix
	1525	character classes and their complements in regular expressions
	1526
	1527	=item *
	1528
	1529	case changing (but not user-defined casing)
	1530
	1531	=item *
	1532
	1533	case-insensitive (C</i>) regular expression matching for [bracketed
	1534	character classes] only, except for some bugs with C<LATIN SMALL
	1535	LETTER SHARP S> (which is supposed to match the two character sequence
	1536	"ss" (or "Ss" or "sS" or "SS"), but Perl has a number of bugs for all
	1537	such multi-character case insensitive characters, of which this is just
	1538	one example.
	1539
	1540	=back
	1541
	1542	Due to concerns, and some evidence, that older code might
	1543	have come to rely on the existing behavior, the new behavior must be explicitly
	1544	enabled by the feature C<unicode_strings> in the L<feature> pragma, even though
	1545	no new syntax is involved.
	1546
	1547	See L<perlfunc/lc> for details on how this pragma works in combination with
	1548	various others for casing.
	1549
	1550	Even though the implementation is incomplete, it is planned to have this
	1551	pragma affect all the problematic behaviors in later releases: you can't
	1552	have one without them all.
	1553
	1554	In the meantime, a workaround is to always call utf8::upgrade($string), or to
	1555	use the standard module L<Encode>. Also, a scalar that has any characters
	1556	whose ordinal is above 0x100, or which were specified using either of the
	1557	C<\N{...}> notations will automatically have character semantics.
	1558
	1559	=head2 Forcing Unicode in Perl (Or Unforcing Unicode in Perl)
	1560
	1561	Sometimes (see L</"When Unicode Does Not Happen"> or L</The "Unicode Bug">)
	1562	there are situations where you simply need to force a byte
	1563	string into UTF-8, or vice versa. The low-level calls
	1564	utf8::upgrade($bytestring) and utf8::downgrade($utf8string[, FAIL_OK]) are
	1565	the answers.
	1566
	1567	Note that utf8::downgrade() can fail if the string contains characters
	1568	that don't fit into a byte.
	1569
	1570	Calling either function on a string that already is in the desired state is a
	1571	no-op.
	1572
	1573	=head2 Using Unicode in XS
	1574
	1575	If you want to handle Perl Unicode in XS extensions, you may find the
	1576	following C APIs useful. See also L<perlguts/"Unicode Support"> for an
	1577	explanation about Unicode at the XS level, and L<perlapi> for the API
	1578	details.
	1579
	1580	=over 4
	1581
	1582	=item *
	1583
	1584	C<DO_UTF8(sv)> returns true if the C<UTF8> flag is on and the bytes
	1585	pragma is not in effect. C<SvUTF8(sv)> returns true if the C<UTF8>
	1586	flag is on; the bytes pragma is ignored. The C<UTF8> flag being on
	1587	does B<not> mean that there are any characters of code points greater
	1588	than 255 (or 127) in the scalar or that there are even any characters
	1589	in the scalar. What the C<UTF8> flag means is that the sequence of
	1590	octets in the representation of the scalar is the sequence of UTF-8
	1591	encoded code points of the characters of a string. The C<UTF8> flag
	1592	being off means that each octet in this representation encodes a
	1593	single character with code point 0..255 within the string. Perl's
	1594	Unicode model is not to use UTF-8 until it is absolutely necessary.
	1595
	1596	=item *
	1597
	1598	C<uvchr_to_utf8(buf, chr)> writes a Unicode character code point into
	1599	a buffer encoding the code point as UTF-8, and returns a pointer
	1600	pointing after the UTF-8 bytes. It works appropriately on EBCDIC machines.
	1601
	1602	=item *
	1603
	1604	C<utf8_to_uvchr(buf, lenp)> reads UTF-8 encoded bytes from a buffer and
	1605	returns the Unicode character code point and, optionally, the length of
	1606	the UTF-8 byte sequence. It works appropriately on EBCDIC machines.
	1607
	1608	=item *
	1609
	1610	C<utf8_length(start, end)> returns the length of the UTF-8 encoded buffer
	1611	in characters. C<sv_len_utf8(sv)> returns the length of the UTF-8 encoded
	1612	scalar.
	1613
	1614	=item *
	1615
	1616	C<sv_utf8_upgrade(sv)> converts the string of the scalar to its UTF-8
	1617	encoded form. C<sv_utf8_downgrade(sv)> does the opposite, if
	1618	possible. C<sv_utf8_encode(sv)> is like sv_utf8_upgrade except that
	1619	it does not set the C<UTF8> flag. C<sv_utf8_decode()> does the
	1620	opposite of C<sv_utf8_encode()>. Note that none of these are to be
	1621	used as general-purpose encoding or decoding interfaces: C<use Encode>
	1622	for that. C<sv_utf8_upgrade()> is affected by the encoding pragma
	1623	but C<sv_utf8_downgrade()> is not (since the encoding pragma is
	1624	designed to be a one-way street).
	1625
	1626	=item *
	1627
	1628	C<is_utf8_char(s)> returns true if the pointer points to a valid UTF-8
	1629	character.
	1630
	1631	=item *
	1632
	1633	C<is_utf8_string(buf, len)> returns true if C<len> bytes of the buffer
	1634	are valid UTF-8.
	1635
	1636	=item *
	1637
	1638	C<UTF8SKIP(buf)> will return the number of bytes in the UTF-8 encoded
	1639	character in the buffer. C<UNISKIP(chr)> will return the number of bytes
	1640	required to UTF-8-encode the Unicode character code point. C<UTF8SKIP()>
	1641	is useful for example for iterating over the characters of a UTF-8
	1642	encoded buffer; C<UNISKIP()> is useful, for example, in computing
	1643	the size required for a UTF-8 encoded buffer.
	1644
	1645	=item *
	1646
	1647	C<utf8_distance(a, b)> will tell the distance in characters between the
	1648	two pointers pointing to the same UTF-8 encoded buffer.
	1649
	1650	=item *
	1651
	1652	C<utf8_hop(s, off)> will return a pointer to a UTF-8 encoded buffer
	1653	that is C<off> (positive or negative) Unicode characters displaced
	1654	from the UTF-8 buffer C<s>. Be careful not to overstep the buffer:
	1655	C<utf8_hop()> will merrily run off the end or the beginning of the
	1656	buffer if told to do so.
	1657
	1658	=item *
	1659
	1660	C<pv_uni_display(dsv, spv, len, pvlim, flags)> and
	1661	C<sv_uni_display(dsv, ssv, pvlim, flags)> are useful for debugging the
	1662	output of Unicode strings and scalars. By default they are useful
	1663	only for debugging--they display B<all> characters as hexadecimal code
	1664	points--but with the flags C<UNI_DISPLAY_ISPRINT>,
	1665	C<UNI_DISPLAY_BACKSLASH>, and C<UNI_DISPLAY_QQ> you can make the
	1666	output more readable.
	1667
	1668	=item *
	1669
	1670	C<foldEQ_utf8(s1, pe1, l1, u1, s2, pe2, l2, u2)> can be used to
	1671	compare two strings case-insensitively in Unicode. For case-sensitive
	1672	comparisons you can just use C<memEQ()> and C<memNE()> as usual, except
	1673	if one string is in utf8 and the other isn't.
	1674
	1675	=back
	1676
	1677	For more information, see L<perlapi>, and F<utf8.c> and F<utf8.h>
	1678	in the Perl source code distribution.
	1679
	1680	=head2 Hacking Perl to work on earlier Unicode versions (for very serious hackers only)
	1681
	1682	Perl by default comes with the latest supported Unicode version built in, but
	1683	you can change to use any earlier one.
	1684
	1685	Download the files in the version of Unicode that you want from the Unicode web
	1686	site L<http://www.unicode.org>). These should replace the existing files in
	1687	C<\$Config{privlib}>/F<unicore>. (C<\%Config> is available from the Config
	1688	module.) Follow the instructions in F<README.perl> in that directory to change
	1689	some of their names, and then run F<make>.
	1690
	1691	It is even possible to download them to a different directory, and then change
	1692	F<utf8_heavy.pl> in the directory C<\$Config{privlib}> to point to the new
	1693	directory, or maybe make a copy of that directory before making the change, and
	1694	using C<@INC> or the C<-I> run-time flag to switch between versions at will
	1695	(but because of caching, not in the middle of a process), but all this is
	1696	beyond the scope of these instructions.
	1697
	1698	=head1 BUGS
	1699
	1700	=head2 Interaction with Locales
	1701
	1702	Use of locales with Unicode data may lead to odd results. Currently,
	1703	Perl attempts to attach 8-bit locale info to characters in the range
	1704	0..255, but this technique is demonstrably incorrect for locales that
	1705	use characters above that range when mapped into Unicode. Perl's
	1706	Unicode support will also tend to run slower. Use of locales with
	1707	Unicode is discouraged.
	1708
	1709	=head2 Problems with characters in the Latin-1 Supplement range
	1710
	1711	See L</The "Unicode Bug">
	1712
	1713	=head2 Problems with case-insensitive regular expression matching
	1714
	1715	There are problems with case-insensitive matches, including those involving
	1716	character classes (enclosed in [square brackets]), characters whose fold
	1717	is to multiple characters (such as the single character LATIN SMALL LIGATURE
	1718	FFL matches case-insensitively with the 3-character string C<ffl>), and
	1719	characters in the Latin-1 Supplement.
	1720
	1721	=head2 Interaction with Extensions
	1722
	1723	When Perl exchanges data with an extension, the extension should be
	1724	able to understand the UTF8 flag and act accordingly. If the
	1725	extension doesn't know about the flag, it's likely that the extension
	1726	will return incorrectly-flagged data.
	1727
	1728	So if you're working with Unicode data, consult the documentation of
	1729	every module you're using if there are any issues with Unicode data
	1730	exchange. If the documentation does not talk about Unicode at all,
	1731	suspect the worst and probably look at the source to learn how the
	1732	module is implemented. Modules written completely in Perl shouldn't
	1733	cause problems. Modules that directly or indirectly access code written
	1734	in other programming languages are at risk.
	1735
	1736	For affected functions, the simple strategy to avoid data corruption is
	1737	to always make the encoding of the exchanged data explicit. Choose an
	1738	encoding that you know the extension can handle. Convert arguments passed
	1739	to the extensions to that encoding and convert results back from that
	1740	encoding. Write wrapper functions that do the conversions for you, so
	1741	you can later change the functions when the extension catches up.
	1742
	1743	To provide an example, let's say the popular Foo::Bar::escape_html
	1744	function doesn't deal with Unicode data yet. The wrapper function
	1745	would convert the argument to raw UTF-8 and convert the result back to
	1746	Perl's internal representation like so:
	1747
	1748	sub my_escape_html ($) {
	1749	my($what) = shift;
	1750	return unless defined $what;
	1751	Encode::decode_utf8(Foo::Bar::escape_html(
	1752	Encode::encode_utf8($what)));
	1753	}
	1754
	1755	Sometimes, when the extension does not convert data but just stores
	1756	and retrieves them, you will be in a position to use the otherwise
	1757	dangerous Encode::_utf8_on() function. Let's say the popular
	1758	C<Foo::Bar> extension, written in C, provides a C<param> method that
	1759	lets you store and retrieve data according to these prototypes:
	1760
	1761	$self->param($name, $value); # set a scalar
	1762	$value = $self->param($name); # retrieve a scalar
	1763
	1764	If it does not yet provide support for any encoding, one could write a
	1765	derived class with such a C<param> method:
	1766
	1767	sub param {
	1768	my($self,$name,$value) = @_;
	1769	utf8::upgrade($name); # make sure it is UTF-8 encoded
	1770	if (defined $value) {
	1771	utf8::upgrade($value); # make sure it is UTF-8 encoded
	1772	return $self->SUPER::param($name,$value);
	1773	} else {
	1774	my $ret = $self->SUPER::param($name);
	1775	Encode::_utf8_on($ret); # we know, it is UTF-8 encoded
	1776	return $ret;
	1777	}
	1778	}
	1779
	1780	Some extensions provide filters on data entry/exit points, such as
	1781	DB_File::filter_store_key and family. Look out for such filters in
	1782	the documentation of your extensions, they can make the transition to
	1783	Unicode data much easier.
	1784
	1785	=head2 Speed
	1786
	1787	Some functions are slower when working on UTF-8 encoded strings than
	1788	on byte encoded strings. All functions that need to hop over
	1789	characters such as length(), substr() or index(), or matching regular
	1790	expressions can work B<much> faster when the underlying data are
	1791	byte-encoded.
	1792
	1793	In Perl 5.8.0 the slowness was often quite spectacular; in Perl 5.8.1
	1794	a caching scheme was introduced which will hopefully make the slowness
	1795	somewhat less spectacular, at least for some operations. In general,
	1796	operations with UTF-8 encoded strings are still slower. As an example,
	1797	the Unicode properties (character classes) like C<\p{Nd}> are known to
	1798	be quite a bit slower (5-20 times) than their simpler counterparts
	1799	like C<\d> (then again, there 268 Unicode characters matching C<Nd>
	1800	compared with the 10 ASCII characters matching C<d>).
	1801
	1802	=head2 Problems on EBCDIC platforms
	1803
	1804	There are a number of known problems with Perl on EBCDIC platforms. If you
	1805	want to use Perl there, send email to perlbug@perl.org.
	1806
	1807	In earlier versions, when byte and character data were concatenated,
	1808	the new string was sometimes created by
	1809	decoding the byte strings as I<ISO 8859-1 (Latin-1)>, even if the
	1810	old Unicode string used EBCDIC.
	1811
	1812	If you find any of these, please report them as bugs.
	1813
	1814	=head2 Porting code from perl-5.6.X
	1815
	1816	Perl 5.8 has a different Unicode model from 5.6. In 5.6 the programmer
	1817	was required to use the C<utf8> pragma to declare that a given scope
	1818	expected to deal with Unicode data and had to make sure that only
	1819	Unicode data were reaching that scope. If you have code that is
	1820	working with 5.6, you will need some of the following adjustments to
	1821	your code. The examples are written such that the code will continue
	1822	to work under 5.6, so you should be safe to try them out.
	1823
	1824	=over 4
	1825
	1826	=item *
	1827
	1828	A filehandle that should read or write UTF-8
	1829
	1830	if ($] > 5.007) {
	1831	binmode $fh, ":encoding(utf8)";
	1832	}
	1833
	1834	=item *
	1835
	1836	A scalar that is going to be passed to some extension
	1837
	1838	Be it Compress::Zlib, Apache::Request or any extension that has no
	1839	mention of Unicode in the manpage, you need to make sure that the
	1840	UTF8 flag is stripped off. Note that at the time of this writing
	1841	(October 2002) the mentioned modules are not UTF-8-aware. Please
	1842	check the documentation to verify if this is still true.
	1843
	1844	if ($] > 5.007) {
	1845	require Encode;
	1846	$val = Encode::encode_utf8($val); # make octets
	1847	}
	1848
	1849	=item *
	1850
	1851	A scalar we got back from an extension
	1852
	1853	If you believe the scalar comes back as UTF-8, you will most likely
	1854	want the UTF8 flag restored:
	1855
	1856	if ($] > 5.007) {
	1857	require Encode;
	1858	$val = Encode::decode_utf8($val);
	1859	}
	1860
	1861	=item *
	1862
	1863	Same thing, if you are really sure it is UTF-8
	1864
	1865	if ($] > 5.007) {
	1866	require Encode;
	1867	Encode::_utf8_on($val);
	1868	}
	1869
	1870	=item *
	1871
	1872	A wrapper for fetchrow_array and fetchrow_hashref
	1873
	1874	When the database contains only UTF-8, a wrapper function or method is
	1875	a convenient way to replace all your fetchrow_array and
	1876	fetchrow_hashref calls. A wrapper function will also make it easier to
	1877	adapt to future enhancements in your database driver. Note that at the
	1878	time of this writing (October 2002), the DBI has no standardized way
	1879	to deal with UTF-8 data. Please check the documentation to verify if
	1880	that is still true.
	1881
	1882	sub fetchrow {
	1883	# $what is one of fetchrow_{array,hashref}
	1884	my($self, $sth, $what) = @_;
	1885	if ($] < 5.007) {
	1886	return $sth->$what;
	1887	} else {
	1888	require Encode;
	1889	if (wantarray) {
	1890	my @arr = $sth->$what;
	1891	for (@arr) {
	1892	defined && /[^\000-\177]/ && Encode::_utf8_on($_);
	1893	}
	1894	return @arr;
	1895	} else {
	1896	my $ret = $sth->$what;
	1897	if (ref $ret) {
	1898	for my $k (keys %$ret) {
	1899	defined
	1900	&& /[^\000-\177]/
	1901	&& Encode::_utf8_on($_) for $ret->{$k};
	1902	}
	1903	return $ret;
	1904	} else {
	1905	defined && /[^\000-\177]/ && Encode::_utf8_on($_) for $ret;
	1906	return $ret;
	1907	}
	1908	}
	1909	}
	1910	}
	1911
	1912
	1913	=item *
	1914
	1915	A large scalar that you know can only contain ASCII
	1916
	1917	Scalars that contain only ASCII and are marked as UTF-8 are sometimes
	1918	a drag to your program. If you recognize such a situation, just remove
	1919	the UTF8 flag:
	1920
	1921	utf8::downgrade($val) if $] > 5.007;
	1922
	1923	=back
	1924
	1925	=head1 SEE ALSO
	1926
	1927	L<perlunitut>, L<perluniintro>, L<perluniprops>, L<Encode>, L<open>, L<utf8>, L<bytes>,
	1928	L<perlretut>, L<perlvar/"${^UNICODE}">
	1929	L<http://www.unicode.org/reports/tr44>).
	1930
	1931	=cut