X-Git-Url: https://perl5.git.perl.org/perl5.git/blobdiff_plain/755789c0fc98a11f381782159172f1870a653abc..f57d8456e7b8d6b2dad0bb49899cfdc68007b794:/pod/perlunicode.pod diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod index 5a0dd7b..9c13c35 100644 --- a/pod/perlunicode.pod +++ b/pod/perlunicode.pod @@ -4,278 +4,399 @@ perlunicode - Unicode support in Perl =head1 DESCRIPTION +If you haven't already, before reading this document, you should become +familiar with both L and L. + +Unicode aims to B-fy the en-B-ings of all the world's +character sets into a single Standard. For quite a few of the various +coding standards that existed when Unicode was first created, converting +from each to Unicode essentially meant adding a constant to each code +point in the original standard, and converting back meant just +subtracting that same constant. For ASCII and ISO-8859-1, the constant +is 0. For ISO-8859-5, (Cyrillic) the constant is 864; for Hebrew +(ISO-8859-8), it's 1488; Thai (ISO-8859-11), 3424; and so forth. This +made it easy to do the conversions, and facilitated the adoption of +Unicode. + +And it worked; nowadays, those legacy standards are rarely used. Most +everyone uses Unicode. + +Unicode is a comprehensive standard. It specifies many things outside +the scope of Perl, such as how to display sequences of characters. For +a full discussion of all aspects of Unicode, see +L. + =head2 Important Caveats +Even though some of this section may not be understandable to you on +first reading, we think it's important enough to highlight some of the +gotchas before delving further, so here goes: + Unicode support is an extensive requirement. While Perl does not implement the Unicode standard or the accompanying technical reports from cover to cover, Perl does support many Unicode features. -People who want to learn to use Unicode in Perl, should probably read -the L and -L, before reading -this reference document. - -Also, the use of Unicode may present security issues that aren't obvious. -Read L. +Also, the use of Unicode may present security issues that aren't +obvious, see L. =over 4 -=item Safest if you "use feature 'unicode_strings'" +=item Safest if you C In order to preserve backward compatibility, Perl does not turn on full internal Unicode support unless the pragma -C is specified. (This is automatically -selected if you use C or higher.) Failure to do this can +L>|feature/The 'unicode_strings' feature> +is specified. (This is automatically +selected if you S> or higher.) Failure to do this can trigger unexpected surprises. See L below. -This pragma doesn't affect I/O, and there are still several places -where Unicode isn't fully supported, such as in filenames. +This pragma doesn't affect I/O. Nor does it change the internal +representation of strings, only their interpretation. There are still +several places where Unicode isn't fully supported, such as in +filenames. =item Input and Output Layers -Perl knows when a filehandle uses Perl's internal Unicode encodings -(UTF-8, or UTF-EBCDIC if in EBCDIC) if the filehandle is opened with -the ":encoding(utf8)" layer. Other encodings can be converted to Perl's -encoding on input or from Perl's encoding on output by use of the -":encoding(...)" layer. See L. - -To indicate that Perl source itself is in UTF-8, use C. +Use the C<:encoding(...)> layer to read from and write to +filehandles using the specified encoding. (See L.) -=item C still needed to enable UTF-8/UTF-EBCDIC in scripts +=item You should convert your non-ASCII, non-UTF-8 Perl scripts to be +UTF-8. -As a compatibility measure, the C pragma must be explicitly -included to enable recognition of UTF-8 in the Perl scripts themselves -(in string or regular expression literals, or in identifier names) on -ASCII-based machines or to recognize UTF-EBCDIC on EBCDIC-based -machines. B -is needed.> See L. +See L. -=item BOM-marked scripts and UTF-16 scripts autodetected +=item C still needed to enable L in scripts -If a Perl script begins marked with the Unicode BOM (UTF-16LE, UTF16-BE, -or UTF-8), or if the script looks like non-BOM-marked UTF-16 of either -endianness, Perl will correctly read in the script as Unicode. -(BOMless UTF-8 cannot be effectively recognized or differentiated from -ISO 8859-1 or other eight-bit encodings.) +If your Perl script is itself encoded in L, +the S> pragma must be explicitly included to enable +recognition of that (in string or regular expression literals, or in +identifier names). B> is needed.> (See L). -=item C needed to upgrade non-Latin-1 byte strings +If a Perl script begins with the bytes that form the UTF-8 encoding of +the Unicode BYTE ORDER MARK (C, see L), those +bytes are completely ignored. -By default, there is a fundamental asymmetry in Perl's Unicode model: -implicit upgrading from byte strings to Unicode strings assumes that -they were encoded in I, but Unicode strings are -downgraded with UTF-8 encoding. This happens because the first 256 -codepoints in Unicode happens to agree with Latin-1. +=item L scripts autodetected -See L for more details. +If a Perl script begins with the Unicode C (UTF-16LE, +UTF16-BE), or if the script looks like non-C-marked +UTF-16 of either endianness, Perl will correctly read in the script as +the appropriate Unicode encoding. =back =head2 Byte and Character Semantics -Beginning with version 5.6, Perl uses logically-wide characters to -represent strings internally. - -Starting in Perl 5.14, Perl-level operations work with -characters rather than bytes within the scope of a -C> (or equivalently -C or higher). (This is not true if bytes have been -explicitly requested by C>, nor necessarily true -for interactions with the platform's operating system.) - -For earlier Perls, and when C is not in effect, Perl -provides a fairly safe environment that can handle both types of -semantics in programs. For operations where Perl can unambiguously -decide that the input data are characters, Perl switches to character -semantics. For operations where this determination cannot be made -without additional information from the user, Perl decides in favor of -compatibility and chooses to use byte semantics. - -When C is in effect (which overrides -C in the same scope), Perl uses the -semantics associated -with the current locale. Otherwise, Perl uses the platform's native -byte semantics for characters whose code points are less than 256, and -Unicode semantics for those greater than 255. On EBCDIC platforms, this -is almost seamless, as the EBCDIC code pages that Perl handles are -equivalent to Unicode's first 256 code points. (The exception is that -EBCDIC regular expression case-insensitive matching rules are not as -as robust as Unicode's.) But on ASCII platforms, Perl uses US-ASCII -(or Basic Latin in Unicode terminology) byte semantics, meaning that characters -whose ordinal numbers are in the range 128 - 255 are undefined except for their -ordinal numbers. This means that none have case (upper and lower), nor are any -a member of character classes, like C<[:alpha:]> or C<\w>. (But all do belong -to the C<\W> class or the Perl regular expression extension C<[:^alpha:]>.) - -This behavior preserves compatibility with earlier versions of Perl, -which allowed byte semantics in Perl operations only if -none of the program's inputs were marked as being a source of Unicode -character data. Such data may come from filehandles, from calls to -external programs, from information provided by the system (such as %ENV), -or from literals and constants in the source text. - -The C pragma is primarily a compatibility device that enables -recognition of UTF-(8|EBCDIC) in literals encountered by the parser. -Note that this pragma is only required while Perl defaults to byte -semantics; when character semantics become the default, this pragma -may become a no-op. See L. - -If strings operating under byte semantics and strings with Unicode -character data are concatenated, the new string will have -character semantics. This can cause surprises: See L, below. -You can choose to be warned when this happens. See L. - -Under character semantics, many operations that formerly operated on -bytes now operate on characters. A character in Perl is -logically just a number ranging from 0 to 2**31 or so. Larger -characters may encode into longer sequences of bytes internally, but -this internal detail is mostly hidden for Perl code. -See L for more. - -=head2 Effects of Character Semantics - -Character semantics have the following effects: +Before Unicode, most encodings used 8 bits (a single byte) to encode +each character. Thus a character was a byte, and a byte was a +character, and there could be only 256 or fewer possible characters. +"Byte Semantics" in the title of this section refers to +this behavior. There was no need to distinguish between "Byte" and +"Character". + +Then along comes Unicode which has room for over a million characters +(and Perl allows for even more). This means that a character may +require more than a single byte to represent it, and so the two terms +are no longer equivalent. What matter are the characters as whole +entities, and not usually the bytes that comprise them. That's what the +term "Character Semantics" in the title of this section refers to. + +Perl had to change internally to decouple "bytes" from "characters". +It is important that you too change your ideas, if you haven't already, +so that "byte" and "character" no longer mean the same thing in your +mind. + +The basic building block of Perl strings has always been a "character". +The changes basically come down to that the implementation no longer +thinks that a character is always just a single byte. + +There are various things to note: =over 4 =item * +String handling functions, for the most part, continue to operate in +terms of characters. C, for example, returns the number of +characters in a string, just as before. But that number no longer is +necessarily the same as the number of bytes in the string (there may be +more bytes than characters). The other such functions include +C, C, C, C, C, C, +C, C, and C. + +The exceptions are: + +=over 4 + +=item * + +the bit-oriented C + +E + +=item * + +the byte-oriented C/C C<"C"> format + +However, the C specifier does operate on whole characters, as does the +C specifier. + +=item * + +some operators that interact with the platform's operating system + +Operators dealing with filenames are examples. + +=item * + +when the functions are called from within the scope of the +S>> pragma + +Likely, you should use this only for debugging anyway. + +=back + +=item * + Strings--including hash keys--and regular expression patterns may -contain characters that have an ordinal value larger than 255. +contain characters that have ordinal values larger than 255. If you use a Unicode editor to edit your program, Unicode characters may occur directly within the literal strings in UTF-8 encoding, or UTF-16. -(The former requires a BOM or C, the latter requires a BOM.) +(The former requires a C, the latter may require a C.) -Unicode characters can also be added to a string by using the C<\N{U+...}> -notation. The Unicode code for the desired character, in hexadecimal, -should be placed in the braces, after the C. For instance, a smiley face is -C<\N{U+263A}>. +L gives other ways to place non-ASCII +characters in your strings. -Alternatively, you can use the C<\x{...}> notation for characters 0x100 and -above. For characters below 0x100 you may get byte semantics instead of -character semantics; see L. On EBCDIC machines there is -the additional problem that the value for such characters gives the EBCDIC -character rather than the Unicode one. +=item * -Additionally, if you +The C and C functions work on whole characters. - use charnames ':full'; +=item * -you can use the C<\N{...}> notation and put the official Unicode -character name within the braces, such as C<\N{WHITE SMILING FACE}>. -See L. +Regular expressions match whole characters. For example, C<"."> matches +a whole character instead of only a single byte. =item * -If an appropriate L is specified, identifiers within the -Perl script may contain Unicode alphanumeric characters, including -ideographs. Perl does not currently attempt to canonicalize variable -names. +The C operator translates whole characters. (Note that the +C functionality has been removed. For similar functionality to +that, see C and C). =item * -Regular expressions match characters instead of bytes. "." matches -a character instead of a byte. +C reverses by character rather than by byte. =item * -Bracketed character classes in regular expressions match characters instead of -bytes and match against the character properties specified in the -Unicode properties database. C<\w> can be used to match a Japanese -ideograph, for instance. +The bit string operators, C<& | ^ ~> and (starting in v5.22) +C<&. |. ^. ~.> can operate on characters that don't fit into a byte. +However, the current behavior is likely to change. You should not use +these operators on strings that are encoded in UTF-8. If you're not +sure about the encoding of a string, downgrade it before using any of +these operators; you can use +L|utf8/Utility functions>. -=item * +=back -Named Unicode properties, scripts, and block ranges may be used (like bracketed -character classes) by using the C<\p{}> "matches property" construct and -the C<\P{}> negation, "doesn't match property". -See L for more details. +The bottom line is that Perl has always practiced "Character Semantics", +but with the advent of Unicode, that is now different than "Byte +Semantics". -You can define your own character properties and use them -in the regular expression with the C<\p{}> or C<\P{}> construct. -See L for more details. +=head2 ASCII Rules versus Unicode Rules -=item * +Before Unicode, when a character was a byte was a character, +Perl knew only about the 128 characters defined by ASCII, code points 0 +through 127 (except for under L>|perllocale>). That +left the code +points 128 to 255 as unassigned, and available for whatever use a +program might want. The only semantics they have is their ordinal +numbers, and that they are members of none of the non-negative character +classes. None are considered to match C<\w> for example, but all match +C<\W>. + +Unicode, of course, assigns each of those code points a particular +meaning (along with ones above 255). To preserve backward +compatibility, Perl only uses the Unicode meanings when there is some +indication that Unicode is what is intended; otherwise the non-ASCII +code points remain treated as if they are unassigned. -The special pattern C<\X> matches a logical character, an "extended grapheme -cluster" in Standardese. In Unicode what appears to the user to be a single -character, for example an accented C, may in fact be composed of a sequence -of characters, in this case a C followed by an accent character. C<\X> -will match the entire sequence. +Here are the ways that Perl knows that a string should be treated as +Unicode: + +=over =item * -The C operator translates characters instead of bytes. Note -that the C functionality has been removed. For similar -functionality see pack('U0', ...) and pack('C0', ...). +Within the scope of S> + +If the whole program is Unicode (signified by using 8-bit Bnicode +Bransformation Bormat), then all strings within it must be +Unicode. =item * -Case translation operators use the Unicode case translation tables -when character input is provided. Note that C, or C<\U> in -interpolated strings, translates to uppercase, while C, -or C<\u> in interpolated strings, translates to titlecase in languages -that make the distinction (which is equivalent to uppercase in languages -without the distinction). +Within the scope of +L>|feature/The 'unicode_strings' feature> + +This pragma was created so you can explicitly tell Perl that operations +executed within its scope are to use Unicode rules. More operations are +affected with newer perls. See L. =item * -Most operators that deal with positions or lengths in a string will -automatically switch to using character positions, including -C, C, C, C, C, C, -C, C, and C. An operator that -specifically does not switch is C. Operators that really don't -care include operators that treat strings as a bucket of bits such as -C, and operators dealing with filenames. +Within the scope of S> or higher + +This implicitly turns on S>. =item * -The C/C letter C does I change, since it is often -used for byte-oriented formats. Again, think C in the C language. +Within the scope of +L>|perllocale/Unicode and UTF-8>, +or L>|perllocale> and the current +locale is a UTF-8 locale. -There is a new C specifier that converts between Unicode characters -and code points. There is also a C specifier that is the equivalent of -C/C and properly handles character values even if they are above 255. +The former is defined to imply Unicode handling; and the latter +indicates a Unicode locale, hence a Unicode interpretation of all +strings within it. =item * -The C and C functions work on characters, similar to -C and C, I C and -C. C and C are methods for -emulating byte-oriented C and C on Unicode strings. -While these methods reveal the internal encoding of Unicode strings, -that is not something one normally needs to care about at all. +When the string contains a Unicode-only code point + +Perl has never accepted code points above 255 without them being +Unicode, so their use implies Unicode for the whole string. =item * -The bit string operators, C<& | ^ ~>, can operate on character data. -However, for backward compatibility, such as when using bit string -operations when characters are all less than 256 in ordinal value, one -should not use C<~> (the bit complement) with characters of both -values less than 256 and values greater than 256. Most importantly, -DeMorgan's laws (C<~($x|$y) eq ~$x&~$y> and C<~($x&$y) eq ~$x|~$y>) -will not hold. The reason for this mathematical I is that -the complement cannot return B the 8-bit (byte-wide) bit -complement B the full character-wide bit complement. +When the string contains a Unicode named code point C<\N{...}> + +The C<\N{...}> construct explicitly refers to a Unicode code point, +even if it is one that is also in ASCII. Therefore the string +containing it must be Unicode. =item * -You can define your own mappings to be used in C, -C, C, and C (or their double-quoted string inlined -versions such as C<\U>). See -L -for more details. +When the string has come from an external source marked as +Unicode + +The L|perlrun/-C [numberElist]> command line option can +specify that certain inputs to the program are Unicode, and the values +of this can be read by your Perl code, see L. + +=item * When the string has been upgraded to UTF-8 + +The function L|utf8/Utility functions> +can be explicitly used to permanently (unless a subsequent +C is called) cause a string to be treated as +Unicode. + +=item * There are additional methods for regular expression patterns + +A pattern that is compiled with the C<< /u >> or C<< /a >> modifiers is +treated as Unicode (though there are some restrictions with C<< /a >>). +Under the C<< /d >> and C<< /l >> modifiers, there are several other +indications for Unicode; see L. =back +Note that all of the above are overridden within the scope of +C>; but you should be using this pragma only for +debugging. + +Note also that some interactions with the platform's operating system +never use Unicode rules. + +When Unicode rules are in effect: + =over 4 =item * -And finally, C reverses by character rather than by byte. +Case translation operators use the Unicode case translation tables. + +Note that C, or C<\U> in interpolated strings, translates to +uppercase, while C, or C<\u> in interpolated strings, +translates to titlecase in languages that make the distinction (which is +equivalent to uppercase in languages without the distinction). + +There is a CPAN module, C>, which allows you to +define your own mappings to be used in C, C, C, +C, and C (or their double-quoted string inlined versions +such as C<\U>). (Prior to Perl 5.16, this functionality was partially +provided in the Perl core, but suffered from a number of insurmountable +drawbacks, so the CPAN module was written instead.) + +=item * + +Character classes in regular expressions match based on the character +properties specified in the Unicode properties database. + +C<\w> can be used to match a Japanese ideograph, for instance; and +C<[[:digit:]]> a Bengali number. + +=item * + +Named Unicode properties, scripts, and block ranges may be used (like +bracketed character classes) by using the C<\p{}> "matches property" +construct and the C<\P{}> negation, "doesn't match property". + +See L for more details. + +You can define your own character properties and use them +in the regular expression with the C<\p{}> or C<\P{}> construct. +See L for more details. =back +=head2 Extended Grapheme Clusters (Logical characters) + +Consider a character, say C. It could appear with various marks around it, +such as an acute accent, or a circumflex, or various hooks, circles, arrows, +I, above, below, to one side or the other, I. There are many +possibilities among the world's languages. The number of combinations is +astronomical, and if there were a character for each combination, it would +soon exhaust Unicode's more than a million possible characters. So Unicode +took a different approach: there is a character for the base C, and a +character for each of the possible marks, and these can be variously combined +to get a final logical character. So a logical character--what appears to be a +single character--can be a sequence of more than one individual characters. +The Unicode standard calls these "extended grapheme clusters" (which +is an improved version of the no-longer much used "grapheme cluster"); +Perl furnishes the C<\X> regular expression construct to match such +sequences in their entirety. + +But Unicode's intent is to unify the existing character set standards and +practices, and several pre-existing standards have single characters that +mean the same thing as some of these combinations, like ISO-8859-1, +which has quite a few of them. For example, C<"LATIN CAPITAL LETTER E +WITH ACUTE"> was already in this standard when Unicode came along. +Unicode therefore added it to its repertoire as that single character. +But this character is considered by Unicode to be equivalent to the +sequence consisting of the character C<"LATIN CAPITAL LETTER E"> +followed by the character C<"COMBINING ACUTE ACCENT">. + +C<"LATIN CAPITAL LETTER E WITH ACUTE"> is called a "pre-composed" +character, and its equivalence with the "E" and the "COMBINING ACCENT" +sequence is called canonical equivalence. All pre-composed characters +are said to have a decomposition (into the equivalent sequence), and the +decomposition type is also called canonical. A string may be comprised +as much as possible of precomposed characters, or it may be comprised of +entirely decomposed characters. Unicode calls these respectively, +"Normalization Form Composed" (NFC) and "Normalization Form Decomposed". +The C> module contains functions that convert +between the two. A string may also have both composed characters and +decomposed characters; this module can be used to make it all one or the +other. + +You may be presented with strings in any of these equivalent forms. +There is currently nothing in Perl 5 that ignores the differences. So +you'll have to specially hanlde it. The usual advice is to convert your +inputs to C before processing further. + +For more detailed information, see L. + =head2 Unicode Character Properties (The only time that Perl considers a sequence of individual code @@ -288,26 +409,29 @@ regular expressions by using the C<\p{}> "matches property" construct and the C<\P{}> "doesn't match property" for its negation. For instance, C<\p{Uppercase}> matches any single character with the Unicode -"Uppercase" property, while C<\p{L}> matches any character with a -General_Category of "L" (letter) property. Brackets are not +C<"Uppercase"> property, while C<\p{L}> matches any character with a +C of C<"L"> (letter) property (see +L below). Brackets are not required for single letter property names, so C<\p{L}> is equivalent to C<\pL>. More formally, C<\p{Uppercase}> matches any single character whose Unicode -Uppercase property value is True, and C<\P{Uppercase}> matches any character -whose Uppercase property value is False, and they could have been written as +C property value is C, and C<\P{Uppercase}> matches any character +whose C property value is C, and they could have been written as C<\p{Uppercase=True}> and C<\p{Uppercase=False}>, respectively. This formality is needed when properties are not binary; that is, if they can -take on more values than just True and False. For example, the Bidi_Class (see -L below), can take on several different -values, such as Left, Right, Whitespace, and others. To match these, one needs -to specify the property name (Bidi_Class), AND the value being matched against -(Left, Right, etc.). This is done, as in the examples above, by having the +take on more values than just C and C. For example, the +C property (see L below), +can take on several different +values, such as C, C, C, and others. To match these, one needs +to specify both the property name (C), AND the value being +matched against +(C, C, I). This is done, as in the examples above, by having the two components separated by an equal sign (or interchangeably, a colon), like C<\p{Bidi_Class: Left}>. All Unicode-defined character properties may be written in these compound forms -of C<\p{property=value}> or C<\p{property:value}>, but Perl provides some +of C<\p{I=I}> or C<\p{I:I}>, but Perl provides some additional properties that are written only in the single form, as well as single-form short-cuts for all binary properties and certain others described below, in which you may omit the property name and the equals or colon @@ -315,17 +439,19 @@ separator. Most Unicode character properties have at least two synonyms (or aliases if you prefer): a short one that is easier to type and a longer one that is more -descriptive and hence easier to understand. Thus the "L" and "Letter" properties -above are equivalent and can be used interchangeably. Likewise, -"Upper" is a synonym for "Uppercase", and we could have written -C<\p{Uppercase}> equivalently as C<\p{Upper}>. Also, there are typically -various synonyms for the values the property can be. For binary properties, -"True" has 3 synonyms: "T", "Yes", and "Y"; and "False has correspondingly "F", -"No", and "N". But be careful. A short form of a value for one property may -not mean the same thing as the same short form for another. Thus, for the -General_Category property, "L" means "Letter", but for the Bidi_Class property, -"L" means "Left". A complete list of properties and synonyms is in -L. +descriptive and hence easier to understand. Thus the C<"L"> and +C<"Letter"> properties above are equivalent and can be used +interchangeably. Likewise, C<"Upper"> is a synonym for C<"Uppercase">, +and we could have written C<\p{Uppercase}> equivalently as C<\p{Upper}>. +Also, there are typically various synonyms for the values the property +can be. For binary properties, C<"True"> has 3 synonyms: C<"T">, +C<"Yes">, and C<"Y">; and C<"False"> has correspondingly C<"F">, +C<"No">, and C<"N">. But be careful. A short form of a value for one +property may not mean the same thing as the same short form for another. +Thus, for the C> property, C<"L"> means +C<"Letter">, but for the L|/Bidirectional Character Types> +property, C<"L"> means C<"Left">. A complete list of properties and +synonyms is in L. Upper/lower case differences in property names and values are irrelevant; thus C<\p{Upper}> means the same thing as C<\p{upper}> or even C<\p{UpPeR}>. @@ -342,7 +468,7 @@ cares about white space (except adjacent to non-word characters), hyphens, and non-interior underscores. You can also use negation in both C<\p{}> and C<\P{}> by introducing a caret -(^) between the first brace and the property name: C<\p{^Tamil}> is +(C<^>) between the first brace and the property name: C<\p{^Tamil}> is equal to C<\P{Tamil}>. Almost all properties are immune to case-insensitive matching. That is, @@ -359,10 +485,13 @@ C, and C, all of which match C under C matching. This set also includes its subsets C and C both -of which under C matching match C. +of which under C match C. (The difference between these sets is that some things, such as Roman -numerals, come in both upper and lower case so they are C, but aren't considered -letters, so they aren't Cs.) +numerals, come in both upper and lower case so they are C, but +aren't considered letters, so they aren't C's.) + +See L for special considerations when +matching Unicode properties against non-Unicode code points. =head3 B @@ -371,11 +500,12 @@ usual categorization of a character" (from L). The compound way of writing these is like C<\p{General_Category=Number}> -(short, C<\p{gc:n}>). But Perl furnishes shortcuts in which everything up +(short: C<\p{gc:n}>). But Perl furnishes shortcuts in which everything up through the equal or colon separator is omitted. So you can instead just write C<\pN>. -Here are the short and long forms of the General Category properties: +Here are the short and long forms of the values the C property +can have: Short Long @@ -433,10 +563,10 @@ C and C are special: both are aliases for the set consisting of everythi =head3 B Because scripts differ in their directionality (Hebrew and Arabic are -written right to left, for example) Unicode supplies these properties in -the Bidi_Class class: +written right to left, for example) Unicode supplies a C property. +Some of the values this property can have are: - Property Meaning + Value Meaning L Left-to-Right LRE Left-to-Right Embedding @@ -460,7 +590,13 @@ the Bidi_Class class: This property is always written in the compound form. For example, C<\p{Bidi_Class:R}> matches characters that are normally -written right to left. +written right to left. Unlike the +C> property, this +property can have more values added in a future Unicode release. Those +listed above comprised the complete set for many Unicode releases, but +others were added in Unicode 6.3; you can always find what the +current ones are in L. And +L describes how to use them. =head3 B @@ -469,17 +605,82 @@ The world's languages are written in many different scripts. This sentence written in Cyrillic, and Greek is written in, well, Greek; Japanese mainly in Hiragana or Katakana. There are many more. -The Unicode Script property gives what script a given character is in, -and the property can be specified with the compound form like -C<\p{Script=Hebrew}> (short: C<\p{sc=hebr}>). Perl furnishes shortcuts for all -script names. You can omit everything up through the equals (or colon), and -simply write C<\p{Latin}> or C<\P{Cyrillic}>. +The Unicode C