X-Git-Url: https://perl5.git.perl.org/perl5.git/blobdiff_plain/70a5eb4a0fb40c8d59c26043e7b9aa76f8ea2802..f434f3571e41ee9c418f07c8510af58cf4083f70:/pod/perlunicode.pod diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod index 9507536..44c2a98 100644 --- a/pod/perlunicode.pod +++ b/pod/perlunicode.pod @@ -4,350 +4,492 @@ perlunicode - Unicode support in Perl =head1 DESCRIPTION +If you haven't already, before reading this document, you should become +familiar with both L and L. + +Unicode aims to B-fy the en-B-ings of all the world's +character sets into a single Standard. For quite a few of the various +coding standards that existed when Unicode was first created, converting +from each to Unicode essentially meant adding a constant to each code +point in the original standard, and converting back meant just +subtracting that same constant. For ASCII and ISO-8859-1, the constant +is 0. For ISO-8859-5, (Cyrillic) the constant is 864; for Hebrew +(ISO-8859-8), it's 1488; Thai (ISO-8859-11), 3424; and so forth. This +made it easy to do the conversions, and facilitated the adoption of +Unicode. + +And it worked; nowadays, those legacy standards are rarely used. Most +everyone uses Unicode. + +Unicode is a comprehensive standard. It specifies many things outside +the scope of Perl, such as how to display sequences of characters. For +a full discussion of all aspects of Unicode, see +L. + =head2 Important Caveats +Even though some of this section may not be understandable to you on +first reading, we think it's important enough to highlight some of the +gotchas before delving further, so here goes: + Unicode support is an extensive requirement. While Perl does not implement the Unicode standard or the accompanying technical reports from cover to cover, Perl does support many Unicode features. -People who want to learn to use Unicode in Perl, should probably read -the L, before reading -this reference document. - Also, the use of Unicode may present security issues that aren't obvious. Read L. =over 4 -=item Input and Output Layers +=item Safest if you C -Perl knows when a filehandle uses Perl's internal Unicode encodings -(UTF-8, or UTF-EBCDIC if in EBCDIC) if the filehandle is opened with -the ":utf8" layer. Other encodings can be converted to Perl's -encoding on input or from Perl's encoding on output by use of the -":encoding(...)" layer. See L. +In order to preserve backward compatibility, Perl does not turn +on full internal Unicode support unless the pragma +L>|feature/The 'unicode_strings' feature> +is specified. (This is automatically +selected if you S> or higher.) Failure to do this can +trigger unexpected surprises. See L below. -To indicate that Perl source itself is in UTF-8, use C. +This pragma doesn't affect I/O. Nor does it change the internal +representation of strings, only their interpretation. There are still +several places where Unicode isn't fully supported, such as in +filenames. -=item Regular Expressions - -The regular expression compiler produces polymorphic opcodes. That is, -the pattern adapts to the data and automatically switches to the Unicode -character scheme when presented with data that is internally encoded in -UTF-8, or instead uses a traditional byte scheme when presented with -byte data. +=item Input and Output Layers -=item C still needed to enable UTF-8/UTF-EBCDIC in scripts +Use the C<:encoding(...)> layer to read from and write to +filehandles using the specified encoding. (See L.) -As a compatibility measure, the C pragma must be explicitly -included to enable recognition of UTF-8 in the Perl scripts themselves -(in string or regular expression literals, or in identifier names) on -ASCII-based machines or to recognize UTF-EBCDIC on EBCDIC-based -machines. B -is needed.> See L. +=item You should convert your non-ASCII, non-UTF-8 Perl scripts to be +UTF-8. -=item BOM-marked scripts and UTF-16 scripts autodetected +See L. -If a Perl script begins marked with the Unicode BOM (UTF-16LE, UTF16-BE, -or UTF-8), or if the script looks like non-BOM-marked UTF-16 of either -endianness, Perl will correctly read in the script as Unicode. -(BOMless UTF-8 cannot be effectively recognized or differentiated from -ISO 8859-1 or other eight-bit encodings.) +=item C still needed to enable L in scripts -=item C needed to upgrade non-Latin-1 byte strings +If your Perl script is itself encoded in L, +the S> pragma must be explicitly included to enable +recognition of that (in string or regular expression literals, or in +identifier names). B> is needed.> (See L). -By default, there is a fundamental asymmetry in Perl's Unicode model: -implicit upgrading from byte strings to Unicode strings assumes that -they were encoded in I, but Unicode strings are -downgraded with UTF-8 encoding. This happens because the first 256 -codepoints in Unicode happens to agree with Latin-1. +=item C-marked scripts and L scripts autodetected -See L for more details. +If a Perl script begins with the Unicode C (UTF-16LE, +UTF16-BE, or UTF-8), or if the script looks like non-C-marked +UTF-16 of either endianness, Perl will correctly read in the script as +the appropriate Unicode encoding. (C-less UTF-8 cannot be +effectively recognized or differentiated from ISO 8859-1 or other +eight-bit encodings.) =back =head2 Byte and Character Semantics -Beginning with version 5.6, Perl uses logically-wide characters to -represent strings internally. - -In future, Perl-level operations will be expected to work with -characters rather than bytes. - -However, as an interim compatibility measure, Perl aims to -provide a safe migration path from byte semantics to character -semantics for programs. For operations where Perl can unambiguously -decide that the input data are characters, Perl switches to -character semantics. For operations where this determination cannot -be made without additional information from the user, Perl decides in -favor of compatibility and chooses to use byte semantics. - -Under byte semantics, when C is in effect, Perl uses the -semantics associated with the current locale. Absent a C, and -absent a C pragma, Perl currently uses US-ASCII -(or Basic Latin in Unicode terminology) byte semantics, meaning that characters -whose ordinal numbers are in the range 128 - 255 are undefined except for their -ordinal numbers. This means that none have case (upper and lower), nor are any -a member of character classes, like C<[:alpha:]> or C<\w>. (But all do belong -to the C<\W> class or the Perl regular expression extension C<[:^alpha:]>.) - -This behavior preserves compatibility with earlier versions of Perl, -which allowed byte semantics in Perl operations only if -none of the program's inputs were marked as being a source of Unicode -character data. Such data may come from filehandles, from calls to -external programs, from information provided by the system (such as %ENV), -or from literals and constants in the source text. - -The C pragma will always, regardless of platform, force byte -semantics in a particular lexical scope. See L. - -The C pragma is intended to always, regardless -of platform, force character (Unicode) semantics in a particular lexical scope. -In release 5.12, it is partially implemented, applying only to case changes. -See L below. - -The C pragma is primarily a compatibility device that enables -recognition of UTF-(8|EBCDIC) in literals encountered by the parser. -Note that this pragma is only required while Perl defaults to byte -semantics; when character semantics become the default, this pragma -may become a no-op. See L. - -Unless explicitly stated, Perl operators use character semantics -for Unicode data and byte semantics for non-Unicode data. -The decision to use character semantics is made transparently. If -input data comes from a Unicode source--for example, if a character -encoding layer is added to a filehandle or a literal Unicode -string constant appears in a program--character semantics apply. -Otherwise, byte semantics are in effect. The C pragma should -be used to force byte semantics on Unicode data, and the C pragma to force Unicode semantics on byte data (though in -5.12 it isn't fully implemented). - -If strings operating under byte semantics and strings with Unicode -character data are concatenated, the new string will have -character semantics. This can cause surprises: See L, below. -You can choose to be warned when this happens. See L. - -Under character semantics, many operations that formerly operated on -bytes now operate on characters. A character in Perl is -logically just a number ranging from 0 to 2**31 or so. Larger -characters may encode into longer sequences of bytes internally, but -this internal detail is mostly hidden for Perl code. -See L for more. - -=head2 Effects of Character Semantics - -Character semantics have the following effects: +Before Unicode, most encodings used 8 bits (a single byte) to encode +each character. Thus a character was a byte, and a byte was a +character, and there could be only 256 or fewer possible characters. +"Byte Semantics" in the title of this section refers to +this behavior. There was no need to distinguish between "Byte" and +"Character". + +Then along comes Unicode which has room for over a million characters +(and Perl allows for even more). This means that a character may +require more than a single byte to represent it, and so the two terms +are no longer equivalent. What matter are the characters as whole +entities, and not usually the bytes that comprise them. That's what the +term "Character Semantics" in the title of this section refers to. + +Perl had to change internally to decouple "bytes" from "characters". +It is important that you too change your ideas, if you haven't already, +so that "byte" and "character" no longer mean the same thing in your +mind. + +The basic building block of Perl strings has always been a "character". +The changes basically come down to that the implementation no longer +thinks that a character is always just a single byte. + +There are various things to note: =over 4 =item * +String handling functions, for the most part, continue to operate in +terms of characters. C, for example, returns the number of +characters in a string, just as before. But that number no longer is +necessarily the same as the number of bytes in the string (there may be +more bytes than characters). The other such functions include +C, C, C, C, C, C, +C, C, and C. + +The exceptions are: + +=over 4 + +=item * + +the bit-oriented C + +E + +=item * + +the byte-oriented C/C C<"C"> format + +However, the C specifier does operate on whole characters, as does the +C specifier. + +=item * + +some operators that interact with the platform's operating system + +Operators dealing with filenames are examples. + +=item * + +when the functions are called from within the scope of the +S>> pragma + +Likely, you should use this only for debugging anyway. + +=back + +=item * + Strings--including hash keys--and regular expression patterns may -contain characters that have an ordinal value larger than 255. +contain characters that have ordinal values larger than 255. If you use a Unicode editor to edit your program, Unicode characters may occur directly within the literal strings in UTF-8 encoding, or UTF-16. -(The former requires a BOM or C, the latter requires a BOM.) +(The former requires a C or C, the latter requires a C.) -Unicode characters can also be added to a string by using the C<\N{U+...}> -notation. The Unicode code for the desired character, in hexadecimal, -should be placed in the braces, after the C. For instance, a smiley face is -C<\N{U+263A}>. +L gives other ways to place non-ASCII +characters in your strings. -Alternatively, you can use the C<\x{...}> notation for characters 0x100 and -above. For characters below 0x100 you may get byte semantics instead of -character semantics; see L. On EBCDIC machines there is -the additional problem that the value for such characters gives the EBCDIC -character rather than the Unicode one. +=item * -Additionally, if you +The C and C functions work on whole characters. - use charnames ':full'; +=item * -you can use the C<\N{...}> notation and put the official Unicode -character name within the braces, such as C<\N{WHITE SMILING FACE}>. -See L. +Regular expressions match whole characters. For example, C<"."> matches +a whole character instead of only a single byte. =item * -If an appropriate L is specified, identifiers within the -Perl script may contain Unicode alphanumeric characters, including -ideographs. Perl does not currently attempt to canonicalize variable -names. +The C operator translates whole characters. (Note that the +C functionality has been removed. For similar functionality to +that, see C and C). =item * -Regular expressions match characters instead of bytes. "." matches -a character instead of a byte. +C reverses by character rather than by byte. =item * -Bracketed character classes in regular expressions match characters instead of -bytes and match against the character properties specified in the -Unicode properties database. C<\w> can be used to match a Japanese -ideograph, for instance. +The bit string operators, C<& | ^ ~> and (starting in v5.22) +C<&. |. ^. ~.> can operate on characters that don't fit into a byte. +However, the current behavior is likely to change. You should not use +these operators on strings that are encoded in UTF-8. If you're not +sure about the encoding of a string, downgrade it before using any of +these operators; you can use +L|utf8/Utility functions>. -=item * +=back -Named Unicode properties, scripts, and block ranges may be used (like bracketed -character classes) by using the C<\p{}> "matches property" construct and -the C<\P{}> negation, "doesn't match property". -See L for more details. +The bottom line is that Perl has always practiced "Character Semantics", +but with the advent of Unicode, that is now different than "Byte +Semantics". -You can define your own character properties and use them -in the regular expression with the C<\p{}> or C<\P{}> construct. -See L for more details. +=head2 ASCII Rules versus Unicode Rules -=item * +Before Unicode, when a character was a byte was a character, +Perl knew only about the 128 characters defined by ASCII, code points 0 +through 127 (except for under S>). That left the code +points 128 to 255 as unassigned, and available for whatever use a +program might want. The only semantics they have is their ordinal +numbers, and that they are members of none of the non-negative character +classes. None are considered to match C<\w> for example, but all match +C<\W>. + +Unicode, of course, assigns each of those code points a particular +meaning (along with ones above 255). To preserve backward +compatibility, Perl only uses the Unicode meanings when there is some +indication that Unicode is what is intended; otherwise the non-ASCII +code points remain treated as if they are unassigned. -The special pattern C<\X> matches a logical character, an "extended grapheme -cluster" in Standardese. In Unicode what appears to the user to be a single -character, for example an accented C, may in fact be composed of a sequence -of characters, in this case a C followed by an accent character. C<\X> -will match the entire sequence. +Here are the ways that Perl knows that a string should be treated as +Unicode: + +=over =item * -The C operator translates characters instead of bytes. Note -that the C functionality has been removed. For similar -functionality see pack('U0', ...) and pack('C0', ...). +Within the scope of S> + +If the whole program is Unicode (signified by using 8-bit Bnicode +Bransformation Bormat), then all strings within it must be +Unicode. =item * -Case translation operators use the Unicode case translation tables -when character input is provided. Note that C, or C<\U> in -interpolated strings, translates to uppercase, while C, -or C<\u> in interpolated strings, translates to titlecase in languages -that make the distinction (which is equivalent to uppercase in languages -without the distinction). +Within the scope of +L>|feature/The 'unicode_strings' feature> + +This pragma was created so you can explicitly tell Perl that operations +executed within its scope are to use Unicode rules. More operations are +affected with newer perls. See L. =item * -Most operators that deal with positions or lengths in a string will -automatically switch to using character positions, including -C, C, C, C, C, C, -C, C, and C. An operator that -specifically does not switch is C. Operators that really don't -care include operators that treat strings as a bucket of bits such as -C, and operators dealing with filenames. +Within the scope of S> or higher + +This implicitly turns on S>. =item * -The C/C letter C does I change, since it is often -used for byte-oriented formats. Again, think C in the C language. +Within the scope of +L>|perllocale/Unicode and UTF-8>, +or L>|perllocale> and the current +locale is a UTF-8 locale. -There is a new C specifier that converts between Unicode characters -and code points. There is also a C specifier that is the equivalent of -C/C and properly handles character values even if they are above 255. +The former is defined to imply Unicode handling; and the latter +indicates a Unicode locale, hence a Unicode interpretation of all +strings within it. =item * -The C and C functions work on characters, similar to -C and C, I C and -C. C and C are methods for -emulating byte-oriented C and C on Unicode strings. -While these methods reveal the internal encoding of Unicode strings, -that is not something one normally needs to care about at all. +When the string contains a Unicode-only code point + +Perl has never accepted code points above 255 without them being +Unicode, so their use implies Unicode for the whole string. =item * -The bit string operators, C<& | ^ ~>, can operate on character data. -However, for backward compatibility, such as when using bit string -operations when characters are all less than 256 in ordinal value, one -should not use C<~> (the bit complement) with characters of both -values less than 256 and values greater than 256. Most importantly, -DeMorgan's laws (C<~($x|$y) eq ~$x&~$y> and C<~($x&$y) eq ~$x|~$y>) -will not hold. The reason for this mathematical I is that -the complement cannot return B the 8-bit (byte-wide) bit -complement B the full character-wide bit complement. +When the string contains a Unicode named code point C<\N{...}> + +The C<\N{...}> construct explicitly refers to a Unicode code point, +even if it is one that is also in ASCII. Therefore the string +containing it must be Unicode. =item * -You can define your own mappings to be used in C, -C, C, and C (or their double-quoted string inlined -versions such as C<\U>). -See L for more details. +When the string has come from an external source marked as +Unicode + +The L|perlrun/-C [numberElist]> command line option can +specify that certain inputs to the program are Unicode, and the values +of this can be read by your Perl code, see L. + +=item * When the string has been upgraded to UTF-8 + +The function L|utf8/Utility functions> +can be explicitly used to permanently (unless a subsequent +C is called) cause a string to be treated as +Unicode. + +=item * There are additional methods for regular expression patterns + +A pattern that is compiled with the C<< /u >> or C<< /a >> modifiers is +treated as Unicode (though there are some restrictions with C<< /a >>). +Under the C<< /d >> and C<< /l >> modifiers, there are several other +indications for Unicode; see L. =back +Note that all of the above are overridden within the scope of +C>; but you should be using this pragma only for +debugging. + +Note also that some interactions with the platform's operating system +never use Unicode rules. + +When Unicode rules are in effect: + =over 4 =item * -And finally, C reverses by character rather than by byte. +Case translation operators use the Unicode case translation tables. + +Note that C, or C<\U> in interpolated strings, translates to +uppercase, while C, or C<\u> in interpolated strings, +translates to titlecase in languages that make the distinction (which is +equivalent to uppercase in languages without the distinction). + +There is a CPAN module, C>, which allows you to +define your own mappings to be used in C, C, C, +C, and C (or their double-quoted string inlined versions +such as C<\U>). (Prior to Perl 5.16, this functionality was partially +provided in the Perl core, but suffered from a number of insurmountable +drawbacks, so the CPAN module was written instead.) + +=item * + +Character classes in regular expressions match based on the character +properties specified in the Unicode properties database. + +C<\w> can be used to match a Japanese ideograph, for instance; and +C<[[:digit:]]> a Bengali number. + +=item * + +Named Unicode properties, scripts, and block ranges may be used (like +bracketed character classes) by using the C<\p{}> "matches property" +construct and the C<\P{}> negation, "doesn't match property". + +See L for more details. + +You can define your own character properties and use them +in the regular expression with the C<\p{}> or C<\P{}> construct. +See L for more details. =back -=head2 Unicode Character Properties +=head2 Extended Grapheme Clusters (Logical characters) -Most Unicode character properties are accessible by using regular expressions. -They are used (like bracketed character classes) by using the C<\p{}> "matches -property" construct and the C<\P{}> negation, "doesn't match property". +Consider a character, say C. It could appear with various marks around it, +such as an acute accent, or a circumflex, or various hooks, circles, arrows, +I, above, below, to one side or the other, I. There are many +possibilities among the world's languages. The number of combinations is +astronomical, and if there were a character for each combination, it would +soon exhaust Unicode's more than a million possible characters. So Unicode +took a different approach: there is a character for the base C, and a +character for each of the possible marks, and these can be variously combined +to get a final logical character. So a logical character--what appears to be a +single character--can be a sequence of more than one individual characters. +The Unicode standard calls these "extended grapheme clusters" (which +is an improved version of the no-longer much used "grapheme cluster"); +Perl furnishes the C<\X> regular expression construct to match such +sequences in their entirety. + +But Unicode's intent is to unify the existing character set standards and +practices, and several pre-existing standards have single characters that +mean the same thing as some of these combinations, like ISO-8859-1, +which has quite a few of them. For example, C<"LATIN CAPITAL LETTER E +WITH ACUTE"> was already in this standard when Unicode came along. +Unicode therefore added it to its repertoire as that single character. +But this character is considered by Unicode to be equivalent to the +sequence consisting of the character C<"LATIN CAPITAL LETTER E"> +followed by the character C<"COMBINING ACUTE ACCENT">. + +C<"LATIN CAPITAL LETTER E WITH ACUTE"> is called a "pre-composed" +character, and its equivalence with the "E" and the "COMBINING ACCENT" +sequence is called canonical equivalence. All pre-composed characters +are said to have a decomposition (into the equivalent sequence), and the +decomposition type is also called canonical. A string may be comprised +as much as possible of precomposed characters, or it may be comprised of +entirely decomposed characters. Unicode calls these respectively, +"Normalization Form Composed" (NFC) and "Normalization Form Decomposed". +The C> module contains functions that convert +between the two. A string may also have both composed characters and +decomposed characters; this module can be used to make it all one or the +other. + +You may be presented with strings in any of these equivalent forms. +There is currently nothing in Perl 5 that ignores the differences. So +you'll have to specially hanlde it. The usual advice is to convert your +inputs to C before processing further. + +For more detailed information, see L. -Note that the only time that Perl considers a sequence of individual code +=head2 Unicode Character Properties + +(The only time that Perl considers a sequence of individual code points as a single logical character is in the C<\X> construct, already mentioned above. Therefore "character" in this discussion means a single -Unicode code point. +Unicode code point.) + +Very nearly all Unicode character properties are accessible through +regular expressions by using the C<\p{}> "matches property" construct +and the C<\P{}> "doesn't match property" for its negation. For instance, C<\p{Uppercase}> matches any single character with the Unicode -"Uppercase" property, while C<\p{L}> matches any character with a -General_Category of "L" (letter) property. Brackets are not +C<"Uppercase"> property, while C<\p{L}> matches any character with a +C of C<"L"> (letter) property (see +L below). Brackets are not required for single letter property names, so C<\p{L}> is equivalent to C<\pL>. More formally, C<\p{Uppercase}> matches any single character whose Unicode -Uppercase property value is True, and C<\P{Uppercase}> matches any character -whose Uppercase property value is False, and they could have been written as +C property value is C, and C<\P{Uppercase}> matches any character +whose C property value is C, and they could have been written as C<\p{Uppercase=True}> and C<\p{Uppercase=False}>, respectively. -This formality is needed when properties are not binary, that is if they can -take on more values than just True and False. For example, the Bidi_Class (see -L below), can take on a number of different -values, such as Left, Right, Whitespace, and others. To match these, one needs -to specify the property name (Bidi_Class), and the value being matched against -(Left, Right, etc.). This is done, as in the examples above, by having the +This formality is needed when properties are not binary; that is, if they can +take on more values than just C and C. For example, the +C property (see L below), +can take on several different +values, such as C, C, C, and others. To match these, one needs +to specify both the property name (C), AND the value being +matched against +(C, C, I). This is done, as in the examples above, by having the two components separated by an equal sign (or interchangeably, a colon), like C<\p{Bidi_Class: Left}>. All Unicode-defined character properties may be written in these compound forms -of C<\p{property=value}> or C<\p{property:value}>, but Perl provides some +of C<\p{I=I}> or C<\p{I:I}>, but Perl provides some additional properties that are written only in the single form, as well as single-form short-cuts for all binary properties and certain others described below, in which you may omit the property name and the equals or colon separator. Most Unicode character properties have at least two synonyms (or aliases if you -prefer), a short one that is easier to type, and a longer one which is more -descriptive and hence it is easier to understand what it means. Thus the "L" -and "Letter" above are equivalent and can be used interchangeably. Likewise, -"Upper" is a synonym for "Uppercase", and we could have written -C<\p{Uppercase}> equivalently as C<\p{Upper}>. Also, there are typically -various synonyms for the values the property can be. For binary properties, -"True" has 3 synonyms: "T", "Yes", and "Y"; and "False has correspondingly "F", -"No", and "N". But be careful. A short form of a value for one property may -not mean the same thing as the same short form for another. Thus, for the -General_Category property, "L" means "Letter", but for the Bidi_Class property, -"L" means "Left". A complete list of properties and synonyms is in -L. - -Upper/lower case differences in the property names and values are irrelevant, +prefer): a short one that is easier to type and a longer one that is more +descriptive and hence easier to understand. Thus the C<"L"> and +C<"Letter"> properties above are equivalent and can be used +interchangeably. Likewise, C<"Upper"> is a synonym for C<"Uppercase">, +and we could have written C<\p{Uppercase}> equivalently as C<\p{Upper}>. +Also, there are typically various synonyms for the values the property +can be. For binary properties, C<"True"> has 3 synonyms: C<"T">, +C<"Yes">, and C<"Y">; and C<"False"> has correspondingly C<"F">, +C<"No">, and C<"N">. But be careful. A short form of a value for one +property may not mean the same thing as the same short form for another. +Thus, for the C> property, C<"L"> means +C<"Letter">, but for the L|/Bidirectional Character Types> +property, C<"L"> means C<"Left">. A complete list of properties and +synonyms is in L. + +Upper/lower case differences in property names and values are irrelevant; thus C<\p{Upper}> means the same thing as C<\p{upper}> or even C<\p{UpPeR}>. Similarly, you can add or subtract underscores anywhere in the middle of a word, so that these are also equivalent to C<\p{U_p_p_e_r}>. And white space is irrelevant adjacent to non-word characters, such as the braces and the equals -or colon separators so C<\p{ Upper }> and C<\p{ Upper_case : Y }> are -equivalent to these as well. In fact, in most cases, white space and even -hyphens can be added or deleted anywhere. So even C<\p{ Up-per case = Yes}> is +or colon separators, so C<\p{ Upper }> and C<\p{ Upper_case : Y }> are +equivalent to these as well. In fact, white space and even +hyphens can usually be added or deleted anywhere. So even C<\p{ Up-per case = Yes}> is equivalent. All this is called "loose-matching" by Unicode. The few places -where stricter matching is employed is in the middle of numbers, and the Perl +where stricter matching is used is in the middle of numbers, and in the Perl extension properties that begin or end with an underscore. Stricter matching -cares about white space (except adjacent to the non-word characters) and +cares about white space (except adjacent to non-word characters), hyphens, and non-interior underscores. You can also use negation in both C<\p{}> and C<\P{}> by introducing a caret -(^) between the first brace and the property name: C<\p{^Tamil}> is +(C<^>) between the first brace and the property name: C<\p{^Tamil}> is equal to C<\P{Tamil}>. +Almost all properties are immune to case-insensitive matching. That is, +adding a C regular expression modifier does not change what they +match. There are two sets that are affected. +The first set is +C, +C, +and C, +all of which match C under C matching. +And the second set is +C, +C, +and C, +all of which match C under C matching. +This set also includes its subsets C and C both +of which under C match C. +(The difference between these sets is that some things, such as Roman +numerals, come in both upper and lower case so they are C, but +aren't considered letters, so they aren't C's.) + +See L for special considerations when +matching Unicode properties against non-Unicode code points. + =head3 B Every Unicode character is assigned a general category, which is the "most @@ -355,11 +497,12 @@ usual categorization of a character" (from L). The compound way of writing these is like C<\p{General_Category=Number}> -(short, C<\p{gc:n}>). But Perl furnishes shortcuts in which everything up +(short: C<\p{gc:n}>). But Perl furnishes shortcuts in which everything up through the equal or colon separator is omitted. So you can instead just write C<\pN>. -Here are the short and long forms of the General Category properties: +Here are the short and long forms of the values the C property +can have: Short Long @@ -406,26 +549,21 @@ Here are the short and long forms of the General Category properties: C Other Cc Control (also Cntrl) Cf Format - Cs Surrogate (not usable) + Cs Surrogate Co Private_Use Cn Unassigned Single-letter properties match all characters in any of the two-letter sub-properties starting with the same letter. -C and C are special cases, which are both aliases for the set consisting of everything matched by C, C, and C. - -Because Perl hides the need for the user to understand the internal -representation of Unicode characters, there is no need to implement -the somewhat messy concept of surrogates. C is therefore not -supported. +C and C are special: both are aliases for the set consisting of everything matched by C, C, and C. =head3 B -Because scripts differ in their directionality (Hebrew is -written right to left, for example) Unicode supplies these properties in -the Bidi_Class class: +Because scripts differ in their directionality (Hebrew and Arabic are +written right to left, for example) Unicode supplies a C property. +Some of the values this property can have are: - Property Meaning + Value Meaning L Left-to-Right LRE Left-to-Right Embedding @@ -449,26 +587,97 @@ the Bidi_Class class: This property is always written in the compound form. For example, C<\p{Bidi_Class:R}> matches characters that are normally -written right to left. +written right to left. Unlike the +C> property, this +property can have more values added in a future Unicode release. Those +listed above comprised the complete set for many Unicode releases, but +others were added in Unicode 6.3; you can always find what the +current ones are in L. And +L describes how to use them. =head3 B -The world's languages are written in a number of scripts. This sentence +The world's languages are written in many different scripts. This sentence (unless you're reading it in translation) is written in Latin, while Russian is -written in Cyrllic, and Greek is written in, well, Greek; Japanese mainly in +written in Cyrillic, and Greek is written in, well, Greek; Japanese mainly in Hiragana or Katakana. There are many more. -The Unicode Script property gives what script a given character is in, -and the property can be specified with the compound form like -C<\p{Script=Hebrew}> (short: C<\p{sc=hebr}>). Perl furnishes shortcuts for all -script names. You can omit everything up through the equals (or colon), and -simply write C<\p{Latin}> or C<\P{Cyrillic}>. +The Unicode C