X-Git-Url: https://perl5.git.perl.org/perl5.git/blobdiff_plain/2269d15c887e7326906ea6195d5970ac188c3411..370c71c555fdc393b7abe16f74b496061898b884:/pod/perlunicode.pod diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod index e893571..0482d92 100644 --- a/pod/perlunicode.pod +++ b/pod/perlunicode.pod @@ -20,7 +20,7 @@ Read L. =over 4 -=item Safest if you "use feature 'unicode_strings'" +=item Safest if you C In order to preserve backward compatibility, Perl does not turn on full internal Unicode support unless the pragma @@ -37,9 +37,9 @@ filenames. Perl knows when a filehandle uses Perl's internal Unicode encodings (UTF-8, or UTF-EBCDIC if in EBCDIC) if the filehandle is opened with -the ":encoding(utf8)" layer. Other encodings can be converted to Perl's +the C<:encoding(utf8)> layer. Other encodings can be converted to Perl's encoding on input or from Perl's encoding on output by use of the -":encoding(...)" layer. See L. +C<:encoding(...)> layer. See L. To indicate that Perl source itself is in UTF-8, use C. @@ -52,12 +52,12 @@ ASCII-based machines or to recognize UTF-EBCDIC on EBCDIC-based machines. B is needed.> See L. -=item BOM-marked scripts and UTF-16 scripts autodetected +=item C-marked scripts and UTF-16 scripts autodetected -If a Perl script begins marked with the Unicode BOM (UTF-16LE, UTF16-BE, -or UTF-8), or if the script looks like non-BOM-marked UTF-16 of either +If a Perl script begins marked with the Unicode C (UTF-16LE, UTF16-BE, +or UTF-8), or if the script looks like non-C-marked UTF-16 of either endianness, Perl will correctly read in the script as Unicode. -(BOMless UTF-8 cannot be effectively recognized or differentiated from +(Cless UTF-8 cannot be effectively recognized or differentiated from ISO 8859-1 or other eight-bit encodings.) =item C needed to upgrade non-Latin-1 byte strings @@ -74,8 +74,7 @@ See L for more details. =head2 Byte and Character Semantics -Beginning with version 5.6, Perl uses logically-wide characters to -represent strings internally. +Perl uses logically-wide characters to represent strings internally. Starting in Perl 5.14, Perl-level operations work with characters rather than bytes within the scope of a @@ -93,19 +92,14 @@ without additional information from the user, Perl decides in favor of compatibility and chooses to use byte semantics. When C (but not C) is in -effect, Perl uses the semantics associated with the current locale. +effect, Perl uses the rules associated with the current locale. (C overrides C in the same scope; while C effectively also selects C in its scope; see L.) Otherwise, Perl uses the platform's native byte semantics for characters whose code points are less than 256, and -Unicode semantics for those greater than 255. On EBCDIC platforms, this -is almost seamless, as the EBCDIC code pages that Perl handles are -equivalent to Unicode's first 256 code points. (The exception is that -EBCDIC regular expression case-insensitive matching rules are not as -as robust as Unicode's.) But on ASCII platforms, Perl uses US-ASCII -(or Basic Latin in Unicode terminology) byte semantics, meaning that characters -whose ordinal numbers are in the range 128 - 255 are undefined except for their +Unicode rules for those greater than 255. That means that non-ASCII +characters are undefined except for their ordinal numbers. This means that none have case (upper and lower), nor are any a member of character classes, like C<[:alpha:]> or C<\w>. (But all do belong to the C<\W> class or the Perl regular expression extension C<[:^alpha:]>.) @@ -114,7 +108,7 @@ This behavior preserves compatibility with earlier versions of Perl, which allowed byte semantics in Perl operations only if none of the program's inputs were marked as being a source of Unicode character data. Such data may come from filehandles, from calls to -external programs, from information provided by the system (such as %ENV), +external programs, from information provided by the system (such as C<%ENV>), or from literals and constants in the source text. The C pragma is primarily a compatibility device that enables @@ -126,7 +120,7 @@ may become a no-op. See L. If strings operating under byte semantics and strings with Unicode character data are concatenated, the new string will have character semantics. This can cause surprises: See L, below. -You can choose to be warned when this happens. See L. +You can choose to be warned when this happens. See C>. Under character semantics, many operations that formerly operated on bytes now operate on characters. A character in Perl is @@ -148,15 +142,15 @@ contain characters that have an ordinal value larger than 255. If you use a Unicode editor to edit your program, Unicode characters may occur directly within the literal strings in UTF-8 encoding, or UTF-16. -(The former requires a BOM or C, the latter requires a BOM.) +(The former requires a C or C, the latter requires a C.) Unicode characters can also be added to a string by using the C<\N{U+...}> notation. The Unicode code for the desired character, in hexadecimal, should be placed in the braces, after the C. For instance, a smiley face is C<\N{U+263A}>. -Alternatively, you can use the C<\x{...}> notation for characters 0x100 and -above. For characters below 0x100 you may get byte semantics instead of +Alternatively, you can use the C<\x{...}> notation for characters C<0x100> and +above. For characters below C<0x100> you may get byte semantics instead of character semantics; see L. On EBCDIC machines there is the additional problem that the value for such characters gives the EBCDIC character rather than the Unicode one, thus it is more portable to use @@ -180,7 +174,7 @@ names. =item * -Regular expressions match characters instead of bytes. "." matches +Regular expressions match characters instead of bytes. C<"."> matches a character instead of a byte. =item * @@ -266,7 +260,7 @@ complement B the full character-wide bit complement. =item * -There is a CPAN module, L, which allows you to define +There is a CPAN module, C>, which allows you to define your own mappings to be used in C, C, C, C, and C (or their double-quoted string inlined versions such as C<\U>). @@ -296,27 +290,29 @@ regular expressions by using the C<\p{}> "matches property" construct and the C<\P{}> "doesn't match property" for its negation. For instance, C<\p{Uppercase}> matches any single character with the Unicode -"Uppercase" property, while C<\p{L}> matches any character with a -General_Category of "L" (letter) property. Brackets are not +C<"Uppercase"> property, while C<\p{L}> matches any character with a +C of C<"L"> (letter) property (see +L below). Brackets are not required for single letter property names, so C<\p{L}> is equivalent to C<\pL>. More formally, C<\p{Uppercase}> matches any single character whose Unicode -Uppercase property value is True, and C<\P{Uppercase}> matches any character -whose Uppercase property value is False, and they could have been written as +C property value is C, and C<\P{Uppercase}> matches any character +whose C property value is C, and they could have been written as C<\p{Uppercase=True}> and C<\p{Uppercase=False}>, respectively. This formality is needed when properties are not binary; that is, if they can -take on more values than just True and False. For example, the Bidi_Class (see -L below), can take on several different -values, such as Left, Right, Whitespace, and others. To match these, one needs -to specify both the property name (Bidi_Class), AND the value being +take on more values than just C and C. For example, the +C property (see L below), +can take on several different +values, such as C, C, C, and others. To match these, one needs +to specify both the property name (C), AND the value being matched against -(Left, Right, etc.). This is done, as in the examples above, by having the +(C, C, etc.). This is done, as in the examples above, by having the two components separated by an equal sign (or interchangeably, a colon), like C<\p{Bidi_Class: Left}>. All Unicode-defined character properties may be written in these compound forms -of C<\p{property=value}> or C<\p{property:value}>, but Perl provides some +of C<\p{I=I}> or C<\p{I:I}>, but Perl provides some additional properties that are written only in the single form, as well as single-form short-cuts for all binary properties and certain others described below, in which you may omit the property name and the equals or colon @@ -324,17 +320,19 @@ separator. Most Unicode character properties have at least two synonyms (or aliases if you prefer): a short one that is easier to type and a longer one that is more -descriptive and hence easier to understand. Thus the "L" and "Letter" properties -above are equivalent and can be used interchangeably. Likewise, -"Upper" is a synonym for "Uppercase", and we could have written -C<\p{Uppercase}> equivalently as C<\p{Upper}>. Also, there are typically -various synonyms for the values the property can be. For binary properties, -"True" has 3 synonyms: "T", "Yes", and "Y"; and "False has correspondingly "F", -"No", and "N". But be careful. A short form of a value for one property may -not mean the same thing as the same short form for another. Thus, for the -General_Category property, "L" means "Letter", but for the Bidi_Class property, -"L" means "Left". A complete list of properties and synonyms is in -L. +descriptive and hence easier to understand. Thus the C<"L"> and +C<"Letter"> properties above are equivalent and can be used +interchangeably. Likewise, C<"Upper"> is a synonym for C<"Uppercase">, +and we could have written C<\p{Uppercase}> equivalently as C<\p{Upper}>. +Also, there are typically various synonyms for the values the property +can be. For binary properties, C<"True"> has 3 synonyms: C<"T">, +C<"Yes">, and C<"Y">; and C<"False"> has correspondingly C<"F">, +C<"No">, and C<"N">. But be careful. A short form of a value for one +property may not mean the same thing as the same short form for another. +Thus, for the C> property, C<"L"> means +C<"Letter">, but for the L|/Bidirectional Character Types> +property, C<"L"> means C<"Left">. A complete list of properties and +synonyms is in L. Upper/lower case differences in property names and values are irrelevant; thus C<\p{Upper}> means the same thing as C<\p{upper}> or even C<\p{UpPeR}>. @@ -351,7 +349,7 @@ cares about white space (except adjacent to non-word characters), hyphens, and non-interior underscores. You can also use negation in both C<\p{}> and C<\P{}> by introducing a caret -(^) between the first brace and the property name: C<\p{^Tamil}> is +(C<^>) between the first brace and the property name: C<\p{^Tamil}> is equal to C<\P{Tamil}>. Almost all properties are immune to case-insensitive matching. That is, @@ -368,18 +366,13 @@ C, and C, all of which match C under C matching. This set also includes its subsets C and C both -of which under C matching match C. +of which under C match C. (The difference between these sets is that some things, such as Roman numerals, come in both upper and lower case so they are C, but aren't considered letters, so they aren't Cs.) -The result is undefined if you try to match a non-Unicode code point -(that is, one above 0x10FFFF) against a Unicode property. Currently, a -warning is raised, and the match will fail. In some cases, this is -counterintuitive, as both these fail: - - chr(0x110000) =~ \p{ASCII_Hex_Digit=True} # Fails. - chr(0x110000) =~ \p{ASCII_Hex_Digit=False} # Fails! +See L for special considerations when +matching Unicode properties against non-Unicode code points. =head3 B @@ -392,7 +385,8 @@ The compound way of writing these is like C<\p{General_Category=Number}> through the equal or colon separator is omitted. So you can instead just write C<\pN>. -Here are the short and long forms of the General Category properties: +Here are the short and long forms of the values the C property +can have: Short Long @@ -450,10 +444,10 @@ C and C are special: both are aliases for the set consisting of everythi =head3 B Because scripts differ in their directionality (Hebrew and Arabic are -written right to left, for example) Unicode supplies these properties in -the Bidi_Class class: +written right to left, for example) Unicode supplies a C property. +Some of the values this property can have are: - Property Meaning + Value Meaning L Left-to-Right LRE Left-to-Right Embedding @@ -477,7 +471,13 @@ the Bidi_Class class: This property is always written in the compound form. For example, C<\p{Bidi_Class:R}> matches characters that are normally -written right to left. +written right to left. Unlike the +C> property, this +property can have more values added in a future Unicode release. Those +listed above comprised the complete set for many Unicode releases, but +others were added in Unicode 6.3; you can always find what the +current ones are in in L. And +L describes how to use them. =head3 B @@ -501,7 +501,7 @@ The difference between these two properties involves characters that are used in multiple scripts. For example the digits '0' through '9' are used in many parts of the world. These are placed in a script named C. Other characters are used in just a few scripts. For -example, the "KATAKANA-HIRAGANA DOUBLE HYPHEN" is used in both Japanese +example, the C<"KATAKANA-HIRAGANA DOUBLE HYPHEN"> is used in both Japanese scripts, Katakana and Hiragana, but nowhere else. The C