From 82206b5ed202f6863d810b917405266fc5486eac Mon Sep 17 00:00:00 2001 From: Karl Williamson Date: Fri, 1 Apr 2011 13:40:23 -0600 Subject: [PATCH] perlrecharclass: Update for 5.14 changes --- pod/perlrecharclass.pod | 396 ++++++++++++++++++++++++++---------------------- 1 file changed, 217 insertions(+), 179 deletions(-) diff --git a/pod/perlrecharclass.pod b/pod/perlrecharclass.pod index 9f27378..d26b035 100644 --- a/pod/perlrecharclass.pod +++ b/pod/perlrecharclass.pod @@ -44,7 +44,7 @@ Here are some examples: "ab" =~ /^.$/ # No match (dot matches one character) =head2 Backslash sequences -X<\w> X<\W> X<\s> X<\S> X<\d> X<\D> X<\p> X<\P> +X<\w> X<\W> X<\s> X<\S> X<\d> X<\D> X<\p> X<\P> X<\N> X<\v> X<\V> X<\h> X<\H> X X @@ -75,40 +75,49 @@ character classes, see L.) =head3 Digits C<\d> matches a single character considered to be a decimal I. -What is considered a decimal digit depends on several factors, detailed -below in L. If those factors -indicate a Unicode interpretation, C<\d> not only matches the digits -'0' - '9', but also Arabic, Devanagari, and digits from other languages. -Otherwise, if a locale is in effect, it matches whatever characters that -locale considers decimal digits. Only when neither a Unicode interpretation -nor locale prevails does C<\d> match only the digits '0' to '9' alone. - -Unicode digits may cause some confusion, and some security issues. In UTF-8 -strings, unless the C regular expression modifier is specified, -C<\d> matches the same characters matched by -C<\p{General_Category=Decimal_Number}>, or synonymously, -C<\p{General_Category=Digit}>. Starting with Unicode version 4.1, this is the -same set of characters matched by C<\p{Numeric_Type=Decimal}>. - +If the C modifier in effect, it matches [0-9]. Otherwise, it +matches anything that is matched by C<\p{Digit}>, which includes [0-9]. +(An unlikely possible exception is that under locale matching rules, the +current locale might not have [0-9] matched by C<\d>, and/or might match +other characters whose code point is less than 256. Such a locale +definition would be in violation of the C language standard, but Perl +doesn't currently assume anything in regard to this.) + +What this means is that unless the C modifier is in effect C<\d> not +only matches the digits '0' - '9', but also Arabic, Devanagari, and +digits from other languages. This may cause some confusion, and some +security issues. + +Some digits that C<\d> matches look like some of the [0-9] ones, but +have different values. For example, BENGALI DIGIT FOUR (U+09EA) looks +very much like an ASCII DIGIT EIGHT (U+0038). An application that +is expecting only the ASCII digits might be misled, or if the match is +C<\d+>, the matched string might contain a mixture of digits from +different writing systems that look like they signify a number different +than they actually do. L can be used to safely +calculate the value, returning C if the input string contains +such a mixture. + +What C<\p{Digit}> means (and hence C<\d> except under the C +modifier) is C<\p{General_Category=Decimal_Number}>, or synonymously, +C<\p{General_Category=Digit}>. Starting with Unicode version 4.1, this +is the same set of characters matched by C<\p{Numeric_Type=Decimal}>. But Unicode also has a different property with a similar name, C<\p{Numeric_Type=Digit}>, which matches a completely different set of -characters. These characters are things such as subscripts. - -The design intent is for C<\d> to match all digits (and no other characters) -that can be used with "normal" big-endian positional decimal syntax, whereby a -sequence of such digits {N0, N1, N2, ...Nn} has the numeric value (...(N0 * 10 -+ N1) * 10 + N2) * 10 ... + Nn). In Unicode 5.2, the Tamil digits (U+0BE6 - -U+0BEF) can also legally be used in old-style Tamil numbers in which they would -appear no more than one in a row, separated by characters that mean "times 10", -"times 100", etc. (See L.) +characters. These characters are things such as C +or subscripts, or are from writing systems that lack all ten digits. -Some non-European digits that C<\d> matches look like European ones, but -have different values. For example, BENGALI DIGIT FOUR (U+09EA) looks -very much like an ASCII DIGIT EIGHT (U+0038). +The design intent is for C<\d> to exactly match the set of characters +that can safely be used with "normal" big-endian positional decimal +syntax, where, for example 123 means one 'hundred', plus two 'tens', +plus three 'ones'. This positional notation does not necessarily apply +to characters that match the other type of "digit", +C<\p{Numeric_Type=Digit}>, and so C<\d> doesn't match them. -It may be useful for security purposes for an application to require that all -digits in a row be from the same script. This can be checked by using -L. +In Unicode 5.2, the Tamil digits (U+0BE6 - U+0BEF) can also legally be +used in old-style Tamil numbers in which they would appear no more than +one in a row, separated by characters that mean "times 10", "times 100", +etc. (See L.) Any character not matched by C<\d> is matched by C<\D>. @@ -117,21 +126,52 @@ Any character not matched by C<\d> is matched by C<\D>. A C<\w> matches a single alphanumeric character (an alphabetic character, or a decimal digit) or a connecting punctuation character, such as an underscore ("_"). It does not match a whole word. To match a whole -word, use C<\w+>. This isn't the same thing as matching an English word, but +word, use C<\w+>. This isn't the same thing as matching an English word, but in the ASCII range it is the same as a string of Perl-identifier -characters. What is considered a -word character depends on several factors, detailed below in L. If those factors indicate a Unicode -interpretation, C<\w> matches the characters considered word -characters in the Unicode database. That is, it not only matches ASCII letters, -but also Thai letters, Greek letters, etc. This includes connector +characters. + +=over + +=item If the C modifier is in effect ... + +C<\w> matches the 63 characters [a-zA-Z0-9_]. + +=item otherwise ... + +=over + +=item For code points above 255 ... + +C<\w> matches the same as C<\p{Word}> matches in this range. That is, +it matches Thai letters, Greek letters, etc. This includes connector punctuation (like the underscore) which connect two words together, or diacritics, such as a C and the modifier letters, which -are generally used to add auxiliary markings to letters. If a Unicode -interpretation is not indicated, C<\w> matches those characters considered -word characters by the current locale or EBCDIC code page. Without a -locale or EBCDIC code page, C<\w> matches the underscore and ASCII letters -and digits. +are generally used to add auxiliary markings to letters. + +=item For code points below 256 ... + +=over + +=item if locale rules are in effect ... + +C<\w> matches the platform's native underscore character plus whatever +the locale considers to be alphanumeric. + +=item if Unicode rules are in effect or if on an EBCDIC platform ... + +C<\w> matches exactly what C<\p{Word}> matches. + +=item otherwise ... + +C<\w> matches [a-zA-Z0-9_]. + +=back + +=back + +=back + +Which rules apply are determined as described in L. There are a number of security issues with the full Unicode list of word characters. See L. @@ -145,30 +185,62 @@ Any character not matched by C<\w> is matched by C<\W>. =head3 Whitespace -C<\s> matches any single character considered whitespace. The exact -set of characters matched by C<\s> depends on several factors, detailed -below in L. If those factors -indicate a Unicode interpretation, C<\s> matches what is considered -whitespace in the Unicode database; the complete list is in the table -below. Otherwise, if a locale or EBCDIC code page is in effect, -C<\s> matches whatever is considered whitespace by the current locale or -EBCDIC code page. Without a locale or EBCDIC code page, C<\s> matches -the horizontal tab (C<\t>), the newline (C<\n>), the form feed (C<\f>), -the carriage return (C<\r>), and the space. (Note that it doesn't match -the vertical tab, C<\cK>.) Perhaps the most notable possible surprise -is that C<\s> matches a non-breaking space B if a Unicode -interpretation is indicated, or the locale or EBCDIC code page that is -in effect happens to have that character. +C<\s> matches any single character considered whitespace. + +=over + +=item If the C modifier is in effect ... + +C<\s> matches the 5 characters [\t\n\f\r ]; that is, the horizontal tab, +the newline, the form feed, the carriage return, and the space. (Note +that it doesn't match the vertical tab, C<\cK> on ASCII platforms.) + +=item otherwise ... + +=over + +=item For code points above 255 ... + +C<\s> matches exactly the code points above 255 shown with an "s" column +in the table below. + +=item For code points below 256 ... + +=over + +=item if locale rules are in effect ... + +C<\s> matches whatever the locale considers to be whitespace. Note that +this is likely to include the vertical space, unlike non-locale C<\s> +matching. + +=item if Unicode rules are in effect or if on an EBCDIC platform ... + +C<\s> matches exactly the characters shown with an "s" column in the +table below. + +=item otherwise ... + +C<\s> matches [\t\n\f\r ]. +Note that this list doesn't include the non-breaking space. + +=back + +=back + +=back + +Which rules apply are determined as described in L. Any character not matched by C<\s> is matched by C<\S>. C<\h> matches any character considered horizontal whitespace; -this includes the space and tab characters and several others +this includes the space and tab characters and several others listed in the table below. C<\H> matches any character not considered horizontal whitespace. C<\v> matches any character considered vertical whitespace; -this includes the carriage return and line feed characters (newline) +this includes the carriage return and line feed characters (newline) plus several other characters, all listed in the table below. C<\V> matches any character not considered vertical whitespace. @@ -178,22 +250,16 @@ sequence. Therefore, it cannot be used inside a bracketed character class; use C<\v> instead (vertical whitespace). Details are discussed in L. -Note that unlike C<\s>, C<\d> and C<\w>, C<\h> and C<\v> always match +Note that unlike C<\s> (and C<\d> and C<\w>), C<\h> and C<\v> always match the same characters, without regard to other factors, such as whether the source string is in UTF-8 format. -One might think that C<\s> is equivalent to C<[\h\v]>. This is not true. The -vertical tab (C<"\x0b">) is not matched by C<\s>, it is however considered -vertical whitespace. Furthermore, if the source string is not in UTF-8 format, -and any locale or EBCDIC code page that is in effect doesn't include them, the -next line (ASCII-platform C<"\x85">) and the no-break space (ASCII-platform -C<"\xA0">) characters are not matched by C<\s>, but are by C<\v> and C<\h> -respectively. If the C modifier is not in effect and the source -string is in UTF-8 format, both the next line and the no-break space -are matched by C<\s>. +One might think that C<\s> is equivalent to C<[\h\v]>. This is not true. +For example, the vertical tab (C<"\x0b">) is not matched by C<\s>, it is +however considered vertical whitespace. The following table is a complete listing of characters matched by -C<\s>, C<\h> and C<\v> as of Unicode 5.2. +C<\s>, C<\h> and C<\v> as of Unicode 6.0. The first column gives the code point of the character (in hex format), the second column gives the (Unicode) name. The third column indicates @@ -231,16 +297,12 @@ page is in effect that changes the C<\s> matching). =item [1] -NEXT LINE and NO-BREAK SPACE only match C<\s> if the source string is in -UTF-8 format and the C modifier is not in effect, or if the locale -or EBCDIC code page in effect includes them. +NEXT LINE and NO-BREAK SPACE may or may not match C<\s> depending +on the rules in effect. See +L. =back -It is worth noting that C<\d>, C<\w>, etc, match single characters, not -complete numbers or words. To match a number (that consists of digits), -use C<\d+>; to match a word, use C<\w+>. - =head3 \N C<\N> is new in 5.12, and is experimental. It, like the dot, matches any @@ -274,9 +336,13 @@ C is valid, but means something different. It matches a two character string: a letter (Unicode property C<\pL>), followed by a lowercase C. +If neither the C modifier nor locale rules are in effect, the use of +a Unicode property will force the regular expression into using Unicode +rules. + Note that almost all properties are immune to case-insensitive matching. That is, adding a C regular expression modifier does not change what -they match. There are two sets affected. The first set is +they match. There are two sets that are affected. The first set is C, C, and C, @@ -289,8 +355,8 @@ all of which match C under C matching. (The difference between these sets is that some things, such as Roman Numerals, come in both upper and lower case so they are C, but aren't considered to be letters, so they aren't Cs. They're -actually Cs.) -This set also includes its subsets C and C, both +actually Cs.) +This set also includes its subsets C and C, both of which under C matching match C. For more details on Unicode properties, see L. # Thai Unicode class. "a" =~ /\P{Lao}/ # Match, as "a" is not a Laotian character. +It is worth emphasizing that C<\d>, C<\w>, etc, match single characters, not +complete numbers or words. To match a number (that consists of digits), +use C<\d+>; to match a word, use C<\w+>. But be aware of the security +considerations in doing so, as mentioned above. =head2 Bracketed Character Classes @@ -459,7 +529,7 @@ Unicode letters. This syntax make the caret a special character inside a bracketed character class, but only if it is the first character of the class. So if you want -the caret as one of the characters to match, either escape the caret or +the caret as one of the characters to match, either escape the caret or else not list it first. Examples: @@ -504,8 +574,7 @@ X X X X X X X POSIX character classes have the form C<[:class:]>, where I is name, and the C<[:> and C<:]> delimiters. POSIX character classes only appear I bracketed character classes, and are a convenient and descriptive -way of listing a group of characters, though they can suffer from -portability issues (see below and L). +way of listing a group of characters. Be careful about the syntax, @@ -517,7 +586,7 @@ Be careful about the syntax, The latter pattern would be a character class consisting of a colon, and the letters C, C, C

and C. -POSIX character classes can be part of a larger bracketed character class. +POSIX character classes can be part of a larger bracketed character class. For example, [01[:alpha:]%] @@ -552,42 +621,74 @@ the table, matches only characters in the ASCII character set. The other counterpart, in the column labelled "Full-range Unicode", matches any appropriate characters in the full Unicode character set. For example, C<\p{Alpha}> matches not just the ASCII alphabetic characters, but any -character in the entire Unicode character set considered alphabetic. +character in the entire Unicode character set considered alphabetic. The column labelled "backslash sequence" is a (short) synonym for the Full-range Unicode form. (Each of the counterparts has various synonyms as well. L lists all -synonyms, plus all characters matched by each ASCII-range property. +synonyms, plus all characters matched by each ASCII-range property. For example, C<\p{AHex}> is a synonym for C<\p{ASCII_Hex_Digit}>, and any C<\p> property name can be prefixed with "Is" such as C<\p{IsAlpha}>.) -Both the C<\p> forms are unaffected by any locale in effect, or whether -the string is in UTF-8 format or not, or whether the platform is EBCDIC or not. -In contrast, the POSIX character classes are affected, unless the -regular expression is compiled with the C modifier. If the C -modifier is not in effect, and the source string is in UTF-8 format, the -POSIX classes behave like their "Full-range" Unicode counterparts. If -C modifier is in effect; or the source string is not in UTF-8 -format, and no locale is in effect, and the platform is not EBCDIC, all -the POSIX classes behave like their ASCII-range counterparts. -Otherwise, they behave based on the rules of the locale or EBCDIC code -page. - -It is proposed to change this behavior in a future release of Perl so that the -the UTF-8-ness of the source string will be irrelevant to the behavior of the -POSIX character classes. This means they will always behave in strict -accordance with the official POSIX standard. That is, if either locale or -EBCDIC code page is present, they will behave in accordance with those; if -absent, the classes will match only their ASCII-range counterparts. If you -wish to comment on this proposal, send email to C. +Both the C<\p> counterparts always assume Unicode rules are in effect. +On ASCII platforms, this means they assume that the code points from 128 +to 255 are Latin-1, and that means that using them under locale rules is +unwise unless the locale is guaranteed to be Latin-1. In contrast, the +POSIX character classes are useful under locale rules. They are +affected by the actual rules in effect, as follows: + +=over + +=item If the C modifier, is in effect ... + +Each of the POSIX classes matches exactly the same as their ASCII-range +counterparts. + +=item otherwise ... + +=over + +=item For code points above 255 ... + +The POSIX class matches the same as its Full-range counterpart. + +=item For code points below 256 ... + +=over + +=item if locale rules are in effect ... + +The POSIX class matches according to the locale. + +=item if Unicode rules are in effect or if on an EBCDIC platform ... + +The POSIX class matches the same as the Full-range counterpart. + +=item otherwise ... + +The POSIX class matches the same as the ASCII range counterpart. + +=back + +=back + +=back + +Which rules apply are determined as described in L. + +It is proposed to change this behavior in a future release of Perl so that +whether or not Unicode rules are in effect would not change the +behavior: Outside of locale or an EBCDIC code page, the POSIX classes +would behave like their ASCII-range counterparts. If you wish to +comment on this proposal, send email to C. [[:...:]] ASCII-range Full-range backslash Note Unicode Unicode sequence ----------------------------------------------------- alpha \p{PosixAlpha} \p{XPosixAlpha} alnum \p{PosixAlnum} \p{XPosixAlnum} - ascii \p{ASCII} + ascii \p{ASCII} blank \p{PosixBlank} \p{XPosixBlank} \h [1] or \p{HorizSpace} [1] cntrl \p{PosixCntrl} \p{XPosixCntrl} [2] @@ -600,7 +701,7 @@ wish to comment on this proposal, send email to C. space \p{PosixSpace} \p{XPosixSpace} [6] upper \p{PosixUpper} \p{XPosixUpper} word \p{PosixWord} \p{XPosixWord} \w - xdigit \p{ASCII_Hex_Digit} \p{XPosixXDigit} + xdigit \p{PosixXDigit} \p{XPosixXDigit} =over 4 @@ -612,12 +713,12 @@ C<\p{Blank}> and C<\p{HorizSpace}> are synonyms. Control characters don't produce output as such, but instead usually control the terminal somehow: for example, newline and backspace are control characters. -In the ASCII range, characters whose ordinals are between 0 and 31 inclusive, +In the ASCII range, characters whose code points are between 0 and 31 inclusive, plus 127 (C) are control characters. On EBCDIC platforms, it is likely that the code page will define C<[[:cntrl:]]> to be the EBCDIC equivalents of the ASCII controls, plus the controls -that in Unicode have ordinals from 128 through 159. +that in Unicode have code pointss from 128 through 159. =item [3] @@ -646,13 +747,14 @@ C<\p{XPosixPunct}> and (in Unicode mode) C<[[:punct:]]>, match what C<\p{PosixPunct}> matches in the ASCII range, plus what C<\p{Punct}> matches. This is different than strictly matching according to C<\p{Punct}>. Another way to say it is that -for a UTF-8 string, C<[[:punct:]]> matches all characters that Unicode -considers punctuation, plus all ASCII-range characters that Unicode -considers symbols. +if Unicode rules are in effect, C<[[:punct:]]> matches all characters +that Unicode considers punctuation, plus all ASCII-range characters that +Unicode considers symbols. =item [6] -C<\p{SpacePerl}> and C<\p{Space}> differ only in that C<\p{Space}> additionally +C<\p{SpacePerl}> and C<\p{Space}> differ only in that in non-locale +matching, C<\p{Space}> additionally matches the vertical tab, C<\cK>. Same for the two ASCII-only range forms. =back @@ -678,13 +780,12 @@ Some examples: [[:^word:]] \P{PerlWord} \P{XPosixWord} \W The backslash sequence can mean either ASCII- or Full-range Unicode, -depending on various factors. See L -below. +depending on various factors as described in L. =head4 [= =] and [. .] Perl recognizes the POSIX character classes C<[=class=]> and -C<[.class.]>, but does not (yet?) support them. Any attempt to use +C<[.class.]>, but does not (yet?) support them. Any attempt to use either construct raises an exception. =head4 Examples @@ -701,66 +802,3 @@ either construct raises an exception. # hex digit. The result matches all # characters except the letters 'a' to 'f' and # 'A' to 'F'. - - -=head2 Locale, EBCDIC, Unicode and UTF-8 - -Some of the character classes have a somewhat different behaviour -depending on the internal encoding of the source string, whether the regular -expression is marked as having Unicode semantics, whatever locale is in -effect, and whether the program is running on an EBCDIC platform. - -C<\w>, C<\d>, C<\s> and the POSIX character classes (and their -negations, including C<\W>, C<\D>, C<\S>) have this behaviour. (Since -the backslash sequences C<\b> and C<\B> are defined in terms of C<\w> -and C<\W>, they also are affected.) - -Starting in Perl 5.14, if the regular expression is compiled with the -C modifier, the behavior doesn't differ regardless of any other -factors. C<\d> matches the 10 digits 0-9; C<\D> any character but those -10; C<\s>, exactly the five characters "[ \f\n\r\t]"; C<\w> only the 63 -characters "[A-Za-z0-9_]"; and the C<"[[:posix:]]"> classes only the -appropriate ASCII characters, the same characters as are matched by the -corresponding C<\p{}> property given in the "ASCII-range Unicode" column -in the table above. (The behavior of all of their complements follows -the same paradigm.) - -Otherwise, a regular expression is marked for Unicode semantics if it is -encoded in utf8 (usually as a result of including a literal character -whose code point is above 255), or if it contains a C<\N{U+...}> or -C<\N{I}> construct, or (starting in Perl 5.14) if it was compiled -in the scope of a C> pragma and not in -the scope of a C> pragma, or has the C regular -expression modifier. - -Note that one can specify C<"use re '/l'"> for example, for any regular -expression modifier, and this has precedence over either of the -C> or C> pragmas. - -The differences in behavior between locale and non-locale semantics -can affect any character whose code point is 255 or less. The -differences in behavior between Unicode and non-Unicode semantics -affects only ASCII platforms, and only when matching against characters -whose code points are between 128 and 255 inclusive. See -L. - -For portability reasons, unless the C modifier is specified, -it may be better to not use C<\w>, C<\d>, C<\s> or the POSIX character -classes and use the Unicode properties instead. - -That way you can control whether you want matching of characters in -the ASCII character set alone, or whether to match Unicode characters. -C> allows seamless Unicode behavior -no matter the internal encodings, but won't allow restricting -to ASCII characters only. - -=head4 Examples - - $str = "\xDF"; # $str is not in UTF-8 format. - $str =~ /^\w/; # No match, as $str isn't in UTF-8 format. - $str .= "\x{0e0b}"; # Now $str is in UTF-8 format. - $str =~ /^\w/; # Match! $str is now in UTF-8 format. - chop $str; - $str =~ /^\w/; # Still a match! $str remains in UTF-8 format. - -=cut -- 1.8.3.1