The dot (or period), C<.> is probably the most used, and certainly
the most well-known character class. By default, a dot matches any
-character, except for the newline. The default can be changed to
+character, except for the newline. That default can be changed to
add matching the newline by using the I<single line> modifier: either
for the entire regular expression with the C</s> modifier, or
locally with C<(?s)>. (The experimental C<\N> backslash sequence, described
=head3 Digits
C<\d> matches a single character considered to be a decimal I<digit>.
-If the C</a> regular expression modifier in effect, it matches [0-9].
+If the C</a> regular expression modifier is in effect, it matches [0-9].
Otherwise, it
matches anything that is matched by C<\p{Digit}>, which includes [0-9].
(An unlikely possible exception is that under locale matching rules, the
is expecting only the ASCII digits might be misled, or if the match is
C<\d+>, the matched string might contain a mixture of digits from
different writing systems that look like they signify a number different
-than they actually do. L<Unicode::UCDE<sol>num()|Unicode::UCD/num> can
+than they actually do. L<Unicode::UCD/num()> can
be used to safely
calculate the value, returning C<undef> if the input string contains
such a mixture.
=head3 Word characters
A C<\w> matches a single alphanumeric character (an alphabetic character, or a
-decimal digit) or a connecting punctuation character, such as an
-underscore ("_"). It does not match a whole word. To match a whole
-word, use C<\w+>. This isn't the same thing as matching an English word, but
-in the ASCII range it is the same as a string of Perl-identifier
-characters.
+decimal digit); or a connecting punctuation character, such as an
+underscore ("_"); or a "mark" character (like some sort of accent) that
+attaches to one of those. It does not match a whole word. To match a
+whole word, use C<\w+>. This isn't the same thing as matching an
+English word, but in the ASCII range it is the same as a string of
+Perl-identifier characters.
=over
Any character not matched by C<\s> is matched by C<\S>.
C<\h> matches any character considered horizontal whitespace;
-this includes the space and tab characters and several others
+this includes the platform's space and tab characters and several others
listed in the table below. C<\H> matches any character
-not considered horizontal whitespace.
+not considered horizontal whitespace. They use the platform's native
+character set, and do not consider any locale that may otherwise be in
+use.
C<\v> matches any character considered vertical whitespace;
-this includes the carriage return and line feed characters (newline)
+this includes the platform's carriage return and line feed characters (newline)
plus several other characters, all listed in the table below.
C<\V> matches any character not considered vertical whitespace.
+They use the platform's native character set, and do not consider any
+locale that may otherwise be in use.
C<\R> matches anything that can be considered a newline under Unicode
rules. It's not a character class, as it can match a multi-character
sequence. Therefore, it cannot be used inside a bracketed character
-class; use C<\v> instead (vertical whitespace).
+class; use C<\v> instead (vertical whitespace). It uses the platform's
+native character set, and does not consider any locale that may
+otherwise be in use.
Details are discussed in L<perlrebackslash>.
Note that unlike C<\s> (and C<\d> and C<\w>), C<\h> and C<\v> always match
-the same characters, without regard to other factors, such as whether the
-source string is in UTF-8 format.
+the same characters, without regard to other factors, such as the active
+locale or whether the source string is in UTF-8 format.
One might think that C<\s> is equivalent to C<[\h\v]>. This is not true.
-For example, the vertical tab (C<"\x0b">) is not matched by C<\s>, it is
-however considered vertical whitespace.
+The difference is that the vertical tab (C<"\x0b">) is not matched by
+C<\s>; it is however considered vertical whitespace.
The following table is a complete listing of characters matched by
C<\s>, C<\h> and C<\v> as of Unicode 6.0.
It matches a two character string: a letter (Unicode property C<\pL>),
followed by a lowercase C<l>.
-If neither the C</a> modifier nor locale rules are in effect, the use of
+If locale rules are not in effect, the use of
a Unicode property will force the regular expression into using Unicode
-rules.
+rules, if it isn't already.
Note that almost all properties are immune to case-insensitive matching.
That is, adding a C</i> regular expression modifier does not change what
It is also possible to define your own properties. This is discussed in
L<perlunicode/User-Defined Character Properties>.
+Unicode properties are defined (surprise!) only on Unicode code points.
+A warning is raised and all matches fail on non-Unicode code points
+(those above the legal Unicode maximum of 0x10FFFF). This can be
+somewhat surprising,
+
+ chr(0x110000) =~ \p{ASCII_Hex_Digit=True} # Fails.
+ chr(0x110000) =~ \p{ASCII_Hex_Digit=False} # Also fails!
+
+Even though these two matches might be thought of as complements, they
+are so only on Unicode code points.
+
=head4 Examples
"a" =~ /\w/ # Match, "a" is a 'word' character.
-------
-* There is an exception to a bracketed character class matching only a
-single character. When the class is to match caselessely under C</i>
-matching rules, and a character inside the class matches a
+* There is an exception to a bracketed character class matching a
+single character only. When the class is to match caselessly under C</i>
+matching rules, and a character that is explicitly mentioned inside the
+class matches a
multiple-character sequence caselessly under Unicode rules, the class
(when not L<inverted|/Negation>) will also match that sequence. For
example, Unicode says that the letter C<LATIN SMALL LETTER SHARP S>
'ss' =~ /\A\N{LATIN SMALL LETTER SHARP S}\z/i # Matches
'ss' =~ /\A[aeioust\N{LATIN SMALL LETTER SHARP S}]\z/i # Matches
+For this to happen, the character must be explicitly specified, and not
+be part of a multi-character range (not even as one of its endpoints).
+(L</Character Ranges> will be explained shortly.) Therefore,
+
+ 'ss' =~ /\A[\0-\x{ff}]\z/i # Doesn't match
+ 'ss' =~ /\A[\0-\N{LATIN SMALL LETTER SHARP S}]\z/i # No match
+ 'ss' =~ /\A[\xDF-\xDF]\z/i # Matches on ASCII platforms, since \XDF
+ # is LATIN SMALL LETTER SHARP S, and the
+ # range is just a single element
+
+Note that it isn't a good idea to specify these types of ranges anyway.
+
=head3 Special Characters Inside a Bracketed Character Class
Most characters that are meta characters in regular expressions (that
C<\p{Alpha}> matches not just the ASCII alphabetic characters, but any
character in the entire Unicode character set considered alphabetic.
An entry in the column labelled "backslash sequence" is a (short)
-synonym for the Full-range Unicode form.
+equivalent.
[[:...:]] ASCII-range Full-range backslash Note
Unicode Unicode sequence
On EBCDIC platforms, it is likely that the code page will define C<[[:cntrl:]]>
to be the EBCDIC equivalents of the ASCII controls, plus the controls
-that in Unicode have code pointss from 128 through 159.
+that in Unicode have code points from 128 through 159.
=item [3]
The similarly named property, C<\p{Punct}>, matches a somewhat different
set in the ASCII range, namely
-C<[-!"#%&'()*,./:;?@[\\\]_{}]>. That is, it is missing C<[$+E<lt>=E<gt>^`|~]>.
+C<[-!"#%&'()*,./:;?@[\\\]_{}]>. That is, it is missing the nine
+characters C<[$+E<lt>=E<gt>^`|~]>.
This is because Unicode splits what POSIX considers to be punctuation into two
categories, Punctuation and Symbols.
=item if locale rules are in effect ...
-The POSIX class matches according to the locale.
+The POSIX class matches according to the locale, except that
+C<word> uses the platform's native underscore character, no matter what
+the locale is.
=item if Unicode rules are in effect or if on an EBCDIC platform ...
/[01[:lower:]]/ # Matches a character that is either a
# lowercase letter, or '0' or '1'.
/[[:digit:][:^xdigit:]]/ # Matches a character that can be anything
- # except the letters 'a' to 'f'. This is
- # because the main character class is composed
- # of two POSIX character classes that are ORed
- # together, one that matches any digit, and
- # the other that matches anything that isn't a
- # hex digit. The result matches all
- # characters except the letters 'a' to 'f' and
- # 'A' to 'F'.
+ # except the letters 'a' to 'f' and 'A' to
+ # 'F'. This is because the main character
+ # class is composed of two POSIX character
+ # classes that are ORed together, one that
+ # matches any digit, and the other that
+ # matches anything that isn't a hex digit.
+ # The OR adds the digits, leaving only the
+ # letters 'a' to 'f' and 'A' to 'F' excluded.