decimal digit) or a connecting punctuation character, such as an
underscore ("_"). It does not match a whole word. To match a whole
word, use C<\w+>. This isn't the same thing as matching an English word, but
-in the ASCII range is the same as a string of Perl-identifier
+in the ASCII range it is the same as a string of Perl-identifier
characters. What is considered a
word character depends on several factors, detailed below in L</Locale,
EBCDIC, Unicode and UTF-8>. If those factors indicate a Unicode
C<\f>,
C<\n>,
C<\N{I<NAME>}>,
-C<\N{U+I<wide hex char>}>,
+C<\N{U+I<hex char>}>,
C<\r>,
C<\t>,
and
matches any lowercase letter from the first half of the ASCII alphabet.
Note that the two characters on either side of the hyphen are not
-necessary both letters or both digits. Any character is possible,
+necessarily both letters or both digits. Any character is possible,
although not advisable. C<['-?]> contains a range of characters, but
most people will not know which characters that will be. Furthermore,
such ranges may lead to portability problems if the code has to run on
=head3 Backslash Sequences
You can put any backslash sequence character class (with the exception of
-C<\N>) inside a bracketed character class, and it will act just
+C<\N> and C<\R>) inside a bracketed character class, and it will act just
as if you put all the characters matched by the backslash sequence inside the
character class. For instance, C<[a-f\d]> will match any decimal digit, or any
of the lowercase letters between 'a' and 'f' inclusive.
C<\N> within a bracketed character class must be of the forms C<\N{I<name>}>
-or C<\N{U+I<wide hex char>}>, and NOT be the form that matches non-newlines,
+or C<\N{U+I<hex char>}>, and NOT be the form that matches non-newlines,
for the same reason that a dot C<.> inside a bracketed character class loses
its special meaning: it matches nearly anything, which generally isn't what you
want to happen.
appropriate characters in the full Unicode character set. For example,
C<\p{Alpha}> will match not just the ASCII alphabetic characters, but any
character in the entire Unicode character set that is considered to be
-alphabetic. The backslash sequence column is a (short) synonym for
+alphabetic. The column labelled "backslash sequence" is a (short) synonym for
the Full-range Unicode form.
(Each of the counterparts has various synonyms as well.
EBCDIC code page.
It is proposed to change this behavior in a future release of Perl so that the
-the UTF8ness of the source string will be irrelevant to the behavior of the
+the UTF-8-ness of the source string will be irrelevant to the behavior of the
POSIX character classes. This means they will always behave in strict
accordance with the official POSIX standard. That is, if either locale or
EBCDIC code page is present, they will behave in accordance with those; if
absent, the classes will match only their ASCII-range counterparts. If you
-disagree with this proposal, send email to C<perl5-porters@perl.org>.
+wish to comment on this proposal, send email to C<perl5-porters@perl.org>.
[[:...:]] ASCII-range Full-range backslash Note
Unicode Unicode sequence
This is because Unicode splits what POSIX considers to be punctuation into two
categories, Punctuation and Symbols.
-C<\p{PosixPunct>, and when the matching string is in UTF-8 format,
-C<[[:punct:]]>, match what they match in the ASCII range, plus what
-C<\p{Punct}> matches. This is different
-than strictly matching according to C<\p{Punct}>. Another way to say it is that
+C<\p{XPosixPunct}> and (in Unicode mode) C<[[:punct:]]>, match what
+C<\p{PosixPunct}> matches in the ASCII range, plus what C<\p{Punct}>
+matches. This is different than strictly matching according to
+C<\p{Punct}>. Another way to say it is that
for a UTF-8 string, C<[[:punct:]]> matches all the characters that Unicode
considers to be punctuation, plus all the ASCII-range characters that Unicode
considers to be symbols.
\P{PerlSpace} \P{XPerlSpace} \S
[[:^word:]] \P{PerlWord} \P{XPosixWord} \W
-Again, the backslash sequence means Full-range Unicode.
+The backslash sequence can mean either ASCII- or Full-range Unicode,
+depending on various factors. See L</Locale, EBCDIC, Unicode and UTF-8>
+below.
=head4 [= =] and [. .]
will match only the 63 characters "[A-Za-z0-9_]"; C<\d>, only the 10
digits 0-9; C<\s>, only the five characters "[ \f\n\r\t]"; and the
C<"[[:posix:]]"> classes only the appropriate ASCII characters. (See
-L<perlrebackslash>.) This modifier is like the C<"/u"> modifier in that
+L<perlrecharclass>.) This modifier is like the C<"/u"> modifier in that
things like "KELVIN SIGN" match the letters "k" and "K"; and non-ASCII
characters continue to have Unicode semantics. This modifier is
recommended for people who only incidentally use Unicode. One can write
C<\d> with confidence that it will only match ASCII characters, and
should the need arise to match beyond ASCII, you can use C<\p{Digit}> or
-C<\p{Word}>. (See L<perlrebackslash> for how to extend C<\s>, and the
+C<\p{Word}>. (See L<perlrecharclass> for how to extend C<\s>, and the
Posix classes beyond ASCII under this modifier.) This modifier is
automatically selected within the scope of C<use re '/a'>.