In all Perl versions, C<\s> matches the 5 characters [\t\n\f\r ]; that
is, the horizontal tab,
the newline, the form feed, the carriage return, and the space.
-Starting in Perl v5.18, experimentally, it also matches the vertical tab, C<\cK>.
+Starting in Perl v5.18, it also matches the vertical tab, C<\cK>.
See note C<[1]> below for a discussion of this.
=item otherwise ...
=item otherwise ...
-C<\s> matches [\t\n\f\r ] and, starting, experimentally in Perl
+C<\s> matches [\t\n\f\r ] and, starting in Perl
v5.18, the vertical tab, C<\cK>.
(See note C<[1]> below for a discussion of this.)
Note that this list doesn't include the non-breaking space.
locale that may otherwise be in use.
C<\R> matches anything that can be considered a newline under Unicode
-rules. It's not a character class, as it can match a multi-character
-sequence. Therefore, it cannot be used inside a bracketed character
-class; use C<\v> instead (vertical whitespace). It uses the platform's
+rules. It can match a multi-character sequence. It cannot be used inside
+a bracketed character class; use C<\v> instead (vertical whitespace).
+It uses the platform's
native character set, and does not consider any locale that may
otherwise be in use.
Details are discussed in L<perlrebackslash>.
=item [1]
-Prior to Perl v5.18, C<\s> did not match the vertical tab. The change
-in v5.18 is considered an experiment, which means it could be backed out
-in v5.22 if experience indicates that it breaks too much
-existing code. If this change adversely affects you, send email to
-C<perlbug@perl.org>; if it affects you positively, email
-C<perlthanks@perl.org>. In the meantime, C<[^\S\cK]> (obscurely)
-matches what C<\s> traditionally did.
+Prior to Perl v5.18, C<\s> did not match the vertical tab.
+C<[^\S\cK]> (obscurely) matches what C<\s> traditionally did.
=item [2]
multi-character range (not even as one of its endpoints). (L</Character
Ranges> will be explained shortly.) Therefore,
- 'ss' =~ /\A[\0-\x{ff}]\z/i # Doesn't match
- 'ss' =~ /\A[\0-\N{LATIN SMALL LETTER SHARP S}]\z/i # No match
- 'ss' =~ /\A[\xDF-\xDF]\z/i # Matches on ASCII platforms, since
- # \XDF is LATIN SMALL LETTER SHARP S,
+ 'ss' =~ /\A[\0-\x{ff}]\z/ui # Doesn't match
+ 'ss' =~ /\A[\0-\N{LATIN SMALL LETTER SHARP S}]\z/ui # No match
+ 'ss' =~ /\A[\xDF-\xDF]\z/ui # Matches on ASCII platforms, since
+ # \xDF is LATIN SMALL LETTER SHARP S,
# and the range is just a single
# element
matches, because C<\N{TAMIL SYLLABLE KAU}> is a named sequence
consisting of the two characters matched against. Like the other
-instance where a bracketed class can match multi characters, and for
+instance where a bracketed class can match multiple characters, and for
similar reasons, the class must not be inverted, and the named sequence
may not appear in a range, even one where it is both endpoints. If
these happen, it is a fatal error if the character class is within an
and
C<\x>
are also special and have the same meanings as they do outside a
-bracketed character class. (However, inside a bracketed character
-class, if C<\N{I<NAME>}> expands to a sequence of characters, only the first
-one in the sequence is used, with a warning.)
+bracketed character class.
Also, a backslash followed by two or three octal digits is considered an octal
number.
Examples:
"+" =~ /[+?*]/ # Match, "+" in a character class is not special.
- "\cH" =~ /[\b]/ # Match, \b inside in a character class.
+ "\cH" =~ /[\b]/ # Match, \b inside in a character class
# is equivalent to a backspace.
- "]" =~ /[][]/ # Match, as the character class contains.
+ "]" =~ /[][]/ # Match, as the character class contains
# both [ and ].
"[]" =~ /[[]]/ # Match, the pattern contains a character class
- # containing just ], and the character class is
+ # containing just [, and the character class is
# followed by a ].
=head3 Character Ranges
# hyphen ('-'), or the letter 'm'.
['-?] # Matches any of the characters '()*+,-./0123456789:;<=>?
# (But not on an EBCDIC platform).
-
-Perl guarantees that the ranges C<A-Z>, C<a-z>, C<0-9>, and any
+ [\N{APOSTROPHE}-\N{QUESTION MARK}]
+ # Matches any of the characters '()*+,-./0123456789:;<=>?
+ # even on an EBCDIC platform.
+ [\N{U+27}-\N{U+3F}] # Same. (U+27 is "'", and U+3F is "?")
+
+As the final two examples above show, you can achieve portablity to
+non-ASCII platforms by using the C<\N{...}> form for the range
+endpoints. These indicate that the specified range is to be interpreted
+using Unicode values, so C<[\N{U+27}-\N{U+3F}]> means to match
+C<\N{U+27}>, C<\N{U+28}>, C<\N{U+29}>, ..., C<\N{U+3D}>, C<\N{U+3E}>,
+and C<\N{U+3F}>, whatever the native code point versions for those are.
+These are called "Unicode" ranges. If either end is of the C<\N{...}>
+form, the range is considered Unicode. A C<regexp> warning is raised
+under C<S<"use re 'strict'">> if the other endpoint is specified
+non-portably:
+
+ [\N{U+00}-\x09] # Warning under re 'strict'; \x09 is non-portable
+ [\N{U+00}-\t] # No warning;
+
+Both of the above match the characters C<\N{U+00}> C<\N{U+01}>, ...
+C<\N{U+08}>, C<\N{U+09}>, but the C<\x09> looks like it could be a
+mistake so the warning is raised (under C<re 'strict'>) for it.
+
+Perl also guarantees that the ranges C<A-Z>, C<a-z>, C<0-9>, and any
subranges of these match what an English-only speaker would expect them
-to match. That is, C<[A-Z]> matches the 26 ASCII uppercase letters;
+to match on any platform. That is, C<[A-Z]> matches the 26 ASCII
+uppercase letters;
C<[a-z]> matches the 26 lowercase letters; and C<[0-9]> matches the 10
digits. Subranges, like C<[h-k]>, match correspondingly, in this case
just the four letters C<"h">, C<"i">, C<"j">, and C<"k">. This is the
word A Perl extension ("[A-Za-z0-9_]"), equivalent to "\w".
xdigit Any hexadecimal digit ("[0-9a-fA-F]").
+Like the L<Unicode properties|/Unicode Properties>, most of the POSIX
+properties match the same regardless of whether case-insensitive (C</i>)
+matching is in effect or not. The two exceptions are C<[:upper:]> and
+C<[:lower:]>. Under C</i>, they each match the union of C<[:upper:]> and
+C<[:lower:]>.
+
Most POSIX character classes have two Unicode-style C<\p> property
counterparts. (They are not official Unicode properties, but Perl extensions
derived from official Unicode properties.) The table below shows the relation
Control characters don't produce output as such, but instead usually control
the terminal somehow: for example, newline and backspace are control characters.
-In the ASCII range, characters whose code points are between 0 and 31 inclusive,
-plus 127 (C<DEL>) are control characters.
+On ASCII platforms, in the ASCII range, characters whose code points are
+between 0 and 31 inclusive, plus 127 (C<DEL>) are control characters; on
+EBCDIC platforms, their counterparts are control characters.
=item [3]
C<\p{XPerlSpace}> and C<\p{Space}> match identically starting with Perl
v5.18. In earlier versions, these differ only in that in non-locale
-matching, C<\p{XPerlSpace}> does not match the vertical tab, C<\cK>.
+matching, C<\p{XPerlSpace}> did not match the vertical tab, C<\cK>.
Same for the two ASCII-only range forms.
=back
There are various other synonyms that can be used besides the names
-listed in the table. For example, C<\p{PosixAlpha}> can be written as
+listed in the table. For example, C<\p{XPosixAlpha}> can be written as
C<\p{Alpha}>. All are listed in
L<perluniprops/Properties accessible through \p{} and \P{}>.
Comments on this feature are welcome; send email to
C<perl5-porters@perl.org>.
+The rules used by L<C<use re 'strict>|re/'strict' mode> apply to this
+construct.
+
We can extend the example above:
/(?[ ( \p{Thai} + \p{Lao} ) & \p{Digit} ])/
! complement
-All the binary operators left associate, and are of equal precedence.
-The unary operator right associates, and has higher precedence. Use
-parentheses to override the default associations. Some feedback we've
-received indicates a desire for intersection to have higher precedence
-than union. This is something that feedback from the field may cause us
-to change in future releases; you may want to parenthesize copiously to
-avoid such changes affecting your code, until this feature is no longer
-considered experimental.
+All the binary operators left associate; C<"&"> is higher precedence
+than the others, which all have equal precedence. The unary operator
+right associates, and has highest precedence. Thus this follows the
+normal Perl precedence rules for logical operators. Use parentheses to
+override the default precedence and associativity.
The main restriction is that everything is a metacharacter. Thus,
you cannot refer to single characters by doing something like this:
This last example shows the use of this construct to specify an ordinary
bracketed character class without additional set operations. Note the
-white space within it; C<E<sol>x> is turned on even within bracketed
-character classes, except you can't have comments inside them. Hence,
+white space within it; a limited version of C<E<sol>x> is turned on even
+within bracketed character classes, with only the SPACE and TAB (C<\t>)
+characters allowed, and no comments. Hence,
(?[ [#] ])
=item 1
-This construct cannot be used within the scope of
-C<use locale> (or the C<E<sol>l> regex modifier).
+When compiled within the scope of C<use locale> (or the C<E<sol>l> regex
+modifier), this construct assumes that the execution-time locale will be
+a UTF-8 one, and the generated pattern always uses Unicode rules. What
+gets matched or not thus isn't dependent on the actual runtime locale, so
+tainting is not enabled. But a C<locale> category warning is raised
+if the runtime locale turns out to not be UTF-8.
=item 2