The dot (or period), C<.> is probably the most used, and certainly
the most well-known character class. By default, a dot matches any
character, except for the newline. That default can be changed to
-add matching the newline by using the I<single line> modifier: either
+add matching the newline by using the I<single line> modifier:
for the entire regular expression with the C</s> modifier, or
-locally with C<(?s)>. (The experimental C<\N> backslash sequence, described
+locally with C<(?s)> (and even globally within the scope of
+L<C<use re '/s'>|re/'E<sol>flags' mode>). (The C<L</\N>> backslash
+sequence, described
below, matches any character except newline without regard to the
I<single line> modifier.)
\H Match a character that isn't horizontal whitespace.
\v Match a vertical whitespace character.
\V Match a character that isn't vertical whitespace.
- \N Match a character that isn't a newline. Experimental.
+ \N Match a character that isn't a newline.
\pP, \p{Prop} Match a character that has the given Unicode property.
\PP, \P{Prop} Match a character that doesn't have the Unicode property
=head3 \N
-C<\N> is new in 5.12, and is experimental. It, like the dot, matches any
+C<\N>, available starting in v5.12, like the dot, matches any
character that is not a newline. The difference is that C<\N> is not influenced
by the I<single line> regular expression modifier (see L</The dot> above). Note
that the form C<\N{...}> may mean something completely different. When the
Otherwise, it
matches anything that is matched by C<\p{Digit}>, which includes [0-9].
(An unlikely possible exception is that under locale matching rules, the
-current locale might not have [0-9] matched by C<\d>, and/or might match
-other characters whose code point is less than 256. Such a locale
-definition would be in violation of the C language standard, but Perl
-doesn't currently assume anything in regard to this.)
+current locale might not have C<[0-9]> matched by C<\d>, and/or might match
+other characters whose code point is less than 256. The only such locale
+definitions that are legal would be to match C<[0-9]> plus another set of
+10 consecutive digit characters; anything else would be in violation of
+the C language standard, but Perl doesn't currently assume anything in
+regard to this.)
What this means is that unless the C</a> modifier is in effect C<\d> not
only matches the digits '0' - '9', but also Arabic, Devanagari, and
Some digits that C<\d> matches look like some of the [0-9] ones, but
have different values. For example, BENGALI DIGIT FOUR (U+09EA) looks
-very much like an ASCII DIGIT EIGHT (U+0038). An application that
+very much like an ASCII DIGIT EIGHT (U+0038), and LEPCHA DIGIT SIX
+(U+1C46) looks very much like an ASCII DIGIT FIVE (U+0035). An
+application that
is expecting only the ASCII digits might be misled, or if the match is
C<\d+>, the matched string might contain a mixture of digits from
different writing systems that look like they signify a number different
than they actually do. L<Unicode::UCD/num()> can
be used to safely
calculate the value, returning C<undef> if the input string contains
-such a mixture.
+such a mixture. Otherwise, for example, a displayed price might be
+deliberately different than it appears.
What C<\p{Digit}> means (and hence C<\d> except under the C</a>
modifier) is C<\p{General_Category=Decimal_Number}>, or synonymously,
=head3 Word characters
A C<\w> matches a single alphanumeric character (an alphabetic character, or a
-decimal digit) or a connecting punctuation character, such as an
-underscore ("_"). It does not match a whole word. To match a whole
-word, use C<\w+>. This isn't the same thing as matching an English word, but
-in the ASCII range it is the same as a string of Perl-identifier
-characters.
+decimal digit); or a connecting punctuation character, such as an
+underscore ("_"); or a "mark" character (like some sort of accent) that
+attaches to one of those. It does not match a whole word. To match a
+whole word, use C<\w+>. This isn't the same thing as matching an
+English word, but in the ASCII range it is the same as a string of
+Perl-identifier characters.
=over
C<\w> matches the platform's native underscore character plus whatever
the locale considers to be alphanumeric.
-=item if Unicode rules are in effect or if on an EBCDIC platform ...
+=item if, instead, Unicode rules are in effect ...
C<\w> matches exactly what C<\p{Word}> matches.
=item If the C</a> modifier is in effect ...
-C<\s> matches the 5 characters [\t\n\f\r ]; that is, the horizontal tab,
-the newline, the form feed, the carriage return, and the space. (Note
-that it doesn't match the vertical tab, C<\cK> on ASCII platforms.)
+In all Perl versions, C<\s> matches the 5 characters [\t\n\f\r ]; that
+is, the horizontal tab,
+the newline, the form feed, the carriage return, and the space.
+Starting in Perl v5.18, it also matches the vertical tab, C<\cK>.
+See note C<[1]> below for a discussion of this.
=item otherwise ...
=item if locale rules are in effect ...
-C<\s> matches whatever the locale considers to be whitespace. Note that
-this is likely to include the vertical space, unlike non-locale C<\s>
-matching.
+C<\s> matches whatever the locale considers to be whitespace.
-=item if Unicode rules are in effect or if on an EBCDIC platform ...
+=item if, instead, Unicode rules are in effect ...
C<\s> matches exactly the characters shown with an "s" column in the
table below.
=item otherwise ...
-C<\s> matches [\t\n\f\r ].
+C<\s> matches [\t\n\f\r ] and, starting in Perl
+v5.18, the vertical tab, C<\cK>.
+(See note C<[1]> below for a discussion of this.)
Note that this list doesn't include the non-breaking space.
=back
locale that may otherwise be in use.
C<\R> matches anything that can be considered a newline under Unicode
-rules. It's not a character class, as it can match a multi-character
-sequence. Therefore, it cannot be used inside a bracketed character
-class; use C<\v> instead (vertical whitespace). It uses the platform's
+rules. It can match a multi-character sequence. It cannot be used inside
+a bracketed character class; use C<\v> instead (vertical whitespace).
+It uses the platform's
native character set, and does not consider any locale that may
otherwise be in use.
Details are discussed in L<perlrebackslash>.
the same characters, without regard to other factors, such as the active
locale or whether the source string is in UTF-8 format.
-One might think that C<\s> is equivalent to C<[\h\v]>. This is not true.
-The difference is that the vertical tab (C<"\x0b">) is not matched by
-C<\s>; it is however considered vertical whitespace.
+One might think that C<\s> is equivalent to C<[\h\v]>. This is indeed true
+starting in Perl v5.18, but prior to that, the sole difference was that the
+vertical tab (C<"\cK">) was not matched by C<\s>.
The following table is a complete listing of characters matched by
-C<\s>, C<\h> and C<\v> as of Unicode 6.0.
+C<\s>, C<\h> and C<\v> as of Unicode 6.3.
The first column gives the Unicode code point of the character (in hex format),
the second column gives the (Unicode) name. The third column indicates
-by which class(es) the character is matched (assuming no locale or EBCDIC code
-page is in effect that changes the C<\s> matching).
+by which class(es) the character is matched (assuming no locale is in
+effect that changes the C<\s> matching).
0x0009 CHARACTER TABULATION h s
0x000a LINE FEED (LF) vs
- 0x000b LINE TABULATION v
+ 0x000b LINE TABULATION vs [1]
0x000c FORM FEED (FF) vs
0x000d CARRIAGE RETURN (CR) vs
0x0020 SPACE h s
- 0x0085 NEXT LINE (NEL) vs [1]
- 0x00a0 NO-BREAK SPACE h s [1]
+ 0x0085 NEXT LINE (NEL) vs [2]
+ 0x00a0 NO-BREAK SPACE h s [2]
0x1680 OGHAM SPACE MARK h s
- 0x180e MONGOLIAN VOWEL SEPARATOR h s
0x2000 EN QUAD h s
0x2001 EM QUAD h s
0x2002 EN SPACE h s
=item [1]
+Prior to Perl v5.18, C<\s> did not match the vertical tab.
+C<[^\S\cK]> (obscurely) matches what C<\s> traditionally did.
+
+=item [2]
+
NEXT LINE and NO-BREAK SPACE may or may not match C<\s> depending
on the rules in effect. See
L<the beginning of this section|/Whitespace>.
It matches a two character string: a letter (Unicode property C<\pL>),
followed by a lowercase C<l>.
-If neither the C</a> modifier nor locale rules are in effect, the use of
+If locale rules are not in effect, the use of
a Unicode property will force the regular expression into using Unicode
-rules.
+rules, if it isn't already.
Note that almost all properties are immune to case-insensitive matching.
That is, adding a C</i> regular expression modifier does not change what
L<perlunicode/User-Defined Character Properties>.
Unicode properties are defined (surprise!) only on Unicode code points.
-A warning is raised and all matches fail on non-Unicode code points
-(those above the legal Unicode maximum of 0x10FFFF). This can be
-somewhat surprising,
+Starting in v5.20, when matching against C<\p> and C<\P>, Perl treats
+non-Unicode code points (those above the legal Unicode maximum of
+0x10FFFF) as if they were typical unassigned Unicode code points.
+
+Prior to v5.20, Perl raised a warning and made all matches fail on
+non-Unicode code points. This could be somewhat surprising:
- chr(0x110000) =~ \p{ASCII_Hex_Digit=True} # Fails.
- chr(0x110000) =~ \p{ASCII_Hex_Digit=False} # Also fails!
+ chr(0x110000) =~ \p{ASCII_Hex_Digit=True} # Fails on Perls < v5.20.
+ chr(0x110000) =~ \p{ASCII_Hex_Digit=False} # Also fails on Perls
+ # < v5.20
-Even though these two matches might be thought of as complements, they
-are so only on Unicode code points.
+Even though these two matches might be thought of as complements, until
+v5.20 they were so only on Unicode code points.
+
+Starting in perl v5.30, wildcards are allowed in Unicode property
+values. See L<perlunicode/Wildcards in Property Values>.
=head4 Examples
-------
-* There is an exception to a bracketed character class matching a
-single character only. When the class is to match caselessly under C</i>
-matching rules, and a character inside the class matches a
+* There are two exceptions to a bracketed character class matching a
+single character only. Each requires special handling by Perl to make
+things work:
+
+=over
+
+=item *
+
+When the class is to match caselessly under C</i> matching rules, and a
+character that is explicitly mentioned inside the class matches a
multiple-character sequence caselessly under Unicode rules, the class
-(when not L<inverted|/Negation>) will also match that sequence. For
-example, Unicode says that the letter C<LATIN SMALL LETTER SHARP S>
-should match the sequence C<ss> under C</i> rules. Thus,
+will also match that sequence. For example, Unicode says that the
+letter C<LATIN SMALL LETTER SHARP S> should match the sequence C<ss>
+under C</i> rules. Thus,
'ss' =~ /\A\N{LATIN SMALL LETTER SHARP S}\z/i # Matches
'ss' =~ /\A[aeioust\N{LATIN SMALL LETTER SHARP S}]\z/i # Matches
+For this to happen, the class must not be inverted (see L</Negation>)
+and the character must be explicitly specified, and not be part of a
+multi-character range (not even as one of its endpoints). (L</Character
+Ranges> will be explained shortly.) Therefore,
+
+ 'ss' =~ /\A[\0-\x{ff}]\z/ui # Doesn't match
+ 'ss' =~ /\A[\0-\N{LATIN SMALL LETTER SHARP S}]\z/ui # No match
+ 'ss' =~ /\A[\xDF-\xDF]\z/ui # Matches on ASCII platforms, since
+ # \xDF is LATIN SMALL LETTER SHARP S,
+ # and the range is just a single
+ # element
+
+Note that it isn't a good idea to specify these types of ranges anyway.
+
+=item *
+
+Some names known to C<\N{...}> refer to a sequence of multiple characters,
+instead of the usual single character. When one of these is included in
+the class, the entire sequence is matched. For example,
+
+ "\N{TAMIL LETTER KA}\N{TAMIL VOWEL SIGN AU}"
+ =~ / ^ [\N{TAMIL SYLLABLE KAU}] $ /x;
+
+matches, because C<\N{TAMIL SYLLABLE KAU}> is a named sequence
+consisting of the two characters matched against. Like the other
+instance where a bracketed class can match multiple characters, and for
+similar reasons, the class must not be inverted, and the named sequence
+may not appear in a range, even one where it is both endpoints. If
+these happen, it is a fatal error if the character class is within the
+scope of L<C<use re 'strict>|re/'strict' mode>, or within an extended
+L<C<(?[...])>|/Extended Bracketed Character Classes> class; otherwise
+only the first code point is used (with a C<regexp>-type warning
+raised).
+
+=back
+
=head3 Special Characters Inside a Bracketed Character Class
Most characters that are meta characters in regular expressions (that
their special meaning and can be used inside a character class without
the need to escape them. For instance, C<[()]> matches either an opening
parenthesis, or a closing parenthesis, and the parens inside the character
-class don't group or capture.
+class don't group or capture. Be aware that, unless the pattern is
+evaluated in single-quotish context, variable interpolation will take
+place before the bracketed class is parsed:
+
+ $, = "\t| ";
+ $a =~ m'[$,]'; # single-quotish: matches '$' or ','
+ $a =~ q{[$,]}' # same
+ $a =~ m/[$,]/; # double-quotish: matches "\t", "|", or " "
Characters that may carry a special meaning inside a character class are:
C<\>, C<^>, C<->, C<[> and C<]>, and are discussed below. They can be
and
C<\x>
are also special and have the same meanings as they do outside a
-bracketed character class. (However, inside a bracketed character
-class, if C<\N{I<NAME>}> expands to a sequence of characters, only the first
-one in the sequence is used, with a warning.)
+bracketed character class.
Also, a backslash followed by two or three octal digits is considered an octal
number.
"+" =~ /[+?*]/ # Match, "+" in a character class is not special.
"\cH" =~ /[\b]/ # Match, \b inside in a character class
# is equivalent to a backspace.
- "]" =~ /[][]/ # Match, as the character class contains.
+ "]" =~ /[][]/ # Match, as the character class contains
# both [ and ].
"[]" =~ /[[]]/ # Match, the pattern contains a character class
- # containing just ], and the character class is
+ # containing just [, and the character class is
# followed by a ].
+=head3 Bracketed Character Classes and the C</xx> pattern modifier
+
+Normally SPACE and TAB characters have no special meaning inside a
+bracketed character class; they are just added to the list of characters
+matched by the class. But if the L<C</xx>|perlre/E<sol>x and E<sol>xx>
+pattern modifier is in effect, they are generally ignored and can be
+added to improve readability. They can't be added in the middle of a
+single construct:
+
+ / [ \x{10 FFFF} ] /xx # WRONG!
+
+The SPACE in the middle of the hex constant is illegal.
+
+To specify a literal SPACE character, you can escape it with a
+backslash, like:
+
+ /[ a e i o u \ ]/xx
+
+This matches the English vowels plus the SPACE character.
+
+For clarity, you should already have been using C<\t> to specify a
+literal tab, and C<\t> is unaffected by C</xx>.
+
=head3 Character Ranges
It is not uncommon to want to match a range of characters. Luckily, instead
# hyphen ('-'), or the letter 'm'.
['-?] # Matches any of the characters '()*+,-./0123456789:;<=>?
# (But not on an EBCDIC platform).
-
+ [\N{APOSTROPHE}-\N{QUESTION MARK}]
+ # Matches any of the characters '()*+,-./0123456789:;<=>?
+ # even on an EBCDIC platform.
+ [\N{U+27}-\N{U+3F}] # Same. (U+27 is "'", and U+3F is "?")
+
+As the final two examples above show, you can achieve portability to
+non-ASCII platforms by using the C<\N{...}> form for the range
+endpoints. These indicate that the specified range is to be interpreted
+using Unicode values, so C<[\N{U+27}-\N{U+3F}]> means to match
+C<\N{U+27}>, C<\N{U+28}>, C<\N{U+29}>, ..., C<\N{U+3D}>, C<\N{U+3E}>,
+and C<\N{U+3F}>, whatever the native code point versions for those are.
+These are called "Unicode" ranges. If either end is of the C<\N{...}>
+form, the range is considered Unicode. A C<regexp> warning is raised
+under C<S<"use re 'strict'">> if the other endpoint is specified
+non-portably:
+
+ [\N{U+00}-\x09] # Warning under re 'strict'; \x09 is non-portable
+ [\N{U+00}-\t] # No warning;
+
+Both of the above match the characters C<\N{U+00}> C<\N{U+01}>, ...
+C<\N{U+08}>, C<\N{U+09}>, but the C<\x09> looks like it could be a
+mistake so the warning is raised (under C<re 'strict'>) for it.
+
+Perl also guarantees that the ranges C<A-Z>, C<a-z>, C<0-9>, and any
+subranges of these match what an English-only speaker would expect them
+to match on any platform. That is, C<[A-Z]> matches the 26 ASCII
+uppercase letters;
+C<[a-z]> matches the 26 lowercase letters; and C<[0-9]> matches the 10
+digits. Subranges, like C<[h-k]>, match correspondingly, in this case
+just the four letters C<"h">, C<"i">, C<"j">, and C<"k">. This is the
+natural behavior on ASCII platforms where the code points (ordinal
+values) for C<"h"> through C<"k"> are consecutive integers (0x68 through
+0x6B). But special handling to achieve this may be needed on platforms
+with a non-ASCII native character set. For example, on EBCDIC
+platforms, the code point for C<"h"> is 0x88, C<"i"> is 0x89, C<"j"> is
+0x91, and C<"k"> is 0x92. Perl specially treats C<[h-k]> to exclude the
+seven code points in the gap: 0x8A through 0x90. This special handling is
+only invoked when the range is a subrange of one of the ASCII uppercase,
+lowercase, and digit ranges, AND each end of the range is expressed
+either as a literal, like C<"A">, or as a named character (C<\N{...}>,
+including the C<\N{U+...> form).
+
+EBCDIC Examples:
+
+ [i-j] # Matches either "i" or "j"
+ [i-\N{LATIN SMALL LETTER J}] # Same
+ [i-\N{U+6A}] # Same
+ [\N{U+69}-\N{U+6A}] # Same
+ [\x{89}-\x{91}] # Matches 0x89 ("i"), 0x8A .. 0x90, 0x91 ("j")
+ [i-\x{91}] # Same
+ [\x{89}-j] # Same
+ [i-J] # Matches, 0x89 ("i") .. 0xC1 ("J"); special
+ # handling doesn't apply because range is mixed
+ # case
=head3 Negation
else don't list it first.
In inverted bracketed character classes, Perl ignores the Unicode rules
-that normally say that certain characters should match a sequence of
-multiple characters under caseless C</i> matching. Following those
-rules could lead to highly confusing situations:
+that normally say that named sequence, and certain characters should
+match a sequence of multiple characters use under caseless C</i>
+matching. Following those rules could lead to highly confusing
+situations:
"ss" =~ /^[^\xDF]+$/ui; # Matches!
says that C<"ss"> is what C<\xDF> matches under C</i>. So which one
"wins"? Do you fail the match because the string has C<ss> or accept it
because it has an C<s> followed by another C<s>? Perl has chosen the
-latter.
+latter. (See note in L</Bracketed Character Classes> above.)
Examples:
X<alpha> X<alnum> X<ascii> X<blank> X<cntrl> X<digit> X<graph>
X<lower> X<print> X<punct> X<space> X<upper> X<word> X<xdigit>
-POSIX character classes have the form C<[:class:]>, where I<class> is
+POSIX character classes have the form C<[:class:]>, where I<class> is the
name, and the C<[:> and C<:]> delimiters. POSIX character classes only appear
I<inside> bracketed character classes, and are a convenient and descriptive
way of listing a group of characters.
The latter pattern would be a character class consisting of a colon,
and the letters C<a>, C<l>, C<p> and C<h>.
+
POSIX character classes can be part of a larger bracketed character class.
For example,
Perl recognizes the following POSIX character classes:
- alpha Any alphabetical character ("[A-Za-z]").
- alnum Any alphanumeric character. ("[A-Za-z0-9]")
+ alpha Any alphabetical character (e.g., [A-Za-z]).
+ alnum Any alphanumeric character (e.g., [A-Za-z0-9]).
ascii Any character in the ASCII character set.
blank A GNU extension, equal to a space or a horizontal tab ("\t").
cntrl Any control character. See Note [2] below.
- digit Any decimal digit ("[0-9]"), equivalent to "\d".
+ digit Any decimal digit (e.g., [0-9]), equivalent to "\d".
graph Any printable character, excluding a space. See Note [3] below.
- lower Any lowercase character ("[a-z]").
+ lower Any lowercase character (e.g., [a-z]).
print Any printable character, including a space. See Note [4] below.
punct Any graphical character excluding "word" characters. Note [5].
- space Any whitespace character. "\s" plus the vertical tab ("\cK").
- upper Any uppercase character ("[A-Z]").
- word A Perl extension ("[A-Za-z0-9_]"), equivalent to "\w".
- xdigit Any hexadecimal digit ("[0-9a-fA-F]").
+ space Any whitespace character. "\s" including the vertical tab
+ ("\cK").
+ upper Any uppercase character (e.g., [A-Z]).
+ word A Perl extension (e.g., [A-Za-z0-9_]), equivalent to "\w".
+ xdigit Any hexadecimal digit (e.g., [0-9a-fA-F]). Note [7].
+
+Like the L<Unicode properties|/Unicode Properties>, most of the POSIX
+properties match the same regardless of whether case-insensitive (C</i>)
+matching is in effect or not. The two exceptions are C<[:upper:]> and
+C<[:lower:]>. Under C</i>, they each match the union of C<[:upper:]> and
+C<[:lower:]>.
Most POSIX character classes have two Unicode-style C<\p> property
counterparts. (They are not official Unicode properties, but Perl extensions
space \p{PosixSpace} \p{XPosixSpace} [6]
upper \p{PosixUpper} \p{XPosixUpper}
word \p{PosixWord} \p{XPosixWord} \w
- xdigit \p{PosixXDigit} \p{XPosixXDigit}
+ xdigit \p{PosixXDigit} \p{XPosixXDigit} [7]
=over 4
Control characters don't produce output as such, but instead usually control
the terminal somehow: for example, newline and backspace are control characters.
-In the ASCII range, characters whose code points are between 0 and 31 inclusive,
-plus 127 (C<DEL>) are control characters.
-
-On EBCDIC platforms, it is likely that the code page will define C<[[:cntrl:]]>
-to be the EBCDIC equivalents of the ASCII controls, plus the controls
-that in Unicode have code pointss from 128 through 159.
+On ASCII platforms, in the ASCII range, characters whose code points are
+between 0 and 31 inclusive, plus 127 (C<DEL>) are control characters; on
+EBCDIC platforms, their counterparts are control characters.
=item [3]
=item [6]
-C<\p{SpacePerl}> and C<\p{Space}> differ only in that in non-locale
-matching, C<\p{Space}> additionally
-matches the vertical tab, C<\cK>. Same for the two ASCII-only range forms.
+C<\p{XPerlSpace}> and C<\p{Space}> match identically starting with Perl
+v5.18. In earlier versions, these differ only in that in non-locale
+matching, C<\p{XPerlSpace}> did not match the vertical tab, C<\cK>.
+Same for the two ASCII-only range forms.
+
+=item [7]
+
+Unlike C<[[:digit:]]> which matches digits in many writing systems, such
+as Thai and Devanagari, there are currently only two sets of hexadecimal
+digits, and it is unlikely that more will be added. This is because you
+not only need the ten digits, but also the six C<[A-F]> (and C<[a-f]>)
+to correspond. That means only the Latin script is suitable for these,
+and Unicode has only two sets of these, the familiar ASCII set, and the
+fullwidth forms starting at U+FF10 (FULLWIDTH DIGIT ZERO).
=back
There are various other synonyms that can be used besides the names
-listed in the table. For example, C<\p{PosixAlpha}> can be written as
+listed in the table. For example, C<\p{XPosixAlpha}> can be written as
C<\p{Alpha}>. All are listed in
-L<perluniprops/Properties accessible through \p{} and \P{}>,
-plus all characters matched by each ASCII-range property.
+L<perluniprops/Properties accessible through \p{} and \P{}>.
Both the C<\p> counterparts always assume Unicode rules are in effect.
On ASCII platforms, this means they assume that the code points from 128
=item if locale rules are in effect ...
-The POSIX class matches according to the locale, except that
-C<word> uses the platform's native underscore character, no matter what
+The POSIX class matches according to the locale, except:
+
+=over
+
+=item C<word>
+
+also includes the platform's native underscore character, no matter what
the locale is.
-=item if Unicode rules are in effect or if on an EBCDIC platform ...
+=item C<ascii>
+
+on platforms that don't have the POSIX C<ascii> extension, this matches
+just the platform's native ASCII-range characters.
+
+=item C<blank>
+
+on platforms that don't have the POSIX C<blank> extension, this matches
+just the platform's native tab and space characters.
+
+=back
+
+=item if, instead, Unicode rules are in effect ...
The POSIX class matches the same as the Full-range counterpart.
Which rules apply are determined as described in
L<perlre/Which character set modifier is in effect?>.
-It is proposed to change this behavior in a future release of Perl so that
-whether or not Unicode rules are in effect would not change the
-behavior: Outside of locale or an EBCDIC code page, the POSIX classes
-would behave like their ASCII-range counterparts. If you wish to
-comment on this proposal, send email to C<perl5-porters@perl.org>.
-
=head4 Negation of POSIX character classes
X<character class, negation>
/[01[:lower:]]/ # Matches a character that is either a
# lowercase letter, or '0' or '1'.
/[[:digit:][:^xdigit:]]/ # Matches a character that can be anything
- # except the letters 'a' to 'f'. This is
- # because the main character class is composed
- # of two POSIX character classes that are ORed
- # together, one that matches any digit, and
- # the other that matches anything that isn't a
- # hex digit. The result matches all
- # characters except the letters 'a' to 'f' and
- # 'A' to 'F'.
+ # except the letters 'a' to 'f' and 'A' to
+ # 'F'. This is because the main character
+ # class is composed of two POSIX character
+ # classes that are ORed together, one that
+ # matches any digit, and the other that
+ # matches anything that isn't a hex digit.
+ # The OR adds the digits, leaving only the
+ # letters 'a' to 'f' and 'A' to 'F' excluded.
+
+=head3 Extended Bracketed Character Classes
+X<character class>
+X<set operations>
+
+This is a fancy bracketed character class that can be used for more
+readable and less error-prone classes, and to perform set operations,
+such as intersection. An example is
+
+ /(?[ \p{Thai} & \p{Digit} ])/
+
+This will match all the digit characters that are in the Thai script.
+
+This is an experimental feature available starting in 5.18, and is
+subject to change as we gain field experience with it. Any attempt to
+use it will raise a warning, unless disabled via
+
+ no warnings "experimental::regex_sets";
+
+Comments on this feature are welcome; send email to
+C<perl5-porters@perl.org>.
+
+The rules used by L<C<use re 'strict>|re/'strict' mode> apply to this
+construct.
+
+We can extend the example above:
+
+ /(?[ ( \p{Thai} + \p{Lao} ) & \p{Digit} ])/
+
+This matches digits that are in either the Thai or Laotian scripts.
+
+Notice the white space in these examples. This construct always has
+the C<E<sol>xx> modifier turned on within it.
+
+The available binary operators are:
+
+ & intersection
+ + union
+ | another name for '+', hence means union
+ - subtraction (the result matches the set consisting of those
+ code points matched by the first operand, excluding any that
+ are also matched by the second operand)
+ ^ symmetric difference (the union minus the intersection). This
+ is like an exclusive or, in that the result is the set of code
+ points that are matched by either, but not both, of the
+ operands.
+
+There is one unary operator:
+
+ ! complement
+
+All the binary operators left associate; C<"&"> is higher precedence
+than the others, which all have equal precedence. The unary operator
+right associates, and has highest precedence. Thus this follows the
+normal Perl precedence rules for logical operators. Use parentheses to
+override the default precedence and associativity.
+
+The main restriction is that everything is a metacharacter. Thus,
+you cannot refer to single characters by doing something like this:
+
+ /(?[ a + b ])/ # Syntax error!
+
+The easiest way to specify an individual typable character is to enclose
+it in brackets:
+
+ /(?[ [a] + [b] ])/
+
+(This is the same thing as C<[ab]>.) You could also have said the
+equivalent:
+
+ /(?[[ a b ]])/
+
+(You can, of course, specify single characters by using, C<\x{...}>,
+C<\N{...}>, etc.)
+
+This last example shows the use of this construct to specify an ordinary
+bracketed character class without additional set operations. Note the
+white space within it. This is allowed because C<E<sol>xx> is
+automatically turned on within this construct.
+
+All the other escapes accepted by normal bracketed character classes are
+accepted here as well.
+
+Because this construct compiles under
+L<C<use re 'strict>|re/'strict' mode>, unrecognized escapes that
+generate warnings in normal classes are fatal errors here, as well as
+all other warnings from these class elements, as well as some
+practices that don't currently warn outside C<re 'strict'>. For example
+you cannot say
+
+ /(?[ [ \xF ] ])/ # Syntax error!
+
+You have to have two hex digits after a braceless C<\x> (use a leading
+zero to make two). These restrictions are to lower the incidence of
+typos causing the class to not match what you thought it would.
+
+If a regular bracketed character class contains a C<\p{}> or C<\P{}> and
+is matched against a non-Unicode code point, a warning may be
+raised, as the result is not Unicode-defined. No such warning will come
+when using this extended form.
+
+The final difference between regular bracketed character classes and
+these, is that it is not possible to get these to match a
+multi-character fold. Thus,
+
+ /(?[ [\xDF] ])/iu
+
+does not match the string C<ss>.
+
+You don't have to enclose POSIX class names inside double brackets,
+hence both of the following work:
+
+ /(?[ [:word:] - [:lower:] ])/
+ /(?[ [[:word:]] - [[:lower:]] ])/
+
+Any contained POSIX character classes, including things like C<\w> and C<\D>
+respect the C<E<sol>a> (and C<E<sol>aa>) modifiers.
+
+Note that C<< (?[ ]) >> is a regex-compile-time construct. Any attempt
+to use something which isn't knowable at the time the containing regular
+expression is compiled is a fatal error. In practice, this means
+just three limitations:
+
+=over 4
+
+=item 1
+
+When compiled within the scope of C<use locale> (or the C<E<sol>l> regex
+modifier), this construct assumes that the execution-time locale will be
+a UTF-8 one, and the generated pattern always uses Unicode rules. What
+gets matched or not thus isn't dependent on the actual runtime locale, so
+tainting is not enabled. But a C<locale> category warning is raised
+if the runtime locale turns out to not be UTF-8.
+
+=item 2
+
+Any
+L<user-defined property|perlunicode/"User-Defined Character Properties">
+used must be already defined by the time the regular expression is
+compiled (but note that this construct can be used instead of such
+properties).
+
+=item 3
+
+A regular expression that otherwise would compile
+using C<E<sol>d> rules, and which uses this construct will instead
+use C<E<sol>u>. Thus this construct tells Perl that you don't want
+C<E<sol>d> rules for the entire regular expression containing it.
+
+=back
+
+Note that skipping white space applies only to the interior of this
+construct. There must not be any space between any of the characters
+that form the initial C<(?[>. Nor may there be space between the
+closing C<])> characters.
+
+Just as in all regular expressions, the pattern can be built up by
+including variables that are interpolated at regex compilation time.
+But its best to compile each sub-component.
+
+ my $thai_or_lao = qr/(?[ \p{Thai} + \p{Lao} ])/;
+ my $lower = qr/(?[ \p{Lower} + \p{Digit} ])/;
+
+When these are embedded in another pattern, what they match does not
+change, regardless of parenthesization or what modifiers are in effect
+in that outer pattern. If you fail to compile the subcomponents, you
+can get some nasty surprises. For example:
+
+ my $thai_or_lao = '\p{Thai} + \p{Lao}';
+ ...
+ qr/(?[ \p{Digit} & $thai_or_lao ])/;
+
+compiles to
+
+ qr/(?[ \p{Digit} & \p{Thai} + \p{Lao} ])/;
+
+But this does not have the effect that someone reading the source code
+would likely expect, as the intersection applies just to C<\p{Thai}>,
+excluding the Laotian. Its best to compile the subcomponents, but you
+could also parenthesize the component pieces:
+
+ my $thai_or_lao = '( \p{Thai} + \p{Lao} )';
+
+But any modifiers will still apply to all the components:
+
+ my $lower = '\p{Lower} + \p{Digit}';
+ qr/(?[ \p{Greek} & $lower ])/i;
+
+matches upper case things. So just, compile the subcomponents, as
+illustrated above.
+
+Due to the way that Perl parses things, your parentheses and brackets
+may need to be balanced, even including comments. If you run into any
+examples, please send them to C<perlbug@perl.org>, so that we can have a
+concrete example for this man page.
+
+We may change it so that things that remain legal uses in normal bracketed
+character classes might become illegal within this experimental
+construct. One proposal, for example, is to forbid adjacent uses of the
+same character, as in C<(?[ [aa] ])>. The motivation for such a change
+is that this usage is likely a typo, as the second "a" adds nothing.