The dot (or period), C<.> is probably the most used, and certainly
the most well-known character class. By default, a dot matches any
character, except for the newline. That default can be changed to
-add matching the newline by using the I<single line> modifier: either
+add matching the newline by using the I<single line> modifier:
for the entire regular expression with the C</s> modifier, or
-locally with C<(?s)>. (The C<\N> backslash sequence, described
+locally with C<(?s)> (and even globally within the scope of
+L<C<use re '/s'>|re/'E<sol>flags' mode>). (The C<L</\N>> backslash
+sequence, described
below, matches any character except newline without regard to the
I<single line> modifier.)
Otherwise, it
matches anything that is matched by C<\p{Digit}>, which includes [0-9].
(An unlikely possible exception is that under locale matching rules, the
-current locale might not have [0-9] matched by C<\d>, and/or might match
-other characters whose code point is less than 256. Such a locale
-definition would be in violation of the C language standard, but Perl
-doesn't currently assume anything in regard to this.)
+current locale might not have C<[0-9]> matched by C<\d>, and/or might match
+other characters whose code point is less than 256. The only such locale
+definitions that are legal would be to match C<[0-9]> plus another set of
+10 consecutive digit characters; anything else would be in violation of
+the C language standard, but Perl doesn't currently assume anything in
+regard to this.)
What this means is that unless the C</a> modifier is in effect C<\d> not
only matches the digits '0' - '9', but also Arabic, Devanagari, and
Some digits that C<\d> matches look like some of the [0-9] ones, but
have different values. For example, BENGALI DIGIT FOUR (U+09EA) looks
-very much like an ASCII DIGIT EIGHT (U+0038). An application that
+very much like an ASCII DIGIT EIGHT (U+0038), and LEPCHA DIGIT SIX
+(U+1C46) looks very much like an ASCII DIGIT FIVE (U+0035). An
+application that
is expecting only the ASCII digits might be misled, or if the match is
C<\d+>, the matched string might contain a mixture of digits from
different writing systems that look like they signify a number different
than they actually do. L<Unicode::UCD/num()> can
be used to safely
calculate the value, returning C<undef> if the input string contains
-such a mixture.
+such a mixture. Otherwise, for example, a displayed price might be
+deliberately different than it appears.
What C<\p{Digit}> means (and hence C<\d> except under the C</a>
modifier) is C<\p{General_Category=Decimal_Number}>, or synonymously,
The Tamil digits (U+0BE6 - U+0BEF) can also legally be
used in old-style Tamil numbers in which they would appear no more than
one in a row, separated by characters that mean "times 10", "times 100",
-etc. (See L<http://www.unicode.org/notes/tn21>.)
+etc. (See L<https://www.unicode.org/notes/tn21>.)
Any character not matched by C<\d> is matched by C<\D>.
C<\w> matches the platform's native underscore character plus whatever
the locale considers to be alphanumeric.
-=item if Unicode rules are in effect ...
+=item if, instead, Unicode rules are in effect ...
C<\w> matches exactly what C<\p{Word}> matches.
In all Perl versions, C<\s> matches the 5 characters [\t\n\f\r ]; that
is, the horizontal tab,
the newline, the form feed, the carriage return, and the space.
-Starting in Perl v5.18, experimentally, it also matches the vertical tab, C<\cK>.
+Starting in Perl v5.18, it also matches the vertical tab, C<\cK>.
See note C<[1]> below for a discussion of this.
=item otherwise ...
C<\s> matches whatever the locale considers to be whitespace.
-=item if Unicode rules are in effect ...
+=item if, instead, Unicode rules are in effect ...
C<\s> matches exactly the characters shown with an "s" column in the
table below.
=item otherwise ...
-C<\s> matches [\t\n\f\r\cK ] and, starting, experimentally in Perl
+C<\s> matches [\t\n\f\r ] and, starting in Perl
v5.18, the vertical tab, C<\cK>.
(See note C<[1]> below for a discussion of this.)
Note that this list doesn't include the non-breaking space.
locale that may otherwise be in use.
C<\R> matches anything that can be considered a newline under Unicode
-rules. It's not a character class, as it can match a multi-character
-sequence. Therefore, it cannot be used inside a bracketed character
-class; use C<\v> instead (vertical whitespace). It uses the platform's
+rules. It can match a multi-character sequence. It cannot be used inside
+a bracketed character class; use C<\v> instead (vertical whitespace).
+It uses the platform's
native character set, and does not consider any locale that may
otherwise be in use.
Details are discussed in L<perlrebackslash>.
vertical tab (C<"\cK">) was not matched by C<\s>.
The following table is a complete listing of characters matched by
-C<\s>, C<\h> and C<\v> as of Unicode 6.0.
+C<\s>, C<\h> and C<\v> as of Unicode 6.3.
The first column gives the Unicode code point of the character (in hex format),
the second column gives the (Unicode) name. The third column indicates
0x0085 NEXT LINE (NEL) vs [2]
0x00a0 NO-BREAK SPACE h s [2]
0x1680 OGHAM SPACE MARK h s
- 0x180e MONGOLIAN VOWEL SEPARATOR h s
0x2000 EN QUAD h s
0x2001 EM QUAD h s
0x2002 EN SPACE h s
=item [1]
-Prior to Perl v5.18, C<\s> did not match the vertical tab. The change
-in v5.18 is considered an experiment, which means it could be backed out
-in v5.20 or v5.22 if experience indicates that it breaks too much
-existing code. If this change adversely affects you, send email to
-C<perlbug@perl.org>; if it affects you positively, email
-C<perlthanks@perl.org>. In the meantime, C<[^\S\cK]> (obscurely)
-matches what C<\s> traditionally did.
+Prior to Perl v5.18, C<\s> did not match the vertical tab.
+C<[^\S\cK]> (obscurely) matches what C<\s> traditionally did.
=item [2]
It matches a two character string: a letter (Unicode property C<\pL>),
followed by a lowercase C<l>.
-If locale rules are not in effect, the use of
-a Unicode property will force the regular expression into using Unicode
-rules, if it isn't already.
+What a Unicode property matches is never subject to locale rules, and
+if locale rules are not otherwise in effect, the use of a Unicode
+property will force the regular expression into using Unicode rules, if
+it isn't already.
Note that almost all properties are immune to case-insensitive matching.
That is, adding a C</i> regular expression modifier does not change what
-they match. There are two sets that are affected. The first set is
+they match. But there are two sets that are affected. The first set is
C<Uppercase_Letter>,
C<Lowercase_Letter>,
and C<Titlecase_Letter>,
L<perlunicode/User-Defined Character Properties>.
Unicode properties are defined (surprise!) only on Unicode code points.
-A warning is raised and all matches fail on non-Unicode code points
-(those above the legal Unicode maximum of 0x10FFFF). This can be
-somewhat surprising,
+Starting in v5.20, when matching against C<\p> and C<\P>, Perl treats
+non-Unicode code points (those above the legal Unicode maximum of
+0x10FFFF) as if they were typical unassigned Unicode code points.
- chr(0x110000) =~ \p{ASCII_Hex_Digit=True} # Fails.
- chr(0x110000) =~ \p{ASCII_Hex_Digit=False} # Also fails!
+Prior to v5.20, Perl raised a warning and made all matches fail on
+non-Unicode code points. This could be somewhat surprising:
-Even though these two matches might be thought of as complements, they
-are so only on Unicode code points.
+ chr(0x110000) =~ \p{ASCII_Hex_Digit=True} # Fails on Perls < v5.20.
+ chr(0x110000) =~ \p{ASCII_Hex_Digit=False} # Also fails on Perls
+ # < v5.20
+
+Even though these two matches might be thought of as complements, until
+v5.20 they were so only on Unicode code points.
+
+Starting in perl v5.30, wildcards are allowed in Unicode property
+values. See L<perlunicode/Wildcards in Property Values>.
=head4 Examples
-------
-* There is an exception to a bracketed character class matching a
-single character only. When the class is to match caselessly under C</i>
-matching rules, and a character that is explicitly mentioned inside the
-class matches a
+* There are two exceptions to a bracketed character class matching a
+single character only. Each requires special handling by Perl to make
+things work:
+
+=over
+
+=item *
+
+When the class is to match caselessly under C</i> matching rules, and a
+character that is explicitly mentioned inside the class matches a
multiple-character sequence caselessly under Unicode rules, the class
-(when not L<inverted|/Negation>) will also match that sequence. For
-example, Unicode says that the letter C<LATIN SMALL LETTER SHARP S>
-should match the sequence C<ss> under C</i> rules. Thus,
+will also match that sequence. For example, Unicode says that the
+letter C<LATIN SMALL LETTER SHARP S> should match the sequence C<ss>
+under C</i> rules. Thus,
'ss' =~ /\A\N{LATIN SMALL LETTER SHARP S}\z/i # Matches
'ss' =~ /\A[aeioust\N{LATIN SMALL LETTER SHARP S}]\z/i # Matches
-For this to happen, the character must be explicitly specified, and not
-be part of a multi-character range (not even as one of its endpoints).
-(L</Character Ranges> will be explained shortly.) Therefore,
+For this to happen, the class must not be inverted (see L</Negation>)
+and the character must be explicitly specified, and not be part of a
+multi-character range (not even as one of its endpoints). (L</Character
+Ranges> will be explained shortly.) Therefore,
- 'ss' =~ /\A[\0-\x{ff}]\z/i # Doesn't match
- 'ss' =~ /\A[\0-\N{LATIN SMALL LETTER SHARP S}]\z/i # No match
- 'ss' =~ /\A[\xDF-\xDF]\z/i # Matches on ASCII platforms, since \XDF
- # is LATIN SMALL LETTER SHARP S, and the
- # range is just a single element
+ 'ss' =~ /\A[\0-\x{ff}]\z/ui # Doesn't match
+ 'ss' =~ /\A[\0-\N{LATIN SMALL LETTER SHARP S}]\z/ui # No match
+ 'ss' =~ /\A[\xDF-\xDF]\z/ui # Matches on ASCII platforms, since
+ # \xDF is LATIN SMALL LETTER SHARP S,
+ # and the range is just a single
+ # element
Note that it isn't a good idea to specify these types of ranges anyway.
+=item *
+
+Some names known to C<\N{...}> refer to a sequence of multiple characters,
+instead of the usual single character. When one of these is included in
+the class, the entire sequence is matched. For example,
+
+ "\N{TAMIL LETTER KA}\N{TAMIL VOWEL SIGN AU}"
+ =~ / ^ [\N{TAMIL SYLLABLE KAU}] $ /x;
+
+matches, because C<\N{TAMIL SYLLABLE KAU}> is a named sequence
+consisting of the two characters matched against. Like the other
+instance where a bracketed class can match multiple characters, and for
+similar reasons, the class must not be inverted, and the named sequence
+may not appear in a range, even one where it is both endpoints. If
+these happen, it is a fatal error if the character class is within the
+scope of L<C<use re 'strict>|re/'strict' mode>, or within an extended
+L<C<(?[...])>|/Extended Bracketed Character Classes> class; otherwise
+only the first code point is used (with a C<regexp>-type warning
+raised).
+
+=back
+
=head3 Special Characters Inside a Bracketed Character Class
Most characters that are meta characters in regular expressions (that
their special meaning and can be used inside a character class without
the need to escape them. For instance, C<[()]> matches either an opening
parenthesis, or a closing parenthesis, and the parens inside the character
-class don't group or capture.
+class don't group or capture. Be aware that, unless the pattern is
+evaluated in single-quotish context, variable interpolation will take
+place before the bracketed class is parsed:
+
+ $, = "\t| ";
+ $a =~ m'[$,]'; # single-quotish: matches '$' or ','
+ $a =~ q{[$,]}' # same
+ $a =~ m/[$,]/; # double-quotish: matches "\t", "|", or " "
Characters that may carry a special meaning inside a character class are:
C<\>, C<^>, C<->, C<[> and C<]>, and are discussed below. They can be
and
C<\x>
are also special and have the same meanings as they do outside a
-bracketed character class. (However, inside a bracketed character
-class, if C<\N{I<NAME>}> expands to a sequence of characters, only the first
-one in the sequence is used, with a warning.)
+bracketed character class.
Also, a backslash followed by two or three octal digits is considered an octal
number.
Examples:
"+" =~ /[+?*]/ # Match, "+" in a character class is not special.
- "\cH" =~ /[\b]/ # Match, \b inside in a character class.
+ "\cH" =~ /[\b]/ # Match, \b inside in a character class
# is equivalent to a backspace.
- "]" =~ /[][]/ # Match, as the character class contains.
+ "]" =~ /[][]/ # Match, as the character class contains
# both [ and ].
"[]" =~ /[[]]/ # Match, the pattern contains a character class
- # containing just ], and the character class is
+ # containing just [, and the character class is
# followed by a ].
+=head3 Bracketed Character Classes and the C</xx> pattern modifier
+
+Normally SPACE and TAB characters have no special meaning inside a
+bracketed character class; they are just added to the list of characters
+matched by the class. But if the L<C</xx>|perlre/E<sol>x and E<sol>xx>
+pattern modifier is in effect, they are generally ignored and can be
+added to improve readability. They can't be added in the middle of a
+single construct:
+
+ / [ \x{10 FFFF} ] /xx # WRONG!
+
+The SPACE in the middle of the hex constant is illegal.
+
+To specify a literal SPACE character, you can escape it with a
+backslash, like:
+
+ /[ a e i o u \ ]/xx
+
+This matches the English vowels plus the SPACE character.
+
+For clarity, you should already have been using C<\t> to specify a
+literal tab, and C<\t> is unaffected by C</xx>.
+
=head3 Character Ranges
It is not uncommon to want to match a range of characters. Luckily, instead
# hyphen ('-'), or the letter 'm'.
['-?] # Matches any of the characters '()*+,-./0123456789:;<=>?
# (But not on an EBCDIC platform).
-
+ [\N{APOSTROPHE}-\N{QUESTION MARK}]
+ # Matches any of the characters '()*+,-./0123456789:;<=>?
+ # even on an EBCDIC platform.
+ [\N{U+27}-\N{U+3F}] # Same. (U+27 is "'", and U+3F is "?")
+
+As the final two examples above show, you can achieve portability to
+non-ASCII platforms by using the C<\N{...}> form for the range
+endpoints. These indicate that the specified range is to be interpreted
+using Unicode values, so C<[\N{U+27}-\N{U+3F}]> means to match
+C<\N{U+27}>, C<\N{U+28}>, C<\N{U+29}>, ..., C<\N{U+3D}>, C<\N{U+3E}>,
+and C<\N{U+3F}>, whatever the native code point versions for those are.
+These are called "Unicode" ranges. If either end is of the C<\N{...}>
+form, the range is considered Unicode. A C<regexp> warning is raised
+under C<S<"use re 'strict'">> if the other endpoint is specified
+non-portably:
+
+ [\N{U+00}-\x09] # Warning under re 'strict'; \x09 is non-portable
+ [\N{U+00}-\t] # No warning;
+
+Both of the above match the characters C<\N{U+00}> C<\N{U+01}>, ...
+C<\N{U+08}>, C<\N{U+09}>, but the C<\x09> looks like it could be a
+mistake so the warning is raised (under C<re 'strict'>) for it.
+
+Perl also guarantees that the ranges C<A-Z>, C<a-z>, C<0-9>, and any
+subranges of these match what an English-only speaker would expect them
+to match on any platform. That is, C<[A-Z]> matches the 26 ASCII
+uppercase letters;
+C<[a-z]> matches the 26 lowercase letters; and C<[0-9]> matches the 10
+digits. Subranges, like C<[h-k]>, match correspondingly, in this case
+just the four letters C<"h">, C<"i">, C<"j">, and C<"k">. This is the
+natural behavior on ASCII platforms where the code points (ordinal
+values) for C<"h"> through C<"k"> are consecutive integers (0x68 through
+0x6B). But special handling to achieve this may be needed on platforms
+with a non-ASCII native character set. For example, on EBCDIC
+platforms, the code point for C<"h"> is 0x88, C<"i"> is 0x89, C<"j"> is
+0x91, and C<"k"> is 0x92. Perl specially treats C<[h-k]> to exclude the
+seven code points in the gap: 0x8A through 0x90. This special handling is
+only invoked when the range is a subrange of one of the ASCII uppercase,
+lowercase, and digit ranges, AND each end of the range is expressed
+either as a literal, like C<"A">, or as a named character (C<\N{...}>,
+including the C<\N{U+...> form).
+
+EBCDIC Examples:
+
+ [i-j] # Matches either "i" or "j"
+ [i-\N{LATIN SMALL LETTER J}] # Same
+ [i-\N{U+6A}] # Same
+ [\N{U+69}-\N{U+6A}] # Same
+ [\x{89}-\x{91}] # Matches 0x89 ("i"), 0x8A .. 0x90, 0x91 ("j")
+ [i-\x{91}] # Same
+ [\x{89}-j] # Same
+ [i-J] # Matches, 0x89 ("i") .. 0xC1 ("J"); special
+ # handling doesn't apply because range is mixed
+ # case
=head3 Negation
else don't list it first.
In inverted bracketed character classes, Perl ignores the Unicode rules
-that normally say that certain characters should match a sequence of
-multiple characters under caseless C</i> matching. Following those
-rules could lead to highly confusing situations:
+that normally say that named sequence, and certain characters should
+match a sequence of multiple characters use under caseless C</i>
+matching. Following those rules could lead to highly confusing
+situations:
"ss" =~ /^[^\xDF]+$/ui; # Matches!
says that C<"ss"> is what C<\xDF> matches under C</i>. So which one
"wins"? Do you fail the match because the string has C<ss> or accept it
because it has an C<s> followed by another C<s>? Perl has chosen the
-latter.
+latter. (See note in L</Bracketed Character Classes> above.)
Examples:
X<alpha> X<alnum> X<ascii> X<blank> X<cntrl> X<digit> X<graph>
X<lower> X<print> X<punct> X<space> X<upper> X<word> X<xdigit>
-POSIX character classes have the form C<[:class:]>, where I<class> is
+POSIX character classes have the form C<[:class:]>, where I<class> is the
name, and the C<[:> and C<:]> delimiters. POSIX character classes only appear
I<inside> bracketed character classes, and are a convenient and descriptive
way of listing a group of characters.
The latter pattern would be a character class consisting of a colon,
and the letters C<a>, C<l>, C<p> and C<h>.
+
POSIX character classes can be part of a larger bracketed character class.
For example,
Perl recognizes the following POSIX character classes:
- alpha Any alphabetical character ("[A-Za-z]").
- alnum Any alphanumeric character ("[A-Za-z0-9]").
+ alpha Any alphabetical character (e.g., [A-Za-z]).
+ alnum Any alphanumeric character (e.g., [A-Za-z0-9]).
ascii Any character in the ASCII character set.
blank A GNU extension, equal to a space or a horizontal tab ("\t").
cntrl Any control character. See Note [2] below.
- digit Any decimal digit ("[0-9]"), equivalent to "\d".
+ digit Any decimal digit (e.g., [0-9]), equivalent to "\d".
graph Any printable character, excluding a space. See Note [3] below.
- lower Any lowercase character ("[a-z]").
+ lower Any lowercase character (e.g., [a-z]).
print Any printable character, including a space. See Note [4] below.
punct Any graphical character excluding "word" characters. Note [5].
space Any whitespace character. "\s" including the vertical tab
("\cK").
- upper Any uppercase character ("[A-Z]").
- word A Perl extension ("[A-Za-z0-9_]"), equivalent to "\w".
- xdigit Any hexadecimal digit ("[0-9a-fA-F]").
+ upper Any uppercase character (e.g., [A-Z]).
+ word A Perl extension (e.g., [A-Za-z0-9_]), equivalent to "\w".
+ xdigit Any hexadecimal digit (e.g., [0-9a-fA-F]). Note [7].
+
+Like the L<Unicode properties|/Unicode Properties>, most of the POSIX
+properties match the same regardless of whether case-insensitive (C</i>)
+matching is in effect or not. The two exceptions are C<[:upper:]> and
+C<[:lower:]>. Under C</i>, they each match the union of C<[:upper:]> and
+C<[:lower:]>.
Most POSIX character classes have two Unicode-style C<\p> property
counterparts. (They are not official Unicode properties, but Perl extensions
space \p{PosixSpace} \p{XPosixSpace} [6]
upper \p{PosixUpper} \p{XPosixUpper}
word \p{PosixWord} \p{XPosixWord} \w
- xdigit \p{PosixXDigit} \p{XPosixXDigit}
+ xdigit \p{PosixXDigit} \p{XPosixXDigit} [7]
=over 4
Control characters don't produce output as such, but instead usually control
the terminal somehow: for example, newline and backspace are control characters.
-In the ASCII range, characters whose code points are between 0 and 31 inclusive,
-plus 127 (C<DEL>) are control characters.
+On ASCII platforms, in the ASCII range, characters whose code points are
+between 0 and 31 inclusive, plus 127 (C<DEL>) are control characters; on
+EBCDIC platforms, their counterparts are control characters.
=item [3]
=item [6]
-C<\p{SpacePerl}> and C<\p{Space}> match identically starting with Perl
+C<\p{XPerlSpace}> and C<\p{Space}> match identically starting with Perl
v5.18. In earlier versions, these differ only in that in non-locale
-matching, C<\p{SpacePerl}> does not match the vertical tab, C<\cK>.
+matching, C<\p{XPerlSpace}> did not match the vertical tab, C<\cK>.
Same for the two ASCII-only range forms.
+=item [7]
+
+Unlike C<[[:digit:]]> which matches digits in many writing systems, such
+as Thai and Devanagari, there are currently only two sets of hexadecimal
+digits, and it is unlikely that more will be added. This is because you
+not only need the ten digits, but also the six C<[A-F]> (and C<[a-f]>)
+to correspond. That means only the Latin script is suitable for these,
+and Unicode has only two sets of these, the familiar ASCII set, and the
+fullwidth forms starting at U+FF10 (FULLWIDTH DIGIT ZERO).
+
=back
There are various other synonyms that can be used besides the names
-listed in the table. For example, C<\p{PosixAlpha}> can be written as
+listed in the table. For example, C<\p{XPosixAlpha}> can be written as
C<\p{Alpha}>. All are listed in
-L<perluniprops/Properties accessible through \p{} and \P{}>,
-plus all characters matched by each ASCII-range property.
+L<perluniprops/Properties accessible through \p{} and \P{}>.
Both the C<\p> counterparts always assume Unicode rules are in effect.
On ASCII platforms, this means they assume that the code points from 128
=item if locale rules are in effect ...
-The POSIX class matches according to the locale, except that
-C<word> uses the platform's native underscore character, no matter what
+The POSIX class matches according to the locale, except:
+
+=over
+
+=item C<word>
+
+also includes the platform's native underscore character, no matter what
the locale is.
-=item if Unicode rules are in effect ...
+=item C<ascii>
+
+on platforms that don't have the POSIX C<ascii> extension, this matches
+just the platform's native ASCII-range characters.
+
+=item C<blank>
+
+on platforms that don't have the POSIX C<blank> extension, this matches
+just the platform's native tab and space characters.
+
+=back
+
+=item if, instead, Unicode rules are in effect ...
The POSIX class matches the same as the Full-range counterpart.
Which rules apply are determined as described in
L<perlre/Which character set modifier is in effect?>.
-It is proposed to change this behavior in a future release of Perl so that
-whether or not Unicode rules are in effect would not change the
-behavior: Outside of locale, the POSIX classes
-would behave like their ASCII-range counterparts. If you wish to
-comment on this proposal, send email to C<perl5-porters@perl.org>.
-
=head4 Negation of POSIX character classes
X<character class, negation>
Comments on this feature are welcome; send email to
C<perl5-porters@perl.org>.
+The rules used by L<C<use re 'strict>|re/'strict' mode> apply to this
+construct.
+
We can extend the example above:
/(?[ ( \p{Thai} + \p{Lao} ) & \p{Digit} ])/
This matches digits that are in either the Thai or Laotian scripts.
Notice the white space in these examples. This construct always has
-the C<E<sol>x> modifier turned on.
+the C<E<sol>xx> modifier turned on within it.
The available binary operators are:
! complement
-All the binary operators left associate, and are of equal precedence.
-The unary operator right associates, and has higher precedence. Use
-parentheses to override the default associations. Some feedback we've
-received indicates a desire for intersection to have higher precedence
-than union. This is something that feedback from the field may cause us
-to change in future releases; you may want to parenthesize copiously to
-avoid such changes affecting your code, until this feature is no longer
-considered experimental.
+All the binary operators left associate; C<"&"> is higher precedence
+than the others, which all have equal precedence. The unary operator
+right associates, and has highest precedence. Thus this follows the
+normal Perl precedence rules for logical operators. Use parentheses to
+override the default precedence and associativity.
The main restriction is that everything is a metacharacter. Thus,
you cannot refer to single characters by doing something like this:
This last example shows the use of this construct to specify an ordinary
bracketed character class without additional set operations. Note the
-white space within it; C<E<sol>x> is turned on even within bracketed
-character classes, except you can't have comments inside them. Hence,
-
- (?[ [#] ])
-
-matches the literal character "#". To specify a literal white space character,
-you can escape it with a backslash, like:
+white space within it. This is allowed because C<E<sol>xx> is
+automatically turned on within this construct.
- /(?[ [ a e i o u \ ] ])/
-
-This matches the English vowels plus the SPACE character.
All the other escapes accepted by normal bracketed character classes are
-accepted here as well; but unrecognized escapes that generate warnings
-in normal classes are fatal errors here.
+accepted here as well.
-All warnings from these class elements are fatal, as well as some
-practices that don't currently warn. For example you cannot say
+Because this construct compiles under
+L<C<use re 'strict>|re/'strict' mode>, unrecognized escapes that
+generate warnings in normal classes are fatal errors here, as well as
+all other warnings from these class elements, as well as some
+practices that don't currently warn outside C<re 'strict'>. For example
+you cannot say
/(?[ [ \xF ] ])/ # Syntax error!
zero to make two). These restrictions are to lower the incidence of
typos causing the class to not match what you thought it would.
+If a regular bracketed character class contains a C<\p{}> or C<\P{}> and
+is matched against a non-Unicode code point, a warning may be
+raised, as the result is not Unicode-defined. No such warning will come
+when using this extended form.
+
The final difference between regular bracketed character classes and
these, is that it is not possible to get these to match a
multi-character fold. Thus,
Any contained POSIX character classes, including things like C<\w> and C<\D>
respect the C<E<sol>a> (and C<E<sol>aa>) modifiers.
-C<< (?[ ]) >> is a regex-compile-time construct. Any attempt to use
-something which isn't knowable at the time the containing regular
+Note that C<< (?[ ]) >> is a regex-compile-time construct. Any attempt
+to use something which isn't knowable at the time the containing regular
expression is compiled is a fatal error. In practice, this means
just three limitations:
=item 1
-This construct cannot be used within the scope of
-C<use locale> (or the C<E<sol>l> regex modifier).
+When compiled within the scope of C<use locale> (or the C<E<sol>l> regex
+modifier), this construct assumes that the execution-time locale will be
+a UTF-8 one, and the generated pattern always uses Unicode rules. What
+gets matched or not thus isn't dependent on the actual runtime locale, so
+tainting is not enabled. But a C<locale> category warning is raised
+if the runtime locale turns out to not be UTF-8.
=item 2
=back
-The C<E<sol>x> processing within this class is an extended form.
-Besides the characters that are considered white space in normal C</x>
-processing, there are 5 others, recommended by the Unicode standard:
-
- U+0085 NEXT LINE
- U+200E LEFT-TO-RIGHT MARK
- U+200F RIGHT-TO-LEFT MARK
- U+2028 LINE SEPARATOR
- U+2029 PARAGRAPH SEPARATOR
-
Note that skipping white space applies only to the interior of this
construct. There must not be any space between any of the characters
that form the initial C<(?[>. Nor may there be space between the
Just as in all regular expressions, the pattern can be built up by
including variables that are interpolated at regex compilation time.
-Care must be taken to ensure that you are getting what you expect. For
-example:
+But its best to compile each sub-component.
+
+ my $thai_or_lao = qr/(?[ \p{Thai} + \p{Lao} ])/;
+ my $lower = qr/(?[ \p{Lower} + \p{Digit} ])/;
+
+When these are embedded in another pattern, what they match does not
+change, regardless of parenthesization or what modifiers are in effect
+in that outer pattern. If you fail to compile the subcomponents, you
+can get some nasty surprises. For example:
my $thai_or_lao = '\p{Thai} + \p{Lao}';
...
qr/(?[ \p{Digit} & \p{Thai} + \p{Lao} ])/;
-But this does not have the effect that someone reading the code would
-likely expect, as the intersection applies just to C<\p{Thai}>,
-excluding the Laotian. Pitfalls like this can be avoided by
-parenthesizing the component pieces:
+But this does not have the effect that someone reading the source code
+would likely expect, as the intersection applies just to C<\p{Thai}>,
+excluding the Laotian. Its best to compile the subcomponents, but you
+could also parenthesize the component pieces:
my $thai_or_lao = '( \p{Thai} + \p{Lao} )';
my $lower = '\p{Lower} + \p{Digit}';
qr/(?[ \p{Greek} & $lower ])/i;
-matches upper case things. You can avoid surprises by making the
-components into instances of this construct by compiling them:
-
- my $thai_or_lao = qr/(?[ \p{Thai} + \p{Lao} ])/;
- my $lower = qr/(?[ \p{Lower} + \p{Digit} ])/;
-
-When these are embedded in another pattern, what they match does not
-change, regardless of parenthesization or what modifiers are in effect
-in that outer pattern.
+matches upper case things. So just, compile the subcomponents, as
+illustrated above.
Due to the way that Perl parses things, your parentheses and brackets
may need to be balanced, even including comments. If you run into any
-examples, please send them to C<perlbug@perl.org>, so that we can have a
-concrete example for this man page.
+examples, please submit them to L<https://github.com/Perl/perl5/issues>,
+so that we can have a concrete example for this man page.
We may change it so that things that remain legal uses in normal bracketed
character classes might become illegal within this experimental