The dot (or period), C<.> is probably the most used, and certainly
the most well-known character class. By default, a dot matches any
character, except for the newline. That default can be changed to
-add matching the newline by using the I<single line> modifier: either
+add matching the newline by using the I<single line> modifier:
for the entire regular expression with the C</s> modifier, or
-locally with C<(?s)>. (The C<L</\N>> backslash sequence, described
+locally with C<(?s)> (and even globally within the scope of
+L<C<use re '/s'>|re/'E<sol>flags' mode>). (The C<L</\N>> backslash
+sequence, described
below, matches any character except newline without regard to the
I<single line> modifier.)
Some digits that C<\d> matches look like some of the [0-9] ones, but
have different values. For example, BENGALI DIGIT FOUR (U+09EA) looks
-very much like an ASCII DIGIT EIGHT (U+0038). An application that
+very much like an ASCII DIGIT EIGHT (U+0038), and LEPCHA DIGIT SIX
+(U+1C46) looks very much like an ASCII DIGIT FIVE (U+0035). An
+application that
is expecting only the ASCII digits might be misled, or if the match is
C<\d+>, the matched string might contain a mixture of digits from
different writing systems that look like they signify a number different
than they actually do. L<Unicode::UCD/num()> can
be used to safely
calculate the value, returning C<undef> if the input string contains
-such a mixture.
+such a mixture. Otherwise, for example, a displayed price might be
+deliberately different than it appears.
What C<\p{Digit}> means (and hence C<\d> except under the C</a>
modifier) is C<\p{General_Category=Decimal_Number}>, or synonymously,
The Tamil digits (U+0BE6 - U+0BEF) can also legally be
used in old-style Tamil numbers in which they would appear no more than
one in a row, separated by characters that mean "times 10", "times 100",
-etc. (See L<http://www.unicode.org/notes/tn21>.)
+etc. (See L<https://www.unicode.org/notes/tn21>.)
Any character not matched by C<\d> is matched by C<\D>.
C<\w> matches the platform's native underscore character plus whatever
the locale considers to be alphanumeric.
-=item if Unicode rules are in effect ...
+=item if, instead, Unicode rules are in effect ...
C<\w> matches exactly what C<\p{Word}> matches.
C<\s> matches whatever the locale considers to be whitespace.
-=item if Unicode rules are in effect ...
+=item if, instead, Unicode rules are in effect ...
C<\s> matches exactly the characters shown with an "s" column in the
table below.
It matches a two character string: a letter (Unicode property C<\pL>),
followed by a lowercase C<l>.
-If locale rules are not in effect, the use of
-a Unicode property will force the regular expression into using Unicode
-rules, if it isn't already.
+What a Unicode property matches is never subject to locale rules, and
+if locale rules are not otherwise in effect, the use of a Unicode
+property will force the regular expression into using Unicode rules, if
+it isn't already.
Note that almost all properties are immune to case-insensitive matching.
That is, adding a C</i> regular expression modifier does not change what
-they match. There are two sets that are affected. The first set is
+they match. But there are two sets that are affected. The first set is
C<Uppercase_Letter>,
C<Lowercase_Letter>,
and C<Titlecase_Letter>,
Even though these two matches might be thought of as complements, until
v5.20 they were so only on Unicode code points.
+Starting in perl v5.30, wildcards are allowed in Unicode property
+values. See L<perlunicode/Wildcards in Property Values>.
+
=head4 Examples
"a" =~ /\w/ # Match, "a" is a 'word' character.
instance where a bracketed class can match multiple characters, and for
similar reasons, the class must not be inverted, and the named sequence
may not appear in a range, even one where it is both endpoints. If
-these happen, it is a fatal error if the character class is within an
-extended L<C<(?[...])>|/Extended Bracketed Character Classes>
-class; and only the first code point is used (with
-a C<regexp>-type warning raised) otherwise.
+these happen, it is a fatal error if the character class is within the
+scope of L<C<use re 'strict>|re/'strict' mode>, or within an extended
+L<C<(?[...])>|/Extended Bracketed Character Classes> class; otherwise
+only the first code point is used (with a C<regexp>-type warning
+raised).
=back
their special meaning and can be used inside a character class without
the need to escape them. For instance, C<[()]> matches either an opening
parenthesis, or a closing parenthesis, and the parens inside the character
-class don't group or capture.
+class don't group or capture. Be aware that, unless the pattern is
+evaluated in single-quotish context, variable interpolation will take
+place before the bracketed class is parsed:
+
+ $, = "\t| ";
+ $a =~ m'[$,]'; # single-quotish: matches '$' or ','
+ $a =~ q{[$,]}' # same
+ $a =~ m/[$,]/; # double-quotish: Because we made an
+ # assignment to $, above, this now
+ # matches "\t", "|", or " "
Characters that may carry a special meaning inside a character class are:
C<\>, C<^>, C<->, C<[> and C<]>, and are discussed below. They can be
# containing just [, and the character class is
# followed by a ].
+=head3 Bracketed Character Classes and the C</xx> pattern modifier
+
+Normally SPACE and TAB characters have no special meaning inside a
+bracketed character class; they are just added to the list of characters
+matched by the class. But if the L<C</xx>|perlre/E<sol>x and E<sol>xx>
+pattern modifier is in effect, they are generally ignored and can be
+added to improve readability. They can't be added in the middle of a
+single construct:
+
+ / [ \x{10 FFFF} ] /xx # WRONG!
+
+The SPACE in the middle of the hex constant is illegal.
+
+To specify a literal SPACE character, you can escape it with a
+backslash, like:
+
+ /[ a e i o u \ ]/xx
+
+This matches the English vowels plus the SPACE character.
+
+For clarity, you should already have been using C<\t> to specify a
+literal tab, and C<\t> is unaffected by C</xx>.
+
=head3 Character Ranges
It is not uncommon to want to match a range of characters. Luckily, instead
# even on an EBCDIC platform.
[\N{U+27}-\N{U+3F}] # Same. (U+27 is "'", and U+3F is "?")
-As the final two examples above show, you can achieve portablity to
+As the final two examples above show, you can achieve portability to
non-ASCII platforms by using the C<\N{...}> form for the range
endpoints. These indicate that the specified range is to be interpreted
using Unicode values, so C<[\N{U+27}-\N{U+3F}]> means to match
Perl recognizes the following POSIX character classes:
- alpha Any alphabetical character ("[A-Za-z]").
- alnum Any alphanumeric character ("[A-Za-z0-9]").
+ alpha Any alphabetical character (e.g., [A-Za-z]).
+ alnum Any alphanumeric character (e.g., [A-Za-z0-9]).
ascii Any character in the ASCII character set.
blank A GNU extension, equal to a space or a horizontal tab ("\t").
cntrl Any control character. See Note [2] below.
- digit Any decimal digit ("[0-9]"), equivalent to "\d".
+ digit Any decimal digit (e.g., [0-9]), equivalent to "\d".
graph Any printable character, excluding a space. See Note [3] below.
- lower Any lowercase character ("[a-z]").
+ lower Any lowercase character (e.g., [a-z]).
print Any printable character, including a space. See Note [4] below.
punct Any graphical character excluding "word" characters. Note [5].
space Any whitespace character. "\s" including the vertical tab
("\cK").
- upper Any uppercase character ("[A-Z]").
- word A Perl extension ("[A-Za-z0-9_]"), equivalent to "\w".
- xdigit Any hexadecimal digit ("[0-9a-fA-F]").
+ upper Any uppercase character (e.g., [A-Z]).
+ word A Perl extension (e.g., [A-Za-z0-9_]), equivalent to "\w".
+ xdigit Any hexadecimal digit (e.g., [0-9a-fA-F]). Note [7].
Like the L<Unicode properties|/Unicode Properties>, most of the POSIX
properties match the same regardless of whether case-insensitive (C</i>)
space \p{PosixSpace} \p{XPosixSpace} [6]
upper \p{PosixUpper} \p{XPosixUpper}
word \p{PosixWord} \p{XPosixWord} \w
- xdigit \p{PosixXDigit} \p{XPosixXDigit}
+ xdigit \p{PosixXDigit} \p{XPosixXDigit} [7]
=over 4
matching, C<\p{XPerlSpace}> did not match the vertical tab, C<\cK>.
Same for the two ASCII-only range forms.
+=item [7]
+
+Unlike C<[[:digit:]]> which matches digits in many writing systems, such
+as Thai and Devanagari, there are currently only two sets of hexadecimal
+digits, and it is unlikely that more will be added. This is because you
+not only need the ten digits, but also the six C<[A-F]> (and C<[a-f]>)
+to correspond. That means only the Latin script is suitable for these,
+and Unicode has only two sets of these, the familiar ASCII set, and the
+fullwidth forms starting at U+FF10 (FULLWIDTH DIGIT ZERO).
+
=back
There are various other synonyms that can be used besides the names
-listed in the table. For example, C<\p{PosixAlpha}> can be written as
+listed in the table. For example, C<\p{XPosixAlpha}> can be written as
C<\p{Alpha}>. All are listed in
L<perluniprops/Properties accessible through \p{} and \P{}>.
=back
-=item if Unicode rules are in effect ...
+=item if, instead, Unicode rules are in effect ...
The POSIX class matches the same as the Full-range counterpart.
Which rules apply are determined as described in
L<perlre/Which character set modifier is in effect?>.
-It is proposed to change this behavior in a future release of Perl so that
-whether or not Unicode rules are in effect would not change the
-behavior: Outside of locale, the POSIX classes
-would behave like their ASCII-range counterparts. If you wish to
-comment on this proposal, send email to C<perl5-porters@perl.org>.
-
=head4 Negation of POSIX character classes
X<character class, negation>
Comments on this feature are welcome; send email to
C<perl5-porters@perl.org>.
+The rules used by L<C<use re 'strict>|re/'strict' mode> apply to this
+construct.
+
We can extend the example above:
/(?[ ( \p{Thai} + \p{Lao} ) & \p{Digit} ])/
This matches digits that are in either the Thai or Laotian scripts.
Notice the white space in these examples. This construct always has
-the C<E<sol>x> modifier turned on within it.
+the C<E<sol>xx> modifier turned on within it.
The available binary operators are:
This last example shows the use of this construct to specify an ordinary
bracketed character class without additional set operations. Note the
-white space within it; a limited version of C<E<sol>x> is turned on even
-within bracketed character classes, with only the SPACE and TAB (C<\t>)
-characters allowed, and no comments. Hence,
-
- (?[ [#] ])
+white space within it. This is allowed because C<E<sol>xx> is
+automatically turned on within this construct.
-matches the literal character "#". To specify a literal white space character,
-you can escape it with a backslash, like:
-
- /(?[ [ a e i o u \ ] ])/
-
-This matches the English vowels plus the SPACE character.
All the other escapes accepted by normal bracketed character classes are
-accepted here as well; but unrecognized escapes that generate warnings
-in normal classes are fatal errors here.
+accepted here as well.
-All warnings from these class elements are fatal, as well as some
-practices that don't currently warn. For example you cannot say
+Because this construct compiles under
+L<C<use re 'strict>|re/'strict' mode>, unrecognized escapes that
+generate warnings in normal classes are fatal errors here, as well as
+all other warnings from these class elements, as well as some
+practices that don't currently warn outside C<re 'strict'>. For example
+you cannot say
/(?[ [ \xF ] ])/ # Syntax error!
Any contained POSIX character classes, including things like C<\w> and C<\D>
respect the C<E<sol>a> (and C<E<sol>aa>) modifiers.
-C<< (?[ ]) >> is a regex-compile-time construct. Any attempt to use
-something which isn't knowable at the time the containing regular
+Note that C<< (?[ ]) >> is a regex-compile-time construct. Any attempt
+to use something which isn't knowable at the time the containing regular
expression is compiled is a fatal error. In practice, this means
just three limitations:
Just as in all regular expressions, the pattern can be built up by
including variables that are interpolated at regex compilation time.
-Care must be taken to ensure that you are getting what you expect. For
-example:
+But its best to compile each sub-component.
+
+ my $thai_or_lao = qr/(?[ \p{Thai} + \p{Lao} ])/;
+ my $lower = qr/(?[ \p{Lower} + \p{Digit} ])/;
+
+When these are embedded in another pattern, what they match does not
+change, regardless of parenthesization or what modifiers are in effect
+in that outer pattern. If you fail to compile the subcomponents, you
+can get some nasty surprises. For example:
my $thai_or_lao = '\p{Thai} + \p{Lao}';
...
qr/(?[ \p{Digit} & \p{Thai} + \p{Lao} ])/;
-But this does not have the effect that someone reading the code would
-likely expect, as the intersection applies just to C<\p{Thai}>,
-excluding the Laotian. Pitfalls like this can be avoided by
-parenthesizing the component pieces:
+But this does not have the effect that someone reading the source code
+would likely expect, as the intersection applies just to C<\p{Thai}>,
+excluding the Laotian. Its best to compile the subcomponents, but you
+could also parenthesize the component pieces:
my $thai_or_lao = '( \p{Thai} + \p{Lao} )';
my $lower = '\p{Lower} + \p{Digit}';
qr/(?[ \p{Greek} & $lower ])/i;
-matches upper case things. You can avoid surprises by making the
-components into instances of this construct by compiling them:
-
- my $thai_or_lao = qr/(?[ \p{Thai} + \p{Lao} ])/;
- my $lower = qr/(?[ \p{Lower} + \p{Digit} ])/;
-
-When these are embedded in another pattern, what they match does not
-change, regardless of parenthesization or what modifiers are in effect
-in that outer pattern.
+matches upper case things. So just, compile the subcomponents, as
+illustrated above.
Due to the way that Perl parses things, your parentheses and brackets
may need to be balanced, even including comments. If you run into any
-examples, please send them to C<perlbug@perl.org>, so that we can have a
-concrete example for this man page.
+examples, please submit them to L<https://github.com/Perl/perl5/issues>,
+so that we can have a concrete example for this man page.
We may change it so that things that remain legal uses in normal bracketed
character classes might become illegal within this experimental