The dot (or period), C<.> is probably the most used, and certainly
the most well-known character class. By default, a dot matches any
-character, except for the newline. The default can be changed to
+character, except for the newline. That default can be changed to
add matching the newline by using the I<single line> modifier: either
for the entire regular expression with the C</s> modifier, or
-locally with C<(?s)>. (The experimental C<\N> backslash sequence, described
+locally with C<(?s)>. (The C<L</\N>> backslash sequence, described
below, matches any character except newline without regard to the
I<single line> modifier.)
"ab" =~ /^.$/ # No match (dot matches one character)
=head2 Backslash sequences
-X<\w> X<\W> X<\s> X<\S> X<\d> X<\D> X<\p> X<\P>
+X<\w> X<\W> X<\s> X<\S> X<\d> X<\D> X<\p> X<\P>
X<\N> X<\v> X<\V> X<\h> X<\H>
X<word> X<whitespace>
\H Match a character that isn't horizontal whitespace.
\v Match a vertical whitespace character.
\V Match a character that isn't vertical whitespace.
- \N Match a character that isn't a newline. Experimental.
+ \N Match a character that isn't a newline.
\pP, \p{Prop} Match a character that has the given Unicode property.
\PP, \P{Prop} Match a character that doesn't have the Unicode property
-=head3 Digits
+=head3 \N
-C<\d> matches a single character that is considered to be a decimal I<digit>.
-What is considered a decimal digit depends on several factors, detailed
-below in L</Locale, EBCDIC, Unicode and UTF-8>. If those factors
-indicate a Unicode interpretation, C<\d> not only matches the digits
-'0' - '9', but also Arabic, Devanagari and digits from other languages.
-Otherwise, if there is a locale in effect, it will match whatever
-characters the locale considers decimal digits. Without a locale, C<\d>
-matches just the digits '0' to '9'.
-
-Unicode digits may cause some confusion, and some security issues. In UTF-8
-strings, C<\d> matches the same characters matched by
-C<\p{General_Category=Decimal_Number}>, or synonymously,
-C<\p{General_Category=Digit}>. Starting with Unicode version 4.1, this is the
-same set of characters matched by C<\p{Numeric_Type=Decimal}>.
+C<\N>, available starting in v5.12, like the dot, matches any
+character that is not a newline. The difference is that C<\N> is not influenced
+by the I<single line> regular expression modifier (see L</The dot> above). Note
+that the form C<\N{...}> may mean something completely different. When the
+C<{...}> is a L<quantifier|perlre/Quantifiers>, it means to match a non-newline
+character that many times. For example, C<\N{3}> means to match 3
+non-newlines; C<\N{5,}> means to match 5 or more non-newlines. But if C<{...}>
+is not a legal quantifier, it is presumed to be a named character. See
+L<charnames> for those. For example, none of C<\N{COLON}>, C<\N{4F}>, and
+C<\N{F4}> contain legal quantifiers, so Perl will try to find characters whose
+names are respectively C<COLON>, C<4F>, and C<F4>.
+
+=head3 Digits
+C<\d> matches a single character considered to be a decimal I<digit>.
+If the C</a> regular expression modifier is in effect, it matches [0-9].
+Otherwise, it
+matches anything that is matched by C<\p{Digit}>, which includes [0-9].
+(An unlikely possible exception is that under locale matching rules, the
+current locale might not have C<[0-9]> matched by C<\d>, and/or might match
+other characters whose code point is less than 256. The only such locale
+definitions that are legal would be to match C<[0-9]> plus another set of
+10 consecutive digit characters; anything else would be in violation of
+the C language standard, but Perl doesn't currently assume anything in
+regard to this.)
+
+What this means is that unless the C</a> modifier is in effect C<\d> not
+only matches the digits '0' - '9', but also Arabic, Devanagari, and
+digits from other languages. This may cause some confusion, and some
+security issues.
+
+Some digits that C<\d> matches look like some of the [0-9] ones, but
+have different values. For example, BENGALI DIGIT FOUR (U+09EA) looks
+very much like an ASCII DIGIT EIGHT (U+0038). An application that
+is expecting only the ASCII digits might be misled, or if the match is
+C<\d+>, the matched string might contain a mixture of digits from
+different writing systems that look like they signify a number different
+than they actually do. L<Unicode::UCD/num()> can
+be used to safely
+calculate the value, returning C<undef> if the input string contains
+such a mixture.
+
+What C<\p{Digit}> means (and hence C<\d> except under the C</a>
+modifier) is C<\p{General_Category=Decimal_Number}>, or synonymously,
+C<\p{General_Category=Digit}>. Starting with Unicode version 4.1, this
+is the same set of characters matched by C<\p{Numeric_Type=Decimal}>.
But Unicode also has a different property with a similar name,
C<\p{Numeric_Type=Digit}>, which matches a completely different set of
-characters. These characters are things such as subscripts.
-
-The design intent is for C<\d> to match all the digits (and no other characters)
-that can be used with "normal" big-endian positional decimal syntax, whereby a
-sequence of such digits {N0, N1, N2, ...Nn} has the numeric value (...(N0 * 10
-+ N1) * 10 + N2) * 10 ... + Nn). In Unicode 5.2, the Tamil digits (U+0BE6 -
-U+0BEF) can also legally be used in old-style Tamil numbers in which they would
-appear no more than one in a row, separated by characters that mean "times 10",
-"times 100", etc. (See L<http://www.unicode.org/notes/tn21>.)
+characters. These characters are things such as C<CIRCLED DIGIT ONE>
+or subscripts, or are from writing systems that lack all ten digits.
-Some of the non-European digits that C<\d> matches look like European ones, but
-have different values. For example, BENGALI DIGIT FOUR (U+09EA) looks
-very much like an ASCII DIGIT EIGHT (U+0038).
+The design intent is for C<\d> to exactly match the set of characters
+that can safely be used with "normal" big-endian positional decimal
+syntax, where, for example 123 means one 'hundred', plus two 'tens',
+plus three 'ones'. This positional notation does not necessarily apply
+to characters that match the other type of "digit",
+C<\p{Numeric_Type=Digit}>, and so C<\d> doesn't match them.
-It may be useful for security purposes for an application to require that all
-digits in a row be from the same script. See L<Unicode::UCD/charscript()>.
+The Tamil digits (U+0BE6 - U+0BEF) can also legally be
+used in old-style Tamil numbers in which they would appear no more than
+one in a row, separated by characters that mean "times 10", "times 100",
+etc. (See L<http://www.unicode.org/notes/tn21>.)
-Any character that isn't matched by C<\d> will be matched by C<\D>.
+Any character not matched by C<\d> is matched by C<\D>.
=head3 Word characters
A C<\w> matches a single alphanumeric character (an alphabetic character, or a
-decimal digit) or a connecting punctuation character, such as an
-underscore ("_"). It does not match a whole word. To match a whole
-word, use C<\w+>. This isn't the same thing as matching an English word, but
-in the ASCII range is the same as a string of Perl-identifier
-characters. What is considered a
-word character depends on several factors, detailed below in L</Locale,
-EBCDIC, Unicode and UTF-8>. If those factors indicate a Unicode
-interpretation, C<\w> matches the characters that are considered word
-characters in the Unicode database. That is, it not only matches ASCII letters,
-but also Thai letters, Greek letters, etc. This includes connector
+decimal digit); or a connecting punctuation character, such as an
+underscore ("_"); or a "mark" character (like some sort of accent) that
+attaches to one of those. It does not match a whole word. To match a
+whole word, use C<\w+>. This isn't the same thing as matching an
+English word, but in the ASCII range it is the same as a string of
+Perl-identifier characters.
+
+=over
+
+=item If the C</a> modifier is in effect ...
+
+C<\w> matches the 63 characters [a-zA-Z0-9_].
+
+=item otherwise ...
+
+=over
+
+=item For code points above 255 ...
+
+C<\w> matches the same as C<\p{Word}> matches in this range. That is,
+it matches Thai letters, Greek letters, etc. This includes connector
punctuation (like the underscore) which connect two words together, or
-marks, such as a C<COMBINING TILDE>, which are generally used to add
-diacritical marks to letters. If a Unicode interpretation
-is not indicated, C<\w> matches those characters that are considered
-word characters by the current locale or EBCDIC code page. Without a
-locale or EBCDIC code page, C<\w> matches the ASCII letters, digits and
-the underscore.
+diacritics, such as a C<COMBINING TILDE> and the modifier letters, which
+are generally used to add auxiliary markings to letters.
+
+=item For code points below 256 ...
+
+=over
+
+=item if locale rules are in effect ...
+
+C<\w> matches the platform's native underscore character plus whatever
+the locale considers to be alphanumeric.
+
+=item if Unicode rules are in effect ...
+
+C<\w> matches exactly what C<\p{Word}> matches.
+
+=item otherwise ...
+
+C<\w> matches [a-zA-Z0-9_].
+
+=back
+
+=back
+
+=back
+
+Which rules apply are determined as described in L<perlre/Which character set modifier is in effect?>.
There are a number of security issues with the full Unicode list of word
characters. See L<http://unicode.org/reports/tr36>.
Also, for a somewhat finer-grained set of characters that are in programming
language identifiers beyond the ASCII range, you may wish to instead use the
-more customized Unicode properties, "ID_Start", ID_Continue", "XID_Start", and
-"XID_Continue". See L<http://unicode.org/reports/tr31>.
+more customized L</Unicode Properties>, C<\p{ID_Start}>,
+C<\p{ID_Continue}>, C<\p{XID_Start}>, and C<\p{XID_Continue}>. See
+L<http://unicode.org/reports/tr31>.
-Any character that isn't matched by C<\w> will be matched by C<\W>.
+Any character not matched by C<\w> is matched by C<\W>.
=head3 Whitespace
-C<\s> matches any single character that is considered whitespace. The exact
-set of characters matched by C<\s> depends on several factors, detailed
-below in L</Locale, EBCDIC, Unicode and UTF-8>. If those factors
-indicate a Unicode interpretation, C<\s> matches what is considered
-whitespace in the Unicode database; the complete list is in the table
-below. Otherwise, if there is a locale or EBCDIC code page in effect,
-C<\s> matches whatever is considered whitespace by the current locale or
-EBCDIC code page. Without a locale or EBCDIC code page, C<\s> matches
-the horizontal tab (C<\t>), the newline (C<\n>), the form feed (C<\f>),
-the carriage return (C<\r>), and the space. (Note that it doesn't match
-the vertical tab, C<\cK>.) Perhaps the most notable possible surprise
-is that C<\s> matches a non-breaking space only if a Unicode
-interpretation is indicated, or the locale or EBCDIC code page that is
-in effect has that character.
-
-Any character that isn't matched by C<\s> will be matched by C<\S>.
-
-C<\h> will match any character that is considered horizontal whitespace;
-this includes the space and the tab characters and a number other characters,
-all of which are listed in the table below. C<\H> will match any character
-that is not considered horizontal whitespace.
-
-C<\v> will match any character that is considered vertical whitespace;
-this includes the carriage return and line feed characters (newline) plus several
-other characters, all listed in the table below.
-C<\V> will match any character that is not considered vertical whitespace.
+C<\s> matches any single character considered whitespace.
+
+=over
+
+=item If the C</a> modifier is in effect ...
+
+In all Perl versions, C<\s> matches the 5 characters [\t\n\f\r ]; that
+is, the horizontal tab,
+the newline, the form feed, the carriage return, and the space.
+Starting in Perl v5.18, experimentally, it also matches the vertical tab, C<\cK>.
+See note C<[1]> below for a discussion of this.
+
+=item otherwise ...
+
+=over
+
+=item For code points above 255 ...
+
+C<\s> matches exactly the code points above 255 shown with an "s" column
+in the table below.
+
+=item For code points below 256 ...
+
+=over
+
+=item if locale rules are in effect ...
+
+C<\s> matches whatever the locale considers to be whitespace.
+
+=item if Unicode rules are in effect ...
+
+C<\s> matches exactly the characters shown with an "s" column in the
+table below.
+
+=item otherwise ...
+
+C<\s> matches [\t\n\f\r ] and, starting, experimentally in Perl
+v5.18, the vertical tab, C<\cK>.
+(See note C<[1]> below for a discussion of this.)
+Note that this list doesn't include the non-breaking space.
+
+=back
+
+=back
+
+=back
+
+Which rules apply are determined as described in L<perlre/Which character set modifier is in effect?>.
+
+Any character not matched by C<\s> is matched by C<\S>.
+
+C<\h> matches any character considered horizontal whitespace;
+this includes the platform's space and tab characters and several others
+listed in the table below. C<\H> matches any character
+not considered horizontal whitespace. They use the platform's native
+character set, and do not consider any locale that may otherwise be in
+use.
+
+C<\v> matches any character considered vertical whitespace;
+this includes the platform's carriage return and line feed characters (newline)
+plus several other characters, all listed in the table below.
+C<\V> matches any character not considered vertical whitespace.
+They use the platform's native character set, and do not consider any
+locale that may otherwise be in use.
C<\R> matches anything that can be considered a newline under Unicode
rules. It's not a character class, as it can match a multi-character
sequence. Therefore, it cannot be used inside a bracketed character
-class; use C<\v> instead (vertical whitespace).
+class; use C<\v> instead (vertical whitespace). It uses the platform's
+native character set, and does not consider any locale that may
+otherwise be in use.
Details are discussed in L<perlrebackslash>.
-Note that unlike C<\s>, C<\d> and C<\w>, C<\h> and C<\v> always match
-the same characters, without regard to other factors, such as if the
-source string is in UTF-8 format or not.
+Note that unlike C<\s> (and C<\d> and C<\w>), C<\h> and C<\v> always match
+the same characters, without regard to other factors, such as the active
+locale or whether the source string is in UTF-8 format.
-One might think that C<\s> is equivalent to C<[\h\v]>. This is not true. The
-vertical tab (C<"\x0b">) is not matched by C<\s>, it is however considered
-vertical whitespace. Furthermore, if the source string is not in UTF-8 format,
-and any locale or EBCDIC code page that is in effect doesn't include them, the
-next line (ASCII-platform C<"\x85">) and the no-break space (ASCII-platform
-C<"\xA0">) characters are not matched by C<\s>, but are by C<\v> and C<\h>
-respectively. If the source string is in UTF-8 format, both the next line and
-the no-break space are matched by C<\s>.
+One might think that C<\s> is equivalent to C<[\h\v]>. This is indeed true
+starting in Perl v5.18, but prior to that, the sole difference was that the
+vertical tab (C<"\cK">) was not matched by C<\s>.
The following table is a complete listing of characters matched by
-C<\s>, C<\h> and C<\v> as of Unicode 5.2.
+C<\s>, C<\h> and C<\v> as of Unicode 6.3.
-The first column gives the code point of the character (in hex format),
+The first column gives the Unicode code point of the character (in hex format),
the second column gives the (Unicode) name. The third column indicates
-by which class(es) the character is matched (assuming no locale or EBCDIC code
-page is in effect that changes the C<\s> matching).
-
- 0x00009 CHARACTER TABULATION h s
- 0x0000a LINE FEED (LF) vs
- 0x0000b LINE TABULATION v
- 0x0000c FORM FEED (FF) vs
- 0x0000d CARRIAGE RETURN (CR) vs
- 0x00020 SPACE h s
- 0x00085 NEXT LINE (NEL) vs [1]
- 0x000a0 NO-BREAK SPACE h s [1]
- 0x01680 OGHAM SPACE MARK h s
- 0x0180e MONGOLIAN VOWEL SEPARATOR h s
- 0x02000 EN QUAD h s
- 0x02001 EM QUAD h s
- 0x02002 EN SPACE h s
- 0x02003 EM SPACE h s
- 0x02004 THREE-PER-EM SPACE h s
- 0x02005 FOUR-PER-EM SPACE h s
- 0x02006 SIX-PER-EM SPACE h s
- 0x02007 FIGURE SPACE h s
- 0x02008 PUNCTUATION SPACE h s
- 0x02009 THIN SPACE h s
- 0x0200a HAIR SPACE h s
- 0x02028 LINE SEPARATOR vs
- 0x02029 PARAGRAPH SEPARATOR vs
- 0x0202f NARROW NO-BREAK SPACE h s
- 0x0205f MEDIUM MATHEMATICAL SPACE h s
- 0x03000 IDEOGRAPHIC SPACE h s
+by which class(es) the character is matched (assuming no locale is in
+effect that changes the C<\s> matching).
+
+ 0x0009 CHARACTER TABULATION h s
+ 0x000a LINE FEED (LF) vs
+ 0x000b LINE TABULATION vs [1]
+ 0x000c FORM FEED (FF) vs
+ 0x000d CARRIAGE RETURN (CR) vs
+ 0x0020 SPACE h s
+ 0x0085 NEXT LINE (NEL) vs [2]
+ 0x00a0 NO-BREAK SPACE h s [2]
+ 0x1680 OGHAM SPACE MARK h s
+ 0x2000 EN QUAD h s
+ 0x2001 EM QUAD h s
+ 0x2002 EN SPACE h s
+ 0x2003 EM SPACE h s
+ 0x2004 THREE-PER-EM SPACE h s
+ 0x2005 FOUR-PER-EM SPACE h s
+ 0x2006 SIX-PER-EM SPACE h s
+ 0x2007 FIGURE SPACE h s
+ 0x2008 PUNCTUATION SPACE h s
+ 0x2009 THIN SPACE h s
+ 0x200a HAIR SPACE h s
+ 0x2028 LINE SEPARATOR vs
+ 0x2029 PARAGRAPH SEPARATOR vs
+ 0x202f NARROW NO-BREAK SPACE h s
+ 0x205f MEDIUM MATHEMATICAL SPACE h s
+ 0x3000 IDEOGRAPHIC SPACE h s
=over 4
=item [1]
-NEXT LINE and NO-BREAK SPACE only match C<\s> if the source string is in
-UTF-8 format, or the locale or EBCDIC code page that is in effect includes them.
+Prior to Perl v5.18, C<\s> did not match the vertical tab. The change
+in v5.18 is considered an experiment, which means it could be backed out
+in v5.22 if experience indicates that it breaks too much
+existing code. If this change adversely affects you, send email to
+C<perlbug@perl.org>; if it affects you positively, email
+C<perlthanks@perl.org>. In the meantime, C<[^\S\cK]> (obscurely)
+matches what C<\s> traditionally did.
-=back
-
-It is worth noting that C<\d>, C<\w>, etc, match single characters, not
-complete numbers or words. To match a number (that consists of digits),
-use C<\d+>; to match a word, use C<\w+>.
+=item [2]
-=head3 \N
+NEXT LINE and NO-BREAK SPACE may or may not match C<\s> depending
+on the rules in effect. See
+L<the beginning of this section|/Whitespace>.
-C<\N> is new in 5.12, and is experimental. It, like the dot, will match any
-character that is not a newline. The difference is that C<\N> is not influenced
-by the I<single line> regular expression modifier (see L</The dot> above). Note
-that the form C<\N{...}> may mean something completely different. When the
-C<{...}> is a L<quantifier|perlre/Quantifiers>, it means to match a non-newline
-character that many times. For example, C<\N{3}> means to match 3
-non-newlines; C<\N{5,}> means to match 5 or more non-newlines. But if C<{...}>
-is not a legal quantifier, it is presumed to be a named character. See
-L<charnames> for those. For example, none of C<\N{COLON}>, C<\N{4F}>, and
-C<\N{F4}> contain legal quantifiers, so Perl will try to find characters whose
-names are, respectively, C<COLON>, C<4F>, and C<F4>.
+=back
=head3 Unicode Properties
with the property name following the C<\p>, otherwise, braces are required.
When using braces, there is a single form, which is just the property name
enclosed in the braces, and a compound form which looks like C<\p{name=value}>,
-which means to match if the property "name" for the character has the particular
+which means to match if the property "name" for the character has that particular
"value".
For instance, a match for a number can be written as C</\pN/> or as
C</\p{Number}/>, or as C</\p{Number=True}/>.
Lowercase letters are matched by the property I<Lowercase_Letter> which
-has as short form I<Ll>. They need the braces, so are written as C</\p{Ll}/> or
+has the short form I<Ll>. They need the braces, so are written as C</\p{Ll}/> or
C</\p{Lowercase_Letter}/>, or C</\p{General_Category=Lowercase_Letter}/>
(the underscores are optional).
C</\pLl/> is valid, but means something different.
It matches a two character string: a letter (Unicode property C<\pL>),
followed by a lowercase C<l>.
-For more details, see L<perlunicode/Unicode Character Properties>; for a
+If locale rules are not in effect, the use of
+a Unicode property will force the regular expression into using Unicode
+rules, if it isn't already.
+
+Note that almost all properties are immune to case-insensitive matching.
+That is, adding a C</i> regular expression modifier does not change what
+they match. There are two sets that are affected. The first set is
+C<Uppercase_Letter>,
+C<Lowercase_Letter>,
+and C<Titlecase_Letter>,
+all of which match C<Cased_Letter> under C</i> matching.
+The second set is
+C<Uppercase>,
+C<Lowercase>,
+and C<Titlecase>,
+all of which match C<Cased> under C</i> matching.
+(The difference between these sets is that some things, such as Roman
+numerals, come in both upper and lower case, so they are C<Cased>, but
+aren't considered to be letters, so they aren't C<Cased_Letter>s. They're
+actually C<Letter_Number>s.)
+This set also includes its subsets C<PosixUpper> and C<PosixLower>, both
+of which under C</i> match C<PosixAlpha>.
+
+For more details on Unicode properties, see L<perlunicode/Unicode
+Character Properties>; for a
complete list of possible properties, see
-L<perluniprops/Properties accessible through \p{} and \P{}>.
+L<perluniprops/Properties accessible through \p{} and \P{}>,
+which notes all forms that have C</i> differences.
It is also possible to define your own properties. This is discussed in
L<perlunicode/User-Defined Character Properties>.
+Unicode properties are defined (surprise!) only on Unicode code points.
+Starting in v5.20, when matching against C<\p> and C<\P>, Perl treats
+non-Unicode code points (those above the legal Unicode maximum of
+0x10FFFF) as if they were typical unassigned Unicode code points.
+
+Prior to v5.20, Perl raised a warning and made all matches fail on
+non-Unicode code points. This could be somewhat surprising:
+
+ chr(0x110000) =~ \p{ASCII_Hex_Digit=True} # Fails on Perls < v5.20.
+ chr(0x110000) =~ \p{ASCII_Hex_Digit=False} # Also fails on Perls
+ # < v5.20
+
+Even though these two matches might be thought of as complements, until
+v5.20 they were so only on Unicode code points.
=head4 Examples
# Thai Unicode class.
"a" =~ /\P{Lao}/ # Match, as "a" is not a Laotian character.
+It is worth emphasizing that C<\d>, C<\w>, etc, match single characters, not
+complete numbers or words. To match a number (that consists of digits),
+use C<\d+>; to match a word, use C<\w+>. But be aware of the security
+considerations in doing so, as mentioned above.
=head2 Bracketed Character Classes
is the bracketed character class. In its simplest form, it lists the characters
that may be matched, surrounded by square brackets, like this: C<[aeiou]>.
This matches one of C<a>, C<e>, C<i>, C<o> or C<u>. Like the other
-character classes, exactly one character will be matched. To match
+character classes, exactly one character is matched.* To match
a longer string consisting of characters mentioned in the character
class, follow the character class with a L<quantifier|perlre/Quantifiers>. For
-instance, C<[aeiou]+> matches a string of one or more lowercase English vowels.
+instance, C<[aeiou]+> matches one or more lowercase English vowels.
Repeating a character in a character class has no
effect; it's considered to be in the set only once.
# a single character.
"ae" =~ /^[aeiou]+$/ # Match, due to the quantifier.
+ -------
+
+* There are two exceptions to a bracketed character class matching a
+single character only. Each requires special handling by Perl to make
+things work:
+
+=over
+
+=item *
+
+When the class is to match caselessly under C</i> matching rules, and a
+character that is explicitly mentioned inside the class matches a
+multiple-character sequence caselessly under Unicode rules, the class
+will also match that sequence. For example, Unicode says that the
+letter C<LATIN SMALL LETTER SHARP S> should match the sequence C<ss>
+under C</i> rules. Thus,
+
+ 'ss' =~ /\A\N{LATIN SMALL LETTER SHARP S}\z/i # Matches
+ 'ss' =~ /\A[aeioust\N{LATIN SMALL LETTER SHARP S}]\z/i # Matches
+
+For this to happen, the class must not be inverted (see L</Negation>)
+and the character must be explicitly specified, and not be part of a
+multi-character range (not even as one of its endpoints). (L</Character
+Ranges> will be explained shortly.) Therefore,
+
+ 'ss' =~ /\A[\0-\x{ff}]\z/ui # Doesn't match
+ 'ss' =~ /\A[\0-\N{LATIN SMALL LETTER SHARP S}]\z/ui # No match
+ 'ss' =~ /\A[\xDF-\xDF]\z/ui # Matches on ASCII platforms, since
+ # \XDF is LATIN SMALL LETTER SHARP S,
+ # and the range is just a single
+ # element
+
+Note that it isn't a good idea to specify these types of ranges anyway.
+
+=item *
+
+Some names known to C<\N{...}> refer to a sequence of multiple characters,
+instead of the usual single character. When one of these is included in
+the class, the entire sequence is matched. For example,
+
+ "\N{TAMIL LETTER KA}\N{TAMIL VOWEL SIGN AU}"
+ =~ / ^ [\N{TAMIL SYLLABLE KAU}] $ /x;
+
+matches, because C<\N{TAMIL SYLLABLE KAU}> is a named sequence
+consisting of the two characters matched against. Like the other
+instance where a bracketed class can match multiple characters, and for
+similar reasons, the class must not be inverted, and the named sequence
+may not appear in a range, even one where it is both endpoints. If
+these happen, it is a fatal error if the character class is within an
+extended L<C<(?[...])>|/Extended Bracketed Character Classes>
+class; and only the first code point is used (with
+a C<regexp>-type warning raised) otherwise.
+
+=back
+
=head3 Special Characters Inside a Bracketed Character Class
Most characters that are meta characters in regular expressions (that
C<\f>,
C<\n>,
C<\N{I<NAME>}>,
-C<\N{U+I<wide hex char>}>,
+C<\N{U+I<hex char>}>,
C<\r>,
C<\t>,
and
C<\x>
are also special and have the same meanings as they do outside a
-bracketed character class. (However, inside a bracketed character
-class, if C<\N{I<NAME>}> expands to a sequence of characters, only the first
-one in the sequence is used, with a warning.)
+bracketed character class.
Also, a backslash followed by two or three octal digits is considered an octal
number.
L</POSIX Character Classes> below), or it signals the end of the bracketed
character class. If you want to include a C<]> in the set of characters, you
must generally escape it.
+
However, if the C<]> is the I<first> (or the second if the first
character is a caret) character of a bracketed character class, it
does not denote the end of the class (as you cannot have an empty class)
Examples:
"+" =~ /[+?*]/ # Match, "+" in a character class is not special.
- "\cH" =~ /[\b]/ # Match, \b inside in a character class
+ "\cH" =~ /[\b]/ # Match, \b inside in a character class.
# is equivalent to a backspace.
"]" =~ /[][]/ # Match, as the character class contains.
# both [ and ].
=head3 Character Ranges
It is not uncommon to want to match a range of characters. Luckily, instead
-of listing all the characters in the range, one may use the hyphen (C<->).
+of listing all characters in the range, one may use the hyphen (C<->).
If inside a bracketed character class you have two characters separated
-by a hyphen, it's treated as if all the characters between the two are in
+by a hyphen, it's treated as if all characters between the two were in
the class. For instance, C<[0-9]> matches any ASCII digit, and C<[a-m]>
matches any lowercase letter from the first half of the ASCII alphabet.
Note that the two characters on either side of the hyphen are not
-necessary both letters or both digits. Any character is possible,
+necessarily both letters or both digits. Any character is possible,
although not advisable. C<['-?]> contains a range of characters, but
-most people will not know which characters that will be. Furthermore,
+most people will not know which characters that means. Furthermore,
such ranges may lead to portability problems if the code has to run on
a platform that uses a different character set, such as EBCDIC.
If a hyphen in a character class cannot syntactically be part of a range, for
instance because it is the first or the last character of the character class,
-or if it immediately follows a range, the hyphen isn't special, and will be
-considered a character that is to be matched literally. You have to escape the
-hyphen with a backslash if you want to have a hyphen in your set of characters
-to be matched, and its position in the class is such that it could be
-considered part of a range.
+or if it immediately follows a range, the hyphen isn't special, and so is
+considered a character to be matched literally. If you want a hyphen in
+your set of characters to be matched and its position in the class is such
+that it could be considered part of a range, you must escape that hyphen
+with a backslash.
Examples:
# hyphen ('-'), or the letter 'm'.
['-?] # Matches any of the characters '()*+,-./0123456789:;<=>?
# (But not on an EBCDIC platform).
-
+ [\N{APOSTROPHE}-\N{QUESTION MARK}]
+ # Matches any of the characters '()*+,-./0123456789:;<=>?
+ # even on an EBCDIC platform.
+ [\N{U+27}-\N{U+3F}] # Same. (U+27 is "'", and U+3F is "?"
+
+As the final two examples above show, you can achieve portablity to
+non-ASCII platforms by using the C<\N{...}> form for the range
+endpoints. These indicate that the specified range is to be interpreted
+using Unicode values, so C<[\N{U+27}-\N{U+3F}]> means to match
+C<\N{U+27}>, C<\N{U+28}>, C<\N{U+29}>, ..., C<\N{U+3D}>, C<\N{U+3E}>,
+and C<\N{U+3F}>, whatever the native code point versions for those are.
+
+Perl also guarantees that the ranges C<A-Z>, C<a-z>, C<0-9>, and any
+subranges of these match what an English-only speaker would expect them
+to match on any platform. That is, C<[A-Z]> matches the 26 ASCII
+uppercase letters;
+C<[a-z]> matches the 26 lowercase letters; and C<[0-9]> matches the 10
+digits. Subranges, like C<[h-k]>, match correspondingly, in this case
+just the four letters C<"h">, C<"i">, C<"j">, and C<"k">. This is the
+natural behavior on ASCII platforms where the code points (ordinal
+values) for C<"h"> through C<"k"> are consecutive integers (0x68 through
+0x6B). But special handling to achieve this may be needed on platforms
+with a non-ASCII native character set. For example, on EBCDIC
+platforms, the code point for C<"h"> is 0x88, C<"i"> is 0x89, C<"j"> is
+0x91, and C<"k"> is 0x92. Perl specially treats C<[h-k]> to exclude the
+seven code points in the gap: 0x8A through 0x90. This special handling is
+only invoked when the range is a subrange of one of the ASCII uppercase,
+lowercase, and digit ranges, AND each end of the range is expressed
+either as a literal, like C<"A">, or as a named character (C<\N{...}>,
+including the C<\N{U+...> form).
+
+EBCDIC Examples:
+
+ [i-j] # Matches either "i" or "j"
+ [i-\N{LATIN SMALL LETTER J}] # Same
+ [i-\N{U+6A}] # Same
+ [\N{U+69}-\N{U+6A}] # Same
+ [\x{89}-\x{91}] # Matches 0x89 ("i"), 0x8A .. 0x90, 0x91 ("j")
+ [i-\x{91}] # Same
+ [\x{89}-j] # Same
+ [i-J] # Matches, 0x89 ("i") .. 0xC1 ("J"); special
+ # handling doesn't apply because range is mixed
+ # case
=head3 Negation
It is also possible to instead list the characters you do not want to
match. You can do so by using a caret (C<^>) as the first character in the
-character class. For instance, C<[^a-z]> matches a character that is not a
-lowercase ASCII letter.
+character class. For instance, C<[^a-z]> matches any character that is not a
+lowercase ASCII letter, which therefore includes more than a million
+Unicode code points. The class is said to be "negated" or "inverted".
This syntax make the caret a special character inside a bracketed character
class, but only if it is the first character of the class. So if you want
-to have the caret as one of the characters you want to match, you either
-have to escape the caret, or not list it first.
+the caret as one of the characters to match, either escape the caret or
+else don't list it first.
+
+In inverted bracketed character classes, Perl ignores the Unicode rules
+that normally say that named sequence, and certain characters should
+match a sequence of multiple characters use under caseless C</i>
+matching. Following those rules could lead to highly confusing
+situations:
+
+ "ss" =~ /^[^\xDF]+$/ui; # Matches!
+
+This should match any sequences of characters that aren't C<\xDF> nor
+what C<\xDF> matches under C</i>. C<"s"> isn't C<\xDF>, but Unicode
+says that C<"ss"> is what C<\xDF> matches under C</i>. So which one
+"wins"? Do you fail the match because the string has C<ss> or accept it
+because it has an C<s> followed by another C<s>? Perl has chosen the
+latter. (See note in L</Bracketed Character Classes> above.)
Examples:
=head3 Backslash Sequences
You can put any backslash sequence character class (with the exception of
-C<\N>) inside a bracketed character class, and it will act just
-as if you put all the characters matched by the backslash sequence inside the
-character class. For instance, C<[a-f\d]> will match any decimal digit, or any
+C<\N> and C<\R>) inside a bracketed character class, and it will act just
+as if you had put all characters matched by the backslash sequence inside the
+character class. For instance, C<[a-f\d]> matches any decimal digit, or any
of the lowercase letters between 'a' and 'f' inclusive.
C<\N> within a bracketed character class must be of the forms C<\N{I<name>}>
-or C<\N{U+I<wide hex char>}>, and NOT be the form that matches non-newlines,
+or C<\N{U+I<hex char>}>, and NOT be the form that matches non-newlines,
for the same reason that a dot C<.> inside a bracketed character class loses
its special meaning: it matches nearly anything, which generally isn't what you
want to happen.
X<alpha> X<alnum> X<ascii> X<blank> X<cntrl> X<digit> X<graph>
X<lower> X<print> X<punct> X<space> X<upper> X<word> X<xdigit>
-POSIX character classes have the form C<[:class:]>, where I<class> is
+POSIX character classes have the form C<[:class:]>, where I<class> is the
name, and the C<[:> and C<:]> delimiters. POSIX character classes only appear
I<inside> bracketed character classes, and are a convenient and descriptive
-way of listing a group of characters, though they currently suffer from
-portability issues (see below and L<Locale, EBCDIC, Unicode and UTF-8>).
+way of listing a group of characters.
Be careful about the syntax,
The latter pattern would be a character class consisting of a colon,
and the letters C<a>, C<l>, C<p> and C<h>.
-POSIX character classes can be part of a larger bracketed character class. For
-example,
+
+POSIX character classes can be part of a larger bracketed character class.
+For example,
[01[:alpha:]%]
Perl recognizes the following POSIX character classes:
alpha Any alphabetical character ("[A-Za-z]").
- alnum Any alphanumerical character. ("[A-Za-z0-9]")
+ alnum Any alphanumeric character ("[A-Za-z0-9]").
ascii Any character in the ASCII character set.
blank A GNU extension, equal to a space or a horizontal tab ("\t").
cntrl Any control character. See Note [2] below.
lower Any lowercase character ("[a-z]").
print Any printable character, including a space. See Note [4] below.
punct Any graphical character excluding "word" characters. Note [5].
- space Any whitespace character. "\s" plus the vertical tab ("\cK").
+ space Any whitespace character. "\s" including the vertical tab
+ ("\cK").
upper Any uppercase character ("[A-Z]").
word A Perl extension ("[A-Za-z0-9_]"), equivalent to "\w".
xdigit Any hexadecimal digit ("[0-9a-fA-F]").
between POSIX character classes and these counterparts.
One counterpart, in the column labelled "ASCII-range Unicode" in
-the table, will only match characters in the ASCII character set.
+the table, matches only characters in the ASCII character set.
The other counterpart, in the column labelled "Full-range Unicode", matches any
appropriate characters in the full Unicode character set. For example,
-C<\p{Alpha}> will match not just the ASCII alphabetic characters, but any
-character in the entire Unicode character set that is considered to be
-alphabetic. The backslash sequence column is a (short) synonym for
-the Full-range Unicode form.
-
-(Each of the counterparts has various synonyms as well.
-L<perluniprops/Properties accessible through \p{} and \P{}> lists all the
-synonyms, plus all the characters matched by each of the ASCII-range
-properties. For example C<\p{AHex}> is a synonym for C<\p{ASCII_Hex_Digit}>,
-and any C<\p> property name can be prefixed with "Is" such as C<\p{IsAlpha}>.)
-
-Both the C<\p> forms are unaffected by any locale that is in effect, or whether
-the string is in UTF-8 format or not, or whether the platform is EBCDIC or not.
-In contrast, the POSIX character classes are affected. If the source string is
-in UTF-8 format, the POSIX classes behave like their "Full-range"
-Unicode counterparts. If the
-source string is not in UTF-8 format, and no locale is in effect, and the
-platform is not EBCDIC, all the POSIX classes behave like their ASCII-range
-counterparts. Otherwise, they behave based on the rules of the locale or
-EBCDIC code page.
-
-It is proposed to change this behavior in a future release of Perl so that the
-the UTF8ness of the source string will be irrelevant to the behavior of the
-POSIX character classes. This means they will always behave in strict
-accordance with the official POSIX standard. That is, if either locale or
-EBCDIC code page is present, they will behave in accordance with those; if
-absent, the classes will match only their ASCII-range counterparts. If you
-disagree with this proposal, send email to C<perl5-porters@perl.org>.
+C<\p{Alpha}> matches not just the ASCII alphabetic characters, but any
+character in the entire Unicode character set considered alphabetic.
+An entry in the column labelled "backslash sequence" is a (short)
+equivalent.
[[:...:]] ASCII-range Full-range backslash Note
Unicode Unicode sequence
-----------------------------------------------------
alpha \p{PosixAlpha} \p{XPosixAlpha}
alnum \p{PosixAlnum} \p{XPosixAlnum}
- ascii \p{ASCII}
+ ascii \p{ASCII}
blank \p{PosixBlank} \p{XPosixBlank} \h [1]
or \p{HorizSpace} [1]
cntrl \p{PosixCntrl} \p{XPosixCntrl} [2]
space \p{PosixSpace} \p{XPosixSpace} [6]
upper \p{PosixUpper} \p{XPosixUpper}
word \p{PosixWord} \p{XPosixWord} \w
- xdigit \p{ASCII_Hex_Digit} \p{XPosixXDigit}
+ xdigit \p{PosixXDigit} \p{XPosixXDigit}
=over 4
=item [2]
Control characters don't produce output as such, but instead usually control
-the terminal somehow: for example newline and backspace are control characters.
-In the ASCII range, characters whose ordinals are between 0 and 31 inclusive,
+the terminal somehow: for example, newline and backspace are control characters.
+In the ASCII range, characters whose code points are between 0 and 31 inclusive,
plus 127 (C<DEL>) are control characters.
-On EBCDIC platforms, it is likely that the code page will define C<[[:cntrl:]]>
-to be the EBCDIC equivalents of the ASCII controls, plus the controls
-that in Unicode have ordinals from 128 through 159.
-
=item [3]
Any character that is I<graphical>, that is, visible. This class consists
-of all the alphanumerical characters and all punctuation characters.
+of all alphanumeric characters and all punctuation characters.
=item [4]
-All printable characters, which is the set of all the graphical characters
-plus whitespace characters that are not also controls.
+All printable characters, which is the set of all graphical characters
+plus those whitespace characters which are not also controls.
=item [5]
-C<\p{PosixPunct}> and C<[[:punct:]]> in the ASCII range match all the
+C<\p{PosixPunct}> and C<[[:punct:]]> in the ASCII range match all
non-controls, non-alphanumeric, non-space characters:
C<[-!"#$%&'()*+,./:;<=E<gt>?@[\\\]^_`{|}~]> (although if a locale is in effect,
it could alter the behavior of C<[[:punct:]]>).
The similarly named property, C<\p{Punct}>, matches a somewhat different
set in the ASCII range, namely
-C<[-!"#%&'()*,./:;?@[\\\]_{}]>. That is, it is missing C<[$+E<lt>=E<gt>^`|~]>.
+C<[-!"#%&'()*,./:;?@[\\\]_{}]>. That is, it is missing the nine
+characters C<[$+E<lt>=E<gt>^`|~]>.
This is because Unicode splits what POSIX considers to be punctuation into two
categories, Punctuation and Symbols.
-C<\p{PosixPunct>, and when the matching string is in UTF-8 format,
-C<[[:punct:]]>, match what they match in the ASCII range, plus what
-C<\p{Punct}> matches. This is different
-than strictly matching according to C<\p{Punct}>. Another way to say it is that
-for a UTF-8 string, C<[[:punct:]]> matches all the characters that Unicode
-considers to be punctuation, plus all the ASCII-range characters that Unicode
-considers to be symbols.
+C<\p{XPosixPunct}> and (under Unicode rules) C<[[:punct:]]>, match what
+C<\p{PosixPunct}> matches in the ASCII range, plus what C<\p{Punct}>
+matches. This is different than strictly matching according to
+C<\p{Punct}>. Another way to say it is that
+if Unicode rules are in effect, C<[[:punct:]]> matches all characters
+that Unicode considers punctuation, plus all ASCII-range characters that
+Unicode considers symbols.
=item [6]
-C<\p{SpacePerl}> and C<\p{Space}> differ only in that C<\p{Space}> additionally
-matches the vertical tab, C<\cK>. Same for the two ASCII-only range forms.
+C<\p{XPerlSpace}> and C<\p{Space}> match identically starting with Perl
+v5.18. In earlier versions, these differ only in that in non-locale
+matching, C<\p{XPerlSpace}> does not match the vertical tab, C<\cK>.
+Same for the two ASCII-only range forms.
+
+=back
+
+There are various other synonyms that can be used besides the names
+listed in the table. For example, C<\p{PosixAlpha}> can be written as
+C<\p{Alpha}>. All are listed in
+L<perluniprops/Properties accessible through \p{} and \P{}>.
+
+Both the C<\p> counterparts always assume Unicode rules are in effect.
+On ASCII platforms, this means they assume that the code points from 128
+to 255 are Latin-1, and that means that using them under locale rules is
+unwise unless the locale is guaranteed to be Latin-1 or UTF-8. In contrast, the
+POSIX character classes are useful under locale rules. They are
+affected by the actual rules in effect, as follows:
+
+=over
+
+=item If the C</a> modifier, is in effect ...
+
+Each of the POSIX classes matches exactly the same as their ASCII-range
+counterparts.
+
+=item otherwise ...
+
+=over
+
+=item For code points above 255 ...
+
+The POSIX class matches the same as its Full-range counterpart.
+
+=item For code points below 256 ...
+
+=over
+
+=item if locale rules are in effect ...
+
+The POSIX class matches according to the locale, except:
+
+=over
+
+=item C<word>
+
+also includes the platform's native underscore character, no matter what
+the locale is.
+
+=item C<ascii>
+
+on platforms that don't have the POSIX C<ascii> extension, this matches
+just the platform's native ASCII-range characters.
+
+=item C<blank>
+
+on platforms that don't have the POSIX C<blank> extension, this matches
+just the platform's native tab and space characters.
=back
-There are various other synonyms that can be used for these besides
-C<\p{HorizSpace}> and \C<\p{XPosixBlank}>. For example
-C<\p{PosixAlpha}> can be written as C<\p{Alpha}>. All are listed
-in L<perluniprops/Properties accessible through \p{} and \P{}>.
+=item if Unicode rules are in effect ...
+
+The POSIX class matches the same as the Full-range counterpart.
+
+=item otherwise ...
+
+The POSIX class matches the same as the ASCII range counterpart.
+
+=back
+
+=back
+
+=back
+
+Which rules apply are determined as described in
+L<perlre/Which character set modifier is in effect?>.
+
+It is proposed to change this behavior in a future release of Perl so that
+whether or not Unicode rules are in effect would not change the
+behavior: Outside of locale, the POSIX classes
+would behave like their ASCII-range counterparts. If you wish to
+comment on this proposal, send email to C<perl5-porters@perl.org>.
-=head4 Negation
+=head4 Negation of POSIX character classes
X<character class, negation>
A Perl extension to the POSIX character class is the ability to
\P{PerlSpace} \P{XPerlSpace} \S
[[:^word:]] \P{PerlWord} \P{XPosixWord} \W
-Again, the backslash sequence means Full-range Unicode.
+The backslash sequence can mean either ASCII- or Full-range Unicode,
+depending on various factors as described in L<perlre/Which character set modifier is in effect?>.
=head4 [= =] and [. .]
-Perl will recognize the POSIX character classes C<[=class=]>, and
-C<[.class.]>, but does not (yet?) support them. Use of
-such a construct will lead to an error.
-
+Perl recognizes the POSIX character classes C<[=class=]> and
+C<[.class.]>, but does not (yet?) support them. Any attempt to use
+either construct raises an exception.
=head4 Examples
/[01[:lower:]]/ # Matches a character that is either a
# lowercase letter, or '0' or '1'.
/[[:digit:][:^xdigit:]]/ # Matches a character that can be anything
- # except the letters 'a' to 'f'. This is
- # because the main character class is composed
- # of two POSIX character classes that are ORed
- # together, one that matches any digit, and
- # the other that matches anything that isn't a
- # hex digit. The result matches all
- # characters except the letters 'a' to 'f' and
- # 'A' to 'F'.
-
-
-=head2 Locale, EBCDIC, Unicode and UTF-8
-
-Some of the character classes have a somewhat different behaviour depending
-on the internal encoding of the source string, if the regular expression
-is marked as having Unicode semantics, the locale that is in effect,
-and if the program is running on an EBCDIC platform.
-
-C<\w>, C<\d>, C<\s> and the POSIX character classes (and their negations,
-including C<\W>, C<\D>, C<\S>) have this behaviour. (Since the backslash
-sequences C<\b> and C<\B> are defined in terms of C<\w> and C<\W>, they also are
-affected.)
-
-The rule is that if the source string is in UTF-8 format or the regular
-expression is marked as indicating Unicode semantics (see the next
-paragraph), the character classes match according to the Unicode
-properties. Otherwise, the character classes match according to
-whatever locale or EBCDIC code page is in effect. If there is no locale
-nor EBCDIC, they match the ASCII defaults (0 to 9 for C<\d>; 52 letters,
-10 digits and underscore for C<\w>; etc.).
-
-A regular expression is marked for Unicode semantics if it is encoded in
-utf8 (usually as a result of including a literal character whose code
-point is above 255), or if it contains a C<\N{U+...}> or C<\N{I<name>}>
-construct, or (starting in Perl 5.14) if it was compiled in the scope of a
-C<S<use feature "unicode_strings">> pragma, or has the C<"u"> regular
-expression modifier.
-
-The differences in behavior between locale and non-locale semantics
-can affect any character whose code point is 255 or less. The
-differences in behavior between Unicode and non-Unicode semantics
-affects only ASCII platforms, and only when matching against characters
-whose code points are between 128 and 255 inclusive. See
-L<perlunicode/The "Unicode Bug">.
-
-For portability reasons, it may be better to not use C<\w>, C<\d>, C<\s>
-or the POSIX character classes, and use the Unicode properties instead.
-That way you can control whether you want matching of just characters in
-the ASCII character set, or any Unicode characters.
-C<S<use feature "unicode_strings">> will allow seamless Unicode behavior
-no matter what the internal encodings are, but won't allow restricting
-to just the ASCII characters.
+ # except the letters 'a' to 'f' and 'A' to
+ # 'F'. This is because the main character
+ # class is composed of two POSIX character
+ # classes that are ORed together, one that
+ # matches any digit, and the other that
+ # matches anything that isn't a hex digit.
+ # The OR adds the digits, leaving only the
+ # letters 'a' to 'f' and 'A' to 'F' excluded.
+
+=head3 Extended Bracketed Character Classes
+X<character class>
+X<set operations>
-=head4 Examples
+This is a fancy bracketed character class that can be used for more
+readable and less error-prone classes, and to perform set operations,
+such as intersection. An example is
+
+ /(?[ \p{Thai} & \p{Digit} ])/
+
+This will match all the digit characters that are in the Thai script.
+
+This is an experimental feature available starting in 5.18, and is
+subject to change as we gain field experience with it. Any attempt to
+use it will raise a warning, unless disabled via
+
+ no warnings "experimental::regex_sets";
+
+Comments on this feature are welcome; send email to
+C<perl5-porters@perl.org>.
+
+We can extend the example above:
+
+ /(?[ ( \p{Thai} + \p{Lao} ) & \p{Digit} ])/
+
+This matches digits that are in either the Thai or Laotian scripts.
+
+Notice the white space in these examples. This construct always has
+the C<E<sol>x> modifier turned on within it.
+
+The available binary operators are:
+
+ & intersection
+ + union
+ | another name for '+', hence means union
+ - subtraction (the result matches the set consisting of those
+ code points matched by the first operand, excluding any that
+ are also matched by the second operand)
+ ^ symmetric difference (the union minus the intersection). This
+ is like an exclusive or, in that the result is the set of code
+ points that are matched by either, but not both, of the
+ operands.
+
+There is one unary operator:
+
+ ! complement
+
+All the binary operators left associate, and are of equal precedence.
+The unary operator right associates, and has higher precedence. Use
+parentheses to override the default associations. Some feedback we've
+received indicates a desire for intersection to have higher precedence
+than union. This is something that feedback from the field may cause us
+to change in future releases; you may want to parenthesize copiously to
+avoid such changes affecting your code, until this feature is no longer
+considered experimental.
+
+The main restriction is that everything is a metacharacter. Thus,
+you cannot refer to single characters by doing something like this:
+
+ /(?[ a + b ])/ # Syntax error!
+
+The easiest way to specify an individual typable character is to enclose
+it in brackets:
+
+ /(?[ [a] + [b] ])/
+
+(This is the same thing as C<[ab]>.) You could also have said the
+equivalent:
+
+ /(?[[ a b ]])/
+
+(You can, of course, specify single characters by using, C<\x{...}>,
+C<\N{...}>, etc.)
+
+This last example shows the use of this construct to specify an ordinary
+bracketed character class without additional set operations. Note the
+white space within it; C<E<sol>x> is turned on even within bracketed
+character classes, except you can't have comments inside them. Hence,
+
+ (?[ [#] ])
+
+matches the literal character "#". To specify a literal white space character,
+you can escape it with a backslash, like:
+
+ /(?[ [ a e i o u \ ] ])/
+
+This matches the English vowels plus the SPACE character.
+All the other escapes accepted by normal bracketed character classes are
+accepted here as well; but unrecognized escapes that generate warnings
+in normal classes are fatal errors here.
+
+All warnings from these class elements are fatal, as well as some
+practices that don't currently warn. For example you cannot say
+
+ /(?[ [ \xF ] ])/ # Syntax error!
+
+You have to have two hex digits after a braceless C<\x> (use a leading
+zero to make two). These restrictions are to lower the incidence of
+typos causing the class to not match what you thought it would.
+
+If a regular bracketed character class contains a C<\p{}> or C<\P{}> and
+is matched against a non-Unicode code point, a warning may be
+raised, as the result is not Unicode-defined. No such warning will come
+when using this extended form.
+
+The final difference between regular bracketed character classes and
+these, is that it is not possible to get these to match a
+multi-character fold. Thus,
+
+ /(?[ [\xDF] ])/iu
+
+does not match the string C<ss>.
+
+You don't have to enclose POSIX class names inside double brackets,
+hence both of the following work:
+
+ /(?[ [:word:] - [:lower:] ])/
+ /(?[ [[:word:]] - [[:lower:]] ])/
+
+Any contained POSIX character classes, including things like C<\w> and C<\D>
+respect the C<E<sol>a> (and C<E<sol>aa>) modifiers.
+
+C<< (?[ ]) >> is a regex-compile-time construct. Any attempt to use
+something which isn't knowable at the time the containing regular
+expression is compiled is a fatal error. In practice, this means
+just three limitations:
+
+=over 4
+
+=item 1
+
+This construct cannot be used within the scope of
+C<use locale> (or the C<E<sol>l> regex modifier).
+
+=item 2
+
+Any
+L<user-defined property|perlunicode/"User-Defined Character Properties">
+used must be already defined by the time the regular expression is
+compiled (but note that this construct can be used instead of such
+properties).
+
+=item 3
+
+A regular expression that otherwise would compile
+using C<E<sol>d> rules, and which uses this construct will instead
+use C<E<sol>u>. Thus this construct tells Perl that you don't want
+C<E<sol>d> rules for the entire regular expression containing it.
+
+=back
+
+Note that skipping white space applies only to the interior of this
+construct. There must not be any space between any of the characters
+that form the initial C<(?[>. Nor may there be space between the
+closing C<])> characters.
+
+Just as in all regular expressions, the pattern can be built up by
+including variables that are interpolated at regex compilation time.
+Care must be taken to ensure that you are getting what you expect. For
+example:
+
+ my $thai_or_lao = '\p{Thai} + \p{Lao}';
+ ...
+ qr/(?[ \p{Digit} & $thai_or_lao ])/;
+
+compiles to
+
+ qr/(?[ \p{Digit} & \p{Thai} + \p{Lao} ])/;
+
+But this does not have the effect that someone reading the code would
+likely expect, as the intersection applies just to C<\p{Thai}>,
+excluding the Laotian. Pitfalls like this can be avoided by
+parenthesizing the component pieces:
+
+ my $thai_or_lao = '( \p{Thai} + \p{Lao} )';
+
+But any modifiers will still apply to all the components:
+
+ my $lower = '\p{Lower} + \p{Digit}';
+ qr/(?[ \p{Greek} & $lower ])/i;
+
+matches upper case things. You can avoid surprises by making the
+components into instances of this construct by compiling them:
+
+ my $thai_or_lao = qr/(?[ \p{Thai} + \p{Lao} ])/;
+ my $lower = qr/(?[ \p{Lower} + \p{Digit} ])/;
+
+When these are embedded in another pattern, what they match does not
+change, regardless of parenthesization or what modifiers are in effect
+in that outer pattern.
- $str = "\xDF"; # $str is not in UTF-8 format.
- $str =~ /^\w/; # No match, as $str isn't in UTF-8 format.
- $str .= "\x{0e0b}"; # Now $str is in UTF-8 format.
- $str =~ /^\w/; # Match! $str is now in UTF-8 format.
- chop $str;
- $str =~ /^\w/; # Still a match! $str remains in UTF-8 format.
+Due to the way that Perl parses things, your parentheses and brackets
+may need to be balanced, even including comments. If you run into any
+examples, please send them to C<perlbug@perl.org>, so that we can have a
+concrete example for this man page.
-=cut
+We may change it so that things that remain legal uses in normal bracketed
+character classes might become illegal within this experimental
+construct. One proposal, for example, is to forbid adjacent uses of the
+same character, as in C<(?[ [aa] ])>. The motivation for such a change
+is that this usage is likely a typo, as the second "a" adds nothing.