character class "..." within the outer bracketed
character class. Example: [[:upper:]] matches any
uppercase character.
+ (?[...]) [8] Extended bracketed character class
\w [3] Match a "word" character (alphanumeric plus "_", plus
other connector punctuation chars plus Unicode
marks)
when of the form C<\N{U+I<hex>}>, it matches the character whose Unicode
code point is I<hex>. Otherwise it matches any character but C<\n>.
+=item [8]
+
+See L<perlrecharclass/Extended Bracketed Character Classes> for details.
+
=back
=head3 Assertions
PAT{min,max}+ (?>PAT{min,max})
=item C<(?[ ])>
-X<set operations>
-
-This is a fancy bracketed character class that can be used for more
-readable and less error-prone classes, and to perform set operations,
-such as intersection. An example is
-
- /(?[ \p{Thai} & \p{Digit} ])/
-
-This will match all the digit characters that are in the Thai script.
-
-This is an experimental feature available starting in 5.18, but is
-subject to change as we gain field experience with it. Any attempt to
-use it will raise a warning, unless disabled via
-
- no warnings "experimental::regex_sets";
-
-Comments on this feature are welcome; send email to
-C<perl5-porters@perl.org>.
-
-We can extend the example above:
-
- /(?[ ( \p{Thai} + \p{Lao} ) & \p{Digit} ])/
-
-This matches digits that are in either the Thai or Laotian scripts.
-
-Notice the white space in these examples. This construct always has
-L</C<E<sol>x>> turned on.
-
-The available binary operators are:
-
- & intersection
- + union
- | another name for '+', hence means union
- - subtraction (the result matches the set consisting of those
- code points matched by the first operand, excluding any that
- are also matched by the second operand)
- ^ symmetric difference (the union minus the intersection). This
- is like an exclusive or, in that the result is the set of code
- points that are matched by either, but not both, of the
- operands.
-
-There is one unary operator:
-
- ! complement
-
-All the binary operators left associate, and are of equal precedence.
-The unary operator right associates, and has higher precedence. Use
-parentheses to override the default associations. Some feedback we've
-received indicates a desire for intersection to have higher precedence
-than union. This is something that feedback from the field may cause us
-to change in future releases; you may want to parenthesize copiously to
-avoid such changes affecting your code, until this feature is no longer
-considered experimental.
-
-The main restriction is that everything is a metacharacter. Thus,
-you cannot refer to single characters by doing something like this:
-
- /(?[ a + b ])/ # Syntax error!
-
-The easiest way to specify an individual typable character is to enclose
-it in brackets:
-
- /(?[ [a] + [b] ])/
-
-(This is the same thing as C<[ab]>.) You could also have said the
-equivalent:
-
- /(?[[ a b ]])/
-
-(You can, of course, specify single characters by using, C<\x{ }>,
-C<\N{ }>, etc.)
-
-This last example shows the use of this construct to specify an ordinary
-bracketed character class without additional set operations. Note the
-white space within it; L</E<sol>x> is turned on even within bracketed
-character classes, except you can't have comments inside them. Hence,
-
- (?[ [#] ])
-
-matches the literal character "#". To specify a literal white space character,
-you can escape it with a backslash, like:
-
- /(?[ [ a e i o u \ ] ])/
-
-This matches the English vowels plus the SPACE character.
-All the other escapes accepted by normal bracketed character classes are
-accepted here as well; but unrecognized escapes that generate warnings
-in normal classes are fatal errors here.
-
-All warnings from these class elements are fatal, as well as some
-practices that don't currently warn. For example you cannot say
-
- /(?[ [ \xF ] ])/ # Syntax error!
-
-You have to have two hex digits after a braceless C<\x> (use a leading
-zero to make two). These restrictions are to lower the incidence of
-typos causing the class to not match what you thought it would.
-
-The final difference between regular bracketed character classes and
-these, is that it is not possible to get these to match a
-multi-character fold. Thus,
-
- /(?[ [\xDF] ])/iu
-
-does not match the string C<ss>.
-
-You don't have to enclose Posix class names inside double brackets,
-hence both of the following work:
-
- /(?[ [:word:] - [:lower:] ])/
- /(?[ [[:word:]] - [[:lower:]] ])/
-
-The Posix character classes, including things like C<\w> and C<\D>
-respect the L</E<sol>a (and E<sol>aa)> modifiers.
-
-C<< (?[ ]) >> is a regex-compile-time construct. Any attempt to use
-something which isn't knowable at the time the containing regular
-expression is compiled is a fatal error. In practice, this means
-just three limitiations:
-
-=over 4
-
-=item 1
-
-This construct cannot be used within the scope of
-C<use locale> (or the L</C<E<sol>l>> regex modifier).
-
-=item 2
-
-Any
-L<user-defined property|perlunicode/"User-Defined Character Properties">
-used must be already defined by the time the regular expression is
-compiled (but note that this construct can be used instead of such
-properties).
-
-=item 3
-
-A regular expression that otherwise would compile
-using L</C<E<sol>d>> rules, and which uses this construct will instead
-use L</C<E<sol>u>>. Thus this construct tells Perl that you don't want
-L</E<sol>d> rules for the entire regular expression containing it.
-
-=back
-
-The L</C<E<sol>x>> processing within this class is an extended form.
-Besides the characters that are considered white space in normal C</x>
-processing, there are 5 others, recommended by the Unicode standard:
-
- U+0085 NEXT LINE
- U+200E LEFT-TO-RIGHT MARK
- U+200F RIGHT-TO-LEFT MARK
- U+2028 LINE SEPARATOR
- U+2029 PARAGRAPH SEPARATOR
-
-Note that skipping white space applies only to the interior of this
-construct. There must not be any space between any of the characters
-that form the initial C<(?[>. Nor may there be space between the
-closing C<])> characters.
-
-Just as in all regular expressions, the pattern can can be built up by
-including variables that are interpolated at regex compilation time.
-Care must be taken to ensure that you are getting what you expect. For
-example:
-
- my $thai_or_lao = '\p{Thai} + \p{Lao}';
- ...
- qr/(?[ \p{Digit} & $thai_or_lao ])/;
-
-compiles to
-
- qr/(?[ \p{Digit} & \p{Thai} + \p{Lao} ])/;
-
-But this does not have the effect that someone reading the code would
-likely expect, as the intersection applies just to C<\p{Thai}>,
-excluding the Laotian. Pitfalls like this can be avoided by
-parenthesizing the component pieces:
-
- my $thai_or_lao = '( \p{Thai} + \p{Lao} )';
-
-But any modifiers will still apply to all the components:
-
- my $lower = '\p{Lower} + \p{Digit}';
- qr/(?[ \p{Greek} & $lower ])/i;
-
-matches upper case things. You can avoid surprises by making the
-components into instances of this construct by compiling them:
-
- my $thai_or_lao = qr/(?[ \p{Thai} + \p{Lao} ])/;
- my $lower = qr/(?[ \p{Lower} + \p{Digit} ])/;
-
-When these are embedded in another pattern, what they match does not
-change, regardless of parenthesization or what modifiers are in effect
-in that outer pattern.
-
-Due to the way that Perl parses things, your parentheses and brackets
-may need to be balanced, even including comments. If you run into any
-examples, please send them to C<perlbug@perl.org>, so that we can have a
-concrete example for this man page.
-We may change it so that things that remain legal uses in normal bracketed
-character classes might become illegal within this experimental
-construct. One proposal, for example, is to forbid adjacent uses of the
-same character, as in C<(?[ [aa] ])>. The motivation for such a change
-is that this usage is likely a typo, as the second "a" adds nothing.
+See L<perlrecharclass/Extended Bracketed Character Classes>.
=back
# matches anything that isn't a hex digit.
# The OR adds the digits, leaving only the
# letters 'a' to 'f' and 'A' to 'F' excluded.
+
+=head3 Extended Bracketed Character Classes
+X<character class>
+X<set operations>
+
+This is a fancy bracketed character class that can be used for more
+readable and less error-prone classes, and to perform set operations,
+such as intersection. An example is
+
+ /(?[ \p{Thai} & \p{Digit} ])/
+
+This will match all the digit characters that are in the Thai script.
+
+This is an experimental feature available starting in 5.18, and is
+subject to change as we gain field experience with it. Any attempt to
+use it will raise a warning, unless disabled via
+
+ no warnings "experimental::regex_sets";
+
+Comments on this feature are welcome; send email to
+C<perl5-porters@perl.org>.
+
+We can extend the example above:
+
+ /(?[ ( \p{Thai} + \p{Lao} ) & \p{Digit} ])/
+
+This matches digits that are in either the Thai or Laotian scripts.
+
+Notice the white space in these examples. This construct always has
+the C<E<sol>x> modifier turned on.
+
+The available binary operators are:
+
+ & intersection
+ + union
+ | another name for '+', hence means union
+ - subtraction (the result matches the set consisting of those
+ code points matched by the first operand, excluding any that
+ are also matched by the second operand)
+ ^ symmetric difference (the union minus the intersection). This
+ is like an exclusive or, in that the result is the set of code
+ points that are matched by either, but not both, of the
+ operands.
+
+There is one unary operator:
+
+ ! complement
+
+All the binary operators left associate, and are of equal precedence.
+The unary operator right associates, and has higher precedence. Use
+parentheses to override the default associations. Some feedback we've
+received indicates a desire for intersection to have higher precedence
+than union. This is something that feedback from the field may cause us
+to change in future releases; you may want to parenthesize copiously to
+avoid such changes affecting your code, until this feature is no longer
+considered experimental.
+
+The main restriction is that everything is a metacharacter. Thus,
+you cannot refer to single characters by doing something like this:
+
+ /(?[ a + b ])/ # Syntax error!
+
+The easiest way to specify an individual typable character is to enclose
+it in brackets:
+
+ /(?[ [a] + [b] ])/
+
+(This is the same thing as C<[ab]>.) You could also have said the
+equivalent:
+
+ /(?[[ a b ]])/
+
+(You can, of course, specify single characters by using, C<\x{ }>,
+C<\N{ }>, etc.)
+
+This last example shows the use of this construct to specify an ordinary
+bracketed character class without additional set operations. Note the
+white space within it; C<E<sol>x> is turned on even within bracketed
+character classes, except you can't have comments inside them. Hence,
+
+ (?[ [#] ])
+
+matches the literal character "#". To specify a literal white space character,
+you can escape it with a backslash, like:
+
+ /(?[ [ a e i o u \ ] ])/
+
+This matches the English vowels plus the SPACE character.
+All the other escapes accepted by normal bracketed character classes are
+accepted here as well; but unrecognized escapes that generate warnings
+in normal classes are fatal errors here.
+
+All warnings from these class elements are fatal, as well as some
+practices that don't currently warn. For example you cannot say
+
+ /(?[ [ \xF ] ])/ # Syntax error!
+
+You have to have two hex digits after a braceless C<\x> (use a leading
+zero to make two). These restrictions are to lower the incidence of
+typos causing the class to not match what you thought it would.
+
+The final difference between regular bracketed character classes and
+these, is that it is not possible to get these to match a
+multi-character fold. Thus,
+
+ /(?[ [\xDF] ])/iu
+
+does not match the string C<ss>.
+
+You don't have to enclose POSIX class names inside double brackets,
+hence both of the following work:
+
+ /(?[ [:word:] - [:lower:] ])/
+ /(?[ [[:word:]] - [[:lower:]] ])/
+
+Any contained POSIX character classes, including things like C<\w> and C<\D>
+respect the C<E<sol>a> (and C<E<sol>aa>) modifiers.
+
+C<< (?[ ]) >> is a regex-compile-time construct. Any attempt to use
+something which isn't knowable at the time the containing regular
+expression is compiled is a fatal error. In practice, this means
+just three limitiations:
+
+=over 4
+
+=item 1
+
+This construct cannot be used within the scope of
+C<use locale> (or the C<E<sol>l> regex modifier).
+
+=item 2
+
+Any
+L<user-defined property|perlunicode/"User-Defined Character Properties">
+used must be already defined by the time the regular expression is
+compiled (but note that this construct can be used instead of such
+properties).
+
+=item 3
+
+A regular expression that otherwise would compile
+using C<E<sol>d> rules, and which uses this construct will instead
+use C<E<sol>u>. Thus this construct tells Perl that you don't want
+C<E<sol>d> rules for the entire regular expression containing it.
+
+=back
+
+The C<E<sol>x> processing within this class is an extended form.
+Besides the characters that are considered white space in normal C</x>
+processing, there are 5 others, recommended by the Unicode standard:
+
+ U+0085 NEXT LINE
+ U+200E LEFT-TO-RIGHT MARK
+ U+200F RIGHT-TO-LEFT MARK
+ U+2028 LINE SEPARATOR
+ U+2029 PARAGRAPH SEPARATOR
+
+Note that skipping white space applies only to the interior of this
+construct. There must not be any space between any of the characters
+that form the initial C<(?[>. Nor may there be space between the
+closing C<])> characters.
+
+Just as in all regular expressions, the pattern can can be built up by
+including variables that are interpolated at regex compilation time.
+Care must be taken to ensure that you are getting what you expect. For
+example:
+
+ my $thai_or_lao = '\p{Thai} + \p{Lao}';
+ ...
+ qr/(?[ \p{Digit} & $thai_or_lao ])/;
+
+compiles to
+
+ qr/(?[ \p{Digit} & \p{Thai} + \p{Lao} ])/;
+
+But this does not have the effect that someone reading the code would
+likely expect, as the intersection applies just to C<\p{Thai}>,
+excluding the Laotian. Pitfalls like this can be avoided by
+parenthesizing the component pieces:
+
+ my $thai_or_lao = '( \p{Thai} + \p{Lao} )';
+
+But any modifiers will still apply to all the components:
+
+ my $lower = '\p{Lower} + \p{Digit}';
+ qr/(?[ \p{Greek} & $lower ])/i;
+
+matches upper case things. You can avoid surprises by making the
+components into instances of this construct by compiling them:
+
+ my $thai_or_lao = qr/(?[ \p{Thai} + \p{Lao} ])/;
+ my $lower = qr/(?[ \p{Lower} + \p{Digit} ])/;
+
+When these are embedded in another pattern, what they match does not
+change, regardless of parenthesization or what modifiers are in effect
+in that outer pattern.
+
+Due to the way that Perl parses things, your parentheses and brackets
+may need to be balanced, even including comments. If you run into any
+examples, please send them to C<perlbug@perl.org>, so that we can have a
+concrete example for this man page.
+
+We may change it so that things that remain legal uses in normal bracketed
+character classes might become illegal within this experimental
+construct. One proposal, for example, is to forbid adjacent uses of the
+same character, as in C<(?[ [aa] ])>. The motivation for such a change
+is that this usage is likely a typo, as the second "a" adds nothing.