\1 Absolute backreference. Not in [].
\a Alarm or bell.
\A Beginning of string. Not in [].
- \b Word/non-word boundary. (Backspace in []).
- \B Not a word/non-word boundary. Not in [].
- \cX Control-X
+ \b{}, \b Boundary. (\b is a backspace in []).
+ \B{}, \B Not a boundary.
+ \cX Control-X.
\C Single octet, even under UTF-8. Not in [].
+ (Deprecated)
\d Character class for digits.
\D Character class for non-digits.
\e Escape character.
\E Turn off \Q, \L and \U processing. Not in [].
\f Form feed.
- \g{}, \g1 Named, absolute or relative backreference. Not in []
+ \F Foldcase till \E. Not in [].
+ \g{}, \g1 Named, absolute or relative backreference.
+ Not in [].
\G Pos assertion. Not in [].
\h Character class for horizontal whitespace.
\H Character class for non horizontal whitespace.
\l Lowercase next character. Not in [].
\L Lowercase till \E. Not in [].
\n (Logical) newline character.
- \N Any character but newline. Experimental. Not in [].
+ \N Any character but newline. Not in [].
\N{} Named or numbered (Unicode) character or sequence.
\o{} Octal escape sequence.
\p{}, \pP Character with the given Unicode property.
\P{}, \PP Character without the given Unicode property.
- \Q Quotemeta till \E. Not in [].
+ \Q Quote (disable) pattern metacharacters till \E. Not
+ in [].
\r Return character.
\R Generic new line. Not in [].
\s Character class for whitespace.
=item [1]
C<\b> is the backspace character only inside a character class. Outside a
-character class, C<\b> is a word/non-word boundary.
+character class, C<\b> alone is a word-character/non-word-character
+boundary, and C<\b{}> is some other type of boundary.
=item [2]
Certain sequences of characters also have names.
To specify by name, the name of the character or character sequence goes
-between the curly braces. In this case, you have to C<use charnames> to
-load the Unicode names of the characters; otherwise Perl will complain.
+between the curly braces.
To specify a character by Unicode code point, use the form C<\N{U+I<code
point>}>, where I<code point> is a number in hexadecimal that gives the
=head4 Example
- use charnames ':full'; # Loads the Unicode names.
$str =~ /\N{THAI CHARACTER SO SO}/; # Matches the Thai SO SO character
use charnames 'Cyrillic'; # Loads Cyrillic names.
$str = "Perl";
$str =~ /\o{120}/; # Match, "\120" is "P".
$str =~ /\120/; # Same.
- $str =~ /\o{120}+/; # Match, "\120" is "P", it's repeated at least once
+ $str =~ /\o{120}+/; # Match, "\120" is "P",
+ # it's repeated at least once.
$str =~ /\120+/; # Same.
$str =~ /P\053/; # No match, "\053" is "+" and taken literally.
/\o{23073}/ # Black foreground, white background smiling face.
- /\o{4801234567}/ # Raises a warning, and yields chr(4)
+ /\o{4801234567}/ # Raises a warning, and yields chr(4).
=head4 Disambiguation rules between old-style octal escapes and backreferences
Octal escapes of the C<\000> form outside of bracketed character classes
-potentially clash with old-style backreferences. (see L</Absolute referencing>
+potentially clash with old-style backreferences (see L</Absolute referencing>
below). They both consist of a backslash followed by numbers. So Perl has to
use heuristics to determine whether it is a backreference or an octal escape.
Perl uses the following rules to disambiguate:
$pat .= ")" x 999;
/^($pat)\1000$/; # Matches 'aa'; there are 1000 capture groups.
/^$pat\1000$/; # Matches 'a@0'; there are 999 capture groups
- # and \1000 is seen as \100 (a '@') and a '0'
+ # and \1000 is seen as \100 (a '@') and a '0'.
=back
C<\E>, whichever comes first. They provide functionality similar to what
the functions C<lc> and C<uc> provide.
-C<\Q> is used to escape all characters following, up to the next C<\E>
-or the end of the pattern. C<\Q> adds a backslash to any character that
-isn't a letter, digit, or underscore. This ensures that any character
-between C<\Q> and C<\E> shall be matched literally, not interpreted
-as a metacharacter by the regex engine.
+C<\Q> is used to quote (disable) pattern metacharacters, up to the next
+C<\E> or the end of the pattern. C<\Q> adds a backslash to any character
+that could have special meaning to Perl. In the ASCII range, it quotes
+every character that isn't a letter, digit, or underscore. See
+L<perlfunc/quotemeta> for details on what gets quoted for non-ASCII
+code points. Using this ensures that any character between C<\Q> and
+C<\E> will be matched literally, not interpreted as a metacharacter by
+the regex engine.
-Mnemonic: I<L>owercase, I<U>ppercase, I<Q>uotemeta, I<E>nd.
+C<\F> can be used to casefold all characters following, up to the next C<\E>
+or the end of the pattern. It provides the functionality similar to
+the C<fc> function.
+
+Mnemonic: I<L>owercase, I<U>ppercase, I<F>old-case, I<Q>uotemeta, I<E>nd.
=head4 Examples
=head4 Examples
/(\w+) \g1/; # Finds a duplicated word, (e.g. "cat cat").
- /(\w+) \1/; # Same thing; written old-style
+ /(\w+) \1/; # Same thing; written old-style.
/(.)(.)\g2\g1/; # Match a four letter palindrome (e.g. "ABBA").
Mnemonic: I<G>lobal.
-=item \b, \B
+=item \b{}, \b, \B{}, \B
+
+C<\b{...}>, available starting in v5.22, matches a boundary (between two
+characters, or before the first character of the string, or after the
+final character of the string) based on the Unicode rules for the
+boundary type specified inside the braces. The currently known boundary
+types are given a few paragraphs below. C<\B{...}> matches at any place
+between characters where C<\b{...}> of the same type doesn't match.
+
+C<\b> when not immediately followed by a C<"{"> matches at any place
+between a word (something matched by C<\w>) and a non-word character
+(C<\W>); C<\B> when not immediately followed by a C<"{"> matches at any
+place between characters where C<\b> doesn't match. To get better
+word matching of natural language text, see L<\b{wb}> below.
-C<\b> matches at any place between a word and a non-word character; C<\B>
-matches at any place between characters where C<\b> doesn't match. C<\b>
+C<\b>
and C<\B> assume there's a non-word character before the beginning and after
the end of the source string; so C<\b> will match at the beginning (or end)
of the source string if the source string begins (or ends) with a word
Do not use something like C<\b=head\d\b> and expect it to match the
beginning of a line. It can't, because for there to be a boundary before
the non-word "=", there must be a word character immediately previous.
-All boundary determinations look for word characters alone, not for
-non-words characters nor for string ends. It may help to understand how
+All plain C<\b> and C<\B> boundary determinations look for word
+characters alone, not for
+non-word characters nor for string ends. It may help to understand how
<\b> and <\B> work by equating them as follows:
\b really means (?:(?<=\w)(?!\w)|(?<!\w)(?=\w))
\B really means (?:(?<=\w)(?=\w)|(?<!\w)(?!\w))
+In contrast, C<\b{...}> may or may not match at the beginning and end of
+the line depending on the boundary type (and C<\B{...}> never does).
+The boundary types currently available are:
+
+=over
+
+=item C<\b{gcb}> or C<\b{g}>
+
+This matches a Unicode "Grapheme Cluster Boundary". (Actually Perl
+always uses the improved "extended" grapheme cluster"). These are
+explained below under L</C<\X>>. In fact, C<\X> is another way to get
+the same functionality. It is equivalent to C</.+?\b{gcb}/>. Use
+whichever is most convenient for your situation.
+
+=item C<\b{sb}>
+
+This matches a Unicode "Sentence Boundary". This is an aid to parsing
+natural language sentences. It gives good, but imperfect results. For
+example, it thinks that "Mr. Smith" is two sentences. More details are
+at L<http://www.unicode.org/reports/tr29/>.
+
+=item C<\b{wb}>
+
+This matches a Unicode "Word Boundary". This gives better (though not
+perfect) results for natural language processing than plain C<\b>
+(without braces) does. For example, it understands that apostrophes can
+be in the middle of words. More details are at
+L<http://www.unicode.org/reports/tr29/>.
+
+=back
+
Mnemonic: I<b>oundary.
=back
print $1; # Prints 'cat'
}
+ print join "\n", "I don't care" =~ m/ ( .+? \b{wb} ) /xg;
+ prints
+ I, ,don't, ,care
+
=head2 Misc
Here we document the backslash sequences that don't fall in one of the
=item \C
-C<\C> always matches a single octet, even if the source string is encoded
+(Deprecated.) C<\C> always matches a single octet, even if the source
+string is encoded
in UTF-8 format, and the character to be matched is a multi-octet character.
-C<\C> was introduced in perl 5.6. This is very dangerous, because it violates
+This is very dangerous, because it violates
the logical character abstraction and can cause UTF-8 sequences to become malformed.
+Use C<utf8::encode()> instead.
+
Mnemonic: oI<C>tet.
=item \K
=item \N
-This is an experimental feature new to perl 5.12.0. It matches any character
+This feature, available starting in v5.12, matches any character
that is B<not> a newline. It is a short-hand for writing C<[^\n]>, and is
identical to the C<.> metasymbol, except under the C</s> flag, which changes
the meaning of C<.>, but not C<\N>.
inside a bracketed character class; C</[\R]/> is an error; use C<\v>
instead. C<\R> was introduced in perl 5.10.0.
+Note that this does not respect any locale that might be in effect; it
+matches according to the platform's native character set.
+
Mnemonic: none really. C<\R> was picked because PCRE already uses C<\R>,
and more importantly because Unicode recommends such a regular expression
metacharacter, and suggests C<\R> as its notation.
UPWARDS ARROW BELOW", and would be displayed by Unicode-aware software as if it
were a single character.
+The match is greedy and non-backtracking, so that the cluster is never
+broken up into smaller components.
+
+See also L<C<\b{gcb}>|/\b{}, \b, \B{}, \B>.
+
Mnemonic: eI<X>tended Unicode character.
=back
=head4 Examples
- "\x{256}" =~ /^\C\C$/; # Match as chr (0x256) takes 2 octets in UTF-8.
-
$str =~ s/foo\Kbar/baz/g; # Change any 'bar' following a 'foo' to 'baz'
$str =~ s/(.)\K\g1//g; # Delete duplicated characters.