Those not usable within a bracketed character class (like C<[\da-z]>) are marked
as C<Not in [].>
- \000 Octal escape sequence.
+ \000 Octal escape sequence. See also \o{}.
\1 Absolute backreference. Not in [].
\a Alarm or bell.
\A Beginning of string. Not in [].
\L Lowercase till \E. Not in [].
\n (Logical) newline character.
\N Any character but newline. Experimental. Not in [].
- \N{} Named or numbered (Unicode) character.
+ \N{} Named or numbered (Unicode) character or sequence.
+ \o{} Octal escape sequence.
\p{}, \pP Character with the given Unicode property.
\P{}, \PP Character without the given Unicode property.
\Q Quotemeta till \E. Not in [].
=item [1]
-C<\b> is only the backspace character inside a character class. Outside a
+C<\b> is the backspace character only inside a character class. Outside a
character class, C<\b> is a word/non-word boundary.
=item [2]
C<\n> matches a logical newline. Perl will convert between C<\n> and your
-OSses native newline character when reading from or writing to text files.
+OS's native newline character when reading from or writing to text files.
=back
$str =~ /\cK/; # Matches if $str contains a vertical tab (control-K).
-=head3 Named or numbered characters
+=head3 Named or numbered characters and character sequences
-All Unicode characters have a Unicode name and numeric ordinal value. Use the
+Unicode characters have a Unicode name and numeric ordinal value. Use the
C<\N{}> construct to specify a character by either of these values.
+Certain sequences of characters also have names.
-To specify by name, the name of the character goes between the curly braces.
-In this case, you have to C<use charnames> to load the Unicode names of the
-characters, otherwise Perl will complain.
+To specify by name, the name of the character or character sequence goes
+between the curly braces. In this case, you have to C<use charnames> to
+load the Unicode names of the characters, otherwise Perl will complain.
-To specify by Unicode ordinal number, use the form
+To specify a character by Unicode code point, use the form
C<\N{U+I<wide hex character>}>, where I<wide hex character> is a number in
hexadecimal that gives the ordinal number that Unicode has assigned to the
desired character. It is customary (but not required) to use leading zeros to
pad the number to 4 digits. Thus C<\N{U+0041}> means
-C<Latin Capital Letter A>, and you will rarely see it written without the two
-leading zeros. C<\N{U+0041}> means C<A> even on EBCDIC machines (where the
-ordinal value of C<A> is not 0x41).
+C<LATIN CAPITAL LETTER A>, and you will rarely see it written without the two
+leading zeros. C<\N{U+0041}> means "A" even on EBCDIC machines (where the
+ordinal value of "A" is not 0x41).
-It is even possible to give your own names to characters, and even to short
-sequences of characters. For details, see L<charnames>.
+It is even possible to give your own names to characters and character
+sequences. For details, see L<charnames>.
(There is an expanded internal form that you may see in debug output:
C<\N{U+I<wide hex character>.I<wide hex character>...}>.
Mnemonic: I<N>amed character.
-Note that a character that is expressed as a named or numbered character is
-considered as a character without special meaning by the regex engine, and will
-match "as is".
+Note that a character or character sequence that is expressed as a named
+or numbered character is considered as a character without special
+meaning by the regex engine, and will match "as is".
=head4 Example
=head3 Octal escapes
-Octal escapes consist of a backslash followed by two or three octal digits
-matching the code point of the character you want to use. This allows for
-512 characters (C<\00> up to C<\777>) that can be expressed this way (but
-anything above C<\377> is deprecated).
-Enough in pre-Unicode days, but most Unicode characters cannot be escaped
-this way.
+There are two forms of octal escapes. Each is used to specify a character by
+its ordinal, specified in octal notation.
+
+One form, available starting in Perl 5.14 looks like C<\o{...}>, where the dots
+represent one or more octal digits. It can be used for any Unicode character.
+
+It was introduced to avoid the potential problems with the other form,
+available in all Perls. That form consists of a backslash followed by three
+octal digits. One problem with this form is that it can look exactly like an
+old-style backreference (see
+L</Disambiguation rules between old-style octal escapes and backreferences>
+below.) You can avoid this by making the first of the three digits always a
+zero, but that makes \077 the largest ordinal unambiguously specifiable by this
+form.
+
+In some contexts, a backslash followed by two or even one octal digits may be
+interpreted as an octal escape, sometimes with a warning, and because of some
+bugs, sometimes with surprising results. Also, if you are creating a regex
+out of smaller snippets concatentated together, and you use fewer than three
+digits, the beginning of one snippet may be interpreted as adding digits to the
+ending of the snippet before it. See L</Absolute referencing> for more
+discussion and examples of the snippet problem.
Note that a character that is expressed as an octal escape is considered
as a character without special meaning by the regex engine, and will match
"as is".
-=head4 Examples (assuming an ASCII platform)
+To summarize, the C<\o{}> form is always safe to use, and the other form is
+safe to use for ordinals up through \077 when you use exactly three digits to
+specify them.
- $str = "Perl";
- $str =~ /\120/; # Match, "\120" is "P".
- $str =~ /\120+/; # Match, "\120" is "P", it is repeated at least once
- $str =~ /P\053/; # No match, "\053" is "+" and taken literally.
+Mnemonic: I<0>ctal or I<o>ctal.
-=head4 Caveat
+=head4 Examples (assuming an ASCII platform)
-Octal escapes potentially clash with old-style backreferences (see L</Absolute
-referencing> below). They both consist of a backslash followed by numbers. So
-Perl has to use heuristics to determine whether it is a backreference or an
-octal escape. Perl uses the following rules:
+ $str = "Perl";
+ $str =~ /\o{120}/; # Match, "\120" is "P".
+ $str =~ /\120/; # Same.
+ $str =~ /\o{120}+/; # Match, "\120" is "P", it's repeated at least once
+ $str =~ /\120+/; # Same.
+ $str =~ /P\053/; # No match, "\053" is "+" and taken literally.
+ /\o{23073}/ # Black foreground, white background smiling face.
+ /\o{4801234567}/ # Raises a warning, and yields chr(4)
+
+=head4 Disambiguation rules between old-style octal escapes and backreferences
+
+Octal escapes of the C<\000> form outside of bracketed character classes
+potentially clash with old-style backreferences. (see L</Absolute referencing>
+below). They both consist of a backslash followed by numbers. So Perl has to
+use heuristics to determine whether it is a backreference or an octal escape.
+Perl uses the following rules to disambiguate:
=over 4
=item 3
-If the number following the backslash is N (decimal), and Perl already has
+If the number following the backslash is N (in decimal), and Perl already has
seen N capture groups, Perl will consider this to be a backreference.
-Otherwise, it will consider it to be an octal escape. Note that if N > 999,
-Perl only takes the first three digits for the octal escape; the rest is
-matched as is.
+Otherwise, it will consider it to be an octal escape. Note that if N has more
+than three digits, Perl only takes the first three for the octal escape;
+the rest are matched as is.
my $pat = "(" x 999;
$pat .= "a";
$pat .= ")" x 999;
/^($pat)\1000$/; # Matches 'aa'; there are 1000 capture groups.
/^$pat\1000$/; # Matches 'a@0'; there are 999 capture groups
- # and \1000 is seen as \100 (a '@') and a '0'.
+ # and \1000 is seen as \100 (a '@') and a '0'
=back
+You can the force a backreference interpretation always by using the C<\g{...}>
+form. You can the force an octal interpretation always by using the C<\o{...}>
+form, or for numbers up through \077 (= 63 decimal), by using three digits,
+beginning with a "0".
+
=head3 Hexadecimal escapes
-Hexadecimal escapes start with C<\x> and are then either followed by a
-two digit hexadecimal number, or a hexadecimal number of arbitrary length
-surrounded by curly braces. The hexadecimal number is the code point of
-the character you want to express.
+Like octal escapes, there are two forms of hexadecimal escapes, but both start
+with the same thing, C<\x>. This is followed by either exactly two hexadecimal
+digits forming a number, or a hexadecimal number of arbitrary length surrounded
+by curly braces. The hexadecimal number is the code point of the character you
+want to express.
-Note that a character that is expressed as a hexadecimal escape is considered
+Note that a character that is expressed as one of these escapes is considered
as a character without special meaning by the regex engine, and will match
"as is".
discuss those here; full details of character classes can be found in
L<perlrecharclass>.
-C<\w> is a character class that matches any single I<word> character (letters,
-digits, underscore). C<\d> is a character class that matches any decimal digit,
-while the character class C<\s> matches any whitespace character.
+C<\w> is a character class that matches any single I<word> character
+(letters, digits, Unicode marks, and connector punctuation (like the
+underscore)). C<\d> is a character class that matches any decimal
+digit, while the character class C<\s> matches any whitespace character.
New in perl 5.10.0 are the classes C<\h> and C<\v> which match horizontal
and vertical whitespace characters.
The uppercase variants (C<\W>, C<\D>, C<\S>, C<\H>, and C<\V>) are
-character classes that match any character that isn't a word character,
-digit, whitespace, horizontal whitespace nor vertical whitespace.
+character classes that match, respectively, any character that isn't a
+word character, digit, whitespace, horizontal whitespace, or vertical
+whitespace.
Mnemonics: I<w>ord, I<d>igit, I<s>pace, I<h>orizontal, I<v>ertical.
is a positive (unsigned) decimal number of any length is an absolute reference
to a capturing group.
-I<N> refers to the Nth set of parentheses - so C<\gI<N>> refers to whatever has
-been matched by that set of parenthesis. Thus C<\g1> refers to the first
+I<N> refers to the Nth set of parentheses, so C<\gI<N>> refers to whatever has
+been matched by that set of parentheses. Thus C<\g1> refers to the first
capture group in the regex.
The C<\gI<N>> form can be equivalently written as C<\g{I<N>}>
=item \A
C<\A> only matches at the beginning of the string. If the C</m> modifier
-isn't used, then C</\A/> is equivalent with C</^/>. However, if the C</m>
+isn't used, then C</\A/> is equivalent to C</^/>. However, if the C</m>
modifier is used, then C</^/> matches internal newlines, but the meaning
of C</\A/> isn't changed by the C</m> modifier. C<\A> matches at the beginning
of the string regardless whether the C</m> modifier is used.
=item \z, \Z
C<\z> and C<\Z> match at the end of the string. If the C</m> modifier isn't
-used, then C</\Z/> is equivalent with C</$/>, that is, it matches at the
+used, then C</\Z/> is equivalent to C</$/>, that is, it matches at the
end of the string, or before the newline at the end of the string. If the
C</m> modifier is used, then C</$/> matches at internal newlines, but the
meaning of C</\Z/> isn't changed by the C</m> modifier. C<\Z> matches at
the meaning of C<.>, but not C<\N>.
Note that C<\N{...}> can mean a
-L<named or numbered character|/Named or numbered characters>.
+L<named or numbered character
+|/Named or numbered characters and character sequences>.
Mnemonic: Complement of I<\n>.