X-Git-Url: https://perl5.git.perl.org/perl5.git/blobdiff_plain/5f2b17ca5c1552a58b181b8cb3d24e5b2d2c74de..fa861958788779f82cb4c14d67b583bc18a75ef9:/pod/perlrebackslash.pod diff --git a/pod/perlrebackslash.pod b/pod/perlrebackslash.pod index 71e7c06..461ebd9 100644 --- a/pod/perlrebackslash.pod +++ b/pod/perlrebackslash.pod @@ -16,7 +16,6 @@ Most sequences are described in detail in different documents; the primary purpose of this document is to have a quick reference guide describing all backslash and escape sequences. - =head2 The backslash In a regular expression, the backslash can perform one of two tasks: @@ -25,19 +24,18 @@ it either takes away the special meaning of the character following it or it is the start of a backslash or escape sequence. The rules determining what it is are quite simple: if the character -following the backslash is a punctuation (non-word) character (that is, -anything that is not a letter, digit or underscore), then the backslash -just takes away the special meaning (if any) of the character following -it. - -If the character following the backslash is a letter or a digit, then the -sequence may be special; if so, it's listed below. A few letters have not -been used yet, and escaping them with a backslash is safe for now, but a -future version of Perl may assign a special meaning to it. However, if you -have warnings turned on, Perl will issue a warning if you use such a sequence. -[1]. - -It is however garanteed that backslash or escape sequences never have a +following the backslash is an ASCII punctuation (non-word) character (that is, +anything that is not a letter, digit or underscore), then the backslash just +takes away the special meaning (if any) of the character following it. + +If the character following the backslash is an ASCII letter or an ASCII digit, +then the sequence may be special; if so, it's listed below. A few letters have +not been used yet, so escaping them with a backslash doesn't change them to be +special. A future version of Perl may assign a special meaning to them, so if +you have warnings turned on, Perl will issue a warning if you use such a +sequence. [1]. + +It is however guaranteed that backslash or escape sequences never have a punctuation character following the backslash, not now, and not in a future version of Perl 5. So it is safe to put a backslash in front of a non-word character. @@ -61,57 +59,62 @@ quoted constructs>. =head2 All the sequences and escapes +Those not usable within a bracketed character class (like C<[\da-z]>) are marked +as C + \000 Octal escape sequence. - \1 Absolute backreference. + \1 Absolute backreference. Not in []. \a Alarm or bell. - \A Beginning of string. - \b Word/non-word boundary. (Backspace in a char class). - \B Not a word/non-word boundary. - \cX Control-X (X can be any ASCII character). - \C Single octet, even under UTF-8. + \A Beginning of string. Not in []. + \b Word/non-word boundary. (Backspace in []). + \B Not a word/non-word boundary. Not in []. + \cX Control-X + \C Single octet, even under UTF-8. Not in []. \d Character class for digits. \D Character class for non-digits. \e Escape character. - \E Turn off \Q, \L and \U processing. + \E Turn off \Q, \L and \U processing. Not in []. \f Form feed. - \g{}, \g1 Named, absolute or relative backreference. - \G Pos assertion. - \h Character class for horizontal white space. - \H Character class for non horizontal white space. - \k{}, \k<>, \k'' Named backreference. - \K Keep the stuff left of \K. - \l Lowercase next character. - \L Lowercase till \E. + \g{}, \g1 Named, absolute or relative backreference. Not in []. + \G Pos assertion. Not in []. + \h Character class for horizontal whitespace. + \H Character class for non horizontal whitespace. + \k{}, \k<>, \k'' Named backreference. Not in []. + \K Keep the stuff left of \K. Not in []. + \l Lowercase next character. Not in []. + \L Lowercase till \E. Not in []. \n (Logical) newline character. - \N{} Named (Unicode) character. - \p{}, \pP Character with a Unicode property. - \P{}, \PP Character without a Unicode property. - \Q Quotemeta till \E. + \N Any character but newline. Experimental. Not in []. + \N{} Named or numbered (Unicode) character. + \p{}, \pP Character with the given Unicode property. + \P{}, \PP Character without the given Unicode property. + \Q Quotemeta till \E. Not in []. \r Return character. - \R Generic new line. - \s Character class for white space. - \S Character class for non white space. + \R Generic new line. Not in []. + \s Character class for whitespace. + \S Character class for non whitespace. \t Tab character. - \u Titlecase next character. - \U Uppercase till \E. - \v Character class for vertical white space. - \V Character class for non vertical white space. + \u Titlecase next character. Not in []. + \U Uppercase till \E. Not in []. + \v Character class for vertical whitespace. + \V Character class for non vertical whitespace. \w Character class for word characters. \W Character class for non-word characters. \x{}, \x00 Hexadecimal escape sequence. - \X Extended Unicode "combining character sequence". - \z End of string. - \Z End of string. + \X Unicode "extended grapheme cluster". Not in []. + \z End of string. Not in []. + \Z End of string. Not in []. =head2 Character Escapes =head3 Fixed characters -A handful of characters have a dedidated I. The following -table shows them, along with their code points (in decimal and hex), their -ASCII name, the control escape (see below) and a short description. +A handful of characters have a dedicated I. The following +table shows them, along with their ASCII code points (in decimal and hex), +their ASCII name, the control escape on ASCII platforms and a short +description. (For EBCDIC platforms, see L.) - Seq. Code Point ASCII Cntr Description. + Seq. Code Point ASCII Cntrl Description. Dec Hex \a 7 07 BEL \cG alarm or bell \b 8 08 BS \cH backspace [1] @@ -142,10 +145,18 @@ OSses native newline character when reading from or writing to text files. =head3 Control characters C<\c> is used to denote a control character; the character following C<\c> -is the name of the control character. For instance, C matches the -character I (a carriage return, code point 13). The case of the -character following C<\c> doesn't matter: C<\cM> and C<\cm> match the same -character. +determines the value of the construct. For example the value of C<\cA> is +C, and the value of C<\cb> is C, etc. +The gory details are in L. A complete +list of what C, etc. means for ASCII and EBCDIC platforms is in +L. + +Note that C<\c\> alone at the end of a regular expression (or doubled-quoted +string) is not valid. The backslash must be followed by another character. +That is, C<\c\I> means C'> for all characters I. + +To write platform-independent code, you must use C<\N{I}> instead, like +C<\N{ESCAPE}> or C<\N{U+001B}>, see L. Mnemonic: Iontrol character. @@ -153,17 +164,39 @@ Mnemonic: Iontrol character. $str =~ /\cK/; # Matches if $str contains a vertical tab (control-K). -=head3 Named characters +=head3 Named or numbered characters -All Unicode characters have a Unicode name, and characters in various scripts -have names as well. It is even possible to give your own names to characters. -You can use a character by name by using the C<\N{}> construct; the name of -the character goes between the curly braces. You do have to C -to load the names of the characters, otherwise Perl will complain you use -a name it doesn't know about. For more details, see L. +All Unicode characters have a Unicode name and numeric ordinal value. Use the +C<\N{}> construct to specify a character by either of these values. + +To specify by name, the name of the character goes between the curly braces. +In this case, you have to C to load the Unicode names of the +characters, otherwise Perl will complain. + +To specify by Unicode ordinal number, use the form +C<\N{U+I}>, where I is a number in +hexadecimal that gives the ordinal number that Unicode has assigned to the +desired character. It is customary (but not required) to use leading zeros to +pad the number to 4 digits. Thus C<\N{U+0041}> means +C, and you will rarely see it written without the two +leading zeros. C<\N{U+0041}> means C even on EBCDIC machines (where the +ordinal value of C is not 0x41). + +It is even possible to give your own names to characters, and even to short +sequences of characters. For details, see L. + +(There is an expanded internal form that you may see in debug output: +C<\N{U+I.I...}>. +The C<...> means any number of these Is separated by dots. +This represents the sequence formed by the characters. This is an internal +form only, subject to change, and you should not try to use it yourself.) Mnemonic: Iamed character. +Note that a character that is expressed as a named or numbered character is +considered as a character without special meaning by the regex engine, and will +match "as is". + =head4 Example use charnames ':full'; # Loads the Unicode names. @@ -176,7 +209,8 @@ Mnemonic: Iamed character. Octal escapes consist of a backslash followed by two or three octal digits matching the code point of the character you want to use. This allows for -522 characters (C<\00> up to C<\777>) that can be expressed this way. +512 characters (C<\00> up to C<\777>) that can be expressed this way (but +anything above C<\377> is deprecated). Enough in pre-Unicode days, but most Unicode characters cannot be escaped this way. @@ -184,7 +218,7 @@ Note that a character that is expressed as an octal escape is considered as a character without special meaning by the regex engine, and will match "as is". -=head4 Examples +=head4 Examples (assuming an ASCII platform) $str = "Perl"; $str =~ /\120/; # Match, "\120" is "P". @@ -202,7 +236,7 @@ the following rules: =item 1 -If the backslash is followed by a single digit, it's a backrefence. +If the backslash is followed by a single digit, it's a backreference. =item 2 @@ -227,7 +261,7 @@ matched as is. =head3 Hexadecimal escapes -Hexadecimal escapes start with C<\x> and are then either followed by +Hexadecimal escapes start with C<\x> and are then either followed by a two digit hexadecimal number, or a hexadecimal number of arbitrary length surrounded by curly braces. The hexadecimal number is the code point of the character you want to express. @@ -238,7 +272,7 @@ as a character without special meaning by the regex engine, and will match Mnemonic: heIadecimal. -=head4 Examples +=head4 Examples (assuming an ASCII platform) $str = "Perl"; $str =~ /\x50/; # Match, "\x50" is "P". @@ -261,7 +295,7 @@ functions C and C). To uppercase or lowercase several characters, one might want to use C<\L> or C<\U>, which will lowercase/uppercase all characters following -them, until either the end of the pattern, or the next occurance of +them, until either the end of the pattern, or the next occurrence of C<\E>, whatever comes first. They perform similar functionality as the functions C and C do. @@ -290,15 +324,15 @@ the character classes are written as a backslash sequence. We will briefly discuss those here; full details of character classes can be found in L. -C<\w> is a character class that matches any I character (letters, -digits, underscore). C<\d> is a character class that matches any digit, -while the character class C<\s> matches any white space character. -New in perl 5.10 are the classes C<\h> and C<\v> which match horizontal -and vertical white space characters. +C<\w> is a character class that matches any single I character (letters, +digits, underscore). C<\d> is a character class that matches any decimal digit, +while the character class C<\s> matches any whitespace character. +New in perl 5.10.0 are the classes C<\h> and C<\v> which match horizontal +and vertical whitespace characters. The uppercase variants (C<\W>, C<\D>, C<\S>, C<\H>, and C<\V>) are character classes that match any character that isn't a word character, -digit, white space, horizontal white space or vertical white space. +digit, whitespace, horizontal whitespace nor vertical whitespace. Mnemonics: Iord, Iigit, Ipace, Iorizontal, Iertical. @@ -309,7 +343,7 @@ match a character that matches the given Unicode property; properties include things like "letter", or "thai character". Capitalizing the sequence to C<\PP> and C<\P{Property}> make the sequence match a character that doesn't match the given Unicode property. For more details, see -L and +L and L. Mnemonic: I

roperty. @@ -319,15 +353,16 @@ Mnemonic: I

roperty. If capturing parenthesis are used in a regular expression, we can refer to the part of the source string that was matched, and match exactly the -same thing. (Full details are discussed in L). There are -three ways of refering to such I: absolutely, relatively, -and by name. +same thing. There are three ways of referring to such I: +absolutely, relatively, and by name. + +=for later add link to perlrecapture =head3 Absolute referencing A backslash sequence that starts with a backslash and is followed by a number is an absolute reference (but be aware of the caveat mentioned above). -If the number is I, it refers to the Nth set of parenthesis - whatever +If the number is I, it refers to the Nth set of parentheses - whatever has been matched by that set of parenthesis has to be matched by the C<\N> as well. @@ -339,12 +374,12 @@ as well. =head3 Relative referencing -New in perl 5.10 is different way of refering to capture buffers: C<\g>. +New in perl 5.10.0 is a different way of referring to capture buffers: C<\g>. C<\g> takes a number as argument, with the number in curly braces (the braces are optional). If the number (N) does not have a sign, it's a reference to the Nth capture group (so C<\g{2}> is equivalent to C<\2> - except that C<\g> always refers to a capture group and will never be seen as an octal -escape). If the number is negative, the reference is relative, refering to +escape). If the number is negative, the reference is relative, referring to the Nth group before the C<\g{-N}>. The big advantage of C<\g{-N}> is that it makes it much easier to write @@ -368,7 +403,7 @@ Mnemonic: Iroup. =head3 Named referencing -Also new in perl 5.10 is the use of named capture buffers, which can be +Also new in perl 5.10.0 is the use of named capture buffers, which can be referred to by name. This is done with C<\g{name}>, which is a backreference to the capture buffer with the name I. @@ -377,7 +412,7 @@ written as C<\k{name}>, C<< \k >> or C<\k'name'>. Note that C<\g{}> has the potential to be ambiguous, as it could be a named reference, or an absolute or relative reference (if its argument is numeric). -However, names are not allowed to start with digits, nor are allowed to +However, names are not allowed to start with digits, nor are they allowed to contain a hyphen, so there is no ambiguity. =head4 Examples @@ -390,7 +425,7 @@ contain a hyphen, so there is no ambiguity. =head2 Assertions -Assertions are conditions that have to be true -- they don't actually +Assertions are conditions that have to be true; they don't actually match parts of the substring. There are six assertions that are written as backslash sequences. @@ -426,7 +461,9 @@ remember where in the source string the last match ended, and the next time, it will start the match from where it ended the previous time. C<\G> matches the point where the previous match ended, or the beginning -of the string if there was no previous match. See also L. +of the string if there was no previous match. + +=for later add link to perlremodifiers Mnemonic: Ilobal. @@ -479,35 +516,52 @@ Mnemonic: oItet. =item \K -This is new in perl 5.10. Anything that is matched left of C<\K> is +This is new in perl 5.10.0. Anything that is matched left of C<\K> is not included in C<$&> - and will not be replaced if the pattern is used in a substitution. This will allow you to write C instead of C or C. Mnemonic: Ieep. +=item \N + +This is a new experimental feature in perl 5.12.0. It matches any character +that is not a newline. It is a short-hand for writing C<[^\n]>, and is +identical to the C<.> metasymbol, except under the C flag, which changes +the meaning of C<.>, but not C<\N>. + +Note that C<\N{...}> can mean a +L. + +Mnemonic: Complement of I<\n>. + =item \R +X<\R> C<\R> matches a I, that is, anything that is considered a newline by Unicode. This includes all characters matched by C<\v> -(vertical white space), and the multi character sequence C<"\x0D\x0A"> +(vertical whitespace), and the multi character sequence C<"\x0D\x0A"> (carriage return followed by a line feed, aka the network newline, or -the newline used in Windows text files). C<\R> is equivalent with -C<< (?>\x0D\x0A)|\v) >>. Since C<\R> can match a more than one character, -it cannot be put inside a bracketed character class; C is an error. -C<\R> is introduced in perl 5.10. +the newline used in Windows text files). C<\R> is equivalent to +C<< (?>\x0D\x0A)|\v) >>. Since C<\R> can match a sequence of more than one +character, it cannot be put inside a bracketed character class; C is an +error; use C<\v> instead. C<\R> was introduced in perl 5.10.0. -Mnemonic: none really. C<\R> was picked because PCRE already uses C<\R>. +Mnemonic: none really. C<\R> was picked because PCRE already uses C<\R>, +and more importantly because Unicode recommends such a regular expression +metacharacter, and suggests C<\R> as the notation. =item \X +X<\X> + +This matches a Unicode I. -This matches an extended Unicode I, and -is equivalent to C<< (?>\PM\pM*) >>. C<\PM> matches any character that is -not considered a Unicode mark character, while C<\pM> matches any character -that is considered a Unicode mark character; so C<\X> matches any non -mark character followed by zero or more mark characters. Mark characters -include (but are not restricted to) I and -I. +C<\X> matches quite well what normal (non-Unicode-programmer) usage +would consider a single character. As an example, consider a G with some sort +of diacritic mark, such as an arrow. There is no such single character in +Unicode, but one can be composed by using a G followed by a Unicode "COMBINING +UPWARDS ARROW BELOW", and would be displayed by Unicode-aware software as if it +were a single character. Mnemonic: eItended Unicode character.