X-Git-Url: https://perl5.git.perl.org/perl5.git/blobdiff_plain/39b6ec1a4f067ccc33bb32fe5d2157b96637d2c5..4ee2b8db537d28b77d127a86307e426289e5c8b5:/pod/perlretut.pod diff --git a/pod/perlretut.pod b/pod/perlretut.pod index 3af0d3a..734ca5c 100644 --- a/pod/perlretut.pod +++ b/pod/perlretut.pod @@ -49,6 +49,10 @@ is harder to pronounce. The Perl pod documentation is evenly split on regexp vs regex; in Perl, there is more than one way to abbreviate it. We'll use regexp in this tutorial. +New in v5.22, L|re/'strict' mode> applies stricter +rules than otherwise when compiling regular expression patterns. It can +find things that, while legal, may not be what you intended. + =head1 Part 1: The basics =head2 Simple word matching @@ -176,7 +180,7 @@ In addition to the metacharacters, there are some ASCII characters which don't have printable character equivalents and are instead represented by I. Common examples are C<\t> for a tab, C<\n> for a newline, C<\r> for a carriage return and C<\a> for a -bell. If your string is better thought of as a sequence of arbitrary +bell (or alert). If your string is better thought of as a sequence of arbitrary bytes, the octal escape sequence, e.g., C<\033>, or hexadecimal escape sequence, e.g., C<\x1B> may be a more natural representation for your bytes. Here are some examples of escapes: @@ -292,8 +296,9 @@ class> of them. One such concept is that of a I. A character class allows a set of possible characters, rather than just a single -character, to match at a particular point in a regexp. Character -classes are denoted by brackets C<[...]>, with the set of characters +character, to match at a particular point in a regexp. You can define +your own custom character classes. These +are denoted by brackets C<[...]>, with the set of characters to be possibly matched inside. Here are some examples: /cat/; # matches 'cat' @@ -367,8 +372,9 @@ character, or the match fails. Then Now, even C<[0-9]> can be a bother to write multiple times, so in the interest of saving keystrokes and making regexps more readable, Perl has several abbreviations for common character classes, as shown below. -Since the introduction of Unicode, these character classes match more -than just a few characters in the ISO 8859-1 range. +Since the introduction of Unicode, unless the C modifier is in +effect, these character classes match more than just a few characters in +the ASCII range. =over 4 @@ -402,10 +408,24 @@ but also digits and characters from non-roman scripts The period '.' matches any character but "\n" (unless the modifier C is in effect, as explained below). +=item * + +\N, like the period, matches any character but "\n", but it does so +regardless of whether the modifier C is in effect. + =back +The C modifier, available starting in Perl 5.14, is used to +restrict the matches of \d, \s, and \w to just those in the ASCII range. +It is useful to keep your program from being needlessly exposed to full +Unicode (and its accompanying security considerations) when all you want +is to process English-like text. (The "a" may be doubled, C, to +provide even more restrictions, preventing case-insensitive matching of +ASCII with non-ASCII characters; otherwise a Unicode "Kelvin Sign" +would caselessly match a "k" or "K".) + The C<\d\s\w\D\S\W> abbreviations can be used both inside and outside -of character classes. Here are some in use: +of bracketed character classes. Here are some in use: /\d\d:\d\d:\d\d/; # matches a hh:mm:ss time format /[\d\s]/; # matches any digit or whitespace character @@ -421,6 +441,11 @@ of characters, it is incorrect to think of C<[^\d\w]> as C<[\D\W]>; in fact C<[^\d\w]> is the same as C<[^\w]>, which is the same as C<[\W]>. Think DeMorgan's laws. +In actuality, the period and C<\d\s\w\D\S\W> abbreviations are +themselves types of character classes, so the ones surrounded by +brackets are just one type of character class. When we need to make a +distinction, we refer to them as "bracketed character classes." + An anchor useful in basic regexps is the I C<\b>. This matches a boundary between a word character and a non-word character C<\w\W> or C<\W\w>: @@ -434,6 +459,11 @@ character C<\w\W> or C<\W\w>: Note in the last example, the end of the string is considered a word boundary. +For natural language processing (so that, for example, apostrophes are +included in words), use instead C<\b{wb}> + + "don't" =~ / .+? \b{wb} /x; # matches the whole string + You might wonder why C<'.'> matches everything but C<"\n"> - why not every character? The reason is that often one is matching against lines and would like to ignore the newline characters. For instance, @@ -621,50 +651,50 @@ of what Perl does when it tries to match the regexp =over 4 -=item 0 +=item Z<>0 Start with the first letter in the string 'a'. -=item 1 +=item Z<>1 Try the first alternative in the first group 'abd'. -=item 2 +=item Z<>2 Match 'a' followed by 'b'. So far so good. -=item 3 +=item Z<>3 'd' in the regexp doesn't match 'c' in the string - a dead end. So backtrack two characters and pick the second alternative in the first group 'abc'. -=item 4 +=item Z<>4 Match 'a' followed by 'b' followed by 'c'. We are on a roll and have satisfied the first group. Set $1 to 'abc'. -=item 5 +=item Z<>5 Move on to the second group and pick the first alternative 'df'. -=item 6 +=item Z<>6 Match the 'd'. -=item 7 +=item Z<>7 'f' in the regexp doesn't match 'e' in the string, so a dead end. Backtrack one character and pick the second alternative in the second group 'd'. -=item 8 +=item Z<>8 'd' matches. The second grouping is satisfied, so set $2 to 'd'. -=item 9 +=item Z<>9 We are at the end of the regexp, so we are done! We have matched 'abcd' out of the string "abcde". @@ -766,7 +796,7 @@ so may lead to surprising and unsatisfactory results. =head2 Relative backreferences Counting the opening parentheses to get the correct number for a -backreference is errorprone as soon as there is more than one +backreference is error-prone as soon as there is more than one capturing group. A more convenient technique became available with Perl 5.10: relative backreferences. To refer to the immediately preceding capture group one now may write C<\g{-1}>, the next but @@ -844,9 +874,9 @@ well, and this is exactly what the parenthesized construct C<(?|...)>, set around an alternative achieves. Here is an extended version of the previous pattern: - if ( $time =~ /(?|(\d\d|\d):(\d\d)|(\d\d)(\d\d))\s+([A-Z][A-Z][A-Z])/ ){ - print "hour=$1 minute=$2 zone=$3\n"; - } + if($time =~ /(?|(\d\d|\d):(\d\d)|(\d\d)(\d\d))\s+([A-Z][A-Z][A-Z])/){ + print "hour=$1 minute=$2 zone=$3\n"; + } Within the alternative numbering group, group numbers start at the same position for each alternative. After the group, numbering continues @@ -854,7 +884,7 @@ with one higher than the maximum reached across all the alternatives. =head2 Position information -In addition to what was matched, Perl (since 5.6.0) also provides the +In addition to what was matched, Perl also provides the positions of what was matched as contents of the C<@-> and C<@+> arrays. C<$-[0]> is the position of the start of the entire match and C<$+[0]> is the position of the end. Similarly, C<$-[n]> is the @@ -864,8 +894,8 @@ this code $x = "Mmm...donut, thought Homer"; $x =~ /^(Mmm|Yech)\.\.\.(donut|peas)/; # matches - foreach $expr (1..$#-) { - print "Match $expr: '${$expr}' at position ($-[$expr],$+[$expr])\n"; + foreach $exp (1..$#-) { + print "Match $exp: '${$exp}' at position ($-[$exp],$+[$exp])\n"; } prints @@ -885,7 +915,10 @@ of the string after the match. An example: In the second match, C<$`> equals C<''> because the regexp matched at the first character position in the string and stopped; it never saw the -second 'the'. It is important to note that using C<$`> and C<$'> +second 'the'. + +If your code is to run on Perl versions earlier than +5.20, it is worthwhile to note that using C<$`> and C<$'> slows down regexp matching quite a bit, while C<$&> slows it down to a lesser extent, because if they are used in one regexp in a program, they are generated for I regexps in the program. So if raw @@ -898,8 +931,11 @@ C<@+> instead: $' is the same as substr( $x, $+[0] ) As of Perl 5.10, the C<${^PREMATCH}>, C<${^MATCH}> and C<${^POSTMATCH}> -variables may be used. These are only set if the C

modifier is present. -Consequently they do not penalize the rest of the program. +variables may be used. These are only set if the C

modifier is +present. Consequently they do not penalize the rest of the program. In +Perl 5.20, C<${^PREMATCH}>, C<${^MATCH}> and C<${^POSTMATCH}> are available +whether the C

has been used or not (the modifier is ignored), and +C<$`>, C<$'> and C<$&> do not cause any speed difference. =head2 Non-capturing groupings @@ -931,6 +967,12 @@ required for some reason: @num = split /(a|b)+/, $x; # @num = ('12','a','34','a','5') @num = split /(?:a|b)+/, $x; # @num = ('12','34','5') +In Perl 5.22 and later, all groups within a regexp can be set to +non-capturing by using the new C flag: + + "hello" =~ /(hi|hello)/n; # $1 is not set! + +See L for more information. =head2 Matching repetitions @@ -984,10 +1026,10 @@ Here are some examples: /y(es)?/i; # matches 'y', 'Y', or a case-insensitive 'yes' $year =~ /^\d{2,4}$/; # make sure year is at least 2 but not more # than 4 digits - $year =~ /^\d{4}$|^\d{2}$/; # better match; throw out 3-digit dates - $year =~ /^\d{2}(\d{2})?$/; # same thing written differently. However, - # this captures the last two digits in $1 - # and the other does not. + $year =~ /^\d{4}$|^\d{2}$/; # better match; throw out 3-digit dates + $year =~ /^\d{2}(\d{2})?$/; # same thing written differently. + # However, this captures the last two + # digits in $1 and the other does not. % simple_grep '^(\w+)\g1$' /usr/dict/words # isn't this easier? beriberi @@ -1228,35 +1270,35 @@ backtracking. Here is a step-by-step analysis of the example =over 4 -=item 0 +=item Z<>0 Start with the first letter in the string 't'. -=item 1 +=item Z<>1 The first quantifier '.*' starts out by matching the whole string 'the cat in the hat'. -=item 2 +=item Z<>2 'a' in the regexp element 'at' doesn't match the end of the string. Backtrack one character. -=item 3 +=item Z<>3 'a' in the regexp element 'at' still doesn't match the last letter of the string 't', so backtrack one more character. -=item 4 +=item Z<>4 Now we can match the 'a' and the 't'. -=item 5 +=item Z<>5 Move on to the third element '.*'. Since we are at the end of the string and '.*' can match 0 times, assign it the empty string. -=item 6 +=item Z<>6 We are done! @@ -1501,31 +1543,6 @@ single line C, multi-line C, case-insensitive C and extended C modifiers. There are a few more things you might want to know about matching operators. -=head3 Optimizing pattern evaluation - -We pointed out earlier that variables in regexps are substituted -before the regexp is evaluated: - - $pattern = 'Seuss'; - while (<>) { - print if /$pattern/; - } - -This will print any lines containing the word C. It is not as -efficient as it could be, however, because Perl has to re-evaluate -(or compile) C<$pattern> each time through the loop. If C<$pattern> won't be -changing over the lifetime of the script, we can add the C -modifier, which directs Perl to only perform variable substitutions -once: - - #!/usr/bin/perl - # Improved simple_grep - $regexp = shift; - while (<>) { - print if /$regexp/o; # a good deal faster - } - - =head3 Prohibiting substitution If you change C<$pattern> after the first substitution happens, Perl @@ -1547,7 +1564,8 @@ the regexp in the I is used instead. So we have =head3 Global matching -The final two modifiers C and C concern multiple matches. +The final two modifiers we will discuss here, +C and C, concern multiple matches. The modifier C stands for global matching and allows the matching operator to match within a string as many times as possible. In scalar context, successive invocations against a string will have @@ -1592,9 +1610,9 @@ there are no groupings, a list of matches to the whole regexp. So if we wanted just the words, we could use @words = ($x =~ /(\w+)/g); # matches, - # $word[0] = 'cat' - # $word[1] = 'dog' - # $word[2] = 'house' + # $words[0] = 'cat' + # $words[1] = 'dog' + # $words[2] = 'house' Closely associated with the C modifier is the C<\G> anchor. The C<\G> anchor matches at the point where the previous C match left @@ -1662,6 +1680,10 @@ which is the correct answer. This example illustrates that it is important not only to match what is desired, but to reject what is not desired. +(There are other regexp modifiers that are available, such as +C, but their specialized uses are beyond the +scope of this introduction. ) + =head3 Search and replace Regular expressions also play a big role in I @@ -1707,7 +1729,7 @@ the following program to replace it: $regexp = shift; $replacement = shift; while (<>) { - s/$regexp/$replacement/go; + s/$regexp/$replacement/g; print; } ^D @@ -1715,13 +1737,15 @@ the following program to replace it: % simple_replace regexp regex perlretut.pod In C we used the C modifier to replace all -occurrences of the regexp on each line and the C modifier to -compile the regexp only once. As with C, both the -C and the C use C<$_> implicitly. +occurrences of the regexp on each line. (Even though the regular +expression appears in a loop, Perl is smart enough to compile it +only once.) As with C, both the +C and the C use C<$_> implicitly. If you don't want C to change your original variable you can use the non-destructive substitute modifier, C. This changes the -behavior so that C returns the final substituted string: +behavior so that C returns the final substituted string +(instead of the number of substitutions): $x = "I like dogs."; $y = $x =~ s/dogs/cats/r; @@ -1741,7 +1765,8 @@ One other interesting thing that the C flag allows is chaining substitutions: $x = "Cats are great."; - print $x =~ s/Cats/Dogs/r =~ s/Dogs/Frogs/r =~ s/Frogs/Hedgehogs/r, "\n"; + print $x =~ s/Cats/Dogs/r =~ s/Dogs/Frogs/r =~ + s/Frogs/Hedgehogs/r, "\n"; # prints "Hedgehogs are great." A modifier available specifically to search and replace is the @@ -1753,7 +1778,7 @@ computation in the process of replacing text. This example counts character frequencies in a line: $x = "Bill the cat"; - $x =~ s/(.)/$chars{$1}++;$1/eg; # final $1 replaces char with itself + $x =~ s/(.)/$chars{$1}++;$1/eg; # final $1 replaces char with itself print "frequency of '$_' is $chars{$_}\n" foreach (sort {$chars{$b} <=> $chars{$a}} keys %chars); @@ -1871,12 +1896,14 @@ instance, It does not protect C<$> or C<@>, so that variables can still be substituted. -C<\Q>, C<\L>, C<\U> and C<\E> are actually part of the syntax of regular -expression I, and are not part of regexp syntax proper. So they -do not work in interpolated patterns. +C<\Q>, C<\L>, C<\l>, C<\U>, C<\u> and C<\E> are actually part of +double-quotish syntax, and not part of regexp syntax proper. They will +work if they appear in a regular expression embedded directly in a +program, but not when contained in a string that is interpolated in a +pattern. -With the advent of 5.6.0, Perl regexps can handle more than just the -standard ASCII character set. Perl now supports I, a standard +Perl regexps can handle more than just the +standard ASCII character set. Perl supports I, a standard for representing the alphabets from virtually all of the world's written languages, and a host of symbols. Perl's text strings are Unicode strings, so they can contain characters with a value (codepoint or character number) higher @@ -1888,8 +1915,8 @@ to know 1) how to represent Unicode characters in a regexp and 2) that a matching operation will treat the string to be searched as a sequence of characters, not bytes. The answer to 1) is that Unicode characters greater than C are represented using the C<\x{hex}> notation, because -\x hex (without curly braces) doesn't go further than 255. Starting in Perl -5.14, if you're an octal fan, you can also use C<\o{oct}>. +\x hex (without curly braces) doesn't go further than 255. (Starting in Perl +5.14, if you're an octal fan, you can also use C<\o{oct}>.) /\x{263a}/; # match a Unicode smiley face :) @@ -1908,80 +1935,69 @@ specified in the Unicode standard. For instance, if we wanted to represent or match the astrological sign for the planet Mercury, we could use - use charnames ":full"; # use named chars with Unicode full names $x = "abc\N{MERCURY}def"; $x =~ /\N{MERCURY}/; # matches -One can also use short names or restrict names to a certain alphabet: +One can also use "short" names: - use charnames ':full'; print "\N{GREEK SMALL LETTER SIGMA} is called sigma.\n"; - - use charnames ":short"; print "\N{greek:Sigma} is an upper-case sigma.\n"; +You can also restrict names to a certain alphabet by specifying the +L pragma: + use charnames qw(greek); print "\N{sigma} is Greek sigma\n"; -A list of full names is found in the file NamesList.txt in the -lib/perl5/X.X.X/unicore directory (where X.X.X is the perl -version number as it is installed on your system). - -The answer to requirement 2), as of 5.6.0, is that a regexp uses Unicode -characters. Internally, this is encoded to bytes using either UTF-8 or a -native 8 bit encoding, depending on the history of the string, but -conceptually it is a sequence of characters, not bytes. See -L for a tutorial about that. - -Let us now discuss Unicode character classes. Just as with Unicode -characters, there are named Unicode character classes represented by the +An index of character names is available on-line from the Unicode +Consortium, L; explanatory +material with links to other resources at +L. + +The answer to requirement 2) is that a regexp (mostly) +uses Unicode characters. The "mostly" is for messy backward +compatibility reasons, but starting in Perl 5.14, any regex compiled in +the scope of a C (which is automatically +turned on within the scope of a C or higher) will turn that +"mostly" into "always". If you want to handle Unicode properly, you +should ensure that C<'unicode_strings'> is turned on. +Internally, this is encoded to bytes using either UTF-8 or a native 8 +bit encoding, depending on the history of the string, but conceptually +it is a sequence of characters, not bytes. See L for a +tutorial about that. + +Let us now discuss Unicode character classes, most usually called +"character properties". These are represented by the C<\p{name}> escape sequence. Closely associated is the C<\P{name}> -character class, which is the negation of the C<\p{name}> class. For +property, which is the negation of the C<\p{name}> one. For example, to match lower and uppercase characters, - use charnames ":full"; # use named chars with Unicode full names $x = "BOB"; $x =~ /^\p{IsUpper}/; # matches, uppercase char class $x =~ /^\P{IsUpper}/; # doesn't match, char class sans uppercase $x =~ /^\p{IsLower}/; # doesn't match, lowercase char class $x =~ /^\P{IsLower}/; # matches, char class sans lowercase -Here is the association between some Perl named classes and the -traditional Unicode classes: - - Perl class name Unicode class name or regular expression - - IsAlpha /^[LM]/ - IsAlnum /^[LMN]/ - IsASCII $code <= 127 - IsCntrl /^C/ - IsBlank $code =~ /^(0020|0009)$/ || /^Z[^lp]/ - IsDigit Nd - IsGraph /^([LMNPS]|Co)/ - IsLower Ll - IsPrint /^([LMNPS]|Co|Zs)/ - IsPunct /^P/ - IsSpace /^Z/ || ($code =~ /^(0009|000A|000B|000C|000D)$/ - IsSpacePerl /^Z/ || ($code =~ /^(0009|000A|000C|000D|0085|2028|2029)$/ - IsUpper /^L[ut]/ - IsWord /^[LMN]/ || $code eq "005F" - IsXDigit $code =~ /^00(3[0-9]|[46][1-6])$/ - -You can also use the official Unicode class names with C<\p> and -C<\P>, like C<\p{L}> for Unicode 'letters', C<\p{Lu}> for uppercase -letters, or C<\P{Nd}> for non-digits. If a C is just one -letter, the braces can be dropped. For instance, C<\pM> is the -character class of Unicode 'marks', for example accent marks. -For the full list see L. - -Unicode has also been separated into various sets of characters -which you can test with C<\p{...}> (in) and C<\P{...}> (not in). -To test whether a character is (or is not) an element of a script -you would use the script name, for example C<\p{Latin}>, C<\p{Greek}>, -or C<\P{Katakana}>. Other sets are the Unicode blocks, the names -of which begin with "In". One such block is dedicated to mathematical -operators, and its pattern formula is }>. -For the full list see L. +(The "Is" is optional.) + +There are many, many Unicode character properties. For the full list +see L. Most of them have synonyms with shorter names, +also listed there. Some synonyms are a single character. For these, +you can drop the braces. For instance, C<\pM> is the same thing as +C<\p{Mark}>, meaning things like accent marks. + +The Unicode C<\p{Script}> and C<\p{Script_Extensions}> properties are +used to categorize every Unicode character into the language script it +is written in. (C is an improved version of +C