X-Git-Url: https://perl5.git.perl.org/perl5.git/blobdiff_plain/0bd4278638caf33618355e433372cb94a2853626..b26bd9b0ed7c520cce2594b28de3ec9521a159ab:/pod/perlretut.pod
diff --git a/pod/perlretut.pod b/pod/perlretut.pod
index b2ba390..9a3c696 100644
--- a/pod/perlretut.pod
+++ b/pod/perlretut.pod
@@ -49,6 +49,10 @@ is harder to pronounce. The Perl pod documentation is evenly split on
regexp vs regex; in Perl, there is more than one way to abbreviate it.
We'll use regexp in this tutorial.
+New in v5.22, L|re/'strict' mode> applies stricter
+rules than otherwise when compiling regular expression patterns. It can
+find things that, while legal, may not be what you intended.
+
=head1 Part 1: The basics
=head2 Simple word matching
@@ -292,8 +296,9 @@ class> of them.
One such concept is that of a I. A character class
allows a set of possible characters, rather than just a single
-character, to match at a particular point in a regexp. Character
-classes are denoted by brackets C<[...]>, with the set of characters
+character, to match at a particular point in a regexp. You can define
+your own custom character classes. These
+are denoted by brackets C<[...]>, with the set of characters
to be possibly matched inside. Here are some examples:
/cat/; # matches 'cat'
@@ -420,7 +425,7 @@ ASCII with non-ASCII characters; otherwise a Unicode "Kelvin Sign"
would caselessly match a "k" or "K".)
The C<\d\s\w\D\S\W> abbreviations can be used both inside and outside
-of character classes. Here are some in use:
+of bracketed character classes. Here are some in use:
/\d\d:\d\d:\d\d/; # matches a hh:mm:ss time format
/[\d\s]/; # matches any digit or whitespace character
@@ -436,6 +441,11 @@ of characters, it is incorrect to think of C<[^\d\w]> as C<[\D\W]>; in
fact C<[^\d\w]> is the same as C<[^\w]>, which is the same as
C<[\W]>. Think DeMorgan's laws.
+In actuality, the period and C<\d\s\w\D\S\W> abbreviations are
+themselves types of character classes, so the ones surrounded by
+brackets are just one type of character class. When we need to make a
+distinction, we refer to them as "bracketed character classes."
+
An anchor useful in basic regexps is the I
C<\b>. This matches a boundary between a word character and a non-word
character C<\w\W> or C<\W\w>:
@@ -449,6 +459,11 @@ character C<\w\W> or C<\W\w>:
Note in the last example, the end of the string is considered a word
boundary.
+For natural language processing (so that, for example, apostrophes are
+included in words), use instead C<\b{wb}>
+
+ "don't" =~ / .+? \b{wb} /x; # matches the whole string
+
You might wonder why C<'.'> matches everything but C<"\n"> - why not
every character? The reason is that often one is matching against
lines and would like to ignore the newline characters. For instance,
@@ -636,50 +651,50 @@ of what Perl does when it tries to match the regexp
=over 4
-=item 0
+=item Z<>0
Start with the first letter in the string 'a'.
-=item 1
+=item Z<>1
Try the first alternative in the first group 'abd'.
-=item 2
+=item Z<>2
Match 'a' followed by 'b'. So far so good.
-=item 3
+=item Z<>3
'd' in the regexp doesn't match 'c' in the string - a dead
end. So backtrack two characters and pick the second alternative in
the first group 'abc'.
-=item 4
+=item Z<>4
Match 'a' followed by 'b' followed by 'c'. We are on a roll
and have satisfied the first group. Set $1 to 'abc'.
-=item 5
+=item Z<>5
Move on to the second group and pick the first alternative
'df'.
-=item 6
+=item Z<>6
Match the 'd'.
-=item 7
+=item Z<>7
'f' in the regexp doesn't match 'e' in the string, so a dead
end. Backtrack one character and pick the second alternative in the
second group 'd'.
-=item 8
+=item Z<>8
'd' matches. The second grouping is satisfied, so set $2 to
'd'.
-=item 9
+=item Z<>9
We are at the end of the regexp, so we are done! We have
matched 'abcd' out of the string "abcde".
@@ -859,9 +874,9 @@ well, and this is exactly what the parenthesized construct C<(?|...)>,
set around an alternative achieves. Here is an extended version of the
previous pattern:
- if ( $time =~ /(?|(\d\d|\d):(\d\d)|(\d\d)(\d\d))\s+([A-Z][A-Z][A-Z])/ ){
- print "hour=$1 minute=$2 zone=$3\n";
- }
+ if($time =~ /(?|(\d\d|\d):(\d\d)|(\d\d)(\d\d))\s+([A-Z][A-Z][A-Z])/){
+ print "hour=$1 minute=$2 zone=$3\n";
+ }
Within the alternative numbering group, group numbers start at the same
position for each alternative. After the group, numbering continues
@@ -869,7 +884,7 @@ with one higher than the maximum reached across all the alternatives.
=head2 Position information
-In addition to what was matched, Perl (since 5.6.0) also provides the
+In addition to what was matched, Perl also provides the
positions of what was matched as contents of the C<@-> and C<@+>
arrays. C<$-[0]> is the position of the start of the entire match and
C<$+[0]> is the position of the end. Similarly, C<$-[n]> is the
@@ -879,8 +894,8 @@ this code
$x = "Mmm...donut, thought Homer";
$x =~ /^(Mmm|Yech)\.\.\.(donut|peas)/; # matches
- foreach $expr (1..$#-) {
- print "Match $expr: '${$expr}' at position ($-[$expr],$+[$expr])\n";
+ foreach $exp (1..$#-) {
+ print "Match $exp: '${$exp}' at position ($-[$exp],$+[$exp])\n";
}
prints
@@ -900,7 +915,10 @@ of the string after the match. An example:
In the second match, C<$`> equals C<''> because the regexp matched at the
first character position in the string and stopped; it never saw the
-second 'the'. It is important to note that using C<$`> and C<$'>
+second 'the'.
+
+If your code is to run on Perl versions earlier than
+5.20, it is worthwhile to note that using C<$`> and C<$'>
slows down regexp matching quite a bit, while C<$&> slows it down to a
lesser extent, because if they are used in one regexp in a program,
they are generated for I regexps in the program. So if raw
@@ -913,8 +931,11 @@ C<@+> instead:
$' is the same as substr( $x, $+[0] )
As of Perl 5.10, the C<${^PREMATCH}>, C<${^MATCH}> and C<${^POSTMATCH}>
-variables may be used. These are only set if the C
modifier is present.
-Consequently they do not penalize the rest of the program.
+variables may be used. These are only set if the C modifier is
+present. Consequently they do not penalize the rest of the program. In
+Perl 5.20, C<${^PREMATCH}>, C<${^MATCH}> and C<${^POSTMATCH}> are available
+whether the C has been used or not (the modifier is ignored), and
+C<$`>, C<$'> and C<$&> do not cause any speed difference.
=head2 Non-capturing groupings
@@ -946,6 +967,12 @@ required for some reason:
@num = split /(a|b)+/, $x; # @num = ('12','a','34','a','5')
@num = split /(?:a|b)+/, $x; # @num = ('12','34','5')
+In Perl 5.22 and later, all groups within a regexp can be set to
+non-capturing by using the new C flag:
+
+ "hello" =~ /(hi|hello)/n; # $1 is not set!
+
+See L for more information.
=head2 Matching repetitions
@@ -999,10 +1026,10 @@ Here are some examples:
/y(es)?/i; # matches 'y', 'Y', or a case-insensitive 'yes'
$year =~ /^\d{2,4}$/; # make sure year is at least 2 but not more
# than 4 digits
- $year =~ /^\d{4}$|^\d{2}$/; # better match; throw out 3-digit dates
- $year =~ /^\d{2}(\d{2})?$/; # same thing written differently. However,
- # this captures the last two digits in $1
- # and the other does not.
+ $year =~ /^\d{4}$|^\d{2}$/; # better match; throw out 3-digit dates
+ $year =~ /^\d{2}(\d{2})?$/; # same thing written differently.
+ # However, this captures the last two
+ # digits in $1 and the other does not.
% simple_grep '^(\w+)\g1$' /usr/dict/words # isn't this easier?
beriberi
@@ -1243,35 +1270,35 @@ backtracking. Here is a step-by-step analysis of the example
=over 4
-=item 0
+=item Z<>0
Start with the first letter in the string 't'.
-=item 1
+=item Z<>1
The first quantifier '.*' starts out by matching the whole
string 'the cat in the hat'.
-=item 2
+=item Z<>2
'a' in the regexp element 'at' doesn't match the end of the
string. Backtrack one character.
-=item 3
+=item Z<>3
'a' in the regexp element 'at' still doesn't match the last
letter of the string 't', so backtrack one more character.
-=item 4
+=item Z<>4
Now we can match the 'a' and the 't'.
-=item 5
+=item Z<>5
Move on to the third element '.*'. Since we are at the end of
the string and '.*' can match 0 times, assign it the empty string.
-=item 6
+=item Z<>6
We are done!
@@ -1583,9 +1610,9 @@ there are no groupings, a list of matches to the whole regexp. So if
we wanted just the words, we could use
@words = ($x =~ /(\w+)/g); # matches,
- # $word[0] = 'cat'
- # $word[1] = 'dog'
- # $word[2] = 'house'
+ # $words[0] = 'cat'
+ # $words[1] = 'dog'
+ # $words[2] = 'house'
Closely associated with the C/g> modifier is the C<\G> anchor. The
C<\G> anchor matches at the point where the previous C/g> match left
@@ -1738,7 +1765,8 @@ One other interesting thing that the C flag allows is chaining
substitutions:
$x = "Cats are great.";
- print $x =~ s/Cats/Dogs/r =~ s/Dogs/Frogs/r =~ s/Frogs/Hedgehogs/r, "\n";
+ print $x =~ s/Cats/Dogs/r =~ s/Dogs/Frogs/r =~
+ s/Frogs/Hedgehogs/r, "\n";
# prints "Hedgehogs are great."
A modifier available specifically to search and replace is the
@@ -1750,7 +1778,7 @@ computation in the process of replacing text. This example counts
character frequencies in a line:
$x = "Bill the cat";
- $x =~ s/(.)/$chars{$1}++;$1/eg; # final $1 replaces char with itself
+ $x =~ s/(.)/$chars{$1}++;$1/eg; # final $1 replaces char with itself
print "frequency of '$_' is $chars{$_}\n"
foreach (sort {$chars{$b} <=> $chars{$a}} keys %chars);
@@ -1874,8 +1902,8 @@ work if they appear in a regular expression embedded directly in a
program, but not when contained in a string that is interpolated in a
pattern.
-With the advent of 5.6.0, Perl regexps can handle more than just the
-standard ASCII character set. Perl now supports I, a standard
+Perl regexps can handle more than just the
+standard ASCII character set. Perl supports I, a standard
for representing the alphabets from virtually all of the world's written
languages, and a host of symbols. Perl's text strings are Unicode strings, so
they can contain characters with a value (codepoint or character number) higher
@@ -1907,18 +1935,17 @@ specified in the Unicode standard. For instance, if we wanted to
represent or match the astrological sign for the planet Mercury, we
could use
- use charnames ":full"; # use named chars with Unicode full names
$x = "abc\N{MERCURY}def";
$x =~ /\N{MERCURY}/; # matches
-One can also use short names or restrict names to a certain alphabet:
+One can also use "short" names:
- use charnames ':full';
print "\N{GREEK SMALL LETTER SIGMA} is called sigma.\n";
-
- use charnames ":short";
print "\N{greek:Sigma} is an upper-case sigma.\n";
+You can also restrict names to a certain alphabet by specifying the
+L pragma:
+
use charnames qw(greek);
print "\N{sigma} is Greek sigma\n";
@@ -1927,25 +1954,24 @@ Consortium, L; explanatory
material with links to other resources at
L.
-The answer to requirement 2) is, as of 5.6.0, that a regexp (mostly)
-uses Unicode characters. (The "mostly" is for messy backward
+The answer to requirement 2) is that a regexp (mostly)
+uses Unicode characters. The "mostly" is for messy backward
compatibility reasons, but starting in Perl 5.14, any regex compiled in
the scope of a C