This is a live mirror of the Perl 5 development currently hosted at https://github.com/perl/perl5
perlretut: Add some white space for legibility
[perl5.git] / pod / perlretut.pod
CommitLineData
47f9c88b
GS
1=head1 NAME
2
3perlretut - Perl regular expressions tutorial
4
5=head1 DESCRIPTION
6
7This page provides a basic tutorial on understanding, creating and
8using regular expressions in Perl. It serves as a complement to the
9reference page on regular expressions L<perlre>. Regular expressions
10are an integral part of the C<m//>, C<s///>, C<qr//> and C<split>
11operators and so this tutorial also overlaps with
12L<perlop/"Regexp Quote-Like Operators"> and L<perlfunc/split>.
13
14Perl is widely renowned for excellence in text processing, and regular
15expressions are one of the big factors behind this fame. Perl regular
16expressions display an efficiency and flexibility unknown in most
17other computer languages. Mastering even the basics of regular
18expressions will allow you to manipulate text with surprising ease.
19
28285ecf
KW
20What is a regular expression? At its most basic, a regular expression
21is a template that is used to determine if a string has certain
22characteristics. The string is most often some text, such as a line,
23sentence, web page, or even a whole book, but less commonly it could be
24some binary data as well.
25Suppose we want to determine if the text in variable, C<$var> contains
26the sequence of characters C<m> C<u> C<s> C<h> C<r> C<o> C<o> C<m>
27(blanks added for legibility). We can write in Perl
28
29 $var =~ m/mushroom/
30
31The value of this expression will be TRUE if C<$var> contains that
32sequence of characters, and FALSE otherwise. The portion enclosed in
33C<"E<sol>"> characters denotes the characteristic we are looking for.
34We use the term I<pattern> for it. The process of looking to see if the
35pattern occurs in the string is called I<matching>, and the C<"=~">
36operator along with the C<"m//"> tell Perl to try to match the pattern
37against the string. Note that the pattern is also a string, but a very
38special kind of one, as we will see. Patterns are in common use these
39days;
47f9c88b
GS
40examples are the patterns typed into a search engine to find web pages
41and the patterns used to list files in a directory, e.g., C<ls *.txt>
42or C<dir *.*>. In Perl, the patterns described by regular expressions
28285ecf
KW
43are used not only to search strings, but to also extract desired parts
44of strings, and to do search and replace operations.
47f9c88b
GS
45
46Regular expressions have the undeserved reputation of being abstract
28285ecf
KW
47and difficult to understand. This really stems simply because the
48notation used to express them tends to be terse and dense, and not
49because of inherent complexity. We recommend using the C<"/x"> regular
50expression modifier (described below) along with plenty of white space
51to make them less dense, and easier to read. Regular expressions are
52constructed using
47f9c88b
GS
53simple concepts like conditionals and loops and are no more difficult
54to understand than the corresponding C<if> conditionals and C<while>
28285ecf 55loops in the Perl language itself.
47f9c88b
GS
56
57This tutorial flattens the learning curve by discussing regular
58expression concepts, along with their notation, one at a time and with
59many examples. The first part of the tutorial will progress from the
60simplest word searches to the basic regular expression concepts. If
61you master the first part, you will have all the tools needed to solve
62about 98% of your needs. The second part of the tutorial is for those
63comfortable with the basics and hungry for more power tools. It
64discusses the more advanced regular expression operators and
8ccb1477 65introduces the latest cutting-edge innovations.
47f9c88b
GS
66
67A note: to save time, 'regular expression' is often abbreviated as
68regexp or regex. Regexp is a more natural abbreviation than regex, but
69is harder to pronounce. The Perl pod documentation is evenly split on
70regexp vs regex; in Perl, there is more than one way to abbreviate it.
71We'll use regexp in this tutorial.
72
67cdf558
KW
73New in v5.22, L<C<use re 'strict'>|re/'strict' mode> applies stricter
74rules than otherwise when compiling regular expression patterns. It can
75find things that, while legal, may not be what you intended.
76
47f9c88b
GS
77=head1 Part 1: The basics
78
79=head2 Simple word matching
80
81The simplest regexp is simply a word, or more generally, a string of
28285ecf 82characters. A regexp consisting of just a word matches any string that
47f9c88b
GS
83contains that word:
84
85 "Hello World" =~ /World/; # matches
86
7638d2dc 87What is this Perl statement all about? C<"Hello World"> is a simple
8ccb1477 88double-quoted string. C<World> is the regular expression and the
7638d2dc 89C<//> enclosing C</World/> tells Perl to search a string for a match.
47f9c88b
GS
90The operator C<=~> associates the string with the regexp match and
91produces a true value if the regexp matched, or false if the regexp
92did not match. In our case, C<World> matches the second word in
93C<"Hello World">, so the expression is true. Expressions like this
94are useful in conditionals:
95
96 if ("Hello World" =~ /World/) {
97 print "It matches\n";
98 }
99 else {
100 print "It doesn't match\n";
101 }
102
103There are useful variations on this theme. The sense of the match can
7638d2dc 104be reversed by using the C<!~> operator:
47f9c88b
GS
105
106 if ("Hello World" !~ /World/) {
107 print "It doesn't match\n";
108 }
109 else {
110 print "It matches\n";
111 }
112
113The literal string in the regexp can be replaced by a variable:
114
115 $greeting = "World";
116 if ("Hello World" =~ /$greeting/) {
117 print "It matches\n";
118 }
119 else {
120 print "It doesn't match\n";
121 }
122
123If you're matching against the special default variable C<$_>, the
124C<$_ =~> part can be omitted:
125
126 $_ = "Hello World";
127 if (/World/) {
128 print "It matches\n";
129 }
130 else {
131 print "It doesn't match\n";
132 }
133
134And finally, the C<//> default delimiters for a match can be changed
135to arbitrary delimiters by putting an C<'m'> out front:
136
137 "Hello World" =~ m!World!; # matches, delimited by '!'
138 "Hello World" =~ m{World}; # matches, note the matching '{}'
a6b2f353
GS
139 "/usr/bin/perl" =~ m"/perl"; # matches after '/usr/bin',
140 # '/' becomes an ordinary char
47f9c88b
GS
141
142C</World/>, C<m!World!>, and C<m{World}> all represent the
7638d2dc
WL
143same thing. When, e.g., the quote (C<">) is used as a delimiter, the forward
144slash C<'/'> becomes an ordinary character and can be used in this regexp
47f9c88b
GS
145without trouble.
146
147Let's consider how different regexps would match C<"Hello World">:
148
149 "Hello World" =~ /world/; # doesn't match
150 "Hello World" =~ /o W/; # matches
151 "Hello World" =~ /oW/; # doesn't match
152 "Hello World" =~ /World /; # doesn't match
153
154The first regexp C<world> doesn't match because regexps are
155case-sensitive. The second regexp matches because the substring
7638d2dc 156S<C<'o W'>> occurs in the string S<C<"Hello World">>. The space
47f9c88b
GS
157character ' ' is treated like any other character in a regexp and is
158needed to match in this case. The lack of a space character is the
159reason the third regexp C<'oW'> doesn't match. The fourth regexp
160C<'World '> doesn't match because there is a space at the end of the
161regexp, but not at the end of the string. The lesson here is that
162regexps must match a part of the string I<exactly> in order for the
163statement to be true.
164
7638d2dc 165If a regexp matches in more than one place in the string, Perl will
47f9c88b
GS
166always match at the earliest possible point in the string:
167
168 "Hello World" =~ /o/; # matches 'o' in 'Hello'
169 "That hat is red" =~ /hat/; # matches 'hat' in 'That'
170
171With respect to character matching, there are a few more points you
172need to know about. First of all, not all characters can be used 'as
7638d2dc 173is' in a match. Some characters, called I<metacharacters>, are reserved
47f9c88b
GS
174for use in regexp notation. The metacharacters are
175
176 {}[]()^$.|*+?\
177
178The significance of each of these will be explained
179in the rest of the tutorial, but for now, it is important only to know
180that a metacharacter can be matched by putting a backslash before it:
181
182 "2+2=4" =~ /2+2/; # doesn't match, + is a metacharacter
183 "2+2=4" =~ /2\+2/; # matches, \+ is treated like an ordinary +
184 "The interval is [0,1)." =~ /[0,1)./ # is a syntax error!
185 "The interval is [0,1)." =~ /\[0,1\)\./ # matches
7638d2dc 186 "#!/usr/bin/perl" =~ /#!\/usr\/bin\/perl/; # matches
47f9c88b
GS
187
188In the last regexp, the forward slash C<'/'> is also backslashed,
189because it is used to delimit the regexp. This can lead to LTS
190(leaning toothpick syndrome), however, and it is often more readable
191to change delimiters.
192
7638d2dc 193 "#!/usr/bin/perl" =~ m!#\!/usr/bin/perl!; # easier to read
47f9c88b
GS
194
195The backslash character C<'\'> is a metacharacter itself and needs to
196be backslashed:
197
198 'C:\WIN32' =~ /C:\\WIN/; # matches
199
200In addition to the metacharacters, there are some ASCII characters
201which don't have printable character equivalents and are instead
7638d2dc 202represented by I<escape sequences>. Common examples are C<\t> for a
47f9c88b 203tab, C<\n> for a newline, C<\r> for a carriage return and C<\a> for a
43e59f7b 204bell (or alert). If your string is better thought of as a sequence of arbitrary
47f9c88b
GS
205bytes, the octal escape sequence, e.g., C<\033>, or hexadecimal escape
206sequence, e.g., C<\x1B> may be a more natural representation for your
207bytes. Here are some examples of escapes:
208
209 "1000\t2000" =~ m(0\t2) # matches
210 "1000\n2000" =~ /0\n20/ # matches
211 "1000\t2000" =~ /\000\t2/ # doesn't match, "0" ne "\000"
f0a2b745
KW
212 "cat" =~ /\o{143}\x61\x74/ # matches in ASCII, but a weird way
213 # to spell cat
47f9c88b
GS
214
215If you've been around Perl a while, all this talk of escape sequences
216may seem familiar. Similar escape sequences are used in double-quoted
217strings and in fact the regexps in Perl are mostly treated as
218double-quoted strings. This means that variables can be used in
219regexps as well. Just like double-quoted strings, the values of the
220variables in the regexp will be substituted in before the regexp is
221evaluated for matching purposes. So we have:
222
223 $foo = 'house';
224 'housecat' =~ /$foo/; # matches
225 'cathouse' =~ /cat$foo/; # matches
47f9c88b
GS
226 'housecat' =~ /${foo}cat/; # matches
227
228So far, so good. With the knowledge above you can already perform
229searches with just about any literal string regexp you can dream up.
230Here is a I<very simple> emulation of the Unix grep program:
231
232 % cat > simple_grep
233 #!/usr/bin/perl
234 $regexp = shift;
235 while (<>) {
236 print if /$regexp/;
237 }
238 ^D
239
240 % chmod +x simple_grep
241
242 % simple_grep abba /usr/dict/words
243 Babbage
244 cabbage
245 cabbages
246 sabbath
247 Sabbathize
248 Sabbathizes
249 sabbatical
250 scabbard
251 scabbards
252
253This program is easy to understand. C<#!/usr/bin/perl> is the standard
254way to invoke a perl program from the shell.
7638d2dc 255S<C<$regexp = shift;>> saves the first command line argument as the
47f9c88b 256regexp to be used, leaving the rest of the command line arguments to
7638d2dc
WL
257be treated as files. S<C<< while (<>) >>> loops over all the lines in
258all the files. For each line, S<C<print if /$regexp/;>> prints the
47f9c88b
GS
259line if the regexp matches the line. In this line, both C<print> and
260C</$regexp/> use the default variable C<$_> implicitly.
261
262With all of the regexps above, if the regexp matched anywhere in the
263string, it was considered a match. Sometimes, however, we'd like to
264specify I<where> in the string the regexp should try to match. To do
7638d2dc 265this, we would use the I<anchor> metacharacters C<^> and C<$>. The
47f9c88b
GS
266anchor C<^> means match at the beginning of the string and the anchor
267C<$> means match at the end of the string, or before a newline at the
268end of the string. Here is how they are used:
269
270 "housekeeper" =~ /keeper/; # matches
271 "housekeeper" =~ /^keeper/; # doesn't match
272 "housekeeper" =~ /keeper$/; # matches
273 "housekeeper\n" =~ /keeper$/; # matches
274
275The second regexp doesn't match because C<^> constrains C<keeper> to
276match only at the beginning of the string, but C<"housekeeper"> has
277keeper starting in the middle. The third regexp does match, since the
278C<$> constrains C<keeper> to match only at the end of the string.
279
280When both C<^> and C<$> are used at the same time, the regexp has to
281match both the beginning and the end of the string, i.e., the regexp
282matches the whole string. Consider
283
284 "keeper" =~ /^keep$/; # doesn't match
285 "keeper" =~ /^keeper$/; # matches
286 "" =~ /^$/; # ^$ matches an empty string
287
288The first regexp doesn't match because the string has more to it than
289C<keep>. Since the second regexp is exactly the string, it
290matches. Using both C<^> and C<$> in a regexp forces the complete
291string to match, so it gives you complete control over which strings
292match and which don't. Suppose you are looking for a fellow named
293bert, off in a string by himself:
294
295 "dogbert" =~ /bert/; # matches, but not what you want
296
297 "dilbert" =~ /^bert/; # doesn't match, but ..
298 "bertram" =~ /^bert/; # matches, so still not good enough
299
300 "bertram" =~ /^bert$/; # doesn't match, good
301 "dilbert" =~ /^bert$/; # doesn't match, good
302 "bert" =~ /^bert$/; # matches, perfect
303
304Of course, in the case of a literal string, one could just as easily
7638d2dc 305use the string comparison S<C<$string eq 'bert'>> and it would be
47f9c88b
GS
306more efficient. The C<^...$> regexp really becomes useful when we
307add in the more powerful regexp tools below.
308
309=head2 Using character classes
310
311Although one can already do quite a lot with the literal string
312regexps above, we've only scratched the surface of regular expression
313technology. In this and subsequent sections we will introduce regexp
314concepts (and associated metacharacter notations) that will allow a
8ccb1477 315regexp to represent not just a single character sequence, but a I<whole
47f9c88b
GS
316class> of them.
317
7638d2dc 318One such concept is that of a I<character class>. A character class
47f9c88b 319allows a set of possible characters, rather than just a single
0b635837
KW
320character, to match at a particular point in a regexp. You can define
321your own custom character classes. These
322are denoted by brackets C<[...]>, with the set of characters
47f9c88b
GS
323to be possibly matched inside. Here are some examples:
324
325 /cat/; # matches 'cat'
326 /[bcr]at/; # matches 'bat, 'cat', or 'rat'
327 /item[0123456789]/; # matches 'item0' or ... or 'item9'
a6b2f353 328 "abc" =~ /[cab]/; # matches 'a'
47f9c88b
GS
329
330In the last statement, even though C<'c'> is the first character in
331the class, C<'a'> matches because the first character position in the
332string is the earliest point at which the regexp can match.
333
334 /[yY][eE][sS]/; # match 'yes' in a case-insensitive way
335 # 'yes', 'Yes', 'YES', etc.
336
da75cd15 337This regexp displays a common task: perform a case-insensitive
28c3722c 338match. Perl provides a way of avoiding all those brackets by simply
47f9c88b
GS
339appending an C<'i'> to the end of the match. Then C</[yY][eE][sS]/;>
340can be rewritten as C</yes/i;>. The C<'i'> stands for
7638d2dc 341case-insensitive and is an example of a I<modifier> of the matching
47f9c88b
GS
342operation. We will meet other modifiers later in the tutorial.
343
344We saw in the section above that there were ordinary characters, which
345represented themselves, and special characters, which needed a
346backslash C<\> to represent themselves. The same is true in a
347character class, but the sets of ordinary and special characters
348inside a character class are different than those outside a character
7638d2dc 349class. The special characters for a character class are C<-]\^$> (and
353c6505 350the pattern delimiter, whatever it is).
7638d2dc 351C<]> is special because it denotes the end of a character class. C<$> is
47f9c88b
GS
352special because it denotes a scalar variable. C<\> is special because
353it is used in escape sequences, just like above. Here is how the
354special characters C<]$\> are handled:
355
356 /[\]c]def/; # matches ']def' or 'cdef'
357 $x = 'bcr';
a6b2f353 358 /[$x]at/; # matches 'bat', 'cat', or 'rat'
47f9c88b
GS
359 /[\$x]at/; # matches '$at' or 'xat'
360 /[\\$x]at/; # matches '\at', 'bat, 'cat', or 'rat'
361
353c6505 362The last two are a little tricky. In C<[\$x]>, the backslash protects
47f9c88b
GS
363the dollar sign, so the character class has two members C<$> and C<x>.
364In C<[\\$x]>, the backslash is protected, so C<$x> is treated as a
365variable and substituted in double quote fashion.
366
367The special character C<'-'> acts as a range operator within character
368classes, so that a contiguous set of characters can be written as a
369range. With ranges, the unwieldy C<[0123456789]> and C<[abc...xyz]>
370become the svelte C<[0-9]> and C<[a-z]>. Some examples are
371
372 /item[0-9]/; # matches 'item0' or ... or 'item9'
373 /[0-9bx-z]aa/; # matches '0aa', ..., '9aa',
374 # 'baa', 'xaa', 'yaa', or 'zaa'
375 /[0-9a-fA-F]/; # matches a hexadecimal digit
36bbe248 376 /[0-9a-zA-Z_]/; # matches a "word" character,
7638d2dc 377 # like those in a Perl variable name
47f9c88b
GS
378
379If C<'-'> is the first or last character in a character class, it is
380treated as an ordinary character; C<[-ab]>, C<[ab-]> and C<[a\-b]> are
381all equivalent.
382
383The special character C<^> in the first position of a character class
7638d2dc 384denotes a I<negated character class>, which matches any character but
a6b2f353 385those in the brackets. Both C<[...]> and C<[^...]> must match a
47f9c88b
GS
386character, or the match fails. Then
387
388 /[^a]at/; # doesn't match 'aat' or 'at', but matches
389 # all other 'bat', 'cat, '0at', '%at', etc.
390 /[^0-9]/; # matches a non-numeric character
391 /[a^]at/; # matches 'aat' or '^at'; here '^' is ordinary
392
28c3722c 393Now, even C<[0-9]> can be a bother to write multiple times, so in the
47f9c88b 394interest of saving keystrokes and making regexps more readable, Perl
7638d2dc 395has several abbreviations for common character classes, as shown below.
0bd5a82d
KW
396Since the introduction of Unicode, unless the C<//a> modifier is in
397effect, these character classes match more than just a few characters in
398the ASCII range.
47f9c88b
GS
399
400=over 4
401
402=item *
551e1d92 403
7638d2dc 404\d matches a digit, not just [0-9] but also digits from non-roman scripts
47f9c88b
GS
405
406=item *
551e1d92 407
7638d2dc 408\s matches a whitespace character, the set [\ \t\r\n\f] and others
47f9c88b
GS
409
410=item *
551e1d92 411
7638d2dc
WL
412\w matches a word character (alphanumeric or _), not just [0-9a-zA-Z_]
413but also digits and characters from non-roman scripts
47f9c88b
GS
414
415=item *
551e1d92 416
7638d2dc 417\D is a negated \d; it represents any other character than a digit, or [^\d]
47f9c88b
GS
418
419=item *
551e1d92 420
47f9c88b
GS
421\S is a negated \s; it represents any non-whitespace character [^\s]
422
423=item *
551e1d92 424
47f9c88b
GS
425\W is a negated \w; it represents any non-word character [^\w]
426
427=item *
551e1d92 428
7638d2dc
WL
429The period '.' matches any character but "\n" (unless the modifier C<//s> is
430in effect, as explained below).
47f9c88b 431
1ca4ba9b
KW
432=item *
433
434\N, like the period, matches any character but "\n", but it does so
435regardless of whether the modifier C<//s> is in effect.
436
47f9c88b
GS
437=back
438
0bd5a82d
KW
439The C<//a> modifier, available starting in Perl 5.14, is used to
440restrict the matches of \d, \s, and \w to just those in the ASCII range.
441It is useful to keep your program from being needlessly exposed to full
442Unicode (and its accompanying security considerations) when all you want
443is to process English-like text. (The "a" may be doubled, C<//aa>, to
444provide even more restrictions, preventing case-insensitive matching of
445ASCII with non-ASCII characters; otherwise a Unicode "Kelvin Sign"
446would caselessly match a "k" or "K".)
447
47f9c88b 448The C<\d\s\w\D\S\W> abbreviations can be used both inside and outside
0b635837 449of bracketed character classes. Here are some in use:
47f9c88b
GS
450
451 /\d\d:\d\d:\d\d/; # matches a hh:mm:ss time format
452 /[\d\s]/; # matches any digit or whitespace character
453 /\w\W\w/; # matches a word char, followed by a
454 # non-word char, followed by a word char
455 /..rt/; # matches any two chars, followed by 'rt'
456 /end\./; # matches 'end.'
457 /end[.]/; # same thing, matches 'end.'
458
459Because a period is a metacharacter, it needs to be escaped to match
460as an ordinary period. Because, for example, C<\d> and C<\w> are sets
461of characters, it is incorrect to think of C<[^\d\w]> as C<[\D\W]>; in
462fact C<[^\d\w]> is the same as C<[^\w]>, which is the same as
463C<[\W]>. Think DeMorgan's laws.
464
0b635837
KW
465In actuality, the period and C<\d\s\w\D\S\W> abbreviations are
466themselves types of character classes, so the ones surrounded by
467brackets are just one type of character class. When we need to make a
468distinction, we refer to them as "bracketed character classes."
469
7638d2dc 470An anchor useful in basic regexps is the I<word anchor>
47f9c88b
GS
471C<\b>. This matches a boundary between a word character and a non-word
472character C<\w\W> or C<\W\w>:
473
474 $x = "Housecat catenates house and cat";
475 $x =~ /cat/; # matches cat in 'housecat'
476 $x =~ /\bcat/; # matches cat in 'catenates'
477 $x =~ /cat\b/; # matches cat in 'housecat'
478 $x =~ /\bcat\b/; # matches 'cat' at end of string
479
480Note in the last example, the end of the string is considered a word
481boundary.
482
ae3bb8ea
KW
483For natural language processing (so that, for example, apostrophes are
484included in words), use instead C<\b{wb}>
485
486 "don't" =~ / .+? \b{wb} /x; # matches the whole string
487
47f9c88b
GS
488You might wonder why C<'.'> matches everything but C<"\n"> - why not
489every character? The reason is that often one is matching against
490lines and would like to ignore the newline characters. For instance,
491while the string C<"\n"> represents one line, we would like to think
28c3722c 492of it as empty. Then
47f9c88b
GS
493
494 "" =~ /^$/; # matches
7638d2dc 495 "\n" =~ /^$/; # matches, $ anchors before "\n"
47f9c88b
GS
496
497 "" =~ /./; # doesn't match; it needs a char
498 "" =~ /^.$/; # doesn't match; it needs a char
499 "\n" =~ /^.$/; # doesn't match; it needs a char other than "\n"
500 "a" =~ /^.$/; # matches
7638d2dc 501 "a\n" =~ /^.$/; # matches, $ anchors before "\n"
47f9c88b
GS
502
503This behavior is convenient, because we usually want to ignore
504newlines when we count and match characters in a line. Sometimes,
505however, we want to keep track of newlines. We might even want C<^>
506and C<$> to anchor at the beginning and end of lines within the
507string, rather than just the beginning and end of the string. Perl
508allows us to choose between ignoring and paying attention to newlines
509by using the C<//s> and C<//m> modifiers. C<//s> and C<//m> stand for
510single line and multi-line and they determine whether a string is to
511be treated as one continuous string, or as a set of lines. The two
512modifiers affect two aspects of how the regexp is interpreted: 1) how
513the C<'.'> character class is defined, and 2) where the anchors C<^>
514and C<$> are able to match. Here are the four possible combinations:
515
516=over 4
517
518=item *
551e1d92 519
47f9c88b
GS
520no modifiers (//): Default behavior. C<'.'> matches any character
521except C<"\n">. C<^> matches only at the beginning of the string and
522C<$> matches only at the end or before a newline at the end.
523
524=item *
551e1d92 525
47f9c88b
GS
526s modifier (//s): Treat string as a single long line. C<'.'> matches
527any character, even C<"\n">. C<^> matches only at the beginning of
528the string and C<$> matches only at the end or before a newline at the
529end.
530
531=item *
551e1d92 532
47f9c88b
GS
533m modifier (//m): Treat string as a set of multiple lines. C<'.'>
534matches any character except C<"\n">. C<^> and C<$> are able to match
535at the start or end of I<any> line within the string.
536
537=item *
551e1d92 538
47f9c88b
GS
539both s and m modifiers (//sm): Treat string as a single long line, but
540detect multiple lines. C<'.'> matches any character, even
541C<"\n">. C<^> and C<$>, however, are able to match at the start or end
542of I<any> line within the string.
543
544=back
545
546Here are examples of C<//s> and C<//m> in action:
547
548 $x = "There once was a girl\nWho programmed in Perl\n";
549
550 $x =~ /^Who/; # doesn't match, "Who" not at start of string
551 $x =~ /^Who/s; # doesn't match, "Who" not at start of string
552 $x =~ /^Who/m; # matches, "Who" at start of second line
553 $x =~ /^Who/sm; # matches, "Who" at start of second line
554
555 $x =~ /girl.Who/; # doesn't match, "." doesn't match "\n"
556 $x =~ /girl.Who/s; # matches, "." matches "\n"
557 $x =~ /girl.Who/m; # doesn't match, "." doesn't match "\n"
558 $x =~ /girl.Who/sm; # matches, "." matches "\n"
559
3c12f9b9 560Most of the time, the default behavior is what is wanted, but C<//s> and
47f9c88b 561C<//m> are occasionally very useful. If C<//m> is being used, the start
28c3722c 562of the string can still be matched with C<\A> and the end of the string
47f9c88b
GS
563can still be matched with the anchors C<\Z> (matches both the end and
564the newline before, like C<$>), and C<\z> (matches only the end):
565
566 $x =~ /^Who/m; # matches, "Who" at start of second line
567 $x =~ /\AWho/m; # doesn't match, "Who" is not at start of string
568
569 $x =~ /girl$/m; # matches, "girl" at end of first line
570 $x =~ /girl\Z/m; # doesn't match, "girl" is not at end of string
571
572 $x =~ /Perl\Z/m; # matches, "Perl" is at newline before end
573 $x =~ /Perl\z/m; # doesn't match, "Perl" is not at end of string
574
575We now know how to create choices among classes of characters in a
576regexp. What about choices among words or character strings? Such
577choices are described in the next section.
578
579=head2 Matching this or that
580
28c3722c 581Sometimes we would like our regexp to be able to match different
47f9c88b 582possible words or character strings. This is accomplished by using
7638d2dc
WL
583the I<alternation> metacharacter C<|>. To match C<dog> or C<cat>, we
584form the regexp C<dog|cat>. As before, Perl will try to match the
47f9c88b 585regexp at the earliest possible point in the string. At each
7638d2dc
WL
586character position, Perl will first try to match the first
587alternative, C<dog>. If C<dog> doesn't match, Perl will then try the
47f9c88b 588next alternative, C<cat>. If C<cat> doesn't match either, then the
7638d2dc 589match fails and Perl moves to the next position in the string. Some
47f9c88b
GS
590examples:
591
592 "cats and dogs" =~ /cat|dog|bird/; # matches "cat"
593 "cats and dogs" =~ /dog|cat|bird/; # matches "cat"
594
595Even though C<dog> is the first alternative in the second regexp,
596C<cat> is able to match earlier in the string.
597
598 "cats" =~ /c|ca|cat|cats/; # matches "c"
599 "cats" =~ /cats|cat|ca|c/; # matches "cats"
600
601Here, all the alternatives match at the first string position, so the
602first alternative is the one that matches. If some of the
603alternatives are truncations of the others, put the longest ones first
604to give them a chance to match.
605
606 "cab" =~ /a|b|c/ # matches "c"
607 # /a|b|c/ == /[abc]/
608
609The last example points out that character classes are like
610alternations of characters. At a given character position, the first
210b36aa 611alternative that allows the regexp match to succeed will be the one
47f9c88b
GS
612that matches.
613
614=head2 Grouping things and hierarchical matching
615
616Alternation allows a regexp to choose among alternatives, but by
7638d2dc 617itself it is unsatisfying. The reason is that each alternative is a whole
47f9c88b
GS
618regexp, but sometime we want alternatives for just part of a
619regexp. For instance, suppose we want to search for housecats or
620housekeepers. The regexp C<housecat|housekeeper> fits the bill, but is
621inefficient because we had to type C<house> twice. It would be nice to
da75cd15 622have parts of the regexp be constant, like C<house>, and some
47f9c88b
GS
623parts have alternatives, like C<cat|keeper>.
624
7638d2dc 625The I<grouping> metacharacters C<()> solve this problem. Grouping
47f9c88b
GS
626allows parts of a regexp to be treated as a single unit. Parts of a
627regexp are grouped by enclosing them in parentheses. Thus we could solve
628the C<housecat|housekeeper> by forming the regexp as
629C<house(cat|keeper)>. The regexp C<house(cat|keeper)> means match
630C<house> followed by either C<cat> or C<keeper>. Some more examples
631are
632
633 /(a|b)b/; # matches 'ab' or 'bb'
634 /(ac|b)b/; # matches 'acb' or 'bb'
635 /(^a|b)c/; # matches 'ac' at start of string or 'bc' anywhere
636 /(a|[bc])d/; # matches 'ad', 'bd', or 'cd'
637
638 /house(cat|)/; # matches either 'housecat' or 'house'
639 /house(cat(s|)|)/; # matches either 'housecats' or 'housecat' or
640 # 'house'. Note groups can be nested.
641
642 /(19|20|)\d\d/; # match years 19xx, 20xx, or the Y2K problem, xx
643 "20" =~ /(19|20|)\d\d/; # matches the null alternative '()\d\d',
644 # because '20\d\d' can't match
645
646Alternations behave the same way in groups as out of them: at a given
647string position, the leftmost alternative that allows the regexp to
210b36aa 648match is taken. So in the last example at the first string position,
47f9c88b 649C<"20"> matches the second alternative, but there is nothing left over
7638d2dc 650to match the next two digits C<\d\d>. So Perl moves on to the next
47f9c88b
GS
651alternative, which is the null alternative and that works, since
652C<"20"> is two digits.
653
654The process of trying one alternative, seeing if it matches, and
7638d2dc
WL
655moving on to the next alternative, while going back in the string
656from where the previous alternative was tried, if it doesn't, is called
657I<backtracking>. The term 'backtracking' comes from the idea that
47f9c88b
GS
658matching a regexp is like a walk in the woods. Successfully matching
659a regexp is like arriving at a destination. There are many possible
660trailheads, one for each string position, and each one is tried in
661order, left to right. From each trailhead there may be many paths,
662some of which get you there, and some which are dead ends. When you
663walk along a trail and hit a dead end, you have to backtrack along the
664trail to an earlier point to try another trail. If you hit your
665destination, you stop immediately and forget about trying all the
666other trails. You are persistent, and only if you have tried all the
667trails from all the trailheads and not arrived at your destination, do
668you declare failure. To be concrete, here is a step-by-step analysis
7638d2dc 669of what Perl does when it tries to match the regexp
47f9c88b
GS
670
671 "abcde" =~ /(abd|abc)(df|d|de)/;
672
673=over 4
674
c9dde696 675=item Z<>0
551e1d92
RB
676
677Start with the first letter in the string 'a'.
678
c9dde696 679=item Z<>1
47f9c88b 680
551e1d92 681Try the first alternative in the first group 'abd'.
47f9c88b 682
c9dde696 683=item Z<>2
47f9c88b 684
551e1d92
RB
685Match 'a' followed by 'b'. So far so good.
686
c9dde696 687=item Z<>3
551e1d92
RB
688
689'd' in the regexp doesn't match 'c' in the string - a dead
47f9c88b
GS
690end. So backtrack two characters and pick the second alternative in
691the first group 'abc'.
692
c9dde696 693=item Z<>4
551e1d92
RB
694
695Match 'a' followed by 'b' followed by 'c'. We are on a roll
47f9c88b
GS
696and have satisfied the first group. Set $1 to 'abc'.
697
c9dde696 698=item Z<>5
551e1d92
RB
699
700Move on to the second group and pick the first alternative
47f9c88b
GS
701'df'.
702
c9dde696 703=item Z<>6
47f9c88b 704
551e1d92
RB
705Match the 'd'.
706
c9dde696 707=item Z<>7
551e1d92
RB
708
709'f' in the regexp doesn't match 'e' in the string, so a dead
47f9c88b
GS
710end. Backtrack one character and pick the second alternative in the
711second group 'd'.
712
c9dde696 713=item Z<>8
551e1d92
RB
714
715'd' matches. The second grouping is satisfied, so set $2 to
47f9c88b
GS
716'd'.
717
c9dde696 718=item Z<>9
551e1d92
RB
719
720We are at the end of the regexp, so we are done! We have
47f9c88b
GS
721matched 'abcd' out of the string "abcde".
722
723=back
724
725There are a couple of things to note about this analysis. First, the
726third alternative in the second group 'de' also allows a match, but we
727stopped before we got to it - at a given character position, leftmost
728wins. Second, we were able to get a match at the first character
729position of the string 'a'. If there were no matches at the first
7638d2dc 730position, Perl would move to the second character position 'b' and
47f9c88b 731attempt the match all over again. Only when all possible paths at all
7638d2dc
WL
732possible character positions have been exhausted does Perl give
733up and declare S<C<$string =~ /(abd|abc)(df|d|de)/;>> to be false.
47f9c88b
GS
734
735Even with all this work, regexp matching happens remarkably fast. To
353c6505
DL
736speed things up, Perl compiles the regexp into a compact sequence of
737opcodes that can often fit inside a processor cache. When the code is
7638d2dc
WL
738executed, these opcodes can then run at full throttle and search very
739quickly.
47f9c88b
GS
740
741=head2 Extracting matches
742
743The grouping metacharacters C<()> also serve another completely
744different function: they allow the extraction of the parts of a string
745that matched. This is very useful to find out what matched and for
746text processing in general. For each grouping, the part that matched
747inside goes into the special variables C<$1>, C<$2>, etc. They can be
748used just as ordinary variables:
749
750 # extract hours, minutes, seconds
2275acdc
RGS
751 if ($time =~ /(\d\d):(\d\d):(\d\d)/) { # match hh:mm:ss format
752 $hours = $1;
753 $minutes = $2;
754 $seconds = $3;
755 }
47f9c88b
GS
756
757Now, we know that in scalar context,
7638d2dc 758S<C<$time =~ /(\d\d):(\d\d):(\d\d)/>> returns a true or false
47f9c88b
GS
759value. In list context, however, it returns the list of matched values
760C<($1,$2,$3)>. So we could write the code more compactly as
761
762 # extract hours, minutes, seconds
763 ($hours, $minutes, $second) = ($time =~ /(\d\d):(\d\d):(\d\d)/);
764
765If the groupings in a regexp are nested, C<$1> gets the group with the
766leftmost opening parenthesis, C<$2> the next opening parenthesis,
7638d2dc 767etc. Here is a regexp with nested groups:
47f9c88b
GS
768
769 /(ab(cd|ef)((gi)|j))/;
770 1 2 34
771
7638d2dc
WL
772If this regexp matches, C<$1> contains a string starting with
773C<'ab'>, C<$2> is either set to C<'cd'> or C<'ef'>, C<$3> equals either
774C<'gi'> or C<'j'>, and C<$4> is either set to C<'gi'>, just like C<$3>,
775or it remains undefined.
776
777For convenience, Perl sets C<$+> to the string held by the highest numbered
778C<$1>, C<$2>,... that got assigned (and, somewhat related, C<$^N> to the
779value of the C<$1>, C<$2>,... most-recently assigned; i.e. the C<$1>,
780C<$2>,... associated with the rightmost closing parenthesis used in the
a01268b5 781match).
47f9c88b 782
7638d2dc
WL
783
784=head2 Backreferences
785
47f9c88b 786Closely associated with the matching variables C<$1>, C<$2>, ... are
d8b950dc 787the I<backreferences> C<\g1>, C<\g2>,... Backreferences are simply
47f9c88b 788matching variables that can be used I<inside> a regexp. This is a
ac036724 789really nice feature; what matches later in a regexp is made to depend on
47f9c88b 790what matched earlier in the regexp. Suppose we wanted to look
7638d2dc 791for doubled words in a text, like 'the the'. The following regexp finds
47f9c88b
GS
792all 3-letter doubles with a space in between:
793
d8b950dc 794 /\b(\w\w\w)\s\g1\b/;
47f9c88b 795
8ccb1477 796The grouping assigns a value to \g1, so that the same 3-letter sequence
7638d2dc
WL
797is used for both parts.
798
799A similar task is to find words consisting of two identical parts:
47f9c88b 800
d8b950dc 801 % simple_grep '^(\w\w\w\w|\w\w\w|\w\w|\w)\g1$' /usr/dict/words
47f9c88b
GS
802 beriberi
803 booboo
804 coco
805 mama
806 murmur
807 papa
808
809The regexp has a single grouping which considers 4-letter
d8b950dc
KW
810combinations, then 3-letter combinations, etc., and uses C<\g1> to look for
811a repeat. Although C<$1> and C<\g1> represent the same thing, care should be
7638d2dc 812taken to use matched variables C<$1>, C<$2>,... only I<outside> a regexp
d8b950dc 813and backreferences C<\g1>, C<\g2>,... only I<inside> a regexp; not doing
7638d2dc
WL
814so may lead to surprising and unsatisfactory results.
815
816
817=head2 Relative backreferences
818
819Counting the opening parentheses to get the correct number for a
7698aede 820backreference is error-prone as soon as there is more than one
7638d2dc
WL
821capturing group. A more convenient technique became available
822with Perl 5.10: relative backreferences. To refer to the immediately
823preceding capture group one now may write C<\g{-1}>, the next but
824last is available via C<\g{-2}>, and so on.
825
826Another good reason in addition to readability and maintainability
8ccb1477 827for using relative backreferences is illustrated by the following example,
7638d2dc
WL
828where a simple pattern for matching peculiar strings is used:
829
d8b950dc 830 $a99a = '([a-z])(\d)\g2\g1'; # matches a11a, g22g, x33x, etc.
7638d2dc
WL
831
832Now that we have this pattern stored as a handy string, we might feel
833tempted to use it as a part of some other pattern:
834
835 $line = "code=e99e";
836 if ($line =~ /^(\w+)=$a99a$/){ # unexpected behavior!
837 print "$1 is valid\n";
838 } else {
839 print "bad line: '$line'\n";
840 }
841
ac036724 842But this doesn't match, at least not the way one might expect. Only
7638d2dc
WL
843after inserting the interpolated C<$a99a> and looking at the resulting
844full text of the regexp is it obvious that the backreferences have
ac036724 845backfired. The subexpression C<(\w+)> has snatched number 1 and
7638d2dc
WL
846demoted the groups in C<$a99a> by one rank. This can be avoided by
847using relative backreferences:
848
849 $a99a = '([a-z])(\d)\g{-1}\g{-2}'; # safe for being interpolated
850
851
852=head2 Named backreferences
853
c27a5cfe 854Perl 5.10 also introduced named capture groups and named backreferences.
7638d2dc
WL
855To attach a name to a capturing group, you write either
856C<< (?<name>...) >> or C<< (?'name'...) >>. The backreference may
857then be written as C<\g{name}>. It is permissible to attach the
858same name to more than one group, but then only the leftmost one of the
859eponymous set can be referenced. Outside of the pattern a named
c27a5cfe 860capture group is accessible through the C<%+> hash.
7638d2dc 861
353c6505 862Assuming that we have to match calendar dates which may be given in one
7638d2dc 863of the three formats yyyy-mm-dd, mm/dd/yyyy or dd.mm.yyyy, we can write
353c6505 864three suitable patterns where we use 'd', 'm' and 'y' respectively as the
c27a5cfe 865names of the groups capturing the pertaining components of a date. The
7638d2dc
WL
866matching operation combines the three patterns as alternatives:
867
868 $fmt1 = '(?<y>\d\d\d\d)-(?<m>\d\d)-(?<d>\d\d)';
869 $fmt2 = '(?<m>\d\d)/(?<d>\d\d)/(?<y>\d\d\d\d)';
870 $fmt3 = '(?<d>\d\d)\.(?<m>\d\d)\.(?<y>\d\d\d\d)';
871 for my $d qw( 2006-10-21 15.01.2007 10/31/2005 ){
872 if ( $d =~ m{$fmt1|$fmt2|$fmt3} ){
873 print "day=$+{d} month=$+{m} year=$+{y}\n";
874 }
875 }
876
877If any of the alternatives matches, the hash C<%+> is bound to contain the
878three key-value pairs.
879
880
881=head2 Alternative capture group numbering
882
883Yet another capturing group numbering technique (also as from Perl 5.10)
884deals with the problem of referring to groups within a set of alternatives.
885Consider a pattern for matching a time of the day, civil or military style:
47f9c88b 886
7638d2dc
WL
887 if ( $time =~ /(\d\d|\d):(\d\d)|(\d\d)(\d\d)/ ){
888 # process hour and minute
889 }
890
891Processing the results requires an additional if statement to determine
353c6505 892whether C<$1> and C<$2> or C<$3> and C<$4> contain the goodies. It would
c27a5cfe 893be easier if we could use group numbers 1 and 2 in second alternative as
353c6505 894well, and this is exactly what the parenthesized construct C<(?|...)>,
7638d2dc
WL
895set around an alternative achieves. Here is an extended version of the
896previous pattern:
897
555bd962
BG
898 if($time =~ /(?|(\d\d|\d):(\d\d)|(\d\d)(\d\d))\s+([A-Z][A-Z][A-Z])/){
899 print "hour=$1 minute=$2 zone=$3\n";
900 }
7638d2dc 901
c27a5cfe 902Within the alternative numbering group, group numbers start at the same
7638d2dc 903position for each alternative. After the group, numbering continues
353c6505 904with one higher than the maximum reached across all the alternatives.
7638d2dc
WL
905
906=head2 Position information
907
13e5d9cd 908In addition to what was matched, Perl also provides the
7638d2dc 909positions of what was matched as contents of the C<@-> and C<@+>
47f9c88b
GS
910arrays. C<$-[0]> is the position of the start of the entire match and
911C<$+[0]> is the position of the end. Similarly, C<$-[n]> is the
912position of the start of the C<$n> match and C<$+[n]> is the position
913of the end. If C<$n> is undefined, so are C<$-[n]> and C<$+[n]>. Then
914this code
915
916 $x = "Mmm...donut, thought Homer";
917 $x =~ /^(Mmm|Yech)\.\.\.(donut|peas)/; # matches
555bd962
BG
918 foreach $exp (1..$#-) {
919 print "Match $exp: '${$exp}' at position ($-[$exp],$+[$exp])\n";
47f9c88b
GS
920 }
921
922prints
923
924 Match 1: 'Mmm' at position (0,3)
925 Match 2: 'donut' at position (6,11)
926
927Even if there are no groupings in a regexp, it is still possible to
7638d2dc 928find out what exactly matched in a string. If you use them, Perl
47f9c88b
GS
929will set C<$`> to the part of the string before the match, will set C<$&>
930to the part of the string that matched, and will set C<$'> to the part
931of the string after the match. An example:
932
933 $x = "the cat caught the mouse";
934 $x =~ /cat/; # $` = 'the ', $& = 'cat', $' = ' caught the mouse'
935 $x =~ /the/; # $` = '', $& = 'the', $' = ' cat caught the mouse'
936
7638d2dc
WL
937In the second match, C<$`> equals C<''> because the regexp matched at the
938first character position in the string and stopped; it never saw the
13b0f67d
DM
939second 'the'.
940
941If your code is to run on Perl versions earlier than
9425.20, it is worthwhile to note that using C<$`> and C<$'>
7638d2dc 943slows down regexp matching quite a bit, while C<$&> slows it down to a
47f9c88b 944lesser extent, because if they are used in one regexp in a program,
7638d2dc 945they are generated for I<all> regexps in the program. So if raw
47f9c88b 946performance is a goal of your application, they should be avoided.
7638d2dc
WL
947If you need to extract the corresponding substrings, use C<@-> and
948C<@+> instead:
47f9c88b
GS
949
950 $` is the same as substr( $x, 0, $-[0] )
951 $& is the same as substr( $x, $-[0], $+[0]-$-[0] )
952 $' is the same as substr( $x, $+[0] )
953
78622607 954As of Perl 5.10, the C<${^PREMATCH}>, C<${^MATCH}> and C<${^POSTMATCH}>
13b0f67d
DM
955variables may be used. These are only set if the C</p> modifier is
956present. Consequently they do not penalize the rest of the program. In
957Perl 5.20, C<${^PREMATCH}>, C<${^MATCH}> and C<${^POSTMATCH}> are available
958whether the C</p> has been used or not (the modifier is ignored), and
959C<$`>, C<$'> and C<$&> do not cause any speed difference.
7638d2dc
WL
960
961=head2 Non-capturing groupings
962
353c6505 963A group that is required to bundle a set of alternatives may or may not be
7638d2dc 964useful as a capturing group. If it isn't, it just creates a superfluous
c27a5cfe 965addition to the set of available capture group values, inside as well as
7638d2dc 966outside the regexp. Non-capturing groupings, denoted by C<(?:regexp)>,
353c6505 967still allow the regexp to be treated as a single unit, but don't establish
c27a5cfe 968a capturing group at the same time. Both capturing and non-capturing
7638d2dc
WL
969groupings are allowed to co-exist in the same regexp. Because there is
970no extraction, non-capturing groupings are faster than capturing
971groupings. Non-capturing groupings are also handy for choosing exactly
972which parts of a regexp are to be extracted to matching variables:
973
974 # match a number, $1-$4 are set, but we only want $1
975 /([+-]?\ *(\d+(\.\d*)?|\.\d+)([eE][+-]?\d+)?)/;
976
977 # match a number faster , only $1 is set
978 /([+-]?\ *(?:\d+(?:\.\d*)?|\.\d+)(?:[eE][+-]?\d+)?)/;
979
980 # match a number, get $1 = whole number, $2 = exponent
981 /([+-]?\ *(?:\d+(?:\.\d*)?|\.\d+)(?:[eE]([+-]?\d+))?)/;
982
983Non-capturing groupings are also useful for removing nuisance
984elements gathered from a split operation where parentheses are
985required for some reason:
986
987 $x = '12aba34ba5';
9b846e30 988 @num = split /(a|b)+/, $x; # @num = ('12','a','34','a','5')
7638d2dc
WL
989 @num = split /(?:a|b)+/, $x; # @num = ('12','34','5')
990
33be4c61
MH
991In Perl 5.22 and later, all groups within a regexp can be set to
992non-capturing by using the new C</n> flag:
993
994 "hello" =~ /(hi|hello)/n; # $1 is not set!
995
996See L<perlre/"n"> for more information.
7638d2dc 997
47f9c88b
GS
998=head2 Matching repetitions
999
1000The examples in the previous section display an annoying weakness. We
7638d2dc
WL
1001were only matching 3-letter words, or chunks of words of 4 letters or
1002less. We'd like to be able to match words or, more generally, strings
1003of any length, without writing out tedious alternatives like
47f9c88b
GS
1004C<\w\w\w\w|\w\w\w|\w\w|\w>.
1005
7638d2dc
WL
1006This is exactly the problem the I<quantifier> metacharacters C<?>,
1007C<*>, C<+>, and C<{}> were created for. They allow us to delimit the
1008number of repeats for a portion of a regexp we consider to be a
47f9c88b
GS
1009match. Quantifiers are put immediately after the character, character
1010class, or grouping that we want to specify. They have the following
1011meanings:
1012
1013=over 4
1014
551e1d92 1015=item *
47f9c88b 1016
7638d2dc 1017C<a?> means: match 'a' 1 or 0 times
47f9c88b 1018
551e1d92
RB
1019=item *
1020
7638d2dc 1021C<a*> means: match 'a' 0 or more times, i.e., any number of times
551e1d92
RB
1022
1023=item *
47f9c88b 1024
7638d2dc 1025C<a+> means: match 'a' 1 or more times, i.e., at least once
551e1d92
RB
1026
1027=item *
1028
7638d2dc 1029C<a{n,m}> means: match at least C<n> times, but not more than C<m>
47f9c88b
GS
1030times.
1031
551e1d92
RB
1032=item *
1033
7638d2dc 1034C<a{n,}> means: match at least C<n> or more times
551e1d92
RB
1035
1036=item *
47f9c88b 1037
7638d2dc 1038C<a{n}> means: match exactly C<n> times
47f9c88b
GS
1039
1040=back
1041
1042Here are some examples:
1043
7638d2dc 1044 /[a-z]+\s+\d*/; # match a lowercase word, at least one space, and
47f9c88b 1045 # any number of digits
d8b950dc 1046 /(\w+)\s+\g1/; # match doubled words of arbitrary length
47f9c88b 1047 /y(es)?/i; # matches 'y', 'Y', or a case-insensitive 'yes'
c2ac8995
NS
1048 $year =~ /^\d{2,4}$/; # make sure year is at least 2 but not more
1049 # than 4 digits
555bd962
BG
1050 $year =~ /^\d{4}$|^\d{2}$/; # better match; throw out 3-digit dates
1051 $year =~ /^\d{2}(\d{2})?$/; # same thing written differently.
1052 # However, this captures the last two
1053 # digits in $1 and the other does not.
47f9c88b 1054
d8b950dc 1055 % simple_grep '^(\w+)\g1$' /usr/dict/words # isn't this easier?
47f9c88b
GS
1056 beriberi
1057 booboo
1058 coco
1059 mama
1060 murmur
1061 papa
1062
7638d2dc 1063For all of these quantifiers, Perl will try to match as much of the
47f9c88b 1064string as possible, while still allowing the regexp to succeed. Thus
7638d2dc
WL
1065with C</a?.../>, Perl will first try to match the regexp with the C<a>
1066present; if that fails, Perl will try to match the regexp without the
47f9c88b
GS
1067C<a> present. For the quantifier C<*>, we get the following:
1068
1069 $x = "the cat in the hat";
1070 $x =~ /^(.*)(cat)(.*)$/; # matches,
1071 # $1 = 'the '
1072 # $2 = 'cat'
1073 # $3 = ' in the hat'
1074
1075Which is what we might expect, the match finds the only C<cat> in the
1076string and locks onto it. Consider, however, this regexp:
1077
1078 $x =~ /^(.*)(at)(.*)$/; # matches,
1079 # $1 = 'the cat in the h'
1080 # $2 = 'at'
7638d2dc 1081 # $3 = '' (0 characters match)
47f9c88b 1082
7638d2dc 1083One might initially guess that Perl would find the C<at> in C<cat> and
47f9c88b
GS
1084stop there, but that wouldn't give the longest possible string to the
1085first quantifier C<.*>. Instead, the first quantifier C<.*> grabs as
1086much of the string as possible while still having the regexp match. In
a6b2f353 1087this example, that means having the C<at> sequence with the final C<at>
f5b885cd 1088in the string. The other important principle illustrated here is that,
47f9c88b 1089when there are two or more elements in a regexp, the I<leftmost>
f5b885cd 1090quantifier, if there is one, gets to grab as much of the string as
47f9c88b
GS
1091possible, leaving the rest of the regexp to fight over scraps. Thus in
1092our example, the first quantifier C<.*> grabs most of the string, while
1093the second quantifier C<.*> gets the empty string. Quantifiers that
7638d2dc
WL
1094grab as much of the string as possible are called I<maximal match> or
1095I<greedy> quantifiers.
47f9c88b
GS
1096
1097When a regexp can match a string in several different ways, we can use
1098the principles above to predict which way the regexp will match:
1099
1100=over 4
1101
1102=item *
551e1d92 1103
47f9c88b
GS
1104Principle 0: Taken as a whole, any regexp will be matched at the
1105earliest possible position in the string.
1106
1107=item *
551e1d92 1108
47f9c88b
GS
1109Principle 1: In an alternation C<a|b|c...>, the leftmost alternative
1110that allows a match for the whole regexp will be the one used.
1111
1112=item *
551e1d92 1113
47f9c88b
GS
1114Principle 2: The maximal matching quantifiers C<?>, C<*>, C<+> and
1115C<{n,m}> will in general match as much of the string as possible while
1116still allowing the whole regexp to match.
1117
1118=item *
551e1d92 1119
47f9c88b
GS
1120Principle 3: If there are two or more elements in a regexp, the
1121leftmost greedy quantifier, if any, will match as much of the string
1122as possible while still allowing the whole regexp to match. The next
1123leftmost greedy quantifier, if any, will try to match as much of the
1124string remaining available to it as possible, while still allowing the
1125whole regexp to match. And so on, until all the regexp elements are
1126satisfied.
1127
1128=back
1129
ac036724 1130As we have seen above, Principle 0 overrides the others. The regexp
47f9c88b
GS
1131will be matched as early as possible, with the other principles
1132determining how the regexp matches at that earliest character
1133position.
1134
1135Here is an example of these principles in action:
1136
1137 $x = "The programming republic of Perl";
1138 $x =~ /^(.+)(e|r)(.*)$/; # matches,
1139 # $1 = 'The programming republic of Pe'
1140 # $2 = 'r'
1141 # $3 = 'l'
1142
1143This regexp matches at the earliest string position, C<'T'>. One
1144might think that C<e>, being leftmost in the alternation, would be
1145matched, but C<r> produces the longest string in the first quantifier.
1146
1147 $x =~ /(m{1,2})(.*)$/; # matches,
1148 # $1 = 'mm'
1149 # $2 = 'ing republic of Perl'
1150
1151Here, The earliest possible match is at the first C<'m'> in
1152C<programming>. C<m{1,2}> is the first quantifier, so it gets to match
1153a maximal C<mm>.
1154
1155 $x =~ /.*(m{1,2})(.*)$/; # matches,
1156 # $1 = 'm'
1157 # $2 = 'ing republic of Perl'
1158
1159Here, the regexp matches at the start of the string. The first
1160quantifier C<.*> grabs as much as possible, leaving just a single
1161C<'m'> for the second quantifier C<m{1,2}>.
1162
1163 $x =~ /(.?)(m{1,2})(.*)$/; # matches,
1164 # $1 = 'a'
1165 # $2 = 'mm'
1166 # $3 = 'ing republic of Perl'
1167
1168Here, C<.?> eats its maximal one character at the earliest possible
1169position in the string, C<'a'> in C<programming>, leaving C<m{1,2}>
1170the opportunity to match both C<m>'s. Finally,
1171
1172 "aXXXb" =~ /(X*)/; # matches with $1 = ''
1173
1174because it can match zero copies of C<'X'> at the beginning of the
1175string. If you definitely want to match at least one C<'X'>, use
1176C<X+>, not C<X*>.
1177
1178Sometimes greed is not good. At times, we would like quantifiers to
1179match a I<minimal> piece of string, rather than a maximal piece. For
7638d2dc
WL
1180this purpose, Larry Wall created the I<minimal match> or
1181I<non-greedy> quantifiers C<??>, C<*?>, C<+?>, and C<{}?>. These are
47f9c88b
GS
1182the usual quantifiers with a C<?> appended to them. They have the
1183following meanings:
1184
1185=over 4
1186
551e1d92
RB
1187=item *
1188
7638d2dc 1189C<a??> means: match 'a' 0 or 1 times. Try 0 first, then 1.
47f9c88b 1190
551e1d92
RB
1191=item *
1192
7638d2dc 1193C<a*?> means: match 'a' 0 or more times, i.e., any number of times,
47f9c88b
GS
1194but as few times as possible
1195
551e1d92
RB
1196=item *
1197
7638d2dc 1198C<a+?> means: match 'a' 1 or more times, i.e., at least once, but
47f9c88b
GS
1199as few times as possible
1200
551e1d92
RB
1201=item *
1202
7638d2dc 1203C<a{n,m}?> means: match at least C<n> times, not more than C<m>
47f9c88b
GS
1204times, as few times as possible
1205
551e1d92
RB
1206=item *
1207
7638d2dc 1208C<a{n,}?> means: match at least C<n> times, but as few times as
47f9c88b
GS
1209possible
1210
551e1d92
RB
1211=item *
1212
7638d2dc 1213C<a{n}?> means: match exactly C<n> times. Because we match exactly
47f9c88b
GS
1214C<n> times, C<a{n}?> is equivalent to C<a{n}> and is just there for
1215notational consistency.
1216
1217=back
1218
1219Let's look at the example above, but with minimal quantifiers:
1220
1221 $x = "The programming republic of Perl";
1222 $x =~ /^(.+?)(e|r)(.*)$/; # matches,
1223 # $1 = 'Th'
1224 # $2 = 'e'
1225 # $3 = ' programming republic of Perl'
1226
1227The minimal string that will allow both the start of the string C<^>
1228and the alternation to match is C<Th>, with the alternation C<e|r>
1229matching C<e>. The second quantifier C<.*> is free to gobble up the
1230rest of the string.
1231
1232 $x =~ /(m{1,2}?)(.*?)$/; # matches,
1233 # $1 = 'm'
1234 # $2 = 'ming republic of Perl'
1235
1236The first string position that this regexp can match is at the first
1237C<'m'> in C<programming>. At this position, the minimal C<m{1,2}?>
1238matches just one C<'m'>. Although the second quantifier C<.*?> would
1239prefer to match no characters, it is constrained by the end-of-string
1240anchor C<$> to match the rest of the string.
1241
1242 $x =~ /(.*?)(m{1,2}?)(.*)$/; # matches,
1243 # $1 = 'The progra'
1244 # $2 = 'm'
1245 # $3 = 'ming republic of Perl'
1246
1247In this regexp, you might expect the first minimal quantifier C<.*?>
1248to match the empty string, because it is not constrained by a C<^>
1249anchor to match the beginning of the word. Principle 0 applies here,
1250however. Because it is possible for the whole regexp to match at the
1251start of the string, it I<will> match at the start of the string. Thus
1252the first quantifier has to match everything up to the first C<m>. The
1253second minimal quantifier matches just one C<m> and the third
1254quantifier matches the rest of the string.
1255
1256 $x =~ /(.??)(m{1,2})(.*)$/; # matches,
1257 # $1 = 'a'
1258 # $2 = 'mm'
1259 # $3 = 'ing republic of Perl'
1260
1261Just as in the previous regexp, the first quantifier C<.??> can match
1262earliest at position C<'a'>, so it does. The second quantifier is
1263greedy, so it matches C<mm>, and the third matches the rest of the
1264string.
1265
1266We can modify principle 3 above to take into account non-greedy
1267quantifiers:
1268
1269=over 4
1270
1271=item *
551e1d92 1272
47f9c88b
GS
1273Principle 3: If there are two or more elements in a regexp, the
1274leftmost greedy (non-greedy) quantifier, if any, will match as much
1275(little) of the string as possible while still allowing the whole
1276regexp to match. The next leftmost greedy (non-greedy) quantifier, if
1277any, will try to match as much (little) of the string remaining
1278available to it as possible, while still allowing the whole regexp to
1279match. And so on, until all the regexp elements are satisfied.
1280
1281=back
1282
1283Just like alternation, quantifiers are also susceptible to
1284backtracking. Here is a step-by-step analysis of the example
1285
1286 $x = "the cat in the hat";
1287 $x =~ /^(.*)(at)(.*)$/; # matches,
1288 # $1 = 'the cat in the h'
1289 # $2 = 'at'
1290 # $3 = '' (0 matches)
1291
1292=over 4
1293
c9dde696 1294=item Z<>0
551e1d92
RB
1295
1296Start with the first letter in the string 't'.
47f9c88b 1297
c9dde696 1298=item Z<>1
551e1d92
RB
1299
1300The first quantifier '.*' starts out by matching the whole
47f9c88b
GS
1301string 'the cat in the hat'.
1302
c9dde696 1303=item Z<>2
551e1d92
RB
1304
1305'a' in the regexp element 'at' doesn't match the end of the
47f9c88b
GS
1306string. Backtrack one character.
1307
c9dde696 1308=item Z<>3
551e1d92
RB
1309
1310'a' in the regexp element 'at' still doesn't match the last
47f9c88b
GS
1311letter of the string 't', so backtrack one more character.
1312
c9dde696 1313=item Z<>4
551e1d92
RB
1314
1315Now we can match the 'a' and the 't'.
47f9c88b 1316
c9dde696 1317=item Z<>5
551e1d92
RB
1318
1319Move on to the third element '.*'. Since we are at the end of
47f9c88b
GS
1320the string and '.*' can match 0 times, assign it the empty string.
1321
c9dde696 1322=item Z<>6
551e1d92
RB
1323
1324We are done!
47f9c88b
GS
1325
1326=back
1327
1328Most of the time, all this moving forward and backtracking happens
7638d2dc 1329quickly and searching is fast. There are some pathological regexps,
47f9c88b
GS
1330however, whose execution time exponentially grows with the size of the
1331string. A typical structure that blows up in your face is of the form
1332
1333 /(a|b+)*/;
1334
1335The problem is the nested indeterminate quantifiers. There are many
1336different ways of partitioning a string of length n between the C<+>
1337and C<*>: one repetition with C<b+> of length n, two repetitions with
1338the first C<b+> length k and the second with length n-k, m repetitions
1339whose bits add up to length n, etc. In fact there are an exponential
7638d2dc 1340number of ways to partition a string as a function of its length. A
47f9c88b 1341regexp may get lucky and match early in the process, but if there is
7638d2dc 1342no match, Perl will try I<every> possibility before giving up. So be
47f9c88b 1343careful with nested C<*>'s, C<{n,m}>'s, and C<+>'s. The book
7638d2dc 1344I<Mastering Regular Expressions> by Jeffrey Friedl gives a wonderful
47f9c88b
GS
1345discussion of this and other efficiency issues.
1346
7638d2dc
WL
1347
1348=head2 Possessive quantifiers
1349
1350Backtracking during the relentless search for a match may be a waste
1351of time, particularly when the match is bound to fail. Consider
1352the simple pattern
1353
1354 /^\w+\s+\w+$/; # a word, spaces, a word
1355
1356Whenever this is applied to a string which doesn't quite meet the
1357pattern's expectations such as S<C<"abc ">> or S<C<"abc def ">>,
353c6505
DL
1358the regex engine will backtrack, approximately once for each character
1359in the string. But we know that there is no way around taking I<all>
1360of the initial word characters to match the first repetition, that I<all>
7638d2dc 1361spaces must be eaten by the middle part, and the same goes for the second
353c6505
DL
1362word.
1363
1364With the introduction of the I<possessive quantifiers> in Perl 5.10, we
1365have a way of instructing the regex engine not to backtrack, with the
1366usual quantifiers with a C<+> appended to them. This makes them greedy as
1367well as stingy; once they succeed they won't give anything back to permit
1368another solution. They have the following meanings:
7638d2dc
WL
1369
1370=over 4
1371
1372=item *
1373
353c6505
DL
1374C<a{n,m}+> means: match at least C<n> times, not more than C<m> times,
1375as many times as possible, and don't give anything up. C<a?+> is short
7638d2dc
WL
1376for C<a{0,1}+>
1377
1378=item *
1379
1380C<a{n,}+> means: match at least C<n> times, but as many times as possible,
353c6505 1381and don't give anything up. C<a*+> is short for C<a{0,}+> and C<a++> is
7638d2dc
WL
1382short for C<a{1,}+>.
1383
1384=item *
1385
1386C<a{n}+> means: match exactly C<n> times. It is just there for
1387notational consistency.
1388
1389=back
1390
353c6505
DL
1391These possessive quantifiers represent a special case of a more general
1392concept, the I<independent subexpression>, see below.
7638d2dc
WL
1393
1394As an example where a possessive quantifier is suitable we consider
1395matching a quoted string, as it appears in several programming languages.
1396The backslash is used as an escape character that indicates that the
1397next character is to be taken literally, as another character for the
1398string. Therefore, after the opening quote, we expect a (possibly
353c6505 1399empty) sequence of alternatives: either some character except an
7638d2dc
WL
1400unescaped quote or backslash or an escaped character.
1401
1402 /"(?:[^"\\]++|\\.)*+"/;
1403
1404
47f9c88b
GS
1405=head2 Building a regexp
1406
1407At this point, we have all the basic regexp concepts covered, so let's
1408give a more involved example of a regular expression. We will build a
1409regexp that matches numbers.
1410
1411The first task in building a regexp is to decide what we want to match
1412and what we want to exclude. In our case, we want to match both
1413integers and floating point numbers and we want to reject any string
1414that isn't a number.
1415
1416The next task is to break the problem down into smaller problems that
1417are easily converted into a regexp.
1418
1419The simplest case is integers. These consist of a sequence of digits,
1420with an optional sign in front. The digits we can represent with
1421C<\d+> and the sign can be matched with C<[+-]>. Thus the integer
1422regexp is
1423
1424 /[+-]?\d+/; # matches integers
1425
1426A floating point number potentially has a sign, an integral part, a
1427decimal point, a fractional part, and an exponent. One or more of these
1428parts is optional, so we need to check out the different
1429possibilities. Floating point numbers which are in proper form include
1430123., 0.345, .34, -1e6, and 25.4E-72. As with integers, the sign out
1431front is completely optional and can be matched by C<[+-]?>. We can
1432see that if there is no exponent, floating point numbers must have a
1433decimal point, otherwise they are integers. We might be tempted to
1434model these with C<\d*\.\d*>, but this would also match just a single
1435decimal point, which is not a number. So the three cases of floating
7638d2dc 1436point number without exponent are
47f9c88b
GS
1437
1438 /[+-]?\d+\./; # 1., 321., etc.
1439 /[+-]?\.\d+/; # .1, .234, etc.
1440 /[+-]?\d+\.\d+/; # 1.0, 30.56, etc.
1441
1442These can be combined into a single regexp with a three-way alternation:
1443
1444 /[+-]?(\d+\.\d+|\d+\.|\.\d+)/; # floating point, no exponent
1445
1446In this alternation, it is important to put C<'\d+\.\d+'> before
1447C<'\d+\.'>. If C<'\d+\.'> were first, the regexp would happily match that
1448and ignore the fractional part of the number.
1449
1450Now consider floating point numbers with exponents. The key
1451observation here is that I<both> integers and numbers with decimal
1452points are allowed in front of an exponent. Then exponents, like the
1453overall sign, are independent of whether we are matching numbers with
1454or without decimal points, and can be 'decoupled' from the
1455mantissa. The overall form of the regexp now becomes clear:
1456
1457 /^(optional sign)(integer | f.p. mantissa)(optional exponent)$/;
1458
1459The exponent is an C<e> or C<E>, followed by an integer. So the
1460exponent regexp is
1461
1462 /[eE][+-]?\d+/; # exponent
1463
1464Putting all the parts together, we get a regexp that matches numbers:
1465
1466 /^[+-]?(\d+\.\d+|\d+\.|\.\d+|\d+)([eE][+-]?\d+)?$/; # Ta da!
1467
1468Long regexps like this may impress your friends, but can be hard to
1469decipher. In complex situations like this, the C<//x> modifier for a
1470match is invaluable. It allows one to put nearly arbitrary whitespace
1471and comments into a regexp without affecting their meaning. Using it,
1472we can rewrite our 'extended' regexp in the more pleasing form
1473
1474 /^
1475 [+-]? # first, match an optional sign
1476 ( # then match integers or f.p. mantissas:
1477 \d+\.\d+ # mantissa of the form a.b
1478 |\d+\. # mantissa of the form a.
1479 |\.\d+ # mantissa of the form .b
1480 |\d+ # integer of the form a
1481 )
563642b4 1482 ( [eE] [+-]? \d+ )? # finally, optionally match an exponent
47f9c88b
GS
1483 $/x;
1484
1485If whitespace is mostly irrelevant, how does one include space
1486characters in an extended regexp? The answer is to backslash it
7638d2dc 1487S<C<'\ '>> or put it in a character class S<C<[ ]>>. The same thing
f5b885cd 1488goes for pound signs: use C<\#> or C<[#]>. For instance, Perl allows
7638d2dc 1489a space between the sign and the mantissa or integer, and we could add
47f9c88b
GS
1490this to our regexp as follows:
1491
1492 /^
1493 [+-]?\ * # first, match an optional sign *and space*
1494 ( # then match integers or f.p. mantissas:
1495 \d+\.\d+ # mantissa of the form a.b
1496 |\d+\. # mantissa of the form a.
1497 |\.\d+ # mantissa of the form .b
1498 |\d+ # integer of the form a
1499 )
563642b4 1500 ( [eE] [+-]? \d+ )? # finally, optionally match an exponent
47f9c88b
GS
1501 $/x;
1502
1503In this form, it is easier to see a way to simplify the
1504alternation. Alternatives 1, 2, and 4 all start with C<\d+>, so it
1505could be factored out:
1506
1507 /^
1508 [+-]?\ * # first, match an optional sign
1509 ( # then match integers or f.p. mantissas:
1510 \d+ # start out with a ...
1511 (
1512 \.\d* # mantissa of the form a.b or a.
1513 )? # ? takes care of integers of the form a
1514 |\.\d+ # mantissa of the form .b
1515 )
563642b4 1516 ( [eE] [+-]? \d+ )? # finally, optionally match an exponent
47f9c88b
GS
1517 $/x;
1518
1519or written in the compact form,
1520
1521 /^[+-]?\ *(\d+(\.\d*)?|\.\d+)([eE][+-]?\d+)?$/;
1522
1523This is our final regexp. To recap, we built a regexp by
1524
1525=over 4
1526
551e1d92
RB
1527=item *
1528
1529specifying the task in detail,
47f9c88b 1530
551e1d92
RB
1531=item *
1532
1533breaking down the problem into smaller parts,
1534
1535=item *
47f9c88b 1536
551e1d92 1537translating the small parts into regexps,
47f9c88b 1538
551e1d92
RB
1539=item *
1540
1541combining the regexps,
1542
1543=item *
47f9c88b 1544
551e1d92 1545and optimizing the final combined regexp.
47f9c88b
GS
1546
1547=back
1548
1549These are also the typical steps involved in writing a computer
1550program. This makes perfect sense, because regular expressions are
7638d2dc 1551essentially programs written in a little computer language that specifies
47f9c88b
GS
1552patterns.
1553
1554=head2 Using regular expressions in Perl
1555
1556The last topic of Part 1 briefly covers how regexps are used in Perl
1557programs. Where do they fit into Perl syntax?
1558
1559We have already introduced the matching operator in its default
1560C</regexp/> and arbitrary delimiter C<m!regexp!> forms. We have used
1561the binding operator C<=~> and its negation C<!~> to test for string
1562matches. Associated with the matching operator, we have discussed the
1563single line C<//s>, multi-line C<//m>, case-insensitive C<//i> and
353c6505
DL
1564extended C<//x> modifiers. There are a few more things you might
1565want to know about matching operators.
47f9c88b 1566
7638d2dc
WL
1567=head3 Prohibiting substitution
1568
1569If you change C<$pattern> after the first substitution happens, Perl
47f9c88b
GS
1570will ignore it. If you don't want any substitutions at all, use the
1571special delimiter C<m''>:
1572
16e8b840 1573 @pattern = ('Seuss');
47f9c88b 1574 while (<>) {
16e8b840 1575 print if m'@pattern'; # matches literal '@pattern', not 'Seuss'
47f9c88b
GS
1576 }
1577
353c6505 1578Similar to strings, C<m''> acts like apostrophes on a regexp; all other
7638d2dc 1579C<m> delimiters act like quotes. If the regexp evaluates to the empty string,
47f9c88b
GS
1580the regexp in the I<last successful match> is used instead. So we have
1581
1582 "dog" =~ /d/; # 'd' matches
1583 "dogbert =~ //; # this matches the 'd' regexp used before
1584
7638d2dc
WL
1585
1586=head3 Global matching
1587
7698aede 1588The final two modifiers we will discuss here,
5f67e4c9 1589C<//g> and C<//c>, concern multiple matches.
da75cd15 1590The modifier C<//g> stands for global matching and allows the
47f9c88b
GS
1591matching operator to match within a string as many times as possible.
1592In scalar context, successive invocations against a string will have
f5b885cd 1593C<//g> jump from match to match, keeping track of position in the
47f9c88b
GS
1594string as it goes along. You can get or set the position with the
1595C<pos()> function.
1596
1597The use of C<//g> is shown in the following example. Suppose we have
1598a string that consists of words separated by spaces. If we know how
1599many words there are in advance, we could extract the words using
1600groupings:
1601
1602 $x = "cat dog house"; # 3 words
1603 $x =~ /^\s*(\w+)\s+(\w+)\s+(\w+)\s*$/; # matches,
1604 # $1 = 'cat'
1605 # $2 = 'dog'
1606 # $3 = 'house'
1607
1608But what if we had an indeterminate number of words? This is the sort
1609of task C<//g> was made for. To extract all words, form the simple
1610regexp C<(\w+)> and loop over all matches with C</(\w+)/g>:
1611
1612 while ($x =~ /(\w+)/g) {
1613 print "Word is $1, ends at position ", pos $x, "\n";
1614 }
1615
1616prints
1617
1618 Word is cat, ends at position 3
1619 Word is dog, ends at position 7
1620 Word is house, ends at position 13
1621
1622A failed match or changing the target string resets the position. If
1623you don't want the position reset after failure to match, add the
1624C<//c>, as in C</regexp/gc>. The current position in the string is
1625associated with the string, not the regexp. This means that different
1626strings have different positions and their respective positions can be
1627set or read independently.
1628
1629In list context, C<//g> returns a list of matched groupings, or if
1630there are no groupings, a list of matches to the whole regexp. So if
1631we wanted just the words, we could use
1632
1633 @words = ($x =~ /(\w+)/g); # matches,
5a0c7e9d
PJ
1634 # $words[0] = 'cat'
1635 # $words[1] = 'dog'
1636 # $words[2] = 'house'
47f9c88b
GS
1637
1638Closely associated with the C<//g> modifier is the C<\G> anchor. The
1639C<\G> anchor matches at the point where the previous C<//g> match left
1640off. C<\G> allows us to easily do context-sensitive matching:
1641
1642 $metric = 1; # use metric units
1643 ...
1644 $x = <FILE>; # read in measurement
1645 $x =~ /^([+-]?\d+)\s*/g; # get magnitude
1646 $weight = $1;
1647 if ($metric) { # error checking
1648 print "Units error!" unless $x =~ /\Gkg\./g;
1649 }
1650 else {
1651 print "Units error!" unless $x =~ /\Glbs\./g;
1652 }
1653 $x =~ /\G\s+(widget|sprocket)/g; # continue processing
1654
1655The combination of C<//g> and C<\G> allows us to process the string a
1656bit at a time and use arbitrary Perl logic to decide what to do next.
25cf8c22
HS
1657Currently, the C<\G> anchor is only fully supported when used to anchor
1658to the start of the pattern.
47f9c88b 1659
f5b885cd 1660C<\G> is also invaluable in processing fixed-length records with
47f9c88b
GS
1661regexps. Suppose we have a snippet of coding region DNA, encoded as
1662base pair letters C<ATCGTTGAAT...> and we want to find all the stop
1663codons C<TGA>. In a coding region, codons are 3-letter sequences, so
1664we can think of the DNA snippet as a sequence of 3-letter records. The
1665naive regexp
1666
1667 # expanded, this is "ATC GTT GAA TGC AAA TGA CAT GAC"
1668 $dna = "ATCGTTGAATGCAAATGACATGAC";
1669 $dna =~ /TGA/;
1670
d1be9408 1671doesn't work; it may match a C<TGA>, but there is no guarantee that
47f9c88b 1672the match is aligned with codon boundaries, e.g., the substring
7638d2dc 1673S<C<GTT GAA>> gives a match. A better solution is
47f9c88b
GS
1674
1675 while ($dna =~ /(\w\w\w)*?TGA/g) { # note the minimal *?
1676 print "Got a TGA stop codon at position ", pos $dna, "\n";
1677 }
1678
1679which prints
1680
1681 Got a TGA stop codon at position 18
1682 Got a TGA stop codon at position 23
1683
1684Position 18 is good, but position 23 is bogus. What happened?
1685
1686The answer is that our regexp works well until we get past the last
1687real match. Then the regexp will fail to match a synchronized C<TGA>
1688and start stepping ahead one character position at a time, not what we
1689want. The solution is to use C<\G> to anchor the match to the codon
1690alignment:
1691
1692 while ($dna =~ /\G(\w\w\w)*?TGA/g) {
1693 print "Got a TGA stop codon at position ", pos $dna, "\n";
1694 }
1695
1696This prints
1697
1698 Got a TGA stop codon at position 18
1699
1700which is the correct answer. This example illustrates that it is
1701important not only to match what is desired, but to reject what is not
1702desired.
1703
0bd5a82d 1704(There are other regexp modifiers that are available, such as
615d795d 1705C<//o>, but their specialized uses are beyond the
0bd5a82d
KW
1706scope of this introduction. )
1707
7638d2dc 1708=head3 Search and replace
47f9c88b 1709
7638d2dc 1710Regular expressions also play a big role in I<search and replace>
47f9c88b
GS
1711operations in Perl. Search and replace is accomplished with the
1712C<s///> operator. The general form is
1713C<s/regexp/replacement/modifiers>, with everything we know about
1714regexps and modifiers applying in this case as well. The
f5b885cd 1715C<replacement> is a Perl double-quoted string that replaces in the
47f9c88b
GS
1716string whatever is matched with the C<regexp>. The operator C<=~> is
1717also used here to associate a string with C<s///>. If matching
7638d2dc 1718against C<$_>, the S<C<$_ =~>> can be dropped. If there is a match,
f5b885cd 1719C<s///> returns the number of substitutions made; otherwise it returns
47f9c88b
GS
1720false. Here are a few examples:
1721
1722 $x = "Time to feed the cat!";
1723 $x =~ s/cat/hacker/; # $x contains "Time to feed the hacker!"
1724 if ($x =~ s/^(Time.*hacker)!$/$1 now!/) {
1725 $more_insistent = 1;
1726 }
1727 $y = "'quoted words'";
1728 $y =~ s/^'(.*)'$/$1/; # strip single quotes,
1729 # $y contains "quoted words"
1730
1731In the last example, the whole string was matched, but only the part
1732inside the single quotes was grouped. With the C<s///> operator, the
f5b885cd 1733matched variables C<$1>, C<$2>, etc. are immediately available for use
47f9c88b
GS
1734in the replacement expression, so we use C<$1> to replace the quoted
1735string with just what was quoted. With the global modifier, C<s///g>
1736will search and replace all occurrences of the regexp in the string:
1737
1738 $x = "I batted 4 for 4";
1739 $x =~ s/4/four/; # doesn't do it all:
1740 # $x contains "I batted four for 4"
1741 $x = "I batted 4 for 4";
1742 $x =~ s/4/four/g; # does it all:
1743 # $x contains "I batted four for four"
1744
1745If you prefer 'regex' over 'regexp' in this tutorial, you could use
1746the following program to replace it:
1747
1748 % cat > simple_replace
1749 #!/usr/bin/perl
1750 $regexp = shift;
1751 $replacement = shift;
1752 while (<>) {
c2e2285d 1753 s/$regexp/$replacement/g;
47f9c88b
GS
1754 print;
1755 }
1756 ^D
1757
1758 % simple_replace regexp regex perlretut.pod
1759
1760In C<simple_replace> we used the C<s///g> modifier to replace all
c2e2285d
KW
1761occurrences of the regexp on each line. (Even though the regular
1762expression appears in a loop, Perl is smart enough to compile it
1763only once.) As with C<simple_grep>, both the
1764C<print> and the C<s/$regexp/$replacement/g> use C<$_> implicitly.
47f9c88b 1765
4f4d7508
DC
1766If you don't want C<s///> to change your original variable you can use
1767the non-destructive substitute modifier, C<s///r>. This changes the
d6b8a906
KW
1768behavior so that C<s///r> returns the final substituted string
1769(instead of the number of substitutions):
4f4d7508
DC
1770
1771 $x = "I like dogs.";
1772 $y = $x =~ s/dogs/cats/r;
1773 print "$x $y\n";
1774
1775That example will print "I like dogs. I like cats". Notice the original
f5b885cd 1776C<$x> variable has not been affected. The overall
4f4d7508
DC
1777result of the substitution is instead stored in C<$y>. If the
1778substitution doesn't affect anything then the original string is
1779returned:
1780
1781 $x = "I like dogs.";
1782 $y = $x =~ s/elephants/cougars/r;
1783 print "$x $y\n"; # prints "I like dogs. I like dogs."
1784
1785One other interesting thing that the C<s///r> flag allows is chaining
1786substitutions:
1787
1788 $x = "Cats are great.";
555bd962
BG
1789 print $x =~ s/Cats/Dogs/r =~ s/Dogs/Frogs/r =~
1790 s/Frogs/Hedgehogs/r, "\n";
4f4d7508
DC
1791 # prints "Hedgehogs are great."
1792
47f9c88b 1793A modifier available specifically to search and replace is the
f5b885cd
FC
1794C<s///e> evaluation modifier. C<s///e> treats the
1795replacement text as Perl code, rather than a double-quoted
1796string. The value that the code returns is substituted for the
47f9c88b
GS
1797matched substring. C<s///e> is useful if you need to do a bit of
1798computation in the process of replacing text. This example counts
1799character frequencies in a line:
1800
1801 $x = "Bill the cat";
555bd962 1802 $x =~ s/(.)/$chars{$1}++;$1/eg; # final $1 replaces char with itself
47f9c88b
GS
1803 print "frequency of '$_' is $chars{$_}\n"
1804 foreach (sort {$chars{$b} <=> $chars{$a}} keys %chars);
1805
1806This prints
1807
1808 frequency of ' ' is 2
1809 frequency of 't' is 2
1810 frequency of 'l' is 2
1811 frequency of 'B' is 1
1812 frequency of 'c' is 1
1813 frequency of 'e' is 1
1814 frequency of 'h' is 1
1815 frequency of 'i' is 1
1816 frequency of 'a' is 1
1817
1818As with the match C<m//> operator, C<s///> can use other delimiters,
1819such as C<s!!!> and C<s{}{}>, and even C<s{}//>. If single quotes are
f5b885cd
FC
1820used C<s'''>, then the regexp and replacement are
1821treated as single-quoted strings and there are no
1822variable substitutions. C<s///> in list context
47f9c88b
GS
1823returns the same thing as in scalar context, i.e., the number of
1824matches.
1825
7638d2dc 1826=head3 The split function
47f9c88b 1827
7638d2dc 1828The C<split()> function is another place where a regexp is used.
353c6505
DL
1829C<split /regexp/, string, limit> separates the C<string> operand into
1830a list of substrings and returns that list. The regexp must be designed
7638d2dc 1831to match whatever constitutes the separators for the desired substrings.
353c6505 1832The C<limit>, if present, constrains splitting into no more than C<limit>
7638d2dc 1833number of strings. For example, to split a string into words, use
47f9c88b
GS
1834
1835 $x = "Calvin and Hobbes";
1836 @words = split /\s+/, $x; # $word[0] = 'Calvin'
1837 # $word[1] = 'and'
1838 # $word[2] = 'Hobbes'
1839
1840If the empty regexp C<//> is used, the regexp always matches and
1841the string is split into individual characters. If the regexp has
7638d2dc 1842groupings, then the resulting list contains the matched substrings from the
47f9c88b
GS
1843groupings as well. For instance,
1844
1845 $x = "/usr/bin/perl";
1846 @dirs = split m!/!, $x; # $dirs[0] = ''
1847 # $dirs[1] = 'usr'
1848 # $dirs[2] = 'bin'
1849 # $dirs[3] = 'perl'
1850 @parts = split m!(/)!, $x; # $parts[0] = ''
1851 # $parts[1] = '/'
1852 # $parts[2] = 'usr'
1853 # $parts[3] = '/'
1854 # $parts[4] = 'bin'
1855 # $parts[5] = '/'
1856 # $parts[6] = 'perl'
1857
1858Since the first character of $x matched the regexp, C<split> prepended
1859an empty initial element to the list.
1860
1861If you have read this far, congratulations! You now have all the basic
1862tools needed to use regular expressions to solve a wide range of text
1863processing problems. If this is your first time through the tutorial,
f5b885cd 1864why not stop here and play around with regexps a while.... S<Part 2>
47f9c88b
GS
1865concerns the more esoteric aspects of regular expressions and those
1866concepts certainly aren't needed right at the start.
1867
1868=head1 Part 2: Power tools
1869
1870OK, you know the basics of regexps and you want to know more. If
1871matching regular expressions is analogous to a walk in the woods, then
1872the tools discussed in Part 1 are analogous to topo maps and a
1873compass, basic tools we use all the time. Most of the tools in part 2
da75cd15 1874are analogous to flare guns and satellite phones. They aren't used
47f9c88b
GS
1875too often on a hike, but when we are stuck, they can be invaluable.
1876
1877What follows are the more advanced, less used, or sometimes esoteric
7638d2dc 1878capabilities of Perl regexps. In Part 2, we will assume you are
7c579eed 1879comfortable with the basics and concentrate on the advanced features.
47f9c88b
GS
1880
1881=head2 More on characters, strings, and character classes
1882
1883There are a number of escape sequences and character classes that we
1884haven't covered yet.
1885
1886There are several escape sequences that convert characters or strings
7638d2dc 1887between upper and lower case, and they are also available within
353c6505 1888patterns. C<\l> and C<\u> convert the next character to lower or
7638d2dc 1889upper case, respectively:
47f9c88b
GS
1890
1891 $x = "perl";
1892 $string =~ /\u$x/; # matches 'Perl' in $string
1893 $x = "M(rs?|s)\\."; # note the double backslash
1894 $string =~ /\l$x/; # matches 'mr.', 'mrs.', and 'ms.',
1895
7638d2dc
WL
1896A C<\L> or C<\U> indicates a lasting conversion of case, until
1897terminated by C<\E> or thrown over by another C<\U> or C<\L>:
47f9c88b
GS
1898
1899 $x = "This word is in lower case:\L SHOUT\E";
1900 $x =~ /shout/; # matches
1901 $x = "I STILL KEYPUNCH CARDS FOR MY 360"
1902 $x =~ /\Ukeypunch/; # matches punch card string
1903
1904If there is no C<\E>, case is converted until the end of the
1905string. The regexps C<\L\u$word> or C<\u\L$word> convert the first
1906character of C<$word> to uppercase and the rest of the characters to
1907lowercase.
1908
1909Control characters can be escaped with C<\c>, so that a control-Z
1910character would be matched with C<\cZ>. The escape sequence
1911C<\Q>...C<\E> quotes, or protects most non-alphabetic characters. For
1912instance,
1913
1914 $x = "\QThat !^*&%~& cat!";
1915 $x =~ /\Q!^*&%~&\E/; # check for rough language
1916
1917It does not protect C<$> or C<@>, so that variables can still be
1918substituted.
1919
8e71069f
FC
1920C<\Q>, C<\L>, C<\l>, C<\U>, C<\u> and C<\E> are actually part of
1921double-quotish syntax, and not part of regexp syntax proper. They will
7698aede 1922work if they appear in a regular expression embedded directly in a
8e71069f
FC
1923program, but not when contained in a string that is interpolated in a
1924pattern.
7c579eed 1925
13e5d9cd
BF
1926Perl regexps can handle more than just the
1927standard ASCII character set. Perl supports I<Unicode>, a standard
7638d2dc 1928for representing the alphabets from virtually all of the world's written
38a44b82 1929languages, and a host of symbols. Perl's text strings are Unicode strings, so
2575c402 1930they can contain characters with a value (codepoint or character number) higher
7c579eed 1931than 255.
47f9c88b
GS
1932
1933What does this mean for regexps? Well, regexp users don't need to know
7638d2dc 1934much about Perl's internal representation of strings. But they do need
2575c402
JW
1935to know 1) how to represent Unicode characters in a regexp and 2) that
1936a matching operation will treat the string to be searched as a sequence
1937of characters, not bytes. The answer to 1) is that Unicode characters
f0a2b745 1938greater than C<chr(255)> are represented using the C<\x{hex}> notation, because
5f67e4c9
KW
1939\x hex (without curly braces) doesn't go further than 255. (Starting in Perl
19405.14, if you're an octal fan, you can also use C<\o{oct}>.)
47f9c88b 1941
47f9c88b
GS
1942 /\x{263a}/; # match a Unicode smiley face :)
1943
7638d2dc 1944B<NOTE>: In Perl 5.6.0 it used to be that one needed to say C<use
72ff2908
JH
1945utf8> to use any Unicode features. This is no more the case: for
1946almost all Unicode processing, the explicit C<utf8> pragma is not
1947needed. (The only case where it matters is if your Perl script is in
1948Unicode and encoded in UTF-8, then an explicit C<use utf8> is needed.)
47f9c88b
GS
1949
1950Figuring out the hexadecimal sequence of a Unicode character you want
1951or deciphering someone else's hexadecimal Unicode regexp is about as
1952much fun as programming in machine code. So another way to specify
e526e8bb
KW
1953Unicode characters is to use the I<named character> escape
1954sequence C<\N{I<name>}>. I<name> is a name for the Unicode character, as
55eda711
JH
1955specified in the Unicode standard. For instance, if we wanted to
1956represent or match the astrological sign for the planet Mercury, we
1957could use
47f9c88b 1958
47f9c88b
GS
1959 $x = "abc\N{MERCURY}def";
1960 $x =~ /\N{MERCURY}/; # matches
1961
fbb93542 1962One can also use "short" names:
47f9c88b 1963
47f9c88b 1964 print "\N{GREEK SMALL LETTER SIGMA} is called sigma.\n";
47f9c88b
GS
1965 print "\N{greek:Sigma} is an upper-case sigma.\n";
1966
fbb93542
KW
1967You can also restrict names to a certain alphabet by specifying the
1968L<charnames> pragma:
1969
47f9c88b
GS
1970 use charnames qw(greek);
1971 print "\N{sigma} is Greek sigma\n";
1972
0bd42786
KW
1973An index of character names is available on-line from the Unicode
1974Consortium, L<http://www.unicode.org/charts/charindex.html>; explanatory
1975material with links to other resources at
1976L<http://www.unicode.org/standard/where>.
47f9c88b 1977
13e5d9cd
BF
1978The answer to requirement 2) is that a regexp (mostly)
1979uses Unicode characters. The "mostly" is for messy backward
615d795d
KW
1980compatibility reasons, but starting in Perl 5.14, any regex compiled in
1981the scope of a C<use feature 'unicode_strings'> (which is automatically
1982turned on within the scope of a C<use 5.012> or higher) will turn that
1983"mostly" into "always". If you want to handle Unicode properly, you
13e5d9cd 1984should ensure that C<'unicode_strings'> is turned on.
0bd5a82d
KW
1985Internally, this is encoded to bytes using either UTF-8 or a native 8
1986bit encoding, depending on the history of the string, but conceptually
1987it is a sequence of characters, not bytes. See L<perlunitut> for a
1988tutorial about that.
2575c402 1989
2c9972cc
KW
1990Let us now discuss Unicode character classes, most usually called
1991"character properties". These are represented by the
2575c402 1992C<\p{name}> escape sequence. Closely associated is the C<\P{name}>
2c9972cc 1993property, which is the negation of the C<\p{name}> one. For
2575c402 1994example, to match lower and uppercase characters,
47f9c88b 1995
47f9c88b
GS
1996 $x = "BOB";
1997 $x =~ /^\p{IsUpper}/; # matches, uppercase char class
1998 $x =~ /^\P{IsUpper}/; # doesn't match, char class sans uppercase
1999 $x =~ /^\p{IsLower}/; # doesn't match, lowercase char class
2000 $x =~ /^\P{IsLower}/; # matches, char class sans lowercase
2001
5f67e4c9
KW
2002(The "Is" is optional.)
2003
2c9972cc
KW
2004There are many, many Unicode character properties. For the full list
2005see L<perluniprops>. Most of them have synonyms with shorter names,
2006also listed there. Some synonyms are a single character. For these,
2007you can drop the braces. For instance, C<\pM> is the same thing as
2008C<\p{Mark}>, meaning things like accent marks.
2009
48791bf1
KW
2010The Unicode C<\p{Script}> and C<\p{Script_Extensions}> properties are
2011used to categorize every Unicode character into the language script it
2012is written in. (C<Script_Extensions> is an improved version of
2013C<Script>, which is retained for backward compatibility, and so you
2014should generally use C<Script_Extensions>.)
2015For example,
2c9972cc
KW
2016English, French, and a bunch of other European languages are written in
2017the Latin script. But there is also the Greek script, the Thai script,
2018the Katakana script, etc. You can test whether a character is in a
48791bf1
KW
2019particular script (based on C<Script_Extensions>) with, for example
2020C<\p{Latin}>, C<\p{Greek}>, or C<\p{Katakana}>. To test if it isn't in
2021the Balinese script, you would use C<\P{Balinese}>.
e1b711da
KW
2022
2023What we have described so far is the single form of the C<\p{...}> character
2024classes. There is also a compound form which you may run into. These
2025look like C<\p{name=value}> or C<\p{name:value}> (the equals sign and colon
2026can be used interchangeably). These are more general than the single form,
2027and in fact most of the single forms are just Perl-defined shortcuts for common
2028compound forms. For example, the script examples in the previous paragraph
48791bf1
KW
2029could be written equivalently as C<\p{Script_Extensions=Latin}>, C<\p{Script_Extensions:Greek}>,
2030C<\p{script_extensions=katakana}>, and C<\P{script_extensions=balinese}> (case is irrelevant
2c9972cc 2031between the C<{}> braces). You may
e1b711da
KW
2032never have to use the compound forms, but sometimes it is necessary, and their
2033use can make your code easier to understand.
47f9c88b 2034
7638d2dc 2035C<\X> is an abbreviation for a character class that comprises
5f67e4c9 2036a Unicode I<extended grapheme cluster>. This represents a "logical character":
e1b711da
KW
2037what appears to be a single character, but may be represented internally by more
2038than one. As an example, using the Unicode full names, e.g., S<C<A + COMBINING
2039RING>> is a grapheme cluster with base character C<A> and combining character
2040S<C<COMBINING RING>>, which translates in Danish to A with the circle atop it,
360633e8 2041as in the word E<Aring>ngstrom.
47f9c88b 2042
da75cd15 2043For the full and latest information about Unicode see the latest
e1b711da 2044Unicode standard, or the Unicode Consortium's website L<http://www.unicode.org>
5e42d7b4 2045
7c579eed 2046As if all those classes weren't enough, Perl also defines POSIX-style
47f9c88b 2047character classes. These have the form C<[:name:]>, with C<name> the
aaa51d5e
JF
2048name of the POSIX class. The POSIX classes are C<alpha>, C<alnum>,
2049C<ascii>, C<cntrl>, C<digit>, C<graph>, C<lower>, C<print>, C<punct>,
2050C<space>, C<upper>, and C<xdigit>, and two extensions, C<word> (a Perl
0bd5a82d
KW
2051extension to match C<\w>), and C<blank> (a GNU extension). The C<//a>
2052modifier restricts these to matching just in the ASCII range; otherwise
2053they can match the same as their corresponding Perl Unicode classes:
2054C<[:upper:]> is the same as C<\p{IsUpper}>, etc. (There are some
2055exceptions and gotchas with this; see L<perlrecharclass> for a full
2056discussion.) The C<[:digit:]>, C<[:word:]>, and
47f9c88b 2057C<[:space:]> correspond to the familiar C<\d>, C<\w>, and C<\s>
aaa51d5e 2058character classes. To negate a POSIX class, put a C<^> in front of
7c579eed
FC
2059the name, so that, e.g., C<[:^digit:]> corresponds to C<\D> and, under
2060Unicode, C<\P{IsDigit}>. The Unicode and POSIX character classes can
54c18d04
MK
2061be used just like C<\d>, with the exception that POSIX character
2062classes can only be used inside of a character class:
47f9c88b
GS
2063
2064 /\s+[abc[:digit:]xyz]\s*/; # match a,b,c,x,y,z, or a digit
54c18d04 2065 /^=item\s[[:digit:]]/; # match '=item',
47f9c88b 2066 # followed by a space and a digit
47f9c88b
GS
2067 /\s+[abc\p{IsDigit}xyz]\s+/; # match a,b,c,x,y,z, or a digit
2068 /^=item\s\p{IsDigit}/; # match '=item',
2069 # followed by a space and a digit
2070
2071Whew! That is all the rest of the characters and character classes.
2072
2073=head2 Compiling and saving regular expressions
2074
c2e2285d
KW
2075In Part 1 we mentioned that Perl compiles a regexp into a compact
2076sequence of opcodes. Thus, a compiled regexp is a data structure
47f9c88b
GS
2077that can be stored once and used again and again. The regexp quote
2078C<qr//> does exactly that: C<qr/string/> compiles the C<string> as a
2079regexp and transforms the result into a form that can be assigned to a
2080variable:
2081
2082 $reg = qr/foo+bar?/; # reg contains a compiled regexp
2083
2084Then C<$reg> can be used as a regexp:
2085
2086 $x = "fooooba";
2087 $x =~ $reg; # matches, just like /foo+bar?/
2088 $x =~ /$reg/; # same thing, alternate form
2089
2090C<$reg> can also be interpolated into a larger regexp:
2091
2092 $x =~ /(abc)?$reg/; # still matches
2093
2094As with the matching operator, the regexp quote can use different
7638d2dc
WL
2095delimiters, e.g., C<qr!!>, C<qr{}> or C<qr~~>. Apostrophes
2096as delimiters (C<qr''>) inhibit any interpolation.
47f9c88b
GS
2097
2098Pre-compiled regexps are useful for creating dynamic matches that
2099don't need to be recompiled each time they are encountered. Using
7638d2dc
WL
2100pre-compiled regexps, we write a C<grep_step> program which greps
2101for a sequence of patterns, advancing to the next pattern as soon
2102as one has been satisfied.
47f9c88b 2103
7638d2dc 2104 % cat > grep_step
47f9c88b 2105 #!/usr/bin/perl
7638d2dc 2106 # grep_step - match <number> regexps, one after the other
47f9c88b
GS
2107 # usage: multi_grep <number> regexp1 regexp2 ... file1 file2 ...
2108
2109 $number = shift;
2110 $regexp[$_] = shift foreach (0..$number-1);
2111 @compiled = map qr/$_/, @regexp;
2112 while ($line = <>) {
7638d2dc
WL
2113 if ($line =~ /$compiled[0]/) {
2114 print $line;
2115 shift @compiled;
2116 last unless @compiled;
47f9c88b
GS
2117 }
2118 }
2119 ^D
2120
7638d2dc
WL
2121 % grep_step 3 shift print last grep_step
2122 $number = shift;
2123 print $line;
2124 last unless @compiled;
47f9c88b
GS
2125
2126Storing pre-compiled regexps in an array C<@compiled> allows us to
2127simply loop through the regexps without any recompilation, thus gaining
2128flexibility without sacrificing speed.
2129
7638d2dc
WL
2130
2131=head2 Composing regular expressions at runtime
2132
2133Backtracking is more efficient than repeated tries with different regular
2134expressions. If there are several regular expressions and a match with
353c6505 2135any of them is acceptable, then it is possible to combine them into a set
7638d2dc 2136of alternatives. If the individual expressions are input data, this
353c6505
DL
2137can be done by programming a join operation. We'll exploit this idea in
2138an improved version of the C<simple_grep> program: a program that matches
7638d2dc
WL
2139multiple patterns:
2140
2141 % cat > multi_grep
2142 #!/usr/bin/perl
2143 # multi_grep - match any of <number> regexps
2144 # usage: multi_grep <number> regexp1 regexp2 ... file1 file2 ...
2145
2146 $number = shift;
2147 $regexp[$_] = shift foreach (0..$number-1);
2148 $pattern = join '|', @regexp;
2149
2150 while ($line = <>) {
c2e2285d 2151 print $line if $line =~ /$pattern/;
7638d2dc
WL
2152 }
2153 ^D
2154
2155 % multi_grep 2 shift for multi_grep
2156 $number = shift;
2157 $regexp[$_] = shift foreach (0..$number-1);
2158
2159Sometimes it is advantageous to construct a pattern from the I<input>
2160that is to be analyzed and use the permissible values on the left
2161hand side of the matching operations. As an example for this somewhat
353c6505 2162paradoxical situation, let's assume that our input contains a command
7638d2dc 2163verb which should match one out of a set of available command verbs,
353c6505 2164with the additional twist that commands may be abbreviated as long as
7638d2dc
WL
2165the given string is unique. The program below demonstrates the basic
2166algorithm.
2167
2168 % cat > keymatch
2169 #!/usr/bin/perl
2170 $kwds = 'copy compare list print';
555bd962
BG
2171 while( $cmd = <> ){
2172 $cmd =~ s/^\s+|\s+$//g; # trim leading and trailing spaces
2173 if( ( @matches = $kwds =~ /\b$cmd\w*/g ) == 1 ){
92a24ac3 2174 print "command: '@matches'\n";
7638d2dc 2175 } elsif( @matches == 0 ){
555bd962 2176 print "no such command: '$cmd'\n";
7638d2dc 2177 } else {
555bd962 2178 print "not unique: '$cmd' (could be one of: @matches)\n";
7638d2dc
WL
2179 }
2180 }
2181 ^D
2182
2183 % keymatch
2184 li
2185 command: 'list'
2186 co
2187 not unique: 'co' (could be one of: copy compare)
2188 printer
2189 no such command: 'printer'
2190
2191Rather than trying to match the input against the keywords, we match the
2192combined set of keywords against the input. The pattern matching
555bd962 2193operation S<C<$kwds =~ /\b($cmd\w*)/g>> does several things at the
353c6505
DL
2194same time. It makes sure that the given command begins where a keyword
2195begins (C<\b>). It tolerates abbreviations due to the added C<\w*>. It
2196tells us the number of matches (C<scalar @matches>) and all the keywords
7638d2dc 2197that were actually matched. You could hardly ask for more.
7638d2dc 2198
47f9c88b
GS
2199=head2 Embedding comments and modifiers in a regular expression
2200
2201Starting with this section, we will be discussing Perl's set of
7638d2dc 2202I<extended patterns>. These are extensions to the traditional regular
47f9c88b
GS
2203expression syntax that provide powerful new tools for pattern
2204matching. We have already seen extensions in the form of the minimal
6b3ddc02
FC
2205matching constructs C<??>, C<*?>, C<+?>, C<{n,m}?>, and C<{n,}?>. Most
2206of the extensions below have the form C<(?char...)>, where the
47f9c88b
GS
2207C<char> is a character that determines the type of extension.
2208
2209The first extension is an embedded comment C<(?#text)>. This embeds a
2210comment into the regular expression without affecting its meaning. The
2211comment should not have any closing parentheses in the text. An
2212example is
2213
2214 /(?# Match an integer:)[+-]?\d+/;
2215
2216This style of commenting has been largely superseded by the raw,
2217freeform commenting that is allowed with the C<//x> modifier.
2218
5f67e4c9 2219Most modifiers, such as C<//i>, C<//m>, C<//s> and C<//x> (or any
24549070 2220combination thereof) can also be embedded in
47f9c88b
GS
2221a regexp using C<(?i)>, C<(?m)>, C<(?s)>, and C<(?x)>. For instance,
2222
2223 /(?i)yes/; # match 'yes' case insensitively
2224 /yes/i; # same thing
2225 /(?x)( # freeform version of an integer regexp
2226 [+-]? # match an optional sign
2227 \d+ # match a sequence of digits
2228 )
2229 /x;
2230
2231Embedded modifiers can have two important advantages over the usual
2232modifiers. Embedded modifiers allow a custom set of modifiers to
2233I<each> regexp pattern. This is great for matching an array of regexps
2234that must have different modifiers:
2235
2236 $pattern[0] = '(?i)doctor';
2237 $pattern[1] = 'Johnson';
2238 ...
2239 while (<>) {
2240 foreach $patt (@pattern) {
2241 print if /$patt/;
2242 }
2243 }
2244
24549070 2245The second advantage is that embedded modifiers (except C<//p>, which
7638d2dc 2246modifies the entire regexp) only affect the regexp
47f9c88b
GS
2247inside the group the embedded modifier is contained in. So grouping
2248can be used to localize the modifier's effects:
2249
2250 /Answer: ((?i)yes)/; # matches 'Answer: yes', 'Answer: YES', etc.
2251
2252Embedded modifiers can also turn off any modifiers already present
2253by using, e.g., C<(?-i)>. Modifiers can also be combined into
2254a single expression, e.g., C<(?s-i)> turns on single line mode and
2255turns off case insensitivity.
2256
7638d2dc 2257Embedded modifiers may also be added to a non-capturing grouping.
47f9c88b
GS
2258C<(?i-m:regexp)> is a non-capturing grouping that matches C<regexp>
2259case insensitively and turns off multi-line mode.
2260
7638d2dc 2261
47f9c88b
GS
2262=head2 Looking ahead and looking behind
2263
2264This section concerns the lookahead and lookbehind assertions. First,
2265a little background.
2266
2267In Perl regular expressions, most regexp elements 'eat up' a certain
2268amount of string when they match. For instance, the regexp element
22bf43da 2269C<[abc]> eats up one character of the string when it matches, in the
7638d2dc 2270sense that Perl moves to the next character position in the string
47f9c88b
GS
2271after the match. There are some elements, however, that don't eat up
2272characters (advance the character position) if they match. The examples
2273we have seen so far are the anchors. The anchor C<^> matches the
2274beginning of the line, but doesn't eat any characters. Similarly, the
7638d2dc 2275word boundary anchor C<\b> matches wherever a character matching C<\w>
353c6505 2276is next to a character that doesn't, but it doesn't eat up any
6b3ddc02
FC
2277characters itself. Anchors are examples of I<zero-width assertions>:
2278zero-width, because they consume
47f9c88b
GS
2279no characters, and assertions, because they test some property of the
2280string. In the context of our walk in the woods analogy to regexp
2281matching, most regexp elements move us along a trail, but anchors have
2282us stop a moment and check our surroundings. If the local environment
2283checks out, we can proceed forward. But if the local environment
2284doesn't satisfy us, we must backtrack.
2285
2286Checking the environment entails either looking ahead on the trail,
2287looking behind, or both. C<^> looks behind, to see that there are no
2288characters before. C<$> looks ahead, to see that there are no
2289characters after. C<\b> looks both ahead and behind, to see if the
7638d2dc 2290characters on either side differ in their "word-ness".
47f9c88b
GS
2291
2292The lookahead and lookbehind assertions are generalizations of the
2293anchor concept. Lookahead and lookbehind are zero-width assertions
2294that let us specify which characters we want to test for. The
2295lookahead assertion is denoted by C<(?=regexp)> and the lookbehind
a6b2f353 2296assertion is denoted by C<< (?<=fixed-regexp) >>. Some examples are
47f9c88b
GS
2297
2298 $x = "I catch the housecat 'Tom-cat' with catnip";
7638d2dc 2299 $x =~ /cat(?=\s)/; # matches 'cat' in 'housecat'
47f9c88b
GS
2300 @catwords = ($x =~ /(?<=\s)cat\w+/g); # matches,
2301 # $catwords[0] = 'catch'
2302 # $catwords[1] = 'catnip'
2303 $x =~ /\bcat\b/; # matches 'cat' in 'Tom-cat'
2304 $x =~ /(?<=\s)cat(?=\s)/; # doesn't match; no isolated 'cat' in
2305 # middle of $x
2306
a6b2f353 2307Note that the parentheses in C<(?=regexp)> and C<< (?<=regexp) >> are
47f9c88b
GS
2308non-capturing, since these are zero-width assertions. Thus in the
2309second regexp, the substrings captured are those of the whole regexp
a6b2f353
GS
2310itself. Lookahead C<(?=regexp)> can match arbitrary regexps, but
2311lookbehind C<< (?<=fixed-regexp) >> only works for regexps of fixed
2312width, i.e., a fixed number of characters long. Thus
2313C<< (?<=(ab|bc)) >> is fine, but C<< (?<=(ab)*) >> is not. The
2314negated versions of the lookahead and lookbehind assertions are
2315denoted by C<(?!regexp)> and C<< (?<!fixed-regexp) >> respectively.
2316They evaluate true if the regexps do I<not> match:
47f9c88b
GS
2317
2318 $x = "foobar";
2319 $x =~ /foo(?!bar)/; # doesn't match, 'bar' follows 'foo'
2320 $x =~ /foo(?!baz)/; # matches, 'baz' doesn't follow 'foo'
2321 $x =~ /(?<!\s)foo/; # matches, there is no \s before 'foo'
2322
7638d2dc
WL
2323Here is an example where a string containing blank-separated words,
2324numbers and single dashes is to be split into its components.
2325Using C</\s+/> alone won't work, because spaces are not required between
2326dashes, or a word or a dash. Additional places for a split are established
2327by looking ahead and behind:
47f9c88b 2328
7638d2dc
WL
2329 $str = "one two - --6-8";
2330 @toks = split / \s+ # a run of spaces
2331 | (?<=\S) (?=-) # any non-space followed by '-'
2332 | (?<=-) (?=\S) # a '-' followed by any non-space
2333 /x, $str; # @toks = qw(one two - - - 6 - 8)
47f9c88b 2334
7638d2dc
WL
2335
2336=head2 Using independent subexpressions to prevent backtracking
2337
2338I<Independent subexpressions> are regular expressions, in the
47f9c88b
GS
2339context of a larger regular expression, that function independently of
2340the larger regular expression. That is, they consume as much or as
2341little of the string as they wish without regard for the ability of
2342the larger regexp to match. Independent subexpressions are represented
2343by C<< (?>regexp) >>. We can illustrate their behavior by first
2344considering an ordinary regexp:
2345
2346 $x = "ab";
2347 $x =~ /a*ab/; # matches
2348
2349This obviously matches, but in the process of matching, the
2350subexpression C<a*> first grabbed the C<a>. Doing so, however,
2351wouldn't allow the whole regexp to match, so after backtracking, C<a*>
2352eventually gave back the C<a> and matched the empty string. Here, what
2353C<a*> matched was I<dependent> on what the rest of the regexp matched.
2354
2355Contrast that with an independent subexpression:
2356
2357 $x =~ /(?>a*)ab/; # doesn't match!
2358
2359The independent subexpression C<< (?>a*) >> doesn't care about the rest
2360of the regexp, so it sees an C<a> and grabs it. Then the rest of the
2361regexp C<ab> cannot match. Because C<< (?>a*) >> is independent, there
da75cd15 2362is no backtracking and the independent subexpression does not give
47f9c88b
GS
2363up its C<a>. Thus the match of the regexp as a whole fails. A similar
2364behavior occurs with completely independent regexps:
2365
2366 $x = "ab";
2367 $x =~ /a*/g; # matches, eats an 'a'
2368 $x =~ /\Gab/g; # doesn't match, no 'a' available
2369
2370Here C<//g> and C<\G> create a 'tag team' handoff of the string from
2371one regexp to the other. Regexps with an independent subexpression are
2372much like this, with a handoff of the string to the independent
2373subexpression, and a handoff of the string back to the enclosing
2374regexp.
2375
2376The ability of an independent subexpression to prevent backtracking
2377can be quite useful. Suppose we want to match a non-empty string
2378enclosed in parentheses up to two levels deep. Then the following
2379regexp matches:
2380
2381 $x = "abc(de(fg)h"; # unbalanced parentheses
2382 $x =~ /\( ( [^()]+ | \([^()]*\) )+ \)/x;
2383
2384The regexp matches an open parenthesis, one or more copies of an
2385alternation, and a close parenthesis. The alternation is two-way, with
2386the first alternative C<[^()]+> matching a substring with no
2387parentheses and the second alternative C<\([^()]*\)> matching a
2388substring delimited by parentheses. The problem with this regexp is
2389that it is pathological: it has nested indeterminate quantifiers
07698885 2390of the form C<(a+|b)+>. We discussed in Part 1 how nested quantifiers
47f9c88b
GS
2391like this could take an exponentially long time to execute if there
2392was no match possible. To prevent the exponential blowup, we need to
2393prevent useless backtracking at some point. This can be done by
2394enclosing the inner quantifier as an independent subexpression:
2395
2396 $x =~ /\( ( (?>[^()]+) | \([^()]*\) )+ \)/x;
2397
2398Here, C<< (?>[^()]+) >> breaks the degeneracy of string partitioning
2399by gobbling up as much of the string as possible and keeping it. Then
2400match failures fail much more quickly.
2401
7638d2dc 2402
47f9c88b
GS
2403=head2 Conditional expressions
2404
7638d2dc 2405A I<conditional expression> is a form of if-then-else statement
47f9c88b
GS
2406that allows one to choose which patterns are to be matched, based on
2407some condition. There are two types of conditional expression:
2408C<(?(condition)yes-regexp)> and
2409C<(?(condition)yes-regexp|no-regexp)>. C<(?(condition)yes-regexp)> is
7638d2dc 2410like an S<C<'if () {}'>> statement in Perl. If the C<condition> is true,
47f9c88b 2411the C<yes-regexp> will be matched. If the C<condition> is false, the
7638d2dc
WL
2412C<yes-regexp> will be skipped and Perl will move onto the next regexp
2413element. The second form is like an S<C<'if () {} else {}'>> statement
47f9c88b
GS
2414in Perl. If the C<condition> is true, the C<yes-regexp> will be
2415matched, otherwise the C<no-regexp> will be matched.
2416
7638d2dc 2417The C<condition> can have several forms. The first form is simply an
47f9c88b 2418integer in parentheses C<(integer)>. It is true if the corresponding
7638d2dc 2419backreference C<\integer> matched earlier in the regexp. The same
c27a5cfe 2420thing can be done with a name associated with a capture group, written
7638d2dc 2421as C<< (<name>) >> or C<< ('name') >>. The second form is a bare
6b3ddc02 2422zero-width assertion C<(?...)>, either a lookahead, a lookbehind, or a
7638d2dc
WL
2423code assertion (discussed in the next section). The third set of forms
2424provides tests that return true if the expression is executed within
2425a recursion (C<(R)>) or is being called from some capturing group,
2426referenced either by number (C<(R1)>, C<(R2)>,...) or by name
2427(C<(R&name)>).
2428
2429The integer or name form of the C<condition> allows us to choose,
2430with more flexibility, what to match based on what matched earlier in the
2431regexp. This searches for words of the form C<"$x$x"> or C<"$x$y$y$x">:
47f9c88b 2432
d8b950dc 2433 % simple_grep '^(\w+)(\w+)?(?(2)\g2\g1|\g1)$' /usr/dict/words
47f9c88b
GS
2434 beriberi
2435 coco
2436 couscous
2437 deed
2438 ...
2439 toot
2440 toto
2441 tutu
2442
2443The lookbehind C<condition> allows, along with backreferences,
2444an earlier part of the match to influence a later part of the
2445match. For instance,
2446
2447 /[ATGC]+(?(?<=AA)G|C)$/;
2448
2449matches a DNA sequence such that it either ends in C<AAG>, or some
2450other base pair combination and C<C>. Note that the form is
a6b2f353
GS
2451C<< (?(?<=AA)G|C) >> and not C<< (?((?<=AA))G|C) >>; for the
2452lookahead, lookbehind or code assertions, the parentheses around the
2453conditional are not needed.
47f9c88b 2454
7638d2dc
WL
2455
2456=head2 Defining named patterns
2457
2458Some regular expressions use identical subpatterns in several places.
2459Starting with Perl 5.10, it is possible to define named subpatterns in
2460a section of the pattern so that they can be called up by name
2461anywhere in the pattern. This syntactic pattern for this definition
2462group is C<< (?(DEFINE)(?<name>pattern)...) >>. An insertion
2463of a named pattern is written as C<(?&name)>.
2464
2465The example below illustrates this feature using the pattern for
2466floating point numbers that was presented earlier on. The three
2467subpatterns that are used more than once are the optional sign, the
2468digit sequence for an integer and the decimal fraction. The DEFINE
2469group at the end of the pattern contains their definition. Notice
2470that the decimal fraction pattern is the first place where we can
2471reuse the integer pattern.
2472
353c6505 2473 /^ (?&osg)\ * ( (?&int)(?&dec)? | (?&dec) )
7638d2dc
WL
2474 (?: [eE](?&osg)(?&int) )?
2475 $
2476 (?(DEFINE)
2477 (?<osg>[-+]?) # optional sign
2478 (?<int>\d++) # integer
2479 (?<dec>\.(?&int)) # decimal fraction
2480 )/x
2481
2482
2483=head2 Recursive patterns
2484
2485This feature (introduced in Perl 5.10) significantly extends the
2486power of Perl's pattern matching. By referring to some other
2487capture group anywhere in the pattern with the construct
353c6505 2488C<(?group-ref)>, the I<pattern> within the referenced group is used
7638d2dc
WL
2489as an independent subpattern in place of the group reference itself.
2490Because the group reference may be contained I<within> the group it
2491refers to, it is now possible to apply pattern matching to tasks that
2492hitherto required a recursive parser.
2493
2494To illustrate this feature, we'll design a pattern that matches if
2495a string contains a palindrome. (This is a word or a sentence that,
2496while ignoring spaces, interpunctuation and case, reads the same backwards
2497as forwards. We begin by observing that the empty string or a string
2498containing just one word character is a palindrome. Otherwise it must
2499have a word character up front and the same at its end, with another
2500palindrome in between.
2501
fd2b7f55 2502 /(?: (\w) (?...Here be a palindrome...) \g{-1} | \w? )/x
7638d2dc 2503
e57a4e52 2504Adding C<\W*> at either end to eliminate what is to be ignored, we already
7638d2dc
WL
2505have the full pattern:
2506
2507 my $pp = qr/^(\W* (?: (\w) (?1) \g{-1} | \w? ) \W*)$/ix;
2508 for $s ( "saippuakauppias", "A man, a plan, a canal: Panama!" ){
2509 print "'$s' is a palindrome\n" if $s =~ /$pp/;
2510 }
2511
2512In C<(?...)> both absolute and relative backreferences may be used.
2513The entire pattern can be reinserted with C<(?R)> or C<(?0)>.
c27a5cfe
KW
2514If you prefer to name your groups, you can use C<(?&name)> to
2515recurse into that group.
7638d2dc
WL
2516
2517
47f9c88b
GS
2518=head2 A bit of magic: executing Perl code in a regular expression
2519
2520Normally, regexps are a part of Perl expressions.
7638d2dc 2521I<Code evaluation> expressions turn that around by allowing
da75cd15 2522arbitrary Perl code to be a part of a regexp. A code evaluation
7638d2dc 2523expression is denoted C<(?{code})>, with I<code> a string of Perl
47f9c88b
GS
2524statements.
2525
353c6505 2526Be warned that this feature is considered experimental, and may be
7638d2dc
WL
2527changed without notice.
2528
47f9c88b
GS
2529Code expressions are zero-width assertions, and the value they return
2530depends on their environment. There are two possibilities: either the
2531code expression is used as a conditional in a conditional expression
2532C<(?(condition)...)>, or it is not. If the code expression is a
2533conditional, the code is evaluated and the result (i.e., the result of
2534the last statement) is used to determine truth or falsehood. If the
2535code expression is not used as a conditional, the assertion always
2536evaluates true and the result is put into the special variable
2537C<$^R>. The variable C<$^R> can then be used in code expressions later
2538in the regexp. Here are some silly examples:
2539
2540 $x = "abcdef";
2541 $x =~ /abc(?{print "Hi Mom!";})def/; # matches,
2542 # prints 'Hi Mom!'
2543 $x =~ /aaa(?{print "Hi Mom!";})def/; # doesn't match,
2544 # no 'Hi Mom!'
745e1e41
DC
2545
2546Pay careful attention to the next example:
2547
47f9c88b
GS
2548 $x =~ /abc(?{print "Hi Mom!";})ddd/; # doesn't match,
2549 # no 'Hi Mom!'
745e1e41
DC
2550 # but why not?
2551
2552At first glance, you'd think that it shouldn't print, because obviously
2553the C<ddd> isn't going to match the target string. But look at this
2554example:
2555
87167316
RGS
2556 $x =~ /abc(?{print "Hi Mom!";})[dD]dd/; # doesn't match,
2557 # but _does_ print
745e1e41
DC
2558
2559Hmm. What happened here? If you've been following along, you know that
ac036724 2560the above pattern should be effectively (almost) the same as the last one;
2561enclosing the C<d> in a character class isn't going to change what it
745e1e41
DC
2562matches. So why does the first not print while the second one does?
2563
7638d2dc 2564The answer lies in the optimizations the regex engine makes. In the first
745e1e41
DC
2565case, all the engine sees are plain old characters (aside from the
2566C<?{}> construct). It's smart enough to realize that the string 'ddd'
2567doesn't occur in our target string before actually running the pattern
2568through. But in the second case, we've tricked it into thinking that our
87167316 2569pattern is more complicated. It takes a look, sees our
745e1e41
DC
2570character class, and decides that it will have to actually run the
2571pattern to determine whether or not it matches, and in the process of
2572running it hits the print statement before it discovers that we don't
2573have a match.
2574
2575To take a closer look at how the engine does optimizations, see the
5a0de581 2576section L</"Pragmas and debugging"> below.
745e1e41
DC
2577
2578More fun with C<?{}>:
2579
47f9c88b
GS
2580 $x =~ /(?{print "Hi Mom!";})/; # matches,
2581 # prints 'Hi Mom!'
2582 $x =~ /(?{$c = 1;})(?{print "$c";})/; # matches,
2583 # prints '1'
2584 $x =~ /(?{$c = 1;})(?{print "$^R";})/; # matches,
2585 # prints '1'
2586
2587The bit of magic mentioned in the section title occurs when the regexp
2588backtracks in the process of searching for a match. If the regexp
2589backtracks over a code expression and if the variables used within are
2590localized using C<local>, the changes in the variables produced by the
2591code expression are undone! Thus, if we wanted to count how many times
2592a character got matched inside a group, we could use, e.g.,
2593
2594 $x = "aaaa";
2595 $count = 0; # initialize 'a' count
2596 $c = "bob"; # test if $c gets clobbered
2597 $x =~ /(?{local $c = 0;}) # initialize count
2598 ( a # match 'a'
2599 (?{local $c = $c + 1;}) # increment count
2600 )* # do this any number of times,
2601 aa # but match 'aa' at the end
2602 (?{$count = $c;}) # copy local $c var into $count
2603 /x;
2604 print "'a' count is $count, \$c variable is '$c'\n";
2605
2606This prints
2607
2608 'a' count is 2, $c variable is 'bob'
2609
7638d2dc
WL
2610If we replace the S<C< (?{local $c = $c + 1;})>> with
2611S<C< (?{$c = $c + 1;})>>, the variable changes are I<not> undone
47f9c88b
GS
2612during backtracking, and we get
2613
2614 'a' count is 4, $c variable is 'bob'
2615
2616Note that only localized variable changes are undone. Other side
2617effects of code expression execution are permanent. Thus
2618
2619 $x = "aaaa";
2620 $x =~ /(a(?{print "Yow\n";}))*aa/;
2621
2622produces
2623
2624 Yow
2625 Yow
2626 Yow
2627 Yow
2628
2629The result C<$^R> is automatically localized, so that it will behave
2630properly in the presence of backtracking.
2631
7638d2dc
WL
2632This example uses a code expression in a conditional to match a
2633definite article, either 'the' in English or 'der|die|das' in German:
47f9c88b 2634
47f9c88b
GS
2635 $lang = 'DE'; # use German
2636 ...
2637 $text = "das";
2638 print "matched\n"
2639 if $text =~ /(?(?{
2640 $lang eq 'EN'; # is the language English?
2641 })
2642 the | # if so, then match 'the'
7638d2dc 2643 (der|die|das) # else, match 'der|die|das'
47f9c88b
GS
2644 )
2645 /xi;
2646
2647Note that the syntax here is C<(?(?{...})yes-regexp|no-regexp)>, not
2648C<(?((?{...}))yes-regexp|no-regexp)>. In other words, in the case of a
2649code expression, we don't need the extra parentheses around the
2650conditional.
2651
e128ab2c
DM
2652If you try to use code expressions where the code text is contained within
2653an interpolated variable, rather than appearing literally in the pattern,
2654Perl may surprise you:
a6b2f353
GS
2655
2656 $bar = 5;
2657 $pat = '(?{ 1 })';
2658 /foo(?{ $bar })bar/; # compiles ok, $bar not interpolated
e128ab2c 2659 /foo(?{ 1 })$bar/; # compiles ok, $bar interpolated
a6b2f353
GS
2660 /foo${pat}bar/; # compile error!
2661
2662 $pat = qr/(?{ $foo = 1 })/; # precompile code regexp
2663 /foo${pat}bar/; # compiles ok
2664
e128ab2c
DM
2665If a regexp has a variable that interpolates a code expression, Perl
2666treats the regexp as an error. If the code expression is precompiled into
2667a variable, however, interpolating is ok. The question is, why is this an
2668error?
a6b2f353
GS
2669
2670The reason is that variable interpolation and code expressions
2671together pose a security risk. The combination is dangerous because
2672many programmers who write search engines often take user input and
2673plug it directly into a regexp:
47f9c88b
GS
2674
2675 $regexp = <>; # read user-supplied regexp
2676 $chomp $regexp; # get rid of possible newline
2677 $text =~ /$regexp/; # search $text for the $regexp
2678
a6b2f353
GS
2679If the C<$regexp> variable contains a code expression, the user could
2680then execute arbitrary Perl code. For instance, some joker could
7638d2dc
WL
2681search for S<C<system('rm -rf *');>> to erase your files. In this
2682sense, the combination of interpolation and code expressions I<taints>
47f9c88b 2683your regexp. So by default, using both interpolation and code
a6b2f353
GS
2684expressions in the same regexp is not allowed. If you're not
2685concerned about malicious users, it is possible to bypass this
7638d2dc 2686security check by invoking S<C<use re 'eval'>>:
a6b2f353
GS
2687
2688 use re 'eval'; # throw caution out the door
2689 $bar = 5;
2690 $pat = '(?{ 1 })';
a6b2f353 2691 /foo${pat}bar/; # compiles ok
47f9c88b 2692
7638d2dc 2693Another form of code expression is the I<pattern code expression>.
47f9c88b
GS
2694The pattern code expression is like a regular code expression, except
2695that the result of the code evaluation is treated as a regular
2696expression and matched immediately. A simple example is
2697
2698 $length = 5;
2699 $char = 'a';
2700 $x = 'aaaaabb';
2701 $x =~ /(??{$char x $length})/x; # matches, there are 5 of 'a'
2702
2703
2704This final example contains both ordinary and pattern code
7638d2dc 2705expressions. It detects whether a binary string C<1101010010001...> has a
47f9c88b
GS
2706Fibonacci spacing 0,1,1,2,3,5,... of the C<1>'s:
2707
47f9c88b 2708 $x = "1101010010001000001";
7638d2dc 2709 $z0 = ''; $z1 = '0'; # initial conditions
47f9c88b
GS
2710 print "It is a Fibonacci sequence\n"
2711 if $x =~ /^1 # match an initial '1'
7638d2dc
WL
2712 (?:
2713 ((??{ $z0 })) # match some '0'
2714 1 # and then a '1'
2715 (?{ $z0 = $z1; $z1 .= $^N; })
47f9c88b
GS
2716 )+ # repeat as needed
2717 $ # that is all there is
2718 /x;
7638d2dc 2719 printf "Largest sequence matched was %d\n", length($z1)-length($z0);
47f9c88b 2720
7638d2dc
WL
2721Remember that C<$^N> is set to whatever was matched by the last
2722completed capture group. This prints
47f9c88b
GS
2723
2724 It is a Fibonacci sequence
2725 Largest sequence matched was 5
2726
2727Ha! Try that with your garden variety regexp package...
2728
7638d2dc 2729Note that the variables C<$z0> and C<$z1> are not substituted when the
47f9c88b 2730regexp is compiled, as happens for ordinary variables outside a code
e128ab2c
DM
2731expression. Rather, the whole code block is parsed as perl code at the
2732same time as perl is compiling the code containing the literal regexp
2733pattern.
47f9c88b
GS
2734
2735The regexp without the C<//x> modifier is
2736
7638d2dc
WL
2737 /^1(?:((??{ $z0 }))1(?{ $z0 = $z1; $z1 .= $^N; }))+$/
2738
2739which shows that spaces are still possible in the code parts. Nevertheless,
353c6505 2740when working with code and conditional expressions, the extended form of
7638d2dc
WL
2741regexps is almost necessary in creating and debugging regexps.
2742
2743
2744=head2 Backtracking control verbs
2745
2746Perl 5.10 introduced a number of control verbs intended to provide
2747detailed control over the backtracking process, by directly influencing
2748the regexp engine and by providing monitoring techniques. As all
2749the features in this group are experimental and subject to change or
2750removal in a future version of Perl, the interested reader is
2751referred to L<perlre/"Special Backtracking Control Verbs"> for a
2752detailed description.
2753
2754Below is just one example, illustrating the control verb C<(*FAIL)>,
2755which may be abbreviated as C<(*F)>. If this is inserted in a regexp
6b3ddc02
FC
2756it will cause it to fail, just as it would at some
2757mismatch between the pattern and the string. Processing
2758of the regexp continues as it would after any "normal"
353c6505
DL
2759failure, so that, for instance, the next position in the string or another
2760alternative will be tried. As failing to match doesn't preserve capture
c27a5cfe 2761groups or produce results, it may be necessary to use this in
7638d2dc
WL
2762combination with embedded code.
2763
2764 %count = ();
b539c2c9 2765 "supercalifragilisticexpialidocious" =~
c2e2285d 2766 /([aeiou])(?{ $count{$1}++; })(*FAIL)/i;
7638d2dc
WL
2767 printf "%3d '%s'\n", $count{$_}, $_ for (sort keys %count);
2768
353c6505
DL
2769The pattern begins with a class matching a subset of letters. Whenever
2770this matches, a statement like C<$count{'a'}++;> is executed, incrementing
2771the letter's counter. Then C<(*FAIL)> does what it says, and
6b3ddc02
FC
2772the regexp engine proceeds according to the book: as long as the end of
2773the string hasn't been reached, the position is advanced before looking
7638d2dc 2774for another vowel. Thus, match or no match makes no difference, and the
e1020413 2775regexp engine proceeds until the entire string has been inspected.
7638d2dc
WL
2776(It's remarkable that an alternative solution using something like
2777
b539c2c9 2778 $count{lc($_)}++ for split('', "supercalifragilisticexpialidocious");
7638d2dc
WL
2779 printf "%3d '%s'\n", $count2{$_}, $_ for ( qw{ a e i o u } );
2780
2781is considerably slower.)
47f9c88b 2782
47f9c88b
GS
2783
2784=head2 Pragmas and debugging
2785
2786Speaking of debugging, there are several pragmas available to control
2787and debug regexps in Perl. We have already encountered one pragma in
7638d2dc 2788the previous section, S<C<use re 'eval';>>, that allows variable
a6b2f353
GS
2789interpolation and code expressions to coexist in a regexp. The other
2790pragmas are
47f9c88b
GS
2791
2792 use re 'taint';
2793 $tainted = <>;
2794 @parts = ($tainted =~ /(\w+)\s+(\w+)/; # @parts is now tainted
2795
2796The C<taint> pragma causes any substrings from a match with a tainted
2797variable to be tainted as well. This is not normally the case, as
2798regexps are often used to extract the safe bits from a tainted
2799variable. Use C<taint> when you are not extracting safe bits, but are
2800performing some other processing. Both C<taint> and C<eval> pragmas
a6b2f353 2801are lexically scoped, which means they are in effect only until
47f9c88b
GS
2802the end of the block enclosing the pragmas.
2803
511eb430
FC
2804 use re '/m'; # or any other flags
2805 $multiline_string =~ /^foo/; # /m is implied
2806
9fa86798
FC
2807The C<re '/flags'> pragma (introduced in Perl
28085.14) turns on the given regular expression flags
3fd67154
KW
2809until the end of the lexical scope. See
2810L<re/"'E<sol>flags' mode"> for more
511eb430
FC
2811detail.
2812
47f9c88b
GS
2813 use re 'debug';
2814 /^(.*)$/s; # output debugging info
2815
2816 use re 'debugcolor';
2817 /^(.*)$/s; # output debugging info in living color
2818
2819The global C<debug> and C<debugcolor> pragmas allow one to get
2820detailed debugging info about regexp compilation and
2821execution. C<debugcolor> is the same as debug, except the debugging
2822information is displayed in color on terminals that can display
2823termcap color sequences. Here is example output:
2824
2825 % perl -e 'use re "debug"; "abc" =~ /a*b+c/;'
ccf3535a 2826 Compiling REx 'a*b+c'
47f9c88b
GS
2827 size 9 first at 1
2828 1: STAR(4)
2829 2: EXACT <a>(0)
2830 4: PLUS(7)
2831 5: EXACT <b>(0)
2832 7: EXACT <c>(9)
2833 9: END(0)
ccf3535a
JK
2834 floating 'bc' at 0..2147483647 (checking floating) minlen 2
2835 Guessing start of match, REx 'a*b+c' against 'abc'...
2836 Found floating substr 'bc' at offset 1...
47f9c88b 2837 Guessed: match at offset 0
ccf3535a 2838 Matching REx 'a*b+c' against 'abc'
47f9c88b 2839 Setting an EVAL scope, savestack=3
555bd962
BG
2840 0 <> <abc> | 1: STAR
2841 EXACT <a> can match 1 times out of 32767...
47f9c88b 2842 Setting an EVAL scope, savestack=3
555bd962
BG
2843 1 <a> <bc> | 4: PLUS
2844 EXACT <b> can match 1 times out of 32767...
47f9c88b 2845 Setting an EVAL scope, savestack=3
555bd962
BG
2846 2 <ab> <c> | 7: EXACT <c>
2847 3 <abc> <> | 9: END
47f9c88b 2848 Match successful!
ccf3535a 2849 Freeing REx: 'a*b+c'
47f9c88b
GS
2850
2851If you have gotten this far into the tutorial, you can probably guess
2852what the different parts of the debugging output tell you. The first
2853part
2854
ccf3535a 2855 Compiling REx 'a*b+c'
47f9c88b
GS
2856 size 9 first at 1
2857 1: STAR(4)
2858 2: EXACT <a>(0)
2859 4: PLUS(7)
2860 5: EXACT <b>(0)
2861 7: EXACT <c>(9)
2862 9: END(0)
2863
2864describes the compilation stage. C<STAR(4)> means that there is a
2865starred object, in this case C<'a'>, and if it matches, goto line 4,
2866i.e., C<PLUS(7)>. The middle lines describe some heuristics and
2867optimizations performed before a match:
2868
ccf3535a
JK
2869 floating 'bc' at 0..2147483647 (checking floating) minlen 2
2870 Guessing start of match, REx 'a*b+c' against 'abc'...
2871 Found floating substr 'bc' at offset 1...
47f9c88b
GS
2872 Guessed: match at offset 0
2873
2874Then the match is executed and the remaining lines describe the
2875process:
2876
ccf3535a 2877 Matching REx 'a*b+c' against 'abc'
47f9c88b 2878 Setting an EVAL scope, savestack=3
555bd962
BG
2879 0 <> <abc> | 1: STAR
2880 EXACT <a> can match 1 times out of 32767...
47f9c88b 2881 Setting an EVAL scope, savestack=3
555bd962
BG
2882 1 <a> <bc> | 4: PLUS
2883 EXACT <b> can match 1 times out of 32767...
47f9c88b 2884 Setting an EVAL scope, savestack=3
555bd962
BG
2885 2 <ab> <c> | 7: EXACT <c>
2886 3 <abc> <> | 9: END
47f9c88b 2887 Match successful!
ccf3535a 2888 Freeing REx: 'a*b+c'
47f9c88b 2889
7638d2dc 2890Each step is of the form S<C<< n <x> <y> >>>, with C<< <x> >> the
47f9c88b 2891part of the string matched and C<< <y> >> the part not yet
7638d2dc 2892matched. The S<C<< | 1: STAR >>> says that Perl is at line number 1
39b6ec1a 2893in the compilation list above. See
d9f2b251 2894L<perldebguts/"Debugging Regular Expressions"> for much more detail.
47f9c88b
GS
2895
2896An alternative method of debugging regexps is to embed C<print>
2897statements within the regexp. This provides a blow-by-blow account of
2898the backtracking in an alternation:
2899
2900 "that this" =~ m@(?{print "Start at position ", pos, "\n";})
2901 t(?{print "t1\n";})
2902 h(?{print "h1\n";})
2903 i(?{print "i1\n";})
2904 s(?{print "s1\n";})
2905 |
2906 t(?{print "t2\n";})
2907 h(?{print "h2\n";})
2908 a(?{print "a2\n";})
2909 t(?{print "t2\n";})
2910 (?{print "Done at position ", pos, "\n";})
2911 @x;
2912
2913prints
2914
2915 Start at position 0
2916 t1
2917 h1
2918 t2
2919 h2
2920 a2
2921 t2
2922 Done at position 4
2923
2924=head1 BUGS
2925
2926Code expressions, conditional expressions, and independent expressions
7638d2dc 2927are I<experimental>. Don't use them in production code. Yet.
47f9c88b
GS
2928
2929=head1 SEE ALSO
2930
7638d2dc 2931This is just a tutorial. For the full story on Perl regular
47f9c88b
GS
2932expressions, see the L<perlre> regular expressions reference page.
2933
2934For more information on the matching C<m//> and substitution C<s///>
2935operators, see L<perlop/"Regexp Quote-Like Operators">. For
2936information on the C<split> operation, see L<perlfunc/split>.
2937
2938For an excellent all-around resource on the care and feeding of
2939regular expressions, see the book I<Mastering Regular Expressions> by
2940Jeffrey Friedl (published by O'Reilly, ISBN 1556592-257-3).
2941
2942=head1 AUTHOR AND COPYRIGHT
2943
2944Copyright (c) 2000 Mark Kvale
2945All rights reserved.
2946
2947This document may be distributed under the same terms as Perl itself.
2948
2949=head2 Acknowledgments
2950
2951The inspiration for the stop codon DNA example came from the ZIP
2952code example in chapter 7 of I<Mastering Regular Expressions>.
2953
a6b2f353
GS
2954The author would like to thank Jeff Pinyan, Andrew Johnson, Peter
2955Haworth, Ronald J Kimball, and Joe Smith for all their helpful
2956comments.
47f9c88b
GS
2957
2958=cut
a6b2f353 2959