This is a live mirror of the Perl 5 development currently hosted at https://github.com/perl/perl5
perlretut: "-" is sometimes a metacharacter
[perl5.git] / pod / perlretut.pod
CommitLineData
47f9c88b
GS
1=head1 NAME
2
3perlretut - Perl regular expressions tutorial
4
5=head1 DESCRIPTION
6
7This page provides a basic tutorial on understanding, creating and
8using regular expressions in Perl. It serves as a complement to the
9reference page on regular expressions L<perlre>. Regular expressions
10are an integral part of the C<m//>, C<s///>, C<qr//> and C<split>
11operators and so this tutorial also overlaps with
12L<perlop/"Regexp Quote-Like Operators"> and L<perlfunc/split>.
13
14Perl is widely renowned for excellence in text processing, and regular
15expressions are one of the big factors behind this fame. Perl regular
16expressions display an efficiency and flexibility unknown in most
17other computer languages. Mastering even the basics of regular
18expressions will allow you to manipulate text with surprising ease.
19
28285ecf
KW
20What is a regular expression? At its most basic, a regular expression
21is a template that is used to determine if a string has certain
22characteristics. The string is most often some text, such as a line,
23sentence, web page, or even a whole book, but less commonly it could be
24some binary data as well.
25Suppose we want to determine if the text in variable, C<$var> contains
15776bb0 26the sequence of characters S<C<m u s h r o o m>>
28285ecf
KW
27(blanks added for legibility). We can write in Perl
28
29 $var =~ m/mushroom/
30
31The value of this expression will be TRUE if C<$var> contains that
32sequence of characters, and FALSE otherwise. The portion enclosed in
15776bb0 33C<'E<sol>'> characters denotes the characteristic we are looking for.
28285ecf
KW
34We use the term I<pattern> for it. The process of looking to see if the
35pattern occurs in the string is called I<matching>, and the C<"=~">
15776bb0 36operator along with the C<m//> tell Perl to try to match the pattern
28285ecf
KW
37against the string. Note that the pattern is also a string, but a very
38special kind of one, as we will see. Patterns are in common use these
39days;
47f9c88b 40examples are the patterns typed into a search engine to find web pages
15776bb0
KW
41and the patterns used to list files in a directory, I<e.g.>, "C<ls *.txt>"
42or "C<dir *.*>". In Perl, the patterns described by regular expressions
28285ecf
KW
43are used not only to search strings, but to also extract desired parts
44of strings, and to do search and replace operations.
47f9c88b
GS
45
46Regular expressions have the undeserved reputation of being abstract
28285ecf
KW
47and difficult to understand. This really stems simply because the
48notation used to express them tends to be terse and dense, and not
f1dc5bb2 49because of inherent complexity. We recommend using the C</x> regular
28285ecf
KW
50expression modifier (described below) along with plenty of white space
51to make them less dense, and easier to read. Regular expressions are
52constructed using
47f9c88b
GS
53simple concepts like conditionals and loops and are no more difficult
54to understand than the corresponding C<if> conditionals and C<while>
28285ecf 55loops in the Perl language itself.
47f9c88b
GS
56
57This tutorial flattens the learning curve by discussing regular
58expression concepts, along with their notation, one at a time and with
59many examples. The first part of the tutorial will progress from the
60simplest word searches to the basic regular expression concepts. If
61you master the first part, you will have all the tools needed to solve
62about 98% of your needs. The second part of the tutorial is for those
63comfortable with the basics and hungry for more power tools. It
64discusses the more advanced regular expression operators and
8ccb1477 65introduces the latest cutting-edge innovations.
47f9c88b 66
15776bb0 67A note: to save time, "regular expression" is often abbreviated as
47f9c88b
GS
68regexp or regex. Regexp is a more natural abbreviation than regex, but
69is harder to pronounce. The Perl pod documentation is evenly split on
70regexp vs regex; in Perl, there is more than one way to abbreviate it.
71We'll use regexp in this tutorial.
72
67cdf558
KW
73New in v5.22, L<C<use re 'strict'>|re/'strict' mode> applies stricter
74rules than otherwise when compiling regular expression patterns. It can
75find things that, while legal, may not be what you intended.
76
47f9c88b
GS
77=head1 Part 1: The basics
78
79=head2 Simple word matching
80
81The simplest regexp is simply a word, or more generally, a string of
28285ecf 82characters. A regexp consisting of just a word matches any string that
47f9c88b
GS
83contains that word:
84
85 "Hello World" =~ /World/; # matches
86
7638d2dc 87What is this Perl statement all about? C<"Hello World"> is a simple
8ccb1477 88double-quoted string. C<World> is the regular expression and the
7638d2dc 89C<//> enclosing C</World/> tells Perl to search a string for a match.
47f9c88b
GS
90The operator C<=~> associates the string with the regexp match and
91produces a true value if the regexp matched, or false if the regexp
92did not match. In our case, C<World> matches the second word in
93C<"Hello World">, so the expression is true. Expressions like this
94are useful in conditionals:
95
96 if ("Hello World" =~ /World/) {
97 print "It matches\n";
98 }
99 else {
100 print "It doesn't match\n";
101 }
102
103There are useful variations on this theme. The sense of the match can
7638d2dc 104be reversed by using the C<!~> operator:
47f9c88b
GS
105
106 if ("Hello World" !~ /World/) {
107 print "It doesn't match\n";
108 }
109 else {
110 print "It matches\n";
111 }
112
113The literal string in the regexp can be replaced by a variable:
114
15776bb0 115 my $greeting = "World";
47f9c88b
GS
116 if ("Hello World" =~ /$greeting/) {
117 print "It matches\n";
118 }
119 else {
120 print "It doesn't match\n";
121 }
122
123If you're matching against the special default variable C<$_>, the
124C<$_ =~> part can be omitted:
125
126 $_ = "Hello World";
127 if (/World/) {
128 print "It matches\n";
129 }
130 else {
131 print "It doesn't match\n";
132 }
133
134And finally, the C<//> default delimiters for a match can be changed
135to arbitrary delimiters by putting an C<'m'> out front:
136
137 "Hello World" =~ m!World!; # matches, delimited by '!'
138 "Hello World" =~ m{World}; # matches, note the matching '{}'
a6b2f353
GS
139 "/usr/bin/perl" =~ m"/perl"; # matches after '/usr/bin',
140 # '/' becomes an ordinary char
47f9c88b
GS
141
142C</World/>, C<m!World!>, and C<m{World}> all represent the
15776bb0 143same thing. When, I<e.g.>, the quote (C<'"'>) is used as a delimiter, the forward
7638d2dc 144slash C<'/'> becomes an ordinary character and can be used in this regexp
47f9c88b
GS
145without trouble.
146
147Let's consider how different regexps would match C<"Hello World">:
148
149 "Hello World" =~ /world/; # doesn't match
150 "Hello World" =~ /o W/; # matches
151 "Hello World" =~ /oW/; # doesn't match
152 "Hello World" =~ /World /; # doesn't match
153
154The first regexp C<world> doesn't match because regexps are
155case-sensitive. The second regexp matches because the substring
7638d2dc 156S<C<'o W'>> occurs in the string S<C<"Hello World">>. The space
15776bb0 157character C<' '> is treated like any other character in a regexp and is
47f9c88b
GS
158needed to match in this case. The lack of a space character is the
159reason the third regexp C<'oW'> doesn't match. The fourth regexp
15776bb0 160"C<World >" doesn't match because there is a space at the end of the
47f9c88b
GS
161regexp, but not at the end of the string. The lesson here is that
162regexps must match a part of the string I<exactly> in order for the
163statement to be true.
164
7638d2dc 165If a regexp matches in more than one place in the string, Perl will
47f9c88b
GS
166always match at the earliest possible point in the string:
167
168 "Hello World" =~ /o/; # matches 'o' in 'Hello'
169 "That hat is red" =~ /hat/; # matches 'hat' in 'That'
170
171With respect to character matching, there are a few more points you
15776bb0
KW
172need to know about. First of all, not all characters can be used "as
173is" in a match. Some characters, called I<metacharacters>, are reserved
47f9c88b
GS
174for use in regexp notation. The metacharacters are
175
f716ba59 176 {}[]()^$.|*+?-\
47f9c88b
GS
177
178The significance of each of these will be explained
179in the rest of the tutorial, but for now, it is important only to know
15776bb0
KW
180that a metacharacter can be matched as-is by putting a backslash before
181it:
47f9c88b
GS
182
183 "2+2=4" =~ /2+2/; # doesn't match, + is a metacharacter
184 "2+2=4" =~ /2\+2/; # matches, \+ is treated like an ordinary +
185 "The interval is [0,1)." =~ /[0,1)./ # is a syntax error!
186 "The interval is [0,1)." =~ /\[0,1\)\./ # matches
7638d2dc 187 "#!/usr/bin/perl" =~ /#!\/usr\/bin\/perl/; # matches
47f9c88b
GS
188
189In the last regexp, the forward slash C<'/'> is also backslashed,
190because it is used to delimit the regexp. This can lead to LTS
191(leaning toothpick syndrome), however, and it is often more readable
192to change delimiters.
193
7638d2dc 194 "#!/usr/bin/perl" =~ m!#\!/usr/bin/perl!; # easier to read
47f9c88b
GS
195
196The backslash character C<'\'> is a metacharacter itself and needs to
197be backslashed:
198
199 'C:\WIN32' =~ /C:\\WIN/; # matches
200
2d5e9bac
KW
201In situations where it doesn't make sense for a particular metacharacter
202to mean what it normally does, it automatically loses its
203metacharacter-ness and becomes an ordinary character that is to be
204matched literally. For example, the C<'}'> is a metacharacter only when
205it is the mate of a C<'{'> metacharacter. Otherwise it is treated as a
206literal RIGHT CURLY BRACKET. This may lead to unexpected results.
207L<C<use re 'strict'>|re/'strict' mode> can catch some of these.
208
47f9c88b
GS
209In addition to the metacharacters, there are some ASCII characters
210which don't have printable character equivalents and are instead
7638d2dc 211represented by I<escape sequences>. Common examples are C<\t> for a
47f9c88b 212tab, C<\n> for a newline, C<\r> for a carriage return and C<\a> for a
43e59f7b 213bell (or alert). If your string is better thought of as a sequence of arbitrary
15776bb0
KW
214bytes, the octal escape sequence, I<e.g.>, C<\033>, or hexadecimal escape
215sequence, I<e.g.>, C<\x1B> may be a more natural representation for your
47f9c88b
GS
216bytes. Here are some examples of escapes:
217
218 "1000\t2000" =~ m(0\t2) # matches
219 "1000\n2000" =~ /0\n20/ # matches
220 "1000\t2000" =~ /\000\t2/ # doesn't match, "0" ne "\000"
f0a2b745
KW
221 "cat" =~ /\o{143}\x61\x74/ # matches in ASCII, but a weird way
222 # to spell cat
47f9c88b
GS
223
224If you've been around Perl a while, all this talk of escape sequences
225may seem familiar. Similar escape sequences are used in double-quoted
226strings and in fact the regexps in Perl are mostly treated as
227double-quoted strings. This means that variables can be used in
228regexps as well. Just like double-quoted strings, the values of the
229variables in the regexp will be substituted in before the regexp is
230evaluated for matching purposes. So we have:
231
232 $foo = 'house';
233 'housecat' =~ /$foo/; # matches
234 'cathouse' =~ /cat$foo/; # matches
47f9c88b
GS
235 'housecat' =~ /${foo}cat/; # matches
236
237So far, so good. With the knowledge above you can already perform
238searches with just about any literal string regexp you can dream up.
239Here is a I<very simple> emulation of the Unix grep program:
240
241 % cat > simple_grep
242 #!/usr/bin/perl
243 $regexp = shift;
244 while (<>) {
245 print if /$regexp/;
246 }
247 ^D
248
249 % chmod +x simple_grep
250
251 % simple_grep abba /usr/dict/words
252 Babbage
253 cabbage
254 cabbages
255 sabbath
256 Sabbathize
257 Sabbathizes
258 sabbatical
259 scabbard
260 scabbards
261
262This program is easy to understand. C<#!/usr/bin/perl> is the standard
263way to invoke a perl program from the shell.
7638d2dc 264S<C<$regexp = shift;>> saves the first command line argument as the
47f9c88b 265regexp to be used, leaving the rest of the command line arguments to
7638d2dc
WL
266be treated as files. S<C<< while (<>) >>> loops over all the lines in
267all the files. For each line, S<C<print if /$regexp/;>> prints the
47f9c88b
GS
268line if the regexp matches the line. In this line, both C<print> and
269C</$regexp/> use the default variable C<$_> implicitly.
270
271With all of the regexps above, if the regexp matched anywhere in the
272string, it was considered a match. Sometimes, however, we'd like to
273specify I<where> in the string the regexp should try to match. To do
15776bb0
KW
274this, we would use the I<anchor> metacharacters C<'^'> and C<'$'>. The
275anchor C<'^'> means match at the beginning of the string and the anchor
276C<'$'> means match at the end of the string, or before a newline at the
47f9c88b
GS
277end of the string. Here is how they are used:
278
279 "housekeeper" =~ /keeper/; # matches
280 "housekeeper" =~ /^keeper/; # doesn't match
281 "housekeeper" =~ /keeper$/; # matches
282 "housekeeper\n" =~ /keeper$/; # matches
283
15776bb0 284The second regexp doesn't match because C<'^'> constrains C<keeper> to
47f9c88b
GS
285match only at the beginning of the string, but C<"housekeeper"> has
286keeper starting in the middle. The third regexp does match, since the
15776bb0 287C<'$'> constrains C<keeper> to match only at the end of the string.
47f9c88b 288
15776bb0
KW
289When both C<'^'> and C<'$'> are used at the same time, the regexp has to
290match both the beginning and the end of the string, I<i.e.>, the regexp
47f9c88b
GS
291matches the whole string. Consider
292
293 "keeper" =~ /^keep$/; # doesn't match
294 "keeper" =~ /^keeper$/; # matches
295 "" =~ /^$/; # ^$ matches an empty string
296
297The first regexp doesn't match because the string has more to it than
298C<keep>. Since the second regexp is exactly the string, it
15776bb0 299matches. Using both C<'^'> and C<'$'> in a regexp forces the complete
47f9c88b
GS
300string to match, so it gives you complete control over which strings
301match and which don't. Suppose you are looking for a fellow named
302bert, off in a string by himself:
303
304 "dogbert" =~ /bert/; # matches, but not what you want
305
306 "dilbert" =~ /^bert/; # doesn't match, but ..
307 "bertram" =~ /^bert/; # matches, so still not good enough
308
309 "bertram" =~ /^bert$/; # doesn't match, good
310 "dilbert" =~ /^bert$/; # doesn't match, good
311 "bert" =~ /^bert$/; # matches, perfect
312
313Of course, in the case of a literal string, one could just as easily
7638d2dc 314use the string comparison S<C<$string eq 'bert'>> and it would be
47f9c88b
GS
315more efficient. The C<^...$> regexp really becomes useful when we
316add in the more powerful regexp tools below.
317
318=head2 Using character classes
319
320Although one can already do quite a lot with the literal string
321regexps above, we've only scratched the surface of regular expression
322technology. In this and subsequent sections we will introduce regexp
323concepts (and associated metacharacter notations) that will allow a
8ccb1477 324regexp to represent not just a single character sequence, but a I<whole
47f9c88b
GS
325class> of them.
326
7638d2dc 327One such concept is that of a I<character class>. A character class
47f9c88b 328allows a set of possible characters, rather than just a single
0b635837
KW
329character, to match at a particular point in a regexp. You can define
330your own custom character classes. These
331are denoted by brackets C<[...]>, with the set of characters
47f9c88b
GS
332to be possibly matched inside. Here are some examples:
333
334 /cat/; # matches 'cat'
335 /[bcr]at/; # matches 'bat, 'cat', or 'rat'
336 /item[0123456789]/; # matches 'item0' or ... or 'item9'
a6b2f353 337 "abc" =~ /[cab]/; # matches 'a'
47f9c88b
GS
338
339In the last statement, even though C<'c'> is the first character in
340the class, C<'a'> matches because the first character position in the
341string is the earliest point at which the regexp can match.
342
343 /[yY][eE][sS]/; # match 'yes' in a case-insensitive way
344 # 'yes', 'Yes', 'YES', etc.
345
da75cd15 346This regexp displays a common task: perform a case-insensitive
28c3722c 347match. Perl provides a way of avoiding all those brackets by simply
47f9c88b
GS
348appending an C<'i'> to the end of the match. Then C</[yY][eE][sS]/;>
349can be rewritten as C</yes/i;>. The C<'i'> stands for
7638d2dc 350case-insensitive and is an example of a I<modifier> of the matching
47f9c88b
GS
351operation. We will meet other modifiers later in the tutorial.
352
353We saw in the section above that there were ordinary characters, which
354represented themselves, and special characters, which needed a
15776bb0 355backslash C<'\'> to represent themselves. The same is true in a
47f9c88b
GS
356character class, but the sets of ordinary and special characters
357inside a character class are different than those outside a character
7638d2dc 358class. The special characters for a character class are C<-]\^$> (and
353c6505 359the pattern delimiter, whatever it is).
15776bb0
KW
360C<']'> is special because it denotes the end of a character class. C<'$'> is
361special because it denotes a scalar variable. C<'\'> is special because
47f9c88b
GS
362it is used in escape sequences, just like above. Here is how the
363special characters C<]$\> are handled:
364
365 /[\]c]def/; # matches ']def' or 'cdef'
366 $x = 'bcr';
a6b2f353 367 /[$x]at/; # matches 'bat', 'cat', or 'rat'
47f9c88b
GS
368 /[\$x]at/; # matches '$at' or 'xat'
369 /[\\$x]at/; # matches '\at', 'bat, 'cat', or 'rat'
370
353c6505 371The last two are a little tricky. In C<[\$x]>, the backslash protects
15776bb0 372the dollar sign, so the character class has two members C<'$'> and C<'x'>.
47f9c88b
GS
373In C<[\\$x]>, the backslash is protected, so C<$x> is treated as a
374variable and substituted in double quote fashion.
375
376The special character C<'-'> acts as a range operator within character
377classes, so that a contiguous set of characters can be written as a
378range. With ranges, the unwieldy C<[0123456789]> and C<[abc...xyz]>
379become the svelte C<[0-9]> and C<[a-z]>. Some examples are
380
381 /item[0-9]/; # matches 'item0' or ... or 'item9'
382 /[0-9bx-z]aa/; # matches '0aa', ..., '9aa',
383 # 'baa', 'xaa', 'yaa', or 'zaa'
384 /[0-9a-fA-F]/; # matches a hexadecimal digit
36bbe248 385 /[0-9a-zA-Z_]/; # matches a "word" character,
7638d2dc 386 # like those in a Perl variable name
47f9c88b
GS
387
388If C<'-'> is the first or last character in a character class, it is
389treated as an ordinary character; C<[-ab]>, C<[ab-]> and C<[a\-b]> are
390all equivalent.
391
15776bb0 392The special character C<'^'> in the first position of a character class
7638d2dc 393denotes a I<negated character class>, which matches any character but
a6b2f353 394those in the brackets. Both C<[...]> and C<[^...]> must match a
47f9c88b
GS
395character, or the match fails. Then
396
397 /[^a]at/; # doesn't match 'aat' or 'at', but matches
398 # all other 'bat', 'cat, '0at', '%at', etc.
399 /[^0-9]/; # matches a non-numeric character
400 /[a^]at/; # matches 'aat' or '^at'; here '^' is ordinary
401
28c3722c 402Now, even C<[0-9]> can be a bother to write multiple times, so in the
47f9c88b 403interest of saving keystrokes and making regexps more readable, Perl
7638d2dc 404has several abbreviations for common character classes, as shown below.
f1dc5bb2 405Since the introduction of Unicode, unless the C</a> modifier is in
0bd5a82d
KW
406effect, these character classes match more than just a few characters in
407the ASCII range.
47f9c88b
GS
408
409=over 4
410
411=item *
551e1d92 412
15776bb0 413C<\d> matches a digit, not just C<[0-9]> but also digits from non-roman scripts
47f9c88b
GS
414
415=item *
551e1d92 416
15776bb0 417C<\s> matches a whitespace character, the set C<[\ \t\r\n\f]> and others
47f9c88b
GS
418
419=item *
551e1d92 420
15776bb0 421C<\w> matches a word character (alphanumeric or C<'_'>), not just C<[0-9a-zA-Z_]>
7638d2dc 422but also digits and characters from non-roman scripts
47f9c88b
GS
423
424=item *
551e1d92 425
15776bb0 426C<\D> is a negated C<\d>; it represents any other character than a digit, or C<[^\d]>
47f9c88b
GS
427
428=item *
551e1d92 429
15776bb0 430C<\S> is a negated C<\s>; it represents any non-whitespace character C<[^\s]>
47f9c88b
GS
431
432=item *
551e1d92 433
15776bb0 434C<\W> is a negated C<\w>; it represents any non-word character C<[^\w]>
47f9c88b
GS
435
436=item *
551e1d92 437
15776bb0 438The period C<'.'> matches any character but C<"\n"> (unless the modifier C</s> is
7638d2dc 439in effect, as explained below).
47f9c88b 440
1ca4ba9b
KW
441=item *
442
15776bb0 443C<\N>, like the period, matches any character but C<"\n">, but it does so
f1dc5bb2 444regardless of whether the modifier C</s> is in effect.
1ca4ba9b 445
47f9c88b
GS
446=back
447
f1dc5bb2 448The C</a> modifier, available starting in Perl 5.14, is used to
15776bb0 449restrict the matches of C<\d>, C<\s>, and C<\w> to just those in the ASCII range.
0bd5a82d
KW
450It is useful to keep your program from being needlessly exposed to full
451Unicode (and its accompanying security considerations) when all you want
f1dc5bb2 452is to process English-like text. (The "a" may be doubled, C</aa>, to
0bd5a82d
KW
453provide even more restrictions, preventing case-insensitive matching of
454ASCII with non-ASCII characters; otherwise a Unicode "Kelvin Sign"
455would caselessly match a "k" or "K".)
456
47f9c88b 457The C<\d\s\w\D\S\W> abbreviations can be used both inside and outside
0b635837 458of bracketed character classes. Here are some in use:
47f9c88b
GS
459
460 /\d\d:\d\d:\d\d/; # matches a hh:mm:ss time format
461 /[\d\s]/; # matches any digit or whitespace character
462 /\w\W\w/; # matches a word char, followed by a
463 # non-word char, followed by a word char
464 /..rt/; # matches any two chars, followed by 'rt'
465 /end\./; # matches 'end.'
466 /end[.]/; # same thing, matches 'end.'
467
468Because a period is a metacharacter, it needs to be escaped to match
469as an ordinary period. Because, for example, C<\d> and C<\w> are sets
470of characters, it is incorrect to think of C<[^\d\w]> as C<[\D\W]>; in
471fact C<[^\d\w]> is the same as C<[^\w]>, which is the same as
472C<[\W]>. Think DeMorgan's laws.
473
0b635837
KW
474In actuality, the period and C<\d\s\w\D\S\W> abbreviations are
475themselves types of character classes, so the ones surrounded by
476brackets are just one type of character class. When we need to make a
477distinction, we refer to them as "bracketed character classes."
478
7638d2dc 479An anchor useful in basic regexps is the I<word anchor>
47f9c88b
GS
480C<\b>. This matches a boundary between a word character and a non-word
481character C<\w\W> or C<\W\w>:
482
483 $x = "Housecat catenates house and cat";
484 $x =~ /cat/; # matches cat in 'housecat'
485 $x =~ /\bcat/; # matches cat in 'catenates'
486 $x =~ /cat\b/; # matches cat in 'housecat'
487 $x =~ /\bcat\b/; # matches 'cat' at end of string
488
489Note in the last example, the end of the string is considered a word
490boundary.
491
ae3bb8ea
KW
492For natural language processing (so that, for example, apostrophes are
493included in words), use instead C<\b{wb}>
494
495 "don't" =~ / .+? \b{wb} /x; # matches the whole string
496
47f9c88b
GS
497You might wonder why C<'.'> matches everything but C<"\n"> - why not
498every character? The reason is that often one is matching against
499lines and would like to ignore the newline characters. For instance,
500while the string C<"\n"> represents one line, we would like to think
28c3722c 501of it as empty. Then
47f9c88b
GS
502
503 "" =~ /^$/; # matches
7638d2dc 504 "\n" =~ /^$/; # matches, $ anchors before "\n"
47f9c88b
GS
505
506 "" =~ /./; # doesn't match; it needs a char
507 "" =~ /^.$/; # doesn't match; it needs a char
508 "\n" =~ /^.$/; # doesn't match; it needs a char other than "\n"
509 "a" =~ /^.$/; # matches
7638d2dc 510 "a\n" =~ /^.$/; # matches, $ anchors before "\n"
47f9c88b
GS
511
512This behavior is convenient, because we usually want to ignore
513newlines when we count and match characters in a line. Sometimes,
15776bb0
KW
514however, we want to keep track of newlines. We might even want C<'^'>
515and C<'$'> to anchor at the beginning and end of lines within the
47f9c88b
GS
516string, rather than just the beginning and end of the string. Perl
517allows us to choose between ignoring and paying attention to newlines
f1dc5bb2 518by using the C</s> and C</m> modifiers. C</s> and C</m> stand for
47f9c88b
GS
519single line and multi-line and they determine whether a string is to
520be treated as one continuous string, or as a set of lines. The two
521modifiers affect two aspects of how the regexp is interpreted: 1) how
15776bb0
KW
522the C<'.'> character class is defined, and 2) where the anchors C<'^'>
523and C<'$'> are able to match. Here are the four possible combinations:
47f9c88b
GS
524
525=over 4
526
527=item *
551e1d92 528
f1dc5bb2 529no modifiers: Default behavior. C<'.'> matches any character
15776bb0
KW
530except C<"\n">. C<'^'> matches only at the beginning of the string and
531C<'$'> matches only at the end or before a newline at the end.
47f9c88b
GS
532
533=item *
551e1d92 534
f1dc5bb2 535s modifier (C</s>): Treat string as a single long line. C<'.'> matches
15776bb0
KW
536any character, even C<"\n">. C<'^'> matches only at the beginning of
537the string and C<'$'> matches only at the end or before a newline at the
47f9c88b
GS
538end.
539
540=item *
551e1d92 541
f1dc5bb2 542m modifier (C</m>): Treat string as a set of multiple lines. C<'.'>
15776bb0 543matches any character except C<"\n">. C<'^'> and C<'$'> are able to match
47f9c88b
GS
544at the start or end of I<any> line within the string.
545
546=item *
551e1d92 547
f1dc5bb2 548both s and m modifiers (C</sm>): Treat string as a single long line, but
47f9c88b 549detect multiple lines. C<'.'> matches any character, even
15776bb0 550C<"\n">. C<'^'> and C<'$'>, however, are able to match at the start or end
47f9c88b
GS
551of I<any> line within the string.
552
553=back
554
f1dc5bb2 555Here are examples of C</s> and C</m> in action:
47f9c88b
GS
556
557 $x = "There once was a girl\nWho programmed in Perl\n";
558
559 $x =~ /^Who/; # doesn't match, "Who" not at start of string
560 $x =~ /^Who/s; # doesn't match, "Who" not at start of string
561 $x =~ /^Who/m; # matches, "Who" at start of second line
562 $x =~ /^Who/sm; # matches, "Who" at start of second line
563
564 $x =~ /girl.Who/; # doesn't match, "." doesn't match "\n"
565 $x =~ /girl.Who/s; # matches, "." matches "\n"
566 $x =~ /girl.Who/m; # doesn't match, "." doesn't match "\n"
567 $x =~ /girl.Who/sm; # matches, "." matches "\n"
568
f1dc5bb2
KW
569Most of the time, the default behavior is what is wanted, but C</s> and
570C</m> are occasionally very useful. If C</m> is being used, the start
28c3722c 571of the string can still be matched with C<\A> and the end of the string
47f9c88b 572can still be matched with the anchors C<\Z> (matches both the end and
15776bb0 573the newline before, like C<'$'>), and C<\z> (matches only the end):
47f9c88b
GS
574
575 $x =~ /^Who/m; # matches, "Who" at start of second line
576 $x =~ /\AWho/m; # doesn't match, "Who" is not at start of string
577
578 $x =~ /girl$/m; # matches, "girl" at end of first line
579 $x =~ /girl\Z/m; # doesn't match, "girl" is not at end of string
580
581 $x =~ /Perl\Z/m; # matches, "Perl" is at newline before end
582 $x =~ /Perl\z/m; # doesn't match, "Perl" is not at end of string
583
584We now know how to create choices among classes of characters in a
585regexp. What about choices among words or character strings? Such
586choices are described in the next section.
587
588=head2 Matching this or that
589
28c3722c 590Sometimes we would like our regexp to be able to match different
47f9c88b 591possible words or character strings. This is accomplished by using
15776bb0 592the I<alternation> metacharacter C<'|'>. To match C<dog> or C<cat>, we
7638d2dc 593form the regexp C<dog|cat>. As before, Perl will try to match the
47f9c88b 594regexp at the earliest possible point in the string. At each
7638d2dc
WL
595character position, Perl will first try to match the first
596alternative, C<dog>. If C<dog> doesn't match, Perl will then try the
47f9c88b 597next alternative, C<cat>. If C<cat> doesn't match either, then the
7638d2dc 598match fails and Perl moves to the next position in the string. Some
47f9c88b
GS
599examples:
600
601 "cats and dogs" =~ /cat|dog|bird/; # matches "cat"
602 "cats and dogs" =~ /dog|cat|bird/; # matches "cat"
603
604Even though C<dog> is the first alternative in the second regexp,
605C<cat> is able to match earlier in the string.
606
607 "cats" =~ /c|ca|cat|cats/; # matches "c"
608 "cats" =~ /cats|cat|ca|c/; # matches "cats"
609
610Here, all the alternatives match at the first string position, so the
611first alternative is the one that matches. If some of the
612alternatives are truncations of the others, put the longest ones first
613to give them a chance to match.
614
615 "cab" =~ /a|b|c/ # matches "c"
616 # /a|b|c/ == /[abc]/
617
618The last example points out that character classes are like
619alternations of characters. At a given character position, the first
210b36aa 620alternative that allows the regexp match to succeed will be the one
47f9c88b
GS
621that matches.
622
623=head2 Grouping things and hierarchical matching
624
625Alternation allows a regexp to choose among alternatives, but by
7638d2dc 626itself it is unsatisfying. The reason is that each alternative is a whole
47f9c88b
GS
627regexp, but sometime we want alternatives for just part of a
628regexp. For instance, suppose we want to search for housecats or
629housekeepers. The regexp C<housecat|housekeeper> fits the bill, but is
630inefficient because we had to type C<house> twice. It would be nice to
da75cd15 631have parts of the regexp be constant, like C<house>, and some
47f9c88b
GS
632parts have alternatives, like C<cat|keeper>.
633
7638d2dc 634The I<grouping> metacharacters C<()> solve this problem. Grouping
47f9c88b
GS
635allows parts of a regexp to be treated as a single unit. Parts of a
636regexp are grouped by enclosing them in parentheses. Thus we could solve
637the C<housecat|housekeeper> by forming the regexp as
638C<house(cat|keeper)>. The regexp C<house(cat|keeper)> means match
639C<house> followed by either C<cat> or C<keeper>. Some more examples
640are
641
642 /(a|b)b/; # matches 'ab' or 'bb'
643 /(ac|b)b/; # matches 'acb' or 'bb'
644 /(^a|b)c/; # matches 'ac' at start of string or 'bc' anywhere
645 /(a|[bc])d/; # matches 'ad', 'bd', or 'cd'
646
647 /house(cat|)/; # matches either 'housecat' or 'house'
648 /house(cat(s|)|)/; # matches either 'housecats' or 'housecat' or
649 # 'house'. Note groups can be nested.
650
651 /(19|20|)\d\d/; # match years 19xx, 20xx, or the Y2K problem, xx
652 "20" =~ /(19|20|)\d\d/; # matches the null alternative '()\d\d',
653 # because '20\d\d' can't match
654
655Alternations behave the same way in groups as out of them: at a given
656string position, the leftmost alternative that allows the regexp to
210b36aa 657match is taken. So in the last example at the first string position,
47f9c88b 658C<"20"> matches the second alternative, but there is nothing left over
7638d2dc 659to match the next two digits C<\d\d>. So Perl moves on to the next
47f9c88b
GS
660alternative, which is the null alternative and that works, since
661C<"20"> is two digits.
662
663The process of trying one alternative, seeing if it matches, and
7638d2dc
WL
664moving on to the next alternative, while going back in the string
665from where the previous alternative was tried, if it doesn't, is called
15776bb0 666I<backtracking>. The term "backtracking" comes from the idea that
47f9c88b
GS
667matching a regexp is like a walk in the woods. Successfully matching
668a regexp is like arriving at a destination. There are many possible
669trailheads, one for each string position, and each one is tried in
670order, left to right. From each trailhead there may be many paths,
671some of which get you there, and some which are dead ends. When you
672walk along a trail and hit a dead end, you have to backtrack along the
673trail to an earlier point to try another trail. If you hit your
674destination, you stop immediately and forget about trying all the
675other trails. You are persistent, and only if you have tried all the
676trails from all the trailheads and not arrived at your destination, do
677you declare failure. To be concrete, here is a step-by-step analysis
7638d2dc 678of what Perl does when it tries to match the regexp
47f9c88b
GS
679
680 "abcde" =~ /(abd|abc)(df|d|de)/;
681
682=over 4
683
15776bb0 684=item Z<>0. Start with the first letter in the string C<'a'>.
551e1d92 685
15776bb0 686E<nbsp>
551e1d92 687
15776bb0 688=item Z<>1. Try the first alternative in the first group C<'abd'>.
47f9c88b 689
15776bb0 690E<nbsp>
47f9c88b 691
15776bb0 692=item Z<>2. Match C<'a'> followed by C<'b'>. So far so good.
47f9c88b 693
15776bb0 694E<nbsp>
551e1d92 695
15776bb0
KW
696=item Z<>3. C<'d'> in the regexp doesn't match C<'c'> in the string - a
697dead end. So backtrack two characters and pick the second alternative
698in the first group C<'abc'>.
551e1d92 699
15776bb0 700E<nbsp>
47f9c88b 701
15776bb0
KW
702=item Z<>4. Match C<'a'> followed by C<'b'> followed by C<'c'>. We are on a roll
703and have satisfied the first group. Set C<$1> to C<'abc'>.
551e1d92 704
15776bb0 705E<nbsp>
47f9c88b 706
15776bb0 707=item Z<>5 Move on to the second group and pick the first alternative C<'df'>.
551e1d92 708
15776bb0 709E<nbsp>
47f9c88b 710
15776bb0 711=item Z<>6 Match the C<'d'>.
47f9c88b 712
15776bb0 713E<nbsp>
551e1d92 714
15776bb0 715=item Z<>7. C<'f'> in the regexp doesn't match C<'e'> in the string, so a dead
47f9c88b 716end. Backtrack one character and pick the second alternative in the
15776bb0 717second group C<'d'>.
47f9c88b 718
15776bb0 719E<nbsp>
551e1d92 720
15776bb0
KW
721=item Z<>8. C<'d'> matches. The second grouping is satisfied, so set
722C<$2> to C<'d'>.
47f9c88b 723
15776bb0 724E<nbsp>
551e1d92 725
15776bb0
KW
726=item Z<>9. We are at the end of the regexp, so we are done! We have
727matched C<'abcd'> out of the string C<"abcde">.
47f9c88b
GS
728
729=back
730
731There are a couple of things to note about this analysis. First, the
15776bb0 732third alternative in the second group C<'de'> also allows a match, but we
47f9c88b
GS
733stopped before we got to it - at a given character position, leftmost
734wins. Second, we were able to get a match at the first character
15776bb0
KW
735position of the string C<'a'>. If there were no matches at the first
736position, Perl would move to the second character position C<'b'> and
47f9c88b 737attempt the match all over again. Only when all possible paths at all
7638d2dc
WL
738possible character positions have been exhausted does Perl give
739up and declare S<C<$string =~ /(abd|abc)(df|d|de)/;>> to be false.
47f9c88b
GS
740
741Even with all this work, regexp matching happens remarkably fast. To
353c6505
DL
742speed things up, Perl compiles the regexp into a compact sequence of
743opcodes that can often fit inside a processor cache. When the code is
7638d2dc
WL
744executed, these opcodes can then run at full throttle and search very
745quickly.
47f9c88b
GS
746
747=head2 Extracting matches
748
749The grouping metacharacters C<()> also serve another completely
750different function: they allow the extraction of the parts of a string
751that matched. This is very useful to find out what matched and for
752text processing in general. For each grouping, the part that matched
15776bb0 753inside goes into the special variables C<$1>, C<$2>, I<etc>. They can be
47f9c88b
GS
754used just as ordinary variables:
755
756 # extract hours, minutes, seconds
2275acdc
RGS
757 if ($time =~ /(\d\d):(\d\d):(\d\d)/) { # match hh:mm:ss format
758 $hours = $1;
759 $minutes = $2;
760 $seconds = $3;
761 }
47f9c88b
GS
762
763Now, we know that in scalar context,
7638d2dc 764S<C<$time =~ /(\d\d):(\d\d):(\d\d)/>> returns a true or false
47f9c88b
GS
765value. In list context, however, it returns the list of matched values
766C<($1,$2,$3)>. So we could write the code more compactly as
767
768 # extract hours, minutes, seconds
769 ($hours, $minutes, $second) = ($time =~ /(\d\d):(\d\d):(\d\d)/);
770
771If the groupings in a regexp are nested, C<$1> gets the group with the
772leftmost opening parenthesis, C<$2> the next opening parenthesis,
15776bb0 773I<etc>. Here is a regexp with nested groups:
47f9c88b
GS
774
775 /(ab(cd|ef)((gi)|j))/;
776 1 2 34
777
7638d2dc
WL
778If this regexp matches, C<$1> contains a string starting with
779C<'ab'>, C<$2> is either set to C<'cd'> or C<'ef'>, C<$3> equals either
780C<'gi'> or C<'j'>, and C<$4> is either set to C<'gi'>, just like C<$3>,
781or it remains undefined.
782
783For convenience, Perl sets C<$+> to the string held by the highest numbered
784C<$1>, C<$2>,... that got assigned (and, somewhat related, C<$^N> to the
15776bb0 785value of the C<$1>, C<$2>,... most-recently assigned; I<i.e.> the C<$1>,
7638d2dc 786C<$2>,... associated with the rightmost closing parenthesis used in the
a01268b5 787match).
47f9c88b 788
7638d2dc
WL
789
790=head2 Backreferences
791
47f9c88b 792Closely associated with the matching variables C<$1>, C<$2>, ... are
d8b950dc 793the I<backreferences> C<\g1>, C<\g2>,... Backreferences are simply
47f9c88b 794matching variables that can be used I<inside> a regexp. This is a
ac036724 795really nice feature; what matches later in a regexp is made to depend on
47f9c88b 796what matched earlier in the regexp. Suppose we wanted to look
15776bb0 797for doubled words in a text, like "the the". The following regexp finds
47f9c88b
GS
798all 3-letter doubles with a space in between:
799
d8b950dc 800 /\b(\w\w\w)\s\g1\b/;
47f9c88b 801
15776bb0 802The grouping assigns a value to C<\g1>, so that the same 3-letter sequence
7638d2dc
WL
803is used for both parts.
804
805A similar task is to find words consisting of two identical parts:
47f9c88b 806
d8b950dc 807 % simple_grep '^(\w\w\w\w|\w\w\w|\w\w|\w)\g1$' /usr/dict/words
47f9c88b
GS
808 beriberi
809 booboo
810 coco
811 mama
812 murmur
813 papa
814
815The regexp has a single grouping which considers 4-letter
15776bb0 816combinations, then 3-letter combinations, I<etc>., and uses C<\g1> to look for
d8b950dc 817a repeat. Although C<$1> and C<\g1> represent the same thing, care should be
7638d2dc 818taken to use matched variables C<$1>, C<$2>,... only I<outside> a regexp
d8b950dc 819and backreferences C<\g1>, C<\g2>,... only I<inside> a regexp; not doing
7638d2dc
WL
820so may lead to surprising and unsatisfactory results.
821
822
823=head2 Relative backreferences
824
825Counting the opening parentheses to get the correct number for a
7698aede 826backreference is error-prone as soon as there is more than one
7638d2dc
WL
827capturing group. A more convenient technique became available
828with Perl 5.10: relative backreferences. To refer to the immediately
829preceding capture group one now may write C<\g{-1}>, the next but
830last is available via C<\g{-2}>, and so on.
831
832Another good reason in addition to readability and maintainability
8ccb1477 833for using relative backreferences is illustrated by the following example,
7638d2dc
WL
834where a simple pattern for matching peculiar strings is used:
835
d8b950dc 836 $a99a = '([a-z])(\d)\g2\g1'; # matches a11a, g22g, x33x, etc.
7638d2dc
WL
837
838Now that we have this pattern stored as a handy string, we might feel
839tempted to use it as a part of some other pattern:
840
841 $line = "code=e99e";
842 if ($line =~ /^(\w+)=$a99a$/){ # unexpected behavior!
843 print "$1 is valid\n";
844 } else {
845 print "bad line: '$line'\n";
846 }
847
ac036724 848But this doesn't match, at least not the way one might expect. Only
7638d2dc
WL
849after inserting the interpolated C<$a99a> and looking at the resulting
850full text of the regexp is it obvious that the backreferences have
ac036724 851backfired. The subexpression C<(\w+)> has snatched number 1 and
7638d2dc
WL
852demoted the groups in C<$a99a> by one rank. This can be avoided by
853using relative backreferences:
854
855 $a99a = '([a-z])(\d)\g{-1}\g{-2}'; # safe for being interpolated
856
857
858=head2 Named backreferences
859
c27a5cfe 860Perl 5.10 also introduced named capture groups and named backreferences.
7638d2dc
WL
861To attach a name to a capturing group, you write either
862C<< (?<name>...) >> or C<< (?'name'...) >>. The backreference may
863then be written as C<\g{name}>. It is permissible to attach the
864same name to more than one group, but then only the leftmost one of the
865eponymous set can be referenced. Outside of the pattern a named
c27a5cfe 866capture group is accessible through the C<%+> hash.
7638d2dc 867
353c6505 868Assuming that we have to match calendar dates which may be given in one
7638d2dc 869of the three formats yyyy-mm-dd, mm/dd/yyyy or dd.mm.yyyy, we can write
15776bb0 870three suitable patterns where we use C<'d'>, C<'m'> and C<'y'> respectively as the
c27a5cfe 871names of the groups capturing the pertaining components of a date. The
7638d2dc
WL
872matching operation combines the three patterns as alternatives:
873
874 $fmt1 = '(?<y>\d\d\d\d)-(?<m>\d\d)-(?<d>\d\d)';
875 $fmt2 = '(?<m>\d\d)/(?<d>\d\d)/(?<y>\d\d\d\d)';
876 $fmt3 = '(?<d>\d\d)\.(?<m>\d\d)\.(?<y>\d\d\d\d)';
877 for my $d qw( 2006-10-21 15.01.2007 10/31/2005 ){
878 if ( $d =~ m{$fmt1|$fmt2|$fmt3} ){
879 print "day=$+{d} month=$+{m} year=$+{y}\n";
880 }
881 }
882
883If any of the alternatives matches, the hash C<%+> is bound to contain the
884three key-value pairs.
885
886
887=head2 Alternative capture group numbering
888
889Yet another capturing group numbering technique (also as from Perl 5.10)
890deals with the problem of referring to groups within a set of alternatives.
891Consider a pattern for matching a time of the day, civil or military style:
47f9c88b 892
7638d2dc
WL
893 if ( $time =~ /(\d\d|\d):(\d\d)|(\d\d)(\d\d)/ ){
894 # process hour and minute
895 }
896
897Processing the results requires an additional if statement to determine
353c6505 898whether C<$1> and C<$2> or C<$3> and C<$4> contain the goodies. It would
c27a5cfe 899be easier if we could use group numbers 1 and 2 in second alternative as
353c6505 900well, and this is exactly what the parenthesized construct C<(?|...)>,
7638d2dc
WL
901set around an alternative achieves. Here is an extended version of the
902previous pattern:
903
555bd962
BG
904 if($time =~ /(?|(\d\d|\d):(\d\d)|(\d\d)(\d\d))\s+([A-Z][A-Z][A-Z])/){
905 print "hour=$1 minute=$2 zone=$3\n";
906 }
7638d2dc 907
c27a5cfe 908Within the alternative numbering group, group numbers start at the same
7638d2dc 909position for each alternative. After the group, numbering continues
353c6505 910with one higher than the maximum reached across all the alternatives.
7638d2dc
WL
911
912=head2 Position information
913
13e5d9cd 914In addition to what was matched, Perl also provides the
7638d2dc 915positions of what was matched as contents of the C<@-> and C<@+>
47f9c88b
GS
916arrays. C<$-[0]> is the position of the start of the entire match and
917C<$+[0]> is the position of the end. Similarly, C<$-[n]> is the
918position of the start of the C<$n> match and C<$+[n]> is the position
919of the end. If C<$n> is undefined, so are C<$-[n]> and C<$+[n]>. Then
920this code
921
922 $x = "Mmm...donut, thought Homer";
923 $x =~ /^(Mmm|Yech)\.\.\.(donut|peas)/; # matches
555bd962
BG
924 foreach $exp (1..$#-) {
925 print "Match $exp: '${$exp}' at position ($-[$exp],$+[$exp])\n";
47f9c88b
GS
926 }
927
928prints
929
930 Match 1: 'Mmm' at position (0,3)
931 Match 2: 'donut' at position (6,11)
932
933Even if there are no groupings in a regexp, it is still possible to
7638d2dc 934find out what exactly matched in a string. If you use them, Perl
47f9c88b 935will set C<$`> to the part of the string before the match, will set C<$&>
15776bb0 936to the part of the string that matched, and will set C<'$'> to the part
47f9c88b
GS
937of the string after the match. An example:
938
939 $x = "the cat caught the mouse";
940 $x =~ /cat/; # $` = 'the ', $& = 'cat', $' = ' caught the mouse'
941 $x =~ /the/; # $` = '', $& = 'the', $' = ' cat caught the mouse'
942
7638d2dc
WL
943In the second match, C<$`> equals C<''> because the regexp matched at the
944first character position in the string and stopped; it never saw the
15776bb0 945second "the".
13b0f67d
DM
946
947If your code is to run on Perl versions earlier than
15776bb0 9485.20, it is worthwhile to note that using C<$`> and C<'$'>
7638d2dc 949slows down regexp matching quite a bit, while C<$&> slows it down to a
47f9c88b 950lesser extent, because if they are used in one regexp in a program,
7638d2dc 951they are generated for I<all> regexps in the program. So if raw
47f9c88b 952performance is a goal of your application, they should be avoided.
7638d2dc
WL
953If you need to extract the corresponding substrings, use C<@-> and
954C<@+> instead:
47f9c88b
GS
955
956 $` is the same as substr( $x, 0, $-[0] )
957 $& is the same as substr( $x, $-[0], $+[0]-$-[0] )
958 $' is the same as substr( $x, $+[0] )
959
78622607 960As of Perl 5.10, the C<${^PREMATCH}>, C<${^MATCH}> and C<${^POSTMATCH}>
13b0f67d
DM
961variables may be used. These are only set if the C</p> modifier is
962present. Consequently they do not penalize the rest of the program. In
963Perl 5.20, C<${^PREMATCH}>, C<${^MATCH}> and C<${^POSTMATCH}> are available
964whether the C</p> has been used or not (the modifier is ignored), and
15776bb0 965C<$`>, C<'$'> and C<$&> do not cause any speed difference.
7638d2dc
WL
966
967=head2 Non-capturing groupings
968
353c6505 969A group that is required to bundle a set of alternatives may or may not be
7638d2dc 970useful as a capturing group. If it isn't, it just creates a superfluous
c27a5cfe 971addition to the set of available capture group values, inside as well as
7638d2dc 972outside the regexp. Non-capturing groupings, denoted by C<(?:regexp)>,
353c6505 973still allow the regexp to be treated as a single unit, but don't establish
c27a5cfe 974a capturing group at the same time. Both capturing and non-capturing
7638d2dc
WL
975groupings are allowed to co-exist in the same regexp. Because there is
976no extraction, non-capturing groupings are faster than capturing
977groupings. Non-capturing groupings are also handy for choosing exactly
978which parts of a regexp are to be extracted to matching variables:
979
980 # match a number, $1-$4 are set, but we only want $1
981 /([+-]?\ *(\d+(\.\d*)?|\.\d+)([eE][+-]?\d+)?)/;
982
983 # match a number faster , only $1 is set
984 /([+-]?\ *(?:\d+(?:\.\d*)?|\.\d+)(?:[eE][+-]?\d+)?)/;
985
986 # match a number, get $1 = whole number, $2 = exponent
987 /([+-]?\ *(?:\d+(?:\.\d*)?|\.\d+)(?:[eE]([+-]?\d+))?)/;
988
989Non-capturing groupings are also useful for removing nuisance
990elements gathered from a split operation where parentheses are
991required for some reason:
992
993 $x = '12aba34ba5';
9b846e30 994 @num = split /(a|b)+/, $x; # @num = ('12','a','34','a','5')
7638d2dc
WL
995 @num = split /(?:a|b)+/, $x; # @num = ('12','34','5')
996
33be4c61
MH
997In Perl 5.22 and later, all groups within a regexp can be set to
998non-capturing by using the new C</n> flag:
999
1000 "hello" =~ /(hi|hello)/n; # $1 is not set!
1001
1002See L<perlre/"n"> for more information.
7638d2dc 1003
47f9c88b
GS
1004=head2 Matching repetitions
1005
1006The examples in the previous section display an annoying weakness. We
7638d2dc
WL
1007were only matching 3-letter words, or chunks of words of 4 letters or
1008less. We'd like to be able to match words or, more generally, strings
1009of any length, without writing out tedious alternatives like
47f9c88b
GS
1010C<\w\w\w\w|\w\w\w|\w\w|\w>.
1011
15776bb0
KW
1012This is exactly the problem the I<quantifier> metacharacters C<'?'>,
1013C<'*'>, C<'+'>, and C<{}> were created for. They allow us to delimit the
7638d2dc 1014number of repeats for a portion of a regexp we consider to be a
47f9c88b
GS
1015match. Quantifiers are put immediately after the character, character
1016class, or grouping that we want to specify. They have the following
1017meanings:
1018
1019=over 4
1020
551e1d92 1021=item *
47f9c88b 1022
15776bb0 1023C<a?> means: match C<'a'> 1 or 0 times
47f9c88b 1024
551e1d92
RB
1025=item *
1026
15776bb0 1027C<a*> means: match C<'a'> 0 or more times, I<i.e.>, any number of times
551e1d92
RB
1028
1029=item *
47f9c88b 1030
15776bb0 1031C<a+> means: match C<'a'> 1 or more times, I<i.e.>, at least once
551e1d92
RB
1032
1033=item *
1034
7638d2dc 1035C<a{n,m}> means: match at least C<n> times, but not more than C<m>
47f9c88b
GS
1036times.
1037
551e1d92
RB
1038=item *
1039
7638d2dc 1040C<a{n,}> means: match at least C<n> or more times
551e1d92
RB
1041
1042=item *
47f9c88b 1043
7638d2dc 1044C<a{n}> means: match exactly C<n> times
47f9c88b
GS
1045
1046=back
1047
1048Here are some examples:
1049
7638d2dc 1050 /[a-z]+\s+\d*/; # match a lowercase word, at least one space, and
47f9c88b 1051 # any number of digits
d8b950dc 1052 /(\w+)\s+\g1/; # match doubled words of arbitrary length
47f9c88b 1053 /y(es)?/i; # matches 'y', 'Y', or a case-insensitive 'yes'
c2ac8995
NS
1054 $year =~ /^\d{2,4}$/; # make sure year is at least 2 but not more
1055 # than 4 digits
555bd962
BG
1056 $year =~ /^\d{4}$|^\d{2}$/; # better match; throw out 3-digit dates
1057 $year =~ /^\d{2}(\d{2})?$/; # same thing written differently.
1058 # However, this captures the last two
1059 # digits in $1 and the other does not.
47f9c88b 1060
d8b950dc 1061 % simple_grep '^(\w+)\g1$' /usr/dict/words # isn't this easier?
47f9c88b
GS
1062 beriberi
1063 booboo
1064 coco
1065 mama
1066 murmur
1067 papa
1068
7638d2dc 1069For all of these quantifiers, Perl will try to match as much of the
47f9c88b 1070string as possible, while still allowing the regexp to succeed. Thus
15776bb0 1071with C</a?.../>, Perl will first try to match the regexp with the C<'a'>
7638d2dc 1072present; if that fails, Perl will try to match the regexp without the
15776bb0 1073C<'a'> present. For the quantifier C<'*'>, we get the following:
47f9c88b
GS
1074
1075 $x = "the cat in the hat";
1076 $x =~ /^(.*)(cat)(.*)$/; # matches,
1077 # $1 = 'the '
1078 # $2 = 'cat'
1079 # $3 = ' in the hat'
1080
1081Which is what we might expect, the match finds the only C<cat> in the
1082string and locks onto it. Consider, however, this regexp:
1083
1084 $x =~ /^(.*)(at)(.*)$/; # matches,
1085 # $1 = 'the cat in the h'
1086 # $2 = 'at'
7638d2dc 1087 # $3 = '' (0 characters match)
47f9c88b 1088
7638d2dc 1089One might initially guess that Perl would find the C<at> in C<cat> and
47f9c88b
GS
1090stop there, but that wouldn't give the longest possible string to the
1091first quantifier C<.*>. Instead, the first quantifier C<.*> grabs as
1092much of the string as possible while still having the regexp match. In
a6b2f353 1093this example, that means having the C<at> sequence with the final C<at>
f5b885cd 1094in the string. The other important principle illustrated here is that,
47f9c88b 1095when there are two or more elements in a regexp, the I<leftmost>
f5b885cd 1096quantifier, if there is one, gets to grab as much of the string as
47f9c88b
GS
1097possible, leaving the rest of the regexp to fight over scraps. Thus in
1098our example, the first quantifier C<.*> grabs most of the string, while
1099the second quantifier C<.*> gets the empty string. Quantifiers that
7638d2dc
WL
1100grab as much of the string as possible are called I<maximal match> or
1101I<greedy> quantifiers.
47f9c88b
GS
1102
1103When a regexp can match a string in several different ways, we can use
1104the principles above to predict which way the regexp will match:
1105
1106=over 4
1107
1108=item *
551e1d92 1109
47f9c88b
GS
1110Principle 0: Taken as a whole, any regexp will be matched at the
1111earliest possible position in the string.
1112
1113=item *
551e1d92 1114
47f9c88b
GS
1115Principle 1: In an alternation C<a|b|c...>, the leftmost alternative
1116that allows a match for the whole regexp will be the one used.
1117
1118=item *
551e1d92 1119
15776bb0 1120Principle 2: The maximal matching quantifiers C<'?'>, C<'*'>, C<'+'> and
47f9c88b
GS
1121C<{n,m}> will in general match as much of the string as possible while
1122still allowing the whole regexp to match.
1123
1124=item *
551e1d92 1125
47f9c88b
GS
1126Principle 3: If there are two or more elements in a regexp, the
1127leftmost greedy quantifier, if any, will match as much of the string
1128as possible while still allowing the whole regexp to match. The next
1129leftmost greedy quantifier, if any, will try to match as much of the
1130string remaining available to it as possible, while still allowing the
1131whole regexp to match. And so on, until all the regexp elements are
1132satisfied.
1133
1134=back
1135
ac036724 1136As we have seen above, Principle 0 overrides the others. The regexp
47f9c88b
GS
1137will be matched as early as possible, with the other principles
1138determining how the regexp matches at that earliest character
1139position.
1140
1141Here is an example of these principles in action:
1142
1143 $x = "The programming republic of Perl";
1144 $x =~ /^(.+)(e|r)(.*)$/; # matches,
1145 # $1 = 'The programming republic of Pe'
1146 # $2 = 'r'
1147 # $3 = 'l'
1148
1149This regexp matches at the earliest string position, C<'T'>. One
15776bb0
KW
1150might think that C<'e'>, being leftmost in the alternation, would be
1151matched, but C<'r'> produces the longest string in the first quantifier.
47f9c88b
GS
1152
1153 $x =~ /(m{1,2})(.*)$/; # matches,
1154 # $1 = 'mm'
1155 # $2 = 'ing republic of Perl'
1156
1157Here, The earliest possible match is at the first C<'m'> in
1158C<programming>. C<m{1,2}> is the first quantifier, so it gets to match
1159a maximal C<mm>.
1160
1161 $x =~ /.*(m{1,2})(.*)$/; # matches,
1162 # $1 = 'm'
1163 # $2 = 'ing republic of Perl'
1164
1165Here, the regexp matches at the start of the string. The first
1166quantifier C<.*> grabs as much as possible, leaving just a single
1167C<'m'> for the second quantifier C<m{1,2}>.
1168
1169 $x =~ /(.?)(m{1,2})(.*)$/; # matches,
1170 # $1 = 'a'
1171 # $2 = 'mm'
1172 # $3 = 'ing republic of Perl'
1173
1174Here, C<.?> eats its maximal one character at the earliest possible
1175position in the string, C<'a'> in C<programming>, leaving C<m{1,2}>
15776bb0 1176the opportunity to match both C<'m'>'s. Finally,
47f9c88b
GS
1177
1178 "aXXXb" =~ /(X*)/; # matches with $1 = ''
1179
1180because it can match zero copies of C<'X'> at the beginning of the
1181string. If you definitely want to match at least one C<'X'>, use
1182C<X+>, not C<X*>.
1183
1184Sometimes greed is not good. At times, we would like quantifiers to
1185match a I<minimal> piece of string, rather than a maximal piece. For
7638d2dc
WL
1186this purpose, Larry Wall created the I<minimal match> or
1187I<non-greedy> quantifiers C<??>, C<*?>, C<+?>, and C<{}?>. These are
15776bb0 1188the usual quantifiers with a C<'?'> appended to them. They have the
47f9c88b
GS
1189following meanings:
1190
1191=over 4
1192
551e1d92
RB
1193=item *
1194
15776bb0 1195C<a??> means: match C<'a'> 0 or 1 times. Try 0 first, then 1.
47f9c88b 1196
551e1d92
RB
1197=item *
1198
15776bb0 1199C<a*?> means: match C<'a'> 0 or more times, I<i.e.>, any number of times,
47f9c88b
GS
1200but as few times as possible
1201
551e1d92
RB
1202=item *
1203
15776bb0 1204C<a+?> means: match C<'a'> 1 or more times, I<i.e.>, at least once, but
47f9c88b
GS
1205as few times as possible
1206
551e1d92
RB
1207=item *
1208
7638d2dc 1209C<a{n,m}?> means: match at least C<n> times, not more than C<m>
47f9c88b
GS
1210times, as few times as possible
1211
551e1d92
RB
1212=item *
1213
7638d2dc 1214C<a{n,}?> means: match at least C<n> times, but as few times as
47f9c88b
GS
1215possible
1216
551e1d92
RB
1217=item *
1218
7638d2dc 1219C<a{n}?> means: match exactly C<n> times. Because we match exactly
47f9c88b
GS
1220C<n> times, C<a{n}?> is equivalent to C<a{n}> and is just there for
1221notational consistency.
1222
1223=back
1224
1225Let's look at the example above, but with minimal quantifiers:
1226
1227 $x = "The programming republic of Perl";
1228 $x =~ /^(.+?)(e|r)(.*)$/; # matches,
1229 # $1 = 'Th'
1230 # $2 = 'e'
1231 # $3 = ' programming republic of Perl'
1232
15776bb0 1233The minimal string that will allow both the start of the string C<'^'>
47f9c88b 1234and the alternation to match is C<Th>, with the alternation C<e|r>
15776bb0 1235matching C<'e'>. The second quantifier C<.*> is free to gobble up the
47f9c88b
GS
1236rest of the string.
1237
1238 $x =~ /(m{1,2}?)(.*?)$/; # matches,
1239 # $1 = 'm'
1240 # $2 = 'ming republic of Perl'
1241
1242The first string position that this regexp can match is at the first
1243C<'m'> in C<programming>. At this position, the minimal C<m{1,2}?>
1244matches just one C<'m'>. Although the second quantifier C<.*?> would
1245prefer to match no characters, it is constrained by the end-of-string
15776bb0 1246anchor C<'$'> to match the rest of the string.
47f9c88b
GS
1247
1248 $x =~ /(.*?)(m{1,2}?)(.*)$/; # matches,
1249 # $1 = 'The progra'
1250 # $2 = 'm'
1251 # $3 = 'ming republic of Perl'
1252
1253In this regexp, you might expect the first minimal quantifier C<.*?>
15776bb0 1254to match the empty string, because it is not constrained by a C<'^'>
47f9c88b
GS
1255anchor to match the beginning of the word. Principle 0 applies here,
1256however. Because it is possible for the whole regexp to match at the
1257start of the string, it I<will> match at the start of the string. Thus
15776bb0
KW
1258the first quantifier has to match everything up to the first C<'m'>. The
1259second minimal quantifier matches just one C<'m'> and the third
47f9c88b
GS
1260quantifier matches the rest of the string.
1261
1262 $x =~ /(.??)(m{1,2})(.*)$/; # matches,
1263 # $1 = 'a'
1264 # $2 = 'mm'
1265 # $3 = 'ing republic of Perl'
1266
1267Just as in the previous regexp, the first quantifier C<.??> can match
1268earliest at position C<'a'>, so it does. The second quantifier is
1269greedy, so it matches C<mm>, and the third matches the rest of the
1270string.
1271
1272We can modify principle 3 above to take into account non-greedy
1273quantifiers:
1274
1275=over 4
1276
1277=item *
551e1d92 1278
47f9c88b
GS
1279Principle 3: If there are two or more elements in a regexp, the
1280leftmost greedy (non-greedy) quantifier, if any, will match as much
1281(little) of the string as possible while still allowing the whole
1282regexp to match. The next leftmost greedy (non-greedy) quantifier, if
1283any, will try to match as much (little) of the string remaining
1284available to it as possible, while still allowing the whole regexp to
1285match. And so on, until all the regexp elements are satisfied.
1286
1287=back
1288
1289Just like alternation, quantifiers are also susceptible to
1290backtracking. Here is a step-by-step analysis of the example
1291
1292 $x = "the cat in the hat";
1293 $x =~ /^(.*)(at)(.*)$/; # matches,
1294 # $1 = 'the cat in the h'
1295 # $2 = 'at'
1296 # $3 = '' (0 matches)
1297
1298=over 4
1299
15776bb0 1300=item Z<>0. Start with the first letter in the string C<'t'>.
47f9c88b 1301
15776bb0 1302E<nbsp>
551e1d92 1303
15776bb0
KW
1304=item Z<>1. The first quantifier C<'.*'> starts out by matching the whole
1305string "C<the cat in the hat>".
47f9c88b 1306
15776bb0 1307E<nbsp>
551e1d92 1308
15776bb0
KW
1309=item Z<>2. C<'a'> in the regexp element C<'at'> doesn't match the end
1310of the string. Backtrack one character.
47f9c88b 1311
15776bb0 1312E<nbsp>
551e1d92 1313
15776bb0
KW
1314=item Z<>3. C<'a'> in the regexp element C<'at'> still doesn't match
1315the last letter of the string C<'t'>, so backtrack one more character.
47f9c88b 1316
15776bb0 1317E<nbsp>
551e1d92 1318
15776bb0 1319=item Z<>4. Now we can match the C<'a'> and the C<'t'>.
47f9c88b 1320
15776bb0 1321E<nbsp>
551e1d92 1322
15776bb0
KW
1323=item Z<>5. Move on to the third element C<'.*'>. Since we are at the
1324end of the string and C<'.*'> can match 0 times, assign it the empty
1325string.
47f9c88b 1326
15776bb0 1327E<nbsp>
551e1d92 1328
15776bb0 1329=item Z<>6. We are done!
47f9c88b
GS
1330
1331=back
1332
1333Most of the time, all this moving forward and backtracking happens
7638d2dc 1334quickly and searching is fast. There are some pathological regexps,
47f9c88b
GS
1335however, whose execution time exponentially grows with the size of the
1336string. A typical structure that blows up in your face is of the form
1337
1338 /(a|b+)*/;
1339
1340The problem is the nested indeterminate quantifiers. There are many
15776bb0
KW
1341different ways of partitioning a string of length n between the C<'+'>
1342and C<'*'>: one repetition with C<b+> of length n, two repetitions with
47f9c88b 1343the first C<b+> length k and the second with length n-k, m repetitions
15776bb0 1344whose bits add up to length n, I<etc>. In fact there are an exponential
7638d2dc 1345number of ways to partition a string as a function of its length. A
47f9c88b 1346regexp may get lucky and match early in the process, but if there is
7638d2dc 1347no match, Perl will try I<every> possibility before giving up. So be
15776bb0 1348careful with nested C<'*'>'s, C<{n,m}>'s, and C<'+'>'s. The book
7638d2dc 1349I<Mastering Regular Expressions> by Jeffrey Friedl gives a wonderful
47f9c88b
GS
1350discussion of this and other efficiency issues.
1351
7638d2dc
WL
1352
1353=head2 Possessive quantifiers
1354
1355Backtracking during the relentless search for a match may be a waste
1356of time, particularly when the match is bound to fail. Consider
1357the simple pattern
1358
1359 /^\w+\s+\w+$/; # a word, spaces, a word
1360
1361Whenever this is applied to a string which doesn't quite meet the
1362pattern's expectations such as S<C<"abc ">> or S<C<"abc def ">>,
15776bb0 1363the regexp engine will backtrack, approximately once for each character
353c6505
DL
1364in the string. But we know that there is no way around taking I<all>
1365of the initial word characters to match the first repetition, that I<all>
7638d2dc 1366spaces must be eaten by the middle part, and the same goes for the second
353c6505
DL
1367word.
1368
1369With the introduction of the I<possessive quantifiers> in Perl 5.10, we
15776bb0
KW
1370have a way of instructing the regexp engine not to backtrack, with the
1371usual quantifiers with a C<'+'> appended to them. This makes them greedy as
353c6505
DL
1372well as stingy; once they succeed they won't give anything back to permit
1373another solution. They have the following meanings:
7638d2dc
WL
1374
1375=over 4
1376
1377=item *
1378
353c6505
DL
1379C<a{n,m}+> means: match at least C<n> times, not more than C<m> times,
1380as many times as possible, and don't give anything up. C<a?+> is short
7638d2dc
WL
1381for C<a{0,1}+>
1382
1383=item *
1384
1385C<a{n,}+> means: match at least C<n> times, but as many times as possible,
353c6505 1386and don't give anything up. C<a*+> is short for C<a{0,}+> and C<a++> is
7638d2dc
WL
1387short for C<a{1,}+>.
1388
1389=item *
1390
1391C<a{n}+> means: match exactly C<n> times. It is just there for
1392notational consistency.
1393
1394=back
1395
353c6505
DL
1396These possessive quantifiers represent a special case of a more general
1397concept, the I<independent subexpression>, see below.
7638d2dc
WL
1398
1399As an example where a possessive quantifier is suitable we consider
1400matching a quoted string, as it appears in several programming languages.
1401The backslash is used as an escape character that indicates that the
1402next character is to be taken literally, as another character for the
1403string. Therefore, after the opening quote, we expect a (possibly
353c6505 1404empty) sequence of alternatives: either some character except an
7638d2dc
WL
1405unescaped quote or backslash or an escaped character.
1406
1407 /"(?:[^"\\]++|\\.)*+"/;
1408
1409
47f9c88b
GS
1410=head2 Building a regexp
1411
1412At this point, we have all the basic regexp concepts covered, so let's
1413give a more involved example of a regular expression. We will build a
1414regexp that matches numbers.
1415
1416The first task in building a regexp is to decide what we want to match
1417and what we want to exclude. In our case, we want to match both
1418integers and floating point numbers and we want to reject any string
1419that isn't a number.
1420
1421The next task is to break the problem down into smaller problems that
1422are easily converted into a regexp.
1423
1424The simplest case is integers. These consist of a sequence of digits,
1425with an optional sign in front. The digits we can represent with
1426C<\d+> and the sign can be matched with C<[+-]>. Thus the integer
1427regexp is
1428
1429 /[+-]?\d+/; # matches integers
1430
1431A floating point number potentially has a sign, an integral part, a
1432decimal point, a fractional part, and an exponent. One or more of these
1433parts is optional, so we need to check out the different
1434possibilities. Floating point numbers which are in proper form include
1435123., 0.345, .34, -1e6, and 25.4E-72. As with integers, the sign out
1436front is completely optional and can be matched by C<[+-]?>. We can
1437see that if there is no exponent, floating point numbers must have a
1438decimal point, otherwise they are integers. We might be tempted to
1439model these with C<\d*\.\d*>, but this would also match just a single
1440decimal point, which is not a number. So the three cases of floating
7638d2dc 1441point number without exponent are
47f9c88b
GS
1442
1443 /[+-]?\d+\./; # 1., 321., etc.
1444 /[+-]?\.\d+/; # .1, .234, etc.
1445 /[+-]?\d+\.\d+/; # 1.0, 30.56, etc.
1446
1447These can be combined into a single regexp with a three-way alternation:
1448
1449 /[+-]?(\d+\.\d+|\d+\.|\.\d+)/; # floating point, no exponent
1450
1451In this alternation, it is important to put C<'\d+\.\d+'> before
1452C<'\d+\.'>. If C<'\d+\.'> were first, the regexp would happily match that
1453and ignore the fractional part of the number.
1454
1455Now consider floating point numbers with exponents. The key
1456observation here is that I<both> integers and numbers with decimal
1457points are allowed in front of an exponent. Then exponents, like the
1458overall sign, are independent of whether we are matching numbers with
15776bb0 1459or without decimal points, and can be "decoupled" from the
47f9c88b
GS
1460mantissa. The overall form of the regexp now becomes clear:
1461
1462 /^(optional sign)(integer | f.p. mantissa)(optional exponent)$/;
1463
15776bb0 1464The exponent is an C<'e'> or C<'E'>, followed by an integer. So the
47f9c88b
GS
1465exponent regexp is
1466
1467 /[eE][+-]?\d+/; # exponent
1468
1469Putting all the parts together, we get a regexp that matches numbers:
1470
1471 /^[+-]?(\d+\.\d+|\d+\.|\.\d+|\d+)([eE][+-]?\d+)?$/; # Ta da!
1472
1473Long regexps like this may impress your friends, but can be hard to
f1dc5bb2 1474decipher. In complex situations like this, the C</x> modifier for a
47f9c88b
GS
1475match is invaluable. It allows one to put nearly arbitrary whitespace
1476and comments into a regexp without affecting their meaning. Using it,
15776bb0 1477we can rewrite our "extended" regexp in the more pleasing form
47f9c88b
GS
1478
1479 /^
1480 [+-]? # first, match an optional sign
1481 ( # then match integers or f.p. mantissas:
1482 \d+\.\d+ # mantissa of the form a.b
1483 |\d+\. # mantissa of the form a.
1484 |\.\d+ # mantissa of the form .b
1485 |\d+ # integer of the form a
1486 )
563642b4 1487 ( [eE] [+-]? \d+ )? # finally, optionally match an exponent
47f9c88b
GS
1488 $/x;
1489
1490If whitespace is mostly irrelevant, how does one include space
1491characters in an extended regexp? The answer is to backslash it
7638d2dc 1492S<C<'\ '>> or put it in a character class S<C<[ ]>>. The same thing
f5b885cd 1493goes for pound signs: use C<\#> or C<[#]>. For instance, Perl allows
7638d2dc 1494a space between the sign and the mantissa or integer, and we could add
47f9c88b
GS
1495this to our regexp as follows:
1496
1497 /^
1498 [+-]?\ * # first, match an optional sign *and space*
1499 ( # then match integers or f.p. mantissas:
1500 \d+\.\d+ # mantissa of the form a.b
1501 |\d+\. # mantissa of the form a.
1502 |\.\d+ # mantissa of the form .b
1503 |\d+ # integer of the form a
1504 )
563642b4 1505 ( [eE] [+-]? \d+ )? # finally, optionally match an exponent
47f9c88b
GS
1506 $/x;
1507
1508In this form, it is easier to see a way to simplify the
1509alternation. Alternatives 1, 2, and 4 all start with C<\d+>, so it
1510could be factored out:
1511
1512 /^
1513 [+-]?\ * # first, match an optional sign
1514 ( # then match integers or f.p. mantissas:
1515 \d+ # start out with a ...
1516 (
1517 \.\d* # mantissa of the form a.b or a.
1518 )? # ? takes care of integers of the form a
1519 |\.\d+ # mantissa of the form .b
1520 )
563642b4 1521 ( [eE] [+-]? \d+ )? # finally, optionally match an exponent
47f9c88b
GS
1522 $/x;
1523
77c8f263
KW
1524Starting in Perl v5.26, specifying C</xx> changes the square-bracketed
1525portions of a pattern to ignore tabs and space characters unless they
1526are escaped by preceding them with a backslash. So, we could write
1527
1528 /^
1529 [ + - ]?\ * # first, match an optional sign
1530 ( # then match integers or f.p. mantissas:
1531 \d+ # start out with a ...
1532 (
1533 \.\d* # mantissa of the form a.b or a.
1534 )? # ? takes care of integers of the form a
1535 |\.\d+ # mantissa of the form .b
1536 )
1537 ( [ e E ] [ + - ]? \d+ )? # finally, optionally match an exponent
1538 $/xx;
1539
1540This doesn't really improve the legibility of this example, but it's
1541available in case you want it. Squashing the pattern down to the
1542compact form, we have
47f9c88b
GS
1543
1544 /^[+-]?\ *(\d+(\.\d*)?|\.\d+)([eE][+-]?\d+)?$/;
1545
1546This is our final regexp. To recap, we built a regexp by
1547
1548=over 4
1549
551e1d92
RB
1550=item *
1551
1552specifying the task in detail,
47f9c88b 1553
551e1d92
RB
1554=item *
1555
1556breaking down the problem into smaller parts,
1557
1558=item *
47f9c88b 1559
551e1d92 1560translating the small parts into regexps,
47f9c88b 1561
551e1d92
RB
1562=item *
1563
1564combining the regexps,
1565
1566=item *
47f9c88b 1567
551e1d92 1568and optimizing the final combined regexp.
47f9c88b
GS
1569
1570=back
1571
1572These are also the typical steps involved in writing a computer
1573program. This makes perfect sense, because regular expressions are
7638d2dc 1574essentially programs written in a little computer language that specifies
47f9c88b
GS
1575patterns.
1576
1577=head2 Using regular expressions in Perl
1578
1579The last topic of Part 1 briefly covers how regexps are used in Perl
1580programs. Where do they fit into Perl syntax?
1581
1582We have already introduced the matching operator in its default
1583C</regexp/> and arbitrary delimiter C<m!regexp!> forms. We have used
1584the binding operator C<=~> and its negation C<!~> to test for string
1585matches. Associated with the matching operator, we have discussed the
f1dc5bb2
KW
1586single line C</s>, multi-line C</m>, case-insensitive C</i> and
1587extended C</x> modifiers. There are a few more things you might
353c6505 1588want to know about matching operators.
47f9c88b 1589
7638d2dc
WL
1590=head3 Prohibiting substitution
1591
1592If you change C<$pattern> after the first substitution happens, Perl
47f9c88b
GS
1593will ignore it. If you don't want any substitutions at all, use the
1594special delimiter C<m''>:
1595
16e8b840 1596 @pattern = ('Seuss');
47f9c88b 1597 while (<>) {
16e8b840 1598 print if m'@pattern'; # matches literal '@pattern', not 'Seuss'
47f9c88b
GS
1599 }
1600
353c6505 1601Similar to strings, C<m''> acts like apostrophes on a regexp; all other
15776bb0 1602C<'m'> delimiters act like quotes. If the regexp evaluates to the empty string,
47f9c88b
GS
1603the regexp in the I<last successful match> is used instead. So we have
1604
1605 "dog" =~ /d/; # 'd' matches
1606 "dogbert =~ //; # this matches the 'd' regexp used before
1607
7638d2dc
WL
1608
1609=head3 Global matching
1610
7698aede 1611The final two modifiers we will discuss here,
f1dc5bb2
KW
1612C</g> and C</c>, concern multiple matches.
1613The modifier C</g> stands for global matching and allows the
47f9c88b
GS
1614matching operator to match within a string as many times as possible.
1615In scalar context, successive invocations against a string will have
f1dc5bb2 1616C</g> jump from match to match, keeping track of position in the
47f9c88b
GS
1617string as it goes along. You can get or set the position with the
1618C<pos()> function.
1619
f1dc5bb2 1620The use of C</g> is shown in the following example. Suppose we have
47f9c88b
GS
1621a string that consists of words separated by spaces. If we know how
1622many words there are in advance, we could extract the words using
1623groupings:
1624
1625 $x = "cat dog house"; # 3 words
1626 $x =~ /^\s*(\w+)\s+(\w+)\s+(\w+)\s*$/; # matches,
1627 # $1 = 'cat'
1628 # $2 = 'dog'
1629 # $3 = 'house'
1630
1631But what if we had an indeterminate number of words? This is the sort
f1dc5bb2 1632of task C</g> was made for. To extract all words, form the simple
47f9c88b
GS
1633regexp C<(\w+)> and loop over all matches with C</(\w+)/g>:
1634
1635 while ($x =~ /(\w+)/g) {
1636 print "Word is $1, ends at position ", pos $x, "\n";
1637 }
1638
1639prints
1640
1641 Word is cat, ends at position 3
1642 Word is dog, ends at position 7
1643 Word is house, ends at position 13
1644
1645A failed match or changing the target string resets the position. If
1646you don't want the position reset after failure to match, add the
f1dc5bb2 1647C</c>, as in C</regexp/gc>. The current position in the string is
47f9c88b
GS
1648associated with the string, not the regexp. This means that different
1649strings have different positions and their respective positions can be
1650set or read independently.
1651
f1dc5bb2 1652In list context, C</g> returns a list of matched groupings, or if
47f9c88b
GS
1653there are no groupings, a list of matches to the whole regexp. So if
1654we wanted just the words, we could use
1655
1656 @words = ($x =~ /(\w+)/g); # matches,
5a0c7e9d
PJ
1657 # $words[0] = 'cat'
1658 # $words[1] = 'dog'
1659 # $words[2] = 'house'
47f9c88b 1660
f1dc5bb2
KW
1661Closely associated with the C</g> modifier is the C<\G> anchor. The
1662C<\G> anchor matches at the point where the previous C</g> match left
47f9c88b
GS
1663off. C<\G> allows us to easily do context-sensitive matching:
1664
1665 $metric = 1; # use metric units
1666 ...
1667 $x = <FILE>; # read in measurement
1668 $x =~ /^([+-]?\d+)\s*/g; # get magnitude
1669 $weight = $1;
1670 if ($metric) { # error checking
1671 print "Units error!" unless $x =~ /\Gkg\./g;
1672 }
1673 else {
1674 print "Units error!" unless $x =~ /\Glbs\./g;
1675 }
1676 $x =~ /\G\s+(widget|sprocket)/g; # continue processing
1677
f1dc5bb2 1678The combination of C</g> and C<\G> allows us to process the string a
47f9c88b 1679bit at a time and use arbitrary Perl logic to decide what to do next.
25cf8c22
HS
1680Currently, the C<\G> anchor is only fully supported when used to anchor
1681to the start of the pattern.
47f9c88b 1682
f5b885cd 1683C<\G> is also invaluable in processing fixed-length records with
47f9c88b
GS
1684regexps. Suppose we have a snippet of coding region DNA, encoded as
1685base pair letters C<ATCGTTGAAT...> and we want to find all the stop
1686codons C<TGA>. In a coding region, codons are 3-letter sequences, so
1687we can think of the DNA snippet as a sequence of 3-letter records. The
1688naive regexp
1689
1690 # expanded, this is "ATC GTT GAA TGC AAA TGA CAT GAC"
1691 $dna = "ATCGTTGAATGCAAATGACATGAC";
1692 $dna =~ /TGA/;
1693
d1be9408 1694doesn't work; it may match a C<TGA>, but there is no guarantee that
15776bb0 1695the match is aligned with codon boundaries, I<e.g.>, the substring
7638d2dc 1696S<C<GTT GAA>> gives a match. A better solution is
47f9c88b
GS
1697
1698 while ($dna =~ /(\w\w\w)*?TGA/g) { # note the minimal *?
1699 print "Got a TGA stop codon at position ", pos $dna, "\n";
1700 }
1701
1702which prints
1703
1704 Got a TGA stop codon at position 18
1705 Got a TGA stop codon at position 23
1706
1707Position 18 is good, but position 23 is bogus. What happened?
1708
1709The answer is that our regexp works well until we get past the last
1710real match. Then the regexp will fail to match a synchronized C<TGA>
1711and start stepping ahead one character position at a time, not what we
1712want. The solution is to use C<\G> to anchor the match to the codon
1713alignment:
1714
1715 while ($dna =~ /\G(\w\w\w)*?TGA/g) {
1716 print "Got a TGA stop codon at position ", pos $dna, "\n";
1717 }
1718
1719This prints
1720
1721 Got a TGA stop codon at position 18
1722
1723which is the correct answer. This example illustrates that it is
1724important not only to match what is desired, but to reject what is not
1725desired.
1726
0bd5a82d 1727(There are other regexp modifiers that are available, such as
f1dc5bb2 1728C</o>, but their specialized uses are beyond the
0bd5a82d
KW
1729scope of this introduction. )
1730
7638d2dc 1731=head3 Search and replace
47f9c88b 1732
7638d2dc 1733Regular expressions also play a big role in I<search and replace>
47f9c88b
GS
1734operations in Perl. Search and replace is accomplished with the
1735C<s///> operator. The general form is
1736C<s/regexp/replacement/modifiers>, with everything we know about
1737regexps and modifiers applying in this case as well. The
15776bb0 1738I<replacement> is a Perl double-quoted string that replaces in the
47f9c88b
GS
1739string whatever is matched with the C<regexp>. The operator C<=~> is
1740also used here to associate a string with C<s///>. If matching
7638d2dc 1741against C<$_>, the S<C<$_ =~>> can be dropped. If there is a match,
f5b885cd 1742C<s///> returns the number of substitutions made; otherwise it returns
47f9c88b
GS
1743false. Here are a few examples:
1744
1745 $x = "Time to feed the cat!";
1746 $x =~ s/cat/hacker/; # $x contains "Time to feed the hacker!"
1747 if ($x =~ s/^(Time.*hacker)!$/$1 now!/) {
1748 $more_insistent = 1;
1749 }
1750 $y = "'quoted words'";
1751 $y =~ s/^'(.*)'$/$1/; # strip single quotes,
1752 # $y contains "quoted words"
1753
1754In the last example, the whole string was matched, but only the part
1755inside the single quotes was grouped. With the C<s///> operator, the
15776bb0 1756matched variables C<$1>, C<$2>, I<etc>. are immediately available for use
47f9c88b
GS
1757in the replacement expression, so we use C<$1> to replace the quoted
1758string with just what was quoted. With the global modifier, C<s///g>
1759will search and replace all occurrences of the regexp in the string:
1760
1761 $x = "I batted 4 for 4";
1762 $x =~ s/4/four/; # doesn't do it all:
1763 # $x contains "I batted four for 4"
1764 $x = "I batted 4 for 4";
1765 $x =~ s/4/four/g; # does it all:
1766 # $x contains "I batted four for four"
1767
15776bb0 1768If you prefer "regex" over "regexp" in this tutorial, you could use
47f9c88b
GS
1769the following program to replace it:
1770
1771 % cat > simple_replace
1772 #!/usr/bin/perl
1773 $regexp = shift;
1774 $replacement = shift;
1775 while (<>) {
c2e2285d 1776 s/$regexp/$replacement/g;
47f9c88b
GS
1777 print;
1778 }
1779 ^D
1780
1781 % simple_replace regexp regex perlretut.pod
1782
1783In C<simple_replace> we used the C<s///g> modifier to replace all
c2e2285d
KW
1784occurrences of the regexp on each line. (Even though the regular
1785expression appears in a loop, Perl is smart enough to compile it
1786only once.) As with C<simple_grep>, both the
1787C<print> and the C<s/$regexp/$replacement/g> use C<$_> implicitly.
47f9c88b 1788
4f4d7508
DC
1789If you don't want C<s///> to change your original variable you can use
1790the non-destructive substitute modifier, C<s///r>. This changes the
d6b8a906
KW
1791behavior so that C<s///r> returns the final substituted string
1792(instead of the number of substitutions):
4f4d7508
DC
1793
1794 $x = "I like dogs.";
1795 $y = $x =~ s/dogs/cats/r;
1796 print "$x $y\n";
1797
1798That example will print "I like dogs. I like cats". Notice the original
f5b885cd 1799C<$x> variable has not been affected. The overall
4f4d7508
DC
1800result of the substitution is instead stored in C<$y>. If the
1801substitution doesn't affect anything then the original string is
1802returned:
1803
1804 $x = "I like dogs.";
1805 $y = $x =~ s/elephants/cougars/r;
1806 print "$x $y\n"; # prints "I like dogs. I like dogs."
1807
1808One other interesting thing that the C<s///r> flag allows is chaining
1809substitutions:
1810
1811 $x = "Cats are great.";
555bd962
BG
1812 print $x =~ s/Cats/Dogs/r =~ s/Dogs/Frogs/r =~
1813 s/Frogs/Hedgehogs/r, "\n";
4f4d7508
DC
1814 # prints "Hedgehogs are great."
1815
47f9c88b 1816A modifier available specifically to search and replace is the
f5b885cd
FC
1817C<s///e> evaluation modifier. C<s///e> treats the
1818replacement text as Perl code, rather than a double-quoted
1819string. The value that the code returns is substituted for the
47f9c88b
GS
1820matched substring. C<s///e> is useful if you need to do a bit of
1821computation in the process of replacing text. This example counts
1822character frequencies in a line:
1823
1824 $x = "Bill the cat";
555bd962 1825 $x =~ s/(.)/$chars{$1}++;$1/eg; # final $1 replaces char with itself
47f9c88b
GS
1826 print "frequency of '$_' is $chars{$_}\n"
1827 foreach (sort {$chars{$b} <=> $chars{$a}} keys %chars);
1828
1829This prints
1830
1831 frequency of ' ' is 2
1832 frequency of 't' is 2
1833 frequency of 'l' is 2
1834 frequency of 'B' is 1
1835 frequency of 'c' is 1
1836 frequency of 'e' is 1
1837 frequency of 'h' is 1
1838 frequency of 'i' is 1
1839 frequency of 'a' is 1
1840
1841As with the match C<m//> operator, C<s///> can use other delimiters,
1842such as C<s!!!> and C<s{}{}>, and even C<s{}//>. If single quotes are
f5b885cd
FC
1843used C<s'''>, then the regexp and replacement are
1844treated as single-quoted strings and there are no
1845variable substitutions. C<s///> in list context
15776bb0 1846returns the same thing as in scalar context, I<i.e.>, the number of
47f9c88b
GS
1847matches.
1848
7638d2dc 1849=head3 The split function
47f9c88b 1850
7638d2dc 1851The C<split()> function is another place where a regexp is used.
353c6505
DL
1852C<split /regexp/, string, limit> separates the C<string> operand into
1853a list of substrings and returns that list. The regexp must be designed
7638d2dc 1854to match whatever constitutes the separators for the desired substrings.
353c6505 1855The C<limit>, if present, constrains splitting into no more than C<limit>
7638d2dc 1856number of strings. For example, to split a string into words, use
47f9c88b
GS
1857
1858 $x = "Calvin and Hobbes";
1859 @words = split /\s+/, $x; # $word[0] = 'Calvin'
1860 # $word[1] = 'and'
1861 # $word[2] = 'Hobbes'
1862
1863If the empty regexp C<//> is used, the regexp always matches and
1864the string is split into individual characters. If the regexp has
7638d2dc 1865groupings, then the resulting list contains the matched substrings from the
47f9c88b
GS
1866groupings as well. For instance,
1867
1868 $x = "/usr/bin/perl";
1869 @dirs = split m!/!, $x; # $dirs[0] = ''
1870 # $dirs[1] = 'usr'
1871 # $dirs[2] = 'bin'
1872 # $dirs[3] = 'perl'
1873 @parts = split m!(/)!, $x; # $parts[0] = ''
1874 # $parts[1] = '/'
1875 # $parts[2] = 'usr'
1876 # $parts[3] = '/'
1877 # $parts[4] = 'bin'
1878 # $parts[5] = '/'
1879 # $parts[6] = 'perl'
1880
15776bb0 1881Since the first character of C<$x> matched the regexp, C<split> prepended
47f9c88b
GS
1882an empty initial element to the list.
1883
1884If you have read this far, congratulations! You now have all the basic
1885tools needed to use regular expressions to solve a wide range of text
1886processing problems. If this is your first time through the tutorial,
f5b885cd 1887why not stop here and play around with regexps a while.... S<Part 2>
47f9c88b
GS
1888concerns the more esoteric aspects of regular expressions and those
1889concepts certainly aren't needed right at the start.
1890
1891=head1 Part 2: Power tools
1892
1893OK, you know the basics of regexps and you want to know more. If
1894matching regular expressions is analogous to a walk in the woods, then
1895the tools discussed in Part 1 are analogous to topo maps and a
1896compass, basic tools we use all the time. Most of the tools in part 2
da75cd15 1897are analogous to flare guns and satellite phones. They aren't used
47f9c88b
GS
1898too often on a hike, but when we are stuck, they can be invaluable.
1899
1900What follows are the more advanced, less used, or sometimes esoteric
7638d2dc 1901capabilities of Perl regexps. In Part 2, we will assume you are
7c579eed 1902comfortable with the basics and concentrate on the advanced features.
47f9c88b
GS
1903
1904=head2 More on characters, strings, and character classes
1905
1906There are a number of escape sequences and character classes that we
1907haven't covered yet.
1908
1909There are several escape sequences that convert characters or strings
7638d2dc 1910between upper and lower case, and they are also available within
353c6505 1911patterns. C<\l> and C<\u> convert the next character to lower or
7638d2dc 1912upper case, respectively:
47f9c88b
GS
1913
1914 $x = "perl";
1915 $string =~ /\u$x/; # matches 'Perl' in $string
1916 $x = "M(rs?|s)\\."; # note the double backslash
1917 $string =~ /\l$x/; # matches 'mr.', 'mrs.', and 'ms.',
1918
7638d2dc
WL
1919A C<\L> or C<\U> indicates a lasting conversion of case, until
1920terminated by C<\E> or thrown over by another C<\U> or C<\L>:
47f9c88b
GS
1921
1922 $x = "This word is in lower case:\L SHOUT\E";
1923 $x =~ /shout/; # matches
1924 $x = "I STILL KEYPUNCH CARDS FOR MY 360"
1925 $x =~ /\Ukeypunch/; # matches punch card string
1926
1927If there is no C<\E>, case is converted until the end of the
1928string. The regexps C<\L\u$word> or C<\u\L$word> convert the first
1929character of C<$word> to uppercase and the rest of the characters to
1930lowercase.
1931
1932Control characters can be escaped with C<\c>, so that a control-Z
1933character would be matched with C<\cZ>. The escape sequence
1934C<\Q>...C<\E> quotes, or protects most non-alphabetic characters. For
1935instance,
1936
1937 $x = "\QThat !^*&%~& cat!";
1938 $x =~ /\Q!^*&%~&\E/; # check for rough language
1939
15776bb0 1940It does not protect C<'$'> or C<'@'>, so that variables can still be
47f9c88b
GS
1941substituted.
1942
8e71069f
FC
1943C<\Q>, C<\L>, C<\l>, C<\U>, C<\u> and C<\E> are actually part of
1944double-quotish syntax, and not part of regexp syntax proper. They will
7698aede 1945work if they appear in a regular expression embedded directly in a
8e71069f
FC
1946program, but not when contained in a string that is interpolated in a
1947pattern.
7c579eed 1948
13e5d9cd
BF
1949Perl regexps can handle more than just the
1950standard ASCII character set. Perl supports I<Unicode>, a standard
7638d2dc 1951for representing the alphabets from virtually all of the world's written
38a44b82 1952languages, and a host of symbols. Perl's text strings are Unicode strings, so
2575c402 1953they can contain characters with a value (codepoint or character number) higher
7c579eed 1954than 255.
47f9c88b
GS
1955
1956What does this mean for regexps? Well, regexp users don't need to know
7638d2dc 1957much about Perl's internal representation of strings. But they do need
2575c402
JW
1958to know 1) how to represent Unicode characters in a regexp and 2) that
1959a matching operation will treat the string to be searched as a sequence
1960of characters, not bytes. The answer to 1) is that Unicode characters
f0a2b745 1961greater than C<chr(255)> are represented using the C<\x{hex}> notation, because
15776bb0
KW
1962C<\x>I<XY> (without curly braces and I<XY> are two hex digits) doesn't
1963go further than 255. (Starting in Perl 5.14, if you're an octal fan,
1964you can also use C<\o{oct}>.)
47f9c88b 1965
47f9c88b
GS
1966 /\x{263a}/; # match a Unicode smiley face :)
1967
7638d2dc 1968B<NOTE>: In Perl 5.6.0 it used to be that one needed to say C<use
72ff2908
JH
1969utf8> to use any Unicode features. This is no more the case: for
1970almost all Unicode processing, the explicit C<utf8> pragma is not
1971needed. (The only case where it matters is if your Perl script is in
1972Unicode and encoded in UTF-8, then an explicit C<use utf8> is needed.)
47f9c88b
GS
1973
1974Figuring out the hexadecimal sequence of a Unicode character you want
1975or deciphering someone else's hexadecimal Unicode regexp is about as
1976much fun as programming in machine code. So another way to specify
e526e8bb
KW
1977Unicode characters is to use the I<named character> escape
1978sequence C<\N{I<name>}>. I<name> is a name for the Unicode character, as
55eda711
JH
1979specified in the Unicode standard. For instance, if we wanted to
1980represent or match the astrological sign for the planet Mercury, we
1981could use
47f9c88b 1982
47f9c88b
GS
1983 $x = "abc\N{MERCURY}def";
1984 $x =~ /\N{MERCURY}/; # matches
1985
fbb93542 1986One can also use "short" names:
47f9c88b 1987
47f9c88b 1988 print "\N{GREEK SMALL LETTER SIGMA} is called sigma.\n";
47f9c88b
GS
1989 print "\N{greek:Sigma} is an upper-case sigma.\n";
1990
fbb93542
KW
1991You can also restrict names to a certain alphabet by specifying the
1992L<charnames> pragma:
1993
47f9c88b
GS
1994 use charnames qw(greek);
1995 print "\N{sigma} is Greek sigma\n";
1996
0bd42786
KW
1997An index of character names is available on-line from the Unicode
1998Consortium, L<http://www.unicode.org/charts/charindex.html>; explanatory
1999material with links to other resources at
2000L<http://www.unicode.org/standard/where>.
47f9c88b 2001
13e5d9cd
BF
2002The answer to requirement 2) is that a regexp (mostly)
2003uses Unicode characters. The "mostly" is for messy backward
15776bb0 2004compatibility reasons, but starting in Perl 5.14, any regexp compiled in
615d795d
KW
2005the scope of a C<use feature 'unicode_strings'> (which is automatically
2006turned on within the scope of a C<use 5.012> or higher) will turn that
2007"mostly" into "always". If you want to handle Unicode properly, you
13e5d9cd 2008should ensure that C<'unicode_strings'> is turned on.
0bd5a82d
KW
2009Internally, this is encoded to bytes using either UTF-8 or a native 8
2010bit encoding, depending on the history of the string, but conceptually
2011it is a sequence of characters, not bytes. See L<perlunitut> for a
2012tutorial about that.
2575c402 2013
2c9972cc 2014Let us now discuss Unicode character classes, most usually called
15776bb0
KW
2015"character properties". These are represented by the C<\p{I<name>}>
2016escape sequence. The negation of this is C<\P{I<name>}>. For example,
2017to match lower and uppercase characters,
47f9c88b 2018
47f9c88b
GS
2019 $x = "BOB";
2020 $x =~ /^\p{IsUpper}/; # matches, uppercase char class
2021 $x =~ /^\P{IsUpper}/; # doesn't match, char class sans uppercase
2022 $x =~ /^\p{IsLower}/; # doesn't match, lowercase char class
2023 $x =~ /^\P{IsLower}/; # matches, char class sans lowercase
2024
15776bb0 2025(The "C<Is>" is optional.)
5f67e4c9 2026
2c9972cc
KW
2027There are many, many Unicode character properties. For the full list
2028see L<perluniprops>. Most of them have synonyms with shorter names,
2029also listed there. Some synonyms are a single character. For these,
2030you can drop the braces. For instance, C<\pM> is the same thing as
2031C<\p{Mark}>, meaning things like accent marks.
2032
48791bf1
KW
2033The Unicode C<\p{Script}> and C<\p{Script_Extensions}> properties are
2034used to categorize every Unicode character into the language script it
2035is written in. (C<Script_Extensions> is an improved version of
2036C<Script>, which is retained for backward compatibility, and so you
2037should generally use C<Script_Extensions>.)
2038For example,
2c9972cc
KW
2039English, French, and a bunch of other European languages are written in
2040the Latin script. But there is also the Greek script, the Thai script,
15776bb0 2041the Katakana script, I<etc>. You can test whether a character is in a
48791bf1
KW
2042particular script (based on C<Script_Extensions>) with, for example
2043C<\p{Latin}>, C<\p{Greek}>, or C<\p{Katakana}>. To test if it isn't in
2044the Balinese script, you would use C<\P{Balinese}>.
e1b711da
KW
2045
2046What we have described so far is the single form of the C<\p{...}> character
2047classes. There is also a compound form which you may run into. These
15776bb0 2048look like C<\p{I<name>=I<value>}> or C<\p{I<name>:I<value>}> (the equals sign and colon
e1b711da
KW
2049can be used interchangeably). These are more general than the single form,
2050and in fact most of the single forms are just Perl-defined shortcuts for common
2051compound forms. For example, the script examples in the previous paragraph
48791bf1
KW
2052could be written equivalently as C<\p{Script_Extensions=Latin}>, C<\p{Script_Extensions:Greek}>,
2053C<\p{script_extensions=katakana}>, and C<\P{script_extensions=balinese}> (case is irrelevant
2c9972cc 2054between the C<{}> braces). You may
e1b711da
KW
2055never have to use the compound forms, but sometimes it is necessary, and their
2056use can make your code easier to understand.
47f9c88b 2057
7638d2dc 2058C<\X> is an abbreviation for a character class that comprises
5f67e4c9 2059a Unicode I<extended grapheme cluster>. This represents a "logical character":
e1b711da 2060what appears to be a single character, but may be represented internally by more
15776bb0
KW
2061than one. As an example, using the Unicode full names, I<e.g.>, "S<A + COMBINING
2062RING>" is a grapheme cluster with base character "A" and combining character
2063"S<COMBINING RING>, which translates in Danish to "A" with the circle atop it,
360633e8 2064as in the word E<Aring>ngstrom.
47f9c88b 2065
da75cd15 2066For the full and latest information about Unicode see the latest
e1b711da 2067Unicode standard, or the Unicode Consortium's website L<http://www.unicode.org>
5e42d7b4 2068
7c579eed 2069As if all those classes weren't enough, Perl also defines POSIX-style
15776bb0 2070character classes. These have the form C<[:I<name>:]>, with I<name> the
aaa51d5e
JF
2071name of the POSIX class. The POSIX classes are C<alpha>, C<alnum>,
2072C<ascii>, C<cntrl>, C<digit>, C<graph>, C<lower>, C<print>, C<punct>,
2073C<space>, C<upper>, and C<xdigit>, and two extensions, C<word> (a Perl
f1dc5bb2 2074extension to match C<\w>), and C<blank> (a GNU extension). The C</a>
0bd5a82d
KW
2075modifier restricts these to matching just in the ASCII range; otherwise
2076they can match the same as their corresponding Perl Unicode classes:
15776bb0 2077C<[:upper:]> is the same as C<\p{IsUpper}>, I<etc>. (There are some
0bd5a82d
KW
2078exceptions and gotchas with this; see L<perlrecharclass> for a full
2079discussion.) The C<[:digit:]>, C<[:word:]>, and
47f9c88b 2080C<[:space:]> correspond to the familiar C<\d>, C<\w>, and C<\s>
15776bb0
KW
2081character classes. To negate a POSIX class, put a C<'^'> in front of
2082the name, so that, I<e.g.>, C<[:^digit:]> corresponds to C<\D> and, under
7c579eed 2083Unicode, C<\P{IsDigit}>. The Unicode and POSIX character classes can
54c18d04
MK
2084be used just like C<\d>, with the exception that POSIX character
2085classes can only be used inside of a character class:
47f9c88b
GS
2086
2087 /\s+[abc[:digit:]xyz]\s*/; # match a,b,c,x,y,z, or a digit
54c18d04 2088 /^=item\s[[:digit:]]/; # match '=item',
47f9c88b 2089 # followed by a space and a digit
47f9c88b
GS
2090 /\s+[abc\p{IsDigit}xyz]\s+/; # match a,b,c,x,y,z, or a digit
2091 /^=item\s\p{IsDigit}/; # match '=item',
2092 # followed by a space and a digit
2093
2094Whew! That is all the rest of the characters and character classes.
2095
2096=head2 Compiling and saving regular expressions
2097
c2e2285d
KW
2098In Part 1 we mentioned that Perl compiles a regexp into a compact
2099sequence of opcodes. Thus, a compiled regexp is a data structure
47f9c88b
GS
2100that can be stored once and used again and again. The regexp quote
2101C<qr//> does exactly that: C<qr/string/> compiles the C<string> as a
2102regexp and transforms the result into a form that can be assigned to a
2103variable:
2104
2105 $reg = qr/foo+bar?/; # reg contains a compiled regexp
2106
2107Then C<$reg> can be used as a regexp:
2108
2109 $x = "fooooba";
2110 $x =~ $reg; # matches, just like /foo+bar?/
2111 $x =~ /$reg/; # same thing, alternate form
2112
2113C<$reg> can also be interpolated into a larger regexp:
2114
2115 $x =~ /(abc)?$reg/; # still matches
2116
2117As with the matching operator, the regexp quote can use different
15776bb0 2118delimiters, I<e.g.>, C<qr!!>, C<qr{}> or C<qr~~>. Apostrophes
7638d2dc 2119as delimiters (C<qr''>) inhibit any interpolation.
47f9c88b
GS
2120
2121Pre-compiled regexps are useful for creating dynamic matches that
2122don't need to be recompiled each time they are encountered. Using
7638d2dc
WL
2123pre-compiled regexps, we write a C<grep_step> program which greps
2124for a sequence of patterns, advancing to the next pattern as soon
2125as one has been satisfied.
47f9c88b 2126
7638d2dc 2127 % cat > grep_step
47f9c88b 2128 #!/usr/bin/perl
7638d2dc 2129 # grep_step - match <number> regexps, one after the other
47f9c88b
GS
2130 # usage: multi_grep <number> regexp1 regexp2 ... file1 file2 ...
2131
2132 $number = shift;
2133 $regexp[$_] = shift foreach (0..$number-1);
2134 @compiled = map qr/$_/, @regexp;
2135 while ($line = <>) {
7638d2dc
WL
2136 if ($line =~ /$compiled[0]/) {
2137 print $line;
2138 shift @compiled;
2139 last unless @compiled;
47f9c88b
GS
2140 }
2141 }
2142 ^D
2143
7638d2dc
WL
2144 % grep_step 3 shift print last grep_step
2145 $number = shift;
2146 print $line;
2147 last unless @compiled;
47f9c88b
GS
2148
2149Storing pre-compiled regexps in an array C<@compiled> allows us to
2150simply loop through the regexps without any recompilation, thus gaining
2151flexibility without sacrificing speed.
2152
7638d2dc
WL
2153
2154=head2 Composing regular expressions at runtime
2155
2156Backtracking is more efficient than repeated tries with different regular
2157expressions. If there are several regular expressions and a match with
353c6505 2158any of them is acceptable, then it is possible to combine them into a set
7638d2dc 2159of alternatives. If the individual expressions are input data, this
353c6505
DL
2160can be done by programming a join operation. We'll exploit this idea in
2161an improved version of the C<simple_grep> program: a program that matches
7638d2dc
WL
2162multiple patterns:
2163
2164 % cat > multi_grep
2165 #!/usr/bin/perl
2166 # multi_grep - match any of <number> regexps
2167 # usage: multi_grep <number> regexp1 regexp2 ... file1 file2 ...
2168
2169 $number = shift;
2170 $regexp[$_] = shift foreach (0..$number-1);
2171 $pattern = join '|', @regexp;
2172
2173 while ($line = <>) {
c2e2285d 2174 print $line if $line =~ /$pattern/;
7638d2dc
WL
2175 }
2176 ^D
2177
2178 % multi_grep 2 shift for multi_grep
2179 $number = shift;
2180 $regexp[$_] = shift foreach (0..$number-1);
2181
2182Sometimes it is advantageous to construct a pattern from the I<input>
2183that is to be analyzed and use the permissible values on the left
2184hand side of the matching operations. As an example for this somewhat
353c6505 2185paradoxical situation, let's assume that our input contains a command
7638d2dc 2186verb which should match one out of a set of available command verbs,
353c6505 2187with the additional twist that commands may be abbreviated as long as
7638d2dc
WL
2188the given string is unique. The program below demonstrates the basic
2189algorithm.
2190
2191 % cat > keymatch
2192 #!/usr/bin/perl
2193 $kwds = 'copy compare list print';
555bd962
BG
2194 while( $cmd = <> ){
2195 $cmd =~ s/^\s+|\s+$//g; # trim leading and trailing spaces
2196 if( ( @matches = $kwds =~ /\b$cmd\w*/g ) == 1 ){
92a24ac3 2197 print "command: '@matches'\n";
7638d2dc 2198 } elsif( @matches == 0 ){
555bd962 2199 print "no such command: '$cmd'\n";
7638d2dc 2200 } else {
555bd962 2201 print "not unique: '$cmd' (could be one of: @matches)\n";
7638d2dc
WL
2202 }
2203 }
2204 ^D
2205
2206 % keymatch
2207 li
2208 command: 'list'
2209 co
2210 not unique: 'co' (could be one of: copy compare)
2211 printer
2212 no such command: 'printer'
2213
2214Rather than trying to match the input against the keywords, we match the
2215combined set of keywords against the input. The pattern matching
555bd962 2216operation S<C<$kwds =~ /\b($cmd\w*)/g>> does several things at the
353c6505
DL
2217same time. It makes sure that the given command begins where a keyword
2218begins (C<\b>). It tolerates abbreviations due to the added C<\w*>. It
2219tells us the number of matches (C<scalar @matches>) and all the keywords
7638d2dc 2220that were actually matched. You could hardly ask for more.
7638d2dc 2221
47f9c88b
GS
2222=head2 Embedding comments and modifiers in a regular expression
2223
2224Starting with this section, we will be discussing Perl's set of
7638d2dc 2225I<extended patterns>. These are extensions to the traditional regular
47f9c88b
GS
2226expression syntax that provide powerful new tools for pattern
2227matching. We have already seen extensions in the form of the minimal
6b3ddc02
FC
2228matching constructs C<??>, C<*?>, C<+?>, C<{n,m}?>, and C<{n,}?>. Most
2229of the extensions below have the form C<(?char...)>, where the
47f9c88b
GS
2230C<char> is a character that determines the type of extension.
2231
2232The first extension is an embedded comment C<(?#text)>. This embeds a
2233comment into the regular expression without affecting its meaning. The
2234comment should not have any closing parentheses in the text. An
2235example is
2236
2237 /(?# Match an integer:)[+-]?\d+/;
2238
2239This style of commenting has been largely superseded by the raw,
f1dc5bb2 2240freeform commenting that is allowed with the C</x> modifier.
47f9c88b 2241
f1dc5bb2 2242Most modifiers, such as C</i>, C</m>, C</s> and C</x> (or any
24549070 2243combination thereof) can also be embedded in
47f9c88b
GS
2244a regexp using C<(?i)>, C<(?m)>, C<(?s)>, and C<(?x)>. For instance,
2245
2246 /(?i)yes/; # match 'yes' case insensitively
2247 /yes/i; # same thing
2248 /(?x)( # freeform version of an integer regexp
2249 [+-]? # match an optional sign
2250 \d+ # match a sequence of digits
2251 )
2252 /x;
2253
2254Embedded modifiers can have two important advantages over the usual
15776bb0 2255modifiers. Embedded modifiers allow a custom set of modifiers for
47f9c88b
GS
2256I<each> regexp pattern. This is great for matching an array of regexps
2257that must have different modifiers:
2258
2259 $pattern[0] = '(?i)doctor';
2260 $pattern[1] = 'Johnson';
2261 ...
2262 while (<>) {
2263 foreach $patt (@pattern) {
2264 print if /$patt/;
2265 }
2266 }
2267
f1dc5bb2 2268The second advantage is that embedded modifiers (except C</p>, which
7638d2dc 2269modifies the entire regexp) only affect the regexp
47f9c88b
GS
2270inside the group the embedded modifier is contained in. So grouping
2271can be used to localize the modifier's effects:
2272
2273 /Answer: ((?i)yes)/; # matches 'Answer: yes', 'Answer: YES', etc.
2274
2275Embedded modifiers can also turn off any modifiers already present
15776bb0
KW
2276by using, I<e.g.>, C<(?-i)>. Modifiers can also be combined into
2277a single expression, I<e.g.>, C<(?s-i)> turns on single line mode and
47f9c88b
GS
2278turns off case insensitivity.
2279
7638d2dc 2280Embedded modifiers may also be added to a non-capturing grouping.
47f9c88b
GS
2281C<(?i-m:regexp)> is a non-capturing grouping that matches C<regexp>
2282case insensitively and turns off multi-line mode.
2283
7638d2dc 2284
47f9c88b
GS
2285=head2 Looking ahead and looking behind
2286
2287This section concerns the lookahead and lookbehind assertions. First,
2288a little background.
2289
15776bb0 2290In Perl regular expressions, most regexp elements "eat up" a certain
47f9c88b 2291amount of string when they match. For instance, the regexp element
22bf43da 2292C<[abc]> eats up one character of the string when it matches, in the
7638d2dc 2293sense that Perl moves to the next character position in the string
47f9c88b
GS
2294after the match. There are some elements, however, that don't eat up
2295characters (advance the character position) if they match. The examples
15776bb0 2296we have seen so far are the anchors. The anchor C<'^'> matches the
47f9c88b 2297beginning of the line, but doesn't eat any characters. Similarly, the
7638d2dc 2298word boundary anchor C<\b> matches wherever a character matching C<\w>
353c6505 2299is next to a character that doesn't, but it doesn't eat up any
6b3ddc02
FC
2300characters itself. Anchors are examples of I<zero-width assertions>:
2301zero-width, because they consume
47f9c88b
GS
2302no characters, and assertions, because they test some property of the
2303string. In the context of our walk in the woods analogy to regexp
2304matching, most regexp elements move us along a trail, but anchors have
2305us stop a moment and check our surroundings. If the local environment
2306checks out, we can proceed forward. But if the local environment
2307doesn't satisfy us, we must backtrack.
2308
2309Checking the environment entails either looking ahead on the trail,
15776bb0
KW
2310looking behind, or both. C<'^'> looks behind, to see that there are no
2311characters before. C<'$'> looks ahead, to see that there are no
47f9c88b 2312characters after. C<\b> looks both ahead and behind, to see if the
7638d2dc 2313characters on either side differ in their "word-ness".
47f9c88b
GS
2314
2315The lookahead and lookbehind assertions are generalizations of the
2316anchor concept. Lookahead and lookbehind are zero-width assertions
2317that let us specify which characters we want to test for. The
2318lookahead assertion is denoted by C<(?=regexp)> and the lookbehind
a6b2f353 2319assertion is denoted by C<< (?<=fixed-regexp) >>. Some examples are
47f9c88b
GS
2320
2321 $x = "I catch the housecat 'Tom-cat' with catnip";
7638d2dc 2322 $x =~ /cat(?=\s)/; # matches 'cat' in 'housecat'
47f9c88b
GS
2323 @catwords = ($x =~ /(?<=\s)cat\w+/g); # matches,
2324 # $catwords[0] = 'catch'
2325 # $catwords[1] = 'catnip'
2326 $x =~ /\bcat\b/; # matches 'cat' in 'Tom-cat'
2327 $x =~ /(?<=\s)cat(?=\s)/; # doesn't match; no isolated 'cat' in
2328 # middle of $x
2329
a6b2f353 2330Note that the parentheses in C<(?=regexp)> and C<< (?<=regexp) >> are
47f9c88b
GS
2331non-capturing, since these are zero-width assertions. Thus in the
2332second regexp, the substrings captured are those of the whole regexp
a6b2f353
GS
2333itself. Lookahead C<(?=regexp)> can match arbitrary regexps, but
2334lookbehind C<< (?<=fixed-regexp) >> only works for regexps of fixed
15776bb0 2335width, I<i.e.>, a fixed number of characters long. Thus
a6b2f353
GS
2336C<< (?<=(ab|bc)) >> is fine, but C<< (?<=(ab)*) >> is not. The
2337negated versions of the lookahead and lookbehind assertions are
2338denoted by C<(?!regexp)> and C<< (?<!fixed-regexp) >> respectively.
2339They evaluate true if the regexps do I<not> match:
47f9c88b
GS
2340
2341 $x = "foobar";
2342 $x =~ /foo(?!bar)/; # doesn't match, 'bar' follows 'foo'
2343 $x =~ /foo(?!baz)/; # matches, 'baz' doesn't follow 'foo'
2344 $x =~ /(?<!\s)foo/; # matches, there is no \s before 'foo'
2345
7638d2dc
WL
2346Here is an example where a string containing blank-separated words,
2347numbers and single dashes is to be split into its components.
2348Using C</\s+/> alone won't work, because spaces are not required between
2349dashes, or a word or a dash. Additional places for a split are established
2350by looking ahead and behind:
47f9c88b 2351
7638d2dc
WL
2352 $str = "one two - --6-8";
2353 @toks = split / \s+ # a run of spaces
2354 | (?<=\S) (?=-) # any non-space followed by '-'
2355 | (?<=-) (?=\S) # a '-' followed by any non-space
2356 /x, $str; # @toks = qw(one two - - - 6 - 8)
47f9c88b 2357
7638d2dc
WL
2358
2359=head2 Using independent subexpressions to prevent backtracking
2360
2361I<Independent subexpressions> are regular expressions, in the
47f9c88b
GS
2362context of a larger regular expression, that function independently of
2363the larger regular expression. That is, they consume as much or as
2364little of the string as they wish without regard for the ability of
2365the larger regexp to match. Independent subexpressions are represented
2366by C<< (?>regexp) >>. We can illustrate their behavior by first
2367considering an ordinary regexp:
2368
2369 $x = "ab";
2370 $x =~ /a*ab/; # matches
2371
2372This obviously matches, but in the process of matching, the
15776bb0 2373subexpression C<a*> first grabbed the C<'a'>. Doing so, however,
47f9c88b 2374wouldn't allow the whole regexp to match, so after backtracking, C<a*>
15776bb0 2375eventually gave back the C<'a'> and matched the empty string. Here, what
47f9c88b
GS
2376C<a*> matched was I<dependent> on what the rest of the regexp matched.
2377
2378Contrast that with an independent subexpression:
2379
2380 $x =~ /(?>a*)ab/; # doesn't match!
2381
2382The independent subexpression C<< (?>a*) >> doesn't care about the rest
15776bb0 2383of the regexp, so it sees an C<'a'> and grabs it. Then the rest of the
47f9c88b 2384regexp C<ab> cannot match. Because C<< (?>a*) >> is independent, there
da75cd15 2385is no backtracking and the independent subexpression does not give
15776bb0 2386up its C<'a'>. Thus the match of the regexp as a whole fails. A similar
47f9c88b
GS
2387behavior occurs with completely independent regexps:
2388
2389 $x = "ab";
2390 $x =~ /a*/g; # matches, eats an 'a'
2391 $x =~ /\Gab/g; # doesn't match, no 'a' available
2392
15776bb0 2393Here C</g> and C<\G> create a "tag team" handoff of the string from
47f9c88b
GS
2394one regexp to the other. Regexps with an independent subexpression are
2395much like this, with a handoff of the string to the independent
2396subexpression, and a handoff of the string back to the enclosing
2397regexp.
2398
2399The ability of an independent subexpression to prevent backtracking
2400can be quite useful. Suppose we want to match a non-empty string
2401enclosed in parentheses up to two levels deep. Then the following
2402regexp matches:
2403
2404 $x = "abc(de(fg)h"; # unbalanced parentheses
77c8f263 2405 $x =~ /\( ( [ ^ () ]+ | \( [ ^ () ]* \) )+ \)/xx;
47f9c88b
GS
2406
2407The regexp matches an open parenthesis, one or more copies of an
2408alternation, and a close parenthesis. The alternation is two-way, with
2409the first alternative C<[^()]+> matching a substring with no
2410parentheses and the second alternative C<\([^()]*\)> matching a
2411substring delimited by parentheses. The problem with this regexp is
2412that it is pathological: it has nested indeterminate quantifiers
07698885 2413of the form C<(a+|b)+>. We discussed in Part 1 how nested quantifiers
47f9c88b
GS
2414like this could take an exponentially long time to execute if there
2415was no match possible. To prevent the exponential blowup, we need to
2416prevent useless backtracking at some point. This can be done by
2417enclosing the inner quantifier as an independent subexpression:
2418
77c8f263 2419 $x =~ /\( ( (?> [ ^ () ]+ ) | \([ ^ () ]* \) )+ \)/xx;
47f9c88b
GS
2420
2421Here, C<< (?>[^()]+) >> breaks the degeneracy of string partitioning
2422by gobbling up as much of the string as possible and keeping it. Then
2423match failures fail much more quickly.
2424
7638d2dc 2425
47f9c88b
GS
2426=head2 Conditional expressions
2427
7638d2dc 2428A I<conditional expression> is a form of if-then-else statement
47f9c88b
GS
2429that allows one to choose which patterns are to be matched, based on
2430some condition. There are two types of conditional expression:
15776bb0
KW
2431C<(?(I<condition>)I<yes-regexp>)> and
2432C<(?(condition)I<yes-regexp>|I<no-regexp>)>.
2433C<(?(I<condition>)I<yes-regexp>)> is
2434like an S<C<'if () {}'>> statement in Perl. If the I<condition> is true,
2435the I<yes-regexp> will be matched. If the I<condition> is false, the
2436I<yes-regexp> will be skipped and Perl will move onto the next regexp
7638d2dc 2437element. The second form is like an S<C<'if () {} else {}'>> statement
15776bb0
KW
2438in Perl. If the I<condition> is true, the I<yes-regexp> will be
2439matched, otherwise the I<no-regexp> will be matched.
47f9c88b 2440
15776bb0
KW
2441The I<condition> can have several forms. The first form is simply an
2442integer in parentheses C<(I<integer>)>. It is true if the corresponding
2443backreference C<\I<integer>> matched earlier in the regexp. The same
c27a5cfe 2444thing can be done with a name associated with a capture group, written
15776bb0 2445as C<<< (E<lt>I<name>E<gt>) >>> or C<< ('I<name>') >>. The second form is a bare
6b3ddc02 2446zero-width assertion C<(?...)>, either a lookahead, a lookbehind, or a
7638d2dc
WL
2447code assertion (discussed in the next section). The third set of forms
2448provides tests that return true if the expression is executed within
2449a recursion (C<(R)>) or is being called from some capturing group,
2450referenced either by number (C<(R1)>, C<(R2)>,...) or by name
15776bb0 2451(C<(R&I<name>)>).
7638d2dc
WL
2452
2453The integer or name form of the C<condition> allows us to choose,
2454with more flexibility, what to match based on what matched earlier in the
2455regexp. This searches for words of the form C<"$x$x"> or C<"$x$y$y$x">:
47f9c88b 2456
d8b950dc 2457 % simple_grep '^(\w+)(\w+)?(?(2)\g2\g1|\g1)$' /usr/dict/words
47f9c88b
GS
2458 beriberi
2459 coco
2460 couscous
2461 deed
2462 ...
2463 toot
2464 toto
2465 tutu
2466
2467The lookbehind C<condition> allows, along with backreferences,
2468an earlier part of the match to influence a later part of the
2469match. For instance,
2470
2471 /[ATGC]+(?(?<=AA)G|C)$/;
2472
2473matches a DNA sequence such that it either ends in C<AAG>, or some
15776bb0 2474other base pair combination and C<'C'>. Note that the form is
a6b2f353
GS
2475C<< (?(?<=AA)G|C) >> and not C<< (?((?<=AA))G|C) >>; for the
2476lookahead, lookbehind or code assertions, the parentheses around the
2477conditional are not needed.
47f9c88b 2478
7638d2dc
WL
2479
2480=head2 Defining named patterns
2481
2482Some regular expressions use identical subpatterns in several places.
2483Starting with Perl 5.10, it is possible to define named subpatterns in
2484a section of the pattern so that they can be called up by name
2485anywhere in the pattern. This syntactic pattern for this definition
15776bb0
KW
2486group is C<< (?(DEFINE)(?<I<name>>I<pattern>)...) >>. An insertion
2487of a named pattern is written as C<(?&I<name>)>.
7638d2dc
WL
2488
2489The example below illustrates this feature using the pattern for
2490floating point numbers that was presented earlier on. The three
2491subpatterns that are used more than once are the optional sign, the
15776bb0 2492digit sequence for an integer and the decimal fraction. The C<DEFINE>
7638d2dc
WL
2493group at the end of the pattern contains their definition. Notice
2494that the decimal fraction pattern is the first place where we can
2495reuse the integer pattern.
2496
353c6505 2497 /^ (?&osg)\ * ( (?&int)(?&dec)? | (?&dec) )
7638d2dc
WL
2498 (?: [eE](?&osg)(?&int) )?
2499 $
2500 (?(DEFINE)
2501 (?<osg>[-+]?) # optional sign
2502 (?<int>\d++) # integer
2503 (?<dec>\.(?&int)) # decimal fraction
2504 )/x
2505
2506
2507=head2 Recursive patterns
2508
2509This feature (introduced in Perl 5.10) significantly extends the
2510power of Perl's pattern matching. By referring to some other
2511capture group anywhere in the pattern with the construct
15776bb0 2512C<(?I<group-ref>)>, the I<pattern> within the referenced group is used
7638d2dc
WL
2513as an independent subpattern in place of the group reference itself.
2514Because the group reference may be contained I<within> the group it
2515refers to, it is now possible to apply pattern matching to tasks that
2516hitherto required a recursive parser.
2517
2518To illustrate this feature, we'll design a pattern that matches if
2519a string contains a palindrome. (This is a word or a sentence that,
2520while ignoring spaces, interpunctuation and case, reads the same backwards
2521as forwards. We begin by observing that the empty string or a string
2522containing just one word character is a palindrome. Otherwise it must
2523have a word character up front and the same at its end, with another
2524palindrome in between.
2525
fd2b7f55 2526 /(?: (\w) (?...Here be a palindrome...) \g{-1} | \w? )/x
7638d2dc 2527
e57a4e52 2528Adding C<\W*> at either end to eliminate what is to be ignored, we already
7638d2dc
WL
2529have the full pattern:
2530
2531 my $pp = qr/^(\W* (?: (\w) (?1) \g{-1} | \w? ) \W*)$/ix;
2532 for $s ( "saippuakauppias", "A man, a plan, a canal: Panama!" ){
2533 print "'$s' is a palindrome\n" if $s =~ /$pp/;
2534 }
2535
2536In C<(?...)> both absolute and relative backreferences may be used.
2537The entire pattern can be reinserted with C<(?R)> or C<(?0)>.
15776bb0 2538If you prefer to name your groups, you can use C<(?&I<name>)> to
c27a5cfe 2539recurse into that group.
7638d2dc
WL
2540
2541
47f9c88b
GS
2542=head2 A bit of magic: executing Perl code in a regular expression
2543
2544Normally, regexps are a part of Perl expressions.
7638d2dc 2545I<Code evaluation> expressions turn that around by allowing
da75cd15 2546arbitrary Perl code to be a part of a regexp. A code evaluation
15776bb0 2547expression is denoted C<(?{I<code>})>, with I<code> a string of Perl
47f9c88b
GS
2548statements.
2549
2550Code expressions are zero-width assertions, and the value they return
2551depends on their environment. There are two possibilities: either the
2552code expression is used as a conditional in a conditional expression
15776bb0
KW
2553C<(?(I<condition>)...)>, or it is not. If the code expression is a
2554conditional, the code is evaluated and the result (I<i.e.>, the result of
47f9c88b
GS
2555the last statement) is used to determine truth or falsehood. If the
2556code expression is not used as a conditional, the assertion always
2557evaluates true and the result is put into the special variable
2558C<$^R>. The variable C<$^R> can then be used in code expressions later
2559in the regexp. Here are some silly examples:
2560
2561 $x = "abcdef";
2562 $x =~ /abc(?{print "Hi Mom!";})def/; # matches,
2563 # prints 'Hi Mom!'
2564 $x =~ /aaa(?{print "Hi Mom!";})def/; # doesn't match,
2565 # no 'Hi Mom!'
745e1e41
DC
2566
2567Pay careful attention to the next example:
2568
47f9c88b
GS
2569 $x =~ /abc(?{print "Hi Mom!";})ddd/; # doesn't match,
2570 # no 'Hi Mom!'
745e1e41
DC
2571 # but why not?
2572
2573At first glance, you'd think that it shouldn't print, because obviously
2574the C<ddd> isn't going to match the target string. But look at this
2575example:
2576
87167316
RGS
2577 $x =~ /abc(?{print "Hi Mom!";})[dD]dd/; # doesn't match,
2578 # but _does_ print
745e1e41
DC
2579
2580Hmm. What happened here? If you've been following along, you know that
ac036724 2581the above pattern should be effectively (almost) the same as the last one;
15776bb0 2582enclosing the C<'d'> in a character class isn't going to change what it
745e1e41
DC
2583matches. So why does the first not print while the second one does?
2584
15776bb0 2585The answer lies in the optimizations the regexp engine makes. In the first
745e1e41 2586case, all the engine sees are plain old characters (aside from the
15776bb0 2587C<?{}> construct). It's smart enough to realize that the string C<'ddd'>
745e1e41
DC
2588doesn't occur in our target string before actually running the pattern
2589through. But in the second case, we've tricked it into thinking that our
87167316 2590pattern is more complicated. It takes a look, sees our
745e1e41
DC
2591character class, and decides that it will have to actually run the
2592pattern to determine whether or not it matches, and in the process of
2593running it hits the print statement before it discovers that we don't
2594have a match.
2595
2596To take a closer look at how the engine does optimizations, see the
5a0de581 2597section L</"Pragmas and debugging"> below.
745e1e41
DC
2598
2599More fun with C<?{}>:
2600
47f9c88b
GS
2601 $x =~ /(?{print "Hi Mom!";})/; # matches,
2602 # prints 'Hi Mom!'
2603 $x =~ /(?{$c = 1;})(?{print "$c";})/; # matches,
2604 # prints '1'
2605 $x =~ /(?{$c = 1;})(?{print "$^R";})/; # matches,
2606 # prints '1'
2607
2608The bit of magic mentioned in the section title occurs when the regexp
2609backtracks in the process of searching for a match. If the regexp
2610backtracks over a code expression and if the variables used within are
2611localized using C<local>, the changes in the variables produced by the
2612code expression are undone! Thus, if we wanted to count how many times
15776bb0 2613a character got matched inside a group, we could use, I<e.g.>,
47f9c88b
GS
2614
2615 $x = "aaaa";
2616 $count = 0; # initialize 'a' count
2617 $c = "bob"; # test if $c gets clobbered
2618 $x =~ /(?{local $c = 0;}) # initialize count
2619 ( a # match 'a'
2620 (?{local $c = $c + 1;}) # increment count
2621 )* # do this any number of times,
2622 aa # but match 'aa' at the end
2623 (?{$count = $c;}) # copy local $c var into $count
2624 /x;
2625 print "'a' count is $count, \$c variable is '$c'\n";
2626
2627This prints
2628
2629 'a' count is 2, $c variable is 'bob'
2630
7638d2dc
WL
2631If we replace the S<C< (?{local $c = $c + 1;})>> with
2632S<C< (?{$c = $c + 1;})>>, the variable changes are I<not> undone
47f9c88b
GS
2633during backtracking, and we get
2634
2635 'a' count is 4, $c variable is 'bob'
2636
2637Note that only localized variable changes are undone. Other side
2638effects of code expression execution are permanent. Thus
2639
2640 $x = "aaaa";
2641 $x =~ /(a(?{print "Yow\n";}))*aa/;
2642
2643produces
2644
2645 Yow
2646 Yow
2647 Yow
2648 Yow
2649
2650The result C<$^R> is automatically localized, so that it will behave
2651properly in the presence of backtracking.
2652
7638d2dc 2653This example uses a code expression in a conditional to match a
15776bb0
KW
2654definite article, either C<'the'> in English or C<'der|die|das'> in
2655German:
47f9c88b 2656
47f9c88b
GS
2657 $lang = 'DE'; # use German
2658 ...
2659 $text = "das";
2660 print "matched\n"
2661 if $text =~ /(?(?{
2662 $lang eq 'EN'; # is the language English?
2663 })
2664 the | # if so, then match 'the'
7638d2dc 2665 (der|die|das) # else, match 'der|die|das'
47f9c88b
GS
2666 )
2667 /xi;
2668
15776bb0
KW
2669Note that the syntax here is C<(?(?{...})I<yes-regexp>|I<no-regexp>)>, not
2670C<(?((?{...}))I<yes-regexp>|I<no-regexp>)>. In other words, in the case of a
47f9c88b
GS
2671code expression, we don't need the extra parentheses around the
2672conditional.
2673
e128ab2c
DM
2674If you try to use code expressions where the code text is contained within
2675an interpolated variable, rather than appearing literally in the pattern,
2676Perl may surprise you:
a6b2f353
GS
2677
2678 $bar = 5;
2679 $pat = '(?{ 1 })';
2680 /foo(?{ $bar })bar/; # compiles ok, $bar not interpolated
e128ab2c 2681 /foo(?{ 1 })$bar/; # compiles ok, $bar interpolated
a6b2f353
GS
2682 /foo${pat}bar/; # compile error!
2683
2684 $pat = qr/(?{ $foo = 1 })/; # precompile code regexp
2685 /foo${pat}bar/; # compiles ok
2686
e128ab2c
DM
2687If a regexp has a variable that interpolates a code expression, Perl
2688treats the regexp as an error. If the code expression is precompiled into
2689a variable, however, interpolating is ok. The question is, why is this an
2690error?
a6b2f353
GS
2691
2692The reason is that variable interpolation and code expressions
2693together pose a security risk. The combination is dangerous because
2694many programmers who write search engines often take user input and
2695plug it directly into a regexp:
47f9c88b
GS
2696
2697 $regexp = <>; # read user-supplied regexp
2698 $chomp $regexp; # get rid of possible newline
2699 $text =~ /$regexp/; # search $text for the $regexp
2700
a6b2f353
GS
2701If the C<$regexp> variable contains a code expression, the user could
2702then execute arbitrary Perl code. For instance, some joker could
7638d2dc
WL
2703search for S<C<system('rm -rf *');>> to erase your files. In this
2704sense, the combination of interpolation and code expressions I<taints>
47f9c88b 2705your regexp. So by default, using both interpolation and code
a6b2f353
GS
2706expressions in the same regexp is not allowed. If you're not
2707concerned about malicious users, it is possible to bypass this
7638d2dc 2708security check by invoking S<C<use re 'eval'>>:
a6b2f353
GS
2709
2710 use re 'eval'; # throw caution out the door
2711 $bar = 5;
2712 $pat = '(?{ 1 })';
a6b2f353 2713 /foo${pat}bar/; # compiles ok
47f9c88b 2714
7638d2dc 2715Another form of code expression is the I<pattern code expression>.
47f9c88b
GS
2716The pattern code expression is like a regular code expression, except
2717that the result of the code evaluation is treated as a regular
2718expression and matched immediately. A simple example is
2719
2720 $length = 5;
2721 $char = 'a';
2722 $x = 'aaaaabb';
2723 $x =~ /(??{$char x $length})/x; # matches, there are 5 of 'a'
2724
2725
2726This final example contains both ordinary and pattern code
7638d2dc 2727expressions. It detects whether a binary string C<1101010010001...> has a
15776bb0 2728Fibonacci spacing 0,1,1,2,3,5,... of the C<'1'>'s:
47f9c88b 2729
47f9c88b 2730 $x = "1101010010001000001";
7638d2dc 2731 $z0 = ''; $z1 = '0'; # initial conditions
47f9c88b
GS
2732 print "It is a Fibonacci sequence\n"
2733 if $x =~ /^1 # match an initial '1'
7638d2dc
WL
2734 (?:
2735 ((??{ $z0 })) # match some '0'
2736 1 # and then a '1'
2737 (?{ $z0 = $z1; $z1 .= $^N; })
47f9c88b
GS
2738 )+ # repeat as needed
2739 $ # that is all there is
2740 /x;
7638d2dc 2741 printf "Largest sequence matched was %d\n", length($z1)-length($z0);
47f9c88b 2742
7638d2dc
WL
2743Remember that C<$^N> is set to whatever was matched by the last
2744completed capture group. This prints
47f9c88b
GS
2745
2746 It is a Fibonacci sequence
2747 Largest sequence matched was 5
2748
2749Ha! Try that with your garden variety regexp package...
2750
7638d2dc 2751Note that the variables C<$z0> and C<$z1> are not substituted when the
47f9c88b 2752regexp is compiled, as happens for ordinary variables outside a code
e128ab2c
DM
2753expression. Rather, the whole code block is parsed as perl code at the
2754same time as perl is compiling the code containing the literal regexp
2755pattern.
47f9c88b 2756
15776bb0 2757This regexp without the C</x> modifier is
47f9c88b 2758
7638d2dc
WL
2759 /^1(?:((??{ $z0 }))1(?{ $z0 = $z1; $z1 .= $^N; }))+$/
2760
2761which shows that spaces are still possible in the code parts. Nevertheless,
353c6505 2762when working with code and conditional expressions, the extended form of
7638d2dc
WL
2763regexps is almost necessary in creating and debugging regexps.
2764
2765
2766=head2 Backtracking control verbs
2767
2768Perl 5.10 introduced a number of control verbs intended to provide
2769detailed control over the backtracking process, by directly influencing
15776bb0
KW
2770the regexp engine and by providing monitoring techniques. See
2771L<perlre/"Special Backtracking Control Verbs"> for a detailed
2772description.
7638d2dc
WL
2773
2774Below is just one example, illustrating the control verb C<(*FAIL)>,
2775which may be abbreviated as C<(*F)>. If this is inserted in a regexp
6b3ddc02
FC
2776it will cause it to fail, just as it would at some
2777mismatch between the pattern and the string. Processing
2778of the regexp continues as it would after any "normal"
353c6505
DL
2779failure, so that, for instance, the next position in the string or another
2780alternative will be tried. As failing to match doesn't preserve capture
c27a5cfe 2781groups or produce results, it may be necessary to use this in
7638d2dc
WL
2782combination with embedded code.
2783
2784 %count = ();
b539c2c9 2785 "supercalifragilisticexpialidocious" =~
c2e2285d 2786 /([aeiou])(?{ $count{$1}++; })(*FAIL)/i;
7638d2dc
WL
2787 printf "%3d '%s'\n", $count{$_}, $_ for (sort keys %count);
2788
353c6505
DL
2789The pattern begins with a class matching a subset of letters. Whenever
2790this matches, a statement like C<$count{'a'}++;> is executed, incrementing
2791the letter's counter. Then C<(*FAIL)> does what it says, and
6b3ddc02
FC
2792the regexp engine proceeds according to the book: as long as the end of
2793the string hasn't been reached, the position is advanced before looking
7638d2dc 2794for another vowel. Thus, match or no match makes no difference, and the
e1020413 2795regexp engine proceeds until the entire string has been inspected.
7638d2dc
WL
2796(It's remarkable that an alternative solution using something like
2797
b539c2c9 2798 $count{lc($_)}++ for split('', "supercalifragilisticexpialidocious");
7638d2dc
WL
2799 printf "%3d '%s'\n", $count2{$_}, $_ for ( qw{ a e i o u } );
2800
2801is considerably slower.)
47f9c88b 2802
47f9c88b
GS
2803
2804=head2 Pragmas and debugging
2805
2806Speaking of debugging, there are several pragmas available to control
2807and debug regexps in Perl. We have already encountered one pragma in
7638d2dc 2808the previous section, S<C<use re 'eval';>>, that allows variable
a6b2f353
GS
2809interpolation and code expressions to coexist in a regexp. The other
2810pragmas are
47f9c88b
GS
2811
2812 use re 'taint';
2813 $tainted = <>;
2814 @parts = ($tainted =~ /(\w+)\s+(\w+)/; # @parts is now tainted
2815
2816The C<taint> pragma causes any substrings from a match with a tainted
2817variable to be tainted as well. This is not normally the case, as
2818regexps are often used to extract the safe bits from a tainted
2819variable. Use C<taint> when you are not extracting safe bits, but are
2820performing some other processing. Both C<taint> and C<eval> pragmas
a6b2f353 2821are lexically scoped, which means they are in effect only until
47f9c88b
GS
2822the end of the block enclosing the pragmas.
2823
511eb430
FC
2824 use re '/m'; # or any other flags
2825 $multiline_string =~ /^foo/; # /m is implied
2826
9fa86798
FC
2827The C<re '/flags'> pragma (introduced in Perl
28285.14) turns on the given regular expression flags
3fd67154
KW
2829until the end of the lexical scope. See
2830L<re/"'E<sol>flags' mode"> for more
511eb430
FC
2831detail.
2832
47f9c88b
GS
2833 use re 'debug';
2834 /^(.*)$/s; # output debugging info
2835
2836 use re 'debugcolor';
2837 /^(.*)$/s; # output debugging info in living color
2838
2839The global C<debug> and C<debugcolor> pragmas allow one to get
2840detailed debugging info about regexp compilation and
2841execution. C<debugcolor> is the same as debug, except the debugging
2842information is displayed in color on terminals that can display
2843termcap color sequences. Here is example output:
2844
2845 % perl -e 'use re "debug"; "abc" =~ /a*b+c/;'
ccf3535a 2846 Compiling REx 'a*b+c'
47f9c88b
GS
2847 size 9 first at 1
2848 1: STAR(4)
2849 2: EXACT <a>(0)
2850 4: PLUS(7)
2851 5: EXACT <b>(0)
2852 7: EXACT <c>(9)
2853 9: END(0)
ccf3535a
JK
2854 floating 'bc' at 0..2147483647 (checking floating) minlen 2
2855 Guessing start of match, REx 'a*b+c' against 'abc'...
2856 Found floating substr 'bc' at offset 1...
47f9c88b 2857 Guessed: match at offset 0
ccf3535a 2858 Matching REx 'a*b+c' against 'abc'
47f9c88b 2859 Setting an EVAL scope, savestack=3
555bd962
BG
2860 0 <> <abc> | 1: STAR
2861 EXACT <a> can match 1 times out of 32767...
47f9c88b 2862 Setting an EVAL scope, savestack=3
555bd962
BG
2863 1 <a> <bc> | 4: PLUS
2864 EXACT <b> can match 1 times out of 32767...
47f9c88b 2865 Setting an EVAL scope, savestack=3
555bd962
BG
2866 2 <ab> <c> | 7: EXACT <c>
2867 3 <abc> <> | 9: END
47f9c88b 2868 Match successful!
ccf3535a 2869 Freeing REx: 'a*b+c'
47f9c88b
GS
2870
2871If you have gotten this far into the tutorial, you can probably guess
2872what the different parts of the debugging output tell you. The first
2873part
2874
ccf3535a 2875 Compiling REx 'a*b+c'
47f9c88b
GS
2876 size 9 first at 1
2877 1: STAR(4)
2878 2: EXACT <a>(0)
2879 4: PLUS(7)
2880 5: EXACT <b>(0)
2881 7: EXACT <c>(9)
2882 9: END(0)
2883
2884describes the compilation stage. C<STAR(4)> means that there is a
2885starred object, in this case C<'a'>, and if it matches, goto line 4,
15776bb0 2886I<i.e.>, C<PLUS(7)>. The middle lines describe some heuristics and
47f9c88b
GS
2887optimizations performed before a match:
2888
ccf3535a
JK
2889 floating 'bc' at 0..2147483647 (checking floating) minlen 2
2890 Guessing start of match, REx 'a*b+c' against 'abc'...
2891 Found floating substr 'bc' at offset 1...
47f9c88b
GS
2892 Guessed: match at offset 0
2893
2894Then the match is executed and the remaining lines describe the
2895process:
2896
ccf3535a 2897 Matching REx 'a*b+c' against 'abc'
47f9c88b 2898 Setting an EVAL scope, savestack=3
555bd962
BG
2899 0 <> <abc> | 1: STAR
2900 EXACT <a> can match 1 times out of 32767...
47f9c88b 2901 Setting an EVAL scope, savestack=3
555bd962
BG
2902 1 <a> <bc> | 4: PLUS
2903 EXACT <b> can match 1 times out of 32767...
47f9c88b 2904 Setting an EVAL scope, savestack=3
555bd962
BG
2905 2 <ab> <c> | 7: EXACT <c>
2906 3 <abc> <> | 9: END
47f9c88b 2907 Match successful!
ccf3535a 2908 Freeing REx: 'a*b+c'
47f9c88b 2909
7638d2dc 2910Each step is of the form S<C<< n <x> <y> >>>, with C<< <x> >> the
47f9c88b 2911part of the string matched and C<< <y> >> the part not yet
7638d2dc 2912matched. The S<C<< | 1: STAR >>> says that Perl is at line number 1
39b6ec1a 2913in the compilation list above. See
d9f2b251 2914L<perldebguts/"Debugging Regular Expressions"> for much more detail.
47f9c88b
GS
2915
2916An alternative method of debugging regexps is to embed C<print>
2917statements within the regexp. This provides a blow-by-blow account of
2918the backtracking in an alternation:
2919
2920 "that this" =~ m@(?{print "Start at position ", pos, "\n";})
2921 t(?{print "t1\n";})
2922 h(?{print "h1\n";})
2923 i(?{print "i1\n";})
2924 s(?{print "s1\n";})
2925 |
2926 t(?{print "t2\n";})
2927 h(?{print "h2\n";})
2928 a(?{print "a2\n";})
2929 t(?{print "t2\n";})
2930 (?{print "Done at position ", pos, "\n";})
2931 @x;
2932
2933prints
2934
2935 Start at position 0
2936 t1
2937 h1
2938 t2
2939 h2
2940 a2
2941 t2
2942 Done at position 4
2943
47f9c88b
GS
2944=head1 SEE ALSO
2945
7638d2dc 2946This is just a tutorial. For the full story on Perl regular
47f9c88b
GS
2947expressions, see the L<perlre> regular expressions reference page.
2948
2949For more information on the matching C<m//> and substitution C<s///>
2950operators, see L<perlop/"Regexp Quote-Like Operators">. For
2951information on the C<split> operation, see L<perlfunc/split>.
2952
2953For an excellent all-around resource on the care and feeding of
2954regular expressions, see the book I<Mastering Regular Expressions> by
2955Jeffrey Friedl (published by O'Reilly, ISBN 1556592-257-3).
2956
2957=head1 AUTHOR AND COPYRIGHT
2958
15776bb0 2959Copyright (c) 2000 Mark Kvale.
47f9c88b 2960All rights reserved.
15776bb0 2961Now maintained by Perl porters.
47f9c88b
GS
2962
2963This document may be distributed under the same terms as Perl itself.
2964
2965=head2 Acknowledgments
2966
2967The inspiration for the stop codon DNA example came from the ZIP
2968code example in chapter 7 of I<Mastering Regular Expressions>.
2969
a6b2f353
GS
2970The author would like to thank Jeff Pinyan, Andrew Johnson, Peter
2971Haworth, Ronald J Kimball, and Joe Smith for all their helpful
2972comments.
47f9c88b
GS
2973
2974=cut
a6b2f353 2975