Commit | Line | Data |
---|---|---|
47f9c88b GS |
1 | =head1 NAME |
2 | ||
3 | perlretut - Perl regular expressions tutorial | |
4 | ||
5 | =head1 DESCRIPTION | |
6 | ||
7 | This page provides a basic tutorial on understanding, creating and | |
8 | using regular expressions in Perl. It serves as a complement to the | |
9 | reference page on regular expressions L<perlre>. Regular expressions | |
10 | are an integral part of the C<m//>, C<s///>, C<qr//> and C<split> | |
11 | operators and so this tutorial also overlaps with | |
12 | L<perlop/"Regexp Quote-Like Operators"> and L<perlfunc/split>. | |
13 | ||
14 | Perl is widely renowned for excellence in text processing, and regular | |
15 | expressions are one of the big factors behind this fame. Perl regular | |
16 | expressions display an efficiency and flexibility unknown in most | |
17 | other computer languages. Mastering even the basics of regular | |
18 | expressions will allow you to manipulate text with surprising ease. | |
19 | ||
28285ecf KW |
20 | What is a regular expression? At its most basic, a regular expression |
21 | is a template that is used to determine if a string has certain | |
22 | characteristics. The string is most often some text, such as a line, | |
23 | sentence, web page, or even a whole book, but less commonly it could be | |
24 | some binary data as well. | |
25 | Suppose we want to determine if the text in variable, C<$var> contains | |
15776bb0 | 26 | the sequence of characters S<C<m u s h r o o m>> |
28285ecf KW |
27 | (blanks added for legibility). We can write in Perl |
28 | ||
29 | $var =~ m/mushroom/ | |
30 | ||
31 | The value of this expression will be TRUE if C<$var> contains that | |
32 | sequence of characters, and FALSE otherwise. The portion enclosed in | |
15776bb0 | 33 | C<'E<sol>'> characters denotes the characteristic we are looking for. |
28285ecf KW |
34 | We use the term I<pattern> for it. The process of looking to see if the |
35 | pattern occurs in the string is called I<matching>, and the C<"=~"> | |
15776bb0 | 36 | operator along with the C<m//> tell Perl to try to match the pattern |
28285ecf KW |
37 | against the string. Note that the pattern is also a string, but a very |
38 | special kind of one, as we will see. Patterns are in common use these | |
39 | days; | |
47f9c88b | 40 | examples are the patterns typed into a search engine to find web pages |
15776bb0 KW |
41 | and the patterns used to list files in a directory, I<e.g.>, "C<ls *.txt>" |
42 | or "C<dir *.*>". In Perl, the patterns described by regular expressions | |
28285ecf KW |
43 | are used not only to search strings, but to also extract desired parts |
44 | of strings, and to do search and replace operations. | |
47f9c88b GS |
45 | |
46 | Regular expressions have the undeserved reputation of being abstract | |
28285ecf KW |
47 | and difficult to understand. This really stems simply because the |
48 | notation used to express them tends to be terse and dense, and not | |
f1dc5bb2 | 49 | because of inherent complexity. We recommend using the C</x> regular |
28285ecf KW |
50 | expression modifier (described below) along with plenty of white space |
51 | to make them less dense, and easier to read. Regular expressions are | |
52 | constructed using | |
47f9c88b GS |
53 | simple concepts like conditionals and loops and are no more difficult |
54 | to understand than the corresponding C<if> conditionals and C<while> | |
28285ecf | 55 | loops in the Perl language itself. |
47f9c88b GS |
56 | |
57 | This tutorial flattens the learning curve by discussing regular | |
58 | expression concepts, along with their notation, one at a time and with | |
59 | many examples. The first part of the tutorial will progress from the | |
60 | simplest word searches to the basic regular expression concepts. If | |
61 | you master the first part, you will have all the tools needed to solve | |
62 | about 98% of your needs. The second part of the tutorial is for those | |
63 | comfortable with the basics and hungry for more power tools. It | |
64 | discusses the more advanced regular expression operators and | |
8ccb1477 | 65 | introduces the latest cutting-edge innovations. |
47f9c88b | 66 | |
15776bb0 | 67 | A note: to save time, "regular expression" is often abbreviated as |
47f9c88b GS |
68 | regexp or regex. Regexp is a more natural abbreviation than regex, but |
69 | is harder to pronounce. The Perl pod documentation is evenly split on | |
70 | regexp vs regex; in Perl, there is more than one way to abbreviate it. | |
71 | We'll use regexp in this tutorial. | |
72 | ||
67cdf558 KW |
73 | New in v5.22, L<C<use re 'strict'>|re/'strict' mode> applies stricter |
74 | rules than otherwise when compiling regular expression patterns. It can | |
75 | find things that, while legal, may not be what you intended. | |
76 | ||
47f9c88b GS |
77 | =head1 Part 1: The basics |
78 | ||
79 | =head2 Simple word matching | |
80 | ||
81 | The simplest regexp is simply a word, or more generally, a string of | |
28285ecf | 82 | characters. A regexp consisting of just a word matches any string that |
47f9c88b GS |
83 | contains that word: |
84 | ||
85 | "Hello World" =~ /World/; # matches | |
86 | ||
7638d2dc | 87 | What is this Perl statement all about? C<"Hello World"> is a simple |
8ccb1477 | 88 | double-quoted string. C<World> is the regular expression and the |
7638d2dc | 89 | C<//> enclosing C</World/> tells Perl to search a string for a match. |
47f9c88b GS |
90 | The operator C<=~> associates the string with the regexp match and |
91 | produces a true value if the regexp matched, or false if the regexp | |
92 | did not match. In our case, C<World> matches the second word in | |
93 | C<"Hello World">, so the expression is true. Expressions like this | |
94 | are useful in conditionals: | |
95 | ||
96 | if ("Hello World" =~ /World/) { | |
97 | print "It matches\n"; | |
98 | } | |
99 | else { | |
100 | print "It doesn't match\n"; | |
101 | } | |
102 | ||
103 | There are useful variations on this theme. The sense of the match can | |
7638d2dc | 104 | be reversed by using the C<!~> operator: |
47f9c88b GS |
105 | |
106 | if ("Hello World" !~ /World/) { | |
107 | print "It doesn't match\n"; | |
108 | } | |
109 | else { | |
110 | print "It matches\n"; | |
111 | } | |
112 | ||
113 | The literal string in the regexp can be replaced by a variable: | |
114 | ||
15776bb0 | 115 | my $greeting = "World"; |
47f9c88b GS |
116 | if ("Hello World" =~ /$greeting/) { |
117 | print "It matches\n"; | |
118 | } | |
119 | else { | |
120 | print "It doesn't match\n"; | |
121 | } | |
122 | ||
123 | If you're matching against the special default variable C<$_>, the | |
124 | C<$_ =~> part can be omitted: | |
125 | ||
126 | $_ = "Hello World"; | |
127 | if (/World/) { | |
128 | print "It matches\n"; | |
129 | } | |
130 | else { | |
131 | print "It doesn't match\n"; | |
132 | } | |
133 | ||
134 | And finally, the C<//> default delimiters for a match can be changed | |
135 | to arbitrary delimiters by putting an C<'m'> out front: | |
136 | ||
137 | "Hello World" =~ m!World!; # matches, delimited by '!' | |
138 | "Hello World" =~ m{World}; # matches, note the matching '{}' | |
a6b2f353 GS |
139 | "/usr/bin/perl" =~ m"/perl"; # matches after '/usr/bin', |
140 | # '/' becomes an ordinary char | |
47f9c88b GS |
141 | |
142 | C</World/>, C<m!World!>, and C<m{World}> all represent the | |
15776bb0 | 143 | same thing. When, I<e.g.>, the quote (C<'"'>) is used as a delimiter, the forward |
7638d2dc | 144 | slash C<'/'> becomes an ordinary character and can be used in this regexp |
47f9c88b GS |
145 | without trouble. |
146 | ||
147 | Let's consider how different regexps would match C<"Hello World">: | |
148 | ||
149 | "Hello World" =~ /world/; # doesn't match | |
150 | "Hello World" =~ /o W/; # matches | |
151 | "Hello World" =~ /oW/; # doesn't match | |
152 | "Hello World" =~ /World /; # doesn't match | |
153 | ||
154 | The first regexp C<world> doesn't match because regexps are | |
155 | case-sensitive. The second regexp matches because the substring | |
7638d2dc | 156 | S<C<'o W'>> occurs in the string S<C<"Hello World">>. The space |
15776bb0 | 157 | character C<' '> is treated like any other character in a regexp and is |
47f9c88b GS |
158 | needed to match in this case. The lack of a space character is the |
159 | reason the third regexp C<'oW'> doesn't match. The fourth regexp | |
15776bb0 | 160 | "C<World >" doesn't match because there is a space at the end of the |
47f9c88b GS |
161 | regexp, but not at the end of the string. The lesson here is that |
162 | regexps must match a part of the string I<exactly> in order for the | |
163 | statement to be true. | |
164 | ||
7638d2dc | 165 | If a regexp matches in more than one place in the string, Perl will |
47f9c88b GS |
166 | always match at the earliest possible point in the string: |
167 | ||
168 | "Hello World" =~ /o/; # matches 'o' in 'Hello' | |
169 | "That hat is red" =~ /hat/; # matches 'hat' in 'That' | |
170 | ||
171 | With respect to character matching, there are a few more points you | |
15776bb0 | 172 | need to know about. First of all, not all characters can be used "as |
d3a1131a KW |
173 | is" in a match. Some characters, called I<metacharacters>, are |
174 | generally reserved for use in regexp notation. The metacharacters are | |
47f9c88b | 175 | |
d3a1131a KW |
176 | {}[]()^$.|*+?-#\ |
177 | ||
178 | This list is not as definitive as it may appear (or be claimed to be in | |
179 | other documentation). For example, C<"#"> is a metacharacter only when | |
180 | the C</x> pattern modifier (described below) is used, and both C<"}"> | |
181 | and C<"]"> are metacharacters only when paired with opening C<"{"> or | |
182 | C<"["> respectively; other gotchas apply. | |
47f9c88b GS |
183 | |
184 | The significance of each of these will be explained | |
185 | in the rest of the tutorial, but for now, it is important only to know | |
15776bb0 KW |
186 | that a metacharacter can be matched as-is by putting a backslash before |
187 | it: | |
47f9c88b GS |
188 | |
189 | "2+2=4" =~ /2+2/; # doesn't match, + is a metacharacter | |
190 | "2+2=4" =~ /2\+2/; # matches, \+ is treated like an ordinary + | |
191 | "The interval is [0,1)." =~ /[0,1)./ # is a syntax error! | |
192 | "The interval is [0,1)." =~ /\[0,1\)\./ # matches | |
7638d2dc | 193 | "#!/usr/bin/perl" =~ /#!\/usr\/bin\/perl/; # matches |
47f9c88b GS |
194 | |
195 | In the last regexp, the forward slash C<'/'> is also backslashed, | |
196 | because it is used to delimit the regexp. This can lead to LTS | |
197 | (leaning toothpick syndrome), however, and it is often more readable | |
198 | to change delimiters. | |
199 | ||
7638d2dc | 200 | "#!/usr/bin/perl" =~ m!#\!/usr/bin/perl!; # easier to read |
47f9c88b GS |
201 | |
202 | The backslash character C<'\'> is a metacharacter itself and needs to | |
203 | be backslashed: | |
204 | ||
205 | 'C:\WIN32' =~ /C:\\WIN/; # matches | |
206 | ||
2d5e9bac KW |
207 | In situations where it doesn't make sense for a particular metacharacter |
208 | to mean what it normally does, it automatically loses its | |
209 | metacharacter-ness and becomes an ordinary character that is to be | |
210 | matched literally. For example, the C<'}'> is a metacharacter only when | |
211 | it is the mate of a C<'{'> metacharacter. Otherwise it is treated as a | |
212 | literal RIGHT CURLY BRACKET. This may lead to unexpected results. | |
213 | L<C<use re 'strict'>|re/'strict' mode> can catch some of these. | |
214 | ||
47f9c88b GS |
215 | In addition to the metacharacters, there are some ASCII characters |
216 | which don't have printable character equivalents and are instead | |
7638d2dc | 217 | represented by I<escape sequences>. Common examples are C<\t> for a |
47f9c88b | 218 | tab, C<\n> for a newline, C<\r> for a carriage return and C<\a> for a |
43e59f7b | 219 | bell (or alert). If your string is better thought of as a sequence of arbitrary |
15776bb0 KW |
220 | bytes, the octal escape sequence, I<e.g.>, C<\033>, or hexadecimal escape |
221 | sequence, I<e.g.>, C<\x1B> may be a more natural representation for your | |
47f9c88b GS |
222 | bytes. Here are some examples of escapes: |
223 | ||
224 | "1000\t2000" =~ m(0\t2) # matches | |
225 | "1000\n2000" =~ /0\n20/ # matches | |
226 | "1000\t2000" =~ /\000\t2/ # doesn't match, "0" ne "\000" | |
f0a2b745 KW |
227 | "cat" =~ /\o{143}\x61\x74/ # matches in ASCII, but a weird way |
228 | # to spell cat | |
47f9c88b GS |
229 | |
230 | If you've been around Perl a while, all this talk of escape sequences | |
231 | may seem familiar. Similar escape sequences are used in double-quoted | |
232 | strings and in fact the regexps in Perl are mostly treated as | |
233 | double-quoted strings. This means that variables can be used in | |
234 | regexps as well. Just like double-quoted strings, the values of the | |
235 | variables in the regexp will be substituted in before the regexp is | |
236 | evaluated for matching purposes. So we have: | |
237 | ||
238 | $foo = 'house'; | |
239 | 'housecat' =~ /$foo/; # matches | |
240 | 'cathouse' =~ /cat$foo/; # matches | |
47f9c88b GS |
241 | 'housecat' =~ /${foo}cat/; # matches |
242 | ||
243 | So far, so good. With the knowledge above you can already perform | |
244 | searches with just about any literal string regexp you can dream up. | |
245 | Here is a I<very simple> emulation of the Unix grep program: | |
246 | ||
247 | % cat > simple_grep | |
248 | #!/usr/bin/perl | |
249 | $regexp = shift; | |
250 | while (<>) { | |
251 | print if /$regexp/; | |
252 | } | |
253 | ^D | |
254 | ||
255 | % chmod +x simple_grep | |
256 | ||
257 | % simple_grep abba /usr/dict/words | |
258 | Babbage | |
259 | cabbage | |
260 | cabbages | |
261 | sabbath | |
262 | Sabbathize | |
263 | Sabbathizes | |
264 | sabbatical | |
265 | scabbard | |
266 | scabbards | |
267 | ||
268 | This program is easy to understand. C<#!/usr/bin/perl> is the standard | |
269 | way to invoke a perl program from the shell. | |
7638d2dc | 270 | S<C<$regexp = shift;>> saves the first command line argument as the |
47f9c88b | 271 | regexp to be used, leaving the rest of the command line arguments to |
7638d2dc WL |
272 | be treated as files. S<C<< while (<>) >>> loops over all the lines in |
273 | all the files. For each line, S<C<print if /$regexp/;>> prints the | |
47f9c88b GS |
274 | line if the regexp matches the line. In this line, both C<print> and |
275 | C</$regexp/> use the default variable C<$_> implicitly. | |
276 | ||
277 | With all of the regexps above, if the regexp matched anywhere in the | |
278 | string, it was considered a match. Sometimes, however, we'd like to | |
279 | specify I<where> in the string the regexp should try to match. To do | |
15776bb0 KW |
280 | this, we would use the I<anchor> metacharacters C<'^'> and C<'$'>. The |
281 | anchor C<'^'> means match at the beginning of the string and the anchor | |
282 | C<'$'> means match at the end of the string, or before a newline at the | |
47f9c88b GS |
283 | end of the string. Here is how they are used: |
284 | ||
285 | "housekeeper" =~ /keeper/; # matches | |
286 | "housekeeper" =~ /^keeper/; # doesn't match | |
287 | "housekeeper" =~ /keeper$/; # matches | |
288 | "housekeeper\n" =~ /keeper$/; # matches | |
289 | ||
15776bb0 | 290 | The second regexp doesn't match because C<'^'> constrains C<keeper> to |
47f9c88b GS |
291 | match only at the beginning of the string, but C<"housekeeper"> has |
292 | keeper starting in the middle. The third regexp does match, since the | |
15776bb0 | 293 | C<'$'> constrains C<keeper> to match only at the end of the string. |
47f9c88b | 294 | |
15776bb0 KW |
295 | When both C<'^'> and C<'$'> are used at the same time, the regexp has to |
296 | match both the beginning and the end of the string, I<i.e.>, the regexp | |
47f9c88b GS |
297 | matches the whole string. Consider |
298 | ||
299 | "keeper" =~ /^keep$/; # doesn't match | |
300 | "keeper" =~ /^keeper$/; # matches | |
301 | "" =~ /^$/; # ^$ matches an empty string | |
302 | ||
303 | The first regexp doesn't match because the string has more to it than | |
304 | C<keep>. Since the second regexp is exactly the string, it | |
15776bb0 | 305 | matches. Using both C<'^'> and C<'$'> in a regexp forces the complete |
47f9c88b GS |
306 | string to match, so it gives you complete control over which strings |
307 | match and which don't. Suppose you are looking for a fellow named | |
308 | bert, off in a string by himself: | |
309 | ||
310 | "dogbert" =~ /bert/; # matches, but not what you want | |
311 | ||
312 | "dilbert" =~ /^bert/; # doesn't match, but .. | |
313 | "bertram" =~ /^bert/; # matches, so still not good enough | |
314 | ||
315 | "bertram" =~ /^bert$/; # doesn't match, good | |
316 | "dilbert" =~ /^bert$/; # doesn't match, good | |
317 | "bert" =~ /^bert$/; # matches, perfect | |
318 | ||
319 | Of course, in the case of a literal string, one could just as easily | |
7638d2dc | 320 | use the string comparison S<C<$string eq 'bert'>> and it would be |
47f9c88b GS |
321 | more efficient. The C<^...$> regexp really becomes useful when we |
322 | add in the more powerful regexp tools below. | |
323 | ||
324 | =head2 Using character classes | |
325 | ||
326 | Although one can already do quite a lot with the literal string | |
327 | regexps above, we've only scratched the surface of regular expression | |
328 | technology. In this and subsequent sections we will introduce regexp | |
329 | concepts (and associated metacharacter notations) that will allow a | |
8ccb1477 | 330 | regexp to represent not just a single character sequence, but a I<whole |
47f9c88b GS |
331 | class> of them. |
332 | ||
7638d2dc | 333 | One such concept is that of a I<character class>. A character class |
47f9c88b | 334 | allows a set of possible characters, rather than just a single |
0b635837 KW |
335 | character, to match at a particular point in a regexp. You can define |
336 | your own custom character classes. These | |
337 | are denoted by brackets C<[...]>, with the set of characters | |
47f9c88b GS |
338 | to be possibly matched inside. Here are some examples: |
339 | ||
340 | /cat/; # matches 'cat' | |
341 | /[bcr]at/; # matches 'bat, 'cat', or 'rat' | |
342 | /item[0123456789]/; # matches 'item0' or ... or 'item9' | |
a6b2f353 | 343 | "abc" =~ /[cab]/; # matches 'a' |
47f9c88b GS |
344 | |
345 | In the last statement, even though C<'c'> is the first character in | |
346 | the class, C<'a'> matches because the first character position in the | |
347 | string is the earliest point at which the regexp can match. | |
348 | ||
349 | /[yY][eE][sS]/; # match 'yes' in a case-insensitive way | |
350 | # 'yes', 'Yes', 'YES', etc. | |
351 | ||
da75cd15 | 352 | This regexp displays a common task: perform a case-insensitive |
28c3722c | 353 | match. Perl provides a way of avoiding all those brackets by simply |
47f9c88b GS |
354 | appending an C<'i'> to the end of the match. Then C</[yY][eE][sS]/;> |
355 | can be rewritten as C</yes/i;>. The C<'i'> stands for | |
7638d2dc | 356 | case-insensitive and is an example of a I<modifier> of the matching |
47f9c88b GS |
357 | operation. We will meet other modifiers later in the tutorial. |
358 | ||
359 | We saw in the section above that there were ordinary characters, which | |
360 | represented themselves, and special characters, which needed a | |
15776bb0 | 361 | backslash C<'\'> to represent themselves. The same is true in a |
47f9c88b GS |
362 | character class, but the sets of ordinary and special characters |
363 | inside a character class are different than those outside a character | |
7638d2dc | 364 | class. The special characters for a character class are C<-]\^$> (and |
353c6505 | 365 | the pattern delimiter, whatever it is). |
15776bb0 KW |
366 | C<']'> is special because it denotes the end of a character class. C<'$'> is |
367 | special because it denotes a scalar variable. C<'\'> is special because | |
47f9c88b GS |
368 | it is used in escape sequences, just like above. Here is how the |
369 | special characters C<]$\> are handled: | |
370 | ||
371 | /[\]c]def/; # matches ']def' or 'cdef' | |
372 | $x = 'bcr'; | |
a6b2f353 | 373 | /[$x]at/; # matches 'bat', 'cat', or 'rat' |
47f9c88b GS |
374 | /[\$x]at/; # matches '$at' or 'xat' |
375 | /[\\$x]at/; # matches '\at', 'bat, 'cat', or 'rat' | |
376 | ||
353c6505 | 377 | The last two are a little tricky. In C<[\$x]>, the backslash protects |
15776bb0 | 378 | the dollar sign, so the character class has two members C<'$'> and C<'x'>. |
47f9c88b GS |
379 | In C<[\\$x]>, the backslash is protected, so C<$x> is treated as a |
380 | variable and substituted in double quote fashion. | |
381 | ||
382 | The special character C<'-'> acts as a range operator within character | |
383 | classes, so that a contiguous set of characters can be written as a | |
384 | range. With ranges, the unwieldy C<[0123456789]> and C<[abc...xyz]> | |
385 | become the svelte C<[0-9]> and C<[a-z]>. Some examples are | |
386 | ||
387 | /item[0-9]/; # matches 'item0' or ... or 'item9' | |
388 | /[0-9bx-z]aa/; # matches '0aa', ..., '9aa', | |
389 | # 'baa', 'xaa', 'yaa', or 'zaa' | |
390 | /[0-9a-fA-F]/; # matches a hexadecimal digit | |
36bbe248 | 391 | /[0-9a-zA-Z_]/; # matches a "word" character, |
7638d2dc | 392 | # like those in a Perl variable name |
47f9c88b GS |
393 | |
394 | If C<'-'> is the first or last character in a character class, it is | |
395 | treated as an ordinary character; C<[-ab]>, C<[ab-]> and C<[a\-b]> are | |
396 | all equivalent. | |
397 | ||
15776bb0 | 398 | The special character C<'^'> in the first position of a character class |
7638d2dc | 399 | denotes a I<negated character class>, which matches any character but |
a6b2f353 | 400 | those in the brackets. Both C<[...]> and C<[^...]> must match a |
47f9c88b GS |
401 | character, or the match fails. Then |
402 | ||
403 | /[^a]at/; # doesn't match 'aat' or 'at', but matches | |
404 | # all other 'bat', 'cat, '0at', '%at', etc. | |
405 | /[^0-9]/; # matches a non-numeric character | |
406 | /[a^]at/; # matches 'aat' or '^at'; here '^' is ordinary | |
407 | ||
28c3722c | 408 | Now, even C<[0-9]> can be a bother to write multiple times, so in the |
47f9c88b | 409 | interest of saving keystrokes and making regexps more readable, Perl |
7638d2dc | 410 | has several abbreviations for common character classes, as shown below. |
f1dc5bb2 | 411 | Since the introduction of Unicode, unless the C</a> modifier is in |
0bd5a82d KW |
412 | effect, these character classes match more than just a few characters in |
413 | the ASCII range. | |
47f9c88b GS |
414 | |
415 | =over 4 | |
416 | ||
417 | =item * | |
551e1d92 | 418 | |
15776bb0 | 419 | C<\d> matches a digit, not just C<[0-9]> but also digits from non-roman scripts |
47f9c88b GS |
420 | |
421 | =item * | |
551e1d92 | 422 | |
15776bb0 | 423 | C<\s> matches a whitespace character, the set C<[\ \t\r\n\f]> and others |
47f9c88b GS |
424 | |
425 | =item * | |
551e1d92 | 426 | |
15776bb0 | 427 | C<\w> matches a word character (alphanumeric or C<'_'>), not just C<[0-9a-zA-Z_]> |
7638d2dc | 428 | but also digits and characters from non-roman scripts |
47f9c88b GS |
429 | |
430 | =item * | |
551e1d92 | 431 | |
15776bb0 | 432 | C<\D> is a negated C<\d>; it represents any other character than a digit, or C<[^\d]> |
47f9c88b GS |
433 | |
434 | =item * | |
551e1d92 | 435 | |
15776bb0 | 436 | C<\S> is a negated C<\s>; it represents any non-whitespace character C<[^\s]> |
47f9c88b GS |
437 | |
438 | =item * | |
551e1d92 | 439 | |
15776bb0 | 440 | C<\W> is a negated C<\w>; it represents any non-word character C<[^\w]> |
47f9c88b GS |
441 | |
442 | =item * | |
551e1d92 | 443 | |
15776bb0 | 444 | The period C<'.'> matches any character but C<"\n"> (unless the modifier C</s> is |
7638d2dc | 445 | in effect, as explained below). |
47f9c88b | 446 | |
1ca4ba9b KW |
447 | =item * |
448 | ||
15776bb0 | 449 | C<\N>, like the period, matches any character but C<"\n">, but it does so |
f1dc5bb2 | 450 | regardless of whether the modifier C</s> is in effect. |
1ca4ba9b | 451 | |
47f9c88b GS |
452 | =back |
453 | ||
f1dc5bb2 | 454 | The C</a> modifier, available starting in Perl 5.14, is used to |
15776bb0 | 455 | restrict the matches of C<\d>, C<\s>, and C<\w> to just those in the ASCII range. |
0bd5a82d KW |
456 | It is useful to keep your program from being needlessly exposed to full |
457 | Unicode (and its accompanying security considerations) when all you want | |
f1dc5bb2 | 458 | is to process English-like text. (The "a" may be doubled, C</aa>, to |
0bd5a82d KW |
459 | provide even more restrictions, preventing case-insensitive matching of |
460 | ASCII with non-ASCII characters; otherwise a Unicode "Kelvin Sign" | |
461 | would caselessly match a "k" or "K".) | |
462 | ||
47f9c88b | 463 | The C<\d\s\w\D\S\W> abbreviations can be used both inside and outside |
0b635837 | 464 | of bracketed character classes. Here are some in use: |
47f9c88b GS |
465 | |
466 | /\d\d:\d\d:\d\d/; # matches a hh:mm:ss time format | |
467 | /[\d\s]/; # matches any digit or whitespace character | |
468 | /\w\W\w/; # matches a word char, followed by a | |
469 | # non-word char, followed by a word char | |
470 | /..rt/; # matches any two chars, followed by 'rt' | |
471 | /end\./; # matches 'end.' | |
472 | /end[.]/; # same thing, matches 'end.' | |
473 | ||
474 | Because a period is a metacharacter, it needs to be escaped to match | |
475 | as an ordinary period. Because, for example, C<\d> and C<\w> are sets | |
476 | of characters, it is incorrect to think of C<[^\d\w]> as C<[\D\W]>; in | |
477 | fact C<[^\d\w]> is the same as C<[^\w]>, which is the same as | |
478 | C<[\W]>. Think DeMorgan's laws. | |
479 | ||
0b635837 KW |
480 | In actuality, the period and C<\d\s\w\D\S\W> abbreviations are |
481 | themselves types of character classes, so the ones surrounded by | |
482 | brackets are just one type of character class. When we need to make a | |
483 | distinction, we refer to them as "bracketed character classes." | |
484 | ||
7638d2dc | 485 | An anchor useful in basic regexps is the I<word anchor> |
47f9c88b GS |
486 | C<\b>. This matches a boundary between a word character and a non-word |
487 | character C<\w\W> or C<\W\w>: | |
488 | ||
489 | $x = "Housecat catenates house and cat"; | |
490 | $x =~ /cat/; # matches cat in 'housecat' | |
491 | $x =~ /\bcat/; # matches cat in 'catenates' | |
492 | $x =~ /cat\b/; # matches cat in 'housecat' | |
493 | $x =~ /\bcat\b/; # matches 'cat' at end of string | |
494 | ||
495 | Note in the last example, the end of the string is considered a word | |
496 | boundary. | |
497 | ||
ae3bb8ea KW |
498 | For natural language processing (so that, for example, apostrophes are |
499 | included in words), use instead C<\b{wb}> | |
500 | ||
501 | "don't" =~ / .+? \b{wb} /x; # matches the whole string | |
502 | ||
47f9c88b GS |
503 | You might wonder why C<'.'> matches everything but C<"\n"> - why not |
504 | every character? The reason is that often one is matching against | |
505 | lines and would like to ignore the newline characters. For instance, | |
506 | while the string C<"\n"> represents one line, we would like to think | |
28c3722c | 507 | of it as empty. Then |
47f9c88b GS |
508 | |
509 | "" =~ /^$/; # matches | |
7638d2dc | 510 | "\n" =~ /^$/; # matches, $ anchors before "\n" |
47f9c88b GS |
511 | |
512 | "" =~ /./; # doesn't match; it needs a char | |
513 | "" =~ /^.$/; # doesn't match; it needs a char | |
514 | "\n" =~ /^.$/; # doesn't match; it needs a char other than "\n" | |
515 | "a" =~ /^.$/; # matches | |
7638d2dc | 516 | "a\n" =~ /^.$/; # matches, $ anchors before "\n" |
47f9c88b GS |
517 | |
518 | This behavior is convenient, because we usually want to ignore | |
519 | newlines when we count and match characters in a line. Sometimes, | |
15776bb0 KW |
520 | however, we want to keep track of newlines. We might even want C<'^'> |
521 | and C<'$'> to anchor at the beginning and end of lines within the | |
47f9c88b GS |
522 | string, rather than just the beginning and end of the string. Perl |
523 | allows us to choose between ignoring and paying attention to newlines | |
f1dc5bb2 | 524 | by using the C</s> and C</m> modifiers. C</s> and C</m> stand for |
47f9c88b GS |
525 | single line and multi-line and they determine whether a string is to |
526 | be treated as one continuous string, or as a set of lines. The two | |
527 | modifiers affect two aspects of how the regexp is interpreted: 1) how | |
15776bb0 KW |
528 | the C<'.'> character class is defined, and 2) where the anchors C<'^'> |
529 | and C<'$'> are able to match. Here are the four possible combinations: | |
47f9c88b GS |
530 | |
531 | =over 4 | |
532 | ||
533 | =item * | |
551e1d92 | 534 | |
f1dc5bb2 | 535 | no modifiers: Default behavior. C<'.'> matches any character |
15776bb0 KW |
536 | except C<"\n">. C<'^'> matches only at the beginning of the string and |
537 | C<'$'> matches only at the end or before a newline at the end. | |
47f9c88b GS |
538 | |
539 | =item * | |
551e1d92 | 540 | |
f1dc5bb2 | 541 | s modifier (C</s>): Treat string as a single long line. C<'.'> matches |
15776bb0 KW |
542 | any character, even C<"\n">. C<'^'> matches only at the beginning of |
543 | the string and C<'$'> matches only at the end or before a newline at the | |
47f9c88b GS |
544 | end. |
545 | ||
546 | =item * | |
551e1d92 | 547 | |
f1dc5bb2 | 548 | m modifier (C</m>): Treat string as a set of multiple lines. C<'.'> |
15776bb0 | 549 | matches any character except C<"\n">. C<'^'> and C<'$'> are able to match |
47f9c88b GS |
550 | at the start or end of I<any> line within the string. |
551 | ||
552 | =item * | |
551e1d92 | 553 | |
f1dc5bb2 | 554 | both s and m modifiers (C</sm>): Treat string as a single long line, but |
47f9c88b | 555 | detect multiple lines. C<'.'> matches any character, even |
15776bb0 | 556 | C<"\n">. C<'^'> and C<'$'>, however, are able to match at the start or end |
47f9c88b GS |
557 | of I<any> line within the string. |
558 | ||
559 | =back | |
560 | ||
f1dc5bb2 | 561 | Here are examples of C</s> and C</m> in action: |
47f9c88b GS |
562 | |
563 | $x = "There once was a girl\nWho programmed in Perl\n"; | |
564 | ||
565 | $x =~ /^Who/; # doesn't match, "Who" not at start of string | |
566 | $x =~ /^Who/s; # doesn't match, "Who" not at start of string | |
567 | $x =~ /^Who/m; # matches, "Who" at start of second line | |
568 | $x =~ /^Who/sm; # matches, "Who" at start of second line | |
569 | ||
570 | $x =~ /girl.Who/; # doesn't match, "." doesn't match "\n" | |
571 | $x =~ /girl.Who/s; # matches, "." matches "\n" | |
572 | $x =~ /girl.Who/m; # doesn't match, "." doesn't match "\n" | |
573 | $x =~ /girl.Who/sm; # matches, "." matches "\n" | |
574 | ||
f1dc5bb2 KW |
575 | Most of the time, the default behavior is what is wanted, but C</s> and |
576 | C</m> are occasionally very useful. If C</m> is being used, the start | |
28c3722c | 577 | of the string can still be matched with C<\A> and the end of the string |
47f9c88b | 578 | can still be matched with the anchors C<\Z> (matches both the end and |
15776bb0 | 579 | the newline before, like C<'$'>), and C<\z> (matches only the end): |
47f9c88b GS |
580 | |
581 | $x =~ /^Who/m; # matches, "Who" at start of second line | |
582 | $x =~ /\AWho/m; # doesn't match, "Who" is not at start of string | |
583 | ||
584 | $x =~ /girl$/m; # matches, "girl" at end of first line | |
585 | $x =~ /girl\Z/m; # doesn't match, "girl" is not at end of string | |
586 | ||
587 | $x =~ /Perl\Z/m; # matches, "Perl" is at newline before end | |
588 | $x =~ /Perl\z/m; # doesn't match, "Perl" is not at end of string | |
589 | ||
590 | We now know how to create choices among classes of characters in a | |
591 | regexp. What about choices among words or character strings? Such | |
592 | choices are described in the next section. | |
593 | ||
594 | =head2 Matching this or that | |
595 | ||
28c3722c | 596 | Sometimes we would like our regexp to be able to match different |
47f9c88b | 597 | possible words or character strings. This is accomplished by using |
15776bb0 | 598 | the I<alternation> metacharacter C<'|'>. To match C<dog> or C<cat>, we |
7638d2dc | 599 | form the regexp C<dog|cat>. As before, Perl will try to match the |
47f9c88b | 600 | regexp at the earliest possible point in the string. At each |
7638d2dc WL |
601 | character position, Perl will first try to match the first |
602 | alternative, C<dog>. If C<dog> doesn't match, Perl will then try the | |
47f9c88b | 603 | next alternative, C<cat>. If C<cat> doesn't match either, then the |
7638d2dc | 604 | match fails and Perl moves to the next position in the string. Some |
47f9c88b GS |
605 | examples: |
606 | ||
607 | "cats and dogs" =~ /cat|dog|bird/; # matches "cat" | |
608 | "cats and dogs" =~ /dog|cat|bird/; # matches "cat" | |
609 | ||
610 | Even though C<dog> is the first alternative in the second regexp, | |
611 | C<cat> is able to match earlier in the string. | |
612 | ||
613 | "cats" =~ /c|ca|cat|cats/; # matches "c" | |
614 | "cats" =~ /cats|cat|ca|c/; # matches "cats" | |
615 | ||
616 | Here, all the alternatives match at the first string position, so the | |
617 | first alternative is the one that matches. If some of the | |
618 | alternatives are truncations of the others, put the longest ones first | |
619 | to give them a chance to match. | |
620 | ||
621 | "cab" =~ /a|b|c/ # matches "c" | |
622 | # /a|b|c/ == /[abc]/ | |
623 | ||
624 | The last example points out that character classes are like | |
625 | alternations of characters. At a given character position, the first | |
210b36aa | 626 | alternative that allows the regexp match to succeed will be the one |
47f9c88b GS |
627 | that matches. |
628 | ||
629 | =head2 Grouping things and hierarchical matching | |
630 | ||
631 | Alternation allows a regexp to choose among alternatives, but by | |
7638d2dc | 632 | itself it is unsatisfying. The reason is that each alternative is a whole |
47f9c88b GS |
633 | regexp, but sometime we want alternatives for just part of a |
634 | regexp. For instance, suppose we want to search for housecats or | |
635 | housekeepers. The regexp C<housecat|housekeeper> fits the bill, but is | |
636 | inefficient because we had to type C<house> twice. It would be nice to | |
da75cd15 | 637 | have parts of the regexp be constant, like C<house>, and some |
47f9c88b GS |
638 | parts have alternatives, like C<cat|keeper>. |
639 | ||
7638d2dc | 640 | The I<grouping> metacharacters C<()> solve this problem. Grouping |
47f9c88b GS |
641 | allows parts of a regexp to be treated as a single unit. Parts of a |
642 | regexp are grouped by enclosing them in parentheses. Thus we could solve | |
643 | the C<housecat|housekeeper> by forming the regexp as | |
644 | C<house(cat|keeper)>. The regexp C<house(cat|keeper)> means match | |
645 | C<house> followed by either C<cat> or C<keeper>. Some more examples | |
646 | are | |
647 | ||
648 | /(a|b)b/; # matches 'ab' or 'bb' | |
649 | /(ac|b)b/; # matches 'acb' or 'bb' | |
650 | /(^a|b)c/; # matches 'ac' at start of string or 'bc' anywhere | |
651 | /(a|[bc])d/; # matches 'ad', 'bd', or 'cd' | |
652 | ||
653 | /house(cat|)/; # matches either 'housecat' or 'house' | |
654 | /house(cat(s|)|)/; # matches either 'housecats' or 'housecat' or | |
655 | # 'house'. Note groups can be nested. | |
656 | ||
657 | /(19|20|)\d\d/; # match years 19xx, 20xx, or the Y2K problem, xx | |
658 | "20" =~ /(19|20|)\d\d/; # matches the null alternative '()\d\d', | |
659 | # because '20\d\d' can't match | |
660 | ||
661 | Alternations behave the same way in groups as out of them: at a given | |
662 | string position, the leftmost alternative that allows the regexp to | |
210b36aa | 663 | match is taken. So in the last example at the first string position, |
47f9c88b | 664 | C<"20"> matches the second alternative, but there is nothing left over |
7638d2dc | 665 | to match the next two digits C<\d\d>. So Perl moves on to the next |
47f9c88b GS |
666 | alternative, which is the null alternative and that works, since |
667 | C<"20"> is two digits. | |
668 | ||
669 | The process of trying one alternative, seeing if it matches, and | |
7638d2dc WL |
670 | moving on to the next alternative, while going back in the string |
671 | from where the previous alternative was tried, if it doesn't, is called | |
15776bb0 | 672 | I<backtracking>. The term "backtracking" comes from the idea that |
47f9c88b GS |
673 | matching a regexp is like a walk in the woods. Successfully matching |
674 | a regexp is like arriving at a destination. There are many possible | |
675 | trailheads, one for each string position, and each one is tried in | |
676 | order, left to right. From each trailhead there may be many paths, | |
677 | some of which get you there, and some which are dead ends. When you | |
678 | walk along a trail and hit a dead end, you have to backtrack along the | |
679 | trail to an earlier point to try another trail. If you hit your | |
680 | destination, you stop immediately and forget about trying all the | |
681 | other trails. You are persistent, and only if you have tried all the | |
682 | trails from all the trailheads and not arrived at your destination, do | |
683 | you declare failure. To be concrete, here is a step-by-step analysis | |
7638d2dc | 684 | of what Perl does when it tries to match the regexp |
47f9c88b GS |
685 | |
686 | "abcde" =~ /(abd|abc)(df|d|de)/; | |
687 | ||
688 | =over 4 | |
689 | ||
15776bb0 | 690 | =item Z<>0. Start with the first letter in the string C<'a'>. |
551e1d92 | 691 | |
15776bb0 | 692 | E<nbsp> |
551e1d92 | 693 | |
15776bb0 | 694 | =item Z<>1. Try the first alternative in the first group C<'abd'>. |
47f9c88b | 695 | |
15776bb0 | 696 | E<nbsp> |
47f9c88b | 697 | |
15776bb0 | 698 | =item Z<>2. Match C<'a'> followed by C<'b'>. So far so good. |
47f9c88b | 699 | |
15776bb0 | 700 | E<nbsp> |
551e1d92 | 701 | |
15776bb0 KW |
702 | =item Z<>3. C<'d'> in the regexp doesn't match C<'c'> in the string - a |
703 | dead end. So backtrack two characters and pick the second alternative | |
704 | in the first group C<'abc'>. | |
551e1d92 | 705 | |
15776bb0 | 706 | E<nbsp> |
47f9c88b | 707 | |
15776bb0 KW |
708 | =item Z<>4. Match C<'a'> followed by C<'b'> followed by C<'c'>. We are on a roll |
709 | and have satisfied the first group. Set C<$1> to C<'abc'>. | |
551e1d92 | 710 | |
15776bb0 | 711 | E<nbsp> |
47f9c88b | 712 | |
15776bb0 | 713 | =item Z<>5 Move on to the second group and pick the first alternative C<'df'>. |
551e1d92 | 714 | |
15776bb0 | 715 | E<nbsp> |
47f9c88b | 716 | |
15776bb0 | 717 | =item Z<>6 Match the C<'d'>. |
47f9c88b | 718 | |
15776bb0 | 719 | E<nbsp> |
551e1d92 | 720 | |
15776bb0 | 721 | =item Z<>7. C<'f'> in the regexp doesn't match C<'e'> in the string, so a dead |
47f9c88b | 722 | end. Backtrack one character and pick the second alternative in the |
15776bb0 | 723 | second group C<'d'>. |
47f9c88b | 724 | |
15776bb0 | 725 | E<nbsp> |
551e1d92 | 726 | |
15776bb0 KW |
727 | =item Z<>8. C<'d'> matches. The second grouping is satisfied, so set |
728 | C<$2> to C<'d'>. | |
47f9c88b | 729 | |
15776bb0 | 730 | E<nbsp> |
551e1d92 | 731 | |
15776bb0 KW |
732 | =item Z<>9. We are at the end of the regexp, so we are done! We have |
733 | matched C<'abcd'> out of the string C<"abcde">. | |
47f9c88b GS |
734 | |
735 | =back | |
736 | ||
737 | There are a couple of things to note about this analysis. First, the | |
15776bb0 | 738 | third alternative in the second group C<'de'> also allows a match, but we |
47f9c88b GS |
739 | stopped before we got to it - at a given character position, leftmost |
740 | wins. Second, we were able to get a match at the first character | |
15776bb0 KW |
741 | position of the string C<'a'>. If there were no matches at the first |
742 | position, Perl would move to the second character position C<'b'> and | |
47f9c88b | 743 | attempt the match all over again. Only when all possible paths at all |
7638d2dc WL |
744 | possible character positions have been exhausted does Perl give |
745 | up and declare S<C<$string =~ /(abd|abc)(df|d|de)/;>> to be false. | |
47f9c88b GS |
746 | |
747 | Even with all this work, regexp matching happens remarkably fast. To | |
353c6505 DL |
748 | speed things up, Perl compiles the regexp into a compact sequence of |
749 | opcodes that can often fit inside a processor cache. When the code is | |
7638d2dc WL |
750 | executed, these opcodes can then run at full throttle and search very |
751 | quickly. | |
47f9c88b GS |
752 | |
753 | =head2 Extracting matches | |
754 | ||
755 | The grouping metacharacters C<()> also serve another completely | |
756 | different function: they allow the extraction of the parts of a string | |
757 | that matched. This is very useful to find out what matched and for | |
758 | text processing in general. For each grouping, the part that matched | |
15776bb0 | 759 | inside goes into the special variables C<$1>, C<$2>, I<etc>. They can be |
47f9c88b GS |
760 | used just as ordinary variables: |
761 | ||
762 | # extract hours, minutes, seconds | |
2275acdc RGS |
763 | if ($time =~ /(\d\d):(\d\d):(\d\d)/) { # match hh:mm:ss format |
764 | $hours = $1; | |
765 | $minutes = $2; | |
766 | $seconds = $3; | |
767 | } | |
47f9c88b GS |
768 | |
769 | Now, we know that in scalar context, | |
7638d2dc | 770 | S<C<$time =~ /(\d\d):(\d\d):(\d\d)/>> returns a true or false |
47f9c88b GS |
771 | value. In list context, however, it returns the list of matched values |
772 | C<($1,$2,$3)>. So we could write the code more compactly as | |
773 | ||
774 | # extract hours, minutes, seconds | |
775 | ($hours, $minutes, $second) = ($time =~ /(\d\d):(\d\d):(\d\d)/); | |
776 | ||
777 | If the groupings in a regexp are nested, C<$1> gets the group with the | |
778 | leftmost opening parenthesis, C<$2> the next opening parenthesis, | |
15776bb0 | 779 | I<etc>. Here is a regexp with nested groups: |
47f9c88b GS |
780 | |
781 | /(ab(cd|ef)((gi)|j))/; | |
782 | 1 2 34 | |
783 | ||
7638d2dc WL |
784 | If this regexp matches, C<$1> contains a string starting with |
785 | C<'ab'>, C<$2> is either set to C<'cd'> or C<'ef'>, C<$3> equals either | |
786 | C<'gi'> or C<'j'>, and C<$4> is either set to C<'gi'>, just like C<$3>, | |
787 | or it remains undefined. | |
788 | ||
789 | For convenience, Perl sets C<$+> to the string held by the highest numbered | |
790 | C<$1>, C<$2>,... that got assigned (and, somewhat related, C<$^N> to the | |
15776bb0 | 791 | value of the C<$1>, C<$2>,... most-recently assigned; I<i.e.> the C<$1>, |
7638d2dc | 792 | C<$2>,... associated with the rightmost closing parenthesis used in the |
a01268b5 | 793 | match). |
47f9c88b | 794 | |
7638d2dc WL |
795 | |
796 | =head2 Backreferences | |
797 | ||
47f9c88b | 798 | Closely associated with the matching variables C<$1>, C<$2>, ... are |
d8b950dc | 799 | the I<backreferences> C<\g1>, C<\g2>,... Backreferences are simply |
47f9c88b | 800 | matching variables that can be used I<inside> a regexp. This is a |
ac036724 | 801 | really nice feature; what matches later in a regexp is made to depend on |
47f9c88b | 802 | what matched earlier in the regexp. Suppose we wanted to look |
15776bb0 | 803 | for doubled words in a text, like "the the". The following regexp finds |
47f9c88b GS |
804 | all 3-letter doubles with a space in between: |
805 | ||
d8b950dc | 806 | /\b(\w\w\w)\s\g1\b/; |
47f9c88b | 807 | |
15776bb0 | 808 | The grouping assigns a value to C<\g1>, so that the same 3-letter sequence |
7638d2dc WL |
809 | is used for both parts. |
810 | ||
811 | A similar task is to find words consisting of two identical parts: | |
47f9c88b | 812 | |
d8b950dc | 813 | % simple_grep '^(\w\w\w\w|\w\w\w|\w\w|\w)\g1$' /usr/dict/words |
47f9c88b GS |
814 | beriberi |
815 | booboo | |
816 | coco | |
817 | mama | |
818 | murmur | |
819 | papa | |
820 | ||
821 | The regexp has a single grouping which considers 4-letter | |
15776bb0 | 822 | combinations, then 3-letter combinations, I<etc>., and uses C<\g1> to look for |
d8b950dc | 823 | a repeat. Although C<$1> and C<\g1> represent the same thing, care should be |
7638d2dc | 824 | taken to use matched variables C<$1>, C<$2>,... only I<outside> a regexp |
d8b950dc | 825 | and backreferences C<\g1>, C<\g2>,... only I<inside> a regexp; not doing |
7638d2dc WL |
826 | so may lead to surprising and unsatisfactory results. |
827 | ||
828 | ||
829 | =head2 Relative backreferences | |
830 | ||
831 | Counting the opening parentheses to get the correct number for a | |
7698aede | 832 | backreference is error-prone as soon as there is more than one |
7638d2dc WL |
833 | capturing group. A more convenient technique became available |
834 | with Perl 5.10: relative backreferences. To refer to the immediately | |
835 | preceding capture group one now may write C<\g{-1}>, the next but | |
836 | last is available via C<\g{-2}>, and so on. | |
837 | ||
838 | Another good reason in addition to readability and maintainability | |
8ccb1477 | 839 | for using relative backreferences is illustrated by the following example, |
7638d2dc WL |
840 | where a simple pattern for matching peculiar strings is used: |
841 | ||
d8b950dc | 842 | $a99a = '([a-z])(\d)\g2\g1'; # matches a11a, g22g, x33x, etc. |
7638d2dc WL |
843 | |
844 | Now that we have this pattern stored as a handy string, we might feel | |
845 | tempted to use it as a part of some other pattern: | |
846 | ||
847 | $line = "code=e99e"; | |
848 | if ($line =~ /^(\w+)=$a99a$/){ # unexpected behavior! | |
849 | print "$1 is valid\n"; | |
850 | } else { | |
851 | print "bad line: '$line'\n"; | |
852 | } | |
853 | ||
ac036724 | 854 | But this doesn't match, at least not the way one might expect. Only |
7638d2dc WL |
855 | after inserting the interpolated C<$a99a> and looking at the resulting |
856 | full text of the regexp is it obvious that the backreferences have | |
ac036724 | 857 | backfired. The subexpression C<(\w+)> has snatched number 1 and |
7638d2dc WL |
858 | demoted the groups in C<$a99a> by one rank. This can be avoided by |
859 | using relative backreferences: | |
860 | ||
861 | $a99a = '([a-z])(\d)\g{-1}\g{-2}'; # safe for being interpolated | |
862 | ||
863 | ||
864 | =head2 Named backreferences | |
865 | ||
c27a5cfe | 866 | Perl 5.10 also introduced named capture groups and named backreferences. |
7638d2dc WL |
867 | To attach a name to a capturing group, you write either |
868 | C<< (?<name>...) >> or C<< (?'name'...) >>. The backreference may | |
869 | then be written as C<\g{name}>. It is permissible to attach the | |
870 | same name to more than one group, but then only the leftmost one of the | |
871 | eponymous set can be referenced. Outside of the pattern a named | |
c27a5cfe | 872 | capture group is accessible through the C<%+> hash. |
7638d2dc | 873 | |
353c6505 | 874 | Assuming that we have to match calendar dates which may be given in one |
7638d2dc | 875 | of the three formats yyyy-mm-dd, mm/dd/yyyy or dd.mm.yyyy, we can write |
15776bb0 | 876 | three suitable patterns where we use C<'d'>, C<'m'> and C<'y'> respectively as the |
c27a5cfe | 877 | names of the groups capturing the pertaining components of a date. The |
7638d2dc WL |
878 | matching operation combines the three patterns as alternatives: |
879 | ||
880 | $fmt1 = '(?<y>\d\d\d\d)-(?<m>\d\d)-(?<d>\d\d)'; | |
881 | $fmt2 = '(?<m>\d\d)/(?<d>\d\d)/(?<y>\d\d\d\d)'; | |
882 | $fmt3 = '(?<d>\d\d)\.(?<m>\d\d)\.(?<y>\d\d\d\d)'; | |
14ccab5a | 883 | for my $d (qw(2006-10-21 15.01.2007 10/31/2005)) { |
7638d2dc WL |
884 | if ( $d =~ m{$fmt1|$fmt2|$fmt3} ){ |
885 | print "day=$+{d} month=$+{m} year=$+{y}\n"; | |
886 | } | |
887 | } | |
888 | ||
889 | If any of the alternatives matches, the hash C<%+> is bound to contain the | |
890 | three key-value pairs. | |
891 | ||
892 | ||
893 | =head2 Alternative capture group numbering | |
894 | ||
895 | Yet another capturing group numbering technique (also as from Perl 5.10) | |
896 | deals with the problem of referring to groups within a set of alternatives. | |
897 | Consider a pattern for matching a time of the day, civil or military style: | |
47f9c88b | 898 | |
7638d2dc WL |
899 | if ( $time =~ /(\d\d|\d):(\d\d)|(\d\d)(\d\d)/ ){ |
900 | # process hour and minute | |
901 | } | |
902 | ||
903 | Processing the results requires an additional if statement to determine | |
353c6505 | 904 | whether C<$1> and C<$2> or C<$3> and C<$4> contain the goodies. It would |
c27a5cfe | 905 | be easier if we could use group numbers 1 and 2 in second alternative as |
353c6505 | 906 | well, and this is exactly what the parenthesized construct C<(?|...)>, |
7638d2dc WL |
907 | set around an alternative achieves. Here is an extended version of the |
908 | previous pattern: | |
909 | ||
555bd962 BG |
910 | if($time =~ /(?|(\d\d|\d):(\d\d)|(\d\d)(\d\d))\s+([A-Z][A-Z][A-Z])/){ |
911 | print "hour=$1 minute=$2 zone=$3\n"; | |
912 | } | |
7638d2dc | 913 | |
c27a5cfe | 914 | Within the alternative numbering group, group numbers start at the same |
7638d2dc | 915 | position for each alternative. After the group, numbering continues |
353c6505 | 916 | with one higher than the maximum reached across all the alternatives. |
7638d2dc WL |
917 | |
918 | =head2 Position information | |
919 | ||
13e5d9cd | 920 | In addition to what was matched, Perl also provides the |
7638d2dc | 921 | positions of what was matched as contents of the C<@-> and C<@+> |
47f9c88b GS |
922 | arrays. C<$-[0]> is the position of the start of the entire match and |
923 | C<$+[0]> is the position of the end. Similarly, C<$-[n]> is the | |
924 | position of the start of the C<$n> match and C<$+[n]> is the position | |
925 | of the end. If C<$n> is undefined, so are C<$-[n]> and C<$+[n]>. Then | |
926 | this code | |
927 | ||
928 | $x = "Mmm...donut, thought Homer"; | |
929 | $x =~ /^(Mmm|Yech)\.\.\.(donut|peas)/; # matches | |
555bd962 BG |
930 | foreach $exp (1..$#-) { |
931 | print "Match $exp: '${$exp}' at position ($-[$exp],$+[$exp])\n"; | |
47f9c88b GS |
932 | } |
933 | ||
934 | prints | |
935 | ||
936 | Match 1: 'Mmm' at position (0,3) | |
937 | Match 2: 'donut' at position (6,11) | |
938 | ||
939 | Even if there are no groupings in a regexp, it is still possible to | |
7638d2dc | 940 | find out what exactly matched in a string. If you use them, Perl |
47f9c88b | 941 | will set C<$`> to the part of the string before the match, will set C<$&> |
15776bb0 | 942 | to the part of the string that matched, and will set C<'$'> to the part |
47f9c88b GS |
943 | of the string after the match. An example: |
944 | ||
945 | $x = "the cat caught the mouse"; | |
946 | $x =~ /cat/; # $` = 'the ', $& = 'cat', $' = ' caught the mouse' | |
947 | $x =~ /the/; # $` = '', $& = 'the', $' = ' cat caught the mouse' | |
948 | ||
7638d2dc WL |
949 | In the second match, C<$`> equals C<''> because the regexp matched at the |
950 | first character position in the string and stopped; it never saw the | |
15776bb0 | 951 | second "the". |
13b0f67d DM |
952 | |
953 | If your code is to run on Perl versions earlier than | |
15776bb0 | 954 | 5.20, it is worthwhile to note that using C<$`> and C<'$'> |
7638d2dc | 955 | slows down regexp matching quite a bit, while C<$&> slows it down to a |
47f9c88b | 956 | lesser extent, because if they are used in one regexp in a program, |
7638d2dc | 957 | they are generated for I<all> regexps in the program. So if raw |
47f9c88b | 958 | performance is a goal of your application, they should be avoided. |
7638d2dc WL |
959 | If you need to extract the corresponding substrings, use C<@-> and |
960 | C<@+> instead: | |
47f9c88b GS |
961 | |
962 | $` is the same as substr( $x, 0, $-[0] ) | |
963 | $& is the same as substr( $x, $-[0], $+[0]-$-[0] ) | |
964 | $' is the same as substr( $x, $+[0] ) | |
965 | ||
78622607 | 966 | As of Perl 5.10, the C<${^PREMATCH}>, C<${^MATCH}> and C<${^POSTMATCH}> |
13b0f67d DM |
967 | variables may be used. These are only set if the C</p> modifier is |
968 | present. Consequently they do not penalize the rest of the program. In | |
969 | Perl 5.20, C<${^PREMATCH}>, C<${^MATCH}> and C<${^POSTMATCH}> are available | |
970 | whether the C</p> has been used or not (the modifier is ignored), and | |
15776bb0 | 971 | C<$`>, C<'$'> and C<$&> do not cause any speed difference. |
7638d2dc WL |
972 | |
973 | =head2 Non-capturing groupings | |
974 | ||
353c6505 | 975 | A group that is required to bundle a set of alternatives may or may not be |
7638d2dc | 976 | useful as a capturing group. If it isn't, it just creates a superfluous |
c27a5cfe | 977 | addition to the set of available capture group values, inside as well as |
7638d2dc | 978 | outside the regexp. Non-capturing groupings, denoted by C<(?:regexp)>, |
353c6505 | 979 | still allow the regexp to be treated as a single unit, but don't establish |
c27a5cfe | 980 | a capturing group at the same time. Both capturing and non-capturing |
7638d2dc WL |
981 | groupings are allowed to co-exist in the same regexp. Because there is |
982 | no extraction, non-capturing groupings are faster than capturing | |
983 | groupings. Non-capturing groupings are also handy for choosing exactly | |
984 | which parts of a regexp are to be extracted to matching variables: | |
985 | ||
986 | # match a number, $1-$4 are set, but we only want $1 | |
987 | /([+-]?\ *(\d+(\.\d*)?|\.\d+)([eE][+-]?\d+)?)/; | |
988 | ||
989 | # match a number faster , only $1 is set | |
990 | /([+-]?\ *(?:\d+(?:\.\d*)?|\.\d+)(?:[eE][+-]?\d+)?)/; | |
991 | ||
992 | # match a number, get $1 = whole number, $2 = exponent | |
993 | /([+-]?\ *(?:\d+(?:\.\d*)?|\.\d+)(?:[eE]([+-]?\d+))?)/; | |
994 | ||
995 | Non-capturing groupings are also useful for removing nuisance | |
996 | elements gathered from a split operation where parentheses are | |
997 | required for some reason: | |
998 | ||
999 | $x = '12aba34ba5'; | |
9b846e30 | 1000 | @num = split /(a|b)+/, $x; # @num = ('12','a','34','a','5') |
7638d2dc WL |
1001 | @num = split /(?:a|b)+/, $x; # @num = ('12','34','5') |
1002 | ||
33be4c61 MH |
1003 | In Perl 5.22 and later, all groups within a regexp can be set to |
1004 | non-capturing by using the new C</n> flag: | |
1005 | ||
1006 | "hello" =~ /(hi|hello)/n; # $1 is not set! | |
1007 | ||
1008 | See L<perlre/"n"> for more information. | |
7638d2dc | 1009 | |
47f9c88b GS |
1010 | =head2 Matching repetitions |
1011 | ||
1012 | The examples in the previous section display an annoying weakness. We | |
7638d2dc WL |
1013 | were only matching 3-letter words, or chunks of words of 4 letters or |
1014 | less. We'd like to be able to match words or, more generally, strings | |
1015 | of any length, without writing out tedious alternatives like | |
47f9c88b GS |
1016 | C<\w\w\w\w|\w\w\w|\w\w|\w>. |
1017 | ||
15776bb0 KW |
1018 | This is exactly the problem the I<quantifier> metacharacters C<'?'>, |
1019 | C<'*'>, C<'+'>, and C<{}> were created for. They allow us to delimit the | |
7638d2dc | 1020 | number of repeats for a portion of a regexp we consider to be a |
47f9c88b GS |
1021 | match. Quantifiers are put immediately after the character, character |
1022 | class, or grouping that we want to specify. They have the following | |
1023 | meanings: | |
1024 | ||
1025 | =over 4 | |
1026 | ||
551e1d92 | 1027 | =item * |
47f9c88b | 1028 | |
15776bb0 | 1029 | C<a?> means: match C<'a'> 1 or 0 times |
47f9c88b | 1030 | |
551e1d92 RB |
1031 | =item * |
1032 | ||
15776bb0 | 1033 | C<a*> means: match C<'a'> 0 or more times, I<i.e.>, any number of times |
551e1d92 RB |
1034 | |
1035 | =item * | |
47f9c88b | 1036 | |
15776bb0 | 1037 | C<a+> means: match C<'a'> 1 or more times, I<i.e.>, at least once |
551e1d92 RB |
1038 | |
1039 | =item * | |
1040 | ||
7638d2dc | 1041 | C<a{n,m}> means: match at least C<n> times, but not more than C<m> |
47f9c88b GS |
1042 | times. |
1043 | ||
551e1d92 RB |
1044 | =item * |
1045 | ||
7638d2dc | 1046 | C<a{n,}> means: match at least C<n> or more times |
551e1d92 RB |
1047 | |
1048 | =item * | |
47f9c88b | 1049 | |
7638d2dc | 1050 | C<a{n}> means: match exactly C<n> times |
47f9c88b GS |
1051 | |
1052 | =back | |
1053 | ||
1054 | Here are some examples: | |
1055 | ||
7638d2dc | 1056 | /[a-z]+\s+\d*/; # match a lowercase word, at least one space, and |
47f9c88b | 1057 | # any number of digits |
d8b950dc | 1058 | /(\w+)\s+\g1/; # match doubled words of arbitrary length |
47f9c88b | 1059 | /y(es)?/i; # matches 'y', 'Y', or a case-insensitive 'yes' |
c2ac8995 NS |
1060 | $year =~ /^\d{2,4}$/; # make sure year is at least 2 but not more |
1061 | # than 4 digits | |
555bd962 BG |
1062 | $year =~ /^\d{4}$|^\d{2}$/; # better match; throw out 3-digit dates |
1063 | $year =~ /^\d{2}(\d{2})?$/; # same thing written differently. | |
1064 | # However, this captures the last two | |
1065 | # digits in $1 and the other does not. | |
47f9c88b | 1066 | |
d8b950dc | 1067 | % simple_grep '^(\w+)\g1$' /usr/dict/words # isn't this easier? |
47f9c88b GS |
1068 | beriberi |
1069 | booboo | |
1070 | coco | |
1071 | mama | |
1072 | murmur | |
1073 | papa | |
1074 | ||
7638d2dc | 1075 | For all of these quantifiers, Perl will try to match as much of the |
47f9c88b | 1076 | string as possible, while still allowing the regexp to succeed. Thus |
15776bb0 | 1077 | with C</a?.../>, Perl will first try to match the regexp with the C<'a'> |
7638d2dc | 1078 | present; if that fails, Perl will try to match the regexp without the |
15776bb0 | 1079 | C<'a'> present. For the quantifier C<'*'>, we get the following: |
47f9c88b GS |
1080 | |
1081 | $x = "the cat in the hat"; | |
1082 | $x =~ /^(.*)(cat)(.*)$/; # matches, | |
1083 | # $1 = 'the ' | |
1084 | # $2 = 'cat' | |
1085 | # $3 = ' in the hat' | |
1086 | ||
1087 | Which is what we might expect, the match finds the only C<cat> in the | |
1088 | string and locks onto it. Consider, however, this regexp: | |
1089 | ||
1090 | $x =~ /^(.*)(at)(.*)$/; # matches, | |
1091 | # $1 = 'the cat in the h' | |
1092 | # $2 = 'at' | |
7638d2dc | 1093 | # $3 = '' (0 characters match) |
47f9c88b | 1094 | |
7638d2dc | 1095 | One might initially guess that Perl would find the C<at> in C<cat> and |
47f9c88b GS |
1096 | stop there, but that wouldn't give the longest possible string to the |
1097 | first quantifier C<.*>. Instead, the first quantifier C<.*> grabs as | |
1098 | much of the string as possible while still having the regexp match. In | |
a6b2f353 | 1099 | this example, that means having the C<at> sequence with the final C<at> |
f5b885cd | 1100 | in the string. The other important principle illustrated here is that, |
47f9c88b | 1101 | when there are two or more elements in a regexp, the I<leftmost> |
f5b885cd | 1102 | quantifier, if there is one, gets to grab as much of the string as |
47f9c88b GS |
1103 | possible, leaving the rest of the regexp to fight over scraps. Thus in |
1104 | our example, the first quantifier C<.*> grabs most of the string, while | |
1105 | the second quantifier C<.*> gets the empty string. Quantifiers that | |
7638d2dc WL |
1106 | grab as much of the string as possible are called I<maximal match> or |
1107 | I<greedy> quantifiers. | |
47f9c88b GS |
1108 | |
1109 | When a regexp can match a string in several different ways, we can use | |
1110 | the principles above to predict which way the regexp will match: | |
1111 | ||
1112 | =over 4 | |
1113 | ||
1114 | =item * | |
551e1d92 | 1115 | |
47f9c88b GS |
1116 | Principle 0: Taken as a whole, any regexp will be matched at the |
1117 | earliest possible position in the string. | |
1118 | ||
1119 | =item * | |
551e1d92 | 1120 | |
47f9c88b GS |
1121 | Principle 1: In an alternation C<a|b|c...>, the leftmost alternative |
1122 | that allows a match for the whole regexp will be the one used. | |
1123 | ||
1124 | =item * | |
551e1d92 | 1125 | |
15776bb0 | 1126 | Principle 2: The maximal matching quantifiers C<'?'>, C<'*'>, C<'+'> and |
47f9c88b GS |
1127 | C<{n,m}> will in general match as much of the string as possible while |
1128 | still allowing the whole regexp to match. | |
1129 | ||
1130 | =item * | |
551e1d92 | 1131 | |
47f9c88b GS |
1132 | Principle 3: If there are two or more elements in a regexp, the |
1133 | leftmost greedy quantifier, if any, will match as much of the string | |
1134 | as possible while still allowing the whole regexp to match. The next | |
1135 | leftmost greedy quantifier, if any, will try to match as much of the | |
1136 | string remaining available to it as possible, while still allowing the | |
1137 | whole regexp to match. And so on, until all the regexp elements are | |
1138 | satisfied. | |
1139 | ||
1140 | =back | |
1141 | ||
ac036724 | 1142 | As we have seen above, Principle 0 overrides the others. The regexp |
47f9c88b GS |
1143 | will be matched as early as possible, with the other principles |
1144 | determining how the regexp matches at that earliest character | |
1145 | position. | |
1146 | ||
1147 | Here is an example of these principles in action: | |
1148 | ||
1149 | $x = "The programming republic of Perl"; | |
1150 | $x =~ /^(.+)(e|r)(.*)$/; # matches, | |
1151 | # $1 = 'The programming republic of Pe' | |
1152 | # $2 = 'r' | |
1153 | # $3 = 'l' | |
1154 | ||
1155 | This regexp matches at the earliest string position, C<'T'>. One | |
15776bb0 KW |
1156 | might think that C<'e'>, being leftmost in the alternation, would be |
1157 | matched, but C<'r'> produces the longest string in the first quantifier. | |
47f9c88b GS |
1158 | |
1159 | $x =~ /(m{1,2})(.*)$/; # matches, | |
1160 | # $1 = 'mm' | |
1161 | # $2 = 'ing republic of Perl' | |
1162 | ||
1163 | Here, The earliest possible match is at the first C<'m'> in | |
1164 | C<programming>. C<m{1,2}> is the first quantifier, so it gets to match | |
1165 | a maximal C<mm>. | |
1166 | ||
1167 | $x =~ /.*(m{1,2})(.*)$/; # matches, | |
1168 | # $1 = 'm' | |
1169 | # $2 = 'ing republic of Perl' | |
1170 | ||
1171 | Here, the regexp matches at the start of the string. The first | |
1172 | quantifier C<.*> grabs as much as possible, leaving just a single | |
1173 | C<'m'> for the second quantifier C<m{1,2}>. | |
1174 | ||
1175 | $x =~ /(.?)(m{1,2})(.*)$/; # matches, | |
1176 | # $1 = 'a' | |
1177 | # $2 = 'mm' | |
1178 | # $3 = 'ing republic of Perl' | |
1179 | ||
1180 | Here, C<.?> eats its maximal one character at the earliest possible | |
1181 | position in the string, C<'a'> in C<programming>, leaving C<m{1,2}> | |
15776bb0 | 1182 | the opportunity to match both C<'m'>'s. Finally, |
47f9c88b GS |
1183 | |
1184 | "aXXXb" =~ /(X*)/; # matches with $1 = '' | |
1185 | ||
1186 | because it can match zero copies of C<'X'> at the beginning of the | |
1187 | string. If you definitely want to match at least one C<'X'>, use | |
1188 | C<X+>, not C<X*>. | |
1189 | ||
1190 | Sometimes greed is not good. At times, we would like quantifiers to | |
1191 | match a I<minimal> piece of string, rather than a maximal piece. For | |
7638d2dc WL |
1192 | this purpose, Larry Wall created the I<minimal match> or |
1193 | I<non-greedy> quantifiers C<??>, C<*?>, C<+?>, and C<{}?>. These are | |
15776bb0 | 1194 | the usual quantifiers with a C<'?'> appended to them. They have the |
47f9c88b GS |
1195 | following meanings: |
1196 | ||
1197 | =over 4 | |
1198 | ||
551e1d92 RB |
1199 | =item * |
1200 | ||
15776bb0 | 1201 | C<a??> means: match C<'a'> 0 or 1 times. Try 0 first, then 1. |
47f9c88b | 1202 | |
551e1d92 RB |
1203 | =item * |
1204 | ||
15776bb0 | 1205 | C<a*?> means: match C<'a'> 0 or more times, I<i.e.>, any number of times, |
47f9c88b GS |
1206 | but as few times as possible |
1207 | ||
551e1d92 RB |
1208 | =item * |
1209 | ||
15776bb0 | 1210 | C<a+?> means: match C<'a'> 1 or more times, I<i.e.>, at least once, but |
47f9c88b GS |
1211 | as few times as possible |
1212 | ||
551e1d92 RB |
1213 | =item * |
1214 | ||
7638d2dc | 1215 | C<a{n,m}?> means: match at least C<n> times, not more than C<m> |
47f9c88b GS |
1216 | times, as few times as possible |
1217 | ||
551e1d92 RB |
1218 | =item * |
1219 | ||
7638d2dc | 1220 | C<a{n,}?> means: match at least C<n> times, but as few times as |
47f9c88b GS |
1221 | possible |
1222 | ||
551e1d92 RB |
1223 | =item * |
1224 | ||
7638d2dc | 1225 | C<a{n}?> means: match exactly C<n> times. Because we match exactly |
47f9c88b GS |
1226 | C<n> times, C<a{n}?> is equivalent to C<a{n}> and is just there for |
1227 | notational consistency. | |
1228 | ||
1229 | =back | |
1230 | ||
1231 | Let's look at the example above, but with minimal quantifiers: | |
1232 | ||
1233 | $x = "The programming republic of Perl"; | |
1234 | $x =~ /^(.+?)(e|r)(.*)$/; # matches, | |
1235 | # $1 = 'Th' | |
1236 | # $2 = 'e' | |
1237 | # $3 = ' programming republic of Perl' | |
1238 | ||
15776bb0 | 1239 | The minimal string that will allow both the start of the string C<'^'> |
47f9c88b | 1240 | and the alternation to match is C<Th>, with the alternation C<e|r> |
15776bb0 | 1241 | matching C<'e'>. The second quantifier C<.*> is free to gobble up the |
47f9c88b GS |
1242 | rest of the string. |
1243 | ||
1244 | $x =~ /(m{1,2}?)(.*?)$/; # matches, | |
1245 | # $1 = 'm' | |
1246 | # $2 = 'ming republic of Perl' | |
1247 | ||
1248 | The first string position that this regexp can match is at the first | |
1249 | C<'m'> in C<programming>. At this position, the minimal C<m{1,2}?> | |
1250 | matches just one C<'m'>. Although the second quantifier C<.*?> would | |
1251 | prefer to match no characters, it is constrained by the end-of-string | |
15776bb0 | 1252 | anchor C<'$'> to match the rest of the string. |
47f9c88b GS |
1253 | |
1254 | $x =~ /(.*?)(m{1,2}?)(.*)$/; # matches, | |
1255 | # $1 = 'The progra' | |
1256 | # $2 = 'm' | |
1257 | # $3 = 'ming republic of Perl' | |
1258 | ||
1259 | In this regexp, you might expect the first minimal quantifier C<.*?> | |
15776bb0 | 1260 | to match the empty string, because it is not constrained by a C<'^'> |
47f9c88b GS |
1261 | anchor to match the beginning of the word. Principle 0 applies here, |
1262 | however. Because it is possible for the whole regexp to match at the | |
1263 | start of the string, it I<will> match at the start of the string. Thus | |
15776bb0 KW |
1264 | the first quantifier has to match everything up to the first C<'m'>. The |
1265 | second minimal quantifier matches just one C<'m'> and the third | |
47f9c88b GS |
1266 | quantifier matches the rest of the string. |
1267 | ||
1268 | $x =~ /(.??)(m{1,2})(.*)$/; # matches, | |
1269 | # $1 = 'a' | |
1270 | # $2 = 'mm' | |
1271 | # $3 = 'ing republic of Perl' | |
1272 | ||
1273 | Just as in the previous regexp, the first quantifier C<.??> can match | |
1274 | earliest at position C<'a'>, so it does. The second quantifier is | |
1275 | greedy, so it matches C<mm>, and the third matches the rest of the | |
1276 | string. | |
1277 | ||
1278 | We can modify principle 3 above to take into account non-greedy | |
1279 | quantifiers: | |
1280 | ||
1281 | =over 4 | |
1282 | ||
1283 | =item * | |
551e1d92 | 1284 | |
47f9c88b GS |
1285 | Principle 3: If there are two or more elements in a regexp, the |
1286 | leftmost greedy (non-greedy) quantifier, if any, will match as much | |
1287 | (little) of the string as possible while still allowing the whole | |
1288 | regexp to match. The next leftmost greedy (non-greedy) quantifier, if | |
1289 | any, will try to match as much (little) of the string remaining | |
1290 | available to it as possible, while still allowing the whole regexp to | |
1291 | match. And so on, until all the regexp elements are satisfied. | |
1292 | ||
1293 | =back | |
1294 | ||
1295 | Just like alternation, quantifiers are also susceptible to | |
1296 | backtracking. Here is a step-by-step analysis of the example | |
1297 | ||
1298 | $x = "the cat in the hat"; | |
1299 | $x =~ /^(.*)(at)(.*)$/; # matches, | |
1300 | # $1 = 'the cat in the h' | |
1301 | # $2 = 'at' | |
1302 | # $3 = '' (0 matches) | |
1303 | ||
1304 | =over 4 | |
1305 | ||
15776bb0 | 1306 | =item Z<>0. Start with the first letter in the string C<'t'>. |
47f9c88b | 1307 | |
15776bb0 | 1308 | E<nbsp> |
551e1d92 | 1309 | |
15776bb0 KW |
1310 | =item Z<>1. The first quantifier C<'.*'> starts out by matching the whole |
1311 | string "C<the cat in the hat>". | |
47f9c88b | 1312 | |
15776bb0 | 1313 | E<nbsp> |
551e1d92 | 1314 | |
15776bb0 KW |
1315 | =item Z<>2. C<'a'> in the regexp element C<'at'> doesn't match the end |
1316 | of the string. Backtrack one character. | |
47f9c88b | 1317 | |
15776bb0 | 1318 | E<nbsp> |
551e1d92 | 1319 | |
15776bb0 KW |
1320 | =item Z<>3. C<'a'> in the regexp element C<'at'> still doesn't match |
1321 | the last letter of the string C<'t'>, so backtrack one more character. | |
47f9c88b | 1322 | |
15776bb0 | 1323 | E<nbsp> |
551e1d92 | 1324 | |
15776bb0 | 1325 | =item Z<>4. Now we can match the C<'a'> and the C<'t'>. |
47f9c88b | 1326 | |
15776bb0 | 1327 | E<nbsp> |
551e1d92 | 1328 | |
15776bb0 KW |
1329 | =item Z<>5. Move on to the third element C<'.*'>. Since we are at the |
1330 | end of the string and C<'.*'> can match 0 times, assign it the empty | |
1331 | string. | |
47f9c88b | 1332 | |
15776bb0 | 1333 | E<nbsp> |
551e1d92 | 1334 | |
15776bb0 | 1335 | =item Z<>6. We are done! |
47f9c88b GS |
1336 | |
1337 | =back | |
1338 | ||
1339 | Most of the time, all this moving forward and backtracking happens | |
7638d2dc | 1340 | quickly and searching is fast. There are some pathological regexps, |
47f9c88b GS |
1341 | however, whose execution time exponentially grows with the size of the |
1342 | string. A typical structure that blows up in your face is of the form | |
1343 | ||
1344 | /(a|b+)*/; | |
1345 | ||
1346 | The problem is the nested indeterminate quantifiers. There are many | |
15776bb0 KW |
1347 | different ways of partitioning a string of length n between the C<'+'> |
1348 | and C<'*'>: one repetition with C<b+> of length n, two repetitions with | |
47f9c88b | 1349 | the first C<b+> length k and the second with length n-k, m repetitions |
15776bb0 | 1350 | whose bits add up to length n, I<etc>. In fact there are an exponential |
7638d2dc | 1351 | number of ways to partition a string as a function of its length. A |
47f9c88b | 1352 | regexp may get lucky and match early in the process, but if there is |
7638d2dc | 1353 | no match, Perl will try I<every> possibility before giving up. So be |
15776bb0 | 1354 | careful with nested C<'*'>'s, C<{n,m}>'s, and C<'+'>'s. The book |
7638d2dc | 1355 | I<Mastering Regular Expressions> by Jeffrey Friedl gives a wonderful |
47f9c88b GS |
1356 | discussion of this and other efficiency issues. |
1357 | ||
7638d2dc WL |
1358 | |
1359 | =head2 Possessive quantifiers | |
1360 | ||
1361 | Backtracking during the relentless search for a match may be a waste | |
1362 | of time, particularly when the match is bound to fail. Consider | |
1363 | the simple pattern | |
1364 | ||
1365 | /^\w+\s+\w+$/; # a word, spaces, a word | |
1366 | ||
1367 | Whenever this is applied to a string which doesn't quite meet the | |
1368 | pattern's expectations such as S<C<"abc ">> or S<C<"abc def ">>, | |
15776bb0 | 1369 | the regexp engine will backtrack, approximately once for each character |
353c6505 DL |
1370 | in the string. But we know that there is no way around taking I<all> |
1371 | of the initial word characters to match the first repetition, that I<all> | |
7638d2dc | 1372 | spaces must be eaten by the middle part, and the same goes for the second |
353c6505 DL |
1373 | word. |
1374 | ||
1375 | With the introduction of the I<possessive quantifiers> in Perl 5.10, we | |
15776bb0 KW |
1376 | have a way of instructing the regexp engine not to backtrack, with the |
1377 | usual quantifiers with a C<'+'> appended to them. This makes them greedy as | |
353c6505 DL |
1378 | well as stingy; once they succeed they won't give anything back to permit |
1379 | another solution. They have the following meanings: | |
7638d2dc WL |
1380 | |
1381 | =over 4 | |
1382 | ||
1383 | =item * | |
1384 | ||
353c6505 DL |
1385 | C<a{n,m}+> means: match at least C<n> times, not more than C<m> times, |
1386 | as many times as possible, and don't give anything up. C<a?+> is short | |
7638d2dc WL |
1387 | for C<a{0,1}+> |
1388 | ||
1389 | =item * | |
1390 | ||
1391 | C<a{n,}+> means: match at least C<n> times, but as many times as possible, | |
353c6505 | 1392 | and don't give anything up. C<a*+> is short for C<a{0,}+> and C<a++> is |
7638d2dc WL |
1393 | short for C<a{1,}+>. |
1394 | ||
1395 | =item * | |
1396 | ||
1397 | C<a{n}+> means: match exactly C<n> times. It is just there for | |
1398 | notational consistency. | |
1399 | ||
1400 | =back | |
1401 | ||
353c6505 DL |
1402 | These possessive quantifiers represent a special case of a more general |
1403 | concept, the I<independent subexpression>, see below. | |
7638d2dc WL |
1404 | |
1405 | As an example where a possessive quantifier is suitable we consider | |
1406 | matching a quoted string, as it appears in several programming languages. | |
1407 | The backslash is used as an escape character that indicates that the | |
1408 | next character is to be taken literally, as another character for the | |
1409 | string. Therefore, after the opening quote, we expect a (possibly | |
353c6505 | 1410 | empty) sequence of alternatives: either some character except an |
7638d2dc WL |
1411 | unescaped quote or backslash or an escaped character. |
1412 | ||
1413 | /"(?:[^"\\]++|\\.)*+"/; | |
1414 | ||
1415 | ||
47f9c88b GS |
1416 | =head2 Building a regexp |
1417 | ||
1418 | At this point, we have all the basic regexp concepts covered, so let's | |
1419 | give a more involved example of a regular expression. We will build a | |
1420 | regexp that matches numbers. | |
1421 | ||
1422 | The first task in building a regexp is to decide what we want to match | |
1423 | and what we want to exclude. In our case, we want to match both | |
1424 | integers and floating point numbers and we want to reject any string | |
1425 | that isn't a number. | |
1426 | ||
1427 | The next task is to break the problem down into smaller problems that | |
1428 | are easily converted into a regexp. | |
1429 | ||
1430 | The simplest case is integers. These consist of a sequence of digits, | |
1431 | with an optional sign in front. The digits we can represent with | |
1432 | C<\d+> and the sign can be matched with C<[+-]>. Thus the integer | |
1433 | regexp is | |
1434 | ||
1435 | /[+-]?\d+/; # matches integers | |
1436 | ||
1437 | A floating point number potentially has a sign, an integral part, a | |
1438 | decimal point, a fractional part, and an exponent. One or more of these | |
1439 | parts is optional, so we need to check out the different | |
1440 | possibilities. Floating point numbers which are in proper form include | |
1441 | 123., 0.345, .34, -1e6, and 25.4E-72. As with integers, the sign out | |
1442 | front is completely optional and can be matched by C<[+-]?>. We can | |
1443 | see that if there is no exponent, floating point numbers must have a | |
1444 | decimal point, otherwise they are integers. We might be tempted to | |
1445 | model these with C<\d*\.\d*>, but this would also match just a single | |
1446 | decimal point, which is not a number. So the three cases of floating | |
7638d2dc | 1447 | point number without exponent are |
47f9c88b GS |
1448 | |
1449 | /[+-]?\d+\./; # 1., 321., etc. | |
1450 | /[+-]?\.\d+/; # .1, .234, etc. | |
1451 | /[+-]?\d+\.\d+/; # 1.0, 30.56, etc. | |
1452 | ||
1453 | These can be combined into a single regexp with a three-way alternation: | |
1454 | ||
1455 | /[+-]?(\d+\.\d+|\d+\.|\.\d+)/; # floating point, no exponent | |
1456 | ||
1457 | In this alternation, it is important to put C<'\d+\.\d+'> before | |
1458 | C<'\d+\.'>. If C<'\d+\.'> were first, the regexp would happily match that | |
1459 | and ignore the fractional part of the number. | |
1460 | ||
1461 | Now consider floating point numbers with exponents. The key | |
1462 | observation here is that I<both> integers and numbers with decimal | |
1463 | points are allowed in front of an exponent. Then exponents, like the | |
1464 | overall sign, are independent of whether we are matching numbers with | |
15776bb0 | 1465 | or without decimal points, and can be "decoupled" from the |
47f9c88b GS |
1466 | mantissa. The overall form of the regexp now becomes clear: |
1467 | ||
1468 | /^(optional sign)(integer | f.p. mantissa)(optional exponent)$/; | |
1469 | ||
15776bb0 | 1470 | The exponent is an C<'e'> or C<'E'>, followed by an integer. So the |
47f9c88b GS |
1471 | exponent regexp is |
1472 | ||
1473 | /[eE][+-]?\d+/; # exponent | |
1474 | ||
1475 | Putting all the parts together, we get a regexp that matches numbers: | |
1476 | ||
1477 | /^[+-]?(\d+\.\d+|\d+\.|\.\d+|\d+)([eE][+-]?\d+)?$/; # Ta da! | |
1478 | ||
1479 | Long regexps like this may impress your friends, but can be hard to | |
f1dc5bb2 | 1480 | decipher. In complex situations like this, the C</x> modifier for a |
47f9c88b GS |
1481 | match is invaluable. It allows one to put nearly arbitrary whitespace |
1482 | and comments into a regexp without affecting their meaning. Using it, | |
15776bb0 | 1483 | we can rewrite our "extended" regexp in the more pleasing form |
47f9c88b GS |
1484 | |
1485 | /^ | |
1486 | [+-]? # first, match an optional sign | |
1487 | ( # then match integers or f.p. mantissas: | |
1488 | \d+\.\d+ # mantissa of the form a.b | |
1489 | |\d+\. # mantissa of the form a. | |
1490 | |\.\d+ # mantissa of the form .b | |
1491 | |\d+ # integer of the form a | |
1492 | ) | |
563642b4 | 1493 | ( [eE] [+-]? \d+ )? # finally, optionally match an exponent |
47f9c88b GS |
1494 | $/x; |
1495 | ||
1496 | If whitespace is mostly irrelevant, how does one include space | |
1497 | characters in an extended regexp? The answer is to backslash it | |
7638d2dc | 1498 | S<C<'\ '>> or put it in a character class S<C<[ ]>>. The same thing |
f5b885cd | 1499 | goes for pound signs: use C<\#> or C<[#]>. For instance, Perl allows |
7638d2dc | 1500 | a space between the sign and the mantissa or integer, and we could add |
47f9c88b GS |
1501 | this to our regexp as follows: |
1502 | ||
1503 | /^ | |
1504 | [+-]?\ * # first, match an optional sign *and space* | |
1505 | ( # then match integers or f.p. mantissas: | |
1506 | \d+\.\d+ # mantissa of the form a.b | |
1507 | |\d+\. # mantissa of the form a. | |
1508 | |\.\d+ # mantissa of the form .b | |
1509 | |\d+ # integer of the form a | |
1510 | ) | |
563642b4 | 1511 | ( [eE] [+-]? \d+ )? # finally, optionally match an exponent |
47f9c88b GS |
1512 | $/x; |
1513 | ||
1514 | In this form, it is easier to see a way to simplify the | |
1515 | alternation. Alternatives 1, 2, and 4 all start with C<\d+>, so it | |
1516 | could be factored out: | |
1517 | ||
1518 | /^ | |
1519 | [+-]?\ * # first, match an optional sign | |
1520 | ( # then match integers or f.p. mantissas: | |
1521 | \d+ # start out with a ... | |
1522 | ( | |
1523 | \.\d* # mantissa of the form a.b or a. | |
1524 | )? # ? takes care of integers of the form a | |
1525 | |\.\d+ # mantissa of the form .b | |
1526 | ) | |
563642b4 | 1527 | ( [eE] [+-]? \d+ )? # finally, optionally match an exponent |
47f9c88b GS |
1528 | $/x; |
1529 | ||
77c8f263 KW |
1530 | Starting in Perl v5.26, specifying C</xx> changes the square-bracketed |
1531 | portions of a pattern to ignore tabs and space characters unless they | |
1532 | are escaped by preceding them with a backslash. So, we could write | |
1533 | ||
1534 | /^ | |
1535 | [ + - ]?\ * # first, match an optional sign | |
1536 | ( # then match integers or f.p. mantissas: | |
1537 | \d+ # start out with a ... | |
1538 | ( | |
1539 | \.\d* # mantissa of the form a.b or a. | |
1540 | )? # ? takes care of integers of the form a | |
1541 | |\.\d+ # mantissa of the form .b | |
1542 | ) | |
1543 | ( [ e E ] [ + - ]? \d+ )? # finally, optionally match an exponent | |
1544 | $/xx; | |
1545 | ||
1546 | This doesn't really improve the legibility of this example, but it's | |
1547 | available in case you want it. Squashing the pattern down to the | |
1548 | compact form, we have | |
47f9c88b GS |
1549 | |
1550 | /^[+-]?\ *(\d+(\.\d*)?|\.\d+)([eE][+-]?\d+)?$/; | |
1551 | ||
1552 | This is our final regexp. To recap, we built a regexp by | |
1553 | ||
1554 | =over 4 | |
1555 | ||
551e1d92 RB |
1556 | =item * |
1557 | ||
1558 | specifying the task in detail, | |
47f9c88b | 1559 | |
551e1d92 RB |
1560 | =item * |
1561 | ||
1562 | breaking down the problem into smaller parts, | |
1563 | ||
1564 | =item * | |
47f9c88b | 1565 | |
551e1d92 | 1566 | translating the small parts into regexps, |
47f9c88b | 1567 | |
551e1d92 RB |
1568 | =item * |
1569 | ||
1570 | combining the regexps, | |
1571 | ||
1572 | =item * | |
47f9c88b | 1573 | |
551e1d92 | 1574 | and optimizing the final combined regexp. |
47f9c88b GS |
1575 | |
1576 | =back | |
1577 | ||
1578 | These are also the typical steps involved in writing a computer | |
1579 | program. This makes perfect sense, because regular expressions are | |
7638d2dc | 1580 | essentially programs written in a little computer language that specifies |
47f9c88b GS |
1581 | patterns. |
1582 | ||
1583 | =head2 Using regular expressions in Perl | |
1584 | ||
1585 | The last topic of Part 1 briefly covers how regexps are used in Perl | |
1586 | programs. Where do they fit into Perl syntax? | |
1587 | ||
1588 | We have already introduced the matching operator in its default | |
1589 | C</regexp/> and arbitrary delimiter C<m!regexp!> forms. We have used | |
1590 | the binding operator C<=~> and its negation C<!~> to test for string | |
1591 | matches. Associated with the matching operator, we have discussed the | |
f1dc5bb2 KW |
1592 | single line C</s>, multi-line C</m>, case-insensitive C</i> and |
1593 | extended C</x> modifiers. There are a few more things you might | |
353c6505 | 1594 | want to know about matching operators. |
47f9c88b | 1595 | |
7638d2dc WL |
1596 | =head3 Prohibiting substitution |
1597 | ||
1598 | If you change C<$pattern> after the first substitution happens, Perl | |
47f9c88b GS |
1599 | will ignore it. If you don't want any substitutions at all, use the |
1600 | special delimiter C<m''>: | |
1601 | ||
16e8b840 | 1602 | @pattern = ('Seuss'); |
47f9c88b | 1603 | while (<>) { |
16e8b840 | 1604 | print if m'@pattern'; # matches literal '@pattern', not 'Seuss' |
47f9c88b GS |
1605 | } |
1606 | ||
353c6505 | 1607 | Similar to strings, C<m''> acts like apostrophes on a regexp; all other |
15776bb0 | 1608 | C<'m'> delimiters act like quotes. If the regexp evaluates to the empty string, |
47f9c88b GS |
1609 | the regexp in the I<last successful match> is used instead. So we have |
1610 | ||
1611 | "dog" =~ /d/; # 'd' matches | |
1612 | "dogbert =~ //; # this matches the 'd' regexp used before | |
1613 | ||
7638d2dc WL |
1614 | |
1615 | =head3 Global matching | |
1616 | ||
7698aede | 1617 | The final two modifiers we will discuss here, |
f1dc5bb2 KW |
1618 | C</g> and C</c>, concern multiple matches. |
1619 | The modifier C</g> stands for global matching and allows the | |
47f9c88b GS |
1620 | matching operator to match within a string as many times as possible. |
1621 | In scalar context, successive invocations against a string will have | |
f1dc5bb2 | 1622 | C</g> jump from match to match, keeping track of position in the |
47f9c88b GS |
1623 | string as it goes along. You can get or set the position with the |
1624 | C<pos()> function. | |
1625 | ||
f1dc5bb2 | 1626 | The use of C</g> is shown in the following example. Suppose we have |
47f9c88b GS |
1627 | a string that consists of words separated by spaces. If we know how |
1628 | many words there are in advance, we could extract the words using | |
1629 | groupings: | |
1630 | ||
1631 | $x = "cat dog house"; # 3 words | |
1632 | $x =~ /^\s*(\w+)\s+(\w+)\s+(\w+)\s*$/; # matches, | |
1633 | # $1 = 'cat' | |
1634 | # $2 = 'dog' | |
1635 | # $3 = 'house' | |
1636 | ||
1637 | But what if we had an indeterminate number of words? This is the sort | |
f1dc5bb2 | 1638 | of task C</g> was made for. To extract all words, form the simple |
47f9c88b GS |
1639 | regexp C<(\w+)> and loop over all matches with C</(\w+)/g>: |
1640 | ||
1641 | while ($x =~ /(\w+)/g) { | |
1642 | print "Word is $1, ends at position ", pos $x, "\n"; | |
1643 | } | |
1644 | ||
1645 | prints | |
1646 | ||
1647 | Word is cat, ends at position 3 | |
1648 | Word is dog, ends at position 7 | |
1649 | Word is house, ends at position 13 | |
1650 | ||
1651 | A failed match or changing the target string resets the position. If | |
1652 | you don't want the position reset after failure to match, add the | |
f1dc5bb2 | 1653 | C</c>, as in C</regexp/gc>. The current position in the string is |
47f9c88b GS |
1654 | associated with the string, not the regexp. This means that different |
1655 | strings have different positions and their respective positions can be | |
1656 | set or read independently. | |
1657 | ||
f1dc5bb2 | 1658 | In list context, C</g> returns a list of matched groupings, or if |
47f9c88b GS |
1659 | there are no groupings, a list of matches to the whole regexp. So if |
1660 | we wanted just the words, we could use | |
1661 | ||
1662 | @words = ($x =~ /(\w+)/g); # matches, | |
5a0c7e9d PJ |
1663 | # $words[0] = 'cat' |
1664 | # $words[1] = 'dog' | |
1665 | # $words[2] = 'house' | |
47f9c88b | 1666 | |
f1dc5bb2 KW |
1667 | Closely associated with the C</g> modifier is the C<\G> anchor. The |
1668 | C<\G> anchor matches at the point where the previous C</g> match left | |
47f9c88b GS |
1669 | off. C<\G> allows us to easily do context-sensitive matching: |
1670 | ||
1671 | $metric = 1; # use metric units | |
1672 | ... | |
1673 | $x = <FILE>; # read in measurement | |
1674 | $x =~ /^([+-]?\d+)\s*/g; # get magnitude | |
1675 | $weight = $1; | |
1676 | if ($metric) { # error checking | |
1677 | print "Units error!" unless $x =~ /\Gkg\./g; | |
1678 | } | |
1679 | else { | |
1680 | print "Units error!" unless $x =~ /\Glbs\./g; | |
1681 | } | |
1682 | $x =~ /\G\s+(widget|sprocket)/g; # continue processing | |
1683 | ||
f1dc5bb2 | 1684 | The combination of C</g> and C<\G> allows us to process the string a |
47f9c88b | 1685 | bit at a time and use arbitrary Perl logic to decide what to do next. |
25cf8c22 HS |
1686 | Currently, the C<\G> anchor is only fully supported when used to anchor |
1687 | to the start of the pattern. | |
47f9c88b | 1688 | |
f5b885cd | 1689 | C<\G> is also invaluable in processing fixed-length records with |
47f9c88b GS |
1690 | regexps. Suppose we have a snippet of coding region DNA, encoded as |
1691 | base pair letters C<ATCGTTGAAT...> and we want to find all the stop | |
1692 | codons C<TGA>. In a coding region, codons are 3-letter sequences, so | |
1693 | we can think of the DNA snippet as a sequence of 3-letter records. The | |
1694 | naive regexp | |
1695 | ||
1696 | # expanded, this is "ATC GTT GAA TGC AAA TGA CAT GAC" | |
1697 | $dna = "ATCGTTGAATGCAAATGACATGAC"; | |
1698 | $dna =~ /TGA/; | |
1699 | ||
d1be9408 | 1700 | doesn't work; it may match a C<TGA>, but there is no guarantee that |
15776bb0 | 1701 | the match is aligned with codon boundaries, I<e.g.>, the substring |
7638d2dc | 1702 | S<C<GTT GAA>> gives a match. A better solution is |
47f9c88b GS |
1703 | |
1704 | while ($dna =~ /(\w\w\w)*?TGA/g) { # note the minimal *? | |
1705 | print "Got a TGA stop codon at position ", pos $dna, "\n"; | |
1706 | } | |
1707 | ||
1708 | which prints | |
1709 | ||
1710 | Got a TGA stop codon at position 18 | |
1711 | Got a TGA stop codon at position 23 | |
1712 | ||
1713 | Position 18 is good, but position 23 is bogus. What happened? | |
1714 | ||
1715 | The answer is that our regexp works well until we get past the last | |
1716 | real match. Then the regexp will fail to match a synchronized C<TGA> | |
1717 | and start stepping ahead one character position at a time, not what we | |
1718 | want. The solution is to use C<\G> to anchor the match to the codon | |
1719 | alignment: | |
1720 | ||
1721 | while ($dna =~ /\G(\w\w\w)*?TGA/g) { | |
1722 | print "Got a TGA stop codon at position ", pos $dna, "\n"; | |
1723 | } | |
1724 | ||
1725 | This prints | |
1726 | ||
1727 | Got a TGA stop codon at position 18 | |
1728 | ||
1729 | which is the correct answer. This example illustrates that it is | |
1730 | important not only to match what is desired, but to reject what is not | |
1731 | desired. | |
1732 | ||
0bd5a82d | 1733 | (There are other regexp modifiers that are available, such as |
f1dc5bb2 | 1734 | C</o>, but their specialized uses are beyond the |
0bd5a82d KW |
1735 | scope of this introduction. ) |
1736 | ||
7638d2dc | 1737 | =head3 Search and replace |
47f9c88b | 1738 | |
7638d2dc | 1739 | Regular expressions also play a big role in I<search and replace> |
47f9c88b GS |
1740 | operations in Perl. Search and replace is accomplished with the |
1741 | C<s///> operator. The general form is | |
1742 | C<s/regexp/replacement/modifiers>, with everything we know about | |
1743 | regexps and modifiers applying in this case as well. The | |
15776bb0 | 1744 | I<replacement> is a Perl double-quoted string that replaces in the |
47f9c88b GS |
1745 | string whatever is matched with the C<regexp>. The operator C<=~> is |
1746 | also used here to associate a string with C<s///>. If matching | |
7638d2dc | 1747 | against C<$_>, the S<C<$_ =~>> can be dropped. If there is a match, |
f5b885cd | 1748 | C<s///> returns the number of substitutions made; otherwise it returns |
47f9c88b GS |
1749 | false. Here are a few examples: |
1750 | ||
1751 | $x = "Time to feed the cat!"; | |
1752 | $x =~ s/cat/hacker/; # $x contains "Time to feed the hacker!" | |
1753 | if ($x =~ s/^(Time.*hacker)!$/$1 now!/) { | |
1754 | $more_insistent = 1; | |
1755 | } | |
1756 | $y = "'quoted words'"; | |
1757 | $y =~ s/^'(.*)'$/$1/; # strip single quotes, | |
1758 | # $y contains "quoted words" | |
1759 | ||
1760 | In the last example, the whole string was matched, but only the part | |
1761 | inside the single quotes was grouped. With the C<s///> operator, the | |
15776bb0 | 1762 | matched variables C<$1>, C<$2>, I<etc>. are immediately available for use |
47f9c88b GS |
1763 | in the replacement expression, so we use C<$1> to replace the quoted |
1764 | string with just what was quoted. With the global modifier, C<s///g> | |
1765 | will search and replace all occurrences of the regexp in the string: | |
1766 | ||
1767 | $x = "I batted 4 for 4"; | |
1768 | $x =~ s/4/four/; # doesn't do it all: | |
1769 | # $x contains "I batted four for 4" | |
1770 | $x = "I batted 4 for 4"; | |
1771 | $x =~ s/4/four/g; # does it all: | |
1772 | # $x contains "I batted four for four" | |
1773 | ||
15776bb0 | 1774 | If you prefer "regex" over "regexp" in this tutorial, you could use |
47f9c88b GS |
1775 | the following program to replace it: |
1776 | ||
1777 | % cat > simple_replace | |
1778 | #!/usr/bin/perl | |
1779 | $regexp = shift; | |
1780 | $replacement = shift; | |
1781 | while (<>) { | |
c2e2285d | 1782 | s/$regexp/$replacement/g; |
47f9c88b GS |
1783 | print; |
1784 | } | |
1785 | ^D | |
1786 | ||
1787 | % simple_replace regexp regex perlretut.pod | |
1788 | ||
1789 | In C<simple_replace> we used the C<s///g> modifier to replace all | |
c2e2285d KW |
1790 | occurrences of the regexp on each line. (Even though the regular |
1791 | expression appears in a loop, Perl is smart enough to compile it | |
1792 | only once.) As with C<simple_grep>, both the | |
1793 | C<print> and the C<s/$regexp/$replacement/g> use C<$_> implicitly. | |
47f9c88b | 1794 | |
4f4d7508 DC |
1795 | If you don't want C<s///> to change your original variable you can use |
1796 | the non-destructive substitute modifier, C<s///r>. This changes the | |
d6b8a906 KW |
1797 | behavior so that C<s///r> returns the final substituted string |
1798 | (instead of the number of substitutions): | |
4f4d7508 DC |
1799 | |
1800 | $x = "I like dogs."; | |
1801 | $y = $x =~ s/dogs/cats/r; | |
1802 | print "$x $y\n"; | |
1803 | ||
1804 | That example will print "I like dogs. I like cats". Notice the original | |
f5b885cd | 1805 | C<$x> variable has not been affected. The overall |
4f4d7508 DC |
1806 | result of the substitution is instead stored in C<$y>. If the |
1807 | substitution doesn't affect anything then the original string is | |
1808 | returned: | |
1809 | ||
1810 | $x = "I like dogs."; | |
1811 | $y = $x =~ s/elephants/cougars/r; | |
1812 | print "$x $y\n"; # prints "I like dogs. I like dogs." | |
1813 | ||
1814 | One other interesting thing that the C<s///r> flag allows is chaining | |
1815 | substitutions: | |
1816 | ||
1817 | $x = "Cats are great."; | |
555bd962 BG |
1818 | print $x =~ s/Cats/Dogs/r =~ s/Dogs/Frogs/r =~ |
1819 | s/Frogs/Hedgehogs/r, "\n"; | |
4f4d7508 DC |
1820 | # prints "Hedgehogs are great." |
1821 | ||
47f9c88b | 1822 | A modifier available specifically to search and replace is the |
f5b885cd FC |
1823 | C<s///e> evaluation modifier. C<s///e> treats the |
1824 | replacement text as Perl code, rather than a double-quoted | |
1825 | string. The value that the code returns is substituted for the | |
47f9c88b GS |
1826 | matched substring. C<s///e> is useful if you need to do a bit of |
1827 | computation in the process of replacing text. This example counts | |
1828 | character frequencies in a line: | |
1829 | ||
1830 | $x = "Bill the cat"; | |
555bd962 | 1831 | $x =~ s/(.)/$chars{$1}++;$1/eg; # final $1 replaces char with itself |
47f9c88b GS |
1832 | print "frequency of '$_' is $chars{$_}\n" |
1833 | foreach (sort {$chars{$b} <=> $chars{$a}} keys %chars); | |
1834 | ||
1835 | This prints | |
1836 | ||
1837 | frequency of ' ' is 2 | |
1838 | frequency of 't' is 2 | |
1839 | frequency of 'l' is 2 | |
1840 | frequency of 'B' is 1 | |
1841 | frequency of 'c' is 1 | |
1842 | frequency of 'e' is 1 | |
1843 | frequency of 'h' is 1 | |
1844 | frequency of 'i' is 1 | |
1845 | frequency of 'a' is 1 | |
1846 | ||
1847 | As with the match C<m//> operator, C<s///> can use other delimiters, | |
1848 | such as C<s!!!> and C<s{}{}>, and even C<s{}//>. If single quotes are | |
f5b885cd FC |
1849 | used C<s'''>, then the regexp and replacement are |
1850 | treated as single-quoted strings and there are no | |
1851 | variable substitutions. C<s///> in list context | |
15776bb0 | 1852 | returns the same thing as in scalar context, I<i.e.>, the number of |
47f9c88b GS |
1853 | matches. |
1854 | ||
7638d2dc | 1855 | =head3 The split function |
47f9c88b | 1856 | |
7638d2dc | 1857 | The C<split()> function is another place where a regexp is used. |
353c6505 DL |
1858 | C<split /regexp/, string, limit> separates the C<string> operand into |
1859 | a list of substrings and returns that list. The regexp must be designed | |
7638d2dc | 1860 | to match whatever constitutes the separators for the desired substrings. |
353c6505 | 1861 | The C<limit>, if present, constrains splitting into no more than C<limit> |
7638d2dc | 1862 | number of strings. For example, to split a string into words, use |
47f9c88b GS |
1863 | |
1864 | $x = "Calvin and Hobbes"; | |
1865 | @words = split /\s+/, $x; # $word[0] = 'Calvin' | |
1866 | # $word[1] = 'and' | |
1867 | # $word[2] = 'Hobbes' | |
1868 | ||
1869 | If the empty regexp C<//> is used, the regexp always matches and | |
1870 | the string is split into individual characters. If the regexp has | |
7638d2dc | 1871 | groupings, then the resulting list contains the matched substrings from the |
47f9c88b GS |
1872 | groupings as well. For instance, |
1873 | ||
1874 | $x = "/usr/bin/perl"; | |
1875 | @dirs = split m!/!, $x; # $dirs[0] = '' | |
1876 | # $dirs[1] = 'usr' | |
1877 | # $dirs[2] = 'bin' | |
1878 | # $dirs[3] = 'perl' | |
1879 | @parts = split m!(/)!, $x; # $parts[0] = '' | |
1880 | # $parts[1] = '/' | |
1881 | # $parts[2] = 'usr' | |
1882 | # $parts[3] = '/' | |
1883 | # $parts[4] = 'bin' | |
1884 | # $parts[5] = '/' | |
1885 | # $parts[6] = 'perl' | |
1886 | ||
15776bb0 | 1887 | Since the first character of C<$x> matched the regexp, C<split> prepended |
47f9c88b GS |
1888 | an empty initial element to the list. |
1889 | ||
1890 | If you have read this far, congratulations! You now have all the basic | |
1891 | tools needed to use regular expressions to solve a wide range of text | |
1892 | processing problems. If this is your first time through the tutorial, | |
f5b885cd | 1893 | why not stop here and play around with regexps a while.... S<Part 2> |
47f9c88b GS |
1894 | concerns the more esoteric aspects of regular expressions and those |
1895 | concepts certainly aren't needed right at the start. | |
1896 | ||
1897 | =head1 Part 2: Power tools | |
1898 | ||
1899 | OK, you know the basics of regexps and you want to know more. If | |
1900 | matching regular expressions is analogous to a walk in the woods, then | |
1901 | the tools discussed in Part 1 are analogous to topo maps and a | |
1902 | compass, basic tools we use all the time. Most of the tools in part 2 | |
da75cd15 | 1903 | are analogous to flare guns and satellite phones. They aren't used |
47f9c88b GS |
1904 | too often on a hike, but when we are stuck, they can be invaluable. |
1905 | ||
1906 | What follows are the more advanced, less used, or sometimes esoteric | |
7638d2dc | 1907 | capabilities of Perl regexps. In Part 2, we will assume you are |
7c579eed | 1908 | comfortable with the basics and concentrate on the advanced features. |
47f9c88b GS |
1909 | |
1910 | =head2 More on characters, strings, and character classes | |
1911 | ||
1912 | There are a number of escape sequences and character classes that we | |
1913 | haven't covered yet. | |
1914 | ||
1915 | There are several escape sequences that convert characters or strings | |
7638d2dc | 1916 | between upper and lower case, and they are also available within |
353c6505 | 1917 | patterns. C<\l> and C<\u> convert the next character to lower or |
7638d2dc | 1918 | upper case, respectively: |
47f9c88b GS |
1919 | |
1920 | $x = "perl"; | |
1921 | $string =~ /\u$x/; # matches 'Perl' in $string | |
1922 | $x = "M(rs?|s)\\."; # note the double backslash | |
1923 | $string =~ /\l$x/; # matches 'mr.', 'mrs.', and 'ms.', | |
1924 | ||
7638d2dc WL |
1925 | A C<\L> or C<\U> indicates a lasting conversion of case, until |
1926 | terminated by C<\E> or thrown over by another C<\U> or C<\L>: | |
47f9c88b GS |
1927 | |
1928 | $x = "This word is in lower case:\L SHOUT\E"; | |
1929 | $x =~ /shout/; # matches | |
1930 | $x = "I STILL KEYPUNCH CARDS FOR MY 360" | |
1931 | $x =~ /\Ukeypunch/; # matches punch card string | |
1932 | ||
1933 | If there is no C<\E>, case is converted until the end of the | |
1934 | string. The regexps C<\L\u$word> or C<\u\L$word> convert the first | |
1935 | character of C<$word> to uppercase and the rest of the characters to | |
1936 | lowercase. | |
1937 | ||
1938 | Control characters can be escaped with C<\c>, so that a control-Z | |
1939 | character would be matched with C<\cZ>. The escape sequence | |
1940 | C<\Q>...C<\E> quotes, or protects most non-alphabetic characters. For | |
1941 | instance, | |
1942 | ||
1943 | $x = "\QThat !^*&%~& cat!"; | |
1944 | $x =~ /\Q!^*&%~&\E/; # check for rough language | |
1945 | ||
15776bb0 | 1946 | It does not protect C<'$'> or C<'@'>, so that variables can still be |
47f9c88b GS |
1947 | substituted. |
1948 | ||
8e71069f FC |
1949 | C<\Q>, C<\L>, C<\l>, C<\U>, C<\u> and C<\E> are actually part of |
1950 | double-quotish syntax, and not part of regexp syntax proper. They will | |
7698aede | 1951 | work if they appear in a regular expression embedded directly in a |
8e71069f FC |
1952 | program, but not when contained in a string that is interpolated in a |
1953 | pattern. | |
7c579eed | 1954 | |
13e5d9cd BF |
1955 | Perl regexps can handle more than just the |
1956 | standard ASCII character set. Perl supports I<Unicode>, a standard | |
7638d2dc | 1957 | for representing the alphabets from virtually all of the world's written |
38a44b82 | 1958 | languages, and a host of symbols. Perl's text strings are Unicode strings, so |
2575c402 | 1959 | they can contain characters with a value (codepoint or character number) higher |
7c579eed | 1960 | than 255. |
47f9c88b GS |
1961 | |
1962 | What does this mean for regexps? Well, regexp users don't need to know | |
7638d2dc | 1963 | much about Perl's internal representation of strings. But they do need |
2575c402 JW |
1964 | to know 1) how to represent Unicode characters in a regexp and 2) that |
1965 | a matching operation will treat the string to be searched as a sequence | |
1966 | of characters, not bytes. The answer to 1) is that Unicode characters | |
f0a2b745 | 1967 | greater than C<chr(255)> are represented using the C<\x{hex}> notation, because |
15776bb0 KW |
1968 | C<\x>I<XY> (without curly braces and I<XY> are two hex digits) doesn't |
1969 | go further than 255. (Starting in Perl 5.14, if you're an octal fan, | |
1970 | you can also use C<\o{oct}>.) | |
47f9c88b | 1971 | |
47f9c88b GS |
1972 | /\x{263a}/; # match a Unicode smiley face :) |
1973 | ||
7638d2dc | 1974 | B<NOTE>: In Perl 5.6.0 it used to be that one needed to say C<use |
72ff2908 JH |
1975 | utf8> to use any Unicode features. This is no more the case: for |
1976 | almost all Unicode processing, the explicit C<utf8> pragma is not | |
1977 | needed. (The only case where it matters is if your Perl script is in | |
1978 | Unicode and encoded in UTF-8, then an explicit C<use utf8> is needed.) | |
47f9c88b GS |
1979 | |
1980 | Figuring out the hexadecimal sequence of a Unicode character you want | |
1981 | or deciphering someone else's hexadecimal Unicode regexp is about as | |
1982 | much fun as programming in machine code. So another way to specify | |
e526e8bb KW |
1983 | Unicode characters is to use the I<named character> escape |
1984 | sequence C<\N{I<name>}>. I<name> is a name for the Unicode character, as | |
55eda711 JH |
1985 | specified in the Unicode standard. For instance, if we wanted to |
1986 | represent or match the astrological sign for the planet Mercury, we | |
1987 | could use | |
47f9c88b | 1988 | |
47f9c88b GS |
1989 | $x = "abc\N{MERCURY}def"; |
1990 | $x =~ /\N{MERCURY}/; # matches | |
1991 | ||
fbb93542 | 1992 | One can also use "short" names: |
47f9c88b | 1993 | |
47f9c88b | 1994 | print "\N{GREEK SMALL LETTER SIGMA} is called sigma.\n"; |
47f9c88b GS |
1995 | print "\N{greek:Sigma} is an upper-case sigma.\n"; |
1996 | ||
fbb93542 KW |
1997 | You can also restrict names to a certain alphabet by specifying the |
1998 | L<charnames> pragma: | |
1999 | ||
47f9c88b GS |
2000 | use charnames qw(greek); |
2001 | print "\N{sigma} is Greek sigma\n"; | |
2002 | ||
0bd42786 KW |
2003 | An index of character names is available on-line from the Unicode |
2004 | Consortium, L<http://www.unicode.org/charts/charindex.html>; explanatory | |
2005 | material with links to other resources at | |
2006 | L<http://www.unicode.org/standard/where>. | |
47f9c88b | 2007 | |
13e5d9cd BF |
2008 | The answer to requirement 2) is that a regexp (mostly) |
2009 | uses Unicode characters. The "mostly" is for messy backward | |
15776bb0 | 2010 | compatibility reasons, but starting in Perl 5.14, any regexp compiled in |
615d795d KW |
2011 | the scope of a C<use feature 'unicode_strings'> (which is automatically |
2012 | turned on within the scope of a C<use 5.012> or higher) will turn that | |
2013 | "mostly" into "always". If you want to handle Unicode properly, you | |
13e5d9cd | 2014 | should ensure that C<'unicode_strings'> is turned on. |
0bd5a82d KW |
2015 | Internally, this is encoded to bytes using either UTF-8 or a native 8 |
2016 | bit encoding, depending on the history of the string, but conceptually | |
2017 | it is a sequence of characters, not bytes. See L<perlunitut> for a | |
2018 | tutorial about that. | |
2575c402 | 2019 | |
2c9972cc | 2020 | Let us now discuss Unicode character classes, most usually called |
15776bb0 KW |
2021 | "character properties". These are represented by the C<\p{I<name>}> |
2022 | escape sequence. The negation of this is C<\P{I<name>}>. For example, | |
2023 | to match lower and uppercase characters, | |
47f9c88b | 2024 | |
47f9c88b GS |
2025 | $x = "BOB"; |
2026 | $x =~ /^\p{IsUpper}/; # matches, uppercase char class | |
2027 | $x =~ /^\P{IsUpper}/; # doesn't match, char class sans uppercase | |
2028 | $x =~ /^\p{IsLower}/; # doesn't match, lowercase char class | |
2029 | $x =~ /^\P{IsLower}/; # matches, char class sans lowercase | |
2030 | ||
15776bb0 | 2031 | (The "C<Is>" is optional.) |
5f67e4c9 | 2032 | |
2c9972cc KW |
2033 | There are many, many Unicode character properties. For the full list |
2034 | see L<perluniprops>. Most of them have synonyms with shorter names, | |
2035 | also listed there. Some synonyms are a single character. For these, | |
2036 | you can drop the braces. For instance, C<\pM> is the same thing as | |
2037 | C<\p{Mark}>, meaning things like accent marks. | |
2038 | ||
48791bf1 KW |
2039 | The Unicode C<\p{Script}> and C<\p{Script_Extensions}> properties are |
2040 | used to categorize every Unicode character into the language script it | |
2041 | is written in. (C<Script_Extensions> is an improved version of | |
2042 | C<Script>, which is retained for backward compatibility, and so you | |
2043 | should generally use C<Script_Extensions>.) | |
2044 | For example, | |
2c9972cc KW |
2045 | English, French, and a bunch of other European languages are written in |
2046 | the Latin script. But there is also the Greek script, the Thai script, | |
15776bb0 | 2047 | the Katakana script, I<etc>. You can test whether a character is in a |
48791bf1 KW |
2048 | particular script (based on C<Script_Extensions>) with, for example |
2049 | C<\p{Latin}>, C<\p{Greek}>, or C<\p{Katakana}>. To test if it isn't in | |
2050 | the Balinese script, you would use C<\P{Balinese}>. | |
e1b711da KW |
2051 | |
2052 | What we have described so far is the single form of the C<\p{...}> character | |
2053 | classes. There is also a compound form which you may run into. These | |
15776bb0 | 2054 | look like C<\p{I<name>=I<value>}> or C<\p{I<name>:I<value>}> (the equals sign and colon |
e1b711da KW |
2055 | can be used interchangeably). These are more general than the single form, |
2056 | and in fact most of the single forms are just Perl-defined shortcuts for common | |
2057 | compound forms. For example, the script examples in the previous paragraph | |
48791bf1 KW |
2058 | could be written equivalently as C<\p{Script_Extensions=Latin}>, C<\p{Script_Extensions:Greek}>, |
2059 | C<\p{script_extensions=katakana}>, and C<\P{script_extensions=balinese}> (case is irrelevant | |
2c9972cc | 2060 | between the C<{}> braces). You may |
e1b711da KW |
2061 | never have to use the compound forms, but sometimes it is necessary, and their |
2062 | use can make your code easier to understand. | |
47f9c88b | 2063 | |
7638d2dc | 2064 | C<\X> is an abbreviation for a character class that comprises |
5f67e4c9 | 2065 | a Unicode I<extended grapheme cluster>. This represents a "logical character": |
e1b711da | 2066 | what appears to be a single character, but may be represented internally by more |
15776bb0 KW |
2067 | than one. As an example, using the Unicode full names, I<e.g.>, "S<A + COMBINING |
2068 | RING>" is a grapheme cluster with base character "A" and combining character | |
2069 | "S<COMBINING RING>, which translates in Danish to "A" with the circle atop it, | |
360633e8 | 2070 | as in the word E<Aring>ngstrom. |
47f9c88b | 2071 | |
da75cd15 | 2072 | For the full and latest information about Unicode see the latest |
e1b711da | 2073 | Unicode standard, or the Unicode Consortium's website L<http://www.unicode.org> |
5e42d7b4 | 2074 | |
7c579eed | 2075 | As if all those classes weren't enough, Perl also defines POSIX-style |
15776bb0 | 2076 | character classes. These have the form C<[:I<name>:]>, with I<name> the |
aaa51d5e JF |
2077 | name of the POSIX class. The POSIX classes are C<alpha>, C<alnum>, |
2078 | C<ascii>, C<cntrl>, C<digit>, C<graph>, C<lower>, C<print>, C<punct>, | |
2079 | C<space>, C<upper>, and C<xdigit>, and two extensions, C<word> (a Perl | |
f1dc5bb2 | 2080 | extension to match C<\w>), and C<blank> (a GNU extension). The C</a> |
0bd5a82d KW |
2081 | modifier restricts these to matching just in the ASCII range; otherwise |
2082 | they can match the same as their corresponding Perl Unicode classes: | |
15776bb0 | 2083 | C<[:upper:]> is the same as C<\p{IsUpper}>, I<etc>. (There are some |
0bd5a82d KW |
2084 | exceptions and gotchas with this; see L<perlrecharclass> for a full |
2085 | discussion.) The C<[:digit:]>, C<[:word:]>, and | |
47f9c88b | 2086 | C<[:space:]> correspond to the familiar C<\d>, C<\w>, and C<\s> |
15776bb0 KW |
2087 | character classes. To negate a POSIX class, put a C<'^'> in front of |
2088 | the name, so that, I<e.g.>, C<[:^digit:]> corresponds to C<\D> and, under | |
7c579eed | 2089 | Unicode, C<\P{IsDigit}>. The Unicode and POSIX character classes can |
54c18d04 MK |
2090 | be used just like C<\d>, with the exception that POSIX character |
2091 | classes can only be used inside of a character class: | |
47f9c88b GS |
2092 | |
2093 | /\s+[abc[:digit:]xyz]\s*/; # match a,b,c,x,y,z, or a digit | |
54c18d04 | 2094 | /^=item\s[[:digit:]]/; # match '=item', |
47f9c88b | 2095 | # followed by a space and a digit |
47f9c88b GS |
2096 | /\s+[abc\p{IsDigit}xyz]\s+/; # match a,b,c,x,y,z, or a digit |
2097 | /^=item\s\p{IsDigit}/; # match '=item', | |
2098 | # followed by a space and a digit | |
2099 | ||
2100 | Whew! That is all the rest of the characters and character classes. | |
2101 | ||
2102 | =head2 Compiling and saving regular expressions | |
2103 | ||
c2e2285d KW |
2104 | In Part 1 we mentioned that Perl compiles a regexp into a compact |
2105 | sequence of opcodes. Thus, a compiled regexp is a data structure | |
47f9c88b GS |
2106 | that can be stored once and used again and again. The regexp quote |
2107 | C<qr//> does exactly that: C<qr/string/> compiles the C<string> as a | |
2108 | regexp and transforms the result into a form that can be assigned to a | |
2109 | variable: | |
2110 | ||
2111 | $reg = qr/foo+bar?/; # reg contains a compiled regexp | |
2112 | ||
2113 | Then C<$reg> can be used as a regexp: | |
2114 | ||
2115 | $x = "fooooba"; | |
2116 | $x =~ $reg; # matches, just like /foo+bar?/ | |
2117 | $x =~ /$reg/; # same thing, alternate form | |
2118 | ||
2119 | C<$reg> can also be interpolated into a larger regexp: | |
2120 | ||
2121 | $x =~ /(abc)?$reg/; # still matches | |
2122 | ||
2123 | As with the matching operator, the regexp quote can use different | |
15776bb0 | 2124 | delimiters, I<e.g.>, C<qr!!>, C<qr{}> or C<qr~~>. Apostrophes |
7638d2dc | 2125 | as delimiters (C<qr''>) inhibit any interpolation. |
47f9c88b GS |
2126 | |
2127 | Pre-compiled regexps are useful for creating dynamic matches that | |
2128 | don't need to be recompiled each time they are encountered. Using | |
7638d2dc WL |
2129 | pre-compiled regexps, we write a C<grep_step> program which greps |
2130 | for a sequence of patterns, advancing to the next pattern as soon | |
2131 | as one has been satisfied. | |
47f9c88b | 2132 | |
7638d2dc | 2133 | % cat > grep_step |
47f9c88b | 2134 | #!/usr/bin/perl |
7638d2dc | 2135 | # grep_step - match <number> regexps, one after the other |
47f9c88b GS |
2136 | # usage: multi_grep <number> regexp1 regexp2 ... file1 file2 ... |
2137 | ||
2138 | $number = shift; | |
2139 | $regexp[$_] = shift foreach (0..$number-1); | |
2140 | @compiled = map qr/$_/, @regexp; | |
2141 | while ($line = <>) { | |
7638d2dc WL |
2142 | if ($line =~ /$compiled[0]/) { |
2143 | print $line; | |
2144 | shift @compiled; | |
2145 | last unless @compiled; | |
47f9c88b GS |
2146 | } |
2147 | } | |
2148 | ^D | |
2149 | ||
7638d2dc WL |
2150 | % grep_step 3 shift print last grep_step |
2151 | $number = shift; | |
2152 | print $line; | |
2153 | last unless @compiled; | |
47f9c88b GS |
2154 | |
2155 | Storing pre-compiled regexps in an array C<@compiled> allows us to | |
2156 | simply loop through the regexps without any recompilation, thus gaining | |
2157 | flexibility without sacrificing speed. | |
2158 | ||
7638d2dc WL |
2159 | |
2160 | =head2 Composing regular expressions at runtime | |
2161 | ||
2162 | Backtracking is more efficient than repeated tries with different regular | |
2163 | expressions. If there are several regular expressions and a match with | |
353c6505 | 2164 | any of them is acceptable, then it is possible to combine them into a set |
7638d2dc | 2165 | of alternatives. If the individual expressions are input data, this |
353c6505 DL |
2166 | can be done by programming a join operation. We'll exploit this idea in |
2167 | an improved version of the C<simple_grep> program: a program that matches | |
7638d2dc WL |
2168 | multiple patterns: |
2169 | ||
2170 | % cat > multi_grep | |
2171 | #!/usr/bin/perl | |
2172 | # multi_grep - match any of <number> regexps | |
2173 | # usage: multi_grep <number> regexp1 regexp2 ... file1 file2 ... | |
2174 | ||
2175 | $number = shift; | |
2176 | $regexp[$_] = shift foreach (0..$number-1); | |
2177 | $pattern = join '|', @regexp; | |
2178 | ||
2179 | while ($line = <>) { | |
c2e2285d | 2180 | print $line if $line =~ /$pattern/; |
7638d2dc WL |
2181 | } |
2182 | ^D | |
2183 | ||
2184 | % multi_grep 2 shift for multi_grep | |
2185 | $number = shift; | |
2186 | $regexp[$_] = shift foreach (0..$number-1); | |
2187 | ||
2188 | Sometimes it is advantageous to construct a pattern from the I<input> | |
2189 | that is to be analyzed and use the permissible values on the left | |
2190 | hand side of the matching operations. As an example for this somewhat | |
353c6505 | 2191 | paradoxical situation, let's assume that our input contains a command |
7638d2dc | 2192 | verb which should match one out of a set of available command verbs, |
353c6505 | 2193 | with the additional twist that commands may be abbreviated as long as |
7638d2dc WL |
2194 | the given string is unique. The program below demonstrates the basic |
2195 | algorithm. | |
2196 | ||
2197 | % cat > keymatch | |
2198 | #!/usr/bin/perl | |
2199 | $kwds = 'copy compare list print'; | |
555bd962 BG |
2200 | while( $cmd = <> ){ |
2201 | $cmd =~ s/^\s+|\s+$//g; # trim leading and trailing spaces | |
2202 | if( ( @matches = $kwds =~ /\b$cmd\w*/g ) == 1 ){ | |
92a24ac3 | 2203 | print "command: '@matches'\n"; |
7638d2dc | 2204 | } elsif( @matches == 0 ){ |
555bd962 | 2205 | print "no such command: '$cmd'\n"; |
7638d2dc | 2206 | } else { |
555bd962 | 2207 | print "not unique: '$cmd' (could be one of: @matches)\n"; |
7638d2dc WL |
2208 | } |
2209 | } | |
2210 | ^D | |
2211 | ||
2212 | % keymatch | |
2213 | li | |
2214 | command: 'list' | |
2215 | co | |
2216 | not unique: 'co' (could be one of: copy compare) | |
2217 | printer | |
2218 | no such command: 'printer' | |
2219 | ||
2220 | Rather than trying to match the input against the keywords, we match the | |
2221 | combined set of keywords against the input. The pattern matching | |
555bd962 | 2222 | operation S<C<$kwds =~ /\b($cmd\w*)/g>> does several things at the |
353c6505 DL |
2223 | same time. It makes sure that the given command begins where a keyword |
2224 | begins (C<\b>). It tolerates abbreviations due to the added C<\w*>. It | |
2225 | tells us the number of matches (C<scalar @matches>) and all the keywords | |
7638d2dc | 2226 | that were actually matched. You could hardly ask for more. |
7638d2dc | 2227 | |
47f9c88b GS |
2228 | =head2 Embedding comments and modifiers in a regular expression |
2229 | ||
2230 | Starting with this section, we will be discussing Perl's set of | |
7638d2dc | 2231 | I<extended patterns>. These are extensions to the traditional regular |
47f9c88b GS |
2232 | expression syntax that provide powerful new tools for pattern |
2233 | matching. We have already seen extensions in the form of the minimal | |
6b3ddc02 FC |
2234 | matching constructs C<??>, C<*?>, C<+?>, C<{n,m}?>, and C<{n,}?>. Most |
2235 | of the extensions below have the form C<(?char...)>, where the | |
47f9c88b GS |
2236 | C<char> is a character that determines the type of extension. |
2237 | ||
2238 | The first extension is an embedded comment C<(?#text)>. This embeds a | |
2239 | comment into the regular expression without affecting its meaning. The | |
2240 | comment should not have any closing parentheses in the text. An | |
2241 | example is | |
2242 | ||
2243 | /(?# Match an integer:)[+-]?\d+/; | |
2244 | ||
2245 | This style of commenting has been largely superseded by the raw, | |
f1dc5bb2 | 2246 | freeform commenting that is allowed with the C</x> modifier. |
47f9c88b | 2247 | |
f1dc5bb2 | 2248 | Most modifiers, such as C</i>, C</m>, C</s> and C</x> (or any |
24549070 | 2249 | combination thereof) can also be embedded in |
47f9c88b GS |
2250 | a regexp using C<(?i)>, C<(?m)>, C<(?s)>, and C<(?x)>. For instance, |
2251 | ||
2252 | /(?i)yes/; # match 'yes' case insensitively | |
2253 | /yes/i; # same thing | |
2254 | /(?x)( # freeform version of an integer regexp | |
2255 | [+-]? # match an optional sign | |
2256 | \d+ # match a sequence of digits | |
2257 | ) | |
2258 | /x; | |
2259 | ||
2260 | Embedded modifiers can have two important advantages over the usual | |
15776bb0 | 2261 | modifiers. Embedded modifiers allow a custom set of modifiers for |
47f9c88b GS |
2262 | I<each> regexp pattern. This is great for matching an array of regexps |
2263 | that must have different modifiers: | |
2264 | ||
2265 | $pattern[0] = '(?i)doctor'; | |
2266 | $pattern[1] = 'Johnson'; | |
2267 | ... | |
2268 | while (<>) { | |
2269 | foreach $patt (@pattern) { | |
2270 | print if /$patt/; | |
2271 | } | |
2272 | } | |
2273 | ||
f1dc5bb2 | 2274 | The second advantage is that embedded modifiers (except C</p>, which |
7638d2dc | 2275 | modifies the entire regexp) only affect the regexp |
47f9c88b GS |
2276 | inside the group the embedded modifier is contained in. So grouping |
2277 | can be used to localize the modifier's effects: | |
2278 | ||
2279 | /Answer: ((?i)yes)/; # matches 'Answer: yes', 'Answer: YES', etc. | |
2280 | ||
2281 | Embedded modifiers can also turn off any modifiers already present | |
15776bb0 KW |
2282 | by using, I<e.g.>, C<(?-i)>. Modifiers can also be combined into |
2283 | a single expression, I<e.g.>, C<(?s-i)> turns on single line mode and | |
47f9c88b GS |
2284 | turns off case insensitivity. |
2285 | ||
7638d2dc | 2286 | Embedded modifiers may also be added to a non-capturing grouping. |
47f9c88b GS |
2287 | C<(?i-m:regexp)> is a non-capturing grouping that matches C<regexp> |
2288 | case insensitively and turns off multi-line mode. | |
2289 | ||
7638d2dc | 2290 | |
47f9c88b GS |
2291 | =head2 Looking ahead and looking behind |
2292 | ||
2293 | This section concerns the lookahead and lookbehind assertions. First, | |
2294 | a little background. | |
2295 | ||
15776bb0 | 2296 | In Perl regular expressions, most regexp elements "eat up" a certain |
47f9c88b | 2297 | amount of string when they match. For instance, the regexp element |
22bf43da | 2298 | C<[abc]> eats up one character of the string when it matches, in the |
7638d2dc | 2299 | sense that Perl moves to the next character position in the string |
47f9c88b GS |
2300 | after the match. There are some elements, however, that don't eat up |
2301 | characters (advance the character position) if they match. The examples | |
15776bb0 | 2302 | we have seen so far are the anchors. The anchor C<'^'> matches the |
47f9c88b | 2303 | beginning of the line, but doesn't eat any characters. Similarly, the |
7638d2dc | 2304 | word boundary anchor C<\b> matches wherever a character matching C<\w> |
353c6505 | 2305 | is next to a character that doesn't, but it doesn't eat up any |
6b3ddc02 FC |
2306 | characters itself. Anchors are examples of I<zero-width assertions>: |
2307 | zero-width, because they consume | |
47f9c88b GS |
2308 | no characters, and assertions, because they test some property of the |
2309 | string. In the context of our walk in the woods analogy to regexp | |
2310 | matching, most regexp elements move us along a trail, but anchors have | |
2311 | us stop a moment and check our surroundings. If the local environment | |
2312 | checks out, we can proceed forward. But if the local environment | |
2313 | doesn't satisfy us, we must backtrack. | |
2314 | ||
2315 | Checking the environment entails either looking ahead on the trail, | |
15776bb0 KW |
2316 | looking behind, or both. C<'^'> looks behind, to see that there are no |
2317 | characters before. C<'$'> looks ahead, to see that there are no | |
47f9c88b | 2318 | characters after. C<\b> looks both ahead and behind, to see if the |
7638d2dc | 2319 | characters on either side differ in their "word-ness". |
47f9c88b GS |
2320 | |
2321 | The lookahead and lookbehind assertions are generalizations of the | |
2322 | anchor concept. Lookahead and lookbehind are zero-width assertions | |
2323 | that let us specify which characters we want to test for. The | |
2324 | lookahead assertion is denoted by C<(?=regexp)> and the lookbehind | |
a6b2f353 | 2325 | assertion is denoted by C<< (?<=fixed-regexp) >>. Some examples are |
47f9c88b GS |
2326 | |
2327 | $x = "I catch the housecat 'Tom-cat' with catnip"; | |
7638d2dc | 2328 | $x =~ /cat(?=\s)/; # matches 'cat' in 'housecat' |
47f9c88b GS |
2329 | @catwords = ($x =~ /(?<=\s)cat\w+/g); # matches, |
2330 | # $catwords[0] = 'catch' | |
2331 | # $catwords[1] = 'catnip' | |
2332 | $x =~ /\bcat\b/; # matches 'cat' in 'Tom-cat' | |
2333 | $x =~ /(?<=\s)cat(?=\s)/; # doesn't match; no isolated 'cat' in | |
2334 | # middle of $x | |
2335 | ||
a6b2f353 | 2336 | Note that the parentheses in C<(?=regexp)> and C<< (?<=regexp) >> are |
47f9c88b GS |
2337 | non-capturing, since these are zero-width assertions. Thus in the |
2338 | second regexp, the substrings captured are those of the whole regexp | |
a6b2f353 GS |
2339 | itself. Lookahead C<(?=regexp)> can match arbitrary regexps, but |
2340 | lookbehind C<< (?<=fixed-regexp) >> only works for regexps of fixed | |
15776bb0 | 2341 | width, I<i.e.>, a fixed number of characters long. Thus |
a6b2f353 GS |
2342 | C<< (?<=(ab|bc)) >> is fine, but C<< (?<=(ab)*) >> is not. The |
2343 | negated versions of the lookahead and lookbehind assertions are | |
2344 | denoted by C<(?!regexp)> and C<< (?<!fixed-regexp) >> respectively. | |
2345 | They evaluate true if the regexps do I<not> match: | |
47f9c88b GS |
2346 | |
2347 | $x = "foobar"; | |
2348 | $x =~ /foo(?!bar)/; # doesn't match, 'bar' follows 'foo' | |
2349 | $x =~ /foo(?!baz)/; # matches, 'baz' doesn't follow 'foo' | |
2350 | $x =~ /(?<!\s)foo/; # matches, there is no \s before 'foo' | |
2351 | ||
7638d2dc WL |
2352 | Here is an example where a string containing blank-separated words, |
2353 | numbers and single dashes is to be split into its components. | |
2354 | Using C</\s+/> alone won't work, because spaces are not required between | |
2355 | dashes, or a word or a dash. Additional places for a split are established | |
2356 | by looking ahead and behind: | |
47f9c88b | 2357 | |
7638d2dc WL |
2358 | $str = "one two - --6-8"; |
2359 | @toks = split / \s+ # a run of spaces | |
2360 | | (?<=\S) (?=-) # any non-space followed by '-' | |
2361 | | (?<=-) (?=\S) # a '-' followed by any non-space | |
2362 | /x, $str; # @toks = qw(one two - - - 6 - 8) | |
47f9c88b | 2363 | |
e7206367 KW |
2364 | Starting in Perl 5.28, experimentally, alphabetic equivalents to these |
2365 | assertions are added, so you can use whichever is most memorable for | |
2366 | your tastes. | |
2367 | ||
2368 | (?=...) (*pla:...) or (*positive_lookahead:...) | |
2369 | (?!...) (*nla:...) or (*negative_lookahead:...) | |
2370 | (?<=...) (*plb:...) or (*positive_lookbehind:...) | |
2371 | (?<!...) (*nlb:...) or (*negative_lookbehind:...) | |
2372 | (?>...) (*atomic:...) | |
2373 | ||
2374 | Using any of these will raise (unless turned off) a warning in the | |
2375 | C<experimental::alpha_assertions> category. | |
7638d2dc WL |
2376 | |
2377 | =head2 Using independent subexpressions to prevent backtracking | |
2378 | ||
2379 | I<Independent subexpressions> are regular expressions, in the | |
47f9c88b GS |
2380 | context of a larger regular expression, that function independently of |
2381 | the larger regular expression. That is, they consume as much or as | |
2382 | little of the string as they wish without regard for the ability of | |
2383 | the larger regexp to match. Independent subexpressions are represented | |
2384 | by C<< (?>regexp) >>. We can illustrate their behavior by first | |
2385 | considering an ordinary regexp: | |
2386 | ||
2387 | $x = "ab"; | |
2388 | $x =~ /a*ab/; # matches | |
2389 | ||
2390 | This obviously matches, but in the process of matching, the | |
15776bb0 | 2391 | subexpression C<a*> first grabbed the C<'a'>. Doing so, however, |
47f9c88b | 2392 | wouldn't allow the whole regexp to match, so after backtracking, C<a*> |
15776bb0 | 2393 | eventually gave back the C<'a'> and matched the empty string. Here, what |
47f9c88b GS |
2394 | C<a*> matched was I<dependent> on what the rest of the regexp matched. |
2395 | ||
2396 | Contrast that with an independent subexpression: | |
2397 | ||
2398 | $x =~ /(?>a*)ab/; # doesn't match! | |
2399 | ||
2400 | The independent subexpression C<< (?>a*) >> doesn't care about the rest | |
15776bb0 | 2401 | of the regexp, so it sees an C<'a'> and grabs it. Then the rest of the |
47f9c88b | 2402 | regexp C<ab> cannot match. Because C<< (?>a*) >> is independent, there |
da75cd15 | 2403 | is no backtracking and the independent subexpression does not give |
15776bb0 | 2404 | up its C<'a'>. Thus the match of the regexp as a whole fails. A similar |
47f9c88b GS |
2405 | behavior occurs with completely independent regexps: |
2406 | ||
2407 | $x = "ab"; | |
2408 | $x =~ /a*/g; # matches, eats an 'a' | |
2409 | $x =~ /\Gab/g; # doesn't match, no 'a' available | |
2410 | ||
15776bb0 | 2411 | Here C</g> and C<\G> create a "tag team" handoff of the string from |
47f9c88b GS |
2412 | one regexp to the other. Regexps with an independent subexpression are |
2413 | much like this, with a handoff of the string to the independent | |
2414 | subexpression, and a handoff of the string back to the enclosing | |
2415 | regexp. | |
2416 | ||
2417 | The ability of an independent subexpression to prevent backtracking | |
2418 | can be quite useful. Suppose we want to match a non-empty string | |
2419 | enclosed in parentheses up to two levels deep. Then the following | |
2420 | regexp matches: | |
2421 | ||
2422 | $x = "abc(de(fg)h"; # unbalanced parentheses | |
77c8f263 | 2423 | $x =~ /\( ( [ ^ () ]+ | \( [ ^ () ]* \) )+ \)/xx; |
47f9c88b GS |
2424 | |
2425 | The regexp matches an open parenthesis, one or more copies of an | |
2426 | alternation, and a close parenthesis. The alternation is two-way, with | |
2427 | the first alternative C<[^()]+> matching a substring with no | |
2428 | parentheses and the second alternative C<\([^()]*\)> matching a | |
2429 | substring delimited by parentheses. The problem with this regexp is | |
2430 | that it is pathological: it has nested indeterminate quantifiers | |
07698885 | 2431 | of the form C<(a+|b)+>. We discussed in Part 1 how nested quantifiers |
47f9c88b GS |
2432 | like this could take an exponentially long time to execute if there |
2433 | was no match possible. To prevent the exponential blowup, we need to | |
2434 | prevent useless backtracking at some point. This can be done by | |
2435 | enclosing the inner quantifier as an independent subexpression: | |
2436 | ||
77c8f263 | 2437 | $x =~ /\( ( (?> [ ^ () ]+ ) | \([ ^ () ]* \) )+ \)/xx; |
47f9c88b GS |
2438 | |
2439 | Here, C<< (?>[^()]+) >> breaks the degeneracy of string partitioning | |
2440 | by gobbling up as much of the string as possible and keeping it. Then | |
2441 | match failures fail much more quickly. | |
2442 | ||
7638d2dc | 2443 | |
47f9c88b GS |
2444 | =head2 Conditional expressions |
2445 | ||
7638d2dc | 2446 | A I<conditional expression> is a form of if-then-else statement |
47f9c88b GS |
2447 | that allows one to choose which patterns are to be matched, based on |
2448 | some condition. There are two types of conditional expression: | |
15776bb0 KW |
2449 | C<(?(I<condition>)I<yes-regexp>)> and |
2450 | C<(?(condition)I<yes-regexp>|I<no-regexp>)>. | |
2451 | C<(?(I<condition>)I<yes-regexp>)> is | |
2452 | like an S<C<'if () {}'>> statement in Perl. If the I<condition> is true, | |
2453 | the I<yes-regexp> will be matched. If the I<condition> is false, the | |
2454 | I<yes-regexp> will be skipped and Perl will move onto the next regexp | |
7638d2dc | 2455 | element. The second form is like an S<C<'if () {} else {}'>> statement |
15776bb0 KW |
2456 | in Perl. If the I<condition> is true, the I<yes-regexp> will be |
2457 | matched, otherwise the I<no-regexp> will be matched. | |
47f9c88b | 2458 | |
15776bb0 KW |
2459 | The I<condition> can have several forms. The first form is simply an |
2460 | integer in parentheses C<(I<integer>)>. It is true if the corresponding | |
2461 | backreference C<\I<integer>> matched earlier in the regexp. The same | |
c27a5cfe | 2462 | thing can be done with a name associated with a capture group, written |
15776bb0 | 2463 | as C<<< (E<lt>I<name>E<gt>) >>> or C<< ('I<name>') >>. The second form is a bare |
6b3ddc02 | 2464 | zero-width assertion C<(?...)>, either a lookahead, a lookbehind, or a |
7638d2dc WL |
2465 | code assertion (discussed in the next section). The third set of forms |
2466 | provides tests that return true if the expression is executed within | |
2467 | a recursion (C<(R)>) or is being called from some capturing group, | |
2468 | referenced either by number (C<(R1)>, C<(R2)>,...) or by name | |
15776bb0 | 2469 | (C<(R&I<name>)>). |
7638d2dc WL |
2470 | |
2471 | The integer or name form of the C<condition> allows us to choose, | |
2472 | with more flexibility, what to match based on what matched earlier in the | |
2473 | regexp. This searches for words of the form C<"$x$x"> or C<"$x$y$y$x">: | |
47f9c88b | 2474 | |
d8b950dc | 2475 | % simple_grep '^(\w+)(\w+)?(?(2)\g2\g1|\g1)$' /usr/dict/words |
47f9c88b GS |
2476 | beriberi |
2477 | coco | |
2478 | couscous | |
2479 | deed | |
2480 | ... | |
2481 | toot | |
2482 | toto | |
2483 | tutu | |
2484 | ||
2485 | The lookbehind C<condition> allows, along with backreferences, | |
2486 | an earlier part of the match to influence a later part of the | |
2487 | match. For instance, | |
2488 | ||
2489 | /[ATGC]+(?(?<=AA)G|C)$/; | |
2490 | ||
2491 | matches a DNA sequence such that it either ends in C<AAG>, or some | |
15776bb0 | 2492 | other base pair combination and C<'C'>. Note that the form is |
a6b2f353 GS |
2493 | C<< (?(?<=AA)G|C) >> and not C<< (?((?<=AA))G|C) >>; for the |
2494 | lookahead, lookbehind or code assertions, the parentheses around the | |
2495 | conditional are not needed. | |
47f9c88b | 2496 | |
7638d2dc WL |
2497 | |
2498 | =head2 Defining named patterns | |
2499 | ||
2500 | Some regular expressions use identical subpatterns in several places. | |
2501 | Starting with Perl 5.10, it is possible to define named subpatterns in | |
2502 | a section of the pattern so that they can be called up by name | |
2503 | anywhere in the pattern. This syntactic pattern for this definition | |
15776bb0 KW |
2504 | group is C<< (?(DEFINE)(?<I<name>>I<pattern>)...) >>. An insertion |
2505 | of a named pattern is written as C<(?&I<name>)>. | |
7638d2dc WL |
2506 | |
2507 | The example below illustrates this feature using the pattern for | |
2508 | floating point numbers that was presented earlier on. The three | |
2509 | subpatterns that are used more than once are the optional sign, the | |
15776bb0 | 2510 | digit sequence for an integer and the decimal fraction. The C<DEFINE> |
7638d2dc WL |
2511 | group at the end of the pattern contains their definition. Notice |
2512 | that the decimal fraction pattern is the first place where we can | |
2513 | reuse the integer pattern. | |
2514 | ||
353c6505 | 2515 | /^ (?&osg)\ * ( (?&int)(?&dec)? | (?&dec) ) |
7638d2dc WL |
2516 | (?: [eE](?&osg)(?&int) )? |
2517 | $ | |
2518 | (?(DEFINE) | |
2519 | (?<osg>[-+]?) # optional sign | |
2520 | (?<int>\d++) # integer | |
2521 | (?<dec>\.(?&int)) # decimal fraction | |
2522 | )/x | |
2523 | ||
2524 | ||
2525 | =head2 Recursive patterns | |
2526 | ||
2527 | This feature (introduced in Perl 5.10) significantly extends the | |
2528 | power of Perl's pattern matching. By referring to some other | |
2529 | capture group anywhere in the pattern with the construct | |
15776bb0 | 2530 | C<(?I<group-ref>)>, the I<pattern> within the referenced group is used |
7638d2dc WL |
2531 | as an independent subpattern in place of the group reference itself. |
2532 | Because the group reference may be contained I<within> the group it | |
2533 | refers to, it is now possible to apply pattern matching to tasks that | |
2534 | hitherto required a recursive parser. | |
2535 | ||
2536 | To illustrate this feature, we'll design a pattern that matches if | |
2537 | a string contains a palindrome. (This is a word or a sentence that, | |
2538 | while ignoring spaces, interpunctuation and case, reads the same backwards | |
2539 | as forwards. We begin by observing that the empty string or a string | |
2540 | containing just one word character is a palindrome. Otherwise it must | |
2541 | have a word character up front and the same at its end, with another | |
2542 | palindrome in between. | |
2543 | ||
fd2b7f55 | 2544 | /(?: (\w) (?...Here be a palindrome...) \g{-1} | \w? )/x |
7638d2dc | 2545 | |
e57a4e52 | 2546 | Adding C<\W*> at either end to eliminate what is to be ignored, we already |
7638d2dc WL |
2547 | have the full pattern: |
2548 | ||
2549 | my $pp = qr/^(\W* (?: (\w) (?1) \g{-1} | \w? ) \W*)$/ix; | |
2550 | for $s ( "saippuakauppias", "A man, a plan, a canal: Panama!" ){ | |
2551 | print "'$s' is a palindrome\n" if $s =~ /$pp/; | |
2552 | } | |
2553 | ||
2554 | In C<(?...)> both absolute and relative backreferences may be used. | |
2555 | The entire pattern can be reinserted with C<(?R)> or C<(?0)>. | |
15776bb0 | 2556 | If you prefer to name your groups, you can use C<(?&I<name>)> to |
c27a5cfe | 2557 | recurse into that group. |
7638d2dc WL |
2558 | |
2559 | ||
47f9c88b GS |
2560 | =head2 A bit of magic: executing Perl code in a regular expression |
2561 | ||
2562 | Normally, regexps are a part of Perl expressions. | |
7638d2dc | 2563 | I<Code evaluation> expressions turn that around by allowing |
da75cd15 | 2564 | arbitrary Perl code to be a part of a regexp. A code evaluation |
15776bb0 | 2565 | expression is denoted C<(?{I<code>})>, with I<code> a string of Perl |
47f9c88b GS |
2566 | statements. |
2567 | ||
2568 | Code expressions are zero-width assertions, and the value they return | |
2569 | depends on their environment. There are two possibilities: either the | |
2570 | code expression is used as a conditional in a conditional expression | |
15776bb0 KW |
2571 | C<(?(I<condition>)...)>, or it is not. If the code expression is a |
2572 | conditional, the code is evaluated and the result (I<i.e.>, the result of | |
47f9c88b GS |
2573 | the last statement) is used to determine truth or falsehood. If the |
2574 | code expression is not used as a conditional, the assertion always | |
2575 | evaluates true and the result is put into the special variable | |
2576 | C<$^R>. The variable C<$^R> can then be used in code expressions later | |
2577 | in the regexp. Here are some silly examples: | |
2578 | ||
2579 | $x = "abcdef"; | |
2580 | $x =~ /abc(?{print "Hi Mom!";})def/; # matches, | |
2581 | # prints 'Hi Mom!' | |
2582 | $x =~ /aaa(?{print "Hi Mom!";})def/; # doesn't match, | |
2583 | # no 'Hi Mom!' | |
745e1e41 DC |
2584 | |
2585 | Pay careful attention to the next example: | |
2586 | ||
47f9c88b GS |
2587 | $x =~ /abc(?{print "Hi Mom!";})ddd/; # doesn't match, |
2588 | # no 'Hi Mom!' | |
745e1e41 DC |
2589 | # but why not? |
2590 | ||
2591 | At first glance, you'd think that it shouldn't print, because obviously | |
2592 | the C<ddd> isn't going to match the target string. But look at this | |
2593 | example: | |
2594 | ||
87167316 RGS |
2595 | $x =~ /abc(?{print "Hi Mom!";})[dD]dd/; # doesn't match, |
2596 | # but _does_ print | |
745e1e41 DC |
2597 | |
2598 | Hmm. What happened here? If you've been following along, you know that | |
ac036724 | 2599 | the above pattern should be effectively (almost) the same as the last one; |
15776bb0 | 2600 | enclosing the C<'d'> in a character class isn't going to change what it |
745e1e41 DC |
2601 | matches. So why does the first not print while the second one does? |
2602 | ||
15776bb0 | 2603 | The answer lies in the optimizations the regexp engine makes. In the first |
745e1e41 | 2604 | case, all the engine sees are plain old characters (aside from the |
15776bb0 | 2605 | C<?{}> construct). It's smart enough to realize that the string C<'ddd'> |
745e1e41 DC |
2606 | doesn't occur in our target string before actually running the pattern |
2607 | through. But in the second case, we've tricked it into thinking that our | |
87167316 | 2608 | pattern is more complicated. It takes a look, sees our |
745e1e41 DC |
2609 | character class, and decides that it will have to actually run the |
2610 | pattern to determine whether or not it matches, and in the process of | |
2611 | running it hits the print statement before it discovers that we don't | |
2612 | have a match. | |
2613 | ||
2614 | To take a closer look at how the engine does optimizations, see the | |
5a0de581 | 2615 | section L</"Pragmas and debugging"> below. |
745e1e41 DC |
2616 | |
2617 | More fun with C<?{}>: | |
2618 | ||
47f9c88b GS |
2619 | $x =~ /(?{print "Hi Mom!";})/; # matches, |
2620 | # prints 'Hi Mom!' | |
2621 | $x =~ /(?{$c = 1;})(?{print "$c";})/; # matches, | |
2622 | # prints '1' | |
2623 | $x =~ /(?{$c = 1;})(?{print "$^R";})/; # matches, | |
2624 | # prints '1' | |
2625 | ||
2626 | The bit of magic mentioned in the section title occurs when the regexp | |
2627 | backtracks in the process of searching for a match. If the regexp | |
2628 | backtracks over a code expression and if the variables used within are | |
2629 | localized using C<local>, the changes in the variables produced by the | |
2630 | code expression are undone! Thus, if we wanted to count how many times | |
15776bb0 | 2631 | a character got matched inside a group, we could use, I<e.g.>, |
47f9c88b GS |
2632 | |
2633 | $x = "aaaa"; | |
2634 | $count = 0; # initialize 'a' count | |
2635 | $c = "bob"; # test if $c gets clobbered | |
2636 | $x =~ /(?{local $c = 0;}) # initialize count | |
2637 | ( a # match 'a' | |
2638 | (?{local $c = $c + 1;}) # increment count | |
2639 | )* # do this any number of times, | |
2640 | aa # but match 'aa' at the end | |
2641 | (?{$count = $c;}) # copy local $c var into $count | |
2642 | /x; | |
2643 | print "'a' count is $count, \$c variable is '$c'\n"; | |
2644 | ||
2645 | This prints | |
2646 | ||
2647 | 'a' count is 2, $c variable is 'bob' | |
2648 | ||
7638d2dc WL |
2649 | If we replace the S<C< (?{local $c = $c + 1;})>> with |
2650 | S<C< (?{$c = $c + 1;})>>, the variable changes are I<not> undone | |
47f9c88b GS |
2651 | during backtracking, and we get |
2652 | ||
2653 | 'a' count is 4, $c variable is 'bob' | |
2654 | ||
2655 | Note that only localized variable changes are undone. Other side | |
2656 | effects of code expression execution are permanent. Thus | |
2657 | ||
2658 | $x = "aaaa"; | |
2659 | $x =~ /(a(?{print "Yow\n";}))*aa/; | |
2660 | ||
2661 | produces | |
2662 | ||
2663 | Yow | |
2664 | Yow | |
2665 | Yow | |
2666 | Yow | |
2667 | ||
2668 | The result C<$^R> is automatically localized, so that it will behave | |
2669 | properly in the presence of backtracking. | |
2670 | ||
7638d2dc | 2671 | This example uses a code expression in a conditional to match a |
15776bb0 KW |
2672 | definite article, either C<'the'> in English or C<'der|die|das'> in |
2673 | German: | |
47f9c88b | 2674 | |
47f9c88b GS |
2675 | $lang = 'DE'; # use German |
2676 | ... | |
2677 | $text = "das"; | |
2678 | print "matched\n" | |
2679 | if $text =~ /(?(?{ | |
2680 | $lang eq 'EN'; # is the language English? | |
2681 | }) | |
2682 | the | # if so, then match 'the' | |
7638d2dc | 2683 | (der|die|das) # else, match 'der|die|das' |
47f9c88b GS |
2684 | ) |
2685 | /xi; | |
2686 | ||
15776bb0 KW |
2687 | Note that the syntax here is C<(?(?{...})I<yes-regexp>|I<no-regexp>)>, not |
2688 | C<(?((?{...}))I<yes-regexp>|I<no-regexp>)>. In other words, in the case of a | |
47f9c88b GS |
2689 | code expression, we don't need the extra parentheses around the |
2690 | conditional. | |
2691 | ||
e128ab2c DM |
2692 | If you try to use code expressions where the code text is contained within |
2693 | an interpolated variable, rather than appearing literally in the pattern, | |
2694 | Perl may surprise you: | |
a6b2f353 GS |
2695 | |
2696 | $bar = 5; | |
2697 | $pat = '(?{ 1 })'; | |
2698 | /foo(?{ $bar })bar/; # compiles ok, $bar not interpolated | |
e128ab2c | 2699 | /foo(?{ 1 })$bar/; # compiles ok, $bar interpolated |
a6b2f353 GS |
2700 | /foo${pat}bar/; # compile error! |
2701 | ||
2702 | $pat = qr/(?{ $foo = 1 })/; # precompile code regexp | |
2703 | /foo${pat}bar/; # compiles ok | |
2704 | ||
e128ab2c DM |
2705 | If a regexp has a variable that interpolates a code expression, Perl |
2706 | treats the regexp as an error. If the code expression is precompiled into | |
2707 | a variable, however, interpolating is ok. The question is, why is this an | |
2708 | error? | |
a6b2f353 GS |
2709 | |
2710 | The reason is that variable interpolation and code expressions | |
2711 | together pose a security risk. The combination is dangerous because | |
2712 | many programmers who write search engines often take user input and | |
2713 | plug it directly into a regexp: | |
47f9c88b GS |
2714 | |
2715 | $regexp = <>; # read user-supplied regexp | |
2716 | $chomp $regexp; # get rid of possible newline | |
2717 | $text =~ /$regexp/; # search $text for the $regexp | |
2718 | ||
a6b2f353 GS |
2719 | If the C<$regexp> variable contains a code expression, the user could |
2720 | then execute arbitrary Perl code. For instance, some joker could | |
7638d2dc WL |
2721 | search for S<C<system('rm -rf *');>> to erase your files. In this |
2722 | sense, the combination of interpolation and code expressions I<taints> | |
47f9c88b | 2723 | your regexp. So by default, using both interpolation and code |
a6b2f353 GS |
2724 | expressions in the same regexp is not allowed. If you're not |
2725 | concerned about malicious users, it is possible to bypass this | |
7638d2dc | 2726 | security check by invoking S<C<use re 'eval'>>: |
a6b2f353 GS |
2727 | |
2728 | use re 'eval'; # throw caution out the door | |
2729 | $bar = 5; | |
2730 | $pat = '(?{ 1 })'; | |
a6b2f353 | 2731 | /foo${pat}bar/; # compiles ok |
47f9c88b | 2732 | |
7638d2dc | 2733 | Another form of code expression is the I<pattern code expression>. |
47f9c88b GS |
2734 | The pattern code expression is like a regular code expression, except |
2735 | that the result of the code evaluation is treated as a regular | |
2736 | expression and matched immediately. A simple example is | |
2737 | ||
2738 | $length = 5; | |
2739 | $char = 'a'; | |
2740 | $x = 'aaaaabb'; | |
2741 | $x =~ /(??{$char x $length})/x; # matches, there are 5 of 'a' | |
2742 | ||
2743 | ||
2744 | This final example contains both ordinary and pattern code | |
7638d2dc | 2745 | expressions. It detects whether a binary string C<1101010010001...> has a |
15776bb0 | 2746 | Fibonacci spacing 0,1,1,2,3,5,... of the C<'1'>'s: |
47f9c88b | 2747 | |
47f9c88b | 2748 | $x = "1101010010001000001"; |
7638d2dc | 2749 | $z0 = ''; $z1 = '0'; # initial conditions |
47f9c88b GS |
2750 | print "It is a Fibonacci sequence\n" |
2751 | if $x =~ /^1 # match an initial '1' | |
7638d2dc WL |
2752 | (?: |
2753 | ((??{ $z0 })) # match some '0' | |
2754 | 1 # and then a '1' | |
2755 | (?{ $z0 = $z1; $z1 .= $^N; }) | |
47f9c88b GS |
2756 | )+ # repeat as needed |
2757 | $ # that is all there is | |
2758 | /x; | |
7638d2dc | 2759 | printf "Largest sequence matched was %d\n", length($z1)-length($z0); |
47f9c88b | 2760 | |
7638d2dc WL |
2761 | Remember that C<$^N> is set to whatever was matched by the last |
2762 | completed capture group. This prints | |
47f9c88b GS |
2763 | |
2764 | It is a Fibonacci sequence | |
2765 | Largest sequence matched was 5 | |
2766 | ||
2767 | Ha! Try that with your garden variety regexp package... | |
2768 | ||
7638d2dc | 2769 | Note that the variables C<$z0> and C<$z1> are not substituted when the |
47f9c88b | 2770 | regexp is compiled, as happens for ordinary variables outside a code |
e128ab2c DM |
2771 | expression. Rather, the whole code block is parsed as perl code at the |
2772 | same time as perl is compiling the code containing the literal regexp | |
2773 | pattern. | |
47f9c88b | 2774 | |
15776bb0 | 2775 | This regexp without the C</x> modifier is |
47f9c88b | 2776 | |
7638d2dc WL |
2777 | /^1(?:((??{ $z0 }))1(?{ $z0 = $z1; $z1 .= $^N; }))+$/ |
2778 | ||
2779 | which shows that spaces are still possible in the code parts. Nevertheless, | |
353c6505 | 2780 | when working with code and conditional expressions, the extended form of |
7638d2dc WL |
2781 | regexps is almost necessary in creating and debugging regexps. |
2782 | ||
2783 | ||
2784 | =head2 Backtracking control verbs | |
2785 | ||
2786 | Perl 5.10 introduced a number of control verbs intended to provide | |
2787 | detailed control over the backtracking process, by directly influencing | |
15776bb0 KW |
2788 | the regexp engine and by providing monitoring techniques. See |
2789 | L<perlre/"Special Backtracking Control Verbs"> for a detailed | |
2790 | description. | |
7638d2dc WL |
2791 | |
2792 | Below is just one example, illustrating the control verb C<(*FAIL)>, | |
2793 | which may be abbreviated as C<(*F)>. If this is inserted in a regexp | |
6b3ddc02 FC |
2794 | it will cause it to fail, just as it would at some |
2795 | mismatch between the pattern and the string. Processing | |
2796 | of the regexp continues as it would after any "normal" | |
353c6505 DL |
2797 | failure, so that, for instance, the next position in the string or another |
2798 | alternative will be tried. As failing to match doesn't preserve capture | |
c27a5cfe | 2799 | groups or produce results, it may be necessary to use this in |
7638d2dc WL |
2800 | combination with embedded code. |
2801 | ||
2802 | %count = (); | |
b539c2c9 | 2803 | "supercalifragilisticexpialidocious" =~ |
c2e2285d | 2804 | /([aeiou])(?{ $count{$1}++; })(*FAIL)/i; |
7638d2dc WL |
2805 | printf "%3d '%s'\n", $count{$_}, $_ for (sort keys %count); |
2806 | ||
353c6505 DL |
2807 | The pattern begins with a class matching a subset of letters. Whenever |
2808 | this matches, a statement like C<$count{'a'}++;> is executed, incrementing | |
2809 | the letter's counter. Then C<(*FAIL)> does what it says, and | |
6b3ddc02 FC |
2810 | the regexp engine proceeds according to the book: as long as the end of |
2811 | the string hasn't been reached, the position is advanced before looking | |
7638d2dc | 2812 | for another vowel. Thus, match or no match makes no difference, and the |
e1020413 | 2813 | regexp engine proceeds until the entire string has been inspected. |
7638d2dc WL |
2814 | (It's remarkable that an alternative solution using something like |
2815 | ||
b539c2c9 | 2816 | $count{lc($_)}++ for split('', "supercalifragilisticexpialidocious"); |
7638d2dc WL |
2817 | printf "%3d '%s'\n", $count2{$_}, $_ for ( qw{ a e i o u } ); |
2818 | ||
2819 | is considerably slower.) | |
47f9c88b | 2820 | |
47f9c88b GS |
2821 | |
2822 | =head2 Pragmas and debugging | |
2823 | ||
2824 | Speaking of debugging, there are several pragmas available to control | |
2825 | and debug regexps in Perl. We have already encountered one pragma in | |
7638d2dc | 2826 | the previous section, S<C<use re 'eval';>>, that allows variable |
a6b2f353 GS |
2827 | interpolation and code expressions to coexist in a regexp. The other |
2828 | pragmas are | |
47f9c88b GS |
2829 | |
2830 | use re 'taint'; | |
2831 | $tainted = <>; | |
2832 | @parts = ($tainted =~ /(\w+)\s+(\w+)/; # @parts is now tainted | |
2833 | ||
2834 | The C<taint> pragma causes any substrings from a match with a tainted | |
2835 | variable to be tainted as well. This is not normally the case, as | |
2836 | regexps are often used to extract the safe bits from a tainted | |
2837 | variable. Use C<taint> when you are not extracting safe bits, but are | |
2838 | performing some other processing. Both C<taint> and C<eval> pragmas | |
a6b2f353 | 2839 | are lexically scoped, which means they are in effect only until |
47f9c88b GS |
2840 | the end of the block enclosing the pragmas. |
2841 | ||
511eb430 FC |
2842 | use re '/m'; # or any other flags |
2843 | $multiline_string =~ /^foo/; # /m is implied | |
2844 | ||
9fa86798 FC |
2845 | The C<re '/flags'> pragma (introduced in Perl |
2846 | 5.14) turns on the given regular expression flags | |
3fd67154 KW |
2847 | until the end of the lexical scope. See |
2848 | L<re/"'E<sol>flags' mode"> for more | |
511eb430 FC |
2849 | detail. |
2850 | ||
47f9c88b GS |
2851 | use re 'debug'; |
2852 | /^(.*)$/s; # output debugging info | |
2853 | ||
2854 | use re 'debugcolor'; | |
2855 | /^(.*)$/s; # output debugging info in living color | |
2856 | ||
2857 | The global C<debug> and C<debugcolor> pragmas allow one to get | |
2858 | detailed debugging info about regexp compilation and | |
2859 | execution. C<debugcolor> is the same as debug, except the debugging | |
2860 | information is displayed in color on terminals that can display | |
2861 | termcap color sequences. Here is example output: | |
2862 | ||
2863 | % perl -e 'use re "debug"; "abc" =~ /a*b+c/;' | |
ccf3535a | 2864 | Compiling REx 'a*b+c' |
47f9c88b GS |
2865 | size 9 first at 1 |
2866 | 1: STAR(4) | |
2867 | 2: EXACT <a>(0) | |
2868 | 4: PLUS(7) | |
2869 | 5: EXACT <b>(0) | |
2870 | 7: EXACT <c>(9) | |
2871 | 9: END(0) | |
ccf3535a JK |
2872 | floating 'bc' at 0..2147483647 (checking floating) minlen 2 |
2873 | Guessing start of match, REx 'a*b+c' against 'abc'... | |
2874 | Found floating substr 'bc' at offset 1... | |
47f9c88b | 2875 | Guessed: match at offset 0 |
ccf3535a | 2876 | Matching REx 'a*b+c' against 'abc' |
47f9c88b | 2877 | Setting an EVAL scope, savestack=3 |
555bd962 BG |
2878 | 0 <> <abc> | 1: STAR |
2879 | EXACT <a> can match 1 times out of 32767... | |
47f9c88b | 2880 | Setting an EVAL scope, savestack=3 |
555bd962 BG |
2881 | 1 <a> <bc> | 4: PLUS |
2882 | EXACT <b> can match 1 times out of 32767... | |
47f9c88b | 2883 | Setting an EVAL scope, savestack=3 |
555bd962 BG |
2884 | 2 <ab> <c> | 7: EXACT <c> |
2885 | 3 <abc> <> | 9: END | |
47f9c88b | 2886 | Match successful! |
ccf3535a | 2887 | Freeing REx: 'a*b+c' |
47f9c88b GS |
2888 | |
2889 | If you have gotten this far into the tutorial, you can probably guess | |
2890 | what the different parts of the debugging output tell you. The first | |
2891 | part | |
2892 | ||
ccf3535a | 2893 | Compiling REx 'a*b+c' |
47f9c88b GS |
2894 | size 9 first at 1 |
2895 | 1: STAR(4) | |
2896 | 2: EXACT <a>(0) | |
2897 | 4: PLUS(7) | |
2898 | 5: EXACT <b>(0) | |
2899 | 7: EXACT <c>(9) | |
2900 | 9: END(0) | |
2901 | ||
2902 | describes the compilation stage. C<STAR(4)> means that there is a | |
2903 | starred object, in this case C<'a'>, and if it matches, goto line 4, | |
15776bb0 | 2904 | I<i.e.>, C<PLUS(7)>. The middle lines describe some heuristics and |
47f9c88b GS |
2905 | optimizations performed before a match: |
2906 | ||
ccf3535a JK |
2907 | floating 'bc' at 0..2147483647 (checking floating) minlen 2 |
2908 | Guessing start of match, REx 'a*b+c' against 'abc'... | |
2909 | Found floating substr 'bc' at offset 1... | |
47f9c88b GS |
2910 | Guessed: match at offset 0 |
2911 | ||
2912 | Then the match is executed and the remaining lines describe the | |
2913 | process: | |
2914 | ||
ccf3535a | 2915 | Matching REx 'a*b+c' against 'abc' |
47f9c88b | 2916 | Setting an EVAL scope, savestack=3 |
555bd962 BG |
2917 | 0 <> <abc> | 1: STAR |
2918 | EXACT <a> can match 1 times out of 32767... | |
47f9c88b | 2919 | Setting an EVAL scope, savestack=3 |
555bd962 BG |
2920 | 1 <a> <bc> | 4: PLUS |
2921 | EXACT <b> can match 1 times out of 32767... | |
47f9c88b | 2922 | Setting an EVAL scope, savestack=3 |
555bd962 BG |
2923 | 2 <ab> <c> | 7: EXACT <c> |
2924 | 3 <abc> <> | 9: END | |
47f9c88b | 2925 | Match successful! |
ccf3535a | 2926 | Freeing REx: 'a*b+c' |
47f9c88b | 2927 | |
7638d2dc | 2928 | Each step is of the form S<C<< n <x> <y> >>>, with C<< <x> >> the |
47f9c88b | 2929 | part of the string matched and C<< <y> >> the part not yet |
7638d2dc | 2930 | matched. The S<C<< | 1: STAR >>> says that Perl is at line number 1 |
39b6ec1a | 2931 | in the compilation list above. See |
d9f2b251 | 2932 | L<perldebguts/"Debugging Regular Expressions"> for much more detail. |
47f9c88b GS |
2933 | |
2934 | An alternative method of debugging regexps is to embed C<print> | |
2935 | statements within the regexp. This provides a blow-by-blow account of | |
2936 | the backtracking in an alternation: | |
2937 | ||
2938 | "that this" =~ m@(?{print "Start at position ", pos, "\n";}) | |
2939 | t(?{print "t1\n";}) | |
2940 | h(?{print "h1\n";}) | |
2941 | i(?{print "i1\n";}) | |
2942 | s(?{print "s1\n";}) | |
2943 | | | |
2944 | t(?{print "t2\n";}) | |
2945 | h(?{print "h2\n";}) | |
2946 | a(?{print "a2\n";}) | |
2947 | t(?{print "t2\n";}) | |
2948 | (?{print "Done at position ", pos, "\n";}) | |
2949 | @x; | |
2950 | ||
2951 | prints | |
2952 | ||
2953 | Start at position 0 | |
2954 | t1 | |
2955 | h1 | |
2956 | t2 | |
2957 | h2 | |
2958 | a2 | |
2959 | t2 | |
2960 | Done at position 4 | |
2961 | ||
47f9c88b GS |
2962 | =head1 SEE ALSO |
2963 | ||
7638d2dc | 2964 | This is just a tutorial. For the full story on Perl regular |
47f9c88b GS |
2965 | expressions, see the L<perlre> regular expressions reference page. |
2966 | ||
2967 | For more information on the matching C<m//> and substitution C<s///> | |
2968 | operators, see L<perlop/"Regexp Quote-Like Operators">. For | |
2969 | information on the C<split> operation, see L<perlfunc/split>. | |
2970 | ||
2971 | For an excellent all-around resource on the care and feeding of | |
2972 | regular expressions, see the book I<Mastering Regular Expressions> by | |
2973 | Jeffrey Friedl (published by O'Reilly, ISBN 1556592-257-3). | |
2974 | ||
2975 | =head1 AUTHOR AND COPYRIGHT | |
2976 | ||
15776bb0 | 2977 | Copyright (c) 2000 Mark Kvale. |
47f9c88b | 2978 | All rights reserved. |
15776bb0 | 2979 | Now maintained by Perl porters. |
47f9c88b GS |
2980 | |
2981 | This document may be distributed under the same terms as Perl itself. | |
2982 | ||
2983 | =head2 Acknowledgments | |
2984 | ||
2985 | The inspiration for the stop codon DNA example came from the ZIP | |
2986 | code example in chapter 7 of I<Mastering Regular Expressions>. | |
2987 | ||
a6b2f353 GS |
2988 | The author would like to thank Jeff Pinyan, Andrew Johnson, Peter |
2989 | Haworth, Ronald J Kimball, and Joe Smith for all their helpful | |
2990 | comments. | |
47f9c88b GS |
2991 | |
2992 | =cut | |
a6b2f353 | 2993 |