This is a live mirror of the Perl 5 development currently hosted at https://github.com/perl/perl5
perldiag: rewording
[perl5.git] / pod / perlfaq6.pod
CommitLineData
68dc0745 1=head1 NAME
2
109f0441 3perlfaq6 - Regular Expressions
68dc0745 4
5=head1 DESCRIPTION
6
7This section is surprisingly small because the rest of the FAQ is
b400a9bf 8littered with answers involving regular expressions. For example,
68dc0745 9decoding a URL and checking whether something is a number are handled
10with regular expressions, but those answers are found elsewhere in
b432a672
AL
11this document (in L<perlfaq9>: "How do I decode or create those %-encodings
12on the web" and L<perlfaq4>: "How do I determine whether a scalar is
13a number/whole/integer/float", to be precise).
68dc0745 14
54310121 15=head2 How can I hope to use regular expressions without creating illegible and unmaintainable code?
d74e8afc
ITB
16X<regex, legibility> X<regexp, legibility>
17X<regular expression, legibility> X</x>
68dc0745 18
19Three techniques can make regular expressions maintainable and
20understandable.
21
22=over 4
23
d92eb7b0 24=item Comments Outside the Regex
68dc0745 25
26Describe what you're doing and how you're doing it, using normal Perl
27comments.
28
ac9dac7f
RGS
29 # turn the line into the first word, a colon, and the
30 # number of characters on the rest of the line
31 s/^(\w+)(.*)/ lc($1) . ":" . length($2) /meg;
68dc0745 32
d92eb7b0 33=item Comments Inside the Regex
68dc0745 34
d92eb7b0 35The C</x> modifier causes whitespace to be ignored in a regex pattern
7b059540 36(except in a character class and a few other places), and also allows you to
b400a9bf 37use normal comments there, too. As you can imagine, whitespace and comments
7b059540 38help a lot.
68dc0745 39
40C</x> lets you turn this:
41
ac9dac7f 42 s{<(?:[^>'"]*|".*?"|'.*?')+>}{}gs;
68dc0745 43
44into this:
45
ac9dac7f
RGS
46 s{ < # opening angle bracket
47 (?: # Non-backreffing grouping paren
48 [^>'"] * # 0 or more things that are neither > nor ' nor "
49 | # or else
50 ".*?" # a section between double quotes (stingy match)
51 | # or else
52 '.*?' # a section between single quotes (stingy match)
53 ) + # all occurring one or more times
54 > # closing angle bracket
55 }{}gsx; # replace with nothing, i.e. delete
68dc0745 56
57It's still not quite so clear as prose, but it is very useful for
58describing the meaning of each part of the pattern.
59
60=item Different Delimiters
61
62While we normally think of patterns as being delimited with C</>
b400a9bf 63characters, they can be delimited by almost any character. L<perlre>
64describes this. For example, the C<s///> above uses braces as
65delimiters. Selecting another delimiter can avoid quoting the
68dc0745 66delimiter within the pattern:
67
ac9dac7f
RGS
68 s/\/usr\/local/\/usr\/share/g; # bad delimiter choice
69 s#/usr/local#/usr/share#g; # better
68dc0745 70
71=back
72
b400a9bf 73=head2 I'm having trouble matching over more than one line. What's wrong?
d74e8afc 74X<regex, multiline> X<regexp, multiline> X<regular expression, multiline>
68dc0745 75
3392b9ec
JH
76Either you don't have more than one line in the string you're looking
77at (probably), or else you aren't using the correct modifier(s) on
78your pattern (possibly).
68dc0745 79
b400a9bf 80There are many ways to get multiline data into a string. If you want
68dc0745 81it to happen automatically while reading input, you'll want to set $/
82(probably to '' for paragraphs or C<undef> for the whole file) to
83allow you to read more than one line at a time.
84
85Read L<perlre> to help you decide which of C</s> and C</m> (or both)
86you might want to use: C</s> allows dot to include newline, and C</m>
87allows caret and dollar to match next to a newline, not just at the
b400a9bf 88end of the string. You do need to make sure that you've actually
68dc0745 89got a multiline string in there.
90
91For example, this program detects duplicate words, even when they span
b400a9bf 92line breaks (but not paragraph ones). For this example, we don't need
68dc0745 93C</s> because we aren't using dot in a regular expression that we want
b400a9bf 94to cross line boundaries. Neither do we need C</m> because we aren't
68dc0745 95wanting caret or dollar to match at any point inside the record next
b400a9bf 96to newlines. But it's imperative that $/ be set to something other
68dc0745 97than the default, or else we won't actually ever have a multiline
98record read in.
99
109f0441 100 $/ = ''; # read in whole paragraph, not just one line
ac9dac7f 101 while ( <> ) {
d8b950dc 102 while ( /\b([\w'-]+)(\s+\g1)+\b/gi ) { # word starts alpha
ac9dac7f
RGS
103 print "Duplicate $1 at paragraph $.\n";
104 }
54310121 105 }
68dc0745 106
107Here's code that finds sentences that begin with "From " (which would
108be mangled by many mailers):
109
109f0441 110 $/ = ''; # read in whole paragraph, not just one line
ac9dac7f
RGS
111 while ( <> ) {
112 while ( /^From /gm ) { # /m makes ^ match next to \n
113 print "leading from in paragraph $.\n";
114 }
68dc0745 115 }
68dc0745 116
117Here's code that finds everything between START and END in a paragraph:
118
ac9dac7f
RGS
119 undef $/; # read in whole file, not just one line or paragraph
120 while ( <> ) {
121 while ( /START(.*?)END/sgm ) { # /s makes . cross line boundaries
122 print "$1\n";
123 }
68dc0745 124 }
68dc0745 125
126=head2 How can I pull out lines between two patterns that are themselves on different lines?
d74e8afc 127X<..>
68dc0745 128
129You can use Perl's somewhat exotic C<..> operator (documented in
130L<perlop>):
131
ac9dac7f 132 perl -ne 'print if /START/ .. /END/' file1 file2 ...
68dc0745 133
134If you wanted text and not lines, you would use
135
ac9dac7f 136 perl -0777 -ne 'print "$1\n" while /START(.*?)END/gs' file1 file2 ...
68dc0745 137
138But if you want nested occurrences of C<START> through C<END>, you'll
139run up against the problem described in the question in this section
140on matching balanced text.
141
5a964f20
TC
142Here's another example of using C<..>:
143
ac9dac7f
RGS
144 while (<>) {
145 $in_header = 1 .. /^$/;
e573f903 146 $in_body = /^$/ .. eof;
5a964f20 147 # now choose between them
ac9dac7f 148 } continue {
e573f903 149 $. = 0 if eof; # fix $.
ac9dac7f 150 }
5a964f20 151
109f0441
S
152=head2 How do I match XML, HTML, or other nasty, ugly things with a regex?
153X<regex, XML> X<regex, HTML> X<XML> X<HTML> X<pain> X<frustration>
154X<sucking out, will to live>
155
156(contributed by brian d foy)
157
158If you just want to get work done, use a module and forget about the
159regular expressions. The C<XML::Parser> and C<HTML::Parser> modules
160are good starts, although each namespace has other parsing modules
161specialized for certain tasks and different ways of doing it. Start at
162CPAN Search ( http://search.cpan.org ) and wonder at all the work people
163have done for you already! :)
164
165The problem with things such as XML is that they have balanced text
166containing multiple levels of balanced text, but sometimes it isn't
167balanced text, as in an empty tag (C<< <br/> >>, for instance). Even then,
168things can occur out-of-order. Just when you think you've got a
169pattern that matches your input, someone throws you a curveball.
170
171If you'd like to do it the hard way, scratching and clawing your way
589a5df2 172toward a right answer but constantly being disappointed, besieged by
109f0441
S
173bug reports, and weary from the inordinate amount of time you have to
174spend reinventing a triangular wheel, then there are several things
175you can try before you give up in frustration:
176
177=over 4
178
179=item * Solve the balanced text problem from another question in L<perlfaq6>
180
181=item * Try the recursive regex features in Perl 5.10 and later. See L<perlre>
182
183=item * Try defining a grammar using Perl 5.10's C<(?DEFINE)> feature.
184
185=item * Break the problem down into sub-problems instead of trying to use a single regex
186
187=item * Convince everyone not to use XML or HTML in the first place
188
189=back
190
191Good luck!
192
68dc0745 193=head2 I put a regular expression into $/ but it didn't work. What's wrong?
d74e8afc
ITB
194X<$/, regexes in> X<$INPUT_RECORD_SEPARATOR, regexes in>
195X<$RS, regexes in>
68dc0745 196
b400a9bf 197$/ has to be a string. You can use these examples if you really need to
c195e131 198do this.
49d635f9 199
28b41a80
RGS
200If you have File::Stream, this is easy.
201
ac9dac7f
RGS
202 use File::Stream;
203
204 my $stream = File::Stream->new(
205 $filehandle,
206 separator => qr/\s*,\s*/,
207 );
28b41a80 208
ac9dac7f 209 print "$_\n" while <$stream>;
28b41a80
RGS
210
211If you don't have File::Stream, you have to do a little more work.
212
109f0441 213You can use the four-argument form of sysread to continually add to
b400a9bf 214a buffer. After you add to the buffer, you check if you have a
49d635f9
RGS
215complete line (using your regular expression).
216
ac9dac7f
RGS
217 local $_ = "";
218 while( sysread FH, $_, 8192, length ) {
109f0441 219 while( s/^((?s).*?)your_pattern// ) {
ac9dac7f
RGS
220 my $record = $1;
221 # do stuff here.
222 }
223 }
197aec24 224
109f0441
S
225You can do the same thing with foreach and a match using the
226c flag and the \G anchor, if you do not mind your entire file
227being in memory at the end.
197aec24 228
ac9dac7f
RGS
229 local $_ = "";
230 while( sysread FH, $_, 8192, length ) {
231 foreach my $record ( m/\G((?s).*?)your_pattern/gc ) {
232 # do stuff here.
233 }
234 substr( $_, 0, pos ) = "" if pos;
235 }
68dc0745 236
3fe9a6f1 237
a6dd486b 238=head2 How do I substitute case insensitively on the LHS while preserving case on the RHS?
d74e8afc
ITB
239X<replace, case preserving> X<substitute, case preserving>
240X<substitution, case preserving> X<s, case preserving>
68dc0745 241
b400a9bf 242Here's a lovely Perlish solution by Larry Rosler. It exploits
d92eb7b0
GS
243properties of bitwise xor on ASCII strings.
244
ac9dac7f 245 $_= "this is a TEsT case";
d92eb7b0 246
ac9dac7f
RGS
247 $old = 'test';
248 $new = 'success';
d92eb7b0 249
ac9dac7f
RGS
250 s{(\Q$old\E)}
251 { uc $new | (uc $1 ^ $1) .
252 (uc(substr $1, -1) ^ substr $1, -1) x
253 (length($new) - length $1)
254 }egi;
d92eb7b0 255
ac9dac7f 256 print;
d92eb7b0 257
8305e449 258And here it is as a subroutine, modeled after the above:
d92eb7b0 259
ac9dac7f
RGS
260 sub preserve_case($$) {
261 my ($old, $new) = @_;
262 my $mask = uc $old ^ $old;
d92eb7b0 263
ac9dac7f
RGS
264 uc $new | $mask .
265 substr($mask, -1) x (length($new) - length($old))
d92eb7b0
GS
266 }
267
109f0441
S
268 $string = "this is a TEsT case";
269 $string =~ s/(test)/preserve_case($1, "success")/egi;
270 print "$string\n";
d92eb7b0
GS
271
272This prints:
273
ac9dac7f 274 this is a SUcCESS case
d92eb7b0 275
74b9445a
JP
276As an alternative, to keep the case of the replacement word if it is
277longer than the original, you can use this code, by Jeff Pinyan:
278
ac9dac7f
RGS
279 sub preserve_case {
280 my ($from, $to) = @_;
281 my ($lf, $lt) = map length, @_;
7207e29d 282
ac9dac7f
RGS
283 if ($lt < $lf) { $from = substr $from, 0, $lt }
284 else { $from .= substr $to, $lf }
7207e29d 285
ac9dac7f
RGS
286 return uc $to | ($from ^ uc $from);
287 }
74b9445a
JP
288
289This changes the sentence to "this is a SUcCess case."
290
d92eb7b0
GS
291Just to show that C programmers can write C in any programming language,
292if you prefer a more C-like solution, the following script makes the
293substitution have the same case, letter by letter, as the original.
294(It also happens to run about 240% slower than the Perlish solution runs.)
295If the substitution has more characters than the string being substituted,
296the case of the last character is used for the rest of the substitution.
68dc0745 297
ac9dac7f
RGS
298 # Original by Nathan Torkington, massaged by Jeffrey Friedl
299 #
300 sub preserve_case($$)
301 {
302 my ($old, $new) = @_;
303 my ($state) = 0; # 0 = no change; 1 = lc; 2 = uc
304 my ($i, $oldlen, $newlen, $c) = (0, length($old), length($new));
305 my ($len) = $oldlen < $newlen ? $oldlen : $newlen;
306
307 for ($i = 0; $i < $len; $i++) {
308 if ($c = substr($old, $i, 1), $c =~ /[\W\d_]/) {
309 $state = 0;
310 } elsif (lc $c eq $c) {
311 substr($new, $i, 1) = lc(substr($new, $i, 1));
312 $state = 1;
313 } else {
314 substr($new, $i, 1) = uc(substr($new, $i, 1));
315 $state = 2;
316 }
317 }
318 # finish up with any remaining new (for when new is longer than old)
319 if ($newlen > $oldlen) {
320 if ($state == 1) {
321 substr($new, $oldlen) = lc(substr($new, $oldlen));
322 } elsif ($state == 2) {
323 substr($new, $oldlen) = uc(substr($new, $oldlen));
324 }
325 }
326 return $new;
327 }
68dc0745 328
5a964f20 329=head2 How can I make C<\w> match national character sets?
d74e8afc 330X<\w>
68dc0745 331
b400a9bf 332Put C<use locale;> in your script. The \w character class is taken
49d635f9
RGS
333from the current locale.
334
335See L<perllocale> for details.
68dc0745 336
337=head2 How can I match a locale-smart version of C</[a-zA-Z]/>?
d74e8afc 338X<alpha>
68dc0745 339
49d635f9
RGS
340You can use the POSIX character class syntax C</[[:alpha:]]/>
341documented in L<perlre>.
342
343No matter which locale you are in, the alphabetic characters are
344the characters in \w without the digits and the underscore.
b400a9bf 345As a regex, that looks like C</[^\W\d_]/>. Its complement,
197aec24
RGS
346the non-alphabetics, is then everything in \W along with
347the digits and the underscore, or C</[\W\d_]/>.
68dc0745 348
d92eb7b0 349=head2 How can I quote a variable to use in a regex?
d74e8afc 350X<regex, escaping> X<regexp, escaping> X<regular expression, escaping>
68dc0745 351
352The Perl parser will expand $variable and @variable references in
b400a9bf 353regular expressions unless the delimiter is a single quote. Remember,
79a522f5 354too, that the right-hand side of a C<s///> substitution is considered
b400a9bf 355a double-quoted string (see L<perlop> for more details). Remember
d92eb7b0 356also that any regex special characters will be acted on unless you
b400a9bf 357precede the substitution with \Q. Here's an example:
68dc0745 358
ac9dac7f
RGS
359 $string = "Placido P. Octopus";
360 $regex = "P.";
68dc0745 361
ac9dac7f
RGS
362 $string =~ s/$regex/Polyp/;
363 # $string is now "Polypacido P. Octopus"
68dc0745 364
c83084d1
MJD
365Because C<.> is special in regular expressions, and can match any
366single character, the regex C<P.> here has matched the <Pl> in the
367original string.
368
369To escape the special meaning of C<.>, we use C<\Q>:
370
ac9dac7f
RGS
371 $string = "Placido P. Octopus";
372 $regex = "P.";
c83084d1 373
ac9dac7f
RGS
374 $string =~ s/\Q$regex/Polyp/;
375 # $string is now "Placido Polyp Octopus"
c83084d1
MJD
376
377The use of C<\Q> causes the <.> in the regex to be treated as a
378regular character, so that C<P.> matches a C<P> followed by a dot.
68dc0745 379
380=head2 What is C</o> really for?
ee891a00 381X</o, regular expressions> X<compile, regular expressions>
68dc0745 382
ee891a00 383(contributed by brian d foy)
68dc0745 384
ee891a00
RGS
385The C</o> option for regular expressions (documented in L<perlop> and
386L<perlreref>) tells Perl to compile the regular expression only once.
387This is only useful when the pattern contains a variable. Perls 5.6
388and later handle this automatically if the pattern does not change.
68dc0745 389
ee891a00
RGS
390Since the match operator C<m//>, the substitution operator C<s///>,
391and the regular expression quoting operator C<qr//> are double-quotish
392constructs, you can interpolate variables into the pattern. See the
393answer to "How can I quote a variable to use in a regex?" for more
394details.
68dc0745 395
ee891a00
RGS
396This example takes a regular expression from the argument list and
397prints the lines of input that match it:
68dc0745 398
ee891a00 399 my $pattern = shift @ARGV;
109f0441 400
ee891a00
RGS
401 while( <> ) {
402 print if m/$pattern/;
403 }
404
405Versions of Perl prior to 5.6 would recompile the regular expression
406for each iteration, even if C<$pattern> had not changed. The C</o>
407would prevent this by telling Perl to compile the pattern the first
408time, then reuse that for subsequent iterations:
409
410 my $pattern = shift @ARGV;
109f0441 411
ee891a00
RGS
412 while( <> ) {
413 print if m/$pattern/o; # useful for Perl < 5.6
414 }
415
416In versions 5.6 and later, Perl won't recompile the regular expression
417if the variable hasn't changed, so you probably don't need the C</o>
418option. It doesn't hurt, but it doesn't help either. If you want any
419version of Perl to compile the regular expression only once even if
420the variable changes (thus, only using its initial value), you still
421need the C</o>.
422
423You can watch Perl's regular expression engine at work to verify for
424yourself if Perl is recompiling a regular expression. The C<use re
425'debug'> pragma (comes with Perl 5.005 and later) shows the details.
426With Perls before 5.6, you should see C<re> reporting that its
427compiling the regular expression on each iteration. With Perl 5.6 or
428later, you should only see C<re> report that for the first iteration.
429
430 use re 'debug';
109f0441 431
ee891a00
RGS
432 $regex = 'Perl';
433 foreach ( qw(Perl Java Ruby Python) ) {
434 print STDERR "-" x 73, "\n";
435 print STDERR "Trying $_...\n";
436 print STDERR "\t$_ is good!\n" if m/$regex/;
437 }
68dc0745 438
439=head2 How do I use a regular expression to strip C style comments from a file?
440
441While this actually can be done, it's much harder than you'd think.
442For example, this one-liner
443
ac9dac7f 444 perl -0777 -pe 's{/\*.*?\*/}{}gs' foo.c
68dc0745 445
b400a9bf 446will work in many but not all cases. You see, it's too simple-minded for
68dc0745 447certain kinds of C programs, in particular, those with what appear to be
b400a9bf 448comments in quoted strings. For that, you'd need something like this,
d92eb7b0 449created by Jeffrey Friedl and later modified by Fred Curtis.
68dc0745 450
ac9dac7f
RGS
451 $/ = undef;
452 $_ = <>;
453 s#/\*[^*]*\*+([^/*][^*]*\*+)*/|("(\\.|[^"\\])*"|'(\\.|[^'\\])*'|.[^/"'\\]*)#defined $2 ? $2 : ""#gse;
454 print;
68dc0745 455
456This could, of course, be more legibly written with the C</x> modifier, adding
b400a9bf 457whitespace and comments. Here it is expanded, courtesy of Fred Curtis.
d92eb7b0
GS
458
459 s{
460 /\* ## Start of /* ... */ comment
461 [^*]*\*+ ## Non-* followed by 1-or-more *'s
462 (
463 [^/*][^*]*\*+
464 )* ## 0-or-more things which don't start with /
465 ## but do end with '*'
466 / ## End of /* ... */ comment
467
468 | ## OR various things which aren't comments:
469
470 (
471 " ## Start of " ... " string
472 (
473 \\. ## Escaped char
474 | ## OR
475 [^"\\] ## Non "\
476 )*
477 " ## End of " ... " string
478
479 | ## OR
480
481 ' ## Start of ' ... ' string
482 (
483 \\. ## Escaped char
484 | ## OR
485 [^'\\] ## Non '\
486 )*
487 ' ## End of ' ... ' string
488
489 | ## OR
490
491 . ## Anything other char
492 [^/"'\\]* ## Chars which doesn't start a comment, string or escape
493 )
c98c5709 494 }{defined $2 ? $2 : ""}gxse;
d92eb7b0 495
109f0441
S
496A slight modification also removes C++ comments, possibly spanning multiple lines
497using a continuation character:
d92eb7b0 498
109f0441 499 s#/\*[^*]*\*+([^/*][^*]*\*+)*/|//([^\\]|[^\n][\n]?)*?\n|("(\\.|[^"\\])*"|'(\\.|[^'\\])*'|.[^/"'\\]*)#defined $3 ? $3 : ""#gse;
68dc0745 500
501=head2 Can I use Perl regular expressions to match balanced text?
d74e8afc 502X<regex, matching balanced test> X<regexp, matching balanced test>
109f0441
S
503X<regular expression, matching balanced test> X<possessive> X<PARNO>
504X<Text::Balanced> X<Regexp::Common> X<backtracking> X<recursion>
505
506(contributed by brian d foy)
507
508Your first try should probably be the C<Text::Balanced> module, which
509is in the Perl standard library since Perl 5.8. It has a variety of
510functions to deal with tricky text. The C<Regexp::Common> module can
511also help by providing canned patterns you can use.
512
513As of Perl 5.10, you can match balanced text with regular expressions
514using recursive patterns. Before Perl 5.10, you had to resort to
515various tricks such as using Perl code in C<(??{})> sequences.
516
517Here's an example using a recursive regular expression. The goal is to
518capture all of the text within angle brackets, including the text in
519nested angle brackets. This sample text has two "major" groups: a
520group with one level of nesting and a group with two levels of
521nesting. There are five total groups in angle brackets:
522
523 I have some <brackets in <nested brackets> > and
524 <another group <nested once <nested twice> > >
525 and that's it.
526
b400a9bf 527The regular expression to match the balanced text uses two new (to
109f0441
S
528Perl 5.10) regular expression features. These are covered in L<perlre>
529and this example is a modified version of one in that documentation.
530
589a5df2 531First, adding the new possessive C<+> to any quantifier finds the
109f0441
S
532longest match and does not backtrack. That's important since you want
533to handle any angle brackets through the recursion, not backtracking.
534The group C<< [^<>]++ >> finds one or more non-angle brackets without
535backtracking.
536
537Second, the new C<(?PARNO)> refers to the sub-pattern in the
c27a5cfe
KW
538particular capture group given by C<PARNO>. In the following regex,
539the first capture group finds (and remembers) the balanced text, and
b400a9bf 540you need that same pattern within the first buffer to get past the
109f0441 541nested text. That's the recursive part. The C<(?1)> uses the pattern
c27a5cfe 542in the outer capture group as an independent part of the regex.
109f0441
S
543
544Putting it all together, you have:
545
546 #!/usr/local/bin/perl5.10.0
547
548 my $string =<<"HERE";
549 I have some <brackets in <nested brackets> > and
550 <another group <nested once <nested twice> > >
551 and that's it.
552 HERE
553
554 my @groups = $string =~ m/
c27a5cfe 555 ( # start of capture group 1
109f0441
S
556 < # match an opening angle bracket
557 (?:
558 [^<>]++ # one or more non angle brackets, non backtracking
559 |
c27a5cfe 560 (?1) # found < or >, so recurse to capture group 1
109f0441
S
561 )*
562 > # match a closing angle bracket
c27a5cfe 563 ) # end of capture group 1
109f0441
S
564 /xg;
565
566 $" = "\n\t";
567 print "Found:\n\t@groups\n";
568
569The output shows that Perl found the two major groups:
570
571 Found:
572 <brackets in <nested brackets> >
573 <another group <nested once <nested twice> > >
574
575With a little extra work, you can get the all of the groups in angle
576brackets even if they are in other angle brackets too. Each time you
577get a balanced match, remove its outer delimiter (that's the one you
578just matched so don't match it again) and add it to a queue of strings
579to process. Keep doing that until you get no matches:
580
581 #!/usr/local/bin/perl5.10.0
582
583 my @queue =<<"HERE";
584 I have some <brackets in <nested brackets> > and
585 <another group <nested once <nested twice> > >
586 and that's it.
587 HERE
588
589 my $regex = qr/
590 ( # start of bracket 1
591 < # match an opening angle bracket
592 (?:
593 [^<>]++ # one or more non angle brackets, non backtracking
594 |
595 (?1) # recurse to bracket 1
596 )*
597 > # match a closing angle bracket
598 ) # end of bracket 1
599 /x;
600
601 $" = "\n\t";
602
603 while( @queue )
604 {
605 my $string = shift @queue;
606
607 my @groups = $string =~ m/$regex/g;
608 print "Found:\n\t@groups\n\n" if @groups;
609
610 unshift @queue, map { s/^<//; s/>$//; $_ } @groups;
611 }
612
613The output shows all of the groups. The outermost matches show up
614first and the nested matches so up later:
615
616 Found:
617 <brackets in <nested brackets> >
618 <another group <nested once <nested twice> > >
619
620 Found:
621 <nested brackets>
622
623 Found:
624 <nested once <nested twice> >
625
626 Found:
627 <nested twice>
68dc0745 628
b400a9bf 629=head2 What does it mean that regexes are greedy? How can I get around it?
d74e8afc 630X<greedy> X<greediness>
68dc0745 631
d92eb7b0 632Most people mean that greedy regexes match as much as they can.
68dc0745 633Technically speaking, it's actually the quantifiers (C<?>, C<*>, C<+>,
634C<{}>) that are greedy rather than the whole pattern; Perl prefers local
b400a9bf 635greed and immediate gratification to overall greed. To get non-greedy
68dc0745 636versions of the same quantifiers, use (C<??>, C<*?>, C<+?>, C<{}?>).
637
638An example:
639
ac9dac7f
RGS
640 $s1 = $s2 = "I am very very cold";
641 $s1 =~ s/ve.*y //; # I am cold
642 $s2 =~ s/ve.*?y //; # I am very cold
68dc0745 643
644Notice how the second substitution stopped matching as soon as it
b400a9bf 645encountered "y ". The C<*?> quantifier effectively tells the regular
68dc0745 646expression engine to find a match as quickly as possible and pass
647control on to whatever is next in line, like you would if you were
648playing hot potato.
649
f9ac83b8 650=head2 How do I process each word on each line?
d74e8afc 651X<word>
68dc0745 652
653Use the split function:
654
ac9dac7f
RGS
655 while (<>) {
656 foreach $word ( split ) {
657 # do something with $word here
658 }
197aec24 659 }
68dc0745 660
54310121 661Note that this isn't really a word in the English sense; it's just
662chunks of consecutive non-whitespace characters.
68dc0745 663
f1cbbd6e
GS
664To work with only alphanumeric sequences (including underscores), you
665might consider
68dc0745 666
ac9dac7f
RGS
667 while (<>) {
668 foreach $word (m/(\w+)/g) {
669 # do something with $word here
670 }
68dc0745 671 }
68dc0745 672
673=head2 How can I print out a word-frequency or line-frequency summary?
674
b400a9bf 675To do this, you have to parse out each word in the input stream. We'll
54310121 676pretend that by word you mean chunk of alphabetics, hyphens, or
677apostrophes, rather than the non-whitespace chunk idea of a word given
68dc0745 678in the previous question:
679
ac9dac7f
RGS
680 while (<>) {
681 while ( /(\b[^\W_\d][\w'-]+\b)/g ) { # misses "`sheep'"
682 $seen{$1}++;
683 }
54310121 684 }
ac9dac7f
RGS
685
686 while ( ($word, $count) = each %seen ) {
687 print "$count $word\n";
688 }
68dc0745 689
690If you wanted to do the same thing for lines, you wouldn't need a
691regular expression:
692
ac9dac7f
RGS
693 while (<>) {
694 $seen{$_}++;
695 }
696
697 while ( ($line, $count) = each %seen ) {
698 print "$count $line";
699 }
68dc0745 700
b432a672
AL
701If you want these output in a sorted order, see L<perlfaq4>: "How do I
702sort a hash (optionally by value instead of key)?".
68dc0745 703
704=head2 How can I do approximate matching?
d74e8afc 705X<match, approximate> X<matching, approximate>
68dc0745 706
707See the module String::Approx available from CPAN.
708
709=head2 How do I efficiently match many regular expressions at once?
d74e8afc
ITB
710X<regex, efficiency> X<regexp, efficiency>
711X<regular expression, efficiency>
68dc0745 712
c93274ad 713(contributed by brian d foy)
7678cced 714
03c6e0f8 715If you have Perl 5.10 or later, this is almost trivial. You just smart
716match against an array of regular expression objects:
7678cced 717
03c6e0f8 718 my @patterns = ( qr/Fr.d/, qr/B.rn.y/, qr/W.lm./ );
719
720 if( $string ~~ @patterns ) {
721 ...
722 };
6670e5e7 723
03c6e0f8 724The smart match stops when it finds a match, so it doesn't have to try
725every expression.
726
727Earlier than Perl 5.10, you have a bit of work to do. You want to
728avoid compiling a regular expression every time you want to match it.
729In this example, perl must recompile the regular expression for every
730iteration of the C<foreach> loop since it has no way to know what
731C<$pattern> will be:
732
733 my @patterns = qw( foo bar baz );
734
735 LINE: while( <DATA> ) {
736 foreach $pattern ( @patterns ) {
737 if( /\b$pattern\b/i ) {
ac9dac7f
RGS
738 print;
739 next LINE;
740 }
741 }
7678cced 742 }
68dc0745 743
03c6e0f8 744The C<qr//> operator showed up in perl 5.005. It compiles a regular
745expression, but doesn't apply it. When you use the pre-compiled
746version of the regex, perl does less work. In this example, I inserted
747a C<map> to turn each pattern into its pre-compiled form. The rest of
748the script is the same, but faster:
7678cced 749
03c6e0f8 750 my @patterns = map { qr/\b$_\b/i } qw( foo bar baz );
7678cced 751
03c6e0f8 752 LINE: while( <> ) {
753 foreach $pattern ( @patterns ) {
109f0441
S
754 if( /$pattern/ )
755 {
756 print;
757 next LINE;
758 }
ac9dac7f 759 }
7678cced 760 }
6670e5e7 761
03c6e0f8 762In some cases, you may be able to make several patterns into a single
763regular expression. Beware of situations that require backtracking
764though.
65acb1b1 765
03c6e0f8 766 my $regex = join '|', qw( foo bar baz );
7678cced 767
03c6e0f8 768 LINE: while( <> ) {
7678cced
RGS
769 print if /\b(?:$regex)\b/i;
770 }
771
109f0441 772For more details on regular expression efficiency, see I<Mastering
c69ca1d4 773Regular Expressions> by Jeffrey Friedl. He explains how regular
7678cced 774expressions engine work and why some patterns are surprisingly
03c6e0f8 775inefficient. Once you understand how perl applies regular expressions,
776you can tune them for individual situations.
68dc0745 777
778=head2 Why don't word-boundary searches with C<\b> work for me?
d74e8afc 779X<\b>
68dc0745 780
7678cced
RGS
781(contributed by brian d foy)
782
783Ensure that you know what \b really does: it's the boundary between a
784word character, \w, and something that isn't a word character. That
785thing that isn't a word character might be \W, but it can also be the
786start or end of the string.
787
788It's not (not!) the boundary between whitespace and non-whitespace,
789and it's not the stuff between words we use to create sentences.
790
791In regex speak, a word boundary (\b) is a "zero width assertion",
792meaning that it doesn't represent a character in the string, but a
793condition at a certain position.
794
795For the regular expression, /\bPerl\b/, there has to be a word
c93274ad 796boundary before the "P" and after the "l". As long as something other
7678cced
RGS
797than a word character precedes the "P" and succeeds the "l", the
798pattern will match. These strings match /\bPerl\b/.
799
800 "Perl" # no word char before P or after l
801 "Perl " # same as previous (space is not a word char)
802 "'Perl'" # the ' char is not a word char
803 "Perl's" # no word char before P, non-word char after "l"
804
805These strings do not match /\bPerl\b/.
806
807 "Perl_" # _ is a word char!
808 "Perler" # no word char before P, but one after l
6670e5e7 809
c93274ad 810You don't have to use \b to match words though. You can look for
811non-word characters surrounded by word characters. These strings
7678cced
RGS
812match the pattern /\b'\b/.
813
814 "don't" # the ' char is surrounded by "n" and "t"
815 "qep'a'" # the ' char is surrounded by "p" and "a"
6670e5e7 816
7678cced 817These strings do not match /\b'\b/.
68dc0745 818
7678cced 819 "foo'" # there is no word char after non-word '
6670e5e7 820
7678cced
RGS
821You can also use the complement of \b, \B, to specify that there
822should not be a word boundary.
68dc0745 823
7678cced
RGS
824In the pattern /\Bam\B/, there must be a word character before the "a"
825and after the "m". These patterns match /\Bam\B/:
68dc0745 826
7678cced
RGS
827 "llama" # "am" surrounded by word chars
828 "Samuel" # same
6670e5e7 829
7678cced 830These strings do not match /\Bam\B/
68dc0745 831
7678cced
RGS
832 "Sam" # no word boundary before "a", but one after "m"
833 "I am Sam" # "am" surrounded by non-word chars
68dc0745 834
68dc0745 835
836=head2 Why does using $&, $`, or $' slow my program down?
d74e8afc 837X<$MATCH> X<$&> X<$POSTMATCH> X<$'> X<$PREMATCH> X<$`>
68dc0745 838
571e049f 839(contributed by Anno Siegel)
68dc0745 840
571e049f 841Once Perl sees that you need one of these variables anywhere in the
b68463f7
RGS
842program, it provides them on each and every pattern match. That means
843that on every pattern match the entire string will be copied, part of it
844to $`, part to $&, and part to $'. Thus the penalty is most severe with
845long strings and patterns that match often. Avoid $&, $', and $` if you
846can, but if you can't, once you've used them at all, use them at will
847because you've already paid the price. Remember that some algorithms
848really appreciate them. As of the 5.005 release, the $& variable is no
849longer "expensive" the way the other two are.
850
851Since Perl 5.6.1 the special variables @- and @+ can functionally replace
c93274ad 852$`, $& and $'. These arrays contain pointers to the beginning and end
b68463f7
RGS
853of each match (see perlvar for the full story), so they give you
854essentially the same information, but without the risk of excessive
855string copying.
6670e5e7 856
109f0441
S
857Perl 5.10 added three specials, C<${^MATCH}>, C<${^PREMATCH}>, and
858C<${^POSTMATCH}> to do the same job but without the global performance
859penalty. Perl 5.10 only sets these variables if you compile or execute the
860regular expression with the C</p> modifier.
861
68dc0745 862=head2 What good is C<\G> in a regular expression?
d74e8afc 863X<\G>
68dc0745 864
49d635f9 865You use the C<\G> anchor to start the next match on the same
c93274ad 866string where the last match left off. The regular
49d635f9
RGS
867expression engine cannot skip over any characters to find
868the next match with this anchor, so C<\G> is similar to the
c93274ad 869beginning of string anchor, C<^>. The C<\G> anchor is typically
870used with the C<g> flag. It uses the value of C<pos()>
871as the position to start the next match. As the match
ee891a00 872operator makes successive matches, it updates C<pos()> with the
49d635f9
RGS
873position of the next character past the last match (or the
874first character of the next match, depending on how you like
ee891a00 875to look at it). Each string has its own C<pos()> value.
49d635f9 876
ee891a00 877Suppose you want to match all of consecutive pairs of digits
49d635f9 878in a string like "1122a44" and stop matching when you
c93274ad 879encounter non-digits. You want to match C<11> and C<22> but
49d635f9
RGS
880the letter <a> shows up between C<22> and C<44> and you want
881to stop at C<a>. Simply matching pairs of digits skips over
882the C<a> and still matches C<44>.
883
884 $_ = "1122a44";
885 my @pairs = m/(\d\d)/g; # qw( 11 22 44 )
886
ee891a00 887If you use the C<\G> anchor, you force the match after C<22> to
c93274ad 888start with the C<a>. The regular expression cannot match
49d635f9
RGS
889there since it does not find a digit, so the next match
890fails and the match operator returns the pairs it already
891found.
892
893 $_ = "1122a44";
894 my @pairs = m/\G(\d\d)/g; # qw( 11 22 )
895
896You can also use the C<\G> anchor in scalar context. You
897still need the C<g> flag.
898
899 $_ = "1122a44";
900 while( m/\G(\d\d)/g )
901 {
902 print "Found $1\n";
903 }
197aec24 904
ee891a00 905After the match fails at the letter C<a>, perl resets C<pos()>
49d635f9
RGS
906and the next match on the same string starts at the beginning.
907
908 $_ = "1122a44";
909 while( m/\G(\d\d)/g )
910 {
911 print "Found $1\n";
912 }
913
914 print "Found $1 after while" if m/(\d\d)/g; # finds "11"
915
ee891a00
RGS
916You can disable C<pos()> resets on fail with the C<c> flag, documented
917in L<perlop> and L<perlreref>. Subsequent matches start where the last
918successful match ended (the value of C<pos()>) even if a match on the
919same string has failed in the meantime. In this case, the match after
920the C<while()> loop starts at the C<a> (where the last match stopped),
921and since it does not use any anchor it can skip over the C<a> to find
922C<44>.
49d635f9
RGS
923
924 $_ = "1122a44";
925 while( m/\G(\d\d)/gc )
926 {
927 print "Found $1\n";
928 }
929
930 print "Found $1 after while" if m/(\d\d)/g; # finds "44"
931
932Typically you use the C<\G> anchor with the C<c> flag
933when you want to try a different match if one fails,
934such as in a tokenizer. Jeffrey Friedl offers this example
935which works in 5.004 or later.
68dc0745 936
ac9dac7f
RGS
937 while (<>) {
938 chomp;
939 PARSER: {
940 m/ \G( \d+\b )/gcx && do { print "number: $1\n"; redo; };
941 m/ \G( \w+ )/gcx && do { print "word: $1\n"; redo; };
942 m/ \G( \s+ )/gcx && do { print "space: $1\n"; redo; };
943 m/ \G( [^\w\d]+ )/gcx && do { print "other: $1\n"; redo; };
944 }
945 }
68dc0745 946
ee891a00 947For each line, the C<PARSER> loop first tries to match a series
c93274ad 948of digits followed by a word boundary. This match has to
49d635f9 949start at the place the last match left off (or the beginning
197aec24 950of the string on the first match). Since C<m/ \G( \d+\b
49d635f9
RGS
951)/gcx> uses the C<c> flag, if the string does not match that
952regular expression, perl does not reset pos() and the next
953match starts at the same position to try a different
954pattern.
68dc0745 955
c93274ad 956=head2 Are Perl regexes DFAs or NFAs? Are they POSIX compliant?
d74e8afc 957X<DFA> X<NFA> X<POSIX>
68dc0745 958
959While it's true that Perl's regular expressions resemble the DFAs
960(deterministic finite automata) of the egrep(1) program, they are in
46fc3d4c 961fact implemented as NFAs (non-deterministic finite automata) to allow
c93274ad 962backtracking and backreferencing. And they aren't POSIX-style either,
963because those guarantee worst-case behavior for all cases. (It seems
68dc0745 964that some people prefer guarantees of consistency, even when what's
c93274ad 965guaranteed is slowness.) See the book "Mastering Regular Expressions"
68dc0745 966(from O'Reilly) by Jeffrey Friedl for all the details you could ever
967hope to know on these matters (a full citation appears in
968L<perlfaq2>).
969
788611b6 970=head2 What's wrong with using grep in a void context?
d74e8afc 971X<grep>
68dc0745 972
788611b6
A
973The problem is that grep builds a return list, regardless of the context.
974This means you're making Perl go to the trouble of building a list that
975you then just throw away. If the list is large, you waste both time and space.
976If your intent is to iterate over the list, then use a for loop for this
f05bbc40 977purpose.
68dc0745 978
788611b6
A
979In perls older than 5.8.1, map suffers from this problem as well.
980But since 5.8.1, this has been fixed, and map is context aware - in void
981context, no lists are constructed.
982
54310121 983=head2 How can I match strings with multibyte characters?
d74e8afc 984X<regex, and multibyte characters> X<regexp, and multibyte characters>
ac9dac7f 985X<regular expression, and multibyte characters> X<martian> X<encoding, Martian>
68dc0745 986
d9d154f2 987Starting from Perl 5.6 Perl has had some level of multibyte character
c93274ad 988support. Perl 5.8 or later is recommended. Supported multibyte
fe854a6f 989character repertoires include Unicode, and legacy encodings
c93274ad 990through the Encode module. See L<perluniintro>, L<perlunicode>,
d9d154f2
JH
991and L<Encode>.
992
993If you are stuck with older Perls, you can do Unicode with the
994C<Unicode::String> module, and character conversions using the
c93274ad 995C<Unicode::Map8> and C<Unicode::Map> modules. If you are using
d9d154f2
JH
996Japanese encodings, you might try using the jperl 5.005_03.
997
998Finally, the following set of approaches was offered by Jeffrey
999Friedl, whose article in issue #5 of The Perl Journal talks about
1000this very matter.
68dc0745 1001
fc36a67e 1002Let's suppose you have some weird Martian encoding where pairs of
1003ASCII uppercase letters encode single Martian letters (i.e. the two
1004bytes "CV" make a single Martian letter, as do the two bytes "SG",
1005"VS", "XX", etc.). Other bytes represent single characters, just like
1006ASCII.
68dc0745 1007
fc36a67e 1008So, the string of Martian "I am CVSGXX!" uses 12 bytes to encode the
1009nine characters 'I', ' ', 'a', 'm', ' ', 'CV', 'SG', 'XX', '!'.
68dc0745 1010
1011Now, say you want to search for the single character C</GX/>. Perl
fc36a67e 1012doesn't know about Martian, so it'll find the two bytes "GX" in the "I
c93274ad 1013am CVSGXX!" string, even though that character isn't there: it just
fc36a67e 1014looks like it is because "SG" is next to "XX", but there's no real
c93274ad 1015"GX". This is a big problem.
68dc0745 1016
1017Here are a few ways, all painful, to deal with it:
1018
ac9dac7f
RGS
1019 # Make sure adjacent "martian" bytes are no longer adjacent.
1020 $martian =~ s/([A-Z][A-Z])/ $1 /g;
1021
1022 print "found GX!\n" if $martian =~ /GX/;
68dc0745 1023
1024Or like this:
1025
ac9dac7f
RGS
1026 @chars = $martian =~ m/([A-Z][A-Z]|[^A-Z])/g;
1027 # above is conceptually similar to: @chars = $text =~ m/(.)/g;
1028 #
1029 foreach $char (@chars) {
1030 print "found GX!\n", last if $char eq 'GX';
1031 }
68dc0745 1032
1033Or like this:
1034
ac9dac7f
RGS
1035 while ($martian =~ m/\G([A-Z][A-Z]|.)/gs) { # \G probably unneeded
1036 print "found GX!\n", last if $1 eq 'GX';
1037 }
68dc0745 1038
49d635f9 1039Here's another, slightly less painful, way to do it from Benjamin
c98c5709 1040Goldberg, who uses a zero-width negative look-behind assertion.
49d635f9 1041
c98c5709 1042 print "found GX!\n" if $martian =~ m/
ac9dac7f
RGS
1043 (?<![A-Z])
1044 (?:[A-Z][A-Z])*?
1045 GX
c98c5709 1046 /x;
197aec24 1047
49d635f9 1048This succeeds if the "martian" character GX is in the string, and fails
c93274ad 1049otherwise. If you don't like using (?<!), a zero-width negative
c98c5709 1050look-behind assertion, you can replace (?<![A-Z]) with (?:^|[^A-Z]).
49d635f9
RGS
1051
1052It does have the drawback of putting the wrong thing in $-[0] and $+[0],
1053but this usually can be worked around.
68dc0745 1054
ac9dac7f
RGS
1055=head2 How do I match a regular expression that's in a variable?
1056X<regex, in variable> X<eval> X<regex> X<quotemeta> X<\Q, regex>
1057X<\E, regex>, X<qr//>
65acb1b1 1058
ac9dac7f 1059(contributed by brian d foy)
65acb1b1 1060
ac9dac7f
RGS
1061We don't have to hard-code patterns into the match operator (or
1062anything else that works with regular expressions). We can put the
1063pattern in a variable for later use.
65acb1b1 1064
ac9dac7f
RGS
1065The match operator is a double quote context, so you can interpolate
1066your variable just like a double quoted string. In this case, you
1067read the regular expression as user input and store it in C<$regex>.
1068Once you have the pattern in C<$regex>, you use that variable in the
1069match operator.
65acb1b1 1070
ac9dac7f 1071 chomp( my $regex = <STDIN> );
65acb1b1 1072
ac9dac7f 1073 if( $string =~ m/$regex/ ) { ... }
65acb1b1 1074
ac9dac7f
RGS
1075Any regular expression special characters in C<$regex> are still
1076special, and the pattern still has to be valid or Perl will complain.
1077For instance, in this pattern there is an unpaired parenthesis.
65acb1b1 1078
ac9dac7f
RGS
1079 my $regex = "Unmatched ( paren";
1080
1081 "Two parens to bind them all" =~ m/$regex/;
1082
1083When Perl compiles the regular expression, it treats the parenthesis
1084as the start of a memory match. When it doesn't find the closing
1085parenthesis, it complains:
1086
1087 Unmatched ( in regex; marked by <-- HERE in m/Unmatched ( <-- HERE paren/ at script line 3.
1088
1089You can get around this in several ways depending on our situation.
1090First, if you don't want any of the characters in the string to be
1091special, you can escape them with C<quotemeta> before you use the string.
1092
1093 chomp( my $regex = <STDIN> );
1094 $regex = quotemeta( $regex );
1095
1096 if( $string =~ m/$regex/ ) { ... }
1097
1098You can also do this directly in the match operator using the C<\Q>
1099and C<\E> sequences. The C<\Q> tells Perl where to start escaping
1100special characters, and the C<\E> tells it where to stop (see L<perlop>
1101for more details).
1102
1103 chomp( my $regex = <STDIN> );
1104
1105 if( $string =~ m/\Q$regex\E/ ) { ... }
1106
1107Alternately, you can use C<qr//>, the regular expression quote operator (see
c93274ad 1108L<perlop> for more details). It quotes and perhaps compiles the pattern,
ac9dac7f
RGS
1109and you can apply regular expression flags to the pattern.
1110
1111 chomp( my $input = <STDIN> );
1112
1113 my $regex = qr/$input/is;
1114
1115 $string =~ m/$regex/ # same as m/$input/is;
1116
1117You might also want to trap any errors by wrapping an C<eval> block
1118around the whole thing.
1119
1120 chomp( my $input = <STDIN> );
1121
1122 eval {
1123 if( $string =~ m/\Q$input\E/ ) { ... }
1124 };
1125 warn $@ if $@;
1126
1127Or...
1128
1129 my $regex = eval { qr/$input/is };
1130 if( defined $regex ) {
1131 $string =~ m/$regex/;
1132 }
1133 else {
1134 warn $@;
1135 }
65acb1b1 1136
68dc0745 1137=head1 AUTHOR AND COPYRIGHT
1138
8d2e243f 1139Copyright (c) 1997-2010 Tom Christiansen, Nathan Torkington, and
7678cced 1140other authors as noted. All rights reserved.
5a964f20 1141
5a7beb56
JH
1142This documentation is free; you can redistribute it and/or modify it
1143under the same terms as Perl itself.
5a964f20
TC
1144
1145Irrespective of its distribution, all code examples in this file
c93274ad 1146are hereby placed into the public domain. You are permitted and
5a964f20 1147encouraged to use this code in your own programs for fun
c93274ad 1148or for profit as you see fit. A simple comment in the code giving
5a964f20 1149credit would be courteous but is not required.