This is a live mirror of the Perl 5 development currently hosted at https://github.com/perl/perl5
additional tests for package block syntax
[perl5.git] / pod / perlfaq6.pod
CommitLineData
68dc0745 1=head1 NAME
2
109f0441 3perlfaq6 - Regular Expressions
68dc0745 4
5=head1 DESCRIPTION
6
7This section is surprisingly small because the rest of the FAQ is
8littered with answers involving regular expressions. For example,
9decoding a URL and checking whether something is a number are handled
10with regular expressions, but those answers are found elsewhere in
b432a672
AL
11this document (in L<perlfaq9>: "How do I decode or create those %-encodings
12on the web" and L<perlfaq4>: "How do I determine whether a scalar is
13a number/whole/integer/float", to be precise).
68dc0745 14
54310121 15=head2 How can I hope to use regular expressions without creating illegible and unmaintainable code?
d74e8afc
ITB
16X<regex, legibility> X<regexp, legibility>
17X<regular expression, legibility> X</x>
68dc0745 18
19Three techniques can make regular expressions maintainable and
20understandable.
21
22=over 4
23
d92eb7b0 24=item Comments Outside the Regex
68dc0745 25
26Describe what you're doing and how you're doing it, using normal Perl
27comments.
28
ac9dac7f
RGS
29 # turn the line into the first word, a colon, and the
30 # number of characters on the rest of the line
31 s/^(\w+)(.*)/ lc($1) . ":" . length($2) /meg;
68dc0745 32
d92eb7b0 33=item Comments Inside the Regex
68dc0745 34
d92eb7b0 35The C</x> modifier causes whitespace to be ignored in a regex pattern
7b059540
KW
36(except in a character class and a few other places), and also allows you to
37use normal comments there, too. As you can imagine, whitespace and comments
38help a lot.
68dc0745 39
40C</x> lets you turn this:
41
ac9dac7f 42 s{<(?:[^>'"]*|".*?"|'.*?')+>}{}gs;
68dc0745 43
44into this:
45
ac9dac7f
RGS
46 s{ < # opening angle bracket
47 (?: # Non-backreffing grouping paren
48 [^>'"] * # 0 or more things that are neither > nor ' nor "
49 | # or else
50 ".*?" # a section between double quotes (stingy match)
51 | # or else
52 '.*?' # a section between single quotes (stingy match)
53 ) + # all occurring one or more times
54 > # closing angle bracket
55 }{}gsx; # replace with nothing, i.e. delete
68dc0745 56
57It's still not quite so clear as prose, but it is very useful for
58describing the meaning of each part of the pattern.
59
60=item Different Delimiters
61
62While we normally think of patterns as being delimited with C</>
63characters, they can be delimited by almost any character. L<perlre>
64describes this. For example, the C<s///> above uses braces as
65delimiters. Selecting another delimiter can avoid quoting the
66delimiter within the pattern:
67
ac9dac7f
RGS
68 s/\/usr\/local/\/usr\/share/g; # bad delimiter choice
69 s#/usr/local#/usr/share#g; # better
68dc0745 70
71=back
72
73=head2 I'm having trouble matching over more than one line. What's wrong?
d74e8afc 74X<regex, multiline> X<regexp, multiline> X<regular expression, multiline>
68dc0745 75
3392b9ec
JH
76Either you don't have more than one line in the string you're looking
77at (probably), or else you aren't using the correct modifier(s) on
78your pattern (possibly).
68dc0745 79
80There are many ways to get multiline data into a string. If you want
81it to happen automatically while reading input, you'll want to set $/
82(probably to '' for paragraphs or C<undef> for the whole file) to
83allow you to read more than one line at a time.
84
85Read L<perlre> to help you decide which of C</s> and C</m> (or both)
86you might want to use: C</s> allows dot to include newline, and C</m>
87allows caret and dollar to match next to a newline, not just at the
88end of the string. You do need to make sure that you've actually
89got a multiline string in there.
90
91For example, this program detects duplicate words, even when they span
92line breaks (but not paragraph ones). For this example, we don't need
93C</s> because we aren't using dot in a regular expression that we want
94to cross line boundaries. Neither do we need C</m> because we aren't
95wanting caret or dollar to match at any point inside the record next
96to newlines. But it's imperative that $/ be set to something other
97than the default, or else we won't actually ever have a multiline
98record read in.
99
109f0441 100 $/ = ''; # read in whole paragraph, not just one line
ac9dac7f
RGS
101 while ( <> ) {
102 while ( /\b([\w'-]+)(\s+\1)+\b/gi ) { # word starts alpha
103 print "Duplicate $1 at paragraph $.\n";
104 }
54310121 105 }
68dc0745 106
107Here's code that finds sentences that begin with "From " (which would
108be mangled by many mailers):
109
109f0441 110 $/ = ''; # read in whole paragraph, not just one line
ac9dac7f
RGS
111 while ( <> ) {
112 while ( /^From /gm ) { # /m makes ^ match next to \n
113 print "leading from in paragraph $.\n";
114 }
68dc0745 115 }
68dc0745 116
117Here's code that finds everything between START and END in a paragraph:
118
ac9dac7f
RGS
119 undef $/; # read in whole file, not just one line or paragraph
120 while ( <> ) {
121 while ( /START(.*?)END/sgm ) { # /s makes . cross line boundaries
122 print "$1\n";
123 }
68dc0745 124 }
68dc0745 125
126=head2 How can I pull out lines between two patterns that are themselves on different lines?
d74e8afc 127X<..>
68dc0745 128
129You can use Perl's somewhat exotic C<..> operator (documented in
130L<perlop>):
131
ac9dac7f 132 perl -ne 'print if /START/ .. /END/' file1 file2 ...
68dc0745 133
134If you wanted text and not lines, you would use
135
ac9dac7f 136 perl -0777 -ne 'print "$1\n" while /START(.*?)END/gs' file1 file2 ...
68dc0745 137
138But if you want nested occurrences of C<START> through C<END>, you'll
139run up against the problem described in the question in this section
140on matching balanced text.
141
5a964f20
TC
142Here's another example of using C<..>:
143
ac9dac7f
RGS
144 while (<>) {
145 $in_header = 1 .. /^$/;
e573f903 146 $in_body = /^$/ .. eof;
5a964f20 147 # now choose between them
ac9dac7f 148 } continue {
e573f903 149 $. = 0 if eof; # fix $.
ac9dac7f 150 }
5a964f20 151
109f0441
S
152=head2 How do I match XML, HTML, or other nasty, ugly things with a regex?
153X<regex, XML> X<regex, HTML> X<XML> X<HTML> X<pain> X<frustration>
154X<sucking out, will to live>
155
156(contributed by brian d foy)
157
158If you just want to get work done, use a module and forget about the
159regular expressions. The C<XML::Parser> and C<HTML::Parser> modules
160are good starts, although each namespace has other parsing modules
161specialized for certain tasks and different ways of doing it. Start at
162CPAN Search ( http://search.cpan.org ) and wonder at all the work people
163have done for you already! :)
164
165The problem with things such as XML is that they have balanced text
166containing multiple levels of balanced text, but sometimes it isn't
167balanced text, as in an empty tag (C<< <br/> >>, for instance). Even then,
168things can occur out-of-order. Just when you think you've got a
169pattern that matches your input, someone throws you a curveball.
170
171If you'd like to do it the hard way, scratching and clawing your way
589a5df2 172toward a right answer but constantly being disappointed, besieged by
109f0441
S
173bug reports, and weary from the inordinate amount of time you have to
174spend reinventing a triangular wheel, then there are several things
175you can try before you give up in frustration:
176
177=over 4
178
179=item * Solve the balanced text problem from another question in L<perlfaq6>
180
181=item * Try the recursive regex features in Perl 5.10 and later. See L<perlre>
182
183=item * Try defining a grammar using Perl 5.10's C<(?DEFINE)> feature.
184
185=item * Break the problem down into sub-problems instead of trying to use a single regex
186
187=item * Convince everyone not to use XML or HTML in the first place
188
189=back
190
191Good luck!
192
68dc0745 193=head2 I put a regular expression into $/ but it didn't work. What's wrong?
d74e8afc
ITB
194X<$/, regexes in> X<$INPUT_RECORD_SEPARATOR, regexes in>
195X<$RS, regexes in>
68dc0745 196
109f0441 197$/ has to be a string. You can use these examples if you really need to
c195e131 198do this.
49d635f9 199
28b41a80
RGS
200If you have File::Stream, this is easy.
201
ac9dac7f
RGS
202 use File::Stream;
203
204 my $stream = File::Stream->new(
205 $filehandle,
206 separator => qr/\s*,\s*/,
207 );
28b41a80 208
ac9dac7f 209 print "$_\n" while <$stream>;
28b41a80
RGS
210
211If you don't have File::Stream, you have to do a little more work.
212
109f0441 213You can use the four-argument form of sysread to continually add to
197aec24 214a buffer. After you add to the buffer, you check if you have a
49d635f9
RGS
215complete line (using your regular expression).
216
ac9dac7f
RGS
217 local $_ = "";
218 while( sysread FH, $_, 8192, length ) {
109f0441 219 while( s/^((?s).*?)your_pattern// ) {
ac9dac7f
RGS
220 my $record = $1;
221 # do stuff here.
222 }
223 }
197aec24 224
109f0441
S
225You can do the same thing with foreach and a match using the
226c flag and the \G anchor, if you do not mind your entire file
227being in memory at the end.
197aec24 228
ac9dac7f
RGS
229 local $_ = "";
230 while( sysread FH, $_, 8192, length ) {
231 foreach my $record ( m/\G((?s).*?)your_pattern/gc ) {
232 # do stuff here.
233 }
234 substr( $_, 0, pos ) = "" if pos;
235 }
68dc0745 236
3fe9a6f1 237
a6dd486b 238=head2 How do I substitute case insensitively on the LHS while preserving case on the RHS?
d74e8afc
ITB
239X<replace, case preserving> X<substitute, case preserving>
240X<substitution, case preserving> X<s, case preserving>
68dc0745 241
d92eb7b0
GS
242Here's a lovely Perlish solution by Larry Rosler. It exploits
243properties of bitwise xor on ASCII strings.
244
ac9dac7f 245 $_= "this is a TEsT case";
d92eb7b0 246
ac9dac7f
RGS
247 $old = 'test';
248 $new = 'success';
d92eb7b0 249
ac9dac7f
RGS
250 s{(\Q$old\E)}
251 { uc $new | (uc $1 ^ $1) .
252 (uc(substr $1, -1) ^ substr $1, -1) x
253 (length($new) - length $1)
254 }egi;
d92eb7b0 255
ac9dac7f 256 print;
d92eb7b0 257
8305e449 258And here it is as a subroutine, modeled after the above:
d92eb7b0 259
ac9dac7f
RGS
260 sub preserve_case($$) {
261 my ($old, $new) = @_;
262 my $mask = uc $old ^ $old;
d92eb7b0 263
ac9dac7f
RGS
264 uc $new | $mask .
265 substr($mask, -1) x (length($new) - length($old))
d92eb7b0
GS
266 }
267
109f0441
S
268 $string = "this is a TEsT case";
269 $string =~ s/(test)/preserve_case($1, "success")/egi;
270 print "$string\n";
d92eb7b0
GS
271
272This prints:
273
ac9dac7f 274 this is a SUcCESS case
d92eb7b0 275
74b9445a
JP
276As an alternative, to keep the case of the replacement word if it is
277longer than the original, you can use this code, by Jeff Pinyan:
278
ac9dac7f
RGS
279 sub preserve_case {
280 my ($from, $to) = @_;
281 my ($lf, $lt) = map length, @_;
7207e29d 282
ac9dac7f
RGS
283 if ($lt < $lf) { $from = substr $from, 0, $lt }
284 else { $from .= substr $to, $lf }
7207e29d 285
ac9dac7f
RGS
286 return uc $to | ($from ^ uc $from);
287 }
74b9445a
JP
288
289This changes the sentence to "this is a SUcCess case."
290
d92eb7b0
GS
291Just to show that C programmers can write C in any programming language,
292if you prefer a more C-like solution, the following script makes the
293substitution have the same case, letter by letter, as the original.
294(It also happens to run about 240% slower than the Perlish solution runs.)
295If the substitution has more characters than the string being substituted,
296the case of the last character is used for the rest of the substitution.
68dc0745 297
ac9dac7f
RGS
298 # Original by Nathan Torkington, massaged by Jeffrey Friedl
299 #
300 sub preserve_case($$)
301 {
302 my ($old, $new) = @_;
303 my ($state) = 0; # 0 = no change; 1 = lc; 2 = uc
304 my ($i, $oldlen, $newlen, $c) = (0, length($old), length($new));
305 my ($len) = $oldlen < $newlen ? $oldlen : $newlen;
306
307 for ($i = 0; $i < $len; $i++) {
308 if ($c = substr($old, $i, 1), $c =~ /[\W\d_]/) {
309 $state = 0;
310 } elsif (lc $c eq $c) {
311 substr($new, $i, 1) = lc(substr($new, $i, 1));
312 $state = 1;
313 } else {
314 substr($new, $i, 1) = uc(substr($new, $i, 1));
315 $state = 2;
316 }
317 }
318 # finish up with any remaining new (for when new is longer than old)
319 if ($newlen > $oldlen) {
320 if ($state == 1) {
321 substr($new, $oldlen) = lc(substr($new, $oldlen));
322 } elsif ($state == 2) {
323 substr($new, $oldlen) = uc(substr($new, $oldlen));
324 }
325 }
326 return $new;
327 }
68dc0745 328
5a964f20 329=head2 How can I make C<\w> match national character sets?
d74e8afc 330X<\w>
68dc0745 331
49d635f9
RGS
332Put C<use locale;> in your script. The \w character class is taken
333from the current locale.
334
335See L<perllocale> for details.
68dc0745 336
337=head2 How can I match a locale-smart version of C</[a-zA-Z]/>?
d74e8afc 338X<alpha>
68dc0745 339
49d635f9
RGS
340You can use the POSIX character class syntax C</[[:alpha:]]/>
341documented in L<perlre>.
342
343No matter which locale you are in, the alphabetic characters are
344the characters in \w without the digits and the underscore.
345As a regex, that looks like C</[^\W\d_]/>. Its complement,
197aec24
RGS
346the non-alphabetics, is then everything in \W along with
347the digits and the underscore, or C</[\W\d_]/>.
68dc0745 348
d92eb7b0 349=head2 How can I quote a variable to use in a regex?
d74e8afc 350X<regex, escaping> X<regexp, escaping> X<regular expression, escaping>
68dc0745 351
352The Perl parser will expand $variable and @variable references in
353regular expressions unless the delimiter is a single quote. Remember,
79a522f5 354too, that the right-hand side of a C<s///> substitution is considered
68dc0745 355a double-quoted string (see L<perlop> for more details). Remember
d92eb7b0 356also that any regex special characters will be acted on unless you
68dc0745 357precede the substitution with \Q. Here's an example:
358
ac9dac7f
RGS
359 $string = "Placido P. Octopus";
360 $regex = "P.";
68dc0745 361
ac9dac7f
RGS
362 $string =~ s/$regex/Polyp/;
363 # $string is now "Polypacido P. Octopus"
68dc0745 364
c83084d1
MJD
365Because C<.> is special in regular expressions, and can match any
366single character, the regex C<P.> here has matched the <Pl> in the
367original string.
368
369To escape the special meaning of C<.>, we use C<\Q>:
370
ac9dac7f
RGS
371 $string = "Placido P. Octopus";
372 $regex = "P.";
c83084d1 373
ac9dac7f
RGS
374 $string =~ s/\Q$regex/Polyp/;
375 # $string is now "Placido Polyp Octopus"
c83084d1
MJD
376
377The use of C<\Q> causes the <.> in the regex to be treated as a
378regular character, so that C<P.> matches a C<P> followed by a dot.
68dc0745 379
380=head2 What is C</o> really for?
ee891a00 381X</o, regular expressions> X<compile, regular expressions>
68dc0745 382
ee891a00 383(contributed by brian d foy)
68dc0745 384
ee891a00
RGS
385The C</o> option for regular expressions (documented in L<perlop> and
386L<perlreref>) tells Perl to compile the regular expression only once.
387This is only useful when the pattern contains a variable. Perls 5.6
388and later handle this automatically if the pattern does not change.
68dc0745 389
ee891a00
RGS
390Since the match operator C<m//>, the substitution operator C<s///>,
391and the regular expression quoting operator C<qr//> are double-quotish
392constructs, you can interpolate variables into the pattern. See the
393answer to "How can I quote a variable to use in a regex?" for more
394details.
68dc0745 395
ee891a00
RGS
396This example takes a regular expression from the argument list and
397prints the lines of input that match it:
68dc0745 398
ee891a00 399 my $pattern = shift @ARGV;
109f0441 400
ee891a00
RGS
401 while( <> ) {
402 print if m/$pattern/;
403 }
404
405Versions of Perl prior to 5.6 would recompile the regular expression
406for each iteration, even if C<$pattern> had not changed. The C</o>
407would prevent this by telling Perl to compile the pattern the first
408time, then reuse that for subsequent iterations:
409
410 my $pattern = shift @ARGV;
109f0441 411
ee891a00
RGS
412 while( <> ) {
413 print if m/$pattern/o; # useful for Perl < 5.6
414 }
415
416In versions 5.6 and later, Perl won't recompile the regular expression
417if the variable hasn't changed, so you probably don't need the C</o>
418option. It doesn't hurt, but it doesn't help either. If you want any
419version of Perl to compile the regular expression only once even if
420the variable changes (thus, only using its initial value), you still
421need the C</o>.
422
423You can watch Perl's regular expression engine at work to verify for
424yourself if Perl is recompiling a regular expression. The C<use re
425'debug'> pragma (comes with Perl 5.005 and later) shows the details.
426With Perls before 5.6, you should see C<re> reporting that its
427compiling the regular expression on each iteration. With Perl 5.6 or
428later, you should only see C<re> report that for the first iteration.
429
430 use re 'debug';
109f0441 431
ee891a00
RGS
432 $regex = 'Perl';
433 foreach ( qw(Perl Java Ruby Python) ) {
434 print STDERR "-" x 73, "\n";
435 print STDERR "Trying $_...\n";
436 print STDERR "\t$_ is good!\n" if m/$regex/;
437 }
68dc0745 438
439=head2 How do I use a regular expression to strip C style comments from a file?
440
441While this actually can be done, it's much harder than you'd think.
442For example, this one-liner
443
ac9dac7f 444 perl -0777 -pe 's{/\*.*?\*/}{}gs' foo.c
68dc0745 445
446will work in many but not all cases. You see, it's too simple-minded for
447certain kinds of C programs, in particular, those with what appear to be
448comments in quoted strings. For that, you'd need something like this,
d92eb7b0 449created by Jeffrey Friedl and later modified by Fred Curtis.
68dc0745 450
ac9dac7f
RGS
451 $/ = undef;
452 $_ = <>;
453 s#/\*[^*]*\*+([^/*][^*]*\*+)*/|("(\\.|[^"\\])*"|'(\\.|[^'\\])*'|.[^/"'\\]*)#defined $2 ? $2 : ""#gse;
454 print;
68dc0745 455
456This could, of course, be more legibly written with the C</x> modifier, adding
d92eb7b0
GS
457whitespace and comments. Here it is expanded, courtesy of Fred Curtis.
458
459 s{
460 /\* ## Start of /* ... */ comment
461 [^*]*\*+ ## Non-* followed by 1-or-more *'s
462 (
463 [^/*][^*]*\*+
464 )* ## 0-or-more things which don't start with /
465 ## but do end with '*'
466 / ## End of /* ... */ comment
467
468 | ## OR various things which aren't comments:
469
470 (
471 " ## Start of " ... " string
472 (
473 \\. ## Escaped char
474 | ## OR
475 [^"\\] ## Non "\
476 )*
477 " ## End of " ... " string
478
479 | ## OR
480
481 ' ## Start of ' ... ' string
482 (
483 \\. ## Escaped char
484 | ## OR
485 [^'\\] ## Non '\
486 )*
487 ' ## End of ' ... ' string
488
489 | ## OR
490
491 . ## Anything other char
492 [^/"'\\]* ## Chars which doesn't start a comment, string or escape
493 )
c98c5709 494 }{defined $2 ? $2 : ""}gxse;
d92eb7b0 495
109f0441
S
496A slight modification also removes C++ comments, possibly spanning multiple lines
497using a continuation character:
d92eb7b0 498
109f0441 499 s#/\*[^*]*\*+([^/*][^*]*\*+)*/|//([^\\]|[^\n][\n]?)*?\n|("(\\.|[^"\\])*"|'(\\.|[^'\\])*'|.[^/"'\\]*)#defined $3 ? $3 : ""#gse;
68dc0745 500
501=head2 Can I use Perl regular expressions to match balanced text?
d74e8afc 502X<regex, matching balanced test> X<regexp, matching balanced test>
109f0441
S
503X<regular expression, matching balanced test> X<possessive> X<PARNO>
504X<Text::Balanced> X<Regexp::Common> X<backtracking> X<recursion>
505
506(contributed by brian d foy)
507
508Your first try should probably be the C<Text::Balanced> module, which
509is in the Perl standard library since Perl 5.8. It has a variety of
510functions to deal with tricky text. The C<Regexp::Common> module can
511also help by providing canned patterns you can use.
512
513As of Perl 5.10, you can match balanced text with regular expressions
514using recursive patterns. Before Perl 5.10, you had to resort to
515various tricks such as using Perl code in C<(??{})> sequences.
516
517Here's an example using a recursive regular expression. The goal is to
518capture all of the text within angle brackets, including the text in
519nested angle brackets. This sample text has two "major" groups: a
520group with one level of nesting and a group with two levels of
521nesting. There are five total groups in angle brackets:
522
523 I have some <brackets in <nested brackets> > and
524 <another group <nested once <nested twice> > >
525 and that's it.
526
527The regular expression to match the balanced text uses two new (to
528Perl 5.10) regular expression features. These are covered in L<perlre>
529and this example is a modified version of one in that documentation.
530
589a5df2 531First, adding the new possessive C<+> to any quantifier finds the
109f0441
S
532longest match and does not backtrack. That's important since you want
533to handle any angle brackets through the recursion, not backtracking.
534The group C<< [^<>]++ >> finds one or more non-angle brackets without
535backtracking.
536
537Second, the new C<(?PARNO)> refers to the sub-pattern in the
538particular capture buffer given by C<PARNO>. In the following regex,
539the first capture buffer finds (and remembers) the balanced text, and
540you need that same pattern within the first buffer to get past the
541nested text. That's the recursive part. The C<(?1)> uses the pattern
542in the outer capture buffer as an independent part of the regex.
543
544Putting it all together, you have:
545
546 #!/usr/local/bin/perl5.10.0
547
548 my $string =<<"HERE";
549 I have some <brackets in <nested brackets> > and
550 <another group <nested once <nested twice> > >
551 and that's it.
552 HERE
553
554 my @groups = $string =~ m/
555 ( # start of capture buffer 1
556 < # match an opening angle bracket
557 (?:
558 [^<>]++ # one or more non angle brackets, non backtracking
559 |
560 (?1) # found < or >, so recurse to capture buffer 1
561 )*
562 > # match a closing angle bracket
563 ) # end of capture buffer 1
564 /xg;
565
566 $" = "\n\t";
567 print "Found:\n\t@groups\n";
568
569The output shows that Perl found the two major groups:
570
571 Found:
572 <brackets in <nested brackets> >
573 <another group <nested once <nested twice> > >
574
575With a little extra work, you can get the all of the groups in angle
576brackets even if they are in other angle brackets too. Each time you
577get a balanced match, remove its outer delimiter (that's the one you
578just matched so don't match it again) and add it to a queue of strings
579to process. Keep doing that until you get no matches:
580
581 #!/usr/local/bin/perl5.10.0
582
583 my @queue =<<"HERE";
584 I have some <brackets in <nested brackets> > and
585 <another group <nested once <nested twice> > >
586 and that's it.
587 HERE
588
589 my $regex = qr/
590 ( # start of bracket 1
591 < # match an opening angle bracket
592 (?:
593 [^<>]++ # one or more non angle brackets, non backtracking
594 |
595 (?1) # recurse to bracket 1
596 )*
597 > # match a closing angle bracket
598 ) # end of bracket 1
599 /x;
600
601 $" = "\n\t";
602
603 while( @queue )
604 {
605 my $string = shift @queue;
606
607 my @groups = $string =~ m/$regex/g;
608 print "Found:\n\t@groups\n\n" if @groups;
609
610 unshift @queue, map { s/^<//; s/>$//; $_ } @groups;
611 }
612
613The output shows all of the groups. The outermost matches show up
614first and the nested matches so up later:
615
616 Found:
617 <brackets in <nested brackets> >
618 <another group <nested once <nested twice> > >
619
620 Found:
621 <nested brackets>
622
623 Found:
624 <nested once <nested twice> >
625
626 Found:
627 <nested twice>
68dc0745 628
d92eb7b0 629=head2 What does it mean that regexes are greedy? How can I get around it?
d74e8afc 630X<greedy> X<greediness>
68dc0745 631
d92eb7b0 632Most people mean that greedy regexes match as much as they can.
68dc0745 633Technically speaking, it's actually the quantifiers (C<?>, C<*>, C<+>,
634C<{}>) that are greedy rather than the whole pattern; Perl prefers local
635greed and immediate gratification to overall greed. To get non-greedy
636versions of the same quantifiers, use (C<??>, C<*?>, C<+?>, C<{}?>).
637
638An example:
639
ac9dac7f
RGS
640 $s1 = $s2 = "I am very very cold";
641 $s1 =~ s/ve.*y //; # I am cold
642 $s2 =~ s/ve.*?y //; # I am very cold
68dc0745 643
644Notice how the second substitution stopped matching as soon as it
645encountered "y ". The C<*?> quantifier effectively tells the regular
646expression engine to find a match as quickly as possible and pass
647control on to whatever is next in line, like you would if you were
648playing hot potato.
649
f9ac83b8 650=head2 How do I process each word on each line?
d74e8afc 651X<word>
68dc0745 652
653Use the split function:
654
ac9dac7f
RGS
655 while (<>) {
656 foreach $word ( split ) {
657 # do something with $word here
658 }
197aec24 659 }
68dc0745 660
54310121 661Note that this isn't really a word in the English sense; it's just
662chunks of consecutive non-whitespace characters.
68dc0745 663
f1cbbd6e
GS
664To work with only alphanumeric sequences (including underscores), you
665might consider
68dc0745 666
ac9dac7f
RGS
667 while (<>) {
668 foreach $word (m/(\w+)/g) {
669 # do something with $word here
670 }
68dc0745 671 }
68dc0745 672
673=head2 How can I print out a word-frequency or line-frequency summary?
674
675To do this, you have to parse out each word in the input stream. We'll
54310121 676pretend that by word you mean chunk of alphabetics, hyphens, or
677apostrophes, rather than the non-whitespace chunk idea of a word given
68dc0745 678in the previous question:
679
ac9dac7f
RGS
680 while (<>) {
681 while ( /(\b[^\W_\d][\w'-]+\b)/g ) { # misses "`sheep'"
682 $seen{$1}++;
683 }
54310121 684 }
ac9dac7f
RGS
685
686 while ( ($word, $count) = each %seen ) {
687 print "$count $word\n";
688 }
68dc0745 689
690If you wanted to do the same thing for lines, you wouldn't need a
691regular expression:
692
ac9dac7f
RGS
693 while (<>) {
694 $seen{$_}++;
695 }
696
697 while ( ($line, $count) = each %seen ) {
698 print "$count $line";
699 }
68dc0745 700
b432a672
AL
701If you want these output in a sorted order, see L<perlfaq4>: "How do I
702sort a hash (optionally by value instead of key)?".
68dc0745 703
704=head2 How can I do approximate matching?
d74e8afc 705X<match, approximate> X<matching, approximate>
68dc0745 706
707See the module String::Approx available from CPAN.
708
709=head2 How do I efficiently match many regular expressions at once?
d74e8afc
ITB
710X<regex, efficiency> X<regexp, efficiency>
711X<regular expression, efficiency>
68dc0745 712
7678cced
RGS
713( contributed by brian d foy )
714
6670e5e7 715Avoid asking Perl to compile a regular expression every time
f12f5f55 716you want to match it. In this example, perl must recompile
109f0441 717the regular expression for every iteration of the C<foreach>
7678cced
RGS
718loop since it has no way to know what $pattern will be.
719
ac9dac7f 720 @patterns = qw( foo bar baz );
6670e5e7 721
ac9dac7f
RGS
722 LINE: while( <DATA> )
723 {
6670e5e7 724 foreach $pattern ( @patterns )
7678cced 725 {
ac9dac7f
RGS
726 if( /\b$pattern\b/i )
727 {
728 print;
729 next LINE;
730 }
731 }
7678cced 732 }
68dc0745 733
109f0441 734The C<qr//> operator showed up in perl 5.005. It compiles a
7678cced
RGS
735regular expression, but doesn't apply it. When you use the
736pre-compiled version of the regex, perl does less work. In
109f0441 737this example, I inserted a C<map> to turn each pattern into
f12f5f55 738its pre-compiled form. The rest of the script is the same,
7678cced
RGS
739but faster.
740
ac9dac7f 741 @patterns = map { qr/\b$_\b/i } qw( foo bar baz );
7678cced 742
ac9dac7f
RGS
743 LINE: while( <> )
744 {
6670e5e7 745 foreach $pattern ( @patterns )
7678cced 746 {
109f0441
S
747 if( /$pattern/ )
748 {
749 print;
750 next LINE;
751 }
ac9dac7f 752 }
7678cced 753 }
6670e5e7 754
7678cced 755In some cases, you may be able to make several patterns into
f12f5f55 756a single regular expression. Beware of situations that require
7678cced 757backtracking though.
65acb1b1 758
7678cced
RGS
759 $regex = join '|', qw( foo bar baz );
760
ac9dac7f
RGS
761 LINE: while( <> )
762 {
7678cced
RGS
763 print if /\b(?:$regex)\b/i;
764 }
765
109f0441
S
766For more details on regular expression efficiency, see I<Mastering
767Regular Expressions> by Jeffrey Freidl. He explains how regular
7678cced 768expressions engine work and why some patterns are surprisingly
6670e5e7 769inefficient. Once you understand how perl applies regular
7678cced 770expressions, you can tune them for individual situations.
68dc0745 771
772=head2 Why don't word-boundary searches with C<\b> work for me?
d74e8afc 773X<\b>
68dc0745 774
7678cced
RGS
775(contributed by brian d foy)
776
777Ensure that you know what \b really does: it's the boundary between a
778word character, \w, and something that isn't a word character. That
779thing that isn't a word character might be \W, but it can also be the
780start or end of the string.
781
782It's not (not!) the boundary between whitespace and non-whitespace,
783and it's not the stuff between words we use to create sentences.
784
785In regex speak, a word boundary (\b) is a "zero width assertion",
786meaning that it doesn't represent a character in the string, but a
787condition at a certain position.
788
789For the regular expression, /\bPerl\b/, there has to be a word
790boundary before the "P" and after the "l". As long as something other
791than a word character precedes the "P" and succeeds the "l", the
792pattern will match. These strings match /\bPerl\b/.
793
794 "Perl" # no word char before P or after l
795 "Perl " # same as previous (space is not a word char)
796 "'Perl'" # the ' char is not a word char
797 "Perl's" # no word char before P, non-word char after "l"
798
799These strings do not match /\bPerl\b/.
800
801 "Perl_" # _ is a word char!
802 "Perler" # no word char before P, but one after l
6670e5e7 803
7678cced 804You don't have to use \b to match words though. You can look for
d7f8936a 805non-word characters surrounded by word characters. These strings
7678cced
RGS
806match the pattern /\b'\b/.
807
808 "don't" # the ' char is surrounded by "n" and "t"
809 "qep'a'" # the ' char is surrounded by "p" and "a"
6670e5e7 810
7678cced 811These strings do not match /\b'\b/.
68dc0745 812
7678cced 813 "foo'" # there is no word char after non-word '
6670e5e7 814
7678cced
RGS
815You can also use the complement of \b, \B, to specify that there
816should not be a word boundary.
68dc0745 817
7678cced
RGS
818In the pattern /\Bam\B/, there must be a word character before the "a"
819and after the "m". These patterns match /\Bam\B/:
68dc0745 820
7678cced
RGS
821 "llama" # "am" surrounded by word chars
822 "Samuel" # same
6670e5e7 823
7678cced 824These strings do not match /\Bam\B/
68dc0745 825
7678cced
RGS
826 "Sam" # no word boundary before "a", but one after "m"
827 "I am Sam" # "am" surrounded by non-word chars
68dc0745 828
68dc0745 829
830=head2 Why does using $&, $`, or $' slow my program down?
d74e8afc 831X<$MATCH> X<$&> X<$POSTMATCH> X<$'> X<$PREMATCH> X<$`>
68dc0745 832
571e049f 833(contributed by Anno Siegel)
68dc0745 834
571e049f 835Once Perl sees that you need one of these variables anywhere in the
b68463f7
RGS
836program, it provides them on each and every pattern match. That means
837that on every pattern match the entire string will be copied, part of it
838to $`, part to $&, and part to $'. Thus the penalty is most severe with
839long strings and patterns that match often. Avoid $&, $', and $` if you
840can, but if you can't, once you've used them at all, use them at will
841because you've already paid the price. Remember that some algorithms
842really appreciate them. As of the 5.005 release, the $& variable is no
843longer "expensive" the way the other two are.
844
845Since Perl 5.6.1 the special variables @- and @+ can functionally replace
846$`, $& and $'. These arrays contain pointers to the beginning and end
847of each match (see perlvar for the full story), so they give you
848essentially the same information, but without the risk of excessive
849string copying.
6670e5e7 850
109f0441
S
851Perl 5.10 added three specials, C<${^MATCH}>, C<${^PREMATCH}>, and
852C<${^POSTMATCH}> to do the same job but without the global performance
853penalty. Perl 5.10 only sets these variables if you compile or execute the
854regular expression with the C</p> modifier.
855
68dc0745 856=head2 What good is C<\G> in a regular expression?
d74e8afc 857X<\G>
68dc0745 858
49d635f9
RGS
859You use the C<\G> anchor to start the next match on the same
860string where the last match left off. The regular
861expression engine cannot skip over any characters to find
862the next match with this anchor, so C<\G> is similar to the
863beginning of string anchor, C<^>. The C<\G> anchor is typically
ee891a00 864used with the C<g> flag. It uses the value of C<pos()>
49d635f9 865as the position to start the next match. As the match
ee891a00 866operator makes successive matches, it updates C<pos()> with the
49d635f9
RGS
867position of the next character past the last match (or the
868first character of the next match, depending on how you like
ee891a00 869to look at it). Each string has its own C<pos()> value.
49d635f9 870
ee891a00 871Suppose you want to match all of consecutive pairs of digits
49d635f9
RGS
872in a string like "1122a44" and stop matching when you
873encounter non-digits. You want to match C<11> and C<22> but
874the letter <a> shows up between C<22> and C<44> and you want
875to stop at C<a>. Simply matching pairs of digits skips over
876the C<a> and still matches C<44>.
877
878 $_ = "1122a44";
879 my @pairs = m/(\d\d)/g; # qw( 11 22 44 )
880
ee891a00 881If you use the C<\G> anchor, you force the match after C<22> to
49d635f9
RGS
882start with the C<a>. The regular expression cannot match
883there since it does not find a digit, so the next match
884fails and the match operator returns the pairs it already
885found.
886
887 $_ = "1122a44";
888 my @pairs = m/\G(\d\d)/g; # qw( 11 22 )
889
890You can also use the C<\G> anchor in scalar context. You
891still need the C<g> flag.
892
893 $_ = "1122a44";
894 while( m/\G(\d\d)/g )
895 {
896 print "Found $1\n";
897 }
197aec24 898
ee891a00 899After the match fails at the letter C<a>, perl resets C<pos()>
49d635f9
RGS
900and the next match on the same string starts at the beginning.
901
902 $_ = "1122a44";
903 while( m/\G(\d\d)/g )
904 {
905 print "Found $1\n";
906 }
907
908 print "Found $1 after while" if m/(\d\d)/g; # finds "11"
909
ee891a00
RGS
910You can disable C<pos()> resets on fail with the C<c> flag, documented
911in L<perlop> and L<perlreref>. Subsequent matches start where the last
912successful match ended (the value of C<pos()>) even if a match on the
913same string has failed in the meantime. In this case, the match after
914the C<while()> loop starts at the C<a> (where the last match stopped),
915and since it does not use any anchor it can skip over the C<a> to find
916C<44>.
49d635f9
RGS
917
918 $_ = "1122a44";
919 while( m/\G(\d\d)/gc )
920 {
921 print "Found $1\n";
922 }
923
924 print "Found $1 after while" if m/(\d\d)/g; # finds "44"
925
926Typically you use the C<\G> anchor with the C<c> flag
927when you want to try a different match if one fails,
928such as in a tokenizer. Jeffrey Friedl offers this example
929which works in 5.004 or later.
68dc0745 930
ac9dac7f
RGS
931 while (<>) {
932 chomp;
933 PARSER: {
934 m/ \G( \d+\b )/gcx && do { print "number: $1\n"; redo; };
935 m/ \G( \w+ )/gcx && do { print "word: $1\n"; redo; };
936 m/ \G( \s+ )/gcx && do { print "space: $1\n"; redo; };
937 m/ \G( [^\w\d]+ )/gcx && do { print "other: $1\n"; redo; };
938 }
939 }
68dc0745 940
ee891a00 941For each line, the C<PARSER> loop first tries to match a series
49d635f9
RGS
942of digits followed by a word boundary. This match has to
943start at the place the last match left off (or the beginning
197aec24 944of the string on the first match). Since C<m/ \G( \d+\b
49d635f9
RGS
945)/gcx> uses the C<c> flag, if the string does not match that
946regular expression, perl does not reset pos() and the next
947match starts at the same position to try a different
948pattern.
68dc0745 949
d92eb7b0 950=head2 Are Perl regexes DFAs or NFAs? Are they POSIX compliant?
d74e8afc 951X<DFA> X<NFA> X<POSIX>
68dc0745 952
953While it's true that Perl's regular expressions resemble the DFAs
954(deterministic finite automata) of the egrep(1) program, they are in
46fc3d4c 955fact implemented as NFAs (non-deterministic finite automata) to allow
68dc0745 956backtracking and backreferencing. And they aren't POSIX-style either,
957because those guarantee worst-case behavior for all cases. (It seems
958that some people prefer guarantees of consistency, even when what's
959guaranteed is slowness.) See the book "Mastering Regular Expressions"
960(from O'Reilly) by Jeffrey Friedl for all the details you could ever
961hope to know on these matters (a full citation appears in
962L<perlfaq2>).
963
788611b6 964=head2 What's wrong with using grep in a void context?
d74e8afc 965X<grep>
68dc0745 966
788611b6
A
967The problem is that grep builds a return list, regardless of the context.
968This means you're making Perl go to the trouble of building a list that
969you then just throw away. If the list is large, you waste both time and space.
970If your intent is to iterate over the list, then use a for loop for this
f05bbc40 971purpose.
68dc0745 972
788611b6
A
973In perls older than 5.8.1, map suffers from this problem as well.
974But since 5.8.1, this has been fixed, and map is context aware - in void
975context, no lists are constructed.
976
54310121 977=head2 How can I match strings with multibyte characters?
d74e8afc 978X<regex, and multibyte characters> X<regexp, and multibyte characters>
ac9dac7f 979X<regular expression, and multibyte characters> X<martian> X<encoding, Martian>
68dc0745 980
d9d154f2
JH
981Starting from Perl 5.6 Perl has had some level of multibyte character
982support. Perl 5.8 or later is recommended. Supported multibyte
fe854a6f 983character repertoires include Unicode, and legacy encodings
d9d154f2
JH
984through the Encode module. See L<perluniintro>, L<perlunicode>,
985and L<Encode>.
986
987If you are stuck with older Perls, you can do Unicode with the
988C<Unicode::String> module, and character conversions using the
989C<Unicode::Map8> and C<Unicode::Map> modules. If you are using
990Japanese encodings, you might try using the jperl 5.005_03.
991
992Finally, the following set of approaches was offered by Jeffrey
993Friedl, whose article in issue #5 of The Perl Journal talks about
994this very matter.
68dc0745 995
fc36a67e 996Let's suppose you have some weird Martian encoding where pairs of
997ASCII uppercase letters encode single Martian letters (i.e. the two
998bytes "CV" make a single Martian letter, as do the two bytes "SG",
999"VS", "XX", etc.). Other bytes represent single characters, just like
1000ASCII.
68dc0745 1001
fc36a67e 1002So, the string of Martian "I am CVSGXX!" uses 12 bytes to encode the
1003nine characters 'I', ' ', 'a', 'm', ' ', 'CV', 'SG', 'XX', '!'.
68dc0745 1004
1005Now, say you want to search for the single character C</GX/>. Perl
fc36a67e 1006doesn't know about Martian, so it'll find the two bytes "GX" in the "I
1007am CVSGXX!" string, even though that character isn't there: it just
1008looks like it is because "SG" is next to "XX", but there's no real
1009"GX". This is a big problem.
68dc0745 1010
1011Here are a few ways, all painful, to deal with it:
1012
ac9dac7f
RGS
1013 # Make sure adjacent "martian" bytes are no longer adjacent.
1014 $martian =~ s/([A-Z][A-Z])/ $1 /g;
1015
1016 print "found GX!\n" if $martian =~ /GX/;
68dc0745 1017
1018Or like this:
1019
ac9dac7f
RGS
1020 @chars = $martian =~ m/([A-Z][A-Z]|[^A-Z])/g;
1021 # above is conceptually similar to: @chars = $text =~ m/(.)/g;
1022 #
1023 foreach $char (@chars) {
1024 print "found GX!\n", last if $char eq 'GX';
1025 }
68dc0745 1026
1027Or like this:
1028
ac9dac7f
RGS
1029 while ($martian =~ m/\G([A-Z][A-Z]|.)/gs) { # \G probably unneeded
1030 print "found GX!\n", last if $1 eq 'GX';
1031 }
68dc0745 1032
49d635f9 1033Here's another, slightly less painful, way to do it from Benjamin
c98c5709 1034Goldberg, who uses a zero-width negative look-behind assertion.
49d635f9 1035
c98c5709 1036 print "found GX!\n" if $martian =~ m/
ac9dac7f
RGS
1037 (?<![A-Z])
1038 (?:[A-Z][A-Z])*?
1039 GX
c98c5709 1040 /x;
197aec24 1041
49d635f9 1042This succeeds if the "martian" character GX is in the string, and fails
c98c5709
RGS
1043otherwise. If you don't like using (?<!), a zero-width negative
1044look-behind assertion, you can replace (?<![A-Z]) with (?:^|[^A-Z]).
49d635f9
RGS
1045
1046It does have the drawback of putting the wrong thing in $-[0] and $+[0],
1047but this usually can be worked around.
68dc0745 1048
ac9dac7f
RGS
1049=head2 How do I match a regular expression that's in a variable?
1050X<regex, in variable> X<eval> X<regex> X<quotemeta> X<\Q, regex>
1051X<\E, regex>, X<qr//>
65acb1b1 1052
ac9dac7f 1053(contributed by brian d foy)
65acb1b1 1054
ac9dac7f
RGS
1055We don't have to hard-code patterns into the match operator (or
1056anything else that works with regular expressions). We can put the
1057pattern in a variable for later use.
65acb1b1 1058
ac9dac7f
RGS
1059The match operator is a double quote context, so you can interpolate
1060your variable just like a double quoted string. In this case, you
1061read the regular expression as user input and store it in C<$regex>.
1062Once you have the pattern in C<$regex>, you use that variable in the
1063match operator.
65acb1b1 1064
ac9dac7f 1065 chomp( my $regex = <STDIN> );
65acb1b1 1066
ac9dac7f 1067 if( $string =~ m/$regex/ ) { ... }
65acb1b1 1068
ac9dac7f
RGS
1069Any regular expression special characters in C<$regex> are still
1070special, and the pattern still has to be valid or Perl will complain.
1071For instance, in this pattern there is an unpaired parenthesis.
65acb1b1 1072
ac9dac7f
RGS
1073 my $regex = "Unmatched ( paren";
1074
1075 "Two parens to bind them all" =~ m/$regex/;
1076
1077When Perl compiles the regular expression, it treats the parenthesis
1078as the start of a memory match. When it doesn't find the closing
1079parenthesis, it complains:
1080
1081 Unmatched ( in regex; marked by <-- HERE in m/Unmatched ( <-- HERE paren/ at script line 3.
1082
1083You can get around this in several ways depending on our situation.
1084First, if you don't want any of the characters in the string to be
1085special, you can escape them with C<quotemeta> before you use the string.
1086
1087 chomp( my $regex = <STDIN> );
1088 $regex = quotemeta( $regex );
1089
1090 if( $string =~ m/$regex/ ) { ... }
1091
1092You can also do this directly in the match operator using the C<\Q>
1093and C<\E> sequences. The C<\Q> tells Perl where to start escaping
1094special characters, and the C<\E> tells it where to stop (see L<perlop>
1095for more details).
1096
1097 chomp( my $regex = <STDIN> );
1098
1099 if( $string =~ m/\Q$regex\E/ ) { ... }
1100
1101Alternately, you can use C<qr//>, the regular expression quote operator (see
1102L<perlop> for more details). It quotes and perhaps compiles the pattern,
1103and you can apply regular expression flags to the pattern.
1104
1105 chomp( my $input = <STDIN> );
1106
1107 my $regex = qr/$input/is;
1108
1109 $string =~ m/$regex/ # same as m/$input/is;
1110
1111You might also want to trap any errors by wrapping an C<eval> block
1112around the whole thing.
1113
1114 chomp( my $input = <STDIN> );
1115
1116 eval {
1117 if( $string =~ m/\Q$input\E/ ) { ... }
1118 };
1119 warn $@ if $@;
1120
1121Or...
1122
1123 my $regex = eval { qr/$input/is };
1124 if( defined $regex ) {
1125 $string =~ m/$regex/;
1126 }
1127 else {
1128 warn $@;
1129 }
65acb1b1 1130
68dc0745 1131=head1 AUTHOR AND COPYRIGHT
1132
8d2e243f 1133Copyright (c) 1997-2010 Tom Christiansen, Nathan Torkington, and
7678cced 1134other authors as noted. All rights reserved.
5a964f20 1135
5a7beb56
JH
1136This documentation is free; you can redistribute it and/or modify it
1137under the same terms as Perl itself.
5a964f20
TC
1138
1139Irrespective of its distribution, all code examples in this file
1140are hereby placed into the public domain. You are permitted and
1141encouraged to use this code in your own programs for fun
1142or for profit as you see fit. A simple comment in the code giving
1143credit would be courteous but is not required.