pod/perlfaq6.pod

   1 =head1 NAME
   2
   3 perlfaq6 - Regexps ($Revision: 1.17 $, $Date: 1997/04/24 22:44:10 $)
   4
   5 =head1 DESCRIPTION
   6
   7 This section is surprisingly small because the rest of the FAQ is
   8 littered with answers involving regular expressions.  For example,
   9 decoding a URL and checking whether something is a number are handled
  10 with regular expressions, but those answers are found elsewhere in
  11 this document (in the section on Data and the Networking one on
  12 networking, to be precise).
  13
  14 =head2 How can I hope to use regular expressions without creating illegible and unmaintainable code?
  15
  16 Three techniques can make regular expressions maintainable and
  17 understandable.
  18
  19 =over 4
  20
  21 =item Comments Outside the Regexp
  22
  23 Describe what you're doing and how you're doing it, using normal Perl
  24 comments.
  25
  26     # turn the line into the first word, a colon, and the
  27     # number of characters on the rest of the line
  28     s/^(\w+)(.*)/ lc($1) . ":" . length($2) /ge;
  29
  30 =item Comments Inside the Regexp
  31
  32 The C</x> modifier causes whitespace to be ignored in a regexp pattern
  33 (except in a character class), and also allows you to use normal
  34 comments there, too.  As you can imagine, whitespace and comments help
  35 a lot.
  36
  37 C</x> lets you turn this:
  38
  39     s{<(?:[^>'"]*|".*?"|'.*?')+>}{}gs;
  40
  41 into this:
  42
  43     s{ <                    # opening angle bracket
  44         (?:                 # Non-backreffing grouping paren
  45              [^>'"] *       # 0 or more things that are neither > nor ' nor "
  46                 |           #    or else
  47              ".*?"          # a section between double quotes (stingy match)
  48                 |           #    or else
  49              '.*?'          # a section between single quotes (stingy match)
  50         ) +                 #   all occurring one or more times
  51        >                    # closing angle bracket
  52     }{}gsx;                 # replace with nothing, i.e. delete
  53
  54 It's still not quite so clear as prose, but it is very useful for
  55 describing the meaning of each part of the pattern.
  56
  57 =item Different Delimiters
  58
  59 While we normally think of patterns as being delimited with C</>
  60 characters, they can be delimited by almost any character.  L<perlre>
  61 describes this.  For example, the C<s///> above uses braces as
  62 delimiters.  Selecting another delimiter can avoid quoting the
  63 delimiter within the pattern:
  64
  65     s/\/usr\/local/\/usr\/share/g;      # bad delimiter choice
  66     s#/usr/local#/usr/share#g;          # better
  67
  68 =back
  69
  70 =head2 I'm having trouble matching over more than one line.  What's wrong?
  71
  72 Either you don't have newlines in your string, or you aren't using the
  73 correct modifier(s) on your pattern.
  74
  75 There are many ways to get multiline data into a string.  If you want
  76 it to happen automatically while reading input, you'll want to set $/
  77 (probably to '' for paragraphs or C<undef> for the whole file) to
  78 allow you to read more than one line at a time.
  79
  80 Read L<perlre> to help you decide which of C</s> and C</m> (or both)
  81 you might want to use: C</s> allows dot to include newline, and C</m>
  82 allows caret and dollar to match next to a newline, not just at the
  83 end of the string.  You do need to make sure that you've actually
  84 got a multiline string in there.
  85
  86 For example, this program detects duplicate words, even when they span
  87 line breaks (but not paragraph ones).  For this example, we don't need
  88 C</s> because we aren't using dot in a regular expression that we want
  89 to cross line boundaries.  Neither do we need C</m> because we aren't
  90 wanting caret or dollar to match at any point inside the record next
  91 to newlines.  But it's imperative that $/ be set to something other
  92 than the default, or else we won't actually ever have a multiline
  93 record read in.
  94
  95     $/ = '';            # read in more whole paragraph, not just one line
  96     while ( <> ) {
  97         while ( /\b(\w\S+)(\s+\1)+\b/gi ) {
  98             print "Duplicate $1 at paragraph $.\n";
  99         }
 100     }
 101
 102 Here's code that finds sentences that begin with "From " (which would
 103 be mangled by many mailers):
 104
 105     $/ = '';            # read in more whole paragraph, not just one line
 106     while ( <> ) {
 107         while ( /^From /gm ) { # /m makes ^ match next to \n
 108             print "leading from in paragraph $.\n";
 109         }
 110     }
 111
 112 Here's code that finds everything between START and END in a paragraph:
 113
 114     undef $/;           # read in whole file, not just one line or paragraph
 115     while ( <> ) {
 116         while ( /START(.*?)END/sm ) { # /s makes . cross line boundaries
 117             print "$1\n";
 118         }
 119     }
 120
 121 =head2 How can I pull out lines between two patterns that are themselves on different lines?
 122
 123 You can use Perl's somewhat exotic C<..> operator (documented in
 124 L<perlop>):
 125
 126     perl -ne 'print if /START/ .. /END/' file1 file2 ...
 127
 128 If you wanted text and not lines, you would use
 129
 130     perl -0777 -pe 'print "$1\n" while /START(.*?)END/gs' file1 file2 ...
 131
 132 But if you want nested occurrences of C<START> through C<END>, you'll
 133 run up against the problem described in the question in this section
 134 on matching balanced text.
 135
 136 =head2 I put a regular expression into $/ but it didn't work. What's wrong?
 137
 138 $/ must be a string, not a regular expression.  Awk has to be better
 139 for something. :-)
 140
 141 Actually, you could do this if you don't mind reading the whole file
 142 into memory:
 143
 144     undef $/;
 145     @records = split /your_pattern/, <FH>;
 146
 147 The Net::Telnet module (available from CPAN) has the capability to
 148 wait for a pattern in the input stream, or timeout if it doesn't
 149 appear within a certain time.
 150
 151     ## Create a file with three lines.
 152     open FH, ">file";
 153     print FH "The first line\nThe second line\nThe third line\n";
 154     close FH;
 155
 156     ## Get a read/write filehandle to it.
 157     $fh = new FileHandle "+<file";
 158
 159     ## Attach it to a "stream" object.
 160     use Net::Telnet;
 161     $file = new Net::Telnet (-fhopen => $fh);
 162
 163     ## Search for the second line and print out the third.
 164     $file->waitfor('/second line\n/');
 165     print $file->getline;
 166
 167 =head2 How do I substitute case insensitively on the LHS, but preserving case on the RHS?
 168
 169 It depends on what you mean by "preserving case".  The following
 170 script makes the substitution have the same case, letter by letter, as
 171 the original.  If the substitution has more characters than the string
 172 being substituted, the case of the last character is used for the rest
 173 of the substitution.
 174
 175     # Original by Nathan Torkington, massaged by Jeffrey Friedl
 176     #
 177     sub preserve_case($$)
 178     {
 179         my ($old, $new) = @_;
 180         my ($state) = 0; # 0 = no change; 1 = lc; 2 = uc
 181         my ($i, $oldlen, $newlen, $c) = (0, length($old), length($new));
 182         my ($len) = $oldlen < $newlen ? $oldlen : $newlen;
 183
 184         for ($i = 0; $i < $len; $i++) {
 185             if ($c = substr($old, $i, 1), $c =~ /[\W\d_]/) {
 186                 $state = 0;
 187             } elsif (lc $c eq $c) {
 188                 substr($new, $i, 1) = lc(substr($new, $i, 1));
 189                 $state = 1;
 190             } else {
 191                 substr($new, $i, 1) = uc(substr($new, $i, 1));
 192                 $state = 2;
 193             }
 194         }
 195         # finish up with any remaining new (for when new is longer than old)
 196         if ($newlen > $oldlen) {
 197             if ($state == 1) {
 198                 substr($new, $oldlen) = lc(substr($new, $oldlen));
 199             } elsif ($state == 2) {
 200                 substr($new, $oldlen) = uc(substr($new, $oldlen));
 201             }
 202         }
 203         return $new;
 204     }
 205
 206     $a = "this is a TEsT case";
 207     $a =~ s/(test)/preserve_case($1, "success")/gie;
 208     print "$a\n";
 209
 210 This prints:
 211
 212     this is a SUcCESS case
 213
 214 =head2 How can I make C<\w> match accented characters?
 215
 216 See L<perllocale>.
 217
 218 =head2 How can I match a locale-smart version of C</[a-zA-Z]/>?
 219
 220 One alphabetic character would be C</[^\W\d_]/>, no matter what locale
 221 you're in.  Non-alphabetics would be C</[\W\d_]/> (assuming you don't
 222 consider an underscore a letter).
 223
 224 =head2 How can I quote a variable to use in a regexp?
 225
 226 The Perl parser will expand $variable and @variable references in
 227 regular expressions unless the delimiter is a single quote.  Remember,
 228 too, that the right-hand side of a C<s///> substitution is considered
 229 a double-quoted string (see L<perlop> for more details).  Remember
 230 also that any regexp special characters will be acted on unless you
 231 precede the substitution with \Q.  Here's an example:
 232
 233     $string = "to die?";
 234     $lhs = "die?";
 235     $rhs = "sleep no more";
 236
 237     $string =~ s/\Q$lhs/$rhs/;
 238     # $string is now "to sleep no more"
 239
 240 Without the \Q, the regexp would also spuriously match "di".
 241
 242 =head2 What is C</o> really for?
 243
 244 Using a variable in a regular expression match forces a re-evaluation
 245 (and perhaps recompilation) each time through.  The C</o> modifier
 246 locks in the regexp the first time it's used.  This always happens in a
 247 constant regular expression, and in fact, the pattern was compiled
 248 into the internal format at the same time your entire program was.
 249
 250 Use of C</o> is irrelevant unless variable interpolation is used in
 251 the pattern, and if so, the regexp engine will neither know nor care
 252 whether the variables change after the pattern is evaluated the I<very
 253 first> time.
 254
 255 C</o> is often used to gain an extra measure of efficiency by not
 256 performing subsequent evaluations when you know it won't matter
 257 (because you know the variables won't change), or more rarely, when
 258 you don't want the regexp to notice if they do.
 259
 260 For example, here's a "paragrep" program:
 261
 262     $/ = '';  # paragraph mode
 263     $pat = shift;
 264     while (<>) {
 265         print if /$pat/o;
 266     }
 267
 268 =head2 How do I use a regular expression to strip C style comments from a file?
 269
 270 While this actually can be done, it's much harder than you'd think.
 271 For example, this one-liner
 272
 273     perl -0777 -pe 's{/\*.*?\*/}{}gs' foo.c
 274
 275 will work in many but not all cases.  You see, it's too simple-minded for
 276 certain kinds of C programs, in particular, those with what appear to be
 277 comments in quoted strings.  For that, you'd need something like this,
 278 created by Jeffrey Friedl:
 279
 280     $/ = undef;
 281     $_ = <>;
 282     s#/\*[^*]*\*+([^/*][^*]*\*+)*/|("(\\.|[^"\\])*"|'(\\.|[^'\\])*'|\n+|.[^/"'\\]*)#$2#g;
 283     print;
 284
 285 This could, of course, be more legibly written with the C</x> modifier, adding
 286 whitespace and comments.
 287
 288 =head2 Can I use Perl regular expressions to match balanced text?
 289
 290 Although Perl regular expressions are more powerful than "mathematical"
 291 regular expressions, because they feature conveniences like backreferences
 292 (C<\1> and its ilk), they still aren't powerful enough. You still need
 293 to use non-regexp techniques to parse balanced text, such as the text
 294 enclosed between matching parentheses or braces, for example.
 295
 296 An elaborate subroutine (for 7-bit ASCII only) to pull out balanced
 297 and possibly nested single chars, like C<`> and C<'>, C<{> and C<}>,
 298 or C<(> and C<)> can be found in
 299 http://www.perl.com/CPAN/authors/id/TOMC/scripts/pull_quotes.gz .
 300
 301 The C::Scan module from CPAN contains such subs for internal usage,
 302 but they are undocumented.
 303
 304 =head2 What does it mean that regexps are greedy?  How can I get around it?
 305
 306 Most people mean that greedy regexps match as much as they can.
 307 Technically speaking, it's actually the quantifiers (C<?>, C<*>, C<+>,
 308 C<{}>) that are greedy rather than the whole pattern; Perl prefers local
 309 greed and immediate gratification to overall greed.  To get non-greedy
 310 versions of the same quantifiers, use (C<??>, C<*?>, C<+?>, C<{}?>).
 311
 312 An example:
 313
 314         $s1 = $s2 = "I am very very cold";
 315         $s1 =~ s/ve.*y //;      # I am cold
 316         $s2 =~ s/ve.*?y //;     # I am very cold
 317
 318 Notice how the second substitution stopped matching as soon as it
 319 encountered "y ".  The C<*?> quantifier effectively tells the regular
 320 expression engine to find a match as quickly as possible and pass
 321 control on to whatever is next in line, like you would if you were
 322 playing hot potato.
 323
 324 =head2  How do I process each word on each line?
 325
 326 Use the split function:
 327
 328     while (<>) {
 329         foreach $word ( split ) {
 330             # do something with $word here
 331         }
 332     }
 333
 334 Note that this isn't really a word in the English sense; it's just
 335 chunks of consecutive non-whitespace characters.
 336
 337 To work with only alphanumeric sequences, you might consider
 338
 339     while (<>) {
 340         foreach $word (m/(\w+)/g) {
 341             # do something with $word here
 342         }
 343     }
 344
 345 =head2 How can I print out a word-frequency or line-frequency summary?
 346
 347 To do this, you have to parse out each word in the input stream.  We'll
 348 pretend that by word you mean chunk of alphabetics, hyphens, or
 349 apostrophes, rather than the non-whitespace chunk idea of a word given
 350 in the previous question:
 351
 352     while (<>) {
 353         while ( /(\b[^\W_\d][\w'-]+\b)/g ) {   # misses "`sheep'"
 354             $seen{$1}++;
 355         }
 356     }
 357     while ( ($word, $count) = each %seen ) {
 358         print "$count $word\n";
 359     }
 360
 361 If you wanted to do the same thing for lines, you wouldn't need a
 362 regular expression:
 363
 364     while (<>) {
 365         $seen{$_}++;
 366     }
 367     while ( ($line, $count) = each %seen ) {
 368         print "$count $line";
 369     }
 370
 371 If you want these output in a sorted order, see the section on Hashes.
 372
 373 =head2 How can I do approximate matching?
 374
 375 See the module String::Approx available from CPAN.
 376
 377 =head2 How do I efficiently match many regular expressions at once?
 378
 379 The following is super-inefficient:
 380
 381     while (<FH>) {
 382         foreach $pat (@patterns) {
 383             if ( /$pat/ ) {
 384                 # do something
 385             }
 386         }
 387     }
 388
 389 Instead, you either need to use one of the experimental Regexp extension
 390 modules from CPAN (which might well be overkill for your purposes),
 391 or else put together something like this, inspired from a routine
 392 in Jeffrey Friedl's book:
 393
 394     sub _bm_build {
 395         my $condition = shift;
 396         my @regexp = @_;  # this MUST not be local(); need my()
 397         my $expr = join $condition => map { "m/\$regexp[$_]/o" } (0..$#regexp);
 398         my $match_func = eval "sub { $expr }";
 399         die if $@;  # propagate $@; this shouldn't happen!
 400         return $match_func;
 401     }
 402
 403     sub bm_and { _bm_build('&&', @_) }
 404     sub bm_or  { _bm_build('||', @_) }
 405
 406     $f1 = bm_and qw{
 407             xterm
 408             (?i)window
 409     };
 410
 411     $f2 = bm_or qw{
 412             \b[Ff]ree\b
 413             \bBSD\B
 414             (?i)sys(tem)?\s*[V5]\b
 415     };
 416
 417     # feed me /etc/termcap, prolly
 418     while ( <> ) {
 419         print "1: $_" if &$f1;
 420         print "2: $_" if &$f2;
 421     }
 422
 423 =head2 Why don't word-boundary searches with C<\b> work for me?
 424
 425 Two common misconceptions are that C<\b> is a synonym for C<\s+>, and
 426 that it's the edge between whitespace characters and non-whitespace
 427 characters.  Neither is correct.  C<\b> is the place between a C<\w>
 428 character and a C<\W> character (that is, C<\b> is the edge of a
 429 "word").  It's a zero-width assertion, just like C<^>, C<$>, and all
 430 the other anchors, so it doesn't consume any characters.  L<perlre>
 431 describes the behaviour of all the regexp metacharacters.
 432
 433 Here are examples of the incorrect application of C<\b>, with fixes:
 434
 435     "two words" =~ /(\w+)\b(\w+)/;          # WRONG
 436     "two words" =~ /(\w+)\s+(\w+)/;         # right
 437
 438     " =matchless= text" =~ /\b=(\w+)=\b/;   # WRONG
 439     " =matchless= text" =~ /=(\w+)=/;       # right
 440
 441 Although they may not do what you thought they did, C<\b> and C<\B>
 442 can still be quite useful.  For an example of the correct use of
 443 C<\b>, see the example of matching duplicate words over multiple
 444 lines.
 445
 446 An example of using C<\B> is the pattern C<\Bis\B>.  This will find
 447 occurrences of "is" on the insides of words only, as in "thistle", but
 448 not "this" or "island".
 449
 450 =head2 Why does using $&, $`, or $' slow my program down?
 451
 452 Because once Perl sees that you need one of these variables anywhere
 453 in the program, it has to provide them on each and every pattern
 454 match.  The same mechanism that handles these provides for the use of
 455 $1, $2, etc., so you pay the same price for each regexp that contains
 456 capturing parentheses. But if you never use $&, etc., in your script,
 457 then regexps I<without> capturing parentheses won't be penalized. So
 458 avoid $&, $', and $` if you can, but if you can't (and some algorithms
 459 really appreciate them), once you've used them once, use them at will,
 460 because you've already paid the price.
 461
 462 =head2 What good is C<\G> in a regular expression?
 463
 464 The notation C<\G> is used in a match or substitution in conjunction the
 465 C</g> modifier (and ignored if there's no C</g>) to anchor the regular
 466 expression to the point just past where the last match occurred, i.e. the
 467 pos() point.
 468
 469 For example, suppose you had a line of text quoted in standard mail
 470 and Usenet notation, (that is, with leading C<E<gt>> characters), and
 471 you want change each leading C<E<gt>> into a corresponding C<:>.  You
 472 could do so in this way:
 473
 474      s/^(>+)/':' x length($1)/gem;
 475
 476 Or, using C<\G>, the much simpler (and faster):
 477
 478     s/\G>/:/g;
 479
 480 A more sophisticated use might involve a tokenizer.  The following
 481 lex-like example is courtesy of Jeffrey Friedl.  It did not work in
 482 5.003 due to bugs in that release, but does work in 5.004 or better.
 483 (Note the use of C</c>, which prevents a failed match with C</g> from
 484 resetting the search position back to the beginning of the string.)
 485
 486     while (<>) {
 487       chomp;
 488       PARSER: {
 489            m/ \G( \d+\b    )/gcx    && do { print "number: $1\n";  redo; };
 490            m/ \G( \w+      )/gcx    && do { print "word:   $1\n";  redo; };
 491            m/ \G( \s+      )/gcx    && do { print "space:  $1\n";  redo; };
 492            m/ \G( [^\w\d]+ )/gcx    && do { print "other:  $1\n";  redo; };
 493       }
 494     }
 495
 496 Of course, that could have been written as
 497
 498     while (<>) {
 499       chomp;
 500       PARSER: {
 501            if ( /\G( \d+\b    )/gcx  {
 502                 print "number: $1\n";
 503                 redo PARSER;
 504            }
 505            if ( /\G( \w+      )/gcx  {
 506                 print "word: $1\n";
 507                 redo PARSER;
 508            }
 509            if ( /\G( \s+      )/gcx  {
 510                 print "space: $1\n";
 511                 redo PARSER;
 512            }
 513            if ( /\G( [^\w\d]+ )/gcx  {
 514                 print "other: $1\n";
 515                 redo PARSER;
 516            }
 517       }
 518     }
 519
 520 But then you lose the vertical alignment of the regular expressions.
 521
 522 =head2 Are Perl regexps DFAs or NFAs?  Are they POSIX compliant?
 523
 524 While it's true that Perl's regular expressions resemble the DFAs
 525 (deterministic finite automata) of the egrep(1) program, they are in
 526 fact implemented as NFAs (non-deterministic finite automata) to allow
 527 backtracking and backreferencing.  And they aren't POSIX-style either,
 528 because those guarantee worst-case behavior for all cases.  (It seems
 529 that some people prefer guarantees of consistency, even when what's
 530 guaranteed is slowness.)  See the book "Mastering Regular Expressions"
 531 (from O'Reilly) by Jeffrey Friedl for all the details you could ever
 532 hope to know on these matters (a full citation appears in
 533 L<perlfaq2>).
 534
 535 =head2 What's wrong with using grep or map in a void context?
 536
 537 Strictly speaking, nothing.  Stylistically speaking, it's not a good
 538 way to write maintainable code.  That's because you're using these
 539 constructs not for their return values but rather for their
 540 side-effects, and side-effects can be mystifying.  There's no void
 541 grep() that's not better written as a C<for> (well, C<foreach>,
 542 technically) loop.
 543
 544 =head2 How can I match strings with multibyte characters?
 545
 546 This is hard, and there's no good way.  Perl does not directly support
 547 wide characters.  It pretends that a byte and a character are
 548 synonymous.  The following set of approaches was offered by Jeffrey
 549 Friedl, whose article in issue #5 of The Perl Journal talks about this
 550 very matter.
 551
 552 Let's suppose you have some weird Martian encoding where pairs of
 553 ASCII uppercase letters encode single Martian letters (i.e. the two
 554 bytes "CV" make a single Martian letter, as do the two bytes "SG",
 555 "VS", "XX", etc.). Other bytes represent single characters, just like
 556 ASCII.
 557
 558 So, the string of Martian "I am CVSGXX!" uses 12 bytes to encode the
 559 nine characters 'I', ' ', 'a', 'm', ' ', 'CV', 'SG', 'XX', '!'.
 560
 561 Now, say you want to search for the single character C</GX/>. Perl
 562 doesn't know about Martian, so it'll find the two bytes "GX" in the "I
 563 am CVSGXX!"  string, even though that character isn't there: it just
 564 looks like it is because "SG" is next to "XX", but there's no real
 565 "GX".  This is a big problem.
 566
 567 Here are a few ways, all painful, to deal with it:
 568
 569    $martian =~ s/([A-Z][A-Z])/ $1 /g; # Make sure adjacent ``martian'' bytes
 570                                       # are no longer adjacent.
 571    print "found GX!\n" if $martian =~ /GX/;
 572
 573 Or like this:
 574
 575    @chars = $martian =~ m/([A-Z][A-Z]|[^A-Z])/g;
 576    # above is conceptually similar to:     @chars = $text =~ m/(.)/g;
 577    #
 578    foreach $char (@chars) {
 579        print "found GX!\n", last if $char eq 'GX';
 580    }
 581
 582 Or like this:
 583
 584    while ($martian =~ m/\G([A-Z][A-Z]|.)/gs) {  # \G probably unneeded
 585        print "found GX!\n", last if $1 eq 'GX';
 586    }
 587
 588 Or like this:
 589
 590    die "sorry, Perl doesn't (yet) have Martian support )-:\n";
 591
 592 In addition, a sample program which converts half-width to full-width
 593 katakana (in Shift-JIS or EUC encoding) is available from CPAN as
 594
 595 =for Tom make it so
 596
 597 There are many double- (and multi-) byte encodings commonly used these
 598 days.  Some versions of these have 1-, 2-, 3-, and 4-byte characters,
 599 all mixed.
 600
 601 =head1 AUTHOR AND COPYRIGHT
 602
 603 Copyright (c) 1997 Tom Christiansen and Nathan Torkington.
 604 All rights reserved.  See L<perlfaq> for distribution information.
 605