This is a live mirror of the Perl 5 development currently hosted at https://github.com/perl/perl5
Re: Exceptions in IPC::Open2
[perl5.git] / pod / perlfaq6.pod
CommitLineData
68dc0745 1=head1 NAME
2
fc36a67e 3perlfaq6 - Regexps ($Revision: 1.17 $, $Date: 1997/04/24 22:44:10 $)
68dc0745 4
5=head1 DESCRIPTION
6
7This section is surprisingly small because the rest of the FAQ is
8littered with answers involving regular expressions. For example,
9decoding a URL and checking whether something is a number are handled
10with regular expressions, but those answers are found elsewhere in
11this document (in the section on Data and the Networking one on
12networking, to be precise).
13
54310121 14=head2 How can I hope to use regular expressions without creating illegible and unmaintainable code?
68dc0745 15
16Three techniques can make regular expressions maintainable and
17understandable.
18
19=over 4
20
21=item Comments Outside the Regexp
22
23Describe what you're doing and how you're doing it, using normal Perl
24comments.
25
26 # turn the line into the first word, a colon, and the
27 # number of characters on the rest of the line
28 s/^(\w+)(.*)/ lc($1) . ":" . length($2) /ge;
29
30=item Comments Inside the Regexp
31
32The C</x> modifier causes whitespace to be ignored in a regexp pattern
33(except in a character class), and also allows you to use normal
34comments there, too. As you can imagine, whitespace and comments help
35a lot.
36
37C</x> lets you turn this:
38
39 s{<(?:[^>'"]*|".*?"|'.*?')+>}{}gs;
40
41into this:
42
43 s{ < # opening angle bracket
44 (?: # Non-backreffing grouping paren
45 [^>'"] * # 0 or more things that are neither > nor ' nor "
46 | # or else
47 ".*?" # a section between double quotes (stingy match)
48 | # or else
49 '.*?' # a section between single quotes (stingy match)
50 ) + # all occurring one or more times
51 > # closing angle bracket
52 }{}gsx; # replace with nothing, i.e. delete
53
54It's still not quite so clear as prose, but it is very useful for
55describing the meaning of each part of the pattern.
56
57=item Different Delimiters
58
59While we normally think of patterns as being delimited with C</>
60characters, they can be delimited by almost any character. L<perlre>
61describes this. For example, the C<s///> above uses braces as
62delimiters. Selecting another delimiter can avoid quoting the
63delimiter within the pattern:
64
65 s/\/usr\/local/\/usr\/share/g; # bad delimiter choice
66 s#/usr/local#/usr/share#g; # better
67
68=back
69
70=head2 I'm having trouble matching over more than one line. What's wrong?
71
72Either you don't have newlines in your string, or you aren't using the
73correct modifier(s) on your pattern.
74
75There are many ways to get multiline data into a string. If you want
76it to happen automatically while reading input, you'll want to set $/
77(probably to '' for paragraphs or C<undef> for the whole file) to
78allow you to read more than one line at a time.
79
80Read L<perlre> to help you decide which of C</s> and C</m> (or both)
81you might want to use: C</s> allows dot to include newline, and C</m>
82allows caret and dollar to match next to a newline, not just at the
83end of the string. You do need to make sure that you've actually
84got a multiline string in there.
85
86For example, this program detects duplicate words, even when they span
87line breaks (but not paragraph ones). For this example, we don't need
88C</s> because we aren't using dot in a regular expression that we want
89to cross line boundaries. Neither do we need C</m> because we aren't
90wanting caret or dollar to match at any point inside the record next
91to newlines. But it's imperative that $/ be set to something other
92than the default, or else we won't actually ever have a multiline
93record read in.
94
95 $/ = ''; # read in more whole paragraph, not just one line
96 while ( <> ) {
97 while ( /\b(\w\S+)(\s+\1)+\b/gi ) {
98 print "Duplicate $1 at paragraph $.\n";
54310121 99 }
100 }
68dc0745 101
102Here's code that finds sentences that begin with "From " (which would
103be mangled by many mailers):
104
105 $/ = ''; # read in more whole paragraph, not just one line
106 while ( <> ) {
107 while ( /^From /gm ) { # /m makes ^ match next to \n
108 print "leading from in paragraph $.\n";
109 }
110 }
111
112Here's code that finds everything between START and END in a paragraph:
113
114 undef $/; # read in whole file, not just one line or paragraph
115 while ( <> ) {
116 while ( /START(.*?)END/sm ) { # /s makes . cross line boundaries
117 print "$1\n";
118 }
119 }
120
121=head2 How can I pull out lines between two patterns that are themselves on different lines?
122
123You can use Perl's somewhat exotic C<..> operator (documented in
124L<perlop>):
125
126 perl -ne 'print if /START/ .. /END/' file1 file2 ...
127
128If you wanted text and not lines, you would use
129
130 perl -0777 -pe 'print "$1\n" while /START(.*?)END/gs' file1 file2 ...
131
132But if you want nested occurrences of C<START> through C<END>, you'll
133run up against the problem described in the question in this section
134on matching balanced text.
135
136=head2 I put a regular expression into $/ but it didn't work. What's wrong?
137
138$/ must be a string, not a regular expression. Awk has to be better
139for something. :-)
140
fc36a67e 141Actually, you could do this if you don't mind reading the whole file
142into memory:
68dc0745 143
144 undef $/;
145 @records = split /your_pattern/, <FH>;
146
3fe9a6f1 147The Net::Telnet module (available from CPAN) has the capability to
148wait for a pattern in the input stream, or timeout if it doesn't
149appear within a certain time.
150
151 ## Create a file with three lines.
152 open FH, ">file";
153 print FH "The first line\nThe second line\nThe third line\n";
154 close FH;
155
156 ## Get a read/write filehandle to it.
157 $fh = new FileHandle "+<file";
158
159 ## Attach it to a "stream" object.
160 use Net::Telnet;
161 $file = new Net::Telnet (-fhopen => $fh);
162
163 ## Search for the second line and print out the third.
164 $file->waitfor('/second line\n/');
165 print $file->getline;
166
68dc0745 167=head2 How do I substitute case insensitively on the LHS, but preserving case on the RHS?
168
169It depends on what you mean by "preserving case". The following
170script makes the substitution have the same case, letter by letter, as
171the original. If the substitution has more characters than the string
172being substituted, the case of the last character is used for the rest
173of the substitution.
174
175 # Original by Nathan Torkington, massaged by Jeffrey Friedl
176 #
177 sub preserve_case($$)
178 {
179 my ($old, $new) = @_;
180 my ($state) = 0; # 0 = no change; 1 = lc; 2 = uc
181 my ($i, $oldlen, $newlen, $c) = (0, length($old), length($new));
182 my ($len) = $oldlen < $newlen ? $oldlen : $newlen;
183
184 for ($i = 0; $i < $len; $i++) {
185 if ($c = substr($old, $i, 1), $c =~ /[\W\d_]/) {
186 $state = 0;
187 } elsif (lc $c eq $c) {
188 substr($new, $i, 1) = lc(substr($new, $i, 1));
189 $state = 1;
190 } else {
191 substr($new, $i, 1) = uc(substr($new, $i, 1));
192 $state = 2;
193 }
194 }
195 # finish up with any remaining new (for when new is longer than old)
196 if ($newlen > $oldlen) {
197 if ($state == 1) {
198 substr($new, $oldlen) = lc(substr($new, $oldlen));
199 } elsif ($state == 2) {
200 substr($new, $oldlen) = uc(substr($new, $oldlen));
201 }
202 }
203 return $new;
204 }
205
206 $a = "this is a TEsT case";
207 $a =~ s/(test)/preserve_case($1, "success")/gie;
208 print "$a\n";
209
210This prints:
211
212 this is a SUcCESS case
213
214=head2 How can I make C<\w> match accented characters?
215
216See L<perllocale>.
217
218=head2 How can I match a locale-smart version of C</[a-zA-Z]/>?
219
220One alphabetic character would be C</[^\W\d_]/>, no matter what locale
54310121 221you're in. Non-alphabetics would be C</[\W\d_]/> (assuming you don't
68dc0745 222consider an underscore a letter).
223
54310121 224=head2 How can I quote a variable to use in a regexp?
68dc0745 225
226The Perl parser will expand $variable and @variable references in
227regular expressions unless the delimiter is a single quote. Remember,
228too, that the right-hand side of a C<s///> substitution is considered
229a double-quoted string (see L<perlop> for more details). Remember
230also that any regexp special characters will be acted on unless you
231precede the substitution with \Q. Here's an example:
232
233 $string = "to die?";
234 $lhs = "die?";
235 $rhs = "sleep no more";
236
237 $string =~ s/\Q$lhs/$rhs/;
238 # $string is now "to sleep no more"
239
240Without the \Q, the regexp would also spuriously match "di".
241
242=head2 What is C</o> really for?
243
46fc3d4c 244Using a variable in a regular expression match forces a re-evaluation
68dc0745 245(and perhaps recompilation) each time through. The C</o> modifier
246locks in the regexp the first time it's used. This always happens in a
247constant regular expression, and in fact, the pattern was compiled
248into the internal format at the same time your entire program was.
249
250Use of C</o> is irrelevant unless variable interpolation is used in
251the pattern, and if so, the regexp engine will neither know nor care
252whether the variables change after the pattern is evaluated the I<very
253first> time.
254
255C</o> is often used to gain an extra measure of efficiency by not
256performing subsequent evaluations when you know it won't matter
257(because you know the variables won't change), or more rarely, when
258you don't want the regexp to notice if they do.
259
260For example, here's a "paragrep" program:
261
262 $/ = ''; # paragraph mode
263 $pat = shift;
264 while (<>) {
265 print if /$pat/o;
266 }
267
268=head2 How do I use a regular expression to strip C style comments from a file?
269
270While this actually can be done, it's much harder than you'd think.
271For example, this one-liner
272
273 perl -0777 -pe 's{/\*.*?\*/}{}gs' foo.c
274
275will work in many but not all cases. You see, it's too simple-minded for
276certain kinds of C programs, in particular, those with what appear to be
277comments in quoted strings. For that, you'd need something like this,
278created by Jeffrey Friedl:
279
280 $/ = undef;
281 $_ = <>;
282 s#/\*[^*]*\*+([^/*][^*]*\*+)*/|("(\\.|[^"\\])*"|'(\\.|[^'\\])*'|\n+|.[^/"'\\]*)#$2#g;
283 print;
284
285This could, of course, be more legibly written with the C</x> modifier, adding
286whitespace and comments.
287
288=head2 Can I use Perl regular expressions to match balanced text?
289
290Although Perl regular expressions are more powerful than "mathematical"
291regular expressions, because they feature conveniences like backreferences
292(C<\1> and its ilk), they still aren't powerful enough. You still need
293to use non-regexp techniques to parse balanced text, such as the text
294enclosed between matching parentheses or braces, for example.
295
296An elaborate subroutine (for 7-bit ASCII only) to pull out balanced
297and possibly nested single chars, like C<`> and C<'>, C<{> and C<}>,
298or C<(> and C<)> can be found in
299http://www.perl.com/CPAN/authors/id/TOMC/scripts/pull_quotes.gz .
300
301The C::Scan module from CPAN contains such subs for internal usage,
302but they are undocumented.
303
304=head2 What does it mean that regexps are greedy? How can I get around it?
305
306Most people mean that greedy regexps match as much as they can.
307Technically speaking, it's actually the quantifiers (C<?>, C<*>, C<+>,
308C<{}>) that are greedy rather than the whole pattern; Perl prefers local
309greed and immediate gratification to overall greed. To get non-greedy
310versions of the same quantifiers, use (C<??>, C<*?>, C<+?>, C<{}?>).
311
312An example:
313
314 $s1 = $s2 = "I am very very cold";
315 $s1 =~ s/ve.*y //; # I am cold
316 $s2 =~ s/ve.*?y //; # I am very cold
317
318Notice how the second substitution stopped matching as soon as it
319encountered "y ". The C<*?> quantifier effectively tells the regular
320expression engine to find a match as quickly as possible and pass
321control on to whatever is next in line, like you would if you were
322playing hot potato.
323
324=head2 How do I process each word on each line?
325
326Use the split function:
327
328 while (<>) {
fc36a67e 329 foreach $word ( split ) {
68dc0745 330 # do something with $word here
fc36a67e 331 }
54310121 332 }
68dc0745 333
54310121 334Note that this isn't really a word in the English sense; it's just
335chunks of consecutive non-whitespace characters.
68dc0745 336
337To work with only alphanumeric sequences, you might consider
338
339 while (<>) {
340 foreach $word (m/(\w+)/g) {
341 # do something with $word here
342 }
343 }
344
345=head2 How can I print out a word-frequency or line-frequency summary?
346
347To do this, you have to parse out each word in the input stream. We'll
54310121 348pretend that by word you mean chunk of alphabetics, hyphens, or
349apostrophes, rather than the non-whitespace chunk idea of a word given
68dc0745 350in the previous question:
351
352 while (<>) {
353 while ( /(\b[^\W_\d][\w'-]+\b)/g ) { # misses "`sheep'"
354 $seen{$1}++;
54310121 355 }
356 }
68dc0745 357 while ( ($word, $count) = each %seen ) {
358 print "$count $word\n";
54310121 359 }
68dc0745 360
361If you wanted to do the same thing for lines, you wouldn't need a
362regular expression:
363
fc36a67e 364 while (<>) {
68dc0745 365 $seen{$_}++;
54310121 366 }
68dc0745 367 while ( ($line, $count) = each %seen ) {
368 print "$count $line";
369 }
370
371If you want these output in a sorted order, see the section on Hashes.
372
373=head2 How can I do approximate matching?
374
375See the module String::Approx available from CPAN.
376
377=head2 How do I efficiently match many regular expressions at once?
378
379The following is super-inefficient:
380
381 while (<FH>) {
382 foreach $pat (@patterns) {
383 if ( /$pat/ ) {
384 # do something
385 }
386 }
387 }
388
389Instead, you either need to use one of the experimental Regexp extension
390modules from CPAN (which might well be overkill for your purposes),
391or else put together something like this, inspired from a routine
392in Jeffrey Friedl's book:
393
394 sub _bm_build {
395 my $condition = shift;
396 my @regexp = @_; # this MUST not be local(); need my()
397 my $expr = join $condition => map { "m/\$regexp[$_]/o" } (0..$#regexp);
398 my $match_func = eval "sub { $expr }";
399 die if $@; # propagate $@; this shouldn't happen!
400 return $match_func;
401 }
402
403 sub bm_and { _bm_build('&&', @_) }
404 sub bm_or { _bm_build('||', @_) }
405
406 $f1 = bm_and qw{
407 xterm
408 (?i)window
409 };
410
411 $f2 = bm_or qw{
412 \b[Ff]ree\b
413 \bBSD\B
414 (?i)sys(tem)?\s*[V5]\b
415 };
416
417 # feed me /etc/termcap, prolly
418 while ( <> ) {
419 print "1: $_" if &$f1;
420 print "2: $_" if &$f2;
421 }
422
423=head2 Why don't word-boundary searches with C<\b> work for me?
424
425Two common misconceptions are that C<\b> is a synonym for C<\s+>, and
426that it's the edge between whitespace characters and non-whitespace
427characters. Neither is correct. C<\b> is the place between a C<\w>
428character and a C<\W> character (that is, C<\b> is the edge of a
429"word"). It's a zero-width assertion, just like C<^>, C<$>, and all
430the other anchors, so it doesn't consume any characters. L<perlre>
431describes the behaviour of all the regexp metacharacters.
432
433Here are examples of the incorrect application of C<\b>, with fixes:
434
435 "two words" =~ /(\w+)\b(\w+)/; # WRONG
436 "two words" =~ /(\w+)\s+(\w+)/; # right
437
438 " =matchless= text" =~ /\b=(\w+)=\b/; # WRONG
439 " =matchless= text" =~ /=(\w+)=/; # right
440
441Although they may not do what you thought they did, C<\b> and C<\B>
442can still be quite useful. For an example of the correct use of
443C<\b>, see the example of matching duplicate words over multiple
444lines.
445
446An example of using C<\B> is the pattern C<\Bis\B>. This will find
447occurrences of "is" on the insides of words only, as in "thistle", but
448not "this" or "island".
449
450=head2 Why does using $&, $`, or $' slow my program down?
451
452Because once Perl sees that you need one of these variables anywhere
453in the program, it has to provide them on each and every pattern
454match. The same mechanism that handles these provides for the use of
455$1, $2, etc., so you pay the same price for each regexp that contains
456capturing parentheses. But if you never use $&, etc., in your script,
457then regexps I<without> capturing parentheses won't be penalized. So
458avoid $&, $', and $` if you can, but if you can't (and some algorithms
459really appreciate them), once you've used them once, use them at will,
460because you've already paid the price.
461
462=head2 What good is C<\G> in a regular expression?
463
464The notation C<\G> is used in a match or substitution in conjunction the
465C</g> modifier (and ignored if there's no C</g>) to anchor the regular
466expression to the point just past where the last match occurred, i.e. the
467pos() point.
468
469For example, suppose you had a line of text quoted in standard mail
470and Usenet notation, (that is, with leading C<E<gt>> characters), and
471you want change each leading C<E<gt>> into a corresponding C<:>. You
472could do so in this way:
473
474 s/^(>+)/':' x length($1)/gem;
475
476Or, using C<\G>, the much simpler (and faster):
477
478 s/\G>/:/g;
479
480A more sophisticated use might involve a tokenizer. The following
481lex-like example is courtesy of Jeffrey Friedl. It did not work in
c90c0ff4 4825.003 due to bugs in that release, but does work in 5.004 or better.
483(Note the use of C</c>, which prevents a failed match with C</g> from
484resetting the search position back to the beginning of the string.)
68dc0745 485
486 while (<>) {
487 chomp;
488 PARSER: {
c90c0ff4 489 m/ \G( \d+\b )/gcx && do { print "number: $1\n"; redo; };
490 m/ \G( \w+ )/gcx && do { print "word: $1\n"; redo; };
491 m/ \G( \s+ )/gcx && do { print "space: $1\n"; redo; };
492 m/ \G( [^\w\d]+ )/gcx && do { print "other: $1\n"; redo; };
68dc0745 493 }
494 }
495
496Of course, that could have been written as
497
498 while (<>) {
499 chomp;
500 PARSER: {
c90c0ff4 501 if ( /\G( \d+\b )/gcx {
68dc0745 502 print "number: $1\n";
503 redo PARSER;
504 }
c90c0ff4 505 if ( /\G( \w+ )/gcx {
68dc0745 506 print "word: $1\n";
507 redo PARSER;
508 }
c90c0ff4 509 if ( /\G( \s+ )/gcx {
68dc0745 510 print "space: $1\n";
511 redo PARSER;
512 }
c90c0ff4 513 if ( /\G( [^\w\d]+ )/gcx {
68dc0745 514 print "other: $1\n";
515 redo PARSER;
516 }
517 }
518 }
519
520But then you lose the vertical alignment of the regular expressions.
521
522=head2 Are Perl regexps DFAs or NFAs? Are they POSIX compliant?
523
524While it's true that Perl's regular expressions resemble the DFAs
525(deterministic finite automata) of the egrep(1) program, they are in
46fc3d4c 526fact implemented as NFAs (non-deterministic finite automata) to allow
68dc0745 527backtracking and backreferencing. And they aren't POSIX-style either,
528because those guarantee worst-case behavior for all cases. (It seems
529that some people prefer guarantees of consistency, even when what's
530guaranteed is slowness.) See the book "Mastering Regular Expressions"
531(from O'Reilly) by Jeffrey Friedl for all the details you could ever
532hope to know on these matters (a full citation appears in
533L<perlfaq2>).
534
535=head2 What's wrong with using grep or map in a void context?
536
537Strictly speaking, nothing. Stylistically speaking, it's not a good
538way to write maintainable code. That's because you're using these
539constructs not for their return values but rather for their
540side-effects, and side-effects can be mystifying. There's no void
541grep() that's not better written as a C<for> (well, C<foreach>,
542technically) loop.
543
54310121 544=head2 How can I match strings with multibyte characters?
68dc0745 545
546This is hard, and there's no good way. Perl does not directly support
547wide characters. It pretends that a byte and a character are
548synonymous. The following set of approaches was offered by Jeffrey
549Friedl, whose article in issue #5 of The Perl Journal talks about this
550very matter.
551
fc36a67e 552Let's suppose you have some weird Martian encoding where pairs of
553ASCII uppercase letters encode single Martian letters (i.e. the two
554bytes "CV" make a single Martian letter, as do the two bytes "SG",
555"VS", "XX", etc.). Other bytes represent single characters, just like
556ASCII.
68dc0745 557
fc36a67e 558So, the string of Martian "I am CVSGXX!" uses 12 bytes to encode the
559nine characters 'I', ' ', 'a', 'm', ' ', 'CV', 'SG', 'XX', '!'.
68dc0745 560
561Now, say you want to search for the single character C</GX/>. Perl
fc36a67e 562doesn't know about Martian, so it'll find the two bytes "GX" in the "I
563am CVSGXX!" string, even though that character isn't there: it just
564looks like it is because "SG" is next to "XX", but there's no real
565"GX". This is a big problem.
68dc0745 566
567Here are a few ways, all painful, to deal with it:
568
3fe9a6f1 569 $martian =~ s/([A-Z][A-Z])/ $1 /g; # Make sure adjacent ``martian'' bytes
68dc0745 570 # are no longer adjacent.
571 print "found GX!\n" if $martian =~ /GX/;
572
573Or like this:
574
575 @chars = $martian =~ m/([A-Z][A-Z]|[^A-Z])/g;
576 # above is conceptually similar to: @chars = $text =~ m/(.)/g;
577 #
578 foreach $char (@chars) {
579 print "found GX!\n", last if $char eq 'GX';
580 }
581
582Or like this:
583
584 while ($martian =~ m/\G([A-Z][A-Z]|.)/gs) { # \G probably unneeded
54310121 585 print "found GX!\n", last if $1 eq 'GX';
68dc0745 586 }
587
588Or like this:
589
590 die "sorry, Perl doesn't (yet) have Martian support )-:\n";
591
592In addition, a sample program which converts half-width to full-width
54310121 593katakana (in Shift-JIS or EUC encoding) is available from CPAN as
68dc0745 594
595=for Tom make it so
596
46fc3d4c 597There are many double- (and multi-) byte encodings commonly used these
68dc0745 598days. Some versions of these have 1-, 2-, 3-, and 4-byte characters,
599all mixed.
600
601=head1 AUTHOR AND COPYRIGHT
602
603Copyright (c) 1997 Tom Christiansen and Nathan Torkington.
604All rights reserved. See L<perlfaq> for distribution information.
46fc3d4c 605