pod/perlrequick.pod

   1 =head1 NAME
   2
   3 perlrequick - Perl regular expressions quick start
   4
   5 =head1 DESCRIPTION
   6
   7 This page covers the very basics of understanding, creating and
   8 using regular expressions ('regexes') in Perl.
   9
  10
  11 =head1 The Guide
  12
  13 This page assumes you already know things, like what a "pattern" is, and
  14 the basic syntax of using them.  If you don't, see L<perlretut>.
  15
  16 =head2 Simple word matching
  17
  18 The simplest regex is simply a word, or more generally, a string of
  19 characters.  A regex consisting of a word matches any string that
  20 contains that word:
  21
  22     "Hello World" =~ /World/;  # matches
  23
  24 In this statement, C<World> is a regex and the C<//> enclosing
  25 C</World/> tells Perl to search a string for a match.  The operator
  26 C<=~> associates the string with the regex match and produces a true
  27 value if the regex matched, or false if the regex did not match.  In
  28 our case, C<World> matches the second word in C<"Hello World">, so the
  29 expression is true.  This idea has several variations.
  30
  31 Expressions like this are useful in conditionals:
  32
  33     print "It matches\n" if "Hello World" =~ /World/;
  34
  35 The sense of the match can be reversed by using C<!~> operator:
  36
  37     print "It doesn't match\n" if "Hello World" !~ /World/;
  38
  39 The literal string in the regex can be replaced by a variable:
  40
  41     $greeting = "World";
  42     print "It matches\n" if "Hello World" =~ /$greeting/;
  43
  44 If you're matching against C<$_>, the C<$_ =~> part can be omitted:
  45
  46     $_ = "Hello World";
  47     print "It matches\n" if /World/;
  48
  49 Finally, the C<//> default delimiters for a match can be changed to
  50 arbitrary delimiters by putting an C<'m'> out front:
  51
  52     "Hello World" =~ m!World!;   # matches, delimited by '!'
  53     "Hello World" =~ m{World};   # matches, note the matching '{}'
  54     "/usr/bin/perl" =~ m"/perl"; # matches after '/usr/bin',
  55                                  # '/' becomes an ordinary char
  56
  57 Regexes must match a part of the string I<exactly> in order for the
  58 statement to be true:
  59
  60     "Hello World" =~ /world/;  # doesn't match, case sensitive
  61     "Hello World" =~ /o W/;    # matches, ' ' is an ordinary char
  62     "Hello World" =~ /World /; # doesn't match, no ' ' at end
  63
  64 Perl will always match at the earliest possible point in the string:
  65
  66     "Hello World" =~ /o/;       # matches 'o' in 'Hello'
  67     "That hat is red" =~ /hat/; # matches 'hat' in 'That'
  68
  69 Not all characters can be used 'as is' in a match.  Some characters,
  70 called B<metacharacters>, are considered special, and reserved for use
  71 in regex notation.  The metacharacters are
  72
  73     {}[]()^$.|*+?\
  74
  75 A metacharacter can be matched literally by putting a backslash before
  76 it:
  77
  78     "2+2=4" =~ /2+2/;    # doesn't match, + is a metacharacter
  79     "2+2=4" =~ /2\+2/;   # matches, \+ is treated like an ordinary +
  80     'C:\WIN32' =~ /C:\\WIN/;                       # matches
  81     "/usr/bin/perl" =~ /\/usr\/bin\/perl/;  # matches
  82
  83 In the last regex, the forward slash C<'/'> is also backslashed,
  84 because it is used to delimit the regex.
  85
  86 Most of the metacharacters aren't always special, and other characters
  87 (such as the ones delimitting the pattern) become special under various
  88 circumstances.  This can be confusing and lead to unexpected results.
  89 L<S<C<use re 'strict'>>|re/'strict' mode> can notify you of potential
  90 pitfalls.
  91
  92 Non-printable ASCII characters are represented by B<escape sequences>.
  93 Common examples are C<\t> for a tab, C<\n> for a newline, and C<\r>
  94 for a carriage return.  Arbitrary bytes are represented by octal
  95 escape sequences, e.g., C<\033>, or hexadecimal escape sequences,
  96 e.g., C<\x1B>:
  97
  98     "1000\t2000" =~ m(0\t2)  # matches
  99     "cat" =~ /\143\x61\x74/  # matches in ASCII, but
 100                              # a weird way to spell cat
 101
 102 Regexes are treated mostly as double-quoted strings, so variable
 103 substitution works:
 104
 105     $foo = 'house';
 106     'cathouse' =~ /cat$foo/;   # matches
 107     'housecat' =~ /${foo}cat/; # matches
 108
 109 With all of the regexes above, if the regex matched anywhere in the
 110 string, it was considered a match.  To specify I<where> it should
 111 match, we would use the B<anchor> metacharacters C<^> and C<$>.  The
 112 anchor C<^> means match at the beginning of the string and the anchor
 113 C<$> means match at the end of the string, or before a newline at the
 114 end of the string.  Some examples:
 115
 116     "housekeeper" =~ /keeper/;         # matches
 117     "housekeeper" =~ /^keeper/;        # doesn't match
 118     "housekeeper" =~ /keeper$/;        # matches
 119     "housekeeper\n" =~ /keeper$/;      # matches
 120     "housekeeper" =~ /^housekeeper$/;  # matches
 121
 122 =head2 Using character classes
 123
 124 A B<character class> allows a set of possible characters, rather than
 125 just a single character, to match at a particular point in a regex.
 126 There are a number of different types of character classes, but usually
 127 when people use this term, they are referring to the type described in
 128 this section, which are technically called "Bracketed character
 129 classes", because they are denoted by brackets C<[...]>, with the set of
 130 characters to be possibly matched inside.  But we'll drop the "bracketed"
 131 below to correspond with common usage.  Here are some examples of
 132 (bracketed) character classes:
 133
 134     /cat/;            # matches 'cat'
 135     /[bcr]at/;        # matches 'bat', 'cat', or 'rat'
 136     "abc" =~ /[cab]/; # matches 'a'
 137
 138 In the last statement, even though C<'c'> is the first character in
 139 the class, the earliest point at which the regex can match is C<'a'>.
 140
 141     /[yY][eE][sS]/; # match 'yes' in a case-insensitive way
 142                     # 'yes', 'Yes', 'YES', etc.
 143     /yes/i;         # also match 'yes' in a case-insensitive way
 144
 145 The last example shows a match with an C<'i'> B<modifier>, which makes
 146 the match case-insensitive.
 147
 148 Character classes also have ordinary and special characters, but the
 149 sets of ordinary and special characters inside a character class are
 150 different than those outside a character class.  The special
 151 characters for a character class are C<-]\^$> and are matched using an
 152 escape:
 153
 154    /[\]c]def/; # matches ']def' or 'cdef'
 155    $x = 'bcr';
 156    /[$x]at/;   # matches 'bat, 'cat', or 'rat'
 157    /[\$x]at/;  # matches '$at' or 'xat'
 158    /[\\$x]at/; # matches '\at', 'bat, 'cat', or 'rat'
 159
 160 The special character C<'-'> acts as a range operator within character
 161 classes, so that the unwieldy C<[0123456789]> and C<[abc...xyz]>
 162 become the svelte C<[0-9]> and C<[a-z]>:
 163
 164     /item[0-9]/;  # matches 'item0' or ... or 'item9'
 165     /[0-9a-fA-F]/;  # matches a hexadecimal digit
 166
 167 If C<'-'> is the first or last character in a character class, it is
 168 treated as an ordinary character.
 169
 170 The special character C<^> in the first position of a character class
 171 denotes a B<negated character class>, which matches any character but
 172 those in the brackets.  Both C<[...]> and C<[^...]> must match a
 173 character, or the match fails.  Then
 174
 175     /[^a]at/;  # doesn't match 'aat' or 'at', but matches
 176                # all other 'bat', 'cat, '0at', '%at', etc.
 177     /[^0-9]/;  # matches a non-numeric character
 178     /[a^]at/;  # matches 'aat' or '^at'; here '^' is ordinary
 179
 180 Perl has several abbreviations for common character classes. (These
 181 definitions are those that Perl uses in ASCII-safe mode with the C</a> modifier.
 182 Otherwise they could match many more non-ASCII Unicode characters as
 183 well.  See L<perlrecharclass/Backslash sequences> for details.)
 184
 185 =over 4
 186
 187 =item *
 188
 189 \d is a digit and represents
 190
 191     [0-9]
 192
 193 =item *
 194
 195 \s is a whitespace character and represents
 196
 197     [\ \t\r\n\f]
 198
 199 =item *
 200
 201 \w is a word character (alphanumeric or _) and represents
 202
 203     [0-9a-zA-Z_]
 204
 205 =item *
 206
 207 \D is a negated \d; it represents any character but a digit
 208
 209     [^0-9]
 210
 211 =item *
 212
 213 \S is a negated \s; it represents any non-whitespace character
 214
 215     [^\s]
 216
 217 =item *
 218
 219 \W is a negated \w; it represents any non-word character
 220
 221     [^\w]
 222
 223 =item *
 224
 225 The period '.' matches any character but "\n"
 226
 227 =back
 228
 229 The C<\d\s\w\D\S\W> abbreviations can be used both inside and outside
 230 of character classes.  Here are some in use:
 231
 232     /\d\d:\d\d:\d\d/; # matches a hh:mm:ss time format
 233     /[\d\s]/;         # matches any digit or whitespace character
 234     /\w\W\w/;         # matches a word char, followed by a
 235                       # non-word char, followed by a word char
 236     /..rt/;           # matches any two chars, followed by 'rt'
 237     /end\./;          # matches 'end.'
 238     /end[.]/;         # same thing, matches 'end.'
 239
 240 The S<B<word anchor> > C<\b> matches a boundary between a word
 241 character and a non-word character C<\w\W> or C<\W\w>:
 242
 243     $x = "Housecat catenates house and cat";
 244     $x =~ /\bcat/;  # matches cat in 'catenates'
 245     $x =~ /cat\b/;  # matches cat in 'housecat'
 246     $x =~ /\bcat\b/;  # matches 'cat' at end of string
 247
 248 In the last example, the end of the string is considered a word
 249 boundary.
 250
 251 For natural language processing (so that, for example, apostrophes are
 252 included in words), use instead C<\b{wb}>
 253
 254     "don't" =~ / .+? \b{wb} /x;  # matches the whole string
 255
 256 =head2 Matching this or that
 257
 258 We can match different character strings with the B<alternation>
 259 metacharacter C<'|'>.  To match C<dog> or C<cat>, we form the regex
 260 C<dog|cat>.  As before, Perl will try to match the regex at the
 261 earliest possible point in the string.  At each character position,
 262 Perl will first try to match the first alternative, C<dog>.  If
 263 C<dog> doesn't match, Perl will then try the next alternative, C<cat>.
 264 If C<cat> doesn't match either, then the match fails and Perl moves to
 265 the next position in the string.  Some examples:
 266
 267     "cats and dogs" =~ /cat|dog|bird/;  # matches "cat"
 268     "cats and dogs" =~ /dog|cat|bird/;  # matches "cat"
 269
 270 Even though C<dog> is the first alternative in the second regex,
 271 C<cat> is able to match earlier in the string.
 272
 273     "cats"          =~ /c|ca|cat|cats/; # matches "c"
 274     "cats"          =~ /cats|cat|ca|c/; # matches "cats"
 275
 276 At a given character position, the first alternative that allows the
 277 regex match to succeed will be the one that matches. Here, all the
 278 alternatives match at the first string position, so the first matches.
 279
 280 =head2 Grouping things and hierarchical matching
 281
 282 The B<grouping> metacharacters C<()> allow a part of a regex to be
 283 treated as a single unit.  Parts of a regex are grouped by enclosing
 284 them in parentheses.  The regex C<house(cat|keeper)> means match
 285 C<house> followed by either C<cat> or C<keeper>.  Some more examples
 286 are
 287
 288     /(a|b)b/;    # matches 'ab' or 'bb'
 289     /(^a|b)c/;   # matches 'ac' at start of string or 'bc' anywhere
 290
 291     /house(cat|)/;  # matches either 'housecat' or 'house'
 292     /house(cat(s|)|)/;  # matches either 'housecats' or 'housecat' or
 293                         # 'house'.  Note groups can be nested.
 294
 295     "20" =~ /(19|20|)\d\d/;  # matches the null alternative '()\d\d',
 296                              # because '20\d\d' can't match
 297
 298 =head2 Extracting matches
 299
 300 The grouping metacharacters C<()> also allow the extraction of the
 301 parts of a string that matched.  For each grouping, the part that
 302 matched inside goes into the special variables C<$1>, C<$2>, etc.
 303 They can be used just as ordinary variables:
 304
 305     # extract hours, minutes, seconds
 306     $time =~ /(\d\d):(\d\d):(\d\d)/;  # match hh:mm:ss format
 307     $hours = $1;
 308     $minutes = $2;
 309     $seconds = $3;
 310
 311 In list context, a match C</regex/> with groupings will return the
 312 list of matched values C<($1,$2,...)>.  So we could rewrite it as
 313
 314     ($hours, $minutes, $second) = ($time =~ /(\d\d):(\d\d):(\d\d)/);
 315
 316 If the groupings in a regex are nested, C<$1> gets the group with the
 317 leftmost opening parenthesis, C<$2> the next opening parenthesis,
 318 etc.  For example, here is a complex regex and the matching variables
 319 indicated below it:
 320
 321     /(ab(cd|ef)((gi)|j))/;
 322      1  2      34
 323
 324 Associated with the matching variables C<$1>, C<$2>, ... are
 325 the B<backreferences> C<\g1>, C<\g2>, ...  Backreferences are
 326 matching variables that can be used I<inside> a regex:
 327
 328     /(\w\w\w)\s\g1/; # find sequences like 'the the' in string
 329
 330 C<$1>, C<$2>, ... should only be used outside of a regex, and C<\g1>,
 331 C<\g2>, ... only inside a regex.
 332
 333 =head2 Matching repetitions
 334
 335 The B<quantifier> metacharacters C<?>, C<*>, C<+>, and C<{}> allow us
 336 to determine the number of repeats of a portion of a regex we
 337 consider to be a match.  Quantifiers are put immediately after the
 338 character, character class, or grouping that we want to specify.  They
 339 have the following meanings:
 340
 341 =over 4
 342
 343 =item *
 344
 345 C<a?> = match 'a' 1 or 0 times
 346
 347 =item *
 348
 349 C<a*> = match 'a' 0 or more times, i.e., any number of times
 350
 351 =item *
 352
 353 C<a+> = match 'a' 1 or more times, i.e., at least once
 354
 355 =item *
 356
 357 C<a{n,m}> = match at least C<n> times, but not more than C<m>
 358 times.
 359
 360 =item *
 361
 362 C<a{n,}> = match at least C<n> or more times
 363
 364 =item *
 365
 366 C<a{,n}> = match C<n> times or fewer
 367
 368 =item *
 369
 370 C<a{n}> = match exactly C<n> times
 371
 372 =back
 373
 374 Here are some examples:
 375
 376     /[a-z]+\s+\d*/;  # match a lowercase word, at least some space, and
 377                      # any number of digits
 378     /(\w+)\s+\g1/;    # match doubled words of arbitrary length
 379     $year =~ /^\d{2,4}$/;  # make sure year is at least 2 but not more
 380                            # than 4 digits
 381     $year =~ /^\d{ 4 }$|^\d{2}$/; # better match; throw out 3 digit dates
 382
 383 These quantifiers will try to match as much of the string as possible,
 384 while still allowing the regex to match.  So we have
 385
 386     $x = 'the cat in the hat';
 387     $x =~ /^(.*)(at)(.*)$/; # matches,
 388                             # $1 = 'the cat in the h'
 389                             # $2 = 'at'
 390                             # $3 = ''   (0 matches)
 391
 392 The first quantifier C<.*> grabs as much of the string as possible
 393 while still having the regex match. The second quantifier C<.*> has
 394 no string left to it, so it matches 0 times.
 395
 396 =head2 More matching
 397
 398 There are a few more things you might want to know about matching
 399 operators.
 400 The global modifier C</g> allows the matching operator to match
 401 within a string as many times as possible.  In scalar context,
 402 successive matches against a string will have C</g> jump from match
 403 to match, keeping track of position in the string as it goes along.
 404 You can get or set the position with the C<pos()> function.
 405 For example,
 406
 407     $x = "cat dog house"; # 3 words
 408     while ($x =~ /(\w+)/g) {
 409         print "Word is $1, ends at position ", pos $x, "\n";
 410     }
 411
 412 prints
 413
 414     Word is cat, ends at position 3
 415     Word is dog, ends at position 7
 416     Word is house, ends at position 13
 417
 418 A failed match or changing the target string resets the position.  If
 419 you don't want the position reset after failure to match, add the
 420 C</c>, as in C</regex/gc>.
 421
 422 In list context, C</g> returns a list of matched groupings, or if
 423 there are no groupings, a list of matches to the whole regex.  So
 424
 425     @words = ($x =~ /(\w+)/g);  # matches,
 426                                 # $word[0] = 'cat'
 427                                 # $word[1] = 'dog'
 428                                 # $word[2] = 'house'
 429
 430 =head2 Search and replace
 431
 432 Search and replace is performed using C<s/regex/replacement/modifiers>.
 433 The C<replacement> is a Perl double-quoted string that replaces in the
 434 string whatever is matched with the C<regex>.  The operator C<=~> is
 435 also used here to associate a string with C<s///>.  If matching
 436 against C<$_>, the S<C<$_ =~>> can be dropped.  If there is a match,
 437 C<s///> returns the number of substitutions made; otherwise it returns
 438 false.  Here are a few examples:
 439
 440     $x = "Time to feed the cat!";
 441     $x =~ s/cat/hacker/;   # $x contains "Time to feed the hacker!"
 442     $y = "'quoted words'";
 443     $y =~ s/^'(.*)'$/$1/;  # strip single quotes,
 444                            # $y contains "quoted words"
 445
 446 With the C<s///> operator, the matched variables C<$1>, C<$2>, etc.
 447 are immediately available for use in the replacement expression. With
 448 the global modifier, C<s///g> will search and replace all occurrences
 449 of the regex in the string:
 450
 451     $x = "I batted 4 for 4";
 452     $x =~ s/4/four/;   # $x contains "I batted four for 4"
 453     $x = "I batted 4 for 4";
 454     $x =~ s/4/four/g;  # $x contains "I batted four for four"
 455
 456 The non-destructive modifier C<s///r> causes the result of the substitution
 457 to be returned instead of modifying C<$_> (or whatever variable the
 458 substitute was bound to with C<=~>):
 459
 460     $x = "I like dogs.";
 461     $y = $x =~ s/dogs/cats/r;
 462     print "$x $y\n"; # prints "I like dogs. I like cats."
 463
 464     $x = "Cats are great.";
 465     print $x =~ s/Cats/Dogs/r =~ s/Dogs/Frogs/r =~
 466         s/Frogs/Hedgehogs/r, "\n";
 467     # prints "Hedgehogs are great."
 468
 469     @foo = map { s/[a-z]/X/r } qw(a b c 1 2 3);
 470     # @foo is now qw(X X X 1 2 3)
 471
 472 The evaluation modifier C<s///e> wraps an C<eval{...}> around the
 473 replacement string and the evaluated result is substituted for the
 474 matched substring.  Some examples:
 475
 476     # reverse all the words in a string
 477     $x = "the cat in the hat";
 478     $x =~ s/(\w+)/reverse $1/ge;   # $x contains "eht tac ni eht tah"
 479
 480     # convert percentage to decimal
 481     $x = "A 39% hit rate";
 482     $x =~ s!(\d+)%!$1/100!e;       # $x contains "A 0.39 hit rate"
 483
 484 The last example shows that C<s///> can use other delimiters, such as
 485 C<s!!!> and C<s{}{}>, and even C<s{}//>.  If single quotes are used
 486 C<s'''>, then the regex and replacement are treated as single-quoted
 487 strings.
 488
 489 =head2 The split operator
 490
 491 C<split /regex/, string> splits C<string> into a list of substrings
 492 and returns that list.  The regex determines the character sequence
 493 that C<string> is split with respect to.  For example, to split a
 494 string into words, use
 495
 496     $x = "Calvin and Hobbes";
 497     @word = split /\s+/, $x;  # $word[0] = 'Calvin'
 498                               # $word[1] = 'and'
 499                               # $word[2] = 'Hobbes'
 500
 501 To extract a comma-delimited list of numbers, use
 502
 503     $x = "1.618,2.718,   3.142";
 504     @const = split /,\s*/, $x;  # $const[0] = '1.618'
 505                                 # $const[1] = '2.718'
 506                                 # $const[2] = '3.142'
 507
 508 If the empty regex C<//> is used, the string is split into individual
 509 characters.  If the regex has groupings, then the list produced contains
 510 the matched substrings from the groupings as well:
 511
 512     $x = "/usr/bin";
 513     @parts = split m!(/)!, $x;  # $parts[0] = ''
 514                                 # $parts[1] = '/'
 515                                 # $parts[2] = 'usr'
 516                                 # $parts[3] = '/'
 517                                 # $parts[4] = 'bin'
 518
 519 Since the first character of $x matched the regex, C<split> prepended
 520 an empty initial element to the list.
 521
 522 =head2 C<use re 'strict'>
 523
 524 New in v5.22, this applies stricter rules than otherwise when compiling
 525 regular expression patterns.  It can find things that, while legal, may
 526 not be what you intended.
 527
 528 See L<'strict' in re|re/'strict' mode>.
 529
 530 =head1 BUGS
 531
 532 None.
 533
 534 =head1 SEE ALSO
 535
 536 This is just a quick start guide.  For a more in-depth tutorial on
 537 regexes, see L<perlretut> and for the reference page, see L<perlre>.
 538
 539 =head1 AUTHOR AND COPYRIGHT
 540
 541 Copyright (c) 2000 Mark Kvale
 542 All rights reserved.
 543
 544 This document may be distributed under the same terms as Perl itself.
 545
 546 =head2 Acknowledgments
 547
 548 The author would like to thank Mark-Jason Dominus, Tom Christiansen,
 549 Ilya Zakharevich, Brad Hughes, and Mike Giroux for all their helpful
 550 comments.
 551
 552 =cut
 553