3 perlrequick - Perl regular expressions quick start
7 This page covers the very basics of understanding, creating and
8 using regular expressions ('regexes') in Perl.
13 This page assumes you already know things, like what a "pattern" is, and
14 the basic syntax of using them. If you don't, see L<perlretut>.
16 =head2 Simple word matching
18 The simplest regex is simply a word, or more generally, a string of
19 characters. A regex consisting of a word matches any string that
22 "Hello World" =~ /World/; # matches
24 In this statement, C<World> is a regex and the C<//> enclosing
25 C</World/> tells Perl to search a string for a match. The operator
26 C<=~> associates the string with the regex match and produces a true
27 value if the regex matched, or false if the regex did not match. In
28 our case, C<World> matches the second word in C<"Hello World">, so the
29 expression is true. This idea has several variations.
31 Expressions like this are useful in conditionals:
33 print "It matches\n" if "Hello World" =~ /World/;
35 The sense of the match can be reversed by using C<!~> operator:
37 print "It doesn't match\n" if "Hello World" !~ /World/;
39 The literal string in the regex can be replaced by a variable:
42 print "It matches\n" if "Hello World" =~ /$greeting/;
44 If you're matching against C<$_>, the C<$_ =~> part can be omitted:
47 print "It matches\n" if /World/;
49 Finally, the C<//> default delimiters for a match can be changed to
50 arbitrary delimiters by putting an C<'m'> out front:
52 "Hello World" =~ m!World!; # matches, delimited by '!'
53 "Hello World" =~ m{World}; # matches, note the matching '{}'
54 "/usr/bin/perl" =~ m"/perl"; # matches after '/usr/bin',
55 # '/' becomes an ordinary char
57 Regexes must match a part of the string I<exactly> in order for the
60 "Hello World" =~ /world/; # doesn't match, case sensitive
61 "Hello World" =~ /o W/; # matches, ' ' is an ordinary char
62 "Hello World" =~ /World /; # doesn't match, no ' ' at end
64 Perl will always match at the earliest possible point in the string:
66 "Hello World" =~ /o/; # matches 'o' in 'Hello'
67 "That hat is red" =~ /hat/; # matches 'hat' in 'That'
69 Not all characters can be used 'as is' in a match. Some characters,
70 called B<metacharacters>, are reserved for use in regex notation.
71 The metacharacters are
75 A metacharacter can be matched by putting a backslash before it:
77 "2+2=4" =~ /2+2/; # doesn't match, + is a metacharacter
78 "2+2=4" =~ /2\+2/; # matches, \+ is treated like an ordinary +
79 'C:\WIN32' =~ /C:\\WIN/; # matches
80 "/usr/bin/perl" =~ /\/usr\/bin\/perl/; # matches
82 In the last regex, the forward slash C<'/'> is also backslashed,
83 because it is used to delimit the regex.
85 Non-printable ASCII characters are represented by B<escape sequences>.
86 Common examples are C<\t> for a tab, C<\n> for a newline, and C<\r>
87 for a carriage return. Arbitrary bytes are represented by octal
88 escape sequences, e.g., C<\033>, or hexadecimal escape sequences,
91 "1000\t2000" =~ m(0\t2) # matches
92 "cat" =~ /\143\x61\x74/ # matches in ASCII, but
93 # a weird way to spell cat
95 Regexes are treated mostly as double-quoted strings, so variable
99 'cathouse' =~ /cat$foo/; # matches
100 'housecat' =~ /${foo}cat/; # matches
102 With all of the regexes above, if the regex matched anywhere in the
103 string, it was considered a match. To specify I<where> it should
104 match, we would use the B<anchor> metacharacters C<^> and C<$>. The
105 anchor C<^> means match at the beginning of the string and the anchor
106 C<$> means match at the end of the string, or before a newline at the
107 end of the string. Some examples:
109 "housekeeper" =~ /keeper/; # matches
110 "housekeeper" =~ /^keeper/; # doesn't match
111 "housekeeper" =~ /keeper$/; # matches
112 "housekeeper\n" =~ /keeper$/; # matches
113 "housekeeper" =~ /^housekeeper$/; # matches
115 =head2 Using character classes
117 A B<character class> allows a set of possible characters, rather than
118 just a single character, to match at a particular point in a regex.
119 Character classes are denoted by brackets C<[...]>, with the set of
120 characters to be possibly matched inside. Here are some examples:
122 /cat/; # matches 'cat'
123 /[bcr]at/; # matches 'bat', 'cat', or 'rat'
124 "abc" =~ /[cab]/; # matches 'a'
126 In the last statement, even though C<'c'> is the first character in
127 the class, the earliest point at which the regex can match is C<'a'>.
129 /[yY][eE][sS]/; # match 'yes' in a case-insensitive way
130 # 'yes', 'Yes', 'YES', etc.
131 /yes/i; # also match 'yes' in a case-insensitive way
133 The last example shows a match with an C<'i'> B<modifier>, which makes
134 the match case-insensitive.
136 Character classes also have ordinary and special characters, but the
137 sets of ordinary and special characters inside a character class are
138 different than those outside a character class. The special
139 characters for a character class are C<-]\^$> and are matched using an
142 /[\]c]def/; # matches ']def' or 'cdef'
144 /[$x]at/; # matches 'bat, 'cat', or 'rat'
145 /[\$x]at/; # matches '$at' or 'xat'
146 /[\\$x]at/; # matches '\at', 'bat, 'cat', or 'rat'
148 The special character C<'-'> acts as a range operator within character
149 classes, so that the unwieldy C<[0123456789]> and C<[abc...xyz]>
150 become the svelte C<[0-9]> and C<[a-z]>:
152 /item[0-9]/; # matches 'item0' or ... or 'item9'
153 /[0-9a-fA-F]/; # matches a hexadecimal digit
155 If C<'-'> is the first or last character in a character class, it is
156 treated as an ordinary character.
158 The special character C<^> in the first position of a character class
159 denotes a B<negated character class>, which matches any character but
160 those in the brackets. Both C<[...]> and C<[^...]> must match a
161 character, or the match fails. Then
163 /[^a]at/; # doesn't match 'aat' or 'at', but matches
164 # all other 'bat', 'cat, '0at', '%at', etc.
165 /[^0-9]/; # matches a non-numeric character
166 /[a^]at/; # matches 'aat' or '^at'; here '^' is ordinary
168 Perl has several abbreviations for common character classes. (These
169 definitions are those that Perl uses in ASCII-safe mode with the C</a> modifier.
170 Otherwise they could match many more non-ASCII Unicode characters as
171 well. See L<perlrecharclass/Backslash sequences> for details.)
177 \d is a digit and represents
183 \s is a whitespace character and represents
189 \w is a word character (alphanumeric or _) and represents
195 \D is a negated \d; it represents any character but a digit
201 \S is a negated \s; it represents any non-whitespace character
207 \W is a negated \w; it represents any non-word character
213 The period '.' matches any character but "\n"
217 The C<\d\s\w\D\S\W> abbreviations can be used both inside and outside
218 of character classes. Here are some in use:
220 /\d\d:\d\d:\d\d/; # matches a hh:mm:ss time format
221 /[\d\s]/; # matches any digit or whitespace character
222 /\w\W\w/; # matches a word char, followed by a
223 # non-word char, followed by a word char
224 /..rt/; # matches any two chars, followed by 'rt'
225 /end\./; # matches 'end.'
226 /end[.]/; # same thing, matches 'end.'
228 The S<B<word anchor> > C<\b> matches a boundary between a word
229 character and a non-word character C<\w\W> or C<\W\w>:
231 $x = "Housecat catenates house and cat";
232 $x =~ /\bcat/; # matches cat in 'catenates'
233 $x =~ /cat\b/; # matches cat in 'housecat'
234 $x =~ /\bcat\b/; # matches 'cat' at end of string
236 In the last example, the end of the string is considered a word
239 For natural language processing (so that, for example, apostrophes are
240 included in words), use instead C<\b{wb}>
242 "don't" =~ / .+? \b{wb} /x; # matches the whole string
244 =head2 Matching this or that
246 We can match different character strings with the B<alternation>
247 metacharacter C<'|'>. To match C<dog> or C<cat>, we form the regex
248 C<dog|cat>. As before, Perl will try to match the regex at the
249 earliest possible point in the string. At each character position,
250 Perl will first try to match the first alternative, C<dog>. If
251 C<dog> doesn't match, Perl will then try the next alternative, C<cat>.
252 If C<cat> doesn't match either, then the match fails and Perl moves to
253 the next position in the string. Some examples:
255 "cats and dogs" =~ /cat|dog|bird/; # matches "cat"
256 "cats and dogs" =~ /dog|cat|bird/; # matches "cat"
258 Even though C<dog> is the first alternative in the second regex,
259 C<cat> is able to match earlier in the string.
261 "cats" =~ /c|ca|cat|cats/; # matches "c"
262 "cats" =~ /cats|cat|ca|c/; # matches "cats"
264 At a given character position, the first alternative that allows the
265 regex match to succeed will be the one that matches. Here, all the
266 alternatives match at the first string position, so the first matches.
268 =head2 Grouping things and hierarchical matching
270 The B<grouping> metacharacters C<()> allow a part of a regex to be
271 treated as a single unit. Parts of a regex are grouped by enclosing
272 them in parentheses. The regex C<house(cat|keeper)> means match
273 C<house> followed by either C<cat> or C<keeper>. Some more examples
276 /(a|b)b/; # matches 'ab' or 'bb'
277 /(^a|b)c/; # matches 'ac' at start of string or 'bc' anywhere
279 /house(cat|)/; # matches either 'housecat' or 'house'
280 /house(cat(s|)|)/; # matches either 'housecats' or 'housecat' or
281 # 'house'. Note groups can be nested.
283 "20" =~ /(19|20|)\d\d/; # matches the null alternative '()\d\d',
284 # because '20\d\d' can't match
286 =head2 Extracting matches
288 The grouping metacharacters C<()> also allow the extraction of the
289 parts of a string that matched. For each grouping, the part that
290 matched inside goes into the special variables C<$1>, C<$2>, etc.
291 They can be used just as ordinary variables:
293 # extract hours, minutes, seconds
294 $time =~ /(\d\d):(\d\d):(\d\d)/; # match hh:mm:ss format
299 In list context, a match C</regex/> with groupings will return the
300 list of matched values C<($1,$2,...)>. So we could rewrite it as
302 ($hours, $minutes, $second) = ($time =~ /(\d\d):(\d\d):(\d\d)/);
304 If the groupings in a regex are nested, C<$1> gets the group with the
305 leftmost opening parenthesis, C<$2> the next opening parenthesis,
306 etc. For example, here is a complex regex and the matching variables
309 /(ab(cd|ef)((gi)|j))/;
312 Associated with the matching variables C<$1>, C<$2>, ... are
313 the B<backreferences> C<\g1>, C<\g2>, ... Backreferences are
314 matching variables that can be used I<inside> a regex:
316 /(\w\w\w)\s\g1/; # find sequences like 'the the' in string
318 C<$1>, C<$2>, ... should only be used outside of a regex, and C<\g1>,
319 C<\g2>, ... only inside a regex.
321 =head2 Matching repetitions
323 The B<quantifier> metacharacters C<?>, C<*>, C<+>, and C<{}> allow us
324 to determine the number of repeats of a portion of a regex we
325 consider to be a match. Quantifiers are put immediately after the
326 character, character class, or grouping that we want to specify. They
327 have the following meanings:
333 C<a?> = match 'a' 1 or 0 times
337 C<a*> = match 'a' 0 or more times, i.e., any number of times
341 C<a+> = match 'a' 1 or more times, i.e., at least once
345 C<a{n,m}> = match at least C<n> times, but not more than C<m>
350 C<a{n,}> = match at least C<n> or more times
354 C<a{n}> = match exactly C<n> times
358 Here are some examples:
360 /[a-z]+\s+\d*/; # match a lowercase word, at least some space, and
361 # any number of digits
362 /(\w+)\s+\g1/; # match doubled words of arbitrary length
363 $year =~ /^\d{2,4}$/; # make sure year is at least 2 but not more
365 $year =~ /^\d{4}$|^\d{2}$/; # better match; throw out 3 digit dates
367 These quantifiers will try to match as much of the string as possible,
368 while still allowing the regex to match. So we have
370 $x = 'the cat in the hat';
371 $x =~ /^(.*)(at)(.*)$/; # matches,
372 # $1 = 'the cat in the h'
374 # $3 = '' (0 matches)
376 The first quantifier C<.*> grabs as much of the string as possible
377 while still having the regex match. The second quantifier C<.*> has
378 no string left to it, so it matches 0 times.
382 There are a few more things you might want to know about matching
384 The global modifier C</g> allows the matching operator to match
385 within a string as many times as possible. In scalar context,
386 successive matches against a string will have C</g> jump from match
387 to match, keeping track of position in the string as it goes along.
388 You can get or set the position with the C<pos()> function.
391 $x = "cat dog house"; # 3 words
392 while ($x =~ /(\w+)/g) {
393 print "Word is $1, ends at position ", pos $x, "\n";
398 Word is cat, ends at position 3
399 Word is dog, ends at position 7
400 Word is house, ends at position 13
402 A failed match or changing the target string resets the position. If
403 you don't want the position reset after failure to match, add the
404 C</c>, as in C</regex/gc>.
406 In list context, C</g> returns a list of matched groupings, or if
407 there are no groupings, a list of matches to the whole regex. So
409 @words = ($x =~ /(\w+)/g); # matches,
414 =head2 Search and replace
416 Search and replace is performed using C<s/regex/replacement/modifiers>.
417 The C<replacement> is a Perl double-quoted string that replaces in the
418 string whatever is matched with the C<regex>. The operator C<=~> is
419 also used here to associate a string with C<s///>. If matching
420 against C<$_>, the S<C<$_ =~>> can be dropped. If there is a match,
421 C<s///> returns the number of substitutions made; otherwise it returns
422 false. Here are a few examples:
424 $x = "Time to feed the cat!";
425 $x =~ s/cat/hacker/; # $x contains "Time to feed the hacker!"
426 $y = "'quoted words'";
427 $y =~ s/^'(.*)'$/$1/; # strip single quotes,
428 # $y contains "quoted words"
430 With the C<s///> operator, the matched variables C<$1>, C<$2>, etc.
431 are immediately available for use in the replacement expression. With
432 the global modifier, C<s///g> will search and replace all occurrences
433 of the regex in the string:
435 $x = "I batted 4 for 4";
436 $x =~ s/4/four/; # $x contains "I batted four for 4"
437 $x = "I batted 4 for 4";
438 $x =~ s/4/four/g; # $x contains "I batted four for four"
440 The non-destructive modifier C<s///r> causes the result of the substitution
441 to be returned instead of modifying C<$_> (or whatever variable the
442 substitute was bound to with C<=~>):
445 $y = $x =~ s/dogs/cats/r;
446 print "$x $y\n"; # prints "I like dogs. I like cats."
448 $x = "Cats are great.";
449 print $x =~ s/Cats/Dogs/r =~ s/Dogs/Frogs/r =~
450 s/Frogs/Hedgehogs/r, "\n";
451 # prints "Hedgehogs are great."
453 @foo = map { s/[a-z]/X/r } qw(a b c 1 2 3);
454 # @foo is now qw(X X X 1 2 3)
456 The evaluation modifier C<s///e> wraps an C<eval{...}> around the
457 replacement string and the evaluated result is substituted for the
458 matched substring. Some examples:
460 # reverse all the words in a string
461 $x = "the cat in the hat";
462 $x =~ s/(\w+)/reverse $1/ge; # $x contains "eht tac ni eht tah"
464 # convert percentage to decimal
465 $x = "A 39% hit rate";
466 $x =~ s!(\d+)%!$1/100!e; # $x contains "A 0.39 hit rate"
468 The last example shows that C<s///> can use other delimiters, such as
469 C<s!!!> and C<s{}{}>, and even C<s{}//>. If single quotes are used
470 C<s'''>, then the regex and replacement are treated as single-quoted
473 =head2 The split operator
475 C<split /regex/, string> splits C<string> into a list of substrings
476 and returns that list. The regex determines the character sequence
477 that C<string> is split with respect to. For example, to split a
478 string into words, use
480 $x = "Calvin and Hobbes";
481 @word = split /\s+/, $x; # $word[0] = 'Calvin'
483 # $word[2] = 'Hobbes'
485 To extract a comma-delimited list of numbers, use
487 $x = "1.618,2.718, 3.142";
488 @const = split /,\s*/, $x; # $const[0] = '1.618'
489 # $const[1] = '2.718'
490 # $const[2] = '3.142'
492 If the empty regex C<//> is used, the string is split into individual
493 characters. If the regex has groupings, then the list produced contains
494 the matched substrings from the groupings as well:
497 @parts = split m!(/)!, $x; # $parts[0] = ''
503 Since the first character of $x matched the regex, C<split> prepended
504 an empty initial element to the list.
506 =head2 C<use re 'strict'>
508 New in v5.22, this applies stricter rules than otherwise when compiling
509 regular expression patterns. It can find things that, while legal, may
510 not be what you intended.
512 See L<'strict' in re|re/'strict' mode>.
520 This is just a quick start guide. For a more in-depth tutorial on
521 regexes, see L<perlretut> and for the reference page, see L<perlre>.
523 =head1 AUTHOR AND COPYRIGHT
525 Copyright (c) 2000 Mark Kvale
528 This document may be distributed under the same terms as Perl itself.
530 =head2 Acknowledgments
532 The author would like to thank Mark-Jason Dominus, Tom Christiansen,
533 Ilya Zakharevich, Brad Hughes, and Mike Giroux for all their helpful