3 perlrequick - Perl regular expressions quick start
7 This page covers the very basics of understanding, creating and
8 using regular expressions ('regexes') in Perl.
13 This page assumes you already know things, like what a "pattern" is, and
14 the basic syntax of using them. If you don't, see L<perlretut>.
16 =head2 Simple word matching
18 The simplest regex is simply a word, or more generally, a string of
19 characters. A regex consisting of a word matches any string that
22 "Hello World" =~ /World/; # matches
24 In this statement, C<World> is a regex and the C<//> enclosing
25 C</World/> tells Perl to search a string for a match. The operator
26 C<=~> associates the string with the regex match and produces a true
27 value if the regex matched, or false if the regex did not match. In
28 our case, C<World> matches the second word in C<"Hello World">, so the
29 expression is true. This idea has several variations.
31 Expressions like this are useful in conditionals:
33 print "It matches\n" if "Hello World" =~ /World/;
35 The sense of the match can be reversed by using C<!~> operator:
37 print "It doesn't match\n" if "Hello World" !~ /World/;
39 The literal string in the regex can be replaced by a variable:
42 print "It matches\n" if "Hello World" =~ /$greeting/;
44 If you're matching against C<$_>, the C<$_ =~> part can be omitted:
47 print "It matches\n" if /World/;
49 Finally, the C<//> default delimiters for a match can be changed to
50 arbitrary delimiters by putting an C<'m'> out front:
52 "Hello World" =~ m!World!; # matches, delimited by '!'
53 "Hello World" =~ m{World}; # matches, note the matching '{}'
54 "/usr/bin/perl" =~ m"/perl"; # matches after '/usr/bin',
55 # '/' becomes an ordinary char
57 Regexes must match a part of the string I<exactly> in order for the
60 "Hello World" =~ /world/; # doesn't match, case sensitive
61 "Hello World" =~ /o W/; # matches, ' ' is an ordinary char
62 "Hello World" =~ /World /; # doesn't match, no ' ' at end
64 Perl will always match at the earliest possible point in the string:
66 "Hello World" =~ /o/; # matches 'o' in 'Hello'
67 "That hat is red" =~ /hat/; # matches 'hat' in 'That'
69 Not all characters can be used 'as is' in a match. Some characters,
70 called B<metacharacters>, are considered special, and reserved for use
71 in regex notation. The metacharacters are
75 A metacharacter can be matched literally by putting a backslash before
78 "2+2=4" =~ /2+2/; # doesn't match, + is a metacharacter
79 "2+2=4" =~ /2\+2/; # matches, \+ is treated like an ordinary +
80 'C:\WIN32' =~ /C:\\WIN/; # matches
81 "/usr/bin/perl" =~ /\/usr\/bin\/perl/; # matches
83 In the last regex, the forward slash C<'/'> is also backslashed,
84 because it is used to delimit the regex.
86 Most of the metacharacters aren't always special, and other characters
87 (such as the ones delimiting the pattern) become special under various
88 circumstances. This can be confusing and lead to unexpected results.
89 L<S<C<use re 'strict'>>|re/'strict' mode> can notify you of potential
92 Non-printable ASCII characters are represented by B<escape sequences>.
93 Common examples are C<\t> for a tab, C<\n> for a newline, and C<\r>
94 for a carriage return. Arbitrary bytes are represented by octal
95 escape sequences, e.g., C<\033>, or hexadecimal escape sequences,
98 "1000\t2000" =~ m(0\t2) # matches
99 "cat" =~ /\143\x61\x74/ # matches in ASCII, but
100 # a weird way to spell cat
102 Regexes are treated mostly as double-quoted strings, so variable
106 'cathouse' =~ /cat$foo/; # matches
107 'housecat' =~ /${foo}cat/; # matches
109 With all of the regexes above, if the regex matched anywhere in the
110 string, it was considered a match. To specify I<where> it should
111 match, we would use the B<anchor> metacharacters C<^> and C<$>. The
112 anchor C<^> means match at the beginning of the string and the anchor
113 C<$> means match at the end of the string, or before a newline at the
114 end of the string. Some examples:
116 "housekeeper" =~ /keeper/; # matches
117 "housekeeper" =~ /^keeper/; # doesn't match
118 "housekeeper" =~ /keeper$/; # matches
119 "housekeeper\n" =~ /keeper$/; # matches
120 "housekeeper" =~ /^housekeeper$/; # matches
122 =head2 Using character classes
124 A B<character class> allows a set of possible characters, rather than
125 just a single character, to match at a particular point in a regex.
126 There are a number of different types of character classes, but usually
127 when people use this term, they are referring to the type described in
128 this section, which are technically called "Bracketed character
129 classes", because they are denoted by brackets C<[...]>, with the set of
130 characters to be possibly matched inside. But we'll drop the "bracketed"
131 below to correspond with common usage. Here are some examples of
132 (bracketed) character classes:
134 /cat/; # matches 'cat'
135 /[bcr]at/; # matches 'bat', 'cat', or 'rat'
136 "abc" =~ /[cab]/; # matches 'a'
138 In the last statement, even though C<'c'> is the first character in
139 the class, the earliest point at which the regex can match is C<'a'>.
141 /[yY][eE][sS]/; # match 'yes' in a case-insensitive way
142 # 'yes', 'Yes', 'YES', etc.
143 /yes/i; # also match 'yes' in a case-insensitive way
145 The last example shows a match with an C<'i'> B<modifier>, which makes
146 the match case-insensitive.
148 Character classes also have ordinary and special characters, but the
149 sets of ordinary and special characters inside a character class are
150 different than those outside a character class. The special
151 characters for a character class are C<-]\^$> and are matched using an
154 /[\]c]def/; # matches ']def' or 'cdef'
156 /[$x]at/; # matches 'bat, 'cat', or 'rat'
157 /[\$x]at/; # matches '$at' or 'xat'
158 /[\\$x]at/; # matches '\at', 'bat, 'cat', or 'rat'
160 The special character C<'-'> acts as a range operator within character
161 classes, so that the unwieldy C<[0123456789]> and C<[abc...xyz]>
162 become the svelte C<[0-9]> and C<[a-z]>:
164 /item[0-9]/; # matches 'item0' or ... or 'item9'
165 /[0-9a-fA-F]/; # matches a hexadecimal digit
167 If C<'-'> is the first or last character in a character class, it is
168 treated as an ordinary character.
170 The special character C<^> in the first position of a character class
171 denotes a B<negated character class>, which matches any character but
172 those in the brackets. Both C<[...]> and C<[^...]> must match a
173 character, or the match fails. Then
175 /[^a]at/; # doesn't match 'aat' or 'at', but matches
176 # all other 'bat', 'cat, '0at', '%at', etc.
177 /[^0-9]/; # matches a non-numeric character
178 /[a^]at/; # matches 'aat' or '^at'; here '^' is ordinary
180 Perl has several abbreviations for common character classes. (These
181 definitions are those that Perl uses in ASCII-safe mode with the C</a> modifier.
182 Otherwise they could match many more non-ASCII Unicode characters as
183 well. See L<perlrecharclass/Backslash sequences> for details.)
189 \d is a digit and represents
195 \s is a whitespace character and represents
201 \w is a word character (alphanumeric or _) and represents
207 \D is a negated \d; it represents any character but a digit
213 \S is a negated \s; it represents any non-whitespace character
219 \W is a negated \w; it represents any non-word character
225 The period '.' matches any character but "\n"
229 The C<\d\s\w\D\S\W> abbreviations can be used both inside and outside
230 of character classes. Here are some in use:
232 /\d\d:\d\d:\d\d/; # matches a hh:mm:ss time format
233 /[\d\s]/; # matches any digit or whitespace character
234 /\w\W\w/; # matches a word char, followed by a
235 # non-word char, followed by a word char
236 /..rt/; # matches any two chars, followed by 'rt'
237 /end\./; # matches 'end.'
238 /end[.]/; # same thing, matches 'end.'
240 The S<B<word anchor> > C<\b> matches a boundary between a word
241 character and a non-word character C<\w\W> or C<\W\w>:
243 $x = "Housecat catenates house and cat";
244 $x =~ /\bcat/; # matches cat in 'catenates'
245 $x =~ /cat\b/; # matches cat in 'housecat'
246 $x =~ /\bcat\b/; # matches 'cat' at end of string
248 In the last example, the end of the string is considered a word
251 For natural language processing (so that, for example, apostrophes are
252 included in words), use instead C<\b{wb}>
254 "don't" =~ / .+? \b{wb} /x; # matches the whole string
256 =head2 Matching this or that
258 We can match different character strings with the B<alternation>
259 metacharacter C<'|'>. To match C<dog> or C<cat>, we form the regex
260 C<dog|cat>. As before, Perl will try to match the regex at the
261 earliest possible point in the string. At each character position,
262 Perl will first try to match the first alternative, C<dog>. If
263 C<dog> doesn't match, Perl will then try the next alternative, C<cat>.
264 If C<cat> doesn't match either, then the match fails and Perl moves to
265 the next position in the string. Some examples:
267 "cats and dogs" =~ /cat|dog|bird/; # matches "cat"
268 "cats and dogs" =~ /dog|cat|bird/; # matches "cat"
270 Even though C<dog> is the first alternative in the second regex,
271 C<cat> is able to match earlier in the string.
273 "cats" =~ /c|ca|cat|cats/; # matches "c"
274 "cats" =~ /cats|cat|ca|c/; # matches "cats"
276 At a given character position, the first alternative that allows the
277 regex match to succeed will be the one that matches. Here, all the
278 alternatives match at the first string position, so the first matches.
280 =head2 Grouping things and hierarchical matching
282 The B<grouping> metacharacters C<()> allow a part of a regex to be
283 treated as a single unit. Parts of a regex are grouped by enclosing
284 them in parentheses. The regex C<house(cat|keeper)> means match
285 C<house> followed by either C<cat> or C<keeper>. Some more examples
288 /(a|b)b/; # matches 'ab' or 'bb'
289 /(^a|b)c/; # matches 'ac' at start of string or 'bc' anywhere
291 /house(cat|)/; # matches either 'housecat' or 'house'
292 /house(cat(s|)|)/; # matches either 'housecats' or 'housecat' or
293 # 'house'. Note groups can be nested.
295 "20" =~ /(19|20|)\d\d/; # matches the null alternative '()\d\d',
296 # because '20\d\d' can't match
298 =head2 Extracting matches
300 The grouping metacharacters C<()> also allow the extraction of the
301 parts of a string that matched. For each grouping, the part that
302 matched inside goes into the special variables C<$1>, C<$2>, etc.
303 They can be used just as ordinary variables:
305 # extract hours, minutes, seconds
306 $time =~ /(\d\d):(\d\d):(\d\d)/; # match hh:mm:ss format
311 In list context, a match C</regex/> with groupings will return the
312 list of matched values C<($1,$2,...)>. So we could rewrite it as
314 ($hours, $minutes, $second) = ($time =~ /(\d\d):(\d\d):(\d\d)/);
316 If the groupings in a regex are nested, C<$1> gets the group with the
317 leftmost opening parenthesis, C<$2> the next opening parenthesis,
318 etc. For example, here is a complex regex and the matching variables
321 /(ab(cd|ef)((gi)|j))/;
324 Associated with the matching variables C<$1>, C<$2>, ... are
325 the B<backreferences> C<\g1>, C<\g2>, ... Backreferences are
326 matching variables that can be used I<inside> a regex:
328 /(\w\w\w)\s\g1/; # find sequences like 'the the' in string
330 C<$1>, C<$2>, ... should only be used outside of a regex, and C<\g1>,
331 C<\g2>, ... only inside a regex.
333 =head2 Matching repetitions
335 The B<quantifier> metacharacters C<?>, C<*>, C<+>, and C<{}> allow us
336 to determine the number of repeats of a portion of a regex we
337 consider to be a match. Quantifiers are put immediately after the
338 character, character class, or grouping that we want to specify. They
339 have the following meanings:
345 C<a?> = match 'a' 1 or 0 times
349 C<a*> = match 'a' 0 or more times, i.e., any number of times
353 C<a+> = match 'a' 1 or more times, i.e., at least once
357 C<a{n,m}> = match at least C<n> times, but not more than C<m>
362 C<a{n,}> = match at least C<n> or more times
366 C<a{,n}> = match C<n> times or fewer
370 C<a{n}> = match exactly C<n> times
374 Here are some examples:
376 /[a-z]+\s+\d*/; # match a lowercase word, at least some space, and
377 # any number of digits
378 /(\w+)\s+\g1/; # match doubled words of arbitrary length
379 $year =~ /^\d{2,4}$/; # make sure year is at least 2 but not more
381 $year =~ /^\d{ 4 }$|^\d{2}$/; # better match; throw out 3 digit dates
383 These quantifiers will try to match as much of the string as possible,
384 while still allowing the regex to match. So we have
386 $x = 'the cat in the hat';
387 $x =~ /^(.*)(at)(.*)$/; # matches,
388 # $1 = 'the cat in the h'
390 # $3 = '' (0 matches)
392 The first quantifier C<.*> grabs as much of the string as possible
393 while still having the regex match. The second quantifier C<.*> has
394 no string left to it, so it matches 0 times.
398 There are a few more things you might want to know about matching
400 The global modifier C</g> allows the matching operator to match
401 within a string as many times as possible. In scalar context,
402 successive matches against a string will have C</g> jump from match
403 to match, keeping track of position in the string as it goes along.
404 You can get or set the position with the C<pos()> function.
407 $x = "cat dog house"; # 3 words
408 while ($x =~ /(\w+)/g) {
409 print "Word is $1, ends at position ", pos $x, "\n";
414 Word is cat, ends at position 3
415 Word is dog, ends at position 7
416 Word is house, ends at position 13
418 A failed match or changing the target string resets the position. If
419 you don't want the position reset after failure to match, add the
420 C</c>, as in C</regex/gc>.
422 In list context, C</g> returns a list of matched groupings, or if
423 there are no groupings, a list of matches to the whole regex. So
425 @words = ($x =~ /(\w+)/g); # matches,
430 =head2 Search and replace
432 Search and replace is performed using C<s/regex/replacement/modifiers>.
433 The C<replacement> is a Perl double-quoted string that replaces in the
434 string whatever is matched with the C<regex>. The operator C<=~> is
435 also used here to associate a string with C<s///>. If matching
436 against C<$_>, the S<C<$_ =~>> can be dropped. If there is a match,
437 C<s///> returns the number of substitutions made; otherwise it returns
438 false. Here are a few examples:
440 $x = "Time to feed the cat!";
441 $x =~ s/cat/hacker/; # $x contains "Time to feed the hacker!"
442 $y = "'quoted words'";
443 $y =~ s/^'(.*)'$/$1/; # strip single quotes,
444 # $y contains "quoted words"
446 With the C<s///> operator, the matched variables C<$1>, C<$2>, etc.
447 are immediately available for use in the replacement expression. With
448 the global modifier, C<s///g> will search and replace all occurrences
449 of the regex in the string:
451 $x = "I batted 4 for 4";
452 $x =~ s/4/four/; # $x contains "I batted four for 4"
453 $x = "I batted 4 for 4";
454 $x =~ s/4/four/g; # $x contains "I batted four for four"
456 The non-destructive modifier C<s///r> causes the result of the substitution
457 to be returned instead of modifying C<$_> (or whatever variable the
458 substitute was bound to with C<=~>):
461 $y = $x =~ s/dogs/cats/r;
462 print "$x $y\n"; # prints "I like dogs. I like cats."
464 $x = "Cats are great.";
465 print $x =~ s/Cats/Dogs/r =~ s/Dogs/Frogs/r =~
466 s/Frogs/Hedgehogs/r, "\n";
467 # prints "Hedgehogs are great."
469 @foo = map { s/[a-z]/X/r } qw(a b c 1 2 3);
470 # @foo is now qw(X X X 1 2 3)
472 The evaluation modifier C<s///e> wraps an C<eval{...}> around the
473 replacement string and the evaluated result is substituted for the
474 matched substring. Some examples:
476 # reverse all the words in a string
477 $x = "the cat in the hat";
478 $x =~ s/(\w+)/reverse $1/ge; # $x contains "eht tac ni eht tah"
480 # convert percentage to decimal
481 $x = "A 39% hit rate";
482 $x =~ s!(\d+)%!$1/100!e; # $x contains "A 0.39 hit rate"
484 The last example shows that C<s///> can use other delimiters, such as
485 C<s!!!> and C<s{}{}>, and even C<s{}//>. If single quotes are used
486 C<s'''>, then the regex and replacement are treated as single-quoted
489 =head2 The split operator
491 C<split /regex/, string> splits C<string> into a list of substrings
492 and returns that list. The regex determines the character sequence
493 that C<string> is split with respect to. For example, to split a
494 string into words, use
496 $x = "Calvin and Hobbes";
497 @word = split /\s+/, $x; # $word[0] = 'Calvin'
499 # $word[2] = 'Hobbes'
501 To extract a comma-delimited list of numbers, use
503 $x = "1.618,2.718, 3.142";
504 @const = split /,\s*/, $x; # $const[0] = '1.618'
505 # $const[1] = '2.718'
506 # $const[2] = '3.142'
508 If the empty regex C<//> is used, the string is split into individual
509 characters. If the regex has groupings, then the list produced contains
510 the matched substrings from the groupings as well:
513 @parts = split m!(/)!, $x; # $parts[0] = ''
519 Since the first character of $x matched the regex, C<split> prepended
520 an empty initial element to the list.
522 =head2 C<use re 'strict'>
524 New in v5.22, this applies stricter rules than otherwise when compiling
525 regular expression patterns. It can find things that, while legal, may
526 not be what you intended.
528 See L<'strict' in re|re/'strict' mode>.
536 This is just a quick start guide. For a more in-depth tutorial on
537 regexes, see L<perlretut> and for the reference page, see L<perlre>.
539 =head1 AUTHOR AND COPYRIGHT
541 Copyright (c) 2000 Mark Kvale
544 This document may be distributed under the same terms as Perl itself.
546 =head2 Acknowledgments
548 The author would like to thank Mark-Jason Dominus, Tom Christiansen,
549 Ilya Zakharevich, Brad Hughes, and Mike Giroux for all their helpful