This is a live mirror of the Perl 5 development currently hosted at https://github.com/perl/perl5
Add @_ in signatured sub being experimental to perldelta.pod
[perl5.git] / pod / perlrequick.pod
CommitLineData
47f9c88b
GS
1=head1 NAME
2
3perlrequick - Perl regular expressions quick start
4
5=head1 DESCRIPTION
6
7This page covers the very basics of understanding, creating and
6425a278
GS
8using regular expressions ('regexes') in Perl.
9
47f9c88b
GS
10
11=head1 The Guide
12
c6ae04d3
KW
13This page assumes you already know things, like what a "pattern" is, and
14the basic syntax of using them. If you don't, see L<perlretut>.
15
47f9c88b
GS
16=head2 Simple word matching
17
6425a278
GS
18The simplest regex is simply a word, or more generally, a string of
19characters. A regex consisting of a word matches any string that
47f9c88b
GS
20contains that word:
21
22 "Hello World" =~ /World/; # matches
23
6425a278 24In this statement, C<World> is a regex and the C<//> enclosing
1e2a213d 25C</World/> tells Perl to search a string for a match. The operator
6425a278
GS
26C<=~> associates the string with the regex match and produces a true
27value if the regex matched, or false if the regex did not match. In
47f9c88b
GS
28our case, C<World> matches the second word in C<"Hello World">, so the
29expression is true. This idea has several variations.
30
31Expressions like this are useful in conditionals:
32
33 print "It matches\n" if "Hello World" =~ /World/;
34
35The sense of the match can be reversed by using C<!~> operator:
36
37 print "It doesn't match\n" if "Hello World" !~ /World/;
38
6425a278 39The literal string in the regex can be replaced by a variable:
47f9c88b
GS
40
41 $greeting = "World";
42 print "It matches\n" if "Hello World" =~ /$greeting/;
43
44If you're matching against C<$_>, the C<$_ =~> part can be omitted:
45
46 $_ = "Hello World";
47 print "It matches\n" if /World/;
48
49Finally, the C<//> default delimiters for a match can be changed to
50arbitrary delimiters by putting an C<'m'> out front:
51
52 "Hello World" =~ m!World!; # matches, delimited by '!'
53 "Hello World" =~ m{World}; # matches, note the matching '{}'
54 "/usr/bin/perl" =~ m"/perl"; # matches after '/usr/bin',
55 # '/' becomes an ordinary char
56
6425a278 57Regexes must match a part of the string I<exactly> in order for the
47f9c88b
GS
58statement to be true:
59
60 "Hello World" =~ /world/; # doesn't match, case sensitive
61 "Hello World" =~ /o W/; # matches, ' ' is an ordinary char
62 "Hello World" =~ /World /; # doesn't match, no ' ' at end
63
1e2a213d 64Perl will always match at the earliest possible point in the string:
47f9c88b
GS
65
66 "Hello World" =~ /o/; # matches 'o' in 'Hello'
67 "That hat is red" =~ /hat/; # matches 'hat' in 'That'
68
69Not all characters can be used 'as is' in a match. Some characters,
a89a8c8d
KW
70called B<metacharacters>, are considered special, and reserved for use
71in regex notation. The metacharacters are
47f9c88b
GS
72
73 {}[]()^$.|*+?\
74
a89a8c8d
KW
75A metacharacter can be matched literally by putting a backslash before
76it:
47f9c88b
GS
77
78 "2+2=4" =~ /2+2/; # doesn't match, + is a metacharacter
79 "2+2=4" =~ /2\+2/; # matches, \+ is treated like an ordinary +
80 'C:\WIN32' =~ /C:\\WIN/; # matches
5d525260 81 "/usr/bin/perl" =~ /\/usr\/bin\/perl/; # matches
47f9c88b 82
6425a278
GS
83In the last regex, the forward slash C<'/'> is also backslashed,
84because it is used to delimit the regex.
47f9c88b 85
a89a8c8d 86Most of the metacharacters aren't always special, and other characters
91fdbfcf 87(such as the ones delimiting the pattern) become special under various
a89a8c8d
KW
88circumstances. This can be confusing and lead to unexpected results.
89L<S<C<use re 'strict'>>|re/'strict' mode> can notify you of potential
90pitfalls.
91
47f9c88b
GS
92Non-printable ASCII characters are represented by B<escape sequences>.
93Common examples are C<\t> for a tab, C<\n> for a newline, and C<\r>
94for a carriage return. Arbitrary bytes are represented by octal
95escape sequences, e.g., C<\033>, or hexadecimal escape sequences,
96e.g., C<\x1B>:
97
555bd962 98 "1000\t2000" =~ m(0\t2) # matches
a89a8c8d 99 "cat" =~ /\143\x61\x74/ # matches in ASCII, but
555bd962 100 # a weird way to spell cat
47f9c88b 101
caedc70b 102Regexes are treated mostly as double-quoted strings, so variable
47f9c88b
GS
103substitution works:
104
105 $foo = 'house';
106 'cathouse' =~ /cat$foo/; # matches
107 'housecat' =~ /${foo}cat/; # matches
108
6425a278 109With all of the regexes above, if the regex matched anywhere in the
47f9c88b
GS
110string, it was considered a match. To specify I<where> it should
111match, we would use the B<anchor> metacharacters C<^> and C<$>. The
112anchor C<^> means match at the beginning of the string and the anchor
113C<$> means match at the end of the string, or before a newline at the
114end of the string. Some examples:
115
6425a278
GS
116 "housekeeper" =~ /keeper/; # matches
117 "housekeeper" =~ /^keeper/; # doesn't match
118 "housekeeper" =~ /keeper$/; # matches
119 "housekeeper\n" =~ /keeper$/; # matches
120 "housekeeper" =~ /^housekeeper$/; # matches
47f9c88b
GS
121
122=head2 Using character classes
123
124A B<character class> allows a set of possible characters, rather than
6425a278 125just a single character, to match at a particular point in a regex.
a89a8c8d
KW
126There are a number of different types of character classes, but usually
127when people use this term, they are referring to the type described in
128this section, which are technically called "Bracketed character
129classes", because they are denoted by brackets C<[...]>, with the set of
130characters to be possibly matched inside. But we'll drop the "bracketed"
131below to correspond with common usage. Here are some examples of
132(bracketed) character classes:
47f9c88b
GS
133
134 /cat/; # matches 'cat'
6425a278 135 /[bcr]at/; # matches 'bat', 'cat', or 'rat'
47f9c88b
GS
136 "abc" =~ /[cab]/; # matches 'a'
137
138In the last statement, even though C<'c'> is the first character in
6425a278 139the class, the earliest point at which the regex can match is C<'a'>.
47f9c88b
GS
140
141 /[yY][eE][sS]/; # match 'yes' in a case-insensitive way
142 # 'yes', 'Yes', 'YES', etc.
143 /yes/i; # also match 'yes' in a case-insensitive way
144
145The last example shows a match with an C<'i'> B<modifier>, which makes
146the match case-insensitive.
147
148Character classes also have ordinary and special characters, but the
149sets of ordinary and special characters inside a character class are
150different than those outside a character class. The special
151characters for a character class are C<-]\^$> and are matched using an
152escape:
153
154 /[\]c]def/; # matches ']def' or 'cdef'
155 $x = 'bcr';
156 /[$x]at/; # matches 'bat, 'cat', or 'rat'
157 /[\$x]at/; # matches '$at' or 'xat'
158 /[\\$x]at/; # matches '\at', 'bat, 'cat', or 'rat'
159
160The special character C<'-'> acts as a range operator within character
161classes, so that the unwieldy C<[0123456789]> and C<[abc...xyz]>
162become the svelte C<[0-9]> and C<[a-z]>:
163
164 /item[0-9]/; # matches 'item0' or ... or 'item9'
165 /[0-9a-fA-F]/; # matches a hexadecimal digit
166
167If C<'-'> is the first or last character in a character class, it is
168treated as an ordinary character.
169
170The special character C<^> in the first position of a character class
171denotes a B<negated character class>, which matches any character but
6425a278 172those in the brackets. Both C<[...]> and C<[^...]> must match a
47f9c88b
GS
173character, or the match fails. Then
174
175 /[^a]at/; # doesn't match 'aat' or 'at', but matches
176 # all other 'bat', 'cat, '0at', '%at', etc.
177 /[^0-9]/; # matches a non-numeric character
178 /[a^]at/; # matches 'aat' or '^at'; here '^' is ordinary
179
caedc70b 180Perl has several abbreviations for common character classes. (These
b81c9143
KW
181definitions are those that Perl uses in ASCII-safe mode with the C</a> modifier.
182Otherwise they could match many more non-ASCII Unicode characters as
183well. See L<perlrecharclass/Backslash sequences> for details.)
47f9c88b
GS
184
185=over 4
186
187=item *
551e1d92 188
5d525260
CW
189\d is a digit and represents
190
191 [0-9]
47f9c88b
GS
192
193=item *
551e1d92 194
5d525260
CW
195\s is a whitespace character and represents
196
197 [\ \t\r\n\f]
47f9c88b
GS
198
199=item *
551e1d92 200
5d525260
CW
201\w is a word character (alphanumeric or _) and represents
202
203 [0-9a-zA-Z_]
47f9c88b
GS
204
205=item *
551e1d92 206
5d525260
CW
207\D is a negated \d; it represents any character but a digit
208
209 [^0-9]
47f9c88b
GS
210
211=item *
551e1d92 212
5d525260
CW
213\S is a negated \s; it represents any non-whitespace character
214
215 [^\s]
47f9c88b
GS
216
217=item *
551e1d92 218
5d525260
CW
219\W is a negated \w; it represents any non-word character
220
221 [^\w]
47f9c88b
GS
222
223=item *
551e1d92 224
47f9c88b
GS
225The period '.' matches any character but "\n"
226
227=back
228
229The C<\d\s\w\D\S\W> abbreviations can be used both inside and outside
230of character classes. Here are some in use:
231
232 /\d\d:\d\d:\d\d/; # matches a hh:mm:ss time format
233 /[\d\s]/; # matches any digit or whitespace character
234 /\w\W\w/; # matches a word char, followed by a
235 # non-word char, followed by a word char
236 /..rt/; # matches any two chars, followed by 'rt'
237 /end\./; # matches 'end.'
238 /end[.]/; # same thing, matches 'end.'
239
240The S<B<word anchor> > C<\b> matches a boundary between a word
241character and a non-word character C<\w\W> or C<\W\w>:
242
243 $x = "Housecat catenates house and cat";
244 $x =~ /\bcat/; # matches cat in 'catenates'
245 $x =~ /cat\b/; # matches cat in 'housecat'
246 $x =~ /\bcat\b/; # matches 'cat' at end of string
247
248In the last example, the end of the string is considered a word
249boundary.
ae3bb8ea
KW
250
251For natural language processing (so that, for example, apostrophes are
252included in words), use instead C<\b{wb}>
253
254 "don't" =~ / .+? \b{wb} /x; # matches the whole string
47f9c88b
GS
255
256=head2 Matching this or that
257
da75cd15 258We can match different character strings with the B<alternation>
6425a278 259metacharacter C<'|'>. To match C<dog> or C<cat>, we form the regex
1e2a213d 260C<dog|cat>. As before, Perl will try to match the regex at the
47f9c88b 261earliest possible point in the string. At each character position,
1e2a213d
KW
262Perl will first try to match the first alternative, C<dog>. If
263C<dog> doesn't match, Perl will then try the next alternative, C<cat>.
264If C<cat> doesn't match either, then the match fails and Perl moves to
47f9c88b
GS
265the next position in the string. Some examples:
266
267 "cats and dogs" =~ /cat|dog|bird/; # matches "cat"
268 "cats and dogs" =~ /dog|cat|bird/; # matches "cat"
269
6425a278 270Even though C<dog> is the first alternative in the second regex,
47f9c88b
GS
271C<cat> is able to match earlier in the string.
272
273 "cats" =~ /c|ca|cat|cats/; # matches "c"
274 "cats" =~ /cats|cat|ca|c/; # matches "cats"
275
276At a given character position, the first alternative that allows the
210b36aa 277regex match to succeed will be the one that matches. Here, all the
5d525260 278alternatives match at the first string position, so the first matches.
47f9c88b
GS
279
280=head2 Grouping things and hierarchical matching
281
6425a278
GS
282The B<grouping> metacharacters C<()> allow a part of a regex to be
283treated as a single unit. Parts of a regex are grouped by enclosing
284them in parentheses. The regex C<house(cat|keeper)> means match
47f9c88b
GS
285C<house> followed by either C<cat> or C<keeper>. Some more examples
286are
287
288 /(a|b)b/; # matches 'ab' or 'bb'
289 /(^a|b)c/; # matches 'ac' at start of string or 'bc' anywhere
290
291 /house(cat|)/; # matches either 'housecat' or 'house'
292 /house(cat(s|)|)/; # matches either 'housecats' or 'housecat' or
293 # 'house'. Note groups can be nested.
294
295 "20" =~ /(19|20|)\d\d/; # matches the null alternative '()\d\d',
296 # because '20\d\d' can't match
297
298=head2 Extracting matches
299
300The grouping metacharacters C<()> also allow the extraction of the
301parts of a string that matched. For each grouping, the part that
302matched inside goes into the special variables C<$1>, C<$2>, etc.
303They can be used just as ordinary variables:
304
305 # extract hours, minutes, seconds
306 $time =~ /(\d\d):(\d\d):(\d\d)/; # match hh:mm:ss format
307 $hours = $1;
308 $minutes = $2;
309 $seconds = $3;
310
6425a278 311In list context, a match C</regex/> with groupings will return the
47f9c88b
GS
312list of matched values C<($1,$2,...)>. So we could rewrite it as
313
314 ($hours, $minutes, $second) = ($time =~ /(\d\d):(\d\d):(\d\d)/);
315
6425a278 316If the groupings in a regex are nested, C<$1> gets the group with the
47f9c88b 317leftmost opening parenthesis, C<$2> the next opening parenthesis,
6425a278 318etc. For example, here is a complex regex and the matching variables
47f9c88b
GS
319indicated below it:
320
321 /(ab(cd|ef)((gi)|j))/;
322 1 2 34
323
324Associated with the matching variables C<$1>, C<$2>, ... are
d8b950dc 325the B<backreferences> C<\g1>, C<\g2>, ... Backreferences are
6425a278 326matching variables that can be used I<inside> a regex:
47f9c88b 327
d8b950dc 328 /(\w\w\w)\s\g1/; # find sequences like 'the the' in string
47f9c88b 329
d8b950dc
KW
330C<$1>, C<$2>, ... should only be used outside of a regex, and C<\g1>,
331C<\g2>, ... only inside a regex.
47f9c88b
GS
332
333=head2 Matching repetitions
334
335The B<quantifier> metacharacters C<?>, C<*>, C<+>, and C<{}> allow us
6425a278 336to determine the number of repeats of a portion of a regex we
47f9c88b
GS
337consider to be a match. Quantifiers are put immediately after the
338character, character class, or grouping that we want to specify. They
339have the following meanings:
340
341=over 4
342
cb49b31f
RB
343=item *
344
345C<a?> = match 'a' 1 or 0 times
346
347=item *
348
349C<a*> = match 'a' 0 or more times, i.e., any number of times
350
351=item *
47f9c88b 352
cb49b31f 353C<a+> = match 'a' 1 or more times, i.e., at least once
47f9c88b 354
cb49b31f 355=item *
47f9c88b 356
cb49b31f 357C<a{n,m}> = match at least C<n> times, but not more than C<m>
47f9c88b
GS
358times.
359
cb49b31f
RB
360=item *
361
362C<a{n,}> = match at least C<n> or more times
363
364=item *
47f9c88b 365
20420ba9
KW
366C<a{,n}> = match C<n> times or fewer
367
368=item *
369
cb49b31f 370C<a{n}> = match exactly C<n> times
47f9c88b
GS
371
372=back
373
374Here are some examples:
375
376 /[a-z]+\s+\d*/; # match a lowercase word, at least some space, and
377 # any number of digits
d8b950dc 378 /(\w+)\s+\g1/; # match doubled words of arbitrary length
c2ac8995
NS
379 $year =~ /^\d{2,4}$/; # make sure year is at least 2 but not more
380 # than 4 digits
1b2f32d5 381 $year =~ /^\d{ 4 }$|^\d{2}$/; # better match; throw out 3 digit dates
47f9c88b
GS
382
383These quantifiers will try to match as much of the string as possible,
6425a278 384while still allowing the regex to match. So we have
47f9c88b 385
6425a278 386 $x = 'the cat in the hat';
47f9c88b
GS
387 $x =~ /^(.*)(at)(.*)$/; # matches,
388 # $1 = 'the cat in the h'
389 # $2 = 'at'
390 # $3 = '' (0 matches)
391
392The first quantifier C<.*> grabs as much of the string as possible
6425a278 393while still having the regex match. The second quantifier C<.*> has
47f9c88b
GS
394no string left to it, so it matches 0 times.
395
396=head2 More matching
397
398There are a few more things you might want to know about matching
72606c45 399operators.
f1dc5bb2 400The global modifier C</g> allows the matching operator to match
47f9c88b 401within a string as many times as possible. In scalar context,
f1dc5bb2 402successive matches against a string will have C</g> jump from match
47f9c88b
GS
403to match, keeping track of position in the string as it goes along.
404You can get or set the position with the C<pos()> function.
405For example,
406
407 $x = "cat dog house"; # 3 words
408 while ($x =~ /(\w+)/g) {
409 print "Word is $1, ends at position ", pos $x, "\n";
410 }
411
412prints
413
414 Word is cat, ends at position 3
415 Word is dog, ends at position 7
416 Word is house, ends at position 13
417
418A failed match or changing the target string resets the position. If
419you don't want the position reset after failure to match, add the
f1dc5bb2 420C</c>, as in C</regex/gc>.
47f9c88b 421
f1dc5bb2 422In list context, C</g> returns a list of matched groupings, or if
6425a278 423there are no groupings, a list of matches to the whole regex. So
47f9c88b
GS
424
425 @words = ($x =~ /(\w+)/g); # matches,
426 # $word[0] = 'cat'
427 # $word[1] = 'dog'
428 # $word[2] = 'house'
429
430=head2 Search and replace
431
6425a278 432Search and replace is performed using C<s/regex/replacement/modifiers>.
caedc70b 433The C<replacement> is a Perl double-quoted string that replaces in the
6425a278 434string whatever is matched with the C<regex>. The operator C<=~> is
47f9c88b 435also used here to associate a string with C<s///>. If matching
caedc70b
FC
436against C<$_>, the S<C<$_ =~>> can be dropped. If there is a match,
437C<s///> returns the number of substitutions made; otherwise it returns
47f9c88b
GS
438false. Here are a few examples:
439
440 $x = "Time to feed the cat!";
441 $x =~ s/cat/hacker/; # $x contains "Time to feed the hacker!"
442 $y = "'quoted words'";
443 $y =~ s/^'(.*)'$/$1/; # strip single quotes,
444 # $y contains "quoted words"
445
446With the C<s///> operator, the matched variables C<$1>, C<$2>, etc.
447are immediately available for use in the replacement expression. With
448the global modifier, C<s///g> will search and replace all occurrences
6425a278 449of the regex in the string:
47f9c88b
GS
450
451 $x = "I batted 4 for 4";
452 $x =~ s/4/four/; # $x contains "I batted four for 4"
453 $x = "I batted 4 for 4";
454 $x =~ s/4/four/g; # $x contains "I batted four for four"
455
4f4d7508
DC
456The non-destructive modifier C<s///r> causes the result of the substitution
457to be returned instead of modifying C<$_> (or whatever variable the
458substitute was bound to with C<=~>):
459
460 $x = "I like dogs.";
461 $y = $x =~ s/dogs/cats/r;
462 print "$x $y\n"; # prints "I like dogs. I like cats."
463
464 $x = "Cats are great.";
555bd962
BG
465 print $x =~ s/Cats/Dogs/r =~ s/Dogs/Frogs/r =~
466 s/Frogs/Hedgehogs/r, "\n";
4f4d7508
DC
467 # prints "Hedgehogs are great."
468
469 @foo = map { s/[a-z]/X/r } qw(a b c 1 2 3);
470 # @foo is now qw(X X X 1 2 3)
471
47f9c88b
GS
472The evaluation modifier C<s///e> wraps an C<eval{...}> around the
473replacement string and the evaluated result is substituted for the
6425a278 474matched substring. Some examples:
47f9c88b 475
6425a278
GS
476 # reverse all the words in a string
477 $x = "the cat in the hat";
478 $x =~ s/(\w+)/reverse $1/ge; # $x contains "eht tac ni eht tah"
47f9c88b 479
6425a278
GS
480 # convert percentage to decimal
481 $x = "A 39% hit rate";
482 $x =~ s!(\d+)%!$1/100!e; # $x contains "A 0.39 hit rate"
47f9c88b 483
6425a278
GS
484The last example shows that C<s///> can use other delimiters, such as
485C<s!!!> and C<s{}{}>, and even C<s{}//>. If single quotes are used
caedc70b 486C<s'''>, then the regex and replacement are treated as single-quoted
6425a278 487strings.
47f9c88b
GS
488
489=head2 The split operator
490
6425a278
GS
491C<split /regex/, string> splits C<string> into a list of substrings
492and returns that list. The regex determines the character sequence
47f9c88b
GS
493that C<string> is split with respect to. For example, to split a
494string into words, use
495
496 $x = "Calvin and Hobbes";
6425a278
GS
497 @word = split /\s+/, $x; # $word[0] = 'Calvin'
498 # $word[1] = 'and'
499 # $word[2] = 'Hobbes'
500
501To extract a comma-delimited list of numbers, use
47f9c88b 502
6425a278
GS
503 $x = "1.618,2.718, 3.142";
504 @const = split /,\s*/, $x; # $const[0] = '1.618'
505 # $const[1] = '2.718'
506 # $const[2] = '3.142'
507
508If the empty regex C<//> is used, the string is split into individual
5d525260 509characters. If the regex has groupings, then the list produced contains
47f9c88b
GS
510the matched substrings from the groupings as well:
511
512 $x = "/usr/bin";
513 @parts = split m!(/)!, $x; # $parts[0] = ''
514 # $parts[1] = '/'
515 # $parts[2] = 'usr'
516 # $parts[3] = '/'
517 # $parts[4] = 'bin'
518
6425a278 519Since the first character of $x matched the regex, C<split> prepended
47f9c88b
GS
520an empty initial element to the list.
521
67cdf558
KW
522=head2 C<use re 'strict'>
523
524New in v5.22, this applies stricter rules than otherwise when compiling
525regular expression patterns. It can find things that, while legal, may
526not be what you intended.
527
528See L<'strict' in re|re/'strict' mode>.
529
47f9c88b
GS
530=head1 BUGS
531
532None.
533
534=head1 SEE ALSO
535
536This is just a quick start guide. For a more in-depth tutorial on
6425a278 537regexes, see L<perlretut> and for the reference page, see L<perlre>.
47f9c88b
GS
538
539=head1 AUTHOR AND COPYRIGHT
540
541Copyright (c) 2000 Mark Kvale
542All rights reserved.
543
544This document may be distributed under the same terms as Perl itself.
545
6425a278
GS
546=head2 Acknowledgments
547
548The author would like to thank Mark-Jason Dominus, Tom Christiansen,
549Ilya Zakharevich, Brad Hughes, and Mike Giroux for all their helpful
550comments.
551
47f9c88b
GS
552=cut
553