perl5.git.perl.org Git - perl5.git/blame_incremental

... / ...

Commit	Line	Data
	1	=head1 NAME
	2
	3	perlrequick - Perl regular expressions quick start
	4
	5	=head1 DESCRIPTION
	6
	7	This page covers the very basics of understanding, creating and
	8	using regular expressions ('regexes') in Perl.
	9
	10
	11	=head1 The Guide
	12
	13	=head2 Simple word matching
	14
	15	The simplest regex is simply a word, or more generally, a string of
	16	characters. A regex consisting of a word matches any string that
	17	contains that word:
	18
	19	"Hello World" =~ /World/; # matches
	20
	21	In this statement, C<World> is a regex and the C<//> enclosing
	22	C</World/> tells Perl to search a string for a match. The operator
	23	C<=~> associates the string with the regex match and produces a true
	24	value if the regex matched, or false if the regex did not match. In
	25	our case, C<World> matches the second word in C<"Hello World">, so the
	26	expression is true. This idea has several variations.
	27
	28	Expressions like this are useful in conditionals:
	29
	30	print "It matches\n" if "Hello World" =~ /World/;
	31
	32	The sense of the match can be reversed by using C<!~> operator:
	33
	34	print "It doesn't match\n" if "Hello World" !~ /World/;
	35
	36	The literal string in the regex can be replaced by a variable:
	37
	38	$greeting = "World";
	39	print "It matches\n" if "Hello World" =~ /$greeting/;
	40
	41	If you're matching against C<$_>, the C<$_ =~> part can be omitted:
	42
	43	$_ = "Hello World";
	44	print "It matches\n" if /World/;
	45
	46	Finally, the C<//> default delimiters for a match can be changed to
	47	arbitrary delimiters by putting an C<'m'> out front:
	48
	49	"Hello World" =~ m!World!; # matches, delimited by '!'
	50	"Hello World" =~ m{World}; # matches, note the matching '{}'
	51	"/usr/bin/perl" =~ m"/perl"; # matches after '/usr/bin',
	52	# '/' becomes an ordinary char
	53
	54	Regexes must match a part of the string I<exactly> in order for the
	55	statement to be true:
	56
	57	"Hello World" =~ /world/; # doesn't match, case sensitive
	58	"Hello World" =~ /o W/; # matches, ' ' is an ordinary char
	59	"Hello World" =~ /World /; # doesn't match, no ' ' at end
	60
	61	Perl will always match at the earliest possible point in the string:
	62
	63	"Hello World" =~ /o/; # matches 'o' in 'Hello'
	64	"That hat is red" =~ /hat/; # matches 'hat' in 'That'
	65
	66	Not all characters can be used 'as is' in a match. Some characters,
	67	called B<metacharacters>, are reserved for use in regex notation.
	68	The metacharacters are
	69
	70	{}[]()^$.\|*+?\
	71
	72	A metacharacter can be matched by putting a backslash before it:
	73
	74	"2+2=4" =~ /2+2/; # doesn't match, + is a metacharacter
	75	"2+2=4" =~ /2\+2/; # matches, \+ is treated like an ordinary +
	76	'C:\WIN32' =~ /C:\\WIN/; # matches
	77	"/usr/bin/perl" =~ /\/usr\/bin\/perl/; # matches
	78
	79	In the last regex, the forward slash C<'/'> is also backslashed,
	80	because it is used to delimit the regex.
	81
	82	Non-printable ASCII characters are represented by B<escape sequences>.
	83	Common examples are C<\t> for a tab, C<\n> for a newline, and C<\r>
	84	for a carriage return. Arbitrary bytes are represented by octal
	85	escape sequences, e.g., C<\033>, or hexadecimal escape sequences,
	86	e.g., C<\x1B>:
	87
	88	"1000\t2000" =~ m(0\t2) # matches
	89	"cat" =~ /\143\x61\x74/ # matches in ASCII, but a weird way to spell cat
	90
	91	Regexes are treated mostly as double-quoted strings, so variable
	92	substitution works:
	93
	94	$foo = 'house';
	95	'cathouse' =~ /cat$foo/; # matches
	96	'housecat' =~ /${foo}cat/; # matches
	97
	98	With all of the regexes above, if the regex matched anywhere in the
	99	string, it was considered a match. To specify I<where> it should
	100	match, we would use the B<anchor> metacharacters C<^> and C<$>. The
	101	anchor C<^> means match at the beginning of the string and the anchor
	102	C<$> means match at the end of the string, or before a newline at the
	103	end of the string. Some examples:
	104
	105	"housekeeper" =~ /keeper/; # matches
	106	"housekeeper" =~ /^keeper/; # doesn't match
	107	"housekeeper" =~ /keeper$/; # matches
	108	"housekeeper\n" =~ /keeper$/; # matches
	109	"housekeeper" =~ /^housekeeper$/; # matches
	110
	111	=head2 Using character classes
	112
	113	A B<character class> allows a set of possible characters, rather than
	114	just a single character, to match at a particular point in a regex.
	115	Character classes are denoted by brackets C<[...]>, with the set of
	116	characters to be possibly matched inside. Here are some examples:
	117
	118	/cat/; # matches 'cat'
	119	/[bcr]at/; # matches 'bat', 'cat', or 'rat'
	120	"abc" =~ /[cab]/; # matches 'a'
	121
	122	In the last statement, even though C<'c'> is the first character in
	123	the class, the earliest point at which the regex can match is C<'a'>.
	124
	125	/[yY][eE][sS]/; # match 'yes' in a case-insensitive way
	126	# 'yes', 'Yes', 'YES', etc.
	127	/yes/i; # also match 'yes' in a case-insensitive way
	128
	129	The last example shows a match with an C<'i'> B<modifier>, which makes
	130	the match case-insensitive.
	131
	132	Character classes also have ordinary and special characters, but the
	133	sets of ordinary and special characters inside a character class are
	134	different than those outside a character class. The special
	135	characters for a character class are C<-]\^$> and are matched using an
	136	escape:
	137
	138	/[\]c]def/; # matches ']def' or 'cdef'
	139	$x = 'bcr';
	140	/[$x]at/; # matches 'bat, 'cat', or 'rat'
	141	/[\$x]at/; # matches '$at' or 'xat'
	142	/[\\$x]at/; # matches '\at', 'bat, 'cat', or 'rat'
	143
	144	The special character C<'-'> acts as a range operator within character
	145	classes, so that the unwieldy C<[0123456789]> and C<[abc...xyz]>
	146	become the svelte C<[0-9]> and C<[a-z]>:
	147
	148	/item[0-9]/; # matches 'item0' or ... or 'item9'
	149	/[0-9a-fA-F]/; # matches a hexadecimal digit
	150
	151	If C<'-'> is the first or last character in a character class, it is
	152	treated as an ordinary character.
	153
	154	The special character C<^> in the first position of a character class
	155	denotes a B<negated character class>, which matches any character but
	156	those in the brackets. Both C<[...]> and C<[^...]> must match a
	157	character, or the match fails. Then
	158
	159	/[^a]at/; # doesn't match 'aat' or 'at', but matches
	160	# all other 'bat', 'cat, '0at', '%at', etc.
	161	/[^0-9]/; # matches a non-numeric character
	162	/[a^]at/; # matches 'aat' or '^at'; here '^' is ordinary
	163
	164	Perl has several abbreviations for common character classes. (These
	165	definitions are those that Perl uses in ASCII-safe mode with the C</a> modifier.
	166	Otherwise they could match many more non-ASCII Unicode characters as
	167	well. See L<perlrecharclass/Backslash sequences> for details.)
	168
	169	=over 4
	170
	171	=item *
	172
	173	\d is a digit and represents
	174
	175	[0-9]
	176
	177	=item *
	178
	179	\s is a whitespace character and represents
	180
	181	[\ \t\r\n\f]
	182
	183	=item *
	184
	185	\w is a word character (alphanumeric or _) and represents
	186
	187	[0-9a-zA-Z_]
	188
	189	=item *
	190
	191	\D is a negated \d; it represents any character but a digit
	192
	193	[^0-9]
	194
	195	=item *
	196
	197	\S is a negated \s; it represents any non-whitespace character
	198
	199	[^\s]
	200
	201	=item *
	202
	203	\W is a negated \w; it represents any non-word character
	204
	205	[^\w]
	206
	207	=item *
	208
	209	The period '.' matches any character but "\n"
	210
	211	=back
	212
	213	The C<\d\s\w\D\S\W> abbreviations can be used both inside and outside
	214	of character classes. Here are some in use:
	215
	216	/\d\d:\d\d:\d\d/; # matches a hh:mm:ss time format
	217	/[\d\s]/; # matches any digit or whitespace character
	218	/\w\W\w/; # matches a word char, followed by a
	219	# non-word char, followed by a word char
	220	/..rt/; # matches any two chars, followed by 'rt'
	221	/end\./; # matches 'end.'
	222	/end[.]/; # same thing, matches 'end.'
	223
	224	The S<B<word anchor> > C<\b> matches a boundary between a word
	225	character and a non-word character C<\w\W> or C<\W\w>:
	226
	227	$x = "Housecat catenates house and cat";
	228	$x =~ /\bcat/; # matches cat in 'catenates'
	229	$x =~ /cat\b/; # matches cat in 'housecat'
	230	$x =~ /\bcat\b/; # matches 'cat' at end of string
	231
	232	In the last example, the end of the string is considered a word
	233	boundary.
	234
	235	=head2 Matching this or that
	236
	237	We can match different character strings with the B<alternation>
	238	metacharacter C<'\|'>. To match C<dog> or C<cat>, we form the regex
	239	C<dog\|cat>. As before, Perl will try to match the regex at the
	240	earliest possible point in the string. At each character position,
	241	Perl will first try to match the first alternative, C<dog>. If
	242	C<dog> doesn't match, Perl will then try the next alternative, C<cat>.
	243	If C<cat> doesn't match either, then the match fails and Perl moves to
	244	the next position in the string. Some examples:
	245
	246	"cats and dogs" =~ /cat\|dog\|bird/; # matches "cat"
	247	"cats and dogs" =~ /dog\|cat\|bird/; # matches "cat"
	248
	249	Even though C<dog> is the first alternative in the second regex,
	250	C<cat> is able to match earlier in the string.
	251
	252	"cats" =~ /c\|ca\|cat\|cats/; # matches "c"
	253	"cats" =~ /cats\|cat\|ca\|c/; # matches "cats"
	254
	255	At a given character position, the first alternative that allows the
	256	regex match to succeed will be the one that matches. Here, all the
	257	alternatives match at the first string position, so the first matches.
	258
	259	=head2 Grouping things and hierarchical matching
	260
	261	The B<grouping> metacharacters C<()> allow a part of a regex to be
	262	treated as a single unit. Parts of a regex are grouped by enclosing
	263	them in parentheses. The regex C<house(cat\|keeper)> means match
	264	C<house> followed by either C<cat> or C<keeper>. Some more examples
	265	are
	266
	267	/(a\|b)b/; # matches 'ab' or 'bb'
	268	/(^a\|b)c/; # matches 'ac' at start of string or 'bc' anywhere
	269
	270	/house(cat\|)/; # matches either 'housecat' or 'house'
	271	/house(cat(s\|)\|)/; # matches either 'housecats' or 'housecat' or
	272	# 'house'. Note groups can be nested.
	273
	274	"20" =~ /(19\|20\|)\d\d/; # matches the null alternative '()\d\d',
	275	# because '20\d\d' can't match
	276
	277	=head2 Extracting matches
	278
	279	The grouping metacharacters C<()> also allow the extraction of the
	280	parts of a string that matched. For each grouping, the part that
	281	matched inside goes into the special variables C<$1>, C<$2>, etc.
	282	They can be used just as ordinary variables:
	283
	284	# extract hours, minutes, seconds
	285	$time =~ /(\d\d):(\d\d):(\d\d)/; # match hh:mm:ss format
	286	$hours = $1;
	287	$minutes = $2;
	288	$seconds = $3;
	289
	290	In list context, a match C</regex/> with groupings will return the
	291	list of matched values C<($1,$2,...)>. So we could rewrite it as
	292
	293	($hours, $minutes, $second) = ($time =~ /(\d\d):(\d\d):(\d\d)/);
	294
	295	If the groupings in a regex are nested, C<$1> gets the group with the
	296	leftmost opening parenthesis, C<$2> the next opening parenthesis,
	297	etc. For example, here is a complex regex and the matching variables
	298	indicated below it:
	299
	300	/(ab(cd\|ef)((gi)\|j))/;
	301	1 2 34
	302
	303	Associated with the matching variables C<$1>, C<$2>, ... are
	304	the B<backreferences> C<\g1>, C<\g2>, ... Backreferences are
	305	matching variables that can be used I<inside> a regex:
	306
	307	/(\w\w\w)\s\g1/; # find sequences like 'the the' in string
	308
	309	C<$1>, C<$2>, ... should only be used outside of a regex, and C<\g1>,
	310	C<\g2>, ... only inside a regex.
	311
	312	=head2 Matching repetitions
	313
	314	The B<quantifier> metacharacters C<?>, C<*>, C<+>, and C<{}> allow us
	315	to determine the number of repeats of a portion of a regex we
	316	consider to be a match. Quantifiers are put immediately after the
	317	character, character class, or grouping that we want to specify. They
	318	have the following meanings:
	319
	320	=over 4
	321
	322	=item *
	323
	324	C<a?> = match 'a' 1 or 0 times
	325
	326	=item *
	327
	328	C<a*> = match 'a' 0 or more times, i.e., any number of times
	329
	330	=item *
	331
	332	C<a+> = match 'a' 1 or more times, i.e., at least once
	333
	334	=item *
	335
	336	C<a{n,m}> = match at least C<n> times, but not more than C<m>
	337	times.
	338
	339	=item *
	340
	341	C<a{n,}> = match at least C<n> or more times
	342
	343	=item *
	344
	345	C<a{n}> = match exactly C<n> times
	346
	347	=back
	348
	349	Here are some examples:
	350
	351	/[a-z]+\s+\d*/; # match a lowercase word, at least some space, and
	352	# any number of digits
	353	/(\w+)\s+\g1/; # match doubled words of arbitrary length
	354	$year =~ /^\d{2,4}$/; # make sure year is at least 2 but not more
	355	# than 4 digits
	356	$year =~ /^\d{4}$\|^\d{2}$/; # better match; throw out 3 digit dates
	357
	358	These quantifiers will try to match as much of the string as possible,
	359	while still allowing the regex to match. So we have
	360
	361	$x = 'the cat in the hat';
	362	$x =~ /^(.)(at)(.)$/; # matches,
	363	# $1 = 'the cat in the h'
	364	# $2 = 'at'
	365	# $3 = '' (0 matches)
	366
	367	The first quantifier C<.*> grabs as much of the string as possible
	368	while still having the regex match. The second quantifier C<.*> has
	369	no string left to it, so it matches 0 times.
	370
	371	=head2 More matching
	372
	373	There are a few more things you might want to know about matching
	374	operators.
	375	The global modifier C<//g> allows the matching operator to match
	376	within a string as many times as possible. In scalar context,
	377	successive matches against a string will have C<//g> jump from match
	378	to match, keeping track of position in the string as it goes along.
	379	You can get or set the position with the C<pos()> function.
	380	For example,
	381
	382	$x = "cat dog house"; # 3 words
	383	while ($x =~ /(\w+)/g) {
	384	print "Word is $1, ends at position ", pos $x, "\n";
	385	}
	386
	387	prints
	388
	389	Word is cat, ends at position 3
	390	Word is dog, ends at position 7
	391	Word is house, ends at position 13
	392
	393	A failed match or changing the target string resets the position. If
	394	you don't want the position reset after failure to match, add the
	395	C<//c>, as in C</regex/gc>.
	396
	397	In list context, C<//g> returns a list of matched groupings, or if
	398	there are no groupings, a list of matches to the whole regex. So
	399
	400	@words = ($x =~ /(\w+)/g); # matches,
	401	# $word[0] = 'cat'
	402	# $word[1] = 'dog'
	403	# $word[2] = 'house'
	404
	405	=head2 Search and replace
	406
	407	Search and replace is performed using C<s/regex/replacement/modifiers>.
	408	The C<replacement> is a Perl double-quoted string that replaces in the
	409	string whatever is matched with the C<regex>. The operator C<=~> is
	410	also used here to associate a string with C<s///>. If matching
	411	against C<$_>, the S<C<$_ =~>> can be dropped. If there is a match,
	412	C<s///> returns the number of substitutions made; otherwise it returns
	413	false. Here are a few examples:
	414
	415	$x = "Time to feed the cat!";
	416	$x =~ s/cat/hacker/; # $x contains "Time to feed the hacker!"
	417	$y = "'quoted words'";
	418	$y =~ s/^'(.*)'$/$1/; # strip single quotes,
	419	# $y contains "quoted words"
	420
	421	With the C<s///> operator, the matched variables C<$1>, C<$2>, etc.
	422	are immediately available for use in the replacement expression. With
	423	the global modifier, C<s///g> will search and replace all occurrences
	424	of the regex in the string:
	425
	426	$x = "I batted 4 for 4";
	427	$x =~ s/4/four/; # $x contains "I batted four for 4"
	428	$x = "I batted 4 for 4";
	429	$x =~ s/4/four/g; # $x contains "I batted four for four"
	430
	431	The non-destructive modifier C<s///r> causes the result of the substitution
	432	to be returned instead of modifying C<$_> (or whatever variable the
	433	substitute was bound to with C<=~>):
	434
	435	$x = "I like dogs.";
	436	$y = $x =~ s/dogs/cats/r;
	437	print "$x $y\n"; # prints "I like dogs. I like cats."
	438
	439	$x = "Cats are great.";
	440	print $x =~ s/Cats/Dogs/r =~ s/Dogs/Frogs/r =~ s/Frogs/Hedgehogs/r, "\n";
	441	# prints "Hedgehogs are great."
	442
	443	@foo = map { s/[a-z]/X/r } qw(a b c 1 2 3);
	444	# @foo is now qw(X X X 1 2 3)
	445
	446	The evaluation modifier C<s///e> wraps an C<eval{...}> around the
	447	replacement string and the evaluated result is substituted for the
	448	matched substring. Some examples:
	449
	450	# reverse all the words in a string
	451	$x = "the cat in the hat";
	452	$x =~ s/(\w+)/reverse $1/ge; # $x contains "eht tac ni eht tah"
	453
	454	# convert percentage to decimal
	455	$x = "A 39% hit rate";
	456	$x =~ s!(\d+)%!$1/100!e; # $x contains "A 0.39 hit rate"
	457
	458	The last example shows that C<s///> can use other delimiters, such as
	459	C<s!!!> and C<s{}{}>, and even C<s{}//>. If single quotes are used
	460	C<s'''>, then the regex and replacement are treated as single-quoted
	461	strings.
	462
	463	=head2 The split operator
	464
	465	C<split /regex/, string> splits C<string> into a list of substrings
	466	and returns that list. The regex determines the character sequence
	467	that C<string> is split with respect to. For example, to split a
	468	string into words, use
	469
	470	$x = "Calvin and Hobbes";
	471	@word = split /\s+/, $x; # $word[0] = 'Calvin'
	472	# $word[1] = 'and'
	473	# $word[2] = 'Hobbes'
	474
	475	To extract a comma-delimited list of numbers, use
	476
	477	$x = "1.618,2.718, 3.142";
	478	@const = split /,\s*/, $x; # $const[0] = '1.618'
	479	# $const[1] = '2.718'
	480	# $const[2] = '3.142'
	481
	482	If the empty regex C<//> is used, the string is split into individual
	483	characters. If the regex has groupings, then the list produced contains
	484	the matched substrings from the groupings as well:
	485
	486	$x = "/usr/bin";
	487	@parts = split m!(/)!, $x; # $parts[0] = ''
	488	# $parts[1] = '/'
	489	# $parts[2] = 'usr'
	490	# $parts[3] = '/'
	491	# $parts[4] = 'bin'
	492
	493	Since the first character of $x matched the regex, C<split> prepended
	494	an empty initial element to the list.
	495
	496	=head1 BUGS
	497
	498	None.
	499
	500	=head1 SEE ALSO
	501
	502	This is just a quick start guide. For a more in-depth tutorial on
	503	regexes, see L<perlretut> and for the reference page, see L<perlre>.
	504
	505	=head1 AUTHOR AND COPYRIGHT
	506
	507	Copyright (c) 2000 Mark Kvale
	508	All rights reserved.
	509
	510	This document may be distributed under the same terms as Perl itself.
	511
	512	=head2 Acknowledgments
	513
	514	The author would like to thank Mark-Jason Dominus, Tom Christiansen,
	515	Ilya Zakharevich, Brad Hughes, and Mike Giroux for all their helpful
	516	comments.
	517
	518	=cut
	519