perl5.git.perl.org Git - perl5.git/blame_incremental

... / ...

Commit	Line	Data
	1	=head1 NAME
	2
	3	perlre - Perl regular expressions
	4
	5	=head1 DESCRIPTION
	6
	7	This page describes the syntax of regular expressions in Perl. For a
	8	description of how to actually I<use> regular expressions in matching
	9	operations, plus various examples of the same, see C<m//> and C<s///> in
	10	L<perlop>.
	11
	12	The matching operations can
	13	have various modifiers, some of which relate to the interpretation of
	14	the regular expression inside. These are:
	15
	16	=over 4
	17
	18	=item i
	19
	20	Do case-insensitive pattern matching.
	21
	22	=item m
	23
	24	Treat string as multiple lines. That is, change "^" and "$" from matching
	25	only at the very start or end of the string to the start or end of any
	26	line anywhere within the string,
	27
	28	=item s
	29
	30	Treat string as single line. That is, change "." to match any character
	31	whatsoever, even a newline, which it normally would not match.
	32
	33	=item x
	34
	35	Extend your pattern's legibility by permitting whitespace and comments.
	36
	37	=back
	38
	39	These are usually written as "the C</x> modifier", even though the delimiter
	40	in question might not actually be a slash. In fact, any of these
	41	modifiers may also be embedded within the regular expression itself using
	42	the new C<(?...)> construct. See below.
	43
	44	The C</x> modifier itself needs a little more explanation. It tells
	45	the regular expression parser to ignore whitespace that is neither
	46	backslashed nor within a character class. You can use this to break up
	47	your regular expression into (slightly) more readable parts. The C<#>
	48	character is also treated as a metacharacter introducing a comment,
	49	just as in ordinary Perl code. This also means that if you want real
	50	whitespace or C<#> characters in the pattern that you'll have to either
	51	escape them or encode them using octal or hex escapes. Taken together,
	52	these features go a long way towards making Perl's regular expressions
	53	more readable. See the C comment deletion code in L<perlop>.
	54
	55	=head2 Regular Expressions
	56
	57	The patterns used in pattern matching are regular expressions such as
	58	those supplied in the Version 8 regexp routines. (In fact, the
	59	routines are derived (distantly) from Henry Spencer's freely
	60	redistributable reimplementation of the V8 routines.)
	61	See L<Version 8 Regular Expressions> for details.
	62
	63	In particular the following metacharacters have their standard I<egrep>-ish
	64	meanings:
	65
	66	\ Quote the next metacharacter
	67	^ Match the beginning of the line
	68	. Match any character (except newline)
	69	$ Match the end of the line (or before newline at the end)
	70	\| Alternation
	71	() Grouping
	72	[] Character class
	73
	74	By default, the "^" character is guaranteed to match only at the
	75	beginning of the string, the "$" character only at the end (or before the
	76	newline at the end) and Perl does certain optimizations with the
	77	assumption that the string contains only one line. Embedded newlines
	78	will not be matched by "^" or "$". You may, however, wish to treat a
	79	string as a multi-line buffer, such that the "^" will match after any
	80	newline within the string, and "$" will match before any newline. At the
	81	cost of a little more overhead, you can do this by using the /m modifier
	82	on the pattern match operator. (Older programs did this by setting C<$*>,
	83	but this practice is deprecated in Perl 5.)
	84
	85	To facilitate multi-line substitutions, the "." character never matches a
	86	newline unless you use the C</s> modifier, which in effect tells Perl to pretend
	87	the string is a single line--even if it isn't. The C</s> modifier also
	88	overrides the setting of C<$*>, in case you have some (badly behaved) older
	89	code that sets it in another module.
	90
	91	The following standard quantifiers are recognized:
	92
	93	* Match 0 or more times
	94	+ Match 1 or more times
	95	? Match 1 or 0 times
	96	{n} Match exactly n times
	97	{n,} Match at least n times
	98	{n,m} Match at least n but not more than m times
	99
	100	(If a curly bracket occurs in any other context, it is treated
	101	as a regular character.) The "*" modifier is equivalent to C<{0,}>, the "+"
	102	modifier to C<{1,}>, and the "?" modifier to C<{0,1}>. n and m are limited
	103	to integral values less than 65536.
	104
	105	By default, a quantified subpattern is "greedy", that is, it will match as
	106	many times as possible without causing the rest of the pattern not to match.
	107	The standard quantifiers are all "greedy", in that they match as many
	108	occurrences as possible (given a particular starting location) without
	109	causing the pattern to fail. If you want it to match the minimum number
	110	of times possible, follow the quantifier with a "?" after any of them.
	111	Note that the meanings don't change, just the "gravity":
	112
	113	*? Match 0 or more times
	114	+? Match 1 or more times
	115	?? Match 0 or 1 time
	116	{n}? Match exactly n times
	117	{n,}? Match at least n times
	118	{n,m}? Match at least n but not more than m times
	119
	120	Since patterns are processed as double quoted strings, the following
	121	also work:
	122
	123	\t tab (HT, TAB)
	124	\n newline (LF, NL)
	125	\r return (CR)
	126	\f form feed (FF)
	127	\a alarm (bell) (BEL)
	128	\e escape (think troff) (ESC)
	129	\033 octal char (think of a PDP-11)
	130	\x1B hex char
	131	\c[ control char
	132	\l lowercase next char (think vi)
	133	\u uppercase next char (think vi)
	134	\L lowercase till \E (think vi)
	135	\U uppercase till \E (think vi)
	136	\E end case modification (think vi)
	137	\Q quote regexp metacharacters till \E
	138
	139	In addition, Perl defines the following:
	140
	141	\w Match a "word" character (alphanumeric plus "_")
	142	\W Match a non-word character
	143	\s Match a whitespace character
	144	\S Match a non-whitespace character
	145	\d Match a digit character
	146	\D Match a non-digit character
	147
	148	Note that C<\w> matches a single alphanumeric character, not a whole
	149	word. To match a word you'd need to say C<\w+>. You may use C<\w>,
	150	C<\W>, C<\s>, C<\S>, C<\d> and C<\D> within character classes (though not
	151	as either end of a range).
	152
	153	Perl defines the following zero-width assertions:
	154
	155	\b Match a word boundary
	156	\B Match a non-(word boundary)
	157	\A Match only at beginning of string
	158	\Z Match only at end of string (or before newline at the end)
	159	\G Match only where previous m//g left off
	160
	161	A word boundary (C<\b>) is defined as a spot between two characters that
	162	has a C<\w> on one side of it and and a C<\W> on the other side of it (in
	163	either order), counting the imaginary characters off the beginning and
	164	end of the string as matching a C<\W>. (Within character classes C<\b>
	165	represents backspace rather than a word boundary.) The C<\A> and C<\Z> are
	166	just like "^" and "$" except that they won't match multiple times when the
	167	C</m> modifier is used, while "^" and "$" will match at every internal line
	168	boundary. To match the actual end of the string, not ignoring newline,
	169	you can use C<\Z(?!\n)>.
	170
	171	When the bracketing construct C<( ... )> is used, \E<lt>digitE<gt> matches the
	172	digit'th substring. Outside of the pattern, always use "$" instead of "\"
	173	in front of the digit. (While the \E<lt>digitE<gt> notation can on rare occasion work
	174	outside the current pattern, this should not be relied upon. See the
	175	WARNING below.) The scope of $E<lt>digitE<gt> (and C<$`>, C<$&>, and C<$'>)
	176	extends to the end of the enclosing BLOCK or eval string, or to the next
	177	successful pattern match, whichever comes first. If you want to use
	178	parentheses to delimit a subpattern (e.g. a set of alternatives) without
	179	saving it as a subpattern, follow the ( with a ?:.
	180
	181	You may have as many parentheses as you wish. If you have more
	182	than 9 substrings, the variables $10, $11, ... refer to the
	183	corresponding substring. Within the pattern, \10, \11, etc. refer back
	184	to substrings if there have been at least that many left parens before
	185	the backreference. Otherwise (for backward compatibility) \10 is the
	186	same as \010, a backspace, and \11 the same as \011, a tab. And so
	187	on. (\1 through \9 are always backreferences.)
	188
	189	C<$+> returns whatever the last bracket match matched. C<$&> returns the
	190	entire matched string. (C<$0> used to return the same thing, but not any
	191	more.) C<$`> returns everything before the matched string. C<$'> returns
	192	everything after the matched string. Examples:
	193
	194	s/^([^ ]) ([^ ]*)/$2 $1/; # swap first two words
	195
	196	if (/Time: (..):(..):(..)/) {
	197	$hours = $1;
	198	$minutes = $2;
	199	$seconds = $3;
	200	}
	201
	202	You will note that all backslashed metacharacters in Perl are
	203	alphanumeric, such as C<\b>, C<\w>, C<\n>. Unlike some other regular expression
	204	languages, there are no backslashed symbols that aren't alphanumeric.
	205	So anything that looks like \\, $, $, \E<lt>, \E<gt>, \{, or \} is always
	206	interpreted as a literal character, not a metacharacter. This makes it
	207	simple to quote a string that you want to use for a pattern but that
	208	you are afraid might contain metacharacters. Simply quote all the
	209	non-alphanumeric characters:
	210
	211	$pattern =~ s/(\W)/\\$1/g;
	212
	213	You can also use the built-in quotemeta() function to do this.
	214	An even easier way to quote metacharacters right in the match operator
	215	is to say
	216
	217	/$unquoted\Q$quoted\E$unquoted/
	218
	219	Perl 5 defines a consistent extension syntax for regular expressions.
	220	The syntax is a pair of parens with a question mark as the first thing
	221	within the parens (this was a syntax error in Perl 4). The character
	222	after the question mark gives the function of the extension. Several
	223	extensions are already supported:
	224
	225	=over 10
	226
	227	=item (?#text)
	228
	229	A comment. The text is ignored. If the C</x> switch is used to enable
	230	whitespace formatting, a simple C<#> will suffice.
	231
	232	=item (?:regexp)
	233
	234	This groups things like "()" but doesn't make backreferences like "()" does. So
	235
	236	split(/\b(?:a\|b\|c)\b/)
	237
	238	is like
	239
	240	split(/\b(a\|b\|c)\b/)
	241
	242	but doesn't spit out extra fields.
	243
	244	=item (?=regexp)
	245
	246	A zero-width positive lookahead assertion. For example, C</\w+(?=\t)/>
	247	matches a word followed by a tab, without including the tab in C<$&>.
	248
	249	=item (?!regexp)
	250
	251	A zero-width negative lookahead assertion. For example C</foo(?!bar)/>
	252	matches any occurrence of "foo" that isn't followed by "bar". Note
	253	however that lookahead and lookbehind are NOT the same thing. You cannot
	254	use this for lookbehind: C</(?!foo)bar/> will not find an occurrence of
	255	"bar" that is preceded by something which is not "foo". That's because
	256	the C<(?!foo)> is just saying that the next thing cannot be "foo"--and
	257	it's not, it's a "bar", so "foobar" will match. You would have to do
	258	something like C</(?!foo)...bar/> for that. We say "like" because there's
	259	the case of your "bar" not having three characters before it. You could
	260	cover that this way: C</(?:(?!foo)...\|^..?)bar/>. Sometimes it's still
	261	easier just to say:
	262
	263	if (/foo/ && $` =~ /bar$/)
	264
	265
	266	=item (?imsx)
	267
	268	One or more embedded pattern-match modifiers. This is particularly
	269	useful for patterns that are specified in a table somewhere, some of
	270	which want to be case sensitive, and some of which don't. The case
	271	insensitive ones merely need to include C<(?i)> at the front of the
	272	pattern. For example:
	273
	274	$pattern = "foobar";
	275	if ( /$pattern/i )
	276
	277	# more flexible:
	278
	279	$pattern = "(?i)foobar";
	280	if ( /$pattern/ )
	281
	282	=back
	283
	284	The specific choice of question mark for this and the new minimal
	285	matching construct was because 1) question mark is pretty rare in older
	286	regular expressions, and 2) whenever you see one, you should stop
	287	and "question" exactly what is going on. That's psychology...
	288
	289	=head2 Backtracking
	290
	291	A fundamental feature of regular expression matching involves the notion
	292	called I<backtracking>. which is used (when needed) by all regular
	293	expression quantifiers, namely C<>, C<?>, C<+>, C<+?>, C<{n,m}>, and
	294	C<{n,m}?>.
	295
	296	For a regular expression to match, the I<entire> regular expression must
	297	match, not just part of it. So if the beginning of a pattern containing a
	298	quantifier succeeds in a way that causes later parts in the pattern to
	299	fail, the matching engine backs up and recalculates the beginning
	300	part--that's why it's called backtracking.
	301
	302	Here is an example of backtracking: Let's say you want to find the
	303	word following "foo" in the string "Food is on the foo table.":
	304
	305	$_ = "Food is on the foo table.";
	306	if ( /\b(foo)\s+(\w+)/i ) {
	307	print "$2 follows $1.\n";
	308	}
	309
	310	When the match runs, the first part of the regular expression (C<\b(foo)>)
	311	finds a possible match right at the beginning of the string, and loads up
	312	$1 with "Foo". However, as soon as the matching engine sees that there's
	313	no whitespace following the "Foo" that it had saved in $1, it realizes its
	314	mistake and starts over again one character after where it had had the
	315	tentative match. This time it goes all the way until the next occurrence
	316	of "foo". The complete regular expression matches this time, and you get
	317	the expected output of "table follows foo."
	318
	319	Sometimes minimal matching can help a lot. Imagine you'd like to match
	320	everything between "foo" and "bar". Initially, you write something
	321	like this:
	322
	323	$_ = "The food is under the bar in the barn.";
	324	if ( /foo(.*)bar/ ) {
	325	print "got <$1>\n";
	326	}
	327
	328	Which perhaps unexpectedly yields:
	329
	330	got <d is under the bar in the >
	331
	332	That's because C<.*> was greedy, so you get everything between the
	333	I<first> "foo" and the I<last> "bar". In this case, it's more effective
	334	to use minimal matching to make sure you get the text between a "foo"
	335	and the first "bar" thereafter.
	336
	337	if ( /foo(.*?)bar/ ) { print "got <$1>\n" }
	338	got <d is under the >
	339
	340	Here's another example: let's say you'd like to match a number at the end
	341	of a string, and you also want to keep the preceding part the match.
	342	So you write this:
	343
	344	$_ = "I have 2 numbers: 53147";
	345	if ( /(.)(\d)/ ) { # Wrong!
	346	print "Beginning is <$1>, number is <$2>.\n";
	347	}
	348
	349	That won't work at all, because C<.*> was greedy and gobbled up the
	350	whole string. As C<\d*> can match on an empty string the complete
	351	regular expression matched successfully.
	352
	353	Beginning is <I have 2 numbers: 53147>, number is <>.
	354
	355	Here are some variants, most of which don't work:
	356
	357	$_ = "I have 2 numbers: 53147";
	358	@pats = qw{
	359	(.)(\d)
	360	(.*)(\d+)
	361	(.?)(\d)
	362	(.*?)(\d+)
	363	(.*)(\d+)$
	364	(.*?)(\d+)$
	365	(.*)\b(\d+)$
	366	(.*\D)(\d+)$
	367	};
	368
	369	for $pat (@pats) {
	370	printf "%-12s ", $pat;
	371	if ( /$pat/ ) {
	372	print "<$1> <$2>\n";
	373	} else {
	374	print "FAIL\n";
	375	}
	376	}
	377
	378	That will print out:
	379
	380	(.)(\d) <I have 2 numbers: 53147> <>
	381	(.*)(\d+) <I have 2 numbers: 5314> <7>
	382	(.?)(\d) <> <>
	383	(.*?)(\d+) <I have > <2>
	384	(.*)(\d+)$ <I have 2 numbers: 5314> <7>
	385	(.*?)(\d+)$ <I have 2 numbers: > <53147>
	386	(.*)\b(\d+)$ <I have 2 numbers: > <53147>
	387	(.*\D)(\d+)$ <I have 2 numbers: > <53147>
	388
	389	As you see, this can be a bit tricky. It's important to realize that a
	390	regular expression is merely a set of assertions that gives a definition
	391	of success. There may be 0, 1, or several different ways that the
	392	definition might succeed against a particular string. And if there are
	393	multiple ways it might succeed, you need to understand backtracking in
	394	order to know which variety of success you will achieve.
	395
	396	When using lookahead assertions and negations, this can all get even
	397	tricker. Imagine you'd like to find a sequence of nondigits not
	398	followed by "123". You might try to write that as
	399
	400	$_ = "ABC123";
	401	if ( /^\D*(?!123)/ ) { # Wrong!
	402	print "Yup, no 123 in $_\n";
	403	}
	404
	405	But that isn't going to match; at least, not the way you're hoping. It
	406	claims that there is no 123 in the string. Here's a clearer picture of
	407	why it that pattern matches, contrary to popular expectations:
	408
	409	$x = 'ABC123' ;
	410	$y = 'ABC445' ;
	411
	412	print "1: got $1\n" if $x =~ /^(ABC)(?!123)/ ;
	413	print "2: got $1\n" if $y =~ /^(ABC)(?!123)/ ;
	414
	415	print "3: got $1\n" if $x =~ /^(\D*)(?!123)/ ;
	416	print "4: got $1\n" if $y =~ /^(\D*)(?!123)/ ;
	417
	418	This prints
	419
	420	2: got ABC
	421	3: got AB
	422	4: got ABC
	423
	424	You might have expected test 3 to fail because it just seems to a more
	425	general purpose version of test 1. The important difference between
	426	them is that test 3 contains a quantifier (C<\D*>) and so can use
	427	backtracking, whereas test 1 will not. What's happening is
	428	that you've asked "Is it true that at the start of $x, following 0 or more
	429	nondigits, you have something that's not 123?" If the pattern matcher had
	430	let C<\D*> expand to "ABC", this would have caused the whole pattern to
	431	fail.
	432	The search engine will initially match C<\D*> with "ABC". Then it will
	433	try to match C<(?!123> with "123" which, of course, fails. But because
	434	a quantifier (C<\D*>) has been used in the regular expression, the
	435	search engine can backtrack and retry the match differently
	436	in the hope of matching the complete regular expression.
	437
	438	Well now,
	439	the pattern really, I<really> wants to succeed, so it uses the
	440	standard regexp backoff-and-retry and lets C<\D*> expand to just "AB" this
	441	time. Now there's indeed something following "AB" that is not
	442	"123". It's in fact "C123", which suffices.
	443
	444	We can deal with this by using both an assertion and a negation. We'll
	445	say that the first part in $1 must be followed by a digit, and in fact, it
	446	must also be followed by something that's not "123". Remember that the
	447	lookaheads are zero-width expressions--they only look, but don't consume
	448	any of the string in their match. So rewriting this way produces what
	449	you'd expect; that is, case 5 will fail, but case 6 succeeds:
	450
	451	print "5: got $1\n" if $x =~ /^(\D*)(?=\d)(?!123)/ ;
	452	print "6: got $1\n" if $y =~ /^(\D*)(?=\d)(?!123)/ ;
	453
	454	6: got ABC
	455
	456	In other words, the two zero-width assertions next to each other work like
	457	they're ANDed together, just as you'd use any builtin assertions: C</^$/>
	458	matches only if you're at the beginning of the line AND the end of the
	459	line simultaneously. The deeper underlying truth is that juxtaposition in
	460	regular expressions always means AND, except when you write an explicit OR
	461	using the vertical bar. C</ab/> means match "a" AND (then) match "b",
	462	although the attempted matches are made at different positions because "a"
	463	is not a zero-width assertion, but a one-width assertion.
	464
	465	One warning: particularly complicated regular expressions can take
	466	exponential time to solve due to the immense number of possible ways they
	467	can use backtracking to try match. For example this will take a very long
	468	time to run
	469
	470	/((a{0,5}){0,5}){0,5}/
	471
	472	And if you used C<*>'s instead of limiting it to 0 through 5 matches, then
	473	it would take literally forever--or until you ran out of stack space.
	474
	475	=head2 Version 8 Regular Expressions
	476
	477	In case you're not familiar with the "regular" Version 8 regexp
	478	routines, here are the pattern-matching rules not described above.
	479
	480	Any single character matches itself, unless it is a I<metacharacter>
	481	with a special meaning described here or above. You can cause
	482	characters which normally function as metacharacters to be interpreted
	483	literally by prefixing them with a "\" (e.g. "\." matches a ".", not any
	484	character; "\\" matches a "\"). A series of characters matches that
	485	series of characters in the target string, so the pattern C<blurfl>
	486	would match "blurfl" in the target string.
	487
	488	You can specify a character class, by enclosing a list of characters
	489	in C<[]>, which will match any one of the characters in the list. If the
	490	first character after the "[" is "^", the class matches any character not
	491	in the list. Within a list, the "-" character is used to specify a
	492	range, so that C<a-z> represents all the characters between "a" and "z",
	493	inclusive.
	494
	495	Characters may be specified using a metacharacter syntax much like that
	496	used in C: "\n" matches a newline, "\t" a tab, "\r" a carriage return,
	497	"\f" a form feed, etc. More generally, \I<nnn>, where I<nnn> is a string
	498	of octal digits, matches the character whose ASCII value is I<nnn>.
	499	Similarly, \xI<nn>, where I<nn> are hexadecimal digits, matches the
	500	character whose ASCII value is I<nn>. The expression \cI<x> matches the
	501	ASCII character control-I<x>. Finally, the "." metacharacter matches any
	502	character except "\n" (unless you use C</s>).
	503
	504	You can specify a series of alternatives for a pattern using "\|" to
	505	separate them, so that C<fee\|fie\|foe> will match any of "fee", "fie",
	506	or "foe" in the target string (as would C<f(e\|i\|o)e>). Note that the
	507	first alternative includes everything from the last pattern delimiter
	508	("(", "[", or the beginning of the pattern) up to the first "\|", and
	509	the last alternative contains everything from the last "\|" to the next
	510	pattern delimiter. For this reason, it's common practice to include
	511	alternatives in parentheses, to minimize confusion about where they
	512	start and end. Note however that "\|" is interpreted as a literal with
	513	square brackets, so if you write C<[fee\|fie\|foe]> you're really only
	514	matching C<[feio\|]>.
	515
	516	Within a pattern, you may designate subpatterns for later reference by
	517	enclosing them in parentheses, and you may refer back to the I<n>th
	518	subpattern later in the pattern using the metacharacter \I<n>.
	519	Subpatterns are numbered based on the left to right order of their
	520	opening parenthesis. Note that a backreference matches whatever
	521	actually matched the subpattern in the string being examined, not the
	522	rules for that subpattern. Therefore, C<(0\|0x)\d\s\1\d> will
	523	match "0x1234 0x4321",but not "0x1234 01234", since subpattern 1
	524	actually matched "0x", even though the rule C<0\|0x> could
	525	potentially match the leading 0 in the second number.
	526
	527	=head2 WARNING on \1 vs $1
	528
	529	Some people get too used to writing things like
	530
	531	$pattern =~ s/(\W)/\\\1/g;
	532
	533	This is grandfathered for the RHS of a substitute to avoid shocking the
	534	B<sed> addicts, but it's a dirty habit to get into. That's because in
	535	PerlThink, the right-hand side of a C<s///> is a double-quoted string. C<\1> in
	536	the usual double-quoted string means a control-A. The customary Unix
	537	meaning of C<\1> is kludged in for C<s///>. However, if you get into the habit
	538	of doing that, you get yourself into trouble if you then add an C</e>
	539	modifier.
	540
	541	s/(\d+)/ \1 + 1 /eg;
	542
	543	Or if you try to do
	544
	545	s/(\d+)/\1000/;
	546
	547	You can't disambiguate that by saying C<\{1}000>, whereas you can fix it with
	548	C<${1}000>. Basically, the operation of interpolation should not be confused
	549	with the operation of matching a backreference. Certainly they mean two
	550	different things on the I<left> side of the C<s///>.