perl5.git.perl.org Git - perl5.git/blame_incremental

... / ...

Commit	Line	Data
	1	=head1 NAME
	2
	3	perlretut - Perl regular expressions tutorial
	4
	5	=head1 DESCRIPTION
	6
	7	This page provides a basic tutorial on understanding, creating and
	8	using regular expressions in Perl. It serves as a complement to the
	9	reference page on regular expressions L<perlre>. Regular expressions
	10	are an integral part of the C<m//>, C<s///>, C<qr//> and C<split>
	11	operators and so this tutorial also overlaps with
	12	L<perlop/"Regexp Quote-Like Operators"> and L<perlfunc/split>.
	13
	14	Perl is widely renowned for excellence in text processing, and regular
	15	expressions are one of the big factors behind this fame. Perl regular
	16	expressions display an efficiency and flexibility unknown in most
	17	other computer languages. Mastering even the basics of regular
	18	expressions will allow you to manipulate text with surprising ease.
	19
	20	What is a regular expression? A regular expression is simply a string
	21	that describes a pattern. Patterns are in common use these days;
	22	examples are the patterns typed into a search engine to find web pages
	23	and the patterns used to list files in a directory, e.g., C<ls *.txt>
	24	or C<dir .>. In Perl, the patterns described by regular expressions
	25	are used to search strings, extract desired parts of strings, and to
	26	do search and replace operations.
	27
	28	Regular expressions have the undeserved reputation of being abstract
	29	and difficult to understand. Regular expressions are constructed using
	30	simple concepts like conditionals and loops and are no more difficult
	31	to understand than the corresponding C<if> conditionals and C<while>
	32	loops in the Perl language itself. In fact, the main challenge in
	33	learning regular expressions is just getting used to the terse
	34	notation used to express these concepts.
	35
	36	This tutorial flattens the learning curve by discussing regular
	37	expression concepts, along with their notation, one at a time and with
	38	many examples. The first part of the tutorial will progress from the
	39	simplest word searches to the basic regular expression concepts. If
	40	you master the first part, you will have all the tools needed to solve
	41	about 98% of your needs. The second part of the tutorial is for those
	42	comfortable with the basics and hungry for more power tools. It
	43	discusses the more advanced regular expression operators and
	44	introduces the latest cutting edge innovations in 5.6.0.
	45
	46	A note: to save time, 'regular expression' is often abbreviated as
	47	regexp or regex. Regexp is a more natural abbreviation than regex, but
	48	is harder to pronounce. The Perl pod documentation is evenly split on
	49	regexp vs regex; in Perl, there is more than one way to abbreviate it.
	50	We'll use regexp in this tutorial.
	51
	52	=head1 Part 1: The basics
	53
	54	=head2 Simple word matching
	55
	56	The simplest regexp is simply a word, or more generally, a string of
	57	characters. A regexp consisting of a word matches any string that
	58	contains that word:
	59
	60	"Hello World" =~ /World/; # matches
	61
	62	What is this perl statement all about? C<"Hello World"> is a simple
	63	double quoted string. C<World> is the regular expression and the
	64	C<//> enclosing C</World/> tells perl to search a string for a match.
	65	The operator C<=~> associates the string with the regexp match and
	66	produces a true value if the regexp matched, or false if the regexp
	67	did not match. In our case, C<World> matches the second word in
	68	C<"Hello World">, so the expression is true. Expressions like this
	69	are useful in conditionals:
	70
	71	if ("Hello World" =~ /World/) {
	72	print "It matches\n";
	73	}
	74	else {
	75	print "It doesn't match\n";
	76	}
	77
	78	There are useful variations on this theme. The sense of the match can
	79	be reversed by using C<!~> operator:
	80
	81	if ("Hello World" !~ /World/) {
	82	print "It doesn't match\n";
	83	}
	84	else {
	85	print "It matches\n";
	86	}
	87
	88	The literal string in the regexp can be replaced by a variable:
	89
	90	$greeting = "World";
	91	if ("Hello World" =~ /$greeting/) {
	92	print "It matches\n";
	93	}
	94	else {
	95	print "It doesn't match\n";
	96	}
	97
	98	If you're matching against the special default variable C<$_>, the
	99	C<$_ =~> part can be omitted:
	100
	101	$_ = "Hello World";
	102	if (/World/) {
	103	print "It matches\n";
	104	}
	105	else {
	106	print "It doesn't match\n";
	107	}
	108
	109	And finally, the C<//> default delimiters for a match can be changed
	110	to arbitrary delimiters by putting an C<'m'> out front:
	111
	112	"Hello World" =~ m!World!; # matches, delimited by '!'
	113	"Hello World" =~ m{World}; # matches, note the matching '{}'
	114	"/usr/bin/perl" =~ m"/perl"; # matches after '/usr/bin',
	115	# '/' becomes an ordinary char
	116
	117	C</World/>, C<m!World!>, and C<m{World}> all represent the
	118	same thing. When, e.g., C<""> is used as a delimiter, the forward
	119	slash C<'/'> becomes an ordinary character and can be used in a regexp
	120	without trouble.
	121
	122	Let's consider how different regexps would match C<"Hello World">:
	123
	124	"Hello World" =~ /world/; # doesn't match
	125	"Hello World" =~ /o W/; # matches
	126	"Hello World" =~ /oW/; # doesn't match
	127	"Hello World" =~ /World /; # doesn't match
	128
	129	The first regexp C<world> doesn't match because regexps are
	130	case-sensitive. The second regexp matches because the substring
	131	S<C<'o W'> > occurs in the string S<C<"Hello World"> >. The space
	132	character ' ' is treated like any other character in a regexp and is
	133	needed to match in this case. The lack of a space character is the
	134	reason the third regexp C<'oW'> doesn't match. The fourth regexp
	135	C<'World '> doesn't match because there is a space at the end of the
	136	regexp, but not at the end of the string. The lesson here is that
	137	regexps must match a part of the string I<exactly> in order for the
	138	statement to be true.
	139
	140	If a regexp matches in more than one place in the string, perl will
	141	always match at the earliest possible point in the string:
	142
	143	"Hello World" =~ /o/; # matches 'o' in 'Hello'
	144	"That hat is red" =~ /hat/; # matches 'hat' in 'That'
	145
	146	With respect to character matching, there are a few more points you
	147	need to know about. First of all, not all characters can be used 'as
	148	is' in a match. Some characters, called B<metacharacters>, are reserved
	149	for use in regexp notation. The metacharacters are
	150
	151	{}[]()^$.\|*+?\
	152
	153	The significance of each of these will be explained
	154	in the rest of the tutorial, but for now, it is important only to know
	155	that a metacharacter can be matched by putting a backslash before it:
	156
	157	"2+2=4" =~ /2+2/; # doesn't match, + is a metacharacter
	158	"2+2=4" =~ /2\+2/; # matches, \+ is treated like an ordinary +
	159	"The interval is [0,1)." =~ /[0,1)./ # is a syntax error!
	160	"The interval is [0,1)." =~ /\[0,1\)\./ # matches
	161	"/usr/bin/perl" =~ /\/usr\/local\/bin\/perl/; # matches
	162
	163	In the last regexp, the forward slash C<'/'> is also backslashed,
	164	because it is used to delimit the regexp. This can lead to LTS
	165	(leaning toothpick syndrome), however, and it is often more readable
	166	to change delimiters.
	167
	168
	169	The backslash character C<'\'> is a metacharacter itself and needs to
	170	be backslashed:
	171
	172	'C:\WIN32' =~ /C:\\WIN/; # matches
	173
	174	In addition to the metacharacters, there are some ASCII characters
	175	which don't have printable character equivalents and are instead
	176	represented by B<escape sequences>. Common examples are C<\t> for a
	177	tab, C<\n> for a newline, C<\r> for a carriage return and C<\a> for a
	178	bell. If your string is better thought of as a sequence of arbitrary
	179	bytes, the octal escape sequence, e.g., C<\033>, or hexadecimal escape
	180	sequence, e.g., C<\x1B> may be a more natural representation for your
	181	bytes. Here are some examples of escapes:
	182
	183	"1000\t2000" =~ m(0\t2) # matches
	184	"1000\n2000" =~ /0\n20/ # matches
	185	"1000\t2000" =~ /\000\t2/ # doesn't match, "0" ne "\000"
	186	"cat" =~ /\143\x61\x74/ # matches, but a weird way to spell cat
	187
	188	If you've been around Perl a while, all this talk of escape sequences
	189	may seem familiar. Similar escape sequences are used in double-quoted
	190	strings and in fact the regexps in Perl are mostly treated as
	191	double-quoted strings. This means that variables can be used in
	192	regexps as well. Just like double-quoted strings, the values of the
	193	variables in the regexp will be substituted in before the regexp is
	194	evaluated for matching purposes. So we have:
	195
	196	$foo = 'house';
	197	'housecat' =~ /$foo/; # matches
	198	'cathouse' =~ /cat$foo/; # matches
	199	'housecat' =~ /${foo}cat/; # matches
	200
	201	So far, so good. With the knowledge above you can already perform
	202	searches with just about any literal string regexp you can dream up.
	203	Here is a I<very simple> emulation of the Unix grep program:
	204
	205	% cat > simple_grep
	206	#!/usr/bin/perl
	207	$regexp = shift;
	208	while (<>) {
	209	print if /$regexp/;
	210	}
	211	^D
	212
	213	% chmod +x simple_grep
	214
	215	% simple_grep abba /usr/dict/words
	216	Babbage
	217	cabbage
	218	cabbages
	219	sabbath
	220	Sabbathize
	221	Sabbathizes
	222	sabbatical
	223	scabbard
	224	scabbards
	225
	226	This program is easy to understand. C<#!/usr/bin/perl> is the standard
	227	way to invoke a perl program from the shell.
	228	S<C<$regexp = shift;> > saves the first command line argument as the
	229	regexp to be used, leaving the rest of the command line arguments to
	230	be treated as files. S<C<< while (<>) >> > loops over all the lines in
	231	all the files. For each line, S<C<print if /$regexp/;> > prints the
	232	line if the regexp matches the line. In this line, both C<print> and
	233	C</$regexp/> use the default variable C<$_> implicitly.
	234
	235	With all of the regexps above, if the regexp matched anywhere in the
	236	string, it was considered a match. Sometimes, however, we'd like to
	237	specify I<where> in the string the regexp should try to match. To do
	238	this, we would use the B<anchor> metacharacters C<^> and C<$>. The
	239	anchor C<^> means match at the beginning of the string and the anchor
	240	C<$> means match at the end of the string, or before a newline at the
	241	end of the string. Here is how they are used:
	242
	243	"housekeeper" =~ /keeper/; # matches
	244	"housekeeper" =~ /^keeper/; # doesn't match
	245	"housekeeper" =~ /keeper$/; # matches
	246	"housekeeper\n" =~ /keeper$/; # matches
	247
	248	The second regexp doesn't match because C<^> constrains C<keeper> to
	249	match only at the beginning of the string, but C<"housekeeper"> has
	250	keeper starting in the middle. The third regexp does match, since the
	251	C<$> constrains C<keeper> to match only at the end of the string.
	252
	253	When both C<^> and C<$> are used at the same time, the regexp has to
	254	match both the beginning and the end of the string, i.e., the regexp
	255	matches the whole string. Consider
	256
	257	"keeper" =~ /^keep$/; # doesn't match
	258	"keeper" =~ /^keeper$/; # matches
	259	"" =~ /^$/; # ^$ matches an empty string
	260
	261	The first regexp doesn't match because the string has more to it than
	262	C<keep>. Since the second regexp is exactly the string, it
	263	matches. Using both C<^> and C<$> in a regexp forces the complete
	264	string to match, so it gives you complete control over which strings
	265	match and which don't. Suppose you are looking for a fellow named
	266	bert, off in a string by himself:
	267
	268	"dogbert" =~ /bert/; # matches, but not what you want
	269
	270	"dilbert" =~ /^bert/; # doesn't match, but ..
	271	"bertram" =~ /^bert/; # matches, so still not good enough
	272
	273	"bertram" =~ /^bert$/; # doesn't match, good
	274	"dilbert" =~ /^bert$/; # doesn't match, good
	275	"bert" =~ /^bert$/; # matches, perfect
	276
	277	Of course, in the case of a literal string, one could just as easily
	278	use the string equivalence S<C<$string eq 'bert'> > and it would be
	279	more efficient. The C<^...$> regexp really becomes useful when we
	280	add in the more powerful regexp tools below.
	281
	282	=head2 Using character classes
	283
	284	Although one can already do quite a lot with the literal string
	285	regexps above, we've only scratched the surface of regular expression
	286	technology. In this and subsequent sections we will introduce regexp
	287	concepts (and associated metacharacter notations) that will allow a
	288	regexp to not just represent a single character sequence, but a I<whole
	289	class> of them.
	290
	291	One such concept is that of a B<character class>. A character class
	292	allows a set of possible characters, rather than just a single
	293	character, to match at a particular point in a regexp. Character
	294	classes are denoted by brackets C<[...]>, with the set of characters
	295	to be possibly matched inside. Here are some examples:
	296
	297	/cat/; # matches 'cat'
	298	/[bcr]at/; # matches 'bat, 'cat', or 'rat'
	299	/item[0123456789]/; # matches 'item0' or ... or 'item9'
	300	"abc" =~ /[cab]/; # matches 'a'
	301
	302	In the last statement, even though C<'c'> is the first character in
	303	the class, C<'a'> matches because the first character position in the
	304	string is the earliest point at which the regexp can match.
	305
	306	/[yY][eE][sS]/; # match 'yes' in a case-insensitive way
	307	# 'yes', 'Yes', 'YES', etc.
	308
	309	This regexp displays a common task: perform a a case-insensitive
	310	match. Perl provides away of avoiding all those brackets by simply
	311	appending an C<'i'> to the end of the match. Then C</[yY][eE][sS]/;>
	312	can be rewritten as C</yes/i;>. The C<'i'> stands for
	313	case-insensitive and is an example of a B<modifier> of the matching
	314	operation. We will meet other modifiers later in the tutorial.
	315
	316	We saw in the section above that there were ordinary characters, which
	317	represented themselves, and special characters, which needed a
	318	backslash C<\> to represent themselves. The same is true in a
	319	character class, but the sets of ordinary and special characters
	320	inside a character class are different than those outside a character
	321	class. The special characters for a character class are C<-]\^$>. C<]>
	322	is special because it denotes the end of a character class. C<$> is
	323	special because it denotes a scalar variable. C<\> is special because
	324	it is used in escape sequences, just like above. Here is how the
	325	special characters C<]$\> are handled:
	326
	327	/[\]c]def/; # matches ']def' or 'cdef'
	328	$x = 'bcr';
	329	/[$x]at/; # matches 'bat', 'cat', or 'rat'
	330	/[\$x]at/; # matches '$at' or 'xat'
	331	/[\\$x]at/; # matches '\at', 'bat, 'cat', or 'rat'
	332
	333	The last two are a little tricky. in C<[\$x]>, the backslash protects
	334	the dollar sign, so the character class has two members C<$> and C<x>.
	335	In C<[\\$x]>, the backslash is protected, so C<$x> is treated as a
	336	variable and substituted in double quote fashion.
	337
	338	The special character C<'-'> acts as a range operator within character
	339	classes, so that a contiguous set of characters can be written as a
	340	range. With ranges, the unwieldy C<[0123456789]> and C<[abc...xyz]>
	341	become the svelte C<[0-9]> and C<[a-z]>. Some examples are
	342
	343	/item[0-9]/; # matches 'item0' or ... or 'item9'
	344	/[0-9bx-z]aa/; # matches '0aa', ..., '9aa',
	345	# 'baa', 'xaa', 'yaa', or 'zaa'
	346	/[0-9a-fA-F]/; # matches a hexadecimal digit
	347	/[0-9a-zA-Z_]/; # matches a "word" character,
	348	# like those in a perl variable name
	349
	350	If C<'-'> is the first or last character in a character class, it is
	351	treated as an ordinary character; C<[-ab]>, C<[ab-]> and C<[a\-b]> are
	352	all equivalent.
	353
	354	The special character C<^> in the first position of a character class
	355	denotes a B<negated character class>, which matches any character but
	356	those in the brackets. Both C<[...]> and C<[^...]> must match a
	357	character, or the match fails. Then
	358
	359	/[^a]at/; # doesn't match 'aat' or 'at', but matches
	360	# all other 'bat', 'cat, '0at', '%at', etc.
	361	/[^0-9]/; # matches a non-numeric character
	362	/[a^]at/; # matches 'aat' or '^at'; here '^' is ordinary
	363
	364	Now, even C<[0-9]> can be a bother the write multiple times, so in the
	365	interest of saving keystrokes and making regexps more readable, Perl
	366	has several abbreviations for common character classes:
	367
	368	=over 4
	369
	370	=item *
	371	\d is a digit and represents [0-9]
	372
	373	=item *
	374	\s is a whitespace character and represents [\ \t\r\n\f]
	375
	376	=item *
	377	\w is a word character (alphanumeric or _) and represents [0-9a-zA-Z_]
	378
	379	=item *
	380	\D is a negated \d; it represents any character but a digit [^0-9]
	381
	382	=item *
	383	\S is a negated \s; it represents any non-whitespace character [^\s]
	384
	385	=item *
	386	\W is a negated \w; it represents any non-word character [^\w]
	387
	388	=item *
	389	The period '.' matches any character but "\n"
	390
	391	=back
	392
	393	The C<\d\s\w\D\S\W> abbreviations can be used both inside and outside
	394	of character classes. Here are some in use:
	395
	396	/\d\d:\d\d:\d\d/; # matches a hh:mm:ss time format
	397	/[\d\s]/; # matches any digit or whitespace character
	398	/\w\W\w/; # matches a word char, followed by a
	399	# non-word char, followed by a word char
	400	/..rt/; # matches any two chars, followed by 'rt'
	401	/end\./; # matches 'end.'
	402	/end[.]/; # same thing, matches 'end.'
	403
	404	Because a period is a metacharacter, it needs to be escaped to match
	405	as an ordinary period. Because, for example, C<\d> and C<\w> are sets
	406	of characters, it is incorrect to think of C<[^\d\w]> as C<[\D\W]>; in
	407	fact C<[^\d\w]> is the same as C<[^\w]>, which is the same as
	408	C<[\W]>. Think DeMorgan's laws.
	409
	410	An anchor useful in basic regexps is the S<B<word anchor> >
	411	C<\b>. This matches a boundary between a word character and a non-word
	412	character C<\w\W> or C<\W\w>:
	413
	414	$x = "Housecat catenates house and cat";
	415	$x =~ /cat/; # matches cat in 'housecat'
	416	$x =~ /\bcat/; # matches cat in 'catenates'
	417	$x =~ /cat\b/; # matches cat in 'housecat'
	418	$x =~ /\bcat\b/; # matches 'cat' at end of string
	419
	420	Note in the last example, the end of the string is considered a word
	421	boundary.
	422
	423	You might wonder why C<'.'> matches everything but C<"\n"> - why not
	424	every character? The reason is that often one is matching against
	425	lines and would like to ignore the newline characters. For instance,
	426	while the string C<"\n"> represents one line, we would like to think
	427	of as empty. Then
	428
	429	"" =~ /^$/; # matches
	430	"\n" =~ /^$/; # matches, "\n" is ignored
	431
	432	"" =~ /./; # doesn't match; it needs a char
	433	"" =~ /^.$/; # doesn't match; it needs a char
	434	"\n" =~ /^.$/; # doesn't match; it needs a char other than "\n"
	435	"a" =~ /^.$/; # matches
	436	"a\n" =~ /^.$/; # matches, ignores the "\n"
	437
	438	This behavior is convenient, because we usually want to ignore
	439	newlines when we count and match characters in a line. Sometimes,
	440	however, we want to keep track of newlines. We might even want C<^>
	441	and C<$> to anchor at the beginning and end of lines within the
	442	string, rather than just the beginning and end of the string. Perl
	443	allows us to choose between ignoring and paying attention to newlines
	444	by using the C<//s> and C<//m> modifiers. C<//s> and C<//m> stand for
	445	single line and multi-line and they determine whether a string is to
	446	be treated as one continuous string, or as a set of lines. The two
	447	modifiers affect two aspects of how the regexp is interpreted: 1) how
	448	the C<'.'> character class is defined, and 2) where the anchors C<^>
	449	and C<$> are able to match. Here are the four possible combinations:
	450
	451	=over 4
	452
	453	=item *
	454	no modifiers (//): Default behavior. C<'.'> matches any character
	455	except C<"\n">. C<^> matches only at the beginning of the string and
	456	C<$> matches only at the end or before a newline at the end.
	457
	458	=item *
	459	s modifier (//s): Treat string as a single long line. C<'.'> matches
	460	any character, even C<"\n">. C<^> matches only at the beginning of
	461	the string and C<$> matches only at the end or before a newline at the
	462	end.
	463
	464	=item *
	465	m modifier (//m): Treat string as a set of multiple lines. C<'.'>
	466	matches any character except C<"\n">. C<^> and C<$> are able to match
	467	at the start or end of I<any> line within the string.
	468
	469	=item *
	470	both s and m modifiers (//sm): Treat string as a single long line, but
	471	detect multiple lines. C<'.'> matches any character, even
	472	C<"\n">. C<^> and C<$>, however, are able to match at the start or end
	473	of I<any> line within the string.
	474
	475	=back
	476
	477	Here are examples of C<//s> and C<//m> in action:
	478
	479	$x = "There once was a girl\nWho programmed in Perl\n";
	480
	481	$x =~ /^Who/; # doesn't match, "Who" not at start of string
	482	$x =~ /^Who/s; # doesn't match, "Who" not at start of string
	483	$x =~ /^Who/m; # matches, "Who" at start of second line
	484	$x =~ /^Who/sm; # matches, "Who" at start of second line
	485
	486	$x =~ /girl.Who/; # doesn't match, "." doesn't match "\n"
	487	$x =~ /girl.Who/s; # matches, "." matches "\n"
	488	$x =~ /girl.Who/m; # doesn't match, "." doesn't match "\n"
	489	$x =~ /girl.Who/sm; # matches, "." matches "\n"
	490
	491	Most of the time, the default behavior is what is want, but C<//s> and
	492	C<//m> are occasionally very useful. If C<//m> is being used, the start
	493	of the string can still be matched with C<\A> and the end of string
	494	can still be matched with the anchors C<\Z> (matches both the end and
	495	the newline before, like C<$>), and C<\z> (matches only the end):
	496
	497	$x =~ /^Who/m; # matches, "Who" at start of second line
	498	$x =~ /\AWho/m; # doesn't match, "Who" is not at start of string
	499
	500	$x =~ /girl$/m; # matches, "girl" at end of first line
	501	$x =~ /girl\Z/m; # doesn't match, "girl" is not at end of string
	502
	503	$x =~ /Perl\Z/m; # matches, "Perl" is at newline before end
	504	$x =~ /Perl\z/m; # doesn't match, "Perl" is not at end of string
	505
	506	We now know how to create choices among classes of characters in a
	507	regexp. What about choices among words or character strings? Such
	508	choices are described in the next section.
	509
	510	=head2 Matching this or that
	511
	512	Sometimes we would like to our regexp to be able to match different
	513	possible words or character strings. This is accomplished by using
	514	the B<alternation> metacharacter C<\|>. To match C<dog> or C<cat>, we
	515	form the regexp C<dog\|cat>. As before, perl will try to match the
	516	regexp at the earliest possible point in the string. At each
	517	character position, perl will first try to match the first
	518	alternative, C<dog>. If C<dog> doesn't match, perl will then try the
	519	next alternative, C<cat>. If C<cat> doesn't match either, then the
	520	match fails and perl moves to the next position in the string. Some
	521	examples:
	522
	523	"cats and dogs" =~ /cat\|dog\|bird/; # matches "cat"
	524	"cats and dogs" =~ /dog\|cat\|bird/; # matches "cat"
	525
	526	Even though C<dog> is the first alternative in the second regexp,
	527	C<cat> is able to match earlier in the string.
	528
	529	"cats" =~ /c\|ca\|cat\|cats/; # matches "c"
	530	"cats" =~ /cats\|cat\|ca\|c/; # matches "cats"
	531
	532	Here, all the alternatives match at the first string position, so the
	533	first alternative is the one that matches. If some of the
	534	alternatives are truncations of the others, put the longest ones first
	535	to give them a chance to match.
	536
	537	"cab" =~ /a\|b\|c/ # matches "c"
	538	# /a\|b\|c/ == /[abc]/
	539
	540	The last example points out that character classes are like
	541	alternations of characters. At a given character position, the first
	542	alternative that allows the regexp match to succeed wil be the one
	543	that matches.
	544
	545	=head2 Grouping things and hierarchical matching
	546
	547	Alternation allows a regexp to choose among alternatives, but by
	548	itself it unsatisfying. The reason is that each alternative is a whole
	549	regexp, but sometime we want alternatives for just part of a
	550	regexp. For instance, suppose we want to search for housecats or
	551	housekeepers. The regexp C<housecat\|housekeeper> fits the bill, but is
	552	inefficient because we had to type C<house> twice. It would be nice to
	553	have parts of the regexp be constant, like C<house>, and and some
	554	parts have alternatives, like C<cat\|keeper>.
	555
	556	The B<grouping> metacharacters C<()> solve this problem. Grouping
	557	allows parts of a regexp to be treated as a single unit. Parts of a
	558	regexp are grouped by enclosing them in parentheses. Thus we could solve
	559	the C<housecat\|housekeeper> by forming the regexp as
	560	C<house(cat\|keeper)>. The regexp C<house(cat\|keeper)> means match
	561	C<house> followed by either C<cat> or C<keeper>. Some more examples
	562	are
	563
	564	/(a\|b)b/; # matches 'ab' or 'bb'
	565	/(ac\|b)b/; # matches 'acb' or 'bb'
	566	/(^a\|b)c/; # matches 'ac' at start of string or 'bc' anywhere
	567	/(a\|[bc])d/; # matches 'ad', 'bd', or 'cd'
	568
	569	/house(cat\|)/; # matches either 'housecat' or 'house'
	570	/house(cat(s\|)\|)/; # matches either 'housecats' or 'housecat' or
	571	# 'house'. Note groups can be nested.
	572
	573	/(19\|20\|)\d\d/; # match years 19xx, 20xx, or the Y2K problem, xx
	574	"20" =~ /(19\|20\|)\d\d/; # matches the null alternative '()\d\d',
	575	# because '20\d\d' can't match
	576
	577	Alternations behave the same way in groups as out of them: at a given
	578	string position, the leftmost alternative that allows the regexp to
	579	match is taken. So in the last example at tth first string position,
	580	C<"20"> matches the second alternative, but there is nothing left over
	581	to match the next two digits C<\d\d>. So perl moves on to the next
	582	alternative, which is the null alternative and that works, since
	583	C<"20"> is two digits.
	584
	585	The process of trying one alternative, seeing if it matches, and
	586	moving on to the next alternative if it doesn't, is called
	587	B<backtracking>. The term 'backtracking' comes from the idea that
	588	matching a regexp is like a walk in the woods. Successfully matching
	589	a regexp is like arriving at a destination. There are many possible
	590	trailheads, one for each string position, and each one is tried in
	591	order, left to right. From each trailhead there may be many paths,
	592	some of which get you there, and some which are dead ends. When you
	593	walk along a trail and hit a dead end, you have to backtrack along the
	594	trail to an earlier point to try another trail. If you hit your
	595	destination, you stop immediately and forget about trying all the
	596	other trails. You are persistent, and only if you have tried all the
	597	trails from all the trailheads and not arrived at your destination, do
	598	you declare failure. To be concrete, here is a step-by-step analysis
	599	of what perl does when it tries to match the regexp
	600
	601	"abcde" =~ /(abd\|abc)(df\|d\|de)/;
	602
	603	=over 4
	604
	605	=item 0 Start with the first letter in the string 'a'.
	606
	607	=item 1 Try the first alternative in the first group 'abd'.
	608
	609	=item 2 Match 'a' followed by 'b'. So far so good.
	610
	611	=item 3 'd' in the regexp doesn't match 'c' in the string - a dead
	612	end. So backtrack two characters and pick the second alternative in
	613	the first group 'abc'.
	614
	615	=item 4 Match 'a' followed by 'b' followed by 'c'. We are on a roll
	616	and have satisfied the first group. Set $1 to 'abc'.
	617
	618	=item 5 Move on to the second group and pick the first alternative
	619	'df'.
	620
	621	=item 6 Match the 'd'.
	622
	623	=item 7 'f' in the regexp doesn't match 'e' in the string, so a dead
	624	end. Backtrack one character and pick the second alternative in the
	625	second group 'd'.
	626
	627	=item 8 'd' matches. The second grouping is satisfied, so set $2 to
	628	'd'.
	629
	630	=item 9 We are at the end of the regexp, so we are done! We have
	631	matched 'abcd' out of the string "abcde".
	632
	633	=back
	634
	635	There are a couple of things to note about this analysis. First, the
	636	third alternative in the second group 'de' also allows a match, but we
	637	stopped before we got to it - at a given character position, leftmost
	638	wins. Second, we were able to get a match at the first character
	639	position of the string 'a'. If there were no matches at the first
	640	position, perl would move to the second character position 'b' and
	641	attempt the match all over again. Only when all possible paths at all
	642	possible character positions have been exhausted does perl give give
	643	up and declare S<C<$string =~ /(abd\|abc)(df\|d\|de)/;> > to be false.
	644
	645	Even with all this work, regexp matching happens remarkably fast. To
	646	speed things up, during compilation stage, perl compiles the regexp
	647	into a compact sequence of opcodes that can often fit inside a
	648	processor cache. When the code is executed, these opcodes can then run
	649	at full throttle and search very quickly.
	650
	651	=head2 Extracting matches
	652
	653	The grouping metacharacters C<()> also serve another completely
	654	different function: they allow the extraction of the parts of a string
	655	that matched. This is very useful to find out what matched and for
	656	text processing in general. For each grouping, the part that matched
	657	inside goes into the special variables C<$1>, C<$2>, etc. They can be
	658	used just as ordinary variables:
	659
	660	# extract hours, minutes, seconds
	661	$time =~ /(\d\d):(\d\d):(\d\d)/; # match hh:mm:ss format
	662	$hours = $1;
	663	$minutes = $2;
	664	$seconds = $3;
	665
	666	Now, we know that in scalar context,
	667	S<C<$time =~ /(\d\d):(\d\d):(\d\d)/> > returns a true or false
	668	value. In list context, however, it returns the list of matched values
	669	C<($1,$2,$3)>. So we could write the code more compactly as
	670
	671	# extract hours, minutes, seconds
	672	($hours, $minutes, $second) = ($time =~ /(\d\d):(\d\d):(\d\d)/);
	673
	674	If the groupings in a regexp are nested, C<$1> gets the group with the
	675	leftmost opening parenthesis, C<$2> the next opening parenthesis,
	676	etc. For example, here is a complex regexp and the matching variables
	677	indicated below it:
	678
	679	/(ab(cd\|ef)((gi)\|j))/;
	680	1 2 34
	681
	682	so that if the regexp matched, e.g., C<$2> would contain 'cd' or 'ef'.
	683	For convenience, perl sets C<$+> to the highest numbered C<$1>, C<$2>,
	684	... that got assigned.
	685
	686	Closely associated with the matching variables C<$1>, C<$2>, ... are
	687	the B<backreferences> C<\1>, C<\2>, ... . Backreferences are simply
	688	matching variables that can be used I<inside> a regexp. This is a
	689	really nice feature - what matches later in a regexp can depend on
	690	what matched earlier in the regexp. Suppose we wanted to look
	691	for doubled words in text, like 'the the'. The following regexp finds
	692	all 3-letter doubles with a space in between:
	693
	694	/(\w\w\w)\s\1/;
	695
	696	The grouping assigns a value to \1, so that the same 3 letter sequence
	697	is used for both parts. Here are some words with repeated parts:
	698
	699	% simple_grep '^(\w\w\w\w\|\w\w\w\|\w\w\|\w)\1$' /usr/dict/words
	700	beriberi
	701	booboo
	702	coco
	703	mama
	704	murmur
	705	papa
	706
	707	The regexp has a single grouping which considers 4-letter
	708	combinations, then 3-letter combinations, etc. and uses C<\1> to look for
	709	a repeat. Although C<$1> and C<\1> represent the same thing, care should be
	710	taken to use matched variables C<$1>, C<$2>, ... only outside a regexp
	711	and backreferences C<\1>, C<\2>, ... only inside a regexp; not doing
	712	so may lead to surprising and/or undefined results.
	713
	714	In addition to what was matched, Perl 5.6.0 also provides the
	715	positions of what was matched with the C<@-> and C<@+>
	716	arrays. C<$-[0]> is the position of the start of the entire match and
	717	C<$+[0]> is the position of the end. Similarly, C<$-[n]> is the
	718	position of the start of the C<$n> match and C<$+[n]> is the position
	719	of the end. If C<$n> is undefined, so are C<$-[n]> and C<$+[n]>. Then
	720	this code
	721
	722	$x = "Mmm...donut, thought Homer";
	723	$x =~ /^(Mmm\|Yech)\.\.\.(donut\|peas)/; # matches
	724	foreach $expr (1..$#-) {
	725	print "Match $expr: '${$expr}' at position ($-[$expr],$+[$expr])\n";
	726	}
	727
	728	prints
	729
	730	Match 1: 'Mmm' at position (0,3)
	731	Match 2: 'donut' at position (6,11)
	732
	733	Even if there are no groupings in a regexp, it is still possible to
	734	find out what exactly matched in a string. If you use them, perl
	735	will set C<$`> to the part of the string before the match, will set C<$&>
	736	to the part of the string that matched, and will set C<$'> to the part
	737	of the string after the match. An example:
	738
	739	$x = "the cat caught the mouse";
	740	$x =~ /cat/; # $` = 'the ', $& = 'cat', $' = ' caught the mouse'
	741	$x =~ /the/; # $` = '', $& = 'the', $' = ' cat caught the mouse'
	742
	743	In the second match, S<C<$` = ''> > because the regexp matched at the
	744	first character position in the string and stopped, it never saw the
	745	second 'the'. It is important to note that using C<$`> and C<$'>
	746	slows down regexp matching quite a bit, and C< $& > slows it down to a
	747	lesser extent, because if they are used in one regexp in a program,
	748	they are generated for <all> regexps in the program. So if raw
	749	performance is a goal of your application, they should be avoided.
	750	If you need them, use C<@-> and C<@+> instead:
	751
	752	$` is the same as substr( $x, 0, $-[0] )
	753	$& is the same as substr( $x, $-[0], $+[0]-$-[0] )
	754	$' is the same as substr( $x, $+[0] )
	755
	756	=head2 Matching repetitions
	757
	758	The examples in the previous section display an annoying weakness. We
	759	were only matching 3-letter words, or syllables of 4 letters or
	760	less. We'd like to be able to match words or syllables of any length,
	761	without writing out tedious alternatives like
	762	C<\w\w\w\w\|\w\w\w\|\w\w\|\w>.
	763
	764	This is exactly the problem the B<quantifier> metacharacters C<?>,
	765	C<*>, C<+>, and C<{}> were created for. They allow us to determine the
	766	number of repeats of a portion of a regexp we consider to be a
	767	match. Quantifiers are put immediately after the character, character
	768	class, or grouping that we want to specify. They have the following
	769	meanings:
	770
	771	=over 4
	772
	773	=item * C<a?> = match 'a' 1 or 0 times
	774
	775	=item * C<a*> = match 'a' 0 or more times, i.e., any number of times
	776
	777	=item * C<a+> = match 'a' 1 or more times, i.e., at least once
	778
	779	=item * C<a{n,m}> = match at least C<n> times, but not more than C<m>
	780	times.
	781
	782	=item * C<a{n,}> = match at least C<n> or more times
	783
	784	=item * C<a{n}> = match exactly C<n> times
	785
	786	=back
	787
	788	Here are some examples:
	789
	790	/[a-z]+\s+\d*/; # match a lowercase word, at least some space, and
	791	# any number of digits
	792	/(\w+)\s+\1/; # match doubled words of arbitrary length
	793	/y(es)?/i; # matches 'y', 'Y', or a case-insensitive 'yes'
	794	$year =~ /\d{2,4}/; # make sure year is at least 2 but not more
	795	# than 4 digits
	796	$year =~ /\d{4}\|\d{2}/; # better match; throw out 3 digit dates
	797	$year =~ /\d{2}(\d{2})?/; # same thing written differently. However,
	798	# this produces $1 and the other does not.
	799
	800	% simple_grep '^(\w+)\1$' /usr/dict/words # isn't this easier?
	801	beriberi
	802	booboo
	803	coco
	804	mama
	805	murmur
	806	papa
	807
	808	For all of these quantifiers, perl will try to match as much of the
	809	string as possible, while still allowing the regexp to succeed. Thus
	810	with C</a?.../>, perl will first try to match the regexp with the C<a>
	811	present; if that fails, perl will try to match the regexp without the
	812	C<a> present. For the quantifier C<*>, we get the following:
	813
	814	$x = "the cat in the hat";
	815	$x =~ /^(.)(cat)(.)$/; # matches,
	816	# $1 = 'the '
	817	# $2 = 'cat'
	818	# $3 = ' in the hat'
	819
	820	Which is what we might expect, the match finds the only C<cat> in the
	821	string and locks onto it. Consider, however, this regexp:
	822
	823	$x =~ /^(.)(at)(.)$/; # matches,
	824	# $1 = 'the cat in the h'
	825	# $2 = 'at'
	826	# $3 = '' (0 matches)
	827
	828	One might initially guess that perl would find the C<at> in C<cat> and
	829	stop there, but that wouldn't give the longest possible string to the
	830	first quantifier C<.>. Instead, the first quantifier C<.> grabs as
	831	much of the string as possible while still having the regexp match. In
	832	this example, that means having the C<at> sequence with the final C<at>
	833	in the string. The other important principle illustrated here is that
	834	when there are two or more elements in a regexp, the I<leftmost>
	835	quantifier, if there is one, gets to grab as much the string as
	836	possible, leaving the rest of the regexp to fight over scraps. Thus in
	837	our example, the first quantifier C<.*> grabs most of the string, while
	838	the second quantifier C<.*> gets the empty string. Quantifiers that
	839	grab as much of the string as possible are called B<maximal match> or
	840	B<greedy> quantifiers.
	841
	842	When a regexp can match a string in several different ways, we can use
	843	the principles above to predict which way the regexp will match:
	844
	845	=over 4
	846
	847	=item *
	848	Principle 0: Taken as a whole, any regexp will be matched at the
	849	earliest possible position in the string.
	850
	851	=item *
	852	Principle 1: In an alternation C<a\|b\|c...>, the leftmost alternative
	853	that allows a match for the whole regexp will be the one used.
	854
	855	=item *
	856	Principle 2: The maximal matching quantifiers C<?>, C<*>, C<+> and
	857	C<{n,m}> will in general match as much of the string as possible while
	858	still allowing the whole regexp to match.
	859
	860	=item *
	861	Principle 3: If there are two or more elements in a regexp, the
	862	leftmost greedy quantifier, if any, will match as much of the string
	863	as possible while still allowing the whole regexp to match. The next
	864	leftmost greedy quantifier, if any, will try to match as much of the
	865	string remaining available to it as possible, while still allowing the
	866	whole regexp to match. And so on, until all the regexp elements are
	867	satisfied.
	868
	869	=back
	870
	871	As we have seen above, Principle 0 overrides the others - the regexp
	872	will be matched as early as possible, with the other principles
	873	determining how the regexp matches at that earliest character
	874	position.
	875
	876	Here is an example of these principles in action:
	877
	878	$x = "The programming republic of Perl";
	879	$x =~ /^(.+)(e\|r)(.*)$/; # matches,
	880	# $1 = 'The programming republic of Pe'
	881	# $2 = 'r'
	882	# $3 = 'l'
	883
	884	This regexp matches at the earliest string position, C<'T'>. One
	885	might think that C<e>, being leftmost in the alternation, would be
	886	matched, but C<r> produces the longest string in the first quantifier.
	887
	888	$x =~ /(m{1,2})(.*)$/; # matches,
	889	# $1 = 'mm'
	890	# $2 = 'ing republic of Perl'
	891
	892	Here, The earliest possible match is at the first C<'m'> in
	893	C<programming>. C<m{1,2}> is the first quantifier, so it gets to match
	894	a maximal C<mm>.
	895
	896	$x =~ /.(m{1,2})(.)$/; # matches,
	897	# $1 = 'm'
	898	# $2 = 'ing republic of Perl'
	899
	900	Here, the regexp matches at the start of the string. The first
	901	quantifier C<.*> grabs as much as possible, leaving just a single
	902	C<'m'> for the second quantifier C<m{1,2}>.
	903
	904	$x =~ /(.?)(m{1,2})(.*)$/; # matches,
	905	# $1 = 'a'
	906	# $2 = 'mm'
	907	# $3 = 'ing republic of Perl'
	908
	909	Here, C<.?> eats its maximal one character at the earliest possible
	910	position in the string, C<'a'> in C<programming>, leaving C<m{1,2}>
	911	the opportunity to match both C<m>'s. Finally,
	912
	913	"aXXXb" =~ /(X*)/; # matches with $1 = ''
	914
	915	because it can match zero copies of C<'X'> at the beginning of the
	916	string. If you definitely want to match at least one C<'X'>, use
	917	C<X+>, not C<X*>.
	918
	919	Sometimes greed is not good. At times, we would like quantifiers to
	920	match a I<minimal> piece of string, rather than a maximal piece. For
	921	this purpose, Larry Wall created the S<B<minimal match> > or
	922	B<non-greedy> quantifiers C<??>,C<*?>, C<+?>, and C<{}?>. These are
	923	the usual quantifiers with a C<?> appended to them. They have the
	924	following meanings:
	925
	926	=over 4
	927
	928	=item * C<a??> = match 'a' 0 or 1 times. Try 0 first, then 1.
	929
	930	=item * C<a*?> = match 'a' 0 or more times, i.e., any number of times,
	931	but as few times as possible
	932
	933	=item * C<a+?> = match 'a' 1 or more times, i.e., at least once, but
	934	as few times as possible
	935
	936	=item * C<a{n,m}?> = match at least C<n> times, not more than C<m>
	937	times, as few times as possible
	938
	939	=item * C<a{n,}?> = match at least C<n> times, but as few times as
	940	possible
	941
	942	=item * C<a{n}?> = match exactly C<n> times. Because we match exactly
	943	C<n> times, C<a{n}?> is equivalent to C<a{n}> and is just there for
	944	notational consistency.
	945
	946	=back
	947
	948	Let's look at the example above, but with minimal quantifiers:
	949
	950	$x = "The programming republic of Perl";
	951	$x =~ /^(.+?)(e\|r)(.*)$/; # matches,
	952	# $1 = 'Th'
	953	# $2 = 'e'
	954	# $3 = ' programming republic of Perl'
	955
	956	The minimal string that will allow both the start of the string C<^>
	957	and the alternation to match is C<Th>, with the alternation C<e\|r>
	958	matching C<e>. The second quantifier C<.*> is free to gobble up the
	959	rest of the string.
	960
	961	$x =~ /(m{1,2}?)(.*?)$/; # matches,
	962	# $1 = 'm'
	963	# $2 = 'ming republic of Perl'
	964
	965	The first string position that this regexp can match is at the first
	966	C<'m'> in C<programming>. At this position, the minimal C<m{1,2}?>
	967	matches just one C<'m'>. Although the second quantifier C<.*?> would
	968	prefer to match no characters, it is constrained by the end-of-string
	969	anchor C<$> to match the rest of the string.
	970
	971	$x =~ /(.?)(m{1,2}?)(.)$/; # matches,
	972	# $1 = 'The progra'
	973	# $2 = 'm'
	974	# $3 = 'ming republic of Perl'
	975
	976	In this regexp, you might expect the first minimal quantifier C<.*?>
	977	to match the empty string, because it is not constrained by a C<^>
	978	anchor to match the beginning of the word. Principle 0 applies here,
	979	however. Because it is possible for the whole regexp to match at the
	980	start of the string, it I<will> match at the start of the string. Thus
	981	the first quantifier has to match everything up to the first C<m>. The
	982	second minimal quantifier matches just one C<m> and the third
	983	quantifier matches the rest of the string.
	984
	985	$x =~ /(.??)(m{1,2})(.*)$/; # matches,
	986	# $1 = 'a'
	987	# $2 = 'mm'
	988	# $3 = 'ing republic of Perl'
	989
	990	Just as in the previous regexp, the first quantifier C<.??> can match
	991	earliest at position C<'a'>, so it does. The second quantifier is
	992	greedy, so it matches C<mm>, and the third matches the rest of the
	993	string.
	994
	995	We can modify principle 3 above to take into account non-greedy
	996	quantifiers:
	997
	998	=over 4
	999
	1000	=item *
	1001	Principle 3: If there are two or more elements in a regexp, the
	1002	leftmost greedy (non-greedy) quantifier, if any, will match as much
	1003	(little) of the string as possible while still allowing the whole
	1004	regexp to match. The next leftmost greedy (non-greedy) quantifier, if
	1005	any, will try to match as much (little) of the string remaining
	1006	available to it as possible, while still allowing the whole regexp to
	1007	match. And so on, until all the regexp elements are satisfied.
	1008
	1009	=back
	1010
	1011	Just like alternation, quantifiers are also susceptible to
	1012	backtracking. Here is a step-by-step analysis of the example
	1013
	1014	$x = "the cat in the hat";
	1015	$x =~ /^(.)(at)(.)$/; # matches,
	1016	# $1 = 'the cat in the h'
	1017	# $2 = 'at'
	1018	# $3 = '' (0 matches)
	1019
	1020	=over 4
	1021
	1022	=item 0 Start with the first letter in the string 't'.
	1023
	1024	=item 1 The first quantifier '.*' starts out by matching the whole
	1025	string 'the cat in the hat'.
	1026
	1027	=item 2 'a' in the regexp element 'at' doesn't match the end of the
	1028	string. Backtrack one character.
	1029
	1030	=item 3 'a' in the regexp element 'at' still doesn't match the last
	1031	letter of the string 't', so backtrack one more character.
	1032
	1033	=item 4 Now we can match the 'a' and the 't'.
	1034
	1035	=item 5 Move on to the third element '.*'. Since we are at the end of
	1036	the string and '.*' can match 0 times, assign it the empty string.
	1037
	1038	=item 6 We are done!
	1039
	1040	=back
	1041
	1042	Most of the time, all this moving forward and backtracking happens
	1043	quickly and searching is fast. There are some pathological regexps,
	1044	however, whose execution time exponentially grows with the size of the
	1045	string. A typical structure that blows up in your face is of the form
	1046
	1047	/(a\|b+)*/;
	1048
	1049	The problem is the nested indeterminate quantifiers. There are many
	1050	different ways of partitioning a string of length n between the C<+>
	1051	and C<*>: one repetition with C<b+> of length n, two repetitions with
	1052	the first C<b+> length k and the second with length n-k, m repetitions
	1053	whose bits add up to length n, etc. In fact there are an exponential
	1054	number of ways to partition a string as a function of length. A
	1055	regexp may get lucky and match early in the process, but if there is
	1056	no match, perl will try I<every> possibility before giving up. So be
	1057	careful with nested C<*>'s, C<{n,m}>'s, and C<+>'s. The book
	1058	I<Mastering regular expressions> by Jeffrey Friedl gives a wonderful
	1059	discussion of this and other efficiency issues.
	1060
	1061	=head2 Building a regexp
	1062
	1063	At this point, we have all the basic regexp concepts covered, so let's
	1064	give a more involved example of a regular expression. We will build a
	1065	regexp that matches numbers.
	1066
	1067	The first task in building a regexp is to decide what we want to match
	1068	and what we want to exclude. In our case, we want to match both
	1069	integers and floating point numbers and we want to reject any string
	1070	that isn't a number.
	1071
	1072	The next task is to break the problem down into smaller problems that
	1073	are easily converted into a regexp.
	1074
	1075	The simplest case is integers. These consist of a sequence of digits,
	1076	with an optional sign in front. The digits we can represent with
	1077	C<\d+> and the sign can be matched with C<[+-]>. Thus the integer
	1078	regexp is
	1079
	1080	/[+-]?\d+/; # matches integers
	1081
	1082	A floating point number potentially has a sign, an integral part, a
	1083	decimal point, a fractional part, and an exponent. One or more of these
	1084	parts is optional, so we need to check out the different
	1085	possibilities. Floating point numbers which are in proper form include
	1086	123., 0.345, .34, -1e6, and 25.4E-72. As with integers, the sign out
	1087	front is completely optional and can be matched by C<[+-]?>. We can
	1088	see that if there is no exponent, floating point numbers must have a
	1089	decimal point, otherwise they are integers. We might be tempted to
	1090	model these with C<\d\.\d>, but this would also match just a single
	1091	decimal point, which is not a number. So the three cases of floating
	1092	point number sans exponent are
	1093
	1094	/[+-]?\d+\./; # 1., 321., etc.
	1095	/[+-]?\.\d+/; # .1, .234, etc.
	1096	/[+-]?\d+\.\d+/; # 1.0, 30.56, etc.
	1097
	1098	These can be combined into a single regexp with a three-way alternation:
	1099
	1100	/[+-]?(\d+\.\d+\|\d+\.\|\.\d+)/; # floating point, no exponent
	1101
	1102	In this alternation, it is important to put C<'\d+\.\d+'> before
	1103	C<'\d+\.'>. If C<'\d+\.'> were first, the regexp would happily match that
	1104	and ignore the fractional part of the number.
	1105
	1106	Now consider floating point numbers with exponents. The key
	1107	observation here is that I<both> integers and numbers with decimal
	1108	points are allowed in front of an exponent. Then exponents, like the
	1109	overall sign, are independent of whether we are matching numbers with
	1110	or without decimal points, and can be 'decoupled' from the
	1111	mantissa. The overall form of the regexp now becomes clear:
	1112
	1113	/^(optional sign)(integer \| f.p. mantissa)(optional exponent)$/;
	1114
	1115	The exponent is an C<e> or C<E>, followed by an integer. So the
	1116	exponent regexp is
	1117
	1118	/[eE][+-]?\d+/; # exponent
	1119
	1120	Putting all the parts together, we get a regexp that matches numbers:
	1121
	1122	/^[+-]?(\d+\.\d+\|\d+\.\|\.\d+\|\d+)([eE][+-]?\d+)?$/; # Ta da!
	1123
	1124	Long regexps like this may impress your friends, but can be hard to
	1125	decipher. In complex situations like this, the C<//x> modifier for a
	1126	match is invaluable. It allows one to put nearly arbitrary whitespace
	1127	and comments into a regexp without affecting their meaning. Using it,
	1128	we can rewrite our 'extended' regexp in the more pleasing form
	1129
	1130	/^
	1131	[+-]? # first, match an optional sign
	1132	( # then match integers or f.p. mantissas:
	1133	\d+\.\d+ # mantissa of the form a.b
	1134	\|\d+\. # mantissa of the form a.
	1135	\|\.\d+ # mantissa of the form .b
	1136	\|\d+ # integer of the form a
	1137	)
	1138	([eE][+-]?\d+)? # finally, optionally match an exponent
	1139	$/x;
	1140
	1141	If whitespace is mostly irrelevant, how does one include space
	1142	characters in an extended regexp? The answer is to backslash it
	1143	S<C<'\ '> > or put it in a character class S<C<[ ]> >. The same thing
	1144	goes for pound signs, use C<\#> or C<[#]>. For instance, Perl allows
	1145	a space between the sign and the mantissa/integer, and we could add
	1146	this to our regexp as follows:
	1147
	1148	/^
	1149	[+-]?\ * # first, match an optional sign and space
	1150	( # then match integers or f.p. mantissas:
	1151	\d+\.\d+ # mantissa of the form a.b
	1152	\|\d+\. # mantissa of the form a.
	1153	\|\.\d+ # mantissa of the form .b
	1154	\|\d+ # integer of the form a
	1155	)
	1156	([eE][+-]?\d+)? # finally, optionally match an exponent
	1157	$/x;
	1158
	1159	In this form, it is easier to see a way to simplify the
	1160	alternation. Alternatives 1, 2, and 4 all start with C<\d+>, so it
	1161	could be factored out:
	1162
	1163	/^
	1164	[+-]?\ * # first, match an optional sign
	1165	( # then match integers or f.p. mantissas:
	1166	\d+ # start out with a ...
	1167	(
	1168	\.\d* # mantissa of the form a.b or a.
	1169	)? # ? takes care of integers of the form a
	1170	\|\.\d+ # mantissa of the form .b
	1171	)
	1172	([eE][+-]?\d+)? # finally, optionally match an exponent
	1173	$/x;
	1174
	1175	or written in the compact form,
	1176
	1177	/^[+-]?\ (\d+(\.\d)?\|\.\d+)([eE][+-]?\d+)?$/;
	1178
	1179	This is our final regexp. To recap, we built a regexp by
	1180
	1181	=over 4
	1182
	1183	=item * specifying the task in detail,
	1184
	1185	=item * breaking down the problem into smaller parts,
	1186
	1187	=item * translating the small parts into regexps,
	1188
	1189	=item * combining the regexps,
	1190
	1191	=item * and optimizing the final combined regexp.
	1192
	1193	=back
	1194
	1195	These are also the typical steps involved in writing a computer
	1196	program. This makes perfect sense, because regular expressions are
	1197	essentially programs written a little computer language that specifies
	1198	patterns.
	1199
	1200	=head2 Using regular expressions in Perl
	1201
	1202	The last topic of Part 1 briefly covers how regexps are used in Perl
	1203	programs. Where do they fit into Perl syntax?
	1204
	1205	We have already introduced the matching operator in its default
	1206	C</regexp/> and arbitrary delimiter C<m!regexp!> forms. We have used
	1207	the binding operator C<=~> and its negation C<!~> to test for string
	1208	matches. Associated with the matching operator, we have discussed the
	1209	single line C<//s>, multi-line C<//m>, case-insensitive C<//i> and
	1210	extended C<//x> modifiers.
	1211
	1212	There are a few more things you might want to know about matching
	1213	operators. First, we pointed out earlier that variables in regexps are
	1214	substituted before the regexp is evaluated:
	1215
	1216	$pattern = 'Seuss';
	1217	while (<>) {
	1218	print if /$pattern/;
	1219	}
	1220
	1221	This will print any lines containing the word C<Seuss>. It is not as
	1222	efficient as it could be, however, because perl has to re-evaluate
	1223	C<$pattern> each time through the loop. If C<$pattern> won't be
	1224	changing over the lifetime of the script, we can add the C<//o>
	1225	modifier, which directs perl to only perform variable substitutions
	1226	once:
	1227
	1228	#!/usr/bin/perl
	1229	# Improved simple_grep
	1230	$regexp = shift;
	1231	while (<>) {
	1232	print if /$regexp/o; # a good deal faster
	1233	}
	1234
	1235	If you change C<$pattern> after the first substitution happens, perl
	1236	will ignore it. If you don't want any substitutions at all, use the
	1237	special delimiter C<m''>:
	1238
	1239	$pattern = 'Seuss';
	1240	while (<>) {
	1241	print if m'$pattern'; # matches '$pattern', not 'Seuss'
	1242	}
	1243
	1244	C<m''> acts like single quotes on a regexp; all other C<m> delimiters
	1245	act like double quotes. If the regexp evaluates to the empty string,
	1246	the regexp in the I<last successful match> is used instead. So we have
	1247
	1248	"dog" =~ /d/; # 'd' matches
	1249	"dogbert =~ //; # this matches the 'd' regexp used before
	1250
	1251	The final two modifiers C<//g> and C<//c> concern multiple matches.
	1252	The modifier C<//g> stands for global matching and allows the the
	1253	matching operator to match within a string as many times as possible.
	1254	In scalar context, successive invocations against a string will have
	1255	`C<//g> jump from match to match, keeping track of position in the
	1256	string as it goes along. You can get or set the position with the
	1257	C<pos()> function.
	1258
	1259	The use of C<//g> is shown in the following example. Suppose we have
	1260	a string that consists of words separated by spaces. If we know how
	1261	many words there are in advance, we could extract the words using
	1262	groupings:
	1263
	1264	$x = "cat dog house"; # 3 words
	1265	$x =~ /^\s(\w+)\s+(\w+)\s+(\w+)\s$/; # matches,
	1266	# $1 = 'cat'
	1267	# $2 = 'dog'
	1268	# $3 = 'house'
	1269
	1270	But what if we had an indeterminate number of words? This is the sort
	1271	of task C<//g> was made for. To extract all words, form the simple
	1272	regexp C<(\w+)> and loop over all matches with C</(\w+)/g>:
	1273
	1274	while ($x =~ /(\w+)/g) {
	1275	print "Word is $1, ends at position ", pos $x, "\n";
	1276	}
	1277
	1278	prints
	1279
	1280	Word is cat, ends at position 3
	1281	Word is dog, ends at position 7
	1282	Word is house, ends at position 13
	1283
	1284	A failed match or changing the target string resets the position. If
	1285	you don't want the position reset after failure to match, add the
	1286	C<//c>, as in C</regexp/gc>. The current position in the string is
	1287	associated with the string, not the regexp. This means that different
	1288	strings have different positions and their respective positions can be
	1289	set or read independently.
	1290
	1291	In list context, C<//g> returns a list of matched groupings, or if
	1292	there are no groupings, a list of matches to the whole regexp. So if
	1293	we wanted just the words, we could use
	1294
	1295	@words = ($x =~ /(\w+)/g); # matches,
	1296	# $word[0] = 'cat'
	1297	# $word[1] = 'dog'
	1298	# $word[2] = 'house'
	1299
	1300	Closely associated with the C<//g> modifier is the C<\G> anchor. The
	1301	C<\G> anchor matches at the point where the previous C<//g> match left
	1302	off. C<\G> allows us to easily do context-sensitive matching:
	1303
	1304	$metric = 1; # use metric units
	1305	...
	1306	$x = <FILE>; # read in measurement
	1307	$x =~ /^([+-]?\d+)\s*/g; # get magnitude
	1308	$weight = $1;
	1309	if ($metric) { # error checking
	1310	print "Units error!" unless $x =~ /\Gkg\./g;
	1311	}
	1312	else {
	1313	print "Units error!" unless $x =~ /\Glbs\./g;
	1314	}
	1315	$x =~ /\G\s+(widget\|sprocket)/g; # continue processing
	1316
	1317	The combination of C<//g> and C<\G> allows us to process the string a
	1318	bit at a time and use arbitrary Perl logic to decide what to do next.
	1319
	1320	C<\G> is also invaluable in processing fixed length records with
	1321	regexps. Suppose we have a snippet of coding region DNA, encoded as
	1322	base pair letters C<ATCGTTGAAT...> and we want to find all the stop
	1323	codons C<TGA>. In a coding region, codons are 3-letter sequences, so
	1324	we can think of the DNA snippet as a sequence of 3-letter records. The
	1325	naive regexp
	1326
	1327	# expanded, this is "ATC GTT GAA TGC AAA TGA CAT GAC"
	1328	$dna = "ATCGTTGAATGCAAATGACATGAC";
	1329	$dna =~ /TGA/;
	1330
	1331	doesn't work; it may match an C<TGA>, but there is no guarantee that
	1332	the match is aligned with codon boundaries, e.g., the substring
	1333	S<C<GTT GAA> > gives a match. A better solution is
	1334
	1335	while ($dna =~ /(\w\w\w)?TGA/g) { # note the minimal ?
	1336	print "Got a TGA stop codon at position ", pos $dna, "\n";
	1337	}
	1338
	1339	which prints
	1340
	1341	Got a TGA stop codon at position 18
	1342	Got a TGA stop codon at position 23
	1343
	1344	Position 18 is good, but position 23 is bogus. What happened?
	1345
	1346	The answer is that our regexp works well until we get past the last
	1347	real match. Then the regexp will fail to match a synchronized C<TGA>
	1348	and start stepping ahead one character position at a time, not what we
	1349	want. The solution is to use C<\G> to anchor the match to the codon
	1350	alignment:
	1351
	1352	while ($dna =~ /\G(\w\w\w)*?TGA/g) {
	1353	print "Got a TGA stop codon at position ", pos $dna, "\n";
	1354	}
	1355
	1356	This prints
	1357
	1358	Got a TGA stop codon at position 18
	1359
	1360	which is the correct answer. This example illustrates that it is
	1361	important not only to match what is desired, but to reject what is not
	1362	desired.
	1363
	1364	B<search and replace>
	1365
	1366	Regular expressions also play a big role in B<search and replace>
	1367	operations in Perl. Search and replace is accomplished with the
	1368	C<s///> operator. The general form is
	1369	C<s/regexp/replacement/modifiers>, with everything we know about
	1370	regexps and modifiers applying in this case as well. The
	1371	C<replacement> is a Perl double quoted string that replaces in the
	1372	string whatever is matched with the C<regexp>. The operator C<=~> is
	1373	also used here to associate a string with C<s///>. If matching
	1374	against C<$_>, the S<C<$_ =~> > can be dropped. If there is a match,
	1375	C<s///> returns the number of substitutions made, otherwise it returns
	1376	false. Here are a few examples:
	1377
	1378	$x = "Time to feed the cat!";
	1379	$x =~ s/cat/hacker/; # $x contains "Time to feed the hacker!"
	1380	if ($x =~ s/^(Time.*hacker)!$/$1 now!/) {
	1381	$more_insistent = 1;
	1382	}
	1383	$y = "'quoted words'";
	1384	$y =~ s/^'(.*)'$/$1/; # strip single quotes,
	1385	# $y contains "quoted words"
	1386
	1387	In the last example, the whole string was matched, but only the part
	1388	inside the single quotes was grouped. With the C<s///> operator, the
	1389	matched variables C<$1>, C<$2>, etc. are immediately available for use
	1390	in the replacement expression, so we use C<$1> to replace the quoted
	1391	string with just what was quoted. With the global modifier, C<s///g>
	1392	will search and replace all occurrences of the regexp in the string:
	1393
	1394	$x = "I batted 4 for 4";
	1395	$x =~ s/4/four/; # doesn't do it all:
	1396	# $x contains "I batted four for 4"
	1397	$x = "I batted 4 for 4";
	1398	$x =~ s/4/four/g; # does it all:
	1399	# $x contains "I batted four for four"
	1400
	1401	If you prefer 'regex' over 'regexp' in this tutorial, you could use
	1402	the following program to replace it:
	1403
	1404	% cat > simple_replace
	1405	#!/usr/bin/perl
	1406	$regexp = shift;
	1407	$replacement = shift;
	1408	while (<>) {
	1409	s/$regexp/$replacement/go;
	1410	print;
	1411	}
	1412	^D
	1413
	1414	% simple_replace regexp regex perlretut.pod
	1415
	1416	In C<simple_replace> we used the C<s///g> modifier to replace all
	1417	occurrences of the regexp on each line and the C<s///o> modifier to
	1418	compile the regexp only once. As with C<simple_grep>, both the
	1419	C<print> and the C<s/$regexp/$replacement/go> use C<$_> implicitly.
	1420
	1421	A modifier available specifically to search and replace is the
	1422	C<s///e> evaluation modifier. C<s///e> wraps an C<eval{...}> around
	1423	the replacement string and the evaluated result is substituted for the
	1424	matched substring. C<s///e> is useful if you need to do a bit of
	1425	computation in the process of replacing text. This example counts
	1426	character frequencies in a line:
	1427
	1428	$x = "Bill the cat";
	1429	$x =~ s/(.)/$chars{$1}++;$1/eg; # final $1 replaces char with itself
	1430	print "frequency of '$_' is $chars{$_}\n"
	1431	foreach (sort {$chars{$b} <=> $chars{$a}} keys %chars);
	1432
	1433	This prints
	1434
	1435	frequency of ' ' is 2
	1436	frequency of 't' is 2
	1437	frequency of 'l' is 2
	1438	frequency of 'B' is 1
	1439	frequency of 'c' is 1
	1440	frequency of 'e' is 1
	1441	frequency of 'h' is 1
	1442	frequency of 'i' is 1
	1443	frequency of 'a' is 1
	1444
	1445	As with the match C<m//> operator, C<s///> can use other delimiters,
	1446	such as C<s!!!> and C<s{}{}>, and even C<s{}//>. If single quotes are
	1447	used C<s'''>, then the regexp and replacement are treated as single
	1448	quoted strings and there are no substitutions. C<s///> in list context
	1449	returns the same thing as in scalar context, i.e., the number of
	1450	matches.
	1451
	1452	B<The split operator>
	1453
	1454	The B<C<split> > function can also optionally use a matching operator
	1455	C<m//> to split a string. C<split /regexp/, string, limit> splits
	1456	C<string> into a list of substrings and returns that list. The regexp
	1457	is used to match the character sequence that the C<string> is split
	1458	with respect to. The C<limit>, if present, constrains splitting into
	1459	no more than C<limit> number of strings. For example, to split a
	1460	string into words, use
	1461
	1462	$x = "Calvin and Hobbes";
	1463	@words = split /\s+/, $x; # $word[0] = 'Calvin'
	1464	# $word[1] = 'and'
	1465	# $word[2] = 'Hobbes'
	1466
	1467	If the empty regexp C<//> is used, the regexp always matches and
	1468	the string is split into individual characters. If the regexp has
	1469	groupings, then list produced contains the matched substrings from the
	1470	groupings as well. For instance,
	1471
	1472	$x = "/usr/bin/perl";
	1473	@dirs = split m!/!, $x; # $dirs[0] = ''
	1474	# $dirs[1] = 'usr'
	1475	# $dirs[2] = 'bin'
	1476	# $dirs[3] = 'perl'
	1477	@parts = split m!(/)!, $x; # $parts[0] = ''
	1478	# $parts[1] = '/'
	1479	# $parts[2] = 'usr'
	1480	# $parts[3] = '/'
	1481	# $parts[4] = 'bin'
	1482	# $parts[5] = '/'
	1483	# $parts[6] = 'perl'
	1484
	1485	Since the first character of $x matched the regexp, C<split> prepended
	1486	an empty initial element to the list.
	1487
	1488	If you have read this far, congratulations! You now have all the basic
	1489	tools needed to use regular expressions to solve a wide range of text
	1490	processing problems. If this is your first time through the tutorial,
	1491	why not stop here and play around with regexps a while... S<Part 2>
	1492	concerns the more esoteric aspects of regular expressions and those
	1493	concepts certainly aren't needed right at the start.
	1494
	1495	=head1 Part 2: Power tools
	1496
	1497	OK, you know the basics of regexps and you want to know more. If
	1498	matching regular expressions is analogous to a walk in the woods, then
	1499	the tools discussed in Part 1 are analogous to topo maps and a
	1500	compass, basic tools we use all the time. Most of the tools in part 2
	1501	are are analogous to flare guns and satellite phones. They aren't used
	1502	too often on a hike, but when we are stuck, they can be invaluable.
	1503
	1504	What follows are the more advanced, less used, or sometimes esoteric
	1505	capabilities of perl regexps. In Part 2, we will assume you are
	1506	comfortable with the basics and concentrate on the new features.
	1507
	1508	=head2 More on characters, strings, and character classes
	1509
	1510	There are a number of escape sequences and character classes that we
	1511	haven't covered yet.
	1512
	1513	There are several escape sequences that convert characters or strings
	1514	between upper and lower case. C<\l> and C<\u> convert the next
	1515	character to lower or upper case, respectively:
	1516
	1517	$x = "perl";
	1518	$string =~ /\u$x/; # matches 'Perl' in $string
	1519	$x = "M(rs?\|s)\\."; # note the double backslash
	1520	$string =~ /\l$x/; # matches 'mr.', 'mrs.', and 'ms.',
	1521
	1522	C<\L> and C<\U> converts a whole substring, delimited by C<\L> or
	1523	C<\U> and C<\E>, to lower or upper case:
	1524
	1525	$x = "This word is in lower case:\L SHOUT\E";
	1526	$x =~ /shout/; # matches
	1527	$x = "I STILL KEYPUNCH CARDS FOR MY 360"
	1528	$x =~ /\Ukeypunch/; # matches punch card string
	1529
	1530	If there is no C<\E>, case is converted until the end of the
	1531	string. The regexps C<\L\u$word> or C<\u\L$word> convert the first
	1532	character of C<$word> to uppercase and the rest of the characters to
	1533	lowercase.
	1534
	1535	Control characters can be escaped with C<\c>, so that a control-Z
	1536	character would be matched with C<\cZ>. The escape sequence
	1537	C<\Q>...C<\E> quotes, or protects most non-alphabetic characters. For
	1538	instance,
	1539
	1540	$x = "\QThat !^*&%~& cat!";
	1541	$x =~ /\Q!^*&%~&\E/; # check for rough language
	1542
	1543	It does not protect C<$> or C<@>, so that variables can still be
	1544	substituted.
	1545
	1546	With the advent of 5.6.0, perl regexps can handle more than just the
	1547	standard ASCII character set. Perl now supports B<Unicode>, a standard
	1548	for encoding the character sets from many of the world's written
	1549	languages. Unicode does this by allowing characters to be more than
	1550	one byte wide. Perl uses the UTF-8 encoding, in which ASCII characters
	1551	are still encoded as one byte, but characters greater than C<chr(127)>
	1552	may be stored as two or more bytes.
	1553
	1554	What does this mean for regexps? Well, regexp users don't need to know
	1555	much about perl's internal representation of strings. But they do need
	1556	to know 1) how to represent Unicode characters in a regexp and 2) when
	1557	a matching operation will treat the string to be searched as a
	1558	sequence of bytes (the old way) or as a sequence of Unicode characters
	1559	(the new way). The answer to 1) is that Unicode characters greater
	1560	than C<chr(127)> may be represented using the C<\x{hex}> notation,
	1561	with C<hex> a hexadecimal integer:
	1562
	1563	use utf8; # We will be doing Unicode processing
	1564	/\x{263a}/; # match a Unicode smiley face :)
	1565
	1566	Unicode characters in the range of 128-255 use two hexadecimal digits
	1567	with braces: C<\x{ab}>. Note that this is different than C<\xab>,
	1568	which is just a hexadecimal byte with no Unicode
	1569	significance.
	1570
	1571	Figuring out the hexadecimal sequence of a Unicode character you want
	1572	or deciphering someone else's hexadecimal Unicode regexp is about as
	1573	much fun as programming in machine code. So another way to specify
	1574	Unicode characters is to use the S<B<named character> > escape
	1575	sequence C<\N{name}>. C<name> is a name for the Unicode character, as
	1576	specified in the Unicode standard. For instance, if we wanted to
	1577	represent or match the astrological sign for the planet Mercury, we
	1578	could use
	1579
	1580	use utf8; # We will be doing Unicode processing
	1581	use charnames ":full"; # use named chars with Unicode full names
	1582	$x = "abc\N{MERCURY}def";
	1583	$x =~ /\N{MERCURY}/; # matches
	1584
	1585	One can also use short names or restrict names to a certain alphabet:
	1586
	1587	use utf8; # We will be doing Unicode processing
	1588
	1589	use charnames ':full';
	1590	print "\N{GREEK SMALL LETTER SIGMA} is called sigma.\n";
	1591
	1592	use charnames ":short";
	1593	print "\N{greek:Sigma} is an upper-case sigma.\n";
	1594
	1595	use charnames qw(greek);
	1596	print "\N{sigma} is Greek sigma\n";
	1597
	1598	A list of full names is found in the file Names.txt in the
	1599	lib/perl5/5.6.0/unicode directory.
	1600
	1601	The answer to requirement 2), as of 5.6.0, is that if a regexp
	1602	contains Unicode characters, the string is searched as a sequence of
	1603	Unicode characters. Otherwise, the string is searched as a sequence of
	1604	bytes. If the string is being searched as a sequence of Unicode
	1605	characters, but matching a single byte is required, we can use the C<\C>
	1606	escape sequence. C<\C> is a character class akin to C<.> except that
	1607	it matches I<any> byte 0-255. So
	1608
	1609	use utf8; # We will be doing Unicode processing
	1610	use charnames ":full"; # use named chars with Unicode full names
	1611	$x = "a";
	1612	$x =~ /\C/; # matches 'a', eats one byte
	1613	$x = "";
	1614	$x =~ /\C/; # doesn't match, no bytes to match
	1615	$x = "\N{MERCURY}"; # two-byte Unicode character
	1616	$x =~ /\C/; # matches, but dangerous!
	1617
	1618	The last regexp matches, but is dangerous because the string
	1619	I<character> position is no longer synchronized to the string I<byte>
	1620	position. This generates the warning 'Malformed UTF-8
	1621	character'. C<\C> is best used for matching the binary data in strings
	1622	with binary data intermixed with Unicode characters.
	1623
	1624	Let us now discuss the rest of the character classes. Just as with
	1625	Unicode characters, there are named Unicode character classes
	1626	represented by the C<\p{name}> escape sequence. Closely associated is
	1627	the C<\P{name}> character class, which is the negation of the
	1628	C<\p{name}> class. For example, to match lower and uppercase
	1629	characters,
	1630
	1631	use utf8; # We will be doing Unicode processing
	1632	use charnames ":full"; # use named chars with Unicode full names
	1633	$x = "BOB";
	1634	$x =~ /^\p{IsUpper}/; # matches, uppercase char class
	1635	$x =~ /^\P{IsUpper}/; # doesn't match, char class sans uppercase
	1636	$x =~ /^\p{IsLower}/; # doesn't match, lowercase char class
	1637	$x =~ /^\P{IsLower}/; # matches, char class sans lowercase
	1638
	1639	If a C<name> is just one letter, the braces can be dropped. For
	1640	instance, C<\pM> is the character class of Unicode 'marks'. Here is
	1641	the association between some Perl named classes and the traditional
	1642	Unicode classes:
	1643
	1644	Perl class name Unicode class name
	1645
	1646	IsAlpha Lu, Ll, or Lo
	1647	IsAlnum Lu, Ll, Lo, or Nd
	1648	IsASCII $code le 127
	1649	IsCntrl C
	1650	IsDigit Nd
	1651	IsGraph [^C] and $code ne "0020"
	1652	IsLower Ll
	1653	IsPrint [^C]
	1654	IsPunct P
	1655	IsSpace Z, or ($code lt "0020" and chr(hex $code) is a \s)
	1656	IsUpper Lu
	1657	IsWord Lu, Ll, Lo, Nd or $code eq "005F"
	1658	IsXDigit $code =~ /^00(3[0-9]\|[46][1-6])$/
	1659
	1660	For a full list of Perl class names, consult the mktables.PL program
	1661	in the lib/perl5/5.6.0/unicode directory.
	1662
	1663	C<\X> is an abbreviation for a character class sequence that includes
	1664	the Unicode 'combining character sequences'. A 'combining character
	1665	sequence' is a base character followed by any number of combining
	1666	characters. An example of a combining character is an accent. Using
	1667	the Unicode full names, e.g., S<C<A + COMBINING RING> > is a combining
	1668	character sequence with base character C<A> and combining character
	1669	S<C<COMBINING RING> >, which translates in Danish to A with the circle
	1670	atop it, as in the word Angstrom. C<\X> is equivalent to C<\PM\pM*}>,
	1671	i.e., a non-mark followed by one or more marks.
	1672
	1673	As if all those classes weren't enough, Perl also defines POSIX style
	1674	character classes. These have the form C<[:name:]>, with C<name> the
	1675	name of the POSIX class. The POSIX classes are C<alpha>, C<alnum>,
	1676	C<ascii>, C<cntrl>, C<digit>, C<graph>, C<lower>, C<print>, C<punct>,
	1677	C<space>, C<upper>, and C<xdigit>, and two extensions, C<word> (a Perl
	1678	extension to match C<\w>), and C<blank> (a GNU extension). If C<utf8>
	1679	is being used, then these classes are defined the same as their
	1680	corresponding perl Unicode classes: C<[:upper:]> is the same as
	1681	C<\p{IsUpper}>, etc. The POSIX character classes, however, don't
	1682	require using C<utf8>. The C<[:digit:]>, C<[:word:]>, and
	1683	C<[:space:]> correspond to the familiar C<\d>, C<\w>, and C<\s>
	1684	character classes. To negate a POSIX class, put a C<^> in front of
	1685	the name, so that, e.g., C<[:^digit:]> corresponds to C<\D> and under
	1686	C<utf8>, C<\P{IsDigit}>. The Unicode and POSIX character classes can
	1687	be used just like C<\d>, both inside and outside of character classes:
	1688
	1689	/\s+[abc[:digit:]xyz]\s*/; # match a,b,c,x,y,z, or a digit
	1690	/^=item\s[:digit:]/; # match '=item',
	1691	# followed by a space and a digit
	1692	use utf8;
	1693	use charnames ":full";
	1694	/\s+[abc\p{IsDigit}xyz]\s+/; # match a,b,c,x,y,z, or a digit
	1695	/^=item\s\p{IsDigit}/; # match '=item',
	1696	# followed by a space and a digit
	1697
	1698	Whew! That is all the rest of the characters and character classes.
	1699
	1700	=head2 Compiling and saving regular expressions
	1701
	1702	In Part 1 we discussed the C<//o> modifier, which compiles a regexp
	1703	just once. This suggests that a compiled regexp is some data structure
	1704	that can be stored once and used again and again. The regexp quote
	1705	C<qr//> does exactly that: C<qr/string/> compiles the C<string> as a
	1706	regexp and transforms the result into a form that can be assigned to a
	1707	variable:
	1708
	1709	$reg = qr/foo+bar?/; # reg contains a compiled regexp
	1710
	1711	Then C<$reg> can be used as a regexp:
	1712
	1713	$x = "fooooba";
	1714	$x =~ $reg; # matches, just like /foo+bar?/
	1715	$x =~ /$reg/; # same thing, alternate form
	1716
	1717	C<$reg> can also be interpolated into a larger regexp:
	1718
	1719	$x =~ /(abc)?$reg/; # still matches
	1720
	1721	As with the matching operator, the regexp quote can use different
	1722	delimiters, e.g., C<qr!!>, C<qr{}> and C<qr~~>. The single quote
	1723	delimiters C<qr''> prevent any interpolation from taking place.
	1724
	1725	Pre-compiled regexps are useful for creating dynamic matches that
	1726	don't need to be recompiled each time they are encountered. Using
	1727	pre-compiled regexps, C<simple_grep> program can be expanded into a
	1728	program that matches multiple patterns:
	1729
	1730	% cat > multi_grep
	1731	#!/usr/bin/perl
	1732	# multi_grep - match any of <number> regexps
	1733	# usage: multi_grep <number> regexp1 regexp2 ... file1 file2 ...
	1734
	1735	$number = shift;
	1736	$regexp[$_] = shift foreach (0..$number-1);
	1737	@compiled = map qr/$_/, @regexp;
	1738	while ($line = <>) {
	1739	foreach $pattern (@compiled) {
	1740	if ($line =~ /$pattern/) {
	1741	print $line;
	1742	last; # we matched, so move onto the next line
	1743	}
	1744	}
	1745	}
	1746	^D
	1747
	1748	% multi_grep 2 last for multi_grep
	1749	$regexp[$_] = shift foreach (0..$number-1);
	1750	foreach $pattern (@compiled) {
	1751	last;
	1752
	1753	Storing pre-compiled regexps in an array C<@compiled> allows us to
	1754	simply loop through the regexps without any recompilation, thus gaining
	1755	flexibility without sacrificing speed.
	1756
	1757	=head2 Embedding comments and modifiers in a regular expression
	1758
	1759	Starting with this section, we will be discussing Perl's set of
	1760	B<extended patterns>. These are extensions to the traditional regular
	1761	expression syntax that provide powerful new tools for pattern
	1762	matching. We have already seen extensions in the form of the minimal
	1763	matching constructs C<??>, C<*?>, C<+?>, C<{n,m}?>, and C<{n,}?>. The
	1764	rest of the extensions below have the form C<(?char...)>, where the
	1765	C<char> is a character that determines the type of extension.
	1766
	1767	The first extension is an embedded comment C<(?#text)>. This embeds a
	1768	comment into the regular expression without affecting its meaning. The
	1769	comment should not have any closing parentheses in the text. An
	1770	example is
	1771
	1772	/(?# Match an integer:)[+-]?\d+/;
	1773
	1774	This style of commenting has been largely superseded by the raw,
	1775	freeform commenting that is allowed with the C<//x> modifier.
	1776
	1777	The modifiers C<//i>, C<//m>, C<//s>, and C<//x> can also embedded in
	1778	a regexp using C<(?i)>, C<(?m)>, C<(?s)>, and C<(?x)>. For instance,
	1779
	1780	/(?i)yes/; # match 'yes' case insensitively
	1781	/yes/i; # same thing
	1782	/(?x)( # freeform version of an integer regexp
	1783	[+-]? # match an optional sign
	1784	\d+ # match a sequence of digits
	1785	)
	1786	/x;
	1787
	1788	Embedded modifiers can have two important advantages over the usual
	1789	modifiers. Embedded modifiers allow a custom set of modifiers to
	1790	I<each> regexp pattern. This is great for matching an array of regexps
	1791	that must have different modifiers:
	1792
	1793	$pattern[0] = '(?i)doctor';
	1794	$pattern[1] = 'Johnson';
	1795	...
	1796	while (<>) {
	1797	foreach $patt (@pattern) {
	1798	print if /$patt/;
	1799	}
	1800	}
	1801
	1802	The second advantage is that embedded modifiers only affect the regexp
	1803	inside the group the embedded modifier is contained in. So grouping
	1804	can be used to localize the modifier's effects:
	1805
	1806	/Answer: ((?i)yes)/; # matches 'Answer: yes', 'Answer: YES', etc.
	1807
	1808	Embedded modifiers can also turn off any modifiers already present
	1809	by using, e.g., C<(?-i)>. Modifiers can also be combined into
	1810	a single expression, e.g., C<(?s-i)> turns on single line mode and
	1811	turns off case insensitivity.
	1812
	1813	=head2 Non-capturing groupings
	1814
	1815	We noted in Part 1 that groupings C<()> had two distinct functions: 1)
	1816	group regexp elements together as a single unit, and 2) extract, or
	1817	capture, substrings that matched the regexp in the
	1818	grouping. Non-capturing groupings, denoted by C<(?:regexp)>, allow the
	1819	regexp to be treated as a single unit, but don't extract substrings or
	1820	set matching variables C<$1>, etc. Both capturing and non-capturing
	1821	groupings are allowed to co-exist in the same regexp. Because there is
	1822	no extraction, non-capturing groupings are faster than capturing
	1823	groupings. Non-capturing groupings are also handy for choosing exactly
	1824	which parts of a regexp are to be extracted to matching variables:
	1825
	1826	# match a number, $1-$4 are set, but we only want $1
	1827	/([+-]?\ (\d+(\.\d)?\|\.\d+)([eE][+-]?\d+)?)/;
	1828
	1829	# match a number faster , only $1 is set
	1830	/([+-]?\ (?:\d+(?:\.\d)?\|\.\d+)(?:[eE][+-]?\d+)?)/;
	1831
	1832	# match a number, get $1 = whole number, $2 = exponent
	1833	/([+-]?\ (?:\d+(?:\.\d)?\|\.\d+)(?:[eE]([+-]?\d+))?)/;
	1834
	1835	Non-capturing groupings are also useful for removing nuisance
	1836	elements gathered from a split operation:
	1837
	1838	$x = '12a34b5';
	1839	@num = split /(a\|b)/, $x; # @num = ('12','a','34','b','5')
	1840	@num = split /(?:a\|b)/, $x; # @num = ('12','34','5')
	1841
	1842	Non-capturing groupings may also have embedded modifiers:
	1843	C<(?i-m:regexp)> is a non-capturing grouping that matches C<regexp>
	1844	case insensitively and turns off multi-line mode.
	1845
	1846	=head2 Looking ahead and looking behind
	1847
	1848	This section concerns the lookahead and lookbehind assertions. First,
	1849	a little background.
	1850
	1851	In Perl regular expressions, most regexp elements 'eat up' a certain
	1852	amount of string when they match. For instance, the regexp element
	1853	C<[abc}]> eats up one character of the string when it matches, in the
	1854	sense that perl moves to the next character position in the string
	1855	after the match. There are some elements, however, that don't eat up
	1856	characters (advance the character position) if they match. The examples
	1857	we have seen so far are the anchors. The anchor C<^> matches the
	1858	beginning of the line, but doesn't eat any characters. Similarly, the
	1859	word boundary anchor C<\b> matches, e.g., if the character to the left
	1860	is a word character and the character to the right is a non-word
	1861	character, but it doesn't eat up any characters itself. Anchors are
	1862	examples of 'zero-width assertions'. Zero-width, because they consume
	1863	no characters, and assertions, because they test some property of the
	1864	string. In the context of our walk in the woods analogy to regexp
	1865	matching, most regexp elements move us along a trail, but anchors have
	1866	us stop a moment and check our surroundings. If the local environment
	1867	checks out, we can proceed forward. But if the local environment
	1868	doesn't satisfy us, we must backtrack.
	1869
	1870	Checking the environment entails either looking ahead on the trail,
	1871	looking behind, or both. C<^> looks behind, to see that there are no
	1872	characters before. C<$> looks ahead, to see that there are no
	1873	characters after. C<\b> looks both ahead and behind, to see if the
	1874	characters on either side differ in their 'word'-ness.
	1875
	1876	The lookahead and lookbehind assertions are generalizations of the
	1877	anchor concept. Lookahead and lookbehind are zero-width assertions
	1878	that let us specify which characters we want to test for. The
	1879	lookahead assertion is denoted by C<(?=regexp)> and the lookbehind
	1880	assertion is denoted by C<< (?<=fixed-regexp) >>. Some examples are
	1881
	1882	$x = "I catch the housecat 'Tom-cat' with catnip";
	1883	$x =~ /cat(?=\s+)/; # matches 'cat' in 'housecat'
	1884	@catwords = ($x =~ /(?<=\s)cat\w+/g); # matches,
	1885	# $catwords[0] = 'catch'
	1886	# $catwords[1] = 'catnip'
	1887	$x =~ /\bcat\b/; # matches 'cat' in 'Tom-cat'
	1888	$x =~ /(?<=\s)cat(?=\s)/; # doesn't match; no isolated 'cat' in
	1889	# middle of $x
	1890
	1891	Note that the parentheses in C<(?=regexp)> and C<< (?<=regexp) >> are
	1892	non-capturing, since these are zero-width assertions. Thus in the
	1893	second regexp, the substrings captured are those of the whole regexp
	1894	itself. Lookahead C<(?=regexp)> can match arbitrary regexps, but
	1895	lookbehind C<< (?<=fixed-regexp) >> only works for regexps of fixed
	1896	width, i.e., a fixed number of characters long. Thus
	1897	C<< (?<=(ab\|bc)) >> is fine, but C<< (?<=(ab)*) >> is not. The
	1898	negated versions of the lookahead and lookbehind assertions are
	1899	denoted by C<(?!regexp)> and C<< (?<!fixed-regexp) >> respectively.
	1900	They evaluate true if the regexps do I<not> match:
	1901
	1902	$x = "foobar";
	1903	$x =~ /foo(?!bar)/; # doesn't match, 'bar' follows 'foo'
	1904	$x =~ /foo(?!baz)/; # matches, 'baz' doesn't follow 'foo'
	1905	$x =~ /(?<!\s)foo/; # matches, there is no \s before 'foo'
	1906
	1907	=head2 Using independent subexpressions to prevent backtracking
	1908
	1909	The last few extended patterns in this tutorial are experimental as of
	1910	5.6.0. Play with them, use them in some code, but don't rely on them
	1911	just yet for production code.
	1912
	1913	S<B<Independent subexpressions> > are regular expressions, in the
	1914	context of a larger regular expression, that function independently of
	1915	the larger regular expression. That is, they consume as much or as
	1916	little of the string as they wish without regard for the ability of
	1917	the larger regexp to match. Independent subexpressions are represented
	1918	by C<< (?>regexp) >>. We can illustrate their behavior by first
	1919	considering an ordinary regexp:
	1920
	1921	$x = "ab";
	1922	$x =~ /a*ab/; # matches
	1923
	1924	This obviously matches, but in the process of matching, the
	1925	subexpression C<a*> first grabbed the C<a>. Doing so, however,
	1926	wouldn't allow the whole regexp to match, so after backtracking, C<a*>
	1927	eventually gave back the C<a> and matched the empty string. Here, what
	1928	C<a*> matched was I<dependent> on what the rest of the regexp matched.
	1929
	1930	Contrast that with an independent subexpression:
	1931
	1932	$x =~ /(?>a*)ab/; # doesn't match!
	1933
	1934	The independent subexpression C<< (?>a*) >> doesn't care about the rest
	1935	of the regexp, so it sees an C<a> and grabs it. Then the rest of the
	1936	regexp C<ab> cannot match. Because C<< (?>a*) >> is independent, there
	1937	is no backtracking and and the independent subexpression does not give
	1938	up its C<a>. Thus the match of the regexp as a whole fails. A similar
	1939	behavior occurs with completely independent regexps:
	1940
	1941	$x = "ab";
	1942	$x =~ /a*/g; # matches, eats an 'a'
	1943	$x =~ /\Gab/g; # doesn't match, no 'a' available
	1944
	1945	Here C<//g> and C<\G> create a 'tag team' handoff of the string from
	1946	one regexp to the other. Regexps with an independent subexpression are
	1947	much like this, with a handoff of the string to the independent
	1948	subexpression, and a handoff of the string back to the enclosing
	1949	regexp.
	1950
	1951	The ability of an independent subexpression to prevent backtracking
	1952	can be quite useful. Suppose we want to match a non-empty string
	1953	enclosed in parentheses up to two levels deep. Then the following
	1954	regexp matches:
	1955
	1956	$x = "abc(de(fg)h"; # unbalanced parentheses
	1957	$x =~ /$ ( [^()]+ \| \([^()]*$ )+ \)/x;
	1958
	1959	The regexp matches an open parenthesis, one or more copies of an
	1960	alternation, and a close parenthesis. The alternation is two-way, with
	1961	the first alternative C<[^()]+> matching a substring with no
	1962	parentheses and the second alternative C<$[^()]*$> matching a
	1963	substring delimited by parentheses. The problem with this regexp is
	1964	that it is pathological: it has nested indeterminate quantifiers
	1965	of the form C<(a+\|b)+>. We discussed in Part 1 how nested quantifiers
	1966	like this could take an exponentially long time to execute if there
	1967	was no match possible. To prevent the exponential blowup, we need to
	1968	prevent useless backtracking at some point. This can be done by
	1969	enclosing the inner quantifier as an independent subexpression:
	1970
	1971	$x =~ /$ ( (?>[^()]+) \| \([^()]*$ )+ \)/x;
	1972
	1973	Here, C<< (?>[^()]+) >> breaks the degeneracy of string partitioning
	1974	by gobbling up as much of the string as possible and keeping it. Then
	1975	match failures fail much more quickly.
	1976
	1977	=head2 Conditional expressions
	1978
	1979	A S<B<conditional expression> > is a form of if-then-else statement
	1980	that allows one to choose which patterns are to be matched, based on
	1981	some condition. There are two types of conditional expression:
	1982	C<(?(condition)yes-regexp)> and
	1983	C<(?(condition)yes-regexp\|no-regexp)>. C<(?(condition)yes-regexp)> is
	1984	like an S<C<'if () {}'> > statement in Perl. If the C<condition> is true,
	1985	the C<yes-regexp> will be matched. If the C<condition> is false, the
	1986	C<yes-regexp> will be skipped and perl will move onto the next regexp
	1987	element. The second form is like an S<C<'if () {} else {}'> > statement
	1988	in Perl. If the C<condition> is true, the C<yes-regexp> will be
	1989	matched, otherwise the C<no-regexp> will be matched.
	1990
	1991	The C<condition> can have two forms. The first form is simply an
	1992	integer in parentheses C<(integer)>. It is true if the corresponding
	1993	backreference C<\integer> matched earlier in the regexp. The second
	1994	form is a bare zero width assertion C<(?...)>, either a
	1995	lookahead, a lookbehind, or a code assertion (discussed in the next
	1996	section).
	1997
	1998	The integer form of the C<condition> allows us to choose, with more
	1999	flexibility, what to match based on what matched earlier in the
	2000	regexp. This searches for words of the form C<"$x$x"> or
	2001	C<"$x$y$y$x">:
	2002
	2003	% simple_grep '^(\w+)(\w+)?(?(2)\2\1\|\1)$' /usr/dict/words
	2004	beriberi
	2005	coco
	2006	couscous
	2007	deed
	2008	...
	2009	toot
	2010	toto
	2011	tutu
	2012
	2013	The lookbehind C<condition> allows, along with backreferences,
	2014	an earlier part of the match to influence a later part of the
	2015	match. For instance,
	2016
	2017	/[ATGC]+(?(?<=AA)G\|C)$/;
	2018
	2019	matches a DNA sequence such that it either ends in C<AAG>, or some
	2020	other base pair combination and C<C>. Note that the form is
	2021	C<< (?(?<=AA)G\|C) >> and not C<< (?((?<=AA))G\|C) >>; for the
	2022	lookahead, lookbehind or code assertions, the parentheses around the
	2023	conditional are not needed.
	2024
	2025	=head2 A bit of magic: executing Perl code in a regular expression
	2026
	2027	Normally, regexps are a part of Perl expressions.
	2028	S<B<Code evaluation> > expressions turn that around by allowing
	2029	arbitrary Perl code to be a part of of a regexp. A code evaluation
	2030	expression is denoted C<(?{code})>, with C<code> a string of Perl
	2031	statements.
	2032
	2033	Code expressions are zero-width assertions, and the value they return
	2034	depends on their environment. There are two possibilities: either the
	2035	code expression is used as a conditional in a conditional expression
	2036	C<(?(condition)...)>, or it is not. If the code expression is a
	2037	conditional, the code is evaluated and the result (i.e., the result of
	2038	the last statement) is used to determine truth or falsehood. If the
	2039	code expression is not used as a conditional, the assertion always
	2040	evaluates true and the result is put into the special variable
	2041	C<$^R>. The variable C<$^R> can then be used in code expressions later
	2042	in the regexp. Here are some silly examples:
	2043
	2044	$x = "abcdef";
	2045	$x =~ /abc(?{print "Hi Mom!";})def/; # matches,
	2046	# prints 'Hi Mom!'
	2047	$x =~ /aaa(?{print "Hi Mom!";})def/; # doesn't match,
	2048	# no 'Hi Mom!'
	2049
	2050	Pay careful attention to the next example:
	2051
	2052	$x =~ /abc(?{print "Hi Mom!";})ddd/; # doesn't match,
	2053	# no 'Hi Mom!'
	2054	# but why not?
	2055
	2056	At first glance, you'd think that it shouldn't print, because obviously
	2057	the C<ddd> isn't going to match the target string. But look at this
	2058	example:
	2059
	2060	$x =~ /abc(?{print "Hi Mom!";})[d]dd/; # doesn't match,
	2061	# but _does_ print
	2062
	2063	Hmm. What happened here? If you've been following along, you know that
	2064	the above pattern should be effectively the same as the last one --
	2065	enclosing the d in a character class isn't going to change what it
	2066	matches. So why does the first not print while the second one does?
	2067
	2068	The answer lies in the optimizations the REx engine makes. In the first
	2069	case, all the engine sees are plain old characters (aside from the
	2070	C<?{}> construct). It's smart enough to realize that the string 'ddd'
	2071	doesn't occur in our target string before actually running the pattern
	2072	through. But in the second case, we've tricked it into thinking that our
	2073	pattern is more complicated than it is. It takes a look, sees our
	2074	character class, and decides that it will have to actually run the
	2075	pattern to determine whether or not it matches, and in the process of
	2076	running it hits the print statement before it discovers that we don't
	2077	have a match.
	2078
	2079	To take a closer look at how the engine does optimizations, see the
	2080	section L<"Pragmas and debugging"> below.
	2081
	2082	More fun with C<?{}>:
	2083
	2084	$x =~ /(?{print "Hi Mom!";})/; # matches,
	2085	# prints 'Hi Mom!'
	2086	$x =~ /(?{$c = 1;})(?{print "$c";})/; # matches,
	2087	# prints '1'
	2088	$x =~ /(?{$c = 1;})(?{print "$^R";})/; # matches,
	2089	# prints '1'
	2090
	2091	The bit of magic mentioned in the section title occurs when the regexp
	2092	backtracks in the process of searching for a match. If the regexp
	2093	backtracks over a code expression and if the variables used within are
	2094	localized using C<local>, the changes in the variables produced by the
	2095	code expression are undone! Thus, if we wanted to count how many times
	2096	a character got matched inside a group, we could use, e.g.,
	2097
	2098	$x = "aaaa";
	2099	$count = 0; # initialize 'a' count
	2100	$c = "bob"; # test if $c gets clobbered
	2101	$x =~ /(?{local $c = 0;}) # initialize count
	2102	( a # match 'a'
	2103	(?{local $c = $c + 1;}) # increment count
	2104	)* # do this any number of times,
	2105	aa # but match 'aa' at the end
	2106	(?{$count = $c;}) # copy local $c var into $count
	2107	/x;
	2108	print "'a' count is $count, \$c variable is '$c'\n";
	2109
	2110	This prints
	2111
	2112	'a' count is 2, $c variable is 'bob'
	2113
	2114	If we replace the S<C< (?{local $c = $c + 1;})> > with
	2115	S<C< (?{$c = $c + 1;})> >, the variable changes are I<not> undone
	2116	during backtracking, and we get
	2117
	2118	'a' count is 4, $c variable is 'bob'
	2119
	2120	Note that only localized variable changes are undone. Other side
	2121	effects of code expression execution are permanent. Thus
	2122
	2123	$x = "aaaa";
	2124	$x =~ /(a(?{print "Yow\n";}))*aa/;
	2125
	2126	produces
	2127
	2128	Yow
	2129	Yow
	2130	Yow
	2131	Yow
	2132
	2133	The result C<$^R> is automatically localized, so that it will behave
	2134	properly in the presence of backtracking.
	2135
	2136	This example uses a code expression in a conditional to match the
	2137	article 'the' in either English or German:
	2138
	2139	$lang = 'DE'; # use German
	2140	...
	2141	$text = "das";
	2142	print "matched\n"
	2143	if $text =~ /(?(?{
	2144	$lang eq 'EN'; # is the language English?
	2145	})
	2146	the \| # if so, then match 'the'
	2147	(die\|das\|der) # else, match 'die\|das\|der'
	2148	)
	2149	/xi;
	2150
	2151	Note that the syntax here is C<(?(?{...})yes-regexp\|no-regexp)>, not
	2152	C<(?((?{...}))yes-regexp\|no-regexp)>. In other words, in the case of a
	2153	code expression, we don't need the extra parentheses around the
	2154	conditional.
	2155
	2156	If you try to use code expressions with interpolating variables, perl
	2157	may surprise you:
	2158
	2159	$bar = 5;
	2160	$pat = '(?{ 1 })';
	2161	/foo(?{ $bar })bar/; # compiles ok, $bar not interpolated
	2162	/foo(?{ 1 })$bar/; # compile error!
	2163	/foo${pat}bar/; # compile error!
	2164
	2165	$pat = qr/(?{ $foo = 1 })/; # precompile code regexp
	2166	/foo${pat}bar/; # compiles ok
	2167
	2168	If a regexp has (1) code expressions and interpolating variables,or
	2169	(2) a variable that interpolates a code expression, perl treats the
	2170	regexp as an error. If the code expression is precompiled into a
	2171	variable, however, interpolating is ok. The question is, why is this
	2172	an error?
	2173
	2174	The reason is that variable interpolation and code expressions
	2175	together pose a security risk. The combination is dangerous because
	2176	many programmers who write search engines often take user input and
	2177	plug it directly into a regexp:
	2178
	2179	$regexp = <>; # read user-supplied regexp
	2180	$chomp $regexp; # get rid of possible newline
	2181	$text =~ /$regexp/; # search $text for the $regexp
	2182
	2183	If the C<$regexp> variable contains a code expression, the user could
	2184	then execute arbitrary Perl code. For instance, some joker could
	2185	search for S<C<system('rm -rf *');> > to erase your files. In this
	2186	sense, the combination of interpolation and code expressions B<taints>
	2187	your regexp. So by default, using both interpolation and code
	2188	expressions in the same regexp is not allowed. If you're not
	2189	concerned about malicious users, it is possible to bypass this
	2190	security check by invoking S<C<use re 'eval'> >:
	2191
	2192	use re 'eval'; # throw caution out the door
	2193	$bar = 5;
	2194	$pat = '(?{ 1 })';
	2195	/foo(?{ 1 })$bar/; # compiles ok
	2196	/foo${pat}bar/; # compiles ok
	2197
	2198	Another form of code expression is the S<B<pattern code expression> >.
	2199	The pattern code expression is like a regular code expression, except
	2200	that the result of the code evaluation is treated as a regular
	2201	expression and matched immediately. A simple example is
	2202
	2203	$length = 5;
	2204	$char = 'a';
	2205	$x = 'aaaaabb';
	2206	$x =~ /(??{$char x $length})/x; # matches, there are 5 of 'a'
	2207
	2208
	2209	This final example contains both ordinary and pattern code
	2210	expressions. It detects if a binary string C<1101010010001...> has a
	2211	Fibonacci spacing 0,1,1,2,3,5,... of the C<1>'s:
	2212
	2213	$s0 = 0; $s1 = 1; # initial conditions
	2214	$x = "1101010010001000001";
	2215	print "It is a Fibonacci sequence\n"
	2216	if $x =~ /^1 # match an initial '1'
	2217	(
	2218	(??{'0' x $s0}) # match $s0 of '0'
	2219	1 # and then a '1'
	2220	(?{
	2221	$largest = $s0; # largest seq so far
	2222	$s2 = $s1 + $s0; # compute next term
	2223	$s0 = $s1; # in Fibonacci sequence
	2224	$s1 = $s2;
	2225	})
	2226	)+ # repeat as needed
	2227	$ # that is all there is
	2228	/x;
	2229	print "Largest sequence matched was $largest\n";
	2230
	2231	This prints
	2232
	2233	It is a Fibonacci sequence
	2234	Largest sequence matched was 5
	2235
	2236	Ha! Try that with your garden variety regexp package...
	2237
	2238	Note that the variables C<$s0> and C<$s1> are not substituted when the
	2239	regexp is compiled, as happens for ordinary variables outside a code
	2240	expression. Rather, the code expressions are evaluated when perl
	2241	encounters them during the search for a match.
	2242
	2243	The regexp without the C<//x> modifier is
	2244
	2245	/^1((??{'0'x$s0})1(?{$largest=$s0;$s2=$s1+$s0$s0=$s1;$s1=$s2;}))+$/;
	2246
	2247	and is a great start on an Obfuscated Perl entry :-) When working with
	2248	code and conditional expressions, the extended form of regexps is
	2249	almost necessary in creating and debugging regexps.
	2250
	2251	=head2 Pragmas and debugging
	2252
	2253	Speaking of debugging, there are several pragmas available to control
	2254	and debug regexps in Perl. We have already encountered one pragma in
	2255	the previous section, S<C<use re 'eval';> >, that allows variable
	2256	interpolation and code expressions to coexist in a regexp. The other
	2257	pragmas are
	2258
	2259	use re 'taint';
	2260	$tainted = <>;
	2261	@parts = ($tainted =~ /(\w+)\s+(\w+)/; # @parts is now tainted
	2262
	2263	The C<taint> pragma causes any substrings from a match with a tainted
	2264	variable to be tainted as well. This is not normally the case, as
	2265	regexps are often used to extract the safe bits from a tainted
	2266	variable. Use C<taint> when you are not extracting safe bits, but are
	2267	performing some other processing. Both C<taint> and C<eval> pragmas
	2268	are lexically scoped, which means they are in effect only until
	2269	the end of the block enclosing the pragmas.
	2270
	2271	use re 'debug';
	2272	/^(.*)$/s; # output debugging info
	2273
	2274	use re 'debugcolor';
	2275	/^(.*)$/s; # output debugging info in living color
	2276
	2277	The global C<debug> and C<debugcolor> pragmas allow one to get
	2278	detailed debugging info about regexp compilation and
	2279	execution. C<debugcolor> is the same as debug, except the debugging
	2280	information is displayed in color on terminals that can display
	2281	termcap color sequences. Here is example output:
	2282
	2283	% perl -e 'use re "debug"; "abc" =~ /a*b+c/;'
	2284	Compiling REx `a*b+c'
	2285	size 9 first at 1
	2286	1: STAR(4)
	2287	2: EXACT <a>(0)
	2288	4: PLUS(7)
	2289	5: EXACT <b>(0)
	2290	7: EXACT <c>(9)
	2291	9: END(0)
	2292	floating `bc' at 0..2147483647 (checking floating) minlen 2
	2293	Guessing start of match, REx `a*b+c' against `abc'...
	2294	Found floating substr `bc' at offset 1...
	2295	Guessed: match at offset 0
	2296	Matching REx `a*b+c' against `abc'
	2297	Setting an EVAL scope, savestack=3
	2298	0 <> <abc> \| 1: STAR
	2299	EXACT <a> can match 1 times out of 32767...
	2300	Setting an EVAL scope, savestack=3
	2301	1 <a> <bc> \| 4: PLUS
	2302	EXACT <b> can match 1 times out of 32767...
	2303	Setting an EVAL scope, savestack=3
	2304	2 <ab> <c> \| 7: EXACT <c>
	2305	3 <abc> <> \| 9: END
	2306	Match successful!
	2307	Freeing REx: `a*b+c'
	2308
	2309	If you have gotten this far into the tutorial, you can probably guess
	2310	what the different parts of the debugging output tell you. The first
	2311	part
	2312
	2313	Compiling REx `a*b+c'
	2314	size 9 first at 1
	2315	1: STAR(4)
	2316	2: EXACT <a>(0)
	2317	4: PLUS(7)
	2318	5: EXACT <b>(0)
	2319	7: EXACT <c>(9)
	2320	9: END(0)
	2321
	2322	describes the compilation stage. C<STAR(4)> means that there is a
	2323	starred object, in this case C<'a'>, and if it matches, goto line 4,
	2324	i.e., C<PLUS(7)>. The middle lines describe some heuristics and
	2325	optimizations performed before a match:
	2326
	2327	floating `bc' at 0..2147483647 (checking floating) minlen 2
	2328	Guessing start of match, REx `a*b+c' against `abc'...
	2329	Found floating substr `bc' at offset 1...
	2330	Guessed: match at offset 0
	2331
	2332	Then the match is executed and the remaining lines describe the
	2333	process:
	2334
	2335	Matching REx `a*b+c' against `abc'
	2336	Setting an EVAL scope, savestack=3
	2337	0 <> <abc> \| 1: STAR
	2338	EXACT <a> can match 1 times out of 32767...
	2339	Setting an EVAL scope, savestack=3
	2340	1 <a> <bc> \| 4: PLUS
	2341	EXACT <b> can match 1 times out of 32767...
	2342	Setting an EVAL scope, savestack=3
	2343	2 <ab> <c> \| 7: EXACT <c>
	2344	3 <abc> <> \| 9: END
	2345	Match successful!
	2346	Freeing REx: `a*b+c'
	2347
	2348	Each step is of the form S<C<< n <x> <y> >> >, with C<< <x> >> the
	2349	part of the string matched and C<< <y> >> the part not yet
	2350	matched. The S<C<< \| 1: STAR >> > says that perl is at line number 1
	2351	n the compilation list above. See
	2352	L<perldebguts/"Debugging regular expressions"> for much more detail.
	2353
	2354	An alternative method of debugging regexps is to embed C<print>
	2355	statements within the regexp. This provides a blow-by-blow account of
	2356	the backtracking in an alternation:
	2357
	2358	"that this" =~ m@(?{print "Start at position ", pos, "\n";})
	2359	t(?{print "t1\n";})
	2360	h(?{print "h1\n";})
	2361	i(?{print "i1\n";})
	2362	s(?{print "s1\n";})
	2363	\|
	2364	t(?{print "t2\n";})
	2365	h(?{print "h2\n";})
	2366	a(?{print "a2\n";})
	2367	t(?{print "t2\n";})
	2368	(?{print "Done at position ", pos, "\n";})
	2369	@x;
	2370
	2371	prints
	2372
	2373	Start at position 0
	2374	t1
	2375	h1
	2376	t2
	2377	h2
	2378	a2
	2379	t2
	2380	Done at position 4
	2381
	2382	=head1 BUGS
	2383
	2384	Code expressions, conditional expressions, and independent expressions
	2385	are B<experimental>. Don't use them in production code. Yet.
	2386
	2387	=head1 SEE ALSO
	2388
	2389	This is just a tutorial. For the full story on perl regular
	2390	expressions, see the L<perlre> regular expressions reference page.
	2391
	2392	For more information on the matching C<m//> and substitution C<s///>
	2393	operators, see L<perlop/"Regexp Quote-Like Operators">. For
	2394	information on the C<split> operation, see L<perlfunc/split>.
	2395
	2396	For an excellent all-around resource on the care and feeding of
	2397	regular expressions, see the book I<Mastering Regular Expressions> by
	2398	Jeffrey Friedl (published by O'Reilly, ISBN 1556592-257-3).
	2399
	2400	=head1 AUTHOR AND COPYRIGHT
	2401
	2402	Copyright (c) 2000 Mark Kvale
	2403	All rights reserved.
	2404
	2405	This document may be distributed under the same terms as Perl itself.
	2406
	2407	=head2 Acknowledgments
	2408
	2409	The inspiration for the stop codon DNA example came from the ZIP
	2410	code example in chapter 7 of I<Mastering Regular Expressions>.
	2411
	2412	The author would like to thank Jeff Pinyan, Andrew Johnson, Peter
	2413	Haworth, Ronald J Kimball, and Joe Smith for all their helpful
	2414	comments.
	2415
	2416	=cut
	2417