perl5.git.perl.org Git - perl5.git/blame_incremental

... / ...

Commit	Line	Data
	1	=head1 NAME
	2	X<regular expression> X<regex> X<regexp>
	3
	4	perlre - Perl regular expressions
	5
	6	=head1 DESCRIPTION
	7
	8	This page describes the syntax of regular expressions in Perl.
	9
	10	If you haven't used regular expressions before, a quick-start
	11	introduction is available in L<perlrequick>, and a longer tutorial
	12	introduction is available in L<perlretut>.
	13
	14	For reference on how regular expressions are used in matching
	15	operations, plus various examples of the same, see discussions of
	16	C<m//>, C<s///>, C<qr//> and C<??> in L<perlop/"Regexp Quote-Like
	17	Operators">.
	18
	19	Matching operations can have various modifiers. Modifiers
	20	that relate to the interpretation of the regular expression inside
	21	are listed below. Modifiers that alter the way a regular expression
	22	is used by Perl are detailed in L<perlop/"Regexp Quote-Like Operators"> and
	23	L<perlop/"Gory details of parsing quoted constructs">.
	24
	25	=over 4
	26
	27	=item i
	28	X</i> X<regex, case-insensitive> X<regexp, case-insensitive>
	29	X<regular expression, case-insensitive>
	30
	31	Do case-insensitive pattern matching.
	32
	33	If C<use locale> is in effect, the case map is taken from the current
	34	locale. See L<perllocale>.
	35
	36	=item m
	37	X</m> X<regex, multiline> X<regexp, multiline> X<regular expression, multiline>
	38
	39	Treat string as multiple lines. That is, change "^" and "$" from matching
	40	the start or end of the string to matching the start or end of any
	41	line anywhere within the string.
	42
	43	=item s
	44	X</s> X<regex, single-line> X<regexp, single-line>
	45	X<regular expression, single-line>
	46
	47	Treat string as single line. That is, change "." to match any character
	48	whatsoever, even a newline, which normally it would not match.
	49
	50	Used together, as /ms, they let the "." match any character whatsoever,
	51	while still allowing "^" and "$" to match, respectively, just after
	52	and just before newlines within the string.
	53
	54	=item x
	55	X</x>
	56
	57	Extend your pattern's legibility by permitting whitespace and comments.
	58
	59	=back
	60
	61	These are usually written as "the C</x> modifier", even though the delimiter
	62	in question might not really be a slash. Any of these
	63	modifiers may also be embedded within the regular expression itself using
	64	the C<(?...)> construct. See below.
	65
	66	The C</x> modifier itself needs a little more explanation. It tells
	67	the regular expression parser to ignore whitespace that is neither
	68	backslashed nor within a character class. You can use this to break up
	69	your regular expression into (slightly) more readable parts. The C<#>
	70	character is also treated as a metacharacter introducing a comment,
	71	just as in ordinary Perl code. This also means that if you want real
	72	whitespace or C<#> characters in the pattern (outside a character
	73	class, where they are unaffected by C</x>), that you'll either have to
	74	escape them or encode them using octal or hex escapes. Taken together,
	75	these features go a long way towards making Perl's regular expressions
	76	more readable. Note that you have to be careful not to include the
	77	pattern delimiter in the comment--perl has no way of knowing you did
	78	not intend to close the pattern early. See the C-comment deletion code
	79	in L<perlop>.
	80	X</x>
	81
	82	=head2 Regular Expressions
	83
	84	The patterns used in Perl pattern matching derive from supplied in
	85	the Version 8 regex routines. (The routines are derived
	86	(distantly) from Henry Spencer's freely redistributable reimplementation
	87	of the V8 routines.) See L<Version 8 Regular Expressions> for
	88	details.
	89
	90	In particular the following metacharacters have their standard I<egrep>-ish
	91	meanings:
	92	X<metacharacter>
	93	X<\> X<^> X<.> X<$> X<\|> X<(> X<()> X<[> X<[]>
	94
	95
	96	\ Quote the next metacharacter
	97	^ Match the beginning of the line
	98	. Match any character (except newline)
	99	$ Match the end of the line (or before newline at the end)
	100	\| Alternation
	101	() Grouping
	102	[] Character class
	103
	104	By default, the "^" character is guaranteed to match only the
	105	beginning of the string, the "$" character only the end (or before the
	106	newline at the end), and Perl does certain optimizations with the
	107	assumption that the string contains only one line. Embedded newlines
	108	will not be matched by "^" or "$". You may, however, wish to treat a
	109	string as a multi-line buffer, such that the "^" will match after any
	110	newline within the string, and "$" will match before any newline. At the
	111	cost of a little more overhead, you can do this by using the /m modifier
	112	on the pattern match operator. (Older programs did this by setting C<$*>,
	113	but this practice has been removed in perl 5.9.)
	114	X<^> X<$> X</m>
	115
	116	To simplify multi-line substitutions, the "." character never matches a
	117	newline unless you use the C</s> modifier, which in effect tells Perl to pretend
	118	the string is a single line--even if it isn't.
	119	X<.> X</s>
	120
	121	The following standard quantifiers are recognized:
	122	X<metacharacter> X<quantifier> X<*> X<+> X<?> X<{n}> X<{n,}> X<{n,m}>
	123
	124	* Match 0 or more times
	125	+ Match 1 or more times
	126	? Match 1 or 0 times
	127	{n} Match exactly n times
	128	{n,} Match at least n times
	129	{n,m} Match at least n but not more than m times
	130
	131	(If a curly bracket occurs in any other context, it is treated
	132	as a regular character. In particular, the lower bound
	133	is not optional.) The "*" modifier is equivalent to C<{0,}>, the "+"
	134	modifier to C<{1,}>, and the "?" modifier to C<{0,1}>. n and m are limited
	135	to integral values less than a preset limit defined when perl is built.
	136	This is usually 32766 on the most common platforms. The actual limit can
	137	be seen in the error message generated by code such as this:
	138
	139	$_ **= $_ , / {$_} / for 2 .. 42;
	140
	141	By default, a quantified subpattern is "greedy", that is, it will match as
	142	many times as possible (given a particular starting location) while still
	143	allowing the rest of the pattern to match. If you want it to match the
	144	minimum number of times possible, follow the quantifier with a "?". Note
	145	that the meanings don't change, just the "greediness":
	146	X<metacharacter> X<greedy> X<greedyness>
	147	X<?> X<*?> X<+?> X<??> X<{n}?> X<{n,}?> X<{n,m}?>
	148
	149	*? Match 0 or more times
	150	+? Match 1 or more times
	151	?? Match 0 or 1 time
	152	{n}? Match exactly n times
	153	{n,}? Match at least n times
	154	{n,m}? Match at least n but not more than m times
	155
	156	Because patterns are processed as double quoted strings, the following
	157	also work:
	158	X<\t> X<\n> X<\r> X<\f> X<\a> X<\l> X<\u> X<\L> X<\U> X<\E> X<\Q>
	159	X<\0> X<\c> X<\N> X<\x>
	160
	161	\t tab (HT, TAB)
	162	\n newline (LF, NL)
	163	\r return (CR)
	164	\f form feed (FF)
	165	\a alarm (bell) (BEL)
	166	\e escape (think troff) (ESC)
	167	\033 octal char (think of a PDP-11)
	168	\x1B hex char
	169	\x{263a} wide hex char (Unicode SMILEY)
	170	\c[ control char
	171	\N{name} named char
	172	\l lowercase next char (think vi)
	173	\u uppercase next char (think vi)
	174	\L lowercase till \E (think vi)
	175	\U uppercase till \E (think vi)
	176	\E end case modification (think vi)
	177	\Q quote (disable) pattern metacharacters till \E
	178
	179	If C<use locale> is in effect, the case map used by C<\l>, C<\L>, C<\u>
	180	and C<\U> is taken from the current locale. See L<perllocale>. For
	181	documentation of C<\N{name}>, see L<charnames>.
	182
	183	You cannot include a literal C<$> or C<@> within a C<\Q> sequence.
	184	An unescaped C<$> or C<@> interpolates the corresponding variable,
	185	while escaping will cause the literal string C<\$> to be matched.
	186	You'll need to write something like C<m/\Quser\E\@\Qhost/>.
	187
	188	In addition, Perl defines the following:
	189	X<metacharacter>
	190	X<\w> X<\W> X<\s> X<\S> X<\d> X<\D> X<\X> X<\p> X<\P> X<\C>
	191	X<word> X<whitespace>
	192
	193	\w Match a "word" character (alphanumeric plus "_")
	194	\W Match a non-"word" character
	195	\s Match a whitespace character
	196	\S Match a non-whitespace character
	197	\d Match a digit character
	198	\D Match a non-digit character
	199	\pP Match P, named property. Use \p{Prop} for longer names.
	200	\PP Match non-P
	201	\X Match eXtended Unicode "combining character sequence",
	202	equivalent to (?:\PM\pM*)
	203	\C Match a single C char (octet) even under Unicode.
	204	NOTE: breaks up characters into their UTF-8 bytes,
	205	so you may end up with malformed pieces of UTF-8.
	206	Unsupported in lookbehind.
	207
	208	A C<\w> matches a single alphanumeric character (an alphabetic
	209	character, or a decimal digit) or C<_>, not a whole word. Use C<\w+>
	210	to match a string of Perl-identifier characters (which isn't the same
	211	as matching an English word). If C<use locale> is in effect, the list
	212	of alphabetic characters generated by C<\w> is taken from the current
	213	locale. See L<perllocale>. You may use C<\w>, C<\W>, C<\s>, C<\S>,
	214	C<\d>, and C<\D> within character classes, but if you try to use them
	215	as endpoints of a range, that's not a range, the "-" is understood
	216	literally. If Unicode is in effect, C<\s> matches also "\x{85}",
	217	"\x{2028}, and "\x{2029}", see L<perlunicode> for more details about
	218	C<\pP>, C<\PP>, and C<\X>, and L<perluniintro> about Unicode in general.
	219	You can define your own C<\p> and C<\P> properties, see L<perlunicode>.
	220	X<\w> X<\W> X<word>
	221
	222	The POSIX character class syntax
	223	X<character class>
	224
	225	[:class:]
	226
	227	is also available. Note that the C<[> and C<]> braces are I<literal>;
	228	they must always be used within a character class expression.
	229
	230	# this is correct:
	231	$string =~ /[[:alpha:]]/;
	232
	233	# this is not, and will generate a warning:
	234	$string =~ /[:alpha:]/;
	235
	236	The available classes and their backslash equivalents (if available) are
	237	as follows:
	238	X<character class>
	239	X<alpha> X<alnum> X<ascii> X<blank> X<cntrl> X<digit> X<graph>
	240	X<lower> X<print> X<punct> X<space> X<upper> X<word> X<xdigit>
	241
	242	alpha
	243	alnum
	244	ascii
	245	blank [1]
	246	cntrl
	247	digit \d
	248	graph
	249	lower
	250	print
	251	punct
	252	space \s [2]
	253	upper
	254	word \w [3]
	255	xdigit
	256
	257	=over
	258
	259	=item [1]
	260
	261	A GNU extension equivalent to C<[ \t]>, "all horizontal whitespace".
	262
	263	=item [2]
	264
	265	Not exactly equivalent to C<\s> since the C<[[:space:]]> includes
	266	also the (very rare) "vertical tabulator", "\ck", chr(11).
	267
	268	=item [3]
	269
	270	A Perl extension, see above.
	271
	272	=back
	273
	274	For example use C<[:upper:]> to match all the uppercase characters.
	275	Note that the C<[]> are part of the C<[::]> construct, not part of the
	276	whole character class. For example:
	277
	278	[01[:alpha:]%]
	279
	280	matches zero, one, any alphabetic character, and the percentage sign.
	281
	282	The following equivalences to Unicode \p{} constructs and equivalent
	283	backslash character classes (if available), will hold:
	284	X<character class> X<\p> X<\p{}>
	285
	286	[[:...:]] \p{...} backslash
	287
	288	alpha IsAlpha
	289	alnum IsAlnum
	290	ascii IsASCII
	291	blank IsSpace
	292	cntrl IsCntrl
	293	digit IsDigit \d
	294	graph IsGraph
	295	lower IsLower
	296	print IsPrint
	297	punct IsPunct
	298	space IsSpace
	299	IsSpacePerl \s
	300	upper IsUpper
	301	word IsWord
	302	xdigit IsXDigit
	303
	304	For example C<[[:lower:]]> and C<\p{IsLower}> are equivalent.
	305
	306	If the C<utf8> pragma is not used but the C<locale> pragma is, the
	307	classes correlate with the usual isalpha(3) interface (except for
	308	"word" and "blank").
	309
	310	The assumedly non-obviously named classes are:
	311
	312	=over 4
	313
	314	=item cntrl
	315	X<cntrl>
	316
	317	Any control character. Usually characters that don't produce output as
	318	such but instead control the terminal somehow: for example newline and
	319	backspace are control characters. All characters with ord() less than
	320	32 are most often classified as control characters (assuming ASCII,
	321	the ISO Latin character sets, and Unicode), as is the character with
	322	the ord() value of 127 (C<DEL>).
	323
	324	=item graph
	325	X<graph>
	326
	327	Any alphanumeric or punctuation (special) character.
	328
	329	=item print
	330	X<print>
	331
	332	Any alphanumeric or punctuation (special) character or the space character.
	333
	334	=item punct
	335	X<punct>
	336
	337	Any punctuation (special) character.
	338
	339	=item xdigit
	340	X<xdigit>
	341
	342	Any hexadecimal digit. Though this may feel silly ([0-9A-Fa-f] would
	343	work just fine) it is included for completeness.
	344
	345	=back
	346
	347	You can negate the [::] character classes by prefixing the class name
	348	with a '^'. This is a Perl extension. For example:
	349	X<character class, negation>
	350
	351	POSIX traditional Unicode
	352
	353	[[:^digit:]] \D \P{IsDigit}
	354	[[:^space:]] \S \P{IsSpace}
	355	[[:^word:]] \W \P{IsWord}
	356
	357	Perl respects the POSIX standard in that POSIX character classes are
	358	only supported within a character class. The POSIX character classes
	359	[.cc.] and [=cc=] are recognized but B<not> supported and trying to
	360	use them will cause an error.
	361
	362	Perl defines the following zero-width assertions:
	363	X<zero-width assertion> X<assertion> X<regex, zero-width assertion>
	364	X<regexp, zero-width assertion>
	365	X<regular expression, zero-width assertion>
	366	X<\b> X<\B> X<\A> X<\Z> X<\z> X<\G>
	367
	368	\b Match a word boundary
	369	\B Match a non-(word boundary)
	370	\A Match only at beginning of string
	371	\Z Match only at end of string, or before newline at the end
	372	\z Match only at end of string
	373	\G Match only at pos() (e.g. at the end-of-match position
	374	of prior m//g)
	375
	376	A word boundary (C<\b>) is a spot between two characters
	377	that has a C<\w> on one side of it and a C<\W> on the other side
	378	of it (in either order), counting the imaginary characters off the
	379	beginning and end of the string as matching a C<\W>. (Within
	380	character classes C<\b> represents backspace rather than a word
	381	boundary, just as it normally does in any double-quoted string.)
	382	The C<\A> and C<\Z> are just like "^" and "$", except that they
	383	won't match multiple times when the C</m> modifier is used, while
	384	"^" and "$" will match at every internal line boundary. To match
	385	the actual end of the string and not ignore an optional trailing
	386	newline, use C<\z>.
	387	X<\b> X<\A> X<\Z> X<\z> X</m>
	388
	389	The C<\G> assertion can be used to chain global matches (using
	390	C<m//g>), as described in L<perlop/"Regexp Quote-Like Operators">.
	391	It is also useful when writing C<lex>-like scanners, when you have
	392	several patterns that you want to match against consequent substrings
	393	of your string, see the previous reference. The actual location
	394	where C<\G> will match can also be influenced by using C<pos()> as
	395	an lvalue: see L<perlfunc/pos>. Currently C<\G> is only fully
	396	supported when anchored to the start of the pattern; while it
	397	is permitted to use it elsewhere, as in C</(?<=\G..)./g>, some
	398	such uses (C</.\G/g>, for example) currently cause problems, and
	399	it is recommended that you avoid such usage for now.
	400	X<\G>
	401
	402	The bracketing construct C<( ... )> creates capture buffers. To
	403	refer to the digit'th buffer use \<digit> within the
	404	match. Outside the match use "$" instead of "\". (The
	405	\<digit> notation works in certain circumstances outside
	406	the match. See the warning below about \1 vs $1 for details.)
	407	Referring back to another part of the match is called a
	408	I<backreference>.
	409	X<regex, capture buffer> X<regexp, capture buffer>
	410	X<regular expression, capture buffer> X<backreference>
	411
	412	There is no limit to the number of captured substrings that you may
	413	use. However Perl also uses \10, \11, etc. as aliases for \010,
	414	\011, etc. (Recall that 0 means octal, so \011 is the character at
	415	number 9 in your coded character set; which would be the 10th character,
	416	a horizontal tab under ASCII.) Perl resolves this
	417	ambiguity by interpreting \10 as a backreference only if at least 10
	418	left parentheses have opened before it. Likewise \11 is a
	419	backreference only if at least 11 left parentheses have opened
	420	before it. And so on. \1 through \9 are always interpreted as
	421	backreferences.
	422
	423	Examples:
	424
	425	s/^([^ ]) ([^ ]*)/$2 $1/; # swap first two words
	426
	427	if (/(.)\1/) { # find first doubled char
	428	print "'$1' is the first doubled character\n";
	429	}
	430
	431	if (/Time: (..):(..):(..)/) { # parse out values
	432	$hours = $1;
	433	$minutes = $2;
	434	$seconds = $3;
	435	}
	436
	437	Several special variables also refer back to portions of the previous
	438	match. C<$+> returns whatever the last bracket match matched.
	439	C<$&> returns the entire matched string. (At one point C<$0> did
	440	also, but now it returns the name of the program.) C<$`> returns
	441	everything before the matched string. C<$'> returns everything
	442	after the matched string. And C<$^N> contains whatever was matched by
	443	the most-recently closed group (submatch). C<$^N> can be used in
	444	extended patterns (see below), for example to assign a submatch to a
	445	variable.
	446	X<$+> X<$^N> X<$&> X<$`> X<$'>
	447
	448	The numbered match variables ($1, $2, $3, etc.) and the related punctuation
	449	set (C<$+>, C<$&>, C<$`>, C<$'>, and C<$^N>) are all dynamically scoped
	450	until the end of the enclosing block or until the next successful
	451	match, whichever comes first. (See L<perlsyn/"Compound Statements">.)
	452	X<$+> X<$^N> X<$&> X<$`> X<$'>
	453	X<$1> X<$2> X<$3> X<$4> X<$5> X<$6> X<$7> X<$8> X<$9>
	454
	455
	456	B<NOTE>: failed matches in Perl do not reset the match variables,
	457	which makes it easier to write code that tests for a series of more
	458	specific cases and remembers the best match.
	459
	460	B<WARNING>: Once Perl sees that you need one of C<$&>, C<$`>, or
	461	C<$'> anywhere in the program, it has to provide them for every
	462	pattern match. This may substantially slow your program. Perl
	463	uses the same mechanism to produce $1, $2, etc, so you also pay a
	464	price for each pattern that contains capturing parentheses. (To
	465	avoid this cost while retaining the grouping behaviour, use the
	466	extended regular expression C<(?: ... )> instead.) But if you never
	467	use C<$&>, C<$`> or C<$'>, then patterns I<without> capturing
	468	parentheses will not be penalized. So avoid C<$&>, C<$'>, and C<$`>
	469	if you can, but if you can't (and some algorithms really appreciate
	470	them), once you've used them once, use them at will, because you've
	471	already paid the price. As of 5.005, C<$&> is not so costly as the
	472	other two.
	473	X<$&> X<$`> X<$'>
	474
	475	Backslashed metacharacters in Perl are alphanumeric, such as C<\b>,
	476	C<\w>, C<\n>. Unlike some other regular expression languages, there
	477	are no backslashed symbols that aren't alphanumeric. So anything
	478	that looks like \\, $, $, \<, \>, \{, or \} is always
	479	interpreted as a literal character, not a metacharacter. This was
	480	once used in a common idiom to disable or quote the special meanings
	481	of regular expression metacharacters in a string that you want to
	482	use for a pattern. Simply quote all non-"word" characters:
	483
	484	$pattern =~ s/(\W)/\\$1/g;
	485
	486	(If C<use locale> is set, then this depends on the current locale.)
	487	Today it is more common to use the quotemeta() function or the C<\Q>
	488	metaquoting escape sequence to disable all metacharacters' special
	489	meanings like this:
	490
	491	/$unquoted\Q$quoted\E$unquoted/
	492
	493	Beware that if you put literal backslashes (those not inside
	494	interpolated variables) between C<\Q> and C<\E>, double-quotish
	495	backslash interpolation may lead to confusing results. If you
	496	I<need> to use literal backslashes within C<\Q...\E>,
	497	consult L<perlop/"Gory details of parsing quoted constructs">.
	498
	499	=head2 Extended Patterns
	500
	501	Perl also defines a consistent extension syntax for features not
	502	found in standard tools like B<awk> and B<lex>. The syntax is a
	503	pair of parentheses with a question mark as the first thing within
	504	the parentheses. The character after the question mark indicates
	505	the extension.
	506
	507	The stability of these extensions varies widely. Some have been
	508	part of the core language for many years. Others are experimental
	509	and may change without warning or be completely removed. Check
	510	the documentation on an individual feature to verify its current
	511	status.
	512
	513	A question mark was chosen for this and for the minimal-matching
	514	construct because 1) question marks are rare in older regular
	515	expressions, and 2) whenever you see one, you should stop and
	516	"question" exactly what is going on. That's psychology...
	517
	518	=over 10
	519
	520	=item C<(?#text)>
	521	X<(?#)>
	522
	523	A comment. The text is ignored. If the C</x> modifier enables
	524	whitespace formatting, a simple C<#> will suffice. Note that Perl closes
	525	the comment as soon as it sees a C<)>, so there is no way to put a literal
	526	C<)> in the comment.
	527
	528	=item C<(?imsx-imsx)>
	529	X<(?)>
	530
	531	One or more embedded pattern-match modifiers, to be turned on (or
	532	turned off, if preceded by C<->) for the remainder of the pattern or
	533	the remainder of the enclosing pattern group (if any). This is
	534	particularly useful for dynamic patterns, such as those read in from a
	535	configuration file, read in as an argument, are specified in a table
	536	somewhere, etc. Consider the case that some of which want to be case
	537	sensitive and some do not. The case insensitive ones need to include
	538	merely C<(?i)> at the front of the pattern. For example:
	539
	540	$pattern = "foobar";
	541	if ( /$pattern/i ) { }
	542
	543	# more flexible:
	544
	545	$pattern = "(?i)foobar";
	546	if ( /$pattern/ ) { }
	547
	548	These modifiers are restored at the end of the enclosing group. For example,
	549
	550	( (?i) blah ) \s+ \1
	551
	552	will match a repeated (I<including the case>!) word C<blah> in any
	553	case, assuming C<x> modifier, and no C<i> modifier outside this
	554	group.
	555
	556	=item C<(?:pattern)>
	557	X<(?:)>
	558
	559	=item C<(?imsx-imsx:pattern)>
	560
	561	This is for clustering, not capturing; it groups subexpressions like
	562	"()", but doesn't make backreferences as "()" does. So
	563
	564	@fields = split(/\b(?:a\|b\|c)\b/)
	565
	566	is like
	567
	568	@fields = split(/\b(a\|b\|c)\b/)
	569
	570	but doesn't spit out extra fields. It's also cheaper not to capture
	571	characters if you don't need to.
	572
	573	Any letters between C<?> and C<:> act as flags modifiers as with
	574	C<(?imsx-imsx)>. For example,
	575
	576	/(?s-i:more.than).million/i
	577
	578	is equivalent to the more verbose
	579
	580	/(?:(?s-i)more.than).million/i
	581
	582	=item C<(?=pattern)>
	583	X<(?=)> X<look-ahead, positive> X<lookahead, positive>
	584
	585	A zero-width positive look-ahead assertion. For example, C</\w+(?=\t)/>
	586	matches a word followed by a tab, without including the tab in C<$&>.
	587
	588	=item C<(?!pattern)>
	589	X<(?!)> X<look-ahead, negative> X<lookahead, negative>
	590
	591	A zero-width negative look-ahead assertion. For example C</foo(?!bar)/>
	592	matches any occurrence of "foo" that isn't followed by "bar". Note
	593	however that look-ahead and look-behind are NOT the same thing. You cannot
	594	use this for look-behind.
	595
	596	If you are looking for a "bar" that isn't preceded by a "foo", C</(?!foo)bar/>
	597	will not do what you want. That's because the C<(?!foo)> is just saying that
	598	the next thing cannot be "foo"--and it's not, it's a "bar", so "foobar" will
	599	match. You would have to do something like C</(?!foo)...bar/> for that. We
	600	say "like" because there's the case of your "bar" not having three characters
	601	before it. You could cover that this way: C</(?:(?!foo)...\|^.{0,2})bar/>.
	602	Sometimes it's still easier just to say:
	603
	604	if (/bar/ && $` !~ /foo$/)
	605
	606	For look-behind see below.
	607
	608	=item C<(?<=pattern)>
	609	X<(?<=)> X<look-behind, positive> X<lookbehind, positive>
	610
	611	A zero-width positive look-behind assertion. For example, C</(?<=\t)\w+/>
	612	matches a word that follows a tab, without including the tab in C<$&>.
	613	Works only for fixed-width look-behind.
	614
	615	=item C<(?<!pattern)>
	616	X<(?<!)> X<look-behind, negative> X<lookbehind, negative>
	617
	618	A zero-width negative look-behind assertion. For example C</(?<!bar)foo/>
	619	matches any occurrence of "foo" that does not follow "bar". Works
	620	only for fixed-width look-behind.
	621
	622	=item C<(?{ code })>
	623	X<(?{})> X<regex, code in> X<regexp, code in> X<regular expression, code in>
	624
	625	B<WARNING>: This extended regular expression feature is considered
	626	highly experimental, and may be changed or deleted without notice.
	627
	628	This zero-width assertion evaluates any embedded Perl code. It
	629	always succeeds, and its C<code> is not interpolated. Currently,
	630	the rules to determine where the C<code> ends are somewhat convoluted.
	631
	632	This feature can be used together with the special variable C<$^N> to
	633	capture the results of submatches in variables without having to keep
	634	track of the number of nested parentheses. For example:
	635
	636	$_ = "The brown fox jumps over the lazy dog";
	637	/the (\S+)(?{ $color = $^N }) (\S+)(?{ $animal = $^N })/i;
	638	print "color = $color, animal = $animal\n";
	639
	640	Inside the C<(?{...})> block, C<$_> refers to the string the regular
	641	expression is matching against. You can also use C<pos()> to know what is
	642	the current position of matching within this string.
	643
	644	The C<code> is properly scoped in the following sense: If the assertion
	645	is backtracked (compare L<"Backtracking">), all changes introduced after
	646	C<local>ization are undone, so that
	647
	648	$_ = 'a' x 8;
	649	m<
	650	(?{ $cnt = 0 }) # Initialize $cnt.
	651	(
	652	a
	653	(?{
	654	local $cnt = $cnt + 1; # Update $cnt, backtracking-safe.
	655	})
	656	)*
	657	aaaa
	658	(?{ $res = $cnt }) # On success copy to non-localized
	659	# location.
	660	>x;
	661
	662	will set C<$res = 4>. Note that after the match, $cnt returns to the globally
	663	introduced value, because the scopes that restrict C<local> operators
	664	are unwound.
	665
	666	This assertion may be used as a C<(?(condition)yes-pattern\|no-pattern)>
	667	switch. If I<not> used in this way, the result of evaluation of
	668	C<code> is put into the special variable C<$^R>. This happens
	669	immediately, so C<$^R> can be used from other C<(?{ code })> assertions
	670	inside the same regular expression.
	671
	672	The assignment to C<$^R> above is properly localized, so the old
	673	value of C<$^R> is restored if the assertion is backtracked; compare
	674	L<"Backtracking">.
	675
	676	For reasons of security, this construct is forbidden if the regular
	677	expression involves run-time interpolation of variables, unless the
	678	perilous C<use re 'eval'> pragma has been used (see L<re>), or the
	679	variables contain results of C<qr//> operator (see
	680	L<perlop/"qr/STRING/imosx">).
	681
	682	This restriction is because of the wide-spread and remarkably convenient
	683	custom of using run-time determined strings as patterns. For example:
	684
	685	$re = <>;
	686	chomp $re;
	687	$string =~ /$re/;
	688
	689	Before Perl knew how to execute interpolated code within a pattern,
	690	this operation was completely safe from a security point of view,
	691	although it could raise an exception from an illegal pattern. If
	692	you turn on the C<use re 'eval'>, though, it is no longer secure,
	693	so you should only do so if you are also using taint checking.
	694	Better yet, use the carefully constrained evaluation within a Safe
	695	compartment. See L<perlsec> for details about both these mechanisms.
	696
	697	=item C<(??{ code })>
	698	X<(??{})>
	699	X<regex, postponed> X<regexp, postponed> X<regular expression, postponed>
	700	X<regex, recursive> X<regexp, recursive> X<regular expression, recursive>
	701
	702	B<WARNING>: This extended regular expression feature is considered
	703	highly experimental, and may be changed or deleted without notice.
	704	A simplified version of the syntax may be introduced for commonly
	705	used idioms.
	706
	707	This is a "postponed" regular subexpression. The C<code> is evaluated
	708	at run time, at the moment this subexpression may match. The result
	709	of evaluation is considered as a regular expression and matched as
	710	if it were inserted instead of this construct.
	711
	712	The C<code> is not interpolated. As before, the rules to determine
	713	where the C<code> ends are currently somewhat convoluted.
	714
	715	The following pattern matches a parenthesized group:
	716
	717	$re = qr{
	718	\(
	719	(?:
	720	(?> [^()]+ ) # Non-parens without backtracking
	721	\|
	722	(??{ $re }) # Group with matching parens
	723	)*
	724	\)
	725	}x;
	726
	727	=item C<< (?>pattern) >>
	728	X<backtrack> X<backtracking>
	729
	730	B<WARNING>: This extended regular expression feature is considered
	731	highly experimental, and may be changed or deleted without notice.
	732
	733	An "independent" subexpression, one which matches the substring
	734	that a I<standalone> C<pattern> would match if anchored at the given
	735	position, and it matches I<nothing other than this substring>. This
	736	construct is useful for optimizations of what would otherwise be
	737	"eternal" matches, because it will not backtrack (see L<"Backtracking">).
	738	It may also be useful in places where the "grab all you can, and do not
	739	give anything back" semantic is desirable.
	740
	741	For example: C<< ^(?>a)ab >> will never match, since C<< (?>a) >>
	742	(anchored at the beginning of string, as above) will match I<all>
	743	characters C<a> at the beginning of string, leaving no C<a> for
	744	C<ab> to match. In contrast, C<a*ab> will match the same as C<a+b>,
	745	since the match of the subgroup C<a*> is influenced by the following
	746	group C<ab> (see L<"Backtracking">). In particular, C<a*> inside
	747	C<aab> will match fewer characters than a standalone C<a>, since
	748	this makes the tail match.
	749
	750	An effect similar to C<< (?>pattern) >> may be achieved by writing
	751	C<(?=(pattern))\1>. This matches the same substring as a standalone
	752	C<a+>, and the following C<\1> eats the matched string; it therefore
	753	makes a zero-length assertion into an analogue of C<< (?>...) >>.
	754	(The difference between these two constructs is that the second one
	755	uses a capturing group, thus shifting ordinals of backreferences
	756	in the rest of a regular expression.)
	757
	758	Consider this pattern:
	759
	760	m{ \(
	761	(
	762	[^()]+ # x+
	763	\|
	764	$ [^()]* $
	765	)+
	766	\)
	767	}x
	768
	769	That will efficiently match a nonempty group with matching parentheses
	770	two levels deep or less. However, if there is no such group, it
	771	will take virtually forever on a long string. That's because there
	772	are so many different ways to split a long string into several
	773	substrings. This is what C<(.+)+> is doing, and C<(.+)+> is similar
	774	to a subpattern of the above pattern. Consider how the pattern
	775	above detects no-match on C<((()aaaaaaaaaaaaaaaaaa> in several
	776	seconds, but that each extra letter doubles this time. This
	777	exponential performance will make it appear that your program has
	778	hung. However, a tiny change to this pattern
	779
	780	m{ \(
	781	(
	782	(?> [^()]+ ) # change x+ above to (?> x+ )
	783	\|
	784	$ [^()]* $
	785	)+
	786	\)
	787	}x
	788
	789	which uses C<< (?>...) >> matches exactly when the one above does (verifying
	790	this yourself would be a productive exercise), but finishes in a fourth
	791	the time when used on a similar string with 1000000 C<a>s. Be aware,
	792	however, that this pattern currently triggers a warning message under
	793	the C<use warnings> pragma or B<-w> switch saying it
	794	C<"matches null string many times in regex">.
	795
	796	On simple groups, such as the pattern C<< (?> [^()]+ ) >>, a comparable
	797	effect may be achieved by negative look-ahead, as in C<[^()]+ (?! [^()] )>.
	798	This was only 4 times slower on a string with 1000000 C<a>s.
	799
	800	The "grab all you can, and do not give anything back" semantic is desirable
	801	in many situations where on the first sight a simple C<()*> looks like
	802	the correct solution. Suppose we parse text with comments being delimited
	803	by C<#> followed by some optional (horizontal) whitespace. Contrary to
	804	its appearance, C<#[ \t]*> I<is not> the correct subexpression to match
	805	the comment delimiter, because it may "give up" some whitespace if
	806	the remainder of the pattern can be made to match that way. The correct
	807	answer is either one of these:
	808
	809	(?>#[ \t]*)
	810	#[ \t]*(?![ \t])
	811
	812	For example, to grab non-empty comments into $1, one should use either
	813	one of these:
	814
	815	/ (?> \# [ \t]* ) ( .+ ) /x;
	816	/ \# [ \t]* ( [^ \t] .* ) /x;
	817
	818	Which one you pick depends on which of these expressions better reflects
	819	the above specification of comments.
	820
	821	=item C<(?(condition)yes-pattern\|no-pattern)>
	822	X<(?()>
	823
	824	=item C<(?(condition)yes-pattern)>
	825
	826	B<WARNING>: This extended regular expression feature is considered
	827	highly experimental, and may be changed or deleted without notice.
	828
	829	Conditional expression. C<(condition)> should be either an integer in
	830	parentheses (which is valid if the corresponding pair of parentheses
	831	matched), or look-ahead/look-behind/evaluate zero-width assertion.
	832
	833	For example:
	834
	835	m{ ( \( )?
	836	[^()]+
	837	(?(1) \) )
	838	}x
	839
	840	matches a chunk of non-parentheses, possibly included in parentheses
	841	themselves.
	842
	843	=back
	844
	845	=head2 Backtracking
	846	X<backtrack> X<backtracking>
	847
	848	NOTE: This section presents an abstract approximation of regular
	849	expression behavior. For a more rigorous (and complicated) view of
	850	the rules involved in selecting a match among possible alternatives,
	851	see L<Combining pieces together>.
	852
	853	A fundamental feature of regular expression matching involves the
	854	notion called I<backtracking>, which is currently used (when needed)
	855	by all regular expression quantifiers, namely C<>, C<?>, C<+>,
	856	C<+?>, C<{n,m}>, and C<{n,m}?>. Backtracking is often optimized
	857	internally, but the general principle outlined here is valid.
	858
	859	For a regular expression to match, the I<entire> regular expression must
	860	match, not just part of it. So if the beginning of a pattern containing a
	861	quantifier succeeds in a way that causes later parts in the pattern to
	862	fail, the matching engine backs up and recalculates the beginning
	863	part--that's why it's called backtracking.
	864
	865	Here is an example of backtracking: Let's say you want to find the
	866	word following "foo" in the string "Food is on the foo table.":
	867
	868	$_ = "Food is on the foo table.";
	869	if ( /\b(foo)\s+(\w+)/i ) {
	870	print "$2 follows $1.\n";
	871	}
	872
	873	When the match runs, the first part of the regular expression (C<\b(foo)>)
	874	finds a possible match right at the beginning of the string, and loads up
	875	$1 with "Foo". However, as soon as the matching engine sees that there's
	876	no whitespace following the "Foo" that it had saved in $1, it realizes its
	877	mistake and starts over again one character after where it had the
	878	tentative match. This time it goes all the way until the next occurrence
	879	of "foo". The complete regular expression matches this time, and you get
	880	the expected output of "table follows foo."
	881
	882	Sometimes minimal matching can help a lot. Imagine you'd like to match
	883	everything between "foo" and "bar". Initially, you write something
	884	like this:
	885
	886	$_ = "The food is under the bar in the barn.";
	887	if ( /foo(.*)bar/ ) {
	888	print "got <$1>\n";
	889	}
	890
	891	Which perhaps unexpectedly yields:
	892
	893	got <d is under the bar in the >
	894
	895	That's because C<.*> was greedy, so you get everything between the
	896	I<first> "foo" and the I<last> "bar". Here it's more effective
	897	to use minimal matching to make sure you get the text between a "foo"
	898	and the first "bar" thereafter.
	899
	900	if ( /foo(.*?)bar/ ) { print "got <$1>\n" }
	901	got <d is under the >
	902
	903	Here's another example: let's say you'd like to match a number at the end
	904	of a string, and you also want to keep the preceding part of the match.
	905	So you write this:
	906
	907	$_ = "I have 2 numbers: 53147";
	908	if ( /(.)(\d)/ ) { # Wrong!
	909	print "Beginning is <$1>, number is <$2>.\n";
	910	}
	911
	912	That won't work at all, because C<.*> was greedy and gobbled up the
	913	whole string. As C<\d*> can match on an empty string the complete
	914	regular expression matched successfully.
	915
	916	Beginning is <I have 2 numbers: 53147>, number is <>.
	917
	918	Here are some variants, most of which don't work:
	919
	920	$_ = "I have 2 numbers: 53147";
	921	@pats = qw{
	922	(.)(\d)
	923	(.*)(\d+)
	924	(.?)(\d)
	925	(.*?)(\d+)
	926	(.*)(\d+)$
	927	(.*?)(\d+)$
	928	(.*)\b(\d+)$
	929	(.*\D)(\d+)$
	930	};
	931
	932	for $pat (@pats) {
	933	printf "%-12s ", $pat;
	934	if ( /$pat/ ) {
	935	print "<$1> <$2>\n";
	936	} else {
	937	print "FAIL\n";
	938	}
	939	}
	940
	941	That will print out:
	942
	943	(.)(\d) <I have 2 numbers: 53147> <>
	944	(.*)(\d+) <I have 2 numbers: 5314> <7>
	945	(.?)(\d) <> <>
	946	(.*?)(\d+) <I have > <2>
	947	(.*)(\d+)$ <I have 2 numbers: 5314> <7>
	948	(.*?)(\d+)$ <I have 2 numbers: > <53147>
	949	(.*)\b(\d+)$ <I have 2 numbers: > <53147>
	950	(.*\D)(\d+)$ <I have 2 numbers: > <53147>
	951
	952	As you see, this can be a bit tricky. It's important to realize that a
	953	regular expression is merely a set of assertions that gives a definition
	954	of success. There may be 0, 1, or several different ways that the
	955	definition might succeed against a particular string. And if there are
	956	multiple ways it might succeed, you need to understand backtracking to
	957	know which variety of success you will achieve.
	958
	959	When using look-ahead assertions and negations, this can all get even
	960	trickier. Imagine you'd like to find a sequence of non-digits not
	961	followed by "123". You might try to write that as
	962
	963	$_ = "ABC123";
	964	if ( /^\D*(?!123)/ ) { # Wrong!
	965	print "Yup, no 123 in $_\n";
	966	}
	967
	968	But that isn't going to match; at least, not the way you're hoping. It
	969	claims that there is no 123 in the string. Here's a clearer picture of
	970	why that pattern matches, contrary to popular expectations:
	971
	972	$x = 'ABC123';
	973	$y = 'ABC445';
	974
	975	print "1: got $1\n" if $x =~ /^(ABC)(?!123)/;
	976	print "2: got $1\n" if $y =~ /^(ABC)(?!123)/;
	977
	978	print "3: got $1\n" if $x =~ /^(\D*)(?!123)/;
	979	print "4: got $1\n" if $y =~ /^(\D*)(?!123)/;
	980
	981	This prints
	982
	983	2: got ABC
	984	3: got AB
	985	4: got ABC
	986
	987	You might have expected test 3 to fail because it seems to a more
	988	general purpose version of test 1. The important difference between
	989	them is that test 3 contains a quantifier (C<\D*>) and so can use
	990	backtracking, whereas test 1 will not. What's happening is
	991	that you've asked "Is it true that at the start of $x, following 0 or more
	992	non-digits, you have something that's not 123?" If the pattern matcher had
	993	let C<\D*> expand to "ABC", this would have caused the whole pattern to
	994	fail.
	995
	996	The search engine will initially match C<\D*> with "ABC". Then it will
	997	try to match C<(?!123> with "123", which fails. But because
	998	a quantifier (C<\D*>) has been used in the regular expression, the
	999	search engine can backtrack and retry the match differently
	1000	in the hope of matching the complete regular expression.
	1001
	1002	The pattern really, I<really> wants to succeed, so it uses the
	1003	standard pattern back-off-and-retry and lets C<\D*> expand to just "AB" this
	1004	time. Now there's indeed something following "AB" that is not
	1005	"123". It's "C123", which suffices.
	1006
	1007	We can deal with this by using both an assertion and a negation.
	1008	We'll say that the first part in $1 must be followed both by a digit
	1009	and by something that's not "123". Remember that the look-aheads
	1010	are zero-width expressions--they only look, but don't consume any
	1011	of the string in their match. So rewriting this way produces what
	1012	you'd expect; that is, case 5 will fail, but case 6 succeeds:
	1013
	1014	print "5: got $1\n" if $x =~ /^(\D*)(?=\d)(?!123)/;
	1015	print "6: got $1\n" if $y =~ /^(\D*)(?=\d)(?!123)/;
	1016
	1017	6: got ABC
	1018
	1019	In other words, the two zero-width assertions next to each other work as though
	1020	they're ANDed together, just as you'd use any built-in assertions: C</^$/>
	1021	matches only if you're at the beginning of the line AND the end of the
	1022	line simultaneously. The deeper underlying truth is that juxtaposition in
	1023	regular expressions always means AND, except when you write an explicit OR
	1024	using the vertical bar. C</ab/> means match "a" AND (then) match "b",
	1025	although the attempted matches are made at different positions because "a"
	1026	is not a zero-width assertion, but a one-width assertion.
	1027
	1028	B<WARNING>: particularly complicated regular expressions can take
	1029	exponential time to solve because of the immense number of possible
	1030	ways they can use backtracking to try match. For example, without
	1031	internal optimizations done by the regular expression engine, this will
	1032	take a painfully long time to run:
	1033
	1034	'aaaaaaaaaaaa' =~ /((a{0,5}){0,5})*[c]/
	1035
	1036	And if you used C<*>'s in the internal groups instead of limiting them
	1037	to 0 through 5 matches, then it would take forever--or until you ran
	1038	out of stack space. Moreover, these internal optimizations are not
	1039	always applicable. For example, if you put C<{0,5}> instead of C<*>
	1040	on the external group, no current optimization is applicable, and the
	1041	match takes a long time to finish.
	1042
	1043	A powerful tool for optimizing such beasts is what is known as an
	1044	"independent group",
	1045	which does not backtrack (see L<C<< (?>pattern) >>>). Note also that
	1046	zero-length look-ahead/look-behind assertions will not backtrack to make
	1047	the tail match, since they are in "logical" context: only
	1048	whether they match is considered relevant. For an example
	1049	where side-effects of look-ahead I<might> have influenced the
	1050	following match, see L<C<< (?>pattern) >>>.
	1051
	1052	=head2 Version 8 Regular Expressions
	1053	X<regular expression, version 8> X<regex, version 8> X<regexp, version 8>
	1054
	1055	In case you're not familiar with the "regular" Version 8 regex
	1056	routines, here are the pattern-matching rules not described above.
	1057
	1058	Any single character matches itself, unless it is a I<metacharacter>
	1059	with a special meaning described here or above. You can cause
	1060	characters that normally function as metacharacters to be interpreted
	1061	literally by prefixing them with a "\" (e.g., "\." matches a ".", not any
	1062	character; "\\" matches a "\"). A series of characters matches that
	1063	series of characters in the target string, so the pattern C<blurfl>
	1064	would match "blurfl" in the target string.
	1065
	1066	You can specify a character class, by enclosing a list of characters
	1067	in C<[]>, which will match any one character from the list. If the
	1068	first character after the "[" is "^", the class matches any character not
	1069	in the list. Within a list, the "-" character specifies a
	1070	range, so that C<a-z> represents all characters between "a" and "z",
	1071	inclusive. If you want either "-" or "]" itself to be a member of a
	1072	class, put it at the start of the list (possibly after a "^"), or
	1073	escape it with a backslash. "-" is also taken literally when it is
	1074	at the end of the list, just before the closing "]". (The
	1075	following all specify the same class of three characters: C<[-az]>,
	1076	C<[az-]>, and C<[a\-z]>. All are different from C<[a-z]>, which
	1077	specifies a class containing twenty-six characters, even on EBCDIC
	1078	based coded character sets.) Also, if you try to use the character
	1079	classes C<\w>, C<\W>, C<\s>, C<\S>, C<\d>, or C<\D> as endpoints of
	1080	a range, that's not a range, the "-" is understood literally.
	1081
	1082	Note also that the whole range idea is rather unportable between
	1083	character sets--and even within character sets they may cause results
	1084	you probably didn't expect. A sound principle is to use only ranges
	1085	that begin from and end at either alphabets of equal case ([a-e],
	1086	[A-E]), or digits ([0-9]). Anything else is unsafe. If in doubt,
	1087	spell out the character sets in full.
	1088
	1089	Characters may be specified using a metacharacter syntax much like that
	1090	used in C: "\n" matches a newline, "\t" a tab, "\r" a carriage return,
	1091	"\f" a form feed, etc. More generally, \I<nnn>, where I<nnn> is a string
	1092	of octal digits, matches the character whose coded character set value
	1093	is I<nnn>. Similarly, \xI<nn>, where I<nn> are hexadecimal digits,
	1094	matches the character whose numeric value is I<nn>. The expression \cI<x>
	1095	matches the character control-I<x>. Finally, the "." metacharacter
	1096	matches any character except "\n" (unless you use C</s>).
	1097
	1098	You can specify a series of alternatives for a pattern using "\|" to
	1099	separate them, so that C<fee\|fie\|foe> will match any of "fee", "fie",
	1100	or "foe" in the target string (as would C<f(e\|i\|o)e>). The
	1101	first alternative includes everything from the last pattern delimiter
	1102	("(", "[", or the beginning of the pattern) up to the first "\|", and
	1103	the last alternative contains everything from the last "\|" to the next
	1104	pattern delimiter. That's why it's common practice to include
	1105	alternatives in parentheses: to minimize confusion about where they
	1106	start and end.
	1107
	1108	Alternatives are tried from left to right, so the first
	1109	alternative found for which the entire expression matches, is the one that
	1110	is chosen. This means that alternatives are not necessarily greedy. For
	1111	example: when matching C<foo\|foot> against "barefoot", only the "foo"
	1112	part will match, as that is the first alternative tried, and it successfully
	1113	matches the target string. (This might not seem important, but it is
	1114	important when you are capturing matched text using parentheses.)
	1115
	1116	Also remember that "\|" is interpreted as a literal within square brackets,
	1117	so if you write C<[fee\|fie\|foe]> you're really only matching C<[feio\|]>.
	1118
	1119	Within a pattern, you may designate subpatterns for later reference
	1120	by enclosing them in parentheses, and you may refer back to the
	1121	I<n>th subpattern later in the pattern using the metacharacter
	1122	\I<n>. Subpatterns are numbered based on the left to right order
	1123	of their opening parenthesis. A backreference matches whatever
	1124	actually matched the subpattern in the string being examined, not
	1125	the rules for that subpattern. Therefore, C<(0\|0x)\d\s\1\d> will
	1126	match "0x1234 0x4321", but not "0x1234 01234", because subpattern
	1127	1 matched "0x", even though the rule C<0\|0x> could potentially match
	1128	the leading 0 in the second number.
	1129
	1130	=head2 Warning on \1 vs $1
	1131
	1132	Some people get too used to writing things like:
	1133
	1134	$pattern =~ s/(\W)/\\\1/g;
	1135
	1136	This is grandfathered for the RHS of a substitute to avoid shocking the
	1137	B<sed> addicts, but it's a dirty habit to get into. That's because in
	1138	PerlThink, the righthand side of an C<s///> is a double-quoted string. C<\1> in
	1139	the usual double-quoted string means a control-A. The customary Unix
	1140	meaning of C<\1> is kludged in for C<s///>. However, if you get into the habit
	1141	of doing that, you get yourself into trouble if you then add an C</e>
	1142	modifier.
	1143
	1144	s/(\d+)/ \1 + 1 /eg; # causes warning under -w
	1145
	1146	Or if you try to do
	1147
	1148	s/(\d+)/\1000/;
	1149
	1150	You can't disambiguate that by saying C<\{1}000>, whereas you can fix it with
	1151	C<${1}000>. The operation of interpolation should not be confused
	1152	with the operation of matching a backreference. Certainly they mean two
	1153	different things on the I<left> side of the C<s///>.
	1154
	1155	=head2 Repeated patterns matching zero-length substring
	1156
	1157	B<WARNING>: Difficult material (and prose) ahead. This section needs a rewrite.
	1158
	1159	Regular expressions provide a terse and powerful programming language. As
	1160	with most other power tools, power comes together with the ability
	1161	to wreak havoc.
	1162
	1163	A common abuse of this power stems from the ability to make infinite
	1164	loops using regular expressions, with something as innocuous as:
	1165
	1166	'foo' =~ m{ ( o? )* }x;
	1167
	1168	The C<o?> can match at the beginning of C<'foo'>, and since the position
	1169	in the string is not moved by the match, C<o?> would match again and again
	1170	because of the C<*> modifier. Another common way to create a similar cycle
	1171	is with the looping modifier C<//g>:
	1172
	1173	@matches = ( 'foo' =~ m{ o? }xg );
	1174
	1175	or
	1176
	1177	print "match: <$&>\n" while 'foo' =~ m{ o? }xg;
	1178
	1179	or the loop implied by split().
	1180
	1181	However, long experience has shown that many programming tasks may
	1182	be significantly simplified by using repeated subexpressions that
	1183	may match zero-length substrings. Here's a simple example being:
	1184
	1185	@chars = split //, $string; # // is not magic in split
	1186	($whitewashed = $string) =~ s/()/ /g; # parens avoid magic s// /
	1187
	1188	Thus Perl allows such constructs, by I<forcefully breaking
	1189	the infinite loop>. The rules for this are different for lower-level
	1190	loops given by the greedy modifiers C<*+{}>, and for higher-level
	1191	ones like the C</g> modifier or split() operator.
	1192
	1193	The lower-level loops are I<interrupted> (that is, the loop is
	1194	broken) when Perl detects that a repeated expression matched a
	1195	zero-length substring. Thus
	1196
	1197	m{ (?: NON_ZERO_LENGTH \| ZERO_LENGTH )* }x;
	1198
	1199	is made equivalent to
	1200
	1201	m{ (?: NON_ZERO_LENGTH )*
	1202	\|
	1203	(?: ZERO_LENGTH )?
	1204	}x;
	1205
	1206	The higher level-loops preserve an additional state between iterations:
	1207	whether the last match was zero-length. To break the loop, the following
	1208	match after a zero-length match is prohibited to have a length of zero.
	1209	This prohibition interacts with backtracking (see L<"Backtracking">),
	1210	and so the I<second best> match is chosen if the I<best> match is of
	1211	zero length.
	1212
	1213	For example:
	1214
	1215	$_ = 'bar';
	1216	s/\w??/<$&>/g;
	1217
	1218	results in C<< <><b><><a><><r><> >>. At each position of the string the best
	1219	match given by non-greedy C<??> is the zero-length match, and the I<second
	1220	best> match is what is matched by C<\w>. Thus zero-length matches
	1221	alternate with one-character-long matches.
	1222
	1223	Similarly, for repeated C<m/()/g> the second-best match is the match at the
	1224	position one notch further in the string.
	1225
	1226	The additional state of being I<matched with zero-length> is associated with
	1227	the matched string, and is reset by each assignment to pos().
	1228	Zero-length matches at the end of the previous match are ignored
	1229	during C<split>.
	1230
	1231	=head2 Combining pieces together
	1232
	1233	Each of the elementary pieces of regular expressions which were described
	1234	before (such as C<ab> or C<\Z>) could match at most one substring
	1235	at the given position of the input string. However, in a typical regular
	1236	expression these elementary pieces are combined into more complicated
	1237	patterns using combining operators C<ST>, C<S\|T>, C<S*> etc
	1238	(in these examples C<S> and C<T> are regular subexpressions).
	1239
	1240	Such combinations can include alternatives, leading to a problem of choice:
	1241	if we match a regular expression C<a\|ab> against C<"abc">, will it match
	1242	substring C<"a"> or C<"ab">? One way to describe which substring is
	1243	actually matched is the concept of backtracking (see L<"Backtracking">).
	1244	However, this description is too low-level and makes you think
	1245	in terms of a particular implementation.
	1246
	1247	Another description starts with notions of "better"/"worse". All the
	1248	substrings which may be matched by the given regular expression can be
	1249	sorted from the "best" match to the "worst" match, and it is the "best"
	1250	match which is chosen. This substitutes the question of "what is chosen?"
	1251	by the question of "which matches are better, and which are worse?".
	1252
	1253	Again, for elementary pieces there is no such question, since at most
	1254	one match at a given position is possible. This section describes the
	1255	notion of better/worse for combining operators. In the description
	1256	below C<S> and C<T> are regular subexpressions.
	1257
	1258	=over 4
	1259
	1260	=item C<ST>
	1261
	1262	Consider two possible matches, C<AB> and C<A'B'>, C<A> and C<A'> are
	1263	substrings which can be matched by C<S>, C<B> and C<B'> are substrings
	1264	which can be matched by C<T>.
	1265
	1266	If C<A> is better match for C<S> than C<A'>, C<AB> is a better
	1267	match than C<A'B'>.
	1268
	1269	If C<A> and C<A'> coincide: C<AB> is a better match than C<AB'> if
	1270	C<B> is better match for C<T> than C<B'>.
	1271
	1272	=item C<S\|T>
	1273
	1274	When C<S> can match, it is a better match than when only C<T> can match.
	1275
	1276	Ordering of two matches for C<S> is the same as for C<S>. Similar for
	1277	two matches for C<T>.
	1278
	1279	=item C<S{REPEAT_COUNT}>
	1280
	1281	Matches as C<SSS...S> (repeated as many times as necessary).
	1282
	1283	=item C<S{min,max}>
	1284
	1285	Matches as C<S{max}\|S{max-1}\|...\|S{min+1}\|S{min}>.
	1286
	1287	=item C<S{min,max}?>
	1288
	1289	Matches as C<S{min}\|S{min+1}\|...\|S{max-1}\|S{max}>.
	1290
	1291	=item C<S?>, C<S*>, C<S+>
	1292
	1293	Same as C<S{0,1}>, C<S{0,BIG_NUMBER}>, C<S{1,BIG_NUMBER}> respectively.
	1294
	1295	=item C<S??>, C<S*?>, C<S+?>
	1296
	1297	Same as C<S{0,1}?>, C<S{0,BIG_NUMBER}?>, C<S{1,BIG_NUMBER}?> respectively.
	1298
	1299	=item C<< (?>S) >>
	1300
	1301	Matches the best match for C<S> and only that.
	1302
	1303	=item C<(?=S)>, C<(?<=S)>
	1304
	1305	Only the best match for C<S> is considered. (This is important only if
	1306	C<S> has capturing parentheses, and backreferences are used somewhere
	1307	else in the whole regular expression.)
	1308
	1309	=item C<(?!S)>, C<(?<!S)>
	1310
	1311	For this grouping operator there is no need to describe the ordering, since
	1312	only whether or not C<S> can match is important.
	1313
	1314	=item C<(??{ EXPR })>
	1315
	1316	The ordering is the same as for the regular expression which is
	1317	the result of EXPR.
	1318
	1319	=item C<(?(condition)yes-pattern\|no-pattern)>
	1320
	1321	Recall that which of C<yes-pattern> or C<no-pattern> actually matches is
	1322	already determined. The ordering of the matches is the same as for the
	1323	chosen subexpression.
	1324
	1325	=back
	1326
	1327	The above recipes describe the ordering of matches I<at a given position>.
	1328	One more rule is needed to understand how a match is determined for the
	1329	whole regular expression: a match at an earlier position is always better
	1330	than a match at a later position.
	1331
	1332	=head2 Creating custom RE engines
	1333
	1334	Overloaded constants (see L<overload>) provide a simple way to extend
	1335	the functionality of the RE engine.
	1336
	1337	Suppose that we want to enable a new RE escape-sequence C<\Y\|> which
	1338	matches at boundary between whitespace characters and non-whitespace
	1339	characters. Note that C<(?=\S)(?<!\S)\|(?!\S)(?<=\S)> matches exactly
	1340	at these positions, so we want to have each C<\Y\|> in the place of the
	1341	more complicated version. We can create a module C<customre> to do
	1342	this:
	1343
	1344	package customre;
	1345	use overload;
	1346
	1347	sub import {
	1348	shift;
	1349	die "No argument to customre::import allowed" if @_;
	1350	overload::constant 'qr' => \&convert;
	1351	}
	1352
	1353	sub invalid { die "/$_[0]/: invalid escape '\\$_[1]'"}
	1354
	1355	# We must also take care of not escaping the legitimate \\Y\|
	1356	# sequence, hence the presence of '\\' in the conversion rules.
	1357	my %rules = ( '\\' => '\\\\',
	1358	'Y\|' => qr/(?=\S)(?<!\S)\|(?!\S)(?<=\S)/ );
	1359	sub convert {
	1360	my $re = shift;
	1361	$re =~ s{
	1362	\\ ( \\ \| Y . )
	1363	}
	1364	{ $rules{$1} or invalid($re,$1) }sgex;
	1365	return $re;
	1366	}
	1367
	1368	Now C<use customre> enables the new escape in constant regular
	1369	expressions, i.e., those without any runtime variable interpolations.
	1370	As documented in L<overload>, this conversion will work only over
	1371	literal parts of regular expressions. For C<\Y\|$re\Y\|> the variable
	1372	part of this regular expression needs to be converted explicitly
	1373	(but only if the special meaning of C<\Y\|> should be enabled inside $re):
	1374
	1375	use customre;
	1376	$re = <>;
	1377	chomp $re;
	1378	$re = customre::convert $re;
	1379	/\Y\|$re\Y\|/;
	1380
	1381	=head1 BUGS
	1382
	1383	This document varies from difficult to understand to completely
	1384	and utterly opaque. The wandering prose riddled with jargon is
	1385	hard to fathom in several places.
	1386
	1387	This document needs a rewrite that separates the tutorial content
	1388	from the reference content.
	1389
	1390	=head1 SEE ALSO
	1391
	1392	L<perlrequick>.
	1393
	1394	L<perlretut>.
	1395
	1396	L<perlop/"Regexp Quote-Like Operators">.
	1397
	1398	L<perlop/"Gory details of parsing quoted constructs">.
	1399
	1400	L<perlfaq6>.
	1401
	1402	L<perlfunc/pos>.
	1403
	1404	L<perllocale>.
	1405
	1406	L<perlebcdic>.
	1407
	1408	I<Mastering Regular Expressions> by Jeffrey Friedl, published
	1409	by O'Reilly and Associates.