This is a live mirror of the Perl 5 development currently hosted at https://github.com/perl/perl5
This is patch.2b1f to perl5.002beta1.
[perl5.git] / pod / perlre.pod
CommitLineData
a0d0e21e
LW
1=head1 NAME
2
3perlre - Perl regular expressions
4
5=head1 DESCRIPTION
6
7For a description of how to use regular expressions in matching
8operations, see C<m//> and C<s///> in L<perlop>. The matching operations can
9have various modifiers, some of which relate to the interpretation of
10the regular expression inside. These are:
11
12 i Do case-insensitive pattern matching.
13 m Treat string as multiple lines.
14 s Treat string as single line.
15 x Use extended regular expressions.
16
17These are usually written as "the C</x> modifier", even though the delimiter
18in question might not actually be a slash. In fact, any of these
19modifiers may also be embedded within the regular expression itself using
20the new C<(?...)> construct. See below.
21
4633a7c4
LW
22The C</x> modifier itself needs a little more explanation. It tells
23the regular expression parser to ignore whitespace that is not
24backslashed or within a character class. You can use this to break up
25your regular expression into (slightly) more readable parts. The C<#>
26character is also treated as a metacharacter introducing a comment,
27just as in ordinary Perl code. Taken together, these features go a
28long way towards making Perl 5 a readable language. See the C comment
a0d0e21e
LW
29deletion code in L<perlop>.
30
31=head2 Regular Expressions
32
33The patterns used in pattern matching are regular expressions such as
34those supplied in the Version 8 regexp routines. (In fact, the
35routines are derived (distantly) from Henry Spencer's freely
36redistributable reimplementation of the V8 routines.)
37See L<Version 8 Regular Expressions> for details.
38
39In particular the following metacharacters have their standard I<egrep>-ish
40meanings:
41
42 \ Quote the next metacharacter
43 ^ Match the beginning of the line
44 . Match any character (except newline)
45 $ Match the end of the line
46 | Alternation
47 () Grouping
48 [] Character class
49
50By default, the "^" character is guaranteed to match only at the
51beginning of the string, the "$" character only at the end (or before the
52newline at the end) and Perl does certain optimizations with the
53assumption that the string contains only one line. Embedded newlines
54will not be matched by "^" or "$". You may, however, wish to treat a
55string as a multi-line buffer, such that the "^" will match after any
56newline within the string, and "$" will match before any newline. At the
57cost of a little more overhead, you can do this by using the /m modifier
58on the pattern match operator. (Older programs did this by setting C<$*>,
59but this practice is deprecated in Perl 5.)
60
61To facilitate multi-line substitutions, the "." character never matches a
62newline unless you use the C</s> modifier, which tells Perl to pretend
63the string is a single line--even if it isn't. The C</s> modifier also
64overrides the setting of C<$*>, in case you have some (badly behaved) older
65code that sets it in another module.
66
67The following standard quantifiers are recognized:
68
69 * Match 0 or more times
70 + Match 1 or more times
71 ? Match 1 or 0 times
72 {n} Match exactly n times
73 {n,} Match at least n times
74 {n,m} Match at least n but not more than m times
75
76(If a curly bracket occurs in any other context, it is treated
77as a regular character.) The "*" modifier is equivalent to C<{0,}>, the "+"
25f94b33
AD
78modifier to C<{1,}>, and the "?" modifier to C<{0,1}>. n and m are limited
79to integral values less than 65536.
a0d0e21e
LW
80
81By default, a quantified subpattern is "greedy", that is, it will match as
82many times as possible without causing the rest pattern not to match. The
83standard quantifiers are all "greedy", in that they match as many
84occurrences as possible (given a particular starting location) without
85causing the pattern to fail. If you want it to match the minimum number
86of times possible, follow the quantifier with a "?" after any of them.
87Note that the meanings don't change, just the "gravity":
88
89 *? Match 0 or more times
90 +? Match 1 or more times
91 ?? Match 0 or 1 time
92 {n}? Match exactly n times
93 {n,}? Match at least n times
94 {n,m}? Match at least n but not more than m times
95
96Since patterns are processed as double quoted strings, the following
97also work:
98
99 \t tab
100 \n newline
101 \r return
102 \f form feed
103 \v vertical tab, whatever that is
104 \a alarm (bell)
105 \e escape
106 \033 octal char
107 \x1b hex char
108 \c[ control char
109 \l lowercase next char
110 \u uppercase next char
111 \L lowercase till \E
112 \U uppercase till \E
113 \E end case modification
114 \Q quote regexp metacharacters till \E
115
116In addition, Perl defines the following:
117
118 \w Match a "word" character (alphanumeric plus "_")
119 \W Match a non-word character
120 \s Match a whitespace character
121 \S Match a non-whitespace character
122 \d Match a digit character
123 \D Match a non-digit character
124
125Note that C<\w> matches a single alphanumeric character, not a whole
126word. To match a word you'd need to say C<\w+>. You may use C<\w>, C<\W>, C<\s>,
127C<\S>, C<\d> and C<\D> within character classes (though not as either end of a
128range).
129
130Perl defines the following zero-width assertions:
131
132 \b Match a word boundary
133 \B Match a non-(word boundary)
134 \A Match only at beginning of string
135 \Z Match only at end of string
136 \G Match only where previous m//g left off
137
138A word boundary (C<\b>) is defined as a spot between two characters that
139has a C<\w> on one side of it and and a C<\W> on the other side of it (in
140either order), counting the imaginary characters off the beginning and
141end of the string as matching a C<\W>. (Within character classes C<\b>
142represents backspace rather than a word boundary.) The C<\A> and C<\Z> are
143just like "^" and "$" except that they won't match multiple times when the
144C</m> modifier is used, while "^" and "$" will match at every internal line
145boundary.
146
147When the bracketing construct C<( ... )> is used, \<digit> matches the
148digit'th substring. (Outside of the pattern, always use "$" instead of
149"\" in front of the digit. The scope of $<digit> (and C<$`>, C<$&>, and C<$')>
150extends to the end of the enclosing BLOCK or eval string, or to the
4633a7c4 151next successful pattern match, whichever comes first.
a0d0e21e
LW
152If you want to
153use parentheses to delimit subpattern (e.g. a set of alternatives) without
154saving it as a subpattern, follow the ( with a ?.
155The \<digit> notation
156sometimes works outside the current pattern, but should not be relied
157upon.) You may have as many parentheses as you wish. If you have more
158than 9 substrings, the variables $10, $11, ... refer to the
159corresponding substring. Within the pattern, \10, \11, etc. refer back
160to substrings if there have been at least that many left parens before
161the backreference. Otherwise (for backward compatibilty) \10 is the
162same as \010, a backspace, and \11 the same as \011, a tab. And so
163on. (\1 through \9 are always backreferences.)
164
165C<$+> returns whatever the last bracket match matched. C<$&> returns the
166entire matched string. ($0 used to return the same thing, but not any
167more.) C<$`> returns everything before the matched string. C<$'> returns
168everything after the matched string. Examples:
169
170 s/^([^ ]*) *([^ ]*)/$2 $1/; # swap first two words
171
172 if (/Time: (..):(..):(..)/) {
173 $hours = $1;
174 $minutes = $2;
175 $seconds = $3;
176 }
177
178You will note that all backslashed metacharacters in Perl are
179alphanumeric, such as C<\b>, C<\w>, C<\n>. Unlike some other regular expression
180languages, there are no backslashed symbols that aren't alphanumeric.
181So anything that looks like \\, \(, \), \<, \>, \{, or \} is always
182interpreted as a literal character, not a metacharacter. This makes it
183simple to quote a string that you want to use for a pattern but that
184you are afraid might contain metacharacters. Simply quote all the
185non-alphanumeric characters:
186
187 $pattern =~ s/(\W)/\\$1/g;
188
189You can also use the built-in quotemeta() function to do this.
190An even easier way to quote metacharacters right in the match operator
191is to say
192
193 /$unquoted\Q$quoted\E$unquoted/
194
195Perl 5 defines a consistent extension syntax for regular expressions.
196The syntax is a pair of parens with a question mark as the first thing
197within the parens (this was a syntax error in Perl 4). The character
198after the question mark gives the function of the extension. Several
199extensions are already supported:
200
201=over 10
202
203=item (?#text)
204
205A comment. The text is ignored.
206
207=item (?:regexp)
208
209This groups things like "()" but doesn't make backrefences like "()" does. So
210
211 split(/\b(?:a|b|c)\b/)
212
213is like
214
215 split(/\b(a|b|c)\b/)
216
217but doesn't spit out extra fields.
218
219=item (?=regexp)
220
221A zero-width positive lookahead assertion. For example, C</\w+(?=\t)/>
222matches a word followed by a tab, without including the tab in C<$&>.
223
224=item (?!regexp)
225
226A zero-width negative lookahead assertion. For example C</foo(?!bar)/>
227matches any occurrence of "foo" that isn't followed by "bar". Note
228however that lookahead and lookbehind are NOT the same thing. You cannot
229use this for lookbehind: C</(?!foo)bar/> will not find an occurrence of
230"bar" that is preceded by something which is not "foo". That's because
231the C<(?!foo)> is just saying that the next thing cannot be "foo"--and
232it's not, it's a "bar", so "foobar" will match. You would have to do
233something like C</(?foo)...bar/> for that. We say "like" because there's
234the case of your "bar" not having three characters before it. You could
235cover that this way: C</(?:(?!foo)...|^..?)bar/>. Sometimes it's still
236easier just to say:
237
238 if (/foo/ && $` =~ /bar$/)
239
240
241=item (?imsx)
242
243One or more embedded pattern-match modifiers. This is particularly
244useful for patterns that are specified in a table somewhere, some of
245which want to be case sensitive, and some of which don't. The case
246insensitive ones merely need to include C<(?i)> at the front of the
247pattern. For example:
248
249 $pattern = "foobar";
250 if ( /$pattern/i )
251
252 # more flexible:
253
254 $pattern = "(?i)foobar";
255 if ( /$pattern/ )
256
257=back
258
259The specific choice of question mark for this and the new minimal
260matching construct was because 1) question mark is pretty rare in older
261regular expressions, and 2) whenever you see one, you should stop
262and "question" exactly what is going on. That's psychology...
263
264=head2 Version 8 Regular Expressions
265
266In case you're not familiar with the "regular" Version 8 regexp
267routines, here are the pattern-matching rules not described above.
268
269Any single character matches itself, unless it is a I<metacharacter>
270with a special meaning described here or above. You can cause
271characters which normally function as metacharacters to be interpreted
272literally by prefixing them with a "\" (e.g. "\." matches a ".", not any
273character; "\\" matches a "\"). A series of characters matches that
274series of characters in the target string, so the pattern C<blurfl>
275would match "blurfl" in the target string.
276
277You can specify a character class, by enclosing a list of characters
278in C<[]>, which will match any one of the characters in the list. If the
279first character after the "[" is "^", the class matches any character not
280in the list. Within a list, the "-" character is used to specify a
281range, so that C<a-z> represents all the characters between "a" and "z",
282inclusive.
283
284Characters may be specified using a metacharacter syntax much like that
285used in C: "\n" matches a newline, "\t" a tab, "\r" a carriage return,
286"\f" a form feed, etc. More generally, \I<nnn>, where I<nnn> is a string
287of octal digits, matches the character whose ASCII value is I<nnn>.
288Similarly, \xI<nn>, where I<nn> are hexidecimal digits, matches the
289character whose ASCII value is I<nn>. The expression \cI<x> matches the
290ASCII character control-I<x>. Finally, the "." metacharacter matches any
291character except "\n" (unless you use C</s>).
292
293You can specify a series of alternatives for a pattern using "|" to
294separate them, so that C<fee|fie|foe> will match any of "fee", "fie",
295or "foe" in the target string (as would C<f(e|i|o)e>). Note that the
296first alternative includes everything from the last pattern delimiter
297("(", "[", or the beginning of the pattern) up to the first "|", and
298the last alternative contains everything from the last "|" to the next
299pattern delimiter. For this reason, it's common practice to include
300alternatives in parentheses, to minimize confusion about where they
748a9306
LW
301start and end. Note however that "|" is interpreted as a literal with
302square brackets, so if you write C<[fee|fie|foe]> you're really only
303matching C<[feio|]>.
a0d0e21e
LW
304
305Within a pattern, you may designate subpatterns for later reference by
306enclosing them in parentheses, and you may refer back to the I<n>th
307subpattern later in the pattern using the metacharacter \I<n>.
308Subpatterns are numbered based on the left to right order of their
309opening parenthesis. Note that a backreference matches whatever
310actually matched the subpattern in the string being examined, not the
748a9306 311rules for that subpattern. Therefore, C<(0|0x)\d*\s\1\d*> will
a0d0e21e 312match "0x1234 0x4321",but not "0x1234 01234", since subpattern 1
748a9306 313actually matched "0x", even though the rule C<0|0x> could
a0d0e21e 314potentially match the leading 0 in the second number.