Commit | Line | Data |
---|---|---|
30487ceb RGS |
1 | =head1 NAME |
2 | ||
3 | perlreref - Perl Regular Expressions Reference | |
4 | ||
5 | =head1 DESCRIPTION | |
6 | ||
7 | This is a quick reference to Perl's regular expressions. | |
8 | For full information see L<perlre> and L<perlop>, as well | |
6d014f17 | 9 | as the L</"SEE ALSO"> section in this document. |
30487ceb | 10 | |
a5365663 | 11 | =head2 OPERATORS |
30487ceb | 12 | |
e17472c5 RGS |
13 | C<=~> determines to which variable the regex is applied. |
14 | In its absence, $_ is used. | |
30487ceb | 15 | |
e17472c5 | 16 | $var =~ /foo/; |
30487ceb | 17 | |
e17472c5 RGS |
18 | C<!~> determines to which variable the regex is applied, |
19 | and negates the result of the match; it returns | |
20 | false if the match succeeds, and true if it fails. | |
6d014f17 | 21 | |
e17472c5 | 22 | $var !~ /foo/; |
6d014f17 | 23 | |
e17472c5 RGS |
24 | C<m/pattern/msixpogc> searches a string for a pattern match, |
25 | applying the given options. | |
30487ceb | 26 | |
e17472c5 RGS |
27 | m Multiline mode - ^ and $ match internal lines |
28 | s match as a Single line - . matches \n | |
29 | i case-Insensitive | |
30 | x eXtended legibility - free whitespace and comments | |
31 | p Preserve a copy of the matched string - | |
32 | ${^PREMATCH}, ${^MATCH}, ${^POSTMATCH} will be defined. | |
33 | o compile pattern Once | |
34 | g Global - all occurrences | |
35 | c don't reset pos on failed matches when using /g | |
30487ceb | 36 | |
e17472c5 RGS |
37 | If 'pattern' is an empty string, the last I<successfully> matched |
38 | regex is used. Delimiters other than '/' may be used for both this | |
64c5a566 | 39 | operator and the following ones. The leading C<m> can be omitted |
e17472c5 | 40 | if the delimiter is '/'. |
30487ceb | 41 | |
e17472c5 RGS |
42 | C<qr/pattern/msixpo> lets you store a regex in a variable, |
43 | or pass one around. Modifiers as for C<m//>, and are stored | |
44 | within the regex. | |
30487ceb | 45 | |
e17472c5 RGS |
46 | C<s/pattern/replacement/msixpogce> substitutes matches of |
47 | 'pattern' with 'replacement'. Modifiers as for C<m//>, | |
48 | with one addition: | |
30487ceb | 49 | |
e17472c5 | 50 | e Evaluate 'replacement' as an expression |
30487ceb | 51 | |
e17472c5 RGS |
52 | 'e' may be specified multiple times. 'replacement' is interpreted |
53 | as a double quoted string unless a single-quote (C<'>) is the delimiter. | |
30487ceb | 54 | |
e17472c5 RGS |
55 | C<?pattern?> is like C<m/pattern/> but matches only once. No alternate |
56 | delimiters can be used. Must be reset with reset(). | |
30487ceb | 57 | |
a5365663 | 58 | =head2 SYNTAX |
30487ceb | 59 | |
9f4a55d4 KW |
60 | \ Escapes the character immediately following it |
61 | . Matches any single character except a newline (unless /s is | |
62 | used) | |
63 | ^ Matches at the beginning of the string (or line, if /m is used) | |
64 | $ Matches at the end of the string (or line, if /m is used) | |
65 | * Matches the preceding element 0 or more times | |
66 | + Matches the preceding element 1 or more times | |
67 | ? Matches the preceding element 0 or 1 times | |
68 | {...} Specifies a range of occurrences for the element preceding it | |
69 | [...] Matches any one of the characters contained within the brackets | |
70 | (...) Groups subexpressions for capturing to $1, $2... | |
71 | (?:...) Groups subexpressions without capturing (cluster) | |
72 | | Matches either the subexpression preceding or following it | |
73 | \1, \2, \3 ... Matches the text from the Nth group | |
74 | \g1 or \g{1}, \g2 ... Matches the text from the Nth group | |
75 | \g-1 or \g{-1}, \g-2 ... Matches the text from the Nth previous group | |
76 | \g{name} Named backreference | |
77 | \k<name> Named backreference | |
78 | \k'name' Named backreference | |
79 | (?P=name) Named backreference (python syntax) | |
30487ceb RGS |
80 | |
81 | =head2 ESCAPE SEQUENCES | |
82 | ||
83 | These work as in normal strings. | |
84 | ||
85 | \a Alarm (beep) | |
86 | \e Escape | |
87 | \f Formfeed | |
88 | \n Newline | |
89 | \r Carriage return | |
90 | \t Tab | |
6ed007ae | 91 | \037 Any octal ASCII value |
30487ceb RGS |
92 | \x7f Any hexadecimal ASCII value |
93 | \x{263a} A wide hexadecimal value | |
94 | \cx Control-x | |
95 | \N{name} A named character | |
e526e8bb | 96 | \N{U+263D} A Unicode character by hex ordinal |
30487ceb | 97 | |
6d014f17 | 98 | \l Lowercase next character |
d3b55b48 | 99 | \u Titlecase next character |
30487ceb | 100 | \L Lowercase until \E |
d3b55b48 | 101 | \U Uppercase until \E |
30487ceb | 102 | \Q Disable pattern metacharacters until \E |
e17472c5 | 103 | \E End modification |
30487ceb | 104 | |
47e8a552 IT |
105 | For Titlecase, see L</Titlecase>. |
106 | ||
30487ceb RGS |
107 | This one works differently from normal strings: |
108 | ||
109 | \b An assertion, not backspace, except in a character class | |
110 | ||
111 | =head2 CHARACTER CLASSES | |
112 | ||
113 | [amy] Match 'a', 'm' or 'y' | |
114 | [f-j] Dash specifies "range" | |
115 | [f-j-] Dash escaped or at start or end means 'dash' | |
6d014f17 | 116 | [^f-j] Caret indicates "match any character _except_ these" |
30487ceb | 117 | |
df225385 | 118 | The following sequences (except C<\N>) work within or without a character class. |
e17472c5 RGS |
119 | The first six are locale aware, all are Unicode aware. See L<perllocale> |
120 | and L<perlunicode> for details. | |
121 | ||
122 | \d A digit | |
123 | \D A nondigit | |
124 | \w A word character | |
125 | \W A non-word character | |
126 | \s A whitespace character | |
127 | \S A non-whitespace character | |
418e7b04 KW |
128 | \h An horizontal whitespace |
129 | \H A non horizontal whitespace | |
9f4a55d4 KW |
130 | \N A non newline (when not followed by '{NAME}'; experimental; |
131 | not valid in a character class; equivalent to [^\n]; it's | |
132 | like '.' without /s modifier) | |
418e7b04 KW |
133 | \v A vertical whitespace |
134 | \V A non vertical whitespace | |
e17472c5 | 135 | \R A generic newline (?>\v|\x0D\x0A) |
e04a154e JH |
136 | |
137 | \C Match a byte (with Unicode, '.' matches a character) | |
30487ceb | 138 | \pP Match P-named (Unicode) property |
e1b711da | 139 | \p{...} Match Unicode property with name longer than 1 character |
30487ceb | 140 | \PP Match non-P |
e1b711da | 141 | \P{...} Match lack of Unicode property with name longer than 1 char |
0111a78f | 142 | \X Match Unicode extended grapheme cluster |
30487ceb RGS |
143 | |
144 | POSIX character classes and their Unicode and Perl equivalents: | |
145 | ||
9f4a55d4 KW |
146 | ASCII- Full- |
147 | range range backslash | |
148 | POSIX \p{...} \p{} sequence Description | |
149 | ----------------------------------------------------------------------- | |
150 | alnum PosixAlnum Alnum Alpha plus Digit | |
151 | alpha PosixAlpha Alpha Alphabetic characters | |
152 | ascii ASCII Any ASCII character | |
153 | blank PosixBlank Blank \h Horizontal whitespace; | |
154 | full-range also written | |
155 | as \p{HorizSpace} (GNU | |
156 | extension) | |
157 | cntrl PosixCntrl Cntrl Control characters | |
158 | digit PosixDigit Digit \d Decimal digits | |
159 | graph PosixGraph Graph Alnum plus Punct | |
160 | lower PosixLower Lower Lowercase characters | |
161 | print PosixPrint Print Graph plus Print, but not | |
162 | any Cntrls | |
163 | punct PosixPunct Punct These aren't precisely | |
164 | equivalent. See NOTE, | |
165 | below. | |
166 | space PosixSpace Space [\s\cK] Whitespace | |
167 | PerlSpace SpacePerl \s Perl's whitespace | |
168 | definition | |
169 | upper PosixUpper Upper Uppercase characters | |
170 | word PerlWord Word \w Alnum plus '_' (Perl | |
171 | extension) | |
172 | xdigit ASCII_Hex_Digit XDigit Hexadecimal digit, | |
173 | ASCII-range is | |
174 | [0-9A-Fa-f] | |
175 | ||
176 | NOTE on C<[[:punct:]]>, C<\p{PosixPunct}> and C<\p{Punct}>: | |
177 | In the ASCII range, C<[[:punct:]]> and C<\p{PosixPunct}> match | |
178 | C<[-!"#$%&'()*+,./:;<=E<gt>?@[\\\]^_`{|}~]> (although if a locale is in | |
179 | effect, it could alter the behavior of C<[[:punct:]]>); and C<\p{Punct}> | |
180 | matches C<[-!"#%&'()*,./:;?@[\\\]_{}]>. When matching a UTF-8 string, | |
181 | C<[[:punct:]]> matches what it does in the ASCII range, plus what | |
182 | C<\p{Punct}> matches. C<\p{Punct}> matches, anything that isn't a | |
183 | control, an alphanumeric, a space, nor a symbol. | |
30487ceb RGS |
184 | |
185 | Within a character class: | |
186 | ||
9f4a55d4 KW |
187 | POSIX traditional Unicode |
188 | [:digit:] \d \p{Digit} | |
189 | [:^digit:] \D \P{Digit} | |
30487ceb RGS |
190 | |
191 | =head2 ANCHORS | |
192 | ||
193 | All are zero-width assertions. | |
194 | ||
195 | ^ Match string start (or line, if /m is used) | |
196 | $ Match string end (or line, if /m is used) or before newline | |
197 | \b Match word boundary (between \w and \W) | |
6d014f17 | 198 | \B Match except at word boundary (between \w and \w or \W and \W) |
30487ceb | 199 | \A Match string start (regardless of /m) |
6d014f17 | 200 | \Z Match string end (before optional newline) |
30487ceb RGS |
201 | \z Match absolute string end |
202 | \G Match where previous m//g left off | |
64c5a566 RGS |
203 | \K Keep the stuff left of the \K, don't include it in $& |
204 | ||
30487ceb RGS |
205 | =head2 QUANTIFIERS |
206 | ||
ac036724 | 207 | Quantifiers are greedy by default and match the B<longest> leftmost. |
30487ceb | 208 | |
64c5a566 RGS |
209 | Maximal Minimal Possessive Allowed range |
210 | ------- ------- ---------- ------------- | |
211 | {n,m} {n,m}? {n,m}+ Must occur at least n times | |
212 | but no more than m times | |
213 | {n,} {n,}? {n,}+ Must occur at least n times | |
214 | {n} {n}? {n}+ Must occur exactly n times | |
215 | * *? *+ 0 or more times (same as {0,}) | |
216 | + +? ++ 1 or more times (same as {1,}) | |
217 | ? ?? ?+ 0 or 1 time (same as {0,1}) | |
218 | ||
219 | The possessive forms (new in Perl 5.10) prevent backtracking: what gets | |
220 | matched by a pattern with a possessive quantifier will not be backtracked | |
221 | into, even if that causes the whole match to fail. | |
30487ceb | 222 | |
ac036724 | 223 | There is no quantifier C<{,n}>. That's interpreted as a literal string. |
6d014f17 | 224 | |
30487ceb RGS |
225 | =head2 EXTENDED CONSTRUCTS |
226 | ||
64c5a566 RGS |
227 | (?#text) A comment |
228 | (?:...) Groups subexpressions without capturing (cluster) | |
229 | (?pimsx-imsx:...) Enable/disable option (as per m// modifiers) | |
230 | (?=...) Zero-width positive lookahead assertion | |
231 | (?!...) Zero-width negative lookahead assertion | |
232 | (?<=...) Zero-width positive lookbehind assertion | |
233 | (?<!...) Zero-width negative lookbehind assertion | |
234 | (?>...) Grab what we can, prohibit backtracking | |
235 | (?|...) Branch reset | |
236 | (?<name>...) Named capture | |
237 | (?'name'...) Named capture | |
238 | (?P<name>...) Named capture (python syntax) | |
239 | (?{ code }) Embedded code, return value becomes $^R | |
240 | (??{ code }) Dynamic regex, return value used as regex | |
241 | (?N) Recurse into subpattern number N | |
242 | (?-N), (?+N) Recurse into Nth previous/next subpattern | |
243 | (?R), (?0) Recurse at the beginning of the whole pattern | |
244 | (?&name) Recurse into a named subpattern | |
245 | (?P>name) Recurse into a named subpattern (python syntax) | |
246 | (?(cond)yes|no) | |
247 | (?(cond)yes) Conditional expression, where "cond" can be: | |
248 | (N) subpattern N has matched something | |
249 | (<name>) named subpattern has matched something | |
250 | ('name') named subpattern has matched something | |
251 | (?{code}) code condition | |
252 | (R) true if recursing | |
253 | (RN) true if recursing into Nth subpattern | |
254 | (R&name) true if recursing into named subpattern | |
255 | (DEFINE) always false, no no-pattern allowed | |
30487ceb | 256 | |
a5365663 | 257 | =head2 VARIABLES |
30487ceb RGS |
258 | |
259 | $_ Default variable for operators to use | |
30487ceb | 260 | |
30487ceb | 261 | $` Everything prior to matched string |
e17472c5 | 262 | $& Entire matched string |
30487ceb RGS |
263 | $' Everything after to matched string |
264 | ||
e17472c5 RGS |
265 | ${^PREMATCH} Everything prior to matched string |
266 | ${^MATCH} Entire matched string | |
267 | ${^POSTMATCH} Everything after to matched string | |
268 | ||
269 | The use of C<$`>, C<$&> or C<$'> will slow down B<all> regex use | |
64c5a566 | 270 | within your program. Consult L<perlvar> for C<@-> |
30487ceb | 271 | to see equivalent expressions that won't cause slow down. |
e17472c5 RGS |
272 | See also L<Devel::SawAmpersand>. Starting with Perl 5.10, you |
273 | can also use the equivalent variables C<${^PREMATCH}>, C<${^MATCH}> | |
274 | and C<${^POSTMATCH}>, but for them to be defined, you have to | |
275 | specify the C</p> (preserve) modifier on your regular expression. | |
30487ceb RGS |
276 | |
277 | $1, $2 ... hold the Xth captured expr | |
278 | $+ Last parenthesized pattern match | |
279 | $^N Holds the most recently closed capture | |
280 | $^R Holds the result of the last (?{...}) expr | |
6d014f17 JH |
281 | @- Offsets of starts of groups. $-[0] holds start of whole match |
282 | @+ Offsets of ends of groups. $+[0] holds end of whole match | |
e17472c5 RGS |
283 | %+ Named capture buffers |
284 | %- Named capture buffers, as array refs | |
30487ceb | 285 | |
6d014f17 | 286 | Captured groups are numbered according to their I<opening> paren. |
30487ceb | 287 | |
a5365663 | 288 | =head2 FUNCTIONS |
30487ceb RGS |
289 | |
290 | lc Lowercase a string | |
291 | lcfirst Lowercase first char of a string | |
292 | uc Uppercase a string | |
47e8a552 IT |
293 | ucfirst Titlecase first char of a string |
294 | ||
30487ceb RGS |
295 | pos Return or set current match position |
296 | quotemeta Quote metacharacters | |
297 | reset Reset ?pattern? status | |
298 | study Analyze string for optimizing matching | |
299 | ||
e17472c5 | 300 | split Use a regex to split a string into parts |
30487ceb | 301 | |
d3b55b48 JH |
302 | The first four of these are like the escape sequences C<\L>, C<\l>, |
303 | C<\U>, and C<\u>. For Titlecase, see L</Titlecase>. | |
47e8a552 | 304 | |
1501d360 | 305 | =head2 TERMINOLOGY |
47e8a552 | 306 | |
a5365663 | 307 | =head3 Titlecase |
47e8a552 IT |
308 | |
309 | Unicode concept which most often is equal to uppercase, but for | |
310 | certain characters like the German "sharp s" there is a difference. | |
311 | ||
40506b5d | 312 | =head1 AUTHOR |
30487ceb | 313 | |
64c5a566 | 314 | Iain Truskett. Updated by the Perl 5 Porters. |
30487ceb RGS |
315 | |
316 | This document may be distributed under the same terms as Perl itself. | |
317 | ||
40506b5d | 318 | =head1 SEE ALSO |
30487ceb RGS |
319 | |
320 | =over 4 | |
321 | ||
322 | =item * | |
323 | ||
324 | L<perlretut> for a tutorial on regular expressions. | |
325 | ||
326 | =item * | |
327 | ||
328 | L<perlrequick> for a rapid tutorial. | |
329 | ||
330 | =item * | |
331 | ||
332 | L<perlre> for more details. | |
333 | ||
334 | =item * | |
335 | ||
336 | L<perlvar> for details on the variables. | |
337 | ||
338 | =item * | |
339 | ||
340 | L<perlop> for details on the operators. | |
341 | ||
342 | =item * | |
343 | ||
344 | L<perlfunc> for details on the functions. | |
345 | ||
346 | =item * | |
347 | ||
348 | L<perlfaq6> for FAQs on regular expressions. | |
349 | ||
350 | =item * | |
351 | ||
64c5a566 RGS |
352 | L<perlrebackslash> for a reference on backslash sequences. |
353 | ||
354 | =item * | |
355 | ||
356 | L<perlrecharclass> for a reference on character classes. | |
357 | ||
358 | =item * | |
359 | ||
30487ceb RGS |
360 | The L<re> module to alter behaviour and aid |
361 | debugging. | |
362 | ||
363 | =item * | |
364 | ||
365 | L<perldebug/"Debugging regular expressions"> | |
366 | ||
367 | =item * | |
368 | ||
e17472c5 | 369 | L<perluniintro>, L<perlunicode>, L<charnames> and L<perllocale> |
30487ceb RGS |
370 | for details on regexes and internationalisation. |
371 | ||
372 | =item * | |
373 | ||
374 | I<Mastering Regular Expressions> by Jeffrey Friedl | |
08d7a6b2 | 375 | (F<http://oreilly.com/catalog/9780596528126/>) for a thorough grounding and |
30487ceb RGS |
376 | reference on the topic. |
377 | ||
378 | =back | |
379 | ||
40506b5d | 380 | =head1 THANKS |
30487ceb RGS |
381 | |
382 | David P.C. Wollmann, | |
383 | Richard Soderberg, | |
384 | Sean M. Burke, | |
385 | Tom Christiansen, | |
e5a7b003 | 386 | Jim Cromie, |
30487ceb RGS |
387 | and |
388 | Jeffrey Goff | |
389 | for useful advice. | |
6d014f17 JH |
390 | |
391 | =cut |