Commit | Line | Data |
---|---|---|
30487ceb RGS |
1 | =head1 NAME |
2 | ||
3 | perlreref - Perl Regular Expressions Reference | |
4 | ||
5 | =head1 DESCRIPTION | |
6 | ||
7 | This is a quick reference to Perl's regular expressions. | |
8 | For full information see L<perlre> and L<perlop>, as well | |
6d014f17 | 9 | as the L</"SEE ALSO"> section in this document. |
30487ceb | 10 | |
a5365663 | 11 | =head2 OPERATORS |
30487ceb | 12 | |
e17472c5 RGS |
13 | C<=~> determines to which variable the regex is applied. |
14 | In its absence, $_ is used. | |
30487ceb | 15 | |
e17472c5 | 16 | $var =~ /foo/; |
30487ceb | 17 | |
e17472c5 RGS |
18 | C<!~> determines to which variable the regex is applied, |
19 | and negates the result of the match; it returns | |
20 | false if the match succeeds, and true if it fails. | |
6d014f17 | 21 | |
e17472c5 | 22 | $var !~ /foo/; |
6d014f17 | 23 | |
33be4c61 | 24 | C<m/pattern/msixpogcdualn> searches a string for a pattern match, |
e17472c5 | 25 | applying the given options. |
30487ceb | 26 | |
e17472c5 RGS |
27 | m Multiline mode - ^ and $ match internal lines |
28 | s match as a Single line - . matches \n | |
29 | i case-Insensitive | |
30 | x eXtended legibility - free whitespace and comments | |
31 | p Preserve a copy of the matched string - | |
32 | ${^PREMATCH}, ${^MATCH}, ${^POSTMATCH} will be defined. | |
33 | o compile pattern Once | |
34 | g Global - all occurrences | |
35 | c don't reset pos on failed matches when using /g | |
b33bbe43 KW |
36 | a restrict \d, \s, \w and [:posix:] to match ASCII only |
37 | aa (two a's) also /i matches exclude ASCII/non-ASCII | |
38 | l match according to current locale | |
39 | u match according to Unicode rules | |
40 | d match according to native rules unless something indicates | |
41 | Unicode | |
33be4c61 | 42 | n Non-capture mode. Don't let () fill in $1, $2, etc... |
30487ceb | 43 | |
e17472c5 RGS |
44 | If 'pattern' is an empty string, the last I<successfully> matched |
45 | regex is used. Delimiters other than '/' may be used for both this | |
64c5a566 | 46 | operator and the following ones. The leading C<m> can be omitted |
e17472c5 | 47 | if the delimiter is '/'. |
30487ceb | 48 | |
33be4c61 | 49 | C<qr/pattern/msixpodualn> lets you store a regex in a variable, |
e17472c5 RGS |
50 | or pass one around. Modifiers as for C<m//>, and are stored |
51 | within the regex. | |
30487ceb | 52 | |
b33bbe43 | 53 | C<s/pattern/replacement/msixpogcedual> substitutes matches of |
e17472c5 | 54 | 'pattern' with 'replacement'. Modifiers as for C<m//>, |
4f4d7508 | 55 | with two additions: |
30487ceb | 56 | |
e17472c5 | 57 | e Evaluate 'replacement' as an expression |
4f4d7508 | 58 | r Return substitution and leave the original string untouched. |
30487ceb | 59 | |
e17472c5 RGS |
60 | 'e' may be specified multiple times. 'replacement' is interpreted |
61 | as a double quoted string unless a single-quote (C<'>) is the delimiter. | |
30487ceb | 62 | |
9c6deb98 | 63 | C<m?pattern?> is like C<m/pattern/> but matches only once. No alternate |
e17472c5 | 64 | delimiters can be used. Must be reset with reset(). |
30487ceb | 65 | |
a5365663 | 66 | =head2 SYNTAX |
30487ceb | 67 | |
9f4a55d4 KW |
68 | \ Escapes the character immediately following it |
69 | . Matches any single character except a newline (unless /s is | |
70 | used) | |
71 | ^ Matches at the beginning of the string (or line, if /m is used) | |
72 | $ Matches at the end of the string (or line, if /m is used) | |
73 | * Matches the preceding element 0 or more times | |
74 | + Matches the preceding element 1 or more times | |
75 | ? Matches the preceding element 0 or 1 times | |
76 | {...} Specifies a range of occurrences for the element preceding it | |
77 | [...] Matches any one of the characters contained within the brackets | |
78 | (...) Groups subexpressions for capturing to $1, $2... | |
79 | (?:...) Groups subexpressions without capturing (cluster) | |
80 | | Matches either the subexpression preceding or following it | |
9f4a55d4 | 81 | \g1 or \g{1}, \g2 ... Matches the text from the Nth group |
c27a5cfe | 82 | \1, \2, \3 ... Matches the text from the Nth group |
9f4a55d4 KW |
83 | \g-1 or \g{-1}, \g-2 ... Matches the text from the Nth previous group |
84 | \g{name} Named backreference | |
85 | \k<name> Named backreference | |
86 | \k'name' Named backreference | |
87 | (?P=name) Named backreference (python syntax) | |
30487ceb RGS |
88 | |
89 | =head2 ESCAPE SEQUENCES | |
90 | ||
91 | These work as in normal strings. | |
92 | ||
93 | \a Alarm (beep) | |
94 | \e Escape | |
95 | \f Formfeed | |
96 | \n Newline | |
97 | \r Carriage return | |
98 | \t Tab | |
e54859e6 KW |
99 | \037 Char whose ordinal is the 3 octal digits, max \777 |
100 | \o{2307} Char whose ordinal is the octal number, unrestricted | |
101 | \x7f Char whose ordinal is the 2 hex digits, max \xFF | |
102 | \x{263a} Char whose ordinal is the hex number, unrestricted | |
30487ceb | 103 | \cx Control-x |
fb121860 | 104 | \N{name} A named Unicode character or character sequence |
e526e8bb | 105 | \N{U+263D} A Unicode character by hex ordinal |
30487ceb | 106 | |
6d014f17 | 107 | \l Lowercase next character |
d3b55b48 | 108 | \u Titlecase next character |
30487ceb | 109 | \L Lowercase until \E |
d3b55b48 | 110 | \U Uppercase until \E |
628253b8 | 111 | \F Foldcase until \E |
30487ceb | 112 | \Q Disable pattern metacharacters until \E |
e17472c5 | 113 | \E End modification |
30487ceb | 114 | |
47e8a552 IT |
115 | For Titlecase, see L</Titlecase>. |
116 | ||
30487ceb RGS |
117 | This one works differently from normal strings: |
118 | ||
119 | \b An assertion, not backspace, except in a character class | |
120 | ||
121 | =head2 CHARACTER CLASSES | |
122 | ||
123 | [amy] Match 'a', 'm' or 'y' | |
124 | [f-j] Dash specifies "range" | |
125 | [f-j-] Dash escaped or at start or end means 'dash' | |
6d014f17 | 126 | [^f-j] Caret indicates "match any character _except_ these" |
30487ceb | 127 | |
df225385 | 128 | The following sequences (except C<\N>) work within or without a character class. |
e17472c5 RGS |
129 | The first six are locale aware, all are Unicode aware. See L<perllocale> |
130 | and L<perlunicode> for details. | |
131 | ||
132 | \d A digit | |
133 | \D A nondigit | |
134 | \w A word character | |
135 | \W A non-word character | |
136 | \s A whitespace character | |
137 | \S A non-whitespace character | |
33f0d962 | 138 | \h A horizontal whitespace |
418e7b04 | 139 | \H A non horizontal whitespace |
2171640d | 140 | \N A non newline (when not followed by '{NAME}';; |
9f4a55d4 KW |
141 | not valid in a character class; equivalent to [^\n]; it's |
142 | like '.' without /s modifier) | |
418e7b04 KW |
143 | \v A vertical whitespace |
144 | \V A non vertical whitespace | |
e17472c5 | 145 | \R A generic newline (?>\v|\x0D\x0A) |
e04a154e | 146 | |
30487ceb | 147 | \pP Match P-named (Unicode) property |
e1b711da | 148 | \p{...} Match Unicode property with name longer than 1 character |
30487ceb | 149 | \PP Match non-P |
e1b711da | 150 | \P{...} Match lack of Unicode property with name longer than 1 char |
0111a78f | 151 | \X Match Unicode extended grapheme cluster |
30487ceb RGS |
152 | |
153 | POSIX character classes and their Unicode and Perl equivalents: | |
154 | ||
cbc24f92 KW |
155 | ASCII- Full- |
156 | POSIX range range backslash | |
157 | [[:...:]] \p{...} \p{...} sequence Description | |
158 | ||
9f4a55d4 | 159 | ----------------------------------------------------------------------- |
92c5714c | 160 | alnum PosixAlnum XPosixAlnum 'alpha' plus 'digit' |
cbc24f92 KW |
161 | alpha PosixAlpha XPosixAlpha Alphabetic characters |
162 | ascii ASCII Any ASCII character | |
163 | blank PosixBlank XPosixBlank \h Horizontal whitespace; | |
164 | full-range also | |
165 | written as | |
166 | \p{HorizSpace} (GNU | |
167 | extension) | |
168 | cntrl PosixCntrl XPosixCntrl Control characters | |
169 | digit PosixDigit XPosixDigit \d Decimal digits | |
92c5714c | 170 | graph PosixGraph XPosixGraph 'alnum' plus 'punct' |
cbc24f92 | 171 | lower PosixLower XPosixLower Lowercase characters |
92c5714c KW |
172 | print PosixPrint XPosixPrint 'graph' plus 'space', |
173 | but not any Controls | |
cbc24f92 KW |
174 | punct PosixPunct XPosixPunct Punctuation and Symbols |
175 | in ASCII-range; just | |
176 | punct outside it | |
92c5714c | 177 | space PosixSpace XPosixSpace \s Whitespace |
cbc24f92 | 178 | upper PosixUpper XPosixUpper Uppercase characters |
92c5714c KW |
179 | word PosixWord XPosixWord \w 'alnum' + Unicode marks |
180 | + connectors, like | |
181 | '_' (Perl extension) | |
cbc24f92 KW |
182 | xdigit ASCII_Hex_Digit XPosixDigit Hexadecimal digit, |
183 | ASCII-range is | |
184 | [0-9A-Fa-f] | |
185 | ||
186 | Also, various synonyms like C<\p{Alpha}> for C<\p{XPosixAlpha}>; all listed | |
187 | in L<perluniprops/Properties accessible through \p{} and \P{}> | |
30487ceb RGS |
188 | |
189 | Within a character class: | |
190 | ||
9f4a55d4 KW |
191 | POSIX traditional Unicode |
192 | [:digit:] \d \p{Digit} | |
193 | [:^digit:] \D \P{Digit} | |
30487ceb RGS |
194 | |
195 | =head2 ANCHORS | |
196 | ||
197 | All are zero-width assertions. | |
198 | ||
199 | ^ Match string start (or line, if /m is used) | |
200 | $ Match string end (or line, if /m is used) or before newline | |
64935bc6 KW |
201 | \b{} Match boundary of type specified within the braces |
202 | \B{} Match wherever \b{} doesn't match | |
30487ceb | 203 | \b Match word boundary (between \w and \W) |
6d014f17 | 204 | \B Match except at word boundary (between \w and \w or \W and \W) |
30487ceb | 205 | \A Match string start (regardless of /m) |
6d014f17 | 206 | \Z Match string end (before optional newline) |
30487ceb RGS |
207 | \z Match absolute string end |
208 | \G Match where previous m//g left off | |
64c5a566 RGS |
209 | \K Keep the stuff left of the \K, don't include it in $& |
210 | ||
30487ceb RGS |
211 | =head2 QUANTIFIERS |
212 | ||
ac036724 | 213 | Quantifiers are greedy by default and match the B<longest> leftmost. |
30487ceb | 214 | |
64c5a566 RGS |
215 | Maximal Minimal Possessive Allowed range |
216 | ------- ------- ---------- ------------- | |
217 | {n,m} {n,m}? {n,m}+ Must occur at least n times | |
218 | but no more than m times | |
219 | {n,} {n,}? {n,}+ Must occur at least n times | |
20420ba9 | 220 | {,n} {,n}? {,n}+ Must occur at most n times |
64c5a566 RGS |
221 | {n} {n}? {n}+ Must occur exactly n times |
222 | * *? *+ 0 or more times (same as {0,}) | |
223 | + +? ++ 1 or more times (same as {1,}) | |
224 | ? ?? ?+ 0 or 1 time (same as {0,1}) | |
225 | ||
226 | The possessive forms (new in Perl 5.10) prevent backtracking: what gets | |
227 | matched by a pattern with a possessive quantifier will not be backtracked | |
228 | into, even if that causes the whole match to fail. | |
30487ceb RGS |
229 | |
230 | =head2 EXTENDED CONSTRUCTS | |
231 | ||
64c5a566 RGS |
232 | (?#text) A comment |
233 | (?:...) Groups subexpressions without capturing (cluster) | |
234 | (?pimsx-imsx:...) Enable/disable option (as per m// modifiers) | |
235 | (?=...) Zero-width positive lookahead assertion | |
8d527d4b KW |
236 | (*pla:...) Same, starting in 5.32; experimentally in 5.28 |
237 | (*positive_lookahead:...) Same, same versions as *pla | |
64c5a566 | 238 | (?!...) Zero-width negative lookahead assertion |
8d527d4b KW |
239 | (*nla:...) Same, starting in 5.32; experimentally in 5.28 |
240 | (*negative_lookahead:...) Same, same versions as *nla | |
64c5a566 | 241 | (?<=...) Zero-width positive lookbehind assertion |
8d527d4b KW |
242 | (*plb:...) Same, starting in 5.32; experimentally in 5.28 |
243 | (*positive_lookbehind:...) Same, same versions as *plb | |
64c5a566 | 244 | (?<!...) Zero-width negative lookbehind assertion |
8d527d4b KW |
245 | (*nlb:...) Same, starting in 5.32; experimentally in 5.28 |
246 | (*negative_lookbehind:...) Same, same versions as *plb | |
64c5a566 | 247 | (?>...) Grab what we can, prohibit backtracking |
8d527d4b | 248 | (*atomic:...) Same, starting in 5.32; experimentally in 5.28 |
64c5a566 RGS |
249 | (?|...) Branch reset |
250 | (?<name>...) Named capture | |
251 | (?'name'...) Named capture | |
252 | (?P<name>...) Named capture (python syntax) | |
ea64e14e | 253 | (?[...]) Extended bracketed character class |
64c5a566 RGS |
254 | (?{ code }) Embedded code, return value becomes $^R |
255 | (??{ code }) Dynamic regex, return value used as regex | |
256 | (?N) Recurse into subpattern number N | |
257 | (?-N), (?+N) Recurse into Nth previous/next subpattern | |
258 | (?R), (?0) Recurse at the beginning of the whole pattern | |
259 | (?&name) Recurse into a named subpattern | |
260 | (?P>name) Recurse into a named subpattern (python syntax) | |
261 | (?(cond)yes|no) | |
89c8f482 | 262 | (?(cond)yes) Conditional expression, where "(cond)" can be: |
3c57a2d9 KW |
263 | (?=pat) lookahead; also (*pla:pat) |
264 | (*positive_lookahead:pat) | |
265 | (?!pat) negative lookahead; also (*nla:pat) | |
266 | (*negative_lookahead:pat) | |
267 | (?<=pat) lookbehind; also (*plb:pat) | |
268 | (*lookbehind:pat) | |
269 | (?<!pat) negative lookbehind; also (*nlb:pat) | |
270 | (*negative_lookbehind:pat) | |
64c5a566 RGS |
271 | (N) subpattern N has matched something |
272 | (<name>) named subpattern has matched something | |
273 | ('name') named subpattern has matched something | |
274 | (?{code}) code condition | |
275 | (R) true if recursing | |
276 | (RN) true if recursing into Nth subpattern | |
277 | (R&name) true if recursing into named subpattern | |
278 | (DEFINE) always false, no no-pattern allowed | |
30487ceb | 279 | |
a5365663 | 280 | =head2 VARIABLES |
30487ceb RGS |
281 | |
282 | $_ Default variable for operators to use | |
30487ceb | 283 | |
30487ceb | 284 | $` Everything prior to matched string |
e17472c5 | 285 | $& Entire matched string |
30487ceb RGS |
286 | $' Everything after to matched string |
287 | ||
e17472c5 RGS |
288 | ${^PREMATCH} Everything prior to matched string |
289 | ${^MATCH} Entire matched string | |
290 | ${^POSTMATCH} Everything after to matched string | |
291 | ||
13b0f67d | 292 | Note to those still using Perl 5.18 or earlier: |
e17472c5 | 293 | The use of C<$`>, C<$&> or C<$'> will slow down B<all> regex use |
e1fd4132 DM |
294 | within your program. Consult L<perlvar> for C<@-> |
295 | to see equivalent expressions that won't cause slow down. | |
296 | See also L<Devel::SawAmpersand>. Starting with Perl 5.10, you | |
e17472c5 RGS |
297 | can also use the equivalent variables C<${^PREMATCH}>, C<${^MATCH}> |
298 | and C<${^POSTMATCH}>, but for them to be defined, you have to | |
299 | specify the C</p> (preserve) modifier on your regular expression. | |
13b0f67d | 300 | In Perl 5.20, the use of C<$`>, C<$&> and C<$'> makes no speed difference. |
30487ceb RGS |
301 | |
302 | $1, $2 ... hold the Xth captured expr | |
303 | $+ Last parenthesized pattern match | |
304 | $^N Holds the most recently closed capture | |
305 | $^R Holds the result of the last (?{...}) expr | |
6d014f17 JH |
306 | @- Offsets of starts of groups. $-[0] holds start of whole match |
307 | @+ Offsets of ends of groups. $+[0] holds end of whole match | |
c27a5cfe KW |
308 | %+ Named capture groups |
309 | %- Named capture groups, as array refs | |
30487ceb | 310 | |
6d014f17 | 311 | Captured groups are numbered according to their I<opening> paren. |
30487ceb | 312 | |
a5365663 | 313 | =head2 FUNCTIONS |
30487ceb RGS |
314 | |
315 | lc Lowercase a string | |
316 | lcfirst Lowercase first char of a string | |
317 | uc Uppercase a string | |
47e8a552 | 318 | ucfirst Titlecase first char of a string |
628253b8 | 319 | fc Foldcase a string |
47e8a552 | 320 | |
30487ceb RGS |
321 | pos Return or set current match position |
322 | quotemeta Quote metacharacters | |
9c6deb98 | 323 | reset Reset m?pattern? status |
30487ceb RGS |
324 | study Analyze string for optimizing matching |
325 | ||
e17472c5 | 326 | split Use a regex to split a string into parts |
30487ceb | 327 | |
628253b8 BF |
328 | The first five of these are like the escape sequences C<\L>, C<\l>, |
329 | C<\U>, C<\u>, and C<\F>. For Titlecase, see L</Titlecase>; For | |
330 | Foldcase, see L</Foldcase>. | |
47e8a552 | 331 | |
1501d360 | 332 | =head2 TERMINOLOGY |
47e8a552 | 333 | |
a5365663 | 334 | =head3 Titlecase |
47e8a552 IT |
335 | |
336 | Unicode concept which most often is equal to uppercase, but for | |
337 | certain characters like the German "sharp s" there is a difference. | |
338 | ||
628253b8 BF |
339 | =head3 Foldcase |
340 | ||
341 | Unicode form that is useful when comparing strings regardless of case, | |
211f3bbf | 342 | as certain characters have complex one-to-many case mappings. Primarily a |
628253b8 BF |
343 | variant of lowercase. |
344 | ||
40506b5d | 345 | =head1 AUTHOR |
30487ceb | 346 | |
64c5a566 | 347 | Iain Truskett. Updated by the Perl 5 Porters. |
30487ceb RGS |
348 | |
349 | This document may be distributed under the same terms as Perl itself. | |
350 | ||
40506b5d | 351 | =head1 SEE ALSO |
30487ceb RGS |
352 | |
353 | =over 4 | |
354 | ||
355 | =item * | |
356 | ||
357 | L<perlretut> for a tutorial on regular expressions. | |
358 | ||
359 | =item * | |
360 | ||
361 | L<perlrequick> for a rapid tutorial. | |
362 | ||
363 | =item * | |
364 | ||
365 | L<perlre> for more details. | |
366 | ||
367 | =item * | |
368 | ||
369 | L<perlvar> for details on the variables. | |
370 | ||
371 | =item * | |
372 | ||
373 | L<perlop> for details on the operators. | |
374 | ||
375 | =item * | |
376 | ||
377 | L<perlfunc> for details on the functions. | |
378 | ||
379 | =item * | |
380 | ||
381 | L<perlfaq6> for FAQs on regular expressions. | |
382 | ||
383 | =item * | |
384 | ||
64c5a566 RGS |
385 | L<perlrebackslash> for a reference on backslash sequences. |
386 | ||
387 | =item * | |
388 | ||
389 | L<perlrecharclass> for a reference on character classes. | |
390 | ||
391 | =item * | |
392 | ||
30487ceb RGS |
393 | The L<re> module to alter behaviour and aid |
394 | debugging. | |
395 | ||
396 | =item * | |
397 | ||
57e8c15d | 398 | L<perldebug/"Debugging Regular Expressions"> |
30487ceb RGS |
399 | |
400 | =item * | |
401 | ||
e17472c5 | 402 | L<perluniintro>, L<perlunicode>, L<charnames> and L<perllocale> |
30487ceb RGS |
403 | for details on regexes and internationalisation. |
404 | ||
405 | =item * | |
406 | ||
407 | I<Mastering Regular Expressions> by Jeffrey Friedl | |
4b05bc8e | 408 | (L<http://oreilly.com/catalog/9780596528126/>) for a thorough grounding and |
30487ceb RGS |
409 | reference on the topic. |
410 | ||
411 | =back | |
412 | ||
40506b5d | 413 | =head1 THANKS |
30487ceb RGS |
414 | |
415 | David P.C. Wollmann, | |
416 | Richard Soderberg, | |
417 | Sean M. Burke, | |
418 | Tom Christiansen, | |
e5a7b003 | 419 | Jim Cromie, |
30487ceb RGS |
420 | and |
421 | Jeffrey Goff | |
422 | for useful advice. | |
6d014f17 JH |
423 | |
424 | =cut |