This is a live mirror of the Perl 5 development currently hosted at https://github.com/perl/perl5
perlreref: missing info, 80 col display
[perl5.git] / pod / perlreref.pod
CommitLineData
30487ceb
RGS
1=head1 NAME
2
3perlreref - Perl Regular Expressions Reference
4
5=head1 DESCRIPTION
6
7This is a quick reference to Perl's regular expressions.
8For full information see L<perlre> and L<perlop>, as well
6d014f17 9as the L</"SEE ALSO"> section in this document.
30487ceb 10
a5365663 11=head2 OPERATORS
30487ceb 12
e17472c5
RGS
13C<=~> determines to which variable the regex is applied.
14In its absence, $_ is used.
30487ceb 15
e17472c5 16 $var =~ /foo/;
30487ceb 17
e17472c5
RGS
18C<!~> determines to which variable the regex is applied,
19and negates the result of the match; it returns
20false if the match succeeds, and true if it fails.
6d014f17 21
e17472c5 22 $var !~ /foo/;
6d014f17 23
e17472c5
RGS
24C<m/pattern/msixpogc> searches a string for a pattern match,
25applying the given options.
30487ceb 26
e17472c5
RGS
27 m Multiline mode - ^ and $ match internal lines
28 s match as a Single line - . matches \n
29 i case-Insensitive
30 x eXtended legibility - free whitespace and comments
31 p Preserve a copy of the matched string -
32 ${^PREMATCH}, ${^MATCH}, ${^POSTMATCH} will be defined.
33 o compile pattern Once
34 g Global - all occurrences
35 c don't reset pos on failed matches when using /g
30487ceb 36
e17472c5
RGS
37If 'pattern' is an empty string, the last I<successfully> matched
38regex is used. Delimiters other than '/' may be used for both this
64c5a566 39operator and the following ones. The leading C<m> can be omitted
e17472c5 40if the delimiter is '/'.
30487ceb 41
e17472c5
RGS
42C<qr/pattern/msixpo> lets you store a regex in a variable,
43or pass one around. Modifiers as for C<m//>, and are stored
44within the regex.
30487ceb 45
e17472c5
RGS
46C<s/pattern/replacement/msixpogce> substitutes matches of
47'pattern' with 'replacement'. Modifiers as for C<m//>,
48with one addition:
30487ceb 49
e17472c5 50 e Evaluate 'replacement' as an expression
30487ceb 51
e17472c5
RGS
52'e' may be specified multiple times. 'replacement' is interpreted
53as a double quoted string unless a single-quote (C<'>) is the delimiter.
30487ceb 54
e17472c5
RGS
55C<?pattern?> is like C<m/pattern/> but matches only once. No alternate
56delimiters can be used. Must be reset with reset().
30487ceb 57
a5365663 58=head2 SYNTAX
30487ceb 59
9f4a55d4
KW
60 \ Escapes the character immediately following it
61 . Matches any single character except a newline (unless /s is
62 used)
63 ^ Matches at the beginning of the string (or line, if /m is used)
64 $ Matches at the end of the string (or line, if /m is used)
65 * Matches the preceding element 0 or more times
66 + Matches the preceding element 1 or more times
67 ? Matches the preceding element 0 or 1 times
68 {...} Specifies a range of occurrences for the element preceding it
69 [...] Matches any one of the characters contained within the brackets
70 (...) Groups subexpressions for capturing to $1, $2...
71 (?:...) Groups subexpressions without capturing (cluster)
72 | Matches either the subexpression preceding or following it
73 \1, \2, \3 ... Matches the text from the Nth group
74 \g1 or \g{1}, \g2 ... Matches the text from the Nth group
75 \g-1 or \g{-1}, \g-2 ... Matches the text from the Nth previous group
76 \g{name} Named backreference
77 \k<name> Named backreference
78 \k'name' Named backreference
79 (?P=name) Named backreference (python syntax)
30487ceb
RGS
80
81=head2 ESCAPE SEQUENCES
82
83These work as in normal strings.
84
85 \a Alarm (beep)
86 \e Escape
87 \f Formfeed
88 \n Newline
89 \r Carriage return
90 \t Tab
6ed007ae 91 \037 Any octal ASCII value
30487ceb
RGS
92 \x7f Any hexadecimal ASCII value
93 \x{263a} A wide hexadecimal value
94 \cx Control-x
95 \N{name} A named character
e526e8bb 96 \N{U+263D} A Unicode character by hex ordinal
30487ceb 97
6d014f17 98 \l Lowercase next character
d3b55b48 99 \u Titlecase next character
30487ceb 100 \L Lowercase until \E
d3b55b48 101 \U Uppercase until \E
30487ceb 102 \Q Disable pattern metacharacters until \E
e17472c5 103 \E End modification
30487ceb 104
47e8a552
IT
105For Titlecase, see L</Titlecase>.
106
30487ceb
RGS
107This one works differently from normal strings:
108
109 \b An assertion, not backspace, except in a character class
110
111=head2 CHARACTER CLASSES
112
113 [amy] Match 'a', 'm' or 'y'
114 [f-j] Dash specifies "range"
115 [f-j-] Dash escaped or at start or end means 'dash'
6d014f17 116 [^f-j] Caret indicates "match any character _except_ these"
30487ceb 117
df225385 118The following sequences (except C<\N>) work within or without a character class.
e17472c5
RGS
119The first six are locale aware, all are Unicode aware. See L<perllocale>
120and L<perlunicode> for details.
121
122 \d A digit
123 \D A nondigit
124 \w A word character
125 \W A non-word character
126 \s A whitespace character
127 \S A non-whitespace character
418e7b04
KW
128 \h An horizontal whitespace
129 \H A non horizontal whitespace
9f4a55d4
KW
130 \N A non newline (when not followed by '{NAME}'; experimental;
131 not valid in a character class; equivalent to [^\n]; it's
132 like '.' without /s modifier)
418e7b04
KW
133 \v A vertical whitespace
134 \V A non vertical whitespace
e17472c5 135 \R A generic newline (?>\v|\x0D\x0A)
e04a154e
JH
136
137 \C Match a byte (with Unicode, '.' matches a character)
30487ceb 138 \pP Match P-named (Unicode) property
e1b711da 139 \p{...} Match Unicode property with name longer than 1 character
30487ceb 140 \PP Match non-P
e1b711da 141 \P{...} Match lack of Unicode property with name longer than 1 char
0111a78f 142 \X Match Unicode extended grapheme cluster
30487ceb
RGS
143
144POSIX character classes and their Unicode and Perl equivalents:
145
9f4a55d4
KW
146 ASCII- Full-
147 range range backslash
148 POSIX \p{...} \p{} sequence Description
149 -----------------------------------------------------------------------
150 alnum PosixAlnum Alnum Alpha plus Digit
151 alpha PosixAlpha Alpha Alphabetic characters
152 ascii ASCII Any ASCII character
153 blank PosixBlank Blank \h Horizontal whitespace;
154 full-range also written
155 as \p{HorizSpace} (GNU
156 extension)
157 cntrl PosixCntrl Cntrl Control characters
158 digit PosixDigit Digit \d Decimal digits
159 graph PosixGraph Graph Alnum plus Punct
160 lower PosixLower Lower Lowercase characters
161 print PosixPrint Print Graph plus Print, but not
162 any Cntrls
163 punct PosixPunct Punct These aren't precisely
164 equivalent. See NOTE,
165 below.
166 space PosixSpace Space [\s\cK] Whitespace
167 PerlSpace SpacePerl \s Perl's whitespace
168 definition
169 upper PosixUpper Upper Uppercase characters
170 word PerlWord Word \w Alnum plus '_' (Perl
171 extension)
172 xdigit ASCII_Hex_Digit XDigit Hexadecimal digit,
173 ASCII-range is
174 [0-9A-Fa-f]
175
176NOTE on C<[[:punct:]]>, C<\p{PosixPunct}> and C<\p{Punct}>:
177In the ASCII range, C<[[:punct:]]> and C<\p{PosixPunct}> match
178C<[-!"#$%&'()*+,./:;<=E<gt>?@[\\\]^_`{|}~]> (although if a locale is in
179effect, it could alter the behavior of C<[[:punct:]]>); and C<\p{Punct}>
180matches C<[-!"#%&'()*,./:;?@[\\\]_{}]>. When matching a UTF-8 string,
181C<[[:punct:]]> matches what it does in the ASCII range, plus what
182C<\p{Punct}> matches. C<\p{Punct}> matches, anything that isn't a
183control, an alphanumeric, a space, nor a symbol.
30487ceb
RGS
184
185Within a character class:
186
9f4a55d4
KW
187 POSIX traditional Unicode
188 [:digit:] \d \p{Digit}
189 [:^digit:] \D \P{Digit}
30487ceb
RGS
190
191=head2 ANCHORS
192
193All are zero-width assertions.
194
195 ^ Match string start (or line, if /m is used)
196 $ Match string end (or line, if /m is used) or before newline
197 \b Match word boundary (between \w and \W)
6d014f17 198 \B Match except at word boundary (between \w and \w or \W and \W)
30487ceb 199 \A Match string start (regardless of /m)
6d014f17 200 \Z Match string end (before optional newline)
30487ceb
RGS
201 \z Match absolute string end
202 \G Match where previous m//g left off
64c5a566
RGS
203 \K Keep the stuff left of the \K, don't include it in $&
204
30487ceb
RGS
205=head2 QUANTIFIERS
206
ac036724 207Quantifiers are greedy by default and match the B<longest> leftmost.
30487ceb 208
64c5a566
RGS
209 Maximal Minimal Possessive Allowed range
210 ------- ------- ---------- -------------
211 {n,m} {n,m}? {n,m}+ Must occur at least n times
212 but no more than m times
213 {n,} {n,}? {n,}+ Must occur at least n times
214 {n} {n}? {n}+ Must occur exactly n times
215 * *? *+ 0 or more times (same as {0,})
216 + +? ++ 1 or more times (same as {1,})
217 ? ?? ?+ 0 or 1 time (same as {0,1})
218
219The possessive forms (new in Perl 5.10) prevent backtracking: what gets
220matched by a pattern with a possessive quantifier will not be backtracked
221into, even if that causes the whole match to fail.
30487ceb 222
ac036724 223There is no quantifier C<{,n}>. That's interpreted as a literal string.
6d014f17 224
30487ceb
RGS
225=head2 EXTENDED CONSTRUCTS
226
64c5a566
RGS
227 (?#text) A comment
228 (?:...) Groups subexpressions without capturing (cluster)
229 (?pimsx-imsx:...) Enable/disable option (as per m// modifiers)
230 (?=...) Zero-width positive lookahead assertion
231 (?!...) Zero-width negative lookahead assertion
232 (?<=...) Zero-width positive lookbehind assertion
233 (?<!...) Zero-width negative lookbehind assertion
234 (?>...) Grab what we can, prohibit backtracking
235 (?|...) Branch reset
236 (?<name>...) Named capture
237 (?'name'...) Named capture
238 (?P<name>...) Named capture (python syntax)
239 (?{ code }) Embedded code, return value becomes $^R
240 (??{ code }) Dynamic regex, return value used as regex
241 (?N) Recurse into subpattern number N
242 (?-N), (?+N) Recurse into Nth previous/next subpattern
243 (?R), (?0) Recurse at the beginning of the whole pattern
244 (?&name) Recurse into a named subpattern
245 (?P>name) Recurse into a named subpattern (python syntax)
246 (?(cond)yes|no)
247 (?(cond)yes) Conditional expression, where "cond" can be:
248 (N) subpattern N has matched something
249 (<name>) named subpattern has matched something
250 ('name') named subpattern has matched something
251 (?{code}) code condition
252 (R) true if recursing
253 (RN) true if recursing into Nth subpattern
254 (R&name) true if recursing into named subpattern
255 (DEFINE) always false, no no-pattern allowed
30487ceb 256
a5365663 257=head2 VARIABLES
30487ceb
RGS
258
259 $_ Default variable for operators to use
30487ceb 260
30487ceb 261 $` Everything prior to matched string
e17472c5 262 $& Entire matched string
30487ceb
RGS
263 $' Everything after to matched string
264
e17472c5
RGS
265 ${^PREMATCH} Everything prior to matched string
266 ${^MATCH} Entire matched string
267 ${^POSTMATCH} Everything after to matched string
268
269The use of C<$`>, C<$&> or C<$'> will slow down B<all> regex use
64c5a566 270within your program. Consult L<perlvar> for C<@->
30487ceb 271to see equivalent expressions that won't cause slow down.
e17472c5
RGS
272See also L<Devel::SawAmpersand>. Starting with Perl 5.10, you
273can also use the equivalent variables C<${^PREMATCH}>, C<${^MATCH}>
274and C<${^POSTMATCH}>, but for them to be defined, you have to
275specify the C</p> (preserve) modifier on your regular expression.
30487ceb
RGS
276
277 $1, $2 ... hold the Xth captured expr
278 $+ Last parenthesized pattern match
279 $^N Holds the most recently closed capture
280 $^R Holds the result of the last (?{...}) expr
6d014f17
JH
281 @- Offsets of starts of groups. $-[0] holds start of whole match
282 @+ Offsets of ends of groups. $+[0] holds end of whole match
e17472c5
RGS
283 %+ Named capture buffers
284 %- Named capture buffers, as array refs
30487ceb 285
6d014f17 286Captured groups are numbered according to their I<opening> paren.
30487ceb 287
a5365663 288=head2 FUNCTIONS
30487ceb
RGS
289
290 lc Lowercase a string
291 lcfirst Lowercase first char of a string
292 uc Uppercase a string
47e8a552
IT
293 ucfirst Titlecase first char of a string
294
30487ceb
RGS
295 pos Return or set current match position
296 quotemeta Quote metacharacters
297 reset Reset ?pattern? status
298 study Analyze string for optimizing matching
299
e17472c5 300 split Use a regex to split a string into parts
30487ceb 301
d3b55b48
JH
302The first four of these are like the escape sequences C<\L>, C<\l>,
303C<\U>, and C<\u>. For Titlecase, see L</Titlecase>.
47e8a552 304
1501d360 305=head2 TERMINOLOGY
47e8a552 306
a5365663 307=head3 Titlecase
47e8a552
IT
308
309Unicode concept which most often is equal to uppercase, but for
310certain characters like the German "sharp s" there is a difference.
311
40506b5d 312=head1 AUTHOR
30487ceb 313
64c5a566 314Iain Truskett. Updated by the Perl 5 Porters.
30487ceb
RGS
315
316This document may be distributed under the same terms as Perl itself.
317
40506b5d 318=head1 SEE ALSO
30487ceb
RGS
319
320=over 4
321
322=item *
323
324L<perlretut> for a tutorial on regular expressions.
325
326=item *
327
328L<perlrequick> for a rapid tutorial.
329
330=item *
331
332L<perlre> for more details.
333
334=item *
335
336L<perlvar> for details on the variables.
337
338=item *
339
340L<perlop> for details on the operators.
341
342=item *
343
344L<perlfunc> for details on the functions.
345
346=item *
347
348L<perlfaq6> for FAQs on regular expressions.
349
350=item *
351
64c5a566
RGS
352L<perlrebackslash> for a reference on backslash sequences.
353
354=item *
355
356L<perlrecharclass> for a reference on character classes.
357
358=item *
359
30487ceb
RGS
360The L<re> module to alter behaviour and aid
361debugging.
362
363=item *
364
365L<perldebug/"Debugging regular expressions">
366
367=item *
368
e17472c5 369L<perluniintro>, L<perlunicode>, L<charnames> and L<perllocale>
30487ceb
RGS
370for details on regexes and internationalisation.
371
372=item *
373
374I<Mastering Regular Expressions> by Jeffrey Friedl
08d7a6b2 375(F<http://oreilly.com/catalog/9780596528126/>) for a thorough grounding and
30487ceb
RGS
376reference on the topic.
377
378=back
379
40506b5d 380=head1 THANKS
30487ceb
RGS
381
382David P.C. Wollmann,
383Richard Soderberg,
384Sean M. Burke,
385Tom Christiansen,
e5a7b003 386Jim Cromie,
30487ceb
RGS
387and
388Jeffrey Goff
389for useful advice.
6d014f17
JH
390
391=cut