| 1 | =head1 NAME |
| 2 | |
| 3 | perlreref - Perl Regular Expressions Reference |
| 4 | |
| 5 | =head1 DESCRIPTION |
| 6 | |
| 7 | This is a quick reference to Perl's regular expressions. |
| 8 | For full information see L<perlre> and L<perlop>, as well |
| 9 | as the L</"SEE ALSO"> section in this document. |
| 10 | |
| 11 | =head2 OPERATORS |
| 12 | |
| 13 | C<=~> determines to which variable the regex is applied. |
| 14 | In its absence, $_ is used. |
| 15 | |
| 16 | $var =~ /foo/; |
| 17 | |
| 18 | C<!~> determines to which variable the regex is applied, |
| 19 | and negates the result of the match; it returns |
| 20 | false if the match succeeds, and true if it fails. |
| 21 | |
| 22 | $var !~ /foo/; |
| 23 | |
| 24 | C<m/pattern/msixpogcdual> searches a string for a pattern match, |
| 25 | applying the given options. |
| 26 | |
| 27 | m Multiline mode - ^ and $ match internal lines |
| 28 | s match as a Single line - . matches \n |
| 29 | i case-Insensitive |
| 30 | x eXtended legibility - free whitespace and comments |
| 31 | p Preserve a copy of the matched string - |
| 32 | ${^PREMATCH}, ${^MATCH}, ${^POSTMATCH} will be defined. |
| 33 | o compile pattern Once |
| 34 | g Global - all occurrences |
| 35 | c don't reset pos on failed matches when using /g |
| 36 | a restrict \d, \s, \w and [:posix:] to match ASCII only |
| 37 | aa (two a's) also /i matches exclude ASCII/non-ASCII |
| 38 | l match according to current locale |
| 39 | u match according to Unicode rules |
| 40 | d match according to native rules unless something indicates |
| 41 | Unicode |
| 42 | |
| 43 | If 'pattern' is an empty string, the last I<successfully> matched |
| 44 | regex is used. Delimiters other than '/' may be used for both this |
| 45 | operator and the following ones. The leading C<m> can be omitted |
| 46 | if the delimiter is '/'. |
| 47 | |
| 48 | C<qr/pattern/msixpodual> lets you store a regex in a variable, |
| 49 | or pass one around. Modifiers as for C<m//>, and are stored |
| 50 | within the regex. |
| 51 | |
| 52 | C<s/pattern/replacement/msixpogcedual> substitutes matches of |
| 53 | 'pattern' with 'replacement'. Modifiers as for C<m//>, |
| 54 | with two additions: |
| 55 | |
| 56 | e Evaluate 'replacement' as an expression |
| 57 | r Return substitution and leave the original string untouched. |
| 58 | |
| 59 | 'e' may be specified multiple times. 'replacement' is interpreted |
| 60 | as a double quoted string unless a single-quote (C<'>) is the delimiter. |
| 61 | |
| 62 | C<?pattern?> is like C<m/pattern/> but matches only once. No alternate |
| 63 | delimiters can be used. Must be reset with reset(). |
| 64 | |
| 65 | =head2 SYNTAX |
| 66 | |
| 67 | \ Escapes the character immediately following it |
| 68 | . Matches any single character except a newline (unless /s is |
| 69 | used) |
| 70 | ^ Matches at the beginning of the string (or line, if /m is used) |
| 71 | $ Matches at the end of the string (or line, if /m is used) |
| 72 | * Matches the preceding element 0 or more times |
| 73 | + Matches the preceding element 1 or more times |
| 74 | ? Matches the preceding element 0 or 1 times |
| 75 | {...} Specifies a range of occurrences for the element preceding it |
| 76 | [...] Matches any one of the characters contained within the brackets |
| 77 | (...) Groups subexpressions for capturing to $1, $2... |
| 78 | (?:...) Groups subexpressions without capturing (cluster) |
| 79 | | Matches either the subexpression preceding or following it |
| 80 | \g1 or \g{1}, \g2 ... Matches the text from the Nth group |
| 81 | \1, \2, \3 ... Matches the text from the Nth group |
| 82 | \g-1 or \g{-1}, \g-2 ... Matches the text from the Nth previous group |
| 83 | \g{name} Named backreference |
| 84 | \k<name> Named backreference |
| 85 | \k'name' Named backreference |
| 86 | (?P=name) Named backreference (python syntax) |
| 87 | |
| 88 | =head2 ESCAPE SEQUENCES |
| 89 | |
| 90 | These work as in normal strings. |
| 91 | |
| 92 | \a Alarm (beep) |
| 93 | \e Escape |
| 94 | \f Formfeed |
| 95 | \n Newline |
| 96 | \r Carriage return |
| 97 | \t Tab |
| 98 | \037 Char whose ordinal is the 3 octal digits, max \777 |
| 99 | \o{2307} Char whose ordinal is the octal number, unrestricted |
| 100 | \x7f Char whose ordinal is the 2 hex digits, max \xFF |
| 101 | \x{263a} Char whose ordinal is the hex number, unrestricted |
| 102 | \cx Control-x |
| 103 | \N{name} A named Unicode character or character sequence |
| 104 | \N{U+263D} A Unicode character by hex ordinal |
| 105 | |
| 106 | \l Lowercase next character |
| 107 | \u Titlecase next character |
| 108 | \L Lowercase until \E |
| 109 | \U Uppercase until \E |
| 110 | \F Foldcase until \E |
| 111 | \Q Disable pattern metacharacters until \E |
| 112 | \E End modification |
| 113 | |
| 114 | For Titlecase, see L</Titlecase>. |
| 115 | |
| 116 | This one works differently from normal strings: |
| 117 | |
| 118 | \b An assertion, not backspace, except in a character class |
| 119 | |
| 120 | =head2 CHARACTER CLASSES |
| 121 | |
| 122 | [amy] Match 'a', 'm' or 'y' |
| 123 | [f-j] Dash specifies "range" |
| 124 | [f-j-] Dash escaped or at start or end means 'dash' |
| 125 | [^f-j] Caret indicates "match any character _except_ these" |
| 126 | |
| 127 | The following sequences (except C<\N>) work within or without a character class. |
| 128 | The first six are locale aware, all are Unicode aware. See L<perllocale> |
| 129 | and L<perlunicode> for details. |
| 130 | |
| 131 | \d A digit |
| 132 | \D A nondigit |
| 133 | \w A word character |
| 134 | \W A non-word character |
| 135 | \s A whitespace character |
| 136 | \S A non-whitespace character |
| 137 | \h An horizontal whitespace |
| 138 | \H A non horizontal whitespace |
| 139 | \N A non newline (when not followed by '{NAME}';; |
| 140 | not valid in a character class; equivalent to [^\n]; it's |
| 141 | like '.' without /s modifier) |
| 142 | \v A vertical whitespace |
| 143 | \V A non vertical whitespace |
| 144 | \R A generic newline (?>\v|\x0D\x0A) |
| 145 | |
| 146 | \C Match a byte (with Unicode, '.' matches a character) |
| 147 | (Deprecated.) |
| 148 | \pP Match P-named (Unicode) property |
| 149 | \p{...} Match Unicode property with name longer than 1 character |
| 150 | \PP Match non-P |
| 151 | \P{...} Match lack of Unicode property with name longer than 1 char |
| 152 | \X Match Unicode extended grapheme cluster |
| 153 | |
| 154 | POSIX character classes and their Unicode and Perl equivalents: |
| 155 | |
| 156 | ASCII- Full- |
| 157 | POSIX range range backslash |
| 158 | [[:...:]] \p{...} \p{...} sequence Description |
| 159 | |
| 160 | ----------------------------------------------------------------------- |
| 161 | alnum PosixAlnum XPosixAlnum Alpha plus Digit |
| 162 | alpha PosixAlpha XPosixAlpha Alphabetic characters |
| 163 | ascii ASCII Any ASCII character |
| 164 | blank PosixBlank XPosixBlank \h Horizontal whitespace; |
| 165 | full-range also |
| 166 | written as |
| 167 | \p{HorizSpace} (GNU |
| 168 | extension) |
| 169 | cntrl PosixCntrl XPosixCntrl Control characters |
| 170 | digit PosixDigit XPosixDigit \d Decimal digits |
| 171 | graph PosixGraph XPosixGraph Alnum plus Punct |
| 172 | lower PosixLower XPosixLower Lowercase characters |
| 173 | print PosixPrint XPosixPrint Graph plus Print, but |
| 174 | not any Cntrls |
| 175 | punct PosixPunct XPosixPunct Punctuation and Symbols |
| 176 | in ASCII-range; just |
| 177 | punct outside it |
| 178 | space PosixSpace XPosixSpace [\s\cK] |
| 179 | PerlSpace XPerlSpace \s Perl's whitespace def'n |
| 180 | upper PosixUpper XPosixUpper Uppercase characters |
| 181 | word PosixWord XPosixWord \w Alnum + Unicode marks + |
| 182 | connectors, like '_' |
| 183 | (Perl extension) |
| 184 | xdigit ASCII_Hex_Digit XPosixDigit Hexadecimal digit, |
| 185 | ASCII-range is |
| 186 | [0-9A-Fa-f] |
| 187 | |
| 188 | Also, various synonyms like C<\p{Alpha}> for C<\p{XPosixAlpha}>; all listed |
| 189 | in L<perluniprops/Properties accessible through \p{} and \P{}> |
| 190 | |
| 191 | Within a character class: |
| 192 | |
| 193 | POSIX traditional Unicode |
| 194 | [:digit:] \d \p{Digit} |
| 195 | [:^digit:] \D \P{Digit} |
| 196 | |
| 197 | =head2 ANCHORS |
| 198 | |
| 199 | All are zero-width assertions. |
| 200 | |
| 201 | ^ Match string start (or line, if /m is used) |
| 202 | $ Match string end (or line, if /m is used) or before newline |
| 203 | \b Match word boundary (between \w and \W) |
| 204 | \B Match except at word boundary (between \w and \w or \W and \W) |
| 205 | \A Match string start (regardless of /m) |
| 206 | \Z Match string end (before optional newline) |
| 207 | \z Match absolute string end |
| 208 | \G Match where previous m//g left off |
| 209 | \K Keep the stuff left of the \K, don't include it in $& |
| 210 | |
| 211 | =head2 QUANTIFIERS |
| 212 | |
| 213 | Quantifiers are greedy by default and match the B<longest> leftmost. |
| 214 | |
| 215 | Maximal Minimal Possessive Allowed range |
| 216 | ------- ------- ---------- ------------- |
| 217 | {n,m} {n,m}? {n,m}+ Must occur at least n times |
| 218 | but no more than m times |
| 219 | {n,} {n,}? {n,}+ Must occur at least n times |
| 220 | {n} {n}? {n}+ Must occur exactly n times |
| 221 | * *? *+ 0 or more times (same as {0,}) |
| 222 | + +? ++ 1 or more times (same as {1,}) |
| 223 | ? ?? ?+ 0 or 1 time (same as {0,1}) |
| 224 | |
| 225 | The possessive forms (new in Perl 5.10) prevent backtracking: what gets |
| 226 | matched by a pattern with a possessive quantifier will not be backtracked |
| 227 | into, even if that causes the whole match to fail. |
| 228 | |
| 229 | There is no quantifier C<{,n}>. That's interpreted as a literal string. |
| 230 | |
| 231 | =head2 EXTENDED CONSTRUCTS |
| 232 | |
| 233 | (?#text) A comment |
| 234 | (?:...) Groups subexpressions without capturing (cluster) |
| 235 | (?pimsx-imsx:...) Enable/disable option (as per m// modifiers) |
| 236 | (?=...) Zero-width positive lookahead assertion |
| 237 | (?!...) Zero-width negative lookahead assertion |
| 238 | (?<=...) Zero-width positive lookbehind assertion |
| 239 | (?<!...) Zero-width negative lookbehind assertion |
| 240 | (?>...) Grab what we can, prohibit backtracking |
| 241 | (?|...) Branch reset |
| 242 | (?<name>...) Named capture |
| 243 | (?'name'...) Named capture |
| 244 | (?P<name>...) Named capture (python syntax) |
| 245 | (?{ code }) Embedded code, return value becomes $^R |
| 246 | (??{ code }) Dynamic regex, return value used as regex |
| 247 | (?N) Recurse into subpattern number N |
| 248 | (?-N), (?+N) Recurse into Nth previous/next subpattern |
| 249 | (?R), (?0) Recurse at the beginning of the whole pattern |
| 250 | (?&name) Recurse into a named subpattern |
| 251 | (?P>name) Recurse into a named subpattern (python syntax) |
| 252 | (?(cond)yes|no) |
| 253 | (?(cond)yes) Conditional expression, where "cond" can be: |
| 254 | (?=pat) look-ahead |
| 255 | (?!pat) negative look-ahead |
| 256 | (?<=pat) look-behind |
| 257 | (?<!pat) negative look-behind |
| 258 | (N) subpattern N has matched something |
| 259 | (<name>) named subpattern has matched something |
| 260 | ('name') named subpattern has matched something |
| 261 | (?{code}) code condition |
| 262 | (R) true if recursing |
| 263 | (RN) true if recursing into Nth subpattern |
| 264 | (R&name) true if recursing into named subpattern |
| 265 | (DEFINE) always false, no no-pattern allowed |
| 266 | |
| 267 | =head2 VARIABLES |
| 268 | |
| 269 | $_ Default variable for operators to use |
| 270 | |
| 271 | $` Everything prior to matched string |
| 272 | $& Entire matched string |
| 273 | $' Everything after to matched string |
| 274 | |
| 275 | ${^PREMATCH} Everything prior to matched string |
| 276 | ${^MATCH} Entire matched string |
| 277 | ${^POSTMATCH} Everything after to matched string |
| 278 | |
| 279 | Note to those still using Perl 5.18 or earlier: |
| 280 | The use of C<$`>, C<$&> or C<$'> will slow down B<all> regex use |
| 281 | within your program. Consult L<perlvar> for C<@-> |
| 282 | to see equivalent expressions that won't cause slow down. |
| 283 | See also L<Devel::SawAmpersand>. Starting with Perl 5.10, you |
| 284 | can also use the equivalent variables C<${^PREMATCH}>, C<${^MATCH}> |
| 285 | and C<${^POSTMATCH}>, but for them to be defined, you have to |
| 286 | specify the C</p> (preserve) modifier on your regular expression. |
| 287 | In Perl 5.20, the use of C<$`>, C<$&> and C<$'> makes no speed difference. |
| 288 | |
| 289 | $1, $2 ... hold the Xth captured expr |
| 290 | $+ Last parenthesized pattern match |
| 291 | $^N Holds the most recently closed capture |
| 292 | $^R Holds the result of the last (?{...}) expr |
| 293 | @- Offsets of starts of groups. $-[0] holds start of whole match |
| 294 | @+ Offsets of ends of groups. $+[0] holds end of whole match |
| 295 | %+ Named capture groups |
| 296 | %- Named capture groups, as array refs |
| 297 | |
| 298 | Captured groups are numbered according to their I<opening> paren. |
| 299 | |
| 300 | =head2 FUNCTIONS |
| 301 | |
| 302 | lc Lowercase a string |
| 303 | lcfirst Lowercase first char of a string |
| 304 | uc Uppercase a string |
| 305 | ucfirst Titlecase first char of a string |
| 306 | fc Foldcase a string |
| 307 | |
| 308 | pos Return or set current match position |
| 309 | quotemeta Quote metacharacters |
| 310 | reset Reset ?pattern? status |
| 311 | study Analyze string for optimizing matching |
| 312 | |
| 313 | split Use a regex to split a string into parts |
| 314 | |
| 315 | The first five of these are like the escape sequences C<\L>, C<\l>, |
| 316 | C<\U>, C<\u>, and C<\F>. For Titlecase, see L</Titlecase>; For |
| 317 | Foldcase, see L</Foldcase>. |
| 318 | |
| 319 | =head2 TERMINOLOGY |
| 320 | |
| 321 | =head3 Titlecase |
| 322 | |
| 323 | Unicode concept which most often is equal to uppercase, but for |
| 324 | certain characters like the German "sharp s" there is a difference. |
| 325 | |
| 326 | =head3 Foldcase |
| 327 | |
| 328 | Unicode form that is useful when comparing strings regardless of case, |
| 329 | as certain characters have complex one-to-many case mappings. Primarily a |
| 330 | variant of lowercase. |
| 331 | |
| 332 | =head1 AUTHOR |
| 333 | |
| 334 | Iain Truskett. Updated by the Perl 5 Porters. |
| 335 | |
| 336 | This document may be distributed under the same terms as Perl itself. |
| 337 | |
| 338 | =head1 SEE ALSO |
| 339 | |
| 340 | =over 4 |
| 341 | |
| 342 | =item * |
| 343 | |
| 344 | L<perlretut> for a tutorial on regular expressions. |
| 345 | |
| 346 | =item * |
| 347 | |
| 348 | L<perlrequick> for a rapid tutorial. |
| 349 | |
| 350 | =item * |
| 351 | |
| 352 | L<perlre> for more details. |
| 353 | |
| 354 | =item * |
| 355 | |
| 356 | L<perlvar> for details on the variables. |
| 357 | |
| 358 | =item * |
| 359 | |
| 360 | L<perlop> for details on the operators. |
| 361 | |
| 362 | =item * |
| 363 | |
| 364 | L<perlfunc> for details on the functions. |
| 365 | |
| 366 | =item * |
| 367 | |
| 368 | L<perlfaq6> for FAQs on regular expressions. |
| 369 | |
| 370 | =item * |
| 371 | |
| 372 | L<perlrebackslash> for a reference on backslash sequences. |
| 373 | |
| 374 | =item * |
| 375 | |
| 376 | L<perlrecharclass> for a reference on character classes. |
| 377 | |
| 378 | =item * |
| 379 | |
| 380 | The L<re> module to alter behaviour and aid |
| 381 | debugging. |
| 382 | |
| 383 | =item * |
| 384 | |
| 385 | L<perldebug/"Debugging Regular Expressions"> |
| 386 | |
| 387 | =item * |
| 388 | |
| 389 | L<perluniintro>, L<perlunicode>, L<charnames> and L<perllocale> |
| 390 | for details on regexes and internationalisation. |
| 391 | |
| 392 | =item * |
| 393 | |
| 394 | I<Mastering Regular Expressions> by Jeffrey Friedl |
| 395 | (F<http://oreilly.com/catalog/9780596528126/>) for a thorough grounding and |
| 396 | reference on the topic. |
| 397 | |
| 398 | =back |
| 399 | |
| 400 | =head1 THANKS |
| 401 | |
| 402 | David P.C. Wollmann, |
| 403 | Richard Soderberg, |
| 404 | Sean M. Burke, |
| 405 | Tom Christiansen, |
| 406 | Jim Cromie, |
| 407 | and |
| 408 | Jeffrey Goff |
| 409 | for useful advice. |
| 410 | |
| 411 | =cut |