pod/perlreref.pod

   1 =head1 NAME
   2
   3 perlreref - Perl Regular Expressions Reference
   4
   5 =head1 DESCRIPTION
   6
   7 This is a quick reference to Perl's regular expressions.
   8 For full information see L<perlre> and L<perlop>, as well
   9 as the L</"SEE ALSO"> section in this document.
  10
  11 =head2 OPERATORS
  12
  13 C<=~> determines to which variable the regex is applied.
  14 In its absence, $_ is used.
  15
  16     $var =~ /foo/;
  17
  18 C<!~> determines to which variable the regex is applied,
  19 and negates the result of the match; it returns
  20 false if the match succeeds, and true if it fails.
  21
  22     $var !~ /foo/;
  23
  24 C<m/pattern/msixpogc> searches a string for a pattern match,
  25 applying the given options.
  26
  27     m  Multiline mode - ^ and $ match internal lines
  28     s  match as a Single line - . matches \n
  29     i  case-Insensitive
  30     x  eXtended legibility - free whitespace and comments
  31     p  Preserve a copy of the matched string -
  32        ${^PREMATCH}, ${^MATCH}, ${^POSTMATCH} will be defined.
  33     o  compile pattern Once
  34     g  Global - all occurrences
  35     c  don't reset pos on failed matches when using /g
  36
  37 If 'pattern' is an empty string, the last I<successfully> matched
  38 regex is used. Delimiters other than '/' may be used for both this
  39 operator and the following ones. The leading C<m> can be omitted
  40 if the delimiter is '/'.
  41
  42 C<qr/pattern/msixpo> lets you store a regex in a variable,
  43 or pass one around. Modifiers as for C<m//>, and are stored
  44 within the regex.
  45
  46 C<s/pattern/replacement/msixpogce> substitutes matches of
  47 'pattern' with 'replacement'. Modifiers as for C<m//>,
  48 with two additions:
  49
  50     e  Evaluate 'replacement' as an expression
  51     r  Return substitution and leave the original string untouched.
  52
  53 'e' may be specified multiple times. 'replacement' is interpreted
  54 as a double quoted string unless a single-quote (C<'>) is the delimiter.
  55
  56 C<?pattern?> is like C<m/pattern/> but matches only once. No alternate
  57 delimiters can be used.  Must be reset with reset().
  58
  59 =head2 SYNTAX
  60
  61  \       Escapes the character immediately following it
  62  .       Matches any single character except a newline (unless /s is
  63            used)
  64  ^       Matches at the beginning of the string (or line, if /m is used)
  65  $       Matches at the end of the string (or line, if /m is used)
  66  *       Matches the preceding element 0 or more times
  67  +       Matches the preceding element 1 or more times
  68  ?       Matches the preceding element 0 or 1 times
  69  {...}   Specifies a range of occurrences for the element preceding it
  70  [...]   Matches any one of the characters contained within the brackets
  71  (...)   Groups subexpressions for capturing to $1, $2...
  72  (?:...) Groups subexpressions without capturing (cluster)
  73  |       Matches either the subexpression preceding or following it
  74  \g1 or \g{1}, \g2 ...    Matches the text from the Nth group
  75  \1, \2, \3 ...           Matches the text from the Nth group
  76  \g-1 or \g{-1}, \g-2 ... Matches the text from the Nth previous group
  77  \g{name}     Named backreference
  78  \k<name>     Named backreference
  79  \k'name'     Named backreference
  80  (?P=name)    Named backreference (python syntax)
  81
  82 =head2 ESCAPE SEQUENCES
  83
  84 These work as in normal strings.
  85
  86    \a       Alarm (beep)
  87    \e       Escape
  88    \f       Formfeed
  89    \n       Newline
  90    \r       Carriage return
  91    \t       Tab
  92    \037     Char whose ordinal is the 3 octal digits, max \777
  93    \o{2307} Char whose ordinal is the octal number, unrestricted
  94    \x7f     Char whose ordinal is the 2 hex digits, max \xFF
  95    \x{263a} Char whose ordinal is the hex number, unrestricted
  96    \cx      Control-x
  97    \N{name} A named Unicode character
  98    \N{U+263D} A Unicode character by hex ordinal
  99
 100    \l  Lowercase next character
 101    \u  Titlecase next character
 102    \L  Lowercase until \E
 103    \U  Uppercase until \E
 104    \Q  Disable pattern metacharacters until \E
 105    \E  End modification
 106
 107 For Titlecase, see L</Titlecase>.
 108
 109 This one works differently from normal strings:
 110
 111    \b  An assertion, not backspace, except in a character class
 112
 113 =head2 CHARACTER CLASSES
 114
 115    [amy]    Match 'a', 'm' or 'y'
 116    [f-j]    Dash specifies "range"
 117    [f-j-]   Dash escaped or at start or end means 'dash'
 118    [^f-j]   Caret indicates "match any character _except_ these"
 119
 120 The following sequences (except C<\N>) work within or without a character class.
 121 The first six are locale aware, all are Unicode aware. See L<perllocale>
 122 and L<perlunicode> for details.
 123
 124    \d      A digit
 125    \D      A nondigit
 126    \w      A word character
 127    \W      A non-word character
 128    \s      A whitespace character
 129    \S      A non-whitespace character
 130    \h      An horizontal whitespace
 131    \H      A non horizontal whitespace
 132    \N      A non newline (when not followed by '{NAME}'; experimental;
 133            not valid in a character class; equivalent to [^\n]; it's
 134            like '.' without /s modifier)
 135    \v      A vertical whitespace
 136    \V      A non vertical whitespace
 137    \R      A generic newline           (?>\v|\x0D\x0A)
 138
 139    \C      Match a byte (with Unicode, '.' matches a character)
 140    \pP     Match P-named (Unicode) property
 141    \p{...} Match Unicode property with name longer than 1 character
 142    \PP     Match non-P
 143    \P{...} Match lack of Unicode property with name longer than 1 char
 144    \X      Match Unicode extended grapheme cluster
 145
 146 POSIX character classes and their Unicode and Perl equivalents:
 147
 148            ASCII-         Full-
 149            range          range   backslash
 150  POSIX    \p{...}         \p{}    sequence       Description
 151  -----------------------------------------------------------------------
 152  alnum   PosixAlnum       Alnum               Alpha plus Digit
 153  alpha   PosixAlpha       Alpha               Alphabetic characters
 154  ascii   ASCII                                Any ASCII character
 155  blank   PosixBlank       Blank     \h        Horizontal whitespace;
 156                                                 full-range also written
 157                                                 as \p{HorizSpace} (GNU
 158                                                 extension)
 159  cntrl   PosixCntrl       Cntrl               Control characters
 160  digit   PosixDigit       Digit     \d        Decimal digits
 161  graph   PosixGraph       Graph               Alnum plus Punct
 162  lower   PosixLower       Lower               Lowercase characters
 163  print   PosixPrint       Print               Graph plus Print, but not
 164                                                 any Cntrls
 165  punct   PosixPunct       Punct               These aren't precisely
 166                                                 equivalent.  See NOTE,
 167                                                 below.
 168  space   PosixSpace       Space     [\s\cK]   Whitespace
 169          PerlSpace        SpacePerl \s        Perl's whitespace
 170                                                 definition
 171  upper   PosixUpper       Upper               Uppercase characters
 172  word    PerlWord         Word      \w        Alnum plus '_' (Perl
 173                                                 extension)
 174  xdigit  ASCII_Hex_Digit  XDigit              Hexadecimal digit,
 175                                                 ASCII-range is
 176                                                 [0-9A-Fa-f]
 177
 178 NOTE on C<[[:punct:]]>, C<\p{PosixPunct}> and C<\p{Punct}>:
 179 In the ASCII range, C<[[:punct:]]> and C<\p{PosixPunct}> match
 180 C<[-!"#$%&'()*+,./:;<=E<gt>?@[\\\]^_`{|}~]> (although if a locale is in
 181 effect, it could alter the behavior of C<[[:punct:]]>); and C<\p{Punct}>
 182 matches C<[-!"#%&'()*,./:;?@[\\\]_{}]>.  When matching a UTF-8 string,
 183 C<[[:punct:]]> matches what it does in the ASCII range, plus what
 184 C<\p{Punct}> matches.  C<\p{Punct}> matches, anything that isn't a
 185 control, an alphanumeric, a space, nor a symbol.
 186
 187 Within a character class:
 188
 189     POSIX      traditional   Unicode
 190   [:digit:]       \d        \p{Digit}
 191   [:^digit:]      \D        \P{Digit}
 192
 193 =head2 ANCHORS
 194
 195 All are zero-width assertions.
 196
 197    ^  Match string start (or line, if /m is used)
 198    $  Match string end (or line, if /m is used) or before newline
 199    \b Match word boundary (between \w and \W)
 200    \B Match except at word boundary (between \w and \w or \W and \W)
 201    \A Match string start (regardless of /m)
 202    \Z Match string end (before optional newline)
 203    \z Match absolute string end
 204    \G Match where previous m//g left off
 205    \K Keep the stuff left of the \K, don't include it in $&
 206
 207 =head2 QUANTIFIERS
 208
 209 Quantifiers are greedy by default and match the B<longest> leftmost.
 210
 211    Maximal Minimal Possessive Allowed range
 212    ------- ------- ---------- -------------
 213    {n,m}   {n,m}?  {n,m}+     Must occur at least n times
 214                               but no more than m times
 215    {n,}    {n,}?   {n,}+      Must occur at least n times
 216    {n}     {n}?    {n}+       Must occur exactly n times
 217    *       *?      *+         0 or more times (same as {0,})
 218    +       +?      ++         1 or more times (same as {1,})
 219    ?       ??      ?+         0 or 1 time (same as {0,1})
 220
 221 The possessive forms (new in Perl 5.10) prevent backtracking: what gets
 222 matched by a pattern with a possessive quantifier will not be backtracked
 223 into, even if that causes the whole match to fail.
 224
 225 There is no quantifier C<{,n}>. That's interpreted as a literal string.
 226
 227 =head2 EXTENDED CONSTRUCTS
 228
 229    (?#text)          A comment
 230    (?:...)           Groups subexpressions without capturing (cluster)
 231    (?pimsx-imsx:...) Enable/disable option (as per m// modifiers)
 232    (?=...)           Zero-width positive lookahead assertion
 233    (?!...)           Zero-width negative lookahead assertion
 234    (?<=...)          Zero-width positive lookbehind assertion
 235    (?<!...)          Zero-width negative lookbehind assertion
 236    (?>...)           Grab what we can, prohibit backtracking
 237    (?|...)           Branch reset
 238    (?<name>...)      Named capture
 239    (?'name'...)      Named capture
 240    (?P<name>...)     Named capture (python syntax)
 241    (?{ code })       Embedded code, return value becomes $^R
 242    (??{ code })      Dynamic regex, return value used as regex
 243    (?N)              Recurse into subpattern number N
 244    (?-N), (?+N)      Recurse into Nth previous/next subpattern
 245    (?R), (?0)        Recurse at the beginning of the whole pattern
 246    (?&name)          Recurse into a named subpattern
 247    (?P>name)         Recurse into a named subpattern (python syntax)
 248    (?(cond)yes|no)
 249    (?(cond)yes)      Conditional expression, where "cond" can be:
 250                      (N)       subpattern N has matched something
 251                      (<name>)  named subpattern has matched something
 252                      ('name')  named subpattern has matched something
 253                      (?{code}) code condition
 254                      (R)       true if recursing
 255                      (RN)      true if recursing into Nth subpattern
 256                      (R&name)  true if recursing into named subpattern
 257                      (DEFINE)  always false, no no-pattern allowed
 258
 259 =head2 VARIABLES
 260
 261    $_    Default variable for operators to use
 262
 263    $`    Everything prior to matched string
 264    $&    Entire matched string
 265    $'    Everything after to matched string
 266
 267    ${^PREMATCH}   Everything prior to matched string
 268    ${^MATCH}      Entire matched string
 269    ${^POSTMATCH}  Everything after to matched string
 270
 271 The use of C<$`>, C<$&> or C<$'> will slow down B<all> regex use
 272 within your program. Consult L<perlvar> for C<@->
 273 to see equivalent expressions that won't cause slow down.
 274 See also L<Devel::SawAmpersand>. Starting with Perl 5.10, you
 275 can also use the equivalent variables C<${^PREMATCH}>, C<${^MATCH}>
 276 and C<${^POSTMATCH}>, but for them to be defined, you have to
 277 specify the C</p> (preserve) modifier on your regular expression.
 278
 279    $1, $2 ...  hold the Xth captured expr
 280    $+    Last parenthesized pattern match
 281    $^N   Holds the most recently closed capture
 282    $^R   Holds the result of the last (?{...}) expr
 283    @-    Offsets of starts of groups. $-[0] holds start of whole match
 284    @+    Offsets of ends of groups. $+[0] holds end of whole match
 285    %+    Named capture groups
 286    %-    Named capture groups, as array refs
 287
 288 Captured groups are numbered according to their I<opening> paren.
 289
 290 =head2 FUNCTIONS
 291
 292    lc          Lowercase a string
 293    lcfirst     Lowercase first char of a string
 294    uc          Uppercase a string
 295    ucfirst     Titlecase first char of a string
 296
 297    pos         Return or set current match position
 298    quotemeta   Quote metacharacters
 299    reset       Reset ?pattern? status
 300    study       Analyze string for optimizing matching
 301
 302    split       Use a regex to split a string into parts
 303
 304 The first four of these are like the escape sequences C<\L>, C<\l>,
 305 C<\U>, and C<\u>.  For Titlecase, see L</Titlecase>.
 306
 307 =head2 TERMINOLOGY
 308
 309 =head3 Titlecase
 310
 311 Unicode concept which most often is equal to uppercase, but for
 312 certain characters like the German "sharp s" there is a difference.
 313
 314 =head1 AUTHOR
 315
 316 Iain Truskett. Updated by the Perl 5 Porters.
 317
 318 This document may be distributed under the same terms as Perl itself.
 319
 320 =head1 SEE ALSO
 321
 322 =over 4
 323
 324 =item *
 325
 326 L<perlretut> for a tutorial on regular expressions.
 327
 328 =item *
 329
 330 L<perlrequick> for a rapid tutorial.
 331
 332 =item *
 333
 334 L<perlre> for more details.
 335
 336 =item *
 337
 338 L<perlvar> for details on the variables.
 339
 340 =item *
 341
 342 L<perlop> for details on the operators.
 343
 344 =item *
 345
 346 L<perlfunc> for details on the functions.
 347
 348 =item *
 349
 350 L<perlfaq6> for FAQs on regular expressions.
 351
 352 =item *
 353
 354 L<perlrebackslash> for a reference on backslash sequences.
 355
 356 =item *
 357
 358 L<perlrecharclass> for a reference on character classes.
 359
 360 =item *
 361
 362 The L<re> module to alter behaviour and aid
 363 debugging.
 364
 365 =item *
 366
 367 L<perldebug/"Debugging regular expressions">
 368
 369 =item *
 370
 371 L<perluniintro>, L<perlunicode>, L<charnames> and L<perllocale>
 372 for details on regexes and internationalisation.
 373
 374 =item *
 375
 376 I<Mastering Regular Expressions> by Jeffrey Friedl
 377 (F<http://oreilly.com/catalog/9780596528126/>) for a thorough grounding and
 378 reference on the topic.
 379
 380 =back
 381
 382 =head1 THANKS
 383
 384 David P.C. Wollmann,
 385 Richard Soderberg,
 386 Sean M. Burke,
 387 Tom Christiansen,
 388 Jim Cromie,
 389 and
 390 Jeffrey Goff
 391 for useful advice.
 392
 393 =cut