pod/perlreref.pod

   1 =head1 NAME
   2
   3 perlreref - Perl Regular Expressions Reference
   4
   5 =head1 DESCRIPTION
   6
   7 This is a quick reference to Perl's regular expressions.
   8 For full information see L<perlre> and L<perlop>, as well
   9 as the L</"SEE ALSO"> section in this document.
  10
  11 =head2 OPERATORS
  12
  13 C<=~> determines to which variable the regex is applied.
  14 In its absence, $_ is used.
  15
  16     $var =~ /foo/;
  17
  18 C<!~> determines to which variable the regex is applied,
  19 and negates the result of the match; it returns
  20 false if the match succeeds, and true if it fails.
  21
  22     $var !~ /foo/;
  23
  24 C<m/pattern/msixpogc> searches a string for a pattern match,
  25 applying the given options.
  26
  27     m  Multiline mode - ^ and $ match internal lines
  28     s  match as a Single line - . matches \n
  29     i  case-Insensitive
  30     x  eXtended legibility - free whitespace and comments
  31     p  Preserve a copy of the matched string -
  32        ${^PREMATCH}, ${^MATCH}, ${^POSTMATCH} will be defined.
  33     o  compile pattern Once
  34     g  Global - all occurrences
  35     c  don't reset pos on failed matches when using /g
  36
  37 If 'pattern' is an empty string, the last I<successfully> matched
  38 regex is used. Delimiters other than '/' may be used for both this
  39 operator and the following ones. The leading C<m> can be omitted
  40 if the delimiter is '/'.
  41
  42 C<qr/pattern/msixpo> lets you store a regex in a variable,
  43 or pass one around. Modifiers as for C<m//>, and are stored
  44 within the regex.
  45
  46 C<s/pattern/replacement/msixpogce> substitutes matches of
  47 'pattern' with 'replacement'. Modifiers as for C<m//>,
  48 with two additions:
  49
  50     e  Evaluate 'replacement' as an expression
  51     r  Return substitution and leave the original string untouched.
  52
  53 'e' may be specified multiple times. 'replacement' is interpreted
  54 as a double quoted string unless a single-quote (C<'>) is the delimiter.
  55
  56 C<?pattern?> is like C<m/pattern/> but matches only once. No alternate
  57 delimiters can be used.  Must be reset with reset().
  58
  59 =head2 SYNTAX
  60
  61  \       Escapes the character immediately following it
  62  .       Matches any single character except a newline (unless /s is
  63            used)
  64  ^       Matches at the beginning of the string (or line, if /m is used)
  65  $       Matches at the end of the string (or line, if /m is used)
  66  *       Matches the preceding element 0 or more times
  67  +       Matches the preceding element 1 or more times
  68  ?       Matches the preceding element 0 or 1 times
  69  {...}   Specifies a range of occurrences for the element preceding it
  70  [...]   Matches any one of the characters contained within the brackets
  71  (...)   Groups subexpressions for capturing to $1, $2...
  72  (?:...) Groups subexpressions without capturing (cluster)
  73  |       Matches either the subexpression preceding or following it
  74  \g1 or \g{1}, \g2 ...    Matches the text from the Nth group
  75  \1, \2, \3 ...           Matches the text from the Nth group
  76  \g-1 or \g{-1}, \g-2 ... Matches the text from the Nth previous group
  77  \g{name}     Named backreference
  78  \k<name>     Named backreference
  79  \k'name'     Named backreference
  80  (?P=name)    Named backreference (python syntax)
  81
  82 =head2 ESCAPE SEQUENCES
  83
  84 These work as in normal strings.
  85
  86    \a       Alarm (beep)
  87    \e       Escape
  88    \f       Formfeed
  89    \n       Newline
  90    \r       Carriage return
  91    \t       Tab
  92    \037     Char whose ordinal is the 3 octal digits, max \777
  93    \o{2307} Char whose ordinal is the octal number, unrestricted
  94    \x7f     Char whose ordinal is the 2 hex digits, max \xFF
  95    \x{263a} Char whose ordinal is the hex number, unrestricted
  96    \cx      Control-x
  97    \N{name} A named Unicode character or character sequence
  98    \N{U+263D} A Unicode character by hex ordinal
  99
 100    \l  Lowercase next character
 101    \u  Titlecase next character
 102    \L  Lowercase until \E
 103    \U  Uppercase until \E
 104    \Q  Disable pattern metacharacters until \E
 105    \E  End modification
 106
 107 For Titlecase, see L</Titlecase>.
 108
 109 This one works differently from normal strings:
 110
 111    \b  An assertion, not backspace, except in a character class
 112
 113 =head2 CHARACTER CLASSES
 114
 115    [amy]    Match 'a', 'm' or 'y'
 116    [f-j]    Dash specifies "range"
 117    [f-j-]   Dash escaped or at start or end means 'dash'
 118    [^f-j]   Caret indicates "match any character _except_ these"
 119
 120 The following sequences (except C<\N>) work within or without a character class.
 121 The first six are locale aware, all are Unicode aware. See L<perllocale>
 122 and L<perlunicode> for details.
 123
 124    \d      A digit
 125    \D      A nondigit
 126    \w      A word character
 127    \W      A non-word character
 128    \s      A whitespace character
 129    \S      A non-whitespace character
 130    \h      An horizontal whitespace
 131    \H      A non horizontal whitespace
 132    \N      A non newline (when not followed by '{NAME}'; experimental;
 133            not valid in a character class; equivalent to [^\n]; it's
 134            like '.' without /s modifier)
 135    \v      A vertical whitespace
 136    \V      A non vertical whitespace
 137    \R      A generic newline           (?>\v|\x0D\x0A)
 138
 139    \C      Match a byte (with Unicode, '.' matches a character)
 140    \pP     Match P-named (Unicode) property
 141    \p{...} Match Unicode property with name longer than 1 character
 142    \PP     Match non-P
 143    \P{...} Match lack of Unicode property with name longer than 1 char
 144    \X      Match Unicode extended grapheme cluster
 145
 146 POSIX character classes and their Unicode and Perl equivalents:
 147
 148             ASCII-         Full-
 149    POSIX    range          range    backslash
 150  [[:...:]]  \p{...}        \p{...}   sequence    Description
 151
 152  -----------------------------------------------------------------------
 153  alnum   PosixAlnum       XPosixAlnum            Alpha plus Digit
 154  alpha   PosixAlpha       XPosixAlpha            Alphabetic characters
 155  ascii   ASCII                                   Any ASCII character
 156  blank   PosixBlank       XPosixBlank   \h       Horizontal whitespace;
 157                                                    full-range also
 158                                                    written as
 159                                                    \p{HorizSpace} (GNU
 160                                                    extension)
 161  cntrl   PosixCntrl       XPosixCntrl            Control characters
 162  digit   PosixDigit       XPosixDigit   \d       Decimal digits
 163  graph   PosixGraph       XPosixGraph            Alnum plus Punct
 164  lower   PosixLower       XPosixLower            Lowercase characters
 165  print   PosixPrint       XPosixPrint            Graph plus Print, but
 166                                                    not any Cntrls
 167  punct   PosixPunct       XPosixPunct            Punctuation and Symbols
 168                                                    in ASCII-range; just
 169                                                    punct outside it
 170  space   PosixSpace       XPosixSpace   [\s\cK]  Whitespace
 171          PerlSpace        XPerlSpace    \s       Perl's whitespace def'n
 172  upper   PosixUpper       XPosixUpper            Uppercase characters
 173  word    PerlWord         XPosixWord    \w       Alnum + Unicode marks +
 174                                                    connectors, like '_'
 175                                                    (Perl extension)
 176  xdigit  ASCII_Hex_Digit  XPosixDigit            Hexadecimal digit,
 177                                                     ASCII-range is
 178                                                     [0-9A-Fa-f]
 179
 180 Also, various synonyms like C<\p{Alpha}> for C<\p{XPosixAlpha}>; all listed
 181 in L<perluniprops/Properties accessible through \p{} and \P{}>
 182
 183 Within a character class:
 184
 185     POSIX      traditional   Unicode
 186   [:digit:]       \d        \p{Digit}
 187   [:^digit:]      \D        \P{Digit}
 188
 189 =head2 ANCHORS
 190
 191 All are zero-width assertions.
 192
 193    ^  Match string start (or line, if /m is used)
 194    $  Match string end (or line, if /m is used) or before newline
 195    \b Match word boundary (between \w and \W)
 196    \B Match except at word boundary (between \w and \w or \W and \W)
 197    \A Match string start (regardless of /m)
 198    \Z Match string end (before optional newline)
 199    \z Match absolute string end
 200    \G Match where previous m//g left off
 201    \K Keep the stuff left of the \K, don't include it in $&
 202
 203 =head2 QUANTIFIERS
 204
 205 Quantifiers are greedy by default and match the B<longest> leftmost.
 206
 207    Maximal Minimal Possessive Allowed range
 208    ------- ------- ---------- -------------
 209    {n,m}   {n,m}?  {n,m}+     Must occur at least n times
 210                               but no more than m times
 211    {n,}    {n,}?   {n,}+      Must occur at least n times
 212    {n}     {n}?    {n}+       Must occur exactly n times
 213    *       *?      *+         0 or more times (same as {0,})
 214    +       +?      ++         1 or more times (same as {1,})
 215    ?       ??      ?+         0 or 1 time (same as {0,1})
 216
 217 The possessive forms (new in Perl 5.10) prevent backtracking: what gets
 218 matched by a pattern with a possessive quantifier will not be backtracked
 219 into, even if that causes the whole match to fail.
 220
 221 There is no quantifier C<{,n}>. That's interpreted as a literal string.
 222
 223 =head2 EXTENDED CONSTRUCTS
 224
 225    (?#text)          A comment
 226    (?:...)           Groups subexpressions without capturing (cluster)
 227    (?pimsx-imsx:...) Enable/disable option (as per m// modifiers)
 228    (?=...)           Zero-width positive lookahead assertion
 229    (?!...)           Zero-width negative lookahead assertion
 230    (?<=...)          Zero-width positive lookbehind assertion
 231    (?<!...)          Zero-width negative lookbehind assertion
 232    (?>...)           Grab what we can, prohibit backtracking
 233    (?|...)           Branch reset
 234    (?<name>...)      Named capture
 235    (?'name'...)      Named capture
 236    (?P<name>...)     Named capture (python syntax)
 237    (?{ code })       Embedded code, return value becomes $^R
 238    (??{ code })      Dynamic regex, return value used as regex
 239    (?N)              Recurse into subpattern number N
 240    (?-N), (?+N)      Recurse into Nth previous/next subpattern
 241    (?R), (?0)        Recurse at the beginning of the whole pattern
 242    (?&name)          Recurse into a named subpattern
 243    (?P>name)         Recurse into a named subpattern (python syntax)
 244    (?(cond)yes|no)
 245    (?(cond)yes)      Conditional expression, where "cond" can be:
 246                      (N)       subpattern N has matched something
 247                      (<name>)  named subpattern has matched something
 248                      ('name')  named subpattern has matched something
 249                      (?{code}) code condition
 250                      (R)       true if recursing
 251                      (RN)      true if recursing into Nth subpattern
 252                      (R&name)  true if recursing into named subpattern
 253                      (DEFINE)  always false, no no-pattern allowed
 254
 255 =head2 VARIABLES
 256
 257    $_    Default variable for operators to use
 258
 259    $`    Everything prior to matched string
 260    $&    Entire matched string
 261    $'    Everything after to matched string
 262
 263    ${^PREMATCH}   Everything prior to matched string
 264    ${^MATCH}      Entire matched string
 265    ${^POSTMATCH}  Everything after to matched string
 266
 267 The use of C<$`>, C<$&> or C<$'> will slow down B<all> regex use
 268 within your program. Consult L<perlvar> for C<@->
 269 to see equivalent expressions that won't cause slow down.
 270 See also L<Devel::SawAmpersand>. Starting with Perl 5.10, you
 271 can also use the equivalent variables C<${^PREMATCH}>, C<${^MATCH}>
 272 and C<${^POSTMATCH}>, but for them to be defined, you have to
 273 specify the C</p> (preserve) modifier on your regular expression.
 274
 275    $1, $2 ...  hold the Xth captured expr
 276    $+    Last parenthesized pattern match
 277    $^N   Holds the most recently closed capture
 278    $^R   Holds the result of the last (?{...}) expr
 279    @-    Offsets of starts of groups. $-[0] holds start of whole match
 280    @+    Offsets of ends of groups. $+[0] holds end of whole match
 281    %+    Named capture groups
 282    %-    Named capture groups, as array refs
 283
 284 Captured groups are numbered according to their I<opening> paren.
 285
 286 =head2 FUNCTIONS
 287
 288    lc          Lowercase a string
 289    lcfirst     Lowercase first char of a string
 290    uc          Uppercase a string
 291    ucfirst     Titlecase first char of a string
 292
 293    pos         Return or set current match position
 294    quotemeta   Quote metacharacters
 295    reset       Reset ?pattern? status
 296    study       Analyze string for optimizing matching
 297
 298    split       Use a regex to split a string into parts
 299
 300 The first four of these are like the escape sequences C<\L>, C<\l>,
 301 C<\U>, and C<\u>.  For Titlecase, see L</Titlecase>.
 302
 303 =head2 TERMINOLOGY
 304
 305 =head3 Titlecase
 306
 307 Unicode concept which most often is equal to uppercase, but for
 308 certain characters like the German "sharp s" there is a difference.
 309
 310 =head1 AUTHOR
 311
 312 Iain Truskett. Updated by the Perl 5 Porters.
 313
 314 This document may be distributed under the same terms as Perl itself.
 315
 316 =head1 SEE ALSO
 317
 318 =over 4
 319
 320 =item *
 321
 322 L<perlretut> for a tutorial on regular expressions.
 323
 324 =item *
 325
 326 L<perlrequick> for a rapid tutorial.
 327
 328 =item *
 329
 330 L<perlre> for more details.
 331
 332 =item *
 333
 334 L<perlvar> for details on the variables.
 335
 336 =item *
 337
 338 L<perlop> for details on the operators.
 339
 340 =item *
 341
 342 L<perlfunc> for details on the functions.
 343
 344 =item *
 345
 346 L<perlfaq6> for FAQs on regular expressions.
 347
 348 =item *
 349
 350 L<perlrebackslash> for a reference on backslash sequences.
 351
 352 =item *
 353
 354 L<perlrecharclass> for a reference on character classes.
 355
 356 =item *
 357
 358 The L<re> module to alter behaviour and aid
 359 debugging.
 360
 361 =item *
 362
 363 L<perldebug/"Debugging Regular Expressions">
 364
 365 =item *
 366
 367 L<perluniintro>, L<perlunicode>, L<charnames> and L<perllocale>
 368 for details on regexes and internationalisation.
 369
 370 =item *
 371
 372 I<Mastering Regular Expressions> by Jeffrey Friedl
 373 (F<http://oreilly.com/catalog/9780596528126/>) for a thorough grounding and
 374 reference on the topic.
 375
 376 =back
 377
 378 =head1 THANKS
 379
 380 David P.C. Wollmann,
 381 Richard Soderberg,
 382 Sean M. Burke,
 383 Tom Christiansen,
 384 Jim Cromie,
 385 and
 386 Jeffrey Goff
 387 for useful advice.
 388
 389 =cut