pod/perlreref.pod

   1 =head1 NAME
   2
   3 perlreref - Perl Regular Expressions Reference
   4
   5 =head1 DESCRIPTION
   6
   7 This is a quick reference to Perl's regular expressions.
   8 For full information see L<perlre> and L<perlop>, as well
   9 as the L</"SEE ALSO"> section in this document.
  10
  11 =head2 OPERATORS
  12
  13 C<=~> determines to which variable the regex is applied.
  14 In its absence, $_ is used.
  15
  16     $var =~ /foo/;
  17
  18 C<!~> determines to which variable the regex is applied,
  19 and negates the result of the match; it returns
  20 false if the match succeeds, and true if it fails.
  21
  22     $var !~ /foo/;
  23
  24 C<m/pattern/msixpogc> searches a string for a pattern match,
  25 applying the given options.
  26
  27     m  Multiline mode - ^ and $ match internal lines
  28     s  match as a Single line - . matches \n
  29     i  case-Insensitive
  30     x  eXtended legibility - free whitespace and comments
  31     p  Preserve a copy of the matched string -
  32        ${^PREMATCH}, ${^MATCH}, ${^POSTMATCH} will be defined.
  33     o  compile pattern Once
  34     g  Global - all occurrences
  35     c  don't reset pos on failed matches when using /g
  36
  37 If 'pattern' is an empty string, the last I<successfully> matched
  38 regex is used. Delimiters other than '/' may be used for both this
  39 operator and the following ones. The leading C<m> can be omitted
  40 if the delimiter is '/'.
  41
  42 C<qr/pattern/msixpo> lets you store a regex in a variable,
  43 or pass one around. Modifiers as for C<m//>, and are stored
  44 within the regex.
  45
  46 C<s/pattern/replacement/msixpogce> substitutes matches of
  47 'pattern' with 'replacement'. Modifiers as for C<m//>,
  48 with one addition:
  49
  50     e  Evaluate 'replacement' as an expression
  51
  52 'e' may be specified multiple times. 'replacement' is interpreted
  53 as a double quoted string unless a single-quote (C<'>) is the delimiter.
  54
  55 C<?pattern?> is like C<m/pattern/> but matches only once. No alternate
  56 delimiters can be used.  Must be reset with reset().
  57
  58 =head2 SYNTAX
  59
  60  \       Escapes the character immediately following it
  61  .       Matches any single character except a newline (unless /s is
  62            used)
  63  ^       Matches at the beginning of the string (or line, if /m is used)
  64  $       Matches at the end of the string (or line, if /m is used)
  65  *       Matches the preceding element 0 or more times
  66  +       Matches the preceding element 1 or more times
  67  ?       Matches the preceding element 0 or 1 times
  68  {...}   Specifies a range of occurrences for the element preceding it
  69  [...]   Matches any one of the characters contained within the brackets
  70  (...)   Groups subexpressions for capturing to $1, $2...
  71  (?:...) Groups subexpressions without capturing (cluster)
  72  |       Matches either the subexpression preceding or following it
  73  \1, \2, \3 ...           Matches the text from the Nth group
  74  \g1 or \g{1}, \g2 ...    Matches the text from the Nth group
  75  \g-1 or \g{-1}, \g-2 ... Matches the text from the Nth previous group
  76  \g{name}     Named backreference
  77  \k<name>     Named backreference
  78  \k'name'     Named backreference
  79  (?P=name)    Named backreference (python syntax)
  80
  81 =head2 ESCAPE SEQUENCES
  82
  83 These work as in normal strings.
  84
  85    \a       Alarm (beep)
  86    \e       Escape
  87    \f       Formfeed
  88    \n       Newline
  89    \r       Carriage return
  90    \t       Tab
  91    \037     Any octal ASCII value
  92    \x7f     Any hexadecimal ASCII value
  93    \x{263a} A wide hexadecimal value
  94    \cx      Control-x
  95    \N{name} A named character
  96    \N{U+263D} A Unicode character by hex ordinal
  97
  98    \l  Lowercase next character
  99    \u  Titlecase next character
 100    \L  Lowercase until \E
 101    \U  Uppercase until \E
 102    \Q  Disable pattern metacharacters until \E
 103    \E  End modification
 104
 105 For Titlecase, see L</Titlecase>.
 106
 107 This one works differently from normal strings:
 108
 109    \b  An assertion, not backspace, except in a character class
 110
 111 =head2 CHARACTER CLASSES
 112
 113    [amy]    Match 'a', 'm' or 'y'
 114    [f-j]    Dash specifies "range"
 115    [f-j-]   Dash escaped or at start or end means 'dash'
 116    [^f-j]   Caret indicates "match any character _except_ these"
 117
 118 The following sequences (except C<\N>) work within or without a character class.
 119 The first six are locale aware, all are Unicode aware. See L<perllocale>
 120 and L<perlunicode> for details.
 121
 122    \d      A digit
 123    \D      A nondigit
 124    \w      A word character
 125    \W      A non-word character
 126    \s      A whitespace character
 127    \S      A non-whitespace character
 128    \h      An horizontal whitespace
 129    \H      A non horizontal whitespace
 130    \N      A non newline (when not followed by '{NAME}'; experimental;
 131            not valid in a character class; equivalent to [^\n]; it's
 132            like '.' without /s modifier)
 133    \v      A vertical whitespace
 134    \V      A non vertical whitespace
 135    \R      A generic newline           (?>\v|\x0D\x0A)
 136
 137    \C      Match a byte (with Unicode, '.' matches a character)
 138    \pP     Match P-named (Unicode) property
 139    \p{...} Match Unicode property with name longer than 1 character
 140    \PP     Match non-P
 141    \P{...} Match lack of Unicode property with name longer than 1 char
 142    \X      Match Unicode extended grapheme cluster
 143
 144 POSIX character classes and their Unicode and Perl equivalents:
 145
 146            ASCII-         Full-
 147            range          range   backslash
 148  POSIX    \p{...}         \p{}    sequence       Description
 149  -----------------------------------------------------------------------
 150  alnum   PosixAlnum       Alnum               Alpha plus Digit
 151  alpha   PosixAlpha       Alpha               Alphabetic characters
 152  ascii   ASCII                                Any ASCII character
 153  blank   PosixBlank       Blank     \h        Horizontal whitespace;
 154                                                 full-range also written
 155                                                 as \p{HorizSpace} (GNU
 156                                                 extension)
 157  cntrl   PosixCntrl       Cntrl               Control characters
 158  digit   PosixDigit       Digit     \d        Decimal digits
 159  graph   PosixGraph       Graph               Alnum plus Punct
 160  lower   PosixLower       Lower               Lowercase characters
 161  print   PosixPrint       Print               Graph plus Print, but not
 162                                                 any Cntrls
 163  punct   PosixPunct       Punct               These aren't precisely
 164                                                 equivalent.  See NOTE,
 165                                                 below.
 166  space   PosixSpace       Space     [\s\cK]   Whitespace
 167          PerlSpace        SpacePerl \s        Perl's whitespace
 168                                                 definition
 169  upper   PosixUpper       Upper               Uppercase characters
 170  word    PerlWord         Word      \w        Alnum plus '_' (Perl
 171                                                 extension)
 172  xdigit  ASCII_Hex_Digit  XDigit              Hexadecimal digit,
 173                                                 ASCII-range is
 174                                                 [0-9A-Fa-f]
 175
 176 NOTE on C<[[:punct:]]>, C<\p{PosixPunct}> and C<\p{Punct}>:
 177 In the ASCII range, C<[[:punct:]]> and C<\p{PosixPunct}> match
 178 C<[-!"#$%&'()*+,./:;<=E<gt>?@[\\\]^_`{|}~]> (although if a locale is in
 179 effect, it could alter the behavior of C<[[:punct:]]>); and C<\p{Punct}>
 180 matches C<[-!"#%&'()*,./:;?@[\\\]_{}]>.  When matching a UTF-8 string,
 181 C<[[:punct:]]> matches what it does in the ASCII range, plus what
 182 C<\p{Punct}> matches.  C<\p{Punct}> matches, anything that isn't a
 183 control, an alphanumeric, a space, nor a symbol.
 184
 185 Within a character class:
 186
 187     POSIX      traditional   Unicode
 188   [:digit:]       \d        \p{Digit}
 189   [:^digit:]      \D        \P{Digit}
 190
 191 =head2 ANCHORS
 192
 193 All are zero-width assertions.
 194
 195    ^  Match string start (or line, if /m is used)
 196    $  Match string end (or line, if /m is used) or before newline
 197    \b Match word boundary (between \w and \W)
 198    \B Match except at word boundary (between \w and \w or \W and \W)
 199    \A Match string start (regardless of /m)
 200    \Z Match string end (before optional newline)
 201    \z Match absolute string end
 202    \G Match where previous m//g left off
 203    \K Keep the stuff left of the \K, don't include it in $&
 204
 205 =head2 QUANTIFIERS
 206
 207 Quantifiers are greedy by default and match the B<longest> leftmost.
 208
 209    Maximal Minimal Possessive Allowed range
 210    ------- ------- ---------- -------------
 211    {n,m}   {n,m}?  {n,m}+     Must occur at least n times
 212                               but no more than m times
 213    {n,}    {n,}?   {n,}+      Must occur at least n times
 214    {n}     {n}?    {n}+       Must occur exactly n times
 215    *       *?      *+         0 or more times (same as {0,})
 216    +       +?      ++         1 or more times (same as {1,})
 217    ?       ??      ?+         0 or 1 time (same as {0,1})
 218
 219 The possessive forms (new in Perl 5.10) prevent backtracking: what gets
 220 matched by a pattern with a possessive quantifier will not be backtracked
 221 into, even if that causes the whole match to fail.
 222
 223 There is no quantifier C<{,n}>. That's interpreted as a literal string.
 224
 225 =head2 EXTENDED CONSTRUCTS
 226
 227    (?#text)          A comment
 228    (?:...)           Groups subexpressions without capturing (cluster)
 229    (?pimsx-imsx:...) Enable/disable option (as per m// modifiers)
 230    (?=...)           Zero-width positive lookahead assertion
 231    (?!...)           Zero-width negative lookahead assertion
 232    (?<=...)          Zero-width positive lookbehind assertion
 233    (?<!...)          Zero-width negative lookbehind assertion
 234    (?>...)           Grab what we can, prohibit backtracking
 235    (?|...)           Branch reset
 236    (?<name>...)      Named capture
 237    (?'name'...)      Named capture
 238    (?P<name>...)     Named capture (python syntax)
 239    (?{ code })       Embedded code, return value becomes $^R
 240    (??{ code })      Dynamic regex, return value used as regex
 241    (?N)              Recurse into subpattern number N
 242    (?-N), (?+N)      Recurse into Nth previous/next subpattern
 243    (?R), (?0)        Recurse at the beginning of the whole pattern
 244    (?&name)          Recurse into a named subpattern
 245    (?P>name)         Recurse into a named subpattern (python syntax)
 246    (?(cond)yes|no)
 247    (?(cond)yes)      Conditional expression, where "cond" can be:
 248                      (N)       subpattern N has matched something
 249                      (<name>)  named subpattern has matched something
 250                      ('name')  named subpattern has matched something
 251                      (?{code}) code condition
 252                      (R)       true if recursing
 253                      (RN)      true if recursing into Nth subpattern
 254                      (R&name)  true if recursing into named subpattern
 255                      (DEFINE)  always false, no no-pattern allowed
 256
 257 =head2 VARIABLES
 258
 259    $_    Default variable for operators to use
 260
 261    $`    Everything prior to matched string
 262    $&    Entire matched string
 263    $'    Everything after to matched string
 264
 265    ${^PREMATCH}   Everything prior to matched string
 266    ${^MATCH}      Entire matched string
 267    ${^POSTMATCH}  Everything after to matched string
 268
 269 The use of C<$`>, C<$&> or C<$'> will slow down B<all> regex use
 270 within your program. Consult L<perlvar> for C<@->
 271 to see equivalent expressions that won't cause slow down.
 272 See also L<Devel::SawAmpersand>. Starting with Perl 5.10, you
 273 can also use the equivalent variables C<${^PREMATCH}>, C<${^MATCH}>
 274 and C<${^POSTMATCH}>, but for them to be defined, you have to
 275 specify the C</p> (preserve) modifier on your regular expression.
 276
 277    $1, $2 ...  hold the Xth captured expr
 278    $+    Last parenthesized pattern match
 279    $^N   Holds the most recently closed capture
 280    $^R   Holds the result of the last (?{...}) expr
 281    @-    Offsets of starts of groups. $-[0] holds start of whole match
 282    @+    Offsets of ends of groups. $+[0] holds end of whole match
 283    %+    Named capture buffers
 284    %-    Named capture buffers, as array refs
 285
 286 Captured groups are numbered according to their I<opening> paren.
 287
 288 =head2 FUNCTIONS
 289
 290    lc          Lowercase a string
 291    lcfirst     Lowercase first char of a string
 292    uc          Uppercase a string
 293    ucfirst     Titlecase first char of a string
 294
 295    pos         Return or set current match position
 296    quotemeta   Quote metacharacters
 297    reset       Reset ?pattern? status
 298    study       Analyze string for optimizing matching
 299
 300    split       Use a regex to split a string into parts
 301
 302 The first four of these are like the escape sequences C<\L>, C<\l>,
 303 C<\U>, and C<\u>.  For Titlecase, see L</Titlecase>.
 304
 305 =head2 TERMINOLOGY
 306
 307 =head3 Titlecase
 308
 309 Unicode concept which most often is equal to uppercase, but for
 310 certain characters like the German "sharp s" there is a difference.
 311
 312 =head1 AUTHOR
 313
 314 Iain Truskett. Updated by the Perl 5 Porters.
 315
 316 This document may be distributed under the same terms as Perl itself.
 317
 318 =head1 SEE ALSO
 319
 320 =over 4
 321
 322 =item *
 323
 324 L<perlretut> for a tutorial on regular expressions.
 325
 326 =item *
 327
 328 L<perlrequick> for a rapid tutorial.
 329
 330 =item *
 331
 332 L<perlre> for more details.
 333
 334 =item *
 335
 336 L<perlvar> for details on the variables.
 337
 338 =item *
 339
 340 L<perlop> for details on the operators.
 341
 342 =item *
 343
 344 L<perlfunc> for details on the functions.
 345
 346 =item *
 347
 348 L<perlfaq6> for FAQs on regular expressions.
 349
 350 =item *
 351
 352 L<perlrebackslash> for a reference on backslash sequences.
 353
 354 =item *
 355
 356 L<perlrecharclass> for a reference on character classes.
 357
 358 =item *
 359
 360 The L<re> module to alter behaviour and aid
 361 debugging.
 362
 363 =item *
 364
 365 L<perldebug/"Debugging regular expressions">
 366
 367 =item *
 368
 369 L<perluniintro>, L<perlunicode>, L<charnames> and L<perllocale>
 370 for details on regexes and internationalisation.
 371
 372 =item *
 373
 374 I<Mastering Regular Expressions> by Jeffrey Friedl
 375 (F<http://oreilly.com/catalog/9780596528126/>) for a thorough grounding and
 376 reference on the topic.
 377
 378 =back
 379
 380 =head1 THANKS
 381
 382 David P.C. Wollmann,
 383 Richard Soderberg,
 384 Sean M. Burke,
 385 Tom Christiansen,
 386 Jim Cromie,
 387 and
 388 Jeffrey Goff
 389 for useful advice.
 390
 391 =cut