| 1 | =head1 NAME |
| 2 | |
| 3 | perlreguts - Description of the Perl regular expression engine. |
| 4 | |
| 5 | =head1 DESCRIPTION |
| 6 | |
| 7 | This document is an attempt to shine some light on the guts of the regex |
| 8 | engine and how it works. The regex engine represents a significant chunk |
| 9 | of the perl codebase, but is relatively poorly understood. This document |
| 10 | is a meagre attempt at addressing this situation. It is derived from the |
| 11 | author's experience, comments in the source code, other papers on the |
| 12 | regex engine, feedback on the perl5-porters mail list, and no doubt other |
| 13 | places as well. |
| 14 | |
| 15 | B<NOTICE!> It should be clearly understood that the behavior and |
| 16 | structures discussed in this represents the state of the engine as the |
| 17 | author understood it at the time of writing. It is B<NOT> an API |
| 18 | definition, it is purely an internals guide for those who want to hack |
| 19 | the regex engine, or understand how the regex engine works. Readers of |
| 20 | this document are expected to understand perl's regex syntax and its |
| 21 | usage in detail. If you want to learn about the basics of Perl's |
| 22 | regular expressions, see L<perlre>. And if you want to replace the |
| 23 | regex engine with your own, see L<perlreapi>. |
| 24 | |
| 25 | =head1 OVERVIEW |
| 26 | |
| 27 | =head2 A quick note on terms |
| 28 | |
| 29 | There is some debate as to whether to say "regexp" or "regex". In this |
| 30 | document we will use the term "regex" unless there is a special reason |
| 31 | not to, in which case we will explain why. |
| 32 | |
| 33 | When speaking about regexes we need to distinguish between their source |
| 34 | code form and their internal form. In this document we will use the term |
| 35 | "pattern" when we speak of their textual, source code form, and the term |
| 36 | "program" when we speak of their internal representation. These |
| 37 | correspond to the terms I<S-regex> and I<B-regex> that Mark Jason |
| 38 | Dominus employs in his paper on "Rx" ([1] in L</REFERENCES>). |
| 39 | |
| 40 | =head2 What is a regular expression engine? |
| 41 | |
| 42 | A regular expression engine is a program that takes a set of constraints |
| 43 | specified in a mini-language, and then applies those constraints to a |
| 44 | target string, and determines whether or not the string satisfies the |
| 45 | constraints. See L<perlre> for a full definition of the language. |
| 46 | |
| 47 | In less grandiose terms, the first part of the job is to turn a pattern into |
| 48 | something the computer can efficiently use to find the matching point in |
| 49 | the string, and the second part is performing the search itself. |
| 50 | |
| 51 | To do this we need to produce a program by parsing the text. We then |
| 52 | need to execute the program to find the point in the string that |
| 53 | matches. And we need to do the whole thing efficiently. |
| 54 | |
| 55 | =head2 Structure of a Regexp Program |
| 56 | |
| 57 | =head3 High Level |
| 58 | |
| 59 | Although it is a bit confusing and some people object to the terminology, it |
| 60 | is worth taking a look at a comment that has |
| 61 | been in F<regexp.h> for years: |
| 62 | |
| 63 | I<This is essentially a linear encoding of a nondeterministic |
| 64 | finite-state machine (aka syntax charts or "railroad normal form" in |
| 65 | parsing technology).> |
| 66 | |
| 67 | The term "railroad normal form" is a bit esoteric, with "syntax |
| 68 | diagram/charts", or "railroad diagram/charts" being more common terms. |
| 69 | Nevertheless it provides a useful mental image of a regex program: each |
| 70 | node can be thought of as a unit of track, with a single entry and in |
| 71 | most cases a single exit point (there are pieces of track that fork, but |
| 72 | statistically not many), and the whole forms a layout with a |
| 73 | single entry and single exit point. The matching process can be thought |
| 74 | of as a car that moves along the track, with the particular route through |
| 75 | the system being determined by the character read at each possible |
| 76 | connector point. A car can fall off the track at any point but it may |
| 77 | only proceed as long as it matches the track. |
| 78 | |
| 79 | Thus the pattern C</foo(?:\w+|\d+|\s+)bar/> can be thought of as the |
| 80 | following chart: |
| 81 | |
| 82 | [start] |
| 83 | | |
| 84 | <foo> |
| 85 | | |
| 86 | +-----+-----+ |
| 87 | | | | |
| 88 | <\w+> <\d+> <\s+> |
| 89 | | | | |
| 90 | +-----+-----+ |
| 91 | | |
| 92 | <bar> |
| 93 | | |
| 94 | [end] |
| 95 | |
| 96 | The truth of the matter is that perl's regular expressions these days are |
| 97 | much more complex than this kind of structure, but visualising it this way |
| 98 | can help when trying to get your bearings, and it matches the |
| 99 | current implementation pretty closely. |
| 100 | |
| 101 | To be more precise, we will say that a regex program is an encoding |
| 102 | of a graph. Each node in the graph corresponds to part of |
| 103 | the original regex pattern, such as a literal string or a branch, |
| 104 | and has a pointer to the nodes representing the next component |
| 105 | to be matched. Since "node" and "opcode" already have other meanings in the |
| 106 | perl source, we will call the nodes in a regex program "regops". |
| 107 | |
| 108 | The program is represented by an array of C<regnode> structures, one or |
| 109 | more of which represent a single regop of the program. Struct |
| 110 | C<regnode> is the smallest struct needed, and has a field structure which is |
| 111 | shared with all the other larger structures. |
| 112 | |
| 113 | The "next" pointers of all regops except C<BRANCH> implement concatenation; |
| 114 | a "next" pointer with a C<BRANCH> on both ends of it is connecting two |
| 115 | alternatives. [Here we have one of the subtle syntax dependencies: an |
| 116 | individual C<BRANCH> (as opposed to a collection of them) is never |
| 117 | concatenated with anything because of operator precedence.] |
| 118 | |
| 119 | The operand of some types of regop is a literal string; for others, |
| 120 | it is a regop leading into a sub-program. In particular, the operand |
| 121 | of a C<BRANCH> node is the first regop of the branch. |
| 122 | |
| 123 | B<NOTE>: As the railroad metaphor suggests, this is B<not> a tree |
| 124 | structure: the tail of the branch connects to the thing following the |
| 125 | set of C<BRANCH>es. It is a like a single line of railway track that |
| 126 | splits as it goes into a station or railway yard and rejoins as it comes |
| 127 | out the other side. |
| 128 | |
| 129 | =head3 Regops |
| 130 | |
| 131 | The base structure of a regop is defined in F<regexp.h> as follows: |
| 132 | |
| 133 | struct regnode { |
| 134 | U8 flags; /* Various purposes, sometimes overridden */ |
| 135 | U8 type; /* Opcode value as specified by regnodes.h */ |
| 136 | U16 next_off; /* Offset in size regnode */ |
| 137 | }; |
| 138 | |
| 139 | Other larger C<regnode>-like structures are defined in F<regcomp.h>. They |
| 140 | are almost like subclasses in that they have the same fields as |
| 141 | C<regnode>, with possibly additional fields following in |
| 142 | the structure, and in some cases the specific meaning (and name) |
| 143 | of some of base fields are overridden. The following is a more |
| 144 | complete description. |
| 145 | |
| 146 | =over 4 |
| 147 | |
| 148 | =item C<regnode_1> |
| 149 | |
| 150 | =item C<regnode_2> |
| 151 | |
| 152 | C<regnode_1> structures have the same header, followed by a single |
| 153 | four-byte argument; C<regnode_2> structures contain two two-byte |
| 154 | arguments instead: |
| 155 | |
| 156 | regnode_1 U32 arg1; |
| 157 | regnode_2 U16 arg1; U16 arg2; |
| 158 | |
| 159 | =item C<regnode_string> |
| 160 | |
| 161 | C<regnode_string> structures, used for literal strings, follow the header |
| 162 | with a one-byte length and then the string data. Strings are padded on |
| 163 | the end with zero bytes so that the total length of the node is a |
| 164 | multiple of four bytes: |
| 165 | |
| 166 | regnode_string char string[1]; |
| 167 | U8 str_len; /* overrides flags */ |
| 168 | |
| 169 | =item C<regnode_charclass> |
| 170 | |
| 171 | Bracketed character classes are represented by C<regnode_charclass> |
| 172 | structures, which have a four-byte argument and then a 32-byte (256-bit) |
| 173 | bitmap indicating which characters in the Latin1 range are included in |
| 174 | the class. |
| 175 | |
| 176 | regnode_charclass U32 arg1; |
| 177 | char bitmap[ANYOF_BITMAP_SIZE]; |
| 178 | |
| 179 | Various flags whose names begin with C<ANYOF_> are used for special |
| 180 | situations. Above Latin1 matches and things not known until run-time |
| 181 | are stored in L</Perl's pprivate structure>. |
| 182 | |
| 183 | =item C<regnode_charclass_posixl> |
| 184 | |
| 185 | There is also a larger form of a char class structure used to represent |
| 186 | POSIX char classes under C</l> matching, |
| 187 | called C<regnode_charclass_posixl> which has an |
| 188 | additional 32-bit bitmap indicating which POSIX char classes |
| 189 | have been included. |
| 190 | |
| 191 | regnode_charclass_posixl U32 arg1; |
| 192 | char bitmap[ANYOF_BITMAP_SIZE]; |
| 193 | U32 classflags; |
| 194 | |
| 195 | =back |
| 196 | |
| 197 | F<regnodes.h> defines an array called C<regarglen[]> which gives the size |
| 198 | of each opcode in units of C<size regnode> (4-byte). A macro is used |
| 199 | to calculate the size of an C<EXACT> node based on its C<str_len> field. |
| 200 | |
| 201 | The regops are defined in F<regnodes.h> which is generated from |
| 202 | F<regcomp.sym> by F<regcomp.pl>. Currently the maximum possible number |
| 203 | of distinct regops is restricted to 256, with about a quarter already |
| 204 | used. |
| 205 | |
| 206 | A set of macros makes accessing the fields |
| 207 | easier and more consistent. These include C<OP()>, which is used to determine |
| 208 | the type of a C<regnode>-like structure; C<NEXT_OFF()>, which is the offset to |
| 209 | the next node (more on this later); C<ARG()>, C<ARG1()>, C<ARG2()>, C<ARG_SET()>, |
| 210 | and equivalents for reading and setting the arguments; and C<STR_LEN()>, |
| 211 | C<STRING()> and C<OPERAND()> for manipulating strings and regop bearing |
| 212 | types. |
| 213 | |
| 214 | =head3 What regop is next? |
| 215 | |
| 216 | There are three distinct concepts of "next" in the regex engine, and |
| 217 | it is important to keep them clear. |
| 218 | |
| 219 | =over 4 |
| 220 | |
| 221 | =item * |
| 222 | |
| 223 | There is the "next regnode" from a given regnode, a value which is |
| 224 | rarely useful except that sometimes it matches up in terms of value |
| 225 | with one of the others, and that sometimes the code assumes this to |
| 226 | always be so. |
| 227 | |
| 228 | =item * |
| 229 | |
| 230 | There is the "next regop" from a given regop/regnode. This is the |
| 231 | regop physically located after the current one, as determined by |
| 232 | the size of the current regop. This is often useful, such as when |
| 233 | dumping the structure we use this order to traverse. Sometimes the code |
| 234 | assumes that the "next regnode" is the same as the "next regop", or in |
| 235 | other words assumes that the sizeof a given regop type is always going |
| 236 | to be one regnode large. |
| 237 | |
| 238 | =item * |
| 239 | |
| 240 | There is the "regnext" from a given regop. This is the regop which |
| 241 | is reached by jumping forward by the value of C<NEXT_OFF()>, |
| 242 | or in a few cases for longer jumps by the C<arg1> field of the C<regnode_1> |
| 243 | structure. The subroutine C<regnext()> handles this transparently. |
| 244 | This is the logical successor of the node, which in some cases, like |
| 245 | that of the C<BRANCH> regop, has special meaning. |
| 246 | |
| 247 | =back |
| 248 | |
| 249 | =head1 Process Overview |
| 250 | |
| 251 | Broadly speaking, performing a match of a string against a pattern |
| 252 | involves the following steps: |
| 253 | |
| 254 | =over 5 |
| 255 | |
| 256 | =item A. Compilation |
| 257 | |
| 258 | =over 5 |
| 259 | |
| 260 | =item 1. Parsing for size |
| 261 | |
| 262 | =item 2. Parsing for construction |
| 263 | |
| 264 | =item 3. Peep-hole optimisation and analysis |
| 265 | |
| 266 | =back |
| 267 | |
| 268 | =item B. Execution |
| 269 | |
| 270 | =over 5 |
| 271 | |
| 272 | =item 4. Start position and no-match optimisations |
| 273 | |
| 274 | =item 5. Program execution |
| 275 | |
| 276 | =back |
| 277 | |
| 278 | =back |
| 279 | |
| 280 | |
| 281 | Where these steps occur in the actual execution of a perl program is |
| 282 | determined by whether the pattern involves interpolating any string |
| 283 | variables. If interpolation occurs, then compilation happens at run time. If it |
| 284 | does not, then compilation is performed at compile time. (The C</o> modifier changes this, |
| 285 | as does C<qr//> to a certain extent.) The engine doesn't really care that |
| 286 | much. |
| 287 | |
| 288 | =head2 Compilation |
| 289 | |
| 290 | This code resides primarily in F<regcomp.c>, along with the header files |
| 291 | F<regcomp.h>, F<regexp.h> and F<regnodes.h>. |
| 292 | |
| 293 | Compilation starts with C<pregcomp()>, which is mostly an initialisation |
| 294 | wrapper which farms work out to two other routines for the heavy lifting: the |
| 295 | first is C<reg()>, which is the start point for parsing; the second, |
| 296 | C<study_chunk()>, is responsible for optimisation. |
| 297 | |
| 298 | Initialisation in C<pregcomp()> mostly involves the creation and data-filling |
| 299 | of a special structure, C<RExC_state_t> (defined in F<regcomp.c>). |
| 300 | Almost all internally-used routines in F<regcomp.h> take a pointer to one |
| 301 | of these structures as their first argument, with the name C<pRExC_state>. |
| 302 | This structure is used to store the compilation state and contains many |
| 303 | fields. Likewise there are many macros which operate on this |
| 304 | variable: anything that looks like C<RExC_xxxx> is a macro that operates on |
| 305 | this pointer/structure. |
| 306 | |
| 307 | =head3 Parsing for size |
| 308 | |
| 309 | In this pass the input pattern is parsed in order to calculate how much |
| 310 | space is needed for each regop we would need to emit. The size is also |
| 311 | used to determine whether long jumps will be required in the program. |
| 312 | |
| 313 | This stage is controlled by the macro C<SIZE_ONLY> being set. |
| 314 | |
| 315 | The parse proceeds pretty much exactly as it does during the |
| 316 | construction phase, except that most routines are short-circuited to |
| 317 | change the size field C<RExC_size> and not do anything else. |
| 318 | |
| 319 | =head3 Parsing for construction |
| 320 | |
| 321 | Once the size of the program has been determined, the pattern is parsed |
| 322 | again, but this time for real. Now C<SIZE_ONLY> will be false, and the |
| 323 | actual construction can occur. |
| 324 | |
| 325 | C<reg()> is the start of the parse process. It is responsible for |
| 326 | parsing an arbitrary chunk of pattern up to either the end of the |
| 327 | string, or the first closing parenthesis it encounters in the pattern. |
| 328 | This means it can be used to parse the top-level regex, or any section |
| 329 | inside of a grouping parenthesis. It also handles the "special parens" |
| 330 | that perl's regexes have. For instance when parsing C</x(?:foo)y/> C<reg()> |
| 331 | will at one point be called to parse from the "?" symbol up to and |
| 332 | including the ")". |
| 333 | |
| 334 | Additionally, C<reg()> is responsible for parsing the one or more |
| 335 | branches from the pattern, and for "finishing them off" by correctly |
| 336 | setting their next pointers. In order to do the parsing, it repeatedly |
| 337 | calls out to C<regbranch()>, which is responsible for handling up to the |
| 338 | first C<|> symbol it sees. |
| 339 | |
| 340 | C<regbranch()> in turn calls C<regpiece()> which |
| 341 | handles "things" followed by a quantifier. In order to parse the |
| 342 | "things", C<regatom()> is called. This is the lowest level routine, which |
| 343 | parses out constant strings, character classes, and the |
| 344 | various special symbols like C<$>. If C<regatom()> encounters a "(" |
| 345 | character it in turn calls C<reg()>. |
| 346 | |
| 347 | The routine C<regtail()> is called by both C<reg()> and C<regbranch()> |
| 348 | in order to "set the tail pointer" correctly. When executing and |
| 349 | we get to the end of a branch, we need to go to the node following the |
| 350 | grouping parens. When parsing, however, we don't know where the end will |
| 351 | be until we get there, so when we do we must go back and update the |
| 352 | offsets as appropriate. C<regtail> is used to make this easier. |
| 353 | |
| 354 | A subtlety of the parsing process means that a regex like C</foo/> is |
| 355 | originally parsed into an alternation with a single branch. It is only |
| 356 | afterwards that the optimiser converts single branch alternations into the |
| 357 | simpler form. |
| 358 | |
| 359 | =head3 Parse Call Graph and a Grammar |
| 360 | |
| 361 | The call graph looks like this: |
| 362 | |
| 363 | reg() # parse a top level regex, or inside of |
| 364 | # parens |
| 365 | regbranch() # parse a single branch of an alternation |
| 366 | regpiece() # parse a pattern followed by a quantifier |
| 367 | regatom() # parse a simple pattern |
| 368 | regclass() # used to handle a class |
| 369 | reg() # used to handle a parenthesised |
| 370 | # subpattern |
| 371 | .... |
| 372 | ... |
| 373 | regtail() # finish off the branch |
| 374 | ... |
| 375 | regtail() # finish off the branch sequence. Tie each |
| 376 | # branch's tail to the tail of the |
| 377 | # sequence |
| 378 | # (NEW) In Debug mode this is |
| 379 | # regtail_study(). |
| 380 | |
| 381 | A grammar form might be something like this: |
| 382 | |
| 383 | atom : constant | class |
| 384 | quant : '*' | '+' | '?' | '{min,max}' |
| 385 | _branch: piece |
| 386 | | piece _branch |
| 387 | | nothing |
| 388 | branch: _branch |
| 389 | | _branch '|' branch |
| 390 | group : '(' branch ')' |
| 391 | _piece: atom | group |
| 392 | piece : _piece |
| 393 | | _piece quant |
| 394 | |
| 395 | =head3 Parsing complications |
| 396 | |
| 397 | The implication of the above description is that a pattern containing nested |
| 398 | parentheses will result in a call graph which cycles through C<reg()>, |
| 399 | C<regbranch()>, C<regpiece()>, C<regatom()>, C<reg()>, C<regbranch()> I<etc> |
| 400 | multiple times, until the deepest level of nesting is reached. All the above |
| 401 | routines return a pointer to a C<regnode>, which is usually the last regnode |
| 402 | added to the program. However, one complication is that reg() returns NULL |
| 403 | for parsing C<(?:)> syntax for embedded modifiers, setting the flag |
| 404 | C<TRYAGAIN>. The C<TRYAGAIN> propagates upwards until it is captured, in |
| 405 | some cases by C<regatom()>, but otherwise unconditionally by |
| 406 | C<regbranch()>. Hence it will never be returned by C<regbranch()> to |
| 407 | C<reg()>. This flag permits patterns such as C<(?i)+> to be detected as |
| 408 | errors (I<Quantifier follows nothing in regex; marked by <-- HERE in m/(?i)+ |
| 409 | <-- HERE />). |
| 410 | |
| 411 | Another complication is that the representation used for the program differs |
| 412 | if it needs to store Unicode, but it's not always possible to know for sure |
| 413 | whether it does until midway through parsing. The Unicode representation for |
| 414 | the program is larger, and cannot be matched as efficiently. (See L</Unicode |
| 415 | and Localisation Support> below for more details as to why.) If the pattern |
| 416 | contains literal Unicode, it's obvious that the program needs to store |
| 417 | Unicode. Otherwise, the parser optimistically assumes that the more |
| 418 | efficient representation can be used, and starts sizing on this basis. |
| 419 | However, if it then encounters something in the pattern which must be stored |
| 420 | as Unicode, such as an C<\x{...}> escape sequence representing a character |
| 421 | literal, then this means that all previously calculated sizes need to be |
| 422 | redone, using values appropriate for the Unicode representation. Currently, |
| 423 | all regular expression constructions which can trigger this are parsed by code |
| 424 | in C<regatom()>. |
| 425 | |
| 426 | To avoid wasted work when a restart is needed, the sizing pass is abandoned |
| 427 | - C<regatom()> immediately returns NULL, setting the flag C<RESTART_UTF8>. |
| 428 | (This action is encapsulated using the macro C<REQUIRE_UTF8>.) This restart |
| 429 | request is propagated up the call chain in a similar fashion, until it is |
| 430 | "caught" in C<Perl_re_op_compile()>, which marks the pattern as containing |
| 431 | Unicode, and restarts the sizing pass. It is also possible for constructions |
| 432 | within run-time code blocks to turn out to need Unicode representation., |
| 433 | which is signalled by C<S_compile_runtime_code()> returning false to |
| 434 | C<Perl_re_op_compile()>. |
| 435 | |
| 436 | The restart was previously implemented using a C<longjmp> in C<regatom()> |
| 437 | back to a C<setjmp> in C<Perl_re_op_compile()>, but this proved to be |
| 438 | problematic as the latter is a large function containing many automatic |
| 439 | variables, which interact badly with the emergent control flow of C<setjmp>. |
| 440 | |
| 441 | =head3 Debug Output |
| 442 | |
| 443 | In the 5.9.x development version of perl you can C<< use re Debug => 'PARSE' >> |
| 444 | to see some trace information about the parse process. We will start with some |
| 445 | simple patterns and build up to more complex patterns. |
| 446 | |
| 447 | So when we parse C</foo/> we see something like the following table. The |
| 448 | left shows what is being parsed, and the number indicates where the next regop |
| 449 | would go. The stuff on the right is the trace output of the graph. The |
| 450 | names are chosen to be short to make it less dense on the screen. 'tsdy' |
| 451 | is a special form of C<regtail()> which does some extra analysis. |
| 452 | |
| 453 | >foo< 1 reg |
| 454 | brnc |
| 455 | piec |
| 456 | atom |
| 457 | >< 4 tsdy~ EXACT <foo> (EXACT) (1) |
| 458 | ~ attach to END (3) offset to 2 |
| 459 | |
| 460 | The resulting program then looks like: |
| 461 | |
| 462 | 1: EXACT <foo>(3) |
| 463 | 3: END(0) |
| 464 | |
| 465 | As you can see, even though we parsed out a branch and a piece, it was ultimately |
| 466 | only an atom. The final program shows us how things work. We have an C<EXACT> regop, |
| 467 | followed by an C<END> regop. The number in parens indicates where the C<regnext> of |
| 468 | the node goes. The C<regnext> of an C<END> regop is unused, as C<END> regops mean |
| 469 | we have successfully matched. The number on the left indicates the position of |
| 470 | the regop in the regnode array. |
| 471 | |
| 472 | Now let's try a harder pattern. We will add a quantifier, so now we have the pattern |
| 473 | C</foo+/>. We will see that C<regbranch()> calls C<regpiece()> twice. |
| 474 | |
| 475 | >foo+< 1 reg |
| 476 | brnc |
| 477 | piec |
| 478 | atom |
| 479 | >o+< 3 piec |
| 480 | atom |
| 481 | >< 6 tail~ EXACT <fo> (1) |
| 482 | 7 tsdy~ EXACT <fo> (EXACT) (1) |
| 483 | ~ PLUS (END) (3) |
| 484 | ~ attach to END (6) offset to 3 |
| 485 | |
| 486 | And we end up with the program: |
| 487 | |
| 488 | 1: EXACT <fo>(3) |
| 489 | 3: PLUS(6) |
| 490 | 4: EXACT <o>(0) |
| 491 | 6: END(0) |
| 492 | |
| 493 | Now we have a special case. The C<EXACT> regop has a C<regnext> of 0. This is |
| 494 | because if it matches it should try to match itself again. The C<PLUS> regop |
| 495 | handles the actual failure of the C<EXACT> regop and acts appropriately (going |
| 496 | to regnode 6 if the C<EXACT> matched at least once, or failing if it didn't). |
| 497 | |
| 498 | Now for something much more complex: C</x(?:foo*|b[a][rR])(foo|bar)$/> |
| 499 | |
| 500 | >x(?:foo*|b... 1 reg |
| 501 | brnc |
| 502 | piec |
| 503 | atom |
| 504 | >(?:foo*|b[... 3 piec |
| 505 | atom |
| 506 | >?:foo*|b[a... reg |
| 507 | >foo*|b[a][... brnc |
| 508 | piec |
| 509 | atom |
| 510 | >o*|b[a][rR... 5 piec |
| 511 | atom |
| 512 | >|b[a][rR])... 8 tail~ EXACT <fo> (3) |
| 513 | >b[a][rR])(... 9 brnc |
| 514 | 10 piec |
| 515 | atom |
| 516 | >[a][rR])(f... 12 piec |
| 517 | atom |
| 518 | >a][rR])(fo... clas |
| 519 | >[rR])(foo|... 14 tail~ EXACT <b> (10) |
| 520 | piec |
| 521 | atom |
| 522 | >rR])(foo|b... clas |
| 523 | >)(foo|bar)... 25 tail~ EXACT <a> (12) |
| 524 | tail~ BRANCH (3) |
| 525 | 26 tsdy~ BRANCH (END) (9) |
| 526 | ~ attach to TAIL (25) offset to 16 |
| 527 | tsdy~ EXACT <fo> (EXACT) (4) |
| 528 | ~ STAR (END) (6) |
| 529 | ~ attach to TAIL (25) offset to 19 |
| 530 | tsdy~ EXACT <b> (EXACT) (10) |
| 531 | ~ EXACT <a> (EXACT) (12) |
| 532 | ~ ANYOF[Rr] (END) (14) |
| 533 | ~ attach to TAIL (25) offset to 11 |
| 534 | >(foo|bar)$< tail~ EXACT <x> (1) |
| 535 | piec |
| 536 | atom |
| 537 | >foo|bar)$< reg |
| 538 | 28 brnc |
| 539 | piec |
| 540 | atom |
| 541 | >|bar)$< 31 tail~ OPEN1 (26) |
| 542 | >bar)$< brnc |
| 543 | 32 piec |
| 544 | atom |
| 545 | >)$< 34 tail~ BRANCH (28) |
| 546 | 36 tsdy~ BRANCH (END) (31) |
| 547 | ~ attach to CLOSE1 (34) offset to 3 |
| 548 | tsdy~ EXACT <foo> (EXACT) (29) |
| 549 | ~ attach to CLOSE1 (34) offset to 5 |
| 550 | tsdy~ EXACT <bar> (EXACT) (32) |
| 551 | ~ attach to CLOSE1 (34) offset to 2 |
| 552 | >$< tail~ BRANCH (3) |
| 553 | ~ BRANCH (9) |
| 554 | ~ TAIL (25) |
| 555 | piec |
| 556 | atom |
| 557 | >< 37 tail~ OPEN1 (26) |
| 558 | ~ BRANCH (28) |
| 559 | ~ BRANCH (31) |
| 560 | ~ CLOSE1 (34) |
| 561 | 38 tsdy~ EXACT <x> (EXACT) (1) |
| 562 | ~ BRANCH (END) (3) |
| 563 | ~ BRANCH (END) (9) |
| 564 | ~ TAIL (END) (25) |
| 565 | ~ OPEN1 (END) (26) |
| 566 | ~ BRANCH (END) (28) |
| 567 | ~ BRANCH (END) (31) |
| 568 | ~ CLOSE1 (END) (34) |
| 569 | ~ EOL (END) (36) |
| 570 | ~ attach to END (37) offset to 1 |
| 571 | |
| 572 | Resulting in the program |
| 573 | |
| 574 | 1: EXACT <x>(3) |
| 575 | 3: BRANCH(9) |
| 576 | 4: EXACT <fo>(6) |
| 577 | 6: STAR(26) |
| 578 | 7: EXACT <o>(0) |
| 579 | 9: BRANCH(25) |
| 580 | 10: EXACT <ba>(14) |
| 581 | 12: OPTIMIZED (2 nodes) |
| 582 | 14: ANYOF[Rr](26) |
| 583 | 25: TAIL(26) |
| 584 | 26: OPEN1(28) |
| 585 | 28: TRIE-EXACT(34) |
| 586 | [StS:1 Wds:2 Cs:6 Uq:5 #Sts:7 Mn:3 Mx:3 Stcls:bf] |
| 587 | <foo> |
| 588 | <bar> |
| 589 | 30: OPTIMIZED (4 nodes) |
| 590 | 34: CLOSE1(36) |
| 591 | 36: EOL(37) |
| 592 | 37: END(0) |
| 593 | |
| 594 | Here we can see a much more complex program, with various optimisations in |
| 595 | play. At regnode 10 we see an example where a character class with only |
| 596 | one character in it was turned into an C<EXACT> node. We can also see where |
| 597 | an entire alternation was turned into a C<TRIE-EXACT> node. As a consequence, |
| 598 | some of the regnodes have been marked as optimised away. We can see that |
| 599 | the C<$> symbol has been converted into an C<EOL> regop, a special piece of |
| 600 | code that looks for C<\n> or the end of the string. |
| 601 | |
| 602 | The next pointer for C<BRANCH>es is interesting in that it points at where |
| 603 | execution should go if the branch fails. When executing, if the engine |
| 604 | tries to traverse from a branch to a C<regnext> that isn't a branch then |
| 605 | the engine will know that the entire set of branches has failed. |
| 606 | |
| 607 | =head3 Peep-hole Optimisation and Analysis |
| 608 | |
| 609 | The regular expression engine can be a weighty tool to wield. On long |
| 610 | strings and complex patterns it can end up having to do a lot of work |
| 611 | to find a match, and even more to decide that no match is possible. |
| 612 | Consider a situation like the following pattern. |
| 613 | |
| 614 | 'ababababababababababab' =~ /(a|b)*z/ |
| 615 | |
| 616 | The C<(a|b)*> part can match at every char in the string, and then fail |
| 617 | every time because there is no C<z> in the string. So obviously we can |
| 618 | avoid using the regex engine unless there is a C<z> in the string. |
| 619 | Likewise in a pattern like: |
| 620 | |
| 621 | /foo(\w+)bar/ |
| 622 | |
| 623 | In this case we know that the string must contain a C<foo> which must be |
| 624 | followed by C<bar>. We can use Fast Boyer-Moore matching as implemented |
| 625 | in C<fbm_instr()> to find the location of these strings. If they don't exist |
| 626 | then we don't need to resort to the much more expensive regex engine. |
| 627 | Even better, if they do exist then we can use their positions to |
| 628 | reduce the search space that the regex engine needs to cover to determine |
| 629 | if the entire pattern matches. |
| 630 | |
| 631 | There are various aspects of the pattern that can be used to facilitate |
| 632 | optimisations along these lines: |
| 633 | |
| 634 | =over 5 |
| 635 | |
| 636 | =item * anchored fixed strings |
| 637 | |
| 638 | =item * floating fixed strings |
| 639 | |
| 640 | =item * minimum and maximum length requirements |
| 641 | |
| 642 | =item * start class |
| 643 | |
| 644 | =item * Beginning/End of line positions |
| 645 | |
| 646 | =back |
| 647 | |
| 648 | Another form of optimisation that can occur is the post-parse "peep-hole" |
| 649 | optimisation, where inefficient constructs are replaced by more efficient |
| 650 | constructs. The C<TAIL> regops which are used during parsing to mark the end |
| 651 | of branches and the end of groups are examples of this. These regops are used |
| 652 | as place-holders during construction and "always match" so they can be |
| 653 | "optimised away" by making the things that point to the C<TAIL> point to the |
| 654 | thing that C<TAIL> points to, thus "skipping" the node. |
| 655 | |
| 656 | Another optimisation that can occur is that of "C<EXACT> merging" which is |
| 657 | where two consecutive C<EXACT> nodes are merged into a single |
| 658 | regop. An even more aggressive form of this is that a branch |
| 659 | sequence of the form C<EXACT BRANCH ... EXACT> can be converted into a |
| 660 | C<TRIE-EXACT> regop. |
| 661 | |
| 662 | All of this occurs in the routine C<study_chunk()> which uses a special |
| 663 | structure C<scan_data_t> to store the analysis that it has performed, and |
| 664 | does the "peep-hole" optimisations as it goes. |
| 665 | |
| 666 | The code involved in C<study_chunk()> is extremely cryptic. Be careful. :-) |
| 667 | |
| 668 | =head2 Execution |
| 669 | |
| 670 | Execution of a regex generally involves two phases, the first being |
| 671 | finding the start point in the string where we should match from, |
| 672 | and the second being running the regop interpreter. |
| 673 | |
| 674 | If we can tell that there is no valid start point then we don't bother running |
| 675 | the interpreter at all. Likewise, if we know from the analysis phase that we |
| 676 | cannot detect a short-cut to the start position, we go straight to the |
| 677 | interpreter. |
| 678 | |
| 679 | The two entry points are C<re_intuit_start()> and C<pregexec()>. These routines |
| 680 | have a somewhat incestuous relationship with overlap between their functions, |
| 681 | and C<pregexec()> may even call C<re_intuit_start()> on its own. Nevertheless |
| 682 | other parts of the perl source code may call into either, or both. |
| 683 | |
| 684 | Execution of the interpreter itself used to be recursive, but thanks to the |
| 685 | efforts of Dave Mitchell in the 5.9.x development track, that has changed: now an |
| 686 | internal stack is maintained on the heap and the routine is fully |
| 687 | iterative. This can make it tricky as the code is quite conservative |
| 688 | about what state it stores, with the result that two consecutive lines in the |
| 689 | code can actually be running in totally different contexts due to the |
| 690 | simulated recursion. |
| 691 | |
| 692 | =head3 Start position and no-match optimisations |
| 693 | |
| 694 | C<re_intuit_start()> is responsible for handling start points and no-match |
| 695 | optimisations as determined by the results of the analysis done by |
| 696 | C<study_chunk()> (and described in L<Peep-hole Optimisation and Analysis>). |
| 697 | |
| 698 | The basic structure of this routine is to try to find the start- and/or |
| 699 | end-points of where the pattern could match, and to ensure that the string |
| 700 | is long enough to match the pattern. It tries to use more efficient |
| 701 | methods over less efficient methods and may involve considerable |
| 702 | cross-checking of constraints to find the place in the string that matches. |
| 703 | For instance it may try to determine that a given fixed string must be |
| 704 | not only present but a certain number of chars before the end of the |
| 705 | string, or whatever. |
| 706 | |
| 707 | It calls several other routines, such as C<fbm_instr()> which does |
| 708 | Fast Boyer Moore matching and C<find_byclass()> which is responsible for |
| 709 | finding the start using the first mandatory regop in the program. |
| 710 | |
| 711 | When the optimisation criteria have been satisfied, C<reg_try()> is called |
| 712 | to perform the match. |
| 713 | |
| 714 | =head3 Program execution |
| 715 | |
| 716 | C<pregexec()> is the main entry point for running a regex. It contains |
| 717 | support for initialising the regex interpreter's state, running |
| 718 | C<re_intuit_start()> if needed, and running the interpreter on the string |
| 719 | from various start positions as needed. When it is necessary to use |
| 720 | the regex interpreter C<pregexec()> calls C<regtry()>. |
| 721 | |
| 722 | C<regtry()> is the entry point into the regex interpreter. It expects |
| 723 | as arguments a pointer to a C<regmatch_info> structure and a pointer to |
| 724 | a string. It returns an integer 1 for success and a 0 for failure. |
| 725 | It is basically a set-up wrapper around C<regmatch()>. |
| 726 | |
| 727 | C<regmatch> is the main "recursive loop" of the interpreter. It is |
| 728 | basically a giant switch statement that implements a state machine, where |
| 729 | the possible states are the regops themselves, plus a number of additional |
| 730 | intermediate and failure states. A few of the states are implemented as |
| 731 | subroutines but the bulk are inline code. |
| 732 | |
| 733 | =head1 MISCELLANEOUS |
| 734 | |
| 735 | =head2 Unicode and Localisation Support |
| 736 | |
| 737 | When dealing with strings containing characters that cannot be represented |
| 738 | using an eight-bit character set, perl uses an internal representation |
| 739 | that is a permissive version of Unicode's UTF-8 encoding[2]. This uses single |
| 740 | bytes to represent characters from the ASCII character set, and sequences |
| 741 | of two or more bytes for all other characters. (See L<perlunitut> |
| 742 | for more information about the relationship between UTF-8 and perl's |
| 743 | encoding, utf8. The difference isn't important for this discussion.) |
| 744 | |
| 745 | No matter how you look at it, Unicode support is going to be a pain in a |
| 746 | regex engine. Tricks that might be fine when you have 256 possible |
| 747 | characters often won't scale to handle the size of the UTF-8 character |
| 748 | set. Things you can take for granted with ASCII may not be true with |
| 749 | Unicode. For instance, in ASCII, it is safe to assume that |
| 750 | C<sizeof(char1) == sizeof(char2)>, but in UTF-8 it isn't. Unicode case folding is |
| 751 | vastly more complex than the simple rules of ASCII, and even when not |
| 752 | using Unicode but only localised single byte encodings, things can get |
| 753 | tricky (for example, B<LATIN SMALL LETTER SHARP S> (U+00DF, E<szlig>) |
| 754 | should match 'SS' in localised case-insensitive matching). |
| 755 | |
| 756 | Making things worse is that UTF-8 support was a later addition to the |
| 757 | regex engine (as it was to perl) and this necessarily made things a lot |
| 758 | more complicated. Obviously it is easier to design a regex engine with |
| 759 | Unicode support in mind from the beginning than it is to retrofit it to |
| 760 | one that wasn't. |
| 761 | |
| 762 | Nearly all regops that involve looking at the input string have |
| 763 | two cases, one for UTF-8, and one not. In fact, it's often more complex |
| 764 | than that, as the pattern may be UTF-8 as well. |
| 765 | |
| 766 | Care must be taken when making changes to make sure that you handle |
| 767 | UTF-8 properly, both at compile time and at execution time, including |
| 768 | when the string and pattern are mismatched. |
| 769 | |
| 770 | =head2 Base Structures |
| 771 | |
| 772 | The C<regexp> structure described in L<perlreapi> is common to all |
| 773 | regex engines. Two of its fields are intended for the private use |
| 774 | of the regex engine that compiled the pattern. These are the |
| 775 | C<intflags> and pprivate members. The C<pprivate> is a void pointer to |
| 776 | an arbitrary structure whose use and management is the responsibility |
| 777 | of the compiling engine. perl will never modify either of these |
| 778 | values. In the case of the stock engine the structure pointed to by |
| 779 | C<pprivate> is called C<regexp_internal>. |
| 780 | |
| 781 | Its C<pprivate> and C<intflags> fields contain data |
| 782 | specific to each engine. |
| 783 | |
| 784 | There are two structures used to store a compiled regular expression. |
| 785 | One, the C<regexp> structure described in L<perlreapi> is populated by |
| 786 | the engine currently being. used and some of its fields read by perl to |
| 787 | implement things such as the stringification of C<qr//>. |
| 788 | |
| 789 | |
| 790 | The other structure is pointed to by the C<regexp> struct's |
| 791 | C<pprivate> and is in addition to C<intflags> in the same struct |
| 792 | considered to be the property of the regex engine which compiled the |
| 793 | regular expression; |
| 794 | |
| 795 | The regexp structure contains all the data that perl needs to be aware of |
| 796 | to properly work with the regular expression. It includes data about |
| 797 | optimisations that perl can use to determine if the regex engine should |
| 798 | really be used, and various other control info that is needed to properly |
| 799 | execute patterns in various contexts such as is the pattern anchored in |
| 800 | some way, or what flags were used during the compile, or whether the |
| 801 | program contains special constructs that perl needs to be aware of. |
| 802 | |
| 803 | In addition it contains two fields that are intended for the private use |
| 804 | of the regex engine that compiled the pattern. These are the C<intflags> |
| 805 | and pprivate members. The C<pprivate> is a void pointer to an arbitrary |
| 806 | structure whose use and management is the responsibility of the compiling |
| 807 | engine. perl will never modify either of these values. |
| 808 | |
| 809 | As mentioned earlier, in the case of the default engines, the C<pprivate> |
| 810 | will be a pointer to a regexp_internal structure which holds the compiled |
| 811 | program and any additional data that is private to the regex engine |
| 812 | implementation. |
| 813 | |
| 814 | =head3 Perl's C<pprivate> structure |
| 815 | |
| 816 | The following structure is used as the C<pprivate> struct by perl's |
| 817 | regex engine. Since it is specific to perl it is only of curiosity |
| 818 | value to other engine implementations. |
| 819 | |
| 820 | typedef struct regexp_internal { |
| 821 | U32 *offsets; /* offset annotations 20001228 MJD |
| 822 | * data about mapping the program to |
| 823 | * the string*/ |
| 824 | regnode *regstclass; /* Optional startclass as identified or |
| 825 | * constructed by the optimiser */ |
| 826 | struct reg_data *data; /* Additional miscellaneous data used |
| 827 | * by the program. Used to make it |
| 828 | * easier to clone and free arbitrary |
| 829 | * data that the regops need. Often the |
| 830 | * ARG field of a regop is an index |
| 831 | * into this structure */ |
| 832 | regnode program[1]; /* Unwarranted chumminess with |
| 833 | * compiler. */ |
| 834 | } regexp_internal; |
| 835 | |
| 836 | =over 5 |
| 837 | |
| 838 | =item C<offsets> |
| 839 | |
| 840 | Offsets holds a mapping of offset in the C<program> |
| 841 | to offset in the C<precomp> string. This is only used by ActiveState's |
| 842 | visual regex debugger. |
| 843 | |
| 844 | =item C<regstclass> |
| 845 | |
| 846 | Special regop that is used by C<re_intuit_start()> to check if a pattern |
| 847 | can match at a certain position. For instance if the regex engine knows |
| 848 | that the pattern must start with a 'Z' then it can scan the string until |
| 849 | it finds one and then launch the regex engine from there. The routine |
| 850 | that handles this is called C<find_by_class()>. Sometimes this field |
| 851 | points at a regop embedded in the program, and sometimes it points at |
| 852 | an independent synthetic regop that has been constructed by the optimiser. |
| 853 | |
| 854 | =item C<data> |
| 855 | |
| 856 | This field points at a C<reg_data> structure, which is defined as follows |
| 857 | |
| 858 | struct reg_data { |
| 859 | U32 count; |
| 860 | U8 *what; |
| 861 | void* data[1]; |
| 862 | }; |
| 863 | |
| 864 | This structure is used for handling data structures that the regex engine |
| 865 | needs to handle specially during a clone or free operation on the compiled |
| 866 | product. Each element in the data array has a corresponding element in the |
| 867 | what array. During compilation regops that need special structures stored |
| 868 | will add an element to each array using the add_data() routine and then store |
| 869 | the index in the regop. |
| 870 | |
| 871 | =item C<program> |
| 872 | |
| 873 | Compiled program. Inlined into the structure so the entire struct can be |
| 874 | treated as a single blob. |
| 875 | |
| 876 | =back |
| 877 | |
| 878 | =head1 SEE ALSO |
| 879 | |
| 880 | L<perlreapi> |
| 881 | |
| 882 | L<perlre> |
| 883 | |
| 884 | L<perlunitut> |
| 885 | |
| 886 | =head1 AUTHOR |
| 887 | |
| 888 | by Yves Orton, 2006. |
| 889 | |
| 890 | With excerpts from Perl, and contributions and suggestions from |
| 891 | Ronald J. Kimball, Dave Mitchell, Dominic Dunlop, Mark Jason Dominus, |
| 892 | Stephen McCamant, and David Landgren. |
| 893 | |
| 894 | =head1 LICENCE |
| 895 | |
| 896 | Same terms as Perl. |
| 897 | |
| 898 | =head1 REFERENCES |
| 899 | |
| 900 | [1] L<http://perl.plover.com/Rx/paper/> |
| 901 | |
| 902 | [2] L<http://www.unicode.org> |
| 903 | |
| 904 | =cut |