3 perlreguts - Description of the Perl regular expression engine.
7 This document is an attempt to shine some light on the guts of the regex
8 engine and how it works. The regex engine represents a signifigant chunk
9 of the perl codebase, but is relatively poorly understood. This document
10 is a meagre attempt at addressing this situation. It is derived from the
11 authors experience, comments in the source code, other papers on the
12 regex engine, feedback in p5p, and no doubt other places as well.
14 B<WARNING!> It should be clearly understood that this document
15 represents the state of the regex engine as the author understands it at
16 the time of writing. It is B<NOT> an API definition, it is purely an
17 internals guide for those who want to hack the regex engine, or
18 understand how the regex engine works. Readers of this document are
19 expected to understand perls regex syntax and its usage in detail, if
20 you are a beginner you are in the wrong the place.
24 =head2 A quick note on terms
26 There is some debate as to whether to say 'regexp' or 'regex'. In this
27 document we will use the term "regex" unless there is a special reason
28 not to, and then we will explain why.
30 When speaking about regexes we need to distinguish between their source
31 code form and their internal form. In this document we will use the term
32 "pattern" when we speak of their textual, source code form, the term
33 "program" when we speak of their internal representation. These
34 correspond to the terms C<S-regex> and C<B-regex> that Mark Jason
35 Dominus employs in his paper on "Rx"[1].
37 =head2 What is a regular expression engine?
39 A regular expression engine is a program whose job is to efficiently
40 find a section of a string that matches a set criteria of criteria. The
41 criteria is expressed in text using a formal language. See perlre for a
42 full definition of the language.
44 So the job in less grandiose terms is to some turn a pattern into
45 something the computer can efficiently use to find the matching point in
48 To do this we need to produce a program by parsing the text. We then
49 need to execute the program to find the point in the string that
50 matches. And we need to do the whole thing efficiently.
52 =head2 Structure of a Regexp Program
56 Although it is a bit confusing and some object to the terminology it
57 is worth taking a look at a comment that has
58 been in regexp.h for years:
60 I<This is essentially a linear encoding of a nondeterministic
61 finite-state machine (aka syntax charts or "railroad normal form" in
64 The term "railroad normal form" is a bit esoteric, with "syntax
65 diagram/charts", or "railroad diagram/charts" being more common terms.
66 Nevertheless it provides a useful mental image of a regex program: Each
67 node can be thought of as a unit of track, with a single entry and in
68 most cases a single exit point (there are pieces of track that fork, but
69 statistically not many), and the total forms a system of track with a
70 single entry and single exit point. The matching process can be thought
71 of as a car that moves on the track, with the particular route through
72 the system being determined by the character read at each possible
73 connector point. A car can roll off the track at any point but it may
74 not procede unless it matches the track...
76 Thus the pattern C</foo(?:\w+|\d+|\s+)bar/> can be thought of as the
94 The truth of the matter is that perls regular expressions these days are
95 way beyond such models, but they can help when trying to get your
96 bearings, and they do match pretty closely with the current
99 To be more precise we will say that a regex program is an encoding
100 of a graph. Each node in the graph corresponds to part of
101 the original regex pattern, such as a literal string or a branch,
102 and has a pointer to the nodes representing the next component
103 to be matched. Since "node" and opcode are overloaded terms in the
104 perl source, we will call the nodes in a regex program 'regops'.
106 The program is represented by an array of regnode structures, one or
107 more of which together represent a single regop of the program. Struct
108 regnode is the smallest struct needed and has a field structure which is
109 shared with all the other larger structures.
111 "Next" pointers of all regops except BRANCH implement concatenation; a
112 "next" pointer with a BRANCH on both ends of it is connecting two
113 alternatives. [Here we have one of the subtle syntax dependencies: an
114 individual BRANCH (as opposed to a collection of them) is never
115 concatenated with anything because of operator precedence.
117 The operand of some types of regop is a literal string; for others,
118 it is a regop leading into a sub-program. In particular, the operand
119 of a BRANCH node is the first regop of the branch.
121 B<NOTE>: As the railroad metaphor suggests this is B<not> a tree
122 structure: the tail of the branch connects to the thing following the
123 set of BRANCHes. It is a like a single line of railway track that
124 splits as it goes into a station or railway yard and rejoins as it comes
129 The base structure of a regop is defined in regexp.h as follows:
132 U8 flags; /* Various purposes, sometimes overriden */
133 U8 type; /* Opcode value as specified by regnodes.h */
134 U16 next_off; /* Offset in size regnode */
137 Other larger regnode-like structures are defined in regcomp.h. They
138 are almost like subclasses in that they have the same fields as
139 regnode, with possibly additional fields following in
140 the structure, and in some cases the specific meaning (and name)
141 of some of base fields are overriden. The following is a more
142 complete description.
150 regnode_1 structures have the same header, followed by a single
151 four-byte argument; regnode_2 structures contain two two-byte
155 regnode_2 U16 arg1; U16 arg2;
159 regnode_string structures, used for literal strings, follow the header
160 with a one-byte length and then the string data. Strings are padded on
161 the end with zero bytes so that the total length of the node is a
162 multiple of four bytes:
164 regnode_string char string[1];
165 U8 str_len; (overides flags)
167 =item regnode_charclass
169 character classes are represented by regnode_charclass structures,
170 which have a four-byte argument and then a 32-byte (256-bit) bitmap
171 indicating which characters are included in the class.
173 regnode_charclass U32 arg1;
174 char bitmap[ANYOF_BITMAP_SIZE];
176 =item regnode_charclass_class
178 There is also a larger form of a char class structure used to represent
179 POSIX char classes called regnode_charclass_class which contains the
180 same fields plus an additional 4-byte (32-bit) bitmap indicating which
181 POSIX char class have been included.
183 regnode_charclass_class U32 arg1;
184 char bitmap[ANYOF_BITMAP_SIZE];
185 char classflags[ANYOF_CLASSBITMAP_SIZE];
189 regnodes.h defines an array called regarglen[] which gives the size
190 of each opcode in units of size regnode (4-byte). A macro is used
191 to calculate the size of an EXACT node based on its C<str_len> field.
193 The opcodes are defined in regnodes.h which is generated from
194 regcomp.sym by regcomp.pl. Currently the maximum possible number
195 of distinct opcodes is restricted to 256, with about 1/4 already
198 There's a set of macros provided to make accessing the fields
199 easier and more consistent. These include C<OP()> which is used to tell
200 the type of a regnode-like structure, NEXT_OFF() which is the offset to
201 the next node (more on this later), ARG(), ARG1(), ARG2(), ARG_SET(),
202 and equivelents for reading and setting the arguments, STR_LEN(),
203 STRING(), and OPERAND() for manipulating strings and regop bearing
206 =head3 What opcode is next?
208 There are three distinct concepts of "next" in the regex engine, and
209 it is important to keep them clear.
215 There is the "next regnode" from a given regnode, a value which is
216 rarely useful except that sometimes it matches up in terms of value
217 with one of the others, and that sometimes the code assumes this to
222 There is the "next opcode" from a given opcode/regnode. This is the
223 opcode physically located after the the current one, as determined by
224 the size of the current opcode. This is often useful, such as when
225 dumping the structure we use this order to traverse. Sometimes the code
226 assumes that the "next regnode" is the same as the "next opcode", or in
227 other words assumes that the sizeof a given opcode type is always going
228 to be 1 regnode large.
232 There is the "regnext" from a given opcode. This is the opcode which
233 is reached by jumping forward by the value of NEXT_OFF(),
234 or in a few cases for longer jumps by the arg1 field of the regnode_1
235 structure. The subroutine regnext() handles this transparently.
236 This is the logical successor of the node, which in some cases, like
237 that of the BRANCH opcode, has special meaning.
241 =head1 PROCESS OVERVIEW
243 Broadly speaking performing a match of a string against a pattern
244 involves the following steps
248 2. Parsing for construction
249 3. Peep-hole Optimisation and Analysis
251 4. Start position and no-match optimisations
254 Where these steps occur in the actual execution of a perl program is
255 determined by whether the pattern involves interpolating any string
256 variables. If it does then compilation happens at run time. If it
257 doesn't then it happens at compile time. (The C</o> modifier changes this,
258 as does C<qr//> to a certain extent.) The engine doesn't really care that
263 This code exists primarily in regcomp.c, along with the header files
264 regcomp.h, regexp.h, regnodes.h.
266 Compilation starts with C<pregcomp()>, which is mostly an initialization
267 wrapper which farms out two other routines for the heavy lifting. The
268 first being C<reg()> which is the start point for parsing, and
269 C<study_chunk()> which is responsible for optimisation.
271 Initialization in C<pregcomp()> mostly involves the creation and data
272 filling of a special structure RExC_state_t, (defined in regcomp.c).
273 Almost all internally used routines in regcomp.h take a pointer to one
274 of these structures as their first argument, with the name *pRExC_state.
275 This structure is used to store the compilation state and contains many
276 fields. Likewise their are many macros defined which operate on this
277 variable. Anything that looks like RExC_xxxx is a macro that operates on
278 this pointer/structure.
280 =head3 Parsing for size
282 In this pass the input pattern is parsed in order to calculate how much
283 space is needed for each opcode we would need to emit. The size is also
284 used to determine whether long jumps will be required in the program.
286 This stage is controlled by the macro SIZE_ONLY being set.
288 The parse procedes pretty much exactly as it does during the
289 construction phase except that most routines are shortcircuited to
290 change the size field RExC_size and not actually do anything.
292 =head3 Parsing for construcution
294 Once the size of the program has been determine the pattern is parsed
295 again, but this time for real. Now SIZE_ONLY will be false, and the
296 actual construction can occur.
298 C<reg()> is the start of the parse process. It is responsible for
299 parsing an arbitrary chunk of pattern up to either the end of the
300 string, or the first closing parenthesis it encounters in the pattern.
301 This means it can be used to parse the toplevel regex, or any section
302 inside of a grouping parenthesis. It also handles the "special parens"
303 that perls regexes have. For instance when parsing C</x(?:foo)y/> C<reg()>
304 will at one point be called to parse from the '?' symbol up to and
307 Additionally C<reg()> is responsible for parsing the one or more
308 branches from the pattern, and for "finishing them off" by correctly
309 setting their next pointers. In order to do the parsing it repeatedly
310 calls out to C<regbranch()> which is responsible for handling up to the
311 first C<|> symbol it sees.
313 C<regbranch()> in turn calls C<regpiece()> which is responsible for
314 handling "things" followed by a quantifier. In order to parse the
315 "things" C<regatom()> is called. This is the lowest level routine which
316 is responsible for parsing out constant strings, char classes, and the
317 various special symbols like C<$>. If C<regatom()> encounters a '('
318 character it in turn calls C<reg()>.
320 The routine C<regtail()> is called by both C<reg()>, C<regbranch()>
321 in order to "set the tail pointer" correctly. When executing and
322 we get to the end of a branch we need to go to node following the
323 grouping parens. When parsing however we don't know where the end will
324 be until we get there, so when we do we must go back and update the
325 offsets as appropriate. C<regtail> is used to make this easier.
327 A subtlety of the parse process means that a regex like C</foo/> is
328 originally parsed into an alternation with a single branch. It is only
329 afterwards that the optimizer converts single branch alternations into the
332 =head3 Parse Call Graph and a Grammar
334 The call graph looks like this:
336 reg() # parse a top level regex, or inside of parens
337 regbranch() # parse a single branch of an alternation
338 regpiece() # parse a pattern followed by a quantifier
339 regatom() # parse a simple pattern
340 regclass() # used to handle a class
341 reg() # used to handle a parenthesized subpattern
344 regtail() # finish off the branch
346 regtail() # finish off the branch sequence. Tie each
347 # branches tail to the tail of the sequence
348 # (NEW) In Debug mode this is
351 A grammar form might be something like this:
353 atom : constant | class
354 quant : '*' | '+' | '?' | '{min,max}'
360 group : '(' branch ')'
367 In bleadperl you can C<< use re Debug => 'PARSE'; >> to see some trace
368 information about the parse process. We will start with some simple
369 patterns and build up to more complex patterns.
371 So when we parse C</foo/> we see something like the following table. The
372 left shows whats being parsed, the number indicates where the next regop
373 would go. The stuff on the right is the trace output of the graph. The
374 names are chosen to be short to make it less dense on the screen. 'tsdy'
375 is a special form of C<regtail()> which does some extra analysis.
381 >< 4 tsdy~ EXACT <foo> (EXACT) (1)
382 ~ attach to END (3) offset to 2
384 The resulting program then looks like:
389 As you can see, even though we parsed out a branch and a piece, it was ultimately
390 only an atom. The final program shows us how things work. We have an EXACT regop,
391 followed by an END regop. The number in parens indicates where the 'regnext' of
392 the node goes. The 'regnext' of an END regop is unused, as END regops mean
393 we have successfully matched. The number on the left indicates the position of
394 the regop in the regnode array.
396 Now lets try a harder pattern. We will add a quantifier so we have the pattern
397 C</foo+/>. We will see that C<regbranch()> calls C<regpiece()> regpiece twice.
405 >< 6 tail~ EXACT <fo> (1)
406 7 tsdy~ EXACT <fo> (EXACT) (1)
408 ~ attach to END (6) offset to 3
410 And we end up with the program:
417 Now we have a special case. The EXACT regop has a regnext of 0. This is
418 because if it matches it should try to match itself again. The PLUS regop
419 handles the actual failure of the EXACT regop and acts appropriately (going
420 to regnode 6 if the EXACT matched at least once, or failing if it didn't.)
422 Now for something much more complex: C</x(?:foo*|b[a][rR])(foo|bar)$/>
428 >(?:foo*|b[... 3 piec
434 >o*|b[a][rR... 5 piec
436 >|b[a][rR])... 8 tail~ EXACT <fo> (3)
437 >b[a][rR])(... 9 brnc
440 >[a][rR])(f... 12 piec
443 >[rR])(foo|... 14 tail~ EXACT <b> (10)
447 >)(foo|bar)... 25 tail~ EXACT <a> (12)
449 26 tsdy~ BRANCH (END) (9)
450 ~ attach to TAIL (25) offset to 16
451 tsdy~ EXACT <fo> (EXACT) (4)
453 ~ attach to TAIL (25) offset to 19
454 tsdy~ EXACT <b> (EXACT) (10)
455 ~ EXACT <a> (EXACT) (12)
456 ~ ANYOF[Rr] (END) (14)
457 ~ attach to TAIL (25) offset to 11
458 >(foo|bar)$< tail~ EXACT <x> (1)
465 >|bar)$< 31 tail~ OPEN1 (26)
469 >)$< 34 tail~ BRANCH (28)
470 36 tsdy~ BRANCH (END) (31)
471 ~ attach to CLOSE1 (34) offset to 3
472 tsdy~ EXACT <foo> (EXACT) (29)
473 ~ attach to CLOSE1 (34) offset to 5
474 tsdy~ EXACT <bar> (EXACT) (32)
475 ~ attach to CLOSE1 (34) offset to 2
481 >< 37 tail~ OPEN1 (26)
485 38 tsdy~ EXACT <x> (EXACT) (1)
494 ~ attach to END (37) offset to 1<div></div>
496 Resulting in the program
505 12: OPTIMIZED (2 nodes)
510 [StS:1 Wds:2 Cs:6 Uq:5 #Sts:7 Mn:3 Mx:3 Stcls:bf]
513 30: OPTIMIZED (4 nodes)
518 Here we can see a much more complex program, with various optimisations in
519 play. At regnode 10 we can see an example where a char class with only
520 one character in it was turned into an EXACT node. We can also see where
521 an entire alternation was turned into a TRIE-EXACT node. As a consequence
522 some of the regnodes have been marked as optimised away. We can see that
523 the C<$> symbol has been converted into an EOL regop, a special piece of
524 code that looks for \n or the end of a string.
526 The next pointer for BRANCHes is interesting in that it points at where
527 execution should go if the branch fails. When executing if the engine
528 tries to traverse from a branch to a regnext that isnt a branch then
529 the engine will know the overall series of branches have failed.
531 =head3 Peep-hole Optimisation and Analysis
533 The regular expression engine can be a weighty tool to wield. On long
534 strings and complex patterns it can end up having to do a lot of work
535 to find a match, and even more to decide that no match is possible.
536 Consider a situation like the following pattern.
538 'ababababababababababab' =~ /(a|b)*z/
540 The C<(a|b)*> part can match at every char in the string, and then fail
541 every time because there is no C<z> in the string. So obviously we can
542 not bother to use the regex engine unless there is a 'z' in the string.
543 Likewise in a pattern like:
547 In this case we know that the string must contain a C<foo> which must be
548 followed by C<bar>. We can use Fast Boyer-More matching as implemented
549 in fbm_instr() to find the location of these strings. If they dont exist
550 then we dont need to resort to the much more expensive regex engine.
551 Even better if they do exist then we can use their positions to
552 reduce the search space that the regex engine needs to cover to determine
553 if the entire pattern does match.
555 There are various aspects of the pattern that can be used to facilitate
556 optimisations along these lines:
558 * anchored fixed strings
559 * floating fixed strings
560 * minimum and maximum length requirements
562 * Beginning/End of line positions
564 Another form of optimisation that can occur is post-parse "peep-hole"
565 optimisations, where inefficient constructs are modified so that they
566 are more efficient. An example of this is TAIL regops which are used
567 during parsing to mark the end of branches and the end of groups. These
568 regops are used as place holders during construction and "always match"
569 so they can be "optimised away" by making the things that point to the
570 TAIL point to thing the TAIL points to, in essence "skipping" the node.
572 Another optimisation that can occur is that of "EXACT merging" which is
573 where two consecutive EXACT nodes are merged into a single more efficient
574 to execute regop. An even more agressive form of this is that a branch
575 sequence of the form EXACT BRANCH ... EXACT can be converted into a TRIE
578 All of this occurs in the routine study_chunk() which uses a special
579 structure scan_data_t to store the analysis that it has performed, and
580 as it goes does the "peep-hole" optimisations.
582 The code involved in study_chunk() is extremely cryptic. Be careful. :-)
586 Execution of a regex generally involves two phases, the first being
587 finding the start point in the string where we should match from,
588 and the second being running the regop interpreter.
590 If we can tell that there is no valid start point we don't bother running
591 interpreter at all. Likewise if we know from the analysis phase that we
592 can not optimise detection of the start position we go straight to the
595 The two entry points are re_intuit_start() and pregexec(). These routines
596 have a somewhat incestuous relationship with overlap between their functions,
597 and pregexec() may even call re_intuit_start() on its own. Nevertheless
598 the perl source code may call into either, or both.
600 Execution of the interpreter itself used to be recursive. Due to the
601 efforts of Dave Mitchel in blead perl it no longer is. Instead an
602 internal stack is maintained on the heap and the routine is fully
603 iterative. This can make it tricky as the code is quite conservative
604 about what state it stores which means that two consecutive lines in the
605 code can actually be running in totally different contexts due to the
608 =head3 Start position and no-match optimisations
610 re_intuit_start() is responsible for handling start point and no match
611 optimisations as determined by the results of the analysis done by
612 study_chunk() (and described in L<Peep-hole Optimisation and Analysis>).
614 The basic structure of this routine is to try to find the start and/or
615 end points of where the pattern could match, and to ensure that the string
616 is long enough to match the pattern. It tries to use more efficent
617 methods over less efficient methods and may involve considerable cross
618 checking of constraints to find the place in the string that matches.
619 For instance it may try to determine that a given fixed string must be
620 not only present but a certain number of chars before the end of the
623 It calls out into several other routines, like fbm_instr() which does
624 "Fast Boyer More" matching and find_byclass() which is responsible for
625 finding the start using the first mandatory regop in the program.
627 When the optimisation criteria have been satisfied reg_try() is called
628 to perform the match.
630 =head3 Program execution
632 C<pregexec()> is the main entry point for running a regex. It contains
633 support for initializing the regex interpreters state, running
634 re_intuit_start() if needed, and running the intepreter on the string
635 from various start positions as needed. When its necessary to use
636 the regex interpreter C<pregexec()> calls C<regtry()>.
638 C<regtry()> is the entry point into the regex interpreter. It expects
639 as arguments a pointer to a regmatch_info structure and a pointer to
640 a string. It returns an integer 1 for success and a 0 for failure.
641 It is basically a setup wrapper around C<regmatch()>.
643 C<regmatch> is the main "recursive loop" of the interpreter. It is
644 basically a giant switch statement that executes the regops based on
645 their type. A few of the regops are implemented as subroutines but
646 the bulk are inline code.
650 =head2 UNICODE and Localization Support
652 No matter how you look at it unicode support is going to be a pain in a
653 regex engine. Tricks that might be fine when you have 256 possible
654 characters often won't scale to handle the size of the 'utf8' character
655 set. Things you can take for granted with ASCII may not be true with
656 unicode. For instance in ASCII its safe to assume that
657 C<sizeof(char1) == sizeof(char2)>, in utf8 it isn't. Unicode case folding is
658 vastly more complex than the simple rules of English, and even when not
659 using unicode but only localized single byte encodings things can get
660 tricky (technically GERMAN-SHARP-ESS should match 'ss' in localized case
661 insensitive matching.)
663 Making things worse is that C<utf8> support was a later addition to the
664 regex engine (as it was to perl) and necessarily this made things a lot
665 more complicated. Obviously its easier to design a regex engine with
666 unicode support from the beginning than it is to retrofit one that
667 wasn't designed with it in mind.
669 Pretty well every regop that involves looking at the input string has
670 two cases, one for 'utf8' and one not. In fact its often more complex
671 than that, as the pattern may be 'utf8' as well.
673 Care must be taken when making changes to make sure that you handle
674 utf8 properly both at compile time and at execution time, including
675 when the string and pattern are mismatched.
677 The following comment in regcomp.h gives an example of exactly how
680 Two problematic code points in Unicode casefolding of EXACT nodes:
682 U+0390 - GREEK SMALL LETTER IOTA WITH DIALYTIKA AND TONOS
683 U+03B0 - GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND TONOS
689 U+03B9 U+0308 U+0301 0xCE 0xB9 0xCC 0x88 0xCC 0x81
690 U+03C5 U+0308 U+0301 0xCF 0x85 0xCC 0x88 0xCC 0x81
692 This means that in case-insensitive matching (or "loose matching",
693 as Unicode calls it), an EXACTF of length six (the UTF-8 encoded
694 byte length of the above casefolded versions) can match a target
695 string of length two (the byte length of UTF-8 encoded U+0390 or
696 U+03B0). This would rather mess up the minimum length computation.
698 What we'll do is to look for the tail four bytes, and then peek
699 at the preceding two bytes to see whether we need to decrease
700 the minimum length by four (six minus two).
702 Thanks to the design of UTF-8, there cannot be false matches:
703 A sequence of valid UTF-8 bytes cannot be a subsequence of
704 another valid sequence of UTF-8 bytes.
710 With excerpts from Perl, and contributions and suggestions from
711 Ronald J. Kimball, Dave Mitchell, Dominic Dunlop, Mark Jason Dominus,
712 and Stephen McCamant.
720 [1] http://perl.plover.com/Rx/paper/