This is a live mirror of the Perl 5 development currently hosted at https://github.com/perl/perl5
Regenerated mktables.lst per Yves Orton's suggestion.
[perl5.git] / pod / perlreguts.pod
CommitLineData
b23a565d
RGS
1=head1 NAME
2
3perlreguts - Description of the Perl regular expression engine.
4
5=head1 DESCRIPTION
6
7This document is an attempt to shine some light on the guts of the regex
4ccfbf60 8engine and how it works. The regex engine represents a significant chunk
b23a565d
RGS
9of the perl codebase, but is relatively poorly understood. This document
10is a meagre attempt at addressing this situation. It is derived from the
be8e71aa
YO
11author's experience, comments in the source code, other papers on the
12regex engine, feedback on the perl5-porters mail list, and no doubt other
13places as well.
b23a565d
RGS
14
15B<WARNING!> It should be clearly understood that this document
16represents the state of the regex engine as the author understands it at
be8e71aa 17the time of writing. It is B<NOT> an API definition; it is purely an
b23a565d
RGS
18internals guide for those who want to hack the regex engine, or
19understand how the regex engine works. Readers of this document are
be8e71aa
YO
20expected to understand perl's regex syntax and its usage in detail. If
21you want to learn about the basics of Perl's regular expressions, see
22L<perlre>.
b23a565d
RGS
23
24=head1 OVERVIEW
25
26=head2 A quick note on terms
27
be8e71aa 28There is some debate as to whether to say "regexp" or "regex". In this
b23a565d 29document we will use the term "regex" unless there is a special reason
be8e71aa 30not to, in which case we will explain why.
b23a565d
RGS
31
32When speaking about regexes we need to distinguish between their source
33code form and their internal form. In this document we will use the term
34"pattern" when we speak of their textual, source code form, the term
35"program" when we speak of their internal representation. These
be8e71aa 36correspond to the terms I<S-regex> and I<B-regex> that Mark Jason
e3950ac3 37Dominus employs in his paper on "Rx" ([1] in L</REFERENCES>).
b23a565d
RGS
38
39=head2 What is a regular expression engine?
40
be8e71aa
YO
41A regular expression engine is a program that takes a set of constraints
42specified in a mini-language, and then applies those constraints to a
43target string, and determines whether or not the string satisfies the
44constraints. See L<perlre> for a full definition of the language.
b23a565d 45
4ccfbf60 46So in less grandiose terms the first part of the job is to turn a pattern into
b23a565d 47something the computer can efficiently use to find the matching point in
be8e71aa 48the string, and the second part is performing the search itself.
b23a565d
RGS
49
50To do this we need to produce a program by parsing the text. We then
51need to execute the program to find the point in the string that
52matches. And we need to do the whole thing efficiently.
53
54=head2 Structure of a Regexp Program
55
56=head3 High Level
57
be8e71aa 58Although it is a bit confusing and some people object to the terminology, it
b23a565d 59is worth taking a look at a comment that has
be8e71aa 60been in F<regexp.h> for years:
b23a565d
RGS
61
62I<This is essentially a linear encoding of a nondeterministic
63finite-state machine (aka syntax charts or "railroad normal form" in
64parsing technology).>
65
66The term "railroad normal form" is a bit esoteric, with "syntax
67diagram/charts", or "railroad diagram/charts" being more common terms.
4ccfbf60 68Nevertheless it provides a useful mental image of a regex program: each
b23a565d
RGS
69node can be thought of as a unit of track, with a single entry and in
70most cases a single exit point (there are pieces of track that fork, but
be8e71aa 71statistically not many), and the whole forms a layout with a
b23a565d 72single entry and single exit point. The matching process can be thought
be8e71aa 73of as a car that moves along the track, with the particular route through
b23a565d 74the system being determined by the character read at each possible
be8e71aa
YO
75connector point. A car can fall off the track at any point but it may
76only proceed as long as it matches the track.
b23a565d
RGS
77
78Thus the pattern C</foo(?:\w+|\d+|\s+)bar/> can be thought of as the
79following chart:
80
be8e71aa
YO
81 [start]
82 |
83 <foo>
84 |
85 +-----+-----+
86 | | |
87 <\w+> <\d+> <\s+>
88 | | |
89 +-----+-----+
90 |
91 <bar>
92 |
93 [end]
94
95The truth of the matter is that perl's regular expressions these days are
4ccfbf60
RGS
96much more complex than this kind of structure, but visualising it this way
97can help when trying to get your bearings, and it matches the
98current implementation pretty closely.
be8e71aa
YO
99
100To be more precise, we will say that a regex program is an encoding
101of a graph. Each node in the graph corresponds to part of
b23a565d
RGS
102the original regex pattern, such as a literal string or a branch,
103and has a pointer to the nodes representing the next component
be8e71aa
YO
104to be matched. Since "node" and "opcode" already have other meanings in the
105perl source, we will call the nodes in a regex program "regops".
b23a565d 106
be8e71aa
YO
107The program is represented by an array of C<regnode> structures, one or
108more of which represent a single regop of the program. Struct
4ccfbf60 109C<regnode> is the smallest struct needed, and has a field structure which is
b23a565d
RGS
110shared with all the other larger structures.
111
be8e71aa
YO
112The "next" pointers of all regops except C<BRANCH> implement concatenation;
113a "next" pointer with a C<BRANCH> on both ends of it is connecting two
114alternatives. [Here we have one of the subtle syntax dependencies: an
115individual C<BRANCH> (as opposed to a collection of them) is never
116concatenated with anything because of operator precedence.]
b23a565d
RGS
117
118The operand of some types of regop is a literal string; for others,
119it is a regop leading into a sub-program. In particular, the operand
be8e71aa 120of a C<BRANCH> node is the first regop of the branch.
b23a565d 121
4ccfbf60 122B<NOTE>: As the railroad metaphor suggests, this is B<not> a tree
b23a565d 123structure: the tail of the branch connects to the thing following the
be8e71aa 124set of C<BRANCH>es. It is a like a single line of railway track that
b23a565d
RGS
125splits as it goes into a station or railway yard and rejoins as it comes
126out the other side.
127
128=head3 Regops
129
be8e71aa 130The base structure of a regop is defined in F<regexp.h> as follows:
b23a565d
RGS
131
132 struct regnode {
be8e71aa 133 U8 flags; /* Various purposes, sometimes overridden */
b23a565d
RGS
134 U8 type; /* Opcode value as specified by regnodes.h */
135 U16 next_off; /* Offset in size regnode */
136 };
137
be8e71aa 138Other larger C<regnode>-like structures are defined in F<regcomp.h>. They
b23a565d 139are almost like subclasses in that they have the same fields as
4ccfbf60 140C<regnode>, with possibly additional fields following in
b23a565d 141the structure, and in some cases the specific meaning (and name)
4ccfbf60 142of some of base fields are overridden. The following is a more
b23a565d
RGS
143complete description.
144
145=over 4
146
be8e71aa 147=item C<regnode_1>
b23a565d 148
be8e71aa 149=item C<regnode_2>
b23a565d 150
be8e71aa
YO
151C<regnode_1> structures have the same header, followed by a single
152four-byte argument; C<regnode_2> structures contain two two-byte
b23a565d
RGS
153arguments instead:
154
155 regnode_1 U32 arg1;
156 regnode_2 U16 arg1; U16 arg2;
157
be8e71aa 158=item C<regnode_string>
b23a565d 159
be8e71aa 160C<regnode_string> structures, used for literal strings, follow the header
b23a565d
RGS
161with a one-byte length and then the string data. Strings are padded on
162the end with zero bytes so that the total length of the node is a
163multiple of four bytes:
164
165 regnode_string char string[1];
be8e71aa 166 U8 str_len; /* overrides flags */
b23a565d 167
be8e71aa 168=item C<regnode_charclass>
b23a565d 169
be8e71aa 170Character classes are represented by C<regnode_charclass> structures,
b23a565d
RGS
171which have a four-byte argument and then a 32-byte (256-bit) bitmap
172indicating which characters are included in the class.
173
174 regnode_charclass U32 arg1;
175 char bitmap[ANYOF_BITMAP_SIZE];
176
be8e71aa 177=item C<regnode_charclass_class>
b23a565d
RGS
178
179There is also a larger form of a char class structure used to represent
be8e71aa
YO
180POSIX char classes called C<regnode_charclass_class> which has an
181additional 4-byte (32-bit) bitmap indicating which POSIX char class
182have been included.
b23a565d
RGS
183
184 regnode_charclass_class U32 arg1;
185 char bitmap[ANYOF_BITMAP_SIZE];
186 char classflags[ANYOF_CLASSBITMAP_SIZE];
187
188=back
189
be8e71aa
YO
190F<regnodes.h> defines an array called C<regarglen[]> which gives the size
191of each opcode in units of C<size regnode> (4-byte). A macro is used
192to calculate the size of an C<EXACT> node based on its C<str_len> field.
b23a565d 193
be8e71aa
YO
194The regops are defined in F<regnodes.h> which is generated from
195F<regcomp.sym> by F<regcomp.pl>. Currently the maximum possible number
196of distinct regops is restricted to 256, with about a quarter already
b23a565d
RGS
197used.
198
be8e71aa 199A set of macros makes accessing the fields
4ccfbf60
RGS
200easier and more consistent. These include C<OP()>, which is used to determine
201the type of a C<regnode>-like structure; C<NEXT_OFF()>, which is the offset to
202the next node (more on this later); C<ARG()>, C<ARG1()>, C<ARG2()>, C<ARG_SET()>,
203and equivalents for reading and setting the arguments; and C<STR_LEN()>,
be8e71aa 204C<STRING()> and C<OPERAND()> for manipulating strings and regop bearing
b23a565d
RGS
205types.
206
be8e71aa 207=head3 What regop is next?
b23a565d
RGS
208
209There are three distinct concepts of "next" in the regex engine, and
210it is important to keep them clear.
211
212=over 4
213
214=item *
215
216There is the "next regnode" from a given regnode, a value which is
217rarely useful except that sometimes it matches up in terms of value
218with one of the others, and that sometimes the code assumes this to
219always be so.
220
221=item *
222
be8e71aa
YO
223There is the "next regop" from a given regop/regnode. This is the
224regop physically located after the the current one, as determined by
225the size of the current regop. This is often useful, such as when
b23a565d 226dumping the structure we use this order to traverse. Sometimes the code
be8e71aa
YO
227assumes that the "next regnode" is the same as the "next regop", or in
228other words assumes that the sizeof a given regop type is always going
229to be one regnode large.
b23a565d
RGS
230
231=item *
232
be8e71aa
YO
233There is the "regnext" from a given regop. This is the regop which
234is reached by jumping forward by the value of C<NEXT_OFF()>,
235or in a few cases for longer jumps by the C<arg1> field of the C<regnode_1>
236structure. The subroutine C<regnext()> handles this transparently.
b23a565d 237This is the logical successor of the node, which in some cases, like
be8e71aa 238that of the C<BRANCH> regop, has special meaning.
b23a565d
RGS
239
240=back
241
be8e71aa 242=head1 Process Overview
b23a565d 243
be8e71aa
YO
244Broadly speaking, performing a match of a string against a pattern
245involves the following steps:
246
247=over 5
248
249=item A. Compilation
250
251=over 5
252
253=item 1. Parsing for size
254
255=item 2. Parsing for construction
256
257=item 3. Peep-hole optimisation and analysis
258
259=back
260
261=item B. Execution
262
263=over 5
264
265=item 4. Start position and no-match optimisations
266
267=item 5. Program execution
268
269=back
270
271=back
b23a565d 272
b23a565d
RGS
273
274Where these steps occur in the actual execution of a perl program is
275determined by whether the pattern involves interpolating any string
be8e71aa
YO
276variables. If interpolation occurs, then compilation happens at run time. If it
277does not, then compilation is performed at compile time. (The C</o> modifier changes this,
4ccfbf60 278as does C<qr//> to a certain extent.) The engine doesn't really care that
b23a565d
RGS
279much.
280
281=head2 Compilation
282
be8e71aa
YO
283This code resides primarily in F<regcomp.c>, along with the header files
284F<regcomp.h>, F<regexp.h> and F<regnodes.h>.
b23a565d 285
4ccfbf60
RGS
286Compilation starts with C<pregcomp()>, which is mostly an initialisation
287wrapper which farms work out to two other routines for the heavy lifting: the
288first is C<reg()>, which is the start point for parsing; the second,
289C<study_chunk()>, is responsible for optimisation.
b23a565d 290
4ccfbf60
RGS
291Initialisation in C<pregcomp()> mostly involves the creation and data-filling
292of a special structure, C<RExC_state_t> (defined in F<regcomp.c>).
293Almost all internally-used routines in F<regcomp.h> take a pointer to one
be8e71aa 294of these structures as their first argument, with the name C<pRExC_state>.
b23a565d 295This structure is used to store the compilation state and contains many
be8e71aa 296fields. Likewise there are many macros which operate on this
4ccfbf60 297variable: anything that looks like C<RExC_xxxx> is a macro that operates on
b23a565d
RGS
298this pointer/structure.
299
300=head3 Parsing for size
301
302In this pass the input pattern is parsed in order to calculate how much
be8e71aa 303space is needed for each regop we would need to emit. The size is also
b23a565d
RGS
304used to determine whether long jumps will be required in the program.
305
be8e71aa 306This stage is controlled by the macro C<SIZE_ONLY> being set.
b23a565d 307
4ccfbf60 308The parse proceeds pretty much exactly as it does during the
be8e71aa
YO
309construction phase, except that most routines are short-circuited to
310change the size field C<RExC_size> and not do anything else.
b23a565d 311
4ccfbf60 312=head3 Parsing for construction
b23a565d 313
be8e71aa
YO
314Once the size of the program has been determined, the pattern is parsed
315again, but this time for real. Now C<SIZE_ONLY> will be false, and the
b23a565d
RGS
316actual construction can occur.
317
318C<reg()> is the start of the parse process. It is responsible for
319parsing an arbitrary chunk of pattern up to either the end of the
320string, or the first closing parenthesis it encounters in the pattern.
4ccfbf60 321This means it can be used to parse the top-level regex, or any section
b23a565d 322inside of a grouping parenthesis. It also handles the "special parens"
be8e71aa
YO
323that perl's regexes have. For instance when parsing C</x(?:foo)y/> C<reg()>
324will at one point be called to parse from the "?" symbol up to and
325including the ")".
b23a565d 326
be8e71aa 327Additionally, C<reg()> is responsible for parsing the one or more
b23a565d 328branches from the pattern, and for "finishing them off" by correctly
be8e71aa
YO
329setting their next pointers. In order to do the parsing, it repeatedly
330calls out to C<regbranch()>, which is responsible for handling up to the
b23a565d
RGS
331first C<|> symbol it sees.
332
be8e71aa
YO
333C<regbranch()> in turn calls C<regpiece()> which
334handles "things" followed by a quantifier. In order to parse the
335"things", C<regatom()> is called. This is the lowest level routine which
336parses out constant strings, character classes, and the
337various special symbols like C<$>. If C<regatom()> encounters a "("
b23a565d
RGS
338character it in turn calls C<reg()>.
339
340The routine C<regtail()> is called by both C<reg()>, C<regbranch()>
341in order to "set the tail pointer" correctly. When executing and
be8e71aa
YO
342we get to the end of a branch, we need to go to the node following the
343grouping parens. When parsing, however, we don't know where the end will
b23a565d
RGS
344be until we get there, so when we do we must go back and update the
345offsets as appropriate. C<regtail> is used to make this easier.
346
be8e71aa 347A subtlety of the parsing process means that a regex like C</foo/> is
b23a565d 348originally parsed into an alternation with a single branch. It is only
4ccfbf60 349afterwards that the optimiser converts single branch alternations into the
b23a565d
RGS
350simpler form.
351
352=head3 Parse Call Graph and a Grammar
353
354The call graph looks like this:
355
356 reg() # parse a top level regex, or inside of parens
357 regbranch() # parse a single branch of an alternation
358 regpiece() # parse a pattern followed by a quantifier
359 regatom() # parse a simple pattern
360 regclass() # used to handle a class
4ccfbf60 361 reg() # used to handle a parenthesised subpattern
b23a565d
RGS
362 ....
363 ...
364 regtail() # finish off the branch
365 ...
366 regtail() # finish off the branch sequence. Tie each
4ccfbf60 367 # branch's tail to the tail of the sequence
b23a565d
RGS
368 # (NEW) In Debug mode this is
369 # regtail_study().
370
371A grammar form might be something like this:
372
373 atom : constant | class
374 quant : '*' | '+' | '?' | '{min,max}'
375 _branch: piece
376 | piece _branch
377 | nothing
378 branch: _branch
379 | _branch '|' branch
380 group : '(' branch ')'
381 _piece: atom | group
382 piece : _piece
383 | _piece quant
384
385=head3 Debug Output
386
4ccfbf60 387In the 5.9.x development version of perl you can C<< use re Debug => 'PARSE'; >> to see some trace
b23a565d
RGS
388information about the parse process. We will start with some simple
389patterns and build up to more complex patterns.
390
391So when we parse C</foo/> we see something like the following table. The
4ccfbf60 392left shows what is being parsed, and the number indicates where the next regop
b23a565d
RGS
393would go. The stuff on the right is the trace output of the graph. The
394names are chosen to be short to make it less dense on the screen. 'tsdy'
395is a special form of C<regtail()> which does some extra analysis.
396
4ccfbf60
RGS
397 >foo< 1 reg
398 brnc
399 piec
400 atom
401 >< 4 tsdy~ EXACT <foo> (EXACT) (1)
402 ~ attach to END (3) offset to 2
b23a565d
RGS
403
404The resulting program then looks like:
405
406 1: EXACT <foo>(3)
407 3: END(0)
408
409As you can see, even though we parsed out a branch and a piece, it was ultimately
be8e71aa
YO
410only an atom. The final program shows us how things work. We have an C<EXACT> regop,
411followed by an C<END> regop. The number in parens indicates where the C<regnext> of
412the node goes. The C<regnext> of an C<END> regop is unused, as C<END> regops mean
b23a565d
RGS
413we have successfully matched. The number on the left indicates the position of
414the regop in the regnode array.
415
be8e71aa 416Now let's try a harder pattern. We will add a quantifier, so now we have the pattern
4ccfbf60
RGS
417C</foo+/>. We will see that C<regbranch()> calls C<regpiece()> twice.
418
419 >foo+< 1 reg
420 brnc
421 piec
422 atom
423 >o+< 3 piec
424 atom
425 >< 6 tail~ EXACT <fo> (1)
426 7 tsdy~ EXACT <fo> (EXACT) (1)
427 ~ PLUS (END) (3)
428 ~ attach to END (6) offset to 3
b23a565d
RGS
429
430And we end up with the program:
431
432 1: EXACT <fo>(3)
433 3: PLUS(6)
434 4: EXACT <o>(0)
435 6: END(0)
436
be8e71aa 437Now we have a special case. The C<EXACT> regop has a C<regnext> of 0. This is
4ccfbf60 438because if it matches it should try to match itself again. The C<PLUS> regop
be8e71aa 439handles the actual failure of the C<EXACT> regop and acts appropriately (going
4ccfbf60 440to regnode 6 if the C<EXACT> matched at least once, or failing if it didn't).
b23a565d
RGS
441
442Now for something much more complex: C</x(?:foo*|b[a][rR])(foo|bar)$/>
443
4ccfbf60
RGS
444 >x(?:foo*|b... 1 reg
445 brnc
446 piec
447 atom
448 >(?:foo*|b[... 3 piec
449 atom
450 >?:foo*|b[a... reg
451 >foo*|b[a][... brnc
b23a565d
RGS
452 piec
453 atom
4ccfbf60
RGS
454 >o*|b[a][rR... 5 piec
455 atom
456 >|b[a][rR])... 8 tail~ EXACT <fo> (3)
457 >b[a][rR])(... 9 brnc
458 10 piec
459 atom
460 >[a][rR])(f... 12 piec
b23a565d 461 atom
4ccfbf60
RGS
462 >a][rR])(fo... clas
463 >[rR])(foo|... 14 tail~ EXACT <b> (10)
b23a565d
RGS
464 piec
465 atom
4ccfbf60
RGS
466 >rR])(foo|b... clas
467 >)(foo|bar)... 25 tail~ EXACT <a> (12)
468 tail~ BRANCH (3)
469 26 tsdy~ BRANCH (END) (9)
470 ~ attach to TAIL (25) offset to 16
471 tsdy~ EXACT <fo> (EXACT) (4)
472 ~ STAR (END) (6)
473 ~ attach to TAIL (25) offset to 19
474 tsdy~ EXACT <b> (EXACT) (10)
475 ~ EXACT <a> (EXACT) (12)
476 ~ ANYOF[Rr] (END) (14)
477 ~ attach to TAIL (25) offset to 11
478 >(foo|bar)$< tail~ EXACT <x> (1)
479 piec
480 atom
481 >foo|bar)$< reg
482 28 brnc
b23a565d
RGS
483 piec
484 atom
4ccfbf60
RGS
485 >|bar)$< 31 tail~ OPEN1 (26)
486 >bar)$< brnc
487 32 piec
488 atom
489 >)$< 34 tail~ BRANCH (28)
490 36 tsdy~ BRANCH (END) (31)
491 ~ attach to CLOSE1 (34) offset to 3
492 tsdy~ EXACT <foo> (EXACT) (29)
493 ~ attach to CLOSE1 (34) offset to 5
494 tsdy~ EXACT <bar> (EXACT) (32)
495 ~ attach to CLOSE1 (34) offset to 2
496 >$< tail~ BRANCH (3)
497 ~ BRANCH (9)
498 ~ TAIL (25)
499 piec
500 atom
501 >< 37 tail~ OPEN1 (26)
502 ~ BRANCH (28)
503 ~ BRANCH (31)
504 ~ CLOSE1 (34)
505 38 tsdy~ EXACT <x> (EXACT) (1)
506 ~ BRANCH (END) (3)
507 ~ BRANCH (END) (9)
508 ~ TAIL (END) (25)
509 ~ OPEN1 (END) (26)
510 ~ BRANCH (END) (28)
511 ~ BRANCH (END) (31)
512 ~ CLOSE1 (END) (34)
513 ~ EOL (END) (36)
514 ~ attach to END (37) offset to 1
b23a565d
RGS
515
516Resulting in the program
517
518 1: EXACT <x>(3)
519 3: BRANCH(9)
520 4: EXACT <fo>(6)
521 6: STAR(26)
522 7: EXACT <o>(0)
523 9: BRANCH(25)
524 10: EXACT <ba>(14)
525 12: OPTIMIZED (2 nodes)
526 14: ANYOF[Rr](26)
527 25: TAIL(26)
528 26: OPEN1(28)
529 28: TRIE-EXACT(34)
530 [StS:1 Wds:2 Cs:6 Uq:5 #Sts:7 Mn:3 Mx:3 Stcls:bf]
531 <foo>
532 <bar>
533 30: OPTIMIZED (4 nodes)
534 34: CLOSE1(36)
535 36: EOL(37)
536 37: END(0)
537
538Here we can see a much more complex program, with various optimisations in
be8e71aa
YO
539play. At regnode 10 we see an example where a character class with only
540one character in it was turned into an C<EXACT> node. We can also see where
541an entire alternation was turned into a C<TRIE-EXACT> node. As a consequence,
b23a565d 542some of the regnodes have been marked as optimised away. We can see that
be8e71aa
YO
543the C<$> symbol has been converted into an C<EOL> regop, a special piece of
544code that looks for C<\n> or the end of the string.
b23a565d 545
be8e71aa 546The next pointer for C<BRANCH>es is interesting in that it points at where
b23a565d 547execution should go if the branch fails. When executing if the engine
be8e71aa
YO
548tries to traverse from a branch to a C<regnext> that isn't a branch then
549the engine will know that the entire set of branches have failed.
b23a565d
RGS
550
551=head3 Peep-hole Optimisation and Analysis
552
553The regular expression engine can be a weighty tool to wield. On long
554strings and complex patterns it can end up having to do a lot of work
555to find a match, and even more to decide that no match is possible.
556Consider a situation like the following pattern.
557
558 'ababababababababababab' =~ /(a|b)*z/
559
560The C<(a|b)*> part can match at every char in the string, and then fail
561every time because there is no C<z> in the string. So obviously we can
4ccfbf60 562avoid using the regex engine unless there is a C<z> in the string.
b23a565d
RGS
563Likewise in a pattern like:
564
565 /foo(\w+)bar/
566
567In this case we know that the string must contain a C<foo> which must be
4ccfbf60 568followed by C<bar>. We can use Fast Boyer-Moore matching as implemented
be8e71aa
YO
569in C<fbm_instr()> to find the location of these strings. If they don't exist
570then we don't need to resort to the much more expensive regex engine.
571Even better, if they do exist then we can use their positions to
b23a565d 572reduce the search space that the regex engine needs to cover to determine
be8e71aa 573if the entire pattern matches.
b23a565d
RGS
574
575There are various aspects of the pattern that can be used to facilitate
576optimisations along these lines:
577
4ccfbf60
RGS
578=over 5
579
580=item * anchored fixed strings
581
582=item * floating fixed strings
583
584=item * minimum and maximum length requirements
585
586=item * start class
587
588=item * Beginning/End of line positions
589
590=back
b23a565d
RGS
591
592Another form of optimisation that can occur is post-parse "peep-hole"
be8e71aa
YO
593optimisations, where inefficient constructs are replaced by
594more efficient constructs. An example of this are C<TAIL> regops which are used
b23a565d 595during parsing to mark the end of branches and the end of groups. These
4ccfbf60 596regops are used as place-holders during construction and "always match"
b23a565d 597so they can be "optimised away" by making the things that point to the
4ccfbf60 598C<TAIL> point to thing that the C<TAIL> points to, thus "skipping" the node.
b23a565d 599
be8e71aa
YO
600Another optimisation that can occur is that of "C<EXACT> merging" which is
601where two consecutive C<EXACT> nodes are merged into a single
4ccfbf60
RGS
602regop. An even more aggressive form of this is that a branch
603sequence of the form C<EXACT BRANCH ... EXACT> can be converted into a
be8e71aa 604C<TRIE-EXACT> regop.
b23a565d 605
be8e71aa
YO
606All of this occurs in the routine C<study_chunk()> which uses a special
607structure C<scan_data_t> to store the analysis that it has performed, and
608does the "peep-hole" optimisations as it goes.
b23a565d 609
be8e71aa 610The code involved in C<study_chunk()> is extremely cryptic. Be careful. :-)
b23a565d
RGS
611
612=head2 Execution
613
614Execution of a regex generally involves two phases, the first being
615finding the start point in the string where we should match from,
616and the second being running the regop interpreter.
617
be8e71aa
YO
618If we can tell that there is no valid start point then we don't bother running
619interpreter at all. Likewise, if we know from the analysis phase that we
620cannot detect a short-cut to the start position, we go straight to the
b23a565d
RGS
621interpreter.
622
be8e71aa 623The two entry points are C<re_intuit_start()> and C<pregexec()>. These routines
b23a565d 624have a somewhat incestuous relationship with overlap between their functions,
be8e71aa 625and C<pregexec()> may even call C<re_intuit_start()> on its own. Nevertheless
4ccfbf60 626other parts of the the perl source code may call into either, or both.
b23a565d
RGS
627
628Execution of the interpreter itself used to be recursive. Due to the
4ccfbf60 629efforts of Dave Mitchell in the 5.9.x development track, it is now iterative. Now an
b23a565d
RGS
630internal stack is maintained on the heap and the routine is fully
631iterative. This can make it tricky as the code is quite conservative
4ccfbf60 632about what state it stores, with the result that that two consecutive lines in the
b23a565d
RGS
633code can actually be running in totally different contexts due to the
634simulated recursion.
635
636=head3 Start position and no-match optimisations
637
4ccfbf60 638C<re_intuit_start()> is responsible for handling start points and no-match
b23a565d 639optimisations as determined by the results of the analysis done by
be8e71aa 640C<study_chunk()> (and described in L<Peep-hole Optimisation and Analysis>).
b23a565d 641
4ccfbf60
RGS
642The basic structure of this routine is to try to find the start- and/or
643end-points of where the pattern could match, and to ensure that the string
644is long enough to match the pattern. It tries to use more efficient
645methods over less efficient methods and may involve considerable
646cross-checking of constraints to find the place in the string that matches.
b23a565d
RGS
647For instance it may try to determine that a given fixed string must be
648not only present but a certain number of chars before the end of the
649string, or whatever.
650
be8e71aa 651It calls several other routines, such as C<fbm_instr()> which does
4ccfbf60 652Fast Boyer Moore matching and C<find_byclass()> which is responsible for
b23a565d
RGS
653finding the start using the first mandatory regop in the program.
654
4ccfbf60 655When the optimisation criteria have been satisfied, C<reg_try()> is called
b23a565d
RGS
656to perform the match.
657
658=head3 Program execution
659
660C<pregexec()> is the main entry point for running a regex. It contains
4ccfbf60
RGS
661support for initialising the regex interpreter's state, running
662C<re_intuit_start()> if needed, and running the interpreter on the string
663from various start positions as needed. When it is necessary to use
b23a565d
RGS
664the regex interpreter C<pregexec()> calls C<regtry()>.
665
666C<regtry()> is the entry point into the regex interpreter. It expects
be8e71aa 667as arguments a pointer to a C<regmatch_info> structure and a pointer to
b23a565d 668a string. It returns an integer 1 for success and a 0 for failure.
4ccfbf60 669It is basically a set-up wrapper around C<regmatch()>.
b23a565d
RGS
670
671C<regmatch> is the main "recursive loop" of the interpreter. It is
e3950ac3
RGS
672basically a giant switch statement that implements a state machine, where
673the possible states are the regops themselves, plus a number of additional
674intermediate and failure states. A few of the states are implemented as
675subroutines but the bulk are inline code.
b23a565d
RGS
676
677=head1 MISCELLANEOUS
678
4ccfbf60
RGS
679=head2 Unicode and Localisation Support
680
681When dealing with strings containing characters that cannot be represented
682using an eight-bit character set, perl uses an internal representation
683that is a permissive version of Unicode's UTF-8 encoding[2]. This uses single
684bytes to represent characters from the ASCII character set, and sequences
685of two or more bytes for all other characters. (See L<perlunitut>
686for more information about the relationship between UTF-8 and perl's
687encoding, utf8 -- the difference isn't important for this discussion.)
b23a565d 688
be8e71aa 689No matter how you look at it, Unicode support is going to be a pain in a
b23a565d 690regex engine. Tricks that might be fine when you have 256 possible
be8e71aa 691characters often won't scale to handle the size of the UTF-8 character
b23a565d 692set. Things you can take for granted with ASCII may not be true with
4ccfbf60 693Unicode. For instance, in ASCII, it is safe to assume that
be8e71aa
YO
694C<sizeof(char1) == sizeof(char2)>, but in UTF-8 it isn't. Unicode case folding is
695vastly more complex than the simple rules of ASCII, and even when not
4ccfbf60
RGS
696using Unicode but only localised single byte encodings, things can get
697tricky (for example, GERMAN-SHARP-ESS should match 'SS' in localised
698case-insensitive matching).
be8e71aa
YO
699
700Making things worse is that UTF-8 support was a later addition to the
701regex engine (as it was to perl) and this necessarily made things a lot
702more complicated. Obviously it is easier to design a regex engine with
703Unicode support in mind from the beginning than it is to retrofit it to
704one that wasn't.
705
4ccfbf60 706Nearly all regops that involve looking at the input string have
be8e71aa
YO
707two cases, one for UTF-8, and one not. In fact, it's often more complex
708than that, as the pattern may be UTF-8 as well.
b23a565d
RGS
709
710Care must be taken when making changes to make sure that you handle
be8e71aa 711UTF-8 properly, both at compile time and at execution time, including
b23a565d
RGS
712when the string and pattern are mismatched.
713
be8e71aa 714The following comment in F<regcomp.h> gives an example of exactly how
b23a565d
RGS
715tricky this can be:
716
717 Two problematic code points in Unicode casefolding of EXACT nodes:
718
719 U+0390 - GREEK SMALL LETTER IOTA WITH DIALYTIKA AND TONOS
720 U+03B0 - GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND TONOS
721
722 which casefold to
723
724 Unicode UTF-8
725
726 U+03B9 U+0308 U+0301 0xCE 0xB9 0xCC 0x88 0xCC 0x81
727 U+03C5 U+0308 U+0301 0xCF 0x85 0xCC 0x88 0xCC 0x81
728
729 This means that in case-insensitive matching (or "loose matching",
730 as Unicode calls it), an EXACTF of length six (the UTF-8 encoded
731 byte length of the above casefolded versions) can match a target
732 string of length two (the byte length of UTF-8 encoded U+0390 or
733 U+03B0). This would rather mess up the minimum length computation.
734
735 What we'll do is to look for the tail four bytes, and then peek
736 at the preceding two bytes to see whether we need to decrease
737 the minimum length by four (six minus two).
738
739 Thanks to the design of UTF-8, there cannot be false matches:
740 A sequence of valid UTF-8 bytes cannot be a subsequence of
741 another valid sequence of UTF-8 bytes.
742
4ccfbf60 743=head2 Base Struct
be8e71aa 744
4ccfbf60 745F<regexp.h> contains the base structure definition:
be8e71aa
YO
746
747 typedef struct regexp {
748 I32 *startp;
749 I32 *endp;
750 regnode *regstclass;
751 struct reg_substr_data *substrs;
752 char *precomp; /* pre-compilation regular expression */
753 struct reg_data *data; /* Additional data. */
754 char *subbeg; /* saved or original string
755 so \digit works forever. */
756 #ifdef PERL_OLD_COPY_ON_WRITE
757 SV *saved_copy; /* If non-NULL, SV which is COW from original */
758 #endif
759 U32 *offsets; /* offset annotations 20001228 MJD */
760 I32 sublen; /* Length of string pointed by subbeg */
761 I32 refcnt;
4ccfbf60 762 I32 minlen; /* minimum possible length of $& */
be8e71aa
YO
763 I32 prelen; /* length of precomp */
764 U32 nparens; /* number of parentheses */
765 U32 lastparen; /* last paren matched */
766 U32 lastcloseparen; /* last paren matched */
767 U32 reganch; /* Internal use only +
768 Tainted information used by regexec? */
769 regnode program[1]; /* Unwarranted chumminess with compiler. */
770 } regexp;
771
772C<program>, and C<data> are the primary fields of concern in terms of
4ccfbf60 773program structure. C<program> is the actual array of nodes, and C<data> is
be8e71aa
YO
774an array of "whatever", with each whatever being typed by letter, and
775freed or cloned as needed based on this type. regops use the data
776array to store reference data that isn't convenient to store in the regop
4ccfbf60
RGS
777itself. It also means memory management code doesn't need to traverse the
778program to find pointers. So for instance, if a regop needs a pointer, the
779normal procedure is use a C<regnode_arg1> store the data index in the C<ARG>
be8e71aa
YO
780field and look it up from the data array.
781
4ccfbf60
RGS
782=over 5
783
784=item -
785
786C<startp>, C<endp>, C<nparens>, C<lasparen>, and C<lastcloseparen> are used to manage capture
be8e71aa
YO
787buffers.
788
4ccfbf60
RGS
789=item -
790
791C<subbeg> and optional C<saved_copy> are used during the execution phase for managing
be8e71aa
YO
792replacements.
793
4ccfbf60 794=item -
be8e71aa 795
4ccfbf60 796C<offsets> and C<precomp> are used for debugging purposes.
be8e71aa 797
4ccfbf60 798=item -
be8e71aa 799
4ccfbf60
RGS
800The rest are used for start point optimisations.
801
802=back
803
804=head2 De-allocation and Cloning
be8e71aa
YO
805
806Any patch that adds data items to the regexp will need to include
4ccfbf60 807changes to F<sv.c> (C<Perl_re_dup()>) and F<regcomp.c> (C<pregfree()>). This
be8e71aa 808involves freeing or cloning items in the regexes data array based
4ccfbf60
RGS
809on the data item's type.
810
811=head1 SEE ALSO
812
813L<perlre>
814
815L<perlunitut>
be8e71aa 816
b23a565d
RGS
817=head1 AUTHOR
818
819by Yves Orton, 2006.
820
821With excerpts from Perl, and contributions and suggestions from
822Ronald J. Kimball, Dave Mitchell, Dominic Dunlop, Mark Jason Dominus,
be8e71aa 823Stephen McCamant, and David Landgren.
b23a565d 824
4ccfbf60 825=head1 LICENCE
b23a565d
RGS
826
827Same terms as Perl.
828
829=head1 REFERENCES
830
4ccfbf60
RGS
831[1] L<http://perl.plover.com/Rx/paper/>
832
833[2] L<http://www.unicode.org>
b23a565d
RGS
834
835=cut