Commit | Line | Data |
---|---|---|
b23a565d RGS |
1 | =head1 NAME |
2 | ||
3 | perlreguts - Description of the Perl regular expression engine. | |
4 | ||
5 | =head1 DESCRIPTION | |
6 | ||
7 | This document is an attempt to shine some light on the guts of the regex | |
4ccfbf60 | 8 | engine and how it works. The regex engine represents a significant chunk |
b23a565d RGS |
9 | of the perl codebase, but is relatively poorly understood. This document |
10 | is a meagre attempt at addressing this situation. It is derived from the | |
be8e71aa YO |
11 | author's experience, comments in the source code, other papers on the |
12 | regex engine, feedback on the perl5-porters mail list, and no doubt other | |
13 | places as well. | |
b23a565d | 14 | |
108003db RGS |
15 | B<NOTICE!> It should be clearly understood that the behavior and |
16 | structures discussed in this represents the state of the engine as the | |
17 | author understood it at the time of writing. It is B<NOT> an API | |
18 | definition, it is purely an internals guide for those who want to hack | |
19 | the regex engine, or understand how the regex engine works. Readers of | |
20 | this document are expected to understand perl's regex syntax and its | |
21 | usage in detail. If you want to learn about the basics of Perl's | |
22 | regular expressions, see L<perlre>. And if you want to replace the | |
e1020413 | 23 | regex engine with your own, see L<perlreapi>. |
b23a565d RGS |
24 | |
25 | =head1 OVERVIEW | |
26 | ||
27 | =head2 A quick note on terms | |
28 | ||
be8e71aa | 29 | There is some debate as to whether to say "regexp" or "regex". In this |
b23a565d | 30 | document we will use the term "regex" unless there is a special reason |
be8e71aa | 31 | not to, in which case we will explain why. |
b23a565d RGS |
32 | |
33 | When speaking about regexes we need to distinguish between their source | |
34 | code form and their internal form. In this document we will use the term | |
edc977ff | 35 | "pattern" when we speak of their textual, source code form, and the term |
b23a565d | 36 | "program" when we speak of their internal representation. These |
be8e71aa | 37 | correspond to the terms I<S-regex> and I<B-regex> that Mark Jason |
e3950ac3 | 38 | Dominus employs in his paper on "Rx" ([1] in L</REFERENCES>). |
b23a565d RGS |
39 | |
40 | =head2 What is a regular expression engine? | |
41 | ||
be8e71aa YO |
42 | A regular expression engine is a program that takes a set of constraints |
43 | specified in a mini-language, and then applies those constraints to a | |
44 | target string, and determines whether or not the string satisfies the | |
45 | constraints. See L<perlre> for a full definition of the language. | |
b23a565d | 46 | |
edc977ff | 47 | In less grandiose terms, the first part of the job is to turn a pattern into |
b23a565d | 48 | something the computer can efficiently use to find the matching point in |
be8e71aa | 49 | the string, and the second part is performing the search itself. |
b23a565d RGS |
50 | |
51 | To do this we need to produce a program by parsing the text. We then | |
52 | need to execute the program to find the point in the string that | |
53 | matches. And we need to do the whole thing efficiently. | |
54 | ||
55 | =head2 Structure of a Regexp Program | |
56 | ||
57 | =head3 High Level | |
58 | ||
be8e71aa | 59 | Although it is a bit confusing and some people object to the terminology, it |
b23a565d | 60 | is worth taking a look at a comment that has |
be8e71aa | 61 | been in F<regexp.h> for years: |
b23a565d RGS |
62 | |
63 | I<This is essentially a linear encoding of a nondeterministic | |
64 | finite-state machine (aka syntax charts or "railroad normal form" in | |
65 | parsing technology).> | |
66 | ||
67 | The term "railroad normal form" is a bit esoteric, with "syntax | |
68 | diagram/charts", or "railroad diagram/charts" being more common terms. | |
4ccfbf60 | 69 | Nevertheless it provides a useful mental image of a regex program: each |
b23a565d RGS |
70 | node can be thought of as a unit of track, with a single entry and in |
71 | most cases a single exit point (there are pieces of track that fork, but | |
be8e71aa | 72 | statistically not many), and the whole forms a layout with a |
b23a565d | 73 | single entry and single exit point. The matching process can be thought |
be8e71aa | 74 | of as a car that moves along the track, with the particular route through |
b23a565d | 75 | the system being determined by the character read at each possible |
be8e71aa YO |
76 | connector point. A car can fall off the track at any point but it may |
77 | only proceed as long as it matches the track. | |
b23a565d RGS |
78 | |
79 | Thus the pattern C</foo(?:\w+|\d+|\s+)bar/> can be thought of as the | |
80 | following chart: | |
81 | ||
be8e71aa YO |
82 | [start] |
83 | | | |
84 | <foo> | |
85 | | | |
86 | +-----+-----+ | |
87 | | | | | |
88 | <\w+> <\d+> <\s+> | |
89 | | | | | |
90 | +-----+-----+ | |
91 | | | |
92 | <bar> | |
93 | | | |
94 | [end] | |
95 | ||
96 | The truth of the matter is that perl's regular expressions these days are | |
4ccfbf60 RGS |
97 | much more complex than this kind of structure, but visualising it this way |
98 | can help when trying to get your bearings, and it matches the | |
99 | current implementation pretty closely. | |
be8e71aa YO |
100 | |
101 | To be more precise, we will say that a regex program is an encoding | |
102 | of a graph. Each node in the graph corresponds to part of | |
b23a565d RGS |
103 | the original regex pattern, such as a literal string or a branch, |
104 | and has a pointer to the nodes representing the next component | |
be8e71aa YO |
105 | to be matched. Since "node" and "opcode" already have other meanings in the |
106 | perl source, we will call the nodes in a regex program "regops". | |
b23a565d | 107 | |
be8e71aa YO |
108 | The program is represented by an array of C<regnode> structures, one or |
109 | more of which represent a single regop of the program. Struct | |
4ccfbf60 | 110 | C<regnode> is the smallest struct needed, and has a field structure which is |
e3f228df KW |
111 | shared with all the other larger structures. (Outside this document, the term |
112 | "regnode" is sometimes used to mean "regop", which could be confusing.) | |
b23a565d | 113 | |
be8e71aa YO |
114 | The "next" pointers of all regops except C<BRANCH> implement concatenation; |
115 | a "next" pointer with a C<BRANCH> on both ends of it is connecting two | |
116 | alternatives. [Here we have one of the subtle syntax dependencies: an | |
117 | individual C<BRANCH> (as opposed to a collection of them) is never | |
118 | concatenated with anything because of operator precedence.] | |
b23a565d RGS |
119 | |
120 | The operand of some types of regop is a literal string; for others, | |
121 | it is a regop leading into a sub-program. In particular, the operand | |
be8e71aa | 122 | of a C<BRANCH> node is the first regop of the branch. |
b23a565d | 123 | |
4ccfbf60 | 124 | B<NOTE>: As the railroad metaphor suggests, this is B<not> a tree |
b23a565d | 125 | structure: the tail of the branch connects to the thing following the |
be8e71aa | 126 | set of C<BRANCH>es. It is a like a single line of railway track that |
b23a565d RGS |
127 | splits as it goes into a station or railway yard and rejoins as it comes |
128 | out the other side. | |
129 | ||
130 | =head3 Regops | |
131 | ||
be8e71aa | 132 | The base structure of a regop is defined in F<regexp.h> as follows: |
b23a565d RGS |
133 | |
134 | struct regnode { | |
be8e71aa | 135 | U8 flags; /* Various purposes, sometimes overridden */ |
b23a565d RGS |
136 | U8 type; /* Opcode value as specified by regnodes.h */ |
137 | U16 next_off; /* Offset in size regnode */ | |
138 | }; | |
139 | ||
be8e71aa | 140 | Other larger C<regnode>-like structures are defined in F<regcomp.h>. They |
b23a565d | 141 | are almost like subclasses in that they have the same fields as |
4ccfbf60 | 142 | C<regnode>, with possibly additional fields following in |
b23a565d | 143 | the structure, and in some cases the specific meaning (and name) |
4ccfbf60 | 144 | of some of base fields are overridden. The following is a more |
b23a565d RGS |
145 | complete description. |
146 | ||
147 | =over 4 | |
148 | ||
be8e71aa | 149 | =item C<regnode_1> |
b23a565d | 150 | |
be8e71aa | 151 | =item C<regnode_2> |
b23a565d | 152 | |
be8e71aa YO |
153 | C<regnode_1> structures have the same header, followed by a single |
154 | four-byte argument; C<regnode_2> structures contain two two-byte | |
b23a565d RGS |
155 | arguments instead: |
156 | ||
157 | regnode_1 U32 arg1; | |
158 | regnode_2 U16 arg1; U16 arg2; | |
159 | ||
be8e71aa | 160 | =item C<regnode_string> |
b23a565d | 161 | |
be8e71aa | 162 | C<regnode_string> structures, used for literal strings, follow the header |
b23a565d | 163 | with a one-byte length and then the string data. Strings are padded on |
e3f228df | 164 | the tail end with zero bytes so that the total length of the node is a |
b23a565d RGS |
165 | multiple of four bytes: |
166 | ||
167 | regnode_string char string[1]; | |
be8e71aa | 168 | U8 str_len; /* overrides flags */ |
b23a565d | 169 | |
be8e71aa | 170 | =item C<regnode_charclass> |
b23a565d | 171 | |
c8849eb1 KW |
172 | Bracketed character classes are represented by C<regnode_charclass> |
173 | structures, which have a four-byte argument and then a 32-byte (256-bit) | |
174 | bitmap indicating which characters in the Latin1 range are included in | |
175 | the class. | |
b23a565d RGS |
176 | |
177 | regnode_charclass U32 arg1; | |
178 | char bitmap[ANYOF_BITMAP_SIZE]; | |
179 | ||
c8849eb1 KW |
180 | Various flags whose names begin with C<ANYOF_> are used for special |
181 | situations. Above Latin1 matches and things not known until run-time | |
182 | are stored in L</Perl's pprivate structure>. | |
183 | ||
d1d040e5 | 184 | =item C<regnode_charclass_posixl> |
b23a565d RGS |
185 | |
186 | There is also a larger form of a char class structure used to represent | |
c8849eb1 | 187 | POSIX char classes under C</l> matching, |
d1d040e5 | 188 | called C<regnode_charclass_posixl> which has an |
c8849eb1 | 189 | additional 32-bit bitmap indicating which POSIX char classes |
be8e71aa | 190 | have been included. |
b23a565d | 191 | |
d1d040e5 | 192 | regnode_charclass_posixl U32 arg1; |
2bdc80de | 193 | char bitmap[ANYOF_BITMAP_SIZE]; |
c8849eb1 | 194 | U32 classflags; |
b23a565d RGS |
195 | |
196 | =back | |
197 | ||
be8e71aa YO |
198 | F<regnodes.h> defines an array called C<regarglen[]> which gives the size |
199 | of each opcode in units of C<size regnode> (4-byte). A macro is used | |
200 | to calculate the size of an C<EXACT> node based on its C<str_len> field. | |
b23a565d | 201 | |
be8e71aa YO |
202 | The regops are defined in F<regnodes.h> which is generated from |
203 | F<regcomp.sym> by F<regcomp.pl>. Currently the maximum possible number | |
204 | of distinct regops is restricted to 256, with about a quarter already | |
b23a565d RGS |
205 | used. |
206 | ||
be8e71aa | 207 | A set of macros makes accessing the fields |
4ccfbf60 RGS |
208 | easier and more consistent. These include C<OP()>, which is used to determine |
209 | the type of a C<regnode>-like structure; C<NEXT_OFF()>, which is the offset to | |
210 | the next node (more on this later); C<ARG()>, C<ARG1()>, C<ARG2()>, C<ARG_SET()>, | |
211 | and equivalents for reading and setting the arguments; and C<STR_LEN()>, | |
be8e71aa | 212 | C<STRING()> and C<OPERAND()> for manipulating strings and regop bearing |
b23a565d RGS |
213 | types. |
214 | ||
be8e71aa | 215 | =head3 What regop is next? |
b23a565d RGS |
216 | |
217 | There are three distinct concepts of "next" in the regex engine, and | |
218 | it is important to keep them clear. | |
219 | ||
220 | =over 4 | |
221 | ||
222 | =item * | |
223 | ||
224 | There is the "next regnode" from a given regnode, a value which is | |
225 | rarely useful except that sometimes it matches up in terms of value | |
226 | with one of the others, and that sometimes the code assumes this to | |
227 | always be so. | |
228 | ||
229 | =item * | |
230 | ||
be8e71aa | 231 | There is the "next regop" from a given regop/regnode. This is the |
e1020413 | 232 | regop physically located after the current one, as determined by |
be8e71aa | 233 | the size of the current regop. This is often useful, such as when |
b23a565d | 234 | dumping the structure we use this order to traverse. Sometimes the code |
be8e71aa YO |
235 | assumes that the "next regnode" is the same as the "next regop", or in |
236 | other words assumes that the sizeof a given regop type is always going | |
237 | to be one regnode large. | |
b23a565d RGS |
238 | |
239 | =item * | |
240 | ||
be8e71aa YO |
241 | There is the "regnext" from a given regop. This is the regop which |
242 | is reached by jumping forward by the value of C<NEXT_OFF()>, | |
243 | or in a few cases for longer jumps by the C<arg1> field of the C<regnode_1> | |
244 | structure. The subroutine C<regnext()> handles this transparently. | |
b23a565d | 245 | This is the logical successor of the node, which in some cases, like |
be8e71aa | 246 | that of the C<BRANCH> regop, has special meaning. |
b23a565d RGS |
247 | |
248 | =back | |
249 | ||
be8e71aa | 250 | =head1 Process Overview |
b23a565d | 251 | |
be8e71aa YO |
252 | Broadly speaking, performing a match of a string against a pattern |
253 | involves the following steps: | |
254 | ||
255 | =over 5 | |
256 | ||
257 | =item A. Compilation | |
258 | ||
259 | =over 5 | |
260 | ||
e3f228df | 261 | =item 1. Parsing |
be8e71aa | 262 | |
e3f228df | 263 | =item 2. Peep-hole optimisation and analysis |
be8e71aa YO |
264 | |
265 | =back | |
266 | ||
267 | =item B. Execution | |
268 | ||
269 | =over 5 | |
270 | ||
e3f228df | 271 | =item 3. Start position and no-match optimisations |
be8e71aa | 272 | |
e3f228df | 273 | =item 4. Program execution |
be8e71aa YO |
274 | |
275 | =back | |
276 | ||
277 | =back | |
b23a565d | 278 | |
b23a565d RGS |
279 | |
280 | Where these steps occur in the actual execution of a perl program is | |
281 | determined by whether the pattern involves interpolating any string | |
be8e71aa YO |
282 | variables. If interpolation occurs, then compilation happens at run time. If it |
283 | does not, then compilation is performed at compile time. (The C</o> modifier changes this, | |
4ccfbf60 | 284 | as does C<qr//> to a certain extent.) The engine doesn't really care that |
b23a565d RGS |
285 | much. |
286 | ||
287 | =head2 Compilation | |
288 | ||
be8e71aa YO |
289 | This code resides primarily in F<regcomp.c>, along with the header files |
290 | F<regcomp.h>, F<regexp.h> and F<regnodes.h>. | |
b23a565d | 291 | |
4ccfbf60 RGS |
292 | Compilation starts with C<pregcomp()>, which is mostly an initialisation |
293 | wrapper which farms work out to two other routines for the heavy lifting: the | |
294 | first is C<reg()>, which is the start point for parsing; the second, | |
295 | C<study_chunk()>, is responsible for optimisation. | |
b23a565d | 296 | |
4ccfbf60 RGS |
297 | Initialisation in C<pregcomp()> mostly involves the creation and data-filling |
298 | of a special structure, C<RExC_state_t> (defined in F<regcomp.c>). | |
299 | Almost all internally-used routines in F<regcomp.h> take a pointer to one | |
be8e71aa | 300 | of these structures as their first argument, with the name C<pRExC_state>. |
b23a565d | 301 | This structure is used to store the compilation state and contains many |
be8e71aa | 302 | fields. Likewise there are many macros which operate on this |
4ccfbf60 | 303 | variable: anything that looks like C<RExC_xxxx> is a macro that operates on |
b23a565d RGS |
304 | this pointer/structure. |
305 | ||
b23a565d RGS |
306 | C<reg()> is the start of the parse process. It is responsible for |
307 | parsing an arbitrary chunk of pattern up to either the end of the | |
308 | string, or the first closing parenthesis it encounters in the pattern. | |
4ccfbf60 | 309 | This means it can be used to parse the top-level regex, or any section |
b23a565d | 310 | inside of a grouping parenthesis. It also handles the "special parens" |
e3f228df KW |
311 | that perl's regexes have. For instance when parsing C</x(?:foo)y/>, |
312 | C<reg()> will at one point be called to parse from the "?" symbol up to | |
313 | and including the ")". | |
b23a565d | 314 | |
be8e71aa | 315 | Additionally, C<reg()> is responsible for parsing the one or more |
b23a565d | 316 | branches from the pattern, and for "finishing them off" by correctly |
be8e71aa YO |
317 | setting their next pointers. In order to do the parsing, it repeatedly |
318 | calls out to C<regbranch()>, which is responsible for handling up to the | |
b23a565d RGS |
319 | first C<|> symbol it sees. |
320 | ||
be8e71aa YO |
321 | C<regbranch()> in turn calls C<regpiece()> which |
322 | handles "things" followed by a quantifier. In order to parse the | |
edc977ff | 323 | "things", C<regatom()> is called. This is the lowest level routine, which |
be8e71aa YO |
324 | parses out constant strings, character classes, and the |
325 | various special symbols like C<$>. If C<regatom()> encounters a "(" | |
b23a565d RGS |
326 | character it in turn calls C<reg()>. |
327 | ||
e3f228df KW |
328 | There used to be two main passes involved in parsing, the first to |
329 | calculate the size of the compiled program, and the second to actually | |
330 | compile it. But now there is only one main pass, with an initial crude | |
331 | guess based on the length of the input pattern, which is increased if | |
332 | necessary as parsing proceeds, and afterwards, trimmed to the actual | |
333 | amount used. | |
334 | ||
335 | However, it may happen that parsing must be restarted at the beginning | |
336 | when various circumstances occur along the way. An example is if the | |
337 | program turns out to be so large that there are jumps in it that won't | |
338 | fit in the normal 16 bits available. There are two special regops that | |
339 | can hold bigger jump destinations, BRANCHJ and LONGBRANCH. The parse is | |
340 | restarted, and these are used instead of the normal shorter ones. | |
341 | Whenever restarting the parse is required, the function returns failure | |
342 | and sets a flag as to what needs to be done. This is passed up to the | |
343 | top level routine which takes the appropriate action and restarts from | |
344 | scratch. In the case of needing longer jumps, the C<RExC_use_BRANCHJ> | |
345 | flag is set in the C<RExC_state_t> structure, which the functions know | |
346 | to inspect before deciding how to do branches. | |
347 | ||
348 | In most instances, the function that discovers the issue sets the causal | |
349 | flag and returns failure immediately. L</Parsing complications> | |
350 | contains an explicit example of how this works. In other cases, such as | |
351 | a forward reference to a numbered parenthetical grouping, we need to | |
352 | finish the parse to know if that numbered grouping actually appears in | |
353 | the pattern. In those cases, the parse is just redone at the end, with | |
354 | the knowledge of how many groupings occur in it. | |
355 | ||
edc977ff | 356 | The routine C<regtail()> is called by both C<reg()> and C<regbranch()> |
b23a565d | 357 | in order to "set the tail pointer" correctly. When executing and |
be8e71aa YO |
358 | we get to the end of a branch, we need to go to the node following the |
359 | grouping parens. When parsing, however, we don't know where the end will | |
b23a565d RGS |
360 | be until we get there, so when we do we must go back and update the |
361 | offsets as appropriate. C<regtail> is used to make this easier. | |
362 | ||
be8e71aa | 363 | A subtlety of the parsing process means that a regex like C</foo/> is |
b23a565d | 364 | originally parsed into an alternation with a single branch. It is only |
4ccfbf60 | 365 | afterwards that the optimiser converts single branch alternations into the |
b23a565d RGS |
366 | simpler form. |
367 | ||
368 | =head3 Parse Call Graph and a Grammar | |
369 | ||
370 | The call graph looks like this: | |
371 | ||
2bdc80de KW |
372 | reg() # parse a top level regex, or inside of |
373 | # parens | |
374 | regbranch() # parse a single branch of an alternation | |
375 | regpiece() # parse a pattern followed by a quantifier | |
376 | regatom() # parse a simple pattern | |
377 | regclass() # used to handle a class | |
378 | reg() # used to handle a parenthesised | |
379 | # subpattern | |
380 | .... | |
381 | ... | |
382 | regtail() # finish off the branch | |
383 | ... | |
384 | regtail() # finish off the branch sequence. Tie each | |
385 | # branch's tail to the tail of the | |
386 | # sequence | |
387 | # (NEW) In Debug mode this is | |
388 | # regtail_study(). | |
b23a565d RGS |
389 | |
390 | A grammar form might be something like this: | |
391 | ||
392 | atom : constant | class | |
393 | quant : '*' | '+' | '?' | '{min,max}' | |
394 | _branch: piece | |
395 | | piece _branch | |
396 | | nothing | |
397 | branch: _branch | |
398 | | _branch '|' branch | |
399 | group : '(' branch ')' | |
400 | _piece: atom | group | |
401 | piece : _piece | |
402 | | _piece quant | |
403 | ||
a35e7505 NC |
404 | =head3 Parsing complications |
405 | ||
406 | The implication of the above description is that a pattern containing nested | |
407 | parentheses will result in a call graph which cycles through C<reg()>, | |
408 | C<regbranch()>, C<regpiece()>, C<regatom()>, C<reg()>, C<regbranch()> I<etc> | |
409 | multiple times, until the deepest level of nesting is reached. All the above | |
410 | routines return a pointer to a C<regnode>, which is usually the last regnode | |
411 | added to the program. However, one complication is that reg() returns NULL | |
412 | for parsing C<(?:)> syntax for embedded modifiers, setting the flag | |
413 | C<TRYAGAIN>. The C<TRYAGAIN> propagates upwards until it is captured, in | |
5bb4462f | 414 | some cases by C<regatom()>, but otherwise unconditionally by |
a35e7505 NC |
415 | C<regbranch()>. Hence it will never be returned by C<regbranch()> to |
416 | C<reg()>. This flag permits patterns such as C<(?i)+> to be detected as | |
417 | errors (I<Quantifier follows nothing in regex; marked by <-- HERE in m/(?i)+ | |
418 | <-- HERE />). | |
419 | ||
420 | Another complication is that the representation used for the program differs | |
421 | if it needs to store Unicode, but it's not always possible to know for sure | |
422 | whether it does until midway through parsing. The Unicode representation for | |
423 | the program is larger, and cannot be matched as efficiently. (See L</Unicode | |
424 | and Localisation Support> below for more details as to why.) If the pattern | |
425 | contains literal Unicode, it's obvious that the program needs to store | |
426 | Unicode. Otherwise, the parser optimistically assumes that the more | |
427 | efficient representation can be used, and starts sizing on this basis. | |
428 | However, if it then encounters something in the pattern which must be stored | |
429 | as Unicode, such as an C<\x{...}> escape sequence representing a character | |
430 | literal, then this means that all previously calculated sizes need to be | |
e3f228df KW |
431 | redone, using values appropriate for the Unicode representation. This |
432 | is another instance where the parsing needs to be restarted, and it can | |
433 | and is done immediately. The function returns failure, and sets the | |
434 | flag C<RESTART_UTF8> (encapsulated by using the macro C<REQUIRE_UTF8>). | |
435 | This restart request is propagated up the call chain in a similar | |
436 | fashion, until it is "caught" in C<Perl_re_op_compile()>, which marks | |
437 | the pattern as containing Unicode, and restarts the sizing pass. It is | |
438 | also possible for constructions within run-time code blocks to turn out | |
439 | to need Unicode representation., which is signalled by | |
440 | C<S_compile_runtime_code()> returning false to C<Perl_re_op_compile()>. | |
a35e7505 NC |
441 | |
442 | The restart was previously implemented using a C<longjmp> in C<regatom()> | |
443 | back to a C<setjmp> in C<Perl_re_op_compile()>, but this proved to be | |
444 | problematic as the latter is a large function containing many automatic | |
445 | variables, which interact badly with the emergent control flow of C<setjmp>. | |
446 | ||
b23a565d RGS |
447 | =head3 Debug Output |
448 | ||
e3f228df KW |
449 | Starting in the 5.9.x development version of perl you can C<< use re |
450 | Debug => 'PARSE' >> to see some trace information about the parse | |
451 | process. We will start with some simple patterns and build up to more | |
452 | complex patterns. | |
b23a565d RGS |
453 | |
454 | So when we parse C</foo/> we see something like the following table. The | |
4ccfbf60 | 455 | left shows what is being parsed, and the number indicates where the next regop |
b23a565d RGS |
456 | would go. The stuff on the right is the trace output of the graph. The |
457 | names are chosen to be short to make it less dense on the screen. 'tsdy' | |
458 | is a special form of C<regtail()> which does some extra analysis. | |
459 | ||
4ccfbf60 RGS |
460 | >foo< 1 reg |
461 | brnc | |
462 | piec | |
463 | atom | |
464 | >< 4 tsdy~ EXACT <foo> (EXACT) (1) | |
465 | ~ attach to END (3) offset to 2 | |
b23a565d RGS |
466 | |
467 | The resulting program then looks like: | |
468 | ||
469 | 1: EXACT <foo>(3) | |
470 | 3: END(0) | |
471 | ||
472 | As you can see, even though we parsed out a branch and a piece, it was ultimately | |
be8e71aa YO |
473 | only an atom. The final program shows us how things work. We have an C<EXACT> regop, |
474 | followed by an C<END> regop. The number in parens indicates where the C<regnext> of | |
475 | the node goes. The C<regnext> of an C<END> regop is unused, as C<END> regops mean | |
b23a565d RGS |
476 | we have successfully matched. The number on the left indicates the position of |
477 | the regop in the regnode array. | |
478 | ||
be8e71aa | 479 | Now let's try a harder pattern. We will add a quantifier, so now we have the pattern |
4ccfbf60 RGS |
480 | C</foo+/>. We will see that C<regbranch()> calls C<regpiece()> twice. |
481 | ||
482 | >foo+< 1 reg | |
483 | brnc | |
484 | piec | |
485 | atom | |
486 | >o+< 3 piec | |
487 | atom | |
488 | >< 6 tail~ EXACT <fo> (1) | |
489 | 7 tsdy~ EXACT <fo> (EXACT) (1) | |
490 | ~ PLUS (END) (3) | |
491 | ~ attach to END (6) offset to 3 | |
b23a565d RGS |
492 | |
493 | And we end up with the program: | |
494 | ||
495 | 1: EXACT <fo>(3) | |
496 | 3: PLUS(6) | |
497 | 4: EXACT <o>(0) | |
498 | 6: END(0) | |
499 | ||
be8e71aa | 500 | Now we have a special case. The C<EXACT> regop has a C<regnext> of 0. This is |
4ccfbf60 | 501 | because if it matches it should try to match itself again. The C<PLUS> regop |
be8e71aa | 502 | handles the actual failure of the C<EXACT> regop and acts appropriately (going |
4ccfbf60 | 503 | to regnode 6 if the C<EXACT> matched at least once, or failing if it didn't). |
b23a565d RGS |
504 | |
505 | Now for something much more complex: C</x(?:foo*|b[a][rR])(foo|bar)$/> | |
506 | ||
4ccfbf60 RGS |
507 | >x(?:foo*|b... 1 reg |
508 | brnc | |
509 | piec | |
510 | atom | |
511 | >(?:foo*|b[... 3 piec | |
512 | atom | |
513 | >?:foo*|b[a... reg | |
514 | >foo*|b[a][... brnc | |
b23a565d RGS |
515 | piec |
516 | atom | |
4ccfbf60 RGS |
517 | >o*|b[a][rR... 5 piec |
518 | atom | |
519 | >|b[a][rR])... 8 tail~ EXACT <fo> (3) | |
520 | >b[a][rR])(... 9 brnc | |
521 | 10 piec | |
522 | atom | |
523 | >[a][rR])(f... 12 piec | |
b23a565d | 524 | atom |
4ccfbf60 RGS |
525 | >a][rR])(fo... clas |
526 | >[rR])(foo|... 14 tail~ EXACT <b> (10) | |
b23a565d RGS |
527 | piec |
528 | atom | |
4ccfbf60 RGS |
529 | >rR])(foo|b... clas |
530 | >)(foo|bar)... 25 tail~ EXACT <a> (12) | |
531 | tail~ BRANCH (3) | |
532 | 26 tsdy~ BRANCH (END) (9) | |
533 | ~ attach to TAIL (25) offset to 16 | |
534 | tsdy~ EXACT <fo> (EXACT) (4) | |
535 | ~ STAR (END) (6) | |
536 | ~ attach to TAIL (25) offset to 19 | |
537 | tsdy~ EXACT <b> (EXACT) (10) | |
538 | ~ EXACT <a> (EXACT) (12) | |
539 | ~ ANYOF[Rr] (END) (14) | |
540 | ~ attach to TAIL (25) offset to 11 | |
541 | >(foo|bar)$< tail~ EXACT <x> (1) | |
542 | piec | |
543 | atom | |
544 | >foo|bar)$< reg | |
545 | 28 brnc | |
b23a565d RGS |
546 | piec |
547 | atom | |
4ccfbf60 RGS |
548 | >|bar)$< 31 tail~ OPEN1 (26) |
549 | >bar)$< brnc | |
550 | 32 piec | |
551 | atom | |
552 | >)$< 34 tail~ BRANCH (28) | |
553 | 36 tsdy~ BRANCH (END) (31) | |
2bdc80de | 554 | ~ attach to CLOSE1 (34) offset to 3 |
4ccfbf60 | 555 | tsdy~ EXACT <foo> (EXACT) (29) |
2bdc80de | 556 | ~ attach to CLOSE1 (34) offset to 5 |
4ccfbf60 | 557 | tsdy~ EXACT <bar> (EXACT) (32) |
2bdc80de | 558 | ~ attach to CLOSE1 (34) offset to 2 |
4ccfbf60 RGS |
559 | >$< tail~ BRANCH (3) |
560 | ~ BRANCH (9) | |
561 | ~ TAIL (25) | |
562 | piec | |
563 | atom | |
564 | >< 37 tail~ OPEN1 (26) | |
565 | ~ BRANCH (28) | |
566 | ~ BRANCH (31) | |
567 | ~ CLOSE1 (34) | |
568 | 38 tsdy~ EXACT <x> (EXACT) (1) | |
569 | ~ BRANCH (END) (3) | |
570 | ~ BRANCH (END) (9) | |
571 | ~ TAIL (END) (25) | |
572 | ~ OPEN1 (END) (26) | |
573 | ~ BRANCH (END) (28) | |
574 | ~ BRANCH (END) (31) | |
575 | ~ CLOSE1 (END) (34) | |
576 | ~ EOL (END) (36) | |
577 | ~ attach to END (37) offset to 1 | |
b23a565d RGS |
578 | |
579 | Resulting in the program | |
580 | ||
581 | 1: EXACT <x>(3) | |
582 | 3: BRANCH(9) | |
583 | 4: EXACT <fo>(6) | |
584 | 6: STAR(26) | |
585 | 7: EXACT <o>(0) | |
586 | 9: BRANCH(25) | |
587 | 10: EXACT <ba>(14) | |
588 | 12: OPTIMIZED (2 nodes) | |
589 | 14: ANYOF[Rr](26) | |
590 | 25: TAIL(26) | |
591 | 26: OPEN1(28) | |
592 | 28: TRIE-EXACT(34) | |
593 | [StS:1 Wds:2 Cs:6 Uq:5 #Sts:7 Mn:3 Mx:3 Stcls:bf] | |
594 | <foo> | |
595 | <bar> | |
596 | 30: OPTIMIZED (4 nodes) | |
597 | 34: CLOSE1(36) | |
598 | 36: EOL(37) | |
599 | 37: END(0) | |
600 | ||
601 | Here we can see a much more complex program, with various optimisations in | |
be8e71aa YO |
602 | play. At regnode 10 we see an example where a character class with only |
603 | one character in it was turned into an C<EXACT> node. We can also see where | |
604 | an entire alternation was turned into a C<TRIE-EXACT> node. As a consequence, | |
b23a565d | 605 | some of the regnodes have been marked as optimised away. We can see that |
be8e71aa YO |
606 | the C<$> symbol has been converted into an C<EOL> regop, a special piece of |
607 | code that looks for C<\n> or the end of the string. | |
b23a565d | 608 | |
be8e71aa | 609 | The next pointer for C<BRANCH>es is interesting in that it points at where |
edc977ff | 610 | execution should go if the branch fails. When executing, if the engine |
be8e71aa | 611 | tries to traverse from a branch to a C<regnext> that isn't a branch then |
edc977ff | 612 | the engine will know that the entire set of branches has failed. |
b23a565d RGS |
613 | |
614 | =head3 Peep-hole Optimisation and Analysis | |
615 | ||
616 | The regular expression engine can be a weighty tool to wield. On long | |
617 | strings and complex patterns it can end up having to do a lot of work | |
618 | to find a match, and even more to decide that no match is possible. | |
619 | Consider a situation like the following pattern. | |
620 | ||
621 | 'ababababababababababab' =~ /(a|b)*z/ | |
622 | ||
623 | The C<(a|b)*> part can match at every char in the string, and then fail | |
624 | every time because there is no C<z> in the string. So obviously we can | |
4ccfbf60 | 625 | avoid using the regex engine unless there is a C<z> in the string. |
b23a565d RGS |
626 | Likewise in a pattern like: |
627 | ||
628 | /foo(\w+)bar/ | |
629 | ||
630 | In this case we know that the string must contain a C<foo> which must be | |
4ccfbf60 | 631 | followed by C<bar>. We can use Fast Boyer-Moore matching as implemented |
be8e71aa YO |
632 | in C<fbm_instr()> to find the location of these strings. If they don't exist |
633 | then we don't need to resort to the much more expensive regex engine. | |
634 | Even better, if they do exist then we can use their positions to | |
b23a565d | 635 | reduce the search space that the regex engine needs to cover to determine |
be8e71aa | 636 | if the entire pattern matches. |
b23a565d RGS |
637 | |
638 | There are various aspects of the pattern that can be used to facilitate | |
639 | optimisations along these lines: | |
640 | ||
4ccfbf60 RGS |
641 | =over 5 |
642 | ||
643 | =item * anchored fixed strings | |
644 | ||
645 | =item * floating fixed strings | |
646 | ||
647 | =item * minimum and maximum length requirements | |
648 | ||
649 | =item * start class | |
650 | ||
651 | =item * Beginning/End of line positions | |
652 | ||
653 | =back | |
b23a565d | 654 | |
edc977ff MH |
655 | Another form of optimisation that can occur is the post-parse "peep-hole" |
656 | optimisation, where inefficient constructs are replaced by more efficient | |
657 | constructs. The C<TAIL> regops which are used during parsing to mark the end | |
658 | of branches and the end of groups are examples of this. These regops are used | |
659 | as place-holders during construction and "always match" so they can be | |
660 | "optimised away" by making the things that point to the C<TAIL> point to the | |
661 | thing that C<TAIL> points to, thus "skipping" the node. | |
b23a565d | 662 | |
be8e71aa YO |
663 | Another optimisation that can occur is that of "C<EXACT> merging" which is |
664 | where two consecutive C<EXACT> nodes are merged into a single | |
4ccfbf60 RGS |
665 | regop. An even more aggressive form of this is that a branch |
666 | sequence of the form C<EXACT BRANCH ... EXACT> can be converted into a | |
be8e71aa | 667 | C<TRIE-EXACT> regop. |
b23a565d | 668 | |
be8e71aa YO |
669 | All of this occurs in the routine C<study_chunk()> which uses a special |
670 | structure C<scan_data_t> to store the analysis that it has performed, and | |
671 | does the "peep-hole" optimisations as it goes. | |
b23a565d | 672 | |
be8e71aa | 673 | The code involved in C<study_chunk()> is extremely cryptic. Be careful. :-) |
b23a565d RGS |
674 | |
675 | =head2 Execution | |
676 | ||
677 | Execution of a regex generally involves two phases, the first being | |
678 | finding the start point in the string where we should match from, | |
679 | and the second being running the regop interpreter. | |
680 | ||
be8e71aa | 681 | If we can tell that there is no valid start point then we don't bother running |
bd650281 | 682 | the interpreter at all. Likewise, if we know from the analysis phase that we |
be8e71aa | 683 | cannot detect a short-cut to the start position, we go straight to the |
b23a565d RGS |
684 | interpreter. |
685 | ||
be8e71aa | 686 | The two entry points are C<re_intuit_start()> and C<pregexec()>. These routines |
b23a565d | 687 | have a somewhat incestuous relationship with overlap between their functions, |
be8e71aa | 688 | and C<pregexec()> may even call C<re_intuit_start()> on its own. Nevertheless |
e1020413 | 689 | other parts of the perl source code may call into either, or both. |
b23a565d | 690 | |
edc977ff MH |
691 | Execution of the interpreter itself used to be recursive, but thanks to the |
692 | efforts of Dave Mitchell in the 5.9.x development track, that has changed: now an | |
b23a565d RGS |
693 | internal stack is maintained on the heap and the routine is fully |
694 | iterative. This can make it tricky as the code is quite conservative | |
e1020413 | 695 | about what state it stores, with the result that two consecutive lines in the |
b23a565d RGS |
696 | code can actually be running in totally different contexts due to the |
697 | simulated recursion. | |
698 | ||
699 | =head3 Start position and no-match optimisations | |
700 | ||
4ccfbf60 | 701 | C<re_intuit_start()> is responsible for handling start points and no-match |
b23a565d | 702 | optimisations as determined by the results of the analysis done by |
5a0de581 | 703 | C<study_chunk()> (and described in L</Peep-hole Optimisation and Analysis>). |
b23a565d | 704 | |
4ccfbf60 RGS |
705 | The basic structure of this routine is to try to find the start- and/or |
706 | end-points of where the pattern could match, and to ensure that the string | |
707 | is long enough to match the pattern. It tries to use more efficient | |
708 | methods over less efficient methods and may involve considerable | |
709 | cross-checking of constraints to find the place in the string that matches. | |
b23a565d RGS |
710 | For instance it may try to determine that a given fixed string must be |
711 | not only present but a certain number of chars before the end of the | |
712 | string, or whatever. | |
713 | ||
be8e71aa | 714 | It calls several other routines, such as C<fbm_instr()> which does |
4ccfbf60 | 715 | Fast Boyer Moore matching and C<find_byclass()> which is responsible for |
b23a565d RGS |
716 | finding the start using the first mandatory regop in the program. |
717 | ||
4ccfbf60 | 718 | When the optimisation criteria have been satisfied, C<reg_try()> is called |
b23a565d RGS |
719 | to perform the match. |
720 | ||
721 | =head3 Program execution | |
722 | ||
723 | C<pregexec()> is the main entry point for running a regex. It contains | |
4ccfbf60 RGS |
724 | support for initialising the regex interpreter's state, running |
725 | C<re_intuit_start()> if needed, and running the interpreter on the string | |
726 | from various start positions as needed. When it is necessary to use | |
b23a565d RGS |
727 | the regex interpreter C<pregexec()> calls C<regtry()>. |
728 | ||
729 | C<regtry()> is the entry point into the regex interpreter. It expects | |
be8e71aa | 730 | as arguments a pointer to a C<regmatch_info> structure and a pointer to |
b23a565d | 731 | a string. It returns an integer 1 for success and a 0 for failure. |
4ccfbf60 | 732 | It is basically a set-up wrapper around C<regmatch()>. |
b23a565d RGS |
733 | |
734 | C<regmatch> is the main "recursive loop" of the interpreter. It is | |
e3950ac3 RGS |
735 | basically a giant switch statement that implements a state machine, where |
736 | the possible states are the regops themselves, plus a number of additional | |
737 | intermediate and failure states. A few of the states are implemented as | |
738 | subroutines but the bulk are inline code. | |
b23a565d RGS |
739 | |
740 | =head1 MISCELLANEOUS | |
741 | ||
4ccfbf60 RGS |
742 | =head2 Unicode and Localisation Support |
743 | ||
744 | When dealing with strings containing characters that cannot be represented | |
9af228c6 | 745 | using an eight-bit character set, perl uses an internal representation |
4ccfbf60 | 746 | that is a permissive version of Unicode's UTF-8 encoding[2]. This uses single |
9af228c6 | 747 | bytes to represent characters from the ASCII character set, and sequences |
4ccfbf60 RGS |
748 | of two or more bytes for all other characters. (See L<perlunitut> |
749 | for more information about the relationship between UTF-8 and perl's | |
ac036724 | 750 | encoding, utf8. The difference isn't important for this discussion.) |
b23a565d | 751 | |
be8e71aa | 752 | No matter how you look at it, Unicode support is going to be a pain in a |
b23a565d | 753 | regex engine. Tricks that might be fine when you have 256 possible |
be8e71aa | 754 | characters often won't scale to handle the size of the UTF-8 character |
b23a565d | 755 | set. Things you can take for granted with ASCII may not be true with |
4ccfbf60 | 756 | Unicode. For instance, in ASCII, it is safe to assume that |
be8e71aa YO |
757 | C<sizeof(char1) == sizeof(char2)>, but in UTF-8 it isn't. Unicode case folding is |
758 | vastly more complex than the simple rules of ASCII, and even when not | |
4ccfbf60 | 759 | using Unicode but only localised single byte encodings, things can get |
d38f6844 AB |
760 | tricky (for example, B<LATIN SMALL LETTER SHARP S> (U+00DF, E<szlig>) |
761 | should match 'SS' in localised case-insensitive matching). | |
be8e71aa YO |
762 | |
763 | Making things worse is that UTF-8 support was a later addition to the | |
764 | regex engine (as it was to perl) and this necessarily made things a lot | |
765 | more complicated. Obviously it is easier to design a regex engine with | |
766 | Unicode support in mind from the beginning than it is to retrofit it to | |
767 | one that wasn't. | |
768 | ||
4ccfbf60 | 769 | Nearly all regops that involve looking at the input string have |
be8e71aa YO |
770 | two cases, one for UTF-8, and one not. In fact, it's often more complex |
771 | than that, as the pattern may be UTF-8 as well. | |
b23a565d RGS |
772 | |
773 | Care must be taken when making changes to make sure that you handle | |
be8e71aa | 774 | UTF-8 properly, both at compile time and at execution time, including |
b23a565d RGS |
775 | when the string and pattern are mismatched. |
776 | ||
f8149455 | 777 | =head2 Base Structures |
be8e71aa | 778 | |
108003db | 779 | The C<regexp> structure described in L<perlreapi> is common to all |
bd650281 | 780 | regex engines. Two of its fields are intended for the private use |
108003db RGS |
781 | of the regex engine that compiled the pattern. These are the |
782 | C<intflags> and pprivate members. The C<pprivate> is a void pointer to | |
783 | an arbitrary structure whose use and management is the responsibility | |
784 | of the compiling engine. perl will never modify either of these | |
785 | values. In the case of the stock engine the structure pointed to by | |
786 | C<pprivate> is called C<regexp_internal>. | |
787 | ||
788 | Its C<pprivate> and C<intflags> fields contain data | |
789 | specific to each engine. | |
790 | ||
f8149455 | 791 | There are two structures used to store a compiled regular expression. |
108003db RGS |
792 | One, the C<regexp> structure described in L<perlreapi> is populated by |
793 | the engine currently being. used and some of its fields read by perl to | |
794 | implement things such as the stringification of C<qr//>. | |
795 | ||
796 | ||
bd650281 | 797 | The other structure is pointed to by the C<regexp> struct's |
108003db RGS |
798 | C<pprivate> and is in addition to C<intflags> in the same struct |
799 | considered to be the property of the regex engine which compiled the | |
2bdc80de | 800 | regular expression; |
be8e71aa | 801 | |
f8149455 YO |
802 | The regexp structure contains all the data that perl needs to be aware of |
803 | to properly work with the regular expression. It includes data about | |
804 | optimisations that perl can use to determine if the regex engine should | |
805 | really be used, and various other control info that is needed to properly | |
806 | execute patterns in various contexts such as is the pattern anchored in | |
807 | some way, or what flags were used during the compile, or whether the | |
808 | program contains special constructs that perl needs to be aware of. | |
9af228c6 | 809 | |
f8149455 YO |
810 | In addition it contains two fields that are intended for the private use |
811 | of the regex engine that compiled the pattern. These are the C<intflags> | |
812 | and pprivate members. The C<pprivate> is a void pointer to an arbitrary | |
813 | structure whose use and management is the responsibility of the compiling | |
814 | engine. perl will never modify either of these values. | |
9af228c6 | 815 | |
f8149455 YO |
816 | As mentioned earlier, in the case of the default engines, the C<pprivate> |
817 | will be a pointer to a regexp_internal structure which holds the compiled | |
818 | program and any additional data that is private to the regex engine | |
819 | implementation. | |
9af228c6 | 820 | |
108003db | 821 | =head3 Perl's C<pprivate> structure |
f8149455 | 822 | |
108003db RGS |
823 | The following structure is used as the C<pprivate> struct by perl's |
824 | regex engine. Since it is specific to perl it is only of curiosity | |
825 | value to other engine implementations. | |
f8149455 | 826 | |
2bdc80de | 827 | typedef struct regexp_internal { |
2bdc80de KW |
828 | U32 *offsets; /* offset annotations 20001228 MJD |
829 | * data about mapping the program to | |
830 | * the string*/ | |
831 | regnode *regstclass; /* Optional startclass as identified or | |
832 | * constructed by the optimiser */ | |
833 | struct reg_data *data; /* Additional miscellaneous data used | |
834 | * by the program. Used to make it | |
835 | * easier to clone and free arbitrary | |
836 | * data that the regops need. Often the | |
837 | * ARG field of a regop is an index | |
838 | * into this structure */ | |
839 | regnode program[1]; /* Unwarranted chumminess with | |
840 | * compiler. */ | |
841 | } regexp_internal; | |
f8149455 YO |
842 | |
843 | =over 5 | |
844 | ||
f8149455 YO |
845 | =item C<offsets> |
846 | ||
847 | Offsets holds a mapping of offset in the C<program> | |
edc977ff | 848 | to offset in the C<precomp> string. This is only used by ActiveState's |
f8149455 YO |
849 | visual regex debugger. |
850 | ||
851 | =item C<regstclass> | |
852 | ||
853 | Special regop that is used by C<re_intuit_start()> to check if a pattern | |
854 | can match at a certain position. For instance if the regex engine knows | |
855 | that the pattern must start with a 'Z' then it can scan the string until | |
856 | it finds one and then launch the regex engine from there. The routine | |
857 | that handles this is called C<find_by_class()>. Sometimes this field | |
858 | points at a regop embedded in the program, and sometimes it points at | |
859 | an independent synthetic regop that has been constructed by the optimiser. | |
860 | ||
861 | =item C<data> | |
862 | ||
bd650281 | 863 | This field points at a C<reg_data> structure, which is defined as follows |
f8149455 YO |
864 | |
865 | struct reg_data { | |
866 | U32 count; | |
867 | U8 *what; | |
868 | void* data[1]; | |
869 | }; | |
870 | ||
871 | This structure is used for handling data structures that the regex engine | |
872 | needs to handle specially during a clone or free operation on the compiled | |
873 | product. Each element in the data array has a corresponding element in the | |
874 | what array. During compilation regops that need special structures stored | |
875 | will add an element to each array using the add_data() routine and then store | |
876 | the index in the regop. | |
877 | ||
878 | =item C<program> | |
879 | ||
880 | Compiled program. Inlined into the structure so the entire struct can be | |
881 | treated as a single blob. | |
4ccfbf60 RGS |
882 | |
883 | =back | |
884 | ||
4ccfbf60 RGS |
885 | =head1 SEE ALSO |
886 | ||
108003db RGS |
887 | L<perlreapi> |
888 | ||
4ccfbf60 RGS |
889 | L<perlre> |
890 | ||
891 | L<perlunitut> | |
be8e71aa | 892 | |
b23a565d RGS |
893 | =head1 AUTHOR |
894 | ||
895 | by Yves Orton, 2006. | |
896 | ||
897 | With excerpts from Perl, and contributions and suggestions from | |
898 | Ronald J. Kimball, Dave Mitchell, Dominic Dunlop, Mark Jason Dominus, | |
be8e71aa | 899 | Stephen McCamant, and David Landgren. |
b23a565d | 900 | |
e3f228df KW |
901 | Now maintained by Perl 5 Porters. |
902 | ||
4ccfbf60 | 903 | =head1 LICENCE |
b23a565d RGS |
904 | |
905 | Same terms as Perl. | |
906 | ||
907 | =head1 REFERENCES | |
908 | ||
71c89d21 | 909 | [1] L<https://perl.plover.com/Rx/paper/> |
4ccfbf60 | 910 | |
71c89d21 | 911 | [2] L<https://www.unicode.org/> |
b23a565d RGS |
912 | |
913 | =cut |