Commit | Line | Data |
---|---|---|
108003db RGS |
1 | =head1 NAME |
2 | ||
3 | perlreapi - perl regular expression plugin interface | |
4 | ||
5 | =head1 DESCRIPTION | |
6 | ||
a0e97681 RGS |
7 | As of Perl 5.9.5 there is a new interface for plugging and using other |
8 | regular expression engines than the default one. | |
9 | ||
10 | Each engine is supposed to provide access to a constant structure of the | |
11 | following format: | |
108003db RGS |
12 | |
13 | typedef struct regexp_engine { | |
3ab4a224 | 14 | REGEXP* (*comp) (pTHX_ const SV * const pattern, const U32 flags); |
49d7dfbc | 15 | I32 (*exec) (pTHX_ REGEXP * const rx, char* stringarg, char* strend, |
2fdbfb4d AB |
16 | char* strbeg, I32 minend, SV* screamer, |
17 | void* data, U32 flags); | |
49d7dfbc | 18 | char* (*intuit) (pTHX_ REGEXP * const rx, SV *sv, char *strpos, |
2fdbfb4d AB |
19 | char *strend, U32 flags, |
20 | struct re_scream_pos_data_s *data); | |
49d7dfbc AB |
21 | SV* (*checkstr) (pTHX_ REGEXP * const rx); |
22 | void (*free) (pTHX_ REGEXP * const rx); | |
2fdbfb4d AB |
23 | void (*numbered_buff_FETCH) (pTHX_ REGEXP * const rx, const I32 paren, |
24 | SV * const sv); | |
25 | void (*numbered_buff_STORE) (pTHX_ REGEXP * const rx, const I32 paren, | |
26 | SV const * const value); | |
27 | I32 (*numbered_buff_LENGTH) (pTHX_ REGEXP * const rx, const SV * const sv, | |
28 | const I32 paren); | |
192b9cd1 AB |
29 | SV* (*named_buff) (pTHX_ REGEXP * const rx, SV * const key, |
30 | SV * const value, U32 flags); | |
31 | SV* (*named_buff_iter) (pTHX_ REGEXP * const rx, const SV * const lastkey, | |
32 | const U32 flags); | |
49d7dfbc | 33 | SV* (*qr_package)(pTHX_ REGEXP * const rx); |
108003db | 34 | #ifdef USE_ITHREADS |
49d7dfbc | 35 | void* (*dupe) (pTHX_ REGEXP * const rx, CLONE_PARAMS *param); |
108003db | 36 | #endif |
108003db RGS |
37 | |
38 | When a regexp is compiled, its C<engine> field is then set to point at | |
a0e97681 | 39 | the appropriate structure, so that when it needs to be used Perl can find |
108003db RGS |
40 | the right routines to do so. |
41 | ||
42 | In order to install a new regexp handler, C<$^H{regcomp}> is set | |
43 | to an integer which (when casted appropriately) resolves to one of these | |
44 | structures. When compiling, the C<comp> method is executed, and the | |
45 | resulting regexp structure's engine field is expected to point back at | |
46 | the same structure. | |
47 | ||
48 | The pTHX_ symbol in the definition is a macro used by perl under threading | |
49 | to provide an extra argument to the routine holding a pointer back to | |
50 | the interpreter that is executing the regexp. So under threading all | |
51 | routines get an extra argument. | |
52 | ||
882227b7 | 53 | =head1 Callbacks |
108003db RGS |
54 | |
55 | =head2 comp | |
56 | ||
3ab4a224 | 57 | REGEXP* comp(pTHX_ const SV * const pattern, const U32 flags); |
108003db | 58 | |
3ab4a224 AB |
59 | Compile the pattern stored in C<pattern> using the given C<flags> and |
60 | return a pointer to a prepared C<REGEXP> structure that can perform | |
61 | the match. See L</The REGEXP structure> below for an explanation of | |
62 | the individual fields in the REGEXP struct. | |
63 | ||
64 | The C<pattern> parameter is the scalar that was used as the | |
65 | pattern. previous versions of perl would pass two C<char*> indicating | |
a0e97681 | 66 | the start and end of the stringified pattern, the following snippet can |
3ab4a224 AB |
67 | be used to get the old parameters: |
68 | ||
69 | STRLEN plen; | |
70 | char* exp = SvPV(pattern, plen); | |
71 | char* xend = exp + plen; | |
72 | ||
73 | Since any scalar can be passed as a pattern it's possible to implement | |
74 | an engine that does something with an array (C<< "ook" =~ [ qw/ eek | |
75 | hlagh / ] >>) or with the non-stringified form of a compiled regular | |
76 | expression (C<< "ook" =~ qr/eek/ >>). perl's own engine will always | |
77 | stringify everything using the snippet above but that doesn't mean | |
78 | other engines have to. | |
108003db | 79 | |
a0e97681 | 80 | The C<flags> parameter is a bitfield which indicates which of the |
c998b245 AB |
81 | C<msixp> flags the regex was compiled with. It also contains |
82 | additional info such as whether C<use locale> is in effect. | |
108003db RGS |
83 | |
84 | The C<eogc> flags are stripped out before being passed to the comp | |
85 | routine. The regex engine does not need to know whether any of these | |
3ab4a224 | 86 | are set as those flags should only affect what perl does with the |
c998b245 AB |
87 | pattern and its match variables, not how it gets compiled and |
88 | executed. | |
108003db | 89 | |
c998b245 AB |
90 | By the time the comp callback is called, some of these flags have |
91 | already had effect (noted below where applicable). However most of | |
92 | their effect occurs after the comp callback has run in routines that | |
93 | read the C<< rx->extflags >> field which it populates. | |
108003db | 94 | |
c998b245 AB |
95 | In general the flags should be preserved in C<< rx->extflags >> after |
96 | compilation, although the regex engine might want to add or delete | |
97 | some of them to invoke or disable some special behavior in perl. The | |
98 | flags along with any special behavior they cause are documented below: | |
108003db | 99 | |
c998b245 | 100 | The pattern modifiers: |
108003db | 101 | |
c998b245 | 102 | =over 4 |
108003db | 103 | |
c998b245 | 104 | =item C</m> - RXf_PMf_MULTILINE |
108003db | 105 | |
c998b245 AB |
106 | If this is in C<< rx->extflags >> it will be passed to |
107 | C<Perl_fbm_instr> by C<pp_split> which will treat the subject string | |
108 | as a multi-line string. | |
108003db | 109 | |
c998b245 | 110 | =item C</s> - RXf_PMf_SINGLELINE |
108003db | 111 | |
c998b245 | 112 | =item C</i> - RXf_PMf_FOLD |
108003db | 113 | |
c998b245 | 114 | =item C</x> - RXf_PMf_EXTENDED |
108003db | 115 | |
c998b245 AB |
116 | If present on a regex C<#> comments will be handled differently by the |
117 | tokenizer in some cases. | |
108003db | 118 | |
c998b245 | 119 | TODO: Document those cases. |
108003db | 120 | |
c998b245 | 121 | =item C</p> - RXf_PMf_KEEPCOPY |
108003db | 122 | |
e72ec78c KW |
123 | TODO: Document this |
124 | ||
f0f9b3b8 KW |
125 | =item Character set |
126 | ||
127 | The character set semantics are determined by an enum that is contained | |
128 | in this field. This is still experimental and subject to change, but | |
129 | the current interface returns the rules by use of the in-line function | |
130 | C<get_regex_charset(const U32 flags)>. The only currently documented | |
131 | value returned from it is REGEX_LOCALE_CHARSET, which is set if | |
e72ec78c KW |
132 | C<use locale> is in effect. If present in C<< rx->extflags >>, |
133 | C<split> will use the locale dependent definition of whitespace | |
134 | when RXf_SKIPWHITE or RXf_WHITE is in effect. ASCII whitespace | |
96090e4f | 135 | is defined as per L<isSPACE|perlapi/isSPACE>, and by the internal |
e72ec78c | 136 | macros C<is_utf8_space> under UTF-8, and C<isSPACE_LC> under C<use |
c998b245 | 137 | locale>. |
108003db | 138 | |
f0f9b3b8 KW |
139 | =back |
140 | ||
141 | Additional flags: | |
142 | ||
143 | =over 4 | |
144 | ||
108003db RGS |
145 | =item RXf_UTF8 |
146 | ||
147 | Set if the pattern is L<SvUTF8()|perlapi/SvUTF8>, set by Perl_pmruntime. | |
148 | ||
c998b245 AB |
149 | A regex engine may want to set or disable this flag during |
150 | compilation. The perl engine for instance may upgrade non-UTF-8 | |
151 | strings to UTF-8 if the pattern includes constructs such as C<\x{...}> | |
152 | that can only match Unicode values. | |
153 | ||
0ac6acae AB |
154 | =item RXf_SPLIT |
155 | ||
156 | If C<split> is invoked as C<split ' '> or with no arguments (which | |
5137fa37 | 157 | really means C<split(' ', $_)>, see L<split|perlfunc/split>), perl will |
0ac6acae AB |
158 | set this flag. The regex engine can then check for it and set the |
159 | SKIPWHITE and WHITE extflags. To do this the perl engine does: | |
160 | ||
161 | if (flags & RXf_SPLIT && r->prelen == 1 && r->precomp[0] == ' ') | |
162 | r->extflags |= (RXf_SKIPWHITE|RXf_WHITE); | |
163 | ||
108003db RGS |
164 | =back |
165 | ||
c998b245 AB |
166 | These flags can be set during compilation to enable optimizations in |
167 | the C<split> operator. | |
168 | ||
169 | =over 4 | |
170 | ||
0ac6acae AB |
171 | =item RXf_SKIPWHITE |
172 | ||
173 | If the flag is present in C<< rx->extflags >> C<split> will delete | |
174 | whitespace from the start of the subject string before it's operated | |
175 | on. What is considered whitespace depends on whether the subject is a | |
176 | UTF-8 string and whether the C<RXf_PMf_LOCALE> flag is set. | |
177 | ||
178 | If RXf_WHITE is set in addition to this flag C<split> will behave like | |
179 | C<split " "> under the perl engine. | |
180 | ||
c998b245 AB |
181 | =item RXf_START_ONLY |
182 | ||
183 | Tells the split operator to split the target string on newlines | |
184 | (C<\n>) without invoking the regex engine. | |
185 | ||
186 | Perl's engine sets this if the pattern is C</^/> (C<plen == 1 && *exp | |
187 | == '^'>), even under C</^/s>, see L<split|perlfunc>. Of course a | |
188 | different regex engine might want to use the same optimizations | |
189 | with a different syntax. | |
190 | ||
191 | =item RXf_WHITE | |
192 | ||
193 | Tells the split operator to split the target string on whitespace | |
194 | without invoking the regex engine. The definition of whitespace varies | |
195 | depending on whether the target string is a UTF-8 string and on | |
196 | whether RXf_PMf_LOCALE is set. | |
197 | ||
0ac6acae | 198 | Perl's engine sets this flag if the pattern is C<\s+>. |
c998b245 | 199 | |
640f820d AB |
200 | =item RXf_NULL |
201 | ||
a0e97681 | 202 | Tells the split operator to split the target string on |
640f820d AB |
203 | characters. The definition of character varies depending on whether |
204 | the target string is a UTF-8 string. | |
205 | ||
206 | Perl's engine sets this flag on empty patterns, this optimization | |
a0e97681 | 207 | makes C<split //> much faster than it would otherwise be. It's even |
640f820d AB |
208 | faster than C<unpack>. |
209 | ||
c998b245 | 210 | =back |
108003db RGS |
211 | |
212 | =head2 exec | |
213 | ||
49d7dfbc | 214 | I32 exec(pTHX_ REGEXP * const rx, |
108003db RGS |
215 | char *stringarg, char* strend, char* strbeg, |
216 | I32 minend, SV* screamer, | |
217 | void* data, U32 flags); | |
218 | ||
219 | Execute a regexp. | |
220 | ||
221 | =head2 intuit | |
222 | ||
49d7dfbc | 223 | char* intuit(pTHX_ REGEXP * const rx, |
108003db | 224 | SV *sv, char *strpos, char *strend, |
49d7dfbc | 225 | const U32 flags, struct re_scream_pos_data_s *data); |
108003db RGS |
226 | |
227 | Find the start position where a regex match should be attempted, | |
228 | or possibly whether the regex engine should not be run because the | |
229 | pattern can't match. This is called as appropriate by the core | |
230 | depending on the values of the extflags member of the regexp | |
231 | structure. | |
232 | ||
233 | =head2 checkstr | |
234 | ||
49d7dfbc | 235 | SV* checkstr(pTHX_ REGEXP * const rx); |
108003db RGS |
236 | |
237 | Return a SV containing a string that must appear in the pattern. Used | |
238 | by C<split> for optimising matches. | |
239 | ||
240 | =head2 free | |
241 | ||
49d7dfbc | 242 | void free(pTHX_ REGEXP * const rx); |
108003db RGS |
243 | |
244 | Called by perl when it is freeing a regexp pattern so that the engine | |
245 | can release any resources pointed to by the C<pprivate> member of the | |
246 | regexp structure. This is only responsible for freeing private data; | |
247 | perl will handle releasing anything else contained in the regexp structure. | |
248 | ||
192b9cd1 | 249 | =head2 Numbered capture callbacks |
108003db | 250 | |
192b9cd1 AB |
251 | Called to get/set the value of C<$`>, C<$'>, C<$&> and their named |
252 | equivalents, ${^PREMATCH}, ${^POSTMATCH} and $^{MATCH}, as well as the | |
c27a5cfe | 253 | numbered capture groups (C<$1>, C<$2>, ...). |
49d7dfbc | 254 | |
a0e97681 | 255 | The C<paren> parameter will be C<-2> for C<$`>, C<-1> for C<$'>, C<0> |
49d7dfbc AB |
256 | for C<$&>, C<1> for C<$1> and so forth. |
257 | ||
192b9cd1 AB |
258 | The names have been chosen by analogy with L<Tie::Scalar> methods |
259 | names with an additional B<LENGTH> callback for efficiency. However | |
260 | named capture variables are currently not tied internally but | |
261 | implemented via magic. | |
262 | ||
263 | =head3 numbered_buff_FETCH | |
264 | ||
265 | void numbered_buff_FETCH(pTHX_ REGEXP * const rx, const I32 paren, | |
266 | SV * const sv); | |
267 | ||
268 | Fetch a specified numbered capture. C<sv> should be set to the scalar | |
269 | to return, the scalar is passed as an argument rather than being | |
270 | returned from the function because when it's called perl already has a | |
271 | scalar to store the value, creating another one would be | |
272 | redundant. The scalar can be set with C<sv_setsv>, C<sv_setpvn> and | |
273 | friends, see L<perlapi>. | |
49d7dfbc AB |
274 | |
275 | This callback is where perl untaints its own capture variables under | |
c998b245 | 276 | taint mode (see L<perlsec>). See the C<Perl_reg_numbered_buff_fetch> |
49d7dfbc AB |
277 | function in F<regcomp.c> for how to untaint capture variables if |
278 | that's something you'd like your engine to do as well. | |
108003db | 279 | |
192b9cd1 | 280 | =head3 numbered_buff_STORE |
108003db | 281 | |
2fdbfb4d AB |
282 | void (*numbered_buff_STORE) (pTHX_ REGEXP * const rx, const I32 paren, |
283 | SV const * const value); | |
108003db | 284 | |
192b9cd1 AB |
285 | Set the value of a numbered capture variable. C<value> is the scalar |
286 | that is to be used as the new value. It's up to the engine to make | |
287 | sure this is used as the new value (or reject it). | |
2fdbfb4d AB |
288 | |
289 | Example: | |
290 | ||
291 | if ("ook" =~ /(o*)/) { | |
ccf3535a | 292 | # 'paren' will be '1' and 'value' will be 'ee' |
2fdbfb4d AB |
293 | $1 =~ tr/o/e/; |
294 | } | |
295 | ||
296 | Perl's own engine will croak on any attempt to modify the capture | |
a0e97681 | 297 | variables, to do this in another engine use the following callback |
2fdbfb4d AB |
298 | (copied from C<Perl_reg_numbered_buff_store>): |
299 | ||
300 | void | |
301 | Example_reg_numbered_buff_store(pTHX_ REGEXP * const rx, const I32 paren, | |
302 | SV const * const value) | |
303 | { | |
304 | PERL_UNUSED_ARG(rx); | |
305 | PERL_UNUSED_ARG(paren); | |
306 | PERL_UNUSED_ARG(value); | |
307 | ||
308 | if (!PL_localizing) | |
309 | Perl_croak(aTHX_ PL_no_modify); | |
310 | } | |
311 | ||
99d59c4d | 312 | Actually perl will not I<always> croak in a statement that looks |
2fdbfb4d AB |
313 | like it would modify a numbered capture variable. This is because the |
314 | STORE callback will not be called if perl can determine that it | |
315 | doesn't have to modify the value. This is exactly how tied variables | |
316 | behave in the same situation: | |
317 | ||
318 | package CaptureVar; | |
319 | use base 'Tie::Scalar'; | |
320 | ||
321 | sub TIESCALAR { bless [] } | |
322 | sub FETCH { undef } | |
323 | sub STORE { die "This doesn't get called" } | |
324 | ||
325 | package main; | |
326 | ||
c69ca1d4 | 327 | tie my $sv => "CaptureVar"; |
2fdbfb4d AB |
328 | $sv =~ y/a/b/; |
329 | ||
330 | Because C<$sv> is C<undef> when the C<y///> operator is applied to it | |
331 | the transliteration won't actually execute and the program won't | |
192b9cd1 AB |
332 | C<die>. This is different to how 5.8 and earlier versions behaved |
333 | since the capture variables were READONLY variables then, now they'll | |
334 | just die when assigned to in the default engine. | |
2fdbfb4d | 335 | |
192b9cd1 | 336 | =head3 numbered_buff_LENGTH |
2fdbfb4d AB |
337 | |
338 | I32 numbered_buff_LENGTH (pTHX_ REGEXP * const rx, const SV * const sv, | |
339 | const I32 paren); | |
340 | ||
341 | Get the C<length> of a capture variable. There's a special callback | |
342 | for this so that perl doesn't have to do a FETCH and run C<length> on | |
192b9cd1 | 343 | the result, since the length is (in perl's case) known from an offset |
0a3a8dc0 | 344 | stored in C<< rx->offs >> this is much more efficient: |
2fdbfb4d AB |
345 | |
346 | I32 s1 = rx->offs[paren].start; | |
347 | I32 s2 = rx->offs[paren].end; | |
348 | I32 len = t1 - s1; | |
349 | ||
350 | This is a little bit more complex in the case of UTF-8, see what | |
351 | C<Perl_reg_numbered_buff_length> does with | |
352 | L<is_utf8_string_loclen|perlapi/is_utf8_string_loclen>. | |
353 | ||
192b9cd1 AB |
354 | =head2 Named capture callbacks |
355 | ||
356 | Called to get/set the value of C<%+> and C<%-> as well as by some | |
357 | utility functions in L<re>. | |
358 | ||
359 | There are two callbacks, C<named_buff> is called in all the cases the | |
360 | FETCH, STORE, DELETE, CLEAR, EXISTS and SCALAR L<Tie::Hash> callbacks | |
361 | would be on changes to C<%+> and C<%-> and C<named_buff_iter> in the | |
362 | same cases as FIRSTKEY and NEXTKEY. | |
363 | ||
364 | The C<flags> parameter can be used to determine which of these | |
365 | operations the callbacks should respond to, the following flags are | |
366 | currently defined: | |
367 | ||
368 | Which L<Tie::Hash> operation is being performed from the Perl level on | |
369 | C<%+> or C<%+>, if any: | |
370 | ||
f1b875a0 YO |
371 | RXapif_FETCH |
372 | RXapif_STORE | |
373 | RXapif_DELETE | |
374 | RXapif_CLEAR | |
375 | RXapif_EXISTS | |
376 | RXapif_SCALAR | |
377 | RXapif_FIRSTKEY | |
378 | RXapif_NEXTKEY | |
192b9cd1 AB |
379 | |
380 | Whether C<%+> or C<%-> is being operated on, if any. | |
2fdbfb4d | 381 | |
f1b875a0 YO |
382 | RXapif_ONE /* %+ */ |
383 | RXapif_ALL /* %- */ | |
2fdbfb4d | 384 | |
192b9cd1 | 385 | Whether this is being called as C<re::regname>, C<re::regnames> or |
c998b245 | 386 | C<re::regnames_count>, if any. The first two will be combined with |
f1b875a0 | 387 | C<RXapif_ONE> or C<RXapif_ALL>. |
192b9cd1 | 388 | |
f1b875a0 YO |
389 | RXapif_REGNAME |
390 | RXapif_REGNAMES | |
391 | RXapif_REGNAMES_COUNT | |
192b9cd1 AB |
392 | |
393 | Internally C<%+> and C<%-> are implemented with a real tied interface | |
394 | via L<Tie::Hash::NamedCapture>. The methods in that package will call | |
395 | back into these functions. However the usage of | |
396 | L<Tie::Hash::NamedCapture> for this purpose might change in future | |
397 | releases. For instance this might be implemented by magic instead | |
398 | (would need an extension to mgvtbl). | |
399 | ||
400 | =head3 named_buff | |
401 | ||
402 | SV* (*named_buff) (pTHX_ REGEXP * const rx, SV * const key, | |
403 | SV * const value, U32 flags); | |
404 | ||
405 | =head3 named_buff_iter | |
406 | ||
407 | SV* (*named_buff_iter) (pTHX_ REGEXP * const rx, const SV * const lastkey, | |
408 | const U32 flags); | |
108003db | 409 | |
49d7dfbc | 410 | =head2 qr_package |
108003db | 411 | |
49d7dfbc | 412 | SV* qr_package(pTHX_ REGEXP * const rx); |
108003db RGS |
413 | |
414 | The package the qr// magic object is blessed into (as seen by C<ref | |
49d7dfbc AB |
415 | qr//>). It is recommended that engines change this to their package |
416 | name for identification regardless of whether they implement methods | |
417 | on the object. | |
418 | ||
192b9cd1 | 419 | The package this method returns should also have the internal |
d5213412 | 420 | C<Regexp> package in its C<@ISA>. C<< qr//->isa("Regexp") >> should always |
192b9cd1 AB |
421 | be true regardless of what engine is being used. |
422 | ||
423 | Example implementation might be: | |
108003db RGS |
424 | |
425 | SV* | |
192b9cd1 | 426 | Example_qr_package(pTHX_ REGEXP * const rx) |
108003db RGS |
427 | { |
428 | PERL_UNUSED_ARG(rx); | |
429 | return newSVpvs("re::engine::Example"); | |
430 | } | |
431 | ||
432 | Any method calls on an object created with C<qr//> will be dispatched to the | |
433 | package as a normal object. | |
434 | ||
435 | use re::engine::Example; | |
436 | my $re = qr//; | |
437 | $re->meth; # dispatched to re::engine::Example::meth() | |
438 | ||
f7e71195 AB |
439 | To retrieve the C<REGEXP> object from the scalar in an XS function use |
440 | the C<SvRX> macro, see L<"REGEXP Functions" in perlapi|perlapi/REGEXP | |
441 | Functions>. | |
108003db RGS |
442 | |
443 | void meth(SV * rv) | |
444 | PPCODE: | |
f7e71195 | 445 | REGEXP * re = SvRX(sv); |
108003db | 446 | |
108003db RGS |
447 | =head2 dupe |
448 | ||
49d7dfbc | 449 | void* dupe(pTHX_ REGEXP * const rx, CLONE_PARAMS *param); |
108003db RGS |
450 | |
451 | On threaded builds a regexp may need to be duplicated so that the pattern | |
a0e97681 | 452 | can be used by multiple threads. This routine is expected to handle the |
108003db RGS |
453 | duplication of any private data pointed to by the C<pprivate> member of |
454 | the regexp structure. It will be called with the preconstructed new | |
455 | regexp structure as an argument, the C<pprivate> member will point at | |
a0e97681 | 456 | the B<old> private structure, and it is this routine's responsibility to |
108003db RGS |
457 | construct a copy and return a pointer to it (which perl will then use to |
458 | overwrite the field as passed to this routine.) | |
459 | ||
460 | This allows the engine to dupe its private data but also if necessary | |
461 | modify the final structure if it really must. | |
462 | ||
463 | On unthreaded builds this field doesn't exist. | |
464 | ||
465 | =head1 The REGEXP structure | |
466 | ||
467 | The REGEXP struct is defined in F<regexp.h>. All regex engines must be able to | |
468 | correctly build such a structure in their L</comp> routine. | |
469 | ||
470 | The REGEXP structure contains all the data that perl needs to be aware of | |
471 | to properly work with the regular expression. It includes data about | |
472 | optimisations that perl can use to determine if the regex engine should | |
473 | really be used, and various other control info that is needed to properly | |
474 | execute patterns in various contexts such as is the pattern anchored in | |
475 | some way, or what flags were used during the compile, or whether the | |
476 | program contains special constructs that perl needs to be aware of. | |
477 | ||
882227b7 AB |
478 | In addition it contains two fields that are intended for the private |
479 | use of the regex engine that compiled the pattern. These are the | |
480 | C<intflags> and C<pprivate> members. C<pprivate> is a void pointer to | |
481 | an arbitrary structure whose use and management is the responsibility | |
482 | of the compiling engine. perl will never modify either of these | |
483 | values. | |
108003db RGS |
484 | |
485 | typedef struct regexp { | |
486 | /* what engine created this regexp? */ | |
487 | const struct regexp_engine* engine; | |
488 | ||
489 | /* what re is this a lightweight copy of? */ | |
490 | struct regexp* mother_re; | |
491 | ||
492 | /* Information about the match that the perl core uses to manage things */ | |
493 | U32 extflags; /* Flags used both externally and internally */ | |
494 | I32 minlen; /* mininum possible length of string to match */ | |
495 | I32 minlenret; /* mininum possible length of $& */ | |
496 | U32 gofs; /* chars left of pos that we search from */ | |
497 | ||
498 | /* substring data about strings that must appear | |
499 | in the final match, used for optimisations */ | |
500 | struct reg_substr_data *substrs; | |
501 | ||
c27a5cfe | 502 | U32 nparens; /* number of capture groups */ |
108003db RGS |
503 | |
504 | /* private engine specific data */ | |
505 | U32 intflags; /* Engine Specific Internal flags */ | |
506 | void *pprivate; /* Data private to the regex engine which | |
507 | created this object. */ | |
508 | ||
509 | /* Data about the last/current match. These are modified during matching*/ | |
510 | U32 lastparen; /* last open paren matched */ | |
511 | U32 lastcloseparen; /* last close paren matched */ | |
512 | regexp_paren_pair *swap; /* Swap copy of *offs */ | |
513 | regexp_paren_pair *offs; /* Array of offsets for (@-) and (@+) */ | |
514 | ||
515 | char *subbeg; /* saved or original string so \digit works forever. */ | |
516 | SV_SAVED_COPY /* If non-NULL, SV which is COW from original */ | |
517 | I32 sublen; /* Length of string pointed by subbeg */ | |
518 | ||
519 | /* Information about the match that isn't often used */ | |
520 | I32 prelen; /* length of precomp */ | |
521 | const char *precomp; /* pre-compilation regular expression */ | |
522 | ||
108003db RGS |
523 | char *wrapped; /* wrapped version of the pattern */ |
524 | I32 wraplen; /* length of wrapped */ | |
525 | ||
526 | I32 seen_evals; /* number of eval groups in the pattern - for security checks */ | |
527 | HV *paren_names; /* Optional hash of paren names */ | |
528 | ||
529 | /* Refcount of this regexp */ | |
530 | I32 refcnt; /* Refcount of this regexp */ | |
531 | } regexp; | |
532 | ||
533 | The fields are discussed in more detail below: | |
534 | ||
882227b7 | 535 | =head2 C<engine> |
108003db RGS |
536 | |
537 | This field points at a regexp_engine structure which contains pointers | |
538 | to the subroutines that are to be used for performing a match. It | |
539 | is the compiling routine's responsibility to populate this field before | |
540 | returning the regexp object. | |
541 | ||
542 | Internally this is set to C<NULL> unless a custom engine is specified in | |
543 | C<$^H{regcomp}>, perl's own set of callbacks can be accessed in the struct | |
544 | pointed to by C<RE_ENGINE_PTR>. | |
545 | ||
882227b7 | 546 | =head2 C<mother_re> |
108003db RGS |
547 | |
548 | TODO, see L<http://www.mail-archive.com/perl5-changes@perl.org/msg17328.html> | |
549 | ||
882227b7 | 550 | =head2 C<extflags> |
108003db | 551 | |
192b9cd1 AB |
552 | This will be used by perl to see what flags the regexp was compiled |
553 | with, this will normally be set to the value of the flags parameter by | |
c998b245 AB |
554 | the L<comp|/comp> callback. See the L<comp|/comp> documentation for |
555 | valid flags. | |
108003db | 556 | |
882227b7 | 557 | =head2 C<minlen> C<minlenret> |
108003db RGS |
558 | |
559 | The minimum string length required for the pattern to match. This is used to | |
560 | prune the search space by not bothering to match any closer to the end of a | |
561 | string than would allow a match. For instance there is no point in even | |
562 | starting the regex engine if the minlen is 10 but the string is only 5 | |
563 | characters long. There is no way that the pattern can match. | |
564 | ||
565 | C<minlenret> is the minimum length of the string that would be found | |
566 | in $& after a match. | |
567 | ||
568 | The difference between C<minlen> and C<minlenret> can be seen in the | |
569 | following pattern: | |
570 | ||
571 | /ns(?=\d)/ | |
572 | ||
573 | where the C<minlen> would be 3 but C<minlenret> would only be 2 as the \d is | |
574 | required to match but is not actually included in the matched content. This | |
575 | distinction is particularly important as the substitution logic uses the | |
a0e97681 | 576 | C<minlenret> to tell whether it can do in-place substitution which can result in |
108003db RGS |
577 | considerable speedup. |
578 | ||
882227b7 | 579 | =head2 C<gofs> |
108003db RGS |
580 | |
581 | Left offset from pos() to start match at. | |
582 | ||
882227b7 | 583 | =head2 C<substrs> |
108003db | 584 | |
192b9cd1 AB |
585 | Substring data about strings that must appear in the final match. This |
586 | is currently only used internally by perl's engine for but might be | |
c998b245 | 587 | used in the future for all engines for optimisations. |
108003db | 588 | |
1cecf2c0 | 589 | =head2 C<nparens>, C<lastparen>, and C<lastcloseparen> |
108003db RGS |
590 | |
591 | These fields are used to keep track of how many paren groups could be matched | |
592 | in the pattern, which was the last open paren to be entered, and which was | |
593 | the last close paren to be entered. | |
594 | ||
882227b7 | 595 | =head2 C<intflags> |
108003db RGS |
596 | |
597 | The engine's private copy of the flags the pattern was compiled with. Usually | |
192b9cd1 | 598 | this is the same as C<extflags> unless the engine chose to modify one of them. |
108003db | 599 | |
882227b7 | 600 | =head2 C<pprivate> |
108003db RGS |
601 | |
602 | A void* pointing to an engine-defined data structure. The perl engine uses the | |
603 | C<regexp_internal> structure (see L<perlreguts/Base Structures>) but a custom | |
604 | engine should use something else. | |
605 | ||
882227b7 | 606 | =head2 C<swap> |
108003db | 607 | |
e9105d30 | 608 | Unused. Left in for compatibility with perl 5.10.0. |
108003db | 609 | |
882227b7 | 610 | =head2 C<offs> |
108003db RGS |
611 | |
612 | A C<regexp_paren_pair> structure which defines offsets into the string being | |
613 | matched which correspond to the C<$&> and C<$1>, C<$2> etc. captures, the | |
614 | C<regexp_paren_pair> struct is defined as follows: | |
615 | ||
616 | typedef struct regexp_paren_pair { | |
617 | I32 start; | |
618 | I32 end; | |
619 | } regexp_paren_pair; | |
620 | ||
621 | If C<< ->offs[num].start >> or C<< ->offs[num].end >> is C<-1> then that | |
c27a5cfe | 622 | capture group did not match. C<< ->offs[0].start/end >> represents C<$&> (or |
108003db RGS |
623 | C<${^MATCH> under C<//p>) and C<< ->offs[paren].end >> matches C<$$paren> where |
624 | C<$paren >= 1>. | |
625 | ||
882227b7 | 626 | =head2 C<precomp> C<prelen> |
108003db | 627 | |
192b9cd1 AB |
628 | Used for optimisations. C<precomp> holds a copy of the pattern that |
629 | was compiled and C<prelen> its length. When a new pattern is to be | |
630 | compiled (such as inside a loop) the internal C<regcomp> operator | |
631 | checks whether the last compiled C<REGEXP>'s C<precomp> and C<prelen> | |
632 | are equivalent to the new one, and if so uses the old pattern instead | |
633 | of compiling a new one. | |
634 | ||
635 | The relevant snippet from C<Perl_pp_regcomp>: | |
636 | ||
637 | if (!re || !re->precomp || re->prelen != (I32)len || | |
638 | memNE(re->precomp, t, len)) | |
639 | /* Compile a new pattern */ | |
108003db | 640 | |
882227b7 | 641 | =head2 C<paren_names> |
108003db | 642 | |
c27a5cfe | 643 | This is a hash used internally to track named capture groups and their |
108003db RGS |
644 | offsets. The keys are the names of the buffers the values are dualvars, |
645 | with the IV slot holding the number of buffers with the given name and the | |
646 | pv being an embedded array of I32. The values may also be contained | |
647 | independently in the data array in cases where named backreferences are | |
648 | used. | |
649 | ||
c998b245 | 650 | =head2 C<substrs> |
108003db RGS |
651 | |
652 | Holds information on the longest string that must occur at a fixed | |
653 | offset from the start of the pattern, and the longest string that must | |
654 | occur at a floating offset from the start of the pattern. Used to do | |
655 | Fast-Boyer-Moore searches on the string to find out if its worth using | |
656 | the regex engine at all, and if so where in the string to search. | |
657 | ||
882227b7 | 658 | =head2 C<subbeg> C<sublen> C<saved_copy> |
108003db | 659 | |
c998b245 | 660 | Used during execution phase for managing search and replace patterns. |
108003db | 661 | |
882227b7 | 662 | =head2 C<wrapped> C<wraplen> |
108003db | 663 | |
c998b245 | 664 | Stores the string C<qr//> stringifies to. The perl engine for example |
ed215d3c | 665 | stores C<(?^:eek)> in the case of C<qr/eek/>. |
108003db | 666 | |
c998b245 AB |
667 | When using a custom engine that doesn't support the C<(?:)> construct |
668 | for inline modifiers, it's probably best to have C<qr//> stringify to | |
669 | the supplied pattern, note that this will create undesired patterns in | |
670 | cases such as: | |
108003db RGS |
671 | |
672 | my $x = qr/a|b/; # "a|b" | |
192b9cd1 | 673 | my $y = qr/c/i; # "c" |
108003db RGS |
674 | my $z = qr/$x$y/; # "a|bc" |
675 | ||
192b9cd1 AB |
676 | There's no solution for this problem other than making the custom |
677 | engine understand a construct like C<(?:)>. | |
108003db | 678 | |
882227b7 | 679 | =head2 C<seen_evals> |
108003db RGS |
680 | |
681 | This stores the number of eval groups in the pattern. This is used for security | |
682 | purposes when embedding compiled regexes into larger patterns with C<qr//>. | |
683 | ||
882227b7 | 684 | =head2 C<refcnt> |
108003db RGS |
685 | |
686 | The number of times the structure is referenced. When this falls to 0 the | |
687 | regexp is automatically freed by a call to pregfree. This should be set to 1 in | |
688 | each engine's L</comp> routine. | |
689 | ||
108003db RGS |
690 | =head1 HISTORY |
691 | ||
692 | Originally part of L<perlreguts>. | |
693 | ||
694 | =head1 AUTHORS | |
695 | ||
696 | Originally written by Yves Orton, expanded by E<AElig>var ArnfjE<ouml>rE<eth> | |
697 | Bjarmason. | |
698 | ||
699 | =head1 LICENSE | |
700 | ||
701 | Copyright 2006 Yves Orton and 2007 E<AElig>var ArnfjE<ouml>rE<eth> Bjarmason. | |
702 | ||
703 | This program is free software; you can redistribute it and/or modify it under | |
704 | the same terms as Perl itself. | |
705 | ||
706 | =cut |