Commit | Line | Data |
---|---|---|
108003db RGS |
1 | =head1 NAME |
2 | ||
5a2b28ce | 3 | perlreapi - Perl regular expression plugin interface |
108003db RGS |
4 | |
5 | =head1 DESCRIPTION | |
6 | ||
5a2b28ce KW |
7 | As of Perl 5.9.5 there is a new interface for plugging and using |
8 | regular expression engines other than the default one. | |
a0e97681 RGS |
9 | |
10 | Each engine is supposed to provide access to a constant structure of the | |
11 | following format: | |
108003db RGS |
12 | |
13 | typedef struct regexp_engine { | |
02c01adb KW |
14 | REGEXP* (*comp) (pTHX_ |
15 | const SV * const pattern, const U32 flags); | |
16 | I32 (*exec) (pTHX_ | |
17 | REGEXP * const rx, | |
18 | char* stringarg, | |
19 | char* strend, char* strbeg, | |
20 | I32 minend, SV* screamer, | |
2fdbfb4d | 21 | void* data, U32 flags); |
02c01adb KW |
22 | char* (*intuit) (pTHX_ |
23 | REGEXP * const rx, SV *sv, | |
52a21eb3 | 24 | const char * const strbeg, |
02c01adb | 25 | char *strpos, char *strend, U32 flags, |
2fdbfb4d | 26 | struct re_scream_pos_data_s *data); |
49d7dfbc AB |
27 | SV* (*checkstr) (pTHX_ REGEXP * const rx); |
28 | void (*free) (pTHX_ REGEXP * const rx); | |
02c01adb KW |
29 | void (*numbered_buff_FETCH) (pTHX_ |
30 | REGEXP * const rx, | |
31 | const I32 paren, | |
32 | SV * const sv); | |
33 | void (*numbered_buff_STORE) (pTHX_ | |
34 | REGEXP * const rx, | |
35 | const I32 paren, | |
5a2b28ce | 36 | SV const * const value); |
02c01adb KW |
37 | I32 (*numbered_buff_LENGTH) (pTHX_ |
38 | REGEXP * const rx, | |
39 | const SV * const sv, | |
40 | const I32 paren); | |
41 | SV* (*named_buff) (pTHX_ | |
42 | REGEXP * const rx, | |
43 | SV * const key, | |
44 | SV * const value, | |
45 | U32 flags); | |
46 | SV* (*named_buff_iter) (pTHX_ | |
47 | REGEXP * const rx, | |
48 | const SV * const lastkey, | |
192b9cd1 | 49 | const U32 flags); |
49d7dfbc | 50 | SV* (*qr_package)(pTHX_ REGEXP * const rx); |
108003db | 51 | #ifdef USE_ITHREADS |
49d7dfbc | 52 | void* (*dupe) (pTHX_ REGEXP * const rx, CLONE_PARAMS *param); |
108003db | 53 | #endif |
3c13cae6 DM |
54 | REGEXP* (*op_comp) (...); |
55 | ||
108003db RGS |
56 | |
57 | When a regexp is compiled, its C<engine> field is then set to point at | |
a0e97681 | 58 | the appropriate structure, so that when it needs to be used Perl can find |
108003db RGS |
59 | the right routines to do so. |
60 | ||
61 | In order to install a new regexp handler, C<$^H{regcomp}> is set | |
62 | to an integer which (when casted appropriately) resolves to one of these | |
2edc787c | 63 | structures. When compiling, the C<comp> method is executed, and the |
5a2b28ce | 64 | resulting C<regexp> structure's engine field is expected to point back at |
108003db RGS |
65 | the same structure. |
66 | ||
5a2b28ce | 67 | The pTHX_ symbol in the definition is a macro used by Perl under threading |
108003db RGS |
68 | to provide an extra argument to the routine holding a pointer back to |
69 | the interpreter that is executing the regexp. So under threading all | |
70 | routines get an extra argument. | |
71 | ||
882227b7 | 72 | =head1 Callbacks |
108003db RGS |
73 | |
74 | =head2 comp | |
75 | ||
3ab4a224 | 76 | REGEXP* comp(pTHX_ const SV * const pattern, const U32 flags); |
108003db | 77 | |
3ab4a224 AB |
78 | Compile the pattern stored in C<pattern> using the given C<flags> and |
79 | return a pointer to a prepared C<REGEXP> structure that can perform | |
2edc787c | 80 | the match. See L</The REGEXP structure> below for an explanation of |
3ab4a224 AB |
81 | the individual fields in the REGEXP struct. |
82 | ||
83 | The C<pattern> parameter is the scalar that was used as the | |
5a2b28ce KW |
84 | pattern. Previous versions of Perl would pass two C<char*> indicating |
85 | the start and end of the stringified pattern; the following snippet can | |
3ab4a224 AB |
86 | be used to get the old parameters: |
87 | ||
88 | STRLEN plen; | |
89 | char* exp = SvPV(pattern, plen); | |
90 | char* xend = exp + plen; | |
91 | ||
5a2b28ce | 92 | Since any scalar can be passed as a pattern, it's possible to implement |
3ab4a224 AB |
93 | an engine that does something with an array (C<< "ook" =~ [ qw/ eek |
94 | hlagh / ] >>) or with the non-stringified form of a compiled regular | |
5a2b28ce KW |
95 | expression (C<< "ook" =~ qr/eek/ >>). Perl's own engine will always |
96 | stringify everything using the snippet above, but that doesn't mean | |
3ab4a224 | 97 | other engines have to. |
108003db | 98 | |
a0e97681 | 99 | The C<flags> parameter is a bitfield which indicates which of the |
2edc787c | 100 | C<msixp> flags the regex was compiled with. It also contains |
5a2b28ce | 101 | additional info, such as if C<use locale> is in effect. |
108003db RGS |
102 | |
103 | The C<eogc> flags are stripped out before being passed to the comp | |
2edc787c | 104 | routine. The regex engine does not need to know if any of these |
5a2b28ce | 105 | are set, as those flags should only affect what Perl does with the |
c998b245 AB |
106 | pattern and its match variables, not how it gets compiled and |
107 | executed. | |
108003db | 108 | |
c998b245 | 109 | By the time the comp callback is called, some of these flags have |
2edc787c | 110 | already had effect (noted below where applicable). However most of |
5a2b28ce | 111 | their effect occurs after the comp callback has run, in routines that |
c998b245 | 112 | read the C<< rx->extflags >> field which it populates. |
108003db | 113 | |
c998b245 AB |
114 | In general the flags should be preserved in C<< rx->extflags >> after |
115 | compilation, although the regex engine might want to add or delete | |
2edc787c | 116 | some of them to invoke or disable some special behavior in Perl. The |
c998b245 | 117 | flags along with any special behavior they cause are documented below: |
108003db | 118 | |
c998b245 | 119 | The pattern modifiers: |
108003db | 120 | |
c998b245 | 121 | =over 4 |
108003db | 122 | |
c998b245 | 123 | =item C</m> - RXf_PMf_MULTILINE |
108003db | 124 | |
c998b245 AB |
125 | If this is in C<< rx->extflags >> it will be passed to |
126 | C<Perl_fbm_instr> by C<pp_split> which will treat the subject string | |
127 | as a multi-line string. | |
108003db | 128 | |
c998b245 | 129 | =item C</s> - RXf_PMf_SINGLELINE |
108003db | 130 | |
c998b245 | 131 | =item C</i> - RXf_PMf_FOLD |
108003db | 132 | |
c998b245 | 133 | =item C</x> - RXf_PMf_EXTENDED |
108003db | 134 | |
5a2b28ce | 135 | If present on a regex, C<"#"> comments will be handled differently by the |
c998b245 | 136 | tokenizer in some cases. |
108003db | 137 | |
c998b245 | 138 | TODO: Document those cases. |
108003db | 139 | |
c998b245 | 140 | =item C</p> - RXf_PMf_KEEPCOPY |
108003db | 141 | |
e72ec78c KW |
142 | TODO: Document this |
143 | ||
f0f9b3b8 KW |
144 | =item Character set |
145 | ||
146 | The character set semantics are determined by an enum that is contained | |
147 | in this field. This is still experimental and subject to change, but | |
148 | the current interface returns the rules by use of the in-line function | |
149 | C<get_regex_charset(const U32 flags)>. The only currently documented | |
150 | value returned from it is REGEX_LOCALE_CHARSET, which is set if | |
e72ec78c KW |
151 | C<use locale> is in effect. If present in C<< rx->extflags >>, |
152 | C<split> will use the locale dependent definition of whitespace | |
2edc787c | 153 | when RXf_SKIPWHITE or RXf_WHITE is in effect. ASCII whitespace |
96090e4f | 154 | is defined as per L<isSPACE|perlapi/isSPACE>, and by the internal |
e72ec78c | 155 | macros C<is_utf8_space> under UTF-8, and C<isSPACE_LC> under C<use |
c998b245 | 156 | locale>. |
108003db | 157 | |
f0f9b3b8 KW |
158 | =back |
159 | ||
160 | Additional flags: | |
161 | ||
162 | =over 4 | |
163 | ||
0ac6acae AB |
164 | =item RXf_SPLIT |
165 | ||
9f6ecbda FC |
166 | This flag was removed in perl 5.18.0. C<split ' '> is now special-cased |
167 | solely in the parser. RXf_SPLIT is still #defined, so you can test for it. | |
168 | This is how it used to work: | |
169 | ||
0ac6acae | 170 | If C<split> is invoked as C<split ' '> or with no arguments (which |
5a2b28ce | 171 | really means C<split(' ', $_)>, see L<split|perlfunc/split>), Perl will |
2edc787c FC |
172 | set this flag. The regex engine can then check for it and set the |
173 | SKIPWHITE and WHITE extflags. To do this, the Perl engine does: | |
0ac6acae AB |
174 | |
175 | if (flags & RXf_SPLIT && r->prelen == 1 && r->precomp[0] == ' ') | |
176 | r->extflags |= (RXf_SKIPWHITE|RXf_WHITE); | |
177 | ||
108003db RGS |
178 | =back |
179 | ||
c998b245 AB |
180 | These flags can be set during compilation to enable optimizations in |
181 | the C<split> operator. | |
182 | ||
183 | =over 4 | |
184 | ||
0ac6acae AB |
185 | =item RXf_SKIPWHITE |
186 | ||
51b1fee8 FC |
187 | This flag was removed in perl 5.18.0. It is still #defined, so you can |
188 | set it, but doing so will have no effect. This is how it used to work: | |
189 | ||
0ac6acae AB |
190 | If the flag is present in C<< rx->extflags >> C<split> will delete |
191 | whitespace from the start of the subject string before it's operated | |
2edc787c | 192 | on. What is considered whitespace depends on if the subject is a |
5a2b28ce | 193 | UTF-8 string and if the C<RXf_PMf_LOCALE> flag is set. |
0ac6acae | 194 | |
5a2b28ce KW |
195 | If RXf_WHITE is set in addition to this flag, C<split> will behave like |
196 | C<split " "> under the Perl engine. | |
0ac6acae | 197 | |
c998b245 AB |
198 | =item RXf_START_ONLY |
199 | ||
200 | Tells the split operator to split the target string on newlines | |
201 | (C<\n>) without invoking the regex engine. | |
202 | ||
203 | Perl's engine sets this if the pattern is C</^/> (C<plen == 1 && *exp | |
2edc787c | 204 | == '^'>), even under C</^/s>; see L<split|perlfunc>. Of course a |
c998b245 AB |
205 | different regex engine might want to use the same optimizations |
206 | with a different syntax. | |
207 | ||
208 | =item RXf_WHITE | |
209 | ||
210 | Tells the split operator to split the target string on whitespace | |
2edc787c | 211 | without invoking the regex engine. The definition of whitespace varies |
5a2b28ce KW |
212 | depending on if the target string is a UTF-8 string and on |
213 | if RXf_PMf_LOCALE is set. | |
c998b245 | 214 | |
0ac6acae | 215 | Perl's engine sets this flag if the pattern is C<\s+>. |
c998b245 | 216 | |
640f820d AB |
217 | =item RXf_NULL |
218 | ||
a0e97681 | 219 | Tells the split operator to split the target string on |
2edc787c | 220 | characters. The definition of character varies depending on if |
640f820d AB |
221 | the target string is a UTF-8 string. |
222 | ||
223 | Perl's engine sets this flag on empty patterns, this optimization | |
2edc787c | 224 | makes C<split //> much faster than it would otherwise be. It's even |
640f820d AB |
225 | faster than C<unpack>. |
226 | ||
dbc200c5 | 227 | =item RXf_NO_INPLACE_SUBST |
94bbc698 FC |
228 | |
229 | Added in perl 5.18.0, this flag indicates that a regular expression might | |
579794ec | 230 | perform an operation that would interfere with inplace substitution. For |
dbc200c5 YO |
231 | instance it might contain lookbehind, or assign to non-magical variables |
232 | (such as $REGMARK and $REGERROR) during matching. C<s///> will skip | |
233 | certain optimisations when this is set. | |
94bbc698 | 234 | |
c998b245 | 235 | =back |
108003db RGS |
236 | |
237 | =head2 exec | |
238 | ||
49d7dfbc | 239 | I32 exec(pTHX_ REGEXP * const rx, |
108003db RGS |
240 | char *stringarg, char* strend, char* strbeg, |
241 | I32 minend, SV* screamer, | |
242 | void* data, U32 flags); | |
243 | ||
8fd1a950 DM |
244 | Execute a regexp. The arguments are |
245 | ||
246 | =over 4 | |
247 | ||
248 | =item rx | |
249 | ||
250 | The regular expression to execute. | |
251 | ||
252 | =item screamer | |
253 | ||
2edc787c | 254 | This strangely-named arg is the SV to be matched against. Note that the |
8fd1a950 DM |
255 | actual char array to be matched against is supplied by the arguments |
256 | described below; the SV is just used to determine UTF8ness, C<pos()> etc. | |
257 | ||
258 | =item strbeg | |
259 | ||
260 | Pointer to the physical start of the string. | |
261 | ||
262 | =item strend | |
263 | ||
264 | Pointer to the character following the physical end of the string (i.e. | |
5a2b28ce | 265 | the C<\0>). |
8fd1a950 DM |
266 | |
267 | =item stringarg | |
268 | ||
269 | Pointer to the position in the string where matching should start; it might | |
270 | not be equal to C<strbeg> (for example in a later iteration of C</.../g>). | |
271 | ||
272 | =item minend | |
273 | ||
274 | Minimum length of string (measured in bytes from C<stringarg>) that must | |
275 | match; if the engine reaches the end of the match but hasn't reached this | |
276 | position in the string, it should fail. | |
277 | ||
278 | =item data | |
279 | ||
280 | Optimisation data; subject to change. | |
281 | ||
282 | =item flags | |
283 | ||
284 | Optimisation flags; subject to change. | |
285 | ||
286 | =back | |
108003db RGS |
287 | |
288 | =head2 intuit | |
289 | ||
52a21eb3 DM |
290 | char* intuit(pTHX_ |
291 | REGEXP * const rx, | |
292 | SV *sv, | |
293 | const char * const strbeg, | |
294 | char *strpos, | |
295 | char *strend, | |
296 | const U32 flags, | |
297 | struct re_scream_pos_data_s *data); | |
108003db RGS |
298 | |
299 | Find the start position where a regex match should be attempted, | |
5a2b28ce | 300 | or possibly if the regex engine should not be run because the |
2edc787c | 301 | pattern can't match. This is called, as appropriate, by the core, |
5a2b28ce | 302 | depending on the values of the C<extflags> member of the C<regexp> |
108003db RGS |
303 | structure. |
304 | ||
52a21eb3 DM |
305 | Arguments: |
306 | ||
307 | rx: the regex to match against | |
308 | sv: the SV being matched: only used for utf8 flag; the string | |
309 | itself is accessed via the pointers below. Note that on | |
310 | something like an overloaded SV, SvPOK(sv) may be false | |
311 | and the string pointers may point to something unrelated to | |
312 | the SV itself. | |
313 | strbeg: real beginning of string | |
314 | strpos: the point in the string at which to begin matching | |
315 | strend: pointer to the byte following the last char of the string | |
316 | flags currently unused; set to 0 | |
317 | data: currently unused; set to NULL | |
318 | ||
319 | ||
108003db RGS |
320 | =head2 checkstr |
321 | ||
49d7dfbc | 322 | SV* checkstr(pTHX_ REGEXP * const rx); |
108003db RGS |
323 | |
324 | Return a SV containing a string that must appear in the pattern. Used | |
325 | by C<split> for optimising matches. | |
326 | ||
327 | =head2 free | |
328 | ||
49d7dfbc | 329 | void free(pTHX_ REGEXP * const rx); |
108003db | 330 | |
5a2b28ce | 331 | Called by Perl when it is freeing a regexp pattern so that the engine |
108003db | 332 | can release any resources pointed to by the C<pprivate> member of the |
2edc787c | 333 | C<regexp> structure. This is only responsible for freeing private data; |
5a2b28ce | 334 | Perl will handle releasing anything else contained in the C<regexp> structure. |
108003db | 335 | |
192b9cd1 | 336 | =head2 Numbered capture callbacks |
108003db | 337 | |
192b9cd1 | 338 | Called to get/set the value of C<$`>, C<$'>, C<$&> and their named |
3b543d84 | 339 | equivalents, ${^PREMATCH}, ${^POSTMATCH} and ${^MATCH}, as well as the |
c27a5cfe | 340 | numbered capture groups (C<$1>, C<$2>, ...). |
49d7dfbc | 341 | |
c149d39e DM |
342 | The C<paren> parameter will be C<1> for C<$1>, C<2> for C<$2> and so |
343 | forth, and have these symbolic values for the special variables: | |
344 | ||
345 | ${^PREMATCH} RX_BUFF_IDX_CARET_PREMATCH | |
346 | ${^POSTMATCH} RX_BUFF_IDX_CARET_POSTMATCH | |
347 | ${^MATCH} RX_BUFF_IDX_CARET_FULLMATCH | |
348 | $` RX_BUFF_IDX_PREMATCH | |
349 | $' RX_BUFF_IDX_POSTMATCH | |
350 | $& RX_BUFF_IDX_FULLMATCH | |
351 | ||
5a2b28ce | 352 | Note that in Perl 5.17.3 and earlier, the last three constants were also |
c149d39e DM |
353 | used for the caret variants of the variables. |
354 | ||
49d7dfbc | 355 | |
192b9cd1 | 356 | The names have been chosen by analogy with L<Tie::Scalar> methods |
2edc787c | 357 | names with an additional B<LENGTH> callback for efficiency. However |
192b9cd1 AB |
358 | named capture variables are currently not tied internally but |
359 | implemented via magic. | |
360 | ||
361 | =head3 numbered_buff_FETCH | |
362 | ||
363 | void numbered_buff_FETCH(pTHX_ REGEXP * const rx, const I32 paren, | |
364 | SV * const sv); | |
365 | ||
2edc787c | 366 | Fetch a specified numbered capture. C<sv> should be set to the scalar |
192b9cd1 | 367 | to return, the scalar is passed as an argument rather than being |
5a2b28ce | 368 | returned from the function because when it's called Perl already has a |
192b9cd1 | 369 | scalar to store the value, creating another one would be |
2edc787c | 370 | redundant. The scalar can be set with C<sv_setsv>, C<sv_setpvn> and |
192b9cd1 | 371 | friends, see L<perlapi>. |
49d7dfbc | 372 | |
5a2b28ce | 373 | This callback is where Perl untaints its own capture variables under |
2edc787c | 374 | taint mode (see L<perlsec>). See the C<Perl_reg_numbered_buff_fetch> |
49d7dfbc AB |
375 | function in F<regcomp.c> for how to untaint capture variables if |
376 | that's something you'd like your engine to do as well. | |
108003db | 377 | |
192b9cd1 | 378 | =head3 numbered_buff_STORE |
108003db | 379 | |
02c01adb KW |
380 | void (*numbered_buff_STORE) (pTHX_ |
381 | REGEXP * const rx, | |
382 | const I32 paren, | |
2fdbfb4d | 383 | SV const * const value); |
108003db | 384 | |
2edc787c FC |
385 | Set the value of a numbered capture variable. C<value> is the scalar |
386 | that is to be used as the new value. It's up to the engine to make | |
192b9cd1 | 387 | sure this is used as the new value (or reject it). |
2fdbfb4d AB |
388 | |
389 | Example: | |
390 | ||
391 | if ("ook" =~ /(o*)/) { | |
ccf3535a | 392 | # 'paren' will be '1' and 'value' will be 'ee' |
2fdbfb4d AB |
393 | $1 =~ tr/o/e/; |
394 | } | |
395 | ||
396 | Perl's own engine will croak on any attempt to modify the capture | |
a0e97681 | 397 | variables, to do this in another engine use the following callback |
2fdbfb4d AB |
398 | (copied from C<Perl_reg_numbered_buff_store>): |
399 | ||
400 | void | |
02c01adb KW |
401 | Example_reg_numbered_buff_store(pTHX_ |
402 | REGEXP * const rx, | |
403 | const I32 paren, | |
404 | SV const * const value) | |
2fdbfb4d AB |
405 | { |
406 | PERL_UNUSED_ARG(rx); | |
407 | PERL_UNUSED_ARG(paren); | |
408 | PERL_UNUSED_ARG(value); | |
409 | ||
410 | if (!PL_localizing) | |
411 | Perl_croak(aTHX_ PL_no_modify); | |
412 | } | |
413 | ||
5a2b28ce | 414 | Actually Perl will not I<always> croak in a statement that looks |
2edc787c | 415 | like it would modify a numbered capture variable. This is because the |
5a2b28ce | 416 | STORE callback will not be called if Perl can determine that it |
2edc787c | 417 | doesn't have to modify the value. This is exactly how tied variables |
2fdbfb4d AB |
418 | behave in the same situation: |
419 | ||
420 | package CaptureVar; | |
421 | use base 'Tie::Scalar'; | |
422 | ||
423 | sub TIESCALAR { bless [] } | |
424 | sub FETCH { undef } | |
425 | sub STORE { die "This doesn't get called" } | |
426 | ||
427 | package main; | |
428 | ||
c69ca1d4 | 429 | tie my $sv => "CaptureVar"; |
2fdbfb4d AB |
430 | $sv =~ y/a/b/; |
431 | ||
5a2b28ce | 432 | Because C<$sv> is C<undef> when the C<y///> operator is applied to it, |
2fdbfb4d | 433 | the transliteration won't actually execute and the program won't |
2edc787c | 434 | C<die>. This is different to how 5.8 and earlier versions behaved |
5a2b28ce | 435 | since the capture variables were READONLY variables then; now they'll |
192b9cd1 | 436 | just die when assigned to in the default engine. |
2fdbfb4d | 437 | |
192b9cd1 | 438 | =head3 numbered_buff_LENGTH |
2fdbfb4d | 439 | |
02c01adb KW |
440 | I32 numbered_buff_LENGTH (pTHX_ |
441 | REGEXP * const rx, | |
442 | const SV * const sv, | |
2fdbfb4d AB |
443 | const I32 paren); |
444 | ||
2edc787c | 445 | Get the C<length> of a capture variable. There's a special callback |
5a2b28ce KW |
446 | for this so that Perl doesn't have to do a FETCH and run C<length> on |
447 | the result, since the length is (in Perl's case) known from an offset | |
448 | stored in C<< rx->offs >>, this is much more efficient: | |
2fdbfb4d AB |
449 | |
450 | I32 s1 = rx->offs[paren].start; | |
451 | I32 s2 = rx->offs[paren].end; | |
452 | I32 len = t1 - s1; | |
453 | ||
454 | This is a little bit more complex in the case of UTF-8, see what | |
455 | C<Perl_reg_numbered_buff_length> does with | |
456 | L<is_utf8_string_loclen|perlapi/is_utf8_string_loclen>. | |
457 | ||
192b9cd1 AB |
458 | =head2 Named capture callbacks |
459 | ||
5a2b28ce | 460 | Called to get/set the value of C<%+> and C<%->, as well as by some |
192b9cd1 AB |
461 | utility functions in L<re>. |
462 | ||
463 | There are two callbacks, C<named_buff> is called in all the cases the | |
464 | FETCH, STORE, DELETE, CLEAR, EXISTS and SCALAR L<Tie::Hash> callbacks | |
465 | would be on changes to C<%+> and C<%-> and C<named_buff_iter> in the | |
466 | same cases as FIRSTKEY and NEXTKEY. | |
467 | ||
468 | The C<flags> parameter can be used to determine which of these | |
5a2b28ce | 469 | operations the callbacks should respond to. The following flags are |
192b9cd1 AB |
470 | currently defined: |
471 | ||
472 | Which L<Tie::Hash> operation is being performed from the Perl level on | |
473 | C<%+> or C<%+>, if any: | |
474 | ||
f1b875a0 YO |
475 | RXapif_FETCH |
476 | RXapif_STORE | |
477 | RXapif_DELETE | |
478 | RXapif_CLEAR | |
479 | RXapif_EXISTS | |
480 | RXapif_SCALAR | |
481 | RXapif_FIRSTKEY | |
482 | RXapif_NEXTKEY | |
192b9cd1 | 483 | |
5a2b28ce | 484 | If C<%+> or C<%-> is being operated on, if any. |
2fdbfb4d | 485 | |
f1b875a0 YO |
486 | RXapif_ONE /* %+ */ |
487 | RXapif_ALL /* %- */ | |
2fdbfb4d | 488 | |
5a2b28ce | 489 | If this is being called as C<re::regname>, C<re::regnames> or |
2edc787c | 490 | C<re::regnames_count>, if any. The first two will be combined with |
f1b875a0 | 491 | C<RXapif_ONE> or C<RXapif_ALL>. |
192b9cd1 | 492 | |
f1b875a0 YO |
493 | RXapif_REGNAME |
494 | RXapif_REGNAMES | |
495 | RXapif_REGNAMES_COUNT | |
192b9cd1 AB |
496 | |
497 | Internally C<%+> and C<%-> are implemented with a real tied interface | |
2edc787c FC |
498 | via L<Tie::Hash::NamedCapture>. The methods in that package will call |
499 | back into these functions. However the usage of | |
192b9cd1 | 500 | L<Tie::Hash::NamedCapture> for this purpose might change in future |
2edc787c | 501 | releases. For instance this might be implemented by magic instead |
192b9cd1 AB |
502 | (would need an extension to mgvtbl). |
503 | ||
504 | =head3 named_buff | |
505 | ||
506 | SV* (*named_buff) (pTHX_ REGEXP * const rx, SV * const key, | |
507 | SV * const value, U32 flags); | |
508 | ||
509 | =head3 named_buff_iter | |
510 | ||
02c01adb KW |
511 | SV* (*named_buff_iter) (pTHX_ |
512 | REGEXP * const rx, | |
513 | const SV * const lastkey, | |
192b9cd1 | 514 | const U32 flags); |
108003db | 515 | |
49d7dfbc | 516 | =head2 qr_package |
108003db | 517 | |
49d7dfbc | 518 | SV* qr_package(pTHX_ REGEXP * const rx); |
108003db RGS |
519 | |
520 | The package the qr// magic object is blessed into (as seen by C<ref | |
2edc787c | 521 | qr//>). It is recommended that engines change this to their package |
5a2b28ce | 522 | name for identification regardless of if they implement methods |
49d7dfbc AB |
523 | on the object. |
524 | ||
192b9cd1 | 525 | The package this method returns should also have the internal |
2edc787c | 526 | C<Regexp> package in its C<@ISA>. C<< qr//->isa("Regexp") >> should always |
192b9cd1 AB |
527 | be true regardless of what engine is being used. |
528 | ||
529 | Example implementation might be: | |
108003db RGS |
530 | |
531 | SV* | |
192b9cd1 | 532 | Example_qr_package(pTHX_ REGEXP * const rx) |
108003db RGS |
533 | { |
534 | PERL_UNUSED_ARG(rx); | |
535 | return newSVpvs("re::engine::Example"); | |
536 | } | |
537 | ||
538 | Any method calls on an object created with C<qr//> will be dispatched to the | |
539 | package as a normal object. | |
540 | ||
541 | use re::engine::Example; | |
542 | my $re = qr//; | |
543 | $re->meth; # dispatched to re::engine::Example::meth() | |
544 | ||
f7e71195 AB |
545 | To retrieve the C<REGEXP> object from the scalar in an XS function use |
546 | the C<SvRX> macro, see L<"REGEXP Functions" in perlapi|perlapi/REGEXP | |
547 | Functions>. | |
108003db RGS |
548 | |
549 | void meth(SV * rv) | |
550 | PPCODE: | |
f7e71195 | 551 | REGEXP * re = SvRX(sv); |
108003db | 552 | |
108003db RGS |
553 | =head2 dupe |
554 | ||
49d7dfbc | 555 | void* dupe(pTHX_ REGEXP * const rx, CLONE_PARAMS *param); |
108003db RGS |
556 | |
557 | On threaded builds a regexp may need to be duplicated so that the pattern | |
2edc787c | 558 | can be used by multiple threads. This routine is expected to handle the |
108003db | 559 | duplication of any private data pointed to by the C<pprivate> member of |
5a2b28ce KW |
560 | the C<regexp> structure. It will be called with the preconstructed new |
561 | C<regexp> structure as an argument, the C<pprivate> member will point at | |
a0e97681 | 562 | the B<old> private structure, and it is this routine's responsibility to |
5a2b28ce | 563 | construct a copy and return a pointer to it (which Perl will then use to |
108003db RGS |
564 | overwrite the field as passed to this routine.) |
565 | ||
566 | This allows the engine to dupe its private data but also if necessary | |
567 | modify the final structure if it really must. | |
568 | ||
569 | On unthreaded builds this field doesn't exist. | |
570 | ||
3c13cae6 DM |
571 | =head2 op_comp |
572 | ||
5a2b28ce | 573 | This is private to the Perl core and subject to change. Should be left |
3c13cae6 DM |
574 | null. |
575 | ||
108003db RGS |
576 | =head1 The REGEXP structure |
577 | ||
2edc787c FC |
578 | The REGEXP struct is defined in F<regexp.h>. |
579 | All regex engines must be able to | |
108003db RGS |
580 | correctly build such a structure in their L</comp> routine. |
581 | ||
5a2b28ce | 582 | The REGEXP structure contains all the data that Perl needs to be aware of |
2edc787c | 583 | to properly work with the regular expression. It includes data about |
5a2b28ce | 584 | optimisations that Perl can use to determine if the regex engine should |
108003db | 585 | really be used, and various other control info that is needed to properly |
5a2b28ce KW |
586 | execute patterns in various contexts, such as if the pattern anchored in |
587 | some way, or what flags were used during the compile, or if the | |
588 | program contains special constructs that Perl needs to be aware of. | |
108003db | 589 | |
882227b7 | 590 | In addition it contains two fields that are intended for the private |
2edc787c FC |
591 | use of the regex engine that compiled the pattern. These are the |
592 | C<intflags> and C<pprivate> members. C<pprivate> is a void pointer to | |
5a2b28ce | 593 | an arbitrary structure, whose use and management is the responsibility |
2edc787c | 594 | of the compiling engine. Perl will never modify either of these |
882227b7 | 595 | values. |
108003db RGS |
596 | |
597 | typedef struct regexp { | |
598 | /* what engine created this regexp? */ | |
599 | const struct regexp_engine* engine; | |
600 | ||
601 | /* what re is this a lightweight copy of? */ | |
602 | struct regexp* mother_re; | |
603 | ||
5a2b28ce | 604 | /* Information about the match that the Perl core uses to manage |
02c01adb | 605 | * things */ |
108003db | 606 | U32 extflags; /* Flags used both externally and internally */ |
2d608413 KW |
607 | I32 minlen; /* mininum possible number of chars in */ |
608 | string to match */ | |
609 | I32 minlenret; /* mininum possible number of chars in $& */ | |
108003db RGS |
610 | U32 gofs; /* chars left of pos that we search from */ |
611 | ||
612 | /* substring data about strings that must appear | |
613 | in the final match, used for optimisations */ | |
614 | struct reg_substr_data *substrs; | |
615 | ||
c27a5cfe | 616 | U32 nparens; /* number of capture groups */ |
108003db RGS |
617 | |
618 | /* private engine specific data */ | |
619 | U32 intflags; /* Engine Specific Internal flags */ | |
620 | void *pprivate; /* Data private to the regex engine which | |
621 | created this object. */ | |
622 | ||
02c01adb KW |
623 | /* Data about the last/current match. These are modified during |
624 | * matching*/ | |
82f14494 DM |
625 | U32 lastparen; /* highest close paren matched ($+) */ |
626 | U32 lastcloseparen; /* last close paren matched ($^N) */ | |
108003db | 627 | regexp_paren_pair *swap; /* Swap copy of *offs */ |
02c01adb KW |
628 | regexp_paren_pair *offs; /* Array of offsets for (@-) and |
629 | (@+) */ | |
108003db | 630 | |
02c01adb KW |
631 | char *subbeg; /* saved or original string so \digit works |
632 | forever. */ | |
108003db RGS |
633 | SV_SAVED_COPY /* If non-NULL, SV which is COW from original */ |
634 | I32 sublen; /* Length of string pointed by subbeg */ | |
02c01adb KW |
635 | I32 suboffset; /* byte offset of subbeg from logical start of |
636 | str */ | |
6502e081 | 637 | I32 subcoffset; /* suboffset equiv, but in chars (for @-/@+) */ |
108003db RGS |
638 | |
639 | /* Information about the match that isn't often used */ | |
640 | I32 prelen; /* length of precomp */ | |
641 | const char *precomp; /* pre-compilation regular expression */ | |
642 | ||
108003db RGS |
643 | char *wrapped; /* wrapped version of the pattern */ |
644 | I32 wraplen; /* length of wrapped */ | |
645 | ||
02c01adb KW |
646 | I32 seen_evals; /* number of eval groups in the pattern - for |
647 | security checks */ | |
108003db RGS |
648 | HV *paren_names; /* Optional hash of paren names */ |
649 | ||
650 | /* Refcount of this regexp */ | |
651 | I32 refcnt; /* Refcount of this regexp */ | |
652 | } regexp; | |
653 | ||
654 | The fields are discussed in more detail below: | |
655 | ||
882227b7 | 656 | =head2 C<engine> |
108003db | 657 | |
5a2b28ce | 658 | This field points at a C<regexp_engine> structure which contains pointers |
2edc787c | 659 | to the subroutines that are to be used for performing a match. It |
108003db RGS |
660 | is the compiling routine's responsibility to populate this field before |
661 | returning the regexp object. | |
662 | ||
663 | Internally this is set to C<NULL> unless a custom engine is specified in | |
5a2b28ce | 664 | C<$^H{regcomp}>, Perl's own set of callbacks can be accessed in the struct |
108003db RGS |
665 | pointed to by C<RE_ENGINE_PTR>. |
666 | ||
882227b7 | 667 | =head2 C<mother_re> |
108003db RGS |
668 | |
669 | TODO, see L<http://www.mail-archive.com/perl5-changes@perl.org/msg17328.html> | |
670 | ||
882227b7 | 671 | =head2 C<extflags> |
108003db | 672 | |
5a2b28ce | 673 | This will be used by Perl to see what flags the regexp was compiled |
192b9cd1 | 674 | with, this will normally be set to the value of the flags parameter by |
2edc787c | 675 | the L<comp|/comp> callback. See the L<comp|/comp> documentation for |
c998b245 | 676 | valid flags. |
108003db | 677 | |
882227b7 | 678 | =head2 C<minlen> C<minlenret> |
108003db | 679 | |
2d608413 KW |
680 | The minimum string length (in characters) required for the pattern to match. |
681 | This is used to | |
108003db | 682 | prune the search space by not bothering to match any closer to the end of a |
2edc787c | 683 | string than would allow a match. For instance there is no point in even |
108003db | 684 | starting the regex engine if the minlen is 10 but the string is only 5 |
2edc787c | 685 | characters long. There is no way that the pattern can match. |
108003db | 686 | |
2d608413 KW |
687 | C<minlenret> is the minimum length (in characters) of the string that would |
688 | be found in $& after a match. | |
108003db RGS |
689 | |
690 | The difference between C<minlen> and C<minlenret> can be seen in the | |
691 | following pattern: | |
692 | ||
693 | /ns(?=\d)/ | |
694 | ||
695 | where the C<minlen> would be 3 but C<minlenret> would only be 2 as the \d is | |
2edc787c FC |
696 | required to match but is not actually |
697 | included in the matched content. This | |
108003db | 698 | distinction is particularly important as the substitution logic uses the |
5a2b28ce KW |
699 | C<minlenret> to tell if it can do in-place substitutions (these can |
700 | result in considerable speed-up). | |
108003db | 701 | |
882227b7 | 702 | =head2 C<gofs> |
108003db RGS |
703 | |
704 | Left offset from pos() to start match at. | |
705 | ||
882227b7 | 706 | =head2 C<substrs> |
108003db | 707 | |
2edc787c | 708 | Substring data about strings that must appear in the final match. This |
5a2b28ce | 709 | is currently only used internally by Perl's engine, but might be |
c998b245 | 710 | used in the future for all engines for optimisations. |
108003db | 711 | |
1cecf2c0 | 712 | =head2 C<nparens>, C<lastparen>, and C<lastcloseparen> |
108003db RGS |
713 | |
714 | These fields are used to keep track of how many paren groups could be matched | |
715 | in the pattern, which was the last open paren to be entered, and which was | |
716 | the last close paren to be entered. | |
717 | ||
882227b7 | 718 | =head2 C<intflags> |
108003db RGS |
719 | |
720 | The engine's private copy of the flags the pattern was compiled with. Usually | |
192b9cd1 | 721 | this is the same as C<extflags> unless the engine chose to modify one of them. |
108003db | 722 | |
882227b7 | 723 | =head2 C<pprivate> |
108003db | 724 | |
2edc787c FC |
725 | A void* pointing to an engine-defined |
726 | data structure. The Perl engine uses the | |
108003db RGS |
727 | C<regexp_internal> structure (see L<perlreguts/Base Structures>) but a custom |
728 | engine should use something else. | |
729 | ||
882227b7 | 730 | =head2 C<swap> |
108003db | 731 | |
2edc787c | 732 | Unused. Left in for compatibility with Perl 5.10.0. |
108003db | 733 | |
882227b7 | 734 | =head2 C<offs> |
108003db RGS |
735 | |
736 | A C<regexp_paren_pair> structure which defines offsets into the string being | |
737 | matched which correspond to the C<$&> and C<$1>, C<$2> etc. captures, the | |
738 | C<regexp_paren_pair> struct is defined as follows: | |
739 | ||
740 | typedef struct regexp_paren_pair { | |
741 | I32 start; | |
742 | I32 end; | |
743 | } regexp_paren_pair; | |
744 | ||
745 | If C<< ->offs[num].start >> or C<< ->offs[num].end >> is C<-1> then that | |
2edc787c FC |
746 | capture group did not match. |
747 | C<< ->offs[0].start/end >> represents C<$&> (or | |
c149d39e | 748 | C<${^MATCH}> under C<//p>) and C<< ->offs[paren].end >> matches C<$$paren> where |
108003db RGS |
749 | C<$paren >= 1>. |
750 | ||
882227b7 | 751 | =head2 C<precomp> C<prelen> |
108003db | 752 | |
2edc787c FC |
753 | Used for optimisations. C<precomp> holds a copy of the pattern that |
754 | was compiled and C<prelen> its length. When a new pattern is to be | |
192b9cd1 | 755 | compiled (such as inside a loop) the internal C<regcomp> operator |
5a2b28ce | 756 | checks if the last compiled C<REGEXP>'s C<precomp> and C<prelen> |
192b9cd1 AB |
757 | are equivalent to the new one, and if so uses the old pattern instead |
758 | of compiling a new one. | |
759 | ||
760 | The relevant snippet from C<Perl_pp_regcomp>: | |
761 | ||
762 | if (!re || !re->precomp || re->prelen != (I32)len || | |
763 | memNE(re->precomp, t, len)) | |
764 | /* Compile a new pattern */ | |
108003db | 765 | |
882227b7 | 766 | =head2 C<paren_names> |
108003db | 767 | |
c27a5cfe | 768 | This is a hash used internally to track named capture groups and their |
2edc787c | 769 | offsets. The keys are the names of the buffers the values are dualvars, |
108003db RGS |
770 | with the IV slot holding the number of buffers with the given name and the |
771 | pv being an embedded array of I32. The values may also be contained | |
772 | independently in the data array in cases where named backreferences are | |
773 | used. | |
774 | ||
c998b245 | 775 | =head2 C<substrs> |
108003db RGS |
776 | |
777 | Holds information on the longest string that must occur at a fixed | |
778 | offset from the start of the pattern, and the longest string that must | |
2edc787c | 779 | occur at a floating offset from the start of the pattern. Used to do |
108003db RGS |
780 | Fast-Boyer-Moore searches on the string to find out if its worth using |
781 | the regex engine at all, and if so where in the string to search. | |
782 | ||
6502e081 DM |
783 | =head2 C<subbeg> C<sublen> C<saved_copy> C<suboffset> C<subcoffset> |
784 | ||
785 | Used during the execution phase for managing search and replace patterns, | |
786 | and for providing the text for C<$&>, C<$1> etc. C<subbeg> points to a | |
787 | buffer (either the original string, or a copy in the case of | |
788 | C<RX_MATCH_COPIED(rx)>), and C<sublen> is the length of the buffer. The | |
789 | C<RX_OFFS> start and end indices index into this buffer. | |
790 | ||
791 | In the presence of the C<REXEC_COPY_STR> flag, but with the addition of | |
792 | the C<REXEC_COPY_SKIP_PRE> or C<REXEC_COPY_SKIP_POST> flags, an engine | |
793 | can choose not to copy the full buffer (although it must still do so in | |
794 | the presence of C<RXf_PMf_KEEPCOPY> or the relevant bits being set in | |
2edc787c | 795 | C<PL_sawampersand>). In this case, it may set C<suboffset> to indicate the |
6502e081 | 796 | number of bytes from the logical start of the buffer to the physical start |
2edc787c | 797 | (i.e. C<subbeg>). It should also set C<subcoffset>, the number of |
6502e081 DM |
798 | characters in the offset. The latter is needed to support C<@-> and C<@+> |
799 | which work in characters, not bytes. | |
108003db | 800 | |
882227b7 | 801 | =head2 C<wrapped> C<wraplen> |
108003db | 802 | |
5a2b28ce | 803 | Stores the string C<qr//> stringifies to. The Perl engine for example |
ed215d3c | 804 | stores C<(?^:eek)> in the case of C<qr/eek/>. |
108003db | 805 | |
c998b245 AB |
806 | When using a custom engine that doesn't support the C<(?:)> construct |
807 | for inline modifiers, it's probably best to have C<qr//> stringify to | |
808 | the supplied pattern, note that this will create undesired patterns in | |
809 | cases such as: | |
108003db RGS |
810 | |
811 | my $x = qr/a|b/; # "a|b" | |
192b9cd1 | 812 | my $y = qr/c/i; # "c" |
108003db RGS |
813 | my $z = qr/$x$y/; # "a|bc" |
814 | ||
192b9cd1 AB |
815 | There's no solution for this problem other than making the custom |
816 | engine understand a construct like C<(?:)>. | |
108003db | 817 | |
882227b7 | 818 | =head2 C<seen_evals> |
108003db | 819 | |
2edc787c FC |
820 | This stores the number of eval groups in |
821 | the pattern. This is used for security | |
108003db RGS |
822 | purposes when embedding compiled regexes into larger patterns with C<qr//>. |
823 | ||
882227b7 | 824 | =head2 C<refcnt> |
108003db | 825 | |
2edc787c FC |
826 | The number of times the structure is referenced. When |
827 | this falls to 0, the regexp is automatically freed | |
828 | by a call to pregfree. This should be set to 1 in | |
108003db RGS |
829 | each engine's L</comp> routine. |
830 | ||
108003db RGS |
831 | =head1 HISTORY |
832 | ||
833 | Originally part of L<perlreguts>. | |
834 | ||
835 | =head1 AUTHORS | |
836 | ||
837 | Originally written by Yves Orton, expanded by E<AElig>var ArnfjE<ouml>rE<eth> | |
838 | Bjarmason. | |
839 | ||
840 | =head1 LICENSE | |
841 | ||
842 | Copyright 2006 Yves Orton and 2007 E<AElig>var ArnfjE<ouml>rE<eth> Bjarmason. | |
843 | ||
844 | This program is free software; you can redistribute it and/or modify it under | |
845 | the same terms as Perl itself. | |
846 | ||
847 | =cut |