| 1 | =head1 NAME |
| 2 | |
| 3 | perlreapi - Perl regular expression plugin interface |
| 4 | |
| 5 | =head1 DESCRIPTION |
| 6 | |
| 7 | As of Perl 5.9.5 there is a new interface for plugging and using |
| 8 | regular expression engines other than the default one. |
| 9 | |
| 10 | Each engine is supposed to provide access to a constant structure of the |
| 11 | following format: |
| 12 | |
| 13 | typedef struct regexp_engine { |
| 14 | REGEXP* (*comp) (pTHX_ |
| 15 | const SV * const pattern, const U32 flags); |
| 16 | I32 (*exec) (pTHX_ |
| 17 | REGEXP * const rx, |
| 18 | char* stringarg, |
| 19 | char* strend, char* strbeg, |
| 20 | SSize_t minend, SV* sv, |
| 21 | void* data, U32 flags); |
| 22 | char* (*intuit) (pTHX_ |
| 23 | REGEXP * const rx, SV *sv, |
| 24 | const char * const strbeg, |
| 25 | char *strpos, char *strend, U32 flags, |
| 26 | struct re_scream_pos_data_s *data); |
| 27 | SV* (*checkstr) (pTHX_ REGEXP * const rx); |
| 28 | void (*free) (pTHX_ REGEXP * const rx); |
| 29 | void (*numbered_buff_FETCH) (pTHX_ |
| 30 | REGEXP * const rx, |
| 31 | const I32 paren, |
| 32 | SV * const sv); |
| 33 | void (*numbered_buff_STORE) (pTHX_ |
| 34 | REGEXP * const rx, |
| 35 | const I32 paren, |
| 36 | SV const * const value); |
| 37 | I32 (*numbered_buff_LENGTH) (pTHX_ |
| 38 | REGEXP * const rx, |
| 39 | const SV * const sv, |
| 40 | const I32 paren); |
| 41 | SV* (*named_buff) (pTHX_ |
| 42 | REGEXP * const rx, |
| 43 | SV * const key, |
| 44 | SV * const value, |
| 45 | U32 flags); |
| 46 | SV* (*named_buff_iter) (pTHX_ |
| 47 | REGEXP * const rx, |
| 48 | const SV * const lastkey, |
| 49 | const U32 flags); |
| 50 | SV* (*qr_package)(pTHX_ REGEXP * const rx); |
| 51 | #ifdef USE_ITHREADS |
| 52 | void* (*dupe) (pTHX_ REGEXP * const rx, CLONE_PARAMS *param); |
| 53 | #endif |
| 54 | REGEXP* (*op_comp) (...); |
| 55 | |
| 56 | |
| 57 | When a regexp is compiled, its C<engine> field is then set to point at |
| 58 | the appropriate structure, so that when it needs to be used Perl can find |
| 59 | the right routines to do so. |
| 60 | |
| 61 | In order to install a new regexp handler, C<$^H{regcomp}> is set |
| 62 | to an integer which (when casted appropriately) resolves to one of these |
| 63 | structures. When compiling, the C<comp> method is executed, and the |
| 64 | resulting C<regexp> structure's engine field is expected to point back at |
| 65 | the same structure. |
| 66 | |
| 67 | The pTHX_ symbol in the definition is a macro used by Perl under threading |
| 68 | to provide an extra argument to the routine holding a pointer back to |
| 69 | the interpreter that is executing the regexp. So under threading all |
| 70 | routines get an extra argument. |
| 71 | |
| 72 | =head1 Callbacks |
| 73 | |
| 74 | =head2 comp |
| 75 | |
| 76 | REGEXP* comp(pTHX_ const SV * const pattern, const U32 flags); |
| 77 | |
| 78 | Compile the pattern stored in C<pattern> using the given C<flags> and |
| 79 | return a pointer to a prepared C<REGEXP> structure that can perform |
| 80 | the match. See L</The REGEXP structure> below for an explanation of |
| 81 | the individual fields in the REGEXP struct. |
| 82 | |
| 83 | The C<pattern> parameter is the scalar that was used as the |
| 84 | pattern. Previous versions of Perl would pass two C<char*> indicating |
| 85 | the start and end of the stringified pattern; the following snippet can |
| 86 | be used to get the old parameters: |
| 87 | |
| 88 | STRLEN plen; |
| 89 | char* exp = SvPV(pattern, plen); |
| 90 | char* xend = exp + plen; |
| 91 | |
| 92 | Since any scalar can be passed as a pattern, it's possible to implement |
| 93 | an engine that does something with an array (C<< "ook" =~ [ qw/ eek |
| 94 | hlagh / ] >>) or with the non-stringified form of a compiled regular |
| 95 | expression (C<< "ook" =~ qr/eek/ >>). Perl's own engine will always |
| 96 | stringify everything using the snippet above, but that doesn't mean |
| 97 | other engines have to. |
| 98 | |
| 99 | The C<flags> parameter is a bitfield which indicates which of the |
| 100 | C<msixpn> flags the regex was compiled with. It also contains |
| 101 | additional info, such as if C<use locale> is in effect. |
| 102 | |
| 103 | The C<eogc> flags are stripped out before being passed to the comp |
| 104 | routine. The regex engine does not need to know if any of these |
| 105 | are set, as those flags should only affect what Perl does with the |
| 106 | pattern and its match variables, not how it gets compiled and |
| 107 | executed. |
| 108 | |
| 109 | By the time the comp callback is called, some of these flags have |
| 110 | already had effect (noted below where applicable). However most of |
| 111 | their effect occurs after the comp callback has run, in routines that |
| 112 | read the C<< rx->extflags >> field which it populates. |
| 113 | |
| 114 | In general the flags should be preserved in C<< rx->extflags >> after |
| 115 | compilation, although the regex engine might want to add or delete |
| 116 | some of them to invoke or disable some special behavior in Perl. The |
| 117 | flags along with any special behavior they cause are documented below: |
| 118 | |
| 119 | The pattern modifiers: |
| 120 | |
| 121 | =over 4 |
| 122 | |
| 123 | =item C</m> - RXf_PMf_MULTILINE |
| 124 | |
| 125 | If this is in C<< rx->extflags >> it will be passed to |
| 126 | C<Perl_fbm_instr> by C<pp_split> which will treat the subject string |
| 127 | as a multi-line string. |
| 128 | |
| 129 | =item C</s> - RXf_PMf_SINGLELINE |
| 130 | |
| 131 | =item C</i> - RXf_PMf_FOLD |
| 132 | |
| 133 | =item C</x> - RXf_PMf_EXTENDED |
| 134 | |
| 135 | If present on a regex, C<"#"> comments will be handled differently by the |
| 136 | tokenizer in some cases. |
| 137 | |
| 138 | TODO: Document those cases. |
| 139 | |
| 140 | =item C</p> - RXf_PMf_KEEPCOPY |
| 141 | |
| 142 | TODO: Document this |
| 143 | |
| 144 | =item Character set |
| 145 | |
| 146 | The character set rules are determined by an enum that is contained |
| 147 | in this field. This is still experimental and subject to change, but |
| 148 | the current interface returns the rules by use of the in-line function |
| 149 | C<get_regex_charset(const U32 flags)>. The only currently documented |
| 150 | value returned from it is REGEX_LOCALE_CHARSET, which is set if |
| 151 | C<use locale> is in effect. If present in C<< rx->extflags >>, |
| 152 | C<split> will use the locale dependent definition of whitespace |
| 153 | when RXf_SKIPWHITE or RXf_WHITE is in effect. ASCII whitespace |
| 154 | is defined as per L<isSPACE|perlapi/isSPACE>, and by the internal |
| 155 | macros C<is_utf8_space> under UTF-8, and C<isSPACE_LC> under C<use |
| 156 | locale>. |
| 157 | |
| 158 | =back |
| 159 | |
| 160 | Additional flags: |
| 161 | |
| 162 | =over 4 |
| 163 | |
| 164 | =item RXf_SPLIT |
| 165 | |
| 166 | This flag was removed in perl 5.18.0. C<split ' '> is now special-cased |
| 167 | solely in the parser. RXf_SPLIT is still #defined, so you can test for it. |
| 168 | This is how it used to work: |
| 169 | |
| 170 | If C<split> is invoked as C<split ' '> or with no arguments (which |
| 171 | really means C<split(' ', $_)>, see L<split|perlfunc/split>), Perl will |
| 172 | set this flag. The regex engine can then check for it and set the |
| 173 | SKIPWHITE and WHITE extflags. To do this, the Perl engine does: |
| 174 | |
| 175 | if (flags & RXf_SPLIT && r->prelen == 1 && r->precomp[0] == ' ') |
| 176 | r->extflags |= (RXf_SKIPWHITE|RXf_WHITE); |
| 177 | |
| 178 | =back |
| 179 | |
| 180 | These flags can be set during compilation to enable optimizations in |
| 181 | the C<split> operator. |
| 182 | |
| 183 | =over 4 |
| 184 | |
| 185 | =item RXf_SKIPWHITE |
| 186 | |
| 187 | This flag was removed in perl 5.18.0. It is still #defined, so you can |
| 188 | set it, but doing so will have no effect. This is how it used to work: |
| 189 | |
| 190 | If the flag is present in C<< rx->extflags >> C<split> will delete |
| 191 | whitespace from the start of the subject string before it's operated |
| 192 | on. What is considered whitespace depends on if the subject is a |
| 193 | UTF-8 string and if the C<RXf_PMf_LOCALE> flag is set. |
| 194 | |
| 195 | If RXf_WHITE is set in addition to this flag, C<split> will behave like |
| 196 | C<split " "> under the Perl engine. |
| 197 | |
| 198 | =item RXf_START_ONLY |
| 199 | |
| 200 | Tells the split operator to split the target string on newlines |
| 201 | (C<\n>) without invoking the regex engine. |
| 202 | |
| 203 | Perl's engine sets this if the pattern is C</^/> (C<plen == 1 && *exp |
| 204 | == '^'>), even under C</^/s>; see L<split|perlfunc>. Of course a |
| 205 | different regex engine might want to use the same optimizations |
| 206 | with a different syntax. |
| 207 | |
| 208 | =item RXf_WHITE |
| 209 | |
| 210 | Tells the split operator to split the target string on whitespace |
| 211 | without invoking the regex engine. The definition of whitespace varies |
| 212 | depending on if the target string is a UTF-8 string and on |
| 213 | if RXf_PMf_LOCALE is set. |
| 214 | |
| 215 | Perl's engine sets this flag if the pattern is C<\s+>. |
| 216 | |
| 217 | =item RXf_NULL |
| 218 | |
| 219 | Tells the split operator to split the target string on |
| 220 | characters. The definition of character varies depending on if |
| 221 | the target string is a UTF-8 string. |
| 222 | |
| 223 | Perl's engine sets this flag on empty patterns, this optimization |
| 224 | makes C<split //> much faster than it would otherwise be. It's even |
| 225 | faster than C<unpack>. |
| 226 | |
| 227 | =item RXf_NO_INPLACE_SUBST |
| 228 | |
| 229 | Added in perl 5.18.0, this flag indicates that a regular expression might |
| 230 | perform an operation that would interfere with inplace substitution. For |
| 231 | instance it might contain lookbehind, or assign to non-magical variables |
| 232 | (such as $REGMARK and $REGERROR) during matching. C<s///> will skip |
| 233 | certain optimisations when this is set. |
| 234 | |
| 235 | =back |
| 236 | |
| 237 | =head2 exec |
| 238 | |
| 239 | I32 exec(pTHX_ REGEXP * const rx, |
| 240 | char *stringarg, char* strend, char* strbeg, |
| 241 | SSize_t minend, SV* sv, |
| 242 | void* data, U32 flags); |
| 243 | |
| 244 | Execute a regexp. The arguments are |
| 245 | |
| 246 | =over 4 |
| 247 | |
| 248 | =item rx |
| 249 | |
| 250 | The regular expression to execute. |
| 251 | |
| 252 | =item sv |
| 253 | |
| 254 | This is the SV to be matched against. Note that the |
| 255 | actual char array to be matched against is supplied by the arguments |
| 256 | described below; the SV is just used to determine UTF8ness, C<pos()> etc. |
| 257 | |
| 258 | =item strbeg |
| 259 | |
| 260 | Pointer to the physical start of the string. |
| 261 | |
| 262 | =item strend |
| 263 | |
| 264 | Pointer to the character following the physical end of the string (i.e. |
| 265 | the C<\0>, if any). |
| 266 | |
| 267 | =item stringarg |
| 268 | |
| 269 | Pointer to the position in the string where matching should start; it might |
| 270 | not be equal to C<strbeg> (for example in a later iteration of C</.../g>). |
| 271 | |
| 272 | =item minend |
| 273 | |
| 274 | Minimum length of string (measured in bytes from C<stringarg>) that must |
| 275 | match; if the engine reaches the end of the match but hasn't reached this |
| 276 | position in the string, it should fail. |
| 277 | |
| 278 | =item data |
| 279 | |
| 280 | Optimisation data; subject to change. |
| 281 | |
| 282 | =item flags |
| 283 | |
| 284 | Optimisation flags; subject to change. |
| 285 | |
| 286 | =back |
| 287 | |
| 288 | =head2 intuit |
| 289 | |
| 290 | char* intuit(pTHX_ |
| 291 | REGEXP * const rx, |
| 292 | SV *sv, |
| 293 | const char * const strbeg, |
| 294 | char *strpos, |
| 295 | char *strend, |
| 296 | const U32 flags, |
| 297 | struct re_scream_pos_data_s *data); |
| 298 | |
| 299 | Find the start position where a regex match should be attempted, |
| 300 | or possibly if the regex engine should not be run because the |
| 301 | pattern can't match. This is called, as appropriate, by the core, |
| 302 | depending on the values of the C<extflags> member of the C<regexp> |
| 303 | structure. |
| 304 | |
| 305 | Arguments: |
| 306 | |
| 307 | rx: the regex to match against |
| 308 | sv: the SV being matched: only used for utf8 flag; the string |
| 309 | itself is accessed via the pointers below. Note that on |
| 310 | something like an overloaded SV, SvPOK(sv) may be false |
| 311 | and the string pointers may point to something unrelated to |
| 312 | the SV itself. |
| 313 | strbeg: real beginning of string |
| 314 | strpos: the point in the string at which to begin matching |
| 315 | strend: pointer to the byte following the last char of the string |
| 316 | flags currently unused; set to 0 |
| 317 | data: currently unused; set to NULL |
| 318 | |
| 319 | |
| 320 | =head2 checkstr |
| 321 | |
| 322 | SV* checkstr(pTHX_ REGEXP * const rx); |
| 323 | |
| 324 | Return a SV containing a string that must appear in the pattern. Used |
| 325 | by C<split> for optimising matches. |
| 326 | |
| 327 | =head2 free |
| 328 | |
| 329 | void free(pTHX_ REGEXP * const rx); |
| 330 | |
| 331 | Called by Perl when it is freeing a regexp pattern so that the engine |
| 332 | can release any resources pointed to by the C<pprivate> member of the |
| 333 | C<regexp> structure. This is only responsible for freeing private data; |
| 334 | Perl will handle releasing anything else contained in the C<regexp> structure. |
| 335 | |
| 336 | =head2 Numbered capture callbacks |
| 337 | |
| 338 | Called to get/set the value of C<$`>, C<$'>, C<$&> and their named |
| 339 | equivalents, ${^PREMATCH}, ${^POSTMATCH} and ${^MATCH}, as well as the |
| 340 | numbered capture groups (C<$1>, C<$2>, ...). |
| 341 | |
| 342 | The C<paren> parameter will be C<1> for C<$1>, C<2> for C<$2> and so |
| 343 | forth, and have these symbolic values for the special variables: |
| 344 | |
| 345 | ${^PREMATCH} RX_BUFF_IDX_CARET_PREMATCH |
| 346 | ${^POSTMATCH} RX_BUFF_IDX_CARET_POSTMATCH |
| 347 | ${^MATCH} RX_BUFF_IDX_CARET_FULLMATCH |
| 348 | $` RX_BUFF_IDX_PREMATCH |
| 349 | $' RX_BUFF_IDX_POSTMATCH |
| 350 | $& RX_BUFF_IDX_FULLMATCH |
| 351 | |
| 352 | Note that in Perl 5.17.3 and earlier, the last three constants were also |
| 353 | used for the caret variants of the variables. |
| 354 | |
| 355 | |
| 356 | The names have been chosen by analogy with L<Tie::Scalar> methods |
| 357 | names with an additional B<LENGTH> callback for efficiency. However |
| 358 | named capture variables are currently not tied internally but |
| 359 | implemented via magic. |
| 360 | |
| 361 | =head3 numbered_buff_FETCH |
| 362 | |
| 363 | void numbered_buff_FETCH(pTHX_ REGEXP * const rx, const I32 paren, |
| 364 | SV * const sv); |
| 365 | |
| 366 | Fetch a specified numbered capture. C<sv> should be set to the scalar |
| 367 | to return, the scalar is passed as an argument rather than being |
| 368 | returned from the function because when it's called Perl already has a |
| 369 | scalar to store the value, creating another one would be |
| 370 | redundant. The scalar can be set with C<sv_setsv>, C<sv_setpvn> and |
| 371 | friends, see L<perlapi>. |
| 372 | |
| 373 | This callback is where Perl untaints its own capture variables under |
| 374 | taint mode (see L<perlsec>). See the C<Perl_reg_numbered_buff_fetch> |
| 375 | function in F<regcomp.c> for how to untaint capture variables if |
| 376 | that's something you'd like your engine to do as well. |
| 377 | |
| 378 | =head3 numbered_buff_STORE |
| 379 | |
| 380 | void (*numbered_buff_STORE) (pTHX_ |
| 381 | REGEXP * const rx, |
| 382 | const I32 paren, |
| 383 | SV const * const value); |
| 384 | |
| 385 | Set the value of a numbered capture variable. C<value> is the scalar |
| 386 | that is to be used as the new value. It's up to the engine to make |
| 387 | sure this is used as the new value (or reject it). |
| 388 | |
| 389 | Example: |
| 390 | |
| 391 | if ("ook" =~ /(o*)/) { |
| 392 | # 'paren' will be '1' and 'value' will be 'ee' |
| 393 | $1 =~ tr/o/e/; |
| 394 | } |
| 395 | |
| 396 | Perl's own engine will croak on any attempt to modify the capture |
| 397 | variables, to do this in another engine use the following callback |
| 398 | (copied from C<Perl_reg_numbered_buff_store>): |
| 399 | |
| 400 | void |
| 401 | Example_reg_numbered_buff_store(pTHX_ |
| 402 | REGEXP * const rx, |
| 403 | const I32 paren, |
| 404 | SV const * const value) |
| 405 | { |
| 406 | PERL_UNUSED_ARG(rx); |
| 407 | PERL_UNUSED_ARG(paren); |
| 408 | PERL_UNUSED_ARG(value); |
| 409 | |
| 410 | if (!PL_localizing) |
| 411 | Perl_croak(aTHX_ PL_no_modify); |
| 412 | } |
| 413 | |
| 414 | Actually Perl will not I<always> croak in a statement that looks |
| 415 | like it would modify a numbered capture variable. This is because the |
| 416 | STORE callback will not be called if Perl can determine that it |
| 417 | doesn't have to modify the value. This is exactly how tied variables |
| 418 | behave in the same situation: |
| 419 | |
| 420 | package CaptureVar; |
| 421 | use parent 'Tie::Scalar'; |
| 422 | |
| 423 | sub TIESCALAR { bless [] } |
| 424 | sub FETCH { undef } |
| 425 | sub STORE { die "This doesn't get called" } |
| 426 | |
| 427 | package main; |
| 428 | |
| 429 | tie my $sv => "CaptureVar"; |
| 430 | $sv =~ y/a/b/; |
| 431 | |
| 432 | Because C<$sv> is C<undef> when the C<y///> operator is applied to it, |
| 433 | the transliteration won't actually execute and the program won't |
| 434 | C<die>. This is different to how 5.8 and earlier versions behaved |
| 435 | since the capture variables were READONLY variables then; now they'll |
| 436 | just die when assigned to in the default engine. |
| 437 | |
| 438 | =head3 numbered_buff_LENGTH |
| 439 | |
| 440 | I32 numbered_buff_LENGTH (pTHX_ |
| 441 | REGEXP * const rx, |
| 442 | const SV * const sv, |
| 443 | const I32 paren); |
| 444 | |
| 445 | Get the C<length> of a capture variable. There's a special callback |
| 446 | for this so that Perl doesn't have to do a FETCH and run C<length> on |
| 447 | the result, since the length is (in Perl's case) known from an offset |
| 448 | stored in C<< rx->offs >>, this is much more efficient: |
| 449 | |
| 450 | I32 s1 = rx->offs[paren].start; |
| 451 | I32 s2 = rx->offs[paren].end; |
| 452 | I32 len = t1 - s1; |
| 453 | |
| 454 | This is a little bit more complex in the case of UTF-8, see what |
| 455 | C<Perl_reg_numbered_buff_length> does with |
| 456 | L<is_utf8_string_loclen|perlapi/is_utf8_string_loclen>. |
| 457 | |
| 458 | =head2 Named capture callbacks |
| 459 | |
| 460 | Called to get/set the value of C<%+> and C<%->, as well as by some |
| 461 | utility functions in L<re>. |
| 462 | |
| 463 | There are two callbacks, C<named_buff> is called in all the cases the |
| 464 | FETCH, STORE, DELETE, CLEAR, EXISTS and SCALAR L<Tie::Hash> callbacks |
| 465 | would be on changes to C<%+> and C<%-> and C<named_buff_iter> in the |
| 466 | same cases as FIRSTKEY and NEXTKEY. |
| 467 | |
| 468 | The C<flags> parameter can be used to determine which of these |
| 469 | operations the callbacks should respond to. The following flags are |
| 470 | currently defined: |
| 471 | |
| 472 | Which L<Tie::Hash> operation is being performed from the Perl level on |
| 473 | C<%+> or C<%+>, if any: |
| 474 | |
| 475 | RXapif_FETCH |
| 476 | RXapif_STORE |
| 477 | RXapif_DELETE |
| 478 | RXapif_CLEAR |
| 479 | RXapif_EXISTS |
| 480 | RXapif_SCALAR |
| 481 | RXapif_FIRSTKEY |
| 482 | RXapif_NEXTKEY |
| 483 | |
| 484 | If C<%+> or C<%-> is being operated on, if any. |
| 485 | |
| 486 | RXapif_ONE /* %+ */ |
| 487 | RXapif_ALL /* %- */ |
| 488 | |
| 489 | If this is being called as C<re::regname>, C<re::regnames> or |
| 490 | C<re::regnames_count>, if any. The first two will be combined with |
| 491 | C<RXapif_ONE> or C<RXapif_ALL>. |
| 492 | |
| 493 | RXapif_REGNAME |
| 494 | RXapif_REGNAMES |
| 495 | RXapif_REGNAMES_COUNT |
| 496 | |
| 497 | Internally C<%+> and C<%-> are implemented with a real tied interface |
| 498 | via L<Tie::Hash::NamedCapture>. The methods in that package will call |
| 499 | back into these functions. However the usage of |
| 500 | L<Tie::Hash::NamedCapture> for this purpose might change in future |
| 501 | releases. For instance this might be implemented by magic instead |
| 502 | (would need an extension to mgvtbl). |
| 503 | |
| 504 | =head3 named_buff |
| 505 | |
| 506 | SV* (*named_buff) (pTHX_ REGEXP * const rx, SV * const key, |
| 507 | SV * const value, U32 flags); |
| 508 | |
| 509 | =head3 named_buff_iter |
| 510 | |
| 511 | SV* (*named_buff_iter) (pTHX_ |
| 512 | REGEXP * const rx, |
| 513 | const SV * const lastkey, |
| 514 | const U32 flags); |
| 515 | |
| 516 | =head2 qr_package |
| 517 | |
| 518 | SV* qr_package(pTHX_ REGEXP * const rx); |
| 519 | |
| 520 | The package the qr// magic object is blessed into (as seen by C<ref |
| 521 | qr//>). It is recommended that engines change this to their package |
| 522 | name for identification regardless of if they implement methods |
| 523 | on the object. |
| 524 | |
| 525 | The package this method returns should also have the internal |
| 526 | C<Regexp> package in its C<@ISA>. C<< qr//->isa("Regexp") >> should always |
| 527 | be true regardless of what engine is being used. |
| 528 | |
| 529 | Example implementation might be: |
| 530 | |
| 531 | SV* |
| 532 | Example_qr_package(pTHX_ REGEXP * const rx) |
| 533 | { |
| 534 | PERL_UNUSED_ARG(rx); |
| 535 | return newSVpvs("re::engine::Example"); |
| 536 | } |
| 537 | |
| 538 | Any method calls on an object created with C<qr//> will be dispatched to the |
| 539 | package as a normal object. |
| 540 | |
| 541 | use re::engine::Example; |
| 542 | my $re = qr//; |
| 543 | $re->meth; # dispatched to re::engine::Example::meth() |
| 544 | |
| 545 | To retrieve the C<REGEXP> object from the scalar in an XS function use |
| 546 | the C<SvRX> macro, see L<"REGEXP Functions" in perlapi|perlapi/REGEXP |
| 547 | Functions>. |
| 548 | |
| 549 | void meth(SV * rv) |
| 550 | PPCODE: |
| 551 | REGEXP * re = SvRX(sv); |
| 552 | |
| 553 | =head2 dupe |
| 554 | |
| 555 | void* dupe(pTHX_ REGEXP * const rx, CLONE_PARAMS *param); |
| 556 | |
| 557 | On threaded builds a regexp may need to be duplicated so that the pattern |
| 558 | can be used by multiple threads. This routine is expected to handle the |
| 559 | duplication of any private data pointed to by the C<pprivate> member of |
| 560 | the C<regexp> structure. It will be called with the preconstructed new |
| 561 | C<regexp> structure as an argument, the C<pprivate> member will point at |
| 562 | the B<old> private structure, and it is this routine's responsibility to |
| 563 | construct a copy and return a pointer to it (which Perl will then use to |
| 564 | overwrite the field as passed to this routine.) |
| 565 | |
| 566 | This allows the engine to dupe its private data but also if necessary |
| 567 | modify the final structure if it really must. |
| 568 | |
| 569 | On unthreaded builds this field doesn't exist. |
| 570 | |
| 571 | =head2 op_comp |
| 572 | |
| 573 | This is private to the Perl core and subject to change. Should be left |
| 574 | null. |
| 575 | |
| 576 | =head1 The REGEXP structure |
| 577 | |
| 578 | The REGEXP struct is defined in F<regexp.h>. |
| 579 | All regex engines must be able to |
| 580 | correctly build such a structure in their L</comp> routine. |
| 581 | |
| 582 | The REGEXP structure contains all the data that Perl needs to be aware of |
| 583 | to properly work with the regular expression. It includes data about |
| 584 | optimisations that Perl can use to determine if the regex engine should |
| 585 | really be used, and various other control info that is needed to properly |
| 586 | execute patterns in various contexts, such as if the pattern anchored in |
| 587 | some way, or what flags were used during the compile, or if the |
| 588 | program contains special constructs that Perl needs to be aware of. |
| 589 | |
| 590 | In addition it contains two fields that are intended for the private |
| 591 | use of the regex engine that compiled the pattern. These are the |
| 592 | C<intflags> and C<pprivate> members. C<pprivate> is a void pointer to |
| 593 | an arbitrary structure, whose use and management is the responsibility |
| 594 | of the compiling engine. Perl will never modify either of these |
| 595 | values. |
| 596 | |
| 597 | typedef struct regexp { |
| 598 | /* what engine created this regexp? */ |
| 599 | const struct regexp_engine* engine; |
| 600 | |
| 601 | /* what re is this a lightweight copy of? */ |
| 602 | struct regexp* mother_re; |
| 603 | |
| 604 | /* Information about the match that the Perl core uses to manage |
| 605 | * things */ |
| 606 | U32 extflags; /* Flags used both externally and internally */ |
| 607 | I32 minlen; /* mininum possible number of chars in */ |
| 608 | string to match */ |
| 609 | I32 minlenret; /* mininum possible number of chars in $& */ |
| 610 | U32 gofs; /* chars left of pos that we search from */ |
| 611 | |
| 612 | /* substring data about strings that must appear |
| 613 | in the final match, used for optimisations */ |
| 614 | struct reg_substr_data *substrs; |
| 615 | |
| 616 | U32 nparens; /* number of capture groups */ |
| 617 | |
| 618 | /* private engine specific data */ |
| 619 | U32 intflags; /* Engine Specific Internal flags */ |
| 620 | void *pprivate; /* Data private to the regex engine which |
| 621 | created this object. */ |
| 622 | |
| 623 | /* Data about the last/current match. These are modified during |
| 624 | * matching*/ |
| 625 | U32 lastparen; /* highest close paren matched ($+) */ |
| 626 | U32 lastcloseparen; /* last close paren matched ($^N) */ |
| 627 | regexp_paren_pair *offs; /* Array of offsets for (@-) and |
| 628 | (@+) */ |
| 629 | |
| 630 | char *subbeg; /* saved or original string so \digit works |
| 631 | forever. */ |
| 632 | SV_SAVED_COPY /* If non-NULL, SV which is COW from original */ |
| 633 | I32 sublen; /* Length of string pointed by subbeg */ |
| 634 | I32 suboffset; /* byte offset of subbeg from logical start of |
| 635 | str */ |
| 636 | I32 subcoffset; /* suboffset equiv, but in chars (for @-/@+) */ |
| 637 | |
| 638 | /* Information about the match that isn't often used */ |
| 639 | I32 prelen; /* length of precomp */ |
| 640 | const char *precomp; /* pre-compilation regular expression */ |
| 641 | |
| 642 | char *wrapped; /* wrapped version of the pattern */ |
| 643 | I32 wraplen; /* length of wrapped */ |
| 644 | |
| 645 | I32 seen_evals; /* number of eval groups in the pattern - for |
| 646 | security checks */ |
| 647 | HV *paren_names; /* Optional hash of paren names */ |
| 648 | |
| 649 | /* Refcount of this regexp */ |
| 650 | I32 refcnt; /* Refcount of this regexp */ |
| 651 | } regexp; |
| 652 | |
| 653 | The fields are discussed in more detail below: |
| 654 | |
| 655 | =head2 C<engine> |
| 656 | |
| 657 | This field points at a C<regexp_engine> structure which contains pointers |
| 658 | to the subroutines that are to be used for performing a match. It |
| 659 | is the compiling routine's responsibility to populate this field before |
| 660 | returning the regexp object. |
| 661 | |
| 662 | Internally this is set to C<NULL> unless a custom engine is specified in |
| 663 | C<$^H{regcomp}>, Perl's own set of callbacks can be accessed in the struct |
| 664 | pointed to by C<RE_ENGINE_PTR>. |
| 665 | |
| 666 | =head2 C<mother_re> |
| 667 | |
| 668 | TODO, see L<http://www.mail-archive.com/perl5-changes@perl.org/msg17328.html> |
| 669 | |
| 670 | =head2 C<extflags> |
| 671 | |
| 672 | This will be used by Perl to see what flags the regexp was compiled |
| 673 | with, this will normally be set to the value of the flags parameter by |
| 674 | the L<comp|/comp> callback. See the L<comp|/comp> documentation for |
| 675 | valid flags. |
| 676 | |
| 677 | =head2 C<minlen> C<minlenret> |
| 678 | |
| 679 | The minimum string length (in characters) required for the pattern to match. |
| 680 | This is used to |
| 681 | prune the search space by not bothering to match any closer to the end of a |
| 682 | string than would allow a match. For instance there is no point in even |
| 683 | starting the regex engine if the minlen is 10 but the string is only 5 |
| 684 | characters long. There is no way that the pattern can match. |
| 685 | |
| 686 | C<minlenret> is the minimum length (in characters) of the string that would |
| 687 | be found in $& after a match. |
| 688 | |
| 689 | The difference between C<minlen> and C<minlenret> can be seen in the |
| 690 | following pattern: |
| 691 | |
| 692 | /ns(?=\d)/ |
| 693 | |
| 694 | where the C<minlen> would be 3 but C<minlenret> would only be 2 as the \d is |
| 695 | required to match but is not actually |
| 696 | included in the matched content. This |
| 697 | distinction is particularly important as the substitution logic uses the |
| 698 | C<minlenret> to tell if it can do in-place substitutions (these can |
| 699 | result in considerable speed-up). |
| 700 | |
| 701 | =head2 C<gofs> |
| 702 | |
| 703 | Left offset from pos() to start match at. |
| 704 | |
| 705 | =head2 C<substrs> |
| 706 | |
| 707 | Substring data about strings that must appear in the final match. This |
| 708 | is currently only used internally by Perl's engine, but might be |
| 709 | used in the future for all engines for optimisations. |
| 710 | |
| 711 | =head2 C<nparens>, C<lastparen>, and C<lastcloseparen> |
| 712 | |
| 713 | These fields are used to keep track of: how many paren capture groups |
| 714 | there are in the pattern; which was the highest paren to be closed (see |
| 715 | L<perlvar/$+>); and which was the most recent paren to be closed (see |
| 716 | L<perlvar/$^N>). |
| 717 | |
| 718 | =head2 C<intflags> |
| 719 | |
| 720 | The engine's private copy of the flags the pattern was compiled with. Usually |
| 721 | this is the same as C<extflags> unless the engine chose to modify one of them. |
| 722 | |
| 723 | =head2 C<pprivate> |
| 724 | |
| 725 | A void* pointing to an engine-defined |
| 726 | data structure. The Perl engine uses the |
| 727 | C<regexp_internal> structure (see L<perlreguts/Base Structures>) but a custom |
| 728 | engine should use something else. |
| 729 | |
| 730 | =head2 C<offs> |
| 731 | |
| 732 | A C<regexp_paren_pair> structure which defines offsets into the string being |
| 733 | matched which correspond to the C<$&> and C<$1>, C<$2> etc. captures, the |
| 734 | C<regexp_paren_pair> struct is defined as follows: |
| 735 | |
| 736 | typedef struct regexp_paren_pair { |
| 737 | I32 start; |
| 738 | I32 end; |
| 739 | } regexp_paren_pair; |
| 740 | |
| 741 | If C<< ->offs[num].start >> or C<< ->offs[num].end >> is C<-1> then that |
| 742 | capture group did not match. |
| 743 | C<< ->offs[0].start/end >> represents C<$&> (or |
| 744 | C<${^MATCH}> under C</p>) and C<< ->offs[paren].end >> matches C<$$paren> where |
| 745 | C<$paren >= 1>. |
| 746 | |
| 747 | =head2 C<precomp> C<prelen> |
| 748 | |
| 749 | Used for optimisations. C<precomp> holds a copy of the pattern that |
| 750 | was compiled and C<prelen> its length. When a new pattern is to be |
| 751 | compiled (such as inside a loop) the internal C<regcomp> operator |
| 752 | checks if the last compiled C<REGEXP>'s C<precomp> and C<prelen> |
| 753 | are equivalent to the new one, and if so uses the old pattern instead |
| 754 | of compiling a new one. |
| 755 | |
| 756 | The relevant snippet from C<Perl_pp_regcomp>: |
| 757 | |
| 758 | if (!re || !re->precomp || re->prelen != (I32)len || |
| 759 | memNE(re->precomp, t, len)) |
| 760 | /* Compile a new pattern */ |
| 761 | |
| 762 | =head2 C<paren_names> |
| 763 | |
| 764 | This is a hash used internally to track named capture groups and their |
| 765 | offsets. The keys are the names of the buffers the values are dualvars, |
| 766 | with the IV slot holding the number of buffers with the given name and the |
| 767 | pv being an embedded array of I32. The values may also be contained |
| 768 | independently in the data array in cases where named backreferences are |
| 769 | used. |
| 770 | |
| 771 | =head2 C<substrs> |
| 772 | |
| 773 | Holds information on the longest string that must occur at a fixed |
| 774 | offset from the start of the pattern, and the longest string that must |
| 775 | occur at a floating offset from the start of the pattern. Used to do |
| 776 | Fast-Boyer-Moore searches on the string to find out if its worth using |
| 777 | the regex engine at all, and if so where in the string to search. |
| 778 | |
| 779 | =head2 C<subbeg> C<sublen> C<saved_copy> C<suboffset> C<subcoffset> |
| 780 | |
| 781 | Used during the execution phase for managing search and replace patterns, |
| 782 | and for providing the text for C<$&>, C<$1> etc. C<subbeg> points to a |
| 783 | buffer (either the original string, or a copy in the case of |
| 784 | C<RX_MATCH_COPIED(rx)>), and C<sublen> is the length of the buffer. The |
| 785 | C<RX_OFFS> start and end indices index into this buffer. |
| 786 | |
| 787 | In the presence of the C<REXEC_COPY_STR> flag, but with the addition of |
| 788 | the C<REXEC_COPY_SKIP_PRE> or C<REXEC_COPY_SKIP_POST> flags, an engine |
| 789 | can choose not to copy the full buffer (although it must still do so in |
| 790 | the presence of C<RXf_PMf_KEEPCOPY> or the relevant bits being set in |
| 791 | C<PL_sawampersand>). In this case, it may set C<suboffset> to indicate the |
| 792 | number of bytes from the logical start of the buffer to the physical start |
| 793 | (i.e. C<subbeg>). It should also set C<subcoffset>, the number of |
| 794 | characters in the offset. The latter is needed to support C<@-> and C<@+> |
| 795 | which work in characters, not bytes. |
| 796 | |
| 797 | =head2 C<wrapped> C<wraplen> |
| 798 | |
| 799 | Stores the string C<qr//> stringifies to. The Perl engine for example |
| 800 | stores C<(?^:eek)> in the case of C<qr/eek/>. |
| 801 | |
| 802 | When using a custom engine that doesn't support the C<(?:)> construct |
| 803 | for inline modifiers, it's probably best to have C<qr//> stringify to |
| 804 | the supplied pattern, note that this will create undesired patterns in |
| 805 | cases such as: |
| 806 | |
| 807 | my $x = qr/a|b/; # "a|b" |
| 808 | my $y = qr/c/i; # "c" |
| 809 | my $z = qr/$x$y/; # "a|bc" |
| 810 | |
| 811 | There's no solution for this problem other than making the custom |
| 812 | engine understand a construct like C<(?:)>. |
| 813 | |
| 814 | =head2 C<seen_evals> |
| 815 | |
| 816 | This stores the number of eval groups in |
| 817 | the pattern. This is used for security |
| 818 | purposes when embedding compiled regexes into larger patterns with C<qr//>. |
| 819 | |
| 820 | =head2 C<refcnt> |
| 821 | |
| 822 | The number of times the structure is referenced. When |
| 823 | this falls to 0, the regexp is automatically freed |
| 824 | by a call to pregfree. This should be set to 1 in |
| 825 | each engine's L</comp> routine. |
| 826 | |
| 827 | =head1 HISTORY |
| 828 | |
| 829 | Originally part of L<perlreguts>. |
| 830 | |
| 831 | =head1 AUTHORS |
| 832 | |
| 833 | Originally written by Yves Orton, expanded by E<AElig>var ArnfjE<ouml>rE<eth> |
| 834 | Bjarmason. |
| 835 | |
| 836 | =head1 LICENSE |
| 837 | |
| 838 | Copyright 2006 Yves Orton and 2007 E<AElig>var ArnfjE<ouml>rE<eth> Bjarmason. |
| 839 | |
| 840 | This program is free software; you can redistribute it and/or modify it under |
| 841 | the same terms as Perl itself. |
| 842 | |
| 843 | =cut |