This is a live mirror of the Perl 5 development currently hosted at https://github.com/perl/perl5
perlreapi: fix documentation on last(close)?paren
[perl5.git] / pod / perlreapi.pod
CommitLineData
108003db
RGS
1=head1 NAME
2
3perlreapi - perl regular expression plugin interface
4
5=head1 DESCRIPTION
6
a0e97681
RGS
7As of Perl 5.9.5 there is a new interface for plugging and using other
8regular expression engines than the default one.
9
10Each engine is supposed to provide access to a constant structure of the
11following format:
108003db
RGS
12
13 typedef struct regexp_engine {
3ab4a224 14 REGEXP* (*comp) (pTHX_ const SV * const pattern, const U32 flags);
49d7dfbc 15 I32 (*exec) (pTHX_ REGEXP * const rx, char* stringarg, char* strend,
2fdbfb4d
AB
16 char* strbeg, I32 minend, SV* screamer,
17 void* data, U32 flags);
49d7dfbc 18 char* (*intuit) (pTHX_ REGEXP * const rx, SV *sv, char *strpos,
2fdbfb4d
AB
19 char *strend, U32 flags,
20 struct re_scream_pos_data_s *data);
49d7dfbc
AB
21 SV* (*checkstr) (pTHX_ REGEXP * const rx);
22 void (*free) (pTHX_ REGEXP * const rx);
2fdbfb4d
AB
23 void (*numbered_buff_FETCH) (pTHX_ REGEXP * const rx, const I32 paren,
24 SV * const sv);
25 void (*numbered_buff_STORE) (pTHX_ REGEXP * const rx, const I32 paren,
26 SV const * const value);
27 I32 (*numbered_buff_LENGTH) (pTHX_ REGEXP * const rx, const SV * const sv,
28 const I32 paren);
192b9cd1
AB
29 SV* (*named_buff) (pTHX_ REGEXP * const rx, SV * const key,
30 SV * const value, U32 flags);
31 SV* (*named_buff_iter) (pTHX_ REGEXP * const rx, const SV * const lastkey,
32 const U32 flags);
49d7dfbc 33 SV* (*qr_package)(pTHX_ REGEXP * const rx);
108003db 34 #ifdef USE_ITHREADS
49d7dfbc 35 void* (*dupe) (pTHX_ REGEXP * const rx, CLONE_PARAMS *param);
108003db 36 #endif
3c13cae6
DM
37 REGEXP* (*op_comp) (...);
38
108003db
RGS
39
40When a regexp is compiled, its C<engine> field is then set to point at
a0e97681 41the appropriate structure, so that when it needs to be used Perl can find
108003db
RGS
42the right routines to do so.
43
44In order to install a new regexp handler, C<$^H{regcomp}> is set
45to an integer which (when casted appropriately) resolves to one of these
46structures. When compiling, the C<comp> method is executed, and the
47resulting regexp structure's engine field is expected to point back at
48the same structure.
49
50The pTHX_ symbol in the definition is a macro used by perl under threading
51to provide an extra argument to the routine holding a pointer back to
52the interpreter that is executing the regexp. So under threading all
53routines get an extra argument.
54
882227b7 55=head1 Callbacks
108003db
RGS
56
57=head2 comp
58
3ab4a224 59 REGEXP* comp(pTHX_ const SV * const pattern, const U32 flags);
108003db 60
3ab4a224
AB
61Compile the pattern stored in C<pattern> using the given C<flags> and
62return a pointer to a prepared C<REGEXP> structure that can perform
63the match. See L</The REGEXP structure> below for an explanation of
64the individual fields in the REGEXP struct.
65
66The C<pattern> parameter is the scalar that was used as the
67pattern. previous versions of perl would pass two C<char*> indicating
a0e97681 68the start and end of the stringified pattern, the following snippet can
3ab4a224
AB
69be used to get the old parameters:
70
71 STRLEN plen;
72 char* exp = SvPV(pattern, plen);
73 char* xend = exp + plen;
74
75Since any scalar can be passed as a pattern it's possible to implement
76an engine that does something with an array (C<< "ook" =~ [ qw/ eek
77hlagh / ] >>) or with the non-stringified form of a compiled regular
78expression (C<< "ook" =~ qr/eek/ >>). perl's own engine will always
79stringify everything using the snippet above but that doesn't mean
80other engines have to.
108003db 81
a0e97681 82The C<flags> parameter is a bitfield which indicates which of the
c998b245
AB
83C<msixp> flags the regex was compiled with. It also contains
84additional info such as whether C<use locale> is in effect.
108003db
RGS
85
86The C<eogc> flags are stripped out before being passed to the comp
87routine. The regex engine does not need to know whether any of these
3ab4a224 88are set as those flags should only affect what perl does with the
c998b245
AB
89pattern and its match variables, not how it gets compiled and
90executed.
108003db 91
c998b245
AB
92By the time the comp callback is called, some of these flags have
93already had effect (noted below where applicable). However most of
94their effect occurs after the comp callback has run in routines that
95read the C<< rx->extflags >> field which it populates.
108003db 96
c998b245
AB
97In general the flags should be preserved in C<< rx->extflags >> after
98compilation, although the regex engine might want to add or delete
99some of them to invoke or disable some special behavior in perl. The
100flags along with any special behavior they cause are documented below:
108003db 101
c998b245 102The pattern modifiers:
108003db 103
c998b245 104=over 4
108003db 105
c998b245 106=item C</m> - RXf_PMf_MULTILINE
108003db 107
c998b245
AB
108If this is in C<< rx->extflags >> it will be passed to
109C<Perl_fbm_instr> by C<pp_split> which will treat the subject string
110as a multi-line string.
108003db 111
c998b245 112=item C</s> - RXf_PMf_SINGLELINE
108003db 113
c998b245 114=item C</i> - RXf_PMf_FOLD
108003db 115
c998b245 116=item C</x> - RXf_PMf_EXTENDED
108003db 117
c998b245
AB
118If present on a regex C<#> comments will be handled differently by the
119tokenizer in some cases.
108003db 120
c998b245 121TODO: Document those cases.
108003db 122
c998b245 123=item C</p> - RXf_PMf_KEEPCOPY
108003db 124
e72ec78c
KW
125TODO: Document this
126
f0f9b3b8
KW
127=item Character set
128
129The character set semantics are determined by an enum that is contained
130in this field. This is still experimental and subject to change, but
131the current interface returns the rules by use of the in-line function
132C<get_regex_charset(const U32 flags)>. The only currently documented
133value returned from it is REGEX_LOCALE_CHARSET, which is set if
e72ec78c
KW
134C<use locale> is in effect. If present in C<< rx->extflags >>,
135C<split> will use the locale dependent definition of whitespace
136when RXf_SKIPWHITE or RXf_WHITE is in effect. ASCII whitespace
96090e4f 137is defined as per L<isSPACE|perlapi/isSPACE>, and by the internal
e72ec78c 138macros C<is_utf8_space> under UTF-8, and C<isSPACE_LC> under C<use
c998b245 139locale>.
108003db 140
f0f9b3b8
KW
141=back
142
143Additional flags:
144
145=over 4
146
108003db
RGS
147=item RXf_UTF8
148
149Set if the pattern is L<SvUTF8()|perlapi/SvUTF8>, set by Perl_pmruntime.
150
c998b245
AB
151A regex engine may want to set or disable this flag during
152compilation. The perl engine for instance may upgrade non-UTF-8
153strings to UTF-8 if the pattern includes constructs such as C<\x{...}>
154that can only match Unicode values.
155
0ac6acae
AB
156=item RXf_SPLIT
157
158If C<split> is invoked as C<split ' '> or with no arguments (which
5137fa37 159really means C<split(' ', $_)>, see L<split|perlfunc/split>), perl will
0ac6acae
AB
160set this flag. The regex engine can then check for it and set the
161SKIPWHITE and WHITE extflags. To do this the perl engine does:
162
163 if (flags & RXf_SPLIT && r->prelen == 1 && r->precomp[0] == ' ')
164 r->extflags |= (RXf_SKIPWHITE|RXf_WHITE);
165
108003db
RGS
166=back
167
c998b245
AB
168These flags can be set during compilation to enable optimizations in
169the C<split> operator.
170
171=over 4
172
0ac6acae
AB
173=item RXf_SKIPWHITE
174
175If the flag is present in C<< rx->extflags >> C<split> will delete
176whitespace from the start of the subject string before it's operated
177on. What is considered whitespace depends on whether the subject is a
178UTF-8 string and whether the C<RXf_PMf_LOCALE> flag is set.
179
180If RXf_WHITE is set in addition to this flag C<split> will behave like
181C<split " "> under the perl engine.
182
c998b245
AB
183=item RXf_START_ONLY
184
185Tells the split operator to split the target string on newlines
186(C<\n>) without invoking the regex engine.
187
188Perl's engine sets this if the pattern is C</^/> (C<plen == 1 && *exp
189== '^'>), even under C</^/s>, see L<split|perlfunc>. Of course a
190different regex engine might want to use the same optimizations
191with a different syntax.
192
193=item RXf_WHITE
194
195Tells the split operator to split the target string on whitespace
196without invoking the regex engine. The definition of whitespace varies
197depending on whether the target string is a UTF-8 string and on
198whether RXf_PMf_LOCALE is set.
199
0ac6acae 200Perl's engine sets this flag if the pattern is C<\s+>.
c998b245 201
640f820d
AB
202=item RXf_NULL
203
a0e97681 204Tells the split operator to split the target string on
640f820d
AB
205characters. The definition of character varies depending on whether
206the target string is a UTF-8 string.
207
208Perl's engine sets this flag on empty patterns, this optimization
a0e97681 209makes C<split //> much faster than it would otherwise be. It's even
640f820d
AB
210faster than C<unpack>.
211
c998b245 212=back
108003db
RGS
213
214=head2 exec
215
49d7dfbc 216 I32 exec(pTHX_ REGEXP * const rx,
108003db
RGS
217 char *stringarg, char* strend, char* strbeg,
218 I32 minend, SV* screamer,
219 void* data, U32 flags);
220
221Execute a regexp.
222
223=head2 intuit
224
49d7dfbc 225 char* intuit(pTHX_ REGEXP * const rx,
108003db 226 SV *sv, char *strpos, char *strend,
49d7dfbc 227 const U32 flags, struct re_scream_pos_data_s *data);
108003db
RGS
228
229Find the start position where a regex match should be attempted,
230or possibly whether the regex engine should not be run because the
231pattern can't match. This is called as appropriate by the core
232depending on the values of the extflags member of the regexp
233structure.
234
235=head2 checkstr
236
49d7dfbc 237 SV* checkstr(pTHX_ REGEXP * const rx);
108003db
RGS
238
239Return a SV containing a string that must appear in the pattern. Used
240by C<split> for optimising matches.
241
242=head2 free
243
49d7dfbc 244 void free(pTHX_ REGEXP * const rx);
108003db
RGS
245
246Called by perl when it is freeing a regexp pattern so that the engine
247can release any resources pointed to by the C<pprivate> member of the
248regexp structure. This is only responsible for freeing private data;
249perl will handle releasing anything else contained in the regexp structure.
250
192b9cd1 251=head2 Numbered capture callbacks
108003db 252
192b9cd1
AB
253Called to get/set the value of C<$`>, C<$'>, C<$&> and their named
254equivalents, ${^PREMATCH}, ${^POSTMATCH} and $^{MATCH}, as well as the
c27a5cfe 255numbered capture groups (C<$1>, C<$2>, ...).
49d7dfbc 256
a0e97681 257The C<paren> parameter will be C<-2> for C<$`>, C<-1> for C<$'>, C<0>
49d7dfbc
AB
258for C<$&>, C<1> for C<$1> and so forth.
259
192b9cd1
AB
260The names have been chosen by analogy with L<Tie::Scalar> methods
261names with an additional B<LENGTH> callback for efficiency. However
262named capture variables are currently not tied internally but
263implemented via magic.
264
265=head3 numbered_buff_FETCH
266
267 void numbered_buff_FETCH(pTHX_ REGEXP * const rx, const I32 paren,
268 SV * const sv);
269
270Fetch a specified numbered capture. C<sv> should be set to the scalar
271to return, the scalar is passed as an argument rather than being
272returned from the function because when it's called perl already has a
273scalar to store the value, creating another one would be
274redundant. The scalar can be set with C<sv_setsv>, C<sv_setpvn> and
275friends, see L<perlapi>.
49d7dfbc
AB
276
277This callback is where perl untaints its own capture variables under
c998b245 278taint mode (see L<perlsec>). See the C<Perl_reg_numbered_buff_fetch>
49d7dfbc
AB
279function in F<regcomp.c> for how to untaint capture variables if
280that's something you'd like your engine to do as well.
108003db 281
192b9cd1 282=head3 numbered_buff_STORE
108003db 283
2fdbfb4d
AB
284 void (*numbered_buff_STORE) (pTHX_ REGEXP * const rx, const I32 paren,
285 SV const * const value);
108003db 286
192b9cd1
AB
287Set the value of a numbered capture variable. C<value> is the scalar
288that is to be used as the new value. It's up to the engine to make
289sure this is used as the new value (or reject it).
2fdbfb4d
AB
290
291Example:
292
293 if ("ook" =~ /(o*)/) {
ccf3535a 294 # 'paren' will be '1' and 'value' will be 'ee'
2fdbfb4d
AB
295 $1 =~ tr/o/e/;
296 }
297
298Perl's own engine will croak on any attempt to modify the capture
a0e97681 299variables, to do this in another engine use the following callback
2fdbfb4d
AB
300(copied from C<Perl_reg_numbered_buff_store>):
301
302 void
303 Example_reg_numbered_buff_store(pTHX_ REGEXP * const rx, const I32 paren,
304 SV const * const value)
305 {
306 PERL_UNUSED_ARG(rx);
307 PERL_UNUSED_ARG(paren);
308 PERL_UNUSED_ARG(value);
309
310 if (!PL_localizing)
311 Perl_croak(aTHX_ PL_no_modify);
312 }
313
99d59c4d 314Actually perl will not I<always> croak in a statement that looks
2fdbfb4d
AB
315like it would modify a numbered capture variable. This is because the
316STORE callback will not be called if perl can determine that it
317doesn't have to modify the value. This is exactly how tied variables
318behave in the same situation:
319
320 package CaptureVar;
321 use base 'Tie::Scalar';
322
323 sub TIESCALAR { bless [] }
324 sub FETCH { undef }
325 sub STORE { die "This doesn't get called" }
326
327 package main;
328
c69ca1d4 329 tie my $sv => "CaptureVar";
2fdbfb4d
AB
330 $sv =~ y/a/b/;
331
332Because C<$sv> is C<undef> when the C<y///> operator is applied to it
333the transliteration won't actually execute and the program won't
192b9cd1
AB
334C<die>. This is different to how 5.8 and earlier versions behaved
335since the capture variables were READONLY variables then, now they'll
336just die when assigned to in the default engine.
2fdbfb4d 337
192b9cd1 338=head3 numbered_buff_LENGTH
2fdbfb4d
AB
339
340 I32 numbered_buff_LENGTH (pTHX_ REGEXP * const rx, const SV * const sv,
341 const I32 paren);
342
343Get the C<length> of a capture variable. There's a special callback
344for this so that perl doesn't have to do a FETCH and run C<length> on
192b9cd1 345the result, since the length is (in perl's case) known from an offset
0a3a8dc0 346stored in C<< rx->offs >> this is much more efficient:
2fdbfb4d
AB
347
348 I32 s1 = rx->offs[paren].start;
349 I32 s2 = rx->offs[paren].end;
350 I32 len = t1 - s1;
351
352This is a little bit more complex in the case of UTF-8, see what
353C<Perl_reg_numbered_buff_length> does with
354L<is_utf8_string_loclen|perlapi/is_utf8_string_loclen>.
355
192b9cd1
AB
356=head2 Named capture callbacks
357
358Called to get/set the value of C<%+> and C<%-> as well as by some
359utility functions in L<re>.
360
361There are two callbacks, C<named_buff> is called in all the cases the
362FETCH, STORE, DELETE, CLEAR, EXISTS and SCALAR L<Tie::Hash> callbacks
363would be on changes to C<%+> and C<%-> and C<named_buff_iter> in the
364same cases as FIRSTKEY and NEXTKEY.
365
366The C<flags> parameter can be used to determine which of these
367operations the callbacks should respond to, the following flags are
368currently defined:
369
370Which L<Tie::Hash> operation is being performed from the Perl level on
371C<%+> or C<%+>, if any:
372
f1b875a0
YO
373 RXapif_FETCH
374 RXapif_STORE
375 RXapif_DELETE
376 RXapif_CLEAR
377 RXapif_EXISTS
378 RXapif_SCALAR
379 RXapif_FIRSTKEY
380 RXapif_NEXTKEY
192b9cd1
AB
381
382Whether C<%+> or C<%-> is being operated on, if any.
2fdbfb4d 383
f1b875a0
YO
384 RXapif_ONE /* %+ */
385 RXapif_ALL /* %- */
2fdbfb4d 386
192b9cd1 387Whether this is being called as C<re::regname>, C<re::regnames> or
c998b245 388C<re::regnames_count>, if any. The first two will be combined with
f1b875a0 389C<RXapif_ONE> or C<RXapif_ALL>.
192b9cd1 390
f1b875a0
YO
391 RXapif_REGNAME
392 RXapif_REGNAMES
393 RXapif_REGNAMES_COUNT
192b9cd1
AB
394
395Internally C<%+> and C<%-> are implemented with a real tied interface
396via L<Tie::Hash::NamedCapture>. The methods in that package will call
397back into these functions. However the usage of
398L<Tie::Hash::NamedCapture> for this purpose might change in future
399releases. For instance this might be implemented by magic instead
400(would need an extension to mgvtbl).
401
402=head3 named_buff
403
404 SV* (*named_buff) (pTHX_ REGEXP * const rx, SV * const key,
405 SV * const value, U32 flags);
406
407=head3 named_buff_iter
408
409 SV* (*named_buff_iter) (pTHX_ REGEXP * const rx, const SV * const lastkey,
410 const U32 flags);
108003db 411
49d7dfbc 412=head2 qr_package
108003db 413
49d7dfbc 414 SV* qr_package(pTHX_ REGEXP * const rx);
108003db
RGS
415
416The package the qr// magic object is blessed into (as seen by C<ref
49d7dfbc
AB
417qr//>). It is recommended that engines change this to their package
418name for identification regardless of whether they implement methods
419on the object.
420
192b9cd1 421The package this method returns should also have the internal
d5213412 422C<Regexp> package in its C<@ISA>. C<< qr//->isa("Regexp") >> should always
192b9cd1
AB
423be true regardless of what engine is being used.
424
425Example implementation might be:
108003db
RGS
426
427 SV*
192b9cd1 428 Example_qr_package(pTHX_ REGEXP * const rx)
108003db
RGS
429 {
430 PERL_UNUSED_ARG(rx);
431 return newSVpvs("re::engine::Example");
432 }
433
434Any method calls on an object created with C<qr//> will be dispatched to the
435package as a normal object.
436
437 use re::engine::Example;
438 my $re = qr//;
439 $re->meth; # dispatched to re::engine::Example::meth()
440
f7e71195
AB
441To retrieve the C<REGEXP> object from the scalar in an XS function use
442the C<SvRX> macro, see L<"REGEXP Functions" in perlapi|perlapi/REGEXP
443Functions>.
108003db
RGS
444
445 void meth(SV * rv)
446 PPCODE:
f7e71195 447 REGEXP * re = SvRX(sv);
108003db 448
108003db
RGS
449=head2 dupe
450
49d7dfbc 451 void* dupe(pTHX_ REGEXP * const rx, CLONE_PARAMS *param);
108003db
RGS
452
453On threaded builds a regexp may need to be duplicated so that the pattern
a0e97681 454can be used by multiple threads. This routine is expected to handle the
108003db
RGS
455duplication of any private data pointed to by the C<pprivate> member of
456the regexp structure. It will be called with the preconstructed new
457regexp structure as an argument, the C<pprivate> member will point at
a0e97681 458the B<old> private structure, and it is this routine's responsibility to
108003db
RGS
459construct a copy and return a pointer to it (which perl will then use to
460overwrite the field as passed to this routine.)
461
462This allows the engine to dupe its private data but also if necessary
463modify the final structure if it really must.
464
465On unthreaded builds this field doesn't exist.
466
3c13cae6
DM
467=head2 op_comp
468
469This is private to the perl core and subject to change. Should be left
470null.
471
108003db
RGS
472=head1 The REGEXP structure
473
474The REGEXP struct is defined in F<regexp.h>. All regex engines must be able to
475correctly build such a structure in their L</comp> routine.
476
477The REGEXP structure contains all the data that perl needs to be aware of
478to properly work with the regular expression. It includes data about
479optimisations that perl can use to determine if the regex engine should
480really be used, and various other control info that is needed to properly
481execute patterns in various contexts such as is the pattern anchored in
482some way, or what flags were used during the compile, or whether the
483program contains special constructs that perl needs to be aware of.
484
882227b7
AB
485In addition it contains two fields that are intended for the private
486use of the regex engine that compiled the pattern. These are the
487C<intflags> and C<pprivate> members. C<pprivate> is a void pointer to
488an arbitrary structure whose use and management is the responsibility
489of the compiling engine. perl will never modify either of these
490values.
108003db
RGS
491
492 typedef struct regexp {
493 /* what engine created this regexp? */
494 const struct regexp_engine* engine;
495
496 /* what re is this a lightweight copy of? */
497 struct regexp* mother_re;
498
499 /* Information about the match that the perl core uses to manage things */
500 U32 extflags; /* Flags used both externally and internally */
501 I32 minlen; /* mininum possible length of string to match */
502 I32 minlenret; /* mininum possible length of $& */
503 U32 gofs; /* chars left of pos that we search from */
504
505 /* substring data about strings that must appear
506 in the final match, used for optimisations */
507 struct reg_substr_data *substrs;
508
c27a5cfe 509 U32 nparens; /* number of capture groups */
108003db
RGS
510
511 /* private engine specific data */
512 U32 intflags; /* Engine Specific Internal flags */
513 void *pprivate; /* Data private to the regex engine which
514 created this object. */
515
516 /* Data about the last/current match. These are modified during matching*/
82f14494
DM
517 U32 lastparen; /* highest close paren matched ($+) */
518 U32 lastcloseparen; /* last close paren matched ($^N) */
108003db
RGS
519 regexp_paren_pair *swap; /* Swap copy of *offs */
520 regexp_paren_pair *offs; /* Array of offsets for (@-) and (@+) */
521
522 char *subbeg; /* saved or original string so \digit works forever. */
523 SV_SAVED_COPY /* If non-NULL, SV which is COW from original */
524 I32 sublen; /* Length of string pointed by subbeg */
525
526 /* Information about the match that isn't often used */
527 I32 prelen; /* length of precomp */
528 const char *precomp; /* pre-compilation regular expression */
529
108003db
RGS
530 char *wrapped; /* wrapped version of the pattern */
531 I32 wraplen; /* length of wrapped */
532
533 I32 seen_evals; /* number of eval groups in the pattern - for security checks */
534 HV *paren_names; /* Optional hash of paren names */
535
536 /* Refcount of this regexp */
537 I32 refcnt; /* Refcount of this regexp */
538 } regexp;
539
540The fields are discussed in more detail below:
541
882227b7 542=head2 C<engine>
108003db
RGS
543
544This field points at a regexp_engine structure which contains pointers
545to the subroutines that are to be used for performing a match. It
546is the compiling routine's responsibility to populate this field before
547returning the regexp object.
548
549Internally this is set to C<NULL> unless a custom engine is specified in
550C<$^H{regcomp}>, perl's own set of callbacks can be accessed in the struct
551pointed to by C<RE_ENGINE_PTR>.
552
882227b7 553=head2 C<mother_re>
108003db
RGS
554
555TODO, see L<http://www.mail-archive.com/perl5-changes@perl.org/msg17328.html>
556
882227b7 557=head2 C<extflags>
108003db 558
192b9cd1
AB
559This will be used by perl to see what flags the regexp was compiled
560with, this will normally be set to the value of the flags parameter by
c998b245
AB
561the L<comp|/comp> callback. See the L<comp|/comp> documentation for
562valid flags.
108003db 563
882227b7 564=head2 C<minlen> C<minlenret>
108003db
RGS
565
566The minimum string length required for the pattern to match. This is used to
567prune the search space by not bothering to match any closer to the end of a
568string than would allow a match. For instance there is no point in even
569starting the regex engine if the minlen is 10 but the string is only 5
570characters long. There is no way that the pattern can match.
571
572C<minlenret> is the minimum length of the string that would be found
573in $& after a match.
574
575The difference between C<minlen> and C<minlenret> can be seen in the
576following pattern:
577
578 /ns(?=\d)/
579
580where the C<minlen> would be 3 but C<minlenret> would only be 2 as the \d is
581required to match but is not actually included in the matched content. This
582distinction is particularly important as the substitution logic uses the
a0e97681 583C<minlenret> to tell whether it can do in-place substitution which can result in
108003db
RGS
584considerable speedup.
585
882227b7 586=head2 C<gofs>
108003db
RGS
587
588Left offset from pos() to start match at.
589
882227b7 590=head2 C<substrs>
108003db 591
192b9cd1
AB
592Substring data about strings that must appear in the final match. This
593is currently only used internally by perl's engine for but might be
c998b245 594used in the future for all engines for optimisations.
108003db 595
1cecf2c0 596=head2 C<nparens>, C<lastparen>, and C<lastcloseparen>
108003db
RGS
597
598These fields are used to keep track of how many paren groups could be matched
599in the pattern, which was the last open paren to be entered, and which was
600the last close paren to be entered.
601
882227b7 602=head2 C<intflags>
108003db
RGS
603
604The engine's private copy of the flags the pattern was compiled with. Usually
192b9cd1 605this is the same as C<extflags> unless the engine chose to modify one of them.
108003db 606
882227b7 607=head2 C<pprivate>
108003db
RGS
608
609A void* pointing to an engine-defined data structure. The perl engine uses the
610C<regexp_internal> structure (see L<perlreguts/Base Structures>) but a custom
611engine should use something else.
612
882227b7 613=head2 C<swap>
108003db 614
e9105d30 615Unused. Left in for compatibility with perl 5.10.0.
108003db 616
882227b7 617=head2 C<offs>
108003db
RGS
618
619A C<regexp_paren_pair> structure which defines offsets into the string being
620matched which correspond to the C<$&> and C<$1>, C<$2> etc. captures, the
621C<regexp_paren_pair> struct is defined as follows:
622
623 typedef struct regexp_paren_pair {
624 I32 start;
625 I32 end;
626 } regexp_paren_pair;
627
628If C<< ->offs[num].start >> or C<< ->offs[num].end >> is C<-1> then that
c27a5cfe 629capture group did not match. C<< ->offs[0].start/end >> represents C<$&> (or
108003db
RGS
630C<${^MATCH> under C<//p>) and C<< ->offs[paren].end >> matches C<$$paren> where
631C<$paren >= 1>.
632
882227b7 633=head2 C<precomp> C<prelen>
108003db 634
192b9cd1
AB
635Used for optimisations. C<precomp> holds a copy of the pattern that
636was compiled and C<prelen> its length. When a new pattern is to be
637compiled (such as inside a loop) the internal C<regcomp> operator
638checks whether the last compiled C<REGEXP>'s C<precomp> and C<prelen>
639are equivalent to the new one, and if so uses the old pattern instead
640of compiling a new one.
641
642The relevant snippet from C<Perl_pp_regcomp>:
643
644 if (!re || !re->precomp || re->prelen != (I32)len ||
645 memNE(re->precomp, t, len))
646 /* Compile a new pattern */
108003db 647
882227b7 648=head2 C<paren_names>
108003db 649
c27a5cfe 650This is a hash used internally to track named capture groups and their
108003db
RGS
651offsets. The keys are the names of the buffers the values are dualvars,
652with the IV slot holding the number of buffers with the given name and the
653pv being an embedded array of I32. The values may also be contained
654independently in the data array in cases where named backreferences are
655used.
656
c998b245 657=head2 C<substrs>
108003db
RGS
658
659Holds information on the longest string that must occur at a fixed
660offset from the start of the pattern, and the longest string that must
661occur at a floating offset from the start of the pattern. Used to do
662Fast-Boyer-Moore searches on the string to find out if its worth using
663the regex engine at all, and if so where in the string to search.
664
882227b7 665=head2 C<subbeg> C<sublen> C<saved_copy>
108003db 666
c998b245 667Used during execution phase for managing search and replace patterns.
108003db 668
882227b7 669=head2 C<wrapped> C<wraplen>
108003db 670
c998b245 671Stores the string C<qr//> stringifies to. The perl engine for example
ed215d3c 672stores C<(?^:eek)> in the case of C<qr/eek/>.
108003db 673
c998b245
AB
674When using a custom engine that doesn't support the C<(?:)> construct
675for inline modifiers, it's probably best to have C<qr//> stringify to
676the supplied pattern, note that this will create undesired patterns in
677cases such as:
108003db
RGS
678
679 my $x = qr/a|b/; # "a|b"
192b9cd1 680 my $y = qr/c/i; # "c"
108003db
RGS
681 my $z = qr/$x$y/; # "a|bc"
682
192b9cd1
AB
683There's no solution for this problem other than making the custom
684engine understand a construct like C<(?:)>.
108003db 685
882227b7 686=head2 C<seen_evals>
108003db
RGS
687
688This stores the number of eval groups in the pattern. This is used for security
689purposes when embedding compiled regexes into larger patterns with C<qr//>.
690
882227b7 691=head2 C<refcnt>
108003db
RGS
692
693The number of times the structure is referenced. When this falls to 0 the
694regexp is automatically freed by a call to pregfree. This should be set to 1 in
695each engine's L</comp> routine.
696
108003db
RGS
697=head1 HISTORY
698
699Originally part of L<perlreguts>.
700
701=head1 AUTHORS
702
703Originally written by Yves Orton, expanded by E<AElig>var ArnfjE<ouml>rE<eth>
704Bjarmason.
705
706=head1 LICENSE
707
708Copyright 2006 Yves Orton and 2007 E<AElig>var ArnfjE<ouml>rE<eth> Bjarmason.
709
710This program is free software; you can redistribute it and/or modify it under
711the same terms as Perl itself.
712
713=cut