This is a live mirror of the Perl 5 development currently hosted at https://github.com/perl/perl5
perlapi: Remove per-thread section; move to real scns
[perl5.git] / regcomp.c
CommitLineData
a0d0e21e
LW
1/* regcomp.c
2 */
3
4/*
4ac71550
TC
5 * 'A fair jaw-cracker dwarf-language must be.' --Samwise Gamgee
6 *
7 * [p.285 of _The Lord of the Rings_, II/iii: "The Ring Goes South"]
a0d0e21e
LW
8 */
9
61296642
DM
10/* This file contains functions for compiling a regular expression. See
11 * also regexec.c which funnily enough, contains functions for executing
166f8a29 12 * a regular expression.
e4a054ea
DM
13 *
14 * This file is also copied at build time to ext/re/re_comp.c, where
15 * it's built with -DPERL_EXT_RE_BUILD -DPERL_EXT_RE_DEBUG -DPERL_EXT.
16 * This causes the main functions to be compiled under new names and with
17 * debugging support added, which makes "use re 'debug'" work.
166f8a29
DM
18 */
19
a687059c
LW
20/* NOTE: this is derived from Henry Spencer's regexp code, and should not
21 * confused with the original package (see point 3 below). Thanks, Henry!
22 */
23
24/* Additional note: this code is very heavily munged from Henry's version
25 * in places. In some spots I've traded clarity for efficiency, so don't
26 * blame Henry for some of the lack of readability.
27 */
28
e50aee73 29/* The names of the functions have been changed from regcomp and
3b753521 30 * regexec to pregcomp and pregexec in order to avoid conflicts
e50aee73
AD
31 * with the POSIX routines of the same names.
32*/
33
b9d5759e 34#ifdef PERL_EXT_RE_BUILD
54df2634 35#include "re_top.h"
b81d288d 36#endif
56953603 37
a687059c 38/*
e50aee73 39 * pregcomp and pregexec -- regsub and regerror are not used in perl
a687059c
LW
40 *
41 * Copyright (c) 1986 by University of Toronto.
42 * Written by Henry Spencer. Not derived from licensed software.
43 *
44 * Permission is granted to anyone to use this software for any
45 * purpose on any computer system, and to redistribute it freely,
46 * subject to the following restrictions:
47 *
48 * 1. The author is not responsible for the consequences of use of
49 * this software, no matter how awful, even if they arise
50 * from defects in it.
51 *
52 * 2. The origin of this software must not be misrepresented, either
53 * by explicit claim or by omission.
54 *
55 * 3. Altered versions must be plainly marked as such, and must not
56 * be misrepresented as being the original software.
57 *
58 *
59 **** Alterations to Henry's code are...
60 ****
4bb101f2 61 **** Copyright (C) 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999,
1129b882
NC
62 **** 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008
63 **** by Larry Wall and others
a687059c 64 ****
9ef589d8
LW
65 **** You may distribute under the terms of either the GNU General Public
66 **** License or the Artistic License, as specified in the README file.
67
a687059c
LW
68 *
69 * Beware that some of this code is subtly aware of the way operator
70 * precedence is structured in regular expressions. Serious changes in
71 * regular-expression syntax might require a total rethink.
72 */
aa2885ce
KW
73
74/* Note on debug output:
75 *
76 * This is set up so that -Dr turns on debugging like all other flags that are
77 * enabled by -DDEBUGGING. -Drv gives more verbose output. This applies to
78 * all regular expressions encountered in a program, and gives a huge amount of
79 * output for all but the shortest programs.
80 *
81 * The ability to output pattern debugging information lexically, and with much
82 * finer grained control was added, with 'use re qw(Debug ....);' available even
83 * in non-DEBUGGING builds. This is accomplished by copying the contents of
84 * regcomp.c to ext/re/re_comp.c, and regexec.c is copied to ext/re/re_exec.c.
85 * Those files are compiled and linked into the perl executable, and they are
86 * compiled essentially as if DEBUGGING were enabled, and controlled by calls
87 * to re.pm.
88 *
89 * That would normally mean linking errors when two functions of the same name
90 * are attempted to be placed into the same executable. That is solved in one
91 * of four ways:
92 * 1) Static functions aren't known outside the file they are in, so for the
93 * many functions of that type in this file, it just isn't a problem.
94 * 2) Most externally known functions are enclosed in
95 * #ifndef PERL_IN_XSUB_RE
96 * ...
97 * #endif
98 * blocks, so there is only one defintion for them in the whole
99 * executable, the one in regcomp.c (or regexec.c). The implication of
100 * that is any debugging info that comes from them is controlled only by
101 * -Dr. Further, any static function they call will also be the version
102 * in regcomp.c (or regexec.c), so its debugging will also be by -Dr.
103 * 3) About a dozen external functions are re-#defined in ext/re/re_top.h, to
104 * have different names, so that what gets loaded in the executable is
105 * 'Perl_foo' from regcomp.c (and regexec.c), and the identical function
106 * from re_comp.c (and re_exec.c), but with the name 'my_foo' Debugging
107 * in the 'Perl_foo' versions is controlled by -Dr, but the 'my_foo'
108 * versions and their callees are under control of re.pm. The catch is
109 * that references to all these go through the regexp_engine structure,
110 * which is initialized in regcomp.h to the Perl_foo versions, and
111 * substituted out in lexical scopes where 'use re' is in effect to the
112 * 'my_foo' ones. That structure is public API, so it would be a hard
113 * sell to add any additional members.
114 * 4) For functions in regcomp.c and re_comp.c that are called only from,
115 * respectively, regexec.c and re_exec.c, they can have two different
116 * names, depending on #ifdef'ing PERL_IN_XSUB_RE, in both regexec.c and
117 * embed.fnc.
118 *
119 * The bottom line is that if you add code to one of the public functions
120 * listed in ext/re/re_top.h, debugging automagically works. But if you write
121 * a new function that needs to do debugging or there is a chain of calls from
122 * it that need to do debugging, all functions in the chain should use options
123 * 2) or 4) above.
124 *
125 * A function may have to be split so that debugging stuff is static, but it
126 * calls out to some other function that only gets compiled in regcomp.c to
127 * access data that we don't want to duplicate.
128 */
129
a687059c 130#include "EXTERN.h"
864dbfa3 131#define PERL_IN_REGCOMP_C
a687059c 132#include "perl.h"
d06ea78c 133
c277df42 134#define REG_COMP_C
54df2634
NC
135#ifdef PERL_IN_XSUB_RE
136# include "re_comp.h"
afda64e8 137EXTERN_C const struct regexp_engine my_reg_engine;
51bd1941 138EXTERN_C const struct regexp_engine wild_reg_engine;
54df2634
NC
139#else
140# include "regcomp.h"
141#endif
a687059c 142
b992490d 143#include "invlist_inline.h"
1b0f46bf 144#include "unicode_constants.h"
04e98a4d 145
a687059c
LW
146#ifndef STATIC
147#define STATIC static
148#endif
149
3f910e62
YO
150/* this is a chain of data about sub patterns we are processing that
151 need to be handled separately/specially in study_chunk. Its so
152 we can simulate recursion without losing state. */
153struct scan_frame;
154typedef struct scan_frame {
155 regnode *last_regnode; /* last node to process in this frame */
156 regnode *next_regnode; /* next node to process when last is reached */
157 U32 prev_recursed_depth;
158 I32 stopparen; /* what stopparen do we use */
089ad25d 159 bool in_gosub; /* this or an outer frame is for GOSUB */
3f910e62
YO
160
161 struct scan_frame *this_prev_frame; /* this previous frame */
162 struct scan_frame *prev_frame; /* previous frame */
163 struct scan_frame *next_frame; /* next frame */
164} scan_frame;
165
f4ae5a27
KW
166/* Certain characters are output as a sequence with the first being a
167 * backslash. */
4aada8b9 168#define isBACKSLASHED_PUNCT(c) memCHRs("-[]\\^", c)
f4ae5a27
KW
169
170
09b2b2e6 171struct RExC_state_t {
514a91f1
DM
172 U32 flags; /* RXf_* are we folding, multilining? */
173 U32 pm_flags; /* PMf_* stuff from the calling PMOP */
830247a4 174 char *precomp; /* uncompiled string. */
711b303b 175 char *precomp_end; /* pointer to end of uncompiled string. */
288b8c02 176 REGEXP *rx_sv; /* The SV that is the regexp. */
f8fc2ecf 177 regexp *rx; /* perl core regexp structure */
538e84ed
KW
178 regexp_internal *rxi; /* internal data for regexp object
179 pprivate field */
fac92740 180 char *start; /* Start of input for compile */
830247a4
IZ
181 char *end; /* End of input for compile */
182 char *parse; /* Input-scan pointer. */
51684e3c
KW
183 char *copy_start; /* start of copy of input within
184 constructed parse string */
cc16d262
KW
185 char *save_copy_start; /* Provides one level of saving
186 and restoring 'copy_start' */
51684e3c
KW
187 char *copy_start_in_input; /* Position in input string
188 corresponding to copy_start */
ea3daa5d 189 SSize_t whilem_seen; /* number of WHILEM in this expr */
fac92740 190 regnode *emit_start; /* Start of emitted-code area */
f55b7b26 191 regnode_offset emit; /* Code-emit pointer */
830247a4
IZ
192 I32 naughty; /* How bad is this pattern? */
193 I32 sawback; /* Did we see \1, ...? */
88f063b4
KW
194 SSize_t size; /* Number of regnode equivalents in
195 pattern */
d8d1dede
KW
196 Size_t sets_depth; /* Counts recursion depth of already-
197 compiled regex set patterns */
2475254e
RL
198 U32 seen;
199
200 I32 parens_buf_size; /* #slots malloced open/close_parens */
201 regnode_offset *open_parens; /* offsets to open parens */
202 regnode_offset *close_parens; /* offsets to close parens */
203 HV *paren_names; /* Paren names */
7c932d07
KW
204
205 /* position beyond 'precomp' of the warning message furthest away from
206 * 'precomp'. During the parse, no warnings are raised for any problems
207 * earlier in the parse than this position. This works if warnings are
208 * raised the first time a given spot is parsed, and if only one
209 * independent warning is raised for any given spot */
210 Size_t latest_warn_offset;
211
212 I32 npar; /* Capture buffer count so far in the
213 parse, (OPEN) plus one. ("par" 0 is
214 the whole pattern)*/
215 I32 total_par; /* During initial parse, is either 0,
216 or -1; the latter indicating a
217 reparse is needed. After that pass,
218 it is what 'npar' became after the
219 pass. Hence, it being > 0 indicates
220 we are in a reparse situation */
538e84ed
KW
221 I32 nestroot; /* root parens we are in - used by
222 accept */
830247a4 223 I32 seen_zerolen;
d5a00e4a 224 regnode *end_op; /* END node in program */
02daf0ab
YO
225 I32 utf8; /* whether the pattern is utf8 or not */
226 I32 orig_utf8; /* whether the pattern was originally in utf8 */
227 /* XXX use this for future optimisation of case
228 * where pattern must be upgraded to utf8. */
e40e74fe
KW
229 I32 uni_semantics; /* If a d charset modifier should use unicode
230 rules, even if the pattern is not in
231 utf8 */
538e84ed 232
88f063b4 233 I32 recurse_count; /* Number of recurse regops we have generated */
2475254e 234 regnode **recurse; /* Recurse regops */
4286711a 235 U8 *study_chunk_recursed; /* bitmap of which subs we have moved
538e84ed 236 through */
09a65838 237 U32 study_chunk_recursed_bytes; /* bytes in bitmap */
80f44cf4 238 I32 in_lookaround;
4624b182 239 I32 contains_locale;
bb3f3ed2 240 I32 override_recoding;
4caef9c3 241 I32 recode_x_to_native;
9d53c457 242 I32 in_multi_char_class;
2475254e 243 int code_index; /* next code_blocks[] slot */
1acab4c5 244 struct reg_code_blocks *code_blocks;/* positions of literal (?{})
68e2671b 245 within pattern */
ee273784 246 SSize_t maxlen; /* mininum possible number of chars in string to match */
3f910e62
YO
247 scan_frame *frame_head;
248 scan_frame *frame_last;
249 U32 frame_count;
7eec73eb 250 AV *warn_text;
945961fd 251 HV *unlexed_names;
d24ca0c5 252 SV *runtime_code_qr; /* qr with the runtime code blocks */
3dab1dad 253#ifdef DEBUGGING
be8e71aa 254 const char *lastparse;
3dab1dad 255 I32 lastnum;
d9a72fcc 256 U32 study_chunk_recursed_count;
2475254e 257 AV *paren_name_list; /* idx -> name */
c9f0d54c
YO
258 SV *mysv1;
259 SV *mysv2;
88f063b4 260
3dab1dad
YO
261#define RExC_lastparse (pRExC_state->lastparse)
262#define RExC_lastnum (pRExC_state->lastnum)
1f1031fe 263#define RExC_paren_name_list (pRExC_state->paren_name_list)
d9a72fcc 264#define RExC_study_chunk_recursed_count (pRExC_state->study_chunk_recursed_count)
c9f0d54c
YO
265#define RExC_mysv (pRExC_state->mysv1)
266#define RExC_mysv1 (pRExC_state->mysv1)
267#define RExC_mysv2 (pRExC_state->mysv2)
268
3dab1dad 269#endif
bd0b62a6 270 bool seen_d_op;
911bd04e 271 bool strict;
da7cf1cc 272 bool study_started;
034602eb 273 bool in_script_run;
7c932d07 274 bool use_BRANCHJ;
5aabce33
KW
275 bool sWARN_EXPERIMENTAL__VLB;
276 bool sWARN_EXPERIMENTAL__REGEX_SETS;
09b2b2e6 277};
830247a4 278
e2509266 279#define RExC_flags (pRExC_state->flags)
514a91f1 280#define RExC_pm_flags (pRExC_state->pm_flags)
830247a4 281#define RExC_precomp (pRExC_state->precomp)
51684e3c
KW
282#define RExC_copy_start_in_input (pRExC_state->copy_start_in_input)
283#define RExC_copy_start_in_constructed (pRExC_state->copy_start)
cc16d262 284#define RExC_save_copy_start_in_constructed (pRExC_state->save_copy_start)
711b303b 285#define RExC_precomp_end (pRExC_state->precomp_end)
288b8c02 286#define RExC_rx_sv (pRExC_state->rx_sv)
830247a4 287#define RExC_rx (pRExC_state->rx)
f8fc2ecf 288#define RExC_rxi (pRExC_state->rxi)
fac92740 289#define RExC_start (pRExC_state->start)
830247a4
IZ
290#define RExC_end (pRExC_state->end)
291#define RExC_parse (pRExC_state->parse)
7c932d07 292#define RExC_latest_warn_offset (pRExC_state->latest_warn_offset )
830247a4 293#define RExC_whilem_seen (pRExC_state->whilem_seen)
bd0b62a6
KW
294#define RExC_seen_d_op (pRExC_state->seen_d_op) /* Seen something that differs
295 under /d from /u ? */
512c0f5a 296
7122b237 297#ifdef RE_TRACK_PATTERN_OFFSETS
44f36cd3 298# define RExC_offsets (RExC_rxi->u.offsets) /* I am not like the
538e84ed 299 others */
7122b237 300#endif
830247a4 301#define RExC_emit (pRExC_state->emit)
fac92740 302#define RExC_emit_start (pRExC_state->emit_start)
830247a4
IZ
303#define RExC_sawback (pRExC_state->sawback)
304#define RExC_seen (pRExC_state->seen)
305#define RExC_size (pRExC_state->size)
ee273784 306#define RExC_maxlen (pRExC_state->maxlen)
830247a4 307#define RExC_npar (pRExC_state->npar)
1533d1d2 308#define RExC_total_parens (pRExC_state->total_par)
0275405c 309#define RExC_parens_buf_size (pRExC_state->parens_buf_size)
e2e6a0f1 310#define RExC_nestroot (pRExC_state->nestroot)
830247a4 311#define RExC_seen_zerolen (pRExC_state->seen_zerolen)
1aa99e6b 312#define RExC_utf8 (pRExC_state->utf8)
e40e74fe 313#define RExC_uni_semantics (pRExC_state->uni_semantics)
02daf0ab 314#define RExC_orig_utf8 (pRExC_state->orig_utf8)
40d049e4
YO
315#define RExC_open_parens (pRExC_state->open_parens)
316#define RExC_close_parens (pRExC_state->close_parens)
5bd2d46e 317#define RExC_end_op (pRExC_state->end_op)
81714fb9 318#define RExC_paren_names (pRExC_state->paren_names)
40d049e4
YO
319#define RExC_recurse (pRExC_state->recurse)
320#define RExC_recurse_count (pRExC_state->recurse_count)
d8d1dede 321#define RExC_sets_depth (pRExC_state->sets_depth)
09a65838 322#define RExC_study_chunk_recursed (pRExC_state->study_chunk_recursed)
538e84ed
KW
323#define RExC_study_chunk_recursed_bytes \
324 (pRExC_state->study_chunk_recursed_bytes)
80f44cf4 325#define RExC_in_lookaround (pRExC_state->in_lookaround)
4624b182 326#define RExC_contains_locale (pRExC_state->contains_locale)
4caef9c3
KW
327#define RExC_recode_x_to_native (pRExC_state->recode_x_to_native)
328
b6d67071 329#ifdef EBCDIC
4caef9c3
KW
330# define SET_recode_x_to_native(x) \
331 STMT_START { RExC_recode_x_to_native = (x); } STMT_END
332#else
333# define SET_recode_x_to_native(x) NOOP
b6d67071 334#endif
4caef9c3 335
9d53c457 336#define RExC_in_multi_char_class (pRExC_state->in_multi_char_class)
3f910e62
YO
337#define RExC_frame_head (pRExC_state->frame_head)
338#define RExC_frame_last (pRExC_state->frame_last)
339#define RExC_frame_count (pRExC_state->frame_count)
67cdf558 340#define RExC_strict (pRExC_state->strict)
da7cf1cc 341#define RExC_study_started (pRExC_state->study_started)
7eec73eb 342#define RExC_warn_text (pRExC_state->warn_text)
034602eb 343#define RExC_in_script_run (pRExC_state->in_script_run)
7c932d07 344#define RExC_use_BRANCHJ (pRExC_state->use_BRANCHJ)
5aabce33
KW
345#define RExC_warned_WARN_EXPERIMENTAL__VLB (pRExC_state->sWARN_EXPERIMENTAL__VLB)
346#define RExC_warned_WARN_EXPERIMENTAL__REGEX_SETS (pRExC_state->sWARN_EXPERIMENTAL__REGEX_SETS)
945961fd 347#define RExC_unlexed_names (pRExC_state->unlexed_names)
830247a4 348
99807a43
HS
349/* Heuristic check on the complexity of the pattern: if TOO_NAUGHTY, we set
350 * a flag to disable back-off on the fixed/floating substrings - if it's
351 * a high complexity pattern we assume the benefit of avoiding a full match
352 * is worth the cost of checking for the substrings even if they rarely help.
353 */
354#define RExC_naughty (pRExC_state->naughty)
355#define TOO_NAUGHTY (10)
356#define MARK_NAUGHTY(add) \
357 if (RExC_naughty < TOO_NAUGHTY) \
358 RExC_naughty += (add)
359#define MARK_NAUGHTY_EXP(exp, add) \
360 if (RExC_naughty < TOO_NAUGHTY) \
361 RExC_naughty += RExC_naughty / (exp) + (add)
cde0cee5 362
a687059c 363#define ISMULT1(c) ((c) == '*' || (c) == '+' || (c) == '?')
55507ac1 364#define ISMULT2(s) (ISMULT1(*s) || ((*s) == '{' && regcurly(s)))
a687059c
LW
365
366/*
367 * Flags to be passed up and down.
368 */
c4ead770
KW
369#define HASWIDTH 0x01 /* Known to not match null strings, could match
370 non-null ones. */
fc7dfe8e
HS
371#define SIMPLE 0x02 /* Exactly one character wide */
372 /* (or LNBREAK as a special case) */
8d9c2815
NC
373#define POSTPONED 0x08 /* (?1),(?&name), (??{...}) or similar */
374#define TRYAGAIN 0x10 /* Weeded out a declaration. */
c36361b2
KW
375#define RESTART_PARSE 0x20 /* Need to redo the parse */
376#define NEED_UTF8 0x40 /* In conjunction with RESTART_PARSE, need to
377 calcuate sizes as UTF-8 */
a687059c 378
3dab1dad
YO
379#define REG_NODE_NUM(x) ((x) ? (int)((x)-RExC_emit_start) : -1)
380
07be1b83
YO
381/* whether trie related optimizations are enabled */
382#if PERL_ENABLE_EXTENDED_TRIE_OPTIMISATION
383#define TRIE_STUDY_OPT
786e8c11 384#define FULL_TRIE_STUDY
07be1b83
YO
385#define TRIE_STCLASS
386#endif
1de06328
YO
387
388
40d049e4
YO
389
390#define PBYTE(u8str,paren) ((U8*)(u8str))[(paren) >> 3]
391#define PBITVAL(paren) (1 << ((paren) & 7))
ac3ef366
HS
392#define PAREN_OFFSET(depth) \
393 (RExC_study_chunk_recursed + (depth) * RExC_study_chunk_recursed_bytes)
394#define PAREN_TEST(depth, paren) \
395 (PBYTE(PAREN_OFFSET(depth), paren) & PBITVAL(paren))
396#define PAREN_SET(depth, paren) \
397 (PBYTE(PAREN_OFFSET(depth), paren) |= PBITVAL(paren))
398#define PAREN_UNSET(depth, paren) \
399 (PBYTE(PAREN_OFFSET(depth), paren) &= ~PBITVAL(paren))
40d049e4 400
82a6ada4 401#define REQUIRE_UTF8(flagp) STMT_START { \
8d9c2815 402 if (!UTF) { \
c36361b2 403 *flagp = RESTART_PARSE|NEED_UTF8; \
f55b7b26 404 return 0; \
8d9c2815 405 } \
82a6ada4 406 } STMT_END
40d049e4 407
bb58640a
KW
408/* /u is to be chosen if we are supposed to use Unicode rules, or if the
409 * pattern is in UTF-8. This latter condition is in case the outermost rules
410 * are locale. See GH #17278 */
411#define toUSE_UNI_CHARSET_NOT_DEPENDS (RExC_uni_semantics || UTF)
412
d63d75f5 413/* Change from /d into /u rules, and restart the parse. RExC_uni_semantics is
f6e8f31e
KW
414 * a flag that indicates we need to override /d with /u as a result of
415 * something in the pattern. It should only be used in regards to calling
b2090782 416 * set_regex_charset() or get_regex_charset() */
512c0f5a
KW
417#define REQUIRE_UNI_RULES(flagp, restart_retval) \
418 STMT_START { \
419 if (DEPENDS_SEMANTICS) { \
512c0f5a
KW
420 set_regex_charset(&RExC_flags, REGEX_UNICODE_CHARSET); \
421 RExC_uni_semantics = 1; \
b76e1c04 422 if (RExC_seen_d_op && LIKELY(! IN_PARENS_PASS)) { \
bd0b62a6
KW
423 /* No need to restart the parse if we haven't seen \
424 * anything that differs between /u and /d, and no need \
425 * to restart immediately if we're going to reparse \
426 * anyway to count parens */ \
c36361b2 427 *flagp |= RESTART_PARSE; \
512c0f5a 428 return restart_retval; \
d1be211c 429 } \
512c0f5a
KW
430 } \
431 } STMT_END
432
7c932d07
KW
433#define REQUIRE_BRANCHJ(flagp, restart_retval) \
434 STMT_START { \
435 RExC_use_BRANCHJ = 1; \
439a3bfe
KW
436 *flagp |= RESTART_PARSE; \
437 return restart_retval; \
7c932d07
KW
438 } STMT_END
439
b76e1c04
KW
440/* Until we have completed the parse, we leave RExC_total_parens at 0 or
441 * less. After that, it must always be positive, because the whole re is
442 * considered to be surrounded by virtual parens. Setting it to negative
443 * indicates there is some construct that needs to know the actual number of
444 * parens to be properly handled. And that means an extra pass will be
445 * required after we've counted them all */
446#define ALL_PARENS_COUNTED (RExC_total_parens > 0)
7c932d07 447#define REQUIRE_PARENS_PASS \
b76e1c04
KW
448 STMT_START { /* No-op if have completed a pass */ \
449 if (! ALL_PARENS_COUNTED) RExC_total_parens = -1; \
7c932d07 450 } STMT_END
b76e1c04
KW
451#define IN_PARENS_PASS (RExC_total_parens < 0)
452
7c932d07 453
f358228e
KW
454/* This is used to return failure (zero) early from the calling function if
455 * various flags in 'flags' are set. Two flags always cause a return:
456 * 'RESTART_PARSE' and 'NEED_UTF8'. 'extra' can be used to specify any
457 * additional flags that should cause a return; 0 if none. If the return will
458 * be done, '*flagp' is first set to be all of the flags that caused the
459 * return. */
460#define RETURN_FAIL_ON_RESTART_OR_FLAGS(flags,flagp,extra) \
fd7fc6ae 461 STMT_START { \
c36361b2
KW
462 if ((flags) & (RESTART_PARSE|NEED_UTF8|(extra))) { \
463 *(flagp) = (flags) & (RESTART_PARSE|NEED_UTF8|(extra)); \
f358228e 464 return 0; \
fd7fc6ae
YO
465 } \
466 } STMT_END
467
c36361b2 468#define MUST_RESTART(flags) ((flags) & (RESTART_PARSE))
fd7fc6ae 469
5d555bb8 470#define RETURN_FAIL_ON_RESTART(flags,flagp) \
f358228e 471 RETURN_FAIL_ON_RESTART_OR_FLAGS( flags, flagp, 0)
5d555bb8 472#define RETURN_FAIL_ON_RESTART_FLAGP(flagp) \
f358228e 473 if (MUST_RESTART(*(flagp))) return 0
fd7fc6ae 474
f19b1a63
KW
475/* This converts the named class defined in regcomp.h to its equivalent class
476 * number defined in handy.h. */
477#define namedclass_to_classnum(class) ((int) ((class) / 2))
478#define classnum_to_namedclass(classnum) ((classnum) * 2)
479
de92f5e6
KW
480#define _invlist_union_complement_2nd(a, b, output) \
481 _invlist_union_maybe_complement_2nd(a, b, TRUE, output)
482#define _invlist_intersection_complement_2nd(a, b, output) \
483 _invlist_intersection_maybe_complement_2nd(a, b, TRUE, output)
484
4a808c87
KW
485/* We add a marker if we are deferring expansion of a property that is both
486 * 1) potentiallly user-defined; and
487 * 2) could also be an official Unicode property.
488 *
489 * Without this marker, any deferred expansion can only be for a user-defined
490 * one. This marker shouldn't conflict with any that could be in a legal name,
491 * and is appended to its name to indicate this. There is a string and
492 * character form */
043cdb9a
KW
493#define DEFERRED_COULD_BE_OFFICIAL_MARKERs "~"
494#define DEFERRED_COULD_BE_OFFICIAL_MARKERc '~'
1c2f3d7a 495
0069caf1
KW
496/* What is infinity for optimization purposes */
497#define OPTIMIZE_INFTY SSize_t_MAX
498
1de06328
YO
499/* About scan_data_t.
500
501 During optimisation we recurse through the regexp program performing
502 various inplace (keyhole style) optimisations. In addition study_chunk
503 and scan_commit populate this data structure with information about
538e84ed 504 what strings MUST appear in the pattern. We look for the longest
3b753521 505 string that must appear at a fixed location, and we look for the
1de06328
YO
506 longest string that may appear at a floating location. So for instance
507 in the pattern:
538e84ed 508
1de06328 509 /FOO[xX]A.*B[xX]BAR/
538e84ed 510
1de06328
YO
511 Both 'FOO' and 'A' are fixed strings. Both 'B' and 'BAR' are floating
512 strings (because they follow a .* construct). study_chunk will identify
513 both FOO and BAR as being the longest fixed and floating strings respectively.
538e84ed 514
1de06328 515 The strings can be composites, for instance
538e84ed 516
1de06328 517 /(f)(o)(o)/
538e84ed 518
1de06328 519 will result in a composite fixed substring 'foo'.
538e84ed 520
1de06328 521 For each string some basic information is maintained:
538e84ed 522
6ea77169 523 - min_offset
1de06328
YO
524 This is the position the string must appear at, or not before.
525 It also implicitly (when combined with minlenp) tells us how many
3b753521
FN
526 characters must match before the string we are searching for.
527 Likewise when combined with minlenp and the length of the string it
538e84ed 528 tells us how many characters must appear after the string we have
1de06328 529 found.
538e84ed 530
1de06328
YO
531 - max_offset
532 Only used for floating strings. This is the rightmost point that
0069caf1 533 the string can appear at. If set to OPTIMIZE_INFTY it indicates that the
1de06328 534 string can occur infinitely far to the right.
6e12f832 535 For fixed strings, it is equal to min_offset.
538e84ed 536
1de06328 537 - minlenp
2d608413
KW
538 A pointer to the minimum number of characters of the pattern that the
539 string was found inside. This is important as in the case of positive
538e84ed 540 lookahead or positive lookbehind we can have multiple patterns
1de06328 541 involved. Consider
538e84ed 542
1de06328 543 /(?=FOO).*F/
538e84ed 544
1de06328
YO
545 The minimum length of the pattern overall is 3, the minimum length
546 of the lookahead part is 3, but the minimum length of the part that
538e84ed 547 will actually match is 1. So 'FOO's minimum length is 3, but the
1de06328 548 minimum length for the F is 1. This is important as the minimum length
538e84ed 549 is used to determine offsets in front of and behind the string being
1de06328 550 looked for. Since strings can be composites this is the length of the
486ec47a 551 pattern at the time it was committed with a scan_commit. Note that
1de06328 552 the length is calculated by study_chunk, so that the minimum lengths
538e84ed 553 are not known until the full pattern has been compiled, thus the
1de06328 554 pointer to the value.
538e84ed 555
1de06328 556 - lookbehind
538e84ed 557
1de06328 558 In the case of lookbehind the string being searched for can be
538e84ed 559 offset past the start point of the final matching string.
1de06328
YO
560 If this value was just blithely removed from the min_offset it would
561 invalidate some of the calculations for how many chars must match
562 before or after (as they are derived from min_offset and minlen and
538e84ed 563 the length of the string being searched for).
1de06328
YO
564 When the final pattern is compiled and the data is moved from the
565 scan_data_t structure into the regexp structure the information
538e84ed
KW
566 about lookbehind is factored in, with the information that would
567 have been lost precalculated in the end_shift field for the
1de06328
YO
568 associated string.
569
570 The fields pos_min and pos_delta are used to store the minimum offset
538e84ed 571 and the delta to the maximum offset at the current point in the pattern.
1de06328
YO
572
573*/
2c2d71f5 574
cb6aef8b
DM
575struct scan_data_substrs {
576 SV *str; /* longest substring found in pattern */
577 SSize_t min_offset; /* earliest point in string it can appear */
578 SSize_t max_offset; /* latest point in string it can appear */
579 SSize_t *minlenp; /* pointer to the minlen relevant to the string */
580 SSize_t lookbehind; /* is the pos of the string modified by LB */
11683ecb 581 I32 flags; /* per substring SF_* and SCF_* flags */
cb6aef8b
DM
582};
583
2c2d71f5 584typedef struct scan_data_t {
1de06328
YO
585 /*I32 len_min; unused */
586 /*I32 len_delta; unused */
49f55535 587 SSize_t pos_min;
ea3daa5d 588 SSize_t pos_delta;
2c2d71f5 589 SV *last_found;
ea3daa5d 590 SSize_t last_end; /* min value, <0 unless valid. */
49f55535 591 SSize_t last_start_min;
ea3daa5d 592 SSize_t last_start_max;
2099df82
DM
593 U8 cur_is_floating; /* whether the last_* values should be set as
594 * the next fixed (0) or floating (1)
595 * substring */
6ea77169 596
2099df82
DM
597 /* [0] is longest fixed substring so far, [1] is longest float so far */
598 struct scan_data_substrs substrs[2];
6ea77169 599
11683ecb 600 I32 flags; /* common SF_* and SCF_* flags */
2c2d71f5 601 I32 whilem_c;
49f55535 602 SSize_t *last_closep;
b8f7bb16 603 regnode_ssc *start_class;
2c2d71f5
JH
604} scan_data_t;
605
a687059c 606/*
e50aee73 607 * Forward declarations for pregcomp()'s friends.
a687059c 608 */
a0d0e21e 609
6ea77169 610static const scan_data_t zero_scan_data = {
a2ca2017 611 0, 0, NULL, 0, 0, 0, 0,
6ea77169 612 {
11683ecb
DM
613 { NULL, 0, 0, 0, 0, 0 },
614 { NULL, 0, 0, 0, 0, 0 },
6ea77169
DM
615 },
616 0, 0, NULL, NULL
617};
c277df42 618
11683ecb
DM
619/* study flags */
620
07be1b83
YO
621#define SF_BEFORE_SEOL 0x0001
622#define SF_BEFORE_MEOL 0x0002
11683ecb 623#define SF_BEFORE_EOL (SF_BEFORE_SEOL|SF_BEFORE_MEOL)
c277df42 624
07be1b83
YO
625#define SF_IS_INF 0x0040
626#define SF_HAS_PAR 0x0080
627#define SF_IN_PAR 0x0100
628#define SF_HAS_EVAL 0x0200
bc604ad8
KW
629
630
631/* SCF_DO_SUBSTR is the flag that tells the regexp analyzer to track the
632 * longest substring in the pattern. When it is not set the optimiser keeps
633 * track of position, but does not keep track of the actual strings seen,
634 *
635 * So for instance /foo/ will be parsed with SCF_DO_SUBSTR being true, but
636 * /foo/i will not.
637 *
638 * Similarly, /foo.*(blah|erm|huh).*fnorble/ will have "foo" and "fnorble"
639 * parsed with SCF_DO_SUBSTR on, but while processing the (...) it will be
640 * turned off because of the alternation (BRANCH). */
07be1b83 641#define SCF_DO_SUBSTR 0x0400
bc604ad8 642
653099ff
GS
643#define SCF_DO_STCLASS_AND 0x0800
644#define SCF_DO_STCLASS_OR 0x1000
645#define SCF_DO_STCLASS (SCF_DO_STCLASS_AND|SCF_DO_STCLASS_OR)
e1901655 646#define SCF_WHILEM_VISITED_POS 0x2000
c277df42 647
786e8c11 648#define SCF_TRIE_RESTUDY 0x4000 /* Do restudy? */
538e84ed 649#define SCF_SEEN_ACCEPT 0x8000
688e0391 650#define SCF_TRIE_DOING_RESTUDY 0x10000
4286711a
YO
651#define SCF_IN_DEFINE 0x20000
652
653
654
07be1b83 655
43fead97 656#define UTF cBOOL(RExC_utf8)
00b27cfc
KW
657
658/* The enums for all these are ordered so things work out correctly */
a62b1201 659#define LOC (get_regex_charset(RExC_flags) == REGEX_LOCALE_CHARSET)
538e84ed
KW
660#define DEPENDS_SEMANTICS (get_regex_charset(RExC_flags) \
661 == REGEX_DEPENDS_CHARSET)
00b27cfc 662#define UNI_SEMANTICS (get_regex_charset(RExC_flags) == REGEX_UNICODE_CHARSET)
538e84ed
KW
663#define AT_LEAST_UNI_SEMANTICS (get_regex_charset(RExC_flags) \
664 >= REGEX_UNICODE_CHARSET)
665#define ASCII_RESTRICTED (get_regex_charset(RExC_flags) \
666 == REGEX_ASCII_RESTRICTED_CHARSET)
667#define AT_LEAST_ASCII_RESTRICTED (get_regex_charset(RExC_flags) \
668 >= REGEX_ASCII_RESTRICTED_CHARSET)
669#define ASCII_FOLD_RESTRICTED (get_regex_charset(RExC_flags) \
670 == REGEX_ASCII_MORE_RESTRICTED_CHARSET)
a62b1201 671
43fead97 672#define FOLD cBOOL(RExC_flags & RXf_PMf_FOLD)
a0ed51b3 673
f2c2a6ab
KW
674/* For programs that want to be strictly Unicode compatible by dying if any
675 * attempt is made to match a non-Unicode code point against a Unicode
676 * property. */
677#define ALWAYS_WARN_SUPER ckDEAD(packWARN(WARN_NON_UNICODE))
678
93733859 679#define OOB_NAMEDCLASS -1
b8c5462f 680
8e661ac5
KW
681/* There is no code point that is out-of-bounds, so this is problematic. But
682 * its only current use is to initialize a variable that is always set before
683 * looked at. */
684#define OOB_UNICODE 0xDEADBEEF
685
a0ed51b3 686#define CHR_SVLEN(sv) (UTF ? sv_len_utf8(sv) : SvCUR(sv))
a0ed51b3 687
8615cb43 688
b45f050a
JF
689/* length of regex to show in messages that don't mark a position within */
690#define RegexLengthToShowInErrorMessages 127
691
692/*
693 * If MARKER[12] are adjusted, be sure to adjust the constants at the top
694 * of t/op/regmesg.t, the tests in t/op/re_tests, and those in
695 * op/pragma/warn/regcomp.
696 */
7253e4e3
RK
697#define MARKER1 "<-- HERE" /* marker as it appears in the description */
698#define MARKER2 " <-- HERE " /* marker as it appears within the regex */
b81d288d 699
538e84ed 700#define REPORT_LOCATION " in regex; marked by " MARKER1 \
147e3846 701 " in m/%" UTF8f MARKER2 "%" UTF8f "/"
b45f050a 702
285b5ca0
KW
703/* The code in this file in places uses one level of recursion with parsing
704 * rebased to an alternate string constructed by us in memory. This can take
705 * the form of something that is completely different from the input, or
706 * something that uses the input as part of the alternate. In the first case,
707 * there should be no possibility of an error, as we are in complete control of
51684e3c
KW
708 * the alternate string. But in the second case we don't completely control
709 * the input portion, so there may be errors in that. Here's an example:
285b5ca0
KW
710 * /[abc\x{DF}def]/ui
711 * is handled specially because \x{df} folds to a sequence of more than one
51684e3c 712 * character: 'ss'. What is done is to create and parse an alternate string,
285b5ca0
KW
713 * which looks like this:
714 * /(?:\x{DF}|[abc\x{DF}def])/ui
715 * where it uses the input unchanged in the middle of something it constructs,
716 * which is a branch for the DF outside the character class, and clustering
717 * parens around the whole thing. (It knows enough to skip the DF inside the
718 * class while in this substitute parse.) 'abc' and 'def' may have errors that
719 * need to be reported. The general situation looks like this:
720 *
51684e3c 721 * |<------- identical ------>|
285b5ca0 722 * sI tI xI eI
51684e3c 723 * Input: ---------------------------------------------------------------
285b5ca0
KW
724 * Constructed: ---------------------------------------------------
725 * sC tC xC eC EC
51684e3c 726 * |<------- identical ------>|
285b5ca0 727 *
51684e3c
KW
728 * sI..eI is the portion of the input pattern we are concerned with here.
729 * sC..EC is the constructed substitute parse string.
730 * sC..tC is constructed by us
731 * tC..eC is an exact duplicate of the portion of the input pattern tI..eI.
732 * In the diagram, these are vertically aligned.
733 * eC..EC is also constructed by us.
734 * xC is the position in the substitute parse string where we found a
735 * problem.
736 * xI is the position in the original pattern corresponding to xC.
285b5ca0 737 *
51684e3c
KW
738 * We want to display a message showing the real input string. Thus we need to
739 * translate from xC to xI. We know that xC >= tC, since the portion of the
740 * string sC..tC has been constructed by us, and so shouldn't have errors. We
741 * get:
742 * xI = tI + (xC - tC)
285b5ca0 743 *
51684e3c
KW
744 * When the substitute parse is constructed, the code needs to set:
745 * RExC_start (sC)
746 * RExC_end (eC)
747 * RExC_copy_start_in_input (tI)
748 * RExC_copy_start_in_constructed (tC)
749 * and restore them when done.
285b5ca0 750 *
51684e3c
KW
751 * During normal processing of the input pattern, both
752 * 'RExC_copy_start_in_input' and 'RExC_copy_start_in_constructed' are set to
753 * sI, so that xC equals xI.
285b5ca0
KW
754 */
755
51684e3c
KW
756#define sI RExC_precomp
757#define eI RExC_precomp_end
758#define sC RExC_start
759#define eC RExC_end
760#define tI RExC_copy_start_in_input
761#define tC RExC_copy_start_in_constructed
762#define xI(xC) (tI + (xC - tC))
763#define xI_offset(xC) (xI(xC) - sI)
285b5ca0
KW
764
765#define REPORT_LOCATION_ARGS(xC) \
766 UTF8fARG(UTF, \
51684e3c 767 (xI(xC) > eI) /* Don't run off end */ \
232b691f 768 ? eI - sI /* Length before the <--HERE */ \
bc929b61
KW
769 : ((xI_offset(xC) >= 0) \
770 ? xI_offset(xC) \
771 : (Perl_croak(aTHX_ "panic: %s: %d: negative offset: %" \
772 IVdf " trying to output message for " \
773 " pattern %.*s", \
51684e3c 774 __FILE__, __LINE__, (IV) xI_offset(xC), \
bc929b61 775 ((int) (eC - sC)), sC), 0)), \
51684e3c 776 sI), /* The input pattern printed up to the <--HERE */ \
285b5ca0 777 UTF8fARG(UTF, \
51684e3c
KW
778 (xI(xC) > eI) ? 0 : eI - xI(xC), /* Length after <--HERE */ \
779 (xI(xC) > eI) ? eI : xI(xC)) /* pattern after <--HERE */
c1d900c3 780
8a6d8ec6
HS
781/* Used to point after bad bytes for an error message, but avoid skipping
782 * past a nul byte. */
e3304382 783#define SKIP_IF_CHAR(s, e) (!*(s) ? 0 : UTF ? UTF8_SAFE_SKIP(s, e) : 1)
8a6d8ec6 784
ed68fe11
KW
785/* Set up to clean up after our imminent demise */
786#define PREPARE_TO_DIE \
787 STMT_START { \
788 if (RExC_rx_sv) \
789 SAVEFREESV(RExC_rx_sv); \
7c932d07
KW
790 if (RExC_open_parens) \
791 SAVEFREEPV(RExC_open_parens); \
792 if (RExC_close_parens) \
793 SAVEFREEPV(RExC_close_parens); \
ed68fe11
KW
794 } STMT_END
795
b45f050a
JF
796/*
797 * Calls SAVEDESTRUCTOR_X if needed, then calls Perl_croak with the given
798 * arg. Show regex, up to a maximum length. If it's too long, chop and add
799 * "...".
800 */
58e23c8d 801#define _FAIL(code) STMT_START { \
bfed75c6 802 const char *ellipses = ""; \
88f063b4 803 IV len = RExC_precomp_end - RExC_precomp; \
ccb2c380 804 \
ed68fe11 805 PREPARE_TO_DIE; \
ccb2c380
MP
806 if (len > RegexLengthToShowInErrorMessages) { \
807 /* chop 10 shorter than the max, to ensure meaning of "..." */ \
808 len = RegexLengthToShowInErrorMessages - 10; \
809 ellipses = "..."; \
810 } \
58e23c8d 811 code; \
ccb2c380 812} STMT_END
8615cb43 813
58e23c8d 814#define FAIL(msg) _FAIL( \
147e3846 815 Perl_croak(aTHX_ "%s in regex m/%" UTF8f "%s/", \
c1d900c3 816 msg, UTF8fARG(UTF, len, RExC_precomp), ellipses))
58e23c8d
YO
817
818#define FAIL2(msg,arg) _FAIL( \
147e3846 819 Perl_croak(aTHX_ msg " in regex m/%" UTF8f "%s/", \
c1d900c3 820 arg, UTF8fARG(UTF, len, RExC_precomp), ellipses))
58e23c8d 821
4b7718d6
KW
822#define FAIL3(msg,arg1,arg2) _FAIL( \
823 Perl_croak(aTHX_ msg " in regex m/%" UTF8f "%s/", \
824 arg1, arg2, UTF8fARG(UTF, len, RExC_precomp), ellipses))
825
b45f050a 826/*
b45f050a
JF
827 * Simple_vFAIL -- like FAIL, but marks the current location in the scan
828 */
ccb2c380 829#define Simple_vFAIL(m) STMT_START { \
ccb2c380 830 Perl_croak(aTHX_ "%s" REPORT_LOCATION, \
d528642a 831 m, REPORT_LOCATION_ARGS(RExC_parse)); \
ccb2c380 832} STMT_END
b45f050a
JF
833
834/*
835 * Calls SAVEDESTRUCTOR_X if needed, then Simple_vFAIL()
836 */
ccb2c380 837#define vFAIL(m) STMT_START { \
ed68fe11 838 PREPARE_TO_DIE; \
ccb2c380
MP
839 Simple_vFAIL(m); \
840} STMT_END
b45f050a
JF
841
842/*
843 * Like Simple_vFAIL(), but accepts two arguments.
844 */
ccb2c380 845#define Simple_vFAIL2(m,a1) STMT_START { \
c6572a6d 846 S_re_croak(aTHX_ UTF, m REPORT_LOCATION, a1, \
d528642a 847 REPORT_LOCATION_ARGS(RExC_parse)); \
ccb2c380 848} STMT_END
b45f050a
JF
849
850/*
851 * Calls SAVEDESTRUCTOR_X if needed, then Simple_vFAIL2().
852 */
ccb2c380 853#define vFAIL2(m,a1) STMT_START { \
ed68fe11 854 PREPARE_TO_DIE; \
ccb2c380
MP
855 Simple_vFAIL2(m, a1); \
856} STMT_END
b45f050a
JF
857
858
859/*
860 * Like Simple_vFAIL(), but accepts three arguments.
861 */
ccb2c380 862#define Simple_vFAIL3(m, a1, a2) STMT_START { \
c6572a6d 863 S_re_croak(aTHX_ UTF, m REPORT_LOCATION, a1, a2, \
d528642a 864 REPORT_LOCATION_ARGS(RExC_parse)); \
ccb2c380 865} STMT_END
b45f050a
JF
866
867/*
868 * Calls SAVEDESTRUCTOR_X if needed, then Simple_vFAIL3().
869 */
ccb2c380 870#define vFAIL3(m,a1,a2) STMT_START { \
ed68fe11 871 PREPARE_TO_DIE; \
ccb2c380
MP
872 Simple_vFAIL3(m, a1, a2); \
873} STMT_END
b45f050a
JF
874
875/*
876 * Like Simple_vFAIL(), but accepts four arguments.
877 */
ccb2c380 878#define Simple_vFAIL4(m, a1, a2, a3) STMT_START { \
c6572a6d 879 S_re_croak(aTHX_ UTF, m REPORT_LOCATION, a1, a2, a3, \
d528642a 880 REPORT_LOCATION_ARGS(RExC_parse)); \
ccb2c380 881} STMT_END
b45f050a 882
95db3ffa 883#define vFAIL4(m,a1,a2,a3) STMT_START { \
ed68fe11 884 PREPARE_TO_DIE; \
95db3ffa
KW
885 Simple_vFAIL4(m, a1, a2, a3); \
886} STMT_END
887
946095af 888/* A specialized version of vFAIL2 that works with UTF8f */
d528642a 889#define vFAIL2utf8f(m, a1) STMT_START { \
ed68fe11 890 PREPARE_TO_DIE; \
c6572a6d 891 S_re_croak(aTHX_ UTF, m REPORT_LOCATION, a1, \
d528642a 892 REPORT_LOCATION_ARGS(RExC_parse)); \
946095af
BF
893} STMT_END
894
3ba22297 895#define vFAIL3utf8f(m, a1, a2) STMT_START { \
ed68fe11 896 PREPARE_TO_DIE; \
c6572a6d 897 S_re_croak(aTHX_ UTF, m REPORT_LOCATION, a1, a2, \
d528642a 898 REPORT_LOCATION_ARGS(RExC_parse)); \
3ba22297
KW
899} STMT_END
900
d62e749e 901/* Setting this to NULL is a signal to not output warnings */
cc16d262
KW
902#define TURN_OFF_WARNINGS_IN_SUBSTITUTE_PARSE \
903 STMT_START { \
904 RExC_save_copy_start_in_constructed = RExC_copy_start_in_constructed;\
905 RExC_copy_start_in_constructed = NULL; \
906 } STMT_END
907#define RESTORE_WARNINGS \
908 RExC_copy_start_in_constructed = RExC_save_copy_start_in_constructed
d62e749e 909
7c932d07
KW
910/* Since a warning can be generated multiple times as the input is reparsed, we
911 * output it the first time we come to that point in the parse, but suppress it
912 * otherwise. 'RExC_copy_start_in_constructed' being NULL is a flag to not
913 * generate any warnings */
55bcf857 914#define TO_OUTPUT_WARNINGS(loc) \
7c932d07
KW
915 ( RExC_copy_start_in_constructed \
916 && ((xI(loc)) - RExC_precomp) > (Ptrdiff_t) RExC_latest_warn_offset)
55bcf857 917
7c932d07
KW
918/* After we've emitted a warning, we save the position in the input so we don't
919 * output it again */
920#define UPDATE_WARNINGS_LOC(loc) \
921 STMT_START { \
922 if (TO_OUTPUT_WARNINGS(loc)) { \
5d894ca5
KW
923 RExC_latest_warn_offset = MAX(sI, MIN(eI, xI(loc))) \
924 - RExC_precomp; \
7c932d07
KW
925 } \
926 } STMT_END
77ccba3c 927
086f21bf 928/* 'warns' is the output of the packWARNx macro used in 'code' */
f8220fe2
KW
929#define _WARN_HELPER(loc, warns, code) \
930 STMT_START { \
d62e749e
KW
931 if (! RExC_copy_start_in_constructed) { \
932 Perl_croak( aTHX_ "panic! %s: %d: Tried to warn when none" \
933 " expected at '%s'", \
934 __FILE__, __LINE__, loc); \
935 } \
55bcf857 936 if (TO_OUTPUT_WARNINGS(loc)) { \
086f21bf
KW
937 if (ckDEAD(warns)) \
938 PREPARE_TO_DIE; \
338190a1 939 code; \
77ccba3c 940 UPDATE_WARNINGS_LOC(loc); \
338190a1 941 } \
f8220fe2 942 } STMT_END
946095af 943
5e0a247b 944/* m is not necessarily a "literal string", in this macro */
58ad8160
KW
945#define warn_non_literal_string(loc, packed_warn, m) \
946 _WARN_HELPER(loc, packed_warn, \
947 Perl_warner(aTHX_ packed_warn, \
d528642a 948 "%s" REPORT_LOCATION, \
f8220fe2 949 m, REPORT_LOCATION_ARGS(loc)))
58ad8160
KW
950#define reg_warn_non_literal_string(loc, m) \
951 warn_non_literal_string(loc, packWARN(WARN_REGEXP), m)
5e0a247b 952
eb761011
KW
953#define ckWARN2_non_literal_string(loc, packwarn, m, a1) \
954 STMT_START { \
955 char * format; \
956 Size_t format_size = strlen(m) + strlen(REPORT_LOCATION)+ 1;\
957 Newx(format, format_size, char); \
958 my_strlcpy(format, m, format_size); \
959 my_strlcat(format, REPORT_LOCATION, format_size); \
960 SAVEFREEPV(format); \
961 _WARN_HELPER(loc, packwarn, \
962 Perl_ck_warner(aTHX_ packwarn, \
963 format, \
964 a1, REPORT_LOCATION_ARGS(loc))); \
965 } STMT_END
966
f8220fe2
KW
967#define ckWARNreg(loc,m) \
968 _WARN_HELPER(loc, packWARN(WARN_REGEXP), \
969 Perl_ck_warner(aTHX_ packWARN(WARN_REGEXP), \
d528642a 970 m REPORT_LOCATION, \
f8220fe2 971 REPORT_LOCATION_ARGS(loc)))
ccb2c380 972
f8220fe2
KW
973#define vWARN(loc, m) \
974 _WARN_HELPER(loc, packWARN(WARN_REGEXP), \
975 Perl_warner(aTHX_ packWARN(WARN_REGEXP), \
d528642a 976 m REPORT_LOCATION, \
f8220fe2 977 REPORT_LOCATION_ARGS(loc))) \
b927b7e9 978
f8220fe2
KW
979#define vWARN_dep(loc, m) \
980 _WARN_HELPER(loc, packWARN(WARN_DEPRECATED), \
981 Perl_warner(aTHX_ packWARN(WARN_DEPRECATED), \
d528642a 982 m REPORT_LOCATION, \
f8220fe2 983 REPORT_LOCATION_ARGS(loc)))
0d6106aa 984
f8220fe2
KW
985#define ckWARNdep(loc,m) \
986 _WARN_HELPER(loc, packWARN(WARN_DEPRECATED), \
987 Perl_ck_warner_d(aTHX_ packWARN(WARN_DEPRECATED), \
d528642a 988 m REPORT_LOCATION, \
f8220fe2 989 REPORT_LOCATION_ARGS(loc)))
147508a2 990
f8220fe2
KW
991#define ckWARNregdep(loc,m) \
992 _WARN_HELPER(loc, packWARN2(WARN_DEPRECATED, WARN_REGEXP), \
993 Perl_ck_warner_d(aTHX_ packWARN2(WARN_DEPRECATED, \
d528642a
KW
994 WARN_REGEXP), \
995 m REPORT_LOCATION, \
f8220fe2 996 REPORT_LOCATION_ARGS(loc)))
ccb2c380 997
f8220fe2
KW
998#define ckWARN2reg_d(loc,m, a1) \
999 _WARN_HELPER(loc, packWARN(WARN_REGEXP), \
1000 Perl_ck_warner_d(aTHX_ packWARN(WARN_REGEXP), \
d528642a 1001 m REPORT_LOCATION, \
f8220fe2 1002 a1, REPORT_LOCATION_ARGS(loc)))
2335b3d3 1003
f8220fe2
KW
1004#define ckWARN2reg(loc, m, a1) \
1005 _WARN_HELPER(loc, packWARN(WARN_REGEXP), \
1006 Perl_ck_warner(aTHX_ packWARN(WARN_REGEXP), \
d528642a 1007 m REPORT_LOCATION, \
f8220fe2 1008 a1, REPORT_LOCATION_ARGS(loc)))
ccb2c380 1009
f8220fe2
KW
1010#define vWARN3(loc, m, a1, a2) \
1011 _WARN_HELPER(loc, packWARN(WARN_REGEXP), \
1012 Perl_warner(aTHX_ packWARN(WARN_REGEXP), \
d528642a 1013 m REPORT_LOCATION, \
f8220fe2 1014 a1, a2, REPORT_LOCATION_ARGS(loc)))
ccb2c380 1015
f8220fe2
KW
1016#define ckWARN3reg(loc, m, a1, a2) \
1017 _WARN_HELPER(loc, packWARN(WARN_REGEXP), \
1018 Perl_ck_warner(aTHX_ packWARN(WARN_REGEXP), \
d528642a
KW
1019 m REPORT_LOCATION, \
1020 a1, a2, \
f8220fe2 1021 REPORT_LOCATION_ARGS(loc)))
668c081a 1022
f8220fe2
KW
1023#define vWARN4(loc, m, a1, a2, a3) \
1024 _WARN_HELPER(loc, packWARN(WARN_REGEXP), \
1025 Perl_warner(aTHX_ packWARN(WARN_REGEXP), \
d528642a
KW
1026 m REPORT_LOCATION, \
1027 a1, a2, a3, \
f8220fe2 1028 REPORT_LOCATION_ARGS(loc)))
ccb2c380 1029
f8220fe2
KW
1030#define ckWARN4reg(loc, m, a1, a2, a3) \
1031 _WARN_HELPER(loc, packWARN(WARN_REGEXP), \
1032 Perl_ck_warner(aTHX_ packWARN(WARN_REGEXP), \
d528642a
KW
1033 m REPORT_LOCATION, \
1034 a1, a2, a3, \
f8220fe2 1035 REPORT_LOCATION_ARGS(loc)))
668c081a 1036
f8220fe2
KW
1037#define vWARN5(loc, m, a1, a2, a3, a4) \
1038 _WARN_HELPER(loc, packWARN(WARN_REGEXP), \
1039 Perl_warner(aTHX_ packWARN(WARN_REGEXP), \
d528642a
KW
1040 m REPORT_LOCATION, \
1041 a1, a2, a3, a4, \
f8220fe2 1042 REPORT_LOCATION_ARGS(loc)))
9d1d55b5 1043
aea5f181 1044#define ckWARNexperimental(loc, class, m) \
5aabce33
KW
1045 STMT_START { \
1046 if (! RExC_warned_ ## class) { /* warn once per compilation */ \
1047 RExC_warned_ ## class = 1; \
1048 _WARN_HELPER(loc, packWARN(class), \
aea5f181
KW
1049 Perl_ck_warner_d(aTHX_ packWARN(class), \
1050 m REPORT_LOCATION, \
5aabce33
KW
1051 REPORT_LOCATION_ARGS(loc)));\
1052 } \
1053 } STMT_END
aea5f181 1054
1a322bb4
KW
1055/* Convert between a pointer to a node and its offset from the beginning of the
1056 * program */
1057#define REGNODE_p(offset) (RExC_emit_start + (offset))
1058#define REGNODE_OFFSET(node) ((node) - RExC_emit_start)
1059
538e84ed 1060/* Macros for recording node offsets. 20001227 mjd@plover.com
fac92740
MJD
1061 * Nodes are numbered 1, 2, 3, 4. Node #n's position is recorded in
1062 * element 2*n-1 of the array. Element #2n holds the byte length node #n.
1063 * Element 0 holds the number n.
07be1b83 1064 * Position is 1 indexed.
fac92740 1065 */
7122b237 1066#ifndef RE_TRACK_PATTERN_OFFSETS
d7c442ce 1067#define Set_Node_Offset_To_R(offset,byte)
7122b237
YO
1068#define Set_Node_Offset(node,byte)
1069#define Set_Cur_Node_Offset
1070#define Set_Node_Length_To_R(node,len)
1071#define Set_Node_Length(node,len)
6a86c6ad 1072#define Set_Node_Cur_Length(node,start)
538e84ed
KW
1073#define Node_Offset(n)
1074#define Node_Length(n)
7122b237
YO
1075#define Set_Node_Offset_Length(node,offset,len)
1076#define ProgLen(ri) ri->u.proglen
1077#define SetProgLen(ri,x) ri->u.proglen = x
4cbeff97 1078#define Track_Code(code)
7122b237
YO
1079#else
1080#define ProgLen(ri) ri->u.offsets[0]
1081#define SetProgLen(ri,x) ri->u.offsets[0] = x
d7c442ce 1082#define Set_Node_Offset_To_R(offset,byte) STMT_START { \
ccb2c380 1083 MJD_OFFSET_DEBUG(("** (%d) offset of node %d is %d.\n", \
d7c442ce
KW
1084 __LINE__, (int)(offset), (int)(byte))); \
1085 if((offset) < 0) { \
538e84ed 1086 Perl_croak(aTHX_ "value of node is %d in Offset macro", \
d7c442ce 1087 (int)(offset)); \
ccb2c380 1088 } else { \
d7c442ce 1089 RExC_offsets[2*(offset)-1] = (byte); \
ccb2c380 1090 } \
ccb2c380
MP
1091} STMT_END
1092
88f063b4 1093#define Set_Node_Offset(node,byte) \
1a322bb4 1094 Set_Node_Offset_To_R(REGNODE_OFFSET(node), (byte)-RExC_start)
ccb2c380
MP
1095#define Set_Cur_Node_Offset Set_Node_Offset(RExC_emit, RExC_parse)
1096
1097#define Set_Node_Length_To_R(node,len) STMT_START { \
ccb2c380 1098 MJD_OFFSET_DEBUG(("** (%d) size of node %d is %d.\n", \
551405c4 1099 __LINE__, (int)(node), (int)(len))); \
ccb2c380 1100 if((node) < 0) { \
538e84ed
KW
1101 Perl_croak(aTHX_ "value of node is %d in Length macro", \
1102 (int)(node)); \
ccb2c380
MP
1103 } else { \
1104 RExC_offsets[2*(node)] = (len); \
1105 } \
ccb2c380
MP
1106} STMT_END
1107
1108#define Set_Node_Length(node,len) \
1a322bb4 1109 Set_Node_Length_To_R(REGNODE_OFFSET(node), len)
6a86c6ad
NC
1110#define Set_Node_Cur_Length(node, start) \
1111 Set_Node_Length(node, RExC_parse - start)
fac92740
MJD
1112
1113/* Get offsets and lengths */
1a322bb4
KW
1114#define Node_Offset(n) (RExC_offsets[2*(REGNODE_OFFSET(n))-1])
1115#define Node_Length(n) (RExC_offsets[2*(REGNODE_OFFSET(n))])
fac92740 1116
07be1b83 1117#define Set_Node_Offset_Length(node,offset,len) STMT_START { \
1a322bb4
KW
1118 Set_Node_Offset_To_R(REGNODE_OFFSET(node), (offset)); \
1119 Set_Node_Length_To_R(REGNODE_OFFSET(node), (len)); \
07be1b83 1120} STMT_END
4cbeff97
TC
1121
1122#define Track_Code(code) STMT_START { code } STMT_END
7122b237 1123#endif
07be1b83
YO
1124
1125#if PERL_ENABLE_EXPERIMENTAL_REGEX_OPTIMISATIONS
1126#define EXPERIMENTAL_INPLACESCAN
f427392e 1127#endif /*PERL_ENABLE_EXPERIMENTAL_REGEX_OPTIMISATIONS*/
07be1b83 1128
cb41e5d6
YO
1129#ifdef DEBUGGING
1130int
6ad9a8ab 1131Perl_re_printf(pTHX_ const char *fmt, ...)
cb41e5d6
YO
1132{
1133 va_list ap;
1134 int result;
1135 PerlIO *f= Perl_debug_log;
1136 PERL_ARGS_ASSERT_RE_PRINTF;
1137 va_start(ap, fmt);
1138 result = PerlIO_vprintf(f, fmt, ap);
1139 va_end(ap);
1140 return result;
1141}
1142
1143int
7b031478 1144Perl_re_indentf(pTHX_ const char *fmt, U32 depth, ...)
cb41e5d6
YO
1145{
1146 va_list ap;
1147 int result;
1148 PerlIO *f= Perl_debug_log;
1149 PERL_ARGS_ASSERT_RE_INDENTF;
1150 va_start(ap, depth);
daeb874b 1151 PerlIO_printf(f, "%*s", ( (int)depth % 20 ) * 2, "");
cb41e5d6
YO
1152 result = PerlIO_vprintf(f, fmt, ap);
1153 va_end(ap);
1154 return result;
1155}
1156#endif /* DEBUGGING */
1157
7b031478 1158#define DEBUG_RExC_seen() \
538e84ed 1159 DEBUG_OPTIMISE_MORE_r({ \
88f063b4 1160 Perl_re_printf( aTHX_ "RExC_seen: "); \
538e84ed 1161 \
e384d5c1 1162 if (RExC_seen & REG_ZERO_LEN_SEEN) \
88f063b4 1163 Perl_re_printf( aTHX_ "REG_ZERO_LEN_SEEN "); \
538e84ed 1164 \
e384d5c1 1165 if (RExC_seen & REG_LOOKBEHIND_SEEN) \
88f063b4 1166 Perl_re_printf( aTHX_ "REG_LOOKBEHIND_SEEN "); \
538e84ed 1167 \
e384d5c1 1168 if (RExC_seen & REG_GPOS_SEEN) \
88f063b4 1169 Perl_re_printf( aTHX_ "REG_GPOS_SEEN "); \
538e84ed 1170 \
e384d5c1 1171 if (RExC_seen & REG_RECURSE_SEEN) \
88f063b4 1172 Perl_re_printf( aTHX_ "REG_RECURSE_SEEN "); \
538e84ed 1173 \
7b031478 1174 if (RExC_seen & REG_TOP_LEVEL_BRANCHES_SEEN) \
88f063b4 1175 Perl_re_printf( aTHX_ "REG_TOP_LEVEL_BRANCHES_SEEN "); \
538e84ed 1176 \
e384d5c1 1177 if (RExC_seen & REG_VERBARG_SEEN) \
88f063b4 1178 Perl_re_printf( aTHX_ "REG_VERBARG_SEEN "); \
538e84ed 1179 \
e384d5c1 1180 if (RExC_seen & REG_CUTGROUP_SEEN) \
88f063b4 1181 Perl_re_printf( aTHX_ "REG_CUTGROUP_SEEN "); \
538e84ed 1182 \
e384d5c1 1183 if (RExC_seen & REG_RUN_ON_COMMENT_SEEN) \
88f063b4 1184 Perl_re_printf( aTHX_ "REG_RUN_ON_COMMENT_SEEN "); \
538e84ed 1185 \
e384d5c1 1186 if (RExC_seen & REG_UNFOLDED_MULTI_SEEN) \
88f063b4 1187 Perl_re_printf( aTHX_ "REG_UNFOLDED_MULTI_SEEN "); \
538e84ed 1188 \
7b031478 1189 if (RExC_seen & REG_UNBOUNDED_QUANTIFIER_SEEN) \
88f063b4 1190 Perl_re_printf( aTHX_ "REG_UNBOUNDED_QUANTIFIER_SEEN "); \
ee273784 1191 \
88f063b4 1192 Perl_re_printf( aTHX_ "\n"); \
9e9ecfdf
YO
1193 });
1194
fdfb4f21 1195#define DEBUG_SHOW_STUDY_FLAG(flags,flag) \
6ad9a8ab 1196 if ((flags) & flag) Perl_re_printf( aTHX_ "%s ", #flag)
fdfb4f21 1197
fdfb4f21 1198
f5a36d78
DM
1199#ifdef DEBUGGING
1200static void
1201S_debug_show_study_flags(pTHX_ U32 flags, const char *open_str,
1202 const char *close_str)
1203{
1204 if (!flags)
1205 return;
1206
1207 Perl_re_printf( aTHX_ "%s", open_str);
11683ecb
DM
1208 DEBUG_SHOW_STUDY_FLAG(flags, SF_BEFORE_SEOL);
1209 DEBUG_SHOW_STUDY_FLAG(flags, SF_BEFORE_MEOL);
f5a36d78
DM
1210 DEBUG_SHOW_STUDY_FLAG(flags, SF_IS_INF);
1211 DEBUG_SHOW_STUDY_FLAG(flags, SF_HAS_PAR);
1212 DEBUG_SHOW_STUDY_FLAG(flags, SF_IN_PAR);
1213 DEBUG_SHOW_STUDY_FLAG(flags, SF_HAS_EVAL);
1214 DEBUG_SHOW_STUDY_FLAG(flags, SCF_DO_SUBSTR);
1215 DEBUG_SHOW_STUDY_FLAG(flags, SCF_DO_STCLASS_AND);
1216 DEBUG_SHOW_STUDY_FLAG(flags, SCF_DO_STCLASS_OR);
1217 DEBUG_SHOW_STUDY_FLAG(flags, SCF_DO_STCLASS);
1218 DEBUG_SHOW_STUDY_FLAG(flags, SCF_WHILEM_VISITED_POS);
1219 DEBUG_SHOW_STUDY_FLAG(flags, SCF_TRIE_RESTUDY);
1220 DEBUG_SHOW_STUDY_FLAG(flags, SCF_SEEN_ACCEPT);
1221 DEBUG_SHOW_STUDY_FLAG(flags, SCF_TRIE_DOING_RESTUDY);
1222 DEBUG_SHOW_STUDY_FLAG(flags, SCF_IN_DEFINE);
1223 Perl_re_printf( aTHX_ "%s", close_str);
1224}
1225
1226
1227static void
1228S_debug_studydata(pTHX_ const char *where, scan_data_t *data,
1229 U32 depth, int is_inf)
1230{
271b36b1 1231 DECLARE_AND_GET_RE_DEBUG_FLAGS;
f5a36d78
DM
1232
1233 DEBUG_OPTIMISE_MORE_r({
1234 if (!data)
1235 return;
1236 Perl_re_indentf(aTHX_ "%s: Pos:%" IVdf "/%" IVdf " Flags: 0x%" UVXf,
1237 depth,
1238 where,
1239 (IV)data->pos_min,
1240 (IV)data->pos_delta,
1241 (UV)data->flags
1242 );
1243
1244 S_debug_show_study_flags(aTHX_ data->flags," [","]");
fdfb4f21 1245
f5a36d78
DM
1246 Perl_re_printf( aTHX_
1247 " Whilem_c: %" IVdf " Lcp: %" IVdf " %s",
1248 (IV)data->whilem_c,
1249 (IV)(data->last_closep ? *((data)->last_closep) : -1),
1250 is_inf ? "INF " : ""
1251 );
1252
11683ecb 1253 if (data->last_found) {
37b6262f 1254 int i;
11683ecb
DM
1255 Perl_re_printf(aTHX_
1256 "Last:'%s' %" IVdf ":%" IVdf "/%" IVdf,
1257 SvPVX_const(data->last_found),
1258 (IV)data->last_end,
1259 (IV)data->last_start_min,
1260 (IV)data->last_start_max
1261 );
1262
2099df82 1263 for (i = 0; i < 2; i++) {
37b6262f
DM
1264 Perl_re_printf(aTHX_
1265 " %s%s: '%s' @ %" IVdf "/%" IVdf,
2099df82
DM
1266 data->cur_is_floating == i ? "*" : "",
1267 i ? "Float" : "Fixed",
37b6262f
DM
1268 SvPVX_const(data->substrs[i].str),
1269 (IV)data->substrs[i].min_offset,
1270 (IV)data->substrs[i].max_offset
1271 );
1272 S_debug_show_study_flags(aTHX_ data->substrs[i].flags," [","]");
1273 }
11683ecb 1274 }
f5a36d78
DM
1275
1276 Perl_re_printf( aTHX_ "\n");
1277 });
1278}
1279
1280
1281static void
1282S_debug_peep(pTHX_ const char *str, const RExC_state_t *pRExC_state,
1283 regnode *scan, U32 depth, U32 flags)
1284{
271b36b1 1285 DECLARE_AND_GET_RE_DEBUG_FLAGS;
f5a36d78
DM
1286
1287 DEBUG_OPTIMISE_r({
1288 regnode *Next;
1289
1290 if (!scan)
1291 return;
1292 Next = regnext(scan);
1293 regprop(RExC_rx, RExC_mysv, scan, NULL, pRExC_state);
1294 Perl_re_indentf( aTHX_ "%s>%3d: %s (%d)",
1295 depth,
1296 str,
1297 REG_NODE_NUM(scan), SvPV_nolen_const(RExC_mysv),
1298 Next ? (REG_NODE_NUM(Next)) : 0 );
1299 S_debug_show_study_flags(aTHX_ flags," [ ","]");
1300 Perl_re_printf( aTHX_ "\n");
1301 });
1302}
1303
1304
1305# define DEBUG_STUDYDATA(where, data, depth, is_inf) \
1306 S_debug_studydata(aTHX_ where, data, depth, is_inf)
1307
1308# define DEBUG_PEEP(str, scan, depth, flags) \
1309 S_debug_peep(aTHX_ str, pRExC_state, scan, depth, flags)
1310
1311#else
1312# define DEBUG_STUDYDATA(where, data, depth, is_inf) NOOP
1313# define DEBUG_PEEP(str, scan, depth, flags) NOOP
1314#endif
1de06328 1315
cb41e5d6 1316
c6871b76
KW
1317/* =========================================================
1318 * BEGIN edit_distance stuff.
1319 *
1320 * This calculates how many single character changes of any type are needed to
1321 * transform a string into another one. It is taken from version 3.1 of
1322 *
1323 * https://metacpan.org/pod/Text::Levenshtein::Damerau::XS
1324 */
1325
1326/* Our unsorted dictionary linked list. */
1327/* Note we use UVs, not chars. */
1328
1329struct dictionary{
1330 UV key;
1331 UV value;
1332 struct dictionary* next;
1333};
1334typedef struct dictionary item;
1335
1336
1337PERL_STATIC_INLINE item*
88f063b4 1338push(UV key, item* curr)
c6871b76
KW
1339{
1340 item* head;
d09f14bf 1341 Newx(head, 1, item);
c6871b76
KW
1342 head->key = key;
1343 head->value = 0;
1344 head->next = curr;
1345 return head;
1346}
1347
1348
1349PERL_STATIC_INLINE item*
1350find(item* head, UV key)
1351{
1352 item* iterator = head;
1353 while (iterator){
1354 if (iterator->key == key){
1355 return iterator;
1356 }
1357 iterator = iterator->next;
1358 }
1359
1360 return NULL;
1361}
1362
1363PERL_STATIC_INLINE item*
88f063b4 1364uniquePush(item* head, UV key)
c6871b76
KW
1365{
1366 item* iterator = head;
1367
1368 while (iterator){
1369 if (iterator->key == key) {
1370 return head;
1371 }
1372 iterator = iterator->next;
1373 }
1374
88f063b4 1375 return push(key, head);
c6871b76
KW
1376}
1377
1378PERL_STATIC_INLINE void
1379dict_free(item* head)
1380{
1381 item* iterator = head;
1382
1383 while (iterator) {
1384 item* temp = iterator;
1385 iterator = iterator->next;
1386 Safefree(temp);
1387 }
1388
1389 head = NULL;
1390}
1391
1392/* End of Dictionary Stuff */
1393
1394/* All calculations/work are done here */
1395STATIC int
1396S_edit_distance(const UV* src,
1397 const UV* tgt,
1398 const STRLEN x, /* length of src[] */
1399 const STRLEN y, /* length of tgt[] */
1400 const SSize_t maxDistance
1401)
1402{
1403 item *head = NULL;
88f063b4 1404 UV swapCount, swapScore, targetCharCount, i, j;
c6871b76
KW
1405 UV *scores;
1406 UV score_ceil = x + y;
1407
1408 PERL_ARGS_ASSERT_EDIT_DISTANCE;
1409
1410 /* intialize matrix start values */
d09f14bf 1411 Newx(scores, ( (x + 2) * (y + 2)), UV);
c6871b76
KW
1412 scores[0] = score_ceil;
1413 scores[1 * (y + 2) + 0] = score_ceil;
1414 scores[0 * (y + 2) + 1] = score_ceil;
1415 scores[1 * (y + 2) + 1] = 0;
88f063b4 1416 head = uniquePush(uniquePush(head, src[0]), tgt[0]);
c6871b76
KW
1417
1418 /* work loops */
1419 /* i = src index */
1420 /* j = tgt index */
1421 for (i=1;i<=x;i++) {
1422 if (i < x)
88f063b4 1423 head = uniquePush(head, src[i]);
c6871b76
KW
1424 scores[(i+1) * (y + 2) + 1] = i;
1425 scores[(i+1) * (y + 2) + 0] = score_ceil;
1426 swapCount = 0;
1427
1428 for (j=1;j<=y;j++) {
1429 if (i == 1) {
1430 if(j < y)
88f063b4 1431 head = uniquePush(head, tgt[j]);
c6871b76
KW
1432 scores[1 * (y + 2) + (j + 1)] = j;
1433 scores[0 * (y + 2) + (j + 1)] = score_ceil;
1434 }
1435
88f063b4 1436 targetCharCount = find(head, tgt[j-1])->value;
c6871b76
KW
1437 swapScore = scores[targetCharCount * (y + 2) + swapCount] + i - targetCharCount - 1 + j - swapCount;
1438
1439 if (src[i-1] != tgt[j-1]){
1440 scores[(i+1) * (y + 2) + (j + 1)] = MIN(swapScore,(MIN(scores[i * (y + 2) + j], MIN(scores[(i+1) * (y + 2) + j], scores[i * (y + 2) + (j + 1)])) + 1));
1441 }
1442 else {
1443 swapCount = j;
1444 scores[(i+1) * (y + 2) + (j + 1)] = MIN(scores[i * (y + 2) + j], swapScore);
1445 }
1446 }
1447
88f063b4 1448 find(head, src[i-1])->value = i;
c6871b76
KW
1449 }
1450
1451 {
1452 IV score = scores[(x+1) * (y + 2) + (y + 1)];
1453 dict_free(head);
1454 Safefree(scores);
1455 return (maxDistance != 0 && maxDistance < score)?(-1):score;
1456 }
1457}
1458
1459/* END of edit_distance() stuff
1460 * ========================================================= */
1461
653099ff 1462/* Mark that we cannot extend a found fixed substring at this point.
37b6262f 1463 Update the longest found anchored substring or the longest found
653099ff
GS
1464 floating substrings if needed. */
1465
4327152a 1466STATIC void
ea3daa5d
FC
1467S_scan_commit(pTHX_ const RExC_state_t *pRExC_state, scan_data_t *data,
1468 SSize_t *minlenp, int is_inf)
c277df42 1469{
e1ec3a88 1470 const STRLEN l = CHR_SVLEN(data->last_found);
2099df82 1471 SV * const longest_sv = data->substrs[data->cur_is_floating].str;
a2ca2017 1472 const STRLEN old_l = CHR_SVLEN(longest_sv);
271b36b1 1473 DECLARE_AND_GET_RE_DEBUG_FLAGS;
b81d288d 1474
7918f24d
NC
1475 PERL_ARGS_ASSERT_SCAN_COMMIT;
1476
c277df42 1477 if ((l >= old_l) && ((l > old_l) || (data->flags & SF_BEFORE_EOL))) {
2099df82 1478 const U8 i = data->cur_is_floating;
a2ca2017 1479 SvSetMagicSV(longest_sv, data->last_found);
37b6262f
DM
1480 data->substrs[i].min_offset = l ? data->last_start_min : data->pos_min;
1481
2099df82
DM
1482 if (!i) /* fixed */
1483 data->substrs[0].max_offset = data->substrs[0].min_offset;
1484 else { /* float */
f6231ebf
KW
1485 data->substrs[1].max_offset =
1486 (is_inf)
1487 ? OPTIMIZE_INFTY
1488 : (l
646e8787 1489 ? data->last_start_max
d23733db
HS
1490 /* temporary underflow guard for 5.32 */
1491 : data->pos_delta < 0 ? OPTIMIZE_INFTY
0069caf1
KW
1492 : (data->pos_delta > OPTIMIZE_INFTY - data->pos_min
1493 ? OPTIMIZE_INFTY
ea3daa5d 1494 : data->pos_min + data->pos_delta));
37b6262f 1495 }
11683ecb 1496
5de22a40
HS
1497 data->substrs[i].flags &= ~SF_BEFORE_EOL;
1498 data->substrs[i].flags |= data->flags & SF_BEFORE_EOL;
37b6262f
DM
1499 data->substrs[i].minlenp = minlenp;
1500 data->substrs[i].lookbehind = 0;
c277df42 1501 }
37b6262f 1502
c277df42 1503 SvCUR_set(data->last_found, 0);
0eda9292 1504 {
a28509cc 1505 SV * const sv = data->last_found;
097eb12c
AL
1506 if (SvUTF8(sv) && SvMAGICAL(sv)) {
1507 MAGIC * const mg = mg_find(sv, PERL_MAGIC_utf8);
1508 if (mg)
1509 mg->mg_len = 0;
1510 }
0eda9292 1511 }
c277df42
IZ
1512 data->last_end = -1;
1513 data->flags &= ~SF_BEFORE_EOL;
f5a36d78 1514 DEBUG_STUDYDATA("commit", data, 0, is_inf);
c277df42
IZ
1515}
1516
cdd87c1d
KW
1517/* An SSC is just a regnode_charclass_posix with an extra field: the inversion
1518 * list that describes which code points it matches */
1519
653099ff 1520STATIC void
3420edd7 1521S_ssc_anything(pTHX_ regnode_ssc *ssc)
653099ff 1522{
cdd87c1d
KW
1523 /* Set the SSC 'ssc' to match an empty string or any code point */
1524
557bd3fb 1525 PERL_ARGS_ASSERT_SSC_ANYTHING;
7918f24d 1526
71068078 1527 assert(is_ANYOF_SYNTHETIC(ssc));
3fffb88a 1528
0854ea0b
KW
1529 /* mortalize so won't leak */
1530 ssc->invlist = sv_2mortal(_add_range_to_invlist(NULL, 0, UV_MAX));
93e92956 1531 ANYOF_FLAGS(ssc) |= SSC_MATCHES_EMPTY_STRING; /* Plus matches empty */
653099ff
GS
1532}
1533
653099ff 1534STATIC int
dc3bf405 1535S_ssc_is_anything(const regnode_ssc *ssc)
653099ff 1536{
c144baaa
KW
1537 /* Returns TRUE if the SSC 'ssc' can match the empty string and any code
1538 * point; FALSE otherwise. Thus, this is used to see if using 'ssc' buys
1539 * us anything: if the function returns TRUE, 'ssc' hasn't been restricted
1540 * in any way, so there's no point in using it */
cdd87c1d
KW
1541
1542 UV start, end;
1543 bool ret;
653099ff 1544
557bd3fb 1545 PERL_ARGS_ASSERT_SSC_IS_ANYTHING;
7918f24d 1546
71068078 1547 assert(is_ANYOF_SYNTHETIC(ssc));
cdd87c1d 1548
93e92956 1549 if (! (ANYOF_FLAGS(ssc) & SSC_MATCHES_EMPTY_STRING)) {
cdd87c1d
KW
1550 return FALSE;
1551 }
1552
1553 /* See if the list consists solely of the range 0 - Infinity */
1554 invlist_iterinit(ssc->invlist);
1555 ret = invlist_iternext(ssc->invlist, &start, &end)
1556 && start == 0
1557 && end == UV_MAX;
1558
1559 invlist_iterfinish(ssc->invlist);
1560
1561 if (ret) {
1562 return TRUE;
1563 }
1564
1565 /* If e.g., both \w and \W are set, matches everything */
e0e1be5f 1566 if (ANYOF_POSIXL_SSC_TEST_ANY_SET(ssc)) {
cdd87c1d
KW
1567 int i;
1568 for (i = 0; i < ANYOF_POSIXL_MAX; i += 2) {
1569 if (ANYOF_POSIXL_TEST(ssc, i) && ANYOF_POSIXL_TEST(ssc, i+1)) {
1570 return TRUE;
1571 }
1572 }
1573 }
1574
1575 return FALSE;
653099ff
GS
1576}
1577
653099ff 1578STATIC void
cdd87c1d 1579S_ssc_init(pTHX_ const RExC_state_t *pRExC_state, regnode_ssc *ssc)
653099ff 1580{
cdd87c1d
KW
1581 /* Initializes the SSC 'ssc'. This includes setting it to match an empty
1582 * string, any code point, or any posix class under locale */
1583
557bd3fb 1584 PERL_ARGS_ASSERT_SSC_INIT;
7918f24d 1585
557bd3fb 1586 Zero(ssc, 1, regnode_ssc);
71068078 1587 set_ANYOF_SYNTHETIC(ssc);
93e92956 1588 ARG_SET(ssc, ANYOF_ONLY_HAS_BITMAP);
3420edd7 1589 ssc_anything(ssc);
cdd87c1d 1590
2f306ab9
KW
1591 /* If any portion of the regex is to operate under locale rules that aren't
1592 * fully known at compile time, initialization includes it. The reason
1593 * this isn't done for all regexes is that the optimizer was written under
1594 * the assumption that locale was all-or-nothing. Given the complexity and
1595 * lack of documentation in the optimizer, and that there are inadequate
1596 * test cases for locale, many parts of it may not work properly, it is
1597 * safest to avoid locale unless necessary. */
cdd87c1d
KW
1598 if (RExC_contains_locale) {
1599 ANYOF_POSIXL_SETALL(ssc);
cdd87c1d
KW
1600 }
1601 else {
1602 ANYOF_POSIXL_ZERO(ssc);
1603 }
653099ff
GS
1604}
1605
b423522f 1606STATIC int
dc3bf405
BF
1607S_ssc_is_cp_posixl_init(const RExC_state_t *pRExC_state,
1608 const regnode_ssc *ssc)
b423522f
KW
1609{
1610 /* Returns TRUE if the SSC 'ssc' is in its initial state with regard only
1611 * to the list of code points matched, and locale posix classes; hence does
1612 * not check its flags) */
1613
1614 UV start, end;
1615 bool ret;
1616
1617 PERL_ARGS_ASSERT_SSC_IS_CP_POSIXL_INIT;
1618
71068078 1619 assert(is_ANYOF_SYNTHETIC(ssc));
b423522f
KW
1620
1621 invlist_iterinit(ssc->invlist);
1622 ret = invlist_iternext(ssc->invlist, &start, &end)
1623 && start == 0
1624 && end == UV_MAX;
1625
1626 invlist_iterfinish(ssc->invlist);
1627
1628 if (! ret) {
1629 return FALSE;
1630 }
1631
e0e1be5f 1632 if (RExC_contains_locale && ! ANYOF_POSIXL_SSC_TEST_ALL_SET(ssc)) {
31f05a37 1633 return FALSE;
b423522f
KW
1634 }
1635
1636 return TRUE;
1637}
1638
4ebed06a
KW
1639#define INVLIST_INDEX 0
1640#define ONLY_LOCALE_MATCHES_INDEX 1
1641#define DEFERRED_USER_DEFINED_INDEX 2
1642
b423522f
KW
1643STATIC SV*
1644S_get_ANYOF_cp_list_for_ssc(pTHX_ const RExC_state_t *pRExC_state,
5c0f85ef 1645 const regnode_charclass* const node)
b423522f
KW
1646{
1647 /* Returns a mortal inversion list defining which code points are matched
1648 * by 'node', which is of type ANYOF. Handles complementing the result if
1649 * appropriate. If some code points aren't knowable at this time, the
31f05a37
KW
1650 * returned list must, and will, contain every code point that is a
1651 * possibility. */
b423522f 1652
e2506fa7 1653 SV* invlist = NULL;
1ee208c4 1654 SV* only_utf8_locale_invlist = NULL;
b423522f
KW
1655 unsigned int i;
1656 const U32 n = ARG(node);
31f05a37 1657 bool new_node_has_latin1 = FALSE;
2d5613be 1658 const U8 flags = (inRANGE(OP(node), ANYOFH, ANYOFRb))
29a889ef
KW
1659 ? 0
1660 : ANYOF_FLAGS(node);
b423522f
KW
1661
1662 PERL_ARGS_ASSERT_GET_ANYOF_CP_LIST_FOR_SSC;
1663
1664 /* Look at the data structure created by S_set_ANYOF_arg() */
93e92956 1665 if (n != ANYOF_ONLY_HAS_BITMAP) {
b423522f
KW
1666 SV * const rv = MUTABLE_SV(RExC_rxi->data->data[n]);
1667 AV * const av = MUTABLE_AV(SvRV(rv));
1668 SV **const ary = AvARRAY(av);
1669 assert(RExC_rxi->data->what[n] == 's');
1670
4ebed06a 1671 if (av_tindex_skip_len_mg(av) >= DEFERRED_USER_DEFINED_INDEX) {
b423522f 1672
6ed02bb6
KW
1673 /* Here there are things that won't be known until runtime -- we
1674 * have to assume it could be anything */
e2506fa7 1675 invlist = sv_2mortal(_new_invlist(1));
b423522f
KW
1676 return _add_range_to_invlist(invlist, 0, UV_MAX);
1677 }
4ebed06a 1678 else if (ary[INVLIST_INDEX]) {
b423522f 1679
6ed02bb6 1680 /* Use the node's inversion list */
4ebed06a 1681 invlist = sv_2mortal(invlist_clone(ary[INVLIST_INDEX], NULL));
1ee208c4
KW
1682 }
1683
1684 /* Get the code points valid only under UTF-8 locales */
766d6d33 1685 if ( (flags & ANYOFL_FOLD)
4ebed06a 1686 && av_tindex_skip_len_mg(av) >= ONLY_LOCALE_MATCHES_INDEX)
1ee208c4 1687 {
4ebed06a 1688 only_utf8_locale_invlist = ary[ONLY_LOCALE_MATCHES_INDEX];
b423522f
KW
1689 }
1690 }
1691
e2506fa7
KW
1692 if (! invlist) {
1693 invlist = sv_2mortal(_new_invlist(0));
1694 }
1695
dcb20b36
KW
1696 /* An ANYOF node contains a bitmap for the first NUM_ANYOF_CODE_POINTS
1697 * code points, and an inversion list for the others, but if there are code
1698 * points that should match only conditionally on the target string being
1699 * UTF-8, those are placed in the inversion list, and not the bitmap.
1700 * Since there are circumstances under which they could match, they are
1701 * included in the SSC. But if the ANYOF node is to be inverted, we have
1702 * to exclude them here, so that when we invert below, the end result
1703 * actually does include them. (Think about "\xe0" =~ /[^\xc0]/di;). We
1704 * have to do this here before we add the unconditionally matched code
1705 * points */
766d6d33 1706 if (flags & ANYOF_INVERT) {
b423522f
KW
1707 _invlist_intersection_complement_2nd(invlist,
1708 PL_UpperLatin1,
1709 &invlist);
1710 }
1711
1712 /* Add in the points from the bit map */
2d5613be 1713 if (! inRANGE(OP(node), ANYOFH, ANYOFRb)) {
90973738
KW
1714 for (i = 0; i < NUM_ANYOF_CODE_POINTS; i++) {
1715 if (ANYOF_BITMAP_TEST(node, i)) {
1716 unsigned int start = i++;
1717
1718 for (; i < NUM_ANYOF_CODE_POINTS
1719 && ANYOF_BITMAP_TEST(node, i); ++i)
1720 {
1721 /* empty */
1722 }
1723 invlist = _add_range_to_invlist(invlist, start, i-1);
1724 new_node_has_latin1 = TRUE;
6f8848d5 1725 }
b423522f
KW
1726 }
1727 }
1728
1729 /* If this can match all upper Latin1 code points, have to add them
ac33c516
KW
1730 * as well. But don't add them if inverting, as when that gets done below,
1731 * it would exclude all these characters, including the ones it shouldn't
1732 * that were added just above */
766d6d33
KW
1733 if (! (flags & ANYOF_INVERT) && OP(node) == ANYOFD
1734 && (flags & ANYOF_SHARED_d_MATCHES_ALL_NON_UTF8_NON_ASCII_non_d_WARN_SUPER))
f240c685 1735 {
b423522f
KW
1736 _invlist_union(invlist, PL_UpperLatin1, &invlist);
1737 }
1738
1739 /* Similarly for these */
766d6d33 1740 if (flags & ANYOF_MATCHES_ALL_ABOVE_BITMAP) {
e0a1ff7a 1741 _invlist_union_complement_2nd(invlist, PL_InBitmap, &invlist);
b423522f
KW
1742 }
1743
766d6d33 1744 if (flags & ANYOF_INVERT) {
b423522f
KW
1745 _invlist_invert(invlist);
1746 }
766d6d33 1747 else if (flags & ANYOFL_FOLD) {
35b8412f 1748 if (new_node_has_latin1) {
31f05a37 1749
26be5fe6
KW
1750 /* Under /li, any 0-255 could fold to any other 0-255, depending on
1751 * the locale. We can skip this if there are no 0-255 at all. */
1752 _invlist_union(invlist, PL_Latin1, &invlist);
35b8412f
KW
1753
1754 invlist = add_cp_to_invlist(invlist, LATIN_SMALL_LETTER_DOTLESS_I);
1755 invlist = add_cp_to_invlist(invlist, LATIN_CAPITAL_LETTER_I_WITH_DOT_ABOVE);
1756 }
1757 else {
1758 if (_invlist_contains_cp(invlist, LATIN_SMALL_LETTER_DOTLESS_I)) {
1759 invlist = add_cp_to_invlist(invlist, 'I');
1760 }
1761 if (_invlist_contains_cp(invlist,
1762 LATIN_CAPITAL_LETTER_I_WITH_DOT_ABOVE))
1763 {
1764 invlist = add_cp_to_invlist(invlist, 'i');
1765 }
1766 }
31f05a37
KW
1767 }
1768
1ee208c4
KW
1769 /* Similarly add the UTF-8 locale possible matches. These have to be
1770 * deferred until after the non-UTF-8 locale ones are taken care of just
1771 * above, or it leads to wrong results under ANYOF_INVERT */
1772 if (only_utf8_locale_invlist) {
31f05a37 1773 _invlist_union_maybe_complement_2nd(invlist,
1ee208c4 1774 only_utf8_locale_invlist,
766d6d33 1775 flags & ANYOF_INVERT,
31f05a37
KW
1776 &invlist);
1777 }
b423522f
KW
1778
1779 return invlist;
1780}
1781
1051e1c4 1782/* These two functions currently do the exact same thing */
557bd3fb 1783#define ssc_init_zero ssc_init
653099ff 1784
cdd87c1d
KW
1785#define ssc_add_cp(ssc, cp) ssc_add_range((ssc), (cp), (cp))
1786#define ssc_match_all_cp(ssc) ssc_add_range(ssc, 0, UV_MAX)
1787
557bd3fb 1788/* 'AND' a given class with another one. Can create false positives. 'ssc'
93e92956
KW
1789 * should not be inverted. 'and_with->flags & ANYOF_MATCHES_POSIXL' should be
1790 * 0 if 'and_with' is a regnode_charclass instead of a regnode_ssc. */
cdd87c1d 1791
653099ff 1792STATIC void
b423522f 1793S_ssc_and(pTHX_ const RExC_state_t *pRExC_state, regnode_ssc *ssc,
7dcac5f6 1794 const regnode_charclass *and_with)
653099ff 1795{
cdd87c1d
KW
1796 /* Accumulate into SSC 'ssc' its 'AND' with 'and_with', which is either
1797 * another SSC or a regular ANYOF class. Can create false positives. */
40d049e4 1798
a0dd4231 1799 SV* anded_cp_list;
2d5613be 1800 U8 and_with_flags = inRANGE(OP(and_with), ANYOFH, ANYOFRb)
29a889ef
KW
1801 ? 0
1802 : ANYOF_FLAGS(and_with);
a0dd4231 1803 U8 anded_flags;
1e6ade67 1804
cdd87c1d 1805 PERL_ARGS_ASSERT_SSC_AND;
653099ff 1806
71068078 1807 assert(is_ANYOF_SYNTHETIC(ssc));
a0dd4231
KW
1808
1809 /* 'and_with' is used as-is if it too is an SSC; otherwise have to extract
1810 * the code point inversion list and just the relevant flags */
71068078 1811 if (is_ANYOF_SYNTHETIC(and_with)) {
7dcac5f6 1812 anded_cp_list = ((regnode_ssc *)and_with)->invlist;
765e6ecf 1813 anded_flags = and_with_flags;
e9b08962
KW
1814
1815 /* XXX This is a kludge around what appears to be deficiencies in the
1816 * optimizer. If we make S_ssc_anything() add in the WARN_SUPER flag,
1817 * there are paths through the optimizer where it doesn't get weeded
1818 * out when it should. And if we don't make some extra provision for
1819 * it like the code just below, it doesn't get added when it should.
1820 * This solution is to add it only when AND'ing, which is here, and
1821 * only when what is being AND'ed is the pristine, original node
1822 * matching anything. Thus it is like adding it to ssc_anything() but
1823 * only when the result is to be AND'ed. Probably the same solution
1824 * could be adopted for the same problem we have with /l matching,
1825 * which is solved differently in S_ssc_init(), and that would lead to
1826 * fewer false positives than that solution has. But if this solution
1827 * creates bugs, the consequences are only that a warning isn't raised
1828 * that should be; while the consequences for having /l bugs is
1829 * incorrect matches */
7dcac5f6 1830 if (ssc_is_anything((regnode_ssc *)and_with)) {
f240c685 1831 anded_flags |= ANYOF_SHARED_d_MATCHES_ALL_NON_UTF8_NON_ASCII_non_d_WARN_SUPER;
e9b08962 1832 }
a0dd4231
KW
1833 }
1834 else {
5c0f85ef 1835 anded_cp_list = get_ANYOF_cp_list_for_ssc(pRExC_state, and_with);
f240c685 1836 if (OP(and_with) == ANYOFD) {
765e6ecf 1837 anded_flags = and_with_flags & ANYOF_COMMON_FLAGS;
f240c685
KW
1838 }
1839 else {
765e6ecf 1840 anded_flags = and_with_flags
f240c685 1841 &( ANYOF_COMMON_FLAGS
108316fb
KW
1842 |ANYOF_SHARED_d_MATCHES_ALL_NON_UTF8_NON_ASCII_non_d_WARN_SUPER
1843 |ANYOF_SHARED_d_UPPER_LATIN1_UTF8_STRING_MATCHES_non_d_RUNTIME_USER_PROP);
765e6ecf 1844 if (ANYOFL_UTF8_LOCALE_REQD(and_with_flags)) {
d1c40ef5
KW
1845 anded_flags &=
1846 ANYOFL_SHARED_UTF8_LOCALE_fold_HAS_MATCHES_nonfold_REQD;
1847 }
f240c685 1848 }
a0dd4231
KW
1849 }
1850
1851 ANYOF_FLAGS(ssc) &= anded_flags;
cdd87c1d
KW
1852
1853 /* Below, C1 is the list of code points in 'ssc'; P1, its posix classes.
1854 * C2 is the list of code points in 'and-with'; P2, its posix classes.
1855 * 'and_with' may be inverted. When not inverted, we have the situation of
1856 * computing:
1857 * (C1 | P1) & (C2 | P2)
1858 * = (C1 & (C2 | P2)) | (P1 & (C2 | P2))
1859 * = ((C1 & C2) | (C1 & P2)) | ((P1 & C2) | (P1 & P2))
1860 * <= ((C1 & C2) | P2)) | ( P1 | (P1 & P2))
1861 * <= ((C1 & C2) | P1 | P2)
1862 * Alternatively, the last few steps could be:
1863 * = ((C1 & C2) | (C1 & P2)) | ((P1 & C2) | (P1 & P2))
1864 * <= ((C1 & C2) | C1 ) | ( C2 | (P1 & P2))
1865 * <= (C1 | C2 | (P1 & P2))
1866 * We favor the second approach if either P1 or P2 is non-empty. This is
1867 * because these components are a barrier to doing optimizations, as what
1868 * they match cannot be known until the moment of matching as they are
1869 * dependent on the current locale, 'AND"ing them likely will reduce or
1870 * eliminate them.
1871 * But we can do better if we know that C1,P1 are in their initial state (a
1872 * frequent occurrence), each matching everything:
1873 * (<everything>) & (C2 | P2) = C2 | P2
1874 * Similarly, if C2,P2 are in their initial state (again a frequent
1875 * occurrence), the result is a no-op
1876 * (C1 | P1) & (<everything>) = C1 | P1
1877 *
1878 * Inverted, we have
1879 * (C1 | P1) & ~(C2 | P2) = (C1 | P1) & (~C2 & ~P2)
1880 * = (C1 & (~C2 & ~P2)) | (P1 & (~C2 & ~P2))
1881 * <= (C1 & ~C2) | (P1 & ~P2)
1882 * */
1aa99e6b 1883
765e6ecf 1884 if ((and_with_flags & ANYOF_INVERT)
71068078 1885 && ! is_ANYOF_SYNTHETIC(and_with))
a0dd4231 1886 {
cdd87c1d 1887 unsigned int i;
8951c461 1888
cdd87c1d
KW
1889 ssc_intersection(ssc,
1890 anded_cp_list,
1891 FALSE /* Has already been inverted */
1892 );
c6b76537 1893
cdd87c1d
KW
1894 /* If either P1 or P2 is empty, the intersection will be also; can skip
1895 * the loop */
765e6ecf 1896 if (! (and_with_flags & ANYOF_MATCHES_POSIXL)) {
cdd87c1d
KW
1897 ANYOF_POSIXL_ZERO(ssc);
1898 }
e0e1be5f 1899 else if (ANYOF_POSIXL_SSC_TEST_ANY_SET(ssc)) {
cdd87c1d
KW
1900
1901 /* Note that the Posix class component P from 'and_with' actually
1902 * looks like:
1903 * P = Pa | Pb | ... | Pn
1904 * where each component is one posix class, such as in [\w\s].
1905 * Thus
1906 * ~P = ~(Pa | Pb | ... | Pn)
1907 * = ~Pa & ~Pb & ... & ~Pn
1908 * <= ~Pa | ~Pb | ... | ~Pn
1909 * The last is something we can easily calculate, but unfortunately
1910 * is likely to have many false positives. We could do better
1911 * in some (but certainly not all) instances if two classes in
1912 * P have known relationships. For example
1913 * :lower: <= :alpha: <= :alnum: <= \w <= :graph: <= :print:
1914 * So
1915 * :lower: & :print: = :lower:
1916 * And similarly for classes that must be disjoint. For example,
1917 * since \s and \w can have no elements in common based on rules in
1918 * the POSIX standard,
1919 * \w & ^\S = nothing
1920 * Unfortunately, some vendor locales do not meet the Posix
1921 * standard, in particular almost everything by Microsoft.
1922 * The loop below just changes e.g., \w into \W and vice versa */
1923
1ee208c4 1924 regnode_charclass_posixl temp;
cdd87c1d
KW
1925 int add = 1; /* To calculate the index of the complement */
1926
b1234259 1927 Zero(&temp, 1, regnode_charclass_posixl);
cdd87c1d
KW
1928 ANYOF_POSIXL_ZERO(&temp);
1929 for (i = 0; i < ANYOF_MAX; i++) {
1930 assert(i % 2 != 0
7dcac5f6
KW
1931 || ! ANYOF_POSIXL_TEST((regnode_charclass_posixl*) and_with, i)
1932 || ! ANYOF_POSIXL_TEST((regnode_charclass_posixl*) and_with, i + 1));
cdd87c1d 1933
7dcac5f6 1934 if (ANYOF_POSIXL_TEST((regnode_charclass_posixl*) and_with, i)) {
cdd87c1d
KW
1935 ANYOF_POSIXL_SET(&temp, i + add);
1936 }
1937 add = 0 - add; /* 1 goes to -1; -1 goes to 1 */
1938 }
1939 ANYOF_POSIXL_AND(&temp, ssc);
c6b76537 1940
cdd87c1d
KW
1941 } /* else ssc already has no posixes */
1942 } /* else: Not inverted. This routine is a no-op if 'and_with' is an SSC
1943 in its initial state */
71068078 1944 else if (! is_ANYOF_SYNTHETIC(and_with)
7dcac5f6 1945 || ! ssc_is_cp_posixl_init(pRExC_state, (regnode_ssc *)and_with))
cdd87c1d
KW
1946 {
1947 /* But if 'ssc' is in its initial state, the result is just 'and_with';
1948 * copy it over 'ssc' */
1949 if (ssc_is_cp_posixl_init(pRExC_state, ssc)) {
71068078 1950 if (is_ANYOF_SYNTHETIC(and_with)) {
cdd87c1d
KW
1951 StructCopy(and_with, ssc, regnode_ssc);
1952 }
1953 else {
1954 ssc->invlist = anded_cp_list;
1955 ANYOF_POSIXL_ZERO(ssc);
765e6ecf 1956 if (and_with_flags & ANYOF_MATCHES_POSIXL) {
7dcac5f6 1957 ANYOF_POSIXL_OR((regnode_charclass_posixl*) and_with, ssc);
cdd87c1d
KW
1958 }
1959 }
1960 }
e0e1be5f 1961 else if (ANYOF_POSIXL_SSC_TEST_ANY_SET(ssc)
765e6ecf 1962 || (and_with_flags & ANYOF_MATCHES_POSIXL))
cdd87c1d
KW
1963 {
1964 /* One or the other of P1, P2 is non-empty. */
765e6ecf 1965 if (and_with_flags & ANYOF_MATCHES_POSIXL) {
1ea8b7fe
KW
1966 ANYOF_POSIXL_AND((regnode_charclass_posixl*) and_with, ssc);
1967 }
cdd87c1d
KW
1968 ssc_union(ssc, anded_cp_list, FALSE);
1969 }
1970 else { /* P1 = P2 = empty */
1971 ssc_intersection(ssc, anded_cp_list, FALSE);
1972 }
137165a6 1973 }
653099ff
GS
1974}
1975
653099ff 1976STATIC void
cdd87c1d 1977S_ssc_or(pTHX_ const RExC_state_t *pRExC_state, regnode_ssc *ssc,
7dcac5f6 1978 const regnode_charclass *or_with)
653099ff 1979{
cdd87c1d
KW
1980 /* Accumulate into SSC 'ssc' its 'OR' with 'or_with', which is either
1981 * another SSC or a regular ANYOF class. Can create false positives if
1982 * 'or_with' is to be inverted. */
7918f24d 1983
a0dd4231
KW
1984 SV* ored_cp_list;
1985 U8 ored_flags;
2d5613be 1986 U8 or_with_flags = inRANGE(OP(or_with), ANYOFH, ANYOFRb)
29a889ef
KW
1987 ? 0
1988 : ANYOF_FLAGS(or_with);
c6b76537 1989
cdd87c1d 1990 PERL_ARGS_ASSERT_SSC_OR;
c6b76537 1991
71068078 1992 assert(is_ANYOF_SYNTHETIC(ssc));
a0dd4231
KW
1993
1994 /* 'or_with' is used as-is if it too is an SSC; otherwise have to extract
1995 * the code point inversion list and just the relevant flags */
71068078 1996 if (is_ANYOF_SYNTHETIC(or_with)) {
7dcac5f6 1997 ored_cp_list = ((regnode_ssc*) or_with)->invlist;
765e6ecf 1998 ored_flags = or_with_flags;
a0dd4231
KW
1999 }
2000 else {
5c0f85ef 2001 ored_cp_list = get_ANYOF_cp_list_for_ssc(pRExC_state, or_with);
765e6ecf 2002 ored_flags = or_with_flags & ANYOF_COMMON_FLAGS;
f240c685
KW
2003 if (OP(or_with) != ANYOFD) {
2004 ored_flags
765e6ecf 2005 |= or_with_flags
108316fb
KW
2006 & ( ANYOF_SHARED_d_MATCHES_ALL_NON_UTF8_NON_ASCII_non_d_WARN_SUPER
2007 |ANYOF_SHARED_d_UPPER_LATIN1_UTF8_STRING_MATCHES_non_d_RUNTIME_USER_PROP);
765e6ecf 2008 if (ANYOFL_UTF8_LOCALE_REQD(or_with_flags)) {
d1c40ef5
KW
2009 ored_flags |=
2010 ANYOFL_SHARED_UTF8_LOCALE_fold_HAS_MATCHES_nonfold_REQD;
2011 }
f240c685 2012 }
a0dd4231
KW
2013 }
2014
2015 ANYOF_FLAGS(ssc) |= ored_flags;
cdd87c1d
KW
2016
2017 /* Below, C1 is the list of code points in 'ssc'; P1, its posix classes.
2018 * C2 is the list of code points in 'or-with'; P2, its posix classes.
2019 * 'or_with' may be inverted. When not inverted, we have the simple
2020 * situation of computing:
2021 * (C1 | P1) | (C2 | P2) = (C1 | C2) | (P1 | P2)
2022 * If P1|P2 yields a situation with both a class and its complement are
2023 * set, like having both \w and \W, this matches all code points, and we
2024 * can delete these from the P component of the ssc going forward. XXX We
2025 * might be able to delete all the P components, but I (khw) am not certain
2026 * about this, and it is better to be safe.
2027 *
2028 * Inverted, we have
2029 * (C1 | P1) | ~(C2 | P2) = (C1 | P1) | (~C2 & ~P2)
2030 * <= (C1 | P1) | ~C2
2031 * <= (C1 | ~C2) | P1
2032 * (which results in actually simpler code than the non-inverted case)
2033 * */
9826f543 2034
765e6ecf 2035 if ((or_with_flags & ANYOF_INVERT)
71068078 2036 && ! is_ANYOF_SYNTHETIC(or_with))
a0dd4231 2037 {
cdd87c1d 2038 /* We ignore P2, leaving P1 going forward */
1ea8b7fe 2039 } /* else Not inverted */
765e6ecf 2040 else if (or_with_flags & ANYOF_MATCHES_POSIXL) {
7dcac5f6 2041 ANYOF_POSIXL_OR((regnode_charclass_posixl*)or_with, ssc);
e0e1be5f 2042 if (ANYOF_POSIXL_SSC_TEST_ANY_SET(ssc)) {
cdd87c1d
KW
2043 unsigned int i;
2044 for (i = 0; i < ANYOF_MAX; i += 2) {
2045 if (ANYOF_POSIXL_TEST(ssc, i) && ANYOF_POSIXL_TEST(ssc, i + 1))
2046 {
2047 ssc_match_all_cp(ssc);
2048 ANYOF_POSIXL_CLEAR(ssc, i);
2049 ANYOF_POSIXL_CLEAR(ssc, i+1);
cdd87c1d
KW
2050 }
2051 }
2052 }
1aa99e6b 2053 }
cdd87c1d
KW
2054
2055 ssc_union(ssc,
2056 ored_cp_list,
2057 FALSE /* Already has been inverted */
2058 );
653099ff
GS
2059}
2060
19bb2455 2061STATIC void
b423522f
KW
2062S_ssc_union(pTHX_ regnode_ssc *ssc, SV* const invlist, const bool invert2nd)
2063{
2064 PERL_ARGS_ASSERT_SSC_UNION;
2065
71068078 2066 assert(is_ANYOF_SYNTHETIC(ssc));
b423522f
KW
2067
2068 _invlist_union_maybe_complement_2nd(ssc->invlist,
2069 invlist,
2070 invert2nd,
2071 &ssc->invlist);
2072}
2073
19bb2455 2074STATIC void
b423522f
KW
2075S_ssc_intersection(pTHX_ regnode_ssc *ssc,
2076 SV* const invlist,
2077 const bool invert2nd)
2078{
2079 PERL_ARGS_ASSERT_SSC_INTERSECTION;
2080
71068078 2081 assert(is_ANYOF_SYNTHETIC(ssc));
b423522f
KW
2082
2083 _invlist_intersection_maybe_complement_2nd(ssc->invlist,
2084 invlist,
2085 invert2nd,
2086 &ssc->invlist);
2087}
2088
19bb2455 2089STATIC void
b423522f
KW
2090S_ssc_add_range(pTHX_ regnode_ssc *ssc, const UV start, const UV end)
2091{
2092 PERL_ARGS_ASSERT_SSC_ADD_RANGE;
2093
71068078 2094 assert(is_ANYOF_SYNTHETIC(ssc));
b423522f
KW
2095
2096 ssc->invlist = _add_range_to_invlist(ssc->invlist, start, end);
2097}
2098
19bb2455 2099STATIC void
b423522f
KW
2100S_ssc_cp_and(pTHX_ regnode_ssc *ssc, const UV cp)
2101{
2102 /* AND just the single code point 'cp' into the SSC 'ssc' */
2103
2104 SV* cp_list = _new_invlist(2);
2105
2106 PERL_ARGS_ASSERT_SSC_CP_AND;
2107
71068078 2108 assert(is_ANYOF_SYNTHETIC(ssc));
b423522f
KW
2109
2110 cp_list = add_cp_to_invlist(cp_list, cp);
2111 ssc_intersection(ssc, cp_list,
2112 FALSE /* Not inverted */
2113 );
2114 SvREFCNT_dec_NN(cp_list);
2115}
2116
19bb2455 2117STATIC void
dc3bf405 2118S_ssc_clear_locale(regnode_ssc *ssc)
b423522f
KW
2119{
2120 /* Set the SSC 'ssc' to not match any locale things */
b423522f
KW
2121 PERL_ARGS_ASSERT_SSC_CLEAR_LOCALE;
2122
71068078 2123 assert(is_ANYOF_SYNTHETIC(ssc));
b423522f
KW
2124
2125 ANYOF_POSIXL_ZERO(ssc);
2126 ANYOF_FLAGS(ssc) &= ~ANYOF_LOCALE_FLAGS;
2127}
2128
b35552de
KW
2129STATIC bool
2130S_is_ssc_worth_it(const RExC_state_t * pRExC_state, const regnode_ssc * ssc)
2131{
2132 /* The synthetic start class is used to hopefully quickly winnow down
2133 * places where a pattern could start a match in the target string. If it
2134 * doesn't really narrow things down that much, there isn't much point to
2135 * having the overhead of using it. This function uses some very crude
2136 * heuristics to decide if to use the ssc or not.
2137 *
2138 * It returns TRUE if 'ssc' rules out more than half what it considers to
2139 * be the "likely" possible matches, but of course it doesn't know what the
2140 * actual things being matched are going to be; these are only guesses
2141 *
2142 * For /l matches, it assumes that the only likely matches are going to be
2143 * in the 0-255 range, uniformly distributed, so half of that is 127
2144 * For /a and /d matches, it assumes that the likely matches will be just
2145 * the ASCII range, so half of that is 63
2146 * For /u and there isn't anything matching above the Latin1 range, it
2147 * assumes that that is the only range likely to be matched, and uses
2148 * half that as the cut-off: 127. If anything matches above Latin1,
2149 * it assumes that all of Unicode could match (uniformly), except for
2150 * non-Unicode code points and things in the General Category "Other"
2151 * (unassigned, private use, surrogates, controls and formats). This
2152 * is a much large number. */
2153
b35552de
KW
2154 U32 count = 0; /* Running total of number of code points matched by
2155 'ssc' */
2156 UV start, end; /* Start and end points of current range in inversion
26be5fe6 2157 XXX outdated. UTF-8 locales are common, what about invert? list */
72400949
KW
2158 const U32 max_code_points = (LOC)
2159 ? 256
f6e8f31e 2160 : (( ! UNI_SEMANTICS
1a26cbcb 2161 || invlist_highest(ssc->invlist) < 256)
72400949
KW
2162 ? 128
2163 : NON_OTHER_COUNT);
2164 const U32 max_match = max_code_points / 2;
b35552de
KW
2165
2166 PERL_ARGS_ASSERT_IS_SSC_WORTH_IT;
2167
2168 invlist_iterinit(ssc->invlist);
2169 while (invlist_iternext(ssc->invlist, &start, &end)) {
72400949
KW
2170 if (start >= max_code_points) {
2171 break;
b35552de 2172 }
72400949 2173 end = MIN(end, max_code_points - 1);
b35552de 2174 count += end - start + 1;
72400949 2175 if (count >= max_match) {
b35552de
KW
2176 invlist_iterfinish(ssc->invlist);
2177 return FALSE;
2178 }
2179 }
2180
2181 return TRUE;
2182}
2183
2184
b423522f
KW
2185STATIC void
2186S_ssc_finalize(pTHX_ RExC_state_t *pRExC_state, regnode_ssc *ssc)
2187{
2188 /* The inversion list in the SSC is marked mortal; now we need a more
2189 * permanent copy, which is stored the same way that is done in a regular
dcb20b36
KW
2190 * ANYOF node, with the first NUM_ANYOF_CODE_POINTS code points in a bit
2191 * map */
b423522f 2192
28118b9c 2193 SV* invlist = invlist_clone(ssc->invlist, NULL);
b423522f
KW
2194
2195 PERL_ARGS_ASSERT_SSC_FINALIZE;
2196
71068078 2197 assert(is_ANYOF_SYNTHETIC(ssc));
b423522f 2198
a0dd4231 2199 /* The code in this file assumes that all but these flags aren't relevant
93e92956
KW
2200 * to the SSC, except SSC_MATCHES_EMPTY_STRING, which should be cleared
2201 * by the time we reach here */
f240c685
KW
2202 assert(! (ANYOF_FLAGS(ssc)
2203 & ~( ANYOF_COMMON_FLAGS
108316fb
KW
2204 |ANYOF_SHARED_d_MATCHES_ALL_NON_UTF8_NON_ASCII_non_d_WARN_SUPER
2205 |ANYOF_SHARED_d_UPPER_LATIN1_UTF8_STRING_MATCHES_non_d_RUNTIME_USER_PROP)));
a0dd4231 2206
b423522f
KW
2207 populate_ANYOF_from_invlist( (regnode *) ssc, &invlist);
2208
4c404f26 2209 set_ANYOF_arg(pRExC_state, (regnode *) ssc, invlist, NULL, NULL);
21c3fd9d 2210 SvREFCNT_dec(invlist);
b423522f 2211
85c8e306
KW
2212 /* Make sure is clone-safe */
2213 ssc->invlist = NULL;
2214
e0e1be5f 2215 if (ANYOF_POSIXL_SSC_TEST_ANY_SET(ssc)) {
93e92956 2216 ANYOF_FLAGS(ssc) |= ANYOF_MATCHES_POSIXL;
d156f5cb 2217 OP(ssc) = ANYOFPOSIXL;
e0e1be5f 2218 }
d156f5cb 2219 else if (RExC_contains_locale) {
b2e90ddf
KW
2220 OP(ssc) = ANYOFL;
2221 }
2222
1462525b 2223 assert(! (ANYOF_FLAGS(ssc) & ANYOF_LOCALE_FLAGS) || RExC_contains_locale);
b423522f
KW
2224}
2225
a3621e74
YO
2226#define TRIE_LIST_ITEM(state,idx) (trie->states[state].trans.list)[ idx ]
2227#define TRIE_LIST_CUR(state) ( TRIE_LIST_ITEM( state, 0 ).forid )
2228#define TRIE_LIST_LEN(state) ( TRIE_LIST_ITEM( state, 0 ).newstate )
538e84ed
KW
2229#define TRIE_LIST_USED(idx) ( trie->states[state].trans.list \
2230 ? (TRIE_LIST_CUR( idx ) - 1) \
2231 : 0 )
a3621e74 2232
3dab1dad
YO
2233
2234#ifdef DEBUGGING
07be1b83 2235/*
2b8b4781
NC
2236 dump_trie(trie,widecharmap,revcharmap)
2237 dump_trie_interim_list(trie,widecharmap,revcharmap,next_alloc)
2238 dump_trie_interim_table(trie,widecharmap,revcharmap,next_alloc)
3dab1dad
YO
2239
2240 These routines dump out a trie in a somewhat readable format.
07be1b83
YO
2241 The _interim_ variants are used for debugging the interim
2242 tables that are used to generate the final compressed
2243 representation which is what dump_trie expects.
2244
486ec47a 2245 Part of the reason for their existence is to provide a form
3dab1dad 2246 of documentation as to how the different representations function.
07be1b83
YO
2247
2248*/
3dab1dad
YO
2249
2250/*
3dab1dad
YO
2251 Dumps the final compressed table form of the trie to Perl_debug_log.
2252 Used for debugging make_trie().
2253*/
b9a59e08 2254
3dab1dad 2255STATIC void
2b8b4781
NC
2256S_dump_trie(pTHX_ const struct _reg_trie_data *trie, HV *widecharmap,
2257 AV *revcharmap, U32 depth)
3dab1dad
YO
2258{
2259 U32 state;
ab3bbdeb 2260 SV *sv=sv_newmortal();
55eed653 2261 int colwidth= widecharmap ? 6 : 4;
2e64971a 2262 U16 word;
271b36b1 2263 DECLARE_AND_GET_RE_DEBUG_FLAGS;
3dab1dad 2264
7918f24d 2265 PERL_ARGS_ASSERT_DUMP_TRIE;
ab3bbdeb 2266
6ad9a8ab 2267 Perl_re_indentf( aTHX_ "Char : %-6s%-6s%-4s ",
1e37780e 2268 depth+1, "Match","Base","Ofs" );
3dab1dad
YO
2269
2270 for( state = 0 ; state < trie->uniquecharcount ; state++ ) {
2b8b4781 2271 SV ** const tmp = av_fetch( revcharmap, state, 0);
3dab1dad 2272 if ( tmp ) {
6ad9a8ab 2273 Perl_re_printf( aTHX_ "%*s",
ab3bbdeb 2274 colwidth,
538e84ed 2275 pv_pretty(sv, SvPV_nolen_const(*tmp), SvCUR(*tmp), colwidth,
ab3bbdeb
YO
2276 PL_colors[0], PL_colors[1],
2277 (SvUTF8(*tmp) ? PERL_PV_ESCAPE_UNI : 0) |
538e84ed
KW
2278 PERL_PV_ESCAPE_FIRSTCHAR
2279 )
ab3bbdeb 2280 );
3dab1dad
YO
2281 }
2282 }
1e37780e
YO
2283 Perl_re_printf( aTHX_ "\n");
2284 Perl_re_indentf( aTHX_ "State|-----------------------", depth+1);
3dab1dad
YO
2285
2286 for( state = 0 ; state < trie->uniquecharcount ; state++ )
6ad9a8ab
YO
2287 Perl_re_printf( aTHX_ "%.*s", colwidth, "--------");
2288 Perl_re_printf( aTHX_ "\n");
3dab1dad 2289
1e2e3d02 2290 for( state = 1 ; state < trie->statecount ; state++ ) {
be8e71aa 2291 const U32 base = trie->states[ state ].trans.base;
3dab1dad 2292
147e3846 2293 Perl_re_indentf( aTHX_ "#%4" UVXf "|", depth+1, (UV)state);
3dab1dad
YO
2294
2295 if ( trie->states[ state ].wordnum ) {
1e37780e 2296 Perl_re_printf( aTHX_ " W%4X", trie->states[ state ].wordnum );
3dab1dad 2297 } else {
6ad9a8ab 2298 Perl_re_printf( aTHX_ "%6s", "" );
3dab1dad
YO
2299 }
2300
147e3846 2301 Perl_re_printf( aTHX_ " @%4" UVXf " ", (UV)base );
3dab1dad
YO
2302
2303 if ( base ) {
2304 U32 ofs = 0;
2305
2306 while( ( base + ofs < trie->uniquecharcount ) ||
2307 ( base + ofs - trie->uniquecharcount < trie->lasttrans
538e84ed
KW
2308 && trie->trans[ base + ofs - trie->uniquecharcount ].check
2309 != state))
3dab1dad
YO
2310 ofs++;
2311
147e3846 2312 Perl_re_printf( aTHX_ "+%2" UVXf "[ ", (UV)ofs);
3dab1dad
YO
2313
2314 for ( ofs = 0 ; ofs < trie->uniquecharcount ; ofs++ ) {
538e84ed
KW
2315 if ( ( base + ofs >= trie->uniquecharcount )
2316 && ( base + ofs - trie->uniquecharcount
2317 < trie->lasttrans )
2318 && trie->trans[ base + ofs
2319 - trie->uniquecharcount ].check == state )
3dab1dad 2320 {
147e3846 2321 Perl_re_printf( aTHX_ "%*" UVXf, colwidth,
1e37780e
YO
2322 (UV)trie->trans[ base + ofs - trie->uniquecharcount ].next
2323 );
3dab1dad 2324 } else {
88f063b4 2325 Perl_re_printf( aTHX_ "%*s", colwidth," ." );
3dab1dad
YO
2326 }
2327 }
2328
6ad9a8ab 2329 Perl_re_printf( aTHX_ "]");
3dab1dad
YO
2330
2331 }
6ad9a8ab 2332 Perl_re_printf( aTHX_ "\n" );
3dab1dad 2333 }
6ad9a8ab 2334 Perl_re_indentf( aTHX_ "word_info N:(prev,len)=",
cb41e5d6 2335 depth);
2e64971a 2336 for (word=1; word <= trie->wordcount; word++) {
6ad9a8ab 2337 Perl_re_printf( aTHX_ " %d:(%d,%d)",
2e64971a
DM
2338 (int)word, (int)(trie->wordinfo[word].prev),
2339 (int)(trie->wordinfo[word].len));
2340 }
6ad9a8ab 2341 Perl_re_printf( aTHX_ "\n" );
538e84ed 2342}
3dab1dad 2343/*
3dab1dad 2344 Dumps a fully constructed but uncompressed trie in list form.
538e84ed 2345 List tries normally only are used for construction when the number of
3dab1dad
YO
2346 possible chars (trie->uniquecharcount) is very high.
2347 Used for debugging make_trie().
2348*/
2349STATIC void
55eed653 2350S_dump_trie_interim_list(pTHX_ const struct _reg_trie_data *trie,
2b8b4781
NC
2351 HV *widecharmap, AV *revcharmap, U32 next_alloc,
2352 U32 depth)
3dab1dad
YO
2353{
2354 U32 state;
ab3bbdeb 2355 SV *sv=sv_newmortal();
55eed653 2356 int colwidth= widecharmap ? 6 : 4;
271b36b1 2357 DECLARE_AND_GET_RE_DEBUG_FLAGS;
7918f24d
NC
2358
2359 PERL_ARGS_ASSERT_DUMP_TRIE_INTERIM_LIST;
2360
3dab1dad 2361 /* print out the table precompression. */
6ad9a8ab 2362 Perl_re_indentf( aTHX_ "State :Word | Transition Data\n",
cb41e5d6 2363 depth+1 );
6ad9a8ab 2364 Perl_re_indentf( aTHX_ "%s",
cb41e5d6 2365 depth+1, "------:-----+-----------------\n" );
538e84ed 2366
3dab1dad
YO
2367 for( state=1 ; state < next_alloc ; state ++ ) {
2368 U16 charid;
538e84ed 2369
147e3846 2370 Perl_re_indentf( aTHX_ " %4" UVXf " :",
cb41e5d6 2371 depth+1, (UV)state );
3dab1dad 2372 if ( ! trie->states[ state ].wordnum ) {
6ad9a8ab 2373 Perl_re_printf( aTHX_ "%5s| ","");
3dab1dad 2374 } else {
6ad9a8ab 2375 Perl_re_printf( aTHX_ "W%4x| ",
3dab1dad
YO
2376 trie->states[ state ].wordnum
2377 );
2378 }
2379 for( charid = 1 ; charid <= TRIE_LIST_USED( state ) ; charid++ ) {
538e84ed 2380 SV ** const tmp = av_fetch( revcharmap,
88f063b4 2381 TRIE_LIST_ITEM(state, charid).forid, 0);
ab3bbdeb 2382 if ( tmp ) {
147e3846 2383 Perl_re_printf( aTHX_ "%*s:%3X=%4" UVXf " | ",
ab3bbdeb 2384 colwidth,
538e84ed
KW
2385 pv_pretty(sv, SvPV_nolen_const(*tmp), SvCUR(*tmp),
2386 colwidth,
2387 PL_colors[0], PL_colors[1],
2388 (SvUTF8(*tmp) ? PERL_PV_ESCAPE_UNI : 0)
2389 | PERL_PV_ESCAPE_FIRSTCHAR
ab3bbdeb 2390 ) ,
88f063b4
KW
2391 TRIE_LIST_ITEM(state, charid).forid,
2392 (UV)TRIE_LIST_ITEM(state, charid).newstate
1e2e3d02 2393 );
538e84ed 2394 if (!(charid % 10))
6ad9a8ab 2395 Perl_re_printf( aTHX_ "\n%*s| ",
664e119d 2396 (int)((depth * 2) + 14), "");
1e2e3d02 2397 }
ab3bbdeb 2398 }
6ad9a8ab 2399 Perl_re_printf( aTHX_ "\n");
3dab1dad 2400 }
538e84ed 2401}
3dab1dad
YO
2402
2403/*
3dab1dad 2404 Dumps a fully constructed but uncompressed trie in table form.
538e84ed
KW
2405 This is the normal DFA style state transition table, with a few
2406 twists to facilitate compression later.
3dab1dad
YO
2407 Used for debugging make_trie().
2408*/
2409STATIC void
55eed653 2410S_dump_trie_interim_table(pTHX_ const struct _reg_trie_data *trie,
2b8b4781
NC
2411 HV *widecharmap, AV *revcharmap, U32 next_alloc,
2412 U32 depth)
3dab1dad
YO
2413{
2414 U32 state;
2415 U16 charid;
ab3bbdeb 2416 SV *sv=sv_newmortal();
55eed653 2417 int colwidth= widecharmap ? 6 : 4;
271b36b1 2418 DECLARE_AND_GET_RE_DEBUG_FLAGS;
7918f24d
NC
2419
2420 PERL_ARGS_ASSERT_DUMP_TRIE_INTERIM_TABLE;
538e84ed 2421
3dab1dad
YO
2422 /*
2423 print out the table precompression so that we can do a visual check
2424 that they are identical.
2425 */
538e84ed 2426
6ad9a8ab 2427 Perl_re_indentf( aTHX_ "Char : ", depth+1 );
3dab1dad
YO
2428
2429 for( charid = 0 ; charid < trie->uniquecharcount ; charid++ ) {
2b8b4781 2430 SV ** const tmp = av_fetch( revcharmap, charid, 0);
3dab1dad 2431 if ( tmp ) {
6ad9a8ab 2432 Perl_re_printf( aTHX_ "%*s",
ab3bbdeb 2433 colwidth,
538e84ed 2434 pv_pretty(sv, SvPV_nolen_const(*tmp), SvCUR(*tmp), colwidth,
ab3bbdeb
YO
2435 PL_colors[0], PL_colors[1],
2436 (SvUTF8(*tmp) ? PERL_PV_ESCAPE_UNI : 0) |
538e84ed
KW
2437 PERL_PV_ESCAPE_FIRSTCHAR
2438 )
ab3bbdeb 2439 );
3dab1dad
YO
2440 }
2441 }
2442
4aaafc03
YO
2443 Perl_re_printf( aTHX_ "\n");
2444 Perl_re_indentf( aTHX_ "State+-", depth+1 );
3dab1dad
YO
2445
2446 for( charid=0 ; charid < trie->uniquecharcount ; charid++ ) {
6ad9a8ab 2447 Perl_re_printf( aTHX_ "%.*s", colwidth,"--------");
3dab1dad
YO
2448 }
2449
6ad9a8ab 2450 Perl_re_printf( aTHX_ "\n" );
3dab1dad
YO
2451
2452 for( state=1 ; state < next_alloc ; state += trie->uniquecharcount ) {
2453
147e3846 2454 Perl_re_indentf( aTHX_ "%4" UVXf " : ",
cb41e5d6 2455 depth+1,
3dab1dad
YO
2456 (UV)TRIE_NODENUM( state ) );
2457
2458 for( charid = 0 ; charid < trie->uniquecharcount ; charid++ ) {
ab3bbdeb
YO
2459 UV v=(UV)SAFE_TRIE_NODENUM( trie->trans[ state + charid ].next );
2460 if (v)
147e3846 2461 Perl_re_printf( aTHX_ "%*" UVXf, colwidth, v );
ab3bbdeb 2462 else
6ad9a8ab 2463 Perl_re_printf( aTHX_ "%*s", colwidth, "." );
3dab1dad
YO
2464 }
2465 if ( ! trie->states[ TRIE_NODENUM( state ) ].wordnum ) {
147e3846 2466 Perl_re_printf( aTHX_ " (%4" UVXf ")\n",
538e84ed 2467 (UV)trie->trans[ state ].check );
3dab1dad 2468 } else {
147e3846 2469 Perl_re_printf( aTHX_ " (%4" UVXf ") W%4X\n",
538e84ed 2470 (UV)trie->trans[ state ].check,
3dab1dad
YO
2471 trie->states[ TRIE_NODENUM( state ) ].wordnum );
2472 }
2473 }
07be1b83 2474}
3dab1dad
YO
2475
2476#endif
2477
2e64971a 2478
786e8c11
YO
2479/* make_trie(startbranch,first,last,tail,word_count,flags,depth)
2480 startbranch: the first branch in the whole branch sequence
2481 first : start branch of sequence of branch-exact nodes.
2482 May be the same as startbranch
2483 last : Thing following the last branch.
2484 May be the same as tail.
2485 tail : item following the branch sequence
2486 count : words in the sequence
a4525e78 2487 flags : currently the OP() type we will be building one of /EXACT(|F|FA|FU|FU_SS|L|FLU8)/
786e8c11 2488 depth : indent depth
3dab1dad 2489
786e8c11 2490Inplace optimizes a sequence of 2 or more Branch-Exact nodes into a TRIE node.
07be1b83 2491
786e8c11
YO
2492A trie is an N'ary tree where the branches are determined by digital
2493decomposition of the key. IE, at the root node you look up the 1st character and
2494follow that branch repeat until you find the end of the branches. Nodes can be
2495marked as "accepting" meaning they represent a complete word. Eg:
07be1b83 2496
786e8c11 2497 /he|she|his|hers/
72f13be8 2498
786e8c11
YO
2499would convert into the following structure. Numbers represent states, letters
2500following numbers represent valid transitions on the letter from that state, if
2501the number is in square brackets it represents an accepting state, otherwise it
2502will be in parenthesis.
07be1b83 2503
786e8c11
YO
2504 +-h->+-e->[3]-+-r->(8)-+-s->[9]
2505 | |
2506 | (2)
2507 | |
2508 (1) +-i->(6)-+-s->[7]
2509 |
2510 +-s->(3)-+-h->(4)-+-e->[5]
07be1b83 2511
786e8c11
YO
2512 Accept Word Mapping: 3=>1 (he),5=>2 (she), 7=>3 (his), 9=>4 (hers)
2513
2514This shows that when matching against the string 'hers' we will begin at state 1
2515read 'h' and move to state 2, read 'e' and move to state 3 which is accepting,
2516then read 'r' and go to state 8 followed by 's' which takes us to state 9 which
2517is also accepting. Thus we know that we can match both 'he' and 'hers' with a
2518single traverse. We store a mapping from accepting to state to which word was
2519matched, and then when we have multiple possibilities we try to complete the
b8fda935 2520rest of the regex in the order in which they occurred in the alternation.
786e8c11
YO
2521
2522The only prior NFA like behaviour that would be changed by the TRIE support is
2523the silent ignoring of duplicate alternations which are of the form:
2524
2525 / (DUPE|DUPE) X? (?{ ... }) Y /x
2526
4b714af6 2527Thus EVAL blocks following a trie may be called a different number of times with
786e8c11 2528and without the optimisation. With the optimisations dupes will be silently
486ec47a 2529ignored. This inconsistent behaviour of EVAL type nodes is well established as
786e8c11
YO
2530the following demonstrates:
2531
2532 'words'=~/(word|word|word)(?{ print $1 })[xyz]/
2533
2534which prints out 'word' three times, but
2535
2536 'words'=~/(word|word|word)(?{ print $1 })S/
2537
2538which doesnt print it out at all. This is due to other optimisations kicking in.
2539
2540Example of what happens on a structural level:
2541
486ec47a 2542The regexp /(ac|ad|ab)+/ will produce the following debug output:
786e8c11
YO
2543
2544 1: CURLYM[1] {1,32767}(18)
2545 5: BRANCH(8)
2546 6: EXACT <ac>(16)
2547 8: BRANCH(11)
2548 9: EXACT <ad>(16)
2549 11: BRANCH(14)
2550 12: EXACT <ab>(16)
2551 16: SUCCEED(0)
2552 17: NOTHING(18)
2553 18: END(0)
2554
2555This would be optimizable with startbranch=5, first=5, last=16, tail=16
2556and should turn into:
2557
2558 1: CURLYM[1] {1,32767}(18)
2559 5: TRIE(16)
2560 [Words:3 Chars Stored:6 Unique Chars:4 States:5 NCP:1]
2561 <ac>
2562 <ad>
2563 <ab>
2564 16: SUCCEED(0)
2565 17: NOTHING(18)
2566 18: END(0)
2567
2568Cases where tail != last would be like /(?foo|bar)baz/:
2569
2570 1: BRANCH(4)
2571 2: EXACT <foo>(8)
2572 4: BRANCH(7)
2573 5: EXACT <bar>(8)
2574 7: TAIL(8)
2575 8: EXACT <baz>(10)
2576 10: END(0)
2577
2578which would be optimizable with startbranch=1, first=1, last=7, tail=8
2579and would end up looking like:
2580
2581 1: TRIE(8)
2582 [Words:2 Chars Stored:6 Unique Chars:5 States:7 NCP:1]
2583 <foo>
2584 <bar>
2585 7: TAIL(8)
2586 8: EXACT <baz>(10)
2587 10: END(0)
2588
c80e42f3 2589 d = uvchr_to_utf8_flags(d, uv, 0);
786e8c11
YO
2590
2591is the recommended Unicode-aware way of saying
2592
2593 *(d++) = uv;
2594*/
2595
fab2782b 2596#define TRIE_STORE_REVCHAR(val) \
786e8c11 2597 STMT_START { \
73031816 2598 if (UTF) { \
668fcfea 2599 SV *zlopp = newSV(UTF8_MAXBYTES); \
88c9ea1e 2600 unsigned char *flrbbbbb = (unsigned char *) SvPVX(zlopp); \
1d84a256
DM
2601 unsigned char *const kapow = uvchr_to_utf8(flrbbbbb, val); \
2602 *kapow = '\0'; \
73031816
NC
2603 SvCUR_set(zlopp, kapow - flrbbbbb); \
2604 SvPOK_on(zlopp); \
2605 SvUTF8_on(zlopp); \
2606 av_push(revcharmap, zlopp); \
2607 } else { \
fab2782b 2608 char ooooff = (char)val; \
73031816
NC
2609 av_push(revcharmap, newSVpvn(&ooooff, 1)); \
2610 } \
2611 } STMT_END
786e8c11 2612
914a25d5
KW
2613/* This gets the next character from the input, folding it if not already
2614 * folded. */
2615#define TRIE_READ_CHAR STMT_START { \
2616 wordlen++; \
2617 if ( UTF ) { \
2618 /* if it is UTF then it is either already folded, or does not need \
2619 * folding */ \
1c1d615a 2620 uvc = valid_utf8_to_uvchr( (const U8*) uc, &len); \
914a25d5
KW
2621 } \
2622 else if (folder == PL_fold_latin1) { \
7d006b13
KW
2623 /* This folder implies Unicode rules, which in the range expressible \
2624 * by not UTF is the lower case, with the two exceptions, one of \
2625 * which should have been taken care of before calling this */ \
2626 assert(*uc != LATIN_SMALL_LETTER_SHARP_S); \
2627 uvc = toLOWER_L1(*uc); \
2628 if (UNLIKELY(uvc == MICRO_SIGN)) uvc = GREEK_SMALL_LETTER_MU; \
2629 len = 1; \
914a25d5
KW
2630 } else { \
2631 /* raw data, will be folded later if needed */ \
2632 uvc = (U32)*uc; \
2633 len = 1; \
2634 } \
786e8c11
YO
2635} STMT_END
2636
2637
2638
2639#define TRIE_LIST_PUSH(state,fid,ns) STMT_START { \
2640 if ( TRIE_LIST_CUR( state ) >=TRIE_LIST_LEN( state ) ) { \
00195859 2641 U32 ging = TRIE_LIST_LEN( state ) * 2; \
f9003953 2642 Renew( trie->states[ state ].trans.list, ging, reg_trie_trans_le ); \
00195859 2643 TRIE_LIST_LEN( state ) = ging; \
786e8c11
YO
2644 } \
2645 TRIE_LIST_ITEM( state, TRIE_LIST_CUR( state ) ).forid = fid; \
2646 TRIE_LIST_ITEM( state, TRIE_LIST_CUR( state ) ).newstate = ns; \
2647 TRIE_LIST_CUR( state )++; \
2648} STMT_END
07be1b83 2649
786e8c11 2650#define TRIE_LIST_NEW(state) STMT_START { \
d09f14bf 2651 Newx( trie->states[ state ].trans.list, \
786e8c11
YO
2652 4, reg_trie_trans_le ); \
2653 TRIE_LIST_CUR( state ) = 1; \
2654 TRIE_LIST_LEN( state ) = 4; \
2655} STMT_END
07be1b83 2656
786e8c11
YO
2657#define TRIE_HANDLE_WORD(state) STMT_START { \
2658 U16 dupe= trie->states[ state ].wordnum; \
2659 regnode * const noper_next = regnext( noper ); \
2660 \
786e8c11
YO
2661 DEBUG_r({ \
2662 /* store the word for dumping */ \
2663 SV* tmp; \
2664 if (OP(noper) != NOTHING) \
740cce10 2665 tmp = newSVpvn_utf8(STRING(noper), STR_LEN(noper), UTF); \
786e8c11 2666 else \
740cce10 2667 tmp = newSVpvn_utf8( "", 0, UTF ); \
2b8b4781 2668 av_push( trie_words, tmp ); \
786e8c11
YO
2669 }); \
2670 \
2671 curword++; \
2e64971a
DM
2672 trie->wordinfo[curword].prev = 0; \
2673 trie->wordinfo[curword].len = wordlen; \
2674 trie->wordinfo[curword].accept = state; \
786e8c11
YO
2675 \
2676 if ( noper_next < tail ) { \
2677 if (!trie->jump) \
538e84ed
KW
2678 trie->jump = (U16 *) PerlMemShared_calloc( word_count + 1, \
2679 sizeof(U16) ); \
7f69552c 2680 trie->jump[curword] = (U16)(noper_next - convert); \
786e8c11
YO
2681 if (!jumper) \
2682 jumper = noper_next; \
2683 if (!nextbranch) \
2684 nextbranch= regnext(cur); \
2685 } \
2686 \
2687 if ( dupe ) { \
2e64971a
DM
2688 /* It's a dupe. Pre-insert into the wordinfo[].prev */\
2689 /* chain, so that when the bits of chain are later */\
2690 /* linked together, the dups appear in the chain */\
2691 trie->wordinfo[curword].prev = trie->wordinfo[dupe].prev; \
2692 trie->wordinfo[dupe].prev = curword; \
786e8c11
YO
2693 } else { \
2694 /* we haven't inserted this word yet. */ \
2695 trie->states[ state ].wordnum = curword; \
2696 } \
2697} STMT_END
07be1b83 2698
3dab1dad 2699
786e8c11
YO
2700#define TRIE_TRANS_STATE(state,base,ucharcount,charid,special) \
2701 ( ( base + charid >= ucharcount \
2702 && base + charid < ubound \
2703 && state == trie->trans[ base - ucharcount + charid ].check \
2704 && trie->trans[ base - ucharcount + charid ].next ) \
2705 ? trie->trans[ base - ucharcount + charid ].next \
2706 : ( state==1 ? special : 0 ) \
2707 )
3dab1dad 2708
8bcafbf4
YO
2709#define TRIE_BITMAP_SET_FOLDED(trie, uvc, folder) \
2710STMT_START { \
2711 TRIE_BITMAP_SET(trie, uvc); \
2712 /* store the folded codepoint */ \
2713 if ( folder ) \
2714 TRIE_BITMAP_SET(trie, folder[(U8) uvc ]); \
2715 \
2716 if ( !UTF ) { \
2717 /* store first byte of utf8 representation of */ \
2718 /* variant codepoints */ \
2719 if (! UVCHR_IS_INVARIANT(uvc)) { \
2720 TRIE_BITMAP_SET(trie, UTF8_TWO_BYTE_HI(uvc)); \
2721 } \
2722 } \
2723} STMT_END
786e8c11
YO
2724#define MADE_TRIE 1
2725#define MADE_JUMP_TRIE 2
2726#define MADE_EXACT_TRIE 4
3dab1dad 2727
a3621e74 2728STATIC I32
538e84ed
KW
2729S_make_trie(pTHX_ RExC_state_t *pRExC_state, regnode *startbranch,
2730 regnode *first, regnode *last, regnode *tail,
2731 U32 word_count, U32 flags, U32 depth)
a3621e74
YO
2732{
2733 /* first pass, loop through and scan words */
2734 reg_trie_data *trie;
55eed653 2735 HV *widecharmap = NULL;
2b8b4781 2736 AV *revcharmap = newAV();
a3621e74 2737 regnode *cur;
a3621e74
YO
2738 STRLEN len = 0;
2739 UV uvc = 0;
2740 U16 curword = 0;
2741 U32 next_alloc = 0;
786e8c11
YO
2742 regnode *jumper = NULL;
2743 regnode *nextbranch = NULL;
7f69552c 2744 regnode *convert = NULL;
2e64971a 2745 U32 *prev_states; /* temp array mapping each state to previous one */
a3621e74 2746 /* we just use folder as a flag in utf8 */
1e696034 2747 const U8 * folder = NULL;
a3621e74 2748
3a611511
YO
2749 /* in the below add_data call we are storing either 'tu' or 'tuaa'
2750 * which stands for one trie structure, one hash, optionally followed
2751 * by two arrays */
2b8b4781 2752#ifdef DEBUGGING
3a611511 2753 const U32 data_slot = add_data( pRExC_state, STR_WITH_LEN("tuaa"));
2b8b4781
NC
2754 AV *trie_words = NULL;
2755 /* along with revcharmap, this only used during construction but both are
2756 * useful during debugging so we store them in the struct when debugging.
8e11feef 2757 */
2b8b4781 2758#else
cf78de0b 2759 const U32 data_slot = add_data( pRExC_state, STR_WITH_LEN("tu"));
3dab1dad 2760 STRLEN trie_charcount=0;
3dab1dad 2761#endif
2b8b4781 2762 SV *re_trie_maxbuff;
271b36b1 2763 DECLARE_AND_GET_RE_DEBUG_FLAGS;
7918f24d
NC
2764
2765 PERL_ARGS_ASSERT_MAKE_TRIE;
72f13be8
YO
2766#ifndef DEBUGGING
2767 PERL_UNUSED_ARG(depth);
2768#endif
a3621e74 2769
1e696034 2770 switch (flags) {
3f2416ae 2771 case EXACT: case EXACT_REQ8: case EXACTL: break;
89829bb5 2772 case EXACTFAA:
0ea669f4 2773 case EXACTFUP:
a4525e78
KW
2774 case EXACTFU:
2775 case EXACTFLU8: folder = PL_fold_latin1; break;
1e696034 2776 case EXACTF: folder = PL_fold; break;
fab2782b 2777 default: Perl_croak( aTHX_ "panic! In trie construction, unknown node type %u %s", (unsigned) flags, PL_reg_name[flags] );
1e696034
KW
2778 }
2779
c944940b 2780 trie = (reg_trie_data *) PerlMemShared_calloc( 1, sizeof(reg_trie_data) );
a3621e74 2781 trie->refcount = 1;
3dab1dad 2782 trie->startstate = 1;
786e8c11 2783 trie->wordcount = word_count;
f8fc2ecf 2784 RExC_rxi->data->data[ data_slot ] = (void*)trie;
c944940b 2785 trie->charmap = (U16 *) PerlMemShared_calloc( 256, sizeof(U16) );
3f2416ae 2786 if (flags == EXACT || flags == EXACT_REQ8 || flags == EXACTL)
c944940b 2787 trie->bitmap = (char *) PerlMemShared_calloc( ANYOF_BITMAP_SIZE, 1 );
2e64971a
DM
2788 trie->wordinfo = (reg_trie_wordinfo *) PerlMemShared_calloc(
2789 trie->wordcount+1, sizeof(reg_trie_wordinfo));
2790
a3621e74 2791 DEBUG_r({
2b8b4781 2792 trie_words = newAV();
a3621e74 2793 });
a3621e74 2794
3b89859a 2795 re_trie_maxbuff = get_sv(RE_TRIE_MAXBUF_NAME, GV_ADD);
316ebaf2 2796 assert(re_trie_maxbuff);
a3621e74 2797 if (!SvIOK(re_trie_maxbuff)) {
0111c4fd 2798 sv_setiv(re_trie_maxbuff, RE_TRIE_MAXBUF_INIT);
a3621e74 2799 }
df826430 2800 DEBUG_TRIE_COMPILE_r({
6ad9a8ab 2801 Perl_re_indentf( aTHX_
cb41e5d6
YO
2802 "make_trie start==%d, first==%d, last==%d, tail==%d depth=%d\n",
2803 depth+1,
88f063b4 2804 REG_NODE_NUM(startbranch), REG_NODE_NUM(first),
538e84ed 2805 REG_NODE_NUM(last), REG_NODE_NUM(tail), (int)depth);
3dab1dad 2806 });
538e84ed 2807
7f69552c
YO
2808 /* Find the node we are going to overwrite */
2809 if ( first == startbranch && OP( last ) != BRANCH ) {
2810 /* whole branch chain */
2811 convert = first;
2812 } else {
2813 /* branch sub-chain */
2814 convert = NEXTOPER( first );
2815 }
538e84ed 2816
a3621e74
YO
2817 /* -- First loop and Setup --
2818
2819 We first traverse the branches and scan each word to determine if it
2820 contains widechars, and how many unique chars there are, this is
2821 important as we have to build a table with at least as many columns as we
2822 have unique chars.
2823
2824 We use an array of integers to represent the character codes 0..255
538e84ed
KW
2825 (trie->charmap) and we use a an HV* to store Unicode characters. We use
2826 the native representation of the character value as the key and IV's for
2827 the coded index.
a3621e74
YO
2828
2829 *TODO* If we keep track of how many times each character is used we can
2830 remap the columns so that the table compression later on is more
3b753521 2831 efficient in terms of memory by ensuring the most common value is in the
a3621e74
YO
2832 middle and the least common are on the outside. IMO this would be better
2833 than a most to least common mapping as theres a decent chance the most
2834 common letter will share a node with the least common, meaning the node
486ec47a 2835 will not be compressible. With a middle is most common approach the worst
a3621e74
YO
2836 case is when we have the least common nodes twice.
2837
2838 */
2839
a3621e74 2840 for ( cur = first ; cur < last ; cur = regnext( cur ) ) {
df826430 2841 regnode *noper = NEXTOPER( cur );
944e05e3
YO
2842 const U8 *uc;
2843 const U8 *e;
bc031a7d 2844 int foldlen = 0;
07be1b83 2845 U32 wordlen = 0; /* required init */
bc031a7d
KW
2846 STRLEN minchars = 0;
2847 STRLEN maxchars = 0;
538e84ed
KW
2848 bool set_bit = trie->bitmap ? 1 : 0; /*store the first char in the
2849 bitmap?*/
a3621e74 2850
3dab1dad 2851 if (OP(noper) == NOTHING) {
20ed8c88
YO
2852 /* skip past a NOTHING at the start of an alternation
2853 * eg, /(?:)a|(?:b)/ should be the same as /a|b/
ca902fb8
YO
2854 *
2855 * If the next node is not something we are supposed to process
2856 * we will just ignore it due to the condition guarding the
2857 * next block.
20ed8c88 2858 */
ca902fb8 2859
df826430 2860 regnode *noper_next= regnext(noper);
944e05e3
YO
2861 if (noper_next < tail)
2862 noper= noper_next;
2863 }
2864
f6b4b99d
KW
2865 if ( noper < tail
2866 && ( OP(noper) == flags
3f2416ae
KW
2867 || (flags == EXACT && OP(noper) == EXACT_REQ8)
2868 || (flags == EXACTFU && ( OP(noper) == EXACTFU_REQ8
0ea669f4 2869 || OP(noper) == EXACTFUP))))
f6b4b99d 2870 {
944e05e3
YO
2871 uc= (U8*)STRING(noper);
2872 e= uc + STR_LEN(noper);
2873 } else {
2874 trie->minlen= 0;
2875 continue;
3dab1dad 2876 }
df826430 2877
944e05e3 2878
fab2782b 2879 if ( set_bit ) { /* bitmap only alloced when !(UTF&&Folding) */
02daf0ab
YO
2880 TRIE_BITMAP_SET(trie,*uc); /* store the raw first byte
2881 regardless of encoding */
0ea669f4 2882 if (OP( noper ) == EXACTFUP) {
fab2782b 2883 /* false positives are ok, so just set this */
0dc4a61d 2884 TRIE_BITMAP_SET(trie, LATIN_SMALL_LETTER_SHARP_S);
fab2782b
YO
2885 }
2886 }
dca5fc4c 2887
bc031a7d
KW
2888 for ( ; uc < e ; uc += len ) { /* Look at each char in the current
2889 branch */
3dab1dad 2890 TRIE_CHARCOUNT(trie)++;
a3621e74 2891 TRIE_READ_CHAR;
645de4ce 2892
bc031a7d
KW
2893 /* TRIE_READ_CHAR returns the current character, or its fold if /i
2894 * is in effect. Under /i, this character can match itself, or
2895 * anything that folds to it. If not under /i, it can match just
2896 * itself. Most folds are 1-1, for example k, K, and KELVIN SIGN
2897 * all fold to k, and all are single characters. But some folds
2898 * expand to more than one character, so for example LATIN SMALL
2899 * LIGATURE FFI folds to the three character sequence 'ffi'. If
2900 * the string beginning at 'uc' is 'ffi', it could be matched by
2901 * three characters, or just by the one ligature character. (It
2902 * could also be matched by two characters: LATIN SMALL LIGATURE FF
2903 * followed by 'i', or by 'f' followed by LATIN SMALL LIGATURE FI).
2904 * (Of course 'I' and/or 'F' instead of 'i' and 'f' can also
2905 * match.) The trie needs to know the minimum and maximum number
2906 * of characters that could match so that it can use size alone to
2907 * quickly reject many match attempts. The max is simple: it is
2908 * the number of folded characters in this branch (since a fold is
2909 * never shorter than what folds to it. */
2910
2911 maxchars++;
2912
2913 /* And the min is equal to the max if not under /i (indicated by
2914 * 'folder' being NULL), or there are no multi-character folds. If
2915 * there is a multi-character fold, the min is incremented just
2916 * once, for the character that folds to the sequence. Each
2917 * character in the sequence needs to be added to the list below of
2918 * characters in the trie, but we count only the first towards the
2919 * min number of characters needed. This is done through the
2920 * variable 'foldlen', which is returned by the macros that look
2921 * for these sequences as the number of bytes the sequence
2922 * occupies. Each time through the loop, we decrement 'foldlen' by
2923 * how many bytes the current char occupies. Only when it reaches
2924 * 0 do we increment 'minchars' or look for another multi-character
2925 * sequence. */
2926 if (folder == NULL) {
2927 minchars++;
2928 }
2929 else if (foldlen > 0) {
2930 foldlen -= (UTF) ? UTF8SKIP(uc) : 1;
645de4ce
KW
2931 }
2932 else {
bc031a7d
KW
2933 minchars++;
2934
2935 /* See if *uc is the beginning of a multi-character fold. If
2936 * so, we decrement the length remaining to look at, to account
2937 * for the current character this iteration. (We can use 'uc'
cf9d46fd
KW
2938 * instead of the fold returned by TRIE_READ_CHAR because the
2939 * macro is smart enough to account for any unfolded
2940 * characters. */
bc031a7d
KW
2941 if (UTF) {
2942 if ((foldlen = is_MULTI_CHAR_FOLD_utf8_safe(uc, e))) {
2943 foldlen -= UTF8SKIP(uc);
645de4ce
KW
2944 }
2945 }
bc031a7d
KW
2946 else if ((foldlen = is_MULTI_CHAR_FOLD_latin1_safe(uc, e))) {
2947 foldlen--;
2948 }
645de4ce 2949 }
bc031a7d
KW
2950
2951 /* The current character (and any potential folds) should be added
2952 * to the possible matching characters for this position in this
2953 * branch */
a3621e74 2954 if ( uvc < 256 ) {
fab2782b
YO
2955 if ( folder ) {
2956 U8 folded= folder[ (U8) uvc ];
2957 if ( !trie->charmap[ folded ] ) {
2958 trie->charmap[ folded ]=( ++trie->uniquecharcount );
2959 TRIE_STORE_REVCHAR( folded );
2960 }
2961 }
a3621e74
YO
2962 if ( !trie->charmap[ uvc ] ) {
2963 trie->charmap[ uvc ]=( ++trie->uniquecharcount );
fab2782b 2964 TRIE_STORE_REVCHAR( uvc );
a3621e74 2965 }
02daf0ab 2966 if ( set_bit ) {
62012aee
KW
2967 /* store the codepoint in the bitmap, and its folded
2968 * equivalent. */
8bcafbf4 2969 TRIE_BITMAP_SET_FOLDED(trie, uvc, folder);
02daf0ab
YO
2970 set_bit = 0; /* We've done our bit :-) */
2971 }
a3621e74 2972 } else {
bc031a7d
KW
2973
2974 /* XXX We could come up with the list of code points that fold
2975 * to this using PL_utf8_foldclosures, except not for
2976 * multi-char folds, as there may be multiple combinations
2977 * there that could work, which needs to wait until runtime to
2978 * resolve (The comment about LIGATURE FFI above is such an
2979 * example */
2980
a3621e74 2981 SV** svpp;
55eed653
NC
2982 if ( !widecharmap )
2983 widecharmap = newHV();
a3621e74 2984
55eed653 2985 svpp = hv_fetch( widecharmap, (char*)&uvc, sizeof( UV ), 1 );
a3621e74
YO
2986
2987 if ( !svpp )
147e3846 2988 Perl_croak( aTHX_ "error creating/fetching widecharmap entry for 0x%" UVXf, uvc );
a3621e74
YO
2989
2990 if ( !SvTRUE( *svpp ) ) {
2991 sv_setiv( *svpp, ++trie->uniquecharcount );
fab2782b 2992 TRIE_STORE_REVCHAR(uvc);
a3621e74
YO
2993 }
2994 }
bc031a7d
KW
2995 } /* end loop through characters in this branch of the trie */
2996
2997 /* We take the min and max for this branch and combine to find the min
2998 * and max for all branches processed so far */
3dab1dad 2999 if( cur == first ) {
bc031a7d
KW
3000 trie->minlen = minchars;
3001 trie->maxlen = maxchars;
3002 } else if (minchars < trie->minlen) {
3003 trie->minlen = minchars;
3004 } else if (maxchars > trie->maxlen) {
3005 trie->maxlen = maxchars;
fab2782b 3006 }
a3621e74
YO
3007 } /* end first pass */
3008 DEBUG_TRIE_COMPILE_r(
6ad9a8ab 3009 Perl_re_indentf( aTHX_
cb41e5d6
YO
3010 "TRIE(%s): W:%d C:%d Uq:%d Min:%d Max:%d\n",
3011 depth+1,
55eed653 3012 ( widecharmap ? "UTF8" : "NATIVE" ), (int)word_count,
be8e71aa
YO
3013 (int)TRIE_CHARCOUNT(trie), trie->uniquecharcount,
3014 (int)trie->minlen, (int)trie->maxlen )
a3621e74 3015 );
a3621e74
YO
3016
3017 /*
3018 We now know what we are dealing with in terms of unique chars and
3019 string sizes so we can calculate how much memory a naive
0111c4fd
RGS
3020 representation using a flat table will take. If it's over a reasonable
3021 limit (as specified by ${^RE_TRIE_MAXBUF}) we use a more memory
a3621e74
YO
3022 conservative but potentially much slower representation using an array
3023 of lists.
3024
3025 At the end we convert both representations into the same compressed
3026 form that will be used in regexec.c for matching with. The latter
3027 is a form that cannot be used to construct with but has memory
3028 properties similar to the list form and access properties similar
3029 to the table form making it both suitable for fast searches and
3030 small enough that its feasable to store for the duration of a program.
3031
3032 See the comment in the code where the compressed table is produced
3033 inplace from the flat tabe representation for an explanation of how
3034 the compression works.
3035
3036 */
3037
3038
2e64971a
DM
3039 Newx(prev_states, TRIE_CHARCOUNT(trie) + 2, U32);
3040 prev_states[1] = 0;
3041
538e84ed
KW
3042 if ( (IV)( ( TRIE_CHARCOUNT(trie) + 1 ) * trie->uniquecharcount + 1)
3043 > SvIV(re_trie_maxbuff) )
3044 {
a3621e74
YO
3045 /*
3046 Second Pass -- Array Of Lists Representation
3047
3048 Each state will be represented by a list of charid:state records
3049 (reg_trie_trans_le) the first such element holds the CUR and LEN
3050 points of the allocated array. (See defines above).
3051
3052 We build the initial structure using the lists, and then convert
3053 it into the compressed table form which allows faster lookups
3054 (but cant be modified once converted).
a3621e74
YO
3055 */
3056
a3621e74
YO
3057 STRLEN transcount = 1;
3058
6ad9a8ab 3059 DEBUG_TRIE_COMPILE_MORE_r( Perl_re_indentf( aTHX_ "Compiling trie using list compiler\n",
cb41e5d6 3060 depth+1));
686b73d4 3061
c944940b
JH
3062 trie->states = (reg_trie_state *)
3063 PerlMemShared_calloc( TRIE_CHARCOUNT(trie) + 2,
3064 sizeof(reg_trie_state) );
a3621e74
YO
3065 TRIE_LIST_NEW(1);
3066 next_alloc = 2;
3067
3068 for ( cur = first ; cur < last ; cur = regnext( cur ) ) {
3069
df826430 3070 regnode *noper = NEXTOPER( cur );
c445ea15
AL
3071 U32 state = 1; /* required init */
3072 U16 charid = 0; /* sanity init */
07be1b83 3073 U32 wordlen = 0; /* required init */
c445ea15 3074
df826430
YO
3075 if (OP(noper) == NOTHING) {
3076 regnode *noper_next= regnext(noper);
944e05e3
YO
3077 if (noper_next < tail)
3078 noper= noper_next;
ca902fb8
YO
3079 /* we will undo this assignment if noper does not
3080 * point at a trieable type in the else clause of
3081 * the following statement. */
df826430
YO
3082 }
3083
f6b4b99d
KW
3084 if ( noper < tail
3085 && ( OP(noper) == flags
3f2416ae
KW
3086 || (flags == EXACT && OP(noper) == EXACT_REQ8)
3087 || (flags == EXACTFU && ( OP(noper) == EXACTFU_REQ8
0ea669f4 3088 || OP(noper) == EXACTFUP))))
f6b4b99d 3089 {
944e05e3
YO
3090 const U8 *uc= (U8*)STRING(noper);
3091 const U8 *e= uc + STR_LEN(noper);
3092
786e8c11 3093 for ( ; uc < e ; uc += len ) {
c445ea15 3094
786e8c11 3095 TRIE_READ_CHAR;
c445ea15 3096
786e8c11
YO
3097 if ( uvc < 256 ) {
3098 charid = trie->charmap[ uvc ];
c445ea15 3099 } else {
538e84ed
KW
3100 SV** const svpp = hv_fetch( widecharmap,
3101 (char*)&uvc,
3102 sizeof( UV ),
3103 0);
786e8c11
YO
3104 if ( !svpp ) {
3105 charid = 0;
3106 } else {
3107 charid=(U16)SvIV( *svpp );
3108 }
c445ea15 3109 }
538e84ed
KW
3110 /* charid is now 0 if we dont know the char read, or
3111 * nonzero if we do */
786e8c11 3112 if ( charid ) {
a3621e74 3113
786e8c11
YO
3114 U16 check;
3115 U32 newstate = 0;
a3621e74 3116
786e8c11
YO
3117 charid--;
3118 if ( !trie->states[ state ].trans.list ) {
3119 TRIE_LIST_NEW( state );
c445ea15 3120 }
538e84ed
KW
3121 for ( check = 1;
3122 check <= TRIE_LIST_USED( state );
3123 check++ )
3124 {
3125 if ( TRIE_LIST_ITEM( state, check ).forid
3126 == charid )
3127 {
786e8c11
YO
3128 newstate = TRIE_LIST_ITEM( state, check ).newstate;
3129 break;
3130 }
3131 }
3132 if ( ! newstate ) {
3133 newstate = next_alloc++;
2e64971a 3134 prev_states[newstate] = state;
786e8c11
YO
3135 TRIE_LIST_PUSH( state, charid, newstate );
3136 transcount++;
3137 }
3138 state = newstate;
3139 } else {
147e3846 3140 Perl_croak( aTHX_ "panic! In trie construction, no char mapping for %" IVdf, uvc );
c445ea15 3141 }
a28509cc 3142 }
ca902fb8
YO
3143 } else {
3144 /* If we end up here it is because we skipped past a NOTHING, but did not end up
3145 * on a trieable type. So we need to reset noper back to point at the first regop
3146 * in the branch before we call TRIE_HANDLE_WORD()
3147 */
3148 noper= NEXTOPER(cur);
3149 }
3dab1dad 3150 TRIE_HANDLE_WORD(state);
a3621e74
YO
3151
3152 } /* end second pass */
3153
1e2e3d02 3154 /* next alloc is the NEXT state to be allocated */
538e84ed 3155 trie->statecount = next_alloc;
c944940b
JH
3156 trie->states = (reg_trie_state *)
3157 PerlMemShared_realloc( trie->states,
3158 next_alloc
3159 * sizeof(reg_trie_state) );
a3621e74 3160
3dab1dad 3161 /* and now dump it out before we compress it */
2b8b4781
NC
3162 DEBUG_TRIE_COMPILE_MORE_r(dump_trie_interim_list(trie, widecharmap,
3163 revcharmap, next_alloc,
3164 depth+1)
1e2e3d02 3165 );
a3621e74 3166
c944940b
JH
3167 trie->trans = (reg_trie_trans *)
3168 PerlMemShared_calloc( transcount, sizeof(reg_trie_trans) );
a3621e74
YO
3169 {
3170 U32 state;
a3621e74
YO
3171 U32 tp = 0;
3172 U32 zp = 0;
3173
3174
3175 for( state=1 ; state < next_alloc ; state ++ ) {
3176 U32 base=0;
3177
3178 /*
3179 DEBUG_TRIE_COMPILE_MORE_r(
6ad9a8ab 3180 Perl_re_printf( aTHX_ "tp: %d zp: %d ",tp,zp)
a3621e74
YO
3181 );
3182 */
3183
3184 if (trie->states[state].trans.list) {
3185 U16 minid=TRIE_LIST_ITEM( state, 1).forid;
3186 U16 maxid=minid;
a28509cc 3187 U16 idx;
a3621e74
YO
3188
3189 for( idx = 2 ; idx <= TRIE_LIST_USED( state ) ; idx++ ) {
c445ea15
AL
3190 const U16 forid = TRIE_LIST_ITEM( state, idx).forid;
3191 if ( forid < minid ) {
3192 minid=forid;
3193 } else if ( forid > maxid ) {
3194 maxid=forid;
3195 }
a3621e74
YO
3196 }
3197 if ( transcount < tp + maxid - minid + 1) {
3198 transcount *= 2;
c944940b
JH
3199 trie->trans = (reg_trie_trans *)
3200 PerlMemShared_realloc( trie->trans,
446bd890
NC
3201 transcount
3202 * sizeof(reg_trie_trans) );
538e84ed
KW
3203 Zero( trie->trans + (transcount / 2),
3204 transcount / 2,
3205 reg_trie_trans );
a3621e74
YO
3206 }
3207 base = trie->uniquecharcount + tp - minid;
3208 if ( maxid == minid ) {
3209 U32 set = 0;
3210 for ( ; zp < tp ; zp++ ) {
3211 if ( ! trie->trans[ zp ].next ) {
3212 base = trie->uniquecharcount + zp - minid;
538e84ed
KW
3213 trie->trans[ zp ].next = TRIE_LIST_ITEM( state,
3214 1).newstate;
a3621e74
YO
3215 trie->trans[ zp ].check = state;
3216 set = 1;
3217 break;
3218 }
3219 }
3220 if ( !set ) {
538e84ed
KW
3221 trie->trans[ tp ].next = TRIE_LIST_ITEM( state,
3222 1).newstate;
a3621e74
YO
3223 trie->trans[ tp ].check = state;
3224 tp++;
3225 zp = tp;
3226 }
3227 } else {
3228 for ( idx=1; idx <= TRIE_LIST_USED( state ) ; idx++ ) {
538e84ed
KW
3229 const U32 tid = base
3230 - trie->uniquecharcount
3231 + TRIE_LIST_ITEM( state, idx ).forid;
3232 trie->trans[ tid ].next = TRIE_LIST_ITEM( state,
3233 idx ).newstate;
a3621e74
YO
3234 trie->trans[ tid ].check = state;
3235 }
3236 tp += ( maxid - minid + 1 );
3237 }
3238 Safefree(trie->states[ state ].trans.list);
3239 }
3240 /*
3241 DEBUG_TRIE_COMPILE_MORE_r(
6ad9a8ab 3242 Perl_re_printf( aTHX_ " base: %d\n",base);
a3621e74
YO
3243 );
3244 */
3245 trie->states[ state ].trans.base=base;
3246 }
cc601c31 3247 trie->lasttrans = tp + 1;
a3621e74
YO
3248 }
3249 } else {
3250 /*
3251 Second Pass -- Flat Table Representation.
3252
b423522f
KW
3253 we dont use the 0 slot of either trans[] or states[] so we add 1 to
3254 each. We know that we will need Charcount+1 trans at most to store
3255 the data (one row per char at worst case) So we preallocate both
3256 structures assuming worst case.
a3621e74
YO
3257
3258 We then construct the trie using only the .next slots of the entry
3259 structs.
3260
b423522f
KW
3261 We use the .check field of the first entry of the node temporarily
3262 to make compression both faster and easier by keeping track of how
3263 many non zero fields are in the node.
a3621e74
YO
3264
3265 Since trans are numbered from 1 any 0 pointer in the table is a FAIL
3266 transition.
3267
b423522f
KW
3268 There are two terms at use here: state as a TRIE_NODEIDX() which is
3269 a number representing the first entry of the node, and state as a
3270 TRIE_NODENUM() which is the trans number. state 1 is TRIE_NODEIDX(1)
3271 and TRIE_NODENUM(1), state 2 is TRIE_NODEIDX(2) and TRIE_NODENUM(3)
3272 if there are 2 entrys per node. eg:
a3621e74
YO
3273
3274 A B A B
3275 1. 2 4 1. 3 7
3276 2. 0 3 3. 0 5
3277 3. 0 0 5. 0 0
3278 4. 0 0 7. 0 0
3279
b423522f
KW
3280 The table is internally in the right hand, idx form. However as we
3281 also have to deal with the states array which is indexed by nodenum
3282 we have to use TRIE_NODENUM() to convert.
a3621e74
YO
3283
3284 */
6ad9a8ab 3285 DEBUG_TRIE_COMPILE_MORE_r( Perl_re_indentf( aTHX_ "Compiling trie using table compiler\n",
cb41e5d6 3286 depth+1));
3dab1dad 3287
c944940b
JH
3288 trie->trans = (reg_trie_trans *)
3289 PerlMemShared_calloc( ( TRIE_CHARCOUNT(trie) + 1 )
3290 * trie->uniquecharcount + 1,
3291 sizeof(reg_trie_trans) );
3292 trie->states = (reg_trie_state *)
3293 PerlMemShared_calloc( TRIE_CHARCOUNT(trie) + 2,
3294 sizeof(reg_trie_state) );
a3621e74
YO
3295 next_alloc = trie->uniquecharcount + 1;
3296
3dab1dad 3297
a3621e74
YO
3298 for ( cur = first ; cur < last ; cur = regnext( cur ) ) {
3299
df826430 3300 regnode *noper = NEXTOPER( cur );
a3621e74
YO
3301
3302 U32 state = 1; /* required init */
3303
3304 U16 charid = 0; /* sanity init */
3305 U32 accept_state = 0; /* sanity init */
a3621e74 3306
07be1b83 3307 U32 wordlen = 0; /* required init */
a3621e74 3308
df826430
YO
3309 if (OP(noper) == NOTHING) {
3310 regnode *noper_next= regnext(noper);
944e05e3
YO
3311 if (noper_next < tail)
3312 noper= noper_next;
ca902fb8
YO
3313 /* we will undo this assignment if noper does not
3314 * point at a trieable type in the else clause of
3315 * the following statement. */
df826430 3316 }
fab2782b 3317
f6b4b99d
KW
3318 if ( noper < tail
3319 && ( OP(noper) == flags
3f2416ae
KW
3320 || (flags == EXACT && OP(noper) == EXACT_REQ8)
3321 || (flags == EXACTFU && ( OP(noper) == EXACTFU_REQ8
0ea669f4 3322 || OP(noper) == EXACTFUP))))
f6b4b99d 3323 {
944e05e3
YO
3324 const U8 *uc= (U8*)STRING(noper);
3325 const U8 *e= uc + STR_LEN(noper);
3326
786e8c11 3327 for ( ; uc < e ; uc += len ) {
a3621e74 3328
786e8c11 3329 TRIE_READ_CHAR;
a3621e74 3330
786e8c11
YO
3331 if ( uvc < 256 ) {
3332 charid = trie->charmap[ uvc ];
3333 } else {
538e84ed
KW
3334 SV* const * const svpp = hv_fetch( widecharmap,
3335 (char*)&uvc,
3336 sizeof( UV ),
3337 0);
786e8c11 3338 charid = svpp ? (U16)SvIV(*svpp) : 0;
a3621e74 3339 }
786e8c11
YO
3340 if ( charid ) {
3341 charid--;
3342 if ( !trie->trans[ state + charid ].next ) {
3343 trie->trans[ state + charid ].next = next_alloc;
3344 trie->trans[ state ].check++;
2e64971a
DM
3345 prev_states[TRIE_NODENUM(next_alloc)]
3346 = TRIE_NODENUM(state);
786e8c11
YO
3347 next_alloc += trie->uniquecharcount;
3348 }
3349 state = trie->trans[ state + charid ].next;
3350 } else {
147e3846 3351 Perl_croak( aTHX_ "panic! In trie construction, no char mapping for %" IVdf, uvc );
786e8c11 3352 }
538e84ed
KW
3353 /* charid is now 0 if we dont know the char read, or
3354 * nonzero if we do */
a3621e74 3355 }
ca902fb8
YO
3356 } else {
3357 /* If we end up here it is because we skipped past a NOTHING, but did not end up
3358 * on a trieable type. So we need to reset noper back to point at the first regop
3359 * in the branch before we call TRIE_HANDLE_WORD().
3360 */
3361 noper= NEXTOPER(cur);
a3621e74 3362 }
3dab1dad
YO
3363 accept_state = TRIE_NODENUM( state );
3364 TRIE_HANDLE_WORD(accept_state);
a3621e74
YO
3365
3366 } /* end second pass */
3367
3dab1dad 3368 /* and now dump it out before we compress it */
2b8b4781
NC
3369 DEBUG_TRIE_COMPILE_MORE_r(dump_trie_interim_table(trie, widecharmap,
3370 revcharmap,
3371 next_alloc, depth+1));
a3621e74 3372
a3621e74
YO
3373 {
3374 /*
3375 * Inplace compress the table.*
3376
3377 For sparse data sets the table constructed by the trie algorithm will
3378 be mostly 0/FAIL transitions or to put it another way mostly empty.
3379 (Note that leaf nodes will not contain any transitions.)
3380
3381 This algorithm compresses the tables by eliminating most such
3382 transitions, at the cost of a modest bit of extra work during lookup:
3383
3384 - Each states[] entry contains a .base field which indicates the
3385 index in the state[] array wheres its transition data is stored.
3386
3b753521 3387 - If .base is 0 there are no valid transitions from that node.
a3621e74
YO
3388
3389 - If .base is nonzero then charid is added to it to find an entry in
3390 the trans array.
3391
3392 -If trans[states[state].base+charid].check!=state then the
3393 transition is taken to be a 0/Fail transition. Thus if there are fail
3394 transitions at the front of the node then the .base offset will point
3395 somewhere inside the previous nodes data (or maybe even into a node
3396 even earlier), but the .check field determines if the transition is
3397 valid.
3398
786e8c11 3399 XXX - wrong maybe?
a3621e74 3400 The following process inplace converts the table to the compressed
3b753521 3401 table: We first do not compress the root node 1,and mark all its
a3621e74 3402 .check pointers as 1 and set its .base pointer as 1 as well. This
3b753521
FN
3403 allows us to do a DFA construction from the compressed table later,
3404 and ensures that any .base pointers we calculate later are greater
3405 than 0.
a3621e74
YO