This is a live mirror of the Perl 5 development currently hosted at https://github.com/perl/perl5
David Mitchell [Sun, 27 May 2012 09:54:23 +0000 (10:54 +0100)]
fix refcount of rex attached to PL_reg_curpm
There were two mistakes; firstly, refcounts weren't consistently managed,
leading to premature frees when recursively calling a rex via (??{}).
It should be that PL_reg_curpm holds a single ref count for the rex it
points to, and that nothing else should hold a count.
Secondly, when making PL_reg_curpm point to a new rex, increment the new
*before* decrementing the old, in case they are both the same rex.
Both of these are now enforced by doing it within the SET_reg_curpm
(formerly SETREX) macro.
Tests will come in the next commit. I can't add them yet because they
also fail for other reasons (paren related).
David Mitchell [Sat, 26 May 2012 15:51:15 +0000 (16:51 +0100)]
perlreapi: fix documentation on last(close)?paren
lastparen was described as being last open paren; it's actually highest
close paren. Also make it clear these correspond to $+ and $^N
David Mitchell [Sat, 26 May 2012 15:32:46 +0000 (16:32 +0100)]
Remove redundant comment.
PL_reglastparen and PL_reglastcloseparen have been removed
David Mitchell [Sat, 26 May 2012 15:25:05 +0000 (16:25 +0100)]
remove some redundant code from CURLY rex ops
CURLY_B_*_fail all currently do:
if (ST.paren && ST.count)
rex->offs[ST.paren].end = -1;
but this is unnecessary. If B has just failed in the pattern /...(A)*B/,
then we will either adjust the amount of matched A, update rex->offs
(overwriting that -1) then call B again; or fail completely, do sayNO, and
backtrack to an op somewhere in the '...' before A. In this latter case,
the "somewhere else" op is the one responsible for unwinding the matched
parentheses, not us.
David Mitchell [Fri, 25 May 2012 14:03:29 +0000 (15:03 +0100)]
$+ and $^N not always correct on backtracking
Certain ops (TRIE, BRANCH, CURLYM, CURLY and related) didn't always
correctly restore or update lastcloseparen ($^N) - and sometimes lastparen
($+) - when backtracking, or doing a zero-length match (i.e. A* matching
zero times).
Fix this by saving lastcloseparen and lastparen in the relevant
regmatch_state union sub structs, then restoring as necessary.
David Mitchell [Thu, 24 May 2012 11:23:44 +0000 (12:23 +0100)]
reduce size of struct regmatch_state
Currently the trie struct is the largest sub-struct in the union;
reduce its size by 2 x 8 bytes (on a 64-bit system) by
1) aligning a bool better
2) eliminating ST.B, which can be trivially derived as
ST.me + NEXT_OFF(ST.me)
(I'm going to partially spoil this by adding a new field in the next commit)
David Mitchell [Wed, 23 May 2012 11:19:55 +0000 (12:19 +0100)]
regcppush(): don't bother saving each paren number
regcppush() saves a contiguous range of slots in rex->offs[].
Currently it saves the index (paren number) along with each slot's data.
Since we know the top index (the saved PL_regsize) and the number of
elements, on restoring we can deduce each paren number without saving it.
David Mitchell [Wed, 23 May 2012 10:39:28 +0000 (11:39 +0100)]
set PL_reg_starttry correctly
PL_reg_starttry is only used for debugging, and S_regtry only set it
within DEBUG_EXECUTE_r; however, its value is also used within
DEBUG_STACK_r. So always set it when debugging, not just when
execute-debugging is enabled.
David Mitchell [Tue, 22 May 2012 09:40:55 +0000 (10:40 +0100)]
pp_match(): clarify intuit parens behaviour
There was some dodgy code, (flagged by dmq), that as well
as setting RX_LASTPAREN(rx), RX_LASTCLOSEPAREN(rx) to zero,
also set RX_NPARENS(rx) to zero.
The actual logic is that if we reach that point, the pattern shouldn't
have any capturing parentheses, so instead of assigning zero to it, assert
that RX_NPARENS(rx) is zero.
David Mitchell [Mon, 21 May 2012 19:44:50 +0000 (20:44 +0100)]
S_regcppush/pop : don't save PL_reginput
currently, S_regcppush() pushes PL_reginput, then S_regcppop() pops its
value and returns it. However, all calls to S_regcppop() are currently in
void context, so nothing actually uses this value. So don't save it in the
first place.
David Mitchell [Fri, 18 May 2012 11:40:39 +0000 (12:40 +0100)]
improve -Mre=Debug,BUFFERS debugging
as well as showing save/restore of capture buffer contents,
also show buffer swaps and setting of individual elements
(at least for the common OPEN/CLOSE ops; I've skipped all the
harder CURLY stuff for now).
David Mitchell [Wed, 16 May 2012 09:06:30 +0000 (10:06 +0100)]
make regexp_paren_pair.start_tmp an offset
Currently the start_tmp field is a pointer into the string, whereas the
the start and end fields are offsets within that string. Make start_tmp
an offset too for consistency.
David Mitchell [Tue, 15 May 2012 20:01:39 +0000 (21:01 +0100)]
eliminate PL_reg_start_tmp, PL_reg_start_tmpl
PL_reg_start_tmp is a global array of temporary parentheses start
positions. An element is set when a '(' is first encountered,
while when a ')' is seen, the per-regex offs array is updated with the
start and end position: the end derived from the position where the ')'
was encountered, and the start position derived from PL_reg_start_tmp[n].
This allows us to differentiate between pending and fully-processed
captures.
Change it so that the tmp start value becomes a third field in the offs
array (.start_tmp), along with the existing .start and .end fields. This
makes the value now per regex rather than global. Although it uses a bit
more memory (the start_tmp values aren't needed after the match has
completed), it simplifies the code, and will make it easier to make a
(??{}) switch to the new regex without having to dump everything on the
save stack.
David Mitchell [Tue, 15 May 2012 11:55:22 +0000 (12:55 +0100)]
eliminate PL_reglast(close)?paren, PL_regoffs
eliminate the three vars
PL_reglastcloseparen
PL_reglastparen
PL_regoffs
(which are actually aliases to PL_reg_state struct elements).
These three vars always point to the corresponding fields within the
currently executing regex; so just access those fields directly instead.
This makes switching between regexes with (??{}) simpler: just update
rex, and everything automatically references the new fields.
David Mitchell [Tue, 5 Jun 2012 21:50:03 +0000 (22:50 +0100)]
make Perl_... and my_re_op_compile sigs match
David Mitchell [Tue, 5 Jun 2012 21:29:23 +0000 (22:29 +0100)]
make perl build again on non-DEBUGGING builds
The PL_block_type debugging-only array is now used indeirectly in
ext/re/re-exec.c, which enables debugging even on non-debugging builds
David Mitchell [Fri, 4 May 2012 15:34:01 +0000 (16:34 +0100)]
make calling of /(?{}) code blocks correct
Formerly, it just updated PL_comppad, set PL_op to the first op of the
code block, and did CALLRUNOPS().
This had a lot of problems, e.g. depth of recursion, and not having
anything on the context stack for die/caller/next/goto etc to see, usually
leading to segfaults.
Make it so that it uses the MULTICALL API instead. This makes it push a
new stack and a CxSUB context stack frame; it also makes us share code
rather than rolling our own.
MULTICALL had to be extended in two ways to make this work; but these have
not yet been made part of the public API. First, it had to allow changing
of the current CV while leaving the current CxSUB frame in place, and
secondly it had to allow pushing a CV with a zero increment of CvDEPTH.
This latter is to handle direct literal blocks:
/(?{...})/
which are compiled into the same CV as the surrounding scope; therefore we
need to push the same sub twice at the same depth (usually 1), i.e.
$ ./perl -Dstv -e'sub f { /(?{$x})/ } f'
...
(29912:-e:1) gvsv(main::x)
STACK 0: MAIN
CX 0: BLOCK =>
CX 1: SUB => <=== the same sub ...
retop=leave
STACK 1: SORT
CX 0: SUB => UNDEF <==== ... as this
retop=(null)
(note that stack 1 is misidentified as SORT; this is a bug in MULTICALl
to be fixed later).
One has to be very careful with the save stack; /(?{})/ is designed
not to introduce a new scope, so that the effects of 'local' etc
accumulate across multiple block invocations (but get popped on
backtracking). This is why we couldn't just do a POP_MULTICALL/PUSH_MULTICALL
pair to change the current CV; the former would pop the save stack too.
Note that in the current implementation, after calling out to the first
code block, we leave the CxSUB and PL_comppad value in place, on the
assumption that it may be soon re-used, and only pop the CxSUB at the end
of S_regmatch(). However, when popping the savestack on backtracking, this
will restore PL_comppad to its original value; so when calling a new code
block with the same CV, we can't rely on PL_comppad still being correct.
Also, this means that outside of a code block call, the context stack and
PL_comppad are wrong; I can't think of anything within the regex code
that could be using these; but it if it turns out not to be the case,
then we'd have to change it so that after each code block call, we pop the
CxSUB off the stack and restore PL_comppad, but without popping the save
stack.
David Mitchell [Sat, 21 Apr 2012 19:25:33 +0000 (20:25 +0100)]
make is_bare_re bool. not int in re_op_compile
This flag pointer only stores truth, so make it a pointer to a bool rather
than to an int.
David Mitchell [Sun, 1 Apr 2012 09:21:22 +0000 (10:21 +0100)]
make OP_REGCRESET only for taint handling
The OP_REGCRESET op, which is sometimes prepended to the chain of ops
leading to OP_REGCOMP, currently serves two purposes; first to reset the
taint flag, and second to initialise PL_reginterp_cnt. The second purpose
is no longer needed, and the first has a bug, in that the op isn't
prepended when "use re 'eval'" is in scope.
Fix this by prepending the op solely when PL_tainting is in effect.
This also makes run-time regexes slightly more efficient in the
non-tainting case.
David Mitchell [Sun, 1 Apr 2012 13:29:33 +0000 (14:29 +0100)]
S_doeval(): saveop can never be null now
David Mitchell [Sun, 1 Apr 2012 13:26:37 +0000 (14:26 +0100)]
reindent S_doeval() following a code purge.
Only whitespace changes.
David Mitchell [Sun, 1 Apr 2012 13:21:18 +0000 (14:21 +0100)]
eliminate sv_compile_2op, sv_compile_2op_is_broken
These two functions, which have been a pimple on the face of perl for
far too long, are no longer needed, now that regex code blocks are
compiled in a sensible manner.
This also allows S_doeval() to be simplified, now that it is no longer
called from sv_compile_2op_is_broken().
David Mitchell [Sun, 1 Apr 2012 12:59:58 +0000 (13:59 +0100)]
eliminate OP_4tree type
This was an alias to OP, and formerly used by the old re_eval mechanism
David Mitchell [Sun, 1 Apr 2012 12:32:24 +0000 (13:32 +0100)]
rename and simplify PL_reg_eval_set
These days PL_reg_eval_set is actually aliased to
PL_reg_state.re_state_reg_eval_set.
Rename this field to re_state_eval_setup_done to make it clearer what it
represents; remove the PL_ backcompat macro; make it boolean;
and remove the two redundant macros RS_init and RS_set.
David Mitchell [Sun, 1 Apr 2012 12:22:14 +0000 (13:22 +0100)]
eliminate RExC_seen_evals and RExC_rx->seen_evals
these were used as part of the old "use re 'eval'" security
mechanism used by the now-eliminated PL_reginterp_cnt
David Mitchell [Sun, 1 Apr 2012 12:14:09 +0000 (13:14 +0100)]
eliminate PL_reginterp_cnt
This used to be the mechanism to determine whether "use re 'eval'" needed
to be in scope; but now that we make a clear distinction between literal
and runtime code blocks, it's no longer needed.
David Mitchell [Sun, 1 Apr 2012 12:04:05 +0000 (13:04 +0100)]
eliminate REG_SEEN_EVAL
This flag was set during pattern compilation if a (?{}) was encountered;
but is redundant now that we have pRExC_state->num_code_blocks.
David Mitchell [Sun, 1 Apr 2012 10:37:33 +0000 (11:37 +0100)]
add some more tests for PL_cv_has_eval
In particular, the use re 'eval' case wasn't tested for
David Mitchell [Sat, 31 Mar 2012 14:47:04 +0000 (15:47 +0100)]
pat_re_eval.t; test "use re 'eval'"
In this test file, firstly reduce all "use re 'eval'"s into the smallest
scope possible, and secondly, for each pattern which still requires this
in scope, also test that pattern without it in scope, but under eval, and
check that it dies.
Note that during the course of this branch, much that formerly needed
"use re "eval'" no longer does, which is the main reason we can dispense
with it so much in this commit.
David Mitchell [Sat, 31 Mar 2012 12:43:12 +0000 (13:43 +0100)]
bump re.pm version number
David Mitchell [Fri, 30 Mar 2012 15:30:26 +0000 (16:30 +0100)]
ensure regex evals report the right location
make sure that PL_curcop is set correctly on entry to a regex code block,
since (unlike a normal eval) there isn't always an initial OP_NEXTSTATE to
cause it to get set. Otherwise, warning messages etc in the first
statement of the code block will appear to come from the wrong place.
David Mitchell [Sun, 18 Mar 2012 15:53:40 +0000 (15:53 +0000)]
Fix up runtime regex codeblocks.
The previous commits in this branch have brought literal code blocks
into the New World Order; now do the same for runtime blocks, i.e. those
needing "use re 'eval'".
The main user-visible changes from this commit are that:
* the code is now fully parsed, rather than needing balanced {}'s; i.e.
this now works:
my $code = q[ (?{ $a = '{' }) ];
use re 'eval';
/$code/
* warnings and errors are now reported as coming from "(eval NNN)" rather
than "(re_eval NNN)" (although see the next commit for some fixups to
that). Indeed, the string "re_eval" has been expunged from the source
and documentation.
The big internal difference is that the sv_compile_2op() and
sv_compile_2op_is_broken() functions are no longer used, and will be
removed shorty.
It works by the regex compiler detecting the presence of run-time code
blocks, and feeding the whole pattern string back into the parser (where
the run-time blocks are now seen as compile-time), then extracting out
any compiled code blocks and adding them to the mix.
For example, in the following:
$c = '(?{"runtime"})d';
use re 'eval';
/a(?{"literal"})\b'$c/
At the point the regex compiler is called, the perl parser will already
have compiled the literal code block and presented it to the regex engine.
The engine examines the pattern string, sees two '(?{', but only one
accounted for by the parser, and so constructs a short string to be
evalled: based on the pattern, but with literal code-blocks blanked out,
and \ and ' escaped. In the above example, the pattern string is
a(?{"literal"})\b'(?{"runtime"})d
and we call eval_sv() with an SV containing the text
qr'a \\b\'(?{"runtime"})d'
The returned qr will contain the new code-block (and associated CV and
pad) which can be extracted and added to the list of compiled code blocks
of the original pattern.
Note that with this scheme, the requirement for "use re 'eval'" is easily
determined, and no longer requires all the pp_regcreset / PL_reginterp_cnt
machinery, which will be removed shortly.
Two subtleties of this scheme are that normally, \\ isn't collapsed into \
for literal regexes (unlike literal strings), and hints aren't inherited
when using eval_sv(). We get round both of these by adding and setting a
new flag, PL_reg_state.re_reparsing, which indicates that we are refeeding
a pattern into the perl parser.
David Mitchell [Wed, 28 Mar 2012 08:44:01 +0000 (09:44 +0100)]
pmruntime: make more use of Perl_re_op_compile
In the compile-time, non-code branch of pmruntime(), rather than
assembling the pattern string ourselves from the list of constant strings,
just pass the whole thing to Perl_re_op_compile() (or whatever eng->op_comp
is in scope) to assemble.
This has two benefits. First it avoids duplicating the concatenation code
(which already has to be present in Perl_re_op_compile), and it provides
more information to that function; in particular so that we can identify
a late-bound compile-time pattern, i.e. $foo =~ 'string-not-regex'.
David Mitchell [Thu, 29 Mar 2012 10:20:32 +0000 (11:20 +0100)]
free PL_regex_padav later
At the point where formerly it was freed, instead just free the individual
regexes and put &PL_sv_undef placeholders in the vacated PL_regex_padav
slots. Then free the actual array a lot later.
This is because at the point where PL_regex_padav used to be freed,
regexes can still hold pointers to CVs that have PMOPs that index into
PL_regex_padav; and while the array is being freed, it's state is
inconsistent (jn particular, sv_clear temporarily stores old indexes in
the top slot.)
This is a band-aid really; the proper solution is to get rid of
PL_regex_padav altogether and store the regexps in the pad.
David Mitchell [Mon, 12 Mar 2012 20:50:02 +0000 (20:50 +0000)]
improve skipping of regex [..] char class in toker
Recently S_scan_const() was enhanced to know when it was within a [...]
of a regex, so that it could ignore anything that looked like a code block
within it; i.e.
[(?{...]
wasn't misinterpreted as the start of a (?{...})
However, the code was too simplistic, and didn't handle \[ and \] escapes
well enough; e.g. are these char classes or not?:
\\\[abc\\\]
\\\\[abc\\\\]
David Mitchell [Thu, 8 Mar 2012 16:00:24 +0000 (16:00 +0000)]
add PMf_USE_RE_EVAL flag
This isn't actually used in the pm_flags field of a PMOP, but is
used in the pm_flags arg of Perl_re_op_compile() to indicate
that this pattern should be compiled with "use re 'eval'" in scope
David Mitchell [Tue, 6 Mar 2012 21:31:06 +0000 (21:31 +0000)]
skip re_eval leak test under -Dmad
Under MAD, eval CVs are explicitly not freed, so the leak test
can't succeed.
David Mitchell [Tue, 28 Feb 2012 15:05:15 +0000 (15:05 +0000)]
mark a var as volatile to avoid longjmp warning
David Mitchell [Tue, 28 Feb 2012 12:44:43 +0000 (12:44 +0000)]
re-enable some threaded regex TODO tests
these were fixed via commit
6eea2b427407da46a602a3ca17cbe055f57c24c0
David Mitchell [Fri, 17 Feb 2012 13:22:17 +0000 (13:22 +0000)]
make _REGEXP_COMMON work under win32
There's an extra trailing semicolon which the win32 compiler doesn't like.
David Mitchell [Thu, 16 Feb 2012 15:53:50 +0000 (15:53 +0000)]
re/pat_re_eval.t: tidy some 'use re eval' tests
reduce the scope of 'use re eval' to the barest minimum;
I also discovered while doing this that one block had the use; ... no;
swapped, and this didn't die, so I've added a TODO test.
Father Chrysostomos [Tue, 14 Feb 2012 19:50:10 +0000 (19:50 +0000)]
[perl #108780] Make /foo$qr/ work under ‘no overloading’
This commit redoes
37c07a4b2, which had to be reverted to allow the
re-eval work to be rebased onto blead.
It changes the code in re_op_compile (formerly in pp_regcomp) to use
the underlying REGEXP instead of the reference to it, when concatenat-
ing pieces to mark a larger regular expression. This makes /foo$qr/
work even under ‘no overloading’. It stopped working with commit
a75c6ed6b.
Karl Williamson [Fri, 23 Dec 2011 00:58:20 +0000 (17:58 -0700)]
regcomp.c: Silence valgrind warning
This happens only in doing debug output. Initialize these two debugging
variables
[ this commit was previously reverted to make rebasing easier; added back
now ]
David Mitchell [Tue, 14 Feb 2012 19:42:39 +0000 (19:42 +0000)]
undo temporarily reverted lib/overload.t tests"
some tests in lib/overload.t were temporarily removed earlier to ease
rebasing; add them back in now.
David Mitchell [Mon, 19 Dec 2011 12:27:48 +0000 (12:27 +0000)]
add tests for regex recompilation
The run-time regexp compilation (invoked via pp_regcomp()) has a mechanism
to skip the recompilation if the pattern text hasn't changed since the
last recompile. Astonishingly this mechanism isn't actually tested, so
here's a test file.
All the tests now pass, but this is due to the various recent fixes in
this branch. In particular, it never used to consider the UTF8ness of the
pattern string, or whether the pattern contained code blocks.
It works by checking the output of 'use re debug' (and -Dr if available)
to detect how many times the pattern was compiled.
This file then is also an indirect test of whether the correct debugging
output is generated, i.e. whether the regcomp.c or ext/re/re_comp.c
versions of functions are getting called.
David Mitchell [Mon, 19 Dec 2011 11:33:07 +0000 (11:33 +0000)]
force recompiling of regex where closures matter
There are some cases where on the second run of a run-time regex, the
text of the pattern hasn't changed, but we should still recompile to
ensure that closure behaviour is correct.
These cases are:
1) run-time code:
my $code = '(??{$x})';
for my $x (1..3) {
$x =~ /$code/; # recompile to see fresh value of $x
}
2) embedded regexes with code:
for my $x (1..3) {
my $r = qr/(??{$x})/;
"A$x" =~ /A$r/; # recompile to see new $r
}
With this fix, all the TODO tests in re/pat_re_eval.t now pass. (Note that
a couple of those TODO tests were actually broken and are fixed in this
commit)
David Mitchell [Tue, 13 Dec 2011 12:17:00 +0000 (12:17 +0000)]
fix =/== typo in ext/re/t/regop.t
David Mitchell [Tue, 13 Dec 2011 12:00:12 +0000 (12:00 +0000)]
add op_comp field to regexp_engine API
Perl's internal function for compiling regexes that knows about code
blocks, Perl_re_op_compile, isn't part of the engine API. However, the
way that regcomp.c is dual-lifed as ext/re/re_comp.c with debugging
compiled in, means that Perl_re_op_compile is also compiled as
my_re_op_compile. These days days the mechanism to choose whether to call
the main functions or the debugging my_* functions when 'use re debug' is
in scope, is the re engine API jump table. Ergo, to ensure that
my_re_op_compile gets called appropriately, this method needs adding to
the jump table.
So, I've added it, but documented as 'for perl internal use only, set to
null in your engine'.
I've also updated current_re_engine() to always return a pointer to a jump
table, even if we're using the internal engine (formerly it returned
null). This then allows us to use the simple condition (eng->op_comp)
to determine whether the current engine supports code blocks.
David Mitchell [Sat, 10 Dec 2011 11:09:51 +0000 (11:09 +0000)]
re_op_compile(): merge the two 'eq old_re' checks
The code had two similar checks for whether the new pattern was the same
as the old (and so skip recompilation); the second was for after a longjmp
and upgrade to UTF8. By moving both checks to just after the
"if (jump_ret == 0)" block we should achieve code simplification without
any change in functionality
David Mitchell [Thu, 8 Dec 2011 16:42:19 +0000 (16:42 +0000)]
pat_re_eval.t: reduce scope of 'use re eval'.
My closure tests were initially covered by a blanket "use re 'eval'".
Now that no literal code blocks require this (especially where qr//s are
interpolated into another regex), only wrap the bits of test code that
really need it.
David Mitchell [Thu, 8 Dec 2011 16:27:33 +0000 (16:27 +0000)]
pat_re_eval.t: remove 'no warnings'
Now the the closure behaviour of code blocks within regexes is mostly
fixes, remove the
no warnings qw(uninitialized closure);
that protected many of the closure tests.
This also showed up that one of my recent tests didn't realise
the whose test file is wrapped in a sub, so my test code destructor
looked like
sub run_tests {
...
my $d;
sub DESTRUCTOR { $d++ }
...
}
so fix that test to remove the closure warning.
David Mitchell [Thu, 8 Dec 2011 16:08:07 +0000 (16:08 +0000)]
add more tests for embedded qr// and code blocks
David Mitchell [Thu, 8 Dec 2011 15:43:41 +0000 (15:43 +0000)]
re_op_compile(): split flags into two arguments
There are two sets of regex-related flags; the RXf_* which
end up in the extflags field of a REGEXP, and the PMf_*, which
are in the op_pmflags field of a PMOP.
Since I added the PMf_HAS_CV and PMf_IS_QR flags, I've been conflating
these two meanings in the single flags arg to re_op_compile(), which meant
that some bits were being misinterpreted. The only test that was failing
was peek.t, but it may have quietly broken other things that simply
weren't tested for (for example PMf_HAS_CV and RXf_SPLIT share the same
value, so something with split qr/(?{...})/ might get messed up).
So, split this arg into two; one for the RXf* flags, and one for the PMf_*
flags.
The public regexp API continues to have only a single flags arg,
which should only be accepting RXf_* flags.
David Mitchell [Thu, 8 Dec 2011 15:11:08 +0000 (15:11 +0000)]
re_op_compile(): rename pm_flags to rx_flags
The orig_pm_flags argument and its modified copy, pm_flags, actually
contain bits related to REGEXPs (i.e. RXf_*) rather than PMOPs
(i.e. PMf_*); although there is some overlap between the two sets of
bit flags.
Rename the variables to make this less unclear.
Ditto for re_compile().
David Mitchell [Wed, 7 Dec 2011 11:29:27 +0000 (11:29 +0000)]
add PMf_IS_QR flag
This indicates that a particular PMOP is in fact OP_QR. We should of
course be able to tell this from op_type, but the regex-compiling API
only gets passed op_flags.
This then allows us to fix a bug where we were deciding during compilation
whether to hang on to the code_blocks based on whether the PMOP was
PMf_HAS_CV rather than PMf_IS_QR; the latter implies the former, but not
the other way round.
David Mitchell [Wed, 7 Dec 2011 11:26:14 +0000 (11:26 +0000)]
fix scanning for code blocks
The toke code that scanned a pattern string to see if it looked like it
had any code blocks in it, used the wrong pointers. This then failed if
the pattern spread over multiple lines and buffers got reassigned.
David Mitchell [Tue, 6 Dec 2011 10:50:51 +0000 (10:50 +0000)]
tidy up the description of re_op_compile()
(also, it was incorrectly described as op_re_compile in a couple of places)
David Mitchell [Wed, 30 Nov 2011 13:40:15 +0000 (13:40 +0000)]
preserve code blocks in interpolated qr//s
This now works:
{ my $x = 1; $r = qr/(??{$x})/ }
my $x = 2;
print "ok\n" if "1" =~ /^$r$/;
When a qr// is interpolated into another pattern, the pattern is still
recompiled using the stringified qr, but now the pre-compiled code blocks
from the qr are reused rather than being re-compiled, so it behaves like a
closure.
Note that this makes some tests in regexp_qr_embed_thr.t fail, due to a
pre-existing threads bug, which can be summarised as:
use threads;
my $s = threads->new(sub { return sub { $::x = 1} })->join;
$s->();
print "\$::x=[$::x]\n";
which prints undef, not 1, since the *::x is cloned into the child thread,
then cloned back into the parent as part of the CV (linked from the pad)
being returned in the join. The cloning/join code isn't clever enough
to realise that the back-cloned *::x is the same as the original *::x, so
the main thread ends up with two copies.
This manifests itself in the re tests as
my $re = threads->new( sub { qr/(?{$::x = 1 })/ })->join();
where, since the returned qr// is now a closure, it suffers from the same
glob duplication in the parent.
So I've disabled 4 re_tests tests under threads for now.
David Mitchell [Fri, 25 Nov 2011 11:29:33 +0000 (11:29 +0000)]
in re_op_compile(), keep code_blocks for qr//
code_blocks is a temporary list of start/end indices and pointers to DO
blocks, that is used during the regexp compilation. Change it so that in
the qr// case, this structure is preserved (attached to regexp_internal),
so that in a forthcoming commit it will be available for use when
interpolating a qr within another pattern.
David Mitchell [Wed, 23 Nov 2011 14:19:52 +0000 (14:19 +0000)]
pp_regcomp(): fix refcnt issue with qr_anoncv
Assigning to qr_anoncv was introduced a few commits ago, but
I forgot to increase the reference count
David Mitchell [Fri, 18 Nov 2011 17:16:14 +0000 (17:16 +0000)]
pm_runtime(): tidy some local vars
inline and delete the 'ext_eng' variable, as it's only used once;
and reduce the scope of 'eng'.
David Mitchell [Fri, 18 Nov 2011 16:54:10 +0000 (16:54 +0000)]
handle /$not_utf8(?{...})$utf8/
the bit of code that concats the args into a pattern and at the same time
notes the start and end indices of the text of the code blocks, got it
wrong if the pattern got upgraded half way through concatting. So work out
in advance whether the string is likely to be utf8.
David Mitchell [Fri, 18 Nov 2011 16:30:17 +0000 (16:30 +0000)]
inline S_get_pat_and_code_indices()
Now that that static function is only used in one place, get rid of it.
David Mitchell [Fri, 18 Nov 2011 15:21:27 +0000 (15:21 +0000)]
fix dumping of PMf_CODELIST_PRIVATE flag
David Mitchell [Fri, 18 Nov 2011 14:48:49 +0000 (14:48 +0000)]
"don't recompile pattern" check: account for UTF8
When recompiling a pattern (e.g. for $x (x,y) { /$x/ }),
it tests whether the new pattern string matches the old one, and if so
skips recompiling it. However, it doesn't take account of the UTF8ness of
the old and new patterns, so can falsely skip recompiling. Now fixed.
Also, there is a feature in re_op_compile() that may abort a pattern
compilation, upgrade the pattern to UTF8, then begin the compile again.
I've added a second check for whether the pattern matches the old pattern,
against the upgraded string. I can't see a way to test this, since its
just an optimisation. Arguably I could add a BEGIN in an embedded code
block to see if it gets compiled twice, but soon I'm going to make it so
that embedded code blocks always get recompiled anyway.
David Mitchell [Fri, 18 Nov 2011 12:37:59 +0000 (12:37 +0000)]
re_op_compile: recalc code indexes on utf8 upgrade
As part of the compilation, we calculate the start and end positions
of the text of each literal code block within the pattern string.
The 'if pattern gets unexpected upgraded to UTF8, longjmp and restart
the compilation' mechanism, means that these indices can become invalid,
so if this happens, recalculate them. We do this by unrolling a call
to Perl_bytes_to_utf8(), which updates the indices at the same time that
it uopdtes the string.
Note that some of the new TODO test are actually passing, but this is for
the wrong reason. They're supposed to test for forced recompilation of
non-literal code blocks, even if the pattern string hasn't changed (which I
haven't implemented yet), but instead they're passing because the "don't
recomile if strings match" check isn't UTF8-aware. I'll fix this
(pre-existing) bug in the next commit.
David Mitchell [Wed, 16 Nov 2011 10:42:36 +0000 (10:42 +0000)]
in re_op_compile, change code_blocks[].end offset
previously, code_blocks[].end pointed to the char *after* the
end of (?{...}); make it instead the last char, i.e. ')'.
This will make the code for the next commit slightly easier.
David Mitchell [Mon, 14 Nov 2011 16:21:42 +0000 (16:21 +0000)]
pp_regcomp(): fix casting issue from prev commit
David Mitchell [Sat, 12 Nov 2011 20:51:27 +0000 (20:51 +0000)]
Handle literal code blocks in runtime regexes
In the following types of regex:
/$runtime(?{...})/
qr/$runtime(?{...})/
make it so that the code block is compiled at the same time that the
surrounding code is compiled, then is incorporated, rather than
re-compiled, when the regex source is assembled and compiled at runtime.
This fixes a bunch of closure-related TODO tests.
Note that this still doesn't yet handle the cases where $runtime contains:
$runtime = qr/...(?{...})/; # block will be stringified and recompiled
$runtime = '(?{...})'; # block compiled the old way, with
matching nesting of {} required
It also doesn't yet handle the case where the pattern getting compiled is
upgraded to utf8 and so is restarted.
Note that this is rather complex, because in something like
$str =~ qr/$a(?{...})$b[1]/
there are four separate phases
* perl compile time; we also compile the code block at the same time,
but within a separate anon CV (with a separate pad)
* at run time, we execute the code that generates the list of SVs
(i.e. $a, $b[1] etc), but have to execute them within the context of the
anon sub, since that's what they were compiled in; we then have to
concat the arguments, while remembering which were literal code blocks;
* then qr// clones the compiled regex, and clones the anon CV at the same
time;
* finally, the pattern is executed.
Through all this we have to ensure that the code blocks and associated
anon CV and pad get preserved and incorporated into the right places
for eventual use.
The changes in this commit build upon the work in the previous few
commits, and work by:
* at (perl) compile time, in pmruntime(), the anon CV (if any) associated
with a qr//, as well as being referred to by the op_targ of the
anoncode op, is also made the targ of the regcomp op;
* at pattern assembly and compile time,
* Perl_re_op_compile() takes the list of SVs gathered by pp_regcomp(),
along with the op tree (from op_code_list) that was used to generate
those SVs (as well as containing the individual DO blocks), and
concatenates them to get a final pattern source string, while
noting the start and end positions of any literal (?{..})'s,
and which block they must correspond to.
* after compilation, pp_regcomp() then uses op_targ to locate
the anon CV and store a pointer to it in the regex.
qr// instantiation and execution work unchanged.
David Mitchell [Fri, 11 Nov 2011 14:14:39 +0000 (14:14 +0000)]
re_op_compile(): rejig code-block handling code
In the code that keeps track of the start and end positions of each
code block, i.e. (?{....}), refactor it:
Rather than an array of STRLENs, with two slots to mark start and
positions, turn it into an array of structs; each holding the start and end
position, plus a pointer to the code block op. This will make it easier
when we shortly allow literal code from run-time patterns too.
Also rename some of the variables accordingly.
David Mitchell [Fri, 11 Nov 2011 11:31:37 +0000 (11:31 +0000)]
re_op_compile(): refactor some code
The 'scan the op tree looking for DO blocks' code will need to
be used in both the 'list of SV' and 'OP tree' cases, so hoist the code
up a level
David Mitchell [Wed, 9 Nov 2011 14:38:26 +0000 (14:38 +0000)]
remove target from REGCOMP op
after the previous commit, the target is no longer needed;
so don't allocate it.
David Mitchell [Fri, 4 Nov 2011 10:12:20 +0000 (10:12 +0000)]
Move bulk of pp_regcomp() into re_op_compile()
When called, pp_regcomp() is presented with a list of SVs on the stack.
Previously, it would perform (amongst other things):
* overloading those SVs;
* concatenating them;
* detection of bare /$qr/;
* detection of unchanged pattern;
optionally followed by a call to the built-in or an external regexp
compiler.
Since we want to avoid premature concatenation (so that we can handle
/$runtime(?{...})/), move all these activities from pp_regcomp() into
re_op_compile().
This makes re_op_compile() a bit cumbersome, with a large arg list,
but I haven't found any way of only moving only a subset of the above.
Note that a side-effect of this is that qr-overloading now works for all
regex compilations, not just those reached via pp_regcomp(); in particular
this now invokes the qr method rather than the "" method if available:
/(??{ $overloaded_object })/
David Mitchell [Tue, 1 Nov 2011 16:50:16 +0000 (16:50 +0000)]
add PMf_CODELIST_PRIVATE flag
This indicates that the op_code_list field in a PMOP is "private";
that is, it points to a list of DO blocks that we don't own, and
shouldn't free, and whose pad may not match ours.
This will allow us to use the op_code_list field in the runtime case of
literal code, e.g. /$runtime(?{...})/ and qr/$runtime(?{...})/. Here, at
compile-time, we need to make the pre-compiled (?{..}) blocks available to
pp_regcomp, but the list containing those blocks is also the list that is
executed in the lead-up to executing pp_regcomp (while skipping the DO
blocks), so the code is already embedded, and doesn't need freeing.
Furthermore, in the qr// case, the code blocks are actually within a
different sub (an anon one) than the PMOP, so the pads won't match.
David Mitchell [Tue, 1 Nov 2011 13:57:50 +0000 (13:57 +0000)]
remove private flag 1 from OP_REGCOMP
When an OP_REGCOMP is hand-rolled, the op_private flags field is
set to 1. This has been so since 1993, but it is undocumented and appears
completely unused. So remove it.
Father Chrysostomos [Sat, 29 Oct 2011 20:40:06 +0000 (13:40 -0700)]
Fix =~ $str_overloaded (5.10 regression)
[ DAPM: I just cherry-picked the tests from this commit, since my own
changes have already fixed this bug. FC's two commits:
15d9c083b08647e489d279a1059b4f14a3df187b
3e1022372a8200bc4c7354e0f588c7f71584a888
were unrolled at the start of this branch since they clashed with my own
stuff; this commit is re-adding the bits of those commits that are
still needed: i.e. just the tests.
]
In 5.8.x, this code:
use overload '""'=>sub { warn "stringify"; --$| ? "gonzo" : chr 256 };
my $obj = bless\do{my $x};
warn "$obj";
print "match\n" if chr(256) =~ $obj;
prints
stringify at - line 1.
gonzo at - line 3.
stringify at - line 1.
match
which is to be expected.
In 5.10+, the stringification happens one extra time, causing a failed match:
stringify at - line 1.
gonzo at - line 3.
stringify at - line 1.
stringify at - line 1.
This logic in pp_regcomp is faulty:
if (DO_UTF8(tmpstr)) {
assert (SvUTF8(tmpstr));
} else if (SvUTF8(tmpstr)) {
... copy under ‘use bytes’...
}
else if (SvAMAGIC(tmpstr)) {
/* make a copy to avoid extra stringifies */
tmpstr = newSVpvn_flags(t, len, SVs_TEMP | SvUTF8(tmpstr));
}
The SvAMAGIC check never happens when the UTF8 flag is on.
David Mitchell [Mon, 31 Oct 2011 12:10:26 +0000 (12:10 +0000)]
add volatile decl to fix previous commit
The previous commit changed the nature of the exp variable in
Perl_re_op_compile(), so it needs to be marked with VOL to work around
setjmp.
Karl Williamson [Sat, 29 Oct 2011 17:20:40 +0000 (11:20 -0600)]
PATCH: [perl #101940]: BBC Tk
This commit that turned up this bug turns out merely exposes an
underlying problem that could be generated via other means.
regcomp.c was looking at the SvUTF8 flag on the input pattern before
doing an SvPV on it. Generally the flag is considered not reliable
unless checked immediately after a SvPV.
I haven't been able to come up with a simple test case that reproduces
the bug. I suspect that XS code is required to trigger it.
[ this is a re-application by davem of commit
11951bcbfcaf4c260b0da0421e72fc80b4654f17 ]
Karl Williamson [Sun, 30 Oct 2011 16:47:21 +0000 (16:47 +0000)]
regcomp.c: Use no_mg for 2nd fetch of pattern
The pattern could be tied, for example, and so only want to access it
once. I couldn't come up with a test case that actually exercised this,
but I can think of future changes to regcomp that would.
[ this is a re-application by davem of commit
3e0b93e82af0f1a033bcdb918b413113f1d61cf0 ]
David Mitchell [Sun, 30 Oct 2011 12:35:13 +0000 (12:35 +0000)]
pp_regcomp: dopn't special-case n->1 arg folding
Currently, pp_regcomp() specially handles the case where the result of
concating all its args is a regexp. Change this so that this only occurs
if there is a single arg.
This means that if $a is an overloaded object whose concat op returns
a regexp, then /$a+/ no longer special-cases, and thus the returned regex
is now stringified and recompiled. This slight loss of efficiency in a
rare case is necessary to allow us to delay concating for as long as possible,
David Mitchell [Wed, 26 Oct 2011 12:21:10 +0000 (13:21 +0100)]
pp_regcomp: split overloading and concat tasks
Make two passes through the list of args: first to apply
magic/overloading, and secondly to concatenate them. This will
shortly allow us to pass a processed, but unconcatenated list to
re_op_compile().
Also, simplify the code by treating the 1-arg case as an arg list
of length 1. This also allows us to use the tryAMAGICregexp macro
in only one place, and thus to unroll and eliminate it.
David Mitchell [Wed, 26 Oct 2011 11:23:52 +0000 (12:23 +0100)]
change re_op_compile() to take a list of SVs
rather than passing a single SV string containing the pattern,
allow a list of SVs (plus count) to be passed. For the moment, only allow
that list to be one element long, but this will allow us to directly
pass in the list of SVs normally pre-processed into a single SV by
pp_regcomp.
David Mitchell [Tue, 25 Oct 2011 14:42:44 +0000 (15:42 +0100)]
fix for overload/stringfy and pp_regcomp
(bug found by code inspection)
The code in pp_regcomp to make a mortal copy of the pattern string
in the case of overload or utf8+'use bytes', missed the case of overload
with utf8. Fix this and at the same time simplify the code.
David Mitchell [Fri, 21 Oct 2011 14:00:47 +0000 (15:00 +0100)]
unlink re_eval code blocks from op list
In the list of ops generated by something like /abc(?{...})def/,
const(abc)
null/special
...
const(...)
const(def)
link the list, but skip the DO blocks. This means that for the runtime
case, we no longer need the temporary measure of deleting the DO blocks,
and it will facilitate the next step of handling literal code at runtime,
i.e. /$runtime(?{...})/.
David Mitchell [Wed, 19 Oct 2011 10:49:40 +0000 (11:49 +0100)]
In Perl_re_op_compile, make a var volatile
This function includes a setjmp to allow for abort and retry if the
pattern getting compiled suddenly becomes UTF8. My recent changes to
it left one var generating the dreaded "warning: variable ‘pat’ might be
clobbered" warning. So declare it VOL to fix this.
David Mitchell [Wed, 12 Oct 2011 13:01:50 +0000 (14:01 +0100)]
make qr/(?{})/ behave with closures
With this commit, qr// with a literal (compile-time) code block
will Do the Right Thing as regards closures and the scope of lexical vars;
in particular, the following now creates three regexes that match 1, 2
and 3:
for my $i (0..2) {
push @r, qr/^(??{$i})$/;
}
"1" =~ $r[1]; # matches
Previously, $i would be evaluated as undef in all 3 patterns.
This is achieved by wrapping the compilation of the pattern within a
new anonymous CV, which is then attached to the pattern. At run-time
pp_qr() clones the CV as well as copying the REGEXP; and when the code
block is executed, it does so using the pad of the cloned CV.
Which makes everything come out all right in the wash.
The CV is stored in a new field of the REGEXP, called qr_anoncv.
Note that run-time qr//s are still not fixed, e.g. qr/$foo(?{...})/;
nor is it yet fixed where the qr// is embedded within another pattern:
continuing with the code example from above,
my $i = 99;
"1" =~ $r[1]; # bare qr: matches: correct!
"X99" =~ /X$r[1]/; # embedded qr: matches: whoops, it's still
seeing the wrong $i
David Mitchell [Sat, 17 Sep 2011 11:55:29 +0000 (12:55 +0100)]
Ignore code blocks within /[...]/
Now that RE code blocks are being handled by the lexer rather than the RE
compiler, make sure that stuff like
/[(?{]/
isn't mis-interpreted as the start of a code block.
David Mitchell [Mon, 12 Sep 2011 13:32:11 +0000 (14:32 +0100)]
make recent re_eval changes compile under -Dmad
Note that this doesn't mean that MAD output is correct; I haven't tested
that. In particular, I have no idea when and how CURMAD() is supposed to
be used.
David Mitchell [Thu, 25 Aug 2011 10:41:49 +0000 (11:41 +0100)]
Mostly complete fix for literal /(?{..})/ blocks
Change the way that code blocks in patterns are parsed and executed,
especially as regards lexical and scoping behaviour.
(Note that this fix only applies to literal code blocks appearing within
patterns: run-time patterns, and literals within qr//, are still done the
old broken way for now).
This change means that for literal /(?{..})/ and /(??{..})/:
* the code block is now fully parsed in the same pass as the surrounding
code, which means that the compiler no longer just does a simplistic
count of balancing {} to find the limits of the code block;
i.e. stuff like /(?{ $x = "{" })/ now works (in the same way
that subscripts in double quoted strings always have: "$a{'{'}" )
* Error and warning messages will now appear to emanate from the main body
rather than an re_eval; e.g. the output from
#!/usr/bin/perl
/(?{ warn "boo" })/
has changed from
boo at (re_eval 1) line 1.
to
boo at /tmp/p line 2.
* scope and closures now behave as you might expect; for example
for my $x (qw(a b c)) { "" =~ /(?{ print $x })/ }
now prints "abc" rather than ""
* with recursion, it now finds the lexical within the appropriate depth
of pad: this code now prints "012" rather than "000":
sub recurse {
my ($n) = @_;
return if $n > 2;
"" =~ /^(?{print $n})/;
recurse($n+1);
}
recurse(0);
* an earlier fix that stopped 'my' declarations within code blocks causing
crashes, required the accumulating of two SAVECOMPPADs on the stack for
each iteration of the code block; this is no longer needed;
* UNITCHECK blocks within literal code blocks are now run as part of the
main body of code (run-time code blocks still trigger an immediate
call to the UNITCHECK block though)
This is all achieved by building upon the efforts of the commits which led
up to this; those altered the parser to parse literal code blocks
directly, but up until now those code blocks were discarded by
Perl_pmruntime and the block re-compiled using the original re_eval
mechanism. As of this commit, for the non-qr and non-runtime variants,
those code blocks are no longer thrown away. Instead:
* the LISTOP generated by the parser, which contains all the code
blocks plus OP_CONSTs that collectively make up the literal pattern,
is now stored in a new field in PMOPs, called op_code_list. For example
in /A(?{BLOCK})C/, the listop stored in op_code_list looks like
LIST
PUSHMARK
CONST['A']
NULL/special (aka a DO block)
BLOCK
CONST['(?{BLOCK})']
CONST['B']
* each of the code blocks has its last op set to null and is individually
run through the peephole optimiser, so each one becomes a little
self-contained block of code, rather than a list of blocks that run into
each other;
* then in re_op_compile(), we concatenate the list of CONSTs to produce a
string to be compiled, but at the same time we note any DO blocks and
note the start and end positions of the corresponding CONST['(?{BLOCK})'];
* (if the current regex engine isn't the built-in perl one, then we just
throw away the code blocks and pass the concatenated string to the engine)
* then during regex compilation, whenever we encounter a '(?{', we see if
it matches the index of one of the pre-compiled blocks, and if so, we
store a pointer to that block in an 'l' data slot, and use the end index
to skip over the text of the code body. Conversely, if the index doesn't
match, then we know that it's a run-time pattern and (for now), compile
it in the old way.
* During execution, when an EVAL op is encountered, if data->what is 'l',
then we just use the pad that was in effect when the pattern was called;
i.e. we use the current pad slot of the currently executing CV that the
pattern is embedded within.
David Mitchell [Fri, 19 Aug 2011 11:10:01 +0000 (12:10 +0100)]
add Perl_re_op_compile function
Make Perl_re_compile() a thin wrapper around a new function,
Perl_re_op_compile(). This function can take either a string pattern or a
list of ops. Then make pmruntime() pass a list of ops directly to it, rather
concatenating all the consts into a single string and passing the const to
Perl_re_compile(). For now, Perl_re_op_compile just does the same: if its
passed an op tree rather than an SV, then it just concats the consts.
So this is is just the next step towards eventually allowing the regex
engine to use the ops directly.
David Mitchell [Wed, 17 Aug 2011 15:41:04 +0000 (16:41 +0100)]
add Perl_current_re_engine() function
Abstract out into a separate function the task of finding the current
in-scope regex engine ($^H{regex}). Currently this task is only done in
one place each for compile- and run-time, but shortly we'll need it in
other places too.
David Mitchell [Mon, 8 Aug 2011 16:56:10 +0000 (17:56 +0100)]
re_eval and closures: add lots of TODO tests
re_evals currently almost always do the wrong thing as regards what
lexical variable they refer to. This commit adds lots of TODO tests that
show what behaviour I think there should be. Note that because hardly any
of these tests pass yet, I haven't been able to verify whether they have
any subtle typos etc.
The basic philosophy behind these tests is:
* literal code is compiled once at compile-time and shares the same
lexical environment as its surroundings; i.e.
/A(?{..$x..})B/
is like
/A/ && do {..$x..} && /B/
* qr is treated as a closure: compiling once, but capturing its
environment anew each time it is instantiated; i.e.
for my $x (...) { push @r, qr/A(?{..$x..}B)/ }
is like
for my $x (...) { push @r, sub { /A/ && do {..$x..} && /B/ } }
* run-time code is recompiled each time the regex is compiled; literal
code in the same expression isn't recompiled; i.e.
$code = '(?{ BEGIN{$y++} })';
for (1..3) { /(?{ BEGIN{$x++}})$code/ }
# x==1, y==3
* an embedded qr is not stringified, so the qr retains its original
lexical environment; i.e.
$x = 1;
{ my $x = 2: $r = qr/(??{$x})/ }
/A$r/; # matches A2, not A1
David Mitchell [Mon, 1 Aug 2011 15:53:00 +0000 (16:53 +0100)]
fix the descriptions for pregcomp/re_compile
Some while ago most of the body of Perl_pregcomp was extracted out into
a new sub, Perl_re_compile, but the comments describing the function stayed
with pregcomp. Move them down to re_compile, since they mostly apply to
that sub now.
David Mitchell [Mon, 1 Aug 2011 11:40:32 +0000 (12:40 +0100)]
disable lexing of (?{}) within \Q, \U etc
when (?{..}) or (??{..}) is enclosed in one of the modifiers like \Q,
don't parse it; treat it as plain characters that are modified by \Q etc.
This means that any code blocks will be interpreted by the regexp compiler
rather than the lexer. This restores behaviour similar to 5.14 and
earlier, except that the (?{}) is *completely* uninterpreted by the lexer,
whereas 5.14 would skip till matching {}.
David Mitchell [Mon, 1 Aug 2011 11:06:00 +0000 (12:06 +0100)]
update diagnostics message
The description of the "Sequence (?{...}) not terminated with ')'" error
was cut and paste from another error without the text being updated!
David Mitchell [Sat, 23 Jul 2011 20:29:02 +0000 (21:29 +0100)]
make re_evals be seen by the toker/parser
This commit is a first step to making the handling of (/(?{...})/ more sane.
But see the big proviso at the end.
Currently a patten like /a(?{...})b/ is uninterpreted by the lexer and
parser, and is instead passed as-is to the regex compiler, which is
responsible for ensuring that the embedded perl code is extracted and
compiled. The only thing the quoted string code in the lexer currently
does is to skip nested matched {}'s, in order to get to end of the code
block and restart looking for interpolated variables, \Q etc.
This commit makes the lexer smarter.
Consider the following pattern:
/FOO(?{BLOCK})BAR$var/
This is currently tokenised as
op_match
(
op_const["FOO(?{BLOCK})BAR"]
,
$
"var"
)
Instead, tokenise it as:
op_match
(
op_const["FOO"]
,
DO
{
BLOCK
;
}
,
op_const["(?{BLOCK})"]
,
op_const["BAR"]
,
$
"var"
)
This means that BLOCK is itself tokenised and parsed. We also insert
a const into the stream to include the original source text of BLOCK so
that it's available for stringifying qr's etc.
Note that by allowing the lexer/parser direct access to BLOCK, we can now
handle things like
/(?{"{"})/
This mechanism is similar to the way something like
"abc $a[foo(q(]))] def"
is currently parsed: the double-quoted string handler in the lexer stops
at $a[, the 'foo(q(]))' is treated as perl code, then at the end control is
passed back to the string handler to handle the ' def'.
This commit includes a new error message:
Sequence (?{...}) not terminated with ')'
since when control is passed back to the quoted-string handler, it expects
to find the ')' as the next char. This new error mostly replaces the old
Sequence (?{...}) not terminated or not {}-balanced in regex
Big proviso:
This commit updates toke.c to recognise the embedded code, but doesn't
then do anything with it. The parser will pass both a compiled do block
and a const for each embedded (?{..}), and Perl_pmruntime just throws
away the do block and keeps the constant text instead which is passed to
the regex compiler. So currently each code block gets compiled twice (!)
with two sets of warnings etc. The next stage will be to pass these do
blocks to the regex compiler.
This commit is based on a patch I had originally worked up about 6 years
ago and has been sitting bit-rotting ever since.
David Mitchell [Fri, 22 Jul 2011 13:56:17 +0000 (14:56 +0100)]
correct comment about how strings are tokenised
The stuff about "foo\lbar" being tokenised as a list which you need to
apply join() to, was wrong; the tokeniser outputs the necessary concats
rather than commas.
David Mitchell [Sun, 30 Oct 2011 16:12:02 +0000 (16:12 +0000)]
Revert 4 regex commits to ease rebasing
Revert "Remove some repeated code in pp_regcomp"
This reverts commit
3e1022372a8200bc4c7354e0f588c7f71584a888.
Revert "regcomp.c: Use no_mg for 2nd fetch of pattern"
This reverts commit
3e0b93e82af0f1a033bcdb918b413113f1d61cf0.
`
Revert "PATCH: [perl #101940]: BBC Tk"
This reverts commit
11951bcbfcaf4c260b0da0421e72fc80b4654f17.
Revert "Fix =~ $str_overloaded (5.10 regression)"
This reverts commit
15d9c083b08647e489d279a1059b4f14a3df187b.
These four recent commits on the blead branch overlap with work on the
re_eval branch. To make rebasing re_eval easier, revert them at the
beginning of the re_eval branch. Any remaining value will be re-added
later in the re_eval branch.
David Mitchell [Sat, 12 Nov 2011 22:11:28 +0000 (22:11 +0000)]
Revert two commits that fix a VOL declaration.
This has also been done on my davem/re_eval branch,
and its easiest to revert these at the start of my branch
Revert "regcomp.c: Silence compiler warning about longjump"
This reverts commit
24efd69ba77ba76cd714519dccee88f45820d8b4.
Revert "Fix volatile declaration"
This reverts commit
4d6b28934825c9c735195140271a6f93f9c07348.
David Mitchell [Tue, 14 Feb 2012 17:25:32 +0000 (17:25 +0000)]
revert a trailing whitespace removal
this is to make rebasing easier. On one of the rebased commits
also removes this whitespace