I haven't done the digging, but this appears to be a failure to include
UTF-8 processing when 'use utf8' was added to Perl.
The code that was causing this in toke.c had found a qr/(?#... beginning
of comment in a pattern. It attempted to space up to but not including
the final character, which is handled later. (In most instances that
final character is a single-byte ')', but not in this test case. It
spaced per-byte. The problem is that if the final character is in UTF-8
and isn't a single byte, it leaves the input position pointing at the
final byte of that character, which creates malformed UTF-8, which the
assertion discovered.
The fix is to be cognizant that this is UTF-8 when spacing to the end,
so that the final position begins at the first byte of it.
skip_all('no re module') unless defined &DynaLoader::boot_DynaLoader;
skip_all_without_unicode_tables();
-plan tests => 862; # Update this when adding/deleting tests.
+plan tests => 863; # Update this when adding/deleting tests.
run_tests() unless caller;
'[^0] doesnt crash on UTF-8 target string');
}
+ { # [perl #133992] This is a tokenizer bug of parsing a pattern
+ fresh_perl_is(q:$z = do {
+ use utf8;
+ "q!Ñ\82еÑ\81Ñ\82! =~ m'"
+ };
+ $z .= 'è(?#\84';
+ $z .= "'";
+ eval $z;:, "", {}, 'foo');
+ }
+
} # End of sub run_tests
1;
* friends */
else if (*s == '(' && PL_lex_inpat && s[1] == '?' && !in_charclass) {
if (s[2] == '#') {
- while (s+1 < send && *s != ')')
- *d++ = *s++;
+ if (s_is_utf8) {
+ PERL_UINT_FAST8_T len = UTF8SKIP(s);
+
+ while (s + len < send && *s != ')') {
+ Copy(s, d, len, U8);
+ d += len;
+ s += len;
+ len = UTF8_SAFE_SKIP(s, send);
+ }
+ }
+ else while (s+1 < send && *s != ')') {
+ *d++ = *s++;
+ }
}
else if (!PL_lex_casemods
&& ( s[2] == '{' /* This should match regcomp.c */