When not explicitly quoted, tokenization of the HERE-document terminator
dealt improperly with multi-byte characters, advancing one byte at a
time instead of one character at a time. This lead to
incomprehensible-to-the-user errors of the form:
Passing malformed UTF-8 to "XPosixWord" is deprecated
Malformed UTF-8 character (unexpected continuation byte 0xa7, with
no preceding start byte)
Can't find string terminator "EnFra�" anywhere before EOF
If enclosed in single or double quotes, parsing was correctly effected,
as delimcpy advances byte-by-byte, but looks only for the single-byte
ending character.
When doing a \w+ match looking for the end of the word, advance
character-by-character instead of byte-by-byte, ensuring that the size
does not extend past the available size in PL_tokenbuf.
my $v = $array[ 0 + $𝛃 ];
$v = $array[ $𝛃 + 0 ];
EXPECT
+########
+# toke.c
+# Allow Unicode here doc boundaries
+use warnings;
+use utf8;
+my $v = <<EnFraçais;
+Comme ca!
+EnFraçais
+print $v;
+EXPECT
+Comme ca!
term = '"';
if (!isWORDCHAR_lazy_if(s,UTF))
deprecate("bare << to mean <<\"\"");
- for (; isWORDCHAR_lazy_if(s,UTF); s++) {
- if (d < e)
- *d++ = *s;
+ peek = s;
+ while (isWORDCHAR_lazy_if(peek,UTF)) {
+ peek += UTF ? UTF8SKIP(peek) : 1;
}
+ len = (peek - s >= e - d) ? (e - d) : (peek - s);
+ Copy(s, d, len, char);
+ s += len;
+ d += len;
}
if (d >= PL_tokenbuf + sizeof PL_tokenbuf - 1)
Perl_croak(aTHX_ "Delimiter for here document is too long");