When converting a byte pattern to UTF-8, the needed size may increase due
to some bytes (the UTF-8 variants) occupying two bytes instead of one
under UTF-8.
Prior to this commit, the pattern was assumed to contain only variants,
and enough memory was allocated for the worst case.
This commit actually calculates how much space is needed and allocates
only that.
There is extra work involved in doing this calculation. But the pattern
is parsed per-word. For short strings, it doesn't much matter either
way. But for very long strings, it seems to me the consequences of
potentially allocating way too much memory out weighs the negative of
this extra work. If field experience proves me wrong, then revert this
commit.
DEBUG_PARSE_r(Perl_re_printf( aTHX_
"UTF8 mismatch! Converting to utf8 for resizing and compile\n"));
- Newx(dst, *plen_p * 2 + 1, U8);
+ /* 1 for each byte + 1 for each byte that expands to two, + trailing NUL */
+ Newx(dst, *plen_p + variant_under_utf8_count(src, src + *plen_p) + 1, U8);
d = dst;
while (s < *plen_p) {