This is a live mirror of the Perl 5 development currently hosted at https://github.com/perl/perl5
Refactor utf8 to code point conversion
Most such conversions occur in the inlined function
Perl_utf8n_to_uvchr_msgs(), which several macros like utf8n_to_uvchr()
expand to.
This commit effectively removes a conditional from inside the loop, and
avoids some conditionals when converting the common case of the input
being UTF-8 invariant (ASCII on ASCII platforms).
Prior to this commit, the code did something different the first time
through the loop than the other times. By hoisting that to pre-loop
initialization, that conditional is removed from each iteration. That
meant rearranging the loop to be a while(1), and have its exit
conditions in the middle.
All calls to this function from the Perl core pass in a non-empty
string. But outside calls could conceivably pass an empty one which
could lead to reading outside the buffer. An extra check is added to
non-core calls, as is already done elsewhere.
This change means that calls from core execute no more conditionals than
the typical:
if (UTF8_IS_INVARIANT(*s)) {
code_point = *s;
}
else {
code_point = utf8n_to_uvchr(s, ...)
}
I'm therefore thinking these can now just be replaced by the simpler
code_point = utf8n_to_uvchr(s, ...)
without a noticeable hit in performance. The essential difference is
that the former gets its code point from the string already being
examined, and the latter looks up data in a 450 byte static array that
is referred to constantly, so is likely to be cached.