This is a live mirror of the Perl 5 development currently hosted at https://github.com/perl/perl5
toke.c: Potentially avoid work when converting to UTF-8
authorKarl Williamson <khw@cpan.org>
Fri, 19 Aug 2016 00:54:13 +0000 (18:54 -0600)
committerKarl Williamson <khw@cpan.org>
Tue, 3 Jan 2017 04:46:41 +0000 (21:46 -0700)
commitaf9be36c89322d2469f27b3c98c20c32044697fe
tree99b4f7199707b44bafbd59c5ea5352251e89e87f
parent8d0042a898ed988cce011b3be47ac52b8c37b727
toke.c: Potentially avoid work when converting to UTF-8

Some code points < 256 are the same whether represented in UTF-8, or
not.  Others change to require 2 bytes to represent in UTF-8.  When
parsing a string, using UTF-8 is avoided unless necessary, because of
the extra overhead required for processing UTF-8.  This means that when,
during the parse, we discover we need to convert to UTF-8, we have to,
in effect, reparse whatever we have so far to make sure those code
points that differ under UTF-8 get their proper representation.  This
reparsing would not be necessary if we know that the string doesn't have
such code points.

It turns out that keeping track of having seen UTF-8 variant code points
is cheap, requiring no extra branch instructions.  And the payoff is
potentially large, avoiding having to reparse the string.  This commit
changes to keep track.
toke.c