This is a live mirror of the Perl 5 development currently hosted at https://github.com/perl/perl5
Add variant_under_utf8_count() core function
This function takes a string that isn't encoded in UTF-8 (hence is
assumed to be in Latin1), and counts how many of the bytes therein
would change if it were to be translated into UTF-8. Each such byte
would occupy two UTF-8 bytes.
This function is useful for calculating the expansion factor precisely
when converting to UTF-8, so as to know how much to malloc.
This function uses a non-obvious method to do the calculations
word-at-a-time, as opposed to the byte-at-a-time method used now, and
hence should be much faster than the current methods.
The performance change in short string lengths is equivocal. Here is
the result for a single character and a 64-bit word.
bytes words Ratio %
-------- -------- -------
Ir 932.0 947.0 98.4
Dr 325.0 325.0 100.0
Dw 104.0 104.0 100.0
COND 136.0 137.0 99.3
IND 28.0 28.0 100.0
COND_m 1.0 0.0 Inf
IND_m 6.0 6.0 100.0
There are some extra instructions executed and an extra branch to check
for and handle the case where we can go word-by-word vs. not. But the
one cache miss is removed.
The results are essentially the same until we get to being able to
handle a full word. Some of the extra instructions are to ensure that
if the input is not aligned on a word boundary, that performance doesn't
suffer.
Here's the results for 8-bytes on a 64-bit system.
bytes words Ratio %
-------- -------- -------
Ir 974.0 955.0 102.0
Dr 332.0 325.0 102.2
Dw 104.0 104.0 100.0
COND 143.0 138.0 103.6
IND 28.0 28.0 100.0
COND_m 1.0 0.0 Inf
IND_m 6.0 6.0 100.0
Things keep improving as the strings get longer. Here's for 24 bytes.
bytes words Ratio %
-------- -------- -------
Ir 1070.0 975.0 109.7
Dr 348.0 327.0 106.4
Dw 104.0 104.0 100.0
COND 159.0 140.0 113.6
IND 28.0 28.0 100.0
COND_m 2.0 0.0 Inf
IND_m 6.0 6.0 100.0
And 96:
bytes words Ratio %
-------- -------- -------
Ir 1502.0 1065.0 141.0
Dr 420.0 336.0 125.0
Dw 104.0 104.0 100.0
COND 231.0 149.0 155.0
IND 28.0 28.0 100.0
COND_m 2.0 1.0 200.0
IND_m 6.0 6.0 100.0
And 10,000
bytes words Ratio %
-------- -------- -------
Ir 60926.0 13445.0 453.1
Dr 10324.0 1574.0 655.9
Dw 104.0 104.0 100.0
COND 10135.0 1387.0 730.7
IND 28.0 28.0 100.0
COND_m 2.0 1.0 200.0
IND_m 6.0 6.0 100.0
I found this trick on the internet many years ago, but I can't seem to
find it again to give them credit.