This is a live mirror of the Perl 5 development currently hosted at https://github.com/perl/perl5
Skip casing for high code points
authorKarl Williamson <khw@cpan.org>
Thu, 3 Dec 2015 20:27:21 +0000 (13:27 -0700)
committerKarl Williamson <khw@cpan.org>
Thu, 10 Dec 2015 06:43:22 +0000 (23:43 -0700)
commit3bfc1e7044659f9ec4cc4f1bc9eea7a8b00061fb
treeec472afda925dc0c5b5f0f45b48aadb3e241061c
parent36eaa8111efe6b0ebe974f6b26ed667c1206dc9f
Skip casing for high code points

As discussed in the previous commit, most code points in Unicode
don't change if upper-, or lower-cased, etc.  In fact as of Unicode
v8.0, 93% of the available code points are above the highest one that
does change.

This commit skips trying to case these 93%.  A regen/ script keeps track
of the max changing one in the current Unicode release, and skips casing
for the higher ones.  Thus currently, casing emoji will be skipped.

Together with the previous commits that dealt with casing, the potential
for huge memory requirements for the swash hashes for casing are
severely limited.

If the following command is run on a perl compiled with -O2 and no
DEBUGGING:

    blead Porting/bench.pl --raw --perlargs="-Ilib -X" --benchfile=plane1_case_perf /path_to_prior_perl=before_this_commit /path_to_new_perl=after

and the file 'plane1_case_perf' contains

    [
        'string::casing::emoji' => {
            desc    => 'yes swash vs no swash',
            setup   => 'my $a = "\x{1F570}"',  # MANTELPIECE CLOCK
            code    => 'uc($a)'
        },
    ];

the following results are obtained:

The numbers represent raw counts per loop iteration.

string::casing::emoji
yes swash vs no swash

       before_this_commit    after
       ------------------ --------
    Ir              981.0    306.0
    Dr              228.0     94.0
    Dw              100.0     45.0
  COND              137.0     49.0
   IND                7.0      4.0

COND_m                5.5      0.0
 IND_m                4.0      2.0

 Ir_m1                0.1     -0.1
 Dr_m1                0.0      0.0
 Dw_m1                0.0      0.0

 Ir_mm                0.0      0.0
 Dr_mm                0.0      0.0
 Dw_mm                0.0      0.0
regen/unicode_constants.pl
unicode_constants.h
utf8.c