This is a live mirror of the Perl 5 development currently hosted at https://github.com/perl/perl5
Skip casing for some non-cased scripts
authorKarl Williamson <khw@cpan.org>
Thu, 3 Dec 2015 20:12:51 +0000 (13:12 -0700)
committerKarl Williamson <khw@cpan.org>
Thu, 10 Dec 2015 06:43:21 +0000 (23:43 -0700)
commit36eaa8111efe6b0ebe974f6b26ed667c1206dc9f
tree111d349b4deb4ee6cd3cb2d8b030107a4fd616e7
parent5af9bc9750ba392c2a4adfdc3ced4b0b301f656a
Skip casing for some non-cased scripts

Characters whose upper, lower, title, or fold case differ from the
character itself amount to just 1.5% of the assigned Unicode characters,
and this percentage falls with each new Unicode release, as almost all
cased scripts have already been encoded.  But a lot of code is written
assuming a cased language, such as calling uc() or lcfirst(), or doing
qr//i.  When such code is run on a non-cased language, the work expended
in doing the casing is wasted.  And casing is expensive.  But finding
out if a character is cased or not is nearly as expensive, so one might
as well just do the casing.

However, the Unicode code space is organized so that there are some long
stretches of contiguous code points that aren't cased.  By adding tests
to see if the input code point is in just a few of these ranges, we can
quickly rule casing out for most of the non-cased scripts that are of
commercial use today, at essentially no expense to handling the more
common cased scripts.  Testing for just 3 ranges in Plane 0 of Unicode
(where most of the code points in common use today reside) allows us to
skip doing casing for more than 82% of code points in the plane,
including the following languages: Arabic, Chinese, Hebrew, Japanese,
Korean, Thai, and the major scripts of India.  No longer is a swash
generated when trying to case one of these, so runtime memory usage is
decreased.

(It should be noted that some of these languages have characters
scattered in other areas, because the original allocation for them
turned out to be not large enough.  When changing the case of these
other characters, the lookups won't be skippped.  But that original
allocation included all or nearly all the characters in current common
use, so these other characters are comparatively rare.)

The comments in the code indicate some candidate non-cased ranges that I
chose not to treat specially at this time.  The next commit will address
planes above Plane 0.

When this command is run on a perl compiled with -O2, no DEBUGGING:

    blead Porting/bench.pl --perlargs="-Ilib -X" --benchfile=plane0_casing_perf /path_to_prior_perl=before_this_commit /path_to_new_perl=after

and file 'plane0_casing_perf' contains
    [
        'string::casing::greek' => {
            desc    => 'should be no change',
            setup   => 'my $a = "\x{3B1}"',  # GREEK SMALL LETTER ALPHA
            code    => 'uc($a)'
        },
        'string::casing::hebrew' => {
            desc    => 'yes swash vs no swash',
            setup   => 'my $a = "\x{5D0}"',  # HEBREW LETTER ALEF
            code    => 'uc($a)'
        },
        'string::casing::cjk' => {
            desc    => 'yes swash vs no swash',
            setup   => 'my $a = "\x{4E01}"',
            code    => 'uc($a)'
        },
        'string::casing::korean' => {
            desc    => 'yes swash vs no swash',
            setup   => 'my $a = "\x{AC00}"',
            code    => 'uc($a)'
        },
    ];

These are the results:

The numbers represent raw counts per loop iteration.

string::casing::cjk
yes swash vs no swash

       before_this_commit    after
       ------------------ --------
    Ir              931.0    300.0
    Dr              217.0     93.0
    Dw               94.0     45.0
  COND              129.0     48.0
   IND                7.0      4.0

COND_m                1.5      0.0
 IND_m                4.0      2.0

 Ir_m1                0.1      0.0
 Dr_m1                0.0      0.0
 Dw_m1                0.0      0.0

 Ir_mm                0.0      0.0
 Dr_mm                0.0      0.0
 Dw_mm                0.0      0.0

string::casing::greek
should be no change

       before_this_commit    after
       ------------------ --------
    Ir              946.0    920.0
    Dr              218.0    220.0
    Dw              100.0    100.0
  COND              127.0    121.0
   IND                6.0      8.0

COND_m                0.5      1.3
 IND_m                2.0      2.0

 Ir_m1                0.1      0.0
 Dr_m1                0.0      0.0
 Dw_m1                0.0      0.0

 Ir_mm                0.0      0.0
 Dr_mm                0.0      0.0
 Dw_mm                0.0      0.0

string::casing::hebrew
yes swash vs no swash

       before_this_commit    after
       ------------------ --------
    Ir              928.0    290.0
    Dr              224.0     92.0
    Dw              100.0     45.0
  COND              129.0     46.0
   IND                6.0      4.0

COND_m                0.5      0.0
 IND_m                2.0      2.0

 Ir_m1                0.1      0.0
 Dr_m1                0.0      0.0
 Dw_m1                0.0      0.0

 Ir_mm                0.0      0.0
 Dr_mm                0.0      0.0
 Dw_mm                0.0      0.0

string::casing::korean
yes swash vs no swash

       before_this_commit    after
       ------------------ --------
    Ir              953.0    307.6
    Dr              224.0     93.0
    Dw              100.0     45.0
  COND              131.0     50.9
   IND                7.0      4.0

COND_m                1.5      0.0
 IND_m                4.0      2.0

 Ir_m1                0.1      0.0
 Dr_m1                0.0      0.0
 Dw_m1                0.0      0.0

 Ir_mm                0.0      0.0
 Dr_mm                0.0      0.0
 Dw_mm                0.0      0.0
utf8.c