perl5.git.perl.org Git - perl5.git/commit

author	Karl Williamson <public@khwilliamson.com>
	Mon, 28 Feb 2011 01:44:43 +0000 (18:44 -0700)
committer	Karl Williamson <public@khwilliamson.com>
	Mon, 28 Feb 2011 02:21:33 +0000 (19:21 -0700)
commit	d50a4f90cab527593b2dd218f71b66a6be555490
tree	37c9334aa808d276506f002fcc5a34ae770073c2	tree \| snapshot
parent	2335b3d39eb70759d992779a5e8e11443648e5dd	commit \| diff

Handle [folds] of 0-255 without swashes

Commit 56ca34cada940c7f6aae9a59da266e541530041e had the side effect of
causing regular expressions with things like [a-z], or even just [k] to
go out to disk to read tables to create swashes because it knew that
some of those characters matched outside the bitmap (and due to
l1_char_class_tab.h it knew which ones had those matches), but it didn't
know what the characters were that participated in those folds.

This patch hard-codes the Unicode 6.0 rules into regcomp.c for the
code points 0-255, so that the very slow utf8_heavy is not invoked on
them.  (Code points above 255 will continue to invoke it.)  It would,
of course, be better if these rules could be regen'd into regcomp.c, as
there is a risk that the standard will change, and the code will not.
But I don't think that has ever happened; in other words, I think that
the rules haven't changed so far since Day 1 of Unicode.  (That would
not be the case if we were doing simple case folding, as the capital
sharp ss which folds to U+00DF was added later.)  And the Standard is
getting more stable in this area.  I believe one of their stability
policies now forbid them from adding something that simply folds to
one of the characters that already has a fold, such as M and m.
Ligatures are frowned on, and I doubt that new ones would be encoded,
so that leaves a new Unicode character that folds to a Latin-1 plus some
sort of mark.  For those, this code is a no-op, so those aren't a
problem either.

pod/perldiag.pod		diff \| blob \| blame \| history
regcomp.c		diff \| blob \| blame \| history