From 2f833f5208e26b208886e51e09e2c072b5eabb46 Mon Sep 17 00:00:00 2001 From: Karl Williamson Date: Mon, 31 Jan 2011 22:50:00 -0700 Subject: [PATCH] regcomp.c: Generate different property for /i matching This patch causes regcomp.c to generate a different property name under /i than not. utf8_heavy.pl will later resolve whether this is to match the same under /i or not, based on the data structure generated by mktables. This is part of moving non-locale folding into regcomp from regexec. The reasons are primarily security, but this has been planned to do at some point anyway for performance. It was not until a 5.13.X build that fixed the regexec code that the case-insensitive matching mostly worked. With that change, things like /\p{ASCII_Hex_Digit}+/i would match non-ASCII characters, such as LATIN SMALL LIGATURE FF, and almost certainly that would not be the expectation of the coder. The Unicode Standard is silent on the matter, but as of this writing, it appears that they will act to recommend against caseless matching of properties; I get the sense that they would never have thought someone would think to do it, but Perl has. I ran some experiments, and actually very few properties have differences under caseless matching anyway. have submitted a proposal to them that says that, but suggests that certain properties can be grandfathered-in. Perl users have come to expect that /\p{Uppercase}/i would match lower case letters, and have written bug reports that they don't, until 5.13.X fixed them, but in addition added the unintended wrinkle from the example above. The design is for mktables to generate tables for /i matching for the few properties that have differences, and to create a hash mapping the standard table to the /i table, which is read by utf8_heavy.pl. regcomp.c munges the names of all properties under /i to be __foo_i. The two initial underscores make sure there is no conflict with existing single underscore initial tables. utf8_heavy strips these off, and computes the table as normal from the remaining unmunged name. At the last moment, it looks up that name in the list of those that have /i tables, and substitutes if found. This completely hides all this from the swash mechanism and regexec.c. This can't be completely hidden from user-defined properties. Now, a boolean will be passed to those subroutines indicating if /i is in effect or not. They are free to ignore it, but they can return a different set of code points depending on its value. They will be called once for each type, and the results cached by the normal swash mechanism, which thinks these are two different properties. --- regcomp.c | 14 ++++++++++++-- 1 file changed, 12 insertions(+), 2 deletions(-) diff --git a/regcomp.c b/regcomp.c index 5bece17..aa05006 100644 --- a/regcomp.c +++ b/regcomp.c @@ -8616,8 +8616,18 @@ parseit: n--; } } - Perl_sv_catpvf(aTHX_ listsv, "%cutf8::%.*s\n", - (value=='p' ? '+' : '!'), (int)n, RExC_parse); + + /* Add the property name to the list. If /i matching, give + * a different name which consists of the normal name + * sandwiched between two underscores and '_i'. The design + * is discussed in the commit message for this. */ + Perl_sv_catpvf(aTHX_ listsv, "%cutf8::%s%.*s%s\n", + (value=='p' ? '+' : '!'), + (FOLD) ? "__" : "", + (int)n, + RExC_parse, + (FOLD) ? "_i" : "" + ); } RExC_parse = e + 1; -- 1.8.3.1