This is a live mirror of the Perl 5 development currently hosted at https://github.com/perl/perl5
regen/regcharclass.pl: Rmv unused macro
[perl5.git] / regen / regcharclass.pl
CommitLineData
1f00b0d6 1#!perl
e64b1bd1 2package CharClass::Matcher;
12b72891 3use strict;
8770da0e 4use 5.008;
12b72891 5use warnings;
e64b1bd1 6use warnings FATAL => 'all';
12b72891 7use Data::Dumper;
e64b1bd1 8$Data::Dumper::Useqq= 1;
12b72891 9
75929b4b
KW
10sub DEBUG () { 0 }
11$|=1 if DEBUG;
12
3d7c117d
MB
13require './regen/regen_lib.pl';
14require './regen/charset_translations.pl';
4c9bc717 15require "./regen/regcharclass_multi_char_folds.pl";
8770da0e 16
ab84f958 17=head1 NAME
0ccab2bc 18
e64b1bd1 19CharClass::Matcher -- Generate C macros that match character classes efficiently
12b72891 20
e64b1bd1
YO
21=head1 SYNOPSIS
22
944ff787 23 perl regen/regcharclass.pl
e64b1bd1
YO
24
25=head1 DESCRIPTION
12b72891
RGS
26
27Dynamically generates macros for detecting special charclasses
e64b1bd1 28in latin-1, utf8, and codepoint forms. Macros can be set to return
cc08b31c 29the length (in bytes) of the matched codepoint, and/or the codepoint itself.
12b72891 30
cc08b31c 31To regenerate F<regcharclass.h>, run this script from perl-root. No arguments
12b72891
RGS
32are necessary.
33
cc08b31c
KW
34Using WHATEVER as an example the following macros can be produced, depending
35on the input parameters (how to get each is described by internal comments at
36the C<__DATA__> line):
12b72891
RGS
37
38=over 4
39
cc08b31c 40=item C<is_WHATEVER(s,is_utf8)>
12b72891 41
cc08b31c 42=item C<is_WHATEVER_safe(s,e,is_utf8)>
12b72891 43
cc08b31c 44Do a lookup as appropriate based on the C<is_utf8> flag. When possible
ab473f03 45comparisons involving octet<128 are done before checking the C<is_utf8>
12b72891
RGS
46flag, hopefully saving time.
47
cc08b31c
KW
48The version without the C<_safe> suffix should be used only when the input is
49known to be well-formed.
12b72891 50
cc08b31c
KW
51=item C<is_WHATEVER_utf8(s)>
52
53=item C<is_WHATEVER_utf8_safe(s,e)>
12b72891
RGS
54
55Do a lookup assuming the string is encoded in (normalized) UTF8.
56
cc08b31c
KW
57The version without the C<_safe> suffix should be used only when the input is
58known to be well-formed.
59
60=item C<is_WHATEVER_latin1(s)>
12b72891 61
cc08b31c 62=item C<is_WHATEVER_latin1_safe(s,e)>
12b72891
RGS
63
64Do a lookup assuming the string is encoded in latin-1 (aka plan octets).
65
cc08b31c
KW
66The version without the C<_safe> suffix should be used only when it is known
67that C<s> contains at least one character.
68
69=item C<is_WHATEVER_cp(cp)>
12b72891 70
47e01c32 71Check to see if the string matches a given codepoint (hypothetically a
b6a6e956 72U32). The condition is constructed as to "break out" as early as
12b72891
RGS
73possible if the codepoint is out of range of the condition.
74
75IOW:
76
77 (cp==X || (cp>X && (cp==Y || (cp>Y && ...))))
78
79Thus if the character is X+1 only two comparisons will be done. Making
80matching lookups slower, but non-matching faster.
81
cc08b31c
KW
82=item C<what_len_WHATEVER_FOO(arg1, ..., len)>
83
84A variant form of each of the macro types described above can be generated, in
85which the code point is returned by the macro, and an extra parameter (in the
86final position) is added, which is a pointer for the macro to set the byte
87length of the returned code point.
88
89These forms all have a C<what_len> prefix instead of the C<is_>, for example
90C<what_len_WHATEVER_safe(s,e,is_utf8,len)> and
91C<what_len_WHATEVER_utf8(s,len)>.
92
93These forms should not be used I<except> on small sets of mostly widely
94separated code points; otherwise the code generated is inefficient. For these
95cases, it is best to use the C<is_> forms, and then find the code point with
96C<utf8_to_uvchr_buf>(). This program can fail with a "deep recursion"
97message on the worst of the inappropriate sets. Examine the generated macro
98to see if it is acceptable.
12b72891 99
cc08b31c
KW
100=item C<what_WHATEVER_FOO(arg1, ...)>
101
102A variant form of each of the C<is_> macro types described above can be generated, in
103which the code point and not the length is returned by the macro. These have
104the same caveat as L</what_len_WHATEVER_FOO(arg1, ..., len)>, plus they should
105not be used where the set contains a NULL, as 0 is returned for two different
106cases: a) the set doesn't include the input code point; b) the set does
107include it, and it is a NULL.
108
109=back
e64b1bd1 110
40f914fd
KW
111The above isn't quite complete, as for specialized purposes one can get a
112macro like C<is_WHATEVER_utf8_no_length_checks(s)>, which assumes that it is
113already known that there is enough space to hold the character starting at
114C<s>, but otherwise checks that it is well-formed. In other words, this is
115intermediary in checking between C<is_WHATEVER_utf8(s)> and
116C<is_WHATEVER_utf8_safe(s,e)>.
117
e64b1bd1
YO
118=head2 CODE FORMAT
119
120perltidy -st -bt=1 -bbt=0 -pt=0 -sbt=1 -ce -nwls== "%f"
121
122
123=head1 AUTHOR
124
cc08b31c 125Author: Yves Orton (demerphq) 2007. Maintained by Perl5 Porters.
e64b1bd1
YO
126
127=head1 BUGS
128
129No tests directly here (although the regex engine will fail tests
130if this code is broken). Insufficient documentation and no Getopts
131handler for using the module as a script.
132
133=head1 LICENSE
134
135You may distribute under the terms of either the GNU General Public
136License or the Artistic License, as specified in the README file.
137
12b72891
RGS
138=cut
139
e64b1bd1
YO
140# Sub naming convention:
141# __func : private subroutine, can not be called as a method
142# _func : private method, not meant for external use
143# func : public method.
144
145# private subs
146#-------------------------------------------------------------------------------
147#
148# ($cp,$n,$l,$u)=__uni_latin($str);
149#
47e01c32 150# Return a list of arrays, each of which when interpreted correctly
e64b1bd1
YO
151# represent the string in some given encoding with specific conditions.
152#
153# $cp - list of codepoints that make up the string.
295bcca9
KW
154# $n - list of octets that make up the string if all codepoints are invariant
155# regardless of if the string is in UTF-8 or not.
e64b1bd1 156# $l - list of octets that make up the string in latin1 encoding if all
295bcca9
KW
157# codepoints < 256, and at least one codepoint is UTF-8 variant.
158# $u - list of octets that make up the string in utf8 if any codepoint is
159# UTF-8 variant
e64b1bd1
YO
160#
161# High CP | Defined
162#-----------+----------
295bcca9 163# 0 - 127 : $n (127/128 are the values for ASCII platforms)
e64b1bd1
YO
164# 128 - 255 : $l, $u
165# 256 - ... : $u
166#
167
168sub __uni_latin1 {
e272994f
KW
169 my $charset = shift;
170 my $a2n= shift;
e64b1bd1
YO
171 my $str= shift;
172 my $max= 0;
173 my @cp;
900c17f9 174 my @cp_high;
295bcca9 175 my $only_has_invariants = 1;
e64b1bd1
YO
176 for my $ch ( split //, $str ) {
177 my $cp= ord $ch;
e64b1bd1 178 $max= $cp if $max < $cp;
a1b2a50f
KW
179 if ($cp > 255) {
180 push @cp, $cp;
181 push @cp_high, $cp;
182 }
183 else {
c30a0cf2 184 push @cp, $a2n->[$cp];
295bcca9 185 }
e64b1bd1
YO
186 }
187 my ( $n, $l, $u );
a1b2a50f 188 $only_has_invariants = ($charset =~ /ascii/i) ? $max < 128 : $max < 160;
295bcca9 189 if ($only_has_invariants) {
e64b1bd1
YO
190 $n= [@cp];
191 } else {
192 $l= [@cp] if $max && $max < 256;
193
a1b2a50f
KW
194 my @u;
195 for my $ch ( split //, $str ) {
196 push @u, map { ord } split //, cp_2_utfbytes(ord $ch, $charset);
197 }
198 $u = \@u;
12b72891 199 }
900c17f9 200 return ( \@cp, \@cp_high, $n, $l, $u );
12b72891
RGS
201}
202
12b72891 203#
e64b1bd1
YO
204# $clean= __clean($expr);
205#
206# Cleanup a ternary expression, removing unnecessary parens and apply some
207# simplifications using regexes.
208#
209
210sub __clean {
211 my ( $expr )= @_;
8fdb8a9d 212
9a3182e9
YO
213 #return $expr;
214
e64b1bd1
YO
215 our $parens;
216 $parens= qr/ (?> \( (?> (?: (?> [^()]+ ) | (??{ $parens }) )* ) \) ) /x;
217
8fdb8a9d 218 ## remove redundant parens
e64b1bd1 219 1 while $expr =~ s/ \( \s* ( $parens ) \s* \) /$1/gx;
8fdb8a9d
YO
220
221
222 # repeatedly simplify conditions like
223 # ( (cond1) ? ( (cond2) ? X : Y ) : Y )
224 # into
225 # ( ( (cond1) && (cond2) ) ? X : Y )
6c4f0678
YO
226 # Also similarly handles expressions like:
227 # : (cond1) ? ( (cond2) ? X : Y ) : Y )
228 # Note the inclusion of the close paren in ([:()]) and the open paren in ([()]) is
229 # purely to ensure we have a balanced set of parens in the expression which makes
230 # it easier to understand the pattern in an editor that understands paren's, we do
231 # not expect either of these cases to actually fire. - Yves
8fdb8a9d 232 1 while $expr =~ s/
6c4f0678 233 ([:()]) \s*
8fdb8a9d
YO
234 ($parens) \s*
235 \? \s*
236 \( \s* ($parens) \s*
6c4f0678
YO
237 \? \s* ($parens|[^()?:\s]+?) \s*
238 : \s* ($parens|[^()?:\s]+?) \s*
8fdb8a9d 239 \) \s*
6c4f0678
YO
240 : \s* \5 \s*
241 ([()])
242 /$1 ( $2 && $3 ) ? $4 : $5 $6/gx;
39a0f513
YO
243 #$expr=~s/\(\(U8\*\)s\)\[(\d+)\]/S$1/g if length $expr > 8000;
244 #$expr=~s/\s+//g if length $expr > 8000;
245
246 die "Expression too long" if length $expr > 8000;
8fdb8a9d 247
e64b1bd1 248 return $expr;
12b72891
RGS
249}
250
e64b1bd1
YO
251#
252# $text= __macro(@args);
253# Join args together by newlines, and then neatly add backslashes to the end
254# of every line as expected by the C pre-processor for #define's.
255#
256
257sub __macro {
258 my $str= join "\n", @_;
259 $str =~ s/\s*$//;
260 my @lines= map { s/\s+$//; s/\t/ /g; $_ } split /\n/, $str;
261 my $last= pop @lines;
262 $str= join "\n", ( map { sprintf "%-76s\\", $_ } @lines ), $last;
263 1 while $str =~ s/^(\t*) {8}/$1\t/gm;
264 return $str . "\n";
12b72891
RGS
265}
266
e64b1bd1
YO
267#
268# my $op=__incrdepth($op);
269#
270# take an 'op' hashref and add one to it and all its childrens depths.
271#
272
273sub __incrdepth {
274 my $op= shift;
275 return unless ref $op;
276 $op->{depth} += 1;
277 __incrdepth( $op->{yes} );
278 __incrdepth( $op->{no} );
279 return $op;
280}
281
282# join two branches of an opcode together with a condition, incrementing
283# the depth on the yes branch when we do so.
284# returns the new root opcode of the tree.
285sub __cond_join {
286 my ( $cond, $yes, $no )= @_;
5ab0c3af 287 if (ref $yes) {
3ff97bcf
KW
288 return {
289 test => $cond,
290 yes => __incrdepth( $yes ),
291 no => $no,
292 depth => 0,
293 };
5ab0c3af
KW
294 }
295 else {
296 return {
297 test => $cond,
298 yes => $yes,
299 no => __incrdepth($no),
300 depth => 0,
301 };
302 }
e64b1bd1
YO
303}
304
8c112bb9
KW
305my $hex_fmt= "0x%02X";
306
307sub val_fmt
308{
309 my $self = shift;
310 my $arg = shift;
311
fdc26d94
KW
312 # Format 'arg' using the printable character if it has one, or a %x if
313 # not, returning a string containing the result
314
8c112bb9
KW
315 # Return what always returned for an unexpected argument
316 return $hex_fmt unless defined $arg && $arg !~ /\D/;
317
fdc26d94
KW
318 # We convert only things inside Latin1
319 if ($arg < 256) {
320
321 # Find the ASCII equivalent of this argument (as the current character
322 # set might not be ASCII)
323 my $char = chr $self->{n2a}->[$arg];
324
325 # If printable, return it, escaping \ and '
326 return "'$char'" if $char =~ /[^\\'[:^print:]]/a;
327 return "'\\\\'" if $char eq "\\";
328 return "'\''" if $char eq "'";
329
330 # Handle the mnemonic controls
331 my $pos = index("\a\b\e\f\n\r\t\cK", $char);
332 return "'\\" . substr("abefnrtv", $pos, 1) . "'" if $pos >= 0;
333 }
334
8c112bb9
KW
335 # Otherwise, just the input, formatted
336 return sprintf $hex_fmt, $arg;
337}
338
e64b1bd1
YO
339# Methods
340
341# constructor
342#
343# my $obj=CLASS->new(op=>'SOMENAME',title=>'blah',txt=>[..]);
344#
345# Create a new CharClass::Matcher object by parsing the text in
346# the txt array. Currently applies the following rules:
347#
348# Element starts with C<0x>, line is evaled the result treated as
349# a number which is passed to chr().
350#
351# Element starts with C<">, line is evaled and the result treated
352# as a string.
353#
354# Each string is then stored in the 'strs' subhash as a hash record
355# made up of the results of __uni_latin1, using the keynames
b1af8fef 356# 'low','latin1','utf8', as well as the synthesized 'LATIN1', 'high', and
b6a6e956 357# 'UTF8' which hold a merge of 'low' and their lowercase equivalents.
e64b1bd1
YO
358#
359# Size data is tracked per type in the 'size' subhash.
360#
361# Return an object
fdc26d94
KW
362
363my %n2a; # Inversion of a2n, for each character set
364
12b72891
RGS
365sub new {
366 my $class= shift;
e64b1bd1
YO
367 my %opt= @_;
368 for ( qw(op txt) ) {
369 die "in " . __PACKAGE__ . " constructor '$_;' is a mandatory field"
370 if !exists $opt{$_};
371 }
372
373 my $self= bless {
374 op => $opt{op},
375 title => $opt{title} || '',
376 }, $class;
e272994f
KW
377
378 my $charset = $opt{charset};
379 my $a2n = get_a2n($charset);
380
fdc26d94
KW
381 # We need to construct the map going the other way if not already done
382 unless (defined $n2a{$charset}) {
383 for (my $i = 0; $i < 256; $i++) {
384 $n2a{$charset}->[$a2n->[$i]] = $i;
385 }
386 }
387
e64b1bd1
YO
388 foreach my $txt ( @{ $opt{txt} } ) {
389 my $str= $txt;
390 if ( $str =~ /^[""]/ ) {
391 $str= eval $str;
05b688d9
KW
392 } elsif ($str =~ / - /x ) { # A range: Replace this element on the
393 # list with its expansion
394 my ($lower, $upper) = $str =~ / 0x (.+?) \s* - \s* 0x (.+) /x;
395 die "Format must be like '0xDEAD - 0xBEAF'; instead was '$str'" if ! defined $lower || ! defined $upper;
396 foreach my $cp (hex $lower .. hex $upper) {
397 push @{$opt{txt}}, sprintf "0x%X", $cp;
398 }
399 next;
295bcca9
KW
400 } elsif ($str =~ s/ ^ N (?= 0x ) //x ) {
401 # Otherwise undocumented, a leading N means is already in the
402 # native character set; don't convert.
e64b1bd1 403 $str= chr eval $str;
295bcca9
KW
404 } elsif ( $str =~ /^0x/ ) {
405 $str= eval $str;
295bcca9 406 $str = chr $str;
05b688d9
KW
407 } elsif ( $str =~ / \s* \\p \{ ( .*? ) \} /x) {
408 my $property = $1;
409 use Unicode::UCD qw(prop_invlist);
410
411 my @invlist = prop_invlist($property, '_perl_core_internal_ok');
412 if (! @invlist) {
413
414 # An empty return could mean an unknown property, or merely
415 # that it is empty. Call in scalar context to differentiate
416 my $count = prop_invlist($property, '_perl_core_internal_ok');
417 die "$property not found" unless defined $count;
418 }
419
420 # Replace this element on the list with the property's expansion
421 for (my $i = 0; $i < @invlist; $i += 2) {
422 foreach my $cp ($invlist[$i] .. $invlist[$i+1] - 1) {
295bcca9
KW
423
424 # prop_invlist() returns native values; add leading 'N'
425 # to indicate that.
426 push @{$opt{txt}}, sprintf "N0x%X", $cp;
05b688d9
KW
427 }
428 }
429 next;
60910c93
KW
430 } elsif ($str =~ / ^ do \s+ ( .* ) /x) {
431 die "do '$1' failed: $!$@" if ! do $1 or $@;
432 next;
433 } elsif ($str =~ / ^ & \s* ( .* ) /x) { # user-furnished sub() call
434 my @results = eval "$1";
435 die "eval '$1' failed: $@" if $@;
436 push @{$opt{txt}}, @results;
437 next;
12b72891 438 } else {
5e6c6c1e 439 die "Unparsable line: $txt\n";
12b72891 440 }
e272994f 441 my ( $cp, $cp_high, $low, $latin1, $utf8 )= __uni_latin1($charset, $a2n, $str );
e64b1bd1
YO
442 my $UTF8= $low || $utf8;
443 my $LATIN1= $low || $latin1;
b1af8fef 444 my $high = (scalar grep { $_ < 256 } @$cp) ? 0 : $utf8;
dda856b2
YO
445 #die Dumper($txt,$cp,$low,$latin1,$utf8)
446 # if $txt=~/NEL/ or $utf8 and @$utf8>3;
e64b1bd1 447
900c17f9
KW
448 @{ $self->{strs}{$str} }{qw( str txt low utf8 latin1 high cp cp_high UTF8 LATIN1 )}=
449 ( $str, $txt, $low, $utf8, $latin1, $high, $cp, $cp_high, $UTF8, $LATIN1 );
e64b1bd1 450 my $rec= $self->{strs}{$str};
900c17f9 451 foreach my $key ( qw(low utf8 latin1 high cp cp_high UTF8 LATIN1) ) {
e64b1bd1
YO
452 $self->{size}{$key}{ 0 + @{ $self->{strs}{$str}{$key} } }++
453 if $self->{strs}{$str}{$key};
12b72891 454 }
e64b1bd1
YO
455 $self->{has_multi} ||= @$cp > 1;
456 $self->{has_ascii} ||= $latin1 && @$latin1;
457 $self->{has_low} ||= $low && @$low;
458 $self->{has_high} ||= !$low && !$latin1;
12b72891 459 }
fdc26d94 460 $self->{n2a} = $n2a{$charset};
e64b1bd1 461 $self->{count}= 0 + keys %{ $self->{strs} };
12b72891
RGS
462 return $self;
463}
464
e64b1bd1 465# my $trie = make_trie($type,$maxlen);
12b72891 466#
47e01c32 467# using the data stored in the object build a trie of a specific type,
e64b1bd1
YO
468# and with specific maximum depth. The trie is made up the elements of
469# the given types array for each string in the object (assuming it is
470# not too long.)
471#
47e01c32 472# returns the trie, or undef if there was no relevant data in the object.
e64b1bd1
YO
473#
474
475sub make_trie {
476 my ( $self, $type, $maxlen )= @_;
477
478 my $strs= $self->{strs};
479 my %trie;
480 foreach my $rec ( values %$strs ) {
481 die "panic: unknown type '$type'"
482 if !exists $rec->{$type};
483 my $dat= $rec->{$type};
484 next unless $dat;
485 next if $maxlen && @$dat > $maxlen;
486 my $node= \%trie;
487 foreach my $elem ( @$dat ) {
488 $node->{$elem} ||= {};
489 $node= $node->{$elem};
12b72891 490 }
e64b1bd1 491 $node->{''}= $rec->{str};
12b72891 492 }
e64b1bd1 493 return 0 + keys( %trie ) ? \%trie : undef;
12b72891
RGS
494}
495
2efb8143
KW
496sub pop_count ($) {
497 my $word = shift;
498
499 # This returns a list of the positions of the bits in the input word that
500 # are 1.
501
502 my @positions;
503 my $position = 0;
504 while ($word) {
505 push @positions, $position if $word & 1;
506 $position++;
507 $word >>= 1;
508 }
509 return @positions;
510}
511
e64b1bd1
YO
512# my $optree= _optree()
513#
514# recursively convert a trie to an optree where every node represents
515# an if else branch.
12b72891 516#
12b72891 517#
12b72891 518
e64b1bd1
YO
519sub _optree {
520 my ( $self, $trie, $test_type, $ret_type, $else, $depth )= @_;
521 return unless defined $trie;
522 if ( $self->{has_multi} and $ret_type =~ /cp|both/ ) {
523 die "Can't do 'cp' optree from multi-codepoint strings";
12b72891 524 }
e64b1bd1
YO
525 $ret_type ||= 'len';
526 $else= 0 unless defined $else;
527 $depth= 0 unless defined $depth;
528
b6a6e956 529 # if we have an empty string as a key it means we are in an
e405c23a
YO
530 # accepting state and unless we can match further on should
531 # return the value of the '' key.
895e25a5 532 if (exists $trie->{''} ) {
e405c23a
YO
533 # we can now update the "else" value, anything failing to match
534 # after this point should return the value from this.
e64b1bd1
YO
535 if ( $ret_type eq 'cp' ) {
536 $else= $self->{strs}{ $trie->{''} }{cp}[0];
8c112bb9 537 $else= $self->val_fmt($else) if $else > 9;
e64b1bd1
YO
538 } elsif ( $ret_type eq 'len' ) {
539 $else= $depth;
540 } elsif ( $ret_type eq 'both') {
541 $else= $self->{strs}{ $trie->{''} }{cp}[0];
8c112bb9 542 $else= $self->val_fmt($else) if $else > 9;
e64b1bd1 543 $else= "len=$depth, $else";
12b72891 544 }
e64b1bd1 545 }
e405c23a
YO
546 # extract the meaningful keys from the trie, filter out '' as
547 # it means we are an accepting state (end of sequence).
548 my @conds= sort { $a <=> $b } grep { length $_ } keys %$trie;
549
b6a6e956 550 # if we haven't any keys there is no further we can match and we
e405c23a 551 # can return the "else" value.
e64b1bd1 552 return $else if !@conds;
e405c23a 553
29f3ce8f 554 my $test = $test_type =~ /^cp/ ? "cp" : "((const U8*)s)[$depth]";
e405c23a 555
c7c8bf55 556 # First we loop over the possible keys/conditions and find out what they
3ff97bcf 557 # look like; we group conditions with the same optree together.
9a3182e9
YO
558 my %dmp_res;
559 my @res_order;
e405c23a
YO
560 local $Data::Dumper::Sortkeys=1;
561 foreach my $cond ( @conds ) {
562
563 # get the optree for this child/condition
564 my $res= $self->_optree( $trie->{$cond}, $test_type, $ret_type, $else, $depth + 1 );
565 # convert it to a string with Dumper
e64b1bd1 566 my $res_code= Dumper( $res );
e405c23a 567
9a3182e9
YO
568 push @{$dmp_res{$res_code}{vals}}, $cond;
569 if (!$dmp_res{$res_code}{optree}) {
570 $dmp_res{$res_code}{optree}= $res;
571 push @res_order, $res_code;
572 }
573 }
574
575 # now that we have deduped the optrees we construct a new optree containing the merged
576 # results.
577 my %root;
578 my $node= \%root;
579 foreach my $res_code_idx (0 .. $#res_order) {
580 my $res_code= $res_order[$res_code_idx];
581 $node->{vals}= $dmp_res{$res_code}{vals};
582 $node->{test}= $test;
583 $node->{yes}= $dmp_res{$res_code}{optree};
584 $node->{depth}= $depth;
585 if ($res_code_idx < $#res_order) {
586 $node= $node->{no}= {};
12b72891 587 } else {
9a3182e9 588 $node->{no}= $else;
12b72891
RGS
589 }
590 }
e405c23a
YO
591
592 # return the optree.
593 return \%root;
12b72891
RGS
594}
595
e64b1bd1
YO
596# my $optree= optree(%opts);
597#
598# Convert a trie to an optree, wrapper for _optree
599
600sub optree {
601 my $self= shift;
602 my %opt= @_;
603 my $trie= $self->make_trie( $opt{type}, $opt{max_depth} );
604 $opt{ret_type} ||= 'len';
900c17f9 605 my $test_type= $opt{type} =~ /^cp/ ? 'cp' : 'depth';
e64b1bd1 606 return $self->_optree( $trie, $test_type, $opt{ret_type}, $opt{else}, 0 );
12b72891
RGS
607}
608
e64b1bd1
YO
609# my $optree= generic_optree(%opts);
610#
611# build a "generic" optree out of the three 'low', 'latin1', 'utf8'
612# sets of strings, including a branch for handling the string type check.
613#
614
615sub generic_optree {
616 my $self= shift;
617 my %opt= @_;
618
619 $opt{ret_type} ||= 'len';
620 my $test_type= 'depth';
621 my $else= $opt{else} || 0;
622
623 my $latin1= $self->make_trie( 'latin1', $opt{max_depth} );
624 my $utf8= $self->make_trie( 'utf8', $opt{max_depth} );
625
626 $_= $self->_optree( $_, $test_type, $opt{ret_type}, $else, 0 )
627 for $latin1, $utf8;
628
629 if ( $utf8 ) {
630 $else= __cond_join( "( is_utf8 )", $utf8, $latin1 || $else );
631 } elsif ( $latin1 ) {
632 $else= __cond_join( "!( is_utf8 )", $latin1, $else );
633 }
87894a24 634 if ($opt{type} eq 'generic') {
61de6bbc
KW
635 my $low= $self->make_trie( 'low', $opt{max_depth} );
636 if ( $low ) {
637 $else= $self->_optree( $low, $test_type, $opt{ret_type}, $else, 0 );
638 }
87894a24 639 }
e64b1bd1
YO
640
641 return $else;
12b72891
RGS
642}
643
e64b1bd1 644# length_optree()
12b72891 645#
e64b1bd1 646# create a string length guarded optree.
12b72891 647#
e64b1bd1
YO
648
649sub length_optree {
650 my $self= shift;
651 my %opt= @_;
652 my $type= $opt{type};
653
654 die "Can't do a length_optree on type 'cp', makes no sense."
900c17f9 655 if $type =~ /^cp/;
e64b1bd1 656
5ab0c3af
KW
657 my $else= ( $opt{else} ||= 0 );
658
c03e41dd
KW
659 return $else if $self->{count} == 0;
660
5ab0c3af
KW
661 my $method = $type =~ /generic/ ? 'generic_optree' : 'optree';
662 if ($method eq 'optree' && scalar keys %{$self->{size}{$type}} == 1) {
663
664 # Here is non-generic output (meaning that we are only generating one
665 # type), and all things that match have the same number ('size') of
666 # bytes. The length guard is simply that we have that number of
667 # bytes.
668 my @size = keys %{$self->{size}{$type}};
669 my $cond= "((e) - (s)) >= $size[0]";
670 my $optree = $self->$method(%opt);
671 $else= __cond_join( $cond, $optree, $else );
672 }
673 elsif ($self->{has_multi}) {
674 my @size;
e64b1bd1 675
5ab0c3af
KW
676 # Here, there can be a match of a multiple character string. We use
677 # the traditional method which is to have a branch for each possible
678 # size (longest first) and test for the legal values for that size.
e64b1bd1
YO
679 my %sizes= (
680 %{ $self->{size}{low} || {} },
681 %{ $self->{size}{latin1} || {} },
682 %{ $self->{size}{utf8} || {} }
683 );
5ab0c3af
KW
684 if ($method eq 'generic_optree') {
685 @size= sort { $a <=> $b } keys %sizes;
686 } else {
687 @size= sort { $a <=> $b } keys %{ $self->{size}{$type} };
688 }
689 for my $size ( @size ) {
690 my $optree= $self->$method( %opt, type => $type, max_depth => $size );
691 my $cond= "((e)-(s) > " . ( $size - 1 ).")";
692 $else= __cond_join( $cond, $optree, $else );
693 }
12b72891 694 }
5ab0c3af
KW
695 else {
696 my $utf8;
697
698 # Here, has more than one possible size, and only matches a single
699 # character. For non-utf8, the needed length is 1; for utf8, it is
700 # found by array lookup 'UTF8SKIP'.
701
702 # If want just the code points above 255, set up to look for those;
703 # otherwise assume will be looking for all non-UTF-8-invariant code
704 # poiints.
705 my $trie_type = ($type eq 'high') ? 'high' : 'utf8';
706
707 # If we do want more than the 0-255 range, find those, and if they
708 # exist...
709 if ($opt{type} !~ /latin1/i && ($utf8 = $self->make_trie($trie_type, 0))) {
710
711 # ... get them into an optree, and set them up as the 'else' clause
712 $utf8 = $self->_optree( $utf8, 'depth', $opt{ret_type}, 0, 0 );
713
714 # We could make this
715 # UTF8_IS_START(*s) && ((e) - (s)) >= UTF8SKIP(s))";
716 # to avoid doing the UTF8SKIP and subsequent branches for invariants
717 # that don't match. But the current macros that get generated
718 # have only a few things that can match past this, so I (khw)
719 # don't think it is worth it. (Even better would be to use
720 # calculate_mask(keys %$utf8) instead of UTF8_IS_START, and use it
dd9bc2b0
KW
721 # if it saves a bunch. We assume that input text likely to be
722 # well-formed .
723 my $cond = "LIKELY(((e) - (s)) >= UTF8SKIP(s))";
5ab0c3af
KW
724 $else = __cond_join($cond, $utf8, $else);
725
726 # For 'generic', we also will want the latin1 UTF-8 variants for
727 # the case where the input isn't UTF-8.
728 my $latin1;
729 if ($method eq 'generic_optree') {
730 $latin1 = $self->make_trie( 'latin1', 1);
731 $latin1= $self->_optree( $latin1, 'depth', $opt{ret_type}, 0, 0 );
732 }
e64b1bd1 733
5ab0c3af
KW
734 # If we want the UTF-8 invariants, get those.
735 my $low;
736 if ($opt{type} !~ /non_low|high/
737 && ($low= $self->make_trie( 'low', 1)))
738 {
739 $low= $self->_optree( $low, 'depth', $opt{ret_type}, 0, 0 );
740
741 # Expand out the UTF-8 invariants as a string so that we
742 # can use them as the conditional
743 $low = $self->_cond_as_str( $low, 0, \%opt);
744
745 # If there are Latin1 variants, add a test for them.
746 if ($latin1) {
747 $else = __cond_join("(! is_utf8 )", $latin1, $else);
748 }
749 elsif ($method eq 'generic_optree') {
750
751 # Otherwise for 'generic' only we know that what
752 # follows must be valid for just UTF-8 strings,
753 $else->{test} = "( is_utf8 && $else->{test} )";
754 }
755
756 # If the invariants match, we are done; otherwise we have
757 # to go to the 'else' clause.
758 $else = __cond_join($low, 1, $else);
759 }
760 elsif ($latin1) { # Here, didn't want or didn't have invariants,
761 # but we do have latin variants
762 $else = __cond_join("(! is_utf8)", $latin1, $else);
763 }
764
765 # We need at least one byte available to start off the tests
dd9bc2b0 766 $else = __cond_join("LIKELY((e) > (s))", $else, 0);
5ab0c3af
KW
767 }
768 else { # Here, we don't want or there aren't any variants. A single
769 # byte available is enough.
770 my $cond= "((e) > (s))";
771 my $optree = $self->$method(%opt);
772 $else= __cond_join( $cond, $optree, $else );
773 }
e64b1bd1 774 }
5ab0c3af 775
e64b1bd1 776 return $else;
12b72891
RGS
777}
778
2efb8143 779sub calculate_mask(@) {
75929b4b
KW
780 # Look at the input list of byte values. This routine returns an array of
781 # mask/base pairs to generate that list.
782
2efb8143
KW
783 my @list = @_;
784 my $list_count = @list;
785
75929b4b
KW
786 # Consider a set of byte values, A, B, C .... If we want to determine if
787 # <c> is one of them, we can write c==A || c==B || c==C .... If the
e42cde6b
KW
788 # values are consecutive, we can shorten that to inRANGE(c, 'A', 'Z'),
789 # which uses far fewer branches. If only some of them are consecutive we
790 # can still save some branches by creating range tests for just those that
791 # are consecutive. _cond_as_str() does this work for looking for ranges.
75929b4b
KW
792 #
793 # Another approach is to look at the bit patterns for A, B, C .... and see
794 # if they have some commonalities. That's what this function does. For
795 # example, consider a set consisting of the bytes
e42cde6b
KW
796 # 0x42, 0x43, 0x62, and 0x63. We could write:
797 # inRANGE(c, 0x42, 0x43) || inRANGE(c, 0x62, 0x63)
798 # which through the magic of casting has not 4, but 2 tests. But the
799 # following mask/compare also works, and has just one test:
800 # (c & 0xDE) == 0x42
801 # The reason it works is that the set consists of exactly the 4 bit
802 # patterns which have either 0 or 1 in the two bit positions that are 0 in
803 # the mask. They have the same value in each bit position where the mask
804 # is 1. The comparison makes sure that the result matches all bytes which
805 # match those six 1 bits exactly. This can be applied to bytes that
806 # differ in 1 through all 8 bit positions. In order to be a candidate for
807 # this optimization, the number of bytes in the set must be a power of 2.
75929b4b 808 #
e42cde6b
KW
809 # It may be that the bytes needing to be matched can't be done with a
810 # single mask. But it may be possible to have two (or more) sets, each
811 # with a separate mask. This function attempts to find some way to save
812 # some branches using the mask technique. If not, it returns an empty
813 # list; if so, it returns a list consisting of
75929b4b
KW
814 # [ [compare1, mask1], [compare2, mask2], ...
815 # [compare_n, undef], [compare_m, undef], ...
816 # ]
817 # The <mask> is undef in the above for those bytes that must be tested
818 # for individually.
819 #
820 # This function does not attempt to find the optimal set. To do so would
821 # probably require testing all possible combinations, and keeping track of
822 # the current best one.
823 #
824 # There are probably much better algorithms, but this is the one I (khw)
825 # came up with. We start with doing a bit-wise compare of every byte in
826 # the set with every other byte. The results are sorted into arrays of
827 # all those that differ by the same bit positions. These are stored in a
828 # hash with the each key being the bits they differ in. Here is the hash
829 # for the 0x53, 0x54, 0x73, 0x74 set:
830 # {
831 # 4 => {
832 # "0,1,2,5" => [
833 # 83,
834 # 116,
835 # 84,
836 # 115
837 # ]
838 # },
839 # 3 => {
840 # "0,1,2" => [
841 # 83,
842 # 84,
843 # 115,
844 # 116
845 # ]
846 # }
847 # 1 => {
848 # 5 => [
849 # 83,
850 # 115,
851 # 84,
852 # 116
853 # ]
854 # },
855 # }
856 #
857 # The set consisting of values which differ in the 4 bit positions 0, 1,
858 # 2, and 5 from some other value in the set consists of all 4 values.
859 # Likewise all 4 values differ from some other value in the 3 bit
860 # positions 0, 1, and 2; and all 4 values differ from some other value in
861 # the single bit position 5. The keys at the uppermost level in the above
862 # hash, 1, 3, and 4, give the number of bit positions that each sub-key
863 # below it has. For example, the 4 key could have as its value an array
864 # consisting of "0,1,2,5", "0,1,2,6", and "3,4,6,7", if the inputs were
865 # such. The best optimization will group the most values into a single
866 # mask. The most values will be the ones that differ in the most
867 # positions, the ones with the largest value for the topmost key. These
868 # keys, are thus just for convenience of sorting by that number, and do
869 # not have any bearing on the core of the algorithm.
870 #
871 # We start with an element from largest number of differing bits. The
872 # largest in this case is 4 bits, and there is only one situation in this
873 # set which has 4 differing bits, "0,1,2,5". We look for any subset of
874 # this set which has 16 values that differ in these 4 bits. There aren't
875 # any, because there are only 4 values in the entire set. We then look at
876 # the next possible thing, which is 3 bits differing in positions "0,1,2".
877 # We look for a subset that has 8 values that differ in these 3 bits.
878 # Again there are none. So we go to look for the next possible thing,
879 # which is a subset of 2**1 values that differ only in bit position 5. 83
880 # and 115 do, so we calculate a mask and base for those and remove them
881 # from every set. Since there is only the one set remaining, we remove
882 # them from just this one. We then look to see if there is another set of
883 # 2 values that differ in bit position 5. 84 and 116 do, so we calculate
884 # a mask and base for those and remove them from every set (again only
885 # this set remains in this example). The set is now empty, and there are
886 # no more sets to look at, so we are done.
887
888 if ($list_count == 256) { # All 256 is trivially masked
2efb8143
KW
889 return (0, 0);
890 }
891
75929b4b
KW
892 my %hash;
893
894 # Generate bits-differing lists for each element compared against each
895 # other element
896 for my $i (0 .. $list_count - 2) {
897 for my $j ($i + 1 .. $list_count - 1) {
898 my @bits_that_differ = pop_count($list[$i] ^ $list[$j]);
899 my $differ_count = @bits_that_differ;
900 my $key = join ",", @bits_that_differ;
901 push @{$hash{$differ_count}{$key}}, $list[$i] unless grep { $_ == $list[$i] } @{$hash{$differ_count}{$key}};
902 push @{$hash{$differ_count}{$key}}, $list[$j];
903 }
904 }
2efb8143 905
75929b4b 906 print STDERR __LINE__, ": calculate_mask() called: List of values grouped by differing bits: ", Dumper \%hash if DEBUG;
2efb8143 907
75929b4b
KW
908 my @final_results;
909 foreach my $count (reverse sort { $a <=> $b } keys %hash) {
910 my $need = 2 ** $count; # Need 8 values for 3 differing bits, etc
de6cb0ab 911 foreach my $bits (sort keys $hash{$count}->%*) {
2efb8143 912
75929b4b 913 print STDERR __LINE__, ": For $count bit(s) difference ($bits), need $need; have ", scalar @{$hash{$count}{$bits}}, "\n" if DEBUG;
2efb8143 914
75929b4b
KW
915 # Look only as long as there are at least as many elements in the
916 # subset as are needed
917 while ((my $cur_count = @{$hash{$count}{$bits}}) >= $need) {
2efb8143 918
75929b4b 919 print STDERR __LINE__, ": Looking at bit positions ($bits): ", Dumper $hash{$count}{$bits} if DEBUG;
2efb8143 920
75929b4b
KW
921 # Start with the first element in it
922 my $try_base = $hash{$count}{$bits}[0];
923 my @subset = $try_base;
924
925 # If it succeeds, we return a mask and a base to compare
926 # against the masked value. That base will be the AND of
927 # every element in the subset. Initialize to the one element
928 # we have so far.
929 my $compare = $try_base;
930
931 # We are trying to find a subset of this that has <need>
932 # elements that differ in the bit positions given by the
933 # string $bits, which is comma separated.
934 my @bits = split ",", $bits;
935
936 TRY: # Look through the remainder of the list for other
937 # elements that differ only by these bit positions.
938
939 for (my $i = 1; $i < $cur_count; $i++) {
940 my $try_this = $hash{$count}{$bits}[$i];
941 my @positions = pop_count($try_base ^ $try_this);
942
943 print STDERR __LINE__, ": $try_base vs $try_this: is (", join(',', @positions), ") a subset of ($bits)?" if DEBUG;;
944
945 foreach my $pos (@positions) {
946 unless (grep { $pos == $_ } @bits) {
947 print STDERR " No\n" if DEBUG;
948 my $remaining = $cur_count - $i - 1;
949 if ($remaining && @subset + $remaining < $need) {
950 print STDERR __LINE__, ": Can stop trying $try_base, because even if all the remaining $remaining values work, they wouldn't add up to the needed $need when combined with the existing ", scalar @subset, " ones\n" if DEBUG;
951 last TRY;
952 }
953 next TRY;
954 }
955 }
956
957 print STDERR " Yes\n" if DEBUG;
958 push @subset, $try_this;
959
960 # Add this to the mask base, in case it ultimately
961 # succeeds,
962 $compare &= $try_this;
963 }
964
965 print STDERR __LINE__, ": subset (", join(", ", @subset), ") has ", scalar @subset, " elements; needs $need\n" if DEBUG;
966
967 if (@subset < $need) {
968 shift @{$hash{$count}{$bits}};
969 next; # Try with next value
970 }
2efb8143 971
75929b4b
KW
972 # Create the mask
973 my $mask = 0;
974 foreach my $position (@bits) {
975 $mask |= 1 << $position;
976 }
977 $mask = ~$mask & 0xFF;
978 push @final_results, [$compare, $mask];
979
980 printf STDERR "%d: Got it: compare=%d=0x%X; mask=%X\n", __LINE__, $compare, $compare, $mask if DEBUG;
981
982 # These values are now spoken for. Remove them from future
983 # consideration
122a2d8f
YO
984 foreach my $remove_count (sort keys %hash) {
985 foreach my $bits (sort keys %{$hash{$remove_count}}) {
75929b4b
KW
986 foreach my $to_remove (@subset) {
987 @{$hash{$remove_count}{$bits}} = grep { $_ != $to_remove } @{$hash{$remove_count}{$bits}};
988 }
989 }
990 }
991 }
992 }
2efb8143
KW
993 }
994
75929b4b
KW
995 # Any values that remain in the list are ones that have to be tested for
996 # individually.
997 my @individuals;
998 foreach my $count (reverse sort { $a <=> $b } keys %hash) {
de6cb0ab 999 foreach my $bits (sort keys $hash{$count}->%*) {
75929b4b
KW
1000 foreach my $remaining (@{$hash{$count}{$bits}}) {
1001
1002 # If we already know about this value, just ignore it.
1003 next if grep { $remaining == $_ } @individuals;
1004
1005 # Otherwise it needs to be returned as something to match
1006 # individually
1007 push @final_results, [$remaining, undef];
1008 push @individuals, $remaining;
1009 }
1010 }
2efb8143 1011 }
2efb8143 1012
75929b4b
KW
1013 # Sort by increasing numeric value
1014 @final_results = sort { $a->[0] <=> $b->[0] } @final_results;
1015
1016 print STDERR __LINE__, ": Final return: ", Dumper \@final_results if DEBUG;
1017
1018 return @final_results;
2efb8143
KW
1019}
1020
e64b1bd1
YO
1021# _cond_as_str
1022# turn a list of conditions into a text expression
1023# - merges ranges of conditions, and joins the result with ||
1024sub _cond_as_str {
ba073cf2 1025 my ( $self, $op, $combine, $opts_ref )= @_;
e64b1bd1
YO
1026 my $cond= $op->{vals};
1027 my $test= $op->{test};
2efb8143 1028 my $is_cp_ret = $opts_ref->{ret_type} eq "cp";
e64b1bd1
YO
1029 return "( $test )" if !defined $cond;
1030
f5772832 1031 # rangify the list.
e64b1bd1
YO
1032 my @ranges;
1033 my $Update= sub {
f5772832
KW
1034 # We skip this if there are optimizations that
1035 # we can apply (below) to the individual ranges
1036 if ( ($is_cp_ret || $combine) && @ranges && ref $ranges[-1]) {
e64b1bd1
YO
1037 if ( $ranges[-1][0] == $ranges[-1][1] ) {
1038 $ranges[-1]= $ranges[-1][0];
1039 } elsif ( $ranges[-1][0] + 1 == $ranges[-1][1] ) {
1040 $ranges[-1]= $ranges[-1][0];
1041 push @ranges, $ranges[-1] + 1;
1042 }
1043 }
1044 };
4a8ca70e
KW
1045 for my $condition ( @$cond ) {
1046 if ( !@ranges || $condition != $ranges[-1][1] + 1 ) {
e64b1bd1 1047 $Update->();
4a8ca70e 1048 push @ranges, [ $condition, $condition ];
e64b1bd1
YO
1049 } else {
1050 $ranges[-1][1]++;
1051 }
1052 }
1053 $Update->();
f5772832 1054
e64b1bd1
YO
1055 return $self->_combine( $test, @ranges )
1056 if $combine;
f5772832
KW
1057
1058 if ($is_cp_ret) {
1f063c57
KW
1059 @ranges= map {
1060 ref $_
8c112bb9
KW
1061 ? "isRANGE( $test, "
1062 . $self->val_fmt($_[0]) . ", "
1063 . $self->val_fmt($_[1]) . " )"
1064 : $self->val_fmt($_) . " == $test";
1f063c57 1065 } @ranges;
6a52943c
KW
1066
1067 return "( " . join( " || ", @ranges ) . " )";
f5772832 1068 }
75929b4b 1069
2358c533
KW
1070 # If the input set has certain characteristics, we can optimize tests
1071 # for it. This doesn't apply if returning the code point, as we want
1072 # each element of the set individually. The code above is for this
1073 # simpler case.
1074
1075 return 1 if @$cond == 256; # If all bytes match, is trivially true
1076
75929b4b 1077 my @masks;
2358c533 1078 if (@ranges > 1) {
75929b4b 1079
b6a6e956 1080 # See if the entire set shares optimizable characteristics, and if so,
e42cde6b
KW
1081 # return the optimization. There is no need to do this on sets with
1082 # just a single range, as that can be expressed with a single
1083 # conditional.
75929b4b
KW
1084 @masks = calculate_mask(@$cond);
1085
1086 # Stringify the output of calculate_mask()
1087 if (@masks) {
1088 my @return;
1089 foreach my $mask_ref (@masks) {
1090 if (defined $mask_ref->[1]) {
8c112bb9
KW
1091 push @return, "( ( $test & "
1092 . $self->val_fmt($mask_ref->[1]) . " ) == "
1093 . $self->val_fmt($mask_ref->[0]) . " )";
75929b4b
KW
1094 }
1095 else { # An undefined mask means to use the value as-is
8c112bb9 1096 push @return, "$test == " . $self->val_fmt($mask_ref->[0]);
75929b4b
KW
1097 }
1098 }
1099
1100 # The best possible case below for specifying this set of values via
1101 # ranges is 1 branch per range. If our mask method yielded better
1102 # results, there is no sense trying something that is bound to be
1103 # worse.
1104 if (@return < @ranges) {
1105 return "( " . join( " || ", @return ) . " )";
1106 }
1107
1108 @masks = @return;
6e130234 1109 }
2358c533 1110 }
f5772832 1111
75929b4b
KW
1112 # Here, there was no entire-class optimization that was clearly better
1113 # than doing things by ranges. Look at each range.
1114 my $range_count_extra = 0;
2358c533
KW
1115 for (my $i = 0; $i < @ranges; $i++) {
1116 if (! ref $ranges[$i]) { # Trivial case: no range
8c112bb9 1117 $ranges[$i] = $self->val_fmt($ranges[$i]) . " == $test";
2358c533
KW
1118 }
1119 elsif ($ranges[$i]->[0] == $ranges[$i]->[1]) {
1120 $ranges[$i] = # Trivial case: single element range
8c112bb9 1121 $self->val_fmt($ranges[$i]->[0]) . " == $test";
2358c533 1122 }
726137b5
KW
1123 elsif ($ranges[$i]->[0] == 0) {
1124 # If the range matches all 256 possible bytes, it is trivially
1125 # true.
1126 return 1 if $ranges[0]->[1] == 0xFF; # @ranges must be 1 in
1127 # this case
8c112bb9
KW
1128 $ranges[$i] = "( $test <= "
1129 . $self->val_fmt($ranges[$i]->[1]) . " )";
726137b5
KW
1130 }
1131 elsif ($ranges[$i]->[1] == 255) {
1132
1133 # Similarly the max possible is 255, so can omit an upper bound
1134 # test if the calculated max is the max possible one.
8c112bb9 1135 $ranges[$i] = "( $test >= " . $self->val_fmt($ranges[0]->[0]) . " )";
726137b5 1136 }
2358c533
KW
1137 else {
1138 my $output = "";
1139
1140 # Well-formed UTF-8 continuation bytes on ascii platforms must be
1141 # in the range 0x80 .. 0xBF. If we know that the input is
1142 # well-formed (indicated by not trying to be 'safe'), we can omit
1143 # tests that verify that the input is within either of these
1144 # bounds. (No legal UTF-8 character can begin with anything in
1145 # this range, so we don't have to worry about this being a
1146 # continuation byte or not.)
a1b2a50f
KW
1147 if ($opts_ref->{charset} =~ /ascii/i
1148 && (! $opts_ref->{safe} && ! $opts_ref->{no_length_checks})
2358c533
KW
1149 && $opts_ref->{type} =~ / ^ (?: utf8 | high ) $ /xi)
1150 {
2358c533
KW
1151 # If the range is the entire legal range, it matches any legal
1152 # byte, so we can omit both tests. (This should happen only
1153 # if the number of ranges is 1.)
e42cde6b 1154 if ($ranges[$i]->[0] == 0x80 && $ranges[$i]->[1] == 0xBF) {
2358c533 1155 return 1;
6e130234 1156 }
f5772832 1157 }
2358c533 1158
e42cde6b
KW
1159 # Here, it isn't the full range of legal continuation bytes. We
1160 # could just assume that there's nothing outside of the legal
1161 # bounds. But inRANGE() allows us to have a single conditional,
1162 # so the only cost of making sure it's a legal UTF-8 continuation
1163 # byte is an extra subtraction instruction, a trivial expense.
8c112bb9
KW
1164 $ranges[$i] = "inRANGE($test, "
1165 . $self->val_fmt($ranges[$i]->[0]) .", "
1166 . $self->val_fmt($ranges[$i]->[1]) . ")";
f5772832 1167 }
2358c533 1168 }
f5772832 1169
75929b4b
KW
1170 # We have generated the list of bytes in two ways; one trying to use masks
1171 # to cut the number of branches down, and the other to look at individual
1172 # ranges (some of which could be cut down by using a mask for just it).
1173 # We return whichever method uses the fewest branches.
1174 return "( "
1175 . join( " || ", (@masks && @masks < @ranges + $range_count_extra)
1176 ? @masks
1177 : @ranges)
1178 . " )";
12b72891
RGS
1179}
1180
e64b1bd1
YO
1181# _combine
1182# recursively turn a list of conditions into a fast break-out condition
1183# used by _cond_as_str() for 'cp' type macros.
1184sub _combine {
1185 my ( $self, $test, @cond )= @_;
1186 return if !@cond;
1187 my $item= shift @cond;
1188 my ( $cstr, $gtv );
6c62bf0f
KW
1189 if ( ref $item ) { # @item should be a 2-element array giving range start
1190 # and end
1191 if ($item->[0] == 0) { # UV's are never negative, so skip "0 <= "
1192 # test which could generate a compiler warning
1193 # that test is always true
8c112bb9 1194 $cstr= "$test <= " . $self->val_fmt($item->[1]);
6c62bf0f
KW
1195 }
1196 else {
8c112bb9
KW
1197 $cstr = "inRANGE($test, "
1198 . $self->val_fmt($item->[0]) . ", "
1199 . $self->val_fmt($item->[1]) . ")";
6c62bf0f 1200 }
8c112bb9 1201 $gtv= $self->val_fmt($item->[1]);
12b72891 1202 } else {
8c112bb9
KW
1203 $cstr= $self->val_fmt($item) . " == $test";
1204 $gtv= $self->val_fmt($item)
12b72891 1205 }
e64b1bd1 1206 if ( @cond ) {
ee98d22d
YO
1207 my $combine= $self->_combine( $test, @cond );
1208 if (@cond >1) {
1209 return "( $cstr || ( $gtv < $test &&\n"
1210 . $combine . " ) )";
1211 } else {
1212 return "( $cstr || $combine )";
1213 }
12b72891 1214 } else {
e64b1bd1 1215 return $cstr;
12b72891 1216 }
e64b1bd1 1217}
12b72891 1218
e64b1bd1
YO
1219# _render()
1220# recursively convert an optree to text with reasonably neat formatting
1221sub _render {
39a0f513 1222 my ( $self, $op, $combine, $brace, $opts_ref, $def, $submacros )= @_;
2e39f0c2 1223 return 0 if ! defined $op; # The set is empty
e64b1bd1
YO
1224 if ( !ref $op ) {
1225 return $op;
12b72891 1226 }
ba073cf2 1227 my $cond= $self->_cond_as_str( $op, $combine, $opts_ref );
cc08b31c
KW
1228 #no warnings 'recursion'; # This would allow really really inefficient
1229 # code to be generated. See pod
39a0f513 1230 my $yes= $self->_render( $op->{yes}, $combine, 1, $opts_ref, $def, $submacros );
30188af7
KW
1231 return $yes if $cond eq '1';
1232
39a0f513 1233 my $no= $self->_render( $op->{no}, $combine, 0, $opts_ref, $def, $submacros );
e64b1bd1
YO
1234 return "( $cond )" if $yes eq '1' and $no eq '0';
1235 my ( $lb, $rb )= $brace ? ( "( ", " )" ) : ( "", "" );
1236 return "$lb$cond ? $yes : $no$rb"
1237 if !ref( $op->{yes} ) && !ref( $op->{no} );
1238 my $ind1= " " x 4;
1239 my $ind= "\n" . ( $ind1 x $op->{depth} );
1240
1241 if ( ref $op->{yes} ) {
1242 $yes= $ind . $ind1 . $yes;
1243 } else {
1244 $yes= " " . $yes;
1245 }
1246
39a0f513
YO
1247 my $str= "$lb$cond ?$yes$ind: $no$rb";
1248 if (length $str > 6000) {
1249 push @$submacros, sprintf "#define $def\n( %s )", "_part" . (my $yes_idx= 0+@$submacros), $yes;
1250 push @$submacros, sprintf "#define $def\n( %s )", "_part" . (my $no_idx= 0+@$submacros), $no;
1251 return sprintf "%s%s ? $def : $def%s", $lb, $cond, "_part$yes_idx", "_part$no_idx", $rb;
1252 }
1253 return $str;
12b72891 1254}
32e6a07c 1255
e64b1bd1
YO
1256# $expr=render($op,$combine)
1257#
1258# convert an optree to text with reasonably neat formatting. If $combine
1259# is true then the condition is created using "fast breakouts" which
1260# produce uglier expressions that are more efficient for common case,
1261# longer lists such as that resulting from type 'cp' output.
1262# Currently only used for type 'cp' macros.
1263sub render {
39a0f513
YO
1264 my ( $self, $op, $combine, $opts_ref, $def_fmt )= @_;
1265
1266 my @submacros;
1267 my $macro= sprintf "#define $def_fmt\n( %s )", "", $self->_render( $op, $combine, 0, $opts_ref, $def_fmt, \@submacros );
1268
1269 return join "\n\n", map { "/*** GENERATED CODE ***/\n" . __macro( __clean( $_ ) ) } @submacros, $macro;
12b72891 1270}
e64b1bd1
YO
1271
1272# make_macro
1273# make a macro of a given type.
1274# calls into make_trie and (generic_|length_)optree as needed
1275# Opts are:
40f914fd
KW
1276# type : 'cp','cp_high', 'generic','high','low','latin1','utf8','LATIN1','UTF8'
1277# ret_type : 'cp' or 'len'
1278# safe : don't assume is well-formed UTF-8, so don't skip any range
1279# checks, and add length guards to macro
1280# no_length_checks : like safe, but don't add length guards.
e64b1bd1
YO
1281#
1282# type defaults to 'generic', and ret_type to 'len' unless type is 'cp'
1283# in which case it defaults to 'cp' as well.
1284#
3ff97bcf 1285# It is illegal to do a type 'cp' macro on a pattern with multi-codepoint
e64b1bd1
YO
1286# sequences in it, as the generated macro will accept only a single codepoint
1287# as an argument.
1288#
6b94381d
KW
1289# It is also illegal to do a non-safe macro on a pattern with multi-codepoint
1290# sequences in it, as even if it is known to be well-formed, we need to not
91e83b73 1291# run off the end of the buffer when, say, the buffer ends with the first two
6b94381d
KW
1292# characters, but three are looked at by the macro.
1293#
e64b1bd1
YO
1294# returns the macro.
1295
1296
1297sub make_macro {
1298 my $self= shift;
1299 my %opts= @_;
1300 my $type= $opts{type} || 'generic';
6b94381d
KW
1301 if ($self->{has_multi}) {
1302 if ($type =~ /^cp/) {
1303 die "Can't do a 'cp' on multi-codepoint character class '$self->{op}'"
1304 }
1305 elsif (! $opts{safe}) {
1306 die "'safe' is required on multi-codepoint character class '$self->{op}'"
1307 }
1308 }
900c17f9 1309 my $ret_type= $opts{ret_type} || ( $opts{type} =~ /^cp/ ? 'cp' : 'len' );
e64b1bd1
YO
1310 my $method;
1311 if ( $opts{safe} ) {
1312 $method= 'length_optree';
87894a24 1313 } elsif ( $type =~ /generic/ ) {
e64b1bd1
YO
1314 $method= 'generic_optree';
1315 } else {
1316 $method= 'optree';
1317 }
900c17f9 1318 my @args= $type =~ /^cp/ ? 'cp' : 's';
e64b1bd1 1319 push @args, "e" if $opts{safe};
87894a24 1320 push @args, "is_utf8" if $type =~ /generic/;
e64b1bd1
YO
1321 push @args, "len" if $ret_type eq 'both';
1322 my $pfx= $ret_type eq 'both' ? 'what_len_' :
1323 $ret_type eq 'cp' ? 'what_' : 'is_';
87894a24
KW
1324 my $ext= $type =~ /generic/ ? '' : '_' . lc( $type );
1325 $ext .= '_non_low' if $type eq 'generic_non_low';
e64b1bd1 1326 $ext .= "_safe" if $opts{safe};
40f914fd 1327 $ext .= "_no_length_checks" if $opts{no_length_checks};
e64b1bd1 1328 my $argstr= join ",", @args;
39a0f513
YO
1329 my $def_fmt="$pfx$self->{op}$ext%s($argstr)";
1330 my $optree= $self->$method( %opts, type => $type, ret_type => $ret_type );
1331 return $self->render( $optree, ($type =~ /^cp/) ? 1 : 0, \%opts, $def_fmt );
32e6a07c 1332}
e64b1bd1 1333
b6a6e956 1334# if we aren't being used as a module (highly likely) then process
e64b1bd1
YO
1335# the __DATA__ below and produce macros in regcharclass.h
1336# if an argument is provided to the script then it is assumed to
1337# be the path of the file to output to, if the arg is '-' outputs
1338# to STDOUT.
1339if ( !caller ) {
e64b1bd1 1340 $|++;
8770da0e 1341 my $path= shift @ARGV || "regcharclass.h";
e64b1bd1
YO
1342 my $out_fh;
1343 if ( $path eq '-' ) {
1344 $out_fh= \*STDOUT;
1345 } else {
29c22b52 1346 $out_fh = open_new( $path );
e64b1bd1 1347 }
8770da0e
NC
1348 print $out_fh read_only_top( lang => 'C', by => $0,
1349 file => 'regcharclass.h', style => '*',
212b6c86
KW
1350 copyright => [2007, 2011],
1351 final => <<EOF,
1352WARNING: These macros are for internal Perl core use only, and may be
1353changed or removed without notice.
1354EOF
1355 );
6a5bc5ac 1356 print $out_fh "\n#ifndef PERL_REGCHARCLASS_H_ /* Guard against nested #includes */\n#define PERL_REGCHARCLASS_H_\n";
12b72891 1357
bb949220 1358 my ( $op, $title, @txt, @types, %mods );
a1b2a50f 1359 my $doit= sub ($) {
e64b1bd1 1360 return unless $op;
ae1d4929 1361
a1b2a50f
KW
1362 my $charset = shift;
1363
ae1d4929 1364 # Skip if to compile on a different platform.
a1b2a50f
KW
1365 return if delete $mods{only_ascii_platform} && $charset !~ /ascii/i;
1366 return if delete $mods{only_ebcdic_platform} && $charset !~ /ebcdic/i;
ae1d4929 1367
e64b1bd1
YO
1368 print $out_fh "/*\n\t$op: $title\n\n";
1369 print $out_fh join "\n", ( map { "\t$_" } @txt ), "*/", "";
a1b2a50f 1370 my $obj= __PACKAGE__->new( op => $op, title => $title, txt => \@txt, charset => $charset);
e64b1bd1 1371
bb949220
KW
1372 #die Dumper(\@types,\%mods);
1373
1374 my @mods;
1375 push @mods, 'safe' if delete $mods{safe};
40f914fd 1376 push @mods, 'no_length_checks' if delete $mods{no_length_checks};
bb949220
KW
1377 unshift @mods, 'fast' if delete $mods{fast} || ! @mods; # Default to 'fast'
1378 # do this one
1379 # first, as
1380 # traditional
1381 if (%mods) {
122a2d8f 1382 die "Unknown modifiers: ", join ", ", map { "'$_'" } sort keys %mods;
bb949220 1383 }
e64b1bd1
YO
1384
1385 foreach my $type_spec ( @types ) {
1386 my ( $type, $ret )= split /-/, $type_spec;
1387 $ret ||= 'len';
1388 foreach my $mod ( @mods ) {
f71bd789
KW
1389
1390 # 'safe' is irrelevant with code point macros, so skip if
1391 # there is also a 'fast', but don't skip if this is the only
1392 # way a cp macro will get generated. Below we convert 'safe'
1393 # to 'fast' in this instance
1394 next if $type =~ /^cp/
40f914fd
KW
1395 && ($mod eq 'safe' || $mod eq 'no_length_checks')
1396 && grep { 'fast' =~ $_ } @mods;
bb949220 1397 delete $mods{$mod};
e64b1bd1
YO
1398 my $macro= $obj->make_macro(
1399 type => $type,
1400 ret_type => $ret,
81200454 1401 safe => $mod eq 'safe' && $type !~ /^cp/,
a1b2a50f 1402 charset => $charset,
40f914fd 1403 no_length_checks => $mod eq 'no_length_checks' && $type !~ /^cp/,
e64b1bd1
YO
1404 );
1405 print $out_fh $macro, "\n";
1406 }
32e6a07c 1407 }
e64b1bd1
YO
1408 };
1409
a1b2a50f
KW
1410 my @data = <DATA>;
1411 foreach my $charset (get_supported_code_pages()) {
1412 my $first_time = 1;
1413 undef $op;
1414 undef $title;
1415 undef @txt;
1416 undef @types;
1417 undef %mods;
1418 print $out_fh "\n", get_conditional_compile_line_start($charset);
1419 my @data_copy = @data;
1420 for (@data_copy) {
91e83b73
KW
1421 s/^ \s* (?: \# .* ) ? $ //x; # squeeze out comment and blanks
1422 next unless /\S/;
1423 chomp;
1424 if ( /^[A-Z]/ ) {
1425 $doit->($charset) unless $first_time; # This starts a new
1426 # definition; do the
1427 # previous one
1428 $first_time = 0;
1429 ( $op, $title )= split /\s*:\s*/, $_, 2;
1430 @txt= ();
1431 } elsif ( s/^=>// ) {
1432 my ( $type, $modifier )= split /:/, $_;
1433 @types= split ' ', $type;
1434 undef %mods;
1435 map { $mods{$_} = 1 } split ' ', $modifier;
1436 } else {
1437 push @txt, "$_";
1438 }
12b72891 1439 }
91e83b73 1440 $doit->($charset);
a1b2a50f
KW
1441 print $out_fh get_conditional_compile_line_end();
1442 }
d10c72f2 1443
6a5bc5ac 1444 print $out_fh "\n#endif /* PERL_REGCHARCLASS_H_ */\n";
d10c72f2 1445
8770da0e
NC
1446 if($path eq '-') {
1447 print $out_fh "/* ex: set ro: */\n";
1448 } else {
0c36c41b
KW
1449 # Some of the sources for these macros come from Unicode tables
1450 my $sources_list = "lib/unicore/mktables.lst";
b60dc4b9
KW
1451 my @sources = ($0, qw(lib/unicore/mktables
1452 lib/Unicode/UCD.pm
1453 regen/regcharclass_multi_char_folds.pl
1454 regen/charset_translations.pl
1455 ));
0c36c41b
KW
1456 {
1457 # Depend on mktables’ own sources. It’s a shorter list of files than
1458 # those that Unicode::UCD uses.
1ae6ead9 1459 if (! open my $mktables_list, '<', $sources_list) {
0c36c41b
KW
1460
1461 # This should force a rebuild once $sources_list exists
1462 push @sources, $sources_list;
1463 }
1464 else {
1465 while(<$mktables_list>) {
1466 last if /===/;
1467 chomp;
1468 push @sources, "lib/unicore/$_" if /^[^#]/;
1469 }
1470 }
1471 }
1472 read_only_bottom_close_and_rename($out_fh, \@sources)
8770da0e 1473 }
12b72891 1474}
e64b1bd1 1475
cc08b31c
KW
1476# The form of the input is a series of definitions to make macros for.
1477# The first line gives the base name of the macro, followed by a colon, and
1478# then text to be used in comments associated with the macro that are its
1479# title or description. In all cases the first (perhaps only) parameter to
1480# the macro is a pointer to the first byte of the code point it is to test to
1481# see if it is in the class determined by the macro. In the case of non-UTF8,
1482# the code point consists only of a single byte.
1483#
1484# The second line must begin with a '=>' and be followed by the types of
1485# macro(s) to be generated; these are specified below. A colon follows the
1486# types, followed by the modifiers, also specified below. At least one
1487# modifier is required.
1488#
1489# The subsequent lines give what code points go into the class defined by the
1490# macro. Multiple characters may be specified via a string like "\x0D\x0A",
60910c93
KW
1491# enclosed in quotes. Otherwise the lines consist of one of:
1492# 1) a single Unicode code point, prefaced by 0x
1493# 2) a single range of Unicode code points separated by a minus (and
1494# optional space)
1495# 3) a single Unicode property specified in the standard Perl form
1496# "\p{...}"
1497# 4) a line like 'do path'. This will do a 'do' on the file given by
1498# 'path'. It is assumed that this does nothing but load subroutines
1499# (See item 5 below). The reason 'require path' is not used instead is
1500# because 'do' doesn't assume that path is in @INC.
1501# 5) a subroutine call
1502# &pkg::foo(arg1, ...)
1503# where pkg::foo was loaded by a 'do' line (item 4). The subroutine
1504# returns an array of entries of forms like items 1-3 above. This
1505# allows more complex inputs than achievable from the other input types.
cc08b31c
KW
1506#
1507# A blank line or one whose first non-blank character is '#' is a comment.
1508# The definition of the macro is terminated by a line unlike those described.
1509#
1510# Valid types:
1511# low generate a macro whose name is 'is_BASE_low' and defines a
1512# class that includes only ASCII-range chars. (BASE is the
1513# input macro base name.)
1514# latin1 generate a macro whose name is 'is_BASE_latin1' and defines a
1515# class that includes only upper-Latin1-range chars. It is not
1516# designed to take a UTF-8 input parameter.
b1af8fef
KW
1517# high generate a macro whose name is 'is_BASE_high' and defines a
1518# class that includes all relevant code points that are above
1519# the Latin1 range. This is for very specialized uses only.
1520# It is designed to take only an input UTF-8 parameter.
cc08b31c
KW
1521# utf8 generate a macro whose name is 'is_BASE_utf8' and defines a
1522# class that includes all relevant characters that aren't ASCII.
1523# It is designed to take only an input UTF-8 parameter.
1524# LATIN1 generate a macro whose name is 'is_BASE_latin1' and defines a
1525# class that includes both ASCII and upper-Latin1-range chars.
1526# It is not designed to take a UTF-8 input parameter.
1527# UTF8 generate a macro whose name is 'is_BASE_utf8' and defines a
1528# class that can include any code point, adding the 'low' ones
1529# to what 'utf8' works on. It is designed to take only an input
1530# UTF-8 parameter.
1531# generic generate a macro whose name is 'is_BASE". It has a 2nd,
1532# boolean, parameter which indicates if the first one points to
1533# a UTF-8 string or not. Thus it works in all circumstances.
87894a24
KW
1534# generic_non_low generate a macro whose name is 'is_BASE_non_low". It has
1535# a 2nd, boolean, parameter which indicates if the first one
1536# points to a UTF-8 string or not. It excludes any ASCII-range
1537# matches, but otherwise it works in all circumstances.
cc08b31c
KW
1538# cp generate a macro whose name is 'is_BASE_cp' and defines a
1539# class that returns true if the UV parameter is a member of the
1540# class; false if not.
900c17f9
KW
1541# cp_high like cp, but it is assumed that it is known that the UV
1542# parameter is above Latin1. The name of the generated macro is
1543# 'is_BASE_cp_high'. This is different from high-cp, derived
1544# below.
cc08b31c
KW
1545# A macro of the given type is generated for each type listed in the input.
1546# The default return value is the number of octets read to generate the match.
1547# Append "-cp" to the type to have it instead return the matched codepoint.
1548# The macro name is changed to 'what_BASE...'. See pod for
1549# caveats
1550# Appending '-both" instead adds an extra parameter to the end of the argument
1551# list, which is a pointer as to where to store the number of
1552# bytes matched, while also returning the code point. The macro
1553# name is changed to 'what_len_BASE...'. See pod for caveats
1554#
1555# Valid modifiers:
1556# safe The input string is not necessarily valid UTF-8. In
1557# particular an extra parameter (always the 2nd) to the macro is
1558# required, which points to one beyond the end of the string.
1559# The macro will make sure not to read off the end of the
1560# string. In the case of non-UTF8, it makes sure that the
1561# string has at least one byte in it. The macro name has
1562# '_safe' appended to it.
40f914fd
KW
1563# no_length_checks The input string is not necessarily valid UTF-8, but it
1564# is to be assumed that the length has already been checked and
1565# found to be valid
cc08b31c
KW
1566# fast The input string is valid UTF-8. No bounds checking is done,
1567# and the macro can make assumptions that lead to faster
1568# execution.
a1b2a50f 1569# only_ascii_platform Skip this definition if the character set is for
ae1d4929 1570# a non-ASCII platform.
a1b2a50f 1571# only_ebcdic_platform Skip this definition if the character set is for
ae1d4929 1572# a non-EBCDIC platform.
cc08b31c
KW
1573# No modifier need be specified; fast is assumed for this case. If both
1574# 'fast', and 'safe' are specified, two macros will be created for each
1575# 'type'.
e90ac8de 1576#
295bcca9 1577# If run on a non-ASCII platform will automatically convert the Unicode input
cc08b31c
KW
1578# to native. The documentation above is slightly wrong in this case. 'low'
1579# actually refers to code points whose UTF-8 representation is the same as the
1580# non-UTF-8 version (invariants); and 'latin1' refers to all the rest of the
1581# code points less than 256.
5e6c6c1e
KW
1582
15831; # in the unlikely case we are being used as a module
1584
1585__DATA__
1586# This is no longer used, but retained in case it is needed some day.
e90ac8de
KW
1587# TRICKYFOLD: Problematic fold case letters. When adding to this list, also should add them to regcomp.c and fold_grind.t
1588# => generic cp generic-cp generic-both :fast safe
1589# 0x00DF # LATIN SMALL LETTER SHARP S
1590# 0x0390 # GREEK SMALL LETTER IOTA WITH DIALYTIKA AND TONOS
1591# 0x03B0 # GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND TONOS
1592# 0x1E9E # LATIN CAPITAL LETTER SHARP S, because maps to same as 00DF
1593# 0x1FD3 # GREEK SMALL LETTER IOTA WITH DIALYTIKA AND OXIA; maps same as 0390
1594# 0x1FE3 # GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND OXIA; maps same as 03B0
1595
12b72891 1596LNBREAK: Line Break: \R
5c025f03 1597=> generic UTF8 LATIN1 : safe
12b72891 1598"\x0D\x0A" # CRLF - Network (Windows) line ending
05b688d9 1599\p{VertSpace}
12b72891
RGS
1600
1601HORIZWS: Horizontal Whitespace: \h \H
507ce328 1602=> high cp_high : fast
05b688d9 1603\p{HorizSpace}
12b72891
RGS
1604
1605VERTWS: Vertical Whitespace: \v \V
507ce328 1606=> high cp_high : fast
05b688d9 1607\p{VertSpace}
612ead59 1608
4ac6419d 1609XDIGIT: Hexadecimal digits
507ce328 1610=> high cp_high : fast
4ac6419d
KW
1611\p{XDigit}
1612
bedac28b 1613XPERLSPACE: \p{XPerlSpace}
507ce328 1614=> high cp_high : fast
bedac28b
KW
1615\p{XPerlSpace}
1616
b96a92fb 1617NONCHAR: Non character code points
89d986df 1618=> UTF8 :safe
099323b4 1619\p{_Perl_Nchar}
b96a92fb 1620
21cb232c 1621SURROGATE: Surrogate code points
89d986df 1622=> UTF8 :safe
099323b4 1623\p{_Perl_Surrogate}
b96a92fb 1624
685289b5
KW
1625QUOTEMETA: Meta-characters that \Q should quote
1626=> high :fast
1627\p{_Perl_Quotemeta}
8769f413
KW
1628
1629MULTI_CHAR_FOLD: multi-char strings that are folded to by a single character
251b239f 1630=> UTF8 :safe
42d7c910 1631&regcharclass_multi_char_folds::multi_char_folds('u', 'a')
8769f413 1632
40b1ba4f 1633MULTI_CHAR_FOLD: multi-char strings that are folded to by a single character
251b239f 1634=> LATIN1 : safe
42d7c910
KW
1635&regcharclass_multi_char_folds::multi_char_folds('l', 'a')
1636
1637THREE_CHAR_FOLD: A three-character multi-char fold
1638=> UTF8 :safe
1639&regcharclass_multi_char_folds::multi_char_folds('u', '3')
1640
1641THREE_CHAR_FOLD: A three-character multi-char fold
1642=> LATIN1 :safe
1643&regcharclass_multi_char_folds::multi_char_folds('l', '3')
8769f413 1644
42d7c910
KW
1645THREE_CHAR_FOLD_HEAD: The first two of three-character multi-char folds
1646=> UTF8 :safe
1647&regcharclass_multi_char_folds::multi_char_folds('u', 'h')
1648
1649THREE_CHAR_FOLD_HEAD: The first two of three-character multi-char folds
1650=> LATIN1 :safe
1651&regcharclass_multi_char_folds::multi_char_folds('l', 'h')
1652#
1653#THREE_CHAR_FOLD_NON_FINAL: The first or middle character of multi-char folds
1654#=> UTF8 :safe
1655#&regcharclass_multi_char_folds::multi_char_folds('u', 'fm')
1656#
1657#THREE_CHAR_FOLD_NON_FINAL: The first or middle character of multi-char folds
1658#=> LATIN1 :safe
1659#&regcharclass_multi_char_folds::multi_char_folds('l', 'fm')
0b50d62a 1660
1a27eb96
KW
1661FOLDS_TO_MULTI: characters that fold to multi-char strings
1662=> UTF8 :fast
1663\p{_Perl_Folds_To_Multi_Char}
1664
31f05a37
KW
1665PROBLEMATIC_LOCALE_FOLD : characters whose fold is problematic under locale
1666=> UTF8 cp :fast
1667\p{_Perl_Problematic_Locale_Folds}
1668
1669PROBLEMATIC_LOCALE_FOLDEDS_START : The first folded character of folds which are problematic under locale
1670=> UTF8 cp :fast
1671\p{_Perl_Problematic_Locale_Foldeds_Start}
1672
0b50d62a 1673PATWS: pattern white space
ef06e936 1674=> generic : safe
099323b4 1675\p{_Perl_PatWS}
6c12993c
KW
1676
1677HANGUL_ED: Hangul syllables whose first character is \xED
1678=> UTF8 :only_ascii_platform safe
16790xD000 - 0xD7FF