This is a live mirror of the Perl 5 development currently hosted at https://github.com/perl/perl5
regcharclass.pl: White-space comment only
[perl5.git] / regen / regcharclass.pl
CommitLineData
1f00b0d6 1#!perl
e64b1bd1 2package CharClass::Matcher;
12b72891 3use strict;
8770da0e 4use 5.008;
12b72891 5use warnings;
e64b1bd1 6use warnings FATAL => 'all';
12b72891 7use Data::Dumper;
e64b1bd1 8$Data::Dumper::Useqq= 1;
12b72891 9
75929b4b
KW
10sub DEBUG () { 0 }
11$|=1 if DEBUG;
12
3d7c117d
MB
13require './regen/regen_lib.pl';
14require './regen/charset_translations.pl';
4c9bc717 15require "./regen/regcharclass_multi_char_folds.pl";
8770da0e 16
ab84f958 17=head1 NAME
0ccab2bc 18
e64b1bd1 19CharClass::Matcher -- Generate C macros that match character classes efficiently
12b72891 20
e64b1bd1
YO
21=head1 SYNOPSIS
22
944ff787 23 perl regen/regcharclass.pl
e64b1bd1
YO
24
25=head1 DESCRIPTION
12b72891
RGS
26
27Dynamically generates macros for detecting special charclasses
e64b1bd1 28in latin-1, utf8, and codepoint forms. Macros can be set to return
cc08b31c 29the length (in bytes) of the matched codepoint, and/or the codepoint itself.
12b72891 30
cc08b31c 31To regenerate F<regcharclass.h>, run this script from perl-root. No arguments
12b72891
RGS
32are necessary.
33
cc08b31c
KW
34Using WHATEVER as an example the following macros can be produced, depending
35on the input parameters (how to get each is described by internal comments at
36the C<__DATA__> line):
12b72891
RGS
37
38=over 4
39
cc08b31c 40=item C<is_WHATEVER(s,is_utf8)>
12b72891 41
cc08b31c 42=item C<is_WHATEVER_safe(s,e,is_utf8)>
12b72891 43
cc08b31c 44Do a lookup as appropriate based on the C<is_utf8> flag. When possible
ab473f03 45comparisons involving octet<128 are done before checking the C<is_utf8>
12b72891
RGS
46flag, hopefully saving time.
47
cc08b31c
KW
48The version without the C<_safe> suffix should be used only when the input is
49known to be well-formed.
12b72891 50
cc08b31c
KW
51=item C<is_WHATEVER_utf8(s)>
52
53=item C<is_WHATEVER_utf8_safe(s,e)>
12b72891
RGS
54
55Do a lookup assuming the string is encoded in (normalized) UTF8.
56
cc08b31c
KW
57The version without the C<_safe> suffix should be used only when the input is
58known to be well-formed.
59
60=item C<is_WHATEVER_latin1(s)>
12b72891 61
cc08b31c 62=item C<is_WHATEVER_latin1_safe(s,e)>
12b72891
RGS
63
64Do a lookup assuming the string is encoded in latin-1 (aka plan octets).
65
cc08b31c
KW
66The version without the C<_safe> suffix should be used only when it is known
67that C<s> contains at least one character.
68
69=item C<is_WHATEVER_cp(cp)>
12b72891 70
47e01c32 71Check to see if the string matches a given codepoint (hypothetically a
b6a6e956 72U32). The condition is constructed as to "break out" as early as
12b72891
RGS
73possible if the codepoint is out of range of the condition.
74
75IOW:
76
77 (cp==X || (cp>X && (cp==Y || (cp>Y && ...))))
78
79Thus if the character is X+1 only two comparisons will be done. Making
80matching lookups slower, but non-matching faster.
81
cc08b31c
KW
82=item C<what_len_WHATEVER_FOO(arg1, ..., len)>
83
84A variant form of each of the macro types described above can be generated, in
85which the code point is returned by the macro, and an extra parameter (in the
86final position) is added, which is a pointer for the macro to set the byte
87length of the returned code point.
88
89These forms all have a C<what_len> prefix instead of the C<is_>, for example
90C<what_len_WHATEVER_safe(s,e,is_utf8,len)> and
91C<what_len_WHATEVER_utf8(s,len)>.
92
93These forms should not be used I<except> on small sets of mostly widely
94separated code points; otherwise the code generated is inefficient. For these
95cases, it is best to use the C<is_> forms, and then find the code point with
96C<utf8_to_uvchr_buf>(). This program can fail with a "deep recursion"
97message on the worst of the inappropriate sets. Examine the generated macro
98to see if it is acceptable.
12b72891 99
cc08b31c
KW
100=item C<what_WHATEVER_FOO(arg1, ...)>
101
102A variant form of each of the C<is_> macro types described above can be generated, in
103which the code point and not the length is returned by the macro. These have
104the same caveat as L</what_len_WHATEVER_FOO(arg1, ..., len)>, plus they should
105not be used where the set contains a NULL, as 0 is returned for two different
106cases: a) the set doesn't include the input code point; b) the set does
107include it, and it is a NULL.
108
109=back
e64b1bd1 110
40f914fd
KW
111The above isn't quite complete, as for specialized purposes one can get a
112macro like C<is_WHATEVER_utf8_no_length_checks(s)>, which assumes that it is
113already known that there is enough space to hold the character starting at
114C<s>, but otherwise checks that it is well-formed. In other words, this is
115intermediary in checking between C<is_WHATEVER_utf8(s)> and
116C<is_WHATEVER_utf8_safe(s,e)>.
117
e64b1bd1
YO
118=head2 CODE FORMAT
119
120perltidy -st -bt=1 -bbt=0 -pt=0 -sbt=1 -ce -nwls== "%f"
121
122
123=head1 AUTHOR
124
cc08b31c 125Author: Yves Orton (demerphq) 2007. Maintained by Perl5 Porters.
e64b1bd1
YO
126
127=head1 BUGS
128
129No tests directly here (although the regex engine will fail tests
130if this code is broken). Insufficient documentation and no Getopts
131handler for using the module as a script.
132
133=head1 LICENSE
134
135You may distribute under the terms of either the GNU General Public
136License or the Artistic License, as specified in the README file.
137
12b72891
RGS
138=cut
139
e64b1bd1
YO
140# Sub naming convention:
141# __func : private subroutine, can not be called as a method
142# _func : private method, not meant for external use
143# func : public method.
144
145# private subs
146#-------------------------------------------------------------------------------
147#
148# ($cp,$n,$l,$u)=__uni_latin($str);
149#
47e01c32 150# Return a list of arrays, each of which when interpreted correctly
e64b1bd1
YO
151# represent the string in some given encoding with specific conditions.
152#
153# $cp - list of codepoints that make up the string.
295bcca9
KW
154# $n - list of octets that make up the string if all codepoints are invariant
155# regardless of if the string is in UTF-8 or not.
e64b1bd1 156# $l - list of octets that make up the string in latin1 encoding if all
295bcca9
KW
157# codepoints < 256, and at least one codepoint is UTF-8 variant.
158# $u - list of octets that make up the string in utf8 if any codepoint is
159# UTF-8 variant
e64b1bd1
YO
160#
161# High CP | Defined
162#-----------+----------
295bcca9 163# 0 - 127 : $n (127/128 are the values for ASCII platforms)
e64b1bd1
YO
164# 128 - 255 : $l, $u
165# 256 - ... : $u
166#
167
168sub __uni_latin1 {
e272994f
KW
169 my $charset = shift;
170 my $a2n= shift;
e64b1bd1
YO
171 my $str= shift;
172 my $max= 0;
173 my @cp;
900c17f9 174 my @cp_high;
295bcca9 175 my $only_has_invariants = 1;
e64b1bd1
YO
176 for my $ch ( split //, $str ) {
177 my $cp= ord $ch;
e64b1bd1 178 $max= $cp if $max < $cp;
a1b2a50f
KW
179 if ($cp > 255) {
180 push @cp, $cp;
181 push @cp_high, $cp;
182 }
183 else {
c30a0cf2 184 push @cp, $a2n->[$cp];
295bcca9 185 }
e64b1bd1
YO
186 }
187 my ( $n, $l, $u );
a1b2a50f 188 $only_has_invariants = ($charset =~ /ascii/i) ? $max < 128 : $max < 160;
295bcca9 189 if ($only_has_invariants) {
e64b1bd1
YO
190 $n= [@cp];
191 } else {
192 $l= [@cp] if $max && $max < 256;
193
a1b2a50f
KW
194 my @u;
195 for my $ch ( split //, $str ) {
196 push @u, map { ord } split //, cp_2_utfbytes(ord $ch, $charset);
197 }
198 $u = \@u;
12b72891 199 }
900c17f9 200 return ( \@cp, \@cp_high, $n, $l, $u );
12b72891
RGS
201}
202
12b72891 203#
e64b1bd1
YO
204# $clean= __clean($expr);
205#
206# Cleanup a ternary expression, removing unnecessary parens and apply some
207# simplifications using regexes.
208#
209
210sub __clean {
211 my ( $expr )= @_;
8fdb8a9d 212
9a3182e9
YO
213 #return $expr;
214
e64b1bd1
YO
215 our $parens;
216 $parens= qr/ (?> \( (?> (?: (?> [^()]+ ) | (??{ $parens }) )* ) \) ) /x;
217
8fdb8a9d 218 ## remove redundant parens
e64b1bd1 219 1 while $expr =~ s/ \( \s* ( $parens ) \s* \) /$1/gx;
8fdb8a9d
YO
220
221
222 # repeatedly simplify conditions like
223 # ( (cond1) ? ( (cond2) ? X : Y ) : Y )
224 # into
225 # ( ( (cond1) && (cond2) ) ? X : Y )
6c4f0678
YO
226 # Also similarly handles expressions like:
227 # : (cond1) ? ( (cond2) ? X : Y ) : Y )
4fad5f9f
KW
228 # Note the inclusion of the close paren in ([:()]) and the open paren in
229 # ([()]) is purely to ensure we have a balanced set of parens in the
230 # expression which makes it easier to understand the pattern in an editor
231 # that understands paren's, we do not expect either of these cases to
232 # actually fire. - Yves
8fdb8a9d 233 1 while $expr =~ s/
6c4f0678 234 ([:()]) \s*
8fdb8a9d
YO
235 ($parens) \s*
236 \? \s*
237 \( \s* ($parens) \s*
6c4f0678
YO
238 \? \s* ($parens|[^()?:\s]+?) \s*
239 : \s* ($parens|[^()?:\s]+?) \s*
8fdb8a9d 240 \) \s*
6c4f0678
YO
241 : \s* \5 \s*
242 ([()])
243 /$1 ( $2 && $3 ) ? $4 : $5 $6/gx;
39a0f513
YO
244 #$expr=~s/\(\(U8\*\)s\)\[(\d+)\]/S$1/g if length $expr > 8000;
245 #$expr=~s/\s+//g if length $expr > 8000;
246
247 die "Expression too long" if length $expr > 8000;
8fdb8a9d 248
e64b1bd1 249 return $expr;
12b72891
RGS
250}
251
e64b1bd1
YO
252#
253# $text= __macro(@args);
254# Join args together by newlines, and then neatly add backslashes to the end
255# of every line as expected by the C pre-processor for #define's.
256#
257
258sub __macro {
259 my $str= join "\n", @_;
260 $str =~ s/\s*$//;
261 my @lines= map { s/\s+$//; s/\t/ /g; $_ } split /\n/, $str;
262 my $last= pop @lines;
263 $str= join "\n", ( map { sprintf "%-76s\\", $_ } @lines ), $last;
264 1 while $str =~ s/^(\t*) {8}/$1\t/gm;
265 return $str . "\n";
12b72891
RGS
266}
267
e64b1bd1
YO
268#
269# my $op=__incrdepth($op);
270#
271# take an 'op' hashref and add one to it and all its childrens depths.
272#
273
274sub __incrdepth {
275 my $op= shift;
276 return unless ref $op;
277 $op->{depth} += 1;
278 __incrdepth( $op->{yes} );
279 __incrdepth( $op->{no} );
280 return $op;
281}
282
283# join two branches of an opcode together with a condition, incrementing
284# the depth on the yes branch when we do so.
285# returns the new root opcode of the tree.
286sub __cond_join {
287 my ( $cond, $yes, $no )= @_;
5ab0c3af 288 if (ref $yes) {
3ff97bcf
KW
289 return {
290 test => $cond,
291 yes => __incrdepth( $yes ),
292 no => $no,
293 depth => 0,
294 };
5ab0c3af
KW
295 }
296 else {
297 return {
298 test => $cond,
299 yes => $yes,
300 no => __incrdepth($no),
301 depth => 0,
302 };
303 }
e64b1bd1
YO
304}
305
8c112bb9
KW
306my $hex_fmt= "0x%02X";
307
308sub val_fmt
309{
310 my $self = shift;
311 my $arg = shift;
0dc7c810 312 my $always_hex = shift // 0; # Use \x{}; don't look for a mnemonic
8c112bb9 313
fdc26d94
KW
314 # Format 'arg' using the printable character if it has one, or a %x if
315 # not, returning a string containing the result
316
8c112bb9
KW
317 # Return what always returned for an unexpected argument
318 return $hex_fmt unless defined $arg && $arg !~ /\D/;
319
fdc26d94 320 # We convert only things inside Latin1
0dc7c810 321 if (! $always_hex && $arg < 256) {
fdc26d94
KW
322
323 # Find the ASCII equivalent of this argument (as the current character
324 # set might not be ASCII)
325 my $char = chr $self->{n2a}->[$arg];
326
327 # If printable, return it, escaping \ and '
328 return "'$char'" if $char =~ /[^\\'[:^print:]]/a;
329 return "'\\\\'" if $char eq "\\";
330 return "'\''" if $char eq "'";
331
332 # Handle the mnemonic controls
333 my $pos = index("\a\b\e\f\n\r\t\cK", $char);
334 return "'\\" . substr("abefnrtv", $pos, 1) . "'" if $pos >= 0;
335 }
336
8c112bb9
KW
337 # Otherwise, just the input, formatted
338 return sprintf $hex_fmt, $arg;
339}
340
e64b1bd1
YO
341# Methods
342
343# constructor
344#
345# my $obj=CLASS->new(op=>'SOMENAME',title=>'blah',txt=>[..]);
346#
347# Create a new CharClass::Matcher object by parsing the text in
348# the txt array. Currently applies the following rules:
349#
350# Element starts with C<0x>, line is evaled the result treated as
351# a number which is passed to chr().
352#
353# Element starts with C<">, line is evaled and the result treated
354# as a string.
355#
356# Each string is then stored in the 'strs' subhash as a hash record
357# made up of the results of __uni_latin1, using the keynames
b1af8fef 358# 'low','latin1','utf8', as well as the synthesized 'LATIN1', 'high', and
b6a6e956 359# 'UTF8' which hold a merge of 'low' and their lowercase equivalents.
e64b1bd1
YO
360#
361# Size data is tracked per type in the 'size' subhash.
362#
363# Return an object
fdc26d94 364
66af77d6
KW
365my %n2a; # Inversion of a2n, for each character set
366my %utf_2_I8; # Inversion of I8_2_utf, for each EBCDIC character set
fdc26d94 367
12b72891
RGS
368sub new {
369 my $class= shift;
e64b1bd1 370 my %opt= @_;
a50454ce 371 my %hash_return;
e64b1bd1
YO
372 for ( qw(op txt) ) {
373 die "in " . __PACKAGE__ . " constructor '$_;' is a mandatory field"
374 if !exists $opt{$_};
375 }
376
377 my $self= bless {
378 op => $opt{op},
379 title => $opt{title} || '',
380 }, $class;
e272994f
KW
381
382 my $charset = $opt{charset};
383 my $a2n = get_a2n($charset);
384
66af77d6 385 # We need to construct the maps going the other way if not already done
fdc26d94
KW
386 unless (defined $n2a{$charset}) {
387 for (my $i = 0; $i < 256; $i++) {
388 $n2a{$charset}->[$a2n->[$i]] = $i;
389 }
390 }
391
66af77d6
KW
392 if ($charset =~ /ebcdic/i) {
393 my $I8_2_utf = get_I8_2_utf($charset);
394 unless (defined $utf_2_I8{$charset}) {
395 for (my $i = 0; $i < 256; $i++) {
396 $utf_2_I8{$charset}->[$I8_2_utf->[$i]] = $i;
397 }
398 }
399 }
400
e64b1bd1
YO
401 foreach my $txt ( @{ $opt{txt} } ) {
402 my $str= $txt;
403 if ( $str =~ /^[""]/ ) {
404 $str= eval $str;
05b688d9
KW
405 } elsif ($str =~ / - /x ) { # A range: Replace this element on the
406 # list with its expansion
407 my ($lower, $upper) = $str =~ / 0x (.+?) \s* - \s* 0x (.+) /x;
4fad5f9f
KW
408 die "Format must be like '0xDEAD - 0xBEAF'; instead was '$str'"
409 if ! defined $lower || ! defined $upper;
05b688d9
KW
410 foreach my $cp (hex $lower .. hex $upper) {
411 push @{$opt{txt}}, sprintf "0x%X", $cp;
412 }
413 next;
295bcca9
KW
414 } elsif ($str =~ s/ ^ N (?= 0x ) //x ) {
415 # Otherwise undocumented, a leading N means is already in the
416 # native character set; don't convert.
e64b1bd1 417 $str= chr eval $str;
295bcca9
KW
418 } elsif ( $str =~ /^0x/ ) {
419 $str= eval $str;
295bcca9 420 $str = chr $str;
05b688d9
KW
421 } elsif ( $str =~ / \s* \\p \{ ( .*? ) \} /x) {
422 my $property = $1;
423 use Unicode::UCD qw(prop_invlist);
424
425 my @invlist = prop_invlist($property, '_perl_core_internal_ok');
426 if (! @invlist) {
427
428 # An empty return could mean an unknown property, or merely
429 # that it is empty. Call in scalar context to differentiate
430 my $count = prop_invlist($property, '_perl_core_internal_ok');
431 die "$property not found" unless defined $count;
432 }
433
434 # Replace this element on the list with the property's expansion
435 for (my $i = 0; $i < @invlist; $i += 2) {
436 foreach my $cp ($invlist[$i] .. $invlist[$i+1] - 1) {
295bcca9
KW
437
438 # prop_invlist() returns native values; add leading 'N'
439 # to indicate that.
440 push @{$opt{txt}}, sprintf "N0x%X", $cp;
05b688d9
KW
441 }
442 }
443 next;
60910c93
KW
444 } elsif ($str =~ / ^ do \s+ ( .* ) /x) {
445 die "do '$1' failed: $!$@" if ! do $1 or $@;
446 next;
447 } elsif ($str =~ / ^ & \s* ( .* ) /x) { # user-furnished sub() call
448 my @results = eval "$1";
449 die "eval '$1' failed: $@" if $@;
450 push @{$opt{txt}}, @results;
451 next;
a50454ce
KW
452 } elsif ($str =~ / ^ % \s* ( .* ) /x) { # user-furnished sub() call
453 %hash_return = eval "$1";
454 die "eval '$1' failed: $@" if $@;
455 push @{$opt{txt}}, keys %hash_return;
456 die "Only one multi character expansion currently allowed per rule"
457 if $self->{multi_maps};
458 next;
12b72891 459 } else {
5e6c6c1e 460 die "Unparsable line: $txt\n";
12b72891 461 }
4fad5f9f
KW
462 my ( $cp, $cp_high, $low, $latin1, $utf8 )
463 = __uni_latin1($charset, $a2n, $str );
a50454ce
KW
464 my $from;
465 if (defined $hash_return{"\"$str\""}) {
466 $from = $hash_return{"\"$str\""};
467 $from = $a2n->[$from] if $from < 256;
468 }
e64b1bd1
YO
469 my $UTF8= $low || $utf8;
470 my $LATIN1= $low || $latin1;
b1af8fef 471 my $high = (scalar grep { $_ < 256 } @$cp) ? 0 : $utf8;
dda856b2
YO
472 #die Dumper($txt,$cp,$low,$latin1,$utf8)
473 # if $txt=~/NEL/ or $utf8 and @$utf8>3;
e64b1bd1 474
a50454ce
KW
475 @{ $self->{strs}{$str} }{qw( str txt low utf8 latin1 high cp cp_high UTF8 LATIN1 from )}=
476 ( $str, $txt, $low, $utf8, $latin1, $high, $cp, $cp_high, $UTF8, $LATIN1, $from );
e64b1bd1 477 my $rec= $self->{strs}{$str};
900c17f9 478 foreach my $key ( qw(low utf8 latin1 high cp cp_high UTF8 LATIN1) ) {
e64b1bd1
YO
479 $self->{size}{$key}{ 0 + @{ $self->{strs}{$str}{$key} } }++
480 if $self->{strs}{$str}{$key};
12b72891 481 }
e64b1bd1
YO
482 $self->{has_multi} ||= @$cp > 1;
483 $self->{has_ascii} ||= $latin1 && @$latin1;
484 $self->{has_low} ||= $low && @$low;
485 $self->{has_high} ||= !$low && !$latin1;
12b72891 486 }
fdc26d94 487 $self->{n2a} = $n2a{$charset};
e64b1bd1 488 $self->{count}= 0 + keys %{ $self->{strs} };
12b72891
RGS
489 return $self;
490}
491
e64b1bd1 492# my $trie = make_trie($type,$maxlen);
12b72891 493#
47e01c32 494# using the data stored in the object build a trie of a specific type,
e64b1bd1
YO
495# and with specific maximum depth. The trie is made up the elements of
496# the given types array for each string in the object (assuming it is
497# not too long.)
498#
47e01c32 499# returns the trie, or undef if there was no relevant data in the object.
e64b1bd1
YO
500#
501
502sub make_trie {
503 my ( $self, $type, $maxlen )= @_;
504
505 my $strs= $self->{strs};
506 my %trie;
507 foreach my $rec ( values %$strs ) {
508 die "panic: unknown type '$type'"
509 if !exists $rec->{$type};
510 my $dat= $rec->{$type};
511 next unless $dat;
512 next if $maxlen && @$dat > $maxlen;
513 my $node= \%trie;
514 foreach my $elem ( @$dat ) {
515 $node->{$elem} ||= {};
516 $node= $node->{$elem};
12b72891 517 }
e64b1bd1 518 $node->{''}= $rec->{str};
12b72891 519 }
e64b1bd1 520 return 0 + keys( %trie ) ? \%trie : undef;
12b72891
RGS
521}
522
2efb8143
KW
523sub pop_count ($) {
524 my $word = shift;
525
526 # This returns a list of the positions of the bits in the input word that
527 # are 1.
528
529 my @positions;
530 my $position = 0;
531 while ($word) {
532 push @positions, $position if $word & 1;
533 $position++;
534 $word >>= 1;
535 }
536 return @positions;
537}
538
e64b1bd1
YO
539# my $optree= _optree()
540#
541# recursively convert a trie to an optree where every node represents
542# an if else branch.
12b72891 543#
12b72891 544#
12b72891 545
e64b1bd1
YO
546sub _optree {
547 my ( $self, $trie, $test_type, $ret_type, $else, $depth )= @_;
548 return unless defined $trie;
e64b1bd1
YO
549 $ret_type ||= 'len';
550 $else= 0 unless defined $else;
551 $depth= 0 unless defined $depth;
552
b6a6e956 553 # if we have an empty string as a key it means we are in an
e405c23a
YO
554 # accepting state and unless we can match further on should
555 # return the value of the '' key.
895e25a5 556 if (exists $trie->{''} ) {
e405c23a
YO
557 # we can now update the "else" value, anything failing to match
558 # after this point should return the value from this.
a50454ce 559 my $prefix = $self->{strs}{ $trie->{''} };
e64b1bd1 560 if ( $ret_type eq 'cp' ) {
a50454ce
KW
561 $else= $prefix->{from};
562 $else= $self->{strs}{ $trie->{''} }{cp}[0] unless defined $else;
8c112bb9 563 $else= $self->val_fmt($else) if $else > 9;
e64b1bd1
YO
564 } elsif ( $ret_type eq 'len' ) {
565 $else= $depth;
566 } elsif ( $ret_type eq 'both') {
a50454ce
KW
567 $else= $prefix->{from};
568 $else= $self->{strs}{ $trie->{''} }{cp}[0] unless defined $else;
8c112bb9 569 $else= $self->val_fmt($else) if $else > 9;
e64b1bd1 570 $else= "len=$depth, $else";
12b72891 571 }
e64b1bd1 572 }
e405c23a
YO
573 # extract the meaningful keys from the trie, filter out '' as
574 # it means we are an accepting state (end of sequence).
575 my @conds= sort { $a <=> $b } grep { length $_ } keys %$trie;
576
b6a6e956 577 # if we haven't any keys there is no further we can match and we
e405c23a 578 # can return the "else" value.
e64b1bd1 579 return $else if !@conds;
e405c23a 580
29f3ce8f 581 my $test = $test_type =~ /^cp/ ? "cp" : "((const U8*)s)[$depth]";
e405c23a 582
c7c8bf55 583 # First we loop over the possible keys/conditions and find out what they
3ff97bcf 584 # look like; we group conditions with the same optree together.
9a3182e9
YO
585 my %dmp_res;
586 my @res_order;
e405c23a
YO
587 local $Data::Dumper::Sortkeys=1;
588 foreach my $cond ( @conds ) {
589
590 # get the optree for this child/condition
4fad5f9f
KW
591 my $res= $self->_optree( $trie->{$cond}, $test_type, $ret_type,
592 $else, $depth + 1 );
e405c23a 593 # convert it to a string with Dumper
e64b1bd1 594 my $res_code= Dumper( $res );
e405c23a 595
9a3182e9
YO
596 push @{$dmp_res{$res_code}{vals}}, $cond;
597 if (!$dmp_res{$res_code}{optree}) {
598 $dmp_res{$res_code}{optree}= $res;
599 push @res_order, $res_code;
600 }
601 }
602
4fad5f9f
KW
603 # now that we have deduped the optrees we construct a new optree
604 # containing the merged
9a3182e9
YO
605 # results.
606 my %root;
607 my $node= \%root;
608 foreach my $res_code_idx (0 .. $#res_order) {
609 my $res_code= $res_order[$res_code_idx];
610 $node->{vals}= $dmp_res{$res_code}{vals};
611 $node->{test}= $test;
612 $node->{yes}= $dmp_res{$res_code}{optree};
613 $node->{depth}= $depth;
614 if ($res_code_idx < $#res_order) {
615 $node= $node->{no}= {};
12b72891 616 } else {
9a3182e9 617 $node->{no}= $else;
12b72891
RGS
618 }
619 }
e405c23a
YO
620
621 # return the optree.
622 return \%root;
12b72891
RGS
623}
624
e64b1bd1
YO
625# my $optree= optree(%opts);
626#
627# Convert a trie to an optree, wrapper for _optree
628
629sub optree {
630 my $self= shift;
631 my %opt= @_;
632 my $trie= $self->make_trie( $opt{type}, $opt{max_depth} );
633 $opt{ret_type} ||= 'len';
900c17f9 634 my $test_type= $opt{type} =~ /^cp/ ? 'cp' : 'depth';
e64b1bd1 635 return $self->_optree( $trie, $test_type, $opt{ret_type}, $opt{else}, 0 );
12b72891
RGS
636}
637
e64b1bd1
YO
638# my $optree= generic_optree(%opts);
639#
640# build a "generic" optree out of the three 'low', 'latin1', 'utf8'
641# sets of strings, including a branch for handling the string type check.
642#
643
644sub generic_optree {
645 my $self= shift;
646 my %opt= @_;
647
648 $opt{ret_type} ||= 'len';
649 my $test_type= 'depth';
650 my $else= $opt{else} || 0;
651
652 my $latin1= $self->make_trie( 'latin1', $opt{max_depth} );
653 my $utf8= $self->make_trie( 'utf8', $opt{max_depth} );
654
655 $_= $self->_optree( $_, $test_type, $opt{ret_type}, $else, 0 )
656 for $latin1, $utf8;
657
658 if ( $utf8 ) {
659 $else= __cond_join( "( is_utf8 )", $utf8, $latin1 || $else );
660 } elsif ( $latin1 ) {
661 $else= __cond_join( "!( is_utf8 )", $latin1, $else );
662 }
87894a24 663 if ($opt{type} eq 'generic') {
61de6bbc
KW
664 my $low= $self->make_trie( 'low', $opt{max_depth} );
665 if ( $low ) {
666 $else= $self->_optree( $low, $test_type, $opt{ret_type}, $else, 0 );
667 }
87894a24 668 }
e64b1bd1
YO
669
670 return $else;
12b72891
RGS
671}
672
e64b1bd1 673# length_optree()
12b72891 674#
e64b1bd1 675# create a string length guarded optree.
12b72891 676#
e64b1bd1
YO
677
678sub length_optree {
679 my $self= shift;
680 my %opt= @_;
681 my $type= $opt{type};
682
683 die "Can't do a length_optree on type 'cp', makes no sense."
900c17f9 684 if $type =~ /^cp/;
e64b1bd1 685
5ab0c3af
KW
686 my $else= ( $opt{else} ||= 0 );
687
c03e41dd
KW
688 return $else if $self->{count} == 0;
689
5ab0c3af
KW
690 my $method = $type =~ /generic/ ? 'generic_optree' : 'optree';
691 if ($method eq 'optree' && scalar keys %{$self->{size}{$type}} == 1) {
692
693 # Here is non-generic output (meaning that we are only generating one
694 # type), and all things that match have the same number ('size') of
695 # bytes. The length guard is simply that we have that number of
696 # bytes.
697 my @size = keys %{$self->{size}{$type}};
698 my $cond= "((e) - (s)) >= $size[0]";
699 my $optree = $self->$method(%opt);
700 $else= __cond_join( $cond, $optree, $else );
701 }
702 elsif ($self->{has_multi}) {
703 my @size;
e64b1bd1 704
5ab0c3af
KW
705 # Here, there can be a match of a multiple character string. We use
706 # the traditional method which is to have a branch for each possible
707 # size (longest first) and test for the legal values for that size.
e64b1bd1
YO
708 my %sizes= (
709 %{ $self->{size}{low} || {} },
710 %{ $self->{size}{latin1} || {} },
711 %{ $self->{size}{utf8} || {} }
712 );
5ab0c3af
KW
713 if ($method eq 'generic_optree') {
714 @size= sort { $a <=> $b } keys %sizes;
715 } else {
716 @size= sort { $a <=> $b } keys %{ $self->{size}{$type} };
717 }
718 for my $size ( @size ) {
4fad5f9f 719 my $optree= $self->$method(%opt, type => $type, max_depth => $size);
5ab0c3af
KW
720 my $cond= "((e)-(s) > " . ( $size - 1 ).")";
721 $else= __cond_join( $cond, $optree, $else );
722 }
12b72891 723 }
5ab0c3af
KW
724 else {
725 my $utf8;
726
727 # Here, has more than one possible size, and only matches a single
728 # character. For non-utf8, the needed length is 1; for utf8, it is
729 # found by array lookup 'UTF8SKIP'.
730
731 # If want just the code points above 255, set up to look for those;
732 # otherwise assume will be looking for all non-UTF-8-invariant code
733 # poiints.
734 my $trie_type = ($type eq 'high') ? 'high' : 'utf8';
735
736 # If we do want more than the 0-255 range, find those, and if they
737 # exist...
4fad5f9f
KW
738 if ( $opt{type} !~ /latin1/i
739 && ($utf8 = $self->make_trie($trie_type, 0)))
740 {
5ab0c3af
KW
741
742 # ... get them into an optree, and set them up as the 'else' clause
743 $utf8 = $self->_optree( $utf8, 'depth', $opt{ret_type}, 0, 0 );
744
745 # We could make this
746 # UTF8_IS_START(*s) && ((e) - (s)) >= UTF8SKIP(s))";
747 # to avoid doing the UTF8SKIP and subsequent branches for invariants
748 # that don't match. But the current macros that get generated
749 # have only a few things that can match past this, so I (khw)
750 # don't think it is worth it. (Even better would be to use
751 # calculate_mask(keys %$utf8) instead of UTF8_IS_START, and use it
dd9bc2b0
KW
752 # if it saves a bunch. We assume that input text likely to be
753 # well-formed .
754 my $cond = "LIKELY(((e) - (s)) >= UTF8SKIP(s))";
5ab0c3af
KW
755 $else = __cond_join($cond, $utf8, $else);
756
757 # For 'generic', we also will want the latin1 UTF-8 variants for
758 # the case where the input isn't UTF-8.
759 my $latin1;
760 if ($method eq 'generic_optree') {
761 $latin1 = $self->make_trie( 'latin1', 1);
4fad5f9f 762 $latin1= $self->_optree($latin1, 'depth', $opt{ret_type}, 0, 0);
5ab0c3af 763 }
e64b1bd1 764
5ab0c3af
KW
765 # If we want the UTF-8 invariants, get those.
766 my $low;
767 if ($opt{type} !~ /non_low|high/
768 && ($low= $self->make_trie( 'low', 1)))
769 {
770 $low= $self->_optree( $low, 'depth', $opt{ret_type}, 0, 0 );
771
772 # Expand out the UTF-8 invariants as a string so that we
773 # can use them as the conditional
774 $low = $self->_cond_as_str( $low, 0, \%opt);
775
776 # If there are Latin1 variants, add a test for them.
777 if ($latin1) {
778 $else = __cond_join("(! is_utf8 )", $latin1, $else);
779 }
780 elsif ($method eq 'generic_optree') {
781
782 # Otherwise for 'generic' only we know that what
783 # follows must be valid for just UTF-8 strings,
784 $else->{test} = "( is_utf8 && $else->{test} )";
785 }
786
787 # If the invariants match, we are done; otherwise we have
788 # to go to the 'else' clause.
789 $else = __cond_join($low, 1, $else);
790 }
791 elsif ($latin1) { # Here, didn't want or didn't have invariants,
792 # but we do have latin variants
793 $else = __cond_join("(! is_utf8)", $latin1, $else);
794 }
795
796 # We need at least one byte available to start off the tests
dd9bc2b0 797 $else = __cond_join("LIKELY((e) > (s))", $else, 0);
5ab0c3af
KW
798 }
799 else { # Here, we don't want or there aren't any variants. A single
800 # byte available is enough.
801 my $cond= "((e) > (s))";
802 my $optree = $self->$method(%opt);
803 $else= __cond_join( $cond, $optree, $else );
804 }
e64b1bd1 805 }
5ab0c3af 806
e64b1bd1 807 return $else;
12b72891
RGS
808}
809
2efb8143 810sub calculate_mask(@) {
75929b4b
KW
811 # Look at the input list of byte values. This routine returns an array of
812 # mask/base pairs to generate that list.
813
2efb8143
KW
814 my @list = @_;
815 my $list_count = @list;
816
75929b4b
KW
817 # Consider a set of byte values, A, B, C .... If we want to determine if
818 # <c> is one of them, we can write c==A || c==B || c==C .... If the
e42cde6b
KW
819 # values are consecutive, we can shorten that to inRANGE(c, 'A', 'Z'),
820 # which uses far fewer branches. If only some of them are consecutive we
821 # can still save some branches by creating range tests for just those that
822 # are consecutive. _cond_as_str() does this work for looking for ranges.
75929b4b
KW
823 #
824 # Another approach is to look at the bit patterns for A, B, C .... and see
825 # if they have some commonalities. That's what this function does. For
826 # example, consider a set consisting of the bytes
e42cde6b
KW
827 # 0x42, 0x43, 0x62, and 0x63. We could write:
828 # inRANGE(c, 0x42, 0x43) || inRANGE(c, 0x62, 0x63)
829 # which through the magic of casting has not 4, but 2 tests. But the
830 # following mask/compare also works, and has just one test:
831 # (c & 0xDE) == 0x42
832 # The reason it works is that the set consists of exactly the 4 bit
833 # patterns which have either 0 or 1 in the two bit positions that are 0 in
834 # the mask. They have the same value in each bit position where the mask
835 # is 1. The comparison makes sure that the result matches all bytes which
836 # match those six 1 bits exactly. This can be applied to bytes that
837 # differ in 1 through all 8 bit positions. In order to be a candidate for
838 # this optimization, the number of bytes in the set must be a power of 2.
75929b4b 839 #
e42cde6b
KW
840 # It may be that the bytes needing to be matched can't be done with a
841 # single mask. But it may be possible to have two (or more) sets, each
842 # with a separate mask. This function attempts to find some way to save
843 # some branches using the mask technique. If not, it returns an empty
844 # list; if so, it returns a list consisting of
75929b4b
KW
845 # [ [compare1, mask1], [compare2, mask2], ...
846 # [compare_n, undef], [compare_m, undef], ...
847 # ]
848 # The <mask> is undef in the above for those bytes that must be tested
849 # for individually.
850 #
851 # This function does not attempt to find the optimal set. To do so would
852 # probably require testing all possible combinations, and keeping track of
853 # the current best one.
854 #
855 # There are probably much better algorithms, but this is the one I (khw)
856 # came up with. We start with doing a bit-wise compare of every byte in
857 # the set with every other byte. The results are sorted into arrays of
858 # all those that differ by the same bit positions. These are stored in a
859 # hash with the each key being the bits they differ in. Here is the hash
860 # for the 0x53, 0x54, 0x73, 0x74 set:
861 # {
862 # 4 => {
863 # "0,1,2,5" => [
864 # 83,
865 # 116,
866 # 84,
867 # 115
868 # ]
869 # },
870 # 3 => {
871 # "0,1,2" => [
872 # 83,
873 # 84,
874 # 115,
875 # 116
876 # ]
877 # }
878 # 1 => {
879 # 5 => [
880 # 83,
881 # 115,
882 # 84,
883 # 116
884 # ]
885 # },
886 # }
887 #
888 # The set consisting of values which differ in the 4 bit positions 0, 1,
889 # 2, and 5 from some other value in the set consists of all 4 values.
890 # Likewise all 4 values differ from some other value in the 3 bit
891 # positions 0, 1, and 2; and all 4 values differ from some other value in
892 # the single bit position 5. The keys at the uppermost level in the above
893 # hash, 1, 3, and 4, give the number of bit positions that each sub-key
894 # below it has. For example, the 4 key could have as its value an array
895 # consisting of "0,1,2,5", "0,1,2,6", and "3,4,6,7", if the inputs were
896 # such. The best optimization will group the most values into a single
897 # mask. The most values will be the ones that differ in the most
898 # positions, the ones with the largest value for the topmost key. These
899 # keys, are thus just for convenience of sorting by that number, and do
900 # not have any bearing on the core of the algorithm.
901 #
902 # We start with an element from largest number of differing bits. The
903 # largest in this case is 4 bits, and there is only one situation in this
904 # set which has 4 differing bits, "0,1,2,5". We look for any subset of
905 # this set which has 16 values that differ in these 4 bits. There aren't
906 # any, because there are only 4 values in the entire set. We then look at
907 # the next possible thing, which is 3 bits differing in positions "0,1,2".
908 # We look for a subset that has 8 values that differ in these 3 bits.
909 # Again there are none. So we go to look for the next possible thing,
910 # which is a subset of 2**1 values that differ only in bit position 5. 83
911 # and 115 do, so we calculate a mask and base for those and remove them
912 # from every set. Since there is only the one set remaining, we remove
913 # them from just this one. We then look to see if there is another set of
914 # 2 values that differ in bit position 5. 84 and 116 do, so we calculate
915 # a mask and base for those and remove them from every set (again only
916 # this set remains in this example). The set is now empty, and there are
917 # no more sets to look at, so we are done.
918
919 if ($list_count == 256) { # All 256 is trivially masked
2efb8143
KW
920 return (0, 0);
921 }
922
75929b4b
KW
923 my %hash;
924
925 # Generate bits-differing lists for each element compared against each
926 # other element
927 for my $i (0 .. $list_count - 2) {
928 for my $j ($i + 1 .. $list_count - 1) {
929 my @bits_that_differ = pop_count($list[$i] ^ $list[$j]);
930 my $differ_count = @bits_that_differ;
931 my $key = join ",", @bits_that_differ;
4fad5f9f
KW
932 push @{$hash{$differ_count}{$key}}, $list[$i]
933 unless grep { $_ == $list[$i] } @{$hash{$differ_count}{$key}};
75929b4b
KW
934 push @{$hash{$differ_count}{$key}}, $list[$j];
935 }
936 }
2efb8143 937
4fad5f9f
KW
938 print STDERR __LINE__, ": calculate_mask() called: List of values grouped",
939 " by differing bits: ", Dumper \%hash if DEBUG;
2efb8143 940
75929b4b
KW
941 my @final_results;
942 foreach my $count (reverse sort { $a <=> $b } keys %hash) {
943 my $need = 2 ** $count; # Need 8 values for 3 differing bits, etc
de6cb0ab 944 foreach my $bits (sort keys $hash{$count}->%*) {
2efb8143 945
4fad5f9f
KW
946 print STDERR __LINE__, ": For $count bit(s) difference ($bits),",
947 " need $need; have ", scalar @{$hash{$count}{$bits}}, "\n" if DEBUG;
2efb8143 948
75929b4b
KW
949 # Look only as long as there are at least as many elements in the
950 # subset as are needed
951 while ((my $cur_count = @{$hash{$count}{$bits}}) >= $need) {
2efb8143 952
4fad5f9f
KW
953 print STDERR __LINE__, ": Looking at bit positions ($bits): ",
954 Dumper $hash{$count}{$bits} if DEBUG;
2efb8143 955
75929b4b
KW
956 # Start with the first element in it
957 my $try_base = $hash{$count}{$bits}[0];
958 my @subset = $try_base;
959
960 # If it succeeds, we return a mask and a base to compare
961 # against the masked value. That base will be the AND of
962 # every element in the subset. Initialize to the one element
963 # we have so far.
964 my $compare = $try_base;
965
966 # We are trying to find a subset of this that has <need>
967 # elements that differ in the bit positions given by the
968 # string $bits, which is comma separated.
969 my @bits = split ",", $bits;
970
971 TRY: # Look through the remainder of the list for other
972 # elements that differ only by these bit positions.
973
974 for (my $i = 1; $i < $cur_count; $i++) {
975 my $try_this = $hash{$count}{$bits}[$i];
976 my @positions = pop_count($try_base ^ $try_this);
977
4fad5f9f
KW
978 print STDERR __LINE__, ": $try_base vs $try_this: is (",
979 join(',', @positions), ") a subset of ($bits)?" if DEBUG;
75929b4b
KW
980
981 foreach my $pos (@positions) {
982 unless (grep { $pos == $_ } @bits) {
983 print STDERR " No\n" if DEBUG;
984 my $remaining = $cur_count - $i - 1;
985 if ($remaining && @subset + $remaining < $need) {
4fad5f9f
KW
986 print STDERR __LINE__, ": Can stop trying",
987 " $try_base, because even if all the",
988 " remaining $remaining values work, they",
989 " wouldn't add up to the needed $need when",
990 " combined with the existing ",
991 scalar @subset, " ones\n" if DEBUG;
75929b4b
KW
992 last TRY;
993 }
994 next TRY;
995 }
996 }
997
998 print STDERR " Yes\n" if DEBUG;
999 push @subset, $try_this;
1000
1001 # Add this to the mask base, in case it ultimately
1002 # succeeds,
1003 $compare &= $try_this;
1004 }
1005
4fad5f9f
KW
1006 print STDERR __LINE__, ": subset (", join(", ", @subset),
1007 ") has ", scalar @subset, " elements; needs $need\n" if DEBUG;
75929b4b
KW
1008
1009 if (@subset < $need) {
1010 shift @{$hash{$count}{$bits}};
1011 next; # Try with next value
1012 }
2efb8143 1013
75929b4b
KW
1014 # Create the mask
1015 my $mask = 0;
1016 foreach my $position (@bits) {
1017 $mask |= 1 << $position;
1018 }
1019 $mask = ~$mask & 0xFF;
1020 push @final_results, [$compare, $mask];
1021
4fad5f9f
KW
1022 printf STDERR "%d: Got it: compare=%d=0x%X; mask=%X\n",
1023 __LINE__, $compare, $compare, $mask if DEBUG;
75929b4b
KW
1024
1025 # These values are now spoken for. Remove them from future
1026 # consideration
122a2d8f
YO
1027 foreach my $remove_count (sort keys %hash) {
1028 foreach my $bits (sort keys %{$hash{$remove_count}}) {
75929b4b 1029 foreach my $to_remove (@subset) {
4fad5f9f
KW
1030 @{$hash{$remove_count}{$bits}}
1031 = grep { $_ != $to_remove }
1032 @{$hash{$remove_count}{$bits}};
75929b4b
KW
1033 }
1034 }
1035 }
1036 }
1037 }
2efb8143
KW
1038 }
1039
75929b4b
KW
1040 # Any values that remain in the list are ones that have to be tested for
1041 # individually.
1042 my @individuals;
1043 foreach my $count (reverse sort { $a <=> $b } keys %hash) {
de6cb0ab 1044 foreach my $bits (sort keys $hash{$count}->%*) {
75929b4b
KW
1045 foreach my $remaining (@{$hash{$count}{$bits}}) {
1046
1047 # If we already know about this value, just ignore it.
1048 next if grep { $remaining == $_ } @individuals;
1049
1050 # Otherwise it needs to be returned as something to match
1051 # individually
1052 push @final_results, [$remaining, undef];
1053 push @individuals, $remaining;
1054 }
1055 }
2efb8143 1056 }
2efb8143 1057
75929b4b
KW
1058 # Sort by increasing numeric value
1059 @final_results = sort { $a->[0] <=> $b->[0] } @final_results;
1060
1061 print STDERR __LINE__, ": Final return: ", Dumper \@final_results if DEBUG;
1062
1063 return @final_results;
2efb8143
KW
1064}
1065
e64b1bd1
YO
1066# _cond_as_str
1067# turn a list of conditions into a text expression
1068# - merges ranges of conditions, and joins the result with ||
1069sub _cond_as_str {
ba073cf2 1070 my ( $self, $op, $combine, $opts_ref )= @_;
cef3e2d6
KW
1071 my @cond = ();
1072 @cond = $op->{vals}->@* if defined $op->{vals};
e64b1bd1 1073 my $test= $op->{test};
2efb8143 1074 my $is_cp_ret = $opts_ref->{ret_type} eq "cp";
cef3e2d6 1075 return "( $test )" unless @cond;
e64b1bd1 1076
6d2bbfb0
KW
1077 # rangify the list. As we encounter a new value, it is placed in a new
1078 # subarray by itself. If the next value is adjacent to it, the end point
1079 # of the subarray is merely incremented; and so on. When the next value
1080 # that isn't adjacent to the previous one is encountered, Update() is
1081 # called to hoist any single-element subarray to be a scalar.
e64b1bd1
YO
1082 my @ranges;
1083 my $Update= sub {
f5772832
KW
1084 # We skip this if there are optimizations that
1085 # we can apply (below) to the individual ranges
1086 if ( ($is_cp_ret || $combine) && @ranges && ref $ranges[-1]) {
6d2bbfb0 1087 $ranges[-1] = $ranges[-1][0] if $ranges[-1][0] == $ranges[-1][1];
e64b1bd1
YO
1088 }
1089 };
8787aefa
KW
1090
1091 # Go through the code points (@cond) and collapse them as much as
1092 # possible into ranges
1093 for my $condition ( @cond ) {
1094 if ( !@ranges || $condition != $ranges[-1][1] + 1 ) {
1095 # Not adjacent to the existing range. Remove that from being a
1096 # range if only a single value;
1097 $Update->();
1098 push @ranges, [ $condition, $condition ];
1099 } else { # Adjacent to the existing range; add to the range
1100 $ranges[-1][1]++;
1101 }
e64b1bd1 1102 }
8787aefa 1103 $Update->();
f5772832 1104
8787aefa
KW
1105 # _combine is used for cp type matching.
1106 return $self->_combine( $test, @ranges ) if $combine;
f5772832 1107
8787aefa
KW
1108 # If the input set has certain characteristics, we can optimize tests
1109 # for it.
2358c533 1110
8787aefa
KW
1111 # Return if all bytes match, hence is trivially true
1112 return 1 if @cond == 256;
2358c533 1113
bc5a92d8
KW
1114 # If this is a single UTF-8 range which includes all possible
1115 # continuation bytes, and we aren't checking for well-formedness, this
1116 # is trivially true.
1117 if ( @ranges == 1
1118 && ! $opts_ref->{safe}
1119 && ! $opts_ref->{no_length_checks}
1120 && $opts_ref->{type} =~ / ^ (?: utf8 | high ) $ /xi
1121 && $ranges[0]->[1] == 0xBF
1122 && $ranges[0]->[0] == 0x80)
1123 {
1124 return 1;
1125 }
1126
1127 my $loop_start = 0;
1128 if (ref $ranges[0] && $ranges[0]->[0] == 0) {
1129
1130 # If the first range matches all 256 possible bytes, it is
1131 # trivially true.
1dff551d
KW
1132 if ($ranges[0]->[1] == 0xFF) {
1133 die "Range spanning all bytes must be the only one"
1134 if @ranges > 1;
1135 return 1;
1136 }
bc5a92d8
KW
1137 # this case
1138 # Here, the first range starts at 0, but doesn't match everything.
1139 # But the condition doesn't have to worry about being < 0
1140 $ranges[0] = "( $test <= "
1141 . $self->val_fmt($ranges[0]->[1]) . " )";
1142 $loop_start++;
1143 }
1144
1145 my $loop_end = @ranges;
1146 if ( @ranges
1147 && ref $ranges[-1]
1148 && $ranges[-1]->[1] == 0xFF
1149 && $ranges[-1]->[0] != 0xFF)
1150 {
1151 # If the final range consists of more than one byte ending with
1152 # the highest possible one, the condition doesn't have to worry
1153 # about being > FF
1154 $ranges[-1] = "( $test >= " . $self->val_fmt($ranges[-1]->[0]) . " )";
1155 $loop_end--;
1156 }
1157
e5425873
KW
1158 # Look at each range to see if there any optimizations.
1159 for (my $i = $loop_start; $i < $loop_end; $i++) {
1160 if (! ref $ranges[$i]) { # Trivial case: no range
1161 $ranges[$i] = $self->val_fmt($ranges[$i]) . " == $test";
1162 }
1163 elsif ($ranges[$i]->[0] == $ranges[$i]->[1]) {
1164 $ranges[$i] = # Trivial case: single element range
1165 $self->val_fmt($ranges[$i]->[0]) . " == $test";
1166 }
1167 else {
1168 $ranges[$i] = "inRANGE_helper_(U8, $test, "
1169 . $self->val_fmt($ranges[$i]->[0]) .", "
1170 . $self->val_fmt($ranges[$i]->[1]) . ")";
1171 }
1172 }
1173
75929b4b 1174 my @masks;
2358c533 1175 if (@ranges > 1) {
75929b4b 1176
b6a6e956 1177 # See if the entire set shares optimizable characteristics, and if so,
e42cde6b
KW
1178 # return the optimization. There is no need to do this on sets with
1179 # just a single range, as that can be expressed with a single
1180 # conditional.
cef3e2d6 1181 @masks = calculate_mask(@cond);
75929b4b
KW
1182
1183 # Stringify the output of calculate_mask()
1184 if (@masks) {
bb20f3e2 1185 my @masked;
75929b4b
KW
1186 foreach my $mask_ref (@masks) {
1187 if (defined $mask_ref->[1]) {
bb20f3e2 1188 push @masked, "( ( $test & "
8c112bb9
KW
1189 . $self->val_fmt($mask_ref->[1]) . " ) == "
1190 . $self->val_fmt($mask_ref->[0]) . " )";
75929b4b
KW
1191 }
1192 else { # An undefined mask means to use the value as-is
bb20f3e2 1193 push @masked, "$test == " . $self->val_fmt($mask_ref->[0]);
75929b4b
KW
1194 }
1195 }
1196
1197 # The best possible case below for specifying this set of values via
1198 # ranges is 1 branch per range. If our mask method yielded better
1199 # results, there is no sense trying something that is bound to be
1200 # worse.
bb20f3e2
KW
1201 if (@masked < @ranges) {
1202 return "( " . join( " || ", @masked ) . " )";
75929b4b
KW
1203 }
1204
bb20f3e2 1205 @masks = @masked;
6e130234 1206 }
2358c533 1207 }
f5772832 1208
75929b4b
KW
1209 # We have generated the list of bytes in two ways; one trying to use masks
1210 # to cut the number of branches down, and the other to look at individual
1211 # ranges (some of which could be cut down by using a mask for just it).
1212 # We return whichever method uses the fewest branches.
1213 return "( "
933716a4 1214 . join( " || ", (@masks && @masks < @ranges)
75929b4b
KW
1215 ? @masks
1216 : @ranges)
1217 . " )";
12b72891
RGS
1218}
1219
e64b1bd1
YO
1220# _combine
1221# recursively turn a list of conditions into a fast break-out condition
1222# used by _cond_as_str() for 'cp' type macros.
1223sub _combine {
1224 my ( $self, $test, @cond )= @_;
1225 return if !@cond;
1226 my $item= shift @cond;
1227 my ( $cstr, $gtv );
6c62bf0f
KW
1228 if ( ref $item ) { # @item should be a 2-element array giving range start
1229 # and end
1230 if ($item->[0] == 0) { # UV's are never negative, so skip "0 <= "
1231 # test which could generate a compiler warning
1232 # that test is always true
8c112bb9 1233 $cstr= "$test <= " . $self->val_fmt($item->[1]);
6c62bf0f
KW
1234 }
1235 else {
88086fd8 1236 $cstr = "inRANGE_helper_(UV, $test, "
8c112bb9
KW
1237 . $self->val_fmt($item->[0]) . ", "
1238 . $self->val_fmt($item->[1]) . ")";
6c62bf0f 1239 }
8c112bb9 1240 $gtv= $self->val_fmt($item->[1]);
12b72891 1241 } else {
8c112bb9
KW
1242 $cstr= $self->val_fmt($item) . " == $test";
1243 $gtv= $self->val_fmt($item)
12b72891 1244 }
e64b1bd1 1245 if ( @cond ) {
ee98d22d
YO
1246 my $combine= $self->_combine( $test, @cond );
1247 if (@cond >1) {
1248 return "( $cstr || ( $gtv < $test &&\n"
1249 . $combine . " ) )";
1250 } else {
1251 return "( $cstr || $combine )";
1252 }
12b72891 1253 } else {
e64b1bd1 1254 return $cstr;
12b72891 1255 }
e64b1bd1 1256}
12b72891 1257
e64b1bd1
YO
1258# _render()
1259# recursively convert an optree to text with reasonably neat formatting
1260sub _render {
39a0f513 1261 my ( $self, $op, $combine, $brace, $opts_ref, $def, $submacros )= @_;
2e39f0c2 1262 return 0 if ! defined $op; # The set is empty
e64b1bd1
YO
1263 if ( !ref $op ) {
1264 return $op;
12b72891 1265 }
ba073cf2 1266 my $cond= $self->_cond_as_str( $op, $combine, $opts_ref );
cc08b31c
KW
1267 #no warnings 'recursion'; # This would allow really really inefficient
1268 # code to be generated. See pod
4fad5f9f
KW
1269 my $yes= $self->_render( $op->{yes}, $combine, 1, $opts_ref, $def,
1270 $submacros);
30188af7
KW
1271 return $yes if $cond eq '1';
1272
4fad5f9f
KW
1273 my $no= $self->_render( $op->{no}, $combine, 0, $opts_ref, $def,
1274 $submacros);
e64b1bd1
YO
1275 return "( $cond )" if $yes eq '1' and $no eq '0';
1276 my ( $lb, $rb )= $brace ? ( "( ", " )" ) : ( "", "" );
1277 return "$lb$cond ? $yes : $no$rb"
1278 if !ref( $op->{yes} ) && !ref( $op->{no} );
1279 my $ind1= " " x 4;
1280 my $ind= "\n" . ( $ind1 x $op->{depth} );
1281
1282 if ( ref $op->{yes} ) {
1283 $yes= $ind . $ind1 . $yes;
1284 } else {
1285 $yes= " " . $yes;
1286 }
1287
39a0f513
YO
1288 my $str= "$lb$cond ?$yes$ind: $no$rb";
1289 if (length $str > 6000) {
4fad5f9f 1290 push @$submacros, sprintf "#define $def\n( %s )", "_part"
2f338e94 1291 . (my $yes_idx= 0+@$submacros) . "_", $yes;
4fad5f9f 1292 push @$submacros, sprintf "#define $def\n( %s )", "_part"
2f338e94
KW
1293 . (my $no_idx= 0+@$submacros) . "_", $no;
1294 return sprintf "%s%s ? $def : $def%s", $lb, $cond,
1295 "_part${yes_idx}_", "_part${no_idx}_", $rb;
39a0f513
YO
1296 }
1297 return $str;
12b72891 1298}
32e6a07c 1299
e64b1bd1
YO
1300# $expr=render($op,$combine)
1301#
1302# convert an optree to text with reasonably neat formatting. If $combine
1303# is true then the condition is created using "fast breakouts" which
1304# produce uglier expressions that are more efficient for common case,
1305# longer lists such as that resulting from type 'cp' output.
1306# Currently only used for type 'cp' macros.
1307sub render {
39a0f513 1308 my ( $self, $op, $combine, $opts_ref, $def_fmt )= @_;
4fad5f9f 1309
39a0f513 1310 my @submacros;
4fad5f9f
KW
1311 my $macro= sprintf "#define $def_fmt\n( %s )", "",
1312 $self->_render( $op, $combine, 0, $opts_ref, $def_fmt,
1313 \@submacros);
39a0f513 1314
4fad5f9f
KW
1315 return join "\n\n",
1316 map { "/*** GENERATED CODE ***/\n" . __macro( __clean( $_ ) ) }
1317 @submacros, $macro;
12b72891 1318}
e64b1bd1
YO
1319
1320# make_macro
1321# make a macro of a given type.
1322# calls into make_trie and (generic_|length_)optree as needed
1323# Opts are:
40f914fd
KW
1324# type : 'cp','cp_high', 'generic','high','low','latin1','utf8','LATIN1','UTF8'
1325# ret_type : 'cp' or 'len'
1326# safe : don't assume is well-formed UTF-8, so don't skip any range
1327# checks, and add length guards to macro
1328# no_length_checks : like safe, but don't add length guards.
e64b1bd1
YO
1329#
1330# type defaults to 'generic', and ret_type to 'len' unless type is 'cp'
1331# in which case it defaults to 'cp' as well.
1332#
3ff97bcf 1333# It is illegal to do a type 'cp' macro on a pattern with multi-codepoint
e64b1bd1
YO
1334# sequences in it, as the generated macro will accept only a single codepoint
1335# as an argument.
1336#
6b94381d
KW
1337# It is also illegal to do a non-safe macro on a pattern with multi-codepoint
1338# sequences in it, as even if it is known to be well-formed, we need to not
91e83b73 1339# run off the end of the buffer when, say, the buffer ends with the first two
6b94381d
KW
1340# characters, but three are looked at by the macro.
1341#
e64b1bd1
YO
1342# returns the macro.
1343
1344
1345sub make_macro {
1346 my $self= shift;
1347 my %opts= @_;
1348 my $type= $opts{type} || 'generic';
6b94381d
KW
1349 if ($self->{has_multi}) {
1350 if ($type =~ /^cp/) {
4fad5f9f
KW
1351 die "Can't do a 'cp' on multi-codepoint character class"
1352 . " '$self->{op}'"
6b94381d
KW
1353 }
1354 elsif (! $opts{safe}) {
4fad5f9f
KW
1355 die "'safe' is required on multi-codepoint character class"
1356 ." '$self->{op}'"
6b94381d
KW
1357 }
1358 }
900c17f9 1359 my $ret_type= $opts{ret_type} || ( $opts{type} =~ /^cp/ ? 'cp' : 'len' );
e64b1bd1
YO
1360 my $method;
1361 if ( $opts{safe} ) {
1362 $method= 'length_optree';
87894a24 1363 } elsif ( $type =~ /generic/ ) {
e64b1bd1
YO
1364 $method= 'generic_optree';
1365 } else {
1366 $method= 'optree';
1367 }
900c17f9 1368 my @args= $type =~ /^cp/ ? 'cp' : 's';
e64b1bd1 1369 push @args, "e" if $opts{safe};
87894a24 1370 push @args, "is_utf8" if $type =~ /generic/;
e64b1bd1 1371 push @args, "len" if $ret_type eq 'both';
4fad5f9f 1372 my $pfx= $ret_type eq 'both' ? 'what_len_' :
e64b1bd1 1373 $ret_type eq 'cp' ? 'what_' : 'is_';
87894a24
KW
1374 my $ext= $type =~ /generic/ ? '' : '_' . lc( $type );
1375 $ext .= '_non_low' if $type eq 'generic_non_low';
e64b1bd1 1376 $ext .= "_safe" if $opts{safe};
40f914fd 1377 $ext .= "_no_length_checks" if $opts{no_length_checks};
e64b1bd1 1378 my $argstr= join ",", @args;
39a0f513
YO
1379 my $def_fmt="$pfx$self->{op}$ext%s($argstr)";
1380 my $optree= $self->$method( %opts, type => $type, ret_type => $ret_type );
1381 return $self->render( $optree, ($type =~ /^cp/) ? 1 : 0, \%opts, $def_fmt );
32e6a07c 1382}
e64b1bd1 1383
b6a6e956 1384# if we aren't being used as a module (highly likely) then process
e64b1bd1
YO
1385# the __DATA__ below and produce macros in regcharclass.h
1386# if an argument is provided to the script then it is assumed to
1387# be the path of the file to output to, if the arg is '-' outputs
1388# to STDOUT.
1389if ( !caller ) {
e64b1bd1 1390 $|++;
8770da0e 1391 my $path= shift @ARGV || "regcharclass.h";
e64b1bd1
YO
1392 my $out_fh;
1393 if ( $path eq '-' ) {
1394 $out_fh= \*STDOUT;
1395 } else {
9824c081 1396 $out_fh = open_new( $path );
e64b1bd1 1397 }
8770da0e 1398 print $out_fh read_only_top( lang => 'C', by => $0,
9824c081
MS
1399 file => 'regcharclass.h', style => '*',
1400 copyright => [2007, 2011],
212b6c86
KW
1401 final => <<EOF,
1402WARNING: These macros are for internal Perl core use only, and may be
1403changed or removed without notice.
1404EOF
1405 );
4fad5f9f
KW
1406 print $out_fh "\n#ifndef PERL_REGCHARCLASS_H_ /* Guard against nested",
1407 " #includes */\n#define PERL_REGCHARCLASS_H_\n";
12b72891 1408
bb949220 1409 my ( $op, $title, @txt, @types, %mods );
a1b2a50f 1410 my $doit= sub ($) {
e64b1bd1 1411 return unless $op;
ae1d4929 1412
a1b2a50f
KW
1413 my $charset = shift;
1414
ae1d4929 1415 # Skip if to compile on a different platform.
a1b2a50f
KW
1416 return if delete $mods{only_ascii_platform} && $charset !~ /ascii/i;
1417 return if delete $mods{only_ebcdic_platform} && $charset !~ /ebcdic/i;
ae1d4929 1418
e64b1bd1
YO
1419 print $out_fh "/*\n\t$op: $title\n\n";
1420 print $out_fh join "\n", ( map { "\t$_" } @txt ), "*/", "";
4fad5f9f
KW
1421 my $obj= __PACKAGE__->new( op => $op, title => $title, txt => \@txt,
1422 charset => $charset);
e64b1bd1 1423
bb949220
KW
1424 #die Dumper(\@types,\%mods);
1425
1426 my @mods;
1427 push @mods, 'safe' if delete $mods{safe};
40f914fd 1428 push @mods, 'no_length_checks' if delete $mods{no_length_checks};
4fad5f9f
KW
1429
1430 # Default to 'fast' do this one first, as traditional
1431 unshift @mods, 'fast' if delete $mods{fast} || ! @mods;
bb949220 1432 if (%mods) {
122a2d8f 1433 die "Unknown modifiers: ", join ", ", map { "'$_'" } sort keys %mods;
bb949220 1434 }
e64b1bd1
YO
1435
1436 foreach my $type_spec ( @types ) {
1437 my ( $type, $ret )= split /-/, $type_spec;
1438 $ret ||= 'len';
1439 foreach my $mod ( @mods ) {
f71bd789
KW
1440
1441 # 'safe' is irrelevant with code point macros, so skip if
1442 # there is also a 'fast', but don't skip if this is the only
1443 # way a cp macro will get generated. Below we convert 'safe'
1444 # to 'fast' in this instance
1445 next if $type =~ /^cp/
40f914fd
KW
1446 && ($mod eq 'safe' || $mod eq 'no_length_checks')
1447 && grep { 'fast' =~ $_ } @mods;
bb949220 1448 delete $mods{$mod};
e64b1bd1
YO
1449 my $macro= $obj->make_macro(
1450 type => $type,
1451 ret_type => $ret,
81200454 1452 safe => $mod eq 'safe' && $type !~ /^cp/,
a1b2a50f 1453 charset => $charset,
4fad5f9f
KW
1454 no_length_checks => $mod eq 'no_length_checks'
1455 && $type !~ /^cp/,
e64b1bd1
YO
1456 );
1457 print $out_fh $macro, "\n";
1458 }
32e6a07c 1459 }
e64b1bd1
YO
1460 };
1461
a1b2a50f
KW
1462 my @data = <DATA>;
1463 foreach my $charset (get_supported_code_pages()) {
1464 my $first_time = 1;
1465 undef $op;
1466 undef $title;
1467 undef @txt;
1468 undef @types;
1469 undef %mods;
1470 print $out_fh "\n", get_conditional_compile_line_start($charset);
1471 my @data_copy = @data;
1472 for (@data_copy) {
91e83b73
KW
1473 s/^ \s* (?: \# .* ) ? $ //x; # squeeze out comment and blanks
1474 next unless /\S/;
1475 chomp;
1476 if ( /^[A-Z]/ ) {
1477 $doit->($charset) unless $first_time; # This starts a new
1478 # definition; do the
1479 # previous one
1480 $first_time = 0;
1481 ( $op, $title )= split /\s*:\s*/, $_, 2;
1482 @txt= ();
1483 } elsif ( s/^=>// ) {
1484 my ( $type, $modifier )= split /:/, $_;
1485 @types= split ' ', $type;
1486 undef %mods;
1487 map { $mods{$_} = 1 } split ' ', $modifier;
1488 } else {
1489 push @txt, "$_";
1490 }
12b72891 1491 }
91e83b73 1492 $doit->($charset);
a1b2a50f
KW
1493 print $out_fh get_conditional_compile_line_end();
1494 }
d10c72f2 1495
6a5bc5ac 1496 print $out_fh "\n#endif /* PERL_REGCHARCLASS_H_ */\n";
d10c72f2 1497
8770da0e 1498 if($path eq '-') {
9824c081 1499 print $out_fh "/* ex: set ro: */\n";
8770da0e 1500 } else {
0c36c41b
KW
1501 # Some of the sources for these macros come from Unicode tables
1502 my $sources_list = "lib/unicore/mktables.lst";
b60dc4b9
KW
1503 my @sources = ($0, qw(lib/unicore/mktables
1504 lib/Unicode/UCD.pm
1505 regen/regcharclass_multi_char_folds.pl
1506 regen/charset_translations.pl
1507 ));
0c36c41b
KW
1508 {
1509 # Depend on mktables’ own sources. It’s a shorter list of files than
1510 # those that Unicode::UCD uses.
1ae6ead9 1511 if (! open my $mktables_list, '<', $sources_list) {
0c36c41b
KW
1512
1513 # This should force a rebuild once $sources_list exists
1514 push @sources, $sources_list;
1515 }
1516 else {
1517 while(<$mktables_list>) {
1518 last if /===/;
1519 chomp;
1520 push @sources, "lib/unicore/$_" if /^[^#]/;
1521 }
1522 }
1523 }
1524 read_only_bottom_close_and_rename($out_fh, \@sources)
8770da0e 1525 }
12b72891 1526}
e64b1bd1 1527
cc08b31c
KW
1528# The form of the input is a series of definitions to make macros for.
1529# The first line gives the base name of the macro, followed by a colon, and
1530# then text to be used in comments associated with the macro that are its
1531# title or description. In all cases the first (perhaps only) parameter to
1532# the macro is a pointer to the first byte of the code point it is to test to
1533# see if it is in the class determined by the macro. In the case of non-UTF8,
1534# the code point consists only of a single byte.
1535#
1536# The second line must begin with a '=>' and be followed by the types of
1537# macro(s) to be generated; these are specified below. A colon follows the
1538# types, followed by the modifiers, also specified below. At least one
1539# modifier is required.
1540#
1541# The subsequent lines give what code points go into the class defined by the
1542# macro. Multiple characters may be specified via a string like "\x0D\x0A",
60910c93
KW
1543# enclosed in quotes. Otherwise the lines consist of one of:
1544# 1) a single Unicode code point, prefaced by 0x
1545# 2) a single range of Unicode code points separated by a minus (and
1546# optional space)
1547# 3) a single Unicode property specified in the standard Perl form
1548# "\p{...}"
1549# 4) a line like 'do path'. This will do a 'do' on the file given by
1550# 'path'. It is assumed that this does nothing but load subroutines
1551# (See item 5 below). The reason 'require path' is not used instead is
1552# because 'do' doesn't assume that path is in @INC.
1553# 5) a subroutine call
1554# &pkg::foo(arg1, ...)
1555# where pkg::foo was loaded by a 'do' line (item 4). The subroutine
1556# returns an array of entries of forms like items 1-3 above. This
1557# allows more complex inputs than achievable from the other input types.
cc08b31c
KW
1558#
1559# A blank line or one whose first non-blank character is '#' is a comment.
1560# The definition of the macro is terminated by a line unlike those described.
1561#
1562# Valid types:
1563# low generate a macro whose name is 'is_BASE_low' and defines a
1564# class that includes only ASCII-range chars. (BASE is the
1565# input macro base name.)
1566# latin1 generate a macro whose name is 'is_BASE_latin1' and defines a
1567# class that includes only upper-Latin1-range chars. It is not
1568# designed to take a UTF-8 input parameter.
b1af8fef
KW
1569# high generate a macro whose name is 'is_BASE_high' and defines a
1570# class that includes all relevant code points that are above
1571# the Latin1 range. This is for very specialized uses only.
1572# It is designed to take only an input UTF-8 parameter.
cc08b31c
KW
1573# utf8 generate a macro whose name is 'is_BASE_utf8' and defines a
1574# class that includes all relevant characters that aren't ASCII.
1575# It is designed to take only an input UTF-8 parameter.
1576# LATIN1 generate a macro whose name is 'is_BASE_latin1' and defines a
1577# class that includes both ASCII and upper-Latin1-range chars.
1578# It is not designed to take a UTF-8 input parameter.
1579# UTF8 generate a macro whose name is 'is_BASE_utf8' and defines a
1580# class that can include any code point, adding the 'low' ones
1581# to what 'utf8' works on. It is designed to take only an input
1582# UTF-8 parameter.
1583# generic generate a macro whose name is 'is_BASE". It has a 2nd,
1584# boolean, parameter which indicates if the first one points to
1585# a UTF-8 string or not. Thus it works in all circumstances.
87894a24
KW
1586# generic_non_low generate a macro whose name is 'is_BASE_non_low". It has
1587# a 2nd, boolean, parameter which indicates if the first one
1588# points to a UTF-8 string or not. It excludes any ASCII-range
1589# matches, but otherwise it works in all circumstances.
cc08b31c
KW
1590# cp generate a macro whose name is 'is_BASE_cp' and defines a
1591# class that returns true if the UV parameter is a member of the
1592# class; false if not.
900c17f9
KW
1593# cp_high like cp, but it is assumed that it is known that the UV
1594# parameter is above Latin1. The name of the generated macro is
1595# 'is_BASE_cp_high'. This is different from high-cp, derived
1596# below.
cc08b31c
KW
1597# A macro of the given type is generated for each type listed in the input.
1598# The default return value is the number of octets read to generate the match.
1599# Append "-cp" to the type to have it instead return the matched codepoint.
1600# The macro name is changed to 'what_BASE...'. See pod for
1601# caveats
1602# Appending '-both" instead adds an extra parameter to the end of the argument
1603# list, which is a pointer as to where to store the number of
1604# bytes matched, while also returning the code point. The macro
1605# name is changed to 'what_len_BASE...'. See pod for caveats
1606#
1607# Valid modifiers:
1608# safe The input string is not necessarily valid UTF-8. In
1609# particular an extra parameter (always the 2nd) to the macro is
1610# required, which points to one beyond the end of the string.
1611# The macro will make sure not to read off the end of the
1612# string. In the case of non-UTF8, it makes sure that the
1613# string has at least one byte in it. The macro name has
1614# '_safe' appended to it.
40f914fd
KW
1615# no_length_checks The input string is not necessarily valid UTF-8, but it
1616# is to be assumed that the length has already been checked and
1617# found to be valid
cc08b31c
KW
1618# fast The input string is valid UTF-8. No bounds checking is done,
1619# and the macro can make assumptions that lead to faster
1620# execution.
a1b2a50f 1621# only_ascii_platform Skip this definition if the character set is for
ae1d4929 1622# a non-ASCII platform.
a1b2a50f 1623# only_ebcdic_platform Skip this definition if the character set is for
ae1d4929 1624# a non-EBCDIC platform.
cc08b31c
KW
1625# No modifier need be specified; fast is assumed for this case. If both
1626# 'fast', and 'safe' are specified, two macros will be created for each
1627# 'type'.
e90ac8de 1628#
295bcca9 1629# If run on a non-ASCII platform will automatically convert the Unicode input
cc08b31c
KW
1630# to native. The documentation above is slightly wrong in this case. 'low'
1631# actually refers to code points whose UTF-8 representation is the same as the
1632# non-UTF-8 version (invariants); and 'latin1' refers to all the rest of the
1633# code points less than 256.
5e6c6c1e
KW
1634
16351; # in the unlikely case we are being used as a module
1636
1637__DATA__
1638# This is no longer used, but retained in case it is needed some day.
e90ac8de
KW
1639# TRICKYFOLD: Problematic fold case letters. When adding to this list, also should add them to regcomp.c and fold_grind.t
1640# => generic cp generic-cp generic-both :fast safe
1641# 0x00DF # LATIN SMALL LETTER SHARP S
1642# 0x0390 # GREEK SMALL LETTER IOTA WITH DIALYTIKA AND TONOS
1643# 0x03B0 # GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND TONOS
1644# 0x1E9E # LATIN CAPITAL LETTER SHARP S, because maps to same as 00DF
1645# 0x1FD3 # GREEK SMALL LETTER IOTA WITH DIALYTIKA AND OXIA; maps same as 0390
1646# 0x1FE3 # GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND OXIA; maps same as 03B0
1647
12b72891 1648LNBREAK: Line Break: \R
5c025f03 1649=> generic UTF8 LATIN1 : safe
12b72891 1650"\x0D\x0A" # CRLF - Network (Windows) line ending
05b688d9 1651\p{VertSpace}
12b72891
RGS
1652
1653HORIZWS: Horizontal Whitespace: \h \H
507ce328 1654=> high cp_high : fast
05b688d9 1655\p{HorizSpace}
12b72891
RGS
1656
1657VERTWS: Vertical Whitespace: \v \V
507ce328 1658=> high cp_high : fast
05b688d9 1659\p{VertSpace}
612ead59 1660
4ac6419d 1661XDIGIT: Hexadecimal digits
507ce328 1662=> high cp_high : fast
4ac6419d
KW
1663\p{XDigit}
1664
bedac28b 1665XPERLSPACE: \p{XPerlSpace}
507ce328 1666=> high cp_high : fast
bedac28b
KW
1667\p{XPerlSpace}
1668
b96a92fb 1669NONCHAR: Non character code points
89d986df 1670=> UTF8 :safe
099323b4 1671\p{_Perl_Nchar}
b96a92fb 1672
6b28089c
KW
1673SHORTER_NON_CHARS: # 3 bytes
1674=> UTF8 :only_ascii_platform fast
16750xFDD0 - 0xFDEF
16760xFFFE - 0xFFFF
1677
1678LARGER_NON_CHARS: # 4 bytes
1679=> UTF8 :only_ascii_platform fast
16800x1FFFE - 0x1FFFF
16810x2FFFE - 0x2FFFF
16820x3FFFE - 0x3FFFF
16830x4FFFE - 0x4FFFF
16840x5FFFE - 0x5FFFF
16850x6FFFE - 0x6FFFF
16860x7FFFE - 0x7FFFF
16870x8FFFE - 0x8FFFF
16880x9FFFE - 0x9FFFF
16890xAFFFE - 0xAFFFF
16900xBFFFE - 0xBFFFF
16910xCFFFE - 0xCFFFF
16920xDFFFE - 0xDFFFF
16930xEFFFE - 0xEFFFF
16940xFFFFE - 0xFFFFF
16950x10FFFE - 0x10FFFF
1696
1697SHORTER_NON_CHARS: # 4 bytes
1698=> UTF8 :only_ebcdic_platform fast
16990xFDD0 - 0xFDEF
17000xFFFE - 0xFFFF
17010x1FFFE - 0x1FFFF
17020x2FFFE - 0x2FFFF
17030x3FFFE - 0x3FFFF
1704
1705LARGER_NON_CHARS: # 5 bytes
1706=> UTF8 :only_ebcdic_platform fast
17070x4FFFE - 0x4FFFF
17080x5FFFE - 0x5FFFF
17090x6FFFE - 0x6FFFF
17100x7FFFE - 0x7FFFF
17110x8FFFE - 0x8FFFF
17120x9FFFE - 0x9FFFF
17130xAFFFE - 0xAFFFF
17140xBFFFE - 0xBFFFF
17150xCFFFE - 0xCFFFF
17160xDFFFE - 0xDFFFF
17170xEFFFE - 0xEFFFF
17180xFFFFE - 0xFFFFF
17190x10FFFE - 0x10FFFF
1720
21cb232c 1721SURROGATE: Surrogate code points
89d986df 1722=> UTF8 :safe
099323b4 1723\p{_Perl_Surrogate}
b96a92fb 1724
685289b5
KW
1725QUOTEMETA: Meta-characters that \Q should quote
1726=> high :fast
1727\p{_Perl_Quotemeta}
8769f413
KW
1728
1729MULTI_CHAR_FOLD: multi-char strings that are folded to by a single character
a50454ce
KW
1730=> UTF8 UTF8-cp :safe
1731%regcharclass_multi_char_folds::multi_char_folds('u', 'a')
8769f413 1732
40b1ba4f 1733MULTI_CHAR_FOLD: multi-char strings that are folded to by a single character
a50454ce
KW
1734=> LATIN1 LATIN1-cp : safe
1735%regcharclass_multi_char_folds::multi_char_folds('l', 'a')
42d7c910
KW
1736
1737THREE_CHAR_FOLD: A three-character multi-char fold
1738=> UTF8 :safe
a50454ce 1739%regcharclass_multi_char_folds::multi_char_folds('u', '3')
42d7c910
KW
1740
1741THREE_CHAR_FOLD: A three-character multi-char fold
1742=> LATIN1 :safe
a50454ce 1743%regcharclass_multi_char_folds::multi_char_folds('l', '3')
8769f413 1744
42d7c910
KW
1745THREE_CHAR_FOLD_HEAD: The first two of three-character multi-char folds
1746=> UTF8 :safe
a50454ce 1747%regcharclass_multi_char_folds::multi_char_folds('u', 'h')
42d7c910
KW
1748
1749THREE_CHAR_FOLD_HEAD: The first two of three-character multi-char folds
1750=> LATIN1 :safe
a50454ce 1751%regcharclass_multi_char_folds::multi_char_folds('l', 'h')
42d7c910
KW
1752#
1753#THREE_CHAR_FOLD_NON_FINAL: The first or middle character of multi-char folds
1754#=> UTF8 :safe
a50454ce 1755#%regcharclass_multi_char_folds::multi_char_folds('u', 'fm')
42d7c910
KW
1756#
1757#THREE_CHAR_FOLD_NON_FINAL: The first or middle character of multi-char folds
1758#=> LATIN1 :safe
a50454ce 1759#%regcharclass_multi_char_folds::multi_char_folds('l', 'fm')
0b50d62a 1760
1a27eb96
KW
1761FOLDS_TO_MULTI: characters that fold to multi-char strings
1762=> UTF8 :fast
1763\p{_Perl_Folds_To_Multi_Char}
1764
31f05a37
KW
1765PROBLEMATIC_LOCALE_FOLD : characters whose fold is problematic under locale
1766=> UTF8 cp :fast
1767\p{_Perl_Problematic_Locale_Folds}
1768
1769PROBLEMATIC_LOCALE_FOLDEDS_START : The first folded character of folds which are problematic under locale
1770=> UTF8 cp :fast
1771\p{_Perl_Problematic_Locale_Foldeds_Start}
1772
0b50d62a 1773PATWS: pattern white space
ef06e936 1774=> generic : safe
099323b4 1775\p{_Perl_PatWS}
6c12993c 1776
67260a96 1777HANGUL_ED: Hangul syllables whose first UTF-8 byte is \xED
6c12993c
KW
1778=> UTF8 :only_ascii_platform safe
17790xD000 - 0xD7FF
67260a96
KW
1780
1781HANGUL_ED: Hangul syllables whose first UTF-8 byte is \xED
1782=> UTF8 :only_ebcdic_platform safe
17830x1 - 0x0
1784# Alows fails on EBCDIC; there are no ED Hanguls there