This is a live mirror of the Perl 5 development currently hosted at https://github.com/perl/perl5
De-globalize regcomp inversion lists.
[perl5.git] / regen / regcharclass.pl
CommitLineData
1f00b0d6 1#!perl
e64b1bd1 2package CharClass::Matcher;
12b72891 3use strict;
8770da0e 4use 5.008;
12b72891 5use warnings;
e64b1bd1 6use warnings FATAL => 'all';
12b72891 7use Text::Wrap qw(wrap);
12b72891 8use Data::Dumper;
e64b1bd1
YO
9$Data::Dumper::Useqq= 1;
10our $hex_fmt= "0x%02X";
12b72891 11
75929b4b
KW
12sub DEBUG () { 0 }
13$|=1 if DEBUG;
14
295bcca9
KW
15sub ASCII_PLATFORM { (ord('A') == 65) }
16
8770da0e
NC
17require 'regen/regen_lib.pl';
18
ab84f958 19=head1 NAME
0ccab2bc 20
e64b1bd1 21CharClass::Matcher -- Generate C macros that match character classes efficiently
12b72891 22
e64b1bd1
YO
23=head1 SYNOPSIS
24
ab84f958 25 perl Porting/regcharclass.pl
e64b1bd1
YO
26
27=head1 DESCRIPTION
12b72891
RGS
28
29Dynamically generates macros for detecting special charclasses
e64b1bd1 30in latin-1, utf8, and codepoint forms. Macros can be set to return
cc08b31c 31the length (in bytes) of the matched codepoint, and/or the codepoint itself.
12b72891 32
cc08b31c 33To regenerate F<regcharclass.h>, run this script from perl-root. No arguments
12b72891
RGS
34are necessary.
35
cc08b31c
KW
36Using WHATEVER as an example the following macros can be produced, depending
37on the input parameters (how to get each is described by internal comments at
38the C<__DATA__> line):
12b72891
RGS
39
40=over 4
41
cc08b31c 42=item C<is_WHATEVER(s,is_utf8)>
12b72891 43
cc08b31c 44=item C<is_WHATEVER_safe(s,e,is_utf8)>
12b72891 45
cc08b31c
KW
46Do a lookup as appropriate based on the C<is_utf8> flag. When possible
47comparisons involving octect<128 are done before checking the C<is_utf8>
12b72891
RGS
48flag, hopefully saving time.
49
cc08b31c
KW
50The version without the C<_safe> suffix should be used only when the input is
51known to be well-formed.
12b72891 52
cc08b31c
KW
53=item C<is_WHATEVER_utf8(s)>
54
55=item C<is_WHATEVER_utf8_safe(s,e)>
12b72891
RGS
56
57Do a lookup assuming the string is encoded in (normalized) UTF8.
58
cc08b31c
KW
59The version without the C<_safe> suffix should be used only when the input is
60known to be well-formed.
61
62=item C<is_WHATEVER_latin1(s)>
12b72891 63
cc08b31c 64=item C<is_WHATEVER_latin1_safe(s,e)>
12b72891
RGS
65
66Do a lookup assuming the string is encoded in latin-1 (aka plan octets).
67
cc08b31c
KW
68The version without the C<_safe> suffix should be used only when it is known
69that C<s> contains at least one character.
70
71=item C<is_WHATEVER_cp(cp)>
12b72891 72
47e01c32 73Check to see if the string matches a given codepoint (hypothetically a
12b72891
RGS
74U32). The condition is constructed as as to "break out" as early as
75possible if the codepoint is out of range of the condition.
76
77IOW:
78
79 (cp==X || (cp>X && (cp==Y || (cp>Y && ...))))
80
81Thus if the character is X+1 only two comparisons will be done. Making
82matching lookups slower, but non-matching faster.
83
cc08b31c
KW
84=item C<what_len_WHATEVER_FOO(arg1, ..., len)>
85
86A variant form of each of the macro types described above can be generated, in
87which the code point is returned by the macro, and an extra parameter (in the
88final position) is added, which is a pointer for the macro to set the byte
89length of the returned code point.
90
91These forms all have a C<what_len> prefix instead of the C<is_>, for example
92C<what_len_WHATEVER_safe(s,e,is_utf8,len)> and
93C<what_len_WHATEVER_utf8(s,len)>.
94
95These forms should not be used I<except> on small sets of mostly widely
96separated code points; otherwise the code generated is inefficient. For these
97cases, it is best to use the C<is_> forms, and then find the code point with
98C<utf8_to_uvchr_buf>(). This program can fail with a "deep recursion"
99message on the worst of the inappropriate sets. Examine the generated macro
100to see if it is acceptable.
12b72891 101
cc08b31c
KW
102=item C<what_WHATEVER_FOO(arg1, ...)>
103
104A variant form of each of the C<is_> macro types described above can be generated, in
105which the code point and not the length is returned by the macro. These have
106the same caveat as L</what_len_WHATEVER_FOO(arg1, ..., len)>, plus they should
107not be used where the set contains a NULL, as 0 is returned for two different
108cases: a) the set doesn't include the input code point; b) the set does
109include it, and it is a NULL.
110
111=back
e64b1bd1
YO
112
113=head2 CODE FORMAT
114
115perltidy -st -bt=1 -bbt=0 -pt=0 -sbt=1 -ce -nwls== "%f"
116
117
118=head1 AUTHOR
119
cc08b31c 120Author: Yves Orton (demerphq) 2007. Maintained by Perl5 Porters.
e64b1bd1
YO
121
122=head1 BUGS
123
124No tests directly here (although the regex engine will fail tests
125if this code is broken). Insufficient documentation and no Getopts
126handler for using the module as a script.
127
128=head1 LICENSE
129
130You may distribute under the terms of either the GNU General Public
131License or the Artistic License, as specified in the README file.
132
12b72891
RGS
133=cut
134
e64b1bd1
YO
135# Sub naming convention:
136# __func : private subroutine, can not be called as a method
137# _func : private method, not meant for external use
138# func : public method.
139
140# private subs
141#-------------------------------------------------------------------------------
142#
143# ($cp,$n,$l,$u)=__uni_latin($str);
144#
47e01c32 145# Return a list of arrays, each of which when interpreted correctly
e64b1bd1
YO
146# represent the string in some given encoding with specific conditions.
147#
148# $cp - list of codepoints that make up the string.
295bcca9
KW
149# $n - list of octets that make up the string if all codepoints are invariant
150# regardless of if the string is in UTF-8 or not.
e64b1bd1 151# $l - list of octets that make up the string in latin1 encoding if all
295bcca9
KW
152# codepoints < 256, and at least one codepoint is UTF-8 variant.
153# $u - list of octets that make up the string in utf8 if any codepoint is
154# UTF-8 variant
e64b1bd1
YO
155#
156# High CP | Defined
157#-----------+----------
295bcca9 158# 0 - 127 : $n (127/128 are the values for ASCII platforms)
e64b1bd1
YO
159# 128 - 255 : $l, $u
160# 256 - ... : $u
161#
162
163sub __uni_latin1 {
164 my $str= shift;
165 my $max= 0;
166 my @cp;
295bcca9 167 my $only_has_invariants = 1;
e64b1bd1
YO
168 for my $ch ( split //, $str ) {
169 my $cp= ord $ch;
170 push @cp, $cp;
171 $max= $cp if $max < $cp;
295bcca9
KW
172 if (! ASCII_PLATFORM && $only_has_invariants) {
173 if ($cp > 255) {
174 $only_has_invariants = 0;
175 }
176 else {
177 my $temp = chr($cp);
178 utf8::upgrade($temp);
179 my @utf8 = unpack "U0C*", $temp;
180 $only_has_invariants = (@utf8 == 1 && $utf8[0] == $cp);
181 }
182 }
e64b1bd1
YO
183 }
184 my ( $n, $l, $u );
295bcca9
KW
185 $only_has_invariants = $max < 128 if ASCII_PLATFORM;
186 if ($only_has_invariants) {
e64b1bd1
YO
187 $n= [@cp];
188 } else {
189 $l= [@cp] if $max && $max < 256;
190
ca51670f
KW
191 $u= $str;
192 utf8::upgrade($u);
193 $u= [ unpack "U0C*", $u ] if defined $u;
12b72891 194 }
e64b1bd1 195 return ( \@cp, $n, $l, $u );
12b72891
RGS
196}
197
12b72891 198#
e64b1bd1
YO
199# $clean= __clean($expr);
200#
201# Cleanup a ternary expression, removing unnecessary parens and apply some
202# simplifications using regexes.
203#
204
205sub __clean {
206 my ( $expr )= @_;
8fdb8a9d 207
9a3182e9
YO
208 #return $expr;
209
e64b1bd1
YO
210 our $parens;
211 $parens= qr/ (?> \( (?> (?: (?> [^()]+ ) | (??{ $parens }) )* ) \) ) /x;
212
8fdb8a9d 213 ## remove redundant parens
e64b1bd1 214 1 while $expr =~ s/ \( \s* ( $parens ) \s* \) /$1/gx;
8fdb8a9d
YO
215
216
217 # repeatedly simplify conditions like
218 # ( (cond1) ? ( (cond2) ? X : Y ) : Y )
219 # into
220 # ( ( (cond1) && (cond2) ) ? X : Y )
6c4f0678
YO
221 # Also similarly handles expressions like:
222 # : (cond1) ? ( (cond2) ? X : Y ) : Y )
223 # Note the inclusion of the close paren in ([:()]) and the open paren in ([()]) is
224 # purely to ensure we have a balanced set of parens in the expression which makes
225 # it easier to understand the pattern in an editor that understands paren's, we do
226 # not expect either of these cases to actually fire. - Yves
8fdb8a9d 227 1 while $expr =~ s/
6c4f0678 228 ([:()]) \s*
8fdb8a9d
YO
229 ($parens) \s*
230 \? \s*
231 \( \s* ($parens) \s*
6c4f0678
YO
232 \? \s* ($parens|[^()?:\s]+?) \s*
233 : \s* ($parens|[^()?:\s]+?) \s*
8fdb8a9d 234 \) \s*
6c4f0678
YO
235 : \s* \5 \s*
236 ([()])
237 /$1 ( $2 && $3 ) ? $4 : $5 $6/gx;
8fdb8a9d 238
e64b1bd1 239 return $expr;
12b72891
RGS
240}
241
e64b1bd1
YO
242#
243# $text= __macro(@args);
244# Join args together by newlines, and then neatly add backslashes to the end
245# of every line as expected by the C pre-processor for #define's.
246#
247
248sub __macro {
249 my $str= join "\n", @_;
250 $str =~ s/\s*$//;
251 my @lines= map { s/\s+$//; s/\t/ /g; $_ } split /\n/, $str;
252 my $last= pop @lines;
253 $str= join "\n", ( map { sprintf "%-76s\\", $_ } @lines ), $last;
254 1 while $str =~ s/^(\t*) {8}/$1\t/gm;
255 return $str . "\n";
12b72891
RGS
256}
257
e64b1bd1
YO
258#
259# my $op=__incrdepth($op);
260#
261# take an 'op' hashref and add one to it and all its childrens depths.
262#
263
264sub __incrdepth {
265 my $op= shift;
266 return unless ref $op;
267 $op->{depth} += 1;
268 __incrdepth( $op->{yes} );
269 __incrdepth( $op->{no} );
270 return $op;
271}
272
273# join two branches of an opcode together with a condition, incrementing
274# the depth on the yes branch when we do so.
275# returns the new root opcode of the tree.
276sub __cond_join {
277 my ( $cond, $yes, $no )= @_;
278 return {
279 test => $cond,
280 yes => __incrdepth( $yes ),
281 no => $no,
282 depth => 0,
283 };
284}
285
286# Methods
287
288# constructor
289#
290# my $obj=CLASS->new(op=>'SOMENAME',title=>'blah',txt=>[..]);
291#
292# Create a new CharClass::Matcher object by parsing the text in
293# the txt array. Currently applies the following rules:
294#
295# Element starts with C<0x>, line is evaled the result treated as
296# a number which is passed to chr().
297#
298# Element starts with C<">, line is evaled and the result treated
299# as a string.
300#
301# Each string is then stored in the 'strs' subhash as a hash record
302# made up of the results of __uni_latin1, using the keynames
b1af8fef 303# 'low','latin1','utf8', as well as the synthesized 'LATIN1', 'high', and
e64b1bd1
YO
304# 'UTF8' which hold a merge of 'low' and their lowercase equivelents.
305#
306# Size data is tracked per type in the 'size' subhash.
307#
308# Return an object
309#
12b72891
RGS
310sub new {
311 my $class= shift;
e64b1bd1
YO
312 my %opt= @_;
313 for ( qw(op txt) ) {
314 die "in " . __PACKAGE__ . " constructor '$_;' is a mandatory field"
315 if !exists $opt{$_};
316 }
317
318 my $self= bless {
319 op => $opt{op},
320 title => $opt{title} || '',
321 }, $class;
322 foreach my $txt ( @{ $opt{txt} } ) {
323 my $str= $txt;
324 if ( $str =~ /^[""]/ ) {
325 $str= eval $str;
05b688d9
KW
326 } elsif ($str =~ / - /x ) { # A range: Replace this element on the
327 # list with its expansion
328 my ($lower, $upper) = $str =~ / 0x (.+?) \s* - \s* 0x (.+) /x;
329 die "Format must be like '0xDEAD - 0xBEAF'; instead was '$str'" if ! defined $lower || ! defined $upper;
330 foreach my $cp (hex $lower .. hex $upper) {
331 push @{$opt{txt}}, sprintf "0x%X", $cp;
332 }
333 next;
295bcca9
KW
334 } elsif ($str =~ s/ ^ N (?= 0x ) //x ) {
335 # Otherwise undocumented, a leading N means is already in the
336 # native character set; don't convert.
e64b1bd1 337 $str= chr eval $str;
295bcca9
KW
338 } elsif ( $str =~ /^0x/ ) {
339 $str= eval $str;
340
341 # Convert from Unicode/ASCII to native, if necessary
342 $str = utf8::unicode_to_native($str) if ! ASCII_PLATFORM
343 && $str <= 0xFF;
344 $str = chr $str;
05b688d9
KW
345 } elsif ( $str =~ / \s* \\p \{ ( .*? ) \} /x) {
346 my $property = $1;
347 use Unicode::UCD qw(prop_invlist);
348
349 my @invlist = prop_invlist($property, '_perl_core_internal_ok');
350 if (! @invlist) {
351
352 # An empty return could mean an unknown property, or merely
353 # that it is empty. Call in scalar context to differentiate
354 my $count = prop_invlist($property, '_perl_core_internal_ok');
355 die "$property not found" unless defined $count;
356 }
357
358 # Replace this element on the list with the property's expansion
359 for (my $i = 0; $i < @invlist; $i += 2) {
360 foreach my $cp ($invlist[$i] .. $invlist[$i+1] - 1) {
295bcca9
KW
361
362 # prop_invlist() returns native values; add leading 'N'
363 # to indicate that.
364 push @{$opt{txt}}, sprintf "N0x%X", $cp;
05b688d9
KW
365 }
366 }
367 next;
60910c93
KW
368 } elsif ($str =~ / ^ do \s+ ( .* ) /x) {
369 die "do '$1' failed: $!$@" if ! do $1 or $@;
370 next;
371 } elsif ($str =~ / ^ & \s* ( .* ) /x) { # user-furnished sub() call
372 my @results = eval "$1";
373 die "eval '$1' failed: $@" if $@;
374 push @{$opt{txt}}, @results;
375 next;
12b72891 376 } else {
5e6c6c1e 377 die "Unparsable line: $txt\n";
12b72891 378 }
e64b1bd1
YO
379 my ( $cp, $low, $latin1, $utf8 )= __uni_latin1( $str );
380 my $UTF8= $low || $utf8;
381 my $LATIN1= $low || $latin1;
b1af8fef 382 my $high = (scalar grep { $_ < 256 } @$cp) ? 0 : $utf8;
dda856b2
YO
383 #die Dumper($txt,$cp,$low,$latin1,$utf8)
384 # if $txt=~/NEL/ or $utf8 and @$utf8>3;
e64b1bd1 385
b1af8fef
KW
386 @{ $self->{strs}{$str} }{qw( str txt low utf8 latin1 high cp UTF8 LATIN1 )}=
387 ( $str, $txt, $low, $utf8, $latin1, $high, $cp, $UTF8, $LATIN1 );
e64b1bd1 388 my $rec= $self->{strs}{$str};
b1af8fef 389 foreach my $key ( qw(low utf8 latin1 high cp UTF8 LATIN1) ) {
e64b1bd1
YO
390 $self->{size}{$key}{ 0 + @{ $self->{strs}{$str}{$key} } }++
391 if $self->{strs}{$str}{$key};
12b72891 392 }
e64b1bd1
YO
393 $self->{has_multi} ||= @$cp > 1;
394 $self->{has_ascii} ||= $latin1 && @$latin1;
395 $self->{has_low} ||= $low && @$low;
396 $self->{has_high} ||= !$low && !$latin1;
12b72891 397 }
e64b1bd1
YO
398 $self->{val_fmt}= $hex_fmt;
399 $self->{count}= 0 + keys %{ $self->{strs} };
12b72891
RGS
400 return $self;
401}
402
e64b1bd1 403# my $trie = make_trie($type,$maxlen);
12b72891 404#
47e01c32 405# using the data stored in the object build a trie of a specific type,
e64b1bd1
YO
406# and with specific maximum depth. The trie is made up the elements of
407# the given types array for each string in the object (assuming it is
408# not too long.)
409#
47e01c32 410# returns the trie, or undef if there was no relevant data in the object.
e64b1bd1
YO
411#
412
413sub make_trie {
414 my ( $self, $type, $maxlen )= @_;
415
416 my $strs= $self->{strs};
417 my %trie;
418 foreach my $rec ( values %$strs ) {
419 die "panic: unknown type '$type'"
420 if !exists $rec->{$type};
421 my $dat= $rec->{$type};
422 next unless $dat;
423 next if $maxlen && @$dat > $maxlen;
424 my $node= \%trie;
425 foreach my $elem ( @$dat ) {
426 $node->{$elem} ||= {};
427 $node= $node->{$elem};
12b72891 428 }
e64b1bd1 429 $node->{''}= $rec->{str};
12b72891 430 }
e64b1bd1 431 return 0 + keys( %trie ) ? \%trie : undef;
12b72891
RGS
432}
433
2efb8143
KW
434sub pop_count ($) {
435 my $word = shift;
436
437 # This returns a list of the positions of the bits in the input word that
438 # are 1.
439
440 my @positions;
441 my $position = 0;
442 while ($word) {
443 push @positions, $position if $word & 1;
444 $position++;
445 $word >>= 1;
446 }
447 return @positions;
448}
449
e64b1bd1
YO
450# my $optree= _optree()
451#
452# recursively convert a trie to an optree where every node represents
453# an if else branch.
12b72891 454#
12b72891 455#
12b72891 456
e64b1bd1
YO
457sub _optree {
458 my ( $self, $trie, $test_type, $ret_type, $else, $depth )= @_;
459 return unless defined $trie;
460 if ( $self->{has_multi} and $ret_type =~ /cp|both/ ) {
461 die "Can't do 'cp' optree from multi-codepoint strings";
12b72891 462 }
e64b1bd1
YO
463 $ret_type ||= 'len';
464 $else= 0 unless defined $else;
465 $depth= 0 unless defined $depth;
466
e405c23a
YO
467 # if we have an emptry string as a key it means we are in an
468 # accepting state and unless we can match further on should
469 # return the value of the '' key.
895e25a5 470 if (exists $trie->{''} ) {
e405c23a
YO
471 # we can now update the "else" value, anything failing to match
472 # after this point should return the value from this.
e64b1bd1
YO
473 if ( $ret_type eq 'cp' ) {
474 $else= $self->{strs}{ $trie->{''} }{cp}[0];
475 $else= sprintf "$self->{val_fmt}", $else if $else > 9;
476 } elsif ( $ret_type eq 'len' ) {
477 $else= $depth;
478 } elsif ( $ret_type eq 'both') {
479 $else= $self->{strs}{ $trie->{''} }{cp}[0];
480 $else= sprintf "$self->{val_fmt}", $else if $else > 9;
481 $else= "len=$depth, $else";
12b72891 482 }
e64b1bd1 483 }
e405c23a
YO
484 # extract the meaningful keys from the trie, filter out '' as
485 # it means we are an accepting state (end of sequence).
486 my @conds= sort { $a <=> $b } grep { length $_ } keys %$trie;
487
488 # if we havent any keys there is no further we can match and we
489 # can return the "else" value.
e64b1bd1 490 return $else if !@conds;
e405c23a
YO
491
492
e64b1bd1 493 my $test= $test_type eq 'cp' ? "cp" : "((U8*)s)[$depth]";
9a3182e9
YO
494 # first we loop over the possible keys/conditions and find out what they look like
495 # we group conditions with the same optree together.
496 my %dmp_res;
497 my @res_order;
e405c23a
YO
498 local $Data::Dumper::Sortkeys=1;
499 foreach my $cond ( @conds ) {
500
501 # get the optree for this child/condition
502 my $res= $self->_optree( $trie->{$cond}, $test_type, $ret_type, $else, $depth + 1 );
503 # convert it to a string with Dumper
e64b1bd1 504 my $res_code= Dumper( $res );
e405c23a 505
9a3182e9
YO
506 push @{$dmp_res{$res_code}{vals}}, $cond;
507 if (!$dmp_res{$res_code}{optree}) {
508 $dmp_res{$res_code}{optree}= $res;
509 push @res_order, $res_code;
510 }
511 }
512
513 # now that we have deduped the optrees we construct a new optree containing the merged
514 # results.
515 my %root;
516 my $node= \%root;
517 foreach my $res_code_idx (0 .. $#res_order) {
518 my $res_code= $res_order[$res_code_idx];
519 $node->{vals}= $dmp_res{$res_code}{vals};
520 $node->{test}= $test;
521 $node->{yes}= $dmp_res{$res_code}{optree};
522 $node->{depth}= $depth;
523 if ($res_code_idx < $#res_order) {
524 $node= $node->{no}= {};
12b72891 525 } else {
9a3182e9 526 $node->{no}= $else;
12b72891
RGS
527 }
528 }
e405c23a
YO
529
530 # return the optree.
531 return \%root;
12b72891
RGS
532}
533
e64b1bd1
YO
534# my $optree= optree(%opts);
535#
536# Convert a trie to an optree, wrapper for _optree
537
538sub optree {
539 my $self= shift;
540 my %opt= @_;
541 my $trie= $self->make_trie( $opt{type}, $opt{max_depth} );
542 $opt{ret_type} ||= 'len';
543 my $test_type= $opt{type} eq 'cp' ? 'cp' : 'depth';
544 return $self->_optree( $trie, $test_type, $opt{ret_type}, $opt{else}, 0 );
12b72891
RGS
545}
546
e64b1bd1
YO
547# my $optree= generic_optree(%opts);
548#
549# build a "generic" optree out of the three 'low', 'latin1', 'utf8'
550# sets of strings, including a branch for handling the string type check.
551#
552
553sub generic_optree {
554 my $self= shift;
555 my %opt= @_;
556
557 $opt{ret_type} ||= 'len';
558 my $test_type= 'depth';
559 my $else= $opt{else} || 0;
560
561 my $latin1= $self->make_trie( 'latin1', $opt{max_depth} );
562 my $utf8= $self->make_trie( 'utf8', $opt{max_depth} );
563
564 $_= $self->_optree( $_, $test_type, $opt{ret_type}, $else, 0 )
565 for $latin1, $utf8;
566
567 if ( $utf8 ) {
568 $else= __cond_join( "( is_utf8 )", $utf8, $latin1 || $else );
569 } elsif ( $latin1 ) {
570 $else= __cond_join( "!( is_utf8 )", $latin1, $else );
571 }
572 my $low= $self->make_trie( 'low', $opt{max_depth} );
573 if ( $low ) {
574 $else= $self->_optree( $low, $test_type, $opt{ret_type}, $else, 0 );
12b72891 575 }
e64b1bd1
YO
576
577 return $else;
12b72891
RGS
578}
579
e64b1bd1 580# length_optree()
12b72891 581#
e64b1bd1 582# create a string length guarded optree.
12b72891 583#
e64b1bd1
YO
584
585sub length_optree {
586 my $self= shift;
587 my %opt= @_;
588 my $type= $opt{type};
589
590 die "Can't do a length_optree on type 'cp', makes no sense."
591 if $type eq 'cp';
592
593 my ( @size, $method );
594
595 if ( $type eq 'generic' ) {
596 $method= 'generic_optree';
597 my %sizes= (
598 %{ $self->{size}{low} || {} },
599 %{ $self->{size}{latin1} || {} },
600 %{ $self->{size}{utf8} || {} }
601 );
602 @size= sort { $a <=> $b } keys %sizes;
603 } else {
604 $method= 'optree';
605 @size= sort { $a <=> $b } keys %{ $self->{size}{$type} };
12b72891 606 }
e64b1bd1
YO
607
608 my $else= ( $opt{else} ||= 0 );
609 for my $size ( @size ) {
610 my $optree= $self->$method( %opt, type => $type, max_depth => $size );
611 my $cond= "((e)-(s) > " . ( $size - 1 ).")";
612 $else= __cond_join( $cond, $optree, $else );
613 }
614 return $else;
12b72891
RGS
615}
616
2efb8143 617sub calculate_mask(@) {
75929b4b
KW
618 # Look at the input list of byte values. This routine returns an array of
619 # mask/base pairs to generate that list.
620
2efb8143
KW
621 my @list = @_;
622 my $list_count = @list;
623
75929b4b
KW
624 # Consider a set of byte values, A, B, C .... If we want to determine if
625 # <c> is one of them, we can write c==A || c==B || c==C .... If the
626 # values are consecutive, we can shorten that to A<=c && c<=Z, which uses
627 # far fewer branches. If only some of them are consecutive we can still
628 # save some branches by creating range tests for just those that are
629 # consecutive. _cond_as_str() does this work for looking for ranges.
630 #
631 # Another approach is to look at the bit patterns for A, B, C .... and see
632 # if they have some commonalities. That's what this function does. For
633 # example, consider a set consisting of the bytes
634 # 0xF0, 0xF1, 0xF2, and 0xF3. We could write:
2efb8143
KW
635 # 0xF0 <= c && c <= 0xF4
636 # But the following mask/compare also works, and has just one test:
75929b4b
KW
637 # (c & 0xFC) == 0xF0
638 # The reason it works is that the set consists of exactly those bytes
2efb8143 639 # whose first 4 bits are 1, and the next two are 0. (The value of the
75929b4b 640 # other 2 bits is immaterial in determining if a byte is in the set or
2efb8143 641 # not.) The mask masks out those 2 irrelevant bits, and the comparison
75929b4b
KW
642 # makes sure that the result matches all bytes which match those 6
643 # material bits exactly. In other words, the set of bytes contains
2efb8143
KW
644 # exactly those whose bottom two bit positions are either 0 or 1. The
645 # same principle applies to bit positions that are not necessarily
646 # adjacent. And it can be applied to bytes that differ in 1 through all 8
647 # bit positions. In order to be a candidate for this optimization, the
75929b4b
KW
648 # number of bytes in the set must be a power of 2.
649 #
650 # Consider a different example, the set 0x53, 0x54, 0x73, and 0x74. That
651 # requires 4 tests using either ranges or individual values, and even
652 # though the number in the set is a power of 2, it doesn't qualify for the
653 # mask optimization described above because the number of bits that are
654 # different is too large for that. However, the set can be expressed as
655 # two branches with masks thusly:
656 # (c & 0xDF) == 0x53 || (c & 0xDF) == 0x54
657 # a branch savings of 50%. This is done by splitting the set into two
658 # subsets each of which has 2 elements, and within each set the values
659 # differ by 1 byte.
660 #
661 # This function attempts to find some way to save some branches using the
662 # mask technique. If not, it returns an empty list; if so, it
663 # returns a list consisting of
664 # [ [compare1, mask1], [compare2, mask2], ...
665 # [compare_n, undef], [compare_m, undef], ...
666 # ]
667 # The <mask> is undef in the above for those bytes that must be tested
668 # for individually.
669 #
670 # This function does not attempt to find the optimal set. To do so would
671 # probably require testing all possible combinations, and keeping track of
672 # the current best one.
673 #
674 # There are probably much better algorithms, but this is the one I (khw)
675 # came up with. We start with doing a bit-wise compare of every byte in
676 # the set with every other byte. The results are sorted into arrays of
677 # all those that differ by the same bit positions. These are stored in a
678 # hash with the each key being the bits they differ in. Here is the hash
679 # for the 0x53, 0x54, 0x73, 0x74 set:
680 # {
681 # 4 => {
682 # "0,1,2,5" => [
683 # 83,
684 # 116,
685 # 84,
686 # 115
687 # ]
688 # },
689 # 3 => {
690 # "0,1,2" => [
691 # 83,
692 # 84,
693 # 115,
694 # 116
695 # ]
696 # }
697 # 1 => {
698 # 5 => [
699 # 83,
700 # 115,
701 # 84,
702 # 116
703 # ]
704 # },
705 # }
706 #
707 # The set consisting of values which differ in the 4 bit positions 0, 1,
708 # 2, and 5 from some other value in the set consists of all 4 values.
709 # Likewise all 4 values differ from some other value in the 3 bit
710 # positions 0, 1, and 2; and all 4 values differ from some other value in
711 # the single bit position 5. The keys at the uppermost level in the above
712 # hash, 1, 3, and 4, give the number of bit positions that each sub-key
713 # below it has. For example, the 4 key could have as its value an array
714 # consisting of "0,1,2,5", "0,1,2,6", and "3,4,6,7", if the inputs were
715 # such. The best optimization will group the most values into a single
716 # mask. The most values will be the ones that differ in the most
717 # positions, the ones with the largest value for the topmost key. These
718 # keys, are thus just for convenience of sorting by that number, and do
719 # not have any bearing on the core of the algorithm.
720 #
721 # We start with an element from largest number of differing bits. The
722 # largest in this case is 4 bits, and there is only one situation in this
723 # set which has 4 differing bits, "0,1,2,5". We look for any subset of
724 # this set which has 16 values that differ in these 4 bits. There aren't
725 # any, because there are only 4 values in the entire set. We then look at
726 # the next possible thing, which is 3 bits differing in positions "0,1,2".
727 # We look for a subset that has 8 values that differ in these 3 bits.
728 # Again there are none. So we go to look for the next possible thing,
729 # which is a subset of 2**1 values that differ only in bit position 5. 83
730 # and 115 do, so we calculate a mask and base for those and remove them
731 # from every set. Since there is only the one set remaining, we remove
732 # them from just this one. We then look to see if there is another set of
733 # 2 values that differ in bit position 5. 84 and 116 do, so we calculate
734 # a mask and base for those and remove them from every set (again only
735 # this set remains in this example). The set is now empty, and there are
736 # no more sets to look at, so we are done.
737
738 if ($list_count == 256) { # All 256 is trivially masked
2efb8143
KW
739 return (0, 0);
740 }
741
75929b4b
KW
742 my %hash;
743
744 # Generate bits-differing lists for each element compared against each
745 # other element
746 for my $i (0 .. $list_count - 2) {
747 for my $j ($i + 1 .. $list_count - 1) {
748 my @bits_that_differ = pop_count($list[$i] ^ $list[$j]);
749 my $differ_count = @bits_that_differ;
750 my $key = join ",", @bits_that_differ;
751 push @{$hash{$differ_count}{$key}}, $list[$i] unless grep { $_ == $list[$i] } @{$hash{$differ_count}{$key}};
752 push @{$hash{$differ_count}{$key}}, $list[$j];
753 }
754 }
2efb8143 755
75929b4b 756 print STDERR __LINE__, ": calculate_mask() called: List of values grouped by differing bits: ", Dumper \%hash if DEBUG;
2efb8143 757
75929b4b
KW
758 my @final_results;
759 foreach my $count (reverse sort { $a <=> $b } keys %hash) {
760 my $need = 2 ** $count; # Need 8 values for 3 differing bits, etc
761 foreach my $bits (keys $hash{$count}) {
2efb8143 762
75929b4b 763 print STDERR __LINE__, ": For $count bit(s) difference ($bits), need $need; have ", scalar @{$hash{$count}{$bits}}, "\n" if DEBUG;
2efb8143 764
75929b4b
KW
765 # Look only as long as there are at least as many elements in the
766 # subset as are needed
767 while ((my $cur_count = @{$hash{$count}{$bits}}) >= $need) {
2efb8143 768
75929b4b 769 print STDERR __LINE__, ": Looking at bit positions ($bits): ", Dumper $hash{$count}{$bits} if DEBUG;
2efb8143 770
75929b4b
KW
771 # Start with the first element in it
772 my $try_base = $hash{$count}{$bits}[0];
773 my @subset = $try_base;
774
775 # If it succeeds, we return a mask and a base to compare
776 # against the masked value. That base will be the AND of
777 # every element in the subset. Initialize to the one element
778 # we have so far.
779 my $compare = $try_base;
780
781 # We are trying to find a subset of this that has <need>
782 # elements that differ in the bit positions given by the
783 # string $bits, which is comma separated.
784 my @bits = split ",", $bits;
785
786 TRY: # Look through the remainder of the list for other
787 # elements that differ only by these bit positions.
788
789 for (my $i = 1; $i < $cur_count; $i++) {
790 my $try_this = $hash{$count}{$bits}[$i];
791 my @positions = pop_count($try_base ^ $try_this);
792
793 print STDERR __LINE__, ": $try_base vs $try_this: is (", join(',', @positions), ") a subset of ($bits)?" if DEBUG;;
794
795 foreach my $pos (@positions) {
796 unless (grep { $pos == $_ } @bits) {
797 print STDERR " No\n" if DEBUG;
798 my $remaining = $cur_count - $i - 1;
799 if ($remaining && @subset + $remaining < $need) {
800 print STDERR __LINE__, ": Can stop trying $try_base, because even if all the remaining $remaining values work, they wouldn't add up to the needed $need when combined with the existing ", scalar @subset, " ones\n" if DEBUG;
801 last TRY;
802 }
803 next TRY;
804 }
805 }
806
807 print STDERR " Yes\n" if DEBUG;
808 push @subset, $try_this;
809
810 # Add this to the mask base, in case it ultimately
811 # succeeds,
812 $compare &= $try_this;
813 }
814
815 print STDERR __LINE__, ": subset (", join(", ", @subset), ") has ", scalar @subset, " elements; needs $need\n" if DEBUG;
816
817 if (@subset < $need) {
818 shift @{$hash{$count}{$bits}};
819 next; # Try with next value
820 }
2efb8143 821
75929b4b
KW
822 # Create the mask
823 my $mask = 0;
824 foreach my $position (@bits) {
825 $mask |= 1 << $position;
826 }
827 $mask = ~$mask & 0xFF;
828 push @final_results, [$compare, $mask];
829
830 printf STDERR "%d: Got it: compare=%d=0x%X; mask=%X\n", __LINE__, $compare, $compare, $mask if DEBUG;
831
832 # These values are now spoken for. Remove them from future
833 # consideration
834 foreach my $remove_count (keys %hash) {
835 foreach my $bits (keys %{$hash{$remove_count}}) {
836 foreach my $to_remove (@subset) {
837 @{$hash{$remove_count}{$bits}} = grep { $_ != $to_remove } @{$hash{$remove_count}{$bits}};
838 }
839 }
840 }
841 }
842 }
2efb8143
KW
843 }
844
75929b4b
KW
845 # Any values that remain in the list are ones that have to be tested for
846 # individually.
847 my @individuals;
848 foreach my $count (reverse sort { $a <=> $b } keys %hash) {
849 foreach my $bits (keys $hash{$count}) {
850 foreach my $remaining (@{$hash{$count}{$bits}}) {
851
852 # If we already know about this value, just ignore it.
853 next if grep { $remaining == $_ } @individuals;
854
855 # Otherwise it needs to be returned as something to match
856 # individually
857 push @final_results, [$remaining, undef];
858 push @individuals, $remaining;
859 }
860 }
2efb8143 861 }
2efb8143 862
75929b4b
KW
863 # Sort by increasing numeric value
864 @final_results = sort { $a->[0] <=> $b->[0] } @final_results;
865
866 print STDERR __LINE__, ": Final return: ", Dumper \@final_results if DEBUG;
867
868 return @final_results;
2efb8143
KW
869}
870
e64b1bd1
YO
871# _cond_as_str
872# turn a list of conditions into a text expression
873# - merges ranges of conditions, and joins the result with ||
874sub _cond_as_str {
ba073cf2 875 my ( $self, $op, $combine, $opts_ref )= @_;
e64b1bd1
YO
876 my $cond= $op->{vals};
877 my $test= $op->{test};
2efb8143 878 my $is_cp_ret = $opts_ref->{ret_type} eq "cp";
e64b1bd1
YO
879 return "( $test )" if !defined $cond;
880
f5772832 881 # rangify the list.
e64b1bd1
YO
882 my @ranges;
883 my $Update= sub {
f5772832
KW
884 # We skip this if there are optimizations that
885 # we can apply (below) to the individual ranges
886 if ( ($is_cp_ret || $combine) && @ranges && ref $ranges[-1]) {
e64b1bd1
YO
887 if ( $ranges[-1][0] == $ranges[-1][1] ) {
888 $ranges[-1]= $ranges[-1][0];
889 } elsif ( $ranges[-1][0] + 1 == $ranges[-1][1] ) {
890 $ranges[-1]= $ranges[-1][0];
891 push @ranges, $ranges[-1] + 1;
892 }
893 }
894 };
4a8ca70e
KW
895 for my $condition ( @$cond ) {
896 if ( !@ranges || $condition != $ranges[-1][1] + 1 ) {
e64b1bd1 897 $Update->();
4a8ca70e 898 push @ranges, [ $condition, $condition ];
e64b1bd1
YO
899 } else {
900 $ranges[-1][1]++;
901 }
902 }
903 $Update->();
f5772832 904
e64b1bd1
YO
905 return $self->_combine( $test, @ranges )
906 if $combine;
f5772832
KW
907
908 if ($is_cp_ret) {
1f063c57
KW
909 @ranges= map {
910 ref $_
911 ? sprintf(
912 "( $self->{val_fmt} <= $test && $test <= $self->{val_fmt} )",
913 @$_ )
914 : sprintf( "$self->{val_fmt} == $test", $_ );
915 } @ranges;
6a52943c
KW
916
917 return "( " . join( " || ", @ranges ) . " )";
f5772832 918 }
75929b4b 919
2358c533
KW
920 # If the input set has certain characteristics, we can optimize tests
921 # for it. This doesn't apply if returning the code point, as we want
922 # each element of the set individually. The code above is for this
923 # simpler case.
924
925 return 1 if @$cond == 256; # If all bytes match, is trivially true
926
75929b4b 927 my @masks;
2358c533 928 if (@ranges > 1) {
75929b4b 929
2358c533
KW
930 # See if the entire set shares optimizable characterstics, and if so,
931 # return the optimization. We delay checking for this on sets with
932 # just a single range, as there may be better optimizations available
933 # in that case.
75929b4b
KW
934 @masks = calculate_mask(@$cond);
935
936 # Stringify the output of calculate_mask()
937 if (@masks) {
938 my @return;
939 foreach my $mask_ref (@masks) {
940 if (defined $mask_ref->[1]) {
941 push @return, sprintf "( ( $test & $self->{val_fmt} ) == $self->{val_fmt} )", $mask_ref->[1], $mask_ref->[0];
942 }
943 else { # An undefined mask means to use the value as-is
944 push @return, sprintf "$test == $self->{val_fmt}", $mask_ref->[0];
945 }
946 }
947
948 # The best possible case below for specifying this set of values via
949 # ranges is 1 branch per range. If our mask method yielded better
950 # results, there is no sense trying something that is bound to be
951 # worse.
952 if (@return < @ranges) {
953 return "( " . join( " || ", @return ) . " )";
954 }
955
956 @masks = @return;
6e130234 957 }
2358c533 958 }
f5772832 959
75929b4b
KW
960 # Here, there was no entire-class optimization that was clearly better
961 # than doing things by ranges. Look at each range.
962 my $range_count_extra = 0;
2358c533
KW
963 for (my $i = 0; $i < @ranges; $i++) {
964 if (! ref $ranges[$i]) { # Trivial case: no range
965 $ranges[$i] = sprintf "$self->{val_fmt} == $test", $ranges[$i];
966 }
967 elsif ($ranges[$i]->[0] == $ranges[$i]->[1]) {
968 $ranges[$i] = # Trivial case: single element range
969 sprintf "$self->{val_fmt} == $test", $ranges[$i]->[0];
970 }
971 else {
972 my $output = "";
973
974 # Well-formed UTF-8 continuation bytes on ascii platforms must be
975 # in the range 0x80 .. 0xBF. If we know that the input is
976 # well-formed (indicated by not trying to be 'safe'), we can omit
977 # tests that verify that the input is within either of these
978 # bounds. (No legal UTF-8 character can begin with anything in
979 # this range, so we don't have to worry about this being a
980 # continuation byte or not.)
981 if (ASCII_PLATFORM
982 && ! $opts_ref->{safe}
983 && $opts_ref->{type} =~ / ^ (?: utf8 | high ) $ /xi)
984 {
985 my $lower_limit_is_80 = ($ranges[$i]->[0] == 0x80);
986 my $upper_limit_is_BF = ($ranges[$i]->[1] == 0xBF);
987
988 # If the range is the entire legal range, it matches any legal
989 # byte, so we can omit both tests. (This should happen only
990 # if the number of ranges is 1.)
991 if ($lower_limit_is_80 && $upper_limit_is_BF) {
992 return 1;
6e130234 993 }
2358c533
KW
994 elsif ($lower_limit_is_80) { # Just use the upper limit test
995 $output = sprintf("( $test <= $self->{val_fmt} )",
996 $ranges[$i]->[1]);
f5772832 997 }
2358c533
KW
998 elsif ($upper_limit_is_BF) { # Just use the lower limit test
999 $output = sprintf("( $test >= $self->{val_fmt} )",
1000 $ranges[$i]->[0]);
f5772832 1001 }
2358c533
KW
1002 }
1003
1004 # If we didn't change to omit a test above, see if the number of
1005 # elements is a power of 2 (only a single bit in the
1006 # representation of its count will be set) and if so, it may be
1007 # that a mask/compare optimization is possible.
1008 if ($output eq ""
1009 && pop_count($ranges[$i]->[1] - $ranges[$i]->[0] + 1) == 1)
1010 {
1011 my @list;
1012 push @list, $_ for ($ranges[$i]->[0] .. $ranges[$i]->[1]);
75929b4b
KW
1013 my @this_masks = calculate_mask(@list);
1014
1015 # Use the mask if there is just one for the whole range.
1016 # Otherwise there is no savings over the two branches that can
1017 # define the range.
1018 if (@this_masks == 1 && defined $this_masks[0][1]) {
1019 $output = sprintf "( $test & $self->{val_fmt} ) == $self->{val_fmt}", $this_masks[0][1], $this_masks[0][0];
f5772832
KW
1020 }
1021 }
2358c533
KW
1022
1023 if ($output ne "") { # Prefer any optimization
1024 $ranges[$i] = $output;
1025 }
75929b4b 1026 else {
2358c533
KW
1027 # No optimization happened. We need a test that the code
1028 # point is within both bounds. But, if the bounds are
1029 # adjacent code points, it is cleaner to say
1030 # 'first == test || second == test'
1031 # than it is to say
1032 # 'first <= test && test <= second'
75929b4b
KW
1033
1034 $range_count_extra++; # This range requires 2 branches to
1035 # represent
e2a80cb5
KW
1036 if ($ranges[$i]->[0] + 1 == $ranges[$i]->[1]) {
1037 $ranges[$i] = "( "
1038 . join( " || ", ( map
1039 { sprintf "$self->{val_fmt} == $test", $_ }
1040 @{$ranges[$i]} ) )
1041 . " )";
1042 }
1043 else { # Full bounds checking
1044 $ranges[$i] = sprintf("( $self->{val_fmt} <= $test && $test <= $self->{val_fmt} )", $ranges[$i]->[0], $ranges[$i]->[1]);
1045 }
75929b4b 1046 }
f5772832 1047 }
2358c533 1048 }
f5772832 1049
75929b4b
KW
1050 # We have generated the list of bytes in two ways; one trying to use masks
1051 # to cut the number of branches down, and the other to look at individual
1052 # ranges (some of which could be cut down by using a mask for just it).
1053 # We return whichever method uses the fewest branches.
1054 return "( "
1055 . join( " || ", (@masks && @masks < @ranges + $range_count_extra)
1056 ? @masks
1057 : @ranges)
1058 . " )";
12b72891
RGS
1059}
1060
e64b1bd1
YO
1061# _combine
1062# recursively turn a list of conditions into a fast break-out condition
1063# used by _cond_as_str() for 'cp' type macros.
1064sub _combine {
1065 my ( $self, $test, @cond )= @_;
1066 return if !@cond;
1067 my $item= shift @cond;
1068 my ( $cstr, $gtv );
1069 if ( ref $item ) {
1070 $cstr=
1071 sprintf( "( $self->{val_fmt} <= $test && $test <= $self->{val_fmt} )",
1072 @$item );
1073 $gtv= sprintf "$self->{val_fmt}", $item->[1];
12b72891 1074 } else {
e64b1bd1
YO
1075 $cstr= sprintf( "$self->{val_fmt} == $test", $item );
1076 $gtv= sprintf "$self->{val_fmt}", $item;
12b72891 1077 }
e64b1bd1
YO
1078 if ( @cond ) {
1079 return "( $cstr || ( $gtv < $test &&\n"
1080 . $self->_combine( $test, @cond ) . " ) )";
12b72891 1081 } else {
e64b1bd1 1082 return $cstr;
12b72891 1083 }
e64b1bd1 1084}
12b72891 1085
e64b1bd1
YO
1086# _render()
1087# recursively convert an optree to text with reasonably neat formatting
1088sub _render {
ba073cf2 1089 my ( $self, $op, $combine, $brace, $opts_ref )= @_;
2e39f0c2 1090 return 0 if ! defined $op; # The set is empty
e64b1bd1
YO
1091 if ( !ref $op ) {
1092 return $op;
12b72891 1093 }
ba073cf2 1094 my $cond= $self->_cond_as_str( $op, $combine, $opts_ref );
cc08b31c
KW
1095 #no warnings 'recursion'; # This would allow really really inefficient
1096 # code to be generated. See pod
ba073cf2 1097 my $yes= $self->_render( $op->{yes}, $combine, 1, $opts_ref );
30188af7
KW
1098 return $yes if $cond eq '1';
1099
ba073cf2 1100 my $no= $self->_render( $op->{no}, $combine, 0, $opts_ref );
e64b1bd1
YO
1101 return "( $cond )" if $yes eq '1' and $no eq '0';
1102 my ( $lb, $rb )= $brace ? ( "( ", " )" ) : ( "", "" );
1103 return "$lb$cond ? $yes : $no$rb"
1104 if !ref( $op->{yes} ) && !ref( $op->{no} );
1105 my $ind1= " " x 4;
1106 my $ind= "\n" . ( $ind1 x $op->{depth} );
1107
1108 if ( ref $op->{yes} ) {
1109 $yes= $ind . $ind1 . $yes;
1110 } else {
1111 $yes= " " . $yes;
1112 }
1113
1114 return "$lb$cond ?$yes$ind: $no$rb";
12b72891 1115}
32e6a07c 1116
e64b1bd1
YO
1117# $expr=render($op,$combine)
1118#
1119# convert an optree to text with reasonably neat formatting. If $combine
1120# is true then the condition is created using "fast breakouts" which
1121# produce uglier expressions that are more efficient for common case,
1122# longer lists such as that resulting from type 'cp' output.
1123# Currently only used for type 'cp' macros.
1124sub render {
ba073cf2
KW
1125 my ( $self, $op, $combine, $opts_ref )= @_;
1126 my $str= "( " . $self->_render( $op, $combine, 0, $opts_ref ) . " )";
e64b1bd1 1127 return __clean( $str );
12b72891 1128}
e64b1bd1
YO
1129
1130# make_macro
1131# make a macro of a given type.
1132# calls into make_trie and (generic_|length_)optree as needed
1133# Opts are:
b1af8fef 1134# type : 'cp','generic','high','low','latin1','utf8','LATIN1','UTF8'
e64b1bd1
YO
1135# ret_type : 'cp' or 'len'
1136# safe : add length guards to macro
1137#
1138# type defaults to 'generic', and ret_type to 'len' unless type is 'cp'
1139# in which case it defaults to 'cp' as well.
1140#
1141# it is illegal to do a type 'cp' macro on a pattern with multi-codepoint
1142# sequences in it, as the generated macro will accept only a single codepoint
1143# as an argument.
1144#
1145# returns the macro.
1146
1147
1148sub make_macro {
1149 my $self= shift;
1150 my %opts= @_;
1151 my $type= $opts{type} || 'generic';
1152 die "Can't do a 'cp' on multi-codepoint character class '$self->{op}'"
1153 if $type eq 'cp'
1154 and $self->{has_multi};
1155 my $ret_type= $opts{ret_type} || ( $opts{type} eq 'cp' ? 'cp' : 'len' );
1156 my $method;
1157 if ( $opts{safe} ) {
1158 $method= 'length_optree';
1159 } elsif ( $type eq 'generic' ) {
1160 $method= 'generic_optree';
1161 } else {
1162 $method= 'optree';
1163 }
1164 my $optree= $self->$method( %opts, type => $type, ret_type => $ret_type );
ba073cf2 1165 my $text= $self->render( $optree, $type eq 'cp', \%opts );
e64b1bd1
YO
1166 my @args= $type eq 'cp' ? 'cp' : 's';
1167 push @args, "e" if $opts{safe};
1168 push @args, "is_utf8" if $type eq 'generic';
1169 push @args, "len" if $ret_type eq 'both';
1170 my $pfx= $ret_type eq 'both' ? 'what_len_' :
1171 $ret_type eq 'cp' ? 'what_' : 'is_';
1172 my $ext= $type eq 'generic' ? '' : '_' . lc( $type );
1173 $ext .= "_safe" if $opts{safe};
1174 my $argstr= join ",", @args;
1175 return "/*** GENERATED CODE ***/\n"
1176 . __macro( "#define $pfx$self->{op}$ext($argstr)\n$text" );
32e6a07c 1177}
e64b1bd1
YO
1178
1179# if we arent being used as a module (highly likely) then process
1180# the __DATA__ below and produce macros in regcharclass.h
1181# if an argument is provided to the script then it is assumed to
1182# be the path of the file to output to, if the arg is '-' outputs
1183# to STDOUT.
1184if ( !caller ) {
e64b1bd1 1185 $|++;
8770da0e 1186 my $path= shift @ARGV || "regcharclass.h";
e64b1bd1
YO
1187 my $out_fh;
1188 if ( $path eq '-' ) {
1189 $out_fh= \*STDOUT;
1190 } else {
29c22b52 1191 $out_fh = open_new( $path );
e64b1bd1 1192 }
8770da0e
NC
1193 print $out_fh read_only_top( lang => 'C', by => $0,
1194 file => 'regcharclass.h', style => '*',
2eee27d7 1195 copyright => [2007, 2011] );
d10c72f2 1196 print $out_fh "\n#ifndef H_REGCHARCLASS /* Guard against nested #includes */\n#define H_REGCHARCLASS 1\n\n";
12b72891 1197
bb949220 1198 my ( $op, $title, @txt, @types, %mods );
e64b1bd1
YO
1199 my $doit= sub {
1200 return unless $op;
ae1d4929
KW
1201
1202 # Skip if to compile on a different platform.
1203 return if delete $mods{only_ascii_platform} && ! ASCII_PLATFORM;
1204 return if delete $mods{only_ebcdic_platform} && ord 'A' != 193;
1205
e64b1bd1
YO
1206 print $out_fh "/*\n\t$op: $title\n\n";
1207 print $out_fh join "\n", ( map { "\t$_" } @txt ), "*/", "";
1208 my $obj= __PACKAGE__->new( op => $op, title => $title, txt => \@txt );
1209
bb949220
KW
1210 #die Dumper(\@types,\%mods);
1211
1212 my @mods;
1213 push @mods, 'safe' if delete $mods{safe};
1214 unshift @mods, 'fast' if delete $mods{fast} || ! @mods; # Default to 'fast'
1215 # do this one
1216 # first, as
1217 # traditional
1218 if (%mods) {
1219 die "Unknown modifiers: ", join ", ", map { "'$_'" } keys %mods;
1220 }
e64b1bd1
YO
1221
1222 foreach my $type_spec ( @types ) {
1223 my ( $type, $ret )= split /-/, $type_spec;
1224 $ret ||= 'len';
1225 foreach my $mod ( @mods ) {
1226 next if $mod eq 'safe' and $type eq 'cp';
bb949220 1227 delete $mods{$mod};
e64b1bd1
YO
1228 my $macro= $obj->make_macro(
1229 type => $type,
1230 ret_type => $ret,
1231 safe => $mod eq 'safe'
1232 );
1233 print $out_fh $macro, "\n";
1234 }
32e6a07c 1235 }
e64b1bd1
YO
1236 };
1237
1238 while ( <DATA> ) {
5e6c6c1e 1239 s/^ \s* (?: \# .* ) ? $ //x; # squeeze out comment and blanks
e64b1bd1
YO
1240 next unless /\S/;
1241 chomp;
fbd1cbdd 1242 if ( /^[A-Z]/ ) {
cc08b31c 1243 $doit->(); # This starts a new definition; do the previous one
e64b1bd1
YO
1244 ( $op, $title )= split /\s*:\s*/, $_, 2;
1245 @txt= ();
1246 } elsif ( s/^=>// ) {
1247 my ( $type, $modifier )= split /:/, $_;
1248 @types= split ' ', $type;
bb949220
KW
1249 undef %mods;
1250 map { $mods{$_} = 1 } split ' ', $modifier;
e64b1bd1
YO
1251 } else {
1252 push @txt, "$_";
12b72891
RGS
1253 }
1254 }
e64b1bd1 1255 $doit->();
d10c72f2
KW
1256
1257 print $out_fh "\n#endif /* H_REGCHARCLASS */\n";
1258
8770da0e
NC
1259 if($path eq '-') {
1260 print $out_fh "/* ex: set ro: */\n";
1261 } else {
1262 read_only_bottom_close_and_rename($out_fh)
1263 }
12b72891 1264}
e64b1bd1 1265
cc08b31c
KW
1266# The form of the input is a series of definitions to make macros for.
1267# The first line gives the base name of the macro, followed by a colon, and
1268# then text to be used in comments associated with the macro that are its
1269# title or description. In all cases the first (perhaps only) parameter to
1270# the macro is a pointer to the first byte of the code point it is to test to
1271# see if it is in the class determined by the macro. In the case of non-UTF8,
1272# the code point consists only of a single byte.
1273#
1274# The second line must begin with a '=>' and be followed by the types of
1275# macro(s) to be generated; these are specified below. A colon follows the
1276# types, followed by the modifiers, also specified below. At least one
1277# modifier is required.
1278#
1279# The subsequent lines give what code points go into the class defined by the
1280# macro. Multiple characters may be specified via a string like "\x0D\x0A",
60910c93
KW
1281# enclosed in quotes. Otherwise the lines consist of one of:
1282# 1) a single Unicode code point, prefaced by 0x
1283# 2) a single range of Unicode code points separated by a minus (and
1284# optional space)
1285# 3) a single Unicode property specified in the standard Perl form
1286# "\p{...}"
1287# 4) a line like 'do path'. This will do a 'do' on the file given by
1288# 'path'. It is assumed that this does nothing but load subroutines
1289# (See item 5 below). The reason 'require path' is not used instead is
1290# because 'do' doesn't assume that path is in @INC.
1291# 5) a subroutine call
1292# &pkg::foo(arg1, ...)
1293# where pkg::foo was loaded by a 'do' line (item 4). The subroutine
1294# returns an array of entries of forms like items 1-3 above. This
1295# allows more complex inputs than achievable from the other input types.
cc08b31c
KW
1296#
1297# A blank line or one whose first non-blank character is '#' is a comment.
1298# The definition of the macro is terminated by a line unlike those described.
1299#
1300# Valid types:
1301# low generate a macro whose name is 'is_BASE_low' and defines a
1302# class that includes only ASCII-range chars. (BASE is the
1303# input macro base name.)
1304# latin1 generate a macro whose name is 'is_BASE_latin1' and defines a
1305# class that includes only upper-Latin1-range chars. It is not
1306# designed to take a UTF-8 input parameter.
b1af8fef
KW
1307# high generate a macro whose name is 'is_BASE_high' and defines a
1308# class that includes all relevant code points that are above
1309# the Latin1 range. This is for very specialized uses only.
1310# It is designed to take only an input UTF-8 parameter.
cc08b31c
KW
1311# utf8 generate a macro whose name is 'is_BASE_utf8' and defines a
1312# class that includes all relevant characters that aren't ASCII.
1313# It is designed to take only an input UTF-8 parameter.
1314# LATIN1 generate a macro whose name is 'is_BASE_latin1' and defines a
1315# class that includes both ASCII and upper-Latin1-range chars.
1316# It is not designed to take a UTF-8 input parameter.
1317# UTF8 generate a macro whose name is 'is_BASE_utf8' and defines a
1318# class that can include any code point, adding the 'low' ones
1319# to what 'utf8' works on. It is designed to take only an input
1320# UTF-8 parameter.
1321# generic generate a macro whose name is 'is_BASE". It has a 2nd,
1322# boolean, parameter which indicates if the first one points to
1323# a UTF-8 string or not. Thus it works in all circumstances.
1324# cp generate a macro whose name is 'is_BASE_cp' and defines a
1325# class that returns true if the UV parameter is a member of the
1326# class; false if not.
1327# A macro of the given type is generated for each type listed in the input.
1328# The default return value is the number of octets read to generate the match.
1329# Append "-cp" to the type to have it instead return the matched codepoint.
1330# The macro name is changed to 'what_BASE...'. See pod for
1331# caveats
1332# Appending '-both" instead adds an extra parameter to the end of the argument
1333# list, which is a pointer as to where to store the number of
1334# bytes matched, while also returning the code point. The macro
1335# name is changed to 'what_len_BASE...'. See pod for caveats
1336#
1337# Valid modifiers:
1338# safe The input string is not necessarily valid UTF-8. In
1339# particular an extra parameter (always the 2nd) to the macro is
1340# required, which points to one beyond the end of the string.
1341# The macro will make sure not to read off the end of the
1342# string. In the case of non-UTF8, it makes sure that the
1343# string has at least one byte in it. The macro name has
1344# '_safe' appended to it.
1345# fast The input string is valid UTF-8. No bounds checking is done,
1346# and the macro can make assumptions that lead to faster
1347# execution.
ae1d4929
KW
1348# only_ascii_platform Skip this definition if this program is being run on
1349# a non-ASCII platform.
1350# only_ebcdic_platform Skip this definition if this program is being run on
1351# a non-EBCDIC platform.
cc08b31c
KW
1352# No modifier need be specified; fast is assumed for this case. If both
1353# 'fast', and 'safe' are specified, two macros will be created for each
1354# 'type'.
e90ac8de 1355#
295bcca9 1356# If run on a non-ASCII platform will automatically convert the Unicode input
cc08b31c
KW
1357# to native. The documentation above is slightly wrong in this case. 'low'
1358# actually refers to code points whose UTF-8 representation is the same as the
1359# non-UTF-8 version (invariants); and 'latin1' refers to all the rest of the
1360# code points less than 256.
5e6c6c1e
KW
1361
13621; # in the unlikely case we are being used as a module
1363
1364__DATA__
1365# This is no longer used, but retained in case it is needed some day.
e90ac8de
KW
1366# TRICKYFOLD: Problematic fold case letters. When adding to this list, also should add them to regcomp.c and fold_grind.t
1367# => generic cp generic-cp generic-both :fast safe
1368# 0x00DF # LATIN SMALL LETTER SHARP S
1369# 0x0390 # GREEK SMALL LETTER IOTA WITH DIALYTIKA AND TONOS
1370# 0x03B0 # GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND TONOS
1371# 0x1E9E # LATIN CAPITAL LETTER SHARP S, because maps to same as 00DF
1372# 0x1FD3 # GREEK SMALL LETTER IOTA WITH DIALYTIKA AND OXIA; maps same as 0390
1373# 0x1FE3 # GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND OXIA; maps same as 03B0
1374
12b72891 1375LNBREAK: Line Break: \R
e64b1bd1 1376=> generic UTF8 LATIN1 :fast safe
12b72891 1377"\x0D\x0A" # CRLF - Network (Windows) line ending
05b688d9 1378\p{VertSpace}
12b72891
RGS
1379
1380HORIZWS: Horizontal Whitespace: \h \H
e64b1bd1 1381=> generic UTF8 LATIN1 cp :fast safe
05b688d9 1382\p{HorizSpace}
12b72891
RGS
1383
1384VERTWS: Vertical Whitespace: \v \V
e64b1bd1 1385=> generic UTF8 LATIN1 cp :fast safe
05b688d9 1386\p{VertSpace}
612ead59 1387
b96a92fb
KW
1388REPLACEMENT: Unicode REPLACEMENT CHARACTER
1389=> UTF8 :safe
13900xFFFD
1391
1392NONCHAR: Non character code points
1393=> UTF8 :fast
1394\p{Nchar}
1395
1396SURROGATE: Surrogate characters
1397=> UTF8 :fast
1398\p{Gc=Cs}
1399
612ead59
KW
1400GCB_L: Grapheme_Cluster_Break=L
1401=> UTF8 :fast
1402\p{_X_GCB_L}
1403
1404GCB_LV_LVT_V: Grapheme_Cluster_Break=(LV or LVT or V)
1405=> UTF8 :fast
1406\p{_X_LV_LVT_V}
1407
1408GCB_Prepend: Grapheme_Cluster_Break=Prepend
1409=> UTF8 :fast
1410\p{_X_GCB_Prepend}
1411
1412GCB_RI: Grapheme_Cluster_Break=RI
1413=> UTF8 :fast
1414\p{_X_RI}
1415
1416GCB_SPECIAL_BEGIN: Grapheme_Cluster_Break=special_begins
1417=> UTF8 :fast
1418\p{_X_Special_Begin}
1419
1420GCB_T: Grapheme_Cluster_Break=T
1421=> UTF8 :fast
1422\p{_X_GCB_T}
1423
1424GCB_V: Grapheme_Cluster_Break=V
1425=> UTF8 :fast
1426\p{_X_GCB_V}
685289b5 1427
4d646140
KW
1428# This program was run with this enabled, and the results copied to utf8.h;
1429# then this was commented out because it takes so long to figure out these 2
1430# million code points. The results would not change unless utf8.h decides it
1431# wants a maximum other than 4 bytes, or this program creates better
1432# optimizations
1433#UTF8_CHAR: Matches utf8 from 1 to 4 bytes
1434#=> UTF8 :safe only_ascii_platform
1435#0x0 - 0x1FFFFF
1436
1437# This hasn't been commented out, because we haven't an EBCDIC platform to run
1438# it on, and the 3 types of EBCDIC allegedly supported by Perl would have
1439# different results
1440UTF8_CHAR: Matches utf8 from 1 to 5 bytes
1441=> UTF8 :safe only_ebcdic_platform
14420x0 - 0x3FFFFF:
1443
685289b5
KW
1444QUOTEMETA: Meta-characters that \Q should quote
1445=> high :fast
1446\p{_Perl_Quotemeta}
8769f413
KW
1447
1448MULTI_CHAR_FOLD: multi-char strings that are folded to by a single character
1449=> UTF8 :safe
1450do regen/regcharclass_multi_char_folds.pl
1451
1452# 1 => All folds
1453&regcharclass_multi_char_folds::multi_char_folds(1)
1454
40b1ba4f
KW
1455MULTI_CHAR_FOLD: multi-char strings that are folded to by a single character
1456=> LATIN1 :safe
8769f413 1457
8769f413 1458&regcharclass_multi_char_folds::multi_char_folds(0)
40b1ba4f 1459# 0 => Latin1-only