This is a live mirror of the Perl 5 development currently hosted at https://github.com/perl/perl5
Make isUTF8_char() an inline function
[perl5.git] / regen / regcharclass.pl
CommitLineData
1f00b0d6 1#!perl
e64b1bd1 2package CharClass::Matcher;
12b72891 3use strict;
8770da0e 4use 5.008;
12b72891 5use warnings;
e64b1bd1 6use warnings FATAL => 'all';
12b72891 7use Data::Dumper;
e64b1bd1
YO
8$Data::Dumper::Useqq= 1;
9our $hex_fmt= "0x%02X";
12b72891 10
75929b4b
KW
11sub DEBUG () { 0 }
12$|=1 if DEBUG;
13
3d7c117d
MB
14require './regen/regen_lib.pl';
15require './regen/charset_translations.pl';
4c9bc717 16require "./regen/regcharclass_multi_char_folds.pl";
8770da0e 17
ab84f958 18=head1 NAME
0ccab2bc 19
e64b1bd1 20CharClass::Matcher -- Generate C macros that match character classes efficiently
12b72891 21
e64b1bd1
YO
22=head1 SYNOPSIS
23
ab84f958 24 perl Porting/regcharclass.pl
e64b1bd1
YO
25
26=head1 DESCRIPTION
12b72891
RGS
27
28Dynamically generates macros for detecting special charclasses
e64b1bd1 29in latin-1, utf8, and codepoint forms. Macros can be set to return
cc08b31c 30the length (in bytes) of the matched codepoint, and/or the codepoint itself.
12b72891 31
cc08b31c 32To regenerate F<regcharclass.h>, run this script from perl-root. No arguments
12b72891
RGS
33are necessary.
34
cc08b31c
KW
35Using WHATEVER as an example the following macros can be produced, depending
36on the input parameters (how to get each is described by internal comments at
37the C<__DATA__> line):
12b72891
RGS
38
39=over 4
40
cc08b31c 41=item C<is_WHATEVER(s,is_utf8)>
12b72891 42
cc08b31c 43=item C<is_WHATEVER_safe(s,e,is_utf8)>
12b72891 44
cc08b31c
KW
45Do a lookup as appropriate based on the C<is_utf8> flag. When possible
46comparisons involving octect<128 are done before checking the C<is_utf8>
12b72891
RGS
47flag, hopefully saving time.
48
cc08b31c
KW
49The version without the C<_safe> suffix should be used only when the input is
50known to be well-formed.
12b72891 51
cc08b31c
KW
52=item C<is_WHATEVER_utf8(s)>
53
54=item C<is_WHATEVER_utf8_safe(s,e)>
12b72891
RGS
55
56Do a lookup assuming the string is encoded in (normalized) UTF8.
57
cc08b31c
KW
58The version without the C<_safe> suffix should be used only when the input is
59known to be well-formed.
60
61=item C<is_WHATEVER_latin1(s)>
12b72891 62
cc08b31c 63=item C<is_WHATEVER_latin1_safe(s,e)>
12b72891
RGS
64
65Do a lookup assuming the string is encoded in latin-1 (aka plan octets).
66
cc08b31c
KW
67The version without the C<_safe> suffix should be used only when it is known
68that C<s> contains at least one character.
69
70=item C<is_WHATEVER_cp(cp)>
12b72891 71
47e01c32 72Check to see if the string matches a given codepoint (hypothetically a
b6a6e956 73U32). The condition is constructed as to "break out" as early as
12b72891
RGS
74possible if the codepoint is out of range of the condition.
75
76IOW:
77
78 (cp==X || (cp>X && (cp==Y || (cp>Y && ...))))
79
80Thus if the character is X+1 only two comparisons will be done. Making
81matching lookups slower, but non-matching faster.
82
cc08b31c
KW
83=item C<what_len_WHATEVER_FOO(arg1, ..., len)>
84
85A variant form of each of the macro types described above can be generated, in
86which the code point is returned by the macro, and an extra parameter (in the
87final position) is added, which is a pointer for the macro to set the byte
88length of the returned code point.
89
90These forms all have a C<what_len> prefix instead of the C<is_>, for example
91C<what_len_WHATEVER_safe(s,e,is_utf8,len)> and
92C<what_len_WHATEVER_utf8(s,len)>.
93
94These forms should not be used I<except> on small sets of mostly widely
95separated code points; otherwise the code generated is inefficient. For these
96cases, it is best to use the C<is_> forms, and then find the code point with
97C<utf8_to_uvchr_buf>(). This program can fail with a "deep recursion"
98message on the worst of the inappropriate sets. Examine the generated macro
99to see if it is acceptable.
12b72891 100
cc08b31c
KW
101=item C<what_WHATEVER_FOO(arg1, ...)>
102
103A variant form of each of the C<is_> macro types described above can be generated, in
104which the code point and not the length is returned by the macro. These have
105the same caveat as L</what_len_WHATEVER_FOO(arg1, ..., len)>, plus they should
106not be used where the set contains a NULL, as 0 is returned for two different
107cases: a) the set doesn't include the input code point; b) the set does
108include it, and it is a NULL.
109
110=back
e64b1bd1 111
40f914fd
KW
112The above isn't quite complete, as for specialized purposes one can get a
113macro like C<is_WHATEVER_utf8_no_length_checks(s)>, which assumes that it is
114already known that there is enough space to hold the character starting at
115C<s>, but otherwise checks that it is well-formed. In other words, this is
116intermediary in checking between C<is_WHATEVER_utf8(s)> and
117C<is_WHATEVER_utf8_safe(s,e)>.
118
e64b1bd1
YO
119=head2 CODE FORMAT
120
121perltidy -st -bt=1 -bbt=0 -pt=0 -sbt=1 -ce -nwls== "%f"
122
123
124=head1 AUTHOR
125
cc08b31c 126Author: Yves Orton (demerphq) 2007. Maintained by Perl5 Porters.
e64b1bd1
YO
127
128=head1 BUGS
129
130No tests directly here (although the regex engine will fail tests
131if this code is broken). Insufficient documentation and no Getopts
132handler for using the module as a script.
133
134=head1 LICENSE
135
136You may distribute under the terms of either the GNU General Public
137License or the Artistic License, as specified in the README file.
138
12b72891
RGS
139=cut
140
e64b1bd1
YO
141# Sub naming convention:
142# __func : private subroutine, can not be called as a method
143# _func : private method, not meant for external use
144# func : public method.
145
146# private subs
147#-------------------------------------------------------------------------------
148#
149# ($cp,$n,$l,$u)=__uni_latin($str);
150#
47e01c32 151# Return a list of arrays, each of which when interpreted correctly
e64b1bd1
YO
152# represent the string in some given encoding with specific conditions.
153#
154# $cp - list of codepoints that make up the string.
295bcca9
KW
155# $n - list of octets that make up the string if all codepoints are invariant
156# regardless of if the string is in UTF-8 or not.
e64b1bd1 157# $l - list of octets that make up the string in latin1 encoding if all
295bcca9
KW
158# codepoints < 256, and at least one codepoint is UTF-8 variant.
159# $u - list of octets that make up the string in utf8 if any codepoint is
160# UTF-8 variant
e64b1bd1
YO
161#
162# High CP | Defined
163#-----------+----------
295bcca9 164# 0 - 127 : $n (127/128 are the values for ASCII platforms)
e64b1bd1
YO
165# 128 - 255 : $l, $u
166# 256 - ... : $u
167#
168
169sub __uni_latin1 {
a1b2a50f 170 my $charset= shift;
e64b1bd1
YO
171 my $str= shift;
172 my $max= 0;
173 my @cp;
900c17f9 174 my @cp_high;
295bcca9 175 my $only_has_invariants = 1;
c30a0cf2 176 my $a2n = get_a2n($charset);
e64b1bd1
YO
177 for my $ch ( split //, $str ) {
178 my $cp= ord $ch;
e64b1bd1 179 $max= $cp if $max < $cp;
a1b2a50f
KW
180 if ($cp > 255) {
181 push @cp, $cp;
182 push @cp_high, $cp;
183 }
184 else {
c30a0cf2 185 push @cp, $a2n->[$cp];
295bcca9 186 }
e64b1bd1
YO
187 }
188 my ( $n, $l, $u );
a1b2a50f 189 $only_has_invariants = ($charset =~ /ascii/i) ? $max < 128 : $max < 160;
295bcca9 190 if ($only_has_invariants) {
e64b1bd1
YO
191 $n= [@cp];
192 } else {
193 $l= [@cp] if $max && $max < 256;
194
a1b2a50f
KW
195 my @u;
196 for my $ch ( split //, $str ) {
197 push @u, map { ord } split //, cp_2_utfbytes(ord $ch, $charset);
198 }
199 $u = \@u;
12b72891 200 }
900c17f9 201 return ( \@cp, \@cp_high, $n, $l, $u );
12b72891
RGS
202}
203
12b72891 204#
e64b1bd1
YO
205# $clean= __clean($expr);
206#
207# Cleanup a ternary expression, removing unnecessary parens and apply some
208# simplifications using regexes.
209#
210
211sub __clean {
212 my ( $expr )= @_;
8fdb8a9d 213
9a3182e9
YO
214 #return $expr;
215
e64b1bd1
YO
216 our $parens;
217 $parens= qr/ (?> \( (?> (?: (?> [^()]+ ) | (??{ $parens }) )* ) \) ) /x;
218
8fdb8a9d 219 ## remove redundant parens
e64b1bd1 220 1 while $expr =~ s/ \( \s* ( $parens ) \s* \) /$1/gx;
8fdb8a9d
YO
221
222
223 # repeatedly simplify conditions like
224 # ( (cond1) ? ( (cond2) ? X : Y ) : Y )
225 # into
226 # ( ( (cond1) && (cond2) ) ? X : Y )
6c4f0678
YO
227 # Also similarly handles expressions like:
228 # : (cond1) ? ( (cond2) ? X : Y ) : Y )
229 # Note the inclusion of the close paren in ([:()]) and the open paren in ([()]) is
230 # purely to ensure we have a balanced set of parens in the expression which makes
231 # it easier to understand the pattern in an editor that understands paren's, we do
232 # not expect either of these cases to actually fire. - Yves
8fdb8a9d 233 1 while $expr =~ s/
6c4f0678 234 ([:()]) \s*
8fdb8a9d
YO
235 ($parens) \s*
236 \? \s*
237 \( \s* ($parens) \s*
6c4f0678
YO
238 \? \s* ($parens|[^()?:\s]+?) \s*
239 : \s* ($parens|[^()?:\s]+?) \s*
8fdb8a9d 240 \) \s*
6c4f0678
YO
241 : \s* \5 \s*
242 ([()])
243 /$1 ( $2 && $3 ) ? $4 : $5 $6/gx;
39a0f513
YO
244 #$expr=~s/\(\(U8\*\)s\)\[(\d+)\]/S$1/g if length $expr > 8000;
245 #$expr=~s/\s+//g if length $expr > 8000;
246
247 die "Expression too long" if length $expr > 8000;
8fdb8a9d 248
e64b1bd1 249 return $expr;
12b72891
RGS
250}
251
e64b1bd1
YO
252#
253# $text= __macro(@args);
254# Join args together by newlines, and then neatly add backslashes to the end
255# of every line as expected by the C pre-processor for #define's.
256#
257
258sub __macro {
259 my $str= join "\n", @_;
260 $str =~ s/\s*$//;
261 my @lines= map { s/\s+$//; s/\t/ /g; $_ } split /\n/, $str;
262 my $last= pop @lines;
263 $str= join "\n", ( map { sprintf "%-76s\\", $_ } @lines ), $last;
264 1 while $str =~ s/^(\t*) {8}/$1\t/gm;
265 return $str . "\n";
12b72891
RGS
266}
267
e64b1bd1
YO
268#
269# my $op=__incrdepth($op);
270#
271# take an 'op' hashref and add one to it and all its childrens depths.
272#
273
274sub __incrdepth {
275 my $op= shift;
276 return unless ref $op;
277 $op->{depth} += 1;
278 __incrdepth( $op->{yes} );
279 __incrdepth( $op->{no} );
280 return $op;
281}
282
283# join two branches of an opcode together with a condition, incrementing
284# the depth on the yes branch when we do so.
285# returns the new root opcode of the tree.
286sub __cond_join {
287 my ( $cond, $yes, $no )= @_;
5ab0c3af 288 if (ref $yes) {
3ff97bcf
KW
289 return {
290 test => $cond,
291 yes => __incrdepth( $yes ),
292 no => $no,
293 depth => 0,
294 };
5ab0c3af
KW
295 }
296 else {
297 return {
298 test => $cond,
299 yes => $yes,
300 no => __incrdepth($no),
301 depth => 0,
302 };
303 }
e64b1bd1
YO
304}
305
306# Methods
307
308# constructor
309#
310# my $obj=CLASS->new(op=>'SOMENAME',title=>'blah',txt=>[..]);
311#
312# Create a new CharClass::Matcher object by parsing the text in
313# the txt array. Currently applies the following rules:
314#
315# Element starts with C<0x>, line is evaled the result treated as
316# a number which is passed to chr().
317#
318# Element starts with C<">, line is evaled and the result treated
319# as a string.
320#
321# Each string is then stored in the 'strs' subhash as a hash record
322# made up of the results of __uni_latin1, using the keynames
b1af8fef 323# 'low','latin1','utf8', as well as the synthesized 'LATIN1', 'high', and
b6a6e956 324# 'UTF8' which hold a merge of 'low' and their lowercase equivalents.
e64b1bd1
YO
325#
326# Size data is tracked per type in the 'size' subhash.
327#
328# Return an object
329#
12b72891
RGS
330sub new {
331 my $class= shift;
e64b1bd1
YO
332 my %opt= @_;
333 for ( qw(op txt) ) {
334 die "in " . __PACKAGE__ . " constructor '$_;' is a mandatory field"
335 if !exists $opt{$_};
336 }
337
338 my $self= bless {
339 op => $opt{op},
340 title => $opt{title} || '',
341 }, $class;
342 foreach my $txt ( @{ $opt{txt} } ) {
343 my $str= $txt;
344 if ( $str =~ /^[""]/ ) {
345 $str= eval $str;
05b688d9
KW
346 } elsif ($str =~ / - /x ) { # A range: Replace this element on the
347 # list with its expansion
348 my ($lower, $upper) = $str =~ / 0x (.+?) \s* - \s* 0x (.+) /x;
349 die "Format must be like '0xDEAD - 0xBEAF'; instead was '$str'" if ! defined $lower || ! defined $upper;
350 foreach my $cp (hex $lower .. hex $upper) {
351 push @{$opt{txt}}, sprintf "0x%X", $cp;
352 }
353 next;
295bcca9
KW
354 } elsif ($str =~ s/ ^ N (?= 0x ) //x ) {
355 # Otherwise undocumented, a leading N means is already in the
356 # native character set; don't convert.
e64b1bd1 357 $str= chr eval $str;
295bcca9
KW
358 } elsif ( $str =~ /^0x/ ) {
359 $str= eval $str;
295bcca9 360 $str = chr $str;
05b688d9
KW
361 } elsif ( $str =~ / \s* \\p \{ ( .*? ) \} /x) {
362 my $property = $1;
363 use Unicode::UCD qw(prop_invlist);
364
365 my @invlist = prop_invlist($property, '_perl_core_internal_ok');
366 if (! @invlist) {
367
368 # An empty return could mean an unknown property, or merely
369 # that it is empty. Call in scalar context to differentiate
370 my $count = prop_invlist($property, '_perl_core_internal_ok');
371 die "$property not found" unless defined $count;
372 }
373
374 # Replace this element on the list with the property's expansion
375 for (my $i = 0; $i < @invlist; $i += 2) {
376 foreach my $cp ($invlist[$i] .. $invlist[$i+1] - 1) {
295bcca9
KW
377
378 # prop_invlist() returns native values; add leading 'N'
379 # to indicate that.
380 push @{$opt{txt}}, sprintf "N0x%X", $cp;
05b688d9
KW
381 }
382 }
383 next;
60910c93
KW
384 } elsif ($str =~ / ^ do \s+ ( .* ) /x) {
385 die "do '$1' failed: $!$@" if ! do $1 or $@;
386 next;
387 } elsif ($str =~ / ^ & \s* ( .* ) /x) { # user-furnished sub() call
388 my @results = eval "$1";
389 die "eval '$1' failed: $@" if $@;
390 push @{$opt{txt}}, @results;
391 next;
12b72891 392 } else {
5e6c6c1e 393 die "Unparsable line: $txt\n";
12b72891 394 }
a1b2a50f 395 my ( $cp, $cp_high, $low, $latin1, $utf8 )= __uni_latin1( $opt{charset}, $str );
e64b1bd1
YO
396 my $UTF8= $low || $utf8;
397 my $LATIN1= $low || $latin1;
b1af8fef 398 my $high = (scalar grep { $_ < 256 } @$cp) ? 0 : $utf8;
dda856b2
YO
399 #die Dumper($txt,$cp,$low,$latin1,$utf8)
400 # if $txt=~/NEL/ or $utf8 and @$utf8>3;
e64b1bd1 401
900c17f9
KW
402 @{ $self->{strs}{$str} }{qw( str txt low utf8 latin1 high cp cp_high UTF8 LATIN1 )}=
403 ( $str, $txt, $low, $utf8, $latin1, $high, $cp, $cp_high, $UTF8, $LATIN1 );
e64b1bd1 404 my $rec= $self->{strs}{$str};
900c17f9 405 foreach my $key ( qw(low utf8 latin1 high cp cp_high UTF8 LATIN1) ) {
e64b1bd1
YO
406 $self->{size}{$key}{ 0 + @{ $self->{strs}{$str}{$key} } }++
407 if $self->{strs}{$str}{$key};
12b72891 408 }
e64b1bd1
YO
409 $self->{has_multi} ||= @$cp > 1;
410 $self->{has_ascii} ||= $latin1 && @$latin1;
411 $self->{has_low} ||= $low && @$low;
412 $self->{has_high} ||= !$low && !$latin1;
12b72891 413 }
e64b1bd1
YO
414 $self->{val_fmt}= $hex_fmt;
415 $self->{count}= 0 + keys %{ $self->{strs} };
12b72891
RGS
416 return $self;
417}
418
e64b1bd1 419# my $trie = make_trie($type,$maxlen);
12b72891 420#
47e01c32 421# using the data stored in the object build a trie of a specific type,
e64b1bd1
YO
422# and with specific maximum depth. The trie is made up the elements of
423# the given types array for each string in the object (assuming it is
424# not too long.)
425#
47e01c32 426# returns the trie, or undef if there was no relevant data in the object.
e64b1bd1
YO
427#
428
429sub make_trie {
430 my ( $self, $type, $maxlen )= @_;
431
432 my $strs= $self->{strs};
433 my %trie;
434 foreach my $rec ( values %$strs ) {
435 die "panic: unknown type '$type'"
436 if !exists $rec->{$type};
437 my $dat= $rec->{$type};
438 next unless $dat;
439 next if $maxlen && @$dat > $maxlen;
440 my $node= \%trie;
441 foreach my $elem ( @$dat ) {
442 $node->{$elem} ||= {};
443 $node= $node->{$elem};
12b72891 444 }
e64b1bd1 445 $node->{''}= $rec->{str};
12b72891 446 }
e64b1bd1 447 return 0 + keys( %trie ) ? \%trie : undef;
12b72891
RGS
448}
449
2efb8143
KW
450sub pop_count ($) {
451 my $word = shift;
452
453 # This returns a list of the positions of the bits in the input word that
454 # are 1.
455
456 my @positions;
457 my $position = 0;
458 while ($word) {
459 push @positions, $position if $word & 1;
460 $position++;
461 $word >>= 1;
462 }
463 return @positions;
464}
465
e64b1bd1
YO
466# my $optree= _optree()
467#
468# recursively convert a trie to an optree where every node represents
469# an if else branch.
12b72891 470#
12b72891 471#
12b72891 472
e64b1bd1
YO
473sub _optree {
474 my ( $self, $trie, $test_type, $ret_type, $else, $depth )= @_;
475 return unless defined $trie;
476 if ( $self->{has_multi} and $ret_type =~ /cp|both/ ) {
477 die "Can't do 'cp' optree from multi-codepoint strings";
12b72891 478 }
e64b1bd1
YO
479 $ret_type ||= 'len';
480 $else= 0 unless defined $else;
481 $depth= 0 unless defined $depth;
482
b6a6e956 483 # if we have an empty string as a key it means we are in an
e405c23a
YO
484 # accepting state and unless we can match further on should
485 # return the value of the '' key.
895e25a5 486 if (exists $trie->{''} ) {
e405c23a
YO
487 # we can now update the "else" value, anything failing to match
488 # after this point should return the value from this.
e64b1bd1
YO
489 if ( $ret_type eq 'cp' ) {
490 $else= $self->{strs}{ $trie->{''} }{cp}[0];
491 $else= sprintf "$self->{val_fmt}", $else if $else > 9;
492 } elsif ( $ret_type eq 'len' ) {
493 $else= $depth;
494 } elsif ( $ret_type eq 'both') {
495 $else= $self->{strs}{ $trie->{''} }{cp}[0];
496 $else= sprintf "$self->{val_fmt}", $else if $else > 9;
497 $else= "len=$depth, $else";
12b72891 498 }
e64b1bd1 499 }
e405c23a
YO
500 # extract the meaningful keys from the trie, filter out '' as
501 # it means we are an accepting state (end of sequence).
502 my @conds= sort { $a <=> $b } grep { length $_ } keys %$trie;
503
b6a6e956 504 # if we haven't any keys there is no further we can match and we
e405c23a 505 # can return the "else" value.
e64b1bd1 506 return $else if !@conds;
e405c23a 507
29f3ce8f 508 my $test = $test_type =~ /^cp/ ? "cp" : "((const U8*)s)[$depth]";
e405c23a 509
c7c8bf55 510 # First we loop over the possible keys/conditions and find out what they
3ff97bcf 511 # look like; we group conditions with the same optree together.
9a3182e9
YO
512 my %dmp_res;
513 my @res_order;
e405c23a
YO
514 local $Data::Dumper::Sortkeys=1;
515 foreach my $cond ( @conds ) {
516
517 # get the optree for this child/condition
518 my $res= $self->_optree( $trie->{$cond}, $test_type, $ret_type, $else, $depth + 1 );
519 # convert it to a string with Dumper
e64b1bd1 520 my $res_code= Dumper( $res );
e405c23a 521
9a3182e9
YO
522 push @{$dmp_res{$res_code}{vals}}, $cond;
523 if (!$dmp_res{$res_code}{optree}) {
524 $dmp_res{$res_code}{optree}= $res;
525 push @res_order, $res_code;
526 }
527 }
528
529 # now that we have deduped the optrees we construct a new optree containing the merged
530 # results.
531 my %root;
532 my $node= \%root;
533 foreach my $res_code_idx (0 .. $#res_order) {
534 my $res_code= $res_order[$res_code_idx];
535 $node->{vals}= $dmp_res{$res_code}{vals};
536 $node->{test}= $test;
537 $node->{yes}= $dmp_res{$res_code}{optree};
538 $node->{depth}= $depth;
539 if ($res_code_idx < $#res_order) {
540 $node= $node->{no}= {};
12b72891 541 } else {
9a3182e9 542 $node->{no}= $else;
12b72891
RGS
543 }
544 }
e405c23a
YO
545
546 # return the optree.
547 return \%root;
12b72891
RGS
548}
549
e64b1bd1
YO
550# my $optree= optree(%opts);
551#
552# Convert a trie to an optree, wrapper for _optree
553
554sub optree {
555 my $self= shift;
556 my %opt= @_;
557 my $trie= $self->make_trie( $opt{type}, $opt{max_depth} );
558 $opt{ret_type} ||= 'len';
900c17f9 559 my $test_type= $opt{type} =~ /^cp/ ? 'cp' : 'depth';
e64b1bd1 560 return $self->_optree( $trie, $test_type, $opt{ret_type}, $opt{else}, 0 );
12b72891
RGS
561}
562
e64b1bd1
YO
563# my $optree= generic_optree(%opts);
564#
565# build a "generic" optree out of the three 'low', 'latin1', 'utf8'
566# sets of strings, including a branch for handling the string type check.
567#
568
569sub generic_optree {
570 my $self= shift;
571 my %opt= @_;
572
573 $opt{ret_type} ||= 'len';
574 my $test_type= 'depth';
575 my $else= $opt{else} || 0;
576
577 my $latin1= $self->make_trie( 'latin1', $opt{max_depth} );
578 my $utf8= $self->make_trie( 'utf8', $opt{max_depth} );
579
580 $_= $self->_optree( $_, $test_type, $opt{ret_type}, $else, 0 )
581 for $latin1, $utf8;
582
583 if ( $utf8 ) {
584 $else= __cond_join( "( is_utf8 )", $utf8, $latin1 || $else );
585 } elsif ( $latin1 ) {
586 $else= __cond_join( "!( is_utf8 )", $latin1, $else );
587 }
87894a24 588 if ($opt{type} eq 'generic') {
61de6bbc
KW
589 my $low= $self->make_trie( 'low', $opt{max_depth} );
590 if ( $low ) {
591 $else= $self->_optree( $low, $test_type, $opt{ret_type}, $else, 0 );
592 }
87894a24 593 }
e64b1bd1
YO
594
595 return $else;
12b72891
RGS
596}
597
e64b1bd1 598# length_optree()
12b72891 599#
e64b1bd1 600# create a string length guarded optree.
12b72891 601#
e64b1bd1
YO
602
603sub length_optree {
604 my $self= shift;
605 my %opt= @_;
606 my $type= $opt{type};
607
608 die "Can't do a length_optree on type 'cp', makes no sense."
900c17f9 609 if $type =~ /^cp/;
e64b1bd1 610
5ab0c3af
KW
611 my $else= ( $opt{else} ||= 0 );
612
c03e41dd
KW
613 return $else if $self->{count} == 0;
614
5ab0c3af
KW
615 my $method = $type =~ /generic/ ? 'generic_optree' : 'optree';
616 if ($method eq 'optree' && scalar keys %{$self->{size}{$type}} == 1) {
617
618 # Here is non-generic output (meaning that we are only generating one
619 # type), and all things that match have the same number ('size') of
620 # bytes. The length guard is simply that we have that number of
621 # bytes.
622 my @size = keys %{$self->{size}{$type}};
623 my $cond= "((e) - (s)) >= $size[0]";
624 my $optree = $self->$method(%opt);
625 $else= __cond_join( $cond, $optree, $else );
626 }
627 elsif ($self->{has_multi}) {
628 my @size;
e64b1bd1 629
5ab0c3af
KW
630 # Here, there can be a match of a multiple character string. We use
631 # the traditional method which is to have a branch for each possible
632 # size (longest first) and test for the legal values for that size.
e64b1bd1
YO
633 my %sizes= (
634 %{ $self->{size}{low} || {} },
635 %{ $self->{size}{latin1} || {} },
636 %{ $self->{size}{utf8} || {} }
637 );
5ab0c3af
KW
638 if ($method eq 'generic_optree') {
639 @size= sort { $a <=> $b } keys %sizes;
640 } else {
641 @size= sort { $a <=> $b } keys %{ $self->{size}{$type} };
642 }
643 for my $size ( @size ) {
644 my $optree= $self->$method( %opt, type => $type, max_depth => $size );
645 my $cond= "((e)-(s) > " . ( $size - 1 ).")";
646 $else= __cond_join( $cond, $optree, $else );
647 }
12b72891 648 }
5ab0c3af
KW
649 else {
650 my $utf8;
651
652 # Here, has more than one possible size, and only matches a single
653 # character. For non-utf8, the needed length is 1; for utf8, it is
654 # found by array lookup 'UTF8SKIP'.
655
656 # If want just the code points above 255, set up to look for those;
657 # otherwise assume will be looking for all non-UTF-8-invariant code
658 # poiints.
659 my $trie_type = ($type eq 'high') ? 'high' : 'utf8';
660
661 # If we do want more than the 0-255 range, find those, and if they
662 # exist...
663 if ($opt{type} !~ /latin1/i && ($utf8 = $self->make_trie($trie_type, 0))) {
664
665 # ... get them into an optree, and set them up as the 'else' clause
666 $utf8 = $self->_optree( $utf8, 'depth', $opt{ret_type}, 0, 0 );
667
668 # We could make this
669 # UTF8_IS_START(*s) && ((e) - (s)) >= UTF8SKIP(s))";
670 # to avoid doing the UTF8SKIP and subsequent branches for invariants
671 # that don't match. But the current macros that get generated
672 # have only a few things that can match past this, so I (khw)
673 # don't think it is worth it. (Even better would be to use
674 # calculate_mask(keys %$utf8) instead of UTF8_IS_START, and use it
dd9bc2b0
KW
675 # if it saves a bunch. We assume that input text likely to be
676 # well-formed .
677 my $cond = "LIKELY(((e) - (s)) >= UTF8SKIP(s))";
5ab0c3af
KW
678 $else = __cond_join($cond, $utf8, $else);
679
680 # For 'generic', we also will want the latin1 UTF-8 variants for
681 # the case where the input isn't UTF-8.
682 my $latin1;
683 if ($method eq 'generic_optree') {
684 $latin1 = $self->make_trie( 'latin1', 1);
685 $latin1= $self->_optree( $latin1, 'depth', $opt{ret_type}, 0, 0 );
686 }
e64b1bd1 687
5ab0c3af
KW
688 # If we want the UTF-8 invariants, get those.
689 my $low;
690 if ($opt{type} !~ /non_low|high/
691 && ($low= $self->make_trie( 'low', 1)))
692 {
693 $low= $self->_optree( $low, 'depth', $opt{ret_type}, 0, 0 );
694
695 # Expand out the UTF-8 invariants as a string so that we
696 # can use them as the conditional
697 $low = $self->_cond_as_str( $low, 0, \%opt);
698
699 # If there are Latin1 variants, add a test for them.
700 if ($latin1) {
701 $else = __cond_join("(! is_utf8 )", $latin1, $else);
702 }
703 elsif ($method eq 'generic_optree') {
704
705 # Otherwise for 'generic' only we know that what
706 # follows must be valid for just UTF-8 strings,
707 $else->{test} = "( is_utf8 && $else->{test} )";
708 }
709
710 # If the invariants match, we are done; otherwise we have
711 # to go to the 'else' clause.
712 $else = __cond_join($low, 1, $else);
713 }
714 elsif ($latin1) { # Here, didn't want or didn't have invariants,
715 # but we do have latin variants
716 $else = __cond_join("(! is_utf8)", $latin1, $else);
717 }
718
719 # We need at least one byte available to start off the tests
dd9bc2b0 720 $else = __cond_join("LIKELY((e) > (s))", $else, 0);
5ab0c3af
KW
721 }
722 else { # Here, we don't want or there aren't any variants. A single
723 # byte available is enough.
724 my $cond= "((e) > (s))";
725 my $optree = $self->$method(%opt);
726 $else= __cond_join( $cond, $optree, $else );
727 }
e64b1bd1 728 }
5ab0c3af 729
e64b1bd1 730 return $else;
12b72891
RGS
731}
732
2efb8143 733sub calculate_mask(@) {
75929b4b
KW
734 # Look at the input list of byte values. This routine returns an array of
735 # mask/base pairs to generate that list.
736
2efb8143
KW
737 my @list = @_;
738 my $list_count = @list;
739
75929b4b
KW
740 # Consider a set of byte values, A, B, C .... If we want to determine if
741 # <c> is one of them, we can write c==A || c==B || c==C .... If the
742 # values are consecutive, we can shorten that to A<=c && c<=Z, which uses
743 # far fewer branches. If only some of them are consecutive we can still
744 # save some branches by creating range tests for just those that are
745 # consecutive. _cond_as_str() does this work for looking for ranges.
746 #
747 # Another approach is to look at the bit patterns for A, B, C .... and see
748 # if they have some commonalities. That's what this function does. For
749 # example, consider a set consisting of the bytes
750 # 0xF0, 0xF1, 0xF2, and 0xF3. We could write:
2efb8143
KW
751 # 0xF0 <= c && c <= 0xF4
752 # But the following mask/compare also works, and has just one test:
75929b4b
KW
753 # (c & 0xFC) == 0xF0
754 # The reason it works is that the set consists of exactly those bytes
2efb8143 755 # whose first 4 bits are 1, and the next two are 0. (The value of the
75929b4b 756 # other 2 bits is immaterial in determining if a byte is in the set or
2efb8143 757 # not.) The mask masks out those 2 irrelevant bits, and the comparison
75929b4b
KW
758 # makes sure that the result matches all bytes which match those 6
759 # material bits exactly. In other words, the set of bytes contains
2efb8143
KW
760 # exactly those whose bottom two bit positions are either 0 or 1. The
761 # same principle applies to bit positions that are not necessarily
762 # adjacent. And it can be applied to bytes that differ in 1 through all 8
763 # bit positions. In order to be a candidate for this optimization, the
75929b4b
KW
764 # number of bytes in the set must be a power of 2.
765 #
766 # Consider a different example, the set 0x53, 0x54, 0x73, and 0x74. That
767 # requires 4 tests using either ranges or individual values, and even
768 # though the number in the set is a power of 2, it doesn't qualify for the
769 # mask optimization described above because the number of bits that are
770 # different is too large for that. However, the set can be expressed as
771 # two branches with masks thusly:
772 # (c & 0xDF) == 0x53 || (c & 0xDF) == 0x54
773 # a branch savings of 50%. This is done by splitting the set into two
774 # subsets each of which has 2 elements, and within each set the values
775 # differ by 1 byte.
776 #
777 # This function attempts to find some way to save some branches using the
778 # mask technique. If not, it returns an empty list; if so, it
779 # returns a list consisting of
780 # [ [compare1, mask1], [compare2, mask2], ...
781 # [compare_n, undef], [compare_m, undef], ...
782 # ]
783 # The <mask> is undef in the above for those bytes that must be tested
784 # for individually.
785 #
786 # This function does not attempt to find the optimal set. To do so would
787 # probably require testing all possible combinations, and keeping track of
788 # the current best one.
789 #
790 # There are probably much better algorithms, but this is the one I (khw)
791 # came up with. We start with doing a bit-wise compare of every byte in
792 # the set with every other byte. The results are sorted into arrays of
793 # all those that differ by the same bit positions. These are stored in a
794 # hash with the each key being the bits they differ in. Here is the hash
795 # for the 0x53, 0x54, 0x73, 0x74 set:
796 # {
797 # 4 => {
798 # "0,1,2,5" => [
799 # 83,
800 # 116,
801 # 84,
802 # 115
803 # ]
804 # },
805 # 3 => {
806 # "0,1,2" => [
807 # 83,
808 # 84,
809 # 115,
810 # 116
811 # ]
812 # }
813 # 1 => {
814 # 5 => [
815 # 83,
816 # 115,
817 # 84,
818 # 116
819 # ]
820 # },
821 # }
822 #
823 # The set consisting of values which differ in the 4 bit positions 0, 1,
824 # 2, and 5 from some other value in the set consists of all 4 values.
825 # Likewise all 4 values differ from some other value in the 3 bit
826 # positions 0, 1, and 2; and all 4 values differ from some other value in
827 # the single bit position 5. The keys at the uppermost level in the above
828 # hash, 1, 3, and 4, give the number of bit positions that each sub-key
829 # below it has. For example, the 4 key could have as its value an array
830 # consisting of "0,1,2,5", "0,1,2,6", and "3,4,6,7", if the inputs were
831 # such. The best optimization will group the most values into a single
832 # mask. The most values will be the ones that differ in the most
833 # positions, the ones with the largest value for the topmost key. These
834 # keys, are thus just for convenience of sorting by that number, and do
835 # not have any bearing on the core of the algorithm.
836 #
837 # We start with an element from largest number of differing bits. The
838 # largest in this case is 4 bits, and there is only one situation in this
839 # set which has 4 differing bits, "0,1,2,5". We look for any subset of
840 # this set which has 16 values that differ in these 4 bits. There aren't
841 # any, because there are only 4 values in the entire set. We then look at
842 # the next possible thing, which is 3 bits differing in positions "0,1,2".
843 # We look for a subset that has 8 values that differ in these 3 bits.
844 # Again there are none. So we go to look for the next possible thing,
845 # which is a subset of 2**1 values that differ only in bit position 5. 83
846 # and 115 do, so we calculate a mask and base for those and remove them
847 # from every set. Since there is only the one set remaining, we remove
848 # them from just this one. We then look to see if there is another set of
849 # 2 values that differ in bit position 5. 84 and 116 do, so we calculate
850 # a mask and base for those and remove them from every set (again only
851 # this set remains in this example). The set is now empty, and there are
852 # no more sets to look at, so we are done.
853
854 if ($list_count == 256) { # All 256 is trivially masked
2efb8143
KW
855 return (0, 0);
856 }
857
75929b4b
KW
858 my %hash;
859
860 # Generate bits-differing lists for each element compared against each
861 # other element
862 for my $i (0 .. $list_count - 2) {
863 for my $j ($i + 1 .. $list_count - 1) {
864 my @bits_that_differ = pop_count($list[$i] ^ $list[$j]);
865 my $differ_count = @bits_that_differ;
866 my $key = join ",", @bits_that_differ;
867 push @{$hash{$differ_count}{$key}}, $list[$i] unless grep { $_ == $list[$i] } @{$hash{$differ_count}{$key}};
868 push @{$hash{$differ_count}{$key}}, $list[$j];
869 }
870 }
2efb8143 871
75929b4b 872 print STDERR __LINE__, ": calculate_mask() called: List of values grouped by differing bits: ", Dumper \%hash if DEBUG;
2efb8143 873
75929b4b
KW
874 my @final_results;
875 foreach my $count (reverse sort { $a <=> $b } keys %hash) {
876 my $need = 2 ** $count; # Need 8 values for 3 differing bits, etc
de6cb0ab 877 foreach my $bits (sort keys $hash{$count}->%*) {
2efb8143 878
75929b4b 879 print STDERR __LINE__, ": For $count bit(s) difference ($bits), need $need; have ", scalar @{$hash{$count}{$bits}}, "\n" if DEBUG;
2efb8143 880
75929b4b
KW
881 # Look only as long as there are at least as many elements in the
882 # subset as are needed
883 while ((my $cur_count = @{$hash{$count}{$bits}}) >= $need) {
2efb8143 884
75929b4b 885 print STDERR __LINE__, ": Looking at bit positions ($bits): ", Dumper $hash{$count}{$bits} if DEBUG;
2efb8143 886
75929b4b
KW
887 # Start with the first element in it
888 my $try_base = $hash{$count}{$bits}[0];
889 my @subset = $try_base;
890
891 # If it succeeds, we return a mask and a base to compare
892 # against the masked value. That base will be the AND of
893 # every element in the subset. Initialize to the one element
894 # we have so far.
895 my $compare = $try_base;
896
897 # We are trying to find a subset of this that has <need>
898 # elements that differ in the bit positions given by the
899 # string $bits, which is comma separated.
900 my @bits = split ",", $bits;
901
902 TRY: # Look through the remainder of the list for other
903 # elements that differ only by these bit positions.
904
905 for (my $i = 1; $i < $cur_count; $i++) {
906 my $try_this = $hash{$count}{$bits}[$i];
907 my @positions = pop_count($try_base ^ $try_this);
908
909 print STDERR __LINE__, ": $try_base vs $try_this: is (", join(',', @positions), ") a subset of ($bits)?" if DEBUG;;
910
911 foreach my $pos (@positions) {
912 unless (grep { $pos == $_ } @bits) {
913 print STDERR " No\n" if DEBUG;
914 my $remaining = $cur_count - $i - 1;
915 if ($remaining && @subset + $remaining < $need) {
916 print STDERR __LINE__, ": Can stop trying $try_base, because even if all the remaining $remaining values work, they wouldn't add up to the needed $need when combined with the existing ", scalar @subset, " ones\n" if DEBUG;
917 last TRY;
918 }
919 next TRY;
920 }
921 }
922
923 print STDERR " Yes\n" if DEBUG;
924 push @subset, $try_this;
925
926 # Add this to the mask base, in case it ultimately
927 # succeeds,
928 $compare &= $try_this;
929 }
930
931 print STDERR __LINE__, ": subset (", join(", ", @subset), ") has ", scalar @subset, " elements; needs $need\n" if DEBUG;
932
933 if (@subset < $need) {
934 shift @{$hash{$count}{$bits}};
935 next; # Try with next value
936 }
2efb8143 937
75929b4b
KW
938 # Create the mask
939 my $mask = 0;
940 foreach my $position (@bits) {
941 $mask |= 1 << $position;
942 }
943 $mask = ~$mask & 0xFF;
944 push @final_results, [$compare, $mask];
945
946 printf STDERR "%d: Got it: compare=%d=0x%X; mask=%X\n", __LINE__, $compare, $compare, $mask if DEBUG;
947
948 # These values are now spoken for. Remove them from future
949 # consideration
122a2d8f
YO
950 foreach my $remove_count (sort keys %hash) {
951 foreach my $bits (sort keys %{$hash{$remove_count}}) {
75929b4b
KW
952 foreach my $to_remove (@subset) {
953 @{$hash{$remove_count}{$bits}} = grep { $_ != $to_remove } @{$hash{$remove_count}{$bits}};
954 }
955 }
956 }
957 }
958 }
2efb8143
KW
959 }
960
75929b4b
KW
961 # Any values that remain in the list are ones that have to be tested for
962 # individually.
963 my @individuals;
964 foreach my $count (reverse sort { $a <=> $b } keys %hash) {
de6cb0ab 965 foreach my $bits (sort keys $hash{$count}->%*) {
75929b4b
KW
966 foreach my $remaining (@{$hash{$count}{$bits}}) {
967
968 # If we already know about this value, just ignore it.
969 next if grep { $remaining == $_ } @individuals;
970
971 # Otherwise it needs to be returned as something to match
972 # individually
973 push @final_results, [$remaining, undef];
974 push @individuals, $remaining;
975 }
976 }
2efb8143 977 }
2efb8143 978
75929b4b
KW
979 # Sort by increasing numeric value
980 @final_results = sort { $a->[0] <=> $b->[0] } @final_results;
981
982 print STDERR __LINE__, ": Final return: ", Dumper \@final_results if DEBUG;
983
984 return @final_results;
2efb8143
KW
985}
986
e64b1bd1
YO
987# _cond_as_str
988# turn a list of conditions into a text expression
989# - merges ranges of conditions, and joins the result with ||
990sub _cond_as_str {
ba073cf2 991 my ( $self, $op, $combine, $opts_ref )= @_;
e64b1bd1
YO
992 my $cond= $op->{vals};
993 my $test= $op->{test};
2efb8143 994 my $is_cp_ret = $opts_ref->{ret_type} eq "cp";
e64b1bd1
YO
995 return "( $test )" if !defined $cond;
996
f5772832 997 # rangify the list.
e64b1bd1
YO
998 my @ranges;
999 my $Update= sub {
f5772832
KW
1000 # We skip this if there are optimizations that
1001 # we can apply (below) to the individual ranges
1002 if ( ($is_cp_ret || $combine) && @ranges && ref $ranges[-1]) {
e64b1bd1
YO
1003 if ( $ranges[-1][0] == $ranges[-1][1] ) {
1004 $ranges[-1]= $ranges[-1][0];
1005 } elsif ( $ranges[-1][0] + 1 == $ranges[-1][1] ) {
1006 $ranges[-1]= $ranges[-1][0];
1007 push @ranges, $ranges[-1] + 1;
1008 }
1009 }
1010 };
4a8ca70e
KW
1011 for my $condition ( @$cond ) {
1012 if ( !@ranges || $condition != $ranges[-1][1] + 1 ) {
e64b1bd1 1013 $Update->();
4a8ca70e 1014 push @ranges, [ $condition, $condition ];
e64b1bd1
YO
1015 } else {
1016 $ranges[-1][1]++;
1017 }
1018 }
1019 $Update->();
f5772832 1020
e64b1bd1
YO
1021 return $self->_combine( $test, @ranges )
1022 if $combine;
f5772832
KW
1023
1024 if ($is_cp_ret) {
1f063c57
KW
1025 @ranges= map {
1026 ref $_
1027 ? sprintf(
1028 "( $self->{val_fmt} <= $test && $test <= $self->{val_fmt} )",
1029 @$_ )
1030 : sprintf( "$self->{val_fmt} == $test", $_ );
1031 } @ranges;
6a52943c
KW
1032
1033 return "( " . join( " || ", @ranges ) . " )";
f5772832 1034 }
75929b4b 1035
2358c533
KW
1036 # If the input set has certain characteristics, we can optimize tests
1037 # for it. This doesn't apply if returning the code point, as we want
1038 # each element of the set individually. The code above is for this
1039 # simpler case.
1040
1041 return 1 if @$cond == 256; # If all bytes match, is trivially true
1042
75929b4b 1043 my @masks;
2358c533 1044 if (@ranges > 1) {
75929b4b 1045
b6a6e956 1046 # See if the entire set shares optimizable characteristics, and if so,
2358c533
KW
1047 # return the optimization. We delay checking for this on sets with
1048 # just a single range, as there may be better optimizations available
1049 # in that case.
75929b4b
KW
1050 @masks = calculate_mask(@$cond);
1051
1052 # Stringify the output of calculate_mask()
1053 if (@masks) {
1054 my @return;
1055 foreach my $mask_ref (@masks) {
1056 if (defined $mask_ref->[1]) {
1057 push @return, sprintf "( ( $test & $self->{val_fmt} ) == $self->{val_fmt} )", $mask_ref->[1], $mask_ref->[0];
1058 }
1059 else { # An undefined mask means to use the value as-is
1060 push @return, sprintf "$test == $self->{val_fmt}", $mask_ref->[0];
1061 }
1062 }
1063
1064 # The best possible case below for specifying this set of values via
1065 # ranges is 1 branch per range. If our mask method yielded better
1066 # results, there is no sense trying something that is bound to be
1067 # worse.
1068 if (@return < @ranges) {
1069 return "( " . join( " || ", @return ) . " )";
1070 }
1071
1072 @masks = @return;
6e130234 1073 }
2358c533 1074 }
f5772832 1075
75929b4b
KW
1076 # Here, there was no entire-class optimization that was clearly better
1077 # than doing things by ranges. Look at each range.
1078 my $range_count_extra = 0;
2358c533
KW
1079 for (my $i = 0; $i < @ranges; $i++) {
1080 if (! ref $ranges[$i]) { # Trivial case: no range
1081 $ranges[$i] = sprintf "$self->{val_fmt} == $test", $ranges[$i];
1082 }
1083 elsif ($ranges[$i]->[0] == $ranges[$i]->[1]) {
1084 $ranges[$i] = # Trivial case: single element range
1085 sprintf "$self->{val_fmt} == $test", $ranges[$i]->[0];
1086 }
726137b5
KW
1087 elsif ($ranges[$i]->[0] == 0) {
1088 # If the range matches all 256 possible bytes, it is trivially
1089 # true.
1090 return 1 if $ranges[0]->[1] == 0xFF; # @ranges must be 1 in
1091 # this case
1092 $ranges[$i] = sprintf "( $test <= $self->{val_fmt} )",
1093 $ranges[$i]->[1];
1094 }
1095 elsif ($ranges[$i]->[1] == 255) {
1096
1097 # Similarly the max possible is 255, so can omit an upper bound
1098 # test if the calculated max is the max possible one.
1099 $ranges[$i] = sprintf "( $test >= $self->{val_fmt} )",
1100 $ranges[0]->[0];
1101 }
2358c533
KW
1102 else {
1103 my $output = "";
1104
1105 # Well-formed UTF-8 continuation bytes on ascii platforms must be
1106 # in the range 0x80 .. 0xBF. If we know that the input is
1107 # well-formed (indicated by not trying to be 'safe'), we can omit
1108 # tests that verify that the input is within either of these
1109 # bounds. (No legal UTF-8 character can begin with anything in
1110 # this range, so we don't have to worry about this being a
1111 # continuation byte or not.)
a1b2a50f
KW
1112 if ($opts_ref->{charset} =~ /ascii/i
1113 && (! $opts_ref->{safe} && ! $opts_ref->{no_length_checks})
2358c533
KW
1114 && $opts_ref->{type} =~ / ^ (?: utf8 | high ) $ /xi)
1115 {
1116 my $lower_limit_is_80 = ($ranges[$i]->[0] == 0x80);
1117 my $upper_limit_is_BF = ($ranges[$i]->[1] == 0xBF);
1118
1119 # If the range is the entire legal range, it matches any legal
1120 # byte, so we can omit both tests. (This should happen only
1121 # if the number of ranges is 1.)
1122 if ($lower_limit_is_80 && $upper_limit_is_BF) {
1123 return 1;
6e130234 1124 }
2358c533
KW
1125 elsif ($lower_limit_is_80) { # Just use the upper limit test
1126 $output = sprintf("( $test <= $self->{val_fmt} )",
1127 $ranges[$i]->[1]);
f5772832 1128 }
2358c533
KW
1129 elsif ($upper_limit_is_BF) { # Just use the lower limit test
1130 $output = sprintf("( $test >= $self->{val_fmt} )",
1131 $ranges[$i]->[0]);
f5772832 1132 }
2358c533
KW
1133 }
1134
1135 # If we didn't change to omit a test above, see if the number of
1136 # elements is a power of 2 (only a single bit in the
1137 # representation of its count will be set) and if so, it may be
1138 # that a mask/compare optimization is possible.
1139 if ($output eq ""
1140 && pop_count($ranges[$i]->[1] - $ranges[$i]->[0] + 1) == 1)
1141 {
1142 my @list;
1143 push @list, $_ for ($ranges[$i]->[0] .. $ranges[$i]->[1]);
75929b4b
KW
1144 my @this_masks = calculate_mask(@list);
1145
1146 # Use the mask if there is just one for the whole range.
1147 # Otherwise there is no savings over the two branches that can
1148 # define the range.
1149 if (@this_masks == 1 && defined $this_masks[0][1]) {
1150 $output = sprintf "( $test & $self->{val_fmt} ) == $self->{val_fmt}", $this_masks[0][1], $this_masks[0][0];
f5772832
KW
1151 }
1152 }
2358c533
KW
1153
1154 if ($output ne "") { # Prefer any optimization
1155 $ranges[$i] = $output;
1156 }
75929b4b 1157 else {
2358c533
KW
1158 # No optimization happened. We need a test that the code
1159 # point is within both bounds. But, if the bounds are
1160 # adjacent code points, it is cleaner to say
1161 # 'first == test || second == test'
1162 # than it is to say
1163 # 'first <= test && test <= second'
75929b4b
KW
1164
1165 $range_count_extra++; # This range requires 2 branches to
1166 # represent
e2a80cb5
KW
1167 if ($ranges[$i]->[0] + 1 == $ranges[$i]->[1]) {
1168 $ranges[$i] = "( "
1169 . join( " || ", ( map
1170 { sprintf "$self->{val_fmt} == $test", $_ }
1171 @{$ranges[$i]} ) )
1172 . " )";
1173 }
1174 else { # Full bounds checking
1175 $ranges[$i] = sprintf("( $self->{val_fmt} <= $test && $test <= $self->{val_fmt} )", $ranges[$i]->[0], $ranges[$i]->[1]);
1176 }
75929b4b 1177 }
f5772832 1178 }
2358c533 1179 }
f5772832 1180
75929b4b
KW
1181 # We have generated the list of bytes in two ways; one trying to use masks
1182 # to cut the number of branches down, and the other to look at individual
1183 # ranges (some of which could be cut down by using a mask for just it).
1184 # We return whichever method uses the fewest branches.
1185 return "( "
1186 . join( " || ", (@masks && @masks < @ranges + $range_count_extra)
1187 ? @masks
1188 : @ranges)
1189 . " )";
12b72891
RGS
1190}
1191
e64b1bd1
YO
1192# _combine
1193# recursively turn a list of conditions into a fast break-out condition
1194# used by _cond_as_str() for 'cp' type macros.
1195sub _combine {
1196 my ( $self, $test, @cond )= @_;
1197 return if !@cond;
1198 my $item= shift @cond;
1199 my ( $cstr, $gtv );
6c62bf0f
KW
1200 if ( ref $item ) { # @item should be a 2-element array giving range start
1201 # and end
1202 if ($item->[0] == 0) { # UV's are never negative, so skip "0 <= "
1203 # test which could generate a compiler warning
1204 # that test is always true
1205 $cstr= sprintf( "$test <= $self->{val_fmt}", $item->[1] );
1206 }
1207 else {
1208 $cstr=
e64b1bd1 1209 sprintf( "( $self->{val_fmt} <= $test && $test <= $self->{val_fmt} )",
6c62bf0f
KW
1210 @$item );
1211 }
e64b1bd1 1212 $gtv= sprintf "$self->{val_fmt}", $item->[1];
12b72891 1213 } else {
e64b1bd1
YO
1214 $cstr= sprintf( "$self->{val_fmt} == $test", $item );
1215 $gtv= sprintf "$self->{val_fmt}", $item;
12b72891 1216 }
e64b1bd1 1217 if ( @cond ) {
ee98d22d
YO
1218 my $combine= $self->_combine( $test, @cond );
1219 if (@cond >1) {
1220 return "( $cstr || ( $gtv < $test &&\n"
1221 . $combine . " ) )";
1222 } else {
1223 return "( $cstr || $combine )";
1224 }
12b72891 1225 } else {
e64b1bd1 1226 return $cstr;
12b72891 1227 }
e64b1bd1 1228}
12b72891 1229
e64b1bd1
YO
1230# _render()
1231# recursively convert an optree to text with reasonably neat formatting
1232sub _render {
39a0f513 1233 my ( $self, $op, $combine, $brace, $opts_ref, $def, $submacros )= @_;
2e39f0c2 1234 return 0 if ! defined $op; # The set is empty
e64b1bd1
YO
1235 if ( !ref $op ) {
1236 return $op;
12b72891 1237 }
ba073cf2 1238 my $cond= $self->_cond_as_str( $op, $combine, $opts_ref );
cc08b31c
KW
1239 #no warnings 'recursion'; # This would allow really really inefficient
1240 # code to be generated. See pod
39a0f513 1241 my $yes= $self->_render( $op->{yes}, $combine, 1, $opts_ref, $def, $submacros );
30188af7
KW
1242 return $yes if $cond eq '1';
1243
39a0f513 1244 my $no= $self->_render( $op->{no}, $combine, 0, $opts_ref, $def, $submacros );
e64b1bd1
YO
1245 return "( $cond )" if $yes eq '1' and $no eq '0';
1246 my ( $lb, $rb )= $brace ? ( "( ", " )" ) : ( "", "" );
1247 return "$lb$cond ? $yes : $no$rb"
1248 if !ref( $op->{yes} ) && !ref( $op->{no} );
1249 my $ind1= " " x 4;
1250 my $ind= "\n" . ( $ind1 x $op->{depth} );
1251
1252 if ( ref $op->{yes} ) {
1253 $yes= $ind . $ind1 . $yes;
1254 } else {
1255 $yes= " " . $yes;
1256 }
1257
39a0f513
YO
1258 my $str= "$lb$cond ?$yes$ind: $no$rb";
1259 if (length $str > 6000) {
1260 push @$submacros, sprintf "#define $def\n( %s )", "_part" . (my $yes_idx= 0+@$submacros), $yes;
1261 push @$submacros, sprintf "#define $def\n( %s )", "_part" . (my $no_idx= 0+@$submacros), $no;
1262 return sprintf "%s%s ? $def : $def%s", $lb, $cond, "_part$yes_idx", "_part$no_idx", $rb;
1263 }
1264 return $str;
12b72891 1265}
32e6a07c 1266
e64b1bd1
YO
1267# $expr=render($op,$combine)
1268#
1269# convert an optree to text with reasonably neat formatting. If $combine
1270# is true then the condition is created using "fast breakouts" which
1271# produce uglier expressions that are more efficient for common case,
1272# longer lists such as that resulting from type 'cp' output.
1273# Currently only used for type 'cp' macros.
1274sub render {
39a0f513
YO
1275 my ( $self, $op, $combine, $opts_ref, $def_fmt )= @_;
1276
1277 my @submacros;
1278 my $macro= sprintf "#define $def_fmt\n( %s )", "", $self->_render( $op, $combine, 0, $opts_ref, $def_fmt, \@submacros );
1279
1280 return join "\n\n", map { "/*** GENERATED CODE ***/\n" . __macro( __clean( $_ ) ) } @submacros, $macro;
12b72891 1281}
e64b1bd1
YO
1282
1283# make_macro
1284# make a macro of a given type.
1285# calls into make_trie and (generic_|length_)optree as needed
1286# Opts are:
40f914fd
KW
1287# type : 'cp','cp_high', 'generic','high','low','latin1','utf8','LATIN1','UTF8'
1288# ret_type : 'cp' or 'len'
1289# safe : don't assume is well-formed UTF-8, so don't skip any range
1290# checks, and add length guards to macro
1291# no_length_checks : like safe, but don't add length guards.
e64b1bd1
YO
1292#
1293# type defaults to 'generic', and ret_type to 'len' unless type is 'cp'
1294# in which case it defaults to 'cp' as well.
1295#
3ff97bcf 1296# It is illegal to do a type 'cp' macro on a pattern with multi-codepoint
e64b1bd1
YO
1297# sequences in it, as the generated macro will accept only a single codepoint
1298# as an argument.
1299#
6b94381d
KW
1300# It is also illegal to do a non-safe macro on a pattern with multi-codepoint
1301# sequences in it, as even if it is known to be well-formed, we need to not
91e83b73 1302# run off the end of the buffer when, say, the buffer ends with the first two
6b94381d
KW
1303# characters, but three are looked at by the macro.
1304#
e64b1bd1
YO
1305# returns the macro.
1306
1307
1308sub make_macro {
1309 my $self= shift;
1310 my %opts= @_;
1311 my $type= $opts{type} || 'generic';
6b94381d
KW
1312 if ($self->{has_multi}) {
1313 if ($type =~ /^cp/) {
1314 die "Can't do a 'cp' on multi-codepoint character class '$self->{op}'"
1315 }
1316 elsif (! $opts{safe}) {
1317 die "'safe' is required on multi-codepoint character class '$self->{op}'"
1318 }
1319 }
900c17f9 1320 my $ret_type= $opts{ret_type} || ( $opts{type} =~ /^cp/ ? 'cp' : 'len' );
e64b1bd1
YO
1321 my $method;
1322 if ( $opts{safe} ) {
1323 $method= 'length_optree';
87894a24 1324 } elsif ( $type =~ /generic/ ) {
e64b1bd1
YO
1325 $method= 'generic_optree';
1326 } else {
1327 $method= 'optree';
1328 }
900c17f9 1329 my @args= $type =~ /^cp/ ? 'cp' : 's';
e64b1bd1 1330 push @args, "e" if $opts{safe};
87894a24 1331 push @args, "is_utf8" if $type =~ /generic/;
e64b1bd1
YO
1332 push @args, "len" if $ret_type eq 'both';
1333 my $pfx= $ret_type eq 'both' ? 'what_len_' :
1334 $ret_type eq 'cp' ? 'what_' : 'is_';
87894a24
KW
1335 my $ext= $type =~ /generic/ ? '' : '_' . lc( $type );
1336 $ext .= '_non_low' if $type eq 'generic_non_low';
e64b1bd1 1337 $ext .= "_safe" if $opts{safe};
40f914fd 1338 $ext .= "_no_length_checks" if $opts{no_length_checks};
e64b1bd1 1339 my $argstr= join ",", @args;
39a0f513
YO
1340 my $def_fmt="$pfx$self->{op}$ext%s($argstr)";
1341 my $optree= $self->$method( %opts, type => $type, ret_type => $ret_type );
1342 return $self->render( $optree, ($type =~ /^cp/) ? 1 : 0, \%opts, $def_fmt );
32e6a07c 1343}
e64b1bd1 1344
b6a6e956 1345# if we aren't being used as a module (highly likely) then process
e64b1bd1
YO
1346# the __DATA__ below and produce macros in regcharclass.h
1347# if an argument is provided to the script then it is assumed to
1348# be the path of the file to output to, if the arg is '-' outputs
1349# to STDOUT.
1350if ( !caller ) {
e64b1bd1 1351 $|++;
8770da0e 1352 my $path= shift @ARGV || "regcharclass.h";
e64b1bd1
YO
1353 my $out_fh;
1354 if ( $path eq '-' ) {
1355 $out_fh= \*STDOUT;
1356 } else {
29c22b52 1357 $out_fh = open_new( $path );
e64b1bd1 1358 }
8770da0e
NC
1359 print $out_fh read_only_top( lang => 'C', by => $0,
1360 file => 'regcharclass.h', style => '*',
212b6c86
KW
1361 copyright => [2007, 2011],
1362 final => <<EOF,
1363WARNING: These macros are for internal Perl core use only, and may be
1364changed or removed without notice.
1365EOF
1366 );
6a5bc5ac 1367 print $out_fh "\n#ifndef PERL_REGCHARCLASS_H_ /* Guard against nested #includes */\n#define PERL_REGCHARCLASS_H_\n";
12b72891 1368
bb949220 1369 my ( $op, $title, @txt, @types, %mods );
a1b2a50f 1370 my $doit= sub ($) {
e64b1bd1 1371 return unless $op;
ae1d4929 1372
a1b2a50f
KW
1373 my $charset = shift;
1374
ae1d4929 1375 # Skip if to compile on a different platform.
a1b2a50f
KW
1376 return if delete $mods{only_ascii_platform} && $charset !~ /ascii/i;
1377 return if delete $mods{only_ebcdic_platform} && $charset !~ /ebcdic/i;
ae1d4929 1378
e64b1bd1
YO
1379 print $out_fh "/*\n\t$op: $title\n\n";
1380 print $out_fh join "\n", ( map { "\t$_" } @txt ), "*/", "";
a1b2a50f 1381 my $obj= __PACKAGE__->new( op => $op, title => $title, txt => \@txt, charset => $charset);
e64b1bd1 1382
bb949220
KW
1383 #die Dumper(\@types,\%mods);
1384
1385 my @mods;
1386 push @mods, 'safe' if delete $mods{safe};
40f914fd 1387 push @mods, 'no_length_checks' if delete $mods{no_length_checks};
bb949220
KW
1388 unshift @mods, 'fast' if delete $mods{fast} || ! @mods; # Default to 'fast'
1389 # do this one
1390 # first, as
1391 # traditional
1392 if (%mods) {
122a2d8f 1393 die "Unknown modifiers: ", join ", ", map { "'$_'" } sort keys %mods;
bb949220 1394 }
e64b1bd1
YO
1395
1396 foreach my $type_spec ( @types ) {
1397 my ( $type, $ret )= split /-/, $type_spec;
1398 $ret ||= 'len';
1399 foreach my $mod ( @mods ) {
f71bd789
KW
1400
1401 # 'safe' is irrelevant with code point macros, so skip if
1402 # there is also a 'fast', but don't skip if this is the only
1403 # way a cp macro will get generated. Below we convert 'safe'
1404 # to 'fast' in this instance
1405 next if $type =~ /^cp/
40f914fd
KW
1406 && ($mod eq 'safe' || $mod eq 'no_length_checks')
1407 && grep { 'fast' =~ $_ } @mods;
bb949220 1408 delete $mods{$mod};
e64b1bd1
YO
1409 my $macro= $obj->make_macro(
1410 type => $type,
1411 ret_type => $ret,
81200454 1412 safe => $mod eq 'safe' && $type !~ /^cp/,
a1b2a50f 1413 charset => $charset,
40f914fd 1414 no_length_checks => $mod eq 'no_length_checks' && $type !~ /^cp/,
e64b1bd1
YO
1415 );
1416 print $out_fh $macro, "\n";
1417 }
32e6a07c 1418 }
e64b1bd1
YO
1419 };
1420
a1b2a50f
KW
1421 my @data = <DATA>;
1422 foreach my $charset (get_supported_code_pages()) {
1423 my $first_time = 1;
1424 undef $op;
1425 undef $title;
1426 undef @txt;
1427 undef @types;
1428 undef %mods;
1429 print $out_fh "\n", get_conditional_compile_line_start($charset);
1430 my @data_copy = @data;
1431 for (@data_copy) {
91e83b73
KW
1432 s/^ \s* (?: \# .* ) ? $ //x; # squeeze out comment and blanks
1433 next unless /\S/;
1434 chomp;
1435 if ( /^[A-Z]/ ) {
1436 $doit->($charset) unless $first_time; # This starts a new
1437 # definition; do the
1438 # previous one
1439 $first_time = 0;
1440 ( $op, $title )= split /\s*:\s*/, $_, 2;
1441 @txt= ();
1442 } elsif ( s/^=>// ) {
1443 my ( $type, $modifier )= split /:/, $_;
1444 @types= split ' ', $type;
1445 undef %mods;
1446 map { $mods{$_} = 1 } split ' ', $modifier;
1447 } else {
1448 push @txt, "$_";
1449 }
12b72891 1450 }
91e83b73 1451 $doit->($charset);
a1b2a50f
KW
1452 print $out_fh get_conditional_compile_line_end();
1453 }
d10c72f2 1454
6a5bc5ac 1455 print $out_fh "\n#endif /* PERL_REGCHARCLASS_H_ */\n";
d10c72f2 1456
8770da0e
NC
1457 if($path eq '-') {
1458 print $out_fh "/* ex: set ro: */\n";
1459 } else {
0c36c41b
KW
1460 # Some of the sources for these macros come from Unicode tables
1461 my $sources_list = "lib/unicore/mktables.lst";
b60dc4b9
KW
1462 my @sources = ($0, qw(lib/unicore/mktables
1463 lib/Unicode/UCD.pm
1464 regen/regcharclass_multi_char_folds.pl
1465 regen/charset_translations.pl
1466 ));
0c36c41b
KW
1467 {
1468 # Depend on mktables’ own sources. It’s a shorter list of files than
1469 # those that Unicode::UCD uses.
1ae6ead9 1470 if (! open my $mktables_list, '<', $sources_list) {
0c36c41b
KW
1471
1472 # This should force a rebuild once $sources_list exists
1473 push @sources, $sources_list;
1474 }
1475 else {
1476 while(<$mktables_list>) {
1477 last if /===/;
1478 chomp;
1479 push @sources, "lib/unicore/$_" if /^[^#]/;
1480 }
1481 }
1482 }
1483 read_only_bottom_close_and_rename($out_fh, \@sources)
8770da0e 1484 }
12b72891 1485}
e64b1bd1 1486
cc08b31c
KW
1487# The form of the input is a series of definitions to make macros for.
1488# The first line gives the base name of the macro, followed by a colon, and
1489# then text to be used in comments associated with the macro that are its
1490# title or description. In all cases the first (perhaps only) parameter to
1491# the macro is a pointer to the first byte of the code point it is to test to
1492# see if it is in the class determined by the macro. In the case of non-UTF8,
1493# the code point consists only of a single byte.
1494#
1495# The second line must begin with a '=>' and be followed by the types of
1496# macro(s) to be generated; these are specified below. A colon follows the
1497# types, followed by the modifiers, also specified below. At least one
1498# modifier is required.
1499#
1500# The subsequent lines give what code points go into the class defined by the
1501# macro. Multiple characters may be specified via a string like "\x0D\x0A",
60910c93
KW
1502# enclosed in quotes. Otherwise the lines consist of one of:
1503# 1) a single Unicode code point, prefaced by 0x
1504# 2) a single range of Unicode code points separated by a minus (and
1505# optional space)
1506# 3) a single Unicode property specified in the standard Perl form
1507# "\p{...}"
1508# 4) a line like 'do path'. This will do a 'do' on the file given by
1509# 'path'. It is assumed that this does nothing but load subroutines
1510# (See item 5 below). The reason 'require path' is not used instead is
1511# because 'do' doesn't assume that path is in @INC.
1512# 5) a subroutine call
1513# &pkg::foo(arg1, ...)
1514# where pkg::foo was loaded by a 'do' line (item 4). The subroutine
1515# returns an array of entries of forms like items 1-3 above. This
1516# allows more complex inputs than achievable from the other input types.
cc08b31c
KW
1517#
1518# A blank line or one whose first non-blank character is '#' is a comment.
1519# The definition of the macro is terminated by a line unlike those described.
1520#
1521# Valid types:
1522# low generate a macro whose name is 'is_BASE_low' and defines a
1523# class that includes only ASCII-range chars. (BASE is the
1524# input macro base name.)
1525# latin1 generate a macro whose name is 'is_BASE_latin1' and defines a
1526# class that includes only upper-Latin1-range chars. It is not
1527# designed to take a UTF-8 input parameter.
b1af8fef
KW
1528# high generate a macro whose name is 'is_BASE_high' and defines a
1529# class that includes all relevant code points that are above
1530# the Latin1 range. This is for very specialized uses only.
1531# It is designed to take only an input UTF-8 parameter.
cc08b31c
KW
1532# utf8 generate a macro whose name is 'is_BASE_utf8' and defines a
1533# class that includes all relevant characters that aren't ASCII.
1534# It is designed to take only an input UTF-8 parameter.
1535# LATIN1 generate a macro whose name is 'is_BASE_latin1' and defines a
1536# class that includes both ASCII and upper-Latin1-range chars.
1537# It is not designed to take a UTF-8 input parameter.
1538# UTF8 generate a macro whose name is 'is_BASE_utf8' and defines a
1539# class that can include any code point, adding the 'low' ones
1540# to what 'utf8' works on. It is designed to take only an input
1541# UTF-8 parameter.
1542# generic generate a macro whose name is 'is_BASE". It has a 2nd,
1543# boolean, parameter which indicates if the first one points to
1544# a UTF-8 string or not. Thus it works in all circumstances.
87894a24
KW
1545# generic_non_low generate a macro whose name is 'is_BASE_non_low". It has
1546# a 2nd, boolean, parameter which indicates if the first one
1547# points to a UTF-8 string or not. It excludes any ASCII-range
1548# matches, but otherwise it works in all circumstances.
cc08b31c
KW
1549# cp generate a macro whose name is 'is_BASE_cp' and defines a
1550# class that returns true if the UV parameter is a member of the
1551# class; false if not.
900c17f9
KW
1552# cp_high like cp, but it is assumed that it is known that the UV
1553# parameter is above Latin1. The name of the generated macro is
1554# 'is_BASE_cp_high'. This is different from high-cp, derived
1555# below.
cc08b31c
KW
1556# A macro of the given type is generated for each type listed in the input.
1557# The default return value is the number of octets read to generate the match.
1558# Append "-cp" to the type to have it instead return the matched codepoint.
1559# The macro name is changed to 'what_BASE...'. See pod for
1560# caveats
1561# Appending '-both" instead adds an extra parameter to the end of the argument
1562# list, which is a pointer as to where to store the number of
1563# bytes matched, while also returning the code point. The macro
1564# name is changed to 'what_len_BASE...'. See pod for caveats
1565#
1566# Valid modifiers:
1567# safe The input string is not necessarily valid UTF-8. In
1568# particular an extra parameter (always the 2nd) to the macro is
1569# required, which points to one beyond the end of the string.
1570# The macro will make sure not to read off the end of the
1571# string. In the case of non-UTF8, it makes sure that the
1572# string has at least one byte in it. The macro name has
1573# '_safe' appended to it.
40f914fd
KW
1574# no_length_checks The input string is not necessarily valid UTF-8, but it
1575# is to be assumed that the length has already been checked and
1576# found to be valid
cc08b31c
KW
1577# fast The input string is valid UTF-8. No bounds checking is done,
1578# and the macro can make assumptions that lead to faster
1579# execution.
a1b2a50f 1580# only_ascii_platform Skip this definition if the character set is for
ae1d4929 1581# a non-ASCII platform.
a1b2a50f 1582# only_ebcdic_platform Skip this definition if the character set is for
ae1d4929 1583# a non-EBCDIC platform.
cc08b31c
KW
1584# No modifier need be specified; fast is assumed for this case. If both
1585# 'fast', and 'safe' are specified, two macros will be created for each
1586# 'type'.
e90ac8de 1587#
295bcca9 1588# If run on a non-ASCII platform will automatically convert the Unicode input
cc08b31c
KW
1589# to native. The documentation above is slightly wrong in this case. 'low'
1590# actually refers to code points whose UTF-8 representation is the same as the
1591# non-UTF-8 version (invariants); and 'latin1' refers to all the rest of the
1592# code points less than 256.
5e6c6c1e
KW
1593
15941; # in the unlikely case we are being used as a module
1595
1596__DATA__
1597# This is no longer used, but retained in case it is needed some day.
e90ac8de
KW
1598# TRICKYFOLD: Problematic fold case letters. When adding to this list, also should add them to regcomp.c and fold_grind.t
1599# => generic cp generic-cp generic-both :fast safe
1600# 0x00DF # LATIN SMALL LETTER SHARP S
1601# 0x0390 # GREEK SMALL LETTER IOTA WITH DIALYTIKA AND TONOS
1602# 0x03B0 # GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND TONOS
1603# 0x1E9E # LATIN CAPITAL LETTER SHARP S, because maps to same as 00DF
1604# 0x1FD3 # GREEK SMALL LETTER IOTA WITH DIALYTIKA AND OXIA; maps same as 0390
1605# 0x1FE3 # GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND OXIA; maps same as 03B0
1606
12b72891 1607LNBREAK: Line Break: \R
5c025f03 1608=> generic UTF8 LATIN1 : safe
12b72891 1609"\x0D\x0A" # CRLF - Network (Windows) line ending
05b688d9 1610\p{VertSpace}
12b72891
RGS
1611
1612HORIZWS: Horizontal Whitespace: \h \H
507ce328 1613=> high cp_high : fast
05b688d9 1614\p{HorizSpace}
12b72891
RGS
1615
1616VERTWS: Vertical Whitespace: \v \V
507ce328 1617=> high cp_high : fast
05b688d9 1618\p{VertSpace}
612ead59 1619
4ac6419d 1620XDIGIT: Hexadecimal digits
507ce328 1621=> high cp_high : fast
4ac6419d
KW
1622\p{XDigit}
1623
bedac28b 1624XPERLSPACE: \p{XPerlSpace}
507ce328 1625=> high cp_high : fast
bedac28b
KW
1626\p{XPerlSpace}
1627
b96a92fb
KW
1628REPLACEMENT: Unicode REPLACEMENT CHARACTER
1629=> UTF8 :safe
16300xFFFD
1631
1632NONCHAR: Non character code points
89d986df 1633=> UTF8 :safe
099323b4 1634\p{_Perl_Nchar}
b96a92fb 1635
21cb232c 1636SURROGATE: Surrogate code points
89d986df 1637=> UTF8 :safe
099323b4 1638\p{_Perl_Surrogate}
b96a92fb 1639
784d4f31
KW
1640# This program was run with this enabled, and the results copied to utf8.h and
1641# utfebcdic.h; then this was commented out because it takes so long to figure
1642# out these 2 million code points. The results would not change unless utf8.h
1643# decides it wants a different maximum, or this program creates better
40f914fd
KW
1644# optimizations. Trying with 5 bytes used too much memory to calculate.
1645#
5dca9278
KW
1646# We don't generate code for invariants here because the EBCDIC form is too
1647# complicated and would slow things down; instead the user should test for
1648# invariants first.
1649#
784d4f31
KW
1650# 0x1FFFFF was chosen because for both UTF-8 and UTF-EBCDIC, its start byte
1651# is the same as 0x10FFFF, and it includes all the above-Unicode code points
1652# that have that start byte. In other words, it is the natural stopping place
1653# that includes all Unicode code points.
40f914fd 1654#
e23e8bc1 1655#STRICT_UTF8_CHAR: Matches legal Unicode UTF-8 variant code points, no surrrogates nor non-character code points
e23e8bc1
KW
1656#0x0080 - 0xD7FF
1657#0xE000 - 0xFDCF
1658#0xFDF0 - 0xFFFD
1659#0x10000 - 0x1FFFD
1660#0x20000 - 0x2FFFD
1661#0x30000 - 0x3FFFD
1662#0x40000 - 0x4FFFD
1663#0x50000 - 0x5FFFD
1664#0x60000 - 0x6FFFD
1665#0x70000 - 0x7FFFD
1666#0x80000 - 0x8FFFD
1667#0x90000 - 0x9FFFD
1668#0xA0000 - 0xAFFFD
1669#0xB0000 - 0xBFFFD
1670#0xC0000 - 0xCFFFD
1671#0xD0000 - 0xDFFFD
1672#0xE0000 - 0xEFFFD
1673#0xF0000 - 0xFFFFD
1674#0x100000 - 0x10FFFD
1675#
1676#STRICT_UTF8_CHAR: Matches legal Unicode UTF-8 variant code points, no surrrogates nor non-character code points
1677#=> UTF8 :no_length_checks only_ebcdic_platform
1678#0x00A0 - 0xD7FF
1679#0xE000 - 0xFDCF
1680#0xFDF0 - 0xFFFD
1681#0x10000 - 0x1FFFD
1682#0x20000 - 0x2FFFD
1683#0x30000 - 0x3FFFD
1684#0x40000 - 0x4FFFD
1685#0x50000 - 0x5FFFD
1686#0x60000 - 0x6FFFD
1687#0x70000 - 0x7FFFD
1688#0x80000 - 0x8FFFD
1689#0x90000 - 0x9FFFD
1690#0xA0000 - 0xAFFFD
1691#0xB0000 - 0xBFFFD
1692#0xC0000 - 0xCFFFD
1693#0xD0000 - 0xDFFFD
1694#0xE0000 - 0xEFFFD
1695#0xF0000 - 0xFFFFD
1696#0x100000 - 0x10FFFD
4d646140 1697
a82be82b
KW
1698#C9_STRICT_UTF8_CHAR: Matches legal Unicode UTF-8 variant code points, no surrogates
1699#=> UTF8 :no_length_checks only_ascii_platform
1700#0x0080 - 0xD7FF
1701#0xE000 - 0x10FFFF
1702#
1703#C9_STRICT_UTF8_CHAR: Matches legal Unicode UTF-8 variant code points including non-character code points, no surrogates
1704#=> UTF8 :no_length_checks only_ebcdic_platform
1705#0x00A0 - 0xD7FF
1706#0xE000 - 0x10FFFF
1707
685289b5
KW
1708QUOTEMETA: Meta-characters that \Q should quote
1709=> high :fast
1710\p{_Perl_Quotemeta}
8769f413
KW
1711
1712MULTI_CHAR_FOLD: multi-char strings that are folded to by a single character
251b239f 1713=> UTF8 :safe
8769f413
KW
1714
1715# 1 => All folds
1716&regcharclass_multi_char_folds::multi_char_folds(1)
1717
40b1ba4f 1718MULTI_CHAR_FOLD: multi-char strings that are folded to by a single character
251b239f 1719=> LATIN1 : safe
8769f413 1720
8769f413 1721&regcharclass_multi_char_folds::multi_char_folds(0)
40b1ba4f 1722# 0 => Latin1-only
0b50d62a 1723
1a27eb96
KW
1724FOLDS_TO_MULTI: characters that fold to multi-char strings
1725=> UTF8 :fast
1726\p{_Perl_Folds_To_Multi_Char}
1727
31f05a37
KW
1728PROBLEMATIC_LOCALE_FOLD : characters whose fold is problematic under locale
1729=> UTF8 cp :fast
1730\p{_Perl_Problematic_Locale_Folds}
1731
1732PROBLEMATIC_LOCALE_FOLDEDS_START : The first folded character of folds which are problematic under locale
1733=> UTF8 cp :fast
1734\p{_Perl_Problematic_Locale_Foldeds_Start}
1735
0b50d62a 1736PATWS: pattern white space
8373491a 1737=> generic cp : safe
099323b4 1738\p{_Perl_PatWS}