This is a live mirror of the Perl 5 development currently hosted at https://github.com/perl/perl5
utf8.h: Use machine generated IS_UTF8_CHAR()
[perl5.git] / regen / regcharclass.pl
CommitLineData
e64b1bd1 1package CharClass::Matcher;
12b72891 2use strict;
8770da0e 3use 5.008;
12b72891 4use warnings;
e64b1bd1 5use warnings FATAL => 'all';
12b72891 6use Text::Wrap qw(wrap);
12b72891 7use Data::Dumper;
e64b1bd1
YO
8$Data::Dumper::Useqq= 1;
9our $hex_fmt= "0x%02X";
12b72891 10
295bcca9
KW
11sub ASCII_PLATFORM { (ord('A') == 65) }
12
8770da0e
NC
13require 'regen/regen_lib.pl';
14
ab84f958 15=head1 NAME
0ccab2bc 16
e64b1bd1 17CharClass::Matcher -- Generate C macros that match character classes efficiently
12b72891 18
e64b1bd1
YO
19=head1 SYNOPSIS
20
ab84f958 21 perl Porting/regcharclass.pl
e64b1bd1
YO
22
23=head1 DESCRIPTION
12b72891
RGS
24
25Dynamically generates macros for detecting special charclasses
e64b1bd1 26in latin-1, utf8, and codepoint forms. Macros can be set to return
cc08b31c 27the length (in bytes) of the matched codepoint, and/or the codepoint itself.
12b72891 28
cc08b31c 29To regenerate F<regcharclass.h>, run this script from perl-root. No arguments
12b72891
RGS
30are necessary.
31
cc08b31c
KW
32Using WHATEVER as an example the following macros can be produced, depending
33on the input parameters (how to get each is described by internal comments at
34the C<__DATA__> line):
12b72891
RGS
35
36=over 4
37
cc08b31c 38=item C<is_WHATEVER(s,is_utf8)>
12b72891 39
cc08b31c 40=item C<is_WHATEVER_safe(s,e,is_utf8)>
12b72891 41
cc08b31c
KW
42Do a lookup as appropriate based on the C<is_utf8> flag. When possible
43comparisons involving octect<128 are done before checking the C<is_utf8>
12b72891
RGS
44flag, hopefully saving time.
45
cc08b31c
KW
46The version without the C<_safe> suffix should be used only when the input is
47known to be well-formed.
12b72891 48
cc08b31c
KW
49=item C<is_WHATEVER_utf8(s)>
50
51=item C<is_WHATEVER_utf8_safe(s,e)>
12b72891
RGS
52
53Do a lookup assuming the string is encoded in (normalized) UTF8.
54
cc08b31c
KW
55The version without the C<_safe> suffix should be used only when the input is
56known to be well-formed.
57
58=item C<is_WHATEVER_latin1(s)>
12b72891 59
cc08b31c 60=item C<is_WHATEVER_latin1_safe(s,e)>
12b72891
RGS
61
62Do a lookup assuming the string is encoded in latin-1 (aka plan octets).
63
cc08b31c
KW
64The version without the C<_safe> suffix should be used only when it is known
65that C<s> contains at least one character.
66
67=item C<is_WHATEVER_cp(cp)>
12b72891 68
47e01c32 69Check to see if the string matches a given codepoint (hypothetically a
12b72891
RGS
70U32). The condition is constructed as as to "break out" as early as
71possible if the codepoint is out of range of the condition.
72
73IOW:
74
75 (cp==X || (cp>X && (cp==Y || (cp>Y && ...))))
76
77Thus if the character is X+1 only two comparisons will be done. Making
78matching lookups slower, but non-matching faster.
79
cc08b31c
KW
80=item C<what_len_WHATEVER_FOO(arg1, ..., len)>
81
82A variant form of each of the macro types described above can be generated, in
83which the code point is returned by the macro, and an extra parameter (in the
84final position) is added, which is a pointer for the macro to set the byte
85length of the returned code point.
86
87These forms all have a C<what_len> prefix instead of the C<is_>, for example
88C<what_len_WHATEVER_safe(s,e,is_utf8,len)> and
89C<what_len_WHATEVER_utf8(s,len)>.
90
91These forms should not be used I<except> on small sets of mostly widely
92separated code points; otherwise the code generated is inefficient. For these
93cases, it is best to use the C<is_> forms, and then find the code point with
94C<utf8_to_uvchr_buf>(). This program can fail with a "deep recursion"
95message on the worst of the inappropriate sets. Examine the generated macro
96to see if it is acceptable.
12b72891 97
cc08b31c
KW
98=item C<what_WHATEVER_FOO(arg1, ...)>
99
100A variant form of each of the C<is_> macro types described above can be generated, in
101which the code point and not the length is returned by the macro. These have
102the same caveat as L</what_len_WHATEVER_FOO(arg1, ..., len)>, plus they should
103not be used where the set contains a NULL, as 0 is returned for two different
104cases: a) the set doesn't include the input code point; b) the set does
105include it, and it is a NULL.
106
107=back
e64b1bd1
YO
108
109=head2 CODE FORMAT
110
111perltidy -st -bt=1 -bbt=0 -pt=0 -sbt=1 -ce -nwls== "%f"
112
113
114=head1 AUTHOR
115
cc08b31c 116Author: Yves Orton (demerphq) 2007. Maintained by Perl5 Porters.
e64b1bd1
YO
117
118=head1 BUGS
119
120No tests directly here (although the regex engine will fail tests
121if this code is broken). Insufficient documentation and no Getopts
122handler for using the module as a script.
123
124=head1 LICENSE
125
126You may distribute under the terms of either the GNU General Public
127License or the Artistic License, as specified in the README file.
128
12b72891
RGS
129=cut
130
e64b1bd1
YO
131# Sub naming convention:
132# __func : private subroutine, can not be called as a method
133# _func : private method, not meant for external use
134# func : public method.
135
136# private subs
137#-------------------------------------------------------------------------------
138#
139# ($cp,$n,$l,$u)=__uni_latin($str);
140#
47e01c32 141# Return a list of arrays, each of which when interpreted correctly
e64b1bd1
YO
142# represent the string in some given encoding with specific conditions.
143#
144# $cp - list of codepoints that make up the string.
295bcca9
KW
145# $n - list of octets that make up the string if all codepoints are invariant
146# regardless of if the string is in UTF-8 or not.
e64b1bd1 147# $l - list of octets that make up the string in latin1 encoding if all
295bcca9
KW
148# codepoints < 256, and at least one codepoint is UTF-8 variant.
149# $u - list of octets that make up the string in utf8 if any codepoint is
150# UTF-8 variant
e64b1bd1
YO
151#
152# High CP | Defined
153#-----------+----------
295bcca9 154# 0 - 127 : $n (127/128 are the values for ASCII platforms)
e64b1bd1
YO
155# 128 - 255 : $l, $u
156# 256 - ... : $u
157#
158
159sub __uni_latin1 {
160 my $str= shift;
161 my $max= 0;
162 my @cp;
295bcca9 163 my $only_has_invariants = 1;
e64b1bd1
YO
164 for my $ch ( split //, $str ) {
165 my $cp= ord $ch;
166 push @cp, $cp;
167 $max= $cp if $max < $cp;
295bcca9
KW
168 if (! ASCII_PLATFORM && $only_has_invariants) {
169 if ($cp > 255) {
170 $only_has_invariants = 0;
171 }
172 else {
173 my $temp = chr($cp);
174 utf8::upgrade($temp);
175 my @utf8 = unpack "U0C*", $temp;
176 $only_has_invariants = (@utf8 == 1 && $utf8[0] == $cp);
177 }
178 }
e64b1bd1
YO
179 }
180 my ( $n, $l, $u );
295bcca9
KW
181 $only_has_invariants = $max < 128 if ASCII_PLATFORM;
182 if ($only_has_invariants) {
e64b1bd1
YO
183 $n= [@cp];
184 } else {
185 $l= [@cp] if $max && $max < 256;
186
ca51670f
KW
187 $u= $str;
188 utf8::upgrade($u);
189 $u= [ unpack "U0C*", $u ] if defined $u;
12b72891 190 }
e64b1bd1 191 return ( \@cp, $n, $l, $u );
12b72891
RGS
192}
193
12b72891 194#
e64b1bd1
YO
195# $clean= __clean($expr);
196#
197# Cleanup a ternary expression, removing unnecessary parens and apply some
198# simplifications using regexes.
199#
200
201sub __clean {
202 my ( $expr )= @_;
203 our $parens;
204 $parens= qr/ (?> \( (?> (?: (?> [^()]+ ) | (??{ $parens }) )* ) \) ) /x;
205
206 #print "$parens\n$expr\n";
207 1 while $expr =~ s/ \( \s* ( $parens ) \s* \) /$1/gx;
208 1 while $expr =~ s/ \( \s* ($parens) \s* \? \s*
209 \( \s* ($parens) \s* \? \s* ($parens|[^:]+?) \s* : \s* ($parens|[^)]+?) \s* \)
210 \s* : \s* \4 \s* \)/( ( $1 && $2 ) ? $3 : 0 )/gx;
211 return $expr;
12b72891
RGS
212}
213
e64b1bd1
YO
214#
215# $text= __macro(@args);
216# Join args together by newlines, and then neatly add backslashes to the end
217# of every line as expected by the C pre-processor for #define's.
218#
219
220sub __macro {
221 my $str= join "\n", @_;
222 $str =~ s/\s*$//;
223 my @lines= map { s/\s+$//; s/\t/ /g; $_ } split /\n/, $str;
224 my $last= pop @lines;
225 $str= join "\n", ( map { sprintf "%-76s\\", $_ } @lines ), $last;
226 1 while $str =~ s/^(\t*) {8}/$1\t/gm;
227 return $str . "\n";
12b72891
RGS
228}
229
e64b1bd1
YO
230#
231# my $op=__incrdepth($op);
232#
233# take an 'op' hashref and add one to it and all its childrens depths.
234#
235
236sub __incrdepth {
237 my $op= shift;
238 return unless ref $op;
239 $op->{depth} += 1;
240 __incrdepth( $op->{yes} );
241 __incrdepth( $op->{no} );
242 return $op;
243}
244
245# join two branches of an opcode together with a condition, incrementing
246# the depth on the yes branch when we do so.
247# returns the new root opcode of the tree.
248sub __cond_join {
249 my ( $cond, $yes, $no )= @_;
250 return {
251 test => $cond,
252 yes => __incrdepth( $yes ),
253 no => $no,
254 depth => 0,
255 };
256}
257
258# Methods
259
260# constructor
261#
262# my $obj=CLASS->new(op=>'SOMENAME',title=>'blah',txt=>[..]);
263#
264# Create a new CharClass::Matcher object by parsing the text in
265# the txt array. Currently applies the following rules:
266#
267# Element starts with C<0x>, line is evaled the result treated as
268# a number which is passed to chr().
269#
270# Element starts with C<">, line is evaled and the result treated
271# as a string.
272#
273# Each string is then stored in the 'strs' subhash as a hash record
274# made up of the results of __uni_latin1, using the keynames
b1af8fef 275# 'low','latin1','utf8', as well as the synthesized 'LATIN1', 'high', and
e64b1bd1
YO
276# 'UTF8' which hold a merge of 'low' and their lowercase equivelents.
277#
278# Size data is tracked per type in the 'size' subhash.
279#
280# Return an object
281#
12b72891
RGS
282sub new {
283 my $class= shift;
e64b1bd1
YO
284 my %opt= @_;
285 for ( qw(op txt) ) {
286 die "in " . __PACKAGE__ . " constructor '$_;' is a mandatory field"
287 if !exists $opt{$_};
288 }
289
290 my $self= bless {
291 op => $opt{op},
292 title => $opt{title} || '',
293 }, $class;
294 foreach my $txt ( @{ $opt{txt} } ) {
295 my $str= $txt;
296 if ( $str =~ /^[""]/ ) {
297 $str= eval $str;
05b688d9
KW
298 } elsif ($str =~ / - /x ) { # A range: Replace this element on the
299 # list with its expansion
300 my ($lower, $upper) = $str =~ / 0x (.+?) \s* - \s* 0x (.+) /x;
301 die "Format must be like '0xDEAD - 0xBEAF'; instead was '$str'" if ! defined $lower || ! defined $upper;
302 foreach my $cp (hex $lower .. hex $upper) {
303 push @{$opt{txt}}, sprintf "0x%X", $cp;
304 }
305 next;
295bcca9
KW
306 } elsif ($str =~ s/ ^ N (?= 0x ) //x ) {
307 # Otherwise undocumented, a leading N means is already in the
308 # native character set; don't convert.
e64b1bd1 309 $str= chr eval $str;
295bcca9
KW
310 } elsif ( $str =~ /^0x/ ) {
311 $str= eval $str;
312
313 # Convert from Unicode/ASCII to native, if necessary
314 $str = utf8::unicode_to_native($str) if ! ASCII_PLATFORM
315 && $str <= 0xFF;
316 $str = chr $str;
05b688d9
KW
317 } elsif ( $str =~ / \s* \\p \{ ( .*? ) \} /x) {
318 my $property = $1;
319 use Unicode::UCD qw(prop_invlist);
320
321 my @invlist = prop_invlist($property, '_perl_core_internal_ok');
322 if (! @invlist) {
323
324 # An empty return could mean an unknown property, or merely
325 # that it is empty. Call in scalar context to differentiate
326 my $count = prop_invlist($property, '_perl_core_internal_ok');
327 die "$property not found" unless defined $count;
328 }
329
330 # Replace this element on the list with the property's expansion
331 for (my $i = 0; $i < @invlist; $i += 2) {
332 foreach my $cp ($invlist[$i] .. $invlist[$i+1] - 1) {
295bcca9
KW
333
334 # prop_invlist() returns native values; add leading 'N'
335 # to indicate that.
336 push @{$opt{txt}}, sprintf "N0x%X", $cp;
05b688d9
KW
337 }
338 }
339 next;
12b72891 340 } else {
5e6c6c1e 341 die "Unparsable line: $txt\n";
12b72891 342 }
e64b1bd1
YO
343 my ( $cp, $low, $latin1, $utf8 )= __uni_latin1( $str );
344 my $UTF8= $low || $utf8;
345 my $LATIN1= $low || $latin1;
b1af8fef 346 my $high = (scalar grep { $_ < 256 } @$cp) ? 0 : $utf8;
dda856b2
YO
347 #die Dumper($txt,$cp,$low,$latin1,$utf8)
348 # if $txt=~/NEL/ or $utf8 and @$utf8>3;
e64b1bd1 349
b1af8fef
KW
350 @{ $self->{strs}{$str} }{qw( str txt low utf8 latin1 high cp UTF8 LATIN1 )}=
351 ( $str, $txt, $low, $utf8, $latin1, $high, $cp, $UTF8, $LATIN1 );
e64b1bd1 352 my $rec= $self->{strs}{$str};
b1af8fef 353 foreach my $key ( qw(low utf8 latin1 high cp UTF8 LATIN1) ) {
e64b1bd1
YO
354 $self->{size}{$key}{ 0 + @{ $self->{strs}{$str}{$key} } }++
355 if $self->{strs}{$str}{$key};
12b72891 356 }
e64b1bd1
YO
357 $self->{has_multi} ||= @$cp > 1;
358 $self->{has_ascii} ||= $latin1 && @$latin1;
359 $self->{has_low} ||= $low && @$low;
360 $self->{has_high} ||= !$low && !$latin1;
12b72891 361 }
e64b1bd1
YO
362 $self->{val_fmt}= $hex_fmt;
363 $self->{count}= 0 + keys %{ $self->{strs} };
12b72891
RGS
364 return $self;
365}
366
e64b1bd1 367# my $trie = make_trie($type,$maxlen);
12b72891 368#
47e01c32 369# using the data stored in the object build a trie of a specific type,
e64b1bd1
YO
370# and with specific maximum depth. The trie is made up the elements of
371# the given types array for each string in the object (assuming it is
372# not too long.)
373#
47e01c32 374# returns the trie, or undef if there was no relevant data in the object.
e64b1bd1
YO
375#
376
377sub make_trie {
378 my ( $self, $type, $maxlen )= @_;
379
380 my $strs= $self->{strs};
381 my %trie;
382 foreach my $rec ( values %$strs ) {
383 die "panic: unknown type '$type'"
384 if !exists $rec->{$type};
385 my $dat= $rec->{$type};
386 next unless $dat;
387 next if $maxlen && @$dat > $maxlen;
388 my $node= \%trie;
389 foreach my $elem ( @$dat ) {
390 $node->{$elem} ||= {};
391 $node= $node->{$elem};
12b72891 392 }
e64b1bd1 393 $node->{''}= $rec->{str};
12b72891 394 }
e64b1bd1 395 return 0 + keys( %trie ) ? \%trie : undef;
12b72891
RGS
396}
397
2efb8143
KW
398sub pop_count ($) {
399 my $word = shift;
400
401 # This returns a list of the positions of the bits in the input word that
402 # are 1.
403
404 my @positions;
405 my $position = 0;
406 while ($word) {
407 push @positions, $position if $word & 1;
408 $position++;
409 $word >>= 1;
410 }
411 return @positions;
412}
413
e64b1bd1
YO
414# my $optree= _optree()
415#
416# recursively convert a trie to an optree where every node represents
417# an if else branch.
12b72891 418#
12b72891 419#
12b72891 420
e64b1bd1
YO
421sub _optree {
422 my ( $self, $trie, $test_type, $ret_type, $else, $depth )= @_;
423 return unless defined $trie;
424 if ( $self->{has_multi} and $ret_type =~ /cp|both/ ) {
425 die "Can't do 'cp' optree from multi-codepoint strings";
12b72891 426 }
e64b1bd1
YO
427 $ret_type ||= 'len';
428 $else= 0 unless defined $else;
429 $depth= 0 unless defined $depth;
430
431 my @conds= sort { $a <=> $b } grep { length $_ } keys %$trie;
895e25a5 432 if (exists $trie->{''} ) {
e64b1bd1
YO
433 if ( $ret_type eq 'cp' ) {
434 $else= $self->{strs}{ $trie->{''} }{cp}[0];
435 $else= sprintf "$self->{val_fmt}", $else if $else > 9;
436 } elsif ( $ret_type eq 'len' ) {
437 $else= $depth;
438 } elsif ( $ret_type eq 'both') {
439 $else= $self->{strs}{ $trie->{''} }{cp}[0];
440 $else= sprintf "$self->{val_fmt}", $else if $else > 9;
441 $else= "len=$depth, $else";
12b72891 442 }
e64b1bd1
YO
443 }
444 return $else if !@conds;
445 my $node= {};
446 my $root= $node;
447 my ( $yes_res, $as_code, @cond );
448 my $test= $test_type eq 'cp' ? "cp" : "((U8*)s)[$depth]";
449 my $Update= sub {
450 $node->{vals}= [@cond];
451 $node->{test}= $test;
452 $node->{yes}= $yes_res;
453 $node->{depth}= $depth;
454 $node->{no}= shift;
455 };
456 while ( @conds ) {
457 my $cond= shift @conds;
458 my $res=
459 $self->_optree( $trie->{$cond}, $test_type, $ret_type, $else,
460 $depth + 1 );
461 my $res_code= Dumper( $res );
462 if ( !$yes_res || $res_code ne $as_code ) {
463 if ( $yes_res ) {
464 $Update->( {} );
465 $node= $node->{no};
466 }
467 ( $yes_res, $as_code )= ( $res, $res_code );
468 @cond= ( $cond );
12b72891 469 } else {
e64b1bd1 470 push @cond, $cond;
12b72891
RGS
471 }
472 }
e64b1bd1 473 $Update->( $else );
12b72891
RGS
474 return $root;
475}
476
e64b1bd1
YO
477# my $optree= optree(%opts);
478#
479# Convert a trie to an optree, wrapper for _optree
480
481sub optree {
482 my $self= shift;
483 my %opt= @_;
484 my $trie= $self->make_trie( $opt{type}, $opt{max_depth} );
485 $opt{ret_type} ||= 'len';
486 my $test_type= $opt{type} eq 'cp' ? 'cp' : 'depth';
487 return $self->_optree( $trie, $test_type, $opt{ret_type}, $opt{else}, 0 );
12b72891
RGS
488}
489
e64b1bd1
YO
490# my $optree= generic_optree(%opts);
491#
492# build a "generic" optree out of the three 'low', 'latin1', 'utf8'
493# sets of strings, including a branch for handling the string type check.
494#
495
496sub generic_optree {
497 my $self= shift;
498 my %opt= @_;
499
500 $opt{ret_type} ||= 'len';
501 my $test_type= 'depth';
502 my $else= $opt{else} || 0;
503
504 my $latin1= $self->make_trie( 'latin1', $opt{max_depth} );
505 my $utf8= $self->make_trie( 'utf8', $opt{max_depth} );
506
507 $_= $self->_optree( $_, $test_type, $opt{ret_type}, $else, 0 )
508 for $latin1, $utf8;
509
510 if ( $utf8 ) {
511 $else= __cond_join( "( is_utf8 )", $utf8, $latin1 || $else );
512 } elsif ( $latin1 ) {
513 $else= __cond_join( "!( is_utf8 )", $latin1, $else );
514 }
515 my $low= $self->make_trie( 'low', $opt{max_depth} );
516 if ( $low ) {
517 $else= $self->_optree( $low, $test_type, $opt{ret_type}, $else, 0 );
12b72891 518 }
e64b1bd1
YO
519
520 return $else;
12b72891
RGS
521}
522
e64b1bd1 523# length_optree()
12b72891 524#
e64b1bd1 525# create a string length guarded optree.
12b72891 526#
e64b1bd1
YO
527
528sub length_optree {
529 my $self= shift;
530 my %opt= @_;
531 my $type= $opt{type};
532
533 die "Can't do a length_optree on type 'cp', makes no sense."
534 if $type eq 'cp';
535
536 my ( @size, $method );
537
538 if ( $type eq 'generic' ) {
539 $method= 'generic_optree';
540 my %sizes= (
541 %{ $self->{size}{low} || {} },
542 %{ $self->{size}{latin1} || {} },
543 %{ $self->{size}{utf8} || {} }
544 );
545 @size= sort { $a <=> $b } keys %sizes;
546 } else {
547 $method= 'optree';
548 @size= sort { $a <=> $b } keys %{ $self->{size}{$type} };
12b72891 549 }
e64b1bd1
YO
550
551 my $else= ( $opt{else} ||= 0 );
552 for my $size ( @size ) {
553 my $optree= $self->$method( %opt, type => $type, max_depth => $size );
554 my $cond= "((e)-(s) > " . ( $size - 1 ).")";
555 $else= __cond_join( $cond, $optree, $else );
556 }
557 return $else;
12b72891
RGS
558}
559
2efb8143
KW
560sub calculate_mask(@) {
561 my @list = @_;
562 my $list_count = @list;
563
564 # Look at the input list of byte values. This routine sees if the set
565 # consisting of those bytes is exactly determinable by using a
566 # mask/compare operation. If not, it returns an empty list; if so, it
567 # returns a list consisting of (mask, compare). For example, consider a
568 # set consisting of the numbers 0xF0, 0xF1, 0xF2, and 0xF3. If we want to
569 # know if a number 'c' is in the set, we could write:
570 # 0xF0 <= c && c <= 0xF4
571 # But the following mask/compare also works, and has just one test:
572 # c & 0xFC == 0xF0
573 # The reason it works is that the set consists of exactly those numbers
574 # whose first 4 bits are 1, and the next two are 0. (The value of the
575 # other 2 bits is immaterial in determining if a number is in the set or
576 # not.) The mask masks out those 2 irrelevant bits, and the comparison
577 # makes sure that the result matches all bytes that which match those 6
578 # material bits exactly. In other words, the set of numbers contains
579 # exactly those whose bottom two bit positions are either 0 or 1. The
580 # same principle applies to bit positions that are not necessarily
581 # adjacent. And it can be applied to bytes that differ in 1 through all 8
582 # bit positions. In order to be a candidate for this optimization, the
583 # number of numbers in the test must be a power of 2. Based on this
584 # count, we know the number of bit positions that must differ.
585 my $bit_diff_count = 0;
586 my $compare = $list[0];
587 if ($list_count == 2) {
588 $bit_diff_count = 1;
589 }
590 elsif ($list_count == 4) {
591 $bit_diff_count = 2;
592 }
593 elsif ($list_count == 8) {
594 $bit_diff_count = 3;
595 }
596 elsif ($list_count == 16) {
597 $bit_diff_count = 4;
598 }
599 elsif ($list_count == 32) {
600 $bit_diff_count = 5;
601 }
602 elsif ($list_count == 64) {
603 $bit_diff_count = 6;
604 }
605 elsif ($list_count == 128) {
606 $bit_diff_count = 7;
607 }
608 elsif ($list_count == 256) {
609 return (0, 0);
610 }
611
612 # If the count wasn't a power of 2, we can't apply this optimization
613 return if ! $bit_diff_count;
614
615 my %bit_map;
616
617 # For each byte in the list, find the bit positions in it whose value
618 # differs from the first byte in the set.
619 for (my $i = 1; $i < @list; $i++) {
620 my @positions = pop_count($list[0] ^ $list[$i]);
621
622 # If the number of differing bits is greater than those permitted by
623 # the set size, this optimization doesn't apply.
624 return if @positions > $bit_diff_count;
625
626 # Save the bit positions that differ.
627 foreach my $bit (@positions) {
628 $bit_map{$bit} = 1;
629 }
630
631 # If the total so far is greater than those permitted by the set size,
632 # this optimization doesn't apply.
633 return if keys %bit_map > $bit_diff_count;
634
635
636 # The value to compare against is the AND of all the members of the
637 # set. The bit positions that are the same in all will be correct in
638 # the AND, and the bit positions that differ will be 0.
639 $compare &= $list[$i];
640 }
641
642 # To get to here, we have gone through all bytes in the set,
643 # and determined that they all differ from each other in at most
644 # the number of bits allowed for the set's quantity. And since we have
645 # tested all 2**N possibilities, we know that the set includes no fewer
646 # elements than we need,, so the optimization applies.
647 die "panic: internal logic error" if keys %bit_map != $bit_diff_count;
648
649 # The mask is the bit positions where things differ, complemented.
650 my $mask = 0;
651 foreach my $position (keys %bit_map) {
652 $mask |= 1 << $position;
653 }
654 $mask = ~$mask & 0xFF;
655
656 return ($mask, $compare);
657}
658
e64b1bd1
YO
659# _cond_as_str
660# turn a list of conditions into a text expression
661# - merges ranges of conditions, and joins the result with ||
662sub _cond_as_str {
ba073cf2 663 my ( $self, $op, $combine, $opts_ref )= @_;
e64b1bd1
YO
664 my $cond= $op->{vals};
665 my $test= $op->{test};
2efb8143 666 my $is_cp_ret = $opts_ref->{ret_type} eq "cp";
e64b1bd1
YO
667 return "( $test )" if !defined $cond;
668
f5772832 669 # rangify the list.
e64b1bd1
YO
670 my @ranges;
671 my $Update= sub {
f5772832
KW
672 # We skip this if there are optimizations that
673 # we can apply (below) to the individual ranges
674 if ( ($is_cp_ret || $combine) && @ranges && ref $ranges[-1]) {
e64b1bd1
YO
675 if ( $ranges[-1][0] == $ranges[-1][1] ) {
676 $ranges[-1]= $ranges[-1][0];
677 } elsif ( $ranges[-1][0] + 1 == $ranges[-1][1] ) {
678 $ranges[-1]= $ranges[-1][0];
679 push @ranges, $ranges[-1] + 1;
680 }
681 }
682 };
4a8ca70e
KW
683 for my $condition ( @$cond ) {
684 if ( !@ranges || $condition != $ranges[-1][1] + 1 ) {
e64b1bd1 685 $Update->();
4a8ca70e 686 push @ranges, [ $condition, $condition ];
e64b1bd1
YO
687 } else {
688 $ranges[-1][1]++;
689 }
690 }
691 $Update->();
f5772832 692
e64b1bd1
YO
693 return $self->_combine( $test, @ranges )
694 if $combine;
f5772832
KW
695
696 if ($is_cp_ret) {
1f063c57
KW
697 @ranges= map {
698 ref $_
699 ? sprintf(
700 "( $self->{val_fmt} <= $test && $test <= $self->{val_fmt} )",
701 @$_ )
702 : sprintf( "$self->{val_fmt} == $test", $_ );
703 } @ranges;
f5772832
KW
704 }
705 else {
706 # If the input set has certain characteristics, we can optimize tests
707 # for it. This doesn't apply if returning the code point, as we want
708 # each element of the set individually. The code above is for this
709 # simpler case.
710
711 return 1 if @$cond == 256; # If all bytes match, is trivially true
712
6e130234 713 if (@ranges > 1) {
f5772832 714 # See if the entire set shares optimizable characterstics, and if
6e130234
KW
715 # so, return the optimization. We delay checking for this on sets
716 # with just a single range, as there may be better optimizations
717 # available in that case.
f5772832
KW
718 my ($mask, $base) = calculate_mask(@$cond);
719 if (defined $mask && defined $base) {
720 return sprintf "( ( $test & $self->{val_fmt} ) == $self->{val_fmt} )", $mask, $base;
721 }
6e130234 722 }
f5772832
KW
723
724 # Here, there was no entire-class optimization. Look at each range.
725 for (my $i = 0; $i < @ranges; $i++) {
726 if (! ref $ranges[$i]) { # Trivial case: no range
727 $ranges[$i] = sprintf "$self->{val_fmt} == $test", $ranges[$i];
728 }
729 elsif ($ranges[$i]->[0] == $ranges[$i]->[1]) {
730 $ranges[$i] = # Trivial case: single element range
731 sprintf "$self->{val_fmt} == $test", $ranges[$i]->[0];
732 }
733 else {
734 my $output = "";
735
6e130234
KW
736 # Well-formed UTF-8 continuation bytes on ascii platforms must
737 # be in the range 0x80 .. 0xBF. If we know that the input is
738 # well-formed (indicated by not trying to be 'safe'), we can
739 # omit tests that verify that the input is within either of
740 # these bounds. (No legal UTF-8 character can begin with
741 # anything in this range, so we don't have to worry about this
742 # being a continuation byte or not.)
743 if (ASCII_PLATFORM
744 && ! $opts_ref->{safe}
745 && $opts_ref->{type} =~ / ^ (?: utf8 | high ) $ /xi)
746 {
747 my $lower_limit_is_80 = ($ranges[$i]->[0] == 0x80);
748 my $upper_limit_is_BF = ($ranges[$i]->[1] == 0xBF);
749
750 # If the range is the entire legal range, it matches any
751 # legal byte, so we can omit both tests. (This should
752 # happen only if the number of ranges is 1.)
753 if ($lower_limit_is_80 && $upper_limit_is_BF) {
754 return 1;
755 }
756 elsif ($lower_limit_is_80) { # Just use the upper limit test
757 $output = sprintf("( $test <= $self->{val_fmt} )",
758 $ranges[$i]->[1]);
759 }
760 elsif ($upper_limit_is_BF) { # Just use the lower limit test
761 $output = sprintf("( $test >= $self->{val_fmt} )",
762 $ranges[$i]->[0]);
763 }
764 }
765
766 # If we didn't change to omit a test above, see if the number
767 # of elements is a power of 2 (only a single bit in the
768 # representation of its count will be set) and if so, it may
769 # be that a mask/compare optimization is possible.
770 if ($output eq ""
771 && pop_count($ranges[$i]->[1] - $ranges[$i]->[0] + 1) == 1)
772 {
f5772832
KW
773 my @list;
774 push @list, $_ for ($ranges[$i]->[0] .. $ranges[$i]->[1]);
775 my ($mask, $base) = calculate_mask(@list);
776 if (defined $mask && defined $base) {
777 $output = sprintf "( $test & $self->{val_fmt} ) == $self->{val_fmt}", $mask, $base;
778 }
779 }
780
781 if ($output ne "") { # Prefer any optimization
782 $ranges[$i] = $output;
783 }
784 elsif ($ranges[$i]->[0] + 1 == $ranges[$i]->[1]) {
785 # No optimization happened. We need a test that the code
786 # point is within both bounds. But, if the bounds are
787 # adjacent code points, it is cleaner to say
788 # 'first == test || second == test'
789 # than it is to say
790 # 'first <= test && test <= second'
791 $ranges[$i] = "( "
792 . join( " || ", ( map
793 { sprintf "$self->{val_fmt} == $test", $_ }
794 @{$ranges[$i]} ) )
795 . " )";
796 }
797 else { # Full bounds checking
798 $ranges[$i] = sprintf("( $self->{val_fmt} <= $test && $test <= $self->{val_fmt} )", $ranges[$i]->[0], $ranges[$i]->[1]);
799 }
800 }
801 }
802 }
803
e64b1bd1 804 return "( " . join( " || ", @ranges ) . " )";
f5772832 805
12b72891
RGS
806}
807
e64b1bd1
YO
808# _combine
809# recursively turn a list of conditions into a fast break-out condition
810# used by _cond_as_str() for 'cp' type macros.
811sub _combine {
812 my ( $self, $test, @cond )= @_;
813 return if !@cond;
814 my $item= shift @cond;
815 my ( $cstr, $gtv );
816 if ( ref $item ) {
817 $cstr=
818 sprintf( "( $self->{val_fmt} <= $test && $test <= $self->{val_fmt} )",
819 @$item );
820 $gtv= sprintf "$self->{val_fmt}", $item->[1];
12b72891 821 } else {
e64b1bd1
YO
822 $cstr= sprintf( "$self->{val_fmt} == $test", $item );
823 $gtv= sprintf "$self->{val_fmt}", $item;
12b72891 824 }
e64b1bd1
YO
825 if ( @cond ) {
826 return "( $cstr || ( $gtv < $test &&\n"
827 . $self->_combine( $test, @cond ) . " ) )";
12b72891 828 } else {
e64b1bd1 829 return $cstr;
12b72891 830 }
e64b1bd1 831}
12b72891 832
e64b1bd1
YO
833# _render()
834# recursively convert an optree to text with reasonably neat formatting
835sub _render {
ba073cf2 836 my ( $self, $op, $combine, $brace, $opts_ref )= @_;
2e39f0c2 837 return 0 if ! defined $op; # The set is empty
e64b1bd1
YO
838 if ( !ref $op ) {
839 return $op;
12b72891 840 }
ba073cf2 841 my $cond= $self->_cond_as_str( $op, $combine, $opts_ref );
cc08b31c
KW
842 #no warnings 'recursion'; # This would allow really really inefficient
843 # code to be generated. See pod
ba073cf2 844 my $yes= $self->_render( $op->{yes}, $combine, 1, $opts_ref );
30188af7
KW
845 return $yes if $cond eq '1';
846
ba073cf2 847 my $no= $self->_render( $op->{no}, $combine, 0, $opts_ref );
e64b1bd1
YO
848 return "( $cond )" if $yes eq '1' and $no eq '0';
849 my ( $lb, $rb )= $brace ? ( "( ", " )" ) : ( "", "" );
850 return "$lb$cond ? $yes : $no$rb"
851 if !ref( $op->{yes} ) && !ref( $op->{no} );
852 my $ind1= " " x 4;
853 my $ind= "\n" . ( $ind1 x $op->{depth} );
854
855 if ( ref $op->{yes} ) {
856 $yes= $ind . $ind1 . $yes;
857 } else {
858 $yes= " " . $yes;
859 }
860
861 return "$lb$cond ?$yes$ind: $no$rb";
12b72891 862}
32e6a07c 863
e64b1bd1
YO
864# $expr=render($op,$combine)
865#
866# convert an optree to text with reasonably neat formatting. If $combine
867# is true then the condition is created using "fast breakouts" which
868# produce uglier expressions that are more efficient for common case,
869# longer lists such as that resulting from type 'cp' output.
870# Currently only used for type 'cp' macros.
871sub render {
ba073cf2
KW
872 my ( $self, $op, $combine, $opts_ref )= @_;
873 my $str= "( " . $self->_render( $op, $combine, 0, $opts_ref ) . " )";
e64b1bd1 874 return __clean( $str );
12b72891 875}
e64b1bd1
YO
876
877# make_macro
878# make a macro of a given type.
879# calls into make_trie and (generic_|length_)optree as needed
880# Opts are:
b1af8fef 881# type : 'cp','generic','high','low','latin1','utf8','LATIN1','UTF8'
e64b1bd1
YO
882# ret_type : 'cp' or 'len'
883# safe : add length guards to macro
884#
885# type defaults to 'generic', and ret_type to 'len' unless type is 'cp'
886# in which case it defaults to 'cp' as well.
887#
888# it is illegal to do a type 'cp' macro on a pattern with multi-codepoint
889# sequences in it, as the generated macro will accept only a single codepoint
890# as an argument.
891#
892# returns the macro.
893
894
895sub make_macro {
896 my $self= shift;
897 my %opts= @_;
898 my $type= $opts{type} || 'generic';
899 die "Can't do a 'cp' on multi-codepoint character class '$self->{op}'"
900 if $type eq 'cp'
901 and $self->{has_multi};
902 my $ret_type= $opts{ret_type} || ( $opts{type} eq 'cp' ? 'cp' : 'len' );
903 my $method;
904 if ( $opts{safe} ) {
905 $method= 'length_optree';
906 } elsif ( $type eq 'generic' ) {
907 $method= 'generic_optree';
908 } else {
909 $method= 'optree';
910 }
911 my $optree= $self->$method( %opts, type => $type, ret_type => $ret_type );
ba073cf2 912 my $text= $self->render( $optree, $type eq 'cp', \%opts );
e64b1bd1
YO
913 my @args= $type eq 'cp' ? 'cp' : 's';
914 push @args, "e" if $opts{safe};
915 push @args, "is_utf8" if $type eq 'generic';
916 push @args, "len" if $ret_type eq 'both';
917 my $pfx= $ret_type eq 'both' ? 'what_len_' :
918 $ret_type eq 'cp' ? 'what_' : 'is_';
919 my $ext= $type eq 'generic' ? '' : '_' . lc( $type );
920 $ext .= "_safe" if $opts{safe};
921 my $argstr= join ",", @args;
922 return "/*** GENERATED CODE ***/\n"
923 . __macro( "#define $pfx$self->{op}$ext($argstr)\n$text" );
32e6a07c 924}
e64b1bd1
YO
925
926# if we arent being used as a module (highly likely) then process
927# the __DATA__ below and produce macros in regcharclass.h
928# if an argument is provided to the script then it is assumed to
929# be the path of the file to output to, if the arg is '-' outputs
930# to STDOUT.
931if ( !caller ) {
e64b1bd1 932 $|++;
8770da0e 933 my $path= shift @ARGV || "regcharclass.h";
e64b1bd1
YO
934 my $out_fh;
935 if ( $path eq '-' ) {
936 $out_fh= \*STDOUT;
937 } else {
29c22b52 938 $out_fh = open_new( $path );
e64b1bd1 939 }
8770da0e
NC
940 print $out_fh read_only_top( lang => 'C', by => $0,
941 file => 'regcharclass.h', style => '*',
2eee27d7 942 copyright => [2007, 2011] );
d10c72f2 943 print $out_fh "\n#ifndef H_REGCHARCLASS /* Guard against nested #includes */\n#define H_REGCHARCLASS 1\n\n";
12b72891 944
bb949220 945 my ( $op, $title, @txt, @types, %mods );
e64b1bd1
YO
946 my $doit= sub {
947 return unless $op;
ae1d4929
KW
948
949 # Skip if to compile on a different platform.
950 return if delete $mods{only_ascii_platform} && ! ASCII_PLATFORM;
951 return if delete $mods{only_ebcdic_platform} && ord 'A' != 193;
952
e64b1bd1
YO
953 print $out_fh "/*\n\t$op: $title\n\n";
954 print $out_fh join "\n", ( map { "\t$_" } @txt ), "*/", "";
955 my $obj= __PACKAGE__->new( op => $op, title => $title, txt => \@txt );
956
bb949220
KW
957 #die Dumper(\@types,\%mods);
958
959 my @mods;
960 push @mods, 'safe' if delete $mods{safe};
961 unshift @mods, 'fast' if delete $mods{fast} || ! @mods; # Default to 'fast'
962 # do this one
963 # first, as
964 # traditional
965 if (%mods) {
966 die "Unknown modifiers: ", join ", ", map { "'$_'" } keys %mods;
967 }
e64b1bd1
YO
968
969 foreach my $type_spec ( @types ) {
970 my ( $type, $ret )= split /-/, $type_spec;
971 $ret ||= 'len';
972 foreach my $mod ( @mods ) {
973 next if $mod eq 'safe' and $type eq 'cp';
bb949220 974 delete $mods{$mod};
e64b1bd1
YO
975 my $macro= $obj->make_macro(
976 type => $type,
977 ret_type => $ret,
978 safe => $mod eq 'safe'
979 );
980 print $out_fh $macro, "\n";
981 }
32e6a07c 982 }
e64b1bd1
YO
983 };
984
985 while ( <DATA> ) {
5e6c6c1e 986 s/^ \s* (?: \# .* ) ? $ //x; # squeeze out comment and blanks
e64b1bd1
YO
987 next unless /\S/;
988 chomp;
989 if ( /^([A-Z]+)/ ) {
cc08b31c 990 $doit->(); # This starts a new definition; do the previous one
e64b1bd1
YO
991 ( $op, $title )= split /\s*:\s*/, $_, 2;
992 @txt= ();
993 } elsif ( s/^=>// ) {
994 my ( $type, $modifier )= split /:/, $_;
995 @types= split ' ', $type;
bb949220
KW
996 undef %mods;
997 map { $mods{$_} = 1 } split ' ', $modifier;
e64b1bd1
YO
998 } else {
999 push @txt, "$_";
12b72891
RGS
1000 }
1001 }
e64b1bd1 1002 $doit->();
d10c72f2
KW
1003
1004 print $out_fh "\n#endif /* H_REGCHARCLASS */\n";
1005
8770da0e
NC
1006 if($path eq '-') {
1007 print $out_fh "/* ex: set ro: */\n";
1008 } else {
1009 read_only_bottom_close_and_rename($out_fh)
1010 }
12b72891 1011}
e64b1bd1 1012
cc08b31c
KW
1013# The form of the input is a series of definitions to make macros for.
1014# The first line gives the base name of the macro, followed by a colon, and
1015# then text to be used in comments associated with the macro that are its
1016# title or description. In all cases the first (perhaps only) parameter to
1017# the macro is a pointer to the first byte of the code point it is to test to
1018# see if it is in the class determined by the macro. In the case of non-UTF8,
1019# the code point consists only of a single byte.
1020#
1021# The second line must begin with a '=>' and be followed by the types of
1022# macro(s) to be generated; these are specified below. A colon follows the
1023# types, followed by the modifiers, also specified below. At least one
1024# modifier is required.
1025#
1026# The subsequent lines give what code points go into the class defined by the
1027# macro. Multiple characters may be specified via a string like "\x0D\x0A",
1028# enclosed in quotes. Otherwise the lines consist of single Unicode code
1029# point, prefaced by 0x; or a single range of Unicode code points separated by
1030# a minus (and optional space); or a single Unicode property specified in the
1031# standard Perl form "\p{...}".
1032#
1033# A blank line or one whose first non-blank character is '#' is a comment.
1034# The definition of the macro is terminated by a line unlike those described.
1035#
1036# Valid types:
1037# low generate a macro whose name is 'is_BASE_low' and defines a
1038# class that includes only ASCII-range chars. (BASE is the
1039# input macro base name.)
1040# latin1 generate a macro whose name is 'is_BASE_latin1' and defines a
1041# class that includes only upper-Latin1-range chars. It is not
1042# designed to take a UTF-8 input parameter.
b1af8fef
KW
1043# high generate a macro whose name is 'is_BASE_high' and defines a
1044# class that includes all relevant code points that are above
1045# the Latin1 range. This is for very specialized uses only.
1046# It is designed to take only an input UTF-8 parameter.
cc08b31c
KW
1047# utf8 generate a macro whose name is 'is_BASE_utf8' and defines a
1048# class that includes all relevant characters that aren't ASCII.
1049# It is designed to take only an input UTF-8 parameter.
1050# LATIN1 generate a macro whose name is 'is_BASE_latin1' and defines a
1051# class that includes both ASCII and upper-Latin1-range chars.
1052# It is not designed to take a UTF-8 input parameter.
1053# UTF8 generate a macro whose name is 'is_BASE_utf8' and defines a
1054# class that can include any code point, adding the 'low' ones
1055# to what 'utf8' works on. It is designed to take only an input
1056# UTF-8 parameter.
1057# generic generate a macro whose name is 'is_BASE". It has a 2nd,
1058# boolean, parameter which indicates if the first one points to
1059# a UTF-8 string or not. Thus it works in all circumstances.
1060# cp generate a macro whose name is 'is_BASE_cp' and defines a
1061# class that returns true if the UV parameter is a member of the
1062# class; false if not.
1063# A macro of the given type is generated for each type listed in the input.
1064# The default return value is the number of octets read to generate the match.
1065# Append "-cp" to the type to have it instead return the matched codepoint.
1066# The macro name is changed to 'what_BASE...'. See pod for
1067# caveats
1068# Appending '-both" instead adds an extra parameter to the end of the argument
1069# list, which is a pointer as to where to store the number of
1070# bytes matched, while also returning the code point. The macro
1071# name is changed to 'what_len_BASE...'. See pod for caveats
1072#
1073# Valid modifiers:
1074# safe The input string is not necessarily valid UTF-8. In
1075# particular an extra parameter (always the 2nd) to the macro is
1076# required, which points to one beyond the end of the string.
1077# The macro will make sure not to read off the end of the
1078# string. In the case of non-UTF8, it makes sure that the
1079# string has at least one byte in it. The macro name has
1080# '_safe' appended to it.
1081# fast The input string is valid UTF-8. No bounds checking is done,
1082# and the macro can make assumptions that lead to faster
1083# execution.
ae1d4929
KW
1084# only_ascii_platform Skip this definition if this program is being run on
1085# a non-ASCII platform.
1086# only_ebcdic_platform Skip this definition if this program is being run on
1087# a non-EBCDIC platform.
cc08b31c
KW
1088# No modifier need be specified; fast is assumed for this case. If both
1089# 'fast', and 'safe' are specified, two macros will be created for each
1090# 'type'.
e90ac8de 1091#
295bcca9 1092# If run on a non-ASCII platform will automatically convert the Unicode input
cc08b31c
KW
1093# to native. The documentation above is slightly wrong in this case. 'low'
1094# actually refers to code points whose UTF-8 representation is the same as the
1095# non-UTF-8 version (invariants); and 'latin1' refers to all the rest of the
1096# code points less than 256.
5e6c6c1e
KW
1097
10981; # in the unlikely case we are being used as a module
1099
1100__DATA__
1101# This is no longer used, but retained in case it is needed some day.
e90ac8de
KW
1102# TRICKYFOLD: Problematic fold case letters. When adding to this list, also should add them to regcomp.c and fold_grind.t
1103# => generic cp generic-cp generic-both :fast safe
1104# 0x00DF # LATIN SMALL LETTER SHARP S
1105# 0x0390 # GREEK SMALL LETTER IOTA WITH DIALYTIKA AND TONOS
1106# 0x03B0 # GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND TONOS
1107# 0x1E9E # LATIN CAPITAL LETTER SHARP S, because maps to same as 00DF
1108# 0x1FD3 # GREEK SMALL LETTER IOTA WITH DIALYTIKA AND OXIA; maps same as 0390
1109# 0x1FE3 # GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND OXIA; maps same as 03B0
1110
12b72891 1111LNBREAK: Line Break: \R
e64b1bd1 1112=> generic UTF8 LATIN1 :fast safe
12b72891 1113"\x0D\x0A" # CRLF - Network (Windows) line ending
05b688d9 1114\p{VertSpace}
12b72891
RGS
1115
1116HORIZWS: Horizontal Whitespace: \h \H
e64b1bd1 1117=> generic UTF8 LATIN1 cp :fast safe
05b688d9 1118\p{HorizSpace}
12b72891
RGS
1119
1120VERTWS: Vertical Whitespace: \v \V
e64b1bd1 1121=> generic UTF8 LATIN1 cp :fast safe
05b688d9 1122\p{VertSpace}
612ead59 1123
b96a92fb
KW
1124REPLACEMENT: Unicode REPLACEMENT CHARACTER
1125=> UTF8 :safe
11260xFFFD
1127
1128NONCHAR: Non character code points
1129=> UTF8 :fast
1130\p{Nchar}
1131
1132SURROGATE: Surrogate characters
1133=> UTF8 :fast
1134\p{Gc=Cs}
1135
612ead59
KW
1136GCB_L: Grapheme_Cluster_Break=L
1137=> UTF8 :fast
1138\p{_X_GCB_L}
1139
1140GCB_LV_LVT_V: Grapheme_Cluster_Break=(LV or LVT or V)
1141=> UTF8 :fast
1142\p{_X_LV_LVT_V}
1143
1144GCB_Prepend: Grapheme_Cluster_Break=Prepend
1145=> UTF8 :fast
1146\p{_X_GCB_Prepend}
1147
1148GCB_RI: Grapheme_Cluster_Break=RI
1149=> UTF8 :fast
1150\p{_X_RI}
1151
1152GCB_SPECIAL_BEGIN: Grapheme_Cluster_Break=special_begins
1153=> UTF8 :fast
1154\p{_X_Special_Begin}
1155
1156GCB_T: Grapheme_Cluster_Break=T
1157=> UTF8 :fast
1158\p{_X_GCB_T}
1159
1160GCB_V: Grapheme_Cluster_Break=V
1161=> UTF8 :fast
1162\p{_X_GCB_V}
685289b5 1163
4d646140
KW
1164# This program was run with this enabled, and the results copied to utf8.h;
1165# then this was commented out because it takes so long to figure out these 2
1166# million code points. The results would not change unless utf8.h decides it
1167# wants a maximum other than 4 bytes, or this program creates better
1168# optimizations
1169#UTF8_CHAR: Matches utf8 from 1 to 4 bytes
1170#=> UTF8 :safe only_ascii_platform
1171#0x0 - 0x1FFFFF
1172
1173# This hasn't been commented out, because we haven't an EBCDIC platform to run
1174# it on, and the 3 types of EBCDIC allegedly supported by Perl would have
1175# different results
1176UTF8_CHAR: Matches utf8 from 1 to 5 bytes
1177=> UTF8 :safe only_ebcdic_platform
11780x0 - 0x3FFFFF:
1179
685289b5
KW
1180QUOTEMETA: Meta-characters that \Q should quote
1181=> high :fast
1182\p{_Perl_Quotemeta}