This is a live mirror of the Perl 5 development currently hosted at https://github.com/perl/perl5
Use macro not swash for utf8 quotemeta
[perl5.git] / regen / regcharclass.pl
CommitLineData
e64b1bd1 1package CharClass::Matcher;
12b72891 2use strict;
8770da0e 3use 5.008;
12b72891 4use warnings;
e64b1bd1 5use warnings FATAL => 'all';
12b72891 6use Text::Wrap qw(wrap);
12b72891 7use Data::Dumper;
e64b1bd1
YO
8$Data::Dumper::Useqq= 1;
9our $hex_fmt= "0x%02X";
12b72891 10
295bcca9
KW
11sub ASCII_PLATFORM { (ord('A') == 65) }
12
8770da0e
NC
13require 'regen/regen_lib.pl';
14
ab84f958 15=head1 NAME
0ccab2bc 16
e64b1bd1 17CharClass::Matcher -- Generate C macros that match character classes efficiently
12b72891 18
e64b1bd1
YO
19=head1 SYNOPSIS
20
ab84f958 21 perl Porting/regcharclass.pl
e64b1bd1
YO
22
23=head1 DESCRIPTION
12b72891
RGS
24
25Dynamically generates macros for detecting special charclasses
e64b1bd1 26in latin-1, utf8, and codepoint forms. Macros can be set to return
cc08b31c 27the length (in bytes) of the matched codepoint, and/or the codepoint itself.
12b72891 28
cc08b31c 29To regenerate F<regcharclass.h>, run this script from perl-root. No arguments
12b72891
RGS
30are necessary.
31
cc08b31c
KW
32Using WHATEVER as an example the following macros can be produced, depending
33on the input parameters (how to get each is described by internal comments at
34the C<__DATA__> line):
12b72891
RGS
35
36=over 4
37
cc08b31c 38=item C<is_WHATEVER(s,is_utf8)>
12b72891 39
cc08b31c 40=item C<is_WHATEVER_safe(s,e,is_utf8)>
12b72891 41
cc08b31c
KW
42Do a lookup as appropriate based on the C<is_utf8> flag. When possible
43comparisons involving octect<128 are done before checking the C<is_utf8>
12b72891
RGS
44flag, hopefully saving time.
45
cc08b31c
KW
46The version without the C<_safe> suffix should be used only when the input is
47known to be well-formed.
12b72891 48
cc08b31c
KW
49=item C<is_WHATEVER_utf8(s)>
50
51=item C<is_WHATEVER_utf8_safe(s,e)>
12b72891
RGS
52
53Do a lookup assuming the string is encoded in (normalized) UTF8.
54
cc08b31c
KW
55The version without the C<_safe> suffix should be used only when the input is
56known to be well-formed.
57
58=item C<is_WHATEVER_latin1(s)>
12b72891 59
cc08b31c 60=item C<is_WHATEVER_latin1_safe(s,e)>
12b72891
RGS
61
62Do a lookup assuming the string is encoded in latin-1 (aka plan octets).
63
cc08b31c
KW
64The version without the C<_safe> suffix should be used only when it is known
65that C<s> contains at least one character.
66
67=item C<is_WHATEVER_cp(cp)>
12b72891 68
47e01c32 69Check to see if the string matches a given codepoint (hypothetically a
12b72891
RGS
70U32). The condition is constructed as as to "break out" as early as
71possible if the codepoint is out of range of the condition.
72
73IOW:
74
75 (cp==X || (cp>X && (cp==Y || (cp>Y && ...))))
76
77Thus if the character is X+1 only two comparisons will be done. Making
78matching lookups slower, but non-matching faster.
79
cc08b31c
KW
80=item C<what_len_WHATEVER_FOO(arg1, ..., len)>
81
82A variant form of each of the macro types described above can be generated, in
83which the code point is returned by the macro, and an extra parameter (in the
84final position) is added, which is a pointer for the macro to set the byte
85length of the returned code point.
86
87These forms all have a C<what_len> prefix instead of the C<is_>, for example
88C<what_len_WHATEVER_safe(s,e,is_utf8,len)> and
89C<what_len_WHATEVER_utf8(s,len)>.
90
91These forms should not be used I<except> on small sets of mostly widely
92separated code points; otherwise the code generated is inefficient. For these
93cases, it is best to use the C<is_> forms, and then find the code point with
94C<utf8_to_uvchr_buf>(). This program can fail with a "deep recursion"
95message on the worst of the inappropriate sets. Examine the generated macro
96to see if it is acceptable.
12b72891 97
cc08b31c
KW
98=item C<what_WHATEVER_FOO(arg1, ...)>
99
100A variant form of each of the C<is_> macro types described above can be generated, in
101which the code point and not the length is returned by the macro. These have
102the same caveat as L</what_len_WHATEVER_FOO(arg1, ..., len)>, plus they should
103not be used where the set contains a NULL, as 0 is returned for two different
104cases: a) the set doesn't include the input code point; b) the set does
105include it, and it is a NULL.
106
107=back
e64b1bd1
YO
108
109=head2 CODE FORMAT
110
111perltidy -st -bt=1 -bbt=0 -pt=0 -sbt=1 -ce -nwls== "%f"
112
113
114=head1 AUTHOR
115
cc08b31c 116Author: Yves Orton (demerphq) 2007. Maintained by Perl5 Porters.
e64b1bd1
YO
117
118=head1 BUGS
119
120No tests directly here (although the regex engine will fail tests
121if this code is broken). Insufficient documentation and no Getopts
122handler for using the module as a script.
123
124=head1 LICENSE
125
126You may distribute under the terms of either the GNU General Public
127License or the Artistic License, as specified in the README file.
128
12b72891
RGS
129=cut
130
e64b1bd1
YO
131# Sub naming convention:
132# __func : private subroutine, can not be called as a method
133# _func : private method, not meant for external use
134# func : public method.
135
136# private subs
137#-------------------------------------------------------------------------------
138#
139# ($cp,$n,$l,$u)=__uni_latin($str);
140#
47e01c32 141# Return a list of arrays, each of which when interpreted correctly
e64b1bd1
YO
142# represent the string in some given encoding with specific conditions.
143#
144# $cp - list of codepoints that make up the string.
295bcca9
KW
145# $n - list of octets that make up the string if all codepoints are invariant
146# regardless of if the string is in UTF-8 or not.
e64b1bd1 147# $l - list of octets that make up the string in latin1 encoding if all
295bcca9
KW
148# codepoints < 256, and at least one codepoint is UTF-8 variant.
149# $u - list of octets that make up the string in utf8 if any codepoint is
150# UTF-8 variant
e64b1bd1
YO
151#
152# High CP | Defined
153#-----------+----------
295bcca9 154# 0 - 127 : $n (127/128 are the values for ASCII platforms)
e64b1bd1
YO
155# 128 - 255 : $l, $u
156# 256 - ... : $u
157#
158
159sub __uni_latin1 {
160 my $str= shift;
161 my $max= 0;
162 my @cp;
295bcca9 163 my $only_has_invariants = 1;
e64b1bd1
YO
164 for my $ch ( split //, $str ) {
165 my $cp= ord $ch;
166 push @cp, $cp;
167 $max= $cp if $max < $cp;
295bcca9
KW
168 if (! ASCII_PLATFORM && $only_has_invariants) {
169 if ($cp > 255) {
170 $only_has_invariants = 0;
171 }
172 else {
173 my $temp = chr($cp);
174 utf8::upgrade($temp);
175 my @utf8 = unpack "U0C*", $temp;
176 $only_has_invariants = (@utf8 == 1 && $utf8[0] == $cp);
177 }
178 }
e64b1bd1
YO
179 }
180 my ( $n, $l, $u );
295bcca9
KW
181 $only_has_invariants = $max < 128 if ASCII_PLATFORM;
182 if ($only_has_invariants) {
e64b1bd1
YO
183 $n= [@cp];
184 } else {
185 $l= [@cp] if $max && $max < 256;
186
ca51670f
KW
187 $u= $str;
188 utf8::upgrade($u);
189 $u= [ unpack "U0C*", $u ] if defined $u;
12b72891 190 }
e64b1bd1 191 return ( \@cp, $n, $l, $u );
12b72891
RGS
192}
193
12b72891 194#
e64b1bd1
YO
195# $clean= __clean($expr);
196#
197# Cleanup a ternary expression, removing unnecessary parens and apply some
198# simplifications using regexes.
199#
200
201sub __clean {
202 my ( $expr )= @_;
203 our $parens;
204 $parens= qr/ (?> \( (?> (?: (?> [^()]+ ) | (??{ $parens }) )* ) \) ) /x;
205
206 #print "$parens\n$expr\n";
207 1 while $expr =~ s/ \( \s* ( $parens ) \s* \) /$1/gx;
208 1 while $expr =~ s/ \( \s* ($parens) \s* \? \s*
209 \( \s* ($parens) \s* \? \s* ($parens|[^:]+?) \s* : \s* ($parens|[^)]+?) \s* \)
210 \s* : \s* \4 \s* \)/( ( $1 && $2 ) ? $3 : 0 )/gx;
211 return $expr;
12b72891
RGS
212}
213
e64b1bd1
YO
214#
215# $text= __macro(@args);
216# Join args together by newlines, and then neatly add backslashes to the end
217# of every line as expected by the C pre-processor for #define's.
218#
219
220sub __macro {
221 my $str= join "\n", @_;
222 $str =~ s/\s*$//;
223 my @lines= map { s/\s+$//; s/\t/ /g; $_ } split /\n/, $str;
224 my $last= pop @lines;
225 $str= join "\n", ( map { sprintf "%-76s\\", $_ } @lines ), $last;
226 1 while $str =~ s/^(\t*) {8}/$1\t/gm;
227 return $str . "\n";
12b72891
RGS
228}
229
e64b1bd1
YO
230#
231# my $op=__incrdepth($op);
232#
233# take an 'op' hashref and add one to it and all its childrens depths.
234#
235
236sub __incrdepth {
237 my $op= shift;
238 return unless ref $op;
239 $op->{depth} += 1;
240 __incrdepth( $op->{yes} );
241 __incrdepth( $op->{no} );
242 return $op;
243}
244
245# join two branches of an opcode together with a condition, incrementing
246# the depth on the yes branch when we do so.
247# returns the new root opcode of the tree.
248sub __cond_join {
249 my ( $cond, $yes, $no )= @_;
250 return {
251 test => $cond,
252 yes => __incrdepth( $yes ),
253 no => $no,
254 depth => 0,
255 };
256}
257
258# Methods
259
260# constructor
261#
262# my $obj=CLASS->new(op=>'SOMENAME',title=>'blah',txt=>[..]);
263#
264# Create a new CharClass::Matcher object by parsing the text in
265# the txt array. Currently applies the following rules:
266#
267# Element starts with C<0x>, line is evaled the result treated as
268# a number which is passed to chr().
269#
270# Element starts with C<">, line is evaled and the result treated
271# as a string.
272#
273# Each string is then stored in the 'strs' subhash as a hash record
274# made up of the results of __uni_latin1, using the keynames
b1af8fef 275# 'low','latin1','utf8', as well as the synthesized 'LATIN1', 'high', and
e64b1bd1
YO
276# 'UTF8' which hold a merge of 'low' and their lowercase equivelents.
277#
278# Size data is tracked per type in the 'size' subhash.
279#
280# Return an object
281#
12b72891
RGS
282sub new {
283 my $class= shift;
e64b1bd1
YO
284 my %opt= @_;
285 for ( qw(op txt) ) {
286 die "in " . __PACKAGE__ . " constructor '$_;' is a mandatory field"
287 if !exists $opt{$_};
288 }
289
290 my $self= bless {
291 op => $opt{op},
292 title => $opt{title} || '',
293 }, $class;
294 foreach my $txt ( @{ $opt{txt} } ) {
295 my $str= $txt;
296 if ( $str =~ /^[""]/ ) {
297 $str= eval $str;
05b688d9
KW
298 } elsif ($str =~ / - /x ) { # A range: Replace this element on the
299 # list with its expansion
300 my ($lower, $upper) = $str =~ / 0x (.+?) \s* - \s* 0x (.+) /x;
301 die "Format must be like '0xDEAD - 0xBEAF'; instead was '$str'" if ! defined $lower || ! defined $upper;
302 foreach my $cp (hex $lower .. hex $upper) {
303 push @{$opt{txt}}, sprintf "0x%X", $cp;
304 }
305 next;
295bcca9
KW
306 } elsif ($str =~ s/ ^ N (?= 0x ) //x ) {
307 # Otherwise undocumented, a leading N means is already in the
308 # native character set; don't convert.
e64b1bd1 309 $str= chr eval $str;
295bcca9
KW
310 } elsif ( $str =~ /^0x/ ) {
311 $str= eval $str;
312
313 # Convert from Unicode/ASCII to native, if necessary
314 $str = utf8::unicode_to_native($str) if ! ASCII_PLATFORM
315 && $str <= 0xFF;
316 $str = chr $str;
05b688d9
KW
317 } elsif ( $str =~ / \s* \\p \{ ( .*? ) \} /x) {
318 my $property = $1;
319 use Unicode::UCD qw(prop_invlist);
320
321 my @invlist = prop_invlist($property, '_perl_core_internal_ok');
322 if (! @invlist) {
323
324 # An empty return could mean an unknown property, or merely
325 # that it is empty. Call in scalar context to differentiate
326 my $count = prop_invlist($property, '_perl_core_internal_ok');
327 die "$property not found" unless defined $count;
328 }
329
330 # Replace this element on the list with the property's expansion
331 for (my $i = 0; $i < @invlist; $i += 2) {
332 foreach my $cp ($invlist[$i] .. $invlist[$i+1] - 1) {
295bcca9
KW
333
334 # prop_invlist() returns native values; add leading 'N'
335 # to indicate that.
336 push @{$opt{txt}}, sprintf "N0x%X", $cp;
05b688d9
KW
337 }
338 }
339 next;
12b72891 340 } else {
5e6c6c1e 341 die "Unparsable line: $txt\n";
12b72891 342 }
e64b1bd1
YO
343 my ( $cp, $low, $latin1, $utf8 )= __uni_latin1( $str );
344 my $UTF8= $low || $utf8;
345 my $LATIN1= $low || $latin1;
b1af8fef 346 my $high = (scalar grep { $_ < 256 } @$cp) ? 0 : $utf8;
dda856b2
YO
347 #die Dumper($txt,$cp,$low,$latin1,$utf8)
348 # if $txt=~/NEL/ or $utf8 and @$utf8>3;
e64b1bd1 349
b1af8fef
KW
350 @{ $self->{strs}{$str} }{qw( str txt low utf8 latin1 high cp UTF8 LATIN1 )}=
351 ( $str, $txt, $low, $utf8, $latin1, $high, $cp, $UTF8, $LATIN1 );
e64b1bd1 352 my $rec= $self->{strs}{$str};
b1af8fef 353 foreach my $key ( qw(low utf8 latin1 high cp UTF8 LATIN1) ) {
e64b1bd1
YO
354 $self->{size}{$key}{ 0 + @{ $self->{strs}{$str}{$key} } }++
355 if $self->{strs}{$str}{$key};
12b72891 356 }
e64b1bd1
YO
357 $self->{has_multi} ||= @$cp > 1;
358 $self->{has_ascii} ||= $latin1 && @$latin1;
359 $self->{has_low} ||= $low && @$low;
360 $self->{has_high} ||= !$low && !$latin1;
12b72891 361 }
e64b1bd1
YO
362 $self->{val_fmt}= $hex_fmt;
363 $self->{count}= 0 + keys %{ $self->{strs} };
12b72891
RGS
364 return $self;
365}
366
e64b1bd1 367# my $trie = make_trie($type,$maxlen);
12b72891 368#
47e01c32 369# using the data stored in the object build a trie of a specific type,
e64b1bd1
YO
370# and with specific maximum depth. The trie is made up the elements of
371# the given types array for each string in the object (assuming it is
372# not too long.)
373#
47e01c32 374# returns the trie, or undef if there was no relevant data in the object.
e64b1bd1
YO
375#
376
377sub make_trie {
378 my ( $self, $type, $maxlen )= @_;
379
380 my $strs= $self->{strs};
381 my %trie;
382 foreach my $rec ( values %$strs ) {
383 die "panic: unknown type '$type'"
384 if !exists $rec->{$type};
385 my $dat= $rec->{$type};
386 next unless $dat;
387 next if $maxlen && @$dat > $maxlen;
388 my $node= \%trie;
389 foreach my $elem ( @$dat ) {
390 $node->{$elem} ||= {};
391 $node= $node->{$elem};
12b72891 392 }
e64b1bd1 393 $node->{''}= $rec->{str};
12b72891 394 }
e64b1bd1 395 return 0 + keys( %trie ) ? \%trie : undef;
12b72891
RGS
396}
397
e64b1bd1
YO
398# my $optree= _optree()
399#
400# recursively convert a trie to an optree where every node represents
401# an if else branch.
12b72891 402#
12b72891 403#
12b72891 404
e64b1bd1
YO
405sub _optree {
406 my ( $self, $trie, $test_type, $ret_type, $else, $depth )= @_;
407 return unless defined $trie;
408 if ( $self->{has_multi} and $ret_type =~ /cp|both/ ) {
409 die "Can't do 'cp' optree from multi-codepoint strings";
12b72891 410 }
e64b1bd1
YO
411 $ret_type ||= 'len';
412 $else= 0 unless defined $else;
413 $depth= 0 unless defined $depth;
414
415 my @conds= sort { $a <=> $b } grep { length $_ } keys %$trie;
895e25a5 416 if (exists $trie->{''} ) {
e64b1bd1
YO
417 if ( $ret_type eq 'cp' ) {
418 $else= $self->{strs}{ $trie->{''} }{cp}[0];
419 $else= sprintf "$self->{val_fmt}", $else if $else > 9;
420 } elsif ( $ret_type eq 'len' ) {
421 $else= $depth;
422 } elsif ( $ret_type eq 'both') {
423 $else= $self->{strs}{ $trie->{''} }{cp}[0];
424 $else= sprintf "$self->{val_fmt}", $else if $else > 9;
425 $else= "len=$depth, $else";
12b72891 426 }
e64b1bd1
YO
427 }
428 return $else if !@conds;
429 my $node= {};
430 my $root= $node;
431 my ( $yes_res, $as_code, @cond );
432 my $test= $test_type eq 'cp' ? "cp" : "((U8*)s)[$depth]";
433 my $Update= sub {
434 $node->{vals}= [@cond];
435 $node->{test}= $test;
436 $node->{yes}= $yes_res;
437 $node->{depth}= $depth;
438 $node->{no}= shift;
439 };
440 while ( @conds ) {
441 my $cond= shift @conds;
442 my $res=
443 $self->_optree( $trie->{$cond}, $test_type, $ret_type, $else,
444 $depth + 1 );
445 my $res_code= Dumper( $res );
446 if ( !$yes_res || $res_code ne $as_code ) {
447 if ( $yes_res ) {
448 $Update->( {} );
449 $node= $node->{no};
450 }
451 ( $yes_res, $as_code )= ( $res, $res_code );
452 @cond= ( $cond );
12b72891 453 } else {
e64b1bd1 454 push @cond, $cond;
12b72891
RGS
455 }
456 }
e64b1bd1 457 $Update->( $else );
12b72891
RGS
458 return $root;
459}
460
e64b1bd1
YO
461# my $optree= optree(%opts);
462#
463# Convert a trie to an optree, wrapper for _optree
464
465sub optree {
466 my $self= shift;
467 my %opt= @_;
468 my $trie= $self->make_trie( $opt{type}, $opt{max_depth} );
469 $opt{ret_type} ||= 'len';
470 my $test_type= $opt{type} eq 'cp' ? 'cp' : 'depth';
471 return $self->_optree( $trie, $test_type, $opt{ret_type}, $opt{else}, 0 );
12b72891
RGS
472}
473
e64b1bd1
YO
474# my $optree= generic_optree(%opts);
475#
476# build a "generic" optree out of the three 'low', 'latin1', 'utf8'
477# sets of strings, including a branch for handling the string type check.
478#
479
480sub generic_optree {
481 my $self= shift;
482 my %opt= @_;
483
484 $opt{ret_type} ||= 'len';
485 my $test_type= 'depth';
486 my $else= $opt{else} || 0;
487
488 my $latin1= $self->make_trie( 'latin1', $opt{max_depth} );
489 my $utf8= $self->make_trie( 'utf8', $opt{max_depth} );
490
491 $_= $self->_optree( $_, $test_type, $opt{ret_type}, $else, 0 )
492 for $latin1, $utf8;
493
494 if ( $utf8 ) {
495 $else= __cond_join( "( is_utf8 )", $utf8, $latin1 || $else );
496 } elsif ( $latin1 ) {
497 $else= __cond_join( "!( is_utf8 )", $latin1, $else );
498 }
499 my $low= $self->make_trie( 'low', $opt{max_depth} );
500 if ( $low ) {
501 $else= $self->_optree( $low, $test_type, $opt{ret_type}, $else, 0 );
12b72891 502 }
e64b1bd1
YO
503
504 return $else;
12b72891
RGS
505}
506
e64b1bd1 507# length_optree()
12b72891 508#
e64b1bd1 509# create a string length guarded optree.
12b72891 510#
e64b1bd1
YO
511
512sub length_optree {
513 my $self= shift;
514 my %opt= @_;
515 my $type= $opt{type};
516
517 die "Can't do a length_optree on type 'cp', makes no sense."
518 if $type eq 'cp';
519
520 my ( @size, $method );
521
522 if ( $type eq 'generic' ) {
523 $method= 'generic_optree';
524 my %sizes= (
525 %{ $self->{size}{low} || {} },
526 %{ $self->{size}{latin1} || {} },
527 %{ $self->{size}{utf8} || {} }
528 );
529 @size= sort { $a <=> $b } keys %sizes;
530 } else {
531 $method= 'optree';
532 @size= sort { $a <=> $b } keys %{ $self->{size}{$type} };
12b72891 533 }
e64b1bd1
YO
534
535 my $else= ( $opt{else} ||= 0 );
536 for my $size ( @size ) {
537 my $optree= $self->$method( %opt, type => $type, max_depth => $size );
538 my $cond= "((e)-(s) > " . ( $size - 1 ).")";
539 $else= __cond_join( $cond, $optree, $else );
540 }
541 return $else;
12b72891
RGS
542}
543
e64b1bd1
YO
544# _cond_as_str
545# turn a list of conditions into a text expression
546# - merges ranges of conditions, and joins the result with ||
547sub _cond_as_str {
548 my ( $self, $op, $combine )= @_;
549 my $cond= $op->{vals};
550 my $test= $op->{test};
551 return "( $test )" if !defined $cond;
552
553 # rangify the list
554 my @ranges;
555 my $Update= sub {
556 if ( @ranges ) {
557 if ( $ranges[-1][0] == $ranges[-1][1] ) {
558 $ranges[-1]= $ranges[-1][0];
559 } elsif ( $ranges[-1][0] + 1 == $ranges[-1][1] ) {
560 $ranges[-1]= $ranges[-1][0];
561 push @ranges, $ranges[-1] + 1;
562 }
563 }
564 };
565 for my $cond ( @$cond ) {
566 if ( !@ranges || $cond != $ranges[-1][1] + 1 ) {
567 $Update->();
568 push @ranges, [ $cond, $cond ];
569 } else {
570 $ranges[-1][1]++;
571 }
572 }
573 $Update->();
574 return $self->_combine( $test, @ranges )
575 if $combine;
576 @ranges= map {
577 ref $_
578 ? sprintf(
579 "( $self->{val_fmt} <= $test && $test <= $self->{val_fmt} )",
580 @$_ )
581 : sprintf( "$self->{val_fmt} == $test", $_ );
582 } @ranges;
583 return "( " . join( " || ", @ranges ) . " )";
12b72891
RGS
584}
585
e64b1bd1
YO
586# _combine
587# recursively turn a list of conditions into a fast break-out condition
588# used by _cond_as_str() for 'cp' type macros.
589sub _combine {
590 my ( $self, $test, @cond )= @_;
591 return if !@cond;
592 my $item= shift @cond;
593 my ( $cstr, $gtv );
594 if ( ref $item ) {
595 $cstr=
596 sprintf( "( $self->{val_fmt} <= $test && $test <= $self->{val_fmt} )",
597 @$item );
598 $gtv= sprintf "$self->{val_fmt}", $item->[1];
12b72891 599 } else {
e64b1bd1
YO
600 $cstr= sprintf( "$self->{val_fmt} == $test", $item );
601 $gtv= sprintf "$self->{val_fmt}", $item;
12b72891 602 }
e64b1bd1
YO
603 if ( @cond ) {
604 return "( $cstr || ( $gtv < $test &&\n"
605 . $self->_combine( $test, @cond ) . " ) )";
12b72891 606 } else {
e64b1bd1 607 return $cstr;
12b72891 608 }
e64b1bd1 609}
12b72891 610
e64b1bd1
YO
611# _render()
612# recursively convert an optree to text with reasonably neat formatting
613sub _render {
614 my ( $self, $op, $combine, $brace )= @_;
2e39f0c2 615 return 0 if ! defined $op; # The set is empty
e64b1bd1
YO
616 if ( !ref $op ) {
617 return $op;
12b72891 618 }
e64b1bd1 619 my $cond= $self->_cond_as_str( $op, $combine );
cc08b31c
KW
620 #no warnings 'recursion'; # This would allow really really inefficient
621 # code to be generated. See pod
e64b1bd1
YO
622 my $yes= $self->_render( $op->{yes}, $combine, 1 );
623 my $no= $self->_render( $op->{no}, $combine, 0 );
624 return "( $cond )" if $yes eq '1' and $no eq '0';
625 my ( $lb, $rb )= $brace ? ( "( ", " )" ) : ( "", "" );
626 return "$lb$cond ? $yes : $no$rb"
627 if !ref( $op->{yes} ) && !ref( $op->{no} );
628 my $ind1= " " x 4;
629 my $ind= "\n" . ( $ind1 x $op->{depth} );
630
631 if ( ref $op->{yes} ) {
632 $yes= $ind . $ind1 . $yes;
633 } else {
634 $yes= " " . $yes;
635 }
636
637 return "$lb$cond ?$yes$ind: $no$rb";
12b72891 638}
32e6a07c 639
e64b1bd1
YO
640# $expr=render($op,$combine)
641#
642# convert an optree to text with reasonably neat formatting. If $combine
643# is true then the condition is created using "fast breakouts" which
644# produce uglier expressions that are more efficient for common case,
645# longer lists such as that resulting from type 'cp' output.
646# Currently only used for type 'cp' macros.
647sub render {
648 my ( $self, $op, $combine )= @_;
649 my $str= "( " . $self->_render( $op, $combine ) . " )";
650 return __clean( $str );
12b72891 651}
e64b1bd1
YO
652
653# make_macro
654# make a macro of a given type.
655# calls into make_trie and (generic_|length_)optree as needed
656# Opts are:
b1af8fef 657# type : 'cp','generic','high','low','latin1','utf8','LATIN1','UTF8'
e64b1bd1
YO
658# ret_type : 'cp' or 'len'
659# safe : add length guards to macro
660#
661# type defaults to 'generic', and ret_type to 'len' unless type is 'cp'
662# in which case it defaults to 'cp' as well.
663#
664# it is illegal to do a type 'cp' macro on a pattern with multi-codepoint
665# sequences in it, as the generated macro will accept only a single codepoint
666# as an argument.
667#
668# returns the macro.
669
670
671sub make_macro {
672 my $self= shift;
673 my %opts= @_;
674 my $type= $opts{type} || 'generic';
675 die "Can't do a 'cp' on multi-codepoint character class '$self->{op}'"
676 if $type eq 'cp'
677 and $self->{has_multi};
678 my $ret_type= $opts{ret_type} || ( $opts{type} eq 'cp' ? 'cp' : 'len' );
679 my $method;
680 if ( $opts{safe} ) {
681 $method= 'length_optree';
682 } elsif ( $type eq 'generic' ) {
683 $method= 'generic_optree';
684 } else {
685 $method= 'optree';
686 }
687 my $optree= $self->$method( %opts, type => $type, ret_type => $ret_type );
688 my $text= $self->render( $optree, $type eq 'cp' );
689 my @args= $type eq 'cp' ? 'cp' : 's';
690 push @args, "e" if $opts{safe};
691 push @args, "is_utf8" if $type eq 'generic';
692 push @args, "len" if $ret_type eq 'both';
693 my $pfx= $ret_type eq 'both' ? 'what_len_' :
694 $ret_type eq 'cp' ? 'what_' : 'is_';
695 my $ext= $type eq 'generic' ? '' : '_' . lc( $type );
696 $ext .= "_safe" if $opts{safe};
697 my $argstr= join ",", @args;
698 return "/*** GENERATED CODE ***/\n"
699 . __macro( "#define $pfx$self->{op}$ext($argstr)\n$text" );
32e6a07c 700}
e64b1bd1
YO
701
702# if we arent being used as a module (highly likely) then process
703# the __DATA__ below and produce macros in regcharclass.h
704# if an argument is provided to the script then it is assumed to
705# be the path of the file to output to, if the arg is '-' outputs
706# to STDOUT.
707if ( !caller ) {
e64b1bd1 708 $|++;
8770da0e 709 my $path= shift @ARGV || "regcharclass.h";
e64b1bd1
YO
710 my $out_fh;
711 if ( $path eq '-' ) {
712 $out_fh= \*STDOUT;
713 } else {
29c22b52 714 $out_fh = open_new( $path );
e64b1bd1 715 }
8770da0e
NC
716 print $out_fh read_only_top( lang => 'C', by => $0,
717 file => 'regcharclass.h', style => '*',
2eee27d7 718 copyright => [2007, 2011] );
d10c72f2 719 print $out_fh "\n#ifndef H_REGCHARCLASS /* Guard against nested #includes */\n#define H_REGCHARCLASS 1\n\n";
12b72891 720
bb949220 721 my ( $op, $title, @txt, @types, %mods );
e64b1bd1
YO
722 my $doit= sub {
723 return unless $op;
724 print $out_fh "/*\n\t$op: $title\n\n";
725 print $out_fh join "\n", ( map { "\t$_" } @txt ), "*/", "";
726 my $obj= __PACKAGE__->new( op => $op, title => $title, txt => \@txt );
727
bb949220
KW
728 #die Dumper(\@types,\%mods);
729
730 my @mods;
731 push @mods, 'safe' if delete $mods{safe};
732 unshift @mods, 'fast' if delete $mods{fast} || ! @mods; # Default to 'fast'
733 # do this one
734 # first, as
735 # traditional
736 if (%mods) {
737 die "Unknown modifiers: ", join ", ", map { "'$_'" } keys %mods;
738 }
e64b1bd1
YO
739
740 foreach my $type_spec ( @types ) {
741 my ( $type, $ret )= split /-/, $type_spec;
742 $ret ||= 'len';
743 foreach my $mod ( @mods ) {
744 next if $mod eq 'safe' and $type eq 'cp';
bb949220 745 delete $mods{$mod};
e64b1bd1
YO
746 my $macro= $obj->make_macro(
747 type => $type,
748 ret_type => $ret,
749 safe => $mod eq 'safe'
750 );
751 print $out_fh $macro, "\n";
752 }
32e6a07c 753 }
e64b1bd1
YO
754 };
755
756 while ( <DATA> ) {
5e6c6c1e 757 s/^ \s* (?: \# .* ) ? $ //x; # squeeze out comment and blanks
e64b1bd1
YO
758 next unless /\S/;
759 chomp;
760 if ( /^([A-Z]+)/ ) {
cc08b31c 761 $doit->(); # This starts a new definition; do the previous one
e64b1bd1
YO
762 ( $op, $title )= split /\s*:\s*/, $_, 2;
763 @txt= ();
764 } elsif ( s/^=>// ) {
765 my ( $type, $modifier )= split /:/, $_;
766 @types= split ' ', $type;
bb949220
KW
767 undef %mods;
768 map { $mods{$_} = 1 } split ' ', $modifier;
e64b1bd1
YO
769 } else {
770 push @txt, "$_";
12b72891
RGS
771 }
772 }
e64b1bd1 773 $doit->();
d10c72f2
KW
774
775 print $out_fh "\n#endif /* H_REGCHARCLASS */\n";
776
8770da0e
NC
777 if($path eq '-') {
778 print $out_fh "/* ex: set ro: */\n";
779 } else {
780 read_only_bottom_close_and_rename($out_fh)
781 }
12b72891 782}
e64b1bd1 783
cc08b31c
KW
784# The form of the input is a series of definitions to make macros for.
785# The first line gives the base name of the macro, followed by a colon, and
786# then text to be used in comments associated with the macro that are its
787# title or description. In all cases the first (perhaps only) parameter to
788# the macro is a pointer to the first byte of the code point it is to test to
789# see if it is in the class determined by the macro. In the case of non-UTF8,
790# the code point consists only of a single byte.
791#
792# The second line must begin with a '=>' and be followed by the types of
793# macro(s) to be generated; these are specified below. A colon follows the
794# types, followed by the modifiers, also specified below. At least one
795# modifier is required.
796#
797# The subsequent lines give what code points go into the class defined by the
798# macro. Multiple characters may be specified via a string like "\x0D\x0A",
799# enclosed in quotes. Otherwise the lines consist of single Unicode code
800# point, prefaced by 0x; or a single range of Unicode code points separated by
801# a minus (and optional space); or a single Unicode property specified in the
802# standard Perl form "\p{...}".
803#
804# A blank line or one whose first non-blank character is '#' is a comment.
805# The definition of the macro is terminated by a line unlike those described.
806#
807# Valid types:
808# low generate a macro whose name is 'is_BASE_low' and defines a
809# class that includes only ASCII-range chars. (BASE is the
810# input macro base name.)
811# latin1 generate a macro whose name is 'is_BASE_latin1' and defines a
812# class that includes only upper-Latin1-range chars. It is not
813# designed to take a UTF-8 input parameter.
b1af8fef
KW
814# high generate a macro whose name is 'is_BASE_high' and defines a
815# class that includes all relevant code points that are above
816# the Latin1 range. This is for very specialized uses only.
817# It is designed to take only an input UTF-8 parameter.
cc08b31c
KW
818# utf8 generate a macro whose name is 'is_BASE_utf8' and defines a
819# class that includes all relevant characters that aren't ASCII.
820# It is designed to take only an input UTF-8 parameter.
821# LATIN1 generate a macro whose name is 'is_BASE_latin1' and defines a
822# class that includes both ASCII and upper-Latin1-range chars.
823# It is not designed to take a UTF-8 input parameter.
824# UTF8 generate a macro whose name is 'is_BASE_utf8' and defines a
825# class that can include any code point, adding the 'low' ones
826# to what 'utf8' works on. It is designed to take only an input
827# UTF-8 parameter.
828# generic generate a macro whose name is 'is_BASE". It has a 2nd,
829# boolean, parameter which indicates if the first one points to
830# a UTF-8 string or not. Thus it works in all circumstances.
831# cp generate a macro whose name is 'is_BASE_cp' and defines a
832# class that returns true if the UV parameter is a member of the
833# class; false if not.
834# A macro of the given type is generated for each type listed in the input.
835# The default return value is the number of octets read to generate the match.
836# Append "-cp" to the type to have it instead return the matched codepoint.
837# The macro name is changed to 'what_BASE...'. See pod for
838# caveats
839# Appending '-both" instead adds an extra parameter to the end of the argument
840# list, which is a pointer as to where to store the number of
841# bytes matched, while also returning the code point. The macro
842# name is changed to 'what_len_BASE...'. See pod for caveats
843#
844# Valid modifiers:
845# safe The input string is not necessarily valid UTF-8. In
846# particular an extra parameter (always the 2nd) to the macro is
847# required, which points to one beyond the end of the string.
848# The macro will make sure not to read off the end of the
849# string. In the case of non-UTF8, it makes sure that the
850# string has at least one byte in it. The macro name has
851# '_safe' appended to it.
852# fast The input string is valid UTF-8. No bounds checking is done,
853# and the macro can make assumptions that lead to faster
854# execution.
855# No modifier need be specified; fast is assumed for this case. If both
856# 'fast', and 'safe' are specified, two macros will be created for each
857# 'type'.
e90ac8de 858#
295bcca9 859# If run on a non-ASCII platform will automatically convert the Unicode input
cc08b31c
KW
860# to native. The documentation above is slightly wrong in this case. 'low'
861# actually refers to code points whose UTF-8 representation is the same as the
862# non-UTF-8 version (invariants); and 'latin1' refers to all the rest of the
863# code points less than 256.
5e6c6c1e
KW
864
8651; # in the unlikely case we are being used as a module
866
867__DATA__
868# This is no longer used, but retained in case it is needed some day.
e90ac8de
KW
869# TRICKYFOLD: Problematic fold case letters. When adding to this list, also should add them to regcomp.c and fold_grind.t
870# => generic cp generic-cp generic-both :fast safe
871# 0x00DF # LATIN SMALL LETTER SHARP S
872# 0x0390 # GREEK SMALL LETTER IOTA WITH DIALYTIKA AND TONOS
873# 0x03B0 # GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND TONOS
874# 0x1E9E # LATIN CAPITAL LETTER SHARP S, because maps to same as 00DF
875# 0x1FD3 # GREEK SMALL LETTER IOTA WITH DIALYTIKA AND OXIA; maps same as 0390
876# 0x1FE3 # GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND OXIA; maps same as 03B0
877
12b72891 878LNBREAK: Line Break: \R
e64b1bd1 879=> generic UTF8 LATIN1 :fast safe
12b72891 880"\x0D\x0A" # CRLF - Network (Windows) line ending
05b688d9 881\p{VertSpace}
12b72891
RGS
882
883HORIZWS: Horizontal Whitespace: \h \H
e64b1bd1 884=> generic UTF8 LATIN1 cp :fast safe
05b688d9 885\p{HorizSpace}
12b72891
RGS
886
887VERTWS: Vertical Whitespace: \v \V
e64b1bd1 888=> generic UTF8 LATIN1 cp :fast safe
05b688d9 889\p{VertSpace}
612ead59
KW
890
891GCB_L: Grapheme_Cluster_Break=L
892=> UTF8 :fast
893\p{_X_GCB_L}
894
895GCB_LV_LVT_V: Grapheme_Cluster_Break=(LV or LVT or V)
896=> UTF8 :fast
897\p{_X_LV_LVT_V}
898
899GCB_Prepend: Grapheme_Cluster_Break=Prepend
900=> UTF8 :fast
901\p{_X_GCB_Prepend}
902
903GCB_RI: Grapheme_Cluster_Break=RI
904=> UTF8 :fast
905\p{_X_RI}
906
907GCB_SPECIAL_BEGIN: Grapheme_Cluster_Break=special_begins
908=> UTF8 :fast
909\p{_X_Special_Begin}
910
911GCB_T: Grapheme_Cluster_Break=T
912=> UTF8 :fast
913\p{_X_GCB_T}
914
915GCB_V: Grapheme_Cluster_Break=V
916=> UTF8 :fast
917\p{_X_GCB_V}
685289b5
KW
918
919QUOTEMETA: Meta-characters that \Q should quote
920=> high :fast
921\p{_Perl_Quotemeta}