This is a live mirror of the Perl 5 development currently hosted at https://github.com/perl/perl5
Unicode::UCD::prop_invmap(): New improved API
[perl5.git] / lib / Unicode / UCD.pm
CommitLineData
55d7b906 1package Unicode::UCD;
561c79ed
JH
2
3use strict;
4use warnings;
36c2430c 5no warnings 'surrogate'; # surrogates can be inputs to this
98ef7649 6use charnames ();
94c91ffc 7use Unicode::Normalize qw(getCombinClass NFD);
561c79ed 8
47f2dcf2 9our $VERSION = '0.41';
561c79ed 10
741297c1
JH
11use Storable qw(dclone);
12
561c79ed
JH
13require Exporter;
14
15our @ISA = qw(Exporter);
74f8133e 16
10a6ecd2
JH
17our @EXPORT_OK = qw(charinfo
18 charblock charscript
19 charblocks charscripts
b08cd201 20 charinrange
ea508aee 21 general_categories bidi_types
b08cd201 22 compexcl
a2bd7410 23 casefold casespec
7319f91d
KW
24 namedseq
25 num
7ef25837
KW
26 prop_aliases
27 prop_value_aliases
681d705c 28 prop_invlist
62b3b855 29 prop_invmap
681d705c 30 MAX_CP
7319f91d 31 );
561c79ed
JH
32
33use Carp;
34
35=head1 NAME
36
55d7b906 37Unicode::UCD - Unicode character database
561c79ed
JH
38
39=head1 SYNOPSIS
40
55d7b906 41 use Unicode::UCD 'charinfo';
b08cd201 42 my $charinfo = charinfo($codepoint);
561c79ed 43
956cae9a
KW
44 use Unicode::UCD 'casefold';
45 my $casefold = casefold(0xFB00);
46
5d8e6e41
KW
47 use Unicode::UCD 'casespec';
48 my $casespec = casespec(0xFB00);
49
55d7b906 50 use Unicode::UCD 'charblock';
e882dd67
JH
51 my $charblock = charblock($codepoint);
52
55d7b906 53 use Unicode::UCD 'charscript';
65044554 54 my $charscript = charscript($codepoint);
561c79ed 55
55d7b906 56 use Unicode::UCD 'charblocks';
e145285f
JH
57 my $charblocks = charblocks();
58
55d7b906 59 use Unicode::UCD 'charscripts';
ea508aee 60 my $charscripts = charscripts();
e145285f 61
55d7b906 62 use Unicode::UCD qw(charscript charinrange);
e145285f
JH
63 my $range = charscript($script);
64 print "looks like $script\n" if charinrange($range, $codepoint);
65
ea508aee
JH
66 use Unicode::UCD qw(general_categories bidi_types);
67 my $categories = general_categories();
68 my $types = bidi_types();
69
7ef25837
KW
70 use Unicode::UCD 'prop_aliases';
71 my @space_names = prop_aliases("space");
72
73 use Unicode::UCD 'prop_value_aliases';
74 my @gc_punct_names = prop_value_aliases("Gc", "Punct");
75
681d705c
KW
76 use Unicode::UCD 'prop_invlist';
77 my @puncts = prop_invlist("gc=punctuation");
78
62b3b855
KW
79 use Unicode::UCD 'prop_invmap';
80 my ($list_ref, $map_ref, $format, $missing)
81 = prop_invmap("General Category");
82
55d7b906 83 use Unicode::UCD 'compexcl';
e145285f
JH
84 my $compexcl = compexcl($codepoint);
85
a2bd7410
JH
86 use Unicode::UCD 'namedseq';
87 my $namedseq = namedseq($named_sequence_name);
88
55d7b906 89 my $unicode_version = Unicode::UCD::UnicodeVersion();
e145285f 90
7319f91d 91 my $convert_to_numeric =
62a8c8c2 92 Unicode::UCD::num("\N{RUMI DIGIT ONE}\N{RUMI DIGIT TWO}");
7319f91d 93
561c79ed
JH
94=head1 DESCRIPTION
95
a452d459
KW
96The Unicode::UCD module offers a series of functions that
97provide a simple interface to the Unicode
8b731da2 98Character Database.
561c79ed 99
a452d459
KW
100=head2 code point argument
101
102Some of the functions are called with a I<code point argument>, which is either
103a decimal or a hexadecimal scalar designating a Unicode code point, or C<U+>
104followed by hexadecimals designating a Unicode code point. In other words, if
105you want a code point to be interpreted as a hexadecimal number, you must
106prefix it with either C<0x> or C<U+>, because a string like e.g. C<123> will be
f200dd12
KW
107interpreted as a decimal code point. Note that the largest code point in
108Unicode is U+10FFFF.
c3e5bc54 109
561c79ed
JH
110=cut
111
10a6ecd2 112my $BLOCKSFH;
10a6ecd2 113my $VERSIONFH;
b08cd201
JH
114my $CASEFOLDFH;
115my $CASESPECFH;
a2bd7410 116my $NAMEDSEQFH;
561c79ed
JH
117
118sub openunicode {
119 my ($rfh, @path) = @_;
120 my $f;
121 unless (defined $$rfh) {
122 for my $d (@INC) {
123 use File::Spec;
55d7b906 124 $f = File::Spec->catfile($d, "unicore", @path);
32c16050 125 last if open($$rfh, $f);
e882dd67 126 undef $f;
561c79ed 127 }
e882dd67
JH
128 croak __PACKAGE__, ": failed to find ",
129 File::Spec->catfile(@path), " in @INC"
130 unless defined $f;
561c79ed
JH
131 }
132 return $f;
133}
134
a452d459 135=head2 B<charinfo()>
561c79ed 136
55d7b906 137 use Unicode::UCD 'charinfo';
561c79ed 138
b08cd201 139 my $charinfo = charinfo(0x41);
561c79ed 140
a452d459
KW
141This returns information about the input L</code point argument>
142as a reference to a hash of fields as defined by the Unicode
143standard. If the L</code point argument> is not assigned in the standard
144(i.e., has the general category C<Cn> meaning C<Unassigned>)
145or is a non-character (meaning it is guaranteed to never be assigned in
146the standard),
a18e976f 147C<undef> is returned.
a452d459
KW
148
149Fields that aren't applicable to the particular code point argument exist in the
150returned hash, and are empty.
151
152The keys in the hash with the meanings of their values are:
153
154=over
155
156=item B<code>
157
158the input L</code point argument> expressed in hexadecimal, with leading zeros
159added if necessary to make it contain at least four hexdigits
160
161=item B<name>
162
163name of I<code>, all IN UPPER CASE.
164Some control-type code points do not have names.
165This field will be empty for C<Surrogate> and C<Private Use> code points,
166and for the others without a name,
167it will contain a description enclosed in angle brackets, like
168C<E<lt>controlE<gt>>.
169
170
171=item B<category>
172
173The short name of the general category of I<code>.
174This will match one of the keys in the hash returned by L</general_categories()>.
175
7ef25837
KW
176The L</prop_value_aliases()> function can be used to get all the synonyms
177of the category name.
178
a452d459
KW
179=item B<combining>
180
181the combining class number for I<code> used in the Canonical Ordering Algorithm.
182For Unicode 5.1, this is described in Section 3.11 C<Canonical Ordering Behavior>
183available at
184L<http://www.unicode.org/versions/Unicode5.1.0/>
185
7ef25837
KW
186The L</prop_value_aliases()> function can be used to get all the synonyms
187of the combining class number.
188
a452d459
KW
189=item B<bidi>
190
191bidirectional type of I<code>.
192This will match one of the keys in the hash returned by L</bidi_types()>.
193
7ef25837
KW
194The L</prop_value_aliases()> function can be used to get all the synonyms
195of the bidi type name.
196
a452d459
KW
197=item B<decomposition>
198
199is empty if I<code> has no decomposition; or is one or more codes
a18e976f 200(separated by spaces) that, taken in order, represent a decomposition for
a452d459
KW
201I<code>. Each has at least four hexdigits.
202The codes may be preceded by a word enclosed in angle brackets then a space,
203like C<E<lt>compatE<gt> >, giving the type of decomposition
204
06bba7d5
KW
205This decomposition may be an intermediate one whose components are also
206decomposable. Use L<Unicode::Normalize> to get the final decomposition.
207
a452d459
KW
208=item B<decimal>
209
210if I<code> is a decimal digit this is its integer numeric value
211
212=item B<digit>
213
89e4a205
KW
214if I<code> represents some other digit-like number, this is its integer
215numeric value
a452d459
KW
216
217=item B<numeric>
218
219if I<code> represents a whole or rational number, this is its numeric value.
220Rational values are expressed as a string like C<1/4>.
221
222=item B<mirrored>
223
224C<Y> or C<N> designating if I<code> is mirrored in bidirectional text
225
226=item B<unicode10>
227
228name of I<code> in the Unicode 1.0 standard if one
229existed for this code point and is different from the current name
230
231=item B<comment>
232
89e4a205 233As of Unicode 6.0, this is always empty.
a452d459
KW
234
235=item B<upper>
236
06bba7d5 237is empty if there is no single code point uppercase mapping for I<code>
4f66642e 238(its uppercase mapping is itself);
a452d459
KW
239otherwise it is that mapping expressed as at least four hexdigits.
240(L</casespec()> should be used in addition to B<charinfo()>
241for case mappings when the calling program can cope with multiple code point
242mappings.)
243
244=item B<lower>
245
06bba7d5 246is empty if there is no single code point lowercase mapping for I<code>
4f66642e 247(its lowercase mapping is itself);
a452d459
KW
248otherwise it is that mapping expressed as at least four hexdigits.
249(L</casespec()> should be used in addition to B<charinfo()>
250for case mappings when the calling program can cope with multiple code point
251mappings.)
252
253=item B<title>
254
06bba7d5 255is empty if there is no single code point titlecase mapping for I<code>
4f66642e 256(its titlecase mapping is itself);
a452d459
KW
257otherwise it is that mapping expressed as at least four hexdigits.
258(L</casespec()> should be used in addition to B<charinfo()>
259for case mappings when the calling program can cope with multiple code point
260mappings.)
261
262=item B<block>
263
a18e976f 264the block I<code> belongs to (used in C<\p{Blk=...}>).
a452d459
KW
265See L</Blocks versus Scripts>.
266
267
268=item B<script>
269
a18e976f 270the script I<code> belongs to.
a452d459
KW
271See L</Blocks versus Scripts>.
272
273=back
32c16050
JH
274
275Note that you cannot do (de)composition and casing based solely on the
a452d459
KW
276I<decomposition>, I<combining>, I<lower>, I<upper>, and I<title> fields;
277you will need also the L</compexcl()>, and L</casespec()> functions.
561c79ed
JH
278
279=cut
280
e10d7780 281# NB: This function is nearly duplicated in charnames.pm
10a6ecd2
JH
282sub _getcode {
283 my $arg = shift;
284
dc0a4417 285 if ($arg =~ /^[1-9]\d*$/) {
10a6ecd2 286 return $arg;
dc0a4417 287 } elsif ($arg =~ /^(?:[Uu]\+|0[xX])?([[:xdigit:]]+)$/) {
10a6ecd2
JH
288 return hex($1);
289 }
290
291 return;
292}
293
05dbc6f8
KW
294# Populated by _num. Converts real number back to input rational
295my %real_to_rational;
296
297# To store the contents of files found on disk.
298my @BIDIS;
299my @CATEGORIES;
300my @DECOMPOSITIONS;
301my @NUMERIC_TYPES;
5c3b35c9
KW
302my %SIMPLE_LOWER;
303my %SIMPLE_TITLE;
304my %SIMPLE_UPPER;
305my %UNICODE_1_NAMES;
05dbc6f8 306
05dbc6f8 307sub charinfo {
a6fa416b 308
05dbc6f8
KW
309 # This function has traditionally mimicked what is in UnicodeData.txt,
310 # warts and all. This is a re-write that avoids UnicodeData.txt so that
311 # it can be removed to save disk space. Instead, this assembles
312 # information gotten by other methods that get data from various other
313 # files. It uses charnames to get the character name; and various
314 # mktables tables.
324f9e44 315
05dbc6f8 316 use feature 'unicode_strings';
a6fa416b 317
10a6ecd2
JH
318 my $arg = shift;
319 my $code = _getcode($arg);
05dbc6f8
KW
320 croak __PACKAGE__, "::charinfo: unknown code '$arg'" unless defined $code;
321
322 # Non-unicode implies undef.
323 return if $code > 0x10FFFF;
324
325 my %prop;
326 my $char = chr($code);
327
35a865d4 328 @CATEGORIES =_read_table("To/Gc.pl") unless @CATEGORIES;
05dbc6f8
KW
329 $prop{'category'} = _search(\@CATEGORIES, 0, $#CATEGORIES, $code)
330 // $utf8::SwashInfo{'ToGc'}{'missing'};
331
332 return if $prop{'category'} eq 'Cn'; # Unassigned code points are undef
333
334 $prop{'code'} = sprintf "%04X", $code;
335 $prop{'name'} = ($char =~ /\p{Cntrl}/) ? '<control>'
336 : (charnames::viacode($code) // "");
337
338 $prop{'combining'} = getCombinClass($code);
339
35a865d4 340 @BIDIS =_read_table("To/Bc.pl") unless @BIDIS;
05dbc6f8
KW
341 $prop{'bidi'} = _search(\@BIDIS, 0, $#BIDIS, $code)
342 // $utf8::SwashInfo{'ToBc'}{'missing'};
343
344 # For most code points, we can just read in "unicore/Decomposition.pl", as
345 # its contents are exactly what should be output. But that file doesn't
346 # contain the data for the Hangul syllable decompositions, which can be
94c91ffc
KW
347 # algorithmically computed, and NFD() does that, so we call NFD() for
348 # those. We can't use NFD() for everything, as it does a complete
05dbc6f8 349 # recursive decomposition, and what this function has always done is to
94c91ffc
KW
350 # return what's in UnicodeData.txt which doesn't show that recursiveness.
351 # Fortunately, the NFD() of the Hanguls doesn't have any recursion
352 # issues.
353 # Having no decomposition implies an empty field; otherwise, all but
354 # "Canonical" imply a compatible decomposition, and the type is prefixed
355 # to that, as it is in UnicodeData.txt
05dbc6f8
KW
356 if ($char =~ /\p{Block=Hangul_Syllables}/) {
357 # The code points of the decomposition are output in standard Unicode
358 # hex format, separated by blanks.
359 $prop{'decomposition'} = join " ", map { sprintf("%04X", $_)}
94c91ffc 360 unpack "U*", NFD($char);
a6fa416b 361 }
05dbc6f8 362 else {
35a865d4 363 @DECOMPOSITIONS = _read_table("Decomposition.pl")
05dbc6f8
KW
364 unless @DECOMPOSITIONS;
365 $prop{'decomposition'} = _search(\@DECOMPOSITIONS, 0, $#DECOMPOSITIONS,
366 $code) // "";
561c79ed 367 }
05dbc6f8
KW
368
369 # Can use num() to get the numeric values, if any.
370 if (! defined (my $value = num($char))) {
371 $prop{'decimal'} = $prop{'digit'} = $prop{'numeric'} = "";
372 }
373 else {
374 if ($char =~ /\d/) {
375 $prop{'decimal'} = $prop{'digit'} = $prop{'numeric'} = $value;
376 }
377 else {
378
379 # For non-decimal-digits, we have to read in the Numeric type
380 # to distinguish them. It is not just a matter of integer vs.
381 # rational, as some whole number values are not considered digits,
382 # e.g., TAMIL NUMBER TEN.
383 $prop{'decimal'} = "";
384
35a865d4 385 @NUMERIC_TYPES =_read_table("To/Nt.pl") unless @NUMERIC_TYPES;
05dbc6f8
KW
386 if ((_search(\@NUMERIC_TYPES, 0, $#NUMERIC_TYPES, $code) // "")
387 eq 'Digit')
388 {
389 $prop{'digit'} = $prop{'numeric'} = $value;
390 }
391 else {
392 $prop{'digit'} = "";
393 $prop{'numeric'} = $real_to_rational{$value} // $value;
394 }
395 }
396 }
397
398 $prop{'mirrored'} = ($char =~ /\p{Bidi_Mirrored}/) ? 'Y' : 'N';
399
35a865d4 400 %UNICODE_1_NAMES =_read_table("To/Na1.pl", "use_hash") unless %UNICODE_1_NAMES;
5c3b35c9 401 $prop{'unicode10'} = $UNICODE_1_NAMES{$code} // "";
05dbc6f8
KW
402
403 # This is true starting in 6.0, but, num() also requires 6.0, so
404 # don't need to test for version again here.
405 $prop{'comment'} = "";
406
35a865d4 407 %SIMPLE_UPPER = _read_table("To/Uc.pl", "use_hash") unless %SIMPLE_UPPER;
bf7fe2df 408 $prop{'upper'} = (defined $SIMPLE_UPPER{$code})
d11155ec 409 ? sprintf("%04X", $SIMPLE_UPPER{$code})
bf7fe2df 410 : "";
75e7c50b 411
35a865d4 412 %SIMPLE_LOWER = _read_table("To/Lc.pl", "use_hash") unless %SIMPLE_LOWER;
bf7fe2df 413 $prop{'lower'} = (defined $SIMPLE_LOWER{$code})
d11155ec 414 ? sprintf("%04X", $SIMPLE_LOWER{$code})
bf7fe2df 415 : "";
75e7c50b 416
35a865d4 417 %SIMPLE_TITLE = _read_table("To/Tc.pl", "use_hash") unless %SIMPLE_TITLE;
bf7fe2df 418 $prop{'title'} = (defined $SIMPLE_TITLE{$code})
d11155ec 419 ? sprintf("%04X", $SIMPLE_TITLE{$code})
bf7fe2df 420 : "";
05dbc6f8
KW
421
422 $prop{block} = charblock($code);
423 $prop{script} = charscript($code);
424 return \%prop;
561c79ed
JH
425}
426
e882dd67
JH
427sub _search { # Binary search in a [[lo,hi,prop],[...],...] table.
428 my ($table, $lo, $hi, $code) = @_;
429
430 return if $lo > $hi;
431
432 my $mid = int(($lo+$hi) / 2);
433
434 if ($table->[$mid]->[0] < $code) {
10a6ecd2 435 if ($table->[$mid]->[1] >= $code) {
e882dd67
JH
436 return $table->[$mid]->[2];
437 } else {
438 _search($table, $mid + 1, $hi, $code);
439 }
440 } elsif ($table->[$mid]->[0] > $code) {
441 _search($table, $lo, $mid - 1, $code);
442 } else {
443 return $table->[$mid]->[2];
444 }
445}
446
cb366075 447sub _read_table ($;$) {
3a12600d
KW
448
449 # Returns the contents of the mktables generated table file located at $1
cb366075
KW
450 # in the form of either an array of arrays or a hash, depending on if the
451 # optional second parameter is true (for hash return) or not. In the case
452 # of a hash return, each key is a code point, and its corresponding value
453 # is what the table gives as the code point's corresponding value. In the
454 # case of an array return, each outer array denotes a range with [0] the
455 # start point of that range; [1] the end point; and [2] the value that
456 # every code point in the range has. The hash return is useful for fast
457 # lookup when the table contains only single code point ranges. The array
458 # return takes much less memory when there are large ranges.
3a12600d 459 #
cb366075 460 # This function has the side effect of setting
3a12600d
KW
461 # $utf8::SwashInfo{$property}{'format'} to be the mktables format of the
462 # table; and
463 # $utf8::SwashInfo{$property}{'missing'} to be the value for all entries
464 # not listed in the table.
465 # where $property is the Unicode property name, preceded by 'To' for map
466 # properties., e.g., 'ToSc'.
467 #
468 # Table entries look like one of:
469 # 0000 0040 Common # [65]
470 # 00AA Latin
471
472 my $table = shift;
cb366075
KW
473 my $return_hash = shift;
474 $return_hash = 0 unless defined $return_hash;
3a12600d 475 my @return;
cb366075 476 my %return;
3a12600d 477 local $_;
d11155ec 478 my $list = do "unicore/$table";
3a12600d 479
d11155ec
KW
480 # Look up if this property requires adjustments, which we do below if it
481 # does.
482 require "unicore/Heavy.pl";
483 my $property = $table =~ s/\.pl//r;
484 $property = $utf8::file_to_swash_name{$property};
485 my $to_adjust = defined $property
486 && $utf8::SwashInfo{$property}{'format'} eq 'a';
487
488 for (split /^/m, $list) {
3a12600d
KW
489 my ($start, $end, $value) = / ^ (.+?) \t (.*?) \t (.+?)
490 \s* ( \# .* )? # Optional comment
491 $ /x;
83fd1222
KW
492 my $decimal_start = hex $start;
493 my $decimal_end = ($end eq "") ? $decimal_start : hex $end;
cb366075 494 if ($return_hash) {
83fd1222 495 foreach my $i ($decimal_start .. $decimal_end) {
d11155ec
KW
496 $return{$i} = ($to_adjust)
497 ? $value + $i - $decimal_start
498 : $value;
cb366075
KW
499 }
500 }
d11155ec
KW
501 elsif (! $to_adjust
502 && @return
503 && $return[-1][1] == $decimal_start - 1
9a96c106
KW
504 && $return[-1][2] eq $value)
505 {
506 # If this is merely extending the previous range, do just that.
507 $return[-1]->[1] = $decimal_end;
508 }
cb366075 509 else {
83fd1222 510 push @return, [ $decimal_start, $decimal_end, $value ];
cb366075 511 }
3a12600d 512 }
cb366075 513 return ($return_hash) ? %return : @return;
3a12600d
KW
514}
515
10a6ecd2
JH
516sub charinrange {
517 my ($range, $arg) = @_;
518 my $code = _getcode($arg);
519 croak __PACKAGE__, "::charinrange: unknown code '$arg'"
520 unless defined $code;
521 _search($range, 0, $#$range, $code);
522}
523
a452d459 524=head2 B<charblock()>
561c79ed 525
55d7b906 526 use Unicode::UCD 'charblock';
561c79ed
JH
527
528 my $charblock = charblock(0x41);
10a6ecd2 529 my $charblock = charblock(1234);
a452d459 530 my $charblock = charblock(0x263a);
10a6ecd2
JH
531 my $charblock = charblock("U+263a");
532
78bf21c2 533 my $range = charblock('Armenian');
10a6ecd2 534
a452d459 535With a L</code point argument> charblock() returns the I<block> the code point
430fe03d
KW
536belongs to, e.g. C<Basic Latin>. The old-style block name is returned (see
537L</Old-style versus new-style block names>).
a452d459 538If the code point is unassigned, this returns the block it would belong to if
a18e976f 539it were assigned.
10a6ecd2 540
78bf21c2
JH
541See also L</Blocks versus Scripts>.
542
18972f4b 543If supplied with an argument that can't be a code point, charblock() tries to
430fe03d
KW
544do the opposite and interpret the argument as an old-style block name. The
545return value
a18e976f
KW
546is a I<range set> with one range: an anonymous list with a single element that
547consists of another anonymous list whose first element is the first code point
548in the block, and whose second (and final) element is the final code point in
549the block. (The extra list consisting of just one element is so that the same
550program logic can be used to handle both this return, and the return from
551L</charscript()> which can have multiple ranges.) You can test whether a code
552point is in a range using the L</charinrange()> function. If the argument is
553not a known block, C<undef> is returned.
561c79ed 554
561c79ed
JH
555=cut
556
557my @BLOCKS;
10a6ecd2 558my %BLOCKS;
561c79ed 559
10a6ecd2 560sub _charblocks {
06bba7d5
KW
561
562 # Can't read from the mktables table because it loses the hyphens in the
563 # original.
561c79ed 564 unless (@BLOCKS) {
10a6ecd2 565 if (openunicode(\$BLOCKSFH, "Blocks.txt")) {
6c8d78fb 566 local $_;
ce066323 567 local $/ = "\n";
10a6ecd2 568 while (<$BLOCKSFH>) {
2796c109 569 if (/^([0-9A-F]+)\.\.([0-9A-F]+);\s+(.+)/) {
10a6ecd2
JH
570 my ($lo, $hi) = (hex($1), hex($2));
571 my $subrange = [ $lo, $hi, $3 ];
572 push @BLOCKS, $subrange;
573 push @{$BLOCKS{$3}}, $subrange;
561c79ed
JH
574 }
575 }
10a6ecd2 576 close($BLOCKSFH);
561c79ed
JH
577 }
578 }
10a6ecd2
JH
579}
580
581sub charblock {
582 my $arg = shift;
583
584 _charblocks() unless @BLOCKS;
585
586 my $code = _getcode($arg);
561c79ed 587
10a6ecd2 588 if (defined $code) {
c707cf8e
KW
589 my $result = _search(\@BLOCKS, 0, $#BLOCKS, $code);
590 return $result if defined $result;
591 return 'No_Block';
592 }
593 elsif (exists $BLOCKS{$arg}) {
594 return dclone $BLOCKS{$arg};
10a6ecd2 595 }
e882dd67
JH
596}
597
a452d459 598=head2 B<charscript()>
e882dd67 599
55d7b906 600 use Unicode::UCD 'charscript';
e882dd67
JH
601
602 my $charscript = charscript(0x41);
10a6ecd2
JH
603 my $charscript = charscript(1234);
604 my $charscript = charscript("U+263a");
e882dd67 605
78bf21c2 606 my $range = charscript('Thai');
10a6ecd2 607
a452d459
KW
608With a L</code point argument> charscript() returns the I<script> the
609code point belongs to, e.g. C<Latin>, C<Greek>, C<Han>.
bb2d29dc 610If the code point is unassigned, it returns C<"Unknown">.
78bf21c2 611
eb0cc9e3 612If supplied with an argument that can't be a code point, charscript() tries
a18e976f
KW
613to do the opposite and interpret the argument as a script name. The
614return value is a I<range set>: an anonymous list of lists that contain
eb0cc9e3 615I<start-of-range>, I<end-of-range> code point pairs. You can test whether a
a18e976f
KW
616code point is in a range set using the L</charinrange()> function. If the
617argument is not a known script, C<undef> is returned.
a452d459
KW
618
619See also L</Blocks versus Scripts>.
e882dd67 620
e882dd67
JH
621=cut
622
623my @SCRIPTS;
10a6ecd2 624my %SCRIPTS;
e882dd67 625
10a6ecd2 626sub _charscripts {
35a865d4 627 @SCRIPTS =_read_table("To/Sc.pl") unless @SCRIPTS;
7bccef0b 628 foreach my $entry (@SCRIPTS) {
f3d50ac9 629 $entry->[2] =~ s/(_\w)/\L$1/g; # Preserve old-style casing
7bccef0b 630 push @{$SCRIPTS{$entry->[2]}}, $entry;
e882dd67 631 }
10a6ecd2
JH
632}
633
634sub charscript {
635 my $arg = shift;
636
637 _charscripts() unless @SCRIPTS;
e882dd67 638
10a6ecd2
JH
639 my $code = _getcode($arg);
640
641 if (defined $code) {
7bccef0b
KW
642 my $result = _search(\@SCRIPTS, 0, $#SCRIPTS, $code);
643 return $result if defined $result;
8079ad82 644 return $utf8::SwashInfo{'ToSc'}{'missing'};
7bccef0b
KW
645 } elsif (exists $SCRIPTS{$arg}) {
646 return dclone $SCRIPTS{$arg};
10a6ecd2 647 }
7bccef0b
KW
648
649 return;
10a6ecd2
JH
650}
651
a452d459 652=head2 B<charblocks()>
10a6ecd2 653
55d7b906 654 use Unicode::UCD 'charblocks';
10a6ecd2 655
b08cd201 656 my $charblocks = charblocks();
10a6ecd2 657
b08cd201 658charblocks() returns a reference to a hash with the known block names
a452d459 659as the keys, and the code point ranges (see L</charblock()>) as the values.
10a6ecd2 660
430fe03d
KW
661The names are in the old-style (see L</Old-style versus new-style block
662names>).
663
62b3b855
KW
664L<prop_invmap("block")|/prop_invmap()> can be used to get this same data in a
665different type of data structure.
666
78bf21c2
JH
667See also L</Blocks versus Scripts>.
668
10a6ecd2
JH
669=cut
670
671sub charblocks {
b08cd201 672 _charblocks() unless %BLOCKS;
741297c1 673 return dclone \%BLOCKS;
10a6ecd2
JH
674}
675
a452d459 676=head2 B<charscripts()>
10a6ecd2 677
55d7b906 678 use Unicode::UCD 'charscripts';
10a6ecd2 679
ea508aee 680 my $charscripts = charscripts();
10a6ecd2 681
ea508aee 682charscripts() returns a reference to a hash with the known script
a452d459 683names as the keys, and the code point ranges (see L</charscript()>) as
ea508aee 684the values.
10a6ecd2 685
62b3b855
KW
686L<prop_invmap("script")|/prop_invmap()> can be used to get this same data in a
687different type of data structure.
688
78bf21c2
JH
689See also L</Blocks versus Scripts>.
690
10a6ecd2
JH
691=cut
692
693sub charscripts {
b08cd201 694 _charscripts() unless %SCRIPTS;
741297c1 695 return dclone \%SCRIPTS;
561c79ed
JH
696}
697
a452d459 698=head2 B<charinrange()>
10a6ecd2 699
f200dd12 700In addition to using the C<\p{Blk=...}> and C<\P{Blk=...}> constructs, you
10a6ecd2 701can also test whether a code point is in the I<range> as returned by
a452d459
KW
702L</charblock()> and L</charscript()> or as the values of the hash returned
703by L</charblocks()> and L</charscripts()> by using charinrange():
10a6ecd2 704
55d7b906 705 use Unicode::UCD qw(charscript charinrange);
10a6ecd2
JH
706
707 $range = charscript('Hiragana');
e145285f 708 print "looks like hiragana\n" if charinrange($range, $codepoint);
10a6ecd2
JH
709
710=cut
711
ea508aee
JH
712my %GENERAL_CATEGORIES =
713 (
714 'L' => 'Letter',
715 'LC' => 'CasedLetter',
716 'Lu' => 'UppercaseLetter',
717 'Ll' => 'LowercaseLetter',
718 'Lt' => 'TitlecaseLetter',
719 'Lm' => 'ModifierLetter',
720 'Lo' => 'OtherLetter',
721 'M' => 'Mark',
722 'Mn' => 'NonspacingMark',
723 'Mc' => 'SpacingMark',
724 'Me' => 'EnclosingMark',
725 'N' => 'Number',
726 'Nd' => 'DecimalNumber',
727 'Nl' => 'LetterNumber',
728 'No' => 'OtherNumber',
729 'P' => 'Punctuation',
730 'Pc' => 'ConnectorPunctuation',
731 'Pd' => 'DashPunctuation',
732 'Ps' => 'OpenPunctuation',
733 'Pe' => 'ClosePunctuation',
734 'Pi' => 'InitialPunctuation',
735 'Pf' => 'FinalPunctuation',
736 'Po' => 'OtherPunctuation',
737 'S' => 'Symbol',
738 'Sm' => 'MathSymbol',
739 'Sc' => 'CurrencySymbol',
740 'Sk' => 'ModifierSymbol',
741 'So' => 'OtherSymbol',
742 'Z' => 'Separator',
743 'Zs' => 'SpaceSeparator',
744 'Zl' => 'LineSeparator',
745 'Zp' => 'ParagraphSeparator',
746 'C' => 'Other',
747 'Cc' => 'Control',
748 'Cf' => 'Format',
749 'Cs' => 'Surrogate',
750 'Co' => 'PrivateUse',
751 'Cn' => 'Unassigned',
752 );
753
754sub general_categories {
755 return dclone \%GENERAL_CATEGORIES;
756}
757
a452d459 758=head2 B<general_categories()>
ea508aee
JH
759
760 use Unicode::UCD 'general_categories';
761
762 my $categories = general_categories();
763
a452d459 764This returns a reference to a hash which has short
ea508aee
JH
765general category names (such as C<Lu>, C<Nd>, C<Zs>, C<S>) as keys and long
766names (such as C<UppercaseLetter>, C<DecimalNumber>, C<SpaceSeparator>,
767C<Symbol>) as values. The hash is reversible in case you need to go
768from the long names to the short names. The general category is the
a452d459
KW
769one returned from
770L</charinfo()> under the C<category> key.
ea508aee 771
7ef25837
KW
772The L</prop_value_aliases()> function can be used to get all the synonyms of
773the category name.
774
ea508aee
JH
775=cut
776
777my %BIDI_TYPES =
778 (
779 'L' => 'Left-to-Right',
780 'LRE' => 'Left-to-Right Embedding',
781 'LRO' => 'Left-to-Right Override',
782 'R' => 'Right-to-Left',
783 'AL' => 'Right-to-Left Arabic',
784 'RLE' => 'Right-to-Left Embedding',
785 'RLO' => 'Right-to-Left Override',
786 'PDF' => 'Pop Directional Format',
787 'EN' => 'European Number',
788 'ES' => 'European Number Separator',
789 'ET' => 'European Number Terminator',
790 'AN' => 'Arabic Number',
791 'CS' => 'Common Number Separator',
792 'NSM' => 'Non-Spacing Mark',
793 'BN' => 'Boundary Neutral',
794 'B' => 'Paragraph Separator',
795 'S' => 'Segment Separator',
796 'WS' => 'Whitespace',
797 'ON' => 'Other Neutrals',
798 );
799
a452d459 800=head2 B<bidi_types()>
ea508aee
JH
801
802 use Unicode::UCD 'bidi_types';
803
804 my $categories = bidi_types();
805
a452d459 806This returns a reference to a hash which has the short
ea508aee
JH
807bidi (bidirectional) type names (such as C<L>, C<R>) as keys and long
808names (such as C<Left-to-Right>, C<Right-to-Left>) as values. The
809hash is reversible in case you need to go from the long names to the
a452d459
KW
810short names. The bidi type is the one returned from
811L</charinfo()>
ea508aee
JH
812under the C<bidi> key. For the exact meaning of the various bidi classes
813the Unicode TR9 is recommended reading:
a452d459 814L<http://www.unicode.org/reports/tr9/>
ea508aee
JH
815(as of Unicode 5.0.0)
816
7ef25837
KW
817The L</prop_value_aliases()> function can be used to get all the synonyms of
818the bidi type name.
819
ea508aee
JH
820=cut
821
a452d459
KW
822sub bidi_types {
823 return dclone \%BIDI_TYPES;
824}
825
826=head2 B<compexcl()>
b08cd201 827
55d7b906 828 use Unicode::UCD 'compexcl';
b08cd201 829
a452d459 830 my $compexcl = compexcl(0x09dc);
b08cd201 831
71a442a8
KW
832This routine is included for backwards compatibility, but as of Perl 5.12, for
833most purposes it is probably more convenient to use one of the following
834instead:
835
836 my $compexcl = chr(0x09dc) =~ /\p{Comp_Ex};
837 my $compexcl = chr(0x09dc) =~ /\p{Full_Composition_Exclusion};
838
839or even
840
841 my $compexcl = chr(0x09dc) =~ /\p{CE};
842 my $compexcl = chr(0x09dc) =~ /\p{Composition_Exclusion};
843
844The first two forms return B<true> if the L</code point argument> should not
76b05678
KW
845be produced by composition normalization. For the final two forms to return
846B<true>, it is additionally required that this fact not otherwise be
847determinable from the Unicode data base.
71a442a8
KW
848
849This routine behaves identically to the final two forms. That is,
850it does not return B<true> if the code point has a decomposition
a452d459
KW
851consisting of another single code point, nor if its decomposition starts
852with a code point whose combining class is non-zero. Code points that meet
853either of these conditions should also not be produced by composition
71a442a8
KW
854normalization, which is probably why you should use the
855C<Full_Composition_Exclusion> property instead, as shown above.
b08cd201 856
71a442a8 857The routine returns B<false> otherwise.
b08cd201
JH
858
859=cut
860
b08cd201
JH
861sub compexcl {
862 my $arg = shift;
863 my $code = _getcode($arg);
74f8133e
JH
864 croak __PACKAGE__, "::compexcl: unknown code '$arg'"
865 unless defined $code;
b08cd201 866
36c2430c 867 no warnings "non_unicode"; # So works on non-Unicode code points
71a442a8 868 return chr($code) =~ /\p{Composition_Exclusion}/;
b08cd201
JH
869}
870
a452d459 871=head2 B<casefold()>
b08cd201 872
55d7b906 873 use Unicode::UCD 'casefold';
b08cd201 874
a452d459
KW
875 my $casefold = casefold(0xDF);
876 if (defined $casefold) {
877 my @full_fold_hex = split / /, $casefold->{'full'};
878 my $full_fold_string =
879 join "", map {chr(hex($_))} @full_fold_hex;
880 my @turkic_fold_hex =
881 split / /, ($casefold->{'turkic'} ne "")
882 ? $casefold->{'turkic'}
883 : $casefold->{'full'};
884 my $turkic_fold_string =
885 join "", map {chr(hex($_))} @turkic_fold_hex;
886 }
887 if (defined $casefold && $casefold->{'simple'} ne "") {
888 my $simple_fold_hex = $casefold->{'simple'};
889 my $simple_fold_string = chr(hex($simple_fold_hex));
890 }
b08cd201 891
a452d459
KW
892This returns the (almost) locale-independent case folding of the
893character specified by the L</code point argument>.
b08cd201 894
a18e976f 895If there is no case folding for that code point, C<undef> is returned.
a452d459
KW
896
897If there is a case folding for that code point, a reference to a hash
b08cd201
JH
898with the following fields is returned:
899
a452d459
KW
900=over
901
902=item B<code>
903
904the input L</code point argument> expressed in hexadecimal, with leading zeros
905added if necessary to make it contain at least four hexdigits
906
907=item B<full>
908
a18e976f 909one or more codes (separated by spaces) that, taken in order, give the
a452d459
KW
910code points for the case folding for I<code>.
911Each has at least four hexdigits.
912
913=item B<simple>
914
915is empty, or is exactly one code with at least four hexdigits which can be used
916as an alternative case folding when the calling program cannot cope with the
917fold being a sequence of multiple code points. If I<full> is just one code
918point, then I<simple> equals I<full>. If there is no single code point folding
919defined for I<code>, then I<simple> is the empty string. Otherwise, it is an
920inferior, but still better-than-nothing alternative folding to I<full>.
921
922=item B<mapping>
923
924is the same as I<simple> if I<simple> is not empty, and it is the same as I<full>
925otherwise. It can be considered to be the simplest possible folding for
926I<code>. It is defined primarily for backwards compatibility.
927
928=item B<status>
b08cd201 929
a452d459
KW
930is C<C> (for C<common>) if the best possible fold is a single code point
931(I<simple> equals I<full> equals I<mapping>). It is C<S> if there are distinct
932folds, I<simple> and I<full> (I<mapping> equals I<simple>). And it is C<F> if
a18e976f
KW
933there is only a I<full> fold (I<mapping> equals I<full>; I<simple> is empty).
934Note that this
a452d459
KW
935describes the contents of I<mapping>. It is defined primarily for backwards
936compatibility.
b08cd201 937
a452d459
KW
938On versions 3.1 and earlier of Unicode, I<status> can also be
939C<I> which is the same as C<C> but is a special case for dotted uppercase I and
940dotless lowercase i:
b08cd201 941
a452d459 942=over
b08cd201 943
a18e976f 944=item B<*> If you use this C<I> mapping
a452d459 945
a18e976f 946the result is case-insensitive,
a452d459
KW
947but dotless and dotted I's are not distinguished
948
a18e976f 949=item B<*> If you exclude this C<I> mapping
a452d459 950
a18e976f 951the result is not fully case-insensitive, but
a452d459
KW
952dotless and dotted I's are distinguished
953
954=back
955
956=item B<turkic>
957
958contains any special folding for Turkic languages. For versions of Unicode
959starting with 3.2, this field is empty unless I<code> has a different folding
960in Turkic languages, in which case it is one or more codes (separated by
a18e976f 961spaces) that, taken in order, give the code points for the case folding for
a452d459
KW
962I<code> in those languages.
963Each code has at least four hexdigits.
964Note that this folding does not maintain canonical equivalence without
965additional processing.
966
967For versions of Unicode 3.1 and earlier, this field is empty unless there is a
968special folding for Turkic languages, in which case I<status> is C<I>, and
969I<mapping>, I<full>, I<simple>, and I<turkic> are all equal.
970
971=back
972
973Programs that want complete generality and the best folding results should use
974the folding contained in the I<full> field. But note that the fold for some
975code points will be a sequence of multiple code points.
976
977Programs that can't cope with the fold mapping being multiple code points can
978use the folding contained in the I<simple> field, with the loss of some
979generality. In Unicode 5.1, about 7% of the defined foldings have no single
980code point folding.
981
982The I<mapping> and I<status> fields are provided for backwards compatibility for
983existing programs. They contain the same values as in previous versions of
984this function.
985
986Locale is not completely independent. The I<turkic> field contains results to
987use when the locale is a Turkic language.
b08cd201
JH
988
989For more information about case mappings see
a452d459 990L<http://www.unicode.org/unicode/reports/tr21>
b08cd201
JH
991
992=cut
993
994my %CASEFOLD;
995
996sub _casefold {
997 unless (%CASEFOLD) {
551b6b6f 998 if (openunicode(\$CASEFOLDFH, "CaseFolding.txt")) {
6c8d78fb 999 local $_;
ce066323 1000 local $/ = "\n";
b08cd201 1001 while (<$CASEFOLDFH>) {
a452d459 1002 if (/^([0-9A-F]+); ([CFIST]); ([0-9A-F]+(?: [0-9A-F]+)*);/) {
b08cd201 1003 my $code = hex($1);
a452d459
KW
1004 $CASEFOLD{$code}{'code'} = $1;
1005 $CASEFOLD{$code}{'turkic'} = "" unless
1006 defined $CASEFOLD{$code}{'turkic'};
1007 if ($2 eq 'C' || $2 eq 'I') { # 'I' is only on 3.1 and
1008 # earlier Unicodes
1009 # Both entries there (I
1010 # only checked 3.1) are
1011 # the same as C, and
1012 # there are no other
1013 # entries for those
1014 # codepoints, so treat
1015 # as if C, but override
1016 # the turkic one for
1017 # 'I'.
1018 $CASEFOLD{$code}{'status'} = $2;
1019 $CASEFOLD{$code}{'full'} = $CASEFOLD{$code}{'simple'} =
1020 $CASEFOLD{$code}{'mapping'} = $3;
1021 $CASEFOLD{$code}{'turkic'} = $3 if $2 eq 'I';
1022 } elsif ($2 eq 'F') {
1023 $CASEFOLD{$code}{'full'} = $3;
1024 unless (defined $CASEFOLD{$code}{'simple'}) {
1025 $CASEFOLD{$code}{'simple'} = "";
1026 $CASEFOLD{$code}{'mapping'} = $3;
1027 $CASEFOLD{$code}{'status'} = $2;
1028 }
1029 } elsif ($2 eq 'S') {
1030
1031
1032 # There can't be a simple without a full, and simple
1033 # overrides all but full
1034
1035 $CASEFOLD{$code}{'simple'} = $3;
1036 $CASEFOLD{$code}{'mapping'} = $3;
1037 $CASEFOLD{$code}{'status'} = $2;
1038 } elsif ($2 eq 'T') {
1039 $CASEFOLD{$code}{'turkic'} = $3;
1040 } # else can't happen because only [CIFST] are possible
b08cd201
JH
1041 }
1042 }
1043 close($CASEFOLDFH);
1044 }
1045 }
1046}
1047
1048sub casefold {
1049 my $arg = shift;
1050 my $code = _getcode($arg);
74f8133e
JH
1051 croak __PACKAGE__, "::casefold: unknown code '$arg'"
1052 unless defined $code;
b08cd201
JH
1053
1054 _casefold() unless %CASEFOLD;
1055
1056 return $CASEFOLD{$code};
1057}
1058
a452d459 1059=head2 B<casespec()>
b08cd201 1060
55d7b906 1061 use Unicode::UCD 'casespec';
b08cd201 1062
a452d459 1063 my $casespec = casespec(0xFB00);
b08cd201 1064
a452d459
KW
1065This returns the potentially locale-dependent case mappings of the L</code point
1066argument>. The mappings may be longer than a single code point (which the basic
1067Unicode case mappings as returned by L</charinfo()> never are).
b08cd201 1068
a452d459
KW
1069If there are no case mappings for the L</code point argument>, or if all three
1070possible mappings (I<lower>, I<title> and I<upper>) result in single code
a18e976f 1071points and are locale independent and unconditional, C<undef> is returned
5d8e6e41
KW
1072(which means that the case mappings, if any, for the code point are those
1073returned by L</charinfo()>).
a452d459
KW
1074
1075Otherwise, a reference to a hash giving the mappings (or a reference to a hash
5d8e6e41
KW
1076of such hashes, explained below) is returned with the following keys and their
1077meanings:
a452d459
KW
1078
1079The keys in the bottom layer hash with the meanings of their values are:
1080
1081=over
1082
1083=item B<code>
1084
1085the input L</code point argument> expressed in hexadecimal, with leading zeros
1086added if necessary to make it contain at least four hexdigits
1087
1088=item B<lower>
1089
a18e976f 1090one or more codes (separated by spaces) that, taken in order, give the
a452d459
KW
1091code points for the lower case of I<code>.
1092Each has at least four hexdigits.
1093
1094=item B<title>
b08cd201 1095
a18e976f 1096one or more codes (separated by spaces) that, taken in order, give the
a452d459
KW
1097code points for the title case of I<code>.
1098Each has at least four hexdigits.
b08cd201 1099
d2da20e3 1100=item B<upper>
b08cd201 1101
a18e976f 1102one or more codes (separated by spaces) that, taken in order, give the
a452d459
KW
1103code points for the upper case of I<code>.
1104Each has at least four hexdigits.
1105
1106=item B<condition>
1107
1108the conditions for the mappings to be valid.
a18e976f 1109If C<undef>, the mappings are always valid.
a452d459
KW
1110When defined, this field is a list of conditions,
1111all of which must be true for the mappings to be valid.
1112The list consists of one or more
1113I<locales> (see below)
1114and/or I<contexts> (explained in the next paragraph),
1115separated by spaces.
1116(Other than as used to separate elements, spaces are to be ignored.)
1117Case distinctions in the condition list are not significant.
82c0b05b 1118Conditions preceded by "NON_" represent the negation of the condition.
b08cd201 1119
a452d459
KW
1120A I<context> is one of those defined in the Unicode standard.
1121For Unicode 5.1, they are defined in Section 3.13 C<Default Case Operations>
1122available at
5d8e6e41
KW
1123L<http://www.unicode.org/versions/Unicode5.1.0/>.
1124These are for context-sensitive casing.
f499c386 1125
a452d459
KW
1126=back
1127
5d8e6e41 1128The hash described above is returned for locale-independent casing, where
a18e976f 1129at least one of the mappings has length longer than one. If C<undef> is
5d8e6e41
KW
1130returned, the code point may have mappings, but if so, all are length one,
1131and are returned by L</charinfo()>.
1132Note that when this function does return a value, it will be for the complete
1133set of mappings for a code point, even those whose length is one.
1134
1135If there are additional casing rules that apply only in certain locales,
1136an additional key for each will be defined in the returned hash. Each such key
1137will be its locale name, defined as a 2-letter ISO 3166 country code, possibly
1138followed by a "_" and a 2-letter ISO language code (possibly followed by a "_"
1139and a variant code). You can find the lists of all possible locales, see
1140L<Locale::Country> and L<Locale::Language>.
89e4a205 1141(In Unicode 6.0, the only locales returned by this function
a452d459 1142are C<lt>, C<tr>, and C<az>.)
b08cd201 1143
5d8e6e41
KW
1144Each locale key is a reference to a hash that has the form above, and gives
1145the casing rules for that particular locale, which take precedence over the
1146locale-independent ones when in that locale.
1147
1148If the only casing for a code point is locale-dependent, then the returned
1149hash will not have any of the base keys, like C<code>, C<upper>, etc., but
1150will contain only locale keys.
1151
b08cd201 1152For more information about case mappings see
a452d459 1153L<http://www.unicode.org/unicode/reports/tr21/>
b08cd201
JH
1154
1155=cut
1156
1157my %CASESPEC;
1158
1159sub _casespec {
1160 unless (%CASESPEC) {
551b6b6f 1161 if (openunicode(\$CASESPECFH, "SpecialCasing.txt")) {
6c8d78fb 1162 local $_;
ce066323 1163 local $/ = "\n";
b08cd201
JH
1164 while (<$CASESPECFH>) {
1165 if (/^([0-9A-F]+); ([0-9A-F]+(?: [0-9A-F]+)*)?; ([0-9A-F]+(?: [0-9A-F]+)*)?; ([0-9A-F]+(?: [0-9A-F]+)*)?; (\w+(?: \w+)*)?/) {
f499c386
JH
1166 my ($hexcode, $lower, $title, $upper, $condition) =
1167 ($1, $2, $3, $4, $5);
1168 my $code = hex($hexcode);
1169 if (exists $CASESPEC{$code}) {
1170 if (exists $CASESPEC{$code}->{code}) {
1171 my ($oldlower,
1172 $oldtitle,
1173 $oldupper,
1174 $oldcondition) =
1175 @{$CASESPEC{$code}}{qw(lower
1176 title
1177 upper
1178 condition)};
822ebcc8
JH
1179 if (defined $oldcondition) {
1180 my ($oldlocale) =
f499c386 1181 ($oldcondition =~ /^([a-z][a-z](?:_\S+)?)/);
f499c386
JH
1182 delete $CASESPEC{$code};
1183 $CASESPEC{$code}->{$oldlocale} =
1184 { code => $hexcode,
1185 lower => $oldlower,
1186 title => $oldtitle,
1187 upper => $oldupper,
1188 condition => $oldcondition };
f499c386
JH
1189 }
1190 }
1191 my ($locale) =
1192 ($condition =~ /^([a-z][a-z](?:_\S+)?)/);
1193 $CASESPEC{$code}->{$locale} =
1194 { code => $hexcode,
1195 lower => $lower,
1196 title => $title,
1197 upper => $upper,
1198 condition => $condition };
1199 } else {
1200 $CASESPEC{$code} =
1201 { code => $hexcode,
1202 lower => $lower,
1203 title => $title,
1204 upper => $upper,
1205 condition => $condition };
1206 }
b08cd201
JH
1207 }
1208 }
1209 close($CASESPECFH);
1210 }
1211 }
1212}
1213
1214sub casespec {
1215 my $arg = shift;
1216 my $code = _getcode($arg);
74f8133e
JH
1217 croak __PACKAGE__, "::casespec: unknown code '$arg'"
1218 unless defined $code;
b08cd201
JH
1219
1220 _casespec() unless %CASESPEC;
1221
741297c1 1222 return ref $CASESPEC{$code} ? dclone $CASESPEC{$code} : $CASESPEC{$code};
b08cd201
JH
1223}
1224
a452d459 1225=head2 B<namedseq()>
a2bd7410
JH
1226
1227 use Unicode::UCD 'namedseq';
1228
1229 my $namedseq = namedseq("KATAKANA LETTER AINU P");
1230 my @namedseq = namedseq("KATAKANA LETTER AINU P");
1231 my %namedseq = namedseq();
1232
1233If used with a single argument in a scalar context, returns the string
a18e976f 1234consisting of the code points of the named sequence, or C<undef> if no
a2bd7410 1235named sequence by that name exists. If used with a single argument in
956cae9a
KW
1236a list context, it returns the list of the ordinals of the code points. If used
1237with no
a2bd7410
JH
1238arguments in a list context, returns a hash with the names of the
1239named sequences as the keys and the named sequences as strings as
a18e976f 1240the values. Otherwise, it returns C<undef> or an empty list depending
a2bd7410
JH
1241on the context.
1242
a452d459
KW
1243This function only operates on officially approved (not provisional) named
1244sequences.
a2bd7410 1245
27f853a0
KW
1246Note that as of Perl 5.14, C<\N{KATAKANA LETTER AINU P}> will insert the named
1247sequence into double-quoted strings, and C<charnames::string_vianame("KATAKANA
1248LETTER AINU P")> will return the same string this function does, but will also
1249operate on character names that aren't named sequences, without you having to
1250know which are which. See L<charnames>.
1251
a2bd7410
JH
1252=cut
1253
1254my %NAMEDSEQ;
1255
1256sub _namedseq {
1257 unless (%NAMEDSEQ) {
98ef7649 1258 if (openunicode(\$NAMEDSEQFH, "Name.pl")) {
a2bd7410 1259 local $_;
ce066323 1260 local $/ = "\n";
a2bd7410 1261 while (<$NAMEDSEQFH>) {
98ef7649
KW
1262 if (/^ [0-9A-F]+ \ /x) {
1263 chomp;
1264 my ($sequence, $name) = split /\t/;
1265 my @s = map { chr(hex($_)) } split(' ', $sequence);
1266 $NAMEDSEQ{$name} = join("", @s);
a2bd7410
JH
1267 }
1268 }
1269 close($NAMEDSEQFH);
1270 }
1271 }
1272}
1273
1274sub namedseq {
98ef7649
KW
1275
1276 # Use charnames::string_vianame() which now returns this information,
1277 # unless the caller wants the hash returned, in which case we read it in,
1278 # and thereafter use it instead of calling charnames, as it is faster.
1279
a2bd7410
JH
1280 my $wantarray = wantarray();
1281 if (defined $wantarray) {
1282 if ($wantarray) {
1283 if (@_ == 0) {
98ef7649 1284 _namedseq() unless %NAMEDSEQ;
a2bd7410
JH
1285 return %NAMEDSEQ;
1286 } elsif (@_ == 1) {
98ef7649
KW
1287 my $s;
1288 if (%NAMEDSEQ) {
1289 $s = $NAMEDSEQ{ $_[0] };
1290 }
1291 else {
1292 $s = charnames::string_vianame($_[0]);
1293 }
a2bd7410
JH
1294 return defined $s ? map { ord($_) } split('', $s) : ();
1295 }
1296 } elsif (@_ == 1) {
98ef7649
KW
1297 return $NAMEDSEQ{ $_[0] } if %NAMEDSEQ;
1298 return charnames::string_vianame($_[0]);
a2bd7410
JH
1299 }
1300 }
1301 return;
1302}
1303
7319f91d
KW
1304my %NUMERIC;
1305
1306sub _numeric {
1307
1308 # Unicode 6.0 instituted the rule that only digits in a consecutive
1309 # block of 10 would be considered decimal digits. Before that, the only
1310 # problematic code point that I'm (khw) aware of is U+019DA, NEW TAI LUE
1311 # THAM DIGIT ONE, which is an alternate form of U+019D1, NEW TAI LUE DIGIT
1312 # ONE. The code could be modified to handle that, but not bothering, as
1313 # in TUS 6.0, U+19DA was changed to Nt=Di.
1314 if ((pack "C*", split /\./, UnicodeVersion()) lt 6.0.0) {
1315 croak __PACKAGE__, "::num requires Unicode 6.0 or greater"
1316 }
35a865d4 1317 my @numbers = _read_table("To/Nv.pl");
98025745
KW
1318 foreach my $entry (@numbers) {
1319 my ($start, $end, $value) = @$entry;
1320
05dbc6f8
KW
1321 # If value contains a slash, convert to decimal, add a reverse hash
1322 # used by charinfo.
98025745
KW
1323 if ((my @rational = split /\//, $value) == 2) {
1324 my $real = $rational[0] / $rational[1];
05dbc6f8 1325 $real_to_rational{$real} = $value;
98025745
KW
1326 $value = $real;
1327 }
1328
1329 for my $i ($start .. $end) {
1330 $NUMERIC{$i} = $value;
7319f91d 1331 }
7319f91d 1332 }
2dc5eb26
KW
1333
1334 # Decided unsafe to use these that aren't officially part of the Unicode
1335 # standard.
1336 #use Math::Trig;
1337 #my $pi = acos(-1.0);
98025745 1338 #$NUMERIC{0x03C0} = $pi;
7319f91d
KW
1339
1340 # Euler's constant, not to be confused with Euler's number
98025745 1341 #$NUMERIC{0x2107} = 0.57721566490153286060651209008240243104215933593992;
7319f91d
KW
1342
1343 # Euler's number
98025745 1344 #$NUMERIC{0x212F} = 2.7182818284590452353602874713526624977572;
2dc5eb26 1345
7319f91d
KW
1346 return;
1347}
1348
1349=pod
1350
67592e11 1351=head2 B<num()>
7319f91d 1352
eefd7bc2
KW
1353 use Unicode::UCD 'num';
1354
1355 my $val = num("123");
1356 my $one_quarter = num("\N{VULGAR FRACTION 1/4}");
1357
7319f91d
KW
1358C<num> returns the numeric value of the input Unicode string; or C<undef> if it
1359doesn't think the entire string has a completely valid, safe numeric value.
1360
1361If the string is just one character in length, the Unicode numeric value
1362is returned if it has one, or C<undef> otherwise. Note that this need
1363not be a whole number. C<num("\N{TIBETAN DIGIT HALF ZERO}")>, for
2dc5eb26
KW
1364example returns -0.5.
1365
1366=cut
7319f91d 1367
2dc5eb26
KW
1368#A few characters to which Unicode doesn't officially
1369#assign a numeric value are considered numeric by C<num>.
1370#These are:
1371
1372# EULER CONSTANT 0.5772... (this is NOT Euler's number)
1373# SCRIPT SMALL E 2.71828... (this IS Euler's number)
1374# GREEK SMALL LETTER PI 3.14159...
1375
1376=pod
7319f91d
KW
1377
1378If the string is more than one character, C<undef> is returned unless
8bb4c8e2 1379all its characters are decimal digits (that is, they would match C<\d+>),
7319f91d
KW
1380from the same script. For example if you have an ASCII '0' and a Bengali
1381'3', mixed together, they aren't considered a valid number, and C<undef>
1382is returned. A further restriction is that the digits all have to be of
1383the same form. A half-width digit mixed with a full-width one will
1384return C<undef>. The Arabic script has two sets of digits; C<num> will
1385return C<undef> unless all the digits in the string come from the same
1386set.
1387
1388C<num> errs on the side of safety, and there may be valid strings of
1389decimal digits that it doesn't recognize. Note that Unicode defines
1390a number of "digit" characters that aren't "decimal digit" characters.
a278d14b 1391"Decimal digits" have the property that they have a positional value, i.e.,
7319f91d
KW
1392there is a units position, a 10's position, a 100's, etc, AND they are
1393arranged in Unicode in blocks of 10 contiguous code points. The Chinese
1394digits, for example, are not in such a contiguous block, and so Unicode
1395doesn't view them as decimal digits, but merely digits, and so C<\d> will not
1396match them. A single-character string containing one of these digits will
1397have its decimal value returned by C<num>, but any longer string containing
1398only these digits will return C<undef>.
1399
a278d14b
KW
1400Strings of multiple sub- and superscripts are not recognized as numbers. You
1401can use either of the compatibility decompositions in Unicode::Normalize to
7319f91d
KW
1402change these into digits, and then call C<num> on the result.
1403
1404=cut
1405
1406# To handle sub, superscripts, this could if called in list context,
1407# consider those, and return the <decomposition> type in the second
1408# array element.
1409
1410sub num {
1411 my $string = $_[0];
1412
1413 _numeric unless %NUMERIC;
1414
1415 my $length = length($string);
98025745 1416 return $NUMERIC{ord($string)} if $length == 1;
7319f91d
KW
1417 return if $string =~ /\D/;
1418 my $first_ord = ord(substr($string, 0, 1));
98025745 1419 my $value = $NUMERIC{$first_ord};
7319f91d
KW
1420 my $zero_ord = $first_ord - $value;
1421
1422 for my $i (1 .. $length -1) {
1423 my $ord = ord(substr($string, $i, 1));
1424 my $digit = $ord - $zero_ord;
1425 return unless $digit >= 0 && $digit <= 9;
1426 $value = $value * 10 + $digit;
1427 }
1428 return $value;
1429}
1430
7ef25837
KW
1431=pod
1432
1433=head2 B<prop_aliases()>
1434
1435 use Unicode::UCD 'prop_aliases';
1436
1437 my ($short_name, $full_name, @other_names) = prop_aliases("space");
1438 my $same_full_name = prop_aliases("Space"); # Scalar context
1439 my ($same_short_name) = prop_aliases("Space"); # gets 0th element
1440 print "The full name is $full_name\n";
1441 print "The short name is $short_name\n";
1442 print "The other aliases are: ", join(", ", @other_names), "\n";
1443
1444 prints:
1445 The full name is White_Space
1446 The short name is WSpace
1447 The other aliases are: Space
1448
1449Most Unicode properties have several synonymous names. Typically, there is at
1450least a short name, convenient to type, and a long name that more fully
1451describes the property, and hence is more easily understood.
1452
1453If you know one name for a Unicode property, you can use C<prop_aliases> to find
1454either the long name (when called in scalar context), or a list of all of the
1455names, somewhat ordered so that the short name is in the 0th element, the long
1456name in the next element, and any other synonyms are in the remaining
1457elements, in no particular order.
1458
1459The long name is returned in a form nicely capitalized, suitable for printing.
1460
1461The input parameter name is loosely matched, which means that white space,
1462hyphens, and underscores are ignored (except for the trailing underscore in
1463the old_form grandfathered-in C<"L_">, which is better written as C<"LC">, and
1464both of which mean C<General_Category=Cased Letter>).
1465
1466If the name is unknown, C<undef> is returned (or an empty list in list
1467context). Note that Perl typically recognizes property names in regular
1468expressions with an optional C<"Is_>" (with or without the underscore)
1469prefixed to them, such as C<\p{isgc=punct}>. This function does not recognize
1470those in the input, returning C<undef>. Nor are they included in the output
1471as possible synonyms.
1472
1473C<prop_aliases> does know about the Perl extensions to Unicode properties,
1474such as C<Any> and C<XPosixAlpha>, and the single form equivalents to Unicode
1475properties such as C<XDigit>, C<Greek>, C<In_Greek>, and C<Is_Greek>. The
1476final example demonstrates that the C<"Is_"> prefix is recognized for these
1477extensions; it is needed to resolve ambiguities. For example,
1478C<prop_aliases('lc')> returns the list C<(lc, Lowercase_Mapping)>, but
1479C<prop_aliases('islc')> returns C<(Is_LC, Cased_Letter)>. This is
1480because C<islc> is a Perl extension which is short for
1481C<General_Category=Cased Letter>. The lists returned for the Perl extensions
1482will not include the C<"Is_"> prefix (whether or not the input had it) unless
1483needed to resolve ambiguities, as shown in the C<"islc"> example, where the
1484returned list had one element containing C<"Is_">, and the other without.
1485
1486It is also possible for the reverse to happen: C<prop_aliases('isc')> returns
1487the list C<(isc, ISO_Comment)>; whereas C<prop_aliases('c')> returns
1488C<(C, Other)> (the latter being a Perl extension meaning
ee94c7d1
KW
1489C<General_Category=Other>.
1490L<perluniprops/Properties accessible through Unicode::UCD> lists the available
1491forms, including which ones are discouraged from use.
7ef25837
KW
1492
1493Those discouraged forms are accepted as input to C<prop_aliases>, but are not
1494returned in the lists. C<prop_aliases('isL&')> and C<prop_aliases('isL_')>,
1495which are old synonyms for C<"Is_LC"> and should not be used in new code, are
1496examples of this. These both return C<(Is_LC, Cased_Letter)>. Thus this
1497function allows you to take a discourarged form, and find its acceptable
1498alternatives. The same goes with single-form Block property equivalences.
1499Only the forms that begin with C<"In_"> are not discouraged; if you pass
1500C<prop_aliases> a discouraged form, you will get back the equivalent ones that
1501begin with C<"In_">. It will otherwise look like a new-style block name (see.
1502L</Old-style versus new-style block names>).
1503
1504C<prop_aliases> does not know about any user-defined properties, and will
1505return C<undef> if called with one of those. Likewise for Perl internal
1506properties, with the exception of "Perl_Decimal_Digit" which it does know
1507about (and which is documented below in L</prop_invmap()>).
1508
1509=cut
1510
1511# It may be that there are use cases where the discouraged forms should be
1512# returned. If that comes up, an optional boolean second parameter to the
1513# function could be created, for example.
1514
1515# These are created by mktables for this routine and stored in unicore/UCD.pl
1516# where their structures are described.
1517our %string_property_loose_to_name;
1518our %ambiguous_names;
1519our %loose_perlprop_to_name;
1520our %prop_aliases;
1521
1522sub prop_aliases ($) {
1523 my $prop = $_[0];
1524 return unless defined $prop;
1525
1526 require "unicore/UCD.pl";
1527 require "unicore/Heavy.pl";
1528 require "utf8_heavy.pl";
1529
1530 # The property name may be loosely or strictly matched; we don't know yet.
1531 # But both types use lower-case.
1532 $prop = lc $prop;
1533
1534 # It is loosely matched if its lower case isn't known to be strict.
1535 my $list_ref;
1536 if (! exists $utf8::stricter_to_file_of{$prop}) {
1537 my $loose = utf8::_loose_name($prop);
1538
1539 # There is a hash that converts from any loose name to its standard
1540 # form, mapping all synonyms for a name to one name that can be used
1541 # as a key into another hash. The whole concept is for memory
1542 # savings, as the second hash doesn't have to have all the
1543 # combinations. Actually, there are two hashes that do the
1544 # converstion. One is used in utf8_heavy.pl (stored in Heavy.pl) for
1545 # looking up properties matchable in regexes. This function needs to
1546 # access string properties, which aren't available in regexes, so a
1547 # second conversion hash is made for them (stored in UCD.pl). Look in
1548 # the string one now, as the rest can have an optional 'is' prefix,
1549 # which these don't.
1550 if (exists $string_property_loose_to_name{$loose}) {
1551
1552 # Convert to its standard loose name.
1553 $prop = $string_property_loose_to_name{$loose};
1554 }
1555 else {
1556 my $retrying = 0; # bool. ? Has an initial 'is' been stripped
1557 RETRY:
1558 if (exists $utf8::loose_property_name_of{$loose}
1559 && (! $retrying
1560 || ! exists $ambiguous_names{$loose}))
1561 {
1562 # Found an entry giving the standard form. We don't get here
1563 # (in the test above) when we've stripped off an
1564 # 'is' and the result is an ambiguous name. That is because
1565 # these are official Unicode properties (though Perl can have
1566 # an optional 'is' prefix meaning the official property), and
1567 # all ambiguous cases involve a Perl single-form extension
1568 # for the gc, script, or block properties, and the stripped
1569 # 'is' means that they mean one of those, and not one of
1570 # these
1571 $prop = $utf8::loose_property_name_of{$loose};
1572 }
1573 elsif (exists $loose_perlprop_to_name{$loose}) {
1574
1575 # This hash is specifically for this function to list Perl
1576 # extensions that aren't in the earlier hashes. If there is
1577 # only one element, the short and long names are identical.
1578 # Otherwise the form is already in the same form as
1579 # %prop_aliases, which is handled at the end of the function.
1580 $list_ref = $loose_perlprop_to_name{$loose};
1581 if (@$list_ref == 1) {
1582 my @list = ($list_ref->[0], $list_ref->[0]);
1583 $list_ref = \@list;
1584 }
1585 }
1586 elsif (! exists $utf8::loose_to_file_of{$loose}) {
1587
1588 # loose_to_file_of is a complete list of loose names. If not
1589 # there, the input is unknown.
1590 return;
1591 }
1592 else {
1593
1594 # Here we found the name but not its aliases, so it has to
1595 # exist. This means it must be one of the Perl single-form
1596 # extensions. First see if it is for a property-value
1597 # combination in one of the following properties.
1598 my @list;
1599 foreach my $property ("gc", "script") {
1600 @list = prop_value_aliases($property, $loose);
1601 last if @list;
1602 }
1603 if (@list) {
1604
1605 # Here, it is one of those property-value combination
1606 # single-form synonyms. There are ambiguities with some
1607 # of these. Check against the list for these, and adjust
1608 # if necessary.
1609 for my $i (0 .. @list -1) {
1610 if (exists $ambiguous_names
1611 {utf8::_loose_name(lc $list[$i])})
1612 {
1613 # The ambiguity is resolved by toggling whether or
1614 # not it has an 'is' prefix
1615 $list[$i] =~ s/^Is_// or $list[$i] =~ s/^/Is_/;
1616 }
1617 }
1618 return @list;
1619 }
1620
1621 # Here, it wasn't one of the gc or script single-form
1622 # extensions. It could be a block property single-form
1623 # extension. An 'in' prefix definitely means that, and should
2a4f2769
KW
1624 # be looked up without the prefix. However, starting in
1625 # Unicode 6.1, we have to special case 'indic...', as there
1626 # is a property that begins with that name. We shouldn't
1627 # strip the 'in' from that. I'm (khw) generalizing this to
1628 # 'indic' instead of the single property, because I suspect
1629 # that others of this class may come along in the future.
1630 # However, this could backfire and a block created whose name
1631 # begins with 'dic...', and we would want to strip the 'in'.
1632 # At which point this would have to be tweaked.
1633 my $began_with_in = $loose =~ s/^in(?!dic)//;
7ef25837
KW
1634 @list = prop_value_aliases("block", $loose);
1635 if (@list) {
1636 map { $_ =~ s/^/In_/ } @list;
1637 return @list;
1638 }
1639
1640 # Here still haven't found it. The last opportunity for it
1641 # being valid is only if it began with 'is'. We retry without
1642 # the 'is', setting a flag to that effect so that we don't
1643 # accept things that begin with 'isis...'
1644 if (! $retrying && ! $began_with_in && $loose =~ s/^is//) {
1645 $retrying = 1;
1646 goto RETRY;
1647 }
1648
1649 # Here, didn't find it. Since it was in %loose_to_file_of, we
1650 # should have been able to find it.
1651 carp __PACKAGE__, "::prop_aliases: Unexpectedly could not find '$prop'. Send bug report to perlbug\@perl.org";
1652 return;
1653 }
1654 }
1655 }
1656
1657 if (! $list_ref) {
1658 # Here, we have set $prop to a standard form name of the input. Look
1659 # it up in the structure created by mktables for this purpose, which
1660 # contains both strict and loosely matched properties. Avoid
1661 # autovivifying.
1662 $list_ref = $prop_aliases{$prop} if exists $prop_aliases{$prop};
1663 return unless $list_ref;
1664 }
1665
1666 # The full name is in element 1.
1667 return $list_ref->[1] unless wantarray;
1668
1669 return @{dclone $list_ref};
1670}
1671
1672=pod
1673
1674=head2 B<prop_value_aliases()>
1675
1676 use Unicode::UCD 'prop_value_aliases';
1677
1678 my ($short_name, $full_name, @other_names)
1679 = prop_value_aliases("Gc", "Punct");
1680 my $same_full_name = prop_value_aliases("Gc", "P"); # Scalar cntxt
1681 my ($same_short_name) = prop_value_aliases("Gc", "P"); # gets 0th
1682 # element
1683 print "The full name is $full_name\n";
1684 print "The short name is $short_name\n";
1685 print "The other aliases are: ", join(", ", @other_names), "\n";
1686
1687 prints:
1688 The full name is Punctuation
1689 The short name is P
1690 The other aliases are: Punct
1691
1692Some Unicode properties have a restricted set of legal values. For example,
1693all binary properties are restricted to just C<true> or C<false>; and there
1694are only a few dozen possible General Categories.
1695
1696For such properties, there are usually several synonyms for each possible
1697value. For example, in binary properties, I<truth> can be represented by any of
1698the strings "Y", "Yes", "T", or "True"; and the General Category
1699"Punctuation" by that string, or "Punct", or simply "P".
1700
1701Like property names, there is typically at least a short name for each such
1702property-value, and a long name. If you know any name of the property-value,
1703you can use C<prop_value_aliases>() to get the long name (when called in
1704scalar context), or a list of all the names, with the short name in the 0th
1705element, the long name in the next element, and any other synonyms in the
1706remaining elements, in no particular order, except that any all-numeric
1707synonyms will be last.
1708
1709The long name is returned in a form nicely capitalized, suitable for printing.
1710
1711Case, white space, hyphens, and underscores are ignored in the input parameters
1712(except for the trailing underscore in the old-form grandfathered-in general
1713category property value C<"L_">, which is better written as C<"LC">).
1714
1715If either name is unknown, C<undef> is returned. Note that Perl typically
1716recognizes property names in regular expressions with an optional C<"Is_>"
1717(with or without the underscore) prefixed to them, such as C<\p{isgc=punct}>.
1718This function does not recognize those in the property parameter, returning
1719C<undef>.
1720
1721If called with a property that doesn't have synonyms for its values, it
1722returns the input value, possibly normalized with capitalization and
1723underscores.
1724
1725For the block property, new-style block names are returned (see
1726L</Old-style versus new-style block names>).
1727
1728To find the synonyms for single-forms, such as C<\p{Any}>, use
1729L</prop_aliases()> instead.
1730
1731C<prop_value_aliases> does not know about any user-defined properties, and
1732will return C<undef> if called with one of those.
1733
1734=cut
1735
1736# These are created by mktables for this routine and stored in unicore/UCD.pl
1737# where their structures are described.
1738our %loose_to_standard_value;
1739our %prop_value_aliases;
1740
1741sub prop_value_aliases ($$) {
1742 my ($prop, $value) = @_;
1743 return unless defined $prop && defined $value;
1744
1745 require "unicore/UCD.pl";
1746 require "utf8_heavy.pl";
1747
1748 # Find the property name synonym that's used as the key in other hashes,
1749 # which is element 0 in the returned list.
1750 ($prop) = prop_aliases($prop);
1751 return if ! $prop;
1752 $prop = utf8::_loose_name(lc $prop);
1753
1754 # Here is a legal property, but the hash below (created by mktables for
1755 # this purpose) only knows about the properties that have a very finite
1756 # number of potential values, that is not ones whose value could be
1757 # anything, like most (if not all) string properties. These don't have
1758 # synonyms anyway. Simply return the input. For example, there is no
1759 # synonym for ('Uppercase_Mapping', A').
1760 return $value if ! exists $prop_value_aliases{$prop};
1761
1762 # The value name may be loosely or strictly matched; we don't know yet.
1763 # But both types use lower-case.
1764 $value = lc $value;
1765
1766 # If the name isn't found under loose matching, it certainly won't be
1767 # found under strict
1768 my $loose_value = utf8::_loose_name($value);
1769 return unless exists $loose_to_standard_value{"$prop=$loose_value"};
1770
1771 # Similarly if the combination under loose matching doesn't exist, it
1772 # won't exist under strict.
1773 my $standard_value = $loose_to_standard_value{"$prop=$loose_value"};
1774 return unless exists $prop_value_aliases{$prop}{$standard_value};
1775
1776 # Here we did find a combination under loose matching rules. But it could
1777 # be that is a strict property match that shouldn't have matched.
1778 # %prop_value_aliases is set up so that the strict matches will appear as
1779 # if they were in loose form. Thus, if the non-loose version is legal,
1780 # we're ok, can skip the further check.
1781 if (! exists $utf8::stricter_to_file_of{"$prop=$value"}
1782
1783 # We're also ok and skip the further check if value loosely matches.
1784 # mktables has verified that no strict name under loose rules maps to
1785 # an existing loose name. This code relies on the very limited
1786 # circumstances that strict names can be here. Strict name matching
1787 # happens under two conditions:
1788 # 1) when the name begins with an underscore. But this function
1789 # doesn't accept those, and %prop_value_aliases doesn't have
1790 # them.
1791 # 2) When the values are numeric, in which case we need to look
1792 # further, but their squeezed-out loose values will be in
1793 # %stricter_to_file_of
1794 && exists $utf8::stricter_to_file_of{"$prop=$loose_value"})
1795 {
1796 # The only thing that's legal loosely under strict is that can have an
1797 # underscore between digit pairs XXX
1798 while ($value =~ s/(\d)_(\d)/$1$2/g) {}
1799 return unless exists $utf8::stricter_to_file_of{"$prop=$value"};
1800 }
1801
1802 # Here, we know that the combination exists. Return it.
1803 my $list_ref = $prop_value_aliases{$prop}{$standard_value};
1804 if (@$list_ref > 1) {
1805 # The full name is in element 1.
1806 return $list_ref->[1] unless wantarray;
1807
1808 return @{dclone $list_ref};
1809 }
1810
1811 return $list_ref->[0] unless wantarray;
1812
1813 # Only 1 element means that it repeats
1814 return ( $list_ref->[0], $list_ref->[0] );
1815}
7319f91d 1816
681d705c
KW
1817# All 1 bits is the largest possible UV.
1818$Unicode::UCD::MAX_CP = ~0;
1819
1820=pod
1821
1822=head2 B<prop_invlist()>
1823
1824C<prop_invlist> returns an inversion list (described below) that defines all the
1825code points for the binary Unicode property (or "property=value" pair) given
1826by the input parameter string:
1827
1828 use feature 'say';
1829 use Unicode::UCD 'prop_invlist';
1830 say join ", ", prop_invlist("Any");
1831
1832 prints:
1833 0, 1114112
1834
1835An empty list is returned if the input is unknown; the number of elements in
1836the list is returned if called in scalar context.
1837
1838L<perluniprops|perluniprops/Properties accessible through \p{} and \P{}> gives
1839the list of properties that this function accepts, as well as all the possible
1840forms for them (including with the optional "Is_" prefixes). (Except this
1841function doesn't accept any Perl-internal properties, some of which are listed
1842there.) This function uses the same loose or tighter matching rules for
1843resolving the input property's name as is done for regular expressions. These
1844are also specified in L<perluniprops|perluniprops/Properties accessible
1845through \p{} and \P{}>. Examples of using the "property=value" form are:
1846
1847 say join ", ", prop_invlist("Script=Shavian");
1848
1849 prints:
1850 66640, 66688
1851
1852 say join ", ", prop_invlist("ASCII_Hex_Digit=No");
1853
1854 prints:
1855 0, 48, 58, 65, 71, 97, 103
1856
1857 say join ", ", prop_invlist("ASCII_Hex_Digit=Yes");
1858
1859 prints:
1860 48, 58, 65, 71, 97, 103
1861
1862Inversion lists are a compact way of specifying Unicode property-value
1863definitions. The 0th item in the list is the lowest code point that has the
1864property-value. The next item (item [1]) is the lowest code point beyond that
1865one that does NOT have the property-value. And the next item beyond that
1866([2]) is the lowest code point beyond that one that does have the
1867property-value, and so on. Put another way, each element in the list gives
1868the beginning of a range that has the property-value (for even numbered
1869elements), or doesn't have the property-value (for odd numbered elements).
1870The name for this data structure stems from the fact that each element in the
1871list toggles (or inverts) whether the corresponding range is or isn't on the
1872list.
1873
1874In the final example above, the first ASCII Hex digit is code point 48, the
1875character "0", and all code points from it through 57 (a "9") are ASCII hex
1876digits. Code points 58 through 64 aren't, but 65 (an "A") through 70 (an "F")
1877are, as are 97 ("a") through 102 ("f"). 103 starts a range of code points
1878that aren't ASCII hex digits. That range extends to infinity, which on your
1879computer can be found in the variable C<$Unicode::UCD::MAX_CP>. (This
1880variable is as close to infinity as Perl can get on your platform, and may be
1881too high for some operations to work; you may wish to use a smaller number for
1882your purposes.)
1883
1884Note that the inversion lists returned by this function can possibly include
1885non-Unicode code points, that is anything above 0x10FFFF. This is in
1886contrast to Perl regular expression matches on those code points, in which a
1887non-Unicode code point always fails to match. For example, both of these have
1888the same result:
1889
1890 chr(0x110000) =~ \p{ASCII_Hex_Digit=True} # Fails.
1891 chr(0x110000) =~ \p{ASCII_Hex_Digit=False} # Fails!
1892
1893And both raise a warning that a Unicode property is being used on a
1894non-Unicode code point. It is arguable as to which is the correct thing to do
1895here. This function has chosen the way opposite to the Perl regular
1896expression behavior. This allows you to easily flip to to the Perl regular
1897expression way (for you to go in the other direction would be far harder).
1898Simply add 0x110000 at the end of the non-empty returned list if it isn't
1899already that value; and pop that value if it is; like:
1900
1901 my @list = prop_invlist("foo");
1902 if (@list) {
1903 if ($list[-1] == 0x110000) {
1904 pop @list; # Defeat the turning on for above Unicode
1905 }
1906 else {
1907 push @list, 0x110000; # Turn off for above Unicode
1908 }
1909 }
1910
1911It is a simple matter to expand out an inversion list to a full list of all
1912code points that have the property-value:
1913
1914 my @invlist = prop_invlist($property_name);
1915 die "empty" unless @invlist;
1916 my @full_list;
1917 for (my $i = 0; $i < @invlist; $i += 2) {
1918 my $upper = ($i + 1) < @invlist
1919 ? $invlist[$i+1] - 1 # In range
1920 : $Unicode::UCD::MAX_CP; # To infinity. You may want
1921 # to stop much much earlier;
1922 # going this high may expose
1923 # perl deficiencies with very
1924 # large numbers.
1925 for my $j ($invlist[$i] .. $upper) {
1926 push @full_list, $j;
1927 }
1928 }
1929
1930C<prop_invlist> does not know about any user-defined nor Perl internal-only
1931properties, and will return C<undef> if called with one of those.
1932
1933=cut
1934
1935# User-defined properties could be handled with some changes to utf8_heavy.pl;
1936# and implementing here of dealing with EXTRAS. If done, consideration should
1937# be given to the fact that the user subroutine could return different results
1938# with each call; security issues need to be thought about.
1939
1940# These are created by mktables for this routine and stored in unicore/UCD.pl
1941# where their structures are described.
1942our %loose_defaults;
1943our $MAX_UNICODE_CODEPOINT;
1944
1945sub prop_invlist ($) {
1946 my $prop = $_[0];
1947 return if ! defined $prop;
1948
1949 require "utf8_heavy.pl";
1950
1951 # Warnings for these are only for regexes, so not applicable to us
1952 no warnings 'deprecated';
1953
1954 # Get the swash definition of the property-value.
1955 my $swash = utf8::SWASHNEW(__PACKAGE__, $prop, undef, 1, 0);
1956
1957 # Fail if not found, or isn't a boolean property-value, or is a
1958 # user-defined property, or is internal-only.
1959 return if ! $swash
1960 || ref $swash eq ""
1961 || $swash->{'BITS'} != 1
1962 || $swash->{'USER_DEFINED'}
1963 || $prop =~ /^\s*_/;
1964
1965 if ($swash->{'EXTRAS'}) {
1966 carp __PACKAGE__, "::prop_invlist: swash returned for $prop unexpectedly has EXTRAS magic";
1967 return;
1968 }
1969 if ($swash->{'SPECIALS'}) {
1970 carp __PACKAGE__, "::prop_invlist: swash returned for $prop unexpectedly has SPECIALS magic";
1971 return;
1972 }
1973
1974 my @invlist;
1975
1976 # The input lines look like:
1977 # 0041\t005A # [26]
1978 # 005F
1979
1980 # Split into lines, stripped of trailing comments
1981 foreach my $range (split "\n",
1982 $swash->{'LIST'} =~ s/ \s* (?: \# .* )? $ //xmgr)
1983 {
1984 # And find the beginning and end of the range on the line
1985 my ($hex_begin, $hex_end) = split "\t", $range;
1986 my $begin = hex $hex_begin;
1987
a39cc031
KW
1988 # If the new range merely extends the old, we remove the marker
1989 # created the last time through the loop for the old's end, which
1990 # causes the new one's end to be used instead.
1991 if (@invlist && $begin == $invlist[-1]) {
1992 pop @invlist;
1993 }
1994 else {
2f3f243e
KW
1995 # Add the beginning of the range
1996 push @invlist, $begin;
a39cc031 1997 }
681d705c
KW
1998
1999 if (defined $hex_end) { # The next item starts with the code point 1
2000 # beyond the end of the range.
2001 push @invlist, hex($hex_end) + 1;
2002 }
2003 else { # No end of range, is a single code point.
2004 push @invlist, $begin + 1;
2005 }
2006 }
2007
2008 require "unicore/UCD.pl";
2009 my $FIRST_NON_UNICODE = $MAX_UNICODE_CODEPOINT + 1;
2010
2011 # Could need to be inverted: add or subtract a 0 at the beginning of the
2012 # list. And to keep it from matching non-Unicode, add or subtract the
2013 # first non-unicode code point.
2014 if ($swash->{'INVERT_IT'}) {
2015 if (@invlist && $invlist[0] == 0) {
2016 shift @invlist;
2017 }
2018 else {
2019 unshift @invlist, 0;
2020 }
2021 if (@invlist && $invlist[-1] == $FIRST_NON_UNICODE) {
2022 pop @invlist;
2023 }
2024 else {
2025 push @invlist, $FIRST_NON_UNICODE;
2026 }
2027 }
2028
2029 # Here, the list is set up to include only Unicode code points. But, if
2030 # the table is the default one for the property, it should contain all
2031 # non-Unicode code points. First calculate the loose name for the
2032 # property. This is done even for strict-name properties, as the data
2033 # structure that mktables generates for us is set up so that we don't have
2034 # to worry about that. The property-value needs to be split if compound,
2035 # as the loose rules need to be independently calculated on each part. We
2036 # know that it is syntactically valid, or SWASHNEW would have failed.
2037
2038 $prop = lc $prop;
2039 my ($prop_only, $table) = split /\s*[:=]\s*/, $prop;
2040 if ($table) {
2041
2042 # May have optional prefixed 'is'
2043 $prop = utf8::_loose_name($prop_only) =~ s/^is//r;
2044 $prop = $utf8::loose_property_name_of{$prop};
2045 $prop .= "=" . utf8::_loose_name($table);
2046 }
2047 else {
2048 $prop = utf8::_loose_name($prop);
2049 }
2050 if (exists $loose_defaults{$prop}) {
2051
2052 # Here, is the default table. If a range ended with 10ffff, instead
2053 # continue that range to infinity, by popping the 110000; otherwise,
2054 # add the range from 11000 to infinity
2055 if (! @invlist || $invlist[-1] != $FIRST_NON_UNICODE) {
2056 push @invlist, $FIRST_NON_UNICODE;
2057 }
2058 else {
2059 pop @invlist;
2060 }
2061 }
2062
2063 return @invlist;
2064}
7319f91d 2065
62b3b855
KW
2066sub _search_invlist {
2067 # Find the range in the inversion list which contains a code point; that
2068 # is, find i such that l[i] <= code_point < l[i+1]
2069
2070 # If this is ever made public, could use to speed up .t specials. Would
2071 # need to use code point argument, as in other functions in this pm
2072
2073 my $list_ref = shift;
2074 my $code_point = shift;
2075 # Verify non-neg numeric XXX
2076
2077 my $max_element = @$list_ref - 1;
2078 return if ! $max_element < 0; # Undef if list is empty.
2079
2080 # Short cut something at the far-end of the table. This also allows us to
2081 # refer to element [$i+1] without fear of being out-of-bounds in the loop
2082 # below.
2083 return $max_element if $code_point >= $list_ref->[$max_element];
2084
2085 use integer; # want integer division
2086
2087 my $i = $max_element / 2;
2088
2089 my $lower = 0;
2090 my $upper = $max_element;
2091 while (1) {
2092
2093 if ($code_point >= $list_ref->[$i]) {
2094
2095 # Here we have met the lower constraint. We can quit if we
2096 # also meet the upper one.
2097 last if $code_point < $list_ref->[$i+1];
2098
2099 $lower = $i; # Still too low.
2100
2101 }
2102 else {
2103
2104 # Here, $code_point < $list_ref[$i], so look lower down.
2105 $upper = $i;
2106 }
2107
2108 # Split search domain in half to try again.
2109 my $temp = ($upper + $lower) / 2;
2110
2111 # No point in continuing unless $i changes for next time
2112 # in the loop.
2113 return $i if $temp == $i;
2114 $i = $temp;
2115 } # End of while loop
2116
2117 # Here we have found the offset
2118 return $i;
2119}
2120
2121=pod
2122
2123=head2 B<prop_invmap()>
2124
2125 use Unicode::UCD 'prop_invmap';
2126 my ($list_ref, $map_ref, $format, $missing)
2127 = prop_invmap("General Category");
2128
2129C<prop_invmap> is used to get the complete mapping definition for a property,
2130in the form of an inversion map. An inversion map consists of two parallel
2131arrays. One is an ordered list of code points that mark range beginnings, and
2132the other gives the value (or mapping) that all code points in the
2133corresponding range have.
2134
2135C<prop_invmap> is called with the name of the desired property. The name is
2136loosely matched, meaning that differences in case, white-space, hyphens, and
2137underscores are not meaningful (except for the trailing underscore in the
2138old-form grandfathered-in property C<"L_">, which is better written as C<"LC">,
2139or even better, C<"Gc=LC">).
2140
2141Many Unicode properties have more than one name (or alias). C<prop_invmap>
2142understands all of these, including Perl extensions to them. Ambiguities are
2143resolved as described above for L</prop_aliases()>. The Perl internal
2144property "Perl_Decimal_Digit, described below, is also accepted. C<undef> is
2145returned if the property name is unknown.
ee94c7d1
KW
2146See L<perluniprops/Properties accessible through Unicode::UCD> for the
2147properties acceptable as inputs to this function.
62b3b855
KW
2148
2149It is a fatal error to call this function except in list context.
2150
2151In addition to the the two arrays that form the inversion map, C<prop_invmap>
2152returns two other values; one is a scalar that gives some details as to the
2153format of the entries of the map array; the other is used for specialized
2154purposes, described at the end of this section.
2155
2156This means that C<prop_invmap> returns a 4 element list. For example,
2157
2158 my ($blocks_ranges_ref, $blocks_maps_ref, $format, $default)
2159 = prop_invmap("Block");
2160
2161In this call, the two arrays will be populated as shown below (for Unicode
21626.0):
2163
2164 Index @blocks_ranges @blocks_maps
2165 0 0x0000 Basic Latin
2166 1 0x0080 Latin-1 Supplement
2167 2 0x0100 Latin Extended-A
2168 3 0x0180 Latin Extended-B
2169 4 0x0250 IPA Extensions
2170 5 0x02B0 Spacing Modifier Letters
2171 6 0x0300 Combining Diacritical Marks
2172 7 0x0370 Greek and Coptic
2173 8 0x0400 Cyrillic
2174 ...
2175 233 0x2B820 No_Block
2176 234 0x2F800 CJK Compatibility Ideographs Supplement
2177 235 0x2FA20 No_Block
2178 236 0xE0000 Tags
2179 237 0xE0080 No_Block
2180 238 0xE0100 Variation Selectors Supplement
2181 239 0xE01F0 No_Block
2182 240 0xF0000 Supplementary Private Use Area-A
2183 241 0x100000 Supplementary Private Use Area-B
2184 242 0x110000 No_Block
2185
2186The first line (with Index [0]) means that the value for code point 0 is "Basic
2187Latin". The entry "0x0080" in the @blocks_ranges column in the second line
2188means that the value from the first line, "Basic Latin", extends to all code
2189points in the range from 0 up to but not including 0x0080, that is, through
647396da 2190127. In other words, the code points from 0 to 127 are all in the "Basic
62b3b855
KW
2191Latin" block. Similarly, all code points in the range from 0x0080 up to (but
2192not including) 0x0100 are in the block named "Latin-1 Supplement", etc.
2193(Notice that the return is the old-style block names; see L</Old-style versus
2194new-style block names>).
2195
2196The final line (with Index [242]) means that the value for all code points above
2197the legal Unicode maximum code point have the value "No_Block", which is the
2198term Unicode uses for a non-existing block.
2199
2200The arrays completely specify the mappings for all possible code points.
2201The final element in an inversion map returned by this function will always be
2202for the range that consists of all the code points that aren't legal Unicode,
2203but that are expressible on the platform. (That is, it starts with code point
22040x110000, the first code point above the legal Unicode maximum, and extends to
2205infinity.) The value for that range will be the same that any typical
2206unassigned code point has for the specified property. (Certain unassigned
2207code points are not "typical"; for example the non-character code points, or
2208those in blocks that are to be written right-to-left. The above-Unicode
2209range's value is not based on these atypical code points.) It could be argued
2210that, instead of treating these as unassigned Unicode code points, the value
2211for this range should be C<undef>. If you wish, you can change the returned
2212arrays accordingly.
2213
2214The maps are almost always simple scalars that should be interpreted as-is.
2215These values are those given in the Unicode-supplied data files, which may be
2216inconsistent as to capitalization and as to which synonym for a property-value
2217is given. The results may be normalized by using the L</prop_value_aliases()>
2218function.
2219
2220There are exceptions to the simple scalar maps. Some properties have some
2221elements in their map list that are themselves lists of scalars; and some
2222special strings are returned that are not to be interpreted as-is. Element
2223[2] (placed into C<$format> in the example above) of the returned four element
647396da 2224list tells you if the map has any of these special elements or not, as follows:
62b3b855
KW
2225
2226=over
2227
dc8d8ea6 2228=item B<C<s>>
62b3b855
KW
2229
2230means all the elements of the map array are simple scalars, with no special
2231elements. Almost all properties are like this, like the C<block> example
2232above.
2233
dc8d8ea6 2234=item B<C<sl>>
62b3b855 2235
647396da 2236means that some of the map array elements have the form given by C<"s">, and
62b3b855
KW
2237the rest are lists of scalars. For example, here is a portion of the output
2238of calling C<prop_invmap>() with the "Script Extensions" property:
2239
2240 @scripts_ranges @scripts_maps
2241 ...
c2ca0207
KW
2242 0x0953 Devanagari
2243 0x0964 [ Bengali, Devanagari, Gurumukhi, Oriya ]
2244 0x0966 Devanagari
62b3b855
KW
2245 0x0970 Common
2246
647396da
KW
2247Here, the code points 0x964 and 0x965 are both used in Bengali,
2248Devanagari, Gurmukhi, and Oriya, but no other scripts.
62b3b855 2249
647396da 2250The Name_Alias property is also of this form. But each scalar consists of two
58b75e36 2251components: 1) the name, and 2) the type of alias this is. They are
7620cb10
KW
2252separated by a colon and a space. In Unicode 6.1, there are several alias types:
2253
2254=over
2255
2256=item C<correction>
2257
2258indicates that the name is a corrected form for the
2259original name (which remains valid) for the same code point.
2260
2261=item C<control>
2262
2263adds a new name for a control character.
2264
2265=item C<alternate>
2266
2267is an alternate name for a character
2268
2269=item C<figment>
2270
2271is a name for a character that has been documented but was never in any
2272actual standard.
2273
2274=item C<abbreviation>
2275
2276is a common abbreviation for a character
2277
2278=back
2279
2280The lists are ordered (roughly) so the most preferred names come before less
2281preferred ones.
58b75e36
KW
2282
2283For example,
2284
7620cb10
KW
2285 @aliases_ranges @alias_maps
2286 ...
2287 0x009E [ 'PRIVACY MESSAGE: control', 'PM: abbreviation' ]
2288 0x009F [ 'APPLICATION PROGRAM COMMAND: control',
2289 'APC: abbreviation'
2290 ]
2291 0x00A0 'NBSP: abbreviation'
2292 0x00A1 ""
2293 0x00AD 'SHY: abbreviation'
2294 0x00AE ""
2295 0x01A2 'LATIN CAPITAL LETTER GHA: correction'
2296 0x01A3 'LATIN SMALL LETTER GHA: correction'
2297 0x01A4 ""
58b75e36 2298 ...
58b75e36 2299
7620cb10
KW
2300A map to the empty string means that there is no alias defined for the code
2301point.
58b75e36 2302
d11155ec 2303=item B<C<a>>
62b3b855 2304
647396da 2305is like C<"s"> in that all the map array elements are scalars, but here they are
d11155ec
KW
2306restricted to all being integers, and some have to be adjusted (hence the name
2307C<"a">) to get the correct result. For example, in:
62b3b855
KW
2308
2309 my ($uppers_ranges_ref, $uppers_maps_ref, $format)
2310 = prop_invmap("Simple_Uppercase_Mapping");
2311
2312the returned arrays look like this:
2313
2314 @$uppers_ranges_ref @$uppers_maps_ref Note
bf7fe2df 2315 0 0
d11155ec 2316 97 65 'a' maps to 'A', b => B ...
bf7fe2df 2317 123 0
d11155ec 2318 181 924 MICRO SIGN => Greek Cap MU
bf7fe2df 2319 182 0
62b3b855
KW
2320 ...
2321
d11155ec
KW
2322Let's start with the second line. It says that the uppercase of code point 97
2323is 65; or C<uc("a")> == "A". But the line is for the entire range of code
2324points 97 through 122. To get the mapping for any code point in a range, you
2325take the offset it has from the beginning code point of the range, and add
2326that to the mapping for that first code point. So, the mapping for 122 ("z")
2327is derived by taking the offset of 122 from 97 (=25) and adding that to 65,
2328yielding 90 ("z"). Likewise for everything in between.
2329
2330The first line works the same way. The first map in a range is always the
2331correct value for its code point (because the adjustment is 0). Thus the
2332C<uc(chr(0))> is just itself. Also, C<uc(chr(1))> is also itself, as the
2333adjustment is 0+1-0 .. C<uc(chr(96))> is 96.
bf7fe2df 2334
d11155ec
KW
2335Requiring this simple adjustment allows the returned arrays to be
2336significantly smaller than otherwise, up to a factor of 10, speeding up
2337searching through them.
62b3b855 2338
d11155ec 2339=item B<C<al>>
62b3b855 2340
d11155ec 2341means that some of the map array elements have the form given by C<"a">, and
62b3b855
KW
2342the rest are ordered lists of code points.
2343For example, in:
2344
2345 my ($uppers_ranges_ref, $uppers_maps_ref, $format)
2346 = prop_invmap("Uppercase_Mapping");
2347
2348the returned arrays look like this:
2349
2350 @$uppers_ranges_ref @$uppers_maps_ref
bf7fe2df 2351 0 0
d11155ec 2352 97 65
bf7fe2df 2353 123 0
d11155ec 2354 181 924
bf7fe2df 2355 182 0
62b3b855
KW
2356 ...
2357 0x0149 [ 0x02BC 0x004E ]
bf7fe2df 2358 0x014A 0
d11155ec 2359 0x014B 330
62b3b855
KW
2360 ...
2361
2362This is the full Uppercase_Mapping property (as opposed to the
d11155ec 2363Simple_Uppercase_Mapping given in the example for format C<"a">). The only
62b3b855
KW
2364difference between the two in the ranges shown is that the code point at
23650x0149 (LATIN SMALL LETTER N PRECEDED BY APOSTROPHE) maps to a string of two
2366characters, 0x02BC (MODIFIER LETTER APOSTROPHE) followed by 0x004E (LATIN
2367CAPITAL LETTER N).
2368
d11155ec
KW
2369No adjustments are needed to entries that are references to arrays; each such
2370entry will have exactly one element in its range, so the offset is always 0.
bf7fe2df 2371
d11155ec 2372=item B<C<ae>>
b0b13ada 2373
d11155ec
KW
2374This is like C<"a">, but some elements are the empty string, and should not be
2375adjusted.
b0b13ada
KW
2376The one internal Perl property accessible by C<prop_invmap> is of this type:
2377"Perl_Decimal_Digit" returns an inversion map which gives the numeric values
2378that are represented by the Unicode decimal digit characters. Characters that
2379don't represent decimal digits map to the empty string, like so:
2380
2381 @digits @values
2382 0x0000 ""
d11155ec 2383 0x0030 0
b0b13ada 2384 0x003A: ""
d11155ec 2385 0x0660: 0
b0b13ada 2386 0x066A: ""
d11155ec 2387 0x06F0: 0
b0b13ada 2388 0x06FA: ""
d11155ec 2389 0x07C0: 0
b0b13ada 2390 0x07CA: ""
d11155ec 2391 0x0966: 0
b0b13ada
KW
2392 ...
2393
2394This means that the code points from 0 to 0x2F do not represent decimal digits;
d11155ec
KW
2395the code point 0x30 (DIGIT ZERO) represents 0; code point 0x31, (DIGIT ONE),
2396represents 0+1-0 = 1; ... code point 0x39, (DIGIT NINE), represents 0+9-0 = 9;
2397... code points 0x3A through 0x65F do not represent decimal digits; 0x660
2398(ARABIC-INDIC DIGIT ZERO), represents 0; ... 0x07C1 (NKO DIGIT ONE),
2399represents 0+1-0 = 1 ...
b0b13ada 2400
d11155ec 2401=item B<C<ale>>
62b3b855 2402
d11155ec
KW
2403is a combination of the C<"al"> type and the C<"ae"> type. Some of
2404the map array elements have the forms given by C<"al">, and
62b3b855
KW
2405the rest are the empty string. The property C<NFKC_Casefold> has this form.
2406An example slice is:
2407
2408 @$ranges_ref @$maps_ref Note
2409 ...
d11155ec
KW
2410 0x00AA 97 FEMININE ORDINAL INDICATOR => 'a'
2411 0x00AB 0
62b3b855 2412 0x00AD SOFT HYPHEN => ""
d11155ec 2413 0x00AE 0
62b3b855 2414 0x00AF [ 0x0020, 0x0304 ] MACRON => SPACE . COMBINING MACRON
d11155ec 2415 0x00B0 0
62b3b855
KW
2416 ...
2417
6cc45523
KW
2418=item B<C<r>>
2419
2420means that all the elements of the map array are either rational numbers or
2421the string C<"NaN">, meaning "Not a Number". A rational number is either an
2422integer, or two integers separated by a solidus (C<"/">). The second integer
2423represents the denominator of the division implied by the solidus, and is
2424guaranteed not to be 0. If you want to convert them to scalar numbers, you
2425can use something like this:
2426
2427 my ($invlist_ref, $invmap_ref, $format) = prop_invmap($property);
2428 if ($format && $format eq "r") {
2429 map { $_ = eval $_ } @$invmap_ref;
2430 }
2431
2432Here's some entries from the output of the property "Nv", which has format
2433C<"r">.
2434
2435 @numerics_ranges @numerics_maps Note
2436 0x00 "NaN"
2437 0x30 0 DIGIT 0
2438 0x31 1
2439 0x32 2
2440 ...
2441 0x37 7
2442 0x38 8
2443 0x39 9 DIGIT 9
2444 0x3A "NaN"
2445 0xB2 2 SUPERSCRIPT 2
2446 0xB3 3 SUPERSCRIPT 2
2447 0xB4 "NaN"
2448 0xB9 1 SUPERSCRIPT 1
2449 0xBA "NaN"
2450 0xBC 1/4 VULGAR FRACTION 1/4
2451 0xBD 1/2 VULGAR FRACTION 1/2
2452 0xBE 3/4 VULGAR FRACTION 3/4
2453 0xBF "NaN"
2454 0x660 0 ARABIC-INDIC DIGIT ZERO
2455
dc8d8ea6 2456=item B<C<n>>
62b3b855
KW
2457
2458means the Name property. All the elements of the map array are simple
2459scalars, but some of them contain special strings that require more work to
2460get the actual name.
2461
2462Entries such as:
2463
2464 CJK UNIFIED IDEOGRAPH-<code point>
2465
2466mean that the name for the code point is "CJK UNIFIED IDEOGRAPH-"
2467with the code point (expressed in hexadecimal) appended to it, like "CJK
647396da
KW
2468UNIFIED IDEOGRAPH-3403" (similarly for S<C<CJK COMPATIBILITY IDEOGRAPH-E<lt>code
2469pointE<gt>>>).
62b3b855
KW
2470
2471Also, entries like
2472
2473 <hangul syllable>
2474
2475means that the name is algorithmically calculated. This is easily done by
2476the function L<charnames/charnames::viacode(code)>.
2477
2478Note that for control characters (C<Gc=cc>), Unicode's data files have the
2479string "C<E<lt>controlE<gt>>", but the real name of each of these characters is the empty
7620cb10 2480string. This function returns that real name, the empty string. (There are
647396da
KW
2481names for these characters, but they are considered aliases, not the Name
2482property name, and are contained in the C<Name_Alias> property.)
62b3b855 2483
d11155ec 2484=item B<C<ad>>
62b3b855 2485
d11155ec 2486means the Decomposition_Mapping property. This property is like C<"al">
bea2c146 2487properties, except that one of the scalar elements is of the form:
62b3b855
KW
2488
2489 <hangul syllable>
2490
bea2c146
KW
2491This signifies that this entry should be replaced by the decompositions for
2492all the code points whose decomposition is algorithmically calculated. (All
647396da 2493of them are currently in one range and likely to remain so; the C<"n"> format
bea2c146 2494has this same entry.) These can be generated via the function
62b3b855
KW
2495L<Unicode::Normalize::NFD()|Unicode::Normalize>.
2496
62b3b855
KW
2497Note that the mapping is the one that is specified in the Unicode data files,
2498and to get the final decomposition, it may need to be applied recursively.
2499
2500=back
2501
d11155ec
KW
2502Note that a format begins with the letter "a" if and only the property it is
2503for requires adjustments by adding the offsets in multi-element ranges. For
2504all these properties, an entry should be adjusted only if the map is a scalar
2505which is an integer. That is, it must match the regular expression:
2506
2507 / ^ -? \d+ $ /xa
2508
2509Further, the first element in a range never needs adjustment, as the
2510adjustment would be just adding 0.
2511
62b3b855
KW
2512A binary search can be used to quickly find a code point in the inversion
2513list, and hence its corresponding mapping.
2514
2515The final element (index [3], assigned to C<$default> in the "block" example) in
2516the four element list returned by this function may be useful for applications
2517that wish to convert the returned inversion map data structure into some
2518other, such as a hash. It gives the mapping that most code points map to
2519under the property. If you establish the convention that any code point not
2520explicitly listed in your data structure maps to this value, you can
2521potentially make your data structure much smaller. As you construct your data
2522structure from the one returned by this function, simply ignore those ranges
2523that map to this value, generally called the "default" value. For example, to
2524convert to the data structure searchable by L</charinrange()>, you can follow
d11155ec 2525this recipe (valid only for properties that don't require adjustments):
62b3b855
KW
2526
2527 my ($list_ref, $map_ref, $format, $missing) = prop_invmap($property);
2528 my @range_list;
2529 for my $i (0 .. @$list_ref - 2) {
2530 next if $map_ref->[$i] eq $missing;
2531 push @range_list, [ $list_ref->[$i],
2532 $list_ref->[$i+1],
2533 $map_ref->[$i]
2534 ];
2535 }
2536
2537 print charinrange(\@range_list, $code_point), "\n";
2538
2539
2540With this, C<charinrange()> will return C<undef> if its input code point maps
2541to C<$missing>. You can avoid this by omitting the C<next> statement, and adding
2542a line after the loop to handle the final element of the inversion map.
2543
62b3b855
KW
2544
2545Note that the inversion maps returned for the C<Case_Folding> and
2546C<Simple_Case_Folding> properties do not include the Turkic-locale mappings.
2547Use L</casefold()> for these.
2548
62b3b855
KW
2549C<prop_invmap> does not know about any user-defined properties, and will
2550return C<undef> if called with one of those.
2551
2552=cut
2553
2554# User-defined properties could be handled with some changes to utf8_heavy.pl;
2555# if done, consideration should be given to the fact that the user subroutine
2556# could return different results with each call, which could lead to some
2557# security issues.
2558
2559# One could store things in memory so they don't have to be recalculated, but
2560# it is unlikely this will be called often, and some properties would take up
2561# significant memory.
2562
2563# These are created by mktables for this routine and stored in unicore/UCD.pl
2564# where their structures are described.
2565our @algorithmic_named_code_points;
2566our $HANGUL_BEGIN;
2567our $HANGUL_COUNT;
2568
2569sub prop_invmap ($) {
2570
2571 croak __PACKAGE__, "::prop_invmap: must be called in list context" unless wantarray;
2572
2573 my $prop = $_[0];
2574 return unless defined $prop;
2575
2576 # Fail internal properties
2577 return if $prop =~ /^_/;
2578
2579 # The values returned by this function.
2580 my (@invlist, @invmap, $format, $missing);
2581
2582 # The swash has two components we look at, the base list, and a hash,
2583 # named 'SPECIALS', containing any additional members whose mappings don't
2584 # fit into the the base list scheme of things. These generally 'override'
2585 # any value in the base list for the same code point.
2586 my $overrides;
2587
2588 require "utf8_heavy.pl";
2589 require "unicore/UCD.pl";
2590
2591RETRY:
2592
647396da
KW
2593 # If there are multiple entries for a single code point
2594 my $has_multiples = 0;
2595
62b3b855
KW
2596 # Try to get the map swash for the property. They have 'To' prepended to
2597 # the property name, and 32 means we will accept 32 bit return values.
647396da 2598 # The 0 means we aren't calling this from tr///.
62b3b855
KW
2599 my $swash = utf8::SWASHNEW(__PACKAGE__, "To$prop", undef, 32, 0);
2600
2601 # If didn't find it, could be because needs a proxy. And if was the
2602 # 'Block' or 'Name' property, use a proxy even if did find it. Finding it
647396da
KW
2603 # in these cases would be the result of the installation changing mktables
2604 # to output the Block or Name tables. The Block table gives block names
2605 # in the new-style, and this routine is supposed to return old-style block
2606 # names. The Name table is valid, but we need to execute the special code
2607 # below to add in the algorithmic-defined name entries.
34132297 2608 # And NFKCCF needs conversion, so handle that here too.
62b3b855 2609 if (ref $swash eq ""
34132297 2610 || $swash->{'TYPE'} =~ / ^ To (?: Blk | Na | NFKCCF ) $ /x)
62b3b855
KW
2611 {
2612
2613 # Get the short name of the input property, in standard form
2614 my ($second_try) = prop_aliases($prop);
2615 return unless $second_try;
2616 $second_try = utf8::_loose_name(lc $second_try);
2617
2618 if ($second_try eq "in") {
2619
2620 # This property is identical to age for inversion map purposes
2621 $prop = "age";
2622 goto RETRY;
2623 }
75e7c50b 2624 elsif ($second_try =~ / ^ s ( cf | [ltu] c ) $ /x) {
62b3b855 2625
75e7c50b
KW
2626 # These properties use just the LIST part of the full mapping,
2627 # which includes the simple maps that are otherwise overridden by
2628 # the SPECIALS. So all we need do is to not look at the SPECIALS;
2629 # set $overrides to indicate that
62b3b855 2630 $overrides = -1;
62b3b855 2631
75e7c50b 2632 # The full name is the simple name stripped of its initial 's'
62b3b855
KW
2633 $prop = $second_try =~ s/^s//r;
2634 goto RETRY;
2635 }
2636 elsif ($second_try eq "blk") {
2637
2638 # We use the old block names. Just create a fake swash from its
2639 # data.
2640 _charblocks();
2641 my %blocks;
2642 $blocks{'LIST'} = "";
2643 $blocks{'TYPE'} = "ToBlk";
2644 $utf8::SwashInfo{ToBlk}{'missing'} = "No_Block";
2645 $utf8::SwashInfo{ToBlk}{'format'} = "s";
2646
2647 foreach my $block (@BLOCKS) {
2648 $blocks{'LIST'} .= sprintf "%x\t%x\t%s\n",
2649 $block->[0],
2650 $block->[1],
2651 $block->[2];
2652 }
2653 $swash = \%blocks;
2654 }
2655 elsif ($second_try eq "na") {
2656
2657 # Use the combo file that has all the Name-type properties in it,
2658 # extracting just the ones that are for the actual 'Name'
2659 # property. And create a fake swash from it.
2660 my %names;
2661 $names{'LIST'} = "";
2662 my $original = do "unicore/Name.pl";
62b3b855
KW
2663 my $algorithm_names = \@algorithmic_named_code_points;
2664
3b6a8189
KW
2665 # We need to remove the names from it that are aliases. For that
2666 # we need to also read in that table. Create a hash with the keys
2667 # being the code points, and the values being a list of the
2668 # aliases for the code point key.
2669 my ($aliases_code_points, $aliases_maps, undef, undef) =
2670 &prop_invmap('Name_Alias');
2671 my %aliases;
2672 for (my $i = 0; $i < @$aliases_code_points; $i++) {
2673 my $code_point = $aliases_code_points->[$i];
2674 $aliases{$code_point} = $aliases_maps->[$i];
2675
2676 # If not already a list, make it into one, so that later we
2677 # can treat things uniformly
2678 if (! ref $aliases{$code_point}) {
2679 $aliases{$code_point} = [ $aliases{$code_point} ];
2680 }
2681
2682 # Remove the alias type from the entry, retaining just the
2683 # name.
2684 map { s/:.*// } @{$aliases{$code_point}};
2685 }
2686
62b3b855
KW
2687 my $i = 0;
2688 foreach my $line (split "\n", $original) {
2689 my ($hex_code_point, $name) = split "\t", $line;
2690
2691 # Weeds out all comments, blank lines, and named sequences
2692 next if $hex_code_point =~ /\P{ASCII_HEX_DIGIT}/;
2693
2694 my $code_point = hex $hex_code_point;
2695
2696 # The name of all controls is the default: the empty string.
2697 # The set of controls is immutable, so these hard-coded
2698 # constants work.
2699 next if $code_point <= 0x9F
2700 && ($code_point <= 0x1F || $code_point >= 0x7F);
2701
3b6a8189
KW
2702 # If this is a name_alias, it isn't a name
2703 next if grep { $_ eq $name } @{$aliases{$code_point}};
62b3b855
KW
2704
2705 # If we are beyond where one of the special lines needs to
2706 # be inserted ...
3b6a8189 2707 while ($i < @$algorithm_names
62b3b855
KW
2708 && $code_point > $algorithm_names->[$i]->{'low'})
2709 {
2710
2711 # ... then insert it, ahead of what we were about to
2712 # output
3b6a8189 2713 $names{'LIST'} .= sprintf "%x\t%x\t%s\n",
62b3b855
KW
2714 $algorithm_names->[$i]->{'low'},
2715 $algorithm_names->[$i]->{'high'},
2716 $algorithm_names->[$i]->{'name'};
2717
62b3b855
KW
2718 # Done with this range.
2719 $i++;
2720
3b6a8189
KW
2721 # We loop until all special lines that precede the next
2722 # regular one are output.
62b3b855
KW
2723 }
2724
3b6a8189
KW
2725 # Here, is a normal name.
2726 $names{'LIST'} .= sprintf "%x\t\t%s\n", $code_point, $name;
2727 } # End of loop through all the names
62b3b855
KW
2728
2729 $names{'TYPE'} = "ToNa";
2730 $utf8::SwashInfo{ToNa}{'missing'} = "";
2731 $utf8::SwashInfo{ToNa}{'format'} = "n";
2732 $swash = \%names;
2733 }
2734 elsif ($second_try =~ / ^ ( d [mt] ) $ /x) {
2735
2736 # The file is a combination of dt and dm properties. Create a
2737 # fake swash from the portion that we want.
2738 my $original = do "unicore/Decomposition.pl";
2739 my %decomps;
2740
2741 if ($second_try eq 'dt') {
2742 $decomps{'TYPE'} = "ToDt";
2743 $utf8::SwashInfo{'ToDt'}{'missing'} = "None";
2744 $utf8::SwashInfo{'ToDt'}{'format'} = "s";
d11155ec 2745 } # 'dm' is handled below, with 'nfkccf'
62b3b855
KW
2746
2747 $decomps{'LIST'} = "";
2748
2749 # This property has one special range not in the file: for the
2750 # hangul syllables
2751 my $done_hangul = 0; # Have we done the hangul range.
2752 foreach my $line (split "\n", $original) {
2753 my ($hex_lower, $hex_upper, $type_and_map) = split "\t", $line;
2754 my $code_point = hex $hex_lower;
2755 my $value;
bea2c146 2756 my $redo = 0;
62b3b855
KW
2757
2758 # The type, enclosed in <...>, precedes the mapping separated
2759 # by blanks
2760 if ($type_and_map =~ / ^ < ( .* ) > \s+ (.*) $ /x) {
2761 $value = ($second_try eq 'dt') ? $1 : $2
2762 }
2763 else { # If there is no type specified, it's canonical
2764 $value = ($second_try eq 'dt')
2765 ? "Canonical" :
2766 $type_and_map;
2767 }
2768
2769 # Insert the hangul range at the appropriate spot.
2770 if (! $done_hangul && $code_point > $HANGUL_BEGIN) {
2771 $done_hangul = 1;
2772 $decomps{'LIST'} .=
2773 sprintf "%x\t%x\t%s\n",
2774 $HANGUL_BEGIN,
2775 $HANGUL_BEGIN + $HANGUL_COUNT - 1,
2776 ($second_try eq 'dt')
2777 ? "Canonical"
2778 : "<hangul syllable>";
2779 }
2780
2781 # And append this to our constructed LIST.
2782 $decomps{'LIST'} .= "$hex_lower\t$hex_upper\t$value\n";
bea2c146
KW
2783
2784 redo if $redo;
62b3b855
KW
2785 }
2786 $swash = \%decomps;
2787 }
d11155ec
KW
2788 elsif ($second_try ne 'nfkccf') { # Don't know this property. Fail.
2789 return;
2790 }
2791
2792 if ($second_try eq 'nfkccf' || $second_try eq 'dm') {
34132297 2793
d11155ec
KW
2794 # The 'nfkccf' property is stored in the old format for backwards
2795 # compatibility for any applications that has read its file
2796 # directly before prop_invmap() existed.
2797 # And the code above has extracted the 'dm' property from its file
2798 # yielding the same format. So here we convert them to adjusted
2799 # format for compatibility with the other properties similar to
2800 # them.
2801 my %revised_swash;
34132297 2802
d11155ec 2803 # We construct a new converted list.
34132297 2804 my $list = "";
d11155ec
KW
2805
2806 my @ranges = split "\n", $swash->{'LIST'};
2807 for (my $i = 0; $i < @ranges; $i++) {
2808 my ($hex_begin, $hex_end, $map) = split "\t", $ranges[$i];
2809
2810 # The dm property has maps that are space separated sequences
2811 # of code points, as well as the special entry "<hangul
2812 # syllable>, which also contains a blank.
2813 my @map = split " ", $map;
2814 if (@map > 1) {
2815
2816 # If it's just the special entry, append as-is.
2817 if ($map eq '<hangul syllable>') {
2818 $list .= "$ranges[$i]\n";
2819 }
2820 else {
2821
2822 # These should all single-element ranges.
2823 croak __PACKAGE__, "Not expecting a mapping with multiple code points in a multi-element range, $ranges[$i]" if $hex_end ne "";
2824
2825 # Convert them to decimal, as that's what's expected.
2826 $list .= "$hex_begin\t\t"
2827 . join(" ", map { hex } @map)
2828 . "\n";
2829 }
2830 next;
2831 }
2832
2833 # Here, the mapping doesn't have a blank, is for a single code
2834 # point.
34132297
KW
2835 my $begin = hex $hex_begin;
2836 my $end = (defined $hex_end && $hex_end ne "")
2837 ? hex $hex_end
2838 : $begin;
d11155ec
KW
2839
2840 # Again, the output is to be in decimal.
34132297 2841 my $decimal_map = hex $map;
d11155ec
KW
2842
2843 # We know that multi-element ranges with the same mapping
2844 # should not be adjusted, as after the adjustment
2845 # multi-element ranges are for consecutive increasing code
2846 # points. Further, the final element in the list won't be
2847 # adjusted, as there is nothing after it to include in the
2848 # adjustment
2849 if ($begin != $end || $i == @ranges -1) {
2850
2851 # So just convert these to single-element ranges
2852 foreach my $code_point ($begin .. $end) {
2853 $list .= sprintf("%04X\t\t%d\n",
2854 $code_point, $decimal_map);
2855 }
34132297 2856 }
d11155ec 2857 else {
34132297 2858
d11155ec
KW
2859 # Here, we have a candidate for adjusting. What we do is
2860 # look through the subsequent adjacent elements in the
2861 # input. If the map to the next one differs by 1 from the
2862 # one before, then we combine into a larger range with the
2863 # initial map. Loop doing this until we find one that
2864 # can't be combined.
2865
2866 my $offset = 0; # How far away are we from the initial
2867 # map
2868 my $squished = 0; # ? Did we squish at least two
2869 # elements together into one range
2870 for ( ; $i < @ranges; $i++) {
2871 my ($next_hex_begin, $next_hex_end, $next_map)
2872 = split "\t", $ranges[$i+1];
2873
2874 # In the case of 'dm', the map may be a sequence of
2875 # multiple code points, which are never combined with
2876 # another range
2877 last if $next_map =~ / /;
2878
2879 $offset++;
2880 my $next_decimal_map = hex $next_map;
2881
2882 # If the next map is not next in sequence, it
2883 # shouldn't be combined.
2884 last if $next_decimal_map != $decimal_map + $offset;
2885
2886 my $next_begin = hex $next_hex_begin;
2887
2888 # Likewise, if the next element isn't adjacent to the
2889 # previous one, it shouldn't be combined.
2890 last if $next_begin != $begin + $offset;
2891
2892 my $next_end = (defined $next_hex_end
2893 && $next_hex_end ne "")
2894 ? hex $next_hex_end
2895 : $next_begin;
2896
2897 # And finally, if the next element is a multi-element
2898 # range, it shouldn't be combined.
2899 last if $next_end != $next_begin;
2900
2901 # Here, we will combine. Loop to see if we should
2902 # combine the next element too.
2903 $squished = 1;
2904 }
2905
2906 if ($squished) {
2907
2908 # Here, 'i' is the element number of the last element to
2909 # be combined, and the range is single-element, or we
2910 # wouldn't be combining. Get it's code point.
2911 my ($hex_end, undef, undef) = split "\t", $ranges[$i];
2912 $list .= "$hex_begin\t$hex_end\t$decimal_map\n";
2913 } else {
2914
2915 # Here, no combining done. Just appen the initial
2916 # (and current) values.
2917 $list .= "$hex_begin\t\t$decimal_map\n";
2918 }
2919 }
2920 } # End of loop constructing the converted list
2921
2922 # Finish up the data structure for our converted swash
2923 my $type = ($second_try eq 'nfkccf') ? 'ToNFKCCF' : 'ToDm';
2924 $revised_swash{'LIST'} = $list;
2925 $revised_swash{'TYPE'} = $type;
2926 $revised_swash{'SPECIALS'} = $swash->{'SPECIALS'};
2927 $swash = \%revised_swash;
2928
2929 $utf8::SwashInfo{$type}{'missing'} = 0;
2930 $utf8::SwashInfo{$type}{'format'} = 'a';
62b3b855
KW
2931 }
2932 }
2933
2934 if ($swash->{'EXTRAS'}) {
2935 carp __PACKAGE__, "::prop_invmap: swash returned for $prop unexpectedly has EXTRAS magic";
2936 return;
2937 }
2938
2939 # Here, have a valid swash return. Examine it.
34132297 2940 my $returned_prop = $swash->{'TYPE'};
62b3b855
KW
2941
2942 # All properties but binary ones should have 'missing' and 'format'
2943 # entries
2944 $missing = $utf8::SwashInfo{$returned_prop}{'missing'};
2945 $missing = 'N' unless defined $missing;
2946
2947 $format = $utf8::SwashInfo{$returned_prop}{'format'};
2948 $format = 'b' unless defined $format;
2949
d11155ec
KW
2950 my $requires_adjustment = $format =~ /^a/;
2951
62b3b855
KW
2952 # The LIST input lines look like:
2953 # ...
2954 # 0374\t\tCommon
2955 # 0375\t0377\tGreek # [3]
2956 # 037A\t037D\tGreek # [4]
2957 # 037E\t\tCommon
2958 # 0384\t\tGreek
2959 # ...
2960 #
2961 # Convert them to like
2962 # 0374 => Common
2963 # 0375 => Greek
2964 # 0378 => $missing
2965 # 037A => Greek
2966 # 037E => Common
2967 # 037F => $missing
2968 # 0384 => Greek
2969 #
2970 # For binary properties, the final non-comment column is absent, and
2971 # assumed to be 'Y'.
2972
2973 foreach my $range (split "\n", $swash->{'LIST'}) {
2974 $range =~ s/ \s* (?: \# .* )? $ //xg; # rmv trailing space, comments
2975
2976 # Find the beginning and end of the range on the line
2977 my ($hex_begin, $hex_end, $map) = split "\t", $range;
2978 my $begin = hex $hex_begin;
2979 my $end = (defined $hex_end && $hex_end ne "")
2980 ? hex $hex_end
2981 : $begin;
2982
92bcf67b
KW
2983 # Each time through the loop (after the first):
2984 # $invlist[-2] contains the beginning of the previous range processed
2985 # $invlist[-1] contains the end+1 of the previous range processed
2986 # $invmap[-2] contains the value of the previous range processed
2987 # $invmap[-1] contains the default value for missing ranges ($missing)
2988 #
2989 # Thus, things are set up for the typical case of a new non-adjacent
2990 # range of non-missings to be added. But, if the new range is
dc8d8ea6 2991 # adjacent, it needs to replace the [-1] element; and if the new
92bcf67b
KW
2992 # range is a multiple value of the previous one, it needs to be added
2993 # to the [-2] map element.
2994
2995 # The first time through, everything will be empty. If the property
2996 # doesn't have a range that begins at 0, add one that maps to $missing
62b3b855
KW
2997 if (! @invlist) {
2998 if ($begin != 0) {
2999 push @invlist, 0;
3000 push @invmap, $missing;
3001 }
3002 }
e35c6019
KW
3003 elsif (@invlist > 1 && $invlist[-2] == $begin) {
3004
3005 # Here we handle the case where the input has multiple entries for
3006 # each code point. mktables should have made sure that each such
3007 # range contains only one code point. At this point, $invlist[-1]
3008 # is the $missing that was added at the end of the last loop
3009 # iteration, and [-2] is the last real input code point, and that
3010 # code point is the same as the one we are adding now, making the
3011 # new one a multiple entry. Add it to the existing entry, either
3012 # by pushing it to the existing list of multiple entries, or
3013 # converting the single current entry into a list with both on it.
3014 # This is all we need do for this iteration.
3015
3016 if ($end != $begin) {
3017 croak __PACKAGE__, "Multiple maps per code point in '$prop' require single-element ranges: begin=$begin, end=$end, map=$map";
3018 }
3019 if (! ref $invmap[-2]) {
3020 $invmap[-2] = [ $invmap[-2], $map ];
3021 }
3022 else {
3023 push @{$invmap[-2]}, $map;
3024 }
3025 $has_multiples = 1;
3026 next;
3027 }
62b3b855
KW
3028 elsif ($invlist[-1] == $begin) {
3029
3030 # If the input isn't in the most compact form, so that there are
3031 # two adjacent ranges that map to the same thing, they should be
d11155ec
KW
3032 # combined (EXCEPT where the arrays require adjustments, in which
3033 # case everything is already set up correctly). This happens in
3034 # our constructed dt mapping, as Element [-2] is the map for the
3035 # latest range so far processed. Just set the beginning point of
3036 # the map to $missing (in invlist[-1]) to 1 beyond where this
3037 # range ends. For example, in
62b3b855
KW
3038 # 12\t13\tXYZ
3039 # 14\t17\tXYZ
3040 # we have set it up so that it looks like
3041 # 12 => XYZ
3042 # 14 => $missing
3043 #
3044 # We now see that it should be
3045 # 12 => XYZ
3046 # 18 => $missing
d11155ec 3047 if (! $requires_adjustment && @invlist > 1 && ( (defined $map)
c887f93f
KW
3048 ? $invmap[-2] eq $map
3049 : $invmap[-2] eq 'Y'))
3050 {
62b3b855
KW
3051 $invlist[-1] = $end + 1;
3052 next;
3053 }
3054
3055 # Here, the range started in the previous iteration that maps to
3056 # $missing starts at the same code point as this range. That
3057 # means there is no gap to fill that that range was intended for,
3058 # so we just pop it off the parallel arrays.
3059 pop @invlist;
3060 pop @invmap;
3061 }
3062
3063 # Add the range beginning, and the range's map.
3064 push @invlist, $begin;
d11155ec 3065 if ($returned_prop eq 'ToDm') {
62b3b855
KW
3066
3067 # The decomposition maps are either a line like <hangul syllable>
3068 # which are to be taken as is; or a sequence of code points in hex
3069 # and separated by blanks. Convert them to decimal, and if there
3070 # is more than one, use an anonymous array as the map.
3071 if ($map =~ /^ < /x) {
3072 push @invmap, $map;
3073 }
3074 else {
bea2c146 3075 my @map = split " ", $map;
62b3b855
KW
3076 if (@map == 1) {
3077 push @invmap, $map[0];
3078 }
3079 else {
3080 push @invmap, \@map;
3081 }
3082 }
3083 }
3084 else {
3085
3086 # Otherwise, convert hex formatted list entries to decimal; add a
3087 # 'Y' map for the missing value in binary properties, or
3088 # otherwise, use the input map unchanged.
3089 $map = ($format eq 'x')
3090 ? hex $map
3091 : $format eq 'b'
3092 ? 'Y'
3093 : $map;
3094 push @invmap, $map;
3095 }
3096
3097 # We just started a range. It ends with $end. The gap between it and
3098 # the next element in the list must be filled with a range that maps
3099 # to the default value. If there is no gap, the next iteration will
3100 # pop this, unless there is no next iteration, and we have filled all
3101 # of the Unicode code space, so check for that and skip.
3102 if ($end < $MAX_UNICODE_CODEPOINT) {
3103 push @invlist, $end + 1;
3104 push @invmap, $missing;
3105 }
3106 }
3107
3108 # If the property is empty, make all code points use the value for missing
3109 # ones.
3110 if (! @invlist) {
3111 push @invlist, 0;
3112 push @invmap, $missing;
3113 }
3114
647396da 3115 # And add in standard element that all non-Unicode code points map to:
62b3b855
KW
3116 # $missing
3117 push @invlist, $MAX_UNICODE_CODEPOINT + 1;
3118 push @invmap, $missing;
3119
3120 # The second component of the map are those values that require
3121 # non-standard specification, stored in SPECIALS. These override any
3122 # duplicate code points in LIST. If we are using a proxy, we may have
3123 # already set $overrides based on the proxy.
3124 $overrides = $swash->{'SPECIALS'} unless defined $overrides;
3125 if ($overrides) {
3126
3127 # A negative $overrides implies that the SPECIALS should be ignored,
d11155ec 3128 # and a simple 'a' list is the value.
62b3b855 3129 if ($overrides < 0) {
d11155ec 3130 $format = 'a';
62b3b855
KW
3131 }
3132 else {
3133
3134 # Currently, all overrides are for properties that normally map to
3135 # single code points, but now some will map to lists of code
3136 # points (but there is an exception case handled below).
d11155ec 3137 $format = 'al';
62b3b855
KW
3138
3139 # Look through the overrides.
3140 foreach my $cp_maybe_utf8 (keys %$overrides) {
3141 my $cp;
3142 my @map;
3143
3144 # If the overrides came from SPECIALS, the code point keys are
3145 # packed UTF-8.
3146 if ($overrides == $swash->{'SPECIALS'}) {
3147 $cp = unpack("C0U", $cp_maybe_utf8);
3148 @map = unpack "U0U*", $swash->{'SPECIALS'}{$cp_maybe_utf8};
3149
3150 # The empty string will show up unpacked as an empty
3151 # array.
d11155ec 3152 $format = 'ale' if @map == 0;
62b3b855
KW
3153 }
3154 else {
3155
3156 # But if we generated the overrides, we didn't bother to
3157 # pack them, and we, so far, do this only for properties
d11155ec 3158 # that are 'a' ones.
62b3b855
KW
3159 $cp = $cp_maybe_utf8;
3160 @map = hex $overrides->{$cp};
d11155ec 3161 $format = 'a';
62b3b855
KW
3162 }
3163
3164 # Find the range that the override applies to.
3165 my $i = _search_invlist(\@invlist, $cp);
3166 if ($cp < $invlist[$i] || $cp >= $invlist[$i + 1]) {
3167 croak __PACKAGE__, "wrong_range, cp=$cp; i=$i, current=$invlist[$i]; next=$invlist[$i + 1]"
3168 }
3169
3170 # And what that range currently maps to
3171 my $cur_map = $invmap[$i];
3172
3173 # If there is a gap between the next range and the code point
3174 # we are overriding, we have to add elements to both arrays to
3175 # fill that gap, using the map that applies to it, which is
3176 # $cur_map, since it is part of the current range.
3177 if ($invlist[$i + 1] > $cp + 1) {
3178 #use feature 'say';
3179 #say "Before splice:";
3180 #say 'i-2=[', $i-2, ']', sprintf("%04X maps to %s", $invlist[$i-2], $invmap[$i-2]) if $i >= 2;
3181 #say 'i-1=[', $i-1, ']', sprintf("%04X maps to %s", $invlist[$i-1], $invmap[$i-1]) if $i >= 1;
3182 #say 'i =[', $i, ']', sprintf("%04X maps to %s", $invlist[$i], $invmap[$i]);
3183 #say 'i+1=[', $i+1, ']', sprintf("%04X maps to %s", $invlist[$i+1], $invmap[$i+1]) if $i < @invlist + 1;
3184 #say 'i+2=[', $i+2, ']', sprintf("%04X maps to %s", $invlist[$i+2], $invmap[$i+2]) if $i < @invlist + 2;
3185
3186 splice @invlist, $i + 1, 0, $cp + 1;
3187 splice @invmap, $i + 1, 0, $cur_map;
3188
3189 #say "After splice:";
3190 #say 'i-2=[', $i-2, ']', sprintf("%04X maps to %s", $invlist[$i-2], $invmap[$i-2]) if $i >= 2;
3191 #say 'i-1=[', $i-1, ']', sprintf("%04X maps to %s", $invlist[$i-1], $invmap[$i-1]) if $i >= 1;
3192 #say 'i =[', $i, ']', sprintf("%04X maps to %s", $invlist[$i], $invmap[$i]);
3193 #say 'i+1=[', $i+1, ']', sprintf("%04X maps to %s", $invlist[$i+1], $invmap[$i+1]) if $i < @invlist + 1;
3194 #say 'i+2=[', $i+2, ']', sprintf("%04X maps to %s", $invlist[$i+2], $invmap[$i+2]) if $i < @invlist + 2;
3195 }
3196
3197 # If the remaining portion of the range is multiple code
3198 # points (ending with the one we are replacing, guaranteed by
3199 # the earlier splice). We must split it into two
3200 if ($invlist[$i] < $cp) {
3201 $i++; # Compensate for the new element
3202
3203 #use feature 'say';
3204 #say "Before splice:";
3205 #say 'i-2=[', $i-2, ']', sprintf("%04X maps to %s", $invlist[$i-2], $invmap[$i-2]) if $i >= 2;
3206 #say 'i-1=[', $i-1, ']', sprintf("%04X maps to %s", $invlist[$i-1], $invmap[$i-1]) if $i >= 1;
3207 #say 'i =[', $i, ']', sprintf("%04X maps to %s", $invlist[$i], $invmap[$i]);
3208 #say 'i+1=[', $i+1, ']', sprintf("%04X maps to %s", $invlist[$i+1], $invmap[$i+1]) if $i < @invlist + 1;
3209 #say 'i+2=[', $i+2, ']', sprintf("%04X maps to %s", $invlist[$i+2], $invmap[$i+2]) if $i < @invlist + 2;
3210
3211 splice @invlist, $i, 0, $cp;
3212 splice @invmap, $i, 0, 'dummy';
3213
3214 #say "After splice:";
3215 #say 'i-2=[', $i-2, ']', sprintf("%04X maps to %s", $invlist[$i-2], $invmap[$i-2]) if $i >= 2;
3216 #say 'i-1=[', $i-1, ']', sprintf("%04X maps to %s", $invlist[$i-1], $invmap[$i-1]) if $i >= 1;
3217 #say 'i =[', $i, ']', sprintf("%04X maps to %s", $invlist[$i], $invmap[$i]);
3218 #say 'i+1=[', $i+1, ']', sprintf("%04X maps to %s", $invlist[$i+1], $invmap[$i+1]) if $i < @invlist + 1;
3219 #say 'i+2=[', $i+2, ']', sprintf("%04X maps to %s", $invlist[$i+2], $invmap[$i+2]) if $i < @invlist + 2;
3220 }
3221
3222 # Here, the range we are overriding contains a single code
3223 # point. The result could be the empty string, a single
3224 # value, or a list. If the last case, we use an anonymous
3225 # array.
3226 $invmap[$i] = (scalar @map == 0)
3227 ? ""
3228 : (scalar @map > 1)
3229 ? \@map
3230 : $map[0];
3231 }
3232 }
3233 }
3234 elsif ($format eq 'x') {
3235
647396da
KW
3236 # All hex-valued properties are really to code points, and have been
3237 # converted to decimal.
bf7fe2df 3238 $format = 'i';
62b3b855 3239 }
d11155ec
KW
3240 elsif ($returned_prop eq 'ToDm') {
3241 $format = 'ad';
62b3b855
KW
3242 }
3243 elsif ($format eq 'sw') { # blank-separated elements to form a list.
3244 map { $_ = [ split " ", $_ ] if $_ =~ / / } @invmap;
3245 $format = 'sl';
3246 }
3247 elsif ($returned_prop eq 'ToNameAlias') {
3248
3249 # This property currently doesn't have any lists, but theoretically
3250 # could
3251 $format = 'sl';
3252 }
b0b13ada 3253 elsif ($returned_prop eq 'ToPerlDecimalDigit') {
d11155ec 3254 $format = 'ae';
b0b13ada 3255 }
62b3b855
KW
3256 elsif ($format ne 'n' && $format ne 'r') {
3257
3258 # All others are simple scalars
3259 $format = 's';
3260 }
e35c6019
KW
3261 if ($has_multiples && $format !~ /l/) {
3262 croak __PACKAGE__, "Wrong format '$format' for prop_invmap('$prop'); should indicate has lists";
3263 }
62b3b855
KW
3264
3265 return (\@invlist, \@invmap, $format, $missing);
3266}
3267
55d7b906 3268=head2 Unicode::UCD::UnicodeVersion
10a6ecd2 3269
a452d459
KW
3270This returns the version of the Unicode Character Database, in other words, the
3271version of the Unicode standard the database implements. The version is a
3272string of numbers delimited by dots (C<'.'>).
10a6ecd2
JH
3273
3274=cut
3275
3276my $UNICODEVERSION;
3277
3278sub UnicodeVersion {
3279 unless (defined $UNICODEVERSION) {
3280 openunicode(\$VERSIONFH, "version");
ce066323 3281 local $/ = "\n";
10a6ecd2
JH
3282 chomp($UNICODEVERSION = <$VERSIONFH>);
3283 close($VERSIONFH);
3284 croak __PACKAGE__, "::VERSION: strange version '$UNICODEVERSION'"
3285 unless $UNICODEVERSION =~ /^\d+(?:\.\d+)+$/;
3286 }
3287 return $UNICODEVERSION;
3288}
3aa957f9 3289
a452d459
KW
3290=head2 B<Blocks versus Scripts>
3291
3292The difference between a block and a script is that scripts are closer
3293to the linguistic notion of a set of code points required to present
3294languages, while block is more of an artifact of the Unicode code point
3295numbering and separation into blocks of (mostly) 256 code points.
3296
3297For example the Latin B<script> is spread over several B<blocks>, such
3298as C<Basic Latin>, C<Latin 1 Supplement>, C<Latin Extended-A>, and
3299C<Latin Extended-B>. On the other hand, the Latin script does not
3300contain all the characters of the C<Basic Latin> block (also known as
3301ASCII): it includes only the letters, and not, for example, the digits
3302or the punctuation.
3303
3304For blocks see L<http://www.unicode.org/Public/UNIDATA/Blocks.txt>
3305
3306For scripts see UTR #24: L<http://www.unicode.org/unicode/reports/tr24/>
3307
3308=head2 B<Matching Scripts and Blocks>
3309
3310Scripts are matched with the regular-expression construct
3311C<\p{...}> (e.g. C<\p{Tibetan}> matches characters of the Tibetan script),
f200dd12 3312while C<\p{Blk=...}> is used for blocks (e.g. C<\p{Blk=Tibetan}> matches
a452d459
KW
3313any of the 256 code points in the Tibetan block).
3314
430fe03d
KW
3315=head2 Old-style versus new-style block names
3316
3317Unicode publishes the names of blocks in two different styles, though the two
3318are equivalent under Unicode's loose matching rules.
3319
3320The original style uses blanks and hyphens in the block names (except for
3321C<No_Block>), like so:
3322
3323 Miscellaneous Mathematical Symbols-B
3324
3325The newer style replaces these with underscores, like this:
3326
3327 Miscellaneous_Mathematical_Symbols_B
3328
3329This newer style is consistent with the values of other Unicode properties.
3330To preserve backward compatibility, all the functions in Unicode::UCD that
3331return block names (except one) return the old-style ones. That one function,
3332L</prop_value_aliases()> can be used to convert from old-style to new-style:
3333
3334 my $new_style = prop_values_aliases("block", $old_style);
3335
3336Perl also has single-form extensions that refer to blocks, C<In_Cyrillic>,
3337meaning C<Block=Cyrillic>. These have always been written in the new style.
3338
3339To convert from new-style to old-style, follow this recipe:
3340
3341 $old_style = charblock((prop_invlist("block=$new_style"))[0]);
3342
3343(which finds the range of code points in the block using C<prop_invlist>,
3344gets the lower end of the range (0th element) and then looks up the old name
3345for its block using C<charblock>).
3346
7620cb10
KW
3347Note that starting in Unicode 6.1, many of the block names have shorter
3348synonyms. These are always given in the new style.
3349
8b731da2
JH
3350=head1 BUGS
3351
3352Does not yet support EBCDIC platforms.
3353
561c79ed
JH
3354=head1 AUTHOR
3355
a18e976f 3356Jarkko Hietaniemi. Now maintained by perl5 porters.
561c79ed
JH
3357
3358=cut
3359
33601;