no warnings 'surrogate'; # surrogates can be inputs to this
use charnames ();
-our $VERSION = '0.57';
+our $VERSION = '0.58';
require Exporter;
is empty if I<code> has no decomposition; or is one or more codes
(separated by spaces) that, taken in order, represent a decomposition for
I<code>. Each has at least four hexdigits.
-The codes may be preceded by a word enclosed in angle brackets then a space,
+The codes may be preceded by a word enclosed in angle brackets, then a space,
like C<E<lt>compatE<gt> >, giving the type of decomposition
This decomposition may be an intermediate one whose components are also
=item B<decimal>
-if I<code> is a decimal digit this is its integer numeric value
+if I<code> represents a decimal digit this is its integer numeric value
=item B<digit>
my $range = charblock('Armenian');
-With a L</code point argument> charblock() returns the I<block> the code point
+With a L</code point argument> C<charblock()> returns the I<block> the code point
belongs to, e.g. C<Basic Latin>. The old-style block name is returned (see
L</Old-style versus new-style block names>).
If the code point is unassigned, this returns the block it would belong to if
See also L</Blocks versus Scripts>.
-If supplied with an argument that can't be a code point, charblock() tries to
+If supplied with an argument that can't be a code point, C<charblock()> tries to
do the opposite and interpret the argument as an old-style block name. On an
ASCII platform, the return value is a I<range set> with one range: an
anonymous list with a single element that consists of another anonymous list
whose first element is the first code point in the block, and whose second
-(and final) element is the final code point in the block. On an EBCDIC
+element is the final code point in the block. On an EBCDIC
platform, the first two Unicode blocks are not contiguous. Their range sets
-are lists containing I<start-of-range>, I<end-of-range> code point pairs. You
+are lists containing I<start-of-range>, I<end-of-range> code point pairs. You
can test whether a code point is in a range set using the L</charinrange()>
-function. If the argument is not a known block, C<undef> is returned.
+function. (To be precise, each I<range set> contains a third array element,
+after the range boundary ones: the old_style block name.)
+
+If the argument to C<charblock()> is not a known block, C<undef> is
+returned.
=cut
my $range = charscript('Thai');
-With a L</code point argument> charscript() returns the I<script> the
-code point belongs to, e.g. C<Latin>, C<Greek>, C<Han>.
+With a L</code point argument>, C<charscript()> returns the I<script> the
+code point belongs to, e.g., C<Latin>, C<Greek>, C<Han>.
If the code point is unassigned or the Unicode version being used is so early
that it doesn't have scripts, this function returns C<"Unknown">.
to do the opposite and interpret the argument as a script name. The
return value is a I<range set>: an anonymous list of lists that contain
I<start-of-range>, I<end-of-range> code point pairs. You can test whether a
-code point is in a range set using the L</charinrange()> function. If the
-argument is not a known script, C<undef> is returned.
+code point is in a range set using the L</charinrange()> function.
+(To be precise, each I<range set> contains a third array element,
+after the range boundary ones: the script name.)
+
+If the C<charscript()> argument is not a known script, C<undef> is returned.
See also L</Blocks versus Scripts>.
my $charblocks = charblocks();
-charblocks() returns a reference to a hash with the known block names
+C<charblocks()> returns a reference to a hash with the known block names
as the keys, and the code point ranges (see L</charblock()>) as the values.
The names are in the old-style (see L</Old-style versus new-style block
my $charscripts = charscripts();
-charscripts() returns a reference to a hash with the known script
+C<charscripts()> returns a reference to a hash with the known script
names as the keys, and the code point ranges (see L</charscript()>) as
the values.
In addition to using the C<\p{Blk=...}> and C<\P{Blk=...}> constructs, you
can also test whether a code point is in the I<range> as returned by
L</charblock()> and L</charscript()> or as the values of the hash returned
-by L</charblocks()> and L</charscripts()> by using charinrange():
+by L</charblocks()> and L</charscripts()> by using C<charinrange()>:
use Unicode::UCD qw(charscript charinrange);
my $compexcl = compexcl(0x09dc);
This routine returns C<undef> if the Unicode version being used is so early
-that it doesn't have this property. It is included for backwards
+that it doesn't have this property.
+
+C<compexcl()> is included for backwards
compatibility, but as of Perl 5.12 and more modern Unicode versions, for
most purposes it is probably more convenient to use one of the following
instead:
If used with a single argument in a scalar context, returns the string
consisting of the code points of the named sequence, or C<undef> if no
named sequence by that name exists. If used with a single argument in
-a list context, it returns the list of the ordinals of the code points. If used
-with no
-arguments in a list context, returns a hash with the names of the
-named sequences as the keys and the named sequences as strings as
+a list context, it returns the list of the ordinals of the code points.
+
+If used with no
+arguments in a list context, it returns a hash with the names of all the
+named sequences as the keys and their sequences as strings as
the values. Otherwise, it returns C<undef> or an empty list depending
on the context.
my $val = num("123");
my $one_quarter = num("\N{VULGAR FRACTION 1/4}");
-C<num> returns the numeric value of the input Unicode string; or C<undef> if it
+C<num()> returns the numeric value of the input Unicode string; or C<undef> if it
doesn't think the entire string has a completely valid, safe numeric value.
If the string is just one character in length, the Unicode numeric value