package charnames;
use strict;
use warnings;
-our $VERSION = '1.28';
+our $VERSION = '1.36';
use unicore::Name; # mktables-generated algorithmically-defined names
use _charnames (); # The submodule for this where most of the work gets done
1;
__END__
+=encoding utf8
+
=head1 NAME
charnames - access to Unicode character names and named character sequences; also define character names
use charnames qw(cyrillic greek);
print "\N{sigma} is Greek sigma, and \N{be} is Cyrillic b.\n";
+ use utf8;
use charnames ":full", ":alias" => {
e_ACUTE => "LATIN SMALL LETTER E WITH ACUTE",
mychar => 0xE8000, # Private use area
+ "自転車に乗る人" => "BICYCLIST"
};
print "\N{e_ACUTE} is a small letter e with an acute.\n";
print "\N{mychar} allows me to name private use characters.\n";
+ print "And I can create synonyms in other languages,",
+ " such as \N{自転車に乗る人} for "BICYCLIST (U+1F6B4)\n";
use charnames ();
print charnames::viacode(0x1234); # prints "ETHIOPIC SYLLABLE SEE"
=back
-All forms other than C<S<"use charnames ();">> also enable the use of
-C<\N{I<CHARNAME>}> sequences to compile a Unicode character into a
-string, based on its name.
+Starting in Perl v5.16, any occurrence of C<\N{I<CHARNAME>}> sequences
+in a double-quotish string automatically loads this module with arguments
+C<:full> and C<:short> (described below) if it hasn't already been loaded with
+different arguments, in order to compile the named Unicode character into
+position in the string. Prior to v5.16, an explicit S<C<use charnames>> was
+required to enable this usage. (However, prior to v5.16, the form C<S<"use
+charnames ();">> did not enable C<\N{I<CHARNAME>}>.)
Note that C<\N{U+I<...>}>, where the I<...> is a hexadecimal number,
-also inserts a character into a string, but doesn't require the use of
-this pragma. The character it inserts is the one whose code point
+also inserts a character into a string.
+The character it inserts is the one whose code point
(ordinal value) is equal to the number. For example, C<"\N{U+263a}"> is
-the Unicode (white background, black foreground) smiley face; it doesn't
-require this pragma, whereas the equivalent, C<"\N{WHITE SMILING FACE}">
-does.
+the Unicode (white background, black foreground) smiley face
+equivalent to C<"\N{WHITE SMILING FACE}">.
Also note, C<\N{I<...>}> can mean a regex quantifier instead of a character
name, when the I<...> is a number (or comma separated pair of numbers
(see L<perlreref/QUANTIFIERS>), and is not related to this pragma.
specified order). Customized aliases can override these, and are explained in
L</CUSTOM ALIASES>.
-For lookup of I<CHARNAME> inside a given script I<SCRIPTNAME>
+For lookup of I<CHARNAME> inside a given script I<SCRIPTNAME>,
this pragma looks in the table of standard Unicode names for the names
SCRIPTNAME CAPITAL LETTER CHARNAME
functionality, use
L<charnames::string_vianame()|/charnames::string_vianame(I<name>)>.
-For the C0 and C1 control characters (U+0000..U+001F, U+0080..U+009F)
-there are no official Unicode names but you can use instead the ISO 6429
-names (LINE FEED, ESCAPE, and so forth, and their abbreviations, LF,
-ESC, ...). In Unicode 3.2 (as of Perl 5.8) some naming changes took
-place, and ISO 6429 was updated, see L</ALIASES>. Since Unicode 6.0, it
-is deprecated to use C<BELL>. Instead use C<ALERT> (but C<BEL> will continue
-to work).
+Note, starting in Perl 5.18, the name C<BELL> refers to the Unicode character
+U+1F514, instead of the traditional U+0007. For the latter, use C<ALERT>
+or C<BEL>.
-If the input name is unknown, C<\N{NAME}> raises a warning and
-substitutes the Unicode REPLACEMENT CHARACTER (U+FFFD).
+It is a syntax error to use C<\N{NAME}> where C<NAME> is unknown.
For C<\N{NAME}>, it is a fatal error if C<use bytes> is in effect and the
input name is that of a character that won't fit into a byte (i.e., whose
C<:full>, but the trade-off may be worth it to you. Each individual look-up
takes very little time, and the results are cached, so the speed difference
would become a factor only in programs that do look-ups of many different
-spellings, and probably only when those look-ups are through vianame() and
-string_vianame(), since C<\N{...}> look-ups are done at compile time.
+spellings, and probably only when those look-ups are through C<vianame()> and
+C<string_vianame()>, since C<\N{...}> look-ups are done at compile time.
=head1 ALIASES
-A few aliases have been defined for convenience; instead of having
-to use the official names,
-
- LINE FEED (LF)
- FORM FEED (FF)
- CARRIAGE RETURN (CR)
- NEXT LINE (NEL)
-
-(yes, with parentheses), one can use
-
- LINE FEED
- FORM FEED
- CARRIAGE RETURN
- NEXT LINE
- LF
- FF
- CR
- NEL
-
-All the other standard abbreviations for the controls, such as C<ACK> for
-C<ACKNOWLEDGE> also can be used.
-
-One can also use
-
- BYTE ORDER MARK
- BOM
-
-and these abbreviations
-
- Abbreviation Full Name
-
- CGJ COMBINING GRAPHEME JOINER
- FVS1 MONGOLIAN FREE VARIATION SELECTOR ONE
- FVS2 MONGOLIAN FREE VARIATION SELECTOR TWO
- FVS3 MONGOLIAN FREE VARIATION SELECTOR THREE
- LRE LEFT-TO-RIGHT EMBEDDING
- LRM LEFT-TO-RIGHT MARK
- LRO LEFT-TO-RIGHT OVERRIDE
- MMSP MEDIUM MATHEMATICAL SPACE
- MVS MONGOLIAN VOWEL SEPARATOR
- NBSP NO-BREAK SPACE
- NNBSP NARROW NO-BREAK SPACE
- PDF POP DIRECTIONAL FORMATTING
- RLE RIGHT-TO-LEFT EMBEDDING
- RLM RIGHT-TO-LEFT MARK
- RLO RIGHT-TO-LEFT OVERRIDE
- SHY SOFT HYPHEN
- VS1 VARIATION SELECTOR-1
- .
- .
- .
- VS256 VARIATION SELECTOR-256
- WJ WORD JOINER
- ZWJ ZERO WIDTH JOINER
- ZWNJ ZERO WIDTH NON-JOINER
- ZWSP ZERO WIDTH SPACE
-
-For backward compatibility one can use the old names for
-certain C0 and C1 controls
-
- old new
-
- FILE SEPARATOR INFORMATION SEPARATOR FOUR
- GROUP SEPARATOR INFORMATION SEPARATOR THREE
- HORIZONTAL TABULATION CHARACTER TABULATION
- HORIZONTAL TABULATION SET CHARACTER TABULATION SET
- HORIZONTAL TABULATION WITH JUSTIFICATION CHARACTER TABULATION
- WITH JUSTIFICATION
- PARTIAL LINE DOWN PARTIAL LINE FORWARD
- PARTIAL LINE UP PARTIAL LINE BACKWARD
- RECORD SEPARATOR INFORMATION SEPARATOR TWO
- REVERSE INDEX REVERSE LINE FEED
- UNIT SEPARATOR INFORMATION SEPARATOR ONE
- VERTICAL TABULATION LINE TABULATION
- VERTICAL TABULATION SET LINE TABULATION SET
-
-but the old names in addition to giving the character
-will also give a warning about being deprecated.
-
-And finally, certain published variants are usable, including some for
-controls that have no Unicode names:
-
- name character
-
- END OF PROTECTED AREA END OF GUARDED AREA, U+0097
- HIGH OCTET PRESET U+0081
- HOP U+0081
- IND U+0084
- INDEX U+0084
- PAD U+0080
- PADDING CHARACTER U+0080
- PRIVATE USE 1 PRIVATE USE ONE, U+0091
- PRIVATE USE 2 PRIVATE USE TWO, U+0092
- SGC U+0099
- SINGLE GRAPHIC CHARACTER INTRODUCER U+0099
- SINGLE-SHIFT 2 SINGLE SHIFT TWO, U+008E
- SINGLE-SHIFT 3 SINGLE SHIFT THREE, U+008F
- START OF PROTECTED AREA START OF GUARDED AREA, U+0096
+Starting in Unicode 6.1 and Perl v5.16, Unicode defines many abbreviations and
+names that were formerly Perl extensions, and some additional ones that Perl
+did not previously accept. The list is getting too long to reproduce here,
+but you can get the complete list from the Unicode web site:
+L<http://www.unicode.org/Public/UNIDATA/NameAliases.txt>.
+
+Earlier versions of Perl accepted almost all the 6.1 names. These were most
+extensively documented in the v5.14 version of this pod:
+L<http://perldoc.perl.org/5.14.0/charnames.html#ALIASES>.
=head1 CUSTOM ALIASES
you're twisted enough, you can change C<"\N{LATIN CAPITAL LETTER A}"> to
mean C<"B">, etc.
-Note that an alias should not be something that is a legal curly
-brace-enclosed quantifier (see L<perlreref/QUANTIFIERS>). For example
-C<\N{123}> means to match 123 non-newline characters, and is not treated as a
-charnames alias. Aliases are discouraged from beginning with anything
-other than an alphabetic character and from containing anything other
-than alphanumerics, spaces, dashes, parentheses, and underscores.
-Currently they must be ASCII.
+Aliases must begin with a character that is alphabetic. After that, each may
+contain any combination of word (C<\w>) characters, SPACE (U+0020),
+HYPHEN-MINUS (U+002D), LEFT PARENTHESIS (U+0028), RIGHT PARENTHESIS (U+0029),
+and NO-BREAK SPACE (U+00A0). These last three should never have been allowed
+in names, and are retained for backwards compatibility only; they may be
+deprecated and removed in future releases of Perl, so don't use them for new
+names. (More precisely, the first character of a name you specify must be
+something that matches all of C<\p{ID_Start}>, C<\p{Alphabetic}>, and
+C<\p{Gc=Letter}>. This makes sure it is what any reasonable person would view
+as an alphabetic character. And, the continuation characters that match C<\w>
+must also match C<\p{ID_Continue}>.) Starting with Perl v5.18, any Unicode
+characters meeting the above criteria may be used; prior to that only
+Latin1-range characters were acceptable.
An alias can map to either an official Unicode character name (not a loose
matched name) or to a
To name a sequence of characters, use a
L<custom translator|/CUSTOM TRANSLATORS> (described below).
-=head1 charnames::viacode(I<code>)
-
-Returns the full name of the character indicated by the numeric code.
-For example,
-
- print charnames::viacode(0x2722);
-
-prints "FOUR TEARDROP-SPOKED ASTERISK".
-
-The name returned is the official name for the code point, if
-available; otherwise your custom alias for it. This means that your
-alias will only be returned for code points that don't have an official
-Unicode name (nor a Unicode version 1 name), such as private use code
-points, and the 4 control characters U+0080, U+0081, U+0084, and U+0099.
-If you define more than one name for the code point, it is indeterminate
-which one will be returned.
-
-The function returns C<undef> if no name is known for the code point.
-In Unicode the proper name of these is the empty string, which
-C<undef> stringifies to. (If you ask for a code point past the legal
-Unicode maximum of U+10FFFF that you haven't assigned an alias to, you
-get C<undef> plus a warning.)
-
-The input number must be a non-negative integer or a string beginning
-with C<"U+"> or C<"0x"> with the remainder considered to be a
-hexadecimal integer. A literal numeric constant must be unsigned; it
-will be interpreted as hex if it has a leading zero or contains
-non-decimal hex digits; otherwise it will be interpreted as decimal.
-
-Notice that the name returned for U+FEFF is "ZERO WIDTH NO-BREAK
-SPACE", not "BYTE ORDER MARK".
-
=head1 charnames::string_vianame(I<name>)
This is a runtime equivalent to C<\N{...}>. I<name> can be any expression
L<script list, C<:short> option|/DESCRIPTION>, or L<custom aliases|/CUSTOM
ALIASES> you may have defined.
-The only difference is that if the input name is unknown, C<string_vianame>
-returns C<undef> instead of the REPLACEMENT CHARACTER and does not raise a
-warning message.
+The only differences are due to the fact that C<string_vianame> is run-time
+and C<\N{}> is compile time. You can't interpolate inside a C<\N{}>, (so
+C<\N{$variable}> doesn't work); and if the input name is unknown,
+C<string_vianame> returns C<undef> instead of it being a syntax error.
=head1 charnames::vianame(I<name>)
This is similar to C<string_vianame>. The main difference is that under most
-circumstances, vianame returns an ordinal code
+circumstances, C<vianame> returns an ordinal code
point, whereas C<string_vianame> returns a string. For example,
printf "U+%04X", charnames::vianame("FOUR TEARDROP-SPOKED ASTERISK");
See L</BUGS> for the circumstances in which the behavior differs
from that described above.
+=head1 charnames::viacode(I<code>)
+
+Returns the full name of the character indicated by the numeric code.
+For example,
+
+ print charnames::viacode(0x2722);
+
+prints "FOUR TEARDROP-SPOKED ASTERISK".
+
+The name returned is the "best" (defined below) official name or alias
+for the code point, if
+available; otherwise your custom alias for it, if defined; otherwise C<undef>.
+This means that your alias will only be returned for code points that don't
+have an official Unicode name (nor alias) such as private use code points.
+
+If you define more than one name for the code point, it is indeterminate
+which one will be returned.
+
+As mentioned, the function returns C<undef> if no name is known for the code
+point. In Unicode the proper name for these is the empty string, which
+C<undef> stringifies to. (If you ask for a code point past the legal
+Unicode maximum of U+10FFFF that you haven't assigned an alias to, you
+get C<undef> plus a warning.)
+
+The input number must be a non-negative integer, or a string beginning
+with C<"U+"> or C<"0x"> with the remainder considered to be a
+hexadecimal integer. A literal numeric constant must be unsigned; it
+will be interpreted as hex if it has a leading zero or contains
+non-decimal hex digits; otherwise it will be interpreted as decimal.
+
+As mentioned above under L</ALIASES>, Unicode 6.1 defines extra names
+(synonyms or aliases) for some code points, most of which were already
+available as Perl extensions. All these are accepted by C<\N{...}> and the
+other functions in this module, but C<viacode> has to choose which one
+name to return for a given input code point, so it returns the "best" name.
+To understand how this works, it is helpful to know more about the Unicode
+name properties. All code points actually have only a single name, which
+(starting in Unicode 2.0) can never change once a character has been assigned
+to the code point. But mistakes have been made in assigning names, for
+example sometimes a clerical error was made during the publishing of the
+Standard which caused words to be misspelled, and there was no way to correct
+those. The Name_Alias property was eventually created to handle these
+situations. If a name was wrong, a corrected synonym would be published for
+it, using Name_Alias. C<viacode> will return that corrected synonym as the
+"best" name for a code point. (It is even possible, though it hasn't happened
+yet, that the correction itself will need to be corrected, and so another
+Name_Alias can be created for that code point; C<viacode> will return the
+most recent correction.)
+
+The Unicode name for each of the control characters (such as LINE FEED) is the
+empty string. However almost all had names assigned by other standards, such
+as the ASCII Standard, or were in common use. C<viacode> returns these names
+as the "best" ones available. Unicode 6.1 has created Name_Aliases for each
+of them, including alternate names, like NEW LINE. C<viacode> uses the
+original name, "LINE FEED" in preference to the alternate. Similarly the
+name returned for U+FEFF is "ZERO WIDTH NO-BREAK SPACE", not "BYTE ORDER
+MARK".
+
+Until Unicode 6.1, the 4 control characters U+0080, U+0081, U+0084, and U+0099
+did not have names nor aliases.
+To preserve backwards compatibility, any alias you define for these code
+points will be returned by this function, in preference to the official name.
+
+Some code points also have abbreviated names, such as "LF" or "NL".
+C<viacode> never returns these.
+
+Because a name correction may be added in future Unicode releases, the name
+that C<viacode> returns may change as a result. This is a rare event, but it
+does happen.
+
=head1 CUSTOM TRANSLATORS
The mechanism of translation of C<\N{...}> escapes is general and not
in effect and the character won't fit into a byte, it returns C<undef> and
raises a warning.
-Names must be ASCII characters only, which means that you are out of luck if
-you want to create aliases in a language where some or all the characters of
-the desired aliases are non-ASCII.
-
Since evaluation of the translation function (see L</CUSTOM
TRANSLATORS>) happens in the middle of compilation (of a string
literal), the translation function should not do any C<eval>s or