X-Git-Url: https://perl5.git.perl.org/perl5.git/blobdiff_plain/376d9008a264d010f49fb171a6506ba64f2cb864..fc273927378ed6a1a60a5758f7e36713630d5e13:/pod/perluniintro.pod diff --git a/pod/perluniintro.pod b/pod/perluniintro.pod index e665d1a..beccd3c 100644 --- a/pod/perluniintro.pod +++ b/pod/perluniintro.pod @@ -5,39 +5,42 @@ perluniintro - Perl Unicode introduction =head1 DESCRIPTION This document gives a general idea of Unicode and how to use Unicode -in Perl. +in Perl. See L for references to more in-depth +treatments of Unicode. =head2 Unicode Unicode is a character set standard which plans to codify all of the writing systems of the world, plus many other symbols. -Unicode and ISO/IEC 10646 are coordinated standards that provide code -points for characters in almost all modern character set standards, -covering more than 30 writing systems and hundreds of languages, +Unicode and ISO/IEC 10646 are coordinated standards that unify +almost all other modern character set standards, +covering more than 80 writing systems and hundreds of languages, including all commercially-important modern languages. All characters in the largest Chinese, Japanese, and Korean dictionaries are also encoded. The standards will eventually cover almost all characters in more than 250 writing systems and thousands of languages. +Unicode 1.0 was released in October 1991, and 6.0 in October 2010. A Unicode I is an abstract entity. It is not bound to any particular integer width, especially not to the C language C. Unicode is language-neutral and display-neutral: it does not encode the -language of the text and it does not define fonts or other graphical +language of the text, and it does not generally define fonts or other graphical layout details. Unicode operates on characters and on text built from those characters. Unicode defines characters like C or C and unique numbers for the characters, in this case 0x0041 and 0x03B1, respectively. These unique numbers are called -I. +I. A code point is essentially the position of the +character within the set of all possible Unicode characters, and thus in +Perl, the term I is often used interchangeably with it. The Unicode standard prefers using hexadecimal notation for the code -points. If numbers like "C<0x0041" are unfamiliar to -you, take a peek at a later section, L. -The Unicode standard uses the notation C, -to give the hexadecimal code point and the normative name of -the character. +points. If numbers like C<0x0041> are unfamiliar to you, take a peek +at a later section, L. The Unicode standard +uses the notation C, to give the +hexadecimal code point and the normative name of the character. Unicode also defines various I for the characters, like "uppercase" or "lowercase", "decimal digit", or "punctuation"; @@ -45,101 +48,139 @@ these properties are independent of the names of the characters. Furthermore, various operations on the characters like uppercasing, lowercasing, and collating (sorting) are defined. -A Unicode character consists either of a single code point, or a -I (like C), followed by one or -more I (like C). This sequence of +A Unicode I "character" can actually consist of more than one internal +I "character" or code point. For Western languages, this is adequately +modelled by a I (like C) followed +by one or more I (like C). This sequence of base character and modifiers is called a I. - -Whether to call these combining character sequences "characters" -depends on your point of view. If you are a programmer, you probably -would tend towards seeing each element in the sequences as one unit, -or "character". The whole sequence could be seen as one "character", -however, from the user's point of view, since that's probably what it -looks like in the context of the user's language. - -With this "whole sequence" view of characters, the total number of -characters is open-ended. But in the programmer's "one unit is one -character" point of view, the concept of "characters" is more -deterministic. In this document, we take that second point of view: one -"character" is one Unicode code point, be it a base character or a -combining character. - -For some combinations, there are I characters. -C, for example, is defined as -a single code point. These precomposed characters are, however, -only available for some combinations, and are mainly -meant to support round-trip conversions between Unicode and legacy -standards (like the ISO 8859). In the general case, the composing -method is more extensible. To support conversion between -different compositions of the characters, various I to standardize representations are also defined. +sequence>. Some non-western languages require more complicated +models, so Unicode created the I concept, which was +later further refined into the I. For +example, a Korean Hangul syllable is considered a single logical +character, but most often consists of three actual +Unicode characters: a leading consonant followed by an interior vowel followed +by a trailing consonant. + +Whether to call these extended grapheme clusters "characters" depends on your +point of view. If you are a programmer, you probably would tend towards seeing +each element in the sequences as one unit, or "character". However from +the user's point of view, the whole sequence could be seen as one +"character" since that's probably what it looks like in the context of the +user's language. In this document, we take the programmer's point of +view: one "character" is one Unicode code point. + +For some combinations of base character and modifiers, there are +I characters. There is a single character equivalent, for +example, for the sequence C followed by +C. It is called C. These precomposed characters are, however, only available for +some combinations, and are mainly meant to support round-trip +conversions between Unicode and legacy standards (like ISO 8859). Using +sequences, as Unicode does, allows for needing fewer basic building blocks +(code points) to express many more potential grapheme clusters. To +support conversion between equivalent forms, various I are also defined. Thus, C is +in I, (abbreviated NFC), and the sequence +C followed by C +represents the same character in I (NFD). Because of backward compatibility with legacy encodings, the "a unique number for every character" idea breaks down a bit: instead, there is "at least one number for every character". The same character could be represented differently in several legacy encodings. The -converse is also not true: some code points do not have an assigned +converse is not true: some code points do not have an assigned character. Firstly, there are unallocated code points within otherwise used blocks. Secondly, there are special Unicode control characters that do not represent true characters. -A common myth about Unicode is that it would be "16-bit", that is, -Unicode is only represented as C<0x10000> (or 65536) characters from -C<0x0000> to C<0xFFFF>. B Since Unicode 2.0, Unicode -has been defined all the way up to 21 bits (C<0x10FFFF>), and since -Unicode 3.1, characters have been defined beyond C<0xFFFF>. The first -C<0x10000> characters are called the I, or the I (BMP). With Unicode 3.1, 17 planes in all are -defined--but nowhere near full of defined characters, yet. - -Another myth is that the 256-character blocks have something to do -with languages--that each languages is specified inside a block. -B The division into blocks exists, but it is -almost completely accidental--an artifact of how the characters have -been historically allocated. Instead, there is a concept called -I, which is more useful: there is C script, -C script, and so on. Scripts usually span varied parts of -several blocks. For further information see L. +When Unicode was first conceived, it was thought that all the world's +characters could be represented using a 16-bit word; that is a maximum of +C<0x10000> (or 65,536) characters would be needed, from C<0x0000> to +C<0xFFFF>. This soon proved to be wrong, and since Unicode 2.0 (July +1996), Unicode has been defined all the way up to 21 bits (C<0x10FFFF>), +and Unicode 3.1 (March 2001) defined the first characters above C<0xFFFF>. +The first C<0x10000> characters are called the I, or the +I (BMP). With Unicode 3.1, 17 (yes, +seventeen) planes in all were defined--but they are nowhere near full of +defined characters, yet. + +When a new language is being encoded, Unicode generally will choose a +C of consecutive unallocated code points for its characters. So +far, the number of code points in these blocks has always been evenly +divisible by 16. Extras in a block, not currently needed, are left +unallocated, for future growth. But there have been occasions when +a later release needed more code points than the available extras, and a +new block had to allocated somewhere else, not contiguous to the initial +one, to handle the overflow. Thus, it became apparent early on that +"block" wasn't an adequate organizing principle, and so the C