X-Git-Url: https://perl5.git.perl.org/perl5.git/blobdiff_plain/58c274a11b245c0b622f3aa697372d5c1dc88354..ad4795e78e923065898354b946437030aaeca163:/pod/perluniintro.pod diff --git a/pod/perluniintro.pod b/pod/perluniintro.pod index 67ce214..778c5de 100644 --- a/pod/perluniintro.pod +++ b/pod/perluniintro.pod @@ -5,160 +5,251 @@ perluniintro - Perl Unicode introduction =head1 DESCRIPTION This document gives a general idea of Unicode and how to use Unicode -in Perl. +in Perl. See L for references to more in-depth +treatments of Unicode. =head2 Unicode -Unicode is a character set standard with plans to cover all of the +Unicode is a character set standard which plans to codify all of the writing systems of the world, plus many other symbols. -Unicode and ISO/IEC 10646 are coordinated standards that provide code -points for the characters in almost all modern character set standards, -covering more than 30 writing systems and hundreds of languages, -including all commercially important modern languages. All characters +Unicode and ISO/IEC 10646 are coordinated standards that unify +almost all other modern character set standards, +covering more than 80 writing systems and hundreds of languages, +including all commercially-important modern languages. All characters in the largest Chinese, Japanese, and Korean dictionaries are also encoded. The standards will eventually cover almost all characters in more than 250 writing systems and thousands of languages. +Unicode 1.0 was released in October 1991, and 6.0 in October 2010. A Unicode I is an abstract entity. It is not bound to any -particular integer width, and especially not to the C language C. -Unicode is language neutral and display neutral: it doesn't encode the -language of the text, and it doesn't define fonts or other graphical +particular integer width, especially not to the C language C. +Unicode is language-neutral and display-neutral: it does not encode the +language of the text, and it does not generally define fonts or other graphical layout details. Unicode operates on characters and on text built from those characters. Unicode defines characters like C or C, and then unique numbers for those, hexadecimal -0x0041 or 0x03B1 for those particular characters. Such unique -numbers are called I. +SMALL LETTER ALPHA> and unique numbers for the characters, in this +case 0x0041 and 0x03B1, respectively. These unique numbers are called +I. A code point is essentially the position of the +character within the set of all possible Unicode characters, and thus in +Perl, the term I is often used interchangeably with it. The Unicode standard prefers using hexadecimal notation for the code -points. (In case this notation, numbers like 0x0041, is unfamiliar to -you, take a peek at a later section, L.) -The Unicode standard uses the notation C, -which gives the hexadecimal code point, and the normative name of -the character. +points. If numbers like C<0x0041> are unfamiliar to you, take a peek +at a later section, L. The Unicode standard +uses the notation C, to give the +hexadecimal code point and the normative name of the character. Unicode also defines various I for the characters, like -"uppercase" or "lowercase", "decimal digit", or "punctuation": +"uppercase" or "lowercase", "decimal digit", or "punctuation"; these properties are independent of the names of the characters. Furthermore, various operations on the characters like uppercasing, -lowercasing, and collating (sorting), are defined. - -A Unicode character consists either of a single code point, or a -I (like C), followed by one or -more I (like C). This sequence of -a base character and modifiers is called a I. - -Whether to call these combining character sequences, as a whole, -"characters" depends on your point of view. If you are a programmer, you -probably would tend towards seeing each element in the sequences as one -unit, one "character", but from the user viewpoint, the sequence as a -whole is probably considered one "character", since that's probably what -it looks like in the context of the user's language. - -With this "as a whole" view of characters, the number of characters is -open-ended. But in the programmer's "one unit is one character" point of -view, the concept of "characters" is more deterministic, and so we take -that point of view in this document: one "character" is one Unicode -code point, be it a base character or a combining character. - -For some of the combinations there are I characters, -for example C is defined as -a single code point. These precomposed characters are, however, -often available only for some combinations, and mainly they are -meant to support round-trip conversions between Unicode and legacy -standards (like the ISO 8859), and in general case the composing -method is more extensible. To support conversion between the -different compositions of the characters, various I are also defined. +lowercasing, and collating (sorting) are defined. + +A Unicode I "character" can actually consist of more than one internal +I "character" or code point. For Western languages, this is adequately +modelled by a I (like C) followed +by one or more I (like C). This sequence of +base character and modifiers is called a I. Some non-western languages require more complicated +models, so Unicode created the I concept, which was +later further refined into the I. For +example, a Korean Hangul syllable is considered a single logical +character, but most often consists of three actual +Unicode characters: a leading consonant followed by an interior vowel followed +by a trailing consonant. + +Whether to call these extended grapheme clusters "characters" depends on your +point of view. If you are a programmer, you probably would tend towards seeing +each element in the sequences as one unit, or "character". However from +the user's point of view, the whole sequence could be seen as one +"character" since that's probably what it looks like in the context of the +user's language. In this document, we take the programmer's point of +view: one "character" is one Unicode code point. + +For some combinations of base character and modifiers, there are +I characters. There is a single character equivalent, for +example, to the sequence C followed by +C. It is called C. These precomposed characters are, however, only available for +some combinations, and are mainly meant to support round-trip +conversions between Unicode and legacy standards (like ISO 8859). Using +sequences, as Unicode does, allows for needing fewer basic building blocks +(code points) to express many more potential grapheme clusters. To +support conversion between equivalent forms, various I are also defined. Thus, C is +in I, (abbreviated NFC), and the sequence +C followed by C +represents the same character in I (NFD). Because of backward compatibility with legacy encodings, the "a unique -number for every character" breaks down a bit: "at least one number -for every character" is closer to truth. (This happens when the same -character has been encoded in several legacy encodings.) The converse -is also not true: not every code point has an assigned character. -Firstly, there are unallocated code points within otherwise used -blocks. Secondly, there are special Unicode control characters that -do not represent true characters. - -A common myth about Unicode is that it would be "16-bit", that is, -0x10000 (or 65536) characters from 0x0000 to 0xFFFF. B -Since Unicode 2.0 Unicode has been defined all the way up to 21 bits -(0x10FFFF), and since 3.1 characters have been defined beyond 0xFFFF. -The first 0x10000 characters are called the I, or the I (BMP). With the Unicode 3.1, 17 planes in all are -defined (but nowhere near full of defined characters yet). - -Another myth is that the 256-character blocks have something to do -with languages: a block per language. B -The division into the blocks exists but it is almost completely -accidental, an artifact of how the characters have been historically -allocated. Instead, there is a concept called I, which may -be more useful: there is C script, C script, and so on. -Scripts usually span several parts of several blocks. For further -information see L. +number for every character" idea breaks down a bit: instead, there is +"at least one number for every character". The same character could +be represented differently in several legacy encodings. The +converse is not also true: some code points do not have an assigned +character. Firstly, there are unallocated code points within +otherwise used blocks. Secondly, there are special Unicode control +characters that do not represent true characters. + +When Unicode was first conceived, it was thought that all the world's +characters could be represented using a 16-bit word; that is a maximum of +C<0x10000> (or 65536) characters from C<0x0000> to C<0xFFFF> would be +needed. This soon proved to be false, and since Unicode 2.0 (July +1996), Unicode has been defined all the way up to 21 bits (C<0x10FFFF>), +and Unicode 3.1 (March 2001) defined the first characters above C<0xFFFF>. +The first C<0x10000> characters are called the I, or the +I (BMP). With Unicode 3.1, 17 (yes, +seventeen) planes in all were defined--but they are nowhere near full of +defined characters, yet. + +When a new language is being encoded, Unicode generally will choose a +C of consecutive unallocated code points for its characters. So +far, the number of code points in these blocks has always been evenly +divisible by 16. Extras in a block, not currently needed, are left +unallocated, for future growth. But there have been occasions when +a later relase needed more code points than the available extras, and a +new block had to allocated somewhere else, not contiguous to the initial +one, to handle the overflow. Thus, it became apparent early on that +"block" wasn't an adequate organizing principal, and so the C