otherwise used blocks. Secondly, there are special Unicode control
characters that do not represent true characters.
-A common myth about Unicode is that it is "16-bit", that is,
-Unicode is only represented as C<0x10000> (or 65536) characters from
-C<0x0000> to C<0xFFFF>. B<This is untrue.> Since Unicode 2.0 (July
+When Unicode was first conceived, it was thought that all the world's
+characters could be represented using a 16-bit word; that is a maximum of
+C<0x10000> (or 65536) characters from C<0x0000> to C<0xFFFF> would be
+needed. This soon proved to be false, and since Unicode 2.0 (July
1996), Unicode has been defined all the way up to 21 bits (C<0x10FFFF>),
-and since Unicode 3.1 (March 2001), characters have been defined
-beyond C<0xFFFF>. The first C<0x10000> characters are called the
-I<Plane 0>, or the I<Basic Multilingual Plane> (BMP). With Unicode
-3.1, 17 (yes, seventeen) planes in all were defined--but they are
-nowhere near full of defined characters, yet.
-
-Another myth is about Unicode blocks--that they have something to
-do with languages--that each block would define the characters used
-by a language or a set of languages. B<This is also untrue.>
+and Unicode 3.1 (March 2001) defined the first characters above C<0xFFFF>.
+The first C<0x10000> characters are called the I<Plane 0>, or the
+I<Basic Multilingual Plane> (BMP). With Unicode 3.1, 17 (yes,
+seventeen) planes in all were defined--but they are nowhere near full of
+defined characters, yet.
+
+When a new language is being encoded, Unicode generally will choose a
+C<block> of consecutive unallocated code points for its characters. So
+far, the number of code points in these blocks has always been evenly
+divisible by 16. Extras in a block, not currently needed, are left
+unallocated, for future growth. But there have been occasions when
+a later relase needed more code points than the available extras, and a
+new block had to allocated somewhere else, not contiguous to the initial
+one, to handle the overflow. Thus, it became apparent early on that
+"block" wasn't an adequate organizing principal, and so the C<Script>
+property was created. (Later an improved script property was added as
+well, the C<Script_Extensions> property.) Those code points that are in
+overflow blocks can still
+have the same script as the original ones. The script concept fits more
+closely with natural language: there is C<Latin> script, C<Greek>
+script, and so on; and there are several artificial scripts, like
+C<Common> for characters that are used in multiple scripts, such as
+mathematical symbols. Scripts usually span varied parts of several
+blocks. For more information about scripts, see L<perlunicode/Scripts>.
The division into blocks exists, but it is almost completely
-accidental--an artifact of how the characters have been and
-still are allocated. Instead, there is a concept called I<scripts>, which is
-more useful: there is C<Latin> script, C<Greek> script, and so on. Scripts
-usually span varied parts of several blocks. For more information about
-scripts, see L<perlunicode/Scripts>.
+accidental--an artifact of how the characters have been and still are
+allocated. (Note that this paragraph has oversimplified things for the
+sake of this being an introduction. Unicode doesn't really encode
+languages, but the writing systems for them--their scripts; and one
+script can be used by many languages. Unicode also encodes things that
+aren't really about languages, such as symbols like C<BAGGAGE CLAIM>.)
The Unicode code points are just abstract numbers. To input and
output these abstract numbers, the numbers must be I<encoded> or
must always be specified exactly like that; it is I<not> subject to
the loose matching of encoding names. Also note that currently C<:utf8> is unsafe for
input, because it accepts the data without validating that it is indeed valid
-UTF8; you should instead use C<:encoding(UTF-8)> (unfortunately this
-specification needs to be in all upper-case with the dash to get the
-safety checking; C<:encoding(utf-8)>, for example, doesn't do the
-checking).
+UTF-8; you should instead use C<:encoding(utf-8)> (with or without a
+hyphen).
See L<PerlIO> for the C<:utf8> layer, L<PerlIO::encoding> and
L<Encode::PerlIO> for the C<:encoding()> layer, and
sections on case mapping in the L<Unicode Standard|http://www.unicode.org>.
As of Perl 5.8.0, the "Full" case-folding of I<Case
-Mappings/SpecialCasing> is implemented, but bugs remain in C<qr//i> with them.
+Mappings/SpecialCasing> is implemented, but bugs remain in C<qr//i> with them,
+mostly fixed by 5.14.
=item *
How Does Unicode Work With Traditional Locales?
-In Perl, not very well. Avoid using locales through the C<locale>
-pragma. Use only one or the other. But see L<perlrun> for the
+Perl tries to keep the two separated. Code points that are above 255
+are treated as Unicode; those below 256, generally as locale. This
+works reasonably well except in some case-insensitive regular expression
+pattern matches that in Unicode would cross the 255/256 boundary. These
+are disallowed.
+Also, the C<\p{}> and C<\N{}> constructs silently assume Unicode values
+even for code points below 256.
+See also L<perlrun> for the
description of the C<-C> switch and its environment counterpart,
C<$ENV{PERL_UNICODE}> to see how to enable various Unicode features,
for example by using locale settings.