[patch] :utf8 updates

[perl5.git] / pod / perlunicode.pod
diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod

index bf21206..61d62d2 100644 (file)
--- a/pod/perlunicode.pod
+++ b/pod/perlunicode.pod
@@ -10,6 +10,10 @@ Unicode support is an extensive requirement. While Perl does not
  implement the Unicode standard or the accompanying technical reports
  from cover to cover, Perl does support many Unicode features.
  
+People who want to learn to use Unicode in Perl, should probably read
+L<the Perl Unicode tutorial|perlunitut> before reading this reference
+document.
+
  =over 4
  
  =item Input and Output Layers
@@ -20,15 +24,15 @@ the ":utf8" layer.  Other encodings can be converted to Perl's
  encoding on input or from Perl's encoding on output by use of the
  ":encoding(...)"  layer.  See L<open>.
  
-To indicate that Perl source itself is using a particular encoding,
-see L<encoding>.
+To indicate that Perl source itself is in UTF-8, use C<use utf8;>.
  
  =item Regular Expressions
  
  The regular expression compiler produces polymorphic opcodes.  That is,
  the pattern adapts to the data and automatically switches to the Unicode
-character scheme when presented with Unicode data--or instead uses
-a traditional byte scheme when presented with byte data.
+character scheme when presented with data that is internally encoded in
+UTF-8 -- or instead uses a traditional byte scheme when presented with
+byte data.
  
  =item C<use utf8> still needed to enable UTF-8/UTF-EBCDIC in scripts
  
@@ -39,8 +43,23 @@ ASCII-based machines or to recognize UTF-EBCDIC on EBCDIC-based
  machines.  B<These are the only times when an explicit C<use utf8>
  is needed.>  See L<utf8>.
  
-You can also use the C<encoding> pragma to change the default encoding
-of the data in your script; see L<encoding>.
+=item BOM-marked scripts and UTF-16 scripts autodetected
+
+If a Perl script begins marked with the Unicode BOM (UTF-16LE, UTF16-BE,
+or UTF-8), or if the script looks like non-BOM-marked UTF-16 of either
+endianness, Perl will correctly read in the script as Unicode.
+(BOMless UTF-8 cannot be effectively recognized or differentiated from
+ISO 8859-1 or other eight-bit encodings.)
+
+=item C<use encoding> needed to upgrade non-Latin-1 byte strings
+
+By default, there is a fundamental asymmetry in Perl's Unicode model:
+implicit upgrading from byte strings to Unicode strings assumes that
+they were encoded in I<ISO 8859-1 (Latin-1)>, but Unicode strings are
+downgraded with UTF-8 encoding.  This happens because the first 256
+codepoints in Unicode happens to agree with Latin-1.  
+
+See L</"Byte and Character Semantics"> for more details.
  
  =back
  
@@ -67,13 +86,6 @@ character data.  Such data may come from filehandles, from calls to
  external programs, from information provided by the system (such as %ENV),
  or from literals and constants in the source text.
  
-On Windows platforms, if the C<-C> command line switch is used or the
-${^WIDE_SYSTEM_CALLS} global flag is set to C<1>, all system calls
-will use the corresponding wide-character APIs.  This feature is
-available only on Windows to conform to the API standard already
-established for that platform--and there are very few non-Windows
-platforms that have Unicode-aware APIs.
-
  The C<bytes> pragma will always, regardless of platform, force byte
  semantics in a particular lexical scope.  See L<bytes>.
  
@@ -93,12 +105,10 @@ Otherwise, byte semantics are in effect.  The C<bytes> pragma should
  be used to force byte semantics on Unicode data.
  
  If strings operating under byte semantics and strings with Unicode
-character data are concatenated, the new string will be upgraded to
-I<ISO 8859-1 (Latin-1)>, even if the old Unicode string used EBCDIC.
-This translation is done without regard to the system's native 8-bit
-encoding, so to change this for systems with non-Latin-1 and 
-non-EBCDIC native encodings use the C<encoding> pragma.  See
-L<encoding>.
+character data are concatenated, the new string will be created by
+decoding the byte strings as I<ISO 8859-1 (Latin-1)>, even if the
+old Unicode string used EBCDIC.  This translation is done without
+regard to the system's native 8-bit encoding. 
  
  Under character semantics, many operations that formerly operated on
  bytes now operate on characters. A character in Perl is
@@ -118,17 +128,16 @@ Character semantics have the following effects:
  Strings--including hash keys--and regular expression patterns may
  contain characters that have an ordinal value larger than 255.
  
-If you use a Unicode editor to edit your program, Unicode characters
-may occur directly within the literal strings in one of the various
-Unicode encodings (UTF-8, UTF-EBCDIC, UCS-2, etc.), but will be recognized
-as such and converted to Perl's internal representation only if the
-appropriate L<encoding> is specified.
+If you use a Unicode editor to edit your program, Unicode characters may
+occur directly within the literal strings in UTF-8 encoding, or UTF-16.
+(The former requires a BOM or C<use utf8>, the latter requires a BOM.)
  
-Unicode characters can also be added to a string by using the
-C<\x{...}> notation.  The Unicode code for the desired character, in
-hexadecimal, should be placed in the braces. For instance, a smiley
-face is C<\x{263A}>.  This encoding scheme only works for characters
-with a code of 0x100 or above.
+Unicode characters can also be added to a string by using the C<\x{...}>
+notation.  The Unicode code for the desired character, in hexadecimal,
+should be placed in the braces. For instance, a smiley face is
+C<\x{263A}>.  This encoding scheme only works for all characters, but
+for characters under 0x100, note that Perl may use an 8 bit encoding
+internally, for optimization and/or backward compatibility.
  
  Additionally, if you
  
@@ -137,7 +146,6 @@ Additionally, if you
  you can use the C<\N{...}> notation and put the official Unicode
  character name within the braces, such as C<\N{WHITE SMILING FACE}>.
  
-
  =item *
  
  If an appropriate L<encoding> is specified, identifiers within the
@@ -148,8 +156,7 @@ names.
  =item *
  
  Regular expressions match characters instead of bytes.  "." matches
-a character instead of a byte.  The C<\C> pattern is provided to force
-a match a single byte--a C<char> in C, hence C<\C>.
+a character instead of a byte.
  
  =item *
  
@@ -162,7 +169,120 @@ ideograph, for instance.
  
  Named Unicode properties, scripts, and block ranges may be used like
  character classes via the C<\p{}> "matches property" construct and
-the  C<\P{}> negation, "doesn't match property".
+the C<\P{}> negation, "doesn't match property".
+
+See L</"Unicode Character Properties"> for more details.
+
+You can define your own character properties and use them
+in the regular expression with the C<\p{}> or C<\P{}> construct.
+
+See L</"User-Defined Character Properties"> for more details.
+
+=item *
+
+The special pattern C<\X> matches any extended Unicode
+sequence--"a combining character sequence" in Standardese--where the
+first character is a base character and subsequent characters are mark
+characters that apply to the base character.  C<\X> is equivalent to
+C<(?:\PM\pM*)>.
+
+=item *
+
+The C<tr///> operator translates characters instead of bytes.  Note
+that the C<tr///CU> functionality has been removed.  For similar
+functionality see pack('U0', ...) and pack('C0', ...).
+
+=item *
+
+Case translation operators use the Unicode case translation tables
+when character input is provided.  Note that C<uc()>, or C<\U> in
+interpolated strings, translates to uppercase, while C<ucfirst>,
+or C<\u> in interpolated strings, translates to titlecase in languages
+that make the distinction.
+
+=item *
+
+Most operators that deal with positions or lengths in a string will
+automatically switch to using character positions, including
+C<chop()>, C<chomp()>, C<substr()>, C<pos()>, C<index()>, C<rindex()>,
+C<sprintf()>, C<write()>, and C<length()>.  An operator that
+specifically does not switch is C<vec()>.  Operators that really don't 
+care include operators that treat strings as a bucket of bits such as 
+C<sort()>, and operators dealing with filenames.
+
+=item *
+
+The C<pack()>/C<unpack()> letter C<C> does I<not> change, since it is often 
+used for byte-oriented formats.  Again, think C<char> in the C language.
+
+There is a new C<U> specifier that converts between Unicode characters
+and code points. There is also a C<W> specifier that is the equivalent of
+C<chr>/C<ord> and properly handles character values even if they are above 255.
+
+=item *
+
+The C<chr()> and C<ord()> functions work on characters, similar to
+C<pack("W")> and C<unpack("W")>, I<not> C<pack("C")> and
+C<unpack("C")>.  C<pack("C")> and C<unpack("C")> are methods for
+emulating byte-oriented C<chr()> and C<ord()> on Unicode strings.
+While these methods reveal the internal encoding of Unicode strings,
+that is not something one normally needs to care about at all.
+
+=item *
+
+The bit string operators, C<& | ^ ~>, can operate on character data.
+However, for backward compatibility, such as when using bit string
+operations when characters are all less than 256 in ordinal value, one
+should not use C<~> (the bit complement) with characters of both
+values less than 256 and values greater than 256.  Most importantly,
+DeMorgan's laws (C<~($x|$y) eq ~$x&~$y> and C<~($x&$y) eq ~$x|~$y>)
+will not hold.  The reason for this mathematical I<faux pas> is that
+the complement cannot return B<both> the 8-bit (byte-wide) bit
+complement B<and> the full character-wide bit complement.
+
+=item *
+
+lc(), uc(), lcfirst(), and ucfirst() work for the following cases:
+
+=over 8
+
+=item *
+
+the case mapping is from a single Unicode character to another
+single Unicode character, or
+
+=item *
+
+the case mapping is from a single Unicode character to more
+than one Unicode character.
+
+=back
+
+Things to do with locales (Lithuanian, Turkish, Azeri) do B<not> work
+since Perl does not understand the concept of Unicode locales.
+
+See the Unicode Technical Report #21, Case Mappings, for more details.
+
+But you can also define your own mappings to be used in the lc(),
+lcfirst(), uc(), and ucfirst() (or their string-inlined versions).
+
+See L</"User-Defined Case Mappings"> for more details.
+
+=back
+
+=over 4
+
+=item *
+
+And finally, C<scalar reverse()> reverses by character rather than by byte.
+
+=back
+
+=head2 Unicode Character Properties
+
+Named Unicode properties, scripts, and block ranges may be used like
+character classes via the C<\p{}> "matches property" construct and
+the C<\P{}> negation, "doesn't match property".
  
  For instance, C<\p{Lu}> matches any character with the Unicode "Lu"
  (Letter, uppercase) property, while C<\p{M}> matches any character
@@ -184,13 +304,21 @@ You can also use negation in both C<\p{}> and C<\P{}> by introducing a caret
  (^) between the first brace and the property name: C<\p{^Tamil}> is
  equal to C<\P{Tamil}>.
  
+B<NOTE: the properties, scripts, and blocks listed here are as of
+Unicode 5.0.0 in July 2006.>
+
+=over 4
+
+=item General Category
+
  Here are the basic Unicode General Category properties, followed by their
-long form.  You can use either; C<\p{Lu}> and C<\p{LowercaseLetter}>,
+long form.  You can use either; C<\p{Lu}> and C<\p{UppercaseLetter}>,
  for instance, are identical.
  
      Short       Long
  
      L           Letter
+    LC          CasedLetter
      Lu          UppercaseLetter
      Ll          LowercaseLetter
      Lt          TitlecaseLetter
@@ -238,60 +366,69 @@ for instance, are identical.
  
  Single-letter properties match all characters in any of the
  two-letter sub-properties starting with the same letter.
-C<L&> is a special case, which is an alias for C<Ll>, C<Lu>, and C<Lt>.
+C<LC> and C<L&> are special cases, which are aliases for the set of
+C<Ll>, C<Lu>, and C<Lt>.
  
  Because Perl hides the need for the user to understand the internal
  representation of Unicode characters, there is no need to implement
  the somewhat messy concept of surrogates. C<Cs> is therefore not
  supported.
  
+=item Bidirectional Character Types
+
  Because scripts differ in their directionality--Hebrew is
-written right to left, for example--Unicode supplies these properties:
+written right to left, for example--Unicode supplies these properties in
+the BidiClass class:
  
      Property    Meaning
  
-    BidiL       Left-to-Right
-    BidiLRE     Left-to-Right Embedding
-    BidiLRO     Left-to-Right Override
-    BidiR       Right-to-Left
-    BidiAL      Right-to-Left Arabic
-    BidiRLE     Right-to-Left Embedding
-    BidiRLO     Right-to-Left Override
-    BidiPDF     Pop Directional Format
-    BidiEN      European Number
-    BidiES      European Number Separator
-    BidiET      European Number Terminator
-    BidiAN      Arabic Number
-    BidiCS      Common Number Separator
-    BidiNSM     Non-Spacing Mark
-    BidiBN      Boundary Neutral
-    BidiB       Paragraph Separator
-    BidiS       Segment Separator
-    BidiWS      Whitespace
-    BidiON      Other Neutrals
-
-For example, C<\p{BidiR}> matches characters that are normally
+    L           Left-to-Right
+    LRE         Left-to-Right Embedding
+    LRO         Left-to-Right Override
+    R           Right-to-Left
+    AL          Right-to-Left Arabic
+    RLE         Right-to-Left Embedding
+    RLO         Right-to-Left Override
+    PDF         Pop Directional Format
+    EN          European Number
+    ES          European Number Separator
+    ET          European Number Terminator
+    AN          Arabic Number
+    CS          Common Number Separator
+    NSM         Non-Spacing Mark
+    BN          Boundary Neutral
+    B           Paragraph Separator
+    S           Segment Separator
+    WS          Whitespace
+    ON          Other Neutrals
+
+For example, C<\p{BidiClass:R}> matches characters that are normally
  written right to left.
  
-=back
-
-=head2 Scripts
+=item Scripts
  
  The script names which can be used by C<\p{...}> and C<\P{...}>,
  such as in C<\p{Latin}> or C<\p{Cyrillic}>, are as follows:
  
      Arabic
      Armenian
+    Balinese
      Bengali
      Bopomofo
+    Braille
+    Buginese
      Buhid
      CanadianAboriginal
      Cherokee
+    Coptic
+    Cuneiform
+    Cypriot
      Cyrillic
      Deseret
      Devanagari
      Ethiopic
      Georgian
+    Glagolitic
      Gothic
      Greek
      Gujarati
@@ -304,27 +441,43 @@ such as in C<\p{Latin}> or C<\p{Cyrillic}>, are as follows:
      Inherited
      Kannada
      Katakana
+    Kharoshthi
      Khmer
      Lao
      Latin
+    Limbu
+    LinearB
      Malayalam
      Mongolian
      Myanmar
+    NewTaiLue
+    Nko
      Ogham
      OldItalic
+    OldPersian
      Oriya
+    Osmanya
+    PhagsPa
+    Phoenician
      Runic
+    Shavian
      Sinhala
+    SylotiNagri
      Syriac
      Tagalog
      Tagbanwa
+    TaiLe
      Tamil
      Telugu
      Thaana
      Thai
      Tibetan
+    Tifinagh
+    Ugaritic
      Yi
  
+=item Extended property classes
+
  Extended property classes can supplement the basic
  properties, defined by the F<PropList> Unicode database:
  
@@ -334,7 +487,6 @@ properties, defined by the F<PropList> Unicode database:
      Deprecated
      Diacritic
      Extender
-    GraphemeLink
      HexDigit
      Hyphen
      Ideographic
@@ -346,37 +498,52 @@ properties, defined by the F<PropList> Unicode database:
      OtherAlphabetic
      OtherDefaultIgnorableCodePoint
      OtherGraphemeExtend
+    OtherIDStart
+    OtherIDContinue
      OtherLowercase
      OtherMath
      OtherUppercase
+    PatternSyntax
+    PatternWhiteSpace
      QuotationMark
      Radical
      SoftDotted
+    STerm
      TerminalPunctuation
      UnifiedIdeograph
+    VariationSelector
      WhiteSpace
  
  and there are further derived properties:
  
-    Alphabetic      Lu + Ll + Lt + Lm + Lo + OtherAlphabetic
-    Lowercase       Ll + OtherLowercase
-    Uppercase       Lu + OtherUppercase
-    Math            Sm + OtherMath
+    Alphabetic  =  Lu + Ll + Lt + Lm + Lo + Nl + OtherAlphabetic
+    Lowercase   =  Ll + OtherLowercase
+    Uppercase   =  Lu + OtherUppercase
+    Math        =  Sm + OtherMath
+
+    IDStart     =  Lu + Ll + Lt + Lm + Lo + Nl + OtherIDStart
+    IDContinue  =  IDStart + Mn + Mc + Nd + Pc + OtherIDContinue
  
-    ID_Start        Lu + Ll + Lt + Lm + Lo + Nl
-    ID_Continue     ID_Start + Mn + Mc + Nd + Pc
+    DefaultIgnorableCodePoint
+                =  OtherDefaultIgnorableCodePoint
+                   + Cf + Cc + Cs + Noncharacters + VariationSelector
+                   - WhiteSpace - FFF9..FFFB (Annotation Characters)
  
-    Any             Any character
-    Assigned        Any non-Cn character (i.e. synonym for \P{Cn})
-    Unassigned      Synonym for \p{Cn}
-    Common          Any character (or unassigned code point)
-                    not explicitly assigned to a script
+    Any         =  Any code points (i.e. U+0000 to U+10FFFF)
+    Assigned    =  Any non-Cn code points (i.e. synonym for \P{Cn})
+    Unassigned  =  Synonym for \p{Cn}
+    ASCII       =  ASCII (i.e. U+0000 to U+007F)
+
+    Common      =  Any character (or unassigned code point)
+                   not explicitly assigned to a script
+
+=item Use of "Is" Prefix
  
  For backward compatibility (with Perl 5.6), all properties mentioned
  so far may have C<Is> prepended to their name, so C<\P{IsLu}>, for
  example, is equal to C<\P{Lu}>.
  
-=head2 Blocks
+=item Blocks
  
  In addition to B<scripts>, Unicode also defines B<blocks> of
  characters.  The difference between scripts and blocks is that the
@@ -388,9 +555,9 @@ blocks. It does not, for example, contain digits, because digits are
  shared across many scripts. Digits and similar groups, like
  punctuation, are in a category called C<Common>.
  
-For more about scripts, see the UTR #24:
+For more about scripts, see the UAX#24 "Script Names":
  
-   http://www.unicode.org/unicode/reports/tr24/
+   http://www.unicode.org/reports/tr24/
  
  For more about blocks, see:
  
@@ -404,12 +571,17 @@ for block tests to avoid confusion.
  
  These block names are supported:
  
+    InAegeanNumbers
      InAlphabeticPresentationForms
+    InAncientGreekMusicalNotation
+    InAncientGreekNumbers
      InArabic
      InArabicPresentationFormsA
      InArabicPresentationFormsB
+    InArabicSupplement
      InArmenian
      InArrows
+    InBalinese
      InBasicLatin
      InBengali
      InBlockElements
@@ -417,6 +589,7 @@ These block names are supported:
      InBopomofoExtended
      InBoxDrawing
      InBraillePatterns
+    InBuginese
      InBuhid
      InByzantineMusicalSymbols
      InCJKCompatibility
@@ -424,27 +597,38 @@ These block names are supported:
      InCJKCompatibilityIdeographs
      InCJKCompatibilityIdeographsSupplement
      InCJKRadicalsSupplement
+    InCJKStrokes
      InCJKSymbolsAndPunctuation
      InCJKUnifiedIdeographs
      InCJKUnifiedIdeographsExtensionA
      InCJKUnifiedIdeographsExtensionB
      InCherokee
      InCombiningDiacriticalMarks
+    InCombiningDiacriticalMarksSupplement
      InCombiningDiacriticalMarksforSymbols
      InCombiningHalfMarks
      InControlPictures
+    InCoptic
+    InCountingRodNumerals
+    InCuneiform
+    InCuneiformNumbersAndPunctuation
      InCurrencySymbols
+    InCypriotSyllabary
      InCyrillic
-    InCyrillicSupplementary
+    InCyrillicSupplement
      InDeseret
      InDevanagari
      InDingbats
      InEnclosedAlphanumerics
      InEnclosedCJKLettersAndMonths
      InEthiopic
+    InEthiopicExtended
+    InEthiopicSupplement
      InGeneralPunctuation
      InGeometricShapes
      InGeorgian
+    InGeorgianSupplement
+    InGlagolitic
      InGothic
      InGreekExtended
      InGreekAndCoptic
@@ -466,13 +650,20 @@ These block names are supported:
      InKannada
      InKatakana
      InKatakanaPhoneticExtensions
+    InKharoshthi
      InKhmer
+    InKhmerSymbols
      InLao
      InLatin1Supplement
      InLatinExtendedA
      InLatinExtendedAdditional
      InLatinExtendedB
+    InLatinExtendedC
+    InLatinExtendedD
      InLetterlikeSymbols
+    InLimbu
+    InLinearBIdeograms
+    InLinearBSyllabary
      InLowSurrogates
      InMalayalam
      InMathematicalAlphanumericSymbols
@@ -480,17 +671,28 @@ These block names are supported:
      InMiscellaneousMathematicalSymbolsA
      InMiscellaneousMathematicalSymbolsB
      InMiscellaneousSymbols
+    InMiscellaneousSymbolsAndArrows
      InMiscellaneousTechnical
+    InModifierToneLetters
      InMongolian
      InMusicalSymbols
      InMyanmar
+    InNKo
+    InNewTaiLue
      InNumberForms
      InOgham
      InOldItalic
+    InOldPersian
      InOpticalCharacterRecognition
      InOriya
+    InOsmanya
+    InPhagspa
+    InPhoenician
+    InPhoneticExtensions
+    InPhoneticExtensionsSupplement
      InPrivateUseArea
      InRunic
+    InShavian
      InSinhala
      InSmallFormVariants
      InSpacingModifierLetters
@@ -499,127 +701,51 @@ These block names are supported:
      InSupplementalArrowsA
      InSupplementalArrowsB
      InSupplementalMathematicalOperators
+    InSupplementalPunctuation
      InSupplementaryPrivateUseAreaA
      InSupplementaryPrivateUseAreaB
+    InSylotiNagri
      InSyriac
      InTagalog
      InTagbanwa
      InTags
+    InTaiLe
+    InTaiXuanJingSymbols
      InTamil
      InTelugu
      InThaana
      InThai
      InTibetan
+    InTifinagh
+    InUgaritic
      InUnifiedCanadianAboriginalSyllabics
      InVariationSelectors
+    InVariationSelectorsSupplement
+    InVerticalForms
      InYiRadicals
      InYiSyllables
-
-=over 4
-
-=item *
-
-The special pattern C<\X> matches any extended Unicode
-sequence--"a combining character sequence" in Standardese--where the
-first character is a base character and subsequent characters are mark
-characters that apply to the base character.  C<\X> is equivalent to
-C<(?:\PM\pM*)>.
-
-=item *
-
-The C<tr///> operator translates characters instead of bytes.  Note
-that the C<tr///CU> functionality has been removed.  For similar
-functionality see pack('U0', ...) and pack('C0', ...).
-
-=item *
-
-Case translation operators use the Unicode case translation tables
-when character input is provided.  Note that C<uc()>, or C<\U> in
-interpolated strings, translates to uppercase, while C<ucfirst>,
-or C<\u> in interpolated strings, translates to titlecase in languages
-that make the distinction.
-
-=item *
-
-Most operators that deal with positions or lengths in a string will
-automatically switch to using character positions, including
-C<chop()>, C<substr()>, C<pos()>, C<index()>, C<rindex()>,
-C<sprintf()>, C<write()>, and C<length()>.  Operators that
-specifically do not switch include C<vec()>, C<pack()>, and
-C<unpack()>.  Operators that really don't care include C<chomp()>,
-operators that treats strings as a bucket of bits such as C<sort()>,
-and operators dealing with filenames.
-
-=item *
-
-The C<pack()>/C<unpack()> letters C<c> and C<C> do I<not> change,
-since they are often used for byte-oriented formats.  Again, think
-C<char> in the C language.
-
-There is a new C<U> specifier that converts between Unicode characters
-and code points.
-
-=item *
-
-The C<chr()> and C<ord()> functions work on characters, similar to
-C<pack("U")> and C<unpack("U")>, I<not> C<pack("C")> and
-C<unpack("C")>.  C<pack("C")> and C<unpack("C")> are methods for
-emulating byte-oriented C<chr()> and C<ord()> on Unicode strings.
-While these methods reveal the internal encoding of Unicode strings,
-that is not something one normally needs to care about at all.
-
-=item *
-
-The bit string operators, C<& | ^ ~>, can operate on character data.
-However, for backward compatibility, such as when using bit string
-operations when characters are all less than 256 in ordinal value, one
-should not use C<~> (the bit complement) with characters of both
-values less than 256 and values greater than 256.  Most importantly,
-DeMorgan's laws (C<~($x|$y) eq ~$x&~$y> and C<~($x&$y) eq ~$x|~$y>)
-will not hold.  The reason for this mathematical I<faux pas> is that
-the complement cannot return B<both> the 8-bit (byte-wide) bit
-complement B<and> the full character-wide bit complement.
-
-=item *
-
-lc(), uc(), lcfirst(), and ucfirst() work for the following cases:
-
-=over 8
-
-=item *
-
-the case mapping is from a single Unicode character to another
-single Unicode character, or
-
-=item *
-
-the case mapping is from a single Unicode character to more
-than one Unicode character.
+    InYijingHexagramSymbols
  
  =back
  
-Things to do with locales (Lithuanian, Turkish, Azeri) do B<not> work
-since Perl does not understand the concept of Unicode locales.
-
-See the Unicode Technical Report #21, Case Mappings, for more details.
-
-=back
+=head2 User-Defined Character Properties
  
-=over 4
+You can define your own character properties by defining subroutines
+whose names begin with "In" or "Is".  The subroutines can be defined in
+any package.  The user-defined properties can be used in the regular
+expression C<\p> and C<\P> constructs; if you are using a user-defined
+property from a package other than the one you are in, you must specify
+its package in the C<\p> or C<\P> construct.
  
-=item *
+    # assuming property IsForeign defined in Lang::
+    package main;  # property package name required
+    if ($txt =~ /\p{Lang::IsForeign}+/) { ... }
  
-And finally, C<scalar reverse()> reverses by character rather than by byte.
+    package Lang;  # property package name not required
+    if ($txt =~ /\p{IsForeign}+/) { ... }
  
-=back
  
-=head2 User-Defined Character Properties
-
-You can define your own character properties by defining subroutines
-whose names begin with "In" or "Is".  The subroutines must be
-visible in the package that uses the properties.  The user-defined
-properties can be used in the regular expression C<\p> and C<\P>
-constructs.
+Note that the effect is compile-time and immutable once defined.
  
  The subroutines must return a specially-formatted string, with one
  or more newline-separated lines.  Each line must be one of the following:
@@ -628,29 +754,40 @@ or more newline-separated lines.  Each line must be one of the following:
  
  =item *
  
+A single hexadecimal number denoting a Unicode code point to include.
+
+=item *
+
  Two hexadecimal numbers separated by horizontal whitespace (space or
  tabular characters) denoting a range of Unicode code points to include.
  
  =item *
  
  Something to include, prefixed by "+": a built-in character
-property (prefixed by "utf8::"), to represent all the characters in that
-property; two hexadecimal code points for a range; or a single
-hexadecimal code point.
+property (prefixed by "utf8::") or a user-defined character property,
+to represent all the characters in that property; two hexadecimal code
+points for a range; or a single hexadecimal code point.
  
  =item *
  
  Something to exclude, prefixed by "-": an existing character
-property (prefixed by "utf8::"), for all the characters in that
-property; two hexadecimal code points for a range; or a single
-hexadecimal code point.
+property (prefixed by "utf8::") or a user-defined character property,
+to represent all the characters in that property; two hexadecimal code
+points for a range; or a single hexadecimal code point.
  
  =item *
  
  Something to negate, prefixed "!": an existing character
-property (prefixed by "utf8::") for all the characters except the
-characters in the property; two hexadecimal code points for a range;
-or a single hexadecimal code point.
+property (prefixed by "utf8::") or a user-defined character property,
+to represent all the characters in that property; two hexadecimal code
+points for a range; or a single hexadecimal code point.
+
+=item *
+
+Something to intersect with, prefixed by "&": an existing character
+property (prefixed by "utf8::") or a user-defined character property,
+for all the characters except the characters in the property; two
+hexadecimal code points for a range; or a single hexadecimal code point.
  
  =back
  
@@ -698,6 +835,72 @@ The negation is useful for defining (surprise!) negated classes.
      END
      }
  
+Intersection is useful for getting the common characters matched by
+two (or more) classes.
+
+    sub InFooAndBar {
+        return <<'END';
+    +main::Foo
+    &main::Bar
+    END
+    }
+
+It's important to remember not to use "&" for the first set -- that
+would be intersecting with nothing (resulting in an empty set).
+
+=head2 User-Defined Case Mappings
+
+You can also define your own mappings to be used in the lc(),
+lcfirst(), uc(), and ucfirst() (or their string-inlined versions).
+The principle is similar to that of user-defined character
+properties: to define subroutines in the C<main> package
+with names like C<ToLower> (for lc() and lcfirst()), C<ToTitle> (for
+the first character in ucfirst()), and C<ToUpper> (for uc(), and the
+rest of the characters in ucfirst()).
+
+The string returned by the subroutines needs now to be three
+hexadecimal numbers separated by tabulators: start of the source
+range, end of the source range, and start of the destination range.
+For example:
+
+    sub ToUpper {
+       return <<END;
+    0061\t0063\t0041
+    END
+    }
+
+defines an uc() mapping that causes only the characters "a", "b", and
+"c" to be mapped to "A", "B", "C", all other characters will remain
+unchanged.
+
+If there is no source range to speak of, that is, the mapping is from
+a single character to another single character, leave the end of the
+source range empty, but the two tabulator characters are still needed.
+For example:
+
+    sub ToLower {
+       return <<END;
+    0041\t\t0061
+    END
+    }
+
+defines a lc() mapping that causes only "A" to be mapped to "a", all
+other characters will remain unchanged.
+
+(For serious hackers only)  If you want to introspect the default
+mappings, you can find the data in the directory
+C<$Config{privlib}>/F<unicore/To/>.  The mapping data is returned as
+the here-document, and the C<utf8::ToSpecFoo> are special exception
+mappings derived from <$Config{privlib}>/F<unicore/SpecialCasing.txt>.
+The C<Digit> and C<Fold> mappings that one can see in the directory
+are not directly user-accessible, one can use either the
+C<Unicode::UCD> module, or just match case-insensitively (that's when
+the C<Fold> mapping is used).
+
+A final note on the user-defined case mappings: they will be used
+only if the scalar has been marked as having Unicode characters.
+Old byte-style strings will not be affected.
+
  =head2 Character Encodings for Input and Output
  
  See L<Encode>.
@@ -706,8 +909,8 @@ See L<Encode>.
  
  The following list of Unicode support for regular expressions describes
  all the features currently supported.  The references to "Level N"
-and the section numbers refer to the Unicode Technical Report 18,
-"Unicode Regular Expression Guidelines".
+and the section numbers refer to the Unicode Technical Standard #18,
+"Unicode Regular Expressions", version 11, in May 2005.
  
  =over 4
  
@@ -715,35 +918,42 @@ and the section numbers refer to the Unicode Technical Report 18,
  
  Level 1 - Basic Unicode Support
  
-        2.1 Hex Notation                        - done          [1]
-            Named Notation                      - done          [2]
-        2.2 Categories                          - done          [3][4]
-        2.3 Subtraction                         - MISSING       [5][6]
-        2.4 Simple Word Boundaries              - done          [7]
-        2.5 Simple Loose Matches                - done          [8]
-        2.6 End of Line                         - MISSING       [9][10]
-
-        [ 1] \x{...}
-        [ 2] \N{...}
-        [ 3] . \p{...} \P{...}
-        [ 4] now scripts (see UTR#24 Script Names) in addition to blocks
-        [ 5] have negation
-        [ 6] can use regular expression look-ahead [a]
-             or user-defined character properties [b] to emulate subtraction
-        [ 7] include Letters in word characters
-        [ 8] note that Perl does Full case-folding in matching, not Simple:
+        RL1.1   Hex Notation                        - done          [1]
+        RL1.2   Properties                          - done          [2][3]
+        RL1.2a  Compatibility Properties            - done          [4]
+        RL1.3   Subtraction and Intersection        - MISSING       [5]
+        RL1.4   Simple Word Boundaries              - done          [6]
+        RL1.5   Simple Loose Matches                - done          [7]
+        RL1.6   Line Boundaries                     - MISSING       [8]
+        RL1.7   Supplementary Code Points           - done          [9]
+
+        [1]  \x{...}
+        [2]  \p{...} \P{...}
+        [3]  supports not only minimal list (general category, scripts,
+             Alphabetic, Lowercase, Uppercase, WhiteSpace,
+             NoncharacterCodePoint, DefaultIgnorableCodePoint, Any,
+             ASCII, Assigned), but also bidirectional types, blocks, etc.
+             (see L</"Unicode Character Properties">)
+        [4]  \d \D \s \S \w \W \X [:prop:] [:^prop:]
+        [5]  can use regular expression look-ahead [a] or
+             user-defined character properties [b] to emulate set operations
+        [6]  \b \B
+        [7]  note that Perl does Full case-folding in matching, not Simple:
               for example U+1F88 is equivalent with U+1F00 U+03B9,
               not with 1F80.  This difference matters for certain Greek
               capital letters with certain modifiers: the Full case-folding
               decomposes the letter, while the Simple case-folding would map
               it to a single character.
-        [ 9] see UTR#13 Unicode Newline Guidelines
-        [10] should do ^ and $ also on \x{85}, \x{2028} and \x{2029}
-             (should also affect <>, $., and script line numbers)
-             (the \x{85}, \x{2028} and \x{2029} do match \s)
+        [8]  should do ^ and $ also on U+000B (\v in C), FF (\f), CR (\r),
+             CRLF (\r\n), NEL (U+0085), LS (U+2028), and PS (U+2029);
+             should also affect <>, $., and script line numbers;
+             should not split lines within CRLF [c] (i.e. there is no empty
+             line between \r and \n)
+        [9]  UTF-8/UTF-EBDDIC used in perl allows not only U+10000 to U+10FFFF
+             but also beyond U+10FFFF [d]
  
  [a] You can mimic class subtraction using lookahead.
-For example, what TR18 might write as
+For example, what UTS#18 might write as
  
      [{Greek}-[{UNASSIGNED}]]
  
@@ -758,38 +968,63 @@ But in this particular example, you probably really want
  
  which will match assigned characters known to be part of the Greek script.
  
-[b] See L</"User-Defined Character Properties">.
-
-=item *
+Also see the Unicode::Regex::Set module, it does implement the full
+UTS#18 grouping, intersection, union, and removal (subtraction) syntax.
  
-Level 2 - Extended Unicode Support
+[b] '+' for union, '-' for removal (set-difference), '&' for intersection
+(see L</"User-Defined Character Properties">)
  
-        3.1 Surrogates                          - MISSING      [11]
-        3.2 Canonical Equivalents               - MISSING       [12][13]
-        3.3 Locale-Independent Graphemes        - MISSING       [14]
-        3.4 Locale-Independent Words            - MISSING       [15]
-        3.5 Locale-Independent Loose Matches    - MISSING       [16]
+[c] Try the C<:crlf> layer (see L<PerlIO>).
  
-        [11] Surrogates are solely a UTF-16 concept and Perl's internal
-             representation is UTF-8.  The Encode module does UTF-16, though.
-        [12] see UTR#15 Unicode Normalization
-        [13] have Unicode::Normalize but not integrated to regexes
-        [14] have \X but at this level . should equal that
-        [15] need three classes, not just \w and \W
-        [16] see UTR#21 Case Mappings
+[d] Avoid C<use warning 'utf8';> (or say C<no warning 'utf8';>) to allow
+U+FFFF (C<\x{FFFF}>).
  
  =item *
  
-Level 3 - Locale-Sensitive Support
+Level 2 - Extended Unicode Support
+
+        RL2.1   Canonical Equivalents           - MISSING       [10][11]
+        RL2.2   Default Grapheme Clusters       - MISSING       [12][13]
+        RL2.3   Default Word Boundaries         - MISSING       [14]
+        RL2.4   Default Loose Matches           - MISSING       [15]
+        RL2.5   Name Properties                 - MISSING       [16]
+        RL2.6   Wildcard Properties             - MISSING
+
+        [10] see UAX#15 "Unicode Normalization Forms"
+        [11] have Unicode::Normalize but not integrated to regexes
+        [12] have \X but at this level . should equal that
+        [13] UAX#29 "Text Boundaries" considers CRLF and Hangul syllable
+             clusters as a single grapheme cluster.
+        [14] see UAX#29, Word Boundaries
+        [15] see UAX#21 "Case Mappings"
+        [16] have \N{...} but neither compute names of CJK Ideographs
+             and Hangul Syllables nor use a loose match [e]
+
+[e] C<\N{...}> allows namespaces (see L<charnames>).
  
-        4.1 Locale-Dependent Categories         - MISSING
-        4.2 Locale-Dependent Graphemes          - MISSING       [16][17]
-        4.3 Locale-Dependent Words              - MISSING
-        4.4 Locale-Dependent Loose Matches      - MISSING
-        4.5 Locale-Dependent Ranges             - MISSING
+=item *
  
-        [16] see UTR#10 Unicode Collation Algorithms
-        [17] have Unicode::Collate but not integrated to regexes
+Level 3 - Tailored Support
+
+        RL3.1   Tailored Punctuation            - MISSING
+        RL3.2   Tailored Grapheme Clusters      - MISSING       [17][18]
+        RL3.3   Tailored Word Boundaries        - MISSING
+        RL3.4   Tailored Loose Matches          - MISSING
+        RL3.5   Tailored Ranges                 - MISSING
+        RL3.6   Context Matching                - MISSING       [19]
+        RL3.7   Incremental Matches             - MISSING
+      ( RL3.8   Unicode Set Sharing )
+        RL3.9   Possible Match Sets             - MISSING
+        RL3.10  Folded Matching                 - MISSING       [20]
+        RL3.11  Submatchers                     - MISSING
+
+        [17] see UAX#10 "Unicode Collation Algorithms"
+        [18] have Unicode::Collate but not integrated to regexes
+        [19] have (?<=x) and (?=x), but look-aheads or look-behinds should see
+             outside of the target substring
+        [20] need insensitive matching for linguistic features other than case;
+             for example, hiragana to katakana, wide and narrow, simplified Han
+             to traditional Han (see UTR#30 "Character Foldings")
  
  =back
  
@@ -853,7 +1088,7 @@ Like UTF-8 but EBCDIC-safe, in the way that UTF-8 is ASCII-safe.
  
  =item *
  
-UTF-16, UTF-16BE, UTF16-LE, Surrogates, and BOMs (Byte Order Marks)
+UTF-16, UTF-16BE, UTF-16LE, Surrogates, and BOMs (Byte Order Marks)
  
  The followings items are mostly for reference and general Unicode
  knowledge, Perl doesn't use these constructs internally.
@@ -905,7 +1140,7 @@ format".
  
  =item *
  
-UTF-32, UTF-32BE, UTF32-LE
+UTF-32, UTF-32BE, UTF-32LE
  
  The UTF-32 family is pretty much like the UTF-16 family, expect that
  the units are 32-bit, and therefore the surrogate scheme is not
@@ -1000,10 +1235,10 @@ there are a couple of exceptions:
  
  =item *
  
-If your locale environment variables (LANGUAGE, LC_ALL, LC_CTYPE, LANG)
-contain the strings 'UTF-8' or 'UTF8' (case-insensitive matching),
-the default encodings of your STDIN, STDOUT, and STDERR, and of
-B<any subsequent file open>, are considered to be UTF-8.
+You can enable automatic UTF-8-ification of your standard file
+handles, default C<open()> layer, and C<@ARGV> by using either
+the C<-C> command line switch or the C<PERL_UNICODE> environment
+variable, see L<perlrun> for the documentation of the C<-C> switch.
  
  =item *
  
@@ -1013,10 +1248,73 @@ straddling of the proverbial fence causes problems.
  
  =back
  
+=head2 When Unicode Does Not Happen
+
+While Perl does have extensive ways to input and output in Unicode,
+and few other 'entry points' like the @ARGV which can be interpreted
+as Unicode (UTF-8), there still are many places where Unicode (in some
+encoding or another) could be given as arguments or received as
+results, or both, but it is not.
+
+The following are such interfaces.  For all of these interfaces Perl
+currently (as of 5.8.3) simply assumes byte strings both as arguments
+and results, or UTF-8 strings if the C<encoding> pragma has been used.
+
+One reason why Perl does not attempt to resolve the role of Unicode in
+this cases is that the answers are highly dependent on the operating
+system and the file system(s).  For example, whether filenames can be
+in Unicode, and in exactly what kind of encoding, is not exactly a
+portable concept.  Similarly for the qx and system: how well will the
+'command line interface' (and which of them?) handle Unicode?
+
+=over 4
+
+=item *
+
+chdir, chmod, chown, chroot, exec, link, lstat, mkdir, 
+rename, rmdir, stat, symlink, truncate, unlink, utime, -X
+
+=item *
+
+%ENV
+
+=item *
+
+glob (aka the <*>)
+
+=item *
+
+open, opendir, sysopen
+
+=item *
+
+qx (aka the backtick operator), system
+
+=item *
+
+readdir, readlink
+
+=back
+
+=head2 Forcing Unicode in Perl (Or Unforcing Unicode in Perl)
+
+Sometimes (see L</"When Unicode Does Not Happen">) there are
+situations where you simply need to force Perl to believe that a byte
+string is UTF-8, or vice versa.  The low-level calls
+utf8::upgrade($bytestring) and utf8::downgrade($utf8string) are
+the answers.
+
+Do not use them without careful thought, though: Perl may easily get
+very confused, angry, or even crash, if you suddenly change the 'nature'
+of scalar like that.  Especially careful you have to be if you use the
+utf8::upgrade(): any random byte string is not valid UTF-8.
+
  =head2 Using Unicode in XS
  
-If you want to handle Perl Unicode in XS extensions, you may find
-the following C APIs useful.  See L<perlapi> for details.
+If you want to handle Perl Unicode in XS extensions, you may find the
+following C APIs useful.  See also L<perlguts/"Unicode Support"> for an
+explanation about Unicode at the XS level, and L<perlapi> for the API
+details.
  
  =over 4
  
@@ -1036,7 +1334,7 @@ Unicode model is not to use UTF-8 until it is absolutely necessary.
  
  =item *
  
-C<uvuni_to_utf8(buf, chr>) writes a Unicode character code point into
+C<uvuni_to_utf8(buf, chr)> writes a Unicode character code point into
  a buffer encoding the code point as UTF-8, and returns a pointer
  pointing after the UTF-8 bytes.
  
@@ -1131,7 +1429,7 @@ Unicode is discouraged.
  =head2 Interaction with Extensions
  
  When Perl exchanges data with an extension, the extension should be
-able to understand the UTF-8 flag and act accordingly. If the
+able to understand the UTF8 flag and act accordingly. If the
  extension doesn't know about the flag, it's likely that the extension
  will return incorrectly-flagged data.
  
@@ -1176,7 +1474,7 @@ derived class with such a C<param> method:
      sub param {
        my($self,$name,$value) = @_;
        utf8::upgrade($name);     # make sure it is UTF-8 encoded
-      if (defined $value)
+      if (defined $value) {
          utf8::upgrade($value);  # make sure it is UTF-8 encoded
          return $self->SUPER::param($name,$value);
        } else {
@@ -1195,57 +1493,18 @@ Unicode data much easier.
  
  Some functions are slower when working on UTF-8 encoded strings than
  on byte encoded strings.  All functions that need to hop over
-characters such as length(), substr() or index() can work B<much>
-faster when the underlying data are byte-encoded. Witness the
-following benchmark:
-
-  % perl -e '
-  use Benchmark;
-  use strict;
-  our $l = 10000;
-  our $u = our $b = "x" x $l;
-  substr($u,0,1) = "\x{100}";
-  timethese(-2,{
-  LENGTH_B => q{ length($b) },
-  LENGTH_U => q{ length($u) },
-  SUBSTR_B => q{ substr($b, $l/4, $l/2) },
-  SUBSTR_U => q{ substr($u, $l/4, $l/2) },
-  });
-  '
-  Benchmark: running LENGTH_B, LENGTH_U, SUBSTR_B, SUBSTR_U for at least 2 CPU seconds...
-    LENGTH_B:  2 wallclock secs ( 2.36 usr +  0.00 sys =  2.36 CPU) @ 5649983.05/s (n=13333960)
-    LENGTH_U:  2 wallclock secs ( 2.11 usr +  0.00 sys =  2.11 CPU) @ 12155.45/s (n=25648)
-    SUBSTR_B:  3 wallclock secs ( 2.16 usr +  0.00 sys =  2.16 CPU) @ 374480.09/s (n=808877)
-    SUBSTR_U:  2 wallclock secs ( 2.11 usr +  0.00 sys =  2.11 CPU) @ 6791.00/s (n=14329)
-
-The numbers show an incredible slowness on long UTF-8 strings.  You
-should carefully avoid using these functions in tight loops. If you
-want to iterate over characters, the superior coding technique would
-split the characters into an array instead of using substr, as the following
-benchmark shows:
-
-  % perl -e '
-  use Benchmark;
-  use strict;
-  our $l = 10000;
-  our $u = our $b = "x" x $l;
-  substr($u,0,1) = "\x{100}";
-  timethese(-5,{
-  SPLIT_B => q{ for my $c (split //, $b){}  },
-  SPLIT_U => q{ for my $c (split //, $u){}  },
-  SUBSTR_B => q{ for my $i (0..length($b)-1){my $c = substr($b,$i,1);} },
-  SUBSTR_U => q{ for my $i (0..length($u)-1){my $c = substr($u,$i,1);} },
-  });
-  '
-  Benchmark: running SPLIT_B, SPLIT_U, SUBSTR_B, SUBSTR_U for at least 5 CPU seconds...
-     SPLIT_B:  6 wallclock secs ( 5.29 usr +  0.00 sys =  5.29 CPU) @ 56.14/s (n=297)
-     SPLIT_U:  5 wallclock secs ( 5.17 usr +  0.01 sys =  5.18 CPU) @ 55.21/s (n=286)
-    SUBSTR_B:  5 wallclock secs ( 5.34 usr +  0.00 sys =  5.34 CPU) @ 123.22/s (n=658)
-    SUBSTR_U:  7 wallclock secs ( 6.20 usr +  0.00 sys =  6.20 CPU) @  0.81/s (n=5)
-
-Even though the algorithm based on C<substr()> is faster than
-C<split()> for byte-encoded data, it pales in comparison to the speed
-of C<split()> when used with UTF-8 data.
+characters such as length(), substr() or index(), or matching regular
+expressions can work B<much> faster when the underlying data are
+byte-encoded.
+
+In Perl 5.8.0 the slowness was often quite spectacular; in Perl 5.8.1
+a caching scheme was introduced which will hopefully make the slowness
+somewhat less spectacular, at least for some operations.  In general,
+operations with UTF-8 encoded strings are still slower. As an example,
+the Unicode properties (character classes) like C<\p{Nd}> are known to
+be quite a bit slower (5-20 times) than their simpler counterparts
+like C<\d> (then again, there 268 Unicode characters matching C<Nd>
+compared with the 10 ASCII characters matching C<d>).
  
  =head2 Porting code from perl-5.6.X
  
@@ -1264,7 +1523,7 @@ to work under 5.6, so you should be safe to try them out.
  A filehandle that should read or write UTF-8
  
    if ($] > 5.007) {
-    binmode $fh, ":utf8";
+    binmode $fh, ":encoding(utf8)";
    }
  
  =item *
@@ -1273,7 +1532,7 @@ A scalar that is going to be passed to some extension
  
  Be it Compress::Zlib, Apache::Request or any extension that has no
  mention of Unicode in the manpage, you need to make sure that the
-UTF-8 flag is stripped off. Note that at the time of this writing
+UTF8 flag is stripped off. Note that at the time of this writing
  (October 2002) the mentioned modules are not UTF-8-aware. Please
  check the documentation to verify if this is still true.
  
@@ -1287,7 +1546,7 @@ check the documentation to verify if this is still true.
  A scalar we got back from an extension
  
  If you believe the scalar comes back as UTF-8, you will most likely
-want the UTF-8 flag restored:
+want the UTF8 flag restored:
  
    if ($] > 5.007) {
      require Encode;
@@ -1349,7 +1608,7 @@ A large scalar that you know can only contain ASCII
  
  Scalars that contain only ASCII and are marked as UTF-8 are sometimes
  a drag to your program. If you recognize such a situation, just remove
-the UTF-8 flag:
+the UTF8 flag:
  
    utf8::downgrade($val) if $] > 5.007;
  
@@ -1357,7 +1616,7 @@ the UTF-8 flag:
  
  =head1 SEE ALSO
  
-L<perluniintro>, L<encoding>, L<Encode>, L<open>, L<utf8>, L<bytes>,
-L<perlretut>, L<perlvar/"${^WIDE_SYSTEM_CALLS}">
+L<perlunitut>, L<perluniintro>, L<Encode>, L<open>, L<utf8>, L<bytes>,
+L<perlretut>, L<perlvar/"${^UNICODE}">
  
  =cut