implement the Unicode standard or the accompanying technical reports
from cover to cover, Perl does support many Unicode features.
+People who want to learn to use Unicode in Perl, should probably read
+the L<Perl Unicode tutorial, perlunitut|perlunitut> and
+L<perluniintro>, before reading
+this reference document.
+
+Also, the use of Unicode may present security issues that aren't obvious.
+Read L<Unicode Security Considerations|http://www.unicode.org/reports/tr36>.
+
=over 4
+=item Safest if you "use feature 'unicode_strings'"
+
+In order to preserve backward compatibility, Perl does not turn
+on full internal Unicode support unless the pragma
+C<use feature 'unicode_strings'> is specified. (This is automatically
+selected if you use C<use 5.012> or higher.) Failure to do this can
+trigger unexpected surprises. See L</The "Unicode Bug"> below.
+
+This pragma doesn't affect I/O, and there are still several places
+where Unicode isn't fully supported, such as in filenames.
+
=item Input and Output Layers
Perl knows when a filehandle uses Perl's internal Unicode encodings
(UTF-8, or UTF-EBCDIC if in EBCDIC) if the filehandle is opened with
-the ":utf8" layer. Other encodings can be converted to Perl's
+the ":encoding(utf8)" layer. Other encodings can be converted to Perl's
encoding on input or from Perl's encoding on output by use of the
":encoding(...)" layer. See L<open>.
-To indicate that Perl source itself is using a particular encoding,
-see L<encoding>.
-
-=item Regular Expressions
-
-The regular expression compiler produces polymorphic opcodes. That is,
-the pattern adapts to the data and automatically switches to the Unicode
-character scheme when presented with Unicode data--or instead uses
-a traditional byte scheme when presented with byte data.
+To indicate that Perl source itself is in UTF-8, use C<use utf8;>.
=item C<use utf8> still needed to enable UTF-8/UTF-EBCDIC in scripts
machines. B<These are the only times when an explicit C<use utf8>
is needed.> See L<utf8>.
-You can also use the C<encoding> pragma to change the default encoding
-of the data in your script; see L<encoding>.
+=item BOM-marked scripts and UTF-16 scripts autodetected
+
+If a Perl script begins marked with the Unicode BOM (UTF-16LE, UTF16-BE,
+or UTF-8), or if the script looks like non-BOM-marked UTF-16 of either
+endianness, Perl will correctly read in the script as Unicode.
+(BOMless UTF-8 cannot be effectively recognized or differentiated from
+ISO 8859-1 or other eight-bit encodings.)
+
+=item C<use encoding> needed to upgrade non-Latin-1 byte strings
+
+By default, there is a fundamental asymmetry in Perl's Unicode model:
+implicit upgrading from byte strings to Unicode strings assumes that
+they were encoded in I<ISO 8859-1 (Latin-1)>, but Unicode strings are
+downgraded with UTF-8 encoding. This happens because the first 256
+codepoints in Unicode happens to agree with Latin-1.
+
+See L</"Byte and Character Semantics"> for more details.
=back
Beginning with version 5.6, Perl uses logically-wide characters to
represent strings internally.
-In future, Perl-level operations will be expected to work with
-characters rather than bytes.
-
-However, as an interim compatibility measure, Perl aims to
-provide a safe migration path from byte semantics to character
-semantics for programs. For operations where Perl can unambiguously
-decide that the input data are characters, Perl switches to
-character semantics. For operations where this determination cannot
-be made without additional information from the user, Perl decides in
-favor of compatibility and chooses to use byte semantics.
+Starting in Perl 5.14, Perl-level operations work with
+characters rather than bytes within the scope of a
+C<L<use feature 'unicode_strings'|feature>> (or equivalently
+C<use 5.012> or higher). (This is not true if bytes have been
+explicitly requested by C<L<use bytes|bytes>>, nor necessarily true
+for interactions with the platform's operating system.)
+
+For earlier Perls, and when C<unicode_strings> is not in effect, Perl
+provides a fairly safe environment that can handle both types of
+semantics in programs. For operations where Perl can unambiguously
+decide that the input data are characters, Perl switches to character
+semantics. For operations where this determination cannot be made
+without additional information from the user, Perl decides in favor of
+compatibility and chooses to use byte semantics.
+
+When C<use locale> is in effect (which overrides
+C<use feature 'unicode_strings'> in the same scope), Perl uses the
+semantics associated
+with the current locale. Otherwise, Perl uses the platform's native
+byte semantics for characters whose code points are less than 256, and
+Unicode semantics for those greater than 255. On EBCDIC platforms, this
+is almost seamless, as the EBCDIC code pages that Perl handles are
+equivalent to Unicode's first 256 code points. (The exception is that
+EBCDIC regular expression case-insensitive matching rules are not as
+as robust as Unicode's.) But on ASCII platforms, Perl uses US-ASCII
+(or Basic Latin in Unicode terminology) byte semantics, meaning that characters
+whose ordinal numbers are in the range 128 - 255 are undefined except for their
+ordinal numbers. This means that none have case (upper and lower), nor are any
+a member of character classes, like C<[:alpha:]> or C<\w>. (But all do belong
+to the C<\W> class or the Perl regular expression extension C<[:^alpha:]>.)
This behavior preserves compatibility with earlier versions of Perl,
which allowed byte semantics in Perl operations only if
-none of the program's inputs were marked as being as source of Unicode
+none of the program's inputs were marked as being a source of Unicode
character data. Such data may come from filehandles, from calls to
external programs, from information provided by the system (such as %ENV),
or from literals and constants in the source text.
-The C<bytes> pragma will always, regardless of platform, force byte
-semantics in a particular lexical scope. See L<bytes>.
-
The C<utf8> pragma is primarily a compatibility device that enables
recognition of UTF-(8|EBCDIC) in literals encountered by the parser.
Note that this pragma is only required while Perl defaults to byte
semantics; when character semantics become the default, this pragma
may become a no-op. See L<utf8>.
-Unless explicitly stated, Perl operators use character semantics
-for Unicode data and byte semantics for non-Unicode data.
-The decision to use character semantics is made transparently. If
-input data comes from a Unicode source--for example, if a character
-encoding layer is added to a filehandle or a literal Unicode
-string constant appears in a program--character semantics apply.
-Otherwise, byte semantics are in effect. The C<bytes> pragma should
-be used to force byte semantics on Unicode data.
-
If strings operating under byte semantics and strings with Unicode
-character data are concatenated, the new string will be upgraded to
-I<ISO 8859-1 (Latin-1)>, even if the old Unicode string used EBCDIC.
-This translation is done without regard to the system's native 8-bit
-encoding, so to change this for systems with non-Latin-1 and
-non-EBCDIC native encodings use the C<encoding> pragma. See
-L<encoding>.
+character data are concatenated, the new string will have
+character semantics. This can cause surprises: See L</BUGS>, below.
+You can choose to be warned when this happens. See L<encoding::warnings>.
Under character semantics, many operations that formerly operated on
bytes now operate on characters. A character in Perl is
Strings--including hash keys--and regular expression patterns may
contain characters that have an ordinal value larger than 255.
-If you use a Unicode editor to edit your program, Unicode characters
-may occur directly within the literal strings in one of the various
-Unicode encodings (UTF-8, UTF-EBCDIC, UCS-2, etc.), but will be recognized
-as such and converted to Perl's internal representation only if the
-appropriate L<encoding> is specified.
+If you use a Unicode editor to edit your program, Unicode characters may
+occur directly within the literal strings in UTF-8 encoding, or UTF-16.
+(The former requires a BOM or C<use utf8>, the latter requires a BOM.)
-Unicode characters can also be added to a string by using the
-C<\x{...}> notation. The Unicode code for the desired character, in
-hexadecimal, should be placed in the braces. For instance, a smiley
-face is C<\x{263A}>. This encoding scheme only works for characters
-with a code of 0x100 or above.
+Unicode characters can also be added to a string by using the C<\N{U+...}>
+notation. The Unicode code for the desired character, in hexadecimal,
+should be placed in the braces, after the C<U>. For instance, a smiley face is
+C<\N{U+263A}>.
+
+Alternatively, you can use the C<\x{...}> notation for characters 0x100 and
+above. For characters below 0x100 you may get byte semantics instead of
+character semantics; see L</The "Unicode Bug">. On EBCDIC machines there is
+the additional problem that the value for such characters gives the EBCDIC
+character rather than the Unicode one.
Additionally, if you
you can use the C<\N{...}> notation and put the official Unicode
character name within the braces, such as C<\N{WHITE SMILING FACE}>.
-
+See L<charnames>.
=item *
=item *
Regular expressions match characters instead of bytes. "." matches
-a character instead of a byte. The C<\C> pattern is provided to force
-a match a single byte--a C<char> in C, hence C<\C>.
+a character instead of a byte.
=item *
-Character classes in regular expressions match characters instead of
+Bracketed character classes in regular expressions match characters instead of
bytes and match against the character properties specified in the
Unicode properties database. C<\w> can be used to match a Japanese
ideograph, for instance.
=item *
-Named Unicode properties, scripts, and block ranges may be used like
-character classes via the C<\p{}> "matches property" construct and
-the C<\P{}> negation, "doesn't match property".
-
-For instance, C<\p{Lu}> matches any character with the Unicode "Lu"
-(Letter, uppercase) property, while C<\p{M}> matches any character
-with an "M" (mark--accents and such) property. Brackets are not
-required for single letter properties, so C<\p{M}> is equivalent to
-C<\pM>. Many predefined properties are available, such as
-C<\p{Mirrored}> and C<\p{Tibetan}>.
-
-The official Unicode script and block names have spaces and dashes as
-separators, but for convenience you can use dashes, spaces, or
-underbars, and case is unimportant. It is recommended, however, that
-for consistency you use the following naming: the official Unicode
-script, property, or block name (see below for the additional rules
-that apply to block names) with whitespace and dashes removed, and the
-words "uppercase-first-lowercase-rest". C<Latin-1 Supplement> thus
-becomes C<Latin1Supplement>.
+Named Unicode properties, scripts, and block ranges may be used (like bracketed
+character classes) by using the C<\p{}> "matches property" construct and
+the C<\P{}> negation, "doesn't match property".
+See L</"Unicode Character Properties"> for more details.
+
+You can define your own character properties and use them
+in the regular expression with the C<\p{}> or C<\P{}> construct.
+See L</"User-Defined Character Properties"> for more details.
+
+=item *
+
+The special pattern C<\X> matches a logical character, an "extended grapheme
+cluster" in Standardese. In Unicode what appears to the user to be a single
+character, for example an accented C<G>, may in fact be composed of a sequence
+of characters, in this case a C<G> followed by an accent character. C<\X>
+will match the entire sequence.
+
+=item *
+
+The C<tr///> operator translates characters instead of bytes. Note
+that the C<tr///CU> functionality has been removed. For similar
+functionality see pack('U0', ...) and pack('C0', ...).
+
+=item *
+
+Case translation operators use the Unicode case translation tables
+when character input is provided. Note that C<uc()>, or C<\U> in
+interpolated strings, translates to uppercase, while C<ucfirst>,
+or C<\u> in interpolated strings, translates to titlecase in languages
+that make the distinction (which is equivalent to uppercase in languages
+without the distinction).
+
+=item *
+
+Most operators that deal with positions or lengths in a string will
+automatically switch to using character positions, including
+C<chop()>, C<chomp()>, C<substr()>, C<pos()>, C<index()>, C<rindex()>,
+C<sprintf()>, C<write()>, and C<length()>. An operator that
+specifically does not switch is C<vec()>. Operators that really don't
+care include operators that treat strings as a bucket of bits such as
+C<sort()>, and operators dealing with filenames.
+
+=item *
+
+The C<pack()>/C<unpack()> letter C<C> does I<not> change, since it is often
+used for byte-oriented formats. Again, think C<char> in the C language.
+
+There is a new C<U> specifier that converts between Unicode characters
+and code points. There is also a C<W> specifier that is the equivalent of
+C<chr>/C<ord> and properly handles character values even if they are above 255.
+
+=item *
+
+The C<chr()> and C<ord()> functions work on characters, similar to
+C<pack("W")> and C<unpack("W")>, I<not> C<pack("C")> and
+C<unpack("C")>. C<pack("C")> and C<unpack("C")> are methods for
+emulating byte-oriented C<chr()> and C<ord()> on Unicode strings.
+While these methods reveal the internal encoding of Unicode strings,
+that is not something one normally needs to care about at all.
+
+=item *
+
+The bit string operators, C<& | ^ ~>, can operate on character data.
+However, for backward compatibility, such as when using bit string
+operations when characters are all less than 256 in ordinal value, one
+should not use C<~> (the bit complement) with characters of both
+values less than 256 and values greater than 256. Most importantly,
+DeMorgan's laws (C<~($x|$y) eq ~$x&~$y> and C<~($x&$y) eq ~$x|~$y>)
+will not hold. The reason for this mathematical I<faux pas> is that
+the complement cannot return B<both> the 8-bit (byte-wide) bit
+complement B<and> the full character-wide bit complement.
+
+=item *
+
+You can define your own mappings to be used in C<lc()>,
+C<lcfirst()>, C<uc()>, and C<ucfirst()> (or their double-quoted string inlined
+versions such as C<\U>). See
+L<User-Defined Case-Mappings|/"User-Defined Case Mappings (for serious hackers only)">
+for more details.
+
+=back
+
+=over 4
+
+=item *
+
+And finally, C<scalar reverse()> reverses by character rather than by byte.
+
+=back
+
+=head2 Unicode Character Properties
+
+(The only time that Perl considers a sequence of individual code
+points as a single logical character is in the C<\X> construct, already
+mentioned above. Therefore "character" in this discussion means a single
+Unicode code point.)
+
+Very nearly all Unicode character properties are accessible through
+regular expressions by using the C<\p{}> "matches property" construct
+and the C<\P{}> "doesn't match property" for its negation.
+
+For instance, C<\p{Uppercase}> matches any single character with the Unicode
+"Uppercase" property, while C<\p{L}> matches any character with a
+General_Category of "L" (letter) property. Brackets are not
+required for single letter property names, so C<\p{L}> is equivalent to C<\pL>.
+
+More formally, C<\p{Uppercase}> matches any single character whose Unicode
+Uppercase property value is True, and C<\P{Uppercase}> matches any character
+whose Uppercase property value is False, and they could have been written as
+C<\p{Uppercase=True}> and C<\p{Uppercase=False}>, respectively.
+
+This formality is needed when properties are not binary; that is, if they can
+take on more values than just True and False. For example, the Bidi_Class (see
+L</"Bidirectional Character Types"> below), can take on several different
+values, such as Left, Right, Whitespace, and others. To match these, one needs
+to specify the property name (Bidi_Class), AND the value being matched against
+(Left, Right, etc.). This is done, as in the examples above, by having the
+two components separated by an equal sign (or interchangeably, a colon), like
+C<\p{Bidi_Class: Left}>.
+
+All Unicode-defined character properties may be written in these compound forms
+of C<\p{property=value}> or C<\p{property:value}>, but Perl provides some
+additional properties that are written only in the single form, as well as
+single-form short-cuts for all binary properties and certain others described
+below, in which you may omit the property name and the equals or colon
+separator.
+
+Most Unicode character properties have at least two synonyms (or aliases if you
+prefer): a short one that is easier to type and a longer one that is more
+descriptive and hence easier to understand. Thus the "L" and "Letter" properties
+above are equivalent and can be used interchangeably. Likewise,
+"Upper" is a synonym for "Uppercase", and we could have written
+C<\p{Uppercase}> equivalently as C<\p{Upper}>. Also, there are typically
+various synonyms for the values the property can be. For binary properties,
+"True" has 3 synonyms: "T", "Yes", and "Y"; and "False has correspondingly "F",
+"No", and "N". But be careful. A short form of a value for one property may
+not mean the same thing as the same short form for another. Thus, for the
+General_Category property, "L" means "Letter", but for the Bidi_Class property,
+"L" means "Left". A complete list of properties and synonyms is in
+L<perluniprops>.
+
+Upper/lower case differences in property names and values are irrelevant;
+thus C<\p{Upper}> means the same thing as C<\p{upper}> or even C<\p{UpPeR}>.
+Similarly, you can add or subtract underscores anywhere in the middle of a
+word, so that these are also equivalent to C<\p{U_p_p_e_r}>. And white space
+is irrelevant adjacent to non-word characters, such as the braces and the equals
+or colon separators, so C<\p{ Upper }> and C<\p{ Upper_case : Y }> are
+equivalent to these as well. In fact, white space and even
+hyphens can usually be added or deleted anywhere. So even C<\p{ Up-per case = Yes}> is
+equivalent. All this is called "loose-matching" by Unicode. The few places
+where stricter matching is used is in the middle of numbers, and in the Perl
+extension properties that begin or end with an underscore. Stricter matching
+cares about white space (except adjacent to non-word characters),
+hyphens, and non-interior underscores.
You can also use negation in both C<\p{}> and C<\P{}> by introducing a caret
(^) between the first brace and the property name: C<\p{^Tamil}> is
equal to C<\P{Tamil}>.
-B<NOTE: the properties, scripts, and blocks listed here are as of
-Unicode 3.2.0, March 2002, or Perl 5.8.0, July 2002. Unicode 4.0.0
-came out in April 2003, and Perl 5.8.1 in September 2003.>
-
-Here are the basic Unicode General Category properties, followed by their
-long form. You can use either; C<\p{Lu}> and C<\p{UppercaseLetter}>,
-for instance, are identical.
+Almost all properties are immune to case-insensitive matching. That is,
+adding a C</i> regular expression modifier does not change what they
+match. There are two sets that are affected.
+The first set is
+C<Uppercase_Letter>,
+C<Lowercase_Letter>,
+and C<Titlecase_Letter>,
+all of which match C<Cased_Letter> under C</i> matching.
+And the second set is
+C<Uppercase>,
+C<Lowercase>,
+and C<Titlecase>,
+all of which match C<Cased> under C</i> matching.
+This set also includes its subsets C<PosixUpper> and C<PosixLower> both
+of which under C</i> matching match C<PosixAlpha>.
+(The difference between these sets is that some things, such as Roman
+numerals, come in both upper and lower case so they are C<Cased>, but aren't considered
+letters, so they aren't C<Cased_Letter>s.)
+
+=head3 B<General_Category>
+
+Every Unicode character is assigned a general category, which is the "most
+usual categorization of a character" (from
+L<http://www.unicode.org/reports/tr44>).
+
+The compound way of writing these is like C<\p{General_Category=Number}>
+(short, C<\p{gc:n}>). But Perl furnishes shortcuts in which everything up
+through the equal or colon separator is omitted. So you can instead just write
+C<\pN>.
+
+Here are the short and long forms of the General Category properties:
Short Long
L Letter
- Lu UppercaseLetter
- Ll LowercaseLetter
- Lt TitlecaseLetter
- Lm ModifierLetter
- Lo OtherLetter
+ LC, L& Cased_Letter (that is: [\p{Ll}\p{Lu}\p{Lt}])
+ Lu Uppercase_Letter
+ Ll Lowercase_Letter
+ Lt Titlecase_Letter
+ Lm Modifier_Letter
+ Lo Other_Letter
M Mark
- Mn NonspacingMark
- Mc SpacingMark
- Me EnclosingMark
+ Mn Nonspacing_Mark
+ Mc Spacing_Mark
+ Me Enclosing_Mark
N Number
- Nd DecimalNumber
- Nl LetterNumber
- No OtherNumber
-
- P Punctuation
- Pc ConnectorPunctuation
- Pd DashPunctuation
- Ps OpenPunctuation
- Pe ClosePunctuation
- Pi InitialPunctuation
+ Nd Decimal_Number (also Digit)
+ Nl Letter_Number
+ No Other_Number
+
+ P Punctuation (also Punct)
+ Pc Connector_Punctuation
+ Pd Dash_Punctuation
+ Ps Open_Punctuation
+ Pe Close_Punctuation
+ Pi Initial_Punctuation
(may behave like Ps or Pe depending on usage)
- Pf FinalPunctuation
+ Pf Final_Punctuation
(may behave like Ps or Pe depending on usage)
- Po OtherPunctuation
+ Po Other_Punctuation
S Symbol
- Sm MathSymbol
- Sc CurrencySymbol
- Sk ModifierSymbol
- So OtherSymbol
+ Sm Math_Symbol
+ Sc Currency_Symbol
+ Sk Modifier_Symbol
+ So Other_Symbol
Z Separator
- Zs SpaceSeparator
- Zl LineSeparator
- Zp ParagraphSeparator
+ Zs Space_Separator
+ Zl Line_Separator
+ Zp Paragraph_Separator
C Other
- Cc Control
+ Cc Control (also Cntrl)
Cf Format
- Cs Surrogate (not usable)
- Co PrivateUse
+ Cs Surrogate
+ Co Private_Use
Cn Unassigned
Single-letter properties match all characters in any of the
two-letter sub-properties starting with the same letter.
-C<L&> is a special case, which is an alias for C<Ll>, C<Lu>, and C<Lt>.
+C<LC> and C<L&> are special: both are aliases for the set consisting of everything matched by C<Ll>, C<Lu>, and C<Lt>.
-Because Perl hides the need for the user to understand the internal
-representation of Unicode characters, there is no need to implement
-the somewhat messy concept of surrogates. C<Cs> is therefore not
-supported.
+=head3 B<Bidirectional Character Types>
-Because scripts differ in their directionality--Hebrew is
-written right to left, for example--Unicode supplies these properties:
+Because scripts differ in their directionality (Hebrew and Arabic are
+written right to left, for example) Unicode supplies these properties in
+the Bidi_Class class:
Property Meaning
- BidiL Left-to-Right
- BidiLRE Left-to-Right Embedding
- BidiLRO Left-to-Right Override
- BidiR Right-to-Left
- BidiAL Right-to-Left Arabic
- BidiRLE Right-to-Left Embedding
- BidiRLO Right-to-Left Override
- BidiPDF Pop Directional Format
- BidiEN European Number
- BidiES European Number Separator
- BidiET European Number Terminator
- BidiAN Arabic Number
- BidiCS Common Number Separator
- BidiNSM Non-Spacing Mark
- BidiBN Boundary Neutral
- BidiB Paragraph Separator
- BidiS Segment Separator
- BidiWS Whitespace
- BidiON Other Neutrals
-
-For example, C<\p{BidiR}> matches characters that are normally
+ L Left-to-Right
+ LRE Left-to-Right Embedding
+ LRO Left-to-Right Override
+ R Right-to-Left
+ AL Arabic Letter
+ RLE Right-to-Left Embedding
+ RLO Right-to-Left Override
+ PDF Pop Directional Format
+ EN European Number
+ ES European Separator
+ ET European Terminator
+ AN Arabic Number
+ CS Common Separator
+ NSM Non-Spacing Mark
+ BN Boundary Neutral
+ B Paragraph Separator
+ S Segment Separator
+ WS Whitespace
+ ON Other Neutrals
+
+This property is always written in the compound form.
+For example, C<\p{Bidi_Class:R}> matches characters that are normally
written right to left.
-=back
+=head3 B<Scripts>
+
+The world's languages are written in many different scripts. This sentence
+(unless you're reading it in translation) is written in Latin, while Russian is
+written in Cyrillic, and Greek is written in, well, Greek; Japanese mainly in
+Hiragana or Katakana. There are many more.
-=head2 Scripts
-
-The script names which can be used by C<\p{...}> and C<\P{...}>,
-such as in C<\p{Latin}> or C<\p{Cyrillic}>, are as follows:
-
- Arabic
- Armenian
- Bengali
- Bopomofo
- Buhid
- CanadianAboriginal
- Cherokee
- Cyrillic
- Deseret
- Devanagari
- Ethiopic
- Georgian
- Gothic
- Greek
- Gujarati
- Gurmukhi
- Han
- Hangul
- Hanunoo
- Hebrew
- Hiragana
- Inherited
- Kannada
- Katakana
- Khmer
- Lao
- Latin
- Malayalam
- Mongolian
- Myanmar
- Ogham
- OldItalic
- Oriya
- Runic
- Sinhala
- Syriac
- Tagalog
- Tagbanwa
- Tamil
- Telugu
- Thaana
- Thai
- Tibetan
- Yi
-
-Extended property classes can supplement the basic
-properties, defined by the F<PropList> Unicode database:
-
- ASCIIHexDigit
- BidiControl
- Dash
- Deprecated
- Diacritic
- Extender
- GraphemeLink
- HexDigit
- Hyphen
- Ideographic
- IDSBinaryOperator
- IDSTrinaryOperator
- JoinControl
- LogicalOrderException
- NoncharacterCodePoint
- OtherAlphabetic
- OtherDefaultIgnorableCodePoint
- OtherGraphemeExtend
- OtherLowercase
- OtherMath
- OtherUppercase
- QuotationMark
- Radical
- SoftDotted
- TerminalPunctuation
- UnifiedIdeograph
- WhiteSpace
-
-and there are further derived properties:
-
- Alphabetic Lu + Ll + Lt + Lm + Lo + OtherAlphabetic
- Lowercase Ll + OtherLowercase
- Uppercase Lu + OtherUppercase
- Math Sm + OtherMath
-
- ID_Start Lu + Ll + Lt + Lm + Lo + Nl
- ID_Continue ID_Start + Mn + Mc + Nd + Pc
-
- Any Any character
- Assigned Any non-Cn character (i.e. synonym for \P{Cn})
- Unassigned Synonym for \p{Cn}
- Common Any character (or unassigned code point)
- not explicitly assigned to a script
+The Unicode Script property gives what script a given character is in,
+and the property can be specified with the compound form like
+C<\p{Script=Hebrew}> (short: C<\p{sc=hebr}>). Perl furnishes shortcuts for all
+script names. You can omit everything up through the equals (or colon), and
+simply write C<\p{Latin}> or C<\P{Cyrillic}>.
+
+A complete list of scripts and their shortcuts is in L<perluniprops>.
+
+=head3 B<Use of "Is" Prefix>
For backward compatibility (with Perl 5.6), all properties mentioned
-so far may have C<Is> prepended to their name, so C<\P{IsLu}>, for
-example, is equal to C<\P{Lu}>.
+so far may have C<Is> or C<Is_> prepended to their name, so C<\P{Is_Lu}>, for
+example, is equal to C<\P{Lu}>, and C<\p{IsScript:Arabic}> is equal to
+C<\p{Arabic}>.
-=head2 Blocks
+=head3 B<Blocks>
In addition to B<scripts>, Unicode also defines B<blocks> of
characters. The difference between scripts and blocks is that the
concept of scripts is closer to natural languages, while the concept
-of blocks is more of an artificial grouping based on groups of 256
-Unicode characters. For example, the C<Latin> script contains letters
-from many blocks but does not contain all the characters from those
-blocks. It does not, for example, contain digits, because digits are
-shared across many scripts. Digits and similar groups, like
-punctuation, are in a category called C<Common>.
-
-For more about scripts, see the UTR #24:
-
- http://www.unicode.org/unicode/reports/tr24/
-
-For more about blocks, see:
-
- http://www.unicode.org/Public/UNIDATA/Blocks.txt
-
-Block names are given with the C<In> prefix. For example, the
-Katakana block is referenced via C<\p{InKatakana}>. The C<In>
-prefix may be omitted if there is no naming conflict with a script
-or any other property, but it is recommended that C<In> always be used
-for block tests to avoid confusion.
-
-These block names are supported:
-
- InAlphabeticPresentationForms
- InArabic
- InArabicPresentationFormsA
- InArabicPresentationFormsB
- InArmenian
- InArrows
- InBasicLatin
- InBengali
- InBlockElements
- InBopomofo
- InBopomofoExtended
- InBoxDrawing
- InBraillePatterns
- InBuhid
- InByzantineMusicalSymbols
- InCJKCompatibility
- InCJKCompatibilityForms
- InCJKCompatibilityIdeographs
- InCJKCompatibilityIdeographsSupplement
- InCJKRadicalsSupplement
- InCJKSymbolsAndPunctuation
- InCJKUnifiedIdeographs
- InCJKUnifiedIdeographsExtensionA
- InCJKUnifiedIdeographsExtensionB
- InCherokee
- InCombiningDiacriticalMarks
- InCombiningDiacriticalMarksforSymbols
- InCombiningHalfMarks
- InControlPictures
- InCurrencySymbols
- InCyrillic
- InCyrillicSupplementary
- InDeseret
- InDevanagari
- InDingbats
- InEnclosedAlphanumerics
- InEnclosedCJKLettersAndMonths
- InEthiopic
- InGeneralPunctuation
- InGeometricShapes
- InGeorgian
- InGothic
- InGreekExtended
- InGreekAndCoptic
- InGujarati
- InGurmukhi
- InHalfwidthAndFullwidthForms
- InHangulCompatibilityJamo
- InHangulJamo
- InHangulSyllables
- InHanunoo
- InHebrew
- InHighPrivateUseSurrogates
- InHighSurrogates
- InHiragana
- InIPAExtensions
- InIdeographicDescriptionCharacters
- InKanbun
- InKangxiRadicals
- InKannada
- InKatakana
- InKatakanaPhoneticExtensions
- InKhmer
- InLao
- InLatin1Supplement
- InLatinExtendedA
- InLatinExtendedAdditional
- InLatinExtendedB
- InLetterlikeSymbols
- InLowSurrogates
- InMalayalam
- InMathematicalAlphanumericSymbols
- InMathematicalOperators
- InMiscellaneousMathematicalSymbolsA
- InMiscellaneousMathematicalSymbolsB
- InMiscellaneousSymbols
- InMiscellaneousTechnical
- InMongolian
- InMusicalSymbols
- InMyanmar
- InNumberForms
- InOgham
- InOldItalic
- InOpticalCharacterRecognition
- InOriya
- InPrivateUseArea
- InRunic
- InSinhala
- InSmallFormVariants
- InSpacingModifierLetters
- InSpecials
- InSuperscriptsAndSubscripts
- InSupplementalArrowsA
- InSupplementalArrowsB
- InSupplementalMathematicalOperators
- InSupplementaryPrivateUseAreaA
- InSupplementaryPrivateUseAreaB
- InSyriac
- InTagalog
- InTagbanwa
- InTags
- InTamil
- InTelugu
- InThaana
- InThai
- InTibetan
- InUnifiedCanadianAboriginalSyllabics
- InVariationSelectors
- InYiRadicals
- InYiSyllables
+of blocks is more of an artificial grouping based on groups of Unicode
+characters with consecutive ordinal values. For example, the "Basic Latin"
+block is all characters whose ordinals are between 0 and 127, inclusive; in
+other words, the ASCII characters. The "Latin" script contains some letters
+from this as well as several other blocks, like "Latin-1 Supplement",
+"Latin Extended-A", etc., but it does not contain all the characters from
+those blocks. It does not, for example, contain the digits 0-9, because
+those digits are shared across many scripts. The digits 0-9 and similar groups,
+like punctuation, are in the script called C<Common>. There is also a
+script called C<Inherited> for characters that modify other characters,
+and inherit the script value of the controlling character. (Note that
+there are several different sets of digits in Unicode that are
+equivalent to 0-9 and are matchable by C<\d> in a regular expression.
+If they are used in a single language only, they are in that language's
+script. Only sets are used across several languages are in the
+C<Common> script.)
+
+For more about scripts versus blocks, see UAX#24 "Unicode Script Property":
+L<http://www.unicode.org/reports/tr24>
+
+The Script property is likely to be the one you want to use when processing
+natural language; the Block property may occasionally be useful in working
+with the nuts and bolts of Unicode.
+
+Block names are matched in the compound form, like C<\p{Block: Arrows}> or
+C<\p{Blk=Hebrew}>. Unlike most other properties, only a few block names have a
+Unicode-defined short name. But Perl does provide a (slight) shortcut: You
+can say, for example C<\p{In_Arrows}> or C<\p{In_Hebrew}>. For backwards
+compatibility, the C<In> prefix may be omitted if there is no naming conflict
+with a script or any other property, and you can even use an C<Is> prefix
+instead in those cases. But it is not a good idea to do this, for a couple
+reasons:
=over 4
-=item *
+=item 1
-The special pattern C<\X> matches any extended Unicode
-sequence--"a combining character sequence" in Standardese--where the
-first character is a base character and subsequent characters are mark
-characters that apply to the base character. C<\X> is equivalent to
-C<(?:\PM\pM*)>.
+It is confusing. There are many naming conflicts, and you may forget some.
+For example, C<\p{Hebrew}> means the I<script> Hebrew, and NOT the I<block>
+Hebrew. But would you remember that 6 months from now?
-=item *
+=item 2
-The C<tr///> operator translates characters instead of bytes. Note
-that the C<tr///CU> functionality has been removed. For similar
-functionality see pack('U0', ...) and pack('C0', ...).
+It is unstable. A new version of Unicode may pre-empt the current meaning by
+creating a property with the same name. There was a time in very early Unicode
+releases when C<\p{Hebrew}> would have matched the I<block> Hebrew; now it
+doesn't.
-=item *
+=back
-Case translation operators use the Unicode case translation tables
-when character input is provided. Note that C<uc()>, or C<\U> in
-interpolated strings, translates to uppercase, while C<ucfirst>,
-or C<\u> in interpolated strings, translates to titlecase in languages
-that make the distinction.
+Some people prefer to always use C<\p{Block: foo}> and C<\p{Script: bar}>
+instead of the shortcuts, whether for clarity, because they can't remember the
+difference between 'In' and 'Is' anyway, or they aren't confident that those who
+eventually will read their code will know that difference.
-=item *
+A complete list of blocks and their shortcuts is in L<perluniprops>.
-Most operators that deal with positions or lengths in a string will
-automatically switch to using character positions, including
-C<chop()>, C<substr()>, C<pos()>, C<index()>, C<rindex()>,
-C<sprintf()>, C<write()>, and C<length()>. Operators that
-specifically do not switch include C<vec()>, C<pack()>, and
-C<unpack()>. Operators that really don't care include C<chomp()>,
-operators that treats strings as a bucket of bits such as C<sort()>,
-and operators dealing with filenames.
+=head3 B<Other Properties>
-=item *
+There are many more properties than the very basic ones described here.
+A complete list is in L<perluniprops>.
-The C<pack()>/C<unpack()> letters C<c> and C<C> do I<not> change,
-since they are often used for byte-oriented formats. Again, think
-C<char> in the C language.
+Unicode defines all its properties in the compound form, so all single-form
+properties are Perl extensions. Most of these are just synonyms for the
+Unicode ones, but some are genuine extensions, including several that are in
+the compound form. And quite a few of these are actually recommended by Unicode
+(in L<http://www.unicode.org/reports/tr18>).
-There is a new C<U> specifier that converts between Unicode characters
-and code points.
+This section gives some details on all extensions that aren't synonyms for
+compound-form Unicode properties (for those, you'll have to refer to the
+L<Unicode Standard|http://www.unicode.org/reports/tr44>.
-=item *
+=over
-The C<chr()> and C<ord()> functions work on characters, similar to
-C<pack("U")> and C<unpack("U")>, I<not> C<pack("C")> and
-C<unpack("C")>. C<pack("C")> and C<unpack("C")> are methods for
-emulating byte-oriented C<chr()> and C<ord()> on Unicode strings.
-While these methods reveal the internal encoding of Unicode strings,
-that is not something one normally needs to care about at all.
+=item B<C<\p{All}>>
-=item *
+This matches any of the 1_114_112 Unicode code points. It is a synonym for
+C<\p{Any}>.
-The bit string operators, C<& | ^ ~>, can operate on character data.
-However, for backward compatibility, such as when using bit string
-operations when characters are all less than 256 in ordinal value, one
-should not use C<~> (the bit complement) with characters of both
-values less than 256 and values greater than 256. Most importantly,
-DeMorgan's laws (C<~($x|$y) eq ~$x&~$y> and C<~($x&$y) eq ~$x|~$y>)
-will not hold. The reason for this mathematical I<faux pas> is that
-the complement cannot return B<both> the 8-bit (byte-wide) bit
-complement B<and> the full character-wide bit complement.
+=item B<C<\p{Alnum}>>
-=item *
+This matches any C<\p{Alphabetic}> or C<\p{Decimal_Number}> character.
-lc(), uc(), lcfirst(), and ucfirst() work for the following cases:
+=item B<C<\p{Any}>>
-=over 8
+This matches any of the 1_114_112 Unicode code points. It is a synonym for
+C<\p{All}>.
-=item *
+=item B<C<\p{ASCII}>>
-the case mapping is from a single Unicode character to another
-single Unicode character, or
+This matches any of the 128 characters in the US-ASCII character set,
+which is a subset of Unicode.
-=item *
+=item B<C<\p{Assigned}>>
-the case mapping is from a single Unicode character to more
-than one Unicode character.
+This matches any assigned code point; that is, any code point whose general
+category is not Unassigned (or equivalently, not Cn).
-=back
+=item B<C<\p{Blank}>>
-Things to do with locales (Lithuanian, Turkish, Azeri) do B<not> work
-since Perl does not understand the concept of Unicode locales.
+This is the same as C<\h> and C<\p{HorizSpace}>: A character that changes the
+spacing horizontally.
-See the Unicode Technical Report #21, Case Mappings, for more details.
+=item B<C<\p{Decomposition_Type: Non_Canonical}>> (Short: C<\p{Dt=NonCanon}>)
-=back
+Matches a character that has a non-canonical decomposition.
-=over 4
+To understand the use of this rarely used property=value combination, it is
+necessary to know some basics about decomposition.
+Consider a character, say H. It could appear with various marks around it,
+such as an acute accent, or a circumflex, or various hooks, circles, arrows,
+I<etc.>, above, below, to one side or the other, etc. There are many
+possibilities among the world's languages. The number of combinations is
+astronomical, and if there were a character for each combination, it would
+soon exhaust Unicode's more than a million possible characters. So Unicode
+took a different approach: there is a character for the base H, and a
+character for each of the possible marks, and these can be variously combined
+to get a final logical character. So a logical character--what appears to be a
+single character--can be a sequence of more than one individual characters.
+This is called an "extended grapheme cluster"; Perl furnishes the C<\X>
+regular expression construct to match such sequences.
-=item *
+But Unicode's intent is to unify the existing character set standards and
+practices, and several pre-existing standards have single characters that
+mean the same thing as some of these combinations. An example is ISO-8859-1,
+which has quite a few of these in the Latin-1 range, an example being "LATIN
+CAPITAL LETTER E WITH ACUTE". Because this character was in this pre-existing
+standard, Unicode added it to its repertoire. But this character is considered
+by Unicode to be equivalent to the sequence consisting of the character
+"LATIN CAPITAL LETTER E" followed by the character "COMBINING ACUTE ACCENT".
-And finally, C<scalar reverse()> reverses by character rather than by byte.
+"LATIN CAPITAL LETTER E WITH ACUTE" is called a "pre-composed" character, and
+its equivalence with the sequence is called canonical equivalence. All
+pre-composed characters are said to have a decomposition (into the equivalent
+sequence), and the decomposition type is also called canonical.
+
+However, many more characters have a different type of decomposition, a
+"compatible" or "non-canonical" decomposition. The sequences that form these
+decompositions are not considered canonically equivalent to the pre-composed
+character. An example, again in the Latin-1 range, is the "SUPERSCRIPT ONE".
+It is somewhat like a regular digit 1, but not exactly; its decomposition
+into the digit 1 is called a "compatible" decomposition, specifically a
+"super" decomposition. There are several such compatibility
+decompositions (see L<http://www.unicode.org/reports/tr44>), including one
+called "compat", which means some miscellaneous type of decomposition
+that doesn't fit into the decomposition categories that Unicode has chosen.
+
+Note that most Unicode characters don't have a decomposition, so their
+decomposition type is "None".
+
+For your convenience, Perl has added the C<Non_Canonical> decomposition
+type to mean any of the several compatibility decompositions.
+
+=item B<C<\p{Graph}>>
+
+Matches any character that is graphic. Theoretically, this means a character
+that on a printer would cause ink to be used.
+
+=item B<C<\p{HorizSpace}>>
+
+This is the same as C<\h> and C<\p{Blank}>: a character that changes the
+spacing horizontally.
+
+=item B<C<\p{In=*}>>
+
+This is a synonym for C<\p{Present_In=*}>
+
+=item B<C<\p{PerlSpace}>>
+
+This is the same as C<\s>, restricted to ASCII, namely C<S<[ \f\n\r\t]>>.
+
+Mnemonic: Perl's (original) space
+
+=item B<C<\p{PerlWord}>>
+
+This is the same as C<\w>, restricted to ASCII, namely C<[A-Za-z0-9_]>
+
+Mnemonic: Perl's (original) word.
+
+=item B<C<\p{Posix...}>>
+
+There are several of these, which are equivalents using the C<\p>
+notation for Posix classes and are described in
+L<perlrecharclass/POSIX Character Classes>.
+
+=item B<C<\p{Present_In: *}>> (Short: C<\p{In=*}>)
+
+This property is used when you need to know in what Unicode version(s) a
+character is.
+
+The "*" above stands for some two digit Unicode version number, such as
+C<1.1> or C<4.0>; or the "*" can also be C<Unassigned>. This property will
+match the code points whose final disposition has been settled as of the
+Unicode release given by the version number; C<\p{Present_In: Unassigned}>
+will match those code points whose meaning has yet to be assigned.
+
+For example, C<U+0041> "LATIN CAPITAL LETTER A" was present in the very first
+Unicode release available, which is C<1.1>, so this property is true for all
+valid "*" versions. On the other hand, C<U+1EFF> was not assigned until version
+5.1 when it became "LATIN SMALL LETTER Y WITH LOOP", so the only "*" that
+would match it are 5.1, 5.2, and later.
+
+Unicode furnishes the C<Age> property from which this is derived. The problem
+with Age is that a strict interpretation of it (which Perl takes) has it
+matching the precise release a code point's meaning is introduced in. Thus
+C<U+0041> would match only 1.1; and C<U+1EFF> only 5.1. This is not usually what
+you want.
+
+Some non-Perl implementations of the Age property may change its meaning to be
+the same as the Perl Present_In property; just be aware of that.
+
+Another confusion with both these properties is that the definition is not
+that the code point has been I<assigned>, but that the meaning of the code point
+has been I<determined>. This is because 66 code points will always be
+unassigned, and so the Age for them is the Unicode version in which the decision
+to make them so was made. For example, C<U+FDD0> is to be permanently
+unassigned to a character, and the decision to do that was made in version 3.1,
+so C<\p{Age=3.1}> matches this character, as also does C<\p{Present_In: 3.1}> and up.
+
+=item B<C<\p{Print}>>
+
+This matches any character that is graphical or blank, except controls.
+
+=item B<C<\p{SpacePerl}>>
+
+This is the same as C<\s>, including beyond ASCII.
+
+Mnemonic: Space, as modified by Perl. (It doesn't include the vertical tab
+which both the Posix standard and Unicode consider white space.)
+
+=item B<C<\p{VertSpace}>>
+
+This is the same as C<\v>: A character that changes the spacing vertically.
+
+=item B<C<\p{Word}>>
+
+This is the same as C<\w>, including over 100_000 characters beyond ASCII.
+
+=item B<C<\p{XPosix...}>>
+
+There are several of these, which are the standard Posix classes
+extended to the full Unicode range. They are described in
+L<perlrecharclass/POSIX Character Classes>.
=back
=head2 User-Defined Character Properties
-You can define your own character properties by defining subroutines
-whose names begin with "In" or "Is". The subroutines must be defined
-in the C<main> package. The user-defined properties can be used in the
-regular expression C<\p> and C<\P> constructs. Note that the effect
-is compile-time and immutable once defined.
+You can define your own binary character properties by defining subroutines
+whose names begin with "In" or "Is". The subroutines can be defined in any
+package. The user-defined properties can be used in the regular expression
+C<\p> and C<\P> constructs; if you are using a user-defined property from a
+package other than the one you are in, you must specify its package in the
+C<\p> or C<\P> construct.
+
+ # assuming property Is_Foreign defined in Lang::
+ package main; # property package name required
+ if ($txt =~ /\p{Lang::IsForeign}+/) { ... }
+
+ package Lang; # property package name not required
+ if ($txt =~ /\p{IsForeign}+/) { ... }
+
+
+Note that the effect is compile-time and immutable once defined.
+However, the subroutines are passed a single parameter, which is 0 if
+case-sensitive matching is in effect and non-zero if caseless matching
+is in effect. The subroutine may return different values depending on
+the value of the flag, and one set of values will immutably be in effect
+for all case-sensitive matches, and the other set for all case-insensitive
+matches.
+
+Note that if the regular expression is tainted, then Perl will die rather
+than calling the subroutine, where the name of the subroutine is
+determined by the tainted data.
The subroutines must return a specially-formatted string, with one
or more newline-separated lines. Each line must be one of the following:
=item *
+A single hexadecimal number denoting a Unicode code point to include.
+
+=item *
+
Two hexadecimal numbers separated by horizontal whitespace (space or
tabular characters) denoting a range of Unicode code points to include.
=item *
Something to include, prefixed by "+": a built-in character
-property (prefixed by "utf8::"), to represent all the characters in that
-property; two hexadecimal code points for a range; or a single
-hexadecimal code point.
+property (prefixed by "utf8::") or a user-defined character property,
+to represent all the characters in that property; two hexadecimal code
+points for a range; or a single hexadecimal code point.
=item *
Something to exclude, prefixed by "-": an existing character
-property (prefixed by "utf8::"), for all the characters in that
-property; two hexadecimal code points for a range; or a single
-hexadecimal code point.
+property (prefixed by "utf8::") or a user-defined character property,
+to represent all the characters in that property; two hexadecimal code
+points for a range; or a single hexadecimal code point.
=item *
Something to negate, prefixed "!": an existing character
-property (prefixed by "utf8::") for all the characters except the
-characters in the property; two hexadecimal code points for a range;
-or a single hexadecimal code point.
+property (prefixed by "utf8::") or a user-defined character property,
+to represent all the characters in that property; two hexadecimal code
+points for a range; or a single hexadecimal code point.
+
+=item *
+
+Something to intersect with, prefixed by "&": an existing character
+property (prefixed by "utf8::") or a user-defined character property,
+for all the characters except the characters in the property; two
+hexadecimal code points for a range; or a single hexadecimal code point.
=back
syllabaries (hiragana and katakana), you can define
sub InKana {
- return <<END;
+ return <<END;
3040\t309F
30A0\t30FF
END
You could also have used the existing block property names:
sub InKana {
- return <<'END';
+ return <<'END';
+utf8::InHiragana
+utf8::InKatakana
END
the non-characters:
sub InKana {
- return <<'END';
+ return <<'END';
+utf8::InHiragana
+utf8::InKatakana
-utf8::IsCn
The negation is useful for defining (surprise!) negated classes.
sub InNotKana {
- return <<'END';
+ return <<'END';
!utf8::InHiragana
-utf8::InKatakana
+utf8::IsCn
END
}
-You can also define your own mappings to be used in the lc(),
-lcfirst(), uc(), and ucfirst() (or their string-inlined versions).
-The principle is the same: define subroutines in the C<main> package
-with names like C<ToLower> (for lc() and lcfirst()), C<ToTitle> (for
-the first character in ucfirst()), and C<ToUpper> (for uc(), and the
-rest of the characters in ucfirst()).
+Intersection is useful for getting the common characters matched by
+two (or more) classes.
+
+ sub InFooAndBar {
+ return <<'END';
+ +main::Foo
+ &main::Bar
+ END
+ }
+
+It's important to remember not to use "&" for the first set; that
+would be intersecting with nothing, resulting in an empty set.
+
+=head2 User-Defined Case Mappings (for serious hackers only)
+
+B<This featured is deprecated and is scheduled to be removed in Perl
+5.16.>
+The CPAN module L<Unicode::Casing> provides better functionality
+without the drawbacks described below.
-The string returned by the subroutines needs now to be three
-hexadecimal numbers separated by tabulators: start of the source
-range, end of the source range, and start of the destination range.
-For example:
+You can define your own mappings to be used in C<lc()>,
+C<lcfirst()>, C<uc()>, and C<ucfirst()> (or their string-inlined versions,
+C<\L>, C<\l>, C<\U>, and C<\u>). The mappings are currently only valid
+on strings encoded in UTF-8, but see below for a partial workaround for
+this restriction.
+
+The principle is similar to that of user-defined character
+properties: define subroutines that do the mappings.
+C<ToLower> is used for C<lc()>, C<\L>, C<lcfirst()>, and C<\l>; C<ToTitle> for
+C<ucfirst()> and C<\u>; and C<ToUpper> for C<uc()> and C<\U>.
+
+C<ToUpper()> should look something like this:
sub ToUpper {
- return <<END;
- 0061\t0063\t0041
+ return <<END;
+ 0061\t007A\t0041
+ 0101\t\t0100
END
}
-defines an uc() mapping that causes only the characters "a", "b", and
-"c" to be mapped to "A", "B", "C", all other characters will remain
-unchanged.
+This sample C<ToUpper()> has the effect of mapping "a-z" to "A-Z", 0x101
+to 0x100, and all other characters map to themselves. The first
+returned line means to map the code point at 0x61 ("a") to 0x41 ("A"),
+the code point at 0x62 ("b") to 0x42 ("B"), ..., 0x7A ("z") to 0x5A
+("Z"). The second line maps just the code point 0x101 to 0x100. Since
+there are no other mappings defined, all other code points map to
+themselves.
+
+This mechanism is not well behaved as far as affecting other packages
+and scopes. All non-threaded programs have exactly one uppercasing
+behavior, one lowercasing behavior, and one titlecasing behavior in
+effect for utf8-encoded strings for the duration of the program. Each
+of these behaviors is irrevocably determined the first time the
+corresponding function is called to change a utf8-encoded string's case.
+If a corresponding C<To-> function has been defined in the package that
+makes that first call, the mapping defined by that function will be the
+mapping used for the duration of the program's execution across all
+packages and scopes. If no corresponding C<To-> function has been
+defined in that package, the standard official mapping will be used for
+all packages and scopes, and any corresponding C<To-> function anywhere
+will be ignored. Threaded programs have similar behavior. If the
+program's casing behavior has been decided at the time of a thread's
+creation, the thread will inherit that behavior. But, if the behavior
+hasn't been decided, the thread gets to decide for itself, and its
+decision does not affect other threads nor its creator.
+
+As shown by the example above, you have to furnish a complete mapping;
+you can't just override a couple of characters and leave the rest
+unchanged. You can find all the official mappings in the directory
+C<$Config{privlib}>F</unicore/To/>. The mapping data is returned as the
+here-document. The C<utf8::ToSpecI<Foo>> hashes in those files are special
+exception mappings derived from
+C<$Config{privlib}>F</unicore/SpecialCasing.txt>. (The "Digit" and
+"Fold" mappings that one can see in the directory are not directly
+user-accessible, one can use either the L<Unicode::UCD> module, or just match
+case-insensitively, which is what uses the "Fold" mapping. Neither are user
+overridable.)
+
+If you have many mappings to change, you can take the official mapping data,
+change by hand the affected code points, and place the whole thing into your
+subroutine. But this will only be valid on Perls that use the same Unicode
+version. Another option would be to have your subroutine read the official
+mapping files and overwrite the affected code points.
+
+If you have only a few mappings to change, starting in 5.14 you can use the
+following trick, here illustrated for Turkish.
+
+ use Config;
+ use charnames ":full";
-If there is no source range to speak of, that is, the mapping is from
-a single character to another single character, leave the end of the
-source range empty, but the two tabulator characters are still needed.
-For example:
+ sub ToUpper {
+ my $official = do "$Config{privlib}/unicore/To/Upper.pl";
+ $utf8::ToSpecUpper{'i'} =
+ "\N{LATIN CAPITAL LETTER I WITH DOT ABOVE}";
+ return $official;
+ }
+
+This takes the official mappings and overrides just one, for "LATIN SMALL
+LETTER I". The keys to the hash must be the bytes that form the UTF-8
+(on EBCDIC platforms, UTF-EBCDIC) of the character, as illustrated by
+the inverse function.
sub ToLower {
- return <<END;
- 0041\t\t0061
- END
+ my $official = do $lower;
+ $utf8::ToSpecLower{"\xc4\xb0"} = "i";
+ return $official;
}
-defines a lc() mapping that causes only "A" to be mapped to "a", all
-other characters will remain unchanged.
+This example is for an ASCII platform, and C<\xc4\xb0> is the string of
+bytes that together form the UTF-8 that represents C<\N{LATIN CAPITAL
+LETTER I WITH DOT ABOVE}>, C<U+0130>. You can avoid having to figure out
+these bytes, and at the same time make it work on all platforms by
+instead writing:
-(For serious hackers only) If you want to introspect the default
-mappings, you can find the data in the directory
-C<$Config{privlib}>/F<unicore/To/>. The mapping data is returned as
-the here-document, and the C<utf8::ToSpecFoo> are special exception
-mappings derived from <$Config{privlib}>/F<unicore/SpecialCasing.txt>.
-The C<Digit> and C<Fold> mappings that one can see in the directory
-are not directly user-accessible, one can use either the
-C<Unicode::UCD> module, or just match case-insensitively (that's when
-the C<Fold> mapping is used).
+ sub ToLower {
+ my $official = do $lower;
+ my $sequence = "\N{LATIN CAPITAL LETTER I WITH DOT ABOVE}";
+ utf8::encode($sequence);
+ $utf8::ToSpecLower{$sequence} = "i";
+ return $official;
+ }
-A final note on the user-defined property tests and mappings: they
-will be used only if the scalar has been marked as having Unicode
-characters. Old byte-style strings will not be affected.
+This works because C<utf8::encode()> takes the single character and
+converts it to the sequence of bytes that constitute it. Note that we took
+advantage of the fact that C<"i"> is the same in UTF-8 or UTF_EBCIDIC as not;
+otherwise we would have had to write
+
+ $utf8::ToSpecLower{$sequence} = "\N{LATIN SMALL LETTER I}";
+
+in the ToLower example, and in the ToUpper example, use
+
+ my $sequence = "\N{LATIN SMALL LETTER I}";
+ utf8::encode($sequence);
+
+A big caveat to the above trick and to this whole mechanism in general,
+is that they work only on strings encoded in UTF-8. You can partially
+get around this by using C<use subs>. (But better to just convert to
+use L<Unicode::Casing>.) For example:
+(The trick illustrated here does work in earlier releases, but only if all the
+characters you want to override have ordinal values of 256 or higher, or
+if you use the other tricks given just below.)
+
+The mappings are in effect only for the package they are defined in, and only
+on scalars that have been marked as having Unicode characters, for example by
+using C<utf8::upgrade()>. Although probably not advisable, you can
+cause the mappings to be used globally by importing into C<CORE::GLOBAL>
+(see L<CORE>).
+
+You can partially get around the restriction that the source strings
+must be in utf8 by using C<use subs> (or by importing into C<CORE::GLOBAL>) by:
+
+ use subs qw(uc ucfirst lc lcfirst);
+
+ sub uc($) {
+ my $string = shift;
+ utf8::upgrade($string);
+ return CORE::uc($string);
+ }
+
+ sub lc($) {
+ my $string = shift;
+ utf8::upgrade($string);
+
+ # Unless an I is before a dot_above, it turns into a dotless i.
+ # (The character class with the combining classes matches non-above
+ # marks following the I. Any number of these may be between the 'I' and
+ # the dot_above, and the dot_above will still apply to the 'I'.
+ use charnames ":full";
+ $string =~
+ s/I
+ (?! [^\p{ccc=0}\p{ccc=Above}]* \N{COMBINING DOT ABOVE} )
+ /\N{LATIN SMALL LETTER DOTLESS I}/gx;
+
+ # But when the I is followed by a dot_above, remove the
+ # dot_above so the end result will be i.
+ $string =~ s/I
+ ([^\p{ccc=0}\p{ccc=Above}]* )
+ \N{COMBINING DOT ABOVE}
+ /i$1/gx;
+ return CORE::lc($string);
+ }
+
+These examples (also for Turkish) make sure the input is in UTF-8, and then
+call the corresponding official function, which will use the C<ToUpper()> and
+C<ToLower()> functions you have defined.
+(For Turkish, there are other required functions: C<ucfirst>, C<lcfirst>,
+and C<ToTitle>. These are very similar to the ones given above.)
+
+The reason this is only a partial fix is that it doesn't affect the C<\l>,
+C<\L>, C<\u>, and C<\U> case-change operations in regular expressions,
+which still require the source to be encoded in utf8 (see L</The "Unicode
+Bug">). (Again, use L<Unicode::Casing> instead.)
+
+The C<lc()> example shows how you can add context-dependent casing. Note
+that context-dependent casing suffers from the problem that the string
+passed to the casing function may not have sufficient context to make
+the proper choice. Also, it will not be called for C<\l>, C<\L>, C<\u>,
+and C<\U>.
=head2 Character Encodings for Input and Output
=head2 Unicode Regular Expression Support Level
-The following list of Unicode support for regular expressions describes
-all the features currently supported. The references to "Level N"
-and the section numbers refer to the Unicode Technical Report 18,
-"Unicode Regular Expression Guidelines", version 6 (Unicode 3.2.0,
-Perl 5.8.0).
+The following list of Unicode supported features for regular expressions describes
+all features currently directly supported by core Perl. The references to "Level N"
+and the section numbers refer to the Unicode Technical Standard #18,
+"Unicode Regular Expressions", version 13, from August 2008.
=over 4
Level 1 - Basic Unicode Support
- 2.1 Hex Notation - done [1]
- Named Notation - done [2]
- 2.2 Categories - done [3][4]
- 2.3 Subtraction - MISSING [5][6]
- 2.4 Simple Word Boundaries - done [7]
- 2.5 Simple Loose Matches - done [8]
- 2.6 End of Line - MISSING [9][10]
-
- [ 1] \x{...}
- [ 2] \N{...}
- [ 3] . \p{...} \P{...}
- [ 4] now scripts (see UTR#24 Script Names) in addition to blocks
- [ 5] have negation
- [ 6] can use regular expression look-ahead [a]
- or user-defined character properties [b] to emulate subtraction
- [ 7] include Letters in word characters
- [ 8] note that Perl does Full case-folding in matching, not Simple:
- for example U+1F88 is equivalent with U+1F00 U+03B9,
- not with 1F80. This difference matters for certain Greek
- capital letters with certain modifiers: the Full case-folding
- decomposes the letter, while the Simple case-folding would map
- it to a single character.
- [ 9] see UTR #13 Unicode Newline Guidelines
- [10] should do ^ and $ also on \x{85}, \x{2028} and \x{2029}
- (should also affect <>, $., and script line numbers)
- (the \x{85}, \x{2028} and \x{2029} do match \s)
+ RL1.1 Hex Notation - done [1]
+ RL1.2 Properties - done [2][3]
+ RL1.2a Compatibility Properties - done [4]
+ RL1.3 Subtraction and Intersection - MISSING [5]
+ RL1.4 Simple Word Boundaries - done [6]
+ RL1.5 Simple Loose Matches - done [7]
+ RL1.6 Line Boundaries - MISSING [8][9]
+ RL1.7 Supplementary Code Points - done [10]
+
+ [1] \x{...}
+ [2] \p{...} \P{...}
+ [3] supports not only minimal list, but all Unicode character
+ properties (see L</Unicode Character Properties>)
+ [4] \d \D \s \S \w \W \X [:prop:] [:^prop:]
+ [5] can use regular expression look-ahead [a] or
+ user-defined character properties [b] to emulate set
+ operations
+ [6] \b \B
+ [7] note that Perl does Full case-folding in matching (but with
+ bugs), not Simple: for example U+1F88 is equivalent to
+ U+1F00 U+03B9, not with 1F80. This difference matters
+ mainly for certain Greek capital letters with certain
+ modifiers: the Full case-folding decomposes the letter,
+ while the Simple case-folding would map it to a single
+ character.
+ [8] should do ^ and $ also on U+000B (\v in C), FF (\f), CR
+ (\r), CRLF (\r\n), NEL (U+0085), LS (U+2028), and PS
+ (U+2029); should also affect <>, $., and script line
+ numbers; should not split lines within CRLF [c] (i.e. there
+ is no empty line between \r and \n)
+ [9] Linebreaking conformant with UAX#14 "Unicode Line Breaking
+ Algorithm" is available through the Unicode::LineBreaking
+ module.
+ [10] UTF-8/UTF-EBDDIC used in Perl allows not only U+10000 to
+ U+10FFFF but also beyond U+10FFFF
[a] You can mimic class subtraction using lookahead.
-For example, what UTR #18 might write as
+For example, what UTS#18 might write as
[{Greek}-[{UNASSIGNED}]]
which will match assigned characters known to be part of the Greek script.
Also see the Unicode::Regex::Set module, it does implement the full
-UTR #18 grouping, intersection, union, and removal (subtraction) syntax.
+UTS#18 grouping, intersection, union, and removal (subtraction) syntax.
+
+[b] '+' for union, '-' for removal (set-difference), '&' for intersection
+(see L</"User-Defined Character Properties">)
-[b] See L</"User-Defined Character Properties">.
+[c] Try the C<:crlf> layer (see L<PerlIO>).
=item *
Level 2 - Extended Unicode Support
- 3.1 Surrogates - MISSING [11]
- 3.2 Canonical Equivalents - MISSING [12][13]
- 3.3 Locale-Independent Graphemes - MISSING [14]
- 3.4 Locale-Independent Words - MISSING [15]
- 3.5 Locale-Independent Loose Matches - MISSING [16]
+ RL2.1 Canonical Equivalents - MISSING [10][11]
+ RL2.2 Default Grapheme Clusters - MISSING [12]
+ RL2.3 Default Word Boundaries - MISSING [14]
+ RL2.4 Default Loose Matches - MISSING [15]
+ RL2.5 Name Properties - DONE
+ RL2.6 Wildcard Properties - MISSING
- [11] Surrogates are solely a UTF-16 concept and Perl's internal
- representation is UTF-8. The Encode module does UTF-16, though.
- [12] see UTR#15 Unicode Normalization
- [13] have Unicode::Normalize but not integrated to regexes
- [14] have \X but at this level . should equal that
- [15] need three classes, not just \w and \W
- [16] see UTR#21 Case Mappings
+ [10] see UAX#15 "Unicode Normalization Forms"
+ [11] have Unicode::Normalize but not integrated to regexes
+ [12] have \X but we don't have a "Grapheme Cluster Mode"
+ [14] see UAX#29, Word Boundaries
+ [15] see UAX#21 "Case Mappings"
=item *
-Level 3 - Locale-Sensitive Support
-
- 4.1 Locale-Dependent Categories - MISSING
- 4.2 Locale-Dependent Graphemes - MISSING [16][17]
- 4.3 Locale-Dependent Words - MISSING
- 4.4 Locale-Dependent Loose Matches - MISSING
- 4.5 Locale-Dependent Ranges - MISSING
-
- [16] see UTR#10 Unicode Collation Algorithms
- [17] have Unicode::Collate but not integrated to regexes
+Level 3 - Tailored Support
+
+ RL3.1 Tailored Punctuation - MISSING
+ RL3.2 Tailored Grapheme Clusters - MISSING [17][18]
+ RL3.3 Tailored Word Boundaries - MISSING
+ RL3.4 Tailored Loose Matches - MISSING
+ RL3.5 Tailored Ranges - MISSING
+ RL3.6 Context Matching - MISSING [19]
+ RL3.7 Incremental Matches - MISSING
+ ( RL3.8 Unicode Set Sharing )
+ RL3.9 Possible Match Sets - MISSING
+ RL3.10 Folded Matching - MISSING [20]
+ RL3.11 Submatchers - MISSING
+
+ [17] see UAX#10 "Unicode Collation Algorithms"
+ [18] have Unicode::Collate but not integrated to regexes
+ [19] have (?<=x) and (?=x), but look-aheads or look-behinds
+ should see outside of the target substring
+ [20] need insensitive matching for linguistic features other
+ than case; for example, hiragana to katakana, wide and
+ narrow, simplified Han to traditional Han (see UTR#30
+ "Character Foldings")
=back
UTF-8
-UTF-8 is a variable-length (1 to 6 bytes, current character allocations
-require 4 bytes), byte-order independent encoding. For ASCII (and we
-really do mean 7-bit ASCII, not another 8-bit encoding), UTF-8 is
-transparent.
+UTF-8 is a variable-length (1 to 4 bytes), byte-order independent
+encoding. For ASCII (and we really do mean 7-bit ASCII, not another
+8-bit encoding), UTF-8 is transparent.
The following table is from Unicode 3.2.
Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte
U+0000..U+007F 00..7F
- U+0080..U+07FF C2..DF 80..BF
- U+0800..U+0FFF E0 A0..BF 80..BF
+ U+0080..U+07FF * C2..DF 80..BF
+ U+0800..U+0FFF E0 * A0..BF 80..BF
U+1000..U+CFFF E1..EC 80..BF 80..BF
U+D000..U+D7FF ED 80..9F 80..BF
- U+D800..U+DFFF ******* ill-formed *******
+ U+D800..U+DFFF +++++++ utf16 surrogates, not legal utf8 +++++++
U+E000..U+FFFF EE..EF 80..BF 80..BF
- U+10000..U+3FFFF F0 90..BF 80..BF 80..BF
+ U+10000..U+3FFFF F0 * 90..BF 80..BF 80..BF
U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF
U+100000..U+10FFFF F4 80..8F 80..BF 80..BF
-Note the C<A0..BF> in C<U+0800..U+0FFF>, the C<80..9F> in
-C<U+D000...U+D7FF>, the C<90..B>F in C<U+10000..U+3FFFF>, and the
-C<80...8F> in C<U+100000..U+10FFFF>. The "gaps" are caused by legal
-UTF-8 avoiding non-shortest encodings: it is technically possible to
-UTF-8-encode a single code point in different ways, but that is
-explicitly forbidden, and the shortest possible encoding should always
-be used. So that's what Perl does.
+Note the gaps marked by "*" before several of the byte entries above. These are
+caused by legal UTF-8 avoiding non-shortest encodings: it is technically
+possible to UTF-8-encode a single code point in different ways, but that is
+explicitly forbidden, and the shortest possible encoding should always be used
+(and that is what Perl does).
Another way to look at it is via bits:
ccccbbbbbbaaaaaa 1110cccc 10bbbbbb 10aaaaaa
00000dddccccccbbbbbbaaaaaa 11110ddd 10cccccc 10bbbbbb 10aaaaaa
-As you can see, the continuation bytes all begin with C<10>, and the
-leading bits of the start byte tell how many bytes the are in the
+As you can see, the continuation bytes all begin with "10", and the
+leading bits of the start byte tell how many bytes there are in the
encoded character.
+The original UTF-8 specification allowed up to 6 bytes, to allow
+encoding of numbers up to 0x7FFF_FFFF. Perl continues to allow those,
+and has extended that up to 13 bytes to encode code points up to what
+can fit in a 64-bit word. However, Perl will warn if you output any of
+these as being non-portable; and under strict UTF-8 input protocols,
+they are forbidden.
+
+The Unicode non-character code points are also disallowed in UTF-8 in
+"open interchange". See L</Non-character code points>.
+
=item *
UTF-EBCDIC
The followings items are mostly for reference and general Unicode
knowledge, Perl doesn't use these constructs internally.
-UTF-16 is a 2 or 4 byte encoding. The Unicode code points
-C<U+0000..U+FFFF> are stored in a single 16-bit unit, and the code
+Like UTF-8, UTF-16 is a variable-width encoding, but where
+UTF-8 uses 8-bit code units, UTF-16 uses 16-bit code units.
+All code points occupy either 2 or 4 bytes in UTF-16: code points
+C<U+0000..U+FFFF> are stored in a single 16-bit unit, and code
points C<U+10000..U+10FFFF> in two 16-bit units. The latter case is
using I<surrogates>, the first 16-bit unit being the I<high
surrogate>, and the second being the I<low surrogate>.
Surrogates are code points set aside to encode the C<U+10000..U+10FFFF>
range of Unicode code points in pairs of 16-bit units. The I<high
-surrogates> are the range C<U+D800..U+DBFF>, and the I<low surrogates>
+surrogates> are the range C<U+D800..U+DBFF> and the I<low surrogates>
are the range C<U+DC00..U+DFFF>. The surrogate encoding is
- $hi = ($uni - 0x10000) / 0x400 + 0xD800;
- $lo = ($uni - 0x10000) % 0x400 + 0xDC00;
+ $hi = ($uni - 0x10000) / 0x400 + 0xD800;
+ $lo = ($uni - 0x10000) % 0x400 + 0xDC00;
and the decoding is
- $uni = 0x10000 + ($hi - 0xD800) * 0x400 + ($lo - 0xDC00);
-
-If you try to generate surrogates (for example by using chr()), you
-will get a warning if warnings are turned on, because those code
-points are not valid for a Unicode character.
+ $uni = 0x10000 + ($hi - 0xD800) * 0x400 + ($lo - 0xDC00);
Because of the 16-bitness, UTF-16 is byte-order dependent. UTF-16
itself can be used for in-memory computations, but if storage or
was writing in UTF-8, you will read the bytes C<0xEF 0xBB 0xBF>.)
The way this trick works is that the character with the code point
-C<U+FFFE> is guaranteed not to be a valid Unicode character, so the
+C<U+FFFE> is not supposed to be in input streams, so the
sequence of bytes C<0xFF 0xFE> is unambiguously "BOM, represented in
little-endian format" and cannot be C<U+FFFE>, represented in big-endian
format".
+Surrogates have no meaning in Unicode outside their use in pairs to
+represent other code points. However, Perl allows them to be
+represented individually internally, for example by saying
+C<chr(0xD801)>, so that all code points, not just those valid for open
+interchange, are
+representable. Unicode does define semantics for them, such as their
+General Category is "Cs". But because their use is somewhat dangerous,
+Perl will warn (using the warning category "surrogate", which is a
+sub-category of "utf8") if an attempt is made
+to do things like take the lower case of one, or match
+case-insensitively, or to output them. (But don't try this on Perls
+before 5.14.)
+
=item *
UTF-32, UTF-32BE, UTF-32LE
The UTF-32 family is pretty much like the UTF-16 family, expect that
the units are 32-bit, and therefore the surrogate scheme is not
-needed. The BOM signatures will be C<0x00 0x00 0xFE 0xFF> for BE and
-C<0xFF 0xFE 0x00 0x00> for LE.
+needed. UTF-32 is a fixed-width encoding. The BOM signatures are
+C<0x00 0x00 0xFE 0xFF> for BE and C<0xFF 0xFE 0x00 0x00> for LE.
=item *
UCS-2, UCS-4
-Encodings defined by the ISO 10646 standard. UCS-2 is a 16-bit
+Legacy, fixed-width encodings defined by the ISO 10646 standard. UCS-2 is a 16-bit
encoding. Unlike UTF-16, UCS-2 is not extensible beyond C<U+FFFF>,
because it does not use surrogates. UCS-4 is a 32-bit encoding,
-functionally identical to UTF-32.
+functionally identical to UTF-32 (the difference being that
+UCS-4 forbids neither surrogates nor code points larger than 0x10_FFFF).
=item *
=back
+=head2 Non-character code points
+
+66 code points are set aside in Unicode as "non-character code points".
+These all have the Unassigned (Cn) General Category, and they never will
+be assigned. These are never supposed to be in legal Unicode input
+streams, so that code can use them as sentinels that can be mixed in
+with character data, and they always will be distinguishable from that data.
+To keep them out of Perl input streams, strict UTF-8 should be
+specified, such as by using the layer C<:encoding('UTF-8')>. The
+non-character code points are the 32 between U+FDD0 and U+FDEF, and the
+34 code points U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, ... U+10FFFE, U+10FFFF.
+Some people are under the mistaken impression that these are "illegal",
+but that is not true. An application or cooperating set of applications
+can legally use them at will internally; but these code points are
+"illegal for open interchange". Therefore, Perl will not accept these
+from input streams unless lax rules are being used, and will warn
+(using the warning category "nonchar", which is a sub-category of "utf8") if
+an attempt is made to output them.
+
+=head2 Beyond Unicode code points
+
+The maximum Unicode code point is U+10FFFF. But Perl accepts code
+points up to the maximum permissible unsigned number available on the
+platform. However, Perl will not accept these from input streams unless
+lax rules are being used, and will warn (using the warning category
+"non_unicode", which is a sub-category of "utf8") if an attempt is made to
+operate on or output them. For example, C<uc(0x11_0000)> will generate
+this warning, returning the input parameter as its result, as the upper
+case of every non-Unicode code point is the code point itself.
+
=head2 Security Implications of Unicode
+Read L<Unicode Security Considerations|http://www.unicode.org/reports/tr36>.
+Also, note the following:
+
=over 4
=item *
Malformed UTF-8
-Unfortunately, the specification of UTF-8 leaves some room for
+Unfortunately, the original specification of UTF-8 leaves some room for
interpretation of how many bytes of encoded output one should generate
from one input Unicode character. Strictly speaking, the shortest
possible sequence of UTF-8 bytes should be generated,
because otherwise there is potential for an input buffer overflow at
the receiving end of a UTF-8 connection. Perl always generates the
-shortest length UTF-8, and with warnings on Perl will warn about
+shortest length UTF-8, and with warnings on, Perl will warn about
non-shortest length UTF-8 along with other malformations, such as the
-surrogates, which are not real Unicode code points.
+surrogates, which are not Unicode code points valid for interchange.
=item *
-Regular expressions behave slightly differently between byte data and
-character (Unicode) data. For example, the "word character" character
-class C<\w> will work differently depending on if data is eight-bit bytes
-or Unicode.
+Regular expression pattern matching may surprise you if you're not
+accustomed to Unicode. Starting in Perl 5.14, several pattern
+modifiers are available to control this, called the character set
+modifiers. Details are given in L<perlre/Character set modifiers>.
-In the first case, the set of C<\w> characters is either small--the
-default set of alphabetic characters, digits, and the "_"--or, if you
-are using a locale (see L<perllocale>), the C<\w> might contain a few
-more letters according to your language and country.
-
-In the second case, the C<\w> set of characters is much, much larger.
-Most importantly, even in the set of the first 256 characters, it will
-probably match different characters: unlike most locales, which are
-specific to a language and country pair, Unicode classifies all the
-characters that are letters I<somewhere> as C<\w>. For example, your
-locale might not think that LATIN SMALL LETTER ETH is a letter (unless
-you happen to speak Icelandic), but Unicode does.
+=back
As discussed elsewhere, Perl has one foot (two hooves?) planted in
each of two worlds: the old world of bytes and the new world of
switch-over to characters should happen. Characters shouldn't get
downgraded to bytes, either. It is possible to accidentally mix bytes
and characters, however (see L<perluniintro>), in which case C<\w> in
-regular expressions might start behaving differently. Review your
-code. Use warnings and the C<strict> pragma.
-
-=back
+regular expressions might start behaving differently (unless the C</a>
+modifier is in effect). Review your code. Use warnings and the C<strict> pragma.
=head2 Unicode in Perl on EBCDIC
=head2 Locales
-Usually locale settings and Unicode do not affect each other, but
-there are a couple of exceptions:
-
-=over 4
-
-=item *
-
-You can enable automatic UTF-8-ification of your standard file
-handles, default C<open()> layer, and C<@ARGV> by using either
-the C<-C> command line switch or the C<PERL_UNICODE> environment
-variable, see L<perlrun> for the documentation of the C<-C> switch.
-
-=item *
-
-Perl tries really hard to work both with Unicode and the old
-byte-oriented world. Most often this is nice, but sometimes Perl's
-straddling of the proverbial fence causes problems.
-
-=back
+See L<perllocale/Unicode and UTF-8>
=head2 When Unicode Does Not Happen
While Perl does have extensive ways to input and output in Unicode,
-and few other 'entry points' like the @ARGV which can be interpreted
-as Unicode (UTF-8), there still are many places where Unicode (in some
-encoding or another) could be given as arguments or received as
+and a few other "entry points" like the @ARGV array (which can sometimes be
+interpreted as UTF-8), there are still many places where Unicode
+(in some encoding or another) could be given as arguments or received as
results, or both, but it is not.
-The following are such interfaces. For all of these Perl currently
-(as of 5.8.1) simply assumes byte strings both as arguments and results.
+The following are such interfaces. Also, see L</The "Unicode Bug">.
+For all of these interfaces Perl
+currently (as of 5.8.3) simply assumes byte strings both as arguments
+and results, or UTF-8 strings if the (problematic) C<encoding> pragma has been used.
-One reason why Perl does not attempt to resolve the role of Unicode in
-this cases is that the answers are highly dependent on the operating
+One reason that Perl does not attempt to resolve the role of Unicode in
+these situations is that the answers are highly dependent on the operating
system and the file system(s). For example, whether filenames can be
-in Unicode, and in exactly what kind of encoding, is not exactly a
-portable concept. Similarly for the qx and system: how well will the
-'command line interface' (and which of them?) handle Unicode?
+in Unicode and in exactly what kind of encoding, is not exactly a
+portable concept. Similarly for C<qx> and C<system>: how well will the
+"command-line interface" (and which of them?) handle Unicode?
=over 4
=item *
-chmod, chmod, chown, chroot, exec, link, mkdir
-rename, rmdir stat, symlink, truncate, unlink, utime
+chdir, chmod, chown, chroot, exec, link, lstat, mkdir,
+rename, rmdir, stat, symlink, truncate, unlink, utime, -X
=item *
=back
+=head2 The "Unicode Bug"
+
+The term, the "Unicode bug" has been applied to an inconsistency
+on ASCII platforms with the
+Unicode code points in the Latin-1 Supplement block, that
+is, between 128 and 255. Without a locale specified, unlike all other
+characters or code points, these characters have very different semantics in
+byte semantics versus character semantics, unless
+C<use feature 'unicode_strings'> is specified.
+(The lesson here is to specify C<unicode_strings> to avoid the
+headaches.)
+
+In character semantics they are interpreted as Unicode code points, which means
+they have the same semantics as Latin-1 (ISO-8859-1).
+
+In byte semantics, they are considered to be unassigned characters, meaning
+that the only semantics they have is their ordinal numbers, and that they are
+not members of various character classes. None are considered to match C<\w>
+for example, but all match C<\W>.
+
+The behavior is known to have effects on these areas:
+
+=over 4
+
+=item *
+
+Changing the case of a scalar, that is, using C<uc()>, C<ucfirst()>, C<lc()>,
+and C<lcfirst()>, or C<\L>, C<\U>, C<\u> and C<\l> in regular expression
+substitutions.
+
+=item *
+
+Using caseless (C</i>) regular expression matching
+
+=item *
+
+Matching any of several properties in regular expressions, namely C<\b>,
+C<\B>, C<\s>, C<\S>, C<\w>, C<\W>, and all the Posix character classes
+I<except> C<[[:ascii:]]>.
+
+=item *
+
+In C<quotemeta> or its inline equivalent C<\Q>, no characters
+code points above 127 are quoted in UTF-8 encoded strings, but in
+byte encoded strings, code points between 128-255 are always quoted.
+
+=item *
+
+User-defined case change mappings. You can create a C<ToUpper()> function, for
+example, which overrides Perl's built-in case mappings. The scalar must be
+encoded in utf8 for your function to actually be invoked.
+
+=back
+
+This behavior can lead to unexpected results in which a string's semantics
+suddenly change if a code point above 255 is appended to or removed from it,
+which changes the string's semantics from byte to character or vice versa. As
+an example, consider the following program and its output:
+
+ $ perl -le'
+ no feature 'unicode_strings';
+ $s1 = "\xC2";
+ $s2 = "\x{2660}";
+ for ($s1, $s2, $s1.$s2) {
+ print /\w/ || 0;
+ }
+ '
+ 0
+ 0
+ 1
+
+If there's no C<\w> in C<s1> or in C<s2>, why does their concatenation have one?
+
+This anomaly stems from Perl's attempt to not disturb older programs that
+didn't use Unicode, and hence had no semantics for characters outside of the
+ASCII range (except in a locale), along with Perl's desire to add Unicode
+support seamlessly. The result wasn't seamless: these characters were
+orphaned.
+
+Starting in Perl 5.14, C<use feature 'unicode_strings'> can be used to
+cause Perl to use Unicode semantics on all string operations within the
+scope of the feature subpragma. Regular expressions compiled in its
+scope retain that behavior even when executed or compiled into larger
+regular expressions outside the scope. (The pragma does not, however,
+affect the C<quotemeta> behavior. Nor does it affect the deprecated
+user-defined case changing operations--these still require a UTF-8
+encoded string to operate.)
+
+In Perl 5.12, the subpragma affected casing changes, but not regular
+expressions. See L<perlfunc/lc> for details on how this pragma works in
+combination with various others for casing.
+
+For earlier Perls, or when a string is passed to a function outside the
+subpragma's scope, a workaround is to always call C<utf8::upgrade($string)>,
+or to use the standard module L<Encode>. Also, a scalar that has any characters
+whose ordinal is above 0x100, or which were specified using either of the
+C<\N{...}> notations, will automatically have character semantics.
+
=head2 Forcing Unicode in Perl (Or Unforcing Unicode in Perl)
-Sometimes (see L</"When Unicode Does Not Happen">) there are
-situations where you simply need to force Perl to believe that a byte
-string is UTF-8, or vice versa. The low-level calls
-utf8::upgrade($bytestring) and utf8::downgrade($utf8string) are
+Sometimes (see L</"When Unicode Does Not Happen"> or L</The "Unicode Bug">)
+there are situations where you simply need to force a byte
+string into UTF-8, or vice versa. The low-level calls
+utf8::upgrade($bytestring) and utf8::downgrade($utf8string[, FAIL_OK]) are
the answers.
-Do not use them without careful thought, though: Perl may easily get
-very confused, angry, or even crash, if you suddenly change the 'nature'
-of scalar like that. Especially careful you have to be if you use the
-utf8::upgrade(): any random byte string is not valid UTF-8.
+Note that utf8::downgrade() can fail if the string contains characters
+that don't fit into a byte.
+
+Calling either function on a string that already is in the desired state is a
+no-op.
=head2 Using Unicode in XS
=item *
C<DO_UTF8(sv)> returns true if the C<UTF8> flag is on and the bytes
-pragma is not in effect. C<SvUTF8(sv)> returns true is the C<UTF8>
+pragma is not in effect. C<SvUTF8(sv)> returns true if the C<UTF8>
flag is on; the bytes pragma is ignored. The C<UTF8> flag being on
does B<not> mean that there are any characters of code points greater
than 255 (or 127) in the scalar or that there are even any characters
=item *
-C<uvuni_to_utf8(buf, chr>) writes a Unicode character code point into
+C<uvchr_to_utf8(buf, chr)> writes a Unicode character code point into
a buffer encoding the code point as UTF-8, and returns a pointer
-pointing after the UTF-8 bytes.
+pointing after the UTF-8 bytes. It works appropriately on EBCDIC machines.
=item *
-C<utf8_to_uvuni(buf, lenp)> reads UTF-8 encoded bytes from a buffer and
+C<utf8_to_uvchr(buf, lenp)> reads UTF-8 encoded bytes from a buffer and
returns the Unicode character code point and, optionally, the length of
-the UTF-8 byte sequence.
+the UTF-8 byte sequence. It works appropriately on EBCDIC machines.
=item *
=item *
-C<utf8_hop(s, off)> will return a pointer to an UTF-8 encoded buffer
+C<utf8_hop(s, off)> will return a pointer to a UTF-8 encoded buffer
that is C<off> (positive or negative) Unicode characters displaced
from the UTF-8 buffer C<s>. Be careful not to overstep the buffer:
C<utf8_hop()> will merrily run off the end or the beginning of the
=item *
-C<ibcmp_utf8(s1, pe1, u1, l1, u1, s2, pe2, l2, u2)> can be used to
+C<foldEQ_utf8(s1, pe1, l1, u1, s2, pe2, l2, u2)> can be used to
compare two strings case-insensitively in Unicode. For case-sensitive
-comparisons you can just use C<memEQ()> and C<memNE()> as usual.
+comparisons you can just use C<memEQ()> and C<memNE()> as usual, except
+if one string is in utf8 and the other isn't.
=back
For more information, see L<perlapi>, and F<utf8.c> and F<utf8.h>
in the Perl source code distribution.
+=head2 Hacking Perl to work on earlier Unicode versions (for very serious hackers only)
+
+Perl by default comes with the latest supported Unicode version built in, but
+you can change to use any earlier one.
+
+Download the files in the desired version of Unicode from the Unicode web
+site L<http://www.unicode.org>). These should replace the existing files in
+F<lib/unicore> in the Perl source tree. Follow the instructions in
+F<README.perl> in that directory to change some of their names, and then build
+perl (see L<INSTALL>).
+
+It is even possible to copy the built files to a different directory, and then
+change F<utf8_heavy.pl> in the directory C<$Config{privlib}> to point to the
+new directory, or maybe make a copy of that directory before making the change,
+and using C<@INC> or the C<-I> run-time flag to switch between versions at will
+(but because of caching, not in the middle of a process), but all this is
+beyond the scope of these instructions.
+
=head1 BUGS
=head2 Interaction with Locales
-Use of locales with Unicode data may lead to odd results. Currently,
-Perl attempts to attach 8-bit locale info to characters in the range
-0..255, but this technique is demonstrably incorrect for locales that
-use characters above that range when mapped into Unicode. Perl's
-Unicode support will also tend to run slower. Use of locales with
-Unicode is discouraged.
+See L<perllocale/Unicode and UTF-8>
+
+=head2 Problems with characters in the Latin-1 Supplement range
+
+See L</The "Unicode Bug">
=head2 Interaction with Extensions
When Perl exchanges data with an extension, the extension should be
-able to understand the UTF-8 flag and act accordingly. If the
-extension doesn't know about the flag, it's likely that the extension
+able to understand the UTF8 flag and act accordingly. If the
+extension doesn't recognize that flag, it's likely that the extension
will return incorrectly-flagged data.
So if you're working with Unicode data, consult the documentation of
Perl's internal representation like so:
sub my_escape_html ($) {
- my($what) = shift;
- return unless defined $what;
- Encode::decode_utf8(Foo::Bar::escape_html(Encode::encode_utf8($what)));
+ my($what) = shift;
+ return unless defined $what;
+ Encode::decode_utf8(Foo::Bar::escape_html(
+ Encode::encode_utf8($what)));
}
Sometimes, when the extension does not convert data but just stores
-and retrieves them, you will be in a position to use the otherwise
+and retrieves them, you will be able to use the otherwise
dangerous Encode::_utf8_on() function. Let's say the popular
C<Foo::Bar> extension, written in C, provides a C<param> method that
lets you store and retrieve data according to these prototypes:
sub param {
my($self,$name,$value) = @_;
utf8::upgrade($name); # make sure it is UTF-8 encoded
- if (defined $value)
+ if (defined $value) {
utf8::upgrade($value); # make sure it is UTF-8 encoded
return $self->SUPER::param($name,$value);
} else {
operations with UTF-8 encoded strings are still slower. As an example,
the Unicode properties (character classes) like C<\p{Nd}> are known to
be quite a bit slower (5-20 times) than their simpler counterparts
-like C<\d> (then again, there 268 Unicode characters matching C<Nd>
+like C<\d> (then again, there are hundreds of Unicode characters matching C<Nd>
compared with the 10 ASCII characters matching C<d>).
+=head2 Problems on EBCDIC platforms
+
+There are several known problems with Perl on EBCDIC platforms. If you
+want to use Perl there, send email to perlbug@perl.org.
+
+In earlier versions, when byte and character data were concatenated,
+the new string was sometimes created by
+decoding the byte strings as I<ISO 8859-1 (Latin-1)>, even if the
+old Unicode string used EBCDIC.
+
+If you find any of these, please report them as bugs.
+
=head2 Porting code from perl-5.6.X
Perl 5.8 has a different Unicode model from 5.6. In 5.6 the programmer
A filehandle that should read or write UTF-8
if ($] > 5.007) {
- binmode $fh, ":utf8";
+ binmode $fh, ":encoding(utf8)";
}
=item *
Be it Compress::Zlib, Apache::Request or any extension that has no
mention of Unicode in the manpage, you need to make sure that the
-UTF-8 flag is stripped off. Note that at the time of this writing
+UTF8 flag is stripped off. Note that at the time of this writing
(October 2002) the mentioned modules are not UTF-8-aware. Please
check the documentation to verify if this is still true.
A scalar we got back from an extension
If you believe the scalar comes back as UTF-8, you will most likely
-want the UTF-8 flag restored:
+want the UTF8 flag restored:
if ($] > 5.007) {
require Encode;
that is still true.
sub fetchrow {
- my($self, $sth, $what) = @_; # $what is one of fetchrow_{array,hashref}
+ # $what is one of fetchrow_{array,hashref}
+ my($self, $sth, $what) = @_;
if ($] < 5.007) {
return $sth->$what;
} else {
my $ret = $sth->$what;
if (ref $ret) {
for my $k (keys %$ret) {
- defined && /[^\000-\177]/ && Encode::_utf8_on($_) for $ret->{$k};
+ defined
+ && /[^\000-\177]/
+ && Encode::_utf8_on($_) for $ret->{$k};
}
return $ret;
} else {
Scalars that contain only ASCII and are marked as UTF-8 are sometimes
a drag to your program. If you recognize such a situation, just remove
-the UTF-8 flag:
+the UTF8 flag:
utf8::downgrade($val) if $] > 5.007;
=head1 SEE ALSO
-L<perluniintro>, L<encoding>, L<Encode>, L<open>, L<utf8>, L<bytes>,
+L<perlunitut>, L<perluniintro>, L<perluniprops>, L<Encode>, L<open>, L<utf8>, L<bytes>,
L<perlretut>, L<perlvar/"${^UNICODE}">
+L<http://www.unicode.org/reports/tr44>).
=cut