-of blocks is more of an artificial grouping based on groups of 256
-Unicode characters. For example, the C<Latin> script contains letters
-from many blocks but does not contain all the characters from those
-blocks. It does not, for example, contain digits, because digits are
-shared across many scripts. Digits and similar groups, like
-punctuation, are in a category called C<Common>.
-
-For more about scripts, see the UAX#24 "Script Names":
-
- http://www.unicode.org/reports/tr24/
-
-For more about blocks, see:
-
- http://www.unicode.org/Public/UNIDATA/Blocks.txt
-
-Block names are given with the C<In> prefix. For example, the
-Katakana block is referenced via C<\p{InKatakana}>. The C<In>
-prefix may be omitted if there is no naming conflict with a script
-or any other property, but it is recommended that C<In> always be used
-for block tests to avoid confusion.
-
-These block names are supported:
-
- InAegeanNumbers
- InAlphabeticPresentationForms
- InAncientGreekMusicalNotation
- InAncientGreekNumbers
- InArabic
- InArabicPresentationFormsA
- InArabicPresentationFormsB
- InArabicSupplement
- InArmenian
- InArrows
- InBalinese
- InBasicLatin
- InBengali
- InBlockElements
- InBopomofo
- InBopomofoExtended
- InBoxDrawing
- InBraillePatterns
- InBuginese
- InBuhid
- InByzantineMusicalSymbols
- InCJKCompatibility
- InCJKCompatibilityForms
- InCJKCompatibilityIdeographs
- InCJKCompatibilityIdeographsSupplement
- InCJKRadicalsSupplement
- InCJKStrokes
- InCJKSymbolsAndPunctuation
- InCJKUnifiedIdeographs
- InCJKUnifiedIdeographsExtensionA
- InCJKUnifiedIdeographsExtensionB
- InCherokee
- InCombiningDiacriticalMarks
- InCombiningDiacriticalMarksSupplement
- InCombiningDiacriticalMarksforSymbols
- InCombiningHalfMarks
- InControlPictures
- InCoptic
- InCountingRodNumerals
- InCuneiform
- InCuneiformNumbersAndPunctuation
- InCurrencySymbols
- InCypriotSyllabary
- InCyrillic
- InCyrillicSupplement
- InDeseret
- InDevanagari
- InDingbats
- InEnclosedAlphanumerics
- InEnclosedCJKLettersAndMonths
- InEthiopic
- InEthiopicExtended
- InEthiopicSupplement
- InGeneralPunctuation
- InGeometricShapes
- InGeorgian
- InGeorgianSupplement
- InGlagolitic
- InGothic
- InGreekExtended
- InGreekAndCoptic
- InGujarati
- InGurmukhi
- InHalfwidthAndFullwidthForms
- InHangulCompatibilityJamo
- InHangulJamo
- InHangulSyllables
- InHanunoo
- InHebrew
- InHighPrivateUseSurrogates
- InHighSurrogates
- InHiragana
- InIPAExtensions
- InIdeographicDescriptionCharacters
- InKanbun
- InKangxiRadicals
- InKannada
- InKatakana
- InKatakanaPhoneticExtensions
- InKharoshthi
- InKhmer
- InKhmerSymbols
- InLao
- InLatin1Supplement
- InLatinExtendedA
- InLatinExtendedAdditional
- InLatinExtendedB
- InLatinExtendedC
- InLatinExtendedD
- InLetterlikeSymbols
- InLimbu
- InLinearBIdeograms
- InLinearBSyllabary
- InLowSurrogates
- InMalayalam
- InMathematicalAlphanumericSymbols
- InMathematicalOperators
- InMiscellaneousMathematicalSymbolsA
- InMiscellaneousMathematicalSymbolsB
- InMiscellaneousSymbols
- InMiscellaneousSymbolsAndArrows
- InMiscellaneousTechnical
- InModifierToneLetters
- InMongolian
- InMusicalSymbols
- InMyanmar
- InNKo
- InNewTaiLue
- InNumberForms
- InOgham
- InOldItalic
- InOldPersian
- InOpticalCharacterRecognition
- InOriya
- InOsmanya
- InPhagspa
- InPhoenician
- InPhoneticExtensions
- InPhoneticExtensionsSupplement
- InPrivateUseArea
- InRunic
- InShavian
- InSinhala
- InSmallFormVariants
- InSpacingModifierLetters
- InSpecials
- InSuperscriptsAndSubscripts
- InSupplementalArrowsA
- InSupplementalArrowsB
- InSupplementalMathematicalOperators
- InSupplementalPunctuation
- InSupplementaryPrivateUseAreaA
- InSupplementaryPrivateUseAreaB
- InSylotiNagri
- InSyriac
- InTagalog
- InTagbanwa
- InTags
- InTaiLe
- InTaiXuanJingSymbols
- InTamil
- InTelugu
- InThaana
- InThai
- InTibetan
- InTifinagh
- InUgaritic
- InUnifiedCanadianAboriginalSyllabics
- InVariationSelectors
- InVariationSelectorsSupplement
- InVerticalForms
- InYiRadicals
- InYiSyllables
- InYijingHexagramSymbols
+of blocks is more of an artificial grouping based on groups of Unicode
+characters with consecutive ordinal values. For example, the "Basic Latin"
+block is all characters whose ordinals are between 0 and 127, inclusive; in
+other words, the ASCII characters. The "Latin" script contains some letters
+from this as well as several other blocks, like "Latin-1 Supplement",
+"Latin Extended-A", etc., but it does not contain all the characters from
+those blocks. It does not, for example, contain the digits 0-9, because
+those digits are shared across many scripts, and hence are in the
+C<Common> script.
+
+For more about scripts versus blocks, see UAX#24 "Unicode Script Property":
+L<http://www.unicode.org/reports/tr24>
+
+The C<Script> or C<Script_Extensions> properties are likely to be the
+ones you want to use when processing
+natural language; the Block property may occasionally be useful in working
+with the nuts and bolts of Unicode.
+
+Block names are matched in the compound form, like C<\p{Block: Arrows}> or
+C<\p{Blk=Hebrew}>. Unlike most other properties, only a few block names have a
+Unicode-defined short name. But Perl does provide a (slight) shortcut: You
+can say, for example C<\p{In_Arrows}> or C<\p{In_Hebrew}>. For backwards
+compatibility, the C<In> prefix may be omitted if there is no naming conflict
+with a script or any other property, and you can even use an C<Is> prefix
+instead in those cases. But it is not a good idea to do this, for a couple
+reasons:
+
+=over 4
+
+=item 1
+
+It is confusing. There are many naming conflicts, and you may forget some.
+For example, C<\p{Hebrew}> means the I<script> Hebrew, and NOT the I<block>
+Hebrew. But would you remember that 6 months from now?
+
+=item 2
+
+It is unstable. A new version of Unicode may preempt the current meaning by
+creating a property with the same name. There was a time in very early Unicode
+releases when C<\p{Hebrew}> would have matched the I<block> Hebrew; now it
+doesn't.
+
+=back
+
+Some people prefer to always use C<\p{Block: foo}> and C<\p{Script: bar}>
+instead of the shortcuts, whether for clarity, because they can't remember the
+difference between 'In' and 'Is' anyway, or they aren't confident that those who
+eventually will read their code will know that difference.
+
+A complete list of blocks and their shortcuts is in L<perluniprops>.
+
+=head3 B<Other Properties>
+
+There are many more properties than the very basic ones described here.
+A complete list is in L<perluniprops>.
+
+Unicode defines all its properties in the compound form, so all single-form
+properties are Perl extensions. Most of these are just synonyms for the
+Unicode ones, but some are genuine extensions, including several that are in
+the compound form. And quite a few of these are actually recommended by Unicode
+(in L<http://www.unicode.org/reports/tr18>).
+
+This section gives some details on all extensions that aren't just
+synonyms for compound-form Unicode properties
+(for those properties, you'll have to refer to the
+L<Unicode Standard|http://www.unicode.org/reports/tr44>.
+
+=over
+
+=item B<C<\p{All}>>
+
+This matches any of the 1_114_112 Unicode code points. It is a synonym for
+C<\p{Any}>.
+
+=item B<C<\p{Alnum}>>
+
+This matches any C<\p{Alphabetic}> or C<\p{Decimal_Number}> character.
+
+=item B<C<\p{Any}>>
+
+This matches any of the 1_114_112 Unicode code points. It is a synonym for
+C<\p{All}>.
+
+=item B<C<\p{ASCII}>>
+
+This matches any of the 128 characters in the US-ASCII character set,
+which is a subset of Unicode.
+
+=item B<C<\p{Assigned}>>
+
+This matches any assigned code point; that is, any code point whose general
+category is not Unassigned (or equivalently, not Cn).
+
+=item B<C<\p{Blank}>>
+
+This is the same as C<\h> and C<\p{HorizSpace}>: A character that changes the
+spacing horizontally.
+
+=item B<C<\p{Decomposition_Type: Non_Canonical}>> (Short: C<\p{Dt=NonCanon}>)
+
+Matches a character that has a non-canonical decomposition.
+
+To understand the use of this rarely used property=value combination, it is
+necessary to know some basics about decomposition.
+Consider a character, say H. It could appear with various marks around it,
+such as an acute accent, or a circumflex, or various hooks, circles, arrows,
+I<etc.>, above, below, to one side or the other, etc. There are many
+possibilities among the world's languages. The number of combinations is
+astronomical, and if there were a character for each combination, it would
+soon exhaust Unicode's more than a million possible characters. So Unicode
+took a different approach: there is a character for the base H, and a
+character for each of the possible marks, and these can be variously combined
+to get a final logical character. So a logical character--what appears to be a
+single character--can be a sequence of more than one individual characters.
+This is called an "extended grapheme cluster"; Perl furnishes the C<\X>
+regular expression construct to match such sequences.
+
+But Unicode's intent is to unify the existing character set standards and
+practices, and several pre-existing standards have single characters that
+mean the same thing as some of these combinations. An example is ISO-8859-1,
+which has quite a few of these in the Latin-1 range, an example being "LATIN
+CAPITAL LETTER E WITH ACUTE". Because this character was in this pre-existing
+standard, Unicode added it to its repertoire. But this character is considered
+by Unicode to be equivalent to the sequence consisting of the character
+"LATIN CAPITAL LETTER E" followed by the character "COMBINING ACUTE ACCENT".
+
+"LATIN CAPITAL LETTER E WITH ACUTE" is called a "pre-composed" character, and
+its equivalence with the sequence is called canonical equivalence. All
+pre-composed characters are said to have a decomposition (into the equivalent
+sequence), and the decomposition type is also called canonical.
+
+However, many more characters have a different type of decomposition, a
+"compatible" or "non-canonical" decomposition. The sequences that form these
+decompositions are not considered canonically equivalent to the pre-composed
+character. An example, again in the Latin-1 range, is the "SUPERSCRIPT ONE".
+It is somewhat like a regular digit 1, but not exactly; its decomposition
+into the digit 1 is called a "compatible" decomposition, specifically a
+"super" decomposition. There are several such compatibility
+decompositions (see L<http://www.unicode.org/reports/tr44>), including one
+called "compat", which means some miscellaneous type of decomposition
+that doesn't fit into the decomposition categories that Unicode has chosen.
+
+Note that most Unicode characters don't have a decomposition, so their
+decomposition type is "None".
+
+For your convenience, Perl has added the C<Non_Canonical> decomposition
+type to mean any of the several compatibility decompositions.
+
+=item B<C<\p{Graph}>>
+
+Matches any character that is graphic. Theoretically, this means a character
+that on a printer would cause ink to be used.
+
+=item B<C<\p{HorizSpace}>>
+
+This is the same as C<\h> and C<\p{Blank}>: a character that changes the
+spacing horizontally.
+
+=item B<C<\p{In=*}>>
+
+This is a synonym for C<\p{Present_In=*}>
+
+=item B<C<\p{PerlSpace}>>
+
+This is the same as C<\s>, restricted to ASCII, namely C<S<[ \f\n\r\t]>>
+and starting in Perl v5.18, experimentally, a vertical tab.
+
+Mnemonic: Perl's (original) space
+
+=item B<C<\p{PerlWord}>>
+
+This is the same as C<\w>, restricted to ASCII, namely C<[A-Za-z0-9_]>
+
+Mnemonic: Perl's (original) word.
+
+=item B<C<\p{Posix...}>>
+
+There are several of these, which are equivalents using the C<\p>
+notation for Posix classes and are described in
+L<perlrecharclass/POSIX Character Classes>.
+
+=item B<C<\p{Present_In: *}>> (Short: C<\p{In=*}>)
+
+This property is used when you need to know in what Unicode version(s) a
+character is.
+
+The "*" above stands for some two digit Unicode version number, such as
+C<1.1> or C<4.0>; or the "*" can also be C<Unassigned>. This property will
+match the code points whose final disposition has been settled as of the
+Unicode release given by the version number; C<\p{Present_In: Unassigned}>
+will match those code points whose meaning has yet to be assigned.
+
+For example, C<U+0041> "LATIN CAPITAL LETTER A" was present in the very first
+Unicode release available, which is C<1.1>, so this property is true for all
+valid "*" versions. On the other hand, C<U+1EFF> was not assigned until version
+5.1 when it became "LATIN SMALL LETTER Y WITH LOOP", so the only "*" that
+would match it are 5.1, 5.2, and later.
+
+Unicode furnishes the C<Age> property from which this is derived. The problem
+with Age is that a strict interpretation of it (which Perl takes) has it
+matching the precise release a code point's meaning is introduced in. Thus
+C<U+0041> would match only 1.1; and C<U+1EFF> only 5.1. This is not usually what
+you want.
+
+Some non-Perl implementations of the Age property may change its meaning to be
+the same as the Perl Present_In property; just be aware of that.
+
+Another confusion with both these properties is that the definition is not
+that the code point has been I<assigned>, but that the meaning of the code point
+has been I<determined>. This is because 66 code points will always be
+unassigned, and so the Age for them is the Unicode version in which the decision
+to make them so was made. For example, C<U+FDD0> is to be permanently
+unassigned to a character, and the decision to do that was made in version 3.1,
+so C<\p{Age=3.1}> matches this character, as also does C<\p{Present_In: 3.1}> and up.
+
+=item B<C<\p{Print}>>
+
+This matches any character that is graphical or blank, except controls.
+
+=item B<C<\p{SpacePerl}>>
+
+This is the same as C<\s>, including beyond ASCII.
+
+Mnemonic: Space, as modified by Perl. (It doesn't include the vertical tab
+which both the Posix standard and Unicode consider white space.)
+
+=item B<C<\p{Title}>> and B<C<\p{Titlecase}>>
+
+Under case-sensitive matching, these both match the same code points as
+C<\p{General Category=Titlecase_Letter}> (C<\p{gc=lt}>). The difference
+is that under C</i> caseless matching, these match the same as
+C<\p{Cased}>, whereas C<\p{gc=lt}> matches C<\p{Cased_Letter>).
+
+=item B<C<\p{VertSpace}>>
+
+This is the same as C<\v>: A character that changes the spacing vertically.
+
+=item B<C<\p{Word}>>
+
+This is the same as C<\w>, including over 100_000 characters beyond ASCII.
+
+=item B<C<\p{XPosix...}>>
+
+There are several of these, which are the standard Posix classes
+extended to the full Unicode range. They are described in
+L<perlrecharclass/POSIX Character Classes>.