From 069d116ce0762851ec18799ea770a5da89d9982c Mon Sep 17 00:00:00 2001 From: Karl Williamson Date: Wed, 23 Mar 2022 20:21:01 -0600 Subject: [PATCH] Clarify \p{Decomposition_Type=NonCanonical} This closes #18458 --- pod/perlunicode.pod | 46 +++++++++++++++++++++++++--------------------- 1 file changed, 25 insertions(+), 21 deletions(-) diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod index 954a048..1716ae4 100644 --- a/pod/perlunicode.pod +++ b/pod/perlunicode.pod @@ -799,27 +799,31 @@ spacing horizontally. =item B> (Short: C<\p{Dt=NonCanon}>) -Matches a character that has a non-canonical decomposition. - -The L section above -talked about canonical decompositions. However, many more characters -have a different type of decomposition, a "compatible" or -"non-canonical" decomposition. The sequences that form these -decompositions are not considered canonically equivalent to the -pre-composed character. An example is the C<"SUPERSCRIPT ONE">. It is -somewhat like a regular digit 1, but not exactly; its decomposition into -the digit 1 is called a "compatible" decomposition, specifically a -"super" decomposition. There are several such compatibility -decompositions (see L), including -one called "compat", which means some miscellaneous type of -decomposition that doesn't fit into the other decomposition categories -that Unicode has chosen. - -Note that most Unicode characters don't have a decomposition, so their -decomposition type is C<"None">. - -For your convenience, Perl has added the C decomposition -type to mean any of the several compatibility decompositions. +Matches a character that has any of the non-canonical decomposition +types. Canonical decompositions are introduced in the +L section above. +However, many more characters have a different type of decomposition, +generically called "compatible" decompositions, or "non-canonical". The +sequences that form these decompositions are not considered canonically +equivalent to the pre-composed character. An example is the +C<"SUPERSCRIPT ONE">. It is somewhat like a regular digit 1, but not +exactly; its decomposition into the digit 1 is called a "compatible" +decomposition, specifically a "super" (for "superscript") decomposition. +There are several such compatibility decompositions (see +L). S> is a +Perl extension that uses just one name to refer to the union of all of +them. + +Most Unicode characters don't have a decomposition, so their +decomposition type is C<"None">. Hence, C is equivalent +to + + qr/(?[ \P{DT=Canonical} - \p{DT=None} ])/ + +(Note that one of the non-canonical decompositions is named "compat", +which could perhaps have been better named "miscellaneous". It includes +just the things that Unicode couldn't figure out a better generic name +for.) =item B> -- 1.8.3.1