From f7f5e97b7a9bb1015c2778e4f6b40b39f1459074 Mon Sep 17 00:00:00 2001 From: Karl Williamson Date: Fri, 28 Jan 2011 09:01:05 -0700 Subject: [PATCH] perldiag.pod: Expand \p in locale description --- pod/perldiag.pod | 23 +++++++++++++++++++---- 1 file changed, 19 insertions(+), 4 deletions(-) diff --git a/pod/perldiag.pod b/pod/perldiag.pod index 0ab9e92..bf22e1e 100644 --- a/pod/perldiag.pod +++ b/pod/perldiag.pod @@ -3331,10 +3331,25 @@ mixed-case attribute name, instead. See L. (W) You compiled a regular expression that contained a Unicode property match (C<\p> or C<\P>), but the regular expression is also being told to -use the run-time locale, not Unicode. It's best to not use these -Unicode properties with locale, as only if the locale is a properly -implemented ISO 8859-1 (Latin1) locale (which is supposed to be a subset -of Unicode) will there not be any anomalies. +use the run-time locale, not Unicode. Instead, use a POSIX character +class, which should know about the locale's rules. +(See L.) + +Even if the run-time locale is ISO 8859-1 (Latin1), which is a subset of +Unicode, some properties will give results that are not valid for that +subset. + +Here are a couple of examples to help you see what's going on. If the +locale is ISO 8859-7, the character at code point 0xD7 is the "GREEK +CAPITAL LETTER CHI". But in Unicode that code point means the +"MULTIPLICATION SIGN" instead, and C<\p> always uses the Unicode +meaning. That means that C<\p{Alpha}> won't match, but C<[[:alpha:]]> +should. Only in the Latin1 locale are all the characters in the same +positions as they are in Unicode. But, even here, some properties give +incorrect results. An example is C<\p{Changes_When_Uppercased}> which +is true for "LATIN SMALL LETTER Y WITH DIAERESIS", but since the upper +case of that character is not in Latin1, in that locale it doesn't +change when upper cased. =item pack/unpack repeat count overflow -- 1.8.3.1