that compiles today may not in a future Perl release. This warning is
to alert you to that risk.
-=item *
-
L<Wide character (U+%X) in %s|perldiag/"Wide character (U+%X) in %s">
(W locale) While in a single-byte locale (I<i.e.>, a non-UTF-8
with your single-byte locale (or perhaps you thought you had a UTF-8
locale, but Perl disagrees).
+=item *
+
+L<Both or neither range ends should be Unicode in regex; marked by E<lt>-- HERE in mE<sol>%sE<sol>|perldiag/"Both or neither range ends should be Unicode in regex; marked by <-- HERE in m/%s/">
+
=back
=head2 Changes to Existing Diagnostics
(P) When starting a new thread or returning values from a thread, Perl
encountered an invalid data type.
+=item Both or neither range ends should be Unicode in regex; marked by
+<-- HERE in m/%s/
+
+(W regexp) (only under C<S<use re 'strict'>> or within C<(?[...])>)
+
+In a bracketed character class in a regular expression pattern, you
+had a range which has exactly one end of it specified using C<\N{}>, and
+the other end is specified using a non-portable mechanism. Perl treats
+the range as a Unicode range, that is, all the characters in it are
+considered to be the Unicode characters, and which may be different code
+points on some platforms Perl runs on. For example, C<[\N{U+06}-\x08]>
+is treated as if you had instead said C<[\N{U+06}-\N{U+08}]>, that is it
+matches the characters whose code points in Unicode are 6, 7, and 8.
+But that C<\x08> might indicate that you meant something different, so
+the warning gets raised.
+
=item Buffer overflow in prime_env_iter: %s
(W internal) A warning peculiar to VMS. While Perl was preparing to
a range, the "-" is understood literally.
Note also that the whole range idea is rather unportable between
-character sets--and even within character sets they may cause results
-you probably didn't expect. A sound principle is to use only ranges
-that begin from and end at either alphabetics of equal case ([a-e],
-[A-E]), or digits ([0-9]). Anything else is unsafe or unclear. If in
-doubt, spell out the character sets in full. Specifying the end points
-of the range using the C<\N{...}> syntax, using Unicode names or code
-points makes the range portable, but still likely not easily
-understandable to someone reading the code. For example,
-C<[\N{U+04}-\N{U+07}]> means to match the Unicode code points
-C<\N{U+04}>, C<\N{U+05}>, C<\N{U+06}>, and C<\N{U+07}>, whatever their
-native values may be on the platform.
+character sets, except for four situations that Perl handles specially.
+Any subset of the ranges C<[A-Z]>, C<[a-z]>, and C<[0-9]> are guaranteed
+to match the expected subset of ASCII characters, no matter what
+character set the platform is running. The fourth portable way to
+specify ranges is to use the C<\N{...}> syntax to specify either end
+point of the range. For example, C<[\N{U+04}-\N{U+07}]> means to match
+the Unicode code points C<\N{U+04}>, C<\N{U+05}>, C<\N{U+06}>, and
+C<\N{U+07}>, whatever their native values may be on the platform. Under
+L<use re 'strict'|re/'strict' mode> or within a L</C<(?[ ])>>, a warning
+is raised, if enabled, and the other end point of a range which has a
+C<\N{...}> endpoint is not portably specified. For example,
+
+ [\N{U+00}-\x06] # Warning under "use re 'strict'".
+
+It is hard to understand without digging what exactly matches ranges
+other than subsets of C<[A-Z]>, C<[a-z]>, and C<[0-9]>. A sound
+principle is to use only ranges that begin from and end at either
+alphabetics of equal case ([a-e], [A-E]), or digits ([0-9]). Anything
+else is unsafe or unclear. If in doubt, spell out the range in full.
Characters may be specified using a metacharacter syntax much like that
used in C: "\n" matches a newline, "\t" a tab, "\r" a carriage return,
using Unicode values, so C<[\N{U+27}-\N{U+3F}]> means to match
C<\N{U+27}>, C<\N{U+28}>, C<\N{U+29}>, ..., C<\N{U+3D}>, C<\N{U+3E}>,
and C<\N{U+3F}>, whatever the native code point versions for those are.
+These are called "Unicode" ranges. If either end is of the C<\N{...}>
+form, the range is considered Unicode. A C<regexp> warning is raised
+under C<S<"use re 'strict'">> if the other endpoint is specified
+non-portably:
+
+ [\N{U+00}-\x09] # Warning under re 'strict'; \x09 is non-portable
+ [\N{U+00}-\t] # No warning;
+
+Both of the above match the characters C<\N{U+00}> C<\N{U+01}>, ...
+C<\N{U+08}>, C<\N{U+09}>, but the C<\x09> looks like it could be a
+mistake so the warning is raised (under C<re 'strict'>) for it.
Perl also guarantees that the ranges C<A-Z>, C<a-z>, C<0-9>, and any
subranges of these match what an English-only speaker would expect them
REPORT_LOCATION_ARGS(offset)); \
} STMT_END
+#define vWARN(loc, m) STMT_START { \
+ const IV offset = loc - RExC_precomp; \
+ __ASSERT_(PASS2) Perl_warner(aTHX_ packWARN(WARN_REGEXP), m REPORT_LOCATION, \
+ REPORT_LOCATION_ARGS(offset)); \
+} STMT_END
+
#define vWARN_dep(loc, m) STMT_START { \
const IV offset = loc - RExC_precomp; \
__ASSERT_(PASS2) Perl_warner(aTHX_ packWARN(WARN_DEPRECATED), m REPORT_LOCATION, \
* runtime locale is UTF-8 */
SV* only_utf8_locale_list = NULL;
-#ifdef EBCDIC
/* In a range, if one of the endpoints is non-character-set portable,
* meaning that it hard-codes a code point that may mean a different
* charactger in ASCII vs. EBCDIC, as opposed to, say, a literal 'A' or a
* to Unicode (i.e. non-ASCII), each code point in it should be considered
* to be a Unicode value. */
bool unicode_range = FALSE;
-#endif
bool invert = FALSE; /* Is this class to be complemented */
bool warn_super = ALWAYS_WARN_SUPER;
if (!range) {
rangebegin = RExC_parse;
element_count++;
-#ifdef EBCDIC
non_portable_endpoint = 0;
-#endif
}
if (UTF) {
value = utf8n_to_uvchr((U8*)RExC_parse,
prevvalue = save_prevvalue;
continue; /* Back to top of loop to get next char */
}
+
/* Here, is a single code point, and <value> contains it */
-#ifdef EBCDIC
unicode_range = TRUE; /* \N{} are Unicode */
-#endif
}
break;
case 'p':
vFAIL(error_msg);
}
}
-#ifdef EBCDIC
non_portable_endpoint++;
-#endif
if (IN_ENCODING && value < 0x100) {
goto recode_encoding;
}
vFAIL(error_msg);
}
}
-#ifdef EBCDIC
non_portable_endpoint++;
-#endif
if (IN_ENCODING && value < 0x100)
goto recode_encoding;
break;
case 'c':
value = grok_bslash_c(*RExC_parse++, PASS2);
-#ifdef EBCDIC
non_portable_endpoint++;
-#endif
break;
case '0': case '1': case '2': case '3': case '4':
case '5': case '6': case '7':
(void)ReREFCNT_inc(RExC_rx_sv);
}
}
-#ifdef EBCDIC
non_portable_endpoint++;
-#endif
if (IN_ENCODING && value < 0x100)
goto recode_encoding;
break;
}
}
+ if (strict && PASS2 && ckWARN(WARN_REGEXP)) {
+ if (range) {
+
+ /* If the range starts above 255, everything is portable and
+ * likely to be so for any forseeable character set, so don't
+ * warn. */
+ if (unicode_range && non_portable_endpoint && prevvalue < 256) {
+ vWARN(RExC_parse, "Both or neither range ends should be Unicode");
+ }
+ }
+ }
+
/* Deal with this element of the class */
if (! SIZE_ONLY) {
#ifndef EBCDIC
push @warning, @warnings_utf8;
my @warning_only_under_strict = (
+ '/[\N{U+00}-\x01]\x{100}/' => 'Both or neither range ends should be Unicode {#} m/[\N{U+00}-\x01{#}]\x{100}/',
+ '/[\x00-\N{SOH}]\x{100}/' => 'Both or neither range ends should be Unicode {#} m/[\x00-\N{U+01}{#}]\x{100}/',
+ '/[\N{DEL}-\o{377}]\x{100}/' => 'Both or neither range ends should be Unicode {#} m/[\N{U+7F}-\o{377}{#}]\x{100}/',
+ '/[\o{0}-\N{U+01}]\x{100}/' => 'Both or neither range ends should be Unicode {#} m/[\o{0}-\N{U+01}{#}]\x{100}/',
+ '/[\000-\N{U+01}]\x{100}/' => 'Both or neither range ends should be Unicode {#} m/[\000-\N{U+01}{#}]\x{100}/',
+ '/[\N{DEL}-\377]\x{100}/' => 'Both or neither range ends should be Unicode {#} m/[\N{U+7F}-\377{#}]\x{100}/',
+ '/[\N{U+00}-A]\x{100}/' => "",
+ '/[a-\N{U+FF}]\x{100}/' => "",
+ '/[\N{U+00}-\a]\x{100}/' => "",
+ '/[\a-\N{U+FF}]\x{100}/' => "",
+ '/[\N{U+FF}-\x{100}]/' => 'Both or neither range ends should be Unicode {#} m/[\N{U+FF}-\x{100}{#}]/',
+ '/[\N{U+100}-\x{101}]/' => "",
);
my @experimental_regex_sets = (