package feature;
-our $VERSION = '1.48';
+our $VERSION = '1.49';
our %feature = (
fc => 'feature_fc',
This feature is available starting with Perl 5.12; was almost fully
implemented in Perl 5.14; and extended in Perl 5.16 to cover C<quotemeta>;
-and extended further in Perl 5.26 to cover L<the range
-operator|perlop/Range Operators>.
+was extended further in Perl 5.26 to cover L<the range
+operator|perlop/Range Operators>; and was extended again in Perl 5.28 to
+cover L<special-cased whitespace splitting|perlfunc/split>.
=head2 The 'unicode_eval' and 'evalbytes' features
being supplied where an SV was expected, crashing perl. [perl
#131597]
+=item *
+
+C<split ' '> now correctly handles the argument being split when in the
+scope of the L<< C<unicode_strings>|feature/"The 'unicode_strings' feature"
+>> feature. Previously, when a string using the single-byte internal
+representation contained characters that are whitespace by Unicode rules but
+not by ASCII rules, it treated those characters as part of fields rather
+than as field separators. [perl #130907]
+
=back
=head1 Known Problems
pattern argument to split; in Perl 5.18.0 and later this special case is
triggered by any expression which evaluates to the simple string S<C<" ">>.
+As of Perl 5.28, this special-cased whitespace splitting works as expected in
+the scope of L<< S<C<"use feature 'unicode_strings">>|feature/The
+'unicode_strings' feature >>. In previous versions, and outside the scope of
+that feature, it exhibits L<perlunicode/The "Unicode Bug">: characters that are
+whitespace according to Unicode rules but not according to ASCII rules can be
+treated as part of fields rather than as field separators, depending on the
+string's internal encoding.
+
If omitted, PATTERN defaults to a single space, S<C<" ">>, triggering
the previously described I<awk> emulation.
exceeded that of the right-hand side, where the right-hand side took up more
bytes than the correct range endpoint.
+=item *
+
+In L<< C<split>'s special-case whitespace splitting|perlfunc/split >>.
+
+Starting in Perl 5.28.0, the C<split> function with a pattern specified as
+a string containing a single space handles whitespace characters consistently
+within the scope of of C<unicode_strings>. Prior to that, or outside its scope,
+characters that are whitespace according to Unicode rules but not according to
+ASCII rules were treated as field contents rather than field separators when
+they appear in byte-encoded strings.
+
=back
You can see from the above that the effect of C<unicode_strings>
problems of the initial Unicode implementation, but for example
regular expressions still do not work with Unicode in 5.6.1.
Perl v5.14.0 is the first release where Unicode support is
-(almost) seamlessly integrable without some gotchas. (There are two
+(almost) seamlessly integrable without some gotchas. (There are a few
exceptions. Firstly, some differences in L<quotemeta|perlfunc/quotemeta>
were fixed starting in Perl 5.16.0. Secondly, some differences in
L<the range operator|perlop/Range Operators> were fixed starting in
-Perl 5.26.0.)
+Perl 5.26.0. Thirdly, some differences in L<split|perlfunc/split> were fixed
+started in Perl 5.28.0.)
To enable this
seamless support, you should C<use feature 'unicode_strings'> (which is
STRLEN len;
const char *s = SvPV_const(sv, len);
const bool do_utf8 = DO_UTF8(sv);
+ const bool in_uni_8_bit = IN_UNI_8_BIT;
const char *strend = s + len;
PMOP *pm = cPMOPx(PL_op);
REGEXP *rx;
while (s < strend && isSPACE_LC(*s))
s++;
}
+ else if (in_uni_8_bit) {
+ while (s < strend && isSPACE_L1(*s))
+ s++;
+ }
else {
while (s < strend && isSPACE(*s))
s++;
{
while (m < strend && !isSPACE_LC(*m))
++m;
+ }
+ else if (in_uni_8_bit) {
+ while (m < strend && !isSPACE_L1(*m))
+ ++m;
} else {
while (m < strend && !isSPACE(*m))
++m;
{
while (s < strend && isSPACE_LC(*s))
++s;
+ }
+ else if (in_uni_8_bit) {
+ while (s < strend && isSPACE_L1(*s))
+ ++s;
} else {
while (s < strend && isSPACE(*s))
++s;
__END__
package feature;
-our $VERSION = '1.48';
+our $VERSION = '1.49';
FEATURES
This feature is available starting with Perl 5.12; was almost fully
implemented in Perl 5.14; and extended in Perl 5.16 to cover C<quotemeta>;
-and extended further in Perl 5.26 to cover L<the range
-operator|perlop/Range Operators>.
+was extended further in Perl 5.26 to cover L<the range
+operator|perlop/Range Operators>; and was extended again in Perl 5.28 to
+cover L<special-cased whitespace splitting|perlfunc/split>.
=head2 The 'unicode_eval' and 'evalbytes' features
set_up_inc('../lib');
}
-plan tests => 163;
+plan tests => 172;
$FS = ':';
qq{split(\$cond ? qr/ / : " ", "$exp") behaves as expected over repeated similar patterns};
}
+SKIP: {
+ # RT #130907: unicode_strings feature doesn't work with split ' '
+
+ my ($sp) = grep /\s/u, map chr, reverse 128 .. 255 # prefer \xA0 over \x85
+ or skip 'no unicode whitespace found in high-8-bit range', 9;
+
+ for (["$sp$sp. /", "leading unicode whitespace"],
+ [".$sp$sp/", "unicode whitespace separator"],
+ [". /$sp$sp", "trailing unicode whitespace"]) {
+ my ($str, $desc) = @$_;
+ use feature "unicode_strings";
+ my @got = split " ", $str;
+ is @got, 2, "whitespace split: $desc: field count";
+ is $got[0], '.', "whitespace split: $desc: field 0";
+ is $got[1], '/', "whitespace split: $desc: field 1";
+ }
+}
+
{
# 'RT #116086: split "\x20" does not work as documented';
my @results;