perl5.git.perl.org Git - perl5.git/log

5.22.1 today

5.22.1 today - update Module::CoreList

avoid leaks when calling mg_set() in leave_scope()

In leave_scope() in places like SAVEt_SV, it does stuff like

    if (SvSMAGICAL(...))
        mg_set(...)
    SvREFCNT_dec_NN(ARG0_SV)

If mg_set() dies (e.g. it calls STORE() and STORE() dies), then ARG0_SV
would leak. Fix this by putting ARG0_SV back in the save stack in this
case.

A similar thing applies to SAVEt_AV and SAVEt_HV, but I couldn't
think of a simple test for those, as tied array and hashes don't have
set magic (just RMG).

Also, SAVEt_AV and SAVEt_HV share a lot of common code, so I made
SAVEt_HV goto into the SAVEt_AV code block for the common part.

utf8.c: Fix EBCDIC double translation

In Perl_uvoffuni_to_utf8_flags(), the input is a Unicode, not native,
code point. But in ba6ed43c6aca7f1ff5a1b82062faa3e1c33c0582, I used a
macro that assumes the input is native.

Suppress overflow warning in bop.t.

There is a constant designed to exercise the limits of a 64-bit
integer that causes an overflow when IVs are 32 bits. The warning
happens at compile time and we don't know yet that we will never
execute the 64-bit path at run time.

hexfp: all ppc/powerpc-ld linux tailbits difference in exp(1)

(not just linux-ppc64-ld)

Not a regression from 5.22.0.

Skip casing for high code points

As discussed in the previous commit, most code points in Unicode
don't change if upper-, or lower-cased, etc.  In fact as of Unicode
v8.0, 93% of the available code points are above the highest one that
does change.

This commit skips trying to case these 93%.  A regen/ script keeps track
of the max changing one in the current Unicode release, and skips casing
for the higher ones.  Thus currently, casing emoji will be skipped.

Together with the previous commits that dealt with casing, the potential
for huge memory requirements for the swash hashes for casing are
severely limited.

If the following command is run on a perl compiled with -O2 and no
DEBUGGING:

    blead Porting/bench.pl --raw --perlargs="-Ilib -X" --benchfile=plane1_case_perf /path_to_prior_perl=before_this_commit /path_to_new_perl=after

and the file 'plane1_case_perf' contains

    [
        'string::casing::emoji' => {
            desc    => 'yes swash vs no swash',
            setup   => 'my $a = "\x{1F570}"',  # MANTELPIECE CLOCK
            code    => 'uc($a)'
        },
    ];

the following results are obtained:

The numbers represent raw counts per loop iteration.

string::casing::emoji
yes swash vs no swash

       before_this_commit    after
       ------------------ --------
    Ir              981.0    306.0
    Dr              228.0     94.0
    Dw              100.0     45.0
  COND              137.0     49.0
   IND                7.0      4.0

COND_m                5.5      0.0
IND_m                4.0      2.0

Ir_m1                0.1     -0.1
Dr_m1                0.0      0.0
Dw_m1                0.0      0.0

Ir_mm                0.0      0.0
Dr_mm                0.0      0.0
Dw_mm                0.0      0.0

Skip casing for some non-cased scripts

Characters whose upper, lower, title, or fold case differ from the
character itself amount to just 1.5% of the assigned Unicode characters,
and this percentage falls with each new Unicode release, as almost all
cased scripts have already been encoded.  But a lot of code is written
assuming a cased language, such as calling uc() or lcfirst(), or doing
qr//i.  When such code is run on a non-cased language, the work expended
in doing the casing is wasted.  And casing is expensive.  But finding
out if a character is cased or not is nearly as expensive, so one might
as well just do the casing.

However, the Unicode code space is organized so that there are some long
stretches of contiguous code points that aren't cased.  By adding tests
to see if the input code point is in just a few of these ranges, we can
quickly rule casing out for most of the non-cased scripts that are of
commercial use today, at essentially no expense to handling the more
common cased scripts.  Testing for just 3 ranges in Plane 0 of Unicode
(where most of the code points in common use today reside) allows us to
skip doing casing for more than 82% of code points in the plane,
including the following languages: Arabic, Chinese, Hebrew, Japanese,
Korean, Thai, and the major scripts of India.  No longer is a swash
generated when trying to case one of these, so runtime memory usage is
decreased.

(It should be noted that some of these languages have characters
scattered in other areas, because the original allocation for them
turned out to be not large enough.  When changing the case of these
other characters, the lookups won't be skippped.  But that original
allocation included all or nearly all the characters in current common
use, so these other characters are comparatively rare.)

The comments in the code indicate some candidate non-cased ranges that I
chose not to treat specially at this time.  The next commit will address
planes above Plane 0.

When this command is run on a perl compiled with -O2, no DEBUGGING:

    blead Porting/bench.pl --perlargs="-Ilib -X" --benchfile=plane0_casing_perf /path_to_prior_perl=before_this_commit /path_to_new_perl=after

and file 'plane0_casing_perf' contains
    [
        'string::casing::greek' => {
            desc    => 'should be no change',
            setup   => 'my $a = "\x{3B1}"',  # GREEK SMALL LETTER ALPHA
            code    => 'uc($a)'
        },
        'string::casing::hebrew' => {
            desc    => 'yes swash vs no swash',
            setup   => 'my $a = "\x{5D0}"',  # HEBREW LETTER ALEF
            code    => 'uc($a)'
        },
        'string::casing::cjk' => {
            desc    => 'yes swash vs no swash',
            setup   => 'my $a = "\x{4E01}"',
            code    => 'uc($a)'
        },
        'string::casing::korean' => {
            desc    => 'yes swash vs no swash',
            setup   => 'my $a = "\x{AC00}"',
            code    => 'uc($a)'
        },
    ];

These are the results:

The numbers represent raw counts per loop iteration.

string::casing::cjk
yes swash vs no swash

       before_this_commit    after
       ------------------ --------
    Ir              931.0    300.0
    Dr              217.0     93.0
    Dw               94.0     45.0
  COND              129.0     48.0
   IND                7.0      4.0

COND_m                1.5      0.0
IND_m                4.0      2.0

Ir_m1                0.1      0.0
Dr_m1                0.0      0.0
Dw_m1                0.0      0.0

Ir_mm                0.0      0.0
Dr_mm                0.0      0.0
Dw_mm                0.0      0.0

string::casing::greek
should be no change

       before_this_commit    after
       ------------------ --------
    Ir              946.0    920.0
    Dr              218.0    220.0
    Dw              100.0    100.0
  COND              127.0    121.0
   IND                6.0      8.0

COND_m                0.5      1.3
IND_m                2.0      2.0

Ir_m1                0.1      0.0
Dr_m1                0.0      0.0
Dw_m1                0.0      0.0

Ir_mm                0.0      0.0
Dr_mm                0.0      0.0
Dw_mm                0.0      0.0

string::casing::hebrew
yes swash vs no swash

       before_this_commit    after
       ------------------ --------
    Ir              928.0    290.0
    Dr              224.0     92.0
    Dw              100.0     45.0
  COND              129.0     46.0
   IND                6.0      4.0

COND_m                0.5      0.0
IND_m                2.0      2.0

Ir_m1                0.1      0.0
Dr_m1                0.0      0.0
Dw_m1                0.0      0.0

Ir_mm                0.0      0.0
Dr_mm                0.0      0.0
Dw_mm                0.0      0.0

string::casing::korean
yes swash vs no swash

       before_this_commit    after
       ------------------ --------
    Ir              953.0    307.6
    Dr              224.0     93.0
    Dw              100.0     45.0
  COND              131.0     50.9
   IND                7.0      4.0

COND_m                1.5      0.0
IND_m                4.0      2.0

Ir_m1                0.1      0.0
Dr_m1                0.0      0.0
Dw_m1                0.0      0.0

Ir_mm                0.0      0.0
Dr_mm                0.0      0.0
Dw_mm                0.0      0.0

utf8.c: Add indentation

This is in preparation for the next commit, so the diff command is less
confused

Don't try to case change surrogates, above-Unicodes

Changing the case (upper, lower, title, fold) of surrogate code points
and non-Unicode code points always yields the original, so there is no
need to actually try it.  And trying it is slow and creates swashes,
which uses up runtime memory.  We test for these code points anyway, so
at the cost of just two gotos and a label, we can skip all that work and
potential memory use.  This is worth doing even though such usage will
be rare in practice.

Running the following command

    blead Porting/bench.pl --perlargs="-Ilib -X" --benchfile=above_unicode path_to_prior_perl=before_this_commit path_to_this_perl=after

on a -O2 no DEBUGGING perl, where file 'above_unicode" contains

    [
        'string::casing::above_unicode' => {
            desc    => 'yes cases vs no casing',
            setup   => 'my $a = "\x{110000}"',
            code    => 'my $b = uc($a)'
        },
    ];

yields this output (the extra cost of swash creation is not included):

The numbers represent raw counts per loop iteration.

string::casing::above_unicode
yes cases vs no casing

        before_this_commit    after
        ------------------ --------
     Ir             1329.0    651.0
     Dr              324.0    190.0
     Dw              149.0     94.0
   COND              192.0    103.0
    IND               13.0     10.0

COND_m                5.5      0.0
  IND_m                6.0      4.0

  Ir_m1                0.1      0.0
  Dr_m1                0.0      0.0
  Dw_m1                0.0      0.0

  Ir_mm                0.0      0.0
  Dr_mm                0.0      0.0
  Dw_mm                0.0      0.0

Fix awkward wording in 'say' documentation

For: RT #126833

reword $@ documentation (it's not just for syntax errors)

RT #124034

utf8.c: Don't throw away a value and then recalc it

In half the calls to to_utf8_case(), the code point being looked up is
known. It is thrown away because the API doesn't pass it, and then
recalculated first thing in to_utf8_case.

Fix this by making a new static function which adds the code point to
the parameter list, and change all calls to use this, leaving the
existing to_utf8_case() as just a wrapper for the new function.

embed.fnc: White-space only

perlapi: Vaguely deprecate to_utf8_case

by giving alternatives to use instead.

utf8.c: Add some LIKELY(), UNLIKELY()

perlgit.pod: update 'git status' sample output

perlgit.pod: standardize on % as shell prompt

utf8.h, utfebcdic.h: Add #define

for future use

utf8.h: Fix macro definition

This has been wrong, and won't compile, should anyone have tried, since
635e76f560b3b3ca075aa2cb5d6d661601968e04 earlier in 5.23.

utf8.h: Remove unused #define

UTF8_QUAD_MAX is no longer used in the core, and is not in cpan, and its
name is highly misleading. It is defined to be 2**36, which has really
nothing to do with what its name indicates.

t/lib/warnings/utf8: Add some tests

These better test the detection of surrogates, noncharacters, and
above-Unicode code points.

Achim Gratz is now a perl author

[perl #126834] Cygwin cygdrive prefix test

* t/lib/cygwin.t: Use the /proc virtual filesystem to determine the
  cygdrive prefix.  If that isn't available, fall back to using the
  cygpath executable instead of parsing the output from df or mount
  for older Cygwin.  That fallback can fail if C:\ is manually mounted
  someplace else, but the former code had the same problem.

[MERGE] rpeep() consistent oldoldop -> oldop -> o

rpeep() assert oldoldop -> oldop -> o form a chain

In rpeep(), in a loop, the var o becomes each op in the op_next chain in
turn. At the same time, oldop is set to the previous value of o, and
oldoldop the previous but one.

Some places that modify the op_next chain weren't correctly upodating
oldop and oldoldop at the same time. Last few commits have fixed those
places.

This commit adds an assert at the top of loop to check that oldoldop and
oldop are in fact consistent.

(This assert was used to find the faults fixed in the previous couple of
commits).

rpeep: maintain chain when handling for(reverse..)

There's code in rpeep() that eliminates the reverse op from
for (reverse ....) {} and just flags the enteriter as needing to reverse
its args.

This code didn't leave oldoldop -> oldop -> o as a consistent chain of
adjacent op_next ops.

rpeep: maintain chain when del extra nextstates

There's code in rpeep() that eliminates duplicate nextstate ops.
E.g.

FOO -> NEXTSTATE1 -> NULL -> ... -> NULL -> NEXTSTATE2 -> ...

becomes

FOO --------------------------------------> NEXTSTATE2 -> ...

This code didn't leave oldoldop -> oldop -> o as a consistent chain of
adjacent op_next ops.

stop the eliding of void $pkg_var from assert fail

The code to eliminate things like our($foo); from the runtime op_next
chain in rpeep() caused the oldoldop, oldop, o vars not to form a chain of
3 adjacent ops. Instead, oldoldop and oldop ended up equal, which later
caused an assertion failure in the padrange code for something like

my($a,$b),$x,my($c,$d);

utf8.c: Fix broken EBCDIC compilation

Commit ba6ed43c6aca7f1ff5a1b82062faa3e1c33c0582 left out a '}' which is
skipped except in EBCDIC builds. (I meant to make sure things would
compile (by reversing the sense of the #if's) on EBCDIC, but forgot at
the time it should have been done.)

[perl #126593] make sure utf8_heavy.pl doesn't depend on itself

With ${^ENCODING} set, it did.

Partly reverts:

commit aa8f6cef961dc2009604f7464c66106421c3ae81
Author: Rafael Garcia-Suarez <rgs@consttype.org>
Date: Wed Jun 17 13:18:59 2015 +0200

Microoptimize some matches in utf8_heavy.pl

Perl_uvoffuni_to_utf8_flags() Combine ASCII, EBCDIC branches

This uses the underlying structure of UTF-8 and UTF-EBCDIC to unify most
of the code.  Previously, the ASCII platform version unrolled a loop,
and the EBCDIC didn't.  Now the loop is used for code points that
require 5 or more bytes to represent in UTF-8 and UTF-EBCDIC.  On ASCII
platforms, this means that all leggal Unicode code points use the
unrolled version.  I used cachegrind to find that the unrolled savings
were not large, and in the trade-off between performance and
maintainability on code points that Unicode doesn't think are legal,
maintainability wins.

I also moved the tests so that there are no unnecessary tests on ASCII
platforms.  For example, if we know that we are in a range of code
points that doesn't have surrogates, no tests for surrogates are done.
Perhaps an optimizing compiler could figure this out.  There is a
smidgeon of extra tests on EBCDIC platforms, to keep the code unified
between the two platform types.

Originally, I did try to keep the loop unrolled, which is how I found
that the performance savings wasn't great.  Here that code is (with a
space inserted before column 1 '#' chars, so git doesn't think they are
comments:

U8 *
Perl_uvoffuni_to_utf8_flags(pTHX_ U8 *d, UV uv, UV flags)
{
    PERL_ARGS_ASSERT_UVOFFUNI_TO_UTF8_FLAGS;

    /* Test for and handle 1-byte result. */
    if (OFFUNI_IS_INVARIANT(uv)) {
*d++ = LATIN1_TO_NATIVE(uv);
return d;
    }

/*  Use shorter names internally in this file */
#define SHIFT   UTF_ACCUMULATION_SHIFT
#undef  MARK
#define MARK    UTF_CONTINUATION_MARK
#define MASK    UTF_CONTINUATION_MASK

    /* Below is an unrolled version of
     *
STRLEN len  = OFFUNISKIP(uv);
U8 *p = d+len-1;
while (p > d) {
    *p-- = I8_TO_NATIVE_UTF8((uv & UTF_CONTINUATION_MASK) | UTF_CONTINUATION_MARK);
    uv >>= UTF_ACCUMULATION_SHIFT;
}
*p = I8_TO_NATIVE_UTF8((uv & UTF_START_MASK(len)) | UTF_START_MARK(len));
return d+len;
     *
     * Unrolled, it looks like:
     *
        if (uv < max_2_byte_uv) return the 2 bytes;
        if (uv < max_3_byte_uv) return the 3 bytes;
        ...
    *
    * Note that on EBCDIC we have to turn things into NATIVE_UTF8, which is a
    * no-op on ASCII platforms */

    /* Not 1-byte; test for and handle 2-byte result.   In the test immediately
     * below, the 32 is for start bytes C0-CF, D0-DF, each of which has a
     * continuation byte which contributes SHIFT bits.  This yields 0x400 on
     * EBCDIC platforms, 0x800 on ASCII */
    if (uv < (32 * (1U << SHIFT))) {
*d++ = I8_TO_NATIVE_UTF8(( uv >> SHIFT) | UTF_START_MARK(2));
*d++ = I8_TO_NATIVE_UTF8(( uv           & MASK) |   MARK);
return d;
    }

    /* Not 2-byte; test for and handle 3-byte result.   In the test immediately
     * below, the 16 is for start bytes E0-EF (which are the ones that indicate
     * 3 bytes), the 2 is for 2 continuation bytes which each contribute SHIFT
     * bits.  This yields 0x4000 on EBCDIC platforms, 0x1_0000 on ASCII, so 3
     * bytes covers the range 0x400-0x3FFF on EBCDIC; 0x800-0xFFFF on ASCII */
    if (uv < (16 * (1U << (2 * SHIFT)))) {
*d++ = I8_TO_NATIVE_UTF8(( uv >> ((3 - 1) * SHIFT)) | UTF_START_MARK(3));
*d++ = I8_TO_NATIVE_UTF8(((uv >> ((2 - 1) * SHIFT)) & MASK) |   MARK);
*d++ = I8_TO_NATIVE_UTF8(( uv  /* (1 - 1) */        & MASK) |   MARK);

#ifndef EBCDIC  /* These problematic code points are 4 bytes on EBCDIC */
        /* The most likely code points in this range are below the surrogates.
         * Do an extra test to quickly exclude those. */
        if (UNLIKELY(uv >= UNICODE_SURROGATE_FIRST)) {
            if (UNLIKELY(   UNICODE_IS_32_NONCHARS(uv)
                         || UNICODE_IS_xFFF_E_F(uv)))
            {
                goto handle_nonchar;
            }
            if (UNLIKELY(UNICODE_IS_SURROGATE(uv))) {
                goto handle_surrogate;
            }
        }
#endif
return d;
    }

    /* Not 3-byte; test for and handle 4-byte result.   In the test immediately
     * below, the 8 is for start bytes F0-F7, the 3 is for 3 continuation bytes
     * which each contribute SHIFT bits.  This yields 0x4_0000 on EBCDIC
     * platforms, 0x20_0000 on ASCII, so 4 bytes covers the range
     * 0x4000-0x3_FFFF on EBCDIC; 0x1_0000-0x1F_FFFF on ASCII */
    if (uv < (8 * (1U << (3 * SHIFT)))) {
*d++ = I8_TO_NATIVE_UTF8(( uv >> ((4 - 1) * SHIFT)) | UTF_START_MARK(4));
*d++ = I8_TO_NATIVE_UTF8(((uv >> ((3 - 1) * SHIFT)) & MASK) |   MARK);
*d++ = I8_TO_NATIVE_UTF8(((uv >> ((2 - 1) * SHIFT)) & MASK) |   MARK);
*d++ = I8_TO_NATIVE_UTF8(( uv  /* (1 - 1) */        & MASK) |   MARK);

#ifdef EBCDIC  /* These problematic code points are 3 bytes on ASCII */
if (UNLIKELY(   UNICODE_IS_32_NONCHARS(uv)
                     || UNICODE_IS_xFFF_E_F(uv)))
        {
            goto handle_nonchar;
        }
if (UNLIKELY(UNICODE_IS_SURROGATE(uv))) {
            goto handle_surrogate;
        }
#else
if (UNLIKELY(   UNICODE_IS_xFFF_E_F(uv))
                     || UNICODE_IS_10_FFF_E_F(uv))
        {
            goto handle_nonchar;
        }
        if (UNLIKELY(UNICODE_IS_SUPER(uv))) {
            goto handle_super;
        }
#endif
return d;
    }

    /* Not 4-byte; test for and handle 5-byte result.   In the test immediately
     * below, the first 4 is for start bytes F8-FB, the second 4 is for 4
     * continuation bytes which each contribute SHIFT bits.  This yields
     * 0x40_0000 on EBCDIC platforms, 0x400_0000 on ASCII, so 5 bytes covers
     * the range 0x4_0000-0x3F_FFFF on EBCDIC; 0x20_0000-0x3FF_FFFF on ASCII */
    if (uv < (4 * (1U << (4 * SHIFT)))) {
*d++ = I8_TO_NATIVE_UTF8(( uv >> ((5 - 1) * SHIFT)) | UTF_START_MARK(5));
*d++ = I8_TO_NATIVE_UTF8(((uv >> ((4 - 1) * SHIFT)) & MASK) |   MARK);
*d++ = I8_TO_NATIVE_UTF8(((uv >> ((3 - 1) * SHIFT)) & MASK) |   MARK);
*d++ = I8_TO_NATIVE_UTF8(((uv >> ((2 - 1) * SHIFT)) & MASK) |   MARK);
*d++ = I8_TO_NATIVE_UTF8(( uv  /* (1 - 1) */        & MASK) |   MARK);

#ifdef EBCDIC
if (UNLIKELY(   UNICODE_IS_xFFF_E_F(uv))
                     || UNICODE_IS_10_FFF_E_F(uv))
        {
            goto handle_nonchar;
        }
        if (UNLIKELY(UNICODE_IS_SUPER(uv))) {
            goto handle_super;
        }
        return d;
#else
goto handle_super;
#endif
    }

    /* Not 5-byte; test for and handle 6-byte result.   In the test immediately
     * below, the 2 is for start bytes FC-FD, the 5 is for 5 continuation bytes
     * which each contribute SHIFT bits.  This yields 0x400_0000 on EBCDIC
     * platforms, 0x8000_0000 on ASCII, so 6 bytes covers the range
     * 0x40_0000-0x3FF_FFFF on EBCDIC; 0x400_0000-0x7FFF_FFFF on ASCII. */
    if (uv < (2 * (1U << (5 * SHIFT)))) {
*d++ = I8_TO_NATIVE_UTF8(( uv >> ((6 - 1) * SHIFT)) | UTF_START_MARK(6));
*d++ = I8_TO_NATIVE_UTF8(((uv >> ((5 - 1) * SHIFT)) & MASK) |   MARK);
*d++ = I8_TO_NATIVE_UTF8(((uv >> ((4 - 1) * SHIFT)) & MASK) |   MARK);
*d++ = I8_TO_NATIVE_UTF8(((uv >> ((3 - 1) * SHIFT)) & MASK) |   MARK);
*d++ = I8_TO_NATIVE_UTF8(((uv >> ((2 - 1) * SHIFT)) & MASK) |   MARK);
*d++ = I8_TO_NATIVE_UTF8(( uv  /* (1 - 1) */        & MASK) |   MARK);
goto handle_super;
    }

    /* This could be moved down for EBCDIC, but not worth the complexity */
    if (   UNLIKELY(uv > MAX_NON_DEPRECATED_CP) && ckWARN_d(WARN_DEPRECATED)) {
        Perl_warner(aTHX_ packWARN(WARN_DEPRECATED),
                    cp_above_legal_max, uv, MAX_NON_DEPRECATED_CP);
    }

    /* Not 6-byte; handle 7-byte result.  There is no need for a test on
     * platforms where 7 bytes is the maximum possible, .  The FE start byte
     * can have 6 continuation bytes which each contribute SHIFT bits.  This
     * yields 0x4000_0000 on EBCDIC platforms, 0x10_0000_0000 on ASCII, so 7
     * bytes covers the range 0x400_0000-0x3FFF_FFFF on EBCDIC;
     * 0x400_0000-0xF_FFFF_FFFF on ASCII */
#if defined(UV_IS_QUAD) || defined(EBCDIC)
    if (uv < ((UV) 1U << (6 * SHIFT)))
#endif
    {
*d++ = I8_TO_NATIVE_UTF8(0xfe); /* Can't match U+FEFF! */
*d++ = I8_TO_NATIVE_UTF8(((uv >> ((6 - 1) * SHIFT)) & MASK) |   MARK);
*d++ = I8_TO_NATIVE_UTF8(((uv >> ((5 - 1) * SHIFT)) & MASK) |   MARK);
*d++ = I8_TO_NATIVE_UTF8(((uv >> ((4 - 1) * SHIFT)) & MASK) |   MARK);
*d++ = I8_TO_NATIVE_UTF8(((uv >> ((3 - 1) * SHIFT)) & MASK) |   MARK);
*d++ = I8_TO_NATIVE_UTF8(((uv >> ((2 - 1) * SHIFT)) & MASK) |   MARK);
*d++ = I8_TO_NATIVE_UTF8(( uv  /* (1 - 1) */        & MASK) |   MARK);
#ifdef EBCDIC
        goto handle_super;
#else
goto handle_above_31_bit;
#endif
    }

    /* Below is for a 0xFF start byte.  You need a 64-bit word size to be able
     * to express this on an ASCII machine, but a 32-bit word expresses the
     * lower range on EBCDIC platforms */
#if defined(UV_IS_QUAD) || defined(EBCDIC)
    {
        /* UTF8_MAX_BYTES result.  The 0xff start byte is followed by 13
         * continuation bytes on EBCDIC; 12 on ASCII.  These numbers of bytes
         * are essentially arbitrary, but were chosen to be enough to represent
         * 2**64 - 1 (plus an extra byte on ASCII).  */
        *d++ = I8_TO_NATIVE_UTF8(0xff); /* Can't match U+FFFE! */
#   ifdef UV_IS_QUAD
#      ifndef EBCDIC
*d++ =    /*      ASCII platform (12 - 1) 6 Reserved bits */    MARK;
#      else
*d++ = I8_TO_NATIVE_UTF8(((uv >>((13 - 1) * SHIFT)) & MASK) |   MARK);
*d++ = I8_TO_NATIVE_UTF8(((uv >>((12 - 1) * SHIFT)) & MASK) |   MARK);
#      endif
*d++ = I8_TO_NATIVE_UTF8(((uv >>((11 - 1) * SHIFT)) & MASK) |   MARK);
*d++ = I8_TO_NATIVE_UTF8(((uv >>((10 - 1) * SHIFT)) & MASK) |   MARK);
*d++ = I8_TO_NATIVE_UTF8(((uv >> ((9 - 1) * SHIFT)) & MASK) |   MARK);
*d++ = I8_TO_NATIVE_UTF8(((uv >> ((8 - 1) * SHIFT)) & MASK) |   MARK);
#   else    /* Here, must be EBCDIC without quad */
*d++ = I8_TO_NATIVE_UTF8(    /*  (13 - 1) 5 Reserved bits */    MARK);
*d++ = I8_TO_NATIVE_UTF8(    /*  (12 - 1) 5 Reserved bits */    MARK);
*d++ = I8_TO_NATIVE_UTF8(    /*  (11 - 1) 5 Reserved bits */    MARK);
*d++ = I8_TO_NATIVE_UTF8(    /*  (10 - 1) 5 Reserved bits */    MARK);
*d++ = I8_TO_NATIVE_UTF8(    /*  ( 9 - 1) 5 Reserved bits */    MARK);
*d++ = I8_TO_NATIVE_UTF8(    /*  ( 8 - 1) 5 Reserved bits */    MARK);
#   endif
*d++ = I8_TO_NATIVE_UTF8(((uv >> ((7 - 1) * SHIFT)) & MASK) |   MARK);
*d++ = I8_TO_NATIVE_UTF8(((uv >> ((6 - 1) * SHIFT)) & MASK) |   MARK);
*d++ = I8_TO_NATIVE_UTF8(((uv >> ((5 - 1) * SHIFT)) & MASK) |   MARK);
*d++ = I8_TO_NATIVE_UTF8(((uv >> ((4 - 1) * SHIFT)) & MASK) |   MARK);
*d++ = I8_TO_NATIVE_UTF8(((uv >> ((3 - 1) * SHIFT)) & MASK) |   MARK);
*d++ = I8_TO_NATIVE_UTF8(((uv >> ((2 - 1) * SHIFT)) & MASK) |   MARK);
*d++ = I8_TO_NATIVE_UTF8(( uv  /* (1 - 1) */        & MASK) |   MARK);
    }
#endif

#ifdef EBCDIC
    if (uv <= 0x7FFFFFFF) {
        goto handle_super;
    }
#endif
    /* FALLTHROUGH */                                                           \

  handle_above_31_bit:

    if (flags & (UNICODE_WARN_ABOVE_31_BIT|UNICODE_WARN_SUPER)) {
        Perl_ck_warner_d(aTHX_ packWARN(WARN_NON_UNICODE),
                  "Code point 0x%"UVXf" is not Unicode, and not portable", uv);
        /* So won't warn twice; we have to fall through into handle_super in
         * case supers are disallowed */
        flags &= ~UNICODE_WARN_SUPER;
    }

    if (flags & UNICODE_DISALLOW_ABOVE_31_BIT) {
        return NULL;
    }

  handle_super:
    if (flags & UNICODE_WARN_SUPER) {
        Perl_ck_warner_d(aTHX_ packWARN(WARN_NON_UNICODE),
            "Code point 0x%04"UVXf" is not Unicode, may not be portable", uv);
    }

    if (flags & UNICODE_DISALLOW_SUPER) {
        return NULL;
    }
    return d;

  handle_surrogate:
    if (flags & UNICODE_WARN_SURROGATE) {
        Perl_ck_warner_d(aTHX_ packWARN(WARN_SURROGATE),
                                    "UTF-16 surrogate U+%04"UVXf, uv);
    }
    if (flags & UNICODE_DISALLOW_SURROGATE) {
        return NULL;
    }
    return d;

  handle_nonchar:
    if (flags & UNICODE_WARN_NONCHAR) {
        Perl_ck_warner_d(aTHX_ packWARN(WARN_NONCHAR),
         "Unicode non-character U+%04"UVXf" is not recommended for open interchange",
         uv);
    }
    if (flags & UNICODE_DISALLOW_NONCHAR) {
        return NULL;
    }
    return d;
}

utf8.c: Extract some code into macros

This is in preparation for a future commit, where they will be used in
more than one place.

utf8.c: White-space only

Most of these white-space only changes are outdenting, in preparation
for a later commit

utf8.c: Add some UNLIKELY()

utf8.c: Reorder some tests

When converting a code point to its UTF-8 form, the code point is
checked to be sure it is not problematic.  It turns out that all
problematic code points require at least 3 UTF-8 bytes to represent.
Already, the single-byte characters are handled before the problematic
tests.  This changes so that the two-byte ones are as well.  Further, it
now uses the canned macros that do this which are valid on both EBCDIC
and ASCII systems, instead of the previous code, which was
ASCII-specific.  This means one less test when the input code point is
representable by 2 bytes.

utf8.h: Split UNICODE_IS_NONCHAR() into smaller macros

This defines 2 macros that can be used individually to check for
non-characters when the input range is known to be restricted to not
include every possible one. This is for future commits.

improve my rewrite of installhtml’s dir scan

Add epigraph for 5.22.1-RC4

Perl 5.22.1-RC4 today

perldelta: davem's significant stuff since 5.23.5

Fix t/op/sprintf2.t on VMS

(Passed under harness, but failed under TEST.)

Fix Emacs dir-local variables

The setting for cperl-indent-level (which sets the default indentation step
used by Emacs for Perl code) was missing a dot in the relevant cons pair.
This meant that the value set was the single-element list (4) rather than
the integer 4, so attempting to indent lines made Emacs produce an error
"Wrong type argument: number-or-marker-p, (4)".

regen charclass_invlists.h

This is needed bcause mktables changed. A porting test did not pick
this up, and so probably should be made to.

mention $? in backticks documentation

Backticks work like system(), in that they use $? for the child exit
code. Mention that so people know to look in $? when their program
fails.

standardize on "lookahead" and "lookaround"

...not the hyphenated form

commit message by rjbs

Remove duplicate line from test

perldelta for 817e3e2c6

and a small typo fix

[perl #123991] report an error if we can't parse the number after -C

Update Math-BigInt-FastCalc to CPAN version 0.38

  [DELTA]

2015-12-02 v0.38 pjacklam
  * Use 'static double', not just 'double' in FastCalc.xs.
  * Move 'Test::More' from 'build_requires' to 'test_requires' in Makefile.PL.

utf8.h, utfebcdic.h: Comments, white-space only

hexfp: test longdblkind directly, instead of doublekind

Also, explain the gating on linux on this test.

(suggested by Aaron Crane)

Porting/Glossary: fix a set of typos

A few descriptions of floating-point formats included the word "big" before
the actual endianness.

perldelta: Add note about earlier \p{} changes

APItest: Add tests for valid_utf8_to_uvchr

t/uni/case.pl: Sort numerics with <=> to get better results

utf8.h: Remove a branch in macro for Unicode surrogates

By masking, this macro can be written so it only has one branch.
Presumably an optimizing compiler would do the same, but not necessarily
so.

utf8.h: Add some casts in macros, for safety

This also renames the macro formal parameter to uv to be clearer as to
what is expected as input, and there were cases where it was referred to
inside the macro without being parenthesized, which was dangerous.

Merge commit for removing EBCDIC special code

It's better to use the same code and definitons whenever possible for
different platforms, for ease of maintenance. This set of commits
eliminates several cases of differing EBCDIC/ASCII definitions, so that
there is less separate code to maintain.

utf8.h: Combine EBCDIC and ASCII macros

Previous commits have set things up so the macros are the same on both
platforms. By moving them to the common part of utf8.h, they can share
the same definition. The difference listing shows instead other things
being moved due to the size of this move in comparison with those things
that really stayed the same.

utf8.h: Refactor macro definition

This changes to use the underlying UTF-8 structure to compute the
numbers in this macro, instead of hand-specifying the resultant ones.
Thus, this macro, with minor tweaks, is the same text on both ASCII and
EBCDIC platforms (though the resultant numbers differ), and the next
commit will change them to use it in common.

utf8.h: Combine EBCDIC and ASCII macros

The previous commits have made these macros be the exact same text, so
can be combined, and defined just once. This requires moving them to
the portion of the file that is common with both EBCDIC and ASCII.

The commit diff shows instead other code being moved.

utf8.h: Refactor a macro

This new definition expands to the same thing as before, but now the
unexpanded text is identical to the EBCDIC definition (which expands to
something else), so the next commit can combine the ASCII and EBCDIC
ones into a single definition.

utf8.h: Use common macro to avoid repeating

This refactors two macros that have mostly the same guts to have a third
macro to define the common guts.

It also changes to use UV_IS_QUAD instead of a home-grown #define that a
future commit will remove.

utf8.h: Move #define within file

This makes 2 related definitions adjacent.

utf8.h: Combine EBCDIC and ASCII #defines

Change to use the same definition for two macros on both types of
platforms, simplifying the code, by using the underlying structure of
the encoding.

utf8.h: Move #define to earlier in the file

And use its mnemonic in other #defines instead of repeating the raw
value.

utf8.h, et.al.: Clean up some casts

By making sure the no-op macros cast the output appropriately, we can
eliminate the casts that have been added in things that call them

utf8.h: Combine ASCII and EBCDIC defines into one

By using a more fundamental value, these two definitions of the macro
can be made the same, so only need one, common to both platforms

utfebcdic.h: Use an internal macro to avoid repeating

This creates a macro that is used in portions of 2 other macros, thus
removing repetition.

utf8.h, utfebcdic.h: Fix-up UTF8_MAXBYTES_CASE defn

The definition had gotten moved away from its comments in utf8.h, and
the wrong thing was being guarded by a #error, (UTF8_MAXBYTES instead).
And it is possible to generalize to get the compiler to do the
calculation, and to consolidate the definitions from the two files into
a single one.

amigaos4: use raise() instead of kill() on ourselves

Using kill() on the same task that called kill() circumvents
Perl's signal handlers, but raise() doesn't, so use that instead.

hexfp: Use Perl_fp_class_nzero unconditionally.

Otherwise in platforms with Perl_fp_class_nzero there would be no
return for the x != 0.0 case.

hexfp: these should be tested only with uselongdouble.

Have more fallbacks for our signbit() emulation.

These help in systems which do not have signbit(), or fail to find one,
or which explicitly disable it.

The idea for the fallback implementation from Craig Berry.

[PATCH] Bump Locale-Codes from 3.36 to 3.37

Signed-off-by: Chris 'BinGOs' Williams <chris@bingosnet.co.uk>

Update Unicode-Normalize to CPAN version 1.24

  [DELTA]

1.24  Sun Nov 29 05:48:44 UTC 2015
    - Updated to use most recent GNU license file.
      ( https://rt.cpan.org/Public/Bug/Display.html?id=108003 )
    - Silence compiler warning message
      ( https://rt.cpan.org/Public/Bug/Display.html?id=109577 )
    - Add kwalitee suggested changes.

Update PathTools to CPAN version 3.60

perlxs.pod: clarify PROTOTYPES: behaviour.

The default is to disable rather than enable.

Also mention the "Please specify prototyping behavior for Foo.xs"
warning.

threads.t: make stress test less stressy

Test 10 creates 100 threads that do 'require IO'. This can use a lot
of memory and other resources. reduce it to 10.

Undo blead customization of Text-ParseWords test script

The customization simply changed DOS EOLs to UNIX EOLs, dating from a time
when the intention was to get all files in blead into UNIX EOL format.
However, since then many more files have crept in with DOS EOLs (for
example, many files under cpan/Pod-Checker, cpan/Pod-Parser and
cpan/Pod-Usage have DOS EOLs in my Git workspace (on Windows) and in the
most recent perl release tarballs (5.22.1-RC3 (made from Windows) and
5.23.5 (not made from Windows AFAIK))) and they clearly do no harm, so
there is no point in trying to make all files have UNIX EOLs and keep them
that way, and therefore no point in this customization.

The GitHub PR that was referenced in Porting/Maintainers.pl has already
been closed (not merged).

There are no changes to ParseWords.t here other than the EOLs.

Add the -Wthread-safety also only for clang 3.6 (6.1) or later.

(follow-up to bdc795f4, suggested by Aaron Crane)

Config-Perl-V bump in Maintainers.pl

Module-CoreList bump in Maintainers.pl

fix the API description of SvLEN_set()

RT #126245

Make it clearer that is the buffer length being specified, not the string
length. Also, change the 'See "SvIV_set"' to SvLEN. That appears to be a
cut and paste error.

Based on suggested wording from jazzkutya@gmail.com

Ensure 'q' works in debugger with RemotePort.

Patch submitted by Shlomi Fish.

For: RT #126735

Add epigraph for 5.22.1-RC3

Perl 5.22.1-RC3 today

Revert "Revert "Module::CoreList updates for 5.22.1""

This reverts commit 85e4652903c8054317fceac9960608e261acb7f5...
... with some manual changes to now place 5.022001 before 5.023006 instead
of before 5.023005 since 5.023006 is now the impending blead release. Also,
set a tentative 5.22.1 final $VERSION/date of Sun 6th.

Complete some unfinished Module::CoreList work from commit 7c294235c2

Bump the TSA clang minimum to 3.6 (in Applese 6.1)

Since it looks like the 3.5 (6.0) in OS X 9 didn't recognize the annotations.

For TSA we want 4 or later clang, not just later.

(Noticed by Aaron Crane.)

More notes on OS X compiler versions.

/..\G/: use chars, not bytes

In something like /..\G/, the engine should start trying to match two
chars before pos(). It was actually trying to match two bytes before.

markstack_grow(): fix debugging stuff

This is a follow-on to commit ac07059afc75:

FOOMARK debugging macros: fix %d cast; only -Dsv

which missed fixing up the debugging statement in markstack_grow().

rpeep(): silence  compiler warning

op.c: In function ‘Perl_rpeep’:
op.c:13666:35: warning: comparison is always false due to limited range of data
type [-Wtype-limits]

This condition is always false if for example base is 32 bit and UVs are 64
bit:

    base >  (UV_MAX >> (OPpPADRANGE_COUNTSHIFT+SAVE_TIGHT_SHIFT)

silence the warning by replacing base with a constant-folded conditional

    (cond ? base : 0) > ....

where cond is false if sizeof(base) is small.

Configure: unbreak -S option now that -O is the default

As far as I can tell, using the -S and -O options together has always
yielded an error of this form:

Configure: 2042: .: Can't open ./optdef.sh

That's because, even though optdef.sh is created in the UU directory, and
most of Configure is run in that directory, part of the -S implementation is
run in the root directory, and was therefore trying to read ./optdef.sh
instead of ./UU/optdef.sh.

As of 41d73075f0801c26794dadb1ff690f305d7e53a7, the -O mode is always
enabled, so the -S option has been broken since then. This fixes that.

move Win32's $^X code to where all other OSes' $^X code lives

Back when the code in perllib.c was first added in 1999, in
commit 80252599d4 the large define tree function that today in 2015 is
Perl_set_caret_X was an unremarkable single statement
http://perl5.git.perl.org/perl.git/blob/80252599d4b7fb26eec4e3a0f451b4387c5dcc19:/perl.c#l2658

Over the years Perl_set_caret_X grew and grew with OS specific code. Move
the Win32 $^X code to match how all the other OSes do it. Fix a problem
where full perl's $^X is always absolute because perl5**.dll uses
GetModuleFileNameW in perllib.c, but miniperl's $^X is always a relative
path because it's coming from libc/command prompt/make tool/make_ext.pl.
Win32 miniperl's $^X being relative causes inefficiencies in EUMM as a
relative $^X is wrong the moment chdir executes in any perl process.
EUMM contains code to search PATH and some other places to guess/figure out
the absolute patch to the current perl to write the absolute perl path
into the makefile. By making $^X absolute on all Win32 perl build variants,
this find absolute perl path code won't execute in EUMM. It also harmonizes
behavior with other OSes and between Win32 mini and full perl. See details
in RT ticket for this patch.

Perl_set_caret_X gv_fetch with GV_ADD can't return NULL

The GV will be created if it doesn't exist. Remove the branch for smaller
code size.

Perl_magic_set(): remove unused var 's'

This var is (mostly) unused, but is set in a couple of places, hence:

mg.c:2657:17: warning: variable ‘s’ set but not used

In the one place it is used, declare it in a narrower scope.