perl5.git.perl.org Git - perl5.git/log

This is a live mirror of the Perl 5 development currently hosted at https://github.com/perl/perl5

https://perl5.git.perl.org / perl5.git / log

Karl Williamson [Fri, 5 Jan 2018 18:35:00 +0000 (11:35 -0700)]

Change some "shouldn't happen" failures into panics

If the system is so broken that these libc calls are failing, soldiering
on won't lead to sane results.

THis rewords some existing panics, and adds the errno to the output for
all of them.

commit | commitdiff | tree

Karl Williamson [Sun, 28 Jan 2018 21:55:31 +0000 (14:55 -0700)]

Forbid 'pig' locale

This is a toy locale found on some systems, which isn't fully
implemented, and if one tries to switch to it can cause failures.

commit | commitdiff | tree

Karl Williamson [Thu, 18 Jan 2018 23:20:02 +0000 (16:20 -0700)]

Avoid changing locale when finding radix char

On systems that have the POSIX 2008 operations, including
nl_langinfo_l(), this commit causes them to not have to actually change
the locale when determining what the decimal point character is.

The locale may have to change during the printing/reading of numbers,
but eventually we can use sprintf_l(), if available, to avoid that too.

commit | commitdiff | tree

Karl Williamson [Thu, 18 Jan 2018 22:56:33 +0000 (15:56 -0700)]

Perl_sv_2pv_flags: Potentially avoid work

By using a macro that is private to the core, this code can avoid
thinking it has to deal with a non-dot radix character, as even if we
are using the locale radix, that is often a dot.

commit | commitdiff | tree

Karl Williamson [Thu, 18 Jan 2018 22:53:42 +0000 (15:53 -0700)]

numeric.c: Remove duplicate PERL_ARGS_ASSERT

By moving the call to one instance of this macro, the other can be
removed.

commit | commitdiff | tree

Karl Williamson [Thu, 18 Jan 2018 22:52:50 +0000 (15:52 -0700)]

locale.c: White-space only

Outdent to compensate for previous patch removing several blocks

commit | commitdiff | tree

Karl Williamson [Thu, 18 Jan 2018 22:45:19 +0000 (15:45 -0700)]

Keep PL_numeric_radix_sv always set

Previously this was removed if the radix was dot. By keeping it set to
a dot, we simplify some code, removing some branches.

commit | commitdiff | tree

Karl Williamson [Thu, 18 Jan 2018 22:32:45 +0000 (15:32 -0700)]

locale.c: Replace by function that does the same thing

This logic occurs often enough that a function has been created to do
it. So use that.

commit | commitdiff | tree

Karl Williamson [Wed, 17 Jan 2018 19:40:40 +0000 (12:40 -0700)]

POSIX::localeconv() Use new fcn; avoid recalcs

This calls strlen() once, instead of passing 0 to the the subsidiary
functions which causes them to call it each time. It also uses the new
function is_utf8_non_invariant_string() instead of doing here what that
function does.

commit | commitdiff | tree

Karl Williamson [Wed, 17 Jan 2018 18:16:15 +0000 (11:16 -0700)]

POSIX.xs: White space only

Vertically align for readability

commit | commitdiff | tree

Karl Williamson [Wed, 17 Jan 2018 18:07:42 +0000 (11:07 -0700)]

locale.c: Move some mutex ops

A future commit will add a mutex, and create the convention that this
mutex if used in combination with the new one always be tried after the
new one is in effect, in order to prevent the possibility of deadlock.
Do it now, before the new one gets added.

This also adds some comments about the reason for this mutex.

commit | commitdiff | tree

Karl Williamson [Wed, 17 Jan 2018 17:12:30 +0000 (10:12 -0700)]

locale.c: White-space only

Indent code to account for previous commits adding some blocks

commit | commitdiff | tree

Karl Williamson [Wed, 17 Jan 2018 15:27:04 +0000 (08:27 -0700)]

locale.c: Use macro instead of its expansion

This macro in a future commit will become more complex.

commit | commitdiff | tree

Karl Williamson [Wed, 17 Jan 2018 05:09:27 +0000 (22:09 -0700)]

locale.c: Do common task in one place

This function in some cases may need to temporarily switch the
LC_NUMERIC code. Instead of repeating the logic to determine if this is
needed, do it once.

commit | commitdiff | tree

Karl Williamson [Wed, 17 Jan 2018 00:38:45 +0000 (17:38 -0700)]

POSIX.xs: Keep locale change to minimum span

Move the restore to as close to the save as possible so that the locale
is in an unstable state for as short a time as possible.

commit | commitdiff | tree

Karl Williamson [Wed, 17 Jan 2018 20:24:46 +0000 (13:24 -0700)]

POSIX::strftime: Add better fallback about UTF-8

If the function returns a valid string that isn't completely UTF-8
invariant, the function assumes it is UTF-8 if we are in a UTF-8 locale.
This works, but in the unlikely event that the system has no LC_TIME, we
can't tesll if it is in a UTF-8 locale. As a better fallback position,
this commit adds the check that there is just a single script of the
time string, adding a measure of reassurance that out call that it is
UTF-8 is correct.

This is unlikely to be used, but now that there is a function to call
that determines if this is a script run, it's easy to add, and unlikely
to actually get compiled.

commit | commitdiff | tree

Karl Williamson [Wed, 17 Jan 2018 20:18:50 +0000 (13:18 -0700)]

grok_numeric_radix(): Avoid recalculating

This function just determined that we are in the scope of 'use locale',
hence the underlying radix character should be used. This commit
changes to use the macro that directly does that; previously the macro
that redundantly looks at if we are in the scope was used.

commit | commitdiff | tree

Karl Williamson [Wed, 17 Jan 2018 20:00:44 +0000 (13:00 -0700)]

sv_vcatpvfn_flags() Balance LC_NUMERIC changes/restores

Prior to this commit, the restore for LC_NUMERIC was getting called even
if there were no corresponding store. Change so they are balanced; a
future commit will require this.

commit | commitdiff | tree

Karl Williamson [Mon, 15 Jan 2018 23:39:44 +0000 (16:39 -0700)]

perl.h: Remove some obsolete macros

These no longer make sense; were for core internal use only

commit | commitdiff | tree

Karl Williamson [Mon, 15 Jan 2018 22:56:43 +0000 (15:56 -0700)]

vutil.c: White_space only

Properly indent a block, and add spaces where C11++ deprecates not
having them

commit | commitdiff | tree

Karl Williamson [Mon, 15 Jan 2018 22:48:57 +0000 (15:48 -0700)]

Simplify some LC_NUMERIC macros

These macros are marked as subject to change and are not documented
externally. I don't know what I was thinking when I named some of them,
but whatever no longer makes sense to me. Simplify them, and change so
there is only one restore macro to remember.

commit | commitdiff | tree

Karl Williamson [Mon, 15 Jan 2018 04:43:43 +0000 (21:43 -0700)]

toke.c: Remove unnecessary macro calls

These macros were to shift the LC_NUMERIC state into using a dot for the
radix character. When I wrote this code, I assumed that parsing should
be using just the dot. Since then, I have discovered that this wraps
other uses where the dot is not correct, so remove it.

commit | commitdiff | tree

Karl Williamson [Mon, 15 Jan 2018 04:37:16 +0000 (21:37 -0700)]

perl.h: Remove unused locale core macro

This undocumented macro is unused in the core, and all these are
commented that they are subject to change. And it confuses things, so
just remove it.

commit | commitdiff | tree

Karl Williamson [Sun, 7 Jan 2018 22:30:06 +0000 (15:30 -0700)]

locale.c: Windows will never be EBCDIC

This adjusts the conditional compilation so that win32 is a subset of
non-EBCDIC. This will be useful in the next commit.

commit | commitdiff | tree

Karl Williamson [Fri, 5 Jan 2018 19:57:37 +0000 (12:57 -0700)]

locale.c: Simplify expression

Since this is operating on C strings, we don't have to check the
lengths, but can rely on the underlying functions to work.

commit | commitdiff | tree

Karl Williamson [Tue, 2 Jan 2018 23:54:28 +0000 (16:54 -0700)]

Cache locale UTF8-ness lookups

Some locales are UTF-8, some are not.  Knowledge of this is needed in
various circumstances.  This commit saves the results of the last
several lookups so they don't have to be recalculated each time.

The full generality of POSIX locales is such that you can have error
messages be displayed in one locale, say Spanish, while other things are
in French.  To accommodate this generality, the program can loop through
all the locale categories finding the UTF8ness of the locale it points
to.  However, in almost all instances, people are going to be in either
French or in Spanish, and not in some combination.  Suppose it is a
French UTF-8 locale for all categories.  This new cache will know that
the French locale is UTF-8, and the queries for all but the first
category can return that immediately.

This simple cache avoids the overhead of hashes.

This also fixes a bug I realized exists in threaded perls, but haven't
reproduced.  We do not support locales in such perls, and the user must
not change the locale or 'use locale'.  But perl itself could change the
locale behind the scenes, leading to segfaults or incorrect results.
One such instance is the determination of UTF8ness.  But this only could
happen if the full generality of locales is used so that the categories
are not all in the same locale.  This could only happen (if the user
doesn't change locales) if the environment is such that the perl program
is started up so that the categories are in such a state.  This commit
fixes this potential bug by caching the UTF8ness of each category at
startup, before any threads are instantiated, and so checking for it
later just looks it up in the cache, without perl changing the locale.

commit | commitdiff | tree

Karl Williamson [Tue, 2 Jan 2018 21:23:24 +0000 (14:23 -0700)]

locale.c: Avoid duplicate work

As the comments say, the needed value is already readily available

commit | commitdiff | tree

Karl Williamson [Tue, 2 Jan 2018 20:38:16 +0000 (13:38 -0700)]

locale.c: Avoid some work

We've already worked out whether the decimal point is a dot or not. We
can pass that information to the called routine so it doesn't have to
figure it out again.

commit | commitdiff | tree

Karl Williamson [Tue, 2 Jan 2018 20:19:03 +0000 (13:19 -0700)]

locale.c: Use non-control for a format dummy

We need a plain character here. I used a '\e' before, but it would be
better to have something that isn't a control, so just change it to a
blank

commit | commitdiff | tree

Karl Williamson [Thu, 25 Jan 2018 18:28:54 +0000 (11:28 -0700)]

locale.c: Create a block around some code; indent

Under some configurations depending on platform and Configure options,
these declarations are not at the beginning of a block. violating C
language rules.

commit | commitdiff | tree

Karl Williamson [Tue, 2 Jan 2018 19:25:35 +0000 (12:25 -0700)]

locale.c: Avoid some more locale changes

In a few places here we can test if we are already in the locale we want
to be in, and not switch unnecessarily if so.

commit | commitdiff | tree

Karl Williamson [Tue, 2 Jan 2018 06:03:34 +0000 (23:03 -0700)]

Avoid some unnecessary changing of locales

The LC_NUMERIC locale category is kept so that generally the decimal
point (radix) is a dot. For some (mostly) output purposes, it needs to
be swapped into the program's current underlying locale so that a
non-dot can be printed.

This commit changes things so that if the current underlying locale uses
a decimal point, the swap doesn't happen, as it's not needed.

commit | commitdiff | tree

Karl Williamson [Wed, 26 Jul 2017 14:59:33 +0000 (08:59 -0600)]

Teach perl about more locale categories

glibc has various other categories than the ones perl handles, for
example LC_PAPER.  This commit adds knowledge of these to perl, so that
one can set them, interrogate them, and have libraries work on them,
even though perl itself does not.

This is in preparation for future commits, where it becomes more
important than currently for perl to know about all the locale
categories on the system.

I looked through various other systems to try to find other categories,
but did not see any.  If a system does have such a category, it is
pretty easy to tell perl about it, and recompile.  Use the changes in
this commit as a template, and send an email to perlbug@perl.org, so
that the next Perl release will have it.

commit | commitdiff | tree

Karl Williamson [Tue, 2 Jan 2018 05:20:25 +0000 (22:20 -0700)]

perl.h: White-space only

commit | commitdiff | tree

Karl Williamson [Tue, 2 Jan 2018 03:41:21 +0000 (20:41 -0700)]

locale.c: Add compile check for unimplemented behavior

Instead of silently not working.

commit | commitdiff | tree

Karl Williamson [Tue, 2 Jan 2018 03:30:39 +0000 (20:30 -0700)]

locale.c: White-space only

Indent because the previous commit created an enclosing block, and
add a blank line elsewhere

commit | commitdiff | tree

Karl Williamson [Tue, 2 Jan 2018 03:00:03 +0000 (20:00 -0700)]

locale.c: Refactor Ultrix code

Examination shows that this code does nothing unless LC_ALL is defined.
So explicitly test at compile time for that.

Also, two variables don't have to be declared so globally, and by
reducing their scope, by creating a new block we don't have to have
PERL_UNUSED_ARG()s for them

commit | commitdiff | tree

Karl Williamson [Tue, 2 Jan 2018 02:07:19 +0000 (19:07 -0700)]

locale.c: Avoid rescanning a string

We can use a parameter to find out where in the string the portion of
interest starts. Do that to avoid starting again from scratch.

commit | commitdiff | tree

Karl Williamson [Tue, 2 Jan 2018 01:33:59 +0000 (18:33 -0700)]

locale.c: Use fcns instead of macros

Here the macros being used expand into the functions being called,
without adding any value to using the macros, and making things slightly
less clear.

commit | commitdiff | tree

Karl Williamson [Tue, 2 Jan 2018 01:17:41 +0000 (18:17 -0700)]

locale.c: Add const to several variables

commit | commitdiff | tree

Karl Williamson [Tue, 2 Jan 2018 01:15:27 +0000 (18:15 -0700)]

locale.c: Improve, add comments

commit | commitdiff | tree

Karl Williamson [Tue, 2 Jan 2018 01:01:45 +0000 (18:01 -0700)]

perl.h: Add comment, rephrase another

commit | commitdiff | tree

Karl Williamson [Sun, 19 Nov 2017 00:34:25 +0000 (17:34 -0700)]

Perl_langinfo: Teach about YESSTR and NOSTR

These are items that nl_langinfo() used to be required to return, but
are considered obsolete. Nonetheless, this drop-in replacement for that
function should know about them for backward compatibility.

commit | commitdiff | tree

Karl Williamson [Mon, 1 Jan 2018 22:07:45 +0000 (15:07 -0700)]

APItest/t/locale.t: Add some tests

This makes sure that the entries for which the expected return value may
legitimately vary from platform to platform get tested as returning
something, skipping the test if the item isn't known on the platform.

A couple of comments are also added.

commit | commitdiff | tree

Karl Williamson [Thu, 4 Jan 2018 03:41:29 +0000 (20:41 -0700)]

Add check that "$!" is correctly interpreted as UTF-8

We sometimes need to know if an error message is UTF-8 or not.
Previously we checked that it is syntactically valid UTF-8, and that the
LC_MESSAGES locale is UTF-8.  But some systems, notably Windows, do not
have LC_MESSAGES.  For those, this commit adds a different, semantic,
check that the text of the message when interpreted as UTF-8 is all in
the same Unicode script.  This is not foolproof, unlike the LC_MESSAGES
check, but it's better than what we have now for such systems.  It
likely is foolproof for non-Latin locales, as any message will have a
bunch of characters in that locale, and no ASCII Latin ones.  For a
Latin locale, these ASCII letters could be intermixed with the UTF-8
ones, causing potential ambiguity.

commit | commitdiff | tree

Karl Williamson [Wed, 15 Nov 2017 05:27:06 +0000 (22:27 -0700)]

Remove uncompilable code

This code was never compiled because of a misspelling in the #ifdef.
No problem surfaced, so just remove it. The next commit adds a different
check.

commit | commitdiff | tree

Karl Williamson [Tue, 9 Jan 2018 02:08:54 +0000 (19:08 -0700)]

perl.c: Move initialization of inversion lists

This is now done very early in the file, as it may be needed for
initializing the locale handling.

commit | commitdiff | tree

Karl Williamson [Thu, 18 Jan 2018 21:09:24 +0000 (14:09 -0700)]

isSCRIPT_RUN: Document in perlintern

commit | commitdiff | tree

Karl Williamson [Tue, 9 Jan 2018 02:11:52 +0000 (19:11 -0700)]

An empty string is a script_run, but marked INVALID

commit | commitdiff | tree

Karl Williamson [Thu, 18 Jan 2018 21:08:47 +0000 (14:08 -0700)]

isSCRIPT_RUN: A sequence of entirely Inherited chars is Inherited

commit | commitdiff | tree

Karl Williamson [Thu, 18 Jan 2018 21:07:43 +0000 (14:07 -0700)]

regexec.c: Add comment

commit | commitdiff | tree

Karl Williamson [Thu, 18 Jan 2018 21:05:23 +0000 (14:05 -0700)]

Fix bug in isSCRIPT_RUN with digit following unassigned

This was being treated as a run, but shouldn't be one.

commit | commitdiff | tree

Karl Williamson [Thu, 18 Jan 2018 20:00:06 +0000 (13:00 -0700)]

isSCRIPT_RUN: Can short cut if not in UTF-8

All characters representable by single bytes are either Common or Latin,
so must be a script run. If we aren't asking for what the script is we
can return immediately. If we are, the run is Latin if any character in
it is Latin, otherwise is Common.

commit | commitdiff | tree

Karl Williamson [Sun, 7 Jan 2018 04:16:15 +0000 (21:16 -0700)]

Give isSCRIPT_RUN() an extra parameter

This allows it to return the script of the run.

commit | commitdiff | tree

Karl Williamson [Sat, 6 Jan 2018 23:15:12 +0000 (16:15 -0700)]

charclasslists.h: script enums visible to CORE,EXT

This exposes the enum definitions for the script extensions property to
the perl code and extensions, for use in future commits.

commit | commitdiff | tree

Karl Williamson [Sat, 6 Jan 2018 23:13:06 +0000 (16:13 -0700)]

regen/mk_invlists.pl: Allow override of where enums get defined

This adds code so that the enums defined by this, which are ordinarily
only used by regexec.c ban be specified to be somewhere else instead.

commit | commitdiff | tree

Karl Williamson [Sat, 6 Jan 2018 23:09:57 +0000 (16:09 -0700)]

regen/mk_invlists.pl: Allow multiple files to access

This changes the code so that the symbols defined by this program
can be #define'd in more than one file.

commit | commitdiff | tree

Karl Williamson [Thu, 18 Jan 2018 21:02:33 +0000 (14:02 -0700)]

regexec.c: Fix typo in comment

commit | commitdiff | tree

Karl Williamson [Sat, 6 Jan 2018 23:18:45 +0000 (16:18 -0700)]

Fix bug in script runs that start with Common

This is a follow on to 8535a06fea02528fe726855a139fcbd360d1fc6e. That
fixed one case where the first character was in the Common script,
things did not work properly. It did not catch the case where a future
character in the string was non-Common from a script that has its own
set of digits, and this commit fixes that.

This just entails moving a block of code to slightly earlier.

commit | commitdiff | tree

Karl Williamson [Thu, 11 Jan 2018 00:10:09 +0000 (17:10 -0700)]

locale.c: Make sure variable is always defined

A future commit assumes this variable is there even on non-DEBUGGING
builds. #define it to 0 for those.

commit | commitdiff | tree

Karl Williamson [Thu, 18 Jan 2018 00:01:00 +0000 (17:01 -0700)]

my_atof(): Lock dot radix

This commit shows some redundant checks.  It examines the text and if it
finds a dot in the middle of the number, and the locale is expecting
something else, it toggles LC_NUMERIC to be the C locale so that the dot
is understood.  However, during further parsing, grok_numeric_radix()
gets called and sees that the locale shouldn't be C, and toggles it
back.  That ordinarily would cause the dot to not be recognized, but
this function always recognizes a dot no matter what the locale.  So
none of our tests fail.  I'm not sure if this is always the case, and I
don't understand this area of the code all that well, but there is a
simple way to cause grok_numeric_radix to not change the locale back,
and that is to call the macro LOCK_LC_NUMERIC_STANDARD() when changing
it the first time in my_atof().  The purpose of this macro is precisely
this situation, so that recursed calls don't try to override the
decisions of the outer calls

commit | commitdiff | tree

Karl Williamson [Wed, 24 Jan 2018 22:57:30 +0000 (15:57 -0700)]

hints/hpux.sh: HP-UX mbrlen() and mbrtowc() don't work

In spite of there being man pages for these, the #include file doesn't
define the mbstate_t type which is required for a parameter to these
functions.

Perhaps the Configure probe could be enhanced so it doesn't return
defined unless these can be successfully compiled, but for now use the
hints file.

commit | commitdiff | tree

Karl Williamson [Sun, 21 Jan 2018 17:08:33 +0000 (10:08 -0700)]

perlembed: Fix typos

Perl is capitalized when referring to the language; lowercased when
referring to a particular executable.

commit | commitdiff | tree

Tony Cook [Wed, 31 Jan 2018 00:00:01 +0000 (11:00 +1100)]

(perl #132761) Devel::PPPort updates for older perls

commit | commitdiff | tree

Tony Cook [Tue, 30 Jan 2018 23:46:49 +0000 (10:46 +1100)]

bump $Devel::PPPort::VERSION to 3,39

commit | commitdiff | tree

Pali [Fri, 26 Jan 2018 18:39:49 +0000 (19:39 +0100)]

Devel::PPPort: Use croak_nocontext() intead of croak() when dTHX is not declared

commit | commitdiff | tree

Pali [Fri, 26 Jan 2018 18:39:14 +0000 (19:39 +0100)]

Devel::PPPort: Declare dTHX in croak_xs_usage()

CvGV() takes aTHX_ as first argument.

commit | commitdiff | tree

Tony Cook [Thu, 25 Jan 2018 03:39:54 +0000 (14:39 +1100)]

(perl #132761) croak_xs_usage() shouldn't accept a THX arguement

commit | commitdiff | tree

Pali [Tue, 23 Jan 2018 22:36:06 +0000 (23:36 +0100)]

Devel::PPPort: Do not run tests which use \N{U+XX} on Perl 5.12.0

Perl 5.12.0 has bug when parsing \N{U+XX} syntax and throw error:
Invalid hexadecimal number in \N{U+...} in regex.

commit | commitdiff | tree

Pali [Tue, 23 Jan 2018 22:00:55 +0000 (23:00 +0100)]

Devel::PPPort: Do not define PERL_MAGIC_qr more times

make regen show warning: magic: PERL_MAGIC_qr already provided by misc

Remove it from misc, but because misc depends on it, put magic before misc.

commit | commitdiff | tree

Pali [Tue, 23 Jan 2018 21:50:20 +0000 (22:50 +0100)]

Devel::PPPort: Do not mask Perl_warn_nocontext and Perl_croak_nocontext

It cause compile errors on older threaded Perl versions.

commit | commitdiff | tree

Karl Williamson [Mon, 22 Jan 2018 19:55:31 +0000 (12:55 -0700)]

Use dfa to speed up translating UTF-8 into code point

This dfa is available from the internet has the reputation of being the
fastest general translator.  This commit changes to use it at the
beginning of our translator, modifying it slightly to accept surrogates
and all 4-byte Perl-extended.  If necessary, it drops down into our
translator to handle errors and warnings and Perl extended.

It shows some improvement over our base translation:

Key:
    Ir   Instruction read
    Dr   Data read
    Dw   Data write
    COND conditional branches
    IND  indirect branches
    _m   branch predict miss
    -    indeterminate percentage (e.g. 1/0)

The numbers represent raw counts per loop iteration.

unicode::utf8n_to_uvchr_0x007f
ord(X)

       blead   dfa Ratio %
       ----- ----- -------
    Ir 359.0 359.0   100.0
    Dr 111.0 111.0   100.0
    Dw  64.0  64.0   100.0
  COND  42.0  42.0   100.0
   IND   5.0   5.0   100.0

COND_m   2.0   0.0     Inf
IND_m   5.0   5.0   100.0

unicode::utf8n_to_uvchr_0x07ff
ord(X)

       blead   dfa Ratio %
       ----- ----- -------
    Ir 478.0 467.0   102.4
    Dr 132.0 133.0    99.2
    Dw  79.0  78.0   101.3
  COND  63.0  57.0   110.5
   IND   5.0   5.0   100.0

COND_m   1.0   0.0     Inf
IND_m   5.0   5.0   100.0

unicode::utf8n_to_uvchr_0xfffd
ord(X)

       blead   dfa Ratio %
       ----- ----- -------
    Ir 494.0 486.0   101.6
    Dr 134.0 136.0    98.5
    Dw  79.0  78.0   101.3
  COND  67.0  61.0   109.8
   IND   5.0   5.0   100.0

COND_m   2.0   0.0     Inf
IND_m   5.0   5.0   100.0

unicode::utf8n_to_uvchr_0x1fffd
ord(X)

       blead   dfa Ratio %
       ----- ----- -------
    Ir 508.0 505.0   100.6
    Dr 135.0 139.0    97.1
    Dw  79.0  78.0   101.3
  COND  70.0  65.0   107.7
   IND   5.0   5.0   100.0

COND_m   2.0   1.0   200.0
IND_m   5.0   5.0   100.0

unicode::utf8n_to_uvchr_0x10fffd
ord(X)

       blead   dfa Ratio %
       ----- ----- -------
    Ir 508.0 505.0   100.6
    Dr 135.0 139.0    97.1
    Dw  79.0  78.0   101.3
  COND  70.0  65.0   107.7
   IND   5.0   5.0   100.0

COND_m   2.0   1.0   200.0
IND_m   5.0   5.0   100.0

Each code point represents an extra byte required in its UTF-8
representation from the previous one.

commit | commitdiff | tree

Karl Williamson [Tue, 30 Jan 2018 19:34:20 +0000 (12:34 -0700)]

regcomp.c: Silence compiler maybe uninit warnings

I don't believe that actually these can be used uninitialized, but
initialize them anyway to silence the warnings.

commit | commitdiff | tree

Matthew Horsfall [Tue, 23 Jan 2018 19:45:06 +0000 (14:45 -0500)]

Use the correct path for valgrind logs in make test.valgrind

commit | commitdiff | tree

Karl Williamson [Tue, 30 Jan 2018 03:47:56 +0000 (20:47 -0700)]

Add ANYOFM regnode

This is a specialized ANYOF node for use when the code points in it
have characteristics that allow them to be matched with a mask instead
of a bit map.  When this happens, the speed up is pretty spectacular:

Key:
    Ir   Instruction read
    Dr   Data read
    Dw   Data write
    COND conditional branches
    IND  indirect branches

The numbers represent raw counts per loop iteration.

Results of ('b' x 10000) . 'a' =~ /[Aa]/

          blead    mask Ratio %
       -------- ------- -------
    Ir 153132.0 25636.0   597.3
    Dr  40909.0  2155.0  1898.3
    Dw  20593.0   593.0  3472.7
  COND  20529.0  3028.0   678.0
   IND     22.0    22.0   100.0

See the comments in regcomp.c or
http://nntp.perl.org/group/perl.perl5.porters/249001 for a description
of the cases that this new technique can handle.  But several common
ones include the C0 controls (on ASCII platforms), [01], [0-7], [Aa] and
any other ASCII case pair.

The set of ASCII characters also could be done with this node instead of
having the special ASCII regnode, reducing code size and complexity.
I haven't investigated the speed loss of doing so.

A NANYOFM node could be created for matching the complements this one
matches.

A pattern like /A/i is not affected by this commit, but the regex
optimizer could be changed to take advantage of this commit.  What would
need to be done is for it to look at the first byte of an EXACTFish node
and if its one of the case pairs this handles, to generate a synthetic
start class for it.  This would automatically invoke the sped up code.

commit | commitdiff | tree

Karl Williamson [Mon, 22 Jan 2018 20:55:03 +0000 (13:55 -0700)]

recomp.sym: Add ANYOFM regnode

This uses a mask instead of a bitmap, and is restricted to representing
invariant characters under UTF-8 that meet particular bit patterns.

commit | commitdiff | tree

Karl Williamson [Thu, 25 Jan 2018 20:35:09 +0000 (13:35 -0700)]

regcomp.c: White-space only

Indent code that the previous commit created a block around

commit | commitdiff | tree

Karl Williamson [Thu, 25 Jan 2018 20:26:16 +0000 (13:26 -0700)]

regcomp.c: Allow a fcn param to be NULL

In which case handling is skipped. This is in preparation for a future
commit which will use this function in a slightly different manner

commit | commitdiff | tree

Karl Williamson [Fri, 29 Dec 2017 22:45:38 +0000 (15:45 -0700)]

regexec.c: Use word-at-a-time to repeat /i single byte pattern

For most of the case folding pairs, like [Aa], it is possible to use a
mask to match them word-at-a-time in regrepeat(), so that long sequences
of them are handled with significantly better performance.

commit | commitdiff | tree

Karl Williamson [Fri, 29 Dec 2017 22:17:41 +0000 (15:17 -0700)]

regexec.c: Use word-at-a-time to repeat a single byte pattern

There is special code in the function regrepeat() to handle instances
where the pattern to repeat is a single byte. These all can be done
word-at-a-time to significantly increase the performance of long
repeats.

commit | commitdiff | tree

Karl Williamson [Wed, 27 Dec 2017 01:25:26 +0000 (18:25 -0700)]

regexec.c: Replace loop by memchr()

This can be called on a potentially long string.

commit | commitdiff | tree

Karl Williamson [Tue, 30 Jan 2018 03:33:14 +0000 (20:33 -0700)]

Compile variant_byte_number() for EBCDIC

Future commits will use this without regard to platform.

commit | commitdiff | tree

Karl Williamson [Tue, 30 Jan 2018 03:07:51 +0000 (20:07 -0700)]

Use different scheme to handle MSVC6

Recent commit 0b08cab0fc46a5f381ca18a451f55cf12c81d966 caused a function
to not be compiled when running on MSVC6, and hence its callers needed
to use an alternative mechanism there. This is easy enough, it turns
out, but it also turns out that there are more opportunities to call
this function. Rather than having each caller have to know about the
MSVC6 problem, this current commit reimplements the function on that
platform to use a slow, dumb method, so knowing about the issue is
confined to just this one function.

commit | commitdiff | tree

Karl Williamson [Sun, 28 Jan 2018 21:48:53 +0000 (14:48 -0700)]

APItest/APItest.xs: Simplify mappings

Instead of using SVs, use the underlying C type, and so the code here
doesn't have to deal with the SV conversions

commit | commitdiff | tree

Karl Williamson [Sun, 28 Jan 2018 21:47:16 +0000 (14:47 -0700)]

APItest/t/utf8_warn_base.pl: White-space only

This outdents a bunch of code to make it a shift width of 2 instead of 4
because the nesting was getting too deep, making the space available on
a line too short.

commit | commitdiff | tree

Karl Williamson [Sun, 28 Jan 2018 21:43:00 +0000 (14:43 -0700)]

APItest/t/utf8_warn_base.pl: Improve diagnostics

commit | commitdiff | tree

Karl Williamson [Sun, 28 Jan 2018 00:43:00 +0000 (17:43 -0700)]

Add utf8n_to_uvchr_msgs()

This UTF-8 to code point translator variant is to meet the needs of
Encode, and provides XS authors with more general capability than
the other decoders.

commit | commitdiff | tree

Karl Williamson [Sun, 28 Jan 2018 17:02:11 +0000 (10:02 -0700)]

Don't use variant_byte_number on MSVC6

See [perl #132766]

commit | commitdiff | tree

Karl Williamson [Thu, 25 Jan 2018 17:37:04 +0000 (10:37 -0700)]

inline.h: Clarify comment

commit | commitdiff | tree

Karl Williamson [Thu, 25 Jan 2018 17:25:27 +0000 (10:25 -0700)]

Don't use C99 ULL constant suffix

The suffix ULL in, e.g., 7ULL, is C99, and since perl supports C89, we
can't use it. Change these occurrences to wrap those that would exceed
32 bits to use UINTMAX_C(...).

perl.h has logic to define that macro appropriately if the compiler
doesn't already know it.

commit | commitdiff | tree

Karl Williamson [Mon, 29 Jan 2018 17:46:02 +0000 (10:46 -0700)]

Fix bug in new [[:ascii:]] nodes

Commit aff4cafe362e55c7722ba12952e287a7d1770cb9 added new regnodes for
[[:ascii:]] and its complement for a significant performance
improvement.  In looking at the code later, I realized that there was a
bug in find_byclass() in that it didn't continue to try after an initial
trial match succeeds, but getting the whole pattern to match fails.
It's supposed to try again with the next ascii.

This commit fixes that, and adds tests.

I thought that these new changes might lower the performance improvement
of the original, but it doesn't.  Here's a typical one where we have a
string of a million non-ascii 2-byte characters, followed by a single
ASCII one.

         posixa    ascii  Ratio %
        ------- -------- --------
     Ir     Inf      Inf    665.9
     Dr     Inf 250907.0   1993.1
     Dw     Inf    597.0 167603.7
   COND     Inf 500532.0    399.7
    IND    22.0     22.0    100.0

(posixa is the old way of doing things; Inf just means the number was
too large for the program to want to display it; the ratio is still
valid).

commit | commitdiff | tree

Karl Williamson [Mon, 29 Jan 2018 16:51:45 +0000 (09:51 -0700)]

regexec.c: Extract some macro code into a submacro

A future commit will reuse this code, so will avoid duplication.

commit | commitdiff | tree

Karl Williamson [Mon, 29 Jan 2018 04:02:49 +0000 (21:02 -0700)]

regexec.c: Use different method for finding adjacent chars

Commit 3b6c52ce7db772c296d8f10d92dec46af03391dc changed the variable
name and commented what the code was doing. This changes that code to
use a different mechanism that I think is simpler, and is extensible so
that it can be used not just for instances in which the input is
examined character-by-character.

Until this commit, a boolean was used to indicate that we've found
adjacent characters. This commit saves the address of the next
character, so when we find the next match, if it begins at the saved
address, we know it is adjacent.

commit | commitdiff | tree

Karl Williamson [Mon, 29 Jan 2018 03:33:10 +0000 (20:33 -0700)]

regexec.c: Extract some macro code into a sub-macro

By doing this, it becomes common code with another place in the code, so
the duplication can be removed.

commit | commitdiff | tree

Karl Williamson [Mon, 29 Jan 2018 02:15:25 +0000 (19:15 -0700)]

regexec.c: Collapse some macros

By adding a utf8ness parameter these 4 macros can be collapsed into 2,
with no increase in run time, as the parameter is always a compile time
constant and modern compilers will avoid the conditional.

commit | commitdiff | tree

Karl Williamson [Mon, 29 Jan 2018 23:05:41 +0000 (16:05 -0700)]

Fix bug in t/re/regex_sets_compat.t

This tests the tests that regexp.t has and which have bracketed
character classes. It converts those to the regex sets notation, and
verifies they still work. It was adding and extra blank at the end of
the pattern in some cases, causing it to fail.

commit | commitdiff | tree

Karl Williamson [Fri, 26 Jan 2018 19:33:20 +0000 (12:33 -0700)]

regexec.c: Use meaningful variable name; comment

It took me quite a while to figure out what 'tmp' is doing here. So I
renamed it to a more meaningful name, and added comments.

commit | commitdiff | tree

Karl Williamson [Thu, 25 Jan 2018 20:36:25 +0000 (13:36 -0700)]

regcomp.c: Clarify comment

commit | commitdiff | tree

Karl Williamson [Thu, 25 Jan 2018 20:20:24 +0000 (13:20 -0700)]

regcomp.c: Use existing function to do task

The function does it better than this code, which looked too deeply into
the internals, and got it wrong sometimes, because it didn't look at the
state of the inversion. The consequences are not a bug, but potentially
forgoing an optimization, or needlessly looking for an optimization that
will turn out to not be there.