This is a live mirror of the Perl 5 development currently hosted at https://github.com/perl/perl5
regexec: Do less work on quantified UTF-8
authorKarl Williamson <public@khwilliamson.com>
Tue, 16 Oct 2012 16:17:01 +0000 (10:17 -0600)
committerKarl Williamson <public@khwilliamson.com>
Wed, 17 Oct 2012 03:48:37 +0000 (21:48 -0600)
commit79a2a0e89816b80870df1f9b9e7bb5fb1edcd556
treef530af448db6076a9fc00479d2d4a3bb64427eee
parent57f0e7e230d864f5b78d28bb89545ef671c101a0
regexec: Do less work on quantified UTF-8

Consider the regexes /A*B/ and /A*?B/ where A and B are arbitrary,
except that B begins with an EXACTish node.  Prior to this patch, as a
shortcut, the loop for accumulating A* would look for the first character
of B to help it decide if B is a possiblity for the next thing.  It did
not test for all of B unless testing showed that the next thing could be
the beginning of B.  If the target string was UTF-8, it converted each
new sequence of bytes to the code point they represented, and then did
the comparision.  This is a relative expensive process.

This commit avoids that conversion by just doing a memEQ at the current
input position.  To do this, it revamps S_setup_EXACTISH_ST_c1_c2() to
output the UTF-8 sequences to compare against.  The function also has
been tightened up so that there are fewer false positives.
regexec.c
regexp.h
utf8.c