As explained in the docs, this helps detect spoofing attacks.
t/re/regexp_unicode_prop_thr.t See if unicode properties work in regular expressions as expected under threads
t/re/rt122747.t Test rt122747 assert faile (requires DEBUGGING)
t/re/rxcode.t See if /(?{ code })/ works
+t/re/script_run.t See if script runs works
t/re/speed.t See if optimisations are keeping things fast
t/re/speed_thr.t ditto under threads
t/re/subst.t See if substitution works
#define PL_Posix_ptrs (vTHX->IPosix_ptrs)
#define PL_Proc (vTHX->IProc)
#define PL_SB_invlist (vTHX->ISB_invlist)
+#define PL_SCX_invlist (vTHX->ISCX_invlist)
#define PL_Sock (vTHX->ISock)
#define PL_StdIO (vTHX->IStdIO)
#define PL_Sv (vTHX->ISv)
PERLVAR(I, GCB_invlist, SV *)
PERLVAR(I, LB_invlist, SV *)
PERLVAR(I, SB_invlist, SV *)
+PERLVAR(I, SCX_invlist, SV *)
PERLVAR(I, WB_invlist, SV *)
PERLVAR(I, Assigned_invlist, SV *)
PERLVAR(I, seen_deprecated_macro, HV *)
PL_WB_invlist = _new_invlist_C_array(_Perl_WB_invlist);
PL_LB_invlist = _new_invlist_C_array(_Perl_LB_invlist);
PL_Assigned_invlist = _new_invlist_C_array(Assigned_invlist);
+ PL_SCX_invlist = _new_invlist_C_array(_Perl_SCX_invlist);
#ifdef HAS_POSIX_2008_LOCALE
PL_C_locale_obj = newlocale(LC_ALL_MASK, "C", NULL);
#endif
PL_GCB_invlist = NULL;
PL_LB_invlist = NULL;
PL_SB_invlist = NULL;
+ PL_SCX_invlist = NULL;
PL_WB_invlist = NULL;
PL_Assigned_invlist = NULL;
it does so atomically or not. Most file descriptors were getting the
flag set, but some were being missed.
+=head2 Mixed Unicode scripts are now detectable
+
+A mixture of scripts, such as Cyrillic and Latin, in a string is often
+the sign of a spoofing attack. A new regular expression construct
+now allows for easy detection of these. For example, you can say
+
+ qr/(?script_run: \d+ \b )/x
+
+And the digits matched will all be from the same set of 10. You won't
+get a look-alike digit from a different script that has a different
+value than what it appears to be.
+
=head1 Security
XXX Any security-related notices go here. In particular, any security
=item In '(*VERB...)', the '(' and '*' must be adjacent in regex;
marked by S<<-- HERE> in m/%s/
-(F) The two-character sequence C<"(*"> in
-this context in a regular expression pattern should be an
-indivisible token, with nothing intervening between the C<"(">
-and the C<"*">, but you separated them.
+=item In '(+...)', the '(' and '+' must be adjacent in regex;
+marked by S<<-- HERE> in m/%s/
+
+(F) The two-character sequences C<"(+"> and C<"(*"> in
+this context in a regular expression pattern should be
+indivisible tokens, with nothing intervening between the C<"(">
+and the C<"*"> or C<"+">, but you separated them.
=item ioctl is not implemented
use feature "signatures";
sub foo ($left, $right) { ... }
+=item The script_run feature is experimental
+
+(S experimental::script_run) This feature is experimental and its
+behavior may in any future release of perl.
+See L<perlre/Script Runs>.
+
=item The stat preceding %s wasn't an lstat
(F) It makes no sense to test the current stat buffer for symbolic
of valid modes: C<< < >>, C<< > >>, C<<< >> >>>, C<< +< >>,
C<< +> >>, C<<< +>> >>>, C<-|>, C<|->, C<< <& >>, C<< >& >>.
+=item Unknown (+ pattern in regex; marked by S<<-- HERE> in m/%s/
+
+(F) The C<(+> was followed by something that the regular expression
+compiler does not recognize. Check your spelling.
+
=item Unknown PerlIO layer "%s"
(W layer) An attempt was made to push an unknown layer onto the Perl I/O
the pattern uses L</C<(?[ ])>>
+=item 8
+
+the pattern uses L<C<(+script_run: ...)>|/Script Runs>
+
=back
Another mnemonic for this modifier is "Depends", as the rules actually
where side-effects of lookahead I<might> have influenced the
following match, see L</C<< (?>pattern) >>>.
+=head2 Script Runs
+
+A script run is basically a sequence of characters, all from the same
+Unicode script (see L<perlunicode/Scripts>), such as Latin or Greek. In
+most places a single word would never be written in multiple scripts,
+unless it is a spoofing attack. An infamous example, is
+
+ paypal.com
+
+Those letters could all be Latin (as in the example just above), or they
+could be all Cyrillic (except for the dot), or they could be a mixture
+of the two. In the case of an internet address the C<.com> would be in
+Latin, And any Cyrillic ones would cause it to be a mixture, not a
+script run. Someone clicking on such a link would not be directed to
+the real Paypal website, but an attacker would craft a look-alike one to
+attempt to gather sensitive information from the person.
+
+Starting in Perl 5.28, it is now easy to detect strings that aren't
+script runs. Simply enclose just about any pattern like this:
+
+ (+script_run:pattern)
+
+What happens is that after I<pattern> succeeds in matching, it is
+subjected to the additional criterion that every character in it must be
+from the same script (see exceptions below). If this isn't true,
+backtracking occurs until something all in the same script is found that
+matches, or all possibilities are exhausted. This can cause a lot of
+backtracking, but generally, only malicious input will result in this,
+though the slow down could cause a denial of service attack. If your
+needs permit, it is best to make the pattern atomic.
+
+ (+script_run:(?>pattern))
+
+(See L</C<(?E<gt>pattern)>>.)
+
+In Taiwan, Japan, and Korea, it is common for text to have a mixture of
+characters from their native scripts and base Chinese. Perl follows
+Unicode's UTS 39 (L<http://unicode.org/reports/tr39/>) Unicode Security
+Mechanisms in allowing such mixtures.
+
+The rules used for matching decimal digits are somewhat different. Many
+scripts have their own sets of digits equivalent to the Western C<0>
+through C<9> ones. A few, such as Arabic, have more than one set. For
+a string to be considered a script run, all digits in it must come from
+the same set, as determined by the first digit encountered. The ASCII
+C<[0-9]> are accepted as being in any script, even those that have their
+own set. This is because these are often used in commerce even in such
+scripts. But any mixing of the ASCII and other digits will cause the
+sequence to not be a script run, failing the match. As an example,
+
+ qr/(?script_run: \d+ \b )/x
+
+guarantees that the digits matched will all be from the same set of 10.
+You won't get a look-alike digit from a different script that has a
+different value than what it appears to be.
+
+Unicode has three pseudo scripts that are handled specially.
+
+"Unknown" is applied to code points whose meaning has yet to be
+determined. Perl currently will match as a script run, any single
+character string consisting of one of these code points. But any string
+longer than one code point containing one of these will not be
+considered a script run.
+
+"Inherited" is applied to characters that modify another, such as an
+accent of some type. These are considered to be in the script of the
+master character, and so never cause a script run to not match.
+
+The other one is "Common". This consists of mostly punctuation, emoji,
+and characters used in mathematics and music, and the ASCII digits C<0>
+through C<9>. These characters can appear intermixed in text in many of
+the world's scripts. These also don't cause a script run to not match,
+except any ASCII digits encountered have to obey the decimal digit rules
+described above.
+
+This construct is non-capturing. You can add parentheses to I<pattern>
+to capture, if desired. You will have to do this if you plan to use
+L</(*ACCEPT) (*ACCEPT:arg)> and not have it bypass the script run
+checking.
+
+This feature is experimental, and the exact syntax and details of
+operation are subject to change; using it yields a warning in the
+C<experimental::script_run> category.
+
+The C<Script_Extensions> property is used as the basis for this feature.
+
=head2 Special Backtracking Control Verbs
These special patterns are generally of the form C<(*I<VERB>:I<ARG>)>. Unless
bool seen_unfolded_sharp_s;
bool strict;
bool study_started;
+ bool in_script_run;
};
#define RExC_flags (pRExC_state->flags)
#define RExC_strict (pRExC_state->strict)
#define RExC_study_started (pRExC_state->study_started)
#define RExC_warn_text (pRExC_state->warn_text)
+#define RExC_in_script_run (pRExC_state->in_script_run)
/* Heuristic check on the complexity of the pattern: if TOO_NAUGHTY, we set
* a flag to disable back-off on the fixed/floating substrings - if it's
RExC_seen_unfolded_sharp_s = 0;
RExC_contains_locale = 0;
RExC_strict = cBOOL(pm_flags & RXf_PMf_STRICT);
+ RExC_in_script_run = 0;
RExC_study_started = 0;
pRExC_state->runtime_code_qr = NULL;
RExC_frame_head= NULL;
* here (if paren ==2). The forms '(*VERB' and '(?...' disallow such
* intervening space, as the sequence is a token, and a token should be
* indivisible */
- bool has_intervening_patws = paren == 2 && *(RExC_parse - 1) != '(';
+ bool has_intervening_patws = (paren == 2 || paren == 's')
+ && *(RExC_parse - 1) != '(';
if (RExC_parse >= RExC_end) {
vFAIL("Unmatched (");
}
- if ( *RExC_parse == '*') { /* (*VERB:ARG) */
+ if (paren == 's') {
+
+ /* A nested script run is a no-op besides clustering */
+ if (RExC_in_script_run) {
+ paren = ':';
+ nextchar(pRExC_state);
+ ret = NULL;
+ goto parse_rest;
+ }
+ RExC_in_script_run = 1;
+
+ ret = reg_node(pRExC_state, SROPEN);
+ is_open = 1;
+ }
+ else if ( *RExC_parse == '*') { /* (*VERB:ARG) */
char *start_verb = RExC_parse + 1;
STRLEN verb_len;
char *start_arg = NULL;
nextchar(pRExC_state);
return ret;
}
+ else if (*RExC_parse == '+') { /* (+...) */
+ RExC_parse++;
+
+ if (has_intervening_patws) {
+ /* XXX Note that a potential gotcha is that outside of /x '( +
+ * ...)' means to match a space at least once ... This is a
+ * problem elsewhere too */
+ vFAIL("In '(+...)', the '(' and '+' must be adjacent");
+ }
+
+ if (! memBEGINPs(RExC_parse, (STRLEN) (RExC_end - RExC_parse),
+ "script_run:"))
+ {
+ RExC_parse += strcspn(RExC_parse, ":)");
+ vFAIL("Unknown (+ pattern");
+ }
+ else {
+
+ /* This indicates Unicode rules. */
+ REQUIRE_UNI_RULES(flagp, NULL);
+
+ RExC_parse += sizeof("script_run:") - 1;
+
+ if (PASS2) {
+ Perl_ck_warner_d(aTHX_
+ packWARN(WARN_EXPERIMENTAL__SCRIPT_RUN),
+ "The script_run feature is experimental"
+ REPORT_LOCATION, REPORT_LOCATION_ARGS(RExC_parse));
+ }
+
+ ret = reg(pRExC_state, 's', &flags, depth+1);
+ if (flags & (RESTART_PASS1|NEED_UTF8)) {
+ *flagp = flags & (RESTART_PASS1|NEED_UTF8);
+ return NULL;
+ }
+
+ nextchar(pRExC_state);
+
+ return ret;
+ }
+ }
else if (*RExC_parse == '?') { /* (?...) */
bool is_logical = 0;
const char * const seqstart = RExC_parse;
Set_Node_Offset(ender,RExC_parse+1); /* MJD */
Set_Node_Length(ender,1); /* MJD */
break;
+ case 's':
+ ender = reg_node(pRExC_state, SRCLOSE);
+ RExC_in_script_run = 0;
+ break;
case '<':
case ',':
case '=':
/* While that wasn't END last time... */
NODE_ALIGN(node);
op = OP(node);
- if (op == CLOSE || op == WHILEM)
+ if (op == CLOSE || op == SRCLOSE || op == WHILEM)
indent--;
next = regnext((regnode *)node);
node = NEXTOPER(node);
node += regarglen[(U8)op];
}
- if (op == CURLYX || op == OPEN)
+ if (op == CURLYX || op == OPEN || op == SROPEN)
indent++;
}
CLEAR_OPTSTART;
bool is_utf8_pat = reginfo->is_utf8_pat;
bool match = FALSE;
I32 orig_savestack_ix = PL_savestack_ix;
+ char * script_run_begin = NULL;
/* Solaris Studio 12.3 messes up fetching PL_charclass['\n'] */
#if (defined(__SUNPRO_C) && (__SUNPRO_C == 0x5120) && defined(__x86_64) && defined(USE_64_BIT_ALL))
lastopen = n;
break;
+ case SROPEN: /* (*SCRIPT_RUN: */
+ script_run_begin = locinput;
+ break;
+
/* XXX really need to log other places start/end are set too */
#define CLOSE_CAPTURE \
rex->offs[n].start = rex->offs[n].start_tmp; \
break;
+ case SRCLOSE: /* (*SCRIPT_RUN: ... ) */
+ {
+ /* Checks that every character in the sequence started by SROPEN
+ * and ending here is one of three scripts: Common, Inherited, and
+ * possibly one other. Additionally all decimal digits must come
+ * from the same consecutive sequence of 10.
+ *
+ * Basically, it looks at each character in the sequence to see if
+ * the above conditions are met; if not it fails. It uses an
+ * inversion map to find the enum corresponding to the script of
+ * each character. But this is complicated by the fact that a few
+ * code points can be in any of several scripts. The data has been
+ * constructed so that there are additional enum values (all
+ * negative) for these situations. The absolute value of those is
+ * an index into another table which contains pointers to auxiliary
+ * tables for each such situation. Each aux array lists all the
+ * scripts for the given situation. There is another, parallel,
+ * table that gives the number of entries in each aux table. These
+ * are all defined in charclass_invlists.h */
+
+ /* XXX Here are the additional things UTS 39 says could be done:
+ * Mark Chinese strings as “mixed script” if they contain both
+ * simplified (S) and traditional (T) Chinese characters, using the
+ * Unihan data in the Unicode Character Database [UCD]. The
+ * criterion can only be applied if the language of the string is
+ * known to be Chinese. So, for example, the string
+ * “写真だけの結婚式 ” is Japanese, and should not be marked as
+ * mixed script because of a mixture of S and T characters.
+ * Testing for whether a character is S or T needs to be based not
+ * on whether the character has a S or T variant , but whether the
+ * character is an S or T variant. khw notes that the sample
+ * contains a Hiragana character, and it is unclear if absence of
+ * any foreign script marks the script as "Chinese"
+ *
+ * Forbid sequences of the same nonspacing mark
+ *
+ * Check to see that all the characters are in the sets of exemplar
+ * characters for at least one language in the Unicode Common
+ * Locale Data Repository [CLDR]. */
+
+
+ /* Things that match /\d/u */
+ SV * decimals_invlist = PL_XPosix_ptrs[_CC_DIGIT];
+ UV * decimals_array = invlist_array(decimals_invlist);
+
+ /* What code point is the digit '0' of the script run? */
+ UV zero_of_run = 0;
+ SCX_enum script_of_run = SCX_INVALID; /* Illegal value */
+ SCX_enum script_of_char = SCX_INVALID;
+
+ /* If the script remains not fully determined from iteration to
+ * iteration, this is the current intersection of the possiblities.
+ * */
+ SCX_enum * intersection = NULL;
+ PERL_UINT_FAST8_T intersection_len = 0;
+
+ const char * s = script_run_begin;
+ const char * strend = locinput;
+
+ assert(s);
+
+ /* Look at each character in the sequence */
+ while (s < strend) {
+ UV cp;
+
+ /* The code allows all scripts to use the ASCII digits. This
+ * is because they are used in commerce even in scripts that
+ * have their own set. Hence any ASCII ones found are ok,
+ * unless a digit from another set has already been
+ * encountered. (The other digit ranges in Common are not
+ * similarly blessed */
+ if (UNLIKELY(isDIGIT(*s))) {
+ if (zero_of_run > 0) {
+ if (zero_of_run != '0') {
+ Safefree(intersection);
+ sayNO;
+ }
+ }
+ else {
+ zero_of_run = '0';
+ }
+ s++;
+ continue;
+ }
+
+ /* Here, isn't an ASCII digit. Find the code point of the
+ * character */
+ if (utf8_target && ! UTF8_IS_INVARIANT(*s)) {
+ Size_t len;
+ cp = valid_utf8_to_uvchr((U8 *) s, &len);
+ s += len;
+ }
+ else {
+ cp = *(s++);
+ }
+
+ /* If is within the range [+0 .. +9] of the script's zero, it
+ * also is a digit in that script. We can skip the rest of
+ * this code for this character. */
+ if (UNLIKELY( zero_of_run > 0
+ && cp >= zero_of_run
+ && cp - zero_of_run <= 9))
+ {
+ continue;
+ }
+
+ /* Find the character's script. The correct values are
+ * hard-coded here for small-enough code points. */
+ if (cp < 0x2B9) { /* From inspection of Unicode db; extremely
+ unlikely to change */
+ if ( cp > 255
+ || ( isALPHA_L1(cp)
+ && LIKELY(cp != MICRO_SIGN_NATIVE)))
+ {
+ script_of_char = SCX_Latin;
+ }
+ else {
+ script_of_char = SCX_Common;
+ }
+ }
+ else {
+ script_of_char = _Perl_SCX_invmap[
+ _invlist_search(PL_SCX_invlist, cp)];
+ }
+
+ /* We arbitrarily accept a single unassigned character, but not
+ * in combination with anything else, and not a run of them. */
+ if ( UNLIKELY(script_of_run == SCX_Unknown)
+ || UNLIKELY( script_of_run != SCX_INVALID
+ && script_of_char == SCX_Unknown))
+ {
+ Safefree(intersection);
+ sayNO;
+ }
+
+ if (UNLIKELY(script_of_char == SCX_Unknown)) {
+ script_of_run = SCX_Unknown;
+ continue;
+ }
+
+ /* We accept 'inherited' script characters currently even at
+ * the beginning. (We know that no characters in Inherited
+ * are digits, or we'd have to check for that) */
+ if (UNLIKELY(script_of_char == SCX_Inherited)) {
+ continue;
+ }
+
+ /* If unknown, the run's script is set to the char's */
+ if (UNLIKELY(script_of_run == SCX_INVALID)) {
+ script_of_run = script_of_char;
+ }
+
+ /* All decimal digits must be from the same sequence of 10.
+ * Above, we handled any ASCII digits without descending to
+ * here. We also handled the case where we already knew what
+ * digit sequence is the one to use, and the character is in
+ * that sequence. Now that we know the script, we can use
+ * script_zeros[] to directly find which sequence the script
+ * uses, except in a few cases it returns 0 */
+ if (UNLIKELY(zero_of_run == 0) && script_of_char >= 0) {
+ zero_of_run = script_zeros[script_of_char];
+ }
+
+ /* Now we can see if the script of the character is the same as
+ * that of the run, or 'Common' which is considered to be in
+ * every script */
+ if (LIKELY( script_of_char == script_of_run
+ || script_of_char == SCX_Common))
+ { /* By far the most common case */
+ goto scripts_match;
+ }
+
+#ifndef HAS_SCX_AUX_TABLES
+
+ /* Too early a Unicode version to have a code point belonging
+ * to more than one script, so, if the scripts don't exactly
+ * match, fail */
+ Safefree(intersection);
+ sayNO;
+
+#else
+
+ /* Here there is no exact match between the character's script
+ * and the run's. Negative script numbers signify that the
+ * value may be any of several scripts, and we need to look at
+ * auxiliary information to make our deterimination. But if
+ * both are non-negative, we can fail now */
+ if (LIKELY(script_of_char >= 0)) {
+ const SCX_enum * search_in;
+ PERL_UINT_FAST8_T search_in_len;
+ PERL_UINT_FAST8_T i;
+
+ if (LIKELY(script_of_run >= 0)) {
+ Safefree(intersection);
+ sayNO;
+ }
+
+ /* Use any previously constructed set of possible scripts.
+ * */
+ if (intersection) {
+ search_in = intersection;
+ search_in_len = intersection_len;
+ }
+ else {
+ search_in = SCX_AUX_TABLE_ptrs[-script_of_run];
+ search_in_len = SCX_AUX_TABLE_lengths[-script_of_run];
+ }
+
+ for (i = 0; i < search_in_len; i++) {
+ if (search_in[i] == script_of_char) {
+ script_of_run = script_of_char;
+ goto scripts_match;
+ }
+ }
+
+ Safefree(intersection);
+ sayNO;
+ }
+ else if (LIKELY(script_of_run >= 0)) {
+ /* script of character could be one of several, but run is
+ * a single script */
+ const SCX_enum * search_in
+ = SCX_AUX_TABLE_ptrs[-script_of_char];
+ const PERL_UINT_FAST8_T search_in_len
+ = SCX_AUX_TABLE_lengths[-script_of_char];
+ PERL_UINT_FAST8_T i;
+
+ for (i = 0; i < search_in_len; i++) {
+ if (search_in[i] == script_of_run) {
+ script_of_char = script_of_run;
+ goto scripts_match;
+ }
+ }
+
+ Safefree(intersection);
+ sayNO;
+ }
+ else {
+ /* Both run and char could be in one of several scripts.
+ * If the intersection is empty, then this character isn't
+ * in this script run. Otherwise, we need to calculate the
+ * intersection to use for future iterations of the loop,
+ * unless we are already at the final character */
+ const SCX_enum * search_char
+ = SCX_AUX_TABLE_ptrs[-script_of_char];
+ const PERL_UINT_FAST8_T char_len
+ = SCX_AUX_TABLE_lengths[-script_of_char];
+ const SCX_enum * search_run;
+ PERL_UINT_FAST8_T run_len;
+
+ SCX_enum * new_overlap = NULL;
+ PERL_UINT_FAST8_T i, j;
+
+ if (intersection) {
+ search_run = intersection;
+ run_len = intersection_len;
+ }
+ else {
+ search_run = SCX_AUX_TABLE_ptrs[-script_of_run];
+ run_len = SCX_AUX_TABLE_lengths[-script_of_run];
+ }
+
+ intersection_len = 0;
+
+ for (i = 0; i < run_len; i++) {
+ for (j = 0; j < char_len; j++) {
+ if (search_run[i] == search_char[j]) {
+
+ /* Here, the script at i,j matches. That means
+ * this character is in the run. But continue
+ * on to find the complete intersection, for
+ * the next loop iteration, and for the digit
+ * check after it.
+ *
+ * On the first found common script, we malloc
+ * space for the intersection list for the worst
+ * case of the intersection, which is the
+ * minimum of the number of scripts remaining
+ * in each set. */
+ if (intersection_len == 0) {
+ Newx(new_overlap,
+ MIN(run_len - i, char_len - j),
+ SCX_enum);
+ }
+ new_overlap[intersection_len++] = search_run[i];
+ }
+ }
+ }
+
+ /* Here we've looked through everything. If they have no
+ * scripts in common, not a run */
+ if (intersection_len == 0) {
+ Safefree(intersection);
+ sayNO;
+ }
+
+
+ /* If there is only a single script in common, set to that.
+ * Otherwise, use the intersection going forward */
+ Safefree(intersection);
+ if (intersection_len == 1) {
+ script_of_run = script_of_char = new_overlap[0];
+ Safefree(new_overlap);
+ }
+ else {
+ intersection = new_overlap;
+ }
+ }
+
+#endif
+
+ scripts_match: ;
+
+ /* Here, the script of the character is compatible with that of
+ * the run. Either they match exactly, or one or both can be
+ * any of several scripts, and the intersection is not empty.
+ * If the character is not a decimal digit, we are done with
+ * it. Otherwise, it could still fail if it is from a
+ * different set of 10 than seen already (or we may not have
+ * seen any, and we need to set the sequence). If we have
+ * determined a single script and that script only has one set
+ * of digits (almost all scripts are like that), then this
+ * isn't a problem, as any digit must come from the same
+ * sequence. The only scripts that have multiple sequences
+ * have been constructed to be 0 in 'script_zeros[]'.
+ *
+ * Here we check if it is a digit. */
+ if ( cp >= FIRST_NON_ASCII_DECIMAL_DIGIT
+ && ( ( zero_of_run == 0
+ || ( ( script_of_char >= 0
+ && script_zeros[script_of_char] == 0)
+ || intersection))))
+ {
+ SSize_t range_zero_index;
+ range_zero_index = _invlist_search(decimals_invlist, cp);
+ if ( LIKELY(range_zero_index >= 0)
+ && ELEMENT_RANGE_MATCHES_INVLIST(range_zero_index))
+ {
+ UV range_zero = decimals_array[range_zero_index];
+ if (zero_of_run) {
+ if (zero_of_run != range_zero) {
+ Safefree(intersection);
+ sayNO;
+ }
+ }
+ else {
+ zero_of_run = range_zero;
+ }
+ }
+ }
+ } /* end of looping through CLOSESR text */
+
+ Safefree(intersection);
+ break;
+ }
+
case ACCEPT: /* (*ACCEPT) */
if (scan->flags)
sv_yes_mark = MUTABLE_SV(rexi->data->data[ ARG( scan ) ]);
}
PL_GCB_invlist = sv_dup_inc(proto_perl->IGCB_invlist, param);
PL_SB_invlist = sv_dup_inc(proto_perl->ISB_invlist, param);
+ PL_SCX_invlist = sv_dup_inc(proto_perl->ISCX_invlist, param);
PL_WB_invlist = sv_dup_inc(proto_perl->IWB_invlist, param);
PL_seen_deprecated_macro = hv_dup_inc(proto_perl->Iseen_deprecated_macro, param);
PL_utf8_mark = sv_dup_inc(proto_perl->Iutf8_mark, param);
'm/\cß/' => "Character following \"\\c\" must be printable ASCII",
'/((?# This is a comment in the middle of a token)?:foo)/' => 'In \'(?...)\', the \'(\' and \'?\' must be adjacent {#} m/((?# This is a comment in the middle of a token)?{#}:foo)/',
'/((?# This is a comment in the middle of a token)*FAIL)/' => 'In \'(*VERB...)\', the \'(\' and \'*\' must be adjacent {#} m/((?# This is a comment in the middle of a token)*{#}FAIL)/',
+ '/((?# This is a comment in the middle of a token)+script_run:foo)/' => 'In \'(+...)\', the \'(\' and \'+\' must be adjacent {#} m/((?# This is a comment in the middle of a token)+{#}script_run:foo)/',
+
+ '/(+script_runfoo)/' => 'Unknown (+ pattern {#} m/(+script_runfoo{#})/',
'/(?[\ &!])/' => 'Incomplete expression within \'(?[ ])\' {#} m/(?[\ &!{#}])/', # [perl #126180]
'/(?[\ +!])/' => 'Incomplete expression within \'(?[ ])\' {#} m/(?[\ +!{#}])/', # [perl #126180]
'/(?[\ -!])/' => 'Incomplete expression within \'(?[ ])\' {#} m/(?[\ -!{#}])/', # [perl #126180]
'/noutf8 ネ (?[ [\tネ] ])/' => 'The regex_sets feature is experimental {#} m/noutf8 ネ (?[{#} [\tネ] ])/',
);
+my @experimental_script_run = (
+ '/(+script_run:paypal.com)/' => 'The script_run feature is experimental {#} m/(+script_run:{#}paypal.com)/',
+ 'use utf8; /utf8 ネ (+script_run:ネ)/' => do { use utf8; 'The script_run feature is experimental {#} m/utf8 ネ (+script_run:{#}ネ)/' },
+ '/noutf8 ネ (+script_run:ネ)/' => 'The script_run feature is experimental {#} m/noutf8 ネ (+script_run:{#}ネ)/',
+);
+
my @deprecated = (
'/^{/' => "",
'/foo|{/' => "",
}
else {
no warnings 'experimental::regex_sets';
+ no warnings 'experimental::script_run';
no warnings 'experimental::re_strict';
warning_is(sub {
}
}
- foreach my $ref (\@warning_tests, \@experimental_regex_sets, \@deprecated) {
+ foreach my $ref (\@warning_tests,
+ \@experimental_regex_sets,
+ \@experimental_script_run,
+ \@deprecated)
+ {
my $warning_type;
my $turn_off_warnings = "";
my $default_on;
if ($ref == \@warning_tests) {
$warning_type = 'regexp, digit';
- $turn_off_warnings = "no warnings 'experimental::regex_sets';";
+ $turn_off_warnings = "no warnings 'experimental::regex_sets';"
+ . "no warnings 'experimental::script_run';";
$default_on = $strict;
}
elsif ($ref == \@deprecated) {
$warning_type = 'regexp, deprecated';
$default_on = 1;
}
- else {
+ elsif ($ref == \@experimental_regex_sets) {
$warning_type = 'experimental::regex_sets';
$default_on = 1;
}
+ elsif ($ref == \@experimental_script_run) {
+ $warning_type = 'experimental::script_run';
+ $default_on = 1;
+ }
+ else {
+ fail("$0: Internal error: Unexpected loop variable");
+ }
+
for (my $i = 0; $i < @$ref; $i += 2) {
my $this_default_on = $default_on;
my $regex = $ref->[$i];
--- /dev/null
+BEGIN {
+ chdir 't' if -d 't';
+ require './test.pl';
+ set_up_inc('../lib');
+}
+
+use strict;
+use warnings;
+
+$|=1;
+
+# The Script_Extension property has only recently become reasonably stable, so
+# later Unicode releases may change things. Some of these tests were
+# designed to provide more code covereage in regexec.c, so changes in it or
+# later Standards could cause them to not test what they originally were aimed
+# to do.
+
+no warnings "experimental::script_run";
+
+my $script_run = qr/ (+script_run: ^ .+ $ ) /x;
+
+unlike("\N{CYRILLIC SMALL LETTER ER}\N{CYRILLIC SMALL LETTER A}\N{CYRILLIC SMALL LETTER U}}\N{CYRILLIC SMALL LETTER ER}\N{CYRILLIC SMALL LETTER A}l", $script_run, "Cyrillic 'paypal' with a Latin 'l' is not a script run");
+unlike("A\N{GREEK CAPITAL LETTER GAMMA}", $script_run, "Latin followed by Greek isn't a script run");
+
+like("\N{CYRILLIC THOUSANDS SIGN}\N{COMBINING CYRILLIC TITLO}", $script_run, "Cyrillic followed by Permic-Arabic is Arabic");
+like("\N{OLD PERMIC LETTER AN}\N{COMBINING CYRILLIC TITLO}", $script_run, "Permic followed by Permic-Arabic is Permic");
+unlike("\N{GLAGOLITIC CAPITAL LETTER AZU}\N{COMBINING CYRILLIC TITLO}", $script_run, "Glagolithic followed by Permic-Arabic isn't a script run");
+
+like("\N{CYRILLIC THOUSANDS SIGN}\N{COMBINING CYRILLIC PALATALIZATION}", $script_run, "Cyrillic followed by Glagolithic-Arabic is Arabic");
+like("\N{GLAGOLITIC CAPITAL LETTER AZU}\N{COMBINING CYRILLIC PALATALIZATION}", $script_run, "Glagolithic followed by Glagolithic-Arabic is Glagolithic");
+unlike("\N{OLD PERMIC LETTER AN}\N{COMBINING CYRILLIC PALATALIZATION}", $script_run, "Permic followed by Glagolithic-Arabic isn't a script run");
+
+like("\N{ARABIC-INDIC DIGIT ZERO}\N{ARABIC-INDIC DIGIT ONE}\N{ARABIC-INDIC DIGIT TWO}\N{ARABIC-INDIC DIGIT THREE}\N{ARABIC COMMA}\N{ARABIC-INDIC DIGIT FOUR}\N{THAANA LETTER HAA}", $script_run, "Arabic-Thaana chars followed by Thaana is Thaana");
+unlike("\N{ARABIC-INDIC DIGIT ZERO}\N{ARABIC-INDIC DIGIT ONE}A", $script_run, "Arabic-Thaana chars followed by Latin isn't a script run");
+like("\N{ARABIC-INDIC DIGIT ZERO}\N{ARABIC-INDIC DIGIT ONE}\N{ARABIC-INDIC DIGIT TWO}\N{ARABIC-INDIC DIGIT THREE}\N{ARABIC COMMA}\N{ARABIC-INDIC DIGIT FOUR}\N{ARABIC NUMBER SIGN}", $script_run, "Arabic-Thaana chars followed by Arabic is Arabic");
+unlike("\N{ARABIC-INDIC DIGIT ZERO}\N{ARABIC-INDIC DIGIT ONE}\N{ARABIC-INDIC DIGIT TWO}\N{ARABIC-INDIC DIGIT THREE}\N{EXTENDED ARABIC-INDIC DIGIT NINE}", $script_run, "Arabic-Thaana digits followed by an Arabic digit from a different sequence isn't a script run");
+like("\N{ARABIC-INDIC DIGIT ZERO}\N{ARABIC-INDIC DIGIT ONE}\N{ARABIC-INDIC DIGIT TWO}\N{ARABIC-INDIC DIGIT THREE}\N{THAANA LETTER HAA}", $script_run, "Arabic-Thaana digits followed by a Thaana leter is a script run");
+
+# The next tests are at a hard-coded boundary in regexec.c at the time of this
+# writing (U+02B9/02BA).
+like("abc\N{MODIFIER LETTER SMALL Y}", $script_run, "All Latin is a script run");
+like("abc\N{MODIFIER LETTER PRIME}", $script_run, "Latin then Common is a script run");
+like("\N{HEBREW LETTER ALEF}\N{HEBREW LETTER TAV}\N{MODIFIER LETTER PRIME}", $script_run, "Hebrew then Common is a script run");
+unlike("\N{HEBREW LETTER ALEF}\N{HEBREW LETTER TAV}\N{MODIFIER LETTER SMALL Y}", $script_run, "Hebrew then Latin isn't a script run");
+like("9876543210\N{DESERET SMALL LETTER WU}", $script_run, "0-9 are the digits for Deseret");
+like("\N{DESERET SMALL LETTER WU}9876543210", $script_run, "Also when they aren't in the initial position");
+unlike("\N{DESERET SMALL LETTER WU}\N{FULLWIDTH DIGIT FIVE}", $script_run, "Fullwidth digits aren't the digits for Deseret");
+unlike("\N{FULLWIDTH DIGIT SIX}\N{DESERET SMALL LETTER LONG I}", $script_run, "... likewise if the digits come first");
+
+like("1234567890\N{ARABIC LETTER ALEF}", $script_run, "[0-9] work for Arabic");
+unlike("1234567890\N{ARABIC LETTER ALEF}\N{ARABIC-INDIC DIGIT FOUR}\N{ARABIC-INDIC DIGIT FIVE}", $script_run, "... but not in combination with real ARABIC digits");
+unlike("\N{ARABIC LETTER ALEF}\N{ARABIC-INDIC DIGIT SIX}\N{ARABIC-INDIC DIGIT SEVEN}1", $script_run, "... nor when the ARABIC digits come before them");
+
+# This exercises the case where the script zero but not the script is
+# ambiguous until a non-ambiguous digit is found.
+like("\N{ARABIC LETTER ALEF}\N{EXTENDED ARABIC-INDIC DIGIT EIGHT}", $script_run, "ARABIC with a Shia digit is a script run");
+
+like("\N{U+03A2}", $script_run, "A single unassigned code point is a script run");
+unlike("\N{U+03A2}\N{U+03A2}", $script_run, "But not more than one");
+unlike("A\N{U+03A2}", $script_run, "... and not in combination with an assigned one");
+unlike("\N{U+03A2}A", $script_run, "... in either order");
+
+like("A\N{COMBINING GRAVE ACCENT}", $script_run, "An inherited script matches others");
+like("\N{COMBINING GRAVE ACCENT}A", $script_run, "... even if first in the sequence");
+
+like("\N{DEVANAGARI DOUBLE DANDA}\N{DEVANAGARI DANDA}\N{DEVANAGARI STRESS SIGN UDATTA}\N{DEVANAGARI STRESS SIGN ANUDATTA}\N{NORTH INDIC FRACTION ONE QUARTER}\N{NORTH INDIC QUANTITY MARK}", $script_run, "A bunch of narrowing down of multiple possible scripts");
+
+unlike("\N{JAVANESE PANGRANGKEP}\N{GEORGIAN PARAGRAPH SEPARATOR}", $script_run, "Two code points each in multiple scripts that don't intersect aren't a script run");
+like("\N{DEVANAGARI SIGN CANDRABINDU VIRAMA}\N{VEDIC TONE YAJURVEDIC KATHAKA INDEPENDENT SVARITA}", $script_run, "Two code points each in multiple scripts that 't intersect singly are a script run");
+
+use utf8;
+
+# From UTS 39
+like("写真だけの結婚式", $script_run, "Mixed Hiragana and Han");
+
+done_testing();