-/* If the bitmap doesn't fully represent what this ANYOF node can match, the
- * ARG is set to this special value (since 0, 1, ... are legal, but will never
- * reach this high). */
-#define ANYOF_NONBITMAP_EMPTY ((U32) -1)
-
-/* The information used to be stored as as combination of the ANYOF_UTF8 and
- * ANYOF_NONBITMAP_NON_UTF8 bits in the flags field, but was moved out of there
- * to free up a bit for other uses. This tries to hide the change from
- * existing code as much as possible. Now, the data structure that goes in ARG
- * is not allocated unless it is needed, and that is what is used to determine
- * if there is something outside the bitmap. The code now assumes that if
- * that structure exists, that any UTF-8 encoded string should be tried against
- * it, but a non-UTF8-encoded string will be tried only if the
- * ANYOF_NONBITMAP_NON_UTF8 bit is also set. */
-#define ANYOF_NONBITMAP(node) (ARG(node) != ANYOF_NONBITMAP_EMPTY)
-
-/* Flags for node->flags of ANYOF. These are in short supply, with none
- * currently available. If more are needed, the ANYOF_LOCALE and
- * ANYOF_POSIXL bits could be shared, making a space penalty for all locale
- * nodes. Also, the ABOVE_LATIN1_ALL bit could be freed up by resorting to
- * creating a swash containing everything above 255. This introduces a
- * performance penalty. Better would be to split it off into a separate node,
- * which actually would improve performance a bit by allowing regexec.c to test
- * for a UTF-8 character being above 255 without having to call a function nor
- * calculate its code point value. Several flags are not used in synthetic
- * start class (SSC) nodes, so could be shared should new flags be needed for
- * SSCs. */
-
-/* regexec.c is expecting this to be in the low bit */
-#define ANYOF_INVERT 0x01
-
-/* For the SSC node only, which cannot be inverted, so is shared with that bit.
- * This means "Does this SSC match an empty string?" This is used only during
- * regex compilation. */
-#define ANYOF_EMPTY_STRING ANYOF_INVERT
+/* An ANYOF node is basically a bitmap with the index being a code point. If
+ * the bit for that code point is 1, the code point matches; if 0, it doesn't
+ * match (complemented if inverted). There is an additional mechanism to deal
+ * with cases where the bitmap is insufficient in and of itself. This #define
+ * indicates if the bitmap does fully represent what this ANYOF node can match.
+ * The ARG is set to this special value (since 0, 1, ... are legal, but will
+ * never reach this high). */
+#define ANYOF_ONLY_HAS_BITMAP ((U32) -1)
+
+/* When the bimap isn't completely sufficient for handling the ANYOF node,
+ * flags (in node->flags of the ANYOF node) get set to indicate this. These
+ * are perennially in short supply. Beyond several cases where warnings need
+ * to be raised under certain circumstances, currently, there are six cases
+ * where the bitmap alone isn't sufficient. We could use six flags to
+ * represent the 6 cases, but to save flags bits, we play some games. The
+ * cases are:
+ *
+ * 1) The bitmap has a compiled-in very finite size. So something else needs
+ * to be used to specify if a code point that is too large for the bitmap
+ * actually matches. The mechanism currently is a swash or inversion
+ * list. ANYOF_ONLY_HAS_BITMAP, described above, being TRUE indicates
+ * there are no matches of too-large code points. But if it is FALSE,
+ * then almost certainly there are matches too large for the bitmap. (The
+ * other cases, described below, either imply this one or are extremely
+ * rare in practice.) So we can just assume that a too-large code point
+ * will need something beyond the bitmap if ANYOF_ONLY_HAS_BITMAP is
+ * FALSE, instead of having a separate flag for this.
+ * 2) A subset of item 1) is if all possible code points outside the bitmap
+ * match. This is a common occurrence when the class is complemented,
+ * like /[^ij]/. Therefore a bit is reserved to indicate this,
+ * rather than having an expensive swash created,
+ * ANYOF_MATCHES_ALL_ABOVE_BITMAP.
+ * 3) Under /d rules, it can happen that code points that are in the upper
+ * latin1 range (\x80-\xFF or their equivalents on EBCDIC platforms) match
+ * only if the runtime target string being matched against is UTF-8. For
+ * example /[\w[:punct:]]/d. This happens only for posix classes (with a
+ * couple of exceptions, like \d where it doesn't happen), and all such
+ * ones also have above-bitmap matches. Thus, 3) implies 1) as well.
+ * Note that /d rules are no longer encouraged; 'use 5.14' or higher
+ * deselects them. But a flag is required so that they can be properly
+ * handled. But it can be a shared flag: see 5) below.
+ * 4) Also under /d rules, something like /[\Wfoo]/ will match everything in
+ * the \x80-\xFF range, unless the string being matched against is UTF-8.
+ * A swash could be created for this case, but this is relatively common,
+ * and it turns out that it's all or nothing: if any one of these code
+ * points matches, they all do. Hence a single bit suffices. We use a
+ * shared flag that doesn't take up space by itself:
+ * ANYOF_SHARED_d_MATCHES_ALL_NON_UTF8_NON_ASCII_non_d_WARN_SUPER.
+ * This also implies 1), with one exception: [:^cntrl:].
+ * 5) A user-defined \p{} property may not have been defined by the time the
+ * regex is compiled. In this case, we don't know until runtime what it
+ * will match, so we have to assume it could match anything, including
+ * code points that ordinarily would be in the bitmap. A flag bit is
+ * necessary to indicate this , though it can be shared with the item 3)
+ * flag, as that only occurs under /d, and this only occurs under non-d.
+ * This case is quite uncommon in the field, and the /(?[ ...])/ construct
+ * is a better way to accomplish what this feature does. This case also
+ * implies 1).
+ * ANYOF_SHARED_d_UPPER_LATIN1_UTF8_STRING_MATCHES_non_d_RUNTIME_USER_PROP
+ * is the shared flag.
+ * 6) /[foo]/il may have folds that are only valid if the runtime locale is a
+ * UTF-8 one. These are quite rare, so it would be good to avoid the
+ * expense of looking for them. But /l matching is slow anyway, and we've
+ * traditionally not worried too much about its performance. And this
+ * condition requires the ANYOFL_FOLD flag to be set, so testing for
+ * that flag would be sufficient to rule out most cases of this. So it is
+ * unclear if this should have a flag or not. But, this flag can be
+ * shared with another, so it doesn't occupy extra space.
+ *
+ * At the moment, there is one spare bit, but this could be increased by
+ * various tricks.
+ *
+ * If just one more bit is needed, at this writing it seems to khw that the
+ * best choice would be to make ANYOF_MATCHES_ALL_ABOVE_BITMAP not a flag, but
+ * something like
+ *
+ * #define ANYOF_MATCHES_ALL_ABOVE_BITMAP ((U32) -2)
+ *
+ * and access it through the ARG like ANYOF_ONLY_HAS_BITMAP is. This flag is
+ * used by all ANYOF node types, and it could be used to avoid calling the
+ * handler function, as the macro REGINCLASS in regexec.c does now for other
+ * cases.
+ *
+ * Another possibility is to instead (or additionally) rename the ANYOF_POSIXL
+ * flag to be ANYOFL_LARGE, to mean that the ANYOF node has an extra 32 bits
+ * beyond what a regular one does. That's what it effectively means now, with
+ * the extra space all for the POSIX class flags. But those classes actually
+ * only occupy 30 bits, so the ANYOFL_FOLD and
+ * ANYOFL_SHARED_UTF8_LOCALE_fold_HAS_MATCHES_nonfold_REQD flags could be moved
+ * to that extra space. The 30 bits in the extra word would indicate if a
+ * posix class should be looked up or not. The downside of this is that ANYOFL
+ * nodes with folding would always have to have the extra space allocated, even
+ * if they didn't use the 30 posix bits. There isn't an SSC problem as all
+ * SSCs are this large anyway.
+ *
+ * One could completely remove ANYOFL_LARGE and make all ANYOFL nodes large.
+ * REGINCLASS would have to be modified so that if the node type were this, it
+ * would call reginclass(), as the flag bit that indicates to do this now would
+ * be gone.
+ *
+ * All told, 5 bits could be available for other uses if all of the above were
+ * done.
+ *
+ * Some flags are not used in synthetic start class (SSC) nodes, so could be
+ * shared should new flags be needed for SSCs, like SSC_MATCHES_EMPTY_STRING
+ * now. */