character is a valid UTF-8 character. The actual number of bytes in the UTF-8
character will be returned if it is valid, otherwise 0.
+WARNING: use only if you *know* that C<s> has at least either UTF8_MAXBYTES or
+UTF8SKIP(s) bytes.
+
=cut */
STRLEN
Perl_is_utf8_char(const U8 *s)
Returns true if first C<len> bytes of the given string form a valid
UTF-8 string, false otherwise. If C<len> is 0, it will be calculated
-using C<strlen(s)>. Note that 'a valid UTF-8 string' does not mean 'a
-string that contains code points above 0x7F encoded in UTF-8' because a
-valid ASCII string is a valid UTF-8 string.
+using C<strlen(s)> (which means if you use this option, that C<s> has to have a
+terminating NUL byte). Note that all characters being ASCII constitute 'a
+valid UTF-8 string'.
See also is_ascii_string(), is_utf8_string_loclen(), and is_utf8_string_loc().
PERL_ARGS_ASSERT_IS_UTF8_STRING;
while (x < send) {
- STRLEN c;
/* Inline the easy bits of is_utf8_char() here for speed... */
- if (UTF8_IS_INVARIANT(*x))
- c = 1;
+ if (UTF8_IS_INVARIANT(*x)) {
+ x++;
+ }
else if (!UTF8_IS_START(*x))
- goto out;
+ return FALSE;
else {
/* ... and call is_utf8_char() only if really needed. */
-#ifdef IS_UTF8_CHAR
- c = UTF8SKIP(x);
+ const STRLEN c = UTF8SKIP(x);
+ const U8* const next_char_ptr = x + c;
+
+ if (next_char_ptr > send) {
+ return FALSE;
+ }
+
if (IS_UTF8_CHAR_FAST(c)) {
if (!IS_UTF8_CHAR(x, c))
- c = 0;
+ return FALSE;
}
- else
- c = is_utf8_char_slow(x, c);
-#else
- c = is_utf8_char(x);
-#endif /* #ifdef IS_UTF8_CHAR */
- if (!c)
- goto out;
+ else if (! is_utf8_char_slow(x, c)) {
+ return FALSE;
+ }
+ x = next_char_ptr;
}
- x += c;
}
- out:
- if (x != send)
- return FALSE;
-
return TRUE;
}
PERL_ARGS_ASSERT_IS_UTF8_STRING_LOCLEN;
while (x < send) {
+ const U8* next_char_ptr;
+
/* Inline the easy bits of is_utf8_char() here for speed... */
if (UTF8_IS_INVARIANT(*x))
- c = 1;
+ next_char_ptr = x + 1;
else if (!UTF8_IS_START(*x))
goto out;
else {
/* ... and call is_utf8_char() only if really needed. */
-#ifdef IS_UTF8_CHAR
c = UTF8SKIP(x);
+ next_char_ptr = c + x;
+ if (next_char_ptr > send) {
+ goto out;
+ }
if (IS_UTF8_CHAR_FAST(c)) {
if (!IS_UTF8_CHAR(x, c))
c = 0;
} else
c = is_utf8_char_slow(x, c);
-#else
- c = is_utf8_char(x);
-#endif /* #ifdef IS_UTF8_CHAR */
if (!c)
goto out;
}
- x += c;
+ x = next_char_ptr;
outlen++;
}
Certain code points are considered problematic. These are Unicode surrogates,
Unicode non-characters, and code points above the Unicode maximum of 0x10FFF.
By default these are considered regular code points, but certain situations
-warrant special handling for them. if C<flags> contains
+warrant special handling for them. If C<flags> contains
UTF8_DISALLOW_ILLEGAL_INTERCHANGE, all three classes are treated as
malformations and handled as such. The flags UTF8_DISALLOW_SURROGATE,
UTF8_DISALLOW_NONCHAR, and UTF8_DISALLOW_SUPER (meaning above the legal Unicode
the others that are above the Unicode legal maximum. There are several
reasons, one of which is that the original UTF-8 specification never went above
this number (the current 0x10FFF limit was imposed later). The UTF-8 encoding
-on ASCII platforms for these large code point begins with a byte containing
+on ASCII platforms for these large code points begins with a byte containing
0xFE or 0xFF. The UTF8_DISALLOW_FE_FF flag will cause them to be treated as
malformations, while allowing smaller above-Unicode code points. (Of course
UTF8_DISALLOW_SUPER will treat all above-Unicode code points, including these,
}
/* Note:
- * A "swash" is a swatch hash.
- * A "swatch" is a bit vector generated by utf8.c:S_swash_get().
+ * Returns a "swash" which is a hash described in utf8.c:S_swash_fetch().
* C<pkg> is a pointer to a package name for SWASHNEW, should be "utf8".
* For other parameters, see utf8::SWASHNEW in lib/utf8_heavy.pl.
*/
* of the string C<ptr>. If C<do_utf8> is true, the string C<ptr> is
* assumed to be in utf8. If C<do_utf8> is false, the string C<ptr> is
* assumed to be in native 8-bit encoding. Caches the swatch in C<swash>.
+ *
+ * A "swash" is a hash which contains initially the keys/values set up by
+ * SWASHNEW. The purpose is to be able to completely represent a Unicode
+ * property for all possible code points. Things are stored in a compact form
+ * (see utf8_heavy.pl) so that calculation is required to find the actual
+ * property value for a given code point. As code points are looked up, new
+ * key/value pairs are added to the hash, so that the calculation doesn't have
+ * to ever be re-done. Further, each calculation is done, not just for the
+ * desired one, but for a whole block of code points adjacent to that one.
+ * For binary properties on ASCII machines, the block is usually for 64 code
+ * points, starting with a code point evenly divisible by 64. Thus if the
+ * property value for code point 257 is requested, the code goes out and
+ * calculates the property values for all 64 code points between 256 and 319,
+ * and stores these as a single 64-bit long bit vector, called a "swatch",
+ * under the key for code point 256. The key is the UTF-8 encoding for code
+ * point 256, minus the final byte. Thus, if the length of the UTF-8 encoding
+ * for a code point is 13 bytes, the key will be 12 bytes long. If the value
+ * for code point 258 is then requested, this code realizes that it would be
+ * stored under the key for 256, and would find that value and extract the
+ * relevant bit, offset from 256.
+ *
+ * Non-binary properties are stored in as many bits as necessary to represent
+ * their values (32 currently, though the code is more general than that), not
+ * as single bits, but the principal is the same: the value for each key is a
+ * vector that encompasses the property values for all code points whose UTF-8
+ * representations are represented by the key. That is, for all code points
+ * whose UTF-8 representations are length N bytes, and the key is the first N-1
+ * bytes of that.
*/
UV
Perl_swash_fetch(pTHX_ SV *swash, const U8 *ptr, bool do_utf8)
SvCUR_set(swatch, scur);
s = (U8*)SvPVX(swatch);
- /* read $swash->{LIST} */
+ /* read $swash->{LIST}. XXX Note that this is a linear scan through a
+ * sorted list. A binary search would be much more efficient */
l = (U8*)SvPV(*listsvp, lcur);
lend = l + lcur;
while (l < lend) {
Perl_check_utf8_print(pTHX_ register const U8* s, const STRLEN len)
{
/* May change: warns if surrogates, non-character code points, or
- * non-Unicode code points are in s which has length len. Returns TRUE if
- * none found; FALSE otherwise. The only other validity check is to make
- * sure that this won't exceed the string's length */
+ * non-Unicode code points are in s which has length len bytes. Returns
+ * TRUE if none found; FALSE otherwise. The only other validity check is
+ * to make sure that this won't exceed the string's length */
const U8* const e = s + len;
bool ok = TRUE;
"%s in %s", unees, PL_op ? OP_DESC(PL_op) : "print");
return FALSE;
}
- if (*s >= UTF8_FIRST_PROBLEMATIC_CODE_POINT_FIRST_BYTE) {
+ if (UNLIKELY(*s >= UTF8_FIRST_PROBLEMATIC_CODE_POINT_FIRST_BYTE)) {
STRLEN char_len;
if (UTF8_IS_SUPER(s)) {
if (ckWARN_d(WARN_NON_UNICODE)) {
STRLEN n1 = 0, n2 = 0; /* Number of bytes in current char */
U8 foldbuf1[UTF8_MAXBYTES_CASE+1];
U8 foldbuf2[UTF8_MAXBYTES_CASE+1];
- U8 natbuf[2]; /* Holds native 8-bit char converted to utf8;
- these always fit in 2 bytes */
PERL_ARGS_ASSERT_FOLDEQ_UTF8_FLAGS;
else if (u1) {
to_utf8_fold(p1, foldbuf1, &n1);
}
- else { /* Not utf8, convert to it first and then get fold */
- uvuni_to_utf8(natbuf, (UV) NATIVE_TO_UNI(((UV)*p1)));
- to_utf8_fold(natbuf, foldbuf1, &n1);
+ else { /* Not utf8, get utf8 fold */
+ to_uni_fold(NATIVE_TO_UNI(*p1), foldbuf1, &n1);
}
f1 = foldbuf1;
}
to_utf8_fold(p2, foldbuf2, &n2);
}
else {
- uvuni_to_utf8(natbuf, (UV) NATIVE_TO_UNI(((UV)*p2)));
- to_utf8_fold(natbuf, foldbuf2, &n2);
+ to_uni_fold(NATIVE_TO_UNI(*p2), foldbuf2, &n2);
}
f2 = foldbuf2;
}