over. You're on your own about bounds checking, though, so don't use it
lightly.
-All bytes in a multi-byte UTF8 character will have the high bit set, so
-you can test if you need to do something special with this character
-like this:
+All bytes in a multi-byte UTF8 character will have the high bit set,
+so you can test if you need to do something special with this
+character like this (the UTF8_IS_CONTINUED() is a macro that tests
+whether the byte is part of a multi-byte UTF-8 character):
- UV uv;
+ U8 *utf;
+ UV uv; /* Note: a UV, not a U8, not a char */
- if (utf & 0x80)
+ if (UTF8_IS_CONTINUED(*utf))
/* Must treat this as UTF8 */
uv = utf8_to_uv(utf);
else
value of the character; the inverse function C<uv_to_utf8> is available
for putting a UV into UTF8:
- if (uv > 0x80)
+ if (UTF8_IS_CONTINUED(uv))
/* Must treat this as UTF8 */
utf8 = uv_to_utf8(utf8, uv);
else
not it's dealing with UTF8 data, so that it can handle the string
appropriately.
+Since just passing an SV to an XS function and copying the data of
+the SV is not enough to copy the UTF8 flags, even less right is just
+passing a C<char *> to an XS function.
+
=head2 How do I convert a string to UTF8?
If you're mixing UTF8 and non-UTF8 strings, you might find it necessary
=item *
If a string is UTF8, B<always> use C<utf8_to_uv> to get at the value,
-unless C<!(*s & 0x80)> in which case you can use C<*s>.
+unless C<!UTF8_IS_CONTINUED(*s)> in which case you can use C<*s>.
=item *
-When writing to a UTF8 string, B<always> use C<uv_to_utf8>, unless
-C<uv < 0x80> in which case you can use C<*s = uv>.
+When writing a character C<uv> to a UTF8 string, B<always> use
+C<uv_to_utf8>, unless C<!UTF8_IS_CONTINUED(uv))> in which case
+you can use C<*s = uv>.
=item *
=head2 Using Unicode in XS
-If you want to handle Perl Unicode in XS extensions, you may find
-the following C APIs useful. See L<perlapi> for details.
+If you want to handle Perl Unicode in XS extensions, you may find the
+following C APIs useful. See also L<perlguts/"Unicode Support"> for an
+explanation about Unicode at the XS level, and L<perlapi> for the API
+details.
=over 4