This is a live mirror of the Perl 5 development currently hosted at https://github.com/perl/perl5
locale.c: Revamp finding if locale is UTF-8
authorKarl Williamson <khw@cpan.org>
Fri, 5 Jan 2018 21:09:40 +0000 (14:09 -0700)
committerKarl Williamson <khw@cpan.org>
Wed, 31 Jan 2018 13:33:02 +0000 (06:33 -0700)
commit0dec74cdf5288e3984f799818ed991a90bc9675b
tree970941e4599744dab59e416b05f968e64b049604
parentd707d7795b9381c06848df78c3ae9a7d5c364292
locale.c: Revamp finding if locale is UTF-8

This changes how this functionality works for the LC_CTYPE locale.  On
systems that have nl_langinfo() one can get a "definitive" answer from
just that.  Otherwise (or if that doesn't return properly) one can use
mbtowc() to check if the UTF-8 byte sequence for the Unicode REPLACEMENT
CHARACTER actually is considered to be that code point.  This is also
"definitive".  If the maximum byte string length for a character is too
short to handle all Unicode UTF-8, we know without further checking that
this isn't a UTF-8 locale, so can avoid the mbtowc check.

It turns out, from testing, that some locales are labelled UTF-8 by
nl_langinfo even though they depart from that at times.  Similarly for
mbtowc().  Perl assumes that a locale doesn't depart from this, and uses
its internal rules that it knows are UTF-8.  A future commit will warn
when this happens.
locale.c