=item *
-The current locale is also used when going outside of Perl with
+The current locale is used when going outside of Perl with
operations like L<system()|perlfunc/system LIST> or
L<qxE<sol>E<sol>|perlop/qxE<sol>STRINGE<sol>>, if those operations are
locale-sensitive.
XS modules for all categories but C<LC_NUMERIC> get the underlying
locale, and hence any C library functions they call will use that
-underlying locale.
-
-Perl tries to keep C<LC_NUMERIC> set to C<"C">
-because too many modules are unable to cope with the decimal point in a
-floating point number not being a dot (it's a comma in many locales).
-Macros are provided for XS code to temporarily change to use the
-underlying locale when necessary; however buggy code that fails to
-restore when done can break other XS code (but not Perl code) in this
-regard. The API for these macros has not yet been nailed down, but will be
-during the course of v5.21. Send email to
-L<mailto:perl5-porters@perl.org> for guidance.
+underlying locale. For more discussion, see L<perlxs/CAVEATS>.
=back
=item *
Regular expression patterns can be compiled using
-L<qrE<sol>E<sol>|perlop/qrE<sol>STRINGE<sol>msixpodual> with actual
+L<qrE<sol>E<sol>|perlop/qrE<sol>STRINGE<sol>msixpodualn> with actual
matching deferred to later. Again, it is whether or not the compilation
was done within the scope of C<use locale> that determines the match
behavior, not if the matches are done within such a scope or not.
=item *
-The variables L<$!|perlvar/$ERRNO> (and its synonyms C<$ERRNO> and
-C<$OS_ERROR>) and L<$^E|perlvar/$EXTENDED_OS_ERROR> (and its synonym
+B<The variables L<$!|perlvar/$ERRNO>> (and its synonyms C<$ERRNO> and
+C<$OS_ERROR>) B<and L<$^E|perlvar/$EXTENDED_OS_ERROR>> (and its synonym
C<$EXTENDED_OS_ERROR>) when used as strings use C<LC_MESSAGES>.
=back
The default behavior is restored with the S<C<no locale>> pragma, or
upon reaching the end of the block enclosing C<use locale>.
-Note that C<use locale> and C<use locale ':not_characters'> may be
+Note that C<use locale> calls may be
nested, and that what is in effect within an inner scope will revert to
the outer scope's rules at the end of the inner scope.
# restore the old locale
setlocale(LC_CTYPE, $old_locale);
+This simultaneously affects all threads of the program, so it may be
+problematic to use locales in threaded applications except where there
+is a single locale applicable to all threads.
+
The first argument of C<setlocale()> gives the B<category>, the second the
B<locale>. The category tells in what aspect of data processing you
want to apply locale-specific rules. Category names are discussed in
You can test out changing these variables temporarily, and if the
new settings seem to help, put those settings into your shell startup
-files. Consult your local documentation for the exact details. For in
+files. Consult your local documentation for the exact details. For
Bourne-like shells (B<sh>, B<ksh>, B<bash>, B<zsh>):
LC_ALL=en_US.ISO8859-1
setenv LC_ALL en_US.ISO8859-1
-or if you have the "env" application you can do in any shell
+or if you have the "env" application you can do (in any shell)
env LC_ALL=en_US.ISO8859-1 perl ...
"color" follows "chocolate" in English, what about in traditional Spanish?
The following collations all make sense and you may meet any of them
-if you "use locale".
+if you C<"use locale">.
A B C D E a b c d e
A a B b C c D d E e
dictionary-like ordering that ignores space characters completely and
which folds case.
-Perl only supports single-byte locales for C<LC_COLLATE>. This means
+Perl currently only supports single-byte locales for C<LC_COLLATE>. This means
that a UTF-8 locale likely will just give you machine-native ordering.
Use L<Unicode::Collate> for the full implementation of the Unicode
Collation Algorithm.
The C<LC_CTYPE> locale also provides the map used in transliterating
characters between lower and uppercase. This affects the case-mapping
-functions--C<fc()>, C<lc()>, C<lcfirst()>, C<uc()>, and C<ucfirst()>; case-mapping
+functions--C<fc()>, C<lc()>, C<lcfirst()>, C<uc()>, and C<ucfirst()>;
+case-mapping
interpolation with C<\F>, C<\l>, C<\L>, C<\u>, or C<\U> in double-quoted
strings and C<s///> substitutions; and case-independent regular expression
pattern matching using the C<i> modifier.
Finally, C<LC_CTYPE> affects the (deprecated) POSIX character-class test
functions--C<POSIX::isalpha()>, C<POSIX::islower()>, and so on. For
-example, if you move from the "C" locale to a 7-bit Scandinavian one,
-you may find--possibly to your surprise--that "|" moves from the
+example, if you move from the "C" locale to a 7-bit ISO 646 one,
+you may find--possibly to your surprise--that C<"|"> moves from the
C<POSIX::ispunct()> class to C<POSIX::isalpha()>.
Unfortunately, this creates big problems for regular expressions. "|" still
-means alternation even though it matches C<\w>.
+means alternation even though it matches C<\w>. Starting in v5.22, a
+warning will be raised when such a locale is switched into. More
+details are given several paragraphs further down.
Starting in v5.20, Perl supports UTF-8 locales for C<LC_CTYPE>, but
otherwise Perl only supports single-byte locales, such as the ISO 8859
series. This means that wide character locales, for example for Asian
-languages, are not supported. The UTF-8 locale support is actually a
+languages, are not well-supported. (If the platform has the capability
+for Perl to detect such a locale, starting in Perl v5.22,
+L<Perl will warn, default enabled|warnings/Category Hierarchy>,
+using the C<locale> warning category, whenever such a locale is switched
+into.) The UTF-8 locale support is actually a
superset of POSIX locales, because it is really full Unicode behavior
as if no locale were in effect at all (except for tainting; see
L</SECURITY>). POSIX locales, even UTF-8 ones,
used as a workaround for this (see L</Unicode and UTF-8>).
Note that there are quite a few things that are unaffected by the
-current locale. All the escape sequences for particular characters,
+current locale. Any literal character is the native character for the
+given platform. Hence 'A' means the character at code point 65 on ASCII
+platforms, and 193 on EBCDIC. That may or may not be an 'A' in the
+current locale, if that locale even has an 'A'.
+Similarly, all the escape sequences for particular characters,
C<\n> for example, always mean the platform's native one. This means,
for example, that C<\N> in regular expressions (every character
but new-line) works on the platform character set.
+Starting in v5.22, Perl will by default warn when switching into a
+locale that redefines any ASCII printable character (plus C<\t> and
+C<\n>) into a different class than expected. This is unlikely to
+happen on modern locales, but can happen with the ISO 646 and other
+7-bit locales that are essentially obsolete. Things may still work,
+depending on what features of Perl are used by the program. For
+example, in the example from above where C<"|"> becomes a C<\w>, and
+there are no regular expressions where this matters, the program may
+still work properly. The warning lists all the characters that
+it can determine could be adversely affected.
+
B<Note:> A broken or malicious C<LC_CTYPE> locale definition may result
in clearly ineligible characters being considered to be alphanumeric by
your application. For strict matching of (mundane) ASCII letters and
Regular expression checks for safe file names or mail addresses using
C<\w> may be spoofed by an C<LC_CTYPE> locale that claims that
-characters such as "E<gt>" and "|" are alphanumeric.
+characters such as C<"E<gt>"> and C<"|"> are alphanumeric.
=item *
properly under C<LC_CTYPE>. To see if a character is a particular type
under a locale, Perl uses the functions like C<isalnum()>. Your C
library may not work for UTF-8 locales with those functions, instead
-only working under the newer wide library functions like C<iswalnum()>.
-However, they are treated like single-byte locales, and will have the
-restrictions described below.
+only working under the newer wide library functions like C<iswalnum()>,
+which Perl does not use.
+These multi-byte locales are treated like single-byte locales, and will
+have the restrictions described below. Starting in Perl v5.22 a warning
+message is raised when Perl detects a multi-byte locale that it doesn't
+fully support.
For single-byte locales,
Perl generally takes the tack to use locale rules on code points that can fit
issue occurs with C<\N{...}>. Prior to v5.20, It is therefore a bad
idea to use C<\p{}> or
C<\N{}> under plain C<use locale>--I<unless> you can guarantee that the
-locale will be a ISO8859-1. Use POSIX character classes instead.
+locale will be ISO8859-1. Use POSIX character classes instead.
Another problem with this approach is that operations that cross the
single byte/multiple byte boundary are not well-defined, and so are
points meaning the same character. Thus in a Greek locale, both U+03A7
and U+00D7 are GREEK CAPITAL LETTER CHI.
+Because of all these problems, starting in v5.22, Perl will raise a
+warning if a multi-byte (hence Unicode) code point is used when a
+single-byte locale is in effect. (Although it doesn't check for this if
+doing so would unreasonably slow execution down.)
+
Vendor locales are notoriously buggy, and it is difficult for Perl to test
its locale-handling code because this interacts with code that Perl has no
control over; therefore the locale-handling code in Perl may be buggy as
Pre-v5.12, it was somewhat haphazard; in v5.12 it was applied fairly
consistently to regular expression matching except for bracketed
character classes; in v5.14 it was extended to all regex matches; and in
-v5.16 to the casing operations such as C<"\L"> and C<uc()>. For
-collation, in all releases, the system's C<strxfrm()> function is called,
-and whatever it does is what you get.
+v5.16 to the casing operations such as C<\L> and C<uc()>. For
+collation, in all releases so far, the system's C<strxfrm()> function is
+called, and whatever it does is what you get.
=head1 BUGS