design deficiencies, and nowadays, there is a series of "UTF-8
locales", based on Unicode. These are locales whose character set is
Unicode, encoded in UTF-8. Starting in v5.20, Perl fully supports
-UTF-8 locales, except for sorting and string comparisions. (Use
-L<Unicode::Collate> for these.) Perl continues to support the old
-non UTF-8 locales as well.
+UTF-8 locales, except for sorting and string comparisons like C<lt> and
+C<ge>. (Use L<Unicode::Collate> for these.) Perl continues to support
+the old non UTF-8 locales as well. There are currently no UTF-8 locales
+for EBCDIC platforms.
(Unicode is also creating C<CLDR>, the "Common Locale Data Repository",
L<http://cldr.unicode.org/> which includes more types of information than
are available in the POSIX locale system. At the time of this writing,
there was no CPAN module that provides access to this XML-encoded data.
-However, many of its locales have the POSIX-only data extracted, and are
-available as UTF-8 locales at
-L<http://unicode.org/Public/cldr/latest/>.)
+However, it is possible to compute the POSIX locale data from them, and
+earlier CLDR versions had these already extracted for you as UTF-8 locales
+L<http://unicode.org/Public/cldr/2.0.1/>.)
=head1 WHAT IS A LOCALE
=head1 PREPARING TO USE LOCALES
-Perl itself will not use locales unless specifically requested to (but
+Perl itself (outside the L<POSIX> module) will not use locales unless
+specifically requested to (but
again note that Perl may interact with code that does use them). Even
if there is such a request, B<all> of the following must be true
for it to work properly:
=head2 The C<"use locale"> pragma
-By default, Perl itself ignores the current locale. The S<C<use locale>>
+WARNING! Do NOT use this pragma in scripts that have multiple
+L<threads|threads> active. The locale is not local to a single thread.
+Another thread may change the locale at any time, which could cause at a
+minimum that a given thread is operating in a locale it isn't expecting
+to be in. On some platforms, segfaults can also occur. The locale
+change need not be explicit; some operations cause perl to change the
+locale itself. You are vulnerable simply by having done a C<"use
+locale">.
+
+By default, Perl itself (outside the L<POSIX> module)
+ignores the current locale. The S<C<use locale>>
pragma tells Perl to use the current locale for some operations.
Starting in v5.16, there are optional parameters to this pragma,
described below, which restrict which operations are affected by it.
L<POSIX> module. Some of those functions are always affected by the
current locale. For example, C<POSIX::strftime()> uses C<LC_TIME>;
C<POSIX::strtod()> uses C<LC_NUMERIC>; C<POSIX::strcoll()> and
-C<POSIX::strxfrm()> use C<LC_COLLATE>; and character classification
-functions like C<POSIX::isalnum()> use C<LC_CTYPE>. All such functions
+C<POSIX::strxfrm()> use C<LC_COLLATE>. All such functions
will behave according to the current underlying locale, even if that
locale isn't exposed to Perl space.
=back
+Note that all C programs (including the perl interpreter, which is
+written in C) always have an underlying locale. That locale is the "C"
+locale unless changed by a call to L<setlocale()|/The setlocale
+function>. When Perl starts up, it changes the underlying locale to the
+one which is indicated by the L</ENVIRONMENT>. When using the L<POSIX>
+module or writing XS code, it is important to keep in mind that the
+underlying locale may be something other than "C", even if the program
+hasn't explicitly changed it.
+
=for comment
The nbsp below makes this look better (though not great)
=item *
Regular expression patterns can be compiled using
-L<qrE<sol>E<sol>|perlop/qrE<sol>STRINGE<sol>msixpodual> with actual
+L<qrE<sol>E<sol>|perlop/qrE<sol>STRINGE<sol>msixpodualn> with actual
matching deferred to later. Again, it is whether or not the compilation
was done within the scope of C<use locale> that determines the match
behavior, not if the matches are done within such a scope or not.
=head2 The setlocale function
+WARNING! Do NOT use this function in a L<thread|threads>. The locale
+will change in all other threads at the same time, and should your
+thread get paused by the operating system, and another started, that
+thread will not have the locale it is expecting. On some platforms,
+there can be a race leading to segfaults if two threads call this
+function nearly simultaneously.
+
You can switch locales as often as you wish at run time with the
C<POSIX::setlocale()> function:
# restore the old locale
setlocale(LC_CTYPE, $old_locale);
-This simultaneously affects all threads of the program, so it may be
-problematic to use locales in threaded applications except where there
-is a single locale applicable to all threads.
-
The first argument of C<setlocale()> gives the B<category>, the second the
B<locale>. The category tells in what aspect of data processing you
want to apply locale-specific rules. Category names are discussed in
locale inconsistencies or to run Perl under the default locale "C".
Perl's moaning about locale problems can be silenced by setting the
-environment variable C<PERL_BADLANG> to a zero value, for example "0".
+environment variable C<PERL_BADLANG> to "0" or "".
This method really just sweeps the problem under the carpet: you tell
Perl to shut up even when Perl sees that something is wrong. Do not
be surprised if later something locale-dependent misbehaves.
strings and C<s///> substitutions; and case-independent regular expression
pattern matching using the C<i> modifier.
-Finally, C<LC_CTYPE> affects the (deprecated) POSIX character-class test
-functions--C<POSIX::isalpha()>, C<POSIX::islower()>, and so on. For
-example, if you move from the "C" locale to a 7-bit ISO 646 one,
-you may find--possibly to your surprise--that C<"|"> moves from the
-C<POSIX::ispunct()> class to C<POSIX::isalpha()>.
-Unfortunately, this creates big problems for regular expressions. "|" still
-means alternation even though it matches C<\w>. Starting in v5.22, a
-warning will be raised when such a locale is switched into. More
-details are given several paragraphs further down.
-
Starting in v5.20, Perl supports UTF-8 locales for C<LC_CTYPE>, but
otherwise Perl only supports single-byte locales, such as the ISO 8859
series. This means that wide character locales, for example for Asian
using the C<locale> warning category, whenever such a locale is switched
into.) The UTF-8 locale support is actually a
superset of POSIX locales, because it is really full Unicode behavior
-as if no locale were in effect at all (except for tainting; see
-L</SECURITY>). POSIX locales, even UTF-8 ones,
+as if no C<LC_CTYPE> locale were in effect at all (except for tainting;
+see L</SECURITY>). POSIX locales, even UTF-8 ones,
are lacking certain concepts in Unicode, such as the idea that changing
the case of a character could expand to be more than one character.
Perl in a UTF-8 locale, will give you that expansion. Prior to v5.20,
Starting in v5.22, Perl will by default warn when switching into a
locale that redefines any ASCII printable character (plus C<\t> and
-C<\n>) into a different class than expected. This is unlikely to
-happen on modern locales, but can happen with the ISO 646 and other
+C<\n>) into a different class than expected. This is likely to
+happen on modern locales only on EBCDIC platforms, where, for example,
+a CCSID 0037 locale on a CCSID 1047 machine moves C<"[">, but it can
+happen on ASCII platforms with the ISO 646 and other
7-bit locales that are essentially obsolete. Things may still work,
depending on what features of Perl are used by the program. For
example, in the example from above where C<"|"> becomes a C<\w>, and
Results are never tainted.
-=item *
-
-B<POSIX character class tests> (C<POSIX::isalnum()>,
-C<POSIX::isalpha()>, C<POSIX::isdigit()>, C<POSIX::isgraph()>,
-C<POSIX::islower()>, C<POSIX::isprint()>, C<POSIX::ispunct()>,
-C<POSIX::isspace()>, C<POSIX::isupper()>, C<POSIX::isxdigit()>):
-
-True/false results are never tainted.
-
=back
Three examples illustrate locale-dependent tainting.
=item PERL_SKIP_LOCALE_INIT
-This environment variable, available starting in Perl v5.20, and if it
-evaluates to a TRUE value, tells Perl to not use the rest of the
+This environment variable, available starting in Perl v5.20, if set
+(to any value), tells Perl to not use the rest of the
environment variables to initialize with. Instead, Perl uses whatever
the current locale settings are. This is particularly useful in
embedded environments, see
at startup. Failure can occur if the locale support in the operating
system is lacking (broken) in some way--or if you mistyped the name of
a locale when you set up your environment. If this environment
-variable is absent, or has a value that does not evaluate to integer
-zero--that is, "0" or ""-- Perl will complain about locale setting
-failures.
+variable is absent, or has a value other than "0" or "", Perl will
+complain about locale setting failures.
B<NOTE>: C<PERL_BADLANG> only gives you a way to hide the warning message.
The message tells about some problem in your system's locale support,
instead a "path" (":"-separated list) of I<languages> (not locales).
See the GNU C<gettext> library documentation for more information.
-=item C<LC_CTYPE>.
+=item C<LC_CTYPE>
In the absence of C<LC_ALL>, C<LC_CTYPE> chooses the character type
locale. In the absence of both C<LC_ALL> and C<LC_CTYPE>, C<LANG>
C<LANG> is the "catch-all" locale environment variable. If it is set, it
is used as the last resort after the overall C<LC_ALL> and the
-category-specific C<LC_I<foo>>
+category-specific C<LC_I<foo>>.
=back
The Unicode CLDR project extracts the POSIX portion of many of its
locales, available at
- http://unicode.org/Public/cldr/latest/
+ http://unicode.org/Public/cldr/2.0.1/
+
+(Newer versions of CLDR require you to compute the POSIX data yourself.
+See L<http://unicode.org/Public/cldr/latest/>.)
There is a large collection of locale definitions at:
The only multi-byte (or wide character) locale that Perl is ever likely
to support is UTF-8. This is due to the difficulty of implementation,
the fact that high quality UTF-8 locales are now published for every
-area of the world (L<http://unicode.org/Public/cldr/latest/>), and that
+area of the world (L<http://unicode.org/Public/cldr/2.0.1/> for
+ones that are already set-up, but from an earlier version;
+L<http://unicode.org/Public/cldr/latest/> for the most up-to-date, but
+you have to extract the POSIX information yourself), and that
failing all that you can use the L<Encode> module to translate to/from
your locale. So, you'll have to do one of those things if you're using
one of these locales, such as Big5 or Shift JIS. For UTF-8 locales, in
points meaning the same character. Thus in a Greek locale, both U+03A7
and U+00D7 are GREEK CAPITAL LETTER CHI.
+Because of all these problems, starting in v5.22, Perl will raise a
+warning if a multi-byte (hence Unicode) code point is used when a
+single-byte locale is in effect. (Although it doesn't check for this if
+doing so would unreasonably slow execution down.)
+
Vendor locales are notoriously buggy, and it is difficult for Perl to test
its locale-handling code because this interacts with code that Perl has no
control over; therefore the locale-handling code in Perl may be buggy as