X-Git-Url: https://perl5.git.perl.org/perl5.git/blobdiff_plain/2c6ee1a7a1ce7cff7755f9aa43a65b8278dd82a1..370c71c555fdc393b7abe16f74b496061898b884:/pod/perllocale.pod diff --git a/pod/perllocale.pod b/pod/perllocale.pod index a5d776a..a44ffbc 100644 --- a/pod/perllocale.pod +++ b/pod/perllocale.pod @@ -191,7 +191,7 @@ follows: =item * -The current locale is also used when going outside of Perl with +The current locale is used when going outside of Perl with operations like L or LE|perlop/qxESTRINGE>, if those operations are locale-sensitive. @@ -211,17 +211,7 @@ locale isn't exposed to Perl space. XS modules for all categories but C get the underlying locale, and hence any C library functions they call will use that -underlying locale. - -Perl tries to keep C set to C<"C"> -because too many modules are unable to cope with the decimal point in a -floating point number not being a dot (it's a comma in many locales). -Macros are provided for XS code to temporarily change to use the -underlying locale when necessary; however buggy code that fails to -restore when done can break other XS code (but not Perl code) in this -regard. The API for these macros has not yet been nailed down, but will be -during the course of v5.21. Send email to -L for guidance. +underlying locale. For more discussion, see L. =back @@ -249,7 +239,7 @@ is. =item * Regular expression patterns can be compiled using -LE|perlop/qrESTRINGEmsixpodual> with actual +LE|perlop/qrESTRINGEmsixpodualn> with actual matching deferred to later. Again, it is whether or not the compilation was done within the scope of C that determines the match behavior, not if the matches are done within such a scope or not. @@ -308,8 +298,8 @@ C, and C) use C =item * -The variables L<$!|perlvar/$ERRNO> (and its synonyms C<$ERRNO> and -C<$OS_ERROR>) and L<$^E|perlvar/$EXTENDED_OS_ERROR> (and its synonym +B> (and its synonyms C<$ERRNO> and +C<$OS_ERROR>) B> (and its synonym C<$EXTENDED_OS_ERROR>) when used as strings use C. =back @@ -318,7 +308,7 @@ C<$EXTENDED_OS_ERROR>) when used as strings use C. The default behavior is restored with the S> pragma, or upon reaching the end of the block enclosing C. -Note that C and C may be +Note that C calls may be nested, and that what is in effect within an inner scope will revert to the outer scope's rules at the end of the inner scope. @@ -416,6 +406,10 @@ C function: # restore the old locale setlocale(LC_CTYPE, $old_locale); +This simultaneously affects all threads of the program, so it may be +problematic to use locales in threaded applications except where there +is a single locale applicable to all threads. + The first argument of C gives the B, the second the B. The category tells in what aspect of data processing you want to apply locale-specific rules. Category names are discussed in @@ -582,7 +576,7 @@ alphabetically in your system is called). You can test out changing these variables temporarily, and if the new settings seem to help, put those settings into your shell startup -files. Consult your local documentation for the exact details. For in +files. Consult your local documentation for the exact details. For Bourne-like shells (B, B, B, B): LC_ALL=en_US.ISO8859-1 @@ -594,7 +588,7 @@ locale "En_US"--and in Cshish shells (B, B) setenv LC_ALL en_US.ISO8859-1 -or if you have the "env" application you can do in any shell +or if you have the "env" application you can do (in any shell) env LC_ALL=en_US.ISO8859-1 perl ... @@ -761,7 +755,7 @@ alphabets, but where do "E" and "E" belong? And while "color" follows "chocolate" in English, what about in traditional Spanish? The following collations all make sense and you may meet any of them -if you "use locale". +if you C<"use locale">. A B C D E a b c d e A a B b C c D d E e @@ -798,7 +792,7 @@ C<$equal_in_locale> will be true if the collation locale specifies a dictionary-like ordering that ignores space characters completely and which folds case. -Perl only supports single-byte locales for C. This means +Perl currently only supports single-byte locales for C. This means that a UTF-8 locale likely will just give you machine-native ordering. Use L for the full implementation of the Unicode Collation Algorithm. @@ -857,23 +851,30 @@ information on all these.) The C locale also provides the map used in transliterating characters between lower and uppercase. This affects the case-mapping -functions--C, C, C, C, and C; case-mapping +functions--C, C, C, C, and C; +case-mapping interpolation with C<\F>, C<\l>, C<\L>, C<\u>, or C<\U> in double-quoted strings and C substitutions; and case-independent regular expression pattern matching using the C modifier. Finally, C affects the (deprecated) POSIX character-class test functions--C, C, and so on. For -example, if you move from the "C" locale to a 7-bit Scandinavian one, -you may find--possibly to your surprise--that "|" moves from the +example, if you move from the "C" locale to a 7-bit ISO 646 one, +you may find--possibly to your surprise--that C<"|"> moves from the C class to C. Unfortunately, this creates big problems for regular expressions. "|" still -means alternation even though it matches C<\w>. +means alternation even though it matches C<\w>. Starting in v5.22, a +warning will be raised when such a locale is switched into. More +details are given several paragraphs further down. Starting in v5.20, Perl supports UTF-8 locales for C, but otherwise Perl only supports single-byte locales, such as the ISO 8859 series. This means that wide character locales, for example for Asian -languages, are not supported. The UTF-8 locale support is actually a +languages, are not well-supported. (If the platform has the capability +for Perl to detect such a locale, starting in Perl v5.22, +L, +using the C warning category, whenever such a locale is switched +into.) The UTF-8 locale support is actually a superset of POSIX locales, because it is really full Unicode behavior as if no locale were in effect at all (except for tainting; see L). POSIX locales, even UTF-8 ones, @@ -886,11 +887,26 @@ For releases v5.16 and v5.18, C> could be used as a workaround for this (see L). Note that there are quite a few things that are unaffected by the -current locale. All the escape sequences for particular characters, +current locale. Any literal character is the native character for the +given platform. Hence 'A' means the character at code point 65 on ASCII +platforms, and 193 on EBCDIC. That may or may not be an 'A' in the +current locale, if that locale even has an 'A'. +Similarly, all the escape sequences for particular characters, C<\n> for example, always mean the platform's native one. This means, for example, that C<\N> in regular expressions (every character but new-line) works on the platform character set. +Starting in v5.22, Perl will by default warn when switching into a +locale that redefines any ASCII printable character (plus C<\t> and +C<\n>) into a different class than expected. This is unlikely to +happen on modern locales, but can happen with the ISO 646 and other +7-bit locales that are essentially obsolete. Things may still work, +depending on what features of Perl are used by the program. For +example, in the example from above where C<"|"> becomes a C<\w>, and +there are no regular expressions where this matters, the program may +still work properly. The warning lists all the characters that +it can determine could be adversely affected. + B A broken or malicious C locale definition may result in clearly ineligible characters being considered to be alphanumeric by your application. For strict matching of (mundane) ASCII letters and @@ -989,7 +1005,7 @@ results. Here are a few possibilities: Regular expression checks for safe file names or mail addresses using C<\w> may be spoofed by an C locale that claims that -characters such as "E" and "|" are alphanumeric. +characters such as C<"E"> and C<"|"> are alphanumeric. =item * @@ -1450,9 +1466,12 @@ the characters in the upper half of the Latin-1 range (128 - 255) properly under C. To see if a character is a particular type under a locale, Perl uses the functions like C. Your C library may not work for UTF-8 locales with those functions, instead -only working under the newer wide library functions like C. -However, they are treated like single-byte locales, and will have the -restrictions described below. +only working under the newer wide library functions like C, +which Perl does not use. +These multi-byte locales are treated like single-byte locales, and will +have the restrictions described below. Starting in Perl v5.22 a warning +message is raised when Perl detects a multi-byte locale that it doesn't +fully support. For single-byte locales, Perl generally takes the tack to use locale rules on code points that can fit @@ -1472,7 +1491,7 @@ Unicode, C<\p{Alpha}> will never match it, regardless of locale. A similar issue occurs with C<\N{...}>. Prior to v5.20, It is therefore a bad idea to use C<\p{}> or C<\N{}> under plain C--I you can guarantee that the -locale will be a ISO8859-1. Use POSIX character classes instead. +locale will be ISO8859-1. Use POSIX character classes instead. Another problem with this approach is that operations that cross the single byte/multiple byte boundary are not well-defined, and so are @@ -1500,6 +1519,11 @@ Still another problem is that this approach can lead to two code points meaning the same character. Thus in a Greek locale, both U+03A7 and U+00D7 are GREEK CAPITAL LETTER CHI. +Because of all these problems, starting in v5.22, Perl will raise a +warning if a multi-byte (hence Unicode) code point is used when a +single-byte locale is in effect. (Although it doesn't check for this if +doing so would unreasonably slow execution down.) + Vendor locales are notoriously buggy, and it is difficult for Perl to test its locale-handling code because this interacts with code that Perl has no control over; therefore the locale-handling code in Perl may be buggy as @@ -1524,9 +1548,9 @@ byte, and Unicode rules for those that can't is not uniformly applied. Pre-v5.12, it was somewhat haphazard; in v5.12 it was applied fairly consistently to regular expression matching except for bracketed character classes; in v5.14 it was extended to all regex matches; and in -v5.16 to the casing operations such as C<"\L"> and C. For -collation, in all releases, the system's C function is called, -and whatever it does is what you get. +v5.16 to the casing operations such as C<\L> and C. For +collation, in all releases so far, the system's C function is +called, and whatever it does is what you get. =head1 BUGS