Commit | Line | Data |
---|---|---|
5f05dabc | 1 | =head1 NAME |
2 | ||
b0c42ed9 | 3 | perllocale - Perl locale handling (internationalization and localization) |
5f05dabc | 4 | |
5 | =head1 DESCRIPTION | |
6 | ||
7 | Perl supports language-specific notions of data such as "is this a | |
14280422 DD |
8 | letter", "what is the upper-case equivalent of this letter", and "which |
9 | of these letters comes first". These are important issues, especially | |
10 | for languages other than English - but also for English: it would be | |
11 | very naE<iuml>ve to think that C<A-Za-z> defines all the "letters". Perl | |
12 | is also aware that some character other than '.' may be preferred as a | |
13 | decimal point, and that output date representations may be | |
14 | language-specific. The process of making an application take account of | |
15 | its users' preferences in such matters is called B<internationalization> | |
16 | (often abbreviated as B<i18n>); telling such an application about a | |
17 | particular set of preferences is known as B<localization> (B<l10n>). | |
18 | ||
19 | Perl can understand language-specific data via the standardized (ISO C, | |
20 | XPG4, POSIX 1.c) method called "the locale system". The locale system is | |
b0c42ed9 | 21 | controlled per application using one pragma, one function call, and |
14280422 DD |
22 | several environment variables. |
23 | ||
24 | B<NOTE>: This feature is new in Perl 5.004, and does not apply unless an | |
25 | application specifically requests it - see L<Backward compatibility>. | |
e38874e2 DD |
26 | The one exception is that write() now B<always> uses the current locale |
27 | - see L<"NOTES">. | |
5f05dabc | 28 | |
29 | =head1 PREPARING TO USE LOCALES | |
30 | ||
14280422 DD |
31 | If Perl applications are to be able to understand and present your data |
32 | correctly according a locale of your choice, B<all> of the following | |
5f05dabc | 33 | must be true: |
34 | ||
35 | =over 4 | |
36 | ||
37 | =item * | |
38 | ||
39 | B<Your operating system must support the locale system>. If it does, | |
14280422 | 40 | you should find that the setlocale() function is a documented part of |
5f05dabc | 41 | its C library. |
42 | ||
43 | =item * | |
44 | ||
14280422 DD |
45 | B<Definitions for the locales which you use must be installed>. You, or |
46 | your system administrator, must make sure that this is the case. The | |
47 | available locales, the location in which they are kept, and the manner | |
48 | in which they are installed, vary from system to system. Some systems | |
49 | provide only a few, hard-wired, locales, and do not allow more to be | |
50 | added; others allow you to add "canned" locales provided by the system | |
51 | supplier; still others allow you or the system administrator to define | |
52 | and add arbitrary locales. (You may have to ask your supplier to | |
53 | provide canned locales which are not delivered with your operating | |
54 | system.) Read your system documentation for further illumination. | |
5f05dabc | 55 | |
56 | =item * | |
57 | ||
58 | B<Perl must believe that the locale system is supported>. If it does, | |
59 | C<perl -V:d_setlocale> will say that the value for C<d_setlocale> is | |
60 | C<define>. | |
61 | ||
62 | =back | |
63 | ||
64 | If you want a Perl application to process and present your data | |
65 | according to a particular locale, the application code should include | |
14280422 | 66 | the S<C<use locale>> pragma (see L<The use locale Pragma>) where |
5f05dabc | 67 | appropriate, and B<at least one> of the following must be true: |
68 | ||
69 | =over 4 | |
70 | ||
71 | =item * | |
72 | ||
14280422 DD |
73 | B<The locale-determining environment variables (see L<"ENVIRONMENT">) |
74 | must be correctly set up>, either by yourself, or by the person who set | |
75 | up your system account, at the time the application is started. | |
5f05dabc | 76 | |
77 | =item * | |
78 | ||
14280422 DD |
79 | B<The application must set its own locale> using the method described in |
80 | L<The setlocale function>. | |
5f05dabc | 81 | |
82 | =back | |
83 | ||
84 | =head1 USING LOCALES | |
85 | ||
86 | =head2 The use locale pragma | |
87 | ||
14280422 DD |
88 | By default, Perl ignores the current locale. The S<C<use locale>> |
89 | pragma tells Perl to use the current locale for some operations: | |
5f05dabc | 90 | |
91 | =over 4 | |
92 | ||
93 | =item * | |
94 | ||
14280422 DD |
95 | B<The comparison operators> (C<lt>, C<le>, C<cmp>, C<ge>, and C<gt>) and |
96 | the POSIX string collation functions strcoll() and strxfrm() use | |
97 | C<LC_COLLATE>. sort() is also affected if it is used without an | |
98 | explicit comparison function because it uses C<cmp> by default. | |
99 | ||
100 | B<Note:> C<eq> and C<ne> are unaffected by the locale: they always | |
101 | perform a byte-by-byte comparison of their scalar operands. What's | |
102 | more, if C<cmp> finds that its operands are equal according to the | |
103 | collation sequence specified by the current locale, it goes on to | |
104 | perform a byte-by-byte comparison, and only returns I<0> (equal) if the | |
105 | operands are bit-for-bit identical. If you really want to know whether | |
106 | two strings - which C<eq> and C<cmp> may consider different - are equal | |
107 | as far as collation in the locale is concerned, see the discussion in | |
108 | L<Category LC_COLLATE: Collation>. | |
5f05dabc | 109 | |
110 | =item * | |
111 | ||
14280422 DD |
112 | B<Regular expressions and case-modification functions> (uc(), lc(), |
113 | ucfirst(), and lcfirst()) use C<LC_CTYPE> | |
5f05dabc | 114 | |
115 | =item * | |
116 | ||
14280422 | 117 | B<The formatting functions> (printf(), sprintf() and write()) use |
5f05dabc | 118 | C<LC_NUMERIC> |
119 | ||
120 | =item * | |
121 | ||
14280422 | 122 | B<The POSIX date formatting function> (strftime()) uses C<LC_TIME>. |
5f05dabc | 123 | |
124 | =back | |
125 | ||
14280422 DD |
126 | C<LC_COLLATE>, C<LC_CTYPE>, and so on, are discussed further in L<LOCALE |
127 | CATEGORIES>. | |
5f05dabc | 128 | |
b0c42ed9 | 129 | The default behavior returns with S<C<no locale>> or on reaching the |
14280422 | 130 | end of the enclosing block. |
5f05dabc | 131 | |
14280422 DD |
132 | Note that the string result of any operation that uses locale |
133 | information is tainted, as it is possible for a locale to be | |
134 | untrustworthy. See L<"SECURITY">. | |
5f05dabc | 135 | |
136 | =head2 The setlocale function | |
137 | ||
14280422 DD |
138 | You can switch locales as often as you wish at run time with the |
139 | POSIX::setlocale() function: | |
5f05dabc | 140 | |
141 | # This functionality not usable prior to Perl 5.004 | |
142 | require 5.004; | |
143 | ||
144 | # Import locale-handling tool set from POSIX module. | |
145 | # This example uses: setlocale -- the function call | |
146 | # LC_CTYPE -- explained below | |
147 | use POSIX qw(locale_h); | |
148 | ||
14280422 | 149 | # query and save the old locale |
5f05dabc | 150 | $old_locale = setlocale(LC_CTYPE); |
151 | ||
152 | setlocale(LC_CTYPE, "fr_CA.ISO8859-1"); | |
153 | # LC_CTYPE now in locale "French, Canada, codeset ISO 8859-1" | |
154 | ||
155 | setlocale(LC_CTYPE, ""); | |
156 | # LC_CTYPE now reset to default defined by LC_ALL/LC_CTYPE/LANG | |
157 | # environment variables. See below for documentation. | |
158 | ||
159 | # restore the old locale | |
160 | setlocale(LC_CTYPE, $old_locale); | |
161 | ||
14280422 DD |
162 | The first argument of setlocale() gives the B<category>, the second the |
163 | B<locale>. The category tells in what aspect of data processing you | |
164 | want to apply locale-specific rules. Category names are discussed in | |
165 | L<LOCALE CATEGORIES> and L<"ENVIRONMENT">. The locale is the name of a | |
166 | collection of customization information corresponding to a particular | |
167 | combination of language, country or territory, and codeset. Read on for | |
168 | hints on the naming of locales: not all systems name locales as in the | |
169 | example. | |
170 | ||
171 | If no second argument is provided, the function returns a string naming | |
172 | the current locale for the category. You can use this value as the | |
173 | second argument in a subsequent call to setlocale(). If a second | |
5f05dabc | 174 | argument is given and it corresponds to a valid locale, the locale for |
175 | the category is set to that value, and the function returns the | |
176 | now-current locale value. You can use this in a subsequent call to | |
14280422 | 177 | setlocale(). (In some implementations, the return value may sometimes |
5f05dabc | 178 | differ from the value you gave as the second argument - think of it as |
179 | an alias for the value that you gave.) | |
180 | ||
181 | As the example shows, if the second argument is an empty string, the | |
182 | category's locale is returned to the default specified by the | |
183 | corresponding environment variables. Generally, this results in a | |
184 | return to the default which was in force when Perl started up: changes | |
14280422 DD |
185 | to the environment made by the application after start-up may or may not |
186 | be noticed, depending on the implementation of your system's C library. | |
5f05dabc | 187 | |
14280422 DD |
188 | If the second argument does not correspond to a valid locale, the locale |
189 | for the category is not changed, and the function returns I<undef>. | |
5f05dabc | 190 | |
14280422 DD |
191 | For further information about the categories, consult L<setlocale(3)>. |
192 | For the locales available in your system, also consult L<setlocale(3)> | |
193 | and see whether it leads you to the list of the available locales | |
194 | (search for the I<SEE ALSO> section). If that fails, try the following | |
195 | command lines: | |
5f05dabc | 196 | |
197 | locale -a | |
198 | ||
199 | nlsinfo | |
200 | ||
201 | ls /usr/lib/nls/loc | |
202 | ||
203 | ls /usr/lib/locale | |
204 | ||
205 | ls /usr/lib/nls | |
206 | ||
207 | and see whether they list something resembling these | |
208 | ||
2bdf8add JH |
209 | en_US.ISO8859-1 de_DE.ISO8859-1 ru_RU.ISO8859-5 |
210 | en_US de_DE ru_RU | |
14280422 | 211 | en de ru |
2bdf8add JH |
212 | english german russian |
213 | english.iso88591 german.iso88591 russian.iso88595 | |
5f05dabc | 214 | |
14280422 | 215 | Sadly, even though the calling interface for setlocale() has been |
2bdf8add JH |
216 | standardized, the names of the locales and the directories where |
217 | the configuration is, have not. The basic form of the name is | |
218 | I<language_country/territory>B<.>I<codeset>, but the | |
5f05dabc | 219 | latter parts are not always present. |
220 | ||
14280422 DD |
221 | Two special locales are worth particular mention: "C" and "POSIX". |
222 | Currently these are effectively the same locale: the difference is | |
223 | mainly that the first one is defined by the C standard and the second by | |
224 | the POSIX standard. What they define is the B<default locale> in which | |
225 | every program starts in the absence of locale information in its | |
226 | environment. (The default default locale, if you will.) Its language | |
227 | is (American) English and its character codeset ASCII. | |
5f05dabc | 228 | |
14280422 DD |
229 | B<NOTE>: Not all systems have the "POSIX" locale (not all systems are |
230 | POSIX-conformant), so use "C" when you need explicitly to specify this | |
231 | default locale. | |
5f05dabc | 232 | |
233 | =head2 The localeconv function | |
234 | ||
14280422 DD |
235 | The POSIX::localeconv() function allows you to get particulars of the |
236 | locale-dependent numeric formatting information specified by the current | |
237 | C<LC_NUMERIC> and C<LC_MONETARY> locales. (If you just want the name of | |
238 | the current locale for a particular category, use POSIX::setlocale() | |
239 | with a single parameter - see L<The setlocale function>.) | |
5f05dabc | 240 | |
241 | use POSIX qw(locale_h); | |
5f05dabc | 242 | |
243 | # Get a reference to a hash of locale-dependent info | |
244 | $locale_values = localeconv(); | |
245 | ||
246 | # Output sorted list of the values | |
247 | for (sort keys %$locale_values) { | |
14280422 | 248 | printf "%-20s = %s\n", $_, $locale_values->{$_} |
5f05dabc | 249 | } |
250 | ||
14280422 DD |
251 | localeconv() takes no arguments, and returns B<a reference to> a hash. |
252 | The keys of this hash are formatting variable names such as | |
253 | C<decimal_point> and C<thousands_sep>; the values are the corresponding | |
254 | values. See L<POSIX (3)/localeconv> for a longer example, which lists | |
255 | all the categories an implementation might be expected to provide; some | |
256 | provide more and others fewer, however. Note that you don't need C<use | |
257 | locale>: as a function with the job of querying the locale, localeconv() | |
258 | always observes the current locale. | |
5f05dabc | 259 | |
260 | Here's a simple-minded example program which rewrites its command line | |
261 | parameters as integers formatted correctly in the current locale: | |
262 | ||
263 | # See comments in previous example | |
264 | require 5.004; | |
265 | use POSIX qw(locale_h); | |
5f05dabc | 266 | |
267 | # Get some of locale's numeric formatting parameters | |
268 | my ($thousands_sep, $grouping) = | |
14280422 | 269 | @{localeconv()}{'thousands_sep', 'grouping'}; |
5f05dabc | 270 | |
271 | # Apply defaults if values are missing | |
272 | $thousands_sep = ',' unless $thousands_sep; | |
273 | $grouping = 3 unless $grouping; | |
274 | ||
275 | # Format command line params for current locale | |
14280422 DD |
276 | for (@ARGV) { |
277 | $_ = int; # Chop non-integer part | |
5f05dabc | 278 | 1 while |
14280422 DD |
279 | s/(\d)(\d{$grouping}($|$thousands_sep))/$1$thousands_sep$2/; |
280 | print "$_"; | |
5f05dabc | 281 | } |
282 | print "\n"; | |
283 | ||
5f05dabc | 284 | =head1 LOCALE CATEGORIES |
285 | ||
14280422 | 286 | The subsections which follow describe basic locale categories. As well |
5f05dabc | 287 | as these, there are some combination categories which allow the |
14280422 DD |
288 | manipulation of more than one basic category at a time. See |
289 | L<"ENVIRONMENT"> for a discussion of these. | |
5f05dabc | 290 | |
291 | =head2 Category LC_COLLATE: Collation | |
292 | ||
14280422 | 293 | When in the scope of S<C<use locale>>, Perl looks to the C<LC_COLLATE> |
5f05dabc | 294 | environment variable to determine the application's notions on the |
14280422 DD |
295 | collation (ordering) of characters. ('b' follows 'a' in Latin |
296 | alphabets, but where do 'E<aacute>' and 'E<aring>' belong?) | |
5f05dabc | 297 | |
298 | Here is a code snippet that will tell you what are the alphanumeric | |
299 | characters in the current locale, in the locale order: | |
300 | ||
301 | use locale; | |
302 | print +(sort grep /\w/, map { chr() } 0..255), "\n"; | |
303 | ||
14280422 DD |
304 | Compare this with the characters that you see and their order if you |
305 | state explicitly that the locale should be ignored: | |
5f05dabc | 306 | |
307 | no locale; | |
308 | print +(sort grep /\w/, map { chr() } 0..255), "\n"; | |
309 | ||
310 | This machine-native collation (which is what you get unless S<C<use | |
311 | locale>> has appeared earlier in the same block) must be used for | |
312 | sorting raw binary data, whereas the locale-dependent collation of the | |
b0c42ed9 | 313 | first example is useful for natural text. |
5f05dabc | 314 | |
14280422 DD |
315 | As noted in L<USING LOCALES>, C<cmp> compares according to the current |
316 | collation locale when C<use locale> is in effect, but falls back to a | |
317 | byte-by-byte comparison for strings which the locale says are equal. You | |
318 | can use POSIX::strcoll() if you don't want this fall-back: | |
319 | ||
320 | use POSIX qw(strcoll); | |
321 | $equal_in_locale = | |
322 | !strcoll("space and case ignored", "SpaceAndCaseIgnored"); | |
323 | ||
324 | $equal_in_locale will be true if the collation locale specifies a | |
325 | dictionary-like ordering which ignores space characters completely, and | |
326 | which folds case. Alternatively, you can use this idiom: | |
327 | ||
328 | use locale; | |
329 | $s_a = "space and case ignored"; | |
330 | $s_b = "SpaceAndCaseIgnored"; | |
331 | $equal_in_locale = $s_a ge $s_b && $s_a le $s_b; | |
332 | ||
333 | which works because neither C<ne> nor C<ge> falls back to doing a | |
334 | byte-by-byte comparison when the operands are equal according to the | |
335 | locale. The idiom may be less efficient than using strcoll(), but, | |
336 | unlike that function, it is not confused by strings containing embedded | |
337 | nulls. | |
338 | ||
339 | If you have a single string which you want to check for "equality in | |
340 | locale" against several others, you might think you could gain a little | |
341 | efficiency by using POSIX::strxfrm() in conjunction with C<eq>: | |
342 | ||
343 | use POSIX qw(strxfrm); | |
344 | $xfrm_string = strxfrm("Mixed-case string"); | |
345 | print "locale collation ignores spaces\n" | |
346 | if $xfrm_string eq strxfrm("Mixed-casestring"); | |
347 | print "locale collation ignores hyphens\n" | |
348 | if $xfrm_string eq strxfrm("Mixedcase string"); | |
349 | print "locale collation ignores case\n" | |
350 | if $xfrm_string eq strxfrm("mixed-case string"); | |
351 | ||
352 | strxfrm() takes a string and maps it into a transformed string for use | |
353 | in byte-by-byte comparisons against other transformed strings during | |
354 | collation. "Under the hood", locale-affected Perl comparison operators | |
355 | call strxfrm() for both their operands, then do a byte-by-byte | |
356 | comparison of the transformed strings. By calling strxfrm() explicitly, | |
357 | and using a non locale-affected comparison, the example attempts to save | |
358 | a couple of transformations. In fact, it doesn't save anything: Perl | |
359 | magic (see L<perlguts/Magic>) creates the transformed version of a | |
360 | string the first time it's needed in a comparison, then keeps it around | |
361 | in case it's needed again. An example rewritten the easy way with | |
e38874e2 | 362 | C<cmp> runs just about as fast. It also copes with null characters |
14280422 | 363 | embedded in strings; if you call strxfrm() directly, it treats the first |
e38874e2 DD |
364 | null it finds as a terminator. And don't expect the transformed strings |
365 | it produces to be portable across systems - or even from one revision | |
366 | of your operating system to the next. In short, don't call strxfrm() | |
367 | directly: let Perl do it for you. | |
14280422 DD |
368 | |
369 | Note: C<use locale> isn't shown in some of these examples, as it isn't | |
370 | needed: strcoll() and strxfrm() exist only to generate locale-dependent | |
371 | results, and so always obey the current C<LC_COLLATE> locale. | |
5f05dabc | 372 | |
373 | =head2 Category LC_CTYPE: Character Types | |
374 | ||
375 | When in the scope of S<C<use locale>>, Perl obeys the C<LC_CTYPE> locale | |
14280422 DD |
376 | setting. This controls the application's notion of which characters are |
377 | alphabetic. This affects Perl's C<\w> regular expression metanotation, | |
378 | which stands for alphanumeric characters - that is, alphabetic and | |
379 | numeric characters. (Consult L<perlre> for more information about | |
380 | regular expressions.) Thanks to C<LC_CTYPE>, depending on your locale | |
381 | setting, characters like 'E<aelig>', 'E<eth>', 'E<szlig>', and | |
382 | 'E<oslash>' may be understood as C<\w> characters. | |
5f05dabc | 383 | |
e38874e2 DD |
384 | The C<LC_CTYPE> locale also provides the map used in translating |
385 | characters between lower- and upper-case. This affects the case-mapping | |
386 | functions - lc(), lcfirst, uc() and ucfirst(); case-mapping | |
387 | interpolation with C<\l>, C<\L>, C<\u> or <\U> in double-quoted strings | |
388 | and in C<s///> substitutions; and case-independent regular expression | |
389 | pattern matching using the C<i> modifier. | |
390 | ||
391 | Finally, C<LC_CTYPE> affects the POSIX character-class test functions - | |
14280422 DD |
392 | isalpha(), islower() and so on. For example, if you move from the "C" |
393 | locale to a 7-bit Scandinavian one, you may find - possibly to your | |
394 | surprise - that "|" moves from the ispunct() class to isalpha(). | |
5f05dabc | 395 | |
14280422 DD |
396 | B<Note:> A broken or malicious C<LC_CTYPE> locale definition may result |
397 | in clearly ineligible characters being considered to be alphanumeric by | |
398 | your application. For strict matching of (unaccented) letters and | |
399 | digits - for example, in command strings - locale-aware applications | |
400 | should use C<\w> inside a C<no locale> block. See L<"SECURITY">. | |
5f05dabc | 401 | |
402 | =head2 Category LC_NUMERIC: Numeric Formatting | |
403 | ||
404 | When in the scope of S<C<use locale>>, Perl obeys the C<LC_NUMERIC> | |
14280422 DD |
405 | locale information, which controls application's idea of how numbers |
406 | should be formatted for human readability by the printf(), sprintf(), | |
407 | and write() functions. String to numeric conversion by the | |
408 | POSIX::strtod() function is also affected. In most implementations the | |
409 | only effect is to change the character used for the decimal point - | |
410 | perhaps from '.' to ',': these functions aren't aware of such niceties | |
411 | as thousands separation and so on. (See L<The localeconv function> if | |
412 | you care about these things.) | |
413 | ||
414 | Note that output produced by print() is B<never> affected by the | |
5f05dabc | 415 | current locale: it is independent of whether C<use locale> or C<no |
14280422 | 416 | locale> is in effect, and corresponds to what you'd get from printf() |
5f05dabc | 417 | in the "C" locale. The same is true for Perl's internal conversions |
418 | between numeric and string formats: | |
419 | ||
420 | use POSIX qw(strtod); | |
421 | use locale; | |
14280422 | 422 | |
5f05dabc | 423 | $n = 5/2; # Assign numeric 2.5 to $n |
424 | ||
425 | $a = " $n"; # Locale-independent conversion to string | |
426 | ||
427 | print "half five is $n\n"; # Locale-independent output | |
428 | ||
429 | printf "half five is %g\n", $n; # Locale-dependent output | |
430 | ||
14280422 DD |
431 | print "DECIMAL POINT IS COMMA\n" |
432 | if $n == (strtod("2,5"))[0]; # Locale-dependent conversion | |
5f05dabc | 433 | |
434 | =head2 Category LC_MONETARY: Formatting of monetary amounts | |
435 | ||
14280422 DD |
436 | The C standard defines the C<LC_MONETARY> category, but no function that |
437 | is affected by its contents. (Those with experience of standards | |
b0c42ed9 | 438 | committees will recognize that the working group decided to punt on the |
14280422 DD |
439 | issue.) Consequently, Perl takes no notice of it. If you really want |
440 | to use C<LC_MONETARY>, you can query its contents - see L<The localeconv | |
441 | function> - and use the information that it returns in your | |
b0c42ed9 | 442 | application's own formatting of currency amounts. However, you may well |
14280422 DD |
443 | find that the information, though voluminous and complex, does not quite |
444 | meet your requirements: currency formatting is a hard nut to crack. | |
5f05dabc | 445 | |
446 | =head2 LC_TIME | |
447 | ||
14280422 | 448 | The output produced by POSIX::strftime(), which builds a formatted |
5f05dabc | 449 | human-readable date/time string, is affected by the current C<LC_TIME> |
450 | locale. Thus, in a French locale, the output produced by the C<%B> | |
451 | format element (full month name) for the first month of the year would | |
452 | be "janvier". Here's how to get a list of the long month names in the | |
453 | current locale: | |
454 | ||
455 | use POSIX qw(strftime); | |
14280422 DD |
456 | for (0..11) { |
457 | $long_month_name[$_] = | |
458 | strftime("%B", 0, 0, 0, 1, $_, 96); | |
5f05dabc | 459 | } |
460 | ||
14280422 DD |
461 | Note: C<use locale> isn't needed in this example: as a function which |
462 | exists only to generate locale-dependent results, strftime() always | |
463 | obeys the current C<LC_TIME> locale. | |
5f05dabc | 464 | |
465 | =head2 Other categories | |
466 | ||
467 | The remaining locale category, C<LC_MESSAGES> (possibly supplemented by | |
468 | others in particular implementations) is not currently used by Perl - | |
b0c42ed9 | 469 | except possibly to affect the behavior of library functions called by |
14280422 DD |
470 | extensions which are not part of the standard Perl distribution. |
471 | ||
472 | =head1 SECURITY | |
473 | ||
474 | While the main discussion of Perl security issues can be found in | |
475 | L<perlsec>, a discussion of Perl's locale handling would be incomplete | |
476 | if it did not draw your attention to locale-dependent security issues. | |
477 | Locales - particularly on systems which allow unprivileged users to | |
478 | build their own locales - are untrustworthy. A malicious (or just plain | |
479 | broken) locale can make a locale-aware application give unexpected | |
480 | results. Here are a few possibilities: | |
481 | ||
482 | =over 4 | |
483 | ||
484 | =item * | |
485 | ||
486 | Regular expression checks for safe file names or mail addresses using | |
487 | C<\w> may be spoofed by an C<LC_CTYPE> locale which claims that | |
488 | characters such as "E<gt>" and "|" are alphanumeric. | |
489 | ||
490 | =item * | |
491 | ||
e38874e2 DD |
492 | String interpolation with case-mapping, as in, say, C<$dest = |
493 | "C:\U$name.$ext">, may produce dangerous results if a bogus LC_CTYPE | |
494 | case-mapping table is in effect. | |
495 | ||
496 | =item * | |
497 | ||
14280422 DD |
498 | If the decimal point character in the C<LC_NUMERIC> locale is |
499 | surreptitiously changed from a dot to a comma, C<sprintf("%g", | |
500 | 0.123456e3)> produces a string result of "123,456". Many people would | |
501 | interpret this as one hundred and twenty-three thousand, four hundred | |
502 | and fifty-six. | |
503 | ||
504 | =item * | |
505 | ||
506 | A sneaky C<LC_COLLATE> locale could result in the names of students with | |
507 | "D" grades appearing ahead of those with "A"s. | |
508 | ||
509 | =item * | |
510 | ||
511 | An application which takes the trouble to use the information in | |
512 | C<LC_MONETARY> may format debits as if they were credits and vice versa | |
513 | if that locale has been subverted. Or it make may make payments in US | |
514 | dollars instead of Hong Kong dollars. | |
515 | ||
516 | =item * | |
517 | ||
518 | The date and day names in dates formatted by strftime() could be | |
519 | manipulated to advantage by a malicious user able to subvert the | |
520 | C<LC_DATE> locale. ("Look - it says I wasn't in the building on | |
521 | Sunday.") | |
522 | ||
523 | =back | |
524 | ||
525 | Such dangers are not peculiar to the locale system: any aspect of an | |
526 | application's environment which may maliciously be modified presents | |
527 | similar challenges. Similarly, they are not specific to Perl: any | |
528 | programming language which allows you to write programs which take | |
529 | account of their environment exposes you to these issues. | |
530 | ||
531 | Perl cannot protect you from all of the possibilities shown in the | |
532 | examples - there is no substitute for your own vigilance - but, when | |
533 | C<use locale> is in effect, Perl uses the tainting mechanism (see | |
534 | L<perlsec>) to mark string results which become locale-dependent, and | |
535 | which may be untrustworthy in consequence. Here is a summary of the | |
b0c42ed9 | 536 | tainting behavior of operators and functions which may be affected by |
14280422 DD |
537 | the locale: |
538 | ||
539 | =over 4 | |
540 | ||
541 | =item B<Comparison operators> (C<lt>, C<le>, C<ge>, C<gt> and C<cmp>): | |
542 | ||
543 | Scalar true/false (or less/equal/greater) result is never tainted. | |
544 | ||
e38874e2 DD |
545 | =item B<Case-mapping interpolation> (with C<\l>, C<\L>, C<\u> or <\U>) |
546 | ||
547 | Result string containing interpolated material is tainted if | |
548 | C<use locale> is in effect. | |
549 | ||
14280422 DD |
550 | =item B<Matching operator> (C<m//>): |
551 | ||
552 | Scalar true/false result never tainted. | |
553 | ||
554 | Subpatterns, either delivered as an array-context result, or as $1 etc. | |
555 | are tainted if C<use locale> is in effect, and the subpattern regular | |
e38874e2 DD |
556 | expression contains C<\w> (to match an alphanumeric character), C<\W> |
557 | (non-alphanumeric character), C<\s> (white-space character), or C<\S> | |
558 | (non white-space character). The matched pattern variable, $&, $` | |
559 | (pre-match), $' (post-match), and $+ (last match) are also tainted if | |
560 | C<use locale> is in effect and the regular expression contains C<\w>, | |
561 | C<\W>, C<\s>, or C<\S>. | |
14280422 DD |
562 | |
563 | =item B<Substitution operator> (C<s///>): | |
564 | ||
e38874e2 DD |
565 | Has the same behavior as the match operator. Also, the left |
566 | operand of C<=~> becomes tainted when C<use locale> in effect, | |
567 | if it is modified as a result of a substitution based on a regular | |
568 | expression match involving C<\w>, C<\W>, C<\s>, or C<\S>; or of | |
569 | case-mapping with C<\l>, C<\L>,C<\u> or <\U>. | |
14280422 DD |
570 | |
571 | =item B<In-memory formatting function> (sprintf()): | |
572 | ||
573 | Result is tainted if "use locale" is in effect. | |
574 | ||
575 | =item B<Output formatting functions> (printf() and write()): | |
576 | ||
577 | Success/failure result is never tainted. | |
578 | ||
579 | =item B<Case-mapping functions> (lc(), lcfirst(), uc(), ucfirst()): | |
580 | ||
581 | Results are tainted if C<use locale> is in effect. | |
582 | ||
583 | =item B<POSIX locale-dependent functions> (localeconv(), strcoll(), | |
584 | strftime(), strxfrm()): | |
585 | ||
586 | Results are never tainted. | |
587 | ||
588 | =item B<POSIX character class tests> (isalnum(), isalpha(), isdigit(), | |
589 | isgraph(), islower(), isprint(), ispunct(), isspace(), isupper(), | |
590 | isxdigit()): | |
591 | ||
592 | True/false results are never tainted. | |
593 | ||
594 | =back | |
595 | ||
596 | Three examples illustrate locale-dependent tainting. | |
597 | The first program, which ignores its locale, won't run: a value taken | |
598 | directly from the command-line may not be used to name an output file | |
599 | when taint checks are enabled. | |
600 | ||
601 | #/usr/local/bin/perl -T | |
602 | # Run with taint checking | |
603 | ||
604 | # Command-line sanity check omitted... | |
605 | $tainted_output_file = shift; | |
606 | ||
607 | open(F, ">$tainted_output_file") | |
608 | or warn "Open of $untainted_output_file failed: $!\n"; | |
609 | ||
610 | The program can be made to run by "laundering" the tainted value through | |
611 | a regular expression: the second example - which still ignores locale | |
612 | information - runs, creating the file named on its command-line | |
613 | if it can. | |
614 | ||
615 | #/usr/local/bin/perl -T | |
616 | ||
617 | $tainted_output_file = shift; | |
618 | $tainted_output_file =~ m%[\w/]+%; | |
619 | $untainted_output_file = $&; | |
620 | ||
621 | open(F, ">$untainted_output_file") | |
622 | or warn "Open of $untainted_output_file failed: $!\n"; | |
623 | ||
624 | Compare this with a very similar program which is locale-aware: | |
625 | ||
626 | #/usr/local/bin/perl -T | |
627 | ||
628 | $tainted_output_file = shift; | |
629 | use locale; | |
630 | $tainted_output_file =~ m%[\w/]+%; | |
631 | $localized_output_file = $&; | |
632 | ||
633 | open(F, ">$localized_output_file") | |
634 | or warn "Open of $localized_output_file failed: $!\n"; | |
635 | ||
636 | This third program fails to run because $& is tainted: it is the result | |
637 | of a match involving C<\w> when C<use locale> is in effect. | |
5f05dabc | 638 | |
639 | =head1 ENVIRONMENT | |
640 | ||
641 | =over 12 | |
642 | ||
643 | =item PERL_BADLANG | |
644 | ||
14280422 DD |
645 | A string that can suppress Perl's warning about failed locale settings |
646 | at start-up. Failure can occur if the locale support in the operating | |
647 | system is lacking (broken) is some way - or if you mistyped the name of | |
648 | a locale when you set up your environment. If this environment variable | |
649 | is absent, or has a value which does not evaluate to integer zero - that | |
650 | is "0" or "" - Perl will complain about locale setting failures. | |
5f05dabc | 651 | |
14280422 DD |
652 | B<NOTE>: PERL_BADLANG only gives you a way to hide the warning message. |
653 | The message tells about some problem in your system's locale support, | |
654 | and you should investigate what the problem is. | |
5f05dabc | 655 | |
656 | =back | |
657 | ||
658 | The following environment variables are not specific to Perl: They are | |
14280422 DD |
659 | part of the standardized (ISO C, XPG4, POSIX 1.c) setlocale() method |
660 | for controlling an application's opinion on data. | |
5f05dabc | 661 | |
662 | =over 12 | |
663 | ||
664 | =item LC_ALL | |
665 | ||
666 | C<LC_ALL> is the "override-all" locale environment variable. If it is | |
667 | set, it overrides all the rest of the locale environment variables. | |
668 | ||
669 | =item LC_CTYPE | |
670 | ||
671 | In the absence of C<LC_ALL>, C<LC_CTYPE> chooses the character type | |
672 | locale. In the absence of both C<LC_ALL> and C<LC_CTYPE>, C<LANG> | |
673 | chooses the character type locale. | |
674 | ||
675 | =item LC_COLLATE | |
676 | ||
14280422 DD |
677 | In the absence of C<LC_ALL>, C<LC_COLLATE> chooses the collation |
678 | (sorting) locale. In the absence of both C<LC_ALL> and C<LC_COLLATE>, | |
679 | C<LANG> chooses the collation locale. | |
5f05dabc | 680 | |
681 | =item LC_MONETARY | |
682 | ||
14280422 DD |
683 | In the absence of C<LC_ALL>, C<LC_MONETARY> chooses the monetary |
684 | formatting locale. In the absence of both C<LC_ALL> and C<LC_MONETARY>, | |
685 | C<LANG> chooses the monetary formatting locale. | |
5f05dabc | 686 | |
687 | =item LC_NUMERIC | |
688 | ||
689 | In the absence of C<LC_ALL>, C<LC_NUMERIC> chooses the numeric format | |
690 | locale. In the absence of both C<LC_ALL> and C<LC_NUMERIC>, C<LANG> | |
691 | chooses the numeric format. | |
692 | ||
693 | =item LC_TIME | |
694 | ||
14280422 DD |
695 | In the absence of C<LC_ALL>, C<LC_TIME> chooses the date and time |
696 | formatting locale. In the absence of both C<LC_ALL> and C<LC_TIME>, | |
697 | C<LANG> chooses the date and time formatting locale. | |
5f05dabc | 698 | |
699 | =item LANG | |
700 | ||
14280422 DD |
701 | C<LANG> is the "catch-all" locale environment variable. If it is set, it |
702 | is used as the last resort after the overall C<LC_ALL> and the | |
5f05dabc | 703 | category-specific C<LC_...>. |
704 | ||
705 | =back | |
706 | ||
707 | =head1 NOTES | |
708 | ||
709 | =head2 Backward compatibility | |
710 | ||
b0c42ed9 JH |
711 | Versions of Perl prior to 5.004 B<mostly> ignored locale information, |
712 | generally behaving as if something similar to the C<"C"> locale (see | |
713 | L<The setlocale function>) was always in force, even if the program | |
5f05dabc | 714 | environment suggested otherwise. By default, Perl still behaves this |
715 | way so as to maintain backward compatibility. If you want a Perl | |
b0c42ed9 JH |
716 | application to pay attention to locale information, you B<must> use |
717 | the S<C<use locale>> pragma (see L<The S<C<use locale>> Pragma>) to | |
718 | instruct it to do so. | |
719 | ||
720 | Versions of Perl from 5.002 to 5.003 did use the C<LC_CTYPE> | |
721 | information if that was available, that is, C<\w> did understand what | |
722 | are the letters according to the locale environment variables. | |
723 | The problem was that the user had no control over the feature: | |
724 | if the C library supported locales, Perl used them. | |
725 | ||
726 | =head2 I18N:Collate obsolete | |
727 | ||
728 | In versions of Perl prior to 5.004 per-locale collation was possible | |
729 | using the C<I18N::Collate> library module. This module is now mildly | |
730 | obsolete and should be avoided in new applications. The C<LC_COLLATE> | |
731 | functionality is now integrated into the Perl core language: One can | |
732 | use locale-specific scalar data completely normally with C<use locale>, | |
733 | so there is no longer any need to juggle with the scalar references of | |
734 | C<I18N::Collate>. | |
5f05dabc | 735 | |
14280422 | 736 | =head2 Sort speed and memory use impacts |
5f05dabc | 737 | |
738 | Comparing and sorting by locale is usually slower than the default | |
14280422 DD |
739 | sorting; slow-downs of two to four times have been observed. It will |
740 | also consume more memory: once a Perl scalar variable has participated | |
741 | in any string comparison or sorting operation obeying the locale | |
742 | collation rules, it will take 3-15 times more memory than before. (The | |
743 | exact multiplier depends on the string's contents, the operating system | |
744 | and the locale.) These downsides are dictated more by the operating | |
745 | system's implementation of the locale system than by Perl. | |
5f05dabc | 746 | |
e38874e2 DD |
747 | =head2 write() and LC_NUMERIC |
748 | ||
749 | Formats are the only part of Perl which unconditionally use information | |
750 | from a program's locale; if a program's environment specifies an | |
751 | LC_NUMERIC locale, it is always used to specify the decimal point | |
752 | character in formatted output. Formatted output cannot be controlled by | |
753 | C<use locale> because the pragma is tied to the block structure of the | |
754 | program, and, for historical reasons, formats exist outside that block | |
755 | structure. | |
756 | ||
5f05dabc | 757 | =head2 Freely available locale definitions |
758 | ||
759 | There is a large collection of locale definitions at | |
14280422 DD |
760 | C<ftp://dkuug.dk/i18n/WG15-collection>. You should be aware that it is |
761 | unsupported, and is not claimed to be fit for any purpose. If your | |
762 | system allows the installation of arbitrary locales, you may find the | |
763 | definitions useful as they are, or as a basis for the development of | |
764 | your own locales. | |
5f05dabc | 765 | |
14280422 | 766 | =head2 I18n and l10n |
5f05dabc | 767 | |
b0c42ed9 JH |
768 | "Internationalization" is often abbreviated as B<i18n> because its first |
769 | and last letters are separated by eighteen others. (You may guess why | |
770 | the internalin ... internaliti ... i18n tends to get abbreviated.) In | |
771 | the same way, "localization" is often abbreviated to B<l10n>. | |
14280422 DD |
772 | |
773 | =head2 An imperfect standard | |
774 | ||
775 | Internationalization, as defined in the C and POSIX standards, can be | |
776 | criticized as incomplete, ungainly, and having too large a granularity. | |
777 | (Locales apply to a whole process, when it would arguably be more useful | |
778 | to have them apply to a single thread, window group, or whatever.) They | |
779 | also have a tendency, like standards groups, to divide the world into | |
780 | nations, when we all know that the world can equally well be divided | |
781 | into bankers, bikers, gamers, and so on. But, for now, it's the only | |
782 | standard we've got. This may be construed as a bug. | |
5f05dabc | 783 | |
784 | =head1 BUGS | |
785 | ||
786 | =head2 Broken systems | |
787 | ||
2bdf8add JH |
788 | In certain system environments the operating system's locale support |
789 | is broken and cannot be fixed or used by Perl. Such deficiencies can | |
790 | and will result in mysterious hangs and/or Perl core dumps when the | |
791 | C<use locale> is in effect. When confronted with such a system, | |
792 | please report in excruciating detail to C<perlbug@perl.com>, and | |
793 | complain to your vendor: maybe some bug fixes exist for these problems | |
794 | in your operating system. Sometimes such bug fixes are called an | |
795 | operating system upgrade. | |
5f05dabc | 796 | |
797 | =head1 SEE ALSO | |
798 | ||
799 | L<POSIX (3)/isalnum>, L<POSIX (3)/isalpha>, L<POSIX (3)/isdigit>, | |
800 | L<POSIX (3)/isgraph>, L<POSIX (3)/islower>, L<POSIX (3)/isprint>, | |
801 | L<POSIX (3)/ispunct>, L<POSIX (3)/isspace>, L<POSIX (3)/isupper>, | |
802 | L<POSIX (3)/isxdigit>, L<POSIX (3)/localeconv>, L<POSIX (3)/setlocale>, | |
14280422 DD |
803 | L<POSIX (3)/strcoll>, L<POSIX (3)/strftime>, L<POSIX (3)/strtod>, |
804 | L<POSIX (3)/strxfrm> | |
5f05dabc | 805 | |
806 | =head1 HISTORY | |
807 | ||
b0c42ed9 | 808 | Jarkko Hietaniemi's original F<perli18n.pod> heavily hacked by Dominic |
14280422 | 809 | Dunlop, assisted by the perl5-porters. |
5f05dabc | 810 | |
e38874e2 | 811 | Last update: Tue Dec 31 01:30:55 EST 1996 |