Commit | Line | Data |
---|---|---|
5f05dabc | 1 | =head1 NAME |
2 | ||
b0c42ed9 | 3 | perllocale - Perl locale handling (internationalization and localization) |
5f05dabc | 4 | |
5 | =head1 DESCRIPTION | |
6 | ||
7 | Perl supports language-specific notions of data such as "is this a | |
68dc0745 | 8 | letter", "what is the uppercase equivalent of this letter", and "which |
14280422 DD |
9 | of these letters comes first". These are important issues, especially |
10 | for languages other than English - but also for English: it would be | |
11 | very naE<iuml>ve to think that C<A-Za-z> defines all the "letters". Perl | |
12 | is also aware that some character other than '.' may be preferred as a | |
13 | decimal point, and that output date representations may be | |
14 | language-specific. The process of making an application take account of | |
15 | its users' preferences in such matters is called B<internationalization> | |
16 | (often abbreviated as B<i18n>); telling such an application about a | |
17 | particular set of preferences is known as B<localization> (B<l10n>). | |
18 | ||
19 | Perl can understand language-specific data via the standardized (ISO C, | |
20 | XPG4, POSIX 1.c) method called "the locale system". The locale system is | |
b0c42ed9 | 21 | controlled per application using one pragma, one function call, and |
14280422 DD |
22 | several environment variables. |
23 | ||
24 | B<NOTE>: This feature is new in Perl 5.004, and does not apply unless an | |
25 | application specifically requests it - see L<Backward compatibility>. | |
e38874e2 DD |
26 | The one exception is that write() now B<always> uses the current locale |
27 | - see L<"NOTES">. | |
5f05dabc | 28 | |
29 | =head1 PREPARING TO USE LOCALES | |
30 | ||
14280422 DD |
31 | If Perl applications are to be able to understand and present your data |
32 | correctly according a locale of your choice, B<all> of the following | |
5f05dabc | 33 | must be true: |
34 | ||
35 | =over 4 | |
36 | ||
37 | =item * | |
38 | ||
39 | B<Your operating system must support the locale system>. If it does, | |
14280422 | 40 | you should find that the setlocale() function is a documented part of |
5f05dabc | 41 | its C library. |
42 | ||
43 | =item * | |
44 | ||
14280422 DD |
45 | B<Definitions for the locales which you use must be installed>. You, or |
46 | your system administrator, must make sure that this is the case. The | |
47 | available locales, the location in which they are kept, and the manner | |
48 | in which they are installed, vary from system to system. Some systems | |
4a6725af | 49 | provide only a few, hard-wired, locales, and do not allow more to be |
14280422 DD |
50 | added; others allow you to add "canned" locales provided by the system |
51 | supplier; still others allow you or the system administrator to define | |
52 | and add arbitrary locales. (You may have to ask your supplier to | |
53 | provide canned locales which are not delivered with your operating | |
54 | system.) Read your system documentation for further illumination. | |
5f05dabc | 55 | |
56 | =item * | |
57 | ||
58 | B<Perl must believe that the locale system is supported>. If it does, | |
59 | C<perl -V:d_setlocale> will say that the value for C<d_setlocale> is | |
60 | C<define>. | |
61 | ||
62 | =back | |
63 | ||
64 | If you want a Perl application to process and present your data | |
65 | according to a particular locale, the application code should include | |
2ae324a7 | 66 | the S<C<use locale>> pragma (see L<The use locale pragma>) where |
5f05dabc | 67 | appropriate, and B<at least one> of the following must be true: |
68 | ||
69 | =over 4 | |
70 | ||
71 | =item * | |
72 | ||
14280422 DD |
73 | B<The locale-determining environment variables (see L<"ENVIRONMENT">) |
74 | must be correctly set up>, either by yourself, or by the person who set | |
75 | up your system account, at the time the application is started. | |
5f05dabc | 76 | |
77 | =item * | |
78 | ||
14280422 DD |
79 | B<The application must set its own locale> using the method described in |
80 | L<The setlocale function>. | |
5f05dabc | 81 | |
82 | =back | |
83 | ||
84 | =head1 USING LOCALES | |
85 | ||
86 | =head2 The use locale pragma | |
87 | ||
14280422 DD |
88 | By default, Perl ignores the current locale. The S<C<use locale>> |
89 | pragma tells Perl to use the current locale for some operations: | |
5f05dabc | 90 | |
91 | =over 4 | |
92 | ||
93 | =item * | |
94 | ||
14280422 DD |
95 | B<The comparison operators> (C<lt>, C<le>, C<cmp>, C<ge>, and C<gt>) and |
96 | the POSIX string collation functions strcoll() and strxfrm() use | |
97 | C<LC_COLLATE>. sort() is also affected if it is used without an | |
98 | explicit comparison function because it uses C<cmp> by default. | |
99 | ||
100 | B<Note:> C<eq> and C<ne> are unaffected by the locale: they always | |
101 | perform a byte-by-byte comparison of their scalar operands. What's | |
102 | more, if C<cmp> finds that its operands are equal according to the | |
103 | collation sequence specified by the current locale, it goes on to | |
104 | perform a byte-by-byte comparison, and only returns I<0> (equal) if the | |
105 | operands are bit-for-bit identical. If you really want to know whether | |
106 | two strings - which C<eq> and C<cmp> may consider different - are equal | |
107 | as far as collation in the locale is concerned, see the discussion in | |
108 | L<Category LC_COLLATE: Collation>. | |
5f05dabc | 109 | |
110 | =item * | |
111 | ||
14280422 DD |
112 | B<Regular expressions and case-modification functions> (uc(), lc(), |
113 | ucfirst(), and lcfirst()) use C<LC_CTYPE> | |
5f05dabc | 114 | |
115 | =item * | |
116 | ||
14280422 | 117 | B<The formatting functions> (printf(), sprintf() and write()) use |
5f05dabc | 118 | C<LC_NUMERIC> |
119 | ||
120 | =item * | |
121 | ||
14280422 | 122 | B<The POSIX date formatting function> (strftime()) uses C<LC_TIME>. |
5f05dabc | 123 | |
124 | =back | |
125 | ||
14280422 DD |
126 | C<LC_COLLATE>, C<LC_CTYPE>, and so on, are discussed further in L<LOCALE |
127 | CATEGORIES>. | |
5f05dabc | 128 | |
b0c42ed9 | 129 | The default behavior returns with S<C<no locale>> or on reaching the |
14280422 | 130 | end of the enclosing block. |
5f05dabc | 131 | |
14280422 DD |
132 | Note that the string result of any operation that uses locale |
133 | information is tainted, as it is possible for a locale to be | |
134 | untrustworthy. See L<"SECURITY">. | |
5f05dabc | 135 | |
136 | =head2 The setlocale function | |
137 | ||
14280422 DD |
138 | You can switch locales as often as you wish at run time with the |
139 | POSIX::setlocale() function: | |
5f05dabc | 140 | |
141 | # This functionality not usable prior to Perl 5.004 | |
142 | require 5.004; | |
143 | ||
144 | # Import locale-handling tool set from POSIX module. | |
145 | # This example uses: setlocale -- the function call | |
146 | # LC_CTYPE -- explained below | |
147 | use POSIX qw(locale_h); | |
148 | ||
14280422 | 149 | # query and save the old locale |
5f05dabc | 150 | $old_locale = setlocale(LC_CTYPE); |
151 | ||
152 | setlocale(LC_CTYPE, "fr_CA.ISO8859-1"); | |
153 | # LC_CTYPE now in locale "French, Canada, codeset ISO 8859-1" | |
154 | ||
155 | setlocale(LC_CTYPE, ""); | |
156 | # LC_CTYPE now reset to default defined by LC_ALL/LC_CTYPE/LANG | |
157 | # environment variables. See below for documentation. | |
158 | ||
159 | # restore the old locale | |
160 | setlocale(LC_CTYPE, $old_locale); | |
161 | ||
14280422 DD |
162 | The first argument of setlocale() gives the B<category>, the second the |
163 | B<locale>. The category tells in what aspect of data processing you | |
164 | want to apply locale-specific rules. Category names are discussed in | |
165 | L<LOCALE CATEGORIES> and L<"ENVIRONMENT">. The locale is the name of a | |
166 | collection of customization information corresponding to a particular | |
167 | combination of language, country or territory, and codeset. Read on for | |
168 | hints on the naming of locales: not all systems name locales as in the | |
169 | example. | |
170 | ||
171 | If no second argument is provided, the function returns a string naming | |
172 | the current locale for the category. You can use this value as the | |
173 | second argument in a subsequent call to setlocale(). If a second | |
5f05dabc | 174 | argument is given and it corresponds to a valid locale, the locale for |
175 | the category is set to that value, and the function returns the | |
176 | now-current locale value. You can use this in a subsequent call to | |
14280422 | 177 | setlocale(). (In some implementations, the return value may sometimes |
5f05dabc | 178 | differ from the value you gave as the second argument - think of it as |
179 | an alias for the value that you gave.) | |
180 | ||
181 | As the example shows, if the second argument is an empty string, the | |
182 | category's locale is returned to the default specified by the | |
183 | corresponding environment variables. Generally, this results in a | |
184 | return to the default which was in force when Perl started up: changes | |
54310121 | 185 | to the environment made by the application after startup may or may not |
14280422 | 186 | be noticed, depending on the implementation of your system's C library. |
5f05dabc | 187 | |
14280422 DD |
188 | If the second argument does not correspond to a valid locale, the locale |
189 | for the category is not changed, and the function returns I<undef>. | |
5f05dabc | 190 | |
14280422 DD |
191 | For further information about the categories, consult L<setlocale(3)>. |
192 | For the locales available in your system, also consult L<setlocale(3)> | |
193 | and see whether it leads you to the list of the available locales | |
194 | (search for the I<SEE ALSO> section). If that fails, try the following | |
195 | command lines: | |
5f05dabc | 196 | |
197 | locale -a | |
198 | ||
199 | nlsinfo | |
200 | ||
201 | ls /usr/lib/nls/loc | |
202 | ||
203 | ls /usr/lib/locale | |
204 | ||
205 | ls /usr/lib/nls | |
206 | ||
207 | and see whether they list something resembling these | |
208 | ||
2bdf8add JH |
209 | en_US.ISO8859-1 de_DE.ISO8859-1 ru_RU.ISO8859-5 |
210 | en_US de_DE ru_RU | |
14280422 | 211 | en de ru |
2bdf8add JH |
212 | english german russian |
213 | english.iso88591 german.iso88591 russian.iso88595 | |
5f05dabc | 214 | |
14280422 | 215 | Sadly, even though the calling interface for setlocale() has been |
2bdf8add JH |
216 | standardized, the names of the locales and the directories where |
217 | the configuration is, have not. The basic form of the name is | |
218 | I<language_country/territory>B<.>I<codeset>, but the | |
5f05dabc | 219 | latter parts are not always present. |
220 | ||
14280422 DD |
221 | Two special locales are worth particular mention: "C" and "POSIX". |
222 | Currently these are effectively the same locale: the difference is | |
223 | mainly that the first one is defined by the C standard and the second by | |
224 | the POSIX standard. What they define is the B<default locale> in which | |
225 | every program starts in the absence of locale information in its | |
226 | environment. (The default default locale, if you will.) Its language | |
227 | is (American) English and its character codeset ASCII. | |
5f05dabc | 228 | |
14280422 DD |
229 | B<NOTE>: Not all systems have the "POSIX" locale (not all systems are |
230 | POSIX-conformant), so use "C" when you need explicitly to specify this | |
231 | default locale. | |
5f05dabc | 232 | |
233 | =head2 The localeconv function | |
234 | ||
14280422 DD |
235 | The POSIX::localeconv() function allows you to get particulars of the |
236 | locale-dependent numeric formatting information specified by the current | |
237 | C<LC_NUMERIC> and C<LC_MONETARY> locales. (If you just want the name of | |
238 | the current locale for a particular category, use POSIX::setlocale() | |
239 | with a single parameter - see L<The setlocale function>.) | |
5f05dabc | 240 | |
241 | use POSIX qw(locale_h); | |
5f05dabc | 242 | |
243 | # Get a reference to a hash of locale-dependent info | |
244 | $locale_values = localeconv(); | |
245 | ||
246 | # Output sorted list of the values | |
247 | for (sort keys %$locale_values) { | |
14280422 | 248 | printf "%-20s = %s\n", $_, $locale_values->{$_} |
5f05dabc | 249 | } |
250 | ||
14280422 DD |
251 | localeconv() takes no arguments, and returns B<a reference to> a hash. |
252 | The keys of this hash are formatting variable names such as | |
253 | C<decimal_point> and C<thousands_sep>; the values are the corresponding | |
254 | values. See L<POSIX (3)/localeconv> for a longer example, which lists | |
255 | all the categories an implementation might be expected to provide; some | |
256 | provide more and others fewer, however. Note that you don't need C<use | |
257 | locale>: as a function with the job of querying the locale, localeconv() | |
258 | always observes the current locale. | |
5f05dabc | 259 | |
260 | Here's a simple-minded example program which rewrites its command line | |
261 | parameters as integers formatted correctly in the current locale: | |
262 | ||
263 | # See comments in previous example | |
264 | require 5.004; | |
265 | use POSIX qw(locale_h); | |
5f05dabc | 266 | |
267 | # Get some of locale's numeric formatting parameters | |
268 | my ($thousands_sep, $grouping) = | |
14280422 | 269 | @{localeconv()}{'thousands_sep', 'grouping'}; |
5f05dabc | 270 | |
271 | # Apply defaults if values are missing | |
272 | $thousands_sep = ',' unless $thousands_sep; | |
273 | $grouping = 3 unless $grouping; | |
274 | ||
275 | # Format command line params for current locale | |
14280422 DD |
276 | for (@ARGV) { |
277 | $_ = int; # Chop non-integer part | |
5f05dabc | 278 | 1 while |
14280422 DD |
279 | s/(\d)(\d{$grouping}($|$thousands_sep))/$1$thousands_sep$2/; |
280 | print "$_"; | |
5f05dabc | 281 | } |
282 | print "\n"; | |
283 | ||
5f05dabc | 284 | =head1 LOCALE CATEGORIES |
285 | ||
14280422 | 286 | The subsections which follow describe basic locale categories. As well |
5f05dabc | 287 | as these, there are some combination categories which allow the |
14280422 DD |
288 | manipulation of more than one basic category at a time. See |
289 | L<"ENVIRONMENT"> for a discussion of these. | |
5f05dabc | 290 | |
291 | =head2 Category LC_COLLATE: Collation | |
292 | ||
14280422 | 293 | When in the scope of S<C<use locale>>, Perl looks to the C<LC_COLLATE> |
5f05dabc | 294 | environment variable to determine the application's notions on the |
14280422 DD |
295 | collation (ordering) of characters. ('b' follows 'a' in Latin |
296 | alphabets, but where do 'E<aacute>' and 'E<aring>' belong?) | |
5f05dabc | 297 | |
298 | Here is a code snippet that will tell you what are the alphanumeric | |
299 | characters in the current locale, in the locale order: | |
300 | ||
301 | use locale; | |
302 | print +(sort grep /\w/, map { chr() } 0..255), "\n"; | |
303 | ||
14280422 DD |
304 | Compare this with the characters that you see and their order if you |
305 | state explicitly that the locale should be ignored: | |
5f05dabc | 306 | |
307 | no locale; | |
308 | print +(sort grep /\w/, map { chr() } 0..255), "\n"; | |
309 | ||
310 | This machine-native collation (which is what you get unless S<C<use | |
311 | locale>> has appeared earlier in the same block) must be used for | |
312 | sorting raw binary data, whereas the locale-dependent collation of the | |
b0c42ed9 | 313 | first example is useful for natural text. |
5f05dabc | 314 | |
14280422 DD |
315 | As noted in L<USING LOCALES>, C<cmp> compares according to the current |
316 | collation locale when C<use locale> is in effect, but falls back to a | |
317 | byte-by-byte comparison for strings which the locale says are equal. You | |
318 | can use POSIX::strcoll() if you don't want this fall-back: | |
319 | ||
320 | use POSIX qw(strcoll); | |
321 | $equal_in_locale = | |
322 | !strcoll("space and case ignored", "SpaceAndCaseIgnored"); | |
323 | ||
324 | $equal_in_locale will be true if the collation locale specifies a | |
325 | dictionary-like ordering which ignores space characters completely, and | |
9e3a2af8 | 326 | which folds case. |
14280422 DD |
327 | |
328 | If you have a single string which you want to check for "equality in | |
329 | locale" against several others, you might think you could gain a little | |
330 | efficiency by using POSIX::strxfrm() in conjunction with C<eq>: | |
331 | ||
332 | use POSIX qw(strxfrm); | |
333 | $xfrm_string = strxfrm("Mixed-case string"); | |
334 | print "locale collation ignores spaces\n" | |
335 | if $xfrm_string eq strxfrm("Mixed-casestring"); | |
336 | print "locale collation ignores hyphens\n" | |
337 | if $xfrm_string eq strxfrm("Mixedcase string"); | |
338 | print "locale collation ignores case\n" | |
339 | if $xfrm_string eq strxfrm("mixed-case string"); | |
340 | ||
341 | strxfrm() takes a string and maps it into a transformed string for use | |
342 | in byte-by-byte comparisons against other transformed strings during | |
343 | collation. "Under the hood", locale-affected Perl comparison operators | |
344 | call strxfrm() for both their operands, then do a byte-by-byte | |
345 | comparison of the transformed strings. By calling strxfrm() explicitly, | |
346 | and using a non locale-affected comparison, the example attempts to save | |
347 | a couple of transformations. In fact, it doesn't save anything: Perl | |
2ae324a7 | 348 | magic (see L<perlguts/Magic Variables>) creates the transformed version of a |
14280422 DD |
349 | string the first time it's needed in a comparison, then keeps it around |
350 | in case it's needed again. An example rewritten the easy way with | |
e38874e2 | 351 | C<cmp> runs just about as fast. It also copes with null characters |
14280422 | 352 | embedded in strings; if you call strxfrm() directly, it treats the first |
e38874e2 DD |
353 | null it finds as a terminator. And don't expect the transformed strings |
354 | it produces to be portable across systems - or even from one revision | |
355 | of your operating system to the next. In short, don't call strxfrm() | |
356 | directly: let Perl do it for you. | |
14280422 DD |
357 | |
358 | Note: C<use locale> isn't shown in some of these examples, as it isn't | |
359 | needed: strcoll() and strxfrm() exist only to generate locale-dependent | |
360 | results, and so always obey the current C<LC_COLLATE> locale. | |
5f05dabc | 361 | |
362 | =head2 Category LC_CTYPE: Character Types | |
363 | ||
364 | When in the scope of S<C<use locale>>, Perl obeys the C<LC_CTYPE> locale | |
14280422 DD |
365 | setting. This controls the application's notion of which characters are |
366 | alphabetic. This affects Perl's C<\w> regular expression metanotation, | |
367 | which stands for alphanumeric characters - that is, alphabetic and | |
368 | numeric characters. (Consult L<perlre> for more information about | |
369 | regular expressions.) Thanks to C<LC_CTYPE>, depending on your locale | |
370 | setting, characters like 'E<aelig>', 'E<eth>', 'E<szlig>', and | |
371 | 'E<oslash>' may be understood as C<\w> characters. | |
5f05dabc | 372 | |
e38874e2 | 373 | The C<LC_CTYPE> locale also provides the map used in translating |
68dc0745 | 374 | characters between lower and uppercase. This affects the case-mapping |
e38874e2 DD |
375 | functions - lc(), lcfirst, uc() and ucfirst(); case-mapping |
376 | interpolation with C<\l>, C<\L>, C<\u> or <\U> in double-quoted strings | |
377 | and in C<s///> substitutions; and case-independent regular expression | |
378 | pattern matching using the C<i> modifier. | |
379 | ||
380 | Finally, C<LC_CTYPE> affects the POSIX character-class test functions - | |
14280422 DD |
381 | isalpha(), islower() and so on. For example, if you move from the "C" |
382 | locale to a 7-bit Scandinavian one, you may find - possibly to your | |
383 | surprise - that "|" moves from the ispunct() class to isalpha(). | |
5f05dabc | 384 | |
14280422 DD |
385 | B<Note:> A broken or malicious C<LC_CTYPE> locale definition may result |
386 | in clearly ineligible characters being considered to be alphanumeric by | |
387 | your application. For strict matching of (unaccented) letters and | |
388 | digits - for example, in command strings - locale-aware applications | |
389 | should use C<\w> inside a C<no locale> block. See L<"SECURITY">. | |
5f05dabc | 390 | |
391 | =head2 Category LC_NUMERIC: Numeric Formatting | |
392 | ||
393 | When in the scope of S<C<use locale>>, Perl obeys the C<LC_NUMERIC> | |
14280422 DD |
394 | locale information, which controls application's idea of how numbers |
395 | should be formatted for human readability by the printf(), sprintf(), | |
396 | and write() functions. String to numeric conversion by the | |
397 | POSIX::strtod() function is also affected. In most implementations the | |
398 | only effect is to change the character used for the decimal point - | |
399 | perhaps from '.' to ',': these functions aren't aware of such niceties | |
400 | as thousands separation and so on. (See L<The localeconv function> if | |
401 | you care about these things.) | |
402 | ||
403 | Note that output produced by print() is B<never> affected by the | |
5f05dabc | 404 | current locale: it is independent of whether C<use locale> or C<no |
14280422 | 405 | locale> is in effect, and corresponds to what you'd get from printf() |
5f05dabc | 406 | in the "C" locale. The same is true for Perl's internal conversions |
407 | between numeric and string formats: | |
408 | ||
409 | use POSIX qw(strtod); | |
410 | use locale; | |
14280422 | 411 | |
5f05dabc | 412 | $n = 5/2; # Assign numeric 2.5 to $n |
413 | ||
414 | $a = " $n"; # Locale-independent conversion to string | |
415 | ||
416 | print "half five is $n\n"; # Locale-independent output | |
417 | ||
418 | printf "half five is %g\n", $n; # Locale-dependent output | |
419 | ||
14280422 DD |
420 | print "DECIMAL POINT IS COMMA\n" |
421 | if $n == (strtod("2,5"))[0]; # Locale-dependent conversion | |
5f05dabc | 422 | |
423 | =head2 Category LC_MONETARY: Formatting of monetary amounts | |
424 | ||
14280422 DD |
425 | The C standard defines the C<LC_MONETARY> category, but no function that |
426 | is affected by its contents. (Those with experience of standards | |
b0c42ed9 | 427 | committees will recognize that the working group decided to punt on the |
14280422 DD |
428 | issue.) Consequently, Perl takes no notice of it. If you really want |
429 | to use C<LC_MONETARY>, you can query its contents - see L<The localeconv | |
430 | function> - and use the information that it returns in your | |
b0c42ed9 | 431 | application's own formatting of currency amounts. However, you may well |
14280422 DD |
432 | find that the information, though voluminous and complex, does not quite |
433 | meet your requirements: currency formatting is a hard nut to crack. | |
5f05dabc | 434 | |
435 | =head2 LC_TIME | |
436 | ||
14280422 | 437 | The output produced by POSIX::strftime(), which builds a formatted |
5f05dabc | 438 | human-readable date/time string, is affected by the current C<LC_TIME> |
439 | locale. Thus, in a French locale, the output produced by the C<%B> | |
440 | format element (full month name) for the first month of the year would | |
441 | be "janvier". Here's how to get a list of the long month names in the | |
442 | current locale: | |
443 | ||
444 | use POSIX qw(strftime); | |
14280422 DD |
445 | for (0..11) { |
446 | $long_month_name[$_] = | |
447 | strftime("%B", 0, 0, 0, 1, $_, 96); | |
5f05dabc | 448 | } |
449 | ||
14280422 DD |
450 | Note: C<use locale> isn't needed in this example: as a function which |
451 | exists only to generate locale-dependent results, strftime() always | |
452 | obeys the current C<LC_TIME> locale. | |
5f05dabc | 453 | |
454 | =head2 Other categories | |
455 | ||
456 | The remaining locale category, C<LC_MESSAGES> (possibly supplemented by | |
457 | others in particular implementations) is not currently used by Perl - | |
b0c42ed9 | 458 | except possibly to affect the behavior of library functions called by |
14280422 DD |
459 | extensions which are not part of the standard Perl distribution. |
460 | ||
461 | =head1 SECURITY | |
462 | ||
463 | While the main discussion of Perl security issues can be found in | |
464 | L<perlsec>, a discussion of Perl's locale handling would be incomplete | |
465 | if it did not draw your attention to locale-dependent security issues. | |
466 | Locales - particularly on systems which allow unprivileged users to | |
467 | build their own locales - are untrustworthy. A malicious (or just plain | |
468 | broken) locale can make a locale-aware application give unexpected | |
469 | results. Here are a few possibilities: | |
470 | ||
471 | =over 4 | |
472 | ||
473 | =item * | |
474 | ||
475 | Regular expression checks for safe file names or mail addresses using | |
476 | C<\w> may be spoofed by an C<LC_CTYPE> locale which claims that | |
477 | characters such as "E<gt>" and "|" are alphanumeric. | |
478 | ||
479 | =item * | |
480 | ||
e38874e2 DD |
481 | String interpolation with case-mapping, as in, say, C<$dest = |
482 | "C:\U$name.$ext">, may produce dangerous results if a bogus LC_CTYPE | |
483 | case-mapping table is in effect. | |
484 | ||
485 | =item * | |
486 | ||
14280422 DD |
487 | If the decimal point character in the C<LC_NUMERIC> locale is |
488 | surreptitiously changed from a dot to a comma, C<sprintf("%g", | |
489 | 0.123456e3)> produces a string result of "123,456". Many people would | |
490 | interpret this as one hundred and twenty-three thousand, four hundred | |
491 | and fifty-six. | |
492 | ||
493 | =item * | |
494 | ||
495 | A sneaky C<LC_COLLATE> locale could result in the names of students with | |
496 | "D" grades appearing ahead of those with "A"s. | |
497 | ||
498 | =item * | |
499 | ||
500 | An application which takes the trouble to use the information in | |
501 | C<LC_MONETARY> may format debits as if they were credits and vice versa | |
502 | if that locale has been subverted. Or it make may make payments in US | |
503 | dollars instead of Hong Kong dollars. | |
504 | ||
505 | =item * | |
506 | ||
507 | The date and day names in dates formatted by strftime() could be | |
508 | manipulated to advantage by a malicious user able to subvert the | |
509 | C<LC_DATE> locale. ("Look - it says I wasn't in the building on | |
510 | Sunday.") | |
511 | ||
512 | =back | |
513 | ||
514 | Such dangers are not peculiar to the locale system: any aspect of an | |
515 | application's environment which may maliciously be modified presents | |
516 | similar challenges. Similarly, they are not specific to Perl: any | |
517 | programming language which allows you to write programs which take | |
518 | account of their environment exposes you to these issues. | |
519 | ||
520 | Perl cannot protect you from all of the possibilities shown in the | |
521 | examples - there is no substitute for your own vigilance - but, when | |
522 | C<use locale> is in effect, Perl uses the tainting mechanism (see | |
523 | L<perlsec>) to mark string results which become locale-dependent, and | |
524 | which may be untrustworthy in consequence. Here is a summary of the | |
b0c42ed9 | 525 | tainting behavior of operators and functions which may be affected by |
14280422 DD |
526 | the locale: |
527 | ||
528 | =over 4 | |
529 | ||
530 | =item B<Comparison operators> (C<lt>, C<le>, C<ge>, C<gt> and C<cmp>): | |
531 | ||
532 | Scalar true/false (or less/equal/greater) result is never tainted. | |
533 | ||
e38874e2 DD |
534 | =item B<Case-mapping interpolation> (with C<\l>, C<\L>, C<\u> or <\U>) |
535 | ||
536 | Result string containing interpolated material is tainted if | |
537 | C<use locale> is in effect. | |
538 | ||
14280422 DD |
539 | =item B<Matching operator> (C<m//>): |
540 | ||
541 | Scalar true/false result never tainted. | |
542 | ||
543 | Subpatterns, either delivered as an array-context result, or as $1 etc. | |
544 | are tainted if C<use locale> is in effect, and the subpattern regular | |
e38874e2 DD |
545 | expression contains C<\w> (to match an alphanumeric character), C<\W> |
546 | (non-alphanumeric character), C<\s> (white-space character), or C<\S> | |
547 | (non white-space character). The matched pattern variable, $&, $` | |
548 | (pre-match), $' (post-match), and $+ (last match) are also tainted if | |
549 | C<use locale> is in effect and the regular expression contains C<\w>, | |
550 | C<\W>, C<\s>, or C<\S>. | |
14280422 DD |
551 | |
552 | =item B<Substitution operator> (C<s///>): | |
553 | ||
e38874e2 DD |
554 | Has the same behavior as the match operator. Also, the left |
555 | operand of C<=~> becomes tainted when C<use locale> in effect, | |
556 | if it is modified as a result of a substitution based on a regular | |
557 | expression match involving C<\w>, C<\W>, C<\s>, or C<\S>; or of | |
558 | case-mapping with C<\l>, C<\L>,C<\u> or <\U>. | |
14280422 DD |
559 | |
560 | =item B<In-memory formatting function> (sprintf()): | |
561 | ||
562 | Result is tainted if "use locale" is in effect. | |
563 | ||
564 | =item B<Output formatting functions> (printf() and write()): | |
565 | ||
566 | Success/failure result is never tainted. | |
567 | ||
568 | =item B<Case-mapping functions> (lc(), lcfirst(), uc(), ucfirst()): | |
569 | ||
570 | Results are tainted if C<use locale> is in effect. | |
571 | ||
572 | =item B<POSIX locale-dependent functions> (localeconv(), strcoll(), | |
573 | strftime(), strxfrm()): | |
574 | ||
575 | Results are never tainted. | |
576 | ||
577 | =item B<POSIX character class tests> (isalnum(), isalpha(), isdigit(), | |
578 | isgraph(), islower(), isprint(), ispunct(), isspace(), isupper(), | |
579 | isxdigit()): | |
580 | ||
581 | True/false results are never tainted. | |
582 | ||
583 | =back | |
584 | ||
585 | Three examples illustrate locale-dependent tainting. | |
586 | The first program, which ignores its locale, won't run: a value taken | |
54310121 | 587 | directly from the command line may not be used to name an output file |
14280422 DD |
588 | when taint checks are enabled. |
589 | ||
590 | #/usr/local/bin/perl -T | |
591 | # Run with taint checking | |
592 | ||
54310121 | 593 | # Command line sanity check omitted... |
14280422 DD |
594 | $tainted_output_file = shift; |
595 | ||
596 | open(F, ">$tainted_output_file") | |
597 | or warn "Open of $untainted_output_file failed: $!\n"; | |
598 | ||
599 | The program can be made to run by "laundering" the tainted value through | |
600 | a regular expression: the second example - which still ignores locale | |
54310121 | 601 | information - runs, creating the file named on its command line |
14280422 DD |
602 | if it can. |
603 | ||
604 | #/usr/local/bin/perl -T | |
605 | ||
606 | $tainted_output_file = shift; | |
607 | $tainted_output_file =~ m%[\w/]+%; | |
608 | $untainted_output_file = $&; | |
609 | ||
610 | open(F, ">$untainted_output_file") | |
611 | or warn "Open of $untainted_output_file failed: $!\n"; | |
612 | ||
613 | Compare this with a very similar program which is locale-aware: | |
614 | ||
615 | #/usr/local/bin/perl -T | |
616 | ||
617 | $tainted_output_file = shift; | |
618 | use locale; | |
619 | $tainted_output_file =~ m%[\w/]+%; | |
620 | $localized_output_file = $&; | |
621 | ||
622 | open(F, ">$localized_output_file") | |
623 | or warn "Open of $localized_output_file failed: $!\n"; | |
624 | ||
625 | This third program fails to run because $& is tainted: it is the result | |
626 | of a match involving C<\w> when C<use locale> is in effect. | |
5f05dabc | 627 | |
628 | =head1 ENVIRONMENT | |
629 | ||
630 | =over 12 | |
631 | ||
632 | =item PERL_BADLANG | |
633 | ||
14280422 | 634 | A string that can suppress Perl's warning about failed locale settings |
54310121 | 635 | at startup. Failure can occur if the locale support in the operating |
14280422 DD |
636 | system is lacking (broken) is some way - or if you mistyped the name of |
637 | a locale when you set up your environment. If this environment variable | |
638 | is absent, or has a value which does not evaluate to integer zero - that | |
639 | is "0" or "" - Perl will complain about locale setting failures. | |
5f05dabc | 640 | |
14280422 DD |
641 | B<NOTE>: PERL_BADLANG only gives you a way to hide the warning message. |
642 | The message tells about some problem in your system's locale support, | |
643 | and you should investigate what the problem is. | |
5f05dabc | 644 | |
645 | =back | |
646 | ||
647 | The following environment variables are not specific to Perl: They are | |
14280422 DD |
648 | part of the standardized (ISO C, XPG4, POSIX 1.c) setlocale() method |
649 | for controlling an application's opinion on data. | |
5f05dabc | 650 | |
651 | =over 12 | |
652 | ||
653 | =item LC_ALL | |
654 | ||
655 | C<LC_ALL> is the "override-all" locale environment variable. If it is | |
656 | set, it overrides all the rest of the locale environment variables. | |
657 | ||
658 | =item LC_CTYPE | |
659 | ||
660 | In the absence of C<LC_ALL>, C<LC_CTYPE> chooses the character type | |
661 | locale. In the absence of both C<LC_ALL> and C<LC_CTYPE>, C<LANG> | |
662 | chooses the character type locale. | |
663 | ||
664 | =item LC_COLLATE | |
665 | ||
14280422 DD |
666 | In the absence of C<LC_ALL>, C<LC_COLLATE> chooses the collation |
667 | (sorting) locale. In the absence of both C<LC_ALL> and C<LC_COLLATE>, | |
668 | C<LANG> chooses the collation locale. | |
5f05dabc | 669 | |
670 | =item LC_MONETARY | |
671 | ||
14280422 DD |
672 | In the absence of C<LC_ALL>, C<LC_MONETARY> chooses the monetary |
673 | formatting locale. In the absence of both C<LC_ALL> and C<LC_MONETARY>, | |
674 | C<LANG> chooses the monetary formatting locale. | |
5f05dabc | 675 | |
676 | =item LC_NUMERIC | |
677 | ||
678 | In the absence of C<LC_ALL>, C<LC_NUMERIC> chooses the numeric format | |
679 | locale. In the absence of both C<LC_ALL> and C<LC_NUMERIC>, C<LANG> | |
680 | chooses the numeric format. | |
681 | ||
682 | =item LC_TIME | |
683 | ||
14280422 DD |
684 | In the absence of C<LC_ALL>, C<LC_TIME> chooses the date and time |
685 | formatting locale. In the absence of both C<LC_ALL> and C<LC_TIME>, | |
686 | C<LANG> chooses the date and time formatting locale. | |
5f05dabc | 687 | |
688 | =item LANG | |
689 | ||
14280422 DD |
690 | C<LANG> is the "catch-all" locale environment variable. If it is set, it |
691 | is used as the last resort after the overall C<LC_ALL> and the | |
5f05dabc | 692 | category-specific C<LC_...>. |
693 | ||
694 | =back | |
695 | ||
696 | =head1 NOTES | |
697 | ||
698 | =head2 Backward compatibility | |
699 | ||
b0c42ed9 JH |
700 | Versions of Perl prior to 5.004 B<mostly> ignored locale information, |
701 | generally behaving as if something similar to the C<"C"> locale (see | |
702 | L<The setlocale function>) was always in force, even if the program | |
5f05dabc | 703 | environment suggested otherwise. By default, Perl still behaves this |
704 | way so as to maintain backward compatibility. If you want a Perl | |
b0c42ed9 | 705 | application to pay attention to locale information, you B<must> use |
2ae324a7 | 706 | the S<C<use locale>> pragma (see L<The use locale Pragma>) to |
b0c42ed9 JH |
707 | instruct it to do so. |
708 | ||
709 | Versions of Perl from 5.002 to 5.003 did use the C<LC_CTYPE> | |
710 | information if that was available, that is, C<\w> did understand what | |
711 | are the letters according to the locale environment variables. | |
712 | The problem was that the user had no control over the feature: | |
713 | if the C library supported locales, Perl used them. | |
714 | ||
715 | =head2 I18N:Collate obsolete | |
716 | ||
717 | In versions of Perl prior to 5.004 per-locale collation was possible | |
718 | using the C<I18N::Collate> library module. This module is now mildly | |
719 | obsolete and should be avoided in new applications. The C<LC_COLLATE> | |
720 | functionality is now integrated into the Perl core language: One can | |
721 | use locale-specific scalar data completely normally with C<use locale>, | |
722 | so there is no longer any need to juggle with the scalar references of | |
723 | C<I18N::Collate>. | |
5f05dabc | 724 | |
14280422 | 725 | =head2 Sort speed and memory use impacts |
5f05dabc | 726 | |
727 | Comparing and sorting by locale is usually slower than the default | |
14280422 DD |
728 | sorting; slow-downs of two to four times have been observed. It will |
729 | also consume more memory: once a Perl scalar variable has participated | |
730 | in any string comparison or sorting operation obeying the locale | |
731 | collation rules, it will take 3-15 times more memory than before. (The | |
732 | exact multiplier depends on the string's contents, the operating system | |
733 | and the locale.) These downsides are dictated more by the operating | |
734 | system's implementation of the locale system than by Perl. | |
5f05dabc | 735 | |
e38874e2 DD |
736 | =head2 write() and LC_NUMERIC |
737 | ||
738 | Formats are the only part of Perl which unconditionally use information | |
739 | from a program's locale; if a program's environment specifies an | |
740 | LC_NUMERIC locale, it is always used to specify the decimal point | |
741 | character in formatted output. Formatted output cannot be controlled by | |
742 | C<use locale> because the pragma is tied to the block structure of the | |
743 | program, and, for historical reasons, formats exist outside that block | |
744 | structure. | |
745 | ||
5f05dabc | 746 | =head2 Freely available locale definitions |
747 | ||
748 | There is a large collection of locale definitions at | |
14280422 DD |
749 | C<ftp://dkuug.dk/i18n/WG15-collection>. You should be aware that it is |
750 | unsupported, and is not claimed to be fit for any purpose. If your | |
751 | system allows the installation of arbitrary locales, you may find the | |
752 | definitions useful as they are, or as a basis for the development of | |
753 | your own locales. | |
5f05dabc | 754 | |
14280422 | 755 | =head2 I18n and l10n |
5f05dabc | 756 | |
b0c42ed9 JH |
757 | "Internationalization" is often abbreviated as B<i18n> because its first |
758 | and last letters are separated by eighteen others. (You may guess why | |
759 | the internalin ... internaliti ... i18n tends to get abbreviated.) In | |
760 | the same way, "localization" is often abbreviated to B<l10n>. | |
14280422 DD |
761 | |
762 | =head2 An imperfect standard | |
763 | ||
764 | Internationalization, as defined in the C and POSIX standards, can be | |
765 | criticized as incomplete, ungainly, and having too large a granularity. | |
766 | (Locales apply to a whole process, when it would arguably be more useful | |
767 | to have them apply to a single thread, window group, or whatever.) They | |
768 | also have a tendency, like standards groups, to divide the world into | |
769 | nations, when we all know that the world can equally well be divided | |
770 | into bankers, bikers, gamers, and so on. But, for now, it's the only | |
771 | standard we've got. This may be construed as a bug. | |
5f05dabc | 772 | |
773 | =head1 BUGS | |
774 | ||
775 | =head2 Broken systems | |
776 | ||
2bdf8add JH |
777 | In certain system environments the operating system's locale support |
778 | is broken and cannot be fixed or used by Perl. Such deficiencies can | |
779 | and will result in mysterious hangs and/or Perl core dumps when the | |
780 | C<use locale> is in effect. When confronted with such a system, | |
9607fc9c | 781 | please report in excruciating detail to <F<perlbug@perl.com>>, and |
2bdf8add JH |
782 | complain to your vendor: maybe some bug fixes exist for these problems |
783 | in your operating system. Sometimes such bug fixes are called an | |
784 | operating system upgrade. | |
5f05dabc | 785 | |
786 | =head1 SEE ALSO | |
787 | ||
788 | L<POSIX (3)/isalnum>, L<POSIX (3)/isalpha>, L<POSIX (3)/isdigit>, | |
789 | L<POSIX (3)/isgraph>, L<POSIX (3)/islower>, L<POSIX (3)/isprint>, | |
790 | L<POSIX (3)/ispunct>, L<POSIX (3)/isspace>, L<POSIX (3)/isupper>, | |
791 | L<POSIX (3)/isxdigit>, L<POSIX (3)/localeconv>, L<POSIX (3)/setlocale>, | |
14280422 DD |
792 | L<POSIX (3)/strcoll>, L<POSIX (3)/strftime>, L<POSIX (3)/strtod>, |
793 | L<POSIX (3)/strxfrm> | |
5f05dabc | 794 | |
795 | =head1 HISTORY | |
796 | ||
b0c42ed9 | 797 | Jarkko Hietaniemi's original F<perli18n.pod> heavily hacked by Dominic |
14280422 | 798 | Dunlop, assisted by the perl5-porters. |
5f05dabc | 799 | |
9e3a2af8 | 800 | Last update: Wed Jan 22 11:04:58 EST 1997 |