1 package Unicode::Collate::Locale;
5 use base qw(Unicode::Collate);
11 (my $ModPath = $INC{'Unicode/Collate/Locale.pm'}) =~ s/\.pm$//;
12 my $KeyPath = File::Spec->catfile('allkeys.txt');
15 my %LocaleFile = map { ($_, $_) } qw(
16 af ar az ca cs cy da eo es et fi fil fo fr ha haw
17 hr hu hy ig is ja kk kl ko lt lv mt nb nn nso om pl ro ru
18 se sk sl sq sv sw tn to tr uk vi wo yo zh
20 $LocaleFile{'default'} = '';
21 $LocaleFile{'de__phonebook'} = 'de_phone';
22 $LocaleFile{'es__traditional'} = 'es_trad';
23 $LocaleFile{'be'} = 'ru';
24 $LocaleFile{'bg'} = 'ru';
25 $LocaleFile{'mk'} = 'ru';
26 $LocaleFile{'sr'} = 'ru';
27 $LocaleFile{'zh__big5han'} = 'zh_big5';
28 $LocaleFile{'zh__gb2312han'} = 'zh_gb';
29 $LocaleFile{'zh__pinyin'} = 'zh_pin';
30 $LocaleFile{'zh__stroke'} = 'zh_strk';
36 $locale =~ tr/\-\ \./_/;
37 $locale =~ s/_phone(?:bk)?\z/_phonebook/;
38 $locale =~ s/_trad\z/_traditional/;
39 $locale =~ s/_big5\z/_big5han/;
40 $locale =~ s/_gb2312\z/_gb2312han/;
41 $LocaleFile{$locale} and return $locale;
43 my ($l,$t,$v) = split(/_/, $locale.'__');
44 for my $loc ("${l}_${t}_$v", "${l}_$t", "${l}__$v", "${l}__$t", $l) {
45 $LocaleFile{$loc} and return $loc;
52 return shift->{accepted_locale};
57 my $f = $LocaleFile{$accepted};
60 my $path = File::Spec->catfile($ModPath, $f);
62 croak "Unicode/Collate/Locale/$f can't be found" if !$h;
69 $hash{accepted_locale} = _locale($hash{locale});
71 if (exists $hash{table}) {
72 croak "your table can't be used with Unicode::Collate::Locale";
74 $hash{table} = $KeyPath;
76 my $href = _fetchpl($hash{accepted_locale});
77 while (my($k,$v) = each %$href) {
78 if (exists $hash{$k}) {
79 croak "$k is reserved by $hash{locale}, can't be overwritten";
83 return $class->SUPER::new(%hash);
91 Unicode::Collate::Locale - Linguistic tailoring for DUCET via Unicode::Collate
95 use Unicode::Collate::Locale;
98 $Collator = Unicode::Collate::Locale->
99 new(locale => $locale_name, %tailoring);
102 @sorted = $Collator->sort(@not_sorted);
105 $result = $Collator->cmp($a, $b); # returns 1, 0, or -1.
107 B<Note:> Strings in C<@not_sorted>, C<$a> and C<$b> are interpreted
108 according to Perl's Unicode support. See L<perlunicode>,
109 L<perluniintro>, L<perlunitut>, L<perlunifaq>, L<utf8>.
110 Otherwise you can use C<preprocess> (cf. C<Unicode::Collate>)
111 or should decode them before.
115 This module provides linguistic tailoring for it
116 taking advantage of C<Unicode::Collate>.
120 The C<new> method returns a collator object.
122 A parameter list for the constructor is a hash, which can include
123 a special key C<'locale'> and its value (case-insensitive) standing
124 for a two-letter language code (ISO-639) like C<'en'> for English.
125 For example, C<Unicode::Collate::Locale-E<gt>new(locale =E<gt> 'FR')>
126 returns a collator tailored for French.
128 C<$locale_name> may be suffixed with a territory(country)
129 code or a variant code, which are separated with C<'_'>.
130 E.g. C<en_US> for English in USA,
131 C<es_ES_traditional> for Spanish in Spain (Traditional),
133 If C<$localename> is not defined,
134 fallback is selected in the following order:
136 1. language_territory_variant
137 2. language_territory
142 Tailoring tags provided by C<Unicode::Collate> are allowed
143 as long as they are not used for C<'locale'> support.
144 Esp. the C<table> tag is always untailorable
145 since it is reserved for DUCET.
147 E.g. a collator for French, which ignores diacritics and case difference
148 (i.e. level 1), with reversed case ordering and no normalization.
150 Unicode::Collate::Locale->new(
153 upper_before_lower => 1,
154 normalization => undef
159 C<Unicode::Collate::Locale> is a subclass of C<Unicode::Collate>
160 and methods other than C<new> are inherited from C<Unicode::Collate>.
162 Here is a list of additional methods:
166 =item C<$Collator-E<gt>getlocale>
168 Returns a language code accepted and used actually on collation.
169 If linguistic tailoring is not provided for a language code you passed
170 (intensionally for some languages, or due to the incomplete implementation),
171 this method returns a string C<'default'> meaning no special tailoring.
175 =head2 A list of tailorable locales
177 locale name description
178 ----------------------------------------------------------
181 az Azerbaijani (Azeri)
188 de__phonebook German (umlaut as 'ae', 'oe', 'ue')
191 es__traditional Spanish ('ch' and 'll' as a grapheme)
234 zh__big5han Chinese (ideographs: big5 order)
235 zh__gb2312han Chinese (ideographs: GB-2312 order)
236 zh__pinyin Chinese (ideographs: pinyin order)
237 zh__stroke Chinese (ideographs: stroke order)
238 ----------------------------------------------------------
240 Locales according to the default UCA rules include
257 [1] ja: Ideographs are sorted in JIS X 0208 order.
258 Fullwidth and halfwidth forms are identical to their normal form.
259 The difference between hiragana and katakana is at the 4th level,
260 the comparison also requires C<(variable =E<gt> 'Non-ignorable')>,
261 and then C<katakana_before_hiragana> has no effect.
263 [2] ko: Plenty of ideographs are sorted by their reading. Such
264 an ideograph is primary (level 1) equal to, and secondary (level 2)
265 greater than, the corresponding hangul syllable.
269 Installation of C<Unicode::Collate::Locale> requires F<Collate/Locale.pm>,
270 F<Collate/Locale/*.pm>, F<Collate/CJK/*.pm> and F<Collate/allkeys.txt>.
271 On building, C<Unicode::Collate::Locale> doesn't require any of F<data/*.txt>,
272 F<gendata/*>, and F<mklocale>.
273 Tests for C<Unicode::Collate::Locale> are named F<t/loc_*.t>.
279 =item tailoring is not maximum
281 Even if a certain letter is tailored, its equivalent would not always
282 tailored as well as it. For example, even though W is tailored,
283 fullwidth W (C<U+FF37>), W with acute (C<U+1E82>), etc. are not
284 tailored. The result may depend on whether source strings are
285 normalized or not, and whether decomposed or composed.
286 Thus C<(normalization =E<gt> undef> is less preferred.
292 The Unicode::Collate::Locale module for perl was written
293 by SADAHIRO Tomoyuki, <SADAHIRO@cpan.org>.
294 This module is Copyright(C) 2004-2010, SADAHIRO Tomoyuki. Japan.
297 This module is free software; you can redistribute it and/or
298 modify it under the same terms as Perl itself.
304 =item Unicode Collation Algorithm - UTS #10
306 L<http://www.unicode.org/reports/tr10/>
308 =item The Default Unicode Collation Element Table (DUCET)
310 L<http://www.unicode.org/Public/UCA/latest/allkeys.txt>
312 =item Unicode Locale Data Markup Language (LDML) - UTS #35
314 L<http://www.unicode.org/reports/tr35/>
316 =item CLDR - Unicode Common Locale Data Repository
318 L<http://cldr.unicode.org/>
320 =item L<Unicode::Collate>
322 =item L<Unicode::Normalize>