1 package Unicode::Collate::Locale;
5 use base qw(Unicode::Collate);
11 (my $ModPath = $INC{'Unicode/Collate/Locale.pm'}) =~ s/\.pm$//;
14 my %LocaleFile = map { ($_, $_) } qw(
15 af ar az ca cs cy da eo es et fi fil fo fr ha haw
16 hr hu hy ig is ja kk kl ko lt lv mt nb nn nso om pl ro ru
17 se sk sl sq sv sw tn to tr uk vi wo yo zh
19 $LocaleFile{'default'} = '';
20 $LocaleFile{'de__phonebook'} = 'de_phone';
21 $LocaleFile{'es__traditional'} = 'es_trad';
22 $LocaleFile{'be'} = 'ru';
23 $LocaleFile{'bg'} = 'ru';
24 $LocaleFile{'mk'} = 'ru';
25 $LocaleFile{'sr'} = 'ru';
26 $LocaleFile{'zh__big5han'} = 'zh_big5';
27 $LocaleFile{'zh__gb2312han'} = 'zh_gb';
28 $LocaleFile{'zh__pinyin'} = 'zh_pin';
29 $LocaleFile{'zh__stroke'} = 'zh_strk';
35 $locale =~ tr/\-\ \./_/;
36 $locale =~ s/_phone(?:bk)?\z/_phonebook/;
37 $locale =~ s/_trad\z/_traditional/;
38 $locale =~ s/_big5\z/_big5han/;
39 $locale =~ s/_gb2312\z/_gb2312han/;
40 $LocaleFile{$locale} and return $locale;
42 my ($l,$t,$v) = split(/_/, $locale.'__');
43 for my $loc ("${l}_${t}_$v", "${l}_$t", "${l}__$v", "${l}__$t", $l) {
44 $LocaleFile{$loc} and return $loc;
51 return shift->{accepted_locale};
56 my $f = $LocaleFile{$accepted};
59 my $path = File::Spec->catfile($ModPath, $f);
61 croak "Unicode/Collate/Locale/$f can't be found" if !$h;
68 $hash{accepted_locale} = _locale($hash{locale});
70 if (exists $hash{table}) {
71 croak "your table can't be used with Unicode::Collate::Locale";
74 my $href = _fetchpl($hash{accepted_locale});
75 while (my($k,$v) = each %$href) {
76 if (exists $hash{$k}) {
77 croak "$k is reserved by $hash{locale}, can't be overwritten";
81 return $class->SUPER::new(%hash);
89 Unicode::Collate::Locale - Linguistic tailoring for DUCET via Unicode::Collate
93 use Unicode::Collate::Locale;
96 $Collator = Unicode::Collate::Locale->
97 new(locale => $locale_name, %tailoring);
100 @sorted = $Collator->sort(@not_sorted);
103 $result = $Collator->cmp($a, $b); # returns 1, 0, or -1.
105 B<Note:> Strings in C<@not_sorted>, C<$a> and C<$b> are interpreted
106 according to Perl's Unicode support. See L<perlunicode>,
107 L<perluniintro>, L<perlunitut>, L<perlunifaq>, L<utf8>.
108 Otherwise you can use C<preprocess> (cf. C<Unicode::Collate>)
109 or should decode them before.
113 This module provides linguistic tailoring for it
114 taking advantage of C<Unicode::Collate>.
118 The C<new> method returns a collator object.
120 A parameter list for the constructor is a hash, which can include
121 a special key C<locale> and its value (case-insensitive) standing
122 for a two-letter language code (ISO-639) like C<'en'> for English.
123 For example, C<Unicode::Collate::Locale-E<gt>new(locale =E<gt> 'FR')>
124 returns a collator tailored for French.
126 C<$locale_name> may be suffixed with a territory(country)
127 code or a variant code, which are separated with C<'_'>.
128 E.g. C<en_US> for English in USA,
129 C<es_ES_traditional> for Spanish in Spain (Traditional),
131 If C<$localename> is not defined,
132 fallback is selected in the following order:
134 1. language_territory_variant
135 2. language_territory
140 Tailoring tags provided by C<Unicode::Collate> are allowed as long as
141 they are not used for C<locale> support. Esp. the C<table> tag
142 is always untailorable since it is reserved for DUCET.
144 E.g. a collator for French, which ignores diacritics and case difference
145 (i.e. level 1), with reversed case ordering and no normalization.
147 Unicode::Collate::Locale->new(
150 upper_before_lower => 1,
151 normalization => undef
154 Overriding a behavior already tailored by C<locale> is disallowed
155 if such a tailoring is passed to C<new()>.
157 Unicode::Collate::Locale->new(
159 upper_before_lower => 0, # causes error as reserved by 'da'
162 However C<change()> inherited from C<Unicode::Collate> allows
163 such a tailoring that is reserved by C<locale>. Examples:
165 new(locale => 'ca')->change(backwards => undef)
166 new(locale => 'da')->change(upper_before_lower => 0)
167 new(locale => 'ja')->change(overrideCJK => undef)
171 C<Unicode::Collate::Locale> is a subclass of C<Unicode::Collate>
172 and methods other than C<new> are inherited from C<Unicode::Collate>.
174 Here is a list of additional methods:
178 =item C<$Collator-E<gt>getlocale>
180 Returns a language code accepted and used actually on collation.
181 If linguistic tailoring is not provided for a language code you passed
182 (intensionally for some languages, or due to the incomplete implementation),
183 this method returns a string C<'default'> meaning no special tailoring.
187 =head2 A list of tailorable locales
189 locale name description
190 ----------------------------------------------------------
193 az Azerbaijani (Azeri)
200 de__phonebook German (umlaut as 'ae', 'oe', 'ue')
203 es__traditional Spanish ('ch' and 'll' as a grapheme)
246 zh__big5han Chinese (ideographs: big5 order)
247 zh__gb2312han Chinese (ideographs: GB-2312 order)
248 zh__pinyin Chinese (ideographs: pinyin order)
249 zh__stroke Chinese (ideographs: stroke order)
250 ----------------------------------------------------------
252 Locales according to the default UCA rules include
269 [1] ja: Ideographs are sorted in JIS X 0208 order.
270 Fullwidth and halfwidth forms are identical to their normal form.
271 The difference between hiragana and katakana is at the 4th level,
272 the comparison also requires C<(variable =E<gt> 'Non-ignorable')>,
273 and then C<katakana_before_hiragana> has no effect.
275 [2] ko: Plenty of ideographs are sorted by their reading. Such
276 an ideograph is primary (level 1) equal to, and secondary (level 2)
277 greater than, the corresponding hangul syllable.
281 Installation of C<Unicode::Collate::Locale> requires F<Collate/Locale.pm>,
282 F<Collate/Locale/*.pm>, F<Collate/CJK/*.pm> and F<Collate/allkeys.txt>.
283 On building, C<Unicode::Collate::Locale> doesn't require any of F<data/*.txt>,
284 F<gendata/*>, and F<mklocale>.
285 Tests for C<Unicode::Collate::Locale> are named F<t/loc_*.t>.
291 =item tailoring is not maximum
293 Even if a certain letter is tailored, its equivalent would not always
294 tailored as well as it. For example, even though W is tailored,
295 fullwidth W (C<U+FF37>), W with acute (C<U+1E82>), etc. are not
296 tailored. The result may depend on whether source strings are
297 normalized or not, and whether decomposed or composed.
298 Thus C<(normalization =E<gt> undef)> is less preferred.
304 The Unicode::Collate::Locale module for perl was written
305 by SADAHIRO Tomoyuki, <SADAHIRO@cpan.org>.
306 This module is Copyright(C) 2004-2011, SADAHIRO Tomoyuki. Japan.
309 This module is free software; you can redistribute it and/or
310 modify it under the same terms as Perl itself.
316 =item Unicode Collation Algorithm - UTS #10
318 L<http://www.unicode.org/reports/tr10/>
320 =item The Default Unicode Collation Element Table (DUCET)
322 L<http://www.unicode.org/Public/UCA/latest/allkeys.txt>
324 =item Unicode Locale Data Markup Language (LDML) - UTS #35
326 L<http://www.unicode.org/reports/tr35/>
328 =item CLDR - Unicode Common Locale Data Repository
330 L<http://cldr.unicode.org/>
332 =item L<Unicode::Collate>
334 =item L<Unicode::Normalize>