This is a live mirror of the Perl 5 development currently hosted at https://github.com/perl/perl5
Add qr/\b{gcb}/
[perl5.git] / pod / perlunicook.pod
CommitLineData
2561daa4
RS
1
2=encoding utf8
3
4=head1 NAME
5
6perlunicook - cookbookish examples of handling Unicode in Perl
7
8=head1 DESCRIPTION
9
10This manpage contains short recipes demonstrating how to handle common Unicode
11operations in Perl, plus one complete program at the end. Any undeclared
12variables in individual recipes are assumed to have a previous appropriate
13value in them.
14
15=head1 EXAMPLES
16
17=head2 ℞ 0: Standard preamble
18
19Unless otherwise notes, all examples below require this standard preamble
20to work correctly, with the C<#!> adjusted to work on your system:
21
22 #!/usr/bin/env perl
23
24 use utf8; # so literals and identifiers can be in UTF-8
25 use v5.12; # or later to get "unicode_strings" feature
26 use strict; # quote strings, declare variables
27 use warnings; # on by default
28 use warnings qw(FATAL utf8); # fatalize encoding glitches
29 use open qw(:std :utf8); # undeclared streams in UTF-8
30 use charnames qw(:full :short); # unneeded in v5.16
31
32This I<does> make even Unix programmers C<binmode> your binary streams,
33or open them with C<:raw>, but that's the only way to get at them
34portably anyway.
35
2a403855
MH
36B<WARNING>: C<use autodie> (pre 2.26) and C<use open> do not get along with each
37other.
2561daa4
RS
38
39=head2 ℞ 1: Generic Unicode-savvy filter
40
41Always decompose on the way in, then recompose on the way out.
42
43 use Unicode::Normalize;
44
45 while (<>) {
46 $_ = NFD($_); # decompose + reorder canonically
47 ...
48 } continue {
49 print NFC($_); # recompose (where possible) + reorder canonically
50 }
51
52=head2 ℞ 2: Fine-tuning Unicode warnings
53
ddeccf1f 54As of v5.14, Perl distinguishes three subclasses of UTF‑8 warnings.
2561daa4
RS
55
56 use v5.14; # subwarnings unavailable any earlier
57 no warnings "nonchar"; # the 66 forbidden non-characters
58 no warnings "surrogate"; # UTF-16/CESU-8 nonsense
59 no warnings "non_unicode"; # for codepoints over 0x10_FFFF
60
61=head2 ℞ 3: Declare source in utf8 for identifiers and literals
62
63Without the all-critical C<use utf8> declaration, putting UTF‑8 in your
64literals and identifiers won’t work right. If you used the standard
65preamble just given above, this already happened. If you did, you can
66do things like this:
67
68 use utf8;
69
70 my $measure = "Ångström";
71 my @μsoft = qw( cp852 cp1251 cp1252 );
72 my @ὑπέρμεγας = qw( ὑπέρ μεγας );
73 my @鯉 = qw( koi8-f koi8-u koi8-r );
74 my $motto = "👪 💗 🐪"; # FAMILY, GROWING HEART, DROMEDARY CAMEL
75
76If you forget C<use utf8>, high bytes will be misunderstood as
77separate characters, and nothing will work right.
78
79=head2 ℞ 4: Characters and their numbers
80
81The C<ord> and C<chr> functions work transparently on all codepoints,
82not just on ASCII alone — nor in fact, not even just on Unicode alone.
83
84 # ASCII characters
85 ord("A")
86 chr(65)
87
88 # characters from the Basic Multilingual Plane
89 ord("Σ")
90 chr(0x3A3)
91
92 # beyond the BMP
93 ord("𝑛") # MATHEMATICAL ITALIC SMALL N
94 chr(0x1D45B)
95
96 # beyond Unicode! (up to MAXINT)
97 ord("\x{20_0000}")
98 chr(0x20_0000)
99
100=head2 ℞ 5: Unicode literals by character number
101
102In an interpolated literal, whether a double-quoted string or a
103regex, you may specify a character by its number using the
104C<\x{I<HHHHHH>}> escape.
105
106 String: "\x{3a3}"
107 Regex: /\x{3a3}/
108
109 String: "\x{1d45b}"
110 Regex: /\x{1d45b}/
111
112 # even non-BMP ranges in regex work fine
113 /[\x{1D434}-\x{1D467}]/
114
115=head2 ℞ 6: Get character name by number
116
117 use charnames ();
118 my $name = charnames::viacode(0x03A3);
119
120=head2 ℞ 7: Get character number by name
121
122 use charnames ();
123 my $number = charnames::vianame("GREEK CAPITAL LETTER SIGMA");
124
125=head2 ℞ 8: Unicode named characters
126
127Use the C<< \N{I<charname>} >> notation to get the character
128by that name for use in interpolated literals (double-quoted
129strings and regexes). In v5.16, there is an implicit
130
131 use charnames qw(:full :short);
132
133But prior to v5.16, you must be explicit about which set of charnames you
134want. The C<:full> names are the official Unicode character name, alias, or
135sequence, which all share a namespace.
136
137 use charnames qw(:full :short latin greek);
138
139 "\N{MATHEMATICAL ITALIC SMALL N}" # :full
140 "\N{GREEK CAPITAL LETTER SIGMA}" # :full
141
142Anything else is a Perl-specific convenience abbreviation. Specify one or
143more scripts by names if you want short names that are script-specific.
144
145 "\N{Greek:Sigma}" # :short
146 "\N{ae}" # latin
147 "\N{epsilon}" # greek
148
149The v5.16 release also supports a C<:loose> import for loose matching of
150character names, which works just like loose matching of property names:
151that is, it disregards case, whitespace, and underscores:
152
153 "\N{euro sign}" # :loose (from v5.16)
154
155=head2 ℞ 9: Unicode named sequences
156
157These look just like character names but return multiple codepoints.
158Notice the C<%vx> vector-print functionality in C<printf>.
159
160 use charnames qw(:full);
161 my $seq = "\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}";
162 printf "U+%v04X\n", $seq;
163 U+0100.0300
164
165=head2 ℞ 10: Custom named characters
166
167Use C<:alias> to give your own lexically scoped nicknames to existing
168characters, or even to give unnamed private-use characters useful names.
169
170 use charnames ":full", ":alias" => {
171 ecute => "LATIN SMALL LETTER E WITH ACUTE",
172 "APPLE LOGO" => 0xF8FF, # private use character
173 };
174
175 "\N{ecute}"
176 "\N{APPLE LOGO}"
177
178=head2 ℞ 11: Names of CJK codepoints
179
180Sinograms like “東京” come back with character names of
181C<CJK UNIFIED IDEOGRAPH-6771> and C<CJK UNIFIED IDEOGRAPH-4EAC>,
182because their “names” vary. The CPAN C<Unicode::Unihan> module
183has a large database for decoding these (and a whole lot more), provided you
184know how to understand its output.
185
186 # cpan -i Unicode::Unihan
187 use Unicode::Unihan;
188 my $str = "東京";
189 my $unhan = new Unicode::Unihan;
190 for my $lang (qw(Mandarin Cantonese Korean JapaneseOn JapaneseKun)) {
191 printf "CJK $str in %-12s is ", $lang;
192 say $unhan->$lang($str);
193 }
194
195prints:
196
197 CJK 東京 in Mandarin is DONG1JING1
198 CJK 東京 in Cantonese is dung1ging1
199 CJK 東京 in Korean is TONGKYENG
200 CJK 東京 in JapaneseOn is TOUKYOU KEI KIN
201 CJK 東京 in JapaneseKun is HIGASHI AZUMAMIYAKO
202
203If you have a specific romanization scheme in mind,
204use the specific module:
205
206 # cpan -i Lingua::JA::Romanize::Japanese
207 use Lingua::JA::Romanize::Japanese;
208 my $k2r = new Lingua::JA::Romanize::Japanese;
209 my $str = "東京";
210 say "Japanese for $str is ", $k2r->chars($str);
211
212prints
213
214 Japanese for 東京 is toukyou
215
216=head2 ℞ 12: Explicit encode/decode
217
218On rare occasion, such as a database read, you may be
219given encoded text you need to decode.
220
221 use Encode qw(encode decode);
222
223 my $chars = decode("shiftjis", $bytes, 1);
224 # OR
225 my $bytes = encode("MIME-Header-ISO_2022_JP", $chars, 1);
226
227For streams all in the same encoding, don't use encode/decode; instead
228set the file encoding when you open the file or immediately after with
229C<binmode> as described later below.
230
231=head2 ℞ 13: Decode program arguments as utf8
232
233 $ perl -CA ...
234 or
235 $ export PERL_UNICODE=A
236 or
237 use Encode qw(decode_utf8);
238 @ARGV = map { decode_utf8($_, 1) } @ARGV;
239
240=head2 ℞ 14: Decode program arguments as locale encoding
241
242 # cpan -i Encode::Locale
243 use Encode qw(locale);
244 use Encode::Locale;
245
246 # use "locale" as an arg to encode/decode
247 @ARGV = map { decode(locale => $_, 1) } @ARGV;
248
249=head2 ℞ 15: Declare STD{IN,OUT,ERR} to be utf8
250
251Use a command-line option, an environment variable, or else
252call C<binmode> explicitly:
253
254 $ perl -CS ...
255 or
256 $ export PERL_UNICODE=S
257 or
258 use open qw(:std :utf8);
259 or
260 binmode(STDIN, ":utf8");
261 binmode(STDOUT, ":utf8");
262 binmode(STDERR, ":utf8");
263
264=head2 ℞ 16: Declare STD{IN,OUT,ERR} to be in locale encoding
265
266 # cpan -i Encode::Locale
267 use Encode;
268 use Encode::Locale;
269
270 # or as a stream for binmode or open
271 binmode STDIN, ":encoding(console_in)" if -t STDIN;
272 binmode STDOUT, ":encoding(console_out)" if -t STDOUT;
273 binmode STDERR, ":encoding(console_out)" if -t STDERR;
274
275=head2 ℞ 17: Make file I/O default to utf8
276
ddeccf1f 277Files opened without an encoding argument will be in UTF-8:
2561daa4
RS
278
279 $ perl -CD ...
280 or
281 $ export PERL_UNICODE=D
282 or
283 use open qw(:utf8);
284
285=head2 ℞ 18: Make all I/O and args default to utf8
286
287 $ perl -CSDA ...
288 or
289 $ export PERL_UNICODE=SDA
290 or
291 use open qw(:std :utf8);
292 use Encode qw(decode_utf8);
293 @ARGV = map { decode_utf8($_, 1) } @ARGV;
294
295=head2 ℞ 19: Open file with specific encoding
296
297Specify stream encoding. This is the normal way
298to deal with encoded text, not by calling low-level
299functions.
300
301 # input file
302 open(my $in_file, "< :encoding(UTF-16)", "wintext");
303 OR
304 open(my $in_file, "<", "wintext");
305 binmode($in_file, ":encoding(UTF-16)");
306 THEN
307 my $line = <$in_file>;
308
309 # output file
310 open($out_file, "> :encoding(cp1252)", "wintext");
311 OR
312 open(my $out_file, ">", "wintext");
313 binmode($out_file, ":encoding(cp1252)");
314 THEN
315 print $out_file "some text\n";
316
317More layers than just the encoding can be specified here. For example,
318the incantation C<":raw :encoding(UTF-16LE) :crlf"> includes implicit
319CRLF handling.
320
321=head2 ℞ 20: Unicode casing
322
323Unicode casing is very different from ASCII casing.
324
325 uc("henry ⅷ") # "HENRY Ⅷ"
326 uc("tschüß") # "TSCHÜSS" notice ß => SS
327
328 # both are true:
329 "tschüß" =~ /TSCHÜSS/i # notice ß => SS
330 "Σίσυφος" =~ /ΣΊΣΥΦΟΣ/i # notice Σ,σ,ς sameness
331
332=head2 ℞ 21: Unicode case-insensitive comparisons
333
334Also available in the CPAN L<Unicode::CaseFold> module,
335the new C<fc> “foldcase” function from v5.16 grants
336access to the same Unicode casefolding as the C</i>
337pattern modifier has always used:
338
339 use feature "fc"; # fc() function is from v5.16
340
341 # sort case-insensitively
342 my @sorted = sort { fc($a) cmp fc($b) } @list;
343
344 # both are true:
345 fc("tschüß") eq fc("TSCHÜSS")
346 fc("Σίσυφος") eq fc("ΣΊΣΥΦΟΣ")
347
348=head2 ℞ 22: Match Unicode linebreak sequence in regex
349
350A Unicode linebreak matches the two-character CRLF
351grapheme or any of seven vertical whitespace characters.
352Good for dealing with textfiles coming from different
353operating systems.
354
355 \R
356
357 s/\R/\n/g; # normalize all linebreaks to \n
358
359=head2 ℞ 23: Get character category
360
361Find the general category of a numeric codepoint.
362
363 use Unicode::UCD qw(charinfo);
364 my $cat = charinfo(0x3A3)->{category}; # "Lu"
365
366=head2 ℞ 24: Disabling Unicode-awareness in builtin charclasses
367
368Disable C<\w>, C<\b>, C<\s>, C<\d>, and the POSIX
369classes from working correctly on Unicode either in this
370scope, or in just one regex.
371
372 use v5.14;
373 use re "/a";
374
375 # OR
376
377 my($num) = $str =~ /(\d+)/a;
378
379Or use specific un-Unicode properties, like C<\p{ahex}>
380and C<\p{POSIX_Digit>}. Properties still work normally
381no matter what charset modifiers (C</d /u /l /a /aa>)
382should be effect.
383
384=head2 ℞ 25: Match Unicode properties in regex with \p, \P
385
386These all match a single codepoint with the given
387property. Use C<\P> in place of C<\p> to match
388one codepoint lacking that property.
389
390 \pL, \pN, \pS, \pP, \pM, \pZ, \pC
391 \p{Sk}, \p{Ps}, \p{Lt}
392 \p{alpha}, \p{upper}, \p{lower}
393 \p{Latin}, \p{Greek}
394 \p{script=Latin}, \p{script=Greek}
395 \p{East_Asian_Width=Wide}, \p{EA=W}
396 \p{Line_Break=Hyphen}, \p{LB=HY}
397 \p{Numeric_Value=4}, \p{NV=4}
398
399=head2 ℞ 26: Custom character properties
400
401Define at compile-time your own custom character
402properties for use in regexes.
403
404 # using private-use characters
405 sub In_Tengwar { "E000\tE07F\n" }
406
407 if (/\p{In_Tengwar}/) { ... }
408
409 # blending existing properties
410 sub Is_GraecoRoman_Title {<<'END_OF_SET'}
411 +utf8::IsLatin
412 +utf8::IsGreek
413 &utf8::IsTitle
414 END_OF_SET
415
416 if (/\p{Is_GraecoRoman_Title}/ { ... }
417
418=head2 ℞ 27: Unicode normalization
419
420Typically render into NFD on input and NFC on output. Using NFKC or NFKD
421functions improves recall on searches, assuming you've already done to the
422same text to be searched. Note that this is about much more than just pre-
423combined compatibility glyphs; it also reorders marks according to their
424canonical combining classes and weeds out singletons.
425
426 use Unicode::Normalize;
427 my $nfd = NFD($orig);
428 my $nfc = NFC($orig);
429 my $nfkd = NFKD($orig);
430 my $nfkc = NFKC($orig);
431
432=head2 ℞ 28: Convert non-ASCII Unicode numerics
433
434Unless you’ve used C</a> or C</aa>, C<\d> matches more than
435ASCII digits only, but Perl’s implicit string-to-number
436conversion does not current recognize these. Here’s how to
437convert such strings manually.
438
439 use v5.14; # needed for num() function
440 use Unicode::UCD qw(num);
441 my $str = "got Ⅻ and ४५६७ and ⅞ and here";
442 my @nums = ();
7b237c8f 443 while ($str =~ /(\d+|\N)/g) { # not just ASCII!
2561daa4
RS
444 push @nums, num($1);
445 }
446 say "@nums"; # 12 4567 0.875
447
448 use charnames qw(:full);
449 my $nv = num("\N{RUMI DIGIT ONE}\N{RUMI DIGIT TWO}");
450
451=head2 ℞ 29: Match Unicode grapheme cluster in regex
452
453Programmer-visible “characters” are codepoints matched by C</./s>,
454but user-visible “characters” are graphemes matched by C</\X/>.
455
456 # Find vowel *plus* any combining diacritics,underlining,etc.
457 my $nfd = NFD($orig);
458 $nfd =~ / (?=[aeiou]) \X /xi
459
460=head2 ℞ 30: Extract by grapheme instead of by codepoint (regex)
461
462 # match and grab five first graphemes
463 my($first_five) = $str =~ /^ ( \X{5} ) /x;
464
465=head2 ℞ 31: Extract by grapheme instead of by codepoint (substr)
466
467 # cpan -i Unicode::GCString
468 use Unicode::GCString;
469 my $gcs = Unicode::GCString->new($str);
470 my $first_five = $gcs->substr(0, 5);
471
472=head2 ℞ 32: Reverse string by grapheme
473
474Reversing by codepoint messes up diacritics, mistakenly converting
475C<crème brûlée> into C<éel̂urb em̀erc> instead of into C<eélûrb emèrc>;
476so reverse by grapheme instead. Both these approaches work
477right no matter what normalization the string is in:
478
479 $str = join("", reverse $str =~ /\X/g);
480
481 # OR: cpan -i Unicode::GCString
482 use Unicode::GCString;
483 $str = reverse Unicode::GCString->new($str);
484
485=head2 ℞ 33: String length in graphemes
486
487The string C<brûlée> has six graphemes but up to eight codepoints.
488This counts by grapheme, not by codepoint:
489
490 my $str = "brûlée";
491 my $count = 0;
492 while ($str =~ /\X/g) { $count++ }
493
494 # OR: cpan -i Unicode::GCString
495 use Unicode::GCString;
496 my $gcs = Unicode::GCString->new($str);
497 my $count = $gcs->length;
498
499=head2 ℞ 34: Unicode column-width for printing
500
501Perl’s C<printf>, C<sprintf>, and C<format> think all
502codepoints take up 1 print column, but many take 0 or 2.
503Here to show that normalization makes no difference,
504we print out both forms:
505
506 use Unicode::GCString;
507 use Unicode::Normalize;
508
509 my @words = qw/crème brûlée/;
510 @words = map { NFC($_), NFD($_) } @words;
511
512 for my $str (@words) {
513 my $gcs = Unicode::GCString->new($str);
514 my $cols = $gcs->columns;
515 my $pad = " " x (10 - $cols);
516 say str, $pad, " |";
517 }
518
519generates this to show that it pads correctly no matter
520the normalization:
521
522 crème |
523 crème |
524 brûlée |
525 brûlée |
526
527=head2 ℞ 35: Unicode collation
528
529Text sorted by numeric codepoint follows no reasonable alphabetic order;
530use the UCA for sorting text.
531
532 use Unicode::Collate;
533 my $col = Unicode::Collate->new();
534 my @list = $col->sort(@old_list);
535
536See the I<ucsort> program from the L<Unicode::Tussle> CPAN module
ddeccf1f 537for a convenient command-line interface to this module.
2561daa4
RS
538
539=head2 ℞ 36: Case- I<and> accent-insensitive Unicode sort
540
541Specify a collation strength of level 1 to ignore case and
542diacritics, only looking at the basic character.
543
544 use Unicode::Collate;
545 my $col = Unicode::Collate->new(level => 1);
546 my @list = $col->sort(@old_list);
547
548=head2 ℞ 37: Unicode locale collation
549
550Some locales have special sorting rules.
551
552 # either use v5.12, OR: cpan -i Unicode::Collate::Locale
553 use Unicode::Collate::Locale;
554 my $col = Unicode::Collate::Locale->new(locale => "de__phonebook");
555 my @list = $col->sort(@old_list);
556
557The I<ucsort> program mentioned above accepts a C<--locale> parameter.
558
559=head2 ℞ 38: Making C<cmp> work on text instead of codepoints
560
561Instead of this:
562
563 @srecs = sort {
564 $b->{AGE} <=> $a->{AGE}
565 ||
566 $a->{NAME} cmp $b->{NAME}
567 } @recs;
568
569Use this:
570
571 my $coll = Unicode::Collate->new();
572 for my $rec (@recs) {
573 $rec->{NAME_key} = $coll->getSortKey( $rec->{NAME} );
574 }
575 @srecs = sort {
576 $b->{AGE} <=> $a->{AGE}
577 ||
578 $a->{NAME_key} cmp $b->{NAME_key}
579 } @recs;
580
581=head2 ℞ 39: Case- I<and> accent-insensitive comparisons
582
583Use a collator object to compare Unicode text by character
584instead of by codepoint.
585
586 use Unicode::Collate;
587 my $es = Unicode::Collate->new(
588 level => 1,
589 normalization => undef
590 );
591
592 # now both are true:
593 $es->eq("García", "GARCIA" );
594 $es->eq("Márquez", "MARQUEZ");
595
596=head2 ℞ 40: Case- I<and> accent-insensitive locale comparisons
597
598Same, but in a specific locale.
599
600 my $de = Unicode::Collate::Locale->new(
601 locale => "de__phonebook",
602 );
603
604 # now this is true:
605 $de->eq("tschüß", "TSCHUESS"); # notice ü => UE, ß => SS
606
607=head2 ℞ 41: Unicode linebreaking
608
609Break up text into lines according to Unicode rules.
610
611 # cpan -i Unicode::LineBreak
612 use Unicode::LineBreak;
613 use charnames qw(:full);
614
615 my $para = "This is a super\N{HYPHEN}long string. " x 20;
616 my $fmt = new Unicode::LineBreak;
617 print $fmt->break($para), "\n";
618
619=head2 ℞ 42: Unicode text in DBM hashes, the tedious way
620
621Using a regular Perl string as a key or value for a DBM
622hash will trigger a wide character exception if any codepoints
623won’t fit into a byte. Here’s how to manually manage the translation:
624
625 use DB_File;
626 use Encode qw(encode decode);
627 tie %dbhash, "DB_File", "pathname";
628
629 # STORE
630
631 # assume $uni_key and $uni_value are abstract Unicode strings
632 my $enc_key = encode("UTF-8", $uni_key, 1);
633 my $enc_value = encode("UTF-8", $uni_value, 1);
634 $dbhash{$enc_key} = $enc_value;
635
636 # FETCH
637
638 # assume $uni_key holds a normal Perl string (abstract Unicode)
639 my $enc_key = encode("UTF-8", $uni_key, 1);
640 my $enc_value = $dbhash{$enc_key};
7b237c8f 641 my $uni_value = decode("UTF-8", $enc_value, 1);
2561daa4
RS
642
643=head2 ℞ 43: Unicode text in DBM hashes, the easy way
644
645Here’s how to implicitly manage the translation; all encoding
646and decoding is done automatically, just as with streams that
647have a particular encoding attached to them:
648
649 use DB_File;
650 use DBM_Filter;
651
652 my $dbobj = tie %dbhash, "DB_File", "pathname";
653 $dbobj->Filter_Value("utf8"); # this is the magic bit
654
655 # STORE
656
657 # assume $uni_key and $uni_value are abstract Unicode strings
658 $dbhash{$uni_key} = $uni_value;
659
660 # FETCH
661
662 # $uni_key holds a normal Perl string (abstract Unicode)
663 my $uni_value = $dbhash{$uni_key};
664
665=head2 ℞ 44: PROGRAM: Demo of Unicode collation and printing
666
667Here’s a full program showing how to make use of locale-sensitive
668sorting, Unicode casing, and managing print widths when some of the
669characters take up zero or two columns, not just one column each time.
670When run, the following program produces this nicely aligned output:
671
672 Crème Brûlée....... €2.00
673 Éclair............. €1.60
674 Fideuà............. €4.20
675 Hamburger.......... €6.00
676 Jamón Serrano...... €4.45
677 Linguiça........... €7.00
678 Pâté............... €4.15
679 Pears.............. €2.00
680 Pêches............. €2.25
681 Smørbrød........... €5.75
682 Spätzle............ €5.50
683 Xoriço............. €3.00
684 Γύρος.............. €6.50
685 막걸리............. €4.00
686 おもち............. €2.65
687 お好み焼き......... €8.00
688 シュークリーム..... €1.85
689 寿司............... €9.99
690 包子............... €7.50
691
692Here's that program; tested on v5.14.
693
694 #!/usr/bin/env perl
695 # umenu - demo sorting and printing of Unicode food
696 #
697 # (obligatory and increasingly long preamble)
698 #
699 use utf8;
700 use v5.14; # for locale sorting
701 use strict;
702 use warnings;
703 use warnings qw(FATAL utf8); # fatalize encoding faults
704 use open qw(:std :utf8); # undeclared streams in UTF-8
705 use charnames qw(:full :short); # unneeded in v5.16
706
707 # std modules
708 use Unicode::Normalize; # std perl distro as of v5.8
709 use List::Util qw(max); # std perl distro as of v5.10
710 use Unicode::Collate::Locale; # std perl distro as of v5.14
711
712 # cpan modules
713 use Unicode::GCString; # from CPAN
714
715 # forward defs
716 sub pad($$$);
717 sub colwidth(_);
718 sub entitle(_);
719
720 my %price = (
721 "γύρος" => 6.50, # gyros
722 "pears" => 2.00, # like um, pears
723 "linguiça" => 7.00, # spicy sausage, Portuguese
724 "xoriço" => 3.00, # chorizo sausage, Catalan
725 "hamburger" => 6.00, # burgermeister meisterburger
726 "éclair" => 1.60, # dessert, French
727 "smørbrød" => 5.75, # sandwiches, Norwegian
728 "spätzle" => 5.50, # Bayerisch noodles, little sparrows
729 "包子" => 7.50, # bao1 zi5, steamed pork buns, Mandarin
730 "jamón serrano" => 4.45, # country ham, Spanish
731 "pêches" => 2.25, # peaches, French
732 "シュークリーム" => 1.85, # cream-filled pastry like eclair
733 "막걸리" => 4.00, # makgeolli, Korean rice wine
734 "寿司" => 9.99, # sushi, Japanese
735 "おもち" => 2.65, # omochi, rice cakes, Japanese
736 "crème brûlée" => 2.00, # crema catalana
720a02e2
FC
737 "fideuà" => 4.20, # more noodles, Valencian
738 # (Catalan=fideuada)
2561daa4
RS
739 "pâté" => 4.15, # gooseliver paste, French
740 "お好み焼き" => 8.00, # okonomiyaki, Japanese
741 );
742
743 my $width = 5 + max map { colwidth } keys %price;
744
745 # So the Asian stuff comes out in an order that someone
746 # who reads those scripts won't freak out over; the
747 # CJK stuff will be in JIS X 0208 order that way.
748 my $coll = new Unicode::Collate::Locale locale => "ja";
749
750 for my $item ($coll->sort(keys %price)) {
751 print pad(entitle($item), $width, ".");
752 printf " €%.2f\n", $price{$item};
753 }
754
755 sub pad($$$) {
756 my($str, $width, $padchar) = @_;
757 return $str . ($padchar x ($width - colwidth($str)));
758 }
759
760 sub colwidth(_) {
761 my($str) = @_;
762 return Unicode::GCString->new($str)->columns;
763 }
764
765 sub entitle(_) {
766 my($str) = @_;
767 $str =~ s{ (?=\pL)(\S) (\S*) }
768 { ucfirst($1) . lc($2) }xge;
769 return $str;
770 }
771
772=head1 SEE ALSO
773
774See these manpages, some of which are CPAN modules:
775L<perlunicode>, L<perluniprops>,
776L<perlre>, L<perlrecharclass>,
777L<perluniintro>, L<perlunitut>, L<perlunifaq>,
778L<PerlIO>, L<DB_File>, L<DBM_Filter>, L<DBM_Filter::utf8>,
779L<Encode>, L<Encode::Locale>,
780L<Unicode::UCD>,
781L<Unicode::Normalize>,
782L<Unicode::GCString>, L<Unicode::LineBreak>,
783L<Unicode::Collate>, L<Unicode::Collate::Locale>,
784L<Unicode::Unihan>,
785L<Unicode::CaseFold>,
786L<Unicode::Tussle>,
787L<Lingua::JA::Romanize::Japanese>,
788L<Lingua::ZH::Romanize::Pinyin>,
789L<Lingua::KO::Romanize::Hangul>.
790
791The L<Unicode::Tussle> CPAN module includes many programs
792to help with working with Unicode, including
793these programs to fully or partly replace standard utilities:
794I<tcgrep> instead of I<egrep>,
795I<uniquote> instead of I<cat -v> or I<hexdump>,
796I<uniwc> instead of I<wc>,
797I<unilook> instead of I<look>,
798I<unifmt> instead of I<fmt>,
799and
800I<ucsort> instead of I<sort>.
801For exploring Unicode character names and character properties,
802see its I<uniprops>, I<unichars>, and I<uninames> programs.
803It also supplies these programs, all of which are general filters that do Unicode-y things:
804I<unititle> and I<unicaps>;
805I<uniwide> and I<uninarrow>;
806I<unisupers> and I<unisubs>;
807I<nfd>, I<nfc>, I<nfkd>, and I<nfkc>;
808and I<uc>, I<lc>, and I<tc>.
809
810Finally, see the published Unicode Standard (page numbers are from version
8116.0.0), including these specific annexes and technical reports:
812
813=over
814
815=item §3.13 Default Case Algorithms, page 113;
816§4.2 Case, pages 120–122;
817Case Mappings, page 166–172, especially Caseless Matching starting on page 170.
818
2561daa4
RS
819=item UAX #44: Unicode Character Database
820
821=item UTS #18: Unicode Regular Expressions
822
823=item UAX #15: Unicode Normalization Forms
824
825=item UTS #10: Unicode Collation Algorithm
826
827=item UAX #29: Unicode Text Segmentation
828
829=item UAX #14: Unicode Line Breaking Algorithm
830
831=item UAX #11: East Asian Width
832
833=back
834
835=head1 AUTHOR
836
837Tom Christiansen E<lt>tchrist@perl.comE<gt> wrote this, with occasional
838kibbitzing from Larry Wall and Jeffrey Friedl in the background.
839
840=head1 COPYRIGHT AND LICENCE
841
842Copyright © 2012 Tom Christiansen.
843
844This program is free software; you may redistribute it and/or modify it
845under the same terms as Perl itself.
846
847Most of these examples taken from the current edition of the “Camel Book”;
848that is, from the 4ᵗʰ Edition of I<Programming Perl>, Copyright © 2012 Tom
849Christiansen <et al.>, 2012-02-13 by O’Reilly Media. The code itself is
850freely redistributable, and you are encouraged to transplant, fold,
851spindle, and mutilate any of the examples in this manpage however you please
852for inclusion into your own programs without any encumbrance whatsoever.
853Acknowledgement via code comment is polite but not required.
854
ddeccf1f 855=head1 REVISION HISTORY
2561daa4
RS
856
857v1.0.0 – first public release, 2012-02-27
858