This is a live mirror of the Perl 5 development currently hosted at https://github.com/perl/perl5
new perldelta
[perl5.git] / pod / perlunicook.pod
CommitLineData
2561daa4
RS
1
2=encoding utf8
3
4=head1 NAME
5
6perlunicook - cookbookish examples of handling Unicode in Perl
7
8=head1 DESCRIPTION
9
10This manpage contains short recipes demonstrating how to handle common Unicode
11operations in Perl, plus one complete program at the end. Any undeclared
12variables in individual recipes are assumed to have a previous appropriate
13value in them.
14
15=head1 EXAMPLES
16
17=head2 ℞ 0: Standard preamble
18
19Unless otherwise notes, all examples below require this standard preamble
20to work correctly, with the C<#!> adjusted to work on your system:
21
22 #!/usr/bin/env perl
23
d84bd0bd
PE
24 use v5.36; # or later to get "unicode_strings" feature,
25 # plus strict, warnings
2561daa4 26 use utf8; # so literals and identifiers can be in UTF-8
2561daa4 27 use warnings qw(FATAL utf8); # fatalize encoding glitches
a8980281 28 use open qw(:std :encoding(UTF-8)); # undeclared streams in UTF-8
2561daa4
RS
29 use charnames qw(:full :short); # unneeded in v5.16
30
31This I<does> make even Unix programmers C<binmode> your binary streams,
32or open them with C<:raw>, but that's the only way to get at them
33portably anyway.
34
2a403855
MH
35B<WARNING>: C<use autodie> (pre 2.26) and C<use open> do not get along with each
36other.
2561daa4
RS
37
38=head2 ℞ 1: Generic Unicode-savvy filter
39
40Always decompose on the way in, then recompose on the way out.
41
42 use Unicode::Normalize;
43
44 while (<>) {
45 $_ = NFD($_); # decompose + reorder canonically
46 ...
47 } continue {
48 print NFC($_); # recompose (where possible) + reorder canonically
49 }
50
51=head2 ℞ 2: Fine-tuning Unicode warnings
52
ddeccf1f 53As of v5.14, Perl distinguishes three subclasses of UTF‑8 warnings.
2561daa4
RS
54
55 use v5.14; # subwarnings unavailable any earlier
56 no warnings "nonchar"; # the 66 forbidden non-characters
57 no warnings "surrogate"; # UTF-16/CESU-8 nonsense
58 no warnings "non_unicode"; # for codepoints over 0x10_FFFF
59
60=head2 ℞ 3: Declare source in utf8 for identifiers and literals
61
62Without the all-critical C<use utf8> declaration, putting UTF‑8 in your
63literals and identifiers won’t work right. If you used the standard
64preamble just given above, this already happened. If you did, you can
65do things like this:
66
67 use utf8;
68
69 my $measure = "Ångström";
70 my @μsoft = qw( cp852 cp1251 cp1252 );
71 my @ὑπέρμεγας = qw( ὑπέρ μεγας );
72 my @鯉 = qw( koi8-f koi8-u koi8-r );
73 my $motto = "👪 💗 🐪"; # FAMILY, GROWING HEART, DROMEDARY CAMEL
74
75If you forget C<use utf8>, high bytes will be misunderstood as
76separate characters, and nothing will work right.
77
78=head2 ℞ 4: Characters and their numbers
79
80The C<ord> and C<chr> functions work transparently on all codepoints,
81not just on ASCII alone — nor in fact, not even just on Unicode alone.
82
83 # ASCII characters
84 ord("A")
85 chr(65)
86
87 # characters from the Basic Multilingual Plane
88 ord("Σ")
89 chr(0x3A3)
90
91 # beyond the BMP
92 ord("𝑛") # MATHEMATICAL ITALIC SMALL N
93 chr(0x1D45B)
94
95 # beyond Unicode! (up to MAXINT)
96 ord("\x{20_0000}")
97 chr(0x20_0000)
98
99=head2 ℞ 5: Unicode literals by character number
100
101In an interpolated literal, whether a double-quoted string or a
102regex, you may specify a character by its number using the
103C<\x{I<HHHHHH>}> escape.
104
105 String: "\x{3a3}"
106 Regex: /\x{3a3}/
107
108 String: "\x{1d45b}"
109 Regex: /\x{1d45b}/
110
111 # even non-BMP ranges in regex work fine
112 /[\x{1D434}-\x{1D467}]/
113
114=head2 ℞ 6: Get character name by number
115
116 use charnames ();
117 my $name = charnames::viacode(0x03A3);
118
119=head2 ℞ 7: Get character number by name
120
121 use charnames ();
122 my $number = charnames::vianame("GREEK CAPITAL LETTER SIGMA");
123
124=head2 ℞ 8: Unicode named characters
125
126Use the C<< \N{I<charname>} >> notation to get the character
127by that name for use in interpolated literals (double-quoted
128strings and regexes). In v5.16, there is an implicit
129
130 use charnames qw(:full :short);
131
132But prior to v5.16, you must be explicit about which set of charnames you
133want. The C<:full> names are the official Unicode character name, alias, or
134sequence, which all share a namespace.
135
136 use charnames qw(:full :short latin greek);
137
138 "\N{MATHEMATICAL ITALIC SMALL N}" # :full
139 "\N{GREEK CAPITAL LETTER SIGMA}" # :full
140
141Anything else is a Perl-specific convenience abbreviation. Specify one or
142more scripts by names if you want short names that are script-specific.
143
144 "\N{Greek:Sigma}" # :short
145 "\N{ae}" # latin
146 "\N{epsilon}" # greek
147
148The v5.16 release also supports a C<:loose> import for loose matching of
149character names, which works just like loose matching of property names:
150that is, it disregards case, whitespace, and underscores:
151
152 "\N{euro sign}" # :loose (from v5.16)
153
673c254b
KW
154Starting in v5.32, you can also use
155
156 qr/\p{name=euro sign}/
157
158to get official Unicode named characters in regular expressions. Loose
159matching is always done for these.
160
2561daa4
RS
161=head2 ℞ 9: Unicode named sequences
162
163These look just like character names but return multiple codepoints.
164Notice the C<%vx> vector-print functionality in C<printf>.
165
166 use charnames qw(:full);
167 my $seq = "\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}";
168 printf "U+%v04X\n", $seq;
169 U+0100.0300
170
171=head2 ℞ 10: Custom named characters
172
173Use C<:alias> to give your own lexically scoped nicknames to existing
174characters, or even to give unnamed private-use characters useful names.
175
176 use charnames ":full", ":alias" => {
177 ecute => "LATIN SMALL LETTER E WITH ACUTE",
178 "APPLE LOGO" => 0xF8FF, # private use character
179 };
180
181 "\N{ecute}"
182 "\N{APPLE LOGO}"
183
184=head2 ℞ 11: Names of CJK codepoints
185
186Sinograms like “東京” come back with character names of
187C<CJK UNIFIED IDEOGRAPH-6771> and C<CJK UNIFIED IDEOGRAPH-4EAC>,
188because their “names” vary. The CPAN C<Unicode::Unihan> module
189has a large database for decoding these (and a whole lot more), provided you
190know how to understand its output.
191
192 # cpan -i Unicode::Unihan
193 use Unicode::Unihan;
194 my $str = "東京";
63602a3f 195 my $unhan = Unicode::Unihan->new;
2561daa4
RS
196 for my $lang (qw(Mandarin Cantonese Korean JapaneseOn JapaneseKun)) {
197 printf "CJK $str in %-12s is ", $lang;
198 say $unhan->$lang($str);
199 }
200
201prints:
202
203 CJK 東京 in Mandarin is DONG1JING1
204 CJK 東京 in Cantonese is dung1ging1
205 CJK 東京 in Korean is TONGKYENG
206 CJK 東京 in JapaneseOn is TOUKYOU KEI KIN
207 CJK 東京 in JapaneseKun is HIGASHI AZUMAMIYAKO
208
209If you have a specific romanization scheme in mind,
210use the specific module:
211
212 # cpan -i Lingua::JA::Romanize::Japanese
213 use Lingua::JA::Romanize::Japanese;
63602a3f 214 my $k2r = Lingua::JA::Romanize::Japanese->new;
2561daa4
RS
215 my $str = "東京";
216 say "Japanese for $str is ", $k2r->chars($str);
217
218prints
219
220 Japanese for 東京 is toukyou
221
222=head2 ℞ 12: Explicit encode/decode
223
224On rare occasion, such as a database read, you may be
225given encoded text you need to decode.
226
227 use Encode qw(encode decode);
228
229 my $chars = decode("shiftjis", $bytes, 1);
230 # OR
231 my $bytes = encode("MIME-Header-ISO_2022_JP", $chars, 1);
232
233For streams all in the same encoding, don't use encode/decode; instead
234set the file encoding when you open the file or immediately after with
235C<binmode> as described later below.
236
237=head2 ℞ 13: Decode program arguments as utf8
238
239 $ perl -CA ...
240 or
241 $ export PERL_UNICODE=A
242 or
8e179dd8
P
243 use Encode qw(decode);
244 @ARGV = map { decode('UTF-8', $_, 1) } @ARGV;
2561daa4
RS
245
246=head2 ℞ 14: Decode program arguments as locale encoding
247
248 # cpan -i Encode::Locale
249 use Encode qw(locale);
250 use Encode::Locale;
251
252 # use "locale" as an arg to encode/decode
253 @ARGV = map { decode(locale => $_, 1) } @ARGV;
254
255=head2 ℞ 15: Declare STD{IN,OUT,ERR} to be utf8
256
257Use a command-line option, an environment variable, or else
258call C<binmode> explicitly:
259
260 $ perl -CS ...
261 or
262 $ export PERL_UNICODE=S
263 or
a8980281 264 use open qw(:std :encoding(UTF-8));
2561daa4 265 or
a8980281 266 binmode(STDIN, ":encoding(UTF-8)");
2561daa4
RS
267 binmode(STDOUT, ":utf8");
268 binmode(STDERR, ":utf8");
269
270=head2 ℞ 16: Declare STD{IN,OUT,ERR} to be in locale encoding
271
272 # cpan -i Encode::Locale
273 use Encode;
274 use Encode::Locale;
275
276 # or as a stream for binmode or open
277 binmode STDIN, ":encoding(console_in)" if -t STDIN;
278 binmode STDOUT, ":encoding(console_out)" if -t STDOUT;
279 binmode STDERR, ":encoding(console_out)" if -t STDERR;
280
281=head2 ℞ 17: Make file I/O default to utf8
282
ddeccf1f 283Files opened without an encoding argument will be in UTF-8:
2561daa4
RS
284
285 $ perl -CD ...
286 or
287 $ export PERL_UNICODE=D
288 or
a8980281 289 use open qw(:encoding(UTF-8));
2561daa4
RS
290
291=head2 ℞ 18: Make all I/O and args default to utf8
292
293 $ perl -CSDA ...
294 or
295 $ export PERL_UNICODE=SDA
296 or
a8980281 297 use open qw(:std :encoding(UTF-8));
8e179dd8
P
298 use Encode qw(decode);
299 @ARGV = map { decode('UTF-8', $_, 1) } @ARGV;
2561daa4
RS
300
301=head2 ℞ 19: Open file with specific encoding
302
303Specify stream encoding. This is the normal way
304to deal with encoded text, not by calling low-level
305functions.
306
307 # input file
308 open(my $in_file, "< :encoding(UTF-16)", "wintext");
309 OR
310 open(my $in_file, "<", "wintext");
311 binmode($in_file, ":encoding(UTF-16)");
312 THEN
313 my $line = <$in_file>;
314
315 # output file
316 open($out_file, "> :encoding(cp1252)", "wintext");
317 OR
318 open(my $out_file, ">", "wintext");
319 binmode($out_file, ":encoding(cp1252)");
320 THEN
321 print $out_file "some text\n";
322
323More layers than just the encoding can be specified here. For example,
324the incantation C<":raw :encoding(UTF-16LE) :crlf"> includes implicit
325CRLF handling.
326
327=head2 ℞ 20: Unicode casing
328
329Unicode casing is very different from ASCII casing.
330
331 uc("henry ⅷ") # "HENRY Ⅷ"
332 uc("tschüß") # "TSCHÜSS" notice ß => SS
333
334 # both are true:
335 "tschüß" =~ /TSCHÜSS/i # notice ß => SS
336 "Σίσυφος" =~ /ΣΊΣΥΦΟΣ/i # notice Σ,σ,ς sameness
337
338=head2 ℞ 21: Unicode case-insensitive comparisons
339
340Also available in the CPAN L<Unicode::CaseFold> module,
341the new C<fc> “foldcase” function from v5.16 grants
342access to the same Unicode casefolding as the C</i>
343pattern modifier has always used:
344
345 use feature "fc"; # fc() function is from v5.16
346
347 # sort case-insensitively
348 my @sorted = sort { fc($a) cmp fc($b) } @list;
349
350 # both are true:
351 fc("tschüß") eq fc("TSCHÜSS")
352 fc("Σίσυφος") eq fc("ΣΊΣΥΦΟΣ")
353
354=head2 ℞ 22: Match Unicode linebreak sequence in regex
355
356A Unicode linebreak matches the two-character CRLF
357grapheme or any of seven vertical whitespace characters.
358Good for dealing with textfiles coming from different
359operating systems.
360
361 \R
362
363 s/\R/\n/g; # normalize all linebreaks to \n
364
365=head2 ℞ 23: Get character category
366
367Find the general category of a numeric codepoint.
368
369 use Unicode::UCD qw(charinfo);
370 my $cat = charinfo(0x3A3)->{category}; # "Lu"
371
372=head2 ℞ 24: Disabling Unicode-awareness in builtin charclasses
373
374Disable C<\w>, C<\b>, C<\s>, C<\d>, and the POSIX
375classes from working correctly on Unicode either in this
376scope, or in just one regex.
377
378 use v5.14;
379 use re "/a";
380
381 # OR
382
383 my($num) = $str =~ /(\d+)/a;
384
385Or use specific un-Unicode properties, like C<\p{ahex}>
386and C<\p{POSIX_Digit>}. Properties still work normally
387no matter what charset modifiers (C</d /u /l /a /aa>)
388should be effect.
389
390=head2 ℞ 25: Match Unicode properties in regex with \p, \P
391
392These all match a single codepoint with the given
393property. Use C<\P> in place of C<\p> to match
394one codepoint lacking that property.
395
396 \pL, \pN, \pS, \pP, \pM, \pZ, \pC
397 \p{Sk}, \p{Ps}, \p{Lt}
398 \p{alpha}, \p{upper}, \p{lower}
399 \p{Latin}, \p{Greek}
48791bf1 400 \p{script_extensions=Latin}, \p{scx=Greek}
2561daa4
RS
401 \p{East_Asian_Width=Wide}, \p{EA=W}
402 \p{Line_Break=Hyphen}, \p{LB=HY}
403 \p{Numeric_Value=4}, \p{NV=4}
404
405=head2 ℞ 26: Custom character properties
406
407Define at compile-time your own custom character
408properties for use in regexes.
409
410 # using private-use characters
411 sub In_Tengwar { "E000\tE07F\n" }
412
413 if (/\p{In_Tengwar}/) { ... }
414
415 # blending existing properties
416 sub Is_GraecoRoman_Title {<<'END_OF_SET'}
417 +utf8::IsLatin
418 +utf8::IsGreek
419 &utf8::IsTitle
420 END_OF_SET
421
422 if (/\p{Is_GraecoRoman_Title}/ { ... }
423
424=head2 ℞ 27: Unicode normalization
425
426Typically render into NFD on input and NFC on output. Using NFKC or NFKD
427functions improves recall on searches, assuming you've already done to the
428same text to be searched. Note that this is about much more than just pre-
429combined compatibility glyphs; it also reorders marks according to their
430canonical combining classes and weeds out singletons.
431
432 use Unicode::Normalize;
433 my $nfd = NFD($orig);
434 my $nfc = NFC($orig);
435 my $nfkd = NFKD($orig);
436 my $nfkc = NFKC($orig);
437
438=head2 ℞ 28: Convert non-ASCII Unicode numerics
439
440Unless you’ve used C</a> or C</aa>, C<\d> matches more than
441ASCII digits only, but Perl’s implicit string-to-number
442conversion does not current recognize these. Here’s how to
443convert such strings manually.
444
445 use v5.14; # needed for num() function
446 use Unicode::UCD qw(num);
447 my $str = "got Ⅻ and ४५६७ and ⅞ and here";
448 my @nums = ();
7b237c8f 449 while ($str =~ /(\d+|\N)/g) { # not just ASCII!
2561daa4
RS
450 push @nums, num($1);
451 }
452 say "@nums"; # 12 4567 0.875
453
454 use charnames qw(:full);
455 my $nv = num("\N{RUMI DIGIT ONE}\N{RUMI DIGIT TWO}");
456
457=head2 ℞ 29: Match Unicode grapheme cluster in regex
458
459Programmer-visible “characters” are codepoints matched by C</./s>,
460but user-visible “characters” are graphemes matched by C</\X/>.
461
462 # Find vowel *plus* any combining diacritics,underlining,etc.
463 my $nfd = NFD($orig);
464 $nfd =~ / (?=[aeiou]) \X /xi
465
466=head2 ℞ 30: Extract by grapheme instead of by codepoint (regex)
467
468 # match and grab five first graphemes
469 my($first_five) = $str =~ /^ ( \X{5} ) /x;
470
471=head2 ℞ 31: Extract by grapheme instead of by codepoint (substr)
472
473 # cpan -i Unicode::GCString
474 use Unicode::GCString;
475 my $gcs = Unicode::GCString->new($str);
476 my $first_five = $gcs->substr(0, 5);
477
478=head2 ℞ 32: Reverse string by grapheme
479
480Reversing by codepoint messes up diacritics, mistakenly converting
481C<crème brûlée> into C<éel̂urb em̀erc> instead of into C<eélûrb emèrc>;
482so reverse by grapheme instead. Both these approaches work
483right no matter what normalization the string is in:
484
485 $str = join("", reverse $str =~ /\X/g);
486
487 # OR: cpan -i Unicode::GCString
488 use Unicode::GCString;
489 $str = reverse Unicode::GCString->new($str);
490
491=head2 ℞ 33: String length in graphemes
492
493The string C<brûlée> has six graphemes but up to eight codepoints.
494This counts by grapheme, not by codepoint:
495
496 my $str = "brûlée";
497 my $count = 0;
498 while ($str =~ /\X/g) { $count++ }
499
500 # OR: cpan -i Unicode::GCString
501 use Unicode::GCString;
502 my $gcs = Unicode::GCString->new($str);
503 my $count = $gcs->length;
504
505=head2 ℞ 34: Unicode column-width for printing
506
507Perl’s C<printf>, C<sprintf>, and C<format> think all
508codepoints take up 1 print column, but many take 0 or 2.
509Here to show that normalization makes no difference,
510we print out both forms:
511
512 use Unicode::GCString;
513 use Unicode::Normalize;
514
515 my @words = qw/crème brûlée/;
516 @words = map { NFC($_), NFD($_) } @words;
517
518 for my $str (@words) {
519 my $gcs = Unicode::GCString->new($str);
520 my $cols = $gcs->columns;
521 my $pad = " " x (10 - $cols);
522 say str, $pad, " |";
523 }
524
525generates this to show that it pads correctly no matter
526the normalization:
527
528 crème |
529 crème |
530 brûlée |
531 brûlée |
532
533=head2 ℞ 35: Unicode collation
534
535Text sorted by numeric codepoint follows no reasonable alphabetic order;
536use the UCA for sorting text.
537
538 use Unicode::Collate;
539 my $col = Unicode::Collate->new();
540 my @list = $col->sort(@old_list);
541
542See the I<ucsort> program from the L<Unicode::Tussle> CPAN module
ddeccf1f 543for a convenient command-line interface to this module.
2561daa4
RS
544
545=head2 ℞ 36: Case- I<and> accent-insensitive Unicode sort
546
547Specify a collation strength of level 1 to ignore case and
548diacritics, only looking at the basic character.
549
550 use Unicode::Collate;
551 my $col = Unicode::Collate->new(level => 1);
552 my @list = $col->sort(@old_list);
553
554=head2 ℞ 37: Unicode locale collation
555
556Some locales have special sorting rules.
557
558 # either use v5.12, OR: cpan -i Unicode::Collate::Locale
559 use Unicode::Collate::Locale;
560 my $col = Unicode::Collate::Locale->new(locale => "de__phonebook");
561 my @list = $col->sort(@old_list);
562
563The I<ucsort> program mentioned above accepts a C<--locale> parameter.
564
565=head2 ℞ 38: Making C<cmp> work on text instead of codepoints
566
567Instead of this:
568
569 @srecs = sort {
570 $b->{AGE} <=> $a->{AGE}
571 ||
572 $a->{NAME} cmp $b->{NAME}
573 } @recs;
574
575Use this:
576
577 my $coll = Unicode::Collate->new();
578 for my $rec (@recs) {
579 $rec->{NAME_key} = $coll->getSortKey( $rec->{NAME} );
580 }
581 @srecs = sort {
582 $b->{AGE} <=> $a->{AGE}
583 ||
584 $a->{NAME_key} cmp $b->{NAME_key}
585 } @recs;
586
587=head2 ℞ 39: Case- I<and> accent-insensitive comparisons
588
589Use a collator object to compare Unicode text by character
590instead of by codepoint.
591
592 use Unicode::Collate;
593 my $es = Unicode::Collate->new(
594 level => 1,
595 normalization => undef
596 );
597
598 # now both are true:
599 $es->eq("García", "GARCIA" );
600 $es->eq("Márquez", "MARQUEZ");
601
602=head2 ℞ 40: Case- I<and> accent-insensitive locale comparisons
603
604Same, but in a specific locale.
605
606 my $de = Unicode::Collate::Locale->new(
607 locale => "de__phonebook",
608 );
609
610 # now this is true:
611 $de->eq("tschüß", "TSCHUESS"); # notice ü => UE, ß => SS
612
613=head2 ℞ 41: Unicode linebreaking
614
615Break up text into lines according to Unicode rules.
616
617 # cpan -i Unicode::LineBreak
618 use Unicode::LineBreak;
619 use charnames qw(:full);
620
621 my $para = "This is a super\N{HYPHEN}long string. " x 20;
63602a3f 622 my $fmt = Unicode::LineBreak->new;
2561daa4
RS
623 print $fmt->break($para), "\n";
624
625=head2 ℞ 42: Unicode text in DBM hashes, the tedious way
626
627Using a regular Perl string as a key or value for a DBM
628hash will trigger a wide character exception if any codepoints
629won’t fit into a byte. Here’s how to manually manage the translation:
630
631 use DB_File;
632 use Encode qw(encode decode);
633 tie %dbhash, "DB_File", "pathname";
634
635 # STORE
636
637 # assume $uni_key and $uni_value are abstract Unicode strings
638 my $enc_key = encode("UTF-8", $uni_key, 1);
639 my $enc_value = encode("UTF-8", $uni_value, 1);
640 $dbhash{$enc_key} = $enc_value;
641
642 # FETCH
643
644 # assume $uni_key holds a normal Perl string (abstract Unicode)
645 my $enc_key = encode("UTF-8", $uni_key, 1);
646 my $enc_value = $dbhash{$enc_key};
7b237c8f 647 my $uni_value = decode("UTF-8", $enc_value, 1);
2561daa4
RS
648
649=head2 ℞ 43: Unicode text in DBM hashes, the easy way
650
651Here’s how to implicitly manage the translation; all encoding
652and decoding is done automatically, just as with streams that
653have a particular encoding attached to them:
654
655 use DB_File;
656 use DBM_Filter;
657
658 my $dbobj = tie %dbhash, "DB_File", "pathname";
659 $dbobj->Filter_Value("utf8"); # this is the magic bit
660
661 # STORE
662
663 # assume $uni_key and $uni_value are abstract Unicode strings
664 $dbhash{$uni_key} = $uni_value;
665
666 # FETCH
667
668 # $uni_key holds a normal Perl string (abstract Unicode)
669 my $uni_value = $dbhash{$uni_key};
670
671=head2 ℞ 44: PROGRAM: Demo of Unicode collation and printing
672
673Here’s a full program showing how to make use of locale-sensitive
674sorting, Unicode casing, and managing print widths when some of the
675characters take up zero or two columns, not just one column each time.
676When run, the following program produces this nicely aligned output:
677
678 Crème Brûlée....... €2.00
679 Éclair............. €1.60
680 Fideuà............. €4.20
681 Hamburger.......... €6.00
682 Jamón Serrano...... €4.45
683 Linguiça........... €7.00
684 Pâté............... €4.15
685 Pears.............. €2.00
686 Pêches............. €2.25
687 Smørbrød........... €5.75
688 Spätzle............ €5.50
689 Xoriço............. €3.00
690 Γύρος.............. €6.50
691 막걸리............. €4.00
692 おもち............. €2.65
693 お好み焼き......... €8.00
694 シュークリーム..... €1.85
695 寿司............... €9.99
696 包子............... €7.50
697
d84bd0bd 698Here's that program.
2561daa4
RS
699
700 #!/usr/bin/env perl
701 # umenu - demo sorting and printing of Unicode food
702 #
703 # (obligatory and increasingly long preamble)
704 #
d84bd0bd 705 use v5.36;
2561daa4 706 use utf8;
2561daa4 707 use warnings qw(FATAL utf8); # fatalize encoding faults
a8980281 708 use open qw(:std :encoding(UTF-8)); # undeclared streams in UTF-8
2561daa4
RS
709 use charnames qw(:full :short); # unneeded in v5.16
710
711 # std modules
712 use Unicode::Normalize; # std perl distro as of v5.8
713 use List::Util qw(max); # std perl distro as of v5.10
714 use Unicode::Collate::Locale; # std perl distro as of v5.14
715
716 # cpan modules
717 use Unicode::GCString; # from CPAN
718
2561daa4
RS
719 my %price = (
720 "γύρος" => 6.50, # gyros
721 "pears" => 2.00, # like um, pears
722 "linguiça" => 7.00, # spicy sausage, Portuguese
723 "xoriço" => 3.00, # chorizo sausage, Catalan
724 "hamburger" => 6.00, # burgermeister meisterburger
725 "éclair" => 1.60, # dessert, French
726 "smørbrød" => 5.75, # sandwiches, Norwegian
727 "spätzle" => 5.50, # Bayerisch noodles, little sparrows
728 "包子" => 7.50, # bao1 zi5, steamed pork buns, Mandarin
729 "jamón serrano" => 4.45, # country ham, Spanish
730 "pêches" => 2.25, # peaches, French
731 "シュークリーム" => 1.85, # cream-filled pastry like eclair
732 "막걸리" => 4.00, # makgeolli, Korean rice wine
733 "寿司" => 9.99, # sushi, Japanese
734 "おもち" => 2.65, # omochi, rice cakes, Japanese
735 "crème brûlée" => 2.00, # crema catalana
720a02e2
FC
736 "fideuà" => 4.20, # more noodles, Valencian
737 # (Catalan=fideuada)
2561daa4
RS
738 "pâté" => 4.15, # gooseliver paste, French
739 "お好み焼き" => 8.00, # okonomiyaki, Japanese
740 );
741
d84bd0bd 742 my $width = 5 + max map { colwidth($_) } keys %price;
2561daa4
RS
743
744 # So the Asian stuff comes out in an order that someone
745 # who reads those scripts won't freak out over; the
746 # CJK stuff will be in JIS X 0208 order that way.
63602a3f 747 my $coll = Unicode::Collate::Locale->new(locale => "ja");
2561daa4
RS
748
749 for my $item ($coll->sort(keys %price)) {
750 print pad(entitle($item), $width, ".");
751 printf " €%.2f\n", $price{$item};
752 }
753
9af9b932 754 sub pad ($str, $width, $padchar) {
2561daa4
RS
755 return $str . ($padchar x ($width - colwidth($str)));
756 }
757
9af9b932 758 sub colwidth ($str) {
2561daa4
RS
759 return Unicode::GCString->new($str)->columns;
760 }
761
9af9b932 762 sub entitle ($str) {
2561daa4
RS
763 $str =~ s{ (?=\pL)(\S) (\S*) }
764 { ucfirst($1) . lc($2) }xge;
765 return $str;
766 }
767
768=head1 SEE ALSO
769
770See these manpages, some of which are CPAN modules:
771L<perlunicode>, L<perluniprops>,
772L<perlre>, L<perlrecharclass>,
773L<perluniintro>, L<perlunitut>, L<perlunifaq>,
774L<PerlIO>, L<DB_File>, L<DBM_Filter>, L<DBM_Filter::utf8>,
775L<Encode>, L<Encode::Locale>,
776L<Unicode::UCD>,
777L<Unicode::Normalize>,
778L<Unicode::GCString>, L<Unicode::LineBreak>,
779L<Unicode::Collate>, L<Unicode::Collate::Locale>,
780L<Unicode::Unihan>,
781L<Unicode::CaseFold>,
782L<Unicode::Tussle>,
783L<Lingua::JA::Romanize::Japanese>,
784L<Lingua::ZH::Romanize::Pinyin>,
785L<Lingua::KO::Romanize::Hangul>.
786
787The L<Unicode::Tussle> CPAN module includes many programs
788to help with working with Unicode, including
789these programs to fully or partly replace standard utilities:
790I<tcgrep> instead of I<egrep>,
791I<uniquote> instead of I<cat -v> or I<hexdump>,
792I<uniwc> instead of I<wc>,
793I<unilook> instead of I<look>,
794I<unifmt> instead of I<fmt>,
795and
796I<ucsort> instead of I<sort>.
797For exploring Unicode character names and character properties,
798see its I<uniprops>, I<unichars>, and I<uninames> programs.
799It also supplies these programs, all of which are general filters that do Unicode-y things:
800I<unititle> and I<unicaps>;
801I<uniwide> and I<uninarrow>;
802I<unisupers> and I<unisubs>;
803I<nfd>, I<nfc>, I<nfkd>, and I<nfkc>;
804and I<uc>, I<lc>, and I<tc>.
805
806Finally, see the published Unicode Standard (page numbers are from version
8076.0.0), including these specific annexes and technical reports:
808
809=over
810
811=item §3.13 Default Case Algorithms, page 113;
812§4.2 Case, pages 120–122;
813Case Mappings, page 166–172, especially Caseless Matching starting on page 170.
814
2561daa4
RS
815=item UAX #44: Unicode Character Database
816
817=item UTS #18: Unicode Regular Expressions
818
819=item UAX #15: Unicode Normalization Forms
820
821=item UTS #10: Unicode Collation Algorithm
822
823=item UAX #29: Unicode Text Segmentation
824
825=item UAX #14: Unicode Line Breaking Algorithm
826
827=item UAX #11: East Asian Width
828
829=back
830
831=head1 AUTHOR
832
833Tom Christiansen E<lt>tchrist@perl.comE<gt> wrote this, with occasional
834kibbitzing from Larry Wall and Jeffrey Friedl in the background.
835
836=head1 COPYRIGHT AND LICENCE
837
838Copyright © 2012 Tom Christiansen.
839
840This program is free software; you may redistribute it and/or modify it
841under the same terms as Perl itself.
842
843Most of these examples taken from the current edition of the “Camel Book”;
844that is, from the 4ᵗʰ Edition of I<Programming Perl>, Copyright © 2012 Tom
845Christiansen <et al.>, 2012-02-13 by O’Reilly Media. The code itself is
846freely redistributable, and you are encouraged to transplant, fold,
847spindle, and mutilate any of the examples in this manpage however you please
848for inclusion into your own programs without any encumbrance whatsoever.
849Acknowledgement via code comment is polite but not required.
850
ddeccf1f 851=head1 REVISION HISTORY
2561daa4
RS
852
853v1.0.0 – first public release, 2012-02-27