This is a live mirror of the Perl 5 development currently hosted at https://github.com/perl/perl5
perlexperiment: document the private_use experiment
[perl5.git] / pod / perlunicook.pod
CommitLineData
2561daa4
RS
1
2=encoding utf8
3
4=head1 NAME
5
6perlunicook - cookbookish examples of handling Unicode in Perl
7
8=head1 DESCRIPTION
9
10This manpage contains short recipes demonstrating how to handle common Unicode
11operations in Perl, plus one complete program at the end. Any undeclared
12variables in individual recipes are assumed to have a previous appropriate
13value in them.
14
15=head1 EXAMPLES
16
17=head2 ℞ 0: Standard preamble
18
19Unless otherwise notes, all examples below require this standard preamble
20to work correctly, with the C<#!> adjusted to work on your system:
21
22 #!/usr/bin/env perl
23
24 use utf8; # so literals and identifiers can be in UTF-8
25 use v5.12; # or later to get "unicode_strings" feature
26 use strict; # quote strings, declare variables
27 use warnings; # on by default
28 use warnings qw(FATAL utf8); # fatalize encoding glitches
a8980281 29 use open qw(:std :encoding(UTF-8)); # undeclared streams in UTF-8
2561daa4
RS
30 use charnames qw(:full :short); # unneeded in v5.16
31
32This I<does> make even Unix programmers C<binmode> your binary streams,
33or open them with C<:raw>, but that's the only way to get at them
34portably anyway.
35
2a403855
MH
36B<WARNING>: C<use autodie> (pre 2.26) and C<use open> do not get along with each
37other.
2561daa4
RS
38
39=head2 ℞ 1: Generic Unicode-savvy filter
40
41Always decompose on the way in, then recompose on the way out.
42
43 use Unicode::Normalize;
44
45 while (<>) {
46 $_ = NFD($_); # decompose + reorder canonically
47 ...
48 } continue {
49 print NFC($_); # recompose (where possible) + reorder canonically
50 }
51
52=head2 ℞ 2: Fine-tuning Unicode warnings
53
ddeccf1f 54As of v5.14, Perl distinguishes three subclasses of UTF‑8 warnings.
2561daa4
RS
55
56 use v5.14; # subwarnings unavailable any earlier
57 no warnings "nonchar"; # the 66 forbidden non-characters
58 no warnings "surrogate"; # UTF-16/CESU-8 nonsense
59 no warnings "non_unicode"; # for codepoints over 0x10_FFFF
60
61=head2 ℞ 3: Declare source in utf8 for identifiers and literals
62
63Without the all-critical C<use utf8> declaration, putting UTF‑8 in your
64literals and identifiers won’t work right. If you used the standard
65preamble just given above, this already happened. If you did, you can
66do things like this:
67
68 use utf8;
69
70 my $measure = "Ångström";
71 my @μsoft = qw( cp852 cp1251 cp1252 );
72 my @ὑπέρμεγας = qw( ὑπέρ μεγας );
73 my @鯉 = qw( koi8-f koi8-u koi8-r );
74 my $motto = "👪 💗 🐪"; # FAMILY, GROWING HEART, DROMEDARY CAMEL
75
76If you forget C<use utf8>, high bytes will be misunderstood as
77separate characters, and nothing will work right.
78
79=head2 ℞ 4: Characters and their numbers
80
81The C<ord> and C<chr> functions work transparently on all codepoints,
82not just on ASCII alone — nor in fact, not even just on Unicode alone.
83
84 # ASCII characters
85 ord("A")
86 chr(65)
87
88 # characters from the Basic Multilingual Plane
89 ord("Σ")
90 chr(0x3A3)
91
92 # beyond the BMP
93 ord("𝑛") # MATHEMATICAL ITALIC SMALL N
94 chr(0x1D45B)
95
96 # beyond Unicode! (up to MAXINT)
97 ord("\x{20_0000}")
98 chr(0x20_0000)
99
100=head2 ℞ 5: Unicode literals by character number
101
102In an interpolated literal, whether a double-quoted string or a
103regex, you may specify a character by its number using the
104C<\x{I<HHHHHH>}> escape.
105
106 String: "\x{3a3}"
107 Regex: /\x{3a3}/
108
109 String: "\x{1d45b}"
110 Regex: /\x{1d45b}/
111
112 # even non-BMP ranges in regex work fine
113 /[\x{1D434}-\x{1D467}]/
114
115=head2 ℞ 6: Get character name by number
116
117 use charnames ();
118 my $name = charnames::viacode(0x03A3);
119
120=head2 ℞ 7: Get character number by name
121
122 use charnames ();
123 my $number = charnames::vianame("GREEK CAPITAL LETTER SIGMA");
124
125=head2 ℞ 8: Unicode named characters
126
127Use the C<< \N{I<charname>} >> notation to get the character
128by that name for use in interpolated literals (double-quoted
129strings and regexes). In v5.16, there is an implicit
130
131 use charnames qw(:full :short);
132
133But prior to v5.16, you must be explicit about which set of charnames you
134want. The C<:full> names are the official Unicode character name, alias, or
135sequence, which all share a namespace.
136
137 use charnames qw(:full :short latin greek);
138
139 "\N{MATHEMATICAL ITALIC SMALL N}" # :full
140 "\N{GREEK CAPITAL LETTER SIGMA}" # :full
141
142Anything else is a Perl-specific convenience abbreviation. Specify one or
143more scripts by names if you want short names that are script-specific.
144
145 "\N{Greek:Sigma}" # :short
146 "\N{ae}" # latin
147 "\N{epsilon}" # greek
148
149The v5.16 release also supports a C<:loose> import for loose matching of
150character names, which works just like loose matching of property names:
151that is, it disregards case, whitespace, and underscores:
152
153 "\N{euro sign}" # :loose (from v5.16)
154
673c254b
KW
155Starting in v5.32, you can also use
156
157 qr/\p{name=euro sign}/
158
159to get official Unicode named characters in regular expressions. Loose
160matching is always done for these.
161
2561daa4
RS
162=head2 ℞ 9: Unicode named sequences
163
164These look just like character names but return multiple codepoints.
165Notice the C<%vx> vector-print functionality in C<printf>.
166
167 use charnames qw(:full);
168 my $seq = "\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}";
169 printf "U+%v04X\n", $seq;
170 U+0100.0300
171
172=head2 ℞ 10: Custom named characters
173
174Use C<:alias> to give your own lexically scoped nicknames to existing
175characters, or even to give unnamed private-use characters useful names.
176
177 use charnames ":full", ":alias" => {
178 ecute => "LATIN SMALL LETTER E WITH ACUTE",
179 "APPLE LOGO" => 0xF8FF, # private use character
180 };
181
182 "\N{ecute}"
183 "\N{APPLE LOGO}"
184
185=head2 ℞ 11: Names of CJK codepoints
186
187Sinograms like “東京” come back with character names of
188C<CJK UNIFIED IDEOGRAPH-6771> and C<CJK UNIFIED IDEOGRAPH-4EAC>,
189because their “names” vary. The CPAN C<Unicode::Unihan> module
190has a large database for decoding these (and a whole lot more), provided you
191know how to understand its output.
192
193 # cpan -i Unicode::Unihan
194 use Unicode::Unihan;
195 my $str = "東京";
63602a3f 196 my $unhan = Unicode::Unihan->new;
2561daa4
RS
197 for my $lang (qw(Mandarin Cantonese Korean JapaneseOn JapaneseKun)) {
198 printf "CJK $str in %-12s is ", $lang;
199 say $unhan->$lang($str);
200 }
201
202prints:
203
204 CJK 東京 in Mandarin is DONG1JING1
205 CJK 東京 in Cantonese is dung1ging1
206 CJK 東京 in Korean is TONGKYENG
207 CJK 東京 in JapaneseOn is TOUKYOU KEI KIN
208 CJK 東京 in JapaneseKun is HIGASHI AZUMAMIYAKO
209
210If you have a specific romanization scheme in mind,
211use the specific module:
212
213 # cpan -i Lingua::JA::Romanize::Japanese
214 use Lingua::JA::Romanize::Japanese;
63602a3f 215 my $k2r = Lingua::JA::Romanize::Japanese->new;
2561daa4
RS
216 my $str = "東京";
217 say "Japanese for $str is ", $k2r->chars($str);
218
219prints
220
221 Japanese for 東京 is toukyou
222
223=head2 ℞ 12: Explicit encode/decode
224
225On rare occasion, such as a database read, you may be
226given encoded text you need to decode.
227
228 use Encode qw(encode decode);
229
230 my $chars = decode("shiftjis", $bytes, 1);
231 # OR
232 my $bytes = encode("MIME-Header-ISO_2022_JP", $chars, 1);
233
234For streams all in the same encoding, don't use encode/decode; instead
235set the file encoding when you open the file or immediately after with
236C<binmode> as described later below.
237
238=head2 ℞ 13: Decode program arguments as utf8
239
240 $ perl -CA ...
241 or
242 $ export PERL_UNICODE=A
243 or
8e179dd8
P
244 use Encode qw(decode);
245 @ARGV = map { decode('UTF-8', $_, 1) } @ARGV;
2561daa4
RS
246
247=head2 ℞ 14: Decode program arguments as locale encoding
248
249 # cpan -i Encode::Locale
250 use Encode qw(locale);
251 use Encode::Locale;
252
253 # use "locale" as an arg to encode/decode
254 @ARGV = map { decode(locale => $_, 1) } @ARGV;
255
256=head2 ℞ 15: Declare STD{IN,OUT,ERR} to be utf8
257
258Use a command-line option, an environment variable, or else
259call C<binmode> explicitly:
260
261 $ perl -CS ...
262 or
263 $ export PERL_UNICODE=S
264 or
a8980281 265 use open qw(:std :encoding(UTF-8));
2561daa4 266 or
a8980281 267 binmode(STDIN, ":encoding(UTF-8)");
2561daa4
RS
268 binmode(STDOUT, ":utf8");
269 binmode(STDERR, ":utf8");
270
271=head2 ℞ 16: Declare STD{IN,OUT,ERR} to be in locale encoding
272
273 # cpan -i Encode::Locale
274 use Encode;
275 use Encode::Locale;
276
277 # or as a stream for binmode or open
278 binmode STDIN, ":encoding(console_in)" if -t STDIN;
279 binmode STDOUT, ":encoding(console_out)" if -t STDOUT;
280 binmode STDERR, ":encoding(console_out)" if -t STDERR;
281
282=head2 ℞ 17: Make file I/O default to utf8
283
ddeccf1f 284Files opened without an encoding argument will be in UTF-8:
2561daa4
RS
285
286 $ perl -CD ...
287 or
288 $ export PERL_UNICODE=D
289 or
a8980281 290 use open qw(:encoding(UTF-8));
2561daa4
RS
291
292=head2 ℞ 18: Make all I/O and args default to utf8
293
294 $ perl -CSDA ...
295 or
296 $ export PERL_UNICODE=SDA
297 or
a8980281 298 use open qw(:std :encoding(UTF-8));
8e179dd8
P
299 use Encode qw(decode);
300 @ARGV = map { decode('UTF-8', $_, 1) } @ARGV;
2561daa4
RS
301
302=head2 ℞ 19: Open file with specific encoding
303
304Specify stream encoding. This is the normal way
305to deal with encoded text, not by calling low-level
306functions.
307
308 # input file
309 open(my $in_file, "< :encoding(UTF-16)", "wintext");
310 OR
311 open(my $in_file, "<", "wintext");
312 binmode($in_file, ":encoding(UTF-16)");
313 THEN
314 my $line = <$in_file>;
315
316 # output file
317 open($out_file, "> :encoding(cp1252)", "wintext");
318 OR
319 open(my $out_file, ">", "wintext");
320 binmode($out_file, ":encoding(cp1252)");
321 THEN
322 print $out_file "some text\n";
323
324More layers than just the encoding can be specified here. For example,
325the incantation C<":raw :encoding(UTF-16LE) :crlf"> includes implicit
326CRLF handling.
327
328=head2 ℞ 20: Unicode casing
329
330Unicode casing is very different from ASCII casing.
331
332 uc("henry ⅷ") # "HENRY Ⅷ"
333 uc("tschüß") # "TSCHÜSS" notice ß => SS
334
335 # both are true:
336 "tschüß" =~ /TSCHÜSS/i # notice ß => SS
337 "Σίσυφος" =~ /ΣΊΣΥΦΟΣ/i # notice Σ,σ,ς sameness
338
339=head2 ℞ 21: Unicode case-insensitive comparisons
340
341Also available in the CPAN L<Unicode::CaseFold> module,
342the new C<fc> “foldcase” function from v5.16 grants
343access to the same Unicode casefolding as the C</i>
344pattern modifier has always used:
345
346 use feature "fc"; # fc() function is from v5.16
347
348 # sort case-insensitively
349 my @sorted = sort { fc($a) cmp fc($b) } @list;
350
351 # both are true:
352 fc("tschüß") eq fc("TSCHÜSS")
353 fc("Σίσυφος") eq fc("ΣΊΣΥΦΟΣ")
354
355=head2 ℞ 22: Match Unicode linebreak sequence in regex
356
357A Unicode linebreak matches the two-character CRLF
358grapheme or any of seven vertical whitespace characters.
359Good for dealing with textfiles coming from different
360operating systems.
361
362 \R
363
364 s/\R/\n/g; # normalize all linebreaks to \n
365
366=head2 ℞ 23: Get character category
367
368Find the general category of a numeric codepoint.
369
370 use Unicode::UCD qw(charinfo);
371 my $cat = charinfo(0x3A3)->{category}; # "Lu"
372
373=head2 ℞ 24: Disabling Unicode-awareness in builtin charclasses
374
375Disable C<\w>, C<\b>, C<\s>, C<\d>, and the POSIX
376classes from working correctly on Unicode either in this
377scope, or in just one regex.
378
379 use v5.14;
380 use re "/a";
381
382 # OR
383
384 my($num) = $str =~ /(\d+)/a;
385
386Or use specific un-Unicode properties, like C<\p{ahex}>
387and C<\p{POSIX_Digit>}. Properties still work normally
388no matter what charset modifiers (C</d /u /l /a /aa>)
389should be effect.
390
391=head2 ℞ 25: Match Unicode properties in regex with \p, \P
392
393These all match a single codepoint with the given
394property. Use C<\P> in place of C<\p> to match
395one codepoint lacking that property.
396
397 \pL, \pN, \pS, \pP, \pM, \pZ, \pC
398 \p{Sk}, \p{Ps}, \p{Lt}
399 \p{alpha}, \p{upper}, \p{lower}
400 \p{Latin}, \p{Greek}
48791bf1 401 \p{script_extensions=Latin}, \p{scx=Greek}
2561daa4
RS
402 \p{East_Asian_Width=Wide}, \p{EA=W}
403 \p{Line_Break=Hyphen}, \p{LB=HY}
404 \p{Numeric_Value=4}, \p{NV=4}
405
406=head2 ℞ 26: Custom character properties
407
408Define at compile-time your own custom character
409properties for use in regexes.
410
411 # using private-use characters
412 sub In_Tengwar { "E000\tE07F\n" }
413
414 if (/\p{In_Tengwar}/) { ... }
415
416 # blending existing properties
417 sub Is_GraecoRoman_Title {<<'END_OF_SET'}
418 +utf8::IsLatin
419 +utf8::IsGreek
420 &utf8::IsTitle
421 END_OF_SET
422
423 if (/\p{Is_GraecoRoman_Title}/ { ... }
424
425=head2 ℞ 27: Unicode normalization
426
427Typically render into NFD on input and NFC on output. Using NFKC or NFKD
428functions improves recall on searches, assuming you've already done to the
429same text to be searched. Note that this is about much more than just pre-
430combined compatibility glyphs; it also reorders marks according to their
431canonical combining classes and weeds out singletons.
432
433 use Unicode::Normalize;
434 my $nfd = NFD($orig);
435 my $nfc = NFC($orig);
436 my $nfkd = NFKD($orig);
437 my $nfkc = NFKC($orig);
438
439=head2 ℞ 28: Convert non-ASCII Unicode numerics
440
441Unless you’ve used C</a> or C</aa>, C<\d> matches more than
442ASCII digits only, but Perl’s implicit string-to-number
443conversion does not current recognize these. Here’s how to
444convert such strings manually.
445
446 use v5.14; # needed for num() function
447 use Unicode::UCD qw(num);
448 my $str = "got Ⅻ and ४५६७ and ⅞ and here";
449 my @nums = ();
7b237c8f 450 while ($str =~ /(\d+|\N)/g) { # not just ASCII!
2561daa4
RS
451 push @nums, num($1);
452 }
453 say "@nums"; # 12 4567 0.875
454
455 use charnames qw(:full);
456 my $nv = num("\N{RUMI DIGIT ONE}\N{RUMI DIGIT TWO}");
457
458=head2 ℞ 29: Match Unicode grapheme cluster in regex
459
460Programmer-visible “characters” are codepoints matched by C</./s>,
461but user-visible “characters” are graphemes matched by C</\X/>.
462
463 # Find vowel *plus* any combining diacritics,underlining,etc.
464 my $nfd = NFD($orig);
465 $nfd =~ / (?=[aeiou]) \X /xi
466
467=head2 ℞ 30: Extract by grapheme instead of by codepoint (regex)
468
469 # match and grab five first graphemes
470 my($first_five) = $str =~ /^ ( \X{5} ) /x;
471
472=head2 ℞ 31: Extract by grapheme instead of by codepoint (substr)
473
474 # cpan -i Unicode::GCString
475 use Unicode::GCString;
476 my $gcs = Unicode::GCString->new($str);
477 my $first_five = $gcs->substr(0, 5);
478
479=head2 ℞ 32: Reverse string by grapheme
480
481Reversing by codepoint messes up diacritics, mistakenly converting
482C<crème brûlée> into C<éel̂urb em̀erc> instead of into C<eélûrb emèrc>;
483so reverse by grapheme instead. Both these approaches work
484right no matter what normalization the string is in:
485
486 $str = join("", reverse $str =~ /\X/g);
487
488 # OR: cpan -i Unicode::GCString
489 use Unicode::GCString;
490 $str = reverse Unicode::GCString->new($str);
491
492=head2 ℞ 33: String length in graphemes
493
494The string C<brûlée> has six graphemes but up to eight codepoints.
495This counts by grapheme, not by codepoint:
496
497 my $str = "brûlée";
498 my $count = 0;
499 while ($str =~ /\X/g) { $count++ }
500
501 # OR: cpan -i Unicode::GCString
502 use Unicode::GCString;
503 my $gcs = Unicode::GCString->new($str);
504 my $count = $gcs->length;
505
506=head2 ℞ 34: Unicode column-width for printing
507
508Perl’s C<printf>, C<sprintf>, and C<format> think all
509codepoints take up 1 print column, but many take 0 or 2.
510Here to show that normalization makes no difference,
511we print out both forms:
512
513 use Unicode::GCString;
514 use Unicode::Normalize;
515
516 my @words = qw/crème brûlée/;
517 @words = map { NFC($_), NFD($_) } @words;
518
519 for my $str (@words) {
520 my $gcs = Unicode::GCString->new($str);
521 my $cols = $gcs->columns;
522 my $pad = " " x (10 - $cols);
523 say str, $pad, " |";
524 }
525
526generates this to show that it pads correctly no matter
527the normalization:
528
529 crème |
530 crème |
531 brûlée |
532 brûlée |
533
534=head2 ℞ 35: Unicode collation
535
536Text sorted by numeric codepoint follows no reasonable alphabetic order;
537use the UCA for sorting text.
538
539 use Unicode::Collate;
540 my $col = Unicode::Collate->new();
541 my @list = $col->sort(@old_list);
542
543See the I<ucsort> program from the L<Unicode::Tussle> CPAN module
ddeccf1f 544for a convenient command-line interface to this module.
2561daa4
RS
545
546=head2 ℞ 36: Case- I<and> accent-insensitive Unicode sort
547
548Specify a collation strength of level 1 to ignore case and
549diacritics, only looking at the basic character.
550
551 use Unicode::Collate;
552 my $col = Unicode::Collate->new(level => 1);
553 my @list = $col->sort(@old_list);
554
555=head2 ℞ 37: Unicode locale collation
556
557Some locales have special sorting rules.
558
559 # either use v5.12, OR: cpan -i Unicode::Collate::Locale
560 use Unicode::Collate::Locale;
561 my $col = Unicode::Collate::Locale->new(locale => "de__phonebook");
562 my @list = $col->sort(@old_list);
563
564The I<ucsort> program mentioned above accepts a C<--locale> parameter.
565
566=head2 ℞ 38: Making C<cmp> work on text instead of codepoints
567
568Instead of this:
569
570 @srecs = sort {
571 $b->{AGE} <=> $a->{AGE}
572 ||
573 $a->{NAME} cmp $b->{NAME}
574 } @recs;
575
576Use this:
577
578 my $coll = Unicode::Collate->new();
579 for my $rec (@recs) {
580 $rec->{NAME_key} = $coll->getSortKey( $rec->{NAME} );
581 }
582 @srecs = sort {
583 $b->{AGE} <=> $a->{AGE}
584 ||
585 $a->{NAME_key} cmp $b->{NAME_key}
586 } @recs;
587
588=head2 ℞ 39: Case- I<and> accent-insensitive comparisons
589
590Use a collator object to compare Unicode text by character
591instead of by codepoint.
592
593 use Unicode::Collate;
594 my $es = Unicode::Collate->new(
595 level => 1,
596 normalization => undef
597 );
598
599 # now both are true:
600 $es->eq("García", "GARCIA" );
601 $es->eq("Márquez", "MARQUEZ");
602
603=head2 ℞ 40: Case- I<and> accent-insensitive locale comparisons
604
605Same, but in a specific locale.
606
607 my $de = Unicode::Collate::Locale->new(
608 locale => "de__phonebook",
609 );
610
611 # now this is true:
612 $de->eq("tschüß", "TSCHUESS"); # notice ü => UE, ß => SS
613
614=head2 ℞ 41: Unicode linebreaking
615
616Break up text into lines according to Unicode rules.
617
618 # cpan -i Unicode::LineBreak
619 use Unicode::LineBreak;
620 use charnames qw(:full);
621
622 my $para = "This is a super\N{HYPHEN}long string. " x 20;
63602a3f 623 my $fmt = Unicode::LineBreak->new;
2561daa4
RS
624 print $fmt->break($para), "\n";
625
626=head2 ℞ 42: Unicode text in DBM hashes, the tedious way
627
628Using a regular Perl string as a key or value for a DBM
629hash will trigger a wide character exception if any codepoints
630won’t fit into a byte. Here’s how to manually manage the translation:
631
632 use DB_File;
633 use Encode qw(encode decode);
634 tie %dbhash, "DB_File", "pathname";
635
636 # STORE
637
638 # assume $uni_key and $uni_value are abstract Unicode strings
639 my $enc_key = encode("UTF-8", $uni_key, 1);
640 my $enc_value = encode("UTF-8", $uni_value, 1);
641 $dbhash{$enc_key} = $enc_value;
642
643 # FETCH
644
645 # assume $uni_key holds a normal Perl string (abstract Unicode)
646 my $enc_key = encode("UTF-8", $uni_key, 1);
647 my $enc_value = $dbhash{$enc_key};
7b237c8f 648 my $uni_value = decode("UTF-8", $enc_value, 1);
2561daa4
RS
649
650=head2 ℞ 43: Unicode text in DBM hashes, the easy way
651
652Here’s how to implicitly manage the translation; all encoding
653and decoding is done automatically, just as with streams that
654have a particular encoding attached to them:
655
656 use DB_File;
657 use DBM_Filter;
658
659 my $dbobj = tie %dbhash, "DB_File", "pathname";
660 $dbobj->Filter_Value("utf8"); # this is the magic bit
661
662 # STORE
663
664 # assume $uni_key and $uni_value are abstract Unicode strings
665 $dbhash{$uni_key} = $uni_value;
666
667 # FETCH
668
669 # $uni_key holds a normal Perl string (abstract Unicode)
670 my $uni_value = $dbhash{$uni_key};
671
672=head2 ℞ 44: PROGRAM: Demo of Unicode collation and printing
673
674Here’s a full program showing how to make use of locale-sensitive
675sorting, Unicode casing, and managing print widths when some of the
676characters take up zero or two columns, not just one column each time.
677When run, the following program produces this nicely aligned output:
678
679 Crème Brûlée....... €2.00
680 Éclair............. €1.60
681 Fideuà............. €4.20
682 Hamburger.......... €6.00
683 Jamón Serrano...... €4.45
684 Linguiça........... €7.00
685 Pâté............... €4.15
686 Pears.............. €2.00
687 Pêches............. €2.25
688 Smørbrød........... €5.75
689 Spätzle............ €5.50
690 Xoriço............. €3.00
691 Γύρος.............. €6.50
692 막걸리............. €4.00
693 おもち............. €2.65
694 お好み焼き......... €8.00
695 シュークリーム..... €1.85
696 寿司............... €9.99
697 包子............... €7.50
698
699Here's that program; tested on v5.14.
700
701 #!/usr/bin/env perl
702 # umenu - demo sorting and printing of Unicode food
703 #
704 # (obligatory and increasingly long preamble)
705 #
706 use utf8;
707 use v5.14; # for locale sorting
708 use strict;
709 use warnings;
710 use warnings qw(FATAL utf8); # fatalize encoding faults
a8980281 711 use open qw(:std :encoding(UTF-8)); # undeclared streams in UTF-8
2561daa4
RS
712 use charnames qw(:full :short); # unneeded in v5.16
713
714 # std modules
715 use Unicode::Normalize; # std perl distro as of v5.8
716 use List::Util qw(max); # std perl distro as of v5.10
717 use Unicode::Collate::Locale; # std perl distro as of v5.14
718
719 # cpan modules
720 use Unicode::GCString; # from CPAN
721
722 # forward defs
723 sub pad($$$);
724 sub colwidth(_);
725 sub entitle(_);
726
727 my %price = (
728 "γύρος" => 6.50, # gyros
729 "pears" => 2.00, # like um, pears
730 "linguiça" => 7.00, # spicy sausage, Portuguese
731 "xoriço" => 3.00, # chorizo sausage, Catalan
732 "hamburger" => 6.00, # burgermeister meisterburger
733 "éclair" => 1.60, # dessert, French
734 "smørbrød" => 5.75, # sandwiches, Norwegian
735 "spätzle" => 5.50, # Bayerisch noodles, little sparrows
736 "包子" => 7.50, # bao1 zi5, steamed pork buns, Mandarin
737 "jamón serrano" => 4.45, # country ham, Spanish
738 "pêches" => 2.25, # peaches, French
739 "シュークリーム" => 1.85, # cream-filled pastry like eclair
740 "막걸리" => 4.00, # makgeolli, Korean rice wine
741 "寿司" => 9.99, # sushi, Japanese
742 "おもち" => 2.65, # omochi, rice cakes, Japanese
743 "crème brûlée" => 2.00, # crema catalana
720a02e2
FC
744 "fideuà" => 4.20, # more noodles, Valencian
745 # (Catalan=fideuada)
2561daa4
RS
746 "pâté" => 4.15, # gooseliver paste, French
747 "お好み焼き" => 8.00, # okonomiyaki, Japanese
748 );
749
750 my $width = 5 + max map { colwidth } keys %price;
751
752 # So the Asian stuff comes out in an order that someone
753 # who reads those scripts won't freak out over; the
754 # CJK stuff will be in JIS X 0208 order that way.
63602a3f 755 my $coll = Unicode::Collate::Locale->new(locale => "ja");
2561daa4
RS
756
757 for my $item ($coll->sort(keys %price)) {
758 print pad(entitle($item), $width, ".");
759 printf " €%.2f\n", $price{$item};
760 }
761
762 sub pad($$$) {
763 my($str, $width, $padchar) = @_;
764 return $str . ($padchar x ($width - colwidth($str)));
765 }
766
767 sub colwidth(_) {
768 my($str) = @_;
769 return Unicode::GCString->new($str)->columns;
770 }
771
772 sub entitle(_) {
773 my($str) = @_;
774 $str =~ s{ (?=\pL)(\S) (\S*) }
775 { ucfirst($1) . lc($2) }xge;
776 return $str;
777 }
778
779=head1 SEE ALSO
780
781See these manpages, some of which are CPAN modules:
782L<perlunicode>, L<perluniprops>,
783L<perlre>, L<perlrecharclass>,
784L<perluniintro>, L<perlunitut>, L<perlunifaq>,
785L<PerlIO>, L<DB_File>, L<DBM_Filter>, L<DBM_Filter::utf8>,
786L<Encode>, L<Encode::Locale>,
787L<Unicode::UCD>,
788L<Unicode::Normalize>,
789L<Unicode::GCString>, L<Unicode::LineBreak>,
790L<Unicode::Collate>, L<Unicode::Collate::Locale>,
791L<Unicode::Unihan>,
792L<Unicode::CaseFold>,
793L<Unicode::Tussle>,
794L<Lingua::JA::Romanize::Japanese>,
795L<Lingua::ZH::Romanize::Pinyin>,
796L<Lingua::KO::Romanize::Hangul>.
797
798The L<Unicode::Tussle> CPAN module includes many programs
799to help with working with Unicode, including
800these programs to fully or partly replace standard utilities:
801I<tcgrep> instead of I<egrep>,
802I<uniquote> instead of I<cat -v> or I<hexdump>,
803I<uniwc> instead of I<wc>,
804I<unilook> instead of I<look>,
805I<unifmt> instead of I<fmt>,
806and
807I<ucsort> instead of I<sort>.
808For exploring Unicode character names and character properties,
809see its I<uniprops>, I<unichars>, and I<uninames> programs.
810It also supplies these programs, all of which are general filters that do Unicode-y things:
811I<unititle> and I<unicaps>;
812I<uniwide> and I<uninarrow>;
813I<unisupers> and I<unisubs>;
814I<nfd>, I<nfc>, I<nfkd>, and I<nfkc>;
815and I<uc>, I<lc>, and I<tc>.
816
817Finally, see the published Unicode Standard (page numbers are from version
8186.0.0), including these specific annexes and technical reports:
819
820=over
821
822=item §3.13 Default Case Algorithms, page 113;
823§4.2 Case, pages 120–122;
824Case Mappings, page 166–172, especially Caseless Matching starting on page 170.
825
2561daa4
RS
826=item UAX #44: Unicode Character Database
827
828=item UTS #18: Unicode Regular Expressions
829
830=item UAX #15: Unicode Normalization Forms
831
832=item UTS #10: Unicode Collation Algorithm
833
834=item UAX #29: Unicode Text Segmentation
835
836=item UAX #14: Unicode Line Breaking Algorithm
837
838=item UAX #11: East Asian Width
839
840=back
841
842=head1 AUTHOR
843
844Tom Christiansen E<lt>tchrist@perl.comE<gt> wrote this, with occasional
845kibbitzing from Larry Wall and Jeffrey Friedl in the background.
846
847=head1 COPYRIGHT AND LICENCE
848
849Copyright © 2012 Tom Christiansen.
850
851This program is free software; you may redistribute it and/or modify it
852under the same terms as Perl itself.
853
854Most of these examples taken from the current edition of the “Camel Book”;
855that is, from the 4ᵗʰ Edition of I<Programming Perl>, Copyright © 2012 Tom
856Christiansen <et al.>, 2012-02-13 by O’Reilly Media. The code itself is
857freely redistributable, and you are encouraged to transplant, fold,
858spindle, and mutilate any of the examples in this manpage however you please
859for inclusion into your own programs without any encumbrance whatsoever.
860Acknowledgement via code comment is polite but not required.
861
ddeccf1f 862=head1 REVISION HISTORY
2561daa4
RS
863
864v1.0.0 – first public release, 2012-02-27