6 perlunicook - cookbookish examples of handling Unicode in Perl
10 This manpage contains short recipes demonstrating how to handle common Unicode
11 operations in Perl, plus one complete program at the end. Any undeclared
12 variables in individual recipes are assumed to have a previous appropriate
17 =head2 ℞ 0: Standard preamble
19 Unless otherwise notes, all examples below require this standard preamble
20 to work correctly, with the C<#!> adjusted to work on your system:
24 use utf8; # so literals and identifiers can be in UTF-8
25 use v5.12; # or later to get "unicode_strings" feature
26 use strict; # quote strings, declare variables
27 use warnings; # on by default
28 use warnings qw(FATAL utf8); # fatalize encoding glitches
29 use open qw(:std :encoding(UTF-8)); # undeclared streams in UTF-8
30 use charnames qw(:full :short); # unneeded in v5.16
32 This I<does> make even Unix programmers C<binmode> your binary streams,
33 or open them with C<:raw>, but that's the only way to get at them
36 B<WARNING>: C<use autodie> (pre 2.26) and C<use open> do not get along with each
39 =head2 ℞ 1: Generic Unicode-savvy filter
41 Always decompose on the way in, then recompose on the way out.
43 use Unicode::Normalize;
46 $_ = NFD($_); # decompose + reorder canonically
49 print NFC($_); # recompose (where possible) + reorder canonically
52 =head2 ℞ 2: Fine-tuning Unicode warnings
54 As of v5.14, Perl distinguishes three subclasses of UTF‑8 warnings.
56 use v5.14; # subwarnings unavailable any earlier
57 no warnings "nonchar"; # the 66 forbidden non-characters
58 no warnings "surrogate"; # UTF-16/CESU-8 nonsense
59 no warnings "non_unicode"; # for codepoints over 0x10_FFFF
61 =head2 ℞ 3: Declare source in utf8 for identifiers and literals
63 Without the all-critical C<use utf8> declaration, putting UTF‑8 in your
64 literals and identifiers won’t work right. If you used the standard
65 preamble just given above, this already happened. If you did, you can
70 my $measure = "Ångström";
71 my @μsoft = qw( cp852 cp1251 cp1252 );
72 my @ὑπέρμεγας = qw( ὑπέρ μεγας );
73 my @鯉 = qw( koi8-f koi8-u koi8-r );
74 my $motto = "👪 💗 🐪"; # FAMILY, GROWING HEART, DROMEDARY CAMEL
76 If you forget C<use utf8>, high bytes will be misunderstood as
77 separate characters, and nothing will work right.
79 =head2 ℞ 4: Characters and their numbers
81 The C<ord> and C<chr> functions work transparently on all codepoints,
82 not just on ASCII alone — nor in fact, not even just on Unicode alone.
88 # characters from the Basic Multilingual Plane
93 ord("𝑛") # MATHEMATICAL ITALIC SMALL N
96 # beyond Unicode! (up to MAXINT)
100 =head2 ℞ 5: Unicode literals by character number
102 In an interpolated literal, whether a double-quoted string or a
103 regex, you may specify a character by its number using the
104 C<\x{I<HHHHHH>}> escape.
112 # even non-BMP ranges in regex work fine
113 /[\x{1D434}-\x{1D467}]/
115 =head2 ℞ 6: Get character name by number
118 my $name = charnames::viacode(0x03A3);
120 =head2 ℞ 7: Get character number by name
123 my $number = charnames::vianame("GREEK CAPITAL LETTER SIGMA");
125 =head2 ℞ 8: Unicode named characters
127 Use the C<< \N{I<charname>} >> notation to get the character
128 by that name for use in interpolated literals (double-quoted
129 strings and regexes). In v5.16, there is an implicit
131 use charnames qw(:full :short);
133 But prior to v5.16, you must be explicit about which set of charnames you
134 want. The C<:full> names are the official Unicode character name, alias, or
135 sequence, which all share a namespace.
137 use charnames qw(:full :short latin greek);
139 "\N{MATHEMATICAL ITALIC SMALL N}" # :full
140 "\N{GREEK CAPITAL LETTER SIGMA}" # :full
142 Anything else is a Perl-specific convenience abbreviation. Specify one or
143 more scripts by names if you want short names that are script-specific.
145 "\N{Greek:Sigma}" # :short
147 "\N{epsilon}" # greek
149 The v5.16 release also supports a C<:loose> import for loose matching of
150 character names, which works just like loose matching of property names:
151 that is, it disregards case, whitespace, and underscores:
153 "\N{euro sign}" # :loose (from v5.16)
155 Starting in v5.32, you can also use
157 qr/\p{name=euro sign}/
159 to get official Unicode named characters in regular expressions. Loose
160 matching is always done for these.
162 =head2 ℞ 9: Unicode named sequences
164 These look just like character names but return multiple codepoints.
165 Notice the C<%vx> vector-print functionality in C<printf>.
167 use charnames qw(:full);
168 my $seq = "\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}";
169 printf "U+%v04X\n", $seq;
172 =head2 ℞ 10: Custom named characters
174 Use C<:alias> to give your own lexically scoped nicknames to existing
175 characters, or even to give unnamed private-use characters useful names.
177 use charnames ":full", ":alias" => {
178 ecute => "LATIN SMALL LETTER E WITH ACUTE",
179 "APPLE LOGO" => 0xF8FF, # private use character
185 =head2 ℞ 11: Names of CJK codepoints
187 Sinograms like “東京” come back with character names of
188 C<CJK UNIFIED IDEOGRAPH-6771> and C<CJK UNIFIED IDEOGRAPH-4EAC>,
189 because their “names” vary. The CPAN C<Unicode::Unihan> module
190 has a large database for decoding these (and a whole lot more), provided you
191 know how to understand its output.
193 # cpan -i Unicode::Unihan
196 my $unhan = Unicode::Unihan->new;
197 for my $lang (qw(Mandarin Cantonese Korean JapaneseOn JapaneseKun)) {
198 printf "CJK $str in %-12s is ", $lang;
199 say $unhan->$lang($str);
204 CJK 東京 in Mandarin is DONG1JING1
205 CJK 東京 in Cantonese is dung1ging1
206 CJK 東京 in Korean is TONGKYENG
207 CJK 東京 in JapaneseOn is TOUKYOU KEI KIN
208 CJK 東京 in JapaneseKun is HIGASHI AZUMAMIYAKO
210 If you have a specific romanization scheme in mind,
211 use the specific module:
213 # cpan -i Lingua::JA::Romanize::Japanese
214 use Lingua::JA::Romanize::Japanese;
215 my $k2r = Lingua::JA::Romanize::Japanese->new;
217 say "Japanese for $str is ", $k2r->chars($str);
221 Japanese for 東京 is toukyou
223 =head2 ℞ 12: Explicit encode/decode
225 On rare occasion, such as a database read, you may be
226 given encoded text you need to decode.
228 use Encode qw(encode decode);
230 my $chars = decode("shiftjis", $bytes, 1);
232 my $bytes = encode("MIME-Header-ISO_2022_JP", $chars, 1);
234 For streams all in the same encoding, don't use encode/decode; instead
235 set the file encoding when you open the file or immediately after with
236 C<binmode> as described later below.
238 =head2 ℞ 13: Decode program arguments as utf8
242 $ export PERL_UNICODE=A
244 use Encode qw(decode);
245 @ARGV = map { decode('UTF-8', $_, 1) } @ARGV;
247 =head2 ℞ 14: Decode program arguments as locale encoding
249 # cpan -i Encode::Locale
250 use Encode qw(locale);
253 # use "locale" as an arg to encode/decode
254 @ARGV = map { decode(locale => $_, 1) } @ARGV;
256 =head2 ℞ 15: Declare STD{IN,OUT,ERR} to be utf8
258 Use a command-line option, an environment variable, or else
259 call C<binmode> explicitly:
263 $ export PERL_UNICODE=S
265 use open qw(:std :encoding(UTF-8));
267 binmode(STDIN, ":encoding(UTF-8)");
268 binmode(STDOUT, ":utf8");
269 binmode(STDERR, ":utf8");
271 =head2 ℞ 16: Declare STD{IN,OUT,ERR} to be in locale encoding
273 # cpan -i Encode::Locale
277 # or as a stream for binmode or open
278 binmode STDIN, ":encoding(console_in)" if -t STDIN;
279 binmode STDOUT, ":encoding(console_out)" if -t STDOUT;
280 binmode STDERR, ":encoding(console_out)" if -t STDERR;
282 =head2 ℞ 17: Make file I/O default to utf8
284 Files opened without an encoding argument will be in UTF-8:
288 $ export PERL_UNICODE=D
290 use open qw(:encoding(UTF-8));
292 =head2 ℞ 18: Make all I/O and args default to utf8
296 $ export PERL_UNICODE=SDA
298 use open qw(:std :encoding(UTF-8));
299 use Encode qw(decode);
300 @ARGV = map { decode('UTF-8', $_, 1) } @ARGV;
302 =head2 ℞ 19: Open file with specific encoding
304 Specify stream encoding. This is the normal way
305 to deal with encoded text, not by calling low-level
309 open(my $in_file, "< :encoding(UTF-16)", "wintext");
311 open(my $in_file, "<", "wintext");
312 binmode($in_file, ":encoding(UTF-16)");
314 my $line = <$in_file>;
317 open($out_file, "> :encoding(cp1252)", "wintext");
319 open(my $out_file, ">", "wintext");
320 binmode($out_file, ":encoding(cp1252)");
322 print $out_file "some text\n";
324 More layers than just the encoding can be specified here. For example,
325 the incantation C<":raw :encoding(UTF-16LE) :crlf"> includes implicit
328 =head2 ℞ 20: Unicode casing
330 Unicode casing is very different from ASCII casing.
332 uc("henry ⅷ") # "HENRY Ⅷ"
333 uc("tschüß") # "TSCHÜSS" notice ß => SS
336 "tschüß" =~ /TSCHÜSS/i # notice ß => SS
337 "Σίσυφος" =~ /ΣΊΣΥΦΟΣ/i # notice Σ,σ,ς sameness
339 =head2 ℞ 21: Unicode case-insensitive comparisons
341 Also available in the CPAN L<Unicode::CaseFold> module,
342 the new C<fc> “foldcase” function from v5.16 grants
343 access to the same Unicode casefolding as the C</i>
344 pattern modifier has always used:
346 use feature "fc"; # fc() function is from v5.16
348 # sort case-insensitively
349 my @sorted = sort { fc($a) cmp fc($b) } @list;
352 fc("tschüß") eq fc("TSCHÜSS")
353 fc("Σίσυφος") eq fc("ΣΊΣΥΦΟΣ")
355 =head2 ℞ 22: Match Unicode linebreak sequence in regex
357 A Unicode linebreak matches the two-character CRLF
358 grapheme or any of seven vertical whitespace characters.
359 Good for dealing with textfiles coming from different
364 s/\R/\n/g; # normalize all linebreaks to \n
366 =head2 ℞ 23: Get character category
368 Find the general category of a numeric codepoint.
370 use Unicode::UCD qw(charinfo);
371 my $cat = charinfo(0x3A3)->{category}; # "Lu"
373 =head2 ℞ 24: Disabling Unicode-awareness in builtin charclasses
375 Disable C<\w>, C<\b>, C<\s>, C<\d>, and the POSIX
376 classes from working correctly on Unicode either in this
377 scope, or in just one regex.
384 my($num) = $str =~ /(\d+)/a;
386 Or use specific un-Unicode properties, like C<\p{ahex}>
387 and C<\p{POSIX_Digit>}. Properties still work normally
388 no matter what charset modifiers (C</d /u /l /a /aa>)
391 =head2 ℞ 25: Match Unicode properties in regex with \p, \P
393 These all match a single codepoint with the given
394 property. Use C<\P> in place of C<\p> to match
395 one codepoint lacking that property.
397 \pL, \pN, \pS, \pP, \pM, \pZ, \pC
398 \p{Sk}, \p{Ps}, \p{Lt}
399 \p{alpha}, \p{upper}, \p{lower}
401 \p{script_extensions=Latin}, \p{scx=Greek}
402 \p{East_Asian_Width=Wide}, \p{EA=W}
403 \p{Line_Break=Hyphen}, \p{LB=HY}
404 \p{Numeric_Value=4}, \p{NV=4}
406 =head2 ℞ 26: Custom character properties
408 Define at compile-time your own custom character
409 properties for use in regexes.
411 # using private-use characters
412 sub In_Tengwar { "E000\tE07F\n" }
414 if (/\p{In_Tengwar}/) { ... }
416 # blending existing properties
417 sub Is_GraecoRoman_Title {<<'END_OF_SET'}
423 if (/\p{Is_GraecoRoman_Title}/ { ... }
425 =head2 ℞ 27: Unicode normalization
427 Typically render into NFD on input and NFC on output. Using NFKC or NFKD
428 functions improves recall on searches, assuming you've already done to the
429 same text to be searched. Note that this is about much more than just pre-
430 combined compatibility glyphs; it also reorders marks according to their
431 canonical combining classes and weeds out singletons.
433 use Unicode::Normalize;
434 my $nfd = NFD($orig);
435 my $nfc = NFC($orig);
436 my $nfkd = NFKD($orig);
437 my $nfkc = NFKC($orig);
439 =head2 ℞ 28: Convert non-ASCII Unicode numerics
441 Unless you’ve used C</a> or C</aa>, C<\d> matches more than
442 ASCII digits only, but Perl’s implicit string-to-number
443 conversion does not current recognize these. Here’s how to
444 convert such strings manually.
446 use v5.14; # needed for num() function
447 use Unicode::UCD qw(num);
448 my $str = "got Ⅻ and ४५६७ and ⅞ and here";
450 while ($str =~ /(\d+|\N)/g) { # not just ASCII!
453 say "@nums"; # 12 4567 0.875
455 use charnames qw(:full);
456 my $nv = num("\N{RUMI DIGIT ONE}\N{RUMI DIGIT TWO}");
458 =head2 ℞ 29: Match Unicode grapheme cluster in regex
460 Programmer-visible “characters” are codepoints matched by C</./s>,
461 but user-visible “characters” are graphemes matched by C</\X/>.
463 # Find vowel *plus* any combining diacritics,underlining,etc.
464 my $nfd = NFD($orig);
465 $nfd =~ / (?=[aeiou]) \X /xi
467 =head2 ℞ 30: Extract by grapheme instead of by codepoint (regex)
469 # match and grab five first graphemes
470 my($first_five) = $str =~ /^ ( \X{5} ) /x;
472 =head2 ℞ 31: Extract by grapheme instead of by codepoint (substr)
474 # cpan -i Unicode::GCString
475 use Unicode::GCString;
476 my $gcs = Unicode::GCString->new($str);
477 my $first_five = $gcs->substr(0, 5);
479 =head2 ℞ 32: Reverse string by grapheme
481 Reversing by codepoint messes up diacritics, mistakenly converting
482 C<crème brûlée> into C<éel̂urb em̀erc> instead of into C<eélûrb emèrc>;
483 so reverse by grapheme instead. Both these approaches work
484 right no matter what normalization the string is in:
486 $str = join("", reverse $str =~ /\X/g);
488 # OR: cpan -i Unicode::GCString
489 use Unicode::GCString;
490 $str = reverse Unicode::GCString->new($str);
492 =head2 ℞ 33: String length in graphemes
494 The string C<brûlée> has six graphemes but up to eight codepoints.
495 This counts by grapheme, not by codepoint:
499 while ($str =~ /\X/g) { $count++ }
501 # OR: cpan -i Unicode::GCString
502 use Unicode::GCString;
503 my $gcs = Unicode::GCString->new($str);
504 my $count = $gcs->length;
506 =head2 ℞ 34: Unicode column-width for printing
508 Perl’s C<printf>, C<sprintf>, and C<format> think all
509 codepoints take up 1 print column, but many take 0 or 2.
510 Here to show that normalization makes no difference,
511 we print out both forms:
513 use Unicode::GCString;
514 use Unicode::Normalize;
516 my @words = qw/crème brûlée/;
517 @words = map { NFC($_), NFD($_) } @words;
519 for my $str (@words) {
520 my $gcs = Unicode::GCString->new($str);
521 my $cols = $gcs->columns;
522 my $pad = " " x (10 - $cols);
526 generates this to show that it pads correctly no matter
534 =head2 ℞ 35: Unicode collation
536 Text sorted by numeric codepoint follows no reasonable alphabetic order;
537 use the UCA for sorting text.
539 use Unicode::Collate;
540 my $col = Unicode::Collate->new();
541 my @list = $col->sort(@old_list);
543 See the I<ucsort> program from the L<Unicode::Tussle> CPAN module
544 for a convenient command-line interface to this module.
546 =head2 ℞ 36: Case- I<and> accent-insensitive Unicode sort
548 Specify a collation strength of level 1 to ignore case and
549 diacritics, only looking at the basic character.
551 use Unicode::Collate;
552 my $col = Unicode::Collate->new(level => 1);
553 my @list = $col->sort(@old_list);
555 =head2 ℞ 37: Unicode locale collation
557 Some locales have special sorting rules.
559 # either use v5.12, OR: cpan -i Unicode::Collate::Locale
560 use Unicode::Collate::Locale;
561 my $col = Unicode::Collate::Locale->new(locale => "de__phonebook");
562 my @list = $col->sort(@old_list);
564 The I<ucsort> program mentioned above accepts a C<--locale> parameter.
566 =head2 ℞ 38: Making C<cmp> work on text instead of codepoints
571 $b->{AGE} <=> $a->{AGE}
573 $a->{NAME} cmp $b->{NAME}
578 my $coll = Unicode::Collate->new();
579 for my $rec (@recs) {
580 $rec->{NAME_key} = $coll->getSortKey( $rec->{NAME} );
583 $b->{AGE} <=> $a->{AGE}
585 $a->{NAME_key} cmp $b->{NAME_key}
588 =head2 ℞ 39: Case- I<and> accent-insensitive comparisons
590 Use a collator object to compare Unicode text by character
591 instead of by codepoint.
593 use Unicode::Collate;
594 my $es = Unicode::Collate->new(
596 normalization => undef
600 $es->eq("García", "GARCIA" );
601 $es->eq("Márquez", "MARQUEZ");
603 =head2 ℞ 40: Case- I<and> accent-insensitive locale comparisons
605 Same, but in a specific locale.
607 my $de = Unicode::Collate::Locale->new(
608 locale => "de__phonebook",
612 $de->eq("tschüß", "TSCHUESS"); # notice ü => UE, ß => SS
614 =head2 ℞ 41: Unicode linebreaking
616 Break up text into lines according to Unicode rules.
618 # cpan -i Unicode::LineBreak
619 use Unicode::LineBreak;
620 use charnames qw(:full);
622 my $para = "This is a super\N{HYPHEN}long string. " x 20;
623 my $fmt = Unicode::LineBreak->new;
624 print $fmt->break($para), "\n";
626 =head2 ℞ 42: Unicode text in DBM hashes, the tedious way
628 Using a regular Perl string as a key or value for a DBM
629 hash will trigger a wide character exception if any codepoints
630 won’t fit into a byte. Here’s how to manually manage the translation:
633 use Encode qw(encode decode);
634 tie %dbhash, "DB_File", "pathname";
638 # assume $uni_key and $uni_value are abstract Unicode strings
639 my $enc_key = encode("UTF-8", $uni_key, 1);
640 my $enc_value = encode("UTF-8", $uni_value, 1);
641 $dbhash{$enc_key} = $enc_value;
645 # assume $uni_key holds a normal Perl string (abstract Unicode)
646 my $enc_key = encode("UTF-8", $uni_key, 1);
647 my $enc_value = $dbhash{$enc_key};
648 my $uni_value = decode("UTF-8", $enc_value, 1);
650 =head2 ℞ 43: Unicode text in DBM hashes, the easy way
652 Here’s how to implicitly manage the translation; all encoding
653 and decoding is done automatically, just as with streams that
654 have a particular encoding attached to them:
659 my $dbobj = tie %dbhash, "DB_File", "pathname";
660 $dbobj->Filter_Value("utf8"); # this is the magic bit
664 # assume $uni_key and $uni_value are abstract Unicode strings
665 $dbhash{$uni_key} = $uni_value;
669 # $uni_key holds a normal Perl string (abstract Unicode)
670 my $uni_value = $dbhash{$uni_key};
672 =head2 ℞ 44: PROGRAM: Demo of Unicode collation and printing
674 Here’s a full program showing how to make use of locale-sensitive
675 sorting, Unicode casing, and managing print widths when some of the
676 characters take up zero or two columns, not just one column each time.
677 When run, the following program produces this nicely aligned output:
679 Crème Brûlée....... €2.00
680 Éclair............. €1.60
681 Fideuà............. €4.20
682 Hamburger.......... €6.00
683 Jamón Serrano...... €4.45
684 Linguiça........... €7.00
685 Pâté............... €4.15
686 Pears.............. €2.00
687 Pêches............. €2.25
688 Smørbrød........... €5.75
689 Spätzle............ €5.50
690 Xoriço............. €3.00
691 Γύρος.............. €6.50
692 막걸리............. €4.00
693 おもち............. €2.65
696 寿司............... €9.99
697 包子............... €7.50
699 Here's that program; tested on v5.14.
702 # umenu - demo sorting and printing of Unicode food
704 # (obligatory and increasingly long preamble)
707 use v5.14; # for locale sorting
710 use warnings qw(FATAL utf8); # fatalize encoding faults
711 use open qw(:std :encoding(UTF-8)); # undeclared streams in UTF-8
712 use charnames qw(:full :short); # unneeded in v5.16
715 use Unicode::Normalize; # std perl distro as of v5.8
716 use List::Util qw(max); # std perl distro as of v5.10
717 use Unicode::Collate::Locale; # std perl distro as of v5.14
720 use Unicode::GCString; # from CPAN
728 "γύρος" => 6.50, # gyros
729 "pears" => 2.00, # like um, pears
730 "linguiça" => 7.00, # spicy sausage, Portuguese
731 "xoriço" => 3.00, # chorizo sausage, Catalan
732 "hamburger" => 6.00, # burgermeister meisterburger
733 "éclair" => 1.60, # dessert, French
734 "smørbrød" => 5.75, # sandwiches, Norwegian
735 "spätzle" => 5.50, # Bayerisch noodles, little sparrows
736 "包子" => 7.50, # bao1 zi5, steamed pork buns, Mandarin
737 "jamón serrano" => 4.45, # country ham, Spanish
738 "pêches" => 2.25, # peaches, French
739 "シュークリーム" => 1.85, # cream-filled pastry like eclair
740 "막걸리" => 4.00, # makgeolli, Korean rice wine
741 "寿司" => 9.99, # sushi, Japanese
742 "おもち" => 2.65, # omochi, rice cakes, Japanese
743 "crème brûlée" => 2.00, # crema catalana
744 "fideuà" => 4.20, # more noodles, Valencian
746 "pâté" => 4.15, # gooseliver paste, French
747 "お好み焼き" => 8.00, # okonomiyaki, Japanese
750 my $width = 5 + max map { colwidth } keys %price;
752 # So the Asian stuff comes out in an order that someone
753 # who reads those scripts won't freak out over; the
754 # CJK stuff will be in JIS X 0208 order that way.
755 my $coll = Unicode::Collate::Locale->new(locale => "ja");
757 for my $item ($coll->sort(keys %price)) {
758 print pad(entitle($item), $width, ".");
759 printf " €%.2f\n", $price{$item};
763 my($str, $width, $padchar) = @_;
764 return $str . ($padchar x ($width - colwidth($str)));
769 return Unicode::GCString->new($str)->columns;
774 $str =~ s{ (?=\pL)(\S) (\S*) }
775 { ucfirst($1) . lc($2) }xge;
781 See these manpages, some of which are CPAN modules:
782 L<perlunicode>, L<perluniprops>,
783 L<perlre>, L<perlrecharclass>,
784 L<perluniintro>, L<perlunitut>, L<perlunifaq>,
785 L<PerlIO>, L<DB_File>, L<DBM_Filter>, L<DBM_Filter::utf8>,
786 L<Encode>, L<Encode::Locale>,
788 L<Unicode::Normalize>,
789 L<Unicode::GCString>, L<Unicode::LineBreak>,
790 L<Unicode::Collate>, L<Unicode::Collate::Locale>,
792 L<Unicode::CaseFold>,
794 L<Lingua::JA::Romanize::Japanese>,
795 L<Lingua::ZH::Romanize::Pinyin>,
796 L<Lingua::KO::Romanize::Hangul>.
798 The L<Unicode::Tussle> CPAN module includes many programs
799 to help with working with Unicode, including
800 these programs to fully or partly replace standard utilities:
801 I<tcgrep> instead of I<egrep>,
802 I<uniquote> instead of I<cat -v> or I<hexdump>,
803 I<uniwc> instead of I<wc>,
804 I<unilook> instead of I<look>,
805 I<unifmt> instead of I<fmt>,
807 I<ucsort> instead of I<sort>.
808 For exploring Unicode character names and character properties,
809 see its I<uniprops>, I<unichars>, and I<uninames> programs.
810 It also supplies these programs, all of which are general filters that do Unicode-y things:
811 I<unititle> and I<unicaps>;
812 I<uniwide> and I<uninarrow>;
813 I<unisupers> and I<unisubs>;
814 I<nfd>, I<nfc>, I<nfkd>, and I<nfkc>;
815 and I<uc>, I<lc>, and I<tc>.
817 Finally, see the published Unicode Standard (page numbers are from version
818 6.0.0), including these specific annexes and technical reports:
822 =item §3.13 Default Case Algorithms, page 113;
823 §4.2 Case, pages 120–122;
824 Case Mappings, page 166–172, especially Caseless Matching starting on page 170.
826 =item UAX #44: Unicode Character Database
828 =item UTS #18: Unicode Regular Expressions
830 =item UAX #15: Unicode Normalization Forms
832 =item UTS #10: Unicode Collation Algorithm
834 =item UAX #29: Unicode Text Segmentation
836 =item UAX #14: Unicode Line Breaking Algorithm
838 =item UAX #11: East Asian Width
844 Tom Christiansen E<lt>tchrist@perl.comE<gt> wrote this, with occasional
845 kibbitzing from Larry Wall and Jeffrey Friedl in the background.
847 =head1 COPYRIGHT AND LICENCE
849 Copyright © 2012 Tom Christiansen.
851 This program is free software; you may redistribute it and/or modify it
852 under the same terms as Perl itself.
854 Most of these examples taken from the current edition of the “Camel Book”;
855 that is, from the 4ᵗʰ Edition of I<Programming Perl>, Copyright © 2012 Tom
856 Christiansen <et al.>, 2012-02-13 by O’Reilly Media. The code itself is
857 freely redistributable, and you are encouraged to transplant, fold,
858 spindle, and mutilate any of the examples in this manpage however you please
859 for inclusion into your own programs without any encumbrance whatsoever.
860 Acknowledgement via code comment is polite but not required.
862 =head1 REVISION HISTORY
864 v1.0.0 – first public release, 2012-02-27