6 perlunicook - cookbookish examples of handling Unicode in Perl
10 This manpage contains short recipes demonstrating how to handle common Unicode
11 operations in Perl, plus one complete program at the end. Any undeclared
12 variables in individual recipes are assumed to have a previous appropriate
17 =head2 ℞ 0: Standard preamble
19 Unless otherwise notes, all examples below require this standard preamble
20 to work correctly, with the C<#!> adjusted to work on your system:
24 use utf8; # so literals and identifiers can be in UTF-8
25 use v5.12; # or later to get "unicode_strings" feature
26 use strict; # quote strings, declare variables
27 use warnings; # on by default
28 use warnings qw(FATAL utf8); # fatalize encoding glitches
29 use open qw(:std :utf8); # undeclared streams in UTF-8
30 use charnames qw(:full :short); # unneeded in v5.16
32 This I<does> make even Unix programmers C<binmode> your binary streams,
33 or open them with C<:raw>, but that's the only way to get at them
36 B<WARNING>: C<use autodie> (pre 2.26) and C<use open> do not get along with each
39 =head2 ℞ 1: Generic Unicode-savvy filter
41 Always decompose on the way in, then recompose on the way out.
43 use Unicode::Normalize;
46 $_ = NFD($_); # decompose + reorder canonically
49 print NFC($_); # recompose (where possible) + reorder canonically
52 =head2 ℞ 2: Fine-tuning Unicode warnings
54 As of v5.14, Perl distinguishes three subclasses of UTF‑8 warnings.
56 use v5.14; # subwarnings unavailable any earlier
57 no warnings "nonchar"; # the 66 forbidden non-characters
58 no warnings "surrogate"; # UTF-16/CESU-8 nonsense
59 no warnings "non_unicode"; # for codepoints over 0x10_FFFF
61 =head2 ℞ 3: Declare source in utf8 for identifiers and literals
63 Without the all-critical C<use utf8> declaration, putting UTF‑8 in your
64 literals and identifiers won’t work right. If you used the standard
65 preamble just given above, this already happened. If you did, you can
70 my $measure = "Ångström";
71 my @μsoft = qw( cp852 cp1251 cp1252 );
72 my @ὑπέρμεγας = qw( ὑπέρ μεγας );
73 my @鯉 = qw( koi8-f koi8-u koi8-r );
74 my $motto = "👪 💗 🐪"; # FAMILY, GROWING HEART, DROMEDARY CAMEL
76 If you forget C<use utf8>, high bytes will be misunderstood as
77 separate characters, and nothing will work right.
79 =head2 ℞ 4: Characters and their numbers
81 The C<ord> and C<chr> functions work transparently on all codepoints,
82 not just on ASCII alone — nor in fact, not even just on Unicode alone.
88 # characters from the Basic Multilingual Plane
93 ord("𝑛") # MATHEMATICAL ITALIC SMALL N
96 # beyond Unicode! (up to MAXINT)
100 =head2 ℞ 5: Unicode literals by character number
102 In an interpolated literal, whether a double-quoted string or a
103 regex, you may specify a character by its number using the
104 C<\x{I<HHHHHH>}> escape.
112 # even non-BMP ranges in regex work fine
113 /[\x{1D434}-\x{1D467}]/
115 =head2 ℞ 6: Get character name by number
118 my $name = charnames::viacode(0x03A3);
120 =head2 ℞ 7: Get character number by name
123 my $number = charnames::vianame("GREEK CAPITAL LETTER SIGMA");
125 =head2 ℞ 8: Unicode named characters
127 Use the C<< \N{I<charname>} >> notation to get the character
128 by that name for use in interpolated literals (double-quoted
129 strings and regexes). In v5.16, there is an implicit
131 use charnames qw(:full :short);
133 But prior to v5.16, you must be explicit about which set of charnames you
134 want. The C<:full> names are the official Unicode character name, alias, or
135 sequence, which all share a namespace.
137 use charnames qw(:full :short latin greek);
139 "\N{MATHEMATICAL ITALIC SMALL N}" # :full
140 "\N{GREEK CAPITAL LETTER SIGMA}" # :full
142 Anything else is a Perl-specific convenience abbreviation. Specify one or
143 more scripts by names if you want short names that are script-specific.
145 "\N{Greek:Sigma}" # :short
147 "\N{epsilon}" # greek
149 The v5.16 release also supports a C<:loose> import for loose matching of
150 character names, which works just like loose matching of property names:
151 that is, it disregards case, whitespace, and underscores:
153 "\N{euro sign}" # :loose (from v5.16)
155 =head2 ℞ 9: Unicode named sequences
157 These look just like character names but return multiple codepoints.
158 Notice the C<%vx> vector-print functionality in C<printf>.
160 use charnames qw(:full);
161 my $seq = "\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}";
162 printf "U+%v04X\n", $seq;
165 =head2 ℞ 10: Custom named characters
167 Use C<:alias> to give your own lexically scoped nicknames to existing
168 characters, or even to give unnamed private-use characters useful names.
170 use charnames ":full", ":alias" => {
171 ecute => "LATIN SMALL LETTER E WITH ACUTE",
172 "APPLE LOGO" => 0xF8FF, # private use character
178 =head2 ℞ 11: Names of CJK codepoints
180 Sinograms like “東京” come back with character names of
181 C<CJK UNIFIED IDEOGRAPH-6771> and C<CJK UNIFIED IDEOGRAPH-4EAC>,
182 because their “names” vary. The CPAN C<Unicode::Unihan> module
183 has a large database for decoding these (and a whole lot more), provided you
184 know how to understand its output.
186 # cpan -i Unicode::Unihan
189 my $unhan = Unicode::Unihan->new;
190 for my $lang (qw(Mandarin Cantonese Korean JapaneseOn JapaneseKun)) {
191 printf "CJK $str in %-12s is ", $lang;
192 say $unhan->$lang($str);
197 CJK 東京 in Mandarin is DONG1JING1
198 CJK 東京 in Cantonese is dung1ging1
199 CJK 東京 in Korean is TONGKYENG
200 CJK 東京 in JapaneseOn is TOUKYOU KEI KIN
201 CJK 東京 in JapaneseKun is HIGASHI AZUMAMIYAKO
203 If you have a specific romanization scheme in mind,
204 use the specific module:
206 # cpan -i Lingua::JA::Romanize::Japanese
207 use Lingua::JA::Romanize::Japanese;
208 my $k2r = Lingua::JA::Romanize::Japanese->new;
210 say "Japanese for $str is ", $k2r->chars($str);
214 Japanese for 東京 is toukyou
216 =head2 ℞ 12: Explicit encode/decode
218 On rare occasion, such as a database read, you may be
219 given encoded text you need to decode.
221 use Encode qw(encode decode);
223 my $chars = decode("shiftjis", $bytes, 1);
225 my $bytes = encode("MIME-Header-ISO_2022_JP", $chars, 1);
227 For streams all in the same encoding, don't use encode/decode; instead
228 set the file encoding when you open the file or immediately after with
229 C<binmode> as described later below.
231 =head2 ℞ 13: Decode program arguments as utf8
235 $ export PERL_UNICODE=A
237 use Encode qw(decode);
238 @ARGV = map { decode('UTF-8', $_, 1) } @ARGV;
240 =head2 ℞ 14: Decode program arguments as locale encoding
242 # cpan -i Encode::Locale
243 use Encode qw(locale);
246 # use "locale" as an arg to encode/decode
247 @ARGV = map { decode(locale => $_, 1) } @ARGV;
249 =head2 ℞ 15: Declare STD{IN,OUT,ERR} to be utf8
251 Use a command-line option, an environment variable, or else
252 call C<binmode> explicitly:
256 $ export PERL_UNICODE=S
258 use open qw(:std :utf8);
260 binmode(STDIN, ":utf8");
261 binmode(STDOUT, ":utf8");
262 binmode(STDERR, ":utf8");
264 =head2 ℞ 16: Declare STD{IN,OUT,ERR} to be in locale encoding
266 # cpan -i Encode::Locale
270 # or as a stream for binmode or open
271 binmode STDIN, ":encoding(console_in)" if -t STDIN;
272 binmode STDOUT, ":encoding(console_out)" if -t STDOUT;
273 binmode STDERR, ":encoding(console_out)" if -t STDERR;
275 =head2 ℞ 17: Make file I/O default to utf8
277 Files opened without an encoding argument will be in UTF-8:
281 $ export PERL_UNICODE=D
285 =head2 ℞ 18: Make all I/O and args default to utf8
289 $ export PERL_UNICODE=SDA
291 use open qw(:std :utf8);
292 use Encode qw(decode);
293 @ARGV = map { decode('UTF-8', $_, 1) } @ARGV;
295 =head2 ℞ 19: Open file with specific encoding
297 Specify stream encoding. This is the normal way
298 to deal with encoded text, not by calling low-level
302 open(my $in_file, "< :encoding(UTF-16)", "wintext");
304 open(my $in_file, "<", "wintext");
305 binmode($in_file, ":encoding(UTF-16)");
307 my $line = <$in_file>;
310 open($out_file, "> :encoding(cp1252)", "wintext");
312 open(my $out_file, ">", "wintext");
313 binmode($out_file, ":encoding(cp1252)");
315 print $out_file "some text\n";
317 More layers than just the encoding can be specified here. For example,
318 the incantation C<":raw :encoding(UTF-16LE) :crlf"> includes implicit
321 =head2 ℞ 20: Unicode casing
323 Unicode casing is very different from ASCII casing.
325 uc("henry ⅷ") # "HENRY Ⅷ"
326 uc("tschüß") # "TSCHÜSS" notice ß => SS
329 "tschüß" =~ /TSCHÜSS/i # notice ß => SS
330 "Σίσυφος" =~ /ΣΊΣΥΦΟΣ/i # notice Σ,σ,ς sameness
332 =head2 ℞ 21: Unicode case-insensitive comparisons
334 Also available in the CPAN L<Unicode::CaseFold> module,
335 the new C<fc> “foldcase” function from v5.16 grants
336 access to the same Unicode casefolding as the C</i>
337 pattern modifier has always used:
339 use feature "fc"; # fc() function is from v5.16
341 # sort case-insensitively
342 my @sorted = sort { fc($a) cmp fc($b) } @list;
345 fc("tschüß") eq fc("TSCHÜSS")
346 fc("Σίσυφος") eq fc("ΣΊΣΥΦΟΣ")
348 =head2 ℞ 22: Match Unicode linebreak sequence in regex
350 A Unicode linebreak matches the two-character CRLF
351 grapheme or any of seven vertical whitespace characters.
352 Good for dealing with textfiles coming from different
357 s/\R/\n/g; # normalize all linebreaks to \n
359 =head2 ℞ 23: Get character category
361 Find the general category of a numeric codepoint.
363 use Unicode::UCD qw(charinfo);
364 my $cat = charinfo(0x3A3)->{category}; # "Lu"
366 =head2 ℞ 24: Disabling Unicode-awareness in builtin charclasses
368 Disable C<\w>, C<\b>, C<\s>, C<\d>, and the POSIX
369 classes from working correctly on Unicode either in this
370 scope, or in just one regex.
377 my($num) = $str =~ /(\d+)/a;
379 Or use specific un-Unicode properties, like C<\p{ahex}>
380 and C<\p{POSIX_Digit>}. Properties still work normally
381 no matter what charset modifiers (C</d /u /l /a /aa>)
384 =head2 ℞ 25: Match Unicode properties in regex with \p, \P
386 These all match a single codepoint with the given
387 property. Use C<\P> in place of C<\p> to match
388 one codepoint lacking that property.
390 \pL, \pN, \pS, \pP, \pM, \pZ, \pC
391 \p{Sk}, \p{Ps}, \p{Lt}
392 \p{alpha}, \p{upper}, \p{lower}
394 \p{script_extensions=Latin}, \p{scx=Greek}
395 \p{East_Asian_Width=Wide}, \p{EA=W}
396 \p{Line_Break=Hyphen}, \p{LB=HY}
397 \p{Numeric_Value=4}, \p{NV=4}
399 =head2 ℞ 26: Custom character properties
401 Define at compile-time your own custom character
402 properties for use in regexes.
404 # using private-use characters
405 sub In_Tengwar { "E000\tE07F\n" }
407 if (/\p{In_Tengwar}/) { ... }
409 # blending existing properties
410 sub Is_GraecoRoman_Title {<<'END_OF_SET'}
416 if (/\p{Is_GraecoRoman_Title}/ { ... }
418 =head2 ℞ 27: Unicode normalization
420 Typically render into NFD on input and NFC on output. Using NFKC or NFKD
421 functions improves recall on searches, assuming you've already done to the
422 same text to be searched. Note that this is about much more than just pre-
423 combined compatibility glyphs; it also reorders marks according to their
424 canonical combining classes and weeds out singletons.
426 use Unicode::Normalize;
427 my $nfd = NFD($orig);
428 my $nfc = NFC($orig);
429 my $nfkd = NFKD($orig);
430 my $nfkc = NFKC($orig);
432 =head2 ℞ 28: Convert non-ASCII Unicode numerics
434 Unless you’ve used C</a> or C</aa>, C<\d> matches more than
435 ASCII digits only, but Perl’s implicit string-to-number
436 conversion does not current recognize these. Here’s how to
437 convert such strings manually.
439 use v5.14; # needed for num() function
440 use Unicode::UCD qw(num);
441 my $str = "got Ⅻ and ४५६७ and ⅞ and here";
443 while ($str =~ /(\d+|\N)/g) { # not just ASCII!
446 say "@nums"; # 12 4567 0.875
448 use charnames qw(:full);
449 my $nv = num("\N{RUMI DIGIT ONE}\N{RUMI DIGIT TWO}");
451 =head2 ℞ 29: Match Unicode grapheme cluster in regex
453 Programmer-visible “characters” are codepoints matched by C</./s>,
454 but user-visible “characters” are graphemes matched by C</\X/>.
456 # Find vowel *plus* any combining diacritics,underlining,etc.
457 my $nfd = NFD($orig);
458 $nfd =~ / (?=[aeiou]) \X /xi
460 =head2 ℞ 30: Extract by grapheme instead of by codepoint (regex)
462 # match and grab five first graphemes
463 my($first_five) = $str =~ /^ ( \X{5} ) /x;
465 =head2 ℞ 31: Extract by grapheme instead of by codepoint (substr)
467 # cpan -i Unicode::GCString
468 use Unicode::GCString;
469 my $gcs = Unicode::GCString->new($str);
470 my $first_five = $gcs->substr(0, 5);
472 =head2 ℞ 32: Reverse string by grapheme
474 Reversing by codepoint messes up diacritics, mistakenly converting
475 C<crème brûlée> into C<éel̂urb em̀erc> instead of into C<eélûrb emèrc>;
476 so reverse by grapheme instead. Both these approaches work
477 right no matter what normalization the string is in:
479 $str = join("", reverse $str =~ /\X/g);
481 # OR: cpan -i Unicode::GCString
482 use Unicode::GCString;
483 $str = reverse Unicode::GCString->new($str);
485 =head2 ℞ 33: String length in graphemes
487 The string C<brûlée> has six graphemes but up to eight codepoints.
488 This counts by grapheme, not by codepoint:
492 while ($str =~ /\X/g) { $count++ }
494 # OR: cpan -i Unicode::GCString
495 use Unicode::GCString;
496 my $gcs = Unicode::GCString->new($str);
497 my $count = $gcs->length;
499 =head2 ℞ 34: Unicode column-width for printing
501 Perl’s C<printf>, C<sprintf>, and C<format> think all
502 codepoints take up 1 print column, but many take 0 or 2.
503 Here to show that normalization makes no difference,
504 we print out both forms:
506 use Unicode::GCString;
507 use Unicode::Normalize;
509 my @words = qw/crème brûlée/;
510 @words = map { NFC($_), NFD($_) } @words;
512 for my $str (@words) {
513 my $gcs = Unicode::GCString->new($str);
514 my $cols = $gcs->columns;
515 my $pad = " " x (10 - $cols);
519 generates this to show that it pads correctly no matter
527 =head2 ℞ 35: Unicode collation
529 Text sorted by numeric codepoint follows no reasonable alphabetic order;
530 use the UCA for sorting text.
532 use Unicode::Collate;
533 my $col = Unicode::Collate->new();
534 my @list = $col->sort(@old_list);
536 See the I<ucsort> program from the L<Unicode::Tussle> CPAN module
537 for a convenient command-line interface to this module.
539 =head2 ℞ 36: Case- I<and> accent-insensitive Unicode sort
541 Specify a collation strength of level 1 to ignore case and
542 diacritics, only looking at the basic character.
544 use Unicode::Collate;
545 my $col = Unicode::Collate->new(level => 1);
546 my @list = $col->sort(@old_list);
548 =head2 ℞ 37: Unicode locale collation
550 Some locales have special sorting rules.
552 # either use v5.12, OR: cpan -i Unicode::Collate::Locale
553 use Unicode::Collate::Locale;
554 my $col = Unicode::Collate::Locale->new(locale => "de__phonebook");
555 my @list = $col->sort(@old_list);
557 The I<ucsort> program mentioned above accepts a C<--locale> parameter.
559 =head2 ℞ 38: Making C<cmp> work on text instead of codepoints
564 $b->{AGE} <=> $a->{AGE}
566 $a->{NAME} cmp $b->{NAME}
571 my $coll = Unicode::Collate->new();
572 for my $rec (@recs) {
573 $rec->{NAME_key} = $coll->getSortKey( $rec->{NAME} );
576 $b->{AGE} <=> $a->{AGE}
578 $a->{NAME_key} cmp $b->{NAME_key}
581 =head2 ℞ 39: Case- I<and> accent-insensitive comparisons
583 Use a collator object to compare Unicode text by character
584 instead of by codepoint.
586 use Unicode::Collate;
587 my $es = Unicode::Collate->new(
589 normalization => undef
593 $es->eq("García", "GARCIA" );
594 $es->eq("Márquez", "MARQUEZ");
596 =head2 ℞ 40: Case- I<and> accent-insensitive locale comparisons
598 Same, but in a specific locale.
600 my $de = Unicode::Collate::Locale->new(
601 locale => "de__phonebook",
605 $de->eq("tschüß", "TSCHUESS"); # notice ü => UE, ß => SS
607 =head2 ℞ 41: Unicode linebreaking
609 Break up text into lines according to Unicode rules.
611 # cpan -i Unicode::LineBreak
612 use Unicode::LineBreak;
613 use charnames qw(:full);
615 my $para = "This is a super\N{HYPHEN}long string. " x 20;
616 my $fmt = Unicode::LineBreak->new;
617 print $fmt->break($para), "\n";
619 =head2 ℞ 42: Unicode text in DBM hashes, the tedious way
621 Using a regular Perl string as a key or value for a DBM
622 hash will trigger a wide character exception if any codepoints
623 won’t fit into a byte. Here’s how to manually manage the translation:
626 use Encode qw(encode decode);
627 tie %dbhash, "DB_File", "pathname";
631 # assume $uni_key and $uni_value are abstract Unicode strings
632 my $enc_key = encode("UTF-8", $uni_key, 1);
633 my $enc_value = encode("UTF-8", $uni_value, 1);
634 $dbhash{$enc_key} = $enc_value;
638 # assume $uni_key holds a normal Perl string (abstract Unicode)
639 my $enc_key = encode("UTF-8", $uni_key, 1);
640 my $enc_value = $dbhash{$enc_key};
641 my $uni_value = decode("UTF-8", $enc_value, 1);
643 =head2 ℞ 43: Unicode text in DBM hashes, the easy way
645 Here’s how to implicitly manage the translation; all encoding
646 and decoding is done automatically, just as with streams that
647 have a particular encoding attached to them:
652 my $dbobj = tie %dbhash, "DB_File", "pathname";
653 $dbobj->Filter_Value("utf8"); # this is the magic bit
657 # assume $uni_key and $uni_value are abstract Unicode strings
658 $dbhash{$uni_key} = $uni_value;
662 # $uni_key holds a normal Perl string (abstract Unicode)
663 my $uni_value = $dbhash{$uni_key};
665 =head2 ℞ 44: PROGRAM: Demo of Unicode collation and printing
667 Here’s a full program showing how to make use of locale-sensitive
668 sorting, Unicode casing, and managing print widths when some of the
669 characters take up zero or two columns, not just one column each time.
670 When run, the following program produces this nicely aligned output:
672 Crème Brûlée....... €2.00
673 Éclair............. €1.60
674 Fideuà............. €4.20
675 Hamburger.......... €6.00
676 Jamón Serrano...... €4.45
677 Linguiça........... €7.00
678 Pâté............... €4.15
679 Pears.............. €2.00
680 Pêches............. €2.25
681 Smørbrød........... €5.75
682 Spätzle............ €5.50
683 Xoriço............. €3.00
684 Γύρος.............. €6.50
685 막걸리............. €4.00
686 おもち............. €2.65
689 寿司............... €9.99
690 包子............... €7.50
692 Here's that program; tested on v5.14.
695 # umenu - demo sorting and printing of Unicode food
697 # (obligatory and increasingly long preamble)
700 use v5.14; # for locale sorting
703 use warnings qw(FATAL utf8); # fatalize encoding faults
704 use open qw(:std :utf8); # undeclared streams in UTF-8
705 use charnames qw(:full :short); # unneeded in v5.16
708 use Unicode::Normalize; # std perl distro as of v5.8
709 use List::Util qw(max); # std perl distro as of v5.10
710 use Unicode::Collate::Locale; # std perl distro as of v5.14
713 use Unicode::GCString; # from CPAN
721 "γύρος" => 6.50, # gyros
722 "pears" => 2.00, # like um, pears
723 "linguiça" => 7.00, # spicy sausage, Portuguese
724 "xoriço" => 3.00, # chorizo sausage, Catalan
725 "hamburger" => 6.00, # burgermeister meisterburger
726 "éclair" => 1.60, # dessert, French
727 "smørbrød" => 5.75, # sandwiches, Norwegian
728 "spätzle" => 5.50, # Bayerisch noodles, little sparrows
729 "包子" => 7.50, # bao1 zi5, steamed pork buns, Mandarin
730 "jamón serrano" => 4.45, # country ham, Spanish
731 "pêches" => 2.25, # peaches, French
732 "シュークリーム" => 1.85, # cream-filled pastry like eclair
733 "막걸리" => 4.00, # makgeolli, Korean rice wine
734 "寿司" => 9.99, # sushi, Japanese
735 "おもち" => 2.65, # omochi, rice cakes, Japanese
736 "crème brûlée" => 2.00, # crema catalana
737 "fideuà" => 4.20, # more noodles, Valencian
739 "pâté" => 4.15, # gooseliver paste, French
740 "お好み焼き" => 8.00, # okonomiyaki, Japanese
743 my $width = 5 + max map { colwidth } keys %price;
745 # So the Asian stuff comes out in an order that someone
746 # who reads those scripts won't freak out over; the
747 # CJK stuff will be in JIS X 0208 order that way.
748 my $coll = Unicode::Collate::Locale->new(locale => "ja");
750 for my $item ($coll->sort(keys %price)) {
751 print pad(entitle($item), $width, ".");
752 printf " €%.2f\n", $price{$item};
756 my($str, $width, $padchar) = @_;
757 return $str . ($padchar x ($width - colwidth($str)));
762 return Unicode::GCString->new($str)->columns;
767 $str =~ s{ (?=\pL)(\S) (\S*) }
768 { ucfirst($1) . lc($2) }xge;
774 See these manpages, some of which are CPAN modules:
775 L<perlunicode>, L<perluniprops>,
776 L<perlre>, L<perlrecharclass>,
777 L<perluniintro>, L<perlunitut>, L<perlunifaq>,
778 L<PerlIO>, L<DB_File>, L<DBM_Filter>, L<DBM_Filter::utf8>,
779 L<Encode>, L<Encode::Locale>,
781 L<Unicode::Normalize>,
782 L<Unicode::GCString>, L<Unicode::LineBreak>,
783 L<Unicode::Collate>, L<Unicode::Collate::Locale>,
785 L<Unicode::CaseFold>,
787 L<Lingua::JA::Romanize::Japanese>,
788 L<Lingua::ZH::Romanize::Pinyin>,
789 L<Lingua::KO::Romanize::Hangul>.
791 The L<Unicode::Tussle> CPAN module includes many programs
792 to help with working with Unicode, including
793 these programs to fully or partly replace standard utilities:
794 I<tcgrep> instead of I<egrep>,
795 I<uniquote> instead of I<cat -v> or I<hexdump>,
796 I<uniwc> instead of I<wc>,
797 I<unilook> instead of I<look>,
798 I<unifmt> instead of I<fmt>,
800 I<ucsort> instead of I<sort>.
801 For exploring Unicode character names and character properties,
802 see its I<uniprops>, I<unichars>, and I<uninames> programs.
803 It also supplies these programs, all of which are general filters that do Unicode-y things:
804 I<unititle> and I<unicaps>;
805 I<uniwide> and I<uninarrow>;
806 I<unisupers> and I<unisubs>;
807 I<nfd>, I<nfc>, I<nfkd>, and I<nfkc>;
808 and I<uc>, I<lc>, and I<tc>.
810 Finally, see the published Unicode Standard (page numbers are from version
811 6.0.0), including these specific annexes and technical reports:
815 =item §3.13 Default Case Algorithms, page 113;
816 §4.2 Case, pages 120–122;
817 Case Mappings, page 166–172, especially Caseless Matching starting on page 170.
819 =item UAX #44: Unicode Character Database
821 =item UTS #18: Unicode Regular Expressions
823 =item UAX #15: Unicode Normalization Forms
825 =item UTS #10: Unicode Collation Algorithm
827 =item UAX #29: Unicode Text Segmentation
829 =item UAX #14: Unicode Line Breaking Algorithm
831 =item UAX #11: East Asian Width
837 Tom Christiansen E<lt>tchrist@perl.comE<gt> wrote this, with occasional
838 kibbitzing from Larry Wall and Jeffrey Friedl in the background.
840 =head1 COPYRIGHT AND LICENCE
842 Copyright © 2012 Tom Christiansen.
844 This program is free software; you may redistribute it and/or modify it
845 under the same terms as Perl itself.
847 Most of these examples taken from the current edition of the “Camel Book”;
848 that is, from the 4ᵗʰ Edition of I<Programming Perl>, Copyright © 2012 Tom
849 Christiansen <et al.>, 2012-02-13 by O’Reilly Media. The code itself is
850 freely redistributable, and you are encouraged to transplant, fold,
851 spindle, and mutilate any of the examples in this manpage however you please
852 for inclusion into your own programs without any encumbrance whatsoever.
853 Acknowledgement via code comment is polite but not required.
855 =head1 REVISION HISTORY
857 v1.0.0 – first public release, 2012-02-27