perl5.git.perl.org Git - perl5.git/blame

Commit	Line	Data
2561daa4 RS	1
	2	=encoding utf8
	3
	4	=head1 NAME
	5
	6	perlunicook - cookbookish examples of handling Unicode in Perl
	7
	8	=head1 DESCRIPTION
	9
	10	This manpage contains short recipes demonstrating how to handle common Unicode
	11	operations in Perl, plus one complete program at the end. Any undeclared
	12	variables in individual recipes are assumed to have a previous appropriate
	13	value in them.
	14
	15	=head1 EXAMPLES
	16
	17	=head2 ℞ 0: Standard preamble
	18
	19	Unless otherwise notes, all examples below require this standard preamble
	20	to work correctly, with the C<#!> adjusted to work on your system:
	21
	22	#!/usr/bin/env perl
	23
d84bd0bd PE	24	use v5.36; # or later to get "unicode_strings" feature,
d84bd0bd PE	25	# plus strict, warnings
2561daa4	26	use utf8; # so literals and identifiers can be in UTF-8
2561daa4	27	use warnings qw(FATAL utf8); # fatalize encoding glitches
a8980281	28	use open qw(:std :encoding(UTF-8)); # undeclared streams in UTF-8
2561daa4 RS	29	use charnames qw(:full :short); # unneeded in v5.16
	30
	31	This I<does> make even Unix programmers C<binmode> your binary streams,
	32	or open them with C<:raw>, but that's the only way to get at them
	33	portably anyway.
	34
2a403855 MH	35	B<WARNING>: C<use autodie> (pre 2.26) and C<use open> do not get along with each
2a403855 MH	36	other.
2561daa4 RS	37
	38	=head2 ℞ 1: Generic Unicode-savvy filter
	39
	40	Always decompose on the way in, then recompose on the way out.
	41
	42	use Unicode::Normalize;
	43
	44	while (<>) {
	45	$_ = NFD($_); # decompose + reorder canonically
	46	...
	47	} continue {
	48	print NFC($_); # recompose (where possible) + reorder canonically
	49	}
	50
	51	=head2 ℞ 2: Fine-tuning Unicode warnings
	52
ddeccf1f	53	As of v5.14, Perl distinguishes three subclasses of UTF‑8 warnings.
2561daa4 RS	54
	55	use v5.14; # subwarnings unavailable any earlier
	56	no warnings "nonchar"; # the 66 forbidden non-characters
	57	no warnings "surrogate"; # UTF-16/CESU-8 nonsense
	58	no warnings "non_unicode"; # for codepoints over 0x10_FFFF
	59
	60	=head2 ℞ 3: Declare source in utf8 for identifiers and literals
	61
	62	Without the all-critical C<use utf8> declaration, putting UTF‑8 in your
	63	literals and identifiers won’t work right. If you used the standard
	64	preamble just given above, this already happened. If you did, you can
	65	do things like this:
	66
	67	use utf8;
	68
	69	my $measure = "Ångström";
	70	my @μsoft = qw( cp852 cp1251 cp1252 );
	71	my @ὑπέρμεγας = qw( ὑπέρ μεγας );
	72	my @鯉 = qw( koi8-f koi8-u koi8-r );
	73	my $motto = "👪 💗 🐪"; # FAMILY, GROWING HEART, DROMEDARY CAMEL
	74
	75	If you forget C<use utf8>, high bytes will be misunderstood as
	76	separate characters, and nothing will work right.
	77
	78	=head2 ℞ 4: Characters and their numbers
	79
	80	The C<ord> and C<chr> functions work transparently on all codepoints,
	81	not just on ASCII alone — nor in fact, not even just on Unicode alone.
	82
	83	# ASCII characters
	84	ord("A")
	85	chr(65)
	86
	87	# characters from the Basic Multilingual Plane
	88	ord("Σ")
	89	chr(0x3A3)
	90
	91	# beyond the BMP
	92	ord("𝑛") # MATHEMATICAL ITALIC SMALL N
	93	chr(0x1D45B)
	94
	95	# beyond Unicode! (up to MAXINT)
	96	ord("\x{20_0000}")
	97	chr(0x20_0000)
	98
	99	=head2 ℞ 5: Unicode literals by character number
	100
	101	In an interpolated literal, whether a double-quoted string or a
	102	regex, you may specify a character by its number using the
	103	C<\x{I<HHHHHH>}> escape.
	104
	105	String: "\x{3a3}"
	106	Regex: /\x{3a3}/
	107
	108	String: "\x{1d45b}"
	109	Regex: /\x{1d45b}/
	110
	111	# even non-BMP ranges in regex work fine
	112	/[\x{1D434}-\x{1D467}]/
	113
	114	=head2 ℞ 6: Get character name by number
	115
	116	use charnames ();
	117	my $name = charnames::viacode(0x03A3);
118
119	=head2 ℞ 7: Get character number by name
120
121	use charnames ();
122	my $number = charnames::vianame("GREEK CAPITAL LETTER SIGMA");
123
124	=head2 ℞ 8: Unicode named characters
125
126	Use the C<< \N{I<charname>} >> notation to get the character
127	by that name for use in interpolated literals (double-quoted
128	strings and regexes). In v5.16, there is an implicit
129
130	use charnames qw(:full :short);
131
132	But prior to v5.16, you must be explicit about which set of charnames you
133	want. The C<:full> names are the official Unicode character name, alias, or
134	sequence, which all share a namespace.
135
136	use charnames qw(:full :short latin greek);
137
138	"\N{MATHEMATICAL ITALIC SMALL N}" # :full
139	"\N{GREEK CAPITAL LETTER SIGMA}" # :full
140
141	Anything else is a Perl-specific convenience abbreviation. Specify one or
142	more scripts by names if you want short names that are script-specific.
143
144	"\N{Greek:Sigma}" # :short
145	"\N{ae}" # latin
146	"\N{epsilon}" # greek
147
148	The v5.16 release also supports a C<:loose> import for loose matching of
149	character names, which works just like loose matching of property names:
150	that is, it disregards case, whitespace, and underscores:
151
152	"\N{euro sign}" # :loose (from v5.16)
153
673c254b KW	154	Starting in v5.32, you can also use
	155
	156	qr/\p{name=euro sign}/
	157
	158	to get official Unicode named characters in regular expressions. Loose
	159	matching is always done for these.
	160
2561daa4 RS	161	=head2 ℞ 9: Unicode named sequences
	162
	163	These look just like character names but return multiple codepoints.
	164	Notice the C<%vx> vector-print functionality in C<printf>.
	165
	166	use charnames qw(:full);
	167	my $seq = "\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}";
	168	printf "U+%v04X\n", $seq;
	169	U+0100.0300
	170
	171	=head2 ℞ 10: Custom named characters
	172
	173	Use C<:alias> to give your own lexically scoped nicknames to existing
	174	characters, or even to give unnamed private-use characters useful names.
	175
	176	use charnames ":full", ":alias" => {
	177	ecute => "LATIN SMALL LETTER E WITH ACUTE",
	178	"APPLE LOGO" => 0xF8FF, # private use character
	179	};
	180
	181	"\N{ecute}"
	182	"\N{APPLE LOGO}"
	183
	184	=head2 ℞ 11: Names of CJK codepoints
	185
	186	Sinograms like “東京” come back with character names of
	187	C<CJK UNIFIED IDEOGRAPH-6771> and C<CJK UNIFIED IDEOGRAPH-4EAC>,
	188	because their “names” vary. The CPAN C<Unicode::Unihan> module
	189	has a large database for decoding these (and a whole lot more), provided you
	190	know how to understand its output.
	191
	192	# cpan -i Unicode::Unihan
	193	use Unicode::Unihan;
	194	my $str = "東京";
63602a3f	195	my $unhan = Unicode::Unihan->new;
2561daa4 RS	196	for my $lang (qw(Mandarin Cantonese Korean JapaneseOn JapaneseKun)) {
	197	printf "CJK $str in %-12s is ", $lang;
	198	say $unhan->$lang($str);
	199	}
	200
	201	prints:
	202
	203	CJK 東京 in Mandarin is DONG1JING1
	204	CJK 東京 in Cantonese is dung1ging1
	205	CJK 東京 in Korean is TONGKYENG
	206	CJK 東京 in JapaneseOn is TOUKYOU KEI KIN
	207	CJK 東京 in JapaneseKun is HIGASHI AZUMAMIYAKO
	208
	209	If you have a specific romanization scheme in mind,
	210	use the specific module:
	211
	212	# cpan -i Lingua::JA::Romanize::Japanese
	213	use Lingua::JA::Romanize::Japanese;
63602a3f	214	my $k2r = Lingua::JA::Romanize::Japanese->new;
2561daa4 RS	215	my $str = "東京";
	216	say "Japanese for $str is ", $k2r->chars($str);
	217
	218	prints
	219
	220	Japanese for 東京 is toukyou
	221
	222	=head2 ℞ 12: Explicit encode/decode
	223
	224	On rare occasion, such as a database read, you may be
	225	given encoded text you need to decode.
	226
	227	use Encode qw(encode decode);
	228
	229	my $chars = decode("shiftjis", $bytes, 1);
	230	# OR
	231	my $bytes = encode("MIME-Header-ISO_2022_JP", $chars, 1);
	232
	233	For streams all in the same encoding, don't use encode/decode; instead
	234	set the file encoding when you open the file or immediately after with
	235	C<binmode> as described later below.
	236
	237	=head2 ℞ 13: Decode program arguments as utf8
	238
	239	$ perl -CA ...
	240	or
	241	$ export PERL_UNICODE=A
	242	or
8e179dd8 P	243	use Encode qw(decode);
8e179dd8 P	244	@ARGV = map { decode('UTF-8', $_, 1) } @ARGV;
2561daa4 RS	245
	246	=head2 ℞ 14: Decode program arguments as locale encoding
	247
	248	# cpan -i Encode::Locale
	249	use Encode qw(locale);
	250	use Encode::Locale;
	251
	252	# use "locale" as an arg to encode/decode
	253	@ARGV = map { decode(locale => $_, 1) } @ARGV;
	254
	255	=head2 ℞ 15: Declare STD{IN,OUT,ERR} to be utf8
	256
	257	Use a command-line option, an environment variable, or else
	258	call C<binmode> explicitly:
	259
	260	$ perl -CS ...
	261	or
	262	$ export PERL_UNICODE=S
	263	or
a8980281	264	use open qw(:std :encoding(UTF-8));
2561daa4	265	or
a8980281	266	binmode(STDIN, ":encoding(UTF-8)");
2561daa4 RS	267	binmode(STDOUT, ":utf8");
	268	binmode(STDERR, ":utf8");
	269
	270	=head2 ℞ 16: Declare STD{IN,OUT,ERR} to be in locale encoding
	271
	272	# cpan -i Encode::Locale
	273	use Encode;
	274	use Encode::Locale;
	275
	276	# or as a stream for binmode or open
	277	binmode STDIN, ":encoding(console_in)" if -t STDIN;
	278	binmode STDOUT, ":encoding(console_out)" if -t STDOUT;
	279	binmode STDERR, ":encoding(console_out)" if -t STDERR;
	280
	281	=head2 ℞ 17: Make file I/O default to utf8
	282
ddeccf1f	283	Files opened without an encoding argument will be in UTF-8:
2561daa4 RS	284
	285	$ perl -CD ...
	286	or
	287	$ export PERL_UNICODE=D
	288	or
a8980281	289	use open qw(:encoding(UTF-8));
2561daa4 RS	290
	291	=head2 ℞ 18: Make all I/O and args default to utf8
	292
	293	$ perl -CSDA ...
	294	or
	295	$ export PERL_UNICODE=SDA
	296	or
a8980281	297	use open qw(:std :encoding(UTF-8));
8e179dd8 P	298	use Encode qw(decode);
8e179dd8 P	299	@ARGV = map { decode('UTF-8', $_, 1) } @ARGV;
2561daa4 RS	300
	301	=head2 ℞ 19: Open file with specific encoding
	302
	303	Specify stream encoding. This is the normal way
	304	to deal with encoded text, not by calling low-level
	305	functions.
	306
	307	# input file
	308	open(my $in_file, "< :encoding(UTF-16)", "wintext");
	309	OR
	310	open(my $in_file, "<", "wintext");
	311	binmode($in_file, ":encoding(UTF-16)");
	312	THEN
	313	my $line = <$in_file>;
	314
	315	# output file
	316	open($out_file, "> :encoding(cp1252)", "wintext");
	317	OR
	318	open(my $out_file, ">", "wintext");
	319	binmode($out_file, ":encoding(cp1252)");
	320	THEN
	321	print $out_file "some text\n";
	322
	323	More layers than just the encoding can be specified here. For example,
	324	the incantation C<":raw :encoding(UTF-16LE) :crlf"> includes implicit
	325	CRLF handling.
	326
	327	=head2 ℞ 20: Unicode casing
	328
	329	Unicode casing is very different from ASCII casing.
	330
	331	uc("henry ⅷ") # "HENRY Ⅷ"
	332	uc("tschüß") # "TSCHÜSS" notice ß => SS
	333
	334	# both are true:
	335	"tschüß" =~ /TSCHÜSS/i # notice ß => SS
	336	"Σίσυφος" =~ /ΣΊΣΥΦΟΣ/i # notice Σ,σ,ς sameness
	337
	338	=head2 ℞ 21: Unicode case-insensitive comparisons
	339
	340	Also available in the CPAN L<Unicode::CaseFold> module,
	341	the new C<fc> “foldcase” function from v5.16 grants
	342	access to the same Unicode casefolding as the C</i>
	343	pattern modifier has always used:
	344
	345	use feature "fc"; # fc() function is from v5.16
	346
	347	# sort case-insensitively
	348	my @sorted = sort { fc($a) cmp fc($b) } @list;
	349
	350	# both are true:
	351	fc("tschüß") eq fc("TSCHÜSS")
	352	fc("Σίσυφος") eq fc("ΣΊΣΥΦΟΣ")
	353
	354	=head2 ℞ 22: Match Unicode linebreak sequence in regex
	355
	356	A Unicode linebreak matches the two-character CRLF
	357	grapheme or any of seven vertical whitespace characters.
	358	Good for dealing with textfiles coming from different
	359	operating systems.
	360
	361	\R
	362
	363	s/\R/\n/g; # normalize all linebreaks to \n
364
365	=head2 ℞ 23: Get character category
366
367	Find the general category of a numeric codepoint.
368
369	use Unicode::UCD qw(charinfo);
370	my $cat = charinfo(0x3A3)->{category}; # "Lu"
371
372	=head2 ℞ 24: Disabling Unicode-awareness in builtin charclasses
373
374	Disable C<\w>, C<\b>, C<\s>, C<\d>, and the POSIX
375	classes from working correctly on Unicode either in this
376	scope, or in just one regex.
377
378	use v5.14;
379	use re "/a";
380
381	# OR
382
383	my($num) = $str =~ /(\d+)/a;
384
385	Or use specific un-Unicode properties, like C<\p{ahex}>
386	and C<\p{POSIX_Digit>}. Properties still work normally
387	no matter what charset modifiers (C</d /u /l /a /aa>)
388	should be effect.
389
390	=head2 ℞ 25: Match Unicode properties in regex with \p, \P
391
392	These all match a single codepoint with the given
393	property. Use C<\P> in place of C<\p> to match
394	one codepoint lacking that property.
395
396	\pL, \pN, \pS, \pP, \pM, \pZ, \pC
397	\p{Sk}, \p{Ps}, \p{Lt}
398	\p{alpha}, \p{upper}, \p{lower}
399	\p{Latin}, \p{Greek}
48791bf1	400	\p{script_extensions=Latin}, \p{scx=Greek}
2561daa4 RS	401	\p{East_Asian_Width=Wide}, \p{EA=W}
	402	\p{Line_Break=Hyphen}, \p{LB=HY}
	403	\p{Numeric_Value=4}, \p{NV=4}
	404
	405	=head2 ℞ 26: Custom character properties
	406
	407	Define at compile-time your own custom character
	408	properties for use in regexes.
	409
	410	# using private-use characters
	411	sub In_Tengwar { "E000\tE07F\n" }
	412
	413	if (/\p{In_Tengwar}/) { ... }
	414
	415	# blending existing properties
	416	sub Is_GraecoRoman_Title {<<'END_OF_SET'}
	417	+utf8::IsLatin
	418	+utf8::IsGreek
	419	&utf8::IsTitle
	420	END_OF_SET
	421
	422	if (/\p{Is_GraecoRoman_Title}/ { ... }
	423
	424	=head2 ℞ 27: Unicode normalization
	425
	426	Typically render into NFD on input and NFC on output. Using NFKC or NFKD
	427	functions improves recall on searches, assuming you've already done to the
	428	same text to be searched. Note that this is about much more than just pre-
	429	combined compatibility glyphs; it also reorders marks according to their
	430	canonical combining classes and weeds out singletons.
	431
	432	use Unicode::Normalize;
	433	my $nfd = NFD($orig);
	434	my $nfc = NFC($orig);
	435	my $nfkd = NFKD($orig);
	436	my $nfkc = NFKC($orig);
	437
	438	=head2 ℞ 28: Convert non-ASCII Unicode numerics
	439
	440	Unless you’ve used C</a> or C</aa>, C<\d> matches more than
	441	ASCII digits only, but Perl’s implicit string-to-number
	442	conversion does not current recognize these. Here’s how to
	443	convert such strings manually.
	444
	445	use v5.14; # needed for num() function
	446	use Unicode::UCD qw(num);
	447	my $str = "got Ⅻ and ४५६७ and ⅞ and here";
	448	my @nums = ();
7b237c8f	449	while ($str =~ /(\d+\|\N)/g) { # not just ASCII!
2561daa4 RS	450	push @nums, num($1);
	451	}
	452	say "@nums"; # 12 4567 0.875
	453
	454	use charnames qw(:full);
	455	my $nv = num("\N{RUMI DIGIT ONE}\N{RUMI DIGIT TWO}");
	456
	457	=head2 ℞ 29: Match Unicode grapheme cluster in regex
	458
	459	Programmer-visible “characters” are codepoints matched by C</./s>,
	460	but user-visible “characters” are graphemes matched by C</\X/>.
	461
	462	# Find vowel plus any combining diacritics,underlining,etc.
	463	my $nfd = NFD($orig);
	464	$nfd =~ / (?=[aeiou]) \X /xi
	465
	466	=head2 ℞ 30: Extract by grapheme instead of by codepoint (regex)
	467
	468	# match and grab five first graphemes
	469	my($first_five) = $str =~ /^ ( \X{5} ) /x;
	470
	471	=head2 ℞ 31: Extract by grapheme instead of by codepoint (substr)
	472
	473	# cpan -i Unicode::GCString
	474	use Unicode::GCString;
	475	my $gcs = Unicode::GCString->new($str);
	476	my $first_five = $gcs->substr(0, 5);
	477
	478	=head2 ℞ 32: Reverse string by grapheme
	479
	480	Reversing by codepoint messes up diacritics, mistakenly converting
	481	C<crème brûlée> into C<éel̂urb em̀erc> instead of into C<eélûrb emèrc>;
	482	so reverse by grapheme instead. Both these approaches work
	483	right no matter what normalization the string is in:
	484
	485	$str = join("", reverse $str =~ /\X/g);
	486
	487	# OR: cpan -i Unicode::GCString
	488	use Unicode::GCString;
	489	$str = reverse Unicode::GCString->new($str);
	490
	491	=head2 ℞ 33: String length in graphemes
	492
	493	The string C<brûlée> has six graphemes but up to eight codepoints.
	494	This counts by grapheme, not by codepoint:
	495
	496	my $str = "brûlée";
	497	my $count = 0;
	498	while ($str =~ /\X/g) { $count++ }
	499
	500	# OR: cpan -i Unicode::GCString
	501	use Unicode::GCString;
	502	my $gcs = Unicode::GCString->new($str);
	503	my $count = $gcs->length;
	504
	505	=head2 ℞ 34: Unicode column-width for printing
	506
	507	Perl’s C<printf>, C<sprintf>, and C<format> think all
	508	codepoints take up 1 print column, but many take 0 or 2.
	509	Here to show that normalization makes no difference,
	510	we print out both forms:
	511
	512	use Unicode::GCString;
	513	use Unicode::Normalize;
514
515	my @words = qw/crème brûlée/;
516	@words = map { NFC($_), NFD($_) } @words;
517
518	for my $str (@words) {
519	my $gcs = Unicode::GCString->new($str);
520	my $cols = $gcs->columns;
521	my $pad = " " x (10 - $cols);
522	say str, $pad, " \|";
523	}
524
525	generates this to show that it pads correctly no matter
526	the normalization:
527
528	crème \|
529	crème \|
530	brûlée \|
531	brûlée \|
532
533	=head2 ℞ 35: Unicode collation
534
535	Text sorted by numeric codepoint follows no reasonable alphabetic order;
536	use the UCA for sorting text.
537
538	use Unicode::Collate;
539	my $col = Unicode::Collate->new();
540	my @list = $col->sort(@old_list);
541
542	See the I<ucsort> program from the L<Unicode::Tussle> CPAN module
ddeccf1f	543	for a convenient command-line interface to this module.
2561daa4 RS	544
	545	=head2 ℞ 36: Case- I<and> accent-insensitive Unicode sort
	546
	547	Specify a collation strength of level 1 to ignore case and
	548	diacritics, only looking at the basic character.
	549
	550	use Unicode::Collate;
	551	my $col = Unicode::Collate->new(level => 1);
	552	my @list = $col->sort(@old_list);
	553
	554	=head2 ℞ 37: Unicode locale collation
	555
	556	Some locales have special sorting rules.
	557
	558	# either use v5.12, OR: cpan -i Unicode::Collate::Locale
	559	use Unicode::Collate::Locale;
	560	my $col = Unicode::Collate::Locale->new(locale => "de__phonebook");
	561	my @list = $col->sort(@old_list);
	562
	563	The I<ucsort> program mentioned above accepts a C<--locale> parameter.
	564
	565	=head2 ℞ 38: Making C<cmp> work on text instead of codepoints
	566
	567	Instead of this:
	568
	569	@srecs = sort {
	570	$b->{AGE} <=> $a->{AGE}
	571	\|\|
	572	$a->{NAME} cmp $b->{NAME}
	573	} @recs;
	574
	575	Use this:
	576
	577	my $coll = Unicode::Collate->new();
	578	for my $rec (@recs) {
	579	$rec->{NAME_key} = $coll->getSortKey( $rec->{NAME} );
	580	}
	581	@srecs = sort {
	582	$b->{AGE} <=> $a->{AGE}
	583	\|\|
	584	$a->{NAME_key} cmp $b->{NAME_key}
	585	} @recs;
	586
	587	=head2 ℞ 39: Case- I<and> accent-insensitive comparisons
	588
	589	Use a collator object to compare Unicode text by character
	590	instead of by codepoint.
	591
	592	use Unicode::Collate;
	593	my $es = Unicode::Collate->new(
	594	level => 1,
	595	normalization => undef
	596	);
	597
	598	# now both are true:
	599	$es->eq("García", "GARCIA" );
	600	$es->eq("Márquez", "MARQUEZ");
	601
	602	=head2 ℞ 40: Case- I<and> accent-insensitive locale comparisons
	603
	604	Same, but in a specific locale.
	605
	606	my $de = Unicode::Collate::Locale->new(
	607	locale => "de__phonebook",
608	);
609
610	# now this is true:
611	$de->eq("tschüß", "TSCHUESS"); # notice ü => UE, ß => SS
612
613	=head2 ℞ 41: Unicode linebreaking
614
615	Break up text into lines according to Unicode rules.
616
617	# cpan -i Unicode::LineBreak
618	use Unicode::LineBreak;
619	use charnames qw(:full);
620
621	my $para = "This is a super\N{HYPHEN}long string. " x 20;
63602a3f	622	my $fmt = Unicode::LineBreak->new;
2561daa4 RS	623	print $fmt->break($para), "\n";
	624
	625	=head2 ℞ 42: Unicode text in DBM hashes, the tedious way
	626
	627	Using a regular Perl string as a key or value for a DBM
	628	hash will trigger a wide character exception if any codepoints
	629	won’t fit into a byte. Here’s how to manually manage the translation:
	630
	631	use DB_File;
	632	use Encode qw(encode decode);
	633	tie %dbhash, "DB_File", "pathname";
	634
	635	# STORE
	636
	637	# assume $uni_key and $uni_value are abstract Unicode strings
	638	my $enc_key = encode("UTF-8", $uni_key, 1);
	639	my $enc_value = encode("UTF-8", $uni_value, 1);
	640	$dbhash{$enc_key} = $enc_value;
	641
	642	# FETCH
	643
	644	# assume $uni_key holds a normal Perl string (abstract Unicode)
	645	my $enc_key = encode("UTF-8", $uni_key, 1);
	646	my $enc_value = $dbhash{$enc_key};
7b237c8f	647	my $uni_value = decode("UTF-8", $enc_value, 1);
2561daa4 RS	648
	649	=head2 ℞ 43: Unicode text in DBM hashes, the easy way
	650
	651	Here’s how to implicitly manage the translation; all encoding
	652	and decoding is done automatically, just as with streams that
	653	have a particular encoding attached to them:
	654
	655	use DB_File;
	656	use DBM_Filter;
	657
	658	my $dbobj = tie %dbhash, "DB_File", "pathname";
	659	$dbobj->Filter_Value("utf8"); # this is the magic bit
	660
	661	# STORE
	662
	663	# assume $uni_key and $uni_value are abstract Unicode strings
	664	$dbhash{$uni_key} = $uni_value;
	665
	666	# FETCH
	667
	668	# $uni_key holds a normal Perl string (abstract Unicode)
	669	my $uni_value = $dbhash{$uni_key};
	670
	671	=head2 ℞ 44: PROGRAM: Demo of Unicode collation and printing
	672
	673	Here’s a full program showing how to make use of locale-sensitive
	674	sorting, Unicode casing, and managing print widths when some of the
	675	characters take up zero or two columns, not just one column each time.
	676	When run, the following program produces this nicely aligned output:
	677
	678	Crème Brûlée....... €2.00
	679	Éclair............. €1.60
	680	Fideuà............. €4.20
	681	Hamburger.......... €6.00
	682	Jamón Serrano...... €4.45
	683	Linguiça........... €7.00
	684	Pâté............... €4.15
	685	Pears.............. €2.00
	686	Pêches............. €2.25
	687	Smørbrød........... €5.75
	688	Spätzle............ €5.50
	689	Xoriço............. €3.00
	690	Γύρος.............. €6.50
	691	막걸리............. €4.00
	692	おもち............. €2.65
	693	お好み焼き......... €8.00
	694	シュークリーム..... €1.85
	695	寿司............... €9.99
	696	包子............... €7.50
	697
d84bd0bd	698	Here's that program.
2561daa4 RS	699
	700	#!/usr/bin/env perl
	701	# umenu - demo sorting and printing of Unicode food
	702	#
	703	# (obligatory and increasingly long preamble)
	704	#
d84bd0bd	705	use v5.36;
2561daa4	706	use utf8;
2561daa4	707	use warnings qw(FATAL utf8); # fatalize encoding faults
a8980281	708	use open qw(:std :encoding(UTF-8)); # undeclared streams in UTF-8
2561daa4 RS	709	use charnames qw(:full :short); # unneeded in v5.16
	710
	711	# std modules
	712	use Unicode::Normalize; # std perl distro as of v5.8
	713	use List::Util qw(max); # std perl distro as of v5.10
	714	use Unicode::Collate::Locale; # std perl distro as of v5.14
	715
	716	# cpan modules
	717	use Unicode::GCString; # from CPAN
	718
2561daa4 RS	719	my %price = (
	720	"γύρος" => 6.50, # gyros
	721	"pears" => 2.00, # like um, pears
	722	"linguiça" => 7.00, # spicy sausage, Portuguese
	723	"xoriço" => 3.00, # chorizo sausage, Catalan
	724	"hamburger" => 6.00, # burgermeister meisterburger
	725	"éclair" => 1.60, # dessert, French
	726	"smørbrød" => 5.75, # sandwiches, Norwegian
	727	"spätzle" => 5.50, # Bayerisch noodles, little sparrows
	728	"包子" => 7.50, # bao1 zi5, steamed pork buns, Mandarin
	729	"jamón serrano" => 4.45, # country ham, Spanish
	730	"pêches" => 2.25, # peaches, French
	731	"シュークリーム" => 1.85, # cream-filled pastry like eclair
	732	"막걸리" => 4.00, # makgeolli, Korean rice wine
	733	"寿司" => 9.99, # sushi, Japanese
	734	"おもち" => 2.65, # omochi, rice cakes, Japanese
	735	"crème brûlée" => 2.00, # crema catalana
720a02e2 FC	736	"fideuà" => 4.20, # more noodles, Valencian
720a02e2 FC	737	# (Catalan=fideuada)
2561daa4 RS	738	"pâté" => 4.15, # gooseliver paste, French
	739	"お好み焼き" => 8.00, # okonomiyaki, Japanese
	740	);
	741
d84bd0bd	742	my $width = 5 + max map { colwidth($_) } keys %price;
2561daa4 RS	743
	744	# So the Asian stuff comes out in an order that someone
	745	# who reads those scripts won't freak out over; the
	746	# CJK stuff will be in JIS X 0208 order that way.
63602a3f	747	my $coll = Unicode::Collate::Locale->new(locale => "ja");
2561daa4 RS	748
	749	for my $item ($coll->sort(keys %price)) {
	750	print pad(entitle($item), $width, ".");
	751	printf " €%.2f\n", $price{$item};
	752	}
	753
9af9b932	754	sub pad ($str, $width, $padchar) {
2561daa4 RS	755	return $str . ($padchar x ($width - colwidth($str)));
	756	}
	757
9af9b932	758	sub colwidth ($str) {
2561daa4 RS	759	return Unicode::GCString->new($str)->columns;
	760	}
	761
9af9b932	762	sub entitle ($str) {
2561daa4 RS	763	$str =~ s{ (?=\pL)(\S) (\S*) }
	764	{ ucfirst($1) . lc($2) }xge;
	765	return $str;
	766	}
	767
	768	=head1 SEE ALSO
	769
	770	See these manpages, some of which are CPAN modules:
	771	L<perlunicode>, L<perluniprops>,
	772	L<perlre>, L<perlrecharclass>,
	773	L<perluniintro>, L<perlunitut>, L<perlunifaq>,
	774	L<PerlIO>, L<DB_File>, L<DBM_Filter>, L<DBM_Filter::utf8>,
	775	L<Encode>, L<Encode::Locale>,
	776	L<Unicode::UCD>,
	777	L<Unicode::Normalize>,
	778	L<Unicode::GCString>, L<Unicode::LineBreak>,
	779	L<Unicode::Collate>, L<Unicode::Collate::Locale>,
	780	L<Unicode::Unihan>,
	781	L<Unicode::CaseFold>,
	782	L<Unicode::Tussle>,
	783	L<Lingua::JA::Romanize::Japanese>,
	784	L<Lingua::ZH::Romanize::Pinyin>,
	785	L<Lingua::KO::Romanize::Hangul>.
	786
	787	The L<Unicode::Tussle> CPAN module includes many programs
	788	to help with working with Unicode, including
	789	these programs to fully or partly replace standard utilities:
	790	I<tcgrep> instead of I<egrep>,
	791	I<uniquote> instead of I<cat -v> or I<hexdump>,
	792	I<uniwc> instead of I<wc>,
	793	I<unilook> instead of I<look>,
	794	I<unifmt> instead of I<fmt>,
	795	and
	796	I<ucsort> instead of I<sort>.
	797	For exploring Unicode character names and character properties,
	798	see its I<uniprops>, I<unichars>, and I<uninames> programs.
	799	It also supplies these programs, all of which are general filters that do Unicode-y things:
	800	I<unititle> and I<unicaps>;
	801	I<uniwide> and I<uninarrow>;
	802	I<unisupers> and I<unisubs>;
	803	I<nfd>, I<nfc>, I<nfkd>, and I<nfkc>;
	804	and I<uc>, I<lc>, and I<tc>.
	805
	806	Finally, see the published Unicode Standard (page numbers are from version
	807	6.0.0), including these specific annexes and technical reports:
	808
	809	=over
	810
	811	=item §3.13 Default Case Algorithms, page 113;
	812	§4.2 Case, pages 120–122;
	813	Case Mappings, page 166–172, especially Caseless Matching starting on page 170.
	814
2561daa4 RS	815	=item UAX #44: Unicode Character Database
	816
	817	=item UTS #18: Unicode Regular Expressions
	818
	819	=item UAX #15: Unicode Normalization Forms
	820
	821	=item UTS #10: Unicode Collation Algorithm
	822
	823	=item UAX #29: Unicode Text Segmentation
	824
	825	=item UAX #14: Unicode Line Breaking Algorithm
	826
	827	=item UAX #11: East Asian Width
	828
	829	=back
	830
	831	=head1 AUTHOR
	832
	833	Tom Christiansen E<lt>tchrist@perl.comE<gt> wrote this, with occasional
	834	kibbitzing from Larry Wall and Jeffrey Friedl in the background.
	835
	836	=head1 COPYRIGHT AND LICENCE
	837
	838	Copyright © 2012 Tom Christiansen.
	839
	840	This program is free software; you may redistribute it and/or modify it
	841	under the same terms as Perl itself.
	842
	843	Most of these examples taken from the current edition of the “Camel Book”;
	844	that is, from the 4ᵗʰ Edition of I<Programming Perl>, Copyright © 2012 Tom
	845	Christiansen <et al.>, 2012-02-13 by O’Reilly Media. The code itself is
	846	freely redistributable, and you are encouraged to transplant, fold,
	847	spindle, and mutilate any of the examples in this manpage however you please
	848	for inclusion into your own programs without any encumbrance whatsoever.
	849	Acknowledgement via code comment is polite but not required.
	850
ddeccf1f	851	=head1 REVISION HISTORY
2561daa4 RS	852
2561daa4 RS	853	v1.0.0 – first public release, 2012-02-27