Commit | Line | Data |
---|---|---|
de1df517 | 1 | package encoding::warnings; |
efd46721 | 2 | $encoding::warnings::VERSION = '0.11'; |
de1df517 RGS |
3 | |
4 | use strict; | |
6309fe6d | 5 | use 5.007; |
de1df517 RGS |
6 | |
7 | =head1 NAME | |
8 | ||
9 | encoding::warnings - Warn on implicit encoding conversions | |
10 | ||
11 | =head1 VERSION | |
12 | ||
efd46721 RGS |
13 | This document describes version 0.11 of encoding::warnings, released |
14 | June 5, 2007. | |
de1df517 RGS |
15 | |
16 | =head1 SYNOPSIS | |
17 | ||
18 | use encoding::warnings; # or 'FATAL' to raise fatal exceptions | |
19 | ||
20 | utf8::encode($a = chr(20000)); # a byte-string (raw bytes) | |
21 | $b = chr(20000); # a unicode-string (wide characters) | |
22 | ||
23 | # "Bytes implicitly upgraded into wide characters as iso-8859-1" | |
24 | $c = $a . $b; | |
25 | ||
26 | =head1 DESCRIPTION | |
27 | ||
28 | =head2 Overview of the problem | |
29 | ||
30 | By default, there is a fundamental asymmetry in Perl's unicode model: | |
31 | implicit upgrading from byte-strings to unicode-strings assumes that | |
32 | they were encoded in I<ISO 8859-1 (Latin-1)>, but unicode-strings are | |
33 | downgraded with UTF-8 encoding. This happens because the first 256 | |
34 | codepoints in Unicode happens to agree with Latin-1. | |
35 | ||
36 | However, this silent upgrading can easily cause problems, if you happen | |
37 | to mix unicode strings with non-Latin1 data -- i.e. byte-strings encoded | |
38 | in UTF-8 or other encodings. The error will not manifest until the | |
39 | combined string is written to output, at which time it would be impossible | |
40 | to see where did the silent upgrading occur. | |
41 | ||
42 | =head2 Detecting the problem | |
43 | ||
44 | This module simplifies the process of diagnosing such problems. Just put | |
45 | this line on top of your main program: | |
46 | ||
47 | use encoding::warnings; | |
48 | ||
49 | Afterwards, implicit upgrading of high-bit bytes will raise a warning. | |
50 | Ex.: C<Bytes implicitly upgraded into wide characters as iso-8859-1 at | |
51 | - line 7>. | |
52 | ||
53 | However, strings composed purely of ASCII code points (C<0x00>..C<0x7F>) | |
54 | will I<not> trigger this warning. | |
55 | ||
56 | You can also make the warnings fatal by importing this module as: | |
57 | ||
58 | use encoding::warnings 'FATAL'; | |
59 | ||
60 | =head2 Solving the problem | |
61 | ||
62 | Most of the time, this warning occurs when a byte-string is concatenated | |
63 | with a unicode-string. There are a number of ways to solve it: | |
64 | ||
65 | =over 4 | |
66 | ||
67 | =item * Upgrade both sides to unicode-strings | |
68 | ||
69 | If your program does not need compatibility for Perl 5.6 and earlier, | |
70 | the recommended approach is to apply appropriate IO disciplines, so all | |
71 | data in your program become unicode-strings. See L<encoding>, L<open> and | |
72 | L<perlfunc/binmode> for how. | |
73 | ||
74 | =item * Downgrade both sides to byte-strings | |
75 | ||
76 | The other way works too, especially if you are sure that all your data | |
77 | are under the same encoding, or if compatibility with older versions | |
78 | of Perl is desired. | |
79 | ||
80 | You may downgrade strings with C<Encode::encode> and C<utf8::encode>. | |
81 | See L<Encode> and L<utf8> for details. | |
82 | ||
83 | =item * Specify the encoding for implicit byte-string upgrading | |
84 | ||
85 | If you are confident that all byte-strings will be in a specific | |
86 | encoding like UTF-8, I<and> need not support older versions of Perl, | |
87 | use the C<encoding> pragma: | |
88 | ||
89 | use encoding 'utf8'; | |
90 | ||
91 | Similarly, this will silence warnings from this module, and preserve the | |
92 | default behaviour: | |
93 | ||
94 | use encoding 'iso-8859-1'; | |
95 | ||
96 | However, note that C<use encoding> actually had three distinct effects: | |
97 | ||
98 | =over 4 | |
99 | ||
100 | =item * PerlIO layers for B<STDIN> and B<STDOUT> | |
101 | ||
102 | This is similar to what L<open> pragma does. | |
103 | ||
104 | =item * Literal conversions | |
105 | ||
106 | This turns I<all> literal string in your program into unicode-strings | |
107 | (equivalent to a C<use utf8>), by decoding them using the specified | |
108 | encoding. | |
109 | ||
110 | =item * Implicit upgrading for byte-strings | |
111 | ||
112 | This will silence warnings from this module, as shown above. | |
113 | ||
114 | =back | |
115 | ||
116 | Because literal conversions also work on empty strings, it may surprise | |
117 | some people: | |
118 | ||
119 | use encoding 'big5'; | |
120 | ||
121 | my $byte_string = pack("C*", 0xA4, 0x40); | |
122 | print length $a; # 2 here. | |
123 | $a .= ""; # concatenating with a unicode string... | |
124 | print length $a; # 1 here! | |
125 | ||
126 | In other words, do not C<use encoding> unless you are certain that the | |
127 | program will not deal with any raw, 8-bit binary data at all. | |
128 | ||
129 | However, the C<Filter =E<gt> 1> flavor of C<use encoding> will I<not> | |
130 | affect implicit upgrading for byte-strings, and is thus incapable of | |
131 | silencing warnings from this module. See L<encoding> for more details. | |
132 | ||
133 | =back | |
134 | ||
135 | =head1 CAVEATS | |
136 | ||
6309fe6d SP |
137 | For Perl 5.9.4 or later, this module's effect is lexical. |
138 | ||
139 | For Perl versions prior to 5.9.4, this module affects the whole script, | |
140 | instead of inside its lexical block. | |
de1df517 RGS |
141 | |
142 | =cut | |
143 | ||
144 | # Constants. | |
145 | sub ASCII () { 0 } | |
146 | sub LATIN1 () { 1 } | |
147 | sub FATAL () { 2 } | |
148 | ||
149 | # Install a ${^ENCODING} handler if no other one are already in place. | |
150 | sub import { | |
151 | my $class = shift; | |
152 | my $fatal = shift || ''; | |
153 | ||
154 | local $@; | |
155 | return if ${^ENCODING} and ref(${^ENCODING}) ne $class; | |
156 | return unless eval { require Encode; 1 }; | |
157 | ||
158 | my $ascii = Encode::find_encoding('us-ascii') or return; | |
159 | my $latin1 = Encode::find_encoding('iso-8859-1') or return; | |
160 | ||
161 | # Have to undef explicitly here | |
162 | undef ${^ENCODING}; | |
163 | ||
164 | # Install a warning handler for decode() | |
6309fe6d | 165 | my $decoder = bless( |
de1df517 RGS |
166 | [ |
167 | $ascii, | |
168 | $latin1, | |
169 | (($fatal eq 'FATAL') ? 'Carp::croak' : 'Carp::carp'), | |
170 | ], $class, | |
171 | ); | |
6309fe6d SP |
172 | |
173 | ${^ENCODING} = $decoder; | |
174 | $^H{$class} = 1; | |
175 | } | |
176 | ||
177 | sub unimport { | |
178 | my $class = shift; | |
179 | $^H{$class} = undef; | |
efd46721 | 180 | undef ${^ENCODING}; |
de1df517 RGS |
181 | } |
182 | ||
183 | # Don't worry about source code literals. | |
184 | sub cat_decode { | |
185 | my $self = shift; | |
186 | return $self->[LATIN1]->cat_decode(@_); | |
187 | } | |
188 | ||
189 | # Warn if the data is not purely US-ASCII. | |
190 | sub decode { | |
191 | my $self = shift; | |
192 | ||
6309fe6d SP |
193 | DO_WARN: { |
194 | if ($] >= 5.009004) { | |
195 | my $hints = (caller(0))[10]; | |
196 | $hints->{ref($self)} or last DO_WARN; | |
197 | } | |
198 | ||
199 | local $@; | |
200 | my $rv = eval { $self->[ASCII]->decode($_[0], Encode::FB_CROAK()) }; | |
201 | return $rv unless $@; | |
202 | ||
203 | require Carp; | |
204 | no strict 'refs'; | |
205 | $self->[FATAL]->( | |
206 | "Bytes implicitly upgraded into wide characters as iso-8859-1" | |
207 | ); | |
208 | ||
209 | } | |
de1df517 | 210 | |
de1df517 RGS |
211 | return $self->[LATIN1]->decode(@_); |
212 | } | |
213 | ||
214 | sub name { 'iso-8859-1' } | |
215 | ||
216 | 1; | |
217 | ||
218 | __END__ | |
219 | ||
220 | =head1 SEE ALSO | |
221 | ||
222 | L<perlunicode>, L<perluniintro> | |
223 | ||
224 | L<open>, L<utf8>, L<encoding>, L<Encode> | |
225 | ||
226 | =head1 AUTHORS | |
227 | ||
6309fe6d | 228 | Audrey Tang |
de1df517 RGS |
229 | |
230 | =head1 COPYRIGHT | |
231 | ||
efd46721 | 232 | Copyright 2004, 2005, 2006, 2007 by Audrey Tang E<lt>cpan@audreyt.orgE<gt>. |
de1df517 RGS |
233 | |
234 | This program is free software; you can redistribute it and/or modify it | |
235 | under the same terms as Perl itself. | |
236 | ||
237 | See L<http://www.perl.com/perl/misc/Artistic.html> | |
238 | ||
239 | =cut |