Commit | Line | Data |
---|---|---|
18586f54 NIS |
1 | package Encode::Encoding; |
2 | # Base class for classes which implement encodings | |
3 | use strict; | |
80a5d8e7 | 4 | our $VERSION = do { my @r = (q$Revision: 1.25 $ =~ /\d+/g); sprintf "%d."."%02d" x $#r, @r }; |
18586f54 NIS |
5 | |
6 | sub Define | |
7 | { | |
8 | my $obj = shift; | |
9 | my $canonical = shift; | |
10 | $obj = bless { Name => $canonical },$obj unless ref $obj; | |
11 | # warn "$canonical => $obj\n"; | |
80a5d8e7 | 12 | Encode::define_encoding($obj, $canonical, @_); |
18586f54 NIS |
13 | } |
14 | ||
15 | sub name { shift->{'Name'} } | |
16 | ||
17 | # Temporary legacy methods | |
18 | sub toUnicode { shift->decode(@_) } | |
19 | sub fromUnicode { shift->encode(@_) } | |
20 | ||
21 | sub new_sequence { return $_[0] } | |
22 | ||
ca777f1c NIS |
23 | sub needs_lines { 0 } |
24 | ||
284ee456 NIS |
25 | sub DESTROY {} |
26 | ||
18586f54 NIS |
27 | 1; |
28 | __END__ | |
1b2c56c8 JH |
29 | |
30 | =head1 NAME | |
31 | ||
32 | Encode::Encoding - Encode Implementation Base Class | |
33 | ||
34 | =head1 SYNOPSIS | |
35 | ||
36 | package Encode::MyEncoding; | |
37 | use base qw(Encode::Encoding); | |
38 | ||
39 | __PACKAGE__->Define(qw(myCanonical myAlias)); | |
40 | ||
5129552c | 41 | =head1 DESCRIPTION |
1b2c56c8 JH |
42 | |
43 | As mentioned in L<Encode>, encodings are (in the current | |
44 | implementation at least) defined by objects. The mapping of encoding | |
45 | name to object is via the C<%encodings> hash. | |
46 | ||
47 | The values of the hash can currently be either strings or objects. | |
48 | The string form may go away in the future. The string form occurs | |
49 | when C<encodings()> has scanned C<@INC> for loadable encodings but has | |
50 | not actually loaded the encoding in question. This is because the | |
51 | current "loading" process is all Perl and a bit slow. | |
52 | ||
53 | Once an encoding is loaded then value of the hash is object which | |
54 | implements the encoding. The object should provide the following | |
55 | interface: | |
56 | ||
57 | =over 4 | |
58 | ||
59 | =item -E<gt>name | |
60 | ||
61 | Should return the string representing the canonical name of the encoding. | |
62 | ||
63 | =item -E<gt>new_sequence | |
64 | ||
65 | This is a placeholder for encodings with state. It should return an | |
66 | object which implements this interface, all current implementations | |
67 | return the original object. | |
68 | ||
69 | =item -E<gt>encode($string,$check) | |
70 | ||
71 | Should return the octet sequence representing I<$string>. If I<$check> | |
72 | is true it should modify I<$string> in place to remove the converted | |
73 | part (i.e. the whole string unless there is an error). If an error | |
74 | occurs it should return the octet sequence for the fragment of string | |
75 | that has been converted, and modify $string in-place to remove the | |
76 | converted part leaving it starting with the problem fragment. | |
77 | ||
78 | If check is is false then C<encode> should make a "best effort" to | |
79 | convert the string - for example by using a replacement character. | |
80 | ||
81 | =item -E<gt>decode($octets,$check) | |
82 | ||
83 | Should return the string that I<$octets> represents. If I<$check> is | |
84 | true it should modify I<$octets> in place to remove the converted part | |
85 | (i.e. the whole sequence unless there is an error). If an error | |
86 | occurs it should return the fragment of string that has been | |
87 | converted, and modify $octets in-place to remove the converted part | |
88 | leaving it starting with the problem fragment. | |
89 | ||
90 | If check is is false then C<decode> should make a "best effort" to | |
91 | convert the string - for example by using Unicode's "\x{FFFD}" as a | |
92 | replacement character. | |
93 | ||
94 | =back | |
95 | ||
96 | It should be noted that the check behaviour is different from the | |
97 | outer public API. The logic is that the "unchecked" case is useful | |
98 | when encoding is part of a stream which may be reporting errors | |
99 | (e.g. STDERR). In such cases it is desirable to get everything | |
100 | through somehow without causing additional errors which obscure the | |
101 | original one. Also the encoding is best placed to know what the | |
102 | correct replacement character is, so if that is the desired behaviour | |
103 | then letting low level code do it is the most efficient. | |
104 | ||
105 | In contrast if check is true, the scheme above allows the encoding to | |
106 | do as much as it can and tell layer above how much that was. What is | |
107 | lacking at present is a mechanism to report what went wrong. The most | |
108 | likely interface will be an additional method call to the object, or | |
109 | perhaps (to avoid forcing per-stream objects on otherwise stateless | |
110 | encodings) and additional parameter. | |
111 | ||
112 | It is also highly desirable that encoding classes inherit from | |
113 | C<Encode::Encoding> as a base class. This allows that class to define | |
114 | additional behaviour for all encoding objects. For example built in | |
115 | Unicode, UCS-2 and UTF-8 classes use : | |
116 | ||
117 | package Encode::MyEncoding; | |
118 | use base qw(Encode::Encoding); | |
119 | ||
120 | __PACKAGE__->Define(qw(myCanonical myAlias)); | |
121 | ||
122 | To create an object with bless {Name => ...},$class, and call | |
123 | define_encoding. They inherit their C<name> method from | |
124 | C<Encode::Encoding>. | |
125 | ||
126 | =head2 Compiled Encodings | |
127 | ||
48e3bbdd NIS |
128 | For the sake of speed and efficiency, Most of the encodings are now |
129 | supported via I<Compiled Form> that are XS modules generated from UCM | |
130 | files. Encode provides enc2xs tool to achieve that. Please see | |
131 | L<enc2xs> for more details. | |
1b2c56c8 | 132 | |
48e3bbdd | 133 | =head1 SEE ALSO |
1b2c56c8 | 134 | |
48e3bbdd | 135 | L<perlmod>, L<enc2xs> |
1b2c56c8 | 136 | |
80a5d8e7 NIS |
137 | =for future |
138 | ||
139 | ||
140 | =over 4 | |
141 | ||
142 | =item Scheme 1 | |
143 | ||
144 | Passed remaining fragment of string being processed. | |
145 | Modifies it in place to remove bytes/characters it can understand | |
146 | and returns a string used to represent them. | |
147 | e.g. | |
148 | ||
149 | sub fixup { | |
150 | my $ch = substr($_[0],0,1,''); | |
151 | return sprintf("\x{%02X}",ord($ch); | |
152 | } | |
153 | ||
154 | This scheme is close to how underlying C code for Encode works, but gives | |
155 | the fixup routine very little context. | |
156 | ||
157 | =item Scheme 2 | |
158 | ||
159 | Passed original string, and an index into it of the problem area, and | |
160 | output string so far. Appends what it will to output string and | |
161 | returns new index into original string. For example: | |
162 | ||
163 | sub fixup { | |
164 | # my ($s,$i,$d) = @_; | |
165 | my $ch = substr($_[0],$_[1],1); | |
166 | $_[2] .= sprintf("\x{%02X}",ord($ch); | |
167 | return $_[1]+1; | |
168 | } | |
169 | ||
170 | This scheme gives maximal control to the fixup routine but is more | |
171 | complicated to code, and may need internals of Encode to be tweaked to | |
172 | keep original string intact. | |
173 | ||
174 | =item Other Schemes | |
175 | ||
176 | Hybrids of above. | |
177 | ||
178 | Multiple return values rather than in-place modifications. | |
179 | ||
180 | Index into the string could be C<pos($str)> allowing C<s/\G...//>. | |
181 | ||
182 | =back | |
183 | ||
1b2c56c8 | 184 | =cut |