This is a live mirror of the Perl 5 development currently hosted at https://github.com/perl/perl5
Integrate mainline
[perl5.git] / ext / Encode / lib / Encode / Encoding.pm
CommitLineData
18586f54
NIS
1package Encode::Encoding;
2# Base class for classes which implement encodings
3use strict;
80a5d8e7 4our $VERSION = do { my @r = (q$Revision: 1.25 $ =~ /\d+/g); sprintf "%d."."%02d" x $#r, @r };
18586f54
NIS
5
6sub Define
7{
8 my $obj = shift;
9 my $canonical = shift;
10 $obj = bless { Name => $canonical },$obj unless ref $obj;
11 # warn "$canonical => $obj\n";
80a5d8e7 12 Encode::define_encoding($obj, $canonical, @_);
18586f54
NIS
13}
14
15sub name { shift->{'Name'} }
16
17# Temporary legacy methods
18sub toUnicode { shift->decode(@_) }
19sub fromUnicode { shift->encode(@_) }
20
21sub new_sequence { return $_[0] }
22
ca777f1c
NIS
23sub needs_lines { 0 }
24
284ee456
NIS
25sub DESTROY {}
26
18586f54
NIS
271;
28__END__
1b2c56c8
JH
29
30=head1 NAME
31
32Encode::Encoding - Encode Implementation Base Class
33
34=head1 SYNOPSIS
35
36 package Encode::MyEncoding;
37 use base qw(Encode::Encoding);
38
39 __PACKAGE__->Define(qw(myCanonical myAlias));
40
5129552c 41=head1 DESCRIPTION
1b2c56c8
JH
42
43As mentioned in L<Encode>, encodings are (in the current
44implementation at least) defined by objects. The mapping of encoding
45name to object is via the C<%encodings> hash.
46
47The values of the hash can currently be either strings or objects.
48The string form may go away in the future. The string form occurs
49when C<encodings()> has scanned C<@INC> for loadable encodings but has
50not actually loaded the encoding in question. This is because the
51current "loading" process is all Perl and a bit slow.
52
53Once an encoding is loaded then value of the hash is object which
54implements the encoding. The object should provide the following
55interface:
56
57=over 4
58
59=item -E<gt>name
60
61Should return the string representing the canonical name of the encoding.
62
63=item -E<gt>new_sequence
64
65This is a placeholder for encodings with state. It should return an
66object which implements this interface, all current implementations
67return the original object.
68
69=item -E<gt>encode($string,$check)
70
71Should return the octet sequence representing I<$string>. If I<$check>
72is true it should modify I<$string> in place to remove the converted
73part (i.e. the whole string unless there is an error). If an error
74occurs it should return the octet sequence for the fragment of string
75that has been converted, and modify $string in-place to remove the
76converted part leaving it starting with the problem fragment.
77
78If check is is false then C<encode> should make a "best effort" to
79convert the string - for example by using a replacement character.
80
81=item -E<gt>decode($octets,$check)
82
83Should return the string that I<$octets> represents. If I<$check> is
84true it should modify I<$octets> in place to remove the converted part
85(i.e. the whole sequence unless there is an error). If an error
86occurs it should return the fragment of string that has been
87converted, and modify $octets in-place to remove the converted part
88leaving it starting with the problem fragment.
89
90If check is is false then C<decode> should make a "best effort" to
91convert the string - for example by using Unicode's "\x{FFFD}" as a
92replacement character.
93
94=back
95
96It should be noted that the check behaviour is different from the
97outer public API. The logic is that the "unchecked" case is useful
98when encoding is part of a stream which may be reporting errors
99(e.g. STDERR). In such cases it is desirable to get everything
100through somehow without causing additional errors which obscure the
101original one. Also the encoding is best placed to know what the
102correct replacement character is, so if that is the desired behaviour
103then letting low level code do it is the most efficient.
104
105In contrast if check is true, the scheme above allows the encoding to
106do as much as it can and tell layer above how much that was. What is
107lacking at present is a mechanism to report what went wrong. The most
108likely interface will be an additional method call to the object, or
109perhaps (to avoid forcing per-stream objects on otherwise stateless
110encodings) and additional parameter.
111
112It is also highly desirable that encoding classes inherit from
113C<Encode::Encoding> as a base class. This allows that class to define
114additional behaviour for all encoding objects. For example built in
115Unicode, UCS-2 and UTF-8 classes use :
116
117 package Encode::MyEncoding;
118 use base qw(Encode::Encoding);
119
120 __PACKAGE__->Define(qw(myCanonical myAlias));
121
122To create an object with bless {Name => ...},$class, and call
123define_encoding. They inherit their C<name> method from
124C<Encode::Encoding>.
125
126=head2 Compiled Encodings
127
48e3bbdd
NIS
128For the sake of speed and efficiency, Most of the encodings are now
129supported via I<Compiled Form> that are XS modules generated from UCM
130files. Encode provides enc2xs tool to achieve that. Please see
131L<enc2xs> for more details.
1b2c56c8 132
48e3bbdd 133=head1 SEE ALSO
1b2c56c8 134
48e3bbdd 135L<perlmod>, L<enc2xs>
1b2c56c8 136
80a5d8e7
NIS
137=for future
138
139
140=over 4
141
142=item Scheme 1
143
144Passed remaining fragment of string being processed.
145Modifies it in place to remove bytes/characters it can understand
146and returns a string used to represent them.
147e.g.
148
149 sub fixup {
150 my $ch = substr($_[0],0,1,'');
151 return sprintf("\x{%02X}",ord($ch);
152 }
153
154This scheme is close to how underlying C code for Encode works, but gives
155the fixup routine very little context.
156
157=item Scheme 2
158
159Passed original string, and an index into it of the problem area, and
160output string so far. Appends what it will to output string and
161returns new index into original string. For example:
162
163 sub fixup {
164 # my ($s,$i,$d) = @_;
165 my $ch = substr($_[0],$_[1],1);
166 $_[2] .= sprintf("\x{%02X}",ord($ch);
167 return $_[1]+1;
168 }
169
170This scheme gives maximal control to the fixup routine but is more
171complicated to code, and may need internals of Encode to be tweaked to
172keep original string intact.
173
174=item Other Schemes
175
176Hybrids of above.
177
178Multiple return values rather than in-place modifications.
179
180Index into the string could be C<pos($str)> allowing C<s/\G...//>.
181
182=back
183
1b2c56c8 184=cut