[perl5.git] / ext / Encode / lib / Encode / Encoding.pm

package Encode::Encoding;
# Base class for classes which implement encodings
use strict;
our $VERSION = do { my @r = (q$Revision: 1.25 $ =~ /\d+/g); sprintf "%d."."%02d" x $#r, @r };

sub Define
{
    my $obj = shift;
    my $canonical = shift;
    $obj = bless { Name => $canonical },$obj unless ref $obj;
    # warn "$canonical => $obj\n";
    Encode::define_encoding($obj, $canonical, @_);
}

sub name { shift->{'Name'} }

# Temporary legacy methods
sub toUnicode    { shift->decode(@_) }
sub fromUnicode  { shift->encode(@_) }

sub new_sequence { return $_[0] }

sub needs_lines  { 0 }

sub DESTROY {}

1;
__END__

=head1 NAME

Encode::Encoding - Encode Implementation Base Class

=head1 SYNOPSIS

  package Encode::MyEncoding;
  use base qw(Encode::Encoding);

  __PACKAGE__->Define(qw(myCanonical myAlias));

=head1 DESCRIPTION

As mentioned in L<Encode>, encodings are (in the current
implementation at least) defined by objects. The mapping of encoding
name to object is via the C<%encodings> hash.

The values of the hash can currently be either strings or objects.
The string form may go away in the future. The string form occurs
when C<encodings()> has scanned C<@INC> for loadable encodings but has
not actually loaded the encoding in question. This is because the
current "loading" process is all Perl and a bit slow.

Once an encoding is loaded then value of the hash is object which
implements the encoding. The object should provide the following
interface:

=over 4

=item -E<gt>name

Should return the string representing the canonical name of the encoding.

=item -E<gt>new_sequence

This is a placeholder for encodings with state. It should return an
object which implements this interface, all current implementations
return the original object.

=item -E<gt>encode($string,$check)

Should return the octet sequence representing I<$string>. If I<$check>
is true it should modify I<$string> in place to remove the converted
part (i.e.  the whole string unless there is an error).  If an error
occurs it should return the octet sequence for the fragment of string
that has been converted, and modify $string in-place to remove the
converted part leaving it starting with the problem fragment.

If check is is false then C<encode> should make a "best effort" to
convert the string - for example by using a replacement character.

=item -E<gt>decode($octets,$check)

Should return the string that I<$octets> represents. If I<$check> is
true it should modify I<$octets> in place to remove the converted part
(i.e.  the whole sequence unless there is an error).  If an error
occurs it should return the fragment of string that has been
converted, and modify $octets in-place to remove the converted part
leaving it starting with the problem fragment.

If check is is false then C<decode> should make a "best effort" to
convert the string - for example by using Unicode's "\x{FFFD}" as a
replacement character.

=back

It should be noted that the check behaviour is different from the
outer public API. The logic is that the "unchecked" case is useful
when encoding is part of a stream which may be reporting errors
(e.g. STDERR).  In such cases it is desirable to get everything
through somehow without causing additional errors which obscure the
original one. Also the encoding is best placed to know what the
correct replacement character is, so if that is the desired behaviour
then letting low level code do it is the most efficient.

In contrast if check is true, the scheme above allows the encoding to
do as much as it can and tell layer above how much that was. What is
lacking at present is a mechanism to report what went wrong. The most
likely interface will be an additional method call to the object, or
perhaps (to avoid forcing per-stream objects on otherwise stateless
encodings) and additional parameter.

It is also highly desirable that encoding classes inherit from
C<Encode::Encoding> as a base class. This allows that class to define
additional behaviour for all encoding objects. For example built in
Unicode, UCS-2 and UTF-8 classes use :

  package Encode::MyEncoding;
  use base qw(Encode::Encoding);

  __PACKAGE__->Define(qw(myCanonical myAlias));

To create an object with bless {Name => ...},$class, and call
define_encoding.  They inherit their C<name> method from
C<Encode::Encoding>.

=head2 Compiled Encodings

For the sake of speed and efficiency, Most of the encodings are now
supported via I<Compiled Form> that are XS modules generated from UCM
files.   Encode provides enc2xs tool to achieve that.  Please see
L<enc2xs> for more details.

=head1 SEE ALSO

L<perlmod>, L<enc2xs>

=for future


=over 4

=item Scheme 1

Passed remaining fragment of string being processed.
Modifies it in place to remove bytes/characters it can understand
and returns a string used to represent them.
e.g.

 sub fixup {
   my $ch = substr($_[0],0,1,'');
   return sprintf("\x{%02X}",ord($ch);
 }

This scheme is close to how underlying C code for Encode works, but gives
the fixup routine very little context.

=item Scheme 2

Passed original string, and an index into it of the problem area, and
output string so far.  Appends what it will to output string and
returns new index into original string.  For example:

 sub fixup {
   # my ($s,$i,$d) = @_;
   my $ch = substr($_[0],$_[1],1);
   $_[2] .= sprintf("\x{%02X}",ord($ch);
   return $_[1]+1;
 }

This scheme gives maximal control to the fixup routine but is more
complicated to code, and may need internals of Encode to be tweaked to
keep original string intact.

=item Other Schemes

Hybrids of above.

Multiple return values rather than in-place modifications.

Index into the string could be C<pos($str)> allowing C<s/\G...//>.

=back

=cut
Commit	Line	Data
18586f54 NIS	1	package Encode::Encoding;
	2	# Base class for classes which implement encodings
	3	use strict;
80a5d8e7	4	our $VERSION = do { my @r = (q$Revision: 1.25 $ =~ /\d+/g); sprintf "%d."."%02d" x $#r, @r };
18586f54 NIS	5
	6	sub Define
	7	{
	8	my $obj = shift;
	9	my $canonical = shift;
	10	$obj = bless { Name => $canonical },$obj unless ref $obj;
	11	# warn "$canonical => $obj\n";
80a5d8e7	12	Encode::define_encoding($obj, $canonical, @_);
18586f54 NIS	13	}
	14
	15	sub name { shift->{'Name'} }
	16
	17	# Temporary legacy methods
	18	sub toUnicode { shift->decode(@_) }
	19	sub fromUnicode { shift->encode(@_) }
	20
	21	sub new_sequence { return $_[0] }
	22
ca777f1c NIS	23	sub needs_lines { 0 }
ca777f1c NIS	24
284ee456 NIS	25	sub DESTROY {}
284ee456 NIS	26
18586f54 NIS	27	1;
18586f54 NIS	28	__END__
1b2c56c8 JH	29
	30	=head1 NAME
	31
	32	Encode::Encoding - Encode Implementation Base Class
	33
	34	=head1 SYNOPSIS
	35
	36	package Encode::MyEncoding;
	37	use base qw(Encode::Encoding);
	38
	39	__PACKAGE__->Define(qw(myCanonical myAlias));
	40
5129552c	41	=head1 DESCRIPTION
1b2c56c8 JH	42
	43	As mentioned in L<Encode>, encodings are (in the current
	44	implementation at least) defined by objects. The mapping of encoding
	45	name to object is via the C<%encodings> hash.
	46
	47	The values of the hash can currently be either strings or objects.
	48	The string form may go away in the future. The string form occurs
	49	when C<encodings()> has scanned C<@INC> for loadable encodings but has
	50	not actually loaded the encoding in question. This is because the
	51	current "loading" process is all Perl and a bit slow.
	52
	53	Once an encoding is loaded then value of the hash is object which
	54	implements the encoding. The object should provide the following
	55	interface:
	56
	57	=over 4
	58
	59	=item -E<gt>name
	60
	61	Should return the string representing the canonical name of the encoding.
	62
	63	=item -E<gt>new_sequence
	64
	65	This is a placeholder for encodings with state. It should return an
	66	object which implements this interface, all current implementations
	67	return the original object.
	68
	69	=item -E<gt>encode($string,$check)
	70
	71	Should return the octet sequence representing I<$string>. If I<$check>
	72	is true it should modify I<$string> in place to remove the converted
	73	part (i.e. the whole string unless there is an error). If an error
	74	occurs it should return the octet sequence for the fragment of string
	75	that has been converted, and modify $string in-place to remove the
	76	converted part leaving it starting with the problem fragment.
	77
	78	If check is is false then C<encode> should make a "best effort" to
	79	convert the string - for example by using a replacement character.
	80
	81	=item -E<gt>decode($octets,$check)
	82
	83	Should return the string that I<$octets> represents. If I<$check> is
	84	true it should modify I<$octets> in place to remove the converted part
	85	(i.e. the whole sequence unless there is an error). If an error
	86	occurs it should return the fragment of string that has been
	87	converted, and modify $octets in-place to remove the converted part
	88	leaving it starting with the problem fragment.
	89
	90	If check is is false then C<decode> should make a "best effort" to
	91	convert the string - for example by using Unicode's "\x{FFFD}" as a
	92	replacement character.
	93
	94	=back
	95
	96	It should be noted that the check behaviour is different from the
	97	outer public API. The logic is that the "unchecked" case is useful
	98	when encoding is part of a stream which may be reporting errors
	99	(e.g. STDERR). In such cases it is desirable to get everything
	100	through somehow without causing additional errors which obscure the
	101	original one. Also the encoding is best placed to know what the
	102	correct replacement character is, so if that is the desired behaviour
	103	then letting low level code do it is the most efficient.
	104
	105	In contrast if check is true, the scheme above allows the encoding to
106	do as much as it can and tell layer above how much that was. What is
107	lacking at present is a mechanism to report what went wrong. The most
108	likely interface will be an additional method call to the object, or
109	perhaps (to avoid forcing per-stream objects on otherwise stateless
110	encodings) and additional parameter.
111
112	It is also highly desirable that encoding classes inherit from
113	C<Encode::Encoding> as a base class. This allows that class to define
114	additional behaviour for all encoding objects. For example built in
115	Unicode, UCS-2 and UTF-8 classes use :
116
117	package Encode::MyEncoding;
118	use base qw(Encode::Encoding);
119
120	__PACKAGE__->Define(qw(myCanonical myAlias));
121
122	To create an object with bless {Name => ...},$class, and call
123	define_encoding. They inherit their C<name> method from
124	C<Encode::Encoding>.
125
126	=head2 Compiled Encodings
127
48e3bbdd NIS	128	For the sake of speed and efficiency, Most of the encodings are now
	129	supported via I<Compiled Form> that are XS modules generated from UCM
	130	files. Encode provides enc2xs tool to achieve that. Please see
	131	L<enc2xs> for more details.
1b2c56c8	132
48e3bbdd	133	=head1 SEE ALSO
1b2c56c8	134
48e3bbdd	135	L<perlmod>, L<enc2xs>
1b2c56c8	136
80a5d8e7 NIS	137	=for future
	138
	139
	140	=over 4
	141
	142	=item Scheme 1
	143
	144	Passed remaining fragment of string being processed.
	145	Modifies it in place to remove bytes/characters it can understand
	146	and returns a string used to represent them.
	147	e.g.
	148
	149	sub fixup {
	150	my $ch = substr($_[0],0,1,'');
	151	return sprintf("\x{%02X}",ord($ch);
	152	}
	153
	154	This scheme is close to how underlying C code for Encode works, but gives
	155	the fixup routine very little context.
	156
	157	=item Scheme 2
	158
	159	Passed original string, and an index into it of the problem area, and
	160	output string so far. Appends what it will to output string and
	161	returns new index into original string. For example:
	162
	163	sub fixup {
	164	# my ($s,$i,$d) = @_;
	165	my $ch = substr($_[0],$_[1],1);
	166	$_[2] .= sprintf("\x{%02X}",ord($ch);
	167	return $_[1]+1;
	168	}
	169
	170	This scheme gives maximal control to the fixup routine but is more
	171	complicated to code, and may need internals of Encode to be tweaked to
	172	keep original string intact.
	173
	174	=item Other Schemes
	175
	176	Hybrids of above.
	177
	178	Multiple return values rather than in-place modifications.
	179
	180	Index into the string could be C<pos($str)> allowing C<s/\G...//>.
	181
	182	=back
	183
1b2c56c8	184	=cut