Commit | Line | Data |
---|---|---|
657b208b | 1 | package bytes; |
5bc28da9 | 2 | |
ab473f03 | 3 | our $VERSION = '1.07'; |
b75c8c73 | 4 | |
d5448623 GS |
5 | $bytes::hint_bits = 0x00000008; |
6 | ||
5bc28da9 | 7 | sub import { |
d5448623 | 8 | $^H |= $bytes::hint_bits; |
5bc28da9 NIS |
9 | } |
10 | ||
11 | sub unimport { | |
d5448623 | 12 | $^H &= ~$bytes::hint_bits; |
5bc28da9 NIS |
13 | } |
14 | ||
15 | sub AUTOLOAD { | |
657b208b | 16 | require "bytes_heavy.pl"; |
5b5a256a TS |
17 | goto &$AUTOLOAD if defined &$AUTOLOAD; |
18 | require Carp; | |
19 | Carp::croak("Undefined subroutine $AUTOLOAD called"); | |
5bc28da9 NIS |
20 | } |
21 | ||
79077e6c RGS |
22 | sub length (_); |
23 | sub chr (_); | |
24 | sub ord (_); | |
579f6b36 JH |
25 | sub substr ($$;$$); |
26 | sub index ($$;$); | |
27 | sub rindex ($$;$); | |
5bc28da9 NIS |
28 | |
29 | 1; | |
30 | __END__ | |
31 | ||
32 | =head1 NAME | |
33 | ||
01e331e5 | 34 | bytes - Perl pragma to expose the individual bytes of characters |
5bc28da9 | 35 | |
490aa361 | 36 | =head1 NOTICE |
a515200d | 37 | |
01e331e5 KW |
38 | Because the bytes pragma breaks encapsulation (i.e. it exposes the innards of |
39 | how the perl executable currently happens to store a string), the byte values | |
06a2b43f JH |
40 | that result are in an unspecified encoding. |
41 | ||
42 | B<Use of this module for anything other than debugging purposes is | |
43 | strongly discouraged.> If you feel that the functions here within | |
44 | might be useful for your application, this possibly indicates a | |
45 | mismatch between your mental model of Perl Unicode and the current | |
01e331e5 KW |
46 | reality. In that case, you may wish to read some of the perl Unicode |
47 | documentation: L<perluniintro>, L<perlunitut>, L<perlunifaq> and | |
48 | L<perlunicode>. | |
a515200d | 49 | |
5bc28da9 NIS |
50 | =head1 SYNOPSIS |
51 | ||
657b208b | 52 | use bytes; |
579f6b36 JH |
53 | ... chr(...); # or bytes::chr |
54 | ... index(...); # or bytes::index | |
55 | ... length(...); # or bytes::length | |
56 | ... ord(...); # or bytes::ord | |
57 | ... rindex(...); # or bytes::rindex | |
58 | ... substr(...); # or bytes::substr | |
657b208b | 59 | no bytes; |
5bc28da9 | 60 | |
579f6b36 | 61 | |
5bc28da9 NIS |
62 | =head1 DESCRIPTION |
63 | ||
01e331e5 KW |
64 | Perl's characters are stored internally as sequences of one or more bytes. |
65 | This pragma allows for the examination of the individual bytes that together | |
66 | comprise a character. | |
67 | ||
68 | Originally the pragma was designed for the loftier goal of helping incorporate | |
69 | Unicode into Perl, but the approach that used it was found to be defective, | |
70 | and the one remaining legitimate use is for debugging when you need to | |
71 | non-destructively examine characters' individual bytes. Just insert this | |
72 | pragma temporarily, and remove it after the debugging is finished. | |
73 | ||
74 | The original usage can be accomplished by explicit (rather than this pragma's | |
ab473f03 | 75 | implicit) encoding using the L<Encode> module: |
01e331e5 KW |
76 | |
77 | use Encode qw/encode/; | |
78 | ||
79 | my $utf8_byte_string = encode "UTF8", $string; | |
80 | my $latin1_byte_string = encode "Latin1", $string; | |
81 | ||
82 | Or, if performance is needed and you are only interested in the UTF-8 | |
83 | representation: | |
84 | ||
01e331e5 | 85 | utf8::encode(my $utf8_byte_string = $string); |
393fec97 | 86 | |
01e331e5 KW |
87 | C<no bytes> can be used to reverse the effect of C<use bytes> within the |
88 | current lexical scope. | |
5de28535 SC |
89 | |
90 | As an example, when Perl sees C<$x = chr(400)>, it encodes the character | |
01e331e5 | 91 | in UTF-8 and stores it in C<$x>. Then it is marked as character data, so, |
5de28535 | 92 | for instance, C<length $x> returns C<1>. However, in the scope of the |
01e331e5 | 93 | C<bytes> pragma, C<$x> is treated as a series of bytes - the bytes that make |
5de28535 SC |
94 | up the UTF8 encoding - and C<length $x> returns C<2>: |
95 | ||
01e331e5 KW |
96 | $x = chr(400); |
97 | print "Length is ", length $x, "\n"; # "Length is 1" | |
98 | printf "Contents are %vd\n", $x; # "Contents are 400" | |
99 | { | |
100 | use bytes; # or "require bytes; bytes::length()" | |
101 | print "Length is ", length $x, "\n"; # "Length is 2" | |
102 | printf "Contents are %vd\n", $x; # "Contents are 198.144 (on | |
103 | # ASCII platforms)" | |
104 | } | |
5de28535 | 105 | |
01e331e5 | 106 | C<chr()>, C<ord()>, C<substr()>, C<index()> and C<rindex()> behave similarly. |
579f6b36 | 107 | |
01e331e5 | 108 | For more on the implications, see L<perluniintro> and L<perlunicode>. |
579f6b36 | 109 | |
06a2b43f JH |
110 | C<bytes::length()> is admittedly handy if you need to know the |
111 | B<byte length> of a Perl scalar. But a more modern way is: | |
112 | ||
113 | use Encode 'encode'; | |
114 | length(encode('UTF-8', $scalar)) | |
115 | ||
579f6b36 JH |
116 | =head1 LIMITATIONS |
117 | ||
01e331e5 | 118 | C<bytes::substr()> does not work as an I<lvalue()>. |
393fec97 GS |
119 | |
120 | =head1 SEE ALSO | |
121 | ||
01e331e5 | 122 | L<perluniintro>, L<perlunicode>, L<utf8>, L<Encode> |
5bc28da9 NIS |
123 | |
124 | =cut |