Commit | Line | Data |
---|---|---|
34babc16 JH |
1 | =head1 NAME |
2 | ||
3 | perlpacktut - tutorial on C<pack> and C<unpack> | |
4 | ||
5 | =head1 DESCRIPTION | |
6 | ||
7 | C<pack> and C<unpack> are two functions for transforming data according | |
8 | to a user-defined template, between the guarded way Perl stores values | |
47b6252e | 9 | and some well-defined representation as might be required in the |
34babc16 JH |
10 | environment of a Perl program. Unfortunately, they're also two of |
11 | the most misunderstood and most often overlooked functions that Perl | |
12 | provides. This tutorial will demystify them for you. | |
13 | ||
14 | ||
15 | =head1 The Basic Principle | |
16 | ||
17 | Most programming languages don't shelter the memory where variables are | |
18 | stored. In C, for instance, you can take the address of some variable, | |
19 | and the C<sizeof> operator tells you how many bytes are allocated to | |
20 | the variable. Using the address and the size, you may access the storage | |
21 | to your heart's content. | |
22 | ||
23 | In Perl, you just can't access memory at random, but the structural and | |
24 | representational conversion provided by C<pack> and C<unpack> is an | |
25 | excellent alternative. The C<pack> function converts values to a byte | |
26 | sequence containing representations according to a given specification, | |
27 | the so-called "template" argument. C<unpack> is the reverse process, | |
28 | deriving some values from the contents of a string of bytes. (Be cautioned, | |
29 | however, that not all that has been packed together can be neatly unpacked - | |
30 | a very common experience as seasoned travellers are likely to confirm.) | |
31 | ||
32 | Why, you may ask, would you need a chunk of memory containing some values | |
33 | in binary representation? One good reason is input and output accessing | |
34 | some file, a device, or a network connection, whereby this binary | |
35 | representation is either forced on you or will give you some benefit | |
36 | in processing. Another cause is passing data to some system call that | |
37 | is not available as a Perl function: C<syscall> requires you to provide | |
38 | parameters stored in the way it happens in a C program. Even text processing | |
39 | (as shown in the next section) may be simplified with judicious usage | |
40 | of these two functions. | |
41 | ||
42 | To see how (un)packing works, we'll start with a simple template | |
43 | code where the conversion is in low gear: between the contents of a byte | |
44 | sequence and a string of hexadecimal digits. Let's use C<unpack>, since | |
47b6252e | 45 | this is likely to remind you of a dump program, or some desperate last |
34babc16 JH |
46 | message unfortunate programs are wont to throw at you before they expire |
47 | into the wild blue yonder. Assuming that the variable C<$mem> holds a | |
48 | sequence of bytes that we'd like to inspect without assuming anything | |
49 | about its meaning, we can write | |
50 | ||
51 | my( $hex ) = unpack( 'H*', $mem ); | |
52 | print "$hex\n"; | |
53 | ||
54 | whereupon we might see something like this, with each pair of hex digits | |
55 | corresponding to a byte: | |
56 | ||
57 | 41204d414e204120504c414e20412043414e414c2050414e414d41 | |
58 | ||
47b6252e | 59 | What was in this chunk of memory? Numbers, characters, or a mixture of |
34babc16 JH |
60 | both? Assuming that we're on a computer where ASCII (or some similar) |
61 | encoding is used: hexadecimal values in the range C<0x40> - C<0x5A> | |
47f22e19 | 62 | indicate an uppercase letter, and C<0x20> encodes a space. So we might |
34babc16 JH |
63 | assume it is a piece of text, which some are able to read like a tabloid; |
64 | but others will have to get hold of an ASCII table and relive that | |
65 | firstgrader feeling. Not caring too much about which way to read this, | |
66 | we note that C<unpack> with the template code C<H> converts the contents | |
67 | of a sequence of bytes into the customary hexadecimal notation. Since | |
68 | "a sequence of" is a pretty vague indication of quantity, C<H> has been | |
69 | defined to convert just a single hexadecimal digit unless it is followed | |
70 | by a repeat count. An asterisk for the repeat count means to use whatever | |
71 | remains. | |
72 | ||
73 | The inverse operation - packing byte contents from a string of hexadecimal | |
74 | digits - is just as easily written. For instance: | |
75 | ||
4e848ca9 | 76 | my $s = pack( 'H2' x 10, 30..39 ); |
34babc16 JH |
77 | print "$s\n"; |
78 | ||
79 | Since we feed a list of ten 2-digit hexadecimal strings to C<pack>, the | |
80 | pack template should contain ten pack codes. If this is run on a computer | |
81 | with ASCII character coding, it will print C<0123456789>. | |
82 | ||
34babc16 JH |
83 | =head1 Packing Text |
84 | ||
85 | Let's suppose you've got to read in a data file like this: | |
86 | ||
87 | Date |Description | Income|Expenditure | |
9567c12e | 88 | 01/24/2001 Zed's Camel Emporium 1147.99 |
34babc16 JH |
89 | 01/28/2001 Flea spray 24.99 |
90 | 01/29/2001 Camel rides to tourists 235.00 | |
91 | ||
92 | How do we do it? You might think first to use C<split>; however, since | |
93 | C<split> collapses blank fields, you'll never know whether a record was | |
94 | income or expenditure. Oops. Well, you could always use C<substr>: | |
95 | ||
96 | while (<>) { | |
97 | my $date = substr($_, 0, 11); | |
98 | my $desc = substr($_, 12, 27); | |
99 | my $income = substr($_, 40, 7); | |
100 | my $expend = substr($_, 52, 7); | |
101 | ... | |
102 | } | |
103 | ||
104 | It's not really a barrel of laughs, is it? In fact, it's worse than it | |
105 | may seem; the eagle-eyed may notice that the first field should only be | |
106 | 10 characters wide, and the error has propagated right through the other | |
107 | numbers - which we've had to count by hand. So it's error-prone as well | |
108 | as horribly unfriendly. | |
109 | ||
110 | Or maybe we could use regular expressions: | |
7207e29d | 111 | |
34babc16 JH |
112 | while (<>) { |
113 | my($date, $desc, $income, $expend) = | |
114 | m|(\d\d/\d\d/\d{4}) (.{27}) (.{7})(.*)|; | |
115 | ... | |
116 | } | |
117 | ||
118 | Urgh. Well, it's a bit better, but - well, would you want to maintain | |
119 | that? | |
120 | ||
121 | Hey, isn't Perl supposed to make this sort of thing easy? Well, it does, | |
122 | if you use the right tools. C<pack> and C<unpack> are designed to help | |
123 | you out when dealing with fixed-width data like the above. Let's have a | |
124 | look at a solution with C<unpack>: | |
125 | ||
126 | while (<>) { | |
127 | my($date, $desc, $income, $expend) = unpack("A10xA27xA7A*", $_); | |
128 | ... | |
129 | } | |
130 | ||
131 | That looks a bit nicer; but we've got to take apart that weird template. | |
132 | Where did I pull that out of? | |
133 | ||
134 | OK, let's have a look at some of our data again; in fact, we'll include | |
135 | the headers, and a handy ruler so we can keep track of where we are. | |
136 | ||
137 | 1 2 3 4 5 | |
138 | 1234567890123456789012345678901234567890123456789012345678 | |
139 | Date |Description | Income|Expenditure | |
140 | 01/28/2001 Flea spray 24.99 | |
141 | 01/29/2001 Camel rides to tourists 235.00 | |
142 | ||
47b6252e | 143 | From this, we can see that the date column stretches from column 1 to |
34babc16 JH |
144 | column 10 - ten characters wide. The C<pack>-ese for "character" is |
145 | C<A>, and ten of them are C<A10>. So if we just wanted to extract the | |
146 | dates, we could say this: | |
147 | ||
148 | my($date) = unpack("A10", $_); | |
149 | ||
150 | OK, what's next? Between the date and the description is a blank column; | |
151 | we want to skip over that. The C<x> template means "skip forward", so we | |
152 | want one of those. Next, we have another batch of characters, from 12 to | |
153 | 38. That's 27 more characters, hence C<A27>. (Don't make the fencepost | |
154 | error - there are 27 characters between 12 and 38, not 26. Count 'em!) | |
155 | ||
156 | Now we skip another character and pick up the next 7 characters: | |
157 | ||
158 | my($date,$description,$income) = unpack("A10xA27xA7", $_); | |
159 | ||
160 | Now comes the clever bit. Lines in our ledger which are just income and | |
161 | not expenditure might end at column 46. Hence, we don't want to tell our | |
162 | C<unpack> pattern that we B<need> to find another 12 characters; we'll | |
163 | just say "if there's anything left, take it". As you might guess from | |
164 | regular expressions, that's what the C<*> means: "use everything | |
165 | remaining". | |
166 | ||
167 | =over 3 | |
168 | ||
169 | =item * | |
170 | ||
171 | Be warned, though, that unlike regular expressions, if the C<unpack> | |
172 | template doesn't match the incoming data, Perl will scream and die. | |
173 | ||
174 | =back | |
175 | ||
176 | ||
177 | Hence, putting it all together: | |
178 | ||
555bd962 BG |
179 | my ($date, $description, $income, $expend) = |
180 | unpack("A10xA27xA7xA*", $_); | |
34babc16 JH |
181 | |
182 | Now, that's our data parsed. I suppose what we might want to do now is | |
183 | total up our income and expenditure, and add another line to the end of | |
184 | our ledger - in the same format - saying how much we've brought in and | |
185 | how much we've spent: | |
186 | ||
187 | while (<>) { | |
555bd962 BG |
188 | my ($date, $desc, $income, $expend) = |
189 | unpack("A10xA27xA7xA*", $_); | |
34babc16 JH |
190 | $tot_income += $income; |
191 | $tot_expend += $expend; | |
192 | } | |
193 | ||
194 | $tot_income = sprintf("%.2f", $tot_income); # Get them into | |
195 | $tot_expend = sprintf("%.2f", $tot_expend); # "financial" format | |
196 | ||
197 | $date = POSIX::strftime("%m/%d/%Y", localtime); | |
198 | ||
199 | # OK, let's go: | |
200 | ||
555bd962 BG |
201 | print pack("A10xA27xA7xA*", $date, "Totals", |
202 | $tot_income, $tot_expend); | |
34babc16 JH |
203 | |
204 | Oh, hmm. That didn't quite work. Let's see what happened: | |
205 | ||
9567c12e | 206 | 01/24/2001 Zed's Camel Emporium 1147.99 |
34babc16 JH |
207 | 01/28/2001 Flea spray 24.99 |
208 | 01/29/2001 Camel rides to tourists 1235.00 | |
209 | 03/23/2001Totals 1235.001172.98 | |
210 | ||
211 | OK, it's a start, but what happened to the spaces? We put C<x>, didn't | |
212 | we? Shouldn't it skip forward? Let's look at what L<perlfunc/pack> says: | |
213 | ||
214 | x A null byte. | |
215 | ||
216 | Urgh. No wonder. There's a big difference between "a null byte", | |
217 | character zero, and "a space", character 32. Perl's put something | |
218 | between the date and the description - but unfortunately, we can't see | |
219 | it! | |
220 | ||
221 | What we actually need to do is expand the width of the fields. The C<A> | |
222 | format pads any non-existent characters with spaces, so we can use the | |
223 | additional spaces to line up our fields, like this: | |
224 | ||
555bd962 BG |
225 | print pack("A11 A28 A8 A*", $date, "Totals", |
226 | $tot_income, $tot_expend); | |
34babc16 JH |
227 | |
228 | (Note that you can put spaces in the template to make it more readable, | |
229 | but they don't translate to spaces in the output.) Here's what we got | |
230 | this time: | |
231 | ||
9567c12e | 232 | 01/24/2001 Zed's Camel Emporium 1147.99 |
34babc16 JH |
233 | 01/28/2001 Flea spray 24.99 |
234 | 01/29/2001 Camel rides to tourists 1235.00 | |
235 | 03/23/2001 Totals 1235.00 1172.98 | |
236 | ||
237 | That's a bit better, but we still have that last column which needs to | |
238 | be moved further over. There's an easy way to fix this up: | |
239 | unfortunately, we can't get C<pack> to right-justify our fields, but we | |
240 | can get C<sprintf> to do it: | |
241 | ||
242 | $tot_income = sprintf("%.2f", $tot_income); | |
243 | $tot_expend = sprintf("%12.2f", $tot_expend); | |
244 | $date = POSIX::strftime("%m/%d/%Y", localtime); | |
555bd962 BG |
245 | print pack("A11 A28 A8 A*", $date, "Totals", |
246 | $tot_income, $tot_expend); | |
34babc16 JH |
247 | |
248 | This time we get the right answer: | |
249 | ||
250 | 01/28/2001 Flea spray 24.99 | |
251 | 01/29/2001 Camel rides to tourists 1235.00 | |
252 | 03/23/2001 Totals 1235.00 1172.98 | |
253 | ||
254 | So that's how we consume and produce fixed-width data. Let's recap what | |
255 | we've seen of C<pack> and C<unpack> so far: | |
256 | ||
257 | =over 3 | |
258 | ||
259 | =item * | |
260 | ||
261 | Use C<pack> to go from several pieces of data to one fixed-width | |
262 | version; use C<unpack> to turn a fixed-width-format string into several | |
263 | pieces of data. | |
264 | ||
265 | =item * | |
266 | ||
267 | The pack format C<A> means "any character"; if you're C<pack>ing and | |
268 | you've run out of things to pack, C<pack> will fill the rest up with | |
269 | spaces. | |
270 | ||
271 | =item * | |
272 | ||
273 | C<x> means "skip a byte" when C<unpack>ing; when C<pack>ing, it means | |
274 | "introduce a null byte" - that's probably not what you mean if you're | |
275 | dealing with plain text. | |
276 | ||
277 | =item * | |
278 | ||
279 | You can follow the formats with numbers to say how many characters | |
280 | should be affected by that format: C<A12> means "take 12 characters"; | |
281 | C<x6> means "skip 6 bytes" or "character 0, 6 times". | |
282 | ||
283 | =item * | |
284 | ||
285 | Instead of a number, you can use C<*> to mean "consume everything else | |
286 | left". | |
287 | ||
288 | B<Warning>: when packing multiple pieces of data, C<*> only means | |
289 | "consume all of the current piece of data". That's to say | |
290 | ||
291 | pack("A*A*", $one, $two) | |
292 | ||
293 | packs all of C<$one> into the first C<A*> and then all of C<$two> into | |
294 | the second. This is a general principle: each format character | |
295 | corresponds to one piece of data to be C<pack>ed. | |
296 | ||
297 | =back | |
298 | ||
299 | ||
300 | ||
301 | =head1 Packing Numbers | |
302 | ||
303 | So much for textual data. Let's get onto the meaty stuff that C<pack> | |
304 | and C<unpack> are best at: handling binary formats for numbers. There is, | |
305 | of course, not just one binary format - life would be too simple - but | |
306 | Perl will do all the finicky labor for you. | |
307 | ||
308 | ||
309 | =head2 Integers | |
310 | ||
311 | Packing and unpacking numbers implies conversion to and from some | |
312 | I<specific> binary representation. Leaving floating point numbers | |
313 | aside for the moment, the salient properties of any such representation | |
314 | are: | |
315 | ||
316 | =over 4 | |
317 | ||
318 | =item * | |
319 | ||
320 | the number of bytes used for storing the integer, | |
321 | ||
322 | =item * | |
323 | ||
324 | whether the contents are interpreted as a signed or unsigned number, | |
325 | ||
326 | =item * | |
327 | ||
328 | the byte ordering: whether the first byte is the least or most | |
329 | significant byte (or: little-endian or big-endian, respectively). | |
330 | ||
331 | =back | |
332 | ||
333 | So, for instance, to pack 20302 to a signed 16 bit integer in your | |
334 | computer's representation you write | |
335 | ||
336 | my $ps = pack( 's', 20302 ); | |
337 | ||
338 | Again, the result is a string, now containing 2 bytes. If you print | |
339 | this string (which is, generally, not recommended) you might see | |
340 | C<ON> or C<NO> (depending on your system's byte ordering) - or something | |
341 | entirely different if your computer doesn't use ASCII character encoding. | |
342 | Unpacking C<$ps> with the same template returns the original integer value: | |
343 | ||
344 | my( $s ) = unpack( 's', $ps ); | |
345 | ||
346 | This is true for all numeric template codes. But don't expect miracles: | |
47b6252e | 347 | if the packed value exceeds the allotted byte capacity, high order bits |
34babc16 JH |
348 | are silently discarded, and unpack certainly won't be able to pull them |
349 | back out of some magic hat. And, when you pack using a signed template | |
350 | code such as C<s>, an excess value may result in the sign bit | |
351 | getting set, and unpacking this will smartly return a negative value. | |
352 | ||
353 | 16 bits won't get you too far with integers, but there is C<l> and C<L> | |
354 | for signed and unsigned 32-bit integers. And if this is not enough and | |
355 | your system supports 64 bit integers you can push the limits much closer | |
356 | to infinity with pack codes C<q> and C<Q>. A notable exception is provided | |
357 | by pack codes C<i> and C<I> for signed and unsigned integers of the | |
358 | "local custom" variety: Such an integer will take up as many bytes as | |
359 | a local C compiler returns for C<sizeof(int)>, but it'll use I<at least> | |
360 | 32 bits. | |
361 | ||
362 | Each of the integer pack codes C<sSlLqQ> results in a fixed number of bytes, | |
363 | no matter where you execute your program. This may be useful for some | |
364 | applications, but it does not provide for a portable way to pass data | |
365 | structures between Perl and C programs (bound to happen when you call | |
366 | XS extensions or the Perl function C<syscall>), or when you read or | |
367 | write binary files. What you'll need in this case are template codes that | |
368 | depend on what your local C compiler compiles when you code C<short> or | |
369 | C<unsigned long>, for instance. These codes and their corresponding | |
370 | byte lengths are shown in the table below. Since the C standard leaves | |
371 | much leeway with respect to the relative sizes of these data types, actual | |
372 | values may vary, and that's why the values are given as expressions in | |
373 | C and Perl. (If you'd like to use values from C<%Config> in your program | |
374 | you have to import it with C<use Config>.) | |
375 | ||
376 | signed unsigned byte length in C byte length in Perl | |
f8b4d74f WL |
377 | s! S! sizeof(short) $Config{shortsize} |
378 | i! I! sizeof(int) $Config{intsize} | |
379 | l! L! sizeof(long) $Config{longsize} | |
d832b8f6 | 380 | q! Q! sizeof(long long) $Config{longlongsize} |
34babc16 JH |
381 | |
382 | The C<i!> and C<I!> codes aren't different from C<i> and C<I>; they are | |
383 | tolerated for completeness' sake. | |
384 | ||
385 | ||
386 | =head2 Unpacking a Stack Frame | |
387 | ||
388 | Requesting a particular byte ordering may be necessary when you work with | |
47f22e19 | 389 | binary data coming from some specific architecture whereas your program could |
34babc16 JH |
390 | run on a totally different system. As an example, assume you have 24 bytes |
391 | containing a stack frame as it happens on an Intel 8086: | |
392 | ||
393 | +---------+ +----+----+ +---------+ | |
394 | TOS: | IP | TOS+4:| FL | FH | FLAGS TOS+14:| SI | | |
395 | +---------+ +----+----+ +---------+ | |
396 | | CS | | AL | AH | AX | DI | | |
397 | +---------+ +----+----+ +---------+ | |
398 | | BL | BH | BX | BP | | |
399 | +----+----+ +---------+ | |
400 | | CL | CH | CX | DS | | |
401 | +----+----+ +---------+ | |
402 | | DL | DH | DX | ES | | |
403 | +----+----+ +---------+ | |
404 | ||
405 | First, we note that this time-honored 16-bit CPU uses little-endian order, | |
406 | and that's why the low order byte is stored at the lower address. To | |
9dc383df | 407 | unpack such a (unsigned) short we'll have to use code C<v>. A repeat |
34babc16 JH |
408 | count unpacks all 12 shorts: |
409 | ||
410 | my( $ip, $cs, $flags, $ax, $bx, $cd, $dx, $si, $di, $bp, $ds, $es ) = | |
411 | unpack( 'v12', $frame ); | |
412 | ||
413 | Alternatively, we could have used C<C> to unpack the individually | |
414 | accessible byte registers FL, FH, AL, AH, etc.: | |
415 | ||
416 | my( $fl, $fh, $al, $ah, $bl, $bh, $cl, $ch, $dl, $dh ) = | |
417 | unpack( 'C10', substr( $frame, 4, 10 ) ); | |
418 | ||
419 | It would be nice if we could do this in one fell swoop: unpack a short, | |
420 | back up a little, and then unpack 2 bytes. Since Perl I<is> nice, it | |
421 | proffers the template code C<X> to back up one byte. Putting this all | |
422 | together, we may now write: | |
423 | ||
424 | my( $ip, $cs, | |
425 | $flags,$fl,$fh, | |
426 | $ax,$al,$ah, $bx,$bl,$bh, $cx,$cl,$ch, $dx,$dl,$dh, | |
427 | $si, $di, $bp, $ds, $es ) = | |
428 | unpack( 'v2' . ('vXXCC' x 5) . 'v5', $frame ); | |
429 | ||
49704364 WL |
430 | (The clumsy construction of the template can be avoided - just read on!) |
431 | ||
47f22e19 | 432 | We've taken some pains to construct the template so that it matches |
34babc16 JH |
433 | the contents of our frame buffer. Otherwise we'd either get undefined values, |
434 | or C<unpack> could not unpack all. If C<pack> runs out of items, it will | |
47f22e19 WL |
435 | supply null strings (which are coerced into zeroes whenever the pack code |
436 | says so). | |
34babc16 JH |
437 | |
438 | ||
439 | =head2 How to Eat an Egg on a Net | |
440 | ||
441 | The pack code for big-endian (high order byte at the lowest address) is | |
442 | C<n> for 16 bit and C<N> for 32 bit integers. You use these codes | |
443 | if you know that your data comes from a compliant architecture, but, | |
444 | surprisingly enough, you should also use these pack codes if you | |
445 | exchange binary data, across the network, with some system that you | |
446 | know next to nothing about. The simple reason is that this | |
447 | order has been chosen as the I<network order>, and all standard-fearing | |
448 | programs ought to follow this convention. (This is, of course, a stern | |
449 | backing for one of the Lilliputian parties and may well influence the | |
450 | political development there.) So, if the protocol expects you to send | |
451 | a message by sending the length first, followed by just so many bytes, | |
452 | you could write: | |
453 | ||
454 | my $buf = pack( 'N', length( $msg ) ) . $msg; | |
455 | ||
456 | or even: | |
457 | ||
458 | my $buf = pack( 'NA*', length( $msg ), $msg ); | |
459 | ||
460 | and pass C<$buf> to your send routine. Some protocols demand that the | |
461 | count should include the length of the count itself: then just add 4 | |
462 | to the data length. (But make sure to read L<"Lengths and Widths"> before | |
463 | you really code this!) | |
464 | ||
465 | ||
59f20ca5 MHM |
466 | =head2 Byte-order modifiers |
467 | ||
468 | In the previous sections we've learned how to use C<n>, C<N>, C<v> and | |
469 | C<V> to pack and unpack integers with big- or little-endian byte-order. | |
470 | While this is nice, it's still rather limited because it leaves out all | |
471 | kinds of signed integers as well as 64-bit integers. For example, if you | |
472 | wanted to unpack a sequence of signed big-endian 16-bit integers in a | |
473 | platform-independent way, you would have to write: | |
474 | ||
475 | my @data = unpack 's*', pack 'S*', unpack 'n*', $buf; | |
476 | ||
c4ecfaf1 | 477 | This is ugly. As of Perl 5.9.2, there's a much nicer way to express your |
59f20ca5 MHM |
478 | desire for a certain byte-order: the C<E<gt>> and C<E<lt>> modifiers. |
479 | C<E<gt>> is the big-endian modifier, while C<E<lt>> is the little-endian | |
480 | modifier. Using them, we could rewrite the above code as: | |
481 | ||
482 | my @data = unpack 's>*', $buf; | |
483 | ||
484 | As you can see, the "big end" of the arrow touches the C<s>, which is a | |
485 | nice way to remember that C<E<gt>> is the big-endian modifier. The same | |
486 | obviously works for C<E<lt>>, where the "little end" touches the code. | |
487 | ||
488 | You will probably find these modifiers even more useful if you have | |
489 | to deal with big- or little-endian C structures. Be sure to read | |
490 | L<"Packing and Unpacking C Structures"> for more on that. | |
491 | ||
34babc16 JH |
492 | |
493 | =head2 Floating point Numbers | |
494 | ||
495 | For packing floating point numbers you have the choice between the | |
59f20ca5 MHM |
496 | pack codes C<f>, C<d>, C<F> and C<D>. C<f> and C<d> pack into (or unpack |
497 | from) single-precision or double-precision representation as it is provided | |
498 | by your system. If your systems supports it, C<D> can be used to pack and | |
499 | unpack extended-precision floating point values (C<long double>), which | |
500 | can offer even more resolution than C<f> or C<d>. C<F> packs an C<NV>, | |
501 | which is the floating point type used by Perl internally. (There | |
34babc16 JH |
502 | is no such thing as a network representation for reals, so if you want |
503 | to send your real numbers across computer boundaries, you'd better stick | |
504 | to ASCII representation, unless you're absolutely sure what's on the other | |
59f20ca5 MHM |
505 | end of the line. For the even more adventuresome, you can use the byte-order |
506 | modifiers from the previous section also on floating point codes.) | |
34babc16 JH |
507 | |
508 | ||
509 | ||
510 | =head1 Exotic Templates | |
511 | ||
512 | ||
513 | =head2 Bit Strings | |
514 | ||
515 | Bits are the atoms in the memory world. Access to individual bits may | |
516 | have to be used either as a last resort or because it is the most | |
517 | convenient way to handle your data. Bit string (un)packing converts | |
518 | between strings containing a series of C<0> and C<1> characters and | |
519 | a sequence of bytes each containing a group of 8 bits. This is almost | |
520 | as simple as it sounds, except that there are two ways the contents of | |
521 | a byte may be written as a bit string. Let's have a look at an annotated | |
522 | byte: | |
523 | ||
524 | 7 6 5 4 3 2 1 0 | |
525 | +-----------------+ | |
526 | | 1 0 0 0 1 1 0 0 | | |
527 | +-----------------+ | |
528 | MSB LSB | |
529 | ||
530 | It's egg-eating all over again: Some think that as a bit string this should | |
531 | be written "10001100" i.e. beginning with the most significant bit, others | |
532 | insist on "00110001". Well, Perl isn't biased, so that's why we have two bit | |
533 | string codes: | |
534 | ||
535 | $byte = pack( 'B8', '10001100' ); # start with MSB | |
536 | $byte = pack( 'b8', '00110001' ); # start with LSB | |
537 | ||
538 | It is not possible to pack or unpack bit fields - just integral bytes. | |
539 | C<pack> always starts at the next byte boundary and "rounds up" to the | |
540 | next multiple of 8 by adding zero bits as required. (If you do want bit | |
541 | fields, there is L<perlfunc/vec>. Or you could implement bit field | |
542 | handling at the character string level, using split, substr, and | |
543 | concatenation on unpacked bit strings.) | |
544 | ||
545 | To illustrate unpacking for bit strings, we'll decompose a simple | |
546 | status register (a "-" stands for a "reserved" bit): | |
547 | ||
548 | +-----------------+-----------------+ | |
549 | | S Z - A - P - C | - - - - O D I T | | |
550 | +-----------------+-----------------+ | |
551 | MSB LSB MSB LSB | |
552 | ||
553 | Converting these two bytes to a string can be done with the unpack | |
554 | template C<'b16'>. To obtain the individual bit values from the bit | |
47f22e19 | 555 | string we use C<split> with the "empty" separator pattern which dissects |
34babc16 JH |
556 | into individual characters. Bit values from the "reserved" positions are |
557 | simply assigned to C<undef>, a convenient notation for "I don't care where | |
558 | this goes". | |
559 | ||
49704364 | 560 | ($carry, undef, $parity, undef, $auxcarry, undef, $zero, $sign, |
34babc16 | 561 | $trace, $interrupt, $direction, $overflow) = |
47f22e19 | 562 | split( //, unpack( 'b16', $status ) ); |
34babc16 JH |
563 | |
564 | We could have used an unpack template C<'b12'> just as well, since the | |
565 | last 4 bits can be ignored anyway. | |
566 | ||
567 | ||
568 | =head2 Uuencoding | |
569 | ||
99f65d01 | 570 | Another odd-man-out in the template alphabet is C<u>, which packs a |
34babc16 JH |
571 | "uuencoded string". ("uu" is short for Unix-to-Unix.) Chances are that |
572 | you won't ever need this encoding technique which was invented to overcome | |
573 | the shortcomings of old-fashioned transmission mediums that do not support | |
574 | other than simple ASCII data. The essential recipe is simple: Take three | |
575 | bytes, or 24 bits. Split them into 4 six-packs, adding a space (0x20) to | |
576 | each. Repeat until all of the data is blended. Fold groups of 4 bytes into | |
577 | lines no longer than 60 and garnish them in front with the original byte count | |
578 | (incremented by 0x20) and a C<"\n"> at the end. - The C<pack> chef will | |
579 | prepare this for you, a la minute, when you select pack code C<u> on the menu: | |
580 | ||
581 | my $uubuf = pack( 'u', $bindat ); | |
582 | ||
583 | A repeat count after C<u> sets the number of bytes to put into an | |
584 | uuencoded line, which is the maximum of 45 by default, but could be | |
585 | set to some (smaller) integer multiple of three. C<unpack> simply ignores | |
586 | the repeat count. | |
587 | ||
588 | ||
589 | =head2 Doing Sums | |
590 | ||
591 | An even stranger template code is C<%>E<lt>I<number>E<gt>. First, because | |
592 | it's used as a prefix to some other template code. Second, because it | |
593 | cannot be used in C<pack> at all, and third, in C<unpack>, doesn't return the | |
594 | data as defined by the template code it precedes. Instead it'll give you an | |
595 | integer of I<number> bits that is computed from the data value by | |
596 | doing sums. For numeric unpack codes, no big feat is achieved: | |
597 | ||
598 | my $buf = pack( 'iii', 100, 20, 3 ); | |
599 | print unpack( '%32i3', $buf ), "\n"; # prints 123 | |
600 | ||
601 | For string values, C<%> returns the sum of the byte values saving | |
602 | you the trouble of a sum loop with C<substr> and C<ord>: | |
603 | ||
604 | print unpack( '%32A*', "\x01\x10" ), "\n"; # prints 17 | |
605 | ||
606 | Although the C<%> code is documented as returning a "checksum": | |
607 | don't put your trust in such values! Even when applied to a small number | |
608 | of bytes, they won't guarantee a noticeable Hamming distance. | |
609 | ||
610 | In connection with C<b> or C<B>, C<%> simply adds bits, and this can be put | |
611 | to good use to count set bits efficiently: | |
612 | ||
613 | my $bitcount = unpack( '%32b*', $mask ); | |
614 | ||
615 | And an even parity bit can be determined like this: | |
616 | ||
617 | my $evenparity = unpack( '%1b*', $mask ); | |
618 | ||
619 | ||
620 | =head2 Unicode | |
621 | ||
622 | Unicode is a character set that can represent most characters in most of | |
623 | the world's languages, providing room for over one million different | |
624 | characters. Unicode 3.1 specifies 94,140 characters: The Basic Latin | |
625 | characters are assigned to the numbers 0 - 127. The Latin-1 Supplement with | |
626 | characters that are used in several European languages is in the next | |
627 | range, up to 255. After some more Latin extensions we find the character | |
47b6252e | 628 | sets from languages using non-Roman alphabets, interspersed with a |
34babc16 | 629 | variety of symbol sets such as currency symbols, Zapf Dingbats or Braille. |
f979aebc | 630 | (You might want to visit L<http://www.unicode.org/> for a look at some of |
34babc16 JH |
631 | them - my personal favourites are Telugu and Kannada.) |
632 | ||
633 | The Unicode character sets associates characters with integers. Encoding | |
634 | these numbers in an equal number of bytes would more than double the | |
47b6252e | 635 | requirements for storing texts written in Latin alphabets. |
34babc16 JH |
636 | The UTF-8 encoding avoids this by storing the most common (from a western |
637 | point of view) characters in a single byte while encoding the rarer | |
638 | ones in three or more bytes. | |
639 | ||
2575c402 JW |
640 | Perl uses UTF-8, internally, for most Unicode strings. |
641 | ||
642 | So what has this got to do with C<pack>? Well, if you want to compose a | |
643 | Unicode string (that is internally encoded as UTF-8), you can do so by | |
644 | using template code C<U>. As an example, let's produce the Euro currency | |
645 | symbol (code number 0x20AC): | |
34babc16 JH |
646 | |
647 | $UTF8{Euro} = pack( 'U', 0x20AC ); | |
2575c402 | 648 | # Equivalent to: $UTF8{Euro} = "\x{20ac}"; |
34babc16 | 649 | |
2575c402 JW |
650 | Inspecting C<$UTF8{Euro}> shows that it contains 3 bytes: |
651 | "\xe2\x82\xac". However, it contains only 1 character, number 0x20AC. | |
652 | The round trip can be completed with C<unpack>: | |
34babc16 JH |
653 | |
654 | $Unicode{Euro} = unpack( 'U', $UTF8{Euro} ); | |
655 | ||
2575c402 JW |
656 | Unpacking using the C<U> template code also works on UTF-8 encoded byte |
657 | strings. | |
658 | ||
34babc16 JH |
659 | Usually you'll want to pack or unpack UTF-8 strings: |
660 | ||
661 | # pack and unpack the Hebrew alphabet | |
662 | my $alefbet = pack( 'U*', 0x05d0..0x05ea ); | |
663 | my @hebrew = unpack( 'U*', $utf ); | |
664 | ||
2575c402 JW |
665 | Please note: in the general case, you're better off using |
666 | Encode::decode_utf8 to decode a UTF-8 encoded byte string to a Perl | |
38a44b82 | 667 | Unicode string, and Encode::encode_utf8 to encode a Perl Unicode string |
2575c402 JW |
668 | to UTF-8 bytes. These functions provide means of handling invalid byte |
669 | sequences and generally have a friendlier interface. | |
34babc16 | 670 | |
47f22e19 WL |
671 | =head2 Another Portable Binary Encoding |
672 | ||
673 | The pack code C<w> has been added to support a portable binary data | |
674 | encoding scheme that goes way beyond simple integers. (Details can | |
f979aebc | 675 | be found at L<http://Casbah.org/>, the Scarab project.) A BER (Binary Encoded |
47f22e19 WL |
676 | Representation) compressed unsigned integer stores base 128 |
677 | digits, most significant digit first, with as few digits as possible. | |
678 | Bit eight (the high bit) is set on each byte except the last. There | |
679 | is no size limit to BER encoding, but Perl won't go to extremes. | |
680 | ||
681 | my $berbuf = pack( 'w*', 1, 128, 128+1, 128*128+127 ); | |
682 | ||
683 | A hex dump of C<$berbuf>, with spaces inserted at the right places, | |
684 | shows 01 8100 8101 81807F. Since the last byte is always less than | |
685 | 128, C<unpack> knows where to stop. | |
686 | ||
34babc16 | 687 | |
49704364 WL |
688 | =head1 Template Grouping |
689 | ||
690 | Prior to Perl 5.8, repetitions of templates had to be made by | |
691 | C<x>-multiplication of template strings. Now there is a better way as | |
692 | we may use the pack codes C<(> and C<)> combined with a repeat count. | |
693 | The C<unpack> template from the Stack Frame example can simply | |
694 | be written like this: | |
695 | ||
696 | unpack( 'v2 (vXXCC)5 v5', $frame ) | |
697 | ||
698 | Let's explore this feature a little more. We'll begin with the equivalent of | |
699 | ||
700 | join( '', map( substr( $_, 0, 1 ), @str ) ) | |
701 | ||
702 | which returns a string consisting of the first character from each string. | |
703 | Using pack, we can write | |
704 | ||
705 | pack( '(A)'.@str, @str ) | |
706 | ||
707 | or, because a repeat count C<*> means "repeat as often as required", | |
708 | simply | |
709 | ||
710 | pack( '(A)*', @str ) | |
711 | ||
712 | (Note that the template C<A*> would only have packed C<$str[0]> in full | |
713 | length.) | |
ffc145e8 | 714 | |
49704364 WL |
715 | To pack dates stored as triplets ( day, month, year ) in an array C<@dates> |
716 | into a sequence of byte, byte, short integer we can write | |
717 | ||
718 | $pd = pack( '(CCS)*', map( @$_, @dates ) ); | |
719 | ||
720 | To swap pairs of characters in a string (with even length) one could use | |
721 | several techniques. First, let's use C<x> and C<X> to skip forward and back: | |
722 | ||
723 | $s = pack( '(A)*', unpack( '(xAXXAx)*', $s ) ); | |
724 | ||
725 | We can also use C<@> to jump to an offset, with 0 being the position where | |
726 | we were when the last C<(> was encountered: | |
727 | ||
728 | $s = pack( '(A)*', unpack( '(@1A @0A @2)*', $s ) ); | |
729 | ||
730 | Finally, there is also an entirely different approach by unpacking big | |
731 | endian shorts and packing them in the reverse byte order: | |
732 | ||
733 | $s = pack( '(v)*', unpack( '(n)*', $s ); | |
734 | ||
735 | ||
34babc16 JH |
736 | =head1 Lengths and Widths |
737 | ||
738 | =head2 String Lengths | |
739 | ||
740 | In the previous section we've seen a network message that was constructed | |
741 | by prefixing the binary message length to the actual message. You'll find | |
742 | that packing a length followed by so many bytes of data is a | |
743 | frequently used recipe since appending a null byte won't work | |
47b6252e | 744 | if a null byte may be part of the data. Here is an example where both |
34babc16 JH |
745 | techniques are used: after two null terminated strings with source and |
746 | destination address, a Short Message (to a mobile phone) is sent after | |
747 | a length byte: | |
748 | ||
749 | my $msg = pack( 'Z*Z*CA*', $src, $dst, length( $sm ), $sm ); | |
750 | ||
751 | Unpacking this message can be done with the same template: | |
752 | ||
753 | ( $src, $dst, $len, $sm ) = unpack( 'Z*Z*CA*', $msg ); | |
754 | ||
47b6252e | 755 | There's a subtle trap lurking in the offing: Adding another field after |
34babc16 JH |
756 | the Short Message (in variable C<$sm>) is all right when packing, but this |
757 | cannot be unpacked naively: | |
758 | ||
759 | # pack a message | |
760 | my $msg = pack( 'Z*Z*CA*C', $src, $dst, length( $sm ), $sm, $prio ); | |
7207e29d | 761 | |
34babc16 JH |
762 | # unpack fails - $prio remains undefined! |
763 | ( $src, $dst, $len, $sm, $prio ) = unpack( 'Z*Z*CA*C', $msg ); | |
764 | ||
765 | The pack code C<A*> gobbles up all remaining bytes, and C<$prio> remains | |
766 | undefined! Before we let disappointment dampen the morale: Perl's got | |
767 | the trump card to make this trick too, just a little further up the sleeve. | |
768 | Watch this: | |
769 | ||
770 | # pack a message: ASCIIZ, ASCIIZ, length/string, byte | |
771 | my $msg = pack( 'Z* Z* C/A* C', $src, $dst, $sm, $prio ); | |
772 | ||
773 | # unpack | |
774 | ( $src, $dst, $sm, $prio ) = unpack( 'Z* Z* C/A* C', $msg ); | |
775 | ||
776 | Combining two pack codes with a slash (C</>) associates them with a single | |
777 | value from the argument list. In C<pack>, the length of the argument is | |
778 | taken and packed according to the first code while the argument itself | |
779 | is added after being converted with the template code after the slash. | |
780 | This saves us the trouble of inserting the C<length> call, but it is | |
781 | in C<unpack> where we really score: The value of the length byte marks the | |
782 | end of the string to be taken from the buffer. Since this combination | |
f8b4d74f | 783 | doesn't make sense except when the second pack code isn't C<a*>, C<A*> |
34babc16 JH |
784 | or C<Z*>, Perl won't let you. |
785 | ||
786 | The pack code preceding C</> may be anything that's fit to represent a | |
787 | number: All the numeric binary pack codes, and even text codes such as | |
788 | C<A4> or C<Z*>: | |
789 | ||
790 | # pack/unpack a string preceded by its length in ASCII | |
791 | my $buf = pack( 'A4/A*', "Humpty-Dumpty" ); | |
792 | # unpack $buf: '13 Humpty-Dumpty' | |
793 | my $txt = unpack( 'A4/A*', $buf ); | |
794 | ||
47b6252e NC |
795 | C</> is not implemented in Perls before 5.6, so if your code is required to |
796 | work on older Perls you'll need to C<unpack( 'Z* Z* C')> to get the length, | |
797 | then use it to make a new unpack string. For example | |
798 | ||
555bd962 BG |
799 | # pack a message: ASCIIZ, ASCIIZ, length, string, byte |
800 | # (5.005 compatible) | |
47b6252e NC |
801 | my $msg = pack( 'Z* Z* C A* C', $src, $dst, length $sm, $sm, $prio ); |
802 | ||
803 | # unpack | |
804 | ( undef, undef, $len) = unpack( 'Z* Z* C', $msg ); | |
805 | ($src, $dst, $sm, $prio) = unpack ( "Z* Z* x A$len C", $msg ); | |
806 | ||
807 | But that second C<unpack> is rushing ahead. It isn't using a simple literal | |
808 | string for the template. So maybe we should introduce... | |
34babc16 JH |
809 | |
810 | =head2 Dynamic Templates | |
811 | ||
812 | So far, we've seen literals used as templates. If the list of pack | |
813 | items doesn't have fixed length, an expression constructing the | |
49704364 WL |
814 | template is required (whenever, for some reason, C<()*> cannot be used). |
815 | Here's an example: To store named string values in a way that can be | |
816 | conveniently parsed by a C program, we create a sequence of names and | |
817 | null terminated ASCII strings, with C<=> between the name and the value, | |
818 | followed by an additional delimiting null byte. Here's how: | |
34babc16 | 819 | |
49704364 | 820 | my $env = pack( '(A*A*Z*)' . keys( %Env ) . 'C', |
47f22e19 WL |
821 | map( { ( $_, '=', $Env{$_} ) } keys( %Env ) ), 0 ); |
822 | ||
823 | Let's examine the cogs of this byte mill, one by one. There's the C<map> | |
824 | call, creating the items we intend to stuff into the C<$env> buffer: | |
825 | to each key (in C<$_>) it adds the C<=> separator and the hash entry value. | |
826 | Each triplet is packed with the template code sequence C<A*A*Z*> that | |
49704364 | 827 | is repeated according to the number of keys. (Yes, that's what the C<keys> |
fe854a6f | 828 | function returns in scalar context.) To get the very last null byte, |
47f22e19 WL |
829 | we add a C<0> at the end of the C<pack> list, to be packed with C<C>. |
830 | (Attentive readers may have noticed that we could have omitted the 0.) | |
34babc16 JH |
831 | |
832 | For the reverse operation, we'll have to determine the number of items | |
833 | in the buffer before we can let C<unpack> rip it apart: | |
834 | ||
47b6252e | 835 | my $n = $env =~ tr/\0// - 1; |
49704364 | 836 | my %env = map( split( /=/, $_ ), unpack( "(Z*)$n", $env ) ); |
34babc16 | 837 | |
47b6252e | 838 | The C<tr> counts the null bytes. The C<unpack> call returns a list of |
47f22e19 | 839 | name-value pairs each of which is taken apart in the C<map> block. |
34babc16 JH |
840 | |
841 | ||
49704364 WL |
842 | =head2 Counting Repetitions |
843 | ||
844 | Rather than storing a sentinel at the end of a data item (or a list of items), | |
845 | we could precede the data with a count. Again, we pack keys and values of | |
846 | a hash, preceding each with an unsigned short length count, and up front | |
847 | we store the number of pairs: | |
848 | ||
849 | my $env = pack( 'S(S/A* S/A*)*', scalar keys( %Env ), %Env ); | |
850 | ||
851 | This simplifies the reverse operation as the number of repetitions can be | |
852 | unpacked with the C</> code: | |
853 | ||
854 | my %env = unpack( 'S/(S/A* S/A*)', $env ); | |
855 | ||
856 | Note that this is one of the rare cases where you cannot use the same | |
857 | template for C<pack> and C<unpack> because C<pack> can't determine | |
858 | a repeat count for a C<()>-group. | |
859 | ||
860 | ||
aa51dd41 MB |
861 | =head2 Intel HEX |
862 | ||
863 | Intel HEX is a file format for representing binary data, mostly for | |
864 | programming various chips, as a text file. (See | |
865 | L<http://en.wikipedia.org/wiki/.hex> for a detailed description, and | |
866 | L<http://en.wikipedia.org/wiki/SREC_(file_format)> for the Motorola | |
867 | S-record format, which can be unravelled using the same technique.) | |
868 | Each line begins with a colon (':') and is followed by a sequence of | |
869 | hexadecimal characters, specifying a byte count I<n> (8 bit), | |
870 | an address (16 bit, big endian), a record type (8 bit), I<n> data bytes | |
871 | and a checksum (8 bit) computed as the least significant byte of the two's | |
872 | complement sum of the preceding bytes. Example: C<:0300300002337A1E>. | |
873 | ||
874 | The first step of processing such a line is the conversion, to binary, | |
875 | of the hexadecimal data, to obtain the four fields, while checking the | |
876 | checksum. No surprise here: we'll start with a simple C<pack> call to | |
877 | convert everything to binary: | |
878 | ||
879 | my $binrec = pack( 'H*', substr( $hexrec, 1 ) ); | |
880 | ||
881 | The resulting byte sequence is most convenient for checking the checksum. | |
882 | Don't slow your program down with a for loop adding the C<ord> values | |
883 | of this string's bytes - the C<unpack> code C<%> is the thing to use | |
884 | for computing the 8-bit sum of all bytes, which must be equal to zero: | |
885 | ||
886 | die unless unpack( "%8C*", $binrec ) == 0; | |
887 | ||
888 | Finally, let's get those four fields. By now, you shouldn't have any | |
889 | problems with the first three fields - but how can we use the byte count | |
890 | of the data in the first field as a length for the data field? Here | |
891 | the codes C<x> and C<X> come to the rescue, as they permit jumping | |
892 | back and forth in the string to unpack. | |
893 | ||
894 | my( $addr, $type, $data ) = unpack( "x n C X4 C x3 /a", $bin ); | |
895 | ||
896 | Code C<x> skips a byte, since we don't need the count yet. Code C<n> takes | |
897 | care of the 16-bit big-endian integer address, and C<C> unpacks the | |
898 | record type. Being at offset 4, where the data begins, we need the count. | |
899 | C<X4> brings us back to square one, which is the byte at offset 0. | |
900 | Now we pick up the count, and zoom forth to offset 4, where we are | |
901 | now fully furnished to extract the exact number of data bytes, leaving | |
902 | the trailing checksum byte alone. | |
903 | ||
904 | ||
905 | ||
34babc16 JH |
906 | =head1 Packing and Unpacking C Structures |
907 | ||
908 | In previous sections we have seen how to pack numbers and character | |
909 | strings. If it were not for a couple of snags we could conclude this | |
910 | section right away with the terse remark that C structures don't | |
911 | contain anything else, and therefore you already know all there is to it. | |
912 | Sorry, no: read on, please. | |
913 | ||
59f20ca5 MHM |
914 | If you have to deal with a lot of C structures, and don't want to |
915 | hack all your template strings manually, you'll probably want to have | |
916 | a look at the CPAN module C<Convert::Binary::C>. Not only can it parse | |
917 | your C source directly, but it also has built-in support for all the | |
918 | odds and ends described further on in this section. | |
919 | ||
34babc16 JH |
920 | =head2 The Alignment Pit |
921 | ||
922 | In the consideration of speed against memory requirements the balance | |
923 | has been tilted in favor of faster execution. This has influenced the | |
924 | way C compilers allocate memory for structures: On architectures | |
925 | where a 16-bit or 32-bit operand can be moved faster between places in | |
926 | memory, or to or from a CPU register, if it is aligned at an even or | |
927 | multiple-of-four or even at a multiple-of eight address, a C compiler | |
928 | will give you this speed benefit by stuffing extra bytes into structures. | |
929 | If you don't cross the C shoreline this is not likely to cause you any | |
47b6252e NC |
930 | grief (although you should care when you design large data structures, |
931 | or you want your code to be portable between architectures (you do want | |
932 | that, don't you?)). | |
34babc16 JH |
933 | |
934 | To see how this affects C<pack> and C<unpack>, we'll compare these two | |
935 | C structures: | |
936 | ||
937 | typedef struct { | |
938 | char c1; | |
939 | short s; | |
940 | char c2; | |
941 | long l; | |
942 | } gappy_t; | |
943 | ||
944 | typedef struct { | |
945 | long l; | |
946 | short s; | |
947 | char c1; | |
948 | char c2; | |
949 | } dense_t; | |
950 | ||
951 | Typically, a C compiler allocates 12 bytes to a C<gappy_t> variable, but | |
952 | requires only 8 bytes for a C<dense_t>. After investigating this further, | |
953 | we can draw memory maps, showing where the extra 4 bytes are hidden: | |
954 | ||
955 | 0 +4 +8 +12 | |
956 | +--+--+--+--+--+--+--+--+--+--+--+--+ | |
957 | |c1|xx| s |c2|xx|xx|xx| l | xx = fill byte | |
958 | +--+--+--+--+--+--+--+--+--+--+--+--+ | |
959 | gappy_t | |
960 | ||
961 | 0 +4 +8 | |
962 | +--+--+--+--+--+--+--+--+ | |
963 | | l | h |c1|c2| | |
964 | +--+--+--+--+--+--+--+--+ | |
965 | dense_t | |
966 | ||
967 | And that's where the first quirk strikes: C<pack> and C<unpack> | |
968 | templates have to be stuffed with C<x> codes to get those extra fill bytes. | |
969 | ||
970 | The natural question: "Why can't Perl compensate for the gaps?" warrants | |
971 | an answer. One good reason is that C compilers might provide (non-ANSI) | |
972 | extensions permitting all sorts of fancy control over the way structures | |
973 | are aligned, even at the level of an individual structure field. And, if | |
974 | this were not enough, there is an insidious thing called C<union> where | |
975 | the amount of fill bytes cannot be derived from the alignment of the next | |
976 | item alone. | |
977 | ||
978 | OK, so let's bite the bullet. Here's one way to get the alignment right | |
979 | by inserting template codes C<x>, which don't take a corresponding item | |
980 | from the list: | |
981 | ||
982 | my $gappy = pack( 'cxs cxxx l!', $c1, $s, $c2, $l ); | |
983 | ||
984 | Note the C<!> after C<l>: We want to make sure that we pack a long | |
47b6252e NC |
985 | integer as it is compiled by our C compiler. And even now, it will only |
986 | work for the platforms where the compiler aligns things as above. | |
987 | And somebody somewhere has a platform where it doesn't. | |
988 | [Probably a Cray, where C<short>s, C<int>s and C<long>s are all 8 bytes. :-)] | |
34babc16 JH |
989 | |
990 | Counting bytes and watching alignments in lengthy structures is bound to | |
991 | be a drag. Isn't there a way we can create the template with a simple | |
992 | program? Here's a C program that does the trick: | |
993 | ||
994 | #include <stdio.h> | |
995 | #include <stddef.h> | |
996 | ||
997 | typedef struct { | |
998 | char fc1; | |
999 | short fs; | |
1000 | char fc2; | |
1001 | long fl; | |
1002 | } gappy_t; | |
1003 | ||
1004 | #define Pt(struct,field,tchar) \ | |
1005 | printf( "@%d%s ", offsetof(struct,field), # tchar ); | |
1006 | ||
d832b8f6 | 1007 | int main() { |
34babc16 JH |
1008 | Pt( gappy_t, fc1, c ); |
1009 | Pt( gappy_t, fs, s! ); | |
1010 | Pt( gappy_t, fc2, c ); | |
1011 | Pt( gappy_t, fl, l! ); | |
1012 | printf( "\n" ); | |
1013 | } | |
1014 | ||
1015 | The output line can be used as a template in a C<pack> or C<unpack> call: | |
1016 | ||
1017 | my $gappy = pack( '@0c @2s! @4c @8l!', $c1, $s, $c2, $l ); | |
1018 | ||
1019 | Gee, yet another template code - as if we hadn't plenty. But | |
1020 | C<@> saves our day by enabling us to specify the offset from the beginning | |
1021 | of the pack buffer to the next item: This is just the value | |
1022 | the C<offsetof> macro (defined in C<E<lt>stddef.hE<gt>>) returns when | |
1023 | given a C<struct> type and one of its field names ("member-designator" in | |
1024 | C standardese). | |
1025 | ||
49704364 WL |
1026 | Neither using offsets nor adding C<x>'s to bridge the gaps is satisfactory. |
1027 | (Just imagine what happens if the structure changes.) What we really need | |
1028 | is a way of saying "skip as many bytes as required to the next multiple of N". | |
1029 | In fluent Templatese, you say this with C<x!N> where N is replaced by the | |
1030 | appropriate value. Here's the next version of our struct packaging: | |
1031 | ||
1032 | my $gappy = pack( 'c x!2 s c x!4 l!', $c1, $s, $c2, $l ); | |
1033 | ||
1034 | That's certainly better, but we still have to know how long all the | |
1035 | integers are, and portability is far away. Rather than C<2>, | |
1036 | for instance, we want to say "however long a short is". But this can be | |
1037 | done by enclosing the appropriate pack code in brackets: C<[s]>. So, here's | |
1038 | the very best we can do: | |
1039 | ||
1040 | my $gappy = pack( 'c x![s] s c x![l!] l!', $c1, $s, $c2, $l ); | |
1041 | ||
34babc16 | 1042 | |
59f20ca5 MHM |
1043 | =head2 Dealing with Endian-ness |
1044 | ||
1045 | Now, imagine that we want to pack the data for a machine with a | |
1046 | different byte-order. First, we'll have to figure out how big the data | |
1047 | types on the target machine really are. Let's assume that the longs are | |
1048 | 32 bits wide and the shorts are 16 bits wide. You can then rewrite the | |
1049 | template as: | |
1050 | ||
1051 | my $gappy = pack( 'c x![s] s c x![l] l', $c1, $s, $c2, $l ); | |
1052 | ||
1053 | If the target machine is little-endian, we could write: | |
1054 | ||
1055 | my $gappy = pack( 'c x![s] s< c x![l] l<', $c1, $s, $c2, $l ); | |
1056 | ||
1057 | This forces the short and the long members to be little-endian, and is | |
1058 | just fine if you don't have too many struct members. But we could also | |
1059 | use the byte-order modifier on a group and write the following: | |
1060 | ||
1061 | my $gappy = pack( '( c x![s] s c x![l] l )<', $c1, $s, $c2, $l ); | |
1062 | ||
1063 | This is not as short as before, but it makes it more obvious that we | |
1064 | intend to have little-endian byte-order for a whole group, not only | |
1065 | for individual template codes. It can also be more readable and easier | |
1066 | to maintain. | |
1067 | ||
1068 | ||
34babc16 JH |
1069 | =head2 Alignment, Take 2 |
1070 | ||
1071 | I'm afraid that we're not quite through with the alignment catch yet. The | |
1072 | hydra raises another ugly head when you pack arrays of structures: | |
1073 | ||
1074 | typedef struct { | |
1075 | short count; | |
1076 | char glyph; | |
1077 | } cell_t; | |
1078 | ||
1079 | typedef cell_t buffer_t[BUFLEN]; | |
1080 | ||
1081 | Where's the catch? Padding is neither required before the first field C<count>, | |
1082 | nor between this and the next field C<glyph>, so why can't we simply pack | |
1083 | like this: | |
1084 | ||
1085 | # something goes wrong here: | |
1086 | pack( 's!a' x @buffer, | |
1087 | map{ ( $_->{count}, $_->{glyph} ) } @buffer ); | |
1088 | ||
1089 | This packs C<3*@buffer> bytes, but it turns out that the size of | |
1090 | C<buffer_t> is four times C<BUFLEN>! The moral of the story is that | |
1091 | the required alignment of a structure or array is propagated to the | |
1092 | next higher level where we have to consider padding I<at the end> | |
1093 | of each component as well. Thus the correct template is: | |
1094 | ||
1095 | pack( 's!ax' x @buffer, | |
1096 | map{ ( $_->{count}, $_->{glyph} ) } @buffer ); | |
1097 | ||
47b6252e NC |
1098 | =head2 Alignment, Take 3 |
1099 | ||
1100 | And even if you take all the above into account, ANSI still lets this: | |
1101 | ||
1102 | typedef struct { | |
1103 | char foo[2]; | |
1104 | } foo_t; | |
34babc16 | 1105 | |
47b6252e NC |
1106 | vary in size. The alignment constraint of the structure can be greater than |
1107 | any of its elements. [And if you think that this doesn't affect anything | |
1108 | common, dismember the next cellphone that you see. Many have ARM cores, and | |
1109 | the ARM structure rules make C<sizeof (foo_t)> == 4] | |
34babc16 JH |
1110 | |
1111 | =head2 Pointers for How to Use Them | |
1112 | ||
1113 | The title of this section indicates the second problem you may run into | |
1114 | sooner or later when you pack C structures. If the function you intend | |
1115 | to call expects a, say, C<void *> value, you I<cannot> simply take | |
1116 | a reference to a Perl variable. (Although that value certainly is a | |
1117 | memory address, it's not the address where the variable's contents are | |
1118 | stored.) | |
1119 | ||
1120 | Template code C<P> promises to pack a "pointer to a fixed length string". | |
1121 | Isn't this what we want? Let's try: | |
1122 | ||
1123 | # allocate some storage and pack a pointer to it | |
1124 | my $memory = "\x00" x $size; | |
1125 | my $memptr = pack( 'P', $memory ); | |
1126 | ||
1127 | But wait: doesn't C<pack> just return a sequence of bytes? How can we pass this | |
1128 | string of bytes to some C code expecting a pointer which is, after all, | |
1129 | nothing but a number? The answer is simple: We have to obtain the numeric | |
1130 | address from the bytes returned by C<pack>. | |
1131 | ||
1132 | my $ptr = unpack( 'L!', $memptr ); | |
1133 | ||
1134 | Obviously this assumes that it is possible to typecast a pointer | |
1135 | to an unsigned long and vice versa, which frequently works but should not | |
1136 | be taken as a universal law. - Now that we have this pointer the next question | |
1137 | is: How can we put it to good use? We need a call to some C function | |
1138 | where a pointer is expected. The read(2) system call comes to mind: | |
1139 | ||
1140 | ssize_t read(int fd, void *buf, size_t count); | |
1141 | ||
1142 | After reading L<perlfunc> explaining how to use C<syscall> we can write | |
1143 | this Perl function copying a file to standard output: | |
1144 | ||
1145 | require 'syscall.ph'; | |
1146 | sub cat($){ | |
1147 | my $path = shift(); | |
1148 | my $size = -s $path; | |
1149 | my $memory = "\x00" x $size; # allocate some memory | |
1150 | my $ptr = unpack( 'L', pack( 'P', $memory ) ); | |
1151 | open( F, $path ) || die( "$path: cannot open ($!)\n" ); | |
1152 | my $fd = fileno(F); | |
1153 | my $res = syscall( &SYS_read, fileno(F), $ptr, $size ); | |
1154 | print $memory; | |
1155 | close( F ); | |
1156 | } | |
1157 | ||
1158 | This is neither a specimen of simplicity nor a paragon of portability but | |
1159 | it illustrates the point: We are able to sneak behind the scenes and | |
1160 | access Perl's otherwise well-guarded memory! (Important note: Perl's | |
1161 | C<syscall> does I<not> require you to construct pointers in this roundabout | |
1162 | way. You simply pass a string variable, and Perl forwards the address.) | |
1163 | ||
1164 | How does C<unpack> with C<P> work? Imagine some pointer in the buffer | |
1165 | about to be unpacked: If it isn't the null pointer (which will smartly | |
1166 | produce the C<undef> value) we have a start address - but then what? | |
1167 | Perl has no way of knowing how long this "fixed length string" is, so | |
1168 | it's up to you to specify the actual size as an explicit length after C<P>. | |
1169 | ||
1170 | my $mem = "abcdefghijklmn"; | |
1171 | print unpack( 'P5', pack( 'P', $mem ) ); # prints "abcde" | |
1172 | ||
1173 | As a consequence, C<pack> ignores any number or C<*> after C<P>. | |
1174 | ||
1175 | ||
1176 | Now that we have seen C<P> at work, we might as well give C<p> a whirl. | |
1177 | Why do we need a second template code for packing pointers at all? The | |
1178 | answer lies behind the simple fact that an C<unpack> with C<p> promises | |
1179 | a null-terminated string starting at the address taken from the buffer, | |
1180 | and that implies a length for the data item to be returned: | |
1181 | ||
1182 | my $buf = pack( 'p', "abc\x00efhijklmn" ); | |
1183 | print unpack( 'p', $buf ); # prints "abc" | |
1184 | ||
1185 | ||
1186 | ||
1187 | Albeit this is apt to be confusing: As a consequence of the length being | |
1188 | implied by the string's length, a number after pack code C<p> is a repeat | |
1189 | count, not a length as after C<P>. | |
1190 | ||
1191 | ||
1192 | Using C<pack(..., $x)> with C<P> or C<p> to get the address where C<$x> is | |
1193 | actually stored must be used with circumspection. Perl's internal machinery | |
1194 | considers the relation between a variable and that address as its very own | |
1195 | private matter and doesn't really care that we have obtained a copy. Therefore: | |
1196 | ||
1197 | =over 4 | |
1198 | ||
1199 | =item * | |
1200 | ||
1201 | Do not use C<pack> with C<p> or C<P> to obtain the address of variable | |
1202 | that's bound to go out of scope (and thereby freeing its memory) before you | |
1203 | are done with using the memory at that address. | |
1204 | ||
1205 | =item * | |
1206 | ||
1207 | Be very careful with Perl operations that change the value of the | |
1208 | variable. Appending something to the variable, for instance, might require | |
1209 | reallocation of its storage, leaving you with a pointer into no-man's land. | |
1210 | ||
1211 | =item * | |
1212 | ||
1213 | Don't think that you can get the address of a Perl variable | |
1214 | when it is stored as an integer or double number! C<pack('P', $x)> will | |
1215 | force the variable's internal representation to string, just as if you | |
1216 | had written something like C<$x .= ''>. | |
1217 | ||
1218 | =back | |
1219 | ||
1220 | It's safe, however, to P- or p-pack a string literal, because Perl simply | |
1221 | allocates an anonymous variable. | |
1222 | ||
1223 | ||
1224 | ||
1225 | =head1 Pack Recipes | |
1226 | ||
1227 | Here are a collection of (possibly) useful canned recipes for C<pack> | |
1228 | and C<unpack>: | |
1229 | ||
1230 | # Convert IP address for socket functions | |
1231 | pack( "C4", split /\./, "123.4.5.6" ); | |
1232 | ||
1233 | # Count the bits in a chunk of memory (e.g. a select vector) | |
1234 | unpack( '%32b*', $mask ); | |
1235 | ||
1236 | # Determine the endianness of your system | |
1237 | $is_little_endian = unpack( 'c', pack( 's', 1 ) ); | |
1238 | $is_big_endian = unpack( 'xc', pack( 's', 1 ) ); | |
1239 | ||
1240 | # Determine the number of bits in a native integer | |
1241 | $bits = unpack( '%32I!', ~0 ); | |
1242 | ||
1243 | # Prepare argument for the nanosleep system call | |
1244 | my $timespec = pack( 'L!L!', $secs, $nanosecs ); | |
1245 | ||
f8b4d74f WL |
1246 | For a simple memory dump we unpack some bytes into just as |
1247 | many pairs of hex digits, and use C<map> to handle the traditional | |
1248 | spacing - 16 bytes to a line: | |
1249 | ||
34babc16 | 1250 | my $i; |
49704364 WL |
1251 | print map( ++$i % 16 ? "$_ " : "$_\n", |
1252 | unpack( 'H2' x length( $mem ), $mem ) ), | |
f8b4d74f | 1253 | length( $mem ) % 16 ? "\n" : ''; |
34babc16 JH |
1254 | |
1255 | ||
47f22e19 WL |
1256 | =head1 Funnies Section |
1257 | ||
1258 | # Pulling digits out of nowhere... | |
1259 | print unpack( 'C', pack( 'x' ) ), | |
1260 | unpack( '%B*', pack( 'A' ) ), | |
1261 | unpack( 'H', pack( 'A' ) ), | |
1262 | unpack( 'A', unpack( 'C', pack( 'A' ) ) ), "\n"; | |
1263 | ||
1264 | # One for the road ;-) | |
1265 | my $advice = pack( 'all u can in a van' ); | |
1266 | ||
1267 | ||
34babc16 JH |
1268 | =head1 Authors |
1269 | ||
1270 | Simon Cozens and Wolfgang Laun. | |
1271 |