Commit | Line | Data |
---|---|---|
e8cd7eae GS |
1 | =head1 NAME |
2 | ||
3 | perlhack - How to hack at the Perl internals | |
4 | ||
5 | =head1 DESCRIPTION | |
6 | ||
7 | This document attempts to explain how Perl development takes place, | |
8 | and ends with some suggestions for people wanting to become bona fide | |
9 | porters. | |
10 | ||
11 | The perl5-porters mailing list is where the Perl standard distribution | |
12 | is maintained and developed. The list can get anywhere from 10 to 150 | |
13 | messages a day, depending on the heatedness of the debate. Most days | |
14 | there are two or three patches, extensions, features, or bugs being | |
15 | discussed at a time. | |
16 | ||
17 | A searchable archive of the list is at: | |
18 | ||
19 | http://www.xray.mpe.mpg.de/mailing-lists/perl5-porters/ | |
20 | ||
21 | The list is also archived under the usenet group name | |
22 | C<perl.porters-gw> at: | |
23 | ||
24 | http://www.deja.com/ | |
25 | ||
26 | List subscribers (the porters themselves) come in several flavours. | |
27 | Some are quiet curious lurkers, who rarely pitch in and instead watch | |
28 | the ongoing development to ensure they're forewarned of new changes or | |
29 | features in Perl. Some are representatives of vendors, who are there | |
30 | to make sure that Perl continues to compile and work on their | |
31 | platforms. Some patch any reported bug that they know how to fix, | |
32 | some are actively patching their pet area (threads, Win32, the regexp | |
33 | engine), while others seem to do nothing but complain. In other | |
34 | words, it's your usual mix of technical people. | |
35 | ||
36 | Over this group of porters presides Larry Wall. He has the final word | |
f6c51b38 GS |
37 | in what does and does not change in the Perl language. Various |
38 | releases of Perl are shepherded by a ``pumpking'', a porter | |
39 | responsible for gathering patches, deciding on a patch-by-patch | |
40 | feature-by-feature basis what will and will not go into the release. | |
41 | For instance, Gurusamy Sarathy is the pumpking for the 5.6 release of | |
42 | Perl. | |
e8cd7eae GS |
43 | |
44 | In addition, various people are pumpkings for different things. For | |
45 | instance, Andy Dougherty and Jarkko Hietaniemi share the I<Configure> | |
46 | pumpkin, and Tom Christiansen is the documentation pumpking. | |
47 | ||
48 | Larry sees Perl development along the lines of the US government: | |
49 | there's the Legislature (the porters), the Executive branch (the | |
50 | pumpkings), and the Supreme Court (Larry). The legislature can | |
51 | discuss and submit patches to the executive branch all they like, but | |
52 | the executive branch is free to veto them. Rarely, the Supreme Court | |
53 | will side with the executive branch over the legislature, or the | |
54 | legislature over the executive branch. Mostly, however, the | |
55 | legislature and the executive branch are supposed to get along and | |
56 | work out their differences without impeachment or court cases. | |
57 | ||
58 | You might sometimes see reference to Rule 1 and Rule 2. Larry's power | |
59 | as Supreme Court is expressed in The Rules: | |
60 | ||
61 | =over 4 | |
62 | ||
63 | =item 1 | |
64 | ||
65 | Larry is always by definition right about how Perl should behave. | |
66 | This means he has final veto power on the core functionality. | |
67 | ||
68 | =item 2 | |
69 | ||
70 | Larry is allowed to change his mind about any matter at a later date, | |
71 | regardless of whether he previously invoked Rule 1. | |
72 | ||
73 | =back | |
74 | ||
75 | Got that? Larry is always right, even when he was wrong. It's rare | |
76 | to see either Rule exercised, but they are often alluded to. | |
77 | ||
78 | New features and extensions to the language are contentious, because | |
79 | the criteria used by the pumpkings, Larry, and other porters to decide | |
80 | which features should be implemented and incorporated are not codified | |
81 | in a few small design goals as with some other languages. Instead, | |
82 | the heuristics are flexible and often difficult to fathom. Here is | |
83 | one person's list, roughly in decreasing order of importance, of | |
84 | heuristics that new features have to be weighed against: | |
85 | ||
86 | =over 4 | |
87 | ||
88 | =item Does concept match the general goals of Perl? | |
89 | ||
90 | These haven't been written anywhere in stone, but one approximation | |
91 | is: | |
92 | ||
93 | 1. Keep it fast, simple, and useful. | |
94 | 2. Keep features/concepts as orthogonal as possible. | |
95 | 3. No arbitrary limits (platforms, data sizes, cultures). | |
96 | 4. Keep it open and exciting to use/patch/advocate Perl everywhere. | |
97 | 5. Either assimilate new technologies, or build bridges to them. | |
98 | ||
99 | =item Where is the implementation? | |
100 | ||
101 | All the talk in the world is useless without an implementation. In | |
102 | almost every case, the person or people who argue for a new feature | |
103 | will be expected to be the ones who implement it. Porters capable | |
104 | of coding new features have their own agendas, and are not available | |
105 | to implement your (possibly good) idea. | |
106 | ||
107 | =item Backwards compatibility | |
108 | ||
109 | It's a cardinal sin to break existing Perl programs. New warnings are | |
110 | contentious--some say that a program that emits warnings is not | |
111 | broken, while others say it is. Adding keywords has the potential to | |
112 | break programs, changing the meaning of existing token sequences or | |
113 | functions might break programs. | |
114 | ||
115 | =item Could it be a module instead? | |
116 | ||
117 | Perl 5 has extension mechanisms, modules and XS, specifically to avoid | |
118 | the need to keep changing the Perl interpreter. You can write modules | |
119 | that export functions, you can give those functions prototypes so they | |
120 | can be called like built-in functions, you can even write XS code to | |
121 | mess with the runtime data structures of the Perl interpreter if you | |
122 | want to implement really complicated things. If it can be done in a | |
123 | module instead of in the core, it's highly unlikely to be added. | |
124 | ||
125 | =item Is the feature generic enough? | |
126 | ||
127 | Is this something that only the submitter wants added to the language, | |
128 | or would it be broadly useful? Sometimes, instead of adding a feature | |
129 | with a tight focus, the porters might decide to wait until someone | |
130 | implements the more generalized feature. For instance, instead of | |
131 | implementing a ``delayed evaluation'' feature, the porters are waiting | |
132 | for a macro system that would permit delayed evaluation and much more. | |
133 | ||
134 | =item Does it potentially introduce new bugs? | |
135 | ||
136 | Radical rewrites of large chunks of the Perl interpreter have the | |
137 | potential to introduce new bugs. The smaller and more localized the | |
138 | change, the better. | |
139 | ||
140 | =item Does it preclude other desirable features? | |
141 | ||
142 | A patch is likely to be rejected if it closes off future avenues of | |
143 | development. For instance, a patch that placed a true and final | |
144 | interpretation on prototypes is likely to be rejected because there | |
145 | are still options for the future of prototypes that haven't been | |
146 | addressed. | |
147 | ||
148 | =item Is the implementation robust? | |
149 | ||
150 | Good patches (tight code, complete, correct) stand more chance of | |
151 | going in. Sloppy or incorrect patches might be placed on the back | |
152 | burner until the pumpking has time to fix, or might be discarded | |
153 | altogether without further notice. | |
154 | ||
155 | =item Is the implementation generic enough to be portable? | |
156 | ||
157 | The worst patches make use of a system-specific features. It's highly | |
158 | unlikely that nonportable additions to the Perl language will be | |
159 | accepted. | |
160 | ||
161 | =item Is there enough documentation? | |
162 | ||
163 | Patches without documentation are probably ill-thought out or | |
164 | incomplete. Nothing can be added without documentation, so submitting | |
165 | a patch for the appropriate manpages as well as the source code is | |
166 | always a good idea. If appropriate, patches should add to the test | |
167 | suite as well. | |
168 | ||
169 | =item Is there another way to do it? | |
170 | ||
171 | Larry said ``Although the Perl Slogan is I<There's More Than One Way | |
172 | to Do It>, I hesitate to make 10 ways to do something''. This is a | |
173 | tricky heuristic to navigate, though--one man's essential addition is | |
174 | another man's pointless cruft. | |
175 | ||
176 | =item Does it create too much work? | |
177 | ||
178 | Work for the pumpking, work for Perl programmers, work for module | |
179 | authors, ... Perl is supposed to be easy. | |
180 | ||
f6c51b38 GS |
181 | =item Patches speak louder than words |
182 | ||
183 | Working code is always preferred to pie-in-the-sky ideas. A patch to | |
184 | add a feature stands a much higher chance of making it to the language | |
185 | than does a random feature request, no matter how fervently argued the | |
186 | request might be. This ties into ``Will it be useful?'', as the fact | |
187 | that someone took the time to make the patch demonstrates a strong | |
188 | desire for the feature. | |
189 | ||
e8cd7eae GS |
190 | =back |
191 | ||
192 | If you're on the list, you might hear the word ``core'' bandied | |
193 | around. It refers to the standard distribution. ``Hacking on the | |
194 | core'' means you're changing the C source code to the Perl | |
195 | interpreter. ``A core module'' is one that ships with Perl. | |
196 | ||
197 | The source code to the Perl interpreter, in its different versions, is | |
198 | kept in a repository managed by a revision control system (which is | |
199 | currently the Perforce program, see http://perforce.com/). The | |
200 | pumpkings and a few others have access to the repository to check in | |
201 | changes. Periodically the pumpking for the development version of Perl | |
202 | will release a new version, so the rest of the porters can see what's | |
2be4c08b GS |
203 | changed. The current state of the main trunk of repository, and patches |
204 | that describe the individual changes that have happened since the last | |
205 | public release are available at this location: | |
206 | ||
207 | ftp://ftp.linux.activestate.com/pub/staff/gsar/APC/ | |
208 | ||
209 | Selective parts are also visible via the rsync protocol. To get all | |
210 | the individual changes to the mainline since the last development | |
211 | release, use the following command: | |
212 | ||
6c7df0a8 | 213 | rsync -avz rsync://ftp.linux.activestate.com/perl-diffs perl-diffs |
2be4c08b GS |
214 | |
215 | Use this to get the latest source tree in full: | |
216 | ||
6c7df0a8 | 217 | rsync -avz rsync://ftp.linux.activestate.com/perl-current perl-current |
2be4c08b GS |
218 | |
219 | Needless to say, the source code in perl-current is usually in a perpetual | |
220 | state of evolution. You should expect it to be very buggy. Do B<not> use | |
221 | it for any purpose other than testing and development. | |
e8cd7eae GS |
222 | |
223 | Always submit patches to I<perl5-porters@perl.org>. This lets other | |
224 | porters review your patch, which catches a surprising number of errors | |
f6c51b38 GS |
225 | in patches. Either use the diff program (available in source code |
226 | form from I<ftp://ftp.gnu.org/pub/gnu/>), or use Johan Vromans' | |
227 | I<makepatch> (available from I<CPAN/authors/id/JV/>). Unified diffs | |
228 | are preferred, but context diffs are accepted. Do not send RCS-style | |
229 | diffs or diffs without context lines. More information is given in | |
230 | the I<Porting/patching.pod> file in the Perl source distribution. | |
231 | Please patch against the latest B<development> version (e.g., if | |
232 | you're fixing a bug in the 5.005 track, patch against the latest | |
233 | 5.005_5x version). Only patches that survive the heat of the | |
234 | development branch get applied to maintenance versions. | |
e8cd7eae GS |
235 | |
236 | Your patch should update the documentation and test suite. | |
237 | ||
238 | To report a bug in Perl, use the program I<perlbug> which comes with | |
239 | Perl (if you can't get Perl to work, send mail to the address | |
240 | I<perlbug@perl.com> or I<perlbug@perl.org>). Reporting bugs through | |
241 | I<perlbug> feeds into the automated bug-tracking system, access to | |
242 | which is provided through the web at I<http://bugs.perl.org/>. It | |
243 | often pays to check the archives of the perl5-porters mailing list to | |
244 | see whether the bug you're reporting has been reported before, and if | |
245 | so whether it was considered a bug. See above for the location of | |
246 | the searchable archives. | |
247 | ||
248 | The CPAN testers (I<http://testers.cpan.org/>) are a group of | |
249 | volunteers who test CPAN modules on a variety of platforms. Perl Labs | |
f6c51b38 GS |
250 | (I<http://labs.perl.org/>) automatically tests Perl source releases on |
251 | platforms and gives feedback to the CPAN testers mailing list. Both | |
252 | efforts welcome volunteers. | |
e8cd7eae | 253 | |
e8cd7eae GS |
254 | It's a good idea to read and lurk for a while before chipping in. |
255 | That way you'll get to see the dynamic of the conversations, learn the | |
256 | personalities of the players, and hopefully be better prepared to make | |
257 | a useful contribution when do you speak up. | |
258 | ||
259 | If after all this you still think you want to join the perl5-porters | |
f6c51b38 GS |
260 | mailing list, send mail to I<perl5-porters-subscribe@perl.org>. To |
261 | unsubscribe, send mail to I<perl5-porters-unsubscribe@perl.org>. | |
e8cd7eae | 262 | |
a422fd2d SC |
263 | To hack on the Perl guts, you'll need to read the following things: |
264 | ||
265 | =over 3 | |
266 | ||
267 | =item L<perlguts> | |
268 | ||
269 | This is of paramount importance, since it's the documentation of what | |
270 | goes where in the Perl source. Read it over a couple of times and it | |
271 | might start to make sense - don't worry if it doesn't yet, because the | |
272 | best way to study it is to read it in conjunction with poking at Perl | |
273 | source, and we'll do that later on. | |
274 | ||
275 | You might also want to look at Gisle Aas's illustrated perlguts - | |
276 | there's no guarantee that this will be absolutely up-to-date with the | |
277 | latest documentation in the Perl core, but the fundamentals will be | |
278 | right. (http://gisle.aas.no/perl/illguts/) | |
279 | ||
280 | =item L<perlxstut> and L<perlxs> | |
281 | ||
282 | A working knowledge of XSUB programming is incredibly useful for core | |
283 | hacking; XSUBs use techniques drawn from the PP code, the portion of the | |
284 | guts that actually executes a Perl program. It's a lot gentler to learn | |
285 | those techniques from simple examples and explanation than from the core | |
286 | itself. | |
287 | ||
288 | =item L<perlapi> | |
289 | ||
290 | The documentation for the Perl API explains what some of the internal | |
291 | functions do, as well as the many macros used in the source. | |
292 | ||
293 | =item F<Porting/pumpkin.pod> | |
294 | ||
295 | This is a collection of words of wisdom for a Perl porter; some of it is | |
296 | only useful to the pumpkin holder, but most of it applies to anyone | |
297 | wanting to go about Perl development. | |
298 | ||
299 | =item The perl5-porters FAQ | |
300 | ||
301 | This is posted to perl5-porters at the beginning on every month, and | |
302 | should be available from http://perlhacker.org/p5p-faq; alternatively, | |
303 | you can get the FAQ emailed to you by sending mail to | |
304 | C<perl5-porters-faq@perl.org>. It contains hints on reading | |
305 | perl5-porters, information on how perl5-porters works and how Perl | |
306 | development in general works. | |
307 | ||
308 | =back | |
309 | ||
310 | =head2 Finding Your Way Around | |
311 | ||
312 | Perl maintenance can be split into a number of areas, and certain people | |
313 | (pumpkins) will have responsibility for each area. These areas sometimes | |
314 | correspond to files or directories in the source kit. Among the areas are: | |
315 | ||
316 | =over 3 | |
317 | ||
318 | =item Core modules | |
319 | ||
320 | Modules shipped as part of the Perl core live in the F<lib/> and F<ext/> | |
321 | subdirectories: F<lib/> is for the pure-Perl modules, and F<ext/> | |
322 | contains the core XS modules. | |
323 | ||
324 | =item Documentation | |
325 | ||
326 | Documentation maintenance includes looking after everything in the | |
327 | F<pod/> directory, (as well as contributing new documentation) and | |
328 | the documentation to the modules in core. | |
329 | ||
330 | =item Configure | |
331 | ||
332 | The configure process is the way we make Perl portable across the | |
333 | myriad of operating systems it supports. Responsibility for the | |
334 | configure, build and installation process, as well as the overall | |
335 | portability of the core code rests with the configure pumpkin - others | |
336 | help out with individual operating systems. | |
337 | ||
338 | The files involved are the operating system directories, (F<win32/>, | |
339 | F<os2/>, F<vms/> and so on) the shell scripts which generate F<config.h> | |
340 | and F<Makefile>, as well as the metaconfig files which generate | |
341 | F<Configure>. (metaconfig isn't included in the core distribution.) | |
342 | ||
343 | =item Interpreter | |
344 | ||
345 | And of course, there's the core of the Perl interpreter itself. Let's | |
346 | have a look at that in a little more detail. | |
347 | ||
348 | =back | |
349 | ||
350 | Before we leave looking at the layout, though, don't forget that | |
351 | F<MANIFEST> contains not only the file names in the Perl distribution, | |
352 | but short descriptions of what's in them, too. For an overview of the | |
353 | important files, try this: | |
354 | ||
355 | perl -lne 'print if /^[^\/]+\.[ch]\s+/' MANIFEST | |
356 | ||
357 | =head2 Elements of the interpreter | |
358 | ||
359 | The work of the interpreter has two main stages: compiling the code | |
360 | into the internal representation, or bytecode, and then executing it. | |
361 | L<perlguts/Compiled code> explains exactly how the compilation stage | |
362 | happens. | |
363 | ||
364 | Here is a short breakdown of perl's operation: | |
365 | ||
366 | =over 3 | |
367 | ||
368 | =item Startup | |
369 | ||
370 | The action begins in F<perlmain.c>. (or F<miniperlmain.c> for miniperl) | |
371 | This is very high-level code, enough to fit on a single screen, and it | |
372 | resembles the code found in L<perlembed>; most of the real action takes | |
373 | place in F<perl.c> | |
374 | ||
375 | First, F<perlmain.c> allocates some memory and constructs a Perl | |
376 | interpreter: | |
377 | ||
378 | 1 PERL_SYS_INIT3(&argc,&argv,&env); | |
379 | 2 | |
380 | 3 if (!PL_do_undump) { | |
381 | 4 my_perl = perl_alloc(); | |
382 | 5 if (!my_perl) | |
383 | 6 exit(1); | |
384 | 7 perl_construct(my_perl); | |
385 | 8 PL_perl_destruct_level = 0; | |
386 | 9 } | |
387 | ||
388 | Line 1 is a macro, and its definition is dependent on your operating | |
389 | system. Line 3 references C<PL_do_undump>, a global variable - all | |
390 | global variables in Perl start with C<PL_>. This tells you whether the | |
391 | current running program was created with the C<-u> flag to perl and then | |
392 | F<undump>, which means it's going to be false in any sane context. | |
393 | ||
394 | Line 4 calls a function in F<perl.c> to allocate memory for a Perl | |
395 | interpreter. It's quite a simple function, and the guts of it looks like | |
396 | this: | |
397 | ||
398 | my_perl = (PerlInterpreter*)PerlMem_malloc(sizeof(PerlInterpreter)); | |
399 | ||
400 | Here you see an example of Perl's system abstraction, which we'll see | |
401 | later: C<PerlMem_malloc> is either your system's C<malloc>, or Perl's | |
402 | own C<malloc> as defined in F<malloc.c> if you selected that option at | |
403 | configure time. | |
404 | ||
405 | Next, in line 7, we construct the interpreter; this sets up all the | |
406 | special variables that Perl needs, the stacks, and so on. | |
407 | ||
408 | Now we pass Perl the command line options, and tell it to go: | |
409 | ||
410 | exitstatus = perl_parse(my_perl, xs_init, argc, argv, (char **)NULL); | |
411 | if (!exitstatus) { | |
412 | exitstatus = perl_run(my_perl); | |
413 | } | |
414 | ||
415 | ||
416 | C<perl_parse> is actually a wrapper around C<S_parse_body>, as defined | |
417 | in F<perl.c>, which processes the command line options, sets up any | |
418 | statically linked XS modules, opens the program and calls C<yyparse> to | |
419 | parse it. | |
420 | ||
421 | =item Parsing | |
422 | ||
423 | The aim of this stage is to take the Perl source, and turn it into an op | |
424 | tree. We'll see what one of those looks like later. Strictly speaking, | |
425 | there's three things going on here. | |
426 | ||
427 | C<yyparse>, the parser, lives in F<perly.c>, although you're better off | |
428 | reading the original YACC input in F<perly.y>. (Yes, Virginia, there | |
429 | B<is> a YACC grammar for Perl!) The job of the parser is to take your | |
430 | code and `understand' it, splitting it into sentences, deciding which | |
431 | operands go with which operators and so on. | |
432 | ||
433 | The parser is nobly assisted by the lexer, which chunks up your input | |
434 | into tokens, and decides what type of thing each token is: a variable | |
435 | name, an operator, a bareword, a subroutine, a core function, and so on. | |
436 | The main point of entry to the lexer is C<yylex>, and that and its | |
437 | associated routines can be found in F<toke.c>. Perl isn't much like | |
438 | other computer languages; it's highly context sensitive at times, it can | |
439 | be tricky to work out what sort of token something is, or where a token | |
440 | ends. As such, there's a lot of interplay between the tokeniser and the | |
441 | parser, which can get pretty frightening if you're not used to it. | |
442 | ||
443 | As the parser understands a Perl program, it builds up a tree of | |
444 | operations for the interpreter to perform during execution. The routines | |
445 | which construct and link together the various operations are to be found | |
446 | in F<op.c>, and will be examined later. | |
447 | ||
448 | =item Optimization | |
449 | ||
450 | Now the parsing stage is complete, and the finished tree represents | |
451 | the operations that the Perl interpreter needs to perform to execute our | |
452 | program. Next, Perl does a dry run over the tree looking for | |
453 | optimisations: constant expressions such as C<3 + 4> will be computed | |
454 | now, and the optimizer will also see if any multiple operations can be | |
455 | replaced with a single one. For instance, to fetch the variable C<$foo>, | |
456 | instead of grabbing the glob C<*foo> and looking at the scalar | |
457 | component, the optimizer fiddles the op tree to use a function which | |
458 | directly looks up the scalar in question. The main optimizer is C<peep> | |
459 | in F<op.c>, and many ops have their own optimizing functions. | |
460 | ||
461 | =item Running | |
462 | ||
463 | Now we're finally ready to go: we have compiled Perl byte code, and all | |
464 | that's left to do is run it. The actual execution is done by the | |
465 | C<runops_standard> function in F<run.c>; more specifically, it's done by | |
466 | these three innocent looking lines: | |
467 | ||
468 | while ((PL_op = CALL_FPTR(PL_op->op_ppaddr)(aTHX))) { | |
469 | PERL_ASYNC_CHECK(); | |
470 | } | |
471 | ||
472 | You may be more comfortable with the Perl version of that: | |
473 | ||
474 | PERL_ASYNC_CHECK() while $Perl::op = &{$Perl::op->{function}}; | |
475 | ||
476 | Well, maybe not. Anyway, each op contains a function pointer, which | |
477 | stipulates the function which will actually carry out the operation. | |
478 | This function will return the next op in the sequence - this allows for | |
479 | things like C<if> which choose the next op dynamically at run time. | |
480 | The C<PERL_ASYNC_CHECK> makes sure that things like signals interrupt | |
481 | execution if required. | |
482 | ||
483 | The actual functions called are known as PP code, and they're spread | |
484 | between four files: F<pp_hot.c> contains the `hot' code, which is most | |
485 | often used and highly optimized, F<pp_sys.c> contains all the | |
486 | system-specific functions, F<pp_ctl.c> contains the functions which | |
487 | implement control structures (C<if>, C<while> and the like) and F<pp.c> | |
488 | contains everything else. These are, if you like, the C code for Perl's | |
489 | built-in functions and operators. | |
490 | ||
491 | =back | |
492 | ||
493 | =head2 Internal Variable Types | |
494 | ||
495 | You should by now have had a look at L<perlguts>, which tells you about | |
496 | Perl's internal variable types: SVs, HVs, AVs and the rest. If not, do | |
497 | that now. | |
498 | ||
499 | These variables are used not only to represent Perl-space variables, but | |
500 | also any constants in the code, as well as some structures completely | |
501 | internal to Perl. The symbol table, for instance, is an ordinary Perl | |
502 | hash. Your code is represented by an SV as it's read into the parser; | |
503 | any program files you call are opened via ordinary Perl filehandles, and | |
504 | so on. | |
505 | ||
506 | The core L<Devel::Peek|Devel::Peek> module lets us examine SVs from a | |
507 | Perl program. Let's see, for instance, how Perl treats the constant | |
508 | C<"hello">. | |
509 | ||
510 | % perl -MDevel::Peek -e 'Dump("hello")' | |
511 | 1 SV = PV(0xa041450) at 0xa04ecbc | |
512 | 2 REFCNT = 1 | |
513 | 3 FLAGS = (POK,READONLY,pPOK) | |
514 | 4 PV = 0xa0484e0 "hello"\0 | |
515 | 5 CUR = 5 | |
516 | 6 LEN = 6 | |
517 | ||
518 | Reading C<Devel::Peek> output takes a bit of practise, so let's go | |
519 | through it line by line. | |
520 | ||
521 | Line 1 tells us we're looking at an SV which lives at C<0xa04ecbc> in | |
522 | memory. SVs themselves are very simple structures, but they contain a | |
523 | pointer to a more complex structure. In this case, it's a PV, a | |
524 | structure which holds a string value, at location C<0xa041450>. Line 2 | |
525 | is the reference count; there are no other references to this data, so | |
526 | it's 1. | |
527 | ||
528 | Line 3 are the flags for this SV - it's OK to use it as a PV, it's a | |
529 | read-only SV (because it's a constant) and the data is a PV internally. | |
530 | Next we've got the contents of the string, starting at location | |
531 | C<0xa0484e0>. | |
532 | ||
533 | Line 5 gives us the current length of the string - note that this does | |
534 | B<not> include the null terminator. Line 6 is not the length of the | |
535 | string, but the length of the currently allocated buffer; as the string | |
536 | grows, Perl automatically extends the available storage via a routine | |
537 | called C<SvGROW>. | |
538 | ||
539 | You can get at any of these quantities from C very easily; just add | |
540 | C<Sv> to the name of the field shown in the snippet, and you've got a | |
541 | macro which will return the value: C<SvCUR(sv)> returns the current | |
542 | length of the string, C<SvREFCOUNT(sv)> returns the reference count, | |
543 | C<SvPV(sv, len)> returns the string itself with its length, and so on. | |
544 | More macros to manipulate these properties can be found in L<perlguts>. | |
545 | ||
546 | Let's take an example of manipulating a PV, from C<sv_catpvn>, in F<sv.c> | |
547 | ||
548 | 1 void | |
549 | 2 Perl_sv_catpvn(pTHX_ register SV *sv, register const char *ptr, register STRLEN len) | |
550 | 3 { | |
551 | 4 STRLEN tlen; | |
552 | 5 char *junk; | |
553 | ||
554 | 6 junk = SvPV_force(sv, tlen); | |
555 | 7 SvGROW(sv, tlen + len + 1); | |
556 | 8 if (ptr == junk) | |
557 | 9 ptr = SvPVX(sv); | |
558 | 10 Move(ptr,SvPVX(sv)+tlen,len,char); | |
559 | 11 SvCUR(sv) += len; | |
560 | 12 *SvEND(sv) = '\0'; | |
561 | 13 (void)SvPOK_only_UTF8(sv); /* validate pointer */ | |
562 | 14 SvTAINT(sv); | |
563 | 15 } | |
564 | ||
565 | This is a function which adds a string, C<ptr>, of length C<len> onto | |
566 | the end of the PV stored in C<sv>. The first thing we do in line 6 is | |
567 | make sure that the SV B<has> a valid PV, by calling the C<SvPV_force> | |
568 | macro to force a PV. As a side effect, C<tlen> gets set to the current | |
569 | value of the PV, and the PV itself is returned to C<junk>. | |
570 | ||
b1866b2d | 571 | In line 7, we make sure that the SV will have enough room to accommodate |
a422fd2d SC |
572 | the old string, the new string and the null terminator. If C<LEN> isn't |
573 | big enough, C<SvGROW> will reallocate space for us. | |
574 | ||
575 | Now, if C<junk> is the same as the string we're trying to add, we can | |
576 | grab the string directly from the SV; C<SvPVX> is the address of the PV | |
577 | in the SV. | |
578 | ||
579 | Line 10 does the actual catenation: the C<Move> macro moves a chunk of | |
580 | memory around: we move the string C<ptr> to the end of the PV - that's | |
581 | the start of the PV plus its current length. We're moving C<len> bytes | |
582 | of type C<char>. After doing so, we need to tell Perl we've extended the | |
583 | string, by altering C<CUR> to reflect the new length. C<SvEND> is a | |
584 | macro which gives us the end of the string, so that needs to be a | |
585 | C<"\0">. | |
586 | ||
587 | Line 13 manipulates the flags; since we've changed the PV, any IV or NV | |
588 | values will no longer be valid: if we have C<$a=10; $a.="6";> we don't | |
589 | want to use the old IV of 10. C<SvPOK_only_utf8> is a special UTF8-aware | |
590 | version of C<SvPOK_only>, a macro which turns off the IOK and NOK flags | |
591 | and turns on POK. The final C<SvTAINT> is a macro which launders tainted | |
592 | data if taint mode is turned on. | |
593 | ||
594 | AVs and HVs are more complicated, but SVs are by far the most common | |
595 | variable type being thrown around. Having seen something of how we | |
596 | manipulate these, let's go on and look at how the op tree is | |
597 | constructed. | |
598 | ||
599 | =head2 Op Trees | |
600 | ||
601 | First, what is the op tree, anyway? The op tree is the parsed | |
602 | representation of your program, as we saw in our section on parsing, and | |
603 | it's the sequence of operations that Perl goes through to execute your | |
604 | program, as we saw in L</Running>. | |
605 | ||
606 | An op is a fundamental operation that Perl can perform: all the built-in | |
607 | functions and operators are ops, and there are a series of ops which | |
608 | deal with concepts the interpreter needs internally - entering and | |
609 | leaving a block, ending a statement, fetching a variable, and so on. | |
610 | ||
611 | The op tree is connected in two ways: you can imagine that there are two | |
612 | "routes" through it, two orders in which you can traverse the tree. | |
613 | First, parse order reflects how the parser understood the code, and | |
614 | secondly, execution order tells perl what order to perform the | |
615 | operations in. | |
616 | ||
617 | The easiest way to examine the op tree is to stop Perl after it has | |
618 | finished parsing, and get it to dump out the tree. This is exactly what | |
619 | the compiler backends L<B::Terse|B::Terse> and L<B::Debug|B::Debug> do. | |
620 | ||
621 | Let's have a look at how Perl sees C<$a = $b + $c>: | |
622 | ||
623 | % perl -MO=Terse -e '$a=$b+$c' | |
624 | 1 LISTOP (0x8179888) leave | |
625 | 2 OP (0x81798b0) enter | |
626 | 3 COP (0x8179850) nextstate | |
627 | 4 BINOP (0x8179828) sassign | |
628 | 5 BINOP (0x8179800) add [1] | |
629 | 6 UNOP (0x81796e0) null [15] | |
630 | 7 SVOP (0x80fafe0) gvsv GV (0x80fa4cc) *b | |
631 | 8 UNOP (0x81797e0) null [15] | |
632 | 9 SVOP (0x8179700) gvsv GV (0x80efeb0) *c | |
633 | 10 UNOP (0x816b4f0) null [15] | |
634 | 11 SVOP (0x816dcf0) gvsv GV (0x80fa460) *a | |
635 | ||
636 | Let's start in the middle, at line 4. This is a BINOP, a binary | |
637 | operator, which is at location C<0x8179828>. The specific operator in | |
638 | question is C<sassign> - scalar assignment - and you can find the code | |
639 | which implements it in the function C<pp_sassign> in F<pp_hot.c>. As a | |
640 | binary operator, it has two children: the add operator, providing the | |
641 | result of C<$b+$c>, is uppermost on line 5, and the left hand side is on | |
642 | line 10. | |
643 | ||
644 | Line 10 is the null op: this does exactly nothing. What is that doing | |
645 | there? If you see the null op, it's a sign that something has been | |
646 | optimized away after parsing. As we mentioned in L</Optimization>, | |
647 | the optimization stage sometimes converts two operations into one, for | |
648 | example when fetching a scalar variable. When this happens, instead of | |
649 | rewriting the op tree and cleaning up the dangling pointers, it's easier | |
650 | just to replace the redundant operation with the null op. Originally, | |
651 | the tree would have looked like this: | |
652 | ||
653 | 10 SVOP (0x816b4f0) rv2sv [15] | |
654 | 11 SVOP (0x816dcf0) gv GV (0x80fa460) *a | |
655 | ||
656 | That is, fetch the C<a> entry from the main symbol table, and then look | |
657 | at the scalar component of it: C<gvsv> (C<pp_gvsv> into F<pp_hot.c>) | |
658 | happens to do both these things. | |
659 | ||
660 | The right hand side, starting at line 5 is similar to what we've just | |
661 | seen: we have the C<add> op (C<pp_add> also in F<pp_hot.c>) add together | |
662 | two C<gvsv>s. | |
663 | ||
664 | Now, what's this about? | |
665 | ||
666 | 1 LISTOP (0x8179888) leave | |
667 | 2 OP (0x81798b0) enter | |
668 | 3 COP (0x8179850) nextstate | |
669 | ||
670 | C<enter> and C<leave> are scoping ops, and their job is to perform any | |
671 | housekeeping every time you enter and leave a block: lexical variables | |
672 | are tidied up, unreferenced variables are destroyed, and so on. Every | |
673 | program will have those first three lines: C<leave> is a list, and its | |
674 | children are all the statements in the block. Statements are delimited | |
675 | by C<nextstate>, so a block is a collection of C<nextstate> ops, with | |
676 | the ops to be performed for each statement being the children of | |
677 | C<nextstate>. C<enter> is a single op which functions as a marker. | |
678 | ||
679 | That's how Perl parsed the program, from top to bottom: | |
680 | ||
681 | Program | |
682 | | | |
683 | Statement | |
684 | | | |
685 | = | |
686 | / \ | |
687 | / \ | |
688 | $a + | |
689 | / \ | |
690 | $b $c | |
691 | ||
692 | However, it's impossible to B<perform> the operations in this order: | |
693 | you have to find the values of C<$b> and C<$c> before you add them | |
694 | together, for instance. So, the other thread that runs through the op | |
695 | tree is the execution order: each op has a field C<op_next> which points | |
696 | to the next op to be run, so following these pointers tells us how perl | |
697 | executes the code. We can traverse the tree in this order using | |
698 | the C<exec> option to C<B::Terse>: | |
699 | ||
700 | % perl -MO=Terse,exec -e '$a=$b+$c' | |
701 | 1 OP (0x8179928) enter | |
702 | 2 COP (0x81798c8) nextstate | |
703 | 3 SVOP (0x81796c8) gvsv GV (0x80fa4d4) *b | |
704 | 4 SVOP (0x8179798) gvsv GV (0x80efeb0) *c | |
705 | 5 BINOP (0x8179878) add [1] | |
706 | 6 SVOP (0x816dd38) gvsv GV (0x80fa468) *a | |
707 | 7 BINOP (0x81798a0) sassign | |
708 | 8 LISTOP (0x8179900) leave | |
709 | ||
710 | This probably makes more sense for a human: enter a block, start a | |
711 | statement. Get the values of C<$b> and C<$c>, and add them together. | |
712 | Find C<$a>, and assign one to the other. Then leave. | |
713 | ||
714 | The way Perl builds up these op trees in the parsing process can be | |
715 | unravelled by examining F<perly.y>, the YACC grammar. Let's take the | |
716 | piece we need to construct the tree for C<$a = $b + $c> | |
717 | ||
718 | 1 term : term ASSIGNOP term | |
719 | 2 { $$ = newASSIGNOP(OPf_STACKED, $1, $2, $3); } | |
720 | 3 | term ADDOP term | |
721 | 4 { $$ = newBINOP($2, 0, scalar($1), scalar($3)); } | |
722 | ||
723 | If you're not used to reading BNF grammars, this is how it works: You're | |
724 | fed certain things by the tokeniser, which generally end up in upper | |
725 | case. Here, C<ADDOP>, is provided when the tokeniser sees C<+> in your | |
726 | code. C<ASSIGNOP> is provided when C<=> is used for assigning. These are | |
727 | `terminal symbols', because you can't get any simpler than them. | |
728 | ||
729 | The grammar, lines one and three of the snippet above, tells you how to | |
730 | build up more complex forms. These complex forms, `non-terminal symbols' | |
731 | are generally placed in lower case. C<term> here is a non-terminal | |
732 | symbol, representing a single expression. | |
733 | ||
734 | The grammar gives you the following rule: you can make the thing on the | |
735 | left of the colon if you see all the things on the right in sequence. | |
736 | This is called a "reduction", and the aim of parsing is to completely | |
737 | reduce the input. There are several different ways you can perform a | |
738 | reduction, separated by vertical bars: so, C<term> followed by C<=> | |
739 | followed by C<term> makes a C<term>, and C<term> followed by C<+> | |
740 | followed by C<term> can also make a C<term>. | |
741 | ||
742 | So, if you see two terms with an C<=> or C<+>, between them, you can | |
743 | turn them into a single expression. When you do this, you execute the | |
744 | code in the block on the next line: if you see C<=>, you'll do the code | |
745 | in line 2. If you see C<+>, you'll do the code in line 4. It's this code | |
746 | which contributes to the op tree. | |
747 | ||
748 | | term ADDOP term | |
749 | { $$ = newBINOP($2, 0, scalar($1), scalar($3)); } | |
750 | ||
751 | What this does is creates a new binary op, and feeds it a number of | |
752 | variables. The variables refer to the tokens: C<$1> is the first token in | |
753 | the input, C<$2> the second, and so on - think regular expression | |
754 | backreferences. C<$$> is the op returned from this reduction. So, we | |
755 | call C<newBINOP> to create a new binary operator. The first parameter to | |
756 | C<newBINOP>, a function in F<op.c>, is the op type. It's an addition | |
757 | operator, so we want the type to be C<ADDOP>. We could specify this | |
758 | directly, but it's right there as the second token in the input, so we | |
759 | use C<$2>. The second parameter is the op's flags: 0 means `nothing | |
760 | special'. Then the things to add: the left and right hand side of our | |
761 | expression, in scalar context. | |
762 | ||
763 | =head2 Stacks | |
764 | ||
765 | When perl executes something like C<addop>, how does it pass on its | |
766 | results to the next op? The answer is, through the use of stacks. Perl | |
767 | has a number of stacks to store things it's currently working on, and | |
768 | we'll look at the three most important ones here. | |
769 | ||
770 | =over 3 | |
771 | ||
772 | =item Argument stack | |
773 | ||
774 | Arguments are passed to PP code and returned from PP code using the | |
775 | argument stack, C<ST>. The typical way to handle arguments is to pop | |
776 | them off the stack, deal with them how you wish, and then push the result | |
777 | back onto the stack. This is how, for instance, the cosine operator | |
778 | works: | |
779 | ||
780 | NV value; | |
781 | value = POPn; | |
782 | value = Perl_cos(value); | |
783 | XPUSHn(value); | |
784 | ||
785 | We'll see a more tricky example of this when we consider Perl's macros | |
786 | below. C<POPn> gives you the NV (floating point value) of the top SV on | |
787 | the stack: the C<$x> in C<cos($x)>. Then we compute the cosine, and push | |
788 | the result back as an NV. The C<X> in C<XPUSHn> means that the stack | |
789 | should be extended if necessary - it can't be necessary here, because we | |
790 | know there's room for one more item on the stack, since we've just | |
791 | removed one! The C<XPUSH*> macros at least guarantee safety. | |
792 | ||
793 | Alternatively, you can fiddle with the stack directly: C<SP> gives you | |
794 | the first element in your portion of the stack, and C<TOP*> gives you | |
795 | the top SV/IV/NV/etc. on the stack. So, for instance, to do unary | |
796 | negation of an integer: | |
797 | ||
798 | SETi(-TOPi); | |
799 | ||
800 | Just set the integer value of the top stack entry to its negation. | |
801 | ||
802 | Argument stack manipulation in the core is exactly the same as it is in | |
803 | XSUBs - see L<perlxstut>, L<perlxs> and L<perlguts> for a longer | |
804 | description of the macros used in stack manipulation. | |
805 | ||
806 | =item Mark stack | |
807 | ||
808 | I say `your portion of the stack' above because PP code doesn't | |
809 | necessarily get the whole stack to itself: if your function calls | |
810 | another function, you'll only want to expose the arguments aimed for the | |
811 | called function, and not (necessarily) let it get at your own data. The | |
812 | way we do this is to have a `virtual' bottom-of-stack, exposed to each | |
813 | function. The mark stack keeps bookmarks to locations in the argument | |
814 | stack usable by each function. For instance, when dealing with a tied | |
815 | variable, (internally, something with `P' magic) Perl has to call | |
816 | methods for accesses to the tied variables. However, we need to separate | |
817 | the arguments exposed to the method to the argument exposed to the | |
818 | original function - the store or fetch or whatever it may be. Here's how | |
819 | the tied C<push> is implemented; see C<av_push> in F<av.c>: | |
820 | ||
821 | 1 PUSHMARK(SP); | |
822 | 2 EXTEND(SP,2); | |
823 | 3 PUSHs(SvTIED_obj((SV*)av, mg)); | |
824 | 4 PUSHs(val); | |
825 | 5 PUTBACK; | |
826 | 6 ENTER; | |
827 | 7 call_method("PUSH", G_SCALAR|G_DISCARD); | |
828 | 8 LEAVE; | |
829 | 9 POPSTACK; | |
830 | ||
831 | The lines which concern the mark stack are the first, fifth and last | |
832 | lines: they save away, restore and remove the current position of the | |
833 | argument stack. | |
834 | ||
835 | Let's examine the whole implementation, for practice: | |
836 | ||
837 | 1 PUSHMARK(SP); | |
838 | ||
839 | Push the current state of the stack pointer onto the mark stack. This is | |
840 | so that when we've finished adding items to the argument stack, Perl | |
841 | knows how many things we've added recently. | |
842 | ||
843 | 2 EXTEND(SP,2); | |
844 | 3 PUSHs(SvTIED_obj((SV*)av, mg)); | |
845 | 4 PUSHs(val); | |
846 | ||
847 | We're going to add two more items onto the argument stack: when you have | |
848 | a tied array, the C<PUSH> subroutine receives the object and the value | |
849 | to be pushed, and that's exactly what we have here - the tied object, | |
850 | retrieved with C<SvTIED_obj>, and the value, the SV C<val>. | |
851 | ||
852 | 5 PUTBACK; | |
853 | ||
854 | Next we tell Perl to make the change to the global stack pointer: C<dSP> | |
855 | only gave us a local copy, not a reference to the global. | |
856 | ||
857 | 6 ENTER; | |
858 | 7 call_method("PUSH", G_SCALAR|G_DISCARD); | |
859 | 8 LEAVE; | |
860 | ||
861 | C<ENTER> and C<LEAVE> localise a block of code - they make sure that all | |
862 | variables are tidied up, everything that has been localised gets | |
863 | its previous value returned, and so on. Think of them as the C<{> and | |
864 | C<}> of a Perl block. | |
865 | ||
866 | To actually do the magic method call, we have to call a subroutine in | |
867 | Perl space: C<call_method> takes care of that, and it's described in | |
868 | L<perlcall>. We call the C<PUSH> method in scalar context, and we're | |
869 | going to discard its return value. | |
870 | ||
871 | 9 POPSTACK; | |
872 | ||
873 | Finally, we remove the value we placed on the mark stack, since we | |
874 | don't need it any more. | |
875 | ||
876 | =item Save stack | |
877 | ||
878 | C doesn't have a concept of local scope, so perl provides one. We've | |
879 | seen that C<ENTER> and C<LEAVE> are used as scoping braces; the save | |
880 | stack implements the C equivalent of, for example: | |
881 | ||
882 | { | |
883 | local $foo = 42; | |
884 | ... | |
885 | } | |
886 | ||
887 | See L<perlguts/Localising Changes> for how to use the save stack. | |
888 | ||
889 | =back | |
890 | ||
891 | =head2 Millions of Macros | |
892 | ||
893 | One thing you'll notice about the Perl source is that it's full of | |
894 | macros. Some have called the pervasive use of macros the hardest thing | |
895 | to understand, others find it adds to clarity. Let's take an example, | |
896 | the code which implements the addition operator: | |
897 | ||
898 | 1 PP(pp_add) | |
899 | 2 { | |
900 | 3 djSP; dATARGET; tryAMAGICbin(add,opASSIGN); | |
901 | 4 { | |
902 | 5 dPOPTOPnnrl_ul; | |
903 | 6 SETn( left + right ); | |
904 | 7 RETURN; | |
905 | 8 } | |
906 | 9 } | |
907 | ||
908 | Every line here (apart from the braces, of course) contains a macro. The | |
909 | first line sets up the function declaration as Perl expects for PP code; | |
910 | line 3 sets up variable declarations for the argument stack and the | |
911 | target, the return value of the operation. Finally, it tries to see if | |
912 | the addition operation is overloaded; if so, the appropriate subroutine | |
913 | is called. | |
914 | ||
915 | Line 5 is another variable declaration - all variable declarations start | |
916 | with C<d> - which pops from the top of the argument stack two NVs (hence | |
917 | C<nn>) and puts them into the variables C<right> and C<left>, hence the | |
918 | C<rl>. These are the two operands to the addition operator. Next, we | |
919 | call C<SETn> to set the NV of the return value to the result of adding | |
920 | the two values. This done, we return - the C<RETURN> macro makes sure | |
921 | that our return value is properly handled, and we pass the next operator | |
922 | to run back to the main run loop. | |
923 | ||
924 | Most of these macros are explained in L<perlapi>, and some of the more | |
925 | important ones are explained in L<perlxs> as well. Pay special attention | |
926 | to L<perlguts/Background and PERL_IMPLICIT_CONTEXT> for information on | |
927 | the C<[pad]THX_?> macros. | |
928 | ||
929 | ||
930 | =head2 Poking at Perl | |
931 | ||
932 | To really poke around with Perl, you'll probably want to build Perl for | |
933 | debugging, like this: | |
934 | ||
935 | ./Configure -d -D optimize=-g | |
936 | make | |
937 | ||
938 | C<-g> is a flag to the C compiler to have it produce debugging | |
939 | information which will allow us to step through a running program. | |
940 | F<Configure> will also turn on the C<DEBUGGING> compilation symbol which | |
941 | enables all the internal debugging code in Perl. There are a whole bunch | |
942 | of things you can debug with this: L<perlrun> lists them all, and the | |
943 | best way to find out about them is to play about with them. The most | |
944 | useful options are probably | |
945 | ||
946 | l Context (loop) stack processing | |
947 | t Trace execution | |
948 | o Method and overloading resolution | |
949 | c String/numeric conversions | |
950 | ||
951 | Some of the functionality of the debugging code can be achieved using XS | |
952 | modules. | |
953 | ||
954 | -Dr => use re 'debug' | |
955 | -Dx => use O 'Debug' | |
956 | ||
957 | =head2 Using a source-level debugger | |
958 | ||
959 | If the debugging output of C<-D> doesn't help you, it's time to step | |
960 | through perl's execution with a source-level debugger. | |
961 | ||
962 | =over 3 | |
963 | ||
964 | =item * | |
965 | ||
966 | We'll use C<gdb> for our examples here; the principles will apply to any | |
967 | debugger, but check the manual of the one you're using. | |
968 | ||
969 | =back | |
970 | ||
971 | To fire up the debugger, type | |
972 | ||
973 | gdb ./perl | |
974 | ||
975 | You'll want to do that in your Perl source tree so the debugger can read | |
976 | the source code. You should see the copyright message, followed by the | |
977 | prompt. | |
978 | ||
979 | (gdb) | |
980 | ||
981 | C<help> will get you into the documentation, but here are the most | |
982 | useful commands: | |
983 | ||
984 | =over 3 | |
985 | ||
986 | =item run [args] | |
987 | ||
988 | Run the program with the given arguments. | |
989 | ||
990 | =item break function_name | |
991 | ||
992 | =item break source.c:xxx | |
993 | ||
994 | Tells the debugger that we'll want to pause execution when we reach | |
995 | either the named function (but see L</Function names>!) or the given | |
996 | line in the named source file. | |
997 | ||
998 | =item step | |
999 | ||
1000 | Steps through the program a line at a time. | |
1001 | ||
1002 | =item next | |
1003 | ||
1004 | Steps through the program a line at a time, without descending into | |
1005 | functions. | |
1006 | ||
1007 | =item continue | |
1008 | ||
1009 | Run until the next breakpoint. | |
1010 | ||
1011 | =item finish | |
1012 | ||
1013 | Run until the end of the current function, then stop again. | |
1014 | ||
1015 | =item | |
1016 | ||
1017 | Just pressing Enter will do the most recent operation again - it's a | |
1018 | blessing when stepping through miles of source code. | |
1019 | ||
1020 | =item print | |
1021 | ||
1022 | Execute the given C code and print its results. B<WARNING>: Perl makes | |
1023 | heavy use of macros, and F<gdb> is not aware of macros. You'll have to | |
1024 | substitute them yourself. So, for instance, you can't say | |
1025 | ||
1026 | print SvPV_nolen(sv) | |
1027 | ||
1028 | but you have to say | |
1029 | ||
1030 | print Perl_sv_2pv_nolen(sv) | |
1031 | ||
1032 | You may find it helpful to have a "macro dictionary", which you can | |
1033 | produce by saying C<cpp -dM perl.c | sort>. Even then, F<cpp> won't | |
1034 | recursively apply the macros for you. | |
1035 | ||
1036 | =back | |
1037 | ||
1038 | =head2 Dumping Perl Data Structures | |
1039 | ||
1040 | One way to get around this macro hell is to use the dumping functions in | |
1041 | F<dump.c>; these work a little like an internal | |
1042 | L<Devel::Peek|Devel::Peek>, but they also cover OPs and other structures | |
1043 | that you can't get at from Perl. Let's take an example. We'll use the | |
1044 | C<$a = $b + $c> we used before, but give it a bit of context: | |
1045 | C<$b = "6XXXX"; $c = 2.3;>. Where's a good place to stop and poke around? | |
1046 | ||
1047 | What about C<pp_add>, the function we examined earlier to implement the | |
1048 | C<+> operator: | |
1049 | ||
1050 | (gdb) break Perl_pp_add | |
1051 | Breakpoint 1 at 0x46249f: file pp_hot.c, line 309. | |
1052 | ||
1053 | Notice we use C<Perl_pp_add> and not C<pp_add> - see L<perlguts/Function Names>. | |
1054 | With the breakpoint in place, we can run our program: | |
1055 | ||
1056 | (gdb) run -e '$b = "6XXXX"; $c = 2.3; $a = $b + $c' | |
1057 | ||
1058 | Lots of junk will go past as gdb reads in the relevant source files and | |
1059 | libraries, and then: | |
1060 | ||
1061 | Breakpoint 1, Perl_pp_add () at pp_hot.c:309 | |
1062 | 309 djSP; dATARGET; tryAMAGICbin(add,opASSIGN); | |
1063 | (gdb) step | |
1064 | 311 dPOPTOPnnrl_ul; | |
1065 | (gdb) | |
1066 | ||
1067 | We looked at this bit of code before, and we said that C<dPOPTOPnnrl_ul> | |
1068 | arranges for two C<NV>s to be placed into C<left> and C<right> - let's | |
1069 | slightly expand it: | |
1070 | ||
1071 | #define dPOPTOPnnrl_ul NV right = POPn; \ | |
1072 | SV *leftsv = TOPs; \ | |
1073 | NV left = USE_LEFT(leftsv) ? SvNV(leftsv) : 0.0 | |
1074 | ||
1075 | C<POPn> takes the SV from the top of the stack and obtains its NV either | |
1076 | directly (if C<SvNOK> is set) or by calling the C<sv_2nv> function. | |
1077 | C<TOPs> takes the next SV from the top of the stack - yes, C<POPn> uses | |
1078 | C<TOPs> - but doesn't remove it. We then use C<SvNV> to get the NV from | |
1079 | C<leftsv> in the same way as before - yes, C<POPn> uses C<SvNV>. | |
1080 | ||
1081 | Since we don't have an NV for C<$b>, we'll have to use C<sv_2nv> to | |
1082 | convert it. If we step again, we'll find ourselves there: | |
1083 | ||
1084 | Perl_sv_2nv (sv=0xa0675d0) at sv.c:1669 | |
1085 | 1669 if (!sv) | |
1086 | (gdb) | |
1087 | ||
1088 | We can now use C<Perl_sv_dump> to investigate the SV: | |
1089 | ||
1090 | SV = PV(0xa057cc0) at 0xa0675d0 | |
1091 | REFCNT = 1 | |
1092 | FLAGS = (POK,pPOK) | |
1093 | PV = 0xa06a510 "6XXXX"\0 | |
1094 | CUR = 5 | |
1095 | LEN = 6 | |
1096 | $1 = void | |
1097 | ||
1098 | We know we're going to get C<6> from this, so let's finish the | |
1099 | subroutine: | |
1100 | ||
1101 | (gdb) finish | |
1102 | Run till exit from #0 Perl_sv_2nv (sv=0xa0675d0) at sv.c:1671 | |
1103 | 0x462669 in Perl_pp_add () at pp_hot.c:311 | |
1104 | 311 dPOPTOPnnrl_ul; | |
1105 | ||
1106 | We can also dump out this op: the current op is always stored in | |
1107 | C<PL_op>, and we can dump it with C<Perl_op_dump>. This'll give us | |
1108 | similar output to L<B::Debug|B::Debug>. | |
1109 | ||
1110 | { | |
1111 | 13 TYPE = add ===> 14 | |
1112 | TARG = 1 | |
1113 | FLAGS = (SCALAR,KIDS) | |
1114 | { | |
1115 | TYPE = null ===> (12) | |
1116 | (was rv2sv) | |
1117 | FLAGS = (SCALAR,KIDS) | |
1118 | { | |
1119 | 11 TYPE = gvsv ===> 12 | |
1120 | FLAGS = (SCALAR) | |
1121 | GV = main::b | |
1122 | } | |
1123 | } | |
1124 | ||
1125 | < finish this later > | |
1126 | ||
1127 | =head2 Patching | |
1128 | ||
1129 | All right, we've now had a look at how to navigate the Perl sources and | |
1130 | some things you'll need to know when fiddling with them. Let's now get | |
1131 | on and create a simple patch. Here's something Larry suggested: if a | |
1132 | C<U> is the first active format during a C<pack>, (for example, | |
1133 | C<pack "U3C8", @stuff>) then the resulting string should be treated as | |
1134 | UTF8 encoded. | |
1135 | ||
1136 | How do we prepare to fix this up? First we locate the code in question - | |
1137 | the C<pack> happens at runtime, so it's going to be in one of the F<pp> | |
1138 | files. Sure enough, C<pp_pack> is in F<pp.c>. Since we're going to be | |
1139 | altering this file, let's copy it to F<pp.c~>. | |
1140 | ||
1141 | Now let's look over C<pp_pack>: we take a pattern into C<pat>, and then | |
1142 | loop over the pattern, taking each format character in turn into | |
1143 | C<datum_type>. Then for each possible format character, we swallow up | |
1144 | the other arguments in the pattern (a field width, an asterisk, and so | |
1145 | on) and convert the next chunk input into the specified format, adding | |
1146 | it onto the output SV C<cat>. | |
1147 | ||
1148 | How do we know if the C<U> is the first format in the C<pat>? Well, if | |
1149 | we have a pointer to the start of C<pat> then, if we see a C<U> we can | |
1150 | test whether we're still at the start of the string. So, here's where | |
1151 | C<pat> is set up: | |
1152 | ||
1153 | STRLEN fromlen; | |
1154 | register char *pat = SvPVx(*++MARK, fromlen); | |
1155 | register char *patend = pat + fromlen; | |
1156 | register I32 len; | |
1157 | I32 datumtype; | |
1158 | SV *fromstr; | |
1159 | ||
1160 | We'll have another string pointer in there: | |
1161 | ||
1162 | STRLEN fromlen; | |
1163 | register char *pat = SvPVx(*++MARK, fromlen); | |
1164 | register char *patend = pat + fromlen; | |
1165 | + char *patcopy; | |
1166 | register I32 len; | |
1167 | I32 datumtype; | |
1168 | SV *fromstr; | |
1169 | ||
1170 | And just before we start the loop, we'll set C<patcopy> to be the start | |
1171 | of C<pat>: | |
1172 | ||
1173 | items = SP - MARK; | |
1174 | MARK++; | |
1175 | sv_setpvn(cat, "", 0); | |
1176 | + patcopy = pat; | |
1177 | while (pat < patend) { | |
1178 | ||
1179 | Now if we see a C<U> which was at the start of the string, we turn on | |
1180 | the UTF8 flag for the output SV, C<cat>: | |
1181 | ||
1182 | + if (datumtype == 'U' && pat==patcopy+1) | |
1183 | + SvUTF8_on(cat); | |
1184 | if (datumtype == '#') { | |
1185 | while (pat < patend && *pat != '\n') | |
1186 | pat++; | |
1187 | ||
1188 | Remember that it has to be C<patcopy+1> because the first character of | |
1189 | the string is the C<U> which has been swallowed into C<datumtype!> | |
1190 | ||
1191 | Oops, we forgot one thing: what if there are spaces at the start of the | |
1192 | pattern? C<pack(" U*", @stuff)> will have C<U> as the first active | |
1193 | character, even though it's not the first thing in the pattern. In this | |
1194 | case, we have to advance C<patcopy> along with C<pat> when we see spaces: | |
1195 | ||
1196 | if (isSPACE(datumtype)) | |
1197 | continue; | |
1198 | ||
1199 | needs to become | |
1200 | ||
1201 | if (isSPACE(datumtype)) { | |
1202 | patcopy++; | |
1203 | continue; | |
1204 | } | |
1205 | ||
1206 | OK. That's the C part done. Now we must do two additional things before | |
1207 | this patch is ready to go: we've changed the behaviour of Perl, and so | |
1208 | we must document that change. We must also provide some more regression | |
1209 | tests to make sure our patch works and doesn't create a bug somewhere | |
1210 | else along the line. | |
1211 | ||
1212 | The regression tests for each operator live in F<t/op/>, and so we make | |
1213 | a copy of F<t/op/pack.t> to F<t/op/pack.t~>. Now we can add our tests | |
1214 | to the end. First, we'll test that the C<U> does indeed create Unicode | |
1215 | strings: | |
1216 | ||
1217 | print 'not ' unless "1.20.300.4000" eq sprintf "%vd", pack("U*",1,20,300,4000); | |
1218 | print "ok $test\n"; $test++; | |
1219 | ||
1220 | Now we'll test that we got that space-at-the-beginning business right: | |
1221 | ||
1222 | print 'not ' unless "1.20.300.4000" eq | |
1223 | sprintf "%vd", pack(" U*",1,20,300,4000); | |
1224 | print "ok $test\n"; $test++; | |
1225 | ||
1226 | And finally we'll test that we don't make Unicode strings if C<U> is B<not> | |
1227 | the first active format: | |
1228 | ||
1229 | print 'not ' unless v1.20.300.4000 ne | |
1230 | sprintf "%vd", pack("C0U*",1,20,300,4000); | |
1231 | print "ok $test\n"; $test++; | |
1232 | ||
b1866b2d | 1233 | Mustn't forget to change the number of tests which appears at the top, or |
a422fd2d SC |
1234 | else the automated tester will get confused: |
1235 | ||
1236 | -print "1..156\n"; | |
1237 | +print "1..159\n"; | |
1238 | ||
1239 | We now compile up Perl, and run it through the test suite. Our new | |
1240 | tests pass, hooray! | |
1241 | ||
1242 | Finally, the documentation. The job is never done until the paperwork is | |
1243 | over, so let's describe the change we've just made. The relevant place | |
1244 | is F<pod/perlfunc.pod>; again, we make a copy, and then we'll insert | |
1245 | this text in the description of C<pack>: | |
1246 | ||
1247 | =item * | |
1248 | ||
1249 | If the pattern begins with a C<U>, the resulting string will be treated | |
1250 | as Unicode-encoded. You can force UTF8 encoding on in a string with an | |
1251 | initial C<U0>, and the bytes that follow will be interpreted as Unicode | |
1252 | characters. If you don't want this to happen, you can begin your pattern | |
1253 | with C<C0> (or anything else) to force Perl not to UTF8 encode your | |
1254 | string, and then follow this with a C<U*> somewhere in your pattern. | |
1255 | ||
1256 | All done. Now let's create the patch. F<Porting/patching.pod> tells us | |
1257 | that if we're making major changes, we should copy the entire directory | |
1258 | to somewhere safe before we begin fiddling, and then do | |
1259 | ||
1260 | diff -ruN old new > patch | |
1261 | ||
1262 | However, we know which files we've changed, and we can simply do this: | |
1263 | ||
1264 | diff -u pp.c~ pp.c > patch | |
1265 | diff -u t/op/pack.t~ t/op/pack.t >> patch | |
1266 | diff -u pod/perlfunc.pod~ pod/perlfunc.pod >> patch | |
1267 | ||
1268 | We end up with a patch looking a little like this: | |
1269 | ||
1270 | --- pp.c~ Fri Jun 02 04:34:10 2000 | |
1271 | +++ pp.c Fri Jun 16 11:37:25 2000 | |
1272 | @@ -4375,6 +4375,7 @@ | |
1273 | register I32 items; | |
1274 | STRLEN fromlen; | |
1275 | register char *pat = SvPVx(*++MARK, fromlen); | |
1276 | + char *patcopy; | |
1277 | register char *patend = pat + fromlen; | |
1278 | register I32 len; | |
1279 | I32 datumtype; | |
1280 | @@ -4405,6 +4406,7 @@ | |
1281 | ... | |
1282 | ||
1283 | And finally, we submit it, with our rationale, to perl5-porters. Job | |
1284 | done! | |
1285 | ||
902b9dbf MLF |
1286 | =head1 EXTERNAL TOOLS FOR DEBUGGING PERL |
1287 | ||
1288 | Sometimes it helps to use external tools while debugging and | |
1289 | testing Perl. This section tries to guide you through using | |
1290 | some common testing and debugging tools with Perl. This is | |
1291 | meant as a guide to interfacing these tools with Perl, not | |
1292 | as any kind of guide to the use of the tools themselves. | |
1293 | ||
1294 | =head2 Rational Software's Purify | |
1295 | ||
1296 | Purify is a commercial tool that is helpful in identifying | |
1297 | memory overruns, wild pointers, memory leaks and other such | |
1298 | badness. Perl must be compiled in a specific way for | |
1299 | optimal testing with Purify. Purify is available under | |
1300 | Windows NT, Solaris, HP-UX, SGI, and Siemens Unix. | |
1301 | ||
1302 | The only currently known leaks happen when there are | |
1303 | compile-time errors within eval or require. (Fixing these | |
1304 | is non-trivial, unfortunately, but they must be fixed | |
1305 | eventually.) | |
1306 | ||
1307 | =head2 Purify on Unix | |
1308 | ||
1309 | On Unix, Purify creates a new Perl binary. To get the most | |
1310 | benefit out of Purify, you should create the perl to Purify | |
1311 | using: | |
1312 | ||
1313 | sh Configure -Accflags=-DPURIFY -Doptimize='-g' \ | |
1314 | -Uusemymalloc -Dusemultiplicity | |
1315 | ||
1316 | where these arguments mean: | |
1317 | ||
1318 | =over 4 | |
1319 | ||
1320 | =item -Accflags=-DPURIFY | |
1321 | ||
1322 | Disables Perl's arena memory allocation functions, as well as | |
1323 | forcing use of memory allocation functions derived from the | |
1324 | system malloc. | |
1325 | ||
1326 | =item -Doptimize='-g' | |
1327 | ||
1328 | Adds debugging information so that you see the exact source | |
1329 | statements where the problem occurs. Without this flag, all | |
1330 | you will see is the source filename of where the error occurred. | |
1331 | ||
1332 | =item -Uusemymalloc | |
1333 | ||
1334 | Disable Perl's malloc so that Purify can more closely monitor | |
1335 | allocations and leaks. Using Perl's malloc will make Purify | |
1336 | report most leaks in the "potential" leaks category. | |
1337 | ||
1338 | =item -Dusemultiplicity | |
1339 | ||
1340 | Enabling the multiplicity option allows perl to clean up | |
1341 | thoroughly when the interpreter shuts down, which reduces the | |
1342 | number of bogus leak reports from Purify. | |
1343 | ||
1344 | =back | |
1345 | ||
1346 | Once you've compiled a perl suitable for Purify'ing, then you | |
1347 | can just: | |
1348 | ||
1349 | make pureperl | |
1350 | ||
1351 | which creates a binary named 'pureperl' that has been Purify'ed. | |
1352 | This binary is used in place of the standard 'perl' binary | |
1353 | when you want to debug Perl memory problems. | |
1354 | ||
1355 | As an example, to show any memory leaks produced during the | |
1356 | standard Perl testset you would create and run the Purify'ed | |
1357 | perl as: | |
1358 | ||
1359 | make pureperl | |
1360 | cd t | |
1361 | ../pureperl -I../lib harness | |
1362 | ||
1363 | which would run Perl on test.pl and report any memory problems. | |
1364 | ||
1365 | Purify outputs messages in "Viewer" windows by default. If | |
1366 | you don't have a windowing environment or if you simply | |
1367 | want the Purify output to unobtrusively go to a log file | |
1368 | instead of to the interactive window, use these following | |
1369 | options to output to the log file "perl.log": | |
1370 | ||
1371 | setenv PURIFYOPTIONS "-chain-length=25 -windows=no \ | |
1372 | -log-file=perl.log -append-logfile=yes" | |
1373 | ||
1374 | If you plan to use the "Viewer" windows, then you only need this option: | |
1375 | ||
1376 | setenv PURIFYOPTIONS "-chain-length=25" | |
1377 | ||
1378 | =head2 Purify on NT | |
1379 | ||
1380 | Purify on Windows NT instruments the Perl binary 'perl.exe' | |
1381 | on the fly. There are several options in the makefile you | |
1382 | should change to get the most use out of Purify: | |
1383 | ||
1384 | =over 4 | |
1385 | ||
1386 | =item DEFINES | |
1387 | ||
1388 | You should add -DPURIFY to the DEFINES line so the DEFINES | |
1389 | line looks something like: | |
1390 | ||
1391 | DEFINES = -DWIN32 -D_CONSOLE -DNO_STRICT $(CRYPT_FLAG) -DPURIFY=1 | |
1392 | ||
1393 | to disable Perl's arena memory allocation functions, as | |
1394 | well as to force use of memory allocation functions derived | |
1395 | from the system malloc. | |
1396 | ||
1397 | =item USE_MULTI = define | |
1398 | ||
1399 | Enabling the multiplicity option allows perl to clean up | |
1400 | thoroughly when the interpreter shuts down, which reduces the | |
1401 | number of bogus leak reports from Purify. | |
1402 | ||
1403 | =item #PERL_MALLOC = define | |
1404 | ||
1405 | Disable Perl's malloc so that Purify can more closely monitor | |
1406 | allocations and leaks. Using Perl's malloc will make Purify | |
1407 | report most leaks in the "potential" leaks category. | |
1408 | ||
1409 | =item CFG = Debug | |
1410 | ||
1411 | Adds debugging information so that you see the exact source | |
1412 | statements where the problem occurs. Without this flag, all | |
1413 | you will see is the source filename of where the error occurred. | |
1414 | ||
1415 | =back | |
1416 | ||
1417 | As an example, to show any memory leaks produced during the | |
1418 | standard Perl testset you would create and run Purify as: | |
1419 | ||
1420 | cd win32 | |
1421 | make | |
1422 | cd ../t | |
1423 | purify ../perl -I../lib harness | |
1424 | ||
1425 | which would instrument Perl in memory, run Perl on test.pl, | |
1426 | then finally report any memory problems. | |
1427 | ||
a422fd2d SC |
1428 | =head2 CONCLUSION |
1429 | ||
1430 | We've had a brief look around the Perl source, an overview of the stages | |
1431 | F<perl> goes through when it's running your code, and how to use a | |
902b9dbf MLF |
1432 | debugger to poke at the Perl guts. We took a very simple problem and |
1433 | demonstrated how to solve it fully - with documentation, regression | |
1434 | tests, and finally a patch for submission to p5p. Finally, we talked | |
1435 | about how to use external tools to debug and test Perl. | |
a422fd2d SC |
1436 | |
1437 | I'd now suggest you read over those references again, and then, as soon | |
1438 | as possible, get your hands dirty. The best way to learn is by doing, | |
1439 | so: | |
1440 | ||
1441 | =over 3 | |
1442 | ||
1443 | =item * | |
1444 | ||
1445 | Subscribe to perl5-porters, follow the patches and try and understand | |
1446 | them; don't be afraid to ask if there's a portion you're not clear on - | |
1447 | who knows, you may unearth a bug in the patch... | |
1448 | ||
1449 | =item * | |
1450 | ||
1451 | Keep up to date with the bleeding edge Perl distributions and get | |
1452 | familiar with the changes. Try and get an idea of what areas people are | |
1453 | working on and the changes they're making. | |
1454 | ||
1455 | =item * | |
1456 | ||
1457 | Find an area of Perl that seems interesting to you, and see if you can | |
1458 | work out how it works. Scan through the source, and step over it in the | |
1459 | debugger. Play, poke, investigate, fiddle! You'll probably get to | |
1460 | understand not just your chosen area but a much wider range of F<perl>'s | |
1461 | activity as well, and probably sooner than you'd think. | |
1462 | ||
1463 | =back | |
1464 | ||
1465 | =over 3 | |
1466 | ||
1467 | =item I<The Road goes ever on and on, down from the door where it began.> | |
1468 | ||
1469 | =back | |
1470 | ||
1471 | If you can do these things, you've started on the long road to Perl porting. | |
1472 | Thanks for wanting to help make Perl better - and happy hacking! | |
1473 | ||
e8cd7eae GS |
1474 | =head1 AUTHOR |
1475 | ||
1476 | This document was written by Nathan Torkington, and is maintained by | |
1477 | the perl5-porters mailing list. | |
1478 |