This is a live mirror of the Perl 5 development currently hosted at https://github.com/perl/perl5
Quick integration of mainline changes to date
[perl5.git] / pod / perlfilter.pod
CommitLineData
146174a9
CB
1=head1 NAME
2
3perlfilter - Source Filters
4
5
6=head1 DESCRIPTION
7
8This article is about a little-known feature of Perl called
9I<source filters>. Source filters alter the program text of a module
10before Perl sees it, much as a C preprocessor alters the source text of
11a C program before the compiler sees it. This article tells you more
12about what source filters are, how they work, and how to write your
13own.
14
15The original purpose of source filters was to let you encrypt your
16program source to prevent casual piracy. This isn't all they can do, as
17you'll soon learn. But first, the basics.
18
19=head1 CONCEPTS
20
21Before the Perl interpreter can execute a Perl script, it must first
22read it from a file into memory for parsing and compilation. (Even
23scripts specified on the command line with the C<-e> option are stored in
24a temporary file for the parser to process.) If that script itself
25includes other scripts with a C<use> or C<require> statement, then each
26of those scripts will have to be read from their respective files as
27well.
28
29Now think of each logical connection between the Perl parser and an
30individual file as a I<source stream>. A source stream is created when
31the Perl parser opens a file, it continues to exist as the source code
32is read into memory, and it is destroyed when Perl is finished parsing
33the file. If the parser encounters a C<require> or C<use> statement in
34a source stream, a new and distinct stream is created just for that
35file.
36
37The diagram below represents a single source stream, with the flow of
38source from a Perl script file on the left into the Perl parser on the
39right. This is how Perl normally operates.
40
41 file -------> parser
42
43There are two important points to remember:
44
45=over 5
46
47=item 1.
48
49Although there can be any number of source streams in existence at any
50given time, only one will be active.
51
52=item 2.
53
54Every source stream is associated with only one file.
55
56=back
57
58A source filter is a special kind of Perl module that intercepts and
59modifies a source stream before it reaches the parser. A source filter
60changes our diagram like this:
61
62 file ----> filter ----> parser
63
64If that doesn't make much sense, consider the analogy of a command
65pipeline. Say you have a shell script stored in the compressed file
66I<trial.gz>. The simple pipeline command below runs the script without
67needing to create a temporary file to hold the uncompressed file.
68
69 gunzip -c trial.gz | sh
70
71In this case, the data flow from the pipeline can be represented as follows:
72
73 trial.gz ----> gunzip ----> sh
74
75With source filters, you can store the text of your script compressed and use a source filter to uncompress it for Perl's parser:
76
77 compressed gunzip
78 Perl program ---> source filter ---> parser
79
80=head1 USING FILTERS
81
82So how do you use a source filter in a Perl script? Above, I said that
83a source filter is just a special kind of module. Like all Perl
84modules, a source filter is invoked with a use statement.
85
86Say you want to pass your Perl source through the C preprocessor before
87execution. You could use the existing C<-P> command line option to do
88this, but as it happens, the source filters distribution comes with a C
89preprocessor filter module called Filter::cpp. Let's use that instead.
90
91Below is an example program, C<cpp_test>, which makes use of this filter.
92Line numbers have been added to allow specific lines to be referenced
93easily.
94
95 1: use Filter::cpp ;
96 2: #define TRUE 1
97 3: $a = TRUE ;
98 4: print "a = $a\n" ;
99
100When you execute this script, Perl creates a source stream for the
101file. Before the parser processes any of the lines from the file, the
102source stream looks like this:
103
104 cpp_test ---------> parser
105
106Line 1, C<use Filter::cpp>, includes and installs the C<cpp> filter
107module. All source filters work this way. The use statement is compiled
108and executed at compile time, before any more of the file is read, and
109it attaches the cpp filter to the source stream behind the scenes. Now
110the data flow looks like this:
111
112 cpp_test ----> cpp filter ----> parser
113
114As the parser reads the second and subsequent lines from the source
115stream, it feeds those lines through the C<cpp> source filter before
116processing them. The C<cpp> filter simply passes each line through the
117real C preprocessor. The output from the C preprocessor is then
118inserted back into the source stream by the filter.
119
120 .-> cpp --.
121 | |
122 | |
123 | <-'
124 cpp_test ----> cpp filter ----> parser
125
126The parser then sees the following code:
127
128 use Filter::cpp ;
129 $a = 1 ;
130 print "a = $a\n" ;
131
132Let's consider what happens when the filtered code includes another
133module with use:
134
135 1: use Filter::cpp ;
136 2: #define TRUE 1
137 3: use Fred ;
138 4: $a = TRUE ;
139 5: print "a = $a\n" ;
140
141The C<cpp> filter does not apply to the text of the Fred module, only
142to the text of the file that used it (C<cpp_test>). Although the use
143statement on line 3 will pass through the cpp filter, the module that
144gets included (C<Fred>) will not. The source streams look like this
145after line 3 has been parsed and before line 4 is parsed:
146
147 cpp_test ---> cpp filter ---> parser (INACTIVE)
148
149 Fred.pm ----> parser
150
151As you can see, a new stream has been created for reading the source
152from C<Fred.pm>. This stream will remain active until all of C<Fred.pm>
153has been parsed. The source stream for C<cpp_test> will still exist,
154but is inactive. Once the parser has finished reading Fred.pm, the
155source stream associated with it will be destroyed. The source stream
156for C<cpp_test> then becomes active again and the parser reads line 4
157and subsequent lines from C<cpp_test>.
158
159You can use more than one source filter on a single file. Similarly,
160you can reuse the same filter in as many files as you like.
161
162For example, if you have a uuencoded and compressed source file, it is
163possible to stack a uudecode filter and an uncompression filter like
164this:
165
166 use Filter::uudecode ; use Filter::uncompress ;
167 M'XL(".H<US4''V9I;F%L')Q;>7/;1I;_>_I3=&E=%:F*I"T?22Q/
168 M6]9*<IQCO*XFT"0[PL%%'Y+IG?WN^ZYN-$'J.[.JE$,20/?K=_[>
169 ...
170
171Once the first line has been processed, the flow will look like this:
172
173 file ---> uudecode ---> uncompress ---> parser
174 filter filter
175
176Data flows through filters in the same order they appear in the source
177file. The uudecode filter appeared before the uncompress filter, so the
178source file will be uudecoded before it's uncompressed.
179
180=head1 WRITING A SOURCE FILTER
181
182There are three ways to write your own source filter. You can write it
183in C, use an external program as a filter, or write the filter in Perl.
184I won't cover the first two in any great detail, so I'll get them out
185of the way first. Writing the filter in Perl is most convenient, so
186I'll devote the most space to it.
187
188=head1 WRITING A SOURCE FILTER IN C
189
190The first of the three available techniques is to write the filter
191completely in C. The external module you create interfaces directly
192with the source filter hooks provided by Perl.
193
194The advantage of this technique is that you have complete control over
195the implementation of your filter. The big disadvantage is the
196increased complexity required to write the filter - not only do you
197need to understand the source filter hooks, but you also need a
198reasonable knowledge of Perl guts. One of the few times it is worth
199going to this trouble is when writing a source scrambler. The
200C<decrypt> filter (which unscrambles the source before Perl parses it)
201included with the source filter distribution is an example of a C
202source filter (see Decryption Filters, below).
203
204
205=over 5
206
207=item B<Decryption Filters>
208
209All decryption filters work on the principle of "security through
210obscurity." Regardless of how well you write a decryption filter and
211how strong your encryption algorithm, anyone determined enough can
212retrieve the original source code. The reason is quite simple - once
213the decryption filter has decrypted the source back to its original
214form, fragments of it will be stored in the computer's memory as Perl
215parses it. The source might only be in memory for a short period of
216time, but anyone possessing a debugger, skill, and lots of patience can
217eventually reconstruct your program.
218
219That said, there are a number of steps that can be taken to make life
220difficult for the potential cracker. The most important: Write your
221decryption filter in C and statically link the decryption module into
222the Perl binary. For further tips to make life difficult for the
223potential cracker, see the file I<decrypt.pm> in the source filters
224module.
225
226=back
227
228=head1 CREATING A SOURCE FILTER AS A SEPARATE EXECUTABLE
229
230An alternative to writing the filter in C is to create a separate
231executable in the language of your choice. The separate executable
232reads from standard input, does whatever processing is necessary, and
233writes the filtered data to standard output. C<Filter:cpp> is an
234example of a source filter implemented as a separate executable - the
235executable is the C preprocessor bundled with your C compiler.
236
237The source filter distribution includes two modules that simplify this
238task: C<Filter::exec> and C<Filter::sh>. Both allow you to run any
239external executable. Both use a coprocess to control the flow of data
240into and out of the external executable. (For details on coprocesses,
241see Stephens, W.R. "Advanced Programming in the UNIX Environment."
242Addison-Wesley, ISBN 0-210-56317-7, pages 441-445.) The difference
243between them is that C<Filter::exec> spawns the external command
244directly, while C<Filter::sh> spawns a shell to execute the external
245command. (Unix uses the Bourne shell; NT uses the cmd shell.) Spawning
246a shell allows you to make use of the shell metacharacters and
247redirection facilities.
248
249Here is an example script that uses C<Filter::sh>:
250
251 use Filter::sh 'tr XYZ PQR' ;
252 $a = 1 ;
253 print "XYZ a = $a\n" ;
254
255The output you'll get when the script is executed:
256
257 PQR a = 1
258
259Writing a source filter as a separate executable works fine, but a
260small performance penalty is incurred. For example, if you execute the
261small example above, a separate subprocess will be created to run the
262Unix C<tr> command. Each use of the filter requires its own subprocess.
263If creating subprocesses is expensive on your system, you might want to
264consider one of the other options for creating source filters.
265
266=head1 WRITING A SOURCE FILTER IN PERL
267
268The easiest and most portable option available for creating your own
269source filter is to write it completely in Perl. To distinguish this
270from the previous two techniques, I'll call it a Perl source filter.
271
272To help understand how to write a Perl source filter we need an example
273to study. Here is a complete source filter that performs rot13
274decoding. (Rot13 is a very simple encryption scheme used in Usenet
275postings to hide the contents of offensive posts. It moves every letter
276forward thirteen places, so that A becomes N, B becomes O, and Z
277becomes M.)
278
279
280 package Rot13 ;
281
282 use Filter::Util::Call ;
283
284 sub import {
285 my ($type) = @_ ;
286 my ($ref) = [] ;
287 filter_add(bless $ref) ;
288 }
289
290 sub filter {
291 my ($self) = @_ ;
292 my ($status) ;
293
294 tr/n-za-mN-ZA-M/a-zA-Z/
295 if ($status = filter_read()) > 0 ;
296 $status ;
297 }
298
299 1;
300
301All Perl source filters are implemented as Perl classes and have the
302same basic structure as the example above.
303
304First, we include the C<Filter::Util::Call> module, which exports a
305number of functions into your filter's namespace. The filter shown
306above uses two of these functions, C<filter_add()> and
307C<filter_read()>.
308
309Next, we create the filter object and associate it with the source
310stream by defining the C<import> function. If you know Perl well
311enough, you know that C<import> is called automatically every time a
312module is included with a use statement. This makes C<import> the ideal
313place to both create and install a filter object.
314
315In the example filter, the object (C<$ref>) is blessed just like any
316other Perl object. Our example uses an anonymous array, but this isn't
317a requirement. Because this example doesn't need to store any context
318information, we could have used a scalar or hash reference just as
319well. The next section demonstrates context data.
320
321The association between the filter object and the source stream is made
322with the C<filter_add()> function. This takes a filter object as a
323parameter (C<$ref> in this case) and installs it in the source stream.
324
325Finally, there is the code that actually does the filtering. For this
326type of Perl source filter, all the filtering is done in a method
327called C<filter()>. (It is also possible to write a Perl source filter
328using a closure. See the C<Filter::Util::Call> manual page for more
329details.) It's called every time the Perl parser needs another line of
330source to process. The C<filter()> method, in turn, reads lines from
331the source stream using the C<filter_read()> function.
332
333If a line was available from the source stream, C<filter_read()>
334returns a status value greater than zero and appends the line to C<$_>.
335A status value of zero indicates end-of-file, less than zero means an
336error. The filter function itself is expected to return its status in
337the same way, and put the filtered line it wants written to the source
338stream in C<$_>. The use of C<$_> accounts for the brevity of most Perl
339source filters.
340
341In order to make use of the rot13 filter we need some way of encoding
342the source file in rot13 format. The script below, C<mkrot13>, does
343just that.
344
345 die "usage mkrot13 filename\n" unless @ARGV ;
346 my $in = $ARGV[0] ;
347 my $out = "$in.tmp" ;
348 open(IN, "<$in") or die "Cannot open file $in: $!\n";
349 open(OUT, ">$out") or die "Cannot open file $out: $!\n";
350
351 print OUT "use Rot13;\n" ;
352 while (<IN>) {
353 tr/a-zA-Z/n-za-mN-ZA-M/ ;
354 print OUT ;
355 }
356
357 close IN;
358 close OUT;
359 unlink $in;
360 rename $out, $in;
361
362If we encrypt this with C<mkrot13>:
363
364 print " hello fred \n" ;
365
366the result will be this:
367
368 use Rot13;
369 cevag "uryyb serq\a" ;
370
371Running it produces this output:
372
373 hello fred
374
375=head1 USING CONTEXT: THE DEBUG FILTER
376
377The rot13 example was a trivial example. Here's another demonstration
378that shows off a few more features.
379
380Say you wanted to include a lot of debugging code in your Perl script
381during development, but you didn't want it available in the released
382product. Source filters offer a solution. In order to keep the example
383simple, let's say you wanted the debugging output to be controlled by
384an environment variable, C<DEBUG>. Debugging code is enabled if the
385variable exists, otherwise it is disabled.
386
387Two special marker lines will bracket debugging code, like this:
388
389 ## DEBUG_BEGIN
390 if ($year > 1999) {
391 warn "Debug: millennium bug in year $year\n" ;
392 }
393 ## DEBUG_END
394
395When the C<DEBUG> environment variable exists, the filter ensures that
396Perl parses only the code between the C<DEBUG_BEGIN> and C<DEBUG_END>
397markers. That means that when C<DEBUG> does exist, the code above
398should be passed through the filter unchanged. The marker lines can
399also be passed through as-is, because the Perl parser will see them as
400comment lines. When C<DEBUG> isn't set, we need a way to disable the
401debug code. A simple way to achieve that is to convert the lines
402between the two markers into comments:
403
404 ## DEBUG_BEGIN
405 #if ($year > 1999) {
406 # warn "Debug: millennium bug in year $year\n" ;
407 #}
408 ## DEBUG_END
409
410Here is the complete Debug filter:
411
412 package Debug;
413
414 use strict;
415 use Filter::Util::Call ;
416
417 use constant TRUE => 1 ;
418 use constant FALSE => 0 ;
419
420 sub import {
421 my ($type) = @_ ;
422 my (%context) = (
423 Enabled => defined $ENV{DEBUG},
424 InTraceBlock => FALSE,
425 Filename => (caller)[1],
426 LineNo => 0,
427 LastBegin => 0,
428 ) ;
429 filter_add(bless \%context) ;
430 }
431
432 sub Die {
433 my ($self) = shift ;
434 my ($message) = shift ;
435 my ($line_no) = shift || $self->{LastBegin} ;
436 die "$message at $self->{Filename} line $line_no.\n"
437 }
438
439 sub filter {
440 my ($self) = @_ ;
441 my ($status) ;
442 $status = filter_read() ;
443 ++ $self->{LineNo} ;
444
445 # deal with EOF/error first
446 if ($status <= 0) {
447 $self->Die("DEBUG_BEGIN has no DEBUG_END")
448 if $self->{InTraceBlock} ;
449 return $status ;
450 }
451
452 if ($self->{InTraceBlock}) {
453 if (/^\s*##\s*DEBUG_BEGIN/ ) {
454 $self->Die("Nested DEBUG_BEGIN", $self->{LineNo})
455 } elsif (/^\s*##\s*DEBUG_END/) {
456 $self->{InTraceBlock} = FALSE ;
457 }
458
459 # comment out the debug lines when the filter is disabled
460 s/^/#/ if ! $self->{Enabled} ;
461 } elsif ( /^\s*##\s*DEBUG_BEGIN/ ) {
462 $self->{InTraceBlock} = TRUE ;
463 $self->{LastBegin} = $self->{LineNo} ;
464 } elsif ( /^\s*##\s*DEBUG_END/ ) {
465 $self->Die("DEBUG_END has no DEBUG_BEGIN", $self->{LineNo});
466 }
467 return $status ;
468 }
469
470 1 ;
471
472The big difference between this filter and the previous example is the
473use of context data in the filter object. The filter object is based on
474a hash reference, and is used to keep various pieces of context
475information between calls to the filter function. All but two of the
476hash fields are used for error reporting. The first of those two,
477Enabled, is used by the filter to determine whether the debugging code
478should be given to the Perl parser. The second, InTraceBlock, is true
479when the filter has encountered a C<DEBUG_BEGIN> line, but has not yet
480encountered the following C<DEBUG_END> line.
481
482If you ignore all the error checking that most of the code does, the
483essence of the filter is as follows:
484
485 sub filter {
486 my ($self) = @_ ;
487 my ($status) ;
488 $status = filter_read() ;
489
490 # deal with EOF/error first
491 return $status if $status <= 0 ;
492 if ($self->{InTraceBlock}) {
493 if (/^\s*##\s*DEBUG_END/) {
494 $self->{InTraceBlock} = FALSE
495 }
496
497 # comment out debug lines when the filter is disabled
498 s/^/#/ if ! $self->{Enabled} ;
499 } elsif ( /^\s*##\s*DEBUG_BEGIN/ ) {
500 $self->{InTraceBlock} = TRUE ;
501 }
502 return $status ;
503 }
504
505Be warned: just as the C-preprocessor doesn't know C, the Debug filter
506doesn't know Perl. It can be fooled quite easily:
507
508 print <<EOM;
509 ##DEBUG_BEGIN
510 EOM
511
512Such things aside, you can see that a lot can be achieved with a modest
513amount of code.
514
515=head1 CONCLUSION
516
517You now have better understanding of what a source filter is, and you
518might even have a possible use for them. If you feel like playing with
519source filters but need a bit of inspiration, here are some extra
520features you could add to the Debug filter.
521
522First, an easy one. Rather than having debugging code that is
523all-or-nothing, it would be much more useful to be able to control
524which specific blocks of debugging code get included. Try extending the
525syntax for debug blocks to allow each to be identified. The contents of
526the C<DEBUG> environment variable can then be used to control which
527blocks get included.
528
529Once you can identify individual blocks, try allowing them to be
530nested. That isn't difficult either.
531
532Here is a interesting idea that doesn't involve the Debug filter.
533Currently Perl subroutines have fairly limited support for formal
534parameter lists. You can specify the number of parameters and their
535type, but you still have to manually take them out of the C<@_> array
536yourself. Write a source filter that allows you to have a named
537parameter list. Such a filter would turn this:
538
539 sub MySub ($first, $second, @rest) { ... }
540
541into this:
542
543 sub MySub($$@) {
544 my ($first) = shift ;
545 my ($second) = shift ;
546 my (@rest) = @_ ;
547 ...
548 }
549
550Finally, if you feel like a real challenge, have a go at writing a
551full-blown Perl macro preprocessor as a source filter. Borrow the
552useful features from the C preprocessor and any other macro processors
553you know. The tricky bit will be choosing how much knowledge of Perl's
554syntax you want your filter to have.
555
556=head1 REQUIREMENTS
557
558The Source Filters distribution is available on CPAN, in
559
560 CPAN/modules/by-module/Filter
561
562=head1 AUTHOR
563
564Paul Marquess E<lt>Paul.Marquess@btinternet.comE<gt>
565
566=head1 Copyrights
567
568This article originally appeared in The Perl Journal #11, and is
569copyright 1998 The Perl Journal. It appears courtesy of Jon Orwant and
570The Perl Journal. This document may be distributed under the same terms
571as Perl itself.