Commit | Line | Data |
---|---|---|
04c692a8 DR |
1 | =encoding utf8 |
2 | ||
3 | =for comment | |
4 | Consistent formatting of this file is achieved with: | |
5 | perl ./Porting/podtidy pod/perlhacktut.pod | |
6 | ||
7 | =head1 NAME | |
8 | ||
9 | perlhacktut - Walk through the creation of a simple C code patch | |
10 | ||
11 | =head1 DESCRIPTION | |
12 | ||
13 | This document takes you through a simple patch example. | |
14 | ||
15 | If you haven't read L<perlhack> yet, go do that first! You might also | |
16 | want to read through L<perlsource> too. | |
17 | ||
18 | Once you're done here, check out L<perlhacktips> next. | |
19 | ||
20 | =head1 EXAMPLE OF A SIMPLE PATCH | |
21 | ||
22 | Let's take a simple patch from start to finish. | |
23 | ||
24 | Here's something Larry suggested: if a C<U> is the first active format | |
25 | during a C<pack>, (for example, C<pack "U3C8", @stuff>) then the | |
26 | resulting string should be treated as UTF-8 encoded. | |
27 | ||
28 | If you are working with a git clone of the Perl repository, you will | |
29 | want to create a branch for your changes. This will make creating a | |
30 | proper patch much simpler. See the L<perlgit> for details on how to do | |
31 | this. | |
32 | ||
33 | =head2 Writing the patch | |
34 | ||
35 | How do we prepare to fix this up? First we locate the code in question | |
36 | - the C<pack> happens at runtime, so it's going to be in one of the | |
37 | F<pp> files. Sure enough, C<pp_pack> is in F<pp.c>. Since we're going | |
38 | to be altering this file, let's copy it to F<pp.c~>. | |
39 | ||
40 | [Well, it was in F<pp.c> when this tutorial was written. It has now | |
41 | been split off with C<pp_unpack> to its own file, F<pp_pack.c>] | |
42 | ||
43 | Now let's look over C<pp_pack>: we take a pattern into C<pat>, and then | |
44 | loop over the pattern, taking each format character in turn into | |
45 | C<datum_type>. Then for each possible format character, we swallow up | |
46 | the other arguments in the pattern (a field width, an asterisk, and so | |
47 | on) and convert the next chunk input into the specified format, adding | |
48 | it onto the output SV C<cat>. | |
49 | ||
50 | How do we know if the C<U> is the first format in the C<pat>? Well, if | |
51 | we have a pointer to the start of C<pat> then, if we see a C<U> we can | |
52 | test whether we're still at the start of the string. So, here's where | |
53 | C<pat> is set up: | |
54 | ||
55 | STRLEN fromlen; | |
eb578fdb KW |
56 | char *pat = SvPVx(*++MARK, fromlen); |
57 | char *patend = pat + fromlen; | |
58 | I32 len; | |
04c692a8 DR |
59 | I32 datumtype; |
60 | SV *fromstr; | |
61 | ||
62 | We'll have another string pointer in there: | |
63 | ||
64 | STRLEN fromlen; | |
eb578fdb KW |
65 | char *pat = SvPVx(*++MARK, fromlen); |
66 | char *patend = pat + fromlen; | |
04c692a8 | 67 | + char *patcopy; |
eb578fdb | 68 | I32 len; |
04c692a8 DR |
69 | I32 datumtype; |
70 | SV *fromstr; | |
71 | ||
72 | And just before we start the loop, we'll set C<patcopy> to be the start | |
73 | of C<pat>: | |
74 | ||
75 | items = SP - MARK; | |
76 | MARK++; | |
5b1fede8 | 77 | SvPVCLEAR(cat); |
04c692a8 DR |
78 | + patcopy = pat; |
79 | while (pat < patend) { | |
80 | ||
81 | Now if we see a C<U> which was at the start of the string, we turn on | |
82 | the C<UTF8> flag for the output SV, C<cat>: | |
83 | ||
84 | + if (datumtype == 'U' && pat==patcopy+1) | |
85 | + SvUTF8_on(cat); | |
86 | if (datumtype == '#') { | |
87 | while (pat < patend && *pat != '\n') | |
88 | pat++; | |
89 | ||
90 | Remember that it has to be C<patcopy+1> because the first character of | |
91 | the string is the C<U> which has been swallowed into C<datumtype!> | |
92 | ||
93 | Oops, we forgot one thing: what if there are spaces at the start of the | |
94 | pattern? C<pack(" U*", @stuff)> will have C<U> as the first active | |
95 | character, even though it's not the first thing in the pattern. In this | |
96 | case, we have to advance C<patcopy> along with C<pat> when we see | |
97 | spaces: | |
98 | ||
99 | if (isSPACE(datumtype)) | |
100 | continue; | |
101 | ||
102 | needs to become | |
103 | ||
104 | if (isSPACE(datumtype)) { | |
105 | patcopy++; | |
106 | continue; | |
107 | } | |
108 | ||
109 | OK. That's the C part done. Now we must do two additional things before | |
110 | this patch is ready to go: we've changed the behaviour of Perl, and so | |
111 | we must document that change. We must also provide some more regression | |
112 | tests to make sure our patch works and doesn't create a bug somewhere | |
113 | else along the line. | |
114 | ||
115 | =head2 Testing the patch | |
116 | ||
117 | The regression tests for each operator live in F<t/op/>, and so we make | |
118 | a copy of F<t/op/pack.t> to F<t/op/pack.t~>. Now we can add our tests | |
119 | to the end. First, we'll test that the C<U> does indeed create Unicode | |
120 | strings. | |
121 | ||
122 | t/op/pack.t has a sensible ok() function, but if it didn't we could use | |
123 | the one from t/test.pl. | |
124 | ||
125 | require './test.pl'; | |
126 | plan( tests => 159 ); | |
127 | ||
128 | so instead of this: | |
129 | ||
130 | print 'not ' unless "1.20.300.4000" eq sprintf "%vd", | |
131 | pack("U*",1,20,300,4000); | |
132 | print "ok $test\n"; $test++; | |
133 | ||
134 | we can write the more sensible (see L<Test::More> for a full | |
135 | explanation of is() and other testing functions). | |
136 | ||
137 | is( "1.20.300.4000", sprintf "%vd", pack("U*",1,20,300,4000), | |
138 | "U* produces Unicode" ); | |
139 | ||
140 | Now we'll test that we got that space-at-the-beginning business right: | |
141 | ||
142 | is( "1.20.300.4000", sprintf "%vd", pack(" U*",1,20,300,4000), | |
143 | " with spaces at the beginning" ); | |
144 | ||
145 | And finally we'll test that we don't make Unicode strings if C<U> is | |
146 | B<not> the first active format: | |
147 | ||
148 | isnt( v1.20.300.4000, sprintf "%vd", pack("C0U*",1,20,300,4000), | |
149 | "U* not first isn't Unicode" ); | |
150 | ||
151 | Mustn't forget to change the number of tests which appears at the top, | |
152 | or else the automated tester will get confused. This will either look | |
153 | like this: | |
154 | ||
155 | print "1..156\n"; | |
156 | ||
157 | or this: | |
158 | ||
159 | plan( tests => 156 ); | |
160 | ||
161 | We now compile up Perl, and run it through the test suite. Our new | |
162 | tests pass, hooray! | |
163 | ||
164 | =head2 Documenting the patch | |
165 | ||
166 | Finally, the documentation. The job is never done until the paperwork | |
167 | is over, so let's describe the change we've just made. The relevant | |
168 | place is F<pod/perlfunc.pod>; again, we make a copy, and then we'll | |
169 | insert this text in the description of C<pack>: | |
170 | ||
171 | =item * | |
172 | ||
173 | If the pattern begins with a C<U>, the resulting string will be treated | |
174 | as UTF-8-encoded Unicode. You can force UTF-8 encoding on in a string | |
175 | with an initial C<U0>, and the bytes that follow will be interpreted as | |
176 | Unicode characters. If you don't want this to happen, you can begin | |
177 | your pattern with C<C0> (or anything else) to force Perl not to UTF-8 | |
178 | encode your string, and then follow this with a C<U*> somewhere in your | |
179 | pattern. | |
180 | ||
181 | =head2 Submit | |
182 | ||
183 | See L<perlhack> for details on how to submit this patch. | |
184 | ||
185 | =head1 AUTHOR | |
186 | ||
187 | This document was originally written by Nathan Torkington, and is | |
188 | maintained by the perl5-porters mailing list. |