| 1 | =encoding utf8 |
| 2 | |
| 3 | =for comment |
| 4 | Consistent formatting of this file is achieved with: |
| 5 | perl ./Porting/podtidy pod/perlhacktut.pod |
| 6 | |
| 7 | =head1 NAME |
| 8 | |
| 9 | perlhacktut - Walk through the creation of a simple C code patch |
| 10 | |
| 11 | =head1 DESCRIPTION |
| 12 | |
| 13 | This document takes you through a simple patch example. |
| 14 | |
| 15 | If you haven't read L<perlhack> yet, go do that first! You might also |
| 16 | want to read through L<perlsource> too. |
| 17 | |
| 18 | Once you're done here, check out L<perlhacktips> next. |
| 19 | |
| 20 | =head1 EXAMPLE OF A SIMPLE PATCH |
| 21 | |
| 22 | Let's take a simple patch from start to finish. |
| 23 | |
| 24 | Here's something Larry suggested: if a C<U> is the first active format |
| 25 | during a C<pack>, (for example, C<pack "U3C8", @stuff>) then the |
| 26 | resulting string should be treated as UTF-8 encoded. |
| 27 | |
| 28 | If you are working with a git clone of the Perl repository, you will |
| 29 | want to create a branch for your changes. This will make creating a |
| 30 | proper patch much simpler. See the L<perlgit> for details on how to do |
| 31 | this. |
| 32 | |
| 33 | =head2 Writing the patch |
| 34 | |
| 35 | How do we prepare to fix this up? First we locate the code in question |
| 36 | - the C<pack> happens at runtime, so it's going to be in one of the |
| 37 | F<pp> files. Sure enough, C<pp_pack> is in F<pp.c>. Since we're going |
| 38 | to be altering this file, let's copy it to F<pp.c~>. |
| 39 | |
| 40 | [Well, it was in F<pp.c> when this tutorial was written. It has now |
| 41 | been split off with C<pp_unpack> to its own file, F<pp_pack.c>] |
| 42 | |
| 43 | Now let's look over C<pp_pack>: we take a pattern into C<pat>, and then |
| 44 | loop over the pattern, taking each format character in turn into |
| 45 | C<datum_type>. Then for each possible format character, we swallow up |
| 46 | the other arguments in the pattern (a field width, an asterisk, and so |
| 47 | on) and convert the next chunk input into the specified format, adding |
| 48 | it onto the output SV C<cat>. |
| 49 | |
| 50 | How do we know if the C<U> is the first format in the C<pat>? Well, if |
| 51 | we have a pointer to the start of C<pat> then, if we see a C<U> we can |
| 52 | test whether we're still at the start of the string. So, here's where |
| 53 | C<pat> is set up: |
| 54 | |
| 55 | STRLEN fromlen; |
| 56 | char *pat = SvPVx(*++MARK, fromlen); |
| 57 | char *patend = pat + fromlen; |
| 58 | I32 len; |
| 59 | I32 datumtype; |
| 60 | SV *fromstr; |
| 61 | |
| 62 | We'll have another string pointer in there: |
| 63 | |
| 64 | STRLEN fromlen; |
| 65 | char *pat = SvPVx(*++MARK, fromlen); |
| 66 | char *patend = pat + fromlen; |
| 67 | + char *patcopy; |
| 68 | I32 len; |
| 69 | I32 datumtype; |
| 70 | SV *fromstr; |
| 71 | |
| 72 | And just before we start the loop, we'll set C<patcopy> to be the start |
| 73 | of C<pat>: |
| 74 | |
| 75 | items = SP - MARK; |
| 76 | MARK++; |
| 77 | SvPVCLEAR(cat); |
| 78 | + patcopy = pat; |
| 79 | while (pat < patend) { |
| 80 | |
| 81 | Now if we see a C<U> which was at the start of the string, we turn on |
| 82 | the C<UTF8> flag for the output SV, C<cat>: |
| 83 | |
| 84 | + if (datumtype == 'U' && pat==patcopy+1) |
| 85 | + SvUTF8_on(cat); |
| 86 | if (datumtype == '#') { |
| 87 | while (pat < patend && *pat != '\n') |
| 88 | pat++; |
| 89 | |
| 90 | Remember that it has to be C<patcopy+1> because the first character of |
| 91 | the string is the C<U> which has been swallowed into C<datumtype!> |
| 92 | |
| 93 | Oops, we forgot one thing: what if there are spaces at the start of the |
| 94 | pattern? C<pack(" U*", @stuff)> will have C<U> as the first active |
| 95 | character, even though it's not the first thing in the pattern. In this |
| 96 | case, we have to advance C<patcopy> along with C<pat> when we see |
| 97 | spaces: |
| 98 | |
| 99 | if (isSPACE(datumtype)) |
| 100 | continue; |
| 101 | |
| 102 | needs to become |
| 103 | |
| 104 | if (isSPACE(datumtype)) { |
| 105 | patcopy++; |
| 106 | continue; |
| 107 | } |
| 108 | |
| 109 | OK. That's the C part done. Now we must do two additional things before |
| 110 | this patch is ready to go: we've changed the behaviour of Perl, and so |
| 111 | we must document that change. We must also provide some more regression |
| 112 | tests to make sure our patch works and doesn't create a bug somewhere |
| 113 | else along the line. |
| 114 | |
| 115 | =head2 Testing the patch |
| 116 | |
| 117 | The regression tests for each operator live in F<t/op/>, and so we make |
| 118 | a copy of F<t/op/pack.t> to F<t/op/pack.t~>. Now we can add our tests |
| 119 | to the end. First, we'll test that the C<U> does indeed create Unicode |
| 120 | strings. |
| 121 | |
| 122 | t/op/pack.t has a sensible ok() function, but if it didn't we could use |
| 123 | the one from t/test.pl. |
| 124 | |
| 125 | require './test.pl'; |
| 126 | plan( tests => 159 ); |
| 127 | |
| 128 | so instead of this: |
| 129 | |
| 130 | print 'not ' unless "1.20.300.4000" eq sprintf "%vd", |
| 131 | pack("U*",1,20,300,4000); |
| 132 | print "ok $test\n"; $test++; |
| 133 | |
| 134 | we can write the more sensible (see L<Test::More> for a full |
| 135 | explanation of is() and other testing functions). |
| 136 | |
| 137 | is( "1.20.300.4000", sprintf "%vd", pack("U*",1,20,300,4000), |
| 138 | "U* produces Unicode" ); |
| 139 | |
| 140 | Now we'll test that we got that space-at-the-beginning business right: |
| 141 | |
| 142 | is( "1.20.300.4000", sprintf "%vd", pack(" U*",1,20,300,4000), |
| 143 | " with spaces at the beginning" ); |
| 144 | |
| 145 | And finally we'll test that we don't make Unicode strings if C<U> is |
| 146 | B<not> the first active format: |
| 147 | |
| 148 | isnt( v1.20.300.4000, sprintf "%vd", pack("C0U*",1,20,300,4000), |
| 149 | "U* not first isn't Unicode" ); |
| 150 | |
| 151 | Mustn't forget to change the number of tests which appears at the top, |
| 152 | or else the automated tester will get confused. This will either look |
| 153 | like this: |
| 154 | |
| 155 | print "1..156\n"; |
| 156 | |
| 157 | or this: |
| 158 | |
| 159 | plan( tests => 156 ); |
| 160 | |
| 161 | We now compile up Perl, and run it through the test suite. Our new |
| 162 | tests pass, hooray! |
| 163 | |
| 164 | =head2 Documenting the patch |
| 165 | |
| 166 | Finally, the documentation. The job is never done until the paperwork |
| 167 | is over, so let's describe the change we've just made. The relevant |
| 168 | place is F<pod/perlfunc.pod>; again, we make a copy, and then we'll |
| 169 | insert this text in the description of C<pack>: |
| 170 | |
| 171 | =item * |
| 172 | |
| 173 | If the pattern begins with a C<U>, the resulting string will be treated |
| 174 | as UTF-8-encoded Unicode. You can force UTF-8 encoding on in a string |
| 175 | with an initial C<U0>, and the bytes that follow will be interpreted as |
| 176 | Unicode characters. If you don't want this to happen, you can begin |
| 177 | your pattern with C<C0> (or anything else) to force Perl not to UTF-8 |
| 178 | encode your string, and then follow this with a C<U*> somewhere in your |
| 179 | pattern. |
| 180 | |
| 181 | =head2 Submit |
| 182 | |
| 183 | See L<perlhack> for details on how to submit this patch. |
| 184 | |
| 185 | =head1 AUTHOR |
| 186 | |
| 187 | This document was originally written by Nathan Torkington, and is |
| 188 | maintained by the perl5-porters mailing list. |