| 1 | =head1 NAME |
| 2 | |
| 3 | perlguts - Introduction to the Perl API |
| 4 | |
| 5 | =head1 DESCRIPTION |
| 6 | |
| 7 | This document attempts to describe how to use the Perl API, as well as |
| 8 | to provide some info on the basic workings of the Perl core. It is far |
| 9 | from complete and probably contains many errors. Please refer any |
| 10 | questions or comments to the author below. |
| 11 | |
| 12 | =head1 Variables |
| 13 | |
| 14 | =head2 Datatypes |
| 15 | |
| 16 | Perl has three typedefs that handle Perl's three main data types: |
| 17 | |
| 18 | SV Scalar Value |
| 19 | AV Array Value |
| 20 | HV Hash Value |
| 21 | |
| 22 | Each typedef has specific routines that manipulate the various data types. |
| 23 | |
| 24 | =head2 What is an "IV"? |
| 25 | |
| 26 | Perl uses a special typedef IV which is a simple signed integer type that is |
| 27 | guaranteed to be large enough to hold a pointer (as well as an integer). |
| 28 | Additionally, there is the UV, which is simply an unsigned IV. |
| 29 | |
| 30 | Perl also uses two special typedefs, I32 and I16, which will always be at |
| 31 | least 32-bits and 16-bits long, respectively. (Again, there are U32 and U16, |
| 32 | as well.) They will usually be exactly 32 and 16 bits long, but on Crays |
| 33 | they will both be 64 bits. |
| 34 | |
| 35 | =head2 Working with SVs |
| 36 | |
| 37 | An SV can be created and loaded with one command. There are five types of |
| 38 | values that can be loaded: an integer value (IV), an unsigned integer |
| 39 | value (UV), a double (NV), a string (PV), and another scalar (SV). |
| 40 | |
| 41 | The seven routines are: |
| 42 | |
| 43 | SV* newSViv(IV); |
| 44 | SV* newSVuv(UV); |
| 45 | SV* newSVnv(double); |
| 46 | SV* newSVpv(const char*, STRLEN); |
| 47 | SV* newSVpvn(const char*, STRLEN); |
| 48 | SV* newSVpvf(const char*, ...); |
| 49 | SV* newSVsv(SV*); |
| 50 | |
| 51 | C<STRLEN> is an integer type (Size_t, usually defined as size_t in |
| 52 | F<config.h>) guaranteed to be large enough to represent the size of |
| 53 | any string that perl can handle. |
| 54 | |
| 55 | In the unlikely case of a SV requiring more complex initialisation, you |
| 56 | can create an empty SV with newSV(len). If C<len> is 0 an empty SV of |
| 57 | type NULL is returned, else an SV of type PV is returned with len + 1 (for |
| 58 | the NUL) bytes of storage allocated, accessible via SvPVX. In both cases |
| 59 | the SV has value undef. |
| 60 | |
| 61 | SV *sv = newSV(0); /* no storage allocated */ |
| 62 | SV *sv = newSV(10); /* 10 (+1) bytes of uninitialised storage allocated */ |
| 63 | |
| 64 | To change the value of an I<already-existing> SV, there are eight routines: |
| 65 | |
| 66 | void sv_setiv(SV*, IV); |
| 67 | void sv_setuv(SV*, UV); |
| 68 | void sv_setnv(SV*, double); |
| 69 | void sv_setpv(SV*, const char*); |
| 70 | void sv_setpvn(SV*, const char*, STRLEN) |
| 71 | void sv_setpvf(SV*, const char*, ...); |
| 72 | void sv_vsetpvfn(SV*, const char*, STRLEN, va_list *, SV **, I32, bool *); |
| 73 | void sv_setsv(SV*, SV*); |
| 74 | |
| 75 | Notice that you can choose to specify the length of the string to be |
| 76 | assigned by using C<sv_setpvn>, C<newSVpvn>, or C<newSVpv>, or you may |
| 77 | allow Perl to calculate the length by using C<sv_setpv> or by specifying |
| 78 | 0 as the second argument to C<newSVpv>. Be warned, though, that Perl will |
| 79 | determine the string's length by using C<strlen>, which depends on the |
| 80 | string terminating with a NUL character. |
| 81 | |
| 82 | The arguments of C<sv_setpvf> are processed like C<sprintf>, and the |
| 83 | formatted output becomes the value. |
| 84 | |
| 85 | C<sv_vsetpvfn> is an analogue of C<vsprintf>, but it allows you to specify |
| 86 | either a pointer to a variable argument list or the address and length of |
| 87 | an array of SVs. The last argument points to a boolean; on return, if that |
| 88 | boolean is true, then locale-specific information has been used to format |
| 89 | the string, and the string's contents are therefore untrustworthy (see |
| 90 | L<perlsec>). This pointer may be NULL if that information is not |
| 91 | important. Note that this function requires you to specify the length of |
| 92 | the format. |
| 93 | |
| 94 | The C<sv_set*()> functions are not generic enough to operate on values |
| 95 | that have "magic". See L<Magic Virtual Tables> later in this document. |
| 96 | |
| 97 | All SVs that contain strings should be terminated with a NUL character. |
| 98 | If it is not NUL-terminated there is a risk of |
| 99 | core dumps and corruptions from code which passes the string to C |
| 100 | functions or system calls which expect a NUL-terminated string. |
| 101 | Perl's own functions typically add a trailing NUL for this reason. |
| 102 | Nevertheless, you should be very careful when you pass a string stored |
| 103 | in an SV to a C function or system call. |
| 104 | |
| 105 | To access the actual value that an SV points to, you can use the macros: |
| 106 | |
| 107 | SvIV(SV*) |
| 108 | SvUV(SV*) |
| 109 | SvNV(SV*) |
| 110 | SvPV(SV*, STRLEN len) |
| 111 | SvPV_nolen(SV*) |
| 112 | |
| 113 | which will automatically coerce the actual scalar type into an IV, UV, double, |
| 114 | or string. |
| 115 | |
| 116 | In the C<SvPV> macro, the length of the string returned is placed into the |
| 117 | variable C<len> (this is a macro, so you do I<not> use C<&len>). If you do |
| 118 | not care what the length of the data is, use the C<SvPV_nolen> macro. |
| 119 | Historically the C<SvPV> macro with the global variable C<PL_na> has been |
| 120 | used in this case. But that can be quite inefficient because C<PL_na> must |
| 121 | be accessed in thread-local storage in threaded Perl. In any case, remember |
| 122 | that Perl allows arbitrary strings of data that may both contain NULs and |
| 123 | might not be terminated by a NUL. |
| 124 | |
| 125 | Also remember that C doesn't allow you to safely say C<foo(SvPV(s, len), |
| 126 | len);>. It might work with your compiler, but it won't work for everyone. |
| 127 | Break this sort of statement up into separate assignments: |
| 128 | |
| 129 | SV *s; |
| 130 | STRLEN len; |
| 131 | char * ptr; |
| 132 | ptr = SvPV(s, len); |
| 133 | foo(ptr, len); |
| 134 | |
| 135 | If you want to know if the scalar value is TRUE, you can use: |
| 136 | |
| 137 | SvTRUE(SV*) |
| 138 | |
| 139 | Although Perl will automatically grow strings for you, if you need to force |
| 140 | Perl to allocate more memory for your SV, you can use the macro |
| 141 | |
| 142 | SvGROW(SV*, STRLEN newlen) |
| 143 | |
| 144 | which will determine if more memory needs to be allocated. If so, it will |
| 145 | call the function C<sv_grow>. Note that C<SvGROW> can only increase, not |
| 146 | decrease, the allocated memory of an SV and that it does not automatically |
| 147 | add a byte for the a trailing NUL (perl's own string functions typically do |
| 148 | C<SvGROW(sv, len + 1)>). |
| 149 | |
| 150 | If you have an SV and want to know what kind of data Perl thinks is stored |
| 151 | in it, you can use the following macros to check the type of SV you have. |
| 152 | |
| 153 | SvIOK(SV*) |
| 154 | SvNOK(SV*) |
| 155 | SvPOK(SV*) |
| 156 | |
| 157 | You can get and set the current length of the string stored in an SV with |
| 158 | the following macros: |
| 159 | |
| 160 | SvCUR(SV*) |
| 161 | SvCUR_set(SV*, I32 val) |
| 162 | |
| 163 | You can also get a pointer to the end of the string stored in the SV |
| 164 | with the macro: |
| 165 | |
| 166 | SvEND(SV*) |
| 167 | |
| 168 | But note that these last three macros are valid only if C<SvPOK()> is true. |
| 169 | |
| 170 | If you want to append something to the end of string stored in an C<SV*>, |
| 171 | you can use the following functions: |
| 172 | |
| 173 | void sv_catpv(SV*, const char*); |
| 174 | void sv_catpvn(SV*, const char*, STRLEN); |
| 175 | void sv_catpvf(SV*, const char*, ...); |
| 176 | void sv_vcatpvfn(SV*, const char*, STRLEN, va_list *, SV **, I32, bool); |
| 177 | void sv_catsv(SV*, SV*); |
| 178 | |
| 179 | The first function calculates the length of the string to be appended by |
| 180 | using C<strlen>. In the second, you specify the length of the string |
| 181 | yourself. The third function processes its arguments like C<sprintf> and |
| 182 | appends the formatted output. The fourth function works like C<vsprintf>. |
| 183 | You can specify the address and length of an array of SVs instead of the |
| 184 | va_list argument. The fifth function extends the string stored in the first |
| 185 | SV with the string stored in the second SV. It also forces the second SV |
| 186 | to be interpreted as a string. |
| 187 | |
| 188 | The C<sv_cat*()> functions are not generic enough to operate on values that |
| 189 | have "magic". See L<Magic Virtual Tables> later in this document. |
| 190 | |
| 191 | If you know the name of a scalar variable, you can get a pointer to its SV |
| 192 | by using the following: |
| 193 | |
| 194 | SV* get_sv("package::varname", FALSE); |
| 195 | |
| 196 | This returns NULL if the variable does not exist. |
| 197 | |
| 198 | If you want to know if this variable (or any other SV) is actually C<defined>, |
| 199 | you can call: |
| 200 | |
| 201 | SvOK(SV*) |
| 202 | |
| 203 | The scalar C<undef> value is stored in an SV instance called C<PL_sv_undef>. |
| 204 | |
| 205 | Its address can be used whenever an C<SV*> is needed. Make sure that |
| 206 | you don't try to compare a random sv with C<&PL_sv_undef>. For example |
| 207 | when interfacing Perl code, it'll work correctly for: |
| 208 | |
| 209 | foo(undef); |
| 210 | |
| 211 | But won't work when called as: |
| 212 | |
| 213 | $x = undef; |
| 214 | foo($x); |
| 215 | |
| 216 | So to repeat always use SvOK() to check whether an sv is defined. |
| 217 | |
| 218 | Also you have to be careful when using C<&PL_sv_undef> as a value in |
| 219 | AVs or HVs (see L<AVs, HVs and undefined values>). |
| 220 | |
| 221 | There are also the two values C<PL_sv_yes> and C<PL_sv_no>, which contain |
| 222 | boolean TRUE and FALSE values, respectively. Like C<PL_sv_undef>, their |
| 223 | addresses can be used whenever an C<SV*> is needed. |
| 224 | |
| 225 | Do not be fooled into thinking that C<(SV *) 0> is the same as C<&PL_sv_undef>. |
| 226 | Take this code: |
| 227 | |
| 228 | SV* sv = (SV*) 0; |
| 229 | if (I-am-to-return-a-real-value) { |
| 230 | sv = sv_2mortal(newSViv(42)); |
| 231 | } |
| 232 | sv_setsv(ST(0), sv); |
| 233 | |
| 234 | This code tries to return a new SV (which contains the value 42) if it should |
| 235 | return a real value, or undef otherwise. Instead it has returned a NULL |
| 236 | pointer which, somewhere down the line, will cause a segmentation violation, |
| 237 | bus error, or just weird results. Change the zero to C<&PL_sv_undef> in the |
| 238 | first line and all will be well. |
| 239 | |
| 240 | To free an SV that you've created, call C<SvREFCNT_dec(SV*)>. Normally this |
| 241 | call is not necessary (see L<Reference Counts and Mortality>). |
| 242 | |
| 243 | =head2 Offsets |
| 244 | |
| 245 | Perl provides the function C<sv_chop> to efficiently remove characters |
| 246 | from the beginning of a string; you give it an SV and a pointer to |
| 247 | somewhere inside the PV, and it discards everything before the |
| 248 | pointer. The efficiency comes by means of a little hack: instead of |
| 249 | actually removing the characters, C<sv_chop> sets the flag C<OOK> |
| 250 | (offset OK) to signal to other functions that the offset hack is in |
| 251 | effect, and it puts the number of bytes chopped off into the IV field |
| 252 | of the SV. It then moves the PV pointer (called C<SvPVX>) forward that |
| 253 | many bytes, and adjusts C<SvCUR> and C<SvLEN>. |
| 254 | |
| 255 | Hence, at this point, the start of the buffer that we allocated lives |
| 256 | at C<SvPVX(sv) - SvIV(sv)> in memory and the PV pointer is pointing |
| 257 | into the middle of this allocated storage. |
| 258 | |
| 259 | This is best demonstrated by example: |
| 260 | |
| 261 | % ./perl -Ilib -MDevel::Peek -le '$a="12345"; $a=~s/.//; Dump($a)' |
| 262 | SV = PVIV(0x8128450) at 0x81340f0 |
| 263 | REFCNT = 1 |
| 264 | FLAGS = (POK,OOK,pPOK) |
| 265 | IV = 1 (OFFSET) |
| 266 | PV = 0x8135781 ( "1" . ) "2345"\0 |
| 267 | CUR = 4 |
| 268 | LEN = 5 |
| 269 | |
| 270 | Here the number of bytes chopped off (1) is put into IV, and |
| 271 | C<Devel::Peek::Dump> helpfully reminds us that this is an offset. The |
| 272 | portion of the string between the "real" and the "fake" beginnings is |
| 273 | shown in parentheses, and the values of C<SvCUR> and C<SvLEN> reflect |
| 274 | the fake beginning, not the real one. |
| 275 | |
| 276 | Something similar to the offset hack is performed on AVs to enable |
| 277 | efficient shifting and splicing off the beginning of the array; while |
| 278 | C<AvARRAY> points to the first element in the array that is visible from |
| 279 | Perl, C<AvALLOC> points to the real start of the C array. These are |
| 280 | usually the same, but a C<shift> operation can be carried out by |
| 281 | increasing C<AvARRAY> by one and decreasing C<AvFILL> and C<AvLEN>. |
| 282 | Again, the location of the real start of the C array only comes into |
| 283 | play when freeing the array. See C<av_shift> in F<av.c>. |
| 284 | |
| 285 | =head2 What's Really Stored in an SV? |
| 286 | |
| 287 | Recall that the usual method of determining the type of scalar you have is |
| 288 | to use C<Sv*OK> macros. Because a scalar can be both a number and a string, |
| 289 | usually these macros will always return TRUE and calling the C<Sv*V> |
| 290 | macros will do the appropriate conversion of string to integer/double or |
| 291 | integer/double to string. |
| 292 | |
| 293 | If you I<really> need to know if you have an integer, double, or string |
| 294 | pointer in an SV, you can use the following three macros instead: |
| 295 | |
| 296 | SvIOKp(SV*) |
| 297 | SvNOKp(SV*) |
| 298 | SvPOKp(SV*) |
| 299 | |
| 300 | These will tell you if you truly have an integer, double, or string pointer |
| 301 | stored in your SV. The "p" stands for private. |
| 302 | |
| 303 | The are various ways in which the private and public flags may differ. |
| 304 | For example, a tied SV may have a valid underlying value in the IV slot |
| 305 | (so SvIOKp is true), but the data should be accessed via the FETCH |
| 306 | routine rather than directly, so SvIOK is false. Another is when |
| 307 | numeric conversion has occurred and precision has been lost: only the |
| 308 | private flag is set on 'lossy' values. So when an NV is converted to an |
| 309 | IV with loss, SvIOKp, SvNOKp and SvNOK will be set, while SvIOK wont be. |
| 310 | |
| 311 | In general, though, it's best to use the C<Sv*V> macros. |
| 312 | |
| 313 | =head2 Working with AVs |
| 314 | |
| 315 | There are two ways to create and load an AV. The first method creates an |
| 316 | empty AV: |
| 317 | |
| 318 | AV* newAV(); |
| 319 | |
| 320 | The second method both creates the AV and initially populates it with SVs: |
| 321 | |
| 322 | AV* av_make(I32 num, SV **ptr); |
| 323 | |
| 324 | The second argument points to an array containing C<num> C<SV*>'s. Once the |
| 325 | AV has been created, the SVs can be destroyed, if so desired. |
| 326 | |
| 327 | Once the AV has been created, the following operations are possible on AVs: |
| 328 | |
| 329 | void av_push(AV*, SV*); |
| 330 | SV* av_pop(AV*); |
| 331 | SV* av_shift(AV*); |
| 332 | void av_unshift(AV*, I32 num); |
| 333 | |
| 334 | These should be familiar operations, with the exception of C<av_unshift>. |
| 335 | This routine adds C<num> elements at the front of the array with the C<undef> |
| 336 | value. You must then use C<av_store> (described below) to assign values |
| 337 | to these new elements. |
| 338 | |
| 339 | Here are some other functions: |
| 340 | |
| 341 | I32 av_len(AV*); |
| 342 | SV** av_fetch(AV*, I32 key, I32 lval); |
| 343 | SV** av_store(AV*, I32 key, SV* val); |
| 344 | |
| 345 | The C<av_len> function returns the highest index value in array (just |
| 346 | like $#array in Perl). If the array is empty, -1 is returned. The |
| 347 | C<av_fetch> function returns the value at index C<key>, but if C<lval> |
| 348 | is non-zero, then C<av_fetch> will store an undef value at that index. |
| 349 | The C<av_store> function stores the value C<val> at index C<key>, and does |
| 350 | not increment the reference count of C<val>. Thus the caller is responsible |
| 351 | for taking care of that, and if C<av_store> returns NULL, the caller will |
| 352 | have to decrement the reference count to avoid a memory leak. Note that |
| 353 | C<av_fetch> and C<av_store> both return C<SV**>'s, not C<SV*>'s as their |
| 354 | return value. |
| 355 | |
| 356 | void av_clear(AV*); |
| 357 | void av_undef(AV*); |
| 358 | void av_extend(AV*, I32 key); |
| 359 | |
| 360 | The C<av_clear> function deletes all the elements in the AV* array, but |
| 361 | does not actually delete the array itself. The C<av_undef> function will |
| 362 | delete all the elements in the array plus the array itself. The |
| 363 | C<av_extend> function extends the array so that it contains at least C<key+1> |
| 364 | elements. If C<key+1> is less than the currently allocated length of the array, |
| 365 | then nothing is done. |
| 366 | |
| 367 | If you know the name of an array variable, you can get a pointer to its AV |
| 368 | by using the following: |
| 369 | |
| 370 | AV* get_av("package::varname", FALSE); |
| 371 | |
| 372 | This returns NULL if the variable does not exist. |
| 373 | |
| 374 | See L<Understanding the Magic of Tied Hashes and Arrays> for more |
| 375 | information on how to use the array access functions on tied arrays. |
| 376 | |
| 377 | =head2 Working with HVs |
| 378 | |
| 379 | To create an HV, you use the following routine: |
| 380 | |
| 381 | HV* newHV(); |
| 382 | |
| 383 | Once the HV has been created, the following operations are possible on HVs: |
| 384 | |
| 385 | SV** hv_store(HV*, const char* key, U32 klen, SV* val, U32 hash); |
| 386 | SV** hv_fetch(HV*, const char* key, U32 klen, I32 lval); |
| 387 | |
| 388 | The C<klen> parameter is the length of the key being passed in (Note that |
| 389 | you cannot pass 0 in as a value of C<klen> to tell Perl to measure the |
| 390 | length of the key). The C<val> argument contains the SV pointer to the |
| 391 | scalar being stored, and C<hash> is the precomputed hash value (zero if |
| 392 | you want C<hv_store> to calculate it for you). The C<lval> parameter |
| 393 | indicates whether this fetch is actually a part of a store operation, in |
| 394 | which case a new undefined value will be added to the HV with the supplied |
| 395 | key and C<hv_fetch> will return as if the value had already existed. |
| 396 | |
| 397 | Remember that C<hv_store> and C<hv_fetch> return C<SV**>'s and not just |
| 398 | C<SV*>. To access the scalar value, you must first dereference the return |
| 399 | value. However, you should check to make sure that the return value is |
| 400 | not NULL before dereferencing it. |
| 401 | |
| 402 | These two functions check if a hash table entry exists, and deletes it. |
| 403 | |
| 404 | bool hv_exists(HV*, const char* key, U32 klen); |
| 405 | SV* hv_delete(HV*, const char* key, U32 klen, I32 flags); |
| 406 | |
| 407 | If C<flags> does not include the C<G_DISCARD> flag then C<hv_delete> will |
| 408 | create and return a mortal copy of the deleted value. |
| 409 | |
| 410 | And more miscellaneous functions: |
| 411 | |
| 412 | void hv_clear(HV*); |
| 413 | void hv_undef(HV*); |
| 414 | |
| 415 | Like their AV counterparts, C<hv_clear> deletes all the entries in the hash |
| 416 | table but does not actually delete the hash table. The C<hv_undef> deletes |
| 417 | both the entries and the hash table itself. |
| 418 | |
| 419 | Perl keeps the actual data in linked list of structures with a typedef of HE. |
| 420 | These contain the actual key and value pointers (plus extra administrative |
| 421 | overhead). The key is a string pointer; the value is an C<SV*>. However, |
| 422 | once you have an C<HE*>, to get the actual key and value, use the routines |
| 423 | specified below. |
| 424 | |
| 425 | I32 hv_iterinit(HV*); |
| 426 | /* Prepares starting point to traverse hash table */ |
| 427 | HE* hv_iternext(HV*); |
| 428 | /* Get the next entry, and return a pointer to a |
| 429 | structure that has both the key and value */ |
| 430 | char* hv_iterkey(HE* entry, I32* retlen); |
| 431 | /* Get the key from an HE structure and also return |
| 432 | the length of the key string */ |
| 433 | SV* hv_iterval(HV*, HE* entry); |
| 434 | /* Return an SV pointer to the value of the HE |
| 435 | structure */ |
| 436 | SV* hv_iternextsv(HV*, char** key, I32* retlen); |
| 437 | /* This convenience routine combines hv_iternext, |
| 438 | hv_iterkey, and hv_iterval. The key and retlen |
| 439 | arguments are return values for the key and its |
| 440 | length. The value is returned in the SV* argument */ |
| 441 | |
| 442 | If you know the name of a hash variable, you can get a pointer to its HV |
| 443 | by using the following: |
| 444 | |
| 445 | HV* get_hv("package::varname", FALSE); |
| 446 | |
| 447 | This returns NULL if the variable does not exist. |
| 448 | |
| 449 | The hash algorithm is defined in the C<PERL_HASH(hash, key, klen)> macro: |
| 450 | |
| 451 | hash = 0; |
| 452 | while (klen--) |
| 453 | hash = (hash * 33) + *key++; |
| 454 | hash = hash + (hash >> 5); /* after 5.6 */ |
| 455 | |
| 456 | The last step was added in version 5.6 to improve distribution of |
| 457 | lower bits in the resulting hash value. |
| 458 | |
| 459 | See L<Understanding the Magic of Tied Hashes and Arrays> for more |
| 460 | information on how to use the hash access functions on tied hashes. |
| 461 | |
| 462 | =head2 Hash API Extensions |
| 463 | |
| 464 | Beginning with version 5.004, the following functions are also supported: |
| 465 | |
| 466 | HE* hv_fetch_ent (HV* tb, SV* key, I32 lval, U32 hash); |
| 467 | HE* hv_store_ent (HV* tb, SV* key, SV* val, U32 hash); |
| 468 | |
| 469 | bool hv_exists_ent (HV* tb, SV* key, U32 hash); |
| 470 | SV* hv_delete_ent (HV* tb, SV* key, I32 flags, U32 hash); |
| 471 | |
| 472 | SV* hv_iterkeysv (HE* entry); |
| 473 | |
| 474 | Note that these functions take C<SV*> keys, which simplifies writing |
| 475 | of extension code that deals with hash structures. These functions |
| 476 | also allow passing of C<SV*> keys to C<tie> functions without forcing |
| 477 | you to stringify the keys (unlike the previous set of functions). |
| 478 | |
| 479 | They also return and accept whole hash entries (C<HE*>), making their |
| 480 | use more efficient (since the hash number for a particular string |
| 481 | doesn't have to be recomputed every time). See L<perlapi> for detailed |
| 482 | descriptions. |
| 483 | |
| 484 | The following macros must always be used to access the contents of hash |
| 485 | entries. Note that the arguments to these macros must be simple |
| 486 | variables, since they may get evaluated more than once. See |
| 487 | L<perlapi> for detailed descriptions of these macros. |
| 488 | |
| 489 | HePV(HE* he, STRLEN len) |
| 490 | HeVAL(HE* he) |
| 491 | HeHASH(HE* he) |
| 492 | HeSVKEY(HE* he) |
| 493 | HeSVKEY_force(HE* he) |
| 494 | HeSVKEY_set(HE* he, SV* sv) |
| 495 | |
| 496 | These two lower level macros are defined, but must only be used when |
| 497 | dealing with keys that are not C<SV*>s: |
| 498 | |
| 499 | HeKEY(HE* he) |
| 500 | HeKLEN(HE* he) |
| 501 | |
| 502 | Note that both C<hv_store> and C<hv_store_ent> do not increment the |
| 503 | reference count of the stored C<val>, which is the caller's responsibility. |
| 504 | If these functions return a NULL value, the caller will usually have to |
| 505 | decrement the reference count of C<val> to avoid a memory leak. |
| 506 | |
| 507 | =head2 AVs, HVs and undefined values |
| 508 | |
| 509 | Sometimes you have to store undefined values in AVs or HVs. Although |
| 510 | this may be a rare case, it can be tricky. That's because you're |
| 511 | used to using C<&PL_sv_undef> if you need an undefined SV. |
| 512 | |
| 513 | For example, intuition tells you that this XS code: |
| 514 | |
| 515 | AV *av = newAV(); |
| 516 | av_store( av, 0, &PL_sv_undef ); |
| 517 | |
| 518 | is equivalent to this Perl code: |
| 519 | |
| 520 | my @av; |
| 521 | $av[0] = undef; |
| 522 | |
| 523 | Unfortunately, this isn't true. AVs use C<&PL_sv_undef> as a marker |
| 524 | for indicating that an array element has not yet been initialized. |
| 525 | Thus, C<exists $av[0]> would be true for the above Perl code, but |
| 526 | false for the array generated by the XS code. |
| 527 | |
| 528 | Other problems can occur when storing C<&PL_sv_undef> in HVs: |
| 529 | |
| 530 | hv_store( hv, "key", 3, &PL_sv_undef, 0 ); |
| 531 | |
| 532 | This will indeed make the value C<undef>, but if you try to modify |
| 533 | the value of C<key>, you'll get the following error: |
| 534 | |
| 535 | Modification of non-creatable hash value attempted |
| 536 | |
| 537 | In perl 5.8.0, C<&PL_sv_undef> was also used to mark placeholders |
| 538 | in restricted hashes. This caused such hash entries not to appear |
| 539 | when iterating over the hash or when checking for the keys |
| 540 | with the C<hv_exists> function. |
| 541 | |
| 542 | You can run into similar problems when you store C<&PL_sv_true> or |
| 543 | C<&PL_sv_false> into AVs or HVs. Trying to modify such elements |
| 544 | will give you the following error: |
| 545 | |
| 546 | Modification of a read-only value attempted |
| 547 | |
| 548 | To make a long story short, you can use the special variables |
| 549 | C<&PL_sv_undef>, C<&PL_sv_true> and C<&PL_sv_false> with AVs and |
| 550 | HVs, but you have to make sure you know what you're doing. |
| 551 | |
| 552 | Generally, if you want to store an undefined value in an AV |
| 553 | or HV, you should not use C<&PL_sv_undef>, but rather create a |
| 554 | new undefined value using the C<newSV> function, for example: |
| 555 | |
| 556 | av_store( av, 42, newSV(0) ); |
| 557 | hv_store( hv, "foo", 3, newSV(0), 0 ); |
| 558 | |
| 559 | =head2 References |
| 560 | |
| 561 | References are a special type of scalar that point to other data types |
| 562 | (including references). |
| 563 | |
| 564 | To create a reference, use either of the following functions: |
| 565 | |
| 566 | SV* newRV_inc((SV*) thing); |
| 567 | SV* newRV_noinc((SV*) thing); |
| 568 | |
| 569 | The C<thing> argument can be any of an C<SV*>, C<AV*>, or C<HV*>. The |
| 570 | functions are identical except that C<newRV_inc> increments the reference |
| 571 | count of the C<thing>, while C<newRV_noinc> does not. For historical |
| 572 | reasons, C<newRV> is a synonym for C<newRV_inc>. |
| 573 | |
| 574 | Once you have a reference, you can use the following macro to dereference |
| 575 | the reference: |
| 576 | |
| 577 | SvRV(SV*) |
| 578 | |
| 579 | then call the appropriate routines, casting the returned C<SV*> to either an |
| 580 | C<AV*> or C<HV*>, if required. |
| 581 | |
| 582 | To determine if an SV is a reference, you can use the following macro: |
| 583 | |
| 584 | SvROK(SV*) |
| 585 | |
| 586 | To discover what type of value the reference refers to, use the following |
| 587 | macro and then check the return value. |
| 588 | |
| 589 | SvTYPE(SvRV(SV*)) |
| 590 | |
| 591 | The most useful types that will be returned are: |
| 592 | |
| 593 | SVt_IV Scalar |
| 594 | SVt_NV Scalar |
| 595 | SVt_PV Scalar |
| 596 | SVt_RV Scalar |
| 597 | SVt_PVAV Array |
| 598 | SVt_PVHV Hash |
| 599 | SVt_PVCV Code |
| 600 | SVt_PVGV Glob (possible a file handle) |
| 601 | SVt_PVMG Blessed or Magical Scalar |
| 602 | |
| 603 | See the sv.h header file for more details. |
| 604 | |
| 605 | =head2 Blessed References and Class Objects |
| 606 | |
| 607 | References are also used to support object-oriented programming. In perl's |
| 608 | OO lexicon, an object is simply a reference that has been blessed into a |
| 609 | package (or class). Once blessed, the programmer may now use the reference |
| 610 | to access the various methods in the class. |
| 611 | |
| 612 | A reference can be blessed into a package with the following function: |
| 613 | |
| 614 | SV* sv_bless(SV* sv, HV* stash); |
| 615 | |
| 616 | The C<sv> argument must be a reference value. The C<stash> argument |
| 617 | specifies which class the reference will belong to. See |
| 618 | L<Stashes and Globs> for information on converting class names into stashes. |
| 619 | |
| 620 | /* Still under construction */ |
| 621 | |
| 622 | Upgrades rv to reference if not already one. Creates new SV for rv to |
| 623 | point to. If C<classname> is non-null, the SV is blessed into the specified |
| 624 | class. SV is returned. |
| 625 | |
| 626 | SV* newSVrv(SV* rv, const char* classname); |
| 627 | |
| 628 | Copies integer, unsigned integer or double into an SV whose reference is C<rv>. SV is blessed |
| 629 | if C<classname> is non-null. |
| 630 | |
| 631 | SV* sv_setref_iv(SV* rv, const char* classname, IV iv); |
| 632 | SV* sv_setref_uv(SV* rv, const char* classname, UV uv); |
| 633 | SV* sv_setref_nv(SV* rv, const char* classname, NV iv); |
| 634 | |
| 635 | Copies the pointer value (I<the address, not the string!>) into an SV whose |
| 636 | reference is rv. SV is blessed if C<classname> is non-null. |
| 637 | |
| 638 | SV* sv_setref_pv(SV* rv, const char* classname, PV iv); |
| 639 | |
| 640 | Copies string into an SV whose reference is C<rv>. Set length to 0 to let |
| 641 | Perl calculate the string length. SV is blessed if C<classname> is non-null. |
| 642 | |
| 643 | SV* sv_setref_pvn(SV* rv, const char* classname, PV iv, STRLEN length); |
| 644 | |
| 645 | Tests whether the SV is blessed into the specified class. It does not |
| 646 | check inheritance relationships. |
| 647 | |
| 648 | int sv_isa(SV* sv, const char* name); |
| 649 | |
| 650 | Tests whether the SV is a reference to a blessed object. |
| 651 | |
| 652 | int sv_isobject(SV* sv); |
| 653 | |
| 654 | Tests whether the SV is derived from the specified class. SV can be either |
| 655 | a reference to a blessed object or a string containing a class name. This |
| 656 | is the function implementing the C<UNIVERSAL::isa> functionality. |
| 657 | |
| 658 | bool sv_derived_from(SV* sv, const char* name); |
| 659 | |
| 660 | To check if you've got an object derived from a specific class you have |
| 661 | to write: |
| 662 | |
| 663 | if (sv_isobject(sv) && sv_derived_from(sv, class)) { ... } |
| 664 | |
| 665 | =head2 Creating New Variables |
| 666 | |
| 667 | To create a new Perl variable with an undef value which can be accessed from |
| 668 | your Perl script, use the following routines, depending on the variable type. |
| 669 | |
| 670 | SV* get_sv("package::varname", TRUE); |
| 671 | AV* get_av("package::varname", TRUE); |
| 672 | HV* get_hv("package::varname", TRUE); |
| 673 | |
| 674 | Notice the use of TRUE as the second parameter. The new variable can now |
| 675 | be set, using the routines appropriate to the data type. |
| 676 | |
| 677 | There are additional macros whose values may be bitwise OR'ed with the |
| 678 | C<TRUE> argument to enable certain extra features. Those bits are: |
| 679 | |
| 680 | =over |
| 681 | |
| 682 | =item GV_ADDMULTI |
| 683 | |
| 684 | Marks the variable as multiply defined, thus preventing the: |
| 685 | |
| 686 | Name <varname> used only once: possible typo |
| 687 | |
| 688 | warning. |
| 689 | |
| 690 | =item GV_ADDWARN |
| 691 | |
| 692 | Issues the warning: |
| 693 | |
| 694 | Had to create <varname> unexpectedly |
| 695 | |
| 696 | if the variable did not exist before the function was called. |
| 697 | |
| 698 | =back |
| 699 | |
| 700 | If you do not specify a package name, the variable is created in the current |
| 701 | package. |
| 702 | |
| 703 | =head2 Reference Counts and Mortality |
| 704 | |
| 705 | Perl uses a reference count-driven garbage collection mechanism. SVs, |
| 706 | AVs, or HVs (xV for short in the following) start their life with a |
| 707 | reference count of 1. If the reference count of an xV ever drops to 0, |
| 708 | then it will be destroyed and its memory made available for reuse. |
| 709 | |
| 710 | This normally doesn't happen at the Perl level unless a variable is |
| 711 | undef'ed or the last variable holding a reference to it is changed or |
| 712 | overwritten. At the internal level, however, reference counts can be |
| 713 | manipulated with the following macros: |
| 714 | |
| 715 | int SvREFCNT(SV* sv); |
| 716 | SV* SvREFCNT_inc(SV* sv); |
| 717 | void SvREFCNT_dec(SV* sv); |
| 718 | |
| 719 | However, there is one other function which manipulates the reference |
| 720 | count of its argument. The C<newRV_inc> function, you will recall, |
| 721 | creates a reference to the specified argument. As a side effect, |
| 722 | it increments the argument's reference count. If this is not what |
| 723 | you want, use C<newRV_noinc> instead. |
| 724 | |
| 725 | For example, imagine you want to return a reference from an XSUB function. |
| 726 | Inside the XSUB routine, you create an SV which initially has a reference |
| 727 | count of one. Then you call C<newRV_inc>, passing it the just-created SV. |
| 728 | This returns the reference as a new SV, but the reference count of the |
| 729 | SV you passed to C<newRV_inc> has been incremented to two. Now you |
| 730 | return the reference from the XSUB routine and forget about the SV. |
| 731 | But Perl hasn't! Whenever the returned reference is destroyed, the |
| 732 | reference count of the original SV is decreased to one and nothing happens. |
| 733 | The SV will hang around without any way to access it until Perl itself |
| 734 | terminates. This is a memory leak. |
| 735 | |
| 736 | The correct procedure, then, is to use C<newRV_noinc> instead of |
| 737 | C<newRV_inc>. Then, if and when the last reference is destroyed, |
| 738 | the reference count of the SV will go to zero and it will be destroyed, |
| 739 | stopping any memory leak. |
| 740 | |
| 741 | There are some convenience functions available that can help with the |
| 742 | destruction of xVs. These functions introduce the concept of "mortality". |
| 743 | An xV that is mortal has had its reference count marked to be decremented, |
| 744 | but not actually decremented, until "a short time later". Generally the |
| 745 | term "short time later" means a single Perl statement, such as a call to |
| 746 | an XSUB function. The actual determinant for when mortal xVs have their |
| 747 | reference count decremented depends on two macros, SAVETMPS and FREETMPS. |
| 748 | See L<perlcall> and L<perlxs> for more details on these macros. |
| 749 | |
| 750 | "Mortalization" then is at its simplest a deferred C<SvREFCNT_dec>. |
| 751 | However, if you mortalize a variable twice, the reference count will |
| 752 | later be decremented twice. |
| 753 | |
| 754 | "Mortal" SVs are mainly used for SVs that are placed on perl's stack. |
| 755 | For example an SV which is created just to pass a number to a called sub |
| 756 | is made mortal to have it cleaned up automatically when it's popped off |
| 757 | the stack. Similarly, results returned by XSUBs (which are pushed on the |
| 758 | stack) are often made mortal. |
| 759 | |
| 760 | To create a mortal variable, use the functions: |
| 761 | |
| 762 | SV* sv_newmortal() |
| 763 | SV* sv_2mortal(SV*) |
| 764 | SV* sv_mortalcopy(SV*) |
| 765 | |
| 766 | The first call creates a mortal SV (with no value), the second converts an existing |
| 767 | SV to a mortal SV (and thus defers a call to C<SvREFCNT_dec>), and the |
| 768 | third creates a mortal copy of an existing SV. |
| 769 | Because C<sv_newmortal> gives the new SV no value,it must normally be given one |
| 770 | via C<sv_setpv>, C<sv_setiv>, etc. : |
| 771 | |
| 772 | SV *tmp = sv_newmortal(); |
| 773 | sv_setiv(tmp, an_integer); |
| 774 | |
| 775 | As that is multiple C statements it is quite common so see this idiom instead: |
| 776 | |
| 777 | SV *tmp = sv_2mortal(newSViv(an_integer)); |
| 778 | |
| 779 | |
| 780 | You should be careful about creating mortal variables. Strange things |
| 781 | can happen if you make the same value mortal within multiple contexts, |
| 782 | or if you make a variable mortal multiple times. Thinking of "Mortalization" |
| 783 | as deferred C<SvREFCNT_dec> should help to minimize such problems. |
| 784 | For example if you are passing an SV which you I<know> has high enough REFCNT |
| 785 | to survive its use on the stack you need not do any mortalization. |
| 786 | If you are not sure then doing an C<SvREFCNT_inc> and C<sv_2mortal>, or |
| 787 | making a C<sv_mortalcopy> is safer. |
| 788 | |
| 789 | The mortal routines are not just for SVs -- AVs and HVs can be |
| 790 | made mortal by passing their address (type-casted to C<SV*>) to the |
| 791 | C<sv_2mortal> or C<sv_mortalcopy> routines. |
| 792 | |
| 793 | =head2 Stashes and Globs |
| 794 | |
| 795 | A B<stash> is a hash that contains all variables that are defined |
| 796 | within a package. Each key of the stash is a symbol |
| 797 | name (shared by all the different types of objects that have the same |
| 798 | name), and each value in the hash table is a GV (Glob Value). This GV |
| 799 | in turn contains references to the various objects of that name, |
| 800 | including (but not limited to) the following: |
| 801 | |
| 802 | Scalar Value |
| 803 | Array Value |
| 804 | Hash Value |
| 805 | I/O Handle |
| 806 | Format |
| 807 | Subroutine |
| 808 | |
| 809 | There is a single stash called C<PL_defstash> that holds the items that exist |
| 810 | in the C<main> package. To get at the items in other packages, append the |
| 811 | string "::" to the package name. The items in the C<Foo> package are in |
| 812 | the stash C<Foo::> in PL_defstash. The items in the C<Bar::Baz> package are |
| 813 | in the stash C<Baz::> in C<Bar::>'s stash. |
| 814 | |
| 815 | To get the stash pointer for a particular package, use the function: |
| 816 | |
| 817 | HV* gv_stashpv(const char* name, I32 create) |
| 818 | HV* gv_stashsv(SV*, I32 create) |
| 819 | |
| 820 | The first function takes a literal string, the second uses the string stored |
| 821 | in the SV. Remember that a stash is just a hash table, so you get back an |
| 822 | C<HV*>. The C<create> flag will create a new package if it is set. |
| 823 | |
| 824 | The name that C<gv_stash*v> wants is the name of the package whose symbol table |
| 825 | you want. The default package is called C<main>. If you have multiply nested |
| 826 | packages, pass their names to C<gv_stash*v>, separated by C<::> as in the Perl |
| 827 | language itself. |
| 828 | |
| 829 | Alternately, if you have an SV that is a blessed reference, you can find |
| 830 | out the stash pointer by using: |
| 831 | |
| 832 | HV* SvSTASH(SvRV(SV*)); |
| 833 | |
| 834 | then use the following to get the package name itself: |
| 835 | |
| 836 | char* HvNAME(HV* stash); |
| 837 | |
| 838 | If you need to bless or re-bless an object you can use the following |
| 839 | function: |
| 840 | |
| 841 | SV* sv_bless(SV*, HV* stash) |
| 842 | |
| 843 | where the first argument, an C<SV*>, must be a reference, and the second |
| 844 | argument is a stash. The returned C<SV*> can now be used in the same way |
| 845 | as any other SV. |
| 846 | |
| 847 | For more information on references and blessings, consult L<perlref>. |
| 848 | |
| 849 | =head2 Double-Typed SVs |
| 850 | |
| 851 | Scalar variables normally contain only one type of value, an integer, |
| 852 | double, pointer, or reference. Perl will automatically convert the |
| 853 | actual scalar data from the stored type into the requested type. |
| 854 | |
| 855 | Some scalar variables contain more than one type of scalar data. For |
| 856 | example, the variable C<$!> contains either the numeric value of C<errno> |
| 857 | or its string equivalent from either C<strerror> or C<sys_errlist[]>. |
| 858 | |
| 859 | To force multiple data values into an SV, you must do two things: use the |
| 860 | C<sv_set*v> routines to add the additional scalar type, then set a flag |
| 861 | so that Perl will believe it contains more than one type of data. The |
| 862 | four macros to set the flags are: |
| 863 | |
| 864 | SvIOK_on |
| 865 | SvNOK_on |
| 866 | SvPOK_on |
| 867 | SvROK_on |
| 868 | |
| 869 | The particular macro you must use depends on which C<sv_set*v> routine |
| 870 | you called first. This is because every C<sv_set*v> routine turns on |
| 871 | only the bit for the particular type of data being set, and turns off |
| 872 | all the rest. |
| 873 | |
| 874 | For example, to create a new Perl variable called "dberror" that contains |
| 875 | both the numeric and descriptive string error values, you could use the |
| 876 | following code: |
| 877 | |
| 878 | extern int dberror; |
| 879 | extern char *dberror_list; |
| 880 | |
| 881 | SV* sv = get_sv("dberror", TRUE); |
| 882 | sv_setiv(sv, (IV) dberror); |
| 883 | sv_setpv(sv, dberror_list[dberror]); |
| 884 | SvIOK_on(sv); |
| 885 | |
| 886 | If the order of C<sv_setiv> and C<sv_setpv> had been reversed, then the |
| 887 | macro C<SvPOK_on> would need to be called instead of C<SvIOK_on>. |
| 888 | |
| 889 | =head2 Magic Variables |
| 890 | |
| 891 | [This section still under construction. Ignore everything here. Post no |
| 892 | bills. Everything not permitted is forbidden.] |
| 893 | |
| 894 | Any SV may be magical, that is, it has special features that a normal |
| 895 | SV does not have. These features are stored in the SV structure in a |
| 896 | linked list of C<struct magic>'s, typedef'ed to C<MAGIC>. |
| 897 | |
| 898 | struct magic { |
| 899 | MAGIC* mg_moremagic; |
| 900 | MGVTBL* mg_virtual; |
| 901 | U16 mg_private; |
| 902 | char mg_type; |
| 903 | U8 mg_flags; |
| 904 | SV* mg_obj; |
| 905 | char* mg_ptr; |
| 906 | I32 mg_len; |
| 907 | }; |
| 908 | |
| 909 | Note this is current as of patchlevel 0, and could change at any time. |
| 910 | |
| 911 | =head2 Assigning Magic |
| 912 | |
| 913 | Perl adds magic to an SV using the sv_magic function: |
| 914 | |
| 915 | void sv_magic(SV* sv, SV* obj, int how, const char* name, I32 namlen); |
| 916 | |
| 917 | The C<sv> argument is a pointer to the SV that is to acquire a new magical |
| 918 | feature. |
| 919 | |
| 920 | If C<sv> is not already magical, Perl uses the C<SvUPGRADE> macro to |
| 921 | convert C<sv> to type C<SVt_PVMG>. Perl then continues by adding new magic |
| 922 | to the beginning of the linked list of magical features. Any prior entry |
| 923 | of the same type of magic is deleted. Note that this can be overridden, |
| 924 | and multiple instances of the same type of magic can be associated with an |
| 925 | SV. |
| 926 | |
| 927 | The C<name> and C<namlen> arguments are used to associate a string with |
| 928 | the magic, typically the name of a variable. C<namlen> is stored in the |
| 929 | C<mg_len> field and if C<name> is non-null then either a C<savepvn> copy of |
| 930 | C<name> or C<name> itself is stored in the C<mg_ptr> field, depending on |
| 931 | whether C<namlen> is greater than zero or equal to zero respectively. As a |
| 932 | special case, if C<(name && namlen == HEf_SVKEY)> then C<name> is assumed |
| 933 | to contain an C<SV*> and is stored as-is with its REFCNT incremented. |
| 934 | |
| 935 | The sv_magic function uses C<how> to determine which, if any, predefined |
| 936 | "Magic Virtual Table" should be assigned to the C<mg_virtual> field. |
| 937 | See the L<Magic Virtual Tables> section below. The C<how> argument is also |
| 938 | stored in the C<mg_type> field. The value of C<how> should be chosen |
| 939 | from the set of macros C<PERL_MAGIC_foo> found in F<perl.h>. Note that before |
| 940 | these macros were added, Perl internals used to directly use character |
| 941 | literals, so you may occasionally come across old code or documentation |
| 942 | referring to 'U' magic rather than C<PERL_MAGIC_uvar> for example. |
| 943 | |
| 944 | The C<obj> argument is stored in the C<mg_obj> field of the C<MAGIC> |
| 945 | structure. If it is not the same as the C<sv> argument, the reference |
| 946 | count of the C<obj> object is incremented. If it is the same, or if |
| 947 | the C<how> argument is C<PERL_MAGIC_arylen>, or if it is a NULL pointer, |
| 948 | then C<obj> is merely stored, without the reference count being incremented. |
| 949 | |
| 950 | See also C<sv_magicext> in L<perlapi> for a more flexible way to add magic |
| 951 | to an SV. |
| 952 | |
| 953 | There is also a function to add magic to an C<HV>: |
| 954 | |
| 955 | void hv_magic(HV *hv, GV *gv, int how); |
| 956 | |
| 957 | This simply calls C<sv_magic> and coerces the C<gv> argument into an C<SV>. |
| 958 | |
| 959 | To remove the magic from an SV, call the function sv_unmagic: |
| 960 | |
| 961 | void sv_unmagic(SV *sv, int type); |
| 962 | |
| 963 | The C<type> argument should be equal to the C<how> value when the C<SV> |
| 964 | was initially made magical. |
| 965 | |
| 966 | =head2 Magic Virtual Tables |
| 967 | |
| 968 | The C<mg_virtual> field in the C<MAGIC> structure is a pointer to an |
| 969 | C<MGVTBL>, which is a structure of function pointers and stands for |
| 970 | "Magic Virtual Table" to handle the various operations that might be |
| 971 | applied to that variable. |
| 972 | |
| 973 | The C<MGVTBL> has five (or sometimes eight) pointers to the following |
| 974 | routine types: |
| 975 | |
| 976 | int (*svt_get)(SV* sv, MAGIC* mg); |
| 977 | int (*svt_set)(SV* sv, MAGIC* mg); |
| 978 | U32 (*svt_len)(SV* sv, MAGIC* mg); |
| 979 | int (*svt_clear)(SV* sv, MAGIC* mg); |
| 980 | int (*svt_free)(SV* sv, MAGIC* mg); |
| 981 | |
| 982 | int (*svt_copy)(SV *sv, MAGIC* mg, SV *nsv, const char *name, int namlen); |
| 983 | int (*svt_dup)(MAGIC *mg, CLONE_PARAMS *param); |
| 984 | int (*svt_local)(SV *nsv, MAGIC *mg); |
| 985 | |
| 986 | |
| 987 | This MGVTBL structure is set at compile-time in F<perl.h> and there are |
| 988 | currently 19 types (or 21 with overloading turned on). These different |
| 989 | structures contain pointers to various routines that perform additional |
| 990 | actions depending on which function is being called. |
| 991 | |
| 992 | Function pointer Action taken |
| 993 | ---------------- ------------ |
| 994 | svt_get Do something before the value of the SV is retrieved. |
| 995 | svt_set Do something after the SV is assigned a value. |
| 996 | svt_len Report on the SV's length. |
| 997 | svt_clear Clear something the SV represents. |
| 998 | svt_free Free any extra storage associated with the SV. |
| 999 | |
| 1000 | svt_copy copy tied variable magic to a tied element |
| 1001 | svt_dup duplicate a magic structure during thread cloning |
| 1002 | svt_local copy magic to local value during 'local' |
| 1003 | |
| 1004 | For instance, the MGVTBL structure called C<vtbl_sv> (which corresponds |
| 1005 | to an C<mg_type> of C<PERL_MAGIC_sv>) contains: |
| 1006 | |
| 1007 | { magic_get, magic_set, magic_len, 0, 0 } |
| 1008 | |
| 1009 | Thus, when an SV is determined to be magical and of type C<PERL_MAGIC_sv>, |
| 1010 | if a get operation is being performed, the routine C<magic_get> is |
| 1011 | called. All the various routines for the various magical types begin |
| 1012 | with C<magic_>. NOTE: the magic routines are not considered part of |
| 1013 | the Perl API, and may not be exported by the Perl library. |
| 1014 | |
| 1015 | The last three slots are a recent addition, and for source code |
| 1016 | compatibility they are only checked for if one of the three flags |
| 1017 | MGf_COPY, MGf_DUP or MGf_LOCAL is set in mg_flags. This means that most |
| 1018 | code can continue declaring a vtable as a 5-element value. These three are |
| 1019 | currently used exclusively by the threading code, and are highly subject |
| 1020 | to change. |
| 1021 | |
| 1022 | The current kinds of Magic Virtual Tables are: |
| 1023 | |
| 1024 | mg_type |
| 1025 | (old-style char and macro) MGVTBL Type of magic |
| 1026 | -------------------------- ------ ---------------------------- |
| 1027 | \0 PERL_MAGIC_sv vtbl_sv Special scalar variable |
| 1028 | A PERL_MAGIC_overload vtbl_amagic %OVERLOAD hash |
| 1029 | a PERL_MAGIC_overload_elem vtbl_amagicelem %OVERLOAD hash element |
| 1030 | c PERL_MAGIC_overload_table (none) Holds overload table (AMT) |
| 1031 | on stash |
| 1032 | B PERL_MAGIC_bm vtbl_bm Boyer-Moore (fast string search) |
| 1033 | D PERL_MAGIC_regdata vtbl_regdata Regex match position data |
| 1034 | (@+ and @- vars) |
| 1035 | d PERL_MAGIC_regdatum vtbl_regdatum Regex match position data |
| 1036 | element |
| 1037 | E PERL_MAGIC_env vtbl_env %ENV hash |
| 1038 | e PERL_MAGIC_envelem vtbl_envelem %ENV hash element |
| 1039 | f PERL_MAGIC_fm vtbl_fm Formline ('compiled' format) |
| 1040 | g PERL_MAGIC_regex_global vtbl_mglob m//g target / study()ed string |
| 1041 | I PERL_MAGIC_isa vtbl_isa @ISA array |
| 1042 | i PERL_MAGIC_isaelem vtbl_isaelem @ISA array element |
| 1043 | k PERL_MAGIC_nkeys vtbl_nkeys scalar(keys()) lvalue |
| 1044 | L PERL_MAGIC_dbfile (none) Debugger %_<filename |
| 1045 | l PERL_MAGIC_dbline vtbl_dbline Debugger %_<filename element |
| 1046 | m PERL_MAGIC_mutex vtbl_mutex ??? |
| 1047 | o PERL_MAGIC_collxfrm vtbl_collxfrm Locale collate transformation |
| 1048 | P PERL_MAGIC_tied vtbl_pack Tied array or hash |
| 1049 | p PERL_MAGIC_tiedelem vtbl_packelem Tied array or hash element |
| 1050 | q PERL_MAGIC_tiedscalar vtbl_packelem Tied scalar or handle |
| 1051 | r PERL_MAGIC_qr vtbl_qr precompiled qr// regex |
| 1052 | S PERL_MAGIC_sig vtbl_sig %SIG hash |
| 1053 | s PERL_MAGIC_sigelem vtbl_sigelem %SIG hash element |
| 1054 | t PERL_MAGIC_taint vtbl_taint Taintedness |
| 1055 | U PERL_MAGIC_uvar vtbl_uvar Available for use by extensions |
| 1056 | v PERL_MAGIC_vec vtbl_vec vec() lvalue |
| 1057 | V PERL_MAGIC_vstring (none) v-string scalars |
| 1058 | w PERL_MAGIC_utf8 vtbl_utf8 UTF-8 length+offset cache |
| 1059 | x PERL_MAGIC_substr vtbl_substr substr() lvalue |
| 1060 | y PERL_MAGIC_defelem vtbl_defelem Shadow "foreach" iterator |
| 1061 | variable / smart parameter |
| 1062 | vivification |
| 1063 | * PERL_MAGIC_glob vtbl_glob GV (typeglob) |
| 1064 | # PERL_MAGIC_arylen vtbl_arylen Array length ($#ary) |
| 1065 | . PERL_MAGIC_pos vtbl_pos pos() lvalue |
| 1066 | < PERL_MAGIC_backref vtbl_backref back pointer to a weak ref |
| 1067 | ~ PERL_MAGIC_ext (none) Available for use by extensions |
| 1068 | : PERL_MAGIC_symtab (none) hash used as symbol table |
| 1069 | % PERL_MAGIC_rhash (none) hash used as restricted hash |
| 1070 | @ PERL_MAGIC_arylen_p vtbl_arylen_p pointer to $#a from @a |
| 1071 | |
| 1072 | |
| 1073 | When an uppercase and lowercase letter both exist in the table, then the |
| 1074 | uppercase letter is typically used to represent some kind of composite type |
| 1075 | (a list or a hash), and the lowercase letter is used to represent an element |
| 1076 | of that composite type. Some internals code makes use of this case |
| 1077 | relationship. However, 'v' and 'V' (vec and v-string) are in no way related. |
| 1078 | |
| 1079 | The C<PERL_MAGIC_ext> and C<PERL_MAGIC_uvar> magic types are defined |
| 1080 | specifically for use by extensions and will not be used by perl itself. |
| 1081 | Extensions can use C<PERL_MAGIC_ext> magic to 'attach' private information |
| 1082 | to variables (typically objects). This is especially useful because |
| 1083 | there is no way for normal perl code to corrupt this private information |
| 1084 | (unlike using extra elements of a hash object). |
| 1085 | |
| 1086 | Similarly, C<PERL_MAGIC_uvar> magic can be used much like tie() to call a |
| 1087 | C function any time a scalar's value is used or changed. The C<MAGIC>'s |
| 1088 | C<mg_ptr> field points to a C<ufuncs> structure: |
| 1089 | |
| 1090 | struct ufuncs { |
| 1091 | I32 (*uf_val)(pTHX_ IV, SV*); |
| 1092 | I32 (*uf_set)(pTHX_ IV, SV*); |
| 1093 | IV uf_index; |
| 1094 | }; |
| 1095 | |
| 1096 | When the SV is read from or written to, the C<uf_val> or C<uf_set> |
| 1097 | function will be called with C<uf_index> as the first arg and a pointer to |
| 1098 | the SV as the second. A simple example of how to add C<PERL_MAGIC_uvar> |
| 1099 | magic is shown below. Note that the ufuncs structure is copied by |
| 1100 | sv_magic, so you can safely allocate it on the stack. |
| 1101 | |
| 1102 | void |
| 1103 | Umagic(sv) |
| 1104 | SV *sv; |
| 1105 | PREINIT: |
| 1106 | struct ufuncs uf; |
| 1107 | CODE: |
| 1108 | uf.uf_val = &my_get_fn; |
| 1109 | uf.uf_set = &my_set_fn; |
| 1110 | uf.uf_index = 0; |
| 1111 | sv_magic(sv, 0, PERL_MAGIC_uvar, (char*)&uf, sizeof(uf)); |
| 1112 | |
| 1113 | Note that because multiple extensions may be using C<PERL_MAGIC_ext> |
| 1114 | or C<PERL_MAGIC_uvar> magic, it is important for extensions to take |
| 1115 | extra care to avoid conflict. Typically only using the magic on |
| 1116 | objects blessed into the same class as the extension is sufficient. |
| 1117 | For C<PERL_MAGIC_ext> magic, it may also be appropriate to add an I32 |
| 1118 | 'signature' at the top of the private data area and check that. |
| 1119 | |
| 1120 | Also note that the C<sv_set*()> and C<sv_cat*()> functions described |
| 1121 | earlier do B<not> invoke 'set' magic on their targets. This must |
| 1122 | be done by the user either by calling the C<SvSETMAGIC()> macro after |
| 1123 | calling these functions, or by using one of the C<sv_set*_mg()> or |
| 1124 | C<sv_cat*_mg()> functions. Similarly, generic C code must call the |
| 1125 | C<SvGETMAGIC()> macro to invoke any 'get' magic if they use an SV |
| 1126 | obtained from external sources in functions that don't handle magic. |
| 1127 | See L<perlapi> for a description of these functions. |
| 1128 | For example, calls to the C<sv_cat*()> functions typically need to be |
| 1129 | followed by C<SvSETMAGIC()>, but they don't need a prior C<SvGETMAGIC()> |
| 1130 | since their implementation handles 'get' magic. |
| 1131 | |
| 1132 | =head2 Finding Magic |
| 1133 | |
| 1134 | MAGIC* mg_find(SV*, int type); /* Finds the magic pointer of that type */ |
| 1135 | |
| 1136 | This routine returns a pointer to the C<MAGIC> structure stored in the SV. |
| 1137 | If the SV does not have that magical feature, C<NULL> is returned. Also, |
| 1138 | if the SV is not of type SVt_PVMG, Perl may core dump. |
| 1139 | |
| 1140 | int mg_copy(SV* sv, SV* nsv, const char* key, STRLEN klen); |
| 1141 | |
| 1142 | This routine checks to see what types of magic C<sv> has. If the mg_type |
| 1143 | field is an uppercase letter, then the mg_obj is copied to C<nsv>, but |
| 1144 | the mg_type field is changed to be the lowercase letter. |
| 1145 | |
| 1146 | =head2 Understanding the Magic of Tied Hashes and Arrays |
| 1147 | |
| 1148 | Tied hashes and arrays are magical beasts of the C<PERL_MAGIC_tied> |
| 1149 | magic type. |
| 1150 | |
| 1151 | WARNING: As of the 5.004 release, proper usage of the array and hash |
| 1152 | access functions requires understanding a few caveats. Some |
| 1153 | of these caveats are actually considered bugs in the API, to be fixed |
| 1154 | in later releases, and are bracketed with [MAYCHANGE] below. If |
| 1155 | you find yourself actually applying such information in this section, be |
| 1156 | aware that the behavior may change in the future, umm, without warning. |
| 1157 | |
| 1158 | The perl tie function associates a variable with an object that implements |
| 1159 | the various GET, SET, etc methods. To perform the equivalent of the perl |
| 1160 | tie function from an XSUB, you must mimic this behaviour. The code below |
| 1161 | carries out the necessary steps - firstly it creates a new hash, and then |
| 1162 | creates a second hash which it blesses into the class which will implement |
| 1163 | the tie methods. Lastly it ties the two hashes together, and returns a |
| 1164 | reference to the new tied hash. Note that the code below does NOT call the |
| 1165 | TIEHASH method in the MyTie class - |
| 1166 | see L<Calling Perl Routines from within C Programs> for details on how |
| 1167 | to do this. |
| 1168 | |
| 1169 | SV* |
| 1170 | mytie() |
| 1171 | PREINIT: |
| 1172 | HV *hash; |
| 1173 | HV *stash; |
| 1174 | SV *tie; |
| 1175 | CODE: |
| 1176 | hash = newHV(); |
| 1177 | tie = newRV_noinc((SV*)newHV()); |
| 1178 | stash = gv_stashpv("MyTie", TRUE); |
| 1179 | sv_bless(tie, stash); |
| 1180 | hv_magic(hash, (GV*)tie, PERL_MAGIC_tied); |
| 1181 | RETVAL = newRV_noinc(hash); |
| 1182 | OUTPUT: |
| 1183 | RETVAL |
| 1184 | |
| 1185 | The C<av_store> function, when given a tied array argument, merely |
| 1186 | copies the magic of the array onto the value to be "stored", using |
| 1187 | C<mg_copy>. It may also return NULL, indicating that the value did not |
| 1188 | actually need to be stored in the array. [MAYCHANGE] After a call to |
| 1189 | C<av_store> on a tied array, the caller will usually need to call |
| 1190 | C<mg_set(val)> to actually invoke the perl level "STORE" method on the |
| 1191 | TIEARRAY object. If C<av_store> did return NULL, a call to |
| 1192 | C<SvREFCNT_dec(val)> will also be usually necessary to avoid a memory |
| 1193 | leak. [/MAYCHANGE] |
| 1194 | |
| 1195 | The previous paragraph is applicable verbatim to tied hash access using the |
| 1196 | C<hv_store> and C<hv_store_ent> functions as well. |
| 1197 | |
| 1198 | C<av_fetch> and the corresponding hash functions C<hv_fetch> and |
| 1199 | C<hv_fetch_ent> actually return an undefined mortal value whose magic |
| 1200 | has been initialized using C<mg_copy>. Note the value so returned does not |
| 1201 | need to be deallocated, as it is already mortal. [MAYCHANGE] But you will |
| 1202 | need to call C<mg_get()> on the returned value in order to actually invoke |
| 1203 | the perl level "FETCH" method on the underlying TIE object. Similarly, |
| 1204 | you may also call C<mg_set()> on the return value after possibly assigning |
| 1205 | a suitable value to it using C<sv_setsv>, which will invoke the "STORE" |
| 1206 | method on the TIE object. [/MAYCHANGE] |
| 1207 | |
| 1208 | [MAYCHANGE] |
| 1209 | In other words, the array or hash fetch/store functions don't really |
| 1210 | fetch and store actual values in the case of tied arrays and hashes. They |
| 1211 | merely call C<mg_copy> to attach magic to the values that were meant to be |
| 1212 | "stored" or "fetched". Later calls to C<mg_get> and C<mg_set> actually |
| 1213 | do the job of invoking the TIE methods on the underlying objects. Thus |
| 1214 | the magic mechanism currently implements a kind of lazy access to arrays |
| 1215 | and hashes. |
| 1216 | |
| 1217 | Currently (as of perl version 5.004), use of the hash and array access |
| 1218 | functions requires the user to be aware of whether they are operating on |
| 1219 | "normal" hashes and arrays, or on their tied variants. The API may be |
| 1220 | changed to provide more transparent access to both tied and normal data |
| 1221 | types in future versions. |
| 1222 | [/MAYCHANGE] |
| 1223 | |
| 1224 | You would do well to understand that the TIEARRAY and TIEHASH interfaces |
| 1225 | are mere sugar to invoke some perl method calls while using the uniform hash |
| 1226 | and array syntax. The use of this sugar imposes some overhead (typically |
| 1227 | about two to four extra opcodes per FETCH/STORE operation, in addition to |
| 1228 | the creation of all the mortal variables required to invoke the methods). |
| 1229 | This overhead will be comparatively small if the TIE methods are themselves |
| 1230 | substantial, but if they are only a few statements long, the overhead |
| 1231 | will not be insignificant. |
| 1232 | |
| 1233 | =head2 Localizing changes |
| 1234 | |
| 1235 | Perl has a very handy construction |
| 1236 | |
| 1237 | { |
| 1238 | local $var = 2; |
| 1239 | ... |
| 1240 | } |
| 1241 | |
| 1242 | This construction is I<approximately> equivalent to |
| 1243 | |
| 1244 | { |
| 1245 | my $oldvar = $var; |
| 1246 | $var = 2; |
| 1247 | ... |
| 1248 | $var = $oldvar; |
| 1249 | } |
| 1250 | |
| 1251 | The biggest difference is that the first construction would |
| 1252 | reinstate the initial value of $var, irrespective of how control exits |
| 1253 | the block: C<goto>, C<return>, C<die>/C<eval>, etc. It is a little bit |
| 1254 | more efficient as well. |
| 1255 | |
| 1256 | There is a way to achieve a similar task from C via Perl API: create a |
| 1257 | I<pseudo-block>, and arrange for some changes to be automatically |
| 1258 | undone at the end of it, either explicit, or via a non-local exit (via |
| 1259 | die()). A I<block>-like construct is created by a pair of |
| 1260 | C<ENTER>/C<LEAVE> macros (see L<perlcall/"Returning a Scalar">). |
| 1261 | Such a construct may be created specially for some important localized |
| 1262 | task, or an existing one (like boundaries of enclosing Perl |
| 1263 | subroutine/block, or an existing pair for freeing TMPs) may be |
| 1264 | used. (In the second case the overhead of additional localization must |
| 1265 | be almost negligible.) Note that any XSUB is automatically enclosed in |
| 1266 | an C<ENTER>/C<LEAVE> pair. |
| 1267 | |
| 1268 | Inside such a I<pseudo-block> the following service is available: |
| 1269 | |
| 1270 | =over 4 |
| 1271 | |
| 1272 | =item C<SAVEINT(int i)> |
| 1273 | |
| 1274 | =item C<SAVEIV(IV i)> |
| 1275 | |
| 1276 | =item C<SAVEI32(I32 i)> |
| 1277 | |
| 1278 | =item C<SAVELONG(long i)> |
| 1279 | |
| 1280 | These macros arrange things to restore the value of integer variable |
| 1281 | C<i> at the end of enclosing I<pseudo-block>. |
| 1282 | |
| 1283 | =item C<SAVESPTR(s)> |
| 1284 | |
| 1285 | =item C<SAVEPPTR(p)> |
| 1286 | |
| 1287 | These macros arrange things to restore the value of pointers C<s> and |
| 1288 | C<p>. C<s> must be a pointer of a type which survives conversion to |
| 1289 | C<SV*> and back, C<p> should be able to survive conversion to C<char*> |
| 1290 | and back. |
| 1291 | |
| 1292 | =item C<SAVEFREESV(SV *sv)> |
| 1293 | |
| 1294 | The refcount of C<sv> would be decremented at the end of |
| 1295 | I<pseudo-block>. This is similar to C<sv_2mortal> in that it is also a |
| 1296 | mechanism for doing a delayed C<SvREFCNT_dec>. However, while C<sv_2mortal> |
| 1297 | extends the lifetime of C<sv> until the beginning of the next statement, |
| 1298 | C<SAVEFREESV> extends it until the end of the enclosing scope. These |
| 1299 | lifetimes can be wildly different. |
| 1300 | |
| 1301 | Also compare C<SAVEMORTALIZESV>. |
| 1302 | |
| 1303 | =item C<SAVEMORTALIZESV(SV *sv)> |
| 1304 | |
| 1305 | Just like C<SAVEFREESV>, but mortalizes C<sv> at the end of the current |
| 1306 | scope instead of decrementing its reference count. This usually has the |
| 1307 | effect of keeping C<sv> alive until the statement that called the currently |
| 1308 | live scope has finished executing. |
| 1309 | |
| 1310 | =item C<SAVEFREEOP(OP *op)> |
| 1311 | |
| 1312 | The C<OP *> is op_free()ed at the end of I<pseudo-block>. |
| 1313 | |
| 1314 | =item C<SAVEFREEPV(p)> |
| 1315 | |
| 1316 | The chunk of memory which is pointed to by C<p> is Safefree()ed at the |
| 1317 | end of I<pseudo-block>. |
| 1318 | |
| 1319 | =item C<SAVECLEARSV(SV *sv)> |
| 1320 | |
| 1321 | Clears a slot in the current scratchpad which corresponds to C<sv> at |
| 1322 | the end of I<pseudo-block>. |
| 1323 | |
| 1324 | =item C<SAVEDELETE(HV *hv, char *key, I32 length)> |
| 1325 | |
| 1326 | The key C<key> of C<hv> is deleted at the end of I<pseudo-block>. The |
| 1327 | string pointed to by C<key> is Safefree()ed. If one has a I<key> in |
| 1328 | short-lived storage, the corresponding string may be reallocated like |
| 1329 | this: |
| 1330 | |
| 1331 | SAVEDELETE(PL_defstash, savepv(tmpbuf), strlen(tmpbuf)); |
| 1332 | |
| 1333 | =item C<SAVEDESTRUCTOR(DESTRUCTORFUNC_NOCONTEXT_t f, void *p)> |
| 1334 | |
| 1335 | At the end of I<pseudo-block> the function C<f> is called with the |
| 1336 | only argument C<p>. |
| 1337 | |
| 1338 | =item C<SAVEDESTRUCTOR_X(DESTRUCTORFUNC_t f, void *p)> |
| 1339 | |
| 1340 | At the end of I<pseudo-block> the function C<f> is called with the |
| 1341 | implicit context argument (if any), and C<p>. |
| 1342 | |
| 1343 | =item C<SAVESTACK_POS()> |
| 1344 | |
| 1345 | The current offset on the Perl internal stack (cf. C<SP>) is restored |
| 1346 | at the end of I<pseudo-block>. |
| 1347 | |
| 1348 | =back |
| 1349 | |
| 1350 | The following API list contains functions, thus one needs to |
| 1351 | provide pointers to the modifiable data explicitly (either C pointers, |
| 1352 | or Perlish C<GV *>s). Where the above macros take C<int>, a similar |
| 1353 | function takes C<int *>. |
| 1354 | |
| 1355 | =over 4 |
| 1356 | |
| 1357 | =item C<SV* save_scalar(GV *gv)> |
| 1358 | |
| 1359 | Equivalent to Perl code C<local $gv>. |
| 1360 | |
| 1361 | =item C<AV* save_ary(GV *gv)> |
| 1362 | |
| 1363 | =item C<HV* save_hash(GV *gv)> |
| 1364 | |
| 1365 | Similar to C<save_scalar>, but localize C<@gv> and C<%gv>. |
| 1366 | |
| 1367 | =item C<void save_item(SV *item)> |
| 1368 | |
| 1369 | Duplicates the current value of C<SV>, on the exit from the current |
| 1370 | C<ENTER>/C<LEAVE> I<pseudo-block> will restore the value of C<SV> |
| 1371 | using the stored value. It doesn't handle magic. Use C<save_scalar> if |
| 1372 | magic is affected. |
| 1373 | |
| 1374 | =item C<void save_list(SV **sarg, I32 maxsarg)> |
| 1375 | |
| 1376 | A variant of C<save_item> which takes multiple arguments via an array |
| 1377 | C<sarg> of C<SV*> of length C<maxsarg>. |
| 1378 | |
| 1379 | =item C<SV* save_svref(SV **sptr)> |
| 1380 | |
| 1381 | Similar to C<save_scalar>, but will reinstate an C<SV *>. |
| 1382 | |
| 1383 | =item C<void save_aptr(AV **aptr)> |
| 1384 | |
| 1385 | =item C<void save_hptr(HV **hptr)> |
| 1386 | |
| 1387 | Similar to C<save_svref>, but localize C<AV *> and C<HV *>. |
| 1388 | |
| 1389 | =back |
| 1390 | |
| 1391 | The C<Alias> module implements localization of the basic types within the |
| 1392 | I<caller's scope>. People who are interested in how to localize things in |
| 1393 | the containing scope should take a look there too. |
| 1394 | |
| 1395 | =head1 Subroutines |
| 1396 | |
| 1397 | =head2 XSUBs and the Argument Stack |
| 1398 | |
| 1399 | The XSUB mechanism is a simple way for Perl programs to access C subroutines. |
| 1400 | An XSUB routine will have a stack that contains the arguments from the Perl |
| 1401 | program, and a way to map from the Perl data structures to a C equivalent. |
| 1402 | |
| 1403 | The stack arguments are accessible through the C<ST(n)> macro, which returns |
| 1404 | the C<n>'th stack argument. Argument 0 is the first argument passed in the |
| 1405 | Perl subroutine call. These arguments are C<SV*>, and can be used anywhere |
| 1406 | an C<SV*> is used. |
| 1407 | |
| 1408 | Most of the time, output from the C routine can be handled through use of |
| 1409 | the RETVAL and OUTPUT directives. However, there are some cases where the |
| 1410 | argument stack is not already long enough to handle all the return values. |
| 1411 | An example is the POSIX tzname() call, which takes no arguments, but returns |
| 1412 | two, the local time zone's standard and summer time abbreviations. |
| 1413 | |
| 1414 | To handle this situation, the PPCODE directive is used and the stack is |
| 1415 | extended using the macro: |
| 1416 | |
| 1417 | EXTEND(SP, num); |
| 1418 | |
| 1419 | where C<SP> is the macro that represents the local copy of the stack pointer, |
| 1420 | and C<num> is the number of elements the stack should be extended by. |
| 1421 | |
| 1422 | Now that there is room on the stack, values can be pushed on it using C<PUSHs> |
| 1423 | macro. The pushed values will often need to be "mortal" (See |
| 1424 | L</Reference Counts and Mortality>): |
| 1425 | |
| 1426 | PUSHs(sv_2mortal(newSViv(an_integer))) |
| 1427 | PUSHs(sv_2mortal(newSVuv(an_unsigned_integer))) |
| 1428 | PUSHs(sv_2mortal(newSVnv(a_double))) |
| 1429 | PUSHs(sv_2mortal(newSVpv("Some String",0))) |
| 1430 | |
| 1431 | And now the Perl program calling C<tzname>, the two values will be assigned |
| 1432 | as in: |
| 1433 | |
| 1434 | ($standard_abbrev, $summer_abbrev) = POSIX::tzname; |
| 1435 | |
| 1436 | An alternate (and possibly simpler) method to pushing values on the stack is |
| 1437 | to use the macro: |
| 1438 | |
| 1439 | XPUSHs(SV*) |
| 1440 | |
| 1441 | This macro automatically adjust the stack for you, if needed. Thus, you |
| 1442 | do not need to call C<EXTEND> to extend the stack. |
| 1443 | |
| 1444 | Despite their suggestions in earlier versions of this document the macros |
| 1445 | C<(X)PUSH[iunp]> are I<not> suited to XSUBs which return multiple results. |
| 1446 | For that, either stick to the C<(X)PUSHs> macros shown above, or use the new |
| 1447 | C<m(X)PUSH[iunp]> macros instead; see L</Putting a C value on Perl stack>. |
| 1448 | |
| 1449 | For more information, consult L<perlxs> and L<perlxstut>. |
| 1450 | |
| 1451 | =head2 Calling Perl Routines from within C Programs |
| 1452 | |
| 1453 | There are four routines that can be used to call a Perl subroutine from |
| 1454 | within a C program. These four are: |
| 1455 | |
| 1456 | I32 call_sv(SV*, I32); |
| 1457 | I32 call_pv(const char*, I32); |
| 1458 | I32 call_method(const char*, I32); |
| 1459 | I32 call_argv(const char*, I32, register char**); |
| 1460 | |
| 1461 | The routine most often used is C<call_sv>. The C<SV*> argument |
| 1462 | contains either the name of the Perl subroutine to be called, or a |
| 1463 | reference to the subroutine. The second argument consists of flags |
| 1464 | that control the context in which the subroutine is called, whether |
| 1465 | or not the subroutine is being passed arguments, how errors should be |
| 1466 | trapped, and how to treat return values. |
| 1467 | |
| 1468 | All four routines return the number of arguments that the subroutine returned |
| 1469 | on the Perl stack. |
| 1470 | |
| 1471 | These routines used to be called C<perl_call_sv>, etc., before Perl v5.6.0, |
| 1472 | but those names are now deprecated; macros of the same name are provided for |
| 1473 | compatibility. |
| 1474 | |
| 1475 | When using any of these routines (except C<call_argv>), the programmer |
| 1476 | must manipulate the Perl stack. These include the following macros and |
| 1477 | functions: |
| 1478 | |
| 1479 | dSP |
| 1480 | SP |
| 1481 | PUSHMARK() |
| 1482 | PUTBACK |
| 1483 | SPAGAIN |
| 1484 | ENTER |
| 1485 | SAVETMPS |
| 1486 | FREETMPS |
| 1487 | LEAVE |
| 1488 | XPUSH*() |
| 1489 | POP*() |
| 1490 | |
| 1491 | For a detailed description of calling conventions from C to Perl, |
| 1492 | consult L<perlcall>. |
| 1493 | |
| 1494 | =head2 Memory Allocation |
| 1495 | |
| 1496 | =head3 Allocation |
| 1497 | |
| 1498 | All memory meant to be used with the Perl API functions should be manipulated |
| 1499 | using the macros described in this section. The macros provide the necessary |
| 1500 | transparency between differences in the actual malloc implementation that is |
| 1501 | used within perl. |
| 1502 | |
| 1503 | It is suggested that you enable the version of malloc that is distributed |
| 1504 | with Perl. It keeps pools of various sizes of unallocated memory in |
| 1505 | order to satisfy allocation requests more quickly. However, on some |
| 1506 | platforms, it may cause spurious malloc or free errors. |
| 1507 | |
| 1508 | The following three macros are used to initially allocate memory : |
| 1509 | |
| 1510 | Newx(pointer, number, type); |
| 1511 | Newxc(pointer, number, type, cast); |
| 1512 | Newxz(pointer, number, type); |
| 1513 | |
| 1514 | The first argument C<pointer> should be the name of a variable that will |
| 1515 | point to the newly allocated memory. |
| 1516 | |
| 1517 | The second and third arguments C<number> and C<type> specify how many of |
| 1518 | the specified type of data structure should be allocated. The argument |
| 1519 | C<type> is passed to C<sizeof>. The final argument to C<Newxc>, C<cast>, |
| 1520 | should be used if the C<pointer> argument is different from the C<type> |
| 1521 | argument. |
| 1522 | |
| 1523 | Unlike the C<Newx> and C<Newxc> macros, the C<Newxz> macro calls C<memzero> |
| 1524 | to zero out all the newly allocated memory. |
| 1525 | |
| 1526 | =head3 Reallocation |
| 1527 | |
| 1528 | Renew(pointer, number, type); |
| 1529 | Renewc(pointer, number, type, cast); |
| 1530 | Safefree(pointer) |
| 1531 | |
| 1532 | These three macros are used to change a memory buffer size or to free a |
| 1533 | piece of memory no longer needed. The arguments to C<Renew> and C<Renewc> |
| 1534 | match those of C<New> and C<Newc> with the exception of not needing the |
| 1535 | "magic cookie" argument. |
| 1536 | |
| 1537 | =head3 Moving |
| 1538 | |
| 1539 | Move(source, dest, number, type); |
| 1540 | Copy(source, dest, number, type); |
| 1541 | Zero(dest, number, type); |
| 1542 | |
| 1543 | These three macros are used to move, copy, or zero out previously allocated |
| 1544 | memory. The C<source> and C<dest> arguments point to the source and |
| 1545 | destination starting points. Perl will move, copy, or zero out C<number> |
| 1546 | instances of the size of the C<type> data structure (using the C<sizeof> |
| 1547 | function). |
| 1548 | |
| 1549 | =head2 PerlIO |
| 1550 | |
| 1551 | The most recent development releases of Perl has been experimenting with |
| 1552 | removing Perl's dependency on the "normal" standard I/O suite and allowing |
| 1553 | other stdio implementations to be used. This involves creating a new |
| 1554 | abstraction layer that then calls whichever implementation of stdio Perl |
| 1555 | was compiled with. All XSUBs should now use the functions in the PerlIO |
| 1556 | abstraction layer and not make any assumptions about what kind of stdio |
| 1557 | is being used. |
| 1558 | |
| 1559 | For a complete description of the PerlIO abstraction, consult L<perlapio>. |
| 1560 | |
| 1561 | =head2 Putting a C value on Perl stack |
| 1562 | |
| 1563 | A lot of opcodes (this is an elementary operation in the internal perl |
| 1564 | stack machine) put an SV* on the stack. However, as an optimization |
| 1565 | the corresponding SV is (usually) not recreated each time. The opcodes |
| 1566 | reuse specially assigned SVs (I<target>s) which are (as a corollary) |
| 1567 | not constantly freed/created. |
| 1568 | |
| 1569 | Each of the targets is created only once (but see |
| 1570 | L<Scratchpads and recursion> below), and when an opcode needs to put |
| 1571 | an integer, a double, or a string on stack, it just sets the |
| 1572 | corresponding parts of its I<target> and puts the I<target> on stack. |
| 1573 | |
| 1574 | The macro to put this target on stack is C<PUSHTARG>, and it is |
| 1575 | directly used in some opcodes, as well as indirectly in zillions of |
| 1576 | others, which use it via C<(X)PUSH[iunp]>. |
| 1577 | |
| 1578 | Because the target is reused, you must be careful when pushing multiple |
| 1579 | values on the stack. The following code will not do what you think: |
| 1580 | |
| 1581 | XPUSHi(10); |
| 1582 | XPUSHi(20); |
| 1583 | |
| 1584 | This translates as "set C<TARG> to 10, push a pointer to C<TARG> onto |
| 1585 | the stack; set C<TARG> to 20, push a pointer to C<TARG> onto the stack". |
| 1586 | At the end of the operation, the stack does not contain the values 10 |
| 1587 | and 20, but actually contains two pointers to C<TARG>, which we have set |
| 1588 | to 20. |
| 1589 | |
| 1590 | If you need to push multiple different values then you should either use |
| 1591 | the C<(X)PUSHs> macros, or else use the new C<m(X)PUSH[iunp]> macros, |
| 1592 | none of which make use of C<TARG>. The C<(X)PUSHs> macros simply push an |
| 1593 | SV* on the stack, which, as noted under L</XSUBs and the Argument Stack>, |
| 1594 | will often need to be "mortal". The new C<m(X)PUSH[iunp]> macros make |
| 1595 | this a little easier to achieve by creating a new mortal for you (via |
| 1596 | C<(X)PUSHmortal>), pushing that onto the stack (extending it if necessary |
| 1597 | in the case of the C<mXPUSH[iunp]> macros), and then setting its value. |
| 1598 | Thus, instead of writing this to "fix" the example above: |
| 1599 | |
| 1600 | XPUSHs(sv_2mortal(newSViv(10))) |
| 1601 | XPUSHs(sv_2mortal(newSViv(20))) |
| 1602 | |
| 1603 | you can simply write: |
| 1604 | |
| 1605 | mXPUSHi(10) |
| 1606 | mXPUSHi(20) |
| 1607 | |
| 1608 | On a related note, if you do use C<(X)PUSH[iunp]>, then you're going to |
| 1609 | need a C<dTARG> in your variable declarations so that the C<*PUSH*> |
| 1610 | macros can make use of the local variable C<TARG>. See also C<dTARGET> |
| 1611 | and C<dXSTARG>. |
| 1612 | |
| 1613 | =head2 Scratchpads |
| 1614 | |
| 1615 | The question remains on when the SVs which are I<target>s for opcodes |
| 1616 | are created. The answer is that they are created when the current unit -- |
| 1617 | a subroutine or a file (for opcodes for statements outside of |
| 1618 | subroutines) -- is compiled. During this time a special anonymous Perl |
| 1619 | array is created, which is called a scratchpad for the current |
| 1620 | unit. |
| 1621 | |
| 1622 | A scratchpad keeps SVs which are lexicals for the current unit and are |
| 1623 | targets for opcodes. One can deduce that an SV lives on a scratchpad |
| 1624 | by looking on its flags: lexicals have C<SVs_PADMY> set, and |
| 1625 | I<target>s have C<SVs_PADTMP> set. |
| 1626 | |
| 1627 | The correspondence between OPs and I<target>s is not 1-to-1. Different |
| 1628 | OPs in the compile tree of the unit can use the same target, if this |
| 1629 | would not conflict with the expected life of the temporary. |
| 1630 | |
| 1631 | =head2 Scratchpads and recursion |
| 1632 | |
| 1633 | In fact it is not 100% true that a compiled unit contains a pointer to |
| 1634 | the scratchpad AV. In fact it contains a pointer to an AV of |
| 1635 | (initially) one element, and this element is the scratchpad AV. Why do |
| 1636 | we need an extra level of indirection? |
| 1637 | |
| 1638 | The answer is B<recursion>, and maybe B<threads>. Both |
| 1639 | these can create several execution pointers going into the same |
| 1640 | subroutine. For the subroutine-child not write over the temporaries |
| 1641 | for the subroutine-parent (lifespan of which covers the call to the |
| 1642 | child), the parent and the child should have different |
| 1643 | scratchpads. (I<And> the lexicals should be separate anyway!) |
| 1644 | |
| 1645 | So each subroutine is born with an array of scratchpads (of length 1). |
| 1646 | On each entry to the subroutine it is checked that the current |
| 1647 | depth of the recursion is not more than the length of this array, and |
| 1648 | if it is, new scratchpad is created and pushed into the array. |
| 1649 | |
| 1650 | The I<target>s on this scratchpad are C<undef>s, but they are already |
| 1651 | marked with correct flags. |
| 1652 | |
| 1653 | =head1 Compiled code |
| 1654 | |
| 1655 | =head2 Code tree |
| 1656 | |
| 1657 | Here we describe the internal form your code is converted to by |
| 1658 | Perl. Start with a simple example: |
| 1659 | |
| 1660 | $a = $b + $c; |
| 1661 | |
| 1662 | This is converted to a tree similar to this one: |
| 1663 | |
| 1664 | assign-to |
| 1665 | / \ |
| 1666 | + $a |
| 1667 | / \ |
| 1668 | $b $c |
| 1669 | |
| 1670 | (but slightly more complicated). This tree reflects the way Perl |
| 1671 | parsed your code, but has nothing to do with the execution order. |
| 1672 | There is an additional "thread" going through the nodes of the tree |
| 1673 | which shows the order of execution of the nodes. In our simplified |
| 1674 | example above it looks like: |
| 1675 | |
| 1676 | $b ---> $c ---> + ---> $a ---> assign-to |
| 1677 | |
| 1678 | But with the actual compile tree for C<$a = $b + $c> it is different: |
| 1679 | some nodes I<optimized away>. As a corollary, though the actual tree |
| 1680 | contains more nodes than our simplified example, the execution order |
| 1681 | is the same as in our example. |
| 1682 | |
| 1683 | =head2 Examining the tree |
| 1684 | |
| 1685 | If you have your perl compiled for debugging (usually done with |
| 1686 | C<-DDEBUGGING> on the C<Configure> command line), you may examine the |
| 1687 | compiled tree by specifying C<-Dx> on the Perl command line. The |
| 1688 | output takes several lines per node, and for C<$b+$c> it looks like |
| 1689 | this: |
| 1690 | |
| 1691 | 5 TYPE = add ===> 6 |
| 1692 | TARG = 1 |
| 1693 | FLAGS = (SCALAR,KIDS) |
| 1694 | { |
| 1695 | TYPE = null ===> (4) |
| 1696 | (was rv2sv) |
| 1697 | FLAGS = (SCALAR,KIDS) |
| 1698 | { |
| 1699 | 3 TYPE = gvsv ===> 4 |
| 1700 | FLAGS = (SCALAR) |
| 1701 | GV = main::b |
| 1702 | } |
| 1703 | } |
| 1704 | { |
| 1705 | TYPE = null ===> (5) |
| 1706 | (was rv2sv) |
| 1707 | FLAGS = (SCALAR,KIDS) |
| 1708 | { |
| 1709 | 4 TYPE = gvsv ===> 5 |
| 1710 | FLAGS = (SCALAR) |
| 1711 | GV = main::c |
| 1712 | } |
| 1713 | } |
| 1714 | |
| 1715 | This tree has 5 nodes (one per C<TYPE> specifier), only 3 of them are |
| 1716 | not optimized away (one per number in the left column). The immediate |
| 1717 | children of the given node correspond to C<{}> pairs on the same level |
| 1718 | of indentation, thus this listing corresponds to the tree: |
| 1719 | |
| 1720 | add |
| 1721 | / \ |
| 1722 | null null |
| 1723 | | | |
| 1724 | gvsv gvsv |
| 1725 | |
| 1726 | The execution order is indicated by C<===E<gt>> marks, thus it is C<3 |
| 1727 | 4 5 6> (node C<6> is not included into above listing), i.e., |
| 1728 | C<gvsv gvsv add whatever>. |
| 1729 | |
| 1730 | Each of these nodes represents an op, a fundamental operation inside the |
| 1731 | Perl core. The code which implements each operation can be found in the |
| 1732 | F<pp*.c> files; the function which implements the op with type C<gvsv> |
| 1733 | is C<pp_gvsv>, and so on. As the tree above shows, different ops have |
| 1734 | different numbers of children: C<add> is a binary operator, as one would |
| 1735 | expect, and so has two children. To accommodate the various different |
| 1736 | numbers of children, there are various types of op data structure, and |
| 1737 | they link together in different ways. |
| 1738 | |
| 1739 | The simplest type of op structure is C<OP>: this has no children. Unary |
| 1740 | operators, C<UNOP>s, have one child, and this is pointed to by the |
| 1741 | C<op_first> field. Binary operators (C<BINOP>s) have not only an |
| 1742 | C<op_first> field but also an C<op_last> field. The most complex type of |
| 1743 | op is a C<LISTOP>, which has any number of children. In this case, the |
| 1744 | first child is pointed to by C<op_first> and the last child by |
| 1745 | C<op_last>. The children in between can be found by iteratively |
| 1746 | following the C<op_sibling> pointer from the first child to the last. |
| 1747 | |
| 1748 | There are also two other op types: a C<PMOP> holds a regular expression, |
| 1749 | and has no children, and a C<LOOP> may or may not have children. If the |
| 1750 | C<op_children> field is non-zero, it behaves like a C<LISTOP>. To |
| 1751 | complicate matters, if a C<UNOP> is actually a C<null> op after |
| 1752 | optimization (see L</Compile pass 2: context propagation>) it will still |
| 1753 | have children in accordance with its former type. |
| 1754 | |
| 1755 | Another way to examine the tree is to use a compiler back-end module, such |
| 1756 | as L<B::Concise>. |
| 1757 | |
| 1758 | =head2 Compile pass 1: check routines |
| 1759 | |
| 1760 | The tree is created by the compiler while I<yacc> code feeds it |
| 1761 | the constructions it recognizes. Since I<yacc> works bottom-up, so does |
| 1762 | the first pass of perl compilation. |
| 1763 | |
| 1764 | What makes this pass interesting for perl developers is that some |
| 1765 | optimization may be performed on this pass. This is optimization by |
| 1766 | so-called "check routines". The correspondence between node names |
| 1767 | and corresponding check routines is described in F<opcode.pl> (do not |
| 1768 | forget to run C<make regen_headers> if you modify this file). |
| 1769 | |
| 1770 | A check routine is called when the node is fully constructed except |
| 1771 | for the execution-order thread. Since at this time there are no |
| 1772 | back-links to the currently constructed node, one can do most any |
| 1773 | operation to the top-level node, including freeing it and/or creating |
| 1774 | new nodes above/below it. |
| 1775 | |
| 1776 | The check routine returns the node which should be inserted into the |
| 1777 | tree (if the top-level node was not modified, check routine returns |
| 1778 | its argument). |
| 1779 | |
| 1780 | By convention, check routines have names C<ck_*>. They are usually |
| 1781 | called from C<new*OP> subroutines (or C<convert>) (which in turn are |
| 1782 | called from F<perly.y>). |
| 1783 | |
| 1784 | =head2 Compile pass 1a: constant folding |
| 1785 | |
| 1786 | Immediately after the check routine is called the returned node is |
| 1787 | checked for being compile-time executable. If it is (the value is |
| 1788 | judged to be constant) it is immediately executed, and a I<constant> |
| 1789 | node with the "return value" of the corresponding subtree is |
| 1790 | substituted instead. The subtree is deleted. |
| 1791 | |
| 1792 | If constant folding was not performed, the execution-order thread is |
| 1793 | created. |
| 1794 | |
| 1795 | =head2 Compile pass 2: context propagation |
| 1796 | |
| 1797 | When a context for a part of compile tree is known, it is propagated |
| 1798 | down through the tree. At this time the context can have 5 values |
| 1799 | (instead of 2 for runtime context): void, boolean, scalar, list, and |
| 1800 | lvalue. In contrast with the pass 1 this pass is processed from top |
| 1801 | to bottom: a node's context determines the context for its children. |
| 1802 | |
| 1803 | Additional context-dependent optimizations are performed at this time. |
| 1804 | Since at this moment the compile tree contains back-references (via |
| 1805 | "thread" pointers), nodes cannot be free()d now. To allow |
| 1806 | optimized-away nodes at this stage, such nodes are null()ified instead |
| 1807 | of free()ing (i.e. their type is changed to OP_NULL). |
| 1808 | |
| 1809 | =head2 Compile pass 3: peephole optimization |
| 1810 | |
| 1811 | After the compile tree for a subroutine (or for an C<eval> or a file) |
| 1812 | is created, an additional pass over the code is performed. This pass |
| 1813 | is neither top-down or bottom-up, but in the execution order (with |
| 1814 | additional complications for conditionals). These optimizations are |
| 1815 | done in the subroutine peep(). Optimizations performed at this stage |
| 1816 | are subject to the same restrictions as in the pass 2. |
| 1817 | |
| 1818 | =head2 Pluggable runops |
| 1819 | |
| 1820 | The compile tree is executed in a runops function. There are two runops |
| 1821 | functions, in F<run.c> and in F<dump.c>. C<Perl_runops_debug> is used |
| 1822 | with DEBUGGING and C<Perl_runops_standard> is used otherwise. For fine |
| 1823 | control over the execution of the compile tree it is possible to provide |
| 1824 | your own runops function. |
| 1825 | |
| 1826 | It's probably best to copy one of the existing runops functions and |
| 1827 | change it to suit your needs. Then, in the BOOT section of your XS |
| 1828 | file, add the line: |
| 1829 | |
| 1830 | PL_runops = my_runops; |
| 1831 | |
| 1832 | This function should be as efficient as possible to keep your programs |
| 1833 | running as fast as possible. |
| 1834 | |
| 1835 | =head1 Examining internal data structures with the C<dump> functions |
| 1836 | |
| 1837 | To aid debugging, the source file F<dump.c> contains a number of |
| 1838 | functions which produce formatted output of internal data structures. |
| 1839 | |
| 1840 | The most commonly used of these functions is C<Perl_sv_dump>; it's used |
| 1841 | for dumping SVs, AVs, HVs, and CVs. The C<Devel::Peek> module calls |
| 1842 | C<sv_dump> to produce debugging output from Perl-space, so users of that |
| 1843 | module should already be familiar with its format. |
| 1844 | |
| 1845 | C<Perl_op_dump> can be used to dump an C<OP> structure or any of its |
| 1846 | derivatives, and produces output similar to C<perl -Dx>; in fact, |
| 1847 | C<Perl_dump_eval> will dump the main root of the code being evaluated, |
| 1848 | exactly like C<-Dx>. |
| 1849 | |
| 1850 | Other useful functions are C<Perl_dump_sub>, which turns a C<GV> into an |
| 1851 | op tree, C<Perl_dump_packsubs> which calls C<Perl_dump_sub> on all the |
| 1852 | subroutines in a package like so: (Thankfully, these are all xsubs, so |
| 1853 | there is no op tree) |
| 1854 | |
| 1855 | (gdb) print Perl_dump_packsubs(PL_defstash) |
| 1856 | |
| 1857 | SUB attributes::bootstrap = (xsub 0x811fedc 0) |
| 1858 | |
| 1859 | SUB UNIVERSAL::can = (xsub 0x811f50c 0) |
| 1860 | |
| 1861 | SUB UNIVERSAL::isa = (xsub 0x811f304 0) |
| 1862 | |
| 1863 | SUB UNIVERSAL::VERSION = (xsub 0x811f7ac 0) |
| 1864 | |
| 1865 | SUB DynaLoader::boot_DynaLoader = (xsub 0x805b188 0) |
| 1866 | |
| 1867 | and C<Perl_dump_all>, which dumps all the subroutines in the stash and |
| 1868 | the op tree of the main root. |
| 1869 | |
| 1870 | =head1 How multiple interpreters and concurrency are supported |
| 1871 | |
| 1872 | =head2 Background and PERL_IMPLICIT_CONTEXT |
| 1873 | |
| 1874 | The Perl interpreter can be regarded as a closed box: it has an API |
| 1875 | for feeding it code or otherwise making it do things, but it also has |
| 1876 | functions for its own use. This smells a lot like an object, and |
| 1877 | there are ways for you to build Perl so that you can have multiple |
| 1878 | interpreters, with one interpreter represented either as a C structure, |
| 1879 | or inside a thread-specific structure. These structures contain all |
| 1880 | the context, the state of that interpreter. |
| 1881 | |
| 1882 | Two macros control the major Perl build flavors: MULTIPLICITY and |
| 1883 | USE_5005THREADS. The MULTIPLICITY build has a C structure |
| 1884 | that packages all the interpreter state, and there is a similar thread-specific |
| 1885 | data structure under USE_5005THREADS. In both cases, |
| 1886 | PERL_IMPLICIT_CONTEXT is also normally defined, and enables the |
| 1887 | support for passing in a "hidden" first argument that represents all three |
| 1888 | data structures. |
| 1889 | |
| 1890 | Two other "encapsulation" macros are the PERL_GLOBAL_STRUCT and |
| 1891 | PERL_GLOBAL_STRUCT_PRIVATE (the latter turns on the former, and the |
| 1892 | former turns on MULTIPLICITY.) The PERL_GLOBAL_STRUCT causes all the |
| 1893 | internal variables of Perl to be wrapped inside a single global struct, |
| 1894 | struct perl_vars, accessible as (globals) &PL_Vars or PL_VarsPtr or |
| 1895 | the function Perl_GetVars(). The PERL_GLOBAL_STRUCT_PRIVATE goes |
| 1896 | one step further, there is still a single struct (allocated in main() |
| 1897 | either from heap or from stack) but there are no global data symbols |
| 1898 | pointing to it. In either case the global struct should be initialised |
| 1899 | as the very first thing in main() using Perl_init_global_struct() and |
| 1900 | correspondingly tear it down after perl_free() using Perl_free_global_struct(), |
| 1901 | please see F<miniperlmain.c> for usage details. You may also need |
| 1902 | to use C<dVAR> in your coding to "declare the global variables" |
| 1903 | when you are using them. dTHX does this for you automatically. |
| 1904 | |
| 1905 | For backward compatibility reasons defining just PERL_GLOBAL_STRUCT |
| 1906 | doesn't actually hide all symbols inside a big global struct: some |
| 1907 | PerlIO_xxx vtables are left visible. The PERL_GLOBAL_STRUCT_PRIVATE |
| 1908 | then hides everything (see how the PERLIO_FUNCS_DECL is used). |
| 1909 | |
| 1910 | All this obviously requires a way for the Perl internal functions to be |
| 1911 | either subroutines taking some kind of structure as the first |
| 1912 | argument, or subroutines taking nothing as the first argument. To |
| 1913 | enable these two very different ways of building the interpreter, |
| 1914 | the Perl source (as it does in so many other situations) makes heavy |
| 1915 | use of macros and subroutine naming conventions. |
| 1916 | |
| 1917 | First problem: deciding which functions will be public API functions and |
| 1918 | which will be private. All functions whose names begin C<S_> are private |
| 1919 | (think "S" for "secret" or "static"). All other functions begin with |
| 1920 | "Perl_", but just because a function begins with "Perl_" does not mean it is |
| 1921 | part of the API. (See L</Internal Functions>.) The easiest way to be B<sure> a |
| 1922 | function is part of the API is to find its entry in L<perlapi>. |
| 1923 | If it exists in L<perlapi>, it's part of the API. If it doesn't, and you |
| 1924 | think it should be (i.e., you need it for your extension), send mail via |
| 1925 | L<perlbug> explaining why you think it should be. |
| 1926 | |
| 1927 | Second problem: there must be a syntax so that the same subroutine |
| 1928 | declarations and calls can pass a structure as their first argument, |
| 1929 | or pass nothing. To solve this, the subroutines are named and |
| 1930 | declared in a particular way. Here's a typical start of a static |
| 1931 | function used within the Perl guts: |
| 1932 | |
| 1933 | STATIC void |
| 1934 | S_incline(pTHX_ char *s) |
| 1935 | |
| 1936 | STATIC becomes "static" in C, and may be #define'd to nothing in some |
| 1937 | configurations in future. |
| 1938 | |
| 1939 | A public function (i.e. part of the internal API, but not necessarily |
| 1940 | sanctioned for use in extensions) begins like this: |
| 1941 | |
| 1942 | void |
| 1943 | Perl_sv_setiv(pTHX_ SV* dsv, IV num) |
| 1944 | |
| 1945 | C<pTHX_> is one of a number of macros (in perl.h) that hide the |
| 1946 | details of the interpreter's context. THX stands for "thread", "this", |
| 1947 | or "thingy", as the case may be. (And no, George Lucas is not involved. :-) |
| 1948 | The first character could be 'p' for a B<p>rototype, 'a' for B<a>rgument, |
| 1949 | or 'd' for B<d>eclaration, so we have C<pTHX>, C<aTHX> and C<dTHX>, and |
| 1950 | their variants. |
| 1951 | |
| 1952 | When Perl is built without options that set PERL_IMPLICIT_CONTEXT, there is no |
| 1953 | first argument containing the interpreter's context. The trailing underscore |
| 1954 | in the pTHX_ macro indicates that the macro expansion needs a comma |
| 1955 | after the context argument because other arguments follow it. If |
| 1956 | PERL_IMPLICIT_CONTEXT is not defined, pTHX_ will be ignored, and the |
| 1957 | subroutine is not prototyped to take the extra argument. The form of the |
| 1958 | macro without the trailing underscore is used when there are no additional |
| 1959 | explicit arguments. |
| 1960 | |
| 1961 | When a core function calls another, it must pass the context. This |
| 1962 | is normally hidden via macros. Consider C<sv_setiv>. It expands into |
| 1963 | something like this: |
| 1964 | |
| 1965 | #ifdef PERL_IMPLICIT_CONTEXT |
| 1966 | #define sv_setiv(a,b) Perl_sv_setiv(aTHX_ a, b) |
| 1967 | /* can't do this for vararg functions, see below */ |
| 1968 | #else |
| 1969 | #define sv_setiv Perl_sv_setiv |
| 1970 | #endif |
| 1971 | |
| 1972 | This works well, and means that XS authors can gleefully write: |
| 1973 | |
| 1974 | sv_setiv(foo, bar); |
| 1975 | |
| 1976 | and still have it work under all the modes Perl could have been |
| 1977 | compiled with. |
| 1978 | |
| 1979 | This doesn't work so cleanly for varargs functions, though, as macros |
| 1980 | imply that the number of arguments is known in advance. Instead we |
| 1981 | either need to spell them out fully, passing C<aTHX_> as the first |
| 1982 | argument (the Perl core tends to do this with functions like |
| 1983 | Perl_warner), or use a context-free version. |
| 1984 | |
| 1985 | The context-free version of Perl_warner is called |
| 1986 | Perl_warner_nocontext, and does not take the extra argument. Instead |
| 1987 | it does dTHX; to get the context from thread-local storage. We |
| 1988 | C<#define warner Perl_warner_nocontext> so that extensions get source |
| 1989 | compatibility at the expense of performance. (Passing an arg is |
| 1990 | cheaper than grabbing it from thread-local storage.) |
| 1991 | |
| 1992 | You can ignore [pad]THXx when browsing the Perl headers/sources. |
| 1993 | Those are strictly for use within the core. Extensions and embedders |
| 1994 | need only be aware of [pad]THX. |
| 1995 | |
| 1996 | =head2 So what happened to dTHR? |
| 1997 | |
| 1998 | C<dTHR> was introduced in perl 5.005 to support the older thread model. |
| 1999 | The older thread model now uses the C<THX> mechanism to pass context |
| 2000 | pointers around, so C<dTHR> is not useful any more. Perl 5.6.0 and |
| 2001 | later still have it for backward source compatibility, but it is defined |
| 2002 | to be a no-op. |
| 2003 | |
| 2004 | =head2 How do I use all this in extensions? |
| 2005 | |
| 2006 | When Perl is built with PERL_IMPLICIT_CONTEXT, extensions that call |
| 2007 | any functions in the Perl API will need to pass the initial context |
| 2008 | argument somehow. The kicker is that you will need to write it in |
| 2009 | such a way that the extension still compiles when Perl hasn't been |
| 2010 | built with PERL_IMPLICIT_CONTEXT enabled. |
| 2011 | |
| 2012 | There are three ways to do this. First, the easy but inefficient way, |
| 2013 | which is also the default, in order to maintain source compatibility |
| 2014 | with extensions: whenever XSUB.h is #included, it redefines the aTHX |
| 2015 | and aTHX_ macros to call a function that will return the context. |
| 2016 | Thus, something like: |
| 2017 | |
| 2018 | sv_setiv(sv, num); |
| 2019 | |
| 2020 | in your extension will translate to this when PERL_IMPLICIT_CONTEXT is |
| 2021 | in effect: |
| 2022 | |
| 2023 | Perl_sv_setiv(Perl_get_context(), sv, num); |
| 2024 | |
| 2025 | or to this otherwise: |
| 2026 | |
| 2027 | Perl_sv_setiv(sv, num); |
| 2028 | |
| 2029 | You have to do nothing new in your extension to get this; since |
| 2030 | the Perl library provides Perl_get_context(), it will all just |
| 2031 | work. |
| 2032 | |
| 2033 | The second, more efficient way is to use the following template for |
| 2034 | your Foo.xs: |
| 2035 | |
| 2036 | #define PERL_NO_GET_CONTEXT /* we want efficiency */ |
| 2037 | #include "EXTERN.h" |
| 2038 | #include "perl.h" |
| 2039 | #include "XSUB.h" |
| 2040 | |
| 2041 | static my_private_function(int arg1, int arg2); |
| 2042 | |
| 2043 | static SV * |
| 2044 | my_private_function(int arg1, int arg2) |
| 2045 | { |
| 2046 | dTHX; /* fetch context */ |
| 2047 | ... call many Perl API functions ... |
| 2048 | } |
| 2049 | |
| 2050 | [... etc ...] |
| 2051 | |
| 2052 | MODULE = Foo PACKAGE = Foo |
| 2053 | |
| 2054 | /* typical XSUB */ |
| 2055 | |
| 2056 | void |
| 2057 | my_xsub(arg) |
| 2058 | int arg |
| 2059 | CODE: |
| 2060 | my_private_function(arg, 10); |
| 2061 | |
| 2062 | Note that the only two changes from the normal way of writing an |
| 2063 | extension is the addition of a C<#define PERL_NO_GET_CONTEXT> before |
| 2064 | including the Perl headers, followed by a C<dTHX;> declaration at |
| 2065 | the start of every function that will call the Perl API. (You'll |
| 2066 | know which functions need this, because the C compiler will complain |
| 2067 | that there's an undeclared identifier in those functions.) No changes |
| 2068 | are needed for the XSUBs themselves, because the XS() macro is |
| 2069 | correctly defined to pass in the implicit context if needed. |
| 2070 | |
| 2071 | The third, even more efficient way is to ape how it is done within |
| 2072 | the Perl guts: |
| 2073 | |
| 2074 | |
| 2075 | #define PERL_NO_GET_CONTEXT /* we want efficiency */ |
| 2076 | #include "EXTERN.h" |
| 2077 | #include "perl.h" |
| 2078 | #include "XSUB.h" |
| 2079 | |
| 2080 | /* pTHX_ only needed for functions that call Perl API */ |
| 2081 | static my_private_function(pTHX_ int arg1, int arg2); |
| 2082 | |
| 2083 | static SV * |
| 2084 | my_private_function(pTHX_ int arg1, int arg2) |
| 2085 | { |
| 2086 | /* dTHX; not needed here, because THX is an argument */ |
| 2087 | ... call Perl API functions ... |
| 2088 | } |
| 2089 | |
| 2090 | [... etc ...] |
| 2091 | |
| 2092 | MODULE = Foo PACKAGE = Foo |
| 2093 | |
| 2094 | /* typical XSUB */ |
| 2095 | |
| 2096 | void |
| 2097 | my_xsub(arg) |
| 2098 | int arg |
| 2099 | CODE: |
| 2100 | my_private_function(aTHX_ arg, 10); |
| 2101 | |
| 2102 | This implementation never has to fetch the context using a function |
| 2103 | call, since it is always passed as an extra argument. Depending on |
| 2104 | your needs for simplicity or efficiency, you may mix the previous |
| 2105 | two approaches freely. |
| 2106 | |
| 2107 | Never add a comma after C<pTHX> yourself--always use the form of the |
| 2108 | macro with the underscore for functions that take explicit arguments, |
| 2109 | or the form without the argument for functions with no explicit arguments. |
| 2110 | |
| 2111 | If one is compiling Perl with the C<-DPERL_GLOBAL_STRUCT> the C<dVAR> |
| 2112 | definition is needed if the Perl global variables (see F<perlvars.h> |
| 2113 | or F<globvar.sym>) are accessed in the function and C<dTHX> is not |
| 2114 | used (the C<dTHX> includes the C<dVAR> if necessary). One notices |
| 2115 | the need for C<dVAR> only with the said compile-time define, because |
| 2116 | otherwise the Perl global variables are visible as-is. |
| 2117 | |
| 2118 | =head2 Should I do anything special if I call perl from multiple threads? |
| 2119 | |
| 2120 | If you create interpreters in one thread and then proceed to call them in |
| 2121 | another, you need to make sure perl's own Thread Local Storage (TLS) slot is |
| 2122 | initialized correctly in each of those threads. |
| 2123 | |
| 2124 | The C<perl_alloc> and C<perl_clone> API functions will automatically set |
| 2125 | the TLS slot to the interpreter they created, so that there is no need to do |
| 2126 | anything special if the interpreter is always accessed in the same thread that |
| 2127 | created it, and that thread did not create or call any other interpreters |
| 2128 | afterwards. If that is not the case, you have to set the TLS slot of the |
| 2129 | thread before calling any functions in the Perl API on that particular |
| 2130 | interpreter. This is done by calling the C<PERL_SET_CONTEXT> macro in that |
| 2131 | thread as the first thing you do: |
| 2132 | |
| 2133 | /* do this before doing anything else with some_perl */ |
| 2134 | PERL_SET_CONTEXT(some_perl); |
| 2135 | |
| 2136 | ... other Perl API calls on some_perl go here ... |
| 2137 | |
| 2138 | =head2 Future Plans and PERL_IMPLICIT_SYS |
| 2139 | |
| 2140 | Just as PERL_IMPLICIT_CONTEXT provides a way to bundle up everything |
| 2141 | that the interpreter knows about itself and pass it around, so too are |
| 2142 | there plans to allow the interpreter to bundle up everything it knows |
| 2143 | about the environment it's running on. This is enabled with the |
| 2144 | PERL_IMPLICIT_SYS macro. Currently it only works with USE_ITHREADS |
| 2145 | and USE_5005THREADS on Windows (see inside iperlsys.h). |
| 2146 | |
| 2147 | This allows the ability to provide an extra pointer (called the "host" |
| 2148 | environment) for all the system calls. This makes it possible for |
| 2149 | all the system stuff to maintain their own state, broken down into |
| 2150 | seven C structures. These are thin wrappers around the usual system |
| 2151 | calls (see win32/perllib.c) for the default perl executable, but for a |
| 2152 | more ambitious host (like the one that would do fork() emulation) all |
| 2153 | the extra work needed to pretend that different interpreters are |
| 2154 | actually different "processes", would be done here. |
| 2155 | |
| 2156 | The Perl engine/interpreter and the host are orthogonal entities. |
| 2157 | There could be one or more interpreters in a process, and one or |
| 2158 | more "hosts", with free association between them. |
| 2159 | |
| 2160 | =head1 Internal Functions |
| 2161 | |
| 2162 | All of Perl's internal functions which will be exposed to the outside |
| 2163 | world are prefixed by C<Perl_> so that they will not conflict with XS |
| 2164 | functions or functions used in a program in which Perl is embedded. |
| 2165 | Similarly, all global variables begin with C<PL_>. (By convention, |
| 2166 | static functions start with C<S_>.) |
| 2167 | |
| 2168 | Inside the Perl core, you can get at the functions either with or |
| 2169 | without the C<Perl_> prefix, thanks to a bunch of defines that live in |
| 2170 | F<embed.h>. This header file is generated automatically from |
| 2171 | F<embed.pl> and F<embed.fnc>. F<embed.pl> also creates the prototyping |
| 2172 | header files for the internal functions, generates the documentation |
| 2173 | and a lot of other bits and pieces. It's important that when you add |
| 2174 | a new function to the core or change an existing one, you change the |
| 2175 | data in the table in F<embed.fnc> as well. Here's a sample entry from |
| 2176 | that table: |
| 2177 | |
| 2178 | Apd |SV** |av_fetch |AV* ar|I32 key|I32 lval |
| 2179 | |
| 2180 | The second column is the return type, the third column the name. Columns |
| 2181 | after that are the arguments. The first column is a set of flags: |
| 2182 | |
| 2183 | =over 3 |
| 2184 | |
| 2185 | =item A |
| 2186 | |
| 2187 | This function is a part of the public API. All such functions should also |
| 2188 | have 'd', very few do not. |
| 2189 | |
| 2190 | =item p |
| 2191 | |
| 2192 | This function has a C<Perl_> prefix; i.e. it is defined as |
| 2193 | C<Perl_av_fetch>. |
| 2194 | |
| 2195 | =item d |
| 2196 | |
| 2197 | This function has documentation using the C<apidoc> feature which we'll |
| 2198 | look at in a second. Some functions have 'd' but not 'A'; docs are good. |
| 2199 | |
| 2200 | =back |
| 2201 | |
| 2202 | Other available flags are: |
| 2203 | |
| 2204 | =over 3 |
| 2205 | |
| 2206 | =item s |
| 2207 | |
| 2208 | This is a static function and is defined as C<STATIC S_whatever>, and |
| 2209 | usually called within the sources as C<whatever(...)>. |
| 2210 | |
| 2211 | =item n |
| 2212 | |
| 2213 | This does not need a interpreter context, so the definition has no |
| 2214 | C<pTHX>, and it follows that callers don't use C<aTHX>. (See |
| 2215 | L<perlguts/Background and PERL_IMPLICIT_CONTEXT>.) |
| 2216 | |
| 2217 | =item r |
| 2218 | |
| 2219 | This function never returns; C<croak>, C<exit> and friends. |
| 2220 | |
| 2221 | =item f |
| 2222 | |
| 2223 | This function takes a variable number of arguments, C<printf> style. |
| 2224 | The argument list should end with C<...>, like this: |
| 2225 | |
| 2226 | Afprd |void |croak |const char* pat|... |
| 2227 | |
| 2228 | =item M |
| 2229 | |
| 2230 | This function is part of the experimental development API, and may change |
| 2231 | or disappear without notice. |
| 2232 | |
| 2233 | =item o |
| 2234 | |
| 2235 | This function should not have a compatibility macro to define, say, |
| 2236 | C<Perl_parse> to C<parse>. It must be called as C<Perl_parse>. |
| 2237 | |
| 2238 | =item x |
| 2239 | |
| 2240 | This function isn't exported out of the Perl core. |
| 2241 | |
| 2242 | =item m |
| 2243 | |
| 2244 | This is implemented as a macro. |
| 2245 | |
| 2246 | =item X |
| 2247 | |
| 2248 | This function is explicitly exported. |
| 2249 | |
| 2250 | =item E |
| 2251 | |
| 2252 | This function is visible to extensions included in the Perl core. |
| 2253 | |
| 2254 | =item b |
| 2255 | |
| 2256 | Binary backward compatibility; this function is a macro but also has |
| 2257 | a C<Perl_> implementation (which is exported). |
| 2258 | |
| 2259 | =item others |
| 2260 | |
| 2261 | See the comments at the top of C<embed.fnc> for others. |
| 2262 | |
| 2263 | =back |
| 2264 | |
| 2265 | If you edit F<embed.pl> or F<embed.fnc>, you will need to run |
| 2266 | C<make regen_headers> to force a rebuild of F<embed.h> and other |
| 2267 | auto-generated files. |
| 2268 | |
| 2269 | =head2 Formatted Printing of IVs, UVs, and NVs |
| 2270 | |
| 2271 | If you are printing IVs, UVs, or NVS instead of the stdio(3) style |
| 2272 | formatting codes like C<%d>, C<%ld>, C<%f>, you should use the |
| 2273 | following macros for portability |
| 2274 | |
| 2275 | IVdf IV in decimal |
| 2276 | UVuf UV in decimal |
| 2277 | UVof UV in octal |
| 2278 | UVxf UV in hexadecimal |
| 2279 | NVef NV %e-like |
| 2280 | NVff NV %f-like |
| 2281 | NVgf NV %g-like |
| 2282 | |
| 2283 | These will take care of 64-bit integers and long doubles. |
| 2284 | For example: |
| 2285 | |
| 2286 | printf("IV is %"IVdf"\n", iv); |
| 2287 | |
| 2288 | The IVdf will expand to whatever is the correct format for the IVs. |
| 2289 | |
| 2290 | If you are printing addresses of pointers, use UVxf combined |
| 2291 | with PTR2UV(), do not use %lx or %p. |
| 2292 | |
| 2293 | =head2 Pointer-To-Integer and Integer-To-Pointer |
| 2294 | |
| 2295 | Because pointer size does not necessarily equal integer size, |
| 2296 | use the follow macros to do it right. |
| 2297 | |
| 2298 | PTR2UV(pointer) |
| 2299 | PTR2IV(pointer) |
| 2300 | PTR2NV(pointer) |
| 2301 | INT2PTR(pointertotype, integer) |
| 2302 | |
| 2303 | For example: |
| 2304 | |
| 2305 | IV iv = ...; |
| 2306 | SV *sv = INT2PTR(SV*, iv); |
| 2307 | |
| 2308 | and |
| 2309 | |
| 2310 | AV *av = ...; |
| 2311 | UV uv = PTR2UV(av); |
| 2312 | |
| 2313 | =head2 Exception Handling |
| 2314 | |
| 2315 | There are a couple of macros to do very basic exception handling in XS |
| 2316 | modules. You have to define C<NO_XSLOCKS> before including F<XSUB.h> to |
| 2317 | be able to use these macros: |
| 2318 | |
| 2319 | #define NO_XSLOCKS |
| 2320 | #include "XSUB.h" |
| 2321 | |
| 2322 | You can use these macros if you call code that may croak, but you need |
| 2323 | to do some cleanup before giving control back to Perl. For example: |
| 2324 | |
| 2325 | dXCPT; /* set up necessary variables */ |
| 2326 | |
| 2327 | XCPT_TRY_START { |
| 2328 | code_that_may_croak(); |
| 2329 | } XCPT_TRY_END |
| 2330 | |
| 2331 | XCPT_CATCH |
| 2332 | { |
| 2333 | /* do cleanup here */ |
| 2334 | XCPT_RETHROW; |
| 2335 | } |
| 2336 | |
| 2337 | Note that you always have to rethrow an exception that has been |
| 2338 | caught. Using these macros, it is not possible to just catch the |
| 2339 | exception and ignore it. If you have to ignore the exception, you |
| 2340 | have to use the C<call_*> function. |
| 2341 | |
| 2342 | The advantage of using the above macros is that you don't have |
| 2343 | to setup an extra function for C<call_*>, and that using these |
| 2344 | macros is faster than using C<call_*>. |
| 2345 | |
| 2346 | =head2 Source Documentation |
| 2347 | |
| 2348 | There's an effort going on to document the internal functions and |
| 2349 | automatically produce reference manuals from them - L<perlapi> is one |
| 2350 | such manual which details all the functions which are available to XS |
| 2351 | writers. L<perlintern> is the autogenerated manual for the functions |
| 2352 | which are not part of the API and are supposedly for internal use only. |
| 2353 | |
| 2354 | Source documentation is created by putting POD comments into the C |
| 2355 | source, like this: |
| 2356 | |
| 2357 | /* |
| 2358 | =for apidoc sv_setiv |
| 2359 | |
| 2360 | Copies an integer into the given SV. Does not handle 'set' magic. See |
| 2361 | C<sv_setiv_mg>. |
| 2362 | |
| 2363 | =cut |
| 2364 | */ |
| 2365 | |
| 2366 | Please try and supply some documentation if you add functions to the |
| 2367 | Perl core. |
| 2368 | |
| 2369 | =head2 Backwards compatibility |
| 2370 | |
| 2371 | The Perl API changes over time. New functions are added or the interfaces |
| 2372 | of existing functions are changed. The C<Devel::PPPort> module tries to |
| 2373 | provide compatibility code for some of these changes, so XS writers don't |
| 2374 | have to code it themselves when supporting multiple versions of Perl. |
| 2375 | |
| 2376 | C<Devel::PPPort> generates a C header file F<ppport.h> that can also |
| 2377 | be run as a Perl script. To generate F<ppport.h>, run: |
| 2378 | |
| 2379 | perl -MDevel::PPPort -eDevel::PPPort::WriteFile |
| 2380 | |
| 2381 | Besides checking existing XS code, the script can also be used to retrieve |
| 2382 | compatibility information for various API calls using the C<--api-info> |
| 2383 | command line switch. For example: |
| 2384 | |
| 2385 | % perl ppport.h --api-info=sv_magicext |
| 2386 | |
| 2387 | For details, see C<perldoc ppport.h>. |
| 2388 | |
| 2389 | =head1 Unicode Support |
| 2390 | |
| 2391 | Perl 5.6.0 introduced Unicode support. It's important for porters and XS |
| 2392 | writers to understand this support and make sure that the code they |
| 2393 | write does not corrupt Unicode data. |
| 2394 | |
| 2395 | =head2 What B<is> Unicode, anyway? |
| 2396 | |
| 2397 | In the olden, less enlightened times, we all used to use ASCII. Most of |
| 2398 | us did, anyway. The big problem with ASCII is that it's American. Well, |
| 2399 | no, that's not actually the problem; the problem is that it's not |
| 2400 | particularly useful for people who don't use the Roman alphabet. What |
| 2401 | used to happen was that particular languages would stick their own |
| 2402 | alphabet in the upper range of the sequence, between 128 and 255. Of |
| 2403 | course, we then ended up with plenty of variants that weren't quite |
| 2404 | ASCII, and the whole point of it being a standard was lost. |
| 2405 | |
| 2406 | Worse still, if you've got a language like Chinese or |
| 2407 | Japanese that has hundreds or thousands of characters, then you really |
| 2408 | can't fit them into a mere 256, so they had to forget about ASCII |
| 2409 | altogether, and build their own systems using pairs of numbers to refer |
| 2410 | to one character. |
| 2411 | |
| 2412 | To fix this, some people formed Unicode, Inc. and |
| 2413 | produced a new character set containing all the characters you can |
| 2414 | possibly think of and more. There are several ways of representing these |
| 2415 | characters, and the one Perl uses is called UTF-8. UTF-8 uses |
| 2416 | a variable number of bytes to represent a character, instead of just |
| 2417 | one. You can learn more about Unicode at http://www.unicode.org/ |
| 2418 | |
| 2419 | =head2 How can I recognise a UTF-8 string? |
| 2420 | |
| 2421 | You can't. This is because UTF-8 data is stored in bytes just like |
| 2422 | non-UTF-8 data. The Unicode character 200, (C<0xC8> for you hex types) |
| 2423 | capital E with a grave accent, is represented by the two bytes |
| 2424 | C<v196.172>. Unfortunately, the non-Unicode string C<chr(196).chr(172)> |
| 2425 | has that byte sequence as well. So you can't tell just by looking - this |
| 2426 | is what makes Unicode input an interesting problem. |
| 2427 | |
| 2428 | The API function C<is_utf8_string> can help; it'll tell you if a string |
| 2429 | contains only valid UTF-8 characters. However, it can't do the work for |
| 2430 | you. On a character-by-character basis, C<is_utf8_char> will tell you |
| 2431 | whether the current character in a string is valid UTF-8. |
| 2432 | |
| 2433 | =head2 How does UTF-8 represent Unicode characters? |
| 2434 | |
| 2435 | As mentioned above, UTF-8 uses a variable number of bytes to store a |
| 2436 | character. Characters with values 1...128 are stored in one byte, just |
| 2437 | like good ol' ASCII. Character 129 is stored as C<v194.129>; this |
| 2438 | continues up to character 191, which is C<v194.191>. Now we've run out of |
| 2439 | bits (191 is binary C<10111111>) so we move on; 192 is C<v195.128>. And |
| 2440 | so it goes on, moving to three bytes at character 2048. |
| 2441 | |
| 2442 | Assuming you know you're dealing with a UTF-8 string, you can find out |
| 2443 | how long the first character in it is with the C<UTF8SKIP> macro: |
| 2444 | |
| 2445 | char *utf = "\305\233\340\240\201"; |
| 2446 | I32 len; |
| 2447 | |
| 2448 | len = UTF8SKIP(utf); /* len is 2 here */ |
| 2449 | utf += len; |
| 2450 | len = UTF8SKIP(utf); /* len is 3 here */ |
| 2451 | |
| 2452 | Another way to skip over characters in a UTF-8 string is to use |
| 2453 | C<utf8_hop>, which takes a string and a number of characters to skip |
| 2454 | over. You're on your own about bounds checking, though, so don't use it |
| 2455 | lightly. |
| 2456 | |
| 2457 | All bytes in a multi-byte UTF-8 character will have the high bit set, |
| 2458 | so you can test if you need to do something special with this |
| 2459 | character like this (the UTF8_IS_INVARIANT() is a macro that tests |
| 2460 | whether the byte can be encoded as a single byte even in UTF-8): |
| 2461 | |
| 2462 | U8 *utf; |
| 2463 | UV uv; /* Note: a UV, not a U8, not a char */ |
| 2464 | |
| 2465 | if (!UTF8_IS_INVARIANT(*utf)) |
| 2466 | /* Must treat this as UTF-8 */ |
| 2467 | uv = utf8_to_uv(utf); |
| 2468 | else |
| 2469 | /* OK to treat this character as a byte */ |
| 2470 | uv = *utf; |
| 2471 | |
| 2472 | You can also see in that example that we use C<utf8_to_uv> to get the |
| 2473 | value of the character; the inverse function C<uv_to_utf8> is available |
| 2474 | for putting a UV into UTF-8: |
| 2475 | |
| 2476 | if (!UTF8_IS_INVARIANT(uv)) |
| 2477 | /* Must treat this as UTF8 */ |
| 2478 | utf8 = uv_to_utf8(utf8, uv); |
| 2479 | else |
| 2480 | /* OK to treat this character as a byte */ |
| 2481 | *utf8++ = uv; |
| 2482 | |
| 2483 | You B<must> convert characters to UVs using the above functions if |
| 2484 | you're ever in a situation where you have to match UTF-8 and non-UTF-8 |
| 2485 | characters. You may not skip over UTF-8 characters in this case. If you |
| 2486 | do this, you'll lose the ability to match hi-bit non-UTF-8 characters; |
| 2487 | for instance, if your UTF-8 string contains C<v196.172>, and you skip |
| 2488 | that character, you can never match a C<chr(200)> in a non-UTF-8 string. |
| 2489 | So don't do that! |
| 2490 | |
| 2491 | =head2 How does Perl store UTF-8 strings? |
| 2492 | |
| 2493 | Currently, Perl deals with Unicode strings and non-Unicode strings |
| 2494 | slightly differently. If a string has been identified as being UTF-8 |
| 2495 | encoded, Perl will set a flag in the SV, C<SVf_UTF8>. You can check and |
| 2496 | manipulate this flag with the following macros: |
| 2497 | |
| 2498 | SvUTF8(sv) |
| 2499 | SvUTF8_on(sv) |
| 2500 | SvUTF8_off(sv) |
| 2501 | |
| 2502 | This flag has an important effect on Perl's treatment of the string: if |
| 2503 | Unicode data is not properly distinguished, regular expressions, |
| 2504 | C<length>, C<substr> and other string handling operations will have |
| 2505 | undesirable results. |
| 2506 | |
| 2507 | The problem comes when you have, for instance, a string that isn't |
| 2508 | flagged is UTF-8, and contains a byte sequence that could be UTF-8 - |
| 2509 | especially when combining non-UTF-8 and UTF-8 strings. |
| 2510 | |
| 2511 | Never forget that the C<SVf_UTF8> flag is separate to the PV value; you |
| 2512 | need be sure you don't accidentally knock it off while you're |
| 2513 | manipulating SVs. More specifically, you cannot expect to do this: |
| 2514 | |
| 2515 | SV *sv; |
| 2516 | SV *nsv; |
| 2517 | STRLEN len; |
| 2518 | char *p; |
| 2519 | |
| 2520 | p = SvPV(sv, len); |
| 2521 | frobnicate(p); |
| 2522 | nsv = newSVpvn(p, len); |
| 2523 | |
| 2524 | The C<char*> string does not tell you the whole story, and you can't |
| 2525 | copy or reconstruct an SV just by copying the string value. Check if the |
| 2526 | old SV has the UTF-8 flag set, and act accordingly: |
| 2527 | |
| 2528 | p = SvPV(sv, len); |
| 2529 | frobnicate(p); |
| 2530 | nsv = newSVpvn(p, len); |
| 2531 | if (SvUTF8(sv)) |
| 2532 | SvUTF8_on(nsv); |
| 2533 | |
| 2534 | In fact, your C<frobnicate> function should be made aware of whether or |
| 2535 | not it's dealing with UTF-8 data, so that it can handle the string |
| 2536 | appropriately. |
| 2537 | |
| 2538 | Since just passing an SV to an XS function and copying the data of |
| 2539 | the SV is not enough to copy the UTF-8 flags, even less right is just |
| 2540 | passing a C<char *> to an XS function. |
| 2541 | |
| 2542 | =head2 How do I convert a string to UTF-8? |
| 2543 | |
| 2544 | If you're mixing UTF-8 and non-UTF-8 strings, you might find it necessary |
| 2545 | to upgrade one of the strings to UTF-8. If you've got an SV, the easiest |
| 2546 | way to do this is: |
| 2547 | |
| 2548 | sv_utf8_upgrade(sv); |
| 2549 | |
| 2550 | However, you must not do this, for example: |
| 2551 | |
| 2552 | if (!SvUTF8(left)) |
| 2553 | sv_utf8_upgrade(left); |
| 2554 | |
| 2555 | If you do this in a binary operator, you will actually change one of the |
| 2556 | strings that came into the operator, and, while it shouldn't be noticeable |
| 2557 | by the end user, it can cause problems. |
| 2558 | |
| 2559 | Instead, C<bytes_to_utf8> will give you a UTF-8-encoded B<copy> of its |
| 2560 | string argument. This is useful for having the data available for |
| 2561 | comparisons and so on, without harming the original SV. There's also |
| 2562 | C<utf8_to_bytes> to go the other way, but naturally, this will fail if |
| 2563 | the string contains any characters above 255 that can't be represented |
| 2564 | in a single byte. |
| 2565 | |
| 2566 | =head2 Is there anything else I need to know? |
| 2567 | |
| 2568 | Not really. Just remember these things: |
| 2569 | |
| 2570 | =over 3 |
| 2571 | |
| 2572 | =item * |
| 2573 | |
| 2574 | There's no way to tell if a string is UTF-8 or not. You can tell if an SV |
| 2575 | is UTF-8 by looking at is C<SvUTF8> flag. Don't forget to set the flag if |
| 2576 | something should be UTF-8. Treat the flag as part of the PV, even though |
| 2577 | it's not - if you pass on the PV to somewhere, pass on the flag too. |
| 2578 | |
| 2579 | =item * |
| 2580 | |
| 2581 | If a string is UTF-8, B<always> use C<utf8_to_uv> to get at the value, |
| 2582 | unless C<UTF8_IS_INVARIANT(*s)> in which case you can use C<*s>. |
| 2583 | |
| 2584 | =item * |
| 2585 | |
| 2586 | When writing a character C<uv> to a UTF-8 string, B<always> use |
| 2587 | C<uv_to_utf8>, unless C<UTF8_IS_INVARIANT(uv))> in which case |
| 2588 | you can use C<*s = uv>. |
| 2589 | |
| 2590 | =item * |
| 2591 | |
| 2592 | Mixing UTF-8 and non-UTF-8 strings is tricky. Use C<bytes_to_utf8> to get |
| 2593 | a new string which is UTF-8 encoded. There are tricks you can use to |
| 2594 | delay deciding whether you need to use a UTF-8 string until you get to a |
| 2595 | high character - C<HALF_UPGRADE> is one of those. |
| 2596 | |
| 2597 | =back |
| 2598 | |
| 2599 | =head1 Custom Operators |
| 2600 | |
| 2601 | Custom operator support is a new experimental feature that allows you to |
| 2602 | define your own ops. This is primarily to allow the building of |
| 2603 | interpreters for other languages in the Perl core, but it also allows |
| 2604 | optimizations through the creation of "macro-ops" (ops which perform the |
| 2605 | functions of multiple ops which are usually executed together, such as |
| 2606 | C<gvsv, gvsv, add>.) |
| 2607 | |
| 2608 | This feature is implemented as a new op type, C<OP_CUSTOM>. The Perl |
| 2609 | core does not "know" anything special about this op type, and so it will |
| 2610 | not be involved in any optimizations. This also means that you can |
| 2611 | define your custom ops to be any op structure - unary, binary, list and |
| 2612 | so on - you like. |
| 2613 | |
| 2614 | It's important to know what custom operators won't do for you. They |
| 2615 | won't let you add new syntax to Perl, directly. They won't even let you |
| 2616 | add new keywords, directly. In fact, they won't change the way Perl |
| 2617 | compiles a program at all. You have to do those changes yourself, after |
| 2618 | Perl has compiled the program. You do this either by manipulating the op |
| 2619 | tree using a C<CHECK> block and the C<B::Generate> module, or by adding |
| 2620 | a custom peephole optimizer with the C<optimize> module. |
| 2621 | |
| 2622 | When you do this, you replace ordinary Perl ops with custom ops by |
| 2623 | creating ops with the type C<OP_CUSTOM> and the C<pp_addr> of your own |
| 2624 | PP function. This should be defined in XS code, and should look like |
| 2625 | the PP ops in C<pp_*.c>. You are responsible for ensuring that your op |
| 2626 | takes the appropriate number of values from the stack, and you are |
| 2627 | responsible for adding stack marks if necessary. |
| 2628 | |
| 2629 | You should also "register" your op with the Perl interpreter so that it |
| 2630 | can produce sensible error and warning messages. Since it is possible to |
| 2631 | have multiple custom ops within the one "logical" op type C<OP_CUSTOM>, |
| 2632 | Perl uses the value of C<< o->op_ppaddr >> as a key into the |
| 2633 | C<PL_custom_op_descs> and C<PL_custom_op_names> hashes. This means you |
| 2634 | need to enter a name and description for your op at the appropriate |
| 2635 | place in the C<PL_custom_op_names> and C<PL_custom_op_descs> hashes. |
| 2636 | |
| 2637 | Forthcoming versions of C<B::Generate> (version 1.0 and above) should |
| 2638 | directly support the creation of custom ops by name. |
| 2639 | |
| 2640 | =head1 AUTHORS |
| 2641 | |
| 2642 | Until May 1997, this document was maintained by Jeff Okamoto |
| 2643 | E<lt>okamoto@corp.hp.comE<gt>. It is now maintained as part of Perl |
| 2644 | itself by the Perl 5 Porters E<lt>perl5-porters@perl.orgE<gt>. |
| 2645 | |
| 2646 | With lots of help and suggestions from Dean Roehrich, Malcolm Beattie, |
| 2647 | Andreas Koenig, Paul Hudson, Ilya Zakharevich, Paul Marquess, Neil |
| 2648 | Bowers, Matthew Green, Tim Bunce, Spider Boardman, Ulrich Pfeifer, |
| 2649 | Stephen McCamant, and Gurusamy Sarathy. |
| 2650 | |
| 2651 | =head1 SEE ALSO |
| 2652 | |
| 2653 | perlapi(1), perlintern(1), perlxs(1), perlembed(1) |