This is a live mirror of the Perl 5 development currently hosted at https://github.com/perl/perl5
[DOCPATCH] perlfunc/read
[perl5.git] / pod / perlguts.pod
CommitLineData
a0d0e21e
LW
1=head1 NAME
2
954c1994 3perlguts - Introduction to the Perl API
a0d0e21e
LW
4
5=head1 DESCRIPTION
6
b3b6085d 7This document attempts to describe how to use the Perl API, as well as
06f6df17 8to provide some info on the basic workings of the Perl core. It is far
b3b6085d
PP
9from complete and probably contains many errors. Please refer any
10questions or comments to the author below.
a0d0e21e 11
0a753a76 12=head1 Variables
13
5f05dabc 14=head2 Datatypes
a0d0e21e
LW
15
16Perl has three typedefs that handle Perl's three main data types:
17
18 SV Scalar Value
19 AV Array Value
20 HV Hash Value
21
d1b91892 22Each typedef has specific routines that manipulate the various data types.
a0d0e21e
LW
23
24=head2 What is an "IV"?
25
954c1994 26Perl uses a special typedef IV which is a simple signed integer type that is
5f05dabc 27guaranteed to be large enough to hold a pointer (as well as an integer).
954c1994 28Additionally, there is the UV, which is simply an unsigned IV.
a0d0e21e 29
d1b91892 30Perl also uses two special typedefs, I32 and I16, which will always be at
954c1994 31least 32-bits and 16-bits long, respectively. (Again, there are U32 and U16,
20dbd849
NC
32as well.) They will usually be exactly 32 and 16 bits long, but on Crays
33they will both be 64 bits.
a0d0e21e 34
54310121 35=head2 Working with SVs
a0d0e21e 36
20dbd849
NC
37An SV can be created and loaded with one command. There are five types of
38values that can be loaded: an integer value (IV), an unsigned integer
39value (UV), a double (NV), a string (PV), and another scalar (SV).
a0d0e21e 40
20dbd849 41The seven routines are:
a0d0e21e
LW
42
43 SV* newSViv(IV);
20dbd849 44 SV* newSVuv(UV);
a0d0e21e 45 SV* newSVnv(double);
06f6df17
RGS
46 SV* newSVpv(const char*, STRLEN);
47 SV* newSVpvn(const char*, STRLEN);
46fc3d4c 48 SV* newSVpvf(const char*, ...);
a0d0e21e
LW
49 SV* newSVsv(SV*);
50
06f6df17
RGS
51C<STRLEN> is an integer type (Size_t, usually defined as size_t in
52F<config.h>) guaranteed to be large enough to represent the size of
53any string that perl can handle.
54
55In the unlikely case of a SV requiring more complex initialisation, you
56can create an empty SV with newSV(len). If C<len> is 0 an empty SV of
57type NULL is returned, else an SV of type PV is returned with len + 1 (for
58the NUL) bytes of storage allocated, accessible via SvPVX. In both cases
59the SV has value undef.
20dbd849 60
06f6df17
RGS
61 SV *sv = newSV(0); /* no storage allocated */
62 SV *sv = newSV(10); /* 10 (+1) bytes of uninitialised storage allocated */
20dbd849 63
06f6df17 64To change the value of an I<already-existing> SV, there are eight routines:
a0d0e21e
LW
65
66 void sv_setiv(SV*, IV);
deb3007b 67 void sv_setuv(SV*, UV);
a0d0e21e 68 void sv_setnv(SV*, double);
08105a92 69 void sv_setpv(SV*, const char*);
06f6df17 70 void sv_setpvn(SV*, const char*, STRLEN)
46fc3d4c 71 void sv_setpvf(SV*, const char*, ...);
5b3218b9 72 void sv_vsetpvfn(SV*, const char*, STRLEN, va_list *, SV **, I32, bool *);
a0d0e21e
LW
73 void sv_setsv(SV*, SV*);
74
75Notice that you can choose to specify the length of the string to be
9da1e3b5
MUN
76assigned by using C<sv_setpvn>, C<newSVpvn>, or C<newSVpv>, or you may
77allow Perl to calculate the length by using C<sv_setpv> or by specifying
780 as the second argument to C<newSVpv>. Be warned, though, that Perl will
79determine the string's length by using C<strlen>, which depends on the
9abd00ed
GS
80string terminating with a NUL character.
81
82The arguments of C<sv_setpvf> are processed like C<sprintf>, and the
83formatted output becomes the value.
84
328bf373 85C<sv_vsetpvfn> is an analogue of C<vsprintf>, but it allows you to specify
9abd00ed
GS
86either a pointer to a variable argument list or the address and length of
87an array of SVs. The last argument points to a boolean; on return, if that
88boolean is true, then locale-specific information has been used to format
c2611fb3 89the string, and the string's contents are therefore untrustworthy (see
9abd00ed
GS
90L<perlsec>). This pointer may be NULL if that information is not
91important. Note that this function requires you to specify the length of
92the format.
93
9da1e3b5
MUN
94The C<sv_set*()> functions are not generic enough to operate on values
95that have "magic". See L<Magic Virtual Tables> later in this document.
a0d0e21e 96
a3cb178b
GS
97All SVs that contain strings should be terminated with a NUL character.
98If it is not NUL-terminated there is a risk of
5f05dabc 99core dumps and corruptions from code which passes the string to C
100functions or system calls which expect a NUL-terminated string.
101Perl's own functions typically add a trailing NUL for this reason.
102Nevertheless, you should be very careful when you pass a string stored
103in an SV to a C function or system call.
104
a0d0e21e
LW
105To access the actual value that an SV points to, you can use the macros:
106
107 SvIV(SV*)
954c1994 108 SvUV(SV*)
a0d0e21e
LW
109 SvNV(SV*)
110 SvPV(SV*, STRLEN len)
1fa8b10d 111 SvPV_nolen(SV*)
a0d0e21e 112
954c1994 113which will automatically coerce the actual scalar type into an IV, UV, double,
a0d0e21e
LW
114or string.
115
116In the C<SvPV> macro, the length of the string returned is placed into the
1fa8b10d
JD
117variable C<len> (this is a macro, so you do I<not> use C<&len>). If you do
118not care what the length of the data is, use the C<SvPV_nolen> macro.
119Historically the C<SvPV> macro with the global variable C<PL_na> has been
120used in this case. But that can be quite inefficient because C<PL_na> must
121be accessed in thread-local storage in threaded Perl. In any case, remember
122that Perl allows arbitrary strings of data that may both contain NULs and
123might not be terminated by a NUL.
a0d0e21e 124
ce2f5d8f
KA
125Also remember that C doesn't allow you to safely say C<foo(SvPV(s, len),
126len);>. It might work with your compiler, but it won't work for everyone.
127Break this sort of statement up into separate assignments:
128
b2f5ed49 129 SV *s;
ce2f5d8f
KA
130 STRLEN len;
131 char * ptr;
b2f5ed49 132 ptr = SvPV(s, len);
ce2f5d8f
KA
133 foo(ptr, len);
134
07fa94a1 135If you want to know if the scalar value is TRUE, you can use:
a0d0e21e
LW
136
137 SvTRUE(SV*)
138
139Although Perl will automatically grow strings for you, if you need to force
140Perl to allocate more memory for your SV, you can use the macro
141
142 SvGROW(SV*, STRLEN newlen)
143
144which will determine if more memory needs to be allocated. If so, it will
145call the function C<sv_grow>. Note that C<SvGROW> can only increase, not
5f05dabc 146decrease, the allocated memory of an SV and that it does not automatically
147add a byte for the a trailing NUL (perl's own string functions typically do
8ebc5c01 148C<SvGROW(sv, len + 1)>).
a0d0e21e
LW
149
150If you have an SV and want to know what kind of data Perl thinks is stored
151in it, you can use the following macros to check the type of SV you have.
152
153 SvIOK(SV*)
154 SvNOK(SV*)
155 SvPOK(SV*)
156
157You can get and set the current length of the string stored in an SV with
158the following macros:
159
160 SvCUR(SV*)
161 SvCUR_set(SV*, I32 val)
162
cb1a09d0
AD
163You can also get a pointer to the end of the string stored in the SV
164with the macro:
165
166 SvEND(SV*)
167
168But note that these last three macros are valid only if C<SvPOK()> is true.
a0d0e21e 169
d1b91892
AD
170If you want to append something to the end of string stored in an C<SV*>,
171you can use the following functions:
172
08105a92 173 void sv_catpv(SV*, const char*);
e65f3abd 174 void sv_catpvn(SV*, const char*, STRLEN);
46fc3d4c 175 void sv_catpvf(SV*, const char*, ...);
328bf373 176 void sv_vcatpvfn(SV*, const char*, STRLEN, va_list *, SV **, I32, bool);
d1b91892
AD
177 void sv_catsv(SV*, SV*);
178
179The first function calculates the length of the string to be appended by
180using C<strlen>. In the second, you specify the length of the string
46fc3d4c 181yourself. The third function processes its arguments like C<sprintf> and
9abd00ed
GS
182appends the formatted output. The fourth function works like C<vsprintf>.
183You can specify the address and length of an array of SVs instead of the
184va_list argument. The fifth function extends the string stored in the first
185SV with the string stored in the second SV. It also forces the second SV
186to be interpreted as a string.
187
188The C<sv_cat*()> functions are not generic enough to operate on values that
189have "magic". See L<Magic Virtual Tables> later in this document.
d1b91892 190
a0d0e21e
LW
191If you know the name of a scalar variable, you can get a pointer to its SV
192by using the following:
193
4929bf7b 194 SV* get_sv("package::varname", FALSE);
a0d0e21e
LW
195
196This returns NULL if the variable does not exist.
197
d1b91892 198If you want to know if this variable (or any other SV) is actually C<defined>,
a0d0e21e
LW
199you can call:
200
201 SvOK(SV*)
202
06f6df17
RGS
203The scalar C<undef> value is stored in an SV instance called C<PL_sv_undef>.
204Its address can be used whenever an C<SV*> is needed.
a0d0e21e 205
06f6df17
RGS
206There are also the two values C<PL_sv_yes> and C<PL_sv_no>, which contain
207boolean TRUE and FALSE values, respectively. Like C<PL_sv_undef>, their
208addresses can be used whenever an C<SV*> is needed.
a0d0e21e 209
9cde0e7f 210Do not be fooled into thinking that C<(SV *) 0> is the same as C<&PL_sv_undef>.
a0d0e21e
LW
211Take this code:
212
213 SV* sv = (SV*) 0;
214 if (I-am-to-return-a-real-value) {
215 sv = sv_2mortal(newSViv(42));
216 }
217 sv_setsv(ST(0), sv);
218
219This code tries to return a new SV (which contains the value 42) if it should
04343c6d 220return a real value, or undef otherwise. Instead it has returned a NULL
a0d0e21e 221pointer which, somewhere down the line, will cause a segmentation violation,
06f6df17
RGS
222bus error, or just weird results. Change the zero to C<&PL_sv_undef> in the
223first line and all will be well.
a0d0e21e
LW
224
225To free an SV that you've created, call C<SvREFCNT_dec(SV*)>. Normally this
3fe9a6f1 226call is not necessary (see L<Reference Counts and Mortality>).
a0d0e21e 227
94dde4fb
SC
228=head2 Offsets
229
230Perl provides the function C<sv_chop> to efficiently remove characters
231from the beginning of a string; you give it an SV and a pointer to
da75cd15 232somewhere inside the PV, and it discards everything before the
94dde4fb
SC
233pointer. The efficiency comes by means of a little hack: instead of
234actually removing the characters, C<sv_chop> sets the flag C<OOK>
235(offset OK) to signal to other functions that the offset hack is in
236effect, and it puts the number of bytes chopped off into the IV field
237of the SV. It then moves the PV pointer (called C<SvPVX>) forward that
00aadd71 238many bytes, and adjusts C<SvCUR> and C<SvLEN>.
94dde4fb
SC
239
240Hence, at this point, the start of the buffer that we allocated lives
241at C<SvPVX(sv) - SvIV(sv)> in memory and the PV pointer is pointing
242into the middle of this allocated storage.
243
244This is best demonstrated by example:
245
246 % ./perl -Ilib -MDevel::Peek -le '$a="12345"; $a=~s/.//; Dump($a)'
247 SV = PVIV(0x8128450) at 0x81340f0
248 REFCNT = 1
249 FLAGS = (POK,OOK,pPOK)
250 IV = 1 (OFFSET)
251 PV = 0x8135781 ( "1" . ) "2345"\0
252 CUR = 4
253 LEN = 5
254
255Here the number of bytes chopped off (1) is put into IV, and
256C<Devel::Peek::Dump> helpfully reminds us that this is an offset. The
257portion of the string between the "real" and the "fake" beginnings is
258shown in parentheses, and the values of C<SvCUR> and C<SvLEN> reflect
259the fake beginning, not the real one.
260
fe854a6f 261Something similar to the offset hack is performed on AVs to enable
319cef53
SC
262efficient shifting and splicing off the beginning of the array; while
263C<AvARRAY> points to the first element in the array that is visible from
264Perl, C<AvALLOC> points to the real start of the C array. These are
265usually the same, but a C<shift> operation can be carried out by
266increasing C<AvARRAY> by one and decreasing C<AvFILL> and C<AvLEN>.
267Again, the location of the real start of the C array only comes into
268play when freeing the array. See C<av_shift> in F<av.c>.
269
d1b91892 270=head2 What's Really Stored in an SV?
a0d0e21e
LW
271
272Recall that the usual method of determining the type of scalar you have is
5f05dabc 273to use C<Sv*OK> macros. Because a scalar can be both a number and a string,
d1b91892 274usually these macros will always return TRUE and calling the C<Sv*V>
a0d0e21e
LW
275macros will do the appropriate conversion of string to integer/double or
276integer/double to string.
277
278If you I<really> need to know if you have an integer, double, or string
279pointer in an SV, you can use the following three macros instead:
280
281 SvIOKp(SV*)
282 SvNOKp(SV*)
283 SvPOKp(SV*)
284
285These will tell you if you truly have an integer, double, or string pointer
d1b91892 286stored in your SV. The "p" stands for private.
a0d0e21e 287
9e9796d6
JH
288The are various ways in which the private and public flags may differ.
289For example, a tied SV may have a valid underlying value in the IV slot
290(so SvIOKp is true), but the data should be accessed via the FETCH
291routine rather than directly, so SvIOK is false. Another is when
292numeric conversion has occured and precision has been lost: only the
293private flag is set on 'lossy' values. So when an NV is converted to an
294IV with loss, SvIOKp, SvNOKp and SvNOK will be set, while SvIOK wont be.
295
07fa94a1 296In general, though, it's best to use the C<Sv*V> macros.
a0d0e21e 297
54310121 298=head2 Working with AVs
a0d0e21e 299
07fa94a1
JO
300There are two ways to create and load an AV. The first method creates an
301empty AV:
a0d0e21e
LW
302
303 AV* newAV();
304
54310121 305The second method both creates the AV and initially populates it with SVs:
a0d0e21e
LW
306
307 AV* av_make(I32 num, SV **ptr);
308
5f05dabc 309The second argument points to an array containing C<num> C<SV*>'s. Once the
54310121 310AV has been created, the SVs can be destroyed, if so desired.
a0d0e21e 311
54310121 312Once the AV has been created, the following operations are possible on AVs:
a0d0e21e
LW
313
314 void av_push(AV*, SV*);
315 SV* av_pop(AV*);
316 SV* av_shift(AV*);
317 void av_unshift(AV*, I32 num);
318
319These should be familiar operations, with the exception of C<av_unshift>.
320This routine adds C<num> elements at the front of the array with the C<undef>
321value. You must then use C<av_store> (described below) to assign values
322to these new elements.
323
324Here are some other functions:
325
5f05dabc 326 I32 av_len(AV*);
a0d0e21e 327 SV** av_fetch(AV*, I32 key, I32 lval);
a0d0e21e 328 SV** av_store(AV*, I32 key, SV* val);
a0d0e21e 329
5f05dabc 330The C<av_len> function returns the highest index value in array (just
331like $#array in Perl). If the array is empty, -1 is returned. The
332C<av_fetch> function returns the value at index C<key>, but if C<lval>
333is non-zero, then C<av_fetch> will store an undef value at that index.
04343c6d
GS
334The C<av_store> function stores the value C<val> at index C<key>, and does
335not increment the reference count of C<val>. Thus the caller is responsible
336for taking care of that, and if C<av_store> returns NULL, the caller will
337have to decrement the reference count to avoid a memory leak. Note that
338C<av_fetch> and C<av_store> both return C<SV**>'s, not C<SV*>'s as their
339return value.
d1b91892 340
a0d0e21e 341 void av_clear(AV*);
a0d0e21e 342 void av_undef(AV*);
cb1a09d0 343 void av_extend(AV*, I32 key);
5f05dabc 344
345The C<av_clear> function deletes all the elements in the AV* array, but
346does not actually delete the array itself. The C<av_undef> function will
347delete all the elements in the array plus the array itself. The
adc882cf
GS
348C<av_extend> function extends the array so that it contains at least C<key+1>
349elements. If C<key+1> is less than the currently allocated length of the array,
350then nothing is done.
a0d0e21e
LW
351
352If you know the name of an array variable, you can get a pointer to its AV
353by using the following:
354
4929bf7b 355 AV* get_av("package::varname", FALSE);
a0d0e21e
LW
356
357This returns NULL if the variable does not exist.
358
04343c6d
GS
359See L<Understanding the Magic of Tied Hashes and Arrays> for more
360information on how to use the array access functions on tied arrays.
361
54310121 362=head2 Working with HVs
a0d0e21e
LW
363
364To create an HV, you use the following routine:
365
366 HV* newHV();
367
54310121 368Once the HV has been created, the following operations are possible on HVs:
a0d0e21e 369
08105a92
GS
370 SV** hv_store(HV*, const char* key, U32 klen, SV* val, U32 hash);
371 SV** hv_fetch(HV*, const char* key, U32 klen, I32 lval);
a0d0e21e 372
5f05dabc 373The C<klen> parameter is the length of the key being passed in (Note that
374you cannot pass 0 in as a value of C<klen> to tell Perl to measure the
375length of the key). The C<val> argument contains the SV pointer to the
54310121 376scalar being stored, and C<hash> is the precomputed hash value (zero if
5f05dabc 377you want C<hv_store> to calculate it for you). The C<lval> parameter
378indicates whether this fetch is actually a part of a store operation, in
379which case a new undefined value will be added to the HV with the supplied
380key and C<hv_fetch> will return as if the value had already existed.
a0d0e21e 381
5f05dabc 382Remember that C<hv_store> and C<hv_fetch> return C<SV**>'s and not just
383C<SV*>. To access the scalar value, you must first dereference the return
384value. However, you should check to make sure that the return value is
385not NULL before dereferencing it.
a0d0e21e
LW
386
387These two functions check if a hash table entry exists, and deletes it.
388
08105a92
GS
389 bool hv_exists(HV*, const char* key, U32 klen);
390 SV* hv_delete(HV*, const char* key, U32 klen, I32 flags);
a0d0e21e 391
5f05dabc 392If C<flags> does not include the C<G_DISCARD> flag then C<hv_delete> will
393create and return a mortal copy of the deleted value.
394
a0d0e21e
LW
395And more miscellaneous functions:
396
397 void hv_clear(HV*);
a0d0e21e 398 void hv_undef(HV*);
5f05dabc 399
400Like their AV counterparts, C<hv_clear> deletes all the entries in the hash
401table but does not actually delete the hash table. The C<hv_undef> deletes
402both the entries and the hash table itself.
a0d0e21e 403
d1b91892
AD
404Perl keeps the actual data in linked list of structures with a typedef of HE.
405These contain the actual key and value pointers (plus extra administrative
406overhead). The key is a string pointer; the value is an C<SV*>. However,
407once you have an C<HE*>, to get the actual key and value, use the routines
408specified below.
409
a0d0e21e
LW
410 I32 hv_iterinit(HV*);
411 /* Prepares starting point to traverse hash table */
412 HE* hv_iternext(HV*);
413 /* Get the next entry, and return a pointer to a
414 structure that has both the key and value */
415 char* hv_iterkey(HE* entry, I32* retlen);
416 /* Get the key from an HE structure and also return
417 the length of the key string */
cb1a09d0 418 SV* hv_iterval(HV*, HE* entry);
d1be9408 419 /* Return an SV pointer to the value of the HE
a0d0e21e 420 structure */
cb1a09d0 421 SV* hv_iternextsv(HV*, char** key, I32* retlen);
d1b91892
AD
422 /* This convenience routine combines hv_iternext,
423 hv_iterkey, and hv_iterval. The key and retlen
424 arguments are return values for the key and its
425 length. The value is returned in the SV* argument */
a0d0e21e
LW
426
427If you know the name of a hash variable, you can get a pointer to its HV
428by using the following:
429
4929bf7b 430 HV* get_hv("package::varname", FALSE);
a0d0e21e
LW
431
432This returns NULL if the variable does not exist.
433
8ebc5c01 434The hash algorithm is defined in the C<PERL_HASH(hash, key, klen)> macro:
a0d0e21e 435
a0d0e21e 436 hash = 0;
ab192400
GS
437 while (klen--)
438 hash = (hash * 33) + *key++;
87275199 439 hash = hash + (hash >> 5); /* after 5.6 */
ab192400 440
87275199 441The last step was added in version 5.6 to improve distribution of
ab192400 442lower bits in the resulting hash value.
a0d0e21e 443
04343c6d
GS
444See L<Understanding the Magic of Tied Hashes and Arrays> for more
445information on how to use the hash access functions on tied hashes.
446
1e422769 447=head2 Hash API Extensions
448
449Beginning with version 5.004, the following functions are also supported:
450
451 HE* hv_fetch_ent (HV* tb, SV* key, I32 lval, U32 hash);
452 HE* hv_store_ent (HV* tb, SV* key, SV* val, U32 hash);
c47ff5f1 453
1e422769 454 bool hv_exists_ent (HV* tb, SV* key, U32 hash);
455 SV* hv_delete_ent (HV* tb, SV* key, I32 flags, U32 hash);
c47ff5f1 456
1e422769 457 SV* hv_iterkeysv (HE* entry);
458
459Note that these functions take C<SV*> keys, which simplifies writing
460of extension code that deals with hash structures. These functions
461also allow passing of C<SV*> keys to C<tie> functions without forcing
462you to stringify the keys (unlike the previous set of functions).
463
464They also return and accept whole hash entries (C<HE*>), making their
465use more efficient (since the hash number for a particular string
4a4eefd0
GS
466doesn't have to be recomputed every time). See L<perlapi> for detailed
467descriptions.
1e422769 468
469The following macros must always be used to access the contents of hash
470entries. Note that the arguments to these macros must be simple
471variables, since they may get evaluated more than once. See
4a4eefd0 472L<perlapi> for detailed descriptions of these macros.
1e422769 473
474 HePV(HE* he, STRLEN len)
475 HeVAL(HE* he)
476 HeHASH(HE* he)
477 HeSVKEY(HE* he)
478 HeSVKEY_force(HE* he)
479 HeSVKEY_set(HE* he, SV* sv)
480
481These two lower level macros are defined, but must only be used when
482dealing with keys that are not C<SV*>s:
483
484 HeKEY(HE* he)
485 HeKLEN(HE* he)
486
04343c6d
GS
487Note that both C<hv_store> and C<hv_store_ent> do not increment the
488reference count of the stored C<val>, which is the caller's responsibility.
489If these functions return a NULL value, the caller will usually have to
490decrement the reference count of C<val> to avoid a memory leak.
1e422769 491
a0d0e21e
LW
492=head2 References
493
d1b91892
AD
494References are a special type of scalar that point to other data types
495(including references).
a0d0e21e 496
07fa94a1 497To create a reference, use either of the following functions:
a0d0e21e 498
5f05dabc 499 SV* newRV_inc((SV*) thing);
500 SV* newRV_noinc((SV*) thing);
a0d0e21e 501
5f05dabc 502The C<thing> argument can be any of an C<SV*>, C<AV*>, or C<HV*>. The
07fa94a1
JO
503functions are identical except that C<newRV_inc> increments the reference
504count of the C<thing>, while C<newRV_noinc> does not. For historical
505reasons, C<newRV> is a synonym for C<newRV_inc>.
506
507Once you have a reference, you can use the following macro to dereference
508the reference:
a0d0e21e
LW
509
510 SvRV(SV*)
511
512then call the appropriate routines, casting the returned C<SV*> to either an
d1b91892 513C<AV*> or C<HV*>, if required.
a0d0e21e 514
d1b91892 515To determine if an SV is a reference, you can use the following macro:
a0d0e21e
LW
516
517 SvROK(SV*)
518
07fa94a1
JO
519To discover what type of value the reference refers to, use the following
520macro and then check the return value.
d1b91892
AD
521
522 SvTYPE(SvRV(SV*))
523
524The most useful types that will be returned are:
525
526 SVt_IV Scalar
527 SVt_NV Scalar
528 SVt_PV Scalar
5f05dabc 529 SVt_RV Scalar
d1b91892
AD
530 SVt_PVAV Array
531 SVt_PVHV Hash
532 SVt_PVCV Code
5f05dabc 533 SVt_PVGV Glob (possible a file handle)
534 SVt_PVMG Blessed or Magical Scalar
535
536 See the sv.h header file for more details.
d1b91892 537
cb1a09d0
AD
538=head2 Blessed References and Class Objects
539
06f6df17 540References are also used to support object-oriented programming. In perl's
cb1a09d0
AD
541OO lexicon, an object is simply a reference that has been blessed into a
542package (or class). Once blessed, the programmer may now use the reference
543to access the various methods in the class.
544
545A reference can be blessed into a package with the following function:
546
547 SV* sv_bless(SV* sv, HV* stash);
548
06f6df17
RGS
549The C<sv> argument must be a reference value. The C<stash> argument
550specifies which class the reference will belong to. See
2ae324a7 551L<Stashes and Globs> for information on converting class names into stashes.
cb1a09d0
AD
552
553/* Still under construction */
554
555Upgrades rv to reference if not already one. Creates new SV for rv to
8ebc5c01 556point to. If C<classname> is non-null, the SV is blessed into the specified
557class. SV is returned.
cb1a09d0 558
08105a92 559 SV* newSVrv(SV* rv, const char* classname);
cb1a09d0 560
e1c57cef 561Copies integer, unsigned integer or double into an SV whose reference is C<rv>. SV is blessed
8ebc5c01 562if C<classname> is non-null.
cb1a09d0 563
08105a92 564 SV* sv_setref_iv(SV* rv, const char* classname, IV iv);
e1c57cef 565 SV* sv_setref_uv(SV* rv, const char* classname, UV uv);
08105a92 566 SV* sv_setref_nv(SV* rv, const char* classname, NV iv);
cb1a09d0 567
5f05dabc 568Copies the pointer value (I<the address, not the string!>) into an SV whose
8ebc5c01 569reference is rv. SV is blessed if C<classname> is non-null.
cb1a09d0 570
08105a92 571 SV* sv_setref_pv(SV* rv, const char* classname, PV iv);
cb1a09d0 572
8ebc5c01 573Copies string into an SV whose reference is C<rv>. Set length to 0 to let
574Perl calculate the string length. SV is blessed if C<classname> is non-null.
cb1a09d0 575
e65f3abd 576 SV* sv_setref_pvn(SV* rv, const char* classname, PV iv, STRLEN length);
cb1a09d0 577
9abd00ed
GS
578Tests whether the SV is blessed into the specified class. It does not
579check inheritance relationships.
580
08105a92 581 int sv_isa(SV* sv, const char* name);
9abd00ed
GS
582
583Tests whether the SV is a reference to a blessed object.
584
585 int sv_isobject(SV* sv);
586
587Tests whether the SV is derived from the specified class. SV can be either
588a reference to a blessed object or a string containing a class name. This
589is the function implementing the C<UNIVERSAL::isa> functionality.
590
08105a92 591 bool sv_derived_from(SV* sv, const char* name);
9abd00ed 592
00aadd71 593To check if you've got an object derived from a specific class you have
9abd00ed
GS
594to write:
595
596 if (sv_isobject(sv) && sv_derived_from(sv, class)) { ... }
cb1a09d0 597
5f05dabc 598=head2 Creating New Variables
cb1a09d0 599
5f05dabc 600To create a new Perl variable with an undef value which can be accessed from
601your Perl script, use the following routines, depending on the variable type.
cb1a09d0 602
4929bf7b
GS
603 SV* get_sv("package::varname", TRUE);
604 AV* get_av("package::varname", TRUE);
605 HV* get_hv("package::varname", TRUE);
cb1a09d0
AD
606
607Notice the use of TRUE as the second parameter. The new variable can now
608be set, using the routines appropriate to the data type.
609
5f05dabc 610There are additional macros whose values may be bitwise OR'ed with the
611C<TRUE> argument to enable certain extra features. Those bits are:
cb1a09d0 612
9a68f1db
SB
613=over
614
615=item GV_ADDMULTI
616
617Marks the variable as multiply defined, thus preventing the:
618
619 Name <varname> used only once: possible typo
620
621warning.
622
9a68f1db
SB
623=item GV_ADDWARN
624
625Issues the warning:
626
627 Had to create <varname> unexpectedly
628
629if the variable did not exist before the function was called.
630
631=back
cb1a09d0 632
07fa94a1
JO
633If you do not specify a package name, the variable is created in the current
634package.
cb1a09d0 635
5f05dabc 636=head2 Reference Counts and Mortality
a0d0e21e 637
d1be9408 638Perl uses a reference count-driven garbage collection mechanism. SVs,
54310121 639AVs, or HVs (xV for short in the following) start their life with a
55497cff 640reference count of 1. If the reference count of an xV ever drops to 0,
07fa94a1 641then it will be destroyed and its memory made available for reuse.
55497cff 642
643This normally doesn't happen at the Perl level unless a variable is
5f05dabc 644undef'ed or the last variable holding a reference to it is changed or
645overwritten. At the internal level, however, reference counts can be
55497cff 646manipulated with the following macros:
647
648 int SvREFCNT(SV* sv);
5f05dabc 649 SV* SvREFCNT_inc(SV* sv);
55497cff 650 void SvREFCNT_dec(SV* sv);
651
652However, there is one other function which manipulates the reference
07fa94a1
JO
653count of its argument. The C<newRV_inc> function, you will recall,
654creates a reference to the specified argument. As a side effect,
655it increments the argument's reference count. If this is not what
656you want, use C<newRV_noinc> instead.
657
658For example, imagine you want to return a reference from an XSUB function.
659Inside the XSUB routine, you create an SV which initially has a reference
660count of one. Then you call C<newRV_inc>, passing it the just-created SV.
5f05dabc 661This returns the reference as a new SV, but the reference count of the
662SV you passed to C<newRV_inc> has been incremented to two. Now you
07fa94a1
JO
663return the reference from the XSUB routine and forget about the SV.
664But Perl hasn't! Whenever the returned reference is destroyed, the
665reference count of the original SV is decreased to one and nothing happens.
666The SV will hang around without any way to access it until Perl itself
667terminates. This is a memory leak.
5f05dabc 668
669The correct procedure, then, is to use C<newRV_noinc> instead of
faed5253
JO
670C<newRV_inc>. Then, if and when the last reference is destroyed,
671the reference count of the SV will go to zero and it will be destroyed,
07fa94a1 672stopping any memory leak.
55497cff 673
5f05dabc 674There are some convenience functions available that can help with the
54310121 675destruction of xVs. These functions introduce the concept of "mortality".
07fa94a1
JO
676An xV that is mortal has had its reference count marked to be decremented,
677but not actually decremented, until "a short time later". Generally the
678term "short time later" means a single Perl statement, such as a call to
54310121 679an XSUB function. The actual determinant for when mortal xVs have their
07fa94a1
JO
680reference count decremented depends on two macros, SAVETMPS and FREETMPS.
681See L<perlcall> and L<perlxs> for more details on these macros.
55497cff 682
683"Mortalization" then is at its simplest a deferred C<SvREFCNT_dec>.
684However, if you mortalize a variable twice, the reference count will
685later be decremented twice.
686
00aadd71
NIS
687"Mortal" SVs are mainly used for SVs that are placed on perl's stack.
688For example an SV which is created just to pass a number to a called sub
06f6df17
RGS
689is made mortal to have it cleaned up automatically when it's popped off
690the stack. Similarly, results returned by XSUBs (which are pushed on the
691stack) are often made mortal.
a0d0e21e
LW
692
693To create a mortal variable, use the functions:
694
695 SV* sv_newmortal()
696 SV* sv_2mortal(SV*)
697 SV* sv_mortalcopy(SV*)
698
00aadd71 699The first call creates a mortal SV (with no value), the second converts an existing
5f05dabc 700SV to a mortal SV (and thus defers a call to C<SvREFCNT_dec>), and the
701third creates a mortal copy of an existing SV.
00aadd71 702Because C<sv_newmortal> gives the new SV no value,it must normally be given one
9a68f1db 703via C<sv_setpv>, C<sv_setiv>, etc. :
00aadd71
NIS
704
705 SV *tmp = sv_newmortal();
706 sv_setiv(tmp, an_integer);
707
708As that is multiple C statements it is quite common so see this idiom instead:
709
710 SV *tmp = sv_2mortal(newSViv(an_integer));
711
712
713You should be careful about creating mortal variables. Strange things
714can happen if you make the same value mortal within multiple contexts,
715or if you make a variable mortal multiple times. Thinking of "Mortalization"
716as deferred C<SvREFCNT_dec> should help to minimize such problems.
717For example if you are passing an SV which you I<know> has high enough REFCNT
718to survive its use on the stack you need not do any mortalization.
719If you are not sure then doing an C<SvREFCNT_inc> and C<sv_2mortal>, or
720making a C<sv_mortalcopy> is safer.
a0d0e21e 721
54310121 722The mortal routines are not just for SVs -- AVs and HVs can be
faed5253 723made mortal by passing their address (type-casted to C<SV*>) to the
07fa94a1 724C<sv_2mortal> or C<sv_mortalcopy> routines.
a0d0e21e 725
5f05dabc 726=head2 Stashes and Globs
a0d0e21e 727
06f6df17
RGS
728A B<stash> is a hash that contains all variables that are defined
729within a package. Each key of the stash is a symbol
aa689395 730name (shared by all the different types of objects that have the same
731name), and each value in the hash table is a GV (Glob Value). This GV
732in turn contains references to the various objects of that name,
733including (but not limited to) the following:
cb1a09d0 734
a0d0e21e
LW
735 Scalar Value
736 Array Value
737 Hash Value
a3cb178b 738 I/O Handle
a0d0e21e
LW
739 Format
740 Subroutine
741
06f6df17
RGS
742There is a single stash called C<PL_defstash> that holds the items that exist
743in the C<main> package. To get at the items in other packages, append the
744string "::" to the package name. The items in the C<Foo> package are in
745the stash C<Foo::> in PL_defstash. The items in the C<Bar::Baz> package are
746in the stash C<Baz::> in C<Bar::>'s stash.
a0d0e21e 747
d1b91892 748To get the stash pointer for a particular package, use the function:
a0d0e21e 749
08105a92 750 HV* gv_stashpv(const char* name, I32 create)
a0d0e21e
LW
751 HV* gv_stashsv(SV*, I32 create)
752
753The first function takes a literal string, the second uses the string stored
d1b91892 754in the SV. Remember that a stash is just a hash table, so you get back an
cb1a09d0 755C<HV*>. The C<create> flag will create a new package if it is set.
a0d0e21e
LW
756
757The name that C<gv_stash*v> wants is the name of the package whose symbol table
758you want. The default package is called C<main>. If you have multiply nested
d1b91892
AD
759packages, pass their names to C<gv_stash*v>, separated by C<::> as in the Perl
760language itself.
a0d0e21e
LW
761
762Alternately, if you have an SV that is a blessed reference, you can find
763out the stash pointer by using:
764
765 HV* SvSTASH(SvRV(SV*));
766
767then use the following to get the package name itself:
768
769 char* HvNAME(HV* stash);
770
5f05dabc 771If you need to bless or re-bless an object you can use the following
772function:
a0d0e21e
LW
773
774 SV* sv_bless(SV*, HV* stash)
775
776where the first argument, an C<SV*>, must be a reference, and the second
777argument is a stash. The returned C<SV*> can now be used in the same way
778as any other SV.
779
d1b91892
AD
780For more information on references and blessings, consult L<perlref>.
781
54310121 782=head2 Double-Typed SVs
0a753a76 783
784Scalar variables normally contain only one type of value, an integer,
785double, pointer, or reference. Perl will automatically convert the
786actual scalar data from the stored type into the requested type.
787
788Some scalar variables contain more than one type of scalar data. For
789example, the variable C<$!> contains either the numeric value of C<errno>
790or its string equivalent from either C<strerror> or C<sys_errlist[]>.
791
792To force multiple data values into an SV, you must do two things: use the
793C<sv_set*v> routines to add the additional scalar type, then set a flag
794so that Perl will believe it contains more than one type of data. The
795four macros to set the flags are:
796
797 SvIOK_on
798 SvNOK_on
799 SvPOK_on
800 SvROK_on
801
802The particular macro you must use depends on which C<sv_set*v> routine
803you called first. This is because every C<sv_set*v> routine turns on
804only the bit for the particular type of data being set, and turns off
805all the rest.
806
807For example, to create a new Perl variable called "dberror" that contains
808both the numeric and descriptive string error values, you could use the
809following code:
810
811 extern int dberror;
812 extern char *dberror_list;
813
4929bf7b 814 SV* sv = get_sv("dberror", TRUE);
0a753a76 815 sv_setiv(sv, (IV) dberror);
816 sv_setpv(sv, dberror_list[dberror]);
817 SvIOK_on(sv);
818
819If the order of C<sv_setiv> and C<sv_setpv> had been reversed, then the
820macro C<SvPOK_on> would need to be called instead of C<SvIOK_on>.
821
822=head2 Magic Variables
a0d0e21e 823
d1b91892
AD
824[This section still under construction. Ignore everything here. Post no
825bills. Everything not permitted is forbidden.]
826
d1b91892
AD
827Any SV may be magical, that is, it has special features that a normal
828SV does not have. These features are stored in the SV structure in a
5f05dabc 829linked list of C<struct magic>'s, typedef'ed to C<MAGIC>.
d1b91892
AD
830
831 struct magic {
832 MAGIC* mg_moremagic;
833 MGVTBL* mg_virtual;
834 U16 mg_private;
835 char mg_type;
836 U8 mg_flags;
837 SV* mg_obj;
838 char* mg_ptr;
839 I32 mg_len;
840 };
841
842Note this is current as of patchlevel 0, and could change at any time.
843
844=head2 Assigning Magic
845
846Perl adds magic to an SV using the sv_magic function:
847
08105a92 848 void sv_magic(SV* sv, SV* obj, int how, const char* name, I32 namlen);
d1b91892
AD
849
850The C<sv> argument is a pointer to the SV that is to acquire a new magical
851feature.
852
853If C<sv> is not already magical, Perl uses the C<SvUPGRADE> macro to
645c22ef
DM
854convert C<sv> to type C<SVt_PVMG>. Perl then continues by adding new magic
855to the beginning of the linked list of magical features. Any prior entry
856of the same type of magic is deleted. Note that this can be overridden,
857and multiple instances of the same type of magic can be associated with an
858SV.
d1b91892 859
54310121 860The C<name> and C<namlen> arguments are used to associate a string with
861the magic, typically the name of a variable. C<namlen> is stored in the
9b5bb84f 862C<mg_len> field and if C<name> is non-null and C<namlen> E<gt>= 0 a malloc'd
d1b91892
AD
863copy of the name is stored in C<mg_ptr> field.
864
865The sv_magic function uses C<how> to determine which, if any, predefined
866"Magic Virtual Table" should be assigned to the C<mg_virtual> field.
06f6df17 867See the L<Magic Virtual Tables> section below. The C<how> argument is also
14befaf4 868stored in the C<mg_type> field. The value of C<how> should be chosen
06f6df17 869from the set of macros C<PERL_MAGIC_foo> found in F<perl.h>. Note that before
645c22ef 870these macros were added, Perl internals used to directly use character
14befaf4 871literals, so you may occasionally come across old code or documentation
75d0f26d 872referring to 'U' magic rather than C<PERL_MAGIC_uvar> for example.
d1b91892
AD
873
874The C<obj> argument is stored in the C<mg_obj> field of the C<MAGIC>
875structure. If it is not the same as the C<sv> argument, the reference
876count of the C<obj> object is incremented. If it is the same, or if
645c22ef 877the C<how> argument is C<PERL_MAGIC_arylen>, or if it is a NULL pointer,
14befaf4 878then C<obj> is merely stored, without the reference count being incremented.
d1b91892 879
cb1a09d0
AD
880There is also a function to add magic to an C<HV>:
881
882 void hv_magic(HV *hv, GV *gv, int how);
883
884This simply calls C<sv_magic> and coerces the C<gv> argument into an C<SV>.
885
886To remove the magic from an SV, call the function sv_unmagic:
887
888 void sv_unmagic(SV *sv, int type);
889
890The C<type> argument should be equal to the C<how> value when the C<SV>
891was initially made magical.
892
d1b91892
AD
893=head2 Magic Virtual Tables
894
d1be9408 895The C<mg_virtual> field in the C<MAGIC> structure is a pointer to an
d1b91892
AD
896C<MGVTBL>, which is a structure of function pointers and stands for
897"Magic Virtual Table" to handle the various operations that might be
898applied to that variable.
899
900The C<MGVTBL> has five pointers to the following routine types:
901
902 int (*svt_get)(SV* sv, MAGIC* mg);
903 int (*svt_set)(SV* sv, MAGIC* mg);
904 U32 (*svt_len)(SV* sv, MAGIC* mg);
905 int (*svt_clear)(SV* sv, MAGIC* mg);
906 int (*svt_free)(SV* sv, MAGIC* mg);
907
06f6df17 908This MGVTBL structure is set at compile-time in F<perl.h> and there are
d1b91892
AD
909currently 19 types (or 21 with overloading turned on). These different
910structures contain pointers to various routines that perform additional
911actions depending on which function is being called.
912
913 Function pointer Action taken
914 ---------------- ------------
8b0711c3 915 svt_get Do something before the value of the SV is retrieved.
d1b91892
AD
916 svt_set Do something after the SV is assigned a value.
917 svt_len Report on the SV's length.
918 svt_clear Clear something the SV represents.
919 svt_free Free any extra storage associated with the SV.
920
921For instance, the MGVTBL structure called C<vtbl_sv> (which corresponds
14befaf4 922to an C<mg_type> of C<PERL_MAGIC_sv>) contains:
d1b91892
AD
923
924 { magic_get, magic_set, magic_len, 0, 0 }
925
14befaf4
DM
926Thus, when an SV is determined to be magical and of type C<PERL_MAGIC_sv>,
927if a get operation is being performed, the routine C<magic_get> is
928called. All the various routines for the various magical types begin
929with C<magic_>. NOTE: the magic routines are not considered part of
930the Perl API, and may not be exported by the Perl library.
d1b91892
AD
931
932The current kinds of Magic Virtual Tables are:
933
14befaf4
DM
934 mg_type
935 (old-style char and macro) MGVTBL Type of magic
936 -------------------------- ------ ----------------------------
937 \0 PERL_MAGIC_sv vtbl_sv Special scalar variable
938 A PERL_MAGIC_overload vtbl_amagic %OVERLOAD hash
939 a PERL_MAGIC_overload_elem vtbl_amagicelem %OVERLOAD hash element
940 c PERL_MAGIC_overload_table (none) Holds overload table (AMT)
941 on stash
942 B PERL_MAGIC_bm vtbl_bm Boyer-Moore (fast string search)
943 D PERL_MAGIC_regdata vtbl_regdata Regex match position data
944 (@+ and @- vars)
945 d PERL_MAGIC_regdatum vtbl_regdatum Regex match position data
946 element
947 E PERL_MAGIC_env vtbl_env %ENV hash
948 e PERL_MAGIC_envelem vtbl_envelem %ENV hash element
949 f PERL_MAGIC_fm vtbl_fm Formline ('compiled' format)
950 g PERL_MAGIC_regex_global vtbl_mglob m//g target / study()ed string
951 I PERL_MAGIC_isa vtbl_isa @ISA array
952 i PERL_MAGIC_isaelem vtbl_isaelem @ISA array element
953 k PERL_MAGIC_nkeys vtbl_nkeys scalar(keys()) lvalue
954 L PERL_MAGIC_dbfile (none) Debugger %_<filename
955 l PERL_MAGIC_dbline vtbl_dbline Debugger %_<filename element
956 m PERL_MAGIC_mutex vtbl_mutex ???
645c22ef 957 o PERL_MAGIC_collxfrm vtbl_collxfrm Locale collate transformation
14befaf4
DM
958 P PERL_MAGIC_tied vtbl_pack Tied array or hash
959 p PERL_MAGIC_tiedelem vtbl_packelem Tied array or hash element
960 q PERL_MAGIC_tiedscalar vtbl_packelem Tied scalar or handle
961 r PERL_MAGIC_qr vtbl_qr precompiled qr// regex
962 S PERL_MAGIC_sig vtbl_sig %SIG hash
963 s PERL_MAGIC_sigelem vtbl_sigelem %SIG hash element
964 t PERL_MAGIC_taint vtbl_taint Taintedness
965 U PERL_MAGIC_uvar vtbl_uvar Available for use by extensions
966 v PERL_MAGIC_vec vtbl_vec vec() lvalue
92f0c265 967 V PERL_MAGIC_vstring (none) v-string scalars
836995da 968 w PERL_MAGIC_utf8 vtbl_utf8 UTF-8 length+offset cache
14befaf4
DM
969 x PERL_MAGIC_substr vtbl_substr substr() lvalue
970 y PERL_MAGIC_defelem vtbl_defelem Shadow "foreach" iterator
971 variable / smart parameter
972 vivification
973 * PERL_MAGIC_glob vtbl_glob GV (typeglob)
974 # PERL_MAGIC_arylen vtbl_arylen Array length ($#ary)
975 . PERL_MAGIC_pos vtbl_pos pos() lvalue
976 < PERL_MAGIC_backref vtbl_backref ???
977 ~ PERL_MAGIC_ext (none) Available for use by extensions
d1b91892 978
68dc0745 979When an uppercase and lowercase letter both exist in the table, then the
92f0c265
JP
980uppercase letter is typically used to represent some kind of composite type
981(a list or a hash), and the lowercase letter is used to represent an element
982of that composite type. Some internals code makes use of this case
983relationship. However, 'v' and 'V' (vec and v-string) are in no way related.
14befaf4
DM
984
985The C<PERL_MAGIC_ext> and C<PERL_MAGIC_uvar> magic types are defined
986specifically for use by extensions and will not be used by perl itself.
987Extensions can use C<PERL_MAGIC_ext> magic to 'attach' private information
988to variables (typically objects). This is especially useful because
989there is no way for normal perl code to corrupt this private information
990(unlike using extra elements of a hash object).
991
992Similarly, C<PERL_MAGIC_uvar> magic can be used much like tie() to call a
993C function any time a scalar's value is used or changed. The C<MAGIC>'s
bdbeb323
SM
994C<mg_ptr> field points to a C<ufuncs> structure:
995
996 struct ufuncs {
a9402793
AB
997 I32 (*uf_val)(pTHX_ IV, SV*);
998 I32 (*uf_set)(pTHX_ IV, SV*);
bdbeb323
SM
999 IV uf_index;
1000 };
1001
1002When the SV is read from or written to, the C<uf_val> or C<uf_set>
14befaf4
DM
1003function will be called with C<uf_index> as the first arg and a pointer to
1004the SV as the second. A simple example of how to add C<PERL_MAGIC_uvar>
1526ead6
AB
1005magic is shown below. Note that the ufuncs structure is copied by
1006sv_magic, so you can safely allocate it on the stack.
1007
1008 void
1009 Umagic(sv)
1010 SV *sv;
1011 PREINIT:
1012 struct ufuncs uf;
1013 CODE:
1014 uf.uf_val = &my_get_fn;
1015 uf.uf_set = &my_set_fn;
1016 uf.uf_index = 0;
14befaf4 1017 sv_magic(sv, 0, PERL_MAGIC_uvar, (char*)&uf, sizeof(uf));
5f05dabc 1018
14befaf4
DM
1019Note that because multiple extensions may be using C<PERL_MAGIC_ext>
1020or C<PERL_MAGIC_uvar> magic, it is important for extensions to take
1021extra care to avoid conflict. Typically only using the magic on
1022objects blessed into the same class as the extension is sufficient.
1023For C<PERL_MAGIC_ext> magic, it may also be appropriate to add an I32
1024'signature' at the top of the private data area and check that.
5f05dabc 1025
ef50df4b
GS
1026Also note that the C<sv_set*()> and C<sv_cat*()> functions described
1027earlier do B<not> invoke 'set' magic on their targets. This must
1028be done by the user either by calling the C<SvSETMAGIC()> macro after
1029calling these functions, or by using one of the C<sv_set*_mg()> or
1030C<sv_cat*_mg()> functions. Similarly, generic C code must call the
1031C<SvGETMAGIC()> macro to invoke any 'get' magic if they use an SV
1032obtained from external sources in functions that don't handle magic.
4a4eefd0 1033See L<perlapi> for a description of these functions.
189b2af5
GS
1034For example, calls to the C<sv_cat*()> functions typically need to be
1035followed by C<SvSETMAGIC()>, but they don't need a prior C<SvGETMAGIC()>
1036since their implementation handles 'get' magic.
1037
d1b91892
AD
1038=head2 Finding Magic
1039
1040 MAGIC* mg_find(SV*, int type); /* Finds the magic pointer of that type */
1041
1042This routine returns a pointer to the C<MAGIC> structure stored in the SV.
1043If the SV does not have that magical feature, C<NULL> is returned. Also,
54310121 1044if the SV is not of type SVt_PVMG, Perl may core dump.
d1b91892 1045
08105a92 1046 int mg_copy(SV* sv, SV* nsv, const char* key, STRLEN klen);
d1b91892
AD
1047
1048This routine checks to see what types of magic C<sv> has. If the mg_type
68dc0745 1049field is an uppercase letter, then the mg_obj is copied to C<nsv>, but
1050the mg_type field is changed to be the lowercase letter.
a0d0e21e 1051
04343c6d
GS
1052=head2 Understanding the Magic of Tied Hashes and Arrays
1053
14befaf4
DM
1054Tied hashes and arrays are magical beasts of the C<PERL_MAGIC_tied>
1055magic type.
9edb2b46
GS
1056
1057WARNING: As of the 5.004 release, proper usage of the array and hash
1058access functions requires understanding a few caveats. Some
1059of these caveats are actually considered bugs in the API, to be fixed
1060in later releases, and are bracketed with [MAYCHANGE] below. If
1061you find yourself actually applying such information in this section, be
1062aware that the behavior may change in the future, umm, without warning.
04343c6d 1063
1526ead6 1064The perl tie function associates a variable with an object that implements
9a68f1db 1065the various GET, SET, etc methods. To perform the equivalent of the perl
1526ead6
AB
1066tie function from an XSUB, you must mimic this behaviour. The code below
1067carries out the necessary steps - firstly it creates a new hash, and then
1068creates a second hash which it blesses into the class which will implement
1069the tie methods. Lastly it ties the two hashes together, and returns a
1070reference to the new tied hash. Note that the code below does NOT call the
1071TIEHASH method in the MyTie class -
1072see L<Calling Perl Routines from within C Programs> for details on how
1073to do this.
1074
1075 SV*
1076 mytie()
1077 PREINIT:
1078 HV *hash;
1079 HV *stash;
1080 SV *tie;
1081 CODE:
1082 hash = newHV();
1083 tie = newRV_noinc((SV*)newHV());
1084 stash = gv_stashpv("MyTie", TRUE);
1085 sv_bless(tie, stash);
899e16d0 1086 hv_magic(hash, (GV*)tie, PERL_MAGIC_tied);
1526ead6
AB
1087 RETVAL = newRV_noinc(hash);
1088 OUTPUT:
1089 RETVAL
1090
04343c6d
GS
1091The C<av_store> function, when given a tied array argument, merely
1092copies the magic of the array onto the value to be "stored", using
1093C<mg_copy>. It may also return NULL, indicating that the value did not
9edb2b46
GS
1094actually need to be stored in the array. [MAYCHANGE] After a call to
1095C<av_store> on a tied array, the caller will usually need to call
1096C<mg_set(val)> to actually invoke the perl level "STORE" method on the
1097TIEARRAY object. If C<av_store> did return NULL, a call to
1098C<SvREFCNT_dec(val)> will also be usually necessary to avoid a memory
1099leak. [/MAYCHANGE]
04343c6d
GS
1100
1101The previous paragraph is applicable verbatim to tied hash access using the
1102C<hv_store> and C<hv_store_ent> functions as well.
1103
1104C<av_fetch> and the corresponding hash functions C<hv_fetch> and
1105C<hv_fetch_ent> actually return an undefined mortal value whose magic
1106has been initialized using C<mg_copy>. Note the value so returned does not
9edb2b46
GS
1107need to be deallocated, as it is already mortal. [MAYCHANGE] But you will
1108need to call C<mg_get()> on the returned value in order to actually invoke
1109the perl level "FETCH" method on the underlying TIE object. Similarly,
04343c6d
GS
1110you may also call C<mg_set()> on the return value after possibly assigning
1111a suitable value to it using C<sv_setsv>, which will invoke the "STORE"
9edb2b46 1112method on the TIE object. [/MAYCHANGE]
04343c6d 1113
9edb2b46 1114[MAYCHANGE]
04343c6d
GS
1115In other words, the array or hash fetch/store functions don't really
1116fetch and store actual values in the case of tied arrays and hashes. They
1117merely call C<mg_copy> to attach magic to the values that were meant to be
1118"stored" or "fetched". Later calls to C<mg_get> and C<mg_set> actually
1119do the job of invoking the TIE methods on the underlying objects. Thus
9edb2b46 1120the magic mechanism currently implements a kind of lazy access to arrays
04343c6d
GS
1121and hashes.
1122
1123Currently (as of perl version 5.004), use of the hash and array access
1124functions requires the user to be aware of whether they are operating on
9edb2b46
GS
1125"normal" hashes and arrays, or on their tied variants. The API may be
1126changed to provide more transparent access to both tied and normal data
1127types in future versions.
1128[/MAYCHANGE]
04343c6d
GS
1129
1130You would do well to understand that the TIEARRAY and TIEHASH interfaces
1131are mere sugar to invoke some perl method calls while using the uniform hash
1132and array syntax. The use of this sugar imposes some overhead (typically
1133about two to four extra opcodes per FETCH/STORE operation, in addition to
1134the creation of all the mortal variables required to invoke the methods).
1135This overhead will be comparatively small if the TIE methods are themselves
1136substantial, but if they are only a few statements long, the overhead
1137will not be insignificant.
1138
d1c897a1
IZ
1139=head2 Localizing changes
1140
1141Perl has a very handy construction
1142
1143 {
1144 local $var = 2;
1145 ...
1146 }
1147
1148This construction is I<approximately> equivalent to
1149
1150 {
1151 my $oldvar = $var;
1152 $var = 2;
1153 ...
1154 $var = $oldvar;
1155 }
1156
1157The biggest difference is that the first construction would
1158reinstate the initial value of $var, irrespective of how control exits
9a68f1db 1159the block: C<goto>, C<return>, C<die>/C<eval>, etc. It is a little bit
d1c897a1
IZ
1160more efficient as well.
1161
1162There is a way to achieve a similar task from C via Perl API: create a
1163I<pseudo-block>, and arrange for some changes to be automatically
1164undone at the end of it, either explicit, or via a non-local exit (via
1165die()). A I<block>-like construct is created by a pair of
b687b08b
TC
1166C<ENTER>/C<LEAVE> macros (see L<perlcall/"Returning a Scalar">).
1167Such a construct may be created specially for some important localized
1168task, or an existing one (like boundaries of enclosing Perl
1169subroutine/block, or an existing pair for freeing TMPs) may be
1170used. (In the second case the overhead of additional localization must
1171be almost negligible.) Note that any XSUB is automatically enclosed in
1172an C<ENTER>/C<LEAVE> pair.
d1c897a1
IZ
1173
1174Inside such a I<pseudo-block> the following service is available:
1175
13a2d996 1176=over 4
d1c897a1
IZ
1177
1178=item C<SAVEINT(int i)>
1179
1180=item C<SAVEIV(IV i)>
1181
1182=item C<SAVEI32(I32 i)>
1183
1184=item C<SAVELONG(long i)>
1185
1186These macros arrange things to restore the value of integer variable
1187C<i> at the end of enclosing I<pseudo-block>.
1188
1189=item C<SAVESPTR(s)>
1190
1191=item C<SAVEPPTR(p)>
1192
1193These macros arrange things to restore the value of pointers C<s> and
1194C<p>. C<s> must be a pointer of a type which survives conversion to
1195C<SV*> and back, C<p> should be able to survive conversion to C<char*>
1196and back.
1197
1198=item C<SAVEFREESV(SV *sv)>
1199
1200The refcount of C<sv> would be decremented at the end of
26d9b02f
JH
1201I<pseudo-block>. This is similar to C<sv_2mortal> in that it is also a
1202mechanism for doing a delayed C<SvREFCNT_dec>. However, while C<sv_2mortal>
1203extends the lifetime of C<sv> until the beginning of the next statement,
1204C<SAVEFREESV> extends it until the end of the enclosing scope. These
1205lifetimes can be wildly different.
1206
1207Also compare C<SAVEMORTALIZESV>.
1208
1209=item C<SAVEMORTALIZESV(SV *sv)>
1210
1211Just like C<SAVEFREESV>, but mortalizes C<sv> at the end of the current
1212scope instead of decrementing its reference count. This usually has the
1213effect of keeping C<sv> alive until the statement that called the currently
1214live scope has finished executing.
d1c897a1
IZ
1215
1216=item C<SAVEFREEOP(OP *op)>
1217
1218The C<OP *> is op_free()ed at the end of I<pseudo-block>.
1219
1220=item C<SAVEFREEPV(p)>
1221
1222The chunk of memory which is pointed to by C<p> is Safefree()ed at the
1223end of I<pseudo-block>.
1224
1225=item C<SAVECLEARSV(SV *sv)>
1226
1227Clears a slot in the current scratchpad which corresponds to C<sv> at
1228the end of I<pseudo-block>.
1229
1230=item C<SAVEDELETE(HV *hv, char *key, I32 length)>
1231
1232The key C<key> of C<hv> is deleted at the end of I<pseudo-block>. The
1233string pointed to by C<key> is Safefree()ed. If one has a I<key> in
1234short-lived storage, the corresponding string may be reallocated like
1235this:
1236
9cde0e7f 1237 SAVEDELETE(PL_defstash, savepv(tmpbuf), strlen(tmpbuf));
d1c897a1 1238
c76ac1ee 1239=item C<SAVEDESTRUCTOR(DESTRUCTORFUNC_NOCONTEXT_t f, void *p)>
d1c897a1
IZ
1240
1241At the end of I<pseudo-block> the function C<f> is called with the
c76ac1ee
GS
1242only argument C<p>.
1243
1244=item C<SAVEDESTRUCTOR_X(DESTRUCTORFUNC_t f, void *p)>
1245
1246At the end of I<pseudo-block> the function C<f> is called with the
1247implicit context argument (if any), and C<p>.
d1c897a1
IZ
1248
1249=item C<SAVESTACK_POS()>
1250
1251The current offset on the Perl internal stack (cf. C<SP>) is restored
1252at the end of I<pseudo-block>.
1253
1254=back
1255
1256The following API list contains functions, thus one needs to
1257provide pointers to the modifiable data explicitly (either C pointers,
00aadd71 1258or Perlish C<GV *>s). Where the above macros take C<int>, a similar
d1c897a1
IZ
1259function takes C<int *>.
1260
13a2d996 1261=over 4
d1c897a1
IZ
1262
1263=item C<SV* save_scalar(GV *gv)>
1264
1265Equivalent to Perl code C<local $gv>.
1266
1267=item C<AV* save_ary(GV *gv)>
1268
1269=item C<HV* save_hash(GV *gv)>
1270
1271Similar to C<save_scalar>, but localize C<@gv> and C<%gv>.
1272
1273=item C<void save_item(SV *item)>
1274
1275Duplicates the current value of C<SV>, on the exit from the current
1276C<ENTER>/C<LEAVE> I<pseudo-block> will restore the value of C<SV>
1277using the stored value.
1278
1279=item C<void save_list(SV **sarg, I32 maxsarg)>
1280
1281A variant of C<save_item> which takes multiple arguments via an array
1282C<sarg> of C<SV*> of length C<maxsarg>.
1283
1284=item C<SV* save_svref(SV **sptr)>
1285
d1be9408 1286Similar to C<save_scalar>, but will reinstate an C<SV *>.
d1c897a1
IZ
1287
1288=item C<void save_aptr(AV **aptr)>
1289
1290=item C<void save_hptr(HV **hptr)>
1291
1292Similar to C<save_svref>, but localize C<AV *> and C<HV *>.
1293
1294=back
1295
1296The C<Alias> module implements localization of the basic types within the
1297I<caller's scope>. People who are interested in how to localize things in
1298the containing scope should take a look there too.
1299
0a753a76 1300=head1 Subroutines
a0d0e21e 1301
68dc0745 1302=head2 XSUBs and the Argument Stack
5f05dabc 1303
1304The XSUB mechanism is a simple way for Perl programs to access C subroutines.
1305An XSUB routine will have a stack that contains the arguments from the Perl
1306program, and a way to map from the Perl data structures to a C equivalent.
1307
1308The stack arguments are accessible through the C<ST(n)> macro, which returns
1309the C<n>'th stack argument. Argument 0 is the first argument passed in the
1310Perl subroutine call. These arguments are C<SV*>, and can be used anywhere
1311an C<SV*> is used.
1312
1313Most of the time, output from the C routine can be handled through use of
1314the RETVAL and OUTPUT directives. However, there are some cases where the
1315argument stack is not already long enough to handle all the return values.
1316An example is the POSIX tzname() call, which takes no arguments, but returns
1317two, the local time zone's standard and summer time abbreviations.
1318
1319To handle this situation, the PPCODE directive is used and the stack is
1320extended using the macro:
1321
924508f0 1322 EXTEND(SP, num);
5f05dabc 1323
924508f0
GS
1324where C<SP> is the macro that represents the local copy of the stack pointer,
1325and C<num> is the number of elements the stack should be extended by.
5f05dabc 1326
00aadd71 1327Now that there is room on the stack, values can be pushed on it using C<PUSHs>
06f6df17
RGS
1328macro. The pushed values will often need to be "mortal" (See
1329L</Reference Counts and Mortality>).
5f05dabc 1330
00aadd71
NIS
1331 PUSHs(sv_2mortal(newSViv(an_integer)))
1332 PUSHs(sv_2mortal(newSVpv("Some String",0)))
1333 PUSHs(sv_2mortal(newSVnv(3.141592)))
5f05dabc 1334
1335And now the Perl program calling C<tzname>, the two values will be assigned
1336as in:
1337
1338 ($standard_abbrev, $summer_abbrev) = POSIX::tzname;
1339
1340An alternate (and possibly simpler) method to pushing values on the stack is
00aadd71 1341to use the macro:
5f05dabc 1342
5f05dabc 1343 XPUSHs(SV*)
1344
00aadd71 1345This macro automatically adjust the stack for you, if needed. Thus, you
5f05dabc 1346do not need to call C<EXTEND> to extend the stack.
00aadd71
NIS
1347
1348Despite their suggestions in earlier versions of this document the macros
1349C<PUSHi>, C<PUSHn> and C<PUSHp> are I<not> suited to XSUBs which return
1350multiple results, see L</Putting a C value on Perl stack>.
5f05dabc 1351
1352For more information, consult L<perlxs> and L<perlxstut>.
1353
1354=head2 Calling Perl Routines from within C Programs
a0d0e21e
LW
1355
1356There are four routines that can be used to call a Perl subroutine from
1357within a C program. These four are:
1358
954c1994
GS
1359 I32 call_sv(SV*, I32);
1360 I32 call_pv(const char*, I32);
1361 I32 call_method(const char*, I32);
1362 I32 call_argv(const char*, I32, register char**);
a0d0e21e 1363
954c1994 1364The routine most often used is C<call_sv>. The C<SV*> argument
d1b91892
AD
1365contains either the name of the Perl subroutine to be called, or a
1366reference to the subroutine. The second argument consists of flags
1367that control the context in which the subroutine is called, whether
1368or not the subroutine is being passed arguments, how errors should be
1369trapped, and how to treat return values.
a0d0e21e
LW
1370
1371All four routines return the number of arguments that the subroutine returned
1372on the Perl stack.
1373
9a68f1db 1374These routines used to be called C<perl_call_sv>, etc., before Perl v5.6.0,
954c1994
GS
1375but those names are now deprecated; macros of the same name are provided for
1376compatibility.
1377
1378When using any of these routines (except C<call_argv>), the programmer
d1b91892
AD
1379must manipulate the Perl stack. These include the following macros and
1380functions:
a0d0e21e
LW
1381
1382 dSP
924508f0 1383 SP
a0d0e21e
LW
1384 PUSHMARK()
1385 PUTBACK
1386 SPAGAIN
1387 ENTER
1388 SAVETMPS
1389 FREETMPS
1390 LEAVE
1391 XPUSH*()
cb1a09d0 1392 POP*()
a0d0e21e 1393
5f05dabc 1394For a detailed description of calling conventions from C to Perl,
1395consult L<perlcall>.
a0d0e21e 1396
5f05dabc 1397=head2 Memory Allocation
a0d0e21e 1398
06f6df17
RGS
1399=head3 Allocation
1400
86058a2d
GS
1401All memory meant to be used with the Perl API functions should be manipulated
1402using the macros described in this section. The macros provide the necessary
1403transparency between differences in the actual malloc implementation that is
1404used within perl.
1405
1406It is suggested that you enable the version of malloc that is distributed
5f05dabc 1407with Perl. It keeps pools of various sizes of unallocated memory in
07fa94a1
JO
1408order to satisfy allocation requests more quickly. However, on some
1409platforms, it may cause spurious malloc or free errors.
d1b91892 1410
06f6df17
RGS
1411The following three macros are used to initially allocate memory :
1412
d1b91892
AD
1413 New(x, pointer, number, type);
1414 Newc(x, pointer, number, type, cast);
1415 Newz(x, pointer, number, type);
1416
5f05dabc 1417The first argument C<x> was a "magic cookie" that was used to keep track
1418of who called the macro, to help when debugging memory problems. However,
07fa94a1
JO
1419the current code makes no use of this feature (most Perl developers now
1420use run-time memory checkers), so this argument can be any number.
5f05dabc 1421
1422The second argument C<pointer> should be the name of a variable that will
1423point to the newly allocated memory.
d1b91892 1424
d1b91892
AD
1425The third and fourth arguments C<number> and C<type> specify how many of
1426the specified type of data structure should be allocated. The argument
1427C<type> is passed to C<sizeof>. The final argument to C<Newc>, C<cast>,
1428should be used if the C<pointer> argument is different from the C<type>
1429argument.
1430
1431Unlike the C<New> and C<Newc> macros, the C<Newz> macro calls C<memzero>
1432to zero out all the newly allocated memory.
1433
06f6df17
RGS
1434=head3 Reallocation
1435
d1b91892
AD
1436 Renew(pointer, number, type);
1437 Renewc(pointer, number, type, cast);
1438 Safefree(pointer)
1439
1440These three macros are used to change a memory buffer size or to free a
1441piece of memory no longer needed. The arguments to C<Renew> and C<Renewc>
1442match those of C<New> and C<Newc> with the exception of not needing the
1443"magic cookie" argument.
1444
06f6df17
RGS
1445=head3 Moving
1446
d1b91892
AD
1447 Move(source, dest, number, type);
1448 Copy(source, dest, number, type);
1449 Zero(dest, number, type);
1450
1451These three macros are used to move, copy, or zero out previously allocated
1452memory. The C<source> and C<dest> arguments point to the source and
1453destination starting points. Perl will move, copy, or zero out C<number>
1454instances of the size of the C<type> data structure (using the C<sizeof>
1455function).
a0d0e21e 1456
5f05dabc 1457=head2 PerlIO
ce3d39e2 1458
5f05dabc 1459The most recent development releases of Perl has been experimenting with
1460removing Perl's dependency on the "normal" standard I/O suite and allowing
1461other stdio implementations to be used. This involves creating a new
1462abstraction layer that then calls whichever implementation of stdio Perl
68dc0745 1463was compiled with. All XSUBs should now use the functions in the PerlIO
5f05dabc 1464abstraction layer and not make any assumptions about what kind of stdio
1465is being used.
1466
1467For a complete description of the PerlIO abstraction, consult L<perlapio>.
1468
8ebc5c01 1469=head2 Putting a C value on Perl stack
ce3d39e2
IZ
1470
1471A lot of opcodes (this is an elementary operation in the internal perl
1472stack machine) put an SV* on the stack. However, as an optimization
1473the corresponding SV is (usually) not recreated each time. The opcodes
1474reuse specially assigned SVs (I<target>s) which are (as a corollary)
1475not constantly freed/created.
1476
0a753a76 1477Each of the targets is created only once (but see
ce3d39e2
IZ
1478L<Scratchpads and recursion> below), and when an opcode needs to put
1479an integer, a double, or a string on stack, it just sets the
1480corresponding parts of its I<target> and puts the I<target> on stack.
1481
1482The macro to put this target on stack is C<PUSHTARG>, and it is
1483directly used in some opcodes, as well as indirectly in zillions of
1484others, which use it via C<(X)PUSH[pni]>.
1485
1bd1c0d5
SC
1486Because the target is reused, you must be careful when pushing multiple
1487values on the stack. The following code will not do what you think:
1488
1489 XPUSHi(10);
1490 XPUSHi(20);
1491
1492This translates as "set C<TARG> to 10, push a pointer to C<TARG> onto
1493the stack; set C<TARG> to 20, push a pointer to C<TARG> onto the stack".
1494At the end of the operation, the stack does not contain the values 10
1495and 20, but actually contains two pointers to C<TARG>, which we have set
1496to 20. If you need to push multiple different values, use C<XPUSHs>,
1497which bypasses C<TARG>.
1498
1499On a related note, if you do use C<(X)PUSH[npi]>, then you're going to
1500need a C<dTARG> in your variable declarations so that the C<*PUSH*>
00aadd71 1501macros can make use of the local variable C<TARG>.
1bd1c0d5 1502
8ebc5c01 1503=head2 Scratchpads
ce3d39e2 1504
54310121 1505The question remains on when the SVs which are I<target>s for opcodes
5f05dabc 1506are created. The answer is that they are created when the current unit --
1507a subroutine or a file (for opcodes for statements outside of
1508subroutines) -- is compiled. During this time a special anonymous Perl
ce3d39e2
IZ
1509array is created, which is called a scratchpad for the current
1510unit.
1511
54310121 1512A scratchpad keeps SVs which are lexicals for the current unit and are
ce3d39e2
IZ
1513targets for opcodes. One can deduce that an SV lives on a scratchpad
1514by looking on its flags: lexicals have C<SVs_PADMY> set, and
1515I<target>s have C<SVs_PADTMP> set.
1516
54310121 1517The correspondence between OPs and I<target>s is not 1-to-1. Different
1518OPs in the compile tree of the unit can use the same target, if this
ce3d39e2
IZ
1519would not conflict with the expected life of the temporary.
1520
2ae324a7 1521=head2 Scratchpads and recursion
ce3d39e2
IZ
1522
1523In fact it is not 100% true that a compiled unit contains a pointer to
1524the scratchpad AV. In fact it contains a pointer to an AV of
1525(initially) one element, and this element is the scratchpad AV. Why do
1526we need an extra level of indirection?
1527
9a68f1db 1528The answer is B<recursion>, and maybe B<threads>. Both
ce3d39e2
IZ
1529these can create several execution pointers going into the same
1530subroutine. For the subroutine-child not write over the temporaries
1531for the subroutine-parent (lifespan of which covers the call to the
1532child), the parent and the child should have different
1533scratchpads. (I<And> the lexicals should be separate anyway!)
1534
5f05dabc 1535So each subroutine is born with an array of scratchpads (of length 1).
1536On each entry to the subroutine it is checked that the current
ce3d39e2
IZ
1537depth of the recursion is not more than the length of this array, and
1538if it is, new scratchpad is created and pushed into the array.
1539
1540The I<target>s on this scratchpad are C<undef>s, but they are already
1541marked with correct flags.
1542
0a753a76 1543=head1 Compiled code
1544
1545=head2 Code tree
1546
1547Here we describe the internal form your code is converted to by
1548Perl. Start with a simple example:
1549
1550 $a = $b + $c;
1551
1552This is converted to a tree similar to this one:
1553
1554 assign-to
1555 / \
1556 + $a
1557 / \
1558 $b $c
1559
7b8d334a 1560(but slightly more complicated). This tree reflects the way Perl
0a753a76 1561parsed your code, but has nothing to do with the execution order.
1562There is an additional "thread" going through the nodes of the tree
1563which shows the order of execution of the nodes. In our simplified
1564example above it looks like:
1565
1566 $b ---> $c ---> + ---> $a ---> assign-to
1567
1568But with the actual compile tree for C<$a = $b + $c> it is different:
1569some nodes I<optimized away>. As a corollary, though the actual tree
1570contains more nodes than our simplified example, the execution order
1571is the same as in our example.
1572
1573=head2 Examining the tree
1574
06f6df17
RGS
1575If you have your perl compiled for debugging (usually done with
1576C<-DDEBUGGING> on the C<Configure> command line), you may examine the
0a753a76 1577compiled tree by specifying C<-Dx> on the Perl command line. The
1578output takes several lines per node, and for C<$b+$c> it looks like
1579this:
1580
1581 5 TYPE = add ===> 6
1582 TARG = 1
1583 FLAGS = (SCALAR,KIDS)
1584 {
1585 TYPE = null ===> (4)
1586 (was rv2sv)
1587 FLAGS = (SCALAR,KIDS)
1588 {
1589 3 TYPE = gvsv ===> 4
1590 FLAGS = (SCALAR)
1591 GV = main::b
1592 }
1593 }
1594 {
1595 TYPE = null ===> (5)
1596 (was rv2sv)
1597 FLAGS = (SCALAR,KIDS)
1598 {
1599 4 TYPE = gvsv ===> 5
1600 FLAGS = (SCALAR)
1601 GV = main::c
1602 }
1603 }
1604
1605This tree has 5 nodes (one per C<TYPE> specifier), only 3 of them are
1606not optimized away (one per number in the left column). The immediate
1607children of the given node correspond to C<{}> pairs on the same level
1608of indentation, thus this listing corresponds to the tree:
1609
1610 add
1611 / \
1612 null null
1613 | |
1614 gvsv gvsv
1615
1616The execution order is indicated by C<===E<gt>> marks, thus it is C<3
16174 5 6> (node C<6> is not included into above listing), i.e.,
1618C<gvsv gvsv add whatever>.
1619
9afa14e3
SC
1620Each of these nodes represents an op, a fundamental operation inside the
1621Perl core. The code which implements each operation can be found in the
1622F<pp*.c> files; the function which implements the op with type C<gvsv>
1623is C<pp_gvsv>, and so on. As the tree above shows, different ops have
1624different numbers of children: C<add> is a binary operator, as one would
1625expect, and so has two children. To accommodate the various different
1626numbers of children, there are various types of op data structure, and
1627they link together in different ways.
1628
1629The simplest type of op structure is C<OP>: this has no children. Unary
1630operators, C<UNOP>s, have one child, and this is pointed to by the
1631C<op_first> field. Binary operators (C<BINOP>s) have not only an
1632C<op_first> field but also an C<op_last> field. The most complex type of
1633op is a C<LISTOP>, which has any number of children. In this case, the
1634first child is pointed to by C<op_first> and the last child by
1635C<op_last>. The children in between can be found by iteratively
1636following the C<op_sibling> pointer from the first child to the last.
1637
1638There are also two other op types: a C<PMOP> holds a regular expression,
1639and has no children, and a C<LOOP> may or may not have children. If the
1640C<op_children> field is non-zero, it behaves like a C<LISTOP>. To
1641complicate matters, if a C<UNOP> is actually a C<null> op after
1642optimization (see L</Compile pass 2: context propagation>) it will still
1643have children in accordance with its former type.
1644
06f6df17
RGS
1645Another way to examine the tree is to use a compiler back-end module, such
1646as L<B::Concise>.
1647
0a753a76 1648=head2 Compile pass 1: check routines
1649
8870b5c7
GS
1650The tree is created by the compiler while I<yacc> code feeds it
1651the constructions it recognizes. Since I<yacc> works bottom-up, so does
0a753a76 1652the first pass of perl compilation.
1653
1654What makes this pass interesting for perl developers is that some
1655optimization may be performed on this pass. This is optimization by
8870b5c7 1656so-called "check routines". The correspondence between node names
0a753a76 1657and corresponding check routines is described in F<opcode.pl> (do not
1658forget to run C<make regen_headers> if you modify this file).
1659
1660A check routine is called when the node is fully constructed except
7b8d334a 1661for the execution-order thread. Since at this time there are no
0a753a76 1662back-links to the currently constructed node, one can do most any
1663operation to the top-level node, including freeing it and/or creating
1664new nodes above/below it.
1665
1666The check routine returns the node which should be inserted into the
1667tree (if the top-level node was not modified, check routine returns
1668its argument).
1669
1670By convention, check routines have names C<ck_*>. They are usually
1671called from C<new*OP> subroutines (or C<convert>) (which in turn are
1672called from F<perly.y>).
1673
1674=head2 Compile pass 1a: constant folding
1675
1676Immediately after the check routine is called the returned node is
1677checked for being compile-time executable. If it is (the value is
1678judged to be constant) it is immediately executed, and a I<constant>
1679node with the "return value" of the corresponding subtree is
1680substituted instead. The subtree is deleted.
1681
1682If constant folding was not performed, the execution-order thread is
1683created.
1684
1685=head2 Compile pass 2: context propagation
1686
1687When a context for a part of compile tree is known, it is propagated
a3cb178b 1688down through the tree. At this time the context can have 5 values
0a753a76 1689(instead of 2 for runtime context): void, boolean, scalar, list, and
1690lvalue. In contrast with the pass 1 this pass is processed from top
1691to bottom: a node's context determines the context for its children.
1692
1693Additional context-dependent optimizations are performed at this time.
1694Since at this moment the compile tree contains back-references (via
1695"thread" pointers), nodes cannot be free()d now. To allow
1696optimized-away nodes at this stage, such nodes are null()ified instead
1697of free()ing (i.e. their type is changed to OP_NULL).
1698
1699=head2 Compile pass 3: peephole optimization
1700
1701After the compile tree for a subroutine (or for an C<eval> or a file)
1702is created, an additional pass over the code is performed. This pass
1703is neither top-down or bottom-up, but in the execution order (with
7b8d334a 1704additional complications for conditionals). These optimizations are
0a753a76 1705done in the subroutine peep(). Optimizations performed at this stage
1706are subject to the same restrictions as in the pass 2.
1707
1ba7f851
PJ
1708=head2 Pluggable runops
1709
1710The compile tree is executed in a runops function. There are two runops
1711functions in F<run.c>. C<Perl_runops_debug> is used with DEBUGGING and
1712C<Perl_runops_standard> is used otherwise. For fine control over the
1713execution of the compile tree it is possible to provide your own runops
1714function.
1715
1716It's probably best to copy one of the existing runops functions and
1717change it to suit your needs. Then, in the BOOT section of your XS
1718file, add the line:
1719
1720 PL_runops = my_runops;
1721
1722This function should be as efficient as possible to keep your programs
1723running as fast as possible.
1724
9afa14e3
SC
1725=head1 Examining internal data structures with the C<dump> functions
1726
1727To aid debugging, the source file F<dump.c> contains a number of
1728functions which produce formatted output of internal data structures.
1729
1730The most commonly used of these functions is C<Perl_sv_dump>; it's used
1731for dumping SVs, AVs, HVs, and CVs. The C<Devel::Peek> module calls
1732C<sv_dump> to produce debugging output from Perl-space, so users of that
00aadd71 1733module should already be familiar with its format.
9afa14e3
SC
1734
1735C<Perl_op_dump> can be used to dump an C<OP> structure or any of its
210b36aa 1736derivatives, and produces output similar to C<perl -Dx>; in fact,
9afa14e3
SC
1737C<Perl_dump_eval> will dump the main root of the code being evaluated,
1738exactly like C<-Dx>.
1739
1740Other useful functions are C<Perl_dump_sub>, which turns a C<GV> into an
1741op tree, C<Perl_dump_packsubs> which calls C<Perl_dump_sub> on all the
1742subroutines in a package like so: (Thankfully, these are all xsubs, so
1743there is no op tree)
1744
1745 (gdb) print Perl_dump_packsubs(PL_defstash)
1746
1747 SUB attributes::bootstrap = (xsub 0x811fedc 0)
1748
1749 SUB UNIVERSAL::can = (xsub 0x811f50c 0)
1750
1751 SUB UNIVERSAL::isa = (xsub 0x811f304 0)
1752
1753 SUB UNIVERSAL::VERSION = (xsub 0x811f7ac 0)
1754
1755 SUB DynaLoader::boot_DynaLoader = (xsub 0x805b188 0)
1756
1757and C<Perl_dump_all>, which dumps all the subroutines in the stash and
1758the op tree of the main root.
1759
954c1994 1760=head1 How multiple interpreters and concurrency are supported
ee072b34 1761
ee072b34
GS
1762=head2 Background and PERL_IMPLICIT_CONTEXT
1763
1764The Perl interpreter can be regarded as a closed box: it has an API
1765for feeding it code or otherwise making it do things, but it also has
1766functions for its own use. This smells a lot like an object, and
1767there are ways for you to build Perl so that you can have multiple
acfe0abc
GS
1768interpreters, with one interpreter represented either as a C structure,
1769or inside a thread-specific structure. These structures contain all
1770the context, the state of that interpreter.
1771
9a68f1db 1772Two macros control the major Perl build flavors: MULTIPLICITY and
acfe0abc
GS
1773USE_5005THREADS. The MULTIPLICITY build has a C structure
1774that packages all the interpreter state, and there is a similar thread-specific
1775data structure under USE_5005THREADS. In both cases,
54aff467
GS
1776PERL_IMPLICIT_CONTEXT is also normally defined, and enables the
1777support for passing in a "hidden" first argument that represents all three
651a3225 1778data structures.
54aff467
GS
1779
1780All this obviously requires a way for the Perl internal functions to be
acfe0abc 1781either subroutines taking some kind of structure as the first
ee072b34 1782argument, or subroutines taking nothing as the first argument. To
acfe0abc 1783enable these two very different ways of building the interpreter,
ee072b34
GS
1784the Perl source (as it does in so many other situations) makes heavy
1785use of macros and subroutine naming conventions.
1786
54aff467 1787First problem: deciding which functions will be public API functions and
00aadd71 1788which will be private. All functions whose names begin C<S_> are private
954c1994
GS
1789(think "S" for "secret" or "static"). All other functions begin with
1790"Perl_", but just because a function begins with "Perl_" does not mean it is
00aadd71
NIS
1791part of the API. (See L</Internal Functions>.) The easiest way to be B<sure> a
1792function is part of the API is to find its entry in L<perlapi>.
1793If it exists in L<perlapi>, it's part of the API. If it doesn't, and you
1794think it should be (i.e., you need it for your extension), send mail via
a422fd2d 1795L<perlbug> explaining why you think it should be.
ee072b34
GS
1796
1797Second problem: there must be a syntax so that the same subroutine
1798declarations and calls can pass a structure as their first argument,
1799or pass nothing. To solve this, the subroutines are named and
1800declared in a particular way. Here's a typical start of a static
1801function used within the Perl guts:
1802
1803 STATIC void
1804 S_incline(pTHX_ char *s)
1805
acfe0abc
GS
1806STATIC becomes "static" in C, and may be #define'd to nothing in some
1807configurations in future.
ee072b34 1808
651a3225
GS
1809A public function (i.e. part of the internal API, but not necessarily
1810sanctioned for use in extensions) begins like this:
ee072b34
GS
1811
1812 void
2307c6d0 1813 Perl_sv_setiv(pTHX_ SV* dsv, IV num)
ee072b34
GS
1814
1815C<pTHX_> is one of a number of macros (in perl.h) that hide the
1816details of the interpreter's context. THX stands for "thread", "this",
1817or "thingy", as the case may be. (And no, George Lucas is not involved. :-)
1818The first character could be 'p' for a B<p>rototype, 'a' for B<a>rgument,
a7486cbb
JH
1819or 'd' for B<d>eclaration, so we have C<pTHX>, C<aTHX> and C<dTHX>, and
1820their variants.
ee072b34 1821
a7486cbb
JH
1822When Perl is built without options that set PERL_IMPLICIT_CONTEXT, there is no
1823first argument containing the interpreter's context. The trailing underscore
ee072b34
GS
1824in the pTHX_ macro indicates that the macro expansion needs a comma
1825after the context argument because other arguments follow it. If
1826PERL_IMPLICIT_CONTEXT is not defined, pTHX_ will be ignored, and the
54aff467
GS
1827subroutine is not prototyped to take the extra argument. The form of the
1828macro without the trailing underscore is used when there are no additional
ee072b34
GS
1829explicit arguments.
1830
54aff467 1831When a core function calls another, it must pass the context. This
2307c6d0 1832is normally hidden via macros. Consider C<sv_setiv>. It expands into
ee072b34
GS
1833something like this:
1834
2307c6d0
SB
1835 #ifdef PERL_IMPLICIT_CONTEXT
1836 #define sv_setiv(a,b) Perl_sv_setiv(aTHX_ a, b)
ee072b34 1837 /* can't do this for vararg functions, see below */
2307c6d0
SB
1838 #else
1839 #define sv_setiv Perl_sv_setiv
1840 #endif
ee072b34
GS
1841
1842This works well, and means that XS authors can gleefully write:
1843
2307c6d0 1844 sv_setiv(foo, bar);
ee072b34
GS
1845
1846and still have it work under all the modes Perl could have been
1847compiled with.
1848
ee072b34
GS
1849This doesn't work so cleanly for varargs functions, though, as macros
1850imply that the number of arguments is known in advance. Instead we
1851either need to spell them out fully, passing C<aTHX_> as the first
1852argument (the Perl core tends to do this with functions like
1853Perl_warner), or use a context-free version.
1854
1855The context-free version of Perl_warner is called
1856Perl_warner_nocontext, and does not take the extra argument. Instead
1857it does dTHX; to get the context from thread-local storage. We
1858C<#define warner Perl_warner_nocontext> so that extensions get source
1859compatibility at the expense of performance. (Passing an arg is
1860cheaper than grabbing it from thread-local storage.)
1861
acfe0abc 1862You can ignore [pad]THXx when browsing the Perl headers/sources.
ee072b34
GS
1863Those are strictly for use within the core. Extensions and embedders
1864need only be aware of [pad]THX.
1865
a7486cbb
JH
1866=head2 So what happened to dTHR?
1867
1868C<dTHR> was introduced in perl 5.005 to support the older thread model.
1869The older thread model now uses the C<THX> mechanism to pass context
1870pointers around, so C<dTHR> is not useful any more. Perl 5.6.0 and
1871later still have it for backward source compatibility, but it is defined
1872to be a no-op.
1873
ee072b34
GS
1874=head2 How do I use all this in extensions?
1875
1876When Perl is built with PERL_IMPLICIT_CONTEXT, extensions that call
1877any functions in the Perl API will need to pass the initial context
1878argument somehow. The kicker is that you will need to write it in
1879such a way that the extension still compiles when Perl hasn't been
1880built with PERL_IMPLICIT_CONTEXT enabled.
1881
1882There are three ways to do this. First, the easy but inefficient way,
1883which is also the default, in order to maintain source compatibility
1884with extensions: whenever XSUB.h is #included, it redefines the aTHX
1885and aTHX_ macros to call a function that will return the context.
1886Thus, something like:
1887
2307c6d0 1888 sv_setiv(sv, num);
ee072b34 1889
4375e838 1890in your extension will translate to this when PERL_IMPLICIT_CONTEXT is
54aff467 1891in effect:
ee072b34 1892
2307c6d0 1893 Perl_sv_setiv(Perl_get_context(), sv, num);
ee072b34 1894
54aff467 1895or to this otherwise:
ee072b34 1896
2307c6d0 1897 Perl_sv_setiv(sv, num);
ee072b34
GS
1898
1899You have to do nothing new in your extension to get this; since
2fa86c13 1900the Perl library provides Perl_get_context(), it will all just
ee072b34
GS
1901work.
1902
1903The second, more efficient way is to use the following template for
1904your Foo.xs:
1905
c52f9dcd
JH
1906 #define PERL_NO_GET_CONTEXT /* we want efficiency */
1907 #include "EXTERN.h"
1908 #include "perl.h"
1909 #include "XSUB.h"
ee072b34
GS
1910
1911 static my_private_function(int arg1, int arg2);
1912
c52f9dcd
JH
1913 static SV *
1914 my_private_function(int arg1, int arg2)
1915 {
1916 dTHX; /* fetch context */
1917 ... call many Perl API functions ...
1918 }
ee072b34
GS
1919
1920 [... etc ...]
1921
c52f9dcd 1922 MODULE = Foo PACKAGE = Foo
ee072b34 1923
c52f9dcd 1924 /* typical XSUB */
ee072b34 1925
c52f9dcd
JH
1926 void
1927 my_xsub(arg)
1928 int arg
1929 CODE:
1930 my_private_function(arg, 10);
ee072b34
GS
1931
1932Note that the only two changes from the normal way of writing an
1933extension is the addition of a C<#define PERL_NO_GET_CONTEXT> before
1934including the Perl headers, followed by a C<dTHX;> declaration at
1935the start of every function that will call the Perl API. (You'll
1936know which functions need this, because the C compiler will complain
1937that there's an undeclared identifier in those functions.) No changes
1938are needed for the XSUBs themselves, because the XS() macro is
1939correctly defined to pass in the implicit context if needed.
1940
1941The third, even more efficient way is to ape how it is done within
1942the Perl guts:
1943
1944
c52f9dcd
JH
1945 #define PERL_NO_GET_CONTEXT /* we want efficiency */
1946 #include "EXTERN.h"
1947 #include "perl.h"
1948 #include "XSUB.h"
ee072b34
GS
1949
1950 /* pTHX_ only needed for functions that call Perl API */
1951 static my_private_function(pTHX_ int arg1, int arg2);
1952
c52f9dcd
JH
1953 static SV *
1954 my_private_function(pTHX_ int arg1, int arg2)
1955 {
1956 /* dTHX; not needed here, because THX is an argument */
1957 ... call Perl API functions ...
1958 }
ee072b34
GS
1959
1960 [... etc ...]
1961
c52f9dcd 1962 MODULE = Foo PACKAGE = Foo
ee072b34 1963
c52f9dcd 1964 /* typical XSUB */
ee072b34 1965
c52f9dcd
JH
1966 void
1967 my_xsub(arg)
1968 int arg
1969 CODE:
1970 my_private_function(aTHX_ arg, 10);
ee072b34
GS
1971
1972This implementation never has to fetch the context using a function
1973call, since it is always passed as an extra argument. Depending on
1974your needs for simplicity or efficiency, you may mix the previous
1975two approaches freely.
1976
651a3225
GS
1977Never add a comma after C<pTHX> yourself--always use the form of the
1978macro with the underscore for functions that take explicit arguments,
1979or the form without the argument for functions with no explicit arguments.
ee072b34 1980
a7486cbb
JH
1981=head2 Should I do anything special if I call perl from multiple threads?
1982
1983If you create interpreters in one thread and then proceed to call them in
1984another, you need to make sure perl's own Thread Local Storage (TLS) slot is
1985initialized correctly in each of those threads.
1986
1987The C<perl_alloc> and C<perl_clone> API functions will automatically set
1988the TLS slot to the interpreter they created, so that there is no need to do
1989anything special if the interpreter is always accessed in the same thread that
1990created it, and that thread did not create or call any other interpreters
1991afterwards. If that is not the case, you have to set the TLS slot of the
1992thread before calling any functions in the Perl API on that particular
1993interpreter. This is done by calling the C<PERL_SET_CONTEXT> macro in that
1994thread as the first thing you do:
1995
1996 /* do this before doing anything else with some_perl */
1997 PERL_SET_CONTEXT(some_perl);
1998
1999 ... other Perl API calls on some_perl go here ...
2000
ee072b34
GS
2001=head2 Future Plans and PERL_IMPLICIT_SYS
2002
2003Just as PERL_IMPLICIT_CONTEXT provides a way to bundle up everything
2004that the interpreter knows about itself and pass it around, so too are
2005there plans to allow the interpreter to bundle up everything it knows
2006about the environment it's running on. This is enabled with the
acfe0abc 2007PERL_IMPLICIT_SYS macro. Currently it only works with USE_ITHREADS
4d1ff10f 2008and USE_5005THREADS on Windows (see inside iperlsys.h).
ee072b34
GS
2009
2010This allows the ability to provide an extra pointer (called the "host"
2011environment) for all the system calls. This makes it possible for
2012all the system stuff to maintain their own state, broken down into
2013seven C structures. These are thin wrappers around the usual system
2014calls (see win32/perllib.c) for the default perl executable, but for a
2015more ambitious host (like the one that would do fork() emulation) all
2016the extra work needed to pretend that different interpreters are
2017actually different "processes", would be done here.
2018
2019The Perl engine/interpreter and the host are orthogonal entities.
2020There could be one or more interpreters in a process, and one or
2021more "hosts", with free association between them.
2022
a422fd2d
SC
2023=head1 Internal Functions
2024
2025All of Perl's internal functions which will be exposed to the outside
06f6df17 2026world are prefixed by C<Perl_> so that they will not conflict with XS
a422fd2d
SC
2027functions or functions used in a program in which Perl is embedded.
2028Similarly, all global variables begin with C<PL_>. (By convention,
06f6df17 2029static functions start with C<S_>.)
a422fd2d
SC
2030
2031Inside the Perl core, you can get at the functions either with or
2032without the C<Perl_> prefix, thanks to a bunch of defines that live in
2033F<embed.h>. This header file is generated automatically from
2034F<embed.pl>. F<embed.pl> also creates the prototyping header files for
2035the internal functions, generates the documentation and a lot of other
2036bits and pieces. It's important that when you add a new function to the
2037core or change an existing one, you change the data in the table at the
2038end of F<embed.pl> as well. Here's a sample entry from that table:
2039
2040 Apd |SV** |av_fetch |AV* ar|I32 key|I32 lval
2041
2042The second column is the return type, the third column the name. Columns
2043after that are the arguments. The first column is a set of flags:
2044
2045=over 3
2046
2047=item A
2048
2049This function is a part of the public API.
2050
2051=item p
2052
2053This function has a C<Perl_> prefix; ie, it is defined as C<Perl_av_fetch>
2054
2055=item d
2056
2057This function has documentation using the C<apidoc> feature which we'll
2058look at in a second.
2059
2060=back
2061
2062Other available flags are:
2063
2064=over 3
2065
2066=item s
2067
a7486cbb
JH
2068This is a static function and is defined as C<S_whatever>, and usually
2069called within the sources as C<whatever(...)>.
a422fd2d
SC
2070
2071=item n
2072
2073This does not use C<aTHX_> and C<pTHX> to pass interpreter context. (See
2074L<perlguts/Background and PERL_IMPLICIT_CONTEXT>.)
2075
2076=item r
2077
2078This function never returns; C<croak>, C<exit> and friends.
2079
2080=item f
2081
2082This function takes a variable number of arguments, C<printf> style.
2083The argument list should end with C<...>, like this:
2084
2085 Afprd |void |croak |const char* pat|...
2086
a7486cbb 2087=item M
a422fd2d 2088
00aadd71 2089This function is part of the experimental development API, and may change
a422fd2d
SC
2090or disappear without notice.
2091
2092=item o
2093
2094This function should not have a compatibility macro to define, say,
2095C<Perl_parse> to C<parse>. It must be called as C<Perl_parse>.
2096
2097=item j
2098
2099This function is not a member of C<CPerlObj>. If you don't know
2100what this means, don't use it.
2101
2102=item x
2103
2104This function isn't exported out of the Perl core.
2105
2106=back
2107
2108If you edit F<embed.pl>, you will need to run C<make regen_headers> to
2109force a rebuild of F<embed.h> and other auto-generated files.
2110
6b4667fc 2111=head2 Formatted Printing of IVs, UVs, and NVs
9dd9db0b 2112
6b4667fc
A
2113If you are printing IVs, UVs, or NVS instead of the stdio(3) style
2114formatting codes like C<%d>, C<%ld>, C<%f>, you should use the
2115following macros for portability
9dd9db0b 2116
c52f9dcd
JH
2117 IVdf IV in decimal
2118 UVuf UV in decimal
2119 UVof UV in octal
2120 UVxf UV in hexadecimal
2121 NVef NV %e-like
2122 NVff NV %f-like
2123 NVgf NV %g-like
9dd9db0b 2124
6b4667fc
A
2125These will take care of 64-bit integers and long doubles.
2126For example:
2127
c52f9dcd 2128 printf("IV is %"IVdf"\n", iv);
6b4667fc
A
2129
2130The IVdf will expand to whatever is the correct format for the IVs.
9dd9db0b 2131
8908e76d
JH
2132If you are printing addresses of pointers, use UVxf combined
2133with PTR2UV(), do not use %lx or %p.
2134
2135=head2 Pointer-To-Integer and Integer-To-Pointer
2136
2137Because pointer size does not necessarily equal integer size,
2138use the follow macros to do it right.
2139
c52f9dcd
JH
2140 PTR2UV(pointer)
2141 PTR2IV(pointer)
2142 PTR2NV(pointer)
2143 INT2PTR(pointertotype, integer)
8908e76d
JH
2144
2145For example:
2146
c52f9dcd
JH
2147 IV iv = ...;
2148 SV *sv = INT2PTR(SV*, iv);
8908e76d
JH
2149
2150and
2151
c52f9dcd
JH
2152 AV *av = ...;
2153 UV uv = PTR2UV(av);
8908e76d 2154
a422fd2d
SC
2155=head2 Source Documentation
2156
2157There's an effort going on to document the internal functions and
2158automatically produce reference manuals from them - L<perlapi> is one
2159such manual which details all the functions which are available to XS
2160writers. L<perlintern> is the autogenerated manual for the functions
2161which are not part of the API and are supposedly for internal use only.
2162
2163Source documentation is created by putting POD comments into the C
2164source, like this:
2165
2166 /*
2167 =for apidoc sv_setiv
2168
2169 Copies an integer into the given SV. Does not handle 'set' magic. See
2170 C<sv_setiv_mg>.
2171
2172 =cut
2173 */
2174
2175Please try and supply some documentation if you add functions to the
2176Perl core.
2177
2178=head1 Unicode Support
2179
2180Perl 5.6.0 introduced Unicode support. It's important for porters and XS
2181writers to understand this support and make sure that the code they
2182write does not corrupt Unicode data.
2183
2184=head2 What B<is> Unicode, anyway?
2185
2186In the olden, less enlightened times, we all used to use ASCII. Most of
2187us did, anyway. The big problem with ASCII is that it's American. Well,
2188no, that's not actually the problem; the problem is that it's not
2189particularly useful for people who don't use the Roman alphabet. What
2190used to happen was that particular languages would stick their own
2191alphabet in the upper range of the sequence, between 128 and 255. Of
2192course, we then ended up with plenty of variants that weren't quite
2193ASCII, and the whole point of it being a standard was lost.
2194
2195Worse still, if you've got a language like Chinese or
2196Japanese that has hundreds or thousands of characters, then you really
2197can't fit them into a mere 256, so they had to forget about ASCII
2198altogether, and build their own systems using pairs of numbers to refer
2199to one character.
2200
2201To fix this, some people formed Unicode, Inc. and
2202produced a new character set containing all the characters you can
2203possibly think of and more. There are several ways of representing these
2204characters, and the one Perl uses is called UTF8. UTF8 uses
2205a variable number of bytes to represent a character, instead of just
b3b6085d 2206one. You can learn more about Unicode at http://www.unicode.org/
a422fd2d
SC
2207
2208=head2 How can I recognise a UTF8 string?
2209
2210You can't. This is because UTF8 data is stored in bytes just like
2211non-UTF8 data. The Unicode character 200, (C<0xC8> for you hex types)
2212capital E with a grave accent, is represented by the two bytes
2213C<v196.172>. Unfortunately, the non-Unicode string C<chr(196).chr(172)>
2214has that byte sequence as well. So you can't tell just by looking - this
2215is what makes Unicode input an interesting problem.
2216
2217The API function C<is_utf8_string> can help; it'll tell you if a string
2218contains only valid UTF8 characters. However, it can't do the work for
2219you. On a character-by-character basis, C<is_utf8_char> will tell you
2220whether the current character in a string is valid UTF8.
2221
2222=head2 How does UTF8 represent Unicode characters?
2223
2224As mentioned above, UTF8 uses a variable number of bytes to store a
2225character. Characters with values 1...128 are stored in one byte, just
2226like good ol' ASCII. Character 129 is stored as C<v194.129>; this
a31a806a 2227continues up to character 191, which is C<v194.191>. Now we've run out of
a422fd2d
SC
2228bits (191 is binary C<10111111>) so we move on; 192 is C<v195.128>. And
2229so it goes on, moving to three bytes at character 2048.
2230
2231Assuming you know you're dealing with a UTF8 string, you can find out
2232how long the first character in it is with the C<UTF8SKIP> macro:
2233
2234 char *utf = "\305\233\340\240\201";
2235 I32 len;
2236
2237 len = UTF8SKIP(utf); /* len is 2 here */
2238 utf += len;
2239 len = UTF8SKIP(utf); /* len is 3 here */
2240
2241Another way to skip over characters in a UTF8 string is to use
2242C<utf8_hop>, which takes a string and a number of characters to skip
2243over. You're on your own about bounds checking, though, so don't use it
2244lightly.
2245
3a2263fe
RGS
2246All bytes in a multi-byte UTF8 character will have the high bit set,
2247so you can test if you need to do something special with this
2248character like this (the UTF8_IS_INVARIANT() is a macro that tests
2249whether the byte can be encoded as a single byte even in UTF-8):
a422fd2d 2250
3a2263fe
RGS
2251 U8 *utf;
2252 UV uv; /* Note: a UV, not a U8, not a char */
a422fd2d 2253
3a2263fe 2254 if (!UTF8_IS_INVARIANT(*utf))
a422fd2d
SC
2255 /* Must treat this as UTF8 */
2256 uv = utf8_to_uv(utf);
2257 else
2258 /* OK to treat this character as a byte */
2259 uv = *utf;
2260
2261You can also see in that example that we use C<utf8_to_uv> to get the
2262value of the character; the inverse function C<uv_to_utf8> is available
2263for putting a UV into UTF8:
2264
3a2263fe 2265 if (!UTF8_IS_INVARIANT(uv))
a422fd2d
SC
2266 /* Must treat this as UTF8 */
2267 utf8 = uv_to_utf8(utf8, uv);
2268 else
2269 /* OK to treat this character as a byte */
2270 *utf8++ = uv;
2271
2272You B<must> convert characters to UVs using the above functions if
2273you're ever in a situation where you have to match UTF8 and non-UTF8
2274characters. You may not skip over UTF8 characters in this case. If you
2275do this, you'll lose the ability to match hi-bit non-UTF8 characters;
2276for instance, if your UTF8 string contains C<v196.172>, and you skip
2277that character, you can never match a C<chr(200)> in a non-UTF8 string.
2278So don't do that!
2279
2280=head2 How does Perl store UTF8 strings?
2281
2282Currently, Perl deals with Unicode strings and non-Unicode strings
2283slightly differently. If a string has been identified as being UTF-8
2284encoded, Perl will set a flag in the SV, C<SVf_UTF8>. You can check and
2285manipulate this flag with the following macros:
2286
2287 SvUTF8(sv)
2288 SvUTF8_on(sv)
2289 SvUTF8_off(sv)
2290
2291This flag has an important effect on Perl's treatment of the string: if
2292Unicode data is not properly distinguished, regular expressions,
2293C<length>, C<substr> and other string handling operations will have
2294undesirable results.
2295
2296The problem comes when you have, for instance, a string that isn't
2297flagged is UTF8, and contains a byte sequence that could be UTF8 -
2298especially when combining non-UTF8 and UTF8 strings.
2299
2300Never forget that the C<SVf_UTF8> flag is separate to the PV value; you
2301need be sure you don't accidentally knock it off while you're
2302manipulating SVs. More specifically, you cannot expect to do this:
2303
2304 SV *sv;
2305 SV *nsv;
2306 STRLEN len;
2307 char *p;
2308
2309 p = SvPV(sv, len);
2310 frobnicate(p);
2311 nsv = newSVpvn(p, len);
2312
2313The C<char*> string does not tell you the whole story, and you can't
2314copy or reconstruct an SV just by copying the string value. Check if the
2315old SV has the UTF8 flag set, and act accordingly:
2316
2317 p = SvPV(sv, len);
2318 frobnicate(p);
2319 nsv = newSVpvn(p, len);
2320 if (SvUTF8(sv))
2321 SvUTF8_on(nsv);
2322
2323In fact, your C<frobnicate> function should be made aware of whether or
2324not it's dealing with UTF8 data, so that it can handle the string
2325appropriately.
2326
3a2263fe
RGS
2327Since just passing an SV to an XS function and copying the data of
2328the SV is not enough to copy the UTF8 flags, even less right is just
2329passing a C<char *> to an XS function.
2330
a422fd2d
SC
2331=head2 How do I convert a string to UTF8?
2332
2333If you're mixing UTF8 and non-UTF8 strings, you might find it necessary
2334to upgrade one of the strings to UTF8. If you've got an SV, the easiest
2335way to do this is:
2336
2337 sv_utf8_upgrade(sv);
2338
2339However, you must not do this, for example:
2340
2341 if (!SvUTF8(left))
2342 sv_utf8_upgrade(left);
2343
2344If you do this in a binary operator, you will actually change one of the
b1866b2d 2345strings that came into the operator, and, while it shouldn't be noticeable
a422fd2d
SC
2346by the end user, it can cause problems.
2347
2348Instead, C<bytes_to_utf8> will give you a UTF8-encoded B<copy> of its
2349string argument. This is useful for having the data available for
b1866b2d 2350comparisons and so on, without harming the original SV. There's also
a422fd2d
SC
2351C<utf8_to_bytes> to go the other way, but naturally, this will fail if
2352the string contains any characters above 255 that can't be represented
2353in a single byte.
2354
2355=head2 Is there anything else I need to know?
2356
2357Not really. Just remember these things:
2358
2359=over 3
2360
2361=item *
2362
2363There's no way to tell if a string is UTF8 or not. You can tell if an SV
2364is UTF8 by looking at is C<SvUTF8> flag. Don't forget to set the flag if
2365something should be UTF8. Treat the flag as part of the PV, even though
2366it's not - if you pass on the PV to somewhere, pass on the flag too.
2367
2368=item *
2369
2370If a string is UTF8, B<always> use C<utf8_to_uv> to get at the value,
3a2263fe 2371unless C<UTF8_IS_INVARIANT(*s)> in which case you can use C<*s>.
a422fd2d
SC
2372
2373=item *
2374
3a2263fe
RGS
2375When writing a character C<uv> to a UTF8 string, B<always> use
2376C<uv_to_utf8>, unless C<UTF8_IS_INVARIANT(uv))> in which case
2377you can use C<*s = uv>.
a422fd2d
SC
2378
2379=item *
2380
2381Mixing UTF8 and non-UTF8 strings is tricky. Use C<bytes_to_utf8> to get
2382a new string which is UTF8 encoded. There are tricks you can use to
2383delay deciding whether you need to use a UTF8 string until you get to a
2384high character - C<HALF_UPGRADE> is one of those.
2385
2386=back
2387
53e06cf0
SC
2388=head1 Custom Operators
2389
9a68f1db 2390Custom operator support is a new experimental feature that allows you to
53e06cf0
SC
2391define your own ops. This is primarily to allow the building of
2392interpreters for other languages in the Perl core, but it also allows
2393optimizations through the creation of "macro-ops" (ops which perform the
2394functions of multiple ops which are usually executed together, such as
b7cb320d 2395C<gvsv, gvsv, add>.)
53e06cf0 2396
b455bf3f 2397This feature is implemented as a new op type, C<OP_CUSTOM>. The Perl
53e06cf0
SC
2398core does not "know" anything special about this op type, and so it will
2399not be involved in any optimizations. This also means that you can
2400define your custom ops to be any op structure - unary, binary, list and
2401so on - you like.
2402
2403It's important to know what custom operators won't do for you. They
2404won't let you add new syntax to Perl, directly. They won't even let you
2405add new keywords, directly. In fact, they won't change the way Perl
2406compiles a program at all. You have to do those changes yourself, after
2407Perl has compiled the program. You do this either by manipulating the op
2408tree using a C<CHECK> block and the C<B::Generate> module, or by adding
2409a custom peephole optimizer with the C<optimize> module.
2410
2411When you do this, you replace ordinary Perl ops with custom ops by
2412creating ops with the type C<OP_CUSTOM> and the C<pp_addr> of your own
2413PP function. This should be defined in XS code, and should look like
2414the PP ops in C<pp_*.c>. You are responsible for ensuring that your op
2415takes the appropriate number of values from the stack, and you are
2416responsible for adding stack marks if necessary.
2417
2418You should also "register" your op with the Perl interpreter so that it
2419can produce sensible error and warning messages. Since it is possible to
2420have multiple custom ops within the one "logical" op type C<OP_CUSTOM>,
2421Perl uses the value of C<< o->op_ppaddr >> as a key into the
2422C<PL_custom_op_descs> and C<PL_custom_op_names> hashes. This means you
2423need to enter a name and description for your op at the appropriate
2424place in the C<PL_custom_op_names> and C<PL_custom_op_descs> hashes.
2425
2426Forthcoming versions of C<B::Generate> (version 1.0 and above) should
2427directly support the creation of custom ops by name; C<Opcodes::Custom>
2428will provide functions which make it trivial to "register" custom ops to
2429the Perl interpreter.
2430
954c1994 2431=head1 AUTHORS
e89caa19 2432
954c1994 2433Until May 1997, this document was maintained by Jeff Okamoto
9b5bb84f
SB
2434E<lt>okamoto@corp.hp.comE<gt>. It is now maintained as part of Perl
2435itself by the Perl 5 Porters E<lt>perl5-porters@perl.orgE<gt>.
cb1a09d0 2436
954c1994
GS
2437With lots of help and suggestions from Dean Roehrich, Malcolm Beattie,
2438Andreas Koenig, Paul Hudson, Ilya Zakharevich, Paul Marquess, Neil
2439Bowers, Matthew Green, Tim Bunce, Spider Boardman, Ulrich Pfeifer,
2440Stephen McCamant, and Gurusamy Sarathy.
cb1a09d0 2441
954c1994 2442=head1 SEE ALSO
cb1a09d0 2443
954c1994 2444perlapi(1), perlintern(1), perlxs(1), perlembed(1)