This is a live mirror of the Perl 5 development currently hosted at https://github.com/perl/perl5
Move Text::Balanced from ext/ to cpan/
[perl5.git] / cpan / Text-Balanced / README
CommitLineData
85831461 1NAME
55a1c97c
JH
2 Text::Balanced - Extract delimited text sequences from strings.
3
85831461
SM
4SYNOPSIS
5 use Text::Balanced qw (
6 extract_delimited
7 extract_bracketed
8 extract_quotelike
9 extract_codeblock
10 extract_variable
11 extract_tagged
12 extract_multiple
13 gen_delimited_pat
14 gen_extract_tagged
15 );
16
17 # Extract the initial substring of $text that is delimited by
18 # two (unescaped) instances of the first character in $delim.
19
20 ($extracted, $remainder) = extract_delimited($text,$delim);
21
22
23 # Extract the initial substring of $text that is bracketed
24 # with a delimiter(s) specified by $delim (where the string
25 # in $delim contains one or more of '(){}[]<>').
26
27 ($extracted, $remainder) = extract_bracketed($text,$delim);
28
29
30 # Extract the initial substring of $text that is bounded by
31 # an XML tag.
32
33 ($extracted, $remainder) = extract_tagged($text);
34
35
36 # Extract the initial substring of $text that is bounded by
37 # a C<BEGIN>...C<END> pair. Don't allow nested C<BEGIN> tags
38
39 ($extracted, $remainder) =
40 extract_tagged($text,"BEGIN","END",undef,{bad=>["BEGIN"]});
41
42
43 # Extract the initial substring of $text that represents a
44 # Perl "quote or quote-like operation"
45
46 ($extracted, $remainder) = extract_quotelike($text);
47
48
49 # Extract the initial substring of $text that represents a block
50 # of Perl code, bracketed by any of character(s) specified by $delim
51 # (where the string $delim contains one or more of '(){}[]<>').
52
53 ($extracted, $remainder) = extract_codeblock($text,$delim);
54
55
56 # Extract the initial substrings of $text that would be extracted by
57 # one or more sequential applications of the specified functions
58 # or regular expressions
59
60 @extracted = extract_multiple($text,
61 [ \&extract_bracketed,
62 \&extract_quotelike,
63 \&some_other_extractor_sub,
64 qr/[xyz]*/,
65 'literal',
66 ]);
67
68 # Create a string representing an optimized pattern (a la Friedl) # that
69 matches a substring delimited by any of the specified characters # (in
70 this case: any type of quote or a slash)
71
72 $patstring = gen_delimited_pat(q{'"`/});
73
74 # Generate a reference to an anonymous sub that is just like
75 extract_tagged # but pre-compiled and optimized for a specific pair of
76 tags, and consequently # much faster (i.e. 3 times faster). It uses qr//
77 for better performance on # repeated calls, so it only works under Perl
78 5.005 or later.
79
80 $extract_head = gen_extract_tagged('<HEAD>','</HEAD>');
81
82 ($extracted, $remainder) = $extract_head->($text);
83
84DESCRIPTION
85 The various "extract_..." subroutines may be used to extract a delimited
86 substring, possibly after skipping a specified prefix string. By
87 default, that prefix is optional whitespace ("/\s*/"), but you can
88 change it to whatever you wish (see below).
89
90 The substring to be extracted must appear at the current "pos" location
91 of the string's variable (or at index zero, if no "pos" position is
92 defined). In other words, the "extract_..." subroutines *don't* extract
93 the first occurrence of a substring anywhere in a string (like an
94 unanchored regex would). Rather, they extract an occurrence of the
95 substring appearing immediately at the current matching position in the
96 string (like a "\G"-anchored regex would).
97
98 General behaviour in list contexts
99 In a list context, all the subroutines return a list, the first three
100 elements of which are always:
101
102 [0] The extracted string, including the specified delimiters. If the
103 extraction fails "undef" is returned.
104
105 [1] The remainder of the input string (i.e. the characters after the
106 extracted string). On failure, the entire string is returned.
107
108 [2] The skipped prefix (i.e. the characters before the extracted
109 string). On failure, "undef" is returned.
110
111 Note that in a list context, the contents of the original input text
112 (the first argument) are not modified in any way.
113
114 However, if the input text was passed in a variable, that variable's
115 "pos" value is updated to point at the first character after the
116 extracted text. That means that in a list context the various
117 subroutines can be used much like regular expressions. For example:
118
119 while ( $next = (extract_quotelike($text))[0] )
120 {
121 # process next quote-like (in $next)
122 }
123
124 General behaviour in scalar and void contexts
125 In a scalar context, the extracted string is returned, having first been
126 removed from the input text. Thus, the following code also processes
127 each quote-like operation, but actually removes them from $text:
128
129 while ( $next = extract_quotelike($text) )
130 {
131 # process next quote-like (in $next)
132 }
133
134 Note that if the input text is a read-only string (i.e. a literal), no
135 attempt is made to remove the extracted text.
136
137 In a void context the behaviour of the extraction subroutines is exactly
138 the same as in a scalar context, except (of course) that the extracted
139 substring is not returned.
140
141 A note about prefixes
142 Prefix patterns are matched without any trailing modifiers ("/gimsox"
143 etc.) This can bite you if you're expecting a prefix specification like
144 '.*?(?=<H1>)' to skip everything up to the first <H1> tag. Such a prefix
145 pattern will only succeed if the <H1> tag is on the current line, since
146 . normally doesn't match newlines.
147
148 To overcome this limitation, you need to turn on /s matching within the
149 prefix pattern, using the "(?s)" directive: '(?s).*?(?=<H1>)'
150
151 "extract_delimited"
152 The "extract_delimited" function formalizes the common idiom of
153 extracting a single-character-delimited substring from the start of a
154 string. For example, to extract a single-quote delimited string, the
155 following code is typically used:
156
157 ($remainder = $text) =~ s/\A('(\\.|[^'])*')//s;
158 $extracted = $1;
159
160 but with "extract_delimited" it can be simplified to:
161
162 ($extracted,$remainder) = extract_delimited($text, "'");
163
164 "extract_delimited" takes up to four scalars (the input text, the
165 delimiters, a prefix pattern to be skipped, and any escape characters)
166 and extracts the initial substring of the text that is appropriately
167 delimited. If the delimiter string has multiple characters, the first
168 one encountered in the text is taken to delimit the substring. The third
169 argument specifies a prefix pattern that is to be skipped (but must be
170 present!) before the substring is extracted. The final argument
171 specifies the escape character to be used for each delimiter.
172
173 All arguments are optional. If the escape characters are not specified,
174 every delimiter is escaped with a backslash ("\"). If the prefix is not
175 specified, the pattern '\s*' - optional whitespace - is used. If the
176 delimiter set is also not specified, the set "/["'`]/" is used. If the
177 text to be processed is not specified either, $_ is used.
178
179 In list context, "extract_delimited" returns a array of three elements,
180 the extracted substring (*including the surrounding delimiters*), the
181 remainder of the text, and the skipped prefix (if any). If a suitable
182 delimited substring is not found, the first element of the array is the
183 empty string, the second is the complete original text, and the prefix
184 returned in the third element is an empty string.
185
186 In a scalar context, just the extracted substring is returned. In a void
187 context, the extracted substring (and any prefix) are simply removed
188 from the beginning of the first argument.
189
190 Examples:
191
192 # Remove a single-quoted substring from the very beginning of $text:
193
194 $substring = extract_delimited($text, "'", '');
195
196 # Remove a single-quoted Pascalish substring (i.e. one in which
197 # doubling the quote character escapes it) from the very
198 # beginning of $text:
199
200 $substring = extract_delimited($text, "'", '', "'");
201
202 # Extract a single- or double- quoted substring from the
203 # beginning of $text, optionally after some whitespace
204 # (note the list context to protect $text from modification):
205
206 ($substring) = extract_delimited $text, q{"'};
207
208 # Delete the substring delimited by the first '/' in $text:
209
210 $text = join '', (extract_delimited($text,'/','[^/]*')[2,1];
211
212 Note that this last example is *not* the same as deleting the first
213 quote-like pattern. For instance, if $text contained the string:
214
215 "if ('./cmd' =~ m/$UNIXCMD/s) { $cmd = $1; }"
216
217 then after the deletion it would contain:
218
219 "if ('.$UNIXCMD/s) { $cmd = $1; }"
220
221 not:
222
223 "if ('./cmd' =~ ms) { $cmd = $1; }"
224
225 See "extract_quotelike" for a (partial) solution to this problem.
226
227 "extract_bracketed"
228 Like "extract_delimited", the "extract_bracketed" function takes up to
229 three optional scalar arguments: a string to extract from, a delimiter
230 specifier, and a prefix pattern. As before, a missing prefix defaults to
231 optional whitespace and a missing text defaults to $_. However, a
232 missing delimiter specifier defaults to '{}()[]<>' (see below).
233
234 "extract_bracketed" extracts a balanced-bracket-delimited substring
235 (using any one (or more) of the user-specified delimiter brackets:
236 '(..)', '{..}', '[..]', or '<..>'). Optionally it will also respect
237 quoted unbalanced brackets (see below).
238
239 A "delimiter bracket" is a bracket in list of delimiters passed as
240 "extract_bracketed"'s second argument. Delimiter brackets are specified
241 by giving either the left or right (or both!) versions of the required
242 bracket(s). Note that the order in which two or more delimiter brackets
243 are specified is not significant.
244
245 A "balanced-bracket-delimited substring" is a substring bounded by
246 matched brackets, such that any other (left or right) delimiter bracket
247 *within* the substring is also matched by an opposite (right or left)
248 delimiter bracket *at the same level of nesting*. Any type of bracket
249 not in the delimiter list is treated as an ordinary character.
250
251 In other words, each type of bracket specified as a delimiter must be
252 balanced and correctly nested within the substring, and any other kind
253 of ("non-delimiter") bracket in the substring is ignored.
254
255 For example, given the string:
256
257 $text = "{ an '[irregularly :-(] {} parenthesized >:-)' string }";
258
259 then a call to "extract_bracketed" in a list context:
260
261 @result = extract_bracketed( $text, '{}' );
262
263 would return:
264
265 ( "{ an '[irregularly :-(] {} parenthesized >:-)' string }" , "" , "" )
266
267 since both sets of '{..}' brackets are properly nested and evenly
268 balanced. (In a scalar context just the first element of the array would
269 be returned. In a void context, $text would be replaced by an empty
270 string.)
271
272 Likewise the call in:
273
274 @result = extract_bracketed( $text, '{[' );
275
276 would return the same result, since all sets of both types of specified
277 delimiter brackets are correctly nested and balanced.
278
279 However, the call in:
280
281 @result = extract_bracketed( $text, '{([<' );
282
283 would fail, returning:
284
285 ( undef , "{ an '[irregularly :-(] {} parenthesized >:-)' string }" );
286
287 because the embedded pairs of '(..)'s and '[..]'s are "cross-nested" and
288 the embedded '>' is unbalanced. (In a scalar context, this call would
289 return an empty string. In a void context, $text would be unchanged.)
290
291 Note that the embedded single-quotes in the string don't help in this
292 case, since they have not been specified as acceptable delimiters and
293 are therefore treated as non-delimiter characters (and ignored).
294
295 However, if a particular species of quote character is included in the
296 delimiter specification, then that type of quote will be correctly
297 handled. for example, if $text is:
298
299 $text = '<A HREF=">>>>">link</A>';
300
301 then
302
303 @result = extract_bracketed( $text, '<">' );
304
305 returns:
306
307 ( '<A HREF=">>>>">', 'link</A>', "" )
308
309 as expected. Without the specification of """ as an embedded quoter:
310
311 @result = extract_bracketed( $text, '<>' );
312
313 the result would be:
314
315 ( '<A HREF=">', '>>>">link</A>', "" )
316
317 In addition to the quote delimiters "'", """, and "`", full Perl
318 quote-like quoting (i.e. q{string}, qq{string}, etc) can be specified by
319 including the letter 'q' as a delimiter. Hence:
320
321 @result = extract_bracketed( $text, '<q>' );
322
323 would correctly match something like this:
324
325 $text = '<leftop: conj /and/ conj>';
326
327 See also: "extract_quotelike" and "extract_codeblock".
328
329 "extract_variable"
330 "extract_variable" extracts any valid Perl variable or variable-involved
331 expression, including scalars, arrays, hashes, array accesses, hash
332 look-ups, method calls through objects, subroutine calls through
333 subroutine references, etc.
334
335 The subroutine takes up to two optional arguments:
336
337 1. A string to be processed ($_ if the string is omitted or "undef")
338
339 2. A string specifying a pattern to be matched as a prefix (which is to
340 be skipped). If omitted, optional whitespace is skipped.
341
342 On success in a list context, an array of 3 elements is returned. The
343 elements are:
344
345 [0] the extracted variable, or variablish expression
346
347 [1] the remainder of the input text,
348
349 [2] the prefix substring (if any),
350
351 On failure, all of these values (except the remaining text) are "undef".
352
353 In a scalar context, "extract_variable" returns just the complete
354 substring that matched a variablish expression. "undef" is returned on
355 failure. In addition, the original input text has the returned substring
356 (and any prefix) removed from it.
357
358 In a void context, the input text just has the matched substring (and
359 any specified prefix) removed.
360
361 "extract_tagged"
362 "extract_tagged" extracts and segments text between (balanced) specified
363 tags.
364
365 The subroutine takes up to five optional arguments:
366
367 1. A string to be processed ($_ if the string is omitted or "undef")
368
369 2. A string specifying a pattern to be matched as the opening tag. If
370 the pattern string is omitted (or "undef") then a pattern that
371 matches any standard XML tag is used.
372
373 3. A string specifying a pattern to be matched at the closing tag. If
374 the pattern string is omitted (or "undef") then the closing tag is
375 constructed by inserting a "/" after any leading bracket characters
376 in the actual opening tag that was matched (*not* the pattern that
377 matched the tag). For example, if the opening tag pattern is
378 specified as '{{\w+}}' and actually matched the opening tag
379 "{{DATA}}", then the constructed closing tag would be "{{/DATA}}".
380
381 4. A string specifying a pattern to be matched as a prefix (which is to
382 be skipped). If omitted, optional whitespace is skipped.
383
384 5. A hash reference containing various parsing options (see below)
385
386 The various options that can be specified are:
387
388 "reject => $listref"
389 The list reference contains one or more strings specifying patterns
390 that must *not* appear within the tagged text.
391
392 For example, to extract an HTML link (which should not contain
393 nested links) use:
394
395 extract_tagged($text, '<A>', '</A>', undef, {reject => ['<A>']} );
396
397 "ignore => $listref"
398 The list reference contains one or more strings specifying patterns
399 that are *not* be be treated as nested tags within the tagged text
400 (even if they would match the start tag pattern).
401
402 For example, to extract an arbitrary XML tag, but ignore "empty"
403 elements:
404
405 extract_tagged($text, undef, undef, undef, {ignore => ['<[^>]*/>']} );
406
407 (also see "gen_delimited_pat" below).
408
409 "fail => $str"
410 The "fail" option indicates the action to be taken if a matching end
411 tag is not encountered (i.e. before the end of the string or some
412 "reject" pattern matches). By default, a failure to match a closing
413 tag causes "extract_tagged" to immediately fail.
414
415 However, if the string value associated with <reject> is "MAX", then
416 "extract_tagged" returns the complete text up to the point of
417 failure. If the string is "PARA", "extract_tagged" returns only the
418 first paragraph after the tag (up to the first line that is either
419 empty or contains only whitespace characters). If the string is "",
420 the the default behaviour (i.e. failure) is reinstated.
421
422 For example, suppose the start tag "/para" introduces a paragraph,
423 which then continues until the next "/endpara" tag or until another
424 "/para" tag is encountered:
425
426 $text = "/para line 1\n\nline 3\n/para line 4";
427
428 extract_tagged($text, '/para', '/endpara', undef,
429 {reject => '/para', fail => MAX );
430
431 # EXTRACTED: "/para line 1\n\nline 3\n"
432
433 Suppose instead, that if no matching "/endpara" tag is found, the
434 "/para" tag refers only to the immediately following paragraph:
435
436 $text = "/para line 1\n\nline 3\n/para line 4";
437
438 extract_tagged($text, '/para', '/endpara', undef,
439 {reject => '/para', fail => MAX );
440
441 # EXTRACTED: "/para line 1\n"
442
443 Note that the specified "fail" behaviour applies to nested tags as
444 well.
445
446 On success in a list context, an array of 6 elements is returned. The
447 elements are:
448
449 [0] the extracted tagged substring (including the outermost tags),
450
451 [1] the remainder of the input text,
452
453 [2] the prefix substring (if any),
454
455 [3] the opening tag
456
457 [4] the text between the opening and closing tags
458
459 [5] the closing tag (or "" if no closing tag was found)
460
461 On failure, all of these values (except the remaining text) are "undef".
462
463 In a scalar context, "extract_tagged" returns just the complete
464 substring that matched a tagged text (including the start and end tags).
465 "undef" is returned on failure. In addition, the original input text has
466 the returned substring (and any prefix) removed from it.
467
468 In a void context, the input text just has the matched substring (and
469 any specified prefix) removed.
470
471 "gen_extract_tagged"
472 (Note: This subroutine is only available under Perl5.005)
473
474 "gen_extract_tagged" generates a new anonymous subroutine which extracts
475 text between (balanced) specified tags. In other words, it generates a
476 function identical in function to "extract_tagged".
477
478 The difference between "extract_tagged" and the anonymous subroutines
479 generated by "gen_extract_tagged", is that those generated subroutines:
480
481 * do not have to reparse tag specification or parsing options every
482 time they are called (whereas "extract_tagged" has to effectively
483 rebuild its tag parser on every call);
484
485 * make use of the new qr// construct to pre-compile the regexes they
486 use (whereas "extract_tagged" uses standard string variable
487 interpolation to create tag-matching patterns).
488
489 The subroutine takes up to four optional arguments (the same set as
490 "extract_tagged" except for the string to be processed). It returns a
491 reference to a subroutine which in turn takes a single argument (the
492 text to be extracted from).
493
494 In other words, the implementation of "extract_tagged" is exactly
495 equivalent to:
496
497 sub extract_tagged
498 {
499 my $text = shift;
500 $extractor = gen_extract_tagged(@_);
501 return $extractor->($text);
502 }
503
504 (although "extract_tagged" is not currently implemented that way, in
505 order to preserve pre-5.005 compatibility).
506
507 Using "gen_extract_tagged" to create extraction functions for specific
508 tags is a good idea if those functions are going to be called more than
509 once, since their performance is typically twice as good as the more
510 general-purpose "extract_tagged".
511
512 "extract_quotelike"
513 "extract_quotelike" attempts to recognize, extract, and segment any one
514 of the various Perl quotes and quotelike operators (see perlop(3))
515 Nested backslashed delimiters, embedded balanced bracket delimiters (for
516 the quotelike operators), and trailing modifiers are all caught. For
517 example, in:
518
519 extract_quotelike 'q # an octothorpe: \# (not the end of the q!) #'
520
521 extract_quotelike ' "You said, \"Use sed\"." '
522
523 extract_quotelike ' s{([A-Z]{1,8}\.[A-Z]{3})} /\L$1\E/; '
524
525 extract_quotelike ' tr/\\\/\\\\/\\\//ds; '
526
527 the full Perl quotelike operations are all extracted correctly.
528
529 Note too that, when using the /x modifier on a regex, any comment
530 containing the current pattern delimiter will cause the regex to be
531 immediately terminated. In other words:
532
533 'm /
534 (?i) # CASE INSENSITIVE
535 [a-z_] # LEADING ALPHABETIC/UNDERSCORE
536 [a-z0-9]* # FOLLOWED BY ANY NUMBER OF ALPHANUMERICS
537 /x'
538
539 will be extracted as if it were:
540
541 'm /
542 (?i) # CASE INSENSITIVE
543 [a-z_] # LEADING ALPHABETIC/'
544
545 This behaviour is identical to that of the actual compiler.
546
547 "extract_quotelike" takes two arguments: the text to be processed and a
548 prefix to be matched at the very beginning of the text. If no prefix is
549 specified, optional whitespace is the default. If no text is given, $_
550 is used.
551
552 In a list context, an array of 11 elements is returned. The elements
553 are:
554
555 [0] the extracted quotelike substring (including trailing modifiers),
556
557 [1] the remainder of the input text,
558
559 [2] the prefix substring (if any),
560
561 [3] the name of the quotelike operator (if any),
562
563 [4] the left delimiter of the first block of the operation,
564
565 [5] the text of the first block of the operation (that is, the contents
566 of a quote, the regex of a match or substitution or the target list
567 of a translation),
568
569 [6] the right delimiter of the first block of the operation,
570
571 [7] the left delimiter of the second block of the operation (that is, if
572 it is a "s", "tr", or "y"),
573
574 [8] the text of the second block of the operation (that is, the
575 replacement of a substitution or the translation list of a
576 translation),
577
578 [9] the right delimiter of the second block of the operation (if any),
579
580 [10]
581 the trailing modifiers on the operation (if any).
582
583 For each of the fields marked "(if any)" the default value on success is
584 an empty string. On failure, all of these values (except the remaining
585 text) are "undef".
586
587 In a scalar context, "extract_quotelike" returns just the complete
588 substring that matched a quotelike operation (or "undef" on failure). In
589 a scalar or void context, the input text has the same substring (and any
590 specified prefix) removed.
591
592 Examples:
593
594 # Remove the first quotelike literal that appears in text
595
596 $quotelike = extract_quotelike($text,'.*?');
597
598 # Replace one or more leading whitespace-separated quotelike
599 # literals in $_ with "<QLL>"
600
601 do { $_ = join '<QLL>', (extract_quotelike)[2,1] } until $@;
602
603
604 # Isolate the search pattern in a quotelike operation from $text
605
606 ($op,$pat) = (extract_quotelike $text)[3,5];
607 if ($op =~ /[ms]/)
608 {
609 print "search pattern: $pat\n";
610 }
611 else
612 {
613 print "$op is not a pattern matching operation\n";
614 }
615
616 "extract_quotelike" and "here documents"
617 "extract_quotelike" can successfully extract "here documents" from an
618 input string, but with an important caveat in list contexts.
619
620 Unlike other types of quote-like literals, a here document is rarely a
621 contiguous substring. For example, a typical piece of code using here
622 document might look like this:
623
624 <<'EOMSG' || die;
625 This is the message.
626 EOMSG
627 exit;
628
629 Given this as an input string in a scalar context, "extract_quotelike"
630 would correctly return the string "<<'EOMSG'\nThis is the
631 message.\nEOMSG", leaving the string " || die;\nexit;" in the original
632 variable. In other words, the two separate pieces of the here document
633 are successfully extracted and concatenated.
634
635 In a list context, "extract_quotelike" would return the list
636
637 [0] "<<'EOMSG'\nThis is the message.\nEOMSG\n" (i.e. the full extracted
638 here document, including fore and aft delimiters),
639
640 [1] " || die;\nexit;" (i.e. the remainder of the input text,
641 concatenated),
642
643 [2] "" (i.e. the prefix substring -- trivial in this case),
644
645 [3] "<<" (i.e. the "name" of the quotelike operator)
646
647 [4] "'EOMSG'" (i.e. the left delimiter of the here document, including
648 any quotes),
649
650 [5] "This is the message.\n" (i.e. the text of the here document),
651
652 [6] "EOMSG" (i.e. the right delimiter of the here document),
653
654 [7..10]
655 "" (a here document has no second left delimiter, second text,
656 second right delimiter, or trailing modifiers).
657
658 However, the matching position of the input variable would be set to
659 "exit;" (i.e. *after* the closing delimiter of the here document), which
660 would cause the earlier " || die;\nexit;" to be skipped in any sequence
661 of code fragment extractions.
662
663 To avoid this problem, when it encounters a here document whilst
664 extracting from a modifiable string, "extract_quotelike" silently
665 rearranges the string to an equivalent piece of Perl:
666
667 <<'EOMSG'
668 This is the message.
669 EOMSG
670 || die;
671 exit;
672
673 in which the here document *is* contiguous. It still leaves the matching
674 position after the here document, but now the rest of the line on which
675 the here document starts is not skipped.
676
677 To prevent <extract_quotelike> from mucking about with the input in this
678 way (this is the only case where a list-context "extract_quotelike" does
679 so), you can pass the input variable as an interpolated literal:
680
681 $quotelike = extract_quotelike("$var");
682
683 "extract_codeblock"
684 "extract_codeblock" attempts to recognize and extract a balanced bracket
685 delimited substring that may contain unbalanced brackets inside Perl
686 quotes or quotelike operations. That is, "extract_codeblock" is like a
687 combination of "extract_bracketed" and "extract_quotelike".
688
689 "extract_codeblock" takes the same initial three parameters as
690 "extract_bracketed": a text to process, a set of delimiter brackets to
691 look for, and a prefix to match first. It also takes an optional fourth
692 parameter, which allows the outermost delimiter brackets to be specified
693 separately (see below).
694
695 Omitting the first argument (input text) means process $_ instead.
696 Omitting the second argument (delimiter brackets) indicates that only
697 '{' is to be used. Omitting the third argument (prefix argument) implies
698 optional whitespace at the start. Omitting the fourth argument
699 (outermost delimiter brackets) indicates that the value of the second
700 argument is to be used for the outermost delimiters.
701
702 Once the prefix an dthe outermost opening delimiter bracket have been
703 recognized, code blocks are extracted by stepping through the input text
704 and trying the following alternatives in sequence:
705
706 1. Try and match a closing delimiter bracket. If the bracket was the
707 same species as the last opening bracket, return the substring to
708 that point. If the bracket was mismatched, return an error.
709
710 2. Try to match a quote or quotelike operator. If found, call
711 "extract_quotelike" to eat it. If "extract_quotelike" fails, return
712 the error it returned. Otherwise go back to step 1.
713
714 3. Try to match an opening delimiter bracket. If found, call
715 "extract_codeblock" recursively to eat the embedded block. If the
716 recursive call fails, return an error. Otherwise, go back to step 1.
717
718 4. Unconditionally match a bareword or any other single character, and
719 then go back to step 1.
720
721 Examples:
722
723 # Find a while loop in the text
724
725 if ($text =~ s/.*?while\s*\{/{/)
726 {
727 $loop = "while " . extract_codeblock($text);
728 }
729
730 # Remove the first round-bracketed list (which may include
731 # round- or curly-bracketed code blocks or quotelike operators)
732
733 extract_codeblock $text, "(){}", '[^(]*';
734
735 The ability to specify a different outermost delimiter bracket is useful
736 in some circumstances. For example, in the Parse::RecDescent module,
737 parser actions which are to be performed only on a successful parse are
738 specified using a "<defer:...>" directive. For example:
739
740 sentence: subject verb object
741 <defer: {$::theVerb = $item{verb}} >
742
743 Parse::RecDescent uses "extract_codeblock($text, '{}<>')" to extract the
744 code within the "<defer:...>" directive, but there's a problem.
745
746 A deferred action like this:
747
748 <defer: {if ($count>10) {$count--}} >
749
750 will be incorrectly parsed as:
751
752 <defer: {if ($count>
753
754 because the "less than" operator is interpreted as a closing delimiter.
755
756 But, by extracting the directive using
757 "extract_codeblock($text, '{}', undef, '<>')" the '>' character is only
758 treated as a delimited at the outermost level of the code block, so the
759 directive is parsed correctly.
760
761 "extract_multiple"
762 The "extract_multiple" subroutine takes a string to be processed and a
763 list of extractors (subroutines or regular expressions) to apply to that
764 string.
765
766 In an array context "extract_multiple" returns an array of substrings of
767 the original string, as extracted by the specified extractors. In a
768 scalar context, "extract_multiple" returns the first substring
769 successfully extracted from the original string. In both scalar and void
770 contexts the original string has the first successfully extracted
771 substring removed from it. In all contexts "extract_multiple" starts at
772 the current "pos" of the string, and sets that "pos" appropriately after
773 it matches.
774
775 Hence, the aim of of a call to "extract_multiple" in a list context is
776 to split the processed string into as many non-overlapping fields as
777 possible, by repeatedly applying each of the specified extractors to the
778 remainder of the string. Thus "extract_multiple" is a generalized form
779 of Perl's "split" subroutine.
780
781 The subroutine takes up to four optional arguments:
782
783 1. A string to be processed ($_ if the string is omitted or "undef")
784
785 2. A reference to a list of subroutine references and/or qr// objects
786 and/or literal strings and/or hash references, specifying the
787 extractors to be used to split the string. If this argument is
788 omitted (or "undef") the list:
789
790 [
791 sub { extract_variable($_[0], '') },
792 sub { extract_quotelike($_[0],'') },
793 sub { extract_codeblock($_[0],'{}','') },
794 ]
795
796 is used.
797
798 3. An number specifying the maximum number of fields to return. If this
799 argument is omitted (or "undef"), split continues as long as
800 possible.
801
802 If the third argument is *N*, then extraction continues until *N*
803 fields have been successfully extracted, or until the string has
804 been completely processed.
805
806 Note that in scalar and void contexts the value of this argument is
807 automatically reset to 1 (under "-w", a warning is issued if the
808 argument has to be reset).
809
810 4. A value indicating whether unmatched substrings (see below) within
811 the text should be skipped or returned as fields. If the value is
812 true, such substrings are skipped. Otherwise, they are returned.
813
814 The extraction process works by applying each extractor in sequence to
815 the text string.
816
817 If the extractor is a subroutine it is called in a list context and is
818 expected to return a list of a single element, namely the extracted
819 text. It may optionally also return two further arguments: a string
820 representing the text left after extraction (like $' for a pattern
821 match), and a string representing any prefix skipped before the
822 extraction (like $` in a pattern match). Note that this is designed to
823 facilitate the use of other Text::Balanced subroutines with
824 "extract_multiple". Note too that the value returned by an extractor
825 subroutine need not bear any relationship to the corresponding substring
826 of the original text (see examples below).
827
828 If the extractor is a precompiled regular expression or a string, it is
829 matched against the text in a scalar context with a leading '\G' and the
830 gc modifiers enabled. The extracted value is either $1 if that variable
831 is defined after the match, or else the complete match (i.e. $&).
832
833 If the extractor is a hash reference, it must contain exactly one
834 element. The value of that element is one of the above extractor types
835 (subroutine reference, regular expression, or string). The key of that
836 element is the name of a class into which the successful return value of
837 the extractor will be blessed.
838
839 If an extractor returns a defined value, that value is immediately
840 treated as the next extracted field and pushed onto the list of fields.
841 If the extractor was specified in a hash reference, the field is also
842 blessed into the appropriate class,
843
844 If the extractor fails to match (in the case of a regex extractor), or
845 returns an empty list or an undefined value (in the case of a subroutine
846 extractor), it is assumed to have failed to extract. If none of the
847 extractor subroutines succeeds, then one character is extracted from the
848 start of the text and the extraction subroutines reapplied. Characters
849 which are thus removed are accumulated and eventually become the next
850 field (unless the fourth argument is true, in which case they are
851 discarded).
852
853 For example, the following extracts substrings that are valid Perl
854 variables:
855
856 @fields = extract_multiple($text,
857 [ sub { extract_variable($_[0]) } ],
858 undef, 1);
859
860 This example separates a text into fields which are quote delimited,
861 curly bracketed, and anything else. The delimited and bracketed parts
862 are also blessed to identify them (the "anything else" is unblessed):
863
864 @fields = extract_multiple($text,
865 [
866 { Delim => sub { extract_delimited($_[0],q{'"}) } },
867 { Brack => sub { extract_bracketed($_[0],'{}') } },
868 ]);
869
870 This call extracts the next single substring that is a valid Perl
871 quotelike operator (and removes it from $text):
872
873 $quotelike = extract_multiple($text,
874 [
875 sub { extract_quotelike($_[0]) },
876 ], undef, 1);
877
878 Finally, here is yet another way to do comma-separated value parsing:
879
880 @fields = extract_multiple($csv_text,
881 [
882 sub { extract_delimited($_[0],q{'"}) },
883 qr/([^,]+)(.*)/,
884 ],
885 undef,1);
886
887 The list in the second argument means: *"Try and extract a ' or "
888 delimited string, otherwise extract anything up to a comma..."*. The
889 undef third argument means: *"...as many times as possible..."*, and the
890 true value in the fourth argument means *"...discarding anything else
891 that appears (i.e. the commas)"*.
892
893 If you wanted the commas preserved as separate fields (i.e. like split
894 does if your split pattern has capturing parentheses), you would just
895 make the last parameter undefined (or remove it).
896
897 "gen_delimited_pat"
898 The "gen_delimited_pat" subroutine takes a single (string) argument and
899 > builds a Friedl-style optimized regex that matches a string delimited
900 by any one of the characters in the single argument. For example:
901
902 gen_delimited_pat(q{'"})
903
904 returns the regex:
905
906 (?:\"(?:\\\"|(?!\").)*\"|\'(?:\\\'|(?!\').)*\')
907
908 Note that the specified delimiters are automatically quotemeta'd.
909
910 A typical use of "gen_delimited_pat" would be to build special purpose
911 tags for "extract_tagged". For example, to properly ignore "empty" XML
912 elements (which might contain quoted strings):
913
914 my $empty_tag = '<(' . gen_delimited_pat(q{'"}) . '|.)+/>';
915
916 extract_tagged($text, undef, undef, undef, {ignore => [$empty_tag]} );
917
918 "gen_delimited_pat" may also be called with an optional second argument,
919 which specifies the "escape" character(s) to be used for each delimiter.
920 For example to match a Pascal-style string (where ' is the delimiter and
921 '' is a literal ' within the string):
922
923 gen_delimited_pat(q{'},q{'});
924
925 Different escape characters can be specified for different delimiters.
926 For example, to specify that '/' is the escape for single quotes and '%'
927 is the escape for double quotes:
928
929 gen_delimited_pat(q{'"},q{/%});
930
931 If more delimiters than escape chars are specified, the last escape char
932 is used for the remaining delimiters. If no escape char is specified for
933 a given specified delimiter, '\' is used.
934
935 "delimited_pat"
936 Note that "gen_delimited_pat" was previously called "delimited_pat".
937 That name may still be used, but is now deprecated.
938
939DIAGNOSTICS
940 In a list context, all the functions return "(undef,$original_text)" on
941 failure. In a scalar context, failure is indicated by returning "undef"
942 (in this case the input text is not modified in any way).
943
944 In addition, on failure in *any* context, the $@ variable is set.
945 Accessing "$@->{error}" returns one of the error diagnostics listed
946 below. Accessing "$@->{pos}" returns the offset into the original string
947 at which the error was detected (although not necessarily where it
948 occurred!) Printing $@ directly produces the error message, with the
949 offset appended. On success, the $@ variable is guaranteed to be
950 "undef".
951
952 The available diagnostics are:
953
954 "Did not find a suitable bracket: "%s""
955 The delimiter provided to "extract_bracketed" was not one of
956 '()[]<>{}'.
957
958 "Did not find prefix: /%s/"
959 A non-optional prefix was specified but wasn't found at the start of
960 the text.
961
962 "Did not find opening bracket after prefix: "%s""
963 "extract_bracketed" or "extract_codeblock" was expecting a
964 particular kind of bracket at the start of the text, and didn't find
965 it.
966
967 "No quotelike operator found after prefix: "%s""
968 "extract_quotelike" didn't find one of the quotelike operators "q",
969 "qq", "qw", "qx", "s", "tr" or "y" at the start of the substring it
970 was extracting.
971
972 "Unmatched closing bracket: "%c""
973 "extract_bracketed", "extract_quotelike" or "extract_codeblock"
974 encountered a closing bracket where none was expected.
975
976 "Unmatched opening bracket(s): "%s""
977 "extract_bracketed", "extract_quotelike" or "extract_codeblock" ran
978 out of characters in the text before closing one or more levels of
979 nested brackets.
980
981 "Unmatched embedded quote (%s)"
982 "extract_bracketed" attempted to match an embedded quoted substring,
983 but failed to find a closing quote to match it.
984
985 "Did not find closing delimiter to match '%s'"
986 "extract_quotelike" was unable to find a closing delimiter to match
987 the one that opened the quote-like operation.
988
989 "Mismatched closing bracket: expected "%c" but found "%s""
990 "extract_bracketed", "extract_quotelike" or "extract_codeblock"
991 found a valid bracket delimiter, but it was the wrong species. This
992 usually indicates a nesting error, but may indicate incorrect
993 quoting or escaping.
994
995 "No block delimiter found after quotelike "%s""
996 "extract_quotelike" or "extract_codeblock" found one of the
997 quotelike operators "q", "qq", "qw", "qx", "s", "tr" or "y" without
998 a suitable block after it.
55a1c97c 999
85831461
SM
1000 "Did not find leading dereferencer"
1001 "extract_variable" was expecting one of '$', '@', or '%' at the
1002 start of a variable, but didn't find any of them.
55a1c97c 1003
85831461
SM
1004 "Bad identifier after dereferencer"
1005 "extract_variable" found a '$', '@', or '%' indicating a variable,
1006 but that character was not followed by a legal Perl identifier.
55a1c97c 1007
85831461
SM
1008 "Did not find expected opening bracket at %s"
1009 "extract_codeblock" failed to find any of the outermost opening
1010 brackets that were specified.
55a1c97c 1011
85831461
SM
1012 "Improperly nested codeblock at %s"
1013 A nested code block was found that started with a delimiter that was
1014 specified as being only to be used as an outermost bracket.
55a1c97c 1015
85831461
SM
1016 "Missing second block for quotelike "%s""
1017 "extract_codeblock" or "extract_quotelike" found one of the
1018 quotelike operators "s", "tr" or "y" followed by only one block.
55a1c97c 1019
85831461
SM
1020 "No match found for opening bracket"
1021 "extract_codeblock" failed to find a closing bracket to match the
1022 outermost opening bracket.
55a1c97c 1023
85831461
SM
1024 "Did not find opening tag: /%s/"
1025 "extract_tagged" did not find a suitable opening tag (after any
1026 specified prefix was removed).
55a1c97c 1027
85831461
SM
1028 "Unable to construct closing tag to match: /%s/"
1029 "extract_tagged" matched the specified opening tag and tried to
1030 modify the matched text to produce a matching closing tag (because
1031 none was specified). It failed to generate the closing tag, almost
1032 certainly because the opening tag did not start with a bracket of
1033 some kind.
55a1c97c 1034
85831461
SM
1035 "Found invalid nested tag: %s"
1036 "extract_tagged" found a nested tag that appeared in the "reject"
1037 list (and the failure mode was not "MAX" or "PARA").
55a1c97c 1038
85831461
SM
1039 "Found unbalanced nested tag: %s"
1040 "extract_tagged" found a nested opening tag that was not matched by
1041 a corresponding nested closing tag (and the failure mode was not
1042 "MAX" or "PARA").
55a1c97c 1043
85831461
SM
1044 "Did not find closing tag"
1045 "extract_tagged" reached the end of the text without finding a
1046 closing tag to match the original opening tag (and the failure mode
1047 was not "MAX" or "PARA").
55a1c97c 1048
85831461
SM
1049AUTHOR
1050 Damian Conway (damian@conway.org)
55a1c97c 1051
85831461
SM
1052BUGS AND IRRITATIONS
1053 There are undoubtedly serious bugs lurking somewhere in this code, if
1054 only because parts of it give the impression of understanding a great
1055 deal more about Perl than they really do.
55a1c97c 1056
85831461 1057 Bug reports and other feedback are most welcome.
55a1c97c 1058
85831461
SM
1059COPYRIGHT
1060 Copyright 1997 - 2001 Damian Conway. All Rights Reserved.
55a1c97c 1061
85831461 1062 Some (minor) parts copyright 2009 Adam Kennedy.
55a1c97c 1063
85831461
SM
1064 This module is free software. It may be used, redistributed and/or
1065 modified under the same terms as Perl itself.
55a1c97c 1066