lib/Text/Balanced.pod

   1 =head1 NAME
   2
   3 Text::Balanced - Extract delimited text sequences from strings.
   4
   5
   6 =head1 SYNOPSIS
   7
   8  use Text::Balanced qw (
   9                         extract_delimited
  10                         extract_bracketed
  11                         extract_quotelike
  12                         extract_codeblock
  13                         extract_variable
  14                         extract_tagged
  15                         extract_multiple
  16
  17                         gen_delimited_pat
  18                         gen_extract_tagged
  19                        );
  20
  21  # Extract the initial substring of $text that is delimited by
  22  # two (unescaped) instances of the first character in $delim.
  23
  24         ($extracted, $remainder) = extract_delimited($text,$delim);
  25
  26
  27  # Extract the initial substring of $text that is bracketed
  28  # with a delimiter(s) specified by $delim (where the string
  29  # in $delim contains one or more of '(){}[]<>').
  30
  31         ($extracted, $remainder) = extract_bracketed($text,$delim);
  32
  33
  34  # Extract the initial substring of $text that is bounded by
  35  # an HTML/XML tag.
  36
  37         ($extracted, $remainder) = extract_tagged($text);
  38
  39
  40  # Extract the initial substring of $text that is bounded by
  41  # a C<BEGIN>...C<END> pair. Don't allow nested C<BEGIN> tags
  42
  43         ($extracted, $remainder) =
  44                 extract_tagged($text,"BEGIN","END",undef,{bad=>["BEGIN"]});
  45
  46
  47  # Extract the initial substring of $text that represents a
  48  # Perl "quote or quote-like operation"
  49
  50         ($extracted, $remainder) = extract_quotelike($text);
  51
  52
  53  # Extract the initial substring of $text that represents a block
  54  # of Perl code, bracketed by any of character(s) specified by $delim
  55  # (where the string $delim contains one or more of '(){}[]<>').
  56
  57         ($extracted, $remainder) = extract_codeblock($text,$delim);
  58
  59
  60  # Extract the initial substrings of $text that would be extracted by
  61  # one or more sequential applications of the specified functions
  62  # or regular expressions
  63
  64         @extracted = extract_multiple($text,
  65                                       [ \&extract_bracketed,
  66                                         \&extract_quotelike,
  67                                         \&some_other_extractor_sub,
  68                                         qr/[xyz]*/,
  69                                         'literal',
  70                                       ]);
  71
  72 # Create a string representing an optimized pattern (a la Friedl)
  73 # that matches a substring delimited by any of the specified characters
  74 # (in this case: any type of quote or a slash)
  75
  76         $patstring = gen_delimited_pat(q{'"`/});
  77
  78
  79 # Generate a reference to an anonymous sub that is just like extract_tagged
  80 # but pre-compiled and optimized for a specific pair of tags, and consequently
  81 # much faster (i.e. 3 times faster). It uses qr// for better performance on
  82 # repeated calls, so it only works under Perl 5.005 or later.
  83
  84         $extract_head = gen_extract_tagged('<HEAD>','</HEAD>');
  85
  86         ($extracted, $remainder) = $extract_head->($text);
  87
  88
  89 =head1 DESCRIPTION
  90
  91 The various C<extract_...> subroutines may be used to extract a
  92 delimited string (possibly after skipping a specified prefix string).
  93 The search for the string always begins at the current C<pos>
  94 location of the string's variable (or at index zero, if no C<pos>
  95 position is defined).
  96
  97 =head2 General behaviour in list contexts
  98
  99 In a list context, all the subroutines return a list, the first three
 100 elements of which are always:
 101
 102 =over 4
 103
 104 =item [0]
 105
 106 The extracted string, including the specified delimiters.
 107 If the extraction fails an empty string is returned.
 108
 109 =item [1]
 110
 111 The remainder of the input string (i.e. the characters after the
 112 extracted string). On failure, the entire string is returned.
 113
 114 =item [2]
 115
 116 The skipped prefix (i.e. the characters before the extracted string).
 117 On failure, the empty string is returned.
 118
 119 =back
 120
 121 Note that in a list context, the contents of the original input text (the first
 122 argument) are not modified in any way.
 123
 124 However, if the input text was passed in a variable, that variable's
 125 C<pos> value is updated to point at the first character after the
 126 extracted text. That means that in a list context the various
 127 subroutines can be used much like regular expressions. For example:
 128
 129         while ( $next = (extract_quotelike($text))[0] )
 130         {
 131                 # process next quote-like (in $next)
 132         }
 133
 134
 135 =head2 General behaviour in scalar and void contexts
 136
 137 In a scalar context, the extracted string is returned, having first been
 138 removed from the input text. Thus, the following code also processes
 139 each quote-like operation, but actually removes them from $text:
 140
 141         while ( $next = extract_quotelike($text) )
 142         {
 143                 # process next quote-like (in $next)
 144         }
 145
 146 Note that if the input text is a read-only string (i.e. a literal),
 147 no attempt is made to remove the extracted text.
 148
 149 In a void context the behaviour of the extraction subroutines is
 150 exactly the same as in a scalar context, except (of course) that the
 151 extracted substring is not returned.
 152
 153 =head2 A note about prefixes
 154
 155 Prefix patterns are matched without any trailing modifiers (C</gimsox> etc.)
 156 This can bite you if you're expecting a prefix specification like
 157 '.*?(?=<H1>)' to skip everything up to the first <H1> tag. Such a prefix
 158 pattern will only succeed if the <H1> tag is on the current line, since
 159 . normally doesn't match newlines.
 160
 161 To overcome this limitation, you need to turn on /s matching within
 162 the prefix pattern, using the C<(?s)> directive: '(?s).*?(?=<H1>)'
 163
 164
 165 =head2 C<extract_delimited>
 166
 167 The C<extract_delimited> function formalizes the common idiom
 168 of extracting a single-character-delimited substring from the start of
 169 a string. For example, to extract a single-quote delimited string, the
 170 following code is typically used:
 171
 172         ($remainder = $text) =~ s/\A('(\\.|[^'])*')//s;
 173         $extracted = $1;
 174
 175 but with C<extract_delimited> it can be simplified to:
 176
 177         ($extracted,$remainder) = extract_delimited($text, "'");
 178
 179 C<extract_delimited> takes up to four scalars (the input text, the
 180 delimiters, a prefix pattern to be skipped, and any escape characters)
 181 and extracts the initial substring of the text that
 182 is appropriately delimited. If the delimiter string has multiple
 183 characters, the first one encountered in the text is taken to delimit
 184 the substring.
 185 The third argument specifies a prefix pattern that is to be skipped
 186 (but must be present!) before the substring is extracted.
 187 The final argument specifies the escape character to be used for each
 188 delimiter.
 189
 190 All arguments are optional. If the escape characters are not specified,
 191 every delimiter is escaped with a backslash (C<\>).
 192 If the prefix is not specified, the
 193 pattern C<'\s*'> - optional whitespace - is used. If the delimiter set
 194 is also not specified, the set C</["'`]/> is used. If the text to be processed
 195 is not specified either, C<$_> is used.
 196
 197 In list context, C<extract_delimited> returns a array of three
 198 elements, the extracted substring (I<including the surrounding
 199 delimiters>), the remainder of the text, and the skipped prefix (if
 200 any). If a suitable delimited substring is not found, the first
 201 element of the array is the empty string, the second is the complete
 202 original text, and the prefix returned in the third element is an
 203 empty string.
 204
 205 In a scalar context, just the extracted substring is returned. In
 206 a void context, the extracted substring (and any prefix) are simply
 207 removed from the beginning of the first argument.
 208
 209 Examples:
 210
 211         # Remove a single-quoted substring from the very beginning of $text:
 212
 213                 $substring = extract_delimited($text, "'", '');
 214
 215         # Remove a single-quoted Pascalish substring (i.e. one in which
 216         # doubling the quote character escapes it) from the very
 217         # beginning of $text:
 218
 219                 $substring = extract_delimited($text, "'", '', "'");
 220
 221         # Extract a single- or double- quoted substring from the
 222         # beginning of $text, optionally after some whitespace
 223         # (note the list context to protect $text from modification):
 224
 225                 ($substring) = extract_delimited $text, q{"'};
 226
 227
 228         # Delete the substring delimited by the first '/' in $text:
 229
 230                 $text = join '', (extract_delimited($text,'/','[^/]*')[2,1];
 231
 232 Note that this last example is I<not> the same as deleting the first
 233 quote-like pattern. For instance, if C<$text> contained the string:
 234
 235         "if ('./cmd' =~ m/$UNIXCMD/s) { $cmd = $1; }"
 236
 237 then after the deletion it would contain:
 238
 239         "if ('.$UNIXCMD/s) { $cmd = $1; }"
 240
 241 not:
 242
 243         "if ('./cmd' =~ ms) { $cmd = $1; }"
 244
 245
 246 See L<"extract_quotelike"> for a (partial) solution to this problem.
 247
 248
 249 =head2 C<extract_bracketed>
 250
 251 Like C<"extract_delimited">, the C<extract_bracketed> function takes
 252 up to three optional scalar arguments: a string to extract from, a delimiter
 253 specifier, and a prefix pattern. As before, a missing prefix defaults to
 254 optional whitespace and a missing text defaults to C<$_>. However, a missing
 255 delimiter specifier defaults to C<'{}()[]E<lt>E<gt>'> (see below).
 256
 257 C<extract_bracketed> extracts a balanced-bracket-delimited
 258 substring (using any one (or more) of the user-specified delimiter
 259 brackets: '(..)', '{..}', '[..]', or '<..>'). Optionally it will also
 260 respect quoted unbalanced brackets (see below).
 261
 262 A "delimiter bracket" is a bracket in list of delimiters passed as
 263 C<extract_bracketed>'s second argument. Delimiter brackets are
 264 specified by giving either the left or right (or both!) versions
 265 of the required bracket(s). Note that the order in which
 266 two or more delimiter brackets are specified is not significant.
 267
 268 A "balanced-bracket-delimited substring" is a substring bounded by
 269 matched brackets, such that any other (left or right) delimiter
 270 bracket I<within> the substring is also matched by an opposite
 271 (right or left) delimiter bracket I<at the same level of nesting>. Any
 272 type of bracket not in the delimiter list is treated as an ordinary
 273 character.
 274
 275 In other words, each type of bracket specified as a delimiter must be
 276 balanced and correctly nested within the substring, and any other kind of
 277 ("non-delimiter") bracket in the substring is ignored.
 278
 279 For example, given the string:
 280
 281         $text = "{ an '[irregularly :-(] {} parenthesized >:-)' string }";
 282
 283 then a call to C<extract_bracketed> in a list context:
 284
 285         @result = extract_bracketed( $text, '{}' );
 286
 287 would return:
 288
 289         ( "{ an '[irregularly :-(] {} parenthesized >:-)' string }" , "" , "" )
 290
 291 since both sets of C<'{..}'> brackets are properly nested and evenly balanced.
 292 (In a scalar context just the first element of the array would be returned. In
 293 a void context, C<$text> would be replaced by an empty string.)
 294
 295 Likewise the call in:
 296
 297         @result = extract_bracketed( $text, '{[' );
 298
 299 would return the same result, since all sets of both types of specified
 300 delimiter brackets are correctly nested and balanced.
 301
 302 However, the call in:
 303
 304         @result = extract_bracketed( $text, '{([<' );
 305
 306 would fail, returning:
 307
 308         ( undef , "{ an '[irregularly :-(] {} parenthesized >:-)' string }"  );
 309
 310 because the embedded pairs of C<'(..)'>s and C<'[..]'>s are "cross-nested" and
 311 the embedded C<'E<gt>'> is unbalanced. (In a scalar context, this call would
 312 return an empty string. In a void context, C<$text> would be unchanged.)
 313
 314 Note that the embedded single-quotes in the string don't help in this
 315 case, since they have not been specified as acceptable delimiters and are
 316 therefore treated as non-delimiter characters (and ignored).
 317
 318 However, if a particular species of quote character is included in the
 319 delimiter specification, then that type of quote will be correctly handled.
 320 for example, if C<$text> is:
 321
 322         $text = '<A HREF=">>>>">link</A>';
 323
 324 then
 325
 326         @result = extract_bracketed( $text, '<">' );
 327
 328 returns:
 329
 330         ( '<A HREF=">>>>">', 'link</A>', "" )
 331
 332 as expected. Without the specification of C<"> as an embedded quoter:
 333
 334         @result = extract_bracketed( $text, '<>' );
 335
 336 the result would be:
 337
 338         ( '<A HREF=">', '>>>">link</A>', "" )
 339
 340 In addition to the quote delimiters C<'>, C<">, and C<`>, full Perl quote-like
 341 quoting (i.e. q{string}, qq{string}, etc) can be specified by including the
 342 letter 'q' as a delimiter. Hence:
 343
 344         @result = extract_bracketed( $text, '<q>' );
 345
 346 would correctly match something like this:
 347
 348         $text = '<leftop: conj /and/ conj>';
 349
 350 See also: C<"extract_quotelike"> and C<"extract_codeblock">.
 351
 352
 353 =head2 C<extract_tagged>
 354
 355 C<extract_tagged> extracts and segments text between (balanced)
 356 specified tags.
 357
 358 The subroutine takes up to five optional arguments:
 359
 360 =over 4
 361
 362 =item 1.
 363
 364 A string to be processed (C<$_> if the string is omitted or C<undef>)
 365
 366 =item 2.
 367
 368 A string specifying a pattern to be matched as the opening tag.
 369 If the pattern string is omitted (or C<undef>) then a pattern
 370 that matches any standard HTML/XML tag is used.
 371
 372 =item 3.
 373
 374 A string specifying a pattern to be matched at the closing tag.
 375 If the pattern string is omitted (or C<undef>) then the closing
 376 tag is constructed by inserting a C</> after any leading bracket
 377 characters in the actual opening tag that was matched (I<not> the pattern
 378 that matched the tag). For example, if the opening tag pattern
 379 is specified as C<'{{\w+}}'> and actually matched the opening tag
 380 C<"{{DATA}}">, then the constructed closing tag would be C<"{{/DATA}}">.
 381
 382 =item 4.
 383
 384 A string specifying a pattern to be matched as a prefix (which is to be
 385 skipped). If omitted, optional whitespace is skipped.
 386
 387 =item 5.
 388
 389 A hash reference containing various parsing options (see below)
 390
 391 =back
 392
 393 The various options that can be specified are:
 394
 395 =over 4
 396
 397 =item C<reject =E<gt> $listref>
 398
 399 The list reference contains one or more strings specifying patterns
 400 that must I<not> appear within the tagged text.
 401
 402 For example, to extract
 403 an HTML link (which should not contain nested links) use:
 404
 405         extract_tagged($text, '<A>', '</A>', undef, {reject => ['<A>']} );
 406
 407 =item C<ignore =E<gt> $listref>
 408
 409 The list reference contains one or more strings specifying patterns
 410 that are I<not> be be treated as nested tags within the tagged text
 411 (even if they would match the start tag pattern).
 412
 413 For example, to extract an arbitrary XML tag, but ignore "empty" elements:
 414
 415         extract_tagged($text, undef, undef, undef, {ignore => ['<[^>]*/>']} );
 416
 417 (also see L<"gen_delimited_pat"> below).
 418
 419
 420 =item C<fail =E<gt> $str>
 421
 422 The C<fail> option indicates the action to be taken if a matching end
 423 tag is not encountered (i.e. before the end of the string or some
 424 C<reject> pattern matches). By default, a failure to match a closing
 425 tag causes C<extract_tagged> to immediately fail.
 426
 427 However, if the string value associated with <reject> is "MAX", then
 428 C<extract_tagged> returns the complete text up to the point of failure.
 429 If the string is "PARA", C<extract_tagged> returns only the first paragraph
 430 after the tag (up to the first line that is either empty or contains
 431 only whitespace characters).
 432 If the string is "", the the default behaviour (i.e. failure) is reinstated.
 433
 434 For example, suppose the start tag "/para" introduces a paragraph, which then
 435 continues until the next "/endpara" tag or until another "/para" tag is
 436 encountered:
 437
 438         $text = "/para line 1\n\nline 3\n/para line 4";
 439
 440         extract_tagged($text, '/para', '/endpara', undef,
 441                                 {reject => '/para', fail => MAX );
 442
 443         # EXTRACTED: "/para line 1\n\nline 3\n"
 444
 445 Suppose instead, that if no matching "/endpara" tag is found, the "/para"
 446 tag refers only to the immediately following paragraph:
 447
 448         $text = "/para line 1\n\nline 3\n/para line 4";
 449
 450         extract_tagged($text, '/para', '/endpara', undef,
 451                         {reject => '/para', fail => MAX );
 452
 453         # EXTRACTED: "/para line 1\n"
 454
 455 Note that the specified C<fail> behaviour applies to nested tags as well.
 456
 457 =back
 458
 459 On success in a list context, an array of 6 elements is returned. The elements are:
 460
 461 =over 4
 462
 463 =item [0]
 464
 465 the extracted tagged substring (including the outermost tags),
 466
 467 =item [1]
 468
 469 the remainder of the input text,
 470
 471 =item [2]
 472
 473 the prefix substring (if any),
 474
 475 =item [3]
 476
 477 the opening tag
 478
 479 =item [4]
 480
 481 the text between the opening and closing tags
 482
 483 =item [5]
 484
 485 the closing tag (or "" if no closing tag was found)
 486
 487 =back
 488
 489 On failure, all of these values (except the remaining text) are C<undef>.
 490
 491 In a scalar context, C<extract_tagged> returns just the complete
 492 substring that matched a tagged text (including the start and end
 493 tags). C<undef> is returned on failure. In addition, the original input
 494 text has the returned substring (and any prefix) removed from it.
 495
 496 In a void context, the input text just has the matched substring (and
 497 any specified prefix) removed.
 498
 499
 500 =head2 C<gen_extract_tagged>
 501
 502 (Note: This subroutine is only available under Perl5.005)
 503
 504 C<gen_extract_tagged> generates a new anonymous subroutine which
 505 extracts text between (balanced) specified tags. In other words,
 506 it generates a function identical in function to C<extract_tagged>.
 507
 508 The difference between C<extract_tagged> and the anonymous
 509 subroutines generated by
 510 C<gen_extract_tagged>, is that those generated subroutines:
 511
 512 =over 4
 513
 514 =item *
 515
 516 do not have to reparse tag specification or parsing options every time
 517 they are called (whereas C<extract_tagged> has to effectively rebuild
 518 its tag parser on every call);
 519
 520 =item *
 521
 522 make use of the new qr// construct to pre-compile the regexes they use
 523 (whereas C<extract_tagged> uses standard string variable interpolation
 524 to create tag-matching patterns).
 525
 526 =back
 527
 528 The subroutine takes up to four optional arguments (the same set as
 529 C<extract_tagged> except for the string to be processed). It returns
 530 a reference to a subroutine which in turn takes a single argument (the text to
 531 be extracted from).
 532
 533 In other words, the implementation of C<extract_tagged> is exactly
 534 equivalent to:
 535
 536         sub extract_tagged
 537         {
 538                 my $text = shift;
 539                 $extractor = gen_extract_tagged(@_);
 540                 return $extractor->($text);
 541         }
 542
 543 (although C<extract_tagged> is not currently implemented that way, in order
 544 to preserve pre-5.005 compatibility).
 545
 546 Using C<gen_extract_tagged> to create extraction functions for specific tags
 547 is a good idea if those functions are going to be called more than once, since
 548 their performance is typically twice as good as the more general-purpose
 549 C<extract_tagged>.
 550
 551
 552 =head2 C<extract_quotelike>
 553
 554 C<extract_quotelike> attempts to recognize, extract, and segment any
 555 one of the various Perl quotes and quotelike operators (see
 556 L<perlop(3)>) Nested backslashed delimiters, embedded balanced bracket
 557 delimiters (for the quotelike operators), and trailing modifiers are
 558 all caught. For example, in:
 559
 560         extract_quotelike 'q # an octothorpe: \# (not the end of the q!) #'
 561
 562         extract_quotelike '  "You said, \"Use sed\"."  '
 563
 564         extract_quotelike ' s{([A-Z]{1,8}\.[A-Z]{3})} /\L$1\E/; '
 565
 566         extract_quotelike ' tr/\\\/\\\\/\\\//ds; '
 567
 568 the full Perl quotelike operations are all extracted correctly.
 569
 570 Note too that, when using the /x modifier on a regex, any comment
 571 containing the current pattern delimiter will cause the regex to be
 572 immediately terminated. In other words:
 573
 574         'm /
 575                 (?i)            # CASE INSENSITIVE
 576                 [a-z_]          # LEADING ALPHABETIC/UNDERSCORE
 577                 [a-z0-9]*       # FOLLOWED BY ANY NUMBER OF ALPHANUMERICS
 578            /x'
 579
 580 will be extracted as if it were:
 581
 582         'm /
 583                 (?i)            # CASE INSENSITIVE
 584                 [a-z_]          # LEADING ALPHABETIC/'
 585
 586 This behaviour is identical to that of the actual compiler.
 587
 588 C<extract_quotelike> takes two arguments: the text to be processed and
 589 a prefix to be matched at the very beginning of the text. If no prefix
 590 is specified, optional whitespace is the default. If no text is given,
 591 C<$_> is used.
 592
 593 In a list context, an array of 11 elements is returned. The elements are:
 594
 595 =over 4
 596
 597 =item [0]
 598
 599 the extracted quotelike substring (including trailing modifiers),
 600
 601 =item [1]
 602
 603 the remainder of the input text,
 604
 605 =item [2]
 606
 607 the prefix substring (if any),
 608
 609 =item [3]
 610
 611 the name of the quotelike operator (if any),
 612
 613 =item [4]
 614
 615 the left delimiter of the first block of the operation,
 616
 617 =item [5]
 618
 619 the text of the first block of the operation
 620 (that is, the contents of
 621 a quote, the regex of a match or substitution or the target list of a
 622 translation),
 623
 624 =item [6]
 625
 626 the right delimiter of the first block of the operation,
 627
 628 =item [7]
 629
 630 the left delimiter of the second block of the operation
 631 (that is, if it is a C<s>, C<tr>, or C<y>),
 632
 633 =item [8]
 634
 635 the text of the second block of the operation
 636 (that is, the replacement of a substitution or the translation list
 637 of a translation),
 638
 639 =item [9]
 640
 641 the right delimiter of the second block of the operation (if any),
 642
 643 =item [10]
 644
 645 the trailing modifiers on the operation (if any).
 646
 647 =back
 648
 649 For each of the fields marked "(if any)" the default value on success is
 650 an empty string.
 651 On failure, all of these values (except the remaining text) are C<undef>.
 652
 653
 654 In a scalar context, C<extract_quotelike> returns just the complete substring
 655 that matched a quotelike operation (or C<undef> on failure). In a scalar or
 656 void context, the input text has the same substring (and any specified
 657 prefix) removed.
 658
 659 Examples:
 660
 661         # Remove the first quotelike literal that appears in text
 662
 663                 $quotelike = extract_quotelike($text,'.*?');
 664
 665         # Replace one or more leading whitespace-separated quotelike
 666         # literals in $_ with "<QLL>"
 667
 668                 do { $_ = join '<QLL>', (extract_quotelike)[2,1] } until $@;
 669
 670
 671         # Isolate the search pattern in a quotelike operation from $text
 672
 673                 ($op,$pat) = (extract_quotelike $text)[3,5];
 674                 if ($op =~ /[ms]/)
 675                 {
 676                         print "search pattern: $pat\n";
 677                 }
 678                 else
 679                 {
 680                         print "$op is not a pattern matching operation\n";
 681                 }
 682
 683
 684 =head2 C<extract_quotelike> and "here documents"
 685
 686 C<extract_quotelike> can successfully extract "here documents" from an input
 687 string, but with an important caveat in list contexts.
 688
 689 Unlike other types of quote-like literals, a here document is rarely
 690 a contiguous substring. For example, a typical piece of code using
 691 here document might look like this:
 692
 693         <<'EOMSG' || die;
 694         This is the message.
 695         EOMSG
 696         exit;
 697
 698 Given this as an input string in a scalar context, C<extract_quotelike>
 699 would correctly return the string "<<'EOMSG'\nThis is the message.\nEOMSG",
 700 leaving the string " || die;\nexit;" in the original variable. In other words,
 701 the two separate pieces of the here document are successfully extracted and
 702 concatenated.
 703
 704 In a list context, C<extract_quotelike> would return the list
 705
 706 =over 4
 707
 708 =item [0]
 709
 710 "<<'EOMSG'\nThis is the message.\nEOMSG\n" (i.e. the full extracted here document,
 711 including fore and aft delimiters),
 712
 713 =item [1]
 714
 715 " || die;\nexit;" (i.e. the remainder of the input text, concatenated),
 716
 717 =item [2]
 718
 719 "" (i.e. the prefix substring -- trivial in this case),
 720
 721 =item [3]
 722
 723 "<<" (i.e. the "name" of the quotelike operator)
 724
 725 =item [4]
 726
 727 "'EOMSG'" (i.e. the left delimiter of the here document, including any quotes),
 728
 729 =item [5]
 730
 731 "This is the message.\n" (i.e. the text of the here document),
 732
 733 =item [6]
 734
 735 "EOMSG" (i.e. the right delimiter of the here document),
 736
 737 =item [7..10]
 738
 739 "" (a here document has no second left delimiter, second text, second right
 740 delimiter, or trailing modifiers).
 741
 742 =back
 743
 744 However, the matching position of the input variable would be set to
 745 "exit;" (i.e. I<after> the closing delimiter of the here document),
 746 which would cause the earlier " || die;\nexit;" to be skipped in any
 747 sequence of code fragment extractions.
 748
 749 To avoid this problem, when it encounters a here document whilst
 750 extracting from a modifiable string, C<extract_quotelike> silently
 751 rearranges the string to an equivalent piece of Perl:
 752
 753         <<'EOMSG'
 754         This is the message.
 755         EOMSG
 756         || die;
 757         exit;
 758
 759 in which the here document I<is> contiguous. It still leaves the
 760 matching position after the here document, but now the rest of the line
 761 on which the here document starts is not skipped.
 762
 763 To prevent <extract_quotelike> from mucking about with the input in this way
 764 (this is the only case where a list-context C<extract_quotelike> does so),
 765 you can pass the input variable as an interpolated literal:
 766
 767         $quotelike = extract_quotelike("$var");
 768
 769
 770 =head2 C<extract_codeblock>
 771
 772 C<extract_codeblock> attempts to recognize and extract a balanced
 773 bracket delimited substring that may contain unbalanced brackets
 774 inside Perl quotes or quotelike operations. That is, C<extract_codeblock>
 775 is like a combination of C<"extract_bracketed"> and
 776 C<"extract_quotelike">.
 777
 778 C<extract_codeblock> takes the same initial three parameters as C<extract_bracketed>:
 779 a text to process, a set of delimiter brackets to look for, and a prefix to
 780 match first. It also takes an optional fourth parameter, which allows the
 781 outermost delimiter brackets to be specified separately (see below).
 782
 783 Omitting the first argument (input text) means process C<$_> instead.
 784 Omitting the second argument (delimiter brackets) indicates that only C<'{'> is to be used.
 785 Omitting the third argument (prefix argument) implies optional whitespace at the start.
 786 Omitting the fourth argument (outermost delimiter brackets) indicates that the
 787 value of the second argument is to be used for the outermost delimiters.
 788
 789 Once the prefix an dthe outermost opening delimiter bracket have been
 790 recognized, code blocks are extracted by stepping through the input text and
 791 trying the following alternatives in sequence:
 792
 793 =over 4
 794
 795 =item 1.
 796
 797 Try and match a closing delimiter bracket. If the bracket was the same
 798 species as the last opening bracket, return the substring to that
 799 point. If the bracket was mismatched, return an error.
 800
 801 =item 2.
 802
 803 Try to match a quote or quotelike operator. If found, call
 804 C<extract_quotelike> to eat it. If C<extract_quotelike> fails, return
 805 the error it returned. Otherwise go back to step 1.
 806
 807 =item 3.
 808
 809 Try to match an opening delimiter bracket. If found, call
 810 C<extract_codeblock> recursively to eat the embedded block. If the
 811 recursive call fails, return an error. Otherwise, go back to step 1.
 812
 813 =item 4.
 814
 815 Unconditionally match a bareword or any other single character, and
 816 then go back to step 1.
 817
 818 =back
 819
 820
 821 Examples:
 822
 823         # Find a while loop in the text
 824
 825                 if ($text =~ s/.*?while\s*\{/{/)
 826                 {
 827                         $loop = "while " . extract_codeblock($text);
 828                 }
 829
 830         # Remove the first round-bracketed list (which may include
 831         # round- or curly-bracketed code blocks or quotelike operators)
 832
 833                 extract_codeblock $text, "(){}", '[^(]*';
 834
 835
 836 The ability to specify a different outermost delimiter bracket is useful
 837 in some circumstances. For example, in the Parse::RecDescent module,
 838 parser actions which are to be performed only on a successful parse
 839 are specified using a C<E<lt>defer:...E<gt>> directive. For example:
 840
 841         sentence: subject verb object
 842                         <defer: {$::theVerb = $item{verb}} >
 843
 844 Parse::RecDescent uses C<extract_codeblock($text, '{}E<lt>E<gt>')> to extract the code
 845 within the C<E<lt>defer:...E<gt>> directive, but there's a problem.
 846
 847 A deferred action like this:
 848
 849                         <defer: {if ($count>10) {$count--}} >
 850
 851 will be incorrectly parsed as:
 852
 853                         <defer: {if ($count>
 854
 855 because the "less than" operator is interpreted as a closing delimiter.
 856
 857 But, by extracting the directive using
 858 S<C<extract_codeblock($text, '{}', undef, 'E<lt>E<gt>')>>
 859 the '>' character is only treated as a delimited at the outermost
 860 level of the code block, so the directive is parsed correctly.
 861
 862 =head2 C<extract_multiple>
 863
 864 The C<extract_multiple> subroutine takes a string to be processed and a
 865 list of extractors (subroutines or regular expressions) to apply to that string.
 866
 867 In an array context C<extract_multiple> returns an array of substrings
 868 of the original string, as extracted by the specified extractors.
 869 In a scalar context, C<extract_multiple> returns the first
 870 substring successfully extracted from the original string. In both
 871 scalar and void contexts the original string has the first successfully
 872 extracted substring removed from it. In all contexts
 873 C<extract_multiple> starts at the current C<pos> of the string, and
 874 sets that C<pos> appropriately after it matches.
 875
 876 Hence, the aim of of a call to C<extract_multiple> in a list context
 877 is to split the processed string into as many non-overlapping fields as
 878 possible, by repeatedly applying each of the specified extractors
 879 to the remainder of the string. Thus C<extract_multiple> is
 880 a generalized form of Perl's C<split> subroutine.
 881
 882 The subroutine takes up to four optional arguments:
 883
 884 =over 4
 885
 886 =item 1.
 887
 888 A string to be processed (C<$_> if the string is omitted or C<undef>)
 889
 890 =item 2.
 891
 892 A reference to a list of subroutine references and/or qr// objects and/or
 893 literal strings and/or hash references, specifying the extractors
 894 to be used to split the string. If this argument is omitted (or
 895 C<undef>) the list:
 896
 897         [
 898                 sub { extract_variable($_[0], '') },
 899                 sub { extract_quotelike($_[0],'') },
 900                 sub { extract_codeblock($_[0],'{}','') },
 901         ]
 902
 903 is used.
 904
 905
 906 =item 3.
 907
 908 An number specifying the maximum number of fields to return. If this
 909 argument is omitted (or C<undef>), split continues as long as possible.
 910
 911 If the third argument is I<N>, then extraction continues until I<N> fields
 912 have been successfully extracted, or until the string has been completely
 913 processed.
 914
 915 Note that in scalar and void contexts the value of this argument is
 916 automatically reset to 1 (under C<-w>, a warning is issued if the argument
 917 has to be reset).
 918
 919 =item 4.
 920
 921 A value indicating whether unmatched substrings (see below) within the
 922 text should be skipped or returned as fields. If the value is true,
 923 such substrings are skipped. Otherwise, they are returned.
 924
 925 =back
 926
 927 The extraction process works by applying each extractor in
 928 sequence to the text string. If the extractor is a subroutine it
 929 is called in a list
 930 context and is expected to return a list of a single element, namely
 931 the extracted text.
 932 Note that the value returned by an extractor subroutine need not bear any
 933 relationship to the corresponding substring of the original text (see
 934 examples below).
 935
 936 If the extractor is a precompiled regular expression or a string,
 937 it is matched against the text in a scalar context with a leading
 938 '\G' and the gc modifiers enabled. The extracted value is either
 939 $1 if that variable is defined after the match, or else the
 940 complete match (i.e. $&).
 941
 942 If the extractor is a hash reference, it must contain exactly one element.
 943 The value of that element is one of the
 944 above extractor types (subroutine reference, regular expression, or string).
 945 The key of that element is the name of a class into which the successful
 946 return value of the extractor will be blessed.
 947
 948 If an extractor returns a defined value, that value is immediately
 949 treated as the next extracted field and pushed onto the list of fields.
 950 If the extractor was specified in a hash reference, the field is also
 951 blessed into the appropriate class,
 952
 953 If the extractor fails to match (in the case of a regex extractor), or returns an empty list or an undefined value (in the case of a subroutine extractor), it is
 954 assumed to have failed to extract.
 955 If none of the extractor subroutines succeeds, then one
 956 character is extracted from the start of the text and the extraction
 957 subroutines reapplied. Characters which are thus removed are accumulated and
 958 eventually become the next field (unless the fourth argument is true, in which
 959 case they are disgarded).
 960
 961 For example, the following extracts substrings that are valid Perl variables:
 962
 963         @fields = extract_multiple($text,
 964                                    [ sub { extract_variable($_[0]) } ],
 965                                    undef, 1);
 966
 967 This example separates a text into fields which are quote delimited,
 968 curly bracketed, and anything else. The delimited and bracketed
 969 parts are also blessed to identify them (the "anything else" is unblessed):
 970
 971         @fields = extract_multiple($text,
 972                    [
 973                         { Delim => sub { extract_delimited($_[0],q{'"}) } },
 974                         { Brack => sub { extract_bracketed($_[0],'{}') } },
 975                    ]);
 976
 977 This call extracts the next single substring that is a valid Perl quotelike
 978 operator (and removes it from $text):
 979
 980         $quotelike = extract_multiple($text,
 981                                       [
 982                                         sub { extract_quotelike($_[0]) },
 983                                       ], undef, 1);
 984
 985 Finally, here is yet another way to do comma-separated value parsing:
 986
 987         @fields = extract_multiple($csv_text,
 988                                   [
 989                                         sub { extract_delimited($_[0],q{'"}) },
 990                                         qr/([^,]+)(.*)/,
 991                                   ],
 992                                   undef,1);
 993
 994 The list in the second argument means:
 995 I<"Try and extract a ' or " delimited string, otherwise extract anything up to a comma...">.
 996 The undef third argument means:
 997 I<"...as many times as possible...">,
 998 and the true value in the fourth argument means
 999 I<"...discarding anything else that appears (i.e. the commas)">.
1000
1001 If you wanted the commas preserved as separate fields (i.e. like split
1002 does if your split pattern has capturing parentheses), you would
1003 just make the last parameter undefined (or remove it).
1004
1005
1006 =head2 C<gen_delimited_pat>
1007
1008 The C<gen_delimited_pat> subroutine takes a single (string) argument and
1009 builds a Friedl-style optimized regex that matches a string delimited
1010 by any one of the characters in the single argument. For example:
1011
1012         gen_delimited_pat(q{'"})
1013
1014 returns the regex:
1015
1016         (?:\"(?:\\\"|(?!\").)*\"|\'(?:\\\'|(?!\').)*\')
1017
1018 Note that the specified delimiters are automatically quotemeta'd.
1019
1020 A typical use of C<gen_delimited_pat> would be to build special purpose tags
1021 for C<extract_tagged>. For example, to properly ignore "empty" XML elements
1022 (which might contain quoted strings):
1023
1024         my $empty_tag = '<(' . gen_delimited_pat(q{'"}) . '|.)+/>';
1025
1026         extract_tagged($text, undef, undef, undef, {ignore => [$empty_tag]} );
1027
1028
1029 C<gen_delimited_pat> may also be called with an optional second argument,
1030 which specifies the "escape" character(s) to be used for each delimiter.
1031 For example to match a Pascal-style string (where ' is the delimiter
1032 and '' is a literal ' within the string):
1033
1034         gen_delimited_pat(q{'},q{'});
1035
1036 Different escape characters can be specified for different delimiters.
1037 For example, to specify that '/' is the escape for single quotes
1038 and '%' is the escape for double quotes:
1039
1040         gen_delimited_pat(q{'"},q{/%});
1041
1042 If more delimiters than escape chars are specified, the last escape char
1043 is used for the remaining delimiters.
1044 If no escape char is specified for a given specified delimiter, '\' is used.
1045
1046 Note that
1047 C<gen_delimited_pat> was previously called
1048 C<delimited_pat>. That name may still be used, but is now deprecated.
1049
1050
1051 =head1 DIAGNOSTICS
1052
1053 In a list context, all the functions return C<(undef,$original_text)>
1054 on failure. In a scalar context, failure is indicated by returning C<undef>
1055 (in this case the input text is not modified in any way).
1056
1057 In addition, on failure in I<any> context, the C<$@> variable is set.
1058 Accessing C<$@-E<gt>{error}> returns one of the error diagnostics listed
1059 below.
1060 Accessing C<$@-E<gt>{pos}> returns the offset into the original string at
1061 which the error was detected (although not necessarily where it occurred!)
1062 Printing C<$@> directly produces the error message, with the offset appended.
1063 On success, the C<$@> variable is guaranteed to be C<undef>.
1064
1065 The available diagnostics are:
1066
1067 =over 4
1068
1069 =item  C<Did not find a suitable bracket: "%s">
1070
1071 The delimiter provided to C<extract_bracketed> was not one of
1072 C<'()[]E<lt>E<gt>{}'>.
1073
1074 =item  C<Did not find prefix: /%s/>
1075
1076 A non-optional prefix was specified but wasn't found at the start of the text.
1077
1078 =item  C<Did not find opening bracket after prefix: "%s">
1079
1080 C<extract_bracketed> or C<extract_codeblock> was expecting a
1081 particular kind of bracket at the start of the text, and didn't find it.
1082
1083 =item  C<No quotelike operator found after prefix: "%s">
1084
1085 C<extract_quotelike> didn't find one of the quotelike operators C<q>,
1086 C<qq>, C<qw>, C<qx>, C<s>, C<tr> or C<y> at the start of the substring
1087 it was extracting.
1088
1089 =item  C<Unmatched closing bracket: "%c">
1090
1091 C<extract_bracketed>, C<extract_quotelike> or C<extract_codeblock> encountered
1092 a closing bracket where none was expected.
1093
1094 =item  C<Unmatched opening bracket(s): "%s">
1095
1096 C<extract_bracketed>, C<extract_quotelike> or C<extract_codeblock> ran
1097 out of characters in the text before closing one or more levels of nested
1098 brackets.
1099
1100 =item C<Unmatched embedded quote (%s)>
1101
1102 C<extract_bracketed> attempted to match an embedded quoted substring, but
1103 failed to find a closing quote to match it.
1104
1105 =item C<Did not find closing delimiter to match '%s'>
1106
1107 C<extract_quotelike> was unable to find a closing delimiter to match the
1108 one that opened the quote-like operation.
1109
1110 =item  C<Mismatched closing bracket: expected "%c" but found "%s">
1111
1112 C<extract_bracketed>, C<extract_quotelike> or C<extract_codeblock> found
1113 a valid bracket delimiter, but it was the wrong species. This usually
1114 indicates a nesting error, but may indicate incorrect quoting or escaping.
1115
1116 =item  C<No block delimiter found after quotelike "%s">
1117
1118 C<extract_quotelike> or C<extract_codeblock> found one of the
1119 quotelike operators C<q>, C<qq>, C<qw>, C<qx>, C<s>, C<tr> or C<y>
1120 without a suitable block after it.
1121
1122 =item C<Did not find leading dereferencer>
1123
1124 C<extract_variable> was expecting one of '$', '@', or '%' at the start of
1125 a variable, but didn't find any of them.
1126
1127 =item C<Bad identifier after dereferencer>
1128
1129 C<extract_variable> found a '$', '@', or '%' indicating a variable, but that
1130 character was not followed by a legal Perl identifier.
1131
1132 =item C<Did not find expected opening bracket at %s>
1133
1134 C<extract_codeblock> failed to find any of the outermost opening brackets
1135 that were specified.
1136
1137 =item C<Improperly nested codeblock at %s>
1138
1139 A nested code block was found that started with a delimiter that was specified
1140 as being only to be used as an outermost bracket.
1141
1142 =item  C<Missing second block for quotelike "%s">
1143
1144 C<extract_codeblock> or C<extract_quotelike> found one of the
1145 quotelike operators C<s>, C<tr> or C<y> followed by only one block.
1146
1147 =item C<No match found for opening bracket>
1148
1149 C<extract_codeblock> failed to find a closing bracket to match the outermost
1150 opening bracket.
1151
1152 =item C<Did not find opening tag: /%s/>
1153
1154 C<extract_tagged> did not find a suitable opening tag (after any specified
1155 prefix was removed).
1156
1157 =item C<Unable to construct closing tag to match: /%s/>
1158
1159 C<extract_tagged> matched the specified opening tag and tried to
1160 modify the matched text to produce a matching closing tag (because
1161 none was specified). It failed to generate the closing tag, almost
1162 certainly because the opening tag did not start with a
1163 bracket of some kind.
1164
1165 =item C<Found invalid nested tag: %s>
1166
1167 C<extract_tagged> found a nested tag that appeared in the "reject" list
1168 (and the failure mode was not "MAX" or "PARA").
1169
1170 =item C<Found unbalanced nested tag: %s>
1171
1172 C<extract_tagged> found a nested opening tag that was not matched by a
1173 corresponding nested closing tag (and the failure mode was not "MAX" or "PARA").
1174
1175 =item C<Did not find closing tag>
1176
1177 C<extract_tagged> reached the end of the text without finding a closing tag
1178 to match the original opening tag (and the failure mode was not
1179 "MAX" or "PARA").
1180
1181
1182
1183
1184 =back
1185
1186
1187 =head1 AUTHOR
1188
1189 Damian Conway (damian@conway.org)
1190
1191
1192 =head1 BUGS AND IRRITATIONS
1193
1194 There are undoubtedly serious bugs lurking somewhere in this code, if
1195 only because parts of it give the impression of understanding a great deal
1196 more about Perl than they really do.
1197
1198 Bug reports and other feedback are most welcome.
1199
1200
1201 =head1 COPYRIGHT
1202
1203  Copyright (c) 1997-2000, Damian Conway. All Rights Reserved.
1204 This module is free software; you can redistribute it and/or
1205 modify it under the same terms as Perl itself.