Commit | Line | Data |
---|---|---|
85831461 | 1 | NAME |
55a1c97c JH |
2 | Text::Balanced - Extract delimited text sequences from strings. |
3 | ||
85831461 S |
4 | SYNOPSIS |
5 | use Text::Balanced qw ( | |
6 | extract_delimited | |
7 | extract_bracketed | |
8 | extract_quotelike | |
9 | extract_codeblock | |
10 | extract_variable | |
11 | extract_tagged | |
12 | extract_multiple | |
13 | gen_delimited_pat | |
14 | gen_extract_tagged | |
15 | ); | |
16 | ||
17 | # Extract the initial substring of $text that is delimited by | |
18 | # two (unescaped) instances of the first character in $delim. | |
19 | ||
20 | ($extracted, $remainder) = extract_delimited($text,$delim); | |
21 | ||
22 | ||
23 | # Extract the initial substring of $text that is bracketed | |
24 | # with a delimiter(s) specified by $delim (where the string | |
25 | # in $delim contains one or more of '(){}[]<>'). | |
26 | ||
27 | ($extracted, $remainder) = extract_bracketed($text,$delim); | |
28 | ||
29 | ||
30 | # Extract the initial substring of $text that is bounded by | |
31 | # an XML tag. | |
32 | ||
33 | ($extracted, $remainder) = extract_tagged($text); | |
34 | ||
35 | ||
36 | # Extract the initial substring of $text that is bounded by | |
37 | # a C<BEGIN>...C<END> pair. Don't allow nested C<BEGIN> tags | |
38 | ||
39 | ($extracted, $remainder) = | |
40 | extract_tagged($text,"BEGIN","END",undef,{bad=>["BEGIN"]}); | |
41 | ||
42 | ||
43 | # Extract the initial substring of $text that represents a | |
44 | # Perl "quote or quote-like operation" | |
45 | ||
46 | ($extracted, $remainder) = extract_quotelike($text); | |
47 | ||
48 | ||
49 | # Extract the initial substring of $text that represents a block | |
50 | # of Perl code, bracketed by any of character(s) specified by $delim | |
51 | # (where the string $delim contains one or more of '(){}[]<>'). | |
52 | ||
53 | ($extracted, $remainder) = extract_codeblock($text,$delim); | |
54 | ||
55 | ||
56 | # Extract the initial substrings of $text that would be extracted by | |
57 | # one or more sequential applications of the specified functions | |
58 | # or regular expressions | |
59 | ||
60 | @extracted = extract_multiple($text, | |
61 | [ \&extract_bracketed, | |
62 | \&extract_quotelike, | |
63 | \&some_other_extractor_sub, | |
64 | qr/[xyz]*/, | |
65 | 'literal', | |
66 | ]); | |
67 | ||
68 | # Create a string representing an optimized pattern (a la Friedl) # that | |
69 | matches a substring delimited by any of the specified characters # (in | |
70 | this case: any type of quote or a slash) | |
71 | ||
72 | $patstring = gen_delimited_pat(q{'"`/}); | |
73 | ||
74 | # Generate a reference to an anonymous sub that is just like | |
75 | extract_tagged # but pre-compiled and optimized for a specific pair of | |
76 | tags, and consequently # much faster (i.e. 3 times faster). It uses qr// | |
77 | for better performance on # repeated calls, so it only works under Perl | |
78 | 5.005 or later. | |
79 | ||
80 | $extract_head = gen_extract_tagged('<HEAD>','</HEAD>'); | |
81 | ||
82 | ($extracted, $remainder) = $extract_head->($text); | |
83 | ||
84 | DESCRIPTION | |
85 | The various "extract_..." subroutines may be used to extract a delimited | |
86 | substring, possibly after skipping a specified prefix string. By | |
87 | default, that prefix is optional whitespace ("/\s*/"), but you can | |
88 | change it to whatever you wish (see below). | |
89 | ||
90 | The substring to be extracted must appear at the current "pos" location | |
91 | of the string's variable (or at index zero, if no "pos" position is | |
92 | defined). In other words, the "extract_..." subroutines *don't* extract | |
93 | the first occurrence of a substring anywhere in a string (like an | |
94 | unanchored regex would). Rather, they extract an occurrence of the | |
95 | substring appearing immediately at the current matching position in the | |
96 | string (like a "\G"-anchored regex would). | |
97 | ||
98 | General behaviour in list contexts | |
99 | In a list context, all the subroutines return a list, the first three | |
100 | elements of which are always: | |
101 | ||
102 | [0] The extracted string, including the specified delimiters. If the | |
103 | extraction fails "undef" is returned. | |
104 | ||
105 | [1] The remainder of the input string (i.e. the characters after the | |
106 | extracted string). On failure, the entire string is returned. | |
107 | ||
108 | [2] The skipped prefix (i.e. the characters before the extracted | |
109 | string). On failure, "undef" is returned. | |
110 | ||
111 | Note that in a list context, the contents of the original input text | |
112 | (the first argument) are not modified in any way. | |
113 | ||
114 | However, if the input text was passed in a variable, that variable's | |
115 | "pos" value is updated to point at the first character after the | |
116 | extracted text. That means that in a list context the various | |
117 | subroutines can be used much like regular expressions. For example: | |
118 | ||
119 | while ( $next = (extract_quotelike($text))[0] ) | |
120 | { | |
121 | # process next quote-like (in $next) | |
122 | } | |
123 | ||
124 | General behaviour in scalar and void contexts | |
125 | In a scalar context, the extracted string is returned, having first been | |
126 | removed from the input text. Thus, the following code also processes | |
127 | each quote-like operation, but actually removes them from $text: | |
128 | ||
129 | while ( $next = extract_quotelike($text) ) | |
130 | { | |
131 | # process next quote-like (in $next) | |
132 | } | |
133 | ||
134 | Note that if the input text is a read-only string (i.e. a literal), no | |
135 | attempt is made to remove the extracted text. | |
136 | ||
137 | In a void context the behaviour of the extraction subroutines is exactly | |
138 | the same as in a scalar context, except (of course) that the extracted | |
139 | substring is not returned. | |
140 | ||
141 | A note about prefixes | |
142 | Prefix patterns are matched without any trailing modifiers ("/gimsox" | |
143 | etc.) This can bite you if you're expecting a prefix specification like | |
144 | '.*?(?=<H1>)' to skip everything up to the first <H1> tag. Such a prefix | |
145 | pattern will only succeed if the <H1> tag is on the current line, since | |
146 | . normally doesn't match newlines. | |
147 | ||
148 | To overcome this limitation, you need to turn on /s matching within the | |
149 | prefix pattern, using the "(?s)" directive: '(?s).*?(?=<H1>)' | |
150 | ||
151 | "extract_delimited" | |
152 | The "extract_delimited" function formalizes the common idiom of | |
153 | extracting a single-character-delimited substring from the start of a | |
154 | string. For example, to extract a single-quote delimited string, the | |
155 | following code is typically used: | |
156 | ||
157 | ($remainder = $text) =~ s/\A('(\\.|[^'])*')//s; | |
158 | $extracted = $1; | |
159 | ||
160 | but with "extract_delimited" it can be simplified to: | |
161 | ||
162 | ($extracted,$remainder) = extract_delimited($text, "'"); | |
163 | ||
164 | "extract_delimited" takes up to four scalars (the input text, the | |
165 | delimiters, a prefix pattern to be skipped, and any escape characters) | |
166 | and extracts the initial substring of the text that is appropriately | |
167 | delimited. If the delimiter string has multiple characters, the first | |
168 | one encountered in the text is taken to delimit the substring. The third | |
169 | argument specifies a prefix pattern that is to be skipped (but must be | |
170 | present!) before the substring is extracted. The final argument | |
171 | specifies the escape character to be used for each delimiter. | |
172 | ||
173 | All arguments are optional. If the escape characters are not specified, | |
174 | every delimiter is escaped with a backslash ("\"). If the prefix is not | |
175 | specified, the pattern '\s*' - optional whitespace - is used. If the | |
176 | delimiter set is also not specified, the set "/["'`]/" is used. If the | |
177 | text to be processed is not specified either, $_ is used. | |
178 | ||
179 | In list context, "extract_delimited" returns a array of three elements, | |
180 | the extracted substring (*including the surrounding delimiters*), the | |
181 | remainder of the text, and the skipped prefix (if any). If a suitable | |
182 | delimited substring is not found, the first element of the array is the | |
183 | empty string, the second is the complete original text, and the prefix | |
184 | returned in the third element is an empty string. | |
185 | ||
186 | In a scalar context, just the extracted substring is returned. In a void | |
187 | context, the extracted substring (and any prefix) are simply removed | |
188 | from the beginning of the first argument. | |
189 | ||
190 | Examples: | |
191 | ||
192 | # Remove a single-quoted substring from the very beginning of $text: | |
193 | ||
194 | $substring = extract_delimited($text, "'", ''); | |
195 | ||
196 | # Remove a single-quoted Pascalish substring (i.e. one in which | |
197 | # doubling the quote character escapes it) from the very | |
198 | # beginning of $text: | |
199 | ||
200 | $substring = extract_delimited($text, "'", '', "'"); | |
201 | ||
202 | # Extract a single- or double- quoted substring from the | |
203 | # beginning of $text, optionally after some whitespace | |
204 | # (note the list context to protect $text from modification): | |
205 | ||
206 | ($substring) = extract_delimited $text, q{"'}; | |
207 | ||
208 | # Delete the substring delimited by the first '/' in $text: | |
209 | ||
210 | $text = join '', (extract_delimited($text,'/','[^/]*')[2,1]; | |
211 | ||
212 | Note that this last example is *not* the same as deleting the first | |
213 | quote-like pattern. For instance, if $text contained the string: | |
214 | ||
215 | "if ('./cmd' =~ m/$UNIXCMD/s) { $cmd = $1; }" | |
216 | ||
217 | then after the deletion it would contain: | |
218 | ||
219 | "if ('.$UNIXCMD/s) { $cmd = $1; }" | |
220 | ||
221 | not: | |
222 | ||
223 | "if ('./cmd' =~ ms) { $cmd = $1; }" | |
224 | ||
225 | See "extract_quotelike" for a (partial) solution to this problem. | |
226 | ||
227 | "extract_bracketed" | |
228 | Like "extract_delimited", the "extract_bracketed" function takes up to | |
229 | three optional scalar arguments: a string to extract from, a delimiter | |
230 | specifier, and a prefix pattern. As before, a missing prefix defaults to | |
231 | optional whitespace and a missing text defaults to $_. However, a | |
232 | missing delimiter specifier defaults to '{}()[]<>' (see below). | |
233 | ||
234 | "extract_bracketed" extracts a balanced-bracket-delimited substring | |
235 | (using any one (or more) of the user-specified delimiter brackets: | |
236 | '(..)', '{..}', '[..]', or '<..>'). Optionally it will also respect | |
237 | quoted unbalanced brackets (see below). | |
238 | ||
239 | A "delimiter bracket" is a bracket in list of delimiters passed as | |
240 | "extract_bracketed"'s second argument. Delimiter brackets are specified | |
241 | by giving either the left or right (or both!) versions of the required | |
242 | bracket(s). Note that the order in which two or more delimiter brackets | |
243 | are specified is not significant. | |
244 | ||
245 | A "balanced-bracket-delimited substring" is a substring bounded by | |
246 | matched brackets, such that any other (left or right) delimiter bracket | |
247 | *within* the substring is also matched by an opposite (right or left) | |
248 | delimiter bracket *at the same level of nesting*. Any type of bracket | |
249 | not in the delimiter list is treated as an ordinary character. | |
250 | ||
251 | In other words, each type of bracket specified as a delimiter must be | |
252 | balanced and correctly nested within the substring, and any other kind | |
253 | of ("non-delimiter") bracket in the substring is ignored. | |
254 | ||
255 | For example, given the string: | |
256 | ||
257 | $text = "{ an '[irregularly :-(] {} parenthesized >:-)' string }"; | |
258 | ||
259 | then a call to "extract_bracketed" in a list context: | |
260 | ||
261 | @result = extract_bracketed( $text, '{}' ); | |
262 | ||
263 | would return: | |
264 | ||
265 | ( "{ an '[irregularly :-(] {} parenthesized >:-)' string }" , "" , "" ) | |
266 | ||
267 | since both sets of '{..}' brackets are properly nested and evenly | |
268 | balanced. (In a scalar context just the first element of the array would | |
269 | be returned. In a void context, $text would be replaced by an empty | |
270 | string.) | |
271 | ||
272 | Likewise the call in: | |
273 | ||
274 | @result = extract_bracketed( $text, '{[' ); | |
275 | ||
276 | would return the same result, since all sets of both types of specified | |
277 | delimiter brackets are correctly nested and balanced. | |
278 | ||
279 | However, the call in: | |
280 | ||
281 | @result = extract_bracketed( $text, '{([<' ); | |
282 | ||
283 | would fail, returning: | |
284 | ||
285 | ( undef , "{ an '[irregularly :-(] {} parenthesized >:-)' string }" ); | |
286 | ||
287 | because the embedded pairs of '(..)'s and '[..]'s are "cross-nested" and | |
288 | the embedded '>' is unbalanced. (In a scalar context, this call would | |
289 | return an empty string. In a void context, $text would be unchanged.) | |
290 | ||
291 | Note that the embedded single-quotes in the string don't help in this | |
292 | case, since they have not been specified as acceptable delimiters and | |
293 | are therefore treated as non-delimiter characters (and ignored). | |
294 | ||
295 | However, if a particular species of quote character is included in the | |
296 | delimiter specification, then that type of quote will be correctly | |
297 | handled. for example, if $text is: | |
298 | ||
299 | $text = '<A HREF=">>>>">link</A>'; | |
300 | ||
301 | then | |
302 | ||
303 | @result = extract_bracketed( $text, '<">' ); | |
304 | ||
305 | returns: | |
306 | ||
307 | ( '<A HREF=">>>>">', 'link</A>', "" ) | |
308 | ||
309 | as expected. Without the specification of """ as an embedded quoter: | |
310 | ||
311 | @result = extract_bracketed( $text, '<>' ); | |
312 | ||
313 | the result would be: | |
314 | ||
315 | ( '<A HREF=">', '>>>">link</A>', "" ) | |
316 | ||
317 | In addition to the quote delimiters "'", """, and "`", full Perl | |
318 | quote-like quoting (i.e. q{string}, qq{string}, etc) can be specified by | |
319 | including the letter 'q' as a delimiter. Hence: | |
320 | ||
321 | @result = extract_bracketed( $text, '<q>' ); | |
322 | ||
323 | would correctly match something like this: | |
324 | ||
325 | $text = '<leftop: conj /and/ conj>'; | |
326 | ||
327 | See also: "extract_quotelike" and "extract_codeblock". | |
328 | ||
329 | "extract_variable" | |
330 | "extract_variable" extracts any valid Perl variable or variable-involved | |
331 | expression, including scalars, arrays, hashes, array accesses, hash | |
332 | look-ups, method calls through objects, subroutine calls through | |
333 | subroutine references, etc. | |
334 | ||
335 | The subroutine takes up to two optional arguments: | |
336 | ||
337 | 1. A string to be processed ($_ if the string is omitted or "undef") | |
338 | ||
339 | 2. A string specifying a pattern to be matched as a prefix (which is to | |
340 | be skipped). If omitted, optional whitespace is skipped. | |
341 | ||
342 | On success in a list context, an array of 3 elements is returned. The | |
343 | elements are: | |
344 | ||
345 | [0] the extracted variable, or variablish expression | |
346 | ||
347 | [1] the remainder of the input text, | |
348 | ||
349 | [2] the prefix substring (if any), | |
350 | ||
351 | On failure, all of these values (except the remaining text) are "undef". | |
352 | ||
353 | In a scalar context, "extract_variable" returns just the complete | |
354 | substring that matched a variablish expression. "undef" is returned on | |
355 | failure. In addition, the original input text has the returned substring | |
356 | (and any prefix) removed from it. | |
357 | ||
358 | In a void context, the input text just has the matched substring (and | |
359 | any specified prefix) removed. | |
360 | ||
361 | "extract_tagged" | |
362 | "extract_tagged" extracts and segments text between (balanced) specified | |
363 | tags. | |
364 | ||
365 | The subroutine takes up to five optional arguments: | |
366 | ||
367 | 1. A string to be processed ($_ if the string is omitted or "undef") | |
368 | ||
369 | 2. A string specifying a pattern to be matched as the opening tag. If | |
370 | the pattern string is omitted (or "undef") then a pattern that | |
371 | matches any standard XML tag is used. | |
372 | ||
373 | 3. A string specifying a pattern to be matched at the closing tag. If | |
374 | the pattern string is omitted (or "undef") then the closing tag is | |
375 | constructed by inserting a "/" after any leading bracket characters | |
376 | in the actual opening tag that was matched (*not* the pattern that | |
377 | matched the tag). For example, if the opening tag pattern is | |
378 | specified as '{{\w+}}' and actually matched the opening tag | |
379 | "{{DATA}}", then the constructed closing tag would be "{{/DATA}}". | |
380 | ||
381 | 4. A string specifying a pattern to be matched as a prefix (which is to | |
382 | be skipped). If omitted, optional whitespace is skipped. | |
383 | ||
384 | 5. A hash reference containing various parsing options (see below) | |
385 | ||
386 | The various options that can be specified are: | |
387 | ||
388 | "reject => $listref" | |
389 | The list reference contains one or more strings specifying patterns | |
390 | that must *not* appear within the tagged text. | |
391 | ||
392 | For example, to extract an HTML link (which should not contain | |
393 | nested links) use: | |
394 | ||
395 | extract_tagged($text, '<A>', '</A>', undef, {reject => ['<A>']} ); | |
396 | ||
397 | "ignore => $listref" | |
398 | The list reference contains one or more strings specifying patterns | |
399 | that are *not* be be treated as nested tags within the tagged text | |
400 | (even if they would match the start tag pattern). | |
401 | ||
402 | For example, to extract an arbitrary XML tag, but ignore "empty" | |
403 | elements: | |
404 | ||
405 | extract_tagged($text, undef, undef, undef, {ignore => ['<[^>]*/>']} ); | |
406 | ||
407 | (also see "gen_delimited_pat" below). | |
408 | ||
409 | "fail => $str" | |
410 | The "fail" option indicates the action to be taken if a matching end | |
411 | tag is not encountered (i.e. before the end of the string or some | |
412 | "reject" pattern matches). By default, a failure to match a closing | |
413 | tag causes "extract_tagged" to immediately fail. | |
414 | ||
415 | However, if the string value associated with <reject> is "MAX", then | |
416 | "extract_tagged" returns the complete text up to the point of | |
417 | failure. If the string is "PARA", "extract_tagged" returns only the | |
418 | first paragraph after the tag (up to the first line that is either | |
419 | empty or contains only whitespace characters). If the string is "", | |
420 | the the default behaviour (i.e. failure) is reinstated. | |
421 | ||
422 | For example, suppose the start tag "/para" introduces a paragraph, | |
423 | which then continues until the next "/endpara" tag or until another | |
424 | "/para" tag is encountered: | |
425 | ||
426 | $text = "/para line 1\n\nline 3\n/para line 4"; | |
427 | ||
428 | extract_tagged($text, '/para', '/endpara', undef, | |
429 | {reject => '/para', fail => MAX ); | |
430 | ||
431 | # EXTRACTED: "/para line 1\n\nline 3\n" | |
432 | ||
433 | Suppose instead, that if no matching "/endpara" tag is found, the | |
434 | "/para" tag refers only to the immediately following paragraph: | |
435 | ||
436 | $text = "/para line 1\n\nline 3\n/para line 4"; | |
437 | ||
438 | extract_tagged($text, '/para', '/endpara', undef, | |
439 | {reject => '/para', fail => MAX ); | |
440 | ||
441 | # EXTRACTED: "/para line 1\n" | |
442 | ||
443 | Note that the specified "fail" behaviour applies to nested tags as | |
444 | well. | |
445 | ||
446 | On success in a list context, an array of 6 elements is returned. The | |
447 | elements are: | |
448 | ||
449 | [0] the extracted tagged substring (including the outermost tags), | |
450 | ||
451 | [1] the remainder of the input text, | |
452 | ||
453 | [2] the prefix substring (if any), | |
454 | ||
455 | [3] the opening tag | |
456 | ||
457 | [4] the text between the opening and closing tags | |
458 | ||
459 | [5] the closing tag (or "" if no closing tag was found) | |
460 | ||
461 | On failure, all of these values (except the remaining text) are "undef". | |
462 | ||
463 | In a scalar context, "extract_tagged" returns just the complete | |
464 | substring that matched a tagged text (including the start and end tags). | |
465 | "undef" is returned on failure. In addition, the original input text has | |
466 | the returned substring (and any prefix) removed from it. | |
467 | ||
468 | In a void context, the input text just has the matched substring (and | |
469 | any specified prefix) removed. | |
470 | ||
471 | "gen_extract_tagged" | |
472 | (Note: This subroutine is only available under Perl5.005) | |
473 | ||
474 | "gen_extract_tagged" generates a new anonymous subroutine which extracts | |
475 | text between (balanced) specified tags. In other words, it generates a | |
476 | function identical in function to "extract_tagged". | |
477 | ||
478 | The difference between "extract_tagged" and the anonymous subroutines | |
479 | generated by "gen_extract_tagged", is that those generated subroutines: | |
480 | ||
481 | * do not have to reparse tag specification or parsing options every | |
482 | time they are called (whereas "extract_tagged" has to effectively | |
483 | rebuild its tag parser on every call); | |
484 | ||
485 | * make use of the new qr// construct to pre-compile the regexes they | |
486 | use (whereas "extract_tagged" uses standard string variable | |
487 | interpolation to create tag-matching patterns). | |
488 | ||
489 | The subroutine takes up to four optional arguments (the same set as | |
490 | "extract_tagged" except for the string to be processed). It returns a | |
491 | reference to a subroutine which in turn takes a single argument (the | |
492 | text to be extracted from). | |
493 | ||
494 | In other words, the implementation of "extract_tagged" is exactly | |
495 | equivalent to: | |
496 | ||
497 | sub extract_tagged | |
498 | { | |
499 | my $text = shift; | |
500 | $extractor = gen_extract_tagged(@_); | |
501 | return $extractor->($text); | |
502 | } | |
503 | ||
504 | (although "extract_tagged" is not currently implemented that way, in | |
505 | order to preserve pre-5.005 compatibility). | |
506 | ||
507 | Using "gen_extract_tagged" to create extraction functions for specific | |
508 | tags is a good idea if those functions are going to be called more than | |
509 | once, since their performance is typically twice as good as the more | |
510 | general-purpose "extract_tagged". | |
511 | ||
512 | "extract_quotelike" | |
513 | "extract_quotelike" attempts to recognize, extract, and segment any one | |
514 | of the various Perl quotes and quotelike operators (see perlop(3)) | |
515 | Nested backslashed delimiters, embedded balanced bracket delimiters (for | |
516 | the quotelike operators), and trailing modifiers are all caught. For | |
517 | example, in: | |
518 | ||
519 | extract_quotelike 'q # an octothorpe: \# (not the end of the q!) #' | |
520 | ||
521 | extract_quotelike ' "You said, \"Use sed\"." ' | |
522 | ||
523 | extract_quotelike ' s{([A-Z]{1,8}\.[A-Z]{3})} /\L$1\E/; ' | |
524 | ||
525 | extract_quotelike ' tr/\\\/\\\\/\\\//ds; ' | |
526 | ||
527 | the full Perl quotelike operations are all extracted correctly. | |
528 | ||
529 | Note too that, when using the /x modifier on a regex, any comment | |
530 | containing the current pattern delimiter will cause the regex to be | |
531 | immediately terminated. In other words: | |
532 | ||
533 | 'm / | |
534 | (?i) # CASE INSENSITIVE | |
535 | [a-z_] # LEADING ALPHABETIC/UNDERSCORE | |
536 | [a-z0-9]* # FOLLOWED BY ANY NUMBER OF ALPHANUMERICS | |
537 | /x' | |
538 | ||
539 | will be extracted as if it were: | |
540 | ||
541 | 'm / | |
542 | (?i) # CASE INSENSITIVE | |
543 | [a-z_] # LEADING ALPHABETIC/' | |
544 | ||
545 | This behaviour is identical to that of the actual compiler. | |
546 | ||
547 | "extract_quotelike" takes two arguments: the text to be processed and a | |
548 | prefix to be matched at the very beginning of the text. If no prefix is | |
549 | specified, optional whitespace is the default. If no text is given, $_ | |
550 | is used. | |
551 | ||
552 | In a list context, an array of 11 elements is returned. The elements | |
553 | are: | |
554 | ||
555 | [0] the extracted quotelike substring (including trailing modifiers), | |
556 | ||
557 | [1] the remainder of the input text, | |
558 | ||
559 | [2] the prefix substring (if any), | |
560 | ||
561 | [3] the name of the quotelike operator (if any), | |
562 | ||
563 | [4] the left delimiter of the first block of the operation, | |
564 | ||
565 | [5] the text of the first block of the operation (that is, the contents | |
566 | of a quote, the regex of a match or substitution or the target list | |
567 | of a translation), | |
568 | ||
569 | [6] the right delimiter of the first block of the operation, | |
570 | ||
571 | [7] the left delimiter of the second block of the operation (that is, if | |
572 | it is a "s", "tr", or "y"), | |
573 | ||
574 | [8] the text of the second block of the operation (that is, the | |
575 | replacement of a substitution or the translation list of a | |
576 | translation), | |
577 | ||
578 | [9] the right delimiter of the second block of the operation (if any), | |
579 | ||
580 | [10] | |
581 | the trailing modifiers on the operation (if any). | |
582 | ||
583 | For each of the fields marked "(if any)" the default value on success is | |
584 | an empty string. On failure, all of these values (except the remaining | |
585 | text) are "undef". | |
586 | ||
587 | In a scalar context, "extract_quotelike" returns just the complete | |
588 | substring that matched a quotelike operation (or "undef" on failure). In | |
589 | a scalar or void context, the input text has the same substring (and any | |
590 | specified prefix) removed. | |
591 | ||
592 | Examples: | |
593 | ||
594 | # Remove the first quotelike literal that appears in text | |
595 | ||
596 | $quotelike = extract_quotelike($text,'.*?'); | |
597 | ||
598 | # Replace one or more leading whitespace-separated quotelike | |
599 | # literals in $_ with "<QLL>" | |
600 | ||
601 | do { $_ = join '<QLL>', (extract_quotelike)[2,1] } until $@; | |
602 | ||
603 | ||
604 | # Isolate the search pattern in a quotelike operation from $text | |
605 | ||
606 | ($op,$pat) = (extract_quotelike $text)[3,5]; | |
607 | if ($op =~ /[ms]/) | |
608 | { | |
609 | print "search pattern: $pat\n"; | |
610 | } | |
611 | else | |
612 | { | |
613 | print "$op is not a pattern matching operation\n"; | |
614 | } | |
615 | ||
616 | "extract_quotelike" and "here documents" | |
617 | "extract_quotelike" can successfully extract "here documents" from an | |
618 | input string, but with an important caveat in list contexts. | |
619 | ||
620 | Unlike other types of quote-like literals, a here document is rarely a | |
621 | contiguous substring. For example, a typical piece of code using here | |
622 | document might look like this: | |
623 | ||
624 | <<'EOMSG' || die; | |
625 | This is the message. | |
626 | EOMSG | |
627 | exit; | |
628 | ||
629 | Given this as an input string in a scalar context, "extract_quotelike" | |
630 | would correctly return the string "<<'EOMSG'\nThis is the | |
631 | message.\nEOMSG", leaving the string " || die;\nexit;" in the original | |
632 | variable. In other words, the two separate pieces of the here document | |
633 | are successfully extracted and concatenated. | |
634 | ||
635 | In a list context, "extract_quotelike" would return the list | |
636 | ||
637 | [0] "<<'EOMSG'\nThis is the message.\nEOMSG\n" (i.e. the full extracted | |
638 | here document, including fore and aft delimiters), | |
639 | ||
640 | [1] " || die;\nexit;" (i.e. the remainder of the input text, | |
641 | concatenated), | |
642 | ||
643 | [2] "" (i.e. the prefix substring -- trivial in this case), | |
644 | ||
645 | [3] "<<" (i.e. the "name" of the quotelike operator) | |
646 | ||
647 | [4] "'EOMSG'" (i.e. the left delimiter of the here document, including | |
648 | any quotes), | |
649 | ||
650 | [5] "This is the message.\n" (i.e. the text of the here document), | |
651 | ||
652 | [6] "EOMSG" (i.e. the right delimiter of the here document), | |
653 | ||
654 | [7..10] | |
655 | "" (a here document has no second left delimiter, second text, | |
656 | second right delimiter, or trailing modifiers). | |
657 | ||
658 | However, the matching position of the input variable would be set to | |
659 | "exit;" (i.e. *after* the closing delimiter of the here document), which | |
660 | would cause the earlier " || die;\nexit;" to be skipped in any sequence | |
661 | of code fragment extractions. | |
662 | ||
663 | To avoid this problem, when it encounters a here document whilst | |
664 | extracting from a modifiable string, "extract_quotelike" silently | |
665 | rearranges the string to an equivalent piece of Perl: | |
666 | ||
667 | <<'EOMSG' | |
668 | This is the message. | |
669 | EOMSG | |
670 | || die; | |
671 | exit; | |
672 | ||
673 | in which the here document *is* contiguous. It still leaves the matching | |
674 | position after the here document, but now the rest of the line on which | |
675 | the here document starts is not skipped. | |
676 | ||
677 | To prevent <extract_quotelike> from mucking about with the input in this | |
678 | way (this is the only case where a list-context "extract_quotelike" does | |
679 | so), you can pass the input variable as an interpolated literal: | |
680 | ||
681 | $quotelike = extract_quotelike("$var"); | |
682 | ||
683 | "extract_codeblock" | |
684 | "extract_codeblock" attempts to recognize and extract a balanced bracket | |
685 | delimited substring that may contain unbalanced brackets inside Perl | |
686 | quotes or quotelike operations. That is, "extract_codeblock" is like a | |
687 | combination of "extract_bracketed" and "extract_quotelike". | |
688 | ||
689 | "extract_codeblock" takes the same initial three parameters as | |
690 | "extract_bracketed": a text to process, a set of delimiter brackets to | |
691 | look for, and a prefix to match first. It also takes an optional fourth | |
692 | parameter, which allows the outermost delimiter brackets to be specified | |
693 | separately (see below). | |
694 | ||
695 | Omitting the first argument (input text) means process $_ instead. | |
696 | Omitting the second argument (delimiter brackets) indicates that only | |
697 | '{' is to be used. Omitting the third argument (prefix argument) implies | |
698 | optional whitespace at the start. Omitting the fourth argument | |
699 | (outermost delimiter brackets) indicates that the value of the second | |
700 | argument is to be used for the outermost delimiters. | |
701 | ||
702 | Once the prefix an dthe outermost opening delimiter bracket have been | |
703 | recognized, code blocks are extracted by stepping through the input text | |
704 | and trying the following alternatives in sequence: | |
705 | ||
706 | 1. Try and match a closing delimiter bracket. If the bracket was the | |
707 | same species as the last opening bracket, return the substring to | |
708 | that point. If the bracket was mismatched, return an error. | |
709 | ||
710 | 2. Try to match a quote or quotelike operator. If found, call | |
711 | "extract_quotelike" to eat it. If "extract_quotelike" fails, return | |
712 | the error it returned. Otherwise go back to step 1. | |
713 | ||
714 | 3. Try to match an opening delimiter bracket. If found, call | |
715 | "extract_codeblock" recursively to eat the embedded block. If the | |
716 | recursive call fails, return an error. Otherwise, go back to step 1. | |
717 | ||
718 | 4. Unconditionally match a bareword or any other single character, and | |
719 | then go back to step 1. | |
720 | ||
721 | Examples: | |
722 | ||
723 | # Find a while loop in the text | |
724 | ||
725 | if ($text =~ s/.*?while\s*\{/{/) | |
726 | { | |
727 | $loop = "while " . extract_codeblock($text); | |
728 | } | |
729 | ||
730 | # Remove the first round-bracketed list (which may include | |
731 | # round- or curly-bracketed code blocks or quotelike operators) | |
732 | ||
733 | extract_codeblock $text, "(){}", '[^(]*'; | |
734 | ||
735 | The ability to specify a different outermost delimiter bracket is useful | |
736 | in some circumstances. For example, in the Parse::RecDescent module, | |
737 | parser actions which are to be performed only on a successful parse are | |
738 | specified using a "<defer:...>" directive. For example: | |
739 | ||
740 | sentence: subject verb object | |
741 | <defer: {$::theVerb = $item{verb}} > | |
742 | ||
743 | Parse::RecDescent uses "extract_codeblock($text, '{}<>')" to extract the | |
744 | code within the "<defer:...>" directive, but there's a problem. | |
745 | ||
746 | A deferred action like this: | |
747 | ||
748 | <defer: {if ($count>10) {$count--}} > | |
749 | ||
750 | will be incorrectly parsed as: | |
751 | ||
752 | <defer: {if ($count> | |
753 | ||
754 | because the "less than" operator is interpreted as a closing delimiter. | |
755 | ||
756 | But, by extracting the directive using | |
757 | "extract_codeblock($text, '{}', undef, '<>')" the '>' character is only | |
758 | treated as a delimited at the outermost level of the code block, so the | |
759 | directive is parsed correctly. | |
760 | ||
761 | "extract_multiple" | |
762 | The "extract_multiple" subroutine takes a string to be processed and a | |
763 | list of extractors (subroutines or regular expressions) to apply to that | |
764 | string. | |
765 | ||
766 | In an array context "extract_multiple" returns an array of substrings of | |
767 | the original string, as extracted by the specified extractors. In a | |
768 | scalar context, "extract_multiple" returns the first substring | |
769 | successfully extracted from the original string. In both scalar and void | |
770 | contexts the original string has the first successfully extracted | |
771 | substring removed from it. In all contexts "extract_multiple" starts at | |
772 | the current "pos" of the string, and sets that "pos" appropriately after | |
773 | it matches. | |
774 | ||
775 | Hence, the aim of of a call to "extract_multiple" in a list context is | |
776 | to split the processed string into as many non-overlapping fields as | |
777 | possible, by repeatedly applying each of the specified extractors to the | |
778 | remainder of the string. Thus "extract_multiple" is a generalized form | |
779 | of Perl's "split" subroutine. | |
780 | ||
781 | The subroutine takes up to four optional arguments: | |
782 | ||
783 | 1. A string to be processed ($_ if the string is omitted or "undef") | |
784 | ||
785 | 2. A reference to a list of subroutine references and/or qr// objects | |
786 | and/or literal strings and/or hash references, specifying the | |
787 | extractors to be used to split the string. If this argument is | |
788 | omitted (or "undef") the list: | |
789 | ||
790 | [ | |
791 | sub { extract_variable($_[0], '') }, | |
792 | sub { extract_quotelike($_[0],'') }, | |
793 | sub { extract_codeblock($_[0],'{}','') }, | |
794 | ] | |
795 | ||
796 | is used. | |
797 | ||
798 | 3. An number specifying the maximum number of fields to return. If this | |
799 | argument is omitted (or "undef"), split continues as long as | |
800 | possible. | |
801 | ||
802 | If the third argument is *N*, then extraction continues until *N* | |
803 | fields have been successfully extracted, or until the string has | |
804 | been completely processed. | |
805 | ||
806 | Note that in scalar and void contexts the value of this argument is | |
807 | automatically reset to 1 (under "-w", a warning is issued if the | |
808 | argument has to be reset). | |
809 | ||
810 | 4. A value indicating whether unmatched substrings (see below) within | |
811 | the text should be skipped or returned as fields. If the value is | |
812 | true, such substrings are skipped. Otherwise, they are returned. | |
813 | ||
814 | The extraction process works by applying each extractor in sequence to | |
815 | the text string. | |
816 | ||
817 | If the extractor is a subroutine it is called in a list context and is | |
818 | expected to return a list of a single element, namely the extracted | |
819 | text. It may optionally also return two further arguments: a string | |
820 | representing the text left after extraction (like $' for a pattern | |
821 | match), and a string representing any prefix skipped before the | |
822 | extraction (like $` in a pattern match). Note that this is designed to | |
823 | facilitate the use of other Text::Balanced subroutines with | |
824 | "extract_multiple". Note too that the value returned by an extractor | |
825 | subroutine need not bear any relationship to the corresponding substring | |
826 | of the original text (see examples below). | |
827 | ||
828 | If the extractor is a precompiled regular expression or a string, it is | |
829 | matched against the text in a scalar context with a leading '\G' and the | |
830 | gc modifiers enabled. The extracted value is either $1 if that variable | |
831 | is defined after the match, or else the complete match (i.e. $&). | |
832 | ||
833 | If the extractor is a hash reference, it must contain exactly one | |
834 | element. The value of that element is one of the above extractor types | |
835 | (subroutine reference, regular expression, or string). The key of that | |
836 | element is the name of a class into which the successful return value of | |
837 | the extractor will be blessed. | |
838 | ||
839 | If an extractor returns a defined value, that value is immediately | |
840 | treated as the next extracted field and pushed onto the list of fields. | |
841 | If the extractor was specified in a hash reference, the field is also | |
842 | blessed into the appropriate class, | |
843 | ||
844 | If the extractor fails to match (in the case of a regex extractor), or | |
845 | returns an empty list or an undefined value (in the case of a subroutine | |
846 | extractor), it is assumed to have failed to extract. If none of the | |
847 | extractor subroutines succeeds, then one character is extracted from the | |
848 | start of the text and the extraction subroutines reapplied. Characters | |
849 | which are thus removed are accumulated and eventually become the next | |
850 | field (unless the fourth argument is true, in which case they are | |
851 | discarded). | |
852 | ||
853 | For example, the following extracts substrings that are valid Perl | |
854 | variables: | |
855 | ||
856 | @fields = extract_multiple($text, | |
857 | [ sub { extract_variable($_[0]) } ], | |
858 | undef, 1); | |
859 | ||
860 | This example separates a text into fields which are quote delimited, | |
861 | curly bracketed, and anything else. The delimited and bracketed parts | |
862 | are also blessed to identify them (the "anything else" is unblessed): | |
863 | ||
864 | @fields = extract_multiple($text, | |
865 | [ | |
866 | { Delim => sub { extract_delimited($_[0],q{'"}) } }, | |
867 | { Brack => sub { extract_bracketed($_[0],'{}') } }, | |
868 | ]); | |
869 | ||
870 | This call extracts the next single substring that is a valid Perl | |
871 | quotelike operator (and removes it from $text): | |
872 | ||
873 | $quotelike = extract_multiple($text, | |
874 | [ | |
875 | sub { extract_quotelike($_[0]) }, | |
876 | ], undef, 1); | |
877 | ||
878 | Finally, here is yet another way to do comma-separated value parsing: | |
879 | ||
880 | @fields = extract_multiple($csv_text, | |
881 | [ | |
882 | sub { extract_delimited($_[0],q{'"}) }, | |
883 | qr/([^,]+)(.*)/, | |
884 | ], | |
885 | undef,1); | |
886 | ||
887 | The list in the second argument means: *"Try and extract a ' or " | |
888 | delimited string, otherwise extract anything up to a comma..."*. The | |
889 | undef third argument means: *"...as many times as possible..."*, and the | |
890 | true value in the fourth argument means *"...discarding anything else | |
891 | that appears (i.e. the commas)"*. | |
892 | ||
893 | If you wanted the commas preserved as separate fields (i.e. like split | |
894 | does if your split pattern has capturing parentheses), you would just | |
895 | make the last parameter undefined (or remove it). | |
896 | ||
897 | "gen_delimited_pat" | |
898 | The "gen_delimited_pat" subroutine takes a single (string) argument and | |
899 | > builds a Friedl-style optimized regex that matches a string delimited | |
900 | by any one of the characters in the single argument. For example: | |
901 | ||
902 | gen_delimited_pat(q{'"}) | |
903 | ||
904 | returns the regex: | |
905 | ||
906 | (?:\"(?:\\\"|(?!\").)*\"|\'(?:\\\'|(?!\').)*\') | |
907 | ||
908 | Note that the specified delimiters are automatically quotemeta'd. | |
909 | ||
910 | A typical use of "gen_delimited_pat" would be to build special purpose | |
911 | tags for "extract_tagged". For example, to properly ignore "empty" XML | |
912 | elements (which might contain quoted strings): | |
913 | ||
914 | my $empty_tag = '<(' . gen_delimited_pat(q{'"}) . '|.)+/>'; | |
915 | ||
916 | extract_tagged($text, undef, undef, undef, {ignore => [$empty_tag]} ); | |
917 | ||
918 | "gen_delimited_pat" may also be called with an optional second argument, | |
919 | which specifies the "escape" character(s) to be used for each delimiter. | |
920 | For example to match a Pascal-style string (where ' is the delimiter and | |
921 | '' is a literal ' within the string): | |
922 | ||
923 | gen_delimited_pat(q{'},q{'}); | |
924 | ||
925 | Different escape characters can be specified for different delimiters. | |
926 | For example, to specify that '/' is the escape for single quotes and '%' | |
927 | is the escape for double quotes: | |
928 | ||
929 | gen_delimited_pat(q{'"},q{/%}); | |
930 | ||
931 | If more delimiters than escape chars are specified, the last escape char | |
932 | is used for the remaining delimiters. If no escape char is specified for | |
933 | a given specified delimiter, '\' is used. | |
934 | ||
935 | "delimited_pat" | |
936 | Note that "gen_delimited_pat" was previously called "delimited_pat". | |
937 | That name may still be used, but is now deprecated. | |
938 | ||
939 | DIAGNOSTICS | |
940 | In a list context, all the functions return "(undef,$original_text)" on | |
941 | failure. In a scalar context, failure is indicated by returning "undef" | |
942 | (in this case the input text is not modified in any way). | |
943 | ||
944 | In addition, on failure in *any* context, the $@ variable is set. | |
945 | Accessing "$@->{error}" returns one of the error diagnostics listed | |
946 | below. Accessing "$@->{pos}" returns the offset into the original string | |
947 | at which the error was detected (although not necessarily where it | |
948 | occurred!) Printing $@ directly produces the error message, with the | |
949 | offset appended. On success, the $@ variable is guaranteed to be | |
950 | "undef". | |
951 | ||
952 | The available diagnostics are: | |
953 | ||
954 | "Did not find a suitable bracket: "%s"" | |
955 | The delimiter provided to "extract_bracketed" was not one of | |
956 | '()[]<>{}'. | |
957 | ||
958 | "Did not find prefix: /%s/" | |
959 | A non-optional prefix was specified but wasn't found at the start of | |
960 | the text. | |
961 | ||
962 | "Did not find opening bracket after prefix: "%s"" | |
963 | "extract_bracketed" or "extract_codeblock" was expecting a | |
964 | particular kind of bracket at the start of the text, and didn't find | |
965 | it. | |
966 | ||
967 | "No quotelike operator found after prefix: "%s"" | |
968 | "extract_quotelike" didn't find one of the quotelike operators "q", | |
969 | "qq", "qw", "qx", "s", "tr" or "y" at the start of the substring it | |
970 | was extracting. | |
971 | ||
972 | "Unmatched closing bracket: "%c"" | |
973 | "extract_bracketed", "extract_quotelike" or "extract_codeblock" | |
974 | encountered a closing bracket where none was expected. | |
975 | ||
976 | "Unmatched opening bracket(s): "%s"" | |
977 | "extract_bracketed", "extract_quotelike" or "extract_codeblock" ran | |
978 | out of characters in the text before closing one or more levels of | |
979 | nested brackets. | |
980 | ||
981 | "Unmatched embedded quote (%s)" | |
982 | "extract_bracketed" attempted to match an embedded quoted substring, | |
983 | but failed to find a closing quote to match it. | |
984 | ||
985 | "Did not find closing delimiter to match '%s'" | |
986 | "extract_quotelike" was unable to find a closing delimiter to match | |
987 | the one that opened the quote-like operation. | |
988 | ||
989 | "Mismatched closing bracket: expected "%c" but found "%s"" | |
990 | "extract_bracketed", "extract_quotelike" or "extract_codeblock" | |
991 | found a valid bracket delimiter, but it was the wrong species. This | |
992 | usually indicates a nesting error, but may indicate incorrect | |
993 | quoting or escaping. | |
994 | ||
995 | "No block delimiter found after quotelike "%s"" | |
996 | "extract_quotelike" or "extract_codeblock" found one of the | |
997 | quotelike operators "q", "qq", "qw", "qx", "s", "tr" or "y" without | |
998 | a suitable block after it. | |
55a1c97c | 999 | |
85831461 S |
1000 | "Did not find leading dereferencer" |
1001 | "extract_variable" was expecting one of '$', '@', or '%' at the | |
1002 | start of a variable, but didn't find any of them. | |
55a1c97c | 1003 | |
85831461 S |
1004 | "Bad identifier after dereferencer" |
1005 | "extract_variable" found a '$', '@', or '%' indicating a variable, | |
1006 | but that character was not followed by a legal Perl identifier. | |
55a1c97c | 1007 | |
85831461 S |
1008 | "Did not find expected opening bracket at %s" |
1009 | "extract_codeblock" failed to find any of the outermost opening | |
1010 | brackets that were specified. | |
55a1c97c | 1011 | |
85831461 S |
1012 | "Improperly nested codeblock at %s" |
1013 | A nested code block was found that started with a delimiter that was | |
1014 | specified as being only to be used as an outermost bracket. | |
55a1c97c | 1015 | |
85831461 S |
1016 | "Missing second block for quotelike "%s"" |
1017 | "extract_codeblock" or "extract_quotelike" found one of the | |
1018 | quotelike operators "s", "tr" or "y" followed by only one block. | |
55a1c97c | 1019 | |
85831461 S |
1020 | "No match found for opening bracket" |
1021 | "extract_codeblock" failed to find a closing bracket to match the | |
1022 | outermost opening bracket. | |
55a1c97c | 1023 | |
85831461 S |
1024 | "Did not find opening tag: /%s/" |
1025 | "extract_tagged" did not find a suitable opening tag (after any | |
1026 | specified prefix was removed). | |
55a1c97c | 1027 | |
85831461 S |
1028 | "Unable to construct closing tag to match: /%s/" |
1029 | "extract_tagged" matched the specified opening tag and tried to | |
1030 | modify the matched text to produce a matching closing tag (because | |
1031 | none was specified). It failed to generate the closing tag, almost | |
1032 | certainly because the opening tag did not start with a bracket of | |
1033 | some kind. | |
55a1c97c | 1034 | |
85831461 S |
1035 | "Found invalid nested tag: %s" |
1036 | "extract_tagged" found a nested tag that appeared in the "reject" | |
1037 | list (and the failure mode was not "MAX" or "PARA"). | |
55a1c97c | 1038 | |
85831461 S |
1039 | "Found unbalanced nested tag: %s" |
1040 | "extract_tagged" found a nested opening tag that was not matched by | |
1041 | a corresponding nested closing tag (and the failure mode was not | |
1042 | "MAX" or "PARA"). | |
55a1c97c | 1043 | |
85831461 S |
1044 | "Did not find closing tag" |
1045 | "extract_tagged" reached the end of the text without finding a | |
1046 | closing tag to match the original opening tag (and the failure mode | |
1047 | was not "MAX" or "PARA"). | |
55a1c97c | 1048 | |
85831461 S |
1049 | AUTHOR |
1050 | Damian Conway (damian@conway.org) | |
55a1c97c | 1051 | |
85831461 S |
1052 | BUGS AND IRRITATIONS |
1053 | There are undoubtedly serious bugs lurking somewhere in this code, if | |
1054 | only because parts of it give the impression of understanding a great | |
1055 | deal more about Perl than they really do. | |
55a1c97c | 1056 | |
85831461 | 1057 | Bug reports and other feedback are most welcome. |
55a1c97c | 1058 | |
85831461 S |
1059 | COPYRIGHT |
1060 | Copyright 1997 - 2001 Damian Conway. All Rights Reserved. | |
55a1c97c | 1061 | |
85831461 | 1062 | Some (minor) parts copyright 2009 Adam Kennedy. |
55a1c97c | 1063 | |
85831461 S |
1064 | This module is free software. It may be used, redistributed and/or |
1065 | modified under the same terms as Perl itself. | |
55a1c97c | 1066 |