Commit | Line | Data |
---|---|---|
49781f4a | 1 | =encoding utf8 |
8a93676d SB |
2 | |
3 | =head1 NAME | |
4 | ||
5 | perlpodspec - Plain Old Documentation: format specification and notes | |
6 | ||
7 | =head1 DESCRIPTION | |
8 | ||
9 | This document is detailed notes on the Pod markup language. Most | |
10 | people will only have to read L<perlpod|perlpod> to know how to write | |
11 | in Pod, but this document may answer some incidental questions to do | |
12 | with parsing and rendering Pod. | |
13 | ||
14 | In this document, "must" / "must not", "should" / | |
15 | "should not", and "may" have their conventional (cf. RFC 2119) | |
16 | meanings: "X must do Y" means that if X doesn't do Y, it's against | |
17 | this specification, and should really be fixed. "X should do Y" | |
18 | means that it's recommended, but X may fail to do Y, if there's a | |
19 | good reason. "X may do Y" is merely a note that X can do Y at | |
20 | will (although it is up to the reader to detect any connotation of | |
21 | "and I think it would be I<nice> if X did Y" versus "it wouldn't | |
22 | really I<bother> me if X did Y"). | |
23 | ||
24 | Notably, when I say "the parser should do Y", the | |
25 | parser may fail to do Y, if the calling application explicitly | |
26 | requests that the parser I<not> do Y. I often phrase this as | |
27 | "the parser should, by default, do Y." This doesn't I<require> | |
28 | the parser to provide an option for turning off whatever | |
29 | feature Y is (like expanding tabs in verbatim paragraphs), although | |
30 | it implicates that such an option I<may> be provided. | |
31 | ||
32 | =head1 Pod Definitions | |
33 | ||
ac036724 | 34 | Pod is embedded in files, typically Perl source files, although you |
8a93676d SB |
35 | can write a file that's nothing but Pod. |
36 | ||
37 | A B<line> in a file consists of zero or more non-newline characters, | |
38 | terminated by either a newline or the end of the file. | |
39 | ||
40 | A B<newline sequence> is usually a platform-dependent concept, but | |
41 | Pod parsers should understand it to mean any of CR (ASCII 13), LF | |
42 | (ASCII 10), or a CRLF (ASCII 13 followed immediately by ASCII 10), in | |
43 | addition to any other system-specific meaning. The first CR/CRLF/LF | |
44 | sequence in the file may be used as the basis for identifying the | |
45 | newline sequence for parsing the rest of the file. | |
46 | ||
47 | A B<blank line> is a line consisting entirely of zero or more spaces | |
48 | (ASCII 32) or tabs (ASCII 9), and terminated by a newline or end-of-file. | |
49 | A B<non-blank line> is a line containing one or more characters other | |
50 | than space or tab (and terminated by a newline or end-of-file). | |
51 | ||
52 | (I<Note:> Many older Pod parsers did not accept a line consisting of | |
ac036724 | 53 | spaces/tabs and then a newline as a blank line. The only lines they |
8a93676d SB |
54 | considered blank were lines consisting of I<no characters at all>, |
55 | terminated by a newline.) | |
56 | ||
57 | B<Whitespace> is used in this document as a blanket term for spaces, | |
58 | tabs, and newline sequences. (By itself, this term usually refers | |
59 | to literal whitespace. That is, sequences of whitespace characters | |
60 | in Pod source, as opposed to "EE<lt>32>", which is a formatting | |
61 | code that I<denotes> a whitespace character.) | |
62 | ||
63 | A B<Pod parser> is a module meant for parsing Pod (regardless of | |
64 | whether this involves calling callbacks or building a parse tree or | |
65 | directly formatting it). A B<Pod formatter> (or B<Pod translator>) | |
66 | is a module or program that converts Pod to some other format (HTML, | |
67 | plaintext, TeX, PostScript, RTF). A B<Pod processor> might be a | |
68 | formatter or translator, or might be a program that does something | |
353c6505 | 69 | else with the Pod (like counting words, scanning for index points, |
8a93676d SB |
70 | etc.). |
71 | ||
72 | Pod content is contained in B<Pod blocks>. A Pod block starts with a | |
1bca558f | 73 | line that matches C<m/\A=[a-zA-Z]/>, and continues up to the next line |
ac036724 | 74 | that matches C<m/\A=cut/> or up to the end of the file if there is |
8a93676d SB |
75 | no C<m/\A=cut/> line. |
76 | ||
77 | =for comment | |
78 | The current perlsyn says: | |
79 | [beginquote] | |
80 | Note that pod translators should look at only paragraphs beginning | |
81 | with a pod directive (it makes parsing easier), whereas the compiler | |
82 | actually knows to look for pod escapes even in the middle of a | |
83 | paragraph. This means that the following secret stuff will be ignored | |
84 | by both the compiler and the translators. | |
85 | $a=3; | |
86 | =secret stuff | |
87 | warn "Neither POD nor CODE!?" | |
88 | =cut back | |
89 | print "got $a\n"; | |
90 | You probably shouldn't rely upon the warn() being podded out forever. | |
91 | Not all pod translators are well-behaved in this regard, and perhaps | |
92 | the compiler will become pickier. | |
93 | [endquote] | |
94 | I think that those paragraphs should just be removed; paragraph-based | |
95 | parsing seems to have been largely abandoned, because of the hassle | |
96 | with non-empty blank lines messing up what people meant by "paragraph". | |
97 | Even if the "it makes parsing easier" bit were especially true, | |
98 | it wouldn't be worth the confusion of having perl and pod2whatever | |
99 | actually disagree on what can constitute a Pod block. | |
100 | ||
e1a97e07 KW |
101 | Note that a parser is not expected to distinguish between something that |
102 | looks like pod, but is in a quoted string, such as a here document. | |
103 | ||
8a93676d SB |
104 | Within a Pod block, there are B<Pod paragraphs>. A Pod paragraph |
105 | consists of non-blank lines of text, separated by one or more blank | |
106 | lines. | |
107 | ||
108 | For purposes of Pod processing, there are four types of paragraphs in | |
109 | a Pod block: | |
110 | ||
111 | =over | |
112 | ||
113 | =item * | |
114 | ||
115 | A command paragraph (also called a "directive"). The first line of | |
116 | this paragraph must match C<m/\A=[a-zA-Z]/>. Command paragraphs are | |
117 | typically one line, as in: | |
118 | ||
119 | =head1 NOTES | |
120 | ||
121 | =item * | |
122 | ||
123 | But they may span several (non-blank) lines: | |
124 | ||
125 | =for comment | |
126 | Hm, I wonder what it would look like if | |
127 | you tried to write a BNF for Pod from this. | |
210b36aa | 128 | |
8a93676d SB |
129 | =head3 Dr. Strangelove, or: How I Learned to |
130 | Stop Worrying and Love the Bomb | |
131 | ||
132 | I<Some> command paragraphs allow formatting codes in their content | |
133 | (i.e., after the part that matches C<m/\A=[a-zA-Z]\S*\s*/>), as in: | |
134 | ||
135 | =head1 Did You Remember to C<use strict;>? | |
136 | ||
137 | In other words, the Pod processing handler for "head1" will apply the | |
138 | same processing to "Did You Remember to CE<lt>use strict;>?" that it | |
ac036724 | 139 | would to an ordinary paragraph (i.e., formatting codes like |
8a93676d SB |
140 | "CE<lt>...>") are parsed and presumably formatted appropriately, and |
141 | whitespace in the form of literal spaces and/or tabs is not | |
142 | significant. | |
143 | ||
144 | =item * | |
145 | ||
146 | A B<verbatim paragraph>. The first line of this paragraph must be a | |
147 | literal space or tab, and this paragraph must not be inside a "=begin | |
148 | I<identifier>", ... "=end I<identifier>" sequence unless | |
149 | "I<identifier>" begins with a colon (":"). That is, if a paragraph | |
150 | starts with a literal space or tab, but I<is> inside a | |
151 | "=begin I<identifier>", ... "=end I<identifier>" region, then it's | |
152 | a data paragraph, unless "I<identifier>" begins with a colon. | |
153 | ||
154 | Whitespace I<is> significant in verbatim paragraphs (although, in | |
155 | processing, tabs are probably expanded). | |
156 | ||
157 | =item * | |
158 | ||
159 | An B<ordinary paragraph>. A paragraph is an ordinary paragraph | |
160 | if its first line matches neither C<m/\A=[a-zA-Z]/> nor | |
161 | C<m/\A[ \t]/>, I<and> if it's not inside a "=begin I<identifier>", | |
162 | ... "=end I<identifier>" sequence unless "I<identifier>" begins with | |
163 | a colon (":"). | |
164 | ||
165 | =item * | |
166 | ||
167 | A B<data paragraph>. This is a paragraph that I<is> inside a "=begin | |
168 | I<identifier>" ... "=end I<identifier>" sequence where | |
169 | "I<identifier>" does I<not> begin with a literal colon (":"). In | |
170 | some sense, a data paragraph is not part of Pod at all (i.e., | |
171 | effectively it's "out-of-band"), since it's not subject to most kinds | |
172 | of Pod parsing; but it is specified here, since Pod | |
173 | parsers need to be able to call an event for it, or store it in some | |
174 | form in a parse tree, or at least just parse I<around> it. | |
175 | ||
176 | =back | |
177 | ||
178 | For example: consider the following paragraphs: | |
179 | ||
180 | # <- that's the 0th column | |
181 | ||
182 | =head1 Foo | |
210b36aa | 183 | |
8a93676d | 184 | Stuff |
210b36aa | 185 | |
8a93676d | 186 | $foo->bar |
210b36aa | 187 | |
8a93676d SB |
188 | =cut |
189 | ||
190 | Here, "=head1 Foo" and "=cut" are command paragraphs because the first | |
191 | line of each matches C<m/\A=[a-zA-Z]/>. "I<[space][space]>$foo->bar" | |
192 | is a verbatim paragraph, because its first line starts with a literal | |
193 | whitespace character (and there's no "=begin"..."=end" region around). | |
194 | ||
195 | The "=begin I<identifier>" ... "=end I<identifier>" commands stop | |
6fbdb1cc | 196 | paragraphs that they surround from being parsed as ordinary or verbatim |
8a93676d SB |
197 | paragraphs, if I<identifier> doesn't begin with a colon. This |
198 | is discussed in detail in the section | |
199 | L</About Data Paragraphs and "=beginE<sol>=end" Regions>. | |
200 | ||
201 | =head1 Pod Commands | |
202 | ||
203 | This section is intended to supplement and clarify the discussion in | |
204 | L<perlpod/"Command Paragraph">. These are the currently recognized | |
205 | Pod commands: | |
206 | ||
207 | =over | |
208 | ||
ee511750 | 209 | =item "=head1", "=head2", "=head3", "=head4", "=head5", "=head6" |
8a93676d SB |
210 | |
211 | This command indicates that the text in the remainder of the paragraph | |
212 | is a heading. That text may contain formatting codes. Examples: | |
213 | ||
214 | =head1 Object Attributes | |
210b36aa | 215 | |
8a93676d SB |
216 | =head3 What B<Not> to Do! |
217 | ||
ee511750 S |
218 | Both C<=head5> and C<=head6> were added in 2020 and might not be |
219 | supported on all Pod parsers. L<Pod::Simple> 3.41 was released on October | |
220 | 2020 and supports both of these providing support for all | |
221 | L<Pod::Simple>-based Pod parsers. | |
222 | ||
8a93676d SB |
223 | =item "=pod" |
224 | ||
225 | This command indicates that this paragraph begins a Pod block. (If we | |
226 | are already in the middle of a Pod block, this command has no effect at | |
227 | all.) If there is any text in this command paragraph after "=pod", | |
228 | it must be ignored. Examples: | |
229 | ||
230 | =pod | |
210b36aa | 231 | |
8a93676d | 232 | This is a plain Pod paragraph. |
210b36aa | 233 | |
8a93676d SB |
234 | =pod This text is ignored. |
235 | ||
236 | =item "=cut" | |
237 | ||
238 | This command indicates that this line is the end of this previously | |
239 | started Pod block. If there is any text after "=cut" on the line, it must be | |
240 | ignored. Examples: | |
241 | ||
242 | =cut | |
243 | ||
244 | =cut The documentation ends here. | |
245 | ||
246 | =cut | |
247 | # This is the first line of program text. | |
248 | sub foo { # This is the second. | |
249 | ||
659cfd94 | 250 | It is an error to try to I<start> a Pod block with a "=cut" command. In |
8a93676d SB |
251 | that case, the Pod processor must halt parsing of the input file, and |
252 | must by default emit a warning. | |
253 | ||
254 | =item "=over" | |
255 | ||
256 | This command indicates that this is the start of a list/indent | |
257 | region. If there is any text following the "=over", it must consist | |
258 | of only a nonzero positive numeral. The semantics of this numeral is | |
259 | explained in the L</"About =over...=back Regions"> section, further | |
260 | below. Formatting codes are not expanded. Examples: | |
261 | ||
262 | =over 3 | |
210b36aa | 263 | |
8a93676d | 264 | =over 3.5 |
210b36aa | 265 | |
8a93676d SB |
266 | =over |
267 | ||
268 | =item "=item" | |
269 | ||
270 | This command indicates that an item in a list begins here. Formatting | |
271 | codes are processed. The semantics of the (optional) text in the | |
272 | remainder of this paragraph are | |
273 | explained in the L</"About =over...=back Regions"> section, further | |
274 | below. Examples: | |
275 | ||
276 | =item | |
210b36aa | 277 | |
8a93676d | 278 | =item * |
210b36aa | 279 | |
8a93676d | 280 | =item * |
210b36aa | 281 | |
8a93676d | 282 | =item 14 |
210b36aa | 283 | |
8a93676d | 284 | =item 3. |
210b36aa | 285 | |
8a93676d | 286 | =item C<< $thing->stuff(I<dodad>) >> |
210b36aa | 287 | |
8a93676d SB |
288 | =item For transporting us beyond seas to be tried for pretended |
289 | offenses | |
210b36aa | 290 | |
8a93676d SB |
291 | =item He is at this time transporting large armies of foreign |
292 | mercenaries to complete the works of death, desolation and | |
293 | tyranny, already begun with circumstances of cruelty and perfidy | |
294 | scarcely paralleled in the most barbarous ages, and totally | |
295 | unworthy the head of a civilized nation. | |
296 | ||
297 | =item "=back" | |
298 | ||
299 | This command indicates that this is the end of the region begun | |
300 | by the most recent "=over" command. It permits no text after the | |
301 | "=back" command. | |
302 | ||
303 | =item "=begin formatname" | |
304 | ||
93592fd5 RS |
305 | =item "=begin formatname parameter" |
306 | ||
8a93676d SB |
307 | This marks the following paragraphs (until the matching "=end |
308 | formatname") as being for some special kind of processing. Unless | |
309 | "formatname" begins with a colon, the contained non-command | |
310 | paragraphs are data paragraphs. But if "formatname" I<does> begin | |
311 | with a colon, then non-command paragraphs are ordinary paragraphs | |
312 | or data paragraphs. This is discussed in detail in the section | |
313 | L</About Data Paragraphs and "=beginE<sol>=end" Regions>. | |
314 | ||
315 | It is advised that formatnames match the regexp | |
c85e9b4c | 316 | C<m/\A:?[-a-zA-Z0-9_]+\z/>. Everything following whitespace after the |
93592fd5 RS |
317 | formatname is a parameter that may be used by the formatter when dealing |
318 | with this region. This parameter must not be repeated in the "=end" | |
319 | paragraph. Implementors should anticipate future expansion in the | |
320 | semantics and syntax of the first parameter to "=begin"/"=end"/"=for". | |
8a93676d SB |
321 | |
322 | =item "=end formatname" | |
323 | ||
324 | This marks the end of the region opened by the matching | |
325 | "=begin formatname" region. If "formatname" is not the formatname | |
326 | of the most recent open "=begin formatname" region, then this | |
327 | is an error, and must generate an error message. This | |
328 | is discussed in detail in the section | |
329 | L</About Data Paragraphs and "=beginE<sol>=end" Regions>. | |
330 | ||
331 | =item "=for formatname text..." | |
332 | ||
333 | This is synonymous with: | |
334 | ||
335 | =begin formatname | |
210b36aa | 336 | |
8a93676d | 337 | text... |
210b36aa | 338 | |
8a93676d SB |
339 | =end formatname |
340 | ||
341 | That is, it creates a region consisting of a single paragraph; that | |
342 | paragraph is to be treated as a normal paragraph if "formatname" | |
343 | begins with a ":"; if "formatname" I<doesn't> begin with a colon, | |
344 | then "text..." will constitute a data paragraph. There is no way | |
345 | to use "=for formatname text..." to express "text..." as a verbatim | |
346 | paragraph. | |
347 | ||
a179871b SB |
348 | =item "=encoding encodingname" |
349 | ||
350 | This command, which should occur early in the document (at least | |
1e54db1a | 351 | before any non-US-ASCII data!), declares that this document is |
a179871b | 352 | encoded in the encoding I<encodingname>, which must be |
6fbdb1cc | 353 | an encoding name that L<Encode> recognizes. (Encode's list |
8a3f7e95 | 354 | of supported encodings, in L<Encode::Supported>, is useful here.) |
a179871b SB |
355 | If the Pod parser cannot decode the declared encoding, it |
356 | should emit a warning and may abort parsing the document | |
357 | altogether. | |
358 | ||
359 | A document having more than one "=encoding" line should be | |
360 | considered an error. Pod processors may silently tolerate this if | |
361 | the not-first "=encoding" lines are just duplicates of the | |
6fbdb1cc RS |
362 | first one (e.g., if there's a "=encoding utf8" line, and later on |
363 | another "=encoding utf8" line). But Pod processors should complain if | |
a179871b SB |
364 | there are contradictory "=encoding" lines in the same document |
365 | (e.g., if there is a "=encoding utf8" early in the document and | |
366 | "=encoding big5" later). Pod processors that recognize BOMs | |
367 | may also complain if they see an "=encoding" line | |
1e54db1a JH |
368 | that contradicts the BOM (e.g., if a document with a UTF-16LE |
369 | BOM has an "=encoding shiftjis" line). | |
a179871b | 370 | |
8a93676d SB |
371 | =back |
372 | ||
373 | If a Pod processor sees any command other than the ones listed | |
374 | above (like "=head", or "=haed1", or "=stuff", or "=cuttlefish", | |
375 | or "=w123"), that processor must by default treat this as an | |
376 | error. It must not process the paragraph beginning with that | |
377 | command, must by default warn of this as an error, and may | |
378 | abort the parse. A Pod parser may allow a way for particular | |
379 | applications to add to the above list of known commands, and to | |
380 | stipulate, for each additional command, whether formatting | |
381 | codes should be processed. | |
382 | ||
383 | Future versions of this specification may add additional | |
384 | commands. | |
385 | ||
386 | ||
387 | ||
388 | =head1 Pod Formatting Codes | |
389 | ||
390 | (Note that in previous drafts of this document and of perlpod, | |
391 | formatting codes were referred to as "interior sequences", and | |
392 | this term may still be found in the documentation for Pod parsers, | |
393 | and in error messages from Pod processors.) | |
394 | ||
395 | There are two syntaxes for formatting codes: | |
396 | ||
397 | =over | |
398 | ||
399 | =item * | |
400 | ||
401 | A formatting code starts with a capital letter (just US-ASCII [A-Z]) | |
402 | followed by a "<", any number of characters, and ending with the first | |
403 | matching ">". Examples: | |
404 | ||
405 | That's what I<you> think! | |
406 | ||
d8ff3e95 | 407 | What's C<CORE::dump()> for? |
8a93676d SB |
408 | |
409 | X<C<chmod> and C<unlink()> Under Different Operating Systems> | |
410 | ||
411 | =item * | |
412 | ||
413 | A formatting code starts with a capital letter (just US-ASCII [A-Z]) | |
414 | followed by two or more "<"'s, one or more whitespace characters, | |
415 | any number of characters, one or more whitespace characters, | |
416 | and ending with the first matching sequence of two or more ">"'s, where | |
417 | the number of ">"'s equals the number of "<"'s in the opening of this | |
418 | formatting code. Examples: | |
419 | ||
420 | That's what I<< you >> think! | |
421 | ||
422 | C<<< open(X, ">>thing.dat") || die $! >>> | |
423 | ||
424 | B<< $foo->bar(); >> | |
425 | ||
426 | With this syntax, the whitespace character(s) after the "CE<lt><<" | |
1bca558f | 427 | and before the ">>>" (or whatever letter) are I<not> renderable. They |
8a93676d SB |
428 | do not signify whitespace, are merely part of the formatting codes |
429 | themselves. That is, these are all synonymous: | |
430 | ||
431 | C<thing> | |
432 | C<< thing >> | |
433 | C<< thing >> | |
434 | C<<< thing >>> | |
435 | C<<<< | |
436 | thing | |
437 | >>>> | |
438 | ||
439 | and so on. | |
440 | ||
a3d78747 RS |
441 | Finally, the multiple-angle-bracket form does I<not> alter the interpretation |
442 | of nested formatting codes, meaning that the following four example lines are | |
443 | identical in meaning: | |
444 | ||
445 | B<example: C<$a E<lt>=E<gt> $b>> | |
446 | ||
447 | B<example: C<< $a <=> $b >>> | |
448 | ||
449 | B<example: C<< $a E<lt>=E<gt> $b >>> | |
450 | ||
451 | B<<< example: C<< $a E<lt>=E<gt> $b >> >>> | |
452 | ||
8a93676d SB |
453 | =back |
454 | ||
455 | In parsing Pod, a notably tricky part is the correct parsing of | |
456 | (potentially nested!) formatting codes. Implementors should | |
457 | consult the code in the C<parse_text> routine in Pod::Parser as an | |
458 | example of a correct implementation. | |
459 | ||
460 | =over | |
461 | ||
462 | =item C<IE<lt>textE<gt>> -- italic text | |
463 | ||
464 | See the brief discussion in L<perlpod/"Formatting Codes">. | |
465 | ||
466 | =item C<BE<lt>textE<gt>> -- bold text | |
467 | ||
468 | See the brief discussion in L<perlpod/"Formatting Codes">. | |
469 | ||
470 | =item C<CE<lt>codeE<gt>> -- code text | |
471 | ||
472 | See the brief discussion in L<perlpod/"Formatting Codes">. | |
473 | ||
474 | =item C<FE<lt>filenameE<gt>> -- style for filenames | |
475 | ||
476 | See the brief discussion in L<perlpod/"Formatting Codes">. | |
477 | ||
478 | =item C<XE<lt>topic nameE<gt>> -- an index entry | |
479 | ||
480 | See the brief discussion in L<perlpod/"Formatting Codes">. | |
481 | ||
482 | This code is unusual in that most formatters completely discard | |
483 | this code and its content. Other formatters will render it with | |
484 | invisible codes that can be used in building an index of | |
485 | the current document. | |
486 | ||
487 | =item C<ZE<lt>E<gt>> -- a null (zero-effect) formatting code | |
488 | ||
489 | Discussed briefly in L<perlpod/"Formatting Codes">. | |
490 | ||
c195f169 | 491 | This code is unusual in that it should have no content. That is, |
8a93676d SB |
492 | a processor may complain if it sees C<ZE<lt>potatoesE<gt>>. Whether |
493 | or not it complains, the I<potatoes> text should ignored. | |
494 | ||
495 | =item C<LE<lt>nameE<gt>> -- a hyperlink | |
496 | ||
497 | The complicated syntaxes of this code are discussed at length in | |
498 | L<perlpod/"Formatting Codes">, and implementation details are | |
499 | discussed below, in L</"About LE<lt>...E<gt> Codes">. Parsing the | |
500 | contents of LE<lt>content> is tricky. Notably, the content has to be | |
501 | checked for whether it looks like a URL, or whether it has to be split | |
502 | on literal "|" and/or "/" (in the right order!), and so on, | |
503 | I<before> EE<lt>...> codes are resolved. | |
504 | ||
505 | =item C<EE<lt>escapeE<gt>> -- a character escape | |
506 | ||
507 | See L<perlpod/"Formatting Codes">, and several points in | |
508 | L</Notes on Implementing Pod Processors>. | |
509 | ||
510 | =item C<SE<lt>textE<gt>> -- text contains non-breaking spaces | |
511 | ||
512 | This formatting code is syntactically simple, but semantically | |
513 | complex. What it means is that each space in the printable | |
3e666715 | 514 | content of this code signifies a non-breaking space. |
8a93676d SB |
515 | |
516 | Consider: | |
517 | ||
518 | C<$x ? $y : $z> | |
519 | ||
520 | S<C<$x ? $y : $z>> | |
521 | ||
522 | Both signify the monospace (c[ode] style) text consisting of | |
523 | "$x", one space, "?", one space, ":", one space, "$z". The | |
524 | difference is that in the latter, with the S code, those spaces | |
3e666715 | 525 | are not "normal" spaces, but instead are non-breaking spaces. |
8a93676d SB |
526 | |
527 | =back | |
528 | ||
529 | ||
530 | If a Pod processor sees any formatting code other than the ones | |
531 | listed above (as in "NE<lt>...>", or "QE<lt>...>", etc.), that | |
532 | processor must by default treat this as an error. | |
533 | A Pod parser may allow a way for particular | |
534 | applications to add to the above list of known formatting codes; | |
535 | a Pod parser might even allow a way to stipulate, for each additional | |
536 | command, whether it requires some form of special processing, as | |
537 | LE<lt>...> does. | |
538 | ||
539 | Future versions of this specification may add additional | |
540 | formatting codes. | |
541 | ||
542 | Historical note: A few older Pod processors would not see a ">" as | |
543 | closing a "CE<lt>" code, if the ">" was immediately preceded by | |
544 | a "-". This was so that this: | |
545 | ||
546 | C<$foo->bar> | |
547 | ||
548 | would parse as equivalent to this: | |
549 | ||
75f15e9f | 550 | C<$foo-E<gt>bar> |
8a93676d SB |
551 | |
552 | instead of as equivalent to a "C" formatting code containing | |
553 | only "$foo-", and then a "bar>" outside the "C" formatting code. This | |
554 | problem has since been solved by the addition of syntaxes like this: | |
555 | ||
556 | C<< $foo->bar >> | |
557 | ||
558 | Compliant parsers must not treat "->" as special. | |
559 | ||
560 | Formatting codes absolutely cannot span paragraphs. If a code is | |
561 | opened in one paragraph, and no closing code is found by the end of | |
562 | that paragraph, the Pod parser must close that formatting code, | |
563 | and should complain (as in "Unterminated I code in the paragraph | |
564 | starting at line 123: 'Time objects are not...'"). So these | |
565 | two paragraphs: | |
566 | ||
567 | I<I told you not to do this! | |
210b36aa | 568 | |
8a93676d SB |
569 | Don't make me say it again!> |
570 | ||
571 | ...must I<not> be parsed as two paragraphs in italics (with the I | |
572 | code starting in one paragraph and starting in another.) Instead, | |
573 | the first paragraph should generate a warning, but that aside, the | |
574 | above code must parse as if it were: | |
575 | ||
576 | I<I told you not to do this!> | |
210b36aa | 577 | |
8a93676d SB |
578 | Don't make me say it again!E<gt> |
579 | ||
580 | (In SGMLish jargon, all Pod commands are like block-level | |
581 | elements, whereas all Pod formatting codes are like inline-level | |
582 | elements.) | |
583 | ||
584 | ||
585 | ||
586 | =head1 Notes on Implementing Pod Processors | |
587 | ||
588 | The following is a long section of miscellaneous requirements | |
589 | and suggestions to do with Pod processing. | |
590 | ||
591 | =over | |
592 | ||
593 | =item * | |
594 | ||
595 | Pod formatters should tolerate lines in verbatim blocks that are of | |
596 | any length, even if that means having to break them (possibly several | |
597 | times, for very long lines) to avoid text running off the side of the | |
598 | page. Pod formatters may warn of such line-breaking. Such warnings | |
599 | are particularly appropriate for lines are over 100 characters long, which | |
600 | are usually not intentional. | |
601 | ||
602 | =item * | |
603 | ||
604 | Pod parsers must recognize I<all> of the three well-known newline | |
605 | formats: CR, LF, and CRLF. See L<perlport|perlport>. | |
606 | ||
607 | =item * | |
608 | ||
609 | Pod parsers should accept input lines that are of any length. | |
610 | ||
611 | =item * | |
612 | ||
613 | Since Perl recognizes a Unicode Byte Order Mark at the start of files | |
614 | as signaling that the file is Unicode encoded as in UTF-16 (whether | |
615 | big-endian or little-endian) or UTF-8, Pod parsers should do the | |
616 | same. Otherwise, the character encoding should be understood as | |
617 | being UTF-8 if the first highbit byte sequence in the file seems | |
8f226aee DW |
618 | valid as a UTF-8 sequence, or otherwise as CP-1252 (earlier versions of |
619 | this specification used Latin-1 instead of CP-1252). | |
8a93676d SB |
620 | |
621 | Future versions of this specification may specify | |
622 | how Pod can accept other encodings. Presumably treatment of other | |
623 | encodings in Pod parsing would be as in XML parsing: whatever the | |
624 | encoding declared by a particular Pod file, content is to be | |
625 | stored in memory as Unicode characters. | |
626 | ||
627 | =item * | |
628 | ||
629 | The well known Unicode Byte Order Marks are as follows: if the | |
630 | file begins with the two literal byte values 0xFE 0xFF, this is | |
631 | the BOM for big-endian UTF-16. If the file begins with the two | |
632 | literal byte value 0xFF 0xFE, this is the BOM for little-endian | |
df0c7995 KW |
633 | UTF-16. On an ASCII platform, if the file begins with the three literal |
634 | byte values | |
8a93676d | 635 | 0xEF 0xBB 0xBF, this is the BOM for UTF-8. |
e8a0e562 | 636 | A mechanism portable to EBCDIC platforms is to: |
df0c7995 KW |
637 | |
638 | my $utf8_bom = "\x{FEFF}"; | |
639 | utf8::encode($utf8_bom); | |
8a93676d SB |
640 | |
641 | =for comment | |
642 | use bytes; print map sprintf(" 0x%02X", ord $_), split '', "\x{feff}"; | |
643 | 0xEF 0xBB 0xBF | |
644 | ||
645 | =for comment | |
1e54db1a | 646 | If toke.c is modified to support UTF-32, add mention of those here. |
8a93676d SB |
647 | |
648 | =item * | |
649 | ||
df0c7995 KW |
650 | A naive, but often sufficient heuristic on ASCII platforms, for testing |
651 | the first highbit | |
8a93676d SB |
652 | byte-sequence in a BOM-less file (whether in code or in Pod!), to see |
653 | whether that sequence is valid as UTF-8 (RFC 2279) is to check whether | |
9a5b9407 | 654 | that the first byte in the sequence is in the range 0xC2 - 0xFD |
8a93676d SB |
655 | I<and> whether the next byte is in the range |
656 | 0x80 - 0xBF. If so, the parser may conclude that this file is in | |
657 | UTF-8, and all highbit sequences in the file should be assumed to | |
658 | be UTF-8. Otherwise the parser should treat the file as being | |
df0c7995 KW |
659 | in CP-1252. (A better check, and which works on EBCDIC platforms as |
660 | well, is to pass a copy of the sequence to | |
9a5b9407 KW |
661 | L<utf8::decode()|utf8> which performs a full validity check on the |
662 | sequence and returns TRUE if it is valid UTF-8, FALSE otherwise. This | |
663 | function is always pre-loaded, is fast because it is written in C, and | |
664 | will only get called at most once, so you don't need to avoid it out of | |
665 | performance concerns.) | |
666 | In the unlikely circumstance that the first highbit | |
8a93676d SB |
667 | sequence in a truly non-UTF-8 file happens to appear to be UTF-8, one |
668 | can cater to our heuristic (as well as any more intelligent heuristic) | |
669 | by prefacing that line with a comment line containing a highbit | |
670 | sequence that is clearly I<not> valid as UTF-8. A line consisting | |
671 | of simply "#", an e-acute, and any non-highbit byte, | |
672 | is sufficient to establish this file's encoding. | |
673 | ||
674 | =for comment | |
675 | If/WHEN some brave soul makes these heuristics into a generic | |
fae2c0fb | 676 | text-file class (or PerlIO layer?), we can presumably delete |
8a93676d | 677 | mention of these icky details from this file, and can instead |
fae2c0fb | 678 | tell people to just use appropriate class/layer. |
8a93676d | 679 | Auto-recognition of newline sequences would be another desirable |
fae2c0fb | 680 | feature of such a class/layer. |
8a93676d SB |
681 | HINT HINT HINT. |
682 | ||
683 | =for comment | |
684 | "The probability that a string of characters | |
685 | in any other encoding appears as valid UTF-8 is low" - RFC2279 | |
686 | ||
687 | =item * | |
688 | ||
8a93676d SB |
689 | Pod processors must treat a "=for [label] [content...]" paragraph as |
690 | meaning the same thing as a "=begin [label]" paragraph, content, and | |
691 | an "=end [label]" paragraph. (The parser may conflate these two | |
692 | constructs, or may leave them distinct, in the expectation that the | |
693 | formatter will nevertheless treat them the same.) | |
694 | ||
695 | =item * | |
696 | ||
697 | When rendering Pod to a format that allows comments (i.e., to nearly | |
698 | any format other than plaintext), a Pod formatter must insert comment | |
699 | text identifying its name and version number, and the name and | |
700 | version numbers of any modules it might be using to process the Pod. | |
701 | Minimal examples: | |
702 | ||
555bd962 | 703 | %% POD::Pod2PS v3.14159, using POD::Parser v1.92 |
210b36aa | 704 | |
555bd962 | 705 | <!-- Pod::HTML v3.14159, using POD::Parser v1.92 --> |
210b36aa | 706 | |
555bd962 | 707 | {\doccomm generated by Pod::Tree::RTF 3.14159 using Pod::Tree 1.08} |
210b36aa | 708 | |
555bd962 | 709 | .\" Pod::Man version 3.14159, using POD::Parser version 1.92 |
8a93676d SB |
710 | |
711 | Formatters may also insert additional comments, including: the | |
712 | release date of the Pod formatter program, the contact address for | |
713 | the author(s) of the formatter, the current time, the name of input | |
714 | file, the formatting options in effect, version of Perl used, etc. | |
715 | ||
716 | Formatters may also choose to note errors/warnings as comments, | |
717 | besides or instead of emitting them otherwise (as in messages to | |
718 | STDERR, or C<die>ing). | |
719 | ||
720 | =item * | |
721 | ||
722 | Pod parsers I<may> emit warnings or error messages ("Unknown E code | |
723 | EE<lt>zslig>!") to STDERR (whether through printing to STDERR, or | |
724 | C<warn>ing/C<carp>ing, or C<die>ing/C<croak>ing), but I<must> allow | |
725 | suppressing all such STDERR output, and instead allow an option for | |
726 | reporting errors/warnings | |
727 | in some other way, whether by triggering a callback, or noting errors | |
728 | in some attribute of the document object, or some similarly unobtrusive | |
729 | mechanism -- or even by appending a "Pod Errors" section to the end of | |
730 | the parsed form of the document. | |
731 | ||
732 | =item * | |
733 | ||
734 | In cases of exceptionally aberrant documents, Pod parsers may abort the | |
735 | parse. Even then, using C<die>ing/C<croak>ing is to be avoided; where | |
736 | possible, the parser library may simply close the input file | |
737 | and add text like "*** Formatting Aborted ***" to the end of the | |
738 | (partial) in-memory document. | |
739 | ||
740 | =item * | |
741 | ||
742 | In paragraphs where formatting codes (like EE<lt>...>, BE<lt>...>) | |
743 | are understood (i.e., I<not> verbatim paragraphs, but I<including> | |
744 | ordinary paragraphs, and command paragraphs that produce renderable | |
745 | text, like "=head1"), literal whitespace should generally be considered | |
746 | "insignificant", in that one literal space has the same meaning as any | |
747 | (nonzero) number of literal spaces, literal newlines, and literal tabs | |
748 | (as long as this produces no blank lines, since those would terminate | |
749 | the paragraph). Pod parsers should compact literal whitespace in each | |
750 | processed paragraph, but may provide an option for overriding this | |
751 | (since some processing tasks do not require it), or may follow | |
752 | additional special rules (for example, specially treating | |
753 | period-space-space or period-newline sequences). | |
754 | ||
755 | =item * | |
756 | ||
757 | Pod parsers should not, by default, try to coerce apostrophe (') and | |
758 | quote (") into smart quotes (little 9's, 66's, 99's, etc), nor try to | |
759 | turn backtick (`) into anything else but a single backtick character | |
353c6505 | 760 | (distinct from an open quote character!), nor "--" into anything but |
8a93676d SB |
761 | two minus signs. They I<must never> do any of those things to text |
762 | in CE<lt>...> formatting codes, and never I<ever> to text in verbatim | |
763 | paragraphs. | |
764 | ||
765 | =item * | |
766 | ||
767 | When rendering Pod to a format that has two kinds of hyphens (-), one | |
3e666715 | 768 | that's a non-breaking hyphen, and another that's a breakable hyphen |
8a93676d SB |
769 | (as in "object-oriented", which can be split across lines as |
770 | "object-", newline, "oriented"), formatters are encouraged to | |
3e666715 | 771 | generally translate "-" to non-breaking hyphen, but may apply |
8a93676d SB |
772 | heuristics to convert some of these to breaking hyphens. |
773 | ||
774 | =item * | |
775 | ||
776 | Pod formatters should make reasonable efforts to keep words of Perl | |
777 | code from being broken across lines. For example, "Foo::Bar" in some | |
778 | formatting systems is seen as eligible for being broken across lines | |
779 | as "Foo::" newline "Bar" or even "Foo::-" newline "Bar". This should | |
780 | be avoided where possible, either by disabling all line-breaking in | |
781 | mid-word, or by wrapping particular words with internal punctuation | |
782 | in "don't break this across lines" codes (which in some formats may | |
783 | not be a single code, but might be a matter of inserting non-breaking | |
784 | zero-width spaces between every pair of characters in a word.) | |
785 | ||
786 | =item * | |
787 | ||
788 | Pod parsers should, by default, expand tabs in verbatim paragraphs as | |
789 | they are processed, before passing them to the formatter or other | |
790 | processor. Parsers may also allow an option for overriding this. | |
791 | ||
792 | =item * | |
793 | ||
794 | Pod parsers should, by default, remove newlines from the end of | |
795 | ordinary and verbatim paragraphs before passing them to the | |
796 | formatter. For example, while the paragraph you're reading now | |
797 | could be considered, in Pod source, to end with (and contain) | |
798 | the newline(s) that end it, it should be processed as ending with | |
799 | (and containing) the period character that ends this sentence. | |
800 | ||
801 | =item * | |
802 | ||
803 | Pod parsers, when reporting errors, should make some effort to report | |
804 | an approximate line number ("Nested EE<lt>>'s in Paragraph #52, near | |
805 | line 633 of Thing/Foo.pm!"), instead of merely noting the paragraph | |
806 | number ("Nested EE<lt>>'s in Paragraph #52 of Thing/Foo.pm!"). Where | |
807 | this is problematic, the paragraph number should at least be | |
808 | accompanied by an excerpt from the paragraph ("Nested EE<lt>>'s in | |
809 | Paragraph #52 of Thing/Foo.pm, which begins 'Read/write accessor for | |
810 | the CE<lt>interest rate> attribute...'"). | |
811 | ||
812 | =item * | |
813 | ||
814 | Pod parsers, when processing a series of verbatim paragraphs one | |
815 | after another, should consider them to be one large verbatim | |
816 | paragraph that happens to contain blank lines. I.e., these two | |
d1be9408 | 817 | lines, which have a blank line between them: |
8a93676d SB |
818 | |
819 | use Foo; | |
820 | ||
821 | print Foo->VERSION | |
822 | ||
823 | should be unified into one paragraph ("\tuse Foo;\n\n\tprint | |
824 | Foo->VERSION") before being passed to the formatter or other | |
825 | processor. Parsers may also allow an option for overriding this. | |
826 | ||
827 | While this might be too cumbersome to implement in event-based Pod | |
828 | parsers, it is straightforward for parsers that return parse trees. | |
829 | ||
830 | =item * | |
831 | ||
832 | Pod formatters, where feasible, are advised to avoid splitting short | |
833 | verbatim paragraphs (under twelve lines, say) across pages. | |
834 | ||
835 | =item * | |
836 | ||
837 | Pod parsers must treat a line with only spaces and/or tabs on it as a | |
838 | "blank line" such as separates paragraphs. (Some older parsers | |
839 | recognized only two adjacent newlines as a "blank line" but would not | |
840 | recognize a newline, a space, and a newline, as a blank line. This | |
841 | is noncompliant behavior.) | |
842 | ||
843 | =item * | |
844 | ||
845 | Authors of Pod formatters/processors should make every effort to | |
846 | avoid writing their own Pod parser. There are already several in | |
847 | CPAN, with a wide range of interface styles -- and one of them, | |
33874d2e | 848 | Pod::Simple, comes with modern versions of Perl. |
8a93676d SB |
849 | |
850 | =item * | |
851 | ||
852 | Characters in Pod documents may be conveyed either as literals, or by | |
853 | number in EE<lt>n> codes, or by an equivalent mnemonic, as in | |
bd940430 KW |
854 | EE<lt>eacute> which is exactly equivalent to EE<lt>233>. The numbers |
855 | are the Latin1/Unicode values, even on EBCDIC platforms. | |
856 | ||
857 | When referring to characters by using a EE<lt>n> numeric code, numbers | |
858 | in the range 32-126 refer to those well known US-ASCII characters (also | |
859 | defined there by Unicode, with the same meaning), which all Pod | |
df0c7995 KW |
860 | formatters must render faithfully. Characters whose EE<lt>E<gt> numbers |
861 | are in the ranges 0-31 and 127-159 should not be used (neither as | |
862 | literals, | |
863 | nor as EE<lt>number> codes), except for the literal byte-sequences for | |
864 | newline (ASCII 13, ASCII 13 10, or ASCII 10), and tab (ASCII 9). | |
bd940430 KW |
865 | |
866 | Numbers in the range 160-255 refer to Latin-1 characters (also | |
867 | defined there by Unicode, with the same meaning). Numbers above | |
8a93676d SB |
868 | 255 should be understood to refer to Unicode characters. |
869 | ||
870 | =item * | |
871 | ||
872 | Be warned | |
873 | that some formatters cannot reliably render characters outside 32-126; | |
874 | and many are able to handle 32-126 and 160-255, but nothing above | |
875 | 255. | |
876 | ||
877 | =item * | |
878 | ||
879 | Besides the well-known "EE<lt>lt>" and "EE<lt>gt>" codes for | |
880 | less-than and greater-than, Pod parsers must understand "EE<lt>sol>" | |
881 | for "/" (solidus, slash), and "EE<lt>verbar>" for "|" (vertical bar, | |
882 | pipe). Pod parsers should also understand "EE<lt>lchevron>" and | |
883 | "EE<lt>rchevron>" as legacy codes for characters 171 and 187, i.e., | |
884 | "left-pointing double angle quotation mark" = "left pointing | |
885 | guillemet" and "right-pointing double angle quotation mark" = "right | |
886 | pointing guillemet". (These look like little "<<" and ">>", and they | |
887 | are now preferably expressed with the HTML/XHTML codes "EE<lt>laquo>" | |
888 | and "EE<lt>raquo>".) | |
889 | ||
890 | =item * | |
891 | ||
892 | Pod parsers should understand all "EE<lt>html>" codes as defined | |
893 | in the entity declarations in the most recent XHTML specification at | |
894 | C<www.W3.org>. Pod parsers must understand at least the entities | |
895 | that define characters in the range 160-255 (Latin-1). Pod parsers, | |
896 | when faced with some unknown "EE<lt>I<identifier>>" code, | |
897 | shouldn't simply replace it with nullstring (by default, at least), | |
898 | but may pass it through as a string consisting of the literal characters | |
899 | E, less-than, I<identifier>, greater-than. Or Pod parsers may offer the | |
900 | alternative option of processing such unknown | |
901 | "EE<lt>I<identifier>>" codes by firing an event especially | |
902 | for such codes, or by adding a special node-type to the in-memory | |
903 | document tree. Such "EE<lt>I<identifier>>" may have special meaning | |
904 | to some processors, or some processors may choose to add them to | |
905 | a special error report. | |
906 | ||
907 | =item * | |
908 | ||
909 | Pod parsers must also support the XHTML codes "EE<lt>quot>" for | |
910 | character 34 (doublequote, "), "EE<lt>amp>" for character 38 | |
911 | (ampersand, &), and "EE<lt>apos>" for character 39 (apostrophe, '). | |
912 | ||
913 | =item * | |
914 | ||
1bca558f | 915 | Note that in all cases of "EE<lt>whateverE<gt>", I<whatever> (whether |
8a93676d | 916 | an htmlname, or a number in any base) must consist only of |
817141f8 | 917 | alphanumeric characters -- that is, I<whatever> must match |
1bca558f | 918 | C<m/\A\w+\z/>. So S<"EE<lt> 0 1 2 3 E<gt>"> is invalid, because |
8a93676d SB |
919 | it contains spaces, which aren't alphanumeric characters. This |
920 | presumably does not I<need> special treatment by a Pod processor; | |
1bca558f | 921 | S<" 0 1 2 3 "> doesn't look like a number in any base, so it would |
8a93676d | 922 | presumably be looked up in the table of HTML-like names. Since |
1bca558f | 923 | there isn't (and cannot be) an HTML-like entity called S<" 0 1 2 3 ">, |
8a93676d | 924 | this will be treated as an error. However, Pod processors may |
1bca558f | 925 | treat S<"EE<lt> 0 1 2 3 E<gt>"> or "EE<lt>e-acute>" as I<syntactically> |
8a93676d SB |
926 | invalid, potentially earning a different error message than the |
927 | error message (or warning, or event) generated by a merely unknown | |
928 | (but theoretically valid) htmlname, as in "EE<lt>qacute>" | |
929 | [sic]. However, Pod parsers are not required to make this | |
930 | distinction. | |
931 | ||
932 | =item * | |
933 | ||
934 | Note that EE<lt>number> I<must not> be interpreted as simply | |
935 | "codepoint I<number> in the current/native character set". It always | |
936 | means only "the character represented by codepoint I<number> in | |
937 | Unicode." (This is identical to the semantics of &#I<number>; in XML.) | |
938 | ||
939 | This will likely require many formatters to have tables mapping from | |
940 | treatable Unicode codepoints (such as the "\xE9" for the e-acute | |
941 | character) to the escape sequences or codes necessary for conveying | |
942 | such sequences in the target output format. A converter to *roff | |
943 | would, for example know that "\xE9" (whether conveyed literally, or via | |
944 | a EE<lt>...> sequence) is to be conveyed as "e\\*'". | |
8939ba94 | 945 | Similarly, a program rendering Pod in a Mac OS application window, would |
8a93676d | 946 | presumably need to know that "\xE9" maps to codepoint 142 in MacRoman |
8939ba94 | 947 | encoding that (at time of writing) is native for Mac OS. Such |
8a93676d SB |
948 | Unicode2whatever mappings are presumably already widely available for |
949 | common output formats. (Such mappings may be incomplete! Implementers | |
950 | are not expected to bend over backwards in an attempt to render | |
951 | Cherokee syllabics, Etruscan runes, Byzantine musical symbols, or any | |
952 | of the other weird things that Unicode can encode.) And | |
953 | if a Pod document uses a character not found in such a mapping, the | |
954 | formatter should consider it an unrenderable character. | |
955 | ||
956 | =item * | |
957 | ||
958 | If, surprisingly, the implementor of a Pod formatter can't find a | |
959 | satisfactory pre-existing table mapping from Unicode characters to | |
960 | escapes in the target format (e.g., a decent table of Unicode | |
961 | characters to *roff escapes), it will be necessary to build such a | |
962 | table. If you are in this circumstance, you should begin with the | |
963 | characters in the range 0x00A0 - 0x00FF, which is mostly the heavily | |
964 | used accented characters. Then proceed (as patience permits and | |
965 | fastidiousness compels) through the characters that the (X)HTML | |
966 | standards groups judged important enough to merit mnemonics | |
967 | for. These are declared in the (X)HTML specifications at the | |
968 | www.W3.org site. At time of writing (September 2001), the most recent | |
969 | entity declaration files are: | |
970 | ||
971 | http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent | |
972 | http://www.w3.org/TR/xhtml1/DTD/xhtml-special.ent | |
973 | http://www.w3.org/TR/xhtml1/DTD/xhtml-symbol.ent | |
974 | ||
975 | Then you can progress through any remaining notable Unicode characters | |
976 | in the range 0x2000-0x204D (consult the character tables at | |
977 | www.unicode.org), and whatever else strikes your fancy. For example, | |
978 | in F<xhtml-symbol.ent>, there is the entry: | |
979 | ||
980 | <!ENTITY infin "∞"> <!-- infinity, U+221E ISOtech --> | |
981 | ||
982 | While the mapping "infin" to the character "\x{221E}" will (hopefully) | |
983 | have been already handled by the Pod parser, the presence of the | |
984 | character in this file means that it's reasonably important enough to | |
985 | include in a formatter's table that maps from notable Unicode characters | |
986 | to the codes necessary for rendering them. So for a Unicode-to-*roff | |
987 | mapping, for example, this would merit the entry: | |
988 | ||
989 | "\x{221E}" => '\(in', | |
990 | ||
991 | It is eagerly hoped that in the future, increasing numbers of formats | |
992 | (and formatters) will support Unicode characters directly (as (X)HTML | |
993 | does with C<∞>, C<∞>, or C<∞>), reducing the need | |
994 | for idiosyncratic mappings of Unicode-to-I<my_escapes>. | |
995 | ||
996 | =item * | |
997 | ||
353c6505 | 998 | It is up to individual Pod formatter to display good judgement when |
8a93676d SB |
999 | confronted with an unrenderable character (which is distinct from an |
1000 | unknown EE<lt>thing> sequence that the parser couldn't resolve to | |
1001 | anything, renderable or not). It is good practice to map Latin letters | |
1002 | with diacritics (like "EE<lt>eacute>"/"EE<lt>233>") to the corresponding | |
1003 | unaccented US-ASCII letters (like a simple character 101, "e"), but | |
210b36aa | 1004 | clearly this is often not feasible, and an unrenderable character may |
8a93676d SB |
1005 | be represented as "?", or the like. In attempting a sane fallback |
1006 | (as from EE<lt>233> to "e"), Pod formatters may use the | |
1007 | %Latin1Code_to_fallback table in L<Pod::Escapes|Pod::Escapes>, or | |
1008 | L<Text::Unidecode|Text::Unidecode>, if available. | |
1009 | ||
1010 | For example, this Pod text: | |
1011 | ||
1012 | magic is enabled if you set C<$Currency> to 'E<euro>'. | |
1013 | ||
1014 | may be rendered as: | |
1015 | "magic is enabled if you set C<$Currency> to 'I<?>'" or as | |
1016 | "magic is enabled if you set C<$Currency> to 'B<[euro]>'", or as | |
1017 | "magic is enabled if you set C<$Currency> to '[x20AC]', etc. | |
1018 | ||
1019 | A Pod formatter may also note, in a comment or warning, a list of what | |
1020 | unrenderable characters were encountered. | |
1021 | ||
1022 | =item * | |
1023 | ||
1024 | EE<lt>...> may freely appear in any formatting code (other than | |
1025 | in another EE<lt>...> or in an ZE<lt>>). That is, "XE<lt>The | |
1026 | EE<lt>euro>1,000,000 Solution>" is valid, as is "LE<lt>The | |
1027 | EE<lt>euro>1,000,000 Solution|Million::Euros>". | |
1028 | ||
1029 | =item * | |
1030 | ||
3e666715 | 1031 | Some Pod formatters output to formats that implement non-breaking |
8a93676d | 1032 | spaces as an individual character (which I'll call "NBSP"), and |
3e666715 | 1033 | others output to formats that implement non-breaking spaces just as |
8a93676d SB |
1034 | spaces wrapped in a "don't break this across lines" code. Note that |
1035 | at the level of Pod, both sorts of codes can occur: Pod can contain a | |
1036 | NBSP character (whether as a literal, or as a "EE<lt>160>" or | |
1037 | "EE<lt>nbsp>" code); and Pod can contain "SE<lt>foo | |
1038 | IE<lt>barE<gt> baz>" codes, where "mere spaces" (character 32) in | |
3e666715 | 1039 | such codes are taken to represent non-breaking spaces. Pod |
8a93676d SB |
1040 | parsers should consider supporting the optional parsing of "SE<lt>foo |
1041 | IE<lt>barE<gt> baz>" as if it were | |
1042 | "fooI<NBSP>IE<lt>barE<gt>I<NBSP>baz", and, going the other way, the | |
1043 | optional parsing of groups of words joined by NBSP's as if each group | |
1044 | were in a SE<lt>...> code, so that formatters may use the | |
1045 | representation that maps best to what the output format demands. | |
1046 | ||
1047 | =item * | |
1048 | ||
210b36aa | 1049 | Some processors may find that the C<SE<lt>...E<gt>> code is easiest to |
8a93676d SB |
1050 | implement by replacing each space in the parse tree under the content |
1051 | of the S, with an NBSP. But note: the replacement should apply I<not> to | |
1052 | spaces in I<all> text, but I<only> to spaces in I<printable> text. (This | |
1053 | distinction may or may not be evident in the particular tree/event | |
1054 | model implemented by the Pod parser.) For example, consider this | |
1055 | unusual case: | |
1056 | ||
1057 | S<L</Autoloaded Functions>> | |
1058 | ||
1059 | This means that the space in the middle of the visible link text must | |
1060 | not be broken across lines. In other words, it's the same as this: | |
1061 | ||
1062 | L<"AutoloadedE<160>Functions"/Autoloaded Functions> | |
1063 | ||
1064 | However, a misapplied space-to-NBSP replacement could (wrongly) | |
1065 | produce something equivalent to this: | |
1066 | ||
1067 | L<"AutoloadedE<160>Functions"/AutoloadedE<160>Functions> | |
1068 | ||
1069 | ...which is almost definitely not going to work as a hyperlink (assuming | |
1070 | this formatter outputs a format supporting hypertext). | |
1071 | ||
1072 | Formatters may choose to just not support the S format code, | |
1073 | especially in cases where the output format simply has no NBSP | |
1074 | character/code and no code for "don't break this stuff across lines". | |
1075 | ||
1076 | =item * | |
1077 | ||
1078 | Besides the NBSP character discussed above, implementors are reminded | |
1079 | of the existence of the other "special" character in Latin-1, the | |
210b36aa | 1080 | "soft hyphen" character, also known as "discretionary hyphen", |
8a93676d SB |
1081 | i.e. C<EE<lt>173E<gt>> = C<EE<lt>0xADE<gt>> = |
1082 | C<EE<lt>shyE<gt>>). This character expresses an optional hyphenation | |
1083 | point. That is, it normally renders as nothing, but may render as a | |
1084 | "-" if a formatter breaks the word at that point. Pod formatters | |
1085 | should, as appropriate, do one of the following: 1) render this with | |
1086 | a code with the same meaning (e.g., "\-" in RTF), 2) pass it through | |
1087 | in the expectation that the formatter understands this character as | |
1088 | such, or 3) delete it. | |
1089 | ||
1090 | For example: | |
1091 | ||
1092 | sigE<shy>action | |
1093 | manuE<shy>script | |
1094 | JarkE<shy>ko HieE<shy>taE<shy>nieE<shy>mi | |
1095 | ||
1096 | These signal to a formatter that if it is to hyphenate "sigaction" | |
1097 | or "manuscript", then it should be done as | |
1098 | "sig-I<[linebreak]>action" or "manu-I<[linebreak]>script" | |
1099 | (and if it doesn't hyphenate it, then the C<EE<lt>shyE<gt>> doesn't | |
1100 | show up at all). And if it is | |
1101 | to hyphenate "Jarkko" and/or "Hietaniemi", it can do | |
1102 | so only at the points where there is a C<EE<lt>shyE<gt>> code. | |
1103 | ||
1104 | In practice, it is anticipated that this character will not be used | |
1105 | often, but formatters should either support it, or delete it. | |
1106 | ||
1107 | =item * | |
1108 | ||
1109 | If you think that you want to add a new command to Pod (like, say, a | |
1110 | "=biblio" command), consider whether you could get the same | |
1111 | effect with a for or begin/end sequence: "=for biblio ..." or "=begin | |
1112 | biblio" ... "=end biblio". Pod processors that don't understand | |
1113 | "=for biblio", etc, will simply ignore it, whereas they may complain | |
1114 | loudly if they see "=biblio". | |
1115 | ||
1116 | =item * | |
1117 | ||
1118 | Throughout this document, "Pod" has been the preferred spelling for | |
1119 | the name of the documentation format. One may also use "POD" or | |
da75cd15 | 1120 | "pod". For the documentation that is (typically) in the Pod |
8a93676d SB |
1121 | format, you may use "pod", or "Pod", or "POD". Understanding these |
1122 | distinctions is useful; but obsessing over how to spell them, usually | |
1123 | is not. | |
1124 | ||
1125 | =back | |
1126 | ||
1127 | ||
1128 | ||
1129 | ||
1130 | ||
1131 | =head1 About LE<lt>...E<gt> Codes | |
1132 | ||
1133 | As you can tell from a glance at L<perlpod|perlpod>, the LE<lt>...> | |
1134 | code is the most complex of the Pod formatting codes. The points below | |
1135 | will hopefully clarify what it means and how processors should deal | |
1136 | with it. | |
1137 | ||
1138 | =over | |
1139 | ||
1140 | =item * | |
1141 | ||
1142 | In parsing an LE<lt>...> code, Pod parsers must distinguish at least | |
1143 | four attributes: | |
1144 | ||
1145 | =over | |
1146 | ||
1147 | =item First: | |
1148 | ||
1bca558f | 1149 | The link-text. If there is none, this must be C<undef>. (E.g., in |
8a93676d SB |
1150 | "LE<lt>Perl Functions|perlfunc>", the link-text is "Perl Functions". |
1151 | In "LE<lt>Time::HiRes>" and even "LE<lt>|Time::HiRes>", there is no | |
1152 | link text. Note that link text may contain formatting.) | |
1153 | ||
1154 | =item Second: | |
1155 | ||
ac036724 | 1156 | The possibly inferred link-text; i.e., if there was no real link |
8a93676d SB |
1157 | text, then this is the text that we'll infer in its place. (E.g., for |
1158 | "LE<lt>Getopt::Std>", the inferred link text is "Getopt::Std".) | |
1159 | ||
1160 | =item Third: | |
1161 | ||
1bca558f | 1162 | The name or URL, or C<undef> if none. (E.g., in "LE<lt>Perl |
ac036724 | 1163 | Functions|perlfunc>", the name (also sometimes called the page) |
1bca558f | 1164 | is "perlfunc". In "LE<lt>/CAVEATS>", the name is C<undef>.) |
8a93676d SB |
1165 | |
1166 | =item Fourth: | |
1167 | ||
1bca558f | 1168 | The section (AKA "item" in older perlpods), or C<undef> if none. E.g., |
f41e638c | 1169 | in "LE<lt>Getopt::Std/DESCRIPTIONE<gt>", "DESCRIPTION" is the section. (Note |
8a93676d SB |
1170 | that this is not the same as a manpage section like the "5" in "man 5 |
1171 | crontab". "Section Foo" in the Pod sense means the part of the text | |
6edf2346 | 1172 | that's introduced by the heading or item whose text is "Foo".) |
8a93676d SB |
1173 | |
1174 | =back | |
1175 | ||
1176 | Pod parsers may also note additional attributes including: | |
1177 | ||
1178 | =over | |
1179 | ||
1180 | =item Fifth: | |
1181 | ||
1182 | A flag for whether item 3 (if present) is a URL (like | |
1183 | "http://lists.perl.org" is), in which case there should be no section | |
1184 | attribute; a Pod name (like "perldoc" and "Getopt::Std" are); or | |
1185 | possibly a man page name (like "crontab(5)" is). | |
1186 | ||
1187 | =item Sixth: | |
1188 | ||
1189 | The raw original LE<lt>...> content, before text is split on | |
1190 | "|", "/", etc, and before EE<lt>...> codes are expanded. | |
1191 | ||
1192 | =back | |
1193 | ||
1194 | (The above were numbered only for concise reference below. It is not | |
1195 | a requirement that these be passed as an actual list or array.) | |
1196 | ||
1197 | For example: | |
1198 | ||
1199 | L<Foo::Bar> | |
555bd962 BG |
1200 | => undef, # link text |
1201 | "Foo::Bar", # possibly inferred link text | |
1202 | "Foo::Bar", # name | |
1203 | undef, # section | |
1204 | 'pod', # what sort of link | |
1205 | "Foo::Bar" # original content | |
8a93676d SB |
1206 | |
1207 | L<Perlport's section on NL's|perlport/Newlines> | |
555bd962 BG |
1208 | => "Perlport's section on NL's", # link text |
1209 | "Perlport's section on NL's", # possibly inferred link text | |
1210 | "perlport", # name | |
1211 | "Newlines", # section | |
1212 | 'pod', # what sort of link | |
1213 | "Perlport's section on NL's|perlport/Newlines" | |
1214 | # original content | |
8a93676d SB |
1215 | |
1216 | L<perlport/Newlines> | |
555bd962 BG |
1217 | => undef, # link text |
1218 | '"Newlines" in perlport', # possibly inferred link text | |
1219 | "perlport", # name | |
1220 | "Newlines", # section | |
1221 | 'pod', # what sort of link | |
1222 | "perlport/Newlines" # original content | |
8a93676d SB |
1223 | |
1224 | L<crontab(5)/"DESCRIPTION"> | |
555bd962 BG |
1225 | => undef, # link text |
1226 | '"DESCRIPTION" in crontab(5)', # possibly inferred link text | |
1227 | "crontab(5)", # name | |
1228 | "DESCRIPTION", # section | |
1229 | 'man', # what sort of link | |
1230 | 'crontab(5)/"DESCRIPTION"' # original content | |
8a93676d SB |
1231 | |
1232 | L</Object Attributes> | |
555bd962 BG |
1233 | => undef, # link text |
1234 | '"Object Attributes"', # possibly inferred link text | |
1235 | undef, # name | |
1236 | "Object Attributes", # section | |
1237 | 'pod', # what sort of link | |
1238 | "/Object Attributes" # original content | |
8a93676d | 1239 | |
71c89d21 | 1240 | L<https://www.perl.org/> |
555bd962 | 1241 | => undef, # link text |
a7b1b289 MM |
1242 | "https://www.perl.org/", # possibly inferred link text |
1243 | "https://www.perl.org/", # name | |
555bd962 BG |
1244 | undef, # section |
1245 | 'url', # what sort of link | |
71c89d21 | 1246 | "https://www.perl.org/" # original content |
8a93676d | 1247 | |
71c89d21 | 1248 | L<Perl.org|https://www.perl.org/> |
555bd962 | 1249 | => "Perl.org", # link text |
a7b1b289 MM |
1250 | "https://www.perl.org/", # possibly inferred link text |
1251 | "https://www.perl.org/", # name | |
555bd962 BG |
1252 | undef, # section |
1253 | 'url', # what sort of link | |
71c89d21 | 1254 | "Perl.org|https://www.perl.org/" # original content |
f6e963e4 | 1255 | |
8a93676d SB |
1256 | Note that you can distinguish URL-links from anything else by the |
1257 | fact that they match C<m/\A\w+:[^:\s]\S*\z/>. So | |
1258 | C<LE<lt>http://www.perl.comE<gt>> is a URL, but | |
1259 | C<LE<lt>HTTP::ResponseE<gt>> isn't. | |
1260 | ||
1261 | =item * | |
1262 | ||
1263 | In case of LE<lt>...> codes with no "text|" part in them, | |
1264 | older formatters have exhibited great variation in actually displaying | |
1265 | the link or cross reference. For example, LE<lt>crontab(5)> would render | |
1266 | as "the C<crontab(5)> manpage", or "in the C<crontab(5)> manpage" | |
1267 | or just "C<crontab(5)>". | |
1268 | ||
1269 | Pod processors must now treat "text|"-less links as follows: | |
1270 | ||
1271 | L<name> => L<name|name> | |
1272 | L</section> => L<"section"|/section> | |
1273 | L<name/section> => L<"section" in name|name/section> | |
1274 | ||
1275 | =item * | |
1276 | ||
1277 | Note that section names might contain markup. I.e., if a section | |
1278 | starts with: | |
1279 | ||
1280 | =head2 About the C<-M> Operator | |
1281 | ||
1282 | or with: | |
1283 | ||
1284 | =item About the C<-M> Operator | |
1285 | ||
1286 | then a link to it would look like this: | |
1287 | ||
1288 | L<somedoc/About the C<-M> Operator> | |
1289 | ||
1290 | Formatters may choose to ignore the markup for purposes of resolving | |
1291 | the link and use only the renderable characters in the section name, | |
1292 | as in: | |
1293 | ||
1294 | <h1><a name="About_the_-M_Operator">About the <code>-M</code> | |
1295 | Operator</h1> | |
210b36aa | 1296 | |
8a93676d | 1297 | ... |
210b36aa | 1298 | |
8a93676d SB |
1299 | <a href="somedoc#About_the_-M_Operator">About the <code>-M</code> |
1300 | Operator" in somedoc</a> | |
1301 | ||
1302 | =item * | |
1303 | ||
1304 | Previous versions of perlpod distinguished C<LE<lt>name/"section"E<gt>> | |
1305 | links from C<LE<lt>name/itemE<gt>> links (and their targets). These | |
1306 | have been merged syntactically and semantically in the current | |
1307 | specification, and I<section> can refer either to a "=headI<n> Heading | |
1308 | Content" command or to a "=item Item Content" command. This | |
1309 | specification does not specify what behavior should be in the case | |
1310 | of a given document having several things all seeming to produce the | |
1311 | same I<section> identifier (e.g., in HTML, several things all producing | |
1312 | the same I<anchorname> in <a name="I<anchorname>">...</a> | |
1313 | elements). Where Pod processors can control this behavior, they should | |
1314 | use the first such anchor. That is, C<LE<lt>Foo/BarE<gt>> refers to the | |
1315 | I<first> "Bar" section in Foo. | |
1316 | ||
1317 | But for some processors/formats this cannot be easily controlled; as | |
1318 | with the HTML example, the behavior of multiple ambiguous | |
1319 | <a name="I<anchorname>">...</a> is most easily just left up to | |
1320 | browsers to decide. | |
1321 | ||
1322 | =item * | |
1323 | ||
8a93676d SB |
1324 | In a C<LE<lt>text|...E<gt>> code, text may contain formatting codes |
1325 | for formatting or for EE<lt>...> escapes, as in: | |
1326 | ||
1327 | L<B<ummE<234>stuff>|...> | |
1328 | ||
1329 | For C<LE<lt>...E<gt>> codes without a "name|" part, only | |
ac036724 | 1330 | C<EE<lt>...E<gt>> and C<ZE<lt>E<gt>> codes may occur. That is, |
1331 | authors should not use "C<LE<lt>BE<lt>Foo::BarE<gt>E<gt>>". | |
8a93676d SB |
1332 | |
1333 | Note, however, that formatting codes and ZE<lt>>'s can occur in any | |
1334 | and all parts of an LE<lt>...> (i.e., in I<name>, I<section>, I<text>, | |
1335 | and I<url>). | |
1336 | ||
1337 | Authors must not nest LE<lt>...> codes. For example, "LE<lt>The | |
1338 | LE<lt>Foo::Bar> man page>" should be treated as an error. | |
1339 | ||
1340 | =item * | |
1341 | ||
1342 | Note that Pod authors may use formatting codes inside the "text" | |
1343 | part of "LE<lt>text|name>" (and so on for LE<lt>text|/"sec">). | |
1344 | ||
1345 | In other words, this is valid: | |
1346 | ||
1347 | Go read L<the docs on C<$.>|perlvar/"$."> | |
1348 | ||
1349 | Some output formats that do allow rendering "LE<lt>...>" codes as | |
1350 | hypertext, might not allow the link-text to be formatted; in | |
1351 | that case, formatters will have to just ignore that formatting. | |
1352 | ||
1353 | =item * | |
1354 | ||
1355 | At time of writing, C<LE<lt>nameE<gt>> values are of two types: | |
1356 | either the name of a Pod page like C<LE<lt>Foo::BarE<gt>> (which | |
1357 | might be a real Perl module or program in an @INC / PATH | |
e1020413 | 1358 | directory, or a .pod file in those places); or the name of a Unix |
8a93676d | 1359 | man page, like C<LE<lt>crontab(5)E<gt>>. In theory, C<LE<lt>chmodE<gt>> |
62a78fcb | 1360 | is ambiguous between a Pod page called "chmod", or the Unix man page |
8a93676d SB |
1361 | "chmod" (in whatever man-section). However, the presence of a string |
1362 | in parens, as in "crontab(5)", is sufficient to signal that what | |
1363 | is being discussed is not a Pod page, and so is presumably a | |
e1020413 | 1364 | Unix man page. The distinction is of no importance to many |
8a93676d SB |
1365 | Pod processors, but some processors that render to hypertext formats |
1366 | may need to distinguish them in order to know how to render a | |
1367 | given C<LE<lt>fooE<gt>> code. | |
1368 | ||
1369 | =item * | |
1370 | ||
b41aadf2 RS |
1371 | Previous versions of perlpod allowed for a C<LE<lt>sectionE<gt>> syntax (as in |
1372 | C<LE<lt>Object AttributesE<gt>>), which was not easily distinguishable from | |
1373 | C<LE<lt>nameE<gt>> syntax and for C<LE<lt>"section"E<gt>> which was only | |
1374 | slightly less ambiguous. This syntax is no longer in the specification, and | |
1375 | has been replaced by the C<LE<lt>/sectionE<gt>> syntax (where the slash was | |
1376 | formerly optional). Pod parsers should tolerate the C<LE<lt>"section"E<gt>> | |
1377 | syntax, for a while at least. The suggested heuristic for distinguishing | |
1378 | C<LE<lt>sectionE<gt>> from C<LE<lt>nameE<gt>> is that if it contains any | |
1379 | whitespace, it's a I<section>. Pod processors should warn about this being | |
1380 | deprecated syntax. | |
8a93676d SB |
1381 | |
1382 | =back | |
1383 | ||
1384 | =head1 About =over...=back Regions | |
1385 | ||
1386 | "=over"..."=back" regions are used for various kinds of list-like | |
1387 | structures. (I use the term "region" here simply as a collective | |
1388 | term for everything from the "=over" to the matching "=back".) | |
1389 | ||
1390 | =over | |
1391 | ||
1392 | =item * | |
1393 | ||
1394 | The non-zero numeric I<indentlevel> in "=over I<indentlevel>" ... | |
1395 | "=back" is used for giving the formatter a clue as to how many | |
1396 | "spaces" (ems, or roughly equivalent units) it should tab over, | |
1397 | although many formatters will have to convert this to an absolute | |
1398 | measurement that may not exactly match with the size of spaces (or M's) | |
1399 | in the document's base font. Other formatters may have to completely | |
1400 | ignore the number. The lack of any explicit I<indentlevel> parameter is | |
1401 | equivalent to an I<indentlevel> value of 4. Pod processors may | |
1402 | complain if I<indentlevel> is present but is not a positive number | |
1403 | matching C<m/\A(\d*\.)?\d+\z/>. | |
1404 | ||
1405 | =item * | |
1406 | ||
1407 | Authors of Pod formatters are reminded that "=over" ... "=back" may | |
1408 | map to several different constructs in your output format. For | |
1409 | example, in converting Pod to (X)HTML, it can map to any of | |
1410 | <ul>...</ul>, <ol>...</ol>, <dl>...</dl>, or | |
1411 | <blockquote>...</blockquote>. Similarly, "=item" can map to <li> or | |
1412 | <dt>. | |
1413 | ||
1414 | =item * | |
1415 | ||
1416 | Each "=over" ... "=back" region should be one of the following: | |
1417 | ||
1418 | =over | |
1419 | ||
1420 | =item * | |
1421 | ||
1422 | An "=over" ... "=back" region containing only "=item *" commands, | |
1423 | each followed by some number of ordinary/verbatim paragraphs, other | |
1424 | nested "=over" ... "=back" regions, "=for..." paragraphs, and | |
1425 | "=begin"..."=end" regions. | |
1426 | ||
1427 | (Pod processors must tolerate a bare "=item" as if it were "=item | |
1428 | *".) Whether "*" is rendered as a literal asterisk, an "o", or as | |
1429 | some kind of real bullet character, is left up to the Pod formatter, | |
1430 | and may depend on the level of nesting. | |
1431 | ||
1432 | =item * | |
1433 | ||
1434 | An "=over" ... "=back" region containing only | |
1435 | C<m/\A=item\s+\d+\.?\s*\z/> paragraphs, each one (or each group of them) | |
1436 | followed by some number of ordinary/verbatim paragraphs, other nested | |
1437 | "=over" ... "=back" regions, "=for..." paragraphs, and/or | |
1438 | "=begin"..."=end" codes. Note that the numbers must start at 1 | |
1439 | in each section, and must proceed in order and without skipping | |
1440 | numbers. | |
1441 | ||
1442 | (Pod processors must tolerate lines like "=item 1" as if they were | |
1443 | "=item 1.", with the period.) | |
1444 | ||
1445 | =item * | |
1446 | ||
1447 | An "=over" ... "=back" region containing only "=item [text]" | |
1448 | commands, each one (or each group of them) followed by some number of | |
1449 | ordinary/verbatim paragraphs, other nested "=over" ... "=back" | |
1450 | regions, or "=for..." paragraphs, and "=begin"..."=end" regions. | |
1451 | ||
1452 | The "=item [text]" paragraph should not match | |
1453 | C<m/\A=item\s+\d+\.?\s*\z/> or C<m/\A=item\s+\*\s*\z/>, nor should it | |
1454 | match just C<m/\A=item\s*\z/>. | |
1455 | ||
1456 | =item * | |
1457 | ||
1458 | An "=over" ... "=back" region containing no "=item" paragraphs at | |
1459 | all, and containing only some number of | |
1460 | ordinary/verbatim paragraphs, and possibly also some nested "=over" | |
1461 | ... "=back" regions, "=for..." paragraphs, and "=begin"..."=end" | |
1462 | regions. Such an itemless "=over" ... "=back" region in Pod is | |
1463 | equivalent in meaning to a "<blockquote>...</blockquote>" element in | |
1464 | HTML. | |
1465 | ||
1466 | =back | |
1467 | ||
1468 | Note that with all the above cases, you can determine which type of | |
1469 | "=over" ... "=back" you have, by examining the first (non-"=cut", | |
1470 | non-"=pod") Pod paragraph after the "=over" command. | |
1471 | ||
1472 | =item * | |
1473 | ||
1474 | Pod formatters I<must> tolerate arbitrarily large amounts of text | |
1475 | in the "=item I<text...>" paragraph. In practice, most such | |
1476 | paragraphs are short, as in: | |
1477 | ||
1478 | =item For cutting off our trade with all parts of the world | |
1479 | ||
1480 | But they may be arbitrarily long: | |
1481 | ||
1482 | =item For transporting us beyond seas to be tried for pretended | |
1483 | offenses | |
1484 | ||
1485 | =item He is at this time transporting large armies of foreign | |
1486 | mercenaries to complete the works of death, desolation and | |
1487 | tyranny, already begun with circumstances of cruelty and perfidy | |
1488 | scarcely paralleled in the most barbarous ages, and totally | |
1489 | unworthy the head of a civilized nation. | |
1490 | ||
1491 | =item * | |
1492 | ||
1493 | Pod processors should tolerate "=item *" / "=item I<number>" commands | |
1494 | with no accompanying paragraph. The middle item is an example: | |
1495 | ||
1496 | =over | |
210b36aa | 1497 | |
8a93676d | 1498 | =item 1 |
210b36aa | 1499 | |
8a93676d | 1500 | Pick up dry cleaning. |
210b36aa | 1501 | |
8a93676d | 1502 | =item 2 |
210b36aa | 1503 | |
8a93676d | 1504 | =item 3 |
210b36aa | 1505 | |
8a93676d | 1506 | Stop by the store. Get Abba Zabas, Stoli, and cheap lawn chairs. |
210b36aa | 1507 | |
8a93676d SB |
1508 | =back |
1509 | ||
1510 | =item * | |
1511 | ||
1512 | No "=over" ... "=back" region can contain headings. Processors may | |
1513 | treat such a heading as an error. | |
1514 | ||
1515 | =item * | |
1516 | ||
1517 | Note that an "=over" ... "=back" region should have some | |
1518 | content. That is, authors should not have an empty region like this: | |
1519 | ||
1520 | =over | |
210b36aa | 1521 | |
8a93676d SB |
1522 | =back |
1523 | ||
1524 | Pod processors seeing such a contentless "=over" ... "=back" region, | |
1525 | may ignore it, or may report it as an error. | |
1526 | ||
1527 | =item * | |
1528 | ||
1529 | Processors must tolerate an "=over" list that goes off the end of the | |
1530 | document (i.e., which has no matching "=back"), but they may warn | |
1531 | about such a list. | |
1532 | ||
1533 | =item * | |
1534 | ||
1535 | Authors of Pod formatters should note that this construct: | |
1536 | ||
1537 | =item Neque | |
1538 | ||
1539 | =item Porro | |
1540 | ||
1541 | =item Quisquam Est | |
210b36aa | 1542 | |
8a93676d SB |
1543 | Qui dolorem ipsum quia dolor sit amet, consectetur, adipisci |
1544 | velit, sed quia non numquam eius modi tempora incidunt ut | |
1545 | labore et dolore magnam aliquam quaerat voluptatem. | |
1546 | ||
1547 | =item Ut Enim | |
1548 | ||
1549 | is semantically ambiguous, in a way that makes formatting decisions | |
1550 | a bit difficult. On the one hand, it could be mention of an item | |
1551 | "Neque", mention of another item "Porro", and mention of another | |
1552 | item "Quisquam Est", with just the last one requiring the explanatory | |
1553 | paragraph "Qui dolorem ipsum quia dolor..."; and then an item | |
1554 | "Ut Enim". In that case, you'd want to format it like so: | |
1555 | ||
1556 | Neque | |
210b36aa | 1557 | |
8a93676d | 1558 | Porro |
210b36aa | 1559 | |
8a93676d SB |
1560 | Quisquam Est |
1561 | Qui dolorem ipsum quia dolor sit amet, consectetur, adipisci | |
1562 | velit, sed quia non numquam eius modi tempora incidunt ut | |
1563 | labore et dolore magnam aliquam quaerat voluptatem. | |
1564 | ||
1565 | Ut Enim | |
1566 | ||
1567 | But it could equally well be a discussion of three (related or equivalent) | |
1568 | items, "Neque", "Porro", and "Quisquam Est", followed by a paragraph | |
1569 | explaining them all, and then a new item "Ut Enim". In that case, you'd | |
1570 | probably want to format it like so: | |
1571 | ||
1572 | Neque | |
1573 | Porro | |
1574 | Quisquam Est | |
1575 | Qui dolorem ipsum quia dolor sit amet, consectetur, adipisci | |
1576 | velit, sed quia non numquam eius modi tempora incidunt ut | |
1577 | labore et dolore magnam aliquam quaerat voluptatem. | |
1578 | ||
1579 | Ut Enim | |
1580 | ||
353c6505 | 1581 | But (for the foreseeable future), Pod does not provide any way for Pod |
8a93676d SB |
1582 | authors to distinguish which grouping is meant by the above |
1583 | "=item"-cluster structure. So formatters should format it like so: | |
1584 | ||
1585 | Neque | |
1586 | ||
1587 | Porro | |
1588 | ||
1589 | Quisquam Est | |
1590 | ||
1591 | Qui dolorem ipsum quia dolor sit amet, consectetur, adipisci | |
1592 | velit, sed quia non numquam eius modi tempora incidunt ut | |
1593 | labore et dolore magnam aliquam quaerat voluptatem. | |
1594 | ||
1595 | Ut Enim | |
1596 | ||
210b36aa | 1597 | That is, there should be (at least roughly) equal spacing between |
8a93676d SB |
1598 | items as between paragraphs (although that spacing may well be less |
1599 | than the full height of a line of text). This leaves it to the reader | |
1600 | to use (con)textual cues to figure out whether the "Qui dolorem | |
1601 | ipsum..." paragraph applies to the "Quisquam Est" item or to all three | |
1602 | items "Neque", "Porro", and "Quisquam Est". While not an ideal | |
1603 | situation, this is preferable to providing formatting cues that may | |
1604 | be actually contrary to the author's intent. | |
1605 | ||
1606 | =back | |
1607 | ||
1608 | ||
1609 | ||
1610 | =head1 About Data Paragraphs and "=begin/=end" Regions | |
1611 | ||
1612 | Data paragraphs are typically used for inlining non-Pod data that is | |
1613 | to be used (typically passed through) when rendering the document to | |
1614 | a specific format: | |
1615 | ||
1616 | =begin rtf | |
210b36aa | 1617 | |
8a93676d | 1618 | \par{\pard\qr\sa4500{\i Printed\~\chdate\~\chtime}\par} |
210b36aa | 1619 | |
8a93676d SB |
1620 | =end rtf |
1621 | ||
1622 | The exact same effect could, incidentally, be achieved with a single | |
1623 | "=for" paragraph: | |
1624 | ||
1625 | =for rtf \par{\pard\qr\sa4500{\i Printed\~\chdate\~\chtime}\par} | |
1626 | ||
1627 | (Although that is not formally a data paragraph, it has the same | |
1628 | meaning as one, and Pod parsers may parse it as one.) | |
1629 | ||
1630 | Another example of a data paragraph: | |
1631 | ||
1632 | =begin html | |
210b36aa | 1633 | |
8a93676d | 1634 | I like <em>PIE</em>! |
210b36aa | 1635 | |
8a93676d | 1636 | <hr>Especially pecan pie! |
210b36aa | 1637 | |
8a93676d SB |
1638 | =end html |
1639 | ||
1640 | If these were ordinary paragraphs, the Pod parser would try to | |
1641 | expand the "EE<lt>/em>" (in the first paragraph) as a formatting | |
1642 | code, just like "EE<lt>lt>" or "EE<lt>eacute>". But since this | |
1643 | is in a "=begin I<identifier>"..."=end I<identifier>" region I<and> | |
1644 | the identifier "html" doesn't begin have a ":" prefix, the contents | |
1645 | of this region are stored as data paragraphs, instead of being | |
1646 | processed as ordinary paragraphs (or if they began with a spaces | |
1647 | and/or tabs, as verbatim paragraphs). | |
1648 | ||
1649 | As a further example: At time of writing, no "biblio" identifier is | |
1650 | supported, but suppose some processor were written to recognize it as | |
1651 | a way of (say) denoting a bibliographic reference (necessarily | |
1652 | containing formatting codes in ordinary paragraphs). The fact that | |
1653 | "biblio" paragraphs were meant for ordinary processing would be | |
1654 | indicated by prefacing each "biblio" identifier with a colon: | |
1655 | ||
1656 | =begin :biblio | |
1657 | ||
1658 | Wirth, Niklaus. 1976. I<Algorithms + Data Structures = | |
1659 | Programs.> Prentice-Hall, Englewood Cliffs, NJ. | |
1660 | ||
1661 | =end :biblio | |
1662 | ||
1663 | This would signal to the parser that paragraphs in this begin...end | |
1664 | region are subject to normal handling as ordinary/verbatim paragraphs | |
1665 | (while still tagged as meant only for processors that understand the | |
1666 | "biblio" identifier). The same effect could be had with: | |
1667 | ||
1668 | =for :biblio | |
1669 | Wirth, Niklaus. 1976. I<Algorithms + Data Structures = | |
1670 | Programs.> Prentice-Hall, Englewood Cliffs, NJ. | |
1671 | ||
1672 | The ":" on these identifiers means simply "process this stuff | |
1673 | normally, even though the result will be for some special target". | |
1674 | I suggest that parser APIs report "biblio" as the target identifier, | |
1675 | but also report that it had a ":" prefix. (And similarly, with the | |
1676 | above "html", report "html" as the target identifier, and note the | |
1677 | I<lack> of a ":" prefix.) | |
1678 | ||
1679 | Note that a "=begin I<identifier>"..."=end I<identifier>" region where | |
1680 | I<identifier> begins with a colon, I<can> contain commands. For example: | |
1681 | ||
1682 | =begin :biblio | |
210b36aa | 1683 | |
8a93676d | 1684 | Wirth's classic is available in several editions, including: |
210b36aa | 1685 | |
8a93676d SB |
1686 | =for comment |
1687 | hm, check abebooks.com for how much used copies cost. | |
210b36aa | 1688 | |
8a93676d | 1689 | =over |
210b36aa | 1690 | |
8a93676d | 1691 | =item |
210b36aa | 1692 | |
8a93676d SB |
1693 | Wirth, Niklaus. 1975. I<Algorithmen und Datenstrukturen.> |
1694 | Teubner, Stuttgart. [Yes, it's in German.] | |
210b36aa | 1695 | |
8a93676d | 1696 | =item |
210b36aa | 1697 | |
8a93676d SB |
1698 | Wirth, Niklaus. 1976. I<Algorithms + Data Structures = |
1699 | Programs.> Prentice-Hall, Englewood Cliffs, NJ. | |
210b36aa | 1700 | |
8a93676d | 1701 | =back |
210b36aa | 1702 | |
8a93676d SB |
1703 | =end :biblio |
1704 | ||
1705 | Note, however, a "=begin I<identifier>"..."=end I<identifier>" | |
1706 | region where I<identifier> does I<not> begin with a colon, should not | |
1707 | directly contain "=head1" ... "=head4" commands, nor "=over", nor "=back", | |
1708 | nor "=item". For example, this may be considered invalid: | |
1709 | ||
1710 | =begin somedata | |
210b36aa | 1711 | |
8a93676d | 1712 | This is a data paragraph. |
210b36aa | 1713 | |
8a93676d | 1714 | =head1 Don't do this! |
210b36aa | 1715 | |
8a93676d | 1716 | This is a data paragraph too. |
210b36aa | 1717 | |
8a93676d SB |
1718 | =end somedata |
1719 | ||
1720 | A Pod processor may signal that the above (specifically the "=head1" | |
1721 | paragraph) is an error. Note, however, that the following should | |
1722 | I<not> be treated as an error: | |
1723 | ||
1724 | =begin somedata | |
210b36aa | 1725 | |
8a93676d | 1726 | This is a data paragraph. |
210b36aa | 1727 | |
8a93676d | 1728 | =cut |
210b36aa | 1729 | |
8a93676d SB |
1730 | # Yup, this isn't Pod anymore. |
1731 | sub excl { (rand() > .5) ? "hoo!" : "hah!" } | |
210b36aa | 1732 | |
8a93676d | 1733 | =pod |
210b36aa | 1734 | |
8a93676d | 1735 | This is a data paragraph too. |
210b36aa | 1736 | |
8a93676d SB |
1737 | =end somedata |
1738 | ||
1739 | And this too is valid: | |
1740 | ||
1741 | =begin someformat | |
210b36aa | 1742 | |
8a93676d | 1743 | This is a data paragraph. |
210b36aa | 1744 | |
8a93676d | 1745 | And this is a data paragraph. |
210b36aa | 1746 | |
8a93676d | 1747 | =begin someotherformat |
210b36aa | 1748 | |
8a93676d | 1749 | This is a data paragraph too. |
210b36aa | 1750 | |
8a93676d | 1751 | And this is a data paragraph too. |
210b36aa | 1752 | |
8a93676d SB |
1753 | =begin :yetanotherformat |
1754 | ||
1755 | =head2 This is a command paragraph! | |
1756 | ||
1757 | This is an ordinary paragraph! | |
210b36aa | 1758 | |
8a93676d | 1759 | And this is a verbatim paragraph! |
210b36aa | 1760 | |
8a93676d | 1761 | =end :yetanotherformat |
210b36aa | 1762 | |
8a93676d | 1763 | =end someotherformat |
210b36aa | 1764 | |
8a93676d | 1765 | Another data paragraph! |
210b36aa | 1766 | |
8a93676d SB |
1767 | =end someformat |
1768 | ||
1769 | The contents of the above "=begin :yetanotherformat" ... | |
1770 | "=end :yetanotherformat" region I<aren't> data paragraphs, because | |
1771 | the immediately containing region's identifier (":yetanotherformat") | |
1772 | begins with a colon. In practice, most regions that contain | |
1773 | data paragraphs will contain I<only> data paragraphs; however, | |
1774 | the above nesting is syntactically valid as Pod, even if it is | |
1775 | rare. However, the handlers for some formats, like "html", | |
1776 | will accept only data paragraphs, not nested regions; and they may | |
1777 | complain if they see (targeted for them) nested regions, or commands, | |
1778 | other than "=end", "=pod", and "=cut". | |
1779 | ||
1780 | Also consider this valid structure: | |
1781 | ||
1782 | =begin :biblio | |
210b36aa | 1783 | |
8a93676d | 1784 | Wirth's classic is available in several editions, including: |
210b36aa | 1785 | |
8a93676d | 1786 | =over |
210b36aa | 1787 | |
8a93676d | 1788 | =item |
210b36aa | 1789 | |
8a93676d SB |
1790 | Wirth, Niklaus. 1975. I<Algorithmen und Datenstrukturen.> |
1791 | Teubner, Stuttgart. [Yes, it's in German.] | |
210b36aa | 1792 | |
8a93676d | 1793 | =item |
210b36aa | 1794 | |
8a93676d SB |
1795 | Wirth, Niklaus. 1976. I<Algorithms + Data Structures = |
1796 | Programs.> Prentice-Hall, Englewood Cliffs, NJ. | |
1797 | ||
1798 | =back | |
210b36aa | 1799 | |
8a93676d | 1800 | Buy buy buy! |
210b36aa | 1801 | |
8a93676d | 1802 | =begin html |
210b36aa | 1803 | |
8a93676d | 1804 | <img src='wirth_spokesmodeling_book.png'> |
210b36aa | 1805 | |
8a93676d | 1806 | <hr> |
210b36aa | 1807 | |
8a93676d | 1808 | =end html |
210b36aa | 1809 | |
8a93676d | 1810 | Now now now! |
210b36aa | 1811 | |
8a93676d SB |
1812 | =end :biblio |
1813 | ||
1814 | There, the "=begin html"..."=end html" region is nested inside | |
1815 | the larger "=begin :biblio"..."=end :biblio" region. Note that the | |
1816 | content of the "=begin html"..."=end html" region is data | |
1817 | paragraph(s), because the immediately containing region's identifier | |
1818 | ("html") I<doesn't> begin with a colon. | |
1819 | ||
1820 | Pod parsers, when processing a series of data paragraphs one | |
1821 | after another (within a single region), should consider them to | |
1822 | be one large data paragraph that happens to contain blank lines. So | |
1823 | the content of the above "=begin html"..."=end html" I<may> be stored | |
1824 | as two data paragraphs (one consisting of | |
1825 | "<img src='wirth_spokesmodeling_book.png'>\n" | |
1826 | and another consisting of "<hr>\n"), but I<should> be stored as | |
1827 | a single data paragraph (consisting of | |
1828 | "<img src='wirth_spokesmodeling_book.png'>\n\n<hr>\n"). | |
1829 | ||
1830 | Pod processors should tolerate empty | |
1831 | "=begin I<something>"..."=end I<something>" regions, | |
1832 | empty "=begin :I<something>"..."=end :I<something>" regions, and | |
1833 | contentless "=for I<something>" and "=for :I<something>" | |
1834 | paragraphs. I.e., these should be tolerated: | |
1835 | ||
1836 | =for html | |
210b36aa | 1837 | |
8a93676d | 1838 | =begin html |
210b36aa | 1839 | |
8a93676d | 1840 | =end html |
210b36aa | 1841 | |
8a93676d | 1842 | =begin :biblio |
210b36aa | 1843 | |
8a93676d SB |
1844 | =end :biblio |
1845 | ||
1846 | Incidentally, note that there's no easy way to express a data | |
1847 | paragraph starting with something that looks like a command. Consider: | |
1848 | ||
1849 | =begin stuff | |
210b36aa | 1850 | |
8a93676d | 1851 | =shazbot |
210b36aa | 1852 | |
8a93676d SB |
1853 | =end stuff |
1854 | ||
1855 | There, "=shazbot" will be parsed as a Pod command "shazbot", not as a data | |
1856 | paragraph "=shazbot\n". However, you can express a data paragraph consisting | |
1857 | of "=shazbot\n" using this code: | |
1858 | ||
1859 | =for stuff =shazbot | |
1860 | ||
1861 | The situation where this is necessary, is presumably quite rare. | |
1862 | ||
1863 | Note that =end commands must match the currently open =begin command. That | |
1864 | is, they must properly nest. For example, this is valid: | |
1865 | ||
1866 | =begin outer | |
210b36aa | 1867 | |
8a93676d | 1868 | X |
210b36aa | 1869 | |
8a93676d | 1870 | =begin inner |
210b36aa | 1871 | |
8a93676d | 1872 | Y |
210b36aa | 1873 | |
8a93676d | 1874 | =end inner |
210b36aa | 1875 | |
8a93676d | 1876 | Z |
210b36aa | 1877 | |
8a93676d SB |
1878 | =end outer |
1879 | ||
1880 | while this is invalid: | |
1881 | ||
1882 | =begin outer | |
210b36aa | 1883 | |
8a93676d | 1884 | X |
210b36aa | 1885 | |
8a93676d | 1886 | =begin inner |
210b36aa | 1887 | |
8a93676d | 1888 | Y |
210b36aa | 1889 | |
8a93676d | 1890 | =end outer |
210b36aa | 1891 | |
8a93676d | 1892 | Z |
210b36aa | 1893 | |
8a93676d | 1894 | =end inner |
210b36aa | 1895 | |
8a93676d SB |
1896 | This latter is improper because when the "=end outer" command is seen, the |
1897 | currently open region has the formatname "inner", not "outer". (It just | |
1898 | happens that "outer" is the format name of a higher-up region.) This is | |
1899 | an error. Processors must by default report this as an error, and may halt | |
210b36aa | 1900 | processing the document containing that error. A corollary of this is that |
ac036724 | 1901 | regions cannot "overlap". That is, the latter block above does not represent |
8a93676d SB |
1902 | a region called "outer" which contains X and Y, overlapping a region called |
1903 | "inner" which contains Y and Z. But because it is invalid (as all | |
1904 | apparently overlapping regions would be), it doesn't represent that, or | |
1905 | anything at all. | |
1906 | ||
1907 | Similarly, this is invalid: | |
1908 | ||
1909 | =begin thing | |
210b36aa | 1910 | |
8a93676d SB |
1911 | =end hting |
1912 | ||
1913 | This is an error because the region is opened by "thing", and the "=end" | |
1914 | tries to close "hting" [sic]. | |
1915 | ||
1916 | This is also invalid: | |
1917 | ||
1918 | =begin thing | |
210b36aa | 1919 | |
8a93676d SB |
1920 | =end |
1921 | ||
1922 | This is invalid because every "=end" command must have a formatname | |
1923 | parameter. | |
1924 | ||
1925 | =head1 SEE ALSO | |
1926 | ||
1927 | L<perlpod>, L<perlsyn/"PODs: Embedded Documentation">, | |
1928 | L<podchecker> | |
1929 | ||
1930 | =head1 AUTHOR | |
1931 | ||
1932 | Sean M. Burke | |
1933 | ||
1934 | =cut | |
1935 | ||
1936 |