| 1 | =encoding utf8 |
| 2 | |
| 3 | =head1 NAME |
| 4 | |
| 5 | perlpodspec - Plain Old Documentation: format specification and notes |
| 6 | |
| 7 | =head1 DESCRIPTION |
| 8 | |
| 9 | This document is detailed notes on the Pod markup language. Most |
| 10 | people will only have to read L<perlpod|perlpod> to know how to write |
| 11 | in Pod, but this document may answer some incidental questions to do |
| 12 | with parsing and rendering Pod. |
| 13 | |
| 14 | In this document, "must" / "must not", "should" / |
| 15 | "should not", and "may" have their conventional (cf. RFC 2119) |
| 16 | meanings: "X must do Y" means that if X doesn't do Y, it's against |
| 17 | this specification, and should really be fixed. "X should do Y" |
| 18 | means that it's recommended, but X may fail to do Y, if there's a |
| 19 | good reason. "X may do Y" is merely a note that X can do Y at |
| 20 | will (although it is up to the reader to detect any connotation of |
| 21 | "and I think it would be I<nice> if X did Y" versus "it wouldn't |
| 22 | really I<bother> me if X did Y"). |
| 23 | |
| 24 | Notably, when I say "the parser should do Y", the |
| 25 | parser may fail to do Y, if the calling application explicitly |
| 26 | requests that the parser I<not> do Y. I often phrase this as |
| 27 | "the parser should, by default, do Y." This doesn't I<require> |
| 28 | the parser to provide an option for turning off whatever |
| 29 | feature Y is (like expanding tabs in verbatim paragraphs), although |
| 30 | it implicates that such an option I<may> be provided. |
| 31 | |
| 32 | =head1 Pod Definitions |
| 33 | |
| 34 | Pod is embedded in files, typically Perl source files, although you |
| 35 | can write a file that's nothing but Pod. |
| 36 | |
| 37 | A B<line> in a file consists of zero or more non-newline characters, |
| 38 | terminated by either a newline or the end of the file. |
| 39 | |
| 40 | A B<newline sequence> is usually a platform-dependent concept, but |
| 41 | Pod parsers should understand it to mean any of CR (ASCII 13), LF |
| 42 | (ASCII 10), or a CRLF (ASCII 13 followed immediately by ASCII 10), in |
| 43 | addition to any other system-specific meaning. The first CR/CRLF/LF |
| 44 | sequence in the file may be used as the basis for identifying the |
| 45 | newline sequence for parsing the rest of the file. |
| 46 | |
| 47 | A B<blank line> is a line consisting entirely of zero or more spaces |
| 48 | (ASCII 32) or tabs (ASCII 9), and terminated by a newline or end-of-file. |
| 49 | A B<non-blank line> is a line containing one or more characters other |
| 50 | than space or tab (and terminated by a newline or end-of-file). |
| 51 | |
| 52 | (I<Note:> Many older Pod parsers did not accept a line consisting of |
| 53 | spaces/tabs and then a newline as a blank line. The only lines they |
| 54 | considered blank were lines consisting of I<no characters at all>, |
| 55 | terminated by a newline.) |
| 56 | |
| 57 | B<Whitespace> is used in this document as a blanket term for spaces, |
| 58 | tabs, and newline sequences. (By itself, this term usually refers |
| 59 | to literal whitespace. That is, sequences of whitespace characters |
| 60 | in Pod source, as opposed to "EE<lt>32>", which is a formatting |
| 61 | code that I<denotes> a whitespace character.) |
| 62 | |
| 63 | A B<Pod parser> is a module meant for parsing Pod (regardless of |
| 64 | whether this involves calling callbacks or building a parse tree or |
| 65 | directly formatting it). A B<Pod formatter> (or B<Pod translator>) |
| 66 | is a module or program that converts Pod to some other format (HTML, |
| 67 | plaintext, TeX, PostScript, RTF). A B<Pod processor> might be a |
| 68 | formatter or translator, or might be a program that does something |
| 69 | else with the Pod (like counting words, scanning for index points, |
| 70 | etc.). |
| 71 | |
| 72 | Pod content is contained in B<Pod blocks>. A Pod block starts with a |
| 73 | line that matches C<m/\A=[a-zA-Z]/>, and continues up to the next line |
| 74 | that matches C<m/\A=cut/> or up to the end of the file if there is |
| 75 | no C<m/\A=cut/> line. |
| 76 | |
| 77 | =for comment |
| 78 | The current perlsyn says: |
| 79 | [beginquote] |
| 80 | Note that pod translators should look at only paragraphs beginning |
| 81 | with a pod directive (it makes parsing easier), whereas the compiler |
| 82 | actually knows to look for pod escapes even in the middle of a |
| 83 | paragraph. This means that the following secret stuff will be ignored |
| 84 | by both the compiler and the translators. |
| 85 | $a=3; |
| 86 | =secret stuff |
| 87 | warn "Neither POD nor CODE!?" |
| 88 | =cut back |
| 89 | print "got $a\n"; |
| 90 | You probably shouldn't rely upon the warn() being podded out forever. |
| 91 | Not all pod translators are well-behaved in this regard, and perhaps |
| 92 | the compiler will become pickier. |
| 93 | [endquote] |
| 94 | I think that those paragraphs should just be removed; paragraph-based |
| 95 | parsing seems to have been largely abandoned, because of the hassle |
| 96 | with non-empty blank lines messing up what people meant by "paragraph". |
| 97 | Even if the "it makes parsing easier" bit were especially true, |
| 98 | it wouldn't be worth the confusion of having perl and pod2whatever |
| 99 | actually disagree on what can constitute a Pod block. |
| 100 | |
| 101 | Note that a parser is not expected to distinguish between something that |
| 102 | looks like pod, but is in a quoted string, such as a here document. |
| 103 | |
| 104 | Within a Pod block, there are B<Pod paragraphs>. A Pod paragraph |
| 105 | consists of non-blank lines of text, separated by one or more blank |
| 106 | lines. |
| 107 | |
| 108 | For purposes of Pod processing, there are four types of paragraphs in |
| 109 | a Pod block: |
| 110 | |
| 111 | =over |
| 112 | |
| 113 | =item * |
| 114 | |
| 115 | A command paragraph (also called a "directive"). The first line of |
| 116 | this paragraph must match C<m/\A=[a-zA-Z]/>. Command paragraphs are |
| 117 | typically one line, as in: |
| 118 | |
| 119 | =head1 NOTES |
| 120 | |
| 121 | =item * |
| 122 | |
| 123 | But they may span several (non-blank) lines: |
| 124 | |
| 125 | =for comment |
| 126 | Hm, I wonder what it would look like if |
| 127 | you tried to write a BNF for Pod from this. |
| 128 | |
| 129 | =head3 Dr. Strangelove, or: How I Learned to |
| 130 | Stop Worrying and Love the Bomb |
| 131 | |
| 132 | I<Some> command paragraphs allow formatting codes in their content |
| 133 | (i.e., after the part that matches C<m/\A=[a-zA-Z]\S*\s*/>), as in: |
| 134 | |
| 135 | =head1 Did You Remember to C<use strict;>? |
| 136 | |
| 137 | In other words, the Pod processing handler for "head1" will apply the |
| 138 | same processing to "Did You Remember to CE<lt>use strict;>?" that it |
| 139 | would to an ordinary paragraph (i.e., formatting codes like |
| 140 | "CE<lt>...>") are parsed and presumably formatted appropriately, and |
| 141 | whitespace in the form of literal spaces and/or tabs is not |
| 142 | significant. |
| 143 | |
| 144 | =item * |
| 145 | |
| 146 | A B<verbatim paragraph>. The first line of this paragraph must be a |
| 147 | literal space or tab, and this paragraph must not be inside a "=begin |
| 148 | I<identifier>", ... "=end I<identifier>" sequence unless |
| 149 | "I<identifier>" begins with a colon (":"). That is, if a paragraph |
| 150 | starts with a literal space or tab, but I<is> inside a |
| 151 | "=begin I<identifier>", ... "=end I<identifier>" region, then it's |
| 152 | a data paragraph, unless "I<identifier>" begins with a colon. |
| 153 | |
| 154 | Whitespace I<is> significant in verbatim paragraphs (although, in |
| 155 | processing, tabs are probably expanded). |
| 156 | |
| 157 | =item * |
| 158 | |
| 159 | An B<ordinary paragraph>. A paragraph is an ordinary paragraph |
| 160 | if its first line matches neither C<m/\A=[a-zA-Z]/> nor |
| 161 | C<m/\A[ \t]/>, I<and> if it's not inside a "=begin I<identifier>", |
| 162 | ... "=end I<identifier>" sequence unless "I<identifier>" begins with |
| 163 | a colon (":"). |
| 164 | |
| 165 | =item * |
| 166 | |
| 167 | A B<data paragraph>. This is a paragraph that I<is> inside a "=begin |
| 168 | I<identifier>" ... "=end I<identifier>" sequence where |
| 169 | "I<identifier>" does I<not> begin with a literal colon (":"). In |
| 170 | some sense, a data paragraph is not part of Pod at all (i.e., |
| 171 | effectively it's "out-of-band"), since it's not subject to most kinds |
| 172 | of Pod parsing; but it is specified here, since Pod |
| 173 | parsers need to be able to call an event for it, or store it in some |
| 174 | form in a parse tree, or at least just parse I<around> it. |
| 175 | |
| 176 | =back |
| 177 | |
| 178 | For example: consider the following paragraphs: |
| 179 | |
| 180 | # <- that's the 0th column |
| 181 | |
| 182 | =head1 Foo |
| 183 | |
| 184 | Stuff |
| 185 | |
| 186 | $foo->bar |
| 187 | |
| 188 | =cut |
| 189 | |
| 190 | Here, "=head1 Foo" and "=cut" are command paragraphs because the first |
| 191 | line of each matches C<m/\A=[a-zA-Z]/>. "I<[space][space]>$foo->bar" |
| 192 | is a verbatim paragraph, because its first line starts with a literal |
| 193 | whitespace character (and there's no "=begin"..."=end" region around). |
| 194 | |
| 195 | The "=begin I<identifier>" ... "=end I<identifier>" commands stop |
| 196 | paragraphs that they surround from being parsed as ordinary or verbatim |
| 197 | paragraphs, if I<identifier> doesn't begin with a colon. This |
| 198 | is discussed in detail in the section |
| 199 | L</About Data Paragraphs and "=beginE<sol>=end" Regions>. |
| 200 | |
| 201 | =head1 Pod Commands |
| 202 | |
| 203 | This section is intended to supplement and clarify the discussion in |
| 204 | L<perlpod/"Command Paragraph">. These are the currently recognized |
| 205 | Pod commands: |
| 206 | |
| 207 | =over |
| 208 | |
| 209 | =item "=head1", "=head2", "=head3", "=head4" |
| 210 | |
| 211 | This command indicates that the text in the remainder of the paragraph |
| 212 | is a heading. That text may contain formatting codes. Examples: |
| 213 | |
| 214 | =head1 Object Attributes |
| 215 | |
| 216 | =head3 What B<Not> to Do! |
| 217 | |
| 218 | =item "=pod" |
| 219 | |
| 220 | This command indicates that this paragraph begins a Pod block. (If we |
| 221 | are already in the middle of a Pod block, this command has no effect at |
| 222 | all.) If there is any text in this command paragraph after "=pod", |
| 223 | it must be ignored. Examples: |
| 224 | |
| 225 | =pod |
| 226 | |
| 227 | This is a plain Pod paragraph. |
| 228 | |
| 229 | =pod This text is ignored. |
| 230 | |
| 231 | =item "=cut" |
| 232 | |
| 233 | This command indicates that this line is the end of this previously |
| 234 | started Pod block. If there is any text after "=cut" on the line, it must be |
| 235 | ignored. Examples: |
| 236 | |
| 237 | =cut |
| 238 | |
| 239 | =cut The documentation ends here. |
| 240 | |
| 241 | =cut |
| 242 | # This is the first line of program text. |
| 243 | sub foo { # This is the second. |
| 244 | |
| 245 | It is an error to try to I<start> a Pod block with a "=cut" command. In |
| 246 | that case, the Pod processor must halt parsing of the input file, and |
| 247 | must by default emit a warning. |
| 248 | |
| 249 | =item "=over" |
| 250 | |
| 251 | This command indicates that this is the start of a list/indent |
| 252 | region. If there is any text following the "=over", it must consist |
| 253 | of only a nonzero positive numeral. The semantics of this numeral is |
| 254 | explained in the L</"About =over...=back Regions"> section, further |
| 255 | below. Formatting codes are not expanded. Examples: |
| 256 | |
| 257 | =over 3 |
| 258 | |
| 259 | =over 3.5 |
| 260 | |
| 261 | =over |
| 262 | |
| 263 | =item "=item" |
| 264 | |
| 265 | This command indicates that an item in a list begins here. Formatting |
| 266 | codes are processed. The semantics of the (optional) text in the |
| 267 | remainder of this paragraph are |
| 268 | explained in the L</"About =over...=back Regions"> section, further |
| 269 | below. Examples: |
| 270 | |
| 271 | =item |
| 272 | |
| 273 | =item * |
| 274 | |
| 275 | =item * |
| 276 | |
| 277 | =item 14 |
| 278 | |
| 279 | =item 3. |
| 280 | |
| 281 | =item C<< $thing->stuff(I<dodad>) >> |
| 282 | |
| 283 | =item For transporting us beyond seas to be tried for pretended |
| 284 | offenses |
| 285 | |
| 286 | =item He is at this time transporting large armies of foreign |
| 287 | mercenaries to complete the works of death, desolation and |
| 288 | tyranny, already begun with circumstances of cruelty and perfidy |
| 289 | scarcely paralleled in the most barbarous ages, and totally |
| 290 | unworthy the head of a civilized nation. |
| 291 | |
| 292 | =item "=back" |
| 293 | |
| 294 | This command indicates that this is the end of the region begun |
| 295 | by the most recent "=over" command. It permits no text after the |
| 296 | "=back" command. |
| 297 | |
| 298 | =item "=begin formatname" |
| 299 | |
| 300 | =item "=begin formatname parameter" |
| 301 | |
| 302 | This marks the following paragraphs (until the matching "=end |
| 303 | formatname") as being for some special kind of processing. Unless |
| 304 | "formatname" begins with a colon, the contained non-command |
| 305 | paragraphs are data paragraphs. But if "formatname" I<does> begin |
| 306 | with a colon, then non-command paragraphs are ordinary paragraphs |
| 307 | or data paragraphs. This is discussed in detail in the section |
| 308 | L</About Data Paragraphs and "=beginE<sol>=end" Regions>. |
| 309 | |
| 310 | It is advised that formatnames match the regexp |
| 311 | C<m/\A:?[-a-zA-Z0-9_]+\z/>. Everything following whitespace after the |
| 312 | formatname is a parameter that may be used by the formatter when dealing |
| 313 | with this region. This parameter must not be repeated in the "=end" |
| 314 | paragraph. Implementors should anticipate future expansion in the |
| 315 | semantics and syntax of the first parameter to "=begin"/"=end"/"=for". |
| 316 | |
| 317 | =item "=end formatname" |
| 318 | |
| 319 | This marks the end of the region opened by the matching |
| 320 | "=begin formatname" region. If "formatname" is not the formatname |
| 321 | of the most recent open "=begin formatname" region, then this |
| 322 | is an error, and must generate an error message. This |
| 323 | is discussed in detail in the section |
| 324 | L</About Data Paragraphs and "=beginE<sol>=end" Regions>. |
| 325 | |
| 326 | =item "=for formatname text..." |
| 327 | |
| 328 | This is synonymous with: |
| 329 | |
| 330 | =begin formatname |
| 331 | |
| 332 | text... |
| 333 | |
| 334 | =end formatname |
| 335 | |
| 336 | That is, it creates a region consisting of a single paragraph; that |
| 337 | paragraph is to be treated as a normal paragraph if "formatname" |
| 338 | begins with a ":"; if "formatname" I<doesn't> begin with a colon, |
| 339 | then "text..." will constitute a data paragraph. There is no way |
| 340 | to use "=for formatname text..." to express "text..." as a verbatim |
| 341 | paragraph. |
| 342 | |
| 343 | =item "=encoding encodingname" |
| 344 | |
| 345 | This command, which should occur early in the document (at least |
| 346 | before any non-US-ASCII data!), declares that this document is |
| 347 | encoded in the encoding I<encodingname>, which must be |
| 348 | an encoding name that L<Encode> recognizes. (Encode's list |
| 349 | of supported encodings, in L<Encode::Supported>, is useful here.) |
| 350 | If the Pod parser cannot decode the declared encoding, it |
| 351 | should emit a warning and may abort parsing the document |
| 352 | altogether. |
| 353 | |
| 354 | A document having more than one "=encoding" line should be |
| 355 | considered an error. Pod processors may silently tolerate this if |
| 356 | the not-first "=encoding" lines are just duplicates of the |
| 357 | first one (e.g., if there's a "=encoding utf8" line, and later on |
| 358 | another "=encoding utf8" line). But Pod processors should complain if |
| 359 | there are contradictory "=encoding" lines in the same document |
| 360 | (e.g., if there is a "=encoding utf8" early in the document and |
| 361 | "=encoding big5" later). Pod processors that recognize BOMs |
| 362 | may also complain if they see an "=encoding" line |
| 363 | that contradicts the BOM (e.g., if a document with a UTF-16LE |
| 364 | BOM has an "=encoding shiftjis" line). |
| 365 | |
| 366 | =back |
| 367 | |
| 368 | If a Pod processor sees any command other than the ones listed |
| 369 | above (like "=head", or "=haed1", or "=stuff", or "=cuttlefish", |
| 370 | or "=w123"), that processor must by default treat this as an |
| 371 | error. It must not process the paragraph beginning with that |
| 372 | command, must by default warn of this as an error, and may |
| 373 | abort the parse. A Pod parser may allow a way for particular |
| 374 | applications to add to the above list of known commands, and to |
| 375 | stipulate, for each additional command, whether formatting |
| 376 | codes should be processed. |
| 377 | |
| 378 | Future versions of this specification may add additional |
| 379 | commands. |
| 380 | |
| 381 | |
| 382 | |
| 383 | =head1 Pod Formatting Codes |
| 384 | |
| 385 | (Note that in previous drafts of this document and of perlpod, |
| 386 | formatting codes were referred to as "interior sequences", and |
| 387 | this term may still be found in the documentation for Pod parsers, |
| 388 | and in error messages from Pod processors.) |
| 389 | |
| 390 | There are two syntaxes for formatting codes: |
| 391 | |
| 392 | =over |
| 393 | |
| 394 | =item * |
| 395 | |
| 396 | A formatting code starts with a capital letter (just US-ASCII [A-Z]) |
| 397 | followed by a "<", any number of characters, and ending with the first |
| 398 | matching ">". Examples: |
| 399 | |
| 400 | That's what I<you> think! |
| 401 | |
| 402 | What's C<CORE::dump()> for? |
| 403 | |
| 404 | X<C<chmod> and C<unlink()> Under Different Operating Systems> |
| 405 | |
| 406 | =item * |
| 407 | |
| 408 | A formatting code starts with a capital letter (just US-ASCII [A-Z]) |
| 409 | followed by two or more "<"'s, one or more whitespace characters, |
| 410 | any number of characters, one or more whitespace characters, |
| 411 | and ending with the first matching sequence of two or more ">"'s, where |
| 412 | the number of ">"'s equals the number of "<"'s in the opening of this |
| 413 | formatting code. Examples: |
| 414 | |
| 415 | That's what I<< you >> think! |
| 416 | |
| 417 | C<<< open(X, ">>thing.dat") || die $! >>> |
| 418 | |
| 419 | B<< $foo->bar(); >> |
| 420 | |
| 421 | With this syntax, the whitespace character(s) after the "CE<lt><<" |
| 422 | and before the ">>>" (or whatever letter) are I<not> renderable. They |
| 423 | do not signify whitespace, are merely part of the formatting codes |
| 424 | themselves. That is, these are all synonymous: |
| 425 | |
| 426 | C<thing> |
| 427 | C<< thing >> |
| 428 | C<< thing >> |
| 429 | C<<< thing >>> |
| 430 | C<<<< |
| 431 | thing |
| 432 | >>>> |
| 433 | |
| 434 | and so on. |
| 435 | |
| 436 | Finally, the multiple-angle-bracket form does I<not> alter the interpretation |
| 437 | of nested formatting codes, meaning that the following four example lines are |
| 438 | identical in meaning: |
| 439 | |
| 440 | B<example: C<$a E<lt>=E<gt> $b>> |
| 441 | |
| 442 | B<example: C<< $a <=> $b >>> |
| 443 | |
| 444 | B<example: C<< $a E<lt>=E<gt> $b >>> |
| 445 | |
| 446 | B<<< example: C<< $a E<lt>=E<gt> $b >> >>> |
| 447 | |
| 448 | =back |
| 449 | |
| 450 | In parsing Pod, a notably tricky part is the correct parsing of |
| 451 | (potentially nested!) formatting codes. Implementors should |
| 452 | consult the code in the C<parse_text> routine in Pod::Parser as an |
| 453 | example of a correct implementation. |
| 454 | |
| 455 | =over |
| 456 | |
| 457 | =item C<IE<lt>textE<gt>> -- italic text |
| 458 | |
| 459 | See the brief discussion in L<perlpod/"Formatting Codes">. |
| 460 | |
| 461 | =item C<BE<lt>textE<gt>> -- bold text |
| 462 | |
| 463 | See the brief discussion in L<perlpod/"Formatting Codes">. |
| 464 | |
| 465 | =item C<CE<lt>codeE<gt>> -- code text |
| 466 | |
| 467 | See the brief discussion in L<perlpod/"Formatting Codes">. |
| 468 | |
| 469 | =item C<FE<lt>filenameE<gt>> -- style for filenames |
| 470 | |
| 471 | See the brief discussion in L<perlpod/"Formatting Codes">. |
| 472 | |
| 473 | =item C<XE<lt>topic nameE<gt>> -- an index entry |
| 474 | |
| 475 | See the brief discussion in L<perlpod/"Formatting Codes">. |
| 476 | |
| 477 | This code is unusual in that most formatters completely discard |
| 478 | this code and its content. Other formatters will render it with |
| 479 | invisible codes that can be used in building an index of |
| 480 | the current document. |
| 481 | |
| 482 | =item C<ZE<lt>E<gt>> -- a null (zero-effect) formatting code |
| 483 | |
| 484 | Discussed briefly in L<perlpod/"Formatting Codes">. |
| 485 | |
| 486 | This code is unusual in that it should have no content. That is, |
| 487 | a processor may complain if it sees C<ZE<lt>potatoesE<gt>>. Whether |
| 488 | or not it complains, the I<potatoes> text should ignored. |
| 489 | |
| 490 | =item C<LE<lt>nameE<gt>> -- a hyperlink |
| 491 | |
| 492 | The complicated syntaxes of this code are discussed at length in |
| 493 | L<perlpod/"Formatting Codes">, and implementation details are |
| 494 | discussed below, in L</"About LE<lt>...E<gt> Codes">. Parsing the |
| 495 | contents of LE<lt>content> is tricky. Notably, the content has to be |
| 496 | checked for whether it looks like a URL, or whether it has to be split |
| 497 | on literal "|" and/or "/" (in the right order!), and so on, |
| 498 | I<before> EE<lt>...> codes are resolved. |
| 499 | |
| 500 | =item C<EE<lt>escapeE<gt>> -- a character escape |
| 501 | |
| 502 | See L<perlpod/"Formatting Codes">, and several points in |
| 503 | L</Notes on Implementing Pod Processors>. |
| 504 | |
| 505 | =item C<SE<lt>textE<gt>> -- text contains non-breaking spaces |
| 506 | |
| 507 | This formatting code is syntactically simple, but semantically |
| 508 | complex. What it means is that each space in the printable |
| 509 | content of this code signifies a non-breaking space. |
| 510 | |
| 511 | Consider: |
| 512 | |
| 513 | C<$x ? $y : $z> |
| 514 | |
| 515 | S<C<$x ? $y : $z>> |
| 516 | |
| 517 | Both signify the monospace (c[ode] style) text consisting of |
| 518 | "$x", one space, "?", one space, ":", one space, "$z". The |
| 519 | difference is that in the latter, with the S code, those spaces |
| 520 | are not "normal" spaces, but instead are non-breaking spaces. |
| 521 | |
| 522 | =back |
| 523 | |
| 524 | |
| 525 | If a Pod processor sees any formatting code other than the ones |
| 526 | listed above (as in "NE<lt>...>", or "QE<lt>...>", etc.), that |
| 527 | processor must by default treat this as an error. |
| 528 | A Pod parser may allow a way for particular |
| 529 | applications to add to the above list of known formatting codes; |
| 530 | a Pod parser might even allow a way to stipulate, for each additional |
| 531 | command, whether it requires some form of special processing, as |
| 532 | LE<lt>...> does. |
| 533 | |
| 534 | Future versions of this specification may add additional |
| 535 | formatting codes. |
| 536 | |
| 537 | Historical note: A few older Pod processors would not see a ">" as |
| 538 | closing a "CE<lt>" code, if the ">" was immediately preceded by |
| 539 | a "-". This was so that this: |
| 540 | |
| 541 | C<$foo->bar> |
| 542 | |
| 543 | would parse as equivalent to this: |
| 544 | |
| 545 | C<$foo-E<gt>bar> |
| 546 | |
| 547 | instead of as equivalent to a "C" formatting code containing |
| 548 | only "$foo-", and then a "bar>" outside the "C" formatting code. This |
| 549 | problem has since been solved by the addition of syntaxes like this: |
| 550 | |
| 551 | C<< $foo->bar >> |
| 552 | |
| 553 | Compliant parsers must not treat "->" as special. |
| 554 | |
| 555 | Formatting codes absolutely cannot span paragraphs. If a code is |
| 556 | opened in one paragraph, and no closing code is found by the end of |
| 557 | that paragraph, the Pod parser must close that formatting code, |
| 558 | and should complain (as in "Unterminated I code in the paragraph |
| 559 | starting at line 123: 'Time objects are not...'"). So these |
| 560 | two paragraphs: |
| 561 | |
| 562 | I<I told you not to do this! |
| 563 | |
| 564 | Don't make me say it again!> |
| 565 | |
| 566 | ...must I<not> be parsed as two paragraphs in italics (with the I |
| 567 | code starting in one paragraph and starting in another.) Instead, |
| 568 | the first paragraph should generate a warning, but that aside, the |
| 569 | above code must parse as if it were: |
| 570 | |
| 571 | I<I told you not to do this!> |
| 572 | |
| 573 | Don't make me say it again!E<gt> |
| 574 | |
| 575 | (In SGMLish jargon, all Pod commands are like block-level |
| 576 | elements, whereas all Pod formatting codes are like inline-level |
| 577 | elements.) |
| 578 | |
| 579 | |
| 580 | |
| 581 | =head1 Notes on Implementing Pod Processors |
| 582 | |
| 583 | The following is a long section of miscellaneous requirements |
| 584 | and suggestions to do with Pod processing. |
| 585 | |
| 586 | =over |
| 587 | |
| 588 | =item * |
| 589 | |
| 590 | Pod formatters should tolerate lines in verbatim blocks that are of |
| 591 | any length, even if that means having to break them (possibly several |
| 592 | times, for very long lines) to avoid text running off the side of the |
| 593 | page. Pod formatters may warn of such line-breaking. Such warnings |
| 594 | are particularly appropriate for lines are over 100 characters long, which |
| 595 | are usually not intentional. |
| 596 | |
| 597 | =item * |
| 598 | |
| 599 | Pod parsers must recognize I<all> of the three well-known newline |
| 600 | formats: CR, LF, and CRLF. See L<perlport|perlport>. |
| 601 | |
| 602 | =item * |
| 603 | |
| 604 | Pod parsers should accept input lines that are of any length. |
| 605 | |
| 606 | =item * |
| 607 | |
| 608 | Since Perl recognizes a Unicode Byte Order Mark at the start of files |
| 609 | as signaling that the file is Unicode encoded as in UTF-16 (whether |
| 610 | big-endian or little-endian) or UTF-8, Pod parsers should do the |
| 611 | same. Otherwise, the character encoding should be understood as |
| 612 | being UTF-8 if the first highbit byte sequence in the file seems |
| 613 | valid as a UTF-8 sequence, or otherwise as CP-1252 (earlier versions of |
| 614 | this specification used Latin-1 instead of CP-1252). |
| 615 | |
| 616 | Future versions of this specification may specify |
| 617 | how Pod can accept other encodings. Presumably treatment of other |
| 618 | encodings in Pod parsing would be as in XML parsing: whatever the |
| 619 | encoding declared by a particular Pod file, content is to be |
| 620 | stored in memory as Unicode characters. |
| 621 | |
| 622 | =item * |
| 623 | |
| 624 | The well known Unicode Byte Order Marks are as follows: if the |
| 625 | file begins with the two literal byte values 0xFE 0xFF, this is |
| 626 | the BOM for big-endian UTF-16. If the file begins with the two |
| 627 | literal byte value 0xFF 0xFE, this is the BOM for little-endian |
| 628 | UTF-16. On an ASCII platform, if the file begins with the three literal |
| 629 | byte values |
| 630 | 0xEF 0xBB 0xBF, this is the BOM for UTF-8. |
| 631 | A mechanism portable to EBCDIC platforms is to: |
| 632 | |
| 633 | my $utf8_bom = "\x{FEFF}"; |
| 634 | utf8::encode($utf8_bom); |
| 635 | |
| 636 | =for comment |
| 637 | use bytes; print map sprintf(" 0x%02X", ord $_), split '', "\x{feff}"; |
| 638 | 0xEF 0xBB 0xBF |
| 639 | |
| 640 | =for comment |
| 641 | If toke.c is modified to support UTF-32, add mention of those here. |
| 642 | |
| 643 | =item * |
| 644 | |
| 645 | A naive, but often sufficient heuristic on ASCII platforms, for testing |
| 646 | the first highbit |
| 647 | byte-sequence in a BOM-less file (whether in code or in Pod!), to see |
| 648 | whether that sequence is valid as UTF-8 (RFC 2279) is to check whether |
| 649 | that the first byte in the sequence is in the range 0xC2 - 0xFD |
| 650 | I<and> whether the next byte is in the range |
| 651 | 0x80 - 0xBF. If so, the parser may conclude that this file is in |
| 652 | UTF-8, and all highbit sequences in the file should be assumed to |
| 653 | be UTF-8. Otherwise the parser should treat the file as being |
| 654 | in CP-1252. (A better check, and which works on EBCDIC platforms as |
| 655 | well, is to pass a copy of the sequence to |
| 656 | L<utf8::decode()|utf8> which performs a full validity check on the |
| 657 | sequence and returns TRUE if it is valid UTF-8, FALSE otherwise. This |
| 658 | function is always pre-loaded, is fast because it is written in C, and |
| 659 | will only get called at most once, so you don't need to avoid it out of |
| 660 | performance concerns.) |
| 661 | In the unlikely circumstance that the first highbit |
| 662 | sequence in a truly non-UTF-8 file happens to appear to be UTF-8, one |
| 663 | can cater to our heuristic (as well as any more intelligent heuristic) |
| 664 | by prefacing that line with a comment line containing a highbit |
| 665 | sequence that is clearly I<not> valid as UTF-8. A line consisting |
| 666 | of simply "#", an e-acute, and any non-highbit byte, |
| 667 | is sufficient to establish this file's encoding. |
| 668 | |
| 669 | =for comment |
| 670 | If/WHEN some brave soul makes these heuristics into a generic |
| 671 | text-file class (or PerlIO layer?), we can presumably delete |
| 672 | mention of these icky details from this file, and can instead |
| 673 | tell people to just use appropriate class/layer. |
| 674 | Auto-recognition of newline sequences would be another desirable |
| 675 | feature of such a class/layer. |
| 676 | HINT HINT HINT. |
| 677 | |
| 678 | =for comment |
| 679 | "The probability that a string of characters |
| 680 | in any other encoding appears as valid UTF-8 is low" - RFC2279 |
| 681 | |
| 682 | =item * |
| 683 | |
| 684 | Pod processors must treat a "=for [label] [content...]" paragraph as |
| 685 | meaning the same thing as a "=begin [label]" paragraph, content, and |
| 686 | an "=end [label]" paragraph. (The parser may conflate these two |
| 687 | constructs, or may leave them distinct, in the expectation that the |
| 688 | formatter will nevertheless treat them the same.) |
| 689 | |
| 690 | =item * |
| 691 | |
| 692 | When rendering Pod to a format that allows comments (i.e., to nearly |
| 693 | any format other than plaintext), a Pod formatter must insert comment |
| 694 | text identifying its name and version number, and the name and |
| 695 | version numbers of any modules it might be using to process the Pod. |
| 696 | Minimal examples: |
| 697 | |
| 698 | %% POD::Pod2PS v3.14159, using POD::Parser v1.92 |
| 699 | |
| 700 | <!-- Pod::HTML v3.14159, using POD::Parser v1.92 --> |
| 701 | |
| 702 | {\doccomm generated by Pod::Tree::RTF 3.14159 using Pod::Tree 1.08} |
| 703 | |
| 704 | .\" Pod::Man version 3.14159, using POD::Parser version 1.92 |
| 705 | |
| 706 | Formatters may also insert additional comments, including: the |
| 707 | release date of the Pod formatter program, the contact address for |
| 708 | the author(s) of the formatter, the current time, the name of input |
| 709 | file, the formatting options in effect, version of Perl used, etc. |
| 710 | |
| 711 | Formatters may also choose to note errors/warnings as comments, |
| 712 | besides or instead of emitting them otherwise (as in messages to |
| 713 | STDERR, or C<die>ing). |
| 714 | |
| 715 | =item * |
| 716 | |
| 717 | Pod parsers I<may> emit warnings or error messages ("Unknown E code |
| 718 | EE<lt>zslig>!") to STDERR (whether through printing to STDERR, or |
| 719 | C<warn>ing/C<carp>ing, or C<die>ing/C<croak>ing), but I<must> allow |
| 720 | suppressing all such STDERR output, and instead allow an option for |
| 721 | reporting errors/warnings |
| 722 | in some other way, whether by triggering a callback, or noting errors |
| 723 | in some attribute of the document object, or some similarly unobtrusive |
| 724 | mechanism -- or even by appending a "Pod Errors" section to the end of |
| 725 | the parsed form of the document. |
| 726 | |
| 727 | =item * |
| 728 | |
| 729 | In cases of exceptionally aberrant documents, Pod parsers may abort the |
| 730 | parse. Even then, using C<die>ing/C<croak>ing is to be avoided; where |
| 731 | possible, the parser library may simply close the input file |
| 732 | and add text like "*** Formatting Aborted ***" to the end of the |
| 733 | (partial) in-memory document. |
| 734 | |
| 735 | =item * |
| 736 | |
| 737 | In paragraphs where formatting codes (like EE<lt>...>, BE<lt>...>) |
| 738 | are understood (i.e., I<not> verbatim paragraphs, but I<including> |
| 739 | ordinary paragraphs, and command paragraphs that produce renderable |
| 740 | text, like "=head1"), literal whitespace should generally be considered |
| 741 | "insignificant", in that one literal space has the same meaning as any |
| 742 | (nonzero) number of literal spaces, literal newlines, and literal tabs |
| 743 | (as long as this produces no blank lines, since those would terminate |
| 744 | the paragraph). Pod parsers should compact literal whitespace in each |
| 745 | processed paragraph, but may provide an option for overriding this |
| 746 | (since some processing tasks do not require it), or may follow |
| 747 | additional special rules (for example, specially treating |
| 748 | period-space-space or period-newline sequences). |
| 749 | |
| 750 | =item * |
| 751 | |
| 752 | Pod parsers should not, by default, try to coerce apostrophe (') and |
| 753 | quote (") into smart quotes (little 9's, 66's, 99's, etc), nor try to |
| 754 | turn backtick (`) into anything else but a single backtick character |
| 755 | (distinct from an open quote character!), nor "--" into anything but |
| 756 | two minus signs. They I<must never> do any of those things to text |
| 757 | in CE<lt>...> formatting codes, and never I<ever> to text in verbatim |
| 758 | paragraphs. |
| 759 | |
| 760 | =item * |
| 761 | |
| 762 | When rendering Pod to a format that has two kinds of hyphens (-), one |
| 763 | that's a non-breaking hyphen, and another that's a breakable hyphen |
| 764 | (as in "object-oriented", which can be split across lines as |
| 765 | "object-", newline, "oriented"), formatters are encouraged to |
| 766 | generally translate "-" to non-breaking hyphen, but may apply |
| 767 | heuristics to convert some of these to breaking hyphens. |
| 768 | |
| 769 | =item * |
| 770 | |
| 771 | Pod formatters should make reasonable efforts to keep words of Perl |
| 772 | code from being broken across lines. For example, "Foo::Bar" in some |
| 773 | formatting systems is seen as eligible for being broken across lines |
| 774 | as "Foo::" newline "Bar" or even "Foo::-" newline "Bar". This should |
| 775 | be avoided where possible, either by disabling all line-breaking in |
| 776 | mid-word, or by wrapping particular words with internal punctuation |
| 777 | in "don't break this across lines" codes (which in some formats may |
| 778 | not be a single code, but might be a matter of inserting non-breaking |
| 779 | zero-width spaces between every pair of characters in a word.) |
| 780 | |
| 781 | =item * |
| 782 | |
| 783 | Pod parsers should, by default, expand tabs in verbatim paragraphs as |
| 784 | they are processed, before passing them to the formatter or other |
| 785 | processor. Parsers may also allow an option for overriding this. |
| 786 | |
| 787 | =item * |
| 788 | |
| 789 | Pod parsers should, by default, remove newlines from the end of |
| 790 | ordinary and verbatim paragraphs before passing them to the |
| 791 | formatter. For example, while the paragraph you're reading now |
| 792 | could be considered, in Pod source, to end with (and contain) |
| 793 | the newline(s) that end it, it should be processed as ending with |
| 794 | (and containing) the period character that ends this sentence. |
| 795 | |
| 796 | =item * |
| 797 | |
| 798 | Pod parsers, when reporting errors, should make some effort to report |
| 799 | an approximate line number ("Nested EE<lt>>'s in Paragraph #52, near |
| 800 | line 633 of Thing/Foo.pm!"), instead of merely noting the paragraph |
| 801 | number ("Nested EE<lt>>'s in Paragraph #52 of Thing/Foo.pm!"). Where |
| 802 | this is problematic, the paragraph number should at least be |
| 803 | accompanied by an excerpt from the paragraph ("Nested EE<lt>>'s in |
| 804 | Paragraph #52 of Thing/Foo.pm, which begins 'Read/write accessor for |
| 805 | the CE<lt>interest rate> attribute...'"). |
| 806 | |
| 807 | =item * |
| 808 | |
| 809 | Pod parsers, when processing a series of verbatim paragraphs one |
| 810 | after another, should consider them to be one large verbatim |
| 811 | paragraph that happens to contain blank lines. I.e., these two |
| 812 | lines, which have a blank line between them: |
| 813 | |
| 814 | use Foo; |
| 815 | |
| 816 | print Foo->VERSION |
| 817 | |
| 818 | should be unified into one paragraph ("\tuse Foo;\n\n\tprint |
| 819 | Foo->VERSION") before being passed to the formatter or other |
| 820 | processor. Parsers may also allow an option for overriding this. |
| 821 | |
| 822 | While this might be too cumbersome to implement in event-based Pod |
| 823 | parsers, it is straightforward for parsers that return parse trees. |
| 824 | |
| 825 | =item * |
| 826 | |
| 827 | Pod formatters, where feasible, are advised to avoid splitting short |
| 828 | verbatim paragraphs (under twelve lines, say) across pages. |
| 829 | |
| 830 | =item * |
| 831 | |
| 832 | Pod parsers must treat a line with only spaces and/or tabs on it as a |
| 833 | "blank line" such as separates paragraphs. (Some older parsers |
| 834 | recognized only two adjacent newlines as a "blank line" but would not |
| 835 | recognize a newline, a space, and a newline, as a blank line. This |
| 836 | is noncompliant behavior.) |
| 837 | |
| 838 | =item * |
| 839 | |
| 840 | Authors of Pod formatters/processors should make every effort to |
| 841 | avoid writing their own Pod parser. There are already several in |
| 842 | CPAN, with a wide range of interface styles -- and one of them, |
| 843 | Pod::Simple, comes with modern versions of Perl. |
| 844 | |
| 845 | =item * |
| 846 | |
| 847 | Characters in Pod documents may be conveyed either as literals, or by |
| 848 | number in EE<lt>n> codes, or by an equivalent mnemonic, as in |
| 849 | EE<lt>eacute> which is exactly equivalent to EE<lt>233>. The numbers |
| 850 | are the Latin1/Unicode values, even on EBCDIC platforms. |
| 851 | |
| 852 | When referring to characters by using a EE<lt>n> numeric code, numbers |
| 853 | in the range 32-126 refer to those well known US-ASCII characters (also |
| 854 | defined there by Unicode, with the same meaning), which all Pod |
| 855 | formatters must render faithfully. Characters whose EE<lt>E<gt> numbers |
| 856 | are in the ranges 0-31 and 127-159 should not be used (neither as |
| 857 | literals, |
| 858 | nor as EE<lt>number> codes), except for the literal byte-sequences for |
| 859 | newline (ASCII 13, ASCII 13 10, or ASCII 10), and tab (ASCII 9). |
| 860 | |
| 861 | Numbers in the range 160-255 refer to Latin-1 characters (also |
| 862 | defined there by Unicode, with the same meaning). Numbers above |
| 863 | 255 should be understood to refer to Unicode characters. |
| 864 | |
| 865 | =item * |
| 866 | |
| 867 | Be warned |
| 868 | that some formatters cannot reliably render characters outside 32-126; |
| 869 | and many are able to handle 32-126 and 160-255, but nothing above |
| 870 | 255. |
| 871 | |
| 872 | =item * |
| 873 | |
| 874 | Besides the well-known "EE<lt>lt>" and "EE<lt>gt>" codes for |
| 875 | less-than and greater-than, Pod parsers must understand "EE<lt>sol>" |
| 876 | for "/" (solidus, slash), and "EE<lt>verbar>" for "|" (vertical bar, |
| 877 | pipe). Pod parsers should also understand "EE<lt>lchevron>" and |
| 878 | "EE<lt>rchevron>" as legacy codes for characters 171 and 187, i.e., |
| 879 | "left-pointing double angle quotation mark" = "left pointing |
| 880 | guillemet" and "right-pointing double angle quotation mark" = "right |
| 881 | pointing guillemet". (These look like little "<<" and ">>", and they |
| 882 | are now preferably expressed with the HTML/XHTML codes "EE<lt>laquo>" |
| 883 | and "EE<lt>raquo>".) |
| 884 | |
| 885 | =item * |
| 886 | |
| 887 | Pod parsers should understand all "EE<lt>html>" codes as defined |
| 888 | in the entity declarations in the most recent XHTML specification at |
| 889 | C<www.W3.org>. Pod parsers must understand at least the entities |
| 890 | that define characters in the range 160-255 (Latin-1). Pod parsers, |
| 891 | when faced with some unknown "EE<lt>I<identifier>>" code, |
| 892 | shouldn't simply replace it with nullstring (by default, at least), |
| 893 | but may pass it through as a string consisting of the literal characters |
| 894 | E, less-than, I<identifier>, greater-than. Or Pod parsers may offer the |
| 895 | alternative option of processing such unknown |
| 896 | "EE<lt>I<identifier>>" codes by firing an event especially |
| 897 | for such codes, or by adding a special node-type to the in-memory |
| 898 | document tree. Such "EE<lt>I<identifier>>" may have special meaning |
| 899 | to some processors, or some processors may choose to add them to |
| 900 | a special error report. |
| 901 | |
| 902 | =item * |
| 903 | |
| 904 | Pod parsers must also support the XHTML codes "EE<lt>quot>" for |
| 905 | character 34 (doublequote, "), "EE<lt>amp>" for character 38 |
| 906 | (ampersand, &), and "EE<lt>apos>" for character 39 (apostrophe, '). |
| 907 | |
| 908 | =item * |
| 909 | |
| 910 | Note that in all cases of "EE<lt>whateverE<gt>", I<whatever> (whether |
| 911 | an htmlname, or a number in any base) must consist only of |
| 912 | alphanumeric characters -- that is, I<whatever> must match |
| 913 | C<m/\A\w+\z/>. So S<"EE<lt> 0 1 2 3 E<gt>"> is invalid, because |
| 914 | it contains spaces, which aren't alphanumeric characters. This |
| 915 | presumably does not I<need> special treatment by a Pod processor; |
| 916 | S<" 0 1 2 3 "> doesn't look like a number in any base, so it would |
| 917 | presumably be looked up in the table of HTML-like names. Since |
| 918 | there isn't (and cannot be) an HTML-like entity called S<" 0 1 2 3 ">, |
| 919 | this will be treated as an error. However, Pod processors may |
| 920 | treat S<"EE<lt> 0 1 2 3 E<gt>"> or "EE<lt>e-acute>" as I<syntactically> |
| 921 | invalid, potentially earning a different error message than the |
| 922 | error message (or warning, or event) generated by a merely unknown |
| 923 | (but theoretically valid) htmlname, as in "EE<lt>qacute>" |
| 924 | [sic]. However, Pod parsers are not required to make this |
| 925 | distinction. |
| 926 | |
| 927 | =item * |
| 928 | |
| 929 | Note that EE<lt>number> I<must not> be interpreted as simply |
| 930 | "codepoint I<number> in the current/native character set". It always |
| 931 | means only "the character represented by codepoint I<number> in |
| 932 | Unicode." (This is identical to the semantics of &#I<number>; in XML.) |
| 933 | |
| 934 | This will likely require many formatters to have tables mapping from |
| 935 | treatable Unicode codepoints (such as the "\xE9" for the e-acute |
| 936 | character) to the escape sequences or codes necessary for conveying |
| 937 | such sequences in the target output format. A converter to *roff |
| 938 | would, for example know that "\xE9" (whether conveyed literally, or via |
| 939 | a EE<lt>...> sequence) is to be conveyed as "e\\*'". |
| 940 | Similarly, a program rendering Pod in a Mac OS application window, would |
| 941 | presumably need to know that "\xE9" maps to codepoint 142 in MacRoman |
| 942 | encoding that (at time of writing) is native for Mac OS. Such |
| 943 | Unicode2whatever mappings are presumably already widely available for |
| 944 | common output formats. (Such mappings may be incomplete! Implementers |
| 945 | are not expected to bend over backwards in an attempt to render |
| 946 | Cherokee syllabics, Etruscan runes, Byzantine musical symbols, or any |
| 947 | of the other weird things that Unicode can encode.) And |
| 948 | if a Pod document uses a character not found in such a mapping, the |
| 949 | formatter should consider it an unrenderable character. |
| 950 | |
| 951 | =item * |
| 952 | |
| 953 | If, surprisingly, the implementor of a Pod formatter can't find a |
| 954 | satisfactory pre-existing table mapping from Unicode characters to |
| 955 | escapes in the target format (e.g., a decent table of Unicode |
| 956 | characters to *roff escapes), it will be necessary to build such a |
| 957 | table. If you are in this circumstance, you should begin with the |
| 958 | characters in the range 0x00A0 - 0x00FF, which is mostly the heavily |
| 959 | used accented characters. Then proceed (as patience permits and |
| 960 | fastidiousness compels) through the characters that the (X)HTML |
| 961 | standards groups judged important enough to merit mnemonics |
| 962 | for. These are declared in the (X)HTML specifications at the |
| 963 | www.W3.org site. At time of writing (September 2001), the most recent |
| 964 | entity declaration files are: |
| 965 | |
| 966 | http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent |
| 967 | http://www.w3.org/TR/xhtml1/DTD/xhtml-special.ent |
| 968 | http://www.w3.org/TR/xhtml1/DTD/xhtml-symbol.ent |
| 969 | |
| 970 | Then you can progress through any remaining notable Unicode characters |
| 971 | in the range 0x2000-0x204D (consult the character tables at |
| 972 | www.unicode.org), and whatever else strikes your fancy. For example, |
| 973 | in F<xhtml-symbol.ent>, there is the entry: |
| 974 | |
| 975 | <!ENTITY infin "∞"> <!-- infinity, U+221E ISOtech --> |
| 976 | |
| 977 | While the mapping "infin" to the character "\x{221E}" will (hopefully) |
| 978 | have been already handled by the Pod parser, the presence of the |
| 979 | character in this file means that it's reasonably important enough to |
| 980 | include in a formatter's table that maps from notable Unicode characters |
| 981 | to the codes necessary for rendering them. So for a Unicode-to-*roff |
| 982 | mapping, for example, this would merit the entry: |
| 983 | |
| 984 | "\x{221E}" => '\(in', |
| 985 | |
| 986 | It is eagerly hoped that in the future, increasing numbers of formats |
| 987 | (and formatters) will support Unicode characters directly (as (X)HTML |
| 988 | does with C<∞>, C<∞>, or C<∞>), reducing the need |
| 989 | for idiosyncratic mappings of Unicode-to-I<my_escapes>. |
| 990 | |
| 991 | =item * |
| 992 | |
| 993 | It is up to individual Pod formatter to display good judgement when |
| 994 | confronted with an unrenderable character (which is distinct from an |
| 995 | unknown EE<lt>thing> sequence that the parser couldn't resolve to |
| 996 | anything, renderable or not). It is good practice to map Latin letters |
| 997 | with diacritics (like "EE<lt>eacute>"/"EE<lt>233>") to the corresponding |
| 998 | unaccented US-ASCII letters (like a simple character 101, "e"), but |
| 999 | clearly this is often not feasible, and an unrenderable character may |
| 1000 | be represented as "?", or the like. In attempting a sane fallback |
| 1001 | (as from EE<lt>233> to "e"), Pod formatters may use the |
| 1002 | %Latin1Code_to_fallback table in L<Pod::Escapes|Pod::Escapes>, or |
| 1003 | L<Text::Unidecode|Text::Unidecode>, if available. |
| 1004 | |
| 1005 | For example, this Pod text: |
| 1006 | |
| 1007 | magic is enabled if you set C<$Currency> to 'E<euro>'. |
| 1008 | |
| 1009 | may be rendered as: |
| 1010 | "magic is enabled if you set C<$Currency> to 'I<?>'" or as |
| 1011 | "magic is enabled if you set C<$Currency> to 'B<[euro]>'", or as |
| 1012 | "magic is enabled if you set C<$Currency> to '[x20AC]', etc. |
| 1013 | |
| 1014 | A Pod formatter may also note, in a comment or warning, a list of what |
| 1015 | unrenderable characters were encountered. |
| 1016 | |
| 1017 | =item * |
| 1018 | |
| 1019 | EE<lt>...> may freely appear in any formatting code (other than |
| 1020 | in another EE<lt>...> or in an ZE<lt>>). That is, "XE<lt>The |
| 1021 | EE<lt>euro>1,000,000 Solution>" is valid, as is "LE<lt>The |
| 1022 | EE<lt>euro>1,000,000 Solution|Million::Euros>". |
| 1023 | |
| 1024 | =item * |
| 1025 | |
| 1026 | Some Pod formatters output to formats that implement non-breaking |
| 1027 | spaces as an individual character (which I'll call "NBSP"), and |
| 1028 | others output to formats that implement non-breaking spaces just as |
| 1029 | spaces wrapped in a "don't break this across lines" code. Note that |
| 1030 | at the level of Pod, both sorts of codes can occur: Pod can contain a |
| 1031 | NBSP character (whether as a literal, or as a "EE<lt>160>" or |
| 1032 | "EE<lt>nbsp>" code); and Pod can contain "SE<lt>foo |
| 1033 | IE<lt>barE<gt> baz>" codes, where "mere spaces" (character 32) in |
| 1034 | such codes are taken to represent non-breaking spaces. Pod |
| 1035 | parsers should consider supporting the optional parsing of "SE<lt>foo |
| 1036 | IE<lt>barE<gt> baz>" as if it were |
| 1037 | "fooI<NBSP>IE<lt>barE<gt>I<NBSP>baz", and, going the other way, the |
| 1038 | optional parsing of groups of words joined by NBSP's as if each group |
| 1039 | were in a SE<lt>...> code, so that formatters may use the |
| 1040 | representation that maps best to what the output format demands. |
| 1041 | |
| 1042 | =item * |
| 1043 | |
| 1044 | Some processors may find that the C<SE<lt>...E<gt>> code is easiest to |
| 1045 | implement by replacing each space in the parse tree under the content |
| 1046 | of the S, with an NBSP. But note: the replacement should apply I<not> to |
| 1047 | spaces in I<all> text, but I<only> to spaces in I<printable> text. (This |
| 1048 | distinction may or may not be evident in the particular tree/event |
| 1049 | model implemented by the Pod parser.) For example, consider this |
| 1050 | unusual case: |
| 1051 | |
| 1052 | S<L</Autoloaded Functions>> |
| 1053 | |
| 1054 | This means that the space in the middle of the visible link text must |
| 1055 | not be broken across lines. In other words, it's the same as this: |
| 1056 | |
| 1057 | L<"AutoloadedE<160>Functions"/Autoloaded Functions> |
| 1058 | |
| 1059 | However, a misapplied space-to-NBSP replacement could (wrongly) |
| 1060 | produce something equivalent to this: |
| 1061 | |
| 1062 | L<"AutoloadedE<160>Functions"/AutoloadedE<160>Functions> |
| 1063 | |
| 1064 | ...which is almost definitely not going to work as a hyperlink (assuming |
| 1065 | this formatter outputs a format supporting hypertext). |
| 1066 | |
| 1067 | Formatters may choose to just not support the S format code, |
| 1068 | especially in cases where the output format simply has no NBSP |
| 1069 | character/code and no code for "don't break this stuff across lines". |
| 1070 | |
| 1071 | =item * |
| 1072 | |
| 1073 | Besides the NBSP character discussed above, implementors are reminded |
| 1074 | of the existence of the other "special" character in Latin-1, the |
| 1075 | "soft hyphen" character, also known as "discretionary hyphen", |
| 1076 | i.e. C<EE<lt>173E<gt>> = C<EE<lt>0xADE<gt>> = |
| 1077 | C<EE<lt>shyE<gt>>). This character expresses an optional hyphenation |
| 1078 | point. That is, it normally renders as nothing, but may render as a |
| 1079 | "-" if a formatter breaks the word at that point. Pod formatters |
| 1080 | should, as appropriate, do one of the following: 1) render this with |
| 1081 | a code with the same meaning (e.g., "\-" in RTF), 2) pass it through |
| 1082 | in the expectation that the formatter understands this character as |
| 1083 | such, or 3) delete it. |
| 1084 | |
| 1085 | For example: |
| 1086 | |
| 1087 | sigE<shy>action |
| 1088 | manuE<shy>script |
| 1089 | JarkE<shy>ko HieE<shy>taE<shy>nieE<shy>mi |
| 1090 | |
| 1091 | These signal to a formatter that if it is to hyphenate "sigaction" |
| 1092 | or "manuscript", then it should be done as |
| 1093 | "sig-I<[linebreak]>action" or "manu-I<[linebreak]>script" |
| 1094 | (and if it doesn't hyphenate it, then the C<EE<lt>shyE<gt>> doesn't |
| 1095 | show up at all). And if it is |
| 1096 | to hyphenate "Jarkko" and/or "Hietaniemi", it can do |
| 1097 | so only at the points where there is a C<EE<lt>shyE<gt>> code. |
| 1098 | |
| 1099 | In practice, it is anticipated that this character will not be used |
| 1100 | often, but formatters should either support it, or delete it. |
| 1101 | |
| 1102 | =item * |
| 1103 | |
| 1104 | If you think that you want to add a new command to Pod (like, say, a |
| 1105 | "=biblio" command), consider whether you could get the same |
| 1106 | effect with a for or begin/end sequence: "=for biblio ..." or "=begin |
| 1107 | biblio" ... "=end biblio". Pod processors that don't understand |
| 1108 | "=for biblio", etc, will simply ignore it, whereas they may complain |
| 1109 | loudly if they see "=biblio". |
| 1110 | |
| 1111 | =item * |
| 1112 | |
| 1113 | Throughout this document, "Pod" has been the preferred spelling for |
| 1114 | the name of the documentation format. One may also use "POD" or |
| 1115 | "pod". For the documentation that is (typically) in the Pod |
| 1116 | format, you may use "pod", or "Pod", or "POD". Understanding these |
| 1117 | distinctions is useful; but obsessing over how to spell them, usually |
| 1118 | is not. |
| 1119 | |
| 1120 | =back |
| 1121 | |
| 1122 | |
| 1123 | |
| 1124 | |
| 1125 | |
| 1126 | =head1 About LE<lt>...E<gt> Codes |
| 1127 | |
| 1128 | As you can tell from a glance at L<perlpod|perlpod>, the LE<lt>...> |
| 1129 | code is the most complex of the Pod formatting codes. The points below |
| 1130 | will hopefully clarify what it means and how processors should deal |
| 1131 | with it. |
| 1132 | |
| 1133 | =over |
| 1134 | |
| 1135 | =item * |
| 1136 | |
| 1137 | In parsing an LE<lt>...> code, Pod parsers must distinguish at least |
| 1138 | four attributes: |
| 1139 | |
| 1140 | =over |
| 1141 | |
| 1142 | =item First: |
| 1143 | |
| 1144 | The link-text. If there is none, this must be C<undef>. (E.g., in |
| 1145 | "LE<lt>Perl Functions|perlfunc>", the link-text is "Perl Functions". |
| 1146 | In "LE<lt>Time::HiRes>" and even "LE<lt>|Time::HiRes>", there is no |
| 1147 | link text. Note that link text may contain formatting.) |
| 1148 | |
| 1149 | =item Second: |
| 1150 | |
| 1151 | The possibly inferred link-text; i.e., if there was no real link |
| 1152 | text, then this is the text that we'll infer in its place. (E.g., for |
| 1153 | "LE<lt>Getopt::Std>", the inferred link text is "Getopt::Std".) |
| 1154 | |
| 1155 | =item Third: |
| 1156 | |
| 1157 | The name or URL, or C<undef> if none. (E.g., in "LE<lt>Perl |
| 1158 | Functions|perlfunc>", the name (also sometimes called the page) |
| 1159 | is "perlfunc". In "LE<lt>/CAVEATS>", the name is C<undef>.) |
| 1160 | |
| 1161 | =item Fourth: |
| 1162 | |
| 1163 | The section (AKA "item" in older perlpods), or C<undef> if none. E.g., |
| 1164 | in "LE<lt>Getopt::Std/DESCRIPTIONE<gt>", "DESCRIPTION" is the section. (Note |
| 1165 | that this is not the same as a manpage section like the "5" in "man 5 |
| 1166 | crontab". "Section Foo" in the Pod sense means the part of the text |
| 1167 | that's introduced by the heading or item whose text is "Foo".) |
| 1168 | |
| 1169 | =back |
| 1170 | |
| 1171 | Pod parsers may also note additional attributes including: |
| 1172 | |
| 1173 | =over |
| 1174 | |
| 1175 | =item Fifth: |
| 1176 | |
| 1177 | A flag for whether item 3 (if present) is a URL (like |
| 1178 | "http://lists.perl.org" is), in which case there should be no section |
| 1179 | attribute; a Pod name (like "perldoc" and "Getopt::Std" are); or |
| 1180 | possibly a man page name (like "crontab(5)" is). |
| 1181 | |
| 1182 | =item Sixth: |
| 1183 | |
| 1184 | The raw original LE<lt>...> content, before text is split on |
| 1185 | "|", "/", etc, and before EE<lt>...> codes are expanded. |
| 1186 | |
| 1187 | =back |
| 1188 | |
| 1189 | (The above were numbered only for concise reference below. It is not |
| 1190 | a requirement that these be passed as an actual list or array.) |
| 1191 | |
| 1192 | For example: |
| 1193 | |
| 1194 | L<Foo::Bar> |
| 1195 | => undef, # link text |
| 1196 | "Foo::Bar", # possibly inferred link text |
| 1197 | "Foo::Bar", # name |
| 1198 | undef, # section |
| 1199 | 'pod', # what sort of link |
| 1200 | "Foo::Bar" # original content |
| 1201 | |
| 1202 | L<Perlport's section on NL's|perlport/Newlines> |
| 1203 | => "Perlport's section on NL's", # link text |
| 1204 | "Perlport's section on NL's", # possibly inferred link text |
| 1205 | "perlport", # name |
| 1206 | "Newlines", # section |
| 1207 | 'pod', # what sort of link |
| 1208 | "Perlport's section on NL's|perlport/Newlines" |
| 1209 | # original content |
| 1210 | |
| 1211 | L<perlport/Newlines> |
| 1212 | => undef, # link text |
| 1213 | '"Newlines" in perlport', # possibly inferred link text |
| 1214 | "perlport", # name |
| 1215 | "Newlines", # section |
| 1216 | 'pod', # what sort of link |
| 1217 | "perlport/Newlines" # original content |
| 1218 | |
| 1219 | L<crontab(5)/"DESCRIPTION"> |
| 1220 | => undef, # link text |
| 1221 | '"DESCRIPTION" in crontab(5)', # possibly inferred link text |
| 1222 | "crontab(5)", # name |
| 1223 | "DESCRIPTION", # section |
| 1224 | 'man', # what sort of link |
| 1225 | 'crontab(5)/"DESCRIPTION"' # original content |
| 1226 | |
| 1227 | L</Object Attributes> |
| 1228 | => undef, # link text |
| 1229 | '"Object Attributes"', # possibly inferred link text |
| 1230 | undef, # name |
| 1231 | "Object Attributes", # section |
| 1232 | 'pod', # what sort of link |
| 1233 | "/Object Attributes" # original content |
| 1234 | |
| 1235 | L<https://www.perl.org/> |
| 1236 | => undef, # link text |
| 1237 | "https://www.perl.org/", # possibly inferred link text |
| 1238 | "https://www.perl.org/", # name |
| 1239 | undef, # section |
| 1240 | 'url', # what sort of link |
| 1241 | "https://www.perl.org/" # original content |
| 1242 | |
| 1243 | L<Perl.org|https://www.perl.org/> |
| 1244 | => "Perl.org", # link text |
| 1245 | "https://www.perl.org/", # possibly inferred link text |
| 1246 | "https://www.perl.org/", # name |
| 1247 | undef, # section |
| 1248 | 'url', # what sort of link |
| 1249 | "Perl.org|https://www.perl.org/" # original content |
| 1250 | |
| 1251 | Note that you can distinguish URL-links from anything else by the |
| 1252 | fact that they match C<m/\A\w+:[^:\s]\S*\z/>. So |
| 1253 | C<LE<lt>http://www.perl.comE<gt>> is a URL, but |
| 1254 | C<LE<lt>HTTP::ResponseE<gt>> isn't. |
| 1255 | |
| 1256 | =item * |
| 1257 | |
| 1258 | In case of LE<lt>...> codes with no "text|" part in them, |
| 1259 | older formatters have exhibited great variation in actually displaying |
| 1260 | the link or cross reference. For example, LE<lt>crontab(5)> would render |
| 1261 | as "the C<crontab(5)> manpage", or "in the C<crontab(5)> manpage" |
| 1262 | or just "C<crontab(5)>". |
| 1263 | |
| 1264 | Pod processors must now treat "text|"-less links as follows: |
| 1265 | |
| 1266 | L<name> => L<name|name> |
| 1267 | L</section> => L<"section"|/section> |
| 1268 | L<name/section> => L<"section" in name|name/section> |
| 1269 | |
| 1270 | =item * |
| 1271 | |
| 1272 | Note that section names might contain markup. I.e., if a section |
| 1273 | starts with: |
| 1274 | |
| 1275 | =head2 About the C<-M> Operator |
| 1276 | |
| 1277 | or with: |
| 1278 | |
| 1279 | =item About the C<-M> Operator |
| 1280 | |
| 1281 | then a link to it would look like this: |
| 1282 | |
| 1283 | L<somedoc/About the C<-M> Operator> |
| 1284 | |
| 1285 | Formatters may choose to ignore the markup for purposes of resolving |
| 1286 | the link and use only the renderable characters in the section name, |
| 1287 | as in: |
| 1288 | |
| 1289 | <h1><a name="About_the_-M_Operator">About the <code>-M</code> |
| 1290 | Operator</h1> |
| 1291 | |
| 1292 | ... |
| 1293 | |
| 1294 | <a href="somedoc#About_the_-M_Operator">About the <code>-M</code> |
| 1295 | Operator" in somedoc</a> |
| 1296 | |
| 1297 | =item * |
| 1298 | |
| 1299 | Previous versions of perlpod distinguished C<LE<lt>name/"section"E<gt>> |
| 1300 | links from C<LE<lt>name/itemE<gt>> links (and their targets). These |
| 1301 | have been merged syntactically and semantically in the current |
| 1302 | specification, and I<section> can refer either to a "=headI<n> Heading |
| 1303 | Content" command or to a "=item Item Content" command. This |
| 1304 | specification does not specify what behavior should be in the case |
| 1305 | of a given document having several things all seeming to produce the |
| 1306 | same I<section> identifier (e.g., in HTML, several things all producing |
| 1307 | the same I<anchorname> in <a name="I<anchorname>">...</a> |
| 1308 | elements). Where Pod processors can control this behavior, they should |
| 1309 | use the first such anchor. That is, C<LE<lt>Foo/BarE<gt>> refers to the |
| 1310 | I<first> "Bar" section in Foo. |
| 1311 | |
| 1312 | But for some processors/formats this cannot be easily controlled; as |
| 1313 | with the HTML example, the behavior of multiple ambiguous |
| 1314 | <a name="I<anchorname>">...</a> is most easily just left up to |
| 1315 | browsers to decide. |
| 1316 | |
| 1317 | =item * |
| 1318 | |
| 1319 | In a C<LE<lt>text|...E<gt>> code, text may contain formatting codes |
| 1320 | for formatting or for EE<lt>...> escapes, as in: |
| 1321 | |
| 1322 | L<B<ummE<234>stuff>|...> |
| 1323 | |
| 1324 | For C<LE<lt>...E<gt>> codes without a "name|" part, only |
| 1325 | C<EE<lt>...E<gt>> and C<ZE<lt>E<gt>> codes may occur. That is, |
| 1326 | authors should not use "C<LE<lt>BE<lt>Foo::BarE<gt>E<gt>>". |
| 1327 | |
| 1328 | Note, however, that formatting codes and ZE<lt>>'s can occur in any |
| 1329 | and all parts of an LE<lt>...> (i.e., in I<name>, I<section>, I<text>, |
| 1330 | and I<url>). |
| 1331 | |
| 1332 | Authors must not nest LE<lt>...> codes. For example, "LE<lt>The |
| 1333 | LE<lt>Foo::Bar> man page>" should be treated as an error. |
| 1334 | |
| 1335 | =item * |
| 1336 | |
| 1337 | Note that Pod authors may use formatting codes inside the "text" |
| 1338 | part of "LE<lt>text|name>" (and so on for LE<lt>text|/"sec">). |
| 1339 | |
| 1340 | In other words, this is valid: |
| 1341 | |
| 1342 | Go read L<the docs on C<$.>|perlvar/"$."> |
| 1343 | |
| 1344 | Some output formats that do allow rendering "LE<lt>...>" codes as |
| 1345 | hypertext, might not allow the link-text to be formatted; in |
| 1346 | that case, formatters will have to just ignore that formatting. |
| 1347 | |
| 1348 | =item * |
| 1349 | |
| 1350 | At time of writing, C<LE<lt>nameE<gt>> values are of two types: |
| 1351 | either the name of a Pod page like C<LE<lt>Foo::BarE<gt>> (which |
| 1352 | might be a real Perl module or program in an @INC / PATH |
| 1353 | directory, or a .pod file in those places); or the name of a Unix |
| 1354 | man page, like C<LE<lt>crontab(5)E<gt>>. In theory, C<LE<lt>chmodE<gt>> |
| 1355 | is ambiguous between a Pod page called "chmod", or the Unix man page |
| 1356 | "chmod" (in whatever man-section). However, the presence of a string |
| 1357 | in parens, as in "crontab(5)", is sufficient to signal that what |
| 1358 | is being discussed is not a Pod page, and so is presumably a |
| 1359 | Unix man page. The distinction is of no importance to many |
| 1360 | Pod processors, but some processors that render to hypertext formats |
| 1361 | may need to distinguish them in order to know how to render a |
| 1362 | given C<LE<lt>fooE<gt>> code. |
| 1363 | |
| 1364 | =item * |
| 1365 | |
| 1366 | Previous versions of perlpod allowed for a C<LE<lt>sectionE<gt>> syntax (as in |
| 1367 | C<LE<lt>Object AttributesE<gt>>), which was not easily distinguishable from |
| 1368 | C<LE<lt>nameE<gt>> syntax and for C<LE<lt>"section"E<gt>> which was only |
| 1369 | slightly less ambiguous. This syntax is no longer in the specification, and |
| 1370 | has been replaced by the C<LE<lt>/sectionE<gt>> syntax (where the slash was |
| 1371 | formerly optional). Pod parsers should tolerate the C<LE<lt>"section"E<gt>> |
| 1372 | syntax, for a while at least. The suggested heuristic for distinguishing |
| 1373 | C<LE<lt>sectionE<gt>> from C<LE<lt>nameE<gt>> is that if it contains any |
| 1374 | whitespace, it's a I<section>. Pod processors should warn about this being |
| 1375 | deprecated syntax. |
| 1376 | |
| 1377 | =back |
| 1378 | |
| 1379 | =head1 About =over...=back Regions |
| 1380 | |
| 1381 | "=over"..."=back" regions are used for various kinds of list-like |
| 1382 | structures. (I use the term "region" here simply as a collective |
| 1383 | term for everything from the "=over" to the matching "=back".) |
| 1384 | |
| 1385 | =over |
| 1386 | |
| 1387 | =item * |
| 1388 | |
| 1389 | The non-zero numeric I<indentlevel> in "=over I<indentlevel>" ... |
| 1390 | "=back" is used for giving the formatter a clue as to how many |
| 1391 | "spaces" (ems, or roughly equivalent units) it should tab over, |
| 1392 | although many formatters will have to convert this to an absolute |
| 1393 | measurement that may not exactly match with the size of spaces (or M's) |
| 1394 | in the document's base font. Other formatters may have to completely |
| 1395 | ignore the number. The lack of any explicit I<indentlevel> parameter is |
| 1396 | equivalent to an I<indentlevel> value of 4. Pod processors may |
| 1397 | complain if I<indentlevel> is present but is not a positive number |
| 1398 | matching C<m/\A(\d*\.)?\d+\z/>. |
| 1399 | |
| 1400 | =item * |
| 1401 | |
| 1402 | Authors of Pod formatters are reminded that "=over" ... "=back" may |
| 1403 | map to several different constructs in your output format. For |
| 1404 | example, in converting Pod to (X)HTML, it can map to any of |
| 1405 | <ul>...</ul>, <ol>...</ol>, <dl>...</dl>, or |
| 1406 | <blockquote>...</blockquote>. Similarly, "=item" can map to <li> or |
| 1407 | <dt>. |
| 1408 | |
| 1409 | =item * |
| 1410 | |
| 1411 | Each "=over" ... "=back" region should be one of the following: |
| 1412 | |
| 1413 | =over |
| 1414 | |
| 1415 | =item * |
| 1416 | |
| 1417 | An "=over" ... "=back" region containing only "=item *" commands, |
| 1418 | each followed by some number of ordinary/verbatim paragraphs, other |
| 1419 | nested "=over" ... "=back" regions, "=for..." paragraphs, and |
| 1420 | "=begin"..."=end" regions. |
| 1421 | |
| 1422 | (Pod processors must tolerate a bare "=item" as if it were "=item |
| 1423 | *".) Whether "*" is rendered as a literal asterisk, an "o", or as |
| 1424 | some kind of real bullet character, is left up to the Pod formatter, |
| 1425 | and may depend on the level of nesting. |
| 1426 | |
| 1427 | =item * |
| 1428 | |
| 1429 | An "=over" ... "=back" region containing only |
| 1430 | C<m/\A=item\s+\d+\.?\s*\z/> paragraphs, each one (or each group of them) |
| 1431 | followed by some number of ordinary/verbatim paragraphs, other nested |
| 1432 | "=over" ... "=back" regions, "=for..." paragraphs, and/or |
| 1433 | "=begin"..."=end" codes. Note that the numbers must start at 1 |
| 1434 | in each section, and must proceed in order and without skipping |
| 1435 | numbers. |
| 1436 | |
| 1437 | (Pod processors must tolerate lines like "=item 1" as if they were |
| 1438 | "=item 1.", with the period.) |
| 1439 | |
| 1440 | =item * |
| 1441 | |
| 1442 | An "=over" ... "=back" region containing only "=item [text]" |
| 1443 | commands, each one (or each group of them) followed by some number of |
| 1444 | ordinary/verbatim paragraphs, other nested "=over" ... "=back" |
| 1445 | regions, or "=for..." paragraphs, and "=begin"..."=end" regions. |
| 1446 | |
| 1447 | The "=item [text]" paragraph should not match |
| 1448 | C<m/\A=item\s+\d+\.?\s*\z/> or C<m/\A=item\s+\*\s*\z/>, nor should it |
| 1449 | match just C<m/\A=item\s*\z/>. |
| 1450 | |
| 1451 | =item * |
| 1452 | |
| 1453 | An "=over" ... "=back" region containing no "=item" paragraphs at |
| 1454 | all, and containing only some number of |
| 1455 | ordinary/verbatim paragraphs, and possibly also some nested "=over" |
| 1456 | ... "=back" regions, "=for..." paragraphs, and "=begin"..."=end" |
| 1457 | regions. Such an itemless "=over" ... "=back" region in Pod is |
| 1458 | equivalent in meaning to a "<blockquote>...</blockquote>" element in |
| 1459 | HTML. |
| 1460 | |
| 1461 | =back |
| 1462 | |
| 1463 | Note that with all the above cases, you can determine which type of |
| 1464 | "=over" ... "=back" you have, by examining the first (non-"=cut", |
| 1465 | non-"=pod") Pod paragraph after the "=over" command. |
| 1466 | |
| 1467 | =item * |
| 1468 | |
| 1469 | Pod formatters I<must> tolerate arbitrarily large amounts of text |
| 1470 | in the "=item I<text...>" paragraph. In practice, most such |
| 1471 | paragraphs are short, as in: |
| 1472 | |
| 1473 | =item For cutting off our trade with all parts of the world |
| 1474 | |
| 1475 | But they may be arbitrarily long: |
| 1476 | |
| 1477 | =item For transporting us beyond seas to be tried for pretended |
| 1478 | offenses |
| 1479 | |
| 1480 | =item He is at this time transporting large armies of foreign |
| 1481 | mercenaries to complete the works of death, desolation and |
| 1482 | tyranny, already begun with circumstances of cruelty and perfidy |
| 1483 | scarcely paralleled in the most barbarous ages, and totally |
| 1484 | unworthy the head of a civilized nation. |
| 1485 | |
| 1486 | =item * |
| 1487 | |
| 1488 | Pod processors should tolerate "=item *" / "=item I<number>" commands |
| 1489 | with no accompanying paragraph. The middle item is an example: |
| 1490 | |
| 1491 | =over |
| 1492 | |
| 1493 | =item 1 |
| 1494 | |
| 1495 | Pick up dry cleaning. |
| 1496 | |
| 1497 | =item 2 |
| 1498 | |
| 1499 | =item 3 |
| 1500 | |
| 1501 | Stop by the store. Get Abba Zabas, Stoli, and cheap lawn chairs. |
| 1502 | |
| 1503 | =back |
| 1504 | |
| 1505 | =item * |
| 1506 | |
| 1507 | No "=over" ... "=back" region can contain headings. Processors may |
| 1508 | treat such a heading as an error. |
| 1509 | |
| 1510 | =item * |
| 1511 | |
| 1512 | Note that an "=over" ... "=back" region should have some |
| 1513 | content. That is, authors should not have an empty region like this: |
| 1514 | |
| 1515 | =over |
| 1516 | |
| 1517 | =back |
| 1518 | |
| 1519 | Pod processors seeing such a contentless "=over" ... "=back" region, |
| 1520 | may ignore it, or may report it as an error. |
| 1521 | |
| 1522 | =item * |
| 1523 | |
| 1524 | Processors must tolerate an "=over" list that goes off the end of the |
| 1525 | document (i.e., which has no matching "=back"), but they may warn |
| 1526 | about such a list. |
| 1527 | |
| 1528 | =item * |
| 1529 | |
| 1530 | Authors of Pod formatters should note that this construct: |
| 1531 | |
| 1532 | =item Neque |
| 1533 | |
| 1534 | =item Porro |
| 1535 | |
| 1536 | =item Quisquam Est |
| 1537 | |
| 1538 | Qui dolorem ipsum quia dolor sit amet, consectetur, adipisci |
| 1539 | velit, sed quia non numquam eius modi tempora incidunt ut |
| 1540 | labore et dolore magnam aliquam quaerat voluptatem. |
| 1541 | |
| 1542 | =item Ut Enim |
| 1543 | |
| 1544 | is semantically ambiguous, in a way that makes formatting decisions |
| 1545 | a bit difficult. On the one hand, it could be mention of an item |
| 1546 | "Neque", mention of another item "Porro", and mention of another |
| 1547 | item "Quisquam Est", with just the last one requiring the explanatory |
| 1548 | paragraph "Qui dolorem ipsum quia dolor..."; and then an item |
| 1549 | "Ut Enim". In that case, you'd want to format it like so: |
| 1550 | |
| 1551 | Neque |
| 1552 | |
| 1553 | Porro |
| 1554 | |
| 1555 | Quisquam Est |
| 1556 | Qui dolorem ipsum quia dolor sit amet, consectetur, adipisci |
| 1557 | velit, sed quia non numquam eius modi tempora incidunt ut |
| 1558 | labore et dolore magnam aliquam quaerat voluptatem. |
| 1559 | |
| 1560 | Ut Enim |
| 1561 | |
| 1562 | But it could equally well be a discussion of three (related or equivalent) |
| 1563 | items, "Neque", "Porro", and "Quisquam Est", followed by a paragraph |
| 1564 | explaining them all, and then a new item "Ut Enim". In that case, you'd |
| 1565 | probably want to format it like so: |
| 1566 | |
| 1567 | Neque |
| 1568 | Porro |
| 1569 | Quisquam Est |
| 1570 | Qui dolorem ipsum quia dolor sit amet, consectetur, adipisci |
| 1571 | velit, sed quia non numquam eius modi tempora incidunt ut |
| 1572 | labore et dolore magnam aliquam quaerat voluptatem. |
| 1573 | |
| 1574 | Ut Enim |
| 1575 | |
| 1576 | But (for the foreseeable future), Pod does not provide any way for Pod |
| 1577 | authors to distinguish which grouping is meant by the above |
| 1578 | "=item"-cluster structure. So formatters should format it like so: |
| 1579 | |
| 1580 | Neque |
| 1581 | |
| 1582 | Porro |
| 1583 | |
| 1584 | Quisquam Est |
| 1585 | |
| 1586 | Qui dolorem ipsum quia dolor sit amet, consectetur, adipisci |
| 1587 | velit, sed quia non numquam eius modi tempora incidunt ut |
| 1588 | labore et dolore magnam aliquam quaerat voluptatem. |
| 1589 | |
| 1590 | Ut Enim |
| 1591 | |
| 1592 | That is, there should be (at least roughly) equal spacing between |
| 1593 | items as between paragraphs (although that spacing may well be less |
| 1594 | than the full height of a line of text). This leaves it to the reader |
| 1595 | to use (con)textual cues to figure out whether the "Qui dolorem |
| 1596 | ipsum..." paragraph applies to the "Quisquam Est" item or to all three |
| 1597 | items "Neque", "Porro", and "Quisquam Est". While not an ideal |
| 1598 | situation, this is preferable to providing formatting cues that may |
| 1599 | be actually contrary to the author's intent. |
| 1600 | |
| 1601 | =back |
| 1602 | |
| 1603 | |
| 1604 | |
| 1605 | =head1 About Data Paragraphs and "=begin/=end" Regions |
| 1606 | |
| 1607 | Data paragraphs are typically used for inlining non-Pod data that is |
| 1608 | to be used (typically passed through) when rendering the document to |
| 1609 | a specific format: |
| 1610 | |
| 1611 | =begin rtf |
| 1612 | |
| 1613 | \par{\pard\qr\sa4500{\i Printed\~\chdate\~\chtime}\par} |
| 1614 | |
| 1615 | =end rtf |
| 1616 | |
| 1617 | The exact same effect could, incidentally, be achieved with a single |
| 1618 | "=for" paragraph: |
| 1619 | |
| 1620 | =for rtf \par{\pard\qr\sa4500{\i Printed\~\chdate\~\chtime}\par} |
| 1621 | |
| 1622 | (Although that is not formally a data paragraph, it has the same |
| 1623 | meaning as one, and Pod parsers may parse it as one.) |
| 1624 | |
| 1625 | Another example of a data paragraph: |
| 1626 | |
| 1627 | =begin html |
| 1628 | |
| 1629 | I like <em>PIE</em>! |
| 1630 | |
| 1631 | <hr>Especially pecan pie! |
| 1632 | |
| 1633 | =end html |
| 1634 | |
| 1635 | If these were ordinary paragraphs, the Pod parser would try to |
| 1636 | expand the "EE<lt>/em>" (in the first paragraph) as a formatting |
| 1637 | code, just like "EE<lt>lt>" or "EE<lt>eacute>". But since this |
| 1638 | is in a "=begin I<identifier>"..."=end I<identifier>" region I<and> |
| 1639 | the identifier "html" doesn't begin have a ":" prefix, the contents |
| 1640 | of this region are stored as data paragraphs, instead of being |
| 1641 | processed as ordinary paragraphs (or if they began with a spaces |
| 1642 | and/or tabs, as verbatim paragraphs). |
| 1643 | |
| 1644 | As a further example: At time of writing, no "biblio" identifier is |
| 1645 | supported, but suppose some processor were written to recognize it as |
| 1646 | a way of (say) denoting a bibliographic reference (necessarily |
| 1647 | containing formatting codes in ordinary paragraphs). The fact that |
| 1648 | "biblio" paragraphs were meant for ordinary processing would be |
| 1649 | indicated by prefacing each "biblio" identifier with a colon: |
| 1650 | |
| 1651 | =begin :biblio |
| 1652 | |
| 1653 | Wirth, Niklaus. 1976. I<Algorithms + Data Structures = |
| 1654 | Programs.> Prentice-Hall, Englewood Cliffs, NJ. |
| 1655 | |
| 1656 | =end :biblio |
| 1657 | |
| 1658 | This would signal to the parser that paragraphs in this begin...end |
| 1659 | region are subject to normal handling as ordinary/verbatim paragraphs |
| 1660 | (while still tagged as meant only for processors that understand the |
| 1661 | "biblio" identifier). The same effect could be had with: |
| 1662 | |
| 1663 | =for :biblio |
| 1664 | Wirth, Niklaus. 1976. I<Algorithms + Data Structures = |
| 1665 | Programs.> Prentice-Hall, Englewood Cliffs, NJ. |
| 1666 | |
| 1667 | The ":" on these identifiers means simply "process this stuff |
| 1668 | normally, even though the result will be for some special target". |
| 1669 | I suggest that parser APIs report "biblio" as the target identifier, |
| 1670 | but also report that it had a ":" prefix. (And similarly, with the |
| 1671 | above "html", report "html" as the target identifier, and note the |
| 1672 | I<lack> of a ":" prefix.) |
| 1673 | |
| 1674 | Note that a "=begin I<identifier>"..."=end I<identifier>" region where |
| 1675 | I<identifier> begins with a colon, I<can> contain commands. For example: |
| 1676 | |
| 1677 | =begin :biblio |
| 1678 | |
| 1679 | Wirth's classic is available in several editions, including: |
| 1680 | |
| 1681 | =for comment |
| 1682 | hm, check abebooks.com for how much used copies cost. |
| 1683 | |
| 1684 | =over |
| 1685 | |
| 1686 | =item |
| 1687 | |
| 1688 | Wirth, Niklaus. 1975. I<Algorithmen und Datenstrukturen.> |
| 1689 | Teubner, Stuttgart. [Yes, it's in German.] |
| 1690 | |
| 1691 | =item |
| 1692 | |
| 1693 | Wirth, Niklaus. 1976. I<Algorithms + Data Structures = |
| 1694 | Programs.> Prentice-Hall, Englewood Cliffs, NJ. |
| 1695 | |
| 1696 | =back |
| 1697 | |
| 1698 | =end :biblio |
| 1699 | |
| 1700 | Note, however, a "=begin I<identifier>"..."=end I<identifier>" |
| 1701 | region where I<identifier> does I<not> begin with a colon, should not |
| 1702 | directly contain "=head1" ... "=head4" commands, nor "=over", nor "=back", |
| 1703 | nor "=item". For example, this may be considered invalid: |
| 1704 | |
| 1705 | =begin somedata |
| 1706 | |
| 1707 | This is a data paragraph. |
| 1708 | |
| 1709 | =head1 Don't do this! |
| 1710 | |
| 1711 | This is a data paragraph too. |
| 1712 | |
| 1713 | =end somedata |
| 1714 | |
| 1715 | A Pod processor may signal that the above (specifically the "=head1" |
| 1716 | paragraph) is an error. Note, however, that the following should |
| 1717 | I<not> be treated as an error: |
| 1718 | |
| 1719 | =begin somedata |
| 1720 | |
| 1721 | This is a data paragraph. |
| 1722 | |
| 1723 | =cut |
| 1724 | |
| 1725 | # Yup, this isn't Pod anymore. |
| 1726 | sub excl { (rand() > .5) ? "hoo!" : "hah!" } |
| 1727 | |
| 1728 | =pod |
| 1729 | |
| 1730 | This is a data paragraph too. |
| 1731 | |
| 1732 | =end somedata |
| 1733 | |
| 1734 | And this too is valid: |
| 1735 | |
| 1736 | =begin someformat |
| 1737 | |
| 1738 | This is a data paragraph. |
| 1739 | |
| 1740 | And this is a data paragraph. |
| 1741 | |
| 1742 | =begin someotherformat |
| 1743 | |
| 1744 | This is a data paragraph too. |
| 1745 | |
| 1746 | And this is a data paragraph too. |
| 1747 | |
| 1748 | =begin :yetanotherformat |
| 1749 | |
| 1750 | =head2 This is a command paragraph! |
| 1751 | |
| 1752 | This is an ordinary paragraph! |
| 1753 | |
| 1754 | And this is a verbatim paragraph! |
| 1755 | |
| 1756 | =end :yetanotherformat |
| 1757 | |
| 1758 | =end someotherformat |
| 1759 | |
| 1760 | Another data paragraph! |
| 1761 | |
| 1762 | =end someformat |
| 1763 | |
| 1764 | The contents of the above "=begin :yetanotherformat" ... |
| 1765 | "=end :yetanotherformat" region I<aren't> data paragraphs, because |
| 1766 | the immediately containing region's identifier (":yetanotherformat") |
| 1767 | begins with a colon. In practice, most regions that contain |
| 1768 | data paragraphs will contain I<only> data paragraphs; however, |
| 1769 | the above nesting is syntactically valid as Pod, even if it is |
| 1770 | rare. However, the handlers for some formats, like "html", |
| 1771 | will accept only data paragraphs, not nested regions; and they may |
| 1772 | complain if they see (targeted for them) nested regions, or commands, |
| 1773 | other than "=end", "=pod", and "=cut". |
| 1774 | |
| 1775 | Also consider this valid structure: |
| 1776 | |
| 1777 | =begin :biblio |
| 1778 | |
| 1779 | Wirth's classic is available in several editions, including: |
| 1780 | |
| 1781 | =over |
| 1782 | |
| 1783 | =item |
| 1784 | |
| 1785 | Wirth, Niklaus. 1975. I<Algorithmen und Datenstrukturen.> |
| 1786 | Teubner, Stuttgart. [Yes, it's in German.] |
| 1787 | |
| 1788 | =item |
| 1789 | |
| 1790 | Wirth, Niklaus. 1976. I<Algorithms + Data Structures = |
| 1791 | Programs.> Prentice-Hall, Englewood Cliffs, NJ. |
| 1792 | |
| 1793 | =back |
| 1794 | |
| 1795 | Buy buy buy! |
| 1796 | |
| 1797 | =begin html |
| 1798 | |
| 1799 | <img src='wirth_spokesmodeling_book.png'> |
| 1800 | |
| 1801 | <hr> |
| 1802 | |
| 1803 | =end html |
| 1804 | |
| 1805 | Now now now! |
| 1806 | |
| 1807 | =end :biblio |
| 1808 | |
| 1809 | There, the "=begin html"..."=end html" region is nested inside |
| 1810 | the larger "=begin :biblio"..."=end :biblio" region. Note that the |
| 1811 | content of the "=begin html"..."=end html" region is data |
| 1812 | paragraph(s), because the immediately containing region's identifier |
| 1813 | ("html") I<doesn't> begin with a colon. |
| 1814 | |
| 1815 | Pod parsers, when processing a series of data paragraphs one |
| 1816 | after another (within a single region), should consider them to |
| 1817 | be one large data paragraph that happens to contain blank lines. So |
| 1818 | the content of the above "=begin html"..."=end html" I<may> be stored |
| 1819 | as two data paragraphs (one consisting of |
| 1820 | "<img src='wirth_spokesmodeling_book.png'>\n" |
| 1821 | and another consisting of "<hr>\n"), but I<should> be stored as |
| 1822 | a single data paragraph (consisting of |
| 1823 | "<img src='wirth_spokesmodeling_book.png'>\n\n<hr>\n"). |
| 1824 | |
| 1825 | Pod processors should tolerate empty |
| 1826 | "=begin I<something>"..."=end I<something>" regions, |
| 1827 | empty "=begin :I<something>"..."=end :I<something>" regions, and |
| 1828 | contentless "=for I<something>" and "=for :I<something>" |
| 1829 | paragraphs. I.e., these should be tolerated: |
| 1830 | |
| 1831 | =for html |
| 1832 | |
| 1833 | =begin html |
| 1834 | |
| 1835 | =end html |
| 1836 | |
| 1837 | =begin :biblio |
| 1838 | |
| 1839 | =end :biblio |
| 1840 | |
| 1841 | Incidentally, note that there's no easy way to express a data |
| 1842 | paragraph starting with something that looks like a command. Consider: |
| 1843 | |
| 1844 | =begin stuff |
| 1845 | |
| 1846 | =shazbot |
| 1847 | |
| 1848 | =end stuff |
| 1849 | |
| 1850 | There, "=shazbot" will be parsed as a Pod command "shazbot", not as a data |
| 1851 | paragraph "=shazbot\n". However, you can express a data paragraph consisting |
| 1852 | of "=shazbot\n" using this code: |
| 1853 | |
| 1854 | =for stuff =shazbot |
| 1855 | |
| 1856 | The situation where this is necessary, is presumably quite rare. |
| 1857 | |
| 1858 | Note that =end commands must match the currently open =begin command. That |
| 1859 | is, they must properly nest. For example, this is valid: |
| 1860 | |
| 1861 | =begin outer |
| 1862 | |
| 1863 | X |
| 1864 | |
| 1865 | =begin inner |
| 1866 | |
| 1867 | Y |
| 1868 | |
| 1869 | =end inner |
| 1870 | |
| 1871 | Z |
| 1872 | |
| 1873 | =end outer |
| 1874 | |
| 1875 | while this is invalid: |
| 1876 | |
| 1877 | =begin outer |
| 1878 | |
| 1879 | X |
| 1880 | |
| 1881 | =begin inner |
| 1882 | |
| 1883 | Y |
| 1884 | |
| 1885 | =end outer |
| 1886 | |
| 1887 | Z |
| 1888 | |
| 1889 | =end inner |
| 1890 | |
| 1891 | This latter is improper because when the "=end outer" command is seen, the |
| 1892 | currently open region has the formatname "inner", not "outer". (It just |
| 1893 | happens that "outer" is the format name of a higher-up region.) This is |
| 1894 | an error. Processors must by default report this as an error, and may halt |
| 1895 | processing the document containing that error. A corollary of this is that |
| 1896 | regions cannot "overlap". That is, the latter block above does not represent |
| 1897 | a region called "outer" which contains X and Y, overlapping a region called |
| 1898 | "inner" which contains Y and Z. But because it is invalid (as all |
| 1899 | apparently overlapping regions would be), it doesn't represent that, or |
| 1900 | anything at all. |
| 1901 | |
| 1902 | Similarly, this is invalid: |
| 1903 | |
| 1904 | =begin thing |
| 1905 | |
| 1906 | =end hting |
| 1907 | |
| 1908 | This is an error because the region is opened by "thing", and the "=end" |
| 1909 | tries to close "hting" [sic]. |
| 1910 | |
| 1911 | This is also invalid: |
| 1912 | |
| 1913 | =begin thing |
| 1914 | |
| 1915 | =end |
| 1916 | |
| 1917 | This is invalid because every "=end" command must have a formatname |
| 1918 | parameter. |
| 1919 | |
| 1920 | =head1 SEE ALSO |
| 1921 | |
| 1922 | L<perlpod>, L<perlsyn/"PODs: Embedded Documentation">, |
| 1923 | L<podchecker> |
| 1924 | |
| 1925 | =head1 AUTHOR |
| 1926 | |
| 1927 | Sean M. Burke |
| 1928 | |
| 1929 | =cut |
| 1930 | |
| 1931 | |